0% found this document useful (0 votes)
13 views

7 Logistic-Regression

Uploaded by

luciefokou96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

7 Logistic-Regression

Uploaded by

luciefokou96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Logistic Regression

Machine Learning

1
Where are we?

We have seen the following ideas


– Linear models
– Learning as loss minimization
– Bayesian learning criteria (MAP and MLE estimation)

2
This lecture

• Logistic regression

• Training a logistic regression classifier

• Back to loss minimization

3
This lecture

• Logistic regression

• Training a logistic regression classifier

• Back to loss minimization

4
Logistic Regression: Setup

• The setting
– Binary classification
– Inputs: Feature vectors 𝐱 ∈ ℜ!
– Labels: 𝑦 ∈ {−1, +1}

• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples

5
Classification, but…

The output 𝑦 is discrete: Either −1 or +1

Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)

Expand hypothesis space to functions whose output is [0 − 1]


• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively make the problem a regression problem

Many hypothesis spaces possible

6
Classification, but…

The output 𝑦 is discrete: Either −1 or +1

Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)

Expand hypothesis space to functions whose output is [0 − 1]


• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively, make the problem a regression problem

Many hypothesis spaces possible

7
The Sigmoid function

The hypothesis space for logistic regression: All


functions of the form

That is, a linear function, composed with a sigmoid


function (the logistic function) ¾
What is the domain
and the range of the
sigmoid function?

This is a reasonable choice. We will see why later

8
The Sigmoid function

The hypothesis space for logistic regression: All


functions of the form

That is, a linear function, composed with a sigmoid


function (the logistic function), defined as

This is a reasonable choice. We will see why later

9
The Sigmoid function

The hypothesis space for logistic regression: All


functions of the form

That is, a linear function, composed with a sigmoid


function (the logistic function), defined as
What is the domain
and the range of the
sigmoid function?

This is a reasonable choice. We will see why later

10
The Sigmoid function

¾(z)

11
The Sigmoid function

What is its derivative with respect to z?

12
The Sigmoid function

What is its derivative with respect to z?

13
Predicting probabilities

According to the logistic regression model, we have

14
Predicting probabilities

According to the logistic regression model, we have

15
Predicting probabilities

According to the logistic regression model, we have

16
Predicting probabilities

According to the logistic regression model, we have

Or equivalently

17
Predicting probabilities

According to the logistic regression model, we have

Note that we are directly modeling


Or equivalently 𝑃(𝑦 | 𝑥) rather than 𝑃(𝑥 |𝑦) and 𝑃(𝑦)

18
Predicting a label with logistic regression

• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)

• If this is greater than half, predict +1 else predict −1


– What does this correspond to in terms of 𝐰 ) 𝐱?

19
Predicting a label with logistic regression

• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)

• If this is greater than half, predict +1 else predict −1


– What does this correspond to in terms of 𝐰 ) 𝐱?

– Prediction = sgn(𝐰 ) 𝐱)

20
This lecture

• Logistic regression

• Training a logistic regression classifier


– First: Maximum likelihood estimation
– Then: Adding priors à Maximum a Posteriori estimation

• Back to loss minimization

21
Maximum likelihood estimation

Let’s address the problem of learning

• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples

• What we want
– Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized
– We know that our examples are drawn independently and
are identically distributed (i.i.d)
– How do we proceed?

22
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,

The usual trick: Convert products to sums by taking log

Recall that this works only because log is an increasing


function and the maximizer will not change

23
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,

Equivalent to solving

-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"

24
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
But (by definition) we know that

1
𝑃 𝑦" 𝐰, 𝐱 " = 𝜎 𝑦" 𝐰)𝐱 " =
1 + exp(−𝑦" 𝐰 ) 𝐱 " )

25
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
Equivalent to solving

-
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"

26
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"

27
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"

Equivalent to: Training a linear classifier by minimizing the logistic loss.

28
Maximum a posteriori estimation

We could also add a prior on the weights

Suppose each weight in the weight vector is drawn


independently from the normal distribution with zero
mean and standard deviation 𝜎
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

29
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

Let us work through this procedure again

30
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

Let us work through this procedure again


to see what changes from maximum likelihood
estimation

What is the goal of MAP estimation?


(In maximum likelihood estimation, we maximized the likelihood of the data)

31
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

What is the goal of MAP estimation?

To maximize the posterior probability of the model given the data (i.e. to find the
most probable model, given the data)

𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)

32
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃(𝐰|𝑆) = argmax 𝑃 𝑆 𝐰 𝑃(𝐰)


𝐰 𝐰

33
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify

max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)


𝐰

34
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify

max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)


𝐰

We have already expanded out the first term.


-
A −log(1 + exp(−𝑦" 𝐰 ) 𝐱 " )
"

35
MAP estimation for logistic regression
$ $
1−𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify

max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)


𝐰

Expand the log prior


- ! ;
−𝑤 "
A −log(1 + exp(−𝑦" 𝐰 ) 𝐱 " ) + A ; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
𝜎
" :+,

36
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify

max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)


𝐰

- !;
−𝑤 "
max A −log(1 + exp(−𝑦" 𝐰 ) 𝐱 " ) + A ; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
𝐰 𝜎
" :+,

37
MAP estimation for logistic regression
$ $
1−𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify

max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)


𝐰

-
1 )
max A −log(1 + exp(−𝑦" 𝐰)𝐱 ") − ; 𝐰 𝐰
𝐰 𝜎
"

38
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify

max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)


𝐰

-
1 )
max A −log(1 + exp(−𝑦" 𝐰)𝐱 ") − ; 𝐰 𝐰
𝐰 𝜎
"

Maximizing a negative function is the same as minimizing the function


39
Learning a logistic regression classifier

Learning a logistic regression classifier is equivalent to


solving
-
1 )
min A log(1 + exp(−𝑦" 𝐰)𝐱 ") + ; 𝐰 𝐰
𝐰 𝜎
"

40
Learning a logistic regression classifier

Learning a logistic regression classifier is equivalent to


solving
-
1 )
min A log(1 + exp(−𝑦" 𝐰)𝐱 ") + ; 𝐰 𝐰
𝐰 𝜎
"

Where have we seen this before?

41
Learning a logistic regression classifier

Learning a logistic regression classifier is equivalent to


solving
-
1 )
min A log(1 + exp(−𝑦" 𝐰)𝐱 ") + ; 𝐰 𝐰
𝐰 𝜎
"

Where have we seen this before?

Exercise: Write down the stochastic gradient descent (SGD) algorithm for this?

Other training algorithms exist. For example, the LBFGS algorithm is an example of
a quasi-Newton method. But gradient based methods like SGD and its variants are
way more commonly used.
42
Logistic regression is…

• A classifier that predicts the probability that the label is


+1 for a particular input

• The discriminative counter-part of the naïve Bayes


classifier

• A discriminative classifier that can be trained via MAP or


MLE estimation

• A discriminative classifier that minimizes the logistic loss


over the training set
43
This lecture

• Logistic regression

• Training a logistic regression classifier

• Back to loss minimization

44
Learning as loss minimization
• The setup
– Examples x drawn from a fixed, unknown distribution D
– Hidden oracle classifier f labels examples
– We wish to find a hypothesis h that mimics f
• The ideal situation
– Define a function L that penalizes bad hypotheses
– Learning: Pick a function ℎ ∈ 𝐻 to minimize expected loss

But distribution D is unknown

• Instead, minimize empirical loss on the training set

45
Empirical loss minimization

Learning = minimize empirical loss on the training set

Is there a problem here?

46
Empirical loss minimization

Learning = minimize empirical loss on the training set

Is there a problem here? Overfitting!

We need something that biases the learner towards simpler


hypotheses
• Achieved using a regularizer, which penalizes complex
hypotheses

47
Regularized loss minimization

• Learning:

• With linear classifiers:


(using ℓ! regularization)

• What is a loss function?


– Loss functions should penalize mistakes
– We are minimizing average loss over the training data

• What is the ideal loss function for classification?

48
The 0-1 loss

Penalize classification mistakes between true label y and


prediction y’

• For linear classifiers, the prediction y’ = sgn(wTx)


– Mistake if 𝑦 𝒘𝑇𝒙 ≤ 0

Minimizing 0-1 loss is intractable. Need surrogates

49
The loss function zoo

Many loss functions exist


– Perceptron loss

– Hinge loss (SVM)

– Exponential loss (AdaBoost)

– Logistic loss (logistic regression)

50
The loss function zoo

51
The loss function zoo

Zero-one

52
The loss function zoo
Hinge: SVM

Zero-one

53
The loss function zoo
Hinge: SVM

Perceptron

Zero-one

54
The loss function zoo
Hinge: SVM
Exponential: AdaBoost
Perceptron

Zero-one

55
The loss function zoo
Hinge: SVM
Exponential: AdaBoost
Perceptron

Zero-one

Logistic regression

56
The loss function zoo Zoomed out

57
The loss function zoo Zoomed out even more

58
This lecture

• Logistic regression

• Training a logistic regression classifier

• Back to loss minimization

• Connection to Naïve Bayes

59
Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function


𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)

Here, the P’s represent the Naïve Bayes posterior distribution,


and w can be used to calculate the priors and the likelihoods.

That is, 𝑃(𝑦 = 1 | 𝐰, 𝐱) is computed using


𝑃(𝐱 | 𝑦 = 1, 𝐰) and 𝑃(𝑦 = 1 | 𝐰)

60
Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function


𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)

But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)

61
Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function


𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)

But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)

Substituting in the above expression, we will get

1
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1 + exp(−𝐰 ) 𝐱)

Exercise: Show this formally


62
Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function


𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)
That is, both naïve Bayes and logistic regression try to
compute the same posterior distribution over the outputs
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
Naïve Bayes is a generative model.

Substituting in the above


Logistic expression,
Regression we get version.
is the discriminative

1
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1 + exp(−𝐰 ) 𝐱)

63

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy