7 Logistic-Regression
7 Logistic-Regression
Machine Learning
1
Where are we?
2
This lecture
• Logistic regression
3
This lecture
• Logistic regression
4
Logistic Regression: Setup
• The setting
– Binary classification
– Inputs: Feature vectors 𝐱 ∈ ℜ!
– Labels: 𝑦 ∈ {−1, +1}
• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples
5
Classification, but…
6
Classification, but…
7
The Sigmoid function
8
The Sigmoid function
9
The Sigmoid function
10
The Sigmoid function
¾(z)
11
The Sigmoid function
12
The Sigmoid function
13
Predicting probabilities
14
Predicting probabilities
15
Predicting probabilities
16
Predicting probabilities
Or equivalently
17
Predicting probabilities
18
Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
19
Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
– Prediction = sgn(𝐰 ) 𝐱)
20
This lecture
• Logistic regression
21
Maximum likelihood estimation
• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples
• What we want
– Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized
– We know that our examples are drawn independently and
are identically distributed (i.i.d)
– How do we proceed?
22
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
23
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
Equivalent to solving
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
24
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
But (by definition) we know that
1
𝑃 𝑦" 𝐰, 𝐱 " = 𝜎 𝑦" 𝐰)𝐱 " =
1 + exp(−𝑦" 𝐰 ) 𝐱 " )
25
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
Equivalent to solving
-
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"
26
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"
27
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"
28
Maximum a posteriori estimation
29
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
30
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
31
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
To maximize the posterior probability of the model given the data (i.e. to find the
most probable model, given the data)
𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)
32
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
33
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
34
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
35
MAP estimation for logistic regression
$ $
1−𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
36
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
- !;
−𝑤 "
max A −log(1 + exp(−𝑦" 𝐰 ) 𝐱 " ) + A ; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
𝐰 𝜎
" :+,
37
MAP estimation for logistic regression
$ $
1−𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
-
1 )
max A −log(1 + exp(−𝑦" 𝐰)𝐱 ") − ; 𝐰 𝐰
𝐰 𝜎
"
38
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
-
1 )
max A −log(1 + exp(−𝑦" 𝐰)𝐱 ") − ; 𝐰 𝐰
𝐰 𝜎
"
40
Learning a logistic regression classifier
41
Learning a logistic regression classifier
Exercise: Write down the stochastic gradient descent (SGD) algorithm for this?
Other training algorithms exist. For example, the LBFGS algorithm is an example of
a quasi-Newton method. But gradient based methods like SGD and its variants are
way more commonly used.
42
Logistic regression is…
• Logistic regression
44
Learning as loss minimization
• The setup
– Examples x drawn from a fixed, unknown distribution D
– Hidden oracle classifier f labels examples
– We wish to find a hypothesis h that mimics f
• The ideal situation
– Define a function L that penalizes bad hypotheses
– Learning: Pick a function ℎ ∈ 𝐻 to minimize expected loss
45
Empirical loss minimization
46
Empirical loss minimization
47
Regularized loss minimization
• Learning:
48
The 0-1 loss
49
The loss function zoo
50
The loss function zoo
51
The loss function zoo
Zero-one
52
The loss function zoo
Hinge: SVM
Zero-one
53
The loss function zoo
Hinge: SVM
Perceptron
Zero-one
54
The loss function zoo
Hinge: SVM
Exponential: AdaBoost
Perceptron
Zero-one
55
The loss function zoo
Hinge: SVM
Exponential: AdaBoost
Perceptron
Zero-one
Logistic regression
56
The loss function zoo Zoomed out
57
The loss function zoo Zoomed out even more
58
This lecture
• Logistic regression
59
Naïve Bayes and Logistic regression
60
Naïve Bayes and Logistic regression
61
Naïve Bayes and Logistic regression
1
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1 + exp(−𝐰 ) 𝐱)
1
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1 + exp(−𝐰 ) 𝐱)
63