Week 4 Logistic
Week 4 Logistic
Nigel Goddard
School of Informatics
Semester 1
1 / 21
Outline
I Logistic function
I Logistic regression
I Learning logistic regression
I Optimization
I The power of non-linear basis functions
I Least-squares classification
I Generative and discriminative models
I Relationships to Generative Models
I Multiclass classification
2 / 21
Decision Boundaries
3 / 21
Example Data
x2
o o
o o
o o
o
o o
x o
o
x x
x
x
x x x1
x
4 / 21
Linear Classifiers
o
o
o
o
o
o
F (x, w) = w> x + w0
o
o o
x o
x
x
x
o
that represents how aligned the
x
x x x1 instance is with y = 1.
x
I w are parameters of the classifier
that we learn from data.
I To do classification of an input x:
x 7→ (y = 1) if F (x, w) > 0
5 / 21
A Geometric View
x2
o o
o o
o o
o
w o o
x o
o
x x
x
x
x x x1
x
6 / 21
Explanation of Geometric View
{x|w> x + w0 = 0}
7 / 21
Two Class Discrimination
8 / 21
The logistic function
The logistic function
I We need a function that returns probabilities (i.e. stays
We need
�between 0aand
function
1). that returns probabilities (i.e. stays
between 0 and 1). provides this
I The logistic function
� The logistic function provides this
I f (z) = σ(z) ≡ 1/(1 + exp(−z)).
� f (z) = σ(z) ≡ 1/(1 + exp(−z)).
I As z goes from −∞ to ∞, so f goes from 0 to 1, a
� As z goes from −∞ to ∞, so f goes from 0 to 1, a
“squashing function”
“squashing function”
I It has a “sigmoid” shape (i.e. S-like shape)
� It has a “sigmoid” shape (i.e. S-like shape)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
−6 −4 −2 0 2 4 6
6 / 24
9 / 21
Linear weights
10 / 21
Logistic regression
11 / 21
Learning Logistic Regression
12 / 21
I Assume data is independent and identically distributed.
I Call the data set D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}
I The likelihood is
n
Y
p(D|w) = p(y = yi |xi , w)
i=1
n
p(y = 1|xi , w)yi (1 − p(y = 1|xi , w))1−yi
Y
=
i=1
n
X
L(w) = yi log σ(w> xi ) + (1 − yi ) log(1 − σ(w> xi ))
i=1
13 / 21
I It turns out that the likelihood has a unique optimum (given
sufficient training examples). It is convex.
I How to maximize? Take gradient
n
∂L
(yi − σ(wT xi ))xij
X
=
∂wj
i=1
14 / 21
Fitting this into the general structure for learning algorithms:
15 / 21
XOR and Linear Separability
XOR and Linear Separability
I A problem is linearly separable if we can find weights so
that
� A Tproblem is linearly separable if we can find weights so
I w̃ x + w0 > 0 for all positive cases (where y = 1), and
that
I w̃T� xw̃+T xw+0 w≤0 >
0 0forforall
allnegative cases
positive cases (where
(where y=
y = 1), and0)
I XOR � w̃T x + w0 ≤ 0 for all negative cases (where y = 0)
� XOR, a failure for the perceptron
1
1
φ2
x2
0 0.5
−1
0
−1 0 x1 1 0 0.5 φ1 1
19 / 21
Multiclass classification
exp(wTk x)
p(y = k|x) = PC
T
j=1 exp(wj x)
PC
I Note that 0 ≤ p(y = k |x) ≤ 1 and j=1 p(y = j|x) = 1
I This is the natural generalization of logistic regression to
more than 2 classes.
20 / 21
Least-squares classification
I Logistic regression is more complicated algorithmically
than linear regression
I Why not just use linear regression with 0/1 targets?
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8