0% found this document useful (0 votes)
8 views

Week 4 Logistic

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week 4 Logistic

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

IAML: Logistic Regression

Nigel Goddard
School of Informatics

Semester 1

1 / 21
Outline

I Logistic function
I Logistic regression
I Learning logistic regression
I Optimization
I The power of non-linear basis functions
I Least-squares classification
I Generative and discriminative models
I Relationships to Generative Models
I Multiclass classification

2 / 21
Decision Boundaries

I In this class we will discuss linear classifiers.


I For each class, there is a region of feature space in which
the classifier selects one class over the other.
I The decision boundary is the boundary of this region. (i.e.,
where the two classes are “tied”)
I In linear classifiers the decision boundary is a line.

3 / 21
Example Data

x2
o o
o o
o o
o
o o
x o
o
x x
x

x
x x x1
x

4 / 21
Linear Classifiers

I In a two-class linear classifier, we


learn a function
x2

o
o

o
o
o

o
F (x, w) = w> x + w0
o
o o
x o
x
x
x
o
that represents how aligned the
x
x x x1 instance is with y = 1.
x
I w are parameters of the classifier
that we learn from data.
I To do classification of an input x:

x 7→ (y = 1) if F (x, w) > 0

5 / 21
A Geometric View

x2
o o
o o
o o
o
w o o
x o
o
x x
x

x
x x x1
x

6 / 21
Explanation of Geometric View

I The decision boundary in this case is

{x|w> x + w0 = 0}

I w is a normal vector to this surface


I (Remember how lines can be written in terms of their
normal vector.)
I Notice that in more than 2 dimensions, this boundary will
be a hyperplane.

7 / 21
Two Class Discrimination

I For now consider a two class case: y ∈ {0, 1}.


I From now on we’ll write x = (1, x1 , x2 , . . . xd ) and
w = (w0 , w1 , . . . wd ).
I We will want a linear, probabilistic model. We could try
P(y = 1|x) = w> x. But this is stupid.
I Instead what we will do is

P(y = 1|x) = f (w> x)

I f must be between 0 and 1. It will squash the real line into


[0, 1]
I Furthermore the fact that probabilities sum to one means

P(y = 0|x) = 1 − f (w> x)

8 / 21
The logistic function
The logistic function
I We need a function that returns probabilities (i.e. stays
We need
�between 0aand
function
1). that returns probabilities (i.e. stays
between 0 and 1). provides this
I The logistic function
� The logistic function provides this
I f (z) = σ(z) ≡ 1/(1 + exp(−z)).
� f (z) = σ(z) ≡ 1/(1 + exp(−z)).
I As z goes from −∞ to ∞, so f goes from 0 to 1, a
� As z goes from −∞ to ∞, so f goes from 0 to 1, a
“squashing function”
“squashing function”
I It has a “sigmoid” shape (i.e. S-like shape)
� It has a “sigmoid” shape (i.e. S-like shape)

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

−6 −4 −2 0 2 4 6

6 / 24
9 / 21
Linear weights

I Linear weights + logistic squashing function == logistic


regression.
I We model the class probabilities as
D
wj xj ) = σ(wT x)
X
p(y = 1|x) = σ(
j=0

I σ(z) = 0.5 when z = 0. Hence the decision boundary is


given by wT x = 0.
I Decision boundary is a M − 1 hyperplane for a M
dimensional problem.

10 / 21
Logistic regression

I For this slide write w̃ = (w1 , w2 , . . . wd ) (i.e., exclude the


bias w0 )
I The bias parameter w0 shifts the position of the
hyperplane, but does not alter the angle
I The direction of the vector w̃ affects the angle of the
hyperplane. The hyperplane is perpendicular to w̃
I The magnitude of the vector w̃ effects how certain the
classifications are
I For small w̃ most of the probabilities within the region of
the decision boundary will be near to 0.5.
I For large w̃ probabilities in the same region will be close to
1 or 0.

11 / 21
Learning Logistic Regression

I Want to set the parameters w using training data.


I As before:
I Write out the model and hence the likelihood
I Find the derivatives of the log likelihood w.r.t the
parameters.
I Adjust the parameters to maximize the log likelihood.

12 / 21
I Assume data is independent and identically distributed.
I Call the data set D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}
I The likelihood is
n
Y
p(D|w) = p(y = yi |xi , w)
i=1
n
p(y = 1|xi , w)yi (1 − p(y = 1|xi , w))1−yi
Y
=
i=1

I Hence the log likelihood L(w) = log p(D|w) is given by

n
X
L(w) = yi log σ(w> xi ) + (1 − yi ) log(1 − σ(w> xi ))
i=1

13 / 21
I It turns out that the likelihood has a unique optimum (given
sufficient training examples). It is convex.
I How to maximize? Take gradient
n
∂L
(yi − σ(wT xi ))xij
X
=
∂wj
i=1

I (Aside: something similar holds for linear regression


n
∂E
(wT φ(xi ) − yi )xij
X
=
∂wj
i=1

where E is squared error.)


I Unfortunately, you cannot maximize L(w) explicitly as for
linear regression. You need to use a numerical
optimisation method, see later.

14 / 21
Fitting this into the general structure for learning algorithms:

I Define the task: classification, discriminative


I Decide on the model structure: logistic regression model
I Decide on the score function: log likelihood
I Decide on optimization/search method to optimize the
score function: numerical optimization routine. Note we
have several choices here (stochastic gradient descent,
conjugate gradient, BFGS).

15 / 21
XOR and Linear Separability
XOR and Linear Separability
I A problem is linearly separable if we can find weights so
that
� A Tproblem is linearly separable if we can find weights so
I w̃ x + w0 > 0 for all positive cases (where y = 1), and
that
I w̃T� xw̃+T xw+0 w≤0 >
0 0forforall
allnegative cases
positive cases (where
(where y=
y = 1), and0)
I XOR � w̃T x + w0 ≤ 0 for all negative cases (where y = 0)
� XOR, a failure for the perceptron

� XOR can be solved by a perceptron using a nonlinear


transformation φ(x) of the input; can you find one?
I XOR becomes linearly separable if we apply a non-linear
11 / 24
tranformation φ(x) of the input — what is one?
16 / 21
The power of non-linear basis functions

1
1

φ2
x2

0 0.5

−1
0

−1 0 x1 1 0 0.5 φ1 1

Using two Gaussian basis functions φ1 (x) and φ2 (x)


Figure credit: Chris Bishop, PRML

As for linear regression, we can transform the input space if we


want x → φ(x) 17 / 21
Generative and Discriminative Models
I Notice that we have done something very different here
than with naive Bayes.
I Naive Bayes: Modelled how a class “generated” the
feature vector p(x|y ). Then could classify using

p(y|x) ∝ p(x|y )p(y )

. This called is a generative approach.


I Logistic regression: Model p(y|x) directly. This is a
discriminative approach.
I Discriminative advantage: Why spend effort modelling
p(x)? Seems a waste, we’re always given it as input.
I Generative advantage: Can be good with missing data
(remember how naive Bayes handles missing data). Also
good for detecting outliers. Or, sometimes you really do
want to generate the input.
18 / 21
Generative Classifiers can be Linear Too

Two scenarios where naive Bayes gives you a linear classifier.


1. Gaussian data with equal covariance. If
p(x|y = 1) ∼ N(µ1 , Σ) and p(x|y = 0) ∼ N(µ2 , Σ) then

p(y = 1|x) = σ(w̃T x + w0 )

for some (w0 , w̃) that depends on µ1 , µ2 , Σ and the class


priors
2. Binary data. Let each component xj be a Bernoulli variable
i.e. xj ∈ {0, 1}. Then a Naı̈ve Bayes classifier has the form

p(y = 1|x) = σ(w̃T x + w0 )

3. Exercise for keeners: prove these two results

19 / 21
Multiclass classification

I Create a different weight vector wk for each class, to


classify into k and not-k.
I Then use the “softmax” function

exp(wTk x)
p(y = k|x) = PC
T
j=1 exp(wj x)

PC
I Note that 0 ≤ p(y = k |x) ≤ 1 and j=1 p(y = j|x) = 1
I This is the natural generalization of logistic regression to
more than 2 classes.

20 / 21
Least-squares classification
I Logistic regression is more complicated algorithmically
than linear regression
I Why not just use linear regression with 0/1 targets?

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

Green: logistic regression; magenta, least-squares regression


Figure credit: Chris Bishop, PRML
21 / 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy