0% found this document useful (0 votes)
35 views

2021 Logistic Regression

Uploaded by

sibahlemlambo5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

2021 Logistic Regression

Uploaded by

sibahlemlambo5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning – COMS3007

Logistic Regression
Benjamin Rosman

Based heavily on course notes by


Chris Williams and Victor Lavrenko,
Amos Storkey, Eric Eaton, and Clint
van Alten
Classification
• Data 𝑋 = {𝑥 0 , … , 𝑥 (𝑛) }, where 𝑥 (𝑖) ∈ 𝑅𝑑
• Labels 𝐲 = {𝑦 0 , … , 𝑦 (𝑛) }, where 𝑦 (𝑖) ∈ {0,1}
• Want to learn function 𝑦 = 𝑓(𝑥, 𝜃) to predict y for a
new x
𝑦 is the class
(red/blue)

𝑥2 𝑥2

𝑥1 𝑥1
Generative vs discriminative
• In Naïve Bayes, we used a generative approach
• Class conditional modeling
• 𝑝 𝑦 𝒙 ∝ 𝑝(𝒙|𝑦)𝑝(𝑦)

• Now model 𝑝 𝑦 𝒙 directly: discriminative approach


• As was the case in decision trees
• Don’t model 𝑝(𝑥)

• Discriminative:
• Can’t generate data
• Often better
• Fewer variables

• Both are correct


Two class discrimination
• Consider two classes: 𝑦 ∈ 0,1
• We could use linear regression
• Doesn’t perform well
• Values < 0 or > 1 don’t make sense
• We want a model of the form:
• 𝑃 𝑦 = 1 𝑥 = 𝑓 𝑥; 𝜃
• It is a probability, so 0 ≤ 𝑓 ≤ 1
• Also, probabilities sum to 1, so
• 𝑃 𝑦 = 0 𝑥 = 1 − 𝑓 𝑥; 𝜃
• What form should we use for 𝑓?
The logistic function
• We need a function that gives probabilities: 0 ≤ 𝑓 ≤ 1
• Logistic function
1
• 𝑓 𝑧 =𝜎 𝑧 =
1+exp −𝑧
• “Sigmoid function”
• S-shape
𝜎(𝑧)
• “Squashing function”
• As z goes from −∞ to ∞
• 𝑓 goes from 0 to 1

• Notes: 𝑧
• 𝜎 0 = 0.5: “decision boundary”
• 𝜎 ′ 𝑧 = 𝜎(𝑧)(1 − 𝜎 𝑧 ) –ve values of z → class 0
+ve values of z → class 1
Linear weights
• Now we need a way of incorporating features 𝑥 and
parameters/weights 𝜃
• Use the same idea of a linear weighting scheme from linear
regression
• 𝑝 𝑦 = 1 𝑥 = 𝜎(𝜃 𝑇 𝜙 𝑥 )
• 𝜃 is a vector of parameters
• 𝜙 𝑥 is the vector of features
• Decision boundary: 𝜎 𝑧 = 0.5 when 𝑧 = 0
• So: decision boundary = 𝜃 𝑇 𝜙 𝑥 = 0
• For an M dimensional problem, boundary is M-1
dimensional hyperplane
Linear decision boundary

In linear regression, 𝜃 𝑇 𝜙 𝑥
defined the function going
through our data. Here it is the
decision boundary = 𝜃 𝑇 𝜙 𝑥 = 0 function separating our classes.
Cost function
• So:
• 𝑝 𝑦 = 1 𝑥; 𝜃 = 𝜎 𝜃 𝑇 𝜙 𝑥 = ℎ𝜃 𝑥
Why?
• 𝑝 𝑦 = 0 𝑥; 𝜃 = 1 − ℎ𝜃 𝑥 What happens
• Write this more compactly as: when y=0?
And y=1?
𝑦 1−𝑦
• 𝑝 𝑦 𝑥; 𝜃 = ℎ𝜃 𝑥 1 − ℎ𝜃 𝑥

• Likelihood of m data points:


• 𝐿 𝜃 = ς𝑚 𝑖=1 𝑝 𝑦
𝑖 𝑥 𝑖 ;𝜃
𝑦𝑖 1−𝑦 𝑖
• = ς𝑚
𝑖=1 ℎ𝜃 𝑥
𝑖 1 − ℎ𝜃 𝑥 𝑖
Cost function
• Likelihood of m data points:
𝑦𝑖 1−𝑦 𝑖
• 𝐿 𝜃 = ς𝑚
𝑖=1 ℎ𝜃 𝑥
𝑖 1 − ℎ𝜃 𝑥 𝑖

• Take the log of the likelihood:


• 𝑙 𝜃 = log 𝐿(𝜃)
• = σ𝑚𝑖=1 𝑦 𝑖 log(ℎ 𝑥 𝑖 ) + 1 − 𝑦
𝜃
𝑖 log 1 − ℎ𝜃 𝑥 𝑖

• We need to maximise the log likelihood

• Equivalent to minimising 𝐸 𝜃 = −𝑙(𝜃)

• Cannot use a closed form solution


Regularisation
• Just as in linear regression, regularisation is useful here
• Penalise the weights for growing too large
• Note: the higher the weights, the “steeper” the S – so
this stops the model becoming over-confident

• min 𝐸(𝜃) where


𝜃
• 𝐸 𝜃 𝑚=
− ෍ 𝑦 𝑖 log(ℎ𝜃 𝑥 𝑖 + 1−𝑦 𝑖 log 1 − ℎ𝜃 𝑥 𝑖

𝑖=1 𝑑

𝜆 = strength of
+ 𝜆 ෍ 𝜃𝑗2
regularisation 𝑗=1
Regularisation
• Note: the higher the weights, the “steeper” the S – so
regularisation stops the model becoming over-confident

1
𝑦=
𝑒 −𝑘𝑥
Gradient descent (again)
• Initialise 𝜃 0 < 𝛼 ≤ 1 is the learning
• Repeat until convergence: rate, usually set quite
𝜕
• 𝜃𝑗 ← 𝜃𝑗 − 𝛼 𝐽(𝜃) small
𝜕𝜃𝑗
• Simultaneous update for 𝑗 = 0, … , 𝑑

Take a step of size 𝛼


in the “downhill”
direction (negative
gradient)
GD with regularisation
• Initialise 𝜃
No regularisation
• Repeat until convergence: on 𝜃0

• 𝜃0 ← 𝜃0 − 𝛼(ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) )
𝑖
• 𝜃𝑗 ← 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗 + 𝜆𝜃𝑗

• Simultaneous update for 𝑗 = 0, … , 𝑑

• This is identical to linear regression!


• But the model is completely different:
1
• ℎ𝜃 𝑥 = −𝜃𝑇 𝑥
1+𝑒
The effect of 𝛼
Example
• Generate two random classes of data, from Gaussians
centered at (1, -1) and (-1, 1)
Example
• Weights randomly initialized: 𝜃 = (0.3, −0.01, −0.3)
Example
• Cycle through each data point i:
• Compute:
𝛿𝜃0 = 𝑦 𝑖 − ℎ𝜃 𝑥 𝑖
𝛿𝜃1 = 𝑦 𝑖
− ℎ𝜃 𝑥 𝑖
𝑥1𝑖
𝛿𝜃2 = 𝑦 𝑖 − ℎ𝜃 𝑥 𝑖 𝑥2𝑖

Update:
𝜃0 ← 𝜃0 + 𝛼𝛿𝜃0
𝜃1 ← 𝜃1 + 𝛼𝛿𝜃1
𝜃2 ← 𝜃2 + 𝛼𝛿𝜃2
Example
Example
Example
Example
Example
Example
• Run until convergence (threshold on the size of change of 𝜃)

Probabilities (of class 1):

0.9999

0.4386
---
0.7759

0.0007
Digression: the perceptron
• The logistic function gives a probabilistic output
• What if we wanted to instead force it to be {0, 1}?
• Instead of the logistic function, what about a step function?
1 𝑖𝑓 𝑧 ≥ 0
•𝑔 𝑧 = ቊ
0 𝑖𝑓 𝑧 < 0
• Use this as before:
• 𝑝 𝑦 = 1 𝑥 = 𝑔 𝜃 𝑇 𝜙 𝑥 = ℎ𝜃 (𝑥)

• Perceptron learning rule:


𝑖
• 𝜃𝑗 ← 𝜃𝑗 + 𝛼 𝑦 𝑖 − ℎ𝜃 𝑥 𝑖 𝑥𝑗
• Exactly as before (with a different function)!
The perceptron
• Historical model
• Invented by Frank Rosenblatt (1957)
• Thought to model neurons in the brain
• (Crudely)
• Originally a machine!

• Very controversial:
• Basically claimed they expected to
• “be able to walk, talk, see, write, reproduce itself and be conscious
of its existence”
Linear separability and XOR
• “Perceptrons” by Minsky and Papert (1969)
• Limitation of a perceptron: cannot implement functions
such as a XOR function
• Led to decreased research in neural networks, and
increased research in symbolic AI
Basis functions (again)
• Use basis functions (again) to get round the linear
separability
• Still need it to be separable in some space

Add polynomial
basis functions
Basis functions (again)
• Two Gaussian basis functions: centered at (-1, -1) and (0, 0)
• Data is separable under this transformation
Basis functions (again)

Linear logistic regression Polynomial basis Gaussian basis functions


functions (RBFs)
Multiclass classification
• Instead of classifying between two classes, we may have
more classes
Multiclass logistic regression
• For two classes:
1
• 𝑝 𝑦 = 1 𝑥; 𝜃 = ℎ𝜃 𝑥 =
1+exp(−𝜃𝑇 𝑥)

exp(𝜃 𝑇 𝑥)
=
1 + exp(𝜃 𝑇 𝑥)

• Given C classes:

exp(𝜃𝑘𝑇 𝑥)
• 𝑝 𝑦 = 𝑐𝑘 𝑥; 𝜃 = σ𝐶 𝑇
𝑗=1 exp(𝜃𝑗 𝑥)

• This is the softmax function

• Note that 0 ≤ 𝑝 𝑐𝑘 𝑥; 𝜃 ≤ 1, and σ𝐶𝑗=1 𝑝 𝑐𝑘 𝑥; 𝜃 = 1


Multiclass classification
• Split into one-vs-rest for each of the C classes

• Use gradient descent: update all parameters for all models


simultaneously
• Pick most probable class
Recap
• Discriminative vs generative
• Model (logistic function)
• Decision boundaries
• Cost function
• Regularisation
• Gradient descent
• The perceptron
• Basis functions
• Multiclass classification

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy