Linear Classifiers
Linear Classifiers
M. Soleymani
Fall 2016
Topics
Discriminant functions
Linear classifiers
Perceptron
SVM will be covered in the later lectures
Fisher
Multi-class classification
2
Classification problem
Given:Training set
𝑖 𝑖 𝑁
labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 ,𝑦 𝑖=1
𝑦 ∈ {1, … , 𝐾}
Examples:
Spam filter
Handwritten digit recognition
…
3
Discriminant functions
Discriminant function can directly assign each vector 𝒙 to a
specific class 𝑘
4
Discriminant Functions
Discriminant functions: A discriminant function 𝑓𝑖 𝒙
for each class 𝒞𝑖 (𝑖 = 1, … , 𝐾):
𝒙 is assigned to class 𝒞𝑖 if:
𝑓𝑖 (𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖
Decision surface: 𝑓 𝒙 = 0
6
Linear classifiers
Decision boundaries are linear in 𝒙, or linear in some
given set of functions of 𝒙
Linearly separable data: data points that can be exactly
classified by a linear decision surface.
7
Two Category
𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙 + 𝑤0 = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒙 = 𝑥1 𝑥2 … 𝑥𝑑
𝒘 = [𝑤1 𝑤2 … 𝑤𝑑 ]
𝑤0 : bias
if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
else 𝒞2
8
Example
3
3 − 𝑥1 − 𝑥2 = 0
4
𝑥2
3
2 if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
else 𝒞2
1
𝑥
1 2 3 4 1
9
Linear classifier: Two Category
Decision boundary is a (𝑑 − 1)-dimensional hyperplane 𝐻 in
the 𝑑-dimensional feature space
The orientation of 𝐻 is determined by the normal vector 𝑤1 , … , 𝑤𝑑
𝑤0 determine the location of the surface.
𝑤0
The normal distance from the origin to the decision surface is 𝒘
𝒘
𝒙 = 𝒙⊥ + 𝑟
𝒘
𝒘𝑇𝒙 + 𝑤 𝒙⊥
0
𝒘𝑇 𝒙 + 𝑤0 = 𝑟 𝒘 ⇒ 𝑟 =
𝒘
𝑓 𝒙 =0
gives a signed measure of the perpendicular
distance 𝑟 of the point 𝒙 from the decision surface
10
Linear boundary: geometry
𝒘𝑇 𝒙 + 𝑤0 > 0
𝒘𝑇 𝒙 + 𝑤0 = 0
𝒘𝑇 𝒙 + 𝑤0 < 0
𝒘𝑇 𝒙 + 𝑤0
𝒘
11
Non-linear decision boundary
Choose non-linear features
Classifier still linear in parameters 𝒘
𝑥2 −1 + 𝑥12 + 𝑥22 = 0
1 1 𝑥1
-1
if 𝒘𝑇 𝝓(𝒙) ≥ 0 then 𝑦 = 1
else 𝑦 = −1
𝒙 = [𝒙1 , 𝒙2 ]
12
Cost Function for linear classification
Finding linear classifiers can be formulated as an optimization
problem:
Select how to measure the prediction loss
𝑛
Based on the training set 𝐷 = 𝒙 𝑖 ,𝑦 𝑖
𝑖=1
, a cost function 𝐽 𝒘 is defined
Solve the resulting optimization problem to find parameters:
Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘
13
SSE cost function for classification 𝐾=2
𝑁
𝑇 𝑖 𝑖 2
𝐽 𝒘 = 𝒘 𝒙 −𝑦
𝑖=1
14
SSE cost function for classification 𝐾=2
𝒘𝑇 𝒙 − 𝑦 2
𝑦=1
1 𝒘𝑇 𝒙
Correct predictions that
are penalized by SSE 𝒘𝑇 𝒙 − 𝑦 2
𝑦 = −1
[Bishop] −1 𝒘𝑇 𝒙
15
SSE cost function for classification 𝐾=2
𝐽 𝒘 = sign 𝒘𝑇 𝒙 𝑖 −𝑦 𝑖
𝑖=1 𝑦=1
−1, 𝑧<0
sign 𝑧 = 𝒘𝑇 𝒙
1, 𝑧≥0
16
Perceptron algorithm
Linear classifier
Two-class: 𝑦 ∈ {−1,1}
𝑦 = −1 for 𝐶2 , 𝑦 = 1 for 𝐶1
𝑓 𝒙; 𝒘 = sign(𝒘𝑇 𝒙)
18
Perceptron criterion
𝐽𝑃 𝒘 = − 𝒘𝑇 𝒙 𝑖 𝑦 𝑖
𝑖∈ℳ
19
Cost function
𝐽(𝒘) 𝐽𝑃 (𝒘)
𝑤0 𝑤0
𝑤1 𝑤1
# of misclassifications Perceptron’s
as a cost function cost function
𝛻𝒘 𝐽𝑃 𝒘 = − 𝒙𝑖𝑦 𝑖
𝑖∈ℳ
Batch Perceptron converges in finite number of steps for linearly
separable data:
Initialize 𝒘
Repeat
𝒘 = 𝒘 + 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖
21
Stochastic gradient descent for Perceptron
Single-sample perceptron:
If 𝒙(𝑖) is misclassified:
𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖) 𝑦 (𝑖)
23
Perceptron: Example
Change 𝒘 in a direction
that corrects the error
24
[Bishop]
Convergence of Perceptron
For data sets that are not linearly separable, the single-sample
perceptron learning algorithm will never converge
25
Pocket algorithm
For the data that are not linearly separable due to noise:
Keeps in its pocket the best 𝒘 encountered up to now.
Initialize 𝒘
for 𝑡 = 1, … , 𝑇
𝑖 ← 𝑡 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘𝑛𝑒𝑤 = 𝒘 + 𝒙(𝑖) 𝑦 (𝑖)
if 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘𝑛𝑒𝑤 < 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 then
𝒘 = 𝒘𝑛𝑒𝑤
end
𝑁
1
𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑠𝑖𝑔𝑛(𝒘𝑇 𝒙(𝑛) ) ≠ 𝑦 (𝑛)
𝑁
𝑛=1
26
Linear Discriminant Analysis (LDA)
Fisher’s Linear Discriminant Analysis :
Dimensionality reduction
Finds linear combinations of features with large ratios of between-
groups scatters to within-groups scatters (as discriminant new
variables)
Classification
Predicts the class of an observation 𝒙 by first projecting it to the
space of discriminant variables and then classifying it in this space
27
Good Projection for Classification
What is a good criterion?
Separating different classes in the projected space
28
Good Projection for Classification
What is a good criterion?
Separating different classes in the projected space
29
Good Projection for Classification
What is a good criterion?
Separating different classes in the projected space
30
LDA Problem
Problem definition:
𝐶 = 2 classes
𝑁
𝒙(𝑖) , 𝑦 (𝑖) 𝑖=1 training samples with 𝑁1 samples from the first class (𝒞1 )
and 𝑁2 samples from the second class (𝒞2 )
Goal: finding the best direction 𝒘 that we hope to enable accurate
classification
31
Measure of Separation in the Projected Direction
Is the direction of the line jointing the class means a good
candidate for 𝒘?
[Bishop]
32
Measure of Separation in the Projected
Direction
The direction of the line jointing the class means is the
solution of the following problem:
Maximizes the separation of the projected class means
𝜇1′ − 𝜇2′ 2
𝐽 𝒘 = ′2
𝑠1 + 𝑠2′2
34
LDA Criteria
The scatters of the original data are:
2
𝑠12 = 𝒙 𝑖 − 𝝁1
𝒙(𝑖) ∈𝒞1
2
𝑠22 = 𝒙 𝑖 − 𝝁2
𝒙(𝑖) ∈𝒞2
35
LDA Criteria
𝜇1′ − 𝜇2′ 2
𝐽 𝒘 = ′2
𝑠1 + 𝑠2′2
𝜇1′ − 𝜇2′ 2 = 𝒘𝑇 𝝁1 − 𝒘𝑇 𝝁2 2
= 𝒘𝑇 𝝁1 − 𝝁2 𝝁1 − 𝝁2 𝑇 𝒘
2
𝑠1′2 = 𝒘𝑇 𝒙 𝑖 𝑇
− 𝒘 𝝁1
𝒙(𝑖) ∈𝒞1
𝑇 𝑖 𝑖 𝑇
=𝒘 𝒙 − 𝝁1 𝒙 − 𝝁1 𝒘
𝒙(𝑖) ∈𝒞1
36
LDA Criteria
𝒘𝑇 𝑺 𝐵 𝒘
𝐽 𝒘 =
𝒘𝑇 𝑺𝑊 𝒘
Between-class 𝑇
scatter matrix
𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2
Within-class 𝑺𝑊 = 𝑺1 + 𝑺2
scatter matrix
𝑖 𝑖 𝑇
𝑺1 = 𝒙 − 𝝁1 𝒙 − 𝝁1
𝒙(𝑖) ∈𝒞1
𝑖 𝑖 𝑇
𝑺2 = 𝒙 − 𝝁2 𝒙 − 𝝁2
𝒙(𝑖) ∈𝒞2
wT S B w
J (w ) T
w SW w
wT S B w wT SW w
w SW w
T
wT S B w
J (w )
w w
2S B w wT
SW w 2SW w wT
SB w
w W W
2 2
T T
w S w w S w
J (w )
0 S B w SW w
w
38
LDA Derivation
If 𝑺𝑊 is full-rank
𝑺𝐵𝒘 = 𝜆𝑺𝑊𝒘 𝑺−1
𝑊 𝑺𝐵𝒘 = 𝜆𝒘
𝒘 ∝ 𝑺−1
𝑊 𝝁1 − 𝝁2
39
LDA Algorithm
Find 𝝁1 and 𝝁2 as the mean of class 1 and 2 respectively
Find 𝑺1 and 𝑺2 as scatter matrix of class 1 and 2 respectively
𝑺𝑊 = 𝑺1 + 𝑺2
𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2 𝑇
Feature Extraction
𝒘 = 𝑺−1
𝑤 𝝁1 − 𝝁2 as the eigenvector corresponding to the largest
eigenvalue of 𝑺−1
𝑤 𝑺𝑏
Classification 𝝁2
𝒘 = 𝑺−1
𝑤 𝝁1 − 𝝁2
Using a threshold on 𝒘𝑇 𝒙, we can classify 𝒙 𝝁1
40
𝑥2
Multi-class classification
Solutions to multi-category problems:
𝑥1
Extend the learning algorithm to support multi-class:
A function 𝑓𝑖 (𝒙) for each class 𝑖 is found
𝑦 = argmax 𝑓𝑖 (𝒙) 𝒙 is assigned to class 𝐶𝑖 if 𝑓𝑖(𝒙) > 𝑓𝑗 (𝒙) 𝑗 𝑖
𝑖=1,…,𝑐
41
Converting multi-class problem to a set of
two-class problems
“one versus rest” or “one against all”
For each class 𝐶𝑖 , a linear discriminant function that separates
samples of 𝐶𝑖 from all the other samples is found.
Totally linearly separable
42
Multi-class classification
One-vs-all (one-vs-rest) 𝑥2
𝑥1
𝑥2 𝑥2
𝑥1
𝑥1 𝑥2
Class 1:
Class 2:
Class 3:
43
𝑥
Multi-class classification
𝑥2
One-vs-one
𝑥1
𝑥2
𝑥2
𝑥1
𝑥2
𝑥1
Class 1:
Class 2:
Class 3:
44
𝑥1
Multi-class classification: ambiguity
Converting the multi-class problem to a set of two-class
problems can lead to regions in which the classification is
undefined
46
Multi-class classification: linear machine
47
Perceptron: multi-class
𝑦 = argmax 𝒘𝑇𝑖 𝒙
𝑖=1,…,𝑐
𝑇
𝑖
𝐽𝑃 𝑾 = − 𝒘𝑦 𝑖 − 𝒘𝑦 𝑖 𝒙
𝑖∈ℳ
ℳ: subset of training data that are misclassified
ℳ = 𝑖|𝑦 𝑖 ≠ 𝑦 (𝑖)
Initialize 𝑾 = 𝒘1 , … , 𝒘𝑐 , 𝑘 ← 0
repeat
𝑘 ← 𝑘 + 1 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘𝑦 𝑖 = 𝒘𝑦 𝑖 − 𝒙(𝑖)
𝒘𝑦 𝑖 = 𝒘𝑦 𝑖 + 𝒙(𝑖)
Until all patterns properly classified
48
Resources
C. Bishop, “Pattern Recognition and Machine Learning”,
Chapter 4.1.
49