0% found this document useful (0 votes)
10 views48 pages

Linear Classifiers

The document discusses linear classifiers, focusing on discriminant functions and their role in classification problems. It covers the perceptron algorithm, decision boundaries, and the challenges of non-linear separability, as well as optimization techniques like stochastic gradient descent and the pocket algorithm. The content is structured around the principles of machine learning as taught in a course at Sharif University of Technology.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views48 pages

Linear Classifiers

The document discusses linear classifiers, focusing on discriminant functions and their role in classification problems. It covers the perceptron algorithm, decision boundaries, and the challenges of non-linear separability, as well as optimization techniques like stochastic gradient descent and the pocket algorithm. The content is structured around the principles of machine learning as taught in a course at Sharif University of Technology.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Linear classifiers

CE-717: Machine Learning


Sharif University of Technology

M. Soleymani
Fall 2016
Topics
 Discriminant functions
 Linear classifiers
 Perceptron
SVM will be covered in the later lectures
 Fisher
 Multi-class classification

2
Classification problem
 Given:Training set
𝑖 𝑖 𝑁
 labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 ,𝑦 𝑖=1
 𝑦 ∈ {1, … , 𝐾}

 Goal: Given an input 𝒙, assign it to one of 𝐾 classes

 Examples:
 Spam filter
 Handwritten digit recognition
 …

3
Discriminant functions
 Discriminant function can directly assign each vector 𝒙 to a
specific class 𝑘

 A popular way of representing a classifier


 Many classification methods are based on discriminant functions

 Assumption: the classes are taken to be disjoint


 The input space is thereby divided into decision regions
 boundaries are called decision boundaries or decision surfaces.

4
Discriminant Functions
 Discriminant functions: A discriminant function 𝑓𝑖 𝒙
for each class 𝒞𝑖 (𝑖 = 1, … , 𝐾):
 𝒙 is assigned to class 𝒞𝑖 if:
𝑓𝑖 (𝒙) > 𝑓𝑗(𝒙) 𝑗  𝑖

 Thus, we can easily divide the feature space into 𝐾 decision


regions
∀𝒙, 𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗  𝑖 ⇒ 𝒙 ∈ ℛ𝑖
ℛ𝑖 : Region of the 𝑖-th class

 Decision surfaces (or boundaries) can also be found using


discriminant functions
 Boundary of the ℛ𝑖 and ℛ𝑗 separating samples of these two categories:
∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)
5
Discriminant Functions: Two-Category

 Decision surface: 𝑓 𝒙 = 0

 For two-category problem, we can only find a function 𝑓 ∶ ℝd


→ ℝ
 𝑓1 𝒙 = 𝑓(𝒙)
 𝑓2 𝒙 = −𝑓(𝒙)

 First, we explain two-category classification problem and then


discuss the multi-category problems.
 Binary classification: a target variable 𝑦 ∈ 0,1 or 𝑦 ∈ −1,1

6
Linear classifiers
 Decision boundaries are linear in 𝒙, or linear in some
given set of functions of 𝒙
 Linearly separable data: data points that can be exactly
classified by a linear decision surface.

 Why linear classifier?


 Even when they are not optimal, we can use their simplicity
 are relatively easy to compute
 In the absence of information suggesting otherwise, linear classifiers are an
attractive candidates for initial, trial classifiers.

7
Two Category
 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙 + 𝑤0 = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
 𝒙 = 𝑥1 𝑥2 … 𝑥𝑑
 𝒘 = [𝑤1 𝑤2 … 𝑤𝑑 ]
 𝑤0 : bias

 if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
 else 𝒞2

Decision surface (boundary): 𝒘𝑇 𝒙 + 𝑤0 = 0


𝒘 is orthogonal to every vector lying within the decision surface

8
Example

3
3 − 𝑥1 − 𝑥2 = 0
4
𝑥2
3
2 if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
else 𝒞2
1

𝑥
1 2 3 4 1

9
Linear classifier: Two Category
 Decision boundary is a (𝑑 − 1)-dimensional hyperplane 𝐻 in
the 𝑑-dimensional feature space
 The orientation of 𝐻 is determined by the normal vector 𝑤1 , … , 𝑤𝑑
 𝑤0 determine the location of the surface.
𝑤0
 The normal distance from the origin to the decision surface is 𝒘

𝒘
𝒙 = 𝒙⊥ + 𝑟
𝒘
𝒘𝑇𝒙 + 𝑤 𝒙⊥
0
𝒘𝑇 𝒙 + 𝑤0 = 𝑟 𝒘 ⇒ 𝑟 =
𝒘

𝑓 𝒙 =0
gives a signed measure of the perpendicular
distance 𝑟 of the point 𝒙 from the decision surface
10
Linear boundary: geometry
𝒘𝑇 𝒙 + 𝑤0 > 0

𝒘𝑇 𝒙 + 𝑤0 = 0

𝒘𝑇 𝒙 + 𝑤0 < 0

𝒘𝑇 𝒙 + 𝑤0
𝒘

11
Non-linear decision boundary
 Choose non-linear features
 Classifier still linear in parameters 𝒘

𝑥2 −1 + 𝑥12 + 𝑥22 = 0

𝝓 𝒙 = [1, 𝒙1 , 𝒙2 , 𝒙12 , 𝒙22 , 𝒙1 𝒙2 ]


1
𝒘 = 𝑤0 , 𝑤1 , … , 𝑤𝑚 = [−1, 0, 0,1,1,0]

1 1 𝑥1
-1
if 𝒘𝑇 𝝓(𝒙) ≥ 0 then 𝑦 = 1
else 𝑦 = −1

𝒙 = [𝒙1 , 𝒙2 ]

12
Cost Function for linear classification
 Finding linear classifiers can be formulated as an optimization
problem:
 Select how to measure the prediction loss
𝑛
 Based on the training set 𝐷 = 𝒙 𝑖 ,𝑦 𝑖
𝑖=1
, a cost function 𝐽 𝒘 is defined
 Solve the resulting optimization problem to find parameters:
 Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘

 Criterion or cost functions for classification:


 We will investigate several cost functions for the classification problem

13
SSE cost function for classification 𝐾=2

SSE cost function is not suitable for classification:


 Least square loss penalizes ‘too correct’ predictions (that they lie a long
way on the correct side of the decision)
 Least square loss also lack robustness to noise

𝑁
𝑇 𝑖 𝑖 2
𝐽 𝒘 = 𝒘 𝒙 −𝑦
𝑖=1

14
SSE cost function for classification 𝐾=2

𝒘𝑇 𝒙 − 𝑦 2
𝑦=1

1 𝒘𝑇 𝒙
Correct predictions that
are penalized by SSE 𝒘𝑇 𝒙 − 𝑦 2
𝑦 = −1

[Bishop] −1 𝒘𝑇 𝒙

15
SSE cost function for classification 𝐾=2

 Is it more suitable if we set 𝑓 𝒙; 𝒘 = 𝑔 𝒘𝑇 𝒙 ?


𝑁
2 sign 𝒘𝑇 𝒙 − 𝑦 2

𝐽 𝒘 = sign 𝒘𝑇 𝒙 𝑖 −𝑦 𝑖

𝑖=1 𝑦=1

−1, 𝑧<0
sign 𝑧 = 𝒘𝑇 𝒙
1, 𝑧≥0

 𝐽 𝒘 is a piecewise constant function shows the number


of misclassifications
𝐽(𝒘)
Training error incurred in classifying
training samples

16
Perceptron algorithm
 Linear classifier
 Two-class: 𝑦 ∈ {−1,1}
 𝑦 = −1 for 𝐶2 , 𝑦 = 1 for 𝐶1

 Goal: ∀𝑖, 𝒙(𝑖) ∈ 𝐶1 ⇒ 𝒘𝑇 𝒙(𝑖) > 0


 ∀𝑖, 𝒙 𝑖 ∈ 𝐶2 ⇒ 𝒘𝑇 𝒙 𝑖 <0

 𝑓 𝒙; 𝒘 = sign(𝒘𝑇 𝒙)

18
Perceptron criterion

𝐽𝑃 𝒘 = − 𝒘𝑇 𝒙 𝑖 𝑦 𝑖

𝑖∈ℳ

ℳ: subset of training data that are misclassified

Many solutions? Which solution among them?

19
Cost function
𝐽(𝒘) 𝐽𝑃 (𝒘)

𝑤0 𝑤0
𝑤1 𝑤1

# of misclassifications Perceptron’s
as a cost function cost function

There may be many solutions in these cost functions

20 [Duda, Hart, and Stork, 2002]


Batch Perceptron
“Gradient Descent” to solve the optimization problem:

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽𝑃 (𝒘𝑡 )

𝛻𝒘 𝐽𝑃 𝒘 = − 𝒙𝑖𝑦 𝑖

𝑖∈ℳ
Batch Perceptron converges in finite number of steps for linearly
separable data:

Initialize 𝒘
Repeat
𝒘 = 𝒘 + 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖

Until 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖 < 𝜃

21
Stochastic gradient descent for Perceptron
 Single-sample perceptron:
 If 𝒙(𝑖) is misclassified:
𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖) 𝑦 (𝑖)

 Perceptron convergence theorem: for linearly separable data


 If training data are linearly separable, the single-sample perceptron is
also guaranteed to find a solution in a finite number of steps
Fixed-Increment single sample Perceptron
Initialize 𝒘, 𝑡 ← 0
repeat
𝜂 can be set to 1 and
proof still works
𝑡 ←𝑡+1
𝑖 ← 𝑡 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘 = 𝒘 + 𝒙(𝑖) 𝑦 (𝑖)
22 Until all patterns properly classified
Example

23
Perceptron: Example

Change 𝒘 in a direction
that corrects the error

24
[Bishop]
Convergence of Perceptron

[Duda, Hart & Stork, 2002]

 For data sets that are not linearly separable, the single-sample
perceptron learning algorithm will never converge

25
Pocket algorithm
 For the data that are not linearly separable due to noise:
 Keeps in its pocket the best 𝒘 encountered up to now.

Initialize 𝒘
for 𝑡 = 1, … , 𝑇
𝑖 ← 𝑡 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘𝑛𝑒𝑤 = 𝒘 + 𝒙(𝑖) 𝑦 (𝑖)
if 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘𝑛𝑒𝑤 < 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 then
𝒘 = 𝒘𝑛𝑒𝑤
end

𝑁
1
𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑠𝑖𝑔𝑛(𝒘𝑇 𝒙(𝑛) ) ≠ 𝑦 (𝑛)
𝑁
𝑛=1

26
Linear Discriminant Analysis (LDA)
 Fisher’s Linear Discriminant Analysis :
 Dimensionality reduction
 Finds linear combinations of features with large ratios of between-
groups scatters to within-groups scatters (as discriminant new
variables)
 Classification
 Predicts the class of an observation 𝒙 by first projecting it to the
space of discriminant variables and then classifying it in this space

27
Good Projection for Classification
 What is a good criterion?
 Separating different classes in the projected space

28
Good Projection for Classification
 What is a good criterion?
 Separating different classes in the projected space

29
Good Projection for Classification
 What is a good criterion?
 Separating different classes in the projected space

30
LDA Problem
 Problem definition:
 𝐶 = 2 classes
𝑁
 𝒙(𝑖) , 𝑦 (𝑖) 𝑖=1 training samples with 𝑁1 samples from the first class (𝒞1 )
and 𝑁2 samples from the second class (𝒞2 )
 Goal: finding the best direction 𝒘 that we hope to enable accurate
classification

 The projection of sample 𝒙 onto a line in direction 𝒘 is 𝒘𝑇 𝒙

 What is the measure of the separation between the projected


points of different classes?

31
Measure of Separation in the Projected Direction
 Is the direction of the line jointing the class means a good
candidate for 𝒘?

[Bishop]

32
Measure of Separation in the Projected
Direction
 The direction of the line jointing the class means is the
solution of the following problem:
 Maximizes the separation of the projected class means

max 𝐽 𝒘 = 𝜇1′ − 𝜇2′ 2


𝒘
s. t. 𝒘 = 1
𝒙(𝑖)
𝒙(𝑖) ∈𝒞1
𝜇1′ = 𝒘𝑇 𝝁1 𝝁1 =
𝑁1
𝒙(𝑖)
𝒙(𝑖) ∈𝒞2
𝜇2′ = 𝒘𝑇 𝝁2 𝝁2 =
𝑁2

 What is the problem with the criteria considering only


𝜇1′ − 𝜇2′ ?
 It does not consider the variances of the classes in the projected direction
33
LDA Criteria
 Fisher idea: maximize a function that will give
 large separation between the projected class means
 while also achieving a small variance within each class, thereby
minimizing the class overlap.

𝜇1′ − 𝜇2′ 2
𝐽 𝒘 = ′2
𝑠1 + 𝑠2′2

34
LDA Criteria
 The scatters of the original data are:
2
𝑠12 = 𝒙 𝑖 − 𝝁1
𝒙(𝑖) ∈𝒞1
2
𝑠22 = 𝒙 𝑖 − 𝝁2
𝒙(𝑖) ∈𝒞2

 The scatters of projected data are:


2
𝑠1′2 = 𝒘𝑇 𝒙 𝑖 − 𝑇
𝒘 𝝁1
𝒙(𝑖) ∈𝒞1
2
𝑠2′2 = 𝒘𝑇 𝒙 𝑖 − 𝑇
𝒘 𝝁1
𝒙(𝑖) ∈𝒞2

35
LDA Criteria
𝜇1′ − 𝜇2′ 2
𝐽 𝒘 = ′2
𝑠1 + 𝑠2′2

𝜇1′ − 𝜇2′ 2 = 𝒘𝑇 𝝁1 − 𝒘𝑇 𝝁2 2

= 𝒘𝑇 𝝁1 − 𝝁2 𝝁1 − 𝝁2 𝑇 𝒘

2
𝑠1′2 = 𝒘𝑇 𝒙 𝑖 𝑇
− 𝒘 𝝁1
𝒙(𝑖) ∈𝒞1
𝑇 𝑖 𝑖 𝑇
=𝒘 𝒙 − 𝝁1 𝒙 − 𝝁1 𝒘
𝒙(𝑖) ∈𝒞1

36
LDA Criteria
𝒘𝑇 𝑺 𝐵 𝒘
𝐽 𝒘 =
𝒘𝑇 𝑺𝑊 𝒘
Between-class 𝑇
scatter matrix
𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2

Within-class 𝑺𝑊 = 𝑺1 + 𝑺2
scatter matrix

𝑖 𝑖 𝑇
𝑺1 = 𝒙 − 𝝁1 𝒙 − 𝝁1
𝒙(𝑖) ∈𝒞1
𝑖 𝑖 𝑇
𝑺2 = 𝒙 − 𝝁2 𝒙 − 𝝁2
𝒙(𝑖) ∈𝒞2

37 scatter matrix=N×covariance matrix


LDA Derivation

wT S B w
J (w )  T
w SW w

wT S B w wT SW w
 w SW w 
T
 wT S B w
J (w )
 w w 
 2S B w  wT
SW w   2SW w  wT
SB w
w  W   W 
2 2
T T
w S w w S w

J (w )
 0  S B w  SW w
w

38
LDA Derivation

If 𝑺𝑊 is full-rank
𝑺𝐵𝒘 = 𝜆𝑺𝑊𝒘 𝑺−1
𝑊 𝑺𝐵𝒘 = 𝜆𝒘

 𝑺𝐵𝒘 (for any vector 𝒘) points in the same direction as


𝝁1 − 𝝁2 :
𝑺𝐵𝒘 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2 𝑇 𝒘 ∝ 𝝁1 − 𝝁2

𝒘 ∝ 𝑺−1
𝑊 𝝁1 − 𝝁2

 Thus, we can solve the eigenvalue problem immediately

39
LDA Algorithm
 Find 𝝁1 and 𝝁2 as the mean of class 1 and 2 respectively
 Find 𝑺1 and 𝑺2 as scatter matrix of class 1 and 2 respectively
 𝑺𝑊 = 𝑺1 + 𝑺2
 𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2 𝑇

 Feature Extraction
 𝒘 = 𝑺−1
𝑤 𝝁1 − 𝝁2 as the eigenvector corresponding to the largest
eigenvalue of 𝑺−1
𝑤 𝑺𝑏
 Classification 𝝁2
 𝒘 = 𝑺−1
𝑤 𝝁1 − 𝝁2
 Using a threshold on 𝒘𝑇 𝒙, we can classify 𝒙 𝝁1

40
𝑥2

Multi-class classification
 Solutions to multi-category problems:
𝑥1
 Extend the learning algorithm to support multi-class:
 A function 𝑓𝑖 (𝒙) for each class 𝑖 is found
 𝑦 = argmax 𝑓𝑖 (𝒙) 𝒙 is assigned to class 𝐶𝑖 if 𝑓𝑖(𝒙) > 𝑓𝑗 (𝒙) 𝑗  𝑖
𝑖=1,…,𝑐

 Converting the problem to a set of two-class problems:

41
Converting multi-class problem to a set of
two-class problems
 “one versus rest” or “one against all”
 For each class 𝐶𝑖 , a linear discriminant function that separates
samples of 𝐶𝑖 from all the other samples is found.
 Totally linearly separable

 “one versus one”


 𝑐(𝑐 − 1)/2 linear discriminant functions are used, one to
separate samples of a pair of classes.
 Pairwise linearly separable

42
Multi-class classification
 One-vs-all (one-vs-rest) 𝑥2

𝑥1

𝑥2 𝑥2

𝑥1
𝑥1 𝑥2
Class 1:
Class 2:
Class 3:
43
𝑥
Multi-class classification
𝑥2
 One-vs-one

𝑥1
𝑥2
𝑥2

𝑥1
𝑥2
𝑥1
Class 1:
Class 2:
Class 3:
44
𝑥1
Multi-class classification: ambiguity
 Converting the multi-class problem to a set of two-class
problems can lead to regions in which the classification is
undefined

one versus rest one versus one

[Duda, Hart & Stork, 2002]


45
Multi-class classification: linear machine
 A discriminant function 𝑓𝑖 𝒙 = 𝒘𝑇𝑖 𝒙 + 𝑤𝑖0 for each class
𝒞𝑖 (𝑖 = 1, … , 𝐾):
 𝒙 is assigned to class 𝒞𝑖 if:
𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗  𝑖

 Decision surfaces (boundaries) can also be found using


discriminant functions
 Boundary of the contiguous ℛ𝑖 and ℛ𝑗 : ∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)
𝑇
 𝒘𝑖 − 𝒘𝑗 𝒙 + 𝑤𝑖0 − 𝑤𝑗0 = 0

46
Multi-class classification: linear machine

[Duda, Hart & Stork, 2002]

47
Perceptron: multi-class
𝑦 = argmax 𝒘𝑇𝑖 𝒙
𝑖=1,…,𝑐
𝑇
𝑖
𝐽𝑃 𝑾 = − 𝒘𝑦 𝑖 − 𝒘𝑦 𝑖 𝒙
𝑖∈ℳ
ℳ: subset of training data that are misclassified
ℳ = 𝑖|𝑦 𝑖 ≠ 𝑦 (𝑖)
Initialize 𝑾 = 𝒘1 , … , 𝒘𝑐 , 𝑘 ← 0
repeat
𝑘 ← 𝑘 + 1 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘𝑦 𝑖 = 𝒘𝑦 𝑖 − 𝒙(𝑖)
𝒘𝑦 𝑖 = 𝒘𝑦 𝑖 + 𝒙(𝑖)
Until all patterns properly classified
48
Resources
 C. Bishop, “Pattern Recognition and Machine Learning”,
Chapter 4.1.

49

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy