0% found this document useful (0 votes)
19 views61 pages

L5-Support Vector Machine

Uploaded by

Fahim Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views61 pages

L5-Support Vector Machine

Uploaded by

Fahim Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Kernel Machine (Support Vector Machines)

Course 4232: Machine Learning

Dept. of Computer Science


Faculty of Science and Technology

Lecturer No: Week No: 4 (1X1.5 hrs) Semester: Summer 21-22

Instructor: Dr. M M Manjurul Islam (manjurul@aiub.edu)


Support-vector machines (SVMs)*

In ML, support-vector machines are supervised


learning models with associated learning algorithms

❖ that analyze data for classification and regression analysis.

❖ Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues


(Boser et al., 1992, Guyon et al., 1993, Vapnik et al., 1997)

❖ SVMs are one of the most robust prediction methods, being based on
statistical learning frameworks or VC theory proposed by Vapnik (1982,
1995) and Chervonenkis (1974)

*wikipedia 2
Main Ideas

• Max-Margin Classifier
• Formalize notion of the best linear separator
• Kernels
• Projecting data into higher-dimensional space makes it linearly
separable
• Complexity
• Depends only on the number of training examples, not on
dimensionality of the kernel space!
Strength of SVMs

❖Good generalization
o in theory
o in practice
❖Works well with few training instances

❖Find globally best model

❖Efficient algorithms

❖Amenable to the kernel trick

4
Tennis example

Temperature

Humidity
= play tennis
= do not play tennis
Linear Support Vector Machines

Data: <xi,yi>, i=1,..,l


xi  Rd
yi  {-1,+1}

x2

=+1
=-1

x1
Linear SVM

Data: <xi,yi>, i=1,..,l


xi  R d
yi  {-1,+1}

f(x) =-1
=+1

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.


Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1
xi•w+b  -1 when yi =-1
H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
H2: xi•w+b = -1 d-
The points on the planes H1 H
and H2 are the Support
Vectors

d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point


The margin of a separating hyperplane is d+ + d-.
Maximizing the margin
We want a classifier with as big margin as possible.

H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the


condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Notations

❖To better match notation used in SVMs and to make


matrix formulas simpler

10
Linear Separators

❖ Training instances

𝑥 ∈ ℝ𝑑+1 , 𝑥0 = 1
𝑦 ∈ {−1,1}

Recall:
❖ Model parameters Inner (dot) product:
u , v = u.v = u T v
𝜃∈ ℝ𝑑+1
=  ui vi
❖ Hyperplane i

T x = ,x = 0

❖ Decision function X = [1, 2, 4.3, 5.5, ]

h( x) = sign( T x) = sign(  , x )

11
Intuitions

12
Intuitions

13
Intuitions

14
Intuitions

15
A “Good” Separator

16
Noise in the Observations

17
Ruling Out Some Separators

18
Lots of Noise

19
Only One Separator Remains

20
Maximizing the Margin

21
“Fat” Separators

22
“Fat” Separators

23
Why Maximize Margin

❖Increasing margin reduces capacity


o i.e., fewer possible models

❖Lesson from Learning Theory:


o If the following holds:
• H is sufficiently constrained in size
• and/or the size of the training data set n is large,
then low training error is likely to be evidence of low
generalization error

24
Alternative View of Logistic Regression

1
h ( x) = − T x
1+ e

❖ If y = 1 , we want h ( x)  1, T x 0

❖ If y = 0 , we want h ( x)  1, x
T
0

25
Alternative View of Logistic Regression
Cost of example: yi log h ( xi ) − (1 − yi )log(1 − h ( xi ))

1
h ( x) = z =T x
1 + e − x
T

y = 1 (want  y = 0 (want 
T T
If x 0): If x 0 ):

26
Logistic Regression to SVM
❖ Logistic Regression:

n
 d 2
min − [yi cos t1 ( x i ) + (1 − yi )cos t 0 ( x i )] +   j
T T
 2 1
1

❖ Support Vector Machines:


n
1 d 2
min − C [ yi cos t1 ( xi ) + (1 − yi )cos t0 ( xi )] +  j
T T
 2 1
1

1
❖ 𝐶 is similar to 

27
Support Vector Machine

n
 d 2
min − [yi cos t1 ( x i ) + (1 − yi )cos t 0 ( x i )] +   j
T T
 2 1
1

28
Support Vector Machine

29
Maximum Margin Hyperplane

30
Maximum Margin Hyperplane

31
Large Margin Classifier in Presence of Outliers

32
Vector Inner Product

33
Understanding the Hyperplane

Assume θ0 = 0 so that the hyperplane is


centered at the origin, and that d= 2

34
Maximizing the Margin

Assume θ0 = 0 so that the


hyperplane is centered at
the origin, and that d = 2

35
Size of the Margin
❖ For the support vectors, we have p  2
= 1
o p is the length of the projection of the SVs onto θ

36
What if surface is Non-Linear

37
Structure of SVMs

38
Kernel Methods
When Linear Separators Fail

40
Mapping into a new Feature Space

❖ For example, with 𝑥𝑖 ∈ ℝ2

❖ Rather than run SVM on xi, run it on  ( x)


o Find non-‐linear separator in input space

❖ What if  ( x) is really big?


❖ Use kernels to compute it implicitly!

41
Kernels

❖ Find kernel K such that


K ( xi , x j ) = ( xi ), ( x j )

❖ Computing K ( x i , x j ) should be efficient,much more so than


computing 𝜃(𝑥𝑖 )and 𝜃(𝑥𝑗 ).

❖ Use K ( x i , xj) in SVM algorithm rather than 𝑥𝑖 , 𝑥𝑗

❖ Remarkably, this is possible!

42
The Polynomial Kernel

❖Let

❖Consider the following function:

4
The Polynomial Kernel
d
Given by K ( xi , x j ) = xi , x j

o  ( x) contains all the monomials of degree 𝑑

❖ Useful in visual pattern recognition


Example:
• 16x16 pixel image
• 1010 monomials of degree 5
• Never explicitly compute  ( x) !

❖ Variation: K ( xi , x j ) = ( xi , x j + 1) d

o Adds all lower‐order monomials (degrees 1,...,d )!

44
The Kernel Trick

“Given an algorithm which is formulated in terms of a


positive definite kernel K1, one can construct an alternative
algorithm by replacing K1 with another positive
definite kernel K2”

❖SVMs can use the kernel trick

45
Incorporating Kernels into SVM

46
The Gaussian Kernel
❖ Also called Radial Basis Function (RBF) kernel

2
xi − x j
K ( xi , x j ) = exp( − 2
)
2 2

o Has value 1 when x i = x j


o Value falls off to 0 with increasing distance
o Note: Need to do feature scaling before using Gaussian Kernel

47
Gaussian Kernel Example

48
Gaussian Kernel Example

49
Gaussian Kernel Example

50
Gaussian Kernel Example

51
Other Kernels
❖ Sigmoid Kernel

K ( xi , x j ) = tanh( xiT x j + c)

o Neural networks use sigmoid as activation function


o SVM with a sigmoid kernel is equivalent to 2‐layer perceptron

❖ Cosine Similarity Kernel

xiT x j
K ( xi , x j ) =
xi x j

o Popular choice for measuring similarity of text documents


o L2 norm projects vectors onto the unit sphere; their dot product is the
cosine of the angle between the vectors

52
Other Kernels
❖ Chi-‐squared Kernel

( xik − xij ) 2
K ( xi , x j ) = exp( −  )
k xik + x jk

o Widely used in computer vision applications


o Chi‐squared measures distance between probability
distributions
o Data is assumed to be non-‐negative, often with L1 norm of 1

❖ String kernels
❖ Tree kernels
❖ Graph kernels

53
Practical Advice for Applying SVMs
❖ Use SVM software package to solve for parameters
o e.g., SVMlight, libsvm, cvx (fast!), etc.

• Need to specify:
o Choice of parameter C
o Choice of kernel function
• Associated kernel parameters

54
Multi-Class Classification with SVMs

y {1,..., K }

❖ Many SVM packages already have multi-‐class classification built in


❖ Otherwise, use one-vs-rest
o Train K SVMs, each picks out one class from rest, yielding  (1) ,..., ( K )

o Predict class i with largest ( (1) )T x

55
4
SVMS in Practice
A Demo
• https://www.csie.ntu.edu.tw/~cjlin/libsvm/

57
SVM summary
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and
gained increasing popularity in late 1990s.
• SVMs represent a general methodology for many PR problems:
classification, regression, feature extraction, clustering, novelty detection,
etc.
• SVMs can be applied to complex data types beyond feature vectors (e.g.,
graphs, sequences, relational data) by designing kernel functions for such
data.
• SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf et
al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to hill-
climb over a subset of αi’s at a time, e.g., SMO [Platt ’99] and [Joachims
’99]
Advantages of SVMs
• There are no problems with local minima, because the solution is a
Qaudratic Programming problem
• The optimal solution can be found in polynomial time
• There are few model parameters to select: the penalty term C, the kernel
function and parameters (e.g., spread σ in the case of RBF kernels)
• The final results are stable and repeatable (e.g., no random initial weights)
• The SVM solution is sparse; it only involves the support vectors
• SVMs rely on elegant and principled learning methods
• SVMs provide a method to control complexity independently of
dimensionality
• SVMs have been shown (theoretically and empirically) to have excellent
generalization capabilities
• Software
• SVMlight, by Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for Linux, Windows,
Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).

• LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM (Library for Support


Vector Machines), is developed by Chang and Lin; also widely used. Developed in
C++ and Java, it supports also multi-class classification, weighted SVM for
unbalanced data, cross-validation and automatic model selection. It has interfaces
for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels: linear,
polynomial, radial basis function, and neural (tanh).

• https://scikit-learn.org/stable/modules/svm.html in Python
References

• http://www.kernel-machines.org/

• http://www.support-vector.net/
• AN INTRODUCTION TO SUPPORT VECTOR MACHINES
(and other kernel-based learning methods)
N. Cristianini and J. Shawe-Taylor Cambridge University Press 2000 ISBN: 0 521 78019 5

• Papers by Vapnik
C.J.C. Burges: A tutorial on Support Vector Machines. Data Mining and
Knowledge Discovery 2:121-167, 1998.

61

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy