L5-Support Vector Machine
L5-Support Vector Machine
❖ SVMs are one of the most robust prediction methods, being based on
statistical learning frameworks or VC theory proposed by Vapnik (1982,
1995) and Chervonenkis (1974)
*wikipedia 2
Main Ideas
• Max-Margin Classifier
• Formalize notion of the best linear separator
• Kernels
• Projecting data into higher-dimensional space makes it linearly
separable
• Complexity
• Depends only on the number of training examples, not on
dimensionality of the kernel space!
Strength of SVMs
❖Good generalization
o in theory
o in practice
❖Works well with few training instances
❖Efficient algorithms
4
Tennis example
Temperature
Humidity
= play tennis
= do not play tennis
Linear Support Vector Machines
x2
=+1
=-1
x1
Linear SVM
f(x) =-1
=+1
H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
10
Linear Separators
❖ Training instances
𝑥 ∈ ℝ𝑑+1 , 𝑥0 = 1
𝑦 ∈ {−1,1}
Recall:
❖ Model parameters Inner (dot) product:
u , v = u.v = u T v
𝜃∈ ℝ𝑑+1
= ui vi
❖ Hyperplane i
T x = ,x = 0
h( x) = sign( T x) = sign( , x )
11
Intuitions
12
Intuitions
13
Intuitions
14
Intuitions
15
A “Good” Separator
16
Noise in the Observations
17
Ruling Out Some Separators
18
Lots of Noise
19
Only One Separator Remains
20
Maximizing the Margin
21
“Fat” Separators
22
“Fat” Separators
23
Why Maximize Margin
24
Alternative View of Logistic Regression
1
h ( x) = − T x
1+ e
❖ If y = 1 , we want h ( x) 1, T x 0
❖ If y = 0 , we want h ( x) 1, x
T
0
25
Alternative View of Logistic Regression
Cost of example: yi log h ( xi ) − (1 − yi )log(1 − h ( xi ))
1
h ( x) = z =T x
1 + e − x
T
y = 1 (want y = 0 (want
T T
If x 0): If x 0 ):
26
Logistic Regression to SVM
❖ Logistic Regression:
n
d 2
min − [yi cos t1 ( x i ) + (1 − yi )cos t 0 ( x i )] + j
T T
2 1
1
1
❖ 𝐶 is similar to
27
Support Vector Machine
n
d 2
min − [yi cos t1 ( x i ) + (1 − yi )cos t 0 ( x i )] + j
T T
2 1
1
28
Support Vector Machine
29
Maximum Margin Hyperplane
30
Maximum Margin Hyperplane
31
Large Margin Classifier in Presence of Outliers
32
Vector Inner Product
33
Understanding the Hyperplane
34
Maximizing the Margin
35
Size of the Margin
❖ For the support vectors, we have p 2
= 1
o p is the length of the projection of the SVs onto θ
36
What if surface is Non-Linear
37
Structure of SVMs
38
Kernel Methods
When Linear Separators Fail
40
Mapping into a new Feature Space
41
Kernels
42
The Polynomial Kernel
❖Let
4
The Polynomial Kernel
d
Given by K ( xi , x j ) = xi , x j
❖ Variation: K ( xi , x j ) = ( xi , x j + 1) d
44
The Kernel Trick
45
Incorporating Kernels into SVM
46
The Gaussian Kernel
❖ Also called Radial Basis Function (RBF) kernel
2
xi − x j
K ( xi , x j ) = exp( − 2
)
2 2
47
Gaussian Kernel Example
48
Gaussian Kernel Example
49
Gaussian Kernel Example
50
Gaussian Kernel Example
51
Other Kernels
❖ Sigmoid Kernel
K ( xi , x j ) = tanh( xiT x j + c)
xiT x j
K ( xi , x j ) =
xi x j
52
Other Kernels
❖ Chi-‐squared Kernel
( xik − xij ) 2
K ( xi , x j ) = exp( − )
k xik + x jk
❖ String kernels
❖ Tree kernels
❖ Graph kernels
53
Practical Advice for Applying SVMs
❖ Use SVM software package to solve for parameters
o e.g., SVMlight, libsvm, cvx (fast!), etc.
• Need to specify:
o Choice of parameter C
o Choice of kernel function
• Associated kernel parameters
54
Multi-Class Classification with SVMs
y {1,..., K }
55
4
SVMS in Practice
A Demo
• https://www.csie.ntu.edu.tw/~cjlin/libsvm/
57
SVM summary
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and
gained increasing popularity in late 1990s.
• SVMs represent a general methodology for many PR problems:
classification, regression, feature extraction, clustering, novelty detection,
etc.
• SVMs can be applied to complex data types beyond feature vectors (e.g.,
graphs, sequences, relational data) by designing kernel functions for such
data.
• SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf et
al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to hill-
climb over a subset of αi’s at a time, e.g., SMO [Platt ’99] and [Joachims
’99]
Advantages of SVMs
• There are no problems with local minima, because the solution is a
Qaudratic Programming problem
• The optimal solution can be found in polynomial time
• There are few model parameters to select: the penalty term C, the kernel
function and parameters (e.g., spread σ in the case of RBF kernels)
• The final results are stable and repeatable (e.g., no random initial weights)
• The SVM solution is sparse; it only involves the support vectors
• SVMs rely on elegant and principled learning methods
• SVMs provide a method to control complexity independently of
dimensionality
• SVMs have been shown (theoretically and empirically) to have excellent
generalization capabilities
• Software
• SVMlight, by Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for Linux, Windows,
Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).
• https://scikit-learn.org/stable/modules/svm.html in Python
References
• http://www.kernel-machines.org/
• http://www.support-vector.net/
• AN INTRODUCTION TO SUPPORT VECTOR MACHINES
(and other kernel-based learning methods)
N. Cristianini and J. Shawe-Taylor Cambridge University Press 2000 ISBN: 0 521 78019 5
• Papers by Vapnik
C.J.C. Burges: A tutorial on Support Vector Machines. Data Mining and
Knowledge Discovery 2:121-167, 1998.
61