Support Vector Machine
Support Vector Machine
Support Vector Machine
1
Machine Learning
2
Types of Machine Learning
3
Classification
Classification is the task of learning a target function f that
maps each attribute set x to one of the predefined class labels
y
Given training data in different classes (labels known)
Predict test data (labels unknown)
Input Output
Attribute Set Classification Model Class Label
(x) (y)
4
Classification contd
Examples
Handwritten digits recognition
Spam filtering
Text classification
Medical diagnosis
Methods:
Nearest Neighbor
Neural Networks
Decision Tree
Rule Based
Support vector machines: a new method
etc.
5
Introduction to SVM
6
Introduction to SVM (contd)
7
Introduction to SVM (contd)
1 if w xi b 0
yi
1 if w xi b 0
9
Hyper Plane
The hyperplane that separates positive and
negative training data is
w x + b = 0
It is also called the decision boundary (surface).
So many possible hyperplanes, which one to
choose?
1 if w xi b 0
yi
1 if w xi b 0
10
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1
w x + b<0
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Misclassified
to +1 class
a
Classifier Margin
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
1. Maximizing the margin is good
accordingf(x,w,b)
to intuition.
= sign(w x + b)
denotes +1 2. Implies that only support vectors are
denotes -1 important; other The
training examples
maximum
are ignorable.
margin linear
3. Empirically it works very very
classifier iswell.
the
linear classifier
Support Vectors with the,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Large Margin Linear Classifier
denotes +1
denotes -1
Given a set of data points: x2
{(xi , yi )}, i 1, 2, , n , where
For yi 1, wT xi b 0
For yi 1, wT xi b 0
For yi 1, wT xi b 1
x1
For yi 1, wT xi b 1
Linear SVM Mathematically
x+ M=Margin Width
X-
What we know:
w . M = 2
w . x+ + b = +1
w . x- + b = -1
w . (x+-x-) = 2
SVM imposes an additional requirement
Margin of its decision boundary must be maximum
Maximizing the margin is equivalent to
minimizing the following objective function:
1 2 x+
minimize w
2
such that x+
For yi 1, wT xi b 1 n
x-
For yi 1, wT xi b 1
yi (wT xi b) 1 x1
Linear SVM: Separable Case
1 2
Quadratic minimize w
programming 2
with linear
constraints s.t. yi (wT xi b) 1
Lagrangian
Function
2 i 1
s.t. ai 0
First term is same
Second term captures the inequality constraints
1 2
minimize w
2
s.t. ai 0
L p
n
0 w a i yi xi
w i 1
n
L p
0 a y i i 0
b i 1
Solving the Optimization Problem
From KKT condition, we know:
a i yi (wT xi b) 1 0
x2
x+
ai 0
x+
s.t. ai 0
Lagrangian Dual
Problem
n
1 n n
maximize ai aia j yi y j xTi x j
i 1 2 i 1 j 1
n
s.t. a i 0 , and a y
i 1
i i 0
Example:
Consider the two-dimensional data set which contains eight training instances.
Using quadratic programming we can solve the optimization problem to obtain
the Lagrange multiplier for each training instance. The Lagrange multipliers
are depicted in the last column of the table. Notice that only the first two
instances have non-zero Lagrange multipliers. These instances are correspond to
the support vector for this data set.
x1
Linear SVM: Nonseparable Case
Formulation:
n
1
w C i
2
minimize
2 i 1
such that
yi (wT xi b) 1 i
i 0
n
1 n n
maximize ai aia j yi y j xTi x j
i 1 2 i 1 j 1
such that
0 ai C
n
a y
i 1
i i 0
Non-linear SVMs
Datasets that are linearly separable with some noise
work out great:
0 x
0 x
Non-linear SVMs: Feature spaces
General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
: x (x)
Not linearly separable data. Linearly separable data.
polar
coordinates
0 5
Distance from center (radius)
35
Figure : Feature Space Representation
The Kernel Trick
The linear classifier relies on dot product between vectors
K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via
some transformation : x (x), the dot product becomes:
K(xi,xj)= (xi) T(xj)
A kernel function is some function that corresponds to an inner
product in some expanded feature space.
Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1+ xiTxj)2,
Need to show that K(xi,xj)= (xi) T(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1
2xj2]
= (xi) T(xj), where (x) = [1 x12 2 x1x2 x22 2x1 2x2]
Examples of Kernel Functions
Linear: K(xi,xj)= xi Txj
44
f) The maximum-margin hyperplane. The three support vectors are
circled.
g) A data set containing one error, indicated by arrow.
h) A separating hyperplane with a soft margin. Error is indicated
by arrow.
i) A nonseparable one-dimensional data set.
j) Separating previously nonseparable data.
k) A linearly nonseparable two-dimensional data set, which is
linearly separable in four dimensions.
l) An SVM that has overfit a twodimensional data set.
45