07 SVMs
07 SVMs
Vector
Machines
&
Kernels
0
x! x2!
Support
Vector
Machines
(i) Non-‐bold
i
th
instance
label
y yi denotes
scalar
(i)
j
th
feature
of
i
th
instance
xj xij
9
Linear
Separators
• Training
instances
x 2 Rd+1 , x0 = 1
y 2 { 1, 1}
Recall:
• Model
parameters
Inner
(dot)
product:
✓ 2 Rd+1
hu, vi = u · v = u| v
X
• Hyperplane
= ui v i
✓ | x = h✓, xi = 0
i
• Decision
func@on
h(x) = sign(✓ | x) = sign(h✓, xi)
Intui@ons
Intui@ons
Intui@ons
Intui@ons
A
“Good”
Separator
Noise
in
the
Observa@ons
Ruling
Out
Some
Separators
Lots
of
Noise
Only
One
Separator
Remains
Maximizing
the
Margin
“Fat”
Separators
“Fat”
Separators
margin
Why
Maximize
Margin
Increasing
margin
reduces
capacity
• i.e.,
fewer
possible
models
23
Alterna@ve
View
of
Logis@c
Regression
1
h✓ (x) = ✓T x
h✓ (x) = g(z)
1+e
z = ✓ T xh✓ (x) = g(z)
z = ✓T x
If
y
=
1
,
we
want
h
✓✓
(x)
⇡
1
,
✓ T x 0
If
y
=
0
,
we
want
h
✓✓
(x)
⇡
0
,
✓ T x ⌧ 0
n
X
J(✓) = [yi log h✓ (xi ) + (1 yi ) log (1 h✓ (xi ))]
i=1
min J(✓) cost1 (✓ | xi ) cost0 (✓ | xi )
✓
24
Based
on
slide
by
Andrew
Ng
Alternate
View
of
Logis@c
Regression
Cost
of
example:
yi log h✓ (xi ) (1 yi ) log (1 h✓ (xi ))
h✓ (x) = g(z)
1 T
h✓ (x) = ✓ T x z = ✓ x
1+e
T
y=1 ✓ x 0
If
y
=
1
(want
✓
T
x
0
):
If
y
=
0
(want
✓
T
x
⌧
0
):
y=0 ✓T x ⌧ 0
25
Based
on
slide
by
Andrew
Ng
Logis@c
Regression
to
SVMs
Logis@c
Regression:
X n X d
min [yi log h✓ (xi ) + (1 yi ) log (1 h✓ (xi ))] + ✓j2
✓ i=1
2 j=1
Support
Vector
Machines:
X n X d
| | 1
min C [yi cost1 (✓ xi ) + (1 yi ) cost0 (✓ xi )] + ✓j2
✓
i=1
2 j=1
1
You
can
think
of
C
as
similar
to
!
26
Support
Vector
Machine
n
X d
X
1
min C [yi cost1 (✓ | xi ) + (1 yi ) cost0 (✓ | xi )] + ✓j2
✓
i=1
2 j=1
y=1
1
(want
✓
|
x
If
y
=
1
):
0
(want
✓
|
x
If
y
=
1
):
y=0
-‐1 1 -‐1 1
27
Based
on
slide
by
Andrew
Ng
Support
Vector
Machine
n
X d
X
1
min C [yi cost1 (✓ | xi ) + (1 yi ) cost0 (✓ | xi )] + ✓j2
✓
i=1
2 j=1
y
=
1
/
0
y
=
+1
/
-‐1
with
C
=
1
d
1X 2 1
d
X
min
✓ 2
✓j min ✓j2
j=1 ✓ 2 j=1
1 s.t. ✓ | xi 1 if yi = 1 = 1 y (✓ | x )
if yi s.t. 1
i i
1 ✓ | xi 1 if yi = 1 if yi = 1
28
Maximum
Margin
Hyperplane
2
margin
=
k✓k2
✓| x = 1 ✓| x = 1
Support
Vectors
✓| x = 1 ✓| x = 1
Large
Margin
Classifier
in
Presence
of
Outliers
C very large
x2
31
Based
on
slide
by
Andrew
Ng
Vector
Inner
Product
v
2
v!
u
2
u!
kuk2 = length(u) 2 R
q
θ = u21 + u22
p!
v
1
u
1
u| v = v| u
= u1 v 1 + u2 v 2
= kuk2 kvk2 cos ✓
= pkuk2 where p = kvk2 cos ✓
32
Based
on
example
by
Andrew
Ng
Understanding
the
Hyperplane
Xd
1
min ✓j2 Assume
θ0 = 0
so
that
the
✓ 2
j=1 hyperplane
is
centered
at
1 s.t. ✓ | xi 1 if yi = 1 the
iforigin,
yi =a1 nd
that
d
=
2
1 ✓ | xi 1 if yi = 1 if yi = 1
x!
✓ | x = k✓k2 kxk2 cos ✓
| {z }
θ p
= pk✓k2
θ
p!
33
Based
on
example
by
Andrew
Ng
Based
on
example
by
Andrew
Ng
θ
-θ
θ
-θ
Since
p
is
small,
therefore
k✓k
2
must
Since
p
is
larger,
k✓k
2
can
be
smaller
be
large
to
have
pk✓k
2
1
(or
≤
-‐1)
in
order
to
have
pk✓k
2
1
(or
≤
-‐1)
Size
of
the
Margin
For
the
support
vectors,
we
have
pk✓k
2
=
±1
• p
is
the
length
of
the
projec@on
of
the
SVs
onto
θ
p
Therefore,
1
p=
k✓k2
-θ
θ 2
margin = 2p =
k✓k2
margin
35
The
SVM
Dual
Problem
The
primal
SVM
problem
was
given
as
X d
1
min ✓j2
✓ 2
j=1
s.t. yi (✓ | xi ) 1 8i
Can
solve
it
more
efficiently
by
taking
the
Lagrangian
dual
• Duality
is
a
common
idea
in
op@miza@on
• It
transforms
a
difficult
op@miza@on
problem
into
a
simpler
one
• Key
idea:
introduce
slack
variables
αi for
each
constraint
– αi indicates
how
important
a
par@cular
constraint
is
to
the
solu@on
36
The
SVM
Dual
Problem
• The
Lagrangian
is
given
by
X d Xn
1
L(✓, ↵) = ✓j2 ↵i (yi ✓ | x 1)
2 j=1 i=1
s.t. ↵i 0 8i
s.t. ↵i 0 8i
X
↵ i yi = 0
i
s.t. ↵i 0 8i
X
↵ i yi = 0
i
39
Understanding
the
Dual
n
X n X
X n
1
Maximize
J(↵) = ↵i ↵i ↵j yi yj hxi , xj i
i=1
2 i=1 j=1
s.t. ↵i 0 8i
X
↵ i yi = 0
i
s.t. ↵i 0 8i
X
↵ i yi = 0
i
42
What
if
Data
Are
Not
Linearly
Separable?
1 X d
min ✓j2
✓ 2 j=1
s.t. yi (✓ | xi )
• Cannot
find
θ
that
sa@sfies
1 8i
• New
problem:
1
d
X X
min ✓j2 + C ⇠i
✓ 2 j=1 i
s.t. yi (✓ | xi ) 1 ⇠i 8i
Strengths
of
SVMs
• Good
generaliza@on
in
theory
• Good
generaliza@on
in
prac@ce
• Work
well
with
few
training
instances
• Find
globally
best
model
• Efficient
algorithms
• Amenable
to
the
kernel
trick
…
What
if
Surface
is
Non-‐Linear?
O O O
O
O
O
O
O
O
O
X
O
X
X
X
O
O
O
X
X
O
O O O
O
O
Image
from
h`p://www.atrandomresearch.com/iclass/
Kernel
Methods
0
x! x2!
Mapping
into
a
New
Feature
Space
: X 7! X̂ = (x)
• For
example,
with
xi 2 R2
([xi1 , xi2 ]) = [xi1 , xi2 , xi1 xi2 , x2i1 , x2i2 ]
• Rather
than
run
SVM
on
xi,
run
it
on
Φ(xi)
– Find
non-‐linear
separator
in
input
space
• Compu@ng
K(x
i
,
x
j
)
=
should
h (xbi ),e
efficient,
(xj )i much
more
so
than
compu@ng
Φ(xi)
and
Φ(xj)
• Use
K(x
i
,
x
j
)
=
in
hSVM
(xia),lgorithm
(xj )i rather
than
hxi , xj i
• Remarkably,
this
is
possible!
The
Polynomial
Kernel
= [xLet
i1 , x
i2
x
i]
=
[x
i1
,
x
i2
]
and
xj = [xj1 , xxj2j] = [xj1 , xj2 ]
s.t. ai 0 8i
X
Xn ↵ i yi = X
0n X n
1
J(↵) = i ↵i ↵i ↵j yi yj K(xi , xj )
i=1
2 i=1 j=1
s.t. ai 0 8i
X
↵ i yi = 0
i
53
The
Gaussian
Kernel
• Also
called
Radial
Basis
Func@on
(RBF)
kernel
✓ 2
◆
kxi xj k2
K(xi , xj ) = exp
2 2
– Has
value
1
when
xi
=
xj!
– Value
falls
off
to
0
with
increasing
distance
– Note:
Need
to
do
feature
scaling
before
using
Gaussian
Kernel
• String
kernels
• Tree
kernels
• Graph
kernels
60
An
Aside:
The
Math
Behind
Kernels
What
does
it
mean
to
be
a
kernel?
•
K(x
i
,
x
j
)
=
h
(x
i
),
(x
j
)i
for
some
Φ
Many
more...
• Cosine
similarity
kernel
• Chi-‐squared
kernel
• String/tree/graph/wavelet/etc
kernels
62
Applica@on:
Automa@c
Photo
Retouching
(Leyvand
et
al.,
2008)
Prac@cal
Advice
for
Applying
SVMs
• Use
SVM
soWware
package
to
solve
for
parameters
– e.g.,
SVMlight,
libsvm,
cvx
(fast!),
etc.
64
Mul@-‐Class
Classifica@on
with
SVMs
y 2 {1, . . . , K}
If
d
is
large
(rela@ve
to
n)
(e.g.,
d
>
n
with
d
=
10,000,
n
=
10-‐1,000)
• Use
logis@c
regression
or
SVM
with
a
linear
kernel
If
d
is
small
(up
to
1,000),
n
is
intermediate
(up
to
10,000)
• Use
SVM
with
Gaussian
kernel
Neural
networks
likely
to
work
well
for
most
of
these
se~ngs,
but
may
be
slower
to
train
Based
on
slide
by
Andrew
Ng
66
Other
SVM
Varia@ons
• nu
SVM
– nu
parameter
controls:
• Frac@on
of
support
vectors
(lower
bound)
and
misclassifica@on
rate
(upper
bound)
• E.g.,
⌫
=
0.05
guarantees
that
≥
5%
of
training
points
are
SVs
and
training
error
rate
is
≤
5%
– Harder
to
op@mize
than
C-‐SVM
and
not
as
scalable
• SVMs
for
regression
• One-‐class
SVMs
• SVMs
for
clustering
...
67
Conclusion
• SVMs
find
op@mal
linear
separator
• The
kernel
trick
makes
SVMs
learn
non-‐linear
decision
surfaces