Lecture 3

SVMs and Kernel Methods
Lecture 3
David Sontag
New York University
Slides adapted from Luke Zettlemoyer, Vibhav Gogate,

and Carlos Guestrin
Today’s lecture
 Dual form of soft-margin SVM

 Feature mappings & kernels
 Convexity, Mercer’s theorem
 (Time permitting) Extensions:
 Imbalanced data
 Multi-class
 Other loss functions
 L1 regularization
⇥
x(1)
⇧ ... ⌃
⇧ ⌃
⇧ x(n) ⌃
⇧ ⌃
Recap ⇧ of dual
⇧ x(1) x(2) ⌃
⌃ SVM derivation
⇥(x) = ⇧ (1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj 
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
So, in dual formulation we will solve for α directly!

• w and b are computed from α (if needed)
Solving for the offset “b”
Lagrangian:
αj > 0 for some j implies constraint

is tight. We use this to obtain b:
(1)
(2)
(3)
Dual formulation only depends on
dot-products of the features!
First, we introduce a feature mapping:


Next, replace the dot product with an equivalent kernel function:
↵
~ 0
Do kernels need to be symmetric?

Classification rule using dual solution
Using dual solution
dot product of feature vectors of

new example with support vectors
Using a kernel function, predict with…

Dual SVM interpretation: Sparsity
+1
-1
=
=
=
w.x + b
w.x + b
w.x + b
Final solution tends to
be sparse
•αj=0 for most j
•don’t need to store these

points to compute w or make
predictions
Non-support Vectors:
•αj=0
•moving them will not Support Vectors:
change w • αj≥0
Soft-margin SVM
Primal: Solve for w,b,α:
Dual:
What changed?
• Added upper bound of C on αi!
• Intuitive explanation:
• Without slack, αi  ∞ when constraints are violated (points
misclassified)
• Upper bound of C limits the αi, so misclassifications are allowed
Common kernels
• Polynomials of degree exactly d
• Polynomials of degree up to d
• Gaussian kernels
• Sigmoid
• And many others: very active area of research!

⌥
⌥⌥⇤wx . . .x
(1)
j j j
⇥(x) = ⌥ ⇧⌥ ex x
(1)
⌃j
⇧ (1)e⇤L (3) ⌃
⌥ x ⇥. . x . =w ⇥ j yj xj
⌥u1 ⇤w v1 j
. . .
⌥ . .. .
Polynomial
⇥(u).⇥(v)
⇤L =
⇤L = w
=⇧
⌥u2 x(1)
w
kernel
⇥y
v2
x
j j j⌃
= u1 v1 + u2 v2 = u.v
⇥
⇤w e u1jjyj x v1
⇤⇤w
⇥(u).⇥(v)
2
⌅ j⇤
= ju 2
. ⌅ = u1 v1 + u2 v2 = u.v
u1 2 v1 v2
d=1 ⇥⇥ ⇥⇥. . .
⌥
uu u 1⇤u 2 v ⌥⌅ v⇤1 v2 2 ⌅ 2 2
⇥(u).⇥(v) =⇧
⇥(u).⇥(v)==
⇥(u).⇥(v) ⌥ 11
. 2 1
1 ⌥
u1 . ⇧ ==uu1 v1v1⌃
. ⌃ ++ =u2uu
v221vv=1 + u.v
u.v
= 2u1 v1 u2 v2 + u
1
1 2
⇤Luu22u2⌥u1 u1 uvv222 v⌥ 2 v1
v1xv2
⇥(u).⇥(v)
⇤ ⌅=
= u
⇤⌥
⇧
w
2 ⌅ .
⌃ ⇧2
⌥
vj2y j j ⌃ = u 2 2
1 v1 + 2u1 v1 u2 v2 +
d=2 u⇤w
1
2 2
uv21u1
2 v2 v1
⌥ u1 u2 ⌥ v1uv22 j v 2
2
⇥(u).⇥(v) = ⌥ ⌥ = (u v 2
+ 2 2
u
⇧ u2 u1 ⌃⇥⇧ v2 v1 ⌃⇥1 11 1 2 21 1 2 2
. = u v + v
2u )v u v + u 2 2
2 v2
u
u2 1 v 2v1 = (u v + u v ) 2
⇥(u).⇥(v) = 2 . 2 = u1 v1 + u2 v2 = u.v
1 1 2 2
u2 v2 = (u.v) 2
⇥(u).⇥(v) = (u.v)d
For any d (we will skip proof): ⇥(u).⇥(v) = (u.v) d
= (u.v)=d (u.v)d
⇥(u).⇥(v)⇥(u).⇥(v)
Polynomials of degree exactly d

Gaussian kernel
Level sets, i.e. w · (x) = r for some r
Support vectors
[Cynthia Rudin] [mblondel.org]

Kernel algebra
Q: How would you prove that the “Gaussian kernel” is a valid kernel?
A: Expand the Euclidean norm as follows:
To see that this is a kernel, use the

Taylor series expansion of the
Then, apply (e) from above exponential, together with repeated
application of (a), (b), and (c):
The feature mapping is
infinite dimensional!
[Justin Domke]
Overfitting?
• Huge feature space with kernels: should we worry about

overfitting?
– SVM objective seeks a solution with large margin
• Theory says that large margin leads to good generalization
(we will see this in a couple of lectures)
– But everything overfits sometimes!!!
– Can control by:
• Setting C
• Choosing a better Kernel
• Varying parameters of the Kernel (width of Gaussian, etc.)
How to deal with imbalanced data?
• In many practical applications we may have

imbalanced data sets
• We may want errors to be equally distributed
between the positive and negative classes
• A slight modification to the SVM objective
does the trick!
Class-specific weighting of the slack variables

How do we do multi-class classification?
One versus all classification
w+
w-
Learn 3 classifiers:
•- vs {o,+}, weights w-
•+ vs {o,-}, weights w+
wo •o vs {+,-}, weights wo
Predict label using:
Any problems?
Could we learn this (1-D) dataset? 

-1 0 1
Multi-class SVM
w+
Simultaneously learn 3 sets
of weights: w-
•How do we guarantee the
correct labels?
wo
•Need new constraints!
The “score” of the correct

class must be better than the
“score” of wrong classes:
Multi-class SVM
As for the SVM, we introduce slack variables and maximize margin:
To predict, we use:
Now can we learn it?  -1 0 1
b+ = .5

Lecture 3

Uploaded by

Copyright:

Available Formats

Lecture 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

SVMs and Kernel Methods

Slides adapted from Luke Zettlemoyer, Vibhav Gogate,

 Dual form of soft-margin SVM

Substituting these values back in (and simplifying), we obtain:

So, in dual formulation we will solve for α directly!

αj > 0 for some j implies constraint

First, we introduce a feature mapping:

Do kernels need to be symmetric?

Using dual solution

dot product of feature vectors of

Using a kernel function, predict with…

•don’t need to store these

Primal: Solve for w,b,α:

• And many others: very active area of research!

Polynomials of degree exactly d

[Cynthia Rudin] [mblondel.org]

To see that this is a kernel, use the

• Huge feature space with kernels: should we worry about

• In many practical applications we may have

Class-specific weighting of the slack variables

Predict label using:

Could we learn this (1-D) dataset? 

The “score” of the correct

Now can we learn it?  -1 0 1

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.