Lecture 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

SVMs and Kernel Methods

Lecture 3

David Sontag
New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate,


and Carlos Guestrin
Today’s lecture

 Dual form of soft-margin SVM


 Feature mappings & kernels
 Convexity, Mercer’s theorem
 (Time permitting) Extensions:
 Imbalanced data
 Multi-class
 Other loss functions
 L1 regularization

x(1)
⇧ ... ⌃
⇧ ⌃
⇧ x(n) ⌃
⇧ ⌃
Recap ⇧ of dual
⇧ x(1) x(2) ⌃
⌃ SVM derivation
⇥(x) = ⇧ (1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

So, in dual formulation we will solve for α directly!


• w and b are computed from α (if needed)
Solving for the offset “b”

Lagrangian:

αj > 0 for some j implies constraint


is tight. We use this to obtain b:

(1)

(2)

(3)
Dual formulation only depends on
dot-products of the features!

First, we introduce a feature mapping:



Next, replace the dot product with an equivalent kernel function:


~ 0

Do kernels need to be symmetric?


Classification rule using dual solution

Using dual solution

dot product of feature vectors of


new example with support vectors

Using a kernel function, predict with…


Dual SVM interpretation: Sparsity

+1

-1
=
=

=
w.x + b
w.x + b

w.x + b
Final solution tends to
be sparse
•αj=0 for most j

•don’t need to store these


points to compute w or make
predictions
Non-support Vectors:
•αj=0
•moving them will not Support Vectors:
change w • αj≥0
Soft-margin SVM

Primal: Solve for w,b,α:

Dual:

What changed?
• Added upper bound of C on αi!
• Intuitive explanation:
• Without slack, αi  ∞ when constraints are violated (points
misclassified)
• Upper bound of C limits the αi, so misclassifications are allowed
Common kernels
• Polynomials of degree exactly d

• Polynomials of degree up to d

• Gaussian kernels

• Sigmoid

• And many others: very active area of research!



⌥⌥⇤wx . . .x
(1)
j j j
⇥(x) = ⌥ ⇧⌥ ex x
(1)
⌃j
⇧ (1)e⇤L (3) ⌃
⌥ x ⇥. . x . =w ⇥ j yj xj
⌥u1 ⇤w v1 j
. . .
⌥ . .. .
Polynomial
⇥(u).⇥(v)
⇤L =
⇤L = w
=⇧
⌥u2 x(1)
w
kernel
⇥y
v2
x
j j j⌃
= u1 v1 + u2 v2 = u.v

⇤w e u1jjyj x v1
⇤⇤w
⇥(u).⇥(v)
2
⌅ j⇤
= ju 2
. ⌅ = u1 v1 + u2 v2 = u.v
u1 2 v1 v2
d=1 ⇥⇥ ⇥⇥. . .

uu u 1⇤u 2 v ⌥⌅ v⇤1 v2 2 ⌅ 2 2
⇥(u).⇥(v) =⇧
⇥(u).⇥(v)==
⇥(u).⇥(v) ⌥ 11
. 2 1
1 ⌥
u1 . ⇧ ==uu1 v1v1⌃
. ⌃ ++ =u2uu
v221vv=1 + u.v
u.v
= 2u1 v1 u2 v2 + u
1
1 2
⇤Luu22u2⌥u1 u1 uvv222 v⌥ 2 v1
v1xv2
⇥(u).⇥(v)
⇤ ⌅=
= u
⇤⌥

w
2 ⌅ .
⌃ ⇧2

vj2y j j ⌃ = u 2 2
1 v1 + 2u1 v1 u2 v2 +
d=2 u⇤w
1
2 2
uv21u1
2 v2 v1
⌥ u1 u2 ⌥ v1uv22 j v 2
2
⇥(u).⇥(v) = ⌥ ⌥ = (u v 2
+ 2 2
u
⇧ u2 u1 ⌃⇥⇧ v2 v1 ⌃⇥1 11 1 2 21 1 2 2
. = u v + v
2u )v u v + u 2 2
2 v2

u
u2 1 v 2v1 = (u v + u v ) 2
⇥(u).⇥(v) = 2 . 2 = u1 v1 + u2 v2 = u.v
1 1 2 2
u2 v2 = (u.v) 2
⇥(u).⇥(v) = (u.v)d
For any d (we will skip proof): ⇥(u).⇥(v) = (u.v) d

= (u.v)=d (u.v)d
⇥(u).⇥(v)⇥(u).⇥(v)

Polynomials of degree exactly d


Gaussian kernel
Level sets, i.e. w · (x) = r for some r

Support vectors

[Cynthia Rudin] [mblondel.org]


Kernel algebra

Q: How would you prove that the “Gaussian kernel” is a valid kernel?
A: Expand the Euclidean norm as follows:

To see that this is a kernel, use the


Taylor series expansion of the
Then, apply (e) from above exponential, together with repeated
application of (a), (b), and (c):
The feature mapping is
infinite dimensional!
[Justin Domke]
Overfitting?

• Huge feature space with kernels: should we worry about


overfitting?
– SVM objective seeks a solution with large margin
• Theory says that large margin leads to good generalization
(we will see this in a couple of lectures)
– But everything overfits sometimes!!!
– Can control by:
• Setting C
• Choosing a better Kernel
• Varying parameters of the Kernel (width of Gaussian, etc.)
How to deal with imbalanced data?

• In many practical applications we may have


imbalanced data sets
• We may want errors to be equally distributed
between the positive and negative classes
• A slight modification to the SVM objective
does the trick!

Class-specific weighting of the slack variables


How do we do multi-class classification?
One versus all classification
w+
w-
Learn 3 classifiers:
•- vs {o,+}, weights w-
•+ vs {o,-}, weights w+
wo •o vs {+,-}, weights wo

Predict label using:

Any problems?

Could we learn this (1-D) dataset? 


-1 0 1
Multi-class SVM

w+
Simultaneously learn 3 sets
of weights: w-
•How do we guarantee the
correct labels?
wo
•Need new constraints!

The “score” of the correct


class must be better than the
“score” of wrong classes:
Multi-class SVM
As for the SVM, we introduce slack variables and maximize margin:

To predict, we use:

Now can we learn it?  -1 0 1

b+ = .5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy