0% found this document useful (0 votes)
22 views68 pages

07 SVMs

Support Vector Machines (SVMs) perform well using linear decision surfaces. SVMs find the optimal hyperplane that separates classes with the maximum margin. Maximizing the margin reduces the capacity of the model, which helps the model generalize better to new examples according to learning theory. Kernels allow SVMs to find separating hyperplanes in transformed feature spaces, enabling them to handle non-linear decision boundaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views68 pages

07 SVMs

Support Vector Machines (SVMs) perform well using linear decision surfaces. SVMs find the optimal hyperplane that separates classes with the maximum margin. Maximizing the margin reduces the capacity of the model, which helps the model generalize better to new examples according to learning theory. Kernels allow SVMs to find separating hyperplanes in transformed feature spaces, enabling them to handle non-linear decision boundaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Support

 Vector  Machines    
&  Kernels  

Doing  really  well  with  linear  decision  surfaces  

Adapted  from  slides  by  Tim  Oates  


Outline  
n Predic@on  
n Why  might  predic@ons  be  wrong?  
n Support  vector  machines  
n Doing  really  well  with  linear  models  
n Kernels  
n Making  the  non-­‐linear  linear  
Why  Might  Predic@ons  be  Wrong?  
• True  non-­‐determinism    
– Flip  a  biased  coin  
– p(heads)  =  θ!
– Es@mate  θ!
– If  θ  >  0.5  predict  ‘heads’,  else  ‘tails’  

Lots  of  ML  research  on  problems  like  this:  


– Learn  a  model  
– Do  the  best  you  can  in  expecta@on  
Why  Might  Predic@ons  be  Wrong?    
• Par@al  observability    
– Something  needed  to  predict  y  is  missing  from  observa@on  x
– N-­‐bit  parity  problem  
• x  contains  N-­‐1  bits  (hard  PO)  
• x  contains  N  bits  but  learner  ignores  some  of  them  (soW  PO)  

• Noise  in  the  observa@on  x


– Measurement  error  
– Instrument  limita@ons  
Why  Might  Predic@ons  be  Wrong?  
• True  non-­‐determinism  
• Par@al  observability  
– hard,  soW  
• Representa@onal  bias  
• Algorithmic  bias  
• Bounded  resources  
Representa@onal  Bias  

• Having  the  right  features  (x)  is  crucial  

0   x! x2!
Support  Vector  Machines  

Doing  Really  Well  with  Linear  


Decision  Surfaces  
Strengths  of  SVMs  
• Good  generaliza@on    
– in  theory  
– in  prac@ce  
• Works  well  with  few  training  instances  
• Find  globally  best  model  
• Efficient  algorithms  
• Amenable  to  the  kernel  trick  
Minor  Nota@on  Change  
To  be`er  match  nota@on  used  in  SVMs  
...and  to  make  matrix  formulas  simpler  
 

We  will  drop  using  superscripts  for  the  i  th  instance  

(i) Bold  denotes  


i  th  instance   x xi vector  

(i) Non-­‐bold  
i  th  instance  label   y yi denotes  scalar  

(i)
j  th  feature  of  i  th  instance   xj xij

9  
Linear  Separators  

• Training  instances  
x 2 Rd+1 , x0 = 1
y 2 { 1, 1}
Recall:  
• Model  parameters   Inner  (dot)  product:  
    ✓ 2 Rd+1   hu, vi = u · v = u| v
  X
• Hyperplane     = ui v i
✓ | x = h✓, xi = 0   i

• Decision  func@on  
h(x) = sign(✓ | x) = sign(h✓, xi)
Intui@ons  
Intui@ons  
Intui@ons  
Intui@ons  
A  “Good”  Separator  
Noise  in  the  Observa@ons  
Ruling  Out  Some  Separators  
Lots  of  Noise  
Only  One  Separator  Remains  
Maximizing  the  Margin  
“Fat”  Separators  
“Fat”  Separators  
margin  
Why  Maximize  Margin  
Increasing  margin  reduces  capacity  
• i.e.,  fewer  possible  models  

Lesson  from  Learning  Theory:  


• If  the  following  holds:  
– H  is  sufficiently  constrained  in  size    
– and/or  the  size  of  the  training  data  set  n  is  large,    
 then  low  training  error  is  likely  to  be  evidence  of  low  
 generaliza@on  error  

23  
Alterna@ve  View  of  Logis@c  Regression  
1
h✓ (x) = ✓T x
h✓ (x) = g(z)
1+e
z = ✓ T xh✓ (x) = g(z)
z = ✓T x

If    y      =
         1    ,  we  want    h
     ✓✓    (x)
             ⇡
         1    ,   ✓ T x 0
If    y      =
         0    ,  we  want    h
     ✓✓    (x)
             ⇡
         0    ,   ✓ T x ⌧ 0
n
X
J(✓) = [yi log h✓ (xi ) + (1 yi ) log (1 h✓ (xi ))]
i=1
min J(✓) cost1 (✓ | xi ) cost0 (✓ | xi )

24  
Based  on  slide  by  Andrew  Ng  
Alternate  View  of  Logis@c  Regression  
Cost  of  example:   yi log h✓ (xi ) (1 yi ) log (1 h✓ (xi ))
h✓ (x) = g(z)
1 T
h✓ (x) = ✓ T x z = ✓ x
1+e
T
y=1 ✓ x 0
If    y        =
       1      (want    ✓      T    x                    0    ):   If    y      =
         0      (want    ✓      T    x        ⌧
         0      ):  
y=0 ✓T x ⌧ 0

25  
Based  on  slide  by  Andrew  Ng  
Logis@c  Regression  to  SVMs  
Logis@c  Regression:  
X n X d
 
min [yi log h✓ (xi ) + (1 yi ) log (1 h✓ (xi ))] + ✓j2
  ✓ i=1
2 j=1
 
Support  Vector  Machines:  
X n X d
| | 1
min C [yi cost1 (✓ xi ) + (1 yi ) cost0 (✓ xi )] + ✓j2

i=1
2 j=1

1
You  can  think  of  C    as  similar  to  !

26  
Support  Vector  Machine  
n
X d
X
1
min C [yi cost1 (✓ | xi ) + (1 yi ) cost0 (✓ | xi )] + ✓j2

i=1
2 j=1
y=1
       1      (want      ✓    |    x
If    y        =                  1    ):            0      (want    ✓      |    x
If    y      =        
               1    ):  
y=0

-­‐1   1   -­‐1   1  

`hinge (h(x)) = max(0, 1 y · h(x))

27  
Based  on  slide  by  Andrew  Ng  
Support  Vector  Machine  
n
X d
X
1
min C [yi cost1 (✓ | xi ) + (1 yi ) cost0 (✓ | xi )] + ✓j2

i=1
2 j=1
y  =  1  /  0  

y  =  +1  /  -­‐1  
with  C  =  1  

d
1X 2 1
d
X
min
✓ 2
✓j min ✓j2
j=1 ✓ 2 j=1
1 s.t. ✓ | xi 1 if yi = 1 = 1 y (✓ | x )
if yi s.t. 1
i i
1 ✓ | xi  1 if yi = 1 if yi = 1

28  
Maximum  Margin  Hyperplane  
2
margin  =    
k✓k2

✓| x = 1 ✓| x = 1
Support  Vectors  

✓| x = 1 ✓| x = 1
Large  Margin  Classifier  in    
Presence  of  Outliers  

C    very  large  

x2  

C    not  too  large  


x1  

31  
Based  on  slide  by  Andrew  Ng  
Vector  Inner  Product  
v  2   v!
u  2   u!
kuk2 = length(u) 2 R
q
θ = u21 + u22
p!
v  1   u  1  

u| v = v| u
= u1 v 1 + u2 v 2
= kuk2 kvk2 cos ✓
= pkuk2 where p = kvk2 cos ✓
32  
Based  on  example  by  Andrew  Ng  
Understanding  the  Hyperplane  
Xd
1
min ✓j2 Assume  θ0 = 0  so  that  the  
✓ 2
j=1 hyperplane  is  centered  at  
1 s.t. ✓ | xi 1 if yi = 1 the  iforigin,  
yi =a1 nd  that  d  =  2  
 1 ✓ | xi  1 if yi = 1 if yi = 1

x!
✓ | x = k✓k2 kxk2 cos ✓
| {z }
θ p

= pk✓k2
θ
p!
33  
Based  on  example  by  Andrew  Ng  
Based  on  example  by  Andrew  Ng  

Maximizing  the  Margin  


1 Xd Assume  θ0 = 0  so  that  the  
min ✓j2 hyperplane  is  centered  at  
✓ 2
j=1 the  origin,  and  that  d  =  2  
1 s.t. ✓ | xi 1 if yi = 1   if y = 1
i
Let  pi  be  the  projec@on  of  
 1 ✓ | xi  1 if yi = 1 xi  oifnto  
yithe  
= vector  
1 θ  

θ

θ

Since  p  is  small,  therefore    k✓k              2    must   Since  p  is  larger,  k✓k
             2        can  be  smaller  
be  large  to  have    pk✓k
                   2                1        (or  ≤  -­‐1)   in  order  to  have    pk✓k
                   2                1        (or  ≤  -­‐1)  
Size  of  the  Margin  
For  the  support  vectors,  we  have      pk✓k
                 2      =
         ±1
           
• p  is  the  length  of  the  projec@on  of  the  SVs  onto  θ    

p  
Therefore,  
1
p=
k✓k2

θ 2
margin = 2p =
k✓k2

margin  

35  
The  SVM  Dual  Problem  
The  primal  SVM  problem  was  given  as  
X d
  1
min ✓j2
  ✓ 2
j=1
  s.t. yi (✓ | xi ) 1 8i
 
Can  solve  it  more  efficiently  by  taking  the  Lagrangian  dual  
• Duality  is  a  common  idea  in  op@miza@on  
• It  transforms  a  difficult  op@miza@on  problem  into  a  simpler  one  
• Key  idea:    introduce  slack  variables  αi for  each  constraint    
– αi indicates  how  important  a  par@cular  constraint  is  to  the  solu@on  

36  
The  SVM  Dual  Problem  
• The  Lagrangian  is  given  by  
X d Xn
1
L(✓, ↵) = ✓j2 ↵i (yi ✓ | x 1)
2 j=1 i=1

s.t. ↵i 0 8i

• We  must  minimize  over  θ and  maximize  over  α  


•  At  op@mal  solu@on,  par@als  w.r.t  θ’s  are  0  

Solve  by  a  bunch  of  algebra  and  calculus  ...    


and  we  obtain  ...  
37  
SVM  Dual  Representa@on  
n
X n X
X n
1
Maximize   J(↵) = ↵i ↵i ↵j yi yj hxi , xj i
i=1
2 i=1 j=1

s.t. ↵i 0 8i
X
↵ i yi = 0
i

The  decision  func@on  is  given  by   !


X
h(x) = sign ↵i yi hx, xi i + b
i2SV
0 1
1 X @ X
where b = yi ↵j yj hxi , xj iA
|SV|
i2SV j2SV
38  
Understanding  the  Dual  
n
X n X
X n
1
Maximize   J(↵) = ↵i ↵i ↵j yi yj hxi , xj i
i=1
2 i=1 j=1

s.t. ↵i 0 8i
X
↵ i yi = 0
i

Balances  between  the   Constraint  weights  (αi’s)  


weight  of  constraints  for  
different  classes! cannot  be  nega@ve!

39  
Understanding  the  Dual  
n
X n X
X n
1
Maximize   J(↵) = ↵i ↵i ↵j yi yj hxi , xj i
i=1
2 i=1 j=1

s.t. ↵i 0 8i
X
↵ i yi = 0
i

Points  with  different  labels   Measures  the  similarity  


increase  the  sum  
  between  points!
Points  with  same  label  
decrease  the  sum  

Intui@vely,  we  should  be  more  careful  around  points  


near  the  margin   40  
Understanding  the  Dual  
n
X n X
X n
1
Maximize   J(↵) = ↵i ↵i ↵j yi yj hxi , xj i
i=1
2 i=1 j=1

s.t. ↵i 0 8i
X
↵ i yi = 0
i

In  the  solu@on,  either:  


|
• αi >  0  and  the  constraint  is  @ght    (      i                x
y (✓      i  )      =
         1    )      
Ø point  is  a  support  vector  
• αi =  0    
Ø point  is  not  a  support  vector  
41  
Employing  the  Solu@on  
• Given  the  op@mal  solu@on  α*,  op@mal  weights  are  
X
?
✓ = ↵i? yi xi
i2SVs
– In  this  formula@on,  have  not  added  x0  =  1  

• Therefore,  we  can  solve  one  of  the  SV  constraints  


yi (✓ ? · xi + ✓0 ) = 1
 to  obtain  θ0  
 
– Or,  more  commonly,  take  the  average  solu@on  over  all  
support  vectors  

42  
What  if  Data  Are  Not    
Linearly  Separable?  
1 X d
min ✓j2
✓ 2 j=1

s.t. yi (✓ | xi )
• Cannot  find  θ  that  sa@sfies   1 8i

• Introduce  slack  variables  ξi!


yi (✓ | xi ) 1 ⇠i 8i

• New  problem:   1
d
X X
min ✓j2 + C ⇠i
✓ 2 j=1 i

s.t. yi (✓ | xi ) 1 ⇠i 8i
Strengths  of  SVMs  
• Good  generaliza@on  in  theory  
• Good  generaliza@on  in  prac@ce  
• Work  well  with  few  training  instances  
• Find  globally  best  model  
• Efficient  algorithms  
• Amenable  to  the  kernel  trick  …  
What  if  Surface  is  Non-­‐Linear?  

O O O  
O   O   O   O   O   O  
   

O   X   O  
X   X   X  
O   O  
O   X   X   O  
O O O
O    
O    
Image  from  h`p://www.atrandomresearch.com/iclass/  
Kernel  Methods  

Making  the  Non-­‐Linear  Linear  


When  Linear  Separators  Fail  

0   x! x2!
Mapping  into  a  New  Feature  Space  

: X 7! X̂ = (x)
• For  example,  with    xi 2 R2
([xi1 , xi2 ]) = [xi1 , xi2 , xi1 xi2 , x2i1 , x2i2 ]
• Rather  than  run  SVM  on  xi,  run  it  on  Φ(xi)  
– Find  non-­‐linear  separator  in  input  space  

• What  if  Φ(xi)  is  really  big?  


• Use  kernels  to  compute  it  implicitly!  
Image  from  h`p://web.engr.oregonstate.edu/  ~afern/classes/cs534/  
Kernels  
• Find  kernel  K  such  that  
K(xi , xj ) = h (xi ), (xj )i

• Compu@ng    K(x
               i    ,    x    j    )      =
should  
h (xbi ),e  efficient,  
(xj )i much  
more  so  than  compu@ng  Φ(xi)  and  Φ(xj)    

• Use    K(x
                 i  ,    x
     j    )      =
in  hSVM  
(xia),lgorithm  
(xj )i rather  than    hxi , xj i
• Remarkably,  this  is  possible!  
The  Polynomial  Kernel  
= [xLet  
i1 , x      i2
x      i]    =
         [x
       i1      ,    x    i2
       ]      and   xj = [xj1 , xxj2j] = [xj1 , xj2 ]
 

Consider  the  following  func@on:  


2
  K(x i , x j ) = hx i , x i
j 2
K(xi , xj ) = hxi , xj i
  = (xi1 xj1 + xi2 xj2 )2
2
  = (x i1
2 j1 x + x i2 x
2 j2 )
= (xi1 xj1 + xi2 x2j2 + 2xi1 xi2 xj1 xj2 )
2
2 2 2 2
  = (x i1 j1x + x i2 xj2 + 2xi1 xi2 xj1 xj2 )
= h (xi ), (xj )i
where     (x ) = h 2(xi ), p
2 (xj )i
i = [xi1 , xi2 , p2xi1 xi2 ]
(xi ) = [x22i1 , x2i2 p
2 , 2xi1 xi2 ]
(xj ) = [xj1 , xj2 , p2xj1 xj2 ]
(xj ) = [x2j1 , x2j2 , 2xj1 xj2 ]
The  Polynomial  Kernel  
d
• Given  by   K(xi , xj ) = hxi , xj i
– Φ(x) contains  all  monomials  of  degree  d!

• Useful  in  visual  pa`ern  recogni@on  


– Example:  
• 16x16  pixel  image  
• 1010  monomials  of  degree  5  
• Never  explicitly  compute  Φ(x) !  

• Varia@on:   K(xi , xj ) = (hxi , xj i + 1)d


– Adds  all  lower-­‐order  monomials  (degrees  1,...,d )!    
The  Kernel  Trick  

“Given  an  algorithm  which  is  formulated  


in  terms  of  a  posi@ve  definite  kernel  K1,  
one  can  construct  an  alterna@ve  
algorithm  by  replacing  K1  with  another  
posi@ve  definite  kernel  K2”  

Ø  SVMs  can  use  the  kernel  trick  


Incorpora@ng  Kernels  into  SVM  
n
X n X
X n
1
J(↵) = ↵i ↵i ↵j yi yj hxi , xj i
i=1
2 i=1 j=1

s.t. ai 0 8i
X
Xn ↵ i yi = X
0n X n
1
J(↵) = i ↵i ↵i ↵j yi yj K(xi , xj )
i=1
2 i=1 j=1
s.t. ai 0 8i
X
↵ i yi = 0
i

53  
The  Gaussian  Kernel  
• Also  called  Radial  Basis  Func@on  (RBF)  kernel  
✓ 2

kxi xj k2
K(xi , xj ) = exp
2 2
– Has  value  1  when  xi  =  xj!
– Value  falls  off  to  0  with  increasing  distance  
– Note:  Need  to  do  feature  scaling  before  using  Gaussian  Kernel  

-­‐3   -­‐3   -­‐3  


-­‐1   5   -­‐1   5   -­‐1   5  
1   0   1   0   1   0  
3   -­‐5   3   -­‐5   3   -­‐5  

lower  bias,   higher  bias,  


higher  variance   lower  variance  
54  
Gaussian  Kernel  Example  
`1 ✓ ◆
`2 kxi xj k22
K(xi , xj ) = exp 2
2

Imagine  we’ve  learned  that:  


`3 ✓ = [ 0.5, 1, 1, 0]

Predict  +1  if   ✓0 + ✓1 K(x, `1 ) + ✓2 K(x, `2 ) + ✓3 K(x, `3 ) 0


 

Based  on  example  by  Andrew  Ng   55  


Gaussian  Kernel  Example  
`1 ✓ ◆
`2 kxi xj k22
K(xi , xj ) = exp 2
x1 2

Imagine  we’ve  learned  that:  


`3 ✓ = [ 0.5, 1, 1, 0]

Predict  +1  if   ✓0 + ✓1 K(x, `1 ) + ✓2 K(x, `2 ) + ✓3 K(x, `3 ) 0


 

• For  x1,  we  have    K(x


               1    ,    `    1    )      ⇡
         1    ,  other  similari@es  ≈  0    
✓0 + ✓1 (1) + ✓2 (0) + ✓3 (0)
= 0.5 + 1(1) + 1(0) + 0(0)
= 0.5 0 ,  so  predict  +1   56  
Based  on  example  by  Andrew  Ng  
Gaussian  Kernel  Example  
`1 ✓ ◆
`2 kxi xj k22
K(xi , xj ) = exp 2
2

Imagine  we’ve  learned  that:  


`3 x2 ✓ = [ 0.5, 1, 1, 0]

Predict  +1  if   ✓0 + ✓1 K(x, `1 ) + ✓2 K(x, `2 ) + ✓3 K(x, `3 ) 0


 

• For  x2,  we  have    K(x


                 2    ,    `  3    )      ⇡
         1    ,  other  similari@es  ≈  0    
✓0 + ✓1 (0) + ✓2 (0) + ✓3 (1)
= 0.5 + 1(0) + 1(0) + 0(1)
Based  on  example  by  Andrew  Ng  
= 0.5 < 0 ,  so  predict  -­‐1   57  
Gaussian  Kernel  Example  
`1 ✓ ◆
`2 kxi xj k22
K(xi , xj ) = exp 2
2

Imagine  we’ve  learned  that:  


`3 ✓ = [ 0.5, 1, 1, 0]

Predict  +1  if   ✓0 + ✓1 K(x, `1 ) + ✓2 K(x, `2 ) + ✓3 K(x, `3 ) 0


 

Rough  sketch  of  decision  surface  

Based  on  example  by  Andrew  Ng   58  


Other  Kernels  
• Sigmoid  Kernel  
|
K(xi , xj ) = tanh (↵xi xj + c)
– Neural  networks  use  sigmoid  as  ac@va@on  func@on  
– SVM  with  a  sigmoid  kernel  is  equivalent  to  2-­‐layer  perceptron  

• Cosine  Similarity  Kernel  


|
xi xj
K(xi , xj ) =
kxi k kxj k
– Popular  choice  for  measuring  similarity  of  text  documents  
– L2  norm  projects  vectors  onto  the  unit  sphere;  their  dot  
product  is  the  cosine  of  the  angle  between  the  vectors  
59  
Other  Kernels  
• Chi-­‐squared  Kernel   !
X (xik xjk )2
K(xi , xj ) = exp
xik + xjk
k
– Widely  used  in  computer  vision  applica@ons  
– Chi-­‐squared  measures  distance  between  probability  
distribu@ons  
– Data  is  assumed  to  be  non-­‐nega@ve,  oWen  with  L1  norm  of  1  

• String  kernels  
• Tree  kernels  
• Graph  kernels  
60  
An  Aside:    The  Math  Behind  Kernels  
What  does  it  mean  to  be  a  kernel?  
•  K(x
               i    ,  x
     j    )      =
         h        (x
         i  ),
             (x
         j  )i
         for  some  Φ  

What  does  it  take  to  be  a  kernel?  


• The  Gram  matrix     Gij = K(xi , xj )
– Symmetric  matrix  
– Posi@ve  semi-­‐definite  matrix:  
"zTGz  ≥  0  for  every  non-­‐zero  vector    z 2 Rn
 
Establishing  “kernel-­‐hood”  from  first  principles  is  non-­‐trivial  
!
A  Few  Good  Kernels...  
• Linear  Kernel   K(xi , xj ) = hxi , xj i
d
• Polynomial  kernel   K(xi , xj ) = (hxi , xj i + c)
– c  ≥  0  trades  off  influence  of  lower  order  terms  
✓ ◆
kxi xj k22
• Gaussian  kernel   K(xi , xj ) = exp 2
2
• Sigmoid  kernel   K(xi , xj ) = tanh (↵x|i xj + c)
 

Many  more...  
• Cosine  similarity  kernel  
• Chi-­‐squared  kernel  
• String/tree/graph/wavelet/etc  kernels  
62  
Applica@on:  Automa@c  Photo  Retouching  
(Leyvand  et  al.,  2008)  
Prac@cal  Advice  for  Applying  SVMs  
• Use  SVM  soWware  package  to  solve  for  parameters  
– e.g.,  SVMlight,  libsvm,  cvx  (fast!),  etc.  

• Need  to  specify:  


– Choice  of  parameter  C!
– Choice  of  kernel  func@on  
• Associated  kernel  parameters  
d
       e.g.,   K(xi , xj ) = (hxi , xj i + c)
✓ 2

kxi xj k2
K(xi , xj ) = exp
2 2

64  
Mul@-­‐Class  Classifica@on  with  SVMs  

y 2 {1, . . . , K}

• Many  SVM  packages  already  have  mul@-­‐class  


classifica@on  built  in  
• Otherwise,  use  one-­‐vs-­‐rest  
– Train  K    SVMs,  each  picks  out  one  class  from  rest,  
yielding   ✓ (1) , . . . , ✓ (K)
(i) |
– Predict  class  i  with  largest    (✓ ) x
Based  on  slide  by  Andrew  Ng   65  
SVMs  vs  Logis@c  Regression  
(Advice  from  Andrew  Ng)  
n  =  #  training  examples            d  =  #  features  
 

If  d    is  large  (rela@ve  to  n)      (e.g.,  d    >  n    with  d  =  10,000,  n  =  10-­‐1,000)  
• Use  logis@c  regression  or  SVM  with  a  linear  kernel  
 

If  d  is  small  (up  to  1,000),  n  is  intermediate  (up  to  10,000)  
• Use  SVM  with  Gaussian  kernel  

If  d  is  small  (up  to  1,000),  n  is  large  (50,000+)  


• Create/add  more  features,  then  use  logis@c  regression  or  SVM  
without  a  kernel  

Neural  networks  likely  to  work  well  for  most  of  these  
se~ngs,  but  may  be  slower  to  train  
Based  on  slide  by  Andrew  Ng   66  
Other  SVM  Varia@ons  
• nu  SVM  
– nu  parameter  controls:  
• Frac@on  of  support  vectors  (lower  bound)  and  
misclassifica@on  rate  (upper  bound)  
• E.g.,      ⌫      =
         0.05
                   guarantees  that  ≥  5%  of  training  points  are  
SVs  and  training  error  rate  is  ≤  5%  
– Harder  to  op@mize  than  C-­‐SVM  and  not  as  scalable  
• SVMs  for  regression  
• One-­‐class  SVMs  
• SVMs  for  clustering  
 ...  

67  
Conclusion  
• SVMs  find  op@mal  linear  separator  
• The  kernel  trick  makes  SVMs  learn  non-­‐linear  
decision  surfaces  

• Strength  of  SVMs:  


– Good  theore@cal  and  empirical  performance  
– Supports  many  types  of  kernels  

• Disadvantages  of  SVMs:  


– “Slow”  to  train/predict  for  huge  data  sets  (but  rela@vely  fast!)  
– Need  to  choose  the  kernel  (and  tune  its  parameters)  

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy