0% found this document useful (0 votes)
70 views6 pages

07 Kernels

This document provides an outline and overview of kernel methods for machine learning. It discusses how kernel methods allow learning non-linear models by mapping data to high-dimensional feature spaces and learning linear models in that space. Kernel functions, such as polynomial and Gaussian kernels, implicitly compute dot products in the feature space without explicitly computing the mapping, making computation more efficient. Kernel methods can be used with support vector machines, logistic regression, and other algorithms to learn non-linear decision boundaries.

Uploaded by

Yiwei Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views6 pages

07 Kernels

This document provides an outline and overview of kernel methods for machine learning. It discusses how kernel methods allow learning non-linear models by mapping data to high-dimensional feature spaces and learning linear models in that space. Kernel functions, such as polynomial and Gaussian kernels, implicitly compute dot products in the feature space without explicitly computing the mapping, making computation more efficient. Kernel methods can be used with support vector machines, logistic regression, and other algorithms to learn non-linear decision boundaries.

Uploaded by

Yiwei Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CIS 520: Machine Learning Spring 2021: Lecture 7

Kernel Methods

Lecturer: Shivani Agarwal

Disclaimer: These notes are designed to be a supplement to the lecture. They


may or may not cover all the material discussed in the lecture (and vice versa).

Outline
• Non-linear models via basis functions

• Closer look at the SVM dual: kernel functions, kernel SVM

• RKHSs and Representer Theorem

• Kernel logistic regression

• Kernel ridge regression

1 Non-linear Models via Basis Functions

Let X = Rd . We have seen methods for learning linear models of the form h(x) = sign(w> x + b) for binary
classification (such as logistic regression and SVMs) and f (x) = w> x + b for regression (such as linear least
squares regression and SVR). What if we want to learn a non-linear model? What would be a simple way
to achieve this using the methods we have seen so far?
One way to achieve this is to map instances x ∈ Rd to some new feature vectors φ(x) ∈ Rn via some
non-linear feature mapping φ : Rd →Rn , and then to learn a linear model in this transformed space. For
example, if one maps instances x ∈ Rd to n = (1 + 2d + d2 )-dimensional feature vectors
 
1
 x1
..
 
 

 .


 xd 
 
 x1 x2 
φ(x) =  .  ,
 
 .. 
 
xd−1 xd 
 
 x2 
 1 
 . 
.
 . 
x2d

then learning a linear model in the transformed space is equivalent to learning a quadratic model in the
original instance space. In general, one can choose any basis functions φ1 , . . . φn : X →R, and learn a linear

1
2 Kernel Methods

model over these: w> φ(x) + b, where w ∈ Rn (in fact, one can do this for X = 6 Rd as well). For example,
in least squares regression applied to a training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (Rd × R)m , one would
simply replace the matrix X ∈ Rm×d with the design matrix Φ ∈ Rm×n , where Φij = φj (xi ). What is a
potential difficulty in doing this?
If n is large (e.g. as would be the case if the feature mapping φ corresponded to a high-degree polynomial),
then the above approach can be computationally expensive. In this lecture we look at a technique that
allows one to implement the above idea efficiently for many algorithms. We start by taking a closer look at
the SVM dual which we derived in the last lecture.

2 Closer Look at the SVM Dual: Kernel Functions, Kernel SVM

Recall the form of the dual we derived for the (soft-margin) linear SVM:
m m m
1 XX X
max − αi αj yi yj (x>
i xj ) + αi (1)
α 2 i=1 j=1 i=1
subject to
Xm
αi yi = 0 (2)
i=1
0 ≤ αi ≤ C, i = 1, . . . , m. (3)

If we implement this on feature vectors φ(xi ) ∈ Rn in place of xi ∈ Rd , we get the following optimization
problem:
m m m
1 XX  X
max − αi αj yi yj φ(xi )> φ(xj ) + αi (4)
α 2 i=1 j=1 i=1
subject to
Xm
αi yi = 0 (5)
i=1
0 ≤ αi ≤ C, i = 1, . . . , m. (6)

This involves computing dot products between vectors φ(xi ), φ(xj ) in Rn . Similarly, using the learned model
to make predictions on a new test point x ∈ Rd also involves computing dot products between vectors in Rn :
X 
>
h(x) = sign αbi yi φ(xi ) φ(x) + b .
b
i∈SV

For example, as we saw above, one can learn a quadratic classifier in X = R2 by learning a linear classifier
in φ(R2 ) ⊂ R6 , where  
1
   x1 
 
x1  x2 
φ =x1 x2  ;

x2  2 
 x1 
x22
clearly, a straightforward approach to learning an SVM classifier in this space (and applying it to a new test
point) will involve computing dot products in R6 (more generally, when learning a degree-q polynomial in
Rd , such a straightforward approach will involve computing dot products in Rn for n = O(dq )).
Kernel Methods 3

Now, consider replacing dot products φ(x)> φ(x0 ) in the above example with K(x, x0 ), where ∀x, x0 ∈ R2 ,

K(x, x0 ) = (x> x0 + 1)2 .

It can be verified (exercise!) that K(x, x0 ) = φK (x)> φK (x0 ), where


 
√1
   √2x1 
 
x1  2x2 
φK √2x1 x2  .
= 
x2  
 x2 
1
x22

Thus, using K(x, x0 ) above instead of φ(x)> φ(x0 ) implicitly computes dot products in R6 , with computation
of dot products required only in R2 !
In fact, one can use any symmetric, positive semi-definite kernel function K : X × X →R (also called a
Mercer kernel function) in the SVM algorithm directly, even if the feature space implemented by the
kernel function cannot be described explicitly. Any such kernel function yields a convex dual problem;
if K is positive definite, then K also corresponds to inner products in some inner product space V (i.e.
K(x, x0 ) = hφ(x), φ(x0 )i for some φ : X →V ).
For Euclidean instance spaces X = Rd , examples of commonly used kernel functions include the polynomial
kernel K(x, x0 ) = (x> x0 + 1)q ,which results in learning a degree-q polynomial threshold classifier, and the
−kx−x0 k22 
Gaussian kernel, also known as the radial basis function (RBF) kernel, K(x, x0 ) = exp 2σ 2
(where σ > 0 is a parameter of the kernel), which effectively implements dot products in an infinite-
dimensional inner product space; in both cases, evaluating the kernel K(x, x0 ) at any two points x, x0
requires only O(d) computation time. Kernel functions can also be used for non-vectorial data (X = 6 Rd );
for example, kernel functions are often used to implicitly embed instance spaces containing strings, trees etc
into an inner product space, and to implicitly learn a linear classifier in this space. Intuitively, it is helpful
to think of kernel functions as capturing some sort of ‘similarity’ between pairs of instances in X .
To summarize, given a training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X ×{±1})m , in order to learn a kernel
SVM classifier using a kernel function K : X × X →R, one simply solves the kernel SVM dual given by
m m m
1 XX X
max − αi αj yi yj K(xi , xj ) + αi (7)
α 2 i=1 j=1 i=1
subject to
Xm
αi yi = 0 (8)
i=1
0 ≤ αi ≤ C, i = 1, . . . , m, (9)

and then predicts the label of a new instance x ∈ X according to


X 
h(x) = sign α
bi yi K(xi , x) + bb ,
i∈SV

where
1 X  X 
bb = yi − α
bj yj K(xi , xj ) .
|SV1 |
i∈SV1 j∈SV
4 Kernel Methods

3 RKHSs and Representer Theorem

Let K : X × X →R be a symmetric positive definite kernel function. Let


 X r 
0
FK = f : X →R f (x) = αi K(xi , x) for some r ∈ Z+ , αi ∈ R, xi ∈ X .

i=1
Pr Ps
For f, g ∈ 0
FK with f (x) = i=1 αi K(xi , x) and g(x) = j=1 βj K(x0j , x), define
r X
X s
hf, giK = αi βj K(xi , x0j ) (10)
i=1 j=1
p
kf kK = hf, f iK . (11)
0 1
Let FK be the completion of FK
under the metric induced by the above norm. Then FK is called the
reproducing kernel Hilbert space (RKHS) associated with K.2
Note that the SVM classifier learned using kernel K is of the form
h(x) = sign(f (x) + b) ,
P
where f (x) = i∈SV bi yi K(xi , x), i.e. where f ∈ FK .
α
In fact, consider the following optimization problem:
m
1 X
1 − yi (f (xi ) + b) + + λkf k2K .

min
f ∈FK ,b∈R m i=1
1
It turns out that the above SVM solution (with C = 2λm ) is a solution to this problem, i.e. the kernel
SVM solution minimizes the RKHS-norm regularized hinge loss over all functions over the form f (x) + b for
f ∈ FK , b ∈ R.
More generally, we have the following result:
Theorem 1 (Representer Theorem). Let K : X × X →R be a positive definite kernel function. Let Y ⊆ R.
Let S = ((x1 , y1 )), . . . , (xm , ym )) ∈ (X × Y)m . Let L : Rm × Y m →R. Let Ω : R+ →R+ be a monotonically
increasing function. Then for λ > 0, there is a solution to the optimization problem
 
L f (x1 ) + b, . . . , f (xm ) + b , (y1 , . . . , ym ) + λ Ω(kf k2K )

min
f ∈FK ,b∈R

of the form
m
X
fb(x) = α
bi K(xi , x) for some α bm ∈ R.
b1 , . . . , α
i=1
If Ω is strictly increasing, then all solutions have this form.

The above result tells us that even if FK is an infinite-dimensional space, any optimization problem resulting
from minimizing a loss over a finite training sample regularized by some increasing function of the RKHS-
norm is effectively a finite-dimensional optimization problem, and moreover, the solution to this problem
can be written as a kernel expansion over the training points. In particular, minimizing any other loss over
FK (regularized by the RKHS-norm) will also yield a solution of this form!
Exercise. Show that linear functions f : Rd →R of the form f (x) = w> x form an RKHS with linear kernel
K : Rd × Rd →R given by K(x, x0 ) = x> x0 and with kf k2K = kwk22 .
1 The metric induced by the norm k · k 0
K is given by dK (f, g) = kf − gkK . The completion of FK is simply FK plus any
limit points of Cauchy sequences in FK 0 under this metric.
2 The name reproducing kernel Hilbert space comes from the following ‘reproducing’ property: For any x ∈ X , define

Kx : X →R as Kx (x0 ) = K(x, x0 ); then for any f ∈ FK , we have hf, Kx i = f (x).


Kernel Methods 5

4 Kernel Logistic Regression

Given a training sample S ∈ (X × {±1})m and kernel function K : X × X →R, the kernel logistic regression
classifier is given by the solution to the following optimization problem:
m
1 X
ln 1 + e−yi (f (xi )+b) + λkf k2K .

min
f ∈FK ,b∈R m i=1
Pm
Since we know from the Representer Theorem that the solution has the form fb(x) = i=1 α
bi K(xi , x), we
can write the above as an optimization problem over α, b:
m m X m
1 X Pm X
ln 1 + e−yi ( j=1 αj K(xj ,xi )+b) + λ

min αi αj K(xi , xj ) .
m
α∈R ,b∈R m i=1 i=1 j=1

This is of a similar form as in standard logistic regression, with m basis functions φj (x) = K(xj , x) for
j ∈ [m] (and w ≡ α)! In particular, define K ∈ Rm×m as Kij = K(xi , xj ) (this is often called the gram
matrix), and let ki denote the i-th column of this matrix. Then we can write the above as simply
m
1 X >
ln 1 + e−yi (α ki +b) + λα> Kα ,

min
α∈Rm ,b∈R m i=1

which is similar to the form for standard linear logistic regression (with feature vectors ki ) – except for the
regularizer being α> Kα rather than kαk22 – and can be solved similarly as before, using similar numerical
optimization methods.
bi 6= 0 ∀i ∈ [m]. A variant of logistic regression
We note that unlike SVMs, here in general, the solution has α
called the import vector machine (IVM) adopts a greedy approach to find a subset IV ⊆ [m] such that
the function X
fb0 (x) + bb = α
bi K(xi , x) + bb
i∈IV

gives good performance. Compared to SVMs, IVMs can provide more natural class probability estimates,
as well as more natural extensions to multiclass classification.

5 Kernel Ridge Regression

Given a training sample S ∈ (X × R)m and kernel function K : X × X →R, consider first a kernel ridge
regression formulation for learning a function f ∈ FK :
m
1 X 2
min yi − f (xi ) + λkf k2K .
f ∈FK m i=1
Pm
Again, since we know from the Representer Theorem that the solution has the form fb(x) = i=1 α
bi K(xi , x),
we can write the above as an optimization problem over α:
m  m 2 m X m
1 X X X
minm yi − αj K(xj , xi ) + λ αi αj K(xi , xj ) ,
α∈R m i=1 j=1 i=1 j=1

or in matrix notation,
m
1 X 2
minm yi − α> ki + λα> Kα .
α∈R m i=1
6 Kernel Methods

Again, this is of the same form as standard linear ridge regression, with feature vectors ki and with regularizer
α> Kα rather than kαk22 . If K is positive definite, in which case the gram matrix K is invertible, then setting
the gradient of the objective above w.r.t. α to zero can be seen to yield
−1
α
b = K + λmIm y,

where as before Im is the m × m identity matrix and y = (y1 , . . . , ym )> ∈ Rm .


Exercise. Show that if X = Rd and one wants to explicitly include a bias term b in the linear ridge regression
solution which is not included in the regularization, then defining
 > 
x1 1    
 .. . w Id 0
X= .
e ..  , w

e = , L= ,
b 0 0
>
xm 1

one gets the solution


w e >X
e = (X
b e + λmL)−1 X
e >y .

How would you extend this to learning a function of the form f (x) + b for f ∈ FK , b ∈ R in the kernel ridge
regression setting?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy