07 Kernels
07 Kernels
Kernel Methods
Outline
• Non-linear models via basis functions
Let X = Rd . We have seen methods for learning linear models of the form h(x) = sign(w> x + b) for binary
classification (such as logistic regression and SVMs) and f (x) = w> x + b for regression (such as linear least
squares regression and SVR). What if we want to learn a non-linear model? What would be a simple way
to achieve this using the methods we have seen so far?
One way to achieve this is to map instances x ∈ Rd to some new feature vectors φ(x) ∈ Rn via some
non-linear feature mapping φ : Rd →Rn , and then to learn a linear model in this transformed space. For
example, if one maps instances x ∈ Rd to n = (1 + 2d + d2 )-dimensional feature vectors
1
x1
..
.
xd
x1 x2
φ(x) = . ,
..
xd−1 xd
x2
1
.
.
.
x2d
then learning a linear model in the transformed space is equivalent to learning a quadratic model in the
original instance space. In general, one can choose any basis functions φ1 , . . . φn : X →R, and learn a linear
1
2 Kernel Methods
model over these: w> φ(x) + b, where w ∈ Rn (in fact, one can do this for X = 6 Rd as well). For example,
in least squares regression applied to a training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (Rd × R)m , one would
simply replace the matrix X ∈ Rm×d with the design matrix Φ ∈ Rm×n , where Φij = φj (xi ). What is a
potential difficulty in doing this?
If n is large (e.g. as would be the case if the feature mapping φ corresponded to a high-degree polynomial),
then the above approach can be computationally expensive. In this lecture we look at a technique that
allows one to implement the above idea efficiently for many algorithms. We start by taking a closer look at
the SVM dual which we derived in the last lecture.
Recall the form of the dual we derived for the (soft-margin) linear SVM:
m m m
1 XX X
max − αi αj yi yj (x>
i xj ) + αi (1)
α 2 i=1 j=1 i=1
subject to
Xm
αi yi = 0 (2)
i=1
0 ≤ αi ≤ C, i = 1, . . . , m. (3)
If we implement this on feature vectors φ(xi ) ∈ Rn in place of xi ∈ Rd , we get the following optimization
problem:
m m m
1 XX X
max − αi αj yi yj φ(xi )> φ(xj ) + αi (4)
α 2 i=1 j=1 i=1
subject to
Xm
αi yi = 0 (5)
i=1
0 ≤ αi ≤ C, i = 1, . . . , m. (6)
This involves computing dot products between vectors φ(xi ), φ(xj ) in Rn . Similarly, using the learned model
to make predictions on a new test point x ∈ Rd also involves computing dot products between vectors in Rn :
X
>
h(x) = sign αbi yi φ(xi ) φ(x) + b .
b
i∈SV
For example, as we saw above, one can learn a quadratic classifier in X = R2 by learning a linear classifier
in φ(R2 ) ⊂ R6 , where
1
x1
x1 x2
φ =x1 x2 ;
x2 2
x1
x22
clearly, a straightforward approach to learning an SVM classifier in this space (and applying it to a new test
point) will involve computing dot products in R6 (more generally, when learning a degree-q polynomial in
Rd , such a straightforward approach will involve computing dot products in Rn for n = O(dq )).
Kernel Methods 3
Now, consider replacing dot products φ(x)> φ(x0 ) in the above example with K(x, x0 ), where ∀x, x0 ∈ R2 ,
Thus, using K(x, x0 ) above instead of φ(x)> φ(x0 ) implicitly computes dot products in R6 , with computation
of dot products required only in R2 !
In fact, one can use any symmetric, positive semi-definite kernel function K : X × X →R (also called a
Mercer kernel function) in the SVM algorithm directly, even if the feature space implemented by the
kernel function cannot be described explicitly. Any such kernel function yields a convex dual problem;
if K is positive definite, then K also corresponds to inner products in some inner product space V (i.e.
K(x, x0 ) = hφ(x), φ(x0 )i for some φ : X →V ).
For Euclidean instance spaces X = Rd , examples of commonly used kernel functions include the polynomial
kernel K(x, x0 ) = (x> x0 + 1)q ,which results in learning a degree-q polynomial threshold classifier, and the
−kx−x0 k22
Gaussian kernel, also known as the radial basis function (RBF) kernel, K(x, x0 ) = exp 2σ 2
(where σ > 0 is a parameter of the kernel), which effectively implements dot products in an infinite-
dimensional inner product space; in both cases, evaluating the kernel K(x, x0 ) at any two points x, x0
requires only O(d) computation time. Kernel functions can also be used for non-vectorial data (X = 6 Rd );
for example, kernel functions are often used to implicitly embed instance spaces containing strings, trees etc
into an inner product space, and to implicitly learn a linear classifier in this space. Intuitively, it is helpful
to think of kernel functions as capturing some sort of ‘similarity’ between pairs of instances in X .
To summarize, given a training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X ×{±1})m , in order to learn a kernel
SVM classifier using a kernel function K : X × X →R, one simply solves the kernel SVM dual given by
m m m
1 XX X
max − αi αj yi yj K(xi , xj ) + αi (7)
α 2 i=1 j=1 i=1
subject to
Xm
αi yi = 0 (8)
i=1
0 ≤ αi ≤ C, i = 1, . . . , m, (9)
where
1 X X
bb = yi − α
bj yj K(xi , xj ) .
|SV1 |
i∈SV1 j∈SV
4 Kernel Methods
of the form
m
X
fb(x) = α
bi K(xi , x) for some α bm ∈ R.
b1 , . . . , α
i=1
If Ω is strictly increasing, then all solutions have this form.
The above result tells us that even if FK is an infinite-dimensional space, any optimization problem resulting
from minimizing a loss over a finite training sample regularized by some increasing function of the RKHS-
norm is effectively a finite-dimensional optimization problem, and moreover, the solution to this problem
can be written as a kernel expansion over the training points. In particular, minimizing any other loss over
FK (regularized by the RKHS-norm) will also yield a solution of this form!
Exercise. Show that linear functions f : Rd →R of the form f (x) = w> x form an RKHS with linear kernel
K : Rd × Rd →R given by K(x, x0 ) = x> x0 and with kf k2K = kwk22 .
1 The metric induced by the norm k · k 0
K is given by dK (f, g) = kf − gkK . The completion of FK is simply FK plus any
limit points of Cauchy sequences in FK 0 under this metric.
2 The name reproducing kernel Hilbert space comes from the following ‘reproducing’ property: For any x ∈ X , define
Given a training sample S ∈ (X × {±1})m and kernel function K : X × X →R, the kernel logistic regression
classifier is given by the solution to the following optimization problem:
m
1 X
ln 1 + e−yi (f (xi )+b) + λkf k2K .
min
f ∈FK ,b∈R m i=1
Pm
Since we know from the Representer Theorem that the solution has the form fb(x) = i=1 α
bi K(xi , x), we
can write the above as an optimization problem over α, b:
m m X m
1 X Pm X
ln 1 + e−yi ( j=1 αj K(xj ,xi )+b) + λ
min αi αj K(xi , xj ) .
m
α∈R ,b∈R m i=1 i=1 j=1
This is of a similar form as in standard logistic regression, with m basis functions φj (x) = K(xj , x) for
j ∈ [m] (and w ≡ α)! In particular, define K ∈ Rm×m as Kij = K(xi , xj ) (this is often called the gram
matrix), and let ki denote the i-th column of this matrix. Then we can write the above as simply
m
1 X >
ln 1 + e−yi (α ki +b) + λα> Kα ,
min
α∈Rm ,b∈R m i=1
which is similar to the form for standard linear logistic regression (with feature vectors ki ) – except for the
regularizer being α> Kα rather than kαk22 – and can be solved similarly as before, using similar numerical
optimization methods.
bi 6= 0 ∀i ∈ [m]. A variant of logistic regression
We note that unlike SVMs, here in general, the solution has α
called the import vector machine (IVM) adopts a greedy approach to find a subset IV ⊆ [m] such that
the function X
fb0 (x) + bb = α
bi K(xi , x) + bb
i∈IV
gives good performance. Compared to SVMs, IVMs can provide more natural class probability estimates,
as well as more natural extensions to multiclass classification.
Given a training sample S ∈ (X × R)m and kernel function K : X × X →R, consider first a kernel ridge
regression formulation for learning a function f ∈ FK :
m
1 X 2
min yi − f (xi ) + λkf k2K .
f ∈FK m i=1
Pm
Again, since we know from the Representer Theorem that the solution has the form fb(x) = i=1 α
bi K(xi , x),
we can write the above as an optimization problem over α:
m m 2 m X m
1 X X X
minm yi − αj K(xj , xi ) + λ αi αj K(xi , xj ) ,
α∈R m i=1 j=1 i=1 j=1
or in matrix notation,
m
1 X 2
minm yi − α> ki + λα> Kα .
α∈R m i=1
6 Kernel Methods
Again, this is of the same form as standard linear ridge regression, with feature vectors ki and with regularizer
α> Kα rather than kαk22 . If K is positive definite, in which case the gram matrix K is invertible, then setting
the gradient of the objective above w.r.t. α to zero can be seen to yield
−1
α
b = K + λmIm y,
How would you extend this to learning a function of the form f (x) + b for f ∈ FK , b ∈ R in the kernel ridge
regression setting?