0% found this document useful (0 votes)

6 views24 pages

SCH Smo 03 C

The document provides an introduction to statistical learning theory, support vector machines, and kernel feature spaces, including the derivation of the support vector optimization problem for classification and regression. It discusses the importance of similarity measures, particularly kernels, in learning algorithms and presents a basic pattern recognition algorithm using geometric interpretations. The text emphasizes the necessity of restricting the function class to ensure good generalization to unseen data, referencing the VC theory and structural risk minimization principles.

Uploaded by

priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views24 pages

SCH Smo 03 C

Uploaded by

priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

A Short Introduction to Learning with Kernels

Bernhard Schölkopf1 and Alexander J. Smola2

1
Max Planck Institut für Biologische Kybernetik, 72076 Tübingen, Germany
2
RSISE, The Australian National University, Canberra 0200, ACT, Australia

Abstract. We brieﬂy describe the main ideas of statistical learning the-

ory, support vector machines, and kernel feature spaces. This includes
a derivation of the support vector optimization problem for classiﬁca-
tion and regression, the ν-trick, various kernels and an overview over
applications of kernel methods.

1 An Introductory Example

Suppose we are given empirical data

(x1 , y1 ), . . . , (xm , ym ) ∈ X × {±1}. (1)

Here, the domain X is some nonempty set that the patterns xi are taken from;
the yi are called labels or targets.
Unless stated otherwise, indices i and j will always be understood to run
over the training set, i.e., i, j = 1, . . . , m.
Note that we have not made any assumptions on the domain X other than
it being a set. In order to study the problem of learning, we need additional
structure. In learning, we want to be able to generalize to unseen data points.
In the case of pattern recognition, this means that given some new pattern
x ∈ X, we want to predict the corresponding y ∈ {±1}. By this we mean, loosely
speaking, that we choose y such that (x, y) is in some sense similar to the training
examples. To this end, we need similarity measures in X and in {±1}. The latter
is easy, as two target values can only be identical or diﬀerent.1 For the former,
we require a similarity measure

k : X × X → R,
(x, x ) → k(x, x ), (2)

i.e., a function that, given two examples x and x , returns a real number char-
acterizing their similarity. For reasons that will become clear later, the function
k is called a kernel [13, 1, 8].
A type of similarity measure that is of particular mathematical appeal are
dot products. For instance, given two vectors x, x ∈ RN , the canonical dot
product is deﬁned as

The present article is based on [23].
1
If the outputs are not in {±1}, the situation gets more complex, cf. [34].

S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 41–64, 2003.
c Springer-Verlag Berlin Heidelberg 2003

42 B. Schölkopf and A.J. Smola

N

(x · x ) := (x)i (x )i . (3)
i=1

Here, (x)i denotes the i-th entry of x.

The geometrical interpretation of this dot product is that it computes the
cosine of the angle between the vectors x and x , provided they are normalized
to
length 1. Moreover, it allows computation of the length of a vector x as
(x · x), and of the distance between two vectors as the length of the diﬀerence
vector. Therefore, being able to compute dot products amounts to being able
to carry out all geometrical constructions that can be formulated in terms of
angles, lengths and distances.
Note, however, that we have not made the assumption that the patterns live
in a dot product space. In order to be able to use a dot product as a similarity
measure, we therefore ﬁrst need to embed them into some dot product space H,
which need not be identical to RN . To this end, we use a map

Φ:X→H
x → x. (4)

The space H is called a feature space. To summarize, embedding the data into
H has three beneﬁts.
1. It lets us deﬁne a similarity measure from the dot product in H,

k(x, x ) := (x · x ) = (Φ(x) · Φ(x )). (5)

2. It allows us to deal with the patterns geometrically, and thus lets us study
learning algorithm using linear algebra and analytic geometry.
3. The freedom to choose the mapping Φ will enable us to design a large variety
of learning algorithms. For instance, consider a situation where the inputs
already live in a dot product space. In that case, we could directly deﬁne
a similarity measure as the dot product. However, we might still choose to
ﬁrst apply a nonlinear map Φ to change the representation into one that is
more suitable for a given problem and learning algorithm.
We are now in the position to describe a pattern recognition learning algo-
rithm that is arguably one of the simplest possible. The basic idea is to compute
the means of the two classes in feature space,
1
c+ = xi , (6)
m+
{i:yi =+1}

1
c− = xi , (7)
m−
{i:yi =−1}

where m+ and m− are the number of examples with positive and negative labels,
respectively (see Figure 1). We then assign a new point x to the class whose
A Short Introduction to Learning with Kernels 43

o
+

+ . o
w o c-
+ c
c+ x-c
+
x

Fig. 1. A simple geometric classiﬁcation algorithm: given two classes of points (de-
picted by ‘o’ and ‘+’), compute their means c+ , c− and assign a test pattern x to the
one whose mean is closer. This can be done by looking at the dot product between x−c
(where c = (c+ + c− )/2) and w := c+ − c− , which changes sign as the enclosed angle
passes through π/2. Note that the corresponding decision boundary is a hyperplane
(the dotted line) orthogonal to w (from [23]).

mean is closer to it. This geometrical construction can be formulated in terms

of dot products. Half-way in between c+ and c− lies the point c := (c+ + c− )/2.
We compute the class of x by checking whether the vector connecting c and x
encloses an angle smaller than π/2 with the vector w := c+ − c− connecting the
class means, in other words
y = sgn ((x − c) · w)
y = sgn ((x − (c+ + c− )/2) · (c+ − c− ))
= sgn ((x · c+ ) − (x · c− ) + b). (8)
Here, we have deﬁned the oﬀset
1
b := c− 2 − c+ 2 . (9)
2
It will prove instructive to rewrite this expression in terms of the patterns xi
in the input domain X. To this end, note that we do not have a dot product in
X, all we have is the similarity measure k (cf. (5)). Therefore, we need to rewrite
everything in terms of the kernel k evaluated on input patterns. To this end,
substitute (6) and (7) into (8) to get the decision function
 
1 1
y = sgn  (x · xi ) − (x · xi ) + b
m+ m−
{i:yi =+1} {i:yi =−1}
 
1 1
= sgn  k(x, xi ) − k(x, xi ) + b . (10)
m+ m−
{i:yi =+1} {i:yi =−1}
44 B. Schölkopf and A.J. Smola

Similarly, the oﬀset becomes

 
1 1 1
b :=  2 k(xi , xj ) − 2 k(xi , xj ) . (11)
2 m− m+
{(i,j):yi =yj =−1} {(i,j):yi =yj =+1}

Let us consider one well-known special case of this type of classiﬁer. Assume
that the class means have the same distance to the origin (hence b = 0), and
that k can be viewed as a density, i.e., it is positive and has integral 1,

k(x, x )dx = 1 for all x ∈ X. (12)

In order to state this assumption, we have to require that we can deﬁne an

integral on X.
If the above holds true, then (10) corresponds to the so-called Bayes deci-
sion boundary separating the two classes, subject to the assumption that the
two classes were generated from two probability distributions that are correctly
estimated by the Parzen windows estimators of the two classes,
1
p1 (x) := k(x, xi ) (13)
m+
{i:yi =+1}
1
p2 (x) := k(x, xi ). (14)
m−
{i:yi =−1}

Given some point x, the label is then simply computed by checking which of the
two, p1 (x) or p2 (x), is larger, which directly leads to (10). Note that this decision
is the best we can do if we have no prior information about the probabilities of
the two classes. For further details, see [23].
The classiﬁer (10) is quite close to the types of learning machines that we will
be interested in. It is linear in the feature space, while in the input domain, it is
represented by a kernel expansion in terms of the training points. It is example-
based in the sense that the kernels are centered on the training examples, i.e.,
one of the two arguments of the kernels is always a training example. The main
point where the more sophisticated techniques to be discussed later will deviate
from (10) is in the selection of the examples that the kernels are centered on,
and in the weight that is put on the individual kernels in the decision function.
Namely, it will no longer be the case that all training examples appear in the
kernel expansion, and the weights of the kernels in the expansion will no longer
be uniform. In the feature space representation, this statement corresponds to
saying that we will study all normal vectors w of decision hyperplanes that can
be represented as linear combinations of the training examples. For instance,
we might want to remove the inﬂuence of patterns that are very far away from
the decision boundary, either since we expect that they will not improve the
generalization error of the decision function, or since we would like to reduce the
computational cost of evaluating the decision function (cf. (10)). The hyperplane
will then only depend on a subset of training examples, called support vectors.
A Short Introduction to Learning with Kernels 45

2 Learning Pattern Recognition from Examples

With the above example in mind, let us now consider the problem of pattern
recognition in a more formal setting [28, 29] following the introduction of [19].
In two-class pattern recognition, we seek to estimate a function
f : X → {±1} (15)
based on input-output training data (1). We assume that the data were generated
independently from some unknown (but ﬁxed) probability distribution P (x, y).
Our goal is to learn a function that will correctly classify unseen examples (x, y),
i.e., we want f (x) = y for examples (x, y) that were also generated from P (x, y).
If we put no restriction on the class of functions that we choose our esti-
mate f from, however, even a function which does well on the training data,
e.g. by satisfying f (xi ) = yi for all i = 1, . . . , m, need not generalize well to
unseen examples. To see this, note that for each function f and any test set
(x̄1 , ȳ1 ), . . . , (x̄m̄ , ȳm̄ ) ∈ RN ×{±1}, satisfying {x̄1 , . . . , x̄m̄ }∩{x1 , . . . , xm } = {},
there exists another function f ∗ such that f ∗ (xi ) = f (xi ) for all i = 1, . . . , m,
yet f ∗ (x̄i ) = f (x̄i ) for all i = 1, . . . , m̄. As we are only given the training data,
we have no means of selecting which of the two functions (and hence which of
the completely diﬀerent sets of test label predictions) is preferable. Hence, only
minimizing the training error (or empirical risk ),

1 1
m
Remp [f ] = |f (xi ) − yi |, (16)
m i=1 2

does not imply a small test error (called risk ), averaged over test examples drawn
from the underlying distribution P (x, y),
1
R[f ] = |f (x) − y| dP (x, y). (17)
2
Statistical learning theory [30, 28, 29], or VC (Vapnik-Chervonenkis) theory,
shows that it is imperative to restrict the class of functions that f is chosen
from to one which has a capacity that is suitable for the amount of available
training data. VC theory provides bounds on the test error. The minimization
of these bounds, which depend on both the empirical risk and the capacity of
the function class, leads to the principle of structural risk minimization [28].
The best-known capacity concept of VC theory is the VC dimension, defined
as the largest number h of points that can be separated in all possible ways
using functions of the given class. An example of a VC bound is the following: if
h < m is the VC dimension of the class of functions that the learning machine
can implement, then for all functions of that class, with a probability of at least
1 − η, the bound
h log(η)
R(α) ≤ Remp (α) + φ , (18)
m m
holds, where the confidence term φ is defined as
46 B. Schölkopf and A.J. Smola

2m

h log(η) h log h + 1 − log(η/4)
φ , = . (19)
m m m
Tighter bounds can be formulated in terms of other concepts, such as the an-
nealed VC entropy or the Growth function. These are usually considered to be
harder to evaluate, but they play a fundamental role in the conceptual part of
VC theory [29]. Alternative capacity concepts that can be used to formulate
bounds include the fat shattering dimension [2].
The bound (18) deserves some further explanatory remarks. Suppose we
wanted to learn a “dependency” where P (x, y) = P (x) · P (y), i.e., where the
pattern x contains no information about the label y, with uniform P (y). Given
a training sample of ﬁxed size, we can then surely come up with a learning
machine which achieves zero training error (provided we have no examples con-
tradicting each other). However, in order to reproduce the random labelings, this
machine will necessarily require a large VC dimension h. Thus, the conﬁdence
term (19), increasing monotonically with h, will be large, and the bound (18)
will not support possible hopes that due to the small training error, we should
expect a small test error. This makes it understandable how (18) can hold in-
dependent of assumptions about the underlying distribution P (x, y): it always
holds (provided that h < m), but it does not always make a nontrivial predic-
tion — a bound on an error rate becomes void if it is larger than the maximum
error rate. In order to get nontrivial predictions from (18), the function space
must be restricted such that the capacity (e.g. VC dimension) is small enough
(in relation to the available amount of data).

3 Hyperplane Classiﬁers
In the present section, we shall describe a hyperplane learning algorithm that
can be performed in a dot product space (such as the feature space that we
introduced previously). As described in the previous section, to design learning
algorithms, one needs to come up with a class of functions whose capacity can
be computed.
[31] considered the class of hyperplanes
(w · x) + b = 0 w ∈ RN , b ∈ R, (20)
corresponding to decision functions
f (x) = sgn ((w · x) + b), (21)
and proposed a learning algorithm for separable problems, termed the General-
ized Portrait, for constructing f from empirical data. It is based on two facts.
First, among all hyperplanes separating the data, there exists a unique one yield-
ing the maximum margin of separation between the classes,
max min{x − xi : x ∈ RN , (w · x) + b = 0, i = 1, . . . , m}. (22)
w,b
Second, the capacity decreases with increasing margin.
A Short Introduction to Learning with Kernels 47

{x | <w, x> + b = +1}

{x | <w, x> + b = −1} Note:
◆
<w, x1> + b = +1
❍ ◆ x1 yi = +1 <w, x2> + b = −1
x2❍
=> <w , (x1−x2)> = 2
◆
, w 2
yi = −1 w ◆ => < , >
||w|| (x1−x2) = ||w||

❍
❍
❍
{x | <w, x> + b = 0}

Fig. 2. A binary classiﬁcation toy problem: separate balls from diamonds. The optimal
hyperplane is orthogonal to the shortest line connecting the convex hulls of the two
classes (dotted), and intersects it half-way between the two classes. The problem being
separable, there exists a weight vector w and a threshold b such that yi ·((w·xi )+b) > 0
(i = 1, . . . , m). Rescaling w and b such that the point(s) closest to the hyperplane
satisfy |(w · xi ) + b| = 1, we obtain a canonical form (w, b) of the hyperplane, satisfying
yi · ((w · xi ) + b) ≥ 1. Note that in this case, the margin, measured perpendicularly to
the hyperplane, equals 2/w. This can be seen by considering two points x1 , x2 on
opposite sides of the margin, i.e., (w · x1 ) + b = 1, (w · x2 ) + b = −1, and projecting
them onto the hyperplane normal vector w/w (from [17]).

To construct this Optimal Hyperplane (cf. Figure 2), one solves the following
optimization problem:
1
minimize τ (w) = w2 (23)
2
subject to yi · ((w · xi ) + b) ≥ 1, i = 1, . . . , m. (24)

This constrained optimization problem is dealt with by introducing Lagrange

multipliers αi ≥ 0 and a Lagrangian

1 m
L(w, b, α) = w2 − αi (yi · ((xi · w) + b) − 1) . (25)
2 i=1

The Lagrangian L has to be minimized with respect to the primal variables w

and b and maximized with respect to the dual variables αi (i.e., a saddle point
has to be found). Let us try to get some intuition for this. If a constraint (24)
is violated, then yi · ((w · xi ) + b) − 1 < 0, in which case L can be increased by
increasing the corresponding αi . At the same time, w and b will have to change
such that L decreases. To prevent −αi (yi · ((w · xi ) + b) − 1) from becoming ar-
bitrarily large, the change in w and b will ensure that, provided the problem is
48 B. Schölkopf and A.J. Smola

separable, the constraint will eventually be satisﬁed. Similarly, one can under-
stand that for all constraints which are not precisely met as equalities, i.e., for
which yi · ((w · xi ) + b) − 1 > 0, the corresponding αi must be 0: this is the value
of αi that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker
complementarity conditions of optimization theory [6].
The condition that at the saddle point, the derivatives of L with respect to
the primal variables must vanish,
∂ ∂
L(w, b, α) = 0, L(w, b, α) = 0, (26)
∂b ∂w
leads to

m
αi yi = 0 (27)
i=1

and

m
w= αi yi xi . (28)
i=1

The solution vector thus has an expansion in terms of a subset of the training
patterns, namely those patterns whose αi is non-zero, called Support Vectors.
By the Karush-Kuhn-Tucker complementarity conditions

αi · [yi ((xi · w) + b) − 1] = 0, i = 1, . . . , m, (29)

the Support Vectors lie on the margin (cf. Figure 2). All remaining examples of
the training set are irrelevant: their constraint (24) does not play a role in the
optimization, and they do not appear in the expansion (28). This nicely captures
our intuition of the problem: as the hyperplane (cf. Figure 2) is completely
determined by the patterns closest to it, the solution should not depend on the
other examples.
By substituting (27) and (28) into L, one eliminates the primal variables and
arrives at the Wolfe dual of the optimization problem (e.g., [6]): ﬁnd multipliers
αi which

m
1
m
maximize W (α) = αi − αi αj yi yj (xi · xj ) (30)
i=1
2 i,j=1

m
subject to αi ≥ 0, i = 1, . . . , m, and αi yi = 0. (31)
i=1

The hyperplane decision function can thus be written as

m
f (x) = sgn yi αi · (x · xi ) + b (32)
i=1

where b is computed using (29).

A Short Introduction to Learning with Kernels 49

The structure of the optimization problem closely resembles those that typi-
cally arise in Lagrange’s formulation of mechanics. Also there, often only a subset
of the constraints become active. For instance, if we keep a ball in a box, then it
will typically roll into one of the corners. The constraints corresponding to the
walls which are not touched by the ball are irrelevant, the walls could just as
well be removed.
Seen in this light, it is not too surprising that it is possible to give a me-
chanical interpretation of optimal margin hyperplanes [9]: If we assume that
each support vector xi exerts a perpendicular force of size αi and sign yi on
a solid plane sheet lying along the hyperplane, then the solution satisﬁes the
requirements of mechanical stability. The constraint (27) states that the forces
on
the sheet sum to zero; and (28) implies that the torques also sum to zero, via
i xi × yi αi · w/w = w × w/w = 0.
There are theoretical arguments supporting the good generalization perfor-
mance of the optimal hyperplane [30, 28, 4, 25, 35]. In addition, it is computation-
ally attractive, since it can be constructed by solving a quadratic programming
problem.

4 Support Vector Classiﬁers

We now have all the tools to describe support vector machines [29, 23]. Every-
thing in the last section was formulated in a dot product space. We think of this
space as the feature space H described in Section 1. To express the formulas
in terms of the input patterns living in X, we thus need to employ (5), which
expresses the dot product of bold face feature vectors x, x in terms of the kernel
k evaluated on input patterns x, x ,

k(x, x ) = (x · x ). (33)

This can be done since all feature vectors only occured in dot products. The
weight vector (cf. (28)) then becomes an expansion in feature space,2 and will
thus typically no longer correspond to the image of a single vector from input
space. We thus obtain decision functions of the more general form (cf. (32))

m
f (x) = sgn yi αi · (Φ(x) · Φ(xi )) + b
i=1

m
= sgn yi αi · k(x, xi ) + b , (34)
i=1

and the following quadratic program (cf. (30)):

2
This constitutes a special case of the so-called representer theorem, which states that
under fairly general conditions, the minimizers of objective functions which contain
a penalizer in terms of a norm in feature space will have kernel expansions [32, 23].
50 B. Schölkopf and A.J. Smola

Fig. 3. Example of a Support Vector classiﬁer found by using a radial basis function
kernel k(x, x ) = exp(−x − x 2 ). Both coordinate axes range from -1 to +1. Circles
and disks are two classes of training examples; the middle line is the decision surface;
the outer lines precisely meet the constraint (24). Note that the Support Vectors found
by the algorithm (marked by extra circles) are not centers of clusters, but examples
which are critical
for the given classiﬁcation task. Grey values code the modulus of the
argument m i=1 yi αi · k(x, xi ) + b of the decision function (34) (from [17]).)

m
1
m
maximize W (α) = αi − αi αj yi yj k(xi , xj ) (35)
i=1
2 i,j=1

m
subject to αi ≥ 0, i = 1, . . . , m, and αi yi = 0. (36)
i=1

In practice, a separating hyperplane may not exist, e.g. if a high noise level
causes a large overlap of the classes. To allow for the possibility of examples
violating (24), one introduces slack variables [10, 29, 22]

ξi ≥ 0, i = 1, . . . , m (37)

in order to relax the constraints to

yi · ((w · xi ) + b) ≥ 1 − ξi , i = 1, . . . , m. (38)

A classiﬁer which generalizes well is then found

by controlling both the classiﬁer
capacity (via w) and the sum of the slacks i ξi . The latter is done as it can
be shown to provide an upper bound on the number of training errors which
leads to a convex optimization problem.
A Short Introduction to Learning with Kernels 51

One possible realization of a soft margin classiﬁer is minimizing the objective

function
1 m
τ (w, ξ) = w2 + C ξi (39)
2 i=1

subject to the constraints (37) and (38), for some value of the constant C > 0
determining the trade-oﬀ. Here and below, we use boldface Greek letters as a
shorthand for corresponding vectors ξ = (ξ1 , . . . , ξm ). Incorporating kernels, and
rewriting it in terms of Lagrange multipliers, this again leads to the problem of
maximizing (35), subject to the constraints

m
0 ≤ αi ≤ C, i = 1, . . . , m, and αi yi = 0. (40)
i=1

The only diﬀerence from the separable case is the upper bound C on the Lagrange
multipliers αi . This way, the inﬂuence of the individual patterns (which could be
outliers) gets limited. As above, the solution takes the form (34). The threshold
b can be computed by exploiting the fact that for all SVs xi with αi < C,
the slack variable ξi is zero (this again follows from the Karush-Kuhn-Tucker
complementarity conditions), and hence

m
yj αj · k(xi , xj ) + b = yi . (41)
j=1

Another possible realization of a soft margin variant of the optimal hyper-

plane uses the ν-parameterization [22]. In it, the parameter C is replaced by a
parameter ν ∈ [0, 1] which can be shown to lower and upper bound the number
of examples that will be SVs and that will come to lie on the wrong side of the
hyperplane,
respectively. It uses a primal objective function with the error term
1
νm ξ
i i − ρ, and separation constraints

yi · ((w · xi ) + b) ≥ ρ − ξi , i = 1, . . . , m. (42)

The margin parameter ρ is a variable of the optimization problem. The dual

can be shown to consist
of maximizing the quadratic part of
(35), subject to
0 ≤ αi ≤ 1/(νm), i αi yi = 0 and the additional constraint i αi = 1.

5 Support Vector Regression

The concept of the margin is speciﬁc to pattern recognition. To generalize the SV
algorithm to regression estimation [29], an analog of the margin is constructed
in the space of the target values y (note that in regression, we have y ∈ R) by
using Vapnik’s ε-insensitive loss function (Figure 4)

|y − f (x)|ε := max{0, |y − f (x)| − ε}. (43)

52 B. Schölkopf and A.J. Smola

ξ
x
x +ε ξ
0
x x
x x
−ε
x x
x
x
x x x
x −ε +ε
x

Fig. 4. In SV regression, a tube with radius ε is ﬁtted to the data. The trade-oﬀ
between model complexity and points lying outside of the tube (with positive slack
variables ξ) is determined by minimizing (46) (from [17]).

To estimate a linear regression

f (x) = (w · x) + b (44)

with precision ε, one minimizes

1 m
w2 + C |yi − f (xi )|ε . (45)
2 i=1

Written as a constrained optimization problem, this reads:

1 m
minimize τ (w, ξ, ξ ∗ ) = w2 + C (ξi + ξi∗ ) (46)
2 i=1
subject to ((w · xi ) + b) − yi ≤ ε + ξi (47)
yi − ((w · xi ) + b) ≤ ε + ξi∗ (48)
ξi , ξi∗ ≥ 0 (49)

for all i = 1, . . . , m. Note that according to (47) and (48), any error smaller than
ε does not require a nonzero ξi or ξi∗ , and hence does not enter the objective
function (46).
Generalization to kernel-based regression estimation is carried out in com-
plete analogy to the case of pattern recognition. Introducing Lagrange multipli-
ers, one thus arrives at the following optimization problem: for C > 0, ε ≥ 0
chosen a priori,
A Short Introduction to Learning with Kernels 53

m
m
maximize W (α, α∗ ) = −ε (αi∗ + αi ) + (αi∗ − αi )yi
i=1 i=1

1 ∗
m
− (αi − αi )(αj∗ − αj )k(xi , xj ) (50)
2 i,j=1

m
subject to 0 ≤ αi , αi∗ ≤ C, i = 1, . . . , m, and (αi − αi∗ ) = 0. (51)
i=1

The regression estimate takes the form

m
f (x) = (αi∗ − αi )k(xi , x) + b, (52)
i=1

where b is computed using the fact that (47) becomes an equality with ξi = 0
if 0 < αi < C, and (48) becomes an equality with ξi∗ = 0 if 0 < αi∗ < C.
Several extensions of this algorithm are possible. From an abstract point of
view, we just need some target function which depends on the vector (w, ξ) (cf.
(46)). There are multiple degrees of freedom for constructing it, including some
freedom how to penalize, or regularize, different parts of the vector, and some
freedom how to use the kernel trick. For instance, more general loss functions
can be used for ξ, leading to problems that can still be solved efficiently [27].
Moreover, norms other than the 2-norm . can be used to regularize the solu-
tion. Yet another example is that polynomial kernels can be incorporated which
consist of multiple layers, such that the first layer only computes products within
certain specified subsets of the entries of w [17].
Finally, the algorithm can be modified such that ε need not be specified a
priori. Instead, one specifies an upper bound 0 ≤ ν ≤ 1 on the fraction of points
allowed to lie outside the tube (asymptotically, the number of SVs) and the
corresponding ε is computed automatically. This is achieved by using as primal
objective function

1
m
w2 + C νmε + |yi − f (xi )|ε (53)
2 i=1

instead of (45), and treating ε ≥ 0 as a parameter that we minimize over [22].

We conclude this section by noting that the SV algorithm has not only been
generalized to regression, but also, more recently, to one-class problems and
novelty detection [20]. Moreover, the kernel method for computing dot products
in feature spaces is not restricted to SV machines. Indeed, it has been pointed
out that it can be used to develop nonlinear generalizations of any algorithm
that can be cast in terms of dot products, such as principal component analysis
[21], and a number of developments have followed this example.

6 Polynomial Kernels
We now take a closer look at the issue of the similarity measure, or kernel, k.
In this section, we think of X as a subset of the vector space RN , (N ∈ N),
endowed with the canonical dot product (3).
54 B. Schölkopf and A.J. Smola

σ( Σ ) output σ (Σ υi k (x,xi))

υ1 υ2 ... υm weights

<, > <, > ... <, > dot product <Φ(x),Φ(xi)>= k (x,xi)

Φ(x1) Φ(x2) Φ(x) Φ(xn) mapped vectors Φ(xi), Φ(x)

... support vectors x1 ... xn

test vector x

Fig. 5. Architecture of SV machines. The input x and the Support Vectors xi are
nonlinearly mapped (by Φ) into a feature space H, where dot products are computed.
By the use of the kernel k, these two layers are in practice computed in one single step.
The results are linearly combined by weights υi , found by solving a quadratic program
(in pattern recognition, υi = yi αi ; in regression estimation, υi = αi∗ − αi ). The linear
combination is fed into the function σ (in pattern recognition, σ(x) = sgn (x + b); in
regression estimation, σ(x) = x + b) (from [17]).

6.1 Product Features

Suppose we are given patterns x ∈ RN where most information is contained in

the d-th order products (monomials) of entries [x]j of x,

[x]j1 · · · · · [x]jd , (54)

where j1 , . . . , jd ∈ {1, . . . , N }. In that case, we might prefer to extract these

product features, and work in the feature space H of all products of d entries. In
visual recognition problems, where images are often represented as vectors, this
would amount to extracting features which are products of individual pixels.
For instance, in R2 , we can collect all monomial feature extractors of degree
2 in the nonlinear map

Φ : R 2 → H = R3 (55)
([x]1 , [x]2 ) → ([x]21 , [x]22 , [x]1 [x]2 ). (56)
A Short Introduction to Learning with Kernels 55

This approach works fine for small toy examples, but it fails for realistically sized
problems: for N -dimensional input patterns, there exist
(N + d − 1)!
NH = (57)
d!(N − 1)!
different monomials (54), comprising a feature space H of dimensionality NH .
For instance, already 16 × 16 pixel input images and a monomial degree d = 5
yield a dimensionality of 1010 .
In certain cases described below, there exists, however, a way of computing dot
products in these high-dimensional feature spaces without explicitely mapping
into them: by means of kernels nonlinear in the input space RN . Thus, if the
subsequent processing can be carried out using dot products exclusively, we are
able to deal with the high dimensionality.
The following section describes how dot products in polynomial feature spaces
can be computed efficiently.

6.2 Polynomial Feature Spaces Induced by Kernels

In order to compute dot products of the form (Φ(x) · Φ(x )), we employ kernel
representations of the form

k(x, x ) = (Φ(x) · Φ(x )), (58)

which allow us to compute the value of the dot product in H without having to
carry out the map Φ. This method was used by [8] to extend the Generalized
Portrait hyperplane classiﬁer of [30] to nonlinear Support Vector machines. [1]
call H the linearization space, and used in the context of the potential function
classiﬁcation method to express the dot product between elements of H in terms
of elements of the input space.
What does k look like for the case of polynomial features? We start by giving
an example [29] for N = d = 2. For the map

C2 : ([x]1 , [x]2 ) → ([x]21 , [x]22 , [x]1 [x]2 , [x]2 [x]1 ), (59)

dot products in H take the form

(C2 (x) · C2 (x )) = [x]21 [x ]21 + [x]22 [x ]22 + 2[x]1 [x]2 [x ]1 [x ]2 = (x · x )2 , (60)

i.e., the desired kernel k is simply the square of the dot product in input space.
The same works for arbitrary N, d ∈ N [8]: as a straightforward generalization of
a result proved in the context of polynomial approximation (Lemma 2.1, [16]),
we have:
Proposition 1. Deﬁne Cd to map x ∈ RN to the vector Cd (x) whose entries
are all possible d-th degree ordered products of the entries of x. Then the corre-
sponding kernel computing the dot product of vectors mapped by Cd is

k(x, x ) = (Cd (x) · Cd (x )) = (x · x )d . (61)

56 B. Schölkopf and A.J. Smola

Proof. We directly compute

N
(Cd (x) · Cd (x )) = [x]j1 · · · · · [x]jd · [x ]j1 · · · · · [x ]jd (62)
j1 ,...,jd =1
 d
N
=  [x]j · [x ]j  = (x · x )d . (63)
j=1

Instead of ordered products, we can use unordered ones to obtain a map

Φd which yields the same value of the dot product. To this end, we have to
compensate for the multiple occurrence of certain monomials in Cd by scaling
the respective entries of Φd with the square roots of their numbers of occurrence.
Then, by this deﬁnition of Φd , and (61),

(Φd (x) · Φd (x )) = (Cd (x) · Cd (x )) = (x · x )d . (64)

For instance, if n of the ji in (54) are equal, and the remaining ones are diﬀerent,
then the coeﬃcient in the corresponding component of Φd is (d − n + 1)! (for
the general case, cf. [24]. For Φ2 , this simply means that [29]
√
Φ2 (x) = ([x]21 , [x]22 , 2 [x]1 [x]2 ). (65)

If x represents an image with the entries being pixel values, we can use
the kernel (x · x )d to work in the space spanned by products of any d pixels —
provided that we are able to do our work solely in terms of dot products, without
any explicit usage of a mapped pattern Φd (x). Using kernels of the form (61), we
take into account higher-order statistics without the combinatorial explosion (cf.
(57)) of time and memory complexity which goes along already with moderately
high N and d.
To conclude this section, note that it is possible to modify (61) such that it
maps into the space of all monomials up to degree d, deﬁning [29]

k(x, x ) = ((x · x ) + 1)d . (66)

7 Representing Similarities in Linear Spaces

In what follows, we will look at things the other way round, and start with the
kernel. Given some kernel function, can we construct a feature space such that
the kernel computes the dot product in that feature space? This question has
been brought to the attention of the machine learning community by [1], [8],
and [29]. In functional analysis, the same problem has been studied under the
heading of Hilbert space representations of kernels. A good monograph on the
functional analytic theory of kernels is [5]; indeed, a large part of the material
in the present section is based on that work.
There is one more aspect in which this section diﬀers from the previous one:
the latter dealt with vectorial data. The results in the current section, in contrast,
A Short Introduction to Learning with Kernels 57

hold for data drawn from domains which need no additional structure other
than them being nonempty sets X. This generalizes kernel learning algorithms
to a large number of situations where a vectorial representation is not readily
available [17, 12, 33].
We start with some basic deﬁnitions and results.

Deﬁnition 1 (Gram matrix). Given a kernel k and patterns x1 , . . . , xm ∈ X,

the m × m matrix

K := (k(xi , xj ))ij (67)

is called the Gram matrix (or kernel matrix) of k with respect to x1 , . . . , xm .

Deﬁnition 2 (Positive deﬁnite matrix). An m × m matrix Kij satisfying

ci c̄j Kij ≥ 0 (68)
i,j

for all ci ∈ C is called positive deﬁnite.

Deﬁnition 3 ((Positive deﬁnite) kernel). Let X be a nonempty set. A func-

tion k : X × X → C which for all m ∈ N, xi ∈ X gives rise to a positive deﬁnite
Gram matrix is called a positive deﬁnite kernel. Often, we shall refer to it simply
as a kernel.

The term kernel stems from the ﬁrst use of this type of function in the study
of integral operators. A function k which gives rise to an operator T via

(T f )(x) = k(x, x )f (x ) dx (69)

is called the kernel of T . One might argue that the term positive definite kernel is
slightly misleading. In matrix theory, the term definite is usually used to denote
the case where equality in (68) only occurs if c1 = · · · = cm = 0. Simply using
the term positive kernel, on the other hand, could be confused with a kernel
whose values are positive. In the literature, a number of different terms are used
for positive definite kernels, such as reproducing kernel, Mercer kernel, or support
vector kernel.
The definitions for (positive definite) kernels and positive definite matrices
differ only in the fact that in the former case, we are free to choose the points
on which the kernel is evaluated.
Positive definiteness implies positivity on the diagonal,

k(x1 , x1 ) ≥ 0 for all x1 ∈ X, (70)

(use m = 1 in (68)), and symmetry, i.e.,

k(xi , xj ) = k(xj , xi ). (71)

58 B. Schölkopf and A.J. Smola

Note that in the complex-valued case, our definition of symmetry includes com-
plex conjugation, depicted by the bar. The definition of symmetry of matrices is
analogous, i.e.. Kij = K̄ji .
Obviously, real-valued kernels, which are what we will mainly be concerned
with, are contained in the above definition as a special case, since we did not
require that the kernel take values in C\R. However, it is not sufficient to require
that (68) hold for real coefficients ci . If we want to get away with real coefficients
only, we additionally have to require that the kernel be symmetric,
k(xi , xj ) = k(xj , xi ). (72)
Kernels can be regarded as generalized dot products. Indeed, any dot product
can be shown to be a kernel; however, linearity does not carry over from dot
products to general kernels. Another property of dot products, the Cauchy-
Schwarz inequality, does have a natural generalization to kernels:
Proposition 2. If k is a positive definite kernel, and x1 , x2 ∈ X, then
|k(x1 , x2 )|2 ≤ k(x1 , x1 ) · k(x2 , x2 ). (73)
Proof. For sake of brevity, we give a non-elementary proof using some basic facts
of linear algebra. The 2 × 2 Gram matrix with entries Kij = k(xi , xj ) is positive
definite. Hence both its eigenvalues are nonnegative, and so is their product, K’s
determinant, i.e.,
0 ≤ K11 K22 − K12 K21 = K11 K22 − K12 K̄12 = K11 K22 − |K12 |2 . (74)
Substituting k(xi , xj ) for Kij , we get the desired inequality.
We are now in a position to construct the feature space associated with a
kernel k.
We define a map from X into the space of functions mapping X into C,
denoted as CX , via
Φ : X → CX
x → k(., x). (75)
Here, Φ(x) = k(., x) denotes the function that assigns the value k(x , x) to x ∈ X.
We have thus turned each pattern into a function on the domain X. In a
sense, a pattern is now represented by its similarity to all other points in the
input domain X. This seems a very rich representation, but it will turn out that
the kernel allows the computation of the dot product in that representation.
We shall now construct a dot product space containing the images of the
input patterns under Φ. To this end, we first need to endow it with the linear
structure of a vector space. This is done by forming linear combinations of the
form

m
f (.) = αi k(., xi ). (76)
i=1

Here, m ∈ N, αi ∈ C and xi ∈ X are arbitrary.

A Short Introduction to Learning with Kernels 59

Next, we deﬁne a dot product between f and another function

m
g(.) = βj k(., xj ) (77)
j=1

(m ∈ N, βj ∈ C and xj ∈ X) as

m
m
f, g := ᾱi βj k(xi , xj ). (78)
i=1 j=1

To see that this is well-deﬁned, although it explicitly contains the expansion

coeﬃcients (which need not be unique), note that

m
f, g = βj f (xj ), (79)
j=1

using k(xj , xi ) = k(xi , xj ). The latter, however, does not depend on the partic-
ular expansion of f . Similarly, for g, note that

m
f, g = ᾱi g(xi ). (80)
i=1

The last two equations also show that ·, · is bilinear. It is symmetric, as f, g =

g, f . Moreover, it is positive deﬁnite, since positive deﬁniteness of k implies that
for any function f , written as (76), we have

m
f, f = αi αj k(xi , xj ) ≥ 0. (81)
i,j=1

The latter implies that ·, · is actually itself a positive definite kernel, defined
on our space of functions. To see this, note that given functions f1 , . . . , fn , and
coefficients γ1 , . . . , γn ∈ R, we have
n
n
n
γi γj fi , fj = γi fi , γj fj ≥ 0. (82)
i,j=1 i=1 j=1

Here, the left hand equality follows from the bilinearity of ·, · , and the right
hand inequality from (81).
For the last step in proving that it even is a dot product, we will use the
following interesting property of Φ, which follows directly from the deﬁnition:
for all functions (76), we have

k(., x), f = f (x) (83)

— k is the representer of evaluation. In particular,

60 B. Schölkopf and A.J. Smola

k(., x), k(., x ) = k(x, x ). (84)

By virtue of these properties, positive deﬁnite kernels k are also called reproduc-
ing kernels [3, 5, 32, 17].
By (83) and Proposition 2, we have
|f (x)|2 = | k(., x), f |2 ≤ k(x, x) · f, f . (85)
Therefore, f, f = 0 directly implies f = 0, which is the last property that was
left to prove in order to establish that ., . is a dot product.
One can complete the space of functions (76) in the norm corresponding to
the dot product, i.e., add the limit points of sequences that are convergent in
that norm, and thus gets a Hilbert space H, usually called a reproducing kernel
Hilbert space.3
The case of real-valued kernels is included in the above; in that case, H can
be chosen as a real Hilbert space.

8 Examples of Kernels
Besides (61), [8] and [29] suggest the usage of Gaussian radial basis function
kernels [1]
x − x 2
k(x, x ) = exp − (86)
2 σ2
and sigmoid kernels
k(x, x ) = tanh(κ(x · x ) + Θ). (87)
While one can show that (87) is not a kernel [26], (86) has become one of the
most useful kernels in situations where no further knowledge about the problem
at hand is given.
Note that all the kernels discussed so far have the convenient property of
unitary invariance, i.e., k(x, x ) = k(U x, U x ) if U = U −1 (if we consider
complex numbers, then U ∗ instead of U has to be used).
The radial basis function kernel additionally is translation invariant. More-
over, as it satisﬁes k(x, x) = 1 for all x ∈ X, each mapped example has unit
length, Φ(x) = 1. In addition, as k(x, x ) > 0 for all x, x ∈ X, all points lie
inside the same orthant in feature space. To see this, recall that for unit length
vectors, the dot product (3) equals the cosine of the enclosed angle. Hence
cos(∠(Φ(x), Φ(x ))) = (Φ(x) · Φ(x )) = k(x, x ) > 0, (88)
which amounts to saying that the enclosed angle between any two mapped ex-
amples is smaller than π/2.
The examples given so far apply to the case of vectorial data. Let us at least
give one example where X is not a vector space.
3
A Hilbert space H is deﬁned as a complete dot product space. Completeness means
that all sequences in H which are convergent in the norm corresponding to the dot
product will actually have their limits in H, too.
A Short Introduction to Learning with Kernels 61

Example 1 (Similarity of probabilistic events). If A is a σ-algebra, and P a prob-

ability measure on A, then

k(A, B) = P (A ∩ B) − P (A)P (B) (89)

is a positive deﬁnite kernel.

Further examples include kernels for string matching, as proposed by [33]

and [12].
There is an analog of the kernel trick for distances rather than dot products,
i.e., dissimilarities rather than similarities. This leads to the class of conditionally
positive deﬁnite kernels, which contain the standard SV kernels as special cases.
Interestingly, it turns out that SVMs and kernel PCA can be applied also with
this larger class of kernels, due to their being translation invariant in feature
space [23].

9 Applications
Having described the basics of SV machines, we now summarize some empirical
findings.
By the use of kernels, the optimal margin classifier was turned into a classifier
which became a serious competitor of high-performance classifiers. Surprisingly,
it was noticed that when different kernel functions are used in SV machines,
they empirically lead to very similar classification accuracies and SV sets [18].
In this sense, the SV set seems to characterize (or compress) the given task in a
manner which up to a certain degree is independent of the type of kernel (i.e.,
the type of classifier) used.
Initial work at AT&T Bell Labs focused on OCR (optical character recog-
nition), a problem where the two main issues are classification accuracy and
classification speed. Consequently, some effort went into the improvement of
SV machines on these issues, leading to the Virtual SV method for incorpo-
rating prior knowledge about transformation invariances by transforming SVs,
and the Reduced Set method for speeding up classification. This way, SV ma-
chines became competitive with (or, in some cases, superior to) the best available
classifiers on both OCR and object recognition tasks [7, 9, 11].
Another initial weakness of SV machines, less apparent in OCR applications
which are characterized by low noise levels, was that the size of the quadratic
programming problem scaled with the number of Support Vectors. This was due
to the fact that in (35), the quadratic part contained at least all SVs — the
common practice was to extract the SVs by going through the training data
in chunks while regularly testing for the possibility that some of the patterns
that were initially not identified as SVs turn out to become SVs at a later
stage (note that without chunking, the size of the matrix would be m × m,
where m is the number of all training examples). What happens if we have a
high-noise problem? In this case, many of the slack variables ξi will become
nonzero, and all the corresponding examples will become SVs. For this case, a
62 B. Schölkopf and A.J. Smola

decomposition algorithm was proposed [14], which is based on the observation

that not only can we leave out the non-SV examples (i.e., the xi with αi = 0)
from the current chunk, but also some of the SVs, especially those that hit the
upper boundary (i.e., αi = C). In fact, one can use chunks which do not even
contain all SVs, and maximize over the corresponding sub-problems. SMO [15]
explores an extreme case, where the sub-problems are chosen so small that one
can solve them analytically. Several public domain SV packages and optimizers
are listed on the web page http://www.kernel-machines.org. For more details on
the optimization problem, see [23].

10 Conclusion
One of the most appealing features of kernel algorithms is the solid founda-
tion provided by both statistical learning theory and functional analysis. Kernel
methods let us interpret (and design) learning algorithms geometrically in fea-
ture spaces nonlinearly related to the input space, and combine statistics and
geometry in a promising way. Kernels provide an elegant framework for studying
three fundamental issues of machine learning:

– Similarity measures — the kernel can be viewed as a (nonlinear) similarity

measure, and should ideally incorporate prior knowledge about the problem
at hand
– Data representation — as described above, kernels induce representations of
the data in a linear space
– Function class — due to the representer theorem, the kernel implicitly also
determines the function class which is used for learning.

References
1. M. A. Aizerman, É. M. Braverman, and L. I. Rozonoér. Theoretical foundations
of the potential function method in pattern recognition learning. Automation and
Remote Control, 25:821–837, 1964.
2. Noga Alon, Shai Ben-David, Nicolo Cesa-Bianchi, and David Haussler. Scale-
sensitive dimensions, uniform convergence, and learnability. Journal of the ACM,
44(4):615–631, 1997.
3. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Math-
ematical Society, 68:337–404, 1950.
4. P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector
machines and other pattern classiﬁers. In B. Schölkopf, C. J. C. Burges, and A. J.
Smola, editors, Advances in Kernel Methods—Support Vector Learning, pages 43–
54, Cambridge, MA, 1999. MIT Press.
5. C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.
Springer, New York, 1984.
6. D. P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1995.
A Short Introduction to Learning with Kernels 63

7. V. Blanz, B. Schölkopf, H. Bülthoﬀ, C. Burges, V. Vapnik, and T. Vetter. Com-

parison of view-based object recognition algorithms using realistic 3D models. In
C. von der Malsburg, W. von Seelen, J. C. Vorbrüggen, and B. Sendhoff, edi-
tors, Artificial Neural Networks ICANN’96, pages 251–256, Berlin, 1996. Springer
Lecture Notes in Computer Science, Vol. 1112.
8. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal
margin classifiers. In D. Haussler, editor, Proceedings of the Annual Conference on
Computational Learning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM
Press.
9. C. J. C. Burges and B. Schölkopf. Improving the accuracy and speed of support
vector learning machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors,
Advances in Neural Information Processing Systems 9, pages 375–381, Cambridge,
MA, 1997. MIT Press.
10. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297,
1995.
11. D. DeCoste and B. Schölkopf. Training invariant support vector machines. Machine
Learning, 2002. Accepted for publication. Also: Technical Report JPL-MLTR-00-1,
Jet Propulsion Laboratory, Pasadena, CA, 2000.
12. D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-
CRL-99-10, Computer Science Department, UC Santa Cruz, 1999.
13. J. Mercer. Functions of positive and negative type and their connection with
the theory of integral equations. Philosophical Transactions of the Royal Society,
London, A 209:415–446, 1909.
14. E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support
vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neu-
ral Networks for Signal Processing VII—Proceedings of the 1997 IEEE Workshop,
pages 276–285, New York, 1997. IEEE.
15. J. Platt. Fast training of support vector machines using sequential minimal opti-
mization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in
Kernel Methods—Support Vector Learning, pages 185–208, Cambridge, MA, 1999.
MIT Press.
16. T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201–
209, 1975.
17. B. Schölkopf. Support Vector Learning. R. Oldenbourg Verlag, München, 1997.
Doktorarbeit, TU Berlin. Download: http://www.kernel-machines.org.
18. B. Schölkopf, C. Burges, and V. Vapnik. Extracting support data for a given task.
In U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International
Conference on Knowledge Discovery & Data Mining, Menlo Park, 1995. AAAI
Press.
19. B. Schölkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods—
Support Vector Learning. MIT Press, Cambridge, MA, 1999.
20. B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Esti-
mating the support of a high-dimensional distribution. Neural Computation, 13(7),
2001.
21. B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998.
22. B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector
algorithms. Neural Computation, 12:1207–1245, 2000.
23. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge,
MA, 2002.
64 B. Schölkopf and A.J. Smola

24. A. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization
operators and support vector kernels. Neural Networks, 11:637–649, 1998.
25. A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans. Advances in Large
Margin Classifiers. MIT Press, Cambridge, MA, 2000.
26. A. J. Smola, Z. L. Óvári, and R. C. Williamson. Regularization with dot-product
kernels. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural
Information Processing Systems 13, pages 308–314. MIT Press, 2001.
27. A. J. Smola and B. Schölkopf. On a kernel-based method for pattern recognition,
regression, approximation and operator inversion. Algorithmica, 22:211–231, 1998.
28. V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian].
Nauka, Moscow, 1979. (English translation: Springer, New York, 1982).
29. V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
30. V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian].
Nauka, Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis,
Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979).
31. V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.
Automation and Remote Control, 24:774–780, 1963.
32. G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Re-
gional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
33. C. Watkins. Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett,
B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers,
pages 39–50, Cambridge, MA, 2000. MIT Press.
34. J. Weston, O. Chapelle, A. Elisseeff, B. Schölkopf, and V. Vapnik. Kernel de-
pendency estimation. Technical Report 98, Max Planck Institute for Biological
Cybernetics, 2002.
35. R. C. Williamson, A. J. Smola, and B. Schölkopf. Generalization bounds for regu-
larization networks and support vector machines via entropy numbers of compact
operators. IEEE Transaction on Information Theory, 2001.

Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
2 Center of Gravity, Stability and Equilibrium PDF
100% (2)
2 Center of Gravity, Stability and Equilibrium PDF
3 pages
Test-12 RCC Vijaypath Solution
No ratings yet
Test-12 RCC Vijaypath Solution
32 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
No ratings yet
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
40 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Chap 02 2024 (1) - Mechanics of Materials
No ratings yet
Chap 02 2024 (1) - Mechanics of Materials
90 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
L6 Lecture Image - Classification.fundemental v4
No ratings yet
L6 Lecture Image - Classification.fundemental v4
66 pages
Electrochemical Impedance Analysis and Interpretat
0% (1)
Electrochemical Impedance Analysis and Interpretat
9 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
SP14 CS188 Lecture 23 - Kernels and Clustering - Print
No ratings yet
SP14 CS188 Lecture 23 - Kernels and Clustering - Print
39 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
Chapter 5
No ratings yet
Chapter 5
35 pages
SVM Class
No ratings yet
SVM Class
33 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
No ratings yet
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
26 pages
1 s2.0 S0168927423000429 Main
No ratings yet
1 s2.0 S0168927423000429 Main
20 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
Randomproj
No ratings yet
Randomproj
17 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
Static Electricity QB
No ratings yet
Static Electricity QB
12 pages
Ds 11
No ratings yet
Ds 11
21 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Self Reading - KNN - Notes
No ratings yet
Self Reading - KNN - Notes
7 pages
HSSC-II Math Final Package
No ratings yet
HSSC-II Math Final Package
11 pages
Lecture 13 - Kernels
No ratings yet
Lecture 13 - Kernels
5 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
17 Part Numbers - Replacing Units v2.3
No ratings yet
17 Part Numbers - Replacing Units v2.3
56 pages
8 EmprclMsnryDsgn PDF
No ratings yet
8 EmprclMsnryDsgn PDF
15 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
MGT Turbec t100 CC
100% (1)
MGT Turbec t100 CC
28 pages
Tanis Food Tec Interpack Profile
No ratings yet
Tanis Food Tec Interpack Profile
2 pages
International Journal of Solids and Structures: Jia Guo, Li Wang, Izuru Takewaki
No ratings yet
International Journal of Solids and Structures: Jia Guo, Li Wang, Izuru Takewaki
10 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
2016-GE-26 37 1 CEP Rock Mechanics
No ratings yet
2016-GE-26 37 1 CEP Rock Mechanics
13 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
Lecture 13 - Lie Groups and Their Lie Algebras (Schuller's Geometric Anatomy of Theoretical Physics)
No ratings yet
Lecture 13 - Lie Groups and Their Lie Algebras (Schuller's Geometric Anatomy of Theoretical Physics)
7 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Chem Exp-2
No ratings yet
Chem Exp-2
6 pages
Mitsubishi Semiconductor Mitsubishi Semiconductor
No ratings yet
Mitsubishi Semiconductor Mitsubishi Semiconductor
8 pages
Distributions, Frobenious Theorem & Transformations: Harry G. Kwatny
No ratings yet
Distributions, Frobenious Theorem & Transformations: Harry G. Kwatny
22 pages
Chapter 2 Homework
No ratings yet
Chapter 2 Homework
4 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Beams On Elastic Foundation
No ratings yet
Beams On Elastic Foundation
10 pages
Poly Kernel
No ratings yet
Poly Kernel
6 pages
Fizzy Drinks To Ichi o Me Try Lesson
No ratings yet
Fizzy Drinks To Ichi o Me Try Lesson
12 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
SVM - Hype or Hallelujah
No ratings yet
SVM - Hype or Hallelujah
13 pages
Failure Theories - Static Loads
No ratings yet
Failure Theories - Static Loads
15 pages
Catherine Lefay, Bernadette Charleux, Maud Save, Christophe Chassenieux, Olivier Guerret, Ste Phanie Magnet
No ratings yet
Catherine Lefay, Bernadette Charleux, Maud Save, Christophe Chassenieux, Olivier Guerret, Ste Phanie Magnet
11 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
No ratings yet
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
645 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Gravitation - by @MadXAbhiOfficial - Handbook
No ratings yet
Gravitation - by @MadXAbhiOfficial - Handbook
4 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Toc PDF
No ratings yet
Toc PDF
5 pages
Half Yearly Examination, 2017-18: Mathematics
No ratings yet
Half Yearly Examination, 2017-18: Mathematics
3 pages
1501589527da Mod14 Q1 e Text
No ratings yet
1501589527da Mod14 Q1 e Text
12 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
Sample Problem: From Settling Curves For Flocculent Suspension 12.4
No ratings yet
Sample Problem: From Settling Curves For Flocculent Suspension 12.4
1 page
Planta
No ratings yet
Planta
2 pages
Chapter 14 Kinetics of Particle - Work & Energy
No ratings yet
Chapter 14 Kinetics of Particle - Work & Energy
37 pages
This Is
No ratings yet
This Is
7 pages
CTR 19
No ratings yet
CTR 19
10 pages
Design of Under Ground Water Tank: CASE-1 When Tank Is Full CASE-2 When Tank Is Empty
100% (1)
Design of Under Ground Water Tank: CASE-1 When Tank Is Full CASE-2 When Tank Is Empty
10 pages
Vahid
No ratings yet
Vahid
18 pages
Optimization in Function Spaces
From Everand
Optimization in Function Spaces
Amol Sasane
No ratings yet
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SCH Smo 03 C

Uploaded by

SCH Smo 03 C

Uploaded by

A Short Introduction to Learning with Kernels

Bernhard Schölkopf1 and Alexander J. Smola2

Abstract. We brieﬂy describe the main ideas of statistical learning the-

Suppose we are given empirical data

(x1 , y1 ), . . . , (xm , ym ) ∈ X × {±1}. (1)

Here, (x)i denotes the i-th entry of x.

k(x, x ) := (x · x ) = (Φ(x) · Φ(x )). (5)

mean is closer to it. This geometrical construction can be formulated in terms

Similarly, the oﬀset becomes

k(x, x )dx = 1 for all x ∈ X. (12)

In order to state this assumption, we have to require that we can deﬁne an

2 Learning Pattern Recognition from Examples

{x | <w, x> + b = +1}

This constrained optimization problem is dealt with by introducing Lagrange

The Lagrangian L has to be minimized with respect to the primal variables w

αi · [yi ((xi · w) + b) − 1] = 0, i = 1, . . . , m, (29)

The hyperplane decision function can thus be written as

where b is computed using (29).

4 Support Vector Classiﬁers

and the following quadratic program (cf. (30)):

in order to relax the constraints to

A classiﬁer which generalizes well is then found

One possible realization of a soft margin classiﬁer is minimizing the objective

Another possible realization of a soft margin variant of the optimal hyper-

The margin parameter ρ is a variable of the optimization problem. The dual

5 Support Vector Regression

|y − f (x)|ε := max{0, |y − f (x)| − ε}. (43)

To estimate a linear regression

with precision ε, one minimizes

Written as a constrained optimization problem, this reads:

The regression estimate takes the form

instead of (45), and treating ε ≥ 0 as a parameter that we minimize over [22].

Φ(x1) Φ(x2) Φ(x) Φ(xn) mapped vectors Φ(xi), Φ(x)

... support vectors x1 ... xn

6.1 Product Features

Suppose we are given patterns x ∈ RN where most information is contained in

[x]j1 · · · · · [x]jd , (54)

where j1 , . . . , jd ∈ {1, . . . , N }. In that case, we might prefer to extract these

6.2 Polynomial Feature Spaces Induced by Kernels

k(x, x ) = (Φ(x) · Φ(x )), (58)

C2 : ([x]1 , [x]2 ) → ([x]21 , [x]22 , [x]1 [x]2 , [x]2 [x]1 ), (59)

dot products in H take the form

k(x, x ) = (Cd (x) · Cd (x )) = (x · x )d . (61)

Proof. We directly compute

Instead of ordered products, we can use unordered ones to obtain a map

(Φd (x) · Φd (x )) = (Cd (x) · Cd (x )) = (x · x )d . (64)

k(x, x ) = ((x · x ) + 1)d . (66)

7 Representing Similarities in Linear Spaces

Deﬁnition 1 (Gram matrix). Given a kernel k and patterns x1 , . . . , xm ∈ X,

K := (k(xi , xj ))ij (67)

is called the Gram matrix (or kernel matrix) of k with respect to x1 , . . . , xm .

Deﬁnition 2 (Positive deﬁnite matrix). An m × m matrix Kij satisfying

for all ci ∈ C is called positive deﬁnite.

Deﬁnition 3 ((Positive deﬁnite) kernel). Let X be a nonempty set. A func-

(T f )(x) = k(x, x )f (x ) dx (69)

k(x1 , x1 ) ≥ 0 for all x1 ∈ X, (70)

(use m = 1 in (68)), and symmetry, i.e.,

k(xi , xj ) = k(xj , xi ). (71)

Here, m ∈ N, αi ∈ C and xi ∈ X are arbitrary.

Next, we deﬁne a dot product between f and another function

(m ∈ N, βj ∈ C and xj ∈ X) as

To see that this is well-deﬁned, although it explicitly contains the expansion

The last two equations also show that ·, · is bilinear. It is symmetric, as f, g =

k(., x), f = f (x) (83)

— k is the representer of evaluation. In particular,

k(., x), k(., x ) = k(x, x ). (84)

Example 1 (Similarity of probabilistic events). If A is a σ-algebra, and P a prob-

k(A, B) = P (A ∩ B) − P (A)P (B) (89)

is a positive deﬁnite kernel.

Further examples include kernels for string matching, as proposed by [33]

decomposition algorithm was proposed [14], which is based on the observation

– Similarity measures — the kernel can be viewed as a (nonlinear) similarity

7. V. Blanz, B. Schölkopf, H. Bülthoﬀ, C. Burges, V. Vapnik, and T. Vetter. Com-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.