Mathematics For Machine Learning
Mathematics For Machine Learning
L ECTURE N OTES
Authors:
Marc Deisenroth, Stefanos Zafeiriou
Contents
1 Linear Regression 3
1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Means and Covariances . . . . . . . . . . . . . . . . . . . . . 6
1.2.1.1 Sum of Random Variables . . . . . . . . . . . . . . . 7
1.2.1.2 Affine Transformation . . . . . . . . . . . . . . . . . 7
1.2.2 Statistical Independence . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Basic Probability Distributions . . . . . . . . . . . . . . . . . . 8
1.2.3.1 Uniform Distribution . . . . . . . . . . . . . . . . . . 9
1.2.3.2 Bernoulli Distribution . . . . . . . . . . . . . . . . . 9
1.2.3.3 Binomial Distribution . . . . . . . . . . . . . . . . . 10
1.2.3.4 Beta Distribution . . . . . . . . . . . . . . . . . . . . 10
1.2.3.5 Gaussian Distribution . . . . . . . . . . . . . . . . . 12
1.2.3.6 Gamma Distribution . . . . . . . . . . . . . . . . . . 16
1.2.3.7 Wishart Distribution . . . . . . . . . . . . . . . . . . 17
1.2.4 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 From Joint Distributions to Graphs . . . . . . . . . . . . . . . 19
1.3.2 From Graphs to Joint Distributions . . . . . . . . . . . . . . . 20
1.3.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1 Scalar Differentiation . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1.1 Taylor Series . . . . . . . . . . . . . . . . . . . . . . 24
1.4.1.2 Differentiation Rules . . . . . . . . . . . . . . . . . . 27
1.4.2 Partial Differentiation and Gradients . . . . . . . . . . . . . . 27
1.4.3 Gradients of Vector-Valued Functions (Vector Fields) . . . . . 30
1.4.3.1 Gradients of Matrices . . . . . . . . . . . . . . . . . 31
1.4.4 Basic Rules of Partial Differentiation . . . . . . . . . . . . . . 33
1.4.5 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.4.6 Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . 36
1.4.7 Linearization and Taylor Series . . . . . . . . . . . . . . . . . 37
1.4.8 Gradient Checking . . . . . . . . . . . . . . . . . . . . . . . . 41
1.5 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.5.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 43
1.5.1.1 Closed-Form Solution . . . . . . . . . . . . . . . . . 44
2 Feature Extraction 71
2.1 Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.1.1 Eigen-decomposition . . . . . . . . . . . . . . . . . . . . . . . 71
2.1.1.1 Symmetric Matrices . . . . . . . . . . . . . . . . . . 72
2.1.2 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . 72
2.1.2.1 Gram-Schmidt Process . . . . . . . . . . . . . . . . . 74
2.1.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 76
2.1.3.1 Thin SVD . . . . . . . . . . . . . . . . . . . . . . . . 77
2.1.3.2 Dimensionality Reduction and SVD . . . . . . . . . . 77
2.1.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . 78
2.1.4.1 Statistical Perspective . . . . . . . . . . . . . . . . . 78
2.1.4.2 Reconstruction Perspective . . . . . . . . . . . . . . 81
2.1.4.3 Computing PCA . . . . . . . . . . . . . . . . . . . . 81
2.1.4.4 Link between SVD and PCA . . . . . . . . . . . . . . 83
2.1.5 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 83
2.1.5.1 The two class case . . . . . . . . . . . . . . . . . . . 84
2.1.5.2 Multi-class Case . . . . . . . . . . . . . . . . . . . . 86
2.2 Computing Linear Discriminant Analysis . . . . . . . . . . . . . . . . 87
2.2.1 Kernel PCA and Kernel LDA . . . . . . . . . . . . . . . . . . . 90
A 111
A.1 Preliminaries on Vectors and Matrices . . . . . . . . . . . . . . . . . 111
A.1.1 Vectors and Vector Operators . . . . . . . . . . . . . . . . . . 111
A.1.2 Matrices and Matrix Operators . . . . . . . . . . . . . . . . . 112
A.1.2.1 Matrix Norms . . . . . . . . . . . . . . . . . . . . . 113
A.1.2.2 Matrix Multiplications . . . . . . . . . . . . . . . . . 113
A.1.2.3 Matrix Transposition . . . . . . . . . . . . . . . . . . 114
A.1.2.4 Trace Operator . . . . . . . . . . . . . . . . . . . . . 115
A.1.2.5 Matrix Determinant . . . . . . . . . . . . . . . . . . 117
A.1.2.6 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . 118
A.1.2.7 Matrix Pseudo-Inverse . . . . . . . . . . . . . . . . . 118
A.1.2.8 Range, Null Space and Rank of a matrix . . . . . . . 119
A.1.2.9 Eigenvalues and Eigenvectors . . . . . . . . . . . . . 120
A.1.2.10 Positive and Negative Definite Matrices . . . . . . . 121
A.1.2.11 Triangular Matrices . . . . . . . . . . . . . . . . . . 121
A.1.2.12 QR decomposition . . . . . . . . . . . . . . . . . . . 122
A.2 Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.2.1 Lengths, Distances, Orthogonality . . . . . . . . . . . . . . . . 123
A.2.2 Applications of Inner Products . . . . . . . . . . . . . . . . . . 124
A.3 Useful Matrix Identities . . . . . . . . . . . . . . . . . . . . . . . . . 125
Introduction
These lecture notes support the course “Mathematics for Machine Learning” in the
Department of Computing at Imperial College London. The aim of the course is
to provide students the basic mathematical background and skills necessary to un-
derstand, design and implement modern statistical machine learning methodologies
and inference mechanisms. The course will focus on examples regarding the use of
mathematical tools for the design of basic machine learning and inference method-
ologies, such as Principal Component Analysis (PCA), Linear Discriminant Analysis,
Bayesian Linear Regression and Support Vector Machines (SVMs).
This course is a hard pre-requisite for the following courses:
– Chapter 1
– Chapter 2.2–2.3
– Chapter 3
– Chapter 8.1
– Chapter 12.1–12.2
• The following chapters from the book by Golub and Van Loan (2012):
Contents CONTENTS
For part two and three the relevant literature is also the lecture notes, as well as the
following chapters from the book by Golub and Van Loan (2012):
As well as, the following papers Turk and Pentland (1991); Belhumeur et al. (1997)
and tutorials Burges (1998); Lin (2006):
Chapter 1
Linear Regression
In this part of the course, we will be looking at regression problems, where we want
to find a function f that maps inputs x ∈ RD to corresponding function values
f (x) ∈ R based on noisy observations y = f (x) + ǫ, where ǫ is a random variable.
An example is given in Fig. 1.1. A typical regression problem is given in Fig. 1.1(a):
For some values x we observe (noisy) function values y = f (x)+ǫ. The task is to infer
the function f that generated the data. A possible solution is given in Fig. 1.1(b).
Regression is a fundamental problem, and regression problems appear in a diverse
range of research areas and applications, including time-series analysis (e.g., sys-
tem identification), control and robotics (e.g., reinforcement learning, forward/in-
verse model learning), optimization (e.g., line searches, global optimization), deep-
learning applications (e.g., computer games, speech-to-text translation, image recog-
nition, automatic video annotation).
Finding a regression function requires solving a variety of problems, including
y = f (x) + ǫ , (1.1)
where x ∈ RD are inputs and y ∈ R are observed targets. Furthermore, ǫ ∼ N 0, σ 2
is independent, identically distributed (i.i.d.) measurement noise. In this particular
case, ǫ is Gaussian distributed with mean 0 and variance σ 2 . The objective is to find
a function f that is close to the unknown function that generated the data.
In this course, we focus on parametric models, i.e., we choose a parametrization of f
and find parameters that “work well”. In the most simple case, the parametrization
of linear regression is
y = f (x) + ǫ = x⊤ θ + ǫ (1.2)
where θ ∈ RD are the parameters we seek and ǫ ∼ N 0, σ 2 .1
In this course, we will discuss in more detail how to
• Find good parameters θ
1.2 Probabilities
Probability theory is a mathematical foundation of statistics and a consistent way of
expressing uncertainty. Jaynes (2003) provides a great introduction to probability
theory.
Definition 1 (Probability Density Function)
A function p : RD → R is called a probability density function if (1) its integral
exists, (2) ∀x ∈ RD : p(x) ≥ 0 and (3)
Z
p(x)dx = 1 . (1.3)
RD
There are two fundamental rules in probability theory that appear everywhere in
machine learning and Bayesian statistics:
Z
p(x) = p(x, y)dy Sum rule/Marginalization property (1.4)
1
It would be more precise to call this model “linear in the parameters”. We will see later that
⊤
Φ (x)θ for nonlinear transformations Φ is also a linear regression model.
2
We omit the definition of a random variable as this will become too technical for the purpose of
this course.
Here, p(x, y) is the joint distribution of the two random variables x, y, p(x), p(y)
are the corresponding marginal distributions, and p(y|x) is the conditional dis-
tribution of y given x. If we consider discrete random variables x, y, the integral
in (1.4) is replaced by a sum. This is where the name comes from.
In machine learning and Bayesian statistics, we are often interested in making infer-
ences of random variables given that we have observed other random variables. Let
us assume, we have some prior knowledge p(x) about a random variable x and some
relationship p(y|x) between x and a second random variable y. If we now observe
y, we can use Bayes’ theorem3 to draw some conclusions about x given the observed
values of y. Bayes’ theorem follows immediately from the sum and product rules
in (1.4)–(1.5) as
p(y|x)p(x)
p(x|y) = . (1.6)
p(y)
Here, p(x) is the prior, which encapsulates our prior knowledge of x, p(y|x) is
the likelihood4 that describes how y and x are related. The quantity p(y) is the
marginal likelihood or evidenceRand is a normalizing constant (independent of x),
which is obtained as the integral p(y|x)p(x)dx of the numerator with respect to x
and ensures that the fraction is normalized. The posterior p(x|y) expresses exactly
what we are interested in, i.e., what we know about x if we observe y.
Remark 2 (Marginal Likelihood)
Thus far, we looked at the marginal likelihood simply as a normalizing constant that
ensures that the posterior probability distribution integrates to 1. In Section 1.7 we will
see that the marginal likelihood also plays an important role in model selection.
3
Also called the “probabilistic inverse”
4
Also called the “measurement model”
Here, the subscript makes it explicit with respect to which variable we need to average.
The variance of a random variable x ∈ RD with mean vector µ is defined as
This matrix is called the covariance matrix of the random variable x. The covariance
matrix is symmetric and positive definite and tells us something about the spread of the
data. R
The covariance matrix contains the variances of the marginals p(xi ) = p(x1 , . . . , xD )dx\i
on its diagonal, where “\i” denotes “all variables but i”. The off-diagonal terms contain
the cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j.
respectively.6 Furthermore,
Z
Cov[x, y] = x(Ax + b)⊤ p(x)dx − E[x]E[Ax + b]⊤ (1.19)
Z Z
= xp(x)dxb + xx⊤ p(x)dxA⊤ − µb⊤ − µµ⊤ A⊤
⊤
(1.20)
Z
⊤ ⊤
= µb − µb + xx⊤ p(x)dx − µµ⊤ A⊤ (1.21)
(1.10)
= ΣA⊤ (1.22)
Intuitively, two random variables x and y are independent if the value of y (once
known) does not add any additional information about x (and vice versa).
If x, y are (statistically) independent then
• p(y|x) = p(y)
• p(x|y) = p(x)
• Cov[x, y] = 0
We write x ⊥ y|z.
U [0.5, 1]
1 1
U [0, 2]
1
Figure 1.2: Examples of uniform distributions. Left: Continuous uniform distribu-
tion that distributes probability mass equally everywhere in a (compact) region. Right:
Discrete uniform distribution that assigns equal probability to four possible (discrete)
events.
Figure 1.3: The Bernoulli distribution can be used to model the binary outcome proba-
bility of a coin flip experiment.
Figure 1.4: Examples of the Binomial distribution for µ ∈ {0.1, 0.4, 0.75} and N = 15.
E[x] = µ , (1.27)
V[x] = µ(1 − µ) , (1.28)
where E[x] and V[x] are the mean and variance of the binary random variable x.
An example where the Bernoulli distribution can be used is when we are interested
in modeling the probability of “head” when flipping a coin.
10
a = 0. 5 = b
a =1=b
4
a =2=b
a = 4, b = 10
3 a = 5, b = 1
p(µ|a, b)
0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 1.5: Examples of the Beta distribution for different values of α and β.
governing the Bernoulli distribution). The Beta distribution itself is governed by two
parameters α > 0, β > 0 and is defined as
Γ(α + β) α−1
p(µ|α, β) = µ (1 − µ)β−1 (1.32)
Γ(α)Γ(β)
α αβ
E[µ] = , V[µ] = 2
(1.33)
α+β (α + β) (α + β + 1)
Note that the fraction of Gamma functions in (1.32) normalizes the Beta distribution.
Intuitively, α moves probability mass toward 1, whereas β moves probability mass
toward 0. There are some special cases (Murphy, 2012):
11
0.04
0.03
0.02
p(x, y)
0.01
0.00
0.01
0.02
0.03
0.04
8
6
4
8 2
y
6 0
4
2
x 2
0 4
of two random variables x, y, where Σxx = Cov[x, x] and Σyy = Cov[y, y] are the
marginal covariance matrices of x and y, respectively, and Σxy = Cov[x, y] is the
cross-covariance matrix between x and y.
7
Also: multivariate normal distribution
8
We will be adopting a common, but mathematically slightly sloppy, language and call the “prob-
ability density function” a “distribution”.
12
3
p(x) 95% confidence bound
Mean Mean
0.3 95% confidence bound 2
0.25 1
0.2 0
p(x)
x2
0.15 −1
0.1 −2
0.05 −3
0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x
1
3
Data Data
p(x) 95% confidence bound
0.3 Mean 2 Mean
95% confidence interval
0.25 1
0.2 0
p(x)
x2
0.15 −1
0.1 −2
0.05 −3
0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x
1
Note that in the computation of the mean in (1.39) the y-value is an observation
and no longer random.
Remark 3
The conditional Gaussian distribution shows up in many places, where we are interested
in posterior distributions:
• The Kalman filter (Kalman, 1960), one of the most central algorithms for state es-
timation in signal processing, does nothing but computing Gaussian conditionals
of joint distributions (Deisenroth and Ohlsson, 2011).
13
Joint p(x,y)
3 Observation y
Conditional p(x|y)
2
y
-1
-2
-3
-4
-5
-6 -4 -2 0 2 4
x
Joint p(x,y)
3 Marginal p(x)
0
y
-1
-2
-3
-4
-5
-6 -4 -2 0 2 4
x
• Gaussian processes (Rasmussen and Williams, 2006), which are a practical im-
plementation of a distribution over functions. In a Gaussian process, we make
assumptions of joint Gaussianity of random variables. By (Gaussian) condition-
ing on observed data, we can determine a posterior distribution over functions.
• Latent linear Gaussian models (Roweis and Ghahramani, 1999; Murphy, 2012),
which include probabilistic PCA (Tipping and Bishop, 1999).
The marginal distribution p(x)9 of a joint Gaussian distribution p(x, y), see (1.37),
is itself Gaussian and computed by applying the sum-rule in (1.4) and given by
Z
p(x) = p(x, y)dy = N x | µx , Σxx . (1.41)
Intuitively, looking at the joint distribution in (1.37), we ignore (i.e., integrate out)
everything we are not interested in.
Product of Gaussians The product of two Gaussians N x | a, A N x | b, B is
an unnormalized Gaussian distribution c N x | c, C with
14
c = C(A−1 a + B −1 b) (1.43)
D 1
c= (2π)− 2 |A + B|− 2 exp − 12 (a − b)⊤ (A + B)−1 (a − b) . (1.44)
y = Ax ⇔ A⊤ y = A⊤ Ax . (1.45)
15
Knowing that p(x + y) is Gaussian, the mean and covariance matrix can be de-
termined immediately using the results from (1.13)–(1.16). This property will be
important when we consider i.i.d. Gaussian noise acting on random variables.
16
1.2.4 Conjugacy
According to Bayes’ theorem (1.6), the posterior is proportional to the product of the
prior and the likelihood. The specification of the prior can be tricky for two reasons:
First, the prior should encapsulate our knowledge about the problem before we see
some data. This is often difficult to describe. Second, it is often not possible to
compute the posterior distribution analytically. However, there are some priors that
are computationally convenient: conjugate priors.
Definition 5 (Conjugate Prior)
A prior is conjugate for the likelihood function if the posterior is of the same form/type
as the prior.
17
i.e., the posterior distribution is a Beta distribution as the prior, i.e., the Beta prior is
conjugate for the parameter µ in the Binomial likelihood function.
Table 1.1 lists examples for conjugate priors for the parameters of some of the stan-
dard distributions that we discussed in this section.
The Beta distribution is the conjugate prior for the parameter µ in both the Binomial
and the Bernoulli likelihood. For a Gaussian likelihood function, we can place a con-
jugate Gaussian prior on the mean. The reason why the Gaussian likelihood appears
twice in the table is that we need distinguish the univariate from the multivariate
case. In the univariate (scalar) case, the inverse Gamma is the conjugate prior for
the variance13 . In the multivariate case, we use a conjugate inverse Wishart distribu-
tion as a prior on the covariance matrix14 . The Dirichlet distribution is the conjugate
prior for the multinomial likelihood function. For further details, we refer to Bishop
(2006).
• Factor graphs
13
Alternatively, the Gamma prior is conjugate for the precision (inverse variance) in the Gaussian
likelihood.
14
Alternatively, the Wishart prior is conjugate for the precision matrix (inverse covariance matrix)
in the Gaussian likelihood.
18
a b a b a b
c c c
(a) Directed graphical model (b) Undirected graphical (c) Factor graph
model
Figure 1.11: Three types of graphical models: (a) Directed graphical models (Bayesian
network); (b) Undirected graphical models (Markov random field); (c) Factor graphs.
Nodes are (random) variables, edges represent probabilistic relations between vari-
ables. In this course, we will focus on directed graphical models.15
Probabilistic graphical models have some convenient properties:
• They are a simple way to visualize the structure of a probabilistic model
• They can be used to design or motivate new kind of statistical models
• Inspection of the graph alone gives us insight into properties, e.g., conditional
independence
• Complex computations for inference and learning in statistical models can be
expressed in terms of graphical manipulations.
19
a b
c
Figure 1.12: Directed graphical model for the factorization of the joint distribution
in (1.57).
x5
x1 x2
x3 x4
Figure 1.13: Directed graphical model for which we seek the corresponding joint distri-
bution and its factorization.
• Each conditional depends only on the parents of the corresponding node in the
graph. For example, x4 will be conditioned on x2 .
Using these two properties, we arrive at the desired factorization of the joint distri-
bution
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x5 )p(x2 |x5 )p(x3 |x1 , x2 )p(x4 |x2 ) . (1.58)
K
Y
p(x) = p(xk |pak ) (1.59)
k=1
20
To make the distinction between these three types easier, we introduce additional
nodes for graphical models:
To find the directed graphical model, for all (observed and unobserved) random
variables we write down all probability distributions with explicit conditioning on
the parameters/variables they depend on. In our case, we end up with:
• p(yn |xn , θ, σ)
• p(θ)
N
Y
p(y1 , . . . , yN , θ|x1 , . . . , xN , σ) = p(θ) p(yn |xn , θ, σ) . (1.61)
n=1
Now, we follow the steps Section 1.3.1 and find the graphical model in Fig. 1.14(a).
Observed random variables are shaded, deterministic parameters are dots, unob-
served random variables are “hollow”. The graphical model is somewhat repeti-
tive because, and we can write it in a more compact form using the plate nota-
tion in Fig. 1.14(b). The plate essentially can be read as “for n = 1, . . . , N locally
copy/repeat everything inside the plate”. Therefore, the plate replaces the dots in
Fig. 1.14(a). Note that the parameter σ for the noise and the random variable θ are
“global” and, therefore, outside the plate.
21
x1 x2 xN
θ
y1 y2 yN
xn yn σ
σ θ N
(a) Version 1 (b) Version 2 using the plate notation.
Figure 1.14: Two graphical models for linear regression. Observed random variables are
shaded, deterministic parameters are dots. (a) Graphical model without plate notation;
(b) Graphical model with plate notation, which allows for a more compact representa-
tion than (a).
(a) Online ranking with the TrueSkill sys- (b) Image restoration
tem.
Figure 1.15: Examples of message passing using graphical models: (a) Microsoft’s
TrueSkill system (Herbrich et al., 2007) is used for ranking in online video games. (b)
Image restoration (Kittler and Föglein, 1984) is used to remove noise from images.
22
y f (x)
f (x0 + δx)
δy
f (x0 )
δx
Figure 1.16: The average incline of a function f between x0 and x0 + δx is the incline
of the secant (blue) through f (x0 ) and f (x0 + δx) and given by δy/δx.
δy f (x + δx) − f (x)
= (1.62)
δx δx
computes the slope of the secant line through two points on the graph of f . These are
the points with x-coordinates x and x + δx, see Fig. 1.16.
The difference quotient can also be considered the average slope of f between x and
x + δx if we assume a f to be linear. In the limit for δx → 0, we obtain the tangent
of f at x, if f is differentiable. The tangent is then the derivative of f at x.
16
We will do this in Section 1.5.
23
Definition 7 (Derivative)
More formally, for h > 0 the derivative of f at x is defined as the limit
df f (x + h) − f (x)
= lim , (1.63)
dx h→0 h
df f (x + h) − f (x)
= lim (1.64)
dx h→0 h
(x + h) − xn
n
= lim (1.65)
Pn hn n−i i
h→0
x h − xn
= lim i=0 i . (1.66)
h→0 h
n
We see that xn = 0
xn−0 h0 . By starting the sum at 1 the xn -term cancels, and we
obtain
Pn n
df i=1 i
xn−i hi
= lim (1.67)
dx h→0 h
n
X n n−i i−1
= lim x h (1.68)
h→0
i=1
i
n
n n−1 X n n−i i−1
= lim x + x h (1.69)
h→0 1 i
i=2
| {z }
→0 as h→0
n!
= xn−1 = nxn−1 . (1.70)
1!(n − 1)!
24
i.e., the Taylor polynomial consists of the first n terms of the Taylor series in (1.71).
Remark 5
In general, a Taylor polynomial of degree n is an approximation of a function. The
approximation is best close to x0 . However, a Taylor polynomial of degree n is an exact
representation of a polynomial f of degree k ≤ n since all derivatives f (i) , i > k vanish.
25
6
f
T0
5 T1
T5
T10
4
2
f(x)
-1
-2
-3
-4 -3 -2 -1 0 1 2 3 4
x
Figure 1.17: Taylor polynomials. The original function f (x) = sin(x) + cos(x) (black,
solid) is approximated by Taylor polynomials (dashed) around x0 = 0. Higher-order
Taylor polynomials approximate the function f better more globally. T10 is already very
close to f in [−4, 4].
T6 (x) = (1 − 4 + 6 − 4 + 1) + x(4 − 12 + 12 − 4) + x2 (6 = 12 + 6) + x3 (4 − 4) + x4
(1.82)
= x4 = f (x) , (1.83)
26
We can see a pattern here: The coefficients in our Taylor series are only ±1 (since
sin(0) = 0), each of which occurs twice before switching to the other one. Further-
more, f (k+4) (0) = f (k) (0).
Therefore, we can write down the full Taylor series expansion of f at x0 = 0 is given
by
∞
X f (k) (x0 )
f (x) = (x − x0 )k (1.91)
k=0
k!
1 2 1 1 1
=1+x− x − x3 + x4 + x5 − · · · (1.92)
2! 3! 4! 5!
1 2 1 4 1 3 1 5
= 1 − x + x ∓ ··· + x − x + x ∓ ··· (1.93)
2! 4! 3! 5!
∞ ∞
X 1 X 1 2k
= (−1)2k x2k+1 + (−1)2k x (1.94)
k=0
(2k + 1)! k=0
(2k)!
= sin(x) + cos(x) (1.95)
Figure 1.17 shows the corresponding first Taylor polynomials Tn for n = 0, 1, 5, 10.
27
where n is the number of variables and 1 is the dimension of the image of f .18 Here, we
used the compact vector notation x = [x1 , . . . , xn ]⊤ .
28
Example
Consider the function f (x, y) = x2 + y 2 (see Fig. 1.18). We obtain the partial deriva-
tive ∂f /∂x by treating y as a constant and computing the derivative of f with respect
to x. We then obtain
∂f (x, y)
= 2x . (1.106)
∂x
Similarly, we obtain the partial derivative of f with respect to y as
∂f (x, y)
= 2y . (1.107)
∂y
Example
For f (x, y) = (x + 2y 3 )2 , we obtain the partial derivatives
∂f (x, y) ∂
= 2(x + 2y 3 ) (x + 2y 3 ) = 2(x + 2y 3 ) , (1.108)
∂x ∂x
29
∂f (x, y) ∂
= 2(x + 2y 3 ) (x + 2y 3 ) = 12(x + 2y 3 )y 2 . (1.109)
∂y ∂y
From (1.102), we know that we obtain the gradient of f with respect to a vector
as the row vector of the partial derivatives. In (1.111), every partial derivative is a
column vector. Therefore, we obtain the gradient of f : Rn → Rm with respect to
x ∈ Rn as
df (x) h ∂f (x) i
= ∂x1 · · · ∂f∂x(x) (1.112)
dx n
∂f1 (x) ∂f1 (x)
∂x1
· · · ∂xn
.. .. ∈ Rm×n .
= . . (1.113)
∂fm (x) ∂fm (x)
∂x1
··· ∂xn
Definition 11 (Jacobian)
The matrix (or vector) of all first-order partial derivatives of a vector-valued function
f : Rn → Rm is called the Jacobian. The Jacobian J is an m × n matrix, which we
30
To solve this hard problem, let us first write down what we already know: We know
that the gradient has the dimensions
dK
∈ R(n×n)×(m×n) , (1.117)
dL
which is a tensor. If we compute the partial derivative of f with respect to a single
entry Lij , i, j ∈ {1, . . . , n}, of L, we obtain an m × m-matrix
∂K
∈ Rn×n . (1.118)
∂Lij
19
e.g., by stacking the columns of the matrix (“flatten”)
31
(a) Approach 1: We compute the partial derivative (b) Approach 2: We re-shape (flatten) A ∈
∂A ∂A ∂A
∂x1 , ∂x2 , ∂x3 , each of which is a 4 × 2 matrix, and R4×2 into a vector Ā ∈ R8 . Then, we com-
Ā
collate them in a 4 × 2 × 3 tensor. pute the gradient ddx ∈ R8×3 . We obtain the
gradient tensor by re-shaping this gradient as
illustrated above.
arrive there.
∂Kpq
When we now compute the partial derivative ∂Lij
, we obtain
n
∂Kpq X ∂
= Lkp Lkq = ∂pqij , (1.121)
∂Lij k=1
∂Lij
Liq if j = p, p 6= q
Lip if j = q, p 6= q
∂pqij = (1.122)
2Liq if j = p, p = q
0 otherwise
32
From (1.117), we know that the desired gradient has the dimension (n×n)×(m×n),
end every single entry of this tensor is given by ∂pqij in (1.122), where p, q, j =
1, . . . , n and i = q, . . . , m.
⊤
∂f (X)⊤ ∂f (X)
= (1.126)
∂X ∂X
∂tr(f (X)) ∂f (X)
= tr (1.127)
∂X ∂X
∂ det(f (X)) −1 ∂f (X)
= det(X)tr X (1.128)
∂X ∂X
−1
∂f (X) ∂f (X) −1
= −f −1 (X) f (X) (1.129)
∂X ∂X
∂a⊤ X −1 b
= −(X −1 )⊤ ab⊤ (X −1 )⊤ (1.130)
∂X
∂x⊤ a
= a⊤ (1.131)
∂x
∂a⊤ x
= a⊤ (1.132)
∂x
∂a⊤ Xb
= ab⊤ (1.133)
∂X
∂x⊤ Bx
= x⊤ (B + B ⊤ ) (1.134)
∂x
∂
(x − As)⊤ W (x − As) = −2(x − As)⊤ W A for symmetric W (1.135)
∂s
20
product rule: (f g)′ = f ′ g + f g ′ , sum rule: (f + g)′ = f ′ + g ′ , chain rule: (g ◦ f )′ = g ′ (f )f ′
33
Example
Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
df ∂f dx1 ∂f dx2
= + (1.137)
dt ∂x1 dt ∂x2 dt
d sin t d cos t
= 2 sin t +2 (1.138)
dt dt
= 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1) (1.139)
Example
Consider the function h : R → R, h(t) = (f ◦ g)(t) with
f : R2 → R (1.143)
g : R → R2 (1.144)
f (x) = exp(x1 x22 ) , (1.145)
t cos t
x = g(t) = (1.146)
t sin t
and compute the gradient of h with respect to t.
Since f : R2 → R and g : R → R2 we note that
∂f
∈ R1×2 , (1.147)
∂x
34
∂g
∈ R2×1 . (1.148)
∂t
The desired gradient is computed by applying the chain-rule:
dh ∂f dx
= (1.149)
dt ∂x dt
h i dx1
∂f ∂f dt
= ∂x1 ∂x2 dx 2
(1.150)
dt
cos t − t sin t
= exp(x1 x22 )x22 2 exp(x1 x22 )x1 x2 (1.151)
sin t + t cos t
= exp(x1 x22 ) x22 (cos t − t sin t) + 2x1 x2 (sin t + t cos t) (1.152)
We seek ∂L
∂θ
, and we will use the chain rule for this purpose.
Before we start any calculation, we determine the dimensionality of the gradient as
∂L
∈ R1×D . (1.156)
∂θ
The chain rule allows us to compute the gradient as
∂L ∂L ∂e
= . (1.157)
∂θ ∂e ∂θ
The dot product yields kek2 = e⊤ e (see Appendix A.2), and by exploiting the product
rule and (1.131) we determine
∂L
= 2e⊤ ∈ R1×N . (1.158)
∂e
Furthermore, we obtain
∂e
= −Φ ∈ RN ×D , (1.159)
∂θ
such that our desired derivative is
∂L
= −2e⊤ Φ = − 2(y ⊤ − θ ⊤ Φ⊤ ) |{z}
Φ ∈ R1×D . (1.160)
∂θ | {z }
1×N N ×D
35
Remark 8
We would have obtained the same result without using the chain rule by immediately
looking at the function
This approach is still practical for simple functions like L2 but becomes impractical if
consider deep function compositions.
Remark 9
When we look at deep function compositionsnested fK ◦ fK−1 ◦ · · · ◦ f1 , writing out
the full function is tedious. Furthermore, for programming purposes the chain rule is
extremely useful: When we write functions/methods for ever fi that return the partial
derivative of its outputs with respect to its inputs, the total derivative with respect to the
input is just the product of the partial derivatives returned by the individual functions.
If we then decide modify fi into f˜i , we simply have to write a function that computes
the partial derivative of f˜i and use this in the product of partial derivatives (instead of
re-deriving the total derivative from scratch).
where x are the inputs (e.g., images), y are the observations (e.g., class labels) and
every function fi , i = 1, . . . , K possesses its own parameters. In neural networks with
multiple layers, we have functions fi (x) = σ(Ai xi−1 + bi ) in the ith layer, where xi−1
is the output of layer i − 1 and σ an activation function, e.g., the logistic sigmoid
1
1+e−x
, tanh or a rectified linear unit (ReLU). In order to train these models, we have
to compute the gradient of a loss function with respect to the inputs of each layer (e.g.,
xi−1 ) to obtain the partial derivative with respect to the parameters of the previous layer
(e.g., Ai−1 , bi−1 ). There are efficient ways of implementing this repeated application of
the chain-rule using backpropagation (Kelley, 1960; Bryson, 1961; Dreyfus, 1962;
Rumelhart et al., 1986).
36
∂nf
• ∂xn
is the nth partial derivative of f with respect to x
∂2f
• ∂x∂y
is the partial derivative obtained by first partial differentiating by x and
then y
∂2f
• ∂y∂x
is the partial derivative obtained by first partial differentiating by y and
then x
∂ 2f ∂ 2f
= , (1.163)
∂x∂y ∂y∂x
i.e., the order of differentiation does not matter, and the corresponding Hessian
matrix (the matrix of second partial derivatives)
" 2 2
#
∂ f ∂ f
∂x2 ∂x∂y
H= ∂2f ∂2f (1.164)
∂x∂y ∂y 2
Here (∇x f )(x0 ) is the gradient of f with respect to x, evaluated at x0 . Note that (1.165)
is equivalent to the first two terms in the multi-variate Taylor-series expansion of
f at x0 .
Definition 12 (Multivariate Taylor Series)
For the multivariate Taylor series, we consider a function
f : RD → R (1.166)
x 7→ f (x) , x ∈ RD , (1.167)
that is smooth at x0 .
When we define δ := x − x0 , the Taylor series of f at (x0 ) is defined as
∞
X Dk f (x0 )
f (x) = x
δk , (1.168)
k=0
k!
37
where Dxk f (x0 ) is the k-th (total) derivative of f with respect to x, evaluated at x0 .
The Taylor polynomial of degree n of f at x0 contains the first n + 1 components of
the series in (1.168) and is defined as
n
X Dk f (x0 )
Tn = x
δk . (1.169)
k=0
k!
Figure 1.20: Visualizing outer products. Outer products of vectors increase the dimen-
sionality of the array by 1 per term.
Remark 12 (Notation)
In (1.168) and (1.169), we used the slightly sloppy notation of δ k , which is not defined
for vectors x ∈ RD , D > 1, and k > 1. Note that both Dxk f and δ k are k-th order
k times
z }| {
21
tensors, i.e., k-dimensional arrays. The k-th order tensor δ k ∈ R D × D × . . . × D is
D
obtained as a k-fold outer product, denoted by ⊗, of the vector δ ∈ R . For example,
Figure 1.20 visualizes two such outer products. In general, we obtain the following
terms in the Taylor series:
X X
Dxk f (x0 )δ k = ··· Dxk f (x0 )[a, . . . , k]δ[a] · · · δ[k] , (1.172)
a k
38
Now that we defined the Taylor series for vector fields, let us explicitly write down the
first terms Dxk f (x0 )δ k of the Taylor series expansion for k = 0, . . . , 3 and δ := x − x0 :
We want to compute the Taylor series expansion of f at (x0 , y0 ) = (1, 2). Before
we start, let us discuss what to expect: The function in (1.177) is a polynomial
of degree 3. We are looking for a Taylor series expansion, which itself is a linear
combination of polynomials. Therefore, we do not expect the Taylor series expansion
to contain terms of fourth or higher order to express a third-order polynomial. This
means, it should be sufficient to determine the first four terms of (1.168) for an exact
alternative representation of (1.177).
To determine the Taylor series expansion, start of with the constant term and the
first-order derivatives, which are given by
f (1, 2) = 13 (1.178)
∂f ∂f
= 2x + 2y ⇒ (1, 2) = 6 (1.179)
∂x ∂x
∂f ∂f
= 2x + 3y 2 ⇒ (1, 2) = 14 . (1.180)
∂y ∂y
h i
⇒ Dx,y1
f (1, 2) = ∇x,y f (1, 2) = ∂f∂x
(1, 2) ∂f
∂y
(1, 2) = 6 14 ∈ R1×2 (1.181)
1
Dx,y f (1, 2) x−1
⇒ δ = 6 14 = 6(x − 1) + 14(y − 2) . (1.182)
1! y−2
1
Note that Dx,y f (1, 2)δ contains only linear terms, i.e., first-order polynomials.
The second-order partial derivatives are given by
∂ 2f ∂ 2f
= 2 ⇒ (1, 2) = 2 (1.183)
∂x2 ∂x2
∂ 2f ∂ 2f
= 6y ⇒ (1, 2) = 12 (1.184)
∂y 2 ∂y 2
∂f 2 ∂f 2
=2⇒ (1, 2) = 2 (1.185)
∂x∂y ∂x∂y
39
∂f 2 ∂f 2
=2⇒ (1, 2) = 2 . (1.186)
∂y∂x ∂y∂x
When we collect the second-order partial derivatives, we obtain the Hessian
" 2 #
∂ f ∂f 2
∂x 2 ∂x∂y 2 2
H = ∂f 2 ∂ 2 f = (1.187)
∂y∂x ∂y 2
2 6y
2 2
⇒ H(1, 2) = ∈ R2×2 , (1.188)
2 12
such that the next term of the Taylor-series expansion is given by
2
Dx,y f (1, 2) 2 1 2
δ = δ H(1, 2)δ (1.189)
2! 2
2 2 x−1
= x−1 y−2 (1.190)
2 12 y − 2
= (x − 1)2 + 2(x − 1)(y − 2) + 6(y − 2)2 (1.191)
2
Here, Dx,y f (1, 2)δ 2 contains only quadratic terms, i.e., second-order polynomials.
The third-order derivatives are obtained as
Dx3 f = ∂H ∂x
∂H
∂y ∈ R2×2×2 , (1.192)
" 3 #
∂ f ∂3f
3 ∂H ∂x3 ∂x∂y∂x
Dx f [:, :, 1] = = ∂3f ∂3f , (1.193)
∂x ∂y∂x2 ∂y 2 ∂x
" 3 #
∂ f ∂3f
∂H 2 2
Dx3 f [:, :, 2] = = ∂x∂ 3 f∂y ∂x∂y
∂3f . (1.194)
∂y ∂y∂x∂y ∂y 3
.
Since most second-order partial derivatives in the Hessian in (1.187) are constant
the only non-zero third-order partial derivative is
∂ 3f ∂ 3f
= 6 ⇒ (1, 2) = 6 . (1.195)
∂y 3 ∂y 3
∂f 3
Higher-order derivatives and the mixed derivatives of degree 3 (e.g., ∂x2 ∂y
) vanish,
such that
3 0 0 3 0 0
Dx,y f [:, :, 1] = , Dx,y f [:, :, 2] = (1.196)
0 0 0 1
and
3
Dx,y f (1, 2) 3 (y − 2)3
δ = , (1.197)
3! 6
which collects all cubic terms (third-order polynomials) of the Taylor series.
Overall, the (exact) Taylor series expansion of f at (x0 , y0 ) = (1, 2) is
2 3
1
Dx,y f (1, 2) 2 Dx,y f (1, 2) 3
f (x) =f (1, 2)+Dx,y f (1, 2)δ+ δ + δ (1.198)
2! 3!
40
=f (1, 2) (1.199)
∂f (1, 2) ∂f (1, 2)
+ (x − 1) + (y − 2) (1.200)
∂x ∂y
1 ∂ 2 f (1, 2) 2 ∂ 2 f (1, 2) 2 ∂ 2 f (1, 2)
+ (x − 1) + (y − 2) + 2 (x − 1)(y − 2)
2! ∂x2 ∂y 2 ∂x∂y
(1.201)
1 ∂ 3 f (1, 2)
+ (y − 2)3 (1.202)
6 ∂y 3
= 13+6(x − 1) + 14(y − 2)+(x − 1)2 + 6(y − 2)2 + 2(x − 1)(y − 2)+(y − 2)3 .
(1.203)
In this case, we obtained an exact Taylor series expansion of the polynomial in (1.177),
i.e., the polynomial in (1.203) is equivalent to the original polynomial in (1.177).
In this particular example, this result is not surprising since the original function
was a third-order polynomial, which we expressed through a linear combination of
constant terms, first-order, second order and third-order polynomials in (1.203).
41
finite differences
central differences
true gradient
f (x)
f (x0 − 2δ )
f (x0 + δ)
f (x0 ) f (x0 + 2δ )
δ δ
x0 − 2 x0 x0 + 2 x0 + δ
Figure 1.21: Finite and central difference approximation of the gradient. The gradient
f ′ (x0 ) is shown in gray. The finite-difference approximation is shown in red, the central
differences approximation is shown in blue. In this illustration, only the slope is impor-
tant. Central differences yields a better approximation to the ground truth than finite
differences.
xn yn σ
N
Figure 1.22: Probabilistic graphical model for linear regression.
22
The corresponding graphical model is given in Fig. 1.22.
42
where we exploited that the likelihood factorizes over the number of data points due
to our independence assumption on the training set.
In the linear regression model (1.2) the likelihood is Gaussian (due to the Gaussian
additive noise term), such that we arrive at
1
− log p(yi |xi , θ) = (yi − x⊤ 2
i θ) + const (1.211)
2σ 2
where the constant includes all terms independent of θ. Using (1.211) in the nega-
tive log-likelihood (1.210) we obtain (ignoring the constant terms)
N
1 X
L(θ) := − log p(y|X, θ) = 2 (yi − x⊤
i θ)
2
(1.212)
2σ i=1
1 1
= 2
(y − Xθ)⊤ (y − Xθ) = 2 ky − Xθk2 , (1.213)
2σ 2σ
x⊤
1
..
X := . ∈ RN ×D , (1.214)
x⊤
N
23
Note that the logarithm is a (strictly) monotonically increasing function.
43
where X is called the design matrix.24 In (1.213) we replaced the sum of squared
with the squared norm25 of the difference term y − Xθ.
Remark 14
In machine learning, the negative log likelihood function is also called an error func-
tion.
Now that we have a concrete form of the negative log-likelihood function we need
to optimize. We will discuss two approaches, both of which are based on gradients.
44
y = φ⊤ (x)θ + ǫ (1.221)
This means, we “lift” the original 1D input space into a K + 1-dimensional feature
space consisting of monomials. With these features, we can model polynomials of
degree ≤ K within the framework of linear regression: A polynomial of degree K is
given by
K
X
f (x) = θi xi = φ⊤ (x)θ (1.223)
i=0
When we consider the training data xi , yi , i = 1, . . . , N and define the feature (de-
sign) matrix
φ⊤
0 (x 1 ) φ ⊤
1 (x 1 ) · · · φ ⊤
K (x 1 )
φ⊤ (x2 ) φ⊤ (x2 ) · · · φ⊤ (x2 )
0 1 K
, Φij = φ⊤
Φ = .. .. .. j (xi ) (1.224)
. . .
φ⊤
0 (xN ) ··· · · · φ⊤
K (xN )
45
for the linear regression problem with nonlinear features defined in (1.221).
1.5.1.4 Properties
The maximum likelihood estimate θ ∗ possesses the following properties:
• Asymptotic consistency: The MLE converges to the true value in the limit of in-
finitely many observations, plus a random error that is approximately normal.
• The size of the samples necessary to achieve these properties can be quite large.
• The error’s variance decays in 1/N where N is the number of data points.
• Especially, in the “small” data regime, maximum likelihood estimation can lead
to overfitting.
Let us consider the data set in Fig. 1.23(a). The data set consists of N = 20 pairs
(xi , yi ), where xi ∼ U[−5, 5] and yi = − sin(xi /5) + cos(xi ) + ǫ, where ǫ ∼ N 0, 0.22 .
We fit a polynomial of degree M = 4 using maximum likelihood estimation (i.e.,
the parameters are given in (1.226)). The maximum likelihood estimate yields an
expected function value φ(x∗ )⊤ θ ML at a new test location x∗ . The result is shown in
Fig. 1.23(b).
1.5.2 Overfitting
We have seen that we can use maximum likelihood estimation to fit linear models
(e.g., polynomials) to data. We can evaluate the quality of the model by comput-
ing the error/loss incurred. One way of doing this is to compute the negative log-
likelihood (1.210), which we minimized to determine the MLE. Alternatively, given
that the noise parameter σ 2 is not a free parameter, we can ignore the scaling by
46
Polynomial of degree 4
3 3
2 2
1 1
f(x)
f(x)
0 0
-1 -1
-2 -2
Data
Data Maximum likelihood estimate
-3 -3
-5 0 5 -5 0 5
x x
(a) Regression data set. (b) Polynomial of degree 4 determined by maxi-
mum likelihood estimation.
Figure 1.23: Polynomial regression. (a) Data set consisting of (xi , yi ) pairs, i =
1, . . . , 20. (b) Maximum likelihood polynomial of degree 4.
which (a) allows us to compare errors of data sets with different sizes27 and (b) has
the same scale and the same units as the observed function values yi .28 Note that
the division by σ 2 makes the log-likelihood “unit-free”.
We can use the RMSE (or the log-likelihood) to determine the best degree of the
polynomial by finding the value M , such that the error is minimized. Given that
the polynomial degree is a natural number, we can perform a brute-force search and
enumerate all (reasonable) values of M .29
Fig. 1.24 shows a number of polynomial fits determined by maximum likelihood. We
notice that polynomials of low degree (e.g., constants (M = 0) or linear (M = 1) fit
the data poorly and, hence, are poor representations of the true underlying function.
For degrees M = 4, . . . , 9 the fits look plausible and smoothly interpolate the data.
When we go to higher-degree polynomials, we notice that they fit the data better
and better—in the extreme case of M = N − 1 = 19, the function passes through
every single data point. However, these high-degree polynomials oscillate wildly and
are a poor representation of the underlying function that generated the data.30 The
property of the polynomials fitting the noise structure is called overfitting.
27
The RMSE is normalized.
28
Assume, we fit a model that maps post-codes (x is given in latitude,longitude) to house prices
(y-values are GBP). Then, the RMSE is also measured in GBP, whereas the squared error is given in
GBP2 .
29
For a training set of size N it is sufficient to test 0 ≤ M ≤ N − 1.
30
Note that the noise variance σ 2 > 0.
47
2 2 2
1 1 1
f(x)
f(x)
f(x)
0 0 0
-1 -1 -1
-2 -2 -2
Data Data Data
Maximum likelihood estimate Maximum likelihood estimate Maximum likelihood estimate
-3 -3 -3
-5 0 5 -5 0 5 -5 0 5
x x x
2
2 2
1
1 1
f(x)
0
f(x)
f(x)
0 0
-1
-1 -1
-2 -2 -2
Data Data Data
Maximum likelihood estimate Maximum likelihood estimate Maximum likelihood estimate
-3 -3 -3
-5 0 5 -5 0 5 -5 0 5
x x x
2 2 2
1 1 1
f(x)
f(x)
0
f(x)
0 0
-1 -1 -1
-2 -2 -2
Data Data Data
Maximum likelihood estimate Maximum likelihood estimate Maximum likelihood estimate
-3 -3 -3
-5 0 5 -5 0 5 -5 0 5
x x x
Remember that the goal is to achieve good generalization by making accurate pre-
dictions for new (unseen) data. We obtain some quantitative insight into the de-
pendence of the generalization performance on the polynomial of degree M by con-
sidering a separate test set comprising 100 data points generated using exactly the
same procedure used to generate the training set, but with new choices for the ran-
dom noise values included in the target values. As test inputs, we chose a linear
grid of 100 points in the interval of [−5, 5]. For each choice of M , we evaluate the
RMSE (1.228) for both the training data and the test data.
Looking now at the test error, which is a qualitive measure of the generalization
properties of the corresponding polynomial, we notice that initially the test error
decreases, see Fig. 1.25 (red). For fourth-order polynomials the test error is rela-
tively low and stays relatively constant up to degree 11. However, from degree 12
onward the test error increases significantly, and high-order polynomials have very
48
2
Training error
1.8 Test error
1.6
1.4
1.2
RMSE 1
0.8
0.6
0.4
0.2
0
0 5 10 15 20
Degree of polynomial
bad generalization properties. In this particular example, this also is evident from
the corresponding maximum likelihood fits in Fig. 1.24. Note that the training error
(blue curve in Fig. 1.25) never increases as a function of M . In our example, the
best generalization (the point of the smallest test error) is achieved for a polynomial
of degree M = 6.
It is a bit counter-intuitive that a polynomial of degree M = 19 is a worse approxi-
mation than a polynomial of degree M = 4, which is a special case of a 19th-order
polynomial (by setting all higher coefficients to 0). However, a 19th-order polyno-
mial can also describe many more functions, i.e., it is a much more flexible model.
In the data set we considered, the observations yn were noisy (i.i.d. Gaussian). A
polynomial of a high degree will use its flexibility to model random disturbances as
systematic/structural properties of the underlying function. Overfitting can be seen
as a general problem of maximum likelihood estimation (Bishop, 2006). Assuming
we had noise-free data, overfitting does not occur, which is also revealed by the test
error, see Fig. 1.26.
1.5.3 Regularization
49
Polynomial of degree 19
3 2
Training error
1.8 Test error
2
1.6
1.4
1
1.2
RMSE
f(x)
0 1
0.8
-1
0.6
0.4
-2
Data 0.2
Maximum likelihood estimate
-3 0
-5 0 5 0 5 10 15 20
x Degree of polynomial
where the second term is the regularizer, and λ ≥ 0 controls the “strictness” of the
regularization.31
p(y|X, θ)p(θ)
p(θ|X, y) = . (1.230)
p(y|X)
The parameter vector θ MAP that maximizes the posterior (1.230) is called the maxi-
mum a-posteriori (MAP) estimate.
To find the MAP estimate, we follow steps that are similar in flavor to maximum
likelihood estimation. We start with the log-transform and compute the log-posterior
as
where the constant includes the terms independent of θ. We see that the log-
posterior in (1.231) consists of the log-likelihood p(y|X, θ) and the log-prior log p(θ).
31
Instead of the 2-norm, we can choose and p-norm k · kp . In practice, smaller values for p lead to
sparser solutions. Here, “sparse” means that many parameter values θi = 0, which is also useful for
variable selection. For p = 1, the regularizer is called LASSO (least absolute shrinkage and selection
operator) and was proposed by Tibshirani (1996).
50
and we recover exactly the regularization term in (1.229). This means that for a
quadratic regularization, the regularization parameter λ in (1.229) corresponds to
twice the precision (inverse variance) of the Gaussian (isotropic) prior p(θ). The log-
prior in (1.231) plays the role of a regularizer that penalizes implausible values, i.e.,
values that are unlikely under the prior.
To find the MAP estimate θ MAP , we minimize the negative log-posterior with respect
to θ, i.e., we solve
51
⊤ σ2
⊤
⇔θ Φ Φ + 2 I = y⊤Φ (1.240)
b
−1
⊤ ⊤ ⊤ σ2
⇔θ = y Φ Φ Φ + 2 I (1.241)
b
2
−1
⊤ σ
⇔θ MAP = Φ Φ + 2 I Φ⊤ y . (1.242)
b
Comparing the MAP estimate in (1.242) with the maximum likelihood estimate in
(1.226) we see that the only difference between both solutions is the additional red
2
term σb2 I in the inverse matrix.32 This term ensures that the inverse exists and serves
as a regularizer.
2 2
1 1
f(x)
f(x)
0 0
-1 -1
-2 Data -2 Data
Maximum likelihood estimate Maximum likelihood estimate
MAP estimate MAP estimate
-3 -3
-5 0 5 -5 0 5
x x
Φ⊤ Φ is positive semidefinite and the additional term is strictly positive definite, such that all
32
52
4.0
3.5
3.0
2.5
f(x) 2.0
1.5
1.0
0.5
0.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
Figure 1.28: Gradient descent can lead to zigzagging and slow convergence.
1.6.1 Stepsize
Choosing a good stepsize is important in gradient descent: If the stepsize (also called
the learning rate) is too small, gradient descent can be slow. If the stepsize is chosen
too large, gradient descent can overshoot, fail to converge, or even diverge.
53
Remark 17
Gradient descent is rarely used for solving linear equation systems Ax = b. Instead
other algorithms with better convergence properties, e.g., conjugate gradient descent,
are used. The speed of convergence of gradient descent depends on the maximal and
minimal eigenvalues of A, while the speed of convergence of conjugate gradients has
a more complex dependence on the eigenvalues, and can benefit from preconditioning.
Gradient descent also benefits from preconditioning, but this is not done as commonly.
For further information on gradient descent, pre-conditioning and convergence we refer
to CO-477.
where α ∈ [0, 1]. Due to the moving average of the gradient, momentum-based
methods are particularly useful when the gradient itself is only a (noisy) estimate.
We will discuss stochastic approximations to the gradient in the following.
54
where θ is the parameter vector of interest, i.e., we want to find θ that minimizes L.
An example is the negative log-likelihood
X
L(θ) = − log p(yk |xk , θ) (1.250)
k
in a regression setting, where xk ∈ RD are the training inputs, yk are the training
targets and θ are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” optimization method,
i.e., optimization is performed using the full training set by updating the parameters
according to
X
θ i+1 = θ i − γi ∇L(θ i ) = θ i − γi ∇Lk (θ i ) (1.251)
k
for a suitable stepsize parameter γi . Evaluating the sum-gradient may require ex-
pensive evaluations of the gradients from all summand functions. When the training
set is enormous (commonly the case in Deep Learning) and/or no simple formu-
las exist, evaluating the sums of gradients becomes very expensive: Evaluating the
gradient requires evaluating the gradients of all summands. To economize on the
computational cost at every iteration, stochastic gradient descent samples a sub-
set of summand functions (“mini-batch”) at every step. In an extreme case, the
sum in (1.249) is approximated by a single summand (randomly chosen), and the
parameters are updated according to
θ i+1 = θ i − γ∇Lk (θ i ) . (1.252)
In practice, it is good to keep the size of the mini-batch as large as possible to (a)
reduce the variance in the parameter update33 and (b) allow for the computation
to take advantage of highly optimized matrix operations that should be used in a
well-vectorized computation of the cost and gradient.
Remark 18
When the learning rate decreases at an appropriate rate, and subject to relatively
mild assumptions, stochastic gradient descent converges almost surely to local mini-
mum (Bottou, 1998).
33
This often leads to more stable convergence.
55
The choices we make (e.g., the degree of the polynomial) influence the number of
free parameters in the model and thereby also the model complexity. More com-
plex models are more flexible in the sense that they can be used to describe more
data sets. For instance, a polynomial of degree 1 (a line a0 + a1 x) can only be used
to describe linear relations between inputs x and observations y. A polynomial of
degree 2 can additionally describe quadratic relationships between inputs and ob-
servations.34 Higher-order polynomials are very flexible models as we have seen
already in Section 1.5 in the context of polynomial regression.
A general problem is that at training time we can only use the training set to evaluate
the performance of the model. However, the performance on the training set is
not really what we are interested in: In Section 1.5, we have seen that maximum
likelihood estimation can lead to overfitting, especially when the training data set is
small. Ideally, our model (also) works well on the test set (which is not available
at training time). Therefore, we need some mechanisms for assessing how a model
generalizes to unseen test data. Model selection is concerned with exactly this
problem.
34
A polynomial a0 + a1 x + a2 x2 can also describe linear functions by setting a2 = 0.
56
Training Validation
Figure 1.29: K-fold cross-validation. The data set is divided into K = 5 chunks, K − 1
of which serve as the training set (blue) and one as the validation set (yellow). This
procedure is repeated for all K choices for the validation set, and the performance of
the model from the K runs is averaged.
where G(V) is the generalization error (e.g., RMSE) on the validation set V for model
M . We repeat this procedure for all models and choose the model that performs best.
Note that cross-validation does not only give us the expected generalization error,
but we can also obtain high-order statistics, e.g., the standard error35 , an estimate
of how uncertain the mean estimate is.
Once the model is chosen we can evaluate the final performance on the test set.
35
The standard error is defined as √σ , where K is the number of experiments and σ the standard
K
deviation.
57
Mk ∼ p(M ) (1.254)
θ k |Mk ∼ p(θ k ) (1.255)
D|θ k ∼ p(D|θ k ) (1.256)
and illustrated in Fig. 1.31. Given a training set D, we compute the posterior over
models as
Note that this posterior no longer depends on the model parameters θ k because they
have been integrated out in the Bayesian setting.
36
We assume that simpler models are less prone to overfitting than complex models.
37
If we treat model selection as a hypothesis testing problem, we are looking for the simplest
hypothesis that is consistent with the data (Murphy, 2012).
58
Evidence
p(D|M1)
p(D|M2)
D
C1
Figure 1.30: “Why Bayesian inference embodies Occam’s razor. This figure gives the
basic intuition for why complex models can turn out to be less probable. The horizontal
axis represents the space of possible data sets D. Bayes’ theorem rewards models in
proportion to how much they predicted the data that occurred. These predictions are
quantified by a normalized probability distribution on D. This probability of the data
given model Mi , p(D|Mi ), is called the evidence for Mi . A simple model M1 makes
only a limited range of predictions, shown by p(D|M1 ); a more powerful model M2 that
has, for example, more free parameters than M1 , is able to predict a greater variety of
data sets. This means, however, that M2 does not predict the data sets in region C1 as
strongly as M1 . Suppose that equal prior probabilities have been assigned to the two
models. Then, if the data set falls in region C1 , the less powerful model M1 will be the
more probable model.” (MacKay, 2003)
where p(θ k |Mk ) is the prior distribution of the model parameters θ k of model Mk .
Remark 19
There are some important differences between a likelihood and a marginal likelihood:
While the likelihood is prone to overfitting, the marginal likelihood is typically not as
the model parameters have been marginalized out (i.e., we no longer have to fit the
parameters). Furthermore, the marginal likelihood automatically embodies a trade-off
between model complexity and data fit.
59
Mk
θk
Figure 1.31: Illustration of the hierarchical generative process in Bayesian model se-
lection. We place a prior p(M ) on the set of models. For each model, there is a prior
p(θ k |Mk ) on the corresponding model parameters, which are then used to generate the
data D.
The first fraction on the right-hand-side (prior odds) measures how much our prior
(initial) beliefs favor M1 over M2 . The ratio of the marginal likelihoods (second
fraction on the right-hand-side) is called the Bayes factor and measures how well
the data D is predicted by M1 compared to M2 .
Remark 20
The Jeffreys-Lindley paradox states that the “Bayes factor always favors the simpler
model since the probability of the data under a complex model with a diffuse prior will
be very small” (Murphy, 2012).
If we choose a uniform prior over models, the prior odds term in (1.260) is 1, i.e.,
the posterior odds is the ratio of the marginal likelihoods (Bayes factor)
p(D|M1 )
. (1.261)
p(D|M2 )
If the Bayes factor is greater than 1, we choose model M1 , otherwise model M2 .
60
xi
m0, S0
σ yi θ
i = 1, ..., N
We compute the marginal likelihood in two steps: First, we show that the marginal
likelihood is Gaussian (as a distribution in y); Second, we compute the mean and
covariance of this Gaussian.
61
1. The marginal likelihood is Gaussian: From Section 1.2.3.5 we know that (i)
the product of two Gaussian random variables is an (unnormalized) Gaus-
sian distribution, (ii) a linear transformation of a Gaussian random variable is
Gaussian distributed.
In (1.266), we require
a linear transformation to bring
2
N y | Xθ, σ I into the form N θ | µ, Σ for some µ, Σ. Once this is done,
the integral can be solved in closed form. The result is the normalizing con-
stant of the product of the two Gaussians. The normalizing constant itself has
Gaussian shape, see (1.44).
2. Mean and covariance. We compute the mean and covariance matrix of the
marginal likelihood by exploiting the standard results for means and covari-
ances of affine transformations of random variables, see Section 1.2.1.2. The
mean of the marginal likelihood is computed as
corrects for the bias of the maximum likelihood estimator by addition of a penalty
term to compensate for the overfitting of more complex models (with lots of param-
eters). Here, M is the number of model parameters.
38
In parametric models, the number of parameters is often related to the complexity of the model
class.
62
xi
m0, S0
σ yi θ
i = 1, ..., N
1.8.1 Model
In Bayesian linear regression, we consider the following model
y = φ⊤ (x)θ + ǫ
ǫ ∼ N 0, σ 2 (1.272)
θ ∼ N m0 , S 0 ,
where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on the parameter
vector θ.39 The graphical corresponding graphical model is shown in Fig. 1.33.
39
Why is a Gaussian prior a convenient choice?
63
64
If we now use the re-arranged likelihood (1.287) and define its mean as µ and
covariance matrix as Σ in (1.284) and (1.283), respectively, we obtain
N θ | µ, Σ N θ | m0 , S 0 ∝ N θ | mN , S N (1.288)
S N = (S −1 −2 ⊤
0 + σ Φ Φ)
−1
(1.289)
mN = S N (S −1
0 m0
−2 ⊤ ⊤
+ σ (Φ Φ) (Φ Φ) Φ y ) = −1 ⊤
S N (S −1
0 m0
−2 ⊤
+ σ Φ y) . (1.290)
| {z } | {z }
Σ−1 µ
65
S −1 ⊤ −2 −1 −2 ⊤ −1 −1
N = Φ σ IΦ + S 0 ⇔ S N = (σ Φ Φ + S 0 ) , (1.299)
−1 ⊤
m⊤
N SN
−2
= (σ Φ y + S −1
0 m0 )
⊤
⇔ mN = S N (σ Φ y + −2 ⊤
S −1
0 m0 ) . (1.300)
where A is symmetric and positive definite, which we wish to bring into the form
Σ=A (1.303)
−1
µ=Σ a (1.304)
We can see that the terms inside the exponential in (1.296) are of the form (1.301)
with
A = σ −2 Φ⊤ Φ + S −1
0 , (1.305)
a = σ −2 Φ⊤ y + S −1
0 m0 . (1.306)
66
Remark 22
The posterior precision (inverse covariance) of the parameters (see (1.299), for exam-
ple)
1 T
S −1 −1
N = S0 + Φ Φ, (1.307)
σ2
contains two terms: S −1 1 T
0 is the prior precision and σ 2 Φ Φ is a data-dependent (pre-
cision) term. Both terms (matrices) are symmetric and positive definite. The data-
dependent term σ12 ΦT Φ grows as more data is taken into account.41 This means (at
least) two things:
• The posterior precision grows as more and more data is taken into account (there-
fore, the covariance shrinks).
• The (relative) influence of the parameter prior vanishes for large N .
The term φ⊤ (x∗ )S N φ(x∗ ) reflects the uncertainty associated with the parameters θ,
whereas σ 2 is the noise variance. Note that S N depends on the training inputs X,
see (1.289). The predictive mean coincides with the MAP estimate.
Remark 23 (Mean and Variance of Noise-Free Function Values)
In many cases, we are not interested in the predictive distribution p(y∗ |X, y, x∗ ) of a
(noisy) observation. Instead, we would like to obtain the distribution of the (noise-free)
latent function values f (x∗ ) = φ⊤ (x∗ )θ. We determine the corresponding moments by
exploiting the properties of means and variances, which yields
E[f (x∗ )|X, y] = Eθ [φ⊤ (x∗ )θ|X, y] = φ⊤ (x∗ )Eθ [θ|X, y] = m⊤
N φ(x∗ ) , (1.312)
Vθ [f (x∗ )|X, y] = Vθ [φ⊤ (x∗ )θ|X, y] = φ⊤ (x∗ )Vθ [θ|X, y]φ(x∗ ) = φ⊤ (x∗ )S N φ(x∗ )
(1.313)
41
ΦT Φ is accumulating contributions from the data, not averaging.
67
1.8.3.1 Derivation
The predictive distribution p(y∗ |X, y, x∗ ) is Gaussian: In (1.310), we multiply two
Gaussians (in θ), which results in another (unnormalized) Gaussian.42 When in-
tegrating out θ, we are left with the normalization constant, which itself is Gaus-
sian shaped (see Section 1.2.3.5). Therefore, it suffices to determine the mean and
the (co)variance of the predictive distribution, which we will do by applying the
standard rules for computing means and (co)variances (see Section 1.2.1). In the
following, we will use the shorthand notation φ∗ = φ(x∗ ).
The mean of the posterior predictive distribution p(y∗ |X, y, x∗ ) is
Here, we exploited that the noise is i.i.d. and that its mean is 0.
The corresponding posterior predictive variance is
The blue terms in the variance expression are the terms that are due to the inherent
(posterior) uncertainty of the parameters, which induces the uncertainty about the
latent function f . The additive green term is the variance of the i.i.d. noise variable.
Example
Fig. 1.34 shows some examples of the posterior distribution over functions, induced
by the parameter posterior. The left panels show the maximum likelihood estimate,
the MAP estimate (which is identical to the posterior mean function) and the 95%
predictive confidence bounds, represented by the shaded area. The right panels
show samples from the posterior over functions: Here, we sampled parameters θ i
from the parameter posterior and computed the function φ⊤ (x∗ )θ i , which is a single
42
To be precise: We multiply the posterior p(θ|X, y) with a distribution of a linearly transformed
θ. Note that a linear transformation of a Gaussian random variable preserves Gaussianity (see Sec-
tion 1.2.3.5).
68
realization of a function under the posterior distribution over functions. For low-
order polynomials, the parameter posterior does not allow the parameters to vary
much: The sampled functions are nearly identical. When we make the model more
flexible by adding more parameters (i.e., we end up with a higher-order polyno-
mial), these parameters are not sufficiently constrained by the posterior, and the
sampled functions can be easily visually separated. We also see in the corresponding
panels on the left how the uncertainty increases, especially at the boundaries. Al-
though for a 10th-order polynomial the MAP estimate yields a good fit, the Bayesian
linear regression model additionally tells us that the posterior uncertainty is huge.
This information can be critical, if we use these predictions in a decision-making sys-
tem, where bad decisions can have significant consequences (e.g., in reinforcement
learning or robotics).
Remark 25
Bayesian linear regression naturally equips the estimate with uncertainty induced by the
(posterior) distribution on the parameters. In maximum likelihood (or MAP) estima-
tion, we can obtain an estimate of the uncertainty by looking at the point-wise squared
distances between the observed values yi in the training data and the function values
φ(xi )⊤ θ. The variance estimate would then be itself a maximum likelihood estimate
and given by
N
2 1 X
σML = (yi − φ(xi )⊤ θ)2 , (1.321)
N i=1
where θ is a point estimate of the parameters (e.g., maximum likelihood or MAP) and
N is the size of the training data set.
69
2 2
1 1
f(x)
f(x)
0 0
-1 -1
2 2
1 1
f(x)
f(x)
0 0
-1 -1
2 2
1 1
f(x)
f(x)
0 0
-1 -1
Figure 1.34: Bayesian linear regression. Left panels: The shaded area indicates the
95% predictive confidence bounds. The mean of the Bayesian linear regression model
coincides with the MAP estimate. The predictive uncertainty is the sum of the noise
term and the posterior parameter uncertainty, which depends on the location of the test
input. Right panels: Sampled functions from the posterior distribution.
70
Chapter 2
Feature Extraction
2.1 Decompositions
In this chapter we will discuss about the use of linear algebra of vectors and matrices
in order to define basic feature extraction and dimensionality reduction methodolo-
gies. In this context, we will study particular linear decompositions, such as eigen-
decomposition and diagonalisations, QR decomposition, Singular Value Decompo-
sitions (SVD), etc. The above algebraic decompositions will be used to formulate
popular linear feature extraction methodologies such as Principal Component Anal-
ysis (PCA) and Linear Discriminant Analysis (LDA). We will show that many of these
decompositions arise naturally by formulating and solving trace optimisation prob-
lems with constraints. We will also study a Maximum Likelihood (ML) solution of a
Probabilistic PCA (PPCA) formulation. Finally, we will study some non-linear feature
extraction methodologies, using kernel methods. In the following we will be using
elements of linear algebra that can be found in Appendix A.1.
2.1.1 Eigen-decomposition
Assume square matrix A ∈ Rn×n with n linearly independent eigenvectors qi , i =
1, . . . , n and n eigenvalues λ1 , . . . , λn . Then A can be factorised as
A = QΛQ−1 (2.1)
where Q = [q1 . . . qn ] and Λ is a diagonal matrix whose diagonal elements are the
corresponding eigenvalues, i.e. Λ = diag{λ1 , . . . , λn }.
Using the eigen-decomposition we can compute various powers of A as
We can easily verify the above for k = 2 as A2 = QΛQ−1 QΛQ−1 = QΛ2 Q−1 . Then,
we can easily prove the general case using induction.
In case k = −1 we can compute the inverse as
71
2.1.2 QR decomposition
Any real square matrix A ∈ Rn×n can be decomposed as
A = QR (2.6)
where Q is an orthogonal matrix (i.e., QT Q = QQT = I) and R is an upper-
triangular matrix.
Some important properties of QR decomposition
• If A is non-singular then its QR decomposition is unique if we require the
diagonal elements of R to be positive.
• If A is singular then QR decomposition still exists but yields a singular upper
triangular R .
• If A has r linearly independent columns, then the first r columns of Q form an
orthonormal basis for the column space of A.
Q
• |det(A)| = | i rii | (see Appendix A.1.2.12 for a proof).
QR has important applications in solving linear systems, in producing an orthogonal
basis for eigenvalue computations etc. In the following we will see some examples.
72
Rx = b (2.8)
Ax = b ⇔
QRx = b ⇔ (2.10)
Rx = QT b.
A k = Q k Rk
(2.11)
Ak+1 = Rk Qk
As can be seen
Ak+1 = Rk Qk = QTk Qk Rk Qk = QTk Ak Qk (2.12)
so all Ak are similar and hence they have the same eigenvalues. The matrices Ak
converge to a triangular matrix (Golub and Van Loan (2012)). Hence, the main
diagonal of this matrix contains the eigenvalues of A.
As an exercise, run the above algorithm in Matlab for a symmetric matrix and com-
pare the results with the function eig of Matlab. In particular, try to run the code
1 A = randn(100,100);
2 B = A*A';
73
In the following we will discuss a practical and well-known algorithm for computing
the QR decomposition. The algorithm can also be used in order to compute an
orthogonal base from a matrix. The algorithm is the so-called Gram-Schmidt (GS)
process.
GS process answers the following question: If the columns of A = [a1 , . . . , an ] define
a basis (not orthonormal) for an inner product space, is there a way to convert it to
an orthonormal basis?
We start by geometrically describing the process for two vectors a1 and a2 . In the first
step we normalize the norm of a1 as q1 = ||aa11||2 . Then, we compute the projection of
a2 onto q1 as in (A.6)
qT1 a2
projq1 a2 = q1 = (qT1 a2 )q1 . (2.13)
||q1 ||2
q̃2
and normalize it so that it has norm one as q2 = ||q̃2 ||2
.
The general algorithm for computing the orthogonal basis Q = [q1 , . . . , qn ] as
q̃1
q̃1 = a1 , q1 =
||q̃1 ||2
q̃2
q̃2 = a2 − projq1 a2 , q2 =
||q̃2 ||2
q̃3 (2.15)
q̃3 = a3 − projq1 a3 − projq2 a3 , q3 =
||q̃3 ||2
n−1
X q̃n
q̃n = an − projqi an , qn =
i=1
||q̃n ||2
74
a1 = (qT1 a1 )q1
a2 = (qT1 a2 )q1 + (qT2 a2 )q2
..
. (2.16)
Xn
an = (qTi an )qi .
i=1
1
1 1 -2
T 9 1
q2 = a2 − q1 a2 q1 = 2 − 1 =
2
6
-3 -2 0
- √12
p q2
and ||q2 ||2 = 1/2. Then, q2 = = √1
||q2 ||2 2
0
aT3 q1 = − √16 and aT3 q2 = √1 .
2
Then,
75
√1
√ q3 3
√1
||q3 ||2 = 2/ 3 hence q3 = = 3 (2.19)
||q3 ||2 √1
3
Hence,
√1 - √12 √1
6 3
√1 √1 √1
Q= 6 2 3
− √26 0 - √13
and
√ √9
qT1 a1 qT1 a2 qT1 a3 6 6
− √16
√1 √1
R= 0 qT2 a2 T
q2 a3 = 0
2 2
0 0 qT3 a3 0 0 √2
3
Furthermore, regarding the 2-norm and Frobenius norm we have the following,
||A||2F = σ12 + . . . + σp2 , p = min{m, n}
(2.25)
||A||2 = σ1 .
76
A = U 1 Σ1 V (2.26)
where
U1 = [u1 , . . . , un ] ∈ Rm×n (2.27)
and
Σ1 = diag(σ1 , . . . , σn ) ∈ Rn×n . (2.28)
The above is the thin SVD of matrix A.
In thin SVD we have UT1 U1 = In but U1 UT1 6= Im . Furthermore, VT V = VVT = Im .
then
min ||A − B||2 = ||A − Ak ||2 = σk+1 . (2.30)
rank(B)=k
That is, there exist unit 2-norm that Bz = 0 and which can be written as
Pk+1 vectors soP
a linear combination z = i=1 αi vi with k+1 2 T
i=1 αi = 1 and αi = vi z. Since Bz = 0
and
k+1
X
Az = σi (viT z)ui (2.32)
i=1
we have that
k+1
X k+1
X
||A − B||22 ≥ ||(A − B)z||22 = ||Az||22 = σi2 (viT z)2 ≥ σk+1 αi2 = σk+1
2
(2.33)
i=1 i=1
77
The above theorem provides us with a way for dimensionality reduction through
rank-reduction. That is, assume that we have a collection of n data samples x1 , . . . , xn .
We stack the samples as columns of matrix X = [x1 , . . . , xn ]. Then, the k low-rank
representation of the data is
Xk = Uk Σk VkT . (2.34)
keeping the first k singular values and the corresponding singular vectors.
where 1n ∈ Rn is a vector.
Using the
P above, a matrix M that contains in all columns the mean of the data
(m = n1 ni=1 xi = n1 X1n ) can be computed as
1
M = m1T = X( 1n 1Tn ). (2.36)
n
Hence, data centering can be computed as
1
[x1 − m, . . . , xn − m] = X − M = X(I − 1n 1Tn ). (2.37)
n
78
we want to find only one vector w so that the variance of the projected features
yi = wiT xi is maximised. The variance can be defined as
n
1X
σy2 = (yi − µy )2 . (2.38)
n i=1
P
where µy = n1 ni=1 yi . But since we assumed centered data it is easy to verify that
µy = 0. Hence, the optimal features {y1o , . . . , yno }
1
{y1o , . . . , yno } = arg max σy2 ⇒
2
n n
1 X T 2 1 X T
wo = arg max (w xi ) = arg max w xi xTi w
w 2n w 2n
i=1 i=1
n (2.39)
1 X T 1
= arg max w xi xTi w = arg max wT XXT w
w 2n w 2n
i=1
1
= arg max wT St w
w 2
∇w L(w) = St w − λw (2.42)
St w = λw (2.43)
79
eigenvalue. But let’s see that mathematically, by replacing the solution to the opti-
misation problem (2.39), we get
λo = arg max λ. (2.44)
λ
80
Since λk > 0 maximisation of the above cost function boils down to keeping the
eigenvectors that correspond to the k largest eigenvalues.
To conclude, PCA transform looks for d orthogonal direction vectors (known as the
principal axes) such that the projection of input sample vectors onto the principal
directions has the maximal spread, or equivalently that the variance of the output
coordinates is maximal. The principal directions are the first (with respect to de-
scending eigenvalues) k eigenvectors of the covariance matrix XXT .
The reconstruction operator using the vectors stored in the columns of W = [w1 , . . . , wd ]
can be written as
xi = Wyi (2.53)
which gives yi = (WT W)−1 WT .
Replacing it back we get that the optimal reconstruction is
81
1
Figure 2.1: Example of data whitening using the PCA projection matrix W = UΛ− 2 .
F is large, this computation can be quite expensive. We will show how to expedite
this procedure when n ≪ F . We firstly have to introduce the following Lemma.
Lemma 1
Let us assume that B = XXT and C = XT X. It can be proven that B and C have
the same positive eigenvalues Λ and, assuming that n < F , then the eigenvectors U
1
of B and the eigenvectors V of C are related as U = XVΛ− 2 .
82
From the above it is evident that the features in Y are un-correlated (i.e., the co-
variance matrix YYT is diagonal, with off-diagonal elements being zero) and the
variance in each dimension is equal to the positive eigenvalues of XXT .
Assume further that we want to make the low-dimensional covariance matrix of the
data equal to the identity matrix. This procedure is called whitening (or sphering)
which is an important normalisation of the data.
The whitening tranformation is given by the projection matrix
1
W = UΛ− 2 (2.58)
X = Ur Σr VrT . (2.60)
Using the above SVD decomposition we can write the covariance matrix as
Comparing now the above with the eigen-decomposition in 2.59, we get that (1) the
basis of PCA is given by the left orthogonal space of SVD and (2) the eigenvalues are
the singular values squared.
Furthermore, the low-dimensional features of PCA are given by
Hence, the normalised features stacked in matrix Σ−1 r Yr are equal to the right or-
thogonal space of SVD, Vr . That is, the right orthogonal space of SVD is equal to the
whitened features.
83
• PCA: Not explicitly defined for classification problems (i.e., in case that data
come with labels)
• How do we define a latent space it this case? (i.e., that helps in data classifica-
tion).
In order to capitalise on the availability of class labels we need to properly define
relevant statistical properties which may help us in classification. Intuition: We want
to find a space in which:
1. data consisting each class look more like each other, while,
2. How do I make the data between classes look dissimilar? (Answer: I move
the data from different classes further away from each other (i.e., increase the
distance between their means)).
1 X T
= (w (xi − µ(c1 )))2
Nc1 x ∈c
i 1
1 X T
= w (xi − µ(c1 ))(xi − µ(c1 ))T w
Nc1 x ∈c
i 1
84
1 X
= wT (xi − µ(c1 ))(xi − µ(c1 ))T w
Nc1 x ∈c
i 1
T
= w S1 w
P
where µ(c1 ) = N1c xi ∈c1 xi is the mean of the first class.
1
Similarly, the within-class variance in the second class c2 is given by
σy2 (c2 ) = wT S2 w
P
where S2 = N1c T
xi ∈c2 (xi − µ(c2 ))(xi − µ(c2 )) and µ(c2 ) is the mean of the second
2
class.
Now, the sum of the two variances can be written as
max wT Sb w s.t. wT Sw w = 1
∂wT Sw w ∂wT St w
= 2Sw w = 2St w
∂w ∂w
∂L
= 0 ⇒ λSw w = Sb w.
∂w
Hence the optimal w is given by the eigenvector that corresponds to the eigenvalue
of S−1
w Sb (assuming that Sw is invertible).
In this special case (where C = 2) the optimal w is
w ∝ S−1
w (µ(c1 ) − µ(c2 )).
In the following we will compute the LDA projection for the following 2D dataset.
c1 = {(4, 1), (2, 4), (2, 3), (3, 6), (4, 4)}
85
c2 = {(9, 10), (6, 8), (9, 5), (8, 7), (10, 8)}
S−1 −1
w Sb w = λw → |Sw Sb − λI| = 0 →
11.89 − λ 8.81
= 0 → λ = 15.65
5.08 3.76 − λ
11.89 8.81 w1 w1 w1 0.91
= 15.65 → = .
5.08 3.76 w2 w2 w2 0.39
Or directly by
w∗ = S−1 T
w (µ1 − µ2 ) = [−0.91 − 0.39] .
86
Assume that we have C classes in total. We assume that each class has Nci samples,
stored in matrix ci = [x1 , . . . , xNci ], i = 1, . . . , Nci , where each xj has F dimensions
and µ(ci ) is the mean vector of thePclass i. Thus, the overall data matrix X =
C
[c1 , . . . , cC ] has size of F × n (n = i=1 Nci ). If m is the overall mean, then the
within-class scatter matrix, Sw , is defined as
C
X C X
X
Sw = Sj = (xi − µ(cj ))(xi − µ(cj ))T (2.64)
j=1 j=1 xi ∈cj
and has rank(Sw ) = min (F, n − (C + 1)). Moreover, the between-class scatter ma-
trix, Sb , is defined as
C
X
Sb = Ncj (µ(cj ) − m)(µ(cj ) − m)T (2.65)
j=1
Sb W = Sw WΛ (2.66)
87
Properties
The scatter matrices have some interesting properties. Let us denote
E1 0 · · · 0
0 E2 · · · 0
M = .. .. . . .. = diag {E1 , E2 , . . . , EC } (2.67)
. . . .
0 0 · · · EC
where
1 1
Nci
··· Nc i
Ei = ... ... ..
. (2.68)
1 1
Nci
··· Nc i Nci ×Nci
Note that M is idempotent, thus MM = M. Given that the data covariance matrix
is St = XXT , the between-class scatter matrix can be written as
Sb = XMMXT = XMXT (2.69)
and the within-class scatter matrix as
T T T
| {z } − XMX
Sw = XX | {z } = X(I − M)X (2.70)
St Sb
Given the above properties, the objective function of Eq. 2.63 can be expressed as
Wo = arg max tr(WT XMMXT W)
W
(2.71)
subject to WT X(I − M)(I − M)XT W = I
The optimisation procedure of this problem involves a procedure called Simultane-
ous Diagonalisation. Let’s assume that the final transformation matrix has the form
W = UQ (2.72)
We aim to find the matrix U that diagonalises Sw = X(I − M)(I − M)XT . This
practically means that, given the constraint of Eq. 2.71, we want
WT X(I − M)(I − M)XT W = I ⇒
⇒QT UT X(I − M)(I − M)XT U Q = I (2.73)
| {z }
I
88
Consequently, using Eqs. 2.72 and 2.73, the objective function of Eq. 2.71 can be
further expressed as
where the constraint WT X(I − M)(I − M)XT W = I now has the form QT Q = I.
Lemma 2
Assume the matrix X(I − M)(I − M)XT = Xw Xw T , where Xw is the F × n matrix
Xw = X(I − M). By performing eigenanalysis on Xw T Xw as Xw T Xw = Vw ΛVw T ,
we get n − (C + 1) positive eigenvalues, thus Vw is a n × (n − (C + 1) matrix.
2. Find Q0 . By denoting
X̃b = UT XM
the (n − (C + 1)) × n matrix of projected class means, Eq. 2.74 becomes
W0 = UQ0 (2.76)
89
All the computations are performed via the use of the kernel or the centralized kernel
matrix
n
Φ T Φ Φ 1X
K̄ = (φ(xi ) − m ) (φ(xj ) − m ), m = φ(xi )
n i=1
Some popular kernel functions include: Gaussian Radial Basis Function (RBF) ker-
nel: 2 ||xi −xj ||2
k(xi , xj ) = exp− r2
Polynomial kernel:
k(xi , xj ) = (xTi xj + b)n
Hyperbolic Tangent kernel:
k(xi , xj ) = tanh(xTi xj + b)
X = [x1 , . . . , xn ]
XΦ = [φ(x1 ), . . . , φ(xn )]
is called feature space. Using the feature space the kernel matrix is defined as
T
K = [φ(xi )T φ(xj )] = [k(xi , xj )] = XΦ XΦ .
90
All computations are performed via the use of K (so called kernel trick)
Φ − 21
Still UΦo = X VΛ cannot be analytically computed. But we do not want to com-
pute UΦ o . What we want is to compute latent features. That is, given a test sample
T
xi , we want to compute y = UΦ o φ(xt ) (this can be performed via the kernel trick)
T
y =UΦ Φ
o (φ(xt ) − m )
1 ΦT
=Λ− 2 VT X (φ(xt ) − mΦ )
− 21 T ΦT 1 Φ
=Λ V (I − E)X φ(xt ) − X 1
n
− 21 T ΦT 1 ΦT Φ
=Λ V (I − E) X φ(xt ) − X X 1
n
− 21 T 1
=Λ V (I − E) g(xt ) − K1
n
where
φ(x1 )T φ(xt ) k(x1 , xt )
T
g(xt ) = XΦ φ(xt ) = ... = ... .
T
φ(xn ) φ(xt ) k(xn , xt )
Kernel LDA is a tutorial exercise.
91
or equivalently
2 1 1
p(xi |yi , W, σ) = N (xi |Wyi + m, σ I) = p exp − 2 (xi − m − Wyi )T (xi − m − Wyi )
(2π)F σ F 2σ
1 1
p(yi ) = N (yi |0, I) = p exp − yiT yi .
(2π) d 2
(2.78)
Given the conditional probability p(xi |yi , W, σ) and prior p(y) we can compute the
two following important distributions
p(yi |xi , W, σ), posterior
(2.79)
p(xi |W, σ), marginal.
1 1
p(xi |yi , W, σ)p(yi ) = p p exp − 2 (xi − m − Wyi )T (xi − m − Wyi ) + σ 2 yiT yi
(2π)F σ F (2π)d 2σ
(2.81)
Now, in order to compute the marginal, as well as the posterior, we will restructure
the exponent of the exponential. The aim of the restructure is to reveal a term that
does not depend on yi , so that it can be safely go out of the integral in (2.80). This
term is used to produce the marginal. The other term is used to produce the pos-
terior. Let’s now see how to restructure the exponent (for convenience let’s assume
x̄i = xi − m)
(x̄i − Wyi )T (x̄i − Wyi ) + σ 2 yiT yi
= x̄Ti x̄i − 2x̄Ti Wyi + yiT WT Wyi + σ 2 yiT yi
(2.82)
= x̄Ti x̄i − 2x̄Ti Wyi + yiT (WT W + σ 2 I)yi
= x̄Ti x̄i − 2x̄Ti Wyi + yiT Myi
92
where M = WT W + σ 2 I.
We observe that we have a quadratic term yiT Myi and we need some extra terms in
order to complete its quadratic form. We can do so as follows
In order to simplify the marginal we will make use of the Woodbury identity
Hence, by removing the constant terms, the function we want to optimise with re-
gards to the parameters is
n
n 1X
L(W, σ, m) = ln det(D) − (xi − m)T D−1 (xi − m)
2 2 i=1
n
n 1X (2.89)
= ln det(D) − tr(D−1 (xi − m)(xi − m)T )
2 2 i=1
n n
= ln det(D) − tr(D−1 St )
2 2
93
We will now take the derivative of the function with regards to the parameters
n
1X
∇m L = 0 ⇒ m = xi
n i=1 . (2.90)
−1 −1 −1 −1
∇W L = 0 ⇒ D St D W − D W ⇒ St D W = W
There are three different solutions. The first is W = 0 (not useful). The second is
D = St . In this case, if St = UΛUT is the covariance matrix eigendecomposition,
1
then W = U(Λ−σ 2 I) 2 VT for an arbitrary rotation matrix V (i.e., VT V = I), D 6= St
and W 6= 0 d < q = rank(St ).
Assume the SVD of W = ULVT
U = [u1 . . . ud ] F × d matrix
U U = I, VT V = VVT = I
T
l1 . . . 0
L = ... . . . ...
0 . . . ld
W=ULVT
D−1 = (WWT + σ 2 I)−1 ======⇒
= (UL2 UT + σ 2 I)−1
Assume a set of bases UF −d such that UTF −d U = 0 and UTF −d UTF −d = I. We then have
−1
D−1 = ULUT + σ 2 I
2 −1
L 0 T 2 T
= [U UF −d ] [U UF −d ] + [U UF −d ]σ I[U UF −d ]
0 0
2 −1
L + σ2I 0
=[U UF −d ] [U UF −d ]T
0 σ2I
2
(L + σ 2 I)−1 0
=[U UF −d ] [U UF −d ]T
0 σ −2 I
And subsequently
−1 (L2 + σ 2 I)−1 0
D U = [U UF −d ] −2 [U UF −d ]T U
0 σ I
(L2 + σ 2 I)−1 0
= [U UF −d ] [I 0]T
0 σ −2 I
2
(L + σ 2 I)−1
= [U UF −d ]
0
94
= U(L2 + σ 2 I)−1
As a result, we have
2 2 T
Hence we have that St u√ i = (li + σ )ui . For St = UΛU , ui are the eigenvectors of St
2 2
and λi = li + σ ⇒ l = λ − σ 2 . Unfortunately, V cannot be determined thus there
is a rotation ambiguity.
Concluding, the optimum Wd is given by (keeping d eigenvectors)
1
Wd = Ud (Λd − σ 2 I) 2 VT
NF N N
ln(2π) − ln(|D|) − tr[D −1 S t ]
L(W, σ 2 , µ) = −
2 2 2
NF N N
=− ln(2π) − ln |WWT + σ 2 I| − tr[(WWT + σ 2 I)−1 St ]
2 2 2
2 Λd − σ 2 I 0
Wd WTd + σ I = [Ud UF −d ] [Ud UF −d ]T
0 0
σ2 . . . 0
+[Ud UF −d ] ... . . . ... [Ud UF −d ]T
0 . . . σ2
Λd 0
= [Ud UF −d ] [Ud UF −d ]T
0 σ2I
Hence
d
Y F
Y
2
|Wd WTd + σ I| = λi σ2
i=1 i=d+1
95
[Ud UF −d ]T
I 0 0
1
= [Ud UF −d ] 0
Λ
σ 2 q−d
0 [Ud UF −d ]T
0 0 0
q
1 X
⇒ tr(D−1 St ) = 2 λi + d
σ i=d+1
( d q
)
N X 1 X
L(σ 2 ) = − F ln 2π + 2
ln λj + (F − d) ln σ + 2 λj + d
2 j=1
σ j=d+1
q q
∂L −3
X 2(F − d) 2 1 X
= 0 ⇒ −2σ λj + =0⇒σ = λj
∂σ j=d+1
σ F − d j=d+1
d q q
2 N X X X
L(σ ) = − { ln λj + ln λj − ln λj
2 j=1 j=d j=d
| {z }
ln |St |
q
1 X
+(F − d) ln λj + F ln 2π + F }
F − d j=d+1
( q q
! )
N 1 1 X 1 X
max ln |St | − ln λj + ln ln λj + const.
2 F −d F − d j=d F − d j=d+1
( q
! q
)
1 X 1 X
⇒ min ln ln λj − ln λj
F − d j=d F − d j=d
we have that
q
! q
1 X 1 X
ln λj ≥ ln λj
F − d j=d+1 F − d j=d+1
96
Hence
q
! q
1 X 1 X
⇒ ln ln λj − ln λj ≥ 0
F − d j=d F − d j=d
Therefore, the function is minimised when the discarded eigenvectors are the ones
that correspond to the q − d eigenvalues.
A brief summary:
q
2 1 X
σ = λj
F − d j=d+1
1
Wd = Ud (Λd − σ 2 I) 2 VT
N
1 X
µ= xi
N i=1
We no longer have a projection but: Ep(yi |xi ) {yi } = M−1 WT (xi − µ). We also have a
reconstruction x̂i = WEp(yi |xi ) {yi } + µ. We can notice that
1
lim
2
Wd = Ud Λd2
σ →0
lim
2
M = WTd Wd
σ →0
Hence,
97
Chapter 3
The SVM technique tries to find the separating hyperplane with the largest margin
between two classes, measured along a line perpendicular to the hyperplane. For
example, in Figure 3.1, the two classes could be fully separated by a dotted line
wT x + b = 0. We would like to decide the line with the largest margin. In other
words, intuitively we think that the distance between two classes of training data
should be as large as possible. That means we find a line with parameters w and b
such that the distance between wT x + b = ±1 is maximised.
The distance between wT x + b = 1 and −1 can be calculated by the following way.
Consider a point x̃ on wT x + b = −1 (see Figure 3.2). As w is the “normal vector” of
the line wT x + b = −1, w and the line are perpendicular to each other. Starting from
x̃ and moving along the direction w, we assume x̃ + tw touches line wT x + b = 1.
Therefore,
wT (x̃ + tw) + b = 1 and wT x̃ + b = −1
We then have twT w = 2, so the distance (i.e., the length of tw) is ||tw||2 = 2 ||w||
wT w
2
=
2
p 2
2 2
||w||2
. Note that ||w||2 = w1 + · · · + wn . As maximising ||w||2 is equivalent to min-
98
wT w
imising 2
, we have the following problem:
1 T
min 2
w w
w,b
subject to yi (wT xi+ b) ≥ 1 (3.1)
i = 1, . . . , l
(wT xi ) + b ≥ 1 if yi = 1,
(3.2)
(wT xi ) + b ≤ −1 if yi = −1,
That is, data in the class 1 must be on the right-hand side of wT x + b = 0 while data
in the other class must be on the left-hand side. Note that the reason of maximising
the distance between wT x+b = ±1 is based on Vapnik’s Structural Risk Minimisation
(?).
The following example gives a simple illustration of maximal-margin separating hy-
perplanes:
99
Example
Given two training data in R1 as in the following Figure 3.3: What is the separating
hyperplane?
We have two data points, namely x1 = 1, x2 = 0 with y = [+1, −1]T . Furthermore,
w ∈ R1 , so Eq. 3.1 becomes
1 2
min w
w,b 2
subject to w·1+b≥1 (3.3)
− 1(w · 0 + b) ≥ 1 (3.4)
From Ineq. 3.4, −b ≥ −1. Putting this into Ineq. 3.3, w ≥ 2. In other words, for
any (w, b) which satisfies 3.3 and 3.4, we have w ≥ 2. As we are minimising 12 w2 ,
the smallest possibility is w = 2. Thus, (w, b) = (2, −1) is the optimal solution. The
separating hyperplane is 2x − 1 = 0, in the middle of the two training data points
(Figure 3.4).
In order to find the optimal w in the general case we need to solve optimisation
problem 3.1. Before doing so, we need some basic knowledge regarding Lagrangian
optimisation and Lagrangian duality.
min f (w)
w
subject to g(w) ≤ 0. (3.5)
By convention, we write that g(w) ≤ 0, as a result this means that we multiply the
constrains from (3.5) by minus one.
100
To solve this we use the method of Lagrange multipliers. We define the Lagrangian
to be the original objective function added to a weighted combination of the con-
straints. The weights are called Lagrange multipliers. It will be helpful to focus on
the simpler case with one inequality constraint and one Lagrange multiplier.
Theorem 2
The original minimisation problem can be written as
This is because when g(w) ≤ 0, we maximise (3.6) by setting a = 0. When g(w) > 0,
one can drive the value to infinity by setting a to a large number. Minimising the
outer term, one sees that we obtain the minimum value of f (w) such that the con-
straint g(w) ≤ 0 holds. Therefore, we can say that the two problems are equivalent.
The primal solution to the problem is given by
We claim that d∗ ≤ p∗ . Let w∗ be the w value that corresponds to the optimal primal
solution p∗ . We can write for all a ≥ 0
The Left-Hand-Side (LHS) of the above is obviously p∗ . This means we can interpret
the Right-Hand-Side (RHS) as a lower bound on p∗ for all a ≥ 0. One obtains the
best lower bound when maximising over a - this yields d∗ . Hence d∗ ≤ p∗ for any
f (w) and g(w). However, if certain conditions are met, namely
• f (w) is convex
then d∗ = p∗ .
For the SVM problem, both of these conditions hold. Finally, in order to solve the
SVM optimisation problem using the dual, we need to further explore the optimality
conditions.
101
Since we are optimising a convex function with linear constraints, the dual solution
will equal the primal solution. To optimise the dual (3.12), we need to minimise
L(w, b, a) with respect to w and b for a fixed value of a. We know that the optimal
w and b must satisfy the condition that the partial derivatives of L with regards to w
and b are 0.
n
X
∇w L(w, b, a) = w − ai yi xi = 0 ⇒ (3.13)
i=1
n
X
w= ai yi xi (3.14)
i=1
Similarly,
n
∂L(w, b, a) X
= ai yi = 0. (3.15)
∂b i=1
Therefore, for a fixed value of a, we have a closed form solution for w that minimises
L(w, b, a). We also have a condition on the sum of ai yi . We can plug them back into
the dual expression.
102
n n n
X 1 XX
L(a) = ai − ai aj yi yj xTi xj . (3.16)
i=1
2 i=1 j=1
Finally, we are left with a function of a what we wishPto maximise. Putting this
together with the constraints ai ≥ 0 and the constraint ni=1 ai yi = 0, we obtain the
following optimisation problem
max 1T a − 12 aT Ky a
a
subject to ai ≥ 0, i = 1, . . . , n (3.17)
aT y = 0
Example
In this example we will see how we can solve the optimisation problem using quadprog
of Matlab. The quadprog function solves generic quadratic programming optimisa-
tion problems of the form:
min f T g + 12 gT Hg
g (3.18)
subject to Ag ≤ c, Ae g = ce , gl ≤ g ≤ gu
This minimisation problem solves for the vector g. The first step to solving our
problem, is to encode it using the matrices H, A, f , c, ce , gl , gu and Ae . Assume we
are given a set of data stored as columns in a data matrix X ∈ RF ×n and a vector
y of labels 1, −1. Then the SVM optimisation problem (3.17) can be reformulated
to (3.18) by (a) changing maximisation to minimisation by reversing the sign of the
cost function, (b) setting g = a, H = [yi yj xTi xj ], (c) f = −1n , A = 0 and c = 0
(a dummy inequality constraint), Ae = [y1 , . . . , yn ] and ce = 0, gl = [0, . . . , 0]T , and
finally gu = [∞ . . . , ∞]T . Once we have created the matrices and vectors quadprog
function can be used like so:
1 g = quadprog (H, f, A, c, A e, c e, g l, g u)
103
11 A e = y';
12 c e = 0;
13 g l = zeros(200,1);
14 g u = 100000*ones(200,1);
15 alpha = quadprog(H,f,A,c,A e,c e,g l,g u);
The claim is that the dual problem is more computationally convenient. We validate
this claim by considering the KKT conditions, which must hold for the solution. In
particular, the complementary slackness condition can be written as
ai = 0 ⇒ yi (wT xi + b) ≥ 1
(3.19)
ai > 0 ⇒ yi (wT xi + b) = 1.
Furthermore, from the above conditions we can find b (from any support vector). A
more numerical stable solution can be found by averaging over all support vectors
as
1 X
b= (yi − wT xi ) (3.20)
NS x ∈S
i
That is, the constraints in 3.21 allow training data to not be on the correct side of
the separating hyperplane wT x + b = 0. This happens when ξi > 1 and an example
is provided in Figure 3.5.
We have ξi ≥ 0 since if ξi < 0, we have yi (wT xi + b) ≥ 1 − ξi ≥ 1 and the training
data is already on the correct side. The new problem is always feasible since for any
(w, b),
ξi ≡ max(0, 1 − yi (wT xi + b)), i = 1, . . . , l
104
we have a feasible solution ((w, b, ξ)). Using this setting, we may worry that for
linearly separable data, some ξi ’s could be larger than 1 and hence corresponding
data could be wrongly classified. For the case that most data except some noisy ones
are separable by a linear function, we would like wT x + b = 0 to correctly classify
the majority
Pl of the points. Therefore, in the objective function we add a penalty
term C i=1 ξi , where C > 0 is the penalty parameter. To have the objective value
as small as possible, most ξi ’s should be zero, so that the constraint goes back to its
original form.
In order to formulate the dual of (3.21) we need to compute
n n n
1 X X X
L(w, b, ξi , ai , ri ) = wT w + C ξi − ai (yi (wT xi + b) − 1 + ξi ) − ri ξ i
2 i=1 i=1 i=1
(3.22)
with Lagrangian multipliers ai ≥ 0, ri ≥ 0. Computing the derivatives
n
∂L X
=w− ai yi xi = 0
∂w i=1
n
∂L X
= ai yi = 0 (3.23)
∂b i=1
∂L
= C − ai − ri = 0.
∂ξi
Substituting (3.23) back to (3.22) we get the dual optimisation problem
1
max L(a) = aT 1 − aT Ky a (3.24)
a 2
T
subject to a y = 0, 0 ≤ ai ≤ C (3.25)
where Ky = [yi yj xTi xj ].
If data are distributed in a highly non-linear way, employing only a linear function
causes many training instances to be on the wrong side of the hyperplane. As a
105
result, under-fitting occurs and the decision function does not perform well. To fit
the training data better, we may think of using a non-linear curve. The problem is
that it is very difficult to model non-linear curves. All we are familiar with are elliptic,
hyperbolic, or parabolic curves, which are far from enough in practice. Instead of
using more sophisticated curves, another approach is to map data into a higher
dimensional space. In this higher dimensional space, it is more likely that data can
be linearly separated. An example by mapping x from R3 to R8 is as follows
√ √ √ √ √ √
φ(x) = [1, 2x1 , 2x2 , 2x3 , x21 , x,2 x23 , 2x1 x2 , 2x2 x3 , 2x1 x3 ]
subject to 0 ≤ ai ≤ C i = 1, . . . , n (3.27)
X n
yi ai = 0
i=1
This new problem of course has some relation with the original problem 3.26, and
we hope that it can be solved more easily. We may write 3.27 in a matrix form for
convenience:
1 T
min α Ky α − 1T α
a 2
subject to 0 ≤ ai ≤ C i = 1, . . . , l (3.28)
T
y α=0
106
Therefore, the crucial point is whether the dual is easier to be solved than the primal.
The number of variables in the dual is the size of the training set is n; a fixed number.
In contrast, the number of variables in the primal problem varies depending on how
data are mapped to a higher dimensional space. Therefore, moving from the primal
to the dual means that we solve a finite-dimensional optimisation problem instead
of a possibly infinite-dimensional one.
If φ(x) is an infinitely-long vector, there is no way to fully write it down and then
calculate the inner product. Therefore, even though the dual possesses the advan-
tage of having a finite number of variables, we could not even write the problem
down before solving it. This is resolved by using special mapping functions φ so that
φ(xi )T φ(xj ) is efficiently calculated (i.e., by using the kernel trick). Then, a decision
function is written as
l
!
X
T T
f (x) = sign(w φ(x) + b) = sign yi αi φ(xi ) φ(x) + b (3.29)
i=1
P
In other words, for a test vector x, if ni=1 yi αi φ(x)T φ(x) + b > 0, we classify it to
be in the class 1. Otherwise, we classify it in the second class. We can see that only
support vectors will affect the results in the prediction stage. In general, the number
of support vectors is not large. Therefore, we can say SVM is used in order to derive
important data (support vectors) from the training data.
n
X
min (yi − (wT xi + b))2 (3.30)
w,b
i=1
107
It is easy to see that 3.30 (with x replaced by φ(x)) and 3.31 are equivalent: If
108
Therefore,
ξi2 + (ξi∗ )2 = (yi − (wT φ(xi ) + b))2
Moreover, ξi ξi∗ = 0 at an optimal solution.
Instead of using square errors, we can use linear ones:
l
X
min ∗ (ξi + ξi∗ )
w,b,ξ,ξ
i=1
subject to − ξi∗ ≤ yi − (wT φ(xi ) + b) ≤ +ξi
ξi , ξi∗ ≥ 0 i = 1, . . . , l
Support vector regression (SVR) then employees two modifications to avoid over-
fitting:
Clearly, ξi is the upper training error (ξi∗ is the lower) subject to the ǫ-insensitive
tube |yi − (wT φ(xi ) + b)| ≤ ǫ. This can be seen from Figure 3.8. If xi is not in the
tube, there is an error ξi or ξi∗ , which we would like to minimise in the objective
function. SVR avoids Punder-fitting and over-fitting the training data by minimising
l
the training error C i=1 (ξi + ξi ) as well as the regularisation term 12 wT w. Addition
∗
of the term wT w can be explained by a similar way to that for classification problems.
In Figure 3.9, under the condition that training data are in the ǫ-insensitive tube, we
would like the approximate function to be as general as possible to represent the
data distribution.
109
Figure 3.9: More general approximate function by maximising the distance between
wT x + b = ±ǫ
The parameters which control the regression quality are the cost of error C, the
width of the tube ǫ, and the mapping function φ. Similar to support vector classifi-
cation, as w may be a huge vector variable, we solve the dual problem
n n
1 X X
min (α − α∗ )T K(α − α∗ ) + ǫ (αi + αi∗ ) + yi (αi − αi∗ )
α,α∗ 2 i=1 i=1
n
X
subject to (αi − αi∗ ) = 0, 0 ≤ ai , a∗i ≤ C, i = 1, . . . , n (3.35)
i=1
(3.36)
where Kij = k(xi , xj ) ≡ φ(xi )T φ(xj ). The derivation of the dual uses the same
procedure for support vector classification. The primal-dual relation shows that
n
X
w= (−αi + αi∗ )φ(xi ),
i=1
110
Appendix A
where [x1 . . . xn ] represents a row vector. In these notes we use column vectors. Use
of row vectors will be explicitly noted.
A matrix A ∈ Rn×l is defined as the following collection
a11 . . . a1l ãT1
A = ... .. .. = [a . . . a ] = .. = [a ]
. . 1 l . ij (A.2)
T
an1 . . . anl ãn
The squared ℓ2 norm of a vector can be defined using the inner product as
n
X
||x||22 T
=x x= x2i . (A.4)
i=1
xT y
cos(θ(x, y)) = . (A.5)
||x||2 ||y||2
111
Figure A.1: Geometric interpretation of the projection of a vector y onto x. The green
vector is projx y, while the red vector is y − projx y.
The cosine between two vectors can be used in order to define projections onto
vectors. In particular, the projection of y onto x, denoted as projx y, is a vector that
is co-linear to x and can be computed as
xT y
projx y = βx = cos(θ)||y||2 x = x. (A.6)
||x||2
and
b̃T1
B = ... = [bij ],
b̃Tl
112
113
Assume that we are given a basis {u1 , . . . , un } and an arbitrary vector x which can
be written as a linear combination of the basis as
n
X
x= ki ui = Uk (A.18)
i=1
Ak = AA · · · A, k times. (A.20)
1
Fractional power of a matrix A can be defined a matrix B = A k such that
Bk = A. (A.21)
(AB)T = BT AT . (A.22)
114
115
∂f X
= aij bki = [BA]kj = [(BA)T ]jk . (A.31)
∂xjk i
Hence, ∇X tr(AXB) = AT BT .
∂f
Now we can compute ∇W f = [ ∂w ki
]
∂f X X
= wri bkr + wji bjk + 2wki bkk
∂wki r6=k j6=k
X X
= wri bkr + wji bjk (A.33)
r j
T
= [BW]ki + [B W]ki .
116
where Ajk is defined as the determinant of the (n−1)×(n−1) matrix that is produced
from A by removing the j-th row and k-th column.
Example (Determinant)
−2 2 −3
Assume matrix A = −1 1 3
2 0 −1
1+2 −1 3 2+2 −2 −3 3+2 −2 −3
det(A) = (−1) ·2· + (−1) ·1· + (−1) ·0·
2 −1 2 −1 −1 3
= (−2) · ((−1) · (−1) − 2 · 3) + 1 · ((−2) · (−1) − 2 · (−3))
= (−2) · (−5) + 8 = 18
(A.35)
117
118
119
(A − λI)x = 0. (A.48)
The above equation has a non-zero solution if and only if the determinant |A − λI|
is zero. Therefore, the eigenvalues of A are values of λ that satisfy the equation
Hence, a way to find eigenvalues analytically is by finding the roots of the above
polynomial (which is called the characteristic polynomial of A).
The generalised eigenvectors of matrices A and B are vectors that satisfy
Ax = λBx (A.50)
B−1 Ax = λx (A.51)
120
xT Ax > 0. (A.55)
Proof: Assume that all eigenvalues λi are positive. Then according to the eigende-
composition of symmetric matrices in (A.53) we have A = UΛUT . The columns of
U constitute a base of Rn . Hence, x = Uc for all x ∈ Rn and c 6= 0. Then,
n
X
T T T
x Ax = c U UΛUU c = c Λc = T T
c2i λi > 0. (A.56)
i=1
which holds for all c ∈ Rn . Now, by choosing as c the columns of the identity In
matrix, the above inequality turns into λi > 0. Hence, all eigenvalues are positive.
• The sum of two upper (lower) triangular matrices is upper (lower) triangular.
• The product of two upper (lower) triangular matrices is upper (lower) trian-
gular.
121
• The determinant of a triangular matrix equals the product of the diagonal en-
tries.
A.1.2.12 QR decomposition
Theorem 4
If A = QR is the QR decomposition of matrix A then
Y
|det(A)| = | rii |. (A.58)
i
However, inner products are more general concepts with specific properties, which
we will now introduce.
Definition 13
Let V be a vector space and β : V × V → R a bilinear mapping (i.e., linear in both
arguments).
122
In an inner product space, the inner product allows us to introduce concepts, such
as lengths, distances and orthogonality.
Remark 26
The norm k · k possesses the following properties:
1. kxk ≥ 0 for all x ∈ V and kxk = 0 ⇔ x = 0
2. kλxk = |λ| · kxk for all x ∈ V and λ ∈ R
3. Minkowski inequality: kx + yk ≤ kxk + kyk for all x, y ∈ V
Definition 15 (Distance and Metric)
Consider an inner product space (V, h·, ·i). Then d(x, y) := kx − yk is called distance
of x, y ∈ V . The mapping
d:V ×V →R (A.63)
(x, y) 7→ d(x, y) (A.64)
is called metric.
123
A metric d satisfies:
Theorem 5
Let (V, h·, ·i) be an inner product space and x, y, z ∈ V . Then:
hx, yi
−1 ≤ ≤ 1. (A.65)
kxk kyk
hx, yi
cos ω = . (A.66)
kxk kyk
124
methods5 , such as Conjugate Gradients or GMRES, minimize residual errors that are
orthogonal to each other (Stoer and Burlirsch, 2002).
In machine learning, inner products are important in the context of kernel meth-
ods (Schölkopf and Smola, 2002). Kernel methods exploit the fact that many lin-
ear algorithms can be expressed purely by inner product computations.6 Then, the
“kernel trick” allows us to compute these inner products implicitly in a (potentially
infinite-dimensional) feature space, without even knowing this feature space ex-
plicitly. This allowed the “non-linearization” of many algorithms used in machine
learning, such as kernel-PCA (Schölkopf et al., 1998) for dimensionality reduction.
Gaussian processes (Rasmussen and Williams, 2006) also fall into the category of
kernel methods and are the current state-of-the-art in probabilistic regression (fit-
ting curves to data points).
The Searle identity in (A.67) is useful if the individual inverses of A and B do not
exist or if they are ill conditioned. The Woodbury identity in (A.68) can be used to
reduce the computational burden: If Z ∈ Rp×p is diagonal, the inverse Z −1 can be
computed in O(p). Consider the case where U ∈ Rp×q , W ∈ Rq×q , and V ⊤ ∈ Rq×p
with p ≫ q. The inverse (Z + U W V ⊤ )−1 ∈ Rp×p would require O(p3 ) computations
(naively implemented). Using (A.68), the computational burden reduces to O(p)
for the inverse of the diagonal matrix Z plus O(q 3 ) for the inverse of W and the
inverse of W −1 + V ⊤ Z −1 U ∈ Rq×q . Therefore, the inversion of a p × p matrix can be
reduced to the inversion of q × q matrices, the inversion of a diagonal p × p matrix,
and some matrix multiplications, all of which require less than O(p3 ) computations.
The Kailath inverse in (A.69) is a special case of the Woodbury identity in (A.68)
with W = I. The Kailath inverse makes the inversion of A + BC numerically a bit
more stable if A + BC is ill-conditioned and A−1 exists.
5
The basis for the Krylov subspace is derived from the Cayley-Hamilton theorem, which allows us
to compute the inverse of a matrix in terms of a linear combination of its powers.
6
Matrix-vector multiplication Ax = b falls into this category since bi is dot product of the ith row
of A with x.
125
Bibliography
Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transac-
tions on Automatic Control, 19(6):716–723. pages 62
Belhumeur, P. N., Hespanha, J. P., and Kriegman, D. J. (1997). Eigenfaces vs. Fish-
erfaces: Recognition using Class Specific Linear Projection. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 19(7):711–720. pages 2
Bickson, D., Dolev, D., Shental, O., Siegel, P. H., and Wolf, J. K. (2007). Linear
Detection via Belief Propagation. In Proceedings of the Annual Allerton Conference
on Communication, Control, and Computing. pages 22
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z.,
Ranzato, M. A., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large Scale
Distributed Deep Networks. In Advances in Neural Information Processing Systems,
pages 1–11. pages 56
126
BIBLIOGRAPHY Bibliography
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for On-
line Learning and Stochastic Optimization. Journal of Machine Learning Research,
12:2121–2159. pages 56
Gal, Y., van der Wilk, M., and Rasmussen, C. E. (2014). Distributed Variational
Inference in Sparse Gaussian Process Regression and Latent Variable Models. In
Advances in Neural Information Processing Systems. pages 56
Golub, G. H. and Van Loan, C. F. (2012). Matrix Computations, volume 4. JHU Press.
pages 1, 2, 73
Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian Processes for Big
Data. In Nicholson, A. and Smyth, P., editors, Proceedings of the Conference on
Uncertainty in Artificial Intelligence. AUAI Press. pages 56
Herbrich, R., Minka, T., and Graepel, T. (2007). TrueSkill(TM): A Bayesian Skill
Rating System. In Advances in Neural Information Processing Systems, pages 569–
576. MIT Press. pages 22
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic Variational
Inference. Journal of Machine Learning Research, 14(1):1303–1347. pages 56
Jimenez Rezende, D., Mohamed, S., and Wierstra, D. (2014). Stochastic Backprop-
agation and Variational Inference in Deep Latent Gaussian Models. In Proceedings
of the International Conference on Machine Learning. pages 15
Kingma, D. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. Inter-
national Conference on Learning Representations, pages 1–13. pages 56
127
Bibliography BIBLIOGRAPHY
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models. MIT Press. pages
22
Lin, C.-J. (2006). A guide to support vector machines. Department of Computer
Science & Information Engineering, National Taiwan University, Taiwan. pages 2
MacKay, D. J. C. (1992). Bayesian Interpolation. Neural Computation, 4:415–447.
pages 58
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms.
Cambridge University Press, The Edinburgh Building, Cambridge CB2 2RU, UK.
pages 58, 59
McEliece, R. J., MacKay, D. J. C., and Cheng, J.-F. (1998). Turbo Decoding as an
Instance of Pearl’s “Belief Propagation” Algorithm. IEEE Journal on Selected Areas
in Communications, 16(2):140–152. pages 22
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,
Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie,
C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and
Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning.
Nature, 518:529–533. pages 56
Murphy, K. P. (2012). Machine Learning: A Proabilistic Perspective. MIT Press, Cam-
bridge, MA, USA. pages 11, 14, 16, 58, 60, 61
O’Hagan, A. (1991). Bayes-Hermite Quadrature. Journal of Statistical Planning and
Inference, 29:245–260. pages 61
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann. pages 22
Petersen, K. B. and Pedersen, M. S. (2012). The Matrix Cookbook. Version
20121115. pages 33
Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s Razor. In Advances in Neural
Information Processing Systems 13, pages 294–300. The MIT Press. pages 62
Rasmussen, C. E. and Ghahramani, Z. (2003). Bayesian Monte Carlo. In Becker, S.,
Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing
Systems 15, pages 489–496. The MIT Press, Cambridge, MA, USA. pages 61
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learn-
ing. Adaptive Computation and Machine Learning. The MIT Press, Cambridge,
MA, USA. pages 14, 125
Roweis, S. and Ghahramani, Z. (1999). A Unifying Review of Linear Gaussian Mod-
els. Neural Computation, 11(2):305–345. pages 14
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning Representations
by Back-propagating Errors. Nature, 323(6088):533–536. pages 36, 54
128
BIBLIOGRAPHY Bibliography
Schölkopf, B., Smola, A. J., and Müller, K.-R. (1998). Nonlinear Component Analysis
as a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299–1319. pages
125
Shental, O., Bickson, D., P. H. Siegel and, J. K. W., and Dolev, D. (2008). Gaussian
Belief Propagatio Solver for Systems of Linear Equations. In IEEE International
Symposium on Information Theory. pages 22
Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2006). TextonBoost: Joint Ap-
pearance, Shape and Context Modeling for Mulit-Class Object Recognition and
Segmentation. In Proceedings of the European Conference on Computer Vision. pages
22
Spiegelhalter, D. and Smith, A. F. M. (1980). Bayes Factors and Choice Criteria for
Linear Models. Journal of the Royal Statistical Society B, 42(2):213–220. pages 58
Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tap-
pen, M., and Rother, C. (2008). A Comparative Study of Energy Minimization
Methods for Markov Random Fields with Smoothness-based Priors. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080. pages 22
Tibshirani, R. (1996). Regression Selection and Shrinkage via the Lasso. Journal of
the Royal Statistical Society B, 58(1):267–288. pages 50
129