MML Book PDF
MML Book PDF
MML Book PDF
List of illustrations iv
Foreword 1
2 Linear Algebra 17
2.1 Systems of Linear Equations 19
2.2 Matrices 22
2.3 Solving Systems of Linear Equations 27
2.4 Vector Spaces 35
2.5 Linear Independence 40
2.6 Basis and Rank 44
2.7 Linear Mappings 48
2.8 Affine Spaces 61
2.9 Further Reading 63
Exercises 63
3 Analytic Geometry 70
3.1 Norms 71
3.2 Inner Products 72
3.3 Lengths and Distances 75
3.4 Angles and Orthogonality 76
3.5 Orthonormal Basis 78
3.6 Orthogonal Complement 79
3.7 Inner Product of Functions 80
3.8 Orthogonal Projections 81
3.9 Rotations 91
3.10 Further Reading 94
Exercises 95
4 Matrix Decompositions 98
i
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
ii Contents
References 395
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
List of Figures
iv
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
List of Figures v
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
vi List of Figures
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Foreword
1
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
2 Foreword
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
4 Foreword
Contributors
The are grateful to many people, who looked at early drafts of the book
and suffered through painful expositions of concepts. We tried to imple-
ment their ideas that we did not violently disagree with. We would like to
especially acknowledge Christfried Webers for his careful reading of many
parts of the book, and his detailed suggestions on structure and presen-
tation. Many friends and colleagues have also been kind enough to pro-
vide their time and energy on different versions of each chapter. We have
been lucky to benefit from the generosity of the online community, who
have suggested improvements via github.com, which greatly improved
the book.
The following people have found bugs, proposed clarifications and sug-
gested relevant literature, either via github.com or personal communica-
tion. Their names are sorted alphabetically.
Contributors through github, whose real names were not listed on their
github profile, are:
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
6 Foreword
Table of Symbols
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Part I
Mathematical Foundations
9
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
1
Since machine learning is inherently data driven, data is at the core data
of machine learning. The goal of machine learning is to design general-
purpose methodologies to extract valuable patterns from data, ideally
without much domain-specific expertise. For example, given a large corpus
of documents (e.g., books in many libraries), machine learning methods
can be used to automatically find relevant topics that are shared across
documents (Hoffman et al., 2010). To achieve this goal, we design mod-
els that are typically related to the process that generates data, similar to model
the dataset we are given. For example, in a regression setting, the model
would describe a function that maps inputs to real-valued outputs. To
paraphrase Mitchell (1997): A model is said to learn from data if its per-
formance on a given task improves after the data is taken into account.
The goal is to find good models that generalize well to yet unseen data,
which we may care about in the future. Learning can be understood as a learning
way to automatically find patterns and structure in data by optimizing the
parameters of the model.
While machine learning has seen many success stories, and software is
readily available to design and train rich and flexible machine learning
systems, we believe that the mathematical foundations of machine learn-
ing are important in order to understand fundamental principles upon
which more complicated machine learning systems are built. Understand-
ing these principles can facilitate creating new machine learning solutions,
understanding and debugging existing approaches and learning about the
inherent assumptions and limitations of the methodologies we are work-
ing with.
11
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
12 Introduction and Motivation
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
14 Introduction and Motivation
Dimensionality
Classification
Reduction
Regression
Estimation
Density
Vector Calculus Probability & Distributions Optimization
Linear Algebra Analytic Geometry Matrix Decomposition
between the two parts of the book to link mathematical concepts with
machine learning algorithms.
Of course there are more than two ways to read this book. Most readers
learn using a combination of top-down and bottom-up approaches, some-
times building up basic mathematical skills before attempting more com-
plex concepts, but also choosing topics based on applications of machine
learning.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
16 Introduction and Motivation
Linear Algebra
→ → 4 Figure 2.1
x+y Different types of
2
vectors. Vectors can
0 be surprising
objects, including
y
→ −2 (a) geometric
x → vectors and (b)
y −4
polynomials.
−6
−2 0 2
x
17
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
18 Linear Algebra
Vector
Figure 2.2 A mind
s map of the concepts
se pro
po per introduced in this
closure
m ty o
co f chapter, along with
Chapter 5 where they are used
Matrix Abelian
Vector calculus with + in other parts of the
ts Vector space Group Linear
en independence book.
rep
res
rep
res
maximal set
ent
s
System of
linear equations
Linear/affine
so mapping
lve
solved by
s Basis
Matrix
inverse
Gaussian
elimination
Chapter 3 Chapter 10
Chapter 12
Analytic geometry Dimensionality
Classification
reduction
Example 2.1
A company produces products N1 , . . . , Nn for which resources
R1 , . . . , Rm are required. To produce a unit of product Nj , aij units of
resource Ri are needed, where i = 1, . . . , m and j = 1, . . . , n.
The objective is to find an optimal production plan, i.e., a plan of how
many units xj of product Nj should be produced if a total of bi units of
resource Ri are available and (ideally) no resources are left over.
If we produce x1 , . . . , xn units of the corresponding products, we need
a total of
ai1 x1 + · · · + ain xn (2.2)
many units of resource Ri . An optimal production plan (x1 , . . . , xn ) ∈ Rn ,
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
20 Linear Algebra
system of linear Equation (2.3) is the general form of a system of linear equations, and
equations x1 , . . . , xn are the unknowns of this system. Every n-tuple (x1 , . . . , xn ) ∈
solution Rn that satisfies (2.3) is a solution of the linear equation system.
Example 2.2
The system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) (2.4)
2x1 + 3x3 = 1 (3)
has no solution: Adding the first two equations yields 2x1 +3x3 = 5, which
contradicts the third equation (3).
Let us have a look at the system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.5)
x2 + x3 = 2 (3)
From the first and third equation it follows that x1 = 1. From (1)+(2) we
get 2+3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1. Therefore,
(1, 1, 1) is the only possible and unique solution (verify that (1, 1, 1) is a
solution by plugging in).
As a third example, we consider
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.6)
2x1 + 3x3 = 5 (3)
Since (1)+(2)=(3), we can omit the third equation (redundancy). From
(1) and (2), we get 2x1 = 5−3x3 and 2x2 = 1+x3 . We define x3 = a ∈ R
as a free variable, such that any triplet
5 3 1 1
− a, + a, a , a ∈ R (2.7)
2 2 2 2
is a solution of the system of linear equations, i.e., we obtain a solution
set that contains infinitely many solutions.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
22 Linear Algebra
2.2 Matrices
Matrices play a central role in linear algebra. They can be used to com-
pactly represent systems of linear equations, but they also represent linear
functions (linear mappings) as we will see later in Section 2.7. Before we
discuss some of these interesting topics, let us first define what a matrix
is and what kind of operations we can do with matrices. We will see more
properties of matrices in Chapter 4.
A ∈ R4×2 a ∈ R8
2.2.1 Matrix Addition and Multiplication
The sum of two matrices A ∈ Rm×n , B ∈ Rm×n is defined as the element-
re-shape
wise sum, i.e.,
a11 + b11 · · · a1n + b1n
.. .. m×n
A + B := ∈R . (2.12)
. .
am1 + bm1 · · · amn + bmn
Note the size of the For matrices A ∈ Rm×n , B ∈ Rn×k the elements cij of the product
matrices. C = AB ∈ Rm×k are defined as
C =
n
np.einsum(’il, X
lj’, A, B) cij = ail blj , i = 1, . . . , m, j = 1, . . . , k. (2.13)
l=1
There are n columns This means, to compute element cij we multiply the elements of the ith
in A and n rows in row of A with the j th column of B and sum them up. Later in Section 3.2,
B so that we can
we will call this the dot product of the corresponding row and column. In
compute ail blj for
l = 1, . . . , n. cases, where we need to be explicit that we are performing multiplication,
Commonly, the dot we use the notation A · B to denote multiplication (explicitly showing
product between “·”).
two vectors a, b is
denoted by a> b or Remark. Matrices can only be multiplied if their “neighboring” dimensions
ha, bi.
Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.
2.2 Matrices 23
Example 2.3
0 2
1 2 3
For A = ∈ R2×3 , B = 1 −1 ∈ R3×2 , we obtain
3 2 1
0 1
0 2
1 2 3 2 3
AB = 1 −1 = ∈ R2×2 , (2.15)
3 2 1 2 5
0 1
0 2 6 4 2
1 2 3
BA = 1 −1 = −2 0 2 ∈ R3×3 . (2.16)
3 2 1
0 1 3 2 1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
24 Linear Algebra
distributivity
Distributivity:
∀A ∈ Rm×n : I m A = AI n = A (2.20)
If we multiply A with
a22 −a12
B := (2.22)
−a21 a11
we obtain
a11 a22 − a12 a21 0
AB = = (a11 a22 − a12 a21 )I . (2.23)
0 a11 a22 − a12 a21
Therefore,
1
−1 a22 −a12
A = (2.24)
a11 a22 − a12 a21 −a21 a11
if and only if a11 a22 − a12 a21 6= 0. In Section 4.1, we will see that a11 a22 −
a12 a21 is the determinant of a 2×2-matrix. Furthermore, we can generally
use the determinant to check whether a matrix is invertible. ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
26 Linear Algebra
distributivity (λC)> = C > λ> = C > λ = λC > since λ = λ> for all λ ∈ R.
Distributivity:
(λ + ψ)C = λC + ψC, C ∈ Rm×n
λ(B + C) = λB + λC, B, C ∈ Rm×n
and use the rules for matrix multiplication, we can write this equation
system in a more compact form as
2 3 5 x1 1
4 −2 −7 x2 = 8 . (2.36)
9 5 −3 x3 2
Note that x1 scales the first column, x2 the second one, and x3 the third
one.
Generally, system of linear equations can be compactly represented in
their matrix form as Ax = b, see (2.3), and the product Ax is a (linear)
combination of the columns of A. We will discuss linear combinations in
more detail in Section 2.5.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
28 Linear Algebra
so that 0 = 8c1 + 2c2 − 1c3 + 0c4 and (x1 , x2 , x3 , x4 ) = (8, 2, −1, 0). In
fact, any scaling of this solution by λ1 ∈ R produces the 0 vector, i.e.,
8
1 0 8 −4 2
= λ1 (8c1 + 2c2 − c3 ) = 0 .
λ1 (2.41)
0 1 2 12 −1
0
Following the same line of reasoning, we express the fourth column of the
matrix in (2.38) using the first two columns and generate another set of
non-trivial versions of 0 as
−4
1 0 8 −4 12
= λ2 (−4c1 + 12c2 − c4 ) = 0
λ2 (2.42)
0 1 2 12 0
−1
for any λ2 ∈ R. Putting everything together, we obtain all solutions of the
general solution equation system in (2.38), which is called the general solution, as the set
−4
42 8
8 2 12
4
x ∈ R : x = + λ1 + λ2 , λ1 , λ2 ∈ R . (2.43)
0 −1 0
0 0 −1
Example 2.6
For a ∈ R, we seek all solutions of the following system of equations:
−2x1 + 4x2 − 2x3 − x4 + 4x5 = −3
4x1 − 8x2 + 3x3 − 3x4 + x5 = 2
. (2.44)
x1 − 2x2 + x3 − x4 + x5 = 0
x1 − 2x2 − 3x4 + 4x5 = a
We start by converting this system of equations into the compact matrix
notation Ax = b. We no longer mention the variables x explicitly and
build the augmented matrix (in the form A | b ) augmented matrix
−2 −2 −1 −3
4 4 Swap with R3
4
−8 3 −3 1 2
1 −2 1 −1 1 0 Swap with R1
1 −2 0 −3 4 a
where we used the vertical line to separate the left-hand-side from the
right-hand-side in (2.44). We use to indicate a transformation of the
augmented matrix using elementary transformations. The augmented
matrix A | b
Swapping rows 1 and 3 leads to
compactly
−2 −1 represents the
1 1 1 0
system of linear
4 −8 3 −3 1 2 −4R1 equations Ax = b.
−2 4 −2 −1 4 −3 +2R1
1 −2 0 −3 4 a −R1
When we now apply the indicated transformations (e.g., subtract Row 1
four times from Row 2), we obtain
−2 −1
1 1 1 0
0 0 −1 1 −3 2
0 0 0 −3 6 −3
0 0 −1 −2 3 a −R2 − R3
−2 −1
1 1 1 0
0 0 −1 1 −3 ·(−1)
2
0 0 0 −3 6 −3 ·(− 31 )
0 0 0 0 0 a+1
−2 −1
1 1 1 0
0 0 1 −1 3 −2
0 0 0 1 −2 1
0 0 0 0 0 a+1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
30 Linear Algebra
row-echelon form This (augmented) matrix is in a convenient form, the row-echelon form
(REF). Reverting this compact notation back into the explicit notation with
the variables we seek, we obtain
x1 − 2x2 + x3 − x4 + x5 = 0
x3 − x4 + 3x5 = −2
. (2.45)
x4 − 2x5 = 1
0 = a+1
particular solution Only for a = −1 this system can be solved. A particular solution is
2
x1
x2 0
x3 = −1 . (2.46)
x4 1
x5 0
general solution The general solution, which captures the set of all possible solutions, is
2 2 2
0 1 0
5
x∈R :x= −1
+ λ1 0 + λ2 −1 , λ1 , λ2 ∈ R . (2.47)
1 0 2
0 0 1
All rows that contain only zeros are at the bottom of the matrix; corre-
spondingly, all rows that contain at least one non-zero element are on
top of rows that contain only zeros.
Looking at non-zero rows only, the first non-zero number from the left
pivot (also called the pivot or the leading coefficient) is always strictly to the
leading coefficient right of the pivot of the row above it.
In other texts, it is
sometimes required Remark (Basic and Free Variables). The variables corresponding to the
that the pivot is 1. pivots in the row-echelon form are called basic variables, the other vari-
basic variables ables are free variables. For example, in (2.45), x1 , x3 , x4 are basic vari-
free variables ables, whereas x2 , x5 are free variables. ♦
Remark (Obtaining a Particular Solution). The row echelon form makes
0 0 0 0
From here, we find relatively directly that λ3 = 1, λ2 = −1, λ1 = 2. When
we put everything together, we must not forget the non-pivot columns
for which we set the coefficients implicitly to 0. Therefore, we get the
particular solution x = [2, 0, −1, 1, 0]> . ♦
Remark (Reduced Row Echelon Form). An equation system is in reduced reduced row
row echelon form (also: row-reduced echelon form or row canonical form) if echelon form
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
32 Linear Algebra
the second column from three times the first column. Now, we look at the
fifth column, which is our second non-pivot column. The fifth column can
be expressed as 3 times the first pivot column, 9 times the second pivot
column, and −4 times the third pivot column. We need to keep track of
the indices of the pivot columns and translate this into 3 times the first col-
umn, 0 times the second column (which is a non-pivot column), 9 times
the third column (which is our second pivot column), and −4 times the
fourth column (which is the third pivot column). Then we need to subtract
the fifth column to obtain 0. In the end, we are still solving a homogeneous
equation system.
To summarize, all solutions of Ax = 0, x ∈ R5 are given by
3 3
−1 0
5
x ∈ R : x = λ1 0 + λ2 9 , λ1 , λ2 ∈ R .
(2.50)
0 −4
0 −1
I n |A−1 .
A|I n ··· (2.56)
This means that if we bring the augmented equation system into reduced
row echelon form, we can read out the inverse on the right-hand side of
the equation system. Hence, determining the inverse of a matrix is equiv-
alent to solving systems of linear equations.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
34 Linear Algebra
and use the Moore-Penrose pseudo-inverse (A> A)−1 A> to determine the Moore-Penrose
solution (2.59) that solves Ax = b, which also corresponds to the mini- pseudo-inverse
mum norm least-squares solution. A disadvantage of this approach is that
it requires many computations for the matrix-matrix product and comput-
ing the inverse of A> A. Moreover, for reasons of numerical precision it
is generally not recommended to compute the inverse or pseudo-inverse.
In the following, we therefore briefly discuss alternative approaches to
solving systems of linear equations.
Gaussian elimination plays an important role when computing deter-
minants (Section 4.1), checking whether a set of vectors is linearly inde-
pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2),
computing the rank of a matrix (Section 2.6.2) and a basis of a vector
space (Section 2.6.1). Gaussian elimination is an intuitive and construc-
tive way to solve a system of linear equations with thousands of variables.
However, for systems with millions of variables, it is impractical as the re-
quired number of arithmetic operations scales cubically in the number of
simultaneous equations.
In practice, systems of many linear equations are solved indirectly, by ei-
ther stationary iterative methods, such as the Richardson method, the Ja-
cobi method, the Gauß-Seidel method, and the successive over-relaxation
method, or Krylov subspace methods, such as conjugate gradients, gener-
alized minimal residual, or biconjugate gradients. We refer to the books
by Strang (2003), Liesen and Mehrmann (2015) and Stoer and Burlirsch
(2002) for further details.
Let x∗ be a solution of Ax = b. The key idea of these iterative methods
is to set up an iteration of the form
x(k+1) = Cx(k) + d (2.60)
for suitable C and d that reduces the residual error kx(k+1) − x∗ k in every
iteration and converges to x∗ . We will introduce norms k · k, which allow
us to compute similarities between vectors, in Section 3.1.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
36 Linear Algebra
2.4.1 Groups
Groups play an important role in computer science. Besides providing a
fundamental framework for operations on sets, they are heavily used in
cryptography, coding theory and graphics.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
38 Linear Algebra
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
40 Linear Algebra
might say “You can get to Kigali by first going 506 km Northwest to Kam-
pala (Uganda) and then 374 km Southwest.”. This is sufficient information
to describe the location of Kigali because the geographic coordinate sys-
tem may be considered a two-dimensional vector space (ignoring altitude
and the Earth’s curved surface). The person may add “It is about 751 km
West of here.” Although this last statement is true, it is not necessary to
find Kigali given the previous information (see Figure 2.7 for an illus-
tration). In this example, the “506 km Northwest” vector (blue) and the
“374 km Southwest” vector (purple) are linearly independent. This means
the Southwest vector cannot be described in terms of the Northwest vec-
tor, and vice versa. However, the third “751 km West” vector (black) is a
linear combination of the other two vectors, and it makes the set of vec-
tors linearly dependent. Equivalently, given “751 km West” and “374 km
Southwest” can be linearly combined to obtain “506 km Northwest”.
Remark. The following properties are useful to find out whether vectors
are linearly independent.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
42 Linear Algebra
Example 2.14
Consider R4 with
−1
1 1
2 1 −2
x1 =
−3 ,
x2 =
0 ,
x3 =
1 .
(2.67)
4 2 1
To check whether they are linearly dependent, we follow the general ap-
proach and solve
−1
1 1
2 1 −2
λ1 x1 + λ2 x2 + λ3 x3 = λ1
−3 + λ2 0 + λ3 1 = 0
(2.68)
4 2 1
for λ1 , . . . , λ3 . We write the vectors xi , i = 1, 2, 3, as the columns of a
matrix and apply elementary row operations until we identify the pivot
columns:
1 −1 1 −1
1 1
2 1 −2 0 1 0
−3
··· . (2.69)
0 1 0 0 1
4 2 1 0 0 0
Here, every column of the matrix is a pivot column. Therefore, there is no
non-trivial solution, and we require λ1 = 0, λ2 = 0, λ3 = 0 to solve the
equation system. Hence, the vectors x1 , x2 , x3 are linearly independent.
This means that {x1 , . . . , xm } are linearly independent if and only if the
column vectors {λ1 , . . . , λm } are linearly independent.
♦
Remark. In a vector space V , m linear combinations of k vectors x1 , . . . , xk
are linearly dependent if m > k . ♦
Example 2.15
Consider a set of linearly independent vectors b1 , b2 , b3 , b4 ∈ Rn and
x1 = b1 − 2b2 + b3 − b4
x2 = −4b1 − 2b2 + 4b4
(2.73)
x3 = 2b1 + 3b2 − b3 − 3b4
x4 = 17b1 − 10b2 + 11b3 + b4
Are the vectors x1 , . . . , x4 ∈ Rn linearly independent? To answer this
question, we investigate whether the column vectors
1 −4 2 17
−2 , −2 , 3 , −10
1 0 −1 11 (2.74)
−1 4 −3 1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
44 Linear Algebra
are linearly independent. The reduced row echelon form of the corre-
sponding linear equation system with coefficient matrix
1 −4 2
17
−2 −2 3 −10
A= (2.75)
1 0 −1 11
−1 4 −3 1
is given as
0 −7
1 0
0 1 0 −15
. (2.76)
0 0 1 −18
0 0 0 0
We see that the corresponding linear equation system is non-trivially solv-
able: The last column is not a pivot column, and x4 = −7x1 −15x2 −18x3 .
Therefore, x1 , . . . , x4 are linearly dependent as x4 can be expressed as a
linear combination of x1 , . . . , x3 .
Generating sets are sets of vectors that span vector (sub)spaces, i.e.,
every vector can be represented as a linear combination of the vectors
in the generating set. Now, we will be more specific and characterize the
smallest generating set that spans a vector (sub)space.
Example 2.16
The set
1 2 1
2 −1 1
A= , , (2.80)
3
0 0
4 2 −4
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
46 Linear Algebra
2 −1 −4 8
x1 , x2 , x3 , x4 = −1 1
3 −5 . (2.83)
−1 2 5 −6
−1 −2 −3 1
With the basic transformation rules for systems of linear equations, we
obtain the row echelon form
1 2 3 −1 1 2 3 −1
2 −1 −4 8 0 1 2 −2
−1 1 3 −5 · · · 0 0 0 1 .
−1 2 5 −6 0 0 0 0
−1 −2 −3 1 0 0 0 0
Since the pivot columns indicate which set of vectors is linearly indepen-
dent, we see from the row echelon form that x1 , x2 , x4 are linearly inde-
pendent (because the system of linear equations λ1 x1 + λ2 x2 + λ4 x4 = 0
can only be solved with λ1 = λ2 = λ4 = 0). Therefore, {x1 , x2 , x4 } is a
basis of U .
2.6.2 Rank
The number of linearly independent columns of a matrix A ∈ Rm×n
equals the number of linearly independent rows and is called the rank rank
of A and is denoted by rk(A).
Remark. The rank of a matrix has some important properties:
rk(A) = rk(A> ), i.e., the column rank equals the row rank.
The columns of A ∈ Rm×n span a subspace U ⊆ Rm with dim(U ) =
rk(A). Later, we will call this subspace the image or range. A basis of
U can be found by applying Gaussian elimination to A to identify the
pivot columns.
The rows of A ∈ Rm×n span a subspace W ⊆ Rn with dim(W ) =
rk(A). A basis of W can be found by applying Gaussian elimination to
A> .
For all A ∈ Rn×n holds: A is regular (invertible) if and only if rk(A) =
n.
For all A ∈ Rm×n and all b ∈ Rm it holds that the linear equation
system Ax = b can be solved if and only if rk(A) = rk(A|b), where
A|b denotes the augmented system.
For A ∈ Rm×n the subspace of solutions for Ax = 0 possesses dimen-
sion n − rk(A). Later, we will call this subspace the kernel or the null kernel
space. null space
A matrix A ∈ Rm×n has full rank if its rank equals the largest possible full rank
rank for a matrix of the same dimensions. This means that the rank of
a full-rank matrix is the lesser of the number of rows and columns, i.e.,
rk(A) = min(m, n). A matrix is said to be rank deficient if it does not rank deficient
have full rank.
♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
48 Linear Algebra
1 2 1
A = −2 −3 1 .
3 5 0
We use Gaussian elimination to determine the rank:
1 2 1 1 2 1
−2 −3 1 ··· 0 1 3 . (2.84)
3 5 0 0 0 0
Here, we see that the number of linearly independent rows and columns
is 2, such that rk(A) = 2.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
50 Linear Algebra
B = (b1 , . . . , bn ) (2.89)
x = α1 b1 + . . . + αn bn (2.90)
Example 2.20
Let us have a look at a geometric vector x ∈ R2 with coordinates [2, 3]> Figure 2.9
with respect to the standard basis (e1 , e2 ) of R2 . This means, we can write Different coordinate
representations of a
x = 2e1 + 3e2 . However, we do not have to choose the standard basis to
vector x, depending
represent this vector. If we use the basis vectors b1 = [1, −1]> , b2 = [1, 1]> on the choice of
we will obtain the coordinates 21 [−1, 5]> to represent the same vector with basis.
respect to (b1 , b2 ) (see Figure 2.9). x = 2e1 + 3e2
x = − 12 b1 + 52 b2
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
52 Linear Algebra
ŷ = AΦ x̂ . (2.94)
This means that the transformation matrix can be used to map coordinates
with respect to an ordered basis in V to coordinates with respect to an
ordered basis in W .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
54 Linear Algebra
where we first expressed the new basis vectors c̃k ∈ W as linear com-
binations of the basis vectors cl ∈ W and then swapped the order of
summation.
Alternatively, when we express the b̃j ∈ V as linear combinations of
bj ∈ V , we arrive at
n
! n n m
(2.106)
X X X X
Φ(b̃j ) = Φ sij bi = sij Φ(bi ) = sij ali cl (2.109a)
i=1 i=1 i=1 l=1
m n
!
X X
= ali sij cl , j = 1, . . . , n , (2.109b)
l=1 i=1
and, therefore,
such that
ÃΦ = T −1 AΦ S , (2.112)
ÃΦ = T −1 AΦ S. (2.113)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
56 Linear Algebra
1 1 0 1
1 0 1 1 0 1 0
B̃ = (1 , 1 , 0) ∈ R3 , C̃ = (
0 , 1 , 1 , 0) .
(2.119)
0 1 1
0 0 0 1
Then,
1 1 0 1
1 0 1 1 0 1 0
S = 1 1 0 , T =
0
, (2.120)
1 1 0
0 1 1
0 0 0 1
where the ith column of S is the coordinate representation of b̃i in terms
of the basis vectors of B . Since B is the standard basis, the coordinate rep-
resentation is straightforward to find. For a general basis BP we would need
3
to solve a linear equation system to find the λi such that i=1 λi bi = b̃j ,
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
58 Linear Algebra
ker(Φ) Im(Φ)
0V 0W
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
60 Linear Algebra
1 2 −1 0
= x1 + x2 + x3 + x4 (2.125b)
1 0 0 1
is linear. To determine Im(Φ) we can take the span of the columns of the
transformation matrix and obtain
1 2 −1 0
Im(Φ) = span[ , , , ]. (2.126)
1 0 0 1
To compute the kernel (null space) of Φ, we need to solve Ax = 0, i.e.,
we need to solve a homogeneous equation system. To do this, we use
Gaussian elimination to transform A into reduced row echelon form:
1 2 −1 0 1 0 0 1
··· . (2.127)
1 0 0 1 0 1 − 21 − 12
This matrix is in reduced row echelon form, and we can use the Minus-
1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively,
we can express the non-pivot columns (columns 3 and 4) as linear com-
binations of the pivot-columns (columns 1 and 2). The third column a3 is
equivalent to − 21 times the second column a2 . Therefore, 0 = a3 + 12 a2 . In
the same way, we see that a4 = a1 − 12 a2 and, therefore, 0 = a1 − 12 a2 −a4 .
Overall, this gives us the kernel (null space) as
−1
0
1 1
ker(Φ) = span[ 1 , 0 ] .
2 2 (2.128)
0 1
rank-nullity
theorem Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin-
ear mapping Φ : V → W it holds that
dim(ker(Φ)) + dim(Im(Φ)) = dim(V ) . (2.129)
fundamental The rank-nullity theorem is also referred to as the fundamental theorem
theorem of linear of linear mappings (Axler, 2015, Theorem 3.22). Direct consequences of
mappings
Theorem 2.24 are
If dim(Im(Φ)) < dim(V ) then ker(Φ) is non-trivial, i.e., the kernel
contains more than 0V and dim(ker(Φ)) > 1.
If AΦ is the transformation matrix of Φ with respect to an ordered basis
and dim(Im(Φ)) < dim(V ) then the system of linear equations AΦ x =
0 has infinitely many solutions.
If dim(V ) = dim(W ) then the following three-way equivalence holds:
– Φ is injective
– Φ is surjective
– Φ is bijective
since Im(Φ) ⊆ W .
One-dimensional affine subspaces are called lines and can be written line
as y = x0 + λx1 , where λ ∈ R, where U = span[x1 ] ⊆ Rn is a one-
dimensional subspace of Rn . This means, a line is defined by a support
point x0 and a vector x1 that defines the direction. See Figure 2.13 for
an illustration.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
62 Linear Algebra
Definition 2.26 (Affine mapping). For two vector spaces V, W and a lin-
Exercises
2.1 We consider (R\{−1}, ?) where
a ? b := ab + a + b, a, b ∈ R\{−1} (2.134)
3 ? x ? x = 15
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
64 Linear Algebra
k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | (∃a ∈ Z) : (x − k = n · a)} .
Zn = {0, 1, . . . , n − 1}
a ⊕ b := a + b
a⊗b=a×b (2.135)
2.
1 2 3 1 1 0
4 5 6 0 1 1
7 8 9 1 0 1
3.
1 1 0 1 2 3
0 1 1 4 5 6
1 0 1 7 8 9
4.
0 3
1 2 1 2 1 −1
4 1 −1 −4 2 1
5 2
5.
0 3
1
−1 1 2 1 2
2 1 4 1 −1 −4
5 2
2.5 Find the set S of all solutions in x of the following inhomogeneous linear
systems Ax = b where A and b are defined below:
1.
1 1 −1 −1 1
2 5 −7 −5 −2
A= , b=
2 −1 1 3 4
5 2 −4 2 6
2.
1 −1 0 0 1 3
1 1 0 −3 0 6
A= , b=
2 −1 0 1 −1 5
−1 2 0 −2 −1 −1
and 3i=1 xi = 1.
P
2.7 Determine the inverse of the following matrices if possible:
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
66 Linear Algebra
1.
2 3 4
A = 3 4 5
4 5 6
2.
1 0 1 0
0 1 1 0
A=
1
1 0 1
1 1 1 0
2.
1 1 1
2 1 0
1 ,
x1 =
0 ,
x2 = x3 =
0
0 1 1
0 1 1
2.10 Write
1
y = −2
5
as linear combination of
1 1 2
x1 = 1 , x2 = 2 , x3 = −1
1 3 1
1 −1 1 1 0 −1
Determine a basis of U1 ∩ U2 .
Φ : L1 ([a, b]) → R
Z b
f 7→ Φ(f ) = f (x)dx ,
a
where L1 ([a, b]) denotes the set of integrable functions on [a, b].
2.
Φ : C1 → C0
f 7→ Φ(f ) = f 0 .
Φ:R→R
x 7→ Φ(x) = cos(x)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
68 Linear Algebra
4.
Φ : R 3 → R2
1 2 3
x 7→ x
1 4 3
Φ : R2 → R2
cos(θ) sin(θ)
x 7→ x
− sin(θ) cos(θ)
Φ : R3 → R 4
3x1 + 2x2 + x3
x1 x1 + x2 + x3
Φ x2 =
x1 − 3x2
x3
2x1 + 3x2 + x3
and let us define two ordered bases B = (b1 , b2 ) and B 0 = (b01 , b02 ) of R2 .
1. Show that B and B 0 are two bases of R2 and draw those basis vectors.
2. Compute the matrix P 1 that performs a basis change from B 0 to B .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
3
Analytic Geometry
Orthogonal
Lengths Angles Rotations
projection
70
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
3.1 Norms 71
3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.
k · k : V → R, (3.1)
x 7→ kxk , (3.2)
which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R length
and x, y ∈ V the following hold:
absolutely
Absolutely homogeneous: kλxk = |λ|kxk homogeneous
Xn
kxk1 := |xi | , (3.3)
i=1
where | · | is the absolute value. The left panel of Figure 3.2 shows all
vectors x ∈ R2 with kxk1 = 1. The Manhattan norm is also called `1
norm. `1 norm
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
72 Analytic Geometry
Euclidean distance and computes the Euclidean distance of x from the origin. The right panel
of Figure 3.2 shows all vectors x ∈ R2 with kxk2 = 1. The Euclidean
`2 norm norm is also called `2 norm.
Remark. Throughout this book, we will use the Euclidean norm (3.4) by
default if not stated otherwise. ♦
We will refer to the particular inner product above as the dot product
in this book. However, inner products are more general concepts with
specific properties, which we will now introduce.
where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
74 Analytic Geometry
The null space (kernel) of A consists only of 0 because x> Ax > 0 for
all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
The diagonal elements aii of A are positive because aii = e>
i Aei > 0,
where ei is the ith vector of the standard basis in Rn .
in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, h·, ·i) the induced norm k · k satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
| hx, yi | 6 kxkkyk . (3.17)
♦
is called the distance between x and y for x, y ∈ V . If we use the dot distance
product as the inner product, then the distance is called Euclidean distance. Euclidean distance
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
76 Analytic Geometry
The mapping
d:V ×V →R (3.22)
(x, y) 7→ d(x, y) (3.23)
positive definite 1. d is positive definite, i.e., d(x, y) > 0 for all x, y ∈ V and d(x, y) =
0 ⇐⇒ x = y
symmetric 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
triangle inequality 3. Triangle inequality: d(x, z) 6 d(x, y) + d(y, z) for all x, y, z ∈ V .
Remark. At first glance the list of properties of inner products and met-
rics look very similar. However, by comparing Definition 3.3 with Defini-
tion 3.6 we observe that hx, yi and d(x, y) behave in opposite directions.
Very similar x and y will result in a large value for the inner product and
a small value for the metric. ♦
0
hx, yi
−1 6 6 1. (3.24)
kxk kyk
−1
0 π/2 π
ω Therefore, there exists a unique ω ∈ [0, π), illustrated in Figure 3.4, with
hx, yi
cos ω = . (3.25)
kxk kyk
angle The number ω is the angle between the vectors x and y . Intuitively, the
angle between two vectors tells us how similar their orientations are. For
example, using the dot product, the angle between x and y = 4x, i.e., y
is a scaled version of x, is 0: Their orientation is the same.
Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 , see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as inner product yields an
angle ω between x and y of 90◦ , such that x ⊥ y . However, if we choose
the inner product
> 2 0
hx, yi = x y, (3.27)
0 1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
78 Analytic Geometry
which gives exactly the angle between x and y . This means that orthog-
onal matrices A with A> = A−1 preserve both angles and distances. It
turns out that orthogonal matrices define transformations that are rota-
tions (with the possibility of flips). In Section 3.9, we will discuss more
details about rotations.
for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basis
If only (3.33) is satisfied then the basis is called an orthogonal basis. Note ONB
orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.
Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-
>
ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
80 Analytic Geometry
e1
U
for lower and upper limits a, b < ∞, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal.
To make the above inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.
sin(x) cos(x)
product evaluates to 0. Therefore, sin and cos are orthogonal functions.
0.0
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
82 Analytic Geometry
Figure 3.9
Orthogonal
projection (orange 2
dots) of a
1
two-dimensional
dataset (blue dots)
x2
0
onto a
one-dimensional −1
subspace (straight
line). −2
−4 −2 0 2 4
x1
the dataset and extract relevant patterns. For example, machine learn-
ing algorithms, such as Principal Component Analysis (PCA) by Pearson
(1901) and Hotelling (1933) and Deep Neural Networks (e.g., deep auto-
encoders (Deng et al., 2010)), heavily exploit the idea of dimensional-
ity reduction. In the following, we will focus on orthogonal projections,
which we will use in Chapter 10 for linear dimensionality reduction and
in Chapter 12 for classification. Even linear regression, which we discuss
in Chapter 9, can be interpreted using orthogonal projections. For a given
lower-dimensional subspace, orthogonal projections of high-dimensional
data retain as much information as possible and minimize the difference/
error between the original data and the corresponding projection. An il-
lustration of such an orthogonal projection is given in Figure 3.9. Before
we detail how to obtain these projections, let us define what a projection
actually is.
b x
πU (x)
ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with kxk = 1 onto a one-dimensional
subspace spanned by b.
We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
hx, bi hb, xi λ = hx, bi if
hx, bi − λ hb, bi = 0 ⇐⇒ λ = = . (3.40) kbk = 1.
hb, bi kbk2
In the last step, we exploited the fact that inner products are symmet-
ric. If we choose h·, ·i to be the dot product, we obtain
b> x b> x
λ= = . (3.41)
b> b kbk2
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
84 Analytic Geometry
hx, bi b> x
πU (x) = λb = b = b, (3.42)
kbk2 kbk2
where the last equality holds for the dot product only. We can also
compute the length of πU (x) by means of Definition 3.1 as
Hence, our projection is of length |λ| times the length of b. This also
adds the intuition that λ is the coordinate of πU (x) with respect to the
basis vector b that spans our one-dimensional subspace U .
If we use the dot product as an inner product we get
b> x bb>
πU (x) = λb = bλ = b = x (3.45)
kbk2 kbk2
we immediately see that
bb>
Pπ = . (3.46)
kbk2
Projection matrices Note that bb> (and, consequently, P π ) is a symmetric matrix (of rank
are always 1), and kbk2 = hb, bi is a scalar.
symmetric.
The projection matrix P π projects any vector x ∈ Rn onto the line through
the origin with direction b (equivalently, the subspace U spanned by b).
Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
not a scalar. However, we no longer require n coordinates to represent the
projection, but only a single one if we want to express it with respect to
the basis vector b that spans the subspace U : λ. ♦
Let us now choose a particular x and see whether it lies in the subspace
>
spanned by b. For x = 1 1 1 , the projection is
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 = 10 ∈ span[2] . (3.48)
9 2 4 4 1 9 10 2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.10
we know that a projection matrix P π satisfies P 2π x = P π x for all x.
Remark. With the results from Chapter 4 we can show that πU (x) is an
eigenvector of P π , and the corresponding eigenvalue is 1. ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
86 Analytic Geometry
pseudo-inverse The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which
can be computed for non-square matrices B . It only requires that B > B
is positive definite, which is the case if B is full rank. In practical ap-
plications (e.g., linear regression), we often add a “jitter term” I to
Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1 then B > B ∈ R is a scalar and
we can rewrite the projection matrix in (3.59) P π = B(B > B)−1 B > as
>
P π = BB
B> B
, which is exactly the projection matrix in (3.46). ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
88 Analytic Geometry
projection error The corresponding projection error is the norm of the difference vector
The projection error between the original vector and its projection onto U , i.e.,
√
is also called the
>
reconstruction error. kx − πU (x)k =
1 −2 1
= 6 . (3.63)
To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , (b) verify that P π = P 2π
(see Definition 3.10).
Remark. The projections πU (x) are still vectors in Rn although they lie in
an m-dimensional subspace U ⊆ Rn . However, to represent a projected
vector we only need the m coordinates λ1 , . . . , λm with respect to the
basis vectors b1 , . . . , bm of U . ♦
Remark. In vector spaces with general inner products, we have to pay
attention when computing angles and distances, which are defined by
means of the inner product. ♦
We can find
approximate Projections allow us to look at situations where we have a linear system
solutions to Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. the columns of A. Given that the linear equation cannot be solved exactly,
we can find an approximate solution. The idea is to find the vector in the
subspace spanned by the columns of A that is closest to b, i.e., we compute
the orthogonal projection of b onto the subspace spanned by the columns
of A. This problem arises often in practice, and the solution is called the
least-squares least-squares solution (assuming the dot product as the inner product) of
solution an overdetermined system. This is discussed further in Section 9.4. Using
reconstruction errors (3.63) is one possible approach approach to derive
principal component analysis (Section 10.3).
Remark. We just looked at projections of vectors x onto a subspace U with
basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33)–(3.34) are
satisfied, the projection equation (3.58) simplifies greatly to
πU (x) = BB > x (3.65)
Figure 3.12
b2 b2 u2 b2 Gram-Schmidt
orthogonalization.
(a) Non-orthogonal
basis (b1 , b2 ) of R2 ;
0 b1 0 πspan[u1 ] (b2 ) u1 0 πspan[u1 ] (b2 ) u1 (b) First constructed
basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
orthogonal
basis vectors b1 , b2 . u1 = b1 and projection of b2 and u2 = b2 − πspan[u1 ] (b2 ).
projection of b2
onto the subspace spanned by
onto span[u1 ];
u1 .
(c) Orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where (u1 , u2 ) of R2 .
2 1
b1 = , b2 = , (3.69)
0 1
see also Figure 3.12(a). Using the Gram-Schmidt method we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):
2
u1 := b1 = , (3.70)
0
u1 u>
(3.45) 1 1 1 0 1 0
u2 := b2 − πspan[u1 ] (b2 ) = b2 − b
2 2 = − = .
ku1 k 1 0 0 1 1
(3.71)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
90 Analytic Geometry
Figure 3.13 x x
Projection onto an
affine space.
(a) Original setting;
(b) Setting shifted x − x0
L L
by −x0 so that πL(x)
x − x0 can be x0 x0
projected onto the
b2 b 2 U = L − x0 b2
direction space U ;
πU (x − x0)
(c) Projection is
translated back to 0 b1 0 b1 0 b1
x0 + πU (x − x0 ),
(a) Setting. (b) Reduce problem to pro- (c) Add support point back in
which gives the final
jection πU onto vector sub- to get affine projection πL .
orthogonal
space.
projection πL (x).
Figure 3.14 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.
3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
A rotation is a linear mapping (more specifically, an automorphism of rotation
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is
−0.38 −0.92
R= . (3.74)
0.92 −0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
92 Analytic Geometry
θ
− sin θ e1 cos θ
3.9.1 Rotations in R2
1 0
Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
the standard coordinate system in R2 . We aim to rotate this coordinate
system by an angle θ as illustrated in Figure 3.16. Note that the rotated
vectors are still linearly independent and, therefore, are a basis of R2 . This
means that the rotation performs a basis change.
Rotations Φ are linear mappings so that we can express them by a
rotation matrix rotation matrix R(θ). Trigonometry (see Figure 3.16) allows us to de-
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain
cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.75)
sin θ cos θ
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(θ) is given as
cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.76)
sin θ cos θ
3.9.2 Rotations in R3
In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
about a one-dimensional axis. The easiest way to specify the general rota-
tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
orthonormal to each other. We can then obtain a general rotation matrix
R by combining the images of the standard basis.
To have a meaningful rotation angle we have to define what “coun-
terclockwise” means when we operate in more than two dimensions. We
use the convention that a “counterclockwise” (planar) rotation about an
axis refers to a rotation about an axis when we look at the axis “head on,
from the end toward the origin”. In R3 , there are therefore three (planar)
rotations about the three standard basis vectors (see Figure 3.17):
e3 Figure 3.17
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2
θ e1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
94 Analytic Geometry
trix
I i−1 ··· ···
0 0
0
cos θ 0 − sin θ 0 n×n
Rij (θ) :=
0 0 I j−i−1 0 0 ∈R , (3.80)
0 sin θ 0 cos θ 0
0 ··· ··· 0 I n−j
Givens rotation for 1 6 i < j 6 n and θ ∈ R. Then Rij (θ) is called a Givens rotation.
Essentially, Rij (θ) is the identity matrix I n with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.81)
In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.
kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state-of-the-art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Hotelling, 1933; Pearson, 1901) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.
Exercises
3.1 Show that h·, ·i defined for all x = [x1 , x2 ]> ∈ R2 and y = [y1 , y2 ]> ∈ R2 by
hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )
is an inner product.
3.2 Consider R2 with h·, ·i defined for all x and y in R2 as
2 0
hx, yi := x> y.
1 2
| {z }
=:A
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
96 Analytic Geometry
using
1. hx, yi := x> y
2 1
2. hx, yi := x> By , B :=
1 3
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
0 1 −3 −1 −1
−1 −3 4 −3 −9
U = span[
2 , −1
1 , 1 , 5 ] , x=
0 −1 2 0 4
2 2 1 7 1
U = span[e1 , e3 ] .
3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
1 −1
b1 := 1 , b2 := 2 .
1 0
by 30◦ .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
4
Matrix Decompositions
98
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
4.1 Determinant and Trace 99
used in
where they are used
in other parts of the
book.
Eigenvalues Chapter 6
Probability
& Distributions
determines
used
in
constructs used in
Eigenvectors Orthogonal matrix Diagonalization
d in
use
us
in
ed
ed
SVD
us
in
used in
Chapter 10
Linear Dimensionality
Reduction
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
100 Matrix Decompositions
For a memory aid of the product terms in Sarrus’ rule, try tracing the
elements of the triple products in the matrix.
We call a square matrix T an upper-triangular matrix if Tij = 0 for upper-triangular
i > j , i.e., the matrix is zero below its diagonal. Analogously, we define a matrix
lower-triangular matrix as a matrix with zeros above its diagonal. For a tri- lower-triangular
angular matrix T ∈ Rn×n , the determinant is the product of the diagonal matrix
elements, i.e.,
n
Y
det(T ) = Tii . (4.8)
i=1
The determinant is
the signed volume
of the parallelepiped
Example 4.2 (Determinants as Measures of Volume) formed by the
columns of the
The notion of a determinant is natural when we consider it as a mapping
matrix.
from a set of n vectors spanning an object in Rn . It turns out that the de- Figure 4.2 The area
terminant det(A) is the signed volume of an n-dimensional parallelepiped of the parallelogram
formed by columns of the matrix A. (shaded region)
For n = 2 the columns of the matrix form a parallelogram, see Fig- spanned by the
vectors b and g is
ure 4.2. As the angle between vectors gets smaller the area of a parallel- |det([b, g])|.
ogram shrinks, too. Consider two vectors b, g that form the columns of a
matrix A = [b, g]. Then, the absolute value of the determinant of A is
the area of the parallelogram with vertices 0, b, g, b + g . In particular, if b
b, g are linearly dependent so that b = λg for some λ ∈ R they no longer
g
form a two-dimensional parallelogram. Therefore, the corresponding area
is 0. On the contrary, if b, g are linearly independent and are multiples Figure 4.3 The
of volume of the
b
the canonical basis vectors e1 , e2 then they can be written as b = and parallelepiped
0 (shaded volume)
0 b 0
= bg − 0 = bg . spanned by vectors
g= , and the determinant is r, b, g is
g 0 g
|det([r, b, g])|.
The sign of the determinant indicates the orientation of the spanning
vectors b, g with respect to the standard basis (e1 , e2 ). In our figure, flip-
ping the order to g, b swaps the columns of A and reverses the orientation
of the shaded area. becomes the familiar formula: area = height × length.
b
This intuition extends to higher dimensions. In R3 , we consider three vec-
r
tors r, b, g ∈ R3 spanning the edges of a parallelepiped, i.e., a solid with g
faces that are parallel parallelograms (see Figure 4.3). The absolute value The sign of the
determinant
of the determinant of the 3 × 3 matrix [r, b, g] is the volume of the solid.
indicates the
Thus, the determinant acts as a function that measures the signed volume orientation of the
formed by column vectors composed in a matrix. spanning vectors.
Consider the three linearly independent vectors r, g, b ∈ R3 given as
2 6 1
r = 0 , g = 1 , b = 4 . (4.9)
−8 0 −1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
102 Matrix Decompositions
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
104 Matrix Decompositions
n
X
tr(A) := aii , (4.18)
i=1
c0 = det(A) , (4.23)
cn−1 = (−1)n−1 tr(A) . (4.24)
The characteristic polynomial (4.22a) will allow us to compute eigen-
values and eigenvectors, covered in the next section.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
106 Matrix Decompositions
algebraic Definition 4.9. Let a square matrix A have an eigenvalue λi . The algebraic
multiplicity multiplicity of λi is the number of times the root appears in the character-
istic polynomial.
A matrix A and its transpose A> possess the same eigenvalues, but not
necessarily the same eigenvectors.
The eigenspace Eλ is the null space of A − λI since
Ax = λx ⇐⇒ Ax − λx = 0 (4.27a)
⇐⇒ (A − λI)x = 0 ⇐⇒ x ∈ ker(A − λI). (4.27b)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
108 Matrix Decompositions
x1 1
This means any vector x = where x2 = −x1 , such as is an
x2 −1
eigenvector with eigenvalue 2. The corresponding eigenspace is given as
1
E2 = span[ ]. (4.35)
−1
Example 4.6
2 1
The matrix A = has two repeated eigenvalues λ1 = λ2 = 2 and an
0 2
of 2. The eigenvalue has, however, only one distinct
algebraic multiplicity
1
eigenvector x1 = and, thus, geometric multiplicity 1.
0
Figure 4.4
Determinants and
eigenspaces.
Overview of five
λ1 = 2.0 linear mappings and
λ2 = 0.5 their associated
det(A) = 1.0
transformation
matrices
Ai ∈ R2×2
projecting 400
color-coded points
x ∈ R2 (left
λ1 = 1.0
λ2 = 1.0 column) onto target
det(A) = 1.0 points Ai x (right
column). The
central column
depicts the first
eigenvector,
stretched by its
λ1 = (0.87-0.5j) associated
λ2 = (0.87+0.5j) eigenvalue λ1 , and
det(A) = 1.0
the second
eigenvector
stretched by its
eigenvalue λ2 . Each
row depicts the
effect of one of five
λ1 = 0.0
λ2 = 2.0 transformation
det(A) = 0.0 matrices Ai with
respect to the
standard basis .
λ1 = 0.5
λ2 = 1.5
det(A) = 0.75
half of the vertical axis, and to the left vice versa. This mapping is area
preserving (det(A2 ) = 1). The eigenvalue λ1 = 1 = λ2 is repeated
and the eigenvectors are collinear (drawn here for emphasis in two
opposite directions). This indicates that the mapping acts only along
one direction (the horizontal
axis).
√
cos( π6 ) − sin( π6 )
1 3 √−1
A3 = = 2 The matrix A3 rotates the
sin( π6 ) cos( π6 ) 1 3
points by π6 rad = 30◦ anti-clockwise and has only complex eigenvalues,
reflecting that the mapping is a rotation (hence, no eigenvectors are
drawn). A rotation has to be volume preserving, and so the determinant
is 1. For
more details
on rotations we refer to Section 3.9.
1 −1
A4 = represents a mapping in the standard basis that col-
−1 1
lapses a two-dimensional domain onto one dimension. Since one eigen-
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
110 Matrix Decompositions
Figure 4.5
0
Caenorhabditis 25
elegans neural 50 20
network (Kaiser and
15
Hilgetag, 2006). 100
neuron index
eigenvalue
(a) Symmetrized 10
connectivity matrix; 150 5
(b) Eigenspectrum.
0
200
−5
250 −10
Methods to analyze and learn from network data are an essential com-
ponent of machine learning methods. The key to understanding networks
is the connectivity between network nodes, especially if two nodes are
connected to each other or not. In data science applications, it is often
useful to study the matrix that captures this connectivity data.
We build a connectivity/adjacency matrix A ∈ R277×277 of the complete
neural network of the worm C.Elegans. Each row/column represents one
of the 277 neurons of this worm’s brain. The connectivity matrix A has
a value of aij = 1 if neuron i talks to neuron j through a synapse, and
aij = 0 otherwise. The connectivity matrix is not symmetric, which im-
plies that eigenvalues may not be real valued. Therefore, we compute a
symmetrized version of the connectivity matrix as Asym := A + A> . This
new matrix Asym is shown in Figure 4.5(a) and has a non-zero value aij
if and only if two neurons are connected (white pixels), irrespective of the
direction of the connection. In Figure 4.5(b), we show the correspond-
ing eigenspectrum of Asym . The horizontal axis shows the index of the
eigenvalues, sorted in descending order. The vertical axis shows the corre-
sponding eigenvalue. The S -like shape of this eigenspectrum is typical for
many biological neural networks. The underlying mechanism responsible
for this is an area of active neuroscience research.
Example 4.8
Consider the matrix
3 2 2
A = 2 3 2 . (4.37)
2 2 3
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
112 Matrix Decompositions
Figure 4.6
Geometric
x2 A interpretation of
eigenvalues. The
v2 eigenvectors of A
x1 v1 get stretched by the
corresponding
eigenvalues. The
area of the unit
Theorem 4.17. The trace of a matrix A ∈ Rn×n is the sum of its eigenval-
square changes by
ues, i.e., |λ1 λ2 |, the
Xn circumference
tr(A) = λi , (4.43) changes by a factor
i=1 2(|λ1 | + |λ2 |).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
114 Matrix Decompositions
up on a different website. The matrix A has the property that for any ini-
tial rank/importance vector x of a website the sequence x, Ax, A2 x, . . .
PageRank converges to a vector x∗ . This vector is called the PageRank and satisfies
Ax∗ = x∗ , i.e., it is an eigenvector (with corresponding eigenvalue 1) of
A. After normalizing x∗ , such that kx∗ k = 1, we can interpret the entries
as probabilities. More details and different perspectives on PageRank can
be found in the original technical report (Page et al., 1999).
Comparing the left hand side of (4.45) and the right hand side of (4.46)
shows that there is a simple pattern in the diagonal elements lii :
√ q
2
q
2 2
l11 = a11 , l22 = a22 − l21 , l33 = a33 − (l31 + l32 ) . (4.47)
Similarly for the elements below the diagonal (lij , where i > j ) there is
also a repeating pattern:
1 1 1
l21 = a21 , l31 = a31 , l32 = (a32 − l31 l21 ) . (4.48)
l11 l11 l22
Thus, we constructed the Cholesky decomposition for any symmetric, pos-
itive definite 3 × 3 matrix. The key realization is that we can backward
calculate what the components lij for the L should be, given the values
aij for A and previously computed values of lij .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
116 Matrix Decompositions
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
118 Matrix Decompositions
polynomial of A is
2−λ 1
det(A − λI) = det (4.56a)
1 2−λ
= (2 − λ)2 − 1 = λ2 − 4λ + 3 = (λ − 3)(λ − 1) . (4.56b)
Therefore, the eigenvalues of A are λ1 = 1 and λ2 = 3 (the roots of the
characteristic polynomial), and the associated (normalized) eigenvectors
are obtained via
2 1 2 1
p1 = 1p1 , p = 3p2 . (4.57)
1 2 1 2 2
This yields
1 1 1
1
p1 = √ , p2 = √ . (4.58)
2 −1 2 1
Step 2: Check for existence The eigenvectors p1 , p2 form a basis of
R2 . Therefore, A can be diagonalized.
Step 3: Construct the matrix P to diagonalize A We collect the eigen-
vectors of A in P so that
1
1 1
P = [p1 , p2 ] = √ . (4.59)
2 −1 1
We then obtain
1 0
P −1 AP = = D. (4.60)
0 3
Equivalently, we get (exploiting that P −1 = P > since the eigenvectors p1
and p2 in this example form an ONB)
1 1 1 1 0 1 1 −1
2 1
=√ √ . (4.61)
1 2 2 −1 1 0 3 2 1 1
| {z } | {z } | {z } | {z }
A P D P>
Ak = (P DP −1 )k = P D k P −1 . (4.62)
A = U Σ V>
m
(4.64)
The diagonal entries σi , i = 1, . . . , r, of Σ are called the singular values, singular values
ui are called the left-singular vectors and v j are called the right-singular left-singular vectors
vectors. By convention the singular values are ordered, i.e., σ1 > σ2 > right-singular
σr > 0. vectors
The singular value matrix Σ is unique, but it requires some attention. singular value
Observe that the Σ ∈ Rm×n is rectangular. In particular, Σ is of the same matrix
size as A. This means that Σ has a diagonal submatrix that contains the
singular values and needs additional zero padding. Specifically, if m > n
then the matrix Σ has diagonal structure up to row n and then consists of
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
120 Matrix Decompositions
0 0 σm 0 . . . 0
Remark. The SVD exists for any matrix A ∈ Rm×n . ♦
value matrix Σ. Finally, it performs a second basis change via U . The SVD
entails a number of important details and caveats, which is why we will
review our intuition in more detail. It is useful to revise
Assume we are given a transformation matrix of a linear mapping Φ : basis changes
(Section 2.7.2),
Rn → Rm with respect to the standard bases B and C of Rn and Rm ,
orthogonal matrices
respectively. Moreover, assume a second basis B̃ of Rn and C̃ of Rm . Then (Definition 3.8) and
orthonormal bases
1. The matrix V performs a basis change in the domain Rn from B̃ (rep- (Section 3.5).
resented by the red and orange vectors v 1 and v 2 in the top-left of Fig-
ure 4.8) to the standard basis B . V > = V −1 performs a basis change
from B to B̃ . The red and orange vectors are now aligned with the
canonical basis in the bottom left of Figure 4.8.
2. Having changed the coordinate system to B̃ , Σ scales the new coordi-
nates by the singular values σi (and adds or deletes dimensions), i.e.,
Σ is the transformation matrix of Φ with respect to B̃ and C̃ , rep-
resented by the red and orange vectors being stretched and lying in
the e1 -e2 plane, which is now embedded in a third dimension in the
bottom right of Figure 4.8.
3. U performs a basis change in the codomain Rm from C̃ into the canoni-
cal basis of Rm , represented by a rotation of the red and orange vectors
out of the e1 -e2 plane. This is shown in the top-right of Figure 4.8.
The SVD expresses a change of basis in both the domain and codomain.
This is in contrast with the eigendecomposition that operates within the
same vector space, where the same basis change is applied and then un-
done. What makes the SVD special is that these two different bases are
simultaneously linked by the singular value matrix Σ.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
122 Matrix Decompositions
R3 (see bottom right panel in Figure 4.9). Note that all vectors lie in the
x1 -x2 plane. The third coordinate is always 0. The vectors in the x1 -x2
plane has been stretched by the singular values.
The direct mapping of the vectors X by A to the codomain R3 equals
the transformation of X by U ΣV > , where U performs a rotation within
the codomain R3 so that the mapped vectors are no longer restricted to
the x1 -x2 plane; they still are on a plane as shown in the top-right panel
of Figure 4.9.
x3
structure of 0.0
x2
0.0
Figure 4.8.
-0.5
−0.5
-1.0
1.5
−1.0 0.5
-1.5
−1.5 -0.5 -0.5 x2
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.5
x1 x1 1.5
-1.5
1.5
1.0
0.5 x3
0
x2
0.0
−0.5
1.5
−1.0
0.5
-1.5
−1.5 -0.5 -0.5 x2
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.5
x1 x1 1.5 -1.5
0 · · · λn
where P is an orthgonal matrix, which is composed of the orthonormal
eigenbasis. The λi > 0 are the eigenvalues of A> A. Let us assume the
SVD of A exists and inject (4.64) into (4.71). This yields
A> A = (U ΣV > )> (U ΣV > ) = V Σ> U > U ΣV > , (4.72)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
124 Matrix Decompositions
This equation closely resembles the eigenvalue equation (4.25), but the
vectors on the left and the right-hand sides are not the same.
For n > m (4.79) holds only for i 6 m and (4.79) says nothing about
the ui for i > m. However, we know by construction that they are or-
thonormal. Conversely, for m > n, (4.79) holds only for i 6 n. For i > n
we have Av i = 0 and we still know that the v i form an orthonormal set.
This means that the SVD also supplies an orthonormal basis of the kernel
(null space) of A, the set of vectors x with Ax = 0 (see Section 2.7.3).
Moreover, concatenating the v i as the columns of V and the ui as the
columns of U yields
AV = U Σ , (4.80)
where Σ has the same dimensions as A and a diagonal structure for rows
1, . . . , r. Hence, right-multiplying with V > yields A = U ΣV > , which is
the SVD of A.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
126 Matrix Decompositions
0
" #
1 1 1 0 √2
1 √1 5
u2 = Av 2 = = √1
, (4.87)
σ2 1 −2 1 0 √25 5
5
1
1 2
U = [u1 , u2 ] = √ . (4.88)
5 −2 1
Note that on a computer the approach illustrated here has poor numerical
behavior, and the SVD of A is normally computed without resorting to the
eigenvalue decomposition of A> A.
Chandra
Beatrix
ratings of three
Ali people for four
movies and its SVD
decomposition.
Star Wars 5 4 1 −0.6710 0.0236 0.4647 −0.5774
−0.7197 0.2054 −0.4759 0.4619
Blade Runner 5 5 0
=
0 0 5 −0.0939 −0.7705 −0.5268 −0.3464
Amelie
Delicatessen 1 0 4 −0.1515 −0.6030 0.5293 −0.5774
9.6438 0 0
0 6.3639 0
0 0 0.7056
0 0 0
−0.7367 −0.6515 −0.1811
0.0852 0.1762 −0.9807
0.6708 −0.7379 −0.0743
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
128 Matrix Decompositions
represents a movie and each column a user. Thus, the column vectors of
movie ratings, one for each viewer, are xAli , xBeatrix , xChandra .
Factoring A using the SVD offers us a way to capture the relationships
of how people rate movies, and especially if there is a structure linking
which people like which movies. Applying the SVD to our data matrix A
makes a number of assumptions:
1. All viewers rate movies consistently using the same linear mapping.
2. There are no errors or noise in the ratings.
3. We interpret the left-singular vectors ui as stereotypical movies and
the right-singular vectors v j as stereotypical viewers.
We then make the assumption that any viewer’s specific movie preferences
can be expressed as a linear combination of the v j . Similarly, any movie’s
like-ability can be expressed as a linear combination of the ui . Therefore,
a vector in the domain of the SVD can be interpreted as a viewer in the
“space” of stereotypical viewers, and a vector in the codomain of the SVD
These two “spaces” correspondingly as a movie in the “space” of stereotypical movies. Let us
are only inspect the SVD of our movie-user matrix. The first left-singular vector u1
meaningfully
spanned by the
has large absolute values for the two science fiction movies and a large
respective viewer first singular value (red shading in Figure 4.10). Thus, this groups a type
and movie data if of users with a specific set of movies (science fiction theme). Similarly, the
the data itself covers
a sufficient diversity
first right-singular v1 shows large absolute values for Ali and Beatrix, who
of viewers and give high ratings to science fiction movies (green shading in Figure 4.10).
movies. This suggests that v1 reflects the notion of a science fiction lover.
Similarly, u2 , seems to capture a French art house film theme, and v 2
indicates that Chandra is close to an idealized lover of such movies. An
idealized science fiction lover is a purist and only loves science fiction
movies, so a science fiction lover v 1 gives a rating of zero to everything
but science fiction themed – this logic is implied the diagonal substructure
for the singular value matrix Σ. A specific movie is therefore represented
by how it decomposes (linearly) into its stereotypical movies. Likewise a
person would be represented by how they decompose (via linear combi-
nation) into movie themes.
Sometimes this formulation is called the reduced SVD (e.g., Datta (2010)) reduced SVD
or the SVD (e.g., Press et al. (2007)). This alternative format changes
merely how the matrices are constructed but leaves the mathematical
structure of the SVD unchanged. The convenience of this alternative
formulation is that Σ is diagonal, as in the eigenvalue decomposition.
In Section 4.6, we will learn about matrix approximation techniques
using the SVD, which is also called the truncated SVD. truncated SVD
It is possible to define the SVD of a rank-r matrix A so that U is an
m × r matrix, Σ a diagonal matrix r × r, and V an r × n matrix.
This construction is very similar to our definition, and ensures that the
diagonal matrix Σ has only non-zero entries along the diagonal. The
main convenience of this alternative notation is that Σ is diagonal, as
in the eigenvalue decomposition.
A restriction that the SVD for A only applies to m × n matrices with
m > n is practically unnecessary. When m < n the SVD decomposition
will yield Σ with more zero columns than rows and, consequently, the
singular values σm+1 , . . . , σn are 0.
The SVD is used in a variety of applications in machine learning from
least squares problems in curve fitting to solving systems of linear equa-
tions. These applications harness various important properties of the SVD,
its relation to the rank of a matrix and its ability to approximate matrices
of a given rank with lower-rank matrices. Substituting a matrix with its
SVD has often the advantage of making calculation more robust to nu-
merical rounding errors. As we will explore in the next section the SVD’s
ability to approximate matrices with “simpler” matrices in a principled
manner opens up machine learning applications ranging from dimension-
ality reduction and topic modeling to data compression and clustering.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
130 Matrix Decompositions
Figure 4.11 Image
processing with the
SVD. (a) The
original grayscale
image is a
1, 432 × 1, 910
matrix of values
between 0 (black) (a) Original image A. (b) A1 , σ1 ≈ 228, 052. (c) A2 , σ2 ≈ 40, 647.
and 1 (white).
(b)–(f) Rank-1
matrices
A1 , . . . , A5 and
their corresponding
singular values
σ1 , . . . , σ5 . The
grid-like structure of
each rank-1 matrix (d) A3 , σ3 ≈ 26, 125. (e) A4 , σ4 ≈ 20, 232. (f) A5 , σ5 ≈ 15, 436.
is imposed by the
outer-product of the
left and
right-singular
vectors. of U and V . Figure 4.11 shows an image of Stonehenge, which can be
represented by a matrix A ∈ R1432×1910 , and some outer products Ai , as
defined in (4.90).
A matrix A ∈ Rm×n of rank r can be written as a sum of rank-1 matrices
Ai so that
r
X r
X
A= σi ui v >
i = σi Ai , (4.91)
i=1 i=1
of A with rk(A(k))
b = k . Figure 4.12 shows low-rank approximations
A(k) of an original image A of Stonehenge. The shape of the rocks be-
b
comes increasingly visible and clearly recognizable in the rank-5 approx-
imation. While the original image requires 1, 432 · 1, 910 = 2, 735, 120
numbers, the rank-5 approximation requires us only to store the five sin-
gular values and the five left and right-singular vectors (1, 432 and 1, 910-
dimensional each) for a total of 5 · (1, 432 + 1, 910 + 1) = 16, 715 numbers
– just above 0.6% of the original.
To measure the difference (error) between A and its rank-k approxima-
tion A(k)
b we need the notion of a norm. In Section 3.1, we already used
norms on vectors that measure the length of a vector. By analogy we can
also define norms on matrices.
Definition 4.23 (Spectral Norm of a Matrix). For x ∈ Rn \{0} the spectral spectral norm
norm of a matrix A ∈ Rm×n is defined as
kAxk2
kAk2 := max . (4.93)
x kxk2
We introduce the notation of a subscript in the matrix norm (left-hand
side), similar to the Euclidean norm for vectors (right-hand side), which
has subscript 2. The spectral norm (4.93) determines how long any vector
x can at most become when multiplied by A.
Theorem 4.24. The spectral norm of A is its largest singular value σ1 .
We leave the proof of this theorem as an exercise.
Eckart-Young
Theorem 4.25 (Eckart-Young Theorem (Eckart and Young, 1936)). Con- theorem
sider a A ∈ Rm×n of rank
Pkr and let >B ∈ R
m×n
be a matrix of rank k . For
any k 6 r with A(k) = i=1 σi ui v i it holds that
b
A(k)
b = argminrk(B)6k kA − Bk2 , (4.94)
A − A(k)
= σk+1 . (4.95)
b
2
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
132 Matrix Decompositions
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
134 Matrix Decompositions
Figure 4.13 A
functional Real matrices
∃ Pseudo inverse
phylogeny of ∃ SVD
matrices
encountered in
Rn×n Rn×m
machine learning. Square
∃ Determinant Nonsquare
∃ Trace
no basis of det =
0
eigenvectors
Singular
de
basis of
t
Defective
6=
eigenvectors
0
Non-defective
(diagonalizable)
Normal Non-normal
A>
A
=
A
A>
∃ Inverse Matrix
Symmetric =
Eigenvalues ∈ R
I Regular
(Invertible)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
136 Matrix Decompositions
to perform on potentially very large matrices of data (Trefethen and Bau III,
1997). Moreover, low-rank approximations are used to operate on ma-
trices that may contain missing values as well as for purposes of lossy
compression and dimensionality reduction (Moonen and De Moor, 1995;
Markovsky, 2011).
Exercises
4.1 Compute the determinant using the Laplace expansion (using the first row)
and the Sarrus Rule for
1 3 5
A= 2 4 6 .
0 2 4
4.6 Compute the eigenspaces of the following transformation matrices. Are they
diagonalizable?
1.
2 3 0
A = 1 4 3
0 0 1
2.
1 1 0 0
0 0 0 0
A=
0
0 0 0
0 0 0 0
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
138 Matrix Decompositions
4.7 Are the following matrices diagonalizable? If yes, determine their diagonal
form and a basis with respect to which the transformation matrices are di-
agonal. If no, give reasons why they are not diagonalizable.
1.
0 1
A=
−8 4
2.
1 1 1
A = 1 1 1
1 1 1
3.
5 4 2 1
0 1 −1 −1
A=
−1
−1 3 0
1 1 −1 2
4.
5 −6 −6
A = −1 4 2
3 −6 −4
4.11 Show that for any A ∈ Rm×n the matrices A> A and AA> possess the
same non-zero eigenvalues.
4.12 Show that for x 6= 0 Theorem 4.24 holds, i.e., show that
kAxk2
max = σ1 ,
x kxk2
Vector Calculus
0
y
density estimation,
−2 i.e., modeling data
−5
distributions.
−4
−10
−4 −2 0 2 4 −10 −5 0 5 10
x x1
(a) Regression problem: Find parameters, (b) Density estimation with a Gaussian mixture
such that the curve explains the observations model: Find means and covariances, such that
(crosses) well. the data (dots) can be explained well.
139
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
140 Vector Calculus
defines
when they are used in
ed
in other parts of the us
book.
collected in
us
ed
i n
us
Chapter 6 used in Jacobian Chapter 11
e
d
Probability Hessian Density estimation
in
used in
Example 5.1
Recall the dot product as a special case of an inner product (Section 3.2).
In the above notation, the function f (x) = x> x, x ∈ R2 , would be be
specified as
f : R2 → R (5.2a)
x 7→ x21 + x22 . (5.2b)
δy f (x + δx) − f (x)
:= (5.3)
δx δx
computes the slope of the secant line through two points on the graph of
f . In Figure 5.3 these are the points with x-coordinates x0 and x0 + δx.
The difference quotient can also be considered the average slope of f
between x and x + δx if we assume f to be a linear function. In the limit
for δx → 0, we obtain the tangent of f at x, if f is differentiable. The
tangent is then the derivative of f at x.
Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f derivative
at x is defined as the limit
df f (x + h) − f (x)
:= lim , (5.4)
dx h→0 h
and the secant in Figure 5.3 becomes a tangent.
The derivative of f points in the direction of steepest ascent of f .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
142 Vector Calculus
x h − xn
= lim i=0 i . (5.5c)
h→0 h
We see that xn = n0 xn−0 h0 . By starting the sum at 1 the xn -term cancels,
and we obtain
Pn n n−i i
df x h
= lim i=1 i (5.6a)
dx h→0 h
!
n
X n n−i i−1
= lim x h (5.6b)
h→0
i=1
i
! n
!
n n−1 X n n−i i−1
= lim x + x h (5.6c)
h→0 1 i=2
i
| {z }
→0 as h→0
n!
= xn−1 = nxn−1 . (5.6d)
1!(n − 1)!
Taylor polynomial Definition 5.3 (Taylor Polynomial). The Taylor polynomial of degree n of
We define t0 := 1 f : R → R at x0 is defined as
for all t ∈ R.
n
X f (k) (x0 )
Tn (x) := (x − x0 )k , (5.7)
k=0
k!
For x0 = 0, we obtain the Maclaurin series as a special instance of the f ∈ C ∞ means that
Taylor series. If f (x) = T∞ (x) then f is called analytic. f is continuously
differentiable
infinitely many
Remark. In general, a Taylor polynomial of degree n is an approximation times.
of a function, which does not need to be a polynomial. The Taylor poly- Maclaurin series
nomial is similar to f in a neighborhood around x0 . However, a Taylor analytic
polynomial of degree n is an exact representation of a polynomial f of
degree k 6 n since all derivatives f (i) , i > k vanish. ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
144 Vector Calculus
y
Taylor polynomials
(dashed) around 0
x0 = 0.
Higher-order Taylor
polynomials −2
approximate the
function f better −4 −2 0 2 4
and more globally. x
T10 is already
similar to f in
[−4, 4].
Example 5.4 (Taylor Series)
Consider the function in Figure 5.4 given by
f (x) = sin(x) + cos(x) ∈ C ∞ . (5.19)
We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurin
series expansion of f . We obtain the following derivatives:
f (0) = sin(0) + cos(0) = 1 (5.20)
f 0 (0) = cos(0) − sin(0) = 1 (5.21)
f 00 (0) = − sin(0) − cos(0) = −1 (5.22)
(3)
f (0) = − cos(0) + sin(0) = −1 (5.23)
(4)
f (0) = sin(0) + cos(0) = f (0) = 1 (5.24)
..
.
We can see a pattern here: The coefficients in our Taylor series are only
±1 (since sin(0) = 0), each of which occurs twice before switching to the
other one. Furthermore, f (k+4) (0) = f (k) (0).
Therefore, the full Taylor series expansion of f at x0 = 0 is given by
∞
X f (k) (x0 )
T∞ (x) = (x − x0 )k (5.25a)
k=0
k!
1 2 1 1 1
=1+x− x − x3 + x4 + x5 − · · · (5.25b)
2! 3! 4! 5!
1 1 1 1
= 1 − x2 + x4 ∓ · · · + x − x3 + x5 ∓ · · · (5.25c)
2! 4! 3! 5!
∞ ∞
X 1 X 1
= (−1)k x2k + (−1)k x2k+1 (5.25d)
k=0
(2k)! k=0
(2k + 1)!
= cos(x) + sin(x) , (5.25e)
where ak are coefficients and c is a constant, which has the special form
in Definition 5.4. ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
146 Vector Calculus
∂f (x, y) ∂
= 2(x + 2y 3 ) (x + 2y 3 ) = 12(x + 2y 3 )y 2 . (5.42)
∂y ∂y
where we used the chain rule (5.32) to compute the partial derivatives.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
148 Vector Calculus
∂ ∂ ∂g ∂f
Chain Rule: (g ◦ f )(x) = g(f (x)) = (5.48)
∂x ∂x ∂f ∂x
This is only an Let us have a closer look at the chain rule. The chain rule (5.48) resem-
intuition, but not bles to some degree the rules for matrix multiplication where we said that
mathematically
neighboring dimensions have to match for matrix multiplication to be de-
correct since the
partial derivative is fined, see Section 2.2.1. If we go from left to right, the chain rule exhibits
not a fraction. similar properties: ∂f shows up in the “denominator” of the first factor
and in the “numerator” of the second factor. If we multiply the factors to-
gether, multiplication is defined, i.e., the dimensions of ∂f match, and ∂f
“cancels”, such that ∂g/∂x remains.
Example 5.8
Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
df ∂f ∂x1 ∂f ∂x2
= + (5.50a)
dt ∂x1 ∂t ∂x2 ∂t
∂ sin t ∂ cos t
= 2 sin t +2 (5.50b)
∂t ∂t
= 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1) (5.50c)
is the corresponding derivative of f with respect to t.
∂f ∂f ∂x1 ∂f ∂x2
= + , (5.51)
∂s ∂x1 ∂s ∂x2 ∂s
∂f ∂f ∂x1 ∂f ∂x2
= + , (5.52)
∂t ∂x1 ∂t ∂x2 ∂t
∂f | ∂s {z ∂t }
=
∂x ∂x
=
∂(s, t)
This compact way of writing the chain rule as a matrix multiplication only The chain rule can
makes sense if the gradient is defined as a row vector. Otherwise, we will be written as a
matrix
need to start transposing gradients for the matrix dimensions to match.
multiplication.
This may still be straightforward as long as the gradient is a vector or a
matrix; however, when the gradient becomes a tensor (we will discuss this
in the following), the transpose is no longer a triviality.
Remark (Verifying the Correctness of a Gradient Implementation). The
definition of the partial derivatives as the limit of the corresponding dif-
ference quotient, see (5.39), can be exploited when numerically checking
the correctness of gradients in computer programs: When we compute Gradient checking
gradients and implement them, we can use finite differences to numer-
ically test our computation and implementation: We choose the value h
to be small (e.g., h = 10−4 ) and compare the finite-difference approxima-
tion from (5.39) with our (analytic) implementation of the gradient. If the
error is small, ourqgradient
P
implementation is probably correct. “Small”
(dh −df )2
could mean that Pi (dhii +dfii )2 < 10−6 , where dhi is the finite-difference
i
approximation and dfi is the analytic gradient of f with respect to the ith
variable xi . ♦
fm (x)
Writing the vector-valued function in this way allows us to view a vector-
valued function f : Rn → Rm as a vector of functions [f1 , . . . , fm ]> ,
fi : Rn → R that map onto R. The differentiation rules for every fi are
exactly the ones we discussed in Section 5.2.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
150 Vector Calculus
(5.55)
From (5.40), we know that the gradient of f with respect to a vector is
the row vector of the partial derivatives. In (5.55), every partial derivative
∂f /∂xi is a column vector. Therefore, we obtain the gradient of f : Rn →
Rm with respect to x ∈ Rn by collecting these partial derivatives:
df (x) ∂f (x) ∂f (x)
= ··· (5.56a)
dx ∂x1 ∂xn
∂f1 (x) ∂f1 (x)
∂x1 ··· ∂xn
.. .. ∈ Rm×n .
= (5.56b)
. .
∂fm (x) ∂fm (x)
∂x1 ··· ∂xn
exists also the denominator layout, which is the transpose of the numerator denominator layout
layout. In this book, we will use the numerator layout. ♦
We will see how the Jacobian is used in the change-of-variable method
for probability distributions in Section 6.7. The amount of scaling due to
the transformation of a variable is provided by the determinant.
In Section 4.1, we saw that the determinant can be used to compute
the area of a parallelogram. If we are given two vectors b1 = [1, 0]> ,
b2 = [0, 1]> as the sides of the unit square (blue, see Figure 5.5), the area
of this square is
det 1 0 = 1 .
(5.60)
0 1
If we take a parallelogram with the sides c1 = [−2, 1]> , c2 = [1, 1]>
(orange in Figure 5.5) its area is given as the absolute value of the deter-
minant (see Section 4.1)
det −2 1 = | − 3| = 3 ,
(5.61)
1 1
i.e., the area of this is exactly 3 times the area of the unit square. We
can find this scaling factor by finding a mapping that transforms the unit
square into the other square. In linear algebra terms, we effectively per-
form a variable transformation from (b1 , b2 ) to (c1 , c2 ). In our case, the
mapping is linear and the absolute value of the determinant of this map-
ping gives us exactly the scaling factor we are looking for.
We will describe two approaches to identify this mapping. First, we ex-
ploit that the mapping is linear so that we can use the tools from Chapter 2
to identify this mapping. Second, we will find the mapping using partial
derivatives using the tools we have been discussing in this chapter.
Approach 1 To get started with the linear algebra approach, we
identify both {b1 , b2 } and {c1 , c2 } as bases of R2 (see Section 2.6.1 for a
recap). What we effectively perform is a change of basis from (b1 , b2 ) to
(c1 , c2 ), and we are looking for the transformation matrix that implements
the basis change. Using results from Section 2.7.2, we identify the desired
basis change matrix as
−2 1
J= , (5.62)
1 1
such that J b1 = c1 and J b2 = c2 . The absolute value of the determi-
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
152 Vector Calculus
nant of J , which yields the scaling factor we are looking for, is given as
|det(J )| = 3, i.e., the area of the square spanned by (c1 , c2 ) is three times
greater than the area spanned by (b1 , b2 ).
Approach 2 The linear algebra approach works for linear trans-
formations; for nonlinear transformations (which become relevant in Sec-
tion 6.7), we follow a more general approach using partial derivatives.
For this approach, we consider a function f : R2 → R2 that performs
a variable transformation. In our example, f maps the coordinate repre-
sentation of any vector x ∈ R2 with respect to (b1 , b2 ) onto the coordi-
nate representation y ∈ R2 with respect to (c1 , c2 ). We want to identify
the mapping so that we can compute how an area (or volume) changes
when it is being transformed by f . For this we need to find out how f (x)
changes if we modify x a bit. This question is exactly answered by the
Jacobian matrix df dx
∈ R2×2 . Since we can write
y1 = −2x1 + x2 (5.63)
y2 = x 1 + x 2 (5.64)
we obtain the functional relationship between x and y , which allows us
to get the partial derivatives
∂y1 ∂y1 ∂y2 ∂y2
= −2 , = 1, = 1, =1 (5.65)
∂x1 ∂x2 ∂x1 ∂x2
and compose the Jacobian as
∂y ∂y1
1
−2 1
J = ∂x ∂x2
∂y2 = 1 1 . (5.66)
1
∂y 2
∂x1 ∂x2
Geometrically, the The Jacobian represents the coordinate transformation we are looking for
Jacobian and is exact if the coordinate transformation is linear (as in our case),
determinant gives
and (5.66) recovers exactly the basis change matrix in (5.62). If the co-
the magnification/
scaling factor when ordinate transformation is nonlinear, the Jacobian approximates this non-
we transform an linear transformation locally with a linear one. The absolute value of the
area or volume. Jacobian determinant |det(J )| is the factor areas or volumes are scaled by
Jacobian
when coordinates are transformed. In our case, we obtain |det(J )| = 3.
determinant
The Jacobian determinant and variable transformations will become
relevant in Section 6.7 when we transform random variables and prob-
ability distributions. These transformations are extremely relevant in ma-
chine learning in the context of training deep neural networks using the
Figure 5.6 reparametrization trick, also called infinite perturbation analysis.
Dimensionality of In this chapter, we encountered derivatives of functions. Figure 5.6 sum-
(partial) derivatives.
marizes the dimensions of those derivatives. If f : R → R the gradient is
x simply a scalar (top-left entry). For f : RD → R the gradient is a 1 × D
f (x) row vector (top-right entry). For f : R → RE , the gradient is an E × 1
∂f column vector, and for f : RD → RE the gradient is an E × D matrix.
∂x
Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.
5.3 Gradients of Vector-Valued Functions 153
We collect the partial derivatives in the Jacobian and obtain the gradient
∂f1 ∂f1
· · · ∂x
∂x1 N
A11 · · · A1N
df
= ... .. = .. .. = A ∈ RM ×N . (5.68)
dx . . .
∂fM ∂fM
∂x 1
· · · ∂x N
AM 1 · · · AM N
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
154 Vector Calculus
Remark. We would have obtained the same result without using the chain
rule by immediately looking at the function
L2 (θ) := ky − Φθk2 = (y − Φθ)> (y − Φθ) . (5.84)
This approach is still practical for simple functions like L2 but becomes
impractical for deep function compositions. ♦
dà dA
∈ R8×3 ∈ R4×2×3
A ∈ R4×2 Ã ∈ R8 dx dx
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
156 Vector Calculus
∂fi
= 0> ∈ R1×1×N (5.91)
∂Ak6=i,:
where we have to pay attention to the correct dimensionality. Since fi
maps onto R and each row of A is of size 1 × N , we obtain a 1 × 1 × N -
sized tensor as the partial derivative of fi with respect to a row of A.
We stack the partial derivatives (5.91) and get the desired gradient
in (5.87) via
>
0
..
.
>
0
∂fi > 1×(M ×N )
∂A
= x
>
∈R . (5.92)
0
.
..
0>
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
158 Vector Calculus
Riq if j = p, p 6= q
Rip if j = q, p 6= q
∂pqij = . (5.98)
2Riq if j = p, p = q
0 otherwise
From (5.94), we know that the desired gradient has the dimension (N ×
N ) × (M × N ), and every single entry of this tensor is given by ∂pqij
in (5.98), where p, q, j = 1, . . . , N and i = q, . . . , M .
(5.110)
Writing out the gradient in this explicit way is often impractical since it
often results in a very lengthy expression for a derivative. In practice,
it means that, if we are not careful, the implementation of the gradient
could be significantly more expensive than computing the function, which
is an unnecessary overhead. For training deep neural network models, the
backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962; backpropagation
Rumelhart et al., 1986) is an efficient way to compute the gradient of an
error function with respect to the parameters of the model.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
160 Vector Calculus
where x are the inputs (e.g., images), y are the observations (e.g., class
labels) and every function fi , i = 1, . . . , K , possesses its own parameters.
We discuss the case, In neural networks with multiple layers, we have functions fi (xi−1 ) =
where the activation σ(Ai xi−1 + bi ) in the ith layer. Here xi−1 is the output of layer i − 1
functions are
and σ an activation function, such as the logistic sigmoid 1+e1−x , tanh or a
identical in each
layer to unclutter rectified linear unit (ReLU). In order to train these models, we require the
notation. gradient of a loss function L with respect to all model parameters Aj , bj
for j = 1, . . . , K . This also requires us to compute the gradient of L with
respect to the inputs of each layer. For example, if we have inputs x and
observations y and a network structure defined by
f 0 := x (5.112)
f i := σi (Ai−1 f i−1 + bi−1 ) , i = 1, . . . , K , (5.113)
∂L ∂L ∂f K ∂f i+2 ∂f i+1
= ··· (5.118)
∂θ i ∂f K ∂f K−1 ∂f i+1 ∂θ i
The orange terms are partial derivatives of the output of a layer with
respect to its inputs, whereas the blue terms are partial derivatives of
the output of a layer with respect to its parameters. Assuming, we have
Figure 5.9
Backward pass in a
x f1 f K−1 fK L multi-layer neural
network to compute
the gradients of the
loss function.
A 1 , b1 A2 , b2 AK−2 , bK−2 AK−1 , bK−1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
162 Vector Calculus
Example 5.14
Consider the function
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 )
(5.122)
from (5.109). If we were to implement a function f on a computer, we
intermediate would be able to save some computation by using intermediate variables:
variables
a = x2 , (5.123)
b = exp(a) , (5.124)
c = a + b, (5.125)
√
d = c, (5.126)
e = cos(c) , (5.127)
f = d + e. (5.128)
√
Figure 5.11 exp(·) b · d
Computation graph
with inputs x,
function values f
x (·)2 a + c + f
and intermediate
variables a, b, c, d, e. cos(·) e
This is the same kind of thinking process that occurs when applying the
chain rule. Note that the above set of equations require fewer operations
than a direct implementation of the function f (x) as defined in (5.109).
The corresponding computation graph in Figure 5.11 shows the flow of
data and computations required to obtain the function value f .
The set of equations that include intermediate variables can be thought
of as a computation graph, a representation that is widely used in imple-
mentations of neural network software libraries. We can directly compute
the derivatives of the intermediate variables with respect to their corre-
sponding inputs by recalling the definition of the derivative of elementary
functions. We obtain:
∂a
= 2x (5.129)
∂x
∂b
= exp(a) (5.130)
∂a
∂c ∂c
=1= (5.131)
∂a ∂b
∂d 1
= √ (5.132)
∂c 2 c
∂e
= − sin(c) (5.133)
∂c
∂f ∂f
=1= . (5.134)
∂d ∂e
By looking at the computation graph in Figure 5.11, we can compute
∂f /∂x by working backward from the output and obtain
∂f ∂f ∂d ∂f ∂e
= + (5.135)
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
= (5.136)
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= + (5.137)
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
= . (5.138)
∂x ∂a ∂x
Note that we implicitly applied the chain rule to obtain ∂f /∂x. By substi-
tuting the results of the derivatives of the elementary functions, we get
∂f 1
= 1 · √ + 1 · (− sin(c)) (5.139)
∂c 2 c
∂f ∂f
= ·1 (5.140)
∂b ∂c
∂f ∂f ∂f
= exp(a) + ·1 (5.141)
∂a ∂b ∂c
∂f ∂f
= · 2x . (5.142)
∂x ∂a
By thinking of each of the derivatives above as a variable, we observe
that the computation required for calculating the derivative is of similar
complexity as the computation of the function itself. This is quite counter-
intuitive since the mathematical expression for the derivative ∂f∂x
(5.110)
is significantly more complicated than the mathematical expression of the
function f (x) in (5.109).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
164 Vector Calculus
where the gi (·) are elementary functions and xPa(xi ) are the parent nodes
of the variable xi in the graph. Given a function defined in this way, we
can use the chain rule to compute the derivative of the function in a step-
by-step fashion. Recall that by definition f = xD and hence
∂f
= 1. (5.144)
∂xD
For other variables xi , we apply the chain rule
∂f X ∂f ∂xj X ∂f ∂gj
= = , (5.145)
∂xi x ∂xj ∂xi ∂xj ∂xi
j :xi ∈Pa(xj ) xj :xi ∈Pa(xj )
first-order Taylor
series expansion.
−1 f (x0) f (x0) + f 0(x0)(x − x0)
−2
−4 −2 0 2 4
x
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
166 Vector Calculus
Figure 5.13
Visualizing outer
products. Outer
products of vectors
increase the
dimensionality of
the array by 1 per
term. (a) Given a vector δ ∈ R4 , we obtain the outer product δ 2 := δ ⊗ δ = δδ > ∈
R4×4 as a matrix.
where Dxk f (x0 ) is the k -th (total) derivative of f with respect to x, eval-
uated at x0 .
Taylor polynomial Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n of
f at x0 contains the first n + 1 components of the series in (5.151) and is
defined as
n
X Dk f (x0 )
Tn (x) = x
δk . (5.152)
k=0
k!
in the Taylor series, where Dxk f (x0 )δ k contains k -th order polynomials.
Now that we defined the Taylor series for vector fields, let us explicitly
write down the first terms Dxk f (x0 )δ k of the Taylor series expansion for
k = 0, . . . , 3 and δ := x − x0 :
np.einsum(
k = 0 : Dx0 f (x0 )δ 0 = f (x0 ) ∈ R (5.156) ’i,i’,Df1,d)
D
np.einsum(
X ’ij,i,j’,
k=1: Dx1 f (x0 )δ 1 = ∇x f (x0 ) |{z}
δ = ∇x f (x0 )[i]δ[i] ∈ R (5.157) Df2,d,d)
| {z } i=1
1×D D×1 np.einsum(
’ijk,i,j,k’,
Dx2 f (x0 )δ 2 δ > = δ > H(x0 )δ
k=2: = tr H(x0 ) |{z}
δ |{z} (5.158) Df3,d,d,d)
| {z }
D×D D×1 1×D
D X
X D
= H[i, j]δ[i]δ[j] ∈ R (5.159)
i=1 j=1
X D
D X
D X
k = 3 : Dx3 f (x0 )δ 3 = Dx3 f (x0 )[i, j, k]δ[i]δ[j]δ[k] ∈ R
i=1 j=1 k=1
(5.160)
Here, H(x0 ) is the Hessian of f evaluated at x0 .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
168 Vector Calculus
∂f ∂f
= 2x + 2y =⇒ (1, 2) = 6 (5.163)
∂x ∂x
∂f ∂f
= 2x + 3y 2 =⇒ (1, 2) = 14 . (5.164)
∂y ∂y
Therefore, we obtain
h i
1 ∂f ∂f
= 6 14 ∈ R1×2
Dx,y f (1, 2) = ∇x,y f (1, 2) = ∂x
(1, 2) ∂y
(1, 2)
(5.165)
such that
1
Dx,y f (1, 2)
x−1
δ = 6 14 = 6(x − 1) + 14(y − 2) . (5.166)
1! y−2
1
Note that Dx,y f (1, 2)δ contains only linear terms, i.e., first-order polyno-
mials.
The second-order partial derivatives are given by
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.167)
∂x2 ∂x2
∂2f ∂2f
= 6y =⇒ (1, 2) = 12 (5.168)
∂y 2 ∂y 2
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.169)
∂y∂x ∂y∂x
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 . (5.170)
∂x∂y ∂x∂y
When we collect the second-order partial derivatives, we obtain the Hes-
sian
" 2
∂2f
#
∂ f
∂x2 ∂x∂y 2 2
H = ∂2f ∂2f
= , (5.171)
2
2 6y
∂y∂x ∂y
such that
2 2
H(1, 2) = ∈ R2×2 . (5.172)
2 12
Therefore, the next term of the Taylor-series expansion is given by
2
Dx,y f (1, 2) 2 1 >
δ = δ H(1, 2)δ (5.173a)
2! 2
1
2 2 x−1
= x−1 y−2 (5.173b)
2 2 12 y − 2
= (x − 1) + 2(x − 1)(y − 2) + 6(y − 2)2 .
2
(5.173c)
2
Here, Dx,y f (1, 2)δ 2 contains only quadratic terms, i.e., second-order poly-
nomials.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
170 Vector Calculus
Exercises
5.1 Compute the derivative f 0 (x) for
f (x) = log(x4 ) sin(x3 ) .
where x, µ ∈ RD , S ∈ RD×D .
2.
f (x) = tr(xx> + σ 2 I) , x ∈ RD
Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
Hint: Explicitly write out the outer product.
3. Use the chain rule. Provide the dimensions of every single partial deriva-
tive. You do not need to compute the product of the partial derivatives
explicitly.
f = tanh(z) ∈ RM
z = Ax + b, x ∈ RN , A ∈ RM ×N , b ∈ RM .
Here, tanh is applied to every component of z .
5.9 We define
g(z, ν) := log p(x, z) − log q(z, ν)
z := t(, ν)
for differentiable functions p, q, t. By using the chain rule, compute the gra-
dient
d
g(z, ν) .
dν
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
6
172
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
6.1 Construction of a Probability Space 173
ple
Reduction
y
rit
Independence Bernoulli
ila
sim
Sufficient Statistics
conjugate
Chapter 11
finite
Density Estimation
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
174 Probability and Distributions
(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three con-
cepts of sample space, event space and probability measure. The probabil-
ity space models a real-world process (referred to as an experiment) with
random outcomes.
The probability of a single event must lie in the interval [0, 1], and the
total probability over all outcomes in the sample space Ω must be 1, i.e.,
P (Ω) = 1. Given a probability space (Ω, A, P ) we want to use it to model
some real world phenomenon. In machine learning, we often avoid ex-
plicitly referring to the probability space, but instead refer to probabilities
on quantities of interest, which we denote by T . In this book we refer to
T as the target space and refer to elements of T as states. We introduce a target space
function X : Ω → T that takes an element of Ω (an event) and returns a
particular quantity of interest x, a value in T . This association/mapping
from Ω to T is called a random variable. For example, in the case of tossing random variable
two coins and counting the number of heads, a random variable X maps
to the three possible outcomes: X(hh) = 2, X(ht) = 1, X(th) = 1 and
X(tt) = 0. In this particular case, T = {0, 1, 2}, and it is the probabilities
on elements of T that we are interested in. For a finite sample space Ω The name “random
and finite T , the function corresponding to a random variable is essen- variable” is a great
source of
tially a lookup table. For any subset S ⊆ T we associate PX (S) ∈ [0, 1]
misunderstanding
(the probability) to a particular event occurring corresponding to the ran- as it is neither
dom variable X . Example 6.1 provides a concrete illustration of the above random nor is it a
terminology. variable. It is a
function.
Remark. The sample space Ω above unfortunately is referred to by dif-
ferent names in different books. Another common name for Ω is “state
space” (Jacod and Protter, 2004), but state space is sometimes reserved
for referring to states in a dynamical system (Hasselblatt and Katok, 2003).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
176 Probability and Distributions
Example 6.1
This toy example is We assume that the reader is already familiar with computing proba-
essentially a biased
bilities of intersections and unions of sets of events. A more gentle intro-
coin flip example.
duction to probability with many examples can be found in Chapter 2 of
Walpole et al. (2011).
Consider a statistical experiment where we model a funfair game con-
sisting of drawing two coins from a bag (with replacement). There are
coins from USA (denoted as $) and UK (denoted as £) in the bag, and
since we draw two coins from the bag, there are four outcomes in total.
The state space or sample space Ω of this experiment is then ($, $), ($,
£), (£, $), (£, £). Let us assume that the composition of the bag of coins is
such that a draw returns at random a $ with probability 0.3.
The event we are interested in is the total number of times the repeated
draw returns $. Let us define a random variable X that maps the sample
space Ω to T , that denotes the number of times we draw $ out of the bag.
We can see from the above sample space we can get zero $, one $ or two
$s, and therefore T = {0, 1, 2}. The random variable X (a function or
lookup table) can be represented as a table like below
X(($, $)) = 2 (6.1)
X(($, £)) = 1 (6.2)
X((£, $)) = 1 (6.3)
X((£, £)) = 0 . (6.4)
Since we return the first coin we draw before drawing the second, this
implies that the two draws are independent of each other, which we will
discuss in Section 6.4.5. Note that there are two experimental outcomes,
which map to the same event, where only one of the draws return $.
Therefore, the probability mass function (Section 6.2.1) of X is given by
P (X = 2) = P (($, $))
= P ($) · P ($)
= 0.3 · 0.3 = 0.09 (6.5)
P (X = 1) = P (($, £) ∪ (£, $))
= P (($, £)) + P ((£, $))
= 0.3 · (1 − 0.3) + (1 − 0.3) · 0.3 = 0.42 (6.6)
P (X = 0) = P ((£, £))
= P (£) · P (£)
= (1 − 0.3) · (1 − 0.3) = 0.49 . (6.7)
6.1.3 Statistics
Probability theory and statistics are often presented together, but they con-
cern different aspects of uncertainty. One way of contrasting them is by the
kinds of problems that are considered. Using probability we can consider
a model of some process, where the underlying uncertainty is captured
by random variables, and we use the rules of probability to derive what
happens. In statistics we observe that something has happened, and try
to figure out the underlying process that explains the observations. In this
sense, machine learning is close to statistics in its goals to construct a
model that adequately represents the process that generated the data. We
can use the rules of probability to obtain a “best fitting” model for some
data.
Another aspect of machine learning systems is that we are interested
in generalization error (see Chapter 8). This means that we are actually
interested in the performance of our system on instances that we will ob-
serve in future, which are not identical to the instances that we have seen
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
178 Probability and Distributions
ci Figure 6.2
z }|{ Visualization of a
y1 discrete bivariate
o probability mass
Y y2 nij rj function, with
random variables X
y3 and Y . This
x1 x2 x3 x4 x5 diagram is adapted
from Bishop (2006).
X
Example 6.2
Consider two random variables X and Y , where X has five possible states
and Y has three possible states, as shown in Figure 6.2. We denote by nij
the number of events with state X = xi and Y = yj , and denote by
N the total number of events. The value ciP is the sum of the individual
3
frequencies for the ith column, that is ci = j=1 nij . Similarly, the value
P5
rj is the row sum, that is rj = i=1 nij . Using these definitions, we can
compactly express the distribution of X and Y .
The probability distribution of each random variable, the marginal
probability, which can be seen as the sum over a row or column
P3
ci j=1 nij
P (X = xi ) = = (6.10)
N N
and
P5
rj nij
P (Y = yj ) = = i=1 , (6.11)
N N
where ci and rj are the ith column and j th row of the probability ta-
ble, respectively. By convention for discrete random variables with a finite
number of events, we assume that probabilties sum up to one, that is
5
X 3
X
P (X = xi ) = 1 and P (Y = yj ) = 1 . (6.12)
i=1 j=1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
180 Probability and Distributions
For probability mass functions (pmf) of discrete random variables the in-
tegral in (6.15) is replaced with a sum (6.12).
Observe that the probability density function is any function f that is
non-negative and integrates to one. We associate a random variable X
with this function f by
Z b
P (a 6 X 6 b) = f (x)dx , (6.16)
a
Remark. We reiterate that there are in fact two distinct concepts when
talking about distributions. First, the idea of a pdf (denoted by f (x))
which is a non-negative function that sums to one. Second, the law of
a random variable X , that is the association of a random variable X with
the pdf f (x). ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
182 Probability and Distributions
Figure 6.3 2.0 2.0
Examples of discrete
and continuous 1.5 1.5
uniform P (Z = z)
p(x)
distributions. See 1.0 1.0
Example 6.3 for
details of the 0.5 0.5
distributions.
0.0 0.0
−1 0 1 2 −1 0 1 2
z x
(a) Discrete distribution (b) Continuous distribution
For most of this book, we will not use the notation f (x) and FX (x) as
we mostly do not need to distinguish between the pdf and cdf. However,
we will need to be careful about pdfs and cdfs in Section 6.7.
Example 6.3
We consider two examples of the uniform distribution, where each state is
equally likely to occur. This example illustrates some differences between
discrete and continuous probability distributions.
Let Z be a discrete uniform random variable with three states {z =
The actual values of −1.1, z = 0.3, z = 1.5}. The probability mass function can be represented
these states are not as a table of probability values:
meaningful here,
and we deliberately z −1.1 0.3 1.5
chose numbers to
1 1 1
drive home the P (Z = z) 3 3 3
point that we do not
want to use (and
should ignore) the Alternatively, we can think of this as a graph (Figure 6.3(a)), where we
ordering of the use the fact that the states can be located on the x-axis, and the y -axis
states.
represents the probability of a particular state. The y -axis in Figure 6.3(a)
is deliberately extended so that is it the same as in Figure 6.3(b).
Let X be a continuous random variable taking values in the range 0.9 6
X 6 1.6, as represented by Figure 6.3(b). Observe that the height of the
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
184 Probability and Distributions
low naturally from fulfilling the desiderata (Jaynes, 2003, Chapter 2).
Probabilistic modeling (Section 8.3) provides a principled foundation for
designing machine learning methods. Once we have defined probability
distributions (Section 6.2) corresponding to the uncertainties of the data
and our problem, it turns out that there are only two fundamental rules,
the sum rule and the product rule.
Recall from (6.9) that p(x, y) is the joint distribution of the two ran-
dom variables x, y . The distributions p(x) and p(y) are the correspond-
ing marginal distributions, and p(y | x) is the conditional distribution of y
given x. Given the definitions of the marginal and conditional probability
for discrete and continuous random variables in Section 6.2, we can now
These two rules present the two fundamental rules in probability theory.
arise The first rule, the sum rule, states that
naturally (Jaynes, X
2003) from the p(x, y) if y is discrete
requirements we
y∈Y
discussed in p(x) = Z , (6.20)
Section 6.1.1. p(x, y)dy if y is continuous
sum rule Y
where Y is the states of the target space of random variable Y. This means
that we sum out (or integrate out) the set of states y of the random vari-
marginalization able Y . The sum rule is also known as the marginalization property. The
property sum rule relates the joint distribution to a marginal distribution. In gen-
eral, when the joint distribution contains more than two random vari-
ables, the sum rule can be applied to any subset of the random variables,
resulting in a marginal distribution of potentially more than one random
variable. More concretely, if x = [x1 , . . . , xD ]> , we obtain the marginal
Z
p(xi ) = p(x1 , . . . , xD )dx\i (6.21)
The product rule can be interpreted as the fact that every joint distribu-
tion of two random variables can be factorized (written as a product)
of two other distributions. The two factors are the marginal distribu-
tion of the first random variable p(x), and the conditional distribution
of the second random variable given the first p(y | x). Since the ordering
of random variables is arbitrary in p(x, y) the product rule also implies
p(x, y) = p(x | y)p(y). To be precise, (6.22) is expressed in terms of the
probability mass functions for discrete random variables. For continuous
random variables, the product rule is expressed in terms of the probability
density functions (Section 6.2.3).
In machine learning and Bayesian statistics, we are often interested in
making inferences of unobserved (latent) random variables given that we
have observed other random variables. Let us assume we have some prior
knowledge p(x) about an unobserved random variable x and some rela-
tionship p(y | x) between x and a second random variable y , which we
can observe. If we observe y we can use Bayes’ theorem to draw some
conclusions about x given the observed values of y . Bayes’ theorem (also: Bayes’ theorem
Bayes’ rule or Bayes’ law) Bayes’ rule
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
186 Probability and Distributions
The quantity
Z
p(y) := p(y | x)p(x)dx = EX [p(y | x)] (6.27)
marginal likelihood is the marginal likelihood/evidence. The right hand side of (6.27) uses the
evidence expectation operator which we define in Section 6.4.1. By definition the
marginal likelihood integrates the numerator of (6.23) with respect to the
latent variable x. Therefore, the marginal likelihood is independent of
x and it ensures that the posterior p(x | y) is normalized. The marginal
likelihood can also be interpreted as the expected likelihood where we
take the expectation with respect to the prior p(x). Beyond normalization
of the posterior the marginal likelihood also plays an important role in
Bayesian model selection as we will discuss in Section 8.5. Due to the
Bayes’ theorem is integration in (8.44), the evidence is often hard to compute.
also called the Bayes’ theorem (6.23) allows us to invert the relationship between x
“probabilistic
and y given by the likelihood. Therefore, Bayes’ theorem is sometimes
inverse”
probabilistic inverse called the probabilistic inverse. We will discuss Bayes’ theorem further in
Section 8.3.
Remark. In Bayesian statistics, the posterior distribution is the quantity
of interest as it encapsulates all available information from the prior and
the data. Instead of carrying the posterior around, it is possible to focus
on some statistic of the posterior, such as the maximum of the posterior,
which we will discuss in Section 8.2. However, focusing on some statistic
of the posterior leads to loss of information. If we think in a bigger con-
text, then the posterior can be used within a decision making system, and
having the full posterior can be extremely useful and lead to decisions that
are robust to disturbances. For example, in the context of model-based re-
inforcement learning, Deisenroth et al. (2015) show that using the full
posterior distribution of plausible transition functions leads to very fast
(data/sample efficient) learning, whereas focusing on the maximum of
the posterior leads to consistent failures. Therefore, having the full pos-
terior can be very useful for a downstream task. In Chapter 9, we will
continue this discussion in the context of linear regression. ♦
Definition 6.3 (Expected value). The expected value of a function g : R → expected value
R of a univariate continuous random variable X ∼ p(x) is given by
Z
EX [g(x)] = g(x)p(x)dx . (6.28)
X
where X is the set of possible outcomes (the target space) of the random
variable X .
where the subscript EXd indicates that we are taking the expected value
with respect to the dth element of the vector x. ♦
Definition 6.3 defines the meaning of the notation EX as the operator
indicating that we should take the integral with respect to the probabil-
ity density (for continuous distributions) or the sum over all states (for
discrete distributions). The definition of the mean (Definition 6.4), is a
special case of the expected value, obtained by choosing g to be the iden-
tity function.
Definition 6.4 (Mean). The mean of a random variable X with states mean
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
188 Probability and Distributions
Example 6.4
Figure 6.4
Mean Illustration of the
Modes mean, mode and
Medianmedian for a
two-dimensional
dataset, as well as
its marginal
densities.
Remark. The expected value (Definition 6.3) is a linear operator. For ex-
ample, given a real-valued function f (x) = ag(x) + bh(x) where a, b ∈ R
and x ∈ RD , we obtain
Z
EX [f (x)] = f (x)p(x)dx (6.34a)
Z
= [ag(x) + bh(x)]p(x)dx (6.34b)
Z Z
= a g(x)p(x)dx + b h(x)p(x)dx (6.34c)
♦
For two random variables, we may wish to characterize their correspon-
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
190 Probability and Distributions
y
each axis (colored
0 0 lines) but with
different
−2 −2 covariances.
−5 0 5 −5 0 5
x x
(a) x and y are negatively correlated. (b) x and y are positively correlated.
Z
p(xi ) = p(x1 , . . . , xD )dx\i , (6.39)
where “\i” denotes “all variables but i”. The off-diagonal entries are the
cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j . cross-covariance
When we want to compare the covariances between different pairs of
random variables, it turns out that the variance of each random variable
affects the value of the covariance. The normalized version of covariance
is called the correlation.
Definition 6.8 (Correlation). The correlation between two random vari- correlation
ables X, Y is given by
Cov[x, y]
corr[x, y] = p ∈ [−1, 1] . (6.40)
V[x]V[y]
The correlation matrix is the covariance matrix of standardized random
variables, x/σ(x). In other words, each random variable is divided by its
standard deviation (the square root of the variance) in the correlation
matrix.
The covariance (and correlation) indicate how two random variables
are related, see Figure 6.5. Positive correlation corr[x, y] means that when
x grows then y is also expected to grow. Negative correlation means that
as x increases then y decreases.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
192 Probability and Distributions
where xn ∈ RD .
empirical covariance Similar to the empirical mean, the empirical covariance matrix is a D×D
matrix
N
1 X
Σ := (xn − x̄)(xn − x̄)> . (6.42)
N n=1
Throughout the
book we use the To compute the statistics for a particular dataset, we would use the
empirical realizations (observations) x1 , . . . , xN and use (6.41) and (6.42). Empir-
covariance, which is ical covariance matrices are symmetric, positive semi-definite (see Sec-
a biased estimate.
The unbiased
tion 3.2.3).
(sometimes called
corrected)
covariance has the 6.4.3 Three Expressions for the Variance
factor N − 1 in the
denominator We now focus on a single random variable X and use the empirical formu-
instead of N . las above to derive three possible expressions for the variance. The deriva-
The derivations are tion below is the same for the population variance, except that we need to
exercises at the end take care of integrals. The standard definition of variance, corresponding
of this chapter.
to the definition of covariance (Definition 6.5), is the expectation of the
squared deviation of a random variable X from its expected value µ, i.e.,
VX [x] := EX [(x − µ)2 ] . (6.43)
The expectation in (6.43) and the mean µ = EX (x) are computed us-
ing (6.32), depending on whether X is a discrete or continuous random
variable. The variance as expressed in (6.43) is the mean of a new random
variable Z := (X − µ)2 .
When estimating the variance in (6.43) empirically, we need to resort
to a two-pass algorithm: one pass through the data to calculate the mean
µ using (6.41), and then a second pass using this estimate µ̂ calculate the
variance. It turns out that we can avoid two passes by rearranging the
raw-score formula terms. The formula in (6.43) can be converted to the so-called raw-score
for variance formula for variance:
2
VX [x] = EX [x2 ] − (EX [x]) . (6.44)
We see that (6.45) is twice the raw-score expression (6.44). This means
that we can express the sum of pairwise distances (of which there are N 2
of them) as a sum of deviations from the mean (of which there are N ).
Geometrically, this means that there is an equivalence between the pair-
wise distances and the distances from the center of the set of points. From
a computational perspective, this means that by computing the mean
(N terms in the summation), and then computing the variance (again
N terms in the summation) we can obtain an expression (left-hand side
of (6.45)) that has N 2 terms.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
194 Probability and Distributions
Example 6.5
Consider a random variable X with zero mean (EX [x] = 0) and also
EX [x3 ] = 0. Let y = x2 (hence, Y is dependent on X ) and consider the
covariance (6.36) between X and Y . But this gives
Cov[x, y] = E[xy] − E[x]E[y] = E[x3 ] = 0 . (6.54)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
196 Probability and Distributions
Figure 6.6
Geometry of
random variables. If
random variables x
and y are
uncorrelated they
are orthogonal
vectors in a
corresponding
vector space, and [y ]
+ var
the Pythagorean [ x]
p var
theorem applies.
=
p
] a var[x]
+y c
p var[x
b
p
var[y]
Figure 6.7
Gaussian
distribution of two
random variables
0.20 x, y.
p(x1, x2)
0.15
0.10
0.05
0.00
7.5
5.0
2.5
−1 0.0 x 2
0 −2.5
x1 1 −5.0
x2
2
0.10
0
0.05
−2
0.00
−4
−5.0 −2.5 0.0 2.5 5.0 7.5 −1 0 1
x x1
where Σxx = Cov[x, x] and Σyy = Cov[y, y] are the marginal covari-
ance matrices of x and y , respectively, and Σxy = Cov[x, y] is the cross-
covariance matrix between x and y .
The conditional distribution p(x | y) is also Gaussian (illustrated in Fig-
ure 6.9(c)) and given by (derived in Section 2.3 of Bishop (2006))
p(x | y) = N µx | y , Σx | y (6.65)
µx | y = µx + Σxy Σ−1
yy (y − µy ) (6.66)
Σx | y = Σxx − Σxy Σ−1
yy Σyx . (6.67)
Note that in the computation of the mean in (6.66) the y -value is an
observation and no longer random.
Remark. The conditional Gaussian distribution shows up in many places,
where we are interested in posterior distributions:
The Kalman filter (Kalman, 1960), one of the most central algorithms
for state estimation in signal processing, does nothing but computing
Gaussian conditionals of joint distributions (Deisenroth and Ohlsson,
2011; ?).
Gaussian processes (Rasmussen and Williams, 2006), which are a prac-
tical implementation of a distribution over functions. In a Gaussian pro-
cess, we make assumptions of joint Gaussianity of random variables. By
(Gaussian) conditioning on observed data, we can determine a poste-
rior distribution over functions.
Latent linear Gaussian models (Roweis and Ghahramani, 1999; Mur-
phy, 2012), which include probabilistic principal component analysis
(PPCA) (Tipping and Bishop, 1999). We will look at PPCA in more de-
tail in Section 10.7.
♦
The marginal distribution p(x) of a joint Gaussian distribution p(x, y),
see (6.64), is itself Gaussian and computed by applying the sum rule
(6.20) and given by
Z
p(x) = p(x, y)dy = N x | µx , Σxx . (6.68)
Example 6.6
Consider the bivariate Gaussian distribution (illustrated in Figure 6.9)
0 0.3 −1
p(x1 , x2 ) = N , . (6.69)
2 −1 5
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
200 Probability and Distributions
conditional 2
distribution of a
0 x2 = −1
Gaussian is also
Gaussian −2
−4
−1 0 1
x1
(a) Bivariate Gaussian.
0.4
0.2
0.2
0.0 0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x1 x1
Here, neither a nor b are random variables. However, writing c in this way
is more compact than (6.76). ♦
Knowing that p(x+y) is Gaussian, the mean and covariance matrix can be
determined immediately using the results from (6.46)–(6.49). This prop-
erty will be important when we consider i.i.d. Gaussian noise acting on
random variables as is the case for linear regression (Chapter 9).
Example 6.7
Since expectations are linear operations, we can obtain the weighted sum
of independent Gaussian random variables
p(ax + by) = N aµx + bµy , a2 Σx + b2 Σy .
(6.79)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
202 Probability and Distributions
(6.82)
Proof The mean of the mixture x is given by the weighted sum of the
means of each random variable. We apply the definition of the mean (Def-
inition 6.4), and plug in our mixture (6.80) above, which yields
Z ∞
E[x] = xp(x)dx (6.83a)
−∞
Z ∞
= αxp1 (x) + (1 − α)xp2 (x)dx (6.83b)
−∞
Z ∞ Z ∞
=α xp1 (x)dx + (1 − α) xp2 (x)dx (6.83c)
−∞ −∞
= αµ1 + (1 − α)µ2 . (6.83d)
To compute the variance, we can use the raw score version of the vari-
ance from (6.44), which requires an expression of the expectation of the
squared random variable. Here we use the definition of an expectation of
a function (the square) of a random variable (Definition 6.3)
Z ∞
2
E[x ] = x2 p(x)dx (6.84a)
−∞
Z ∞
= αx2 p1 (x) + (1 − α)x2 p2 (x)dx (6.84b)
−∞
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
204 Probability and Distributions
It turns out that the class of distributions called the exponential family exponential family
provides the right balance of generality while retaining favourable com-
putation and inference properties. Before we introduce the exponential
family, let us see three more members of “named” probability distribu-
tions, the Bernoulli (Example 6.8), Binomial (Example 6.9) and Beta (Ex-
ample 6.10) distributions.
Example 6.8
The Bernoulli distribution is a distribution for a single binary random Bernoulli
variable X with state x ∈ {0, 1}. It is governed by a single continuous pa- distribution
rameter µ ∈ [0, 1] that represents the probability of X = 1. The Bernoulli
distribution Ber(µ) is defined as
p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1} , (6.92)
E[x] = µ , (6.93)
V[x] = µ(1 − µ) , (6.94)
where E[x] and V[x] are the mean and variance of the binary random
variable X .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
206 Probability and Distributions
Figure 6.10
Examples of the µ = 0.1
Binomial 0.3 µ = 0.4
distribution for µ = 0.75
µ ∈ {0.1, 0.4, 0.75}
and N = 15. 0.2
p(m)
0.1
0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0
Number m of observations x = 1 in N = 15 experiments
10 Figure 6.11
a = 0.5 = b Examples of the
8 a=1=b Beta distribution for
a = 2, b = 0.3 different values of α
6 a = 4, b = 10 and β.
p(µ|a, b)
a = 5, b = 1
4
0
0.0 0.2 0.4 0.6 0.8 1.0
µ
Remark. There is a whole zoo of distributions with names, and they are
related in different ways to each other (Leemis and McQueston, 2008).
It is worth keeping in mind that each named distribution is created for a
particular reason, but may have other applications. Knowing the reason
behind the creation of a particular distribution often allows insight into
how to best use it. We introduced the above three distributions to be able
to illustrate the concepts of conjugacy (Section 6.6.1) and exponential
families (Section 6.6.3). ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
208 Probability and Distributions
6.6.1 Conjugacy
According to Bayes’ theorem (6.23), the posterior is proportional to the
product of the prior and the likelihood. The specification of the prior can
be tricky for two reasons: First, the prior should encapsulate our knowl-
edge about the problem before we see any data. This is often difficult to
describe. Second, it is often not possible to compute the posterior distribu-
tion analytically. However, there are some priors that are computationally
conjugate prior convenient: conjugate priors.
conjugate Definition 6.13 (Conjugate Prior). A prior is conjugate for the likelihood
function if the posterior is of the same form/type as the prior.
Conjugacy is particularly convenient because we can algebraically cal-
culate our posterior distribution by updating the parameters of the prior
distribution.
Remark. When considering the geometry of probability distributions, con-
jugate priors retain the same distance structure as the likelihood (Agarwal
and Daumé III, 2010). ♦
To introduce a concrete example of conjugate priors, we describe below
the Binomial distribution (defined on discrete random variables) and the
Beta distribution (defined on continuous random variables).
Table 6.2 lists examples for conjugate priors for the parameters of some
standard likelihoods used in probabilistic modeling. Distributions such as
Multinomial, inverse Gamma, inverse Wishart, and Dirichlet can be found
in any statistical text, and are for example described in Bishop (2006).
The Beta distribution is the conjugate prior for the parameter µ in both
the Binomial and the Bernoulli likelihood. For a Gaussian likelihood func-
tion, we can place a conjugate Gaussian prior on the mean. The reason The Gamma prior is
why the Gaussian likelihood appears twice in the table is that we need conjugate for the
precision (inverse
to distinguish the univariate from the multivariate case. In the univariate
variance) in the
(scalar) case, the inverse Gamma is the conjugate prior for the variance. univariate Gaussian
In the multivariate case, we use a conjugate inverse Wishart distribution likelihood, and the
as a prior on the covariance matrix. The Dirichlet distribution is the conju- Wishart prior is
conjugate for the
gate prior for the multinomial likelihood function. For further details, we
precision matrix
refer to Bishop (2006). (inverse covariance
matrix) in the
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press. multivariate
Gaussian likelihood.
210 Probability and Distributions
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
212 Probability and Distributions
Example 6.15
Recall the exponential family form of the Bernoulli distribution (6.113d),
µ
p(x | µ) = exp x log + log(1 − µ) . (6.121)
1−µ
The canonical conjugate prior therefore has the same form
µ
p(µ | γ, n0 ) = exp n0 γ log + n0 log(1 − µ) − Ac (γ, n0 ) ,
1−µ
(6.122)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
214 Probability and Distributions
which simplifies to
p(µ | γ, n0 ) = exp [n0 γ log µ + n0 (1 − γ) log(1 − µ) − Ac (γ, n0 )] .
(6.123)
Putting this in non-exponential family form ,
p(µ | γ, n0 ) ∝ µn0 γ (1 − µ)n0 (1−γ) (6.124)
which is of the same form as the Beta distribution (6.98), with minor
manipulations we get the original parametrization (Example 6.12).
Observe that in this example we have derived the form of the Beta dis-
tribution by looking at the conjugate prior of the exponential family.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
216 Probability and Distributions
Example 6.16
Let X be a continuous random variable with probability density function
on 0 6 x 6 1
f (x) = 3x2 . (6.128)
We are interested in finding the pdf of Y = X 2 .
The function f is an increasing function of x, and therefore the resulting
value of y lies in the interval [0, 1]. We obtain
FY (y) = P (Y 6 y) definition of cdf (6.129a)
= P (X 2 6 y) transformation of interest (6.129b)
1
= P (X 6 y ) 2 inverse (6.129c)
1
= FX (y ) 2 definition of cdf (6.129d)
Z y 21
= 3t2 dt cdf as a definite integral (6.129e)
0
t=y 12
= t3 t=0 result of integration (6.129f)
3
=y , 2 0 6 y 6 1. (6.129g)
Therefore, the cdf of Y is
3
FY (y) = y 2 (6.130)
for 0 6 y 6 1. To obtain the pdf, we differentiate the cdf
d 3 1
f (y) = FY (y) = y 2 (6.131)
dy 2
for 0 6 y 6 1.
Functions that have
inverses are called In Example 6.16, we considered a strictly monotonically increasing func-
injective functions
tion f (x) = 3x2 . This means that we could compute an inverse function.
(Section 2.7).
In general, we require that the function of interest y = U (x) has an in-
verse x = U −1 (y). A useful result can be obtained by considering the cu-
mulative distribution function FX (x) of a random variable X , and using
it as the transformation U (x). This leads to the following theorem.
Y = FX (x) , (6.132)
probability integral Theorem 6.15 is known as the probability integral transform, and it is
transform used to derive algorithms for sampling from distributions by transforming
The derivation of this rule is based on the chain rule of calculus (5.32) and
by applying twice the fundamental theorem of calculus. The fundamental
theorem of calculus formalizes the fact that integration and differentiation
are somehow “inverses” of each other. An intuitive understanding of the
rule can be obtained by thinking (loosely) about small changes (differen-
tials) to the equation u = g(x). That is by considering ∆u = g 0 (x)∆x as a
differential of u = g(x). By subsituting u = g(x), the argument inside the
integral on the right hand side of (6.133) becomes f (g(x)). By pretending
that the term du can be approximated by du ≈ ∆u = g 0 (x)∆x, and that
dx ≈ ∆x, we obtain (6.133). ♦
Consider a univariate random variable X , and an invertible function
U , which gives us another random variable Y = U (X). We assume that
random variable X has states x ∈ [a, b]. By the definition of the cdf, we
have
FY (y) = P (Y 6 y) . (6.134)
We are interested in a function U of the random variable
P (Y 6 y) = P (U (X) 6 y) , (6.135)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
218 Probability and Distributions
Theorem 6.16. [Theorem 17.2 in Billingsley (1995)] Let f (x) be the value
of the probability density of the multivariate continuous random variable X .
If the vector-valued function y = U (x) is differentiable and invertible for
all values within the domain of x, then for corresponding values of y , the
probability density of Y = U (X) is given by
−1
∂ −1
f (y) = fx (U (y)) × det U (y) . (6.144)
∂y
The theorem looks intimidating at first glance, but the key point is that
a change of variable of a multivariate random variable follows the pro-
cedure of the univariate change of variable. First we need to work out
the inverse transform, and substitute that into the density of x. Then we
calculate the determinant of the Jacobian and multiply the result. The
following example illustrates the case of a bivariate random variable.
Example 6.17
x1
Consider a bivariate random variable X with states x = and proba-
x2
bility density function
> !
1 1 x1
x1 x1
f = exp − . (6.145)
x2 2π 2 x2 x2
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
220 Probability and Distributions
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
222 Probability and Distributions
Exercises
6.1 Consider the following bivariate distribution p(x, y) of two discrete random
variables X and Y .
x1 x2 x3 x4 x5
X
Compute:
1. The marginal distributions p(x) and p(y).
2. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
6.2 Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4)
10 1 0 0 8.4 2.0
0.4N , + 0.6N , .
2 0 1 0 2.0 1.7
Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
terior distribution p(µ | x1 , . . . , xN ).
6.4 There are two bags. The first bag contains 4 mango and 2 apples; the second
bag contains 4 mango and 4 apples.
We also have a biased coin, which shows “heads” with probability 0.6 and
“tails” with probability 0.4. If the coin shows “heads” we pick a fruit at
random from bag 1, otherwise we pick a fruit at random from bag 2.
Your friend flips the coin (you cannot see the result), picks a fruit at random
from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2?
Hint: Use Bayes’ theorem.
6.5 Consider the following time-series model:
xt+1 = Axt + w , w ∼ N 0, Q
y t = Cxt + v , v ∼ N 0, R
where w, v are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
N µ0 , Σ0 .
C = (A−1 + B −1 )−1
c = C(A−1 a + B −1 b)
D 1
c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)> (A + B)−1 (a − b) .
Furthermore, we have
y = Ax + b + w ,
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
224 Probability and Distributions
Continuous Optimization
225
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
226 Continuous Optimization
Lagrange Chapter 11
multipliers Density Estimation
Convex
−20
−40
−60
−6 −5 −4 −3 −2 −1 0 1 2
value of parameter
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
228 Continuous Optimization
Example 7.1
Consider a quadratic function in two dimensions
> >
1 x1
x1 2 1 x1 5 x1
f = − (7.7)
x2 2 x2 1 20 x2 3 x2
with gradient
> >
x1 x 2 1 5
∇f = 1 − . (7.8)
x2 x2 1 20 3
Starting at the initial location x0 = [−3, −1]> , we iteratively apply
(7.6) to obtain a sequence of estimates that converge to the minimum
2 50.
0
40.0
90 Figure 7.3 Gradient
descent on a
75 two-dimensional
quadratic surface
1 60 (shown as a
0.0
heatmap). See
45 Example 7.1 for a
x2
0 description.
30
−1 15
10.0
30.0
20.0
0
70. 60.
80. 0 0
50.0
0 40.0
−2 −15
−4 −2 0 2 4
x1
value (illustrated in Figure 7.3). We can see (both from the figure and
by plugging x0 into (7.8)) that the gradient at x0 points north and
east, leading to x1 = [−1.98, 1.21]> . Repeating that argument gives us
x2 = [−1.32, −0.42]> , and so on.
7.1.1 Stepsize
As mentioned earlier, choosing a good stepsize is important in gradient
descent. If the stepsize is too small, gradient descent can be slow. If the The stepsize is also
stepsize is chosen too large, gradient descent can overshoot, fail to con- called the learning
rate.
verge, or even diverge. We will discuss the use of momentum in the next
section. It is a method that smoothes out erratic behavior of gradient up-
dates and dampens oscillations.
Adaptive gradient methods rescale the stepsize at each iteration, de-
pending on local properties of the function. There are two simple heuris-
tics (Toussaint, 2012):
When the function value increases after a gradient step, the step size
was too large. Undo the step and decrease the stepsize.
When the function value decreases the step could have been larger. Try
to increase the stepsize.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
230 Continuous Optimization
where xn ∈ RD are the training inputs, yn are the training targets and θ
are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” opti-
mization method, i.e., optimization is performed using the full training set
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
232 Continuous Optimization
3 Figure 7.4
Illustration of
constrained
optimization. The
2
unconstrained
problem (indicated
by the contour
1 lines) has a
minimum on the
right side (indicated
by the circle). The
x2
0
box constraints
(−1 6 x 6 1 and
−1 6 y 6 1) require
−1
that the optimal
solution are within
the box, resulting in
−2 an optimal value
indicated by the
star.
−3
−3 −2 −1 0 1 2 3
x1
where f : RD → R.
In this section, we have additional constraints. That is for real valued
functions gi : RD → R for i = 1, . . . , m we consider the constrained
optimization problem
min f (x) (7.17)
x
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
234 Continuous Optimization
This gives infinite penalty if the constraint is not satisfied, and hence
would provide the same solution. However, this infinite step function is
equally difficult to optimize. We can overcome this difficulty by introduc-
Lagrange multipliers ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the
step function with a linear function.
Lagrangian We associate to problem (7.17) the Lagrangian by introducing the La-
grange multipliers λi > 0 corresponding to each inequality constraint re-
spectively (Boyd and Vandenberghe, 2004, Chapter 4) so that
m
X
L(x, λ) = f (x) + λi gi (x) (7.20a)
i=1
>
= f (x) + λ g(x) (7.20b)
where in the last line we have a concatenated all constraints gi (x) into a
vector g(x), and all the Lagrange multipliers into a vector λ ∈ Rm .
We now introduce the idea of Lagrangian duality. In general, duality
in optimization is the idea of converting an optimization problem in one
set of variables x (called the primal variables), into another optimization
problem in a different set of variables λ (called the dual variables). We
introduce two different approaches to duality: in this section, we discuss
Lagrangian duality; in Section 7.3.3 we discuss Legendre-Fenchel duality.
Note that taking the maximum over y of the left hand side of (7.24) main-
tains the inequality since the inequality is true for all y . Similarly we can
take the minimum over x of the right hand side of (7.24) to obtain (7.23).
The second concept is weak duality, which uses (7.23) to show that weak duality
primal values are always greater than or equal to dual values. This is de-
scribed in more detail in (7.27). ♦
Recall that the difference between J(x) in (7.18) and the Lagrangian
in (7.20b) is that we have relaxed the indicator function to a linear func-
tion. Therefore when λ > 0, the Lagrangian L(x, λ) is a lower bound of
J(x). Hence, the maximum of L(x, λ) with respect to λ is
J(x) = max L(x, λ). (7.25)
λ>0
By the minimax inequality (7.23) it follows that swapping the order of the
minimum and maximum results in a smaller value, i.e.,
min max L(x, λ) > max mind L(x, λ) . (7.27)
x∈Rd λ>0 λ>0 x∈R
This is also known as weak duality. Note that the inner part of the right- weak duality
hand side is the dual objective function D(λ) and the definition follows.
In contrast to the original optimization problem, which has constraints,
minx∈Rd L(x, λ) is an unconstrained optimization problem for a given
value of λ. If solving minx∈Rd L(x, λ) is easy, then the overall problem
is easy to solve. The reason is that the outer problem (maximization over
λ) is a maximum over a set of affine functions, and hence is a concave
function, even though f (·) and gi (·) may be nonconvex. The maximum of
a concave function can be efficiently computed.
Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual
problem by differentiating the Lagrangian with respect to x, setting the
differential to zero and solving for the optimal value. We will discuss two
concrete examples in Sections 7.3.1 and 7.3.2, where f (·) and gi (·) are
convex.
Remark (Equality constraints). Consider (7.17) with additional equality
constraints
min f (x)
x
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
236 Continuous Optimization
30 y = 3x2 − 5x + 2
y 20
10
0
−3 −2 −1 0 1 2 3
x
convex set Definition 7.2. A set C is a convex set if for any x, y ∈ C and for any scalar
θ with 0 6 θ 6 1, we have
θx + (1 − θ)y ∈ C . (7.29)
Figure 7.6 Example
of a convex set. Convex sets are sets such that a straight line connecting any two ele-
ments of the set lie inside the set. Figures 7.6 and 7.7 illustrate convex
and nonconvex sets, respectively.
Convex functions are functions such that a straight line between any
two points of the function lie above the function. Figure 7.2 shows a non-
convex function and Figure 7.3 shows a convex function. Another convex
function is shown in Figure 7.5.
Example 7.3
The negative entropy f (x) = x log2 x is convex for x > 0. A visualization
of the function is shown in Figure 7.8, and we can see that the function
is convex. To illustrate the above definitions of convexity, let us check the
calculations for two points x = 2 and x = 4. Note that to prove convexity
of f (x) we would need to check for all points x ∈ R.
Recall Definition 7.3. Consider a point midway between the two points
(that is θ = 0.5), then the left hand side is f (0.5 · 2 + 0.5 · 4) = 3 log2 3 ≈
4.75. The right hand side is 0.5(2 log2 2) + 0.5(4 log2 4) = 1 + 4 = 5. And
therefore the definition is satisfied.
Since f (x) is differentiable, we can alternatively use (7.31). Calculating
the derivative of f (x), we obtain
1 1
∇x (x log2 x) = 1 · log2 x + x · = log2 x + . (7.32)
x loge 2 loge 2
Using the same two test points x = 2 and x = 4, the left hand side of
(7.31) is given by f (4) = 8. The right hand side is
f (x) + ∇>
x (y − x) = f (2) + ∇f (2) · (4 − 2) (7.33a)
1
= 2 + (1 + ) · 2 ≈ 6.9 . (7.33b)
loge 2
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
238 Continuous Optimization
5
f (x)
0 1 2 3 4 5
x
Example 7.4
A nonnegative weighted sum of convex functions is convex. Observe that
if f is a convex function, and α > 0 is a nonnegative scalar, then the
function αf is convex. We can see this by multiplying α to both sides of
equation in Definition 7.3, and recalling that multiplying a nonnegative
number does not change the inequality.
If f1 and f2 are convex functions, then we have by the definition
f1 (θx + (1 − θ)y) 6 θf1 (x) + (1 − θ)f1 (y) (7.34)
f2 (θx + (1 − θ)y) 6 θf2 (x) + (1 − θ)f2 (y) . (7.35)
Summing up both sides gives us
f1 (θx + (1 − θ)y) + f2 (θx + (1 − θ)y)
6 θf1 (x) + (1 − θ)f1 (y) + θf2 (x) + (1 − θ)f2 (y) , (7.36)
where the right hand side can be rearranged to
θ(f1 (x) + f2 (x)) + (1 − θ)(f1 (y) + f2 (y)) (7.37)
completing the proof that the sum of convex functions is convex.
Combining the two facts above, we see that αf1 (x) + βf2 (x) is convex
for α, β > 0. This closure property can be extended using a similar argu-
ment for nonnegative weighted sums of more than two convex functions.
Jensen’s inequality Remark. The inequality in (7.30) is sometimes called Jensen’s inequality.
minf (x)
x
subject to Ax 6 b
where A ∈ Rm×d and b ∈ Rm . This is known as a linear program. It has d linear program
variables and m linear constraints. The Lagrangian is given by Linear programs are
one of the most
L(x, λ) = c> x + λ> (Ax − b) , (7.40) widely used
approaches in
where λ ∈ Rm is the vector of non-negative Lagrange multipliers. Rear- industry.
ranging the terms corresponding to x yields
L(x, λ) = (c + A> λ)> x − λ> b . (7.41)
Taking the derivative of L(x, λ) with respect to x and setting it to zero
gives us
c + A> λ = 0. (7.42)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
240 Continuous Optimization
by the contour 8
lines) has a
minimum on the
right side. The
optimal value given 6
x2
0
0 2 4 6 8 10 12 14 16
x1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
242 Continuous Optimization
Note that the convex conjugate definition above does not need the func-
tion f to be convex nor differentiable. In the definition above, we have
used a general inner product (Section 3.2) but in the rest of this section
we will consider the standard dot product between finite dimensional vec-
tors (hs, xi = s> x) to avoid too many technical details.
This derivation is To understand the above definition in a geometric fashion, consider an
easiest to nice simple one dimensional convex and differentiable function, for ex-
understand by
ample f (x) = x2 . Note that since we are looking at a one dimensional
drawing the
reasoning as it problem, hyperplanes reduce to a line. Consider a line y = sx + c. Recall
progresses. that we are able to describe convex functions by their supporting hyper-
planes, so let us try to describe this function f (x) by its supporting lines.
Fix the gradient of the line s ∈ R and for each point (x0 , f (x0 )) on the
graph of f , find the minimum value of c such that the line still inter-
sects (x0 , f (x0 )). Note that the minimum value of c is the place where a
line with slope s “just touches” the function f (x) = x2 . The line passing
through (x0 , f (x0 )) with gradient s is given by
y − f (x0 ) = s(x − x0 ) . (7.54)
The y -intercept of this line is −sx0 + f (x0 ). The minimum of c for which
y = sx + c intersects with the graph of f is therefore
inf −sx0 + f (x0 ). (7.55)
x0
Example 7.7
(Convex Conjugates) To illustrate the application of convex conjugates,
consider the quadratic function
λ > −1
f (y) = y K y (7.59)
2
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
244 Continuous Optimization
Example 7.8
In machine learning we often use sums of functions, for example the ob-
jective function of the training set includes a sum of the losses for each ex-
ample in the training set. In the following, we derive the convex conjugate
of a sum of losses `(t), where ` : R → R. This also illustrates Pthe appli-
n
cation of the convex conjugate to the vector case. Let L(t) = i=1 `i (ti ).
Then,
n
X
L∗ (z) = sup hz, ti − `i (ti ) (7.63a)
t∈Rn i=1
n
X
= sup zi ti − `i (ti ) definition of dot product (7.63b)
t∈Rn i=1
Xn
= sup zi ti − `i (ti ) (7.63c)
n
i=1 t∈R
Xn
= `∗i (zi ) . definition of conjugate (7.63d)
i=1
Example 7.9
Let f (y) and g(x) be convex functions, and A a real matrix of appropriate
dimensions such that Ax = y . Then
min f (Ax) + g(x) = min f (y) + g(x). (7.64)
x Ax=y
where the last step of swapping max and min is due to the fact that f (y)
and g(x) are convex functions. By splitting up the dot product term and
collecting x and y ,
max min f (y) + g(x) + (Ax − y)> u (7.66a)
u x,y
h i
= max min −y u + f (y) + min(Ax)> u + g(x)
>
(7.66b)
u y x
h i
= max min −y > u + f (y) + min x> A> u + g(x) (7.66c)
u y x
Recall the convex conjugate (Definition 7.4) and the fact that dot prod- For general inner
products, A> is
ucts are symmetric,
replaced by the
adjoint A∗ .
h i
max min −y > u + f (y) + min x> A> u + g(x) (7.67a)
u y x
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
246 Continuous Optimization
Exercises
7.1 Consider the univariate function
f (x) = x3 + 6x2 − 3x − 5.
Find its stationary points and indicate whether they are maximum, mini-
mum or saddle points.
7.2 Consider the update equation for stochastic gradient descent (Equation (7.15)).
Write down the update when we use a mini-batch size of one.
7.3 Consider whether the following statements are true or false:
1. The intersection of any two convex sets is convex.
2. The union of any two convex sets is convex.
3. The difference of a convex set A from another convex set B is convex.
7.4 Consider whether the following statements are true or false:
1. The sum of any two convex functions is convex.
2. The difference of any two convex functions is convex.
3. The product of any two convex functions is convex.
4. The maximum of any two convex functions is convex.
7.5 Express the following optimization problem as a standard linear program in
matrix notation
max p> x + ξ
x∈R2 ,ξ∈R
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
248 Continuous Optimization
Derive the convex conjugate function f ∗ (s), by assuming the standard dot
product.
Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.10 Consider the function
1 >
f (x) = x Ax + b> x + c (7.72)
2
where A is strictly positive definite, which means that it is invertible. Derive
the convex conjugate of f (x).
Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.11 The hinge loss (which is the loss used by the Support Vector Machine) is
given by
L(α) = max{0, 1 − α}
249
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
8
In the first part of the book, we introduced the mathematics that form
the foundations of many machine learning methods. The hope is that a
reader would be able to learn the rudimentary forms of the language of
mathematics from the first part, which we will now use to describe and
discuss machine learning. The second part of the book introduces four
pillars of machine learning:
Regression (Chapter 9)
Dimensionality reduction (Chapter 10)
Density estimation (Chapter 11)
Classification (Chapter 12)
The main aim of this part of the book is to illustrate how the mathematical
concepts introduced in the first part of the book can be used to design
machine learning algorithms that can be used to solve tasks within the
remit of the four pillars. We do not intend to introduce advanced machine
learning concepts, but instead to provide a set of practical methods that
allow the reader to apply the knowledge they gained from the first part
of the book. It also provides a gateway to the wider machine learning
literature for readers already familiar with the mathematics.
It is worth at this point to pause and consider the problem that a ma-
chine learning algorithm is designed to solve. As discussed in Chapter 1,
there are three major components of a machine learning system: data,
models and learning. The main question of machine learning is “what do
we mean by good models?”. The word model has many subtleties and we model
will revisit it multiple times in this chapter. It is also not entirely obvious
how to objectively define the word “good”. One of the guiding principles
of machine learning is that good models should perform well on unseen
data. This requires us to define some performance metrics, such as accu-
racy or distance from ground truth, as well as figuring out ways to do well
under these performance metrics. This chapter covers a few necessary bits
and pieces of mathematical and statistical language that are commonly
used to talk about machine learning models. By doing so, we briefly out-
line the current best practices for training a model such that the resulting
predictor does well on data that we have not yet seen.
As mentioned in Chapter 1, there are two different senses in which we
251
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
252 When Models meet Data
Table 8.1 Example Name Gender Degree Postcode Age Annual Salary
data from a Aditya M MSc W21BG 36 89563
fictitious human Bob M PhD EC1A1BA 47 123543
resource database Chloé F BEcon SW1A1BH 26 23989
that is not in a Daisuke M BSc SE207AT 68 138769
numerical format. Elisabeth F MBA SE10AA 33 113888
Data as Vectors
We assume that our data can be read by a computer, and represented ade-
quately in a numerical format. Data is assumed to be tabular (Figure 8.1),
where we think of each row of the table as representing a particular in-
Data is assumed to stance or example, and each column to be a particular feature. In recent
be in a tidy years machine learning has been applied to many types of data that do not
format (Wickham,
obviously come in the tabular numerical format, for example genomic se-
2014; Codd, 1990).
quences, text and image contents of a webpage, and social media graphs.
We do not discuss the important and challenging aspects of identifying
good features. Many of these aspects depend on domain expertise and re-
quire careful engineering, which in recent years have been put under the
umbrella of data science (Stray, 2016; Adhikari and DeNero, 2018).
Even when we have data in tabular format, there are still choices to be
made to obtain a numerical representation. For example in Table 8.1, the
gender column (a categorical variable) may be converted into numbers 0
representing “Male” and 1 representing “Female”. Alternatively the gen-
der could be represented by numbers −1, +1, respectively (as shown in
Table 8.2). Furthermore it is often important to use domain knowledge
when constructing the representation, such as knowing that university
degrees progress from Bachelor’s to Master’s to PhD or realizing that the
postcode provided is not just a string of characters but actually encodes
an area in London. In Table 8.2, we converted the data from Table 8.1
to a numerical format, and each postcode is represented as two numbers,
a latitude and longitude. Even numerical data that could potentially be
directly read into a machine learning algorithm should be carefully con-
sidered for units, scaling, and constraints. Without additional information,
one should shift and scale all columns of the dataset such that they have
an empirical mean of 0 and an empirical variance of 1. For the purposes
of this book we assume that a domain expert already converted data ap-
propriately, i.e., each input xn is a D-dimensional vector of real numbers,
which are called features, attributes or covariates. We consider a dataset to feature
be of the form as illustrated by Table 8.2. Observe that we have dropped attribute
covariate
the Name column of Table 8.1 in the new numerical representation. There
are two main reasons why this is desirable: 1. we do not expect the iden-
tifier (the Name) to be infomative for a machine learning task, and 2.
we may wish to anonymize the data to help protect the privacy of the
employees.
In this part of the book, we will use N to denote the number of examples
in a dataset and index the examples with lowercase n = 1, . . . , N . We
assume that we are given a set of numerical data, represented as an array
of vectors (Table 8.2). Each row is a particular individual xn often referred
to as an example or data point in machine learning. The subscript n refers example
to the fact that this is the nth example out of a total of N examples in the data point
dataset. Each column represents a particular feature of interest about the
example, and we index the features as d = 1, . . . , D. Recall that data is
represented as vectors, which means that each example (each data point)
is a D dimensional vector. The orientation of the table originates from
the database community, but for some machine learning algorithms (e.g.,
in Chapter 10) it is more convenient to represent examples as column
vectors.
Let us consider the problem of predicting annual salary from age, based
on the data in Table 8.2. This is called a supervised learning problem
where we have a label yn (the salary) associated with each example xn label
(the age). The label yn has various other names including: target, response
variable and annotation. A dataset is written as a set of example-label pairs
{(x1 , y1 ), . . . , (xn , yn ), . . . , (xN , yN )}. The table of examples {x1 , . . . xN }
is often concatenated, and written as X ∈ RN ×D . Figure 8.1 illustrates
the dataset consisting of the rightmost two columns of Table 8.2 where
x=age and y =salary.
We use the concepts introduced in the first part of the book to formalize
the machine learning problems such as that in the previous paragraph.
Representing data as vectors xn allows us to use concepts from linear al-
gebra (introduced in Chapter 2). In many machine learning algorithms,
we need to additionally be able to compare two vectors. As we will see in
Chapters 9 and 12, computing the similarity or distance between two ex-
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
254 When Models meet Data
amples allows us to formalize the intuition that examples with similar fea-
tures should have similar labels. The comparison of two vectors requires
that we construct a geometry (explained in Chapter 3), and allows us to
optimize the resulting learning problem using techniques from Chapter 7.
Since we have vector representations of data, we can manipulate data to
find potentially better representations of it. We will discuss finding good
representations in two ways: finding lower-dimensional approximations
of the original feature vector, and using nonlinear higher-dimensional
combinations of the original feature vector. In Chapter 10 we will see an
example of finding a low-dimensional approximation of the original data
space by finding the principal components. Finding principal components
is closely related to concepts of eigenvalue and singular value decomposi-
tion as introduced in Chapter 4. For the high-dimensional representation,
feature map we will see an explicit feature map φ(·) that allows us to represent inputs
xn using a higher dimensional representation φ(xn ). The main motiva-
tion for higher dimensional representations is that we can construct new
features as non-linear combinations of the original features, which in turn
may make the learning problem easier. We will discuss the feature map
kernel in Section 9.2 and show how this feature map leads to a kernel in Sec-
tion 12.4. In recent years deep learning methods (Goodfellow et al., 2016)
have shown promise in using the data itself to learn new good features,
and has been very successful in areas such as computer vision, speech
recognition and natural language processing. We will not cover neural
networks in this part of the book, but the reader is referred to Section 5.6
for the mathematical description of backpropagation, a key concept for
training neural networks.
75
50
25
0
0 10 20 30 40 50 60 70 80
x
Models as Functions
Once we have data in an appropriate vector representation, we can get to
the business of constructing a predictive function (known as a predictor). predictor
In Chapter 1 we did not yet have the language to be precise about models.
Using the concepts from the first part of the book, we can now introduce
what ”model” means. We present two major approaches in this book: a
predictor as a function, and a predictor as a probabilistic model. We de-
scribe the former here and the latter in the next subsection.
A predictor is a function that, when given a particular input example
(in our case a vector of features), produces an output. For now consider
the output to be a single number, i.e., a real-valued scalar output. This can
be written as
f : RD → R , (8.1)
where the input vector x is D-dimensional (has D features), and the func-
tion f then applied to it (written as f (x)) returns a real number. Fig-
ure 8.2 illustrates a possible function that can be used to compute the
value of the prediction for input values x.
In this book, we do not consider the general case of all functions, which
would involve the need for functional analysis. Instead we consider the
special case of linear functions
for unknown θ and θ0 . This restriction means that the contents of Chap-
ter 2 and 3 suffice for precisely stating the notion of a predictor for the
non-probabilistic (in contrast to the probabilistic view described next)
view of machine learning. Linear functions strike a good balance between
the generality of the problems that can be solved and the amount of back-
ground mathematics that is needed.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
256 When Models meet Data
50
25
0
0 10 20 30 40 50 60 70 80
x
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
258 When Models meet Data
Section 8.1.1 What is the set of functions we allow the predictor to take?
Section 8.1.2 How do we measure how well the predictor performs on
the training data?
Section 8.1.3 How do we construct predictors from only training data
that performs well on unseen test data?
Section 8.1.4 What is the procedure for searching over the space of mod-
els?
In this section, we use the notation ŷn = f (xn , θ ∗ ) to represent the output
of the predictor.
Remark. For ease of presentation we will describe empirical risk minimiza-
tion in terms of supervised learning (where we have labels). This simpli-
fies the definition of the hypothesis class and the loss function. It is also
common in machine learning to choose a parametrized class of functions,
for example affine functions. ♦
Example 8.1
We introduce the problem of ordinary least squares regression to illustrate
empirical risk minimization. A more comprehensive account of regression
is given in Chapter 9. When the label yn is real valued, a popular choice
of function class for predictors is the set of affine functions. We choose Affine functions are
often referred to as
a more compact notation for a affine function by concatenating an addi-
(D) >
linear functions in
tional unit feature x(0) = 1 to xn , i.e., xn = [1, x(1) (2)
n , xn , . . . , xn ] . The machine learning.
parameter vector is correspondingly θ = [θ0 , θ1 , θ2 , . . . θD ]> , allowing us
to write the predictor as a linear function
f (xn , θ) = θ > xn . (8.4)
This linear predictor is equivalent to the affine model
D
X
f (xn , θ) = θ0 + θd x(d)
n . (8.5)
d=1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
260 When Models meet Data
empirical risk where ŷn = f (xn , θ). Equation (8.6) is called the empirical risk and de-
pends on three arguments, the predictor f and the data X, y . This general
empirical risk strategy for learning is called empirical risk minimization.
minimization
where we substituted the predictor ŷn = f (xn , θ). By using our choice of
a linear predictor f (xn , θ) = θ > xn we obtain the optimization problem
N
1 X
minD (yn − θ > xn )2 . (8.8)
θ∈R N n=1
This equation can be equivalently expressed in matrix form
1 2
min ky − Xθk . (8.9)
θ∈RD N
This is known as the least-squares problem. There exists a closed-form an- least-squares
problem
alytic solution for this by solving the normal equations, which we will
discuss in Section 9.2.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
262 When Models meet Data
penalty term The regularization term is sometimes called the penalty term, that biases
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
264 When Models meet Data
where R(f (k) , V (k) ) is the risk (e.g., RMSE) on the validation set V (k) for
predictor f (k) . The approximation has two sources: first due to the finite
training set which results in not the best possible f (k) and second due to
the finite validation set which results in an inaccurate estimation of the
risk R(f (k) , V (k) ). A potential disadvantage of K -fold cross validation is
the computational cost of training the model K times, which can be bur-
densome if the training cost is computationally expensive. In practice, it
is often not sufficient to look at the direct parameters alone. For example,
we need to explore multiple complexity parameters (e.g., multiple regu-
larization parameters), which may not be direct parameters of the model.
Evaluating the quality of the model, depending on these hyperparameters
may result in a number of training runs that is exponential in the number
of model parameters. One can use nested cross validation (Section 8.5.1)
to search for good hyperparameters.
embarrassingly However, cross validation is an embarrassingly parallel problem, i.e.,
parallel little effort is needed to separate the problem into a number of parallel
tasks. Given sufficient computing resources (e.g., cloud computing, server
farms), cross validation does not require longer than a single performance
assessment.
In this section we saw that empirical risk minimization is based on the
following concepts: the hypothesis class of functions, the loss function and
regularization. In Section 8.2 we will see the effect of using a probability
distribution to replace the idea of loss functions and regularization.
Further Reading
Due to the fact that the original development of empirical risk minimiza-
tion (Vapnik, 1998) was couched in heavily theoretical language, many
of the subsequent developments have been theoretical. The area of study
statistical learning is called statistical learning theory (Hastie et al., 2001; von Luxburg and
theory Schölkopf, 2011; Vapnik, 1999; Evgeniou et al., 2000). A recent machine
learning textbook that builds on the theoretical foundations and develops
efficient learning algorithms is Shalev-Shwartz and Ben-David (2014).
The concept of regularization has its roots in the solution of ill-posed in-
verse problems (Neumaier, 1998). The approach presented here is called
Tikhonov Tikhonov regularization, and there is a closely related constrained ver-
regularization sion called Ivanov regularization. Tikhonov regularization has deep rela-
tionships to the bias-variance tradeoff and feature selection (Bühlmann
and Geer, 2011). An alternative to cross validation is bootstrap and jack-
knife (Efron and Tibshirani, 1993; Davidson and Hinkley, 1997; Hall,
1992).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
266 When Models meet Data
Example 8.4
The first example that is often used is to specify that the conditional
probability of the labels given the examples is a Gaussian distribution. In
other words, we assume that we can explain our observation uncertainty
by independent Gaussian noise (refer to Section 6.5) with zero mean,
εn ∼ N 0, σ 2 . We further assume that the linear model x> n θ is used for
prediction. This means we specify a Gaussian likelihood for each example
label pair xn , yn ,
p(yn | xn , θ) = N yn | x> 2
n θ, σ . (8.15)
An illustration of a Gaussian likelihood for a given parameter θ is shown
in Figure 8.3. We will see in Section 9.2, how to explicitly expand the
expression above out in terms of the Gaussian distribution.
independent and We assume that the set of examples (x1 , y1 ), . . . , (xN , yN ) are independent
identically and identically distributed (i.i.d.). The word independent (Section 6.4.5)
distributed
implies that the likelihood of the whole dataset (Y = {y1 , . . . , yN } and
X = {x1 , . . . , xN } factorizes into a product of the likelihoods of each
individual example
N
Y
p(Y | X , θ) = p(yn | xn , θ) , (8.16)
n=1
While it is temping to interpret the fact that θ is on the right of the condi-
tioning in p(yn |xn , θ) (8.15), and hence should be interpreted as observed
and fixed, this interpretation is incorrect. The negative log-likelihood L(θ)
Example 8.5
Continuing on our example of Gaussian likelihoods (8.15), the negative
log-likelihood can be rewritten as
N
X N
X
log N yn | x> 2
L(θ) = − log p(yn | xn , θ) = − n θ, σ (8.18a)
n=1 n=1
N
1 (yn − x> n θ)
2
X
=− log √ exp − (8.18b)
n=1 2πσ 2 2σ 2
N N
(yn − x>n θ)
2
1
X X
=− log exp − − log √ (8.18c)
n=1
2σ 2
n=1 2πσ 2
N N
1 X > 2
X 1
= (yn − x n θ) − log √ . (8.18d)
2σ n=1
2
n=1 2πσ 2
As σ is given, the second term in (8.18d) is constant, and minimizing L(θ)
corresponds to solving the least squares problem (compare with (8.8))
expressed in the first term.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
268 When Models meet Data
0
0 10 20 30 40 50 60 70 80
x
Figure 8.6
Comparing the 150 MLE
predictions with the MAP
maximum likelihood 125
estimate and the
MAP estimate at
100
x = 60. The prior
biases the slope to
y
75
be less steep and the
intercept to be
50
closer to zero. In
this example, the
bias that moves the 25
intercept closer to
zero actually 0
0 10 20 30 40 50 60 70 80
increases the slope.
x
The proportion relation above hides the density of the data p(x) which
may be difficult to estimate. Instead of estimating the minimum of the neg-
ative log-likelihood, we now estimate the minimum of the negative log-
posterior, which is referred to as maximum a posteriori estimation (MAP maximum a
estimation). An illustration of the effect of adding a zero mean Gaussian posteriori
estimation
prior is shown in Figure 8.6.
MAP estimation
Example 8.6
In addition to the assumption of Gaussian likelihood in the previous exam-
ple, we assume that the parameter vector is distributed
as a multivariate
Gaussian with zero mean, i.e., p(θ) = N 0, Σ where Σ is the covariance
matrix (Section 6.5). Note that the conjugate prior of a Gaussian is also a
Gaussian (Section 6.6.1) and therefore we expect the posterior distribu-
tion to also be a Gaussian. We will see the details of maximum a posteriori
estimation in Chapter 9.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
270 When Models meet Data
y
different model
−2 −2 −2
classes to a
−4 −4 −4
regression dataset.
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
Further Reading
When considering probabilistic models the principle of maximum likeli-
hood estimation generalizes the idea of least-squares regression for linear
models, which we will discuss in detail in Chapter 9. When restricting
the predictor to have linear form with an additional nonlinear function ϕ
applied to the output, i.e.,
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
272 When Models meet Data
we can consider other models for other prediction tasks, such as binary
classification or modeling count data (McCullagh and Nelder, 1989). An
alternative view of this is to consider likelihoods that are from the ex-
ponential family (Section 6.6). The class of models, which have linear
dependence between parameters and data, and have potentially nonlin-
ear transformation ϕ (called a link function) is referred to as generalized
linear models (Agresti, 2002, Chapter 4).
Maximum likelihood estimation has a rich history, and was originally
proposed by Sir Ronald Fisher in the 1930s. We will expand upon the idea
of a probabilistic model in Section 8.3. One debate among researchers
who use probabilistic models, is the discussion between Bayesian and
frequentist statistics. As mentioned in Section 6.1.1 it boils down to the
definition of probability. Recall from Section 6.1 that one can consider
probability to be a generalization (by allowing uncertainty) of logical rea-
soning (Cheeseman, 1985; Jaynes, 2003). The method of maximum like-
lihood estimation is frequentist in nature, and the interested reader is
pointed to Efron and Hastie (2016) for a balanced view of both Bayesian
and frequentist statistics.
There are some probabilistic models where maximum likelihood esti-
mation may not be possible. The reader is referred to more advanced sta-
tistical textbooks, e.g., Casella and Berger (2002), for approaches, such as
method of moments, M -estimation and estimating equations.
A probabilistic
8.3.1 Probabilistic Models
model is specified Probabilistic models represent the uncertain aspects of an experiment as
by the joint
probability distributions. The benefit of using probabilistic models is that
distribution of all
random variables.
Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.
8.3 Probabilistic Modeling and Inference 273
they offer a unified and consistent set of tools from probability theory
(Chapter 6) for modeling, inference, prediction and model selection.
In probabilistic modeling, the joint distribution p(x, θ) of the observed
variables x and the hidden parameters θ is of central importance: It en-
capsulates information from
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
274 When Models meet Data
and they no longer depend on the model parameters θ , which have been
marginalized/integrated out. Equation (8.23) reveals that the prediction
is an average over all plausible parameter values θ , where the plausibility
is encapsulated by the parameter distribution p(θ).
Having discussed parameter estimation in Section 8.2 and Bayesian in-
ference here, let us compare these two approaches to learning. Parameter
estimation via maximum likelihood or MAP estimation yields a consistent
point estimate θ ∗ of the parameters, and the key computational problem
to be solved is optimization. In contrast, Bayesian inference yields a (pos-
terior) distribution, and the key computational problem to be solved is
integration. Predictions with point estimates are straightforward, whereas
predictions in the Bayesian framework require solving another integration
problem, see (8.23). However, Bayesian inference gives us a principled
way to incorporate prior knowledge, account for side information and
incorporate structural knowledge, all of which is not easily done in the
context of parameter estimation. Moreover, the propagation of parameter
uncertainty to the prediction can be valuable in decision-making systems
for risk assessment and exploration in the context of data-efficient learn-
ing (Kamthe and Deisenroth, 2018; Deisenroth et al., 2015).
While Bayesian inference is a mathematically principled framework for
learning about parameters and making predictions, there are some prac-
tical challenges that come with it because of the integration problems we
need to solve, see (8.22) and (8.23). More specifically, if we do not choose
a conjugate prior on the parameters (Section 6.6.1), the integrals in (8.22)
and (8.23) are not analytically tractable, and we cannot compute the pos-
terior, the predictions or the marginal likelihood in closed form. In these
cases, we need to resort to approximations. Here, we can use stochas-
tic approximations, such as Markov chain Monte Carlo (MCMC) (Gilks
et al., 1996), or deterministic approximations, such as the Laplace ap-
proximation (Bishop, 2006; Murphy, 2012; Barber, 2012), variational in-
ference (Jordan et al., 1999; Blei et al., 2017) or expectation propaga-
tion (Minka, 2001a).
Despite these challenges, Bayesian inference has been successfully ap-
plied to a variety of problems, including large-scale topic modeling (Hoff-
man et al., 2013), click-through-rate prediction (Graepel et al., 2010),
data-efficient reinforcement learning in control systems (Deisenroth et al.,
2015), online ranking systems (Herbrich et al., 2007), and large-scale rec-
ommender systems. There are generic tools, such as Bayesian optimiza-
tion (Brochu et al., 2009; Snoek et al., 2012; Shahriari et al., 2016), that
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
276 When Models meet Data
model, which does not depend on the latent variables. Second, we use this
likelihood for parameter estimation or Bayesian inference, where we use
exactly the same expressions as in Sections 8.2 and 8.3.2, respectively.
Since the likelihood function p(x | θ) is the predictive distribution of the
data given the model parameters, we need to marginalize out the latent
variables so that
Z
p(x | θ) = p(x | θ, z)p(z)dz , (8.25)
where p(x | z, θ) is given in (8.24) and p(z) is the prior on the latent
The likelihood is a variables. Note that the likelihood must not depend on the latent variables
function of the data z , but it is only a function of the data x and the model parameters θ .
and the model
The likelihood in (8.25) directly allows for parameter estimation via
parameters, but is
independent of the maximum likelihood. MAP estimation is also straightforward with an ad-
latent variables. ditional prior on the model parameters θ as discussed in Section 8.2.2.
Moreover, with the likelihood 8.25 Bayesian inference (Section 8.3.2) in
a latent-variable model works in the usual way: We place a prior p(θ)
on the model parameters and use Bayes’ theorem to obtain a posterior
distribution
p(X | θ)p(θ)
p(θ | X ) = (8.26)
p(X )
over the model parameters given a dataset X . The posterior in (8.26) can
be used for predictions within a Bayesian inference framework, see (8.23).
One challenge we have in this latent-variable model is that the like-
lihood p(X | θ) requires the marginalization of the latent variables ac-
cording to (8.25). Except when we choose a conjugate prior p(z) for
p(x | z, θ), the marginalization in (8.25) is not analytically tractable, and
we need to resort to approximations (Paquet, 2008; Bishop, 2006; Mur-
phy, 2012; Moustaki et al., 2015).
Similar to the parameter posterior (8.26) we can compute a posterior
on the latent variables according to
p(X | z)p(z)
Z
p(z | X ) = , p(X | z) = p(X | z, θ)p(θ)dθ , (8.27)
p(X )
where p(z) is the prior on the latent variables and p(X | z) requires us to
integrate out the model parameters θ .
Given the difficulty of solving integrals analytically, it is clear that marginal-
izing out both the latent variables and the model parameters at the same
time is not possible in general (Murphy, 2012; Bishop, 2006). A quantity
that is easier to compute is the posterior distribution on the latent vari-
ables, but conditioned on the model parameters, i.e.,
p(X | z, θ)p(z)
p(z | X , θ) = , (8.28)
p(X | θ)
where p(z) is the prior on the latent variables and p(X | z, θ) is given
in (8.24).
In Chapters 10 and 11, we derive the likelihood functions for PCA and
Gaussian mixture models, respectively. Moreover, we compute the poste-
rior distributions (8.28) on the latent variables for both PCA and Gaussian
mixture models.
Remark. In the following chapters, we may not be drawing such a clear
distinction between latent variables z and uncertain model parameters θ
and call the model parameters “latent” or “hidden” as well because they
are unobserved. In Chapters 10 and 11, where we use the latent variables
z , we will pay attention to the difference as we will have two different
types of hidden variables: model parameters θ and latent variables z . ♦
We can exploit the fact that all the elements of a probabilistic model are
random variables to define a unified language for representing them. In
Section 8.4, we will see a concise graphical language for representing the
structure of probabilistic models. We will use this graphical language to
describe the probabilistic models in the subsequent chapters.
Further Reading
Probabilistic models in machine learning (Bishop, 2006; Barber, 2012;
Murphy, 2012) provide a way for users to capture uncertainty about data
and predictive models in a principled fashion. Ghahramani (2015) presents
a short review of probabilistic models in machine learning. Given a proba-
bilistic model, we may be lucky enough to be able to compute parameters
of interest analytically. However, in general, analytic solutions are rare,
and computational methods such as sampling (Gilks et al., 1996; Brooks
et al., 2011) and variational inference (Jordan et al., 1999; Blei et al.,
2017) are used. Moustaki et al. (2015) and Paquet (2008) provide a good
overview of Bayesian inference in latent-variable models.
In recent years, several programming languages have been proposed
that aim to treat the variables defined in software as random variables
corresponding to probability distributions. The objective is to be able to
write complex functions of probability distributions, while under the hood
the compiler automatically takes care of the rules of Bayesian inference.
This rapidly changing field is called probabilistic programming. probabilistic
programming
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
278 When Models meet Data
c x3 x4
directed graphical Directed graphical models/Bayesian networks are a method for repre-
model/Bayesian senting conditional dependencies in a probabilistic model. They provide
network
a visual description of the conditional probabilities, hence, providing a
With additional
assumptions, the simple language for describing complex interdependence. The modular
arrows can be used description also entails computational simplification. Directed links (ar-
to indicate causal rows) between two nodes (random variables) indicate conditional proba-
relationships (Pearl,
bilities. For example, the arrow between a and b in Figure 8.9(a) gives the
2009).
conditional probability p(b | a) of b given a.
Directed graphical models can be derived from joint distributions if we
know something about their factorization.
Example 8.7
Consider the joint distribution
p(a, b, c) = p(c | a, b)p(b | a)p(a) (8.29)
of three random variables a, b, c. The factorization of the joint distribution
in (8.29) tells us something about the relationship between the random
variables:
c depends directly on a and b
b depends directly on a
a depends neither on b nor on c
For the factorization in (8.29), we obtain the directed graphical model in
Figure 8.9(a).
Example 8.8
Looking at the graphical model in Figure 8.9(b) we exploit two properties:
The joint distribution p(x1 , . . . , x5 ) we seek is the product of a set of
conditionals, one for each node in the graph. In this particular example,
we will need five conditionals.
Each conditional depends only on the parents of the corresponding
node in the graph. For example, x4 will be conditioned on x2 .
These two properties yield the desired factorization of the joint distribu-
tion
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x5 )p(x2 | x5 )p(x3 | x1 , x2 )p(x4 | x2 ) . (8.30)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
280 When Models meet Data
Figure 8.10
Graphical models
µ α µ β
for a repeated µ
Bernoulli
experiment. xn xn
x1 xN n = 1, . . . , N n = 1, . . . , N
where Pak means “the parent nodes of xk ”. Parent nodes of xk are nodes
that have arrows pointing to xk .
We conclude this subsection with a concrete example of the coin flip
experiment. Consider a Bernoulli experiment (Example 6.8) where the
probability that the outcome x of this experiment is “heads” is
p(x | µ) = Ber(µ) . (8.32)
We now repeat this experiment N times and observe outcomes x1 , . . . , xN
so that we obtain the joint distribution
N
Y
p(x1 , . . . , xN | µ) = p(xn | µ) . (8.33)
n=1
Figure 8.11
a b c D-separation
example.
A⊥
⊥ B|C , (8.34)
the arrows on the path meet either head to tail or tail to tail at the node,
and the node is in the set C , or
the arrows meet head to head at the node and neither the node nor any
of its descendants is in the set C .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
282 When Models meet Data
Further Reading
An introduction to probabilistic graphical models can be found in Bishop
(2006, Chapter 8), and an extensive description of the different applica-
tions and corresponding algorithmic implications can be found in Koller
and Friedman (2009).
There are three main types of probabilistic graphical models:
directed graphical
model Directed graphical models (Bayesian networks), see Figure 8.13(a)
Bayesian network Undirected graphical models (Markov random fields), see Figure 8.13(b)
undirected graphical Factor graphs, see Figure 8.13(c)
model
Markov random Graphical models allow for graph-based algorithms for inference and
field learning, e.g., via local message passing. Applications range from rank-
factor graph
ing in online games (Herbrich et al., 2007) and computer vision (e.g.,
image segmentation, semantic labeling, image de-noising, image restora-
tion (Sucar and Gillies, 1994; Shotton et al., 2006; Szeliski et al., 2008;
Kittler and Föglein, 1984)) to coding theory (McEliece et al., 1998), solv-
ing linear equation systems (Shental et al., 2008) and iterative Bayesian
state estimation in signal processing (Bickson et al., 2007; Deisenroth and
Mohamed, 2012).
One topic which is particularly important in real applications that we
do not discuss in this book is the idea of structured prediction (Bakir
et al., 2007; Nowozin et al., 2014) which allow machine learning mod-
els to tackle predictions that are structured, for example sequences, trees
and graphs. The popularity of neural network models has allowed more
flexible probabilistic models to be used, resulting in many useful applica-
tions of structured models (Goodfellow et al., 2016, Chapter 16). In recent
years, there has been a renewed interest in graphical models due to its ap-
plications to causal inference (Rosenbaum, 2017; Pearl, 2009; Imbens and
Rubin, 2015; Peters et al., 2017).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
284 When Models meet Data
Figure 8.14
Evidence Bayesian inference
embodies Occam’s
razor. The
horizontal axis
describes the space
p(D | M1 ) of all possible
datasets D. The
evidence (vertical
axis) evaluates how
well a model
predicts available
p(D | M2 ) data. Since
p(D | Mi ) needs to
integrate to 1, we
should choose the
model with the
greatest evidence.
D
C Adapted
from MacKay
(2003).
the posterior probability p(Mi | D) of model Mi given the data D, we can
employ Bayes’ theorem. Assuming a uniform prior p(M ) over all mod-
els, Bayes’ theorem rewards models in proportion to how much they pre-
dicted the data that occurred. This prediction of the data given model
Mi , p(D | Mi ), is called the evidence for Mi . A simple model M1 can only evidence
predict a small number of datasets, which is shown by p(D | M1 ); a more
powerful model M2 that has, e.g., more free parameters than M1 , is able
to predict a greater variety of datasets. This means, however, that M2
does not predict the datasets in region C as well as M1 . Suppose that
equal prior probabilities have been assigned to the two models. Then, if
the dataset falls into region C , the less powerful model M1 is the more
probable model.
Above, we argued that models need to be able to explain the data, i.e.,
there should be a way to generate data from a given model. Furthermore
if the model has been appropriately learned from the data, then we expect
that the generated data should be similar to the empirical data. For this,
it is helpful to phrase model selection as a hierarchical inference problem,
which allows us to compute the posterior distribution over models.
Let us consider a finite number of models M = {M1 , . . . , MK }, where
each model Mk possesses parameters θ k . In Bayesian model selection, we Bayesian model
place a prior p(M ) on the set of models. The corresponding generative selection
generative process
process that allows us to generate data from this model is
Mk ∼ p(M ) (8.40)
θ k ∼ p(θ | Mk ) (8.41)
D ∼ p(D | θ k ) (8.42)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
286 When Models meet Data
Figure 8.15 With a uniform prior p(Mk ) = K1 , which gives every model equal (prior)
Illustration of the probability, determining the MAP estimate over models amounts to pick-
hierarchical
ing the model that maximizes the model evidence (8.44).
generative process
in Bayesian model Remark (Likelihood and Marginal Likelihood). There are some important
selection. We place differences between a likelihood and a marginal likelihood (evidence):
a prior p(M ) on the
set of models. For
While the likelihood is prone to overfitting, the marginal likelihood is typ-
each model, there is ically not as the model parameters have been marginalized out (i.e., we
a distribution no longer have to fit the parameters). Furthermore, the marginal likeli-
p(θ | M ) on the hood automatically embodies a trade-off between model complexity and
corresponding
model parameters,
data fit (Occam’s razor). ♦
which is used to
generate the data D.
8.5.3 Bayes Factors for Model Comparison
M Consider the problem of comparing two probabilistic models M1 , M2 ,
given a dataset D. If we compute the posteriors p(M1 | D) and p(M2 | D),
we can compute the ratio of the posteriors
θ
p(D | M1 )p(M1 )
p(M1 | D) p(D) p(M1 ) p(D | M1 )
= p(D | M2 )p(M2 )
= . (8.46)
p(M2 | D) p(D)
p(M2 ) p(D | M2 )
D | {z } | {z } | {z }
posterior odds prior odds Bayes factor
posterior odds The ratio of the posteriors is also called the posterior odds. The first frac-
prior odds tion on the right-hand-side of (8.46), the prior odds, measures how much
our prior (initial) beliefs favor M1 over M2 . The ratio of the marginal like-
Bayes factor lihoods (second fraction on the right-hand-side) is called the Bayes factor
and measures how well the data D is predicted by M1 compared to M2 .
Jeffreys-Lindley Remark. The Jeffreys-Lindley paradox states that the “Bayes factor always
paradox favors the simpler model since the probability of the data under a complex
model with a diffuse prior will be very small” (Murphy, 2012). Here, a
diffuse prior refers to a prior that does not favor specific models, i.e.,
many models are a priori plausible under this prior. ♦
If we choose a uniform prior over models, the prior odds term in (8.46)
is 1, i.e., the posterior odds is the ratio of the marginal likelihoods (Bayes
factor)
p(D | M1 )
. (8.47)
p(D | M2 )
If the Bayes factor is greater than 1, we choose model M1 , otherwise
model M2 . In a similar way to frequentist statistics, there are guidelines
on the size of the ratio that one should consider before ”significance” of
the result (Jeffreys, 1961).
Remark (Computing the Marginal Likelihood). The marginal likelihood
plays an important role in model selection: We need to compute Bayes
factors (8.46) and posterior distributions over models (8.43).
Unfortunately, computing the marginal likelihood requires us to solve
an integral (8.44). This integration is generally analytically intractable,
and we will have to resort to approximation techniques, e.g., numerical
integration (Stoer and Burlirsch, 2002), stochastic approximations using
Monte Carlo (Murphy, 2012) or Bayesian Monte Carlo techniques (O’Hagan,
1991; Rasmussen and Ghahramani, 2003).
However, there are special cases in which we can solve it. In Section 6.6.1,
we discussed conjugate models. If we choose a conjugate parameter prior
p(θ), we can compute the marginal likelihood in closed form. In Chap-
ter 9, we will do exactly this in the context of linear regression. ♦
We have seen a brief introduction to the basic concepts of machine
learning in this chapter. For the rest of this part of the book we will see
how the three different flavours of learning in Section 8.1, Section 8.2 and
Section 8.3 are applied to the four pillars of machine learning (regression,
dimensionality reduction, density estimation and classification).
Further Reading
We mentioned at the start of the section that there are high level modeling
choices that influence the performance of the model. Examples include:
The degree of a polynomial in a regression setting
The number of components in a mixture model
The network architecture of a (deep) neural network
The type of kernel in a support vector machine
The dimensionality of the latent space in PCA
The learning rate (schedule) in an optimization algorithm
In parametric
Rasmussen and Ghahramani (2001) showed that the automatic Occam’s models, the number
razor does not necessarily penalize the number of parameters in a model of parameters is
often related to the
but it is active in terms of the complexity of functions. They also showed complexity of the
that the automatic Occam’s razor also holds for Bayesian non-parametric model class.
models with many parameters, e.g., Gaussian processes.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
288 When Models meet Data
Linear Regression
Figure 9.1
0.4 0.4 (a) Dataset;
(b) Possible solution
0.2 0.2
to the regression
0.0 0.0 problem.
y
−0.2 −0.2
−0.4 −0.4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Regression problem: Observed noisy (b) Regression solution: Possible function
function values from which we wish to infer that could have generated the data (blue)
the underlying function that generated the with indication of the measurement noise of
data. the function value at the corresponding in-
puts (orange distributions).
289
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
290 Linear Regression
p(y | x) = N y | f (x), σ 2 .
(9.1)
y = f (x) + , (9.2)
where ∼ N 0, σ 2 is independent, identically distributed (i.i.d.) Gaus-
sian measurement noise with mean 0 and variance σ 2 . Our objective is
to find a function that is close (similar) to the unknown function f that
generated the data and which generalizes well.
In this chapter, we focus on parametric models, i.e., we choose a para-
metrized function and find parameters θ that “work well” for modeling the
data. For the time being, we assume that the noise variance σ 2 is known
and focus on learning the model parameters θ . In linear regression, we
consider the special case that the parameters θ appear linearly in our
model. An example of linear regression is given by
p(y | x, θ) = N y | x> θ, σ 2
(9.3)
> 2
⇐⇒ y = x θ + , ∼ N 0, σ , (9.4)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
292 Linear Regression
Figure 9.2 Linear 20
regression example. 10 10
(a) Example 0 0 0
y
y
functions that fall
into this category. −10 −10
−20
(b) Training set. −10 0 10 −10 −5 0 5 10 −10 −5 0 5 10
x x x
(c) Maximum
likelihood estimate. (a) Example functions (straight (b) Training set. (c) Maximum likelihood esti-
lines) that can be described us- mate.
ing the linear model in (9.4).
p(y∗ | x∗ , θ ∗ ) = N y∗ | x> ∗ 2
∗θ , σ . (9.6)
where we exploited that the likelihood (9.5b) factorizes over the number
of data points due to our independence assumption on the training set.
In the linear regression model (9.4) the likelihood is Gaussian (due to
the Gaussian additive noise term), such that we arrive at
1
log p(yn | xn , θ) = − (yn − x> 2
n θ) + const , (9.9)
2σ 2
where the constant includes all terms independent of θ . Using (9.9) in the
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
294 Linear Regression
dL d 1
>
= (y − Xθ) (y − Xθ) (9.11a)
dθ dθ 2σ 2
1 d >
= 2 y y − 2y > Xθ + θ > X > Xθ (9.11b)
2σ dθ
1
= 2 (−y > X + θ > X > X) ∈ R1×D . (9.11c)
σ
The maximum likelihood estimator θ ML solves dL
dθ
= 0> (necessary opti-
Ignoring the mality condition) and we obtain
possibility of
duplicate data dL (9.11c)
points, rk(X) = D = 0> ⇐⇒ θ > > >
ML X X = y X (9.12a)
dθ
if N > D, i.e., we
do not have more ⇐⇒ θ > > >
ML = y X(X X)
−1
(9.12b)
parameters than
⇐⇒ θ ML = (X > X)−1 X > y . (9.12c)
data points.
K−1
>
X (9.13)
⇐⇒ y = φ (x)θ + = θk φk (x) + ,
k=0
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
296 Linear Regression
gree K − 1 is
K−1
X
f (x) = θk xk = φ> (x)θ , (9.15)
k=0
for the linear regression problem with nonlinear features defined in (9.13).
Remark. When we were working without features, we required X > X to
be invertible, which is the case when the rows of X are linearly indepen-
dent. In (9.19), we therefore require Φ> Φ ∈ RD×D to be invertible. This
is the case if and only if rk(Φ) = D. ♦
Figure 9.4
4 4 Training data Polynomial
MLE regression. (a)
2 2 Dataset consisting of
(xn , yn ) pairs,
0 0
y
n = 1, . . . , 10; (b)
−2 −2
Maximum
likelihood
−4 −4 polynomial of
−4 −2 0 2 4 −4 −2 0 2 4 degree 4.
x x
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
298 Linear Regression
y
different polynomial
−2 −2 −2
degrees M .
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
0 0 0
y
y
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
300 Linear Regression
RMSE
4
0
0 2 4 6 8 10
Degree of polynomial
training error maximum likelihood fits in Figure 9.5. Note that the training error (blue
curve in Figure 9.6) never increases when the degree of the polynomial in-
creases. In our example, the best generalization (the point of the smallest
test error test error) is obtained for a polynomial of degree M = 4.
where the constant comprises the terms that are independent of θ . We see
that the log-posterior in (9.25) is the sum of the log-likelihood p(Y | X , θ)
and the log-prior log p(θ) so that the MAP estimate will be a “compromise”
between the prior (our suggestion for plausible parameter values before
observing data) and the data-dependent likelihood.
To find the MAP estimate θ MAP , we minimize the negative log-posterior
distribution with respect to θ , i.e., we solve
θ MAP ∈ arg min{− log p(Y | X , θ) − log p(θ)} . (9.26)
θ
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
302 Linear Regression
2
Φ> Φ + σb2 I is symmetric and strictly positive definite (i.e., its inverse
exists and the MAP estimate is the unique solution of a system of linear
equations). Moreover, it reflects the impact of the regularizer.
Figure 9.7
Polynomial 4 4 Training data
regression: MLE
Maximum 2 2 MAP
likelihood and MAP
0 0
y
estimates.
(a) Polynomials of
−2 Training data −2
degree 6; MLE
(b) Polynomials of −4 MAP −4
degree 8.
−4 −2 0 2 4 −4 −2 0 2 4
x x
useful for variable selection. For p = 1, the regularizer is called LASSO LASSO
(least absolute shrinkage and selection operator) and was proposed by Tib-
shirani (1996). ♦
2
The regularizer λ kθk2 in (9.32) can be interpreted as a negative log-
Gaussian prior, which we use in MAP estimation, see (9.26). More specif-
ically, with a Gaussian prior p(θ) = N 0, b2 I , we obtain the negative
log-Gaussian prior
1 2
− log p(θ) = kθk2 + const (9.33)
2b2
so that for λ = 2b12 the regularization term and the negative log-Gaussian
prior are identical.
Given that the regularized least-squares loss function in (9.32) consists
of terms that are closely related to the negative log-likelihood plus a neg-
ative log-prior, it is not surprising that, when we minimize this loss, we
obtain a solution that closely resembles the MAP estimate in (9.31). More
specifically, minimizing the regularized least-squares loss function yields
θ RLS = (Φ> Φ + λI)−1 Φ> y , (9.34)
2
which is identical to the MAP estimate in (9.31) for λ = σb2 , where σ 2 is
2
the noise variance and b the variance of the (isotropic) Gaussian prior
2
p(θ) = N 0, b I . A point estimate is a
So far, we covered parameter estimation using maximum likelihood and single specific
parameter value,
MAP estimation where we found point estimates θ ∗ that optimize an ob-
unlike a distribution
jective function (likelihood or posterior). We saw that both maximum like- over plausible
lihood and MAP estimation can lead to overfitting. In the next section, we parameter settings.
will discuss Bayesian linear regression, where we use Bayesian inference
(Section 8.3) to find a posterior distribution over the unknown parame-
ters, which we subsequently use to make predictions. More specifically, for
predictions we will average over all plausible sets of parameters instead
of focusing on a point estimate.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
304 Linear Regression
9.3.1 Model
In Bayesian linear regression, we consider the model
prior p(θ) = N m0 , S 0 ,
(9.35)
likelihood p(y | x, θ) = N y | φ> (x)θ, σ 2 ,
Figure 9.8 where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on θ ,
Graphical model for which turns the parameter vector into a random variable. This allows us
Bayesian linear
to write down the corresponding graphical model in Figure 9.8, where we
regression.
made the parameters of the Gaussian prior on θ explicit. The full proba-
m0 S0 bilistic model, i.e., the joint distribution of observed and unobserved ran-
dom variables, y and θ , respectively, is
θ
σ p(y, θ | x) = p(y | x, θ)p(θ) . (9.36)
x y
9.3.2 Prior Predictions
In practice, we are usually not so much interested in the parameter values
θ themselves. Instead, our focus often lies in the predictions we make
with those parameter values. In a Bayesian setting, we take the parameter
distribution and average over all plausible parameter settings when we
make predictions. More specifically, to make predictions at an input x∗ ,
we integrate out θ and obtain
Z
p(y∗ | x∗ ) = p(y∗ | x∗ , θ)p(θ)dθ = Eθ [p(y∗ | x∗ , θ)] , (9.37)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
306 Linear Regression
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
308 Linear Regression
Σ := A , (9.53)
µ := Σ−1 a (9.54)
A := σ −2 Φ> Φ + S −1
0 , (9.55)
a := σ −2 Φ> y + S −1
0 m0 . (9.56)
The term φ> (x∗ )S N φ(x∗ ) reflects the posterior uncertainty associated
with the parameters θ . Note that S N depends on the training inputs
through Φ, see (9.43b). The predictive mean φ> (x∗ )mN coincides with
the MAP estimate.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
310 Linear Regression
regression and 2 2 2
posterior over
0 0 0
y
y
functions. Training data
−2 −2 MLE −2
(a) Training data; MAP
(b) Posterior −4 −4 BLR −4
−4 −2 −4 −2 −4 −2
distribution over 0
x
2 4 0
x
2 4 0
x
2 4
functions;
(a) Training data. (b) Posterior over functions rep- (c) Samples from the posterior
(c) Samples from
resented by the marginal un- over functions, which are in-
the posterior over
certainties (shaded) showing duced by the samples from the
functions.
the 67% and 95% predictive parameter posterior.
confidence bounds, the maxi-
mum likelihood estimate (MLE)
and the MAP estimate (MAP),
which is identical to the poste-
rior mean function.
y
67% (dark-gray)
Training data
and 95%
−2 MLE −2
MAP
(light-gray)
BLR predictive
−4 −4
confidence bounds.
−4 −2 0 2 4 −4 −2 0 2 4 The mean of the
x x
Bayesian linear
(a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the pos- regression model
terior over functions (right). coincides with the
MAP estimate. The
predictive
4 4 uncertainty is the
sum of the noise
2 2
term and the
posterior parameter
0 0
y
4 Training data 4
MLE
2 MAP 2
BLR
0 0
y
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(c) Posterior distribution for polynomials of degree M = 7 (left) and samples from the pos-
terior over functions (right).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
312 Linear Regression
y
(b) Maximum
Projection likelihood solution
−2 −2
Observations interpreted as a
Maximum likelihood estimate
projection.
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Regression dataset consisting of noisy ob- (b) The orange dots are the projections of
servations yn (blue) of function values f (xn ) the noisy observations (blue dots) onto the
at input locations xn . line θML x. The maximum likelihood solution to
a linear regression problem finds a subspace
(line) onto which the overall projection er-
ror (orange lines) of the observations is mini-
mized.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
314 Linear Regression
Linear regression regression as a problem for solving systems of linear equations. There-
can be thought of as fore, we can relate to concepts from linear algebra and analytic geometry
a method for solving
that we discussed in Chapters 2 and 3. In particular, looking carefully
systems of linear
equations. at (9.67) we see that the maximum likelihood estimator θML in our ex-
ample from (9.65) effectively does an orthogonal projection of y onto
Maximum the one-dimensional subspace spanned by X . Recalling the results on or-
>
likelihood linear thogonal projections from Section 3.8, we identify XX as the projection
regression performs X>X
an orthogonal
matrix, θML as the coordinates of the projection onto the one-dimensional
projection. subspace of RN spanned by X and XθML as the orthogonal projection of
y onto this subspace.
Therefore, the maximum likelihood solution provides also a geometri-
cally optimal solution by finding the vectors in the subspace spanned by
X that are “closest” to the corresponding observations y , where “clos-
est” means the smallest (squared) distance of the function values yn to
xn θ. This is achieved by orthogonal projections. Figure 9.12(b) shows the
projection of the noisy observations onto the subspace that minimizes the
squared distance between the original dataset and its projection (note that
the x-coordinate is fixed), which corresponds to the maximum likelihood
solution.
In the general linear regression case where
y = φ> (x)θ + , ∼ N 0, σ 2
(9.68)
y ≈ Φθ ML , (9.69)
> −1 >
θ ML = Φ(Φ Φ) Φ y (9.70)
so that the coupling between different features has disappeared and the
maximum likelihood projection is simply the sum of projections of y onto
the individual basis vectors φk , i.e., the columns of Φ. Many popular basis
functions in signal processing, such as wavelets and Fourier bases, are
orthogonal basis functions. When the basis is not orthogonal, one can
convert a set of linearly independent basis functions to an orthogonal basis
by using the Gram-Schmidt process (Strang, 2003).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
316 Linear Regression
Working directly with high-dimensional data, such as images, comes with A 640 × 480 pixels
some difficulties: it is hard to analyze, interpretation is difficult, visualiza- color image is a data
point in a
tion is nearly impossible, and (from a practical point of view) storage of
million-dimensional
the data vectors can be expensive. However, high-dimensional data often space, where every
has properties that we can exploit. For example, high-dimensional data is pixel responds to
often overcomplete, i.e., many dimensions are redundant and can be ex- three dimensions,
one for each color
plained by a combination of other dimensions. Furthermore, dimensions
channel (red, green,
in high-dimensional data are often correlated so that the data possesses an blue).
intrinsic lower-dimensional structure. Dimensionality reduction exploits
structure and correlation and allows us to work with a more compact rep-
resentation of the data, ideally without losing information. We can think
of dimensionality reduction as a compression technique, similar to jpeg or
mp3, which are compression algorithms for images and music.
In this chapter, we will discuss principal component analysis (PCA), an principal component
algorithm for linear dimensionality reduction. PCA, proposed by Pearson analysis
(1901) and Hotelling (1933), has been around for more than 100 years PCA
dimensionality
and is still one of the most commonly used techniques for data compres- reduction
sion and data visualization. It is also used for the identification of simple
patterns, latent factors and structures of high-dimensional data. In the sig-
Figure 10.1
Illustration:
4 4
Dimensionality
reduction. (a) The
2 2 original dataset
does not vary much
along the x2
x2
x2
0 0
direction. (b) The
data from (a) can be
−2 −2
represented using
the x1 -coordinate
−4 −4 alone with nearly no
loss.
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x1 x1
(a) Dataset with x1 and x2 coordinates. (b) Compressed dataset where only the x1 coor-
dinate is relevant.
317
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
318 Dimensionality Reduction with Principal Component Analysis
Karhunen-Loève nal processing community, PCA is also known as the Karhunen-Loève trans-
transform form. In this chapter, we derive PCA from first principles, drawing on our
understanding of basis and basis change (Sections 2.6.1 and 2.7.2), pro-
jections (Section 3.8), eigenvalues (Section 4.2), Gaussian distributions
(Section 6.5) and constrained optimization (Section 7.2).
Dimensionality reduction generally exploits a property of high-dimen-
sional data (e.g., images) that it often lies on a low-dimensional subspace.
Figure 10.1 gives an illustrative example in two dimensions. Although
the data in Figure 10.1(a) does not quite lie on a line, the data does not
vary much in the x2 -direction, so that we can express it as if it was on
a line – with nearly no loss, see Figure 10.1(b). To describe the data in
Figure 10.1(b), only the x1 -coordinate is required, and the data lies in a
one-dimensional subspace of R2 .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
320 Dimensionality Reduction with Principal Component Analysis
Figure 10.3
Examples of
handwritten digits
from the MNIST
dataset. http:
//yann.lecun.
com/exdb/mnist/
occurring example, which contains 60, 000 examples of handwritten digits
0–9. Each digit is a grayscale image of size 28 × 28, i.e., it contains 784
pixels so that we can interpret every image in this dataset as a vector
x ∈ R784 . Examples of these digits are shown in Figure 10.3.
i.e., the variance of the low-dimensional code does not depend on the
mean of the data. Therefore, we assume without loss of generality that the
data has mean 0 for the remainder of this section. With this assumption
the mean of the low-dimensional code is also 0 since Ez [z] = Ex [B > x] =
B > Ex [x] = 0. ♦
z1n = b>
1 xn , (10.8)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
322 Dimensionality Reduction with Principal Component Analysis
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
324 Dimensionality Reduction with Principal Component Analysis
400
Eigenvalue
Eigenvalues sorted 30
in descending order; 300
(b) Variance 20
200
captured by the
10
principal 100
components 0
0 50 100 150 200 0 50 100 150 200
associated with the Index Number of principal components
largest eigenvalues.
(a) Eigenvalues (sorted in descending order) of (b) Variance captured by the principal compo-
the data covariance matrix of all digits ‘8’ in the nents.
MNIST training set.
Figure 10.5(a) shows the 200 largest eigenvalues of the data covariance
matrix. We see that only a few of them have a value that differs signifi-
cantly from 0. Therefore, most of the variance, when projecting data onto
the subspace spanned by the corresponding eigenvectors, is captured by
only a few principal components as shown in Figure 10.5(b).
Figure 10.6
Illustration of the
projection
approach: Find a
subspace (line) that
minimizes the
length of the
difference vector
between projected
(orange) and
original (blue) data.
where the λm are the M largest eigenvalues of the data covariance matrix
S . Consequently, the variance lost by data compression via PCA is
D
X
JM := λj = VD − VM . (10.24)
j=M +1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
326 Dimensionality Reduction with Principal Component Analysis
Figure 10.7 2.5 2.5
Simplified
projection setting. 2.0 2.0
(a) A vector x ∈ R2
(red cross) shall be 1.5 1.5
projected onto a
one-dimensional
x2
x2
1.0 1.0
subspace U ⊆ R2 U U
spanned by b. (b) 0.5 0.5
shows the difference b b
vectors between x 0.0 0.0
and some
candidates x̃. −0.5 −0.5
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x1 x1
D
X M
X D
X
x= ζd b d = ζm b m + ζj b j (10.25)
d=1 m=1 j=M +1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
328 Dimensionality Reduction with Principal Component Analysis
Figure 10.8 3.25 2.5
Optimal projection
3.00
of a vector x ∈ R2 2.0
onto a 2.75
one-dimensional 2.50
1.5
subspace
kx − x̃k
2.25
x2
(continuation from 1.0
Figure 10.7). U
2.00 x̃
(a) Distances 0.5
1.75 b
kx − x̃k for some
0.0
x̃ ∈ U . 1.50
(b) Orthogonal
1.25 −0.5
projection and −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x1 x1
optimal coordinates.
(a) Distances kx − x̃k for some x̃ = z1 b ∈ (b) The vector x̃ that minimizes the distance
U = span[b], see panel (b) for the setting. in panel (a) is its orthogonal projection onto
U . The coordinate of the projection x̃ with
respect to the basis vector b that spans U
is the factor we need to scale b in order to
“reach” x̃.
since b>
i bi = 1. Setting this partial derivative to 0 yields immediately the
optimal coordinates
>
zin = x>
n bi = bi xn (10.31)
for i = 1, . . . , M and n = 1, . . . , N . This means that the optimal co-
ordinates zin of the projection x̃n are the coordinates of the orthogonal
projection (see Section 3.8) of the original data point xn onto the one-
The coordinates of dimensional subspace that is spanned by bi . Consequently:
the optimal
projection of xn The optimal linear projection x̃n of xn is an orthogonal projection.
with respect to the The coordinates of x̃n with respect to the basis (b1 , . . . , bM ) are the
basis vectors
coordinates of the orthogonal projection of xn onto the principal sub-
b1 , . . . , bM are the
coordinates of the space.
orthogonal An orthogonal projection is the best linear mapping given the objec-
projection of xn tive (10.28).
onto the principal
The coordinates ζm of x in (10.25) and the coordinates zm of x̃ in (10.26)
subspace.
must be identical for m = 1, . . . , M since U ⊥ = span[bM +1 , . . . , bD ] is
the orthogonal complement (see Section 3.6) of U = span[b1 , . . . , bM ].
Remark (Orthogonal Projections with Orthonormal Basis Vectors). Let us
briefly recap orthogonal projections from Section 3.8. If (b1 , . . . , bD ) is an
b>
j x is the orthonormal basis of RD then
coordinate of the
orthogonal x̃ = bj (b>
j bj )
−1 >
bj x = bj b>
j x ∈ R
D
(10.32)
projection of x onto
the subspace is the orthogonal projection of x onto the subspace spanned by the j th
spanned by bj . basis vector, and zj = b> j x is the coordinate of this projection with respect
to the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.8
illustrates this setting.
Since we can generally write the original data point xn as a linear combi-
nation of all basis vectors, it holds that
D D D
!
(10.31)
X X X >
>
xn = zdn bd = (xn bd )bd = bd bd xn (10.36a)
d=1 d=1 d=1
M
! D
!
X X
= bm b>
m xn + bj b>
j xn , (10.36b)
m=1 j=M +1
where we split the sum with D terms into a sum over M and a sum
over D − M terms. With this result, we find that the displacement vector
xn − x̃n , i.e., the difference vector between the original data point and its
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
330 Dimensionality Reduction with Principal Component Analysis
Figure 10.9
Orthogonal 6 U⊥
projection and
displacement 4
vectors. When
projecting data 2
points xn (blue)
x2
0
onto subspace U1 U
we obtain x̃n
−2
(orange). The
displacement vector −4
x̃n − xn lies
completely in the −6
orthogonal
−5 0 5
complement U2 of x1
U1 .
projection, is
D
!
X
xn − x̃n = bj b>
j xn (10.37a)
j=M +1
D
X
= (x>
n bj )bj . (10.37b)
j=M +1
This means the difference is exactly the projection of the data point onto
the orthogonal complement of the principal subspace: We identify the ma-
trix j=M +1 bj b>
PD
j in (10.37a) as the projection matrix that performs this
projection. Hence, the displacement vector xn − x̃n lies in the subspace
that is orthogonal to the principal subspace as illustrated in Figure 10.9.
Remark (Low-Rank Approximation). In (10.37a), we saw that the projec-
tion matrix, which projects x onto x̃, is given by
M
X
bm b> >
m = BB . (10.38)
m=1
Now, we have all the tools to reformulate the loss function (10.28).
D
2
N N
X
1 X 2 (10.37b) 1
X >
JM = kxn − x̃n k = (bj xn )bj
. (10.40)
N n=1 N n=1
j=M +1
We now explicitly compute the squared norm and exploit the fact that the
bj form an ONB, which yields
N D N D
1 X X > 1 X X >
JM = 2
(b xn ) = b xn b>
j xn (10.41a)
N n=1 j=M +1 j N n=1 j=M +1 j
N D
1 X X >
= b xn x>
n bj , (10.41b)
N n=1 j=M +1 j
where we exploited the symmetry of the dot product in the last step to
write b> >
j xn = xn bj . We now swap the sums and obtain
D N
! D
X > 1 X X
JM = bj >
xn xn bj = b>
j Sbj (10.42a)
j=M +1
N n=1 j=M +1
| {z }
=:S
D
X D
X D
X
= tr(b>
j Sbj ) tr(Sbj b>
j ) = tr bj b>
j S ,
j=M +1 j=M +1 j=M +1
| {z }
projection matrix
(10.42b)
where we exploited the property that the trace operator tr(·), see (4.18),
is linear and invariant to cyclic permutations of its arguments. Since we
assumed that our dataset is centered, i.e., E[X ] = 0, we identify S as the
data covariance matrix. Since the projection matrix in (10.42b) is con-
structed as a sum of rank-one matrices bj b> j it itself is of rank D − M . Minimizing the
Equation (10.42a) implies that we can formulate the average squared average squared
reconstruction error
reconstruction error equivalently as the covariance matrix of the data,
is equivalent to
projected onto the orthogonal complement of the principal subspace. Min- minimizing the
imizing the average squared reconstruction error is therefore equivalent to projection of the
minimizing the variance of the data when projected onto the subspace we data covariance
matrix onto the
ignore, i.e., the orthogonal complement of the principal subspace. Equiva-
orthogonal
lently, we maximize the variance of the projection that we retain in the complement of the
principal subspace, which links the projection loss immediately to the principal subspace.
maximum-variance formulation of PCA discussed in Section 10.2. But this Minimizing the
then also means that we will obtain the same solution that we obtained average squared
reconstruction error
for the maximum-variance perspective. Therefore, we omit a derivation
is equivalent to
that is identical to the one Section 10.2 and summarize the results from maximizing the
earlier in the light of the projection perspective. variance of the
The average squared reconstruction error, when projecting onto the M - projected data.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
332 Dimensionality Reduction with Principal Component Analysis
D
X
JM = λj , (10.43)
j=M +1
Figure 10.10
Embedding of
MNIST digits 0
(blue) and 1
(orange) in a
two-dimensional
principal subspace
using PCA. Four
embeddings of the
digits ‘0’ and ‘1’ in
the principal
subspace are
highlighted in red
with their
corresponding
original digit.
Figure 10.10 visualizes the training data of the MMIST digits ‘0’ and ‘1’
embedded in the vector subspace spanned by the first two principal com-
ponents. We observe a relatively clear separation between ‘0’s (blue dots)
and ‘1’s (orange dots), and we see the variation within each individual
cluster. Four embeddings of the digits ‘0’ and ‘1’ in the principal subspace
are highlighted in red with their corresponding original digit. The figure
reveals that the variation within the set of ‘0’ is significantly greater than
the variation within the set of ‘1’.
>
where U ∈ RD×D and V ∈ RN ×N are orthogonal matrices and Σ ∈
RD×N is a matrix whose only non-zero entries are the singular values
σii > 0. It then follows that
1 1 1
S= XX > = U Σ |V {z
>
V} Σ> U > = U ΣΣ> U > . (10.47)
N N N
=I N
With the results from Section 4.5 we get that the columns of U are the The columns of U
eigenvectors of XX > (and therefore S ). Furthermore, the eigenvalues are the eigenvectors
of S.
λd of S are related to the singular values of X via
σd2
λd =
. (10.48)
N
This relationship between the eigenvalues of S and the singular values
of X provides the connection between the maximum variance view (Sec-
tion 10.2) and the singular value decomposition.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
334 Dimensionality Reduction with Principal Component Analysis
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
336 Dimensionality Reduction with Principal Component Analysis
and we recover the data covariance matrix again. This now also means
that we recover Xcm as an eigenvector of S .
Remark. If we want to apply the PCA algorithm that we discussed in Sec-
tion 10.6 we need to normalize the eigenvectors Xcm of S so that they
have norm 1. ♦
x2
x2
0.0 0.0 0.0
0 5 0 5 0 5
x1 x1 x1
(a) Original dataset. (b) Step 1: Centering by sub- (c) Step 2: Dividing by the
tracting the mean from each standard deviation to make
data point. the data unit free. Data has
variance 1 along each axis.
x2
x2
0.0 0.0 0.0
0 5 0 5 0 5
x1 x1 x1
(d) Step 3: Compute eigenval- (e) Step 4: Project data onto (f) Undo the standardization
ues and eigenvectors (arrows) the principal subspace. and move projected data back
of the data covariance matrix into the original data space
(ellipse). from (a).
(d)
where x∗ is the dth component of x∗ . We obtain the projection as
with coordinates
z ∗ = B > x∗ (10.59)
with respect to the basis of the principal subspace. Here, B is the ma-
trix that contains the eigenvectors that are associated with the largest
eigenvalues of the data covariance matrix as columns. PCA returns the
coordinates (10.59), not the projections x∗ .
x̃(d) (d)
∗ ← x̃∗ σd + µd , d = 1, . . . , D . (10.60)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
338 Dimensionality Reduction with Principal Component Analysis
PCs: 10
PCs: 100
PCs: 500
the importance of the principal components drops off rapidly, and only
marginal gains can be achieved by adding more PCs. This matches exactly
our observation in Figure 10.5 where we discovered that the most of the
variance of the projected data is captured by only a few principal compo-
nents. With about 550 PCs, we can essentially fully reconstruct the training
data that contains the digit ‘8’ (some pixels around the boundaries show
no variation across the dataset as they are always black).
Average squared reconstruction error
Figure 10.13
500 Average squared
reconstruction error
400
as a function of the
number of principal
300
components. The
average squared
200
reconstruction error
100 is the sum of the
eigenvalues in the
0 orthogonal
0 200 400 600 800 complement of the
Number of PCs principal subspace.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
340 Dimensionality Reduction with Principal Component Analysis
Figure 10.14
Graphical model for zn
probabilistic PCA.
The observations xn
explicitly depend on B µ
corresponding
latent variables
z n ∼ N 0, I . The
xn σ
model parameters
n = 1, . . . , N
B, µ and the
likelihood
parameter σ are
shared across the By introducing a continuous-valued latent variable z ∈ RM it is possible
dataset. to phrase PCA as a probabilistic latent-variable model. Tipping and Bishop
probabilistic PCA (1999) proposed this latent-variable model as Probabilistic PCA (PPCA).
PPCA PPCA addresses most of the issues above, and the PCA solution that we
obtained by maximizing the variance in the projected space or by minimiz-
ing the reconstruction error is obtained as the special case of maximum
likelihood estimation in a noise-free setting.
Remark. Note the direction of the arrow that connects the latent variables
z and the observed data x: The arrow points from z to x, which means
that the PPCA model assumes a lower-dimensional latent cause z for high-
dimensional observations x. In the end, we are obviously interested in
finding something out about z given some observations. To get there we
will apply Bayesian inference to “invert” the arrow implicitly and go from
observations to latent variables. ♦
Figure 10.15
Generating new
MNIST digits. The
latent variables z
can be used to
generate new data
x̃ = Bz. The closer
we stay to the
training data the
more realistic the
generated data.
Figure 10.15 shows the latent coordinates of the MNIST digits ‘8’ found
by PCA when using a two-dimensional principal subspace (blue dots). We
can query any vector z ∗ in this latent space and generate an image x̃∗ =
Bz ∗ that resembles the digit ‘8’. We show eight of such generated images
with their corresponding latent space representation. Depending on where
we query the latent space, the generated images look different (shape,
rotation, size, ...). If we query away from the training data, we see more an
more artefacts, e.g., the top-left and top-right digits. Note that the intrinsic
dimensionality of these generated images is only two.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
342 Dimensionality Reduction with Principal Component Analysis
so that
Z
2
p(x | B, µ, σ ) = p(x | z, µ, σ 2 )p(z)dz (10.67a)
Z
N x | Bz + µ, σ 2 I N z | 0, I dz .
= (10.67b)
From Section 6.5, we know that the solution to this integral is a Gaussian
distribution with mean
Ex [x] = Ez [Bz + µ] + E [] = µ (10.68)
and with covariance matrix
V[x] = Vz [Bz + µ] + V [] = Vz [Bz] + σ 2 I (10.69a)
> 2 > 2
= B Vz [z]B + σ I = BB + σ I . (10.69b)
The likelihood in (10.67b) can be used for maximum likelihood or MAP
estimation of the model parameters.
Remark. We cannot use the conditional distribution in (10.63) for maxi-
mum likelihood estimation as it still depends on the latent variables. The
likelihood function we require for maximum likelihood (or MAP) estima-
tion should only be a function of the data x and the model parameters,
but must not depend on the latent variables. ♦
From Section 6.5 we know that a Gaussian random variable z and
a linear/affine transformation x = Bz of it are jointly Gaussian
dis-
tributed. We already know the marginals p(z) = N z | 0, I and p(x) =
N x | µ, BB > + σ 2 I . The missing cross-covariance is given as
Note that the posterior covariance does not depend on the observed data
x. For a new observation x∗ in data space, we use (10.72) to determine
the posterior distribution of the corresponding latent variable z ∗ . The co-
variance matrix C allows us to assess how confident the embedding is. A
covariance matrix C with a small determinant (which measures volumes)
tells us that the latent embedding z ∗ is fairly certain. If we obtain a pos-
terior distribution p(z ∗ | x∗ ) with much variance, we may be faced with
an outlier. However, we can explore this posterior distribution to under-
stand what other data points x are plausible under this posterior. To do
this, we exploit the generative process underlying PPCA, which allows us
to explore the posterior distribution on the latent variables by generating
new data that are plausible under this posterior:
If we repeat this process many times, we can explore the posterior dis-
tribution (10.72) on the latent variables z ∗ and its implications on the
observed data. The sampling process effectively hypothesizes data, which
is plausible under the posterior distribution.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
344 Dimensionality Reduction with Principal Component Analysis
where T ∈ RD×M contains M eigenvectors of the data covariance matrix, The matrix Λ − σ 2 I
Λ = diag(λ1 , . . . , λM ) ∈ RM ×M is a diagonal matrix with the eigenvalues in (10.77) is
guaranteed to be
associated with the principal axes on its diagonal, and R ∈ RM ×M is
positive
an arbitrary orthogonal matrix. The maximum likelihood solution B ML is semi-definite as the
unique up to an arbitrary orthogonal transformation, e.g., we can right- smallest eigenvalue
multiply B ML with any rotation matrix R so that (10.77) essentially is a of the data
covariance matrix is
singular value decomposition (see Section 4.5). An outline of the proof is
bounded from
given by Tipping and Bishop (1999). below by the noise
The maximum likelihood estimate for µ given in (10.76) is the sample variance σ 2 .
mean of the data. The maximum likelihood estimator for the observation
noise variance σ 2 given in (10.78) is the average variance in the orthog-
onal complement of the principal subspace, i.e., the average leftover vari-
ance that we cannot capture with the first M principal components are
treated as observation noise.
In the noise-free limit where σ → 0, PPCA and PCA provide identical
solutions: Since the data covariance matrix S is symmetric, it can be di-
agonalized (see Section 4.4), i.e., there exists a matrix T of eigenvectors
of S so that
S = T ΛT −1 . (10.79)
In the PPCA model, the data covariance matrix is the covariance matrix of
the Gaussian likelihood p(x | B, µ, σ 2 ), which is BB > +σ 2 I , see (10.69b).
For σ → 0, we obtain BB > so that this data covariance must equal the
PCA data covariance (and its factorization given in (10.79)) so that
1
Cov[X ] = T ΛT −1 = BB > ⇐⇒ B = T Λ 2 R , (10.80)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
346 Dimensionality Reduction with Principal Component Analysis
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
11
Figure 11.1
Two-dimensional
dataset that cannot 4
be meaningfully
represented by a 2
Gaussian.
0
x2
−2
−4
−5 0 5
x1
348
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
11.1 Gaussian Mixture Model 349
K
X
p(x | θ) = πk N x | µk , Σk (11.3)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.4)
k=1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
350 Density Estimation with Gaussian Mixture Models
p(x)
0.15
convex combination
of Gaussian
0.10
distributions and is
more expressive 0.05
than any individual
component. Dashed 0.00
lines represent the −4 −2 0 2 4 6 8
weighted Gaussian x
components.
0.05
0.00
−5 0 5 10 15
x
(11.11)
This simple form allows us find closed-form maximum likelihood esti-
mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
352 Density Estimation with Gaussian Mixture Models
the log into the sum over k so that we cannot obtain a simple closed-form
maximum likelihood solution. ♦
Any local optimum of a function exhibits the property that its gradi-
ent with respect to the parameters must vanish (necessary condition), see
Chapter 7. In our case, we obtain the following necessary conditions when
we optimize the log-likelihood in (11.10) with respect to the GMM param-
eters µk , Σk , πk :
N
∂L X ∂ log p(xn | θ)
= 0> ⇐⇒ = 0> , (11.12)
∂µk n=1
∂µ k
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0, (11.13)
∂Σk n=1
∂Σk
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk
For all three necessary conditions, by applying the chain rule (see Sec-
tion 5.2.2), we require partial derivatives of the form
∂ log p(xn | θ) 1 ∂p(xn | θ)
= , (11.15)
∂θ p(xn | θ) ∂θ
where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and
1 1
= PK . (11.16)
p(xn | θ) j=1 πj N xn | µj , Σj
11.2.1 Responsibilities
We define the quantity
πk N xn | µk , Σk
rnk := PK (11.17)
j=1 πj N xn | µj , Σj
responsibility as the responsibility of the k th mixture component for the nth data point.
The responsibility rnk of the k th mixture component for data point xn is
proportional to the likelihood
p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.18)
r n follows a of the mixture component given the data point. Therefore, mixture com-
Boltzmann/Gibbs ponents have a high responsibility for a data point when the data point
distribution.
could be a plausible sample from that mixture component. Note that
r n := [rn1 , . . . , rnK ]> ∈ RK is a (normalized) probability vector, i.e.,
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
354 Density Estimation with Gaussian Mixture Models
Here, we used the identity from (11.16) and the result of the partial
derivative in (11.21b) to get to (11.22b). The values rnk are the responsi-
bilities we defined in (11.17).
∂L(µnew )
We now solve (11.22c) for µnewk so that ∂µk = 0> and obtain
k
N N PN N
X X rnk xn 1 X
rnk xn = rnk µnew
k ⇐⇒ µnew
k = P n=1
= rnk xn ,
n=1 n=1
N
rnk Nk n=1
n=1
(11.23)
where we defined
N
X
Nk := rnk (11.24)
n=1
Therefore, the mean µk is pulled toward a data point xn with strength Figure 11.4 Update
given by rnk . The means are pulled stronger toward data points for which of the mean
parameter of
the corresponding mixture component has a high responsibility, i.e., a high
mixture component
likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- in a GMM. The
date in (11.20) as the expected value of all data points under the distri- mean µ is being
bution given by pulled toward
individual data
r k := [r1k , . . . , rN k ]> /Nk , (11.25) points with the
weights given by the
which is a normalized probability vector, i.e., corresponding
responsibilities.
µk ← Erk [X ] . (11.26)
x2 x3
r2
Example 11.3 (Mean Updates) r1 r3
x1
µ
p(x)
In our example from Figure 11.3, the mean values are updated as fol-
lows:
µ1 : −4 → −2.7 (11.27)
µ2 : 0 → −0.4 (11.28)
µ3 : 8 → 3.7 (11.29)
Here, we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
356 Density Estimation with Gaussian Mixture Models
and obtain (after some re-arranging) the desired partial derivative re-
quired in (11.31) as
∂p(xn | θ)
= πk N xn | µk , Σk
∂Σk
· − 12 (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) . (11.35)
· − 21 (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.36b)
N
1X
=− rnk (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.36c)
2 n=1
N N
!
1 −1 X 1 −1 X
= − Σk rnk + Σk rnk (xn − µk )(xn − µk ) >
Σ−1
k .
2 n=1
2 n=1
| {z }
=Nk
(11.36d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
N
!
X
Nk Σ−1
k = Σk
−1
rnk (xn − µk )(xn − µk )> Σ−1k (11.37a)
n=1
N
!
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )> Σ−1
k (11.37b)
n=1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
358 Density Estimation with Gaussian Mixture Models
Here, we see that the variances of the first and third component shrink
significantly, the variance of the second component increases slightly.
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
vidual components prior to updating the variances. Figure 11.6(b) shows
the GMM density after updating the variances.
Figure 11.6 Effect
of updating the 0.30 π1 N (x|µ1 , σ12 ) 0.35 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
variances in a GMM. 0.25 π3 N (x|µ3 , σ32 )
0.30 π3 N (x|µ3 , σ32 )
p(x)
variances; (b) GMM 0.15
0.15
after updating the 0.10
0.10
variances while 0.05 0.05
retaining the means
0.00 0.00
and mixture −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
weights. x x
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the variances. after updating the variances.
where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivative with respect to πk as
N
∂L X N xn | µk , Σk
= PK +λ (11.44a)
∂πk n=1 j=1 πj N xn | µj , Σj
N
1 X πk N xn | µk , Σk Nk
= PK +λ = + λ, (11.44b)
πk n=1 j=1 πj N xn | µj , Σj πk
| {z }
=Nk
We can identify the mixture weight in (11.42) as the ratio of the to-
tal responsibility
P of the k th cluster and the number of data points. Since
N = k Nk the number of data points can also be interpreted as the to-
tal responsibility of all mixture components together, such that πk is the
relative importance of the k th mixture component for the dataset.
PN
Remark. Since Nk = i=1 rnk , the update equation (11.42) for the mix-
ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
sponsibilities rnk . ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
360 Density Estimation with Gaussian Mixture Models
p(x)
0.15
mixture weights; (b) 0.15
0.10
GMM after updating 0.10
11.3 EM Algorithm
Unfortunately, the updates in (11.20), (11.30), and (11.42) do not consti-
tute a closed-form solution for the updates of the parameters µk , Σk , πk
of the mixture model because the responsibilities rnk depend on those pa-
rameters in a complex way. However, the results suggest a simple iterative
scheme for finding a solution to the parameters estimation problem via
EM algorithm maximum likelihood. The Expectation Maximization algorithm (EM algo-
Figure 11.8 EM
π1 N (x|µ1 , σ12 ) 28 algorithm applied to
0.30
π2 N (x|µ2 , σ22 )
the GMM from
Negative log-likelihood
26
0.25 π3 N (x|µ3 , σ32 )
GMM density 24 Figure 11.2.
0.20
22
p(x)
0.15
20
0.10
18
0.05 16
0.00 14
−5 0 5 10 15 0 1 2 3 4 5
x
Iteration
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
algorithm converges and returns this GMM. EM iterations.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
362 Density Estimation with Gaussian Mixture Models
Figure 11.9 10
Illustration of the 104
Negative log-likelihood
EM algorithm for 5
fitting a Gaussian
mixture model with
x2
0
three components to 6 × 103
a two-dimensional −5
dataset. (a) Dataset;
4 × 103
(b) Negative −10
−10 −5 0 5 10 0 20 40 60
log-likelihood x1 EM iteration
(lower is better) as
(a) Dataset. (b) Negative log-likelihood.
a function of the EM
iterations. The red
10 10
dots indicate the
iterations for which
5 5
the mixture
components of the
x2
x2
0 0
corresponding GMM
fits are shown
−5 −5
in (c)–(f). The
yellow discs indicate
−10 −10
the means of the −10 −5 0 5 10 −10 −5 0 5 10
x1 x1
Gaussian mixture
components. (c) EM initialization. (d) EM after 1 iteration.
Figure 11.10(a)
shows the final 10 10
GMM fit.
5 5
x2
x2
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1
When we run EM on our example from Figure 11.3, we obtain the final
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as
p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
(11.57)
+ 0.43N x | 3.64, 1.63 .
x2
0 0
EM converges;
−2 −2
(b) Each data point
−4 −4 is colored according
−6 −6 to the
−5 0 5 −5 0 5
x1 x1 responsibilities of
the mixture
(a) GMM fit after 62 iterations. (b) Dataset colored according to the respon-
components.
sibilities of the mixture components.
the corresponding final GMM fit. Figure 11.10(b) visualizes the final re-
sponsibilities of the mixture components for the data points. The dataset is
colored according to the responsibilities of the mixture components when
EM converges. While a single mixture component is clearly responsible
for the data on the left, the overlap of the two data clusters on the right
could have been generated by two mixture components. It becomes clear
that there are data points that cannot be uniquely assigned to a single
(either blue or yellow) component, such that the responsibilities of these
two clusters for those points are around 0.5.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
364 Density Estimation with Gaussian Mixture Models
so that
p(x | zk = 1) = N x | µk , Σk . (11.58)
πk = p(zk = 1) (11.60)
for k = 1, . . . , K , so that
p(x, z1 = 1) π1 N x | µ1 , Σ1
p(x, z) = .. ..
= , (11.62)
. .
p(x, zK = 1) πK N x | µK , ΣK
which fully specifies the probabilistic model.
11.4.2 Likelihood
To obtain the likelihood p(x | θ) in a latent variable model, we need to
marginalize out the latent variables (see Section 8.3.3. In our case, this
can be done by summing out all latent variables from the joint p(x, z)
in (11.62) so that
X
p(x | θ) = p(x | θ, z)p(z | θ) , θ := {µk , Σk , πk : k = 1, . . . , K} .
z
(11.63)
We now explicitly condition on the parameters θ of the probabilistic model,
which we previously omitted. In (11.63), P we sum over all K possible one-
hot encodings of z , which is denoted by z . Since there is only a single
non-zero single entry in each z there are only K possible configurations/
settings of z . For example, if K = 3 then z can have the configurations
1 0 0
0 , 1 , 0 . (11.64)
0 0 1
Summing over all possible configurations of z in (11.63) is equivalent to
looking at the non-zero entry of the z -vector and write
X
p(x | θ) = p(x | θ, z)p(z | θ) (11.65a)
z
K
X
= p(x | θ, zk = 1)p(zk = 1 | θ) (11.65b)
k=1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
366 Density Estimation with Gaussian Mixture Models
Figure 11.12 π
Graphical model for
a GMM with N data
points.
zn
µk
Σk xn
k = 1, . . . , K
n = 1, . . . , N
which is exactly the GMM likelihood from (11.9). Therefore, the latent
variable model with latent indicators zk is an equivalent way of thinking
about a Gaussian mixture model.
where the expectation of log p(x, z | θ) is taken with respect to the poste-
rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
of model parameters θ (t+1) by maximizing (11.73b).
Although an EM iteration does increase the log-likelihood, there are
no guarantees that EM converges to the maximum likelihood solution.
It is possible that the EM algorithm converges to a local maximum of
the log-likelihood. Different initializations of the parameters θ could be
used in multiple EM runs to reduce the risk of ending up in a bad local
optimum. We do not go into further details here, but refer to the excellent
expositions by Rogers and Girolami (2016) and Bishop (2006).
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
368 Density Estimation with Gaussian Mixture Models
0.15
kernel density
estimator produces
0.10 a smooth estimate
of the underlying
0.05 density, whereas the
histogram is an
0.00 unsmoothed count
−4 −2 0 2 4 6 8
x measure of how
many data points
(black) fall into a
In this chapter, we discussed mixture models for density estimation. single bin.
There is a plethora of density estimation techniques available. In practice,
we often use histograms and kernel density estimation. histogram
Histograms provide a non-parametric way to represent continuous den-
sities and have been proposed by Pearson (1895). A histogram is con-
structed by “binning” the data space and count how many data points fall
into each bin. Then a bar is drawn at the center of each bin, and the height
of the bar is proportional to the number of data points within that bin. The
bin size is a critical hyperparameter, and a bad choice can lead to overfit-
ting and underfitting. Cross-validation, as discussed in Section 8.1.4, can
be used to determine a good bin size. kernel density
Kernel density estimation, independently proposed by Rosenblatt (1956) estimation
and Parzen (1962), is a nonparametric way for density estimation. Given
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
N
1 X x − xn
p(x) = k , (11.74)
N h n=1 h
where k is a kernel function, i.e., a non-negative function that integrates
to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a simi-
lar role as the bin size in histograms. Note that we place a kernel on every
single data point xn in the dataset. Commonly used kernel functions are
the uniform distribution and the Gaussian distribution. Kernel density esti-
mates are closely related to histograms, but by choosing a suitable kernel,
we can guarantee smoothness of the density estimate. Figure 11.13 illus-
trates the difference between a histogram and a kernel density estimator
(with a Gaussian-shaped kernel) for a given dataset of 250 data points.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
12
370
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
Classification with Support Vector Machines 371
Figure 12.1
Example 2D data,
illustrating the
intuition of data
where we can find a
linear classifier that
x(2)
separates red
crosses from blue
dots.
x(1)
SVMs. First, the SVM allows for a geometric way to think about supervised
machine learning. While in Chapter 9 we considered the machine learning
problem in terms of probabilistic models and attacked it using maximum
likelihood estimation and Bayesian inference, here we will consider an
alternative approach where we reason geometrically about the machine
learning task. It relies heavily on concepts, such as inner products and
projections, which we discussed in Chapter 3. The second reason why we
find SVMs instructive is that in contrast to Chapter 9, the optimization
problem for SVM does not admit an analytic solution so that we need to
resort to a variety of optimization tools introduced in Chapter 7.
The SVM view of machine learning is subtly different from the max-
imum likelihood view of Chapter 9. The maximum likelihood view pro-
poses a model based on a probabilistic view of the data distribution, from
which an optimization problem is derived. In contrast, the SVM view starts
by designing a particular function that is to be optimized during training,
based on geometric intuitions. We have seen something similar already in
Chapter 10 where we derived PCA from geometric principles. In the SVM
case, we start by designing an objective function that is to be minimized on
training data, following the principles of empirical risk minimization 8.1.
This can also be understood as designing a particular loss function.
Let us derive the optimization problem corresponding to training an
SVM on example-label pairs. Intuitively, we imagine binary classification
data, which can be separated by a hyperplane as illustrated in Figure 12.1.
Here, every example xn (a vector of dimension 2) is a two-dimensional
location (x(1) (2)
n and xn ), and the corresponding binary label yn is one of
two different symbols (red cross or blue disc). “Hyperplane” is a word that
is commonly used in machine learning, and we encountered hyperplanes
already in Section 2.8. A hyperplane is an affine subspace of dimension
D−1 (if the corresponding vector space is of dimension D). The examples
consist of two classes (there are two possible labels) that have features
(the components of the vector representing the example) arranged in such
a way as to allow us to separate/classify them by drawing a straight line.
In the following, we formalize the idea of finding a linear separator
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
372 Classification with Support Vector Machines
of the two classes. We introduce the idea of the margin and then extend
linear separators to allow for examples to fall on the “wrong” side, incur-
ring a classification error. We present two equivalent ways of formalizing
the SVM: the geometric view (Section 12.2.4) and the loss function view
(Section 12.2.5). We derive the dual version of the SVM using Lagrange
multipliers (Section 7.2). The dual SVM allows us to observe a third way
of formalizing the SVM: in terms of the convex hulls of the examples of
each class (Section 12.3.2). We conclude by briefly describing kernels and
how to numerically solve the nonlinear kernel-SVM optimization problem.
and the examples with negative labels are on the negative side, i.e.,
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
374 Classification with Support Vector Machines
Figure 12.3
Possible separating
hyperplanes. There
are many linear
classifiers (green
lines) that separate
x(2)
red crosses from
blue dots.
x(1)
.
0
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
376 Classification with Support Vector Machines
Figure 12.5
Derivation of the
.xa
1
r w
x0a .
margin: r = kwk .
hw
,
hw
xi
,
+
xi
b=
+
b=
1
0
problem, we obtain the objective
max r
w,b,r |{z}
margin
(12.10)
subject to yn (hw, xn i + b) > r , kwk = 1 , r > 0,
| {z } | {z }
data fitting normalization
which says that we want to maximize the margin r, while ensuring that
the data lies on the correct side of the hyperplane.
Remark. The concept of the margin turns out to be highly pervasive in
machine learning. It was used by Vladimir Vapnik and Alexey Chervo-
nenkis to show that when the margin is large, the “complexity” of the func-
tion class is low, and, hence, learning is possible (Vapnik, 2000). It turns
out that the concept is useful for various different approaches for theo-
retically analyzing generalization error (Shalev-Shwartz and Ben-David,
2014; Steinwart and Christmann, 2008). ♦
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
378 Classification with Support Vector Machines
max r2
w0 ,b,r
w0
(12.22)
subject to yn , xn + b > r, r > 0.
kw0 k
Equation (12.22) explicitly states that the distance r is positive. Therefore,
Note that r > 0 we can divide the first constraint by r, which yields
because we
assumed linear max r2
w0 ,b,r
separability, and
hence there is no
(12.23)
* +
issue to divide by r. w0 b
subject to yn , xn + > 1, r>0
kw0 k r r
|{z}
| {z } 00
w00 b
x(2)
x(1) x(1)
(a) Linearly separable data, with a large (b) Non linearly separable data
margin.
0
renaming the parameters to w00 and b00 . Since w00 = kww0 kr , rearranging for
r gives
w0
w0
00 1
1
kw k =
0
= ·
0
= . (12.24)
kw k r r kw k
r
By substituting this result into (12.23) we obtain
1
max 2
00 00
w ,b kw00 k (12.25)
subject to yn (hw00 , xn i + b00 ) > 1 .
1
The final step is to observe that maximizing kw00 k2
yields the same solution
1 00 2
as minimizing 2
kw k , which concludes the proof of Theorem 12.1.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
380 Classification with Support Vector Machines
hw
measures the
,
hw
xi
distance of a
,
positive example
+
xi
b=
x+ to the positive
+
margin hyperplane
b=
1
hw, xi + b = 1
0
when x+ is on the
wrong side.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
382 Classification with Support Vector Machines
max{0, 1 − t}
hinge loss
of zero-one loss.
2
0
−2 0 2
t
This loss can be interpreted as never allowing any examples inside the
margin.
For a given training set {(x1 , y1 ), . . . , (xN , yN )} we seek to minimize
the total loss, while regularizing the objective with `2 -regularization (see
Section 8.1.3). Using the hinge loss (12.28) gives us the unconstrained
optimization problem
N
1 X
min kwk2 + C max{0, 1 − yn (hw, xn i + b)} . (12.31)
w,b
|2 {z } n=1
regularizer
| {z }
error term
regularizer The first term in (12.31) is called the regularization term or the regularizer
loss term (see Section 9.2.3), and the second term is called the loss term or the error
error term 2
term. Recall from Section 12.2.4 that the term 12 kwk arises directly from
the margin. In other words, margin maximization can be interpreted as
regularization regularization.
In principle, the unconstrained optimization problem in (12.31) can be
directly solved with (sub-)gradient descent methods as described in Sec-
tion 7.1. To see that (12.31) and (12.26a) are equivalent, observe that the
hinge loss (12.28) essentially consists of two linear parts, as expressed
in (12.29). Consider the hinge loss on for a single example-label pair
(12.28). We can equivalently replace minimization of the hinge loss over t
with a minimization of a slack variable ξ with two constraints. In equation
form,
min max{0, 1 − t} (12.32)
t
is equivalent to
min ξ
ξ,t
(12.33)
subject to ξ > 0, ξ > 1 − t.
By substituting this expression into (12.31) and rearranging one of the
constraints, we obtain exactly the soft margin SVM (12.26a).
Remark. Let us contrast our choice of the loss function in this section to the
loss function for linear regression in Chapter 9. Recall from Section 9.2.1
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
384 Classification with Support Vector Machines
representer theorem which is a particular instance of the representer theorem (Kimeldorf and
Wahba, 1970). Equation (12.38) states that the optimal weight vector in
the primal is a linear combination of the examples xn . Recall from Sec-
tion 2.6.1 that this means that the solution of the optimization problem
lies in the span of training data. Additionally the constraint obtained by
setting (12.36) to zero implies that the optimal weight vector is an affine
The representer combination of the examples. The representer theorem turns out to hold
theorem is actually for very general settings of regularized empirical risk minimization (Hof-
a collection of
mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more
theorems saying
that the solution of general versions (Schölkopf et al., 2001), and necessary and sufficient
minimizing conditions on its existence can be found in (Yu et al., 2013).
empirical risk lies in
the subspace
Remark. The representer theorem (12.38) also provides an explaination
(Section 2.4.3) of the name Support Vector Machine. The examples xn , for which the
defined by the corresponding parameters αn = 0, do not contribute to the solution w at
examples. all. The other examples, where αn > 0, are called support vectors since
support vectors they “support” the hyperplane. ♦
By substituting the expression for w into the Lagrangian (12.34), we
obtain the dual
N N N
*N +
1 XX X X
D(ξ, α, γ) = yi yj αi αj hxi , xj i − yi αi yj αj xj , xi
2 i=1 j=1 i=1 j=1
N
X N
X N
X N
X N
X
+C ξi − b yi αi + αi − αi ξi − γ i ξi .
i=1 i=1 i=1 i=1 i=1
(12.39)
bilinear (see Section 3.2). Therefore, the first two terms in (12.39) are
over the same objects. These terms (colored blue) can be simplified, and
we obtain the Lagrangian
N N N N
1 XX X X
D(ξ, α, γ) = − yi yj αi αj hxi , xj i + αi + (C − αi − γi )ξi .
2 i=1 j=1 i=1 i=1
(12.40)
The last term in this equation is a collection of all terms that contain slack
variables ξi . By setting (12.37) to zero, we see that the last term in (12.40)
is also zero. Furthermore, by using the same equation and recalling that
the Lagrange multiplers γi are non-negative, we conclude that αi 6 C .
We now obtain the dual optimization problem of the SVM, which is ex-
pressed exclusively in terms of the Lagrange multipliers αi . Recall from
Lagrangian duality (Definition 7.1) that we maximize the dual problem.
This is equivalent to minimizing the negative dual problem, such that we
end up with the dual SVM dual SVM
N N N
1 XX X
min yi yj αi αj hxi , xj i − αi
α 2 i=1 j=1 i=1
N
X (12.41)
subject to yi αi = 0
i=1
0 6 αi 6 C for all i = 1, . . . , N .
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
386 Classification with Support Vector Machines
Figure 12.9 Convex
hulls. (a) Convex
hull of points, some
of which lie within
the boundary;
(b) Convex hulls
around positive and
negative examples.
c
(a) Convex hull. (b) Convex hulls around positive (blue) and
negative (red) examples. The distance between
the two convex sets is the length of the differ-
ence vector c − d.
w := c − d . (12.44)
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
388 Classification with Support Vector Machines
The objective function (12.48) and the constraint (12.50), along with the
assumption that α > 0, give us a constrained (convex) optimization prob-
lem. This optimization problem can be shown to be the same as that of
the dual hard margin SVM (Bennett and Bredensteiner, 2000a).
Remark. To obtain the soft margin dual, we consider the reduced hull. The
reduced hull reduced hull is similar to the convex hull but has an upper bound to the
size of the coefficients α. The maximum possible value of the elements
of α restricts the size that the convex hull can take. In other words, the
bound on α shrinks the convex hull to a smaller volume (Bennett and
Bredensteiner, 2000b). ♦
12.4 Kernels
Consider the formulation of the dual SVM (12.41). Notice that the in-
ner product in the objective occurs only between examples xi and xj .
There are no inner products between the examples and the parameters.
Therefore, if we consider a set of features φ(xi ) to represent xi , the only
change in the dual SVM will be to replace the inner product. This mod-
ularity, where the choice of the classification method (the SVM) and the
choice of the feature representation φ(x) can be considered separately,
provides flexibility for us to explore the two problems independently. In
this section we discuss the representation φ(x) and briefly introduce the
idea of kernels, but do not go into the technical details.
Since φ(x) could be a non-linear function, we can use the SVM (which
assumes a linear classifier) to construct classifiers that are nonlinear in
the examples xn . This provides a second avenue, in addition to the soft
margin, for users to deal with a dataset that is not linearly separable. It
turns out that there are many algorithms and statistical methods, which
have this property that we observed in the dual SVM: the only inner prod-
ucts are those that occur between examples. Instead of explicitly defining
a non-linear feature map φ(·) and computing the resulting inner product
between examples xi and xj , we define a similarity function k(xi , xj ) be-
kernel tween xi and xj . For a certain class of similarity functions, called kernels,
the similarity function implicitly defines a non-linear feature map φ(·).
The inputs X of the Kernels are by definition functions k : X × X → R for which there exists
kernel function can a Hilbert space H and φ : X → H a feature map such that
be very general, and
is not necessarily k(xi , xj ) = hφ(xi ), φ(xj )iH . (12.52)
restricted to RD .
second feature
nonlinear, the
underlying problem
being solved is for a
linear separating
hyperplane (albeit
with a nonlinear
kernel).
second feature
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
390 Classification with Support Vector Machines
and Williams, 2006). Figure 12.10 illustrates the effect of different kernels
on separating hyperplanes on an example dataset. Note that we are still
solving for hyperplanes, that is the hypothesis class of functions are still
linear. The non-linear surfaces are due to the kernel function.
Remark. Unfortunately for the fledgling machine learner, there are multi-
ple meanings of the word kernel. In this chapter, the word kernel comes
from the idea of the Reproducing Kernel Hilbert Space (RKHS) (Aron-
szajn, 1950; Saitoh, 1988). We have discussed the idea of the kernel in
linear algebra (Section 2.7.3), where the kernel is another word for the
null space. The third common use of the word kernel in machine learning
is the smoothing kernel in kernel density estimation (Section 11.5). ♦
Since the explicit representation φ(x) is mathematically equivalent to
the kernel representation k(xi , xj ) a practitioner will often design the
kernel function, such that it can be computed more efficiently than the
inner product between explicit feature maps. For example, consider the
polynomial kernel (Schölkopf and Smola, 2002), where the number of
terms in the explicit expansion grows very quickly (even for polynomials
of low degree) when the input dimension is large. The kernel function
only requires one multiplication per input dimension, which can provide
significant computational savings. Another example is the Gaussian ra-
dial basis function kernel (Schölkopf and Smola, 2002; Rasmussen and
Williams, 2006) where the corresponding feature space is infinite dimen-
sional. In this case, we cannot explicitly represent the feature space but
The choice of can still compute similarities between a pair of examples using the kernel.
kernel, as well as Another useful aspect of the kernel trick is that there is no need for
the parameters of
the original data to be already represented as multivariate real-valued
the kernel are often
chosen using nested data. Note that the inner product is defined on the output of the function
cross validation φ(·), but does not restrict the input to real numbers. Hence, the function
(Section 8.5.1). φ(·) and the kernel function k(·, ·) can be defined on any object, e.g.,
sets, sequences, strings, graphs and distributions (Ben-Hur et al., 2008;
Gärtner, 2008; Shi et al., 2009; Vishwanathan et al., 2010; Sriperumbudur
et al., 2010).
Using this subgradient above, we can apply the optimization methods pre-
sented in Section 7.1.
Both the primal and the dual SVM result in a convex quadratic pro-
gramming problem (constrained optimization). Note that the primal SVM
in (12.26a) has optimization variables that have the size of the dimen-
sion D of the input examples. The dual SVM in (12.41) has optimization
variables that have the size of the number N of examples.
To express the primal SVM in the standard form (7.45) for quadratic
programming, let us assume that we use the dot product (3.5) as the
inner product. We rearrange the equation for the primal SVM (12.26a), Recall from
such that the optimization variables are all on the right and the inequality Section 3.2 that we
use the phrase dot
of the constraint matches the standard form. This yields the optimization
product to mean the
N inner product on
1 X
Euclidean vector
min kwk2 + C ξn
w,b,ξ 2 space.
n=1 (12.55)
−yn x>
n w − yn b − ξn 6 −1
subject to
−ξn 6 0
n = 1, . . . , N . By concatenating the variables w, b, xn into a single vector,
and carefully collecting the terms, we obtain the following matrix form of
the soft margin SVM.
>
w w
> w
1 ID 0D,N +1
min b b + 0D+1,1 C1N,1 b
w,b,ξ 2 ξ 0N +1,D 0N +1,N +1
ξ ξ
w
−Y X −y −I N −1N,1
subject to b 6 .
0N,D+1 −I N 0N,1
ξ
(12.56)
In the above optimization problem, the minimization is over [w> , b, ξ > ]> ∈
RD+1+N , and we use the notation: I m to represent the identity matrix of
size m × m, 0m,n to represent the matrix of zeros of size m × n, and 1m,n
to represent the matrix of ones of size m × n. In addition y is the vector
of labels [y1 , . . . , yN ]> , Y = diag(y) is an N by N matrix where the ele-
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
392 Classification with Support Vector Machines
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
394 Classification with Support Vector Machines
lationship between loss function and the likelihood (also compare Sec-
tion 8.1 and Section 8.2). The maximum likelihood approach correspond-
ing to a well calibrated transformation during training is called logistic
regression, which comes from a class of methods called generalized linear
models. Details of logistic regression from this point of view can be found
in Agresti (2002, Chapter 5) and McCullagh and Nelder (1989, Chapter
4). Naturally, one could take a more Bayesian view of the classifier out-
put by estimation a posterior distribution using Bayesian logistic regres-
sion. The Bayesian view also includes the specification of the prior, which
includes design choices such as conjugacy (Section 6.6.1) with the like-
lihood. Additionally, one could consider latent functions as priors, which
results in Gaussian process classification (Rasmussen and Williams, 2006,
Chapter 3).
395
c
Draft (April 25, 2019) of “Mathematics for Machine Learning”
2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
396 References
Berlinet, Alain, and Thomas-Agnan, Christine. 2004. Reproducing Kernel Hilbert Spaces
in Probability and Statistics. Springer.
Bertsekas, Dimitri P. 1999. Nonlinear Programming. Athena Scientific.
Bertsekas, Dimitri P. 2009. Convex Optimization Theory. Athena Scientific.
Betancourt, Michael. 2018. Probability Theory (for Scientists and Engineers). https:
//betanalpha.github.io/assets/case_studies/probability_theory.html.
Bickel, Peter J., and Doksum, Kjell. 2006. Mathematical Statistics, Basic Ideas and
Selected Topics. Vol. 1. Prentice Hall.
Bickson, Danny, Dolev, Danny, Shental, Ori, Siegel, Paul H., and Wolf, Jack K. 2007.
Linear Detection via Belief Propagation. In: Proceedings of the Annual Allerton Con-
ference on Communication, Control, and Computing.
Billingsley, Patrick. 1995. Probability and Measure. Wiley.
Bishop, Christopher M. 1995. Neural Networks for Pattern Recognition. Clarendon
Press.
Bishop, Christopher M. 1999. Bayesian PCA. In: Advances in Neural Information Pro-
cessing Systems.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Blei, David M., Kucukelbir, Alp, and McAuliffe, Jon D. 2017. Variational Inference: A
Review for Statisticians. Journal of the American Statistical Association, 112(518),
859–877.
Blum, Arvim, and Hardt, Moritz. 2015. The Ladder: A Reliable Leaderboard for Ma-
chine Learning Competitions. In: International Conference on Machine Learning.
Bonnans, J. Frédéric, Gilbert, J. Charles, Lemaréchal, Claude, and Sagastizábal, Clau-
dia A. 2006. Numerical Optimization: Theoretical and Practical Aspects. 2nd edn.
Springer Verlag.
Borwein, Jonathan M., and Lewis, Adrian S. 2006. Convex Analysis and Nonlinear
Optimization. 2nd edn. Canadian Mathematical Society.
Bottou, Léon. 1998. Online Algorithms and Stochastic Approximations. In: Online
Learning and Neural Networks. Cambridge University Press.
Bottou, Léon, Curtis, Frank E, and Nocedal, Jorge. 2018. Optimization Methods for
Large-scale Machine Learning. SIAM Review, 60(2), 223–311.
Boucheron, Stephane, Lugosi, Gabor, and Massart, Pascal. 2013. Concentration In-
equalities: A Nonasymptotic Theory of Independence. Oxford University Press.
Boyd, Stephen, and Vandenberghe, Lieven. 2004. Convex Optimization. Cambridge
University Press.
Boyd, Stephen, and Vandenberghe, Lieven. 2018. Introduction to Applied Linear Alge-
bra. Cambridge University Press.
Brochu, Eric, Cora, Vlad M., and de Freitas, Nando. 2009. A Tutorial on Bayesian
Optimization of Expensive Cost Functions, with Application to Active User Modeling
and Hierarchical Reinforcement Learning. Tech. rept. TR-2009-023. Department of
Computer Science, University of British Columbia.
Brooks, Steve, Gelman, Andrew, Jones, Galin L., and Meng, Xiao-Li (eds). 2011. Hand-
book of Markov Chain Monte Carlo. Chapman and Hall/CRC.
Brown, Lawrence D. 1986. Fundamentals of Statistical Exponential Families: With Ap-
plications in Statistical Decision Theory. Lecture Notes - Monograph Series. Institute
of Mathematical Statistics.
Bryson, Arthur E. 1961. A Gradient Method for Optimizing Multi-stage Allocation
Processes. In: Proceedings of the Harvard University Symposium on Digital Computers
and Their Applications.
Bubeck, Sébastien. 2015. Convex Optimization: Algorithms and Complexity. Founda-
tions and Trends in Machine Learning, 8(3-4), 231–357.
Bühlmann, Peter, and Geer, Sara Van De. 2011. Statistics for High-Dimensional Data.
Springer.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
398 References
Drumm, Volker, and Weil, Wolfgang. 2001. Lineare Algebra und Analytische Geometrie.
Lecture Notes, Universität Karlsruhe (TH).
Dudley, Richard M. 2002. Real Analysis and Probability. Cambridge University Press.
Eaton, Morris L. 2007. Multivariate Statistics: A Vector Space Approach. Vol. 53. Insti-
tute of Mathematical Statistics Lecture Notes—Monograph Series.
Eckart, Carl, and Young, Gale. 1936. The Approximation of One Matrix by Another of
Lower Rank. Psychometrika, 1(3), 211–218.
Efron, Bradley, and Hastie, Trevor. 2016. Computer Age Statistical Inference: Algorithms,
Evidence and Data Science. Cambridge University Press.
Efron, Bradley, and Tibshirani, Robert J. 1993. An Introduction to the Bootstrap. Chap-
man and Hall/CRC.
Elliott, Conal. 2009. Beautiful Differentiation. In: International Conference on Func-
tional Programming.
Evgeniou, Theodoros, Pontil, Massimiliano, and Poggio, Tomaso. 2000. Statistical
Learning Theory: A Primer. International Journal of Computer Vision, 38(1), 9–13.
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Xiang-Rui, and Lin, Chih-Jen.
2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine
Learning Research, 9, 1871–1874.
Gal, Yarin, van der Wilk, Mark, and Rasmussen, Carl E. 2014. Distributed Variational
Inference in Sparse Gaussian Process Regression and Latent Variable Models. In:
Advances in Neural Information Processing Systems.
Gärtner, Thomas. 2008. Kernels for Structured Data. World Scientific.
Gavish, Matan,√ and Donoho, David L. 2014. The Optimal Hard Threshold for Singular
Values is 4 3. IEEE Transactions on Information Theory, 60(8), 5040–5053.
Gelman, Andrew, Carlin, John B., Stern, Hal S., and Rubin, Donald B. 2004. Bayesian
Data Analysis. Second. Chapman & Hall/CRC.
Gentle, James E. 2004. Random Number Generation and Monte Carlo Methods. 2nd
edn. Springer.
Ghahramani, Zoubin. 2015. Probabilistic Machine Learning and Artificial Intelligence.
Nature, 521, 452–459.
Ghahramani, Zoubin, and Roweis, Sam T. 1999. Learning Nonlinear Dynamical Sys-
tems using an EM Algorithm. In: Advances in Neural Information Processing Systems.
MIT Press.
Gilks, Walter R., Richardson, Sylvia, and Spiegelhalter, David J. 1996. Markov Chain
Monte Carlo in Practice. Chapman & Hall.
Gneiting, Tilmann, and Raftery, Adrian E. 2007. Strictly Proper Scoring Rules, Pre-
diction, and Estimation. Journal of the American Statistical Association, 102(477),
359–378.
Goh, Gabriel. 2017. Why Momentum Really Works. Distill.
Gohberg, Israel, Goldberg, Seymour, and Krupnik, Nahum. 2012. Traces and Determi-
nants of Linear Operators. Vol. 116. Birkhäuser.
Golan, Jonathan S. 2007. The Linear Algebra a Beginning Graduate Student Ought to
Know. 2nd edn. Springer.
Golub, Gene H., and Van Loan, Charles F. 2012. Matrix Computations. Vol. 4. JHU
Press.
Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. 2016. Deep Learning. MIT
Press.
Graepel, Thore, Candela, Joaquin Quiñonero-Candela, Borchert, Thomas, and Her-
brich, Ralf. 2010. Web-scale Bayesian Click-through Rate Prediction for Sponsored
Search Advertising in Microsoft’s Bing Search Engine. In: Proceedings of the Interna-
tional Conference on Machine Learning.
Griewank, Andreas, and Walther, Andrea. 2003. Introduction to Automatic Differenti-
ation. In: Proceedings in Applied Mathematics and Mechanics.
Griewank, Andreas, and Walther, Andrea. 2008. Evaluating Derivatives, Principles and
Techniques of Algorithmic Differentiation. second edn. SIAM, Philadelphia.
Grimmett, Geoffrey, and Welsh, Dominic. 2014. Probability: an Introduction. 2nd edn.
Oxford University Press.
Grinstead, Charles M., and Snell, J. Laurie. 1997. Introduction to Probability. American
Mathematical Society.
Hacking, Ian. 2001. Probability and Inductive Logic. Cambridge University Press.
Hall, Peter. 1992. The Bootstrap and Edgeworth Expansion. Springer.
Hallin, Marc, Paindaveine, Davy, and Šiman, Miroslav. 2010. Multivariate quantiles
and multiple-output regression quantiles: from `1 optimization to halfspace depth.
Annals of Statistics, 38, 635–669.
Hasselblatt, Boris, and Katok, Anatole. 2003. A first course in dynamics with a
Panorama of Recent Developments. Cambridge University Press.
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. 2001. The Elements of Statis-
tical Learning—Data Mining, Inference, and Prediction. Springer Series in Statistics.
175 Fifth Avenue, New York City, NY, USA: Springer-Verlag New York, Inc.
Hausman, Karol, Springenberg, Jost T., Wang, Ziyu, Heess, Nicolas, and Riedmiller,
Martin. 2018. Learning an Embedding Space for Transferable Robot Skills. In:
Proceedings of the International Conference on Learning Representations.
Hazan, Elad. 2015. Introduction to Online Convex Optimization. Foundations and
Trends in Optimization, 2(3-4), 157–325.
Hensman, James, Fusi, Nicolò, and Lawrence, Neil D. 2013. Gaussian Processes for
Big Data. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Herbrich, Ralf, Minka, Tom, and Graepel, Thore. 2007. TrueSkill(TM): A Bayesian
Skill Rating System. In: Advances in Neural Information Processing Systems.
Hiriart-Urruty, Jean-Baptiste, and Lemaréchal, Claude. 2001. Fundamentals of Convex
Analysis. Springer.
Hoffman, Matthew D., Blei, David M., and Bach, Francis. 2010. Online Learning for
Latent Dirichlet Allocation. Advances in Neural Information Processing Systems.
Hoffman, Matthew D., Blei, David M., Wang, Chong, and Paisley, John. 2013. Stochas-
tic Variational Inference. Journal of Machine Learning Research, 14(1), 1303–1347.
Hofmann, Thomas, Schölkopf, Bernhard, and Smola, Alexander J. 2008. Kernel Meth-
ods in Machine Learning. Annals of Statistics, 36(3), 1171–1220.
Hogben, Leslie. 2013. Handbook of Linear Algebra. 2nd edn. Chapman and Hall/CRC.
Horn, Roger A., and Johnson, Charles R. 2013. Matrix Analysis. Cambridge University
Press.
Hotelling, Harold. 1933. Analysis of a Complex of Statistical Variables into Principal
Components. Journal of Educational Psychology, 24, 417–441.
Hyvarinen, Aapo, Oja, Erkki, and Karhunen, Juha. 2001. Independent Component Anal-
ysis. Wiley.
Imbens, Guido W., and Rubin, Donald B. 2015. Causal Inference for Statistics, Social
and Biomedical Sciences. Cambridge University Press.
Jacod, Jean, and Protter, Philip. 2004. Probability Essentials. 2nd edn. Springer.
Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge University
Press.
Jefferys, Willian H., and Berger, James O. 1992. Ockham’s Razor and Bayesian Analy-
sis. American Scientist, 80, 64–72.
Jeffreys, Harold. 1961. Theory of Probability. 3rd edn. Oxford University Press.
Jimenez Rezende, Danilo, Mohamed, Shakir, and Wierstra, Daan. 2014. Stochastic
Backpropagation and Approximate Inference in Deep Generative Models. In: Pro-
ceedings of the International Conference on Machine Learning.
Joachims, Thorsten. 1999. Advances in Kernel Methods—Support Vector Learning. MIT
Press. Chap. Making Large-Scale SVM Learning Practical, pages 169–184.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
400 References
Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S., and Saul, Lawrence K.
1999. An Introduction to Variational Methods for Graphical Models. Machine Learn-
ing, 37, 183–233.
Julier, Simon J., and Uhlmann, Jeffrey K. 1997. A New Extension of the Kalman Filter
to Nonlinear Systems. In: Proceedings of AeroSense Symposium on Aerospace/Defense
Sensing, Simulation and Controls.
Kaiser, Marcus, and Hilgetag, Claus C. 2006. Nonoptimal Component Placement, but
Short Processing Paths, due to Long-Distance Projections in Neural Systems. PLoS
Computational Biology, 2(7), e95.
Kalman, Dan. 1996. A Singularly Valuable Decomposition: The SVD of a Matrix. The
College Mathematics Journal, 27(1), 2–23.
Kalman, Rudolf E. 1960. A New Approach to Linear Filtering and Prediction Problems.
Transactions of the ASME—Journal of Basic Engineering, 82(Series D), 35–45.
Kamthe, Sanket, and Deisenroth, Marc P. 2018. Data-Efficient Reinforcement Learning
with Probabilistic Model Predictive Control. In: Proceedings of the International
Conference on Artificial Intelligence and Statistics.
Katz, Victor J. 2004. A History of Mathematics. Pearson/Addison-Wesley.
Kelley, Henry J. 1960. Gradient Theory of Optimal Flight Paths. Ars Journal, 30(10),
947–954.
Kimeldorf, George S., and Wahba, Grace. 1970. A Correspondence Between Bayesian
Estimation on Stochastic Processes and Smoothing by Splines. Annals of Mathemat-
ical Statistics, 41(2), 495–502.
Kingma, Diederik, and Ba, Jimmy. 2014. Adam: A Method for Stochastic Optimization.
Proceedings of the International Conference on Learning Representations, 1–13.
Kingma, Diederik P., and Welling, Max. 2014. Auto-Encoding Variational Bayes. In:
Proceedings of the International Conference on Learning Representations.
Kittler, J., and Föglein, J. 1984. Contextual Classification of Multispectral Pixel Data.
Image and Vision Computing, 2(1), 13–29.
Kolda, Tamara G., and Bader, Brett W. 2009. Tensor Decompositions and Applications.
SIAM Review, 51(3), 455–500.
Koller, Daphne, and Friedman, Nir. 2009. Probabilistic Graphical Models. MIT Press.
Kong, Linglong, and Mizera, Ivan. 2012. Quantile Tomography: Using Quantiles with
Multivariate Data. Statistica Sinica, 22, 1598–1610.
Lang, Serge. 1987. Linear Algebra. Springer.
Lawrence, Neil. 2005. Probabilistic Non-linear Principal Component Analysis with
Gaussian Process Latent Variable Models. Journal of Machine Learning Research,
6(Nov.), 1783–1816.
Leemis, Lawrence M., and McQueston, Jacquelyn T. 2008. Univariate Distribution
Relationships. The American Statistician, 62(1), 45–53.
Lehmann, Erich L., and Romano, Joseph P. 2005. Testing Statistical Hypotheses.
Springer.
Lehmann, Erich Leo, and Casella, George. 1998. Theory of Point Estimation. Springer.
Liesen, Jörg, and Mehrmann, Volker. 2015. Linear Algebra. Springer.
Lin, Hsuan-Tien, Lin, Chih-Jen, and Weng, Ruby C. 2007. A Note on Platt’s Probabilistic
Outputs for Support Vector Machines. Machine Learning, 68, 267–276.
Ljung, Lennart. 1999. System Identification: Theory for the User. Prentice Hall.
Loosli, Gaëlle, Canu, Stéphane, and Ong, Cheng S. 2016. Learning SVM in Kreı̆n
Spaces. IEEE Transactions of Pattern Analysis and Machine Intelligence, 38(6), 1204–
1216.
Luenberger, David G. 1969. Optimization by Vector Space Methods. Wiley.
MacKay, Davic J. C. 2003. Information Theory, Inference, and Learning Algorithms.
Cambridge University Press.
MacKay, David J. C. 1992. Bayesian Interpolation. Neural Computation, 4, 415–447.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
402 References
Ong, Cheng S., Mary, Xavier, Canu, Stéphane, and Smola, Alexander J. 2004. Learn-
ing with Non-positive Kernels. Pages 639–646 of: Proceedings of the International
Conference on Machine Learning.
Ormoneit, Dirk, Sidenbladh, Hedvig, Black, Michael J., and Hastie, Trevor. 2001.
Learning and Tracking Cyclic Human Motion. In: Advances in Neural Information
Processing Systems.
Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd, Terry. 1999. The PageR-
ank Citation Ranking: Bringing Order to the Web. Tech. rept. Stanford InfoLab.
Paquet, Ulrich. 2008. Bayesian Inference for Latent Variable Models. Ph.D. thesis, Uni-
versity of Cambridge.
Parzen, Emanuel. 1962. On Estimation of a Probability Density Function and Mode.
The Annals of Mathematical Statistics, 33(3), 1065–1076.
Pearl, Judea. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann.
Pearl, Judea. 2009. Causality: Models, Reasoning and Inference. 2nd edn. Cambridge
University Press.
Pearson, Karl. 1895. Contributions to the Mathematical Theory of Evolution. II. Skew
Variation in Homogeneous Material. Philosophical Transactions of the Royal Society
A: Mathematical, Physical and Engineering Sciences, 186, 343–414.
Pearson, Karl. 1901. On Lines and Planes of Closest Fit to Systems of Points in Space.
Philosophical Magazine, 2(11), 559–572.
Peters, Jonas, Janzing, Dominik, and Schölkopf, Bernhard. 2017. Elements of Causal
Inference: Foundations and Learning Algorithms. MIT Press.
Petersen, K. B., and Pedersen, M. S. 2012 (nov). The Matrix Cookbook. Tech. rept.
Technical University of Denmark. Version 20121115.
Platt, John C. 2000. Probabilistic Outputs for Support Vector Machines and Compar-
isons to Regularized Likelihood Methods. In: Advances in Large Margin Classifiers.
Pollard, David. 2002. A User’s Guide to Measure Theoretic Probability. Cambridge
University Press.
Polyak, Roman A. 2016. The Legendre Transformation in Modern Optimization. Pages
437–507 of: Optimization and Its Applications in Control and Data Sciences. Springer.
Press, William H., Teukolsky, Saul A., Vetterling, William T., and Flannery, Brian P.
2007. Numerical Recipes: The Art of Scientific Computing. 3rd edn. Cambridge Uni-
versity Press.
Proschan, Michael A., and Presnell, Brett. 1998. Expect the Unexpected from Condi-
tional Expectation. American Statistician, 52(3), 248–252.
Raschka, Sebastian, and Mirjalili, Vahid. 2017. Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and TensorFlow. Packt Publish-
ing.
Rasmussen, Carl E., and Ghahramani, Zoubin. 2001. Occam’s Razor. In: Advances in
Neural Information Processing Systems.
Rasmussen, Carl E., and Ghahramani, Zoubin. 2003. Bayesian Monte Carlo. In: Ad-
vances in Neural Information Processing Systems.
Rasmussen, Carl E., and Williams, Christopher K. I. 2006. Gaussian Processes for Ma-
chine Learning. Cambridge, MA, USA: MIT Press.
Reid, Mark, and Williamson, Robert C. 2011. Information, Divergence and Risk for
Binary Experiments. Journal of Machine Learning Research, 12, 731–817.
Rezende, Danilo J., and Mohamed, Shakir. 2015. Variational Inference with Normal-
izing Flows. In: Proceedings of the International Conference on Machine Learning.
Rifkin, Ryan M., and Lippert, Ross A. 2007. Value Regularization and Fenchel Duality.
Journal of Machine Learning Research, 8, 441–479.
Rockafellar, R. Tyrrell. 1970. Convex Analysis. Princeton University Press.
Rogers, Simon, and Girolami, Mark. 2016. A First Course in Machine Learning. 2nd
edn. Chapman and Hall/CRC.
Rosenbaum, Paul R. 2017. Observation & Experiment: An Introduction to Causal Infer-
ence. Harvard University Press.
Rosenblatt, Murray. 1956. Remarks on Some Nonparametric Estimates of a Density
Function. The Annals of Mathematical Statistics, 27(3), 832–837.
Roweis, Sam, and Ghahramani, Zoubin. 1999. A Unifying Review of Linear Gaussian
Models. Neural Computation, 11(2), 305–345.
Roweis, Sam T. 1998. EM Algorithms for PCA and SPCA. Pages 626–632 of: Advances
in Neural Information Processing Systems.
Roy, Anindya, and Banerjee, Sudipto. 2014. Linear Algebra and Matrix Analysis for
Statistics. Chapman and Hall/CRC.
Rubinstein, Reuven Y., and Kroese, Dirk P. 2016. Simulation and the Monte Carlo
Method. Vol. 10. Wiley.
Ruffini, Paolo. 1799. Teoria Generale delle Equazioni, in cui si Dimostra Impossibile la
Soluzione Algebraica delle Equazioni Generali di Grado Superiore al Quarto. Stampe-
ria di S. Tommaso d’Aquino.
Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J. 1986. Learning
Representations by Back-propagating Errors. Nature, 323(6088), 533–536.
Sæmundsson, Steindór, Hofmann, Katja, and Deisenroth, Marc P. 2018. Meta Rein-
forcement Learning with Latent Variable Gaussian Processes. In: Proceedings of the
Conference on Uncertainty in Artificial Intelligence.
Saitoh, Saburou. 1988. Theory of Reproducing Kernels and its Applications. Longman
Scientific & Technical.
Schölkopf, Bernhard, and Smola, Alexander J. 2002. Learning with Kernels—Support
Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Schölkopf, Bernhard, Smola, Alexander J., and Müller, Klaus-Robert. 1997. Kernel
Principal Component Analysis. In: Proceedings of the International Conference on
Artificial Neural Networks. Springer.
Schölkopf, Bernhard, Smola, Alexander J., and Müller, Klaus-Robert. 1998. Nonlinear
Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5),
1299–1319.
Schölkopf, Bernhard, Herbrich, Ralf, and Smola, Alexander J. 2001. A Generalized
Representer Theorem. In: Proceedings of the International Conference on Computa-
tional Learning Theory.
Schwartz, Laurent. 1964. Sous espaces Hilbertiens d’espaces vectoriels topologiques
et noyaux associés. Journal d’Analyse Mathématique, 13, 115–256. in French.
Schwarz, Gideon E. 1978. Estimating the Dimension of a Model. Annals of Statistics,
6(2), 461–464.
Shahriari, Bobak, Swersky, Kevin, Wang, Ziyu, Adams, Ryan P., and De Freitas, Nando.
2016. Taking the Human out of the Loop: A Review of Bayesian Optimization.
Proceedings of the IEEE, 104(1), 148–175.
Shalev-Shwartz, Shai, and Ben-David, Shai. 2014. Understanding Machine Leanring:
From Theory to Algorithms. Cambridge University Press.
Shawe-Taylor, John, and Cristianini, Nello. 2004. Kernel Methods for Pattern Analysis.
Cambridge University Press.
Shawe-Taylor, John, and Sun, Shiliang. 2011. A Review of Optimization Methodologies
in Support Vector Machines. Neurocomputing, 74(17), 3609–3618.
Shental, O., Bickson, D., P. H. Siegel and, J. K. Wolf, and Dolev, D. 2008. Gaussian
Belief Propagatio Solver for Systems of Linear Equations. In: Proceedings of the
International Symposium on Information Theory.
Shewchuk, Jonathan R. 1994. An Introduction to the Conjugate Gradient Method With-
out the Agonizing Pain.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
404 References
Shi, Jianbo, and Malik, Jitendra. 2000. Normalized Cuts and Image Segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Shi, Qinfeng, Petterson, James, Dror, Gideon, Langford, John, Smola, Alex, and Vish-
wanathan, S.V.N. 2009. Hash Kernels for Structured Data. Journal of Machine
Learning Research, 2615–2637.
Shiryayev, A. N. 1984. Probability. Springer.
Shor, Naum Z. 1985. Minimization Methods for Non-differentiable Functions. Springer.
Shotton, Jamie, Winn, John, Rother, Carsten, and Criminisi, Antonio. 2006. Texton-
Boost: Joint Appearance, Shape and Context Modeling for Mulit-Class Object Recog-
nition and Segmentation. In: Proceedings of the European Conference on Computer
Vision.
Smith, Adrian F. M., and Spiegelhalter, David. 1980. Bayes Factors and Choice Criteria
for Linear Models. Journal of the Royal Statistical Society B, 42(2), 213–220.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. 2012. Practical Bayesian Opti-
mization of Machine Learning Algorithms. In: Advence in Neural Information Pro-
cessing Systems.
Spearman, Charles. 1904. “General Intelligence,” Objectively Determined and Mea-
sured. American Journal of Psychology, 15(2), 201–292.
Sriperumbudur, Bharath K., Gretton, Arthur, Fukumizu, Kenji, Schölkopf, Bernhard,
and Lanckriet, Gert R. G. 2010. Hilbert Space Embeddings and Metrics on Proba-
bility Measures. Journal of Machine Learning Research, 11, 1517–1561.
Steinwart, Ingo. 2007. How to Compare Different Loss Functions and Their Risks.
Constructive Approximation, 26, 225–287.
Steinwart, Ingo, and Christmann, Andreas. 2008. Support Vector Machines. Springer.
Stoer, Josef, and Burlirsch, Roland. 2002. Introduction to Numerical Analysis. Springer.
Strang, Gilbert. 1993. The Fundamental Theorem of Linear Algebra. The American
Mathematical Monthly, 100(9), 848–855.
Strang, Gilbert. 2003. Introduction to Linear Algebra. 3rd edn. Wellesley-Cambridge
Press.
Stray, Jonathan. 2016. The Curious Journalist’s Guide to Data. Tow Center for Digital
Journalism at Columbia’s Graduate School of Journalism.
Strogatz, Steven. 2014. Writing about Math for the Perplexed and the Traumatized.
Notices of the American Mathematical Society, 61(3), 286–291.
Sucar, Luis E., and Gillies, Duncan F. 1994. Probabilistic Reasoning in High-Level
Vision. Image and Vision Computing, 12(1), 42–60.
Szeliski, Richard, Zabih, Ramin, Scharstein, Daniel, Veksler, Olga, Kolmogorov,
Vladimir, Agarwala, Aseem, Tappen, Marshall, and Rother, Carsten. 2008. A Com-
parative Study of Energy Minimization Methods for Markov Random Fields with
Smoothness-based Priors. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 30(6), 1068–1080.
Tandra, Haryono. 2014. The Relationship Between the Change of Variable Theorem
and The Fundamental Theorem of Calculus for the Lebesgue Integral. Teaching of
Mathematics, 17(2), 76–83.
Tenenbaum, Joshua B., De Silva, Vin, and Langford, John C. 2000. A Global Geometric
Framework for Nonlinear Dimensionality Reduction. Science, 290(5500), 2319–
2323.
Tibshirani, Robert. 1996. Regression Selection and Shrinkage via the Lasso. Journal
of the Royal Statistical Society B, 58(1), 267–288.
Tipping, Michael E., and Bishop, Christopher M. 1999. Probabilistic Principal Compo-
nent Analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611–622.
Titsias, Michalis K., and Lawrence, Neil D. 2010. Bayesian Gaussian Process Latent
Variable Model. In: Proceedings of the International Conference on Artificial Intelli-
gence and Statistics. JMLR W&CP, vol. 9.
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: