MML Book PDF

Mathematics for Machine Learning
Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong

Contents
List of illustrations iv
Foreword 1
Part I Mathematical Foundations 9
1 Introduction and Motivation 11

1.1 Finding Words for Intuitions 12
1.2 Two Ways to Read this Book 13
1.3 Exercises and Feedback 16
2 Linear Algebra 17
2.1 Systems of Linear Equations 19
2.2 Matrices 22
2.3 Solving Systems of Linear Equations 27
2.4 Vector Spaces 35
2.5 Linear Independence 40
2.6 Basis and Rank 44
2.7 Linear Mappings 48
2.8 Affine Spaces 61
2.9 Further Reading 63
Exercises 63
3 Analytic Geometry 70
3.1 Norms 71
3.2 Inner Products 72
3.3 Lengths and Distances 75
3.4 Angles and Orthogonality 76
3.5 Orthonormal Basis 78
3.6 Orthogonal Complement 79
3.7 Inner Product of Functions 80
3.8 Orthogonal Projections 81
3.9 Rotations 91
Exercises 95
4 Matrix Decompositions 98
i
c
Draft (April 25, 2019) of “Mathematics for Machine Learning” 2019 by Marc Peter Deisenroth,
A. Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Please do
not post or distribute this file, please link to https://mml-book.com.
ii Contents
4.1 Determinant and Trace 99

4.2 Eigenvalues and Eigenvectors 105
4.3 Cholesky Decomposition 114
4.4 Eigendecomposition and Diagonalization 115
4.5 Singular Value Decomposition 119
4.6 Matrix Approximation 129
4.7 Matrix Phylogeny 134
Exercises 137
5 Vector Calculus 139

5.1 Differentiation of Univariate Functions 141
5.2 Partial Differentiation and Gradients 146
5.3 Gradients of Vector-Valued Functions 149
5.4 Gradients of Matrices 155
5.5 Useful Identities for Computing Gradients 158
5.6 Backpropagation and Automatic Differentiation 159
5.7 Higher-order Derivatives 164
5.8 Linearization and Multivariate Taylor Series 165
Exercises 170
6 Probability and Distributions 172

6.1 Construction of a Probability Space 172
6.2 Discrete and Continuous Probabilities 178
6.3 Sum Rule, Product Rule and Bayes’ Theorem 183
6.4 Summary Statistics and Independence 186
6.5 Gaussian Distribution 197
6.6 Conjugacy and the Exponential Family 205
6.7 Change of Variables/Inverse Transform 214
Exercises 222
7 Continuous Optimization 225

7.1 Optimization using Gradient Descent 227
7.2 Constrained Optimization and Lagrange Multipliers 233
7.3 Convex Optimization 236
Exercises 247
Part II Central Machine Learning Problems 249
8 When Models meet Data 251

8.1 Empirical Risk Minimization 258
8.2 Parameter Estimation 265
8.3 Probabilistic Modeling and Inference 272
8.4 Directed Graphical Models 277
Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

Contents iii
8.5 Model Selection 283
9 Linear Regression 289

9.1 Problem Formulation 291
9.3 Bayesian Linear Regression 303
9.4 Maximum Likelihood as Orthogonal Projection 313
10 Dimensionality Reduction with Principal Component Analysis 317

10.1 Problem Setting 318
10.2 Maximum Variance Perspective 320
10.3 Projection Perspective 325
10.4 Eigenvector Computation and Low-Rank Approximations 333
10.5 PCA in High Dimensions 335
10.6 Key Steps of PCA in Practice 336
10.7 Latent Variable Perspective 339
11 Density Estimation with Gaussian Mixture Models 348

11.1 Gaussian Mixture Model 349
11.2 Parameter Learning via Maximum Likelihood 350
11.3 EM Algorithm 360
12 Classification with Support Vector Machines 370

12.1 Separating Hyperplanes 372
12.2 Primal Support Vector Machine 374
12.3 Dual Support Vector Machine 383
12.4 Kernels 388
12.5 Numerical Solution 390
References 395
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
List of Figures
1.1 The foundations and four pillars of machine learning. 14

2.1 Different types of vectors. 17
2.2 Linear algebra mind map. 19
2.3 Geometric interpretation of systems of linear equations. 21
2.4 A matrix can be represented as a long vector. 22
2.5 Matrix multiplication. 23
2.6 Examples of subspaces. 39
2.7 Geographic example of linearly dependent vectors. 41
2.8 Two different coordinate systems. 50
2.9 Different coordinate representations of a vector. 51
2.10 Three examples of linear transformations. 52
2.11 Basis change. 56
2.12 Kernel and image of a linear mapping Φ : V → W . 59
2.13 Lines are affine subspaces. 62
3.1 Analytic geometry mind map. 70
3.2 Illustration of different norms. 71
3.3 Triangle inequality. 71
3.4 f (x) = cos(x). 76
3.5 Angle between two vectors. 77
3.6 Angle between two vectors. 77
3.7 A plane can be described by its normal vector. 80
3.8 f (x) = sin(x) cos(x). 81
3.9 Orthogonal projection. 82
3.10 Examples of projections onto one-dimensional subspaces. 83
3.11 Projection onto a two-dimensional subspace. 85
3.12 Gram-Schmidt orthogonalization. 89
3.13 Projection onto an affine space. 90
3.14 Rotation. 91
3.15 Robotic arm. 91
3.16 Rotation of the standard basis in R2 by an angle θ. 92
3.17 Rotation in three dimensions. 93
4.1 Matrix decomposition mind map. 99
4.2 The area of a parallelogram computed using the determinant. 101
4.3 The volume of a parallelepiped computed using the determinant. 101
4.4 Determinants and eigenspaces. 109
4.5 C. elegans neural network. 110
4.6 Geometric interpretation of eigenvalues. 113
4.7 Eigendecomposition as sequential transformations. 117
iv
c
List of Figures v
4.8 Intuition behind SVD as sequential transformations. 120

4.9 SVD and mapping of vectors. 122
4.10 SVD decomposition for movie ratings. 127
4.11 Image processing with the SVD. 130
4.12 Image reconstruction with the SVD. 131
4.13 Phylogeny of matrices in machine learning. 134
5.1 Different problems for which we need vector calculus. 139
5.2 Vector calculus mindmap. 140
5.3 Difference quotient. 141
5.4 Taylor polynomials. 144
5.5 Jacobian determinant. 151
5.6 Dimensionality of partial derivatives. 152
5.7 Gradient computation of a matrix with respect to a vector. 155
5.8 Forward pass in a multi-layer neural network. 160
5.9 Backward pass in a multi-layer neural network. 161
5.10 Data flow graph. 161
5.11 Computation graph. 162
5.12 Linear approximation of a function. 165
5.13 Visualizing outer products. 166
6.1 Probability mind map. 173
6.2 Visualization of a discrete bivariate probability mass function. 179
6.3 Examples of discrete and continuous uniform distributions. 182
6.4 Mean, Mode and Median. 189
6.5 Identical means and variances but different covariances. 191
6.6 Geometry of random variables. 196
6.7 Gaussian distribution of two random variables x, y. 197
6.8 Gaussian distributions overlaid with 100 samples. 198
6.9 Bivariate Gaussian with conditional and marginal. 200
6.10 Examples of the Binomial distribution. 206
6.11 Examples of the Beta distribution for different values of α and β. 207
7.1 Optimization mind map. 226
7.2 Example objective function. 227
7.3 Gradient descent on a two-dimensional quadratic surface. 229
7.4 Illustration of constrained optimization. 233
7.5 Example of a convex function. 236
7.6 Example of a convex set. 236
7.7 Example of a nonconvex set. 237
7.8 The negative entropy and its tangent. 238
7.9 Illustration of a linear program. 240
8.1 Toy data for linear regression 254
8.2 Example function and its prediction 255
8.3 Example function and its uncertainty. 256
8.4 K-fold cross validation. 263
8.5 Maximum likelihood estimate. 268
8.6 Maximum a posteriori estimation. 268
8.7 Model fitting. 270
8.8 Fitting of different model classes. 271
8.9 Examples of directed graphical models. 278
c
vi List of Figures
8.10 Graphical models for a repeated Bernoulli experiment. 280

8.11 D-separation example. 281
8.12 Three types of graphical models. 282
8.13 Nested cross validation. 283
8.14 Bayesian inference embodies Occam’s razor. 285
8.15 Hierarchical generative process in Bayesian model selection. 286
9.1 Regression. 289
9.2 Linear regression example. 292
9.3 Probabilistic graphical model for linear regression. 292
9.4 Polynomial regression. 297
9.5 Maximum likelihood fits for different polynomial degrees M . 299
9.6 Training and test error. 300
9.7 Polynomial regression: Maximum likelihood and MAP estimates. 302
9.8 Graphical model for Bayesian linear regression. 304
9.9 Prior over functions. 305
9.10 Bayesian linear regression and posterior over functions. 310
9.11 Bayesian linear regression. 311
9.12 Geometric interpretation of least squares. 313
10.1 Illustration: Dimensionality reduction. 317
10.2 Graphical illustration of PCA. 319
10.3 Examples of handwritten digits from the MNIST dataset. 320
10.4 Illustration of the maximum variance perspective. 321
10.5 Properties of the training data of MNIST ‘8’. 324
10.6 Illustration of the projection approach. 325
10.7 Simplified projection setting. 326
10.8 Optimal projection. 328
10.9 Orthogonal projection and displacement vectors. 330
10.10 Embedding of MNIST digits. 332
10.11 Steps of PCA. 337
10.12 Effect of the number of principal components on reconstruction. 338
10.13 Squared reconstruction error versus the number of components. 339
10.14 PPCA graphical model. 340
10.15 Generating new MNIST digits. 341
10.16 PCA as an auto-encoder. 344
11.1 Dataset that cannot be represented by a Gaussian. 348
11.2 Gaussian mixture model. 350
11.3 Initial setting: GMM with three mixture components. 350
11.4 Update of the mean parameter of mixture component in a GMM. 355
11.5 Effect of updating the mean values in a GMM. 355
11.6 Effect of updating the variances in a GMM. 358
11.7 Effect of updating the mixture weights in a GMM. 360
11.8 EM algorithm applied to the GMM from Figure 11.2. 361
11.9 Illustration of the EM algorithm. 362
11.10 GMM fit and responsibilities when EM converges. 363
11.11 Graphical model for a GMM with a single data point. 364
11.12 Graphical model for a GMM with N data points. 366
11.13 Histogram and kernel density estimation. 369
12.1 Example 2D data for classification. 371

List of Figures vii
12.2 Equation of a separating hyperplane. 373

12.3 Possible separating hyperplanes 374
12.4 Vector addition to express distance to hyperplane. 375
1
12.5 Derivation of the margin: r = kwk . 376
12.6 Linearly separable and non linearly separable data. 379
12.7 Soft margin SVM allows examples to be within the margin. 380
12.8 The hinge loss is a convex upper bound of zero-one loss. 382
12.9 Convex hulls. 386
12.10 SVM with different kernels. 389
c
Foreword
Machine learning is the latest in a long line of attempts to distill human

knowledge and reasoning into a form that is suitable for constructing ma-
chines and engineering automated systems. As machine learning becomes
more ubiquitous and its software packages become easier to use it is nat-
ural and desirable that the low-level technical details are abstracted away
and hidden from the practitioner. However, this brings with it the danger
that a practitioner becomes unaware of the design decisions and, hence,
the limits of machine learning algorithms.
The enthusiastic practitioner who is interested to learn more about the
magic behind successful machine learning algorithms currently faces a
daunting set of pre-requisite knowledge:
Programming languages and data analysis tools

Large-scale computation and the associated frameworks
Mathematics and statistics and how machine learning builds on it
At universities, introductory courses on machine learning tend to spend

early parts of the course covering some of these pre-requisites. For histori-
cal reasons, courses in machine learning tend to be taught in the computer
science department, where students are often trained in the first two areas
of knowledge, but not so much in mathematics and statistics.
Current machine learning textbooks primarily focus on machine learn-
ing algorithms and methodologies and assume that the reader is com-
petent in mathematics and statistics. Therefore, these books only spend
one or two chapters of background mathematics, either at the beginning
of the book or as appendices. We have found many people who want to
delve into the foundations of basic machine learning methods who strug-
gle with the mathematical knowledge required to read a machine learning
textbook. Having taught undergraduate and graduate courses at universi-
ties, we find that the gap between high-school mathematics and the math-
ematics level required to read a standard machine learning textbook is too
big for many people.
This book brings the mathematical foundations of basic machine learn-
ing concepts to the fore and collects the information in a single place so
that this skills gap is narrowed or even closed.
1
c
2 Foreword
Why Another Book on Machine Learning?

Machine learning builds upon the language of mathematics to express
concepts that seem intuitively obvious but which are surprisingly difficult
to formalize. Once formalized properly, we can gain insights into the task
we want to solve. One common complaint of students of mathematics
around the globe is that the topics covered seem to have little relevance
to practical problems. We believe that machine learning is an obvious and
direct motivation for people to learn mathematics.
This book is intended to be a guidebook to the vast mathematical lit-
“Math is linked in erature that forms the foundations of modern machine learning. We mo-
the popular mind tivate the need for mathematical concepts by directly pointing out their
with phobia and
usefulness in the context of fundamental machine learning problems. In
anxiety. You’d think
we’re discussing the interest of keeping the book short, many details and more advanced
spiders.” (Strogatz, concepts have been left out. Equipped with the basic concepts presented
2014) here, and how they fit into the larger context of machine learning, the
reader can find numerous resources for further study, which we provide at
the end of the respective chapters. For readers with a mathematical back-
ground, this book provides a brief but precisely stated glimpse of machine
learning. In contrast to other books that focus on methods and models of
machine learning (MacKay, 2003; Bishop, 2006; Alpaydin, 2010; Rogers
and Girolami, 2016; Murphy, 2012; Barber, 2012; Shalev-Shwartz and
Ben-David, 2014) or programmatic aspects of machine learning (Müller
and Guido, 2016; Raschka and Mirjalili, 2017; Chollet and Allaire, 2018)
we provide only four representative examples of machine learning algo-
rithms. Instead we focus on the mathematical concepts behind the models
themselves. We hope that readers will be able to gain a deeper under-
standing of the basic questions in machine learning and connect practi-
cal questions arising from the use of machine learning with fundamental
choices in the mathematical model.
We do not aim to write a classical machine learning book. Instead, our
intention is to provide the mathematical background, applied to four cen-
tral machine learning problems, to make it easier to read other machine
learning textbooks.
Who is the Target Audience?

As applications of machine learning become widespread in society we be-
lieve that everybody should have some understanding of its underlying
principles. This book is written in an academic mathematical style, which
enables us to be precise about the concepts behind machine learning. We
encourage readers unfamiliar with this seemingly terse style to persevere
and to keep the goals of each topic in mind. We sprinkle comments and
remarks throughout the text, in the hope that it provides useful guidance
with respect to the big picture.
The book assumes the reader to have mathematical knowledge commonly

Foreword 3
covered in high-school mathematics and physics. For example, the reader

should have seen derivatives and integrals before, and geometric vectors
in two or three dimensions. Starting from there we generalize these con-
cepts. Therefore, the target audience of the book includes undergraduate
university students, evening learners and learners participating in online
machine learning courses.
In analogy to music, there are three types of interaction, which people
have with machine learning:
Astute Listener The democratization of machine learning by the pro-
vision of open-source software, online tutorials and cloud-based tools al-
lows users to not worry about the specifics of pipelines. Users can focus on
extracting insights from data using off-the-shelf tools. This enables non-
tech savvy domain experts to benefit from machine learning. This is sim-
ilar to listening to music; the user is able to choose and discern between
different types of machine learning, and benefits from it. More experi-
enced users are like music critics, asking important questions about the
application of machine learning in society such as ethics, fairness, and pri-
vacy of the individual. We hope that this book provides a foundation for
thinking about the certification and risk management of machine learning
systems, and allows them to use their domain expertise to build better
machine learning systems.
Experienced Artist Skilled practitioners of machine learning can plug
and play different tools and libraries into an analysis pipeline. The stereo-
typical practitioner would be a data scientist or engineer who understands
machine learning interfaces and their use cases, and is able to perform
wonderful feats of prediction from data. This is similar to a virtuoso play-
ing music, where highly skilled practitioners can bring existing instru-
ments to life, and bring enjoyment to their audience. Using the mathe-
matics presented here as a primer, practitioners would be able to under-
stand the benefits and limits of their favorite method, and to extend and
generalize existing machine learning algorithms. We hope that this book
provides the impetus for more rigorous and principled development of
machine learning methods.
Fledgling Composer As machine learning is applied to new domains,
developers of machine learning need to develop new methods and extend
existing algorithms. They are often researchers who need to understand
the mathematical basis of machine learning and uncover relationships be-
tween different tasks. This is similar to composers of music who, within
the rules and structure of musical theory, create new and amazing pieces.
We hope this book provides a high-level overview of other technical books
for people who want to become composers of machine learning. There is
a great need in society for new researchers who are able to propose and
explore novel approaches for attacking the many challenges of learning
from data.
c
4 Foreword
Contributors
The are grateful to many people, who looked at early drafts of the book
and suffered through painful expositions of concepts. We tried to imple-
ment their ideas that we did not violently disagree with. We would like to
especially acknowledge Christfried Webers for his careful reading of many
parts of the book, and his detailed suggestions on structure and presen-
tation. Many friends and colleagues have also been kind enough to pro-
vide their time and energy on different versions of each chapter. We have
been lucky to benefit from the generosity of the online community, who
have suggested improvements via github.com, which greatly improved
the book.
The following people have found bugs, proposed clarifications and sug-
gested relevant literature, either via github.com or personal communica-
tion. Their names are sorted alphabetically.
Abdul-Ganiy Usman Ellen Broad

Adam Gaier Fengkuangtian Zhu
Aditya Menon Fiona Condon
Adele Jackson Georgios Theodorou
Aleksandar Krnjaic He Xin
Alexander Makrigiorgos Irene Raissa Kameni
Alfredo Canziani Jakub Nabaglo
Ali Shafti James Hensman
Alasdair Tran Jamie Liu
Amr Khalifa Jean Kaddour
Andrew Tanggara Jean-Paul Ebejer
Antal A. Buss Jerry Qiang
Antoine Toisoul Le Cann Jitesh Sindhare
Angus Gruen John Lloyd
Areg Sarvazyan Jonas Ngnawe
Artem Artemev Jon Martin
Artyom Stepanov Justin Hsi
Bill Kromydas Kai Arulkumaran
Bob Williamson Kamil Dreczkowski
Boon Ping Lim Lily Wang
Chao Qu Lionel Tondji Ngoupeyou
Cheng Li Lydia Knüfing
Chris Sherlock Mahmoud Aslan
Christopher Gray Markus Hegland
Daniel McNamara Matthew Alger
Daniel Wood Matthew Lee
Darren Siegel Mark Hartenstein
David Johnston Mark van der Wilk
Dawei Chen Martin Hewing

Foreword 5
Maximus McCann Shakir Mohamed

Mengyan Zhang Shawn Berry
Michael Bennett Sheikh Abdul Raheem Ali
Michael Pedersen Sheng Xue
Minjeong Shin Sridhar Thiagarajan
Mohammad Malekzadeh Syed Nouman Hasany
Naveen Kumar Szymon Brych
Nico Montali Thomas Bühler
Oscar Armas Timur Sharapov
Patrick Henriksen Tom Melamed
Patrick Wieschollek Vincent Adam
Pattarawat Chormai
Vincent Dutordoir
Paul Kelly
Vu Minh
Petros Christodoulou
Wasim Aftab
Piotr Januszewski
Wen Zhi
Pranav Subramani
Wojciech Stokowiec
Quyu Kong
Ragib Zaman Xiaonan Chong
Rui Zhang Xiaowei Zhang
Ryan-Rhys Griffiths Yazhou Hao
Salomon Kabongo Yicheng Luo
Samuel Ogunmola Young Lee
Sandeep Mavadia Yu Lu
Sarvesh Nikumbh Yun Cheng
Sebastian Raschka Yuxiao Huang
Senanayak Sesh Kumar Karri Zac Cranko
Seung-Heon Baek Zijian Cao
Shahbaz Chaudhary Zoe Nolan
Contributors through github, whose real names were not listed on their
github profile, are:
SamDataMad insad empet

bumptiousmonkey HorizonP victorBigand
idoamihai cs-maillist 17SKYE
deepakiim kudo23 jessjing1995
We are also very grateful to the many anonymous reviewers organized

by Cambridge University Press who read one or more chapters of earlier
versions of the manuscript, and provided constructive criticism that led
to considerable improvements. A special mention goes to Dinesh Singh
Negi, our LATEX support person, for detailed and prompt advice about LATEX-
related issues. Last but not least, we are very grateful to our editor Lauren
Cowles, who has been patiently guiding us through the gestation process
of this book.
c
6 Foreword
Table of Symbols
Symbol Typical meaning

a, b, c, α, β, γ scalars are lowercase
x, y, z vectors are bold lowercase
A, B, C matrices are bold uppercase
x> , A> transpose of a vector or matrix
A−1 inverse of a matrix
hx, yi inner product of x and y
x> y dot product of x and y
B = (b1 , b2 , b3 ) (ordered) tuple
B = [b1 , b2 , b3 ] matrix of column vectors stacked horizontally
B = {b1 , b2 , b3 } set of vectors (unordered)
Z, N integers and natural numbers, respectively
R, C real and complex numbers, respectively
Rn n-dimensional vector space of real numbers
∀x universal quantifier: For all x
∃x existential quantifier: There exists x
a := b a is defined as b
a =: b b is defined as a
a∝b a is proportional to b, i.e., a = constant · b
g◦f function composition: “g after f ”
⇐⇒ if and only if
=⇒ implies
A, C sets
a∈A a is an element of the set A
∅ empty set
D number of dimensions; indexed by d = 1, . . . , D
N number of data points; indexed by n = 1, . . . , N
Im identity matrix of size m × m
0m,n matrix of zeros of size m × n
1m,n matrix of ones of size m × n
ei standard/canonical vector (where i is the component that is 1)
dim dimensionality of vector space
rk(A) rank of matrix A
Im(Φ) image of linear mapping Φ
ker(Φ) kernel (null space) of a linear mapping Φ
span[b1 ] span (generating set) of b1
tr(A) trace of A
det(A) determinant of A
|·| absolute value or determinant (depending on context)
k·k norm; Euclidean unless specified
λ eigenvalue or Lagrange multiplier
Eλ eigenspace corresponding to eigenvalue λ

Foreword 7
Symbol Typical meaning

θ parameter vector
∂f
∂x
partial derivative of f with respect to x
df
dx
total derivative of f with respect to x
∇ gradient
L Lagrangian
L negative log-likelihood
n
k
Binomial coefficient, n choose k
VX [x] variance of x with respect to the random variable X
EX [x] expectation of x with respect to the random variable X
CovX,Y [x, y] covariance between x and y .
X⊥ ⊥ Y |Z X is conditionally independent of Y given Z
X∼p random variable X is distributed according to p
N µ, Σ Gaussian distribution with mean µ and covariance Σ
Ber(µ) Bernoulli distribution with parameter µ
Bin(N, µ) Binomial distribution with parameters N, µ
Beta(α, β) Beta distribution with parameters α, β
Table of Abbreviations and Acronyms

Acronym Meaning
e.g. exempli gratia (Latin: for example)
GMM Gaussian mixture model
i.e. id est (Latin: this means)
i.i.d. independent, identically distributed
MAP maximum a posteriori
MLE maximum likelihood estimation/estimator
ONB orthonomal basis
PCA principal component analysis
PPCA probabilistic principal component analysis
REF row echelon form
SPD symmetric, positive definite
SVM support vector machine
c
Part I
Mathematical Foundations
9
c
1
Introduction and Motivation
Machine learning is about designing algorithms that automatically extract

valuable information from data. The emphasis here is on “automatic”, i.e.,
machine learning is concerned about general-purpose methodologies that
can be applied to many datasets, while producing something that is mean-
ingful. There are three concepts that are at the core of machine learning:
data, a model and learning
Since machine learning is inherently data driven, data is at the core data
of machine learning. The goal of machine learning is to design general-
purpose methodologies to extract valuable patterns from data, ideally
without much domain-specific expertise. For example, given a large corpus
of documents (e.g., books in many libraries), machine learning methods
can be used to automatically find relevant topics that are shared across
documents (Hoffman et al., 2010). To achieve this goal, we design mod-
els that are typically related to the process that generates data, similar to model
the dataset we are given. For example, in a regression setting, the model
would describe a function that maps inputs to real-valued outputs. To
paraphrase Mitchell (1997): A model is said to learn from data if its per-
formance on a given task improves after the data is taken into account.
The goal is to find good models that generalize well to yet unseen data,
which we may care about in the future. Learning can be understood as a learning
way to automatically find patterns and structure in data by optimizing the
parameters of the model.
While machine learning has seen many success stories, and software is
readily available to design and train rich and flexible machine learning
systems, we believe that the mathematical foundations of machine learn-
ing are important in order to understand fundamental principles upon
which more complicated machine learning systems are built. Understand-
ing these principles can facilitate creating new machine learning solutions,
understanding and debugging existing approaches and learning about the
inherent assumptions and limitations of the methodologies we are work-
ing with.
11
c
12 Introduction and Motivation
1.1 Finding Words for Intuitions

A challenge we face regularly in machine learning is that concepts and
words are slippery, and a particular component of the machine learning
system can be abstracted to different mathematical concepts. For example,
the word “algorithm” is used in at least two different senses in the con-
text of machine learning. In the first sense, we use the phrase “machine
learning algorithm” to mean a system that makes predictions based on in-
predictors put data. We refer to these algorithms as predictors. In the second sense,
we use the exact same phrase “machine learning algorithm” to mean a
system that adapts some internal parameters of the predictor so that it
performs well on future unseen input data. Here we refer to this adapta-
training tion as training a system.
This book will not resolve the issue of ambiguity, but we want to high-
light upfront that, depending on the context, the same expressions can
mean different things. However, we attempt to make the context suffi-
ciently clear to reduce the level of ambiguity.
The first part of this book introduces the mathematical concepts and
foundations needed to talk about the three main components of a machine
learning system: data, models and learning. We will briefly outline these
components here, and we will revisit them again in Chapter 8 once we
have discussed the necessary mathematical concepts.
While not all data is numerical it is often useful to consider data in
a number format. In this book, we assume that data has already been
appropriately converted into a numerical representation suitable for read-
data as vectors ing into a computer program. Therefore, we think of data as vectors. As
another illustration of how subtle words are, there are (at least) three
different ways to think about vectors: a vector as an array of numbers (a
computer science view), a vector as an arrow with a direction and magni-
tude (a physics view), and a vector as an object that obeys addition and
scaling (a mathematical view).
model A model is typically used to describe a process for generating data, sim-
ilar to the dataset at hand. Therefore, good models can also be thought
of as simplified versions of the real (unknown) data-generating process,
capturing aspects that are relevant for modeling the data and extracting
hidden patterns from it. A good model can then be used to predict what
would happen in the real world without performing real-world experi-
ments.
learning We now come to the crux of the matter, the learning component of
machine learning. Assume we are given a dataset and a suitable model.
Training the model means to use the data available to optimize some pa-
rameters of the model with respect to a utility function that evaluates how
well the model predicts the training data. Most training methods can be
thought of as an approach analogous to climbing a hill to reach its peak.
In this analogy, the peak of the hill corresponds to a maximum of some

desired performance measure. However, in practice, we are interested in

the model to perform well on unseen data. Performing well on data that
we have already seen (training data) may only mean that we found a
good way to memorize the data. However, this may not generalize well to
unseen data, and, in practical applications, we often need to expose our
machine learning system to situations that it has not encountered before.
Let us summarize the main concepts of machine learning that we cover
in this book:
We represent data as vectors.

We choose an appropriate model, either using the probabilisitic or opti-
mization view.
We learn from available data by using numerical optimization methods
with the aim that the model performs well on data not used for training.
1.2 Two Ways to Read this Book

We can consider two strategies for understanding the mathematics for
machine learning:
Bottom-up: Building up the concepts from foundational to more ad-

vanced. This is often the preferred approach in more technical fields,
such as mathematics. This strategy has the advantage that the reader
at all times is able to rely on their previously learned concepts. Unfor-
tunately, for a practitioner many of the foundational concepts are not
particularly interesting by themselves, and the lack of motivation means
that most foundational definitions are quickly forgotten.
Top-down: Drilling down from practical needs to more basic require-
ments. This goal-driven approach has the advantage that the reader
knows at all times why they need to work on a particular concept, and
there is a clear path of required knowledge. The downside of this strat-
egy is that the knowledge is built on potentially shaky foundations, and
the reader has to remember a set of words for which they do not have
any way of understanding.
We decided to write this book in a modular way to separate foundational

(mathematical) concepts from applications so that this book can be read
in both ways. The book is split into two parts, where Part I lays the math-
ematical foundations and Part II applies the concepts from Part I to a set
of fundamental machine learning problems, which form four pillars of
machine learning as illustrated in Figure 1.1: regression, dimensionality
reduction, density estimation, and classification. Chapters in Part I mostly
build upon the previous ones, but it is possible to skip a chapter and work
backward if necessary. Chapters in Part II are only loosely coupled and
can be read in any order. There are many pointers forward and backward
c
Figure 1.1 The

foundations and
four pillars of Machine Learning
machine learning.
Dimensionality
Classification
Reduction
Regression
Estimation
Density
Vector Calculus Probability & Distributions Optimization
Linear Algebra Analytic Geometry Matrix Decomposition
between the two parts of the book to link mathematical concepts with
machine learning algorithms.
Of course there are more than two ways to read this book. Most readers
learn using a combination of top-down and bottom-up approaches, some-
times building up basic mathematical skills before attempting more com-
plex concepts, but also choosing topics based on applications of machine
learning.
Part I is about Mathematics

The four pillars of machine learning we cover in this book (see Figure 1.1)
require a solid mathematical foundation, which is laid out in Part I.
We represent numerical data as vectors and represent a table of such
data as a matrix. The study of vectors and matrices is called linear algebra,
linear algebra which we introduce in Chapter 2. The collection of vectors as a matrix is
also described there.
Given two vectors representing two objects in the real world we want
to make statements about their similarity. The idea is that vectors that
are similar should be predicted to have similar outputs by our machine
learning algorithm (our predictor). To formalize the idea of similarity be-
tween vectors, we need to introduce operations that take two vectors as
input and return a numerical value representing their similarity. The con-
analytic geometry struction of similarity and distances is central to analytic geometry and is
discussed in Chapter 3.
In Chapter 4, we introduce some fundamental concepts about matri-
matrix ces and matrix decomposition. Some operations on matrices are extremely
decomposition useful in machine learning, and they allow for an intuitive interpretation
of the data and more efficient learning.
We often consider data to be noisy observations of some true underly-
ing signal. We hope that by applying machine learning we can identify the
signal from the noise. This requires us to have a language for quantify-
ing what “noise” means. We often would also like to have predictors that

allow us to express some sort of uncertainty, e.g., to quantify the confi-

dence we have about the value of the prediction at a particular test data
point. Quantification of uncertainty is the realm of probability theory and probability theory
is covered in Chapter 6.
To train machine learning models, we typically find parameters that
maximize some performance measure. Many optimization techniques re-
quire the concept of a gradient, which tells us the direction in which to
search for a solution. Chapter 5 is about vector calculus and details the vector calculus
concept of gradients, which we subsequently use in Chapter 7, where we
talk about optimization to find maxima/minima of functions. optimization
Part II is about Machine Learning

The second part of the book introduces four pillars of machine learning
as shown in Figure 1.1. We illustrate how the mathematical concepts in-
troduced in the first part of the book are the foundation for each pillar.
Broadly speaking, chapters are ordered by difficulty (in ascending order).
In Chapter 8, we restate the three components of machine learning
(data, models and parameter estimation) in a mathematical fashion. In
addition, we provide some guidelines for building experimental set-ups
that guard against overly optimistic evaluations of machine learning sys-
tems. Recall that the goal is to build a predictor that performs well on
unseen data.
In Chapter 9, we will have a close look at linear regression, where our linear regression
objective is to find functions that map inputs x ∈ RD to corresponding
observed function values y ∈ R, which we can interpret as the labels of
their respective inputs. We will discuss classical model fitting (parameter
estimation) via maximum likelihood and maximum a posteriori estimation
as well as Bayesian linear regression where we integrate the parameters
out instead of optimizing them.
Chapter 10 focuses on dimensionality reduction, the second pillar in Fig- dimensionality
ure 1.1, using principal component analysis. The key objective of dimen- reduction
sionality reduction is to find a compact, lower-dimensional representation
of high-dimensional data x ∈ RD , which is often easier to analyze than
the original data. Unlike regression, dimensionality reduction is only con-
cerned about modeling the data – there are no labels associated with a
data point x.
In Chapter 11, we will move to our third pillar: density estimation. The density estimation
objective of density estimation is to find a probability distribution that de-
scribes a given dataset. We will focus on Gaussian mixture models for this
purpose, and we will discuss an iterative scheme to find the parameters of
this model. As in dimensionality reduction, there are no labels associated
with the data points x ∈ RD . However, we do not seek a low-dimensional
representation of the data. Instead, we are interested in a density model
that describes the data.
Chapter 12 concludes the book with an in-depth discussion of the fourth
c
classification pillar: classification. We will discuss classification in the context of support

vector machines. Similar to regression (Chapter 9) we have inputs x and
corresponding labels y . However, unlike regression, where the labels were
real-valued, the labels in classification are integers, which requires special
care.
1.3 Exercises and Feedback

We provide some exercises in Part I, which can be done mostly by pen and
paper. For Part II we provide programming tutorials (jupyter notebooks)
to explore some properties of the machine learning algorithms we discuss
in this book.
We appreciate that Cambridge University Press strongly supports our
aim to democratize education and learning by making this book freely
available for download at
https://mml-book.com
where tutorials, errata and additional materials can be found. Mistakes
can be reported and feedback provided using the URL above.

2
Linear Algebra
When formalizing intuitive concepts, a common approach is to construct a

set of objects (symbols) and a set of rules to manipulate these objects. This
is known as an algebra. Linear algebra is the study of vectors and certain algebra
rules to manipulate vectors. The vectors many of us know from school are
called “geometric vectors”, which are usually denoted by a small arrow
above the letter, e.g., →
−
x and → −
y . In this book, we discuss more general
concepts of vectors and use a bold letter to represent them, e.g., x and y .
In general, vectors are special objects that can be added together and
multiplied by scalars to produce another object of the same kind. From
an abstract mathematical viewpoint, any object that satisfies these two
properties can be considered a vector. Here are some examples of such
vector objects:
1. Geometric vectors. This example of a vector may be familiar from high

school mathematics and physics. Geometric vectors, see Figure 2.1(a),
are directed segments, which can be drawn (at least in two dimen-
→ → → → →
sions). Two geometric vectors x, y can be added, such that x + y = z
is another geometric vector. Furthermore, multiplication by a scalar
→
λ x , λ ∈ R, is also a geometric vector. In fact, it is the original vector
scaled by λ. Therefore, geometric vectors are instances of the vector
concepts introduced above. Interpreting vectors as geometric vectors
enables us to use our intuitions about direction and magnitude to rea-
son about mathematical operations.
2. Polynomials are also vectors, see Figure 2.1(b): Two polynomials can
→ → 4 Figure 2.1
x+y Different types of
2
vectors. Vectors can
0 be surprising
objects, including
y
→ −2 (a) geometric
x → vectors and (b)
y −4
polynomials.
−6
−2 0 2
x
(a) Geometric vectors. (b) Polynomials.
17
c
18 Linear Algebra
be added together, which results in another polynomial; and they can

be multiplied by a scalar λ ∈ R, and the result is a polynomial as
well. Therefore, polynomials are (rather unusual) instances of vectors.
Note that polynomials are very different from geometric vectors. While
geometric vectors are concrete “drawings”, polynomials are abstract
concepts. However, they are both vectors in the sense described above.
3. Audio signals are vectors. Audio signals are represented as a series of
numbers. We can add audio signals together, and their sum is a new
audio signal. If we scale an audio signal, we also obtain an audio signal.
Therefore, audio signals are a type of vector, too.
4. Elements of Rn (tuples of n real numbers) are vectors. Rn is more
abstract than polynomials, and it is the concept we focus on in this
book. For instance,
 
1
a = 2 ∈ R3 (2.1)
3
is an example of a triplet of numbers. Adding two vectors a, b ∈ Rn
component-wise results in another vector: a + b = c ∈ Rn . Moreover,
multiplying a ∈ Rn by λ ∈ R results in a scaled vector λa ∈ Rn .
Be careful to check Considering vectors as elements of Rn has an additional benefit that
whether array it loosely corresponds to arrays of real numbers on a computer. Many
operations actually
programming languages support array operations, which allow for con-
perform vector
operations when venient implementation of algorithms that involve vector operations.
implementing on a
computer. Linear algebra focuses on the similarities between these vector concepts.
We can add them together and multiply them by scalars. We will largely
focus on vectors in Rn since most algorithms in linear algebra are for-
mulated in Rn . We will see in Chapter 8 that we often consider data to
be represented as vectors in Rn . In this book, we will focus on finite-
dimensional vector spaces, in which case there is a 1:1 correspondence
between any kind of vector and Rn . When it is convenient, we will use
Pavel Grinfeld’s intuitions about geometric vectors and consider array-based algorithms.
series on linear One major idea in mathematics is the idea of “closure”. This is the ques-
algebra:
tion: What is the set of all things that can result from my proposed oper-
http://tinyurl.
com/nahclwm ations? In the case of vectors: What is the set of vectors that can result by
Gilbert Strang’s starting with a small set of vectors, and adding them to each other and
course on linear scaling them? This results in a vector space (Section 2.4). The concept of
algebra: a vector space and its properties underlie much of machine learning. The
http://tinyurl.
com/29p5q8j
concepts introduced in this chapter are summarized in Figure 2.2
3Blue1Brown series
This chapter is mostly based on the lecture notes and books by Drumm
on linear algebra: and Weil (2001), Strang (2003), Hogben (2013), Liesen and Mehrmann
https://tinyurl. (2015) as well as Pavel Grinfeld’s Linear Algebra series. Other excellent
com/h5g4kps resources are Gilbert Strang’s Linear Algebra course at MIT and the Linear
Algebra Series by 3Blue1Brown.

Vector
Figure 2.2 A mind
s map of the concepts
se pro
po per introduced in this
closure
m ty o
co f chapter, along with
Chapter 5 where they are used
Matrix Abelian
Vector calculus with + in other parts of the
ts Vector space Group Linear
en independence book.
rep
res
rep
res
maximal set
ent
s
System of
linear equations
Linear/affine
so mapping
lve
solved by
s Basis
Matrix
inverse
Gaussian
elimination
Chapter 3 Chapter 10
Chapter 12
Analytic geometry Dimensionality
Classification
reduction
Linear algebra plays an important role in machine learning and gen-

eral mathematics. The concepts introduced in this chapter are further ex-
panded to include the idea of geometry in Chapter 3. In Chapter 5, we
will discuss vector calculus, where a principled knowledge of matrix op-
erations is essential. In Chapter 10, we will use projections (to be intro-
duced in Section 3.8) for dimensionality reduction with Principal Compo-
nent Analysis (PCA). In Chapter 9, we will discuss linear regression where
linear algebra plays a central role for solving least-squares problems.
2.1 Systems of Linear Equations

Systems of linear equations play a central part of linear algebra. Many
problems can be formulated as systems of linear equations, and linear
algebra gives us the tools for solving them.
Example 2.1
A company produces products N1 , . . . , Nn for which resources
R1 , . . . , Rm are required. To produce a unit of product Nj , aij units of
resource Ri are needed, where i = 1, . . . , m and j = 1, . . . , n.
The objective is to find an optimal production plan, i.e., a plan of how
many units xj of product Nj should be produced if a total of bi units of
resource Ri are available and (ideally) no resources are left over.
If we produce x1 , . . . , xn units of the corresponding products, we need
a total of
ai1 x1 + · · · + ain xn (2.2)
many units of resource Ri . An optimal production plan (x1 , . . . , xn ) ∈ Rn ,
c
20 Linear Algebra
therefore, has to satisfy the following system of equations:

a11 x1 + · · · + a1n xn = b1
.. , (2.3)
.
am1 x1 + · · · + amn xn = bm
where aij ∈ R and bi ∈ R.
system of linear Equation (2.3) is the general form of a system of linear equations, and
equations x1 , . . . , xn are the unknowns of this system. Every n-tuple (x1 , . . . , xn ) ∈
solution Rn that satisfies (2.3) is a solution of the linear equation system.
Example 2.2
The system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) (2.4)
2x1 + 3x3 = 1 (3)
has no solution: Adding the first two equations yields 2x1 +3x3 = 5, which
contradicts the third equation (3).
Let us have a look at the system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.5)
x2 + x3 = 2 (3)
From the first and third equation it follows that x1 = 1. From (1)+(2) we
get 2+3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1. Therefore,
(1, 1, 1) is the only possible and unique solution (verify that (1, 1, 1) is a
solution by plugging in).
As a third example, we consider
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.6)
2x1 + 3x3 = 5 (3)
Since (1)+(2)=(3), we can omit the third equation (redundancy). From
(1) and (2), we get 2x1 = 5−3x3 and 2x2 = 1+x3 . We define x3 = a ∈ R
as a free variable, such that any triplet
5 3 1 1

− a, + a, a , a ∈ R (2.7)
2 2 2 2
is a solution of the system of linear equations, i.e., we obtain a solution
set that contains infinitely many solutions.
In general, for a real-valued system of linear equations we obtain either

Figure 2.3 The

x2 solution space of a
system of two linear
equations with two
4x1 + 4x2 = 5
variables can be
geometrically
2x1 − 4x2 = 1 interpreted as the
intersection of two
lines. Every linear
equation represents
a line.
x1
no, exactly one or infinitely many solutions. Linear regression (Chapter 9)

solves an version of Example 2.1 when we cannot solve the system of
linear equations.
Remark (Geometric Interpretation of Systems of Linear Equations). In a
system of linear equations with two variables x1 , x2 , each linear equation
defines a line on the x1 x2 -plane. Since a solution to a system of linear
equations must satisfy all equations simultaneously, the solution set is the
intersection of these lines. This intersection set can be a line (if the linear
equations describe the same line), a point, or empty (when the lines are
parallel). An illustration is given in Figure 2.3 for the system
4x1 + 4x2 = 5
(2.8)
2x1 − 4x2 = 1
where the solution space is the point (x1 , x2 ) = (1, 41 ). Similarly, for three
variables, each linear equation determines a plane in three-dimensional
space. When we intersect these planes, i.e., satisfy all linear equations at
the same time, we can obtain a solution set that is a plane, a line, a point
or empty (when the planes have no common intersection). ♦
For a systematic approach to solving systems of linear equations, we
will introduce a useful compact notation. We collect the coefficients aij
into vectors and collect the vectors into matrices. In other words, we write
the system from (2.3) in the following form:
       
a11 a12 a1n b1
 ..   ..   ..   .. 
x1  .  + x2  .  + · · · + xn  .  =  .  (2.9)
am1 am2 amn bm
    
a11 · · · a1n x1 b1
 .. ..   ..  =  ..  .
⇐⇒  . .  .   .  (2.10)
am1 · · · amn xn bm
In the following, we will have a close look at these matrices and de-
fine computation rules. We will return to solving linear equations in Sec-
tion 2.3.
c
22 Linear Algebra
2.2 Matrices
Matrices play a central role in linear algebra. They can be used to com-
pactly represent systems of linear equations, but they also represent linear
functions (linear mappings) as we will see later in Section 2.7. Before we
discuss some of these interesting topics, let us first define what a matrix
is and what kind of operations we can do with matrices. We will see more
properties of matrices in Chapter 4.
matrix Definition 2.1 (Matrix). With m, n ∈ N a real-valued (m, n) matrix A is

an m·n-tuple of elements aij , i = 1, . . . , m, j = 1, . . . , n, which is ordered
according to a rectangular scheme consisting of m rows and n columns:
a11 a12 · · · a1n
 
 a21 a22 · · · a2n 
A =  .. .. ..  , aij ∈ R . (2.11)
 
 . . . 
am1 am2 · · · amn
row By convention (1, n)-matrices are called rows, (m, 1)-matrices are called
column columns. These special matrices are also called row/column vectors.
row vector
column vector Rm×n is the set of all real-valued (m, n)-matrices. A ∈ Rm×n can be
Figure 2.4 By equivalently represented as a ∈ Rmn by stacking all n columns of the
stacking its matrix into a long vector, see Figure 2.4.
columns, a matrix A
can be represented
as a long vector a.
A ∈ R4×2 a ∈ R8
2.2.1 Matrix Addition and Multiplication
The sum of two matrices A ∈ Rm×n , B ∈ Rm×n is defined as the element-
re-shape
wise sum, i.e.,
 
a11 + b11 · · · a1n + b1n
.. .. m×n
A + B :=  ∈R . (2.12)
 
. .
am1 + bm1 · · · amn + bmn
Note the size of the For matrices A ∈ Rm×n , B ∈ Rn×k the elements cij of the product
matrices. C = AB ∈ Rm×k are defined as
C =
n
np.einsum(’il, X
lj’, A, B) cij = ail blj , i = 1, . . . , m, j = 1, . . . , k. (2.13)
l=1
There are n columns This means, to compute element cij we multiply the elements of the ith
in A and n rows in row of A with the j th column of B and sum them up. Later in Section 3.2,
B so that we can
we will call this the dot product of the corresponding row and column. In
compute ail blj for
l = 1, . . . , n. cases, where we need to be explicit that we are performing multiplication,
Commonly, the dot we use the notation A · B to denote multiplication (explicitly showing
product between “·”).
two vectors a, b is
denoted by a> b or Remark. Matrices can only be multiplied if their “neighboring” dimensions
ha, bi.
2.2 Matrices 23
match. For instance, an n × k -matrix A can be multiplied with a k × m-

matrix B , but only from the left side:
A |{z}
|{z} B = |{z}
C (2.14)
n×k k×m n×m
The product BA is not defined if m 6= n since the neighboring dimensions

do not match. ♦
Remark. Matrix multiplication is not defined as an element-wise operation
on matrix elements, i.e., cij 6= aij bij (even if the size of A, B was cho-
sen appropriately). This kind of element-wise multiplication often appears
in programming languages when we multiply (multi-dimensional) arrays
with each other, and is called a Hadamard product. ♦ Hadamard product
Example 2.3  
0 2
1 2 3
For A = ∈ R2×3 , B = 1 −1 ∈ R3×2 , we obtain
3 2 1
0 1
 
0 2
1 2 3  2 3
AB = 1 −1 = ∈ R2×2 , (2.15)
3 2 1 2 5
0 1
   
0 2 6 4 2
1 2 3
BA = 1 −1 = −2 0 2 ∈ R3×3 . (2.16)
3 2 1
0 1 3 2 1
Figure 2.5 Even if

From this example, we can already see that matrix multiplication is not both matrix
multiplications AB
commutative, i.e., AB 6= BA, see also Figure 2.5 for an illustration.
and BA are
defined, the
Definition 2.2 (Identity Matrix). In Rn×n , we define the identity matrix
dimensions of the
1 0 ··· 0 ··· 0 results can be
 
0 1 · · · 0 · · · 0 different.
 .. .. . . .. . . .. 
 
. . . . . .  n×n
0 0 · · · 1 · · · 0 ∈ R
I n :=  (2.17)

 
. . .
 .. .. . . ... . . . ... 

identity matrix
0 0 ··· 0 ··· 1
as the n × n-matrix containing 1 on the diagonal and 0 everywhere else.
Now that we defined matrix multiplication, matrix addition and the
identity matrix, let us have a look at some properties of matrices:
associativity
Associativity:
∀A ∈ Rm×n , B ∈ Rn×p , C ∈ Rp×q : (AB)C = A(BC) (2.18)
c
24 Linear Algebra
distributivity
Distributivity:
∀A, B ∈ Rm×n , C, D ∈ Rn×p : (A + B)C = AC + BC (2.19a)

A(C + D) = AC + AD (2.19b)
Multiplication with the identity matrix:
∀A ∈ Rm×n : I m A = AI n = A (2.20)
Note that I m 6= I n for m 6= n.
2.2.2 Inverse and Transpose

A square matrix Definition 2.3 (Inverse). Consider a square matrixA ∈ Rn×n . Let matrix
possesses the same B ∈ Rn×n have the property that AB = I n = BA. B is called the
number of columns
inverse of A and denoted by A−1 .
and rows.
inverse
Unfortunately, not every matrix A possesses an inverse A−1 . If this
regular inverse does exist, A is called regular/invertible/non-singular, otherwise
invertible singular/non-invertible. When the matrix inverse exists, it is unique. In
non-singular Section 2.3, we will discuss a general way to compute the inverse of a
singular matrix by solving a system of linear equations.
non-invertible
Remark (Existence of the Inverse of a 2 × 2-Matrix). Consider a matrix

a11 a12
A := ∈ R2×2 . (2.21)
a21 a22
If we multiply A with

a22 −a12
B := (2.22)
−a21 a11
we obtain

a11 a22 − a12 a21 0
AB = = (a11 a22 − a12 a21 )I . (2.23)
0 a11 a22 − a12 a21
Therefore,
1

−1 a22 −a12
A = (2.24)
a11 a22 − a12 a21 −a21 a11
if and only if a11 a22 − a12 a21 6= 0. In Section 4.1, we will see that a11 a22 −
a12 a21 is the determinant of a 2×2-matrix. Furthermore, we can generally
use the determinant to check whether a matrix is invertible. ♦

2.2 Matrices 25
Example 2.4 (Inverse Matrix)

The matrices
   
1 2 1 −7 −7 6
A = 4 4 5 , B= 2 1 −1 (2.25)
6 7 7 4 5 −4
are inverse to each other since AB = I = BA.
Definition 2.4 (Transpose). For A ∈ Rm×n the matrix B ∈ Rn×m with

bij = aji is called the transpose of A. We write B = A> . transpose
The main diagonal
In general, A> can be obtained by writing the columns of A as the rows (sometimes called
of A> . Some important properties of inverses and transposes are: “principal diagonal”,
“primary diagonal”,
AA−1 = I = A−1 A (2.26) “leading diagonal”,
−1 −1 −1 or “major diagonal”)
(AB) =B A (2.27)
of a matrix A is the
−1 −1 −1
(A + B) 6= A +B (2.28) collection of entries
> >
Aij where i = j.
(A ) = A (2.29) The scalar case of
> > > (2.28) is
(A + B) = A + B (2.30) 1
2+4
= 61 6= 12 + 41 .
(AB)> = B > A> (2.31)
Definition 2.5 (Symmetric Matrix). A matrix A ∈ Rn×n is symmetric if symmetric matrix
A = A> .
Note that only (n, n)-matrices can be symmetric. Generally, we call
(n, n)-matrices also square matrices because they possess the same num- square matrix
ber of rows and columns. Moreover, if A is invertible then so is A> and
(A−1 )> = (A> )−1 =: A−>
Remark (Sum and Product of Symmetric Matrices). The sum of symmet-
ric matrices A, B ∈ Rn×n is always symmetric. However, although their
product is always defined, it is generally not symmetric:

1 0 1 1 1 1
= . (2.32)
0 0 1 1 0 0
♦
2.2.3 Multiplication by a Scalar

Let us look at what happens to matrices when they are multiplied by a
scalar λ ∈ R. Let A ∈ Rm×n and λ ∈ R. Then λA = K , Kij = λ aij .
Practically, λ scales each element of A. For λ, ψ ∈ R the following holds:
associativity
Associativity:
(λψ)C = λ(ψC), C ∈ Rm×n
c
26 Linear Algebra
λ(BC) = (λB)C = B(λC) = (BC)λ, B ∈ Rm×n , C ∈ Rn×k .

Note that this allows us to move scalar values around.
distributivity (λC)> = C > λ> = C > λ = λC > since λ = λ> for all λ ∈ R.
Distributivity:
(λ + ψ)C = λC + ψC, C ∈ Rm×n
λ(B + C) = λB + λC, B, C ∈ Rm×n
Example 2.5 (Distributivity)

If we define

1 2
C := (2.33)
3 4
then for any λ, ψ ∈ R we obtain

(λ + ψ)1 (λ + ψ)2 λ + ψ 2λ + 2ψ
(λ + ψ)C = = (2.34a)
(λ + ψ)3 (λ + ψ)4 3λ + 3ψ 4λ + 4ψ

λ 2λ ψ 2ψ
= + = λC + ψC . (2.34b)
3λ 4λ 3ψ 4ψ
2.2.4 Compact Representations of Systems of Linear Equations

If we consider the system of linear equations
2x1 + 3x2 + 5x3 = 1

4x1 − 2x2 − 7x3 = 8 (2.35)
9x1 + 5x2 − 3x3 = 2
and use the rules for matrix multiplication, we can write this equation
system in a more compact form as
    
2 3 5 x1 1
4 −2 −7 x2  = 8 . (2.36)
9 5 −3 x3 2
Note that x1 scales the first column, x2 the second one, and x3 the third
one.
Generally, system of linear equations can be compactly represented in
their matrix form as Ax = b, see (2.3), and the product Ax is a (linear)
combination of the columns of A. We will discuss linear combinations in
more detail in Section 2.5.

2.3 Solving Systems of Linear Equations

In (2.3), we introduced the general form of an equation system, i.e.,
a11 x1 + · · · + a1n xn = b1
.. (2.37)
.
am1 x1 + · · · + amn xn = bm ,
where aij ∈ R and bi ∈ R are known constants and xj are unknowns,
i = 1, . . . , m, j = 1, . . . , n. Thus far, we saw that matrices can be used as
a compact way of formulating systems of linear equations so that we can
write Ax = b, see (2.10). Moreover, we defined basic matrix operations,
such as addition and multiplication of matrices. In the following, we will
focus on solving systems of linear equations and provide an algorithm for
finding the inverse of a matrix.
2.3.1 Particular and General Solution

Before discussing how to generally solve systems of linear equations, let
us have a look at an example. Consider the system of equations
 
x1
1 0 8 −4  x2  = 42 .

(2.38)
0 1 2 12 x3  8
x4
The system has two equations and four unknowns. Therefore, in general
we would expect infinitely many solutions. This system of equations is
in a particularly easy form, where the first two columns consist of a 1
P4 a 0. Remember that we want to find scalars x1 , . . . , x4 , such that
and
i=1 xi ci = b, where we define ci to be the ith column of the matrix and
b the right-hand-side of (2.38). A solution to the problem in (2.38) can
be found immediately by taking 42 times the first column and 8 times the
second column so that

42 1 0
b= = 42 +8 . (2.39)
8 0 1
Therefore, a solution is [42, 8, 0, 0]> . This solution is called a particular particular solution
solution or special solution. However, this is not the only solution of this special solution
system of linear equations. To capture all the other solutions, we need
to be creative of generating 0 in a non-trivial way using the columns of
the matrix: Adding 0 to our special solution does not change the special
solution. To do so, we express the third column using the first two columns
(which are of this very simple form)

8 1 0
=8 +2 (2.40)
2 0 1
c
28 Linear Algebra
so that 0 = 8c1 + 2c2 − 1c3 + 0c4 and (x1 , x2 , x3 , x4 ) = (8, 2, −1, 0). In
fact, any scaling of this solution by λ1 ∈ R produces the 0 vector, i.e.,
  
8
1 0 8 −4   2   
 = λ1 (8c1 + 2c2 − c3 ) = 0 .
λ1   (2.41)
0 1 2 12  −1 

0
Following the same line of reasoning, we express the fourth column of the
matrix in (2.38) using the first two columns and generate another set of
non-trivial versions of 0 as
−4
  

1 0 8 −4   12 
  
 = λ2 (−4c1 + 12c2 − c4 ) = 0
λ2   (2.42)
0 1 2 12  0 

−1
for any λ2 ∈ R. Putting everything together, we obtain all solutions of the
general solution equation system in (2.38), which is called the general solution, as the set
 
−4
     

 42 8 

8 2 12
       
4
x ∈ R : x =   + λ1   + λ2   , λ1 , λ2 ∈ R . (2.43)
     

 0 −1 0 

0 0 −1
 
Remark. The general approach we followed consisted of the following

three steps:
1. Find a particular solution to Ax = b
2. Find all solutions to Ax = 0
3. Combine the solutions from 1. and 2. to the general solution.
Neither the general nor the particular solution is unique. ♦
The system of linear equations in the example above was easy to solve
because the matrix in (2.38) has this particularly convenient form, which
allowed us to find the particular and the general solution by inspection.
However, general equation systems are not of this simple form. Fortu-
nately, there exists a constructive algorithmic way of transforming any
system of linear equations into this particularly simple form: Gaussian
elimination. Key to Gaussian elimination are elementary transformations
of systems of linear equations, which transform the equation system into
a simple form. Then, we can apply the three steps to the simple form that
we just discussed in the context of the example in (2.38).
2.3.2 Elementary Transformations

elementary Key to solving a system of linear equations are elementary transformations
transformations that keep the solution set the same, but that transform the equation system
into a simpler form:

Exchange of two equations (rows in the matrix representing the system

of equations)
Multiplication of an equation (row) with a constant λ ∈ R\{0}
Addition of two equations (rows)
Example 2.6
For a ∈ R, we seek all solutions of the following system of equations:
−2x1 + 4x2 − 2x3 − x4 + 4x5 = −3
4x1 − 8x2 + 3x3 − 3x4 + x5 = 2
. (2.44)
x1 − 2x2 + x3 − x4 + x5 = 0
x1 − 2x2 − 3x4 + 4x5 = a
We start by converting this system of equations into the compact matrix
notation Ax = b. We no longer mention the variables x explicitly and
build the augmented matrix (in the form A | b ) augmented matrix
−2 −2 −1 −3
 
4 4 Swap with R3
 4
 −8 3 −3 1 2 

 1 −2 1 −1 1 0  Swap with R1
1 −2 0 −3 4 a
where we used the vertical line to separate the left-hand-side from the
right-hand-side in (2.44). We use to indicate a transformation of the
augmented matrix using elementary transformations. The augmented

matrix A | b
Swapping rows 1 and 3 leads to
compactly
−2 −1 represents the
 
1 1 1 0
system of linear

 4 −8 3 −3 1 2  −4R1 equations Ax = b.
 −2 4 −2 −1 4 −3  +2R1
1 −2 0 −3 4 a −R1
When we now apply the indicated transformations (e.g., subtract Row 1
four times from Row 2), we obtain
−2 −1
 
1 1 1 0

 0 0 −1 1 −3 2 

 0 0 0 −3 6 −3 
0 0 −1 −2 3 a −R2 − R3
−2 −1
 
1 1 1 0

 0 0 −1 1 −3  ·(−1)
2 
 0 0 0 −3 6 −3  ·(− 31 )
0 0 0 0 0 a+1
−2 −1
 
1 1 1 0

 0 0 1 −1 3 −2 
 0 0 0 1 −2 1 
0 0 0 0 0 a+1
c
30 Linear Algebra
row-echelon form This (augmented) matrix is in a convenient form, the row-echelon form
(REF). Reverting this compact notation back into the explicit notation with
the variables we seek, we obtain
x1 − 2x2 + x3 − x4 + x5 = 0
x3 − x4 + 3x5 = −2
. (2.45)
x4 − 2x5 = 1
0 = a+1
particular solution Only for a = −1 this system can be solved. A particular solution is
2
   
x1
x2   0 
   
x3  = −1 . (2.46)
   
x4   1 
x5 0
general solution The general solution, which captures the set of all possible solutions, is
 
2 2 2
     

 



  0  1  0  


5
     
x∈R :x= −1
 
 + λ1 0 + λ2 −1 , λ1 , λ2 ∈ R . (2.47)
   


  1  0  2  


 
0 0 1
 
In the following, we will detail a constructive way to obtain a particular

and general solution of a system of linear equations.
Remark (Pivots and Staircase Structure). The leading coefficient of a row
pivot (first non-zero number from the left) is called the pivot and is always
strictly to the right of the pivot of the row above it. Therefore, any equa-
tion system in row echelon form always has a “staircase” structure. ♦
row echelon form Definition 2.6 (Row Echelon Form). A matrix is in row echelon form if
All rows that contain only zeros are at the bottom of the matrix; corre-
spondingly, all rows that contain at least one non-zero element are on
top of rows that contain only zeros.
Looking at non-zero rows only, the first non-zero number from the left
pivot (also called the pivot or the leading coefficient) is always strictly to the
leading coefficient right of the pivot of the row above it.
In other texts, it is
sometimes required Remark (Basic and Free Variables). The variables corresponding to the
that the pivot is 1. pivots in the row-echelon form are called basic variables, the other vari-
basic variables ables are free variables. For example, in (2.45), x1 , x3 , x4 are basic vari-
free variables ables, whereas x2 , x5 are free variables. ♦
Remark (Obtaining a Particular Solution). The row echelon form makes

our lives easier when we need to determine a particular solution. To do

this, we express the right-hand
PP side of the equation system using the pivot
columns, such that b = i=1 λi pi , where pi , i = 1, . . . , P , are the pivot
columns. The λi are determined easiest if we start with the most-right
pivot column and work our way to the left.
In the above example, we would try to find λ1 , λ2 , λ3 so that
−1
       
1 1 0
0 1 −1 −2
0 + λ2 0 + λ3  1  =  1  .
λ1  (2.48)
      
0 0 0 0
From here, we find relatively directly that λ3 = 1, λ2 = −1, λ1 = 2. When
we put everything together, we must not forget the non-pivot columns
for which we set the coefficients implicitly to 0. Therefore, we get the
particular solution x = [2, 0, −1, 1, 0]> . ♦
Remark (Reduced Row Echelon Form). An equation system is in reduced reduced row
row echelon form (also: row-reduced echelon form or row canonical form) if echelon form
It is in row echelon form.

Every pivot is 1.
The pivot is the only non-zero entry in its column.
♦
The reduced row echelon form will play an important role later in Sec-
tion 2.3.3 because it allows us to determine the general solution of a sys-
tem of linear equations in a straightforward way.
Gaussian
Remark (Gaussian Elimination). Gaussian elimination is an algorithm that elimination
performs elementary transformations to bring a system of linear equations
into reduced row echelon form. ♦
Example 2.7 (Reduced Row Echelon Form)

Verify that the following matrix is in reduced row echelon form (the pivots
are in bold):
 
1 3 0 0 3
A = 0 0 1 0 9  (2.49)
0 0 0 1 −4
The key idea for finding the solutions of Ax = 0 is to look at the non-
pivot columns, which we will need to express as a (linear) combination of
the pivot columns. The reduced row echelon form makes this relatively
straightforward, and we express the non-pivot columns in terms of sums
and multiples of the pivot columns that are on their left: The second col-
umn is 3 times the first column (we can ignore the pivot columns on the
right of the second column). Therefore, to obtain 0, we need to subtract
c
32 Linear Algebra
the second column from three times the first column. Now, we look at the
fifth column, which is our second non-pivot column. The fifth column can
be expressed as 3 times the first pivot column, 9 times the second pivot
column, and −4 times the third pivot column. We need to keep track of
the indices of the pivot columns and translate this into 3 times the first col-
umn, 0 times the second column (which is a non-pivot column), 9 times
the third column (which is our second pivot column), and −4 times the
fourth column (which is the third pivot column). Then we need to subtract
the fifth column to obtain 0. In the end, we are still solving a homogeneous
equation system.
To summarize, all solutions of Ax = 0, x ∈ R5 are given by
 
3 3
   

 



 −1  0  


5
   
x ∈ R : x = λ1  0  + λ2  9  , λ1 , λ2 ∈ R .
    (2.50)


  0  −4 


 
0 −1
 
2.3.3 The Minus-1 Trick

In the following, we introduce a practical trick for reading out the solu-
tions x of a homogeneous system of linear equations Ax = 0, where
A ∈ Rk×n , x ∈ Rn .
To start, we assume that A is in reduced row echelon form without any
rows that just contain zeros, i.e.,
0 ··· 0 1 ∗ ··· ∗ 0 ∗ ··· ∗ 0 ∗ ··· ∗
 
 .. .. . . .. 
 . . 0 0 · · · 0 1 ∗ · · · ∗ .. .. . 
 .. . . . . . . . . ..  ,
 
A= . .. .. .. .. 0 .. .. .. .. . 
 .. .. .. .. .. .. .. .. .. .. 
 
 . . . . . . . . 0 . . 
0 ··· 0 0 0 ··· 0 0 0 ··· 0 1 ∗ ··· ∗
(2.51)
where ∗ can be an arbitrary real number, with the constraints that the first
non-zero entry per row must be 1 and all other entries in the correspond-
ing column must be 0. The columns j1 , . . . , jk with the pivots (marked
in bold) are the standard unit vectors e1 , . . . , ek ∈ Rk . We extend this
matrix to an n × n-matrix Ã by adding n − k rows of the form

0 · · · 0 −1 0 · · · 0 (2.52)
so that the diagonal of the augmented matrix Ã contains either 1 or −1.

Then, the columns of Ã that contain the −1 as pivots are solutions of

the homogeneous equation system Ax = 0. To be more precise, these

columns form a basis (Section 2.6.1) of the solution space of Ax = 0,
which we will later call the kernel or null space (see Section 2.7.3). kernel
null space
Example 2.8 (Minus-1 Trick)

Let us revisit the matrix in (2.49), which is already in REF:
 
1 3 0 0 3
A = 0 0 1 0 9  . (2.53)
0 0 0 1 −4
We now augment this matrix to a 5 × 5 matrix by adding rows of the
form (2.52) at the places where the pivots on the diagonal are missing
and obtain
1 3 0 0 3
 
0 −1 0 0 0 
 
0 0 1 0 9 
Ã =  (2.54)

0 0 0 1 −4 
0 0 0 0 −1
From this form, we can immediately read out the solutions of Ax = 0 by
taking the columns of Ã, which contain −1 on the diagonal:
 
3 3
   

 



 
−1 


 0 




5
x ∈ R : x = λ1  0  + λ2  9  , λ1 , λ2 ∈ R ,
    (2.55)


  0  −4 


 
0 −1
 
which is identical to the solution in (2.50) that we obtained by “insight”.
Calculating the Inverse

To compute the inverse A−1 of A ∈ Rn×n , we need to find a matrix X
that satisfies AX = I n . Then, X = A−1 . We can write this down as
a set of simultaneous linear equations AX = I n , where we solve for
X = [x1 | · · · |xn ]. We use the augmented matrix notation for a compact
representation of this set of systems of linear equations and obtain
I n |A−1 .

A|I n ··· (2.56)
This means that if we bring the augmented equation system into reduced
row echelon form, we can read out the inverse on the right-hand side of
the equation system. Hence, determining the inverse of a matrix is equiv-
alent to solving systems of linear equations.
c
34 Linear Algebra
Example 2.9 (Calculating an Inverse Matrix by Gaussian Elimination)

To determine the inverse of
 
1 0 2 0
1 1 0 0
A= 1 2 0 1
 (2.57)
1 1 1 1
we write down the augmented matrix
 
1 0 2 0 1 0 0 0
 1 1 0 0 0 1 0 0 
 
 1 2 0 1 0 0 1 0 
1 1 1 1 0 0 0 1
and use Gaussian elimination to bring it into reduced row echelon form
1 0 0 0 −1 2 −2 2
 
 0 1 0 0
 1 −1 2 −2  ,
 0 0 1 0 1 −1 1 −1 
0 0 0 1 −1 0 −1 2
such that the desired inverse is given as its right-hand side:
−1 2 −2 2
 
 1 −1 2 −2
A−1 =  1 −1 1 −1 .
 (2.58)
−1 0 −1 2
We can verify that (2.58) is indeed the inverse by performing the multi-
plication AA−1 and observing that we recover I 4 .
2.3.4 Algorithms for Solving a System of Linear Equations

In the following, we briefly discuss approaches to solving a system of lin-
ear equations of the form Ax = b. We make the assumption that a solu-
tion exists. Should there be no solution, we need to resort to approximate
solutions, which we do not cover in this chapter. One way to solve the ap-
proximate problem is using the approach of linear regression, which we
discuss in detail in Chapter 9.
In special cases, we may be able to determine the inverse A−1 , such
that the solution of Ax = b is given as x = A−1 b. However, this is
only possible if A is a square matrix and invertible, which is often not the
case. Otherwise, under mild assumptions (i.e., A needs to have linearly
independent columns) we can use the transformation
Ax = b ⇐⇒ A> Ax = A> b ⇐⇒ x = (A> A)−1 A> b (2.59)

and use the Moore-Penrose pseudo-inverse (A> A)−1 A> to determine the Moore-Penrose
solution (2.59) that solves Ax = b, which also corresponds to the mini- pseudo-inverse
mum norm least-squares solution. A disadvantage of this approach is that
it requires many computations for the matrix-matrix product and comput-
ing the inverse of A> A. Moreover, for reasons of numerical precision it
is generally not recommended to compute the inverse or pseudo-inverse.
In the following, we therefore briefly discuss alternative approaches to
solving systems of linear equations.
Gaussian elimination plays an important role when computing deter-
minants (Section 4.1), checking whether a set of vectors is linearly inde-
pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2),
computing the rank of a matrix (Section 2.6.2) and a basis of a vector
space (Section 2.6.1). Gaussian elimination is an intuitive and construc-
tive way to solve a system of linear equations with thousands of variables.
However, for systems with millions of variables, it is impractical as the re-
quired number of arithmetic operations scales cubically in the number of
simultaneous equations.
In practice, systems of many linear equations are solved indirectly, by ei-
ther stationary iterative methods, such as the Richardson method, the Ja-
cobi method, the Gauß-Seidel method, and the successive over-relaxation
method, or Krylov subspace methods, such as conjugate gradients, gener-
alized minimal residual, or biconjugate gradients. We refer to the books
by Strang (2003), Liesen and Mehrmann (2015) and Stoer and Burlirsch
(2002) for further details.
Let x∗ be a solution of Ax = b. The key idea of these iterative methods
is to set up an iteration of the form
x(k+1) = Cx(k) + d (2.60)
for suitable C and d that reduces the residual error kx(k+1) − x∗ k in every
iteration and converges to x∗ . We will introduce norms k · k, which allow
us to compute similarities between vectors, in Section 3.1.
2.4 Vector Spaces

Thus far, we have looked at systems of linear equations and how to solve
them (Section 2.3). We saw that systems of linear equations can be com-
pactly represented using matrix-vector notation (2.10). In the following,
we will have a closer look at vector spaces, i.e., a structured space in which
vectors live.
In the beginning of this chapter, we informally characterized vectors as
objects that can be added together and multiplied by a scalar, and they
remain objects of the same type (see page 17). Now, we are ready to
formalize this, and we will start by introducing the concept of a group,
which is a set of elements and an operation defined on these elements
that keeps some structure of the set intact.
c
36 Linear Algebra
2.4.1 Groups
Groups play an important role in computer science. Besides providing a
fundamental framework for operations on sets, they are heavily used in
cryptography, coding theory and graphics.
Definition 2.7 (Group). Consider a set G and an operation ⊗ : G ×G → G

group defined on G . Then G := (G, ⊗) is called a group if the following hold:
closure
associativity
1. Closure of G under ⊗: ∀x, y ∈ G : x ⊗ y ∈ G
neutral element 2. Associativity: ∀x, y, z ∈ G : (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z)
inverse element 3. Neutral element: ∃e ∈ G ∀x ∈ G : x ⊗ e = x and e ⊗ x = x
4. Inverse element: ∀x ∈ G ∃y ∈ G : x ⊗ y = e and y ⊗ x = e. We often
write x−1 to denote the inverse element of x.
Remark. The inverse element is defined with respect to the operation ⊗

and does not necessarily mean x1 . ♦
Abelian group If additionally ∀x, y ∈ G : x ⊗ y = y ⊗ x then G = (G, ⊗) is an Abelian
group (commutative).
Example 2.10 (Groups)

Let us have a look at some examples of sets with associated operations
and see whether they are groups.
(Z, +) is a group.
N0 := N ∪ {0} (N0 , +) is not a group: Although (N0 , +) possesses a neutral element
(0), the inverse elements are missing.
(Z, ·) is not a group: Although (Z, ·) contains a neutral element (1), the
inverse elements for any z ∈ Z, z 6= ±1, are missing.
(R, ·) is not a group since 0 does not possess an inverse element.
(R\{0}, ·) is Abelian.
(Rn , +), (Zn , +), n ∈ N are Abelian if + is defined componentwise, i.e.,
(x1 , · · · , xn ) + (y1 , · · · , yn ) = (x1 + y1 , · · · , xn + yn ). (2.61)
Then, (x1 , · · · , xn )−1 := (−x1 , · · · , −xn ) is the inverse element and
e = (0, · · · , 0) is the neutral element.
(Rm×n , +), the set of m × n-matrices is Abelian (with componentwise
addition as defined in (2.61)).
Let us have a closer look at (Rn×n , ·), i.e., the set of n × n-matrices with
matrix multiplication as defined in (2.13).
– Closure and associativity follow directly from the definition of matrix
multiplication.
– Neutral element: The identity matrix I n is the neutral element with
respect to matrix multiplication “·” in (Rn×n , ·).

– Inverse element: If the inverse exists (A is regular) then A−1 is the

inverse element of A ∈ Rn×n , and in exactly this case (Rn×n , ·) is a
group, called the general linear group.
Definition 2.8 (General Linear Group). The set of regular (invertible)

matrices A ∈ Rn×n is a group with respect to matrix multiplication as
defined in (2.13) and is called general linear group GL(n, R). However, general linear group
since matrix multiplication is not commutative, the group is not Abelian.
2.4.2 Vector Spaces

When we discussed groups, we looked at sets G and inner operations on
G , i.e., mappings G × G → G that only operate on elements in G . In the
following, we will consider sets that in addition to an inner operation +
also contain an outer operation ·, the multiplication of a vector x ∈ G by
a scalar λ ∈ R. We can think of the inner operation as a form of addition,
and the outer operation as a form of scaling. Note that the inner/outer
operations have nothing to do with inner/outer products.
Definition 2.9 (Vector space). A real-valued vector space V = (V, +, ·) is vector space
a set V with two operations
+: V ×V →V (2.62)
·: R×V →V (2.63)
where
1. (V, +) is an Abelian group
2. Distributivity:
1. ∀λ ∈ R, x, y ∈ V : λ · (x + y) = λ · x + λ · y
2. ∀λ, ψ ∈ R, x ∈ V : (λ + ψ) · x = λ · x + ψ · x
3. Associativity (outer operation): ∀λ, ψ ∈ R, x ∈ V : λ·(ψ ·x) = (λψ)·x
4. Neutral element with respect to the outer operation: ∀x ∈ V : 1·x = x
The elements x ∈ V are called vectors. The neutral element of (V, +) is vector
the zero vector 0 = [0, . . . , 0]> , and the inner operation + is called vector vector addition
addition. The elements λ ∈ R are called scalars and the outer operation scalar
· is a multiplication by scalars. Note that a scalar product is something multiplication by
different, and we will get to this in Section 3.2. scalars
Remark. A “vector multiplication” ab, a, b ∈ Rn , is not defined. Theoret-

ically, we could define an element-wise multiplication, such that c = ab
with cj = aj bj . This “array multiplication” is common to many program-
ming languages but makes mathematically limited sense using the stan-
dard rules for matrix multiplication: By treating vectors as n × 1 matrices
c
38 Linear Algebra
(which we usually do), we can use the matrix multiplication as defined

in (2.13). However, then the dimensions of the vectors do not match. Only
outer product the following multiplications for vectors are defined: ab> ∈ Rn×n (outer
product), a> b ∈ R (inner/scalar/dot product). ♦
Example 2.11 (Vector Spaces)

Let us have a look at some important examples.
V = Rn , n ∈ N is a vector space with operations defined as follows:
– Addition: x+y = (x1 , . . . , xn )+(y1 , . . . , yn ) = (x1 +y1 , . . . , xn +yn )
for all x, y ∈ Rn
– Multiplication by scalars: λx = λ(x1 , . . . , xn ) = (λx1 , . . . , λxn ) for
all λ ∈ R, x ∈ Rn
V = Rm×n , m, n ∈ N is a vector space with
 
a11 + b11 · · · a1n + b1n
– Addition: A + B =  .. ..
 is defined ele-
 
. .
am1 + bm1 · · · amn + bmn
mentwise for all A, B ∈ V  
λa11 · · · λa1n
– Multiplication by scalars: λA =  ... ..  as defined in

. 
λam1 · · · λamn
Section 2.2. Remember that Rm×n is equivalent to Rmn .
V = C, with the standard definition of addition of complex numbers.
Remark. In the following, we will denote a vector space (V, +, ·) by V

when + and · are the standard vector addition and scalar multiplication.
Moreover, we will use the notation x ∈ V for vectors in V to simplify
notation. ♦
Remark. The vector spaces Rn , Rn×1 , R1×n are only different in the way
we write vectors. In the following, we will not make a distinction between
column vector Rn and Rn×1 , which allows us to write n-tuples as column vectors
 
x1
 .. 
x =  . . (2.64)
xn
This simplifies the notation regarding vector space operations. However,

row vector we do distinguish between Rn×1 and R1×n (the row vectors) to avoid con-
fusion with matrix multiplication. By default we write x to denote a col-
transpose umn vector, and a row vector is denoted by x> , the transpose of x. ♦

2.4.3 Vector Subspaces

In the following, we will introduce vector subspaces. Intuitively, they are
sets contained in the original vector space with the property that when
we perform vector space operations on elements within this subspace, we
will never leave it. In this sense, they are “closed”. Vector subspaces are a
key idea in machine learning. For example, Chapter 10 demonstrates how
to use vector subspaces for dimensionality reduction.
Definition 2.10 (Vector Subspace). Let V = (V, +, ·) be a vector space
and U ⊆ V , U 6= ∅. Then U = (U, +, ·) is called vector subspace of V (or vector subspace
linear subspace) if U is a vector space with the vector space operations + linear subspace
and · restricted to U × U and R × U . We write U ⊆ V to denote a subspace
U of V .
If U ⊆ V and V is a vector space, then U naturally inherits many prop-
erties directly from V because they hold for all x ∈ V , and in particular for
all x ∈ U ⊆ V . This includes the Abelian group properties, the distribu-
tivity, the associativity and the neutral element. To determine whether
(U, +, ·) is a subspace of V we still do need to show
1. U 6= ∅, in particular: 0 ∈ U
2. Closure of U :
1. With respect to the outer operation: ∀λ ∈ R ∀x ∈ U : λx ∈ U .
2. With respect to the inner operation: ∀x, y ∈ U : x + y ∈ U .
Example 2.12 (Vector Subspaces)

Let us have a look at some examples.
For every vector space V the trivial subspaces are V itself and {0}.
Only example D in Figure 2.6 is a subspace of R2 (with the usual inner/
outer operations). In A and C , the closure property is violated; B does
not contain 0.
The solution set of a homogeneous system of linear equations Ax = 0
with n unknowns x = [x1 , . . . , xn ]> is a subspace of Rn .
The solution of an inhomogeneous system of linear equations Ax =
b, b 6= 0 is not a subspace of Rn .
The intersection of arbitrarily many subspaces is a subspace itself.
B Figure 2.6 Not all

A
subsets of R2 are
subspaces. In A and
D
C, the closure
0 0 0 0
C property is violated;
B does not contain
0. Only D is a
subspace.
c
40 Linear Algebra
Remark. Every subspace U ⊆ (Rn , +, ·) is the solution space of a homo-

geneous system of homogeneous linear equations Ax = 0 for x ∈ Rn .
♦
2.5 Linear Independence

In the following, we will have a close look at what we can do with vectors
(elements of the vector space). In particular, we can add vectors together
and multiply them with scalars. The closure property guarantees that we
end up with another vector in the same vector space. It is possible to find
a set of vectors with which we can represent every vector in the vector
space by adding them together and scaling them. This set of vectors is
a basis, and we will discuss them in Section 2.6.1. Before we get there,
we will need to introduce the concepts of linear combinations and linear
independence.
Definition 2.11 (Linear Combination). Consider a vector space V and a

finite number of vectors x1 , . . . , xk ∈ V . Then, every v ∈ V of the form
k
X
v = λ1 x1 + · · · + λk xk = λi xi ∈ V (2.65)
i=1
linear combination with λ1 , . . . , λk ∈ R is a linear combination of the vectors x1 , . . . , xk .
The 0-vector can always be P written as the linear combination of k vec-

k
tors x1 , . . . , xk because 0 = i=1 0xi is always true. In the following,
we are interested in non-trivial linear combinations of a set of vectors to
represent 0, i.e., linear combinations of vectors x1 , . . . , xk where not all
coefficients λi in (2.65) are 0.
Definition 2.12 (Linear (In)dependence). Let us consider a vector space

V with k ∈ N and x1 , . P . . , xk ∈ V . If there is a non-trivial linear com-
k
bination, such that 0 = i=1 λi xi with at least one λi 6= 0, the vectors
linearly dependent x1 , . . . , xk are linearly dependent. If only the trivial solution exists, i.e.,
linearly λ1 = . . . = λk = 0 the vectors x1 , . . . , xk are linearly independent.
independent
Linear independence is one of the most important concepts in linear
algebra. Intuitively, a set of linearly independent vectors are vectors that
have no redundancy, i.e., if we remove any of those vectors from the set,
we will lose something. Throughout the next sections, we will formalize
this intuition more.
Example 2.13 (Linearly Dependent Vectors)

A geographic example may help to clarify the concept of linear indepen-
dence. A person in Nairobi (Kenya) describing where Kigali (Rwanda) is

might say “You can get to Kigali by first going 506 km Northwest to Kam-
pala (Uganda) and then 374 km Southwest.”. This is sufficient information
to describe the location of Kigali because the geographic coordinate sys-
tem may be considered a two-dimensional vector space (ignoring altitude
and the Earth’s curved surface). The person may add “It is about 751 km
West of here.” Although this last statement is true, it is not necessary to
find Kigali given the previous information (see Figure 2.7 for an illus-
tration). In this example, the “506 km Northwest” vector (blue) and the
“374 km Southwest” vector (purple) are linearly independent. This means
the Southwest vector cannot be described in terms of the Northwest vec-
tor, and vice versa. However, the third “751 km West” vector (black) is a
linear combination of the other two vectors, and it makes the set of vec-
tors linearly dependent. Equivalently, given “751 km West” and “374 km
Southwest” can be linearly combined to obtain “506 km Northwest”.
Kampala Figure 2.7

506 Geographic example
t
es km (with crude
w No
u th rth
wes
approximations to
cardinal directions)
So t of linearly
km Nairobi
dependent vectors
3 74 751 km West in a
t two-dimensional
es
w space (plane).
Kigali th
Sou
km
4
37
Remark. The following properties are useful to find out whether vectors
are linearly independent.
k vectors are either linearly dependent or linearly independent. There

is no third option.
If at least one of the vectors x1 , . . . , xk is 0 then they are linearly de-
pendent. The same holds if two vectors are identical.
The vectors {x1 , . . . , xk : xi 6= 0, i = 1, . . . , k}, k > 2, are linearly
dependent if and only if (at least) one of them is a linear combination
of the others. In particular, if one vector is a multiple of another vector,
i.e., xi = λxj , λ ∈ R then the set {x1 , . . . , xk : xi 6= 0, i = 1, . . . , k}
is linearly dependent.
A practical way of checking whether vectors x1 , . . . , xk ∈ V are linearly
independent is to use Gaussian elimination: Write all vectors as columns
c
42 Linear Algebra
of a matrix A and perform Gaussian elimination until the matrix is in

row echelon form (the reduced row echelon form is not necessary here).
– The pivot columns indicate the vectors, which are linearly indepen-
dent of the vectors on the left. Note that there is an ordering of vec-
tors when the matrix is built.
– The non-pivot columns can be expressed as linear combinations of
the pivot columns on their left. For instance, the row echelon form

1 3 0
(2.66)
0 0 2
tells us that the first and third column are pivot columns. The second
column is a non-pivot column because it is 3 times the first column.
All column vectors are linearly independent if and only if all columns
are pivot columns. If there is at least one non-pivot column, the columns
(and, therefore, the corresponding vectors) are linearly dependent.
♦
Example 2.14
Consider R4 with
−1
     
1 1
 2  1 −2
x1 = 
−3 ,
 x2 = 
0 ,
 x3 = 
 1 .
 (2.67)
4 2 1
To check whether they are linearly dependent, we follow the general ap-
proach and solve
−1
     
1 1
 2  1 −2
λ1 x1 + λ2 x2 + λ3 x3 = λ1 
−3 + λ2 0 + λ3  1  = 0
     (2.68)
4 2 1
for λ1 , . . . , λ3 . We write the vectors xi , i = 1, 2, 3, as the columns of a
matrix and apply elementary row operations until we identify the pivot
columns:
1 −1 1 −1
   
1 1
 2 1 −2 0 1 0

−3
 ···  . (2.69)
0 1 0 0 1
4 2 1 0 0 0
Here, every column of the matrix is a pivot column. Therefore, there is no
non-trivial solution, and we require λ1 = 0, λ2 = 0, λ3 = 0 to solve the
equation system. Hence, the vectors x1 , x2 , x3 are linearly independent.

Remark. Consider a vector space V with k linearly independent vectors

b1 , . . . , bk and m linear combinations
k
X
x1 = λi1 bi ,
i=1
.. (2.70)
.
k
X
xm = λim bi .
i=1
Defining B = [b1 , . . . , bk ] as the matrix whose columns are the linearly

independent vectors b1 , . . . , bk , we can write
 
λ1j
 .. 
xj = Bλj , λj =  .  , j = 1, . . . , m , (2.71)
λkj
in a more compact form.
We want to test whether x1 , . . . , xm are linearly independent.
Pm For this
purpose, we follow the general approach of testing when j=1 ψj xj = 0.
With (2.71), we obtain
m
X m
X m
X
ψj xj = ψj Bλj = B ψj λj . (2.72)
j=1 j=1 j=1
This means that {x1 , . . . , xm } are linearly independent if and only if the
column vectors {λ1 , . . . , λm } are linearly independent.
♦
Remark. In a vector space V , m linear combinations of k vectors x1 , . . . , xk
are linearly dependent if m > k . ♦
Example 2.15
Consider a set of linearly independent vectors b1 , b2 , b3 , b4 ∈ Rn and
x1 = b1 − 2b2 + b3 − b4
x2 = −4b1 − 2b2 + 4b4
(2.73)
x3 = 2b1 + 3b2 − b3 − 3b4
x4 = 17b1 − 10b2 + 11b3 + b4
Are the vectors x1 , . . . , x4 ∈ Rn linearly independent? To answer this
question, we investigate whether the column vectors
       

 1 −4 2 17 
−2 , −2 ,  3  , −10
       
 1   0  −1  11  (2.74)

 
−1 4 −3 1
 
c
44 Linear Algebra
are linearly independent. The reduced row echelon form of the corre-
sponding linear equation system with coefficient matrix
1 −4 2
 
17
−2 −2 3 −10
A=  (2.75)
 1 0 −1 11 
−1 4 −3 1
is given as
0 −7
 
1 0
0 1 0 −15
 . (2.76)
0 0 1 −18
0 0 0 0
We see that the corresponding linear equation system is non-trivially solv-
able: The last column is not a pivot column, and x4 = −7x1 −15x2 −18x3 .
Therefore, x1 , . . . , x4 are linearly dependent as x4 can be expressed as a
linear combination of x1 , . . . , x3 .
2.6 Basis and Rank

In a vector space V , we are particularly interested in sets of vectors A that
possess the property that any vector v ∈ V can be obtained by a linear
combination of vectors in A. These vectors are special vectors, and in the
following, we will characterize them.
2.6.1 Generating Set and Basis

Definition 2.13 (Generating Set and Span). Consider a vector space V =
(V, +, ·) and set of vectors A = {x1 , . . . , xk } ⊆ V . If every vector v ∈
V can be expressed as a linear combination of x1 , . . . , xk , A is called a
generating set generating set of V . The set of all linear combinations of vectors in A is
span called the span of A. If A spans the vector space V , we write V = span[A]
or V = span[x1 , . . . , xk ].
Generating sets are sets of vectors that span vector (sub)spaces, i.e.,
every vector can be represented as a linear combination of the vectors
in the generating set. Now, we will be more specific and characterize the
smallest generating set that spans a vector (sub)space.
Definition 2.14 (Basis). Consider a vector space V = (V, +, ·) and A ⊆

minimal V . A generating set A of V is called minimal if there exists no smaller set
Ã ⊆ A ⊆ V that spans V . Every linearly independent generating set of V
basis is minimal and is called a basis of V .

Let V = (V, +, ·) be a vector space and B ⊆ V, B =

6 ∅. Then, the
following statements are equivalent: A basis is a minimal
generating set and a
B is a basis of V maximal linearly
B is a minimal generating set independent set of
vectors.
B is a maximal linearly independent set of vectors in V , i.e., adding any
other vector to this set will make it linearly dependent.
Every vector x ∈ V is a linear combination of vectors from B , and every
linear combination is unique, i.e., with
k
X k
X
x= λ i bi = ψi bi (2.77)
i=1 i=1
and λi , ψi ∈ R, bi ∈ B it follows that λi = ψi , i = 1, . . . , k .
Example 2.16
In R3 , the canonical/standard basis is canonical basis

     
 1 0 0 
B= 0 , 1 , 0 .
    (2.78)
0 0 1
 
Different bases in R3 are

           
 1 1 1   0.5 1.8 −2.2 
B1 = 0 , 1 , 1 , B2 = 0.8 , 0.3 , −1.3 . (2.79)
0 0 1 0.4 0.3 3.5
   
The set
     

 1 2 1 
     
2 −1  1 

A=  , , (2.80)
 3
    0   0 

4 2 −4
 
is linearly independent, but not a generating set (and no basis) of R4 :

For instance, the vector [1, 0, 0, 0]> cannot be obtained by a linear com-
bination of elements in A.
Remark. Every vector space V possesses a basis B . The examples above

show that there can be many bases of a vector space V , i.e., there is no
unique basis. However, all bases possess the same number of elements,
the basis vectors. ♦ basis vector
We only consider finite-dimensional vector spaces V . In this case, the

dimension of V is the number of basis vectors of V , and we write dim(V ). dimension
If U ⊆ V is a subspace of V then dim(U ) 6 dim(V ) and dim(U ) =
c
46 Linear Algebra
dim(V ) if and only if U = V . Intuitively, the dimension of a vector space

can be thought of as the number of independent directions in this vector
The dimension of a space.
vector space
corresponds to the
Remark. The dimension of a vector space is not necessarily the number

number of its basis 0
of elements in a vector. For instance, the vector space V = span[ ] is
vectors. 1
one-dimensional, although the basis vector possesses two elements. ♦
n
Remark. A basis of a subspace U = span[x1 , . . . , xm ] ⊆ R can be found
by executing the following steps:
1. Write the spanning vectors as columns of a matrix A
2. Determine the row echelon form of A.
3. The spanning vectors associated with the pivot columns are a basis of
U.
♦
Example 2.17 (Determining a Basis)

For a vector subspace U ⊆ R5 , spanned by the vectors
1 2 3 −1
       
 2  −1 −4  8 
        5
x1 = −1 , x2 =  1  , x3 =  3  , x4 = −5 ∈ R , (2.81)
      
−1  2   5  −6
−1 −2 −3 1
we are interested in finding out which vectors x1 , . . . , x4 are a basis for U .
For this, we need to check whether x1 , . . . , x4 are linearly independent.
Therefore, we need to solve
4
X
λi xi = 0 , (2.82)
i=1
which leads to a homogeneous system of equations with matrix

1 2 3 −1
 
 2 −1 −4 8 
 

x1 , x2 , x3 , x4 = −1 1
 3 −5 . (2.83)
−1 2 5 −6
−1 −2 −3 1
With the basic transformation rules for systems of linear equations, we
obtain the row echelon form
1 2 3 −1 1 2 3 −1
   
 2 −1 −4 8   0 1 2 −2 
   
−1 1 3 −5  · · ·  0 0 0 1 .
   
−1 2 5 −6   0 0 0 0 
−1 −2 −3 1 0 0 0 0

Since the pivot columns indicate which set of vectors is linearly indepen-
dent, we see from the row echelon form that x1 , x2 , x4 are linearly inde-
pendent (because the system of linear equations λ1 x1 + λ2 x2 + λ4 x4 = 0
can only be solved with λ1 = λ2 = λ4 = 0). Therefore, {x1 , x2 , x4 } is a
basis of U .
2.6.2 Rank
The number of linearly independent columns of a matrix A ∈ Rm×n
equals the number of linearly independent rows and is called the rank rank
of A and is denoted by rk(A).
Remark. The rank of a matrix has some important properties:
rk(A) = rk(A> ), i.e., the column rank equals the row rank.
The columns of A ∈ Rm×n span a subspace U ⊆ Rm with dim(U ) =
rk(A). Later, we will call this subspace the image or range. A basis of
U can be found by applying Gaussian elimination to A to identify the
pivot columns.
The rows of A ∈ Rm×n span a subspace W ⊆ Rn with dim(W ) =
rk(A). A basis of W can be found by applying Gaussian elimination to
A> .
For all A ∈ Rn×n holds: A is regular (invertible) if and only if rk(A) =
n.
For all A ∈ Rm×n and all b ∈ Rm it holds that the linear equation
system Ax = b can be solved if and only if rk(A) = rk(A|b), where
A|b denotes the augmented system.
For A ∈ Rm×n the subspace of solutions for Ax = 0 possesses dimen-
sion n − rk(A). Later, we will call this subspace the kernel or the null kernel
space. null space
A matrix A ∈ Rm×n has full rank if its rank equals the largest possible full rank
rank for a matrix of the same dimensions. This means that the rank of
a full-rank matrix is the lesser of the number of rows and columns, i.e.,
rk(A) = min(m, n). A matrix is said to be rank deficient if it does not rank deficient
have full rank.
♦
Example 2.18 (Rank)

 
1 0 1
A = 0 1 1.
0 0 0
A has two linearly independent rows/columns so that rk(A) = 2.
c
48 Linear Algebra
 
1 2 1
A = −2 −3 1 .
3 5 0
We use Gaussian elimination to determine the rank:
   
1 2 1 1 2 1
−2 −3 1 ··· 0 1 3 . (2.84)
3 5 0 0 0 0
Here, we see that the number of linearly independent rows and columns
is 2, such that rk(A) = 2.
2.7 Linear Mappings

In the following, we will study mappings on vector spaces that preserve
their structure, which will allow us to define the concept of a coordinate.
In the beginning of the chapter, we said that vectors are objects that can be
added together and multiplied by a scalar, and the resulting object is still
a vector. We wish to preserve this property when applying the mapping:
Consider two real vector spaces V, W . A mapping Φ : V → W preserves
the structure of the vector space if
Φ(x + y) = Φ(x) + Φ(y) (2.85)
Φ(λx) = λΦ(x) (2.86)
for all x, y ∈ V and λ ∈ R. We can summarize this in the following
definition:
Definition 2.15 (Linear Mapping). For vector spaces V, W , a mapping
linear mapping Φ : V → W is called a linear mapping (or vector space homomorphism/
vector space linear transformation) if
homomorphism
linear ∀x, y ∈ V ∀λ, ψ ∈ R : Φ(λx + ψy) = λΦ(x) + ψΦ(y) . (2.87)
transformation
It turns out that we can represent linear mappings as matrices (Sec-
tion 2.7.1). Recall that we can also collect a set of vectors as columns of a
matrix. When working with matrices, we have to keep in mind what the
matrix represents: a linear mapping or a collection of vectors. We will see
more about linear mappings in Chapter 4. Before we continue, we will
briefly introduce special mappings.
Definition 2.16 (Injective, Surjective, Bijective). Consider a mapping Φ :
V → W , where V, W can be arbitrary sets. Then Φ is called
injective
surjective injective if ∀x, y ∈ V : Φ(x) = Φ(y) =⇒ x = y .
bijective surjective if Φ(V) = W .
bijective if it is injective and surjective.

If Φ is surjective then every element in W can be “reached” from V using

Φ. A bijective Φ can be “undone”, i.e., there exists a mapping Ψ : W → V
so that Ψ ◦ Φ(x) = x. This mapping Ψ is then called the inverse of Φ and
normally denoted by Φ−1 .
With these definitions, we introduce the following special cases of linear
mappings between vector spaces V and W :
isomorphism
Isomorphism: Φ : V → W linear and bijective endomorphism
Endomorphism: Φ : V → V linear automorphism
Automorphism: Φ : V → V linear and bijective
We define idV : V → V , x 7→ x as the identity mapping or identity identity mapping
automorphism in V . identity
automorphism
Example 2.19 (Homomorphism)

The mapping Φ : R2 → C, Φ(x) = x1 + ix2 , is a homomorphism:

x1 y
Φ + 1 = (x1 + y1 ) + i(x2 + y2 ) = x1 + ix2 + y1 + iy2
x2 y2

x1 y1
=Φ +Φ
x2 y2

x x1
Φ λ 1 = λx1 + λix2 = λ(x1 + ix2 ) = λΦ .
x2 x2
(2.88)
This also justifies why complex numbers can be represented as tuples in
R2 : There is a bijective linear mapping that converts the elementwise addi-
tion of tuples in R2 into the set of complex numbers with the correspond-
ing addition. Note that we only showed linearity, but not the bijection.
Theorem 2.17 (Theorem 3.59 in Axler (2015)). Finite-dimensional vector

spaces V and W are isomorphic if and only if dim(V ) = dim(W ).
Theorem 2.17 states that there exists a linear, bijective mapping be-
tween two vector spaces of the same dimension. Intuitively, this means
that vector spaces of the same dimension are kind of the same thing as
they can be transformed into each other without incurring any loss.
Theorem 2.17 also gives us the justification to treat Rm×n (the vector
space of m × n-matrices) and Rmn (the vector space of vectors of length
mn) the same as their dimensions are mn, and there exists a linear, bijec-
tive mapping that transforms one into the other.
Remark. Consider vector spaces V, W, X . Then:
For linear mappings Φ : V → W and Ψ : W → X the mapping
Ψ ◦ Φ : V → X is also linear.
If Φ : V → W is an isomorphism then Φ−1 : W → V is an isomor-
phism, too.
c
50 Linear Algebra
Figure 2.8 Two

different coordinate
systems defined by
two sets of basis
vectors. A vector x
has different
coordinate
x x
representations b2
depending on which
coordinate system is e2
chosen.
b1
e1
If Φ : V → W, Ψ : V → W are linear then Φ + Ψ and λΦ, λ ∈ R, are

linear, too.
2.7.1 Matrix Representation of Linear Mappings

Any n-dimensional vector space is isomorphic to Rn (Theorem 2.17). We
consider a basis {b1 , . . . , bn } of an n-dimensional vector space V . In the
following, the order of the basis vectors will be important. Therefore, we
write
B = (b1 , . . . , bn ) (2.89)
ordered basis and call this n-tuple an ordered basis of V .

Remark (Notation). We are at the point where notation gets a bit tricky.
Therefore, we summarize some parts here. B = (b1 , . . . , bn ) is an ordered
basis, B = {b1 , . . . , bn } is an (unordered) basis, and B = [b1 , . . . , bn ] is a
matrix whose columns are the vectors b1 , . . . , bn . ♦
Definition 2.18 (Coordinates). Consider a vector space V and an ordered
basis B = (b1 , . . . , bn ) of V . For any x ∈ V we obtain a unique represen-
tation (linear combination)
x = α1 b1 + . . . + αn bn (2.90)
coordinate of x with respect to B . Then α1 , . . . , αn are the coordinates of x with

respect to B , and the vector
 
α1
 .. 
α =  .  ∈ Rn (2.91)
αn
coordinate vector is the coordinate vector/coordinate representation of x with respect to the
coordinate ordered basis B .
representation

A basis effectively defines a coordinate system. We are familiar with the

Cartesian coordinate system in two dimensions, which is spanned by the
canonical basis vectors e1 , e2 . In this coordinate system, a vector x ∈ R2
has a representation that tells us how to linearly combine e1 and e2 to
obtain x. However, any basis of R2 defines a valid coordinate system,
and the same vector x from before may have a different coordinate rep-
resentation in the (b1 , b2 ) basis. In Figure 2.8, the coordinates of x with
respect to the standard basis (e1 , e2 ) is [2, 2]> . However, with respect to
the basis (b1 , b2 ) the same vector x is represented as [1.09, 0.72]> , i.e.,
x = 1.09b1 + 0.72b2 . In the following sections, we will discover how to
obtain this representation.
Example 2.20
Let us have a look at a geometric vector x ∈ R2 with coordinates [2, 3]> Figure 2.9
with respect to the standard basis (e1 , e2 ) of R2 . This means, we can write Different coordinate
representations of a
x = 2e1 + 3e2 . However, we do not have to choose the standard basis to
vector x, depending
represent this vector. If we use the basis vectors b1 = [1, −1]> , b2 = [1, 1]> on the choice of
we will obtain the coordinates 21 [−1, 5]> to represent the same vector with basis.
respect to (b1 , b2 ) (see Figure 2.9). x = 2e1 + 3e2
x = − 12 b1 + 52 b2
Remark. For an n-dimensional vector space V and an ordered basis B

of V , the mapping Φ : Rn → V , Φ(ei ) = bi , i = 1, . . . , n, is linear
(and because of Theorem 2.17 an isomorphism), where (e1 , . . . , en ) is e2
b2
the standard basis of Rn .
♦ e1
b1
Now we are ready to make an explicit connection between matrices and
linear mappings between finite-dimensional vector spaces.
Definition 2.19 (Transformation matrix). Consider vector spaces V, W
with corresponding (ordered) bases B = (b1 , . . . , bn ) and C = (c1 , . . . , cm ).
Moreover, we consider a linear mapping Φ : V → W . For j ∈ {1, . . . , n}
m
X
Φ(bj ) = α1j c1 + · · · + αmj cm = αij ci (2.92)
i=1
is the unique representation of Φ(bj ) with respect to C . Then, we call the

m × n-matrix AΦ whose elements are given by
AΦ (i, j) = αij (2.93)
the transformation matrix of Φ (with respect to the ordered bases B of V transformation
and C of W ). matrix
The coordinates of Φ(bj ) with respect to the ordered basis C of W

are the j -th column of AΦ . Consider (finite-dimensional) vector spaces
V, W with ordered bases B, C and a linear mapping Φ : V → W with
c
52 Linear Algebra
transformation matrix AΦ . If x̂ is the coordinate vector of x ∈ V with

respect to B and ŷ the coordinate vector of y = Φ(x) ∈ W with respect
to C , then
ŷ = AΦ x̂ . (2.94)
This means that the transformation matrix can be used to map coordinates
with respect to an ordered basis in V to coordinates with respect to an
ordered basis in W .
Example 2.21 (Transformation Matrix)

Consider a homomorphism Φ : V → W and ordered bases B =
(b1 , . . . , b3 ) of V and C = (c1 , . . . , c4 ) of W . With
Φ(b1 ) = c1 − c2 + 3c3 − c4
Φ(b2 ) = 2c1 + c2 + 7c3 + 2c4 (2.95)
Φ(b3 ) = 3c2 + c3 + 4c4
P4 transformation matrix AΦ with respect to

the B and C satisfies Φ(bk ) =
i=1 αik ci for k = 1, . . . , 3 and is given as
 
1 2 0
−1 1 3
AΦ = [α1 , α2 , α3 ] =   3
, (2.96)
7 1
−1 2 4
where the αj , j = 1, 2, 3, are the coordinate vectors of Φ(bj ) with respect
to C .
Example 2.22 (Linear Transformations of Vectors)
Figure 2.10 Three

examples of linear
transformations of
the vectors shown
as dots in (a);
(b) Rotation by 45◦ ;
(c) Stretching of the
horizontal
(a) Original data. (b) Rotation by 45◦ . (c) Stretch along the (d) General linear
coordinates by 2;
horizontal axis. mapping.
(d) Combination of
reflection, rotation We consider three linear transformations of a set of vectors in R2 with
and stretching.
the transformation matrices
cos( π4 ) − sin( π4 ) 1 3 −1

2 0
A1 = , A2 = , A3 = . (2.97)
sin( π4 ) cos( π4 ) 0 1 2 1 −1

Figure 2.10 gives three examples of linear transformations of a set of vec-

tors. Figure 2.10(a) shows 400 vectors in R2 , each of which is represented
by a dot at the corresponding (x1 , x2 )-coordinates. The vectors are ar-
ranged in a square. When we use matrix A1 in (2.97) to linearly transform
each of these vectors, we obtain the rotated square in Figure 2.10(b). If we
apply the linear mapping represented by A2 , we obtain the rectangle in
Figure 2.10(c) where each x1 -coordinate is stretched by 2. Figure 2.10(d)
shows the original square from Figure 2.10(a) when linearly transformed
using A3 , which is a combination of a reflection, a rotation and a stretch.
2.7.2 Basis Change

In the following, we will have a closer look at how transformation matrices
of a linear mapping Φ : V → W change if we change the bases in V and
W . Consider two ordered bases
B = (b1 , . . . , bn ), B̃ = (b̃1 , . . . , b̃n ) (2.98)
of V and two ordered bases
C = (c1 , . . . , cm ), C̃ = (c̃1 , . . . , c̃m ) (2.99)
of W . Moreover, AΦ ∈ Rm×n is the transformation matrix of the linear
mapping Φ : V → W with respect to the bases B and C , and ÃΦ ∈ Rm×n
is the corresponding transformation mapping with respect to B̃ and C̃ .
In the following, we will investigate how A and Ã are related, i.e., how/
whether we can transform AΦ into ÃΦ if we choose to perform a basis
change from B, C to B̃, C̃ .
Remark. We effectively get different coordinate representations of the
identity mapping idV . In the context of Figure 2.9, this would mean to
map coordinates with respect to (e1 , e2 ) onto coordinates with respect to
(b1 , b2 ) without changing the vector x. By changing the basis and corre-
spondingly the representation of vectors, the transformation matrix with
respect to this new basis can have a particularly simple form that allows
for straightforward computation. ♦
Example 2.23 (Basis Change)

Consider a transformation matrix

2 1
A= (2.100)
1 2
with respect to the canonical basis in R2 . If we define a new basis

1 1
B=( , ) (2.101)
1 −1
c
54 Linear Algebra
we obtain a diagonal transformation matrix

3 0
Ã = (2.102)
0 1
with respect to B , which is easier to work with than A.
In the following, we will look at mappings that transform coordinate

vectors with respect to one basis into coordinate vectors with respect to
a different basis. We will state our main result first and then provide an
explanation.
Theorem 2.20 (Basis Change). For a linear mapping Φ : V → W , ordered
bases
B = (b1 , . . . , bn ), B̃ = (b̃1 , . . . , b̃n ) (2.103)
of V and
C = (c1 , . . . , cm ), C̃ = (c̃1 , . . . , c̃m ) (2.104)
of W , and a transformation matrix AΦ of Φ with respect to B and C , the
corresponding transformation matrix ÃΦ with respect to the bases B̃ and C̃
is given as
ÃΦ = T −1 AΦ S . (2.105)
Here, S ∈ Rn×n is the transformation matrix of idV that maps coordinates
with respect to B̃ onto coordinates with respect to B , and T ∈ Rm×m is the
transformation matrix of idW that maps coordinates with respect to C̃ onto
coordinates with respect to C .
Proof Following Drumm and Weil (2001) we can write the vectors of the
new basis B̃ of V as a linear combination of the basis vectors of B , such
that
n
X
b̃j = s1j b1 + · · · + snj bn = sij bi , j = 1, . . . , n . (2.106)
i=1
Similarly, we write the new basis vectors C̃ of W as a linear combination

of the basis vectors of C , which yields
m
X
c̃k = t1k c1 + · · · + tmk cm = tlk cl , k = 1, . . . , m . (2.107)
l=1
We define S = ((sij )) ∈ Rn×n as the transformation matrix that maps

coordinates with respect to B̃ onto coordinates with respect to B and
T = ((tlk )) ∈ Rm×m as the transformation matrix that maps coordinates
with respect to C̃ onto coordinates with respect to C . In particular, the j th
column of S is the coordinate representation of b̃j with respect to B and

the k th column of T is the coordinate representation of c̃k with respect to

C . Note that both S and T are regular.
We are going to look at Φ(b̃j ) from two perspectives. First, applying the
mapping Φ, we get that for all j = 1, . . . , n
m m m m m
!
(2.107)
X X X X X
Φ(b̃j ) = ãkj c̃k = ãkj tlk cl = tlk ãkj cl , (2.108)
k=1
| {z } k=1 l=1 l=1 k=1
∈W
where we first expressed the new basis vectors c̃k ∈ W as linear com-
binations of the basis vectors cl ∈ W and then swapped the order of
summation.
Alternatively, when we express the b̃j ∈ V as linear combinations of
bj ∈ V , we arrive at
n
! n n m
(2.106)
X X X X
Φ(b̃j ) = Φ sij bi = sij Φ(bi ) = sij ali cl (2.109a)
i=1 i=1 i=1 l=1
m n
!
X X
= ali sij cl , j = 1, . . . , n , (2.109b)
l=1 i=1
where we exploited the linearity of Φ. Comparing (2.108) and (2.109b),

it follows for all j = 1, . . . , n and l = 1, . . . , m that
m
X n
X
tlk ãkj = ali sij (2.110)
k=1 i=1
and, therefore,
T ÃΦ = AΦ S ∈ Rm×n , (2.111)
such that
ÃΦ = T −1 AΦ S , (2.112)
which proves Theorem 2.20.
Theorem 2.20 tells us that with a basis change in V (B is replaced with

B̃ ) and W (C is replaced with C̃ ) the transformation matrix AΦ of a
linear mapping Φ : V → W is replaced by an equivalent matrix ÃΦ with
ÃΦ = T −1 AΦ S. (2.113)
Figure 2.11 illustrates this relation: Consider a homomorphism Φ : V →

W and ordered bases B, B̃ of V and C, C̃ of W . The mapping ΦCB is an
instantiation of Φ and maps basis vectors of B onto linear combinations
of basis vectors of C . Assuming, we know the transformation matrix AΦ
of ΦCB with respect to the ordered bases B, C . When we perform a basis
change from B to B̃ in V and from C to C̃ in W , we can determine the
c
56 Linear Algebra
Figure 2.11 For a Φ Φ

Vector spaces V W V W
homomorphism
Φ : V → W and ΦCB ΦCB
B C B C
ordered bases B, B̃ AΦ AΦ
of V and C, C̃ of W Ordered bases ΨB B̃ S T ΞC C̃ ΨB B̃ S T −1 ΞC̃C = Ξ−1
C C̃
(marked in blue), ÃΦ ÃΦ
we can express the B̃ C̃ B̃ C̃
ΦC̃ B̃ ΦC̃ B̃
mapping ΦC̃ B̃ with
respect to the bases
B̃, C̃ equivalently as
corresponding transformation matrix ÃΦ as follows: First, we find the ma-
a composition of the
homomorphisms trix representation of the linear mapping ΨB B̃ : V → V that maps coordi-
ΦC̃ B̃ = nates with respect to the new basis B̃ onto the (unique) coordinates with
ΞC̃C ◦ ΦCB ◦ ΨB B̃ respect to the “old” basis B (in V ). Then, we use the transformation ma-
with respect to the
trix AΦ of ΦCB : V → W to map these coordinates onto the coordinates
bases in the
subscripts. The with respect to C in W . Finally, we use a linear mapping ΞC̃C : W → W
corresponding to map the coordinates with respect to C onto coordinates with respect to
transformation C̃ . Therefore, we can express the linear mapping ΦC̃ B̃ as a composition of
matrices are in red. linear mappings that involve the “old” basis:
ΦC̃ B̃ = ΞC̃C ◦ ΦCB ◦ ΨB B̃ = Ξ−1
C C̃
◦ ΦCB ◦ ΨB B̃ . (2.114)
Concretely, we use ΨB B̃ = idV and ΞC C̃ = idW , i.e., the identity mappings
that map vectors onto themselves, but with respect to a different basis.
equivalent Definition 2.21 (Equivalence). Two matrices A, Ã ∈ Rm×n are equivalent

if there exist regular matrices S ∈ Rn×n and T ∈ Rm×m , such that
Ã = T −1 AS .
similar Definition 2.22 (Similarity). Two matrices A, Ã ∈ Rn×n are similar if
there exists a regular matrix S ∈ Rn×n with Ã = S −1 AS
Remark. Similar matrices are always equivalent. However, equivalent ma-

trices are not necessarily similar. ♦
Remark. Consider vector spaces V, W, X . From the remark on page 49 we
already know that for linear mappings Φ : V → W and Ψ : W → X the
mapping Ψ ◦ Φ : V → X is also linear. With transformation matrices AΦ
and AΨ of the corresponding mappings, the overall transformation matrix
is AΨ◦Φ = AΨ AΦ . ♦
In light of this remark, we can look at basis changes from the perspec-
tive of composing linear mappings:
AΦ is the transformation matrix of a linear mapping ΦCB : V → W

with respect to the bases B, C .
ÃΦ is the transformation matrix of the linear mapping ΦC̃ B̃ : V → W
with respect to the bases B̃, C̃ .
S is the transformation matrix of a linear mapping ΨB B̃ : V → V
(automorphism) that represents B̃ in terms of B . Normally, Ψ = idV is
the identity mapping in V .

T is the transformation matrix of a linear mapping ΞC C̃ : W → W

(automorphism) that represents C̃ in terms of C . Normally, Ξ = idW is
the identity mapping in W .
If we (informally) write down the transformations just in terms of bases
then AΦ : B → C , ÃΦ : B̃ → C̃ , S : B̃ → B , T : C̃ → C and
T −1 : C → C̃ , and
B̃ → C̃ = B̃ → B→ C → C̃ (2.115)
−1
ÃΦ = T AΦ S . (2.116)
Note that the execution order in (2.116) is from right to left because vec-
tors are multiplied at the right-hand side so that x 7→ Sx 7→ AΦ (Sx) 7→
T −1 AΦ (Sx) = ÃΦ x.

Example 2.24 (Basis Change)

Consider a linear mapping Φ : R3 → R4 whose transformation matrix is
 
1 2 0
−1 1 3
AΦ =  3 7 1
 (2.117)
−1 2 4
with respect to the standard bases
       
      1 0 0 0
1 0 0 0 1 0 0
B = (0 , 1 , 0) , C = (
0 , 0 , 1 , 0).
       (2.118)
0 0 1
0 0 0 1
We seek the transformation matrix ÃΦ of Φ with respect to the new bases
       
      1 1 0 1
1 0 1 1 0 1 0
B̃ = (1 , 1 , 0) ∈ R3 , C̃ = (
0 , 1 , 1 , 0) .
       (2.119)
0 1 1
0 0 0 1
Then,
 
  1 1 0 1
1 0 1 1 0 1 0
S = 1 1 0 , T =
0
, (2.120)
1 1 0
0 1 1
0 0 0 1
where the ith column of S is the coordinate representation of b̃i in terms
of the basis vectors of B . Since B is the standard basis, the coordinate rep-
resentation is straightforward to find. For a general basis BP we would need
3
to solve a linear equation system to find the λi such that i=1 λi bi = b̃j ,
c
58 Linear Algebra
j = 1, . . . , 3. Similarly, the j th column of T is the coordinate representa-

tion of c̃j in terms of the basis vectors of C .
Therefore, we obtain
1 −1 −1
  
1 3 2 1
1  1 −1 1 −1  0 4 2
ÃΦ = T −1 AΦ S = 
 
(2.121a)
2 −1 1
 1 1  10 8 4
0 0 0 2 1 6 3
−4 −4 −2
 
 6 0 0
=  4
. (2.121b)
8 4
1 6 3
In Chapter 4, we will be able to exploit the concept of a basis change

to find a basis with respect to which the transformation matrix of an en-
domorphism has a particularly simple (diagonal) form. In Chapter 10, we
will look at a data compression problem and find a convenient basis onto
which we can project the data while minimizing the compression loss.
2.7.3 Image and Kernel

The image and kernel of a linear mapping are vector subspaces with cer-
tain important properties. In the following, we will characterize them
more carefully.
Definition 2.23 (Image and Kernel).
kernel For Φ : V → W , we define the kernel/null space
null space
ker(Φ) := Φ−1 (0W ) = {v ∈ V : Φ(v) = 0W } (2.122)
image and the image/range
range
Im(Φ) := Φ(V ) = {w ∈ W |∃v ∈ V : Φ(v) = w} . (2.123)
domain We also call V and W also the domain and codomain of Φ, respectively.
codomain
Intuitively, the kernel is the set of vectors in v ∈ V that Φ maps onto
the neutral element 0W ∈ W . The image is the set of vectors w ∈ W that
can be “reached” by Φ from any vector in V . An illustration is given in
Figure 2.12.
Remark. Consider a linear mapping Φ : V → W , where V, W are vector
spaces.
It always holds that Φ(0V ) = 0W and, therefore, 0V ∈ ker(Φ). In
particular, the null space is never empty.
Im(Φ) ⊆ W is a subspace of W , and ker(Φ) ⊆ V is a subspace of V .

Φ:V →W Figure 2.12 Kernel

V W and image of a
linear mapping
Φ : V → W.
ker(Φ) Im(Φ)
0V 0W
Φ is injective (one-to-one) if and only if ker(Φ) = {0}.

♦
m×n
Remark (Null Space and Column Space). Let us consider A ∈ R and
a linear mapping Φ : Rn → Rm , x 7→ Ax.
For A = [a1 , . . . , an ], where ai are the columns of A, we obtain
( n )
X
n
Im(Φ) = {Ax : x ∈ R } = xi ai : x1 , . . . , xn ∈ R (2.124a)
i=1
m
= span[a1 , . . . , an ] ⊆ R , (2.124b)
i.e., the image is the span of the columns of A, also called the column column space
space. Therefore, the column space (image) is a subspace of Rm , where
m is the “height” of the matrix.
rk(A) = dim(Im(Φ))
The kernel/null space ker(Φ) is the general solution to the homoge-
neous system of linear equations Ax = 0 and captures all possible
linear combinations of the elements in Rn that produce 0 ∈ Rm .
The kernel is a subspace of Rn , where n is the “width” of the matrix.
The kernel focuses on the relationship among the columns, and we can
use it to determine whether/how we can express a column as a linear
combination of other columns.
♦
Example 2.25 (Image and Kernel of a Linear Mapping)

The mapping
   
x1 x1
4 2
x2  1 2 −1 0  x2  x1 + 2x2 − x3
Φ : R → R ,   7→
    =
x3 1 0 0 1 x3  x1 + x4
x4 x4
(2.125a)
c
60 Linear Algebra

1 2 −1 0
= x1 + x2 + x3 + x4 (2.125b)
1 0 0 1
is linear. To determine Im(Φ) we can take the span of the columns of the
transformation matrix and obtain

1 2 −1 0
Im(Φ) = span[ , , , ]. (2.126)
1 0 0 1
To compute the kernel (null space) of Φ, we need to solve Ax = 0, i.e.,
we need to solve a homogeneous equation system. To do this, we use
Gaussian elimination to transform A into reduced row echelon form:

1 2 −1 0 1 0 0 1
··· . (2.127)
1 0 0 1 0 1 − 21 − 12
This matrix is in reduced row echelon form, and we can use the Minus-
1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively,
we can express the non-pivot columns (columns 3 and 4) as linear com-
binations of the pivot-columns (columns 1 and 2). The third column a3 is
equivalent to − 21 times the second column a2 . Therefore, 0 = a3 + 12 a2 . In
the same way, we see that a4 = a1 − 12 a2 and, therefore, 0 = a1 − 12 a2 −a4 .
Overall, this gives us the kernel (null space) as
−1
   
0
1  1 
ker(Φ) = span[  1  ,  0 ] .
2  2  (2.128)
0 1
rank-nullity
theorem Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin-
ear mapping Φ : V → W it holds that
dim(ker(Φ)) + dim(Im(Φ)) = dim(V ) . (2.129)
fundamental The rank-nullity theorem is also referred to as the fundamental theorem
theorem of linear of linear mappings (Axler, 2015, Theorem 3.22). Direct consequences of
mappings
Theorem 2.24 are
If dim(Im(Φ)) < dim(V ) then ker(Φ) is non-trivial, i.e., the kernel
contains more than 0V and dim(ker(Φ)) > 1.
If AΦ is the transformation matrix of Φ with respect to an ordered basis
and dim(Im(Φ)) < dim(V ) then the system of linear equations AΦ x =
0 has infinitely many solutions.
If dim(V ) = dim(W ) then the following three-way equivalence holds:
– Φ is injective
– Φ is surjective
– Φ is bijective
since Im(Φ) ⊆ W .

2.8 Affine Spaces 61
2.8 Affine Spaces

In the following, we will have a closer look at spaces that are offset from
the origin, i.e., spaces that are no longer vector subspaces. Moreover, we
will briefly discuss properties of mappings between these affine spaces,
which resemble linear mappings.
Remark. In the machine learning literature, the distinction between linear
and affine is sometimes not clear so that we can find references to affine
spaces/mappings as linear spaces/mappings. ♦
2.8.1 Affine Subspaces

Definition 2.25 (Affine Subspace). Let V be a vector space, x0 ∈ V and
U ⊆ V a subspace. Then the subset
L = x0 + U := {x0 + u : u ∈ U } (2.130a)
= {v ∈ V |∃u ∈ U : v = x0 + u} ⊆ V (2.130b)
is called affine subspace or linear manifold of V . U is called direction or affine subspace
direction space, and x0 is called support point. In Chapter 12, we refer to linear manifold
such a subspace as a hyperplane. direction
direction space
Note that the definition of an affine subspace excludes 0 if x0 ∈ / U. support point
Therefore, an affine subspace is not a (linear) subspace (vector subspace) hyperplane
of V for x0 ∈
/ U.
Examples of affine subspaces are points, lines and planes in R3 , which
do not (necessarily) go through the origin.
Remark. Consider two affine subspaces L = x0 + U and L̃ = x̃0 + Ũ of a
vector space V . Then, L ⊆ L̃ if and only if U ⊆ Ũ and x0 − x̃0 ∈ Ũ .
Affine subspaces are often described by parameters: Consider a k -dimen-
sional affine space L = x0 + U of V . If (b1 , . . . , bk ) is an ordered basis of
U , then every element x ∈ L can be uniquely described as
x = x0 + λ1 b1 + . . . + λk bk , (2.131)
where λ1 , . . . , λk ∈ R. This representation is called parametric equation parametric equation
of L with directional vectors b1 , . . . , bk and parameters λ1 , . . . , λk . ♦ parameters
Example 2.26 (Affine Subspaces)
One-dimensional affine subspaces are called lines and can be written line
as y = x0 + λx1 , where λ ∈ R, where U = span[x1 ] ⊆ Rn is a one-
dimensional subspace of Rn . This means, a line is defined by a support
point x0 and a vector x1 that defines the direction. See Figure 2.13 for
an illustration.
c
62 Linear Algebra
plane Two-dimensional affine subspaces of Rn are called planes. The para-

metric equation for planes is y = x0 + λ1 x1 + λ2 x2 , where λ1 , λ2 ∈ R
and U = [x1 , x2 ] ⊆ Rn . This means, a plane is defined by a support
point x0 and two linearly independent vectors x1 , x2 that span the di-
rection space.
hyperplane In Rn , the (n − 1)-dimensional affine subspaces are called hyperplanes,
Pn−1
and the corresponding parametric equation is y = x0 + i=1 λi xi ,
where x1 , . . . , xn−1 form a basis of an (n − 1)-dimensional subspace
U of Rn . This means, a hyperplane is defined by a support point x0
and (n − 1) linearly independent vectors x1 , . . . , xn−1 that span the
direction space. In R2 , a line is also a hyperplane. In R3 , a plane is also
a hyperplane.
Figure 2.13 Vectors

+ λu
y on a line lie in an
L = x0
affine subspace L
with support point y
x0 and direction u.
x0
u
0
Remark (Inhomogeneous systems of linear equations and affine subspaces).

For A ∈ Rm×n and b ∈ Rm the solution of the linear equation sys-
tem Ax = b is either the empty set or an affine subspace of Rn of
dimension n − rk(A). In particular, the solution of the linear equation
λ1 x1 + . . . + λn xn = b, where (λ1 , . . . , λn ) 6= (0, . . . , 0), is a hyperplane
in Rn .
In Rn , every k -dimensional affine subspace is the solution of a linear
inhomogeneous equation system Ax = b, where A ∈ Rm×n , b ∈ Rm and
rk(A) = n − k . Recall that for homogeneous equation systems Ax = 0
the solution was a vector subspace, which we can also think of as a special
affine space with support point x0 = 0. ♦
2.8.2 Affine Mappings

Similar to linear mappings between vector spaces, which we discussed
in Section 2.7, we can define affine mappings between two affine spaces.
Linear and affine mappings are closely related. Therefore, many properties
that we already know from linear mappings, e.g., that the composition of
linear mappings is a linear mapping, also hold for affine mappings.
Definition 2.26 (Affine mapping). For two vector spaces V, W and a lin-

ear mapping Φ : V → W and a ∈ W the mapping

φ:V →W (2.132)
x 7→ a + Φ(x) (2.133)
is an affine mapping from V to W . The vector a is called the translation affine mapping
vector of φ. translation vector
Every affine mapping φ : V → W is also the composition of a linear

mapping Φ : V → W and a translation τ : W → W in W , such that
φ = τ ◦ Φ. The mappings Φ and τ are uniquely determined.
The composition φ0 ◦ φ of affine mappings φ : V → W , φ0 : W → X is
affine.
Affine mappings keep the geometric structure invariant. They also pre-
serve the dimension and parallelism.
2.9 Further Reading

There are many resources for learning linear algebra, including the text-
books by Golan (2007), Strang (2003), Axler (2015), Liesen and Mehrmann
(2015). There are also several online resources that we mentioned in the
introduction to this chapter. We only covered Gaussian elimination here,
but there are many other approaches for solving systems of linear equa-
tions, and we refer to numerical linear algebra textbooks by Golub and
Van Loan (2012); Horn and Johnson (2013); Stoer and Burlirsch (2002)
for an in-depth discussion.
In this book, we distinguish between the topics of linear algebra (e.g.,
vectors, matrices, linear independence, basis) and topics related to the
geometry of a vector space. In Chapter 3, we will introduce the inner
product, which induces a norm. These concepts allow us to define angles,
lengths and distances, which we will use for orthogonal projections. Pro-
jections turn out to be key in many machine learning algorithms, such as
linear regression and principal component analysis, both of which we will
cover in Chapters 9 and 10, respectively.
Exercises
2.1 We consider (R\{−1}, ?) where
a ? b := ab + a + b, a, b ∈ R\{−1} (2.134)
1. Show that (R\{−1}, ?) is an Abelian group

2. Solve
3 ? x ? x = 15
in the Abelian group (R\{−1}, ?), where ? is defined in (2.134).
c
64 Linear Algebra
2.2 Let n be in N \ {0}. Let k, x be in Z. We define the congruence class k̄ of the

integer k as the set
k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | (∃a ∈ Z) : (x − k = n · a)} .
We now define Z/nZ (sometimes written Zn ) as the set of all congruence

classes modulo n. Euclidean division implies that this set is a finite set con-
taining n elements:
Zn = {0, 1, . . . , n − 1}
For all a, b ∈ Zn , we define
a ⊕ b := a + b
1. Show that (Zn , ⊕) is a group. Is it Abelian?

2. We now define another operation ⊗ for all a and b in Zn as
a⊗b=a×b (2.135)
where a × b represents the usual multiplication in Z.

Let n = 5. Draw the times table of the elements of Z5 \ {0} under ⊗, i.e.,
calculate the products a ⊗ b for all a and b in Z5 \ {0}.
Hence, show that Z5 \ {0} is closed under ⊗ and possesses a neutral
element for ⊗. Display the inverse of all elements in Z5 \ {0} under ⊗.
Conclude that (Z5 \ {0}, ⊗) is an Abelian group.
3. Show that (Z8 \ {0}, ⊗) is not a group.
4. We recall that Bézout theorem states that two integers a and b are rela-
tively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers u
and v such that au + bv = 1. Show that (Zn \ {0}, ⊗) is a group if and
only if n ∈ N \ {0} is prime.
2.3 Consider the set G of 3 × 3 matrices defined as:
  
 1 x z 
G = 0 1 y  ∈ R3×3 x, y, z ∈ R

(2.136)
0 0 1
 
We define · as the standard matrix multiplication.

Is (G, ·) a group? If yes, is it Abelian? Justify your answer.
2.4 Compute the following matrix products:
1.
  
1 2 1 1 0
4 5  0 1 1
7 8 1 0 1
2.
  
1 2 3 1 1 0
4 5 6  0 1 1
7 8 9 1 0 1

Exercises 65
3.
  
1 1 0 1 2 3
0 1 1  4 5 6
1 0 1 7 8 9
4.
 
0 3
1 2 1 2 1 −1
4 1 −1 −4 2 1
5 2
5.
 
0 3

1
 −1 1 2 1 2
2 1 4 1 −1 −4
5 2
2.5 Find the set S of all solutions in x of the following inhomogeneous linear
systems Ax = b where A and b are defined below:
1.
   
1 1 −1 −1 1
2 5 −7 −5 −2
A= , b= 
2 −1 1 3 4
5 2 −4 2 6
2.
   
1 −1 0 0 1 3
1 1 0 −3 0 6
A= , b= 
2 −1 0 1 −1 5
−1 2 0 −2 −1 −1
3. Using Gaussian elimination find all solutions of the inhomogeneous equa-

tion system Ax = b with
   
0 1 0 0 1 0 2
A = 0 0 0 1 1 0 , b = −1
0 1 0 0 0 1 1
 
x1
2.6 Find all solutions in x = x2  ∈ R3 of the equation system Ax = 12x,
x3
where
 
6 4 3
A = 6 0 9
0 8 0
and 3i=1 xi = 1.
P
2.7 Determine the inverse of the following matrices if possible:
c
66 Linear Algebra
1.
 
2 3 4
A = 3 4 5
4 5 6
2.
 
1 0 1 0
0 1 1 0
A=
1

1 0 1
1 1 1 0
2.8 Which of the following sets are subspaces of R3 ?

1. A = {(λ, λ + µ3 , λ − µ3 ) | λ, µ ∈ R}
2. B = {(λ2 , −λ2 , 0) | λ ∈ R}
3. Let γ be in R.
C = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ1 − 2ξ2 + 3ξ3 = γ}
4. D = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ2 ∈ Z}
2.9 Are the following set of vectors linearly independent?
1.
     
2 1 3
x1 = −1 , x2 =  1  , x3 = −3
3 −2 8
2.
     
1 1 1
2 1  0
     
1 ,
x1 =  
0  ,
x2 =   x3 = 
0

0 1  1
0 1 1
2.10 Write
 
1
y = −2
5
as linear combination of
     
1 1 2
x1 = 1 , x2 = 2 , x3 = −1
1 3 1
2.11 1. Consider two subspaces of R4 :

           
1 2 −1 −1 2 −3
1 −1 1 −2 −2 6
U1 = span[
−3
 ,
0
 ,
−1] ,
 U2 = span[
2
 ,
0
 ,
−2] .

1 −1 1 1 0 −1
Determine a basis of U1 ∩ U2 .

Exercises 67
2. Consider two subspaces U1 and U2 , where U1 is the solution space of the

homogeneous equation system A1 x = 0 and U2 is the solution space of
the homogeneous equation system A2 x = 0 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2
1. Determine the dimension of U1 , U2

2. Determine bases of U1 and U2
3. Determine a basis of U1 ∩ U2
2.12 Consider two subspaces U1 and U2 , where U1 is spanned by the columns of
A1 and U2 is spanned by the columns of A2 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2
1. Determine the dimension of U1 , U2

2. Determine bases of U1 and U2
3. Determine a basis of U1 ∩ U2
2.13 Let F = {(x, y, z) ∈ R3 | x+y−z = 0} and G = {(a−b, a+b, a−3b) | a, b ∈ R}.
1. Show that F and G are subspaces of R3 .
2. Calculate F ∩ G without resorting to any basis vector.
3. Find one basis for F and one for G, calculate F ∩G using the basis vectors
previously found and check your result with the previous question.
2.14 Are the following mappings linear?
1. Let a, b ∈ R.
Φ : L1 ([a, b]) → R
Z b
f 7→ Φ(f ) = f (x)dx ,
a
where L1 ([a, b]) denotes the set of integrable functions on [a, b].
2.
Φ : C1 → C0
f 7→ Φ(f ) = f 0 .
where for k > 1, C k denotes the set of k times continuously differentiable

functions, and C 0 denotes the set of continuous functions.
3.
Φ:R→R
x 7→ Φ(x) = cos(x)
c
68 Linear Algebra
4.
Φ : R 3 → R2

1 2 3
x 7→ x
1 4 3
5. Let θ be in [0, 2π[.
Φ : R2 → R2

cos(θ) sin(θ)
x 7→ x
− sin(θ) cos(θ)
2.15 Consider the linear mapping
Φ : R3 → R 4
 
  3x1 + 2x2 + x3
x1  x1 + x2 + x3 
Φ x2  = 
 x1 − 3x2 

x3
2x1 + 3x2 + x3
Find the transformation matrix AΦ

Determine rk(AΦ )
Compute kernel and image of Φ. What are dim(ker(Φ)) and dim(Im(Φ))?
2.16 Let E be a vector space. Let f and g be two automorphisms on E such that
f ◦ g = idE (i.e., f ◦ g is the identity mapping idE ). Show that ker(f ) =
ker(g ◦ f ), Im(g) = Im(g ◦ f ) and that ker(f ) ∩ Im(g) = {0E }.
2.17 Consider an endomorphism Φ : R3 → R3 whose transformation matrix
(with respect to the standard basis in R3 ) is
 
1 1 0
AΦ = 1 −1 0 .
1 1 1
1. Determine ker(Φ) and Im(Φ).

2. Determine the transformation matrix ÃΦ with respect to the basis
     
1 1 1
B = (1 , 2 , 0) ,
1 1 0
i.e., perform a basis change toward the new basis B .

2.18 Let us consider b1 , b2 , b01 , b02 , 4 vectors of R2 expressed in the standard basis
of R2 as

2 −1 2 1
b1 = , b2 = , b01 = , b02 =
1 −1 −2 1
and let us define two ordered bases B = (b1 , b2 ) and B 0 = (b01 , b02 ) of R2 .
1. Show that B and B 0 are two bases of R2 and draw those basis vectors.
2. Compute the matrix P 1 that performs a basis change from B 0 to B .

Exercises 69
3. We consider c1 , c2 , c3 , 3 vectors of R3 defined in the standard basis of R

as
     
1 0 1
c1 =  2  , c2 = −1 , c3 =  0 
−1 2 −1
and we define C = (c1 , c2 , c3 ).

1. Show that C is a basis of R3 , e.g., by using determinants (see Sec-
tion 4.1)
2. Let us call C 0 = (c01 , c02 , c03 ) the standard basis of R3 . Determine the
matrix P 2 that performs the basis change from C to C 0 .
4. We consider a homomorphism Φ : R2 −→ R3 , such that
Φ(b1 + b2 ) = c2 + c3
Φ(b1 − b2 ) = 2c1 − c2 + 3c3
where B = (b1 , b2 ) and C = (c1 , c2 , c3 ) are ordered bases of R2 and R3 ,

respectively.
Determine the transformation matrix AΦ of Φ with respect to the ordered
bases B and C .
5. Determine A0 , the transformation matrix of Φ with respect to the bases
B 0 and C 0 .
6. Let us consider the vector x ∈ R2 whose coordinates in B 0 are [2, 3]> . In
other words, x = 2b01 + 3b02 .
1. Calculate the coordinates of x in B .
2. Based on that, compute the coordinates of Φ(x) expressed in C .
3. Then, write Φ(x) in terms of c01 , c02 , c03 .
4. Use the representation of x in B 0 and the matrix A0 to find this result
directly.
c
3
Analytic Geometry
In Chapter 2, we studied vectors, vector spaces and linear mappings at

a general but abstract level. In this chapter, we will add some geometric
interpretation and intuition to all of these concepts. In particular, we will
look at geometric vectors, compute their lengths and distances or angles
between two vectors. To be able to do this, we equip the vector space with
an inner product that induces the geometry of the vector space. Inner
products and their corresponding norms and metrics capture the intuitive
notions of similarity and distances, which we use to develop the Support
Vector Machine in Chapter 12. We will then use the concepts of lengths
and angles between vectors to discuss orthogonal projections, which will
play a central role when we discuss principal component analysis in Chap-
ter 10 and regression via maximum likelihood estimation in Chapter 9.
Figure 3.1 gives an overview of how concepts in this chapter are related
and how they are connected to other chapters of the book.
Figure 3.1 A mind

Inner product
map of the concepts
introduced in this
es
chapter, along with uc
when they are used ind
in other parts of the
Chapter 12
book. Norm Classification
Orthogonal
Lengths Angles Rotations
projection
Chapter 9 Chapter 4 Chapter 10

Regression Matrix Dimensionality
decomposition reduction
70
c
3.1 Norms 71
kx k1 = 1 kx k2 = 1 Figure 3.2 For

1 1
different norms, the
red lines indicate
the set of vectors
with norm 1. Left:
1 1 Manhattan norm;
Right: Euclidean
distance.
3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.
Definition 3.1 (Norm). A norm on a vector space V is a function norm
k · k : V → R, (3.1)
x 7→ kxk , (3.2)
which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R length
and x, y ∈ V the following hold:
absolutely
Absolutely homogeneous: kλxk = |λ|kxk homogeneous
Triangle inequality: kx + yk 6 kxk + kyk triangle inequality
Positive definite: kxk > 0 and kxk = 0 ⇐⇒ x = 0. positive definite
Figure 3.3 Triangle

In geometric terms, the triangle inequality states that for any triangle, inequality.
the sum of the lengths of any two sides must be greater than or equal
a b
to the length of the remaining side; see Figure 3.3 for an illustration.
Definition 3.1 above is in terms of a general vector space V (Section 2.4), c≤a+b
but in this book we will only consider a finite-dimensional vector space

Rn . Recall that for a vector x ∈ Rn we denote the elements of the vector
using a subscript, that is xi is the ith element of the vector x.
Example 3.1 (Manhattan Norm)

The Manhattan norm on Rn is defined for x ∈ Rn as Manhattan norm
Xn
kxk1 := |xi | , (3.3)
i=1
where | · | is the absolute value. The left panel of Figure 3.2 shows all
vectors x ∈ R2 with kxk1 = 1. The Manhattan norm is also called `1
norm. `1 norm
c
72 Analytic Geometry
Example 3.2 (Euclidean Norm)

Euclidean norm The Euclidean norm of x ∈ Rn is defined as
v
√
u n
uX
kxk2 := t x2i = x> x (3.4)
i=1
Euclidean distance and computes the Euclidean distance of x from the origin. The right panel
of Figure 3.2 shows all vectors x ∈ R2 with kxk2 = 1. The Euclidean
`2 norm norm is also called `2 norm.
Remark. Throughout this book, we will use the Euclidean norm (3.4) by
default if not stated otherwise. ♦
3.2 Inner Products

Inner products allow for the introduction of intuitive geometrical con-
cepts, such as the length of a vector and the angle or distance between
two vectors. A major purpose of inner products is to determine whether
vectors are orthogonal to each other.
3.2.1 Dot Product

We may already be familiar with a particular type of inner product, the
scalar product scalar product/dot product in Rn , which is given by
dot product n
X
x> y = xi yi . (3.5)
i=1
We will refer to the particular inner product above as the dot product
in this book. However, inner products are more general concepts with
specific properties, which we will now introduce.
3.2.2 General Inner Products

Recall the linear mapping from Section 2.7, where we can rearrange the
bilinear mapping mapping with respect to addition and multiplication with a scalar. A bilinear
mapping Ω is a mapping with two arguments, and it is linear in each ar-
gument, i.e., when we look at a vector space V then it holds that for all
x, y, z ∈ V, λ, ψ ∈ R that
Ω(λx + ψy, z) = λΩ(x, z) + ψΩ(y, z) (3.6)
Ω(x, λy + ψz) = λΩ(x, y) + ψΩ(x, z) . (3.7)
Here, (3.6) asserts that Ω is linear in the first argument, and (3.7) asserts
that Ω is linear in the second argument (see also (2.87)).

3.2 Inner Products 73
Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear

mapping that takes two vectors and maps them onto a real number. Then
Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the symmetric
order of the arguments does not matter.
Ω is called positive definite if positive definite
∀x ∈ V \ {0} : Ω(x, x) > 0 , Ω(0, 0) = 0 (3.8)

Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear
mapping that takes two vectors and maps them onto a real number. Then
A positive definite, symmetric bilinear mapping Ω : V ×V → R is called
an inner product on V . We typically write hx, yi instead of Ω(x, y). inner product
The pair (V, h·, ·i) is called an inner product space or (real) vector space inner product space
with inner product. If we use the dot product defined in (3.5), we call vector space with
(V, h·, ·i) a Euclidean vector space. inner product
Euclidean vector
We will refer to the spaces above as inner product spaces in this book. space
Example 3.3 (Inner Product that is not the Dot Product)

Consider V = R2 . If we define
hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2 (3.9)
then h·, ·i is an inner product but different from the dot product. The proof
will be an exercise.
3.2.3 Symmetric, Positive Definite Matrices

Symmetric, positive definite matrices play an important role in machine
learning, and they are defined via the inner product. In Section 4.3, we
will return to symmetric, positive definite matrices in the context of matrix
decompositions. The idea of symmetric positive semidefinite matrices is
key in the definition of kernels (Section 12.4).
Consider an n-dimensional vector space V with an inner product h·, ·i :
V × V → R (see Definition 3.3) and an ordered basis B = (b1 , . . . , bn ) of
V . Recall from Section 2.6.1 that any vectors x, y ∈ V Pncan be written as
linearPcombinations of the basis vectors so that x = i=1 ψi bi ∈ V and
n
y = j=1 λj bj ∈ V for suitable ψi , λj ∈ R. Due to the bilinearity of the
inner product it holds that for all x, y ∈ V that
* n n
+ n X n
ψi hbi , bj i λj = x̂> Aŷ , (3.10)
X X X
hx, yi = ψi bi , λj bj =
i=1 j=1 i=1 j=1
where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
c
is symmetric. Furthermore, the positive definiteness of the inner product

implies that
∀x ∈ V \{0} : x> Ax > 0 . (3.11)
Definition 3.4 (Symmetric, positive definite matrix). A symmetric matrix

symmetric, positive A ∈ Rn×n that satisfies (3.11) is called symmetric, positive definite or
definite just positive definite. If only > holds in (3.11) then A is called symmetric,
positive definite
positive semi-definite.
symmetric, positive
semi-definite
Example 3.4 (Symmetric, Positive Definite Matrices)

Consider the matrices

9 6 9 6
A1 = , A2 = . (3.12)
6 5 6 3
A1 is positive definite because it is symmetric and

>
9 6 x1
x A1 x = x1 x2 (3.13a)
6 5 x2
= 9x21 + 12x1 x2 + 5x22 = (3x1 + 2x2 )2 + x22 > 0 (3.13b)
for all x ∈ V \ {0}. In contrast, A2 is symmetric but not positive definite
because x> A2 x = 9x21 + 12x1 x2 + 3x22 = (3x1 + 2x2 )2 − x22 can be less
than 0, e.g., for x = [2, −3]> .
If A ∈ Rn×n is symmetric, positive definite then
hx, yi = x̂> Aŷ (3.14)
defines an inner product with respect to an ordered basis B where x̂ and

ŷ are the coordinate representations of x, y ∈ V with respect to B .
Theorem 3.5. For a real-valued, finite-dimensional vector space V and an

ordered basis B of V it holds that h·, ·i : V × V → R is an inner product if
and only if there exists a symmetric, positive definite matrix A ∈ Rn×n with
hx, yi = x̂> Aŷ . (3.15)
The following properties hold if A ∈ Rn×n is symmetric and positive

definite:
The null space (kernel) of A consists only of 0 because x> Ax > 0 for
all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
The diagonal elements aii of A are positive because aii = e>
i Aei > 0,
where ei is the ith vector of the standard basis in Rn .

3.3 Lengths and Distances 75
3.3 Lengths and Distances

In Section 3.1, we already discussed norms that we can use to compute
the length of a vector. Inner products and norms are closely related in the
sense that any inner product induces a norm Inner products
q induce norms.
kxk := hx, xi (3.16)
in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, h·, ·i) the induced norm k · k satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
| hx, yi | 6 kxkkyk . (3.17)
♦
Example 3.5 (Lengths of Vectors using Inner Products)

In geometry, we are often interested in lengths of vectors. We can now use
an inner product to compute them using (3.16). Let us take x = [1, 1]> ∈
R2 . If we use the dot product as the inner product, with (3.16) we obtain
√ √ √
kxk = x> x = 12 + 12 = 2 (3.18)
as the length of x. Let us now choose a different inner product:
1 − 12 1

>
hx, yi := x 1 y = x1 y1 − (x1 y2 + x2 y1 ) + x2 y2 . (3.19)
−2 1 2
If we compute the norm of a vector, then this inner product returns smaller
values than the dot product if x1 and x2 have the same sign (and x1 x2 >
0), otherwise it returns greater values than the dot product. With this
inner product we obtain
√
hx, xi = x21 − x1 x2 + x22 = 1 − 1 + 1 = 1 =⇒ kxk = 1 = 1 , (3.20)
such that x is “shorter” with this inner product than with the dot product.
Definition 3.6 (Distance and Metric). Consider an inner product space

(V, h·, ·i). Then
q
d(x, y) := kx − yk = hx − y, x − yi (3.21)
is called the distance between x and y for x, y ∈ V . If we use the dot distance
product as the inner product, then the distance is called Euclidean distance. Euclidean distance
c
The mapping
d:V ×V →R (3.22)
(x, y) 7→ d(x, y) (3.23)
metric is called a metric.
Remark. Similar to the length of a vector, the distance between vectors

does not require an inner product: a norm is sufficient. If we have a norm
induced by an inner product, the distance may vary depending on the
choice of the inner product. ♦
A metric d satisfies:
positive definite 1. d is positive definite, i.e., d(x, y) > 0 for all x, y ∈ V and d(x, y) =
0 ⇐⇒ x = y
symmetric 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
triangle inequality 3. Triangle inequality: d(x, z) 6 d(x, y) + d(y, z) for all x, y, z ∈ V .
Remark. At first glance the list of properties of inner products and met-
rics look very similar. However, by comparing Definition 3.3 with Defini-
tion 3.6 we observe that hx, yi and d(x, y) behave in opposite directions.
Very similar x and y will result in a large value for the inner product and
a small value for the metric. ♦
3.4 Angles and Orthogonality

Figure 3.4 When
restricted to [0, π) In addition to enabling the definition of lengths of vectors, as well as the
then f (ω) = cos(ω) distance between two vectors, inner products also capture the geometry
returns a unique
number in the
of a vector space by defining the angle ω between two vectors. We use
interval [−1, 1]. the Cauchy-Schwarz inequality (3.17) to define angles ω in inner prod-
uct spaces between two vectors x, y , and this notion coincides with our
1
intuition in R2 and R3 . Assume that x 6= 0, y 6= 0. Then
cos(ω)
0
hx, yi
−1 6 6 1. (3.24)
kxk kyk
−1
0 π/2 π
ω Therefore, there exists a unique ω ∈ [0, π), illustrated in Figure 3.4, with
hx, yi
cos ω = . (3.25)
kxk kyk
angle The number ω is the angle between the vectors x and y . Intuitively, the
angle between two vectors tells us how similar their orientations are. For
example, using the dot product, the angle between x and y = 4x, i.e., y
is a scaled version of x, is 0: Their orientation is the same.

3.4 Angles and Orthogonality 77
Example 3.6 (Angle between Vectors)

Let us compute the angle between x = [1, 1]> ∈ R2 and y = [1, 2]> ∈ R2 , Figure 3.5 The
see Figure 3.5, where we use the dot product as the inner product. Then angle ω between
two vectors x, y is
we get
computed using the
hx, yi x> y 3 inner product.
cos ω = p =p =√ , (3.26)
hx, xi hy, yi x> xy > y 10
y
and the angle between the two vectors is arccos( √310 ) ≈ 0.32 rad, which
corresponds to about 18◦ .
1
ω x
A key feature of the inner product is that it also allows us to characterize
vectors that are orthogonal.
0 1
Definition 3.7 (Orthogonality). Two vectors x and y are orthogonal if and orthogonal
only if hx, yi = 0, and we write x ⊥ y . If additionally kxk = 1 = kyk,
i.e., the vectors are unit vectors, then x and y are orthonormal. orthonormal
An implication of this definition is that the 0-vector is orthogonal to

every vector in the vector space.
Remark. Orthogonality is the generalization of the concept of perpendic-
ularity to bilinear forms that do not have to be the dot product. In our
context, geometrically, we can think of orthogonal vectors as having a
right angle with respect to a specific inner product. ♦
Example 3.7 (Orthogonal Vectors)
Figure 3.6 The

1 angle ω between
y x two vectors x, y can
change depending
ω on the inner
product.
−1 0 1
Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 , see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as inner product yields an
angle ω between x and y of 90◦ , such that x ⊥ y . However, if we choose
the inner product

> 2 0
hx, yi = x y, (3.27)
0 1
c
we get that the angle ω between x and y is given by

hx, yi 1
cos ω = = − =⇒ ω ≈ 1.91 rad ≈ 109.5◦ , (3.28)
kxkkyk 3
and x and y are not orthogonal. Therefore, vectors that are orthogonal
with respect to one inner product do not have to be orthogonal with re-
spect to a different inner product.
Definition 3.8 (Orthogonal Matrix). A square matrix A ∈ Rn×n is an

orthogonal matrix orthogonal matrix if and only if its columns are orthonormal so that
AA> = I = A> A , (3.29)
which implies that
A−1 = A> , (3.30)
It is convention to i.e., the inverse is obtained by simply transposing the matrix.
call these matrices
“orthogonal” but a Transformations by orthogonal matrices are special because the length
more precise of a vector x is not changed when transforming it using an orthogonal
description would matrix A. For the dot product we obtain
be “orthonormal”.
2 2
Transformations kAxk = (Ax)> (Ax) = x> A> Ax = x> Ix = x> x = kxk . (3.31)
with orthogonal
matrices preserve Moreover, the angle between any two vectors x, y , as measured by their
distances and inner product, is also unchanged when transforming both of them using
angles.
an orthogonal matrix A. Assuming the dot product as the inner product,
the angle of the images Ax and Ay is given as
(Ax)> (Ay) x> A> Ay x> y
cos ω = =q = , (3.32)
kAxk kAyk x> A> Axy > A> Ay kxk kyk
which gives exactly the angle between x and y . This means that orthog-
onal matrices A with A> = A−1 preserve both angles and distances. It
turns out that orthogonal matrices define transformations that are rota-
tions (with the possibility of flips). In Section 3.9, we will discuss more
details about rotations.
3.5 Orthonormal Basis

In Section 2.6.1, we characterized properties of basis vectors and found
that in an n-dimensional vector space, we need n basis vectors, i.e., n
vectors that are linearly independent. In Sections 3.3 and 3.4, we used
inner products to compute the length of vectors and the angle between
vectors. In the following, we will discuss the special case where the basis
vectors are orthogonal to each other and where the length of each basis
vector is 1. We will call this basis then an orthonormal basis.

3.6 Orthogonal Complement 79
Let us introduce this more formally.
Definition 3.9 (Orthonormal basis). Consider an n-dimensional vector

space V and a basis {b1 , . . . , bn } of V . If
hbi , bj i = 0 for i 6= j (3.33)

hbi , bi i = 1 (3.34)
for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basis
If only (3.33) is satisfied then the basis is called an orthogonal basis. Note ONB
orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.
Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-
>
ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).
Example 3.8 (Orthonormal Basis)

The canonical/standard basis for a Euclidean vector space Rn is an or-
thonormal basis, where the inner product is the dot product of vectors.
In R2 , the vectors
1 1 1

1
b1 = √ , b2 = √ (3.35)
2 1 2 −1
form an orthonormal basis since b>
1 b2 = 0 and kb1 k = 1 = kb2 k.
We will exploit the concept of an orthonormal basis in Chapter 12 and

Chapter 10 when we discuss Support Vector Machines and Principal Com-
ponent Analysis.
3.6 Orthogonal Complement

Having defined orthogonality, we will now look at vector spaces that are
orthogonal to each other. This will play an important role in Chapter 10,
when we discuss linear dimensionality reduction from a geometric per-
spective.
Consider a D-dimensional vector space V and an M -dimensional sub-
space U ⊆ V . Then its orthogonal complement U ⊥ is a (D−M )-dimensional orthogonal
subspace of V and contains all vectors in V that are orthogonal to every complement
vector in U . Furthermore, U ∩ U ⊥ = {0} so that any vector x ∈ V can be
c
Figure 3.7 A plane

U in a e3
three-dimensional
vector space can be w
described by its
normal vector,
which spans its e2
orthogonal
complement U ⊥ .
e1
U
uniquely decomposed into

M
X D−M
X
x= λm bm + ψj b⊥
j , λm , ψj ∈ R , (3.36)
m=1 j=1
where (b1 , . . . , bM ) is a basis of U and (b⊥ ⊥ ⊥

1 , . . . , bD−M ) is a basis of U .
Therefore, the orthogonal complement can also be used to describe a
plane U (two-dimensional subspace) in a three-dimensional vector space.
More specifically, the vector w with kwk = 1, which is orthogonal to the
plane U , is the basis vector of U ⊥ . Figure 3.7 illustrates this setting. All
vectors that are orthogonal to w must (by construction) lie in the plane
normal vector U . The vector w is called the normal vector of U .
Generally, orthogonal complements can be used to describe hyperplanes
in n-dimensional vector and affine spaces.
3.7 Inner Product of Functions

Thus far, we looked at properties of inner products to compute lengths,
angles and distances. We focused on inner products of finite-dimensional
vectors. In the following, we will look at an example of inner products of
a different type of vectors: inner products of functions.
The inner products we discussed so far were defined for vectors with a
finite number of entries. We can think of a vector x ∈ Rn as function with
n function values. The concept of an inner product can be generalized to
vectors with an infinite number of entries (countably infinite) and also
continuous-valued functions (uncountably infinite). Then, the sum over
individual components of vectors, see (3.5) for example, turns into an
integral.
An inner product of two functions u : R → R and v : R → R can be
defined as the definite integral
Z b
hu, vi := u(x)v(x)dx (3.37)
a

for lower and upper limits a, b < ∞, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal.
To make the above inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.
Example 3.9 (Inner Product of Functions)

If we choose u = sin(x) and v = cos(x), the integrand f (x) = u(x)v(x) Figure 3.8 f (x) =
of (3.37), is shown in Figure 3.8. We see that this function is odd, i.e., sin(x) cos(x).
f (−x) = −f (x). Therefore, the integral with limits a = −π, b = π of this 0.5
sin(x) cos(x)
product evaluates to 0. Therefore, sin and cos are orthogonal functions.
0.0
Remark. It also holds that the collection of functions −0.5

−2.5 0.0 2.5
x
{1, cos(x), cos(2x), cos(3x), . . . } (3.38)
is orthogonal if we integrate from −π to π , i.e., any pair of functions are

orthogonal to each other. The collection of functions in (3.38) spans a
large subspace of the functions that are even and periodic on [−π, π), and
projecting functions onto this subspace is the fundamental idea behind
Fourier series. ♦
In Section 6.4.6, we will have a look at a second type of unconventional
inner products: the inner product of random variables.
3.8 Orthogonal Projections

Projections are an important class of linear transformations (besides rota-
tions and reflections) and play an important role in graphics, coding the-
ory, statistics and machine learning. In machine learning, we often deal
with data that is high-dimensional. High-dimensional data is often hard
to analyze or visualize. However, high-dimensional data quite often pos-
sesses the property that only a few dimensions contain most information,
and most other dimensions are not essential to describe key properties of
the data. When we compress or visualize high-dimensional data we will
lose information. To minimize this compression loss, we ideally find the
most informative dimensions in the data. As discussed in Chapter 1, data “Feature” is a
can be represented as vectors, and in this chapter, we will discuss some common expression
for data
of the fundamental tools for data compression. More specifically, we can
representation.
project the original high-dimensional data onto a lower-dimensional fea-
ture space and work in this lower-dimensional space to learn more about
c
Figure 3.9
Orthogonal
projection (orange 2
dots) of a
1
two-dimensional
dataset (blue dots)
x2
0
onto a
one-dimensional −1
subspace (straight
line). −2
−4 −2 0 2 4
x1
the dataset and extract relevant patterns. For example, machine learn-
ing algorithms, such as Principal Component Analysis (PCA) by Pearson
(1901) and Hotelling (1933) and Deep Neural Networks (e.g., deep auto-
encoders (Deng et al., 2010)), heavily exploit the idea of dimensional-
ity reduction. In the following, we will focus on orthogonal projections,
which we will use in Chapter 10 for linear dimensionality reduction and
in Chapter 12 for classification. Even linear regression, which we discuss
in Chapter 9, can be interpreted using orthogonal projections. For a given
lower-dimensional subspace, orthogonal projections of high-dimensional
data retain as much information as possible and minimize the difference/
error between the original data and the corresponding projection. An il-
lustration of such an orthogonal projection is given in Figure 3.9. Before
we detail how to obtain these projections, let us define what a projection
actually is.
Definition 3.10 (Projection). Let V be a vector space and U ⊆ V a

projection subspace of V . A linear mapping π : V → U is called a projection if
π2 = π ◦ π = π.
Since linear mappings can be expressed by transformation matrices (see
Section 2.7), the definition above applies equally to a special kind of trans-
projection matrix formation matrices, the projection matrices P π , which exhibit the property
that P 2π = P π .
In the following, we will derive orthogonal projections of vectors in the
inner product space (Rn , h·, ·i) onto subspaces. We will start with one-
line dimensional subspaces, which are also called lines. If not mentioned oth-
erwise, we assume the dot product hx, yi = x> y as the inner product.
3.8.1 Projection onto 1-Dimensional Subspaces (Lines)

Assume we are given a line (1-dimensional subspace) through the ori-
gin with basis vector b ∈ Rn . The line is a one-dimensional subspace
U ⊆ Rn spanned by b. When we project x ∈ Rn onto U , we seek the

Figure 3.10
x Examples of
projections onto
one-dimensional
subspaces.
b x
πU (x)
ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with kxk = 1 onto a one-dimensional
subspace spanned by b.
vector πU (x) ∈ U that is closest to x. Using geometric arguments, let

us characterize some properties of the projection πU (x) (Figure 3.10(a)
serves as an illustration):
The projection πU (x) is closest to x, where “closest” implies that the

distance kx − πU (x)k is minimal. It follows that the segment πU (x) − x
from πU (x) to x is orthogonal to U and, therefore, the basis vector b of
U . The orthogonality condition yields hπU (x) − x, bi = 0 since angles
between vectors are defined via the inner product. λ is then the
The projection πU (x) of x onto U must be an element of U and, there- coordinate of πU (x)
with respect to b.
fore, a multiple of the basis vector b that spans U . Hence, πU (x) = λb,
for some λ ∈ R.
In the following three steps, we determine the coordinate λ, the projection

πU (x) ∈ U and the projection matrix P π that maps any x ∈ Rn onto U .
1. Finding the coordinate λ. The orthogonality condition yields

πU (x)=λb
hx − πU (x), bi = 0 ⇐⇒ hx − λb, bi = 0 . (3.39)
We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
hx, bi hb, xi λ = hx, bi if
hx, bi − λ hb, bi = 0 ⇐⇒ λ = = . (3.40) kbk = 1.
hb, bi kbk2
In the last step, we exploited the fact that inner products are symmet-
ric. If we choose h·, ·i to be the dot product, we obtain
b> x b> x
λ= = . (3.41)
b> b kbk2
If kbk = 1, then the coordinate λ of the projection is given by b> x.
c
2. Finding the projection point πU (x) ∈ U . Since πU (x) = λb we imme-

diately obtain with (3.40) that
hx, bi b> x
πU (x) = λb = b = b, (3.42)
kbk2 kbk2
where the last equality holds for the dot product only. We can also
compute the length of πU (x) by means of Definition 3.1 as
kπU (x)k = kλbk = |λ| kbk . (3.43)
Hence, our projection is of length |λ| times the length of b. This also
adds the intuition that λ is the coordinate of πU (x) with respect to the
basis vector b that spans our one-dimensional subspace U .
If we use the dot product as an inner product we get
(3.42) |b> x| (3.25) kbk

kπU (x)k = kbk = | cos ω| kxk kbk = | cos ω| kxk .
kbk 2 kbk2
(3.44)
Here, ω is the angle between x and b. This equation should be familiar

from trigonometry: If kxk = 1 then x lies on the unit circle. It follows
The horizontal axis that the projection onto the horizontal axis spanned by b is exactly
is a one-dimensional cos ω , and the length of the corresponding vector πU (x) = | cos ω|. An
subspace.
illustration is given in Figure 3.10(b).
3. Finding the projection matrix P π . We know that a projection is a lin-
ear mapping (see Definition 3.10). Therefore, there exists a projection
matrix P π , such that πU (x) = P π x. With the dot product as inner
product and
b> x bb>
πU (x) = λb = bλ = b = x (3.45)
kbk2 kbk2
we immediately see that
bb>
Pπ = . (3.46)
kbk2
Projection matrices Note that bb> (and, consequently, P π ) is a symmetric matrix (of rank
are always 1), and kbk2 = hb, bi is a scalar.
symmetric.
The projection matrix P π projects any vector x ∈ Rn onto the line through
the origin with direction b (equivalently, the subspace U spanned by b).
Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
not a scalar. However, we no longer require n coordinates to represent the
projection, but only a single one if we want to express it with respect to
the basis vector b that spans the subspace U : λ. ♦

x Figure 3.11
Projection onto a
two-dimensional
subspace U with
basis b1 , b2 . The
projection πU (x) of
x − πU (x) x ∈ R3 onto U can
be expressed as a
linear combination
U of b1 , b2 and the
b2
displacement vector
x − πU (x) is
πU (x) orthogonal to both
b1 and b2 .
0 b1
Example 3.10 (Projection onto a Line)

Find the projection matrix P π onto the line through the origin spanned
>
by b = 1 2 2 . b is a direction and a basis of the one-dimensional
subspace (line through origin).
With (3.46), we obtain
   
1 1 2 2
bb> 1   1
Pπ = > = 2 1 2 2 = 2 4 4 . (3.47)
b b 9 2 9 2 4 4
Let us now choose a particular x and see whether it lies in the subspace
>
spanned by b. For x = 1 1 1 , the projection is
      
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 = 10 ∈ span[2] . (3.48)
9 2 4 4 1 9 10 2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.10
we know that a projection matrix P π satisfies P 2π x = P π x for all x.
Remark. With the results from Chapter 4 we can show that πU (x) is an
eigenvector of P π , and the corresponding eigenvalue is 1. ♦
3.8.2 Projection onto General Subspaces

If U is given by a set
In the following, we look at orthogonal projections of vectors x ∈ Rn of spanning vectors,
onto lower-dimensional subspaces U ⊆ Rn with dim(U ) = m > 1. An which are not a
basis, make sure
illustration is given in Figure 3.11.
you determine a
Assume that (b1 , . . . , bm ) is an ordered basis of U . Any projection πU (x) basis b1 , . . . , bm
onto U is necessarily an element of U . Therefore, they can be represented before proceeding.
c
as linear Pcombinations of the basis vectors b1 , . . . , bm of U , such that

m
The basis vectors πU (x) = i=1 λi bi .
form the columns of As in the 1D case, we follow a three-step procedure to find the projec-
B ∈ Rn×m , where
tion πU (x) and the projection matrix P π :
B = [b1 , . . . , bm ].
1. Find the coordinates λ1 , . . . , λm of the projection (with respect to the
basis of U ), such that the linear combination
m
X
πU (x) = λi bi = Bλ , (3.49)
i=1
B = [b1 , . . . , bm ] ∈ Rn×m , λ = [λ1 , . . . , λm ]> ∈ Rm , (3.50)

is closest to x ∈ Rn . As in the 1D case, “closest” means “minimum
distance”, which implies that the vector connecting πU (x) ∈ U and
x ∈ Rn must be orthogonal to all basis vectors of U . Therefore, we
obtain m simultaneous conditions (assuming the dot product as the
inner product)
hb1 , x − πU (x)i = b>1 (x − πU (x)) = 0 (3.51)
..
.
hbm , x − πU (x)i = b>
m (x − πU (x)) = 0 (3.52)
which, with πU (x) = Bλ, can be written as
b>
1 (x − Bλ) = 0 (3.53)
..
.
b>
m (x − Bλ) = 0 (3.54)
such that we obtain a homogeneous linear equation system
 > 
b1 
 ..   >
 .  x − Bλ = 0 ⇐⇒ B (x − Bλ) = 0 (3.55)
b>
m
⇐⇒ B > Bλ = B > x . (3.56)

normal equation The last expression is called normal equation. Since b1 , . . . , bm are a
basis of U and, therefore, linearly independent, B > B ∈ Rm×m is reg-
ular and can be inverted. This allows us to solve for the coefficients/
coordinates
λ = (B > B)−1 B > x . (3.57)
pseudo-inverse The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which
can be computed for non-square matrices B . It only requires that B > B
is positive definite, which is the case if B is full rank. In practical ap-
plications (e.g., linear regression), we often add a “jitter term” I to

B > B to guarantee increased numerical stability and positive definite-

ness. This “ridge” can be rigorously derived using Bayesian inference.
See Chapter 9 for details.
2. Find the projection πU (x) ∈ U . We already established that πU (x) =
Bλ. Therefore, with (3.57)
πU (x) = B(B > B)−1 B > x . (3.58)
3. Find the projection matrix P π . From (3.58) we can immediately see

that the projection matrix that solves P π x = πU (x) must be
P π = B(B > B)−1 B > . (3.59)
Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1 then B > B ∈ R is a scalar and
we can rewrite the projection matrix in (3.59) P π = B(B > B)−1 B > as
>
P π = BB
B> B
, which is exactly the projection matrix in (3.46). ♦
Example 3.11 (Projectiononto  a Two-dimensional

 Subspace)
 
1 0 6
For a subspace U = span[1 , 1] ⊆ R3 and x = 0 ∈ R3 find the
1 2 0
coordinates λ of x in terms of the subspace U , the projection point πU (x)
and the projection matrix P π .
First, we see that the generating set of U is a basis (linear
 indepen-

1 0
dence) and write the basis vectors of U into a matrix B = 1 1.
1 2
Second, we compute the matrix B > B and the vector B > x as
   
1 0 6
1 1 1  3 3 1 1 1   6
B>B = 1 1 = , B>x = 0 = .
0 1 2 3 5 0 1 2 0
1 2 0
(3.60)
Third, we solve the normal equation B > Bλ = B > x to find λ:

3 3 λ1 6 5
= ⇐⇒ λ = . (3.61)
3 5 λ2 0 −3
Fourth, the projection πU (x) of x onto U , i.e., into the column space of
B , can be directly computed via
 
5
πU (x) = Bλ =  2  . (3.62)
−1
c
projection error The corresponding projection error is the norm of the difference vector
The projection error between the original vector and its projection onto U , i.e.,
√
is also called the >
reconstruction error. kx − πU (x)k = 1 −2 1 = 6 . (3.63)

Fifth, the projection matrix (for any x ∈ R3 ) is given by

 
5 2 −1
1
P π = B(B > B)−1 B > =  2 2 2  . (3.64)
6 −1 2 5
To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , (b) verify that P π = P 2π
(see Definition 3.10).
Remark. The projections πU (x) are still vectors in Rn although they lie in
an m-dimensional subspace U ⊆ Rn . However, to represent a projected
vector we only need the m coordinates λ1 , . . . , λm with respect to the
basis vectors b1 , . . . , bm of U . ♦
Remark. In vector spaces with general inner products, we have to pay
attention when computing angles and distances, which are defined by
means of the inner product. ♦
We can find
approximate Projections allow us to look at situations where we have a linear system
solutions to Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. the columns of A. Given that the linear equation cannot be solved exactly,
we can find an approximate solution. The idea is to find the vector in the
subspace spanned by the columns of A that is closest to b, i.e., we compute
the orthogonal projection of b onto the subspace spanned by the columns
of A. This problem arises often in practice, and the solution is called the
least-squares least-squares solution (assuming the dot product as the inner product) of
solution an overdetermined system. This is discussed further in Section 9.4. Using
reconstruction errors (3.63) is one possible approach approach to derive
principal component analysis (Section 10.3).
Remark. We just looked at projections of vectors x onto a subspace U with
basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33)–(3.34) are
satisfied, the projection equation (3.58) simplifies greatly to
πU (x) = BB > x (3.65)
since B > B = I with coordinates

λ = B>x . (3.66)
This means that we no longer have to compute the inverse from (3.58),
which saves computation time. ♦

3.8.3 Gram-Schmidt Orthogonalization

Projections are at the core of the Gram-Schmidt method that allows us to
constructively transform any basis (b1 , . . . , bn ) of an n-dimensional vector
space V into an orthogonal/orthonormal basis (u1 , . . . , un ) of V . This
basis always exists (Liesen and Mehrmann, 2015) and span[b1 , . . . , bn ] =
span[u1 , . . . , un ]. The Gram-Schmidt orthogonalization method iteratively Gram-Schmidt
constructs an orthogonal basis (u1 , . . . , un ) from any basis (b1 , . . . , bn ) of orthogonalization
V as follows:
u1 := b1 (3.67)
uk := bk − πspan[u1 ,...,uk−1 ] (bk ) , k = 1, . . . , n . (3.68)
In (3.68), the k th basis vector bk is projected onto the subspace spanned
by the first k − 1 constructed orthogonal vectors u1 , . . . , uk−1 , see Sec-
tion 3.8.2. This projection is then subtracted from bk and yields a vector
uk that is orthogonal to the (k − 1)-dimensional subspace spanned by
u1 , . . . , uk−1 . Repeating this procedure for all n basis vectors b1 , . . . , bn
yields an orthogonal basis (u1 , . . . , un ) of V . If we normalize the uk we
obtain an ONB where kuk k = 1 for k = 1, . . . , n.
Example 3.12 (Gram-Schmidt Orthogonalization)
Figure 3.12
b2 b2 u2 b2 Gram-Schmidt
orthogonalization.
(a) Non-orthogonal
basis (b1 , b2 ) of R2 ;
0 b1 0 πspan[u1 ] (b2 ) u1 0 πspan[u1 ] (b2 ) u1 (b) First constructed
basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
orthogonal
basis vectors b1 , b2 . u1 = b1 and projection of b2 and u2 = b2 − πspan[u1 ] (b2 ).
projection of b2
onto the subspace spanned by
onto span[u1 ];
u1 .
(c) Orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where (u1 , u2 ) of R2 .

2 1
b1 = , b2 = , (3.69)
0 1
see also Figure 3.12(a). Using the Gram-Schmidt method we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):

2
u1 := b1 = , (3.70)
0
u1 u>

(3.45) 1 1 1 0 1 0
u2 := b2 − πspan[u1 ] (b2 ) = b2 − b
2 2 = − = .
ku1 k 1 0 0 1 1
(3.71)
c
Figure 3.13 x x
Projection onto an
affine space.
(a) Original setting;
(b) Setting shifted x − x0
L L
by −x0 so that πL(x)
x − x0 can be x0 x0
projected onto the
b2 b 2 U = L − x0 b2
direction space U ;
πU (x − x0)
(c) Projection is
translated back to 0 b1 0 b1 0 b1
x0 + πU (x − x0 ),
(a) Setting. (b) Reduce problem to pro- (c) Add support point back in
which gives the final
jection πU onto vector sub- to get affine projection πL .
orthogonal
space.
projection πL (x).
These steps are illustrated in Figures 3.12(b)–(c). We immediately see that

u1 and u2 are orthogonal, i.e., u> 1 u2 = 0.
3.8.4 Projection onto Affine Subspaces

Thus far, we discussed how to project a vector onto a lower-dimensional
subspace U . In the following, we provide a solution to projecting a vector
onto an affine subspace.
Consider the setting in Figure 3.13(a). We are given an affine space L =
x0 + U where b1 , b2 are basis vectors of U . To determine the orthogonal
projection πL (x) of x onto L, we transform the problem into a problem
that we know how to solve: the projection onto a vector subspace. In
order to get there, we subtract the support point x0 from x and from L,
so that L − x0 = U is exactly the vector subspace U . We can now use the
orthogonal projections onto a subspace we discussed in Section 3.8.2 and
obtain the projection πU (x − x0 ), which is illustrated in Figure 3.13(b).
This projection can now be translated back into L by adding x0 , such that
we obtain the orthogonal projection onto an affine space L as
πL (x) = x0 + πU (x − x0 ) , (3.72)
where πU (·) is the orthogonal projection onto the subspace U , i.e., the
direction space of L, see Figure 3.13(c).
From Figure 3.13 it is also evident that the distance of x from the affine
space L is identical to the distance of x − x0 from U , i.e.,
d(x, L) = kx − πL (x)k = kx − (x0 + πU (x − x0 ))k (3.73a)
= d(x − x0 , πU (x − x0 )) . (3.73b)
We will use projections onto an affine subspace to derive the concept of
a separating hyperplane in Section 12.1.

3.9 Rotations 91
Figure 3.14 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.
Figure 3.15 The

robotic arm needs to
rotate its joints in
order to pick up
objects or to place
them correctly.
Figure taken
from (Deisenroth
et al., 2015).
3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
A rotation is a linear mapping (more specifically, an automorphism of rotation
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is

−0.38 −0.92
R= . (3.74)
0.92 −0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.
c
Figure 3.16 Φ(e2 ) = [− sin θ, cos θ]>

Rotation of the cos θ
standard basis in R2
by an angle θ. e2
Φ(e1 ) = [cos θ, sin θ]>

sin θ
θ
θ
− sin θ e1 cos θ
3.9.1 Rotations in R2

1 0
Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
the standard coordinate system in R2 . We aim to rotate this coordinate
system by an angle θ as illustrated in Figure 3.16. Note that the rotated
vectors are still linearly independent and, therefore, are a basis of R2 . This
means that the rotation performs a basis change.
Rotations Φ are linear mappings so that we can express them by a
rotation matrix rotation matrix R(θ). Trigonometry (see Figure 3.16) allows us to de-
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain

cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.75)
sin θ cos θ
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(θ) is given as

cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.76)
sin θ cos θ
3.9.2 Rotations in R3
In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
about a one-dimensional axis. The easiest way to specify the general rota-
tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
orthonormal to each other. We can then obtain a general rotation matrix
R by combining the images of the standard basis.
To have a meaningful rotation angle we have to define what “coun-
terclockwise” means when we operate in more than two dimensions. We
use the convention that a “counterclockwise” (planar) rotation about an
axis refers to a rotation about an axis when we look at the axis “head on,
from the end toward the origin”. In R3 , there are therefore three (planar)
rotations about the three standard basis vectors (see Figure 3.17):

3.9 Rotations 93
e3 Figure 3.17
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2
θ e1
Rotation about the e1 -axis

 
1 0 0
R1 (θ) = Φ(e1 ) Φ(e2 ) Φ(e3 ) = 0 cos θ − sin θ . (3.77)
0 sin θ cos θ
Here, the e1 coordinate is fixed, and the counterclockwise rotation is

performed in the e2 e3 plane.
 
cos θ 0 sin θ
R2 (θ) =  0 1 0 . (3.78)
− sin θ 0 cos θ
If we rotate the e1 e3 plane about the e2 axis, we need to look at the e2

axis from its “tip” toward the origin.
 
cos θ − sin θ 0
R3 (θ) =  sin θ cos θ 0 . (3.79)
0 0 1
Figure 3.17 illustrates this.
3.9.3 Rotations in n Dimensions

The generalization of rotations from 2D and 3D to n-dimensional Eu-
clidean vector spaces can be intuitively described as fixing n − 2 dimen-
sions and restrict the rotation to a two-dimensional plane in the n-dimen-
sional space. As in the three-dimensional case we can rotate any plane
(two-dimensional subspace of Rn ).
Definition 3.11 (Givens Rotation). Let V be an n-dimensional Euclidean

vector space and Φ : V → V an automorphism with transformation ma-
c
trix
I i−1 ··· ···
 
0 0
 0
 cos θ 0 − sin θ 0   n×n
Rij (θ) := 
 0 0 I j−i−1 0 0  ∈R , (3.80)
 0 sin θ 0 cos θ 0 
0 ··· ··· 0 I n−j
Givens rotation for 1 6 i < j 6 n and θ ∈ R. Then Rij (θ) is called a Givens rotation.
Essentially, Rij (θ) is the identity matrix I n with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.81)
In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.
3.9.4 Properties of Rotations

Rotations exhibit a number useful properties, which can be derived by
considering them as orthogonal matrices (Definition 3.8):
Rotations preserve distances, i.e., kx−yk = kRθ (x)−Rθ (y)k. In other
words, rotations leave the distance between any two points unchanged
after the transformation.
Rotations preserve angles, i.e., the angle between Rθ x and Rθ y equals
the angle between x and y .
Rotations in three (or more) dimensions are generally not commuta-
tive. Therefore, the order in which rotations are applied is important,
even if they rotate about the same point. Only in two dimensions vector
rotations are commutative, such that R(φ)R(θ) = R(θ)R(φ) for all
φ, θ ∈ [0, 2π). They form an Abelian group (with multiplication) only if
they rotate about the same point (e.g., the origin).
3.10 Further Reading

In this chapter, we gave a brief overview of some of the important concepts
of analytic geometry, which we will use in later chapters of the book. For
a broader and more in-depth overview of some the concepts we presented
we refer to the following excellent books: Axler (2015) and Boyd and
Vandenberghe (2018).
Inner products allow us to determine specific bases of vector (sub)spaces,
where each vector is orthogonal to all others (orthogonal bases) using the
Gram-Schmidt method. These bases are important in optimization and
numerical algorithms for solving linear equation systems. For instance,
Krylov subspace methods, such as Conjugate Gradients or GMRES, mini-
mize residual errors that are orthogonal to each other (Stoer and Burlirsch,
2002).
In machine learning, inner products are important in the context of

Exercises 95
kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state-of-the-art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Hotelling, 1933; Pearson, 1901) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.
Exercises
3.1 Show that h·, ·i defined for all x = [x1 , x2 ]> ∈ R2 and y = [y1 , y2 ]> ∈ R2 by
hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )
is an inner product.
3.2 Consider R2 with h·, ·i defined for all x and y in R2 as

2 0
hx, yi := x> y.
1 2
| {z }
=:A
Is h·, ·i an inner product?

3.3 Compute the distance between
   
1 −1
x = 2 , y = −1
3 0
using
1. hx, yi := x> y  
2 1 0
2. hx, yi := x> Ay , A := 1 3 −1
0 −1 2
3.4 Compute the angle between

1 −1
x= , y=
2 −1
c
using
1. hx, yi := x> y

2 1
2. hx, yi := x> By , B :=
1 3
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
         
0 1 −3 −1 −1
−1 −3  4  −3 −9
         
U = span[
 2 , −1
  1 ,  1  ,  5 ] , x= 
     
0 −1 2 0 4
2 2 1 7 1
1. Determine the orthogonal projection πU (x) of x onto U

2. Determine the distance d(x, U )
3.6 Consider R3 with the inner product
 
2 1 0
>
hx, yi := x 1 2 −1 y .
0 −1 2
Furthermore, we define e1 , e2 , e3 as the standard/canonical basis in R3 .

1. Determine the orthogonal projection πU (e2 ) of e2 onto
U = span[e1 , e3 ] .
Hint: Orthogonality is defined through the inner product.

2. Compute the distance d(e2 , U ).
3. Draw the scenario: standard basis vectors and πU (e2 )
3.7 Let V be a vector space π an endomorphism of V .
1. Prove that π is a projection if and only if idV − π is a projection, where
idV is the identity endomorphism on V .
2. Assume now that π is a projection. Calculate Im(idV −π) and ker(idV −π)
as a function of Im(π) and ker(π).
3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
   
1 −1
b1 := 1 , b2 :=  2  .
1 0
3.9 Let n ∈ N∗ and let x1 , . . . , xn > 0 be n positive real numbers so that x1 +

· · · + xn = 1. Use the Cauchy-Schwarz inequality and show that
Pn
1. x2i > n1
Pi=1
n 1 2
2. i=1 xi > n
Hint: Think about the dot product on Rn . Then, choose specific vectors
x, y ∈ Rn and apply the Cauchy-Schwarz inequality.

Exercises 97
3.10 Rotate the vectors

2 0
x1 := , x2 :=
3 −1
by 30◦ .
c
4
Matrix Decompositions
In Chapters 2 and 3, we studied ways to manipulate and measure vectors,

projections of vectors and linear mappings. Mappings and transformations
of vectors can be conveniently described as operations performed by ma-
trices. Moreover, data is often represented in matrix form as well, e.g.,
where the rows of the matrix represent different people and the columns
describe different features of the people, such as weight, height and socio-
economic status. In this chapter, we present three aspects of matrices: how
to summarize matrices, how matrices can be decomposed, and how these
decompositions can be used for matrix approximations.
We first consider methods that allow us to describe matrices with just
a few numbers that characterize the overall properties of matrices. We
will do this in the sections on determinants (Section 4.1) and eigenval-
ues (Section 4.2) for the important special case of square matrices. These
characteristic numbers have important mathematical consequences and
allow us to quickly grasp what useful properties a matrix has. From here
we will proceed to matrix decomposition methods: An analogy for ma-
trix decomposition is the factoring of numbers, such as the factoring of
21 into prime numbers 7 · 3. For this reason matrix decomposition is also
matrix factorization often referred to as matrix factorization. Matrix decompositions are used
to describe a matrix by means of a different representation using factors
of interpretable matrices.
We will first cover a square-root-like operation for symmetric, positive
definite matrices, the Cholesky decomposition (Section 4.3). From here
we will look at two related methods for factorizing matrices into canoni-
cal forms. The first one is known as matrix diagonalization (Section 4.4),
which allows us to represent the linear mapping using a diagonal trans-
formation matrix if we choose an appropriate basis. The second method,
singular value decomposition (Section 4.5), extends this factorization to
non-square matrices, and it is considered one of the fundamental concepts
in linear algebra. These decompositions are helpful as matrices represent-
ing numerical data are often very large and hard to analyze. We conclude
the chapter with a systematic overview of the types of matrices and the
characteristic properties that distinguish them in form of a matrix taxon-
omy (Section 4.7).
The methods that we cover in this chapter will become important in
98
c
tests used in Figure 4.1 A mind

Determinant Invertibility Cholesky map of the concepts
introduced in this
chapter, along with
used in
used in
where they are used
in other parts of the
book.
Eigenvalues Chapter 6
Probability
& Distributions
determines
used
in
constructs used in
Eigenvectors Orthogonal matrix Diagonalization
d in
use
us
in
ed
ed
SVD
us
in
used in
Chapter 10
Linear Dimensionality
Reduction
both subsequent mathematical chapters, such as Chapter 6 but also in ap-

plied chapters, such as dimensionality reduction in Chapters 10 or density
estimation in Chapter 11. This chapter’s overall structure is depicted in
the mind map of Figure 4.1.
4.1 Determinant and Trace The determinant

Determinants are important concepts in linear algebra. A determinant is notation |A| must
not be confused
a mathematical object in the analysis and solution of systems of linear
with the absolute
equations. Determinants are only defined for square matrices A ∈ Rn×n , value.
i.e., matrices with the same number of rows and columns. In this book,
we write the determinant as det(A) or sometimes as |A| so that

a11 a12 . . . a1n

a21 a22 . . . a2n
det(A) = .. .. . (4.1)

..
.
. .

an1 an2 . . . ann
The determinant of a square matrix A ∈ Rn×n is a function that maps A determinant
c
100 Matrix Decompositions
onto a real number. Before providing a definition of the determinant for

general n × n matrices let us have a look at some motivating examples,
and define determinants for some special matrices.
Example 4.1 (Testing for Matrix Invertibility)

Let us begin with exploring if a square matrix A is invertible (see Sec-
tion 2.2.2). For the smallest cases, we already know when a matrix
is invertible. If A is a 1 × 1 matrix, i.e., it is a scalar number, then
A = a =⇒ A−1 = a1 . Thus a a1 = 1 holds, if and only if a 6= 0.
For 2 × 2 matrices, by the definition of the inverse (Definition 2.3), we
know that AA−1 = I . Then, with (2.24), the inverse of A is
1

a22 −a12
A−1 = . (4.2)
a11 a22 − a12 a21 −a21 a11
Hence, A is invertible if and only if
a11 a22 − a12 a21 6= 0 . (4.3)
This quantity is the determinant of A ∈ R2×2 , i.e.,

a
11 a12

det(A) = = a11 a22 − a12 a21 . (4.4)
a21 a22
The example above points already at the relationship between determi-

nants and the existence of inverse matrices. The next theorem states the
same result for n × n matrices.
Theorem 4.1. For any square matrix A ∈ Rn×n it holds that A is invertible
if and only if det(A) 6= 0.
We have explicit (closed-form) expressions for determinants of small
matrices in terms of the elements of the matrix. For n = 1,
det(A) = det(a11 ) = a11 . (4.5)
For n = 2,

a a
det(A) = 11 12 = a11 a22 − a12 a21 , (4.6)
a21 a22
which we have observed in the example above.
For n = 3 (known as Sarrus’ rule),

a11 a12 a13

a21 a22 a23 = a11 a22 a33 + a21 a32 a13 + a31 a12 a23 (4.7)

a31 a32 a33
− a31 a22 a13 − a11 a32 a23 − a21 a12 a33 .

For a memory aid of the product terms in Sarrus’ rule, try tracing the
elements of the triple products in the matrix.
We call a square matrix T an upper-triangular matrix if Tij = 0 for upper-triangular
i > j , i.e., the matrix is zero below its diagonal. Analogously, we define a matrix
lower-triangular matrix as a matrix with zeros above its diagonal. For a tri- lower-triangular
angular matrix T ∈ Rn×n , the determinant is the product of the diagonal matrix
elements, i.e.,
n
Y
det(T ) = Tii . (4.8)
i=1
The determinant is
the signed volume
of the parallelepiped
Example 4.2 (Determinants as Measures of Volume) formed by the
columns of the
The notion of a determinant is natural when we consider it as a mapping
matrix.
from a set of n vectors spanning an object in Rn . It turns out that the de- Figure 4.2 The area
terminant det(A) is the signed volume of an n-dimensional parallelepiped of the parallelogram
formed by columns of the matrix A. (shaded region)
For n = 2 the columns of the matrix form a parallelogram, see Fig- spanned by the
vectors b and g is
ure 4.2. As the angle between vectors gets smaller the area of a parallel- |det([b, g])|.
ogram shrinks, too. Consider two vectors b, g that form the columns of a
matrix A = [b, g]. Then, the absolute value of the determinant of A is
the area of the parallelogram with vertices 0, b, g, b + g . In particular, if b
b, g are linearly dependent so that b = λg for some λ ∈ R they no longer
g
form a two-dimensional parallelogram. Therefore, the corresponding area
is 0. On the contrary, if b, g are linearly independent and are multiples Figure 4.3 The
of volume of the
b
the canonical basis vectors e1 , e2 then they can be written as b = and parallelepiped

0 (shaded volume)

0 b 0
= bg − 0 = bg . spanned by vectors
g= , and the determinant is r, b, g is
g 0 g
|det([r, b, g])|.
The sign of the determinant indicates the orientation of the spanning
vectors b, g with respect to the standard basis (e1 , e2 ). In our figure, flip-
ping the order to g, b swaps the columns of A and reverses the orientation
of the shaded area. becomes the familiar formula: area = height × length.
b
This intuition extends to higher dimensions. In R3 , we consider three vec-
r
tors r, b, g ∈ R3 spanning the edges of a parallelepiped, i.e., a solid with g
faces that are parallel parallelograms (see Figure 4.3). The absolute value The sign of the
determinant
of the determinant of the 3 × 3 matrix [r, b, g] is the volume of the solid.
indicates the
Thus, the determinant acts as a function that measures the signed volume orientation of the
formed by column vectors composed in a matrix. spanning vectors.
Consider the three linearly independent vectors r, g, b ∈ R3 given as
     
2 6 1
r =  0  , g = 1 , b =  4  . (4.9)
−8 0 −1
c
Writing these vectors as the columns of a matrix

 
2 6 1
A = [r, g, b] =  0 1 4  (4.10)
−8 0 −1
allows us to compute the desired volume as
V = |det(A)| = 186 . (4.11)
Computing the determinant of an n × n matrix requires a general algo-

rithm to solve the cases for n > 3, which we are going to explore in the fol-
lowing. The theorem below reduces the problem of computing the deter-
minant of an n×n matrix to computing the determinant of (n−1)×(n−1)
matrices. By recursively applying the following Laplace expansion we can
therefore compute determinants of n × n matrices by ultimately comput-
ing determinants of 2 × 2 matrices.
Laplace expansion
Theorem 4.2 (Laplace Expansion). Consider a matrix A ∈ Rn×n . Then,
for all j = 1, . . . , n:
det(Ak,j ) is called 1. Expansion along column j
a minor and n
(−1)k+j det(Ak,j )
X
a cofactor.
det(A) = (−1)k+j akj det(Ak,j ) . (4.12)
k=1
2. Expansion along row j

n
X
det(A) = (−1)k+j ajk det(Aj,k ) . (4.13)
k=1
Here Ak,j ∈ R(n−1)×(n−1) is the submatrix of A that we obtain when delet-

ing row k and column j .
Example 4.3 (Laplace Expansion)

Let us compute the determinant of
 
1 2 3
A = 3 1 2 (4.14)
0 0 1
using the Laplace expansion along the first row. By applying (4.13) we get

1 2 3
3 1 2 = (−1)1+1 · 1 1 2

0 1
0 0 1 (4.15)

3 2 3 1

+ (−1)1+2 · 2 + (−1)1+3 · 3 .
0 1 0 0

We use (4.6) to compute the determinants of all 2 × 2 matrices and obtain

det(A) = 1(1 − 0) − 2(3 − 0) + 3(0 − 0) = −5 . (4.16)
For completeness we can compare this result to computing the determi-
nant using Sarrus’ rule (4.7):
det(A) = 1·1·1+3·0·3+0·2·2−0·1·3−1·0·2−3·2·1 = 1−6 = −5 . (4.17)
For A ∈ Rn×n the determinant exhibits the following properties:
The determinant of a matrix product is the product of the corresponding

determinants, det(AB) = det(A)det(B).
Determinants are invariant to transposition, i.e., det(A) = det(A> ).
If A is regular (invertible) then det(A−1 ) = det(A)
1
Similar matrices (Defintion 2.22) possess the same determinant. There-

fore, for a linear mapping Φ : V → V all transformation matrices AΦ
of Φ have the same determinant. Thus, the determinant is invariant to
the choice of basis of a linear mapping.
Adding a multiple of a column/row to another one does not change
det(A).
Multiplication of a column/row with λ ∈ R scales det(A) by λ. In
particular, det(λA) = λn det(A).
Swapping two rows/columns changes the sign of det(A).
Because of the last three properties, we can use Gaussian elimination (see
Section 2.1) to compute det(A) by bringing A into row-echelon form.
We can stop Gaussian elimination when we have A in a triangular form
where the elements below the diagonal are all 0. Recall from (4.8) that the
determinant of a triangular matrix is the product of the diagonal elements.
Theorem 4.3. A square matrix A ∈ Rn×n has det(A) 6= 0 if and only if
rk(A) = n. In other words, A is invertible if and only if it is full rank.
When mathematics was mainly performed by hand, the determinant
calculation was considered an essential way to analyze matrix invertibil-
ity. However, contemporary approaches in machine learning use direct
numerical methods that superseded the explicit calculation of the deter-
minant. For example, in Chapter 2, we learned that inverse matrices can
be computed by Gaussian elimination. Gaussian elimination can thus be
used to compute the determinant of a matrix.
Determinants will play an important theoretical role for the following
sections, especially when we learn about eigenvalues and eigenvectors
(Section 4.2) through the characteristic polynomial.
Definition 4.4. The trace of a square matrix A ∈ Rn×n is defined as trace
c
n
X
tr(A) := aii , (4.18)
i=1
i.e. , the trace is the sum of the diagonal elements of A.

The trace satisfies the following properties:
tr(A + B) = tr(A) + tr(B) for A, B ∈ Rn×n
tr(αA) = αtr(A) , α ∈ R for A ∈ Rn×n
tr(I n ) = n
tr(AB) = tr(BA) for A ∈ Rn×k , B ∈ Rk×n
It can be shown that only one function satisfies these four properties to-
gether – the trace (Gohberg et al., 2012).
The properties of the trace of matrix products are more general. Specif-
The trace is ically, the trace is invariant under cyclic permutations, i.e.,
invariant under
cyclic permutations. tr(AKL) = tr(KLA) (4.19)
for matrices A ∈ Ra×k , K ∈ Rk×l , L ∈ Rl×a . This property generalizes to
products of arbitrarily many matrices. As a special case of (4.19) it follows
that for two vectors x, y ∈ Rn
tr(xy > ) = tr(y > x) = y > x ∈ R . (4.20)
Given a linear mapping Φ : V → V , where V is a vector space, we
define the trace of this map by using the trace of matrix representation
of Φ. For a given basis of V we can describe Φ by means of the transfor-
mation matrix A. Then, the trace of Φ is the trace of A. For a different
basis of V it holds that the corresponding transformation matrix B of Φ
can be obtained by a basis change of the form S −1 AS for suitable S (see
Section 2.7.2). For the corresponding trace of Φ this means
(4.19)
tr(B) = tr(S −1 AS) = tr(ASS −1 ) = tr(A) . (4.21)
Hence, while matrix representations of linear mappings are basis depen-
dent the trace of a linear mapping Φ is independent of the basis.
In this section, we covered determinants and traces as functions char-
acterizing a square matrix. Taking together our understanding of determi-
nants and traces we can now define an important equation describing a
matrix A in terms of a polynomial, which we will use extensively in the
following sections.
Definition 4.5 (Characteristic Polynomial). For λ ∈ R and a square ma-
trix A ∈ Rn×n
pA (λ) := det(A − λI) (4.22a)
= c0 + c1 λ + c2 λ2 + · · · + cn−1 λn−1 + (−1)n λn , (4.22b)
characteristic c0 , . . . , cn−1 ∈ R, is the characteristic polynomial of A. In particular,
polynomial

c0 = det(A) , (4.23)
cn−1 = (−1)n−1 tr(A) . (4.24)
The characteristic polynomial (4.22a) will allow us to compute eigen-
values and eigenvectors, covered in the next section.
4.2 Eigenvalues and Eigenvectors

We will now get to know a new way to characterize a matrix and its associ-
ated linear mapping. Recall from Section 2.7.1 that every linear mapping
has a unique transformation matrix given an ordered basis. We can in-
terpret linear mappings and their associated transformation matrices by
performing an “eigen” analysis. As we will see, the eigenvalues of a lin- Eigen is a German
ear mapping will tell us how a special set of vectors, the eigenvectors, are word meaning
“characteristic”,
transformed by the linear mapping.
“self” or “own”.
Definition 4.6. Let A ∈ Rn×n be a square matrix. Then λ ∈ R is an
eigenvalue of A and x ∈ Rn \ {0} is the corresponding eigenvector of A if eigenvalue
eigenvector
Ax = λx . (4.25)
We call (4.25) the eigenvalue equation. eigenvalue equation
Remark. In the linear algebra literature and software, it is often a conven-

tion that eigenvalues are sorted in descending order, so that the largest
eigenvalue and associated eigenvector are called the first eigenvalue and
its associated eigenvector, and the second largest called the second eigen-
value and its associated eigenvector, and so on. However, textbooks and
publications may have different or no notion of orderings. We do not want
to presume an ordering in this book if not stated explicitly. ♦
The following statements are equivalent:
λ is an eigenvalue of A ∈ Rn×n
There exists an x ∈ Rn \ {0} with Ax = λx or equivalently, (A −
λI n )x = 0 can be solved non-trivially, i.e., x 6= 0.
rk(A − λI n ) < n
det(A − λI n ) = 0
Definition 4.7 (Collinearity and Codirection). Two vectors that point in
the same direction are called codirected. Two vectors are collinear if they codirected
point in the same or the opposite direction. collinear
Remark (Non-uniqueness of eigenvectors). If x is an eigenvector of A

associated with eigenvalue λ then for any c ∈ R \ {0} it holds that cx is
an eigenvector of A with the same eigenvalue since
A(cx) = cAx = cλx = λ(cx) . (4.26)
Thus, all vectors that are collinear to x are also eigenvectors of A.
♦
c
Theorem 4.8. λ ∈ R is eigenvalue of A ∈ Rn×n if and only if λ is a root of

the characteristic polynomial pA (λ) of A.
algebraic Definition 4.9. Let a square matrix A have an eigenvalue λi . The algebraic
multiplicity multiplicity of λi is the number of times the root appears in the character-
istic polynomial.
Definition 4.10 (Eigenspace and Eigenspectrum). For A ∈ Rn×n the set

of all eigenvectors of A associated with an eigenvalue λ spans a subspace
eigenspace of Rn , which is called the eigenspace of A with respect to λ and is denoted
eigenspectrum by Eλ . The set of all eigenvalues of A is called the eigenspectrum, or just
spectrum spectrum, of A.
If λ is an eigenvalue of A ∈ Rn×n then the corresponding eigenspace

Eλ is the solution space of the homogeneous system of linear equations
(A − λI)x = 0. Geometrically, the eigenvector corresponding to a non-
zero eigenvalue points in a direction that is stretched by the linear map-
ping. The eigenvalue is the factor by which it is stretched. If the eigenvalue
is negative, the direction of the stretching is flipped.
Example 4.4 (The Case of the Identity Matrix)

The identity matrix I ∈ Rn×n has characteristic polynomial pI (λ) =
det(I − λ) = (1 − λ)n = 0, which has only one eigenvalue λ = 1 that oc-
curs n times. Moreover, Ix = λx = 1x holds for all vectors x ∈ Rn \ {0}.
Because of this, the sole eigenspace E1 of the identity matrix spans n di-
mensions, and all n standard basis vectors of Rn are eigenvectors of I .
Useful properties regarding eigenvalues and eigenvectors include:
A matrix A and its transpose A> possess the same eigenvalues, but not
necessarily the same eigenvectors.
The eigenspace Eλ is the null space of A − λI since
Ax = λx ⇐⇒ Ax − λx = 0 (4.27a)
⇐⇒ (A − λI)x = 0 ⇐⇒ x ∈ ker(A − λI). (4.27b)
Similar matrices (see Definition 2.22) possess the same eigenvalues.

Therefore, a linear mapping Φ has eigenvalues that are independent of
the choice of basis of its transformation matrix. This makes eigenvalues,
together with the determinant and the trace, key characteristic param-
eters of a linear mapping as they are all invariant under basis change.
Symmetric, positive definite matrices always have positive, real eigen-
values.

Example 4.5 (Computing Eigenvalues, Eigenvectors and Eigenspaces)

Let us find the eigenvalues and eigenvectors of the 2 × 2 matrix

4 2
A= . (4.28)
1 3
Step 1: Characteristic Polynomial. From our definition of the eigen-
vector x 6= 0 and eigenvalue λ of A there will be a vector such that
Ax = λx, i.e., (A − λI)x = 0. Since x 6= 0 this requires that the kernel
(null space) of A − λI contains more elements than just 0. This means
that A − λI is not invertible and therefore det(A − λI) = 0. Hence, we
need to compute the roots of the characteristic polynomial (4.22a) to find
the eigenvalues.
Step 2: Eigenvalues. The characteristic polynomial is
pA (λ) = det(A − λI) (4.29a)

4 2 λ 0 4 − λ 2
= det − = (4.29b)
1 3 0 λ 1 3 − λ
= (4 − λ)(3 − λ) − 2 · 1 . (4.29c)
We factorize the characteristic polynomial and obtain
p(λ) = (4 − λ)(3 − λ) − 2 · 1 = 10 − 7λ + λ2 = (2 − λ)(5 − λ) (4.30)
giving the roots λ1 = 2 and λ2 = 5.

Step 3: Eigenvectors and Eigenspaces. We find the eigenvectors that
correspond to these eigenvalues by looking at vectors x such that

4−λ 2
x = 0. (4.31)
1 3−λ
For λ = 5 we obtain

4−5 2 x1 −1 2 x1
= = 0. (4.32)
1 3 − 5 x2 1 −2 x2
We solve this homogeneous system and obtain a solution space

2
E5 = span[ ]. (4.33)
1
This eigenspace is one-dimensional as it possesses a single basis vector.
Analogously, we find the eigenvector for λ = 2 by solving the homoge-
neous system of equations

4−2 2 2 2
x= x = 0. (4.34)
1 3−2 1 1
c

x1 1
This means any vector x = where x2 = −x1 , such as is an
x2 −1
eigenvector with eigenvalue 2. The corresponding eigenspace is given as

1
E2 = span[ ]. (4.35)
−1
The two eigenspaces E5 and E2 in Example 4.5 are one-dimensional

as they are each spanned by a single vector. However, in other cases
we may have multiple identical eigenvalues (see Definition 4.9) and the
eigenspace may have more than one dimension.
Definition 4.11. Let λi be an eigenvalue of a square matrix A. Then the
geometric geometric multiplicity of λi is the number of linearly independent eigen-
multiplicity vectors associated with λi . In other words, it is the dimensionality of the
eigenspace spanned by the eigenvectors associated with λi .
Remark. A specific eigenvalue’s geometric multiplicity must be at least
one because every eigenvalue has at least one associated eigenvector. An
eigenvalue’s geometric multiplicity cannot exceed its algebraic multiplic-
ity, but it may be lower. ♦
Example 4.6
2 1
The matrix A = has two repeated eigenvalues λ1 = λ2 = 2 and an
0 2
of 2. The eigenvalue has, however, only one distinct
algebraic multiplicity
1
eigenvector x1 = and, thus, geometric multiplicity 1.
0
Graphical Intuition in Two Dimensions

Let us gain some intuition for determinants, eigenvectors, eigenvalues us-
ing different linear mappings. Figure 4.4 depicts five transformation ma-
trices A1 , . . . , A5 and their impact on a square grid of points, centered at
In geometry, the the origin:
area-preserving 1
properties of this 0
A1 = 2 . The direction of the two eigenvectors correspond to the
type of shearing 0 2
parallel to an axis is canonical basis vectors in R2 , i.e., to two cardinal axes. The vertical axis
also known as
Cavalieri’s principle
is extended by a factor of 2 (eigenvalue λ1 = 2), and the horizontal axis
of equal areas for is compressed by factor 12 (eigenvalue λ2 = 12 ). The mapping is area
parallelograms preserving (det(A1 ) = 1 = 2 · 21 ).
(Katz, 2004). 1 12

A2 = corresponds to a shearing mapping , i.e., it shears the
0 1
points along the horizontal axis to the right if they are on the positive

Figure 4.4
Determinants and
eigenspaces.
Overview of five
λ1 = 2.0 linear mappings and
λ2 = 0.5 their associated
det(A) = 1.0
transformation
matrices
Ai ∈ R2×2
projecting 400
color-coded points
x ∈ R2 (left
λ1 = 1.0
λ2 = 1.0 column) onto target
det(A) = 1.0 points Ai x (right
column). The
central column
depicts the first
eigenvector,
stretched by its
λ1 = (0.87-0.5j) associated
λ2 = (0.87+0.5j) eigenvalue λ1 , and
det(A) = 1.0
the second
eigenvector
stretched by its
eigenvalue λ2 . Each
row depicts the
effect of one of five
λ1 = 0.0
λ2 = 2.0 transformation
det(A) = 0.0 matrices Ai with
respect to the
standard basis .
λ1 = 0.5
λ2 = 1.5
det(A) = 0.75
half of the vertical axis, and to the left vice versa. This mapping is area
preserving (det(A2 ) = 1). The eigenvalue λ1 = 1 = λ2 is repeated
and the eigenvectors are collinear (drawn here for emphasis in two
opposite directions). This indicates that the mapping acts only along
one direction (the horizontal
axis).
√
cos( π6 ) − sin( π6 )

1 3 √−1
A3 = = 2 The matrix A3 rotates the
sin( π6 ) cos( π6 ) 1 3
points by π6 rad = 30◦ anti-clockwise and has only complex eigenvalues,
reflecting that the mapping is a rotation (hence, no eigenvectors are
drawn). A rotation has to be volume preserving, and so the determinant
is 1. For
more details
on rotations we refer to Section 3.9.
1 −1
A4 = represents a mapping in the standard basis that col-
−1 1
lapses a two-dimensional domain onto one dimension. Since one eigen-
c
value is 0 the space in direction of the (blue) eigenvector corresponding

to λ1 = 0 collapses, while the orthogonal (red) eigenvector stretches
space by
a factor λ2 = 2. Therefore, the area of the image is 0.
1 12

A5 = 1 is a shear-and-stretch mapping that shrinks space by 75%
2
1
since | det(A5 )| = 43 . It stretches space along the (blue) eigenvector
of λ2 by a factor 1.5 and compresses it along the orthogonal (blue)
eigenvector by a 0.5.
Example 4.7 (Eigenspectrum of a Biological Neural Network)
Figure 4.5
0
Caenorhabditis 25
elegans neural 50 20
network (Kaiser and
15
Hilgetag, 2006). 100
neuron index
eigenvalue
(a) Symmetrized 10
connectivity matrix; 150 5
(b) Eigenspectrum.
0
200
−5
250 −10
0 50 100 150 200 250 0 100 200

neuron index index of sorted eigenvalue
(a) Connectivity matrix. (b) Eigenspectrum.
Methods to analyze and learn from network data are an essential com-
ponent of machine learning methods. The key to understanding networks
is the connectivity between network nodes, especially if two nodes are
connected to each other or not. In data science applications, it is often
useful to study the matrix that captures this connectivity data.
We build a connectivity/adjacency matrix A ∈ R277×277 of the complete
neural network of the worm C.Elegans. Each row/column represents one
of the 277 neurons of this worm’s brain. The connectivity matrix A has
a value of aij = 1 if neuron i talks to neuron j through a synapse, and
aij = 0 otherwise. The connectivity matrix is not symmetric, which im-
plies that eigenvalues may not be real valued. Therefore, we compute a
symmetrized version of the connectivity matrix as Asym := A + A> . This
new matrix Asym is shown in Figure 4.5(a) and has a non-zero value aij
if and only if two neurons are connected (white pixels), irrespective of the
direction of the connection. In Figure 4.5(b), we show the correspond-
ing eigenspectrum of Asym . The horizontal axis shows the index of the
eigenvalues, sorted in descending order. The vertical axis shows the corre-
sponding eigenvalue. The S -like shape of this eigenspectrum is typical for
many biological neural networks. The underlying mechanism responsible
for this is an area of active neuroscience research.

Theorem 4.12. The eigenvectors x1 , . . . , xn of a matrix A ∈ Rn×n with n

distinct eigenvalues λ1 , . . . , λn are linearly independent.
This theorem states that eigenvectors of a matrix with n distinct eigen-
values form a basis of Rn .
Definition 4.13. A square matrix A ∈ Rn×n is defective if it possesses defective
fewer than n linearly independent eigenvectors.
A non-defective matrix A ∈ Rn×n does not necessarily require n dis-
tinct eigenvalues, but it does require that the eigenvectors form a basis of
Rn . Looking at the eigenspaces of a defective matrix, it follows that the
sum of the dimensions of the eigenspaces less than n. Specifically, a de-
fective matrix has at least one eigenvalue λi with an algebraic multiplicity
m > 1 and and a geometric multiplicity of less than m.
Remark. A defective matrix cannot have n distinct eigenvalues as distinct
eigenvalues have linearly independent eigenvectors (Theorem 4.12). ♦
Theorem 4.14. Given a matrix A ∈ Rm×n we can always obtain a sym-
metric, positive semi-definite matrix S ∈ Rn×n by defining
S := A> A . (4.36)
Remark. If rk(A) = n then S := A> A is positive definite. ♦
Understanding why Theorem 4.14 holds is insightful for how we can
use symmetrized matrices: Symmetry requires S = S > and by insert-
ing (4.36) we obtain S = A> A = A> (A> )> = (A> A)> = S > . More-
over, positive semi-definiteness (Section 3.2.3) requires that x> Sx > 0
and inserting (4.36) we obtain x> Sx = x> A> Ax = (x> A> )(Ax) =
(Ax)> (Ax) > 0, because the dot product computes a sum of squares
(which are themselves non-negative).
spectral theorem
Theorem 4.15 (Spectral Theorem). If A ∈ Rn×n is symmetric, there ex-
ists an orthonormal basis of the corresponding vector space V consisting of
eigenvectors of A, and each eigenvalue is real.
A direct implication of the spectral theorem is that the eigendecompo-
sition of a symmetric matrix A exists (with real eigenvalues), and that
we can find an ONB of eigenvectors so that A = P DP > , where D is
diagonal and the columns of P contain the eigenvectors.
Example 4.8
Consider the matrix
 
3 2 2
A = 2 3 2 . (4.37)
2 2 3
c
The characteristic polynomial of A is

pA (λ) = (λ − 1)2 (λ − 7) , (4.38)
so that we obtain the eigenvalues λ1 = 1 and λ2 = 7, where λ1 is a
repeated eigenvalue. Following our standard procedure for computing
eigenvectors, we obtain the eigenspaces
     
−1 −1 1
E1 = span[ 1 ,  0 ], E7 = span[1] . (4.39)
0 1 1
| {z } | {z } |{z}
=:x1 =:x2 =:x3
We see that x3 is orthogonal to both x1 and x2 . However, since x> 1 x2 =

1 6= 0 they are not orthogonal. The spectral theorem (Theorem 4.15)
states that there exists an orthogonal basis, but the one we have is not
orthogonal. However, we can construct one.
To construct such a basis, we exploit the fact that x1 , x2 are eigenvec-
tors associated with the same eigenvalue λ. Therefore, for any α, β ∈ R it
holds that
A(αx1 + βx2 ) = Ax1 α + Ax2 β = λ(αx1 + βx2 ) , (4.40)
i.e., any linear combination of x1 and x2 is also an eigenvector of A as-
sociated with λ. The Gram-Schmidt algorithm (Section 3.8.3) is a method
for iteratively constructing an orthogonal/orthonormal basis from a set of
basis vectors using such linear combinations. Therefore, even if x1 and x2
are not orthogonal, we can apply the Gram-Schmidt algorithm and find
eigenvectors associated with λ1 = 1 that are orthogonal to each other
(and to x3 ). In our example, we will obtain
   
−1 −1
1
x01 =  1  , x02 = −1 , (4.41)
0 2 2
which are orthogonal to each other, orthogonal to x3 and eigenvectors of

A associated with λ1 = 1.
Before we conclude our considerations of eigenvalues and eigenvectors

it is useful to tie these matrix characteristics together with the concepts of
the determinant and the trace.
Theorem 4.16. The determinant of a matrix A ∈ Rn×n is the product of
its eigenvalues, i.e.,
n
Y
det(A) = λi , (4.42)
i=1
where λi are (possibly repeated) eigenvalues of A.

Figure 4.6
Geometric
x2 A interpretation of
eigenvalues. The
v2 eigenvectors of A
x1 v1 get stretched by the
corresponding
eigenvalues. The
area of the unit
Theorem 4.17. The trace of a matrix A ∈ Rn×n is the sum of its eigenval-
square changes by
ues, i.e., |λ1 λ2 |, the
Xn circumference
tr(A) = λi , (4.43) changes by a factor
i=1 2(|λ1 | + |λ2 |).
where λi are (possibly repeated) eigenvalues of A.
Let us provide a geometric intuition of these two theorems. Consider

a matrix A ∈ R2×2 that possesses two linearly independent eigenvectors
x1 , x2 . For this example, we assume (x1 , x2 ) are an ONB of R2 so that they
are orthogonal and the area of the square they span is 1, see Figure 4.6.
From Section 4.1 we know that the determinant computes the change of
area of unit square under the transformation A. In this example, we can
compute the change of area explicitly: Mapping the eigenvectors using
A gives us vectors v 1 = Ax1 = λ1 x1 and v 2 = Ax2 = λ2 x2 , i.e., the
new vectors v i are scaled versions of the eigenvectors xi , and the scaling
factors are the corresponding eigenvalues λi . v 1 , v 2 are still orthogonal,
and the area of the rectangle they span is |λ1 λ2 |.
Given that x1 , x2 (in our example) are orthonormal, we can directly
compute the circumference of the unit square as 2(1 + 1). Mapping the
eigenvectors using A creates a rectangle whose circumference is 2(|λ1 | +
|λ2 |). Therefore, the sum of the absolute values of the eigenvalues tells us
how the circumference of the unit square changes under the transforma-
tion matrix A.
Example 4.9 (Google’s PageRank – Webpages as Eigenvectors)

Google uses the eigenvector corresponding to the maximal eigenvalue of
a matrix A to determine the rank of a page for search. The idea for the
PageRank algorithm, developed at Stanford University by Larry Page and
Sergey Brin in 1996, was that the importance of any web page can be ap-
proximated by the importance of pages that link to it. For this, they write
down all websites as a huge directed graph that shows which page links to
which. PageRank computes the weight (importance) xi > 0 of a website
ai by counting the number of pages pointing to ai . Moreover, PageRank
takes into account the importance of the websites that link to ai . The nav-
igation behavior of a user is then modeled by a transition matrix A of
this graph that tells us with what (click) probability somebody will end
c
up on a different website. The matrix A has the property that for any ini-
tial rank/importance vector x of a website the sequence x, Ax, A2 x, . . .
PageRank converges to a vector x∗ . This vector is called the PageRank and satisfies
Ax∗ = x∗ , i.e., it is an eigenvector (with corresponding eigenvalue 1) of
A. After normalizing x∗ , such that kx∗ k = 1, we can interpret the entries
as probabilities. More details and different perspectives on PageRank can
be found in the original technical report (Page et al., 1999).
4.3 Cholesky Decomposition

There are many ways to factorize special types of matrices that we en-
counter often in machine learning. In the positive real numbers, we have
the square-root operation that gives us a decomposition of the number
into identical components, e.g., 9 = 3·3. For matrices, we need to be care-
ful that we compute a square-root like operation on positive quantities. For
symmetric, positive definite matrices (see Section 3.2.3) we can choose
Cholesky from a number of square-root equivalent operations. The Cholesky decom-
decomposition position/Cholesky factorization provides a square-root equivalent opera-
Cholesky tion on symmetric, positive definite matrices that is useful in practice.
factorization
Theorem 4.18 (Cholesky Decomposition). A symmetric, positive definite
matrix A can be factorized into a product A = LL> , where L is a lower-
triangular matrix with positive diagonal elements:
    
a11 · · · a1n l11 · · · 0 l11 · · · ln1
 .. .. ..  =  .. . . ..   .. . . ..  .
 . . .   . . .  . . .  (4.44)
an1 · · · ann ln1 · · · lnn 0 · · · lnn
Cholesky factor L is called the Cholesky factor of A, and L is unique.
Example 4.10 (Cholesky Factorization)

Consider a symmetric, positive definite matrix A ∈ R3×3 . We are inter-
ested in finding its Cholesky factorization A = LL> , i.e.,
    
a11 a21 a31 l11 0 0 l11 l21 l31
>
A = 21 22 32 = LL = 21 22
 a a a   l l 0   0 l22 l32  . (4.45)
a31 a32 a33 l31 l32 l33 0 0 l33
Multiplying out the right hand side yields
 2 
l11 l21 l11 l31 l11
2 2
A = l21 l11 l21 + l22 l31 l21 + l32 l22  . (4.46)
2 2 2
l31 l11 l31 l21 + l32 l22 l31 + l32 + l33

Comparing the left hand side of (4.45) and the right hand side of (4.46)
shows that there is a simple pattern in the diagonal elements lii :
√ q
2
q
2 2
l11 = a11 , l22 = a22 − l21 , l33 = a33 − (l31 + l32 ) . (4.47)
Similarly for the elements below the diagonal (lij , where i > j ) there is
also a repeating pattern:
1 1 1
l21 = a21 , l31 = a31 , l32 = (a32 − l31 l21 ) . (4.48)
l11 l11 l22
Thus, we constructed the Cholesky decomposition for any symmetric, pos-
itive definite 3 × 3 matrix. The key realization is that we can backward
calculate what the components lij for the L should be, given the values
aij for A and previously computed values of lij .
The Cholesky decomposition is an important tool for the numerical

computations underlying machine learning. Here, symmetric positive def-
inite matrices require frequent manipulation, e.g., the covariance matrix
of a multivariate Gaussian variable (see Section 6.5) is symmetric, positive
definite. The Cholesky factorization of this covariance matrix allows us to
generate samples from a Gaussian distribution. It also allows us to perform
a linear transformation of random variables, which is heavily exploited
when computing gradients in deep stochastic models, such as the varia-
tional auto-encoder (Jimenez Rezende et al., 2014; Kingma and Welling,
2014). The Cholesky decomposition also allows us to compute determi-
nants very efficiently. Given the Cholesky decomposition A = LL> , we
know that det(A) = det(L) det(L> ) = det(L)2 . Since L is a triangular
matrix, the determinant
Q 2 is simply the product of its diagonal entries so
that det(A) = i lii . Thus, many numerical software packages use the
Cholesky decomposition to make computations more efficient.
4.4 Eigendecomposition and Diagonalization

A diagonal matrix is a matrix that have value zero on all off diagonal diagonal matrix
elements, that is they are of the form
 
c1 · · · 0
 .. . . ..  .
D=. . . (4.49)
0 · · · cn
They allow fast computation of determinants, powers and inverses. The
determinant is the product of its diagonal entries, a matrix power D k is
given by each diagonal element raised to the power k , and the inverse
D −1 is the reciprocal of its diagonal elements if all of them are non-zero.
In this section, we will discuss how to transform matrices into diagonal
c
form. This is an important application of the basis change we discussed in

Section 2.7.2 and eigenvalues from Section 4.2.
Recall that two matrices A, D are similar (Definition 2.22) if there ex-
ists an invertible matrix P , such that D = P −1 AP . More specifically, we
will look at matrices A that are similar to diagonal matrices D that con-
tain the eigenvalues of A on the diagonal.
diagonalizable Definition 4.19 (Diagonalizable). A matrix A ∈ Rn×n is diagonalizable
if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix
P ∈ Rn×n such that D = P −1 AP .
In the following, we will see that diagonalizing a matrix A ∈ Rn×n is
a way of expressing the same linear mapping but in another basis (see
Section 2.6.1), which will turn out to be a basis that consists of the eigen-
vectors of A.
Let A ∈ Rn×n , let λ1 , . . . , λn be a set of scalars, and let p1 , . . . , pn be a
set of vectors in Rn . We define P := [p1 , . . . , pn ] and let D ∈ Rn×n be a
diagonal matrix with diagonal entries λ1 , . . . , λn . Then we can show that
AP = P D (4.50)
if and only if λ1 , . . . , λn are the eigenvalues of A and p1 , . . . , pn are cor-
responding eigenvectors of A.
We can see that this statement holds because
AP = A[p1 , . . . , pn ] = [Ap1 , . . . , Apn ] , (4.51)
 
λ1 0
P D = [p1 , . . . , pn ] 
 ..  = [λ1 p1 , . . . , λn pn ] .

(4.52)
.
0 λn
Thus, (4.50) implies that
Ap1 = λ1 p1 (4.53)
..
.
Apn = λn pn . (4.54)
Therefore, the columns of P must be eigenvectors of A.
Our definition of diagonalization requires that P ∈ Rn×n is invertible,
i.e., P has full rank (Theorem 4.3). This requires us to have n linearly
independent eigenvectors p1 , . . . , pn , i.e., the pi form a basis of Rn .
Theorem 4.20 (Eigendecomposition). A square matrix A ∈ Rn×n can be
factored into
A = P DP −1 , (4.55)
where P ∈ Rn×n and D is a diagonal matrix whose diagonal entries are
the eigenvalues of A, if and only if the eigenvectors of A form a basis of Rn .

Figure 4.7 Intuition

behind the
p2 eigendecomposition
p1 λ2 p2 λ1 p1 as sequential
A transformations.
Top-left to
bottom-left: P −1
performs a basis
P −1 P
change (here drawn
in R2 and depicted
e2 as a rotation-like
λ2 e2 operation), mapping
D the eigenvectors
into the standard
e1 λ1 e1 basis. Bottom-left-
to-bottom right: D
performs a scaling
along the remapped
Theorem 4.20 implies that only non-defective matrices can be diagonal- orthogonal
ized and that the columns of P are the n eigenvectors of A. For symmetric eigenvectors,
depicted here by a
matrices we can obtain even stronger outcomes for the eigenvalue decom-
circle being
position. stretched to an
ellipse. Bottom-right
Theorem 4.21. A symmetric matrix S ∈ Rn×n can always be diagonalized. to top-right: P
undoes the basis
Theorem 4.21 follows directly from the spectral theorem 4.15. More- change (depicted as
over, the spectral theorem states that we can find an ONB of eigenvectors a reverse rotation)
and restores the
of Rn . This makes P an orthogonal matrix so that D = P > AP .
original coordinate
Remark. The Jordan normal form of a matrix offers a decomposition that frame.
works for defective matrices (Lang, 1987) but is beyond the scope of this
book. ♦
Geometric Intuition for the Eigendecomposition

We can interpret the eigendecomposition of a matrix as follows (see also
Figure 4.7): Let A be the transformation matrix of a linear mapping with
respect to the standard basis. P −1 performs a basis change from the stan-
dard basis into the eigenbasis. This identifies the eigenvectors pi (red and
orange arrows in Figure 4.7) onto the standard basis vectors ei . Then, the
diagonal D scales the vectors along these axes by the eigenvalues λi . Fi-
nally, P transforms these scaled vectors back into the standard/canonical
coordinates yielding λi pi .
Example 4.11 (Eigendecomposition)

2 1
Let us compute the eigendecomposition of A = .
1 2
Step 1: Compute eigenvalues and eigenvectors. The characteristic
c
polynomial of A is

2−λ 1
det(A − λI) = det (4.56a)
1 2−λ
= (2 − λ)2 − 1 = λ2 − 4λ + 3 = (λ − 3)(λ − 1) . (4.56b)
Therefore, the eigenvalues of A are λ1 = 1 and λ2 = 3 (the roots of the
characteristic polynomial), and the associated (normalized) eigenvectors
are obtained via

2 1 2 1
p1 = 1p1 , p = 3p2 . (4.57)
1 2 1 2 2
This yields
1 1 1

1
p1 = √ , p2 = √ . (4.58)
2 −1 2 1
Step 2: Check for existence The eigenvectors p1 , p2 form a basis of
R2 . Therefore, A can be diagonalized.
Step 3: Construct the matrix P to diagonalize A We collect the eigen-
vectors of A in P so that
1

1 1
P = [p1 , p2 ] = √ . (4.59)
2 −1 1
We then obtain

1 0
P −1 AP = = D. (4.60)
0 3
Equivalently, we get (exploiting that P −1 = P > since the eigenvectors p1
and p2 in this example form an ONB)
1 1 1 1 0 1 1 −1

2 1
=√ √ . (4.61)
1 2 2 −1 1 0 3 2 1 1
| {z } | {z } | {z } | {z }
A P D P>
Diagonal matrices D can efficiently be raised to a power. Therefore,

we can find a matrix power for a matrix A ∈ Rn×n via the eigenvalue
decomposition (if it exists) so that
Ak = (P DP −1 )k = P D k P −1 . (4.62)
Computing D k is efficient because we apply this operation individually

to any diagonal element.
Assuming the eigendecomposition A = P DP −1 exists. Then,
det(A) = det(P DP −1 ) = det(P ) det(D) det(P −1 ) (4.63a)

Y
= det(D) = dii (4.63b)
i
allows for an efficient computation of the determinant of A.
The eigenvalue decomposition requires square matrices. It would be

useful to perform a decomposition on general matrices. In the next sec-
tion, we introduce a more general matrix decomposition technique, the
singular value decomposition.
4.5 Singular Value Decomposition

The singular value decomposition (SVD) of a matrix is a central matrix
decomposition method in linear algebra. It has been referred to as the
“fundamental theorem of linear algebra” (Strang, 1993) because it can be
applied to all matrices, not only to square matrices, and it always exists.
Moreover, as we will explore in the following, the SVD of a matrix A,
which represents a linear mapping Φ : V → W , quantifies the change
between the underlying geometry of these two vector spaces. We recom-
mend the work by Kalman (1996) and Roy and Banerjee (2014) for a
deeper overview of the mathematics of the SVD.
SVD theorem
Theorem 4.22 (SVD Theorem). Let Am×n be a rectangular matrix of rank
r ∈ [0, min(m, n)]. The singular value decomposition (SVD) of A is a de- singular value
composition of the form decomposition
SVD
n m n
n
A = U Σ V>
m
(4.64)
with an orthogonal matrix U ∈ Rm×m with column vectors ui , i = 1, . . . , m,

and an orthogonal matrix V ∈ Rn×n with column vectors v j , j = 1, . . . , n.
Moreover, Σ is an m × n matrix with Σii = σi > 0 and Σij = 0, i 6= j .
The diagonal entries σi , i = 1, . . . , r, of Σ are called the singular values, singular values
ui are called the left-singular vectors and v j are called the right-singular left-singular vectors
vectors. By convention the singular values are ordered, i.e., σ1 > σ2 > right-singular
σr > 0. vectors
The singular value matrix Σ is unique, but it requires some attention. singular value
Observe that the Σ ∈ Rm×n is rectangular. In particular, Σ is of the same matrix
size as A. This means that Σ has a diagonal submatrix that contains the
singular values and needs additional zero padding. Specifically, if m > n
then the matrix Σ has diagonal structure up to row n and then consists of
c
Figure 4.8 Intuition

behind the SVD of a
matrix A ∈ R3×2 V2
as sequential V1 σ 2 u2
transformations. A
Top-left to
bottom-left: V > σ 1 u1
performs a basis
change in R2 . V> U
Bottom-left to
bottom-right: Σ
scales and maps e2
from R2 to R3 . The σ2 e2
ellipse in the Σ
bottom-right lives in e1 σ1 e1
R3 . The third
dimension is
orthogonal to the
surface of the
elliptical disk.
0> row vectors from n + 1 to m below so that
σ1 0 0
 
Bottom-right to
top-right: U
 0 ... 0 
 
performs a basis
change within R3 .
 
0 0 σn 
Σ=  0 ... 0  .
 (4.65)
 
. .. 
 .. . 
0 ... 0
If m < n the matrix Σ has a diagonal structure up to column m and
columns that consist of 0 from m + 1 to n:
 
σ1 0 0 0 ... 0
Σ =  0 ... 0 0 0 . (4.66)
 
0 0 σm 0 . . . 0
Remark. The SVD exists for any matrix A ∈ Rm×n . ♦
4.5.1 Geometric Intuitions for the SVD

The SVD offers geometric intuitions to describe a transformation matrix
A. In the following, we will discuss the SVD as sequential linear trans-
formations performed on the bases. In example 4.12, we will then apply
transformation matrices of the SVD to a set of vectors in R2 , which allows
us to visualize the effect of each transformation more clearly.
The SVD of a matrix can be interpreted as a decomposition of a corre-
sponding linear mapping (recall Section 2.7.1) Φ : Rn → Rm into three
operations (see Figure 4.8). The SVD intuition follows superficially a sim-
ilar structure to our eigendecomposition intuition, see Figure 4.7: Broadly
speaking, the SVD performs a basis change via V > followed by a scal-
ing and augmentation (or reduction) in dimensionality via the singular

value matrix Σ. Finally, it performs a second basis change via U . The SVD
entails a number of important details and caveats, which is why we will
review our intuition in more detail. It is useful to revise
Assume we are given a transformation matrix of a linear mapping Φ : basis changes
(Section 2.7.2),
Rn → Rm with respect to the standard bases B and C of Rn and Rm ,
orthogonal matrices
respectively. Moreover, assume a second basis B̃ of Rn and C̃ of Rm . Then (Definition 3.8) and
orthonormal bases
1. The matrix V performs a basis change in the domain Rn from B̃ (rep- (Section 3.5).
resented by the red and orange vectors v 1 and v 2 in the top-left of Fig-
ure 4.8) to the standard basis B . V > = V −1 performs a basis change
from B to B̃ . The red and orange vectors are now aligned with the
canonical basis in the bottom left of Figure 4.8.
2. Having changed the coordinate system to B̃ , Σ scales the new coordi-
nates by the singular values σi (and adds or deletes dimensions), i.e.,
Σ is the transformation matrix of Φ with respect to B̃ and C̃ , rep-
resented by the red and orange vectors being stretched and lying in
the e1 -e2 plane, which is now embedded in a third dimension in the
bottom right of Figure 4.8.
3. U performs a basis change in the codomain Rm from C̃ into the canoni-
cal basis of Rm , represented by a rotation of the red and orange vectors
out of the e1 -e2 plane. This is shown in the top-right of Figure 4.8.
The SVD expresses a change of basis in both the domain and codomain.
This is in contrast with the eigendecomposition that operates within the
same vector space, where the same basis change is applied and then un-
done. What makes the SVD special is that these two different bases are
simultaneously linked by the singular value matrix Σ.
Example 4.12 (Vectors and the SVD)
Consider a mapping of a square grid of vectors X ∈ R2 which fit in a

box of size 2 × 2 centered at the origin. Using the standard basis we map
these vectors using
 
1 −0.8
A = 0 1  = U ΣV > (4.67a)
1 0
  
−0.79 0 −0.62 1.62 0
−0.78 0.62
=  0.38 −0.78 −0.49  0 1.0 . (4.67b)
−0.62 −0.78
−0.48 −0.62 0.62 0 0
We start with a set of vectors X (colored dots, see top-left panel of Fig-
ure 4.9) arranged in a grid. We then apply V > ∈ R2×2 , which rotates X .
The rotated vectors are shown in the bottom-left panel of Figure 4.9. We
now map these vectors using the singular value matrix Σ to the codomain
c
R3 (see bottom right panel in Figure 4.9). Note that all vectors lie in the
x1 -x2 plane. The third coordinate is always 0. The vectors in the x1 -x2
plane has been stretched by the singular values.
The direct mapping of the vectors X by A to the codomain R3 equals
the transformation of X by U ΣV > , where U performs a rotation within
the codomain R3 so that the mapped vectors are no longer restricted to
the x1 -x2 plane; they still are on a plane as shown in the top-right panel
of Figure 4.9.
Figure 4.9 SVD and 1.5

mapping of vectors
(represented by
1.0
discs). The panels 1.0
follow the same
0.5 0.5
anti-clockwise
x3
structure of 0.0
x2
0.0
Figure 4.8.
-0.5
−0.5
-1.0
1.5
−1.0 0.5
-1.5
−1.5 -0.5 -0.5 x2
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.5
x1 x1 1.5
-1.5
1.5
1.0
0.5 x3
0
x2
0.0
−0.5
1.5
−1.0
0.5
-1.5
−1.5 -0.5 -0.5 x2
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.5
x1 x1 1.5 -1.5
4.5.2 Construction of the SVD

We will next discuss why the SVD exists and show how to compute it
in detail. The SVD of a general matrix shares some similarities with the
eigendecomposition of a square matrix.
Remark. Compare the eigendecomposition of an SPD matrix
S = S > = P DP > (4.68)

with the corresponding SVD

S = U ΣV > . (4.69)
If we set
U =P =V , D=Σ (4.70)
we see that the SVD of SPD matrices is their eigendecomposition. ♦
In the following, we will explore why Theorem 4.22 holds and how
the SVD is constructed. Computing the SVD of A ∈ Rm×n is equivalent
to finding two sets of orthonormal bases U = (u1 , . . . , um ) and V =
(v 1 , . . . , v n ) of the codomain Rm and the domain Rn , respectively. From
these ordered bases we will construct the matrices U and V .
Our plan is to start with constructing the orthonormal set of right-
singular vectors v 1 , . . . , v n ∈ Rn . We then construct the orthonormal set
of left-singular vectors u1 , . . . , um ∈ Rm . Thereafter, we will link the two
and require that the orthogonality of the v i is preserved under the trans-
formation of A. This is important because we know that the images Av i
form a set of orthogonal vectors. We will then normalize these images by
scalar factors, which will turn out to be the singular values.
Let us begin with constructing the right-singular vectors. The spectral
theorem (Theorem 4.15) tells us that a symmetric matrix possesses an
ONB of eigenvectors, which also means it can be diagonalized. Moreover,
from Theorem 4.14 we can always construct a symmetric, positive semi-
definite matrix A> A ∈ Rn×n from any rectangular matrix A ∈ Rm×n .
Thus, we can always diagonalize A> A and obtain
 
λ1 · · · 0
A> A = P DP > = P  ... . . . ...  P > , (4.71)
 
0 · · · λn
where P is an orthgonal matrix, which is composed of the orthonormal
eigenbasis. The λi > 0 are the eigenvalues of A> A. Let us assume the
SVD of A exists and inject (4.64) into (4.71). This yields
A> A = (U ΣV > )> (U ΣV > ) = V Σ> U > U ΣV > , (4.72)
where U , V are orthogonal matrices. Therefore, with U > U = I we ob-

tain
 2 
σ1 0 0
A> A = V Σ> ΣV > = V  0 . . . 0  V > . (4.73)
 
2
0 0 σn
Comparing now (4.71) and (4.73) we identify
V > = P> , (4.74)
σi2 = λi . (4.75)
c
Therefore, the eigenvectors of A> A that compose P are the right-singular

vectors V of A (see (4.74)). The eigenvalues of A> A are the squared
singular values of Σ (see (4.75)).
To obtain the left-singular vectors U we follow a similar procedure.
We start by computing the SVD of the symmetric matrix AA> ∈ Rm×m
(instead of the above A> A ∈ Rn×n ). The SVD of A yields
AA> = (U ΣV > )(U ΣV > )> = U ΣV > V Σ> U > (4.76a)
 2 
σ1 0 0
= U  0 ... 0  U> . (4.76b)
 
2
0 0 σm
The spectral theorem tells us that AA> = SDS > can be diagonalized
and we can find an ONB of eigenvectors of AA> , which are collected in
S . The orthonormal eigenvectors of AA> are the left-singular vectors U
and form an orthonormal basis set in the codomain of the SVD.
This leaves the question of the structure of the matrix Σ. Since AA>
and A> A have the same non-zero eigenvalues (see page 106) the non-
zero entries of the Σ matrices in the SVD for both cases have to be the
same.
The last step is to link up all the parts we touched upon so far. We have
an orthonormal set of right-singular vectors in V . To finish the construc-
tion of the SVD we connect them with the orthonormal vectors U . To
reach this goal we use the fact the images of the v i under A have to be
orthogonal, too. We can show this by using the results from Section 3.4.
We require that the inner product between Av i and Av j must be 0 for
i 6= j . For any two orthogonal eigenvectors v i , v j , i 6= j it holds that
>
(Av i )> (Av j ) = v > > >
i (A A)v j = v i (λj v j ) = λj v i v j = 0 . (4.77)
For the case m > r it holds that {Av 1 , . . . , Av r } are a basis of an r-
dimensional subspace of Rm .
To complete the SVD construction we need left-singular vectors that are
orthonormal: we normalize the images of the right-singular vectors Av i
and obtain
Av i 1 1
ui := = √ Av i = Av i , (4.78)
kAv i k λi σi
where the last equality was obtained from (4.75) and (4.76b) showing us
that the eigenvalues of AA> are such that σi2 = λi .
Therefore, the eigenvectors of A> A, which we know are the right-
singular vectors v i , and their normalized images under A, the left-singular
vectors ui , form two self-consistent ONBs that are connected through the
singular value matrix Σ.
singular value Let us rearrange (4.78) to obtain the singular value equation
equation
Av i = σi ui , i = 1, . . . , r . (4.79)

This equation closely resembles the eigenvalue equation (4.25), but the
vectors on the left and the right-hand sides are not the same.
For n > m (4.79) holds only for i 6 m and (4.79) says nothing about
the ui for i > m. However, we know by construction that they are or-
thonormal. Conversely, for m > n, (4.79) holds only for i 6 n. For i > n
we have Av i = 0 and we still know that the v i form an orthonormal set.
This means that the SVD also supplies an orthonormal basis of the kernel
(null space) of A, the set of vectors x with Ax = 0 (see Section 2.7.3).
Moreover, concatenating the v i as the columns of V and the ui as the
columns of U yields
AV = U Σ , (4.80)
where Σ has the same dimensions as A and a diagonal structure for rows
1, . . . , r. Hence, right-multiplying with V > yields A = U ΣV > , which is
the SVD of A.
Example 4.13 (Computing the SVD)

Let us find the singular value decomposition of

1 0 1
A= . (4.81)
−2 1 0
The SVD requires us to compute the right-singular vectors v j , the singular
values σk and the left-singular vectors ui .
Step 1: Right-singular vectors as the eigenbasis of A> A.
We start by computing
   
1 −2 5 −2 1
1 0 1
A> A = 0 1  = −2 1 0 . (4.82)
−2 1 0
1 0 1 0 1
We compute the singular values and right-singular vectors v j through
the eigenvalue decomposition of A> A, which is given as
 5 −1 
   √5 −2

√ 0 √ 6 0 0 √ √1
 30 √1 √ 6
 30 √130 √230 
A> A =  √−2 30 5
−2  
6
0 1 0  0 5 5 
= P DP > ,
√1 √2 √1 −1 −2 √1
30 5 6
0 0 0 √
6
√
6 6
(4.83)
and we obtain the right-singular vectors as the columns of P so that
 5 −1

√ 0 √
 30 √1 √ 6
V = P =  √−2 30 5
−2 
6
. (4.84)
√1 √2 √1
30 5 6
Step 2: Singular-value matrix.

As the singular values σi are the square-roots of the eigenvalues of A> A
c
√ D . Since rk(A) = 2 there are only two non-

we obtain them straight from
zero singular values: σ1 = 6 and σ2 = 1. The singular value matrix must
be the same size as A, and we obtain
√
6 0 0
Σ= . (4.85)
0 1 0
Step 3: Left-singular vectors as the normalized image of the right-
singular vectors.
We find the left-singular vectors by computing the image of the right-
singular vectors under A and normalizing them by dividing them by their
corresponding singular value. We obtain
 
√530 " #
1 1 √1

1 0 1  √−2 
u1 = Av 1 = √  30  = 5
, (4.86)
σ1 6 −2 1 0 1 − √25
√
30
0
  " #
1 1 1 0 √2

1  √1  5
u2 = Av 2 = = √1
, (4.87)
σ2 1 −2 1 0 √25 5
5
1

1 2
U = [u1 , u2 ] = √ . (4.88)
5 −2 1
Note that on a computer the approach illustrated here has poor numerical
behavior, and the SVD of A is normally computed without resorting to the
eigenvalue decomposition of A> A.
4.5.3 Eigenvalue Decomposition vs Singular Value Decomposition

Let us consider the eigendecomposition A = P DP −1 and the SVD A =
U ΣV > and review the core elements of the past sections.
The SVD always exists for any matrix Rm×n . The eigendecomposition is
only defined for square matrices Rn×n and only exists if we can find a
basis of eigenvectors of Rn .
The vectors in the eigendecomposition matrix P are not necessarily
orthogonal, i.e., the change of basis is not a simple rotation and scaling.
On the other hand, the vectors in the matrices U and V in the SVD are
orthonormal, so they do represent rotations.
Both the eigendecomposition and the SVD are compositions of three
linear mappings:
1. Change of basis in the domain
2. Independent scaling of each new basis vector and mapping from do-
main to codomain
3. Change of basis in the codomain

Figure 4.10 Movie
Chandra
Beatrix
ratings of three
Ali people for four
movies and its SVD
   
decomposition.
Star Wars 5 4 1  −0.6710 0.0236 0.4647 −0.5774
−0.7197 0.2054 −0.4759 0.4619

Blade Runner  5 5 0 
= 
0 0 5   −0.0939 −0.7705 −0.5268 −0.3464
 
Amelie  
Delicatessen 1 0 4 −0.1515 −0.6030 0.5293 −0.5774
 
9.6438 0 0

 0 6.3639 0 

 0 0 0.7056 
0 0 0
 
−0.7367 −0.6515 −0.1811
 0.0852 0.1762 −0.9807 
 
0.6708 −0.7379 −0.0743
A key difference between the eigendecomposition and the SVD is that

in the SVD, domain and codomain can be vector spaces of different
dimensions.
In the SVD, the left and right-singular vector matrices U and V are gen-
erally not inverse of each other (they perform basis changes in different
vector spaces). In the eigendecomposition, the basis change matrices P
and P −1 are inverses of each other.
In the SVD, the entries in the diagonal matrix Σ are all real and non-
negative, which is not generally true for the diagonal matrix in the
eigendecomposition.
The SVD and the eigendecomposition are closely related through their
projections
– The left-singular vectors of A are eigenvectors of AA>
– The right-singular vectors of A are eigenvectors of A> A.
– The non-zero singular values of A are the square-roots of the non-
zero eigenvalues of AA> and are equal to the non-zero eigenvalues
of A> A.
For symmetric matricesA ∈ Rn×n the eigenvalue decomposition and
the SVD are one and the same, which follows from the spectral theo-
rem 4.15.
Example 4.14 (Finding Structure in Movie Ratings and Consumers)

Let us add a practical interpretation of the SVD by analyzing data on peo-
ple and their preferred movies. Consider 3 viewers (Ali, Beatrix, Chan-
dra) rating four different movies (Star Wars, Blade Runner, Amelie, Del-
icatessen). Their ratings are values between 0 (worst) and 5 (best) and
encoded in a data matrix A ∈ R4×3 as shown in Figure 4.10. Each row
c
represents a movie and each column a user. Thus, the column vectors of
movie ratings, one for each viewer, are xAli , xBeatrix , xChandra .
Factoring A using the SVD offers us a way to capture the relationships
of how people rate movies, and especially if there is a structure linking
which people like which movies. Applying the SVD to our data matrix A
makes a number of assumptions:
1. All viewers rate movies consistently using the same linear mapping.
2. There are no errors or noise in the ratings.
3. We interpret the left-singular vectors ui as stereotypical movies and
the right-singular vectors v j as stereotypical viewers.
We then make the assumption that any viewer’s specific movie preferences
can be expressed as a linear combination of the v j . Similarly, any movie’s
like-ability can be expressed as a linear combination of the ui . Therefore,
a vector in the domain of the SVD can be interpreted as a viewer in the
“space” of stereotypical viewers, and a vector in the codomain of the SVD
These two “spaces” correspondingly as a movie in the “space” of stereotypical movies. Let us
are only inspect the SVD of our movie-user matrix. The first left-singular vector u1
meaningfully
spanned by the
has large absolute values for the two science fiction movies and a large
respective viewer first singular value (red shading in Figure 4.10). Thus, this groups a type
and movie data if of users with a specific set of movies (science fiction theme). Similarly, the
the data itself covers
a sufficient diversity
first right-singular v1 shows large absolute values for Ali and Beatrix, who
of viewers and give high ratings to science fiction movies (green shading in Figure 4.10).
movies. This suggests that v1 reflects the notion of a science fiction lover.
Similarly, u2 , seems to capture a French art house film theme, and v 2
indicates that Chandra is close to an idealized lover of such movies. An
idealized science fiction lover is a purist and only loves science fiction
movies, so a science fiction lover v 1 gives a rating of zero to everything
but science fiction themed – this logic is implied the diagonal substructure
for the singular value matrix Σ. A specific movie is therefore represented
by how it decomposes (linearly) into its stereotypical movies. Likewise a
person would be represented by how they decompose (via linear combi-
nation) into movie themes.
It is worth to briefly discuss SVD terminology and conventions as there

are different versions used in the literature. The mathematics remains in-
variant to these differences, but these differences can be confusing.
For convenience in notation and abstraction we use an SVD notation

where the SVD is described as having two square left and right-singular
vector matrices, but a non-square singular value matrix. Our defini-
full SVD tion (4.64) for the SVD is sometimes called the full SVD.
Some authors define the SVD a bit differently and focus on square sin-

gular matrices. Then, for A ∈ Rm×n and m > n

A = U Σ V>. (4.89)
m×n m×n n×n n×n
Sometimes this formulation is called the reduced SVD (e.g., Datta (2010)) reduced SVD
or the SVD (e.g., Press et al. (2007)). This alternative format changes
merely how the matrices are constructed but leaves the mathematical
structure of the SVD unchanged. The convenience of this alternative
formulation is that Σ is diagonal, as in the eigenvalue decomposition.
In Section 4.6, we will learn about matrix approximation techniques
using the SVD, which is also called the truncated SVD. truncated SVD
It is possible to define the SVD of a rank-r matrix A so that U is an
m × r matrix, Σ a diagonal matrix r × r, and V an r × n matrix.
This construction is very similar to our definition, and ensures that the
diagonal matrix Σ has only non-zero entries along the diagonal. The
main convenience of this alternative notation is that Σ is diagonal, as
in the eigenvalue decomposition.
A restriction that the SVD for A only applies to m × n matrices with
m > n is practically unnecessary. When m < n the SVD decomposition
will yield Σ with more zero columns than rows and, consequently, the
singular values σm+1 , . . . , σn are 0.
The SVD is used in a variety of applications in machine learning from
least squares problems in curve fitting to solving systems of linear equa-
tions. These applications harness various important properties of the SVD,
its relation to the rank of a matrix and its ability to approximate matrices
of a given rank with lower-rank matrices. Substituting a matrix with its
SVD has often the advantage of making calculation more robust to nu-
merical rounding errors. As we will explore in the next section the SVD’s
ability to approximate matrices with “simpler” matrices in a principled
manner opens up machine learning applications ranging from dimension-
ality reduction and topic modeling to data compression and clustering.
4.6 Matrix Approximation

We considered the SVD as a way to factorize A = U ΣV > ∈ Rm×n into
the product of three matrices, where U ∈ Rm×m and V ∈ Rn×n are or-
thogonal and Σ contains the singular values on its main diagonal. Instead
of doing the full SVD factorization, we will now investigate how the SVD
allows us to represent a matrix A as a sum of simpler (low-rank) matrices
Ai , which lends itself to a matrix approximation scheme that is cheaper
to compute than e full SVD.
We construct a rank-1 matrix Ai ∈ Rm×m as
Ai := ui v >
i , (4.90)
which is formed by the outer product of the ith orthogonal column vector
c
Figure 4.11 Image
processing with the
SVD. (a) The
original grayscale
image is a
1, 432 × 1, 910
matrix of values
between 0 (black) (a) Original image A. (b) A1 , σ1 ≈ 228, 052. (c) A2 , σ2 ≈ 40, 647.
and 1 (white).
(b)–(f) Rank-1
matrices
A1 , . . . , A5 and
their corresponding
singular values
σ1 , . . . , σ5 . The
grid-like structure of
each rank-1 matrix (d) A3 , σ3 ≈ 26, 125. (e) A4 , σ4 ≈ 20, 232. (f) A5 , σ5 ≈ 15, 436.
is imposed by the
outer-product of the
left and
right-singular
vectors. of U and V . Figure 4.11 shows an image of Stonehenge, which can be
represented by a matrix A ∈ R1432×1910 , and some outer products Ai , as
defined in (4.90).
A matrix A ∈ Rm×n of rank r can be written as a sum of rank-1 matrices
Ai so that
r
X r
X
A= σi ui v >
i = σi Ai , (4.91)
i=1 i=1
where the outer-product matrices Ai are weighted by the ith singular

value σi . We can see why (4.91) holds: the diagonal structure of the singu-
lar value matrix Σ multiplies only matching left and right-singular vectors
ui v >
i and scales them by the corresponding singular value σi . All terms
Σij ui v >
j vanish for i 6= j because Σ is a diagonal matrix. Any terms i > r
vanish because the corresponding singular values are 0.
In (4.90), we introduced rank-1 matricesAi . We summed up the r in-
dividual rank-1 matrices to obtain a rank-r matrix A, see (4.91). If the
sum does not run over all matrices Ai , i = 1, . . . , r, but only up to an
rank-k intermediate value k < r, we obtain a rank-k approximation
approximation
k
X k
X
A(k)
b := σi ui v >
i = σi Ai (4.92)
i=1 i=1
of A with rk(A(k))
b = k . Figure 4.12 shows low-rank approximations
A(k) of an original image A of Stonehenge. The shape of the rocks be-
b
comes increasingly visible and clearly recognizable in the rank-5 approx-
imation. While the original image requires 1, 432 · 1, 910 = 2, 735, 120
numbers, the rank-5 approximation requires us only to store the five sin-
gular values and the five left and right-singular vectors (1, 432 and 1, 910-

Figure 4.12 Image
reconstruction with
the SVD. (a)
Original image.
(b)–(f) Image
reconstruction using
the low-rank
(a) Original image A. (b) Rank-1 approximation A(1).(c)
b Rank-2 approximation A(2).
b approximation of
the SVD, where the
rank-k
approximation is
given by A(k)
b =
Pk
i=1 σi Ai .
(d) Rank-3 approximation A(3).(e)

b Rank-4 approximation A(4).(f)
b Rank-5 approximation A(5).
b
dimensional each) for a total of 5 · (1, 432 + 1, 910 + 1) = 16, 715 numbers
– just above 0.6% of the original.
To measure the difference (error) between A and its rank-k approxima-
tion A(k)
b we need the notion of a norm. In Section 3.1, we already used
norms on vectors that measure the length of a vector. By analogy we can
also define norms on matrices.
Definition 4.23 (Spectral Norm of a Matrix). For x ∈ Rn \{0} the spectral spectral norm
norm of a matrix A ∈ Rm×n is defined as
kAxk2
kAk2 := max . (4.93)
x kxk2
We introduce the notation of a subscript in the matrix norm (left-hand
side), similar to the Euclidean norm for vectors (right-hand side), which
has subscript 2. The spectral norm (4.93) determines how long any vector
x can at most become when multiplied by A.
Theorem 4.24. The spectral norm of A is its largest singular value σ1 .
We leave the proof of this theorem as an exercise.
Eckart-Young
Theorem 4.25 (Eckart-Young Theorem (Eckart and Young, 1936)). Con- theorem
sider a A ∈ Rm×n of rank
Pkr and let >B ∈ R
m×n
be a matrix of rank k . For
any k 6 r with A(k) = i=1 σi ui v i it holds that
b
A(k)
b = argminrk(B)6k kA − Bk2 , (4.94)

A − A(k) = σk+1 . (4.95)
b
2
The Eckart-Young theorem states explicitly how much error we intro-

duce by approximating A using a rank-k approximation. We can inter-
pret the rank-k approximation obtained with the SVD as a projection of
c
the full-rank matrix A onto a lower-dimensional space of rank-at-most-k

matrices. Of all possible projections the SVD minimizes the error (with
respect to the spectral norm) between A and any rank-k approximation.
We can retrace some of the steps to understand why (4.95) should hold.
We observe that the difference between A − A(k)
b is a matrix containing
the sum of the remaining rank-1 matrices
r
X
A − A(k)
b = σi ui v >
i . (4.96)
i=k+1
By Theorem 4.24, we immediately obtain σk+1 as the spectral norm of the

difference matrix. Let us have a closer look at (4.94). If we assume that
there is another matrix B with rk(B) 6 k such that

kA − Bk2 < A − A(k) (4.97)
b

2
then there exists an (n − k )-dimensional null space Z ⊆ Rn such that

x ∈ Z implies that Bx = 0. Then it follows that
kAxk2 = k(A − B)xk2 , (4.98)
and by using a version of the Cauchy-Schwartz inequality (3.17) that en-
compasses norms of matrices we obtain
kAxk2 6 kA − Bk2 kxk2 < σk+1 kxk2 . (4.99)
However, there exists a (k + 1)-dimensional subspace where kAxk2 >
σk+1 kxk2 , which is spanned by the right-singular vectors v j , j 6 k + 1 of
A. Adding up dimensions of these two spaces yields a number greater n,
as there must be a non-zero vector in both spaces. This is a contradiction
of the rank-nullity theorem (Theorem 2.24) in Section 2.7.3.
The Eckart-Young theorem implies that we can use SVD to reduce a
rank-r matrix A to a rank-k matrix A b in a principled, optimal (in the
spectral norm sense) manner. We can interpret the approximation of A by
a rank-k matrix as a form of lossy compression. Therefore, the low-rank
approximation of a matrix appears in many machine learning applications,
e.g., image processing, noise filtering and regularization of ill-posed prob-
lems. Furthermore, it plays a key role in dimensionality reduction and
principal component analysis as we will see in Chapter 10.
Example 4.15 (Finding Structure in Movie Ratings and Consumers

(continued))
Coming back to our movie-rating example we can now apply the concept
of low-rank approximations to approximate the orginal data matrix. Recall
that our first singular value captures the notion of science fiction theme
in movies and science fiction lovers. Thus, by using only the first singular

value term in a rank-1 decomposition of the movie-rating matrix we obtain

the predicted ratings
−0.6710
 
−0.7197
A1 = u 1 v >

 −0.7367 −0.6515 −0.1811
1 =  (4.100a)

−0.0939
−0.1515
 
0.4943 0.4372 0.1215
0.5302 0.4689 0.1303
=0.0692 0.0612 0.0170 .
 (4.100b)
0.1116 0.0987 0.0274
This first rank-1 approximation A1 is insightful: it tells us that Ali and
Beatrix like science fiction movies, such as Star Wars and Bladerunner
(entries have values > 4), but fails to capture the ratings of the other
movies by Chandra. This is not surprising as Chandra’s type of movies are
not captured by the first singular value. The second singular value gives
us a better rank-1 approximation for those movie-theme lovers:
 
0.0236
 0.2054 
A2 = u 2 v >

 0.0852 0.1762 −0.9807
2 =  (4.101a)

−0.7705 
−0.6030
−0.0154 0.0042 −0.0174
 
−0.1338 0.0362 −0.1516
=  0.5019 −0.1358 0.5686 
 (4.101b)
0.3928 −0.1063 0.445
In this second rank-1 approximation, A2 we capture Chandra’s ratings
and movie types well, but for the science fiction movies. This leads us to
consider the rank-2 approximation A(2)
b , where we combine the first two
rank-1 approximations
 
4.7801 4.2419 1.0244
5.2252 4.7522 −0.0250
A(2)
b = σ1 A1 + σ2 A2 = 
0.2493 −0.2743 4.9724  .
 (4.102)
0.7495 0.2756 4.0278
A(2)
b is similar to the original movie ratings table
 
5 4 1
5 5 0
A= 0 0 5 ,
 (4.103)
1 0 4
and this suggests that we can ignore the contribution of A3 . We can in-
terpret this so that in the data table there is no evidence of a third movie-
c
Figure 4.13 A
functional Real matrices
∃ Pseudo inverse
phylogeny of ∃ SVD
matrices
encountered in
Rn×n Rn×m
machine learning. Square
∃ Determinant Nonsquare
∃ Trace
no basis of det =
0
eigenvectors
Singular
de
basis of
t
Defective
6=
eigenvectors
0
Non-defective
(diagonalizable)
A> A = AA> A> A 6= AA>
Normal Non-normal
A>
A
=
A
A>
∃ Inverse Matrix
Symmetric =
Eigenvalues ∈ R
I Regular
(Invertible)
Diagonal columns are

orthogonal
eigenvectors
Positive definite
Cholesky
Identity Eigenvalues > 0 Orthogonal
Rotation
matrix
theme/movie-lovers category. This also means that the entire space of

movie-themes-movie-lovers in our example is a two-dimensional space
spanned by science fiction and French art house movies and lovers.
4.7 Matrix Phylogeny

The word
“phylogenetic” In Chapters 2 and 3, we covered the basics of linear algebra and analytic
describes how we geometry. In this chapter, we looked at fundamental characteristics of ma-
capture the
trices and linear mappings. Figure 4.13 depicts the phylogenetic tree of
relationships among
individuals or relationships between different types of matrices (black arrows indicating
groups and derived “is a subset of”) and the covered operations we can perform on them (in
from the Greek blue). We consider all real matrices A ∈ Rn×m . For non-square matrices
words for “tribe”
(where n 6= m), the SVD always exists as we saw in this chapter. Focus-
and “source”.
ing on square matrices A ∈ Rn×n the determinant informs us whether a
square matrix possesses an inverse matrix, i.e., whether it belongs to the
class of regular, invertible matrices. If the square n × n matrix possesses n
linearly independent eigenvectors then the matrix is non-defective and an

eigendecomposition exists (Theorem 4.12). We know that repeated eigen-

values may result in defective matrices, which cannot be diagonalized.
Non-singular and non-defective matrices are not the same. For exam-
ple, a rotation matrix will be invertible (determinant is non-zero) but not
diagonalizable in the real numbers (eigenvalues are not guaranteed to be
real numbers).
We dive further into the branch of non-defective square n × n matrices.
A is normal if the condition A> A = AA> holds. Moreover, if the more
restrictive condition holds that A> A = AA> = I , then A is called or-
thogonal (see Definition 3.8). The set of orthogonal matrices is a subset of
the regular (invertible) matrices and satisfies A> = A−1 .
Normal matrices have a frequently encountered subset, the symmet-
ric matrices S ∈ Rn×n which satisfy S = S > . Symmetric matrices have
only real eigenvalues. A subset of the symmetric matrices are the posi-
tive definite matrices P that satisfy the condition of x> P x > 0 for all
x ∈ Rn \ {0}. In this case, a unique Cholesky decomposition exists (The-
orem 4.18). Positive definite matrices have only positive eigenvalues and
are always invertible (i.e., have a non-zero determinant).
Another subset of symmetric matrices are the diagonal matrices D . Di-
agonal matrices are closed under multiplication and addition, but do not
necessarily form a group (this is only the case if all diagonal entries are
non-zero so that the matrix is invertible). A special diagonal matrix is the
identity matrix I .
4.8 Further Reading

Most of the content in this chapter establishes underlying mathematics
and connects them to methods for studying mappings, many of which
are at the heart of machine learning at the level of underpinning soft-
ware solutions and building blocks for almost all machine learning theory.
Matrix characterization using determinants, eigenspectra and eigenspaces
are fundamental features and conditions for categorizing and analyzing
matrices. This extends to all forms of representations of data and map-
pings involving data, as well as judging the numerical stability of compu-
tational operations on such matrices (Press et al., 2007).
Determinants are fundamental tools in order to invert matrices and
compute eigenvalues “by hand”. However, for almost all but the small-
est instances numerical computation by Gaussian elimination outperforms
determinants (Press et al., 2007). Determinants remain nevertheless a
powerful theoretical concept, e.g., to gain intuition about the orientation
of a basis based on the sign of the determinant. Eigenvectors can be used
to perform basis changes to transform data into the coordinates of mean-
ingful orthogonal, feature vectors. Similarly, matrix decomposition meth-
ods, such as the Cholesky decomposition, reappear often when we com-
pute or simulate random events (Rubinstein and Kroese, 2016). Therefore,
c
the Cholesky decomposition enables us to compute the reparametriza-

tion trick where we want to perform continuous differentiation over ran-
dom variables, e.g., in variational autoencoders (Kingma and Ba, 2014;
Jimenez Rezende et al., 2014).
Eigendecomposition is fundamental in enabling us to extract mean-
ingful and interpretable information that characterizes linear mappings.
Therefore, the eigendecomposition underlies a general class of machine
learning algorithms called spectral methods that perform eigendecomposi-
tion of a positive-definite kernel. These spectral decomposition methods
encompass classical approaches to statistical data analysis, such as:
principal component
analysis Principal component analysis (PCA (Pearson, 1901), see also Chapter 10),
in which a low-dimensional subspace, which explains most of the vari-
Fisher discriminant ability in the data, is sought.
analysis Fisher discriminant analysis, which aims to determine a separating hy-
multidimensional perplane for data classification (Mika et al., 1999).
scaling Multidimensional scaling (MDS) (Carroll and Chang, 1970).
The computational efficiency of these methods typically comes from find-
ing the best rank-k approximation to a symmetric, positive semi-definite
matrix. More contemporary examples of spectral methods have different
origins, but each of them requires the computation of the eigenvectors
Isomap and eigenvalues of a positive-definite kernel, such as Isomap (Tenenbaum
Laplacian et al., 2000), Laplacian eigenmaps (Belkin and Niyogi, 2003), Hessian
eigenmaps eigenmaps (Donoho and Grimes, 2003), and spectral clustering (Shi and
Hessian eigenmaps Malik, 2000). The core computations of these are generally underpinned
spectral clustering by low-rank matrix approximation techniques (Belabbas and Wolfe, 2009)
as we encountered here via the SVD.
The SVD allows us to discover some of the same kind of information as
the eigendecomposition. However, the SVD is more generally applicable
to non-square matrices and data tables. These matrix factorization meth-
ods become relevant whenever we want to identify heterogeneity in data
when we want to perform data compression by approximation, e.g., in-
stead of storing n×m values just storing (n+m)k values, or when we want
to perform data pre-processing, e.g., to decorrelate predictor variables of
a design matrix (Ormoneit et al., 2001). The SVD operates on matrices,
which we can interpret as rectangular arrays with two indices (rows and
columns). The extension of matrix-like structure to higher-dimensional
arrays are called tensors. It turns out that the SVD is the special case of
a more general family of decompositions that operate on such tensors
(Kolda and Bader, 2009). SVD-like operations and low-rank approxima-
Tucker tions on tensors are for example the Tucker decomposition (Tucker, 1966)
decomposition or the CP decomposition (Carroll and Chang, 1970).
CP decomposition The SVD low-rank approximation is frequently used in machine learn-
ing for computational efficiency reasons. This is because it reduces the
amount of memory and operations with non-zero multiplications we need

Exercises 137
to perform on potentially very large matrices of data (Trefethen and Bau III,
1997). Moreover, low-rank approximations are used to operate on ma-
trices that may contain missing values as well as for purposes of lossy
compression and dimensionality reduction (Moonen and De Moor, 1995;
Markovsky, 2011).
Exercises
4.1 Compute the determinant using the Laplace expansion (using the first row)
and the Sarrus Rule for
 
1 3 5
A= 2 4 6 .
0 2 4
4.2 Compute the following determinant efficiently:

 
2 0 1 2 0
2
 −1 0 1 1

.
0 1 2 1 2

−2 0 2 −1 2
2 0 0 1 1

1 0 −2 2
4.3 Compute the eigenspaces of , .
1 1 2 1
4.4 Compute all eigenspaces of
 
0 −1 1 1
−1 1 −2 3
A= .
2 −1 0 0
1 −1 1 0
4.5 Diagonalizability of a matrix is unrelated to its invertibility. Determine for

the following four matrices whether they are diagonalizable and/or invert-
ible

1 0 1 0 1 1 0 1
, , , .
0 1 0 0 0 1 0 0
4.6 Compute the eigenspaces of the following transformation matrices. Are they
diagonalizable?
1.
 
2 3 0
A = 1 4 3
0 0 1
2.
 
1 1 0 0
0 0 0 0
A=
0

0 0 0
0 0 0 0
c
4.7 Are the following matrices diagonalizable? If yes, determine their diagonal
form and a basis with respect to which the transformation matrices are di-
agonal. If no, give reasons why they are not diagonalizable.
1.

0 1
A=
−8 4
2.
 
1 1 1
A = 1 1 1
1 1 1
3.
 
5 4 2 1
 0 1 −1 −1
A=
 −1

−1 3 0
1 1 −1 2
4.
 
5 −6 −6
A = −1 4 2
3 −6 −4
4.8 Find the SVD of the matrix

3 2 2
A= .
2 3 −2
4.9 Find the singular value decomposition of

2 2
A= .
−1 1
4.10 Find the best rank-1 approximation of

3 2 2
A= .
2 3 −2
4.11 Show that for any A ∈ Rm×n the matrices A> A and AA> possess the
same non-zero eigenvalues.
4.12 Show that for x 6= 0 Theorem 4.24 holds, i.e., show that
kAxk2
max = σ1 ,
x kxk2
where σ1 is the largest singular value of A ∈ Rm×n .

5
Vector Calculus
Many algorithms in machine learning optimize an objective function with

respect to a set of desired model parameters that control how well a model
explains the data: Finding good parameters can be phrased as an opti-
mization problem (see Section 8.1 and 8.2). Examples include: (i) linear
regression (see Chapter 9), where we look at curve-fitting problems and
optimize linear weight parameters to maximize the likelihood; (ii) neural-
network auto-encoders for dimensionality reduction and data compres-
sion, where the parameters are the weights and biases of each layer, and
where we minimize a reconstruction error by repeated application of the
chain rule; (iii) Gaussian mixture models (see Chapter 11) for modeling
data distributions, where we optimize the location and shape parame-
ters of each mixture component to maximize the likelihood of the model.
Figure 5.1 illustrates some of these problems, which we typically solve
by using optimization algorithms that exploit gradient information (Sec-
tion 7.1). Figure 5.2 gives an overview of how concepts in this chapter are
related and how they are connected to other chapters of the book.
Central to this chapter is the concept of a function. A function f is
a quantity that relates two quantities to each other. In this book, these
quantities are typically inputs x ∈ RD and targets (function values) f (x),
which we assume are real-valued if not stated otherwise. Here RD is the
domain of f , and the function values f (x) are the image/codomain of f . domain
Section 2.7.3 provides much more detailed discussion in the context of image
codomain
10 Figure 5.1 Vector
4 Training data calculus plays a
MLE
5 central role in (a)
2
regression (curve
0 fitting) and (b)
x2
0
y
density estimation,
−2 i.e., modeling data
−5
distributions.
−4
−10
−4 −2 0 2 4 −10 −5 0 5 10
x x1
(a) Regression problem: Find parameters, (b) Density estimation with a Gaussian mixture
such that the curve explains the observations model: Find means and covariances, such that
(crosses) well. the data (dots) can be explained well.
139
c
140 Vector Calculus
Figure 5.2 A mind Chapter 9

map of the concepts Difference quotient
Regression
introduced in this
chapter, along with
defines
when they are used in
ed
in other parts of the us
book.
Chapter 7 used in used in Chapter 10

Optimization Partial derivatives Dimensionality
reduction
collected in
us
ed
i n
us
Chapter 6 used in Jacobian Chapter 11
e
d
Probability Hessian Density estimation
in
used in
Taylor series Chapter 12

Classification
linear functions. We often write

f : RD → R (5.1a)
x 7→ f (x) (5.1b)
to specify a function, where (5.1a) specifies that f is a mapping from
RD to R and (5.1b) specifies the explicit assignment of an input x to
a function value f (x). A function f assigns every input x exactly one
function value f (x).
Example 5.1
Recall the dot product as a special case of an inner product (Section 3.2).
In the above notation, the function f (x) = x> x, x ∈ R2 , would be be
specified as
f : R2 → R (5.2a)
x 7→ x21 + x22 . (5.2b)
In this chapter, we will discuss how to compute gradients of functions,

which is often essential to facilitate learning in machine learning models
since the gradient points in the direction of steepest ascent. Therefore,
vector calculus is one of the fundamental mathematical tools we need in

Figure 5.3 The

y f (x) average incline of a
function f between
x0 and x0 + δx is
the incline of the
secant (blue)
through f (x0 ) and
f (x0 + δx) f (x0 + δx) and
given by δy/δx.
δy
f (x0 )
δx
machine learning. Throughout this book, we assume that functions are

differentiable. With some additional technical definitions, which we do
not cover here, many of the approaches presented can be extended to
sub-differentials (functions that are continuous but not differentiable at
certain points). We will look at an extension to the case of functions with
constraints in Chapter 7.
5.1 Differentiation of Univariate Functions

In the following, we briefly revisit differentiation of a univariate function,
which may be familiar from high-school mathematics. We start with the
difference quotient of a univariate function y = f (x), x, y ∈ R, which we
will subsequently use to define derivatives.
Definition 5.1 (Difference Quotient). The difference quotient difference quotient
δy f (x + δx) − f (x)
:= (5.3)
δx δx
computes the slope of the secant line through two points on the graph of
f . In Figure 5.3 these are the points with x-coordinates x0 and x0 + δx.
The difference quotient can also be considered the average slope of f
between x and x + δx if we assume f to be a linear function. In the limit
for δx → 0, we obtain the tangent of f at x, if f is differentiable. The
tangent is then the derivative of f at x.
Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f derivative
at x is defined as the limit
df f (x + h) − f (x)
:= lim , (5.4)
dx h→0 h
and the secant in Figure 5.3 becomes a tangent.
The derivative of f points in the direction of steepest ascent of f .
c
142 Vector Calculus
Example 5.2 (Derivative of a Polynomial)

We want to compute the derivative of f (x) = xn , n ∈ N. We may already
know that the answer will be nxn−1 , but we want to derive this result
using the definition of the derivative as the limit of the difference quotient.
Using the definition of the derivative in (5.4) we obtain
df f (x + h) − f (x)
= lim (5.5a)
dx h→0 h
(x + h)n − xn
= lim (5.5b)
Pn hn n−i i
h→0
x h − xn
= lim i=0 i . (5.5c)
h→0 h
We see that xn = n0 xn−0 h0 . By starting the sum at 1 the xn -term cancels,

and we obtain
Pn n n−i i
df x h
= lim i=1 i (5.6a)
dx h→0 h
!
n
X n n−i i−1
= lim x h (5.6b)
h→0
i=1
i
! n
!
n n−1 X n n−i i−1
= lim x + x h (5.6c)
h→0 1 i=2
i
| {z }
→0 as h→0
n!
= xn−1 = nxn−1 . (5.6d)
1!(n − 1)!
5.1.1 Taylor Series

The Taylor series is a representation of a function f as an infinite sum of
terms. These terms are determined using derivatives of f evaluated at x0 .
Taylor polynomial Definition 5.3 (Taylor Polynomial). The Taylor polynomial of degree n of
We define t0 := 1 f : R → R at x0 is defined as
for all t ∈ R.
n
X f (k) (x0 )
Tn (x) := (x − x0 )k , (5.7)
k=0
k!
where f (k) (x0 ) is the k th derivative of f at x0 (which we assume exists)

(k)
and f k!(x0 ) are the coefficients of the polynomial.
Definition 5.4 (Taylor Series). For a smooth function f ∈ C ∞ , f : R → R,

Taylor series the Taylor series of f at x0 is defined as

∞
X f (k) (x0 )
T∞ (x) = (x − x0 )k . (5.8)
k=0
k!
For x0 = 0, we obtain the Maclaurin series as a special instance of the f ∈ C ∞ means that
Taylor series. If f (x) = T∞ (x) then f is called analytic. f is continuously
differentiable
infinitely many
Remark. In general, a Taylor polynomial of degree n is an approximation times.
of a function, which does not need to be a polynomial. The Taylor poly- Maclaurin series
nomial is similar to f in a neighborhood around x0 . However, a Taylor analytic
polynomial of degree n is an exact representation of a polynomial f of
degree k 6 n since all derivatives f (i) , i > k vanish. ♦
Example 5.3 (Taylor Polynomial)

We consider the polynomial
f (x) = x4 (5.9)
and seek the Taylor polynomial T6 , evaluated at x0 = 1. We start by com-
puting the coefficients f (k) (1) for k = 0, . . . , 6:
f (1) = 1 (5.10)
f 0 (1) = 4 (5.11)
f 00 (1) = 12 (5.12)
f (3) (1) = 24 (5.13)
(4)
f (1) = 24 (5.14)
(5)
f (1) = 0 (5.15)
(6)
f (1) = 0 (5.16)
Therefore, the desired Taylor polynomial is
6
X f (k) (x0 )
T6 (x) = (x − x0 )k (5.17a)
k=0
k!
= 1 + 4(x − 1) + 6(x − 1)2 + 4(x − 1)3 + (x − 1)4 + 0 . (5.17b)
Multiplying out and re-arranging yields
T6 (x) = (1 − 4 + 6 − 4 + 1) + x(4 − 12 + 12 − 4)
+ x2 (6 − 12 + 6) + x3 (4 − 4) + x4 (5.18a)
= x4 = f (x) , (5.18b)
i.e., we obtain an exact representation of the original function.
c
144 Vector Calculus
Figure 5.4 Taylor

polynomials. The f
original function T0
f (x) = 4
T1
sin(x) + cos(x) T5
(black, solid) is 2 T10
approximated by
y
Taylor polynomials
(dashed) around 0
x0 = 0.
Higher-order Taylor
polynomials −2
approximate the
function f better −4 −2 0 2 4
and more globally. x
T10 is already
similar to f in
[−4, 4].
Example 5.4 (Taylor Series)
Consider the function in Figure 5.4 given by
f (x) = sin(x) + cos(x) ∈ C ∞ . (5.19)
We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurin
series expansion of f . We obtain the following derivatives:
f (0) = sin(0) + cos(0) = 1 (5.20)
f 0 (0) = cos(0) − sin(0) = 1 (5.21)
f 00 (0) = − sin(0) − cos(0) = −1 (5.22)
(3)
f (0) = − cos(0) + sin(0) = −1 (5.23)
(4)
f (0) = sin(0) + cos(0) = f (0) = 1 (5.24)
..
.
We can see a pattern here: The coefficients in our Taylor series are only
±1 (since sin(0) = 0), each of which occurs twice before switching to the
other one. Furthermore, f (k+4) (0) = f (k) (0).
Therefore, the full Taylor series expansion of f at x0 = 0 is given by
∞
X f (k) (x0 )
T∞ (x) = (x − x0 )k (5.25a)
k=0
k!
1 2 1 1 1
=1+x− x − x3 + x4 + x5 − · · · (5.25b)
2! 3! 4! 5!
1 1 1 1
= 1 − x2 + x4 ∓ · · · + x − x3 + x5 ∓ · · · (5.25c)
2! 4! 3! 5!
∞ ∞
X 1 X 1
= (−1)k x2k + (−1)k x2k+1 (5.25d)
k=0
(2k)! k=0
(2k + 1)!
= cos(x) + sin(x) , (5.25e)

where we used the power series representations power series

representation
∞
X 1 2k
cos(x) = (−1)k x , (5.26)
k=0
(2k)!
∞
X 1
sin(x) = (−1)k x2k+1 . (5.27)
k=0
(2k + 1)!
Figure 5.4 shows the corresponding first Taylor polynomials Tn for n =
0, 1, 5, 10.
Remark. A Taylor series is a special case of a power series

∞
X
f (x) = ak (x − c)k (5.28)
k=0
where ak are coefficients and c is a constant, which has the special form
in Definition 5.4. ♦
5.1.2 Differentiation Rules

In the following, we briefly state basic differentiation rules, where we
denote the derivative of f by f 0 .
Product Rule: (f (x)g(x))0 = f 0 (x)g(x) + f (x)g 0 (x) (5.29)

0
f (x) f 0 (x)g(x) − f (x)g 0 (x)

Quotient Rule: = (5.30)
g(x) (g(x))2
Sum Rule: (f (x) + g(x))0 = f 0 (x) + g 0 (x) (5.31)
0
Chain Rule: g(f (x)) = (g ◦ f )0 (x) = g 0 (f (x))f 0 (x) (5.32)
Here, g ◦ f denotes function composition x 7→ f (x) 7→ g(f (x)).
Example 5.5 (Chain rule)

Let us compute the derivative of the function h(x) = (2x + 1)4 using the
chain rule. With
h(x) = (2x + 1)4 = g(f (x)) , (5.33)
f (x) = 2x + 1 , (5.34)
g(f ) = f 4 (5.35)
we obtain the derivatives of f and g as
f 0 (x) = 2 , (5.36)
g 0 (f ) = 4f 3 , (5.37)
c
146 Vector Calculus
such that the derivative of h is given as

(5.34)
h0 (x) = g 0 (f )f 0 (x) = (4f 3 ) · 2 = 4(2x + 1)3 · 2 = 8(2x + 1)3 , (5.38)
where we used the chain rule (5.32), and substituted the definition of f
in (5.34) in g 0 (f ).
5.2 Partial Differentiation and Gradients

Differentiation as discussed in Section 5.1 applies to functions f of a
scalar variable x ∈ R. In the following, we consider the general case
where the function f depends on one or more variables x ∈ Rn , e.g.,
f (x) = f (x1 , x2 ). The generalization of the derivative to functions of sev-
eral variables is the gradient.
We find the gradient of the function f with respect to x by varying one
variable at a time and keeping the others constant. The gradient is then
the collection of these partial derivatives.
Definition 5.5 (Partial Derivative). For a function f : Rn → R, x 7→
partial derivative f (x), x ∈ Rn of n variables x1 , . . . , xn we define the partial derivatives as
∂f f (x1 + h, x2 , . . . , xn ) − f (x)
= lim
∂x1 h→0 h
.. (5.39)
.
∂f f (x1 , . . . , xn−1 , xn + h) − f (x)
= lim
∂xn h→0 h
and collect them in the row vector
df

∂f (x) ∂f (x) ∂f (x)
∇x f = gradf = = ··· ∈ R1×n , (5.40)
dx ∂x1 ∂x2 ∂xn
where n is the number of variables and 1 is the dimension of the image/
range/codomain of f . Here, we defined the column vector x = [x1 , . . . , xn ]> ∈
gradient Rn . The row vector in (5.40) is called the gradient of f or the Jacobian
Jacobian and is the generalization of the derivative from Section 5.1.
Remark. This definition of the Jacobian is a special case of the general
definition of the Jacobian for vector-valued functions as the collection of
partial derivatives. We will get back to this in Section 5.3. ♦
Example 5.6 (Partial Derivatives using the Chain Rule)

We can use results For f (x, y) = (x + 2y 3 )2 , we obtain the partial derivatives
from scalar
differentiation: Each ∂f (x, y) ∂
partial derivative is = 2(x + 2y 3 ) (x + 2y 3 ) = 2(x + 2y 3 ) , (5.41)
∂x ∂x
a derivative with
respect to a scalar.

5.2 Partial Differentiation and Gradients 147
∂f (x, y) ∂
= 2(x + 2y 3 ) (x + 2y 3 ) = 12(x + 2y 3 )y 2 . (5.42)
∂y ∂y
where we used the chain rule (5.32) to compute the partial derivatives.
Remark (Gradient as a Row Vector). It is not uncommon in the literature

to define the gradient vector as a column vector, following the conven-
tion that vectors are generally column vectors. The reason why we define
the gradient vector as a row vector is twofold: First, we can consistently
generalize the gradient to vector-valued functions f : Rn → Rm (then
the gradient becomes a matrix). Second, we can immediately apply the
multi-variate chain rule without paying attention to the dimension of the
gradient. We will discuss both points in Section 5.3. ♦
Example 5.7 (Gradient)

For f (x1 , x2 ) = x21 x2 + x1 x32 ∈ R, the partial derivatives (i.e., the deriva-
tives of f with respect to x1 and x2 ) are
∂f (x1 , x2 )
= 2x1 x2 + x32 (5.43)
∂x1
∂f (x1 , x2 )
= x21 + 3x1 x22 (5.44)
∂x2
and the gradient is then
df

∂f (x1 , x2 ) ∂f (x1 , x2 )
= 2x1 x2 + x32 x21 + 3x1 x22 ∈ R1×2 .

=
dx ∂x1 ∂x2
(5.45)
5.2.1 Basic Rules of Partial Differentiation

Product rule:
In the multivariate case, where x ∈ Rn , the basic differentiation rules that (f g)0 = f 0 g + f g 0 ,
we know from school (e.g., sum rule, product rule, chain rule; see also Sum rule:
(f + g)0 = f 0 + g 0 ,
Section 5.1.2) still apply. However, when we compute derivatives with re- Chain rule:
spect to vectors x ∈ Rn we need to pay attention: Our gradients now (g(f ))0 = g 0 (f )f 0
involve vectors and matrices, and matrix multiplication is not commuta-
tive (Section 2.2.1), i.e., the order matters.
Here are the general product rule, sum rule and chain rule:
∂ ∂f ∂g
Product Rule: f (x)g(x) = g(x) + f (x) (5.46)
∂x ∂x ∂x
∂ ∂f ∂g
Sum Rule: f (x) + g(x) = + (5.47)
∂x ∂x ∂x
c
148 Vector Calculus
∂ ∂ ∂g ∂f
Chain Rule: (g ◦ f )(x) = g(f (x)) = (5.48)
∂x ∂x ∂f ∂x
This is only an Let us have a closer look at the chain rule. The chain rule (5.48) resem-
intuition, but not bles to some degree the rules for matrix multiplication where we said that
mathematically
neighboring dimensions have to match for matrix multiplication to be de-
correct since the
partial derivative is fined, see Section 2.2.1. If we go from left to right, the chain rule exhibits
not a fraction. similar properties: ∂f shows up in the “denominator” of the first factor
and in the “numerator” of the second factor. If we multiply the factors to-
gether, multiplication is defined, i.e., the dimensions of ∂f match, and ∂f
“cancels”, such that ∂g/∂x remains.
5.2.2 Chain Rule

2
Consider a function f : R → R of two variables x1 , x2 . Furthermore,
x1 (t) and x2 (t) are themselves functions of t. To compute the gradient of
f with respect to t, we need to apply the chain rule (5.48) for multivariate
functions as
" #
df h i ∂x1 (t) ∂f ∂x1 ∂f ∂x2
∂f ∂f
= ∂x1 ∂x2 ∂x2 (t) =
∂t + (5.49)
dt ∂t
∂x1 ∂t ∂x2 ∂t
where d denotes the gradient and ∂ partial derivatives.
Example 5.8
Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
df ∂f ∂x1 ∂f ∂x2
= + (5.50a)
dt ∂x1 ∂t ∂x2 ∂t
∂ sin t ∂ cos t
= 2 sin t +2 (5.50b)
∂t ∂t
= 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1) (5.50c)
is the corresponding derivative of f with respect to t.
If f (x1 , x2 ) is a function of x1 and x2 , where x1 (s, t) and x2 (s, t) are

themselves functions of two variables s and t, the chain rule yields the
partial derivatives
∂f ∂f ∂x1 ∂f ∂x2
= + , (5.51)
∂s ∂x1 ∂s ∂x2 ∂s
∂f ∂f ∂x1 ∂f ∂x2
= + , (5.52)
∂t ∂x1 ∂t ∂x2 ∂t

and the gradient is obtained by the matrix multiplication

∂x1 ∂x1
 
df ∂f ∂x
= ∂f ∂f  ∂s
h i
= ∂t  . (5.53)
d(s, t) ∂x ∂x
∂x ∂(s, t) | 1 {z 2 }
 ∂x
2 ∂x 2
∂f | ∂s {z ∂t }
=
∂x ∂x
=
∂(s, t)
This compact way of writing the chain rule as a matrix multiplication only The chain rule can
makes sense if the gradient is defined as a row vector. Otherwise, we will be written as a
matrix
need to start transposing gradients for the matrix dimensions to match.
multiplication.
This may still be straightforward as long as the gradient is a vector or a
matrix; however, when the gradient becomes a tensor (we will discuss this
in the following), the transpose is no longer a triviality.
Remark (Verifying the Correctness of a Gradient Implementation). The
definition of the partial derivatives as the limit of the corresponding dif-
ference quotient, see (5.39), can be exploited when numerically checking
the correctness of gradients in computer programs: When we compute Gradient checking
gradients and implement them, we can use finite differences to numer-
ically test our computation and implementation: We choose the value h
to be small (e.g., h = 10−4 ) and compare the finite-difference approxima-
tion from (5.39) with our (analytic) implementation of the gradient. If the
error is small, ourqgradient
P
implementation is probably correct. “Small”
(dh −df )2
could mean that Pi (dhii +dfii )2 < 10−6 , where dhi is the finite-difference
i
approximation and dfi is the analytic gradient of f with respect to the ith
variable xi . ♦
5.3 Gradients of Vector-Valued Functions

Thus far, we discussed partial derivatives and gradients of functions f :
Rn → R mapping to the real numbers. In the following, we will generalize
the concept of the gradient to vector-valued functions (vector fields) f :
Rn → Rm , where n > 1 and m > 1.
For a function f : Rn → Rm and a vector x = [x1 , . . . , xn ]> ∈ Rn , the
corresponding vector of function values is given as
 
f1 (x)
f (x) =  ...  ∈ Rm . (5.54)
 
fm (x)
Writing the vector-valued function in this way allows us to view a vector-
valued function f : Rn → Rm as a vector of functions [f1 , . . . , fm ]> ,
fi : Rn → R that map onto R. The differentiation rules for every fi are
exactly the ones we discussed in Section 5.2.
c
150 Vector Calculus
Therefore, the partial derivative of a vector-valued function f : Rn →

Rm with respect to xi ∈ R, i = 1, . . . n, is given as the vector
 ∂f1  
limh→0 f1 (x1 ,...,xi−1 ,xi +h,x i+1 ,...xn )−f1 (x)

∂xi h
∂f
=  ...  =  .. m
∈R .
   
∂xi .
∂fm
∂xi limh→0 fm (x1 ,...,xi−1 ,xi +h,x
h
i+1 ,...xn )−fm (x)
(5.55)
From (5.40), we know that the gradient of f with respect to a vector is
the row vector of the partial derivatives. In (5.55), every partial derivative
∂f /∂xi is a column vector. Therefore, we obtain the gradient of f : Rn →
Rm with respect to x ∈ Rn by collecting these partial derivatives:

df (x) ∂f (x) ∂f (x)
= ··· (5.56a)
dx ∂x1 ∂xn
 
∂f1 (x) ∂f1 (x)
 ∂x1 ··· ∂xn 
.. ..  ∈ Rm×n .
 
= (5.56b)
 . . 
 ∂fm (x) ∂fm (x) 
∂x1 ··· ∂xn
Definition 5.6 (Jacobian). The collection of all first-order partial deriva-

Jacobian tives of a vector-valued function f : Rn → Rm is called the Jacobian. The
The gradient of a Jacobian J is an m × n matrix,which we define and arrange as follows:
function
df (x)

f : Rn → Rm is a ∂f (x) ∂f (x)
J = ∇x f = = ··· (5.57)
matrix of size dx ∂x1 ∂xn
m × n.  
∂f1 (x) ∂f1 (x)
 ∂x ···
1 ∂xn 
. ..
 
=
 .. . ,

(5.58)
 
 ∂fm (x) ∂fm (x) 
···
∂x ∂xn
 1
x1
 ..  ∂fi
x =  .  , J(i, j) = . (5.59)
∂xj
xn
As a special case of (5.58), a function Pf : Rn → R1 , which maps a
n n
vector x ∈ R onto a scalar (e.g., f (x) = i=1 xi ), possesses a Jacobian
that is a row vector (matrix of dimension 1 × n), see (5.40).
numerator layout Remark. In this book, we use the numerator layout of the derivative, i.e.,
the derivative df /dx of f ∈ Rm with respect to x ∈ Rn is an m ×
n matrix, where the elements of f define the rows and the elements of
x define the columns of the corresponding Jacobian, see (5.58). There

Figure 5.5 The

determinant of the
f (·) Jacobian of f can
b2 c1 c2 be used to compute
the magnifier
between the blue
b1
and orange area.
exists also the denominator layout, which is the transpose of the numerator denominator layout
layout. In this book, we will use the numerator layout. ♦
We will see how the Jacobian is used in the change-of-variable method
for probability distributions in Section 6.7. The amount of scaling due to
the transformation of a variable is provided by the determinant.
In Section 4.1, we saw that the determinant can be used to compute
the area of a parallelogram. If we are given two vectors b1 = [1, 0]> ,
b2 = [0, 1]> as the sides of the unit square (blue, see Figure 5.5), the area
of this square is

det 1 0 = 1 .

(5.60)
0 1
If we take a parallelogram with the sides c1 = [−2, 1]> , c2 = [1, 1]>
(orange in Figure 5.5) its area is given as the absolute value of the deter-
minant (see Section 4.1)

det −2 1 = | − 3| = 3 ,

(5.61)
1 1
i.e., the area of this is exactly 3 times the area of the unit square. We
can find this scaling factor by finding a mapping that transforms the unit
square into the other square. In linear algebra terms, we effectively per-
form a variable transformation from (b1 , b2 ) to (c1 , c2 ). In our case, the
mapping is linear and the absolute value of the determinant of this map-
ping gives us exactly the scaling factor we are looking for.
We will describe two approaches to identify this mapping. First, we ex-
ploit that the mapping is linear so that we can use the tools from Chapter 2
to identify this mapping. Second, we will find the mapping using partial
derivatives using the tools we have been discussing in this chapter.
Approach 1 To get started with the linear algebra approach, we
identify both {b1 , b2 } and {c1 , c2 } as bases of R2 (see Section 2.6.1 for a
recap). What we effectively perform is a change of basis from (b1 , b2 ) to
(c1 , c2 ), and we are looking for the transformation matrix that implements
the basis change. Using results from Section 2.7.2, we identify the desired
basis change matrix as

−2 1
J= , (5.62)
1 1
such that J b1 = c1 and J b2 = c2 . The absolute value of the determi-
c
152 Vector Calculus
nant of J , which yields the scaling factor we are looking for, is given as
|det(J )| = 3, i.e., the area of the square spanned by (c1 , c2 ) is three times
greater than the area spanned by (b1 , b2 ).
Approach 2 The linear algebra approach works for linear trans-
formations; for nonlinear transformations (which become relevant in Sec-
tion 6.7), we follow a more general approach using partial derivatives.
For this approach, we consider a function f : R2 → R2 that performs
a variable transformation. In our example, f maps the coordinate repre-
sentation of any vector x ∈ R2 with respect to (b1 , b2 ) onto the coordi-
nate representation y ∈ R2 with respect to (c1 , c2 ). We want to identify
the mapping so that we can compute how an area (or volume) changes
when it is being transformed by f . For this we need to find out how f (x)
changes if we modify x a bit. This question is exactly answered by the
Jacobian matrix df dx
∈ R2×2 . Since we can write
y1 = −2x1 + x2 (5.63)
y2 = x 1 + x 2 (5.64)
we obtain the functional relationship between x and y , which allows us
to get the partial derivatives
∂y1 ∂y1 ∂y2 ∂y2
= −2 , = 1, = 1, =1 (5.65)
∂x1 ∂x2 ∂x1 ∂x2
and compose the Jacobian as
 ∂y ∂y1 
1
−2 1
J =  ∂x ∂x2 
∂y2  = 1 1 . (5.66)
 1
∂y 2
∂x1 ∂x2
Geometrically, the The Jacobian represents the coordinate transformation we are looking for
Jacobian and is exact if the coordinate transformation is linear (as in our case),
determinant gives
and (5.66) recovers exactly the basis change matrix in (5.62). If the co-
the magnification/
scaling factor when ordinate transformation is nonlinear, the Jacobian approximates this non-
we transform an linear transformation locally with a linear one. The absolute value of the
area or volume. Jacobian determinant |det(J )| is the factor areas or volumes are scaled by
Jacobian
when coordinates are transformed. In our case, we obtain |det(J )| = 3.
determinant
The Jacobian determinant and variable transformations will become
relevant in Section 6.7 when we transform random variables and prob-
ability distributions. These transformations are extremely relevant in ma-
chine learning in the context of training deep neural networks using the
Figure 5.6 reparametrization trick, also called infinite perturbation analysis.
Dimensionality of In this chapter, we encountered derivatives of functions. Figure 5.6 sum-
(partial) derivatives.
marizes the dimensions of those derivatives. If f : R → R the gradient is
x simply a scalar (top-left entry). For f : RD → R the gradient is a 1 × D
f (x) row vector (top-right entry). For f : R → RE , the gradient is an E × 1
∂f column vector, and for f : RD → RE the gradient is an E × D matrix.
∂x
Example 5.9 (Gradient of a Vector-Valued Function)

We are given
f (x) = Ax , f (x) ∈ RM , A ∈ RM ×N , x ∈ RN .
To compute the gradient df /dx we first determine the dimension of
df /dx: Since f : RN → RM , it follows that df /dx ∈ RM ×N . Second,
to compute the gradient we determine the partial derivatives of f with
respect to every xj :
N
X ∂fi
fi (x) = Aij xj =⇒ = Aij (5.67)
j=1
∂xj
We collect the partial derivatives in the Jacobian and obtain the gradient
 ∂f1 ∂f1 
· · · ∂x
 
∂x1 N
A11 · · · A1N
df
=  ... ..  =  .. ..  = A ∈ RM ×N . (5.68)

dx .   . . 
∂fM ∂fM
∂x 1
· · · ∂x N
AM 1 · · · AM N
Example 5.10 (Chain Rule)

Consider the function h : R → R, h(t) = (f ◦ g)(t) with
f : R2 → R (5.69)
g : R → R2 (5.70)
f (x) = exp(x1 x22 ) , (5.71)

x t cos t
x = 1 = g(t) = (5.72)
x2 t sin t
and compute the gradient of h with respect to t. Since f : R2 → R and
g : R → R2 we note that
∂f ∂g
∈ R1×2 , ∈ R2×1 . (5.73)
∂x ∂t
The desired gradient is computed by applying the chain rule:
∂x1
 
dh

∂f ∂x ∂f ∂f  ∂t 
= = (5.74a)
dt ∂x ∂t ∂x1 ∂x2 ∂x2
 
∂t
2 2 2
cos t − t sin t
= exp(x1 x2 )x2 2 exp(x1 x2 )x1 x2 (5.74b)
sin t + t cos t
2 2

= exp(x1 x2 ) x2 (cos t − t sin t) + 2x1 x2 (sin t + t cos t) , (5.74c)
where x1 = t cos t and x2 = t sin t, see (5.72).
c
154 Vector Calculus
Example 5.11 (Gradient of a Least-Squares Loss in a Linear Model)

We will discuss this Let us consider the linear model
model in much
more detail in y = Φθ , (5.75)
Chapter 9 in the
context of linear where θ ∈ RD is a parameter vector, Φ ∈ RN ×D are input features and
regression, where
we need derivatives
y ∈ RN are the corresponding observations. We define the functions
of the least-squares
loss L with respect
L(e) := kek2 , (5.76)
to the parameters θ. e(θ) := y − Φθ . (5.77)
We seek ∂L∂θ
, and we will use the chain rule for this purpose. L is called a
least-squares loss least-squares loss function.
Before we start our calculation, we determine the dimensionality of the
gradient as
∂L
∈ R1×D . (5.78)
∂θ
The chain rule allows us to compute the gradient as
∂L ∂L ∂e
= , (5.79)
∂θ ∂e ∂θ
dLdtheta = where the dth element is given by
np.einsum(
’n,nd’, N
∂L X ∂L ∂e
dLde,dedtheta) [1, d] = [n] [n, d] . (5.80)
∂θ n=1
∂e ∂θ
We know that kek2 = e> e (see Section 3.2) and determine
∂L
= 2e> ∈ R1×N . (5.81)
∂e
Furthermore, we obtain
∂e
= −Φ ∈ RN ×D , (5.82)
∂θ
such that our desired derivative is
∂L (5.77)
= −2e> Φ = − 2(y > − θ > Φ> ) |{z} Φ ∈ R1×D . (5.83)
∂θ | {z }
1×N N ×D
Remark. We would have obtained the same result without using the chain
rule by immediately looking at the function
L2 (θ) := ky − Φθk2 = (y − Φθ)> (y − Φθ) . (5.84)
This approach is still practical for simple functions like L2 but becomes
impractical for deep function compositions. ♦

A ∈ R4×2 x ∈ R3 Figure 5.7
x1 Visualization of
x2 gradient
x3 computation of a
matrix with respect
to a vector. We are
interested in
Partial derivatives: computing the
gradient of
∂A
∈ R4×2 A ∈ R4×2 with
∂x3 dA
∈ R4×2×3 respect to a vector
∂A dx
∈ R4×2 x ∈ R3 . We know
∂x2 collate that gradient
∂A dA
∈ R4×2×3 . We
∈ R4×2 dx
∂x1 follow two
equivalent
4 approaches to arrive
there: (a) Collating
3 partial derivatives
into a Jacobian
2 tensor;
(a) Approach 1: We compute the partial derivative (b) Flattening of the
∂A ∂A ∂A
, , , each of which is a 4 × 2 matrix, and col- matrix into a vector,
∂x1 ∂x2 ∂x3
late them in a 4 × 2 × 3 tensor. computing the
Jacobian matrix,
re-shaping into a
Jacobian tensor.
A ∈ R4×2 x ∈ R3
x1
x2
x3
dÃ dA
∈ R8×3 ∈ R4×2×3
A ∈ R4×2 Ã ∈ R8 dx dx
re-shape gradient re-shape
(b) Approach 2: We re-shape (flatten) A ∈ R4×2 into a vec-

tor Ã ∈ R8 . Then, we compute the gradient ddx Ã
∈ R8×3 .
We obtain the gradient tensor by re-shaping this gradient as
illustrated above.
5.4 Gradients of Matrices We can think of a

We will encounter situations where we need to take gradients of matrices tensor as a
multidimensional
with respect to vectors (or other matrices), which results in a multidimen-
array.
sional tensor. We can think of this tensor as a multidimensional array that
c
156 Vector Calculus
collects partial derivatives. For example, if we compute the gradient of an

m × n matrix A with respect to a p × q matrix B , the resulting Jacobian
would be (p×q)×(m×n), i.e., a four-dimensional tensor J , whose entries
are given as Jijkl = ∂Aij /∂Bkl .
Since matrices represent linear mappings, we can exploit the fact that
there is a vector-space isomorphism (linear, invertible mapping) between
the space Rm×n of m × n matrices and the space Rmn of mn vectors.
Therefore, we can re-shape our matrices into vectors of lengths mn and
pq , respectively. The gradient using these mn vectors results in a Jacobian
Matrices can be of size pq × mn. Figure 5.7 visualizes both approaches. In practical ap-
transformed into plications, it is often desirable to re-shape the matrix into a vector and
vectors by stacking
continue working with this Jacobian matrix: The chain rule (5.48) boils
the columns of the
matrix down to simple matrix multiplication, whereas in the case of a Jacobian
(“flattening”). tensor, we will need to pay more attention to what dimensions we need
to sum out.
Example 5.12 (Gradient of Vectors with Respect to Matrices)

Let us consider the following example, where
f = Ax , f ∈ RM , A ∈ RM ×N , x ∈ RN (5.85)
and where we seek the gradient df /dA. Let us start again by determining
the dimension of the gradient as
df
∈ RM ×(M ×N ) . (5.86)
dA
By definition, the gradient is the collection of the partial derivatives:
 ∂f1 
∂A
df ∂fi
=  ...  , ∈ R1×(M ×N ) . (5.87)
 
dA ∂fM
∂A
∂A
To compute the partial derivatives, it will be helpful to explicitly write out

the matrix vector multiplication:
N
X
fi = Aij xj , i = 1, . . . , M , (5.88)
j=1
and the partial derivatives are then given as

∂fi
= xq . (5.89)
∂Aiq
This allows us to compute the partial derivatives of fi with respect to a
row of A, which is given as
∂fi
= x> ∈ R1×1×N , (5.90)
∂Ai,:

∂fi
= 0> ∈ R1×1×N (5.91)
∂Ak6=i,:
where we have to pay attention to the correct dimensionality. Since fi
maps onto R and each row of A is of size 1 × N , we obtain a 1 × 1 × N -
sized tensor as the partial derivative of fi with respect to a row of A.
We stack the partial derivatives (5.91) and get the desired gradient
in (5.87) via
 >
0
 .. 
 . 
 >
0 
∂fi  > 1×(M ×N )
∂A 
= x 
>
∈R . (5.92)
0 
 
 . 
 .. 
0>
Example 5.13 (Gradient of Matrices with Respect to Matrices)

Consider a matrix R ∈ RM ×N and f : RM ×N → RN ×N with
f (R) = R> R =: K ∈ RN ×N . (5.93)
where we seek the gradient dK/dR.
To solve this hard problem, let us first write down what we already
know: The gradient has the dimensions
dK
∈ R(N ×N )×(M ×N ) , (5.94)
dR
which is a tensor. Moreover,
dKpq
∈ R1×M ×N (5.95)
dR
for p, q = 1, . . . , N , where Kpq is the (p, q)-th entry of K = f (R). De-
noting the ith column of R by r i , every entry of K is given by the dot
product of two columns of R, i.e.,
M
X
Kpq = r>
p rq = Rmp Rmq . (5.96)
m=1
∂Kpq
When we now compute the partial derivative ∂Rij
we obtain
M
∂Kpq X ∂
= Rmp Rmq = ∂pqij , (5.97)
∂Rij m=1
∂R ij
c
158 Vector Calculus


 Riq if j = p, p 6= q
Rip if j = q, p 6= q

∂pqij = . (5.98)

 2Riq if j = p, p = q
0 otherwise

From (5.94), we know that the desired gradient has the dimension (N ×
N ) × (M × N ), and every single entry of this tensor is given by ∂pqij
in (5.98), where p, q, j = 1, . . . , N and i = q, . . . , M .
5.5 Useful Identities for Computing Gradients

In the following, we list some useful gradients that are frequently required
in a machine learning context (Petersen and Pedersen, 2012). Here, we
use tr(·) as the trace (see Definition 4.4), det(·) as the determinant (see
Section 4.1) and f (X)−1 as the inverse of f (X), assuming it exists.
>
∂f (X)

∂
f (X)> = (5.99)
∂X ∂X
∂f (X)

∂
tr(f (X)) = tr (5.100)
∂X ∂X
∂f (X)

∂
det(f (X)) = det(f (X))tr f (X)−1 (5.101)
∂X ∂X
∂ ∂f (X)
f (X)−1 = −f (X)−1 f (X)−1 (5.102)
∂X ∂X
∂a> X −1 b
= −(X −1 )> ab> (X −1 )> (5.103)
∂X
∂x> a
= a> (5.104)
∂x
∂a> x
= a> (5.105)
∂x
∂a> Xb
= ab> (5.106)
∂X
∂x> Bx
= x> (B + B > ) (5.107)
∂x
∂
(x − As)> W (x − As) = −2(x − As)> W A for symmetric W
∂s
(5.108)
Remark. In this book, we only cover traces and transposes of matrices.
However, we have seen that derivatives can be higher-dimensional ten-
sors, in which case the usual trace and transpose are not defined. In these
cases, the trace of a D ×D ×E ×F tensor would be an E ×F -dimensional
matrix. This is a special case of a tensor contraction. Similarly, when we

“transpose” a tensor, we mean swapping the first two dimensions. Specif-

ically in (5.99)–(5.102) we require tensor-related computations when we
work with multivariate functions f (·) and compute derivatives with re-
spect to matrices (and choose not to vectorize them as discussed in Sec-
tion 5.4). ♦
5.6 Backpropagation and Automatic Differentiation

A good discussion
In many machine learning applications, we find good model parameters about
by performing gradient descent (Section 7.1), which relies on the fact backpropagation
and the chain rule is
that we can compute the gradient of a learning objective with respect
available at a blog
to the parameters of the model. For a given objective function, we can by Tim Viera at
obtain the gradient with respect to the model parameters using calculus https://tinyurl.
and applying the chain rule, see Section 5.2.2. We already had a taste in com/ycfm2yrw.
Section 5.3 when we looked at the gradient of a squared loss with respect
to the parameters of a linear regression model.
Consider the function
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 ) .

(5.109)
By application of the chain rule, and noting that differentiation is linear

we compute the gradient
df 2x + 2x exp(x2 )
− sin x2 + exp(x2 ) 2x + 2x exp(x2 )

= p
dx 2 x2 + exp(x2 )
!
1
− sin x2 + exp(x2 ) 1 + exp(x2 ) .

= 2x p
2 x + exp(x )
2 2
(5.110)
Writing out the gradient in this explicit way is often impractical since it
often results in a very lengthy expression for a derivative. In practice,
it means that, if we are not careful, the implementation of the gradient
could be significantly more expensive than computing the function, which
is an unnecessary overhead. For training deep neural network models, the
backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962; backpropagation
Rumelhart et al., 1986) is an efficient way to compute the gradient of an
error function with respect to the parameters of the model.
5.6.1 Gradients in a Deep Network

An area, where the chain rule is used to an extreme, is Deep Learning,
where the function value y is computed as a many-level function compo-
sition
y = (fK ◦ fK−1 ◦ · · · ◦ f1 )(x) = fK (fK−1 (· · · (f1 (x)) · · · )) , (5.111)
c
160 Vector Calculus
Figure 5.8 Forward

pass in a multi-layer
neural network to x f1 f K−1 fK L
compute the loss L
as a function of the
inputs x and the
parameters Ai , bi . A1 , b 1 A2 , b2 AK−2 , bK−2 AK−1 , bK−1
where x are the inputs (e.g., images), y are the observations (e.g., class
labels) and every function fi , i = 1, . . . , K , possesses its own parameters.
We discuss the case, In neural networks with multiple layers, we have functions fi (xi−1 ) =
where the activation σ(Ai xi−1 + bi ) in the ith layer. Here xi−1 is the output of layer i − 1
functions are
and σ an activation function, such as the logistic sigmoid 1+e1−x , tanh or a
identical in each
layer to unclutter rectified linear unit (ReLU). In order to train these models, we require the
notation. gradient of a loss function L with respect to all model parameters Aj , bj
for j = 1, . . . , K . This also requires us to compute the gradient of L with
respect to the inputs of each layer. For example, if we have inputs x and
observations y and a network structure defined by
f 0 := x (5.112)
f i := σi (Ai−1 f i−1 + bi−1 ) , i = 1, . . . , K , (5.113)
see also Figure 5.8 for a visualization, we may be interested in finding

Aj , bj for j = 0, . . . , K − 1, such that the squared loss
L(θ) = ky − f K (θ, x)k2 (5.114)
is minimized, where θ = {A0 , b0 , . . . , AK−1 , bK−1 }.

To obtain the gradients with respect to the parameter set θ , we require
the partial derivatives of L with respect to the parameters θ j = {Aj , bj }
of each layer j = 0, . . . , K − 1. The chain rule allows us to determine the
A more in-depth partial derivatives as
discussion about
gradients of neural ∂L ∂L ∂f K
networks can be
= (5.115)
∂θ K−1 ∂f K ∂θ K−1
found in Justin
Domke’s lecture ∂L ∂L ∂f K ∂f K−1
notes = (5.116)
∂θ K−2 ∂f K ∂f K−1 ∂θ K−2
https://tinyurl.
com/yalcxgtv.
∂L ∂L ∂f K ∂f K−1 ∂f K−2
= (5.117)
∂θ K−3 ∂f K ∂f K−1 ∂f K−2 ∂θ K−3
∂L ∂L ∂f K ∂f i+2 ∂f i+1
= ··· (5.118)
∂θ i ∂f K ∂f K−1 ∂f i+1 ∂θ i
The orange terms are partial derivatives of the output of a layer with
respect to its inputs, whereas the blue terms are partial derivatives of
the output of a layer with respect to its parameters. Assuming, we have

Figure 5.9
Backward pass in a
x f1 f K−1 fK L multi-layer neural
network to compute
the gradients of the
loss function.
A 1 , b1 A2 , b2 AK−2 , bK−2 AK−1 , bK−1
Figure 5.10 Simple

x a b y
graph illustrating
the flow of data
from x to y via
already computed the partial derivatives ∂L/∂θ i+1 , then most of the com- some intermediate
putation can be reused to compute ∂L/∂θ i . The additional terms that we variables a, b.
need to compute are indicated by the boxes. Figure 5.9 visualizes that the
gradients are passed backward through the network.
5.6.2 Automatic Differentiation

It turns out that backpropagation is a special case of a general technique
in numerical analysis called automatic differentiation. We can think of au- automatic
tomatic differentation as a set of techniques to numerically (in contrast to differentiation
symbolically) evaluate the exact (up to machine precision) gradient of a
function by working with intermediate variables and applying the chain
rule. Automatic differentiation applies a series of elementary arithmetic Automatic
operations, e.g., addition and multiplication and elementary functions, differentiation is
different from
e.g., sin, cos, exp, log. By applying the chain rule to these operations, the
symbolic
gradient of quite complicated functions can be computed automatically. differentiation and
Automatic differentiation applies to general computer programs and has numerical
forward and reverse modes. ? give a great overview of automatic differ- approximations of
the gradient, e.g., by
entiation in machine learning.
using finite
Figure 5.10 shows a simple graph representing the data flow from in- differences.
puts x to outputs y via some intermediate variables a, b. If we were to
compute the derivative dy/dx, we would apply the chain rule and obtain
dy dy db da
= . (5.119)
dx db da dx
Intuitively, the forward and reverse mode differ in the order of multipli- In the general case,
cation. Due to the associativity of matrix multiplication we can choose we work with
Jacobians, which
between
can be vectors,
dy dy db da

matrices or tensors.
= , (5.120)
dx db da dx
dy dy db da

= . (5.121)
dx db da dx
Equation (5.120) would be the reverse mode because gradients are prop- reverse mode
agated backward through the graph, i.e., reverse to the data flow. Equa-
tion (5.121) would be the forward mode, where the gradients flow with forward mode
c
162 Vector Calculus
the data from left to right through the graph.

In the following, we will focus on reverse mode automatic differentia-
tion, which is backpropagation. In the context of neural networks, where
the input dimensionality is often much higher than the dimensionality of
the labels, the reverse mode is computationally significantly cheaper than
the forward mode. Let us start with an instructive example.
Example 5.14
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 )

(5.122)
from (5.109). If we were to implement a function f on a computer, we
intermediate would be able to save some computation by using intermediate variables:
variables
a = x2 , (5.123)
b = exp(a) , (5.124)
c = a + b, (5.125)
√
d = c, (5.126)
e = cos(c) , (5.127)
f = d + e. (5.128)
√
Figure 5.11 exp(·) b · d
Computation graph
with inputs x,
function values f
x (·)2 a + c + f
and intermediate
variables a, b, c, d, e. cos(·) e
This is the same kind of thinking process that occurs when applying the
chain rule. Note that the above set of equations require fewer operations
than a direct implementation of the function f (x) as defined in (5.109).
The corresponding computation graph in Figure 5.11 shows the flow of
data and computations required to obtain the function value f .
The set of equations that include intermediate variables can be thought
of as a computation graph, a representation that is widely used in imple-
mentations of neural network software libraries. We can directly compute
the derivatives of the intermediate variables with respect to their corre-
sponding inputs by recalling the definition of the derivative of elementary
functions. We obtain:
∂a
= 2x (5.129)
∂x
∂b
= exp(a) (5.130)
∂a

∂c ∂c
=1= (5.131)
∂a ∂b
∂d 1
= √ (5.132)
∂c 2 c
∂e
= − sin(c) (5.133)
∂c
∂f ∂f
=1= . (5.134)
∂d ∂e
By looking at the computation graph in Figure 5.11, we can compute
∂f /∂x by working backward from the output and obtain
∂f ∂f ∂d ∂f ∂e
= + (5.135)
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
= (5.136)
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= + (5.137)
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
= . (5.138)
∂x ∂a ∂x
Note that we implicitly applied the chain rule to obtain ∂f /∂x. By substi-
tuting the results of the derivatives of the elementary functions, we get
∂f 1
= 1 · √ + 1 · (− sin(c)) (5.139)
∂c 2 c
∂f ∂f
= ·1 (5.140)
∂b ∂c
∂f ∂f ∂f
= exp(a) + ·1 (5.141)
∂a ∂b ∂c
∂f ∂f
= · 2x . (5.142)
∂x ∂a
By thinking of each of the derivatives above as a variable, we observe
that the computation required for calculating the derivative is of similar
complexity as the computation of the function itself. This is quite counter-
intuitive since the mathematical expression for the derivative ∂f∂x
(5.110)
is significantly more complicated than the mathematical expression of the
function f (x) in (5.109).
Automatic differentiation is a formalization of the example above. Let

x1 , . . . , xd be the input variables to the function, xd+1 , . . . , xD−1 be the
intermediate variables and xD the output variable. Then the computation
graph can be expressed as:
For i = d + 1, . . . , D : xi = gi (xPa(xi ) ) , (5.143)
c
164 Vector Calculus
where the gi (·) are elementary functions and xPa(xi ) are the parent nodes
of the variable xi in the graph. Given a function defined in this way, we
can use the chain rule to compute the derivative of the function in a step-
by-step fashion. Recall that by definition f = xD and hence
∂f
= 1. (5.144)
∂xD
For other variables xi , we apply the chain rule
∂f X ∂f ∂xj X ∂f ∂gj
= = , (5.145)
∂xi x ∂xj ∂xi ∂xj ∂xi
j :xi ∈Pa(xj ) xj :xi ∈Pa(xj )
where Pa(xj ) is the set of parent nodes of xj in the computation graph.

Auto-differentiation Equation (5.143) is the forward propagation of a function, whereas (5.145)
in reverse mode is the backpropagation of the gradient through the computation graph. For
requires a parse
neural network training we backpropagate the error of the prediction with
tree.
respect to the label.
The automatic differentiation approach above works whenever we have
a function that can be expressed as a computation graph, where the ele-
mentary functions are differentiable. In fact, the function may not even be
a mathematical function but a computer program. However, not all com-
puter programs can be automatically differentiated, e.g., if we cannot find
differential elementary functions. Programming structures, such as for
loops and if statements require more care as well.
5.7 Higher-order Derivatives

So far, we discussed gradients, i.e., first-order derivatives. Sometimes, we
are interested in derivatives of higher order, e.g., when we want to use
Newton’s Method for optimization, which requires second-order deriva-
tives (Nocedal and Wright, 2006). In Section 5.1.1, we discussed the Tay-
lor series to approximate functions using polynomials. In the multivariate
case, we can do exactly the same. In the following, we will do exactly this.
But let us start with some notation.
Consider a function f : R2 → R of two variables x, y . We use the
following notation for higher-order partial derivatives (and for gradients):
∂2f
∂x2
is the second partial derivative of f with respect to x
∂nf
∂xn
is the nth partial derivative of f with respect to x
∂2f ∂ ∂f

∂y∂x
= ∂y ∂x
is the partial derivative obtained by first partial differ-
entiating with respect to x and then with respect to y
∂2f
∂x∂y
is the partial derivative obtained by first partial differentiating by
y and then x
Hessian The Hessian is the collection of all second-order partial derivatives.

Figure 5.12 Linear

approximation of a
function. The
1 f (x) original function f
is linearized at
x0 = −2 using a
0
f (x)
first-order Taylor
series expansion.
−1 f (x0) f (x0) + f 0(x0)(x − x0)
−2
−4 −2 0 2 4
x
If f (x, y) is a twice (continuously) differentiable function then

∂2f ∂2f
= , (5.146)
∂x∂y ∂y∂x
i.e., the order of differentiation does not matter, and the corresponding
Hessian matrix Hessian matrix
 2
∂2f

∂ f
 ∂x2 ∂x∂y 
H= 2
 (5.147)
 ∂ f ∂2f 
∂x∂y ∂y 2
is symmetric. The Hessian is denoted as ∇2x,y f (x, y). Generally, for x ∈ Rn
and f : Rn → R, the Hessian is an n × n matrix. The Hessian measures
the curvature of the function locally around (x, y).
Remark (Hessian of a Vector Field). If f : Rn → Rm is a vector field, the
Hessian is an (m × n × n)-tensor. ♦
5.8 Linearization and Multivariate Taylor Series

The gradient ∇f of a function f is often used for a locally linear approxi-
mation of f around x0 :
f (x) ≈ f (x0 ) + (∇x f )(x0 )(x − x0 ) . (5.148)
Here (∇x f )(x0 ) is the gradient of f with respect to x, evaluated at x0 .
Figure 5.12 illustrates the linear approximation of a function f at an input
x0 . The orginal function is approximated by a straight line. This approx-
imation is locally accurate, but the further we move away from x0 the
worse the approximation gets. Equation (5.148) is a special case of a mul-
tivariate Taylor series expansion of f at x0 , where we consider only the
first two terms. We discuss the more general case in the following, which
will allow for better approximations.
c
166 Vector Calculus
Figure 5.13
Visualizing outer
products. Outer
products of vectors
increase the
dimensionality of
the array by 1 per
term. (a) Given a vector δ ∈ R4 , we obtain the outer product δ 2 := δ ⊗ δ = δδ > ∈
R4×4 as a matrix.
(b) An outer product δ 3 := δ ⊗ δ ⊗ δ ∈ R4×4×4 results in a third-order tensor (“three-

dimensional matrix”), i.e., an array with three indexes.
Definition 5.7 (Multivariate Taylor Series). We consider a function

f : RD → R (5.149)
D
x 7→ f (x) , x∈R , (5.150)
that is smooth at x0 . When we define the difference vector δ := x − x0 ,
multivariate Taylor the multivariate Taylor series of f at (x0 ) is defined as
series
∞
X Dk f (x0 )
f (x) = x
δk , (5.151)
k=0
k!
where Dxk f (x0 ) is the k -th (total) derivative of f with respect to x, eval-
uated at x0 .
Taylor polynomial Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n of
f at x0 contains the first n + 1 components of the series in (5.151) and is
defined as
n
X Dk f (x0 )
Tn (x) = x
δk . (5.152)
k=0
k!
In (5.151) and (5.152), we used the slightly sloppy notation of δ k ,

which is not defined for vectors x ∈ RD , D > 1, and k > 1. Note that
A vector can be both Dxk f and δ k are k -th order tensors, i.e., k -dimensional arrays. The
implemented as a k times
1-dimensional array,
}| {
z
k D×D×...×D
a matrix as a k -th order tensor δ ∈ R is obtained as a k -fold outer product,
D
2-dimensional array. denoted by ⊗, of the vector δ ∈ R . For example,
δ 2 := δ ⊗ δ = δδ > , δ 2 [i, j] = δ[i]δ[j] (5.153)

δ 3 := δ ⊗ δ ⊗ δ , δ 3 [i, j, k] = δ[i]δ[j]δ[k] . (5.154)

Figure 5.13 visualizes two such outer products. In general, we obtain the
terms
D
X D
X
Dxk f (x0 )δ k = ··· Dxk f (x0 )[i1 , . . . , ik ]δ[i1 ] · · · δ[ik ] (5.155)
i1 =1 ik =1
in the Taylor series, where Dxk f (x0 )δ k contains k -th order polynomials.
Now that we defined the Taylor series for vector fields, let us explicitly
write down the first terms Dxk f (x0 )δ k of the Taylor series expansion for
k = 0, . . . , 3 and δ := x − x0 :
np.einsum(
k = 0 : Dx0 f (x0 )δ 0 = f (x0 ) ∈ R (5.156) ’i,i’,Df1,d)
D
np.einsum(
X ’ij,i,j’,
k=1: Dx1 f (x0 )δ 1 = ∇x f (x0 ) |{z}
δ = ∇x f (x0 )[i]δ[i] ∈ R (5.157) Df2,d,d)
| {z } i=1
1×D D×1 np.einsum(
’ijk,i,j,k’,
Dx2 f (x0 )δ 2 δ > = δ > H(x0 )δ

k=2: = tr H(x0 ) |{z}
δ |{z} (5.158) Df3,d,d,d)
| {z }
D×D D×1 1×D
D X
X D
= H[i, j]δ[i]δ[j] ∈ R (5.159)
i=1 j=1
X D
D X
D X
k = 3 : Dx3 f (x0 )δ 3 = Dx3 f (x0 )[i, j, k]δ[i]δ[j]δ[k] ∈ R
i=1 j=1 k=1
(5.160)
Here, H(x0 ) is the Hessian of f evaluated at x0 .
Example 5.15 (Taylor-Series Expansion of a Function with Two Vari-

ables)
f (x, y) = x2 + 2xy + y 3 . (5.161)
We want to compute the Taylor series expansion of f at (x0 , y0 ) = (1, 2).
Before we start, let us discuss what to expect: The function in (5.161) is
a polynomial of degree 3. We are looking for a Taylor series expansion,
which itself is a linear combination of polynomials. Therefore, we do not
expect the Taylor series expansion to contain terms of fourth or higher
order to express a third-order polynomial. This means, it should be suffi-
cient to determine the first four terms of (5.151) for an exact alternative
representation of (5.161).
To determine the Taylor series expansion, start of with the constant term
and the first-order derivatives, which are given by
f (1, 2) = 13 (5.162)
c
168 Vector Calculus
∂f ∂f
= 2x + 2y =⇒ (1, 2) = 6 (5.163)
∂x ∂x
∂f ∂f
= 2x + 3y 2 =⇒ (1, 2) = 14 . (5.164)
∂y ∂y
Therefore, we obtain
h i
1 ∂f ∂f
= 6 14 ∈ R1×2

Dx,y f (1, 2) = ∇x,y f (1, 2) = ∂x
(1, 2) ∂y
(1, 2)
(5.165)
such that
1
Dx,y f (1, 2)

x−1
δ = 6 14 = 6(x − 1) + 14(y − 2) . (5.166)
1! y−2
1
Note that Dx,y f (1, 2)δ contains only linear terms, i.e., first-order polyno-
mials.
The second-order partial derivatives are given by
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.167)
∂x2 ∂x2
∂2f ∂2f
= 6y =⇒ (1, 2) = 12 (5.168)
∂y 2 ∂y 2
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.169)
∂y∂x ∂y∂x
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 . (5.170)
∂x∂y ∂x∂y
When we collect the second-order partial derivatives, we obtain the Hes-
sian
" 2
∂2f
#
∂ f
∂x2 ∂x∂y 2 2
H = ∂2f ∂2f
= , (5.171)
2
2 6y
∂y∂x ∂y
such that

2 2
H(1, 2) = ∈ R2×2 . (5.172)
2 12
Therefore, the next term of the Taylor-series expansion is given by
2
Dx,y f (1, 2) 2 1 >
δ = δ H(1, 2)δ (5.173a)
2! 2
1

2 2 x−1
= x−1 y−2 (5.173b)
2 2 12 y − 2
= (x − 1) + 2(x − 1)(y − 2) + 6(y − 2)2 .
2
(5.173c)
2
Here, Dx,y f (1, 2)δ 2 contains only quadratic terms, i.e., second-order poly-
nomials.

The third-order derivatives are obtained as

h i
3
Dx,y f = ∂H ∂x
∂H
∂y ∈ R2×2×2 , (5.174)
" 3
∂3f
#
∂ f
3 ∂H ∂x3 ∂x2 ∂y
Dx,y f [:, :, 1] = = ∂3f ∂3f
, (5.175)
∂x ∂x∂y∂x ∂x∂y 2
" 3
∂3f
#
∂ f
3 ∂H ∂y∂x2 ∂y∂x∂y
Dx,y f [:, :, 2] = = ∂3f ∂3f
. (5.176)
∂y ∂y 2 ∂x ∂y 3
Since most second-order partial derivatives in the Hessian in (5.171) are

constant the only non-zero third-order partial derivative is
∂3f ∂3f
= 6 =⇒ (1, 2) = 6 . (5.177)
∂y 3 ∂y 3
Higher-order derivatives and the mixed derivatives of degree 3 (e.g.,
∂f 3
∂x2 ∂y
) vanish, such that

3 0 0 3 0 0
Dx,y f [:, :, 1] = , Dx,y f [:, :, 2] = (5.178)
0 0 0 6
and
3
Dx,y f (1, 2) 3
δ = (y − 2)3 , (5.179)
3!
which collects all cubic terms of the Taylor series. Overall, the (exact)
Taylor series expansion of f at (x0 , y0 ) = (1, 2) is
2 3
1
Dx,y f (1, 2) 2 Dx,y f (1, 2) 3
f (x) = f (1, 2) + Dx,y f (1, 2)δ + δ + δ
2! 3!
(5.180a)
∂f (1, 2) ∂f (1, 2)
= f (1, 2) + (x − 1) + (y − 2)
∂x ∂y
2
1 ∂ f (1, 2) 2 ∂ 2 f (1, 2)
+ (x − 1) + (y − 2)2
2! ∂x2 ∂y 2
∂ 2 f (1, 2) 1 ∂ 3 f (1, 2)

+2 (x − 1)(y − 2) + (y − 2)3 (5.180b)
∂x∂y 6 ∂y 3
= 13 + 6(x − 1) + 14(y − 2)
+ (x − 1)2 + 6(y − 2)2 + 2(x − 1)(y − 2) + (y − 2)3 . (5.180c)
In this case, we obtained an exact Taylor series expansion of the polyno-
mial in (5.161), i.e., the polynomial in (5.180c) is identical to the original
polynomial in (5.161). In this particular example, this result is not sur-
prising since the original function was a third-order polynomial, which
we expressed through a linear combination of constant terms, first-order,
second order and third-order polynomials in (5.180c).
c
170 Vector Calculus
5.9 Further Reading

Further details of matrix differentials, along with a short review of the re-
quired linear algebra can be found in Magnus and Neudecker (2007). Au-
tomatic differentiation has had a long history, and the reader is referred to
Griewank and Walther (2003, 2008); Elliott (2009) and their references.
In machine learning (and other disciplines), we often need to compute
expectations, i.e., we need to solve integrals of the form
Z
Ex [f (x)] = f (x)p(x)dx . (5.181)
Even if p(x) is in a convenient form (e.g., Gaussian), this integral gen-

erally cannot be solved analytically. The Taylor series expansion of f is
one way of finding an approximate solution: Assuming p(x) = N µ, Σ
is Gaussian, then the first-order Taylor series expansion around µ locally
linearizes the nonlinear function f . For linear functions, we can compute
the mean (and the covariance) exactly if p(x) is Gaussian distributed (see
extended Kalman Section 6.5). This property is heavily exploited by the extended Kalman
filter filter (Maybeck, 1979) for online state estimation in nonlinear dynami-
cal systems (also called “state-space models”). Other deterministic ways
unscented transform to approximate the integral in (5.181) are the unscented transform (Julier
Laplace and Uhlmann, 1997), which does not require any gradients, or the Laplace
approximation approximation (MacKay, 2003; Bishop, 2006; Murphy, 2012), which uses
a second-order Taylor series expansion (requiring the Hessian) for a local
Gaussian approximation of p(x) around the mode of the posterior distri-
bution.
Exercises
5.1 Compute the derivative f 0 (x) for
f (x) = log(x4 ) sin(x3 ) .
5.2 Compute the derivative f 0 (x) of the logistic sigmoid

1
f (x) = .
1 + exp(−x)
5.3 Compute the derivative f 0 (x) of the function
f (x) = exp(− 2σ1 2 (x − µ)2 ) ,
where µ, σ ∈ R are constants.

5.4 Compute the Taylor polynomials Tn , n = 0, . . . , 5 of f (x) = sin(x) + cos(x)
at x0 = 0.
5.5 Consider the following functions
f1 (x) = sin(x1 ) cos(x2 ) , x ∈ R2
f2 (x, y) = x> y , x, y ∈ Rn
f3 (x) = xx> , x ∈ Rn

Exercises 171
∂fi
1. What are the dimensions of ∂x ?
2. Compute the Jacobians
5.6 Differentiate f with respect to t and g with respect to X , where
f (t) = sin(log(t> t)) , t ∈ RD
g(X) = tr(AXB) , A ∈ RD×E , X ∈ RE×F , B ∈ RF ×D ,
where tr denotes the trace.
5.7 Compute the derivatives df /dx of the following functions by using the chain
rule. Provide the dimensions of every single partial derivative. Describe your
steps in detail.
1.
f (z) = log(1 + z) , z = x> x , x ∈ RD
2.
f (z) = sin(z) , z = Ax + b , A ∈ RE×D , x ∈ RD , b ∈ RE
where sin(·) is applied to every element of z .
5.8 Compute the derivatives df /dx of the following functions.
Describe your steps in detail.
1. Use the chain rule. Provide the dimensions of every single partial deriva-
tive.
f (z) = exp(− 12 z)
z = g(y) = y > S −1 y
y = h(x) = x − µ
where x, µ ∈ RD , S ∈ RD×D .
2.
f (x) = tr(xx> + σ 2 I) , x ∈ RD
Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
Hint: Explicitly write out the outer product.
3. Use the chain rule. Provide the dimensions of every single partial deriva-
tive. You do not need to compute the product of the partial derivatives
explicitly.
f = tanh(z) ∈ RM
z = Ax + b, x ∈ RN , A ∈ RM ×N , b ∈ RM .
Here, tanh is applied to every component of z .
5.9 We define
g(z, ν) := log p(x, z) − log q(z, ν)
z := t(, ν)
for differentiable functions p, q, t. By using the chain rule, compute the gra-
dient
d
g(z, ν) .
dν
c
6
Probability and Distributions
Probability, loosely speaking, concerns the study of uncertainty. Probabil-

ity can be thought of as the fraction of times an event occurs, or as a degree
of belief about an event. We then would like to use this probability to mea-
sure the chance of something occurring in an experiment. As mentioned
in Chapter 1, we often quantify uncertainty in the data, uncertainty in the
machine learning model, and uncertainty in the predictions produced by
random variable the model. Quantifying uncertainty requires the idea of a random variable,
which is a function that maps outcomes of random experiments to a set of
properties that we are interested in. Associated with the random variable
is a function that measures the probability that a particular outcome (or
probability set of outcomes) will occur; this is called the probability distribution.
distribution Probability distributions are used as a building block for other con-
cepts, such as probabilistic modeling (Section 8.3), graphical models (Sec-
tion 8.4) and model selection (Section 8.5). In the next section, we present
the three concepts that define a probability space (the sample space, the
events and the probability of an event) and how they are related to a
fourth concept called the random variable. The presentation is deliber-
ately slightly hand wavy since a rigorous presentation may occlude the
intuition behind the concepts. An outline of the concepts presented in this
chapter are shown in Figure 6.1.
6.1 Construction of a Probability Space

The theory of probability aims at defining a mathematical structure to
describe random outcomes of experiments. For example, when tossing a
single coin, we cannot determine the outcome, but by doing a large num-
ber of coin tosses, we can observe a regularity in the average outcome.
Using this mathematical structure of probability, the goal is to perform
automated reasoning, and in this sense, probability generalizes logical
reasoning (Jaynes, 2003).
6.1.1 Philosophical Issues

When constructing automated reasoning systems, classical Boolean logic
does not allow us to express certain forms of plausible reasoning. Consider
172
c
Figure 6.1 A mind

Mean Variance Bayes’ Theorem
map of the concepts
related to random
variables and
probability
distributions, as
Chapter 9 described in this
Summary Statistics Product Rule Sum Rule Regression chapter.
pro
per
ty
Random variable example

Transformations & Distribution Gaussian
ex Chapter 10
am Dimensionality
property
ple
Reduction
y
rit
Independence Bernoulli
ila
sim
Sufficient Statistics
conjugate
Chapter 11
finite
Density Estimation
Inner Product Beta

Exponential Family
the following scenario: We observe that A is false. We find B becomes

less plausible although no conclusion can be drawn from classical logic.
We observe that B is true. It seems A becomes more plausible. We use
this form of reasoning daily. We are waiting for a friend, and consider
three possibilities; H1: she is on time, H2: she has been delayed by traffic
and H3: she has been abducted by aliens. When we observe our friend
is late, we must logically rule out H1. We also tend to consider H2 to be
more likely, though we are not logically required to do so. Finally, we may
consider H3 to be possible, but we continue to consider it quite unlikely.
How do we conclude H2 is the most plausible answer? Seen in this way, “For plausible
probability theory can be considered a generalization of Boolean logic. In reasoning it is
necessary to extend
the context of machine learning, it is often applied in this way to formalize
the discrete true and
the design of automated reasoning systems. Further arguments about how false values of truth
probability theory is the foundation of reasoning systems can be found to continuous plau-
in (Pearl, 1988). sibilities.”(Jaynes,
2003)
The philosophical basis of probability and how it should be somehow
related to what we think should be true (in the logical sense) was studied
by Cox (Jaynes, 2003). Another way to think about it is that if we are
precise about our common sense we end up constructing probabilities.
E. T. Jaynes (1922–1998) identified three mathematical criteria, which
must apply to all plausibilities:
1. The degrees of plausibility are represented by real numbers.
2. These numbers must be based on the rules of common sense.
c
174 Probability and Distributions
3. The resulting reasoning must be consistent, with the three following

meanings of the word “consistent”:
a) Consistency or non-contradiction: when the same result can be
reached through different means, the same plausibility value must
be found in all cases.
b) Honesty: All available data must be taken into account.
c) Reproducibility: If our state of knowledge about two problems are
the same, then we must assign the same degree of plausibility to
both of them.
The Cox-Jaynes theorem proves these plausibilities to be sufficient to
define the universal mathematical rules that apply to plausibility p, up to
transformation by an arbitrary monotonic function. Crucially, these rules
are the rules of probability.
Remark. In machine learning and statistics, there are two major interpre-
tations of probability: the Bayesian and frequentist interpretations (Bishop,
2006; Efron and Hastie, 2016). The Bayesian interpretation uses probabil-
ity to specify the degree of uncertainty that the user has about an event. It
is sometimes referred to as “subjective probability” or “degree of belief”.
The frequentist interpretation considers the relative frequencies of events
of interest to the total number of events that occurred. The probability of
an event is defined as the relative frequency of the event in the limit when
one has infinite data. ♦
Some machine learning texts on probabilistic models use lazy notation
and jargon, which is confusing. This text is no exception. Multiple distinct
concepts are all referred to as “probability distribution”, and the reader
has to often disentangle the meaning from the context. One trick to help
make sense of probability distributions is to check whether we are trying
to model something categorical (a discrete random variable) or some-
thing continuous (a continous random variable). The kinds of questions
we tackle in machine learning are closely related to whether we are con-
sidering categorical or continuous models.
6.1.2 Probability and Random Variables

There are three distinct ideas, which are often confused when discussing
probabilities. First is the idea of a probability space, which allows us to
quantify the idea of a probability. However, we mostly do not work directly
with this basic probability space. Instead we work with random variables
(the second idea), which transfers the probability to a more convenient
(often numerical) space. The third idea is the idea of a distribution or law
associated with a random variable. We will introduce the first two ideas
in this section and expand on the third idea in Section 6.2.
Modern probability is based on a set of axioms proposed by Kolmogorov

(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three con-
cepts of sample space, event space and probability measure. The probabil-
ity space models a real-world process (referred to as an experiment) with
random outcomes.
The sample space Ω

The sample space is the set of all possible outcomes of the experiment, sample space
usually denoted by Ω. For example, two successive coin tosses have
a sample space of {hh, tt, ht, th}, where “h” denotes “heads” and “t”
denotes “tails”.
The event space A
The event space is the space of potential results of the experiment. A event space
subset A of the sample space Ω is in the event space A if at the end
of the experiment we can observe whether a particular outcome ω ∈ Ω
is in A. The event space A is obtained by considering the collection of
subsets of Ω, and for discrete probability distributions (Section 6.2.1)
A is often the powerset of Ω.
The probability P
With each event A ∈ A, we associate a number P (A) that measures the
probability or degree of belief that the event will occur. P (A) is called
the probability of A. probability
The probability of a single event must lie in the interval [0, 1], and the
total probability over all outcomes in the sample space Ω must be 1, i.e.,
P (Ω) = 1. Given a probability space (Ω, A, P ) we want to use it to model
some real world phenomenon. In machine learning, we often avoid ex-
plicitly referring to the probability space, but instead refer to probabilities
on quantities of interest, which we denote by T . In this book we refer to
T as the target space and refer to elements of T as states. We introduce a target space
function X : Ω → T that takes an element of Ω (an event) and returns a
particular quantity of interest x, a value in T . This association/mapping
from Ω to T is called a random variable. For example, in the case of tossing random variable
two coins and counting the number of heads, a random variable X maps
to the three possible outcomes: X(hh) = 2, X(ht) = 1, X(th) = 1 and
X(tt) = 0. In this particular case, T = {0, 1, 2}, and it is the probabilities
on elements of T that we are interested in. For a finite sample space Ω The name “random
and finite T , the function corresponding to a random variable is essen- variable” is a great
source of
tially a lookup table. For any subset S ⊆ T we associate PX (S) ∈ [0, 1]
misunderstanding
(the probability) to a particular event occurring corresponding to the ran- as it is neither
dom variable X . Example 6.1 provides a concrete illustration of the above random nor is it a
terminology. variable. It is a
function.
Remark. The sample space Ω above unfortunately is referred to by dif-
ferent names in different books. Another common name for Ω is “state
space” (Jacod and Protter, 2004), but state space is sometimes reserved
for referring to states in a dynamical system (Hasselblatt and Katok, 2003).
c
Other names sometimes used to describe Ω are: “sample description space”,

“possibility space” and “event space”. ♦
Example 6.1
This toy example is We assume that the reader is already familiar with computing proba-
essentially a biased
bilities of intersections and unions of sets of events. A more gentle intro-
coin flip example.
duction to probability with many examples can be found in Chapter 2 of
Walpole et al. (2011).
Consider a statistical experiment where we model a funfair game con-
sisting of drawing two coins from a bag (with replacement). There are
coins from USA (denoted as $) and UK (denoted as £) in the bag, and
since we draw two coins from the bag, there are four outcomes in total.
The state space or sample space Ω of this experiment is then ($, $), ($,
£), (£, $), (£, £). Let us assume that the composition of the bag of coins is
such that a draw returns at random a $ with probability 0.3.
The event we are interested in is the total number of times the repeated
draw returns $. Let us define a random variable X that maps the sample
space Ω to T , that denotes the number of times we draw $ out of the bag.
We can see from the above sample space we can get zero $, one $ or two
$s, and therefore T = {0, 1, 2}. The random variable X (a function or
lookup table) can be represented as a table like below
X(($, $)) = 2 (6.1)
X(($, £)) = 1 (6.2)
X((£, $)) = 1 (6.3)
X((£, £)) = 0 . (6.4)
Since we return the first coin we draw before drawing the second, this
implies that the two draws are independent of each other, which we will
discuss in Section 6.4.5. Note that there are two experimental outcomes,
which map to the same event, where only one of the draws return $.
Therefore, the probability mass function (Section 6.2.1) of X is given by
P (X = 2) = P (($, $))
= P ($) · P ($)
= 0.3 · 0.3 = 0.09 (6.5)
P (X = 1) = P (($, £) ∪ (£, $))
= P (($, £)) + P ((£, $))
= 0.3 · (1 − 0.3) + (1 − 0.3) · 0.3 = 0.42 (6.6)
P (X = 0) = P ((£, £))
= P (£) · P (£)
= (1 − 0.3) · (1 − 0.3) = 0.49 . (6.7)

In the above calculation, we equated two different concepts, the prob-

ability of the output of X and the probability of the samples in Ω. For
example, in (6.7) we say P (X = 0) = P ((£, £)). Consider the random
variable X : Ω → T and a subset S ⊆ T (for example a single element
of T , such as the outcome that one head is obtained when tossing two
coins). Let X −1 (S) be the pre-image of S by X , i.e., the set of elements of
Ω that map to S under X ; {ω ∈ Ω : X(ω) ∈ S}. One way to understand
the transformation of probability from events in Ω via the random variable
X is to associate it with the probability of the pre-image of S (Jacod and
Protter, 2004). For S ⊆ T , we have the notation
PX (S) = P (X ∈ S) = P (X −1 (S)) = P ({ω ∈ Ω : X(ω) ∈ S}) . (6.8)

The left-hand side of (6.8) is the probability of the set of possible outcomes
(e.g., number of $ = 1) that we are interested in. Via the random variable
X , which maps states to outcomes, we see in the right-hand side of (6.8)
that this is the probability of the set of states (in Ω) that have the property
(e.g., $£, £$). We say that a random variable X is distributed according
to a particular probability distribution PX , which defines the probability
mapping between the event and the probability of the outcome of the
random variable. In other words, the function PX or equivalently P ◦ X −1
is the law or distribution of random variable X . law
distribution
Remark. The target space, that is the range T of the random variable X ,
is used to indicate the kind of probability space, i.e., a T random variable.
When T is finite or countably infinite, this is called a discrete random
variable (Section 6.2.1). For continuous random variables (Section 6.2.2)
we only consider T = R or T = RD . ♦
6.1.3 Statistics
Probability theory and statistics are often presented together, but they con-
cern different aspects of uncertainty. One way of contrasting them is by the
kinds of problems that are considered. Using probability we can consider
a model of some process, where the underlying uncertainty is captured
by random variables, and we use the rules of probability to derive what
happens. In statistics we observe that something has happened, and try
to figure out the underlying process that explains the observations. In this
sense, machine learning is close to statistics in its goals to construct a
model that adequately represents the process that generated the data. We
can use the rules of probability to obtain a “best fitting” model for some
data.
Another aspect of machine learning systems is that we are interested
in generalization error (see Chapter 8). This means that we are actually
interested in the performance of our system on instances that we will ob-
serve in future, which are not identical to the instances that we have seen
c
so far. This analysis of future performance relies on probability and statis-

tics, most of which is beyond what will be presented in this chapter. The
interested reader is encouraged to look at the books by Shalev-Shwartz
and Ben-David (2014); Boucheron et al. (2013). We will see more about
statistics in Chapter 8.
6.2 Discrete and Continuous Probabilities

Let us focus our attention on ways to describe the probability of an event
as introduced in Section 6.1. Depending on whether the target space is
discrete or continuous the natural way to refer to distributions is different.
When the outcome space T is discrete, we can specify the probability
that a random variable X takes a particular value x ∈ T , denoted as
P (X = x). The expression P (X = x) for a discrete random variable
probability mass X is known as the probability mass function. When the outcome space
function T is continuous, e.g., the real line R, it is more natural to specify the
probability that a random variable X is in an interval, denoted by P (a 6
X 6 b) for a < b. By convention we specify the probability that a random
variable X is less than a particular value x, denoted by P (X 6 x). The
expression P (X 6 x) for a continuous random variable X is known as
cumulative the cumulative distribution function. We will discuss continuous random
distribution function variables in Section 6.2.2. We will revisit the nomenclature and contrast
discrete and continuous random variables in Section 6.2.3.
univariate Remark. We will use the phrase univariate distribution to refer to distribu-
tions of a single random variable (whose states are denoted by non-bold
x). We will refer to distributions of more than one random variable as
multivariate multivariate distributions, and will usually consider a vector of random
variables (whose states are denoted by bold x). ♦
6.2.1 Discrete Probabilities

When the target space is discrete, we can imagine the probability distri-
bution of multiple random variables as filling out a (multidimensional)
array of numbers. Figure 6.2 shows an example. The target space of the
joint probability is the Cartesian product of the target spaces of each of
joint probability the random variables. We define the joint probability as the entry of both
values jointly
nij
P (X = xi , Y = yj ) = , (6.9)
N
where nij is the number of events with state xi and yj and N the total
number of events. The joint probability is the probability of the intersec-
tion of both events, that is P (X = xi , Y = yj ) = P (X = xi ∩ Y = yj ).
probability mass Figure 6.2 illustrates the probability mass function (pmf) of a discrete prob-
function ability distribution. For two random variables X and Y , the probability

ci Figure 6.2
z }|{ Visualization of a
y1 discrete bivariate
o probability mass
Y y2 nij rj function, with
random variables X
y3 and Y . This
x1 x2 x3 x4 x5 diagram is adapted
from Bishop (2006).
X
that X = x and Y = y is (lazily) written as p(x, y) and is called the joint

probability. One can think of a probability as a function that takes state
x and y and returns a real number, which is the reason we write p(x, y).
The marginal probability that X takes the value x irrespective of the value marginal probability
of random variable Y is (lazily) written as p(x). We write X ∼ p(x) to
denote that the random variable X is distributed according to p(x). If we
consider only the instances where X = x, then the fraction of instances
(the conditional probability) for which Y = y is written (lazily) as p(y | x). conditional
probability
Example 6.2
Consider two random variables X and Y , where X has five possible states
and Y has three possible states, as shown in Figure 6.2. We denote by nij
the number of events with state X = xi and Y = yj , and denote by
N the total number of events. The value ciP is the sum of the individual
3
frequencies for the ith column, that is ci = j=1 nij . Similarly, the value
P5
rj is the row sum, that is rj = i=1 nij . Using these definitions, we can
compactly express the distribution of X and Y .
The probability distribution of each random variable, the marginal
probability, which can be seen as the sum over a row or column
P3
ci j=1 nij
P (X = xi ) = = (6.10)
N N
and
P5
rj nij
P (Y = yj ) = = i=1 , (6.11)
N N
where ci and rj are the ith column and j th row of the probability ta-
ble, respectively. By convention for discrete random variables with a finite
number of events, we assume that probabilties sum up to one, that is
5
X 3
X
P (X = xi ) = 1 and P (Y = yj ) = 1 . (6.12)
i=1 j=1
The conditional probability is the fraction of a row or column in a par-
c
ticular cell. For example, the conditional probability of Y given X is

nij
P (Y = yj | X = xi ) = , (6.13)
ci
and the conditional probability of x given y is
nij
P (X = xi | Y = yj ) = . (6.14)
rj
In machine learning, we use discrete probability distributions to model

categorical variable categorical variables, i.e., variables that take a finite set of unordered val-
ues. They could be categorical features, such as the degree taken at uni-
versity when used for predicting the salary of a person, or categorical la-
bels, such as letters of the alphabet when doing handwriting recognition.
Discrete distributions are also often used to construct probabilistic models
that combine a finite number of continuous distributions (Chapter 11).
6.2.2 Continuous Probabilities

We consider real-valued random variables in this section, i.e., we consider
target spaces that are intervals of the real line R. In this book, we pretend
that we can perform operations on real random variables as if we have dis-
crete probability spaces with finite states. However, this simplification is
not precise for two situations: when we repeat something infinitely often,
and when we want to draw a point from an interval. The first situation
arises when we discuss generalization errors in machine learning (Chap-
ter 8). The second situation arises when we want to discuss continuous
distributions, such as the Gaussian (Section 6.5). For our purposes, the
lack of precision allows for a more brief introduction to probability.
Remark. In continuous spaces, there are two additional technicalities,
which are counterintuitive. First, the set of all subsets (used to define
the event space A in Section 6.1) is not well behaved enough. A needs
to be restricted to behave well under set complements, set intersections
and set unions. Second, the size of a set (which in discrete spaces can
be obtained by counting the elements) turns out to be tricky. The size of
measure a set is called its measure. For example, the cardinality of discrete sets,
the length of an interval in R and the volume of a region in Rd are all
measures. Sets that behave well under set operations and additionally
Borel σ-algebra have a topology are called a Borel σ -algebra. Betancourt details a careful
construction of probability spaces from set theory without being bogged
down in technicalities, see https://tinyurl.com/yb3t6mfd. For a more
precise construction we refer to Jacod and Protter (2004) and Billingsley
(1995).
In this book, we consider real-valued random variables with their cor-

responding Borel σ -algebra. We consider random variables with values in

RD to be a vector of real-valued random variables. ♦
Definition 6.1 (Probability Density Function). A function f : RD → R is
called a probability density function (pdf ) if probability density
function
1. ∀x ∈ RD : f (x) > 0 pdf
2. Its integral exists and
Z
f (x)dx = 1 . (6.15)
RD
For probability mass functions (pmf) of discrete random variables the in-
tegral in (6.15) is replaced with a sum (6.12).
Observe that the probability density function is any function f that is
non-negative and integrates to one. We associate a random variable X
with this function f by
Z b
P (a 6 X 6 b) = f (x)dx , (6.16)
a
where a, b ∈ R and x ∈ R are outcomes of the continuous random vari-

able X . States x ∈ RD are defined analogously by considering a vector
of x ∈ R. This association (6.16) is called the law or distribution of the law
random variable X . P (X = x) is a set of
Remark. In contrast to discrete random variables, the probability of a con- measure zero.
tinuous random variable X taking a particular value P (X = x) is zero.
This is like trying to specify an interval in (6.16) where a = b. ♦
Definition 6.2 (Cumulative Distribution Function). A cumulative distribu- cumulative
tion function (cdf) of a multivariate real-valued random variable X with distribution function
states x ∈ RD is given by
FX (x) = P (X1 6 x1 , . . . , XD 6 xD ) , (6.17)
where X = [X1 , . . . , XD ]> , x = [x1 , . . . xD ]> , and the right-hand side
represents the probability that random variable Xi takes the value smaller
than or equal to xi .
There are cdfs,
The cdf can be expressed also as the integral of the probability density which do not have
function f (x) so that corresponding pdfs.
Z x1 Z xD
FX (x) = ··· f (z1 , . . . , zD )dz1 · · · dzD . (6.18)
−∞ −∞
Remark. We reiterate that there are in fact two distinct concepts when
talking about distributions. First, the idea of a pdf (denoted by f (x))
which is a non-negative function that sums to one. Second, the law of
a random variable X , that is the association of a random variable X with
the pdf f (x). ♦
c
Figure 6.3 2.0 2.0
Examples of discrete
and continuous 1.5 1.5
uniform P (Z = z)
p(x)
distributions. See 1.0 1.0
Example 6.3 for
details of the 0.5 0.5
distributions.
0.0 0.0
−1 0 1 2 −1 0 1 2
z x
(a) Discrete distribution (b) Continuous distribution
For most of this book, we will not use the notation f (x) and FX (x) as
we mostly do not need to distinguish between the pdf and cdf. However,
we will need to be careful about pdfs and cdfs in Section 6.7.
6.2.3 Contrasting Discrete and Continuous Distributions

Recall from Section 6.1.2 that probabilities are positive and the total prob-
ability sums up to one. For discrete random variables (see (6.12)) this
implies that the probability of each state must lie in the interval [0, 1].
However, for continuous random variables the normalization (see (6.15))
does not imply that the value of the density is less than or equal to 1 for
uniform distribution all values. We illustrate this in Figure 6.3 using the uniform distribution
for both discrete and continuous random variables.
Example 6.3
We consider two examples of the uniform distribution, where each state is
equally likely to occur. This example illustrates some differences between
discrete and continuous probability distributions.
Let Z be a discrete uniform random variable with three states {z =
The actual values of −1.1, z = 0.3, z = 1.5}. The probability mass function can be represented
these states are not as a table of probability values:
meaningful here,
and we deliberately z −1.1 0.3 1.5
chose numbers to
1 1 1
drive home the P (Z = z) 3 3 3
point that we do not
want to use (and
should ignore) the Alternatively, we can think of this as a graph (Figure 6.3(a)), where we
ordering of the use the fact that the states can be located on the x-axis, and the y -axis
states.
represents the probability of a particular state. The y -axis in Figure 6.3(a)
is deliberately extended so that is it the same as in Figure 6.3(b).
Let X be a continuous random variable taking values in the range 0.9 6
X 6 1.6, as represented by Figure 6.3(b). Observe that the height of the

type “point probability” “interval probability” Table 6.1

Nomenclature for
discrete P (X = x) not applicable probability
probability mass function distributions.
continuous p(x) P (X 6 x)
probability density function cumulative distribution function
density can be greater than 1. However, it needs to hold that

Z 1.6
p(x)dx = 1 . (6.19)
0.9
Remark. There is an additional subtlety with regards to discrete prob-

ability distributions. The states z1 , . . . , zd do not in principle have any
structure, i.e., there is usually no way to compare them, for example
z1 = red, z2 = green, z3 = blue. However, in many machine learning
applications discrete states take numerical values, e.g., z1 = −1.1, z2 =
0.3, z3 = 1.5, where we could say z1 < z2 < z3 . Discrete states that as-
sume numerical values are particularly useful because we often consider
expected values (Section 6.4.1) of random variables. ♦
Unfortunately machine learning literature uses notation and nomencla-
ture that hides the distinction between the sample space Ω, the target
space T and the random variable X . For a value x of the set of possible
outcomes of the random variable X , i.e., x ∈ T , p(x) denotes the prob- We think of the
ability that random variable X has the outcome x. For discrete random outcome x as the
argument that
variables this is written as P (X = x), which is known as the probability
results in the
mass function (pmf). The pmf is often referred to as the “distribution”. For probability p(x).
continuous variables, p(x) is called the probability density function (often
referred to as a density). To muddy things even further the cumulative dis-
tribution function P (X 6 x) is often also referred to as the “distribution”.
In this chapter, we will use the notation X to refer to both univariate and
multivariate random variables, and denote the states by x and x respec-
tively. We summarize the nomenclature in Table 6.1.
Remark. We will be using the expression “probability distribution” not
only for discrete probability mass functions but also for continuous proba-
bility density functions, although this is technically incorrect. In line with
most machine learning literature, we also rely on context to distinguish
the different uses of the phrase probability distribution. ♦
6.3 Sum Rule, Product Rule and Bayes’ Theorem

We think of probability theory as an extension to logical reasoning. As
we discussed in Section 6.1.1, the rules of probability presented here fol-
c
low naturally from fulfilling the desiderata (Jaynes, 2003, Chapter 2).
Probabilistic modeling (Section 8.3) provides a principled foundation for
designing machine learning methods. Once we have defined probability
distributions (Section 6.2) corresponding to the uncertainties of the data
and our problem, it turns out that there are only two fundamental rules,
the sum rule and the product rule.
Recall from (6.9) that p(x, y) is the joint distribution of the two ran-
dom variables x, y . The distributions p(x) and p(y) are the correspond-
ing marginal distributions, and p(y | x) is the conditional distribution of y
given x. Given the definitions of the marginal and conditional probability
for discrete and continuous random variables in Section 6.2, we can now
These two rules present the two fundamental rules in probability theory.
arise The first rule, the sum rule, states that
naturally (Jaynes,  X
2003) from the  p(x, y) if y is discrete
requirements we


y∈Y
discussed in p(x) = Z , (6.20)
Section 6.1.1. p(x, y)dy if y is continuous



sum rule Y
where Y is the states of the target space of random variable Y. This means
that we sum out (or integrate out) the set of states y of the random vari-
marginalization able Y . The sum rule is also known as the marginalization property. The
property sum rule relates the joint distribution to a marginal distribution. In gen-
eral, when the joint distribution contains more than two random vari-
ables, the sum rule can be applied to any subset of the random variables,
resulting in a marginal distribution of potentially more than one random
variable. More concretely, if x = [x1 , . . . , xD ]> , we obtain the marginal
Z
p(xi ) = p(x1 , . . . , xD )dx\i (6.21)
by repeated application of the sum rule where we integrate/sum out all

random variables except xi , which is indicated by \i, which reads “all
except i”.
Remark. Many of the computational challenges of probabilistic modeling
are due to the application of the sum rule. When there are many variables
or discrete variables with many states, the sum rule boils down to per-
forming a high-dimensional sum or integral. Performing high-dimensional
sums or integrals is generally computationally hard, in the sense that there
is no known polynomial-time algorithm to calculate them exactly. ♦
product rule The second rule, known as the product rule, relates the joint distribution
to the conditional distribution via
p(x, y) = p(y | x)p(x) . (6.22)
The product rule can be interpreted as the fact that every joint distribu-
tion of two random variables can be factorized (written as a product)

of two other distributions. The two factors are the marginal distribu-
tion of the first random variable p(x), and the conditional distribution
of the second random variable given the first p(y | x). Since the ordering
of random variables is arbitrary in p(x, y) the product rule also implies
p(x, y) = p(x | y)p(y). To be precise, (6.22) is expressed in terms of the
probability mass functions for discrete random variables. For continuous
random variables, the product rule is expressed in terms of the probability
density functions (Section 6.2.3).
In machine learning and Bayesian statistics, we are often interested in
making inferences of unobserved (latent) random variables given that we
have observed other random variables. Let us assume we have some prior
knowledge p(x) about an unobserved random variable x and some rela-
tionship p(y | x) between x and a second random variable y , which we
can observe. If we observe y we can use Bayes’ theorem to draw some
conclusions about x given the observed values of y . Bayes’ theorem (also: Bayes’ theorem
Bayes’ rule or Bayes’ law) Bayes’ rule
likelihood prior Bayes’ law

z }| { z }| {
p(y | x) p(x)
p(x | y) = (6.23)
| {z } p(y)
posterior |{z}
evidence
is a direct consequence of the product rule in (6.20) since

p(x, y) = p(x | y)p(y) (6.24)
and
p(x, y) = p(y | x)p(x) (6.25)
so that
p(y | x)p(x)
p(x | y)p(y) = p(y | x)p(x) ⇐⇒ p(x | y) = . (6.26)
p(y)
In (6.23), p(x) is the prior, which encapsulates our subjective prior prior
knowledge of the unobserved (latent) variable x before observing any
data. We can choose any prior that makes sense to us, but it is critical to
ensure that the prior has a non-zero pdf (or pmf) on all plausible x, even
if they are very rare.
The likelihood p(y | x) describes how x and y are related, and in the likelihood
case of discrete probability distributions, it is the probability of the data y The likelihood is
sometimes also
if we were to know the latent variable x. Note that the likelihood is not a
called the
distribution in x, but only in y . We call p(y | x) either the “likelihood of “measurement
x (given y )” or the “probability of y given x” but never the likelihood of model”.
y (MacKay, 2003).
The posterior p(x | y) is the quantity of interest in Bayesian statistics posterior
because it expresses exactly what we are interested in, i.e., what we know
about x after having observed y .
c
The quantity
Z
p(y) := p(y | x)p(x)dx = EX [p(y | x)] (6.27)
marginal likelihood is the marginal likelihood/evidence. The right hand side of (6.27) uses the
evidence expectation operator which we define in Section 6.4.1. By definition the
marginal likelihood integrates the numerator of (6.23) with respect to the
latent variable x. Therefore, the marginal likelihood is independent of
x and it ensures that the posterior p(x | y) is normalized. The marginal
likelihood can also be interpreted as the expected likelihood where we
take the expectation with respect to the prior p(x). Beyond normalization
of the posterior the marginal likelihood also plays an important role in
Bayesian model selection as we will discuss in Section 8.5. Due to the
Bayes’ theorem is integration in (8.44), the evidence is often hard to compute.
also called the Bayes’ theorem (6.23) allows us to invert the relationship between x
“probabilistic
and y given by the likelihood. Therefore, Bayes’ theorem is sometimes
inverse”
probabilistic inverse called the probabilistic inverse. We will discuss Bayes’ theorem further in
Section 8.3.
Remark. In Bayesian statistics, the posterior distribution is the quantity
of interest as it encapsulates all available information from the prior and
the data. Instead of carrying the posterior around, it is possible to focus
on some statistic of the posterior, such as the maximum of the posterior,
which we will discuss in Section 8.2. However, focusing on some statistic
of the posterior leads to loss of information. If we think in a bigger con-
text, then the posterior can be used within a decision making system, and
having the full posterior can be extremely useful and lead to decisions that
are robust to disturbances. For example, in the context of model-based re-
inforcement learning, Deisenroth et al. (2015) show that using the full
posterior distribution of plausible transition functions leads to very fast
(data/sample efficient) learning, whereas focusing on the maximum of
the posterior leads to consistent failures. Therefore, having the full pos-
terior can be very useful for a downstream task. In Chapter 9, we will
continue this discussion in the context of linear regression. ♦
6.4 Summary Statistics and Independence

We are often interested in summarizing sets of random variables and com-
paring pairs of random variables. A statistic of a random variable is a de-
terministic function of that random variable. The summary statistics of a
distribution provides one useful view of how a random variable behaves,
and as the name suggests, provides numbers that summarize and charac-
terize the distribution. We describe the mean and the variance, two well-
known summary statistics. Then we discuss two ways to compare a pair
of random variables: first how to say that two random variables are inde-
pendent, and second how to compute an inner product between them.

6.4.1 Means and Covariances

Mean and (co)variance are often useful to describe properties of probabil-
ity distributions (expected values and spread). We will see in Section 6.6
that there is a useful family of distributions (called the exponential fam-
ily), where the statistics of the random variable capture all possible infor-
mation.
The concept of the expected value is central to machine learning, and
the foundational concepts of probability itself can be derived from the
expected value (Whittle, 2000).
Definition 6.3 (Expected value). The expected value of a function g : R → expected value
R of a univariate continuous random variable X ∼ p(x) is given by
Z
EX [g(x)] = g(x)p(x)dx . (6.28)
X
Correspondingly the expected value of a function g of a discrete random

variable X ∼ p(x) is given by
X
EX [g(x)] = g(x)p(x) (6.29)
x∈X
where X is the set of possible outcomes (the target space) of the random
variable X .
In this section, we consider discrete random variables to have numerical

outcomes. This can be seen by observing that the function g takes real
numbers as inputs. The expected value
of a function of a
Remark. We consider multivariate random variables X as a finite vector random variable is
of univariate random variables [X1 , . . . , Xn ]> . For multivariate random sometimes referred
variables, we define the expected value element wise to as the “law of the
unconscious
statistician” (Casella
 
EX1 [g(x1 )]
.. and Berger, 2002,
D
EX [g(x)] =  ∈R , (6.30)
 
. Section 2.2).
EXD [g(xD )]
where the subscript EXd indicates that we are taking the expected value
with respect to the dth element of the vector x. ♦
Definition 6.3 defines the meaning of the notation EX as the operator
indicating that we should take the integral with respect to the probabil-
ity density (for continuous distributions) or the sum over all states (for
discrete distributions). The definition of the mean (Definition 6.4), is a
special case of the expected value, obtained by choosing g to be the iden-
tity function.
Definition 6.4 (Mean). The mean of a random variable X with states mean
c
x ∈ RD is an average and is defined as

 
EX1 [x1 ]
.. D
EX [x] =  ∈R , (6.31)
 
.
EXD [xD ]
where
 Z
xd p(xd )dxd if X is a continuous random variable



X
Exd [xd ] := X



xi p(xd = xi ) if X is a discrete random variable
xi ∈X
(6.32)
for d = 1, . . . , D, where the subscript d indicates the corresponding di-
mension of x. The integral and sum are over the states X of the target
space of the random varible X .
In one dimension, there are two other intuitive notions of “average”,
median which are the median and the mode. The median is the “middle” value if
we sort the values, i.e., 50% of the values are greater than the median and
50% are smaller than the median. This idea can be generalized to contin-
uous values by considering the value where the cdf (Definition 6.2) is 0.5.
For distributions, which are asymmetric or have long tails, the median
provides an estimate of a typical value that is closer to human intuition
than the mean value. Furthermore, the median is more robust to outliers
than the mean. The generalization of the median to higher dimensions is
non-trivial as there is no obvious way to “sort” in more than one dimen-
mode sion (Hallin et al., 2010; Kong and Mizera, 2012). The mode is the most
frequently occurring value. For a discrete random variable, the mode is
defined as the value of x having the highest frequency of occurrence. For
a continuous random variable, the mode is defined as a peak in the den-
sity p(x). A particular density p(x) may have more than one mode, and
furthermore there may be a very large number of modes in high dimen-
sional distributions. Therefore finding all the modes of a distribution can
be computationally challenging.
Example 6.4
Consider the two-dimensional distribution illustrated in Figure 6.4

10 1 0 0 8.4 2.0
p(x) = 0.4N x , + 0.6N x , .
2 0 1 0 2.0 1.7
(6.33)
2
We will define the Gaussian distribution N µ, σ in Section 6.5. Also
shown is its corresponding marginal distribution in each dimension. Ob-
serve that the distribution is bimodal (has two modes), but one of the

marginal distributions is unimodal (has one mode). The horizontal bi-

modal univariate distribution illustrates that the mean and median can
be different from each other. While it is tempting to define the two-
dimensional median to be the concatenation of the medians in each di-
mension, the fact that we cannot define an ordering of two-dimensional
points makes it difficult. When we say “cannot define an ordering”, we
mean
that
there is more than one way to define the relation < so that
3 2
< .
0 3
Figure 6.4
Mean Illustration of the
Modes mean, mode and
Medianmedian for a
two-dimensional
dataset, as well as
its marginal
densities.
Remark. The expected value (Definition 6.3) is a linear operator. For ex-
ample, given a real-valued function f (x) = ag(x) + bh(x) where a, b ∈ R
and x ∈ RD , we obtain
Z
EX [f (x)] = f (x)p(x)dx (6.34a)
Z
= [ag(x) + bh(x)]p(x)dx (6.34b)
Z Z
= a g(x)p(x)dx + b h(x)p(x)dx (6.34c)
= aEX [g(x)] + bEX [h(x)] . (6.34d)
♦
For two random variables, we may wish to characterize their correspon-
c
dence to each other. The covariance intuitively represents the notion of

how dependent random variables are to one another.
covariance Definition 6.5 (Covariance (univariate)). The covariance between two
univariate random variables X, Y ∈ R is given by the expected product
of their deviations from their respective means, i.e.,

CovX,Y [x, y] := EX,Y (x − EX [x])(y − EY [y]) . (6.35)
Terminology: The
covariance of Remark. When the random variable associated with the expectation or
multivariate random covariance is clear by its arguments the subscript is often suppressed (for
variables Cov[x, y] example EX [x] is often written as E[x]). ♦
is sometimes
referred to as By using the linearity of expectations, the expression in Definition 6.5
cross-covariance, can be rewritten as the expected value of the product minus the product
with covariance
of the expected values, i.e.,
referring to
Cov[x, x]. Cov[x, y] = E[xy] − E[x]E[y] . (6.36)
variance The covariance of a variable with itself Cov[x, x] is called the variance and
standard deviation is denoted by VX [x]. The square-root of the variance is called the standard
deviation and is often denoted by σ(x). The notion of covariance can be
generalized to multivariate random variables.
Definition 6.6 (Covariance). If we consider two multivariate random
variables X and Y with states x ∈ RD and y ∈ RE respectively, the
covariance covariance between X and Y is defined as
Cov[x, y] = E[xy > ] − E[x]E[y]> = Cov[y, x]> ∈ RD×E . (6.37)
Definition 6.6 can be applied with the same multivariate random vari-
able in both arguments, which results in a useful concept that intuitively
captures the “spread” of a random variable. For a multivariate random
variable, the variance describes the relation between individual dimen-
sions of the random variable.
variance Definition 6.7 (Variance). The variance of a random variable X with
states x ∈ RD and a mean vector µ ∈ RD is defined as
VX [x] = CovX [x, x] (6.38a)
> > >
= EX [(x − µ)(x − µ) ] = EX [xx ] − EX [x]EX [x] (6.38b)
 
Cov[x1 , x1 ] Cov[x1 , x2 ] . . . Cov[x1 , xD ]
 Cov[x2 , x1 ] Cov[x2 , x2 ] . . . Cov[x2 , xD ] 
= .. .. .. . (6.38c)
 
. .
 . . . . 
Cov[xD , x1 ] ... . . . Cov[xD , xD ]
covariance matrix The D × D matrix in (6.38c) is called the covariance matrix of the
multivariate random variable X . The covariance matrix is symmetric and
positive definite and tells us something about the spread of the data. On
marginal its diagonal, the covariance matrix contains the variances of the marginals

Figure 6.5
6 6 Two-dimensional
datasets with
4 4
identical means and
2 2 variances along
y
y
each axis (colored
0 0 lines) but with
different
−2 −2 covariances.
−5 0 5 −5 0 5
x x
(a) x and y are negatively correlated. (b) x and y are positively correlated.
Z
p(xi ) = p(x1 , . . . , xD )dx\i , (6.39)
where “\i” denotes “all variables but i”. The off-diagonal entries are the
cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j . cross-covariance
When we want to compare the covariances between different pairs of
random variables, it turns out that the variance of each random variable
affects the value of the covariance. The normalized version of covariance
is called the correlation.
Definition 6.8 (Correlation). The correlation between two random vari- correlation
ables X, Y is given by
Cov[x, y]
corr[x, y] = p ∈ [−1, 1] . (6.40)
V[x]V[y]
The correlation matrix is the covariance matrix of standardized random
variables, x/σ(x). In other words, each random variable is divided by its
standard deviation (the square root of the variance) in the correlation
matrix.
The covariance (and correlation) indicate how two random variables
are related, see Figure 6.5. Positive correlation corr[x, y] means that when
x grows then y is also expected to grow. Negative correlation means that
as x increases then y decreases.
6.4.2 Empirical Means and Covariances

The definitions in Section 6.4.1 are often also called the population mean population mean
and covariance, as it refers to the true statistics for the population. In ma- and covariance
chine learning we need to learn from empirical observations of data. Con-
sider a random variable X . There are two conceptual steps to go from
population statistics to the realization of empirical statistics. First we use
the fact that we have a finite dataset (of size N ) to construct an empir-
ical statistic which is a function of a finite number of identical random
variables, X1 , . . . , XN . Second we observe the data, that is we look at
c
the realization x1 , . . . , xN of each of the random variables and apply the

empirical statistic.
Specifically for the mean (Definition 6.4), given a particular dataset we
empirical mean can obtain an estimate of the mean, which is called the empirical mean or
sample mean sample mean. The same holds for the empirical covariance.
empirical mean Definition 6.9 (Empirical Mean and Covariance). The empirical mean vec-
tor is the arithmetic average of the observations for each variable, and it
is defined as
N
1 X
x̄ := xn , (6.41)
N n=1
where xn ∈ RD .
empirical covariance Similar to the empirical mean, the empirical covariance matrix is a D×D
matrix
N
1 X
Σ := (xn − x̄)(xn − x̄)> . (6.42)
N n=1
Throughout the
book we use the To compute the statistics for a particular dataset, we would use the
empirical realizations (observations) x1 , . . . , xN and use (6.41) and (6.42). Empir-
covariance, which is ical covariance matrices are symmetric, positive semi-definite (see Sec-
a biased estimate.
The unbiased
tion 3.2.3).
(sometimes called
corrected)
covariance has the 6.4.3 Three Expressions for the Variance
factor N − 1 in the
denominator We now focus on a single random variable X and use the empirical formu-
instead of N . las above to derive three possible expressions for the variance. The deriva-
The derivations are tion below is the same for the population variance, except that we need to
exercises at the end take care of integrals. The standard definition of variance, corresponding
of this chapter.
to the definition of covariance (Definition 6.5), is the expectation of the
squared deviation of a random variable X from its expected value µ, i.e.,
VX [x] := EX [(x − µ)2 ] . (6.43)
The expectation in (6.43) and the mean µ = EX (x) are computed us-
ing (6.32), depending on whether X is a discrete or continuous random
variable. The variance as expressed in (6.43) is the mean of a new random
variable Z := (X − µ)2 .
When estimating the variance in (6.43) empirically, we need to resort
to a two-pass algorithm: one pass through the data to calculate the mean
µ using (6.41), and then a second pass using this estimate µ̂ calculate the
variance. It turns out that we can avoid two passes by rearranging the
raw-score formula terms. The formula in (6.43) can be converted to the so-called raw-score
for variance formula for variance:
2
VX [x] = EX [x2 ] − (EX [x]) . (6.44)

The expression in (6.44) can be remembered as “the mean of the square

minus the square of the mean”. It can be calculated empirically in one pass
through data since we can accumulate xi (to calculate the mean) and x2i
simultaneously, where xi is the ith observation. Unfortunately, if imple- If the two terms in
mented in this way, it can be numerically unstable. The raw-score version (6.44) are huge and
approximately
of the variance can be useful in machine learning, e.g., when deriving the
equal, we may
bias-variance decomposition (Bishop, 2006). suffer from an
A third way to understand the variance is that it is a sum of pairwise dif- unnecessary loss of
ferences between all pairs of observations. Consider a sample x1 , . . . , xN numerical precision
in floating point
of realizations of random variable X , and we compute the squared differ-
arithmetic.
ence between pairs of xi and xj . By expanding the square we can show
that the sum of N 2 pairwise differences is the empirical variance of the
observations:
 !2 
N N N
1 X 1 X 1 X
(xi − xj )2 = 2  x2 − xi  . (6.45)
N 2 i,j=1 N i=1 i N i=1
We see that (6.45) is twice the raw-score expression (6.44). This means
that we can express the sum of pairwise distances (of which there are N 2
of them) as a sum of deviations from the mean (of which there are N ).
Geometrically, this means that there is an equivalence between the pair-
wise distances and the distances from the center of the set of points. From
a computational perspective, this means that by computing the mean
(N terms in the summation), and then computing the variance (again
N terms in the summation) we can obtain an expression (left-hand side
of (6.45)) that has N 2 terms.
6.4.4 Sums and Transformations of Random Variables

We may want to model a phenomenon that cannot be well explained by
textbook distributions (we introduce some in Sections 6.5 and 6.6), and
hence may perform simple manipulations of random variables (such as
adding two random variables).
Consider two random variables X, Y with states x, y ∈ RD . Then:
E[x + y] = E[x] + E[y] (6.46)

E[x − y] = E[x] − E[y] (6.47)
V[x + y] = V[x] + V[y] + Cov[x, y] + Cov[y, x] (6.48)
V[x − y] = V[x] + V[y] − Cov[x, y] − Cov[y, x] . (6.49)
Mean and (co)variance exhibit some useful properties when it comes

to affine transformation of random variables. Consider a random variable
X with mean µ and covariance matrix Σ and a (deterministic) affine
transformation y = Ax + b of x. Then y is itself a random variable
c
whose mean vector and covariance matrix are given by

EY [y] = EX [Ax + b] = AEX [x] + b = Aµ + b , (6.50)
VY [y] = VX [Ax + b] = VX [Ax] = AVX [x]A> = AΣA> , (6.51)
This can be shown respectively. Furthermore,
directly by using the
definition of the Cov[x, y] = E[x(Ax + b)> ] − E[x]E[Ax + b]> (6.52a)
mean and > > > > > >
covariance.
= E[x]b + E[xx ]A − µb − µµ A (6.52b)
= µb> − µb> + E[xx> ] − µµ> A>

(6.52c)
(6.38b) >
= ΣA , (6.52d)
where Σ = E[xx> ] − µµ> is the covariance of X .
6.4.5 Statistical Independence

statistical Definition 6.10 (Independence). Two random variables X, Y are statistically
independence independent if and only if
p(x, y) = p(x)p(y) . (6.53)
Intuitively, two random variables X and Y are independent if the value
of y (once known) does not add any additional information about x (and
vice versa). If X, Y are (statistically) independent then
p(y | x) = p(y)
p(x | y) = p(x)
VX,Y [x + y] = VX [x] + VY [y]
CovX,Y [x, y] = 0
The last point above may not hold in converse, i.e., two random vari-
ables can have covariance zero but are not statistically independent. To
understand why, recall that covariance measures only linear dependence.
Therefore, random variables that are nonlinearly dependent could have
covariance zero.
Example 6.5
Consider a random variable X with zero mean (EX [x] = 0) and also
EX [x3 ] = 0. Let y = x2 (hence, Y is dependent on X ) and consider the
covariance (6.36) between X and Y . But this gives
Cov[x, y] = E[xy] − E[x]E[y] = E[x3 ] = 0 . (6.54)
In machine learning, we often consider problems that can be mod-

independent and eled as independent and identically distributed (i.i.d.) random variables,
identically X1 , . . . , XN . The word “independent” refers to Definition 6.10, i.e., any
distributed
i.i.d.
pair of random variables Xi and Xj are independent. The phrase “identi-

cally distributed” means that all the random variables are from the same
distribution.
Another concept that is important in machine learning is conditional
independence.
Definition 6.11 (Conditional Independence). Two random variables X
and Y are conditionally independent given Z if and only if conditionally
independent
p(x, y | z) = p(x | z)p(y | z) for all z ∈ Z , (6.55)
where Z is the set of states of random variable Z . We write X ⊥
⊥ Y | Z to
denote that X is conditionally independent of Y given Z .
Definition 6.11 requires that the relation in (6.55) must hold true for
every value of z . The interpretation of (6.55) can be understood as “given
knowledge about z , the distribution of x and y factorizes”. Independence
can be cast as a special case of conditional independence if we write X ⊥ ⊥
Y | ∅. By using the product rule of probability (6.22) we can expand the
left-hand side of (6.55) to obtain
p(x, y | z) = p(x | y, z)p(y | z) . (6.56)
By comparing the right-hand side of (6.55) with (6.56) we see that p(y | z)
appears in both of them so that
p(x | y, z) = p(x | z) . (6.57)
Equation (6.57) provides an alternative definition of conditional indepen-
dence, i.e., X ⊥⊥ Y | Z . This alternative presentation provides the inter-
pretation “given that we know z , knowledge about y does not change our
knowledge of x”.
6.4.6 Inner Products of Random Variables

Recall the definition of inner products from Section 3.2. We can define an
inner product between random variables, which we briefly describe in this
section. If we have two uncorrelated random variables X, Y then
V[x + y] = V[x] + V[y] (6.58)
Since variances are measured in squared units, this looks very much like
the Pythagorean theorem for right triangles c2 = a2 + b2 . Inner products
In the following, we see whether we can find a geometric interpreta- between
multivariate random
tion of the variance relation of uncorrelated random variables in (6.58).
variables can be
Random variables can be considered vectors in a vector space, and we treated in a similar
can define inner products to obtain geometric properties of random vari- fashion
ables (Eaton, 2007). If we define
hX, Y i := Cov[x, y] (6.59)
c
Figure 6.6
Geometry of
random variables. If
random variables x
and y are
uncorrelated they
are orthogonal
vectors in a
corresponding
vector space, and [y ]
+ var
the Pythagorean [ x]
p var
theorem applies.
=
p
] a var[x]
+y c
p var[x
b
p
var[y]
for zero mean random variables X and Y , we obtain an inner product. we

Cov[x, x] = 0 ⇐⇒ see that the covariance is symmetric, positive definite, and linear in either
x=0 argument.The length of a random variable is
Cov[αx + z, y] = q q
α Cov[x, y] +
kXk = Cov[x, x] = V[x] = σ[x] , (6.60)
Cov[z, y] for α ∈ R.
i.e., its standard deviation. The “longer” the random variable, the more
uncertain it is; and a random variable with length 0 is deterministic.
If we look at the angle θ between two random variables X, Y , we get
hX, Y i Cov[x, y]
cos θ = =p , (6.61)
kXk kY k V[x]V[y]
which is the correlation (Definition 6.8) between the two random vari-
ables. This means that we can think of correlation as the cosine of the
angle between two random variables when we consider them geometri-
cally. We know from Definition 3.7 that X ⊥ Y ⇐⇒ hX, Y i = 0. In our
case, this means that X and Y are orthogonal if and only if Cov[x, y] = 0,
i.e., they are uncorrelated. Figure 6.6 illustrates this relationship.
Remark. While it is tempting to use the Euclidean distance (constructed
from the definition of inner products above) to compare probability distri-
butions, it is unfortunately not the best way to obtain distances between
distributions. Recall that the probability mass (or density) is positive and
needs to add up to 1. These constraints mean that distributions live on
something called a statistical manifold. The study of this space of prob-

Figure 6.7
Gaussian
distribution of two
random variables
0.20 x, y.
p(x1, x2)
0.15
0.10
0.05
0.00
7.5
5.0
2.5
−1 0.0 x 2
0 −2.5
x1 1 −5.0
ability distributions is called information geometry. Computing distances

between distributions are often done using Kullback-Leibler divergence
which is a generalization of distances that account for properties of the
statistical manifold. Just like the Euclidean distance is a special case of a
metric (Section 3.3) the Kullback-Leibler divergence is a special case of
two more general classes of divergences called Bregman divergences and
f -divergences. The study of divergences is beyond the scope of this book,
and we refer for more details to the recent book by Amari (2016), one of
the founders of the field of information geometry. ♦
6.5 Gaussian Distribution The Gaussian

The Gaussian distribution is the most well studied probability distribution distribution arises
naturally when we
for continuous-valued random variables. It is also referred to as the normal
consider sums of
distribution. Its importance originates from the fact that it has many com- independent and
putationally convenient properties, which we will be discussing in the fol- identically
lowing. In particular, we will use it to define the likelihood and prior for distributed random
variables. This is
linear regression (Chapter 9), and consider a mixture of Gaussians for
known as the
density estimation (Chapter 11). central limit
There are many other areas of machine learning that also benefit from theorem (Grinstead
using a Gaussian distribution, for example Gaussian processes, variational and Snell, 1997).
inference and reinforcement learning. It is also widely used in other appli- normal distribution
cation areas such as signal processing (e.g., Kalman filter), control (e.g.,
linear quadratic regulator) and statistics (e.g. hypothesis testing).
For a univariate random variable, the Gaussian distribution has a den-
sity that is given by
1 (x − µ)2

2
p(x | µ, σ ) = √ exp − . (6.62)
2πσ 2 2σ 2
The multivariate Gaussian distribution is fully characterized by a mean multivariate
Gaussian
distribution
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press. mean vector
Figure 6.8
8
Gaussian mean
p(x)
0.20 sample
distributions mean 6
overlaid with 100 sample
0.15 4
samples. 2σ
x2
2
0.10
0
0.05
−2
0.00
−4
−5.0 −2.5 0.0 2.5 5.0 7.5 −1 0 1
x x1
(a) Univariate (1-dimensional) Gaussian; The (b) Multivariate (2-dimensional) Gaussian,

red cross shows the mean and the red line shows viewed from top. The red cross shows the mean
the extent of the variance. and the coloured lines shows the contour lines
of the density.
covariance matrix vector µ and a covariance matrix Σ and defined as

D 1
p(x | µ, Σ) = (2π)− 2 |Σ|− 2 exp − 12 (x − µ)> Σ−1 (x − µ) , (6.63)

Also known as a where x ∈ RD . We write p(x) = N x | µ, Σ or X ∼ N µ, Σ . Fig-
multivariate normal ure 6.7 shows a bivariate Gaussian (mesh), with the corresponding con-
distribution.
tour plot. Figure 6.8 shows a univariate Gaussian and a bivariate Gaussian
with corresponding samples. The special case of the Gaussian with zero
mean and identity covariance, that is µ = 0 and Σ = I , is referred to as
standard normal the standard normal distribution.
distribution Gaussians are widely used in statistical estimation and machine learn-
ing as they have closed-form expressions for marginal and conditional dis-
tributions. In Chapter 9, we use these closed-form expressions extensively
for linear regression. A major advantage of modeling with Gaussian ran-
dom variables is that variable transformations (Section 6.7) are often not
needed. Since the Gaussian distribution is fully specified by its mean and
covariance we often can obtain the transformed distribution by applying
the transformation to the mean and covariance of the random variable.
6.5.1 Marginals and Conditionals of Gaussians are Gaussians

In the following, we present marginalization and conditioning in the gen-
eral case of multivariate random variables. If this is confusing at first read-
ing, the reader is advised to consider two univariate random variables
instead. Let X and Y be two multivariate random variables, which may
have different dimensions. To consider the effect of applying the sum rule
of probability and the effect of conditioning we explicitly write the Gaus-
sian distribution in terms of the concatenated states [x, y]> ,

µx Σxx Σxy
p(x, y) = N , . (6.64)
µy Σyx Σyy

where Σxx = Cov[x, x] and Σyy = Cov[y, y] are the marginal covari-
ance matrices of x and y , respectively, and Σxy = Cov[x, y] is the cross-
covariance matrix between x and y .
The conditional distribution p(x | y) is also Gaussian (illustrated in Fig-
ure 6.9(c)) and given by (derived in Section 2.3 of Bishop (2006))

p(x | y) = N µx | y , Σx | y (6.65)
µx | y = µx + Σxy Σ−1
yy (y − µy ) (6.66)
Σx | y = Σxx − Σxy Σ−1
yy Σyx . (6.67)
Note that in the computation of the mean in (6.66) the y -value is an
observation and no longer random.
Remark. The conditional Gaussian distribution shows up in many places,
where we are interested in posterior distributions:
The Kalman filter (Kalman, 1960), one of the most central algorithms
for state estimation in signal processing, does nothing but computing
Gaussian conditionals of joint distributions (Deisenroth and Ohlsson,
2011; ?).
Gaussian processes (Rasmussen and Williams, 2006), which are a prac-
tical implementation of a distribution over functions. In a Gaussian pro-
cess, we make assumptions of joint Gaussianity of random variables. By
(Gaussian) conditioning on observed data, we can determine a poste-
rior distribution over functions.
Latent linear Gaussian models (Roweis and Ghahramani, 1999; Mur-
phy, 2012), which include probabilistic principal component analysis
(PPCA) (Tipping and Bishop, 1999). We will look at PPCA in more de-
tail in Section 10.7.
♦
The marginal distribution p(x) of a joint Gaussian distribution p(x, y),
see (6.64), is itself Gaussian and computed by applying the sum rule
(6.20) and given by
Z

p(x) = p(x, y)dy = N x | µx , Σxx . (6.68)
The corresponding result holds for p(y), which is obtained by marginaliz-

ing with respect to x. Intuitively, looking at the joint distribution in (6.64),
we ignore (i.e., integrate out) everything we are not interested in. This is
illustrated in Figure 6.9(b).
Example 6.6
Consider the bivariate Gaussian distribution (illustrated in Figure 6.9)

0 0.3 −1
p(x1 , x2 ) = N , . (6.69)
2 −1 5
c
We can compute the parameters of the univariate Gaussian, conditioned

on x2 = −1, by applying (6.66) and (6.67) to obtain the mean and vari-
ance respectively. Numerically, this is
µx1 | x2 =−1 = 0 + (−1)(0.2)(−1 − 2) = 0.6 (6.70)
and
σx21 | x2 =−1 = 0.3 − (−1)(0.2)(−1) = 0.1 . (6.71)
Therefore, the conditional Gaussian is given by

p(x1 | x2 = −1) = N 0.6, 0.1 . (6.72)
The marginal distribution p(x1 ) in contrast can be obtained by applying
(6.68), which is essentially using the mean and variance of the random
variable x1 , giving us

p(x1 ) = N 0, 0.3 . (6.73)
Figure 6.9
(a) Bivariate 8
Gaussian; (b)
Marginal of a joint 6
Gaussian
distribution is 4
Gaussian; (c) The
x2
conditional 2
distribution of a
0 x2 = −1
Gaussian is also
Gaussian −2
−4
−1 0 1
x1
(a) Bivariate Gaussian.
p(x1) 1.2 p(x1|x2 = −1)
0.6 mean mean

1.0
2σ 2σ
0.8
0.4
0.6
0.4
0.2
0.2
0.0 0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x1 x1
(b) Marginal distribution. (c) Conditional distribution.

6.5.2 Product of Gaussian Densities

For linear regression (Chapter 9), we need to compute a Gaussian likeli-
hood. Furthermore we may wish to assume a Gaussian prior (Section 9.3).
The application of Bayes rule to compute the posterior results in a multi-
plication of the likelihood and the prior, that is the multiplication
of two
Gaussian densities. The product of two Gaussians N x | a, A N x | b, B The derivation is an
exercise at the end

is a Gaussian distribution scaled by a c ∈ R, given by c N x | c, C with
of this chapter.
C = (A−1 + B −1 )−1 (6.74)

−1 −1
c = C(A a + B b) (6.75)
D 1
(2π)− 2 |A B|− 2 exp − 12 (a − b)> (A + B)−1 (a − b) . (6.76)

c= +
The scaling constant c itself can be written in the form of a Gaussian
density either in a or in b with an “inflated” covariance matrix A + B ,
i.e., c = N a | b, A + B = N b | a, A + B .

Remark. For notation convenience, we will sometimes use N x | m, S
to describe the functional form of a Gaussian density even if x is not a
random variable. We have just done this above when we wrote

c = N a | b, A + B = N b | a, A + B . (6.77)
Here, neither a nor b are random variables. However, writing c in this way
is more compact than (6.76). ♦
6.5.3 Sums and Linear Transformations

If XY are independent Gaussian random variables (i.e., the joint distri-

bution is given as p(x, y) = p(x)p(y)) with p(x) = N x | µx , Σx and

p(y) = N y | µy , Σy , then x + y is also Gaussian distributed and given
by

p(x + y) = N µx + µy , Σx + Σy . (6.78)
Knowing that p(x+y) is Gaussian, the mean and covariance matrix can be
determined immediately using the results from (6.46)–(6.49). This prop-
erty will be important when we consider i.i.d. Gaussian noise acting on
random variables as is the case for linear regression (Chapter 9).
Example 6.7
Since expectations are linear operations, we can obtain the weighted sum
of independent Gaussian random variables
p(ax + by) = N aµx + bµy , a2 Σx + b2 Σy .

(6.79)
c
Remark. A case which will be useful in Chapter 11 is the weighted sum

of Gaussian densities. This is different from the weighted sum of Gaussian
random variables. ♦
In Theorem 6.12, the random variable x is from a density which a mix-
ture of two densities p1 (x) and p2 (x), weighted by α. The theorem can
be generalized to the multivariate random variable case, since linearity of
expectations holds also for multivariate random variables. However the
idea of a squared random variable needs to be replaced by xx> .
Theorem 6.12. Consider a weighted sum of two univariate Gaussian densi-
ties
p(x) = αp1 (x) + (1 − α)p2 (x) , (6.80)
where the scalar 0 < α < 1 is the mixture weight, and p1 (x) and p2 (x) are
univariate Gaussian densities (Equation (6.62)) with different parameters,
i.e., (µ1 , σ12 ) 6= (µ2 , σ22 ).
Then, the mean of the mixture x is given by the weighted sum of the means
of each random variable,
E[x] = αµ1 + (1 − α)µ2 . (6.81)
The variance of the mixture x is the mean of the conditional variance and
the variance of the conditional mean,
2

V[x] = ασ12 + (1 − α)σ22 + αµ21 + (1 − α)µ22 − [αµ1 + (1 − α)µ2 ] .

(6.82)
Proof The mean of the mixture x is given by the weighted sum of the
means of each random variable. We apply the definition of the mean (Def-
inition 6.4), and plug in our mixture (6.80) above, which yields
Z ∞
E[x] = xp(x)dx (6.83a)
−∞
Z ∞
= αxp1 (x) + (1 − α)xp2 (x)dx (6.83b)
−∞
Z ∞ Z ∞
=α xp1 (x)dx + (1 − α) xp2 (x)dx (6.83c)
−∞ −∞
= αµ1 + (1 − α)µ2 . (6.83d)
To compute the variance, we can use the raw score version of the vari-
ance from (6.44), which requires an expression of the expectation of the
squared random variable. Here we use the definition of an expectation of
a function (the square) of a random variable (Definition 6.3)
Z ∞
2
E[x ] = x2 p(x)dx (6.84a)
−∞
Z ∞
= αx2 p1 (x) + (1 − α)x2 p2 (x)dx (6.84b)
−∞

Z ∞ Z ∞
=α x2 p1 (x)dx + (1 − α) x2 p2 (x)dx (6.84c)
−∞ −∞
= α(µ21 + σ12 ) + (1 − α)(µ22 + σ22 ) , (6.84d)
where in the last equality, we again used the raw score version of the
variance (6.44) giving σ 2 = E[x2 ] − µ2 . This is rearranged such that the
expectation of a squared random variable is the sum of the squared mean
and the variance.
Therefore, the variance is given by subtracting (6.83d) from (6.84d),
V[x] = E[x2 ] − (E[x])2 (6.85a)
= α(µ21 + σ12 )
+ (1 − α)(µ22 + σ22 ) − (αµ1 + (1 − α)µ2 ) 2
(6.85b)
2
+ (1 − α)σ22

= ασ1

2
αµ21 + (1 − α)µ22 − [αµ1 + (1 − α)µ2 ] .

+ (6.85c)
For a mixture, the individual components can be considered to be condi-

tional distributions (conditioned on the component identity). The last line
is an illustration of the conditional variance formula: “The variance of a
mixture is the mean of the conditional variance and the variance of the
conditional mean”.
Remark. The derivation above holds for any density, but since the Gaus-
sian is fully determined by the mean and variance, the mixture density
can be determined in closed form. ♦
We consider in Example 6.17 a bivariate standard Gaussian random
variable X and performed a linear transformation Ax on it. The outcome
is a Gaussian random variable with zero mean and covariance AA> . Ob-
serve that adding a constant vector will change the mean of the distribu-
tion, without affecting its variance, that is the random variable x + µ is
Gaussian with mean µ and identity covariance. Hence, any linear/affine
transformation of a Gaussian random variable is Gaussian distributed. Any linear/affine
Consider a Gaussian distributed random variable X ∼ N µ, Σ . For transformation of a
Gaussian random
a given matrix A of appropriate shape, let Y be a random variable such
variable is also
that y = Ax is a transformed version of x. We can compute the mean of Gaussian
y by exploiting that the expectation is a linear operator (Equation (6.50)) distributed.
as follows:
E[y] = E[Ax] = AE[x] = Aµ . (6.86)
Similarly the variance of y can be found by using Equation (6.51):
V[y] = V[Ax] = AV[x]A> = AΣA> . (6.87)
This means that the random variable y is distributed according to
p(y) = N y | Aµ, AΣA> .

(6.88)
Let us now consider the reverse transformation: when we know that a
c
random variable has a mean that is a linear transformation of another

random variable. For a given full rank matrix A ∈ RM ×N where M > N ,
let y ∈ RM be a Gaussian random variable with mean Ax, i.e.,

p(y) = N y | Ax, Σ . (6.89)
What is the corresponding probability distribution p(x)? If A is invert-

ible, then we can write x = A−1 y and apply the transformation in the
previous paragraph. However, in general A is not invertible, and we use
an approach similar to that of the pseudo-inverse (3.57). That is we pre-
multiply both sides with A> and then invert A> A which is symmetric
and positive definite, giving us the relation
y = Ax ⇐⇒ (A> A)−1 A> y = x . (6.90)
Hence, x is a linear transformation of y , and we obtain
p(x) = N x | (A> A)−1 A> y, (A> A)−1 A> ΣA(A> A)−1 .

(6.91)
6.5.4 Sampling from Multivariate Gaussian Distributions

We will not explain the subtleties of random sampling on a computer,
and the interested reader is referred to Gentle (2004). In the case of a
multivariate Gaussian, this process consists of three stages: first we need
a source of pseudo-random numbers that provide a uniform sample in the
interval [0,1], second we use a non-linear transformation such as the Box-
Müller transform (Devroye, 1986) to obtain a sample from a univariate
Gaussian, and third we collate a vector of these samples to obtain a sample
from a multivariate standard normal N 0, I .
For a general multivariate Gaussian, that is where the mean is non-zero
and the covariance is not the identity matrix, we use the properties of
linear transformations of a Gaussian random variable. Assume we are in-
terested in generating samples xi , i = 1, . . . , n, from a multivariate Gaus-
sian distribution with mean µ and covariance matrix Σ. We would like
to construct the sample from a sampler that provides samples from the
To compute the multivariate standard normal N 0, I .
Cholesky

To obtain samples from a multivariate normal N µ, Σ , we can use
factorization of a
matrix, it is required the properties of a linear transformation of a Gaussian random variable:
that the matrix is If x ∼ N 0, I then y = Ax + µ, where AA> = Σ, is Gaussian dis-
symmetric and tributed with mean µ and covariance matrix Σ. One convenient choice of
positive definite A is to use the Cholesky decomposition (Section 4.3) of the covariance
(Section 3.2.3).
Covariance matrices
matrix Σ = AA> . The Cholesky decomposition has the benefit that A is
possess this triangular, leading to efficient computation.
property.
6.6 Conjugacy and the Exponential Family

Many of the probability distributions “with names” that we find in statis-
tics textbooks were discovered to model particular types of phenomena.
For example we have seen the Gaussian distribution in Section 6.5. The
distributions are also related to each other in complex ways (Leemis and
McQueston, 2008). For a beginner in the field, it can be overwhelming to
figure out which distribution to use. In addition, many of these distribu-
tions were discovered at a time that statistics and computation was done “Computers” were a
by pencil and paper. It is natural to ask what are meaningful concepts job description.
in the computing age (Efron and Hastie, 2016). In the previous section,
we saw that many of the operations required for inference can be conve-
niently calculated when the distribution is Gaussian. It is worth recalling
at this point the desiderata for manipulating probability distributions in
the machine learning context.
1. There is some “closure property” when applying the rules of probability,

e.g., Bayes’ theorem. By closure we mean that applying a particular
operation returns an object of the same type.
2. As we collect more data, we do not need more parameters to describe
the distribution.
3. Since we are interested in learning from data, we want parameter es-
timation to behave nicely.
It turns out that the class of distributions called the exponential family exponential family
provides the right balance of generality while retaining favourable com-
putation and inference properties. Before we introduce the exponential
family, let us see three more members of “named” probability distribu-
tions, the Bernoulli (Example 6.8), Binomial (Example 6.9) and Beta (Ex-
ample 6.10) distributions.
Example 6.8
The Bernoulli distribution is a distribution for a single binary random Bernoulli
variable X with state x ∈ {0, 1}. It is governed by a single continuous pa- distribution
rameter µ ∈ [0, 1] that represents the probability of X = 1. The Bernoulli
distribution Ber(µ) is defined as
p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1} , (6.92)
E[x] = µ , (6.93)
V[x] = µ(1 − µ) , (6.94)
where E[x] and V[x] are the mean and variance of the binary random
variable X .
An example where the Bernoulli distribution can be used is when we

are interested in modeling the probability of “head” when flipping a coin.
c
Figure 6.10
Examples of the µ = 0.1
Binomial 0.3 µ = 0.4
distribution for µ = 0.75
µ ∈ {0.1, 0.4, 0.75}
and N = 15. 0.2
p(m)
0.1
0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0
Number m of observations x = 1 in N = 15 experiments
Remark. The rewriting above of the Bernoulli distribution, where we use

Boolean variables as numerical 0 or 1 and express them in the exponents,
is a trick that is often used in machine learning textbooks. Another oc-
curence of this is when expressing the Multinomial distribution. ♦
Example 6.9 (Binomial Distribution)

Binomial The Binomial distribution is a generalization of the Bernoulli distribution
distribution to a distribution over integers. In particular, the Binomial can be used to
describe the probability of observing m occurrences of X = 1 in a set of
N samples from a Bernoulli distribution where p(X = 1) = µ ∈ [0, 1].
The Binomial distribution Bin(N, µ) is defined as
!
N m
p(m | N, µ) = µ (1 − µ)N −m , (6.95)
m
E[m] = N µ , (6.96)
V[m] = N µ(1 − µ) , (6.97)
where E[m] and V[m] are the mean and variance of m, respectively.
An example where the Binomial could be used is if we want to describe

the probability of observing m “heads” in N coin-flip experiments if the
probability for observing head in a single experiment is µ.
Example 6.10 (Beta Distribution)
We may wish to model a continuous random variable on a finite in-

Beta distribution terval. The Beta distribution is a distribution over a continuous random
variable µ ∈ [0, 1], which is often used to represent the probability for
some binary event (e.g., the parameter governing the Bernoulli distribu-
tion). The Beta distribution Beta(α, β) (illustrated in Figure 6.11) itself is

governed by two parameters α > 0, β > 0 and is defined as

Γ(α + β) α−1
p(µ | α, β) = µ (1 − µ)β−1 (6.98)
Γ(α)Γ(β)
α αβ
E[µ] = , V[µ] = (6.99)
α+β (α + β) (α + β + 1)
2
where Γ(·) is the Gamma function defined as

Z ∞
Γ(t) := xt−1 exp(−x)dx, t > 0. (6.100)
0
Γ(t + 1) = tΓ(t) . (6.101)
Note that the fraction of Gamma functions in (6.98) normalizes the Beta
distribution.
10 Figure 6.11
a = 0.5 = b Examples of the
8 a=1=b Beta distribution for
a = 2, b = 0.3 different values of α
6 a = 4, b = 10 and β.
p(µ|a, b)
a = 5, b = 1
4
0
0.0 0.2 0.4 0.6 0.8 1.0
µ
Intuitively, α moves probability mass toward 1, whereas β moves prob-

ability mass toward 0. There are some special cases (Murphy, 2012):
For α = 1 = β we obtain the uniform distribution U[0, 1].

For α, β < 1, we get a bimodal distribution with spikes at 0 and 1.
For α, β > 1, the distribution is unimodal.
For α, β > 1 and α = β , the distribution is unimodal, symmetric and
centered in the interval [0, 1], i.e., the mode/mean is at 12 .
Remark. There is a whole zoo of distributions with names, and they are
related in different ways to each other (Leemis and McQueston, 2008).
It is worth keeping in mind that each named distribution is created for a
particular reason, but may have other applications. Knowing the reason
behind the creation of a particular distribution often allows insight into
how to best use it. We introduced the above three distributions to be able
to illustrate the concepts of conjugacy (Section 6.6.1) and exponential
families (Section 6.6.3). ♦
c
6.6.1 Conjugacy
According to Bayes’ theorem (6.23), the posterior is proportional to the
product of the prior and the likelihood. The specification of the prior can
be tricky for two reasons: First, the prior should encapsulate our knowl-
edge about the problem before we see any data. This is often difficult to
describe. Second, it is often not possible to compute the posterior distribu-
tion analytically. However, there are some priors that are computationally
conjugate prior convenient: conjugate priors.
conjugate Definition 6.13 (Conjugate Prior). A prior is conjugate for the likelihood
function if the posterior is of the same form/type as the prior.
Conjugacy is particularly convenient because we can algebraically cal-
culate our posterior distribution by updating the parameters of the prior
distribution.
Remark. When considering the geometry of probability distributions, con-
jugate priors retain the same distance structure as the likelihood (Agarwal
and Daumé III, 2010). ♦
To introduce a concrete example of conjugate priors, we describe below
the Binomial distribution (defined on discrete random variables) and the
Beta distribution (defined on continuous random variables).
Example 6.11 (Beta-Binomial Conjugacy)

Consider a Binomial random variable x ∼ Bin(N, µ) where
!
N x
p(x | N, µ) = µ (1 − µ)N −x , x = 0, 1, . . . , N , (6.102)
x
is the probability of finding x times the outcome “head” in N coin flips,
where µ is the probability of a “head”. We place a Beta prior on the pa-
rameter µ, that is µ ∼ Beta(α, β) where
Γ(α + β) α−1
p(µ | α, β) = µ (1 − µ)β−1 . (6.103)
Γ(α)Γ(β)
If we now observe some outcome x = h, that is we see h heads in N coin
flips, we compute the posterior distribution on µ as
p(µ | x = h, N, α, β) ∝ p(x | N, µ)p(µ | α, β) (6.104a)
h (N −h) α−1 β−1
∝ µ (1 − µ) µ (1 − µ) (6.104b)
h+α−1 (N −h)+β−1
=µ (1 − µ) (6.104c)
∝ Beta(h + α, N − h + β) , (6.104d)
i.e., the posterior distribution is a Beta distribution as the prior, i.e., the
Beta prior is conjugate for the parameter µ in the Binomial likelihood
function.

Likelihood Conjugate prior Posterior Table 6.2 Examples
Bernoulli Beta Beta of conjugate priors
Binomial Beta Beta for common
Gaussian Gaussian/inverse Gamma Gaussian/inverse Gamma likelihood functions.
Gaussian Gaussian/inverse Wishart Gaussian/inverse Wishart
Multinomial Dirichlet Dirichlet
In the following example, we will derive a result that is similar to the

Beta-Binomial conjugacy result. Here we will show that the Beta distribu-
tion is a conjugate prior for the Bernoulli distribution.
Example 6.12 (Beta-Bernoulli Conjugacy)

Let x ∈ {0, 1} be distributed according to the Bernoulli distribution with
parameter θ ∈ [0, 1], that is p(x = 1 | θ) = θ. This can also be expressed
as p(x | θ) = θx (1 − θ)1−x . Let θ be distributed according to a Beta distri-
bution with parameters α, β , that is p(θ | α, β) ∝ θα−1 (1 − θ)β−1 .
Multiplying the Beta and the Bernoulli distributions, we get
p(θ | x, α, β) = p(x | θ) × p(θ | α, β) (6.105a)
x 1−x α−1 β−1
∝ θ (1 − θ) ×θ (1 − θ) (6.105b)
α+x−1 β+(1−x)−1
=θ (1 − θ) (6.105c)
∝ p(θ | α + x, β + (1 − x)) . (6.105d)
The last line above is the Beta distribution with parameters (α + x, β +
(1 − x)).
Table 6.2 lists examples for conjugate priors for the parameters of some
standard likelihoods used in probabilistic modeling. Distributions such as
Multinomial, inverse Gamma, inverse Wishart, and Dirichlet can be found
in any statistical text, and are for example described in Bishop (2006).
The Beta distribution is the conjugate prior for the parameter µ in both
the Binomial and the Bernoulli likelihood. For a Gaussian likelihood func-
tion, we can place a conjugate Gaussian prior on the mean. The reason The Gamma prior is
why the Gaussian likelihood appears twice in the table is that we need conjugate for the
precision (inverse
to distinguish the univariate from the multivariate case. In the univariate
variance) in the
(scalar) case, the inverse Gamma is the conjugate prior for the variance. univariate Gaussian
In the multivariate case, we use a conjugate inverse Wishart distribution likelihood, and the
as a prior on the covariance matrix. The Dirichlet distribution is the conju- Wishart prior is
conjugate for the
gate prior for the multinomial likelihood function. For further details, we
precision matrix
refer to Bishop (2006). (inverse covariance
matrix) in the
c
2019 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press. multivariate
Gaussian likelihood.
6.6.2 Sufficient Statistics

Recall that a statistic of a random variable is a deterministic function of
that random variable. For example, if x = [x1 , . . . , xN ]> is a vector of
univariate Gaussian random variables, that is xn ∼ N µ, σ 2 , then the
sample mean µ̂ = N1 (x1 + · · · + xN ) is a statistic. Sir Ronald Fisher dis-
sufficient statistics covered the notion of sufficient statistics: the idea that there are statistics
that will contain all available information that can be inferred from data
corresponding to the distribution under consideration. In other words, suf-
ficient statistics carry all the information needed to make inference about
the population, that is they are the statistics that are sufficient to represent
the distribution.
For a set of distributions parametrized by θ, let X be a random variable
with distribution p(x | θ0 ) given an unknown θ0 . A vector φ(x) of statistics
are called sufficient statistics for θ0 if they contain all possible informa-
tion about θ0 . To be more formal about “contain all possible information”:
this means that the probability of x given θ can be factored into a part
that does not depend on θ, and a part that depends on θ only via φ(x).
The Fisher-Neyman factorization theorem formalizes this notion, which
we state below without proof.
Theorem 6.14 (Fisher-Neyman). [Theorem 6.5 in Lehmann and Casella

Fisher-Neyman (1998)] Let X have probability density function p(x | θ). Then the statistics
theorem φ(x) are sufficient for θ if and only if p(x | θ) can be written in the form
p(x | θ) = h(x)gθ (φ(x)). (6.106)
where h(x) is a distribution independent of θ and gθ captures all the depen-

dence on θ via sufficient statistics φ(x).
If p(x | θ) does not depend on θ then φ(x) is trivially a sufficient statistic

for any function φ. The more interesting case is that p(x | θ) is dependent
only on φ(x) and not x itself. In this case, φ(x) is a sufficient statistic for
x.
In machine learning we consider a finite number of samples from a
distribution. One could imagine that for simple distributions (such as the
Bernoulli in Example 6.8) we only need a small number of samples to
estimate the parameters of the distributions. We could also consider the
opposite problem: if we have a set of data (a sample from an unknown
distribution), which distribution gives the best fit? A natural question to
ask is as we observe more data, do we need more parameters θ to de-
scribe the distribution? It turns out that the answer is yes in general, and
this is studied in non-parametric statistics (Wasserman, 2007). A converse
question is to consider which class of distributions have finite dimensional
sufficient statistics, that is the number of parameters needed to describe
them do not increase arbitrarily. The answer is exponential family distri-
butions, described in the following section.

6.6.3 Exponential Family

There are three possible levels of abstraction we can have when con-
sidering distributions (of discrete or continuous random variables). At
level one (the most concrete end of the spectrum), we have a particu-
lar named distribution
with fixed parameters, for example a univariate
Gaussian N 0, 1 with zero mean and unit variance. In machine learning
we often use the second level of abstraction, that is we fix the paramet-
ric form (the univariate Gaussian) and infer the parameters
from data. For
2
example, we assume a univariate Gaussian N µ, σ with unknown mean
µ and unknown variance σ 2 , and use a maximum likelihood fit to deter-
mine the best parameters (µ, σ 2 ). We will see an example of this when
considering linear regression in Chapter 9. A third level of abstraction is
to consider families of distributions, and in this book, we consider the ex-
ponential family. The univariate Gaussian is an example of a member of
the exponential family. Many of the widely used statistical models, includ-
ing all the “named” models in Table 6.2, are members of the exponential
family. They can all be unified into one concept (Brown, 1986).
Remark. A brief historical anecdote: like many concepts in mathematics
and science, exponential families were independently discovered at the
same time by different researchers. In the years 1935–1936, Edwin Pitman
in Tasmania, Georges Darmois in Paris, and Bernard Koopman in New
York, independently showed that the exponential families are the only
families that enjoy finite-dimensional sufficient statistics under repeated
independent sampling (Lehmann and Casella, 1998). ♦
An exponential family is a family of probability distributions, parame- exponential family
terized by θ ∈ RD , of the form
p(x | θ) = h(x) exp (hθ, φ(x)i − A(θ)) , (6.107)
where φ(x) is the vector of sufficient statistics. In general, any inner prod-
uct (Section 3.2) can be used in (6.107), and for concreteness we will use
the standard dot product here (hθ, φ(x)i = θ > φ(x)). Note that the form
of the exponential family is essentially a particular expression of gθ (φ(x))
in the Fisher-Neyman theorem (Theorem 6.14).
The factor h(x) can be absorbed into the dot product term by adding
another entry (log h(x)) to the vector of sufficient statistics φ(x), and
constraining the corresponding parameter θ0 = 1. The term A(θ) is the
normalization constant that ensures that the distribution sums up or inte-
grates to one and is called the log-partition function. A good intuitive no- log-partition
tion of exponential families can be obtained by ignoring these two terms function
and considering exponential families as distributions of the form
p(x | θ) ∝ exp θ > φ(x) .

(6.108)
For this form of parametrization, the parameters θ are called the natural natural parameters
parameters. At first glance it seems that exponential families is a mundane
c
transformation by adding the exponential function to the result of a dot

product. However, there are many implications that allow for convenient
modeling and efficient computation based on the fact that we can capture
information about data in φ(x).
Example 6.13 (Gaussian as Exponential Family)

2 x
Consider the univariate Gaussian distribution N µ, σ . Let φ(x) = 2 .
x
Then by using the definition of the exponential family,
p(x | θ) ∝ exp(θ1 x + θ2 x2 ) . (6.109)
Setting
>
1

µ
θ= ,− (6.110)
σ 2 2σ 2
and substituting into (6.109) we obtain
x2 1

µx 2
p(x | θ) ∝ exp − 2 ∝ exp − 2 (x − µ) . (6.111)
σ2 2σ 2σ
Therefore, the univariate Gaussian distribution is a member of the expo-
x
nential family with sufficient statistic φ(x) = 2 , and natural parame-
x
ters given by θ in (6.110).
Example 6.14 (Bernoulli as Exponential Family)

Recall the Bernoulli distribution from Example 6.8
p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1}. (6.112)
This can be written in exponential family form
p(x | µ) = exp log µx (1 − µ)1−x

(6.113a)
= exp [x log µ + (1 − x) log(1 − µ)] (6.113b)
= exp [x log µ − x log(1 − µ) + log(1 − µ)] (6.113c)
h i
µ
= exp x log 1−µ + log(1 − µ) . (6.113d)
The last line (6.113d) can be identified as being in exponential family

form (6.107) by observing that
h(x) = 1 (6.114)
µ
θ = log 1−µ (6.115)
φ(x) = x (6.116)
A(θ) = − log(1 − µ) = log(1 + exp(θ)). (6.117)

The relationship between θ and µ is invertible,

1
µ= . (6.118)
1 + exp(−θ)
The relation (6.118) is used to obtain the right equality of (6.117).
Remark. The relationship between the original Bernoulli parameter µ and

the natural parameter θ is known as the sigmoid or logistic function. Ob- sigmoid
serve that µ ∈ (0, 1) but θ ∈ R, and therefore the sigmoid function
squeezes a real value into the range (0, 1). This property is useful in ma-
chine learning, for example it is used in logistic regression (Bishop, 2006,
Section 4.3.2), as well as as a nonlinear activation functions in neural
networks (Goodfellow et al., 2016, Chapter 6). ♦
It is often not obvious how to find the parametric form of the conjugate
distribution of a particular distribution (for example those in Table 6.2).
Exponential families provide a convenient way to find conjugate pairs of
distributions. Consider the random variable X is a member of the expo-
nential family (6.107)
p(x | θ) = h(x) exp (hθ, φ(x)i − A(θ)) . (6.119)
Every exponential family has a conjugate prior (Brown, 1986)

γ1 θ
p(θ | γ) = hc (θ) exp , − Ac (γ) , (6.120)
γ2 −A(θ)

γ1
where γ = has dimension dim(θ) + 1. The sufficient statistics of
γ2

θ
the conjugate prior are . By using the knowledge of the general
−A(θ)
form of conjugate priors for exponential families, we can derive functional
forms of conjugate priors corresponding to particular distributions.
Example 6.15
Recall the exponential family form of the Bernoulli distribution (6.113d),

µ
p(x | µ) = exp x log + log(1 − µ) . (6.121)
1−µ
The canonical conjugate prior therefore has the same form

µ
p(µ | γ, n0 ) = exp n0 γ log + n0 log(1 − µ) − Ac (γ, n0 ) ,
1−µ
(6.122)
c
which simplifies to
p(µ | γ, n0 ) = exp [n0 γ log µ + n0 (1 − γ) log(1 − µ) − Ac (γ, n0 )] .
(6.123)
Putting this in non-exponential family form ,
p(µ | γ, n0 ) ∝ µn0 γ (1 − µ)n0 (1−γ) (6.124)
which is of the same form as the Beta distribution (6.98), with minor
manipulations we get the original parametrization (Example 6.12).
Observe that in this example we have derived the form of the Beta dis-
tribution by looking at the conjugate prior of the exponential family.
As mentioned in the previous section, the main motivation for expo-

nential families is that they have finite-dimensional sufficient statistics.
Additionally, conjugate distributions are easy to write down, and the con-
jugate distributions also come from an exponential family. From an infer-
ence perspective, maximum likelihood estimation behaves nicely because
empirical estimates of sufficient statistics are optimal estimates of the pop-
ulation values of sufficient statistics (recall the mean and covariance of a
Gaussian). From an optimization perspective, the log-likelihood function
is concave allowing for efficient optimization approaches to be applied
(Chapter 7).
6.7 Change of Variables/Inverse Transform

It may seem that there are very many known distributions, but in reality
the set of distributions for which we have names is quite limited. There-
fore, it is often useful to understand how transformed random variables
are distributed. For example, assume that X is a random variable dis-
tributed according to the univariate normal distribution N 0, 1 , what is
the distribution of X 2 ? Another example, which is quite common in ma-
chine learning, is: given that X1 and X2 are univariate standard normal,
what is the distribution of 12 (X1 + X2 )?
One option to work out the distribution of 21 (X1 + X2 ) is to calculate
the mean and variance of X1 and X2 and then combine them. As we saw
in Section 6.4.4, we can calculate the mean and variance of resulting ran-
dom variables when we consider affine transformations of random vari-
ables. However, we may not be able to obtain the functional form of the
distribution under transformations. Furthermore, we may be interested
in nonlinear transformations of random variables for which closed-form
expressions are not readily available.
Remark (Notation). In this section, we will be explicit about random vari-
ables and the values they take. Hence, recall that we use capital letters

X, Y to denote random variables and small letters x, y to denote the val-

ues in the target space T that the random variables take. We will explicitly
write probability mass functions (pmf) of discrete random variables X as
P (X = x). For continuous random variables X (Section 6.2.2), the prob-
ability density function (pdf) is written as f (x) and the the cumulative
distribution function (cdf) is written as FX (x). ♦
We will look at two approaches for obtaining distributions of transfor-
mations of random variables: a direct approach using the definition of a
cumulative distribution function and a change-of-variable approach that
uses the chain rule of calculus (Section 5.2.2). The change-of-variable ap- Moment generating
proach is widely used because it provides a “recipe” for attempting to functions can also
be used to study
compute the resulting distribution due to a transformation. We will ex-
transformations of
plain the techniques for univariate random variables, and will only briefly random
provide the results for the general case of multivariate random variables. variables (Casella
Transformations of discrete random variables can be understood di- and Berger, 2002,
Chapter 2).
rectly. Given a discrete random variable X with probability mass function
(pmf) P (X = x) (Section 6.2.1), and an invertible function U (x). Con-
sider the transformed random variable Y := U (X), with pmf P (Y = y).
Then
P (Y = y) = P (U (X) = y) transformation of interest (6.125a)

= P (X = U −1 (y)) inverse (6.125b)
where we can observe that x = U −1 (y). Therefore for discrete random

variables, transformations directly change the individual events (with the
probabilities appropriately transformed).
6.7.1 Distribution Function Technique

The distribution function technique goes back to first principles, and uses
the definition of a cumulative distribution function (cdf) FX (x) = P (X 6
x) and the fact that its differential is the probability density function (pdf)
f (x) (Wasserman, 2004, Chapter 2). For a random variable X and a func-
tion U , we find the pdf of the random variable Y := U (X) by
1. Finding the cdf:

FY (y) = P (Y 6 y) (6.126)
2. Differentiating the cdf FY (y) to get the pdf f (y).

d
f (y) = FY (y) . (6.127)
dy
We also need to keep in mind that the domain of the random variable may
have changed due to the transformation by U .
c
Example 6.16
Let X be a continuous random variable with probability density function
on 0 6 x 6 1
f (x) = 3x2 . (6.128)
We are interested in finding the pdf of Y = X 2 .
The function f is an increasing function of x, and therefore the resulting
value of y lies in the interval [0, 1]. We obtain
FY (y) = P (Y 6 y) definition of cdf (6.129a)
= P (X 2 6 y) transformation of interest (6.129b)
1
= P (X 6 y ) 2 inverse (6.129c)
1
= FX (y ) 2 definition of cdf (6.129d)
Z y 21
= 3t2 dt cdf as a definite integral (6.129e)
0
t=y 12
= t3 t=0 result of integration (6.129f)
3
=y , 2 0 6 y 6 1. (6.129g)
Therefore, the cdf of Y is
3
FY (y) = y 2 (6.130)
for 0 6 y 6 1. To obtain the pdf, we differentiate the cdf
d 3 1
f (y) = FY (y) = y 2 (6.131)
dy 2
for 0 6 y 6 1.
Functions that have
inverses are called In Example 6.16, we considered a strictly monotonically increasing func-
injective functions
tion f (x) = 3x2 . This means that we could compute an inverse function.
(Section 2.7).
In general, we require that the function of interest y = U (x) has an in-
verse x = U −1 (y). A useful result can be obtained by considering the cu-
mulative distribution function FX (x) of a random variable X , and using
it as the transformation U (x). This leads to the following theorem.
Theorem 6.15. [Theorem 2.1.10 in Casella and Berger (2002)] Let X be a

continuous random variable with a strictly monotonic cumulative distribu-
tion function FX (x). Then the random variable Y defined as
Y = FX (x) , (6.132)
has a uniform distribution.
probability integral Theorem 6.15 is known as the probability integral transform, and it is
transform used to derive algorithms for sampling from distributions by transforming

the result of sampling from a uniform random variable (Bishop, 2006).

The algorithm works by first generating a sample from a uniform distribu-
tion, then transforming it by the inverse cdf (assuming this is available)
to obtain a sample from the desired distribution. The probability integral
transform is also used for hypothesis testing whether a sample comes from
a particular distribution (Lehmann and Romano, 2005). The idea that the
output of a cdf gives a uniform distribution also forms the basis of copu-
las (Nelsen, 2006).
6.7.2 Change of Variables

The distribution function technique in Section 6.7.1 is derived from first
principles, based on the definitions of cdfs and using properties of in-
verses, differentiation and integration. This argument from first principles
relies on two facts:
1. We can transform the cdf of Y into an expression that is a cdf of X .
2. We can differentiate the cdf to obtain the pdf.
Let us break down the reasoning step by step, with the goal of understand-
ing the more general change of variables approach in Theorem 6.16. Change of variables
in probability relies
Remark. The name change of variables comes from the idea of changing on the change of
the variable of integration when faced with a difficult integral. For uni- variables method in
variate functions, we use the substitution rule of integration, calculus (Tandra,
Z Z 2014).
0
f (g(x))g (x)dx = f (u)du where u = g(x). (6.133)
The derivation of this rule is based on the chain rule of calculus (5.32) and
by applying twice the fundamental theorem of calculus. The fundamental
theorem of calculus formalizes the fact that integration and differentiation
are somehow “inverses” of each other. An intuitive understanding of the
rule can be obtained by thinking (loosely) about small changes (differen-
tials) to the equation u = g(x). That is by considering ∆u = g 0 (x)∆x as a
differential of u = g(x). By subsituting u = g(x), the argument inside the
integral on the right hand side of (6.133) becomes f (g(x)). By pretending
that the term du can be approximated by du ≈ ∆u = g 0 (x)∆x, and that
dx ≈ ∆x, we obtain (6.133). ♦
Consider a univariate random variable X , and an invertible function
U , which gives us another random variable Y = U (X). We assume that
random variable X has states x ∈ [a, b]. By the definition of the cdf, we
have
FY (y) = P (Y 6 y) . (6.134)
We are interested in a function U of the random variable
P (Y 6 y) = P (U (X) 6 y) , (6.135)
c
where we assume that the function U is invertible. An invertible func-

tion on an interval is either strictly increasing or strictly decreasing. Let
us consider the case that U is strictly increasing, then its inverse U −1 is
also strictly increasing. By applying the inverse U −1 to the arguments of
P (U (X) 6 y), we obtain
P (U (X) 6 y) = P (U −1 (U (X)) 6 U −1 (y)) = P (X 6 U −1 (y)) .
(6.136)
The right most term above is an expression of the cdf of X . Recall the
definition of the cdf in terms of the pdf
Z U −1 (y)
−1
P (X 6 U (y)) = f (x)dx . (6.137)
a
Now we have an expression of the cdf of Y in terms of x:

Z U −1 (y)
FY (y) = f (x)dx . (6.138)
a
To obtain the pdf, we differentiate (6.138) with respect to y .

Z −1
d d U (y)
f (y) = Fy (y) = f (x)dx (6.139)
dy dy a
Note that the integral on the right hand side is with respect to x, but we
need an integral with respect to y because we are differentiating with
respect to y . In particular we use (6.133) to get the substitution
Z Z
0
f (U −1 (y))U −1 (y)dy = f (x)dx where x = U −1 (y). (6.140)
Using (6.140) on the right hand side of (6.139) gives us

Z −1
d U (y) 0
f (y) = fx (U −1 (y))U −1 (y)dy. (6.141)
dy a
We then recall that differentiation is a linear operator and we use the
subscript x to remind ourselves that fx (U −1 (y)) is a function of x and not
y . Invoking the fundamental theorem of calculus again gives us
d −1

f (y) = fx (U −1 (y)) · U (y) . (6.142)
dy
Recall that we assumed that U is a strictly increasing function. For decreas-
ing functions, it turns out that we have a negative sign when we follow
the same derivation. We introduce the absolute value of the differential to
have the same expression for both increasing and decreasing U

−1
d −1
f (y) = fx (U (y)) · U (y) .
(6.143)
dy

d −1
change of variable This is called the change of variable technique. The term dy U (y) in

(6.143) measures how much a unit volume changes when applying U

(see also the definition of the Jacobian in Section 5.3).
Remark. In comparison
to the discrete case in (6.125b), we have an addi-
d −1
tional factor dy U (y). The continuous case requires more care because
P (Y = y) = 0 for all y . The probability density function f (y) does not
have a description as a probability of an event involving y . ♦
So far in this section, we have been studying univariate change of vari-
ables. The case for multivariate random variables is analogous, but com-
plicated by fact that the absolute value cannot be used for multivariate
functions. Instead we use the determinant of the Jacobian matrix. Recall
from (5.58) that the Jacobian is a matrix of partial derivatives, and that
the existence of a non-zero determinant shows that we can invert the Ja-
cobian. Recall the discussion in Section 4.1 that the determinant arises
because our differentials (cubes of volume) are transformed into paral-
lelepipeds by the Jacobian. Let us summarize the discussion above in the
following theorem, which gives us a recipe for multivariate change of vari-
ables.
Theorem 6.16. [Theorem 17.2 in Billingsley (1995)] Let f (x) be the value
of the probability density of the multivariate continuous random variable X .
If the vector-valued function y = U (x) is differentiable and invertible for
all values within the domain of x, then for corresponding values of y , the
probability density of Y = U (X) is given by

−1
∂ −1
f (y) = fx (U (y)) × det U (y) . (6.144)
∂y
The theorem looks intimidating at first glance, but the key point is that
a change of variable of a multivariate random variable follows the pro-
cedure of the univariate change of variable. First we need to work out
the inverse transform, and substitute that into the density of x. Then we
calculate the determinant of the Jacobian and multiply the result. The
following example illustrates the case of a bivariate random variable.
Example 6.17
x1
Consider a bivariate random variable X with states x = and proba-
x2
bility density function
> !
1 1 x1

x1 x1
f = exp − . (6.145)
x2 2π 2 x2 x2
We use the change-of-variable technique from Theorem 6.16 to derive the

effect of a linear transformation (Section 2.7) of the random variable.
c
Consider a matrix A ∈ R2×2 defined as

a b
A= . (6.146)
c d
We are interested in finding the probability density function of the trans-
formed bivariate random variable Y with states y = Ax.
Recall that for change of variables we require the inverse transformation
of x as a function of y . Since we consider linear transformations, the
inverse transformation is given by the matrix inverse (see Section 2.2.2).
For 2 × 2 matrices, we can explicitly write out the formula, given by
1

x1 −1 y1 d −b y1
=A = . (6.147)
x2 y2 ad − bc −c a y2
Observe that ad − bc is the determinant (Section 4.1) of A. The corre-
sponding probability density function is given by
1
f (x) = f (A−1 y) = exp − 21 y > A−> A−1 y . (6.148)
2π
The partial derivative of a matrix times a vector with respect to the vector
is the matrix itself (Section 5.5) and, therefore,
∂ −1
A y = A−1 . (6.149)
∂y
Recall from Section 4.1 that the determinant of the inverse is the inverse
of the determinant so that the determinant of the Jacobian matrix is
1

∂ −1
det A y = . (6.150)
∂y ad − bc
We are now able to apply the change-of-variable formula from Theo-
rem 6.16 by multiplying (6.148) with (6.150), which yields

∂ −1
f (y) = f (x) det
A y (6.151a)
∂y
1
= exp − 12 y > A−> A−1 y |ad − bc|−1 . (6.151b)
2π
While Example 6.17 is based on a bivariate random variable, which

allows to easily compute the matrix inverse, the relation above holds for
higher dimensions.
Remark. We saw in Section 6.5 that the density f (x) above is actually
the standard Gaussian distribution, and the transformed density f (y) is a
bivariate Gaussian with covariance Σ = AA> . ♦
We will use the ideas in this chapter to describe probabilistic modeling
in Section 8.3, as well as introduce a graphical language in Section 8.4. We

will see direct machine learning applications of these ideas in Chapters 9

and 11.
6.8 Further Reading

This chapter is rather terse at times, Grinstead and Snell (1997) and
Walpole et al. (2011) provide more relaxed presentations that are suit-
able for self study. Readers interested in more philosophical aspects of
probability should consider Hacking (2001), whereas a more software
engineering approach is presented by Downey (2014). An overview of
exponential families can be found in Barndorff-Nielsen (2014). We will
see more about how to use probability distributions to model machine
learning tasks in Chapter 8. Ironically the recent surge in interest in neu-
ral networks has resulted in a broader appreciation of probabilistic mod-
els. For example the idea of normalizing flows (Rezende and Mohamed,
2015) relies on change of variables for transforming random variables.
An overview of methods for variational inference as applied to neural net-
works is described in Chapters 16 to 20 of the book by Goodfellow et al.
(2016).
We side stepped a large part of the difficulty in continuous random vari-
ables by avoiding measure theoretic questions (Billingsley, 1995; Pollard,
2002), and by assuming without construction that we have real numbers,
and ways of defining sets on real numbers as well as their appropriate
frequency of occurrence. These details do matter, for example in the spec-
ification of conditional probability p(y | x) for continuous random vari-
ables x, y (Proschan and Presnell, 1998). The lazy notation hides the fact
that we want to specify that x = x (which is a set of measure zero).
Furthermore we are interested in the probability density function of y .
A more precise notation would have to say Ey [f (y) | σ(x)], where we
take the expectation over y of a test function f conditioned on the σ -
algebra of x. A more technical audience interested in the details of prob-
ability theory have many options (Jacod and Protter, 2004; Jaynes, 2003;
MacKay, 2003; Grimmett and Welsh, 2014) including some very technical
discussions (Çinlar, 2011; Dudley, 2002; Shiryayev, 1984; Lehmann and
Casella, 1998; Bickel and Doksum, 2006). An alternative way to approach
probability is to start with the concept of expectation, and “work back-
wards” to derive the necessary properties of a probability space (Whittle,
2000). As machine learning allows us to model more intricate distribu-
tions on ever more complex types of data, a developer of probabilistic
machine learning models would have to understand these more techni-
cal aspects. Machine learning texts with a probabilistic modeling focus
include the books by MacKay (2003); Bishop (2006); Murphy (2012);
Barber (2012); Rasmussen and Williams (2006).
c
Exercises
6.1 Consider the following bivariate distribution p(x, y) of two discrete random
variables X and Y .
y1 0.01 0.02 0.03 0.1 0.1
Y y2 0.05 0.1 0.05 0.07 0.2
y3 0.1 0.05 0.03 0.05 0.04
x1 x2 x3 x4 x5
X
Compute:
1. The marginal distributions p(x) and p(y).
2. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
6.2 Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4)

10 1 0 0 8.4 2.0
0.4N , + 0.6N , .
2 0 1 0 2.0 1.7
1. Compute the marginal distributions for each dimension

2. Compute the mean, mode and median for each marginal distribution
3. Compute the mean and mode for the 2 dimensional distribution
6.3 You have written a computer program that sometimes compiles and some-
times not (code does not change). You decide to model the apparent stochas-
ticity (success vs no success) x of the compiler using a Bernoulli distribution
with parameter µ:
p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1}
Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
terior distribution p(µ | x1 , . . . , xN ).
6.4 There are two bags. The first bag contains 4 mango and 2 apples; the second
bag contains 4 mango and 4 apples.
We also have a biased coin, which shows “heads” with probability 0.6 and
“tails” with probability 0.4. If the coin shows “heads” we pick a fruit at
random from bag 1, otherwise we pick a fruit at random from bag 2.
Your friend flips the coin (you cannot see the result), picks a fruit at random
from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2?
Hint: Use Bayes’ theorem.
6.5 Consider the following time-series model:

xt+1 = Axt + w , w ∼ N 0, Q

y t = Cxt + v , v ∼ N 0, R
where w, v are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
N µ0 , Σ0 .

Exercises 223
1. What is the form of p(x0 , x1 , . . . , xT )? Justify your answer (you do not

have to explicitly compute the joint distribution).
(1–2 sentences)
2. Assume that p(xt | y 1 , . . . , y t ) = N µt , Σt .
a) Compute p(xt+1 | y 1 , . . . , y t )
b) Compute p(xt+1 , y t+1 | y 1 , . . . , y t )
c) At time t + 1, we observe the value y t+1 = ŷ . Compute the condi-
tional distribution p(xt+1 | y 1 , . . . , y t+1 ).
6.6 Prove the relationship in (6.44), which relates the standard definition of the
variance to the raw score expression for the variance.
6.7 Prove the relationship in (6.45), which relates the pairwise difference be-
tween examples in a dataset with the raw score expression for the variance.
6.8 Express the Bernoulli distribution in the natural parameter form of the ex-
ponential family, see (6.107).
6.9 Express the Binomial distribution as an exponential family distribution. Also
express the Beta distribution is an exponential family distribution. Show that
the product of the Beta and the Binomial distribution is also a member of
the exponential family.
6.10 Derive the relationship in Section 6.5.2 in two ways:
1. By completing the square
2. By expressing the Gaussian in its exponential family form

The product of two Gaussians N x | a, A N x | b, B is an unnormalized
Gaussian distribution c N x | c, C with
C = (A−1 + B −1 )−1
c = C(A−1 a + B −1 b)
D 1
c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)> (A + B)−1 (a − b) .

Note that the normalizing constant c itself can be considered a (normalized)

Gaussian distribution either in a or in b with an “inflated”
covariance matrix
A + B , i.e., c = N a | b, A + B = N b | a, A + B .
6.11 Iterated Expectations.
Consider two random variables x, y with joint distribution p(x, y). Show that

EX [x] = EY EX [x | y] .
Here, EX [x | y] denotes the expected value of x under the conditional distri-
bution p(x | y).
6.12 Manipulation of Gaussian Random Variables.
Consider a Gaussian random variable x ∼ N x | µx , Σx , where x ∈ RD .

Furthermore, we have
y = Ax + b + w ,
where y ∈ RE , A ∈ RE×D , b ∈ RE , and w ∼ N w | 0, Q is indepen-

dent Gaussian noise. “Independent” implies that x and w are independent

random variables and that Q is diagonal.
1. Write down the likelihood
R p(y | x).
2. The distribution p(y) = p(y | x)p(x)dx is Gaussian. Compute the mean
µy and the covariance Σy . Derive your result in detail.
c
3. The random variable y is being transformed according to the measure-

ment mapping
z = Cy + v ,
where z ∈ RF , C ∈ RF ×E , and v ∼ N v | 0, R is independent Gaus-

sian (measurement) noise.

Write down p(z | y).
Compute p(z), i.e., the mean µz and the covariance Σz . Derive your
result in detail.
4. Now, a value ŷ is measured. Compute the posterior distribution p(x | ŷ).
Hint for solution: This posterior is also Gaussian, i.e., we need to de-
termine only its mean and covariance matrix. Start by explicitly com-
puting the joint Gaussian p(x, y). This also requires us to compute the
cross-covariances Covx,y [x, y] and Covy,x [y, x]. Then, apply the rules
for Gaussian conditioning.
6.13 Probability integral transformation
Given a continous random variable x, with cdf Fx (x). Show that the random
variable y = Fx (x) is uniformly distributed.

7
Continuous Optimization
Since machine learning algorithms are implemented on a computer, the

mathematical formulations are expressed as numerical optimization meth-
ods. This chapter describes the basic numerical methods for training ma-
chine learning models. Training a machine learning model often boils
down to finding a good set of parameters. The notion of “good” is de-
termined by the objective function or the probabilistic model, which we
will see examples of in the second part of this book. Given an objective
function finding the best value is done using optimization algorithms. Since we consider
data and models in
This chapter covers two main branches of continuous optimization (Fig-
RD the
ure 7.1): unconstrained and constrained optimization. We will assume in optimization
this chapter that our objective function is differentiable (see Chapter 5), problems we face
hence we have access to a gradient at each location in the space to help are continuous
optimization
us find the optimum value. By convention most objective functions in ma-
problems, as
chine learning are intended to be minimized, that is the best value is the opposed to
minimum value. Intuitively finding the best value is like finding the val- combinatorial
leys of the objective function, and the gradients point us uphill. The idea is optimization
to move downhill (opposite to the gradient) and hope to find the deepest problems for
discrete variables.
point. For unconstrained optimization, this is the only concept we need,
but there are several design choices which we discuss in Section 7.1. For
constrained optimization, we need to introduce other concepts to man-
age the constraints (Section 7.2). We will also introduce a special class
of problems (convex optimization problems in Section 7.3) where we can
make statements about reaching the global optimum.
Consider the function in Figure 7.2. The function has a global minimum global minimum
around x = −4.5, which has the objective function value of approximately
−47. Since the function is “smooth” the gradients can be used to help
find the minimum by indicating whether we should take a step to the
right or left. This assumes that we are in the correct bowl, as there exists
another local minimum around x = 0.7. Recall that we can solve for all local minimum
the stationary points of a functionby calculating its derivative and setting Stationary points
it to zero. For are the real roots of
the derivative, that
is points that have
zero gradient.
`(x) = x4 + 7x3 + 5x2 − 17x + 3 . (7.1)
225
c
226 Continuous Optimization
Figure 7.1 A mind Continuous

optimization Stepsize
map of the concepts
related to
optimization, as
presented in this
chapter. There are Unconstrained
Momentum
optimization Gradient descent
two main ideas:
gradient descent
and convex
optimization.
Stochastic
gradient
Constrained Chapter 10 descent
optimization Dimension Reduc.
Lagrange Chapter 11
multipliers Density Estimation
Convex
Convex optimization Linear

& duality programming
Convex conjugate Quadratic Chapter 12

programming Classification
we obtain the corresponding gradient as

d`(x)
= 4x3 + 21x2 + 10x − 17 . (7.2)
dx
Since this is a cubic equation, it has in general three solutions when set
to zero. In the example, two of them are minima and one is a maximum
(around x = −1.4). To check whether a stationary point is a minimum or
maximum we need to take the derivative a second time and check whether
the second derivative is positive or negative at the stationary point. In our
case, the second derivative is
d2 `(x)
= 12x2 + 42x + 10 . (7.3)
dx2
By substituting our visually estimated values of x = −4.5, −1.4,
2 0.7 we
will observe that as expected the middle point is a maximum d dx `(x)
2 <0
and the other two stationary points are minimums.
Note that we have avoided analytically solving for values of x in the
previous discussion, although for low order polynomials such as the above

Figure 7.2 Example

objective function.
60 Gradients are
indicated by arrows,
x4 + 7x3 + 5x2 − 17x + 3 and the global
40
minimum is
indicated by the
20 dashed blue line.
objective
−20
−40
−60
−6 −5 −4 −3 −2 −1 0 1 2
value of parameter
we could do so. In general, we are unable to find analytic solutions, and

hence we need to start at some value, say x0 = −10 and follow the gradi-
ent. The gradient indicates that we should go right, but not how far (this According to the
is called the step size). Furthermore, if we had started at the right side Abel-Ruffini
theorem, there is in
(e.g., x0 = 0) the gradient would have led us to the wrong minimum.
general no algebraic
Figure 7.2 illustrates the fact that for x > −1, the gradient points toward solution for
the minimum on the right of the figure, which has a larger objective value. polynomials of
In Section 7.3, we will learn about a class of functions, called convex degree 5 or more
(Abel, 1826).
functions, that do not exhibit this tricky dependency on the starting point
of the optimization algorithm. For convex functions all local minima are
global minimum. It turns out that many machine learning objective func- For convex functions
tions are designed such that they are convex, and we will see an example all local minima are
global minimum.
in Chapter 12.
The discussion in this chapter so far was about a one dimensional func-
tion, where we are able to visualize the ideas of gradients, descent direc-
tions and optimal values. In the rest of this chapter we develop the same
ideas in high dimensions. Unfortunately we can only visualize the con-
cepts in one dimension, but some concepts do not generalize directly to
higher dimensions, therefore some care needs to be taken when reading.
7.1 Optimization using Gradient Descent

We now consider the problem of solving for the minimum of a real-valued
function
min f (x) (7.4)
x
c
where f : Rd → R is an objective function that captures the machine

learning problem at hand. We assume that our function f is differentiable,
and we are unable to analytically find a solution in closed form.
Gradient descent is a first-order optimization algorithm. To find a local
minimum of a function using gradient descent, one takes steps propor-
tional to the negative of the gradient of the function at the current point.
We use the Recall from Section 5.1 that the gradient points in the direction of the
convention of row steepest ascent. Another useful intuition is to consider the set of lines
vectors for
where the function is at a certain value (f (x) = c for some value c ∈ R),
gradients.
which are known as the contour lines. The gradient points in a direction
that is orthogonal to the contour lines of the function we wish to optimize.
Let us consider multivariate functions. Imagine a surface (described by
the function f (x)) with a ball starting at a particular location x0 . When
the ball is released, it will move downhill in the direction of steepest de-
scent. Gradient descent exploits the fact that f (x0 ) decreases fastest if one
moves from x0 in the direction of the negative gradient −((∇f )(x0 ))> of
f at x0 . We assume in this book that the functions are differentiable, and
refer the reader to more general settings in Section 7.4. Then, if
x1 = x0 − γ((∇f )(x0 ))> (7.5)
for a small step size γ > 0 then f (x1 ) 6 f (x0 ). Note that we use the
transpose for the gradient since otherwise the dimensions will not work
out.
This observation allows us to define a simple gradient-descent algo-
rithm: If we want to find a local optimum f (x∗ ) of a function f : Rn →
R, x 7→ f (x), we start with an initial guess x0 of the parameters we wish
to optimize and then iterate according to
xi+1 = xi − γi ((∇f )(xi ))> . (7.6)
For suitable step size γi , the sequence f (x0 ) > f (x1 ) > . . . converges to
a local minimum.
Example 7.1
Consider a quadratic function in two dimensions
> >
1 x1

x1 2 1 x1 5 x1
f = − (7.7)
x2 2 x2 1 20 x2 3 x2
with gradient
> >
x1 x 2 1 5
∇f = 1 − . (7.8)
x2 x2 1 20 3
Starting at the initial location x0 = [−3, −1]> , we iteratively apply
(7.6) to obtain a sequence of estimates that converge to the minimum

2 50.
0
40.0
90 Figure 7.3 Gradient
descent on a
75 two-dimensional
quadratic surface
1 60 (shown as a
0.0
heatmap). See
45 Example 7.1 for a
x2
0 description.
30
−1 15
10.0
30.0
20.0
0
70. 60.
80. 0 0
50.0
0 40.0
−2 −15
−4 −2 0 2 4
x1
value (illustrated in Figure 7.3). We can see (both from the figure and
by plugging x0 into (7.8)) that the gradient at x0 points north and
east, leading to x1 = [−1.98, 1.21]> . Repeating that argument gives us
x2 = [−1.32, −0.42]> , and so on.
Remark. Gradient descent can be relatively slow close to the minimum:

Its asymptotic rate of convergence is inferior to many other methods. Us-
ing the ball rolling down the hill analogy, when the surface is a long thin
valley the problem is poorly conditioned (Trefethen and Bau III, 1997).
For poorly conditioned convex problems, gradient descent increasingly
‘zigzags’ as the gradients point nearly orthogonally to the shortest direc-
tion to a minimum point, see Figure 7.3. ♦
7.1.1 Stepsize
As mentioned earlier, choosing a good stepsize is important in gradient
descent. If the stepsize is too small, gradient descent can be slow. If the The stepsize is also
stepsize is chosen too large, gradient descent can overshoot, fail to con- called the learning
rate.
verge, or even diverge. We will discuss the use of momentum in the next
section. It is a method that smoothes out erratic behavior of gradient up-
dates and dampens oscillations.
Adaptive gradient methods rescale the stepsize at each iteration, de-
pending on local properties of the function. There are two simple heuris-
tics (Toussaint, 2012):
When the function value increases after a gradient step, the step size
was too large. Undo the step and decrease the stepsize.
When the function value decreases the step could have been larger. Try
to increase the stepsize.
c
Although the “undo” step seems to be a waste of resources, using this

heuristic guarantees monotonic convergence.
Example 7.2 (Solving a Linear Equation System)

When we solve linear equations of the form Ax = b, in practice we solve
Ax − b = 0 approximately by finding x∗ that minimizes the the squared
error
kAx − bk2 = (Ax − b)> (Ax − b) (7.9)
if we use the Euclidean norm. The gradient of (7.9) with respect to x is
∇x = 2(Ax − b)> A . (7.10)
We can use this gradient directly in a gradient descent algorithm. How-
ever for this particular special case, it turns out that there is an analytic
solution, which can be found by setting the gradient to zero. We will see
more on solving squared error problems in Chapter 9.
Remark. When applied to the solution of linear systems of equations Ax =

b gradient descent may converge slowly. The speed of convergence of gra-
condition number dient descent is dependent on the condition number κ = σ(A) max
σ(A)min
, which
is the ratio of the maximum to the minimum singular value (Section 4.5)
of A. The condition number essentially measures the ratio of the most
curved direction versus the least curved direction, which corresponds to
our imagery that poorly conditioned problems are long thin valleys: they
are very curved in one direction, but very flat in the other. Instead of di-
rectly solving Ax = b, one could instead solve P −1 (Ax − b) = 0, where
preconditioner P is called the preconditioner. The goal is to design P −1 such that P −1 A
has a better condition number, but at the same time P −1 is easy to com-
pute. For further information on gradient descent, pre-conditioning and
convergence we refer to (Boyd and Vandenberghe, 2004, Chapter 9). ♦
7.1.2 Gradient Descent with Momentum

As illustrated in Figure 7.3, the convergence of gradient descent may be
very slow if the curvature of the optimization surface is such that the there
are regions which are poorly scaled. The curvature is such that the gra-
dient descent steps hops between the walls of the valley, and approaches
the optimum in small steps. The proposed tweak to improve convergence
Goh (2017) wrote is to give gradient descent some memory.
an intuitive blog Gradient descent with momentum (Rumelhart et al., 1986) is a method
post on gradient
that introduces an additional term to remember what happened in the
descent with
momentum. previous iteration. This memory dampens oscillations and smooths out
the gradient updates. Continuing the ball analogy, the momentum term
emulates the phenomenon of a heavy ball which is reluctant to change

directions. The idea is to have a gradient update with memory to imple-

ment a moving average. The momentum-based method remembers the
update ∆xi at each iteration i and determines the next update as a linear
combination of the current and previous gradients
xi+1 = xi − γi ((∇f )(xi ))> + α∆xi (7.11)
>
∆xi = xi − xi−1 = −γi−1 ((∇f )(xi−1 )) , (7.12)
where α ∈ [0, 1]. Sometimes we will only know the gradient approxi-
mately. In such cases the momentum term is useful since it averages out
different noisy estimates of the gradient. One particularly useful way to
obtain an approximate gradient is using a stochastic approximation, which
we discuss next.
7.1.3 Stochastic Gradient Descent

Computing the gradient can be very time consuming. However, often it is
possible to find a “cheap” approximation of the gradient. Approximating
the gradient is still useful as long as it points in roughly the same direction
as the true gradient. stochastic gradient
Stochastic gradient descent (often shortened in SGD) is a stochastic ap- descent
proximation of the gradient descent method for minimizing an objective
function that is written as a sum of differentiable functions. The word
stochastic here refers to the fact that we acknowledge that we do not
know the gradient precisely, but instead only know a noisy approxima-
tion to it. By constraining the probability distribution of the approximate
gradients, we can still theoretically guarantee that SGD will converge.
In machine learning given n = 1, . . . , N data points, we often consider
objective functions which are the sum of the losses Ln incurred by each
example n. In mathematical notation we have the form
N
X
L(θ) = Ln (θ) (7.13)
n=1
where θ is the vector of parameters of interest, i.e., we want to find θ that

minimizes L. An example from regression (Chapter 9) is the negative log-
likelihood, which is expressed as a sum over log-likelihoods of individual
examples so that
N
X
L(θ) = − log p(yn |xn , θ) , (7.14)
n=1
where xn ∈ RD are the training inputs, yn are the training targets and θ
are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” opti-
mization method, i.e., optimization is performed using the full training set
c
by updating the vector of parameters according to

N
X
>
θ i+1 = θ i − γi (∇L(θ i )) = θ i − γi (∇Ln (θ i ))> (7.15)
n=1
for a suitable stepsize parameter γi . Evaluating the sum-gradient may re-

quire expensive evaluations of the gradients from all individual functions
Ln . When the training set is enormous and/or no simple formulas exist,
evaluating the sums of Pgradients becomes very expensive.
N
Consider the term n=1 (∇Ln (θ i )) in (7.15) above: we can reduce the
amount of computation by taking a sum over a smaller set of Ln . In con-
trast to batch gradient descent, which uses all Ln for n = 1, . . . , N , we
randomly choose a subset of Ln for mini-batch gradient descent. In the
extreme case, we randomly select only a single Ln to estimate the gra-
dient. The key insight about why taking a subset of data is sensible is
to realise that for gradient descent to converge, we only require that the
gradient
PN is an unbiased estimate of the true gradient. In fact the term
n=1 (∇Ln (θ i )) in (7.15) is an empirical estimate of the expected value
(Section 6.4.1) of the gradient. Therefore any other unbiased empirical
estimate of the expected value, for example using any subsample of the
data, would suffice for convergence of gradient descent.
Remark. When the learning rate decreases at an appropriate rate, and sub-
ject to relatively mild assumptions, stochastic gradient descent converges
almost surely to local minimum (Bottou, 1998). ♦
Why should one consider using an approximate gradient? A major rea-
son is practical implementation constraints, such as the size of CPU/GPU
memory or limits on computational time. We can think of the size of the
subset used to estimate the gradient in the same way that we thought
of the size of a sample when estimating empirical means (Section 6.4.1).
Large mini-batch sizes will provide accurate estimates of the gradient,
reducing the variance in the parameter update. Furthermore large mini-
batches take advantage of highly optimized matrix operations in vector-
ized implementations of the cost and gradient. The reduction in variance
leads to more stable convergence, but each gradient calculation will be
more expensive.
In contrast small mini-batches are quick to estimate. If we keep the
mini-batch size small, the noise in our gradient estimate will allow us to
get out of some bad local optima, which we may otherwise get stuck in.
In machine learning, optimization methods are used for training by min-
imizing an objective function on the training data, but the overall goal
is to improve generalization performance (Chapter 8). Since the goal in
machine learning does not necessarily need a precise estimate of the min-
imum of the objective function, approximate gradients using mini-batch
approaches have been widely used. Stochastic gradient descent is very
effective in large-scale machine learning problems (Bottou et al., 2018),

3 Figure 7.4
Illustration of
constrained
optimization. The
2
unconstrained
problem (indicated
by the contour
1 lines) has a
minimum on the
right side (indicated
by the circle). The
x2
0
box constraints
(−1 6 x 6 1 and
−1 6 y 6 1) require
−1
that the optimal
solution are within
the box, resulting in
−2 an optimal value
indicated by the
star.
−3
−3 −2 −1 0 1 2 3
x1
such as training deep neural networks on millions of images (Dean et al.,

2012), topic models (Hoffman et al., 2013), reinforcement learning (Mnih
et al., 2015) or training large-scale Gaussian process models (Hensman
et al., 2013; Gal et al., 2014).
7.2 Constrained Optimization and Lagrange Multipliers

In the previous section, we considered the problem of solving for the min-
imum of a function
min f (x) , (7.16)
x
where f : RD → R.
In this section, we have additional constraints. That is for real valued
functions gi : RD → R for i = 1, . . . , m we consider the constrained
optimization problem
min f (x) (7.17)
x
subject to gi (x) 6 0 for all i = 1, . . . , m

It is worth pointing out that the functions f and gi could be non-convex
in general, and we will consider the convex case in the next section.
One obvious, but not very practical, way of converting the constrained
problem (7.17) into an unconstrained one is to use an indicator function
m
X
J(x) = f (x) + 1(gi (x)) , (7.18)
i=1
c
where 1(z) is an infinite step function

(
0 if z 6 0
1(z) = . (7.19)
∞ otherwise
This gives infinite penalty if the constraint is not satisfied, and hence
would provide the same solution. However, this infinite step function is
equally difficult to optimize. We can overcome this difficulty by introduc-
Lagrange multipliers ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the
step function with a linear function.
Lagrangian We associate to problem (7.17) the Lagrangian by introducing the La-
grange multipliers λi > 0 corresponding to each inequality constraint re-
spectively (Boyd and Vandenberghe, 2004, Chapter 4) so that
m
X
L(x, λ) = f (x) + λi gi (x) (7.20a)
i=1
>
= f (x) + λ g(x) (7.20b)
where in the last line we have a concatenated all constraints gi (x) into a
vector g(x), and all the Lagrange multipliers into a vector λ ∈ Rm .
We now introduce the idea of Lagrangian duality. In general, duality
in optimization is the idea of converting an optimization problem in one
set of variables x (called the primal variables), into another optimization
problem in a different set of variables λ (called the dual variables). We
introduce two different approaches to duality: in this section, we discuss
Lagrangian duality; in Section 7.3.3 we discuss Legendre-Fenchel duality.
Definition 7.1. The problem in (7.17)
min f (x) (7.21)

x
subject to gi (x) 6 0 for all i = 1, . . . , m

primal problem is known as the primal problem, corresponding to the primal variables x.
Lagrangian dual The associated Lagrangian dual problem is given by
problem
max D(λ)
λ∈Rm
(7.22)
subject to λ > 0,
where λ are the dual variables and D(λ) = minx∈Rd L(x, λ).
Remark. In the discussion of Definition 7.1 below, we use two concepts

which are also of independent interest (Boyd and Vandenberghe, 2004).
minimax inequality First the minimax inequality, which says that for any function with two
arguments ϕ(x, y), the maximin is less than the minimax, i.e.,
max min ϕ(x, y) 6 min max ϕ(x, y) . (7.23)

y x x y

This inequality can be proved by considering the inequality

For all x, y min ϕ(x, y) 6 max ϕ(x, y) . (7.24)
x y
Note that taking the maximum over y of the left hand side of (7.24) main-
tains the inequality since the inequality is true for all y . Similarly we can
take the minimum over x of the right hand side of (7.24) to obtain (7.23).
The second concept is weak duality, which uses (7.23) to show that weak duality
primal values are always greater than or equal to dual values. This is de-
scribed in more detail in (7.27). ♦
Recall that the difference between J(x) in (7.18) and the Lagrangian
in (7.20b) is that we have relaxed the indicator function to a linear func-
tion. Therefore when λ > 0, the Lagrangian L(x, λ) is a lower bound of
J(x). Hence, the maximum of L(x, λ) with respect to λ is
J(x) = max L(x, λ). (7.25)
λ>0
Recall that the original problem was minimizing J(x),

min max L(x, λ) . (7.26)
x∈Rd λ>0
By the minimax inequality (7.23) it follows that swapping the order of the
minimum and maximum results in a smaller value, i.e.,
min max L(x, λ) > max mind L(x, λ) . (7.27)
x∈Rd λ>0 λ>0 x∈R
This is also known as weak duality. Note that the inner part of the right- weak duality
hand side is the dual objective function D(λ) and the definition follows.
In contrast to the original optimization problem, which has constraints,
minx∈Rd L(x, λ) is an unconstrained optimization problem for a given
value of λ. If solving minx∈Rd L(x, λ) is easy, then the overall problem
is easy to solve. The reason is that the outer problem (maximization over
λ) is a maximum over a set of affine functions, and hence is a concave
function, even though f (·) and gi (·) may be nonconvex. The maximum of
a concave function can be efficiently computed.
Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual
problem by differentiating the Lagrangian with respect to x, setting the
differential to zero and solving for the optimal value. We will discuss two
concrete examples in Sections 7.3.1 and 7.3.2, where f (·) and gi (·) are
convex.
Remark (Equality constraints). Consider (7.17) with additional equality
constraints
min f (x)
x
subject to gi (x) 6 0 for all i = 1, . . . , m (7.28)

hj (x) = 0 for all j = 1, . . . , n .
c
Figure 7.5 Example

of a convex
function. 40
30 y = 3x2 − 5x + 2
y 20
10
0
−3 −2 −1 0 1 2 3
x
We can model equality constraints by replacing them with two inequality

constraints. That is for each equality constraint hj (x) = 0 we equivalently
replace it by two constraints hj (x) 6 0 and hj (x) > 0. It turns out that
the resulting Lagrange multipliers are then unconstrained.
Therefore, we constrain the Lagrange multipliers corresponding to the
inequality constraints in (7.28) to be non-negative, and leave the La-
grange multipliers corresponding to the equality constraints unconstrained.
♦
7.3 Convex Optimization

We focus our attention of a particularly useful class of optimization prob-
lems, where we can guarantee global optimality. When f (·) is a convex
function, and when the constraints involving g(·) and h(·) are convex sets,
convex optimization this is called a convex optimization problem. In this setting, we have strong
problem duality: The optimal solution of the dual problem is the same as the opti-
strong duality mal solution of the primal problem. The distinction between convex func-
tions and convex sets are often not strictly presented in machine learning
literature, but one can often infer the implied meaning from context.
convex set Definition 7.2. A set C is a convex set if for any x, y ∈ C and for any scalar
θ with 0 6 θ 6 1, we have
θx + (1 − θ)y ∈ C . (7.29)
Figure 7.6 Example
of a convex set. Convex sets are sets such that a straight line connecting any two ele-
ments of the set lie inside the set. Figures 7.6 and 7.7 illustrate convex
and nonconvex sets, respectively.
Convex functions are functions such that a straight line between any
two points of the function lie above the function. Figure 7.2 shows a non-
convex function and Figure 7.3 shows a convex function. Another convex
function is shown in Figure 7.5.

Definition 7.3. Let function f : RD → R be a function whose domain is a

convex set. The function f is a convex function if for all x, y in the domain convex function
of f , and for any scalar θ with 0 6 θ 6 1, we have
f (θx + (1 − θ)y) 6 θf (x) + (1 − θ)f (y) . (7.30)
Remark. A concave function is the negative of a convex function. ♦ concave function
Figure 7.7 Example
The constraints involving g(·) and h(·) in (7.28) truncate functions at a of a nonconvex set.
scalar value, resulting in sets. Another relation between convex functions
and convex sets is to consider the set obtained by “filling in” a convex
function. A convex function is a bowl like object, and we imagine pouring
water into it to fill it up. This resulting filled in set, called the epigraph of
the convex function, is a convex set.
If a function f : Rn → R is differentiable, we can specify convexity in
terms of its gradient ∇x f (x) (Section 5.2). A function f (x) is convex if
epigraph
and only if for any two points x, y it holds that
f (y) > f (x) + ∇x f (x)> (y − x) . (7.31)
If we further know that a function f (x) is twice differentiable, that is the
Hessian (5.147) exists for all values in the domain of x, then the function
f (x) is convex if and only if ∇2x f (x) is positive semi-definite (Boyd and
Vandenberghe, 2004).
Example 7.3
The negative entropy f (x) = x log2 x is convex for x > 0. A visualization
of the function is shown in Figure 7.8, and we can see that the function
is convex. To illustrate the above definitions of convexity, let us check the
calculations for two points x = 2 and x = 4. Note that to prove convexity
of f (x) we would need to check for all points x ∈ R.
Recall Definition 7.3. Consider a point midway between the two points
(that is θ = 0.5), then the left hand side is f (0.5 · 2 + 0.5 · 4) = 3 log2 3 ≈
4.75. The right hand side is 0.5(2 log2 2) + 0.5(4 log2 4) = 1 + 4 = 5. And
therefore the definition is satisfied.
Since f (x) is differentiable, we can alternatively use (7.31). Calculating
the derivative of f (x), we obtain
1 1
∇x (x log2 x) = 1 · log2 x + x · = log2 x + . (7.32)
x loge 2 loge 2
Using the same two test points x = 2 and x = 4, the left hand side of
(7.31) is given by f (4) = 8. The right hand side is
f (x) + ∇>
x (y − x) = f (2) + ∇f (2) · (4 − 2) (7.33a)
1
= 2 + (1 + ) · 2 ≈ 6.9 . (7.33b)
loge 2
c
Figure 7.8 The

negative entropy x log2 x
function (which is 10
convex), and its tangent at x = 2
tangent at x = 2.
5
f (x)
0 1 2 3 4 5
x
We can check that a function or set is convex from first principles by

recalling the definitions. In practice we often rely on operations that pre-
serve convexity to check that a particular function or set is convex. Al-
though the details are vastly different, this is again the idea of closure
that we introduced in Chapter 2 for vector spaces.
Example 7.4
A nonnegative weighted sum of convex functions is convex. Observe that
if f is a convex function, and α > 0 is a nonnegative scalar, then the
function αf is convex. We can see this by multiplying α to both sides of
equation in Definition 7.3, and recalling that multiplying a nonnegative
number does not change the inequality.
If f1 and f2 are convex functions, then we have by the definition
f1 (θx + (1 − θ)y) 6 θf1 (x) + (1 − θ)f1 (y) (7.34)
f2 (θx + (1 − θ)y) 6 θf2 (x) + (1 − θ)f2 (y) . (7.35)
Summing up both sides gives us
f1 (θx + (1 − θ)y) + f2 (θx + (1 − θ)y)
6 θf1 (x) + (1 − θ)f1 (y) + θf2 (x) + (1 − θ)f2 (y) , (7.36)
where the right hand side can be rearranged to
θ(f1 (x) + f2 (x)) + (1 − θ)(f1 (y) + f2 (y)) (7.37)
completing the proof that the sum of convex functions is convex.
Combining the two facts above, we see that αf1 (x) + βf2 (x) is convex
for α, β > 0. This closure property can be extended using a similar argu-
ment for nonnegative weighted sums of more than two convex functions.
Jensen’s inequality Remark. The inequality in (7.30) is sometimes called Jensen’s inequality.

In fact, a whole class of inequalities for taking nonnegative weighted sums

of convex functions are all called Jensen’s inequality. ♦
In summary, a constrained optimization problem is called a convex opti- convex optimization
mization problem if problem
minf (x)
x
subject to gi (x) 6 0 for all i = 1, . . . , m (7.38)

hj (x) = 0 for all j = 1, . . . , n ,
where all functions f (x) and gi (x) are convex functions, and all hj (x) =
0 are convex sets. In the following, we will describe two classes of convex
optimization problems that are widely used and well understood.
7.3.1 Linear Programming

Consider the special case when all the functions above are linear, that is
min c> x (7.39)
x∈Rd
subject to Ax 6 b
where A ∈ Rm×d and b ∈ Rm . This is known as a linear program. It has d linear program
variables and m linear constraints. The Lagrangian is given by Linear programs are
one of the most
L(x, λ) = c> x + λ> (Ax − b) , (7.40) widely used
approaches in
where λ ∈ Rm is the vector of non-negative Lagrange multipliers. Rear- industry.
ranging the terms corresponding to x yields
L(x, λ) = (c + A> λ)> x − λ> b . (7.41)
Taking the derivative of L(x, λ) with respect to x and setting it to zero
gives us
c + A> λ = 0. (7.42)
Therefore the dual Lagrangian is D(λ) = −λ> b. Recall we would like

to maximize D(λ). In addition to the constraint due to the derivative of
L(x, λ) being zero, we also have the fact that λ > 0, resulting in the
following dual optimization problem It is convention to
minimize the primal
max − b> λ (7.43) and maximize the
λ∈Rm
dual.
subject to c + A> λ = 0
λ > 0.
This is also a linear program, but with m variables. We have the choice
of solving the primal (7.39) or the dual (7.43) program depending on
whether m or d is larger. Recall that d is the number of variables and m is
the number of constraints in the primal linear program.
c
Example 7.5 (Linear Program)

Consider the linear program
>
5 x1
min2 −
x∈R 3 x2
2 2 33
   
 2 −4  8  (7.44)
−2 1  x1 6  5 
   
subject to
 0 −1 x2
   
−1
0 1 8
with two variables, which is also shown in Figure 7.9. The objective func-
tion is linear, resulting in linear contour lines. The constraint set in stan-
dard form is translated into the legend. The optimal value must lie in the
shaded (feasible) region, and is indicated by the star.
Figure 7.9 2x2 ≤ 33 − 2x1

Illustration of a 4x2 ≥ 2x1 − 8
10
linear program. The x2 ≤ 2x1 − 5
unconstrained x2 ≥ 1
problem (indicated x2 ≤ 8
by the contour 8
lines) has a
minimum on the
right side. The
optimal value given 6
x2
the constraints are

shown by the star.
4
0
0 2 4 6 8 10 12 14 16
x1
7.3.2 Quadratic Programming

Consider the case of a convex quadratic objctive function, where the con-
straints are affine, i.e.,
1 >
min x Qx + c> x (7.45)
x∈Rd 2
subject to Ax 6 b

where A ∈ Rm×d , b ∈ Rm and c ∈ Rd . The square symmetric matrix Q ∈

Rd×d is positive definite, and therefore the objective function is convex.
This is known as a quadratic program. Observe that it has d variables and
m linear constraints.
Example 7.6 (Quadratic Program)

Consider the quadratic program
> >
1 x1 2 1 x1 5 x1
min2 + (7.46)
x∈R 2 x2 1 4 x2 3 x2
   
1 0 1
−1 0  x1 1
subject to   0
 6  (7.47)
1  x2 1
0 −1 1
of two variables, which is also illustrated in Figure 7.4. The objective func-
tion is quadratic with a positive semidefinite matrix Q, resulting in ellip-
tical contour lines. The optimal value must lie in the shaded (feasible)
region, and is indicated by the star.
The Lagrangian is given by

1 >
L(x, λ) = x Qx + c> x + λ> (Ax − b) (7.48a)
2
1
= x> Qx + (c + A> λ)> x − λ> b , (7.48b)
2
where again we have rearranged the terms. Taking the derivative of L(x, λ)
with respect to x and setting it to zero gives
Qx + (c + A> λ) = 0 . (7.49)
Assuming that Q is invertible, we get
x = −Q−1 (c + A> λ) (7.50)
Substituting (7.50) into the primal Lagrangian L(x, λ) we get the dual
Lagrangian
1
D(λ) = − (c + A> λ)Q−1 (c + A> λ) − λ> b . (7.51)
2
Therefore, the dual optimization problem is given by
1
max − (c + A> λ)Q−1 (c + A> λ) − λ> b
λ∈Rm 2 (7.52)
subject to λ > 0.
We will see an application of quadratic programming in machine learning
in Chapter 12.
c
7.3.3 Legendre-Fenchel Transform and Convex Conjugate

Let us revisit the idea of duality from Section 7.2, without considering
constraints. One useful fact about a convex set is that it can be equiva-
lently described by its supporting hyperplanes. A hyperplane is called a
supporting supporting hyperplane of a convex set if it intersects the convex set, and
hyperplane the convex set is contained on just one side of it. Recall that for a con-
vex function, we can fill it up to obtain the epigraph which is a convex
set. Therefore we can also describe convex functions in terms of their
supporting hyperplanes. Furthermore observe that the supporting hyper-
plane just touches the convex function, and is in fact the tangent to the
function at that point. And recall that the tangent of a function f (x) at
a given point x0 is the evaluation of the gradient of that function at that
df (x)
point dx . In summary, because convex sets can be equivalently
x=x0
described by its supporting hyperplanes, convex functions can be equiv-
Legendre transform alently described by a function of their gradient. The Legendre transform
Physics students are formalizes this concept .
often introduced to We begin with the most general definition, which unfortunately has a
the Legendre
counter-intuitive form, and look at special cases to relate the definition
transform as
relating the to the intuition above. The Legendre-Fenchel transform is a transformation
Lagrangian and the (in the sense of a Fourier transform) from a convex differentiable function
Hamiltonian in f (x) to a function that depends on the tangents s(x) = ∇x f (x). It is
classical mechanics.
worth stressing that this is a transformation of the function f (·) and not
Legendre-Fenchel
transform the variable x or the function evaluated at x. The Legendre-Fenchel trans-
convex conjugate form is also known as the convex conjugate (for reasons we will see soon)
and is closely related to duality (Hiriart-Urruty and Lemaréchal, 2001,
Chapter 5).
convex conjugate Definition 7.4. The convex conjugate of a function f : RD → R is a
function f ∗ defined by
f ∗ (s) = sup (hs, xi − f (x)) . (7.53)
x∈RD
Note that the convex conjugate definition above does not need the func-
tion f to be convex nor differentiable. In the definition above, we have
used a general inner product (Section 3.2) but in the rest of this section
we will consider the standard dot product between finite dimensional vec-
tors (hs, xi = s> x) to avoid too many technical details.
This derivation is To understand the above definition in a geometric fashion, consider an
easiest to nice simple one dimensional convex and differentiable function, for ex-
understand by
ample f (x) = x2 . Note that since we are looking at a one dimensional
drawing the
reasoning as it problem, hyperplanes reduce to a line. Consider a line y = sx + c. Recall
progresses. that we are able to describe convex functions by their supporting hyper-
planes, so let us try to describe this function f (x) by its supporting lines.
Fix the gradient of the line s ∈ R and for each point (x0 , f (x0 )) on the
graph of f , find the minimum value of c such that the line still inter-

sects (x0 , f (x0 )). Note that the minimum value of c is the place where a
line with slope s “just touches” the function f (x) = x2 . The line passing
through (x0 , f (x0 )) with gradient s is given by
y − f (x0 ) = s(x − x0 ) . (7.54)
The y -intercept of this line is −sx0 + f (x0 ). The minimum of c for which
y = sx + c intersects with the graph of f is therefore
inf −sx0 + f (x0 ). (7.55)
x0
The convex conjugate above is by convention defined to be the negative

of this. The reasoning in this paragraph did not rely on the fact that we
chose a one dimensional convex and differentiable function, and holds for
f : RD → R which are nonconvex and non differentiable.
The classical
Remark. Convex differentiable functions such as the example f (x) = x2 Legendre transform
is a nice special case, where there is no need for the supremum, and there is defined on convex
differentiable
is a one to one correspondence between a function and its Legendre trans- functions in RD
form. Let us derive this from first principles. For a convex differentiable
function, we know that at x0 the tangent touches f (x0 ) so that
f (x0 ) = sx0 + c . (7.56)
Recall that we want to describe the convex function f (x) in terms of its
gradient ∇x f (x), and that s = ∇x f (x0 ). We rearrange to get an expres-
sion for −c to obtain
− c = sx0 − f (x0 ) . (7.57)
Note that −c changes with x0 and therefore with s, which is why we can
think of it as a function of s, which we call
f ∗ (s) := sx0 − f (x0 ) . (7.58)
Comparing (7.58) with Definition 7.4, we see that (7.58) is a special case
(without the supremum). ♦
The conjugate function has nice properties, for example for convex
functions, applying the Legendre transform again gets us back to the orig-
inal function. In the same way that the slope of f (x) is s, the slope of
f ∗ (s) is x. The following, two examples show common uses of convex
conjugates in machine learning.
Example 7.7
(Convex Conjugates) To illustrate the application of convex conjugates,
consider the quadratic function
λ > −1
f (y) = y K y (7.59)
2
c
based on a positive definite matrix K ∈ Rn×n . We denote the primal

variable to be y ∈ Rn and the dual variable to be α ∈ Rn .
Applying Definition 7.4, we obtain the function
λ > −1
f ∗ (α) = sup hy, αi − y K y. (7.60)
y∈Rn 2
Since the function is differentiable we can find the maximum by taking
the derivative and with respect to y setting it to zero.
∂ hy, αi − λ2 y > K −1 y

= (α − λK −1 y)> (7.61)
∂y
and hence when the gradient is zero we have y = λ1 Kα. Substituting
into (7.60) yields
>
1 > λ 1 1 1 >

∗ −1
f (α) = α Kα − Kα K Kα = α Kα .
λ 2 λ λ 2λ
(7.62)
Example 7.8
In machine learning we often use sums of functions, for example the ob-
jective function of the training set includes a sum of the losses for each ex-
ample in the training set. In the following, we derive the convex conjugate
of a sum of losses `(t), where ` : R → R. This also illustrates Pthe appli-
n
cation of the convex conjugate to the vector case. Let L(t) = i=1 ì (ti ).
Then,
n
X
L∗ (z) = sup hz, ti − ì (ti ) (7.63a)
t∈Rn i=1
n
X
= sup zi ti − ì (ti ) definition of dot product (7.63b)
t∈Rn i=1
Xn
= sup zi ti − ì (ti ) (7.63c)
n
i=1 t∈R
Xn
= `∗i (zi ) . definition of conjugate (7.63d)
i=1
Recall that in Section 7.2 we derived a dual optimization problem using

Lagrange multipliers. Furthermore for convex optimization problems we
have strong duality, that is the solutions of the primal and dual problem
match. The Fenchel-Legendre transform described here also can be used
to derive a dual optimization problem. Furthermore, then the function is

convex and differentiable, the supremum is unique. To further investigate

the relation between these two approaches, let us consider a linear equal-
ity constrained convex optimization problem.
Example 7.9
Let f (y) and g(x) be convex functions, and A a real matrix of appropriate
dimensions such that Ax = y . Then
min f (Ax) + g(x) = min f (y) + g(x). (7.64)
x Ax=y
By introducing the Lagrange multiplier u for the constraints Ax = y ,

min f (y) + g(x) = min max f (y) + g(x) + (Ax − y)> u (7.65a)
Ax=y x,y u
= max min f (y) + g(x) + (Ax − y)> u (7.65b)

u x,y
where the last step of swapping max and min is due to the fact that f (y)
and g(x) are convex functions. By splitting up the dot product term and
collecting x and y ,
max min f (y) + g(x) + (Ax − y)> u (7.66a)
u x,y
h i
= max min −y u + f (y) + min(Ax)> u + g(x)
>
(7.66b)
u y x
h i
= max min −y > u + f (y) + min x> A> u + g(x) (7.66c)
u y x
Recall the convex conjugate (Definition 7.4) and the fact that dot prod- For general inner
products, A> is
ucts are symmetric,
replaced by the
adjoint A∗ .
h i
max min −y > u + f (y) + min x> A> u + g(x) (7.67a)
u y x
= max −f ∗ (u) − g ∗ (−A> u) . (7.67b)

u
Therefore, we have shown that

min f (Ax) + g(x) = max −f ∗ (u) − g ∗ (−A> u) . (7.68)
x u
The Legendre-Fenchel conjugate turns out to be quite useful for ma-

chine learning problems that can be expressed as convex optimization
problems. In particular for convex loss functions that apply independently
to each example, the conjugate loss is a convenient way to derive a dual
problem.
c
7.4 Further Reading

Continuous optimization is an active area of research, and we do not try
to provide a comprehensive account of recent advances.
From a gradient descent perspective, there are two major weaknesses
which each have their own set of literature. The first challenge is the fact
that gradient descent is a first order algorithm, and does not use infor-
mation about the curvature of the surface. When there are long valleys,
the gradient points perpendicularly to the direction of interest. The idea
of momentum can be generalized to a general class of acceleration meth-
ods (Nesterov, 2018). Conjugate gradient methods avoid the issues faced
by gradient descent by taking previous directions into account (Shewchuk,
1994). Second order methods such as Newton methods use the Hessian to
provide information about the curvature. Many of the choices for choos-
ing stepsizes and ideas like momentum arise by considering the curvature
of the objective function (Goh, 2017; Bottou et al., 2018). Quasi-Newton
methods such as L-BFGS try to use cheaper computational methods to ap-
proximate the Hessian (Nocedal and Wright, 2006). Recently there has
been interest in other metrics for computing descent directions, result-
ing in approaches such as mirror descent (Beck and Teboulle, 2003) and
natural gradient (Toussaint, 2012).
The second challenge is to handle non-differentiable functions. Gradi-
ent methods are not well defined when there are kinks in the function. In
these cases, subgradient methods can be used (Shor, 1985). For further in-
formation and algorithms for optimizing non-differentiable functions, we
refer to the book by Bertsekas (1999). There is a vast amount of literature
on different approaches for numerically solving continuous optimization
problems, including algorithms for constrained optimization problems. A
good starting point to appreciate this literature are the books by Luen-
berger (1969) and Bonnans et al. (2006). A recent survey of continuous
optimization is provided by Bubeck (2015).
Modern applications of machine learning often mean that the size of
datasets prohibit the use of batch gradient descent, and hence stochastic
gradient descent is the current workhorse of large scale machine learning
methods. Recent surveys of the literature include (Hazan, 2015; Bottou
Hugo Gonçalves’ et al., 2018).
blog is also a good
For duality and convex optimization, the book by Boyd and Vanden-
resource for an
easier introduction berghe (Boyd and Vandenberghe, 2004) includes lectures and slides on-
to Legendre-Fenchel line. A more mathematical treatment is provided by Bertsekas (2009), and
transforms recent book by one of the key researchers in the area of optimization is
https://tinyurl. Nesterov (2018). Convex optimization is based upon convex analysis, and
com/ydaal7hj
the reader interested in more foundational results about convex functions
is referred to Hiriart-Urruty and Lemaréchal (2001); Rockafellar (1970);
Borwein and Lewis (2006). Legendre-Fenchel transforms are also covered
in the above books on convex analysis, but a more beginner-friendly pre-

Exercises 247
sentations is available at Zia et al. (2009). The role of Legendre-Fenchel

transforms in the analysis of convex optimization algorithms is surveyed
in Polyak (2016).
Exercises
7.1 Consider the univariate function
f (x) = x3 + 6x2 − 3x − 5.
Find its stationary points and indicate whether they are maximum, mini-
mum or saddle points.
7.2 Consider the update equation for stochastic gradient descent (Equation (7.15)).
Write down the update when we use a mini-batch size of one.
7.3 Consider whether the following statements are true or false:
1. The intersection of any two convex sets is convex.
2. The union of any two convex sets is convex.
3. The difference of a convex set A from another convex set B is convex.
7.4 Consider whether the following statements are true or false:
1. The sum of any two convex functions is convex.
2. The difference of any two convex functions is convex.
3. The product of any two convex functions is convex.
4. The maximum of any two convex functions is convex.
7.5 Express the following optimization problem as a standard linear program in
matrix notation
max p> x + ξ
x∈R2 ,ξ∈R
subject to the constraints that ξ > 0, x0 6 0 and x1 6 3.

7.6 Consider the linear program illustrated in Figure 7.9,
>
5 x1
min −
x∈R2 3 x2
   
2 2 33
2
 −4 8
 x1  
−2
subject to   x2 6  5 
1  
0 −1 −1
0 1 8
Derive the dual linear program using Lagrange duality.
7.7 Consider the quadratic program illustrated in Figure 7.4,
> >
1 x1 2 1 x1 5 x1
min +
x∈R2 2 x2 1 4 x2 3 x2
   
1 0 1

−1 0 x1 6 1
 
subject to 

0 1  x2 1
0 −1 1
Derive the dual quadratic program using Lagrange duality.
c
7.8 Consider the following convex optimization problem

1 >
min w w
w∈RD 2
subject to w> x > 1 .
Derive the Lagrangian dual by introducing the Lagrange multiplier λ.

7.9 Consider the negative entropy of x ∈ RD ,
D
X
f (x) = xd log xd .
d=1
Derive the convex conjugate function f ∗ (s), by assuming the standard dot
product.
Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.10 Consider the function
1 >
f (x) = x Ax + b> x + c (7.72)
2
where A is strictly positive definite, which means that it is invertible. Derive
the convex conjugate of f (x).
Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.11 The hinge loss (which is the loss used by the Support Vector Machine) is
given by
L(α) = max{0, 1 − α}
If we are interested in applying gradient methods such as L-BFGS, and do

not want to resort to subgradient methods, we need to smooth the kink in
the hinge loss. Compute the convex conjugate of the hinge loss L∗ (β) where
β is the dual variable. Add a `2 proximal term, and compute the conjugate
of the resulting function
γ 2
L∗ (β) + β
2
where γ is a given hyperparameter.

Part II
Central Machine Learning Problems
249
c
8
When Models meet Data
In the first part of the book, we introduced the mathematics that form
the foundations of many machine learning methods. The hope is that a
reader would be able to learn the rudimentary forms of the language of
mathematics from the first part, which we will now use to describe and
discuss machine learning. The second part of the book introduces four
pillars of machine learning:
Regression (Chapter 9)
Dimensionality reduction (Chapter 10)
Density estimation (Chapter 11)
Classification (Chapter 12)
The main aim of this part of the book is to illustrate how the mathematical
concepts introduced in the first part of the book can be used to design
machine learning algorithms that can be used to solve tasks within the
remit of the four pillars. We do not intend to introduce advanced machine
learning concepts, but instead to provide a set of practical methods that
allow the reader to apply the knowledge they gained from the first part
of the book. It also provides a gateway to the wider machine learning
literature for readers already familiar with the mathematics.
It is worth at this point to pause and consider the problem that a ma-
chine learning algorithm is designed to solve. As discussed in Chapter 1,
there are three major components of a machine learning system: data,
models and learning. The main question of machine learning is “what do
we mean by good models?”. The word model has many subtleties and we model
will revisit it multiple times in this chapter. It is also not entirely obvious
how to objectively define the word “good”. One of the guiding principles
of machine learning is that good models should perform well on unseen
data. This requires us to define some performance metrics, such as accu-
racy or distance from ground truth, as well as figuring out ways to do well
under these performance metrics. This chapter covers a few necessary bits
and pieces of mathematical and statistical language that are commonly
used to talk about machine learning models. By doing so, we briefly out-
line the current best practices for training a model such that the resulting
predictor does well on data that we have not yet seen.
As mentioned in Chapter 1, there are two different senses in which we
251
c
252 When Models meet Data
Table 8.1 Example Name Gender Degree Postcode Age Annual Salary
data from a Aditya M MSc W21BG 36 89563
fictitious human Bob M PhD EC1A1BA 47 123543
resource database Chloé F BEcon SW1A1BH 26 23989
that is not in a Daisuke M BSc SE207AT 68 138769
numerical format. Elisabeth F MBA SE10AA 33 113888
use the phrase “machine learning algorithm”: training and prediction. We

will describe these ideas in this chapter, as well as the idea of selecting
between different models. We will introduce the framework of empirical
risk minimization in Section 8.1, the principle of maximum likelihood in
Section 8.2, and the idea of probabilistic models in Section 8.3. We briefly
outline a graphical language for specifying probabilistic models in Sec-
tion 8.4 and finally discuss model selection in Section 8.5. The rest of this
section expands upon the three main components of machine learning:
data, models and learning.
Data as Vectors
We assume that our data can be read by a computer, and represented ade-
quately in a numerical format. Data is assumed to be tabular (Figure 8.1),
where we think of each row of the table as representing a particular in-
Data is assumed to stance or example, and each column to be a particular feature. In recent
be in a tidy years machine learning has been applied to many types of data that do not
format (Wickham,
obviously come in the tabular numerical format, for example genomic se-
2014; Codd, 1990).
quences, text and image contents of a webpage, and social media graphs.
We do not discuss the important and challenging aspects of identifying
good features. Many of these aspects depend on domain expertise and re-
quire careful engineering, which in recent years have been put under the
umbrella of data science (Stray, 2016; Adhikari and DeNero, 2018).
Even when we have data in tabular format, there are still choices to be
made to obtain a numerical representation. For example in Table 8.1, the
gender column (a categorical variable) may be converted into numbers 0
representing “Male” and 1 representing “Female”. Alternatively the gen-
der could be represented by numbers −1, +1, respectively (as shown in
Table 8.2). Furthermore it is often important to use domain knowledge
when constructing the representation, such as knowing that university
degrees progress from Bachelor’s to Master’s to PhD or realizing that the
postcode provided is not just a string of characters but actually encodes
an area in London. In Table 8.2, we converted the data from Table 8.1
to a numerical format, and each postcode is represented as two numbers,
a latitude and longitude. Even numerical data that could potentially be
directly read into a machine learning algorithm should be carefully con-
sidered for units, scaling, and constraints. Without additional information,
one should shift and scale all columns of the dataset such that they have
an empirical mean of 0 and an empirical variance of 1. For the purposes

When Models meet Data 253
Gender ID Degree Latitude Longitude Age Annual Salary Table 8.2 Example
(in degrees) (in degrees) (in thousands) data from a
-1 2 51.5073 0.1290 36 89.563 fictitious human
-1 3 51.5074 0.1275 47 123.543 resource database
+1 1 51.5071 0.1278 26 23.989 (see Table 8.1),
-1 1 51.5075 0.1281 68 138.769 converted to a
+1 2 51.5074 0.1278 33 113.888 numerical format.
of this book we assume that a domain expert already converted data ap-
propriately, i.e., each input xn is a D-dimensional vector of real numbers,
which are called features, attributes or covariates. We consider a dataset to feature
be of the form as illustrated by Table 8.2. Observe that we have dropped attribute
covariate
the Name column of Table 8.1 in the new numerical representation. There
are two main reasons why this is desirable: 1. we do not expect the iden-
tifier (the Name) to be infomative for a machine learning task, and 2.
we may wish to anonymize the data to help protect the privacy of the
employees.
In this part of the book, we will use N to denote the number of examples
in a dataset and index the examples with lowercase n = 1, . . . , N . We
assume that we are given a set of numerical data, represented as an array
of vectors (Table 8.2). Each row is a particular individual xn often referred
to as an example or data point in machine learning. The subscript n refers example
to the fact that this is the nth example out of a total of N examples in the data point
dataset. Each column represents a particular feature of interest about the
example, and we index the features as d = 1, . . . , D. Recall that data is
represented as vectors, which means that each example (each data point)
is a D dimensional vector. The orientation of the table originates from
the database community, but for some machine learning algorithms (e.g.,
in Chapter 10) it is more convenient to represent examples as column
vectors.
Let us consider the problem of predicting annual salary from age, based
on the data in Table 8.2. This is called a supervised learning problem
where we have a label yn (the salary) associated with each example xn label
(the age). The label yn has various other names including: target, response
variable and annotation. A dataset is written as a set of example-label pairs
{(x1 , y1 ), . . . , (xn , yn ), . . . , (xN , yN )}. The table of examples {x1 , . . . xN }
is often concatenated, and written as X ∈ RN ×D . Figure 8.1 illustrates
the dataset consisting of the rightmost two columns of Table 8.2 where
x=age and y =salary.
We use the concepts introduced in the first part of the book to formalize
the machine learning problems such as that in the previous paragraph.
Representing data as vectors xn allows us to use concepts from linear al-
gebra (introduced in Chapter 2). In many machine learning algorithms,
we need to additionally be able to compare two vectors. As we will see in
Chapters 9 and 12, computing the similarity or distance between two ex-
c
Figure 8.1 Toy data

for linear regression. 150
Training data in
(xn , yn ) pairs from 125
the rightmost two
columns of
Table 8.2. We are
100 ?
interested in the
y
75
salary of a person
aged 60 (x = 60)
50
illustrated as a
vertical dashed red
line, which is not
25
part of the training
data. 0
0 10 20 30 40 50 60 70 80
x
amples allows us to formalize the intuition that examples with similar fea-
tures should have similar labels. The comparison of two vectors requires
that we construct a geometry (explained in Chapter 3), and allows us to
optimize the resulting learning problem using techniques from Chapter 7.
Since we have vector representations of data, we can manipulate data to
find potentially better representations of it. We will discuss finding good
representations in two ways: finding lower-dimensional approximations
of the original feature vector, and using nonlinear higher-dimensional
combinations of the original feature vector. In Chapter 10 we will see an
example of finding a low-dimensional approximation of the original data
space by finding the principal components. Finding principal components
is closely related to concepts of eigenvalue and singular value decomposi-
tion as introduced in Chapter 4. For the high-dimensional representation,
feature map we will see an explicit feature map φ(·) that allows us to represent inputs
xn using a higher dimensional representation φ(xn ). The main motiva-
tion for higher dimensional representations is that we can construct new
features as non-linear combinations of the original features, which in turn
may make the learning problem easier. We will discuss the feature map
kernel in Section 9.2 and show how this feature map leads to a kernel in Sec-
tion 12.4. In recent years deep learning methods (Goodfellow et al., 2016)
have shown promise in using the data itself to learn new good features,
and has been very successful in areas such as computer vision, speech
recognition and natural language processing. We will not cover neural
networks in this part of the book, but the reader is referred to Section 5.6
for the mathematical description of backpropagation, a key concept for
training neural networks.

Figure 8.2 Example

150 function (black solid
diagonal line) and
125 its prediction at
x = 60, i.e.,
100 f (60) = 100.
y
75
50
25
0
0 10 20 30 40 50 60 70 80
x
Models as Functions
Once we have data in an appropriate vector representation, we can get to
the business of constructing a predictive function (known as a predictor). predictor
In Chapter 1 we did not yet have the language to be precise about models.
Using the concepts from the first part of the book, we can now introduce
what ”model” means. We present two major approaches in this book: a
predictor as a function, and a predictor as a probabilistic model. We de-
scribe the former here and the latter in the next subsection.
A predictor is a function that, when given a particular input example
(in our case a vector of features), produces an output. For now consider
the output to be a single number, i.e., a real-valued scalar output. This can
be written as
f : RD → R , (8.1)
where the input vector x is D-dimensional (has D features), and the func-
tion f then applied to it (written as f (x)) returns a real number. Fig-
ure 8.2 illustrates a possible function that can be used to compute the
value of the prediction for input values x.
In this book, we do not consider the general case of all functions, which
would involve the need for functional analysis. Instead we consider the
special case of linear functions
f (x) = θ > x + θ0 (8.2)
for unknown θ and θ0 . This restriction means that the contents of Chap-
ter 2 and 3 suffice for precisely stating the notion of a predictor for the
non-probabilistic (in contrast to the probabilistic view described next)
view of machine learning. Linear functions strike a good balance between
the generality of the problems that can be solved and the amount of back-
ground mathematics that is needed.
c
Figure 8.3 Example

function (black solid 150
diagonal line) and
its predictive 125
uncertainty at
x = 60 (drawn as a 100
Gaussian).
y
75
50
25
0
0 10 20 30 40 50 60 70 80
x
Models as Probability Distributions

We often consider data to be noisy observations of some true underlying
effect, and hope that by applying machine learning we can identify the
signal from the noise. This requires us to have a language for quantify-
ing the effect of noise. We often would also like to have predictors that
express some sort of uncertainty, e.g., to quantify the confidence we have
about the value of the prediction for a particular test data point. As we
have seen in Chapter 6 probability theory provides a language for quan-
tifying uncertainty. Figure 8.3 illustrates the predictive uncertainty of the
function as a Gaussian distribution.
Instead of considering a predictor as a single function, we could con-
sider predictors to be probabilistic models, i.e., models describing the dis-
tribution of possible functions. We limit ourselves in this book to the spe-
cial case of distributions with finite dimensional parameters, which allows
us to describe probabilistic models without needing stochastic processes
and random measures. For this special case we can think about probabilis-
tic models as multivariate probability distributions, which already allow
for a rich class of models.
We will introduce how to use concepts from probability (Chapter 6) to
define machine learning models in Section 8.3, and introduce a graphical
language for describing probabilistic models in a compact way in Sec-
tion 8.4.
Learning is Finding Parameters

The goal of learning is to find a model and its corresponding parame-
ters such that the resulting predictor will perform well on unseen data.
There are conceptually three distinct algorithmic phases when discussing
machine learning algorithms:
1. Prediction or inference

2. Training or parameter estimation

3. Hyperparameter tuning or model selection
The prediction phase is when we use a trained predictor on previously un-
seen test data. In other words, the parameters and model choice is already
fixed and the predictor is applied to new vectors representing new input
data points. As outlined in Chapter 1 and the previous subsection, we will
consider two schools of machine learning in this book, corresponding to
whether the predictor is a function or a probabilistic model. When we
have a probabilistic model (discussed further in Section 8.3) the predic-
tion phase is called inference.
Remark. Unfortunately there is no agreed upon naming for the different
algorithmic phases. The word inference is sometimes also used to mean
parameter estimation of a probabilistic model, and less often may be also
used to mean prediction for non-probabilistic models. ♦
The training or parameter estimation phase is when we adjust our pre-
dictive model based on training data. We would like to find good predic-
tors given training data, and there are two main strategies for doing so:
finding the best predictor based on some measure of quality (sometimes
called finding a point estimate), or using Bayesian inference. Finding a
point estimate can be applied to both types of predictors, but Bayesian
inference requires probabilistic models.
For the non-probabilistic model, we follow the principle of empirical risk empirical risk
minimization, which we describe in Section 8.1. Empirical risk minimiza- minimization
tion directly provides an optimization problem for finding good parame-
ters. With a statistical model the principle of maximum likelihood is used maximum likelihood
to find a good set of parameters (Section 8.2). We can additionally model
the uncertainty of parameters using a probabilisitic model, which we will
look at in more detail in Section 8.3.
We use numerical methods to find good parameters that “fit” the data,
and most training methods can be thought of as hill climbing approaches
to find the maximum of an objective, for example the maximum of a like-
lihood. To apply hill-climbing approaches we use the gradients described The convention in
Chapter 5 and implement numerical optimization approaches from Chap- optimization is to
minimize objectives.
ter 7.
Hence, there is often
As mentioned in Chapter 1, we are interested in learning a model based an extra minus sign
on data such that it performs well on future data. It is not enough for in machine learning
the model to only fit the training data well, the predictor needs to per- objectives.
form well on unseen data. We simulate the behaviour of our predictor on
future unseen data using cross validation (Section 8.1.4). As we will see cross validation
in this chapter, to achieve the goal of performing well on unseen data,
we will need to balance between fitting well on training data and finding
“simple” explanations of the phenomenon. This trade off is achieved us-
ing regularization (Section 8.1.3) or by adding a prior (Section 8.2.2). In
philosophy, this is considered to be neither induction nor deduction, but
c
abduction is called abduction. According to the Stanford Encyclopedia of Philosophy,

abduction is the process of inference to the best explanation (Douven,
A good movie title is 2017).
“AI abduction”. We often need to make high level modeling decisions about the struc-
ture of the predictor, such as the number of components to use or the
class of probability distributions to consider. The choice of the number of
hyperparameter components is an example of a hyperparameter, and this choice can af-
fect the performance of the model significantly. The problem of choosing
model selection between different models is called model selection, which we describe in
Section 8.5. For non-probabilistic models, model selection is often done
nested cross using nested cross validation, which is described in Section 8.5.1. We also
validation use model selection to choose hyperparameters of our model.
Remark. The distinction between parameters and hyperparameters is some-
what arbitrary, and is mostly driven by the distinction between what can
be numerically optimized versus what needs to use search techniques.
Another way to consider the distinction is to consider parameters as the
explicit parameters of a probabilistic model, and to consider hyperparam-
eters (higher level parameters) as parameters that control the distribution
of these explicit parameters. ♦
In the following sections, we will look at three flavors of machine learn-
ing: empirical risk minimization (Section 8.1), the principle of maximum
likelihood (Section 8.2), and probabilistic modeling (Section 8.3).
8.1 Empirical Risk Minimization

After having all the mathematics under our belt, we are now in a posi-
tion to introduce what it means to learn. The “learning” part of machine
learning boils down to estimating parameters based on training data.
In this section we consider the case of a predictor that is a function,
and consider the case of probabilistic models in Section 8.2. We describe
the idea of empirical risk minimization, which was originally popularized
by the proposal of the support vector machine (described in Chapter 12).
However, its general principles are widely applicable and allow us to ask
the question of what is learning without explicitly constructing probabilis-
tic models. There are four main design choices, which we will cover in
detail in the following subsections:
Section 8.1.1 What is the set of functions we allow the predictor to take?
Section 8.1.2 How do we measure how well the predictor performs on
the training data?
Section 8.1.3 How do we construct predictors from only training data
that performs well on unseen test data?
Section 8.1.4 What is the procedure for searching over the space of mod-
els?

8.1.1 Hypothesis Class of Functions

Assume we are given N examples xn ∈ RD and corresponding scalar la-
bels yn ∈ R. We consider the supervised learning setting, where we obtain
pairs (x1 , y1 ), . . . , (xN , yN ). Given this data, we would like to estimate a
predictor f (·, θ) : RD → R, parametrized by θ . We hope to be able to find
a good parameter θ ∗ such that we fit the data well, that is
f (xn , θ ∗ ) ≈ yn for all n = 1, . . . , N . (8.3)
In this section, we use the notation ŷn = f (xn , θ ∗ ) to represent the output
of the predictor.
Remark. For ease of presentation we will describe empirical risk minimiza-
tion in terms of supervised learning (where we have labels). This simpli-
fies the definition of the hypothesis class and the loss function. It is also
common in machine learning to choose a parametrized class of functions,
for example affine functions. ♦
Example 8.1
We introduce the problem of ordinary least squares regression to illustrate
empirical risk minimization. A more comprehensive account of regression
is given in Chapter 9. When the label yn is real valued, a popular choice
of function class for predictors is the set of affine functions. We choose Affine functions are
often referred to as
a more compact notation for a affine function by concatenating an addi-
(D) >
linear functions in
tional unit feature x(0) = 1 to xn , i.e., xn = [1, x(1) (2)
n , xn , . . . , xn ] . The machine learning.
parameter vector is correspondingly θ = [θ0 , θ1 , θ2 , . . . θD ]> , allowing us
to write the predictor as a linear function
f (xn , θ) = θ > xn . (8.4)
This linear predictor is equivalent to the affine model
D
X
f (xn , θ) = θ0 + θd x(d)
n . (8.5)
d=1
Observe that the predictor takes the vector of features representing a

single example xn as input and produces a real valued output, i.e., f :
RD+1 → R. The previous figures in this chapter had a straight line as a
predictor, which means that we have assumed an affine function.
Instead of a linear function, we may wish to consider non-linear func-
tions as predictors. Recent advances in neural networks allow for efficient
computation of more complex non-linear function classes.
Given the class of functions, we want to search for a good predictor.

We now move on to the second ingredient of empirical risk minimization:
how to measure how well the predictor fits the training data.
c
8.1.2 Loss Function for Training

Consider the label yn for a particular example; and the corresponding pre-
diction ŷn that we make based on xn . To define what it means to fit the
loss function data well, we need to specify a loss function `(yn , ŷn ) that takes the ground
truth label and the prediction as input and produces a non-negative num-
ber (referred to as the loss) representing how much error we have made
The expression on this particular prediction. Our goal for finding a good parameter vector
“error” is often used θ ∗ is to minimize the average loss on the set of N training examples.
to mean loss.
One assumption that is commonly made in machine learning is that
independent and the set of examples (x1 , y1 ), . . . , (xN , yN ) are independent and identically
identically distributed. The word independent (Section 6.4.5) means that two data
distributed
points (xi , yi ) and (xj , yj ) do not statistically depend on each other, mean-
ing that the empirical mean is a good estimate of the population mean
(Section 6.4.1). This implies that we can use the empirical mean of the
training set loss on the training data. For a given training set {(x1 , y1 ), . . . , (xN , yN )}
we introduce the notation of an example matrix X := [x1 , . . . , xN ]> ∈
RN ×D and a label vector y := [y1 , . . . , yN ]> ∈ RN . Using this matrix
notation the average loss is given by
N
1 X
Remp (f, X, y) = `(yn , ŷn ) , (8.6)
N n=1
empirical risk where ŷn = f (xn , θ). Equation (8.6) is called the empirical risk and de-
pends on three arguments, the predictor f and the data X, y . This general
empirical risk strategy for learning is called empirical risk minimization.
minimization
Example 8.2 (Least-Squares Loss)

Continuing the example of least-squares regression, we specify that we
measure the cost of making an error during training using the squared
loss `(yn , ŷn ) = (yn − ŷn )2 . We wish to minimize the empirical risk (8.6),
which is the average of the losses over the data
N
1 X
min (yn − f (xn , θ))2 , (8.7)
θ∈RD N
n=1
where we substituted the predictor ŷn = f (xn , θ). By using our choice of
a linear predictor f (xn , θ) = θ > xn we obtain the optimization problem
N
1 X
minD (yn − θ > xn )2 . (8.8)
θ∈R N n=1
This equation can be equivalently expressed in matrix form
1 2
min ky − Xθk . (8.9)
θ∈RD N

This is known as the least-squares problem. There exists a closed-form an- least-squares
problem
alytic solution for this by solving the normal equations, which we will
discuss in Section 9.2.
We are not interested in a predictor that only performs well on the

training data. Instead, we seek a predictor that performs well (has low
risk) on unseen test data. More formally, we are interested in finding a
predictor f (with parameters fixed) that minimizes the expected risk expected risk
Rtrue (f ) = Ex,y [`(y, f (x))] , (8.10)

where y is the label and f (x) is the prediction based on the example x.
The notation Rtrue (f ) indicates that this is the true risk if we had access to
an infinite amount of data. The expectation is over the (infinite) set of all Another phrase
possible data and labels. There are two practical questions that arise from commonly used for
expected risk is
our desire to minimize expected risk which we address in the following
“population risk”.
two subsections:
How should we change our training procedure to generalize well?

How do we estimate expected risk from (finite) data?
Remark. Many machine learning tasks are specified with an associated
performance measure, e.g., accuracy of prediction or root mean squared
error. The performance measure could be more complex, be cost sensitive
and capture details about the particular application. In principle, the de-
sign of the loss function for empirical risk minimization should correspond
directly to the performance measure specified by the machine learning
task. In practice there is often a mismatch between the design of the loss
function and the performance measure. This could be due to issues such
as ease of implementation or efficiency of optimization. ♦
8.1.3 Regularization to Reduce Overfitting

This section describes an addition to empirical risk minimization that al-
lows it to generalize well (approximately minimizing expected risk). Re-
call that the aim of training a machine learning predictor is so that we can
perform well on unseen data, i.e., the predictor generalizes well. We sim-
ulate this unseen data by holding out a proportion of the whole dataset.
This hold out set is referred to as the test set. Given a sufficiently rich class test set
of functions for the predictor f , we can essentially memorize the training Even knowing only
the performance of
data to obtain zero empirical risk. While this is great to minimize the loss
the predictor on the
(and therefore the risk) on the training data, we would not expect the test set leaks
predictor to generalize well to unseen data. In practice we have only a information (Blum
finite set of data, and hence we split our data into a training and a test and Hardt, 2015).
set. The training set is used to fit the model, and the test set (not seen
c
by the machine learning algorithm during training) is used to evaluate

generalization performance. It is important for the user to not cycle back
to a new round of training after having observed the test set. We use the
subscripts train and test to denote the training and test sets, respectively.
We will revisit this idea of using a finite dataset to evaluate expected risk
in Section 8.1.4.
overfitting It turns out that empirical risk minimization can lead to overfitting, i.e.,
the predictor fits too closely to the training data and does not general-
ize well to new data (Mitchell, 1997). This general phenomenon of hav-
ing very small average loss on the training set but large average loss on
the test set tends to occur when we have little data and a complex hy-
pothesis class. For a particular predictor f (with parameters fixed), the
phenomenon of overfitting occurs when the risk estimate from the train-
ing data Remp (f, X train , y train ) underestimates the expected risk Rtrue (f ).
Since we estimate the expected risk Rtrue (f ) by using the empirical risk
on the test set Remp (f, X test , y test ) if the test risk is much larger than
the training risk, this is an indication of overfitting. We revisit the idea of
overfitting in Section 8.2.3.
Therefore, we need to somehow bias the search for the minimizer of
empirical risk by introducing a penalty term, which makes it harder for
the optimizer to return an overly flexible predictor. In machine learning,
regularization the penalty term is referred to as regularization. Regularization is a way
to compromise between accurate solution of empirical risk minimization
and the size or complexity of the solution.
Example 8.3 (Regularized Least Squares)

Regularization is an approach that discourages complex or extreme solu-
tions to an optimization problem. The simplest regularization strategy is
to replace the least-squares problem
1 2
ky − Xθk .
min (8.11)
θ N
in the previous example with the “regularized” problem by adding a

penalty term involving only θ :
1 2 2
min ky − Xθk + λ kθk . (8.12)
θ N
2
regularizer The additional term kθk is called the regularizer, and the parameter
regularization λ is the regularization parameter. The regularization parameter trades
parameter
off minimizing the loss on the training set and the magnitude of the pa-
rameters θ . It often happens that the magnitude of the parameter values
becomes relatively large if we run into overfitting (Bishop, 2006).
penalty term The regularization term is sometimes called the penalty term, that biases

Figure 8.4 K-fold

Training cross validation. The
dataset is divided
into K = 5 chunks,
K − 1 of which
serve as the training
set (blue) and one
Validation as the validation set
(orange hatch).
the vector θ to be closer to the origin. The idea of regularization also

appears in probabilistic models as the prior probability of the parameters.
Recall from Section 6.6 that for the posterior distribution to be of the
same form as the prior distribution, the prior and the likelihood need to
be conjugate. We will revisit this idea in Section 8.2.2. We will see in
Chapter 12 that the idea of the regularizer is equivalent to the idea of a
large margin.
8.1.4 Cross Validation to Assess the Generalization Performance

We mentioned in the previous section that we measure generalization er-
ror by estimating it by applying the predictor on test data. This data is also
sometimes referred to as the validation set. The validation set is a subset validation set
of the available training data that we keep aside. A practical issue with
this approach is that the amount of data is limited, and ideally we would
use as much of the data available to train the model. This would require to
keep our validation set V small, which then would lead to a noisy estimate
(with high variance) of the predictive performance. One solution to these
contradictory objectives (large training set, large validation set) is to use
cross validation. K -fold cross validation effectively partitions the data into cross validation
K chunks, K − 1 of which form the training set R, and the last chunk
serves as the validation set V (similar to the idea outlined above). Cross-
validation iterates through (ideally) all combinations of assignments of
chunks to R and V , see Figure 8.4. This procedure is repeated for all K
choices for the validation set, and the performance of the model from the
K runs is averaged.
We partition our training set into two sets D = R ∪ V , such that they do
not overlap (R∩V = ∅), where V is the validation set, and train our model
on R. After training, we assess the performance of the predictor f on the
validation set V (e.g., by computing root mean square error (RMSE) of
the trained model on the validation set). More precisely, for each partition
k the training data R(k) produces a predictor f (k) , which is then applied
to validation set V (k) to compute the empirical risk R(f (k) , V (k) ). We cycle
through all possible partitionings of validation and training sets and com-
pute the average generalization error of the predictor. Cross-validation
c
approximates the expected generalization error

K
1 X
EV [R(f, V)] ≈ R(f (k) , V (k) ) , (8.13)
K k=1
where R(f (k) , V (k) ) is the risk (e.g., RMSE) on the validation set V (k) for
predictor f (k) . The approximation has two sources: first due to the finite
training set which results in not the best possible f (k) and second due to
the finite validation set which results in an inaccurate estimation of the
risk R(f (k) , V (k) ). A potential disadvantage of K -fold cross validation is
the computational cost of training the model K times, which can be bur-
densome if the training cost is computationally expensive. In practice, it
is often not sufficient to look at the direct parameters alone. For example,
we need to explore multiple complexity parameters (e.g., multiple regu-
larization parameters), which may not be direct parameters of the model.
Evaluating the quality of the model, depending on these hyperparameters
may result in a number of training runs that is exponential in the number
of model parameters. One can use nested cross validation (Section 8.5.1)
to search for good hyperparameters.
embarrassingly However, cross validation is an embarrassingly parallel problem, i.e.,
parallel little effort is needed to separate the problem into a number of parallel
tasks. Given sufficient computing resources (e.g., cloud computing, server
farms), cross validation does not require longer than a single performance
assessment.
In this section we saw that empirical risk minimization is based on the
following concepts: the hypothesis class of functions, the loss function and
regularization. In Section 8.2 we will see the effect of using a probability
distribution to replace the idea of loss functions and regularization.
Further Reading
Due to the fact that the original development of empirical risk minimiza-
tion (Vapnik, 1998) was couched in heavily theoretical language, many
of the subsequent developments have been theoretical. The area of study
statistical learning is called statistical learning theory (Hastie et al., 2001; von Luxburg and
theory Schölkopf, 2011; Vapnik, 1999; Evgeniou et al., 2000). A recent machine
learning textbook that builds on the theoretical foundations and develops
efficient learning algorithms is Shalev-Shwartz and Ben-David (2014).
The concept of regularization has its roots in the solution of ill-posed in-
verse problems (Neumaier, 1998). The approach presented here is called
Tikhonov Tikhonov regularization, and there is a closely related constrained ver-
regularization sion called Ivanov regularization. Tikhonov regularization has deep rela-
tionships to the bias-variance tradeoff and feature selection (Bühlmann
and Geer, 2011). An alternative to cross validation is bootstrap and jack-
knife (Efron and Tibshirani, 1993; Davidson and Hinkley, 1997; Hall,
1992).

Thinking about empirical risk minimization (Section 8.1) as “probabil-

ity free” is incorrect. There is an underlying unknown probability distri-
bution p(x, y) that governs the data generation. However, the approach
of empirical risk minimization is agnostic to that choice of distribution.
This is in contrast to standard statistical approaches that explicitly re-
quire the knowledge of p(x, y). Furthermore, since the distribution is a
joint distribution on both examples x and labels y , the labels can be non-
deterministic. In contrast to standard statistics we do not need to specify
the noise distribution for the labels y .
8.2 Parameter Estimation

In Section 8.1 we did not explicitly model our problem using probability
distributions. In this section, we will see how to use probability distribu-
tions to model our uncertainty due to the observation process and our
uncertainty in the parameters of our predictors. In Section 8.2.1 we intro-
duce the likelihood, which is analogous to the concept of loss functions
(Section 8.1.2) in empirical risk minimization. The concept of priors (Sec-
tion 8.2.2) is analogous to the concept of regularization (Section 8.1.3).
8.2.1 Maximum Likelihood Estimation

The idea behind maximum likelihood estimation (MLE) is to define a func- maximum likelihood
tion of the parameters that enables us to find a model that fits the data estimation
well. The estimation problem is focused on the likelihood function, or likelihood
more precisely its negative logarithm. For data represented by a random
variable x and for a family of probability densities p(x | θ) parametrized
by θ , the negative log-likelihood is given by negative
log-likelihood
Lx (θ) = − log p(x | θ). (8.14)
The notation Lx (θ) emphasizes the fact that the parameter θ is varying
and the data x is fixed. We very often drop the reference to x when writing
the negative log-likelihood, as it is really a function of θ , and write it as
L(θ) when the random variable representing the uncertainty in the data
is clear from the context.
Let us interpret what the probability density p(x | θ) is modeling for a
fixed value of θ . It is a distribution that models the uncertainty of the data.
In other words, once we have chosen the type of function we want as a
predictor, the likelihood provides the probability of observing data x.
In a complementary view, if we consider the data to be fixed (because
it has been observed), and we vary the parameters θ , what does L(θ) tell
us? It tells us how likely a particular setting of θ is for the observations x.
Based on this second view, the maximum likelihood estimator gives us the
most likely parameter θ for the set of data.
c
We consider the supervised learning setting, where we obtain pairs

(x1 , y1 ), . . . , (xN , yN ) with xn ∈ RD and labels yn ∈ R. We are inter-
ested in constructing a predictor that takes a feature vector xn as input
and produces a prediction yn (or something close to it), i.e., given a vec-
tor xn we want the probability distribution of the label yn . In other words,
we specify the conditional probability distribution of the labels given the
examples for the particular parameter setting θ .
Example 8.4
The first example that is often used is to specify that the conditional
probability of the labels given the examples is a Gaussian distribution. In
other words, we assume that we can explain our observation uncertainty
by independent Gaussian noise (refer to Section 6.5) with zero mean,
εn ∼ N 0, σ 2 . We further assume that the linear model x> n θ is used for
prediction. This means we specify a Gaussian likelihood for each example
label pair xn , yn ,
p(yn | xn , θ) = N yn | x> 2

n θ, σ . (8.15)
An illustration of a Gaussian likelihood for a given parameter θ is shown
in Figure 8.3. We will see in Section 9.2, how to explicitly expand the
expression above out in terms of the Gaussian distribution.
independent and We assume that the set of examples (x1 , y1 ), . . . , (xN , yN ) are independent
identically and identically distributed (i.i.d.). The word independent (Section 6.4.5)
distributed
implies that the likelihood of the whole dataset (Y = {y1 , . . . , yN } and
X = {x1 , . . . , xN } factorizes into a product of the likelihoods of each
individual example
N
Y
p(Y | X , θ) = p(yn | xn , θ) , (8.16)
n=1
where p(yn | xn , θ) is a particular distribution (which was Gaussian in the

example above (8.15)). The expression “identically distributed” means
that each term in the product above is of the same distribution, and all
of them share the same parameters. It is often easier from an optimization
viewpoint to compute functions that can be decomposed into sums of sim-
Recall log(ab) = pler functions. Hence, in machine learning we often consider the negative
log(a) + log(b) log-likelihood
N
X
L(θ) = − log p(Y | X , θ) = − log p(yn | xn , θ) . (8.17)
n=1
While it is temping to interpret the fact that θ is on the right of the condi-
tioning in p(yn |xn , θ) (8.15), and hence should be interpreted as observed
and fixed, this interpretation is incorrect. The negative log-likelihood L(θ)

is a function of θ . Therefore, to find a good parameter vector θ that

explains the data (x1 , y1 ), . . . , (xN , yN ) well minimize the negative log-
likelihood L(θ) with respect to θ .
Remark. The negative sign in (8.17) is a historical artifact that is due
to the convention that we want to maximize likelihood, but numerical
optimization literature tends to study minimization of functions. ♦
Example 8.5
Continuing on our example of Gaussian likelihoods (8.15), the negative
log-likelihood can be rewritten as
N
X N
X
log N yn | x> 2

L(θ) = − log p(yn | xn , θ) = − n θ, σ (8.18a)
n=1 n=1
N
1 (yn − x> n θ)
2
X
=− log √ exp − (8.18b)
n=1 2πσ 2 2σ 2
N N
(yn − x>n θ)
2
1
X X
=− log exp − − log √ (8.18c)
n=1
2σ 2
n=1 2πσ 2
N N
1 X > 2
X 1
= (yn − x n θ) − log √ . (8.18d)
2σ n=1
2
n=1 2πσ 2
As σ is given, the second term in (8.18d) is constant, and minimizing L(θ)
corresponds to solving the least squares problem (compare with (8.8))
expressed in the first term.
It turns out that for Gaussian likelihoods the resulting optimization

problem corresponding to maximum likelihood estimation has a closed-
form solution. We will see more details on this in Chapter 9. Maximum
likelihood estimation may suffer from overfitting (Section 8.2.3), anal-
ogous to unregularized empirical risk minimization (Section 9.2.3). For
other likelihood functions, i.e., if we model our noise with non-Gaussian
distributions, maximum likelihood estimation may not have a closed-form
analytic solution. In this case, we resort to numerical optimization meth-
ods discussed in Chapter 7.
8.2.2 Maximum A Posteriori Estimation

If we have prior knowledge about the distribution of the parameters θ of
our distribution we can multiply an additional term to the likelihood. This
additional term is a prior probability distribution on parameters p(θ). For
a given prior, after observing some data x, how should we update the dis-
tribution of θ ? In other words, how should we represent the fact that we
c
Figure 8.5 For the

given data, the 150
maximum likelihood
estimate of the 125
parameters results
in the black 100
diagonal line. The
orange square
y
75
shows the value of
the maximum
50
likelihood
prediction at
x = 60.
25
0
0 10 20 30 40 50 60 70 80
x
Figure 8.6
Comparing the 150 MLE
predictions with the MAP
maximum likelihood 125
estimate and the
MAP estimate at
100
x = 60. The prior
biases the slope to
y
75
be less steep and the
intercept to be
50
closer to zero. In
this example, the
bias that moves the 25
intercept closer to
zero actually 0
0 10 20 30 40 50 60 70 80
increases the slope.
x
have more specific knowledge of θ after observing data x? Bayes’ theo-

rem, as discussed in Section 6.3, gives us a principled tool to update our
probability distributions of random variables. It allows us to compute a
posterior posterior distribution p(θ | x) (the more specific knowledge) on the pa-
prior rameters θ from general prior statements (prior distribution) p(θ) and
the function p(x | θ) that links the parameters θ and the observed data x
likelihood (called the likelihood):
p(x | θ)p(θ)
p(θ | x) = . (8.19)
p(x)
Recall that we are interested in finding the parameter θ that maximizes
likelihood. Since the distribution p(x) does not depend on θ we can ignore
the value of the denominator and obtain
p(θ | x) ∝ p(x | θ)p(θ) . (8.20)

The proportion relation above hides the density of the data p(x) which
may be difficult to estimate. Instead of estimating the minimum of the neg-
ative log-likelihood, we now estimate the minimum of the negative log-
posterior, which is referred to as maximum a posteriori estimation (MAP maximum a
estimation). An illustration of the effect of adding a zero mean Gaussian posteriori
estimation
prior is shown in Figure 8.6.
MAP estimation
Example 8.6
In addition to the assumption of Gaussian likelihood in the previous exam-
ple, we assume that the parameter vector is distributed
as a multivariate
Gaussian with zero mean, i.e., p(θ) = N 0, Σ where Σ is the covariance
matrix (Section 6.5). Note that the conjugate prior of a Gaussian is also a
Gaussian (Section 6.6.1) and therefore we expect the posterior distribu-
tion to also be a Gaussian. We will see the details of maximum a posteriori
estimation in Chapter 9.
The idea of including prior knowledge about where good parameters

lie is widespread in machine learning. An alternative view which we saw
in Section 8.1.3 is the idea of regularization, which introduces an addi-
tional term that biases the resulting parameters to be close to the origin.
Maximum a posteriori estimation can be considered to bridge the non-
probabilistic and probabilistic worlds as it explicitly acknowledges the
need for a prior distribution but it still only produces a point estimate
of the parameters.
Remark. The maximum likelihood estimate θ ML possesses the following
properties (Lehmann and Casella, 1998; Efron and Hastie, 2016):
Asymptotic consistency: The MLE converges to the true value in the
limit of infinitely many observations, plus a random error that is ap-
proximately normal.
The size of the samples necessary to achieve these properties can be
quite large.
The error’s variance decays in 1/N where N is the number of data
points.
Especially, in the “small” data regime, maximum likelihood estimation
can lead to overfitting.
♦
The principle of maximum likelihood estimation (and maximum a pos-
teriori estimation) uses probabilistic modeling to reason about the uncer-
tainty in the data and model parameters. However, we have not yet taken
probabilistic modeling to its full extent. In this section, the resulting train-
ing procedure still produces a point estimate of the predictor, i.e., training
returns one single set of parameter values that represent the best predic-
tor. In Section 8.3 we will take the view that the parameter values should
c
Figure 8.7 Model

fitting. In a Mθ
parametrized class
Mθ of models, we
optimize the model Mθ ∗
parameters θ to
Mθ 0 M∗
minimize the
distance to the true
(unknown) model
M ∗.
also be treated as random variables, and instead of estimating “best” val-

ues of that distribution, we will use the full parameter distribution when
making predictions.
8.2.3 Model Fitting

Consider the setting where we are given a dataset, and we are interested
in fitting a parametrized model to the data. When we talk about “fit-
ting”, we typically mean optimizing/learning model parameters so that
they minimize some loss function, e.g., the negative log-likelihood. With
maximum likelihood (Section 8.2.1) and maximum a posteriori estima-
tion (Section 8.2.2) we already discussed two commonly used algorithms
for model fitting.
The parametrization of the model defines a model class Mθ with which
we can operate. For example, in a linear regression setting, we may define
the relationship between inputs x and (noise-free) observations y to be
y = ax + b, where θ := {a, b} are the model parameters. In this case, the
model parameters θ describe the family of affine functions, i.e., straight
lines with slope a, which are offset from 0 by b. Assume the data comes
from a model M ∗ , which is unknown to us. For a given training dataset,
we optimize θ so that Mθ is as close as possible to M ∗ , where the “close-
ness” is defined by the objective function we optimize (e.g., squared loss
on the training data). Figure 8.7 illustrates a setting where we have a
small model class (indicated by the circle Mθ ), and the data generation
model M ∗ lies outside the set of considered models. We begin our param-
eter search at Mθ0 . After the optimization, i.e., when we obtain the best
possible parameters θ ∗ , we distinguish three different cases: (i) overfit-
ting, (ii) underfitting, (iii) fitting well. We will give a high-level intuition
of what these three concepts mean.
overfitting Roughly speaking, overfitting refers to the situation where the para-
metrized model class is too rich to model the dataset generated by M ∗ ,
One way to detect i.e., Mθ could model much more complicated datasets. For instance, if the
overfitting in dataset was generated by a linear function, and define Mθ to be the class
practice is to
of seventh-order polynomials, we could model not only linear functions,
observe that the
model has low but also polynomials of degree two, three etc. Models that overfit typically
training risk but
high test risk during Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.
cross validation
(Section 8.1.4).
4 Training data 4 Training data 4 Training data
Figure 8.8 Fitting
MLE MLE MLE (by maximum
2 2 2
likelihood) of
0 0 0
y
y
different model
−2 −2 −2
classes to a
−4 −4 −4
regression dataset.
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
(a) Overfitting (b) Underfitting. (c) Fitting well.
have a large number of parameters. An observation we often make is that

the overly flexible model class Mθ uses all its modeling power to reduce
the training error. If the training data is noisy, it will therefore find some
useful signal in the noise itself. This will cause enormous problems when
we predict away from the training data. Figure 8.8(a) gives an example
of overfitting in the context of regression where the model parameters
are learned by means of maximum likelihood (see Section 8.2.1). We will
discuss overfitting in regression more in Section 9.2.2.
When we run into underfitting we encounter the opposite problem where underfitting
the model class Mθ is not rich enough. For example, if our dataset was
generated by a sinusoidal function, but θ only parametrizes straight lines,
the best optimization procedure will not get us close to the true model.
However, we still optimize the parameters and find the best straight line
that models the dataset. Figure 8.8(b) shows an example of a model that
underfits because it is insufficiently flexible. Models that underfit typically
have few parameters.
The third case is when the parametrized model class is about right.
Then, our model fits well, i.e., it neither overfits nor underfits. This means
our model class is just rich enough to describe the dataset we are given.
Figure 8.8(c) shows a model that fits the given dataset fairly well. Ideally,
this is the model class we would want to work with since it has good
generalization properties.
In practice, we often define very rich model classes Mθ with many pa-
rameters, such as deep neural networks. To mitigate the problem of over-
fitting we can use regularization (Section 8.1.3) or priors (Section 8.2.2).
We will discuss how to choose the model class in Section 8.5.
Further Reading
When considering probabilistic models the principle of maximum likeli-
hood estimation generalizes the idea of least-squares regression for linear
models, which we will discuss in detail in Chapter 9. When restricting
the predictor to have linear form with an additional nonlinear function ϕ
applied to the output, i.e.,
p(yn |xn , θ) = ϕ(θ > xn ) , (8.21)
c
we can consider other models for other prediction tasks, such as binary
classification or modeling count data (McCullagh and Nelder, 1989). An
alternative view of this is to consider likelihoods that are from the ex-
ponential family (Section 6.6). The class of models, which have linear
dependence between parameters and data, and have potentially nonlin-
ear transformation ϕ (called a link function) is referred to as generalized
linear models (Agresti, 2002, Chapter 4).
Maximum likelihood estimation has a rich history, and was originally
proposed by Sir Ronald Fisher in the 1930s. We will expand upon the idea
of a probabilistic model in Section 8.3. One debate among researchers
who use probabilistic models, is the discussion between Bayesian and
frequentist statistics. As mentioned in Section 6.1.1 it boils down to the
definition of probability. Recall from Section 6.1 that one can consider
probability to be a generalization (by allowing uncertainty) of logical rea-
soning (Cheeseman, 1985; Jaynes, 2003). The method of maximum like-
lihood estimation is frequentist in nature, and the interested reader is
pointed to Efron and Hastie (2016) for a balanced view of both Bayesian
and frequentist statistics.
There are some probabilistic models where maximum likelihood esti-
mation may not be possible. The reader is referred to more advanced sta-
tistical textbooks, e.g., Casella and Berger (2002), for approaches, such as
method of moments, M -estimation and estimating equations.
8.3 Probabilistic Modeling and Inference

In machine learning, we are frequently concerned with the interpretation
and analysis of data, e.g., for prediction of future events and decision
making. To make this task more tractable, we often build models that
generative process describe the generative process that generates the observed data.
For example, we can describe the outcome of a coin-flip experiment
(“heads” or “tails”) in two steps. First, we define a parameter µ, which
describes the probability of “heads”, as the parameter of a Bernoulli distri-
bution (Chapter 6); second, we can sample an outcome x ∈ {head, tail}
from the Bernoulli distribution p(x | µ) = Ber(µ). The parameter µ gives
rise to a specific dataset X and depends on the coin used. Since µ is un-
known in advance and can never be observed directly, we need mecha-
nisms to learn something about µ given observed outcomes of coin-flip
experiments. In the following, we will discuss how probabilistic modeling
can be used for this purpose.
A probabilistic
8.3.1 Probabilistic Models
model is specified Probabilistic models represent the uncertain aspects of an experiment as
by the joint
probability distributions. The benefit of using probabilistic models is that
distribution of all
random variables.
they offer a unified and consistent set of tools from probability theory
(Chapter 6) for modeling, inference, prediction and model selection.
In probabilistic modeling, the joint distribution p(x, θ) of the observed
variables x and the hidden parameters θ is of central importance: It en-
capsulates information from
the prior and the likelihood (product rule, Section 6.3)

the marginal likelihood p(x), which will play an important role in model
selection (Section 8.5) can be computed by taking the joint distribution
and integrating out the parameters (sum rule, Section 6.3)
the posterior, which can be obtained by dividing the joint by the marginal
likelihood.
Only the joint distribution has this property. Therefore, a probabilistic

model is specified by the joint distribution of all its random variables.
8.3.2 Bayesian Inference

Parameter
A key task in machine learning is to take a model and the data to uncover estimation can be
the values of the model’s hidden variables θ given the observed variables phrased as an
optimization
x. In Section 8.2.1, we already discussed two ways for estimating model problem.
parameters θ using maximum likelihood or maximum a posteriori esti-
mation. In both cases, we obtain a single-best value for θ so that the key
algorithmic problem of parameter estimation is solving an optimization
problem. Once these point estimates θ ∗ are known, we use them to make
predictions. More specifically, the predictive distribution will be p(x | θ ∗ ),
where we use θ ∗ in the likelihood function.
As discussed on page 186, focusing solely on some statistic of the pos-
terior distribution (such as the parameter θ ∗ that maximizes the poste-
rior) leads to loss of information, which can be critical in a system that
uses the prediction p(x | θ ∗ ) to make decisions. These decision-making Bayesian inference
systems typically have different objective functions than the likelihood, a is about learning the
distribution of
squared-error loss or a mis-classification error. Therefore, having the full
random variables.
posterior distribution around can be extremely useful and leads to more
robust decisions. Bayesian inference is about finding this posterior distri- Bayesian inference
bution (Gelman et al., 2004). For a dataset X , a parameter prior p(θ) and
a likelihood function, the posterior
p(X | θ)p(θ)
Z
p(θ | X ) = , p(X ) = p(X | θ)p(θ)dθ , (8.22)
p(X )
is obtained by applying Bayes’ theorem. The key idea is to exploit Bayes’ Bayesian inference
theorem to invert the relationship between the parameters θ and the data inverts the
relationship
X (given by the likelihood) to obtain the posterior distribution p(θ | X ).
between parameters
The implication of having a posterior distribution on the parameters is and the data.
that it can be used to propagate uncertainty from the parameters to the
c
data. More specifically, with a distribution p(θ) on the parameters our

predictions will be
Z
p(x) = p(x | θ)p(θ)dθ = Eθ [p(x | θ)] , (8.23)
and they no longer depend on the model parameters θ , which have been
marginalized/integrated out. Equation (8.23) reveals that the prediction
is an average over all plausible parameter values θ , where the plausibility
is encapsulated by the parameter distribution p(θ).
Having discussed parameter estimation in Section 8.2 and Bayesian in-
ference here, let us compare these two approaches to learning. Parameter
estimation via maximum likelihood or MAP estimation yields a consistent
point estimate θ ∗ of the parameters, and the key computational problem
to be solved is optimization. In contrast, Bayesian inference yields a (pos-
terior) distribution, and the key computational problem to be solved is
integration. Predictions with point estimates are straightforward, whereas
predictions in the Bayesian framework require solving another integration
problem, see (8.23). However, Bayesian inference gives us a principled
way to incorporate prior knowledge, account for side information and
incorporate structural knowledge, all of which is not easily done in the
context of parameter estimation. Moreover, the propagation of parameter
uncertainty to the prediction can be valuable in decision-making systems
for risk assessment and exploration in the context of data-efficient learn-
ing (Kamthe and Deisenroth, 2018; Deisenroth et al., 2015).
While Bayesian inference is a mathematically principled framework for
learning about parameters and making predictions, there are some prac-
tical challenges that come with it because of the integration problems we
need to solve, see (8.22) and (8.23). More specifically, if we do not choose
a conjugate prior on the parameters (Section 6.6.1), the integrals in (8.22)
and (8.23) are not analytically tractable, and we cannot compute the pos-
terior, the predictions or the marginal likelihood in closed form. In these
cases, we need to resort to approximations. Here, we can use stochas-
tic approximations, such as Markov chain Monte Carlo (MCMC) (Gilks
et al., 1996), or deterministic approximations, such as the Laplace ap-
proximation (Bishop, 2006; Murphy, 2012; Barber, 2012), variational in-
ference (Jordan et al., 1999; Blei et al., 2017) or expectation propaga-
tion (Minka, 2001a).
Despite these challenges, Bayesian inference has been successfully ap-
plied to a variety of problems, including large-scale topic modeling (Hoff-
man et al., 2013), click-through-rate prediction (Graepel et al., 2010),
data-efficient reinforcement learning in control systems (Deisenroth et al.,
2015), online ranking systems (Herbrich et al., 2007), and large-scale rec-
ommender systems. There are generic tools, such as Bayesian optimiza-
tion (Brochu et al., 2009; Snoek et al., 2012; Shahriari et al., 2016), that

are very useful ingredients for an efficient search of meta parameters of

models or algorithms.
Remark. In the machine learning literature, there can be a somewhat ar-
bitrary separation between (random) “variables” and “parameters”. While
parameters are estimated (e.g., via maximum likelihood) variables are
usually marginalized out. In this book, we are not so strict with this sep-
aration because, in principle, we can place a prior on any parameter and
integrate it out, which would then turn the parameter into a random vari-
able according to the separation above. ♦
8.3.3 Latent Variable Models

In practice, it is sometimes useful to have additional latent variables z latent variables
(besides the model parameters θ ) as part of the model (Moustaki et al.,
2015). These latent variables are different from the model parameters
θ as they do not parametrize the model explicitly. Latent variables may
describe the data-generating process, thereby contributing to the inter-
pretability of the model. They also often simplify the structure of the
model and allow us to define simpler and richer model structures. Sim-
plification of the model structure often goes hand in hand with a smaller
number of model parameters (Paquet, 2008; Murphy, 2012). Learning in
latent-variable models (at least via maximum likelihood) can be done in a
principled way using the expectation maximization (EM) algorithm (Demp-
ster et al., 1977; Bishop, 2006). Examples, where such latent variables
are helpful, are principal component analysis for dimensionality reduc-
tion (Chapter 10), Gaussian mixture models for density estimation (Chap-
ter 11), hidden Markov models (Maybeck, 1979) or dynamical systems
(Ljung, 1999; Ghahramani and Roweis, 1999) for time-series modeling,
and meta learning and task generalization (Sæmundsson et al., 2018;
Hausman et al., 2018). Although the introduction of these latent variables
may make the model structure and the generative process easier, learning
in latent-variable models is generally hard as we will see in Chapter 11.
Since latent-variable models also allow us to define the process that
generates data from parameters, let us have a look at this generative pro-
cess. Denoting data by x, the model parameters by θ and the latent vari-
ables by z , we obtain the conditional distribution
p(x | θ, z) (8.24)
that allows us to generate data for any model parameters and latent vari-
ables. Given that z are latent variables, we place a prior p(z) on them.
As the models we discussed previously, models with latent variables
can be used for parameter learning and inference within the frameworks
we discussed in Sections 8.2 and 8.3.2. To facilitate learning (e.g., by
means of maximum likelihood estimation or Bayesian inference), we fol-
low a two-step procedure. First, we compute the likelihood p(x | θ) of the
c
model, which does not depend on the latent variables. Second, we use this
likelihood for parameter estimation or Bayesian inference, where we use
exactly the same expressions as in Sections 8.2 and 8.3.2, respectively.
Since the likelihood function p(x | θ) is the predictive distribution of the
data given the model parameters, we need to marginalize out the latent
variables so that
Z
p(x | θ) = p(x | θ, z)p(z)dz , (8.25)
where p(x | z, θ) is given in (8.24) and p(z) is the prior on the latent
The likelihood is a variables. Note that the likelihood must not depend on the latent variables
function of the data z , but it is only a function of the data x and the model parameters θ .
and the model
The likelihood in (8.25) directly allows for parameter estimation via
parameters, but is
independent of the maximum likelihood. MAP estimation is also straightforward with an ad-
latent variables. ditional prior on the model parameters θ as discussed in Section 8.2.2.
Moreover, with the likelihood 8.25 Bayesian inference (Section 8.3.2) in
a latent-variable model works in the usual way: We place a prior p(θ)
on the model parameters and use Bayes’ theorem to obtain a posterior
distribution
p(X | θ)p(θ)
p(θ | X ) = (8.26)
p(X )
over the model parameters given a dataset X . The posterior in (8.26) can
be used for predictions within a Bayesian inference framework, see (8.23).
One challenge we have in this latent-variable model is that the like-
lihood p(X | θ) requires the marginalization of the latent variables ac-
cording to (8.25). Except when we choose a conjugate prior p(z) for
p(x | z, θ), the marginalization in (8.25) is not analytically tractable, and
we need to resort to approximations (Paquet, 2008; Bishop, 2006; Mur-
phy, 2012; Moustaki et al., 2015).
Similar to the parameter posterior (8.26) we can compute a posterior
on the latent variables according to
p(X | z)p(z)
Z
p(z | X ) = , p(X | z) = p(X | z, θ)p(θ)dθ , (8.27)
p(X )
where p(z) is the prior on the latent variables and p(X | z) requires us to
integrate out the model parameters θ .
Given the difficulty of solving integrals analytically, it is clear that marginal-
izing out both the latent variables and the model parameters at the same
time is not possible in general (Murphy, 2012; Bishop, 2006). A quantity
that is easier to compute is the posterior distribution on the latent vari-
ables, but conditioned on the model parameters, i.e.,
p(X | z, θ)p(z)
p(z | X , θ) = , (8.28)
p(X | θ)

where p(z) is the prior on the latent variables and p(X | z, θ) is given
in (8.24).
In Chapters 10 and 11, we derive the likelihood functions for PCA and
Gaussian mixture models, respectively. Moreover, we compute the poste-
rior distributions (8.28) on the latent variables for both PCA and Gaussian
mixture models.
Remark. In the following chapters, we may not be drawing such a clear
distinction between latent variables z and uncertain model parameters θ
and call the model parameters “latent” or “hidden” as well because they
are unobserved. In Chapters 10 and 11, where we use the latent variables
z , we will pay attention to the difference as we will have two different
types of hidden variables: model parameters θ and latent variables z . ♦
We can exploit the fact that all the elements of a probabilistic model are
random variables to define a unified language for representing them. In
Section 8.4, we will see a concise graphical language for representing the
structure of probabilistic models. We will use this graphical language to
describe the probabilistic models in the subsequent chapters.
Further Reading
Probabilistic models in machine learning (Bishop, 2006; Barber, 2012;
Murphy, 2012) provide a way for users to capture uncertainty about data
and predictive models in a principled fashion. Ghahramani (2015) presents
a short review of probabilistic models in machine learning. Given a proba-
bilistic model, we may be lucky enough to be able to compute parameters
of interest analytically. However, in general, analytic solutions are rare,
and computational methods such as sampling (Gilks et al., 1996; Brooks
et al., 2011) and variational inference (Jordan et al., 1999; Blei et al.,
2017) are used. Moustaki et al. (2015) and Paquet (2008) provide a good
overview of Bayesian inference in latent-variable models.
In recent years, several programming languages have been proposed
that aim to treat the variables defined in software as random variables
corresponding to probability distributions. The objective is to be able to
write complex functions of probability distributions, while under the hood
the compiler automatically takes care of the rules of Bayesian inference.
This rapidly changing field is called probabilistic programming. probabilistic
programming
8.4 Directed Graphical Models

In this section, we introduce a graphical language for specifying a prob-
abilistic model, called the directed graphical model. It provides a compact directed graphical
and succinct way to specify probabilistic models, and allows the reader to model
visually parse dependencies between random variables. A graphical model
visually captures the way in which the joint distribution over all random
variables can be decomposed into a product of factors depending only on
c
a subset of these variables. In Section 8.3, we identified the joint distri-

bution of a probabilistic model as the key quantity of interest because it
comprises information about the prior, the likelihood and the posterior.
Directed graphical However, the joint distribution by itself can be quite complicated, and
models are also it does not tell us anything about structural properties of the probabilis-
known as Bayesian
tic model. For example, the joint distribution p(a, b, c) does not tell us
networks.
anything about independence relations. This is the point where graphical
models come into play. This section relies on the concepts of independence
and conditional independence, as described in Section 6.4.5.
graphical model In a graphical model, nodes are random variables. In Figure 8.9(a), the
nodes represent the random variables a, b, c. Edges represent probabilistic
relations between variables, e.g., conditional probabilities.
Remark. Not every distribution can be represented in a particular choice of
graphical model. A discussion of this can be found in (Bishop, 2006). ♦
Probabilistic graphical models have some convenient properties:
They are a simple way to visualize the structure of a probabilistic model.

They can be used to design or motivate new kind of statistical models.
Inspection of the graph alone gives us insight into properties, e.g., con-
ditional independence.
Complex computations for inference and learning in statistical models
can be expressed in terms of graphical manipulations.
8.4.1 Graph Semantics

Figure 8.9
Examples of a b x1 x2 x5
directed graphical
models.
c x3 x4
(a) Fully connected. (b) Not fully connected.
directed graphical Directed graphical models/Bayesian networks are a method for repre-
model/Bayesian senting conditional dependencies in a probabilistic model. They provide
network
a visual description of the conditional probabilities, hence, providing a
With additional
assumptions, the simple language for describing complex interdependence. The modular
arrows can be used description also entails computational simplification. Directed links (ar-
to indicate causal rows) between two nodes (random variables) indicate conditional proba-
relationships (Pearl,
bilities. For example, the arrow between a and b in Figure 8.9(a) gives the
2009).
conditional probability p(b | a) of b given a.
Directed graphical models can be derived from joint distributions if we
know something about their factorization.

Example 8.7
Consider the joint distribution
p(a, b, c) = p(c | a, b)p(b | a)p(a) (8.29)
of three random variables a, b, c. The factorization of the joint distribution
in (8.29) tells us something about the relationship between the random
variables:
c depends directly on a and b
b depends directly on a
a depends neither on b nor on c
For the factorization in (8.29), we obtain the directed graphical model in
Figure 8.9(a).
In general, we can construct the corresponding directed graphical model

from a factorized joint distribution as follows:
1. Create a node for all random variables

2. For each conditional distribution, we add a directed link (arrow) to
the graph from the nodes corresponding to the variables on which the
distribution is conditioned on
The graph layout
The graph layout depends on the choice of factorization of the joint dis- depends on the
factorization of the
tribution. joint distribution.
We discussed how to get from a known factorization of the joint dis-
tribution to the corresponding directed graphical model. Now, we will go
exactly the opposite and describe how to extract the joint distribution of
a set of random variables from a given graphical model.
Example 8.8
Looking at the graphical model in Figure 8.9(b) we exploit two properties:
The joint distribution p(x1 , . . . , x5 ) we seek is the product of a set of
conditionals, one for each node in the graph. In this particular example,
we will need five conditionals.
Each conditional depends only on the parents of the corresponding
node in the graph. For example, x4 will be conditioned on x2 .
These two properties yield the desired factorization of the joint distribu-
tion
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x5 )p(x2 | x5 )p(x3 | x1 , x2 )p(x4 | x2 ) . (8.30)
c
Figure 8.10
Graphical models
µ α µ β
for a repeated µ
Bernoulli
experiment. xn xn
x1 xN n = 1, . . . , N n = 1, . . . , N
(a) Version with xn explicit. (b) Version with (c) Hyperparameters α

plate notation. and β on the latent µ.
In general, the joint distribution p(x) = p(x1 , . . . , xK ) is given as

K
Y
p(x) = p(xk | Pak ) (8.31)
k=1
where Pak means “the parent nodes of xk ”. Parent nodes of xk are nodes
that have arrows pointing to xk .
We conclude this subsection with a concrete example of the coin flip
experiment. Consider a Bernoulli experiment (Example 6.8) where the
probability that the outcome x of this experiment is “heads” is
p(x | µ) = Ber(µ) . (8.32)
We now repeat this experiment N times and observe outcomes x1 , . . . , xN
so that we obtain the joint distribution
N
Y
p(x1 , . . . , xN | µ) = p(xn | µ) . (8.33)
n=1
The expression on the right hand side is a product of Bernoulli distribu-

tions on each individual outcome because the experiments are indepen-
dent. Recall from Section 6.4.5 that statistical independence means that
the distribution factorizes. To write the graphical model down for this set-
ting, we make the distinction between unobserved/latent variables and
observed variables. Graphically, observed variables are denoted by shaded
nodes so that we obtain the graphical model in Figure 8.10(a). We see
that the single parameter µ is the same for all xn , n = 1, . . . , N as the
outcomes xn are identically distributed. A more compact, but equivalent,
graphical model for this setting is given in Figure 8.10(b), where we use
plate the plate notation. The plate (box) repeats everything inside (in this case
the observations xn ) N times. Therefore, both graphical models are equiv-
alent, but the plate notation is more compact. Graphical models immedi-
hyper-prior ately allow us to place a hyper-prior on µ. A hyper-prior is a second layer
of prior distributions on the parameters of the first layer of priors. Fig-
ure 8.10(c) places a Beta(α, β) prior on the latent variable µ. If we treat
α and β as deterministic parameters, i.e., not a random variable, we omit
the circle around it.

Figure 8.11
a b c D-separation
example.
8.4.2 Conditional Independence and d-Separation

Directed graphical models allow us to find conditional independence (Sec-
tion 6.4.5) relationship properties of the joint distribution only by looking
at the graph. A concept called d-separation (Pearl, 1988) is key to this. d-separation
Consider a general directed graph in which A, B, C are arbitrary non-
intersecting sets of nodes (whose union may be smaller than the com-
plete set of nodes in the graph). We wish to ascertain whether a particular
conditional independence statement, A is conditionally independent of B
given C , denoted by
A⊥
⊥ B|C , (8.34)
is implied by a given directed acyclic graph. To do so, we consider all

possible trails (paths that ignore the direction of the arrows) from any
node in A to any nodes in B . Any such path is said to be blocked if it
includes any node such that either
the arrows on the path meet either head to tail or tail to tail at the node,
and the node is in the set C , or
the arrows meet head to head at the node and neither the node nor any
of its descendants is in the set C .
If all paths are blocked, then A is said to be d-separated from B by C ,

and the joint distribution over all of the variables in the graph will satisfy
A⊥ ⊥ B | C.
Example 8.9 (Conditional Independence)

Consider the graphical model in Figure 8.11. By visual inspection we see
that
b⊥ ⊥ d | a, c (8.35)
a⊥ ⊥ c|b (8.36)
b 6⊥
⊥ d|c (8.37)
a 6⊥
⊥ c | b, e . (8.38)
c
Figure 8.12 Three

types of graphical
models: (a) Directed
a b a b a b
graphical models
(Bayesian
networks); (b) c c c
Undirected
graphical models (a) Directed graphical model (b) Undirected graphical (c) Factor graph
(Markov random model
fields); (c) Factor
graphs.
Directed graphical models allow a compact representation of probabil-
isitic models, and we will see examples of directed graphical models in
Chapter 9, 10 and 11. The representation along with the concept of con-
ditional independence, allows us to factorize the respective probabilisitic
models into expressions that are easier to optimize.
The graphical representation of the probabilistic model allows us to
visually see the impact of design choices we have made on the structure
of the model. We often need to make high-level assumptions about the
structure of the model. These modeling assumptions (hyperparameters)
affect the prediction performance, but cannot be selected directly using
the approaches we have seen so far. We will discuss different ways to
choose the structure in Section 8.5.
Further Reading
An introduction to probabilistic graphical models can be found in Bishop
(2006, Chapter 8), and an extensive description of the different applica-
tions and corresponding algorithmic implications can be found in Koller
and Friedman (2009).
There are three main types of probabilistic graphical models:
directed graphical
model Directed graphical models (Bayesian networks), see Figure 8.13(a)
Bayesian network Undirected graphical models (Markov random fields), see Figure 8.13(b)
undirected graphical Factor graphs, see Figure 8.13(c)
model
Markov random Graphical models allow for graph-based algorithms for inference and
field learning, e.g., via local message passing. Applications range from rank-
factor graph
ing in online games (Herbrich et al., 2007) and computer vision (e.g.,
image segmentation, semantic labeling, image de-noising, image restora-
tion (Sucar and Gillies, 1994; Shotton et al., 2006; Szeliski et al., 2008;
Kittler and Föglein, 1984)) to coding theory (McEliece et al., 1998), solv-
ing linear equation systems (Shental et al., 2008) and iterative Bayesian
state estimation in signal processing (Bickson et al., 2007; Deisenroth and
Mohamed, 2012).
One topic which is particularly important in real applications that we
do not discuss in this book is the idea of structured prediction (Bakir
et al., 2007; Nowozin et al., 2014) which allow machine learning mod-
els to tackle predictions that are structured, for example sequences, trees

Figure 8.13 Nested

All labeled data cross validation. We
perform two levels
of K-fold cross
All training data Test data validation.
To train model Validation
and graphs. The popularity of neural network models has allowed more
flexible probabilistic models to be used, resulting in many useful applica-
tions of structured models (Goodfellow et al., 2016, Chapter 16). In recent
years, there has been a renewed interest in graphical models due to its ap-
plications to causal inference (Rosenbaum, 2017; Pearl, 2009; Imbens and
Rubin, 2015; Peters et al., 2017).
8.5 Model Selection

In machine learning, we often need to make high level modeling decisions
that critically influence the performance of the model. The choices we
make (e.g., the functional form of the likelihood) influence the number
and type of free parameters in the model and thereby also the flexibility
and expressivity of the model. More complex models are more flexible in
the sense that they can be used to describe more datasets. For instance, a
polynomial of degree 1 (a line y = a0 + a1 x) can only be used to describe
linear relations between inputs x and observations y . A polynomial of
degree 2 can additionally describe quadratic relationships between inputs
and observations. A polynomial
One would now think that very flexible models are generally preferable y = a0 +a1 x+a2 x2
can also describe
to simple models because they are more expressive. A general problem
linear functions by
is that at training time we can only use the training set to evaluate the setting a2 = 0, i.e.,
performance of the model and learn its parameters. However, the per- it is strictly more
formance on the training set is not really what we are interested in. In expressive than a
first-order
Section 8.2, we have seen that maximum likelihood estimation can lead
polynomial.
to overfitting, especially when the training dataset is small. Ideally, our
model (also) works well on the test set (which is not available at training
time). Therefore, we need some mechanisms for assessing how a model
generalizes to unseen test data. Model selection is concerned with exactly
this problem.
8.5.1 Nested Cross Validation

We have already seen an approach (cross validation in Section 8.1.4) that
can be used for model selection. Recall that cross validation provides an
c
estimate of the generalization error by repeatedly splitting the dataset into

training and validation sets. We can apply this idea one more time, i.e.,
for each split, we can perform another round of cross validation. This is
nested cross sometimes referred to as nested cross validation (see Figure 8.13). The
validation inner level is used to estimate the performance of a particular choice of
model or hyperparameter on a internal validation set. The outer level is
used to estimate generalization performance for the best choice of model
chosen by the inner loop. We can test different model and hyperparameter
choices in the inner loop. To distinguish the two levels, the set used to
test set estimate the generalization performance is often called the test set and
validation set the set used for choosing the best model is called the validation set. The
inner loop estimates the expected value of the generalization error for a
given model (8.39), by approximating it using the empirical error on the
validation set.
K
1 X
EV [R(V | M )] ≈ R(V (k) | M ) , (8.39)
K k=1
The standard error where R(V | M ) is the empirical risk (e.g., root mean squared error) on
is defined as √σ , the validation set V for model M . We repeat this procedure for all models
K
where K is the and choose the model that performs best. Note that cross-validation not
number of
experiments and σ
only gives us the expected generalization error, but we can also obtain
is the standard high-order statistics, e.g., the standard error, an estimate of how uncertain
deviation of the risk the mean estimate is. Once the model is chosen we can evaluate the final
of each experiment. performance on the test set.
8.5.2 Bayesian Model Selection

There are many approaches to model selection, some of which are covered
in this section. Generally, they all attempt to trade off model complexity
and data fit. We assume that simpler models are less prone to overfitting
than complex models, and hence the objective of model selection is to find
the simplest model that explains the data reasonably well. This concept is
Occam’s razor also known as Occam’s razor.
Remark. If we treat model selection as a hypothesis testing problem, we
are looking for the simplest hypothesis that is consistent with the data (Mur-
phy, 2012). ♦
One may consider placing a prior on models that favors simpler models.
However, it is not necessary to do this: An “automatic Occam’s Razor” is
quantitatively embodied in the application of Bayesian probability (Smith
and Spiegelhalter, 1980; MacKay, 1992; Jefferys and Berger, 1992). Fig-
ure 8.14, adapted from MacKay (2003), gives us the basic intuition why
complex and very expressive models may turn out to be a less probable
These predictions choice for modeling a given dataset D. Let us think of the horizontal axis
are quantified by a representing the space of all possible datasets D. If we are interested in
normalized
probability
distribution on D, Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.
i.e., it needs to
integrate/sum to 1.
Figure 8.14
Evidence Bayesian inference
embodies Occam’s
razor. The
horizontal axis
describes the space
p(D | M1 ) of all possible
datasets D. The
evidence (vertical
axis) evaluates how
well a model
predicts available
p(D | M2 ) data. Since
p(D | Mi ) needs to
integrate to 1, we
should choose the
model with the
greatest evidence.
D
C Adapted
from MacKay
(2003).
the posterior probability p(Mi | D) of model Mi given the data D, we can
employ Bayes’ theorem. Assuming a uniform prior p(M ) over all mod-
els, Bayes’ theorem rewards models in proportion to how much they pre-
dicted the data that occurred. This prediction of the data given model
Mi , p(D | Mi ), is called the evidence for Mi . A simple model M1 can only evidence
predict a small number of datasets, which is shown by p(D | M1 ); a more
powerful model M2 that has, e.g., more free parameters than M1 , is able
to predict a greater variety of datasets. This means, however, that M2
does not predict the datasets in region C as well as M1 . Suppose that
equal prior probabilities have been assigned to the two models. Then, if
the dataset falls into region C , the less powerful model M1 is the more
probable model.
Above, we argued that models need to be able to explain the data, i.e.,
there should be a way to generate data from a given model. Furthermore
if the model has been appropriately learned from the data, then we expect
that the generated data should be similar to the empirical data. For this,
it is helpful to phrase model selection as a hierarchical inference problem,
which allows us to compute the posterior distribution over models.
Let us consider a finite number of models M = {M1 , . . . , MK }, where
each model Mk possesses parameters θ k . In Bayesian model selection, we Bayesian model
place a prior p(M ) on the set of models. The corresponding generative selection
generative process
process that allows us to generate data from this model is
Mk ∼ p(M ) (8.40)
θ k ∼ p(θ | Mk ) (8.41)
D ∼ p(D | θ k ) (8.42)
and illustrated in Figure 8.15. Given a training set D, we apply Bayes’
c
theorem and compute the posterior distribution over models as

p(Mk | D) ∝ p(Mk )p(D | Mk ) . (8.43)
Note that this posterior no longer depends on the model parameters θ k
because they have been integrated out in the Bayesian setting since
Z
p(D | Mk ) = p(D | θ k )p(θ k | Mk )dθ k , (8.44)
where p(θ k | Mk ) is the prior distribution of the model parameters θ k of

model evidence model Mk . The term (8.44) is referred to as the model evidence or marginal
marginal likelihood likelihood. From the posterior in (8.43), we determine the MAP estimate
M ∗ = arg max p(Mk | D) . (8.45)
Mk
Figure 8.15 With a uniform prior p(Mk ) = K1 , which gives every model equal (prior)
Illustration of the probability, determining the MAP estimate over models amounts to pick-
hierarchical
ing the model that maximizes the model evidence (8.44).
generative process
in Bayesian model Remark (Likelihood and Marginal Likelihood). There are some important
selection. We place differences between a likelihood and a marginal likelihood (evidence):
a prior p(M ) on the
set of models. For
While the likelihood is prone to overfitting, the marginal likelihood is typ-
each model, there is ically not as the model parameters have been marginalized out (i.e., we
a distribution no longer have to fit the parameters). Furthermore, the marginal likeli-
p(θ | M ) on the hood automatically embodies a trade-off between model complexity and
corresponding
model parameters,
data fit (Occam’s razor). ♦
which is used to
generate the data D.
8.5.3 Bayes Factors for Model Comparison
M Consider the problem of comparing two probabilistic models M1 , M2 ,
given a dataset D. If we compute the posteriors p(M1 | D) and p(M2 | D),
we can compute the ratio of the posteriors
θ
p(D | M1 )p(M1 )
p(M1 | D) p(D) p(M1 ) p(D | M1 )
= p(D | M2 )p(M2 )
= . (8.46)
p(M2 | D) p(D)
p(M2 ) p(D | M2 )
D | {z } | {z } | {z }
posterior odds prior odds Bayes factor
posterior odds The ratio of the posteriors is also called the posterior odds. The first frac-
prior odds tion on the right-hand-side of (8.46), the prior odds, measures how much
our prior (initial) beliefs favor M1 over M2 . The ratio of the marginal like-
Bayes factor lihoods (second fraction on the right-hand-side) is called the Bayes factor
and measures how well the data D is predicted by M1 compared to M2 .
Jeffreys-Lindley Remark. The Jeffreys-Lindley paradox states that the “Bayes factor always
paradox favors the simpler model since the probability of the data under a complex
model with a diffuse prior will be very small” (Murphy, 2012). Here, a
diffuse prior refers to a prior that does not favor specific models, i.e.,
many models are a priori plausible under this prior. ♦

If we choose a uniform prior over models, the prior odds term in (8.46)
is 1, i.e., the posterior odds is the ratio of the marginal likelihoods (Bayes
factor)
p(D | M1 )
. (8.47)
p(D | M2 )
If the Bayes factor is greater than 1, we choose model M1 , otherwise
model M2 . In a similar way to frequentist statistics, there are guidelines
on the size of the ratio that one should consider before ”significance” of
the result (Jeffreys, 1961).
Remark (Computing the Marginal Likelihood). The marginal likelihood
plays an important role in model selection: We need to compute Bayes
factors (8.46) and posterior distributions over models (8.43).
Unfortunately, computing the marginal likelihood requires us to solve
an integral (8.44). This integration is generally analytically intractable,
and we will have to resort to approximation techniques, e.g., numerical
integration (Stoer and Burlirsch, 2002), stochastic approximations using
Monte Carlo (Murphy, 2012) or Bayesian Monte Carlo techniques (O’Hagan,
1991; Rasmussen and Ghahramani, 2003).
However, there are special cases in which we can solve it. In Section 6.6.1,
we discussed conjugate models. If we choose a conjugate parameter prior
p(θ), we can compute the marginal likelihood in closed form. In Chap-
ter 9, we will do exactly this in the context of linear regression. ♦
We have seen a brief introduction to the basic concepts of machine
learning in this chapter. For the rest of this part of the book we will see
how the three different flavours of learning in Section 8.1, Section 8.2 and
Section 8.3 are applied to the four pillars of machine learning (regression,
dimensionality reduction, density estimation and classification).
Further Reading
We mentioned at the start of the section that there are high level modeling
choices that influence the performance of the model. Examples include:
The degree of a polynomial in a regression setting
The number of components in a mixture model
The network architecture of a (deep) neural network
The type of kernel in a support vector machine
The dimensionality of the latent space in PCA
The learning rate (schedule) in an optimization algorithm
In parametric
Rasmussen and Ghahramani (2001) showed that the automatic Occam’s models, the number
razor does not necessarily penalize the number of parameters in a model of parameters is
often related to the
but it is active in terms of the complexity of functions. They also showed complexity of the
that the automatic Occam’s razor also holds for Bayesian non-parametric model class.
models with many parameters, e.g., Gaussian processes.
c
If we focus on the maximum likelihood estimate, there exist a number of

heuristics for model selection that discourage overfitting. They are called
information criteria, and we choose the model with the largest value. The
Akaike information Akaike information criterion (AIC) (Akaike, 1974)
criterion
log p(x | θ) − M (8.48)
corrects for the bias of the maximum likelihood estimator by addition of
a penalty term to compensate for the overfitting of more complex models
with lots of parameters. Here, M is the number of model parameters. The
AIC estimates the relative information lost by a given model.
Bayesian The Bayesian information criterion (BIC) (Schwarz, 1978)
information
1
Z
criterion log p(x) = log p(x | θ)p(θ)dθ ≈ log p(x | θ) − M log N (8.49)
2
can be used for exponential family distributions. Here, N is the number
of data points and M is the number of parameters. BIC penalizes model
complexity more heavily than AIC.

9
Linear Regression
In the following, we will apply the mathematical concepts from Chap-

ters 2, 5, 6 and 7 to solve linear regression (curve fitting) problems. In
regression, we aim to find a function f that maps inputs x ∈ RD to corre- regression
sponding function values f (x) ∈ R. We assume we are given a set of train-
ing inputs xn and corresponding noisy observations yn = f (xn )+, where
is an i.i.d. random variable that describes measurement/observation
noise and potentially unmodeled processes (which we will not consider
further in this chapter). Throughout this chapter we assume zero-mean
Gaussian noise. Our task is to find a function that does not only model the
training data, but which generalizes well to predicting function values at
input locations that are not part of the training data (see Chapter 8). An
illustration of such a regression problem is given in Figure 9.1. A typical
regression setting is given in Figure 9.1(a): For some input values xn we
observe (noisy) function values yn = f (xn ) + . The task is to infer the
function f that generated the data and generalizes well to function values
at new input locations. A possible solution is given in Figure 9.1(b), where
we also show three distributions centered at the function values f (x) that
represent the noise in the data.
Regression is a fundamental problem in machine learning, and regres-
sion problems appear in a diverse range of research areas and applica-
Figure 9.1
0.4 0.4 (a) Dataset;
(b) Possible solution
0.2 0.2
to the regression
0.0 0.0 problem.
y
−0.2 −0.2
−0.4 −0.4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Regression problem: Observed noisy (b) Regression solution: Possible function
function values from which we wish to infer that could have generated the data (blue)
the underlying function that generated the with indication of the measurement noise of
data. the function value at the corresponding in-
puts (orange distributions).
289
c
290 Linear Regression
tions, including time-series analysis (e.g., system identification), control

and robotics (e.g., reinforcement learning, forward/inverse model learn-
ing), optimization (e.g., line searches, global optimization), and deep-
learning applications (e.g., computer games, speech-to-text translation,
image recognition, automatic video annotation). Regression is also a key
ingredient of classification algorithms. Finding a regression function re-
quires solving a variety of problems, including
Choice of the model (type) and the parametrization of the regres-
Normally, the type sion function. Given a dataset, what function classes (e.g., polynomi-
of noise could also als) are good candidates for modeling the data, and what particular
be a “model choice”,
parametrization (e.g., degree of the polynomial) should we choose?
but we fix the noise
to be Gaussian in Model selection, as discussed in Section 8.5, allows us to compare var-
this chapter. ious models to find the simplest model that explains the training data
reasonably well.
Finding good parameters. Having chosen a model of the regression
function, how do we find good model parameters? Here, we will need to
look at different loss/objective functions (they determine what a “good”
fit is) and optimization algorithms that allow us to minimize this loss.
Overfitting and model selection. Overfitting is a problem when the
regression function fits the training data “too well” but does not gen-
eralize to unseen test data. Overfitting typically occurs if the underly-
ing model (or its parametrization) is overly flexible and expressive, see
Section 8.5. We will look at the underlying reasons and discuss ways to
mitigate the effect of overfitting in the context of linear regression.
Relationship between loss functions and parameter priors. Loss func-
tions (optimization objectives) are often motivated and induced by prob-
abilistic models. We will look at the connection between loss functions
and the underlying prior assumptions that induce these losses.
Uncertainty modeling. In any practical setting, we have access to only
a finite, potentially large, amount of (training) data for selecting the
model class and the corresponding parameters. Given that this finite
amount of training data does not cover all possible scenarios, we may
want to describe the remaining parameter uncertainty to obtain a mea-
sure of confidence of the model’s prediction at test time; the smaller the
training set the more important uncertainty modeling. Consistent mod-
eling of uncertainty equips model predictions with confidence bounds.
In the following, we will be using the mathematical tools from Chap-
ters 3, 5, 6 and 7 to solve linear regression problems. We will discuss
maximum likelihood and maximum a posteriori (MAP) estimation to find
optimal model parameters. Using these parameter estimates, we will have
a brief look at generalization errors and overfitting. Toward the end of
this chapter, we will discuss Bayesian linear regression, which allows us to
reason about model parameters at a higher level, thereby removing some
of the problems encountered in maximum likelihood and MAP estimation.

9.1 Problem Formulation 291
9.1 Problem Formulation

Because of the presence of observation noise, we will adopt a probabilis-
tic approach and explicitly model the noise using a likelihood function.
More specifically, throughout this chapter, we consider a regression prob-
lem with the likelihood function
p(y | x) = N y | f (x), σ 2 .

(9.1)
Here, x ∈ RD are inputs and y ∈ R are noisy function values (targets).

With (9.1) the functional relationship between x and y is given as
y = f (x) + , (9.2)

where ∼ N 0, σ 2 is independent, identically distributed (i.i.d.) Gaus-
sian measurement noise with mean 0 and variance σ 2 . Our objective is
to find a function that is close (similar) to the unknown function f that
generated the data and which generalizes well.
In this chapter, we focus on parametric models, i.e., we choose a para-
metrized function and find parameters θ that “work well” for modeling the
data. For the time being, we assume that the noise variance σ 2 is known
and focus on learning the model parameters θ . In linear regression, we
consider the special case that the parameters θ appear linearly in our
model. An example of linear regression is given by
p(y | x, θ) = N y | x> θ, σ 2

(9.3)
> 2

⇐⇒ y = x θ + , ∼ N 0, σ , (9.4)
where θ ∈ RD are the parameters we seek. The class of functions de-

scribed by (9.4) are straight lines that pass through the origin. In (9.4),
we chose a parametrization f (x) = x> θ . A Dirac delta (delta
The likelihood in (9.3) is the probability density function of y evalu- function) is zero
everywhere except
ated at x> θ . Note that the only source of uncertainty originates from the
at a single point,
observation noise (as x and θ are assumed known in (9.3)). Without ob- and its integral is 1.
servation noise the relationship between x and y would be deterministic It can be considered
and (9.3) would be a Dirac delta. a Gaussian in the
limit of σ 2 → 0.
likelihood
Example 9.1
For x, θ ∈ R the linear regression model in (9.4) describes straight lines
(linear functions), and the parameter θ is the slope of the line. Fig-
ure 9.1(a) shows some example functions for different values of θ.
Linear regression
The linear regression model in (9.3)–(9.4) is not only linear in the pa- refers to models that
are linear in the
rameters, but also linear in the inputs x. We will see later that y = φ> (x)θ
parameters.
for nonlinear transformations φ is also a linear regression model because
“linear regression” refers to models that are “linear in the parameters”,
c
Figure 9.2 Linear 20
regression example. 10 10
(a) Example 0 0 0
y
y
functions that fall
into this category. −10 −10
−20
(b) Training set. −10 0 10 −10 −5 0 5 10 −10 −5 0 5 10
x x x
(c) Maximum
likelihood estimate. (a) Example functions (straight (b) Training set. (c) Maximum likelihood esti-
lines) that can be described us- mate.
ing the linear model in (9.4).
i.e., models that describe a function by a linear combination of input fea-

tures. Here, a “feature” is a representation φ(x) of the inputs x.
In the following, we will discuss in more detail how to find good pa-
rameters θ and how to evaluate whether a parameter set “works well”.
For the time being we assume that the noise variance σ 2 is known.
9.2 Parameter Estimation

Consider the linear regression setting (9.4) and assume we are given a
training set training set D := {(x1 , y1 ), . . . , (xN , yN )} consisting of N inputs xn ∈
Figure 9.3 RD and corresponding observations/targets yn ∈ R, n = 1, . . . , N . The
Probabilistic corresponding graphical model is given in Figure 9.3. Note that yi and yj
graphical model for
are conditionally independent given their respective inputs xi , xj so that
linear regression.
Observed random the likelihood factorizes according to
variables are
shaded, p(Y | X , θ) = p(y1 , . . . , yN | x1 , . . . , xN , θ) (9.5a)
deterministic/ N N
known values are Y Y
N yn | x > 2

without circles.
= p(yn | xn , θ) = n θ, σ , (9.5b)
n=1 n=1
θ
where we defined X := {x1 , . . . , xN } and Y := {y1 , . . . , yN } as the sets
σ
of training inputs and corresponding targets, respectively. The likelihood
xn yn and the factors p(yn | xn , θ) are Gaussian due to the noise distribution,
see (9.3).
n = 1, . . . , N
In the following, we will discuss how to find optimal parameters θ ∗ ∈
RD for the linear regression model (9.4). Once the parameters θ ∗ are
found, we can predict function values by using this parameter estimate
in (9.4) so that at an arbitrary test input x∗ the distribution of the corre-
sponding target y∗ is
p(y∗ | x∗ , θ ∗ ) = N y∗ | x> ∗ 2

∗θ , σ . (9.6)
In the following, we will have a look at parameter estimation by maxi-

mizing the likelihood, a topic that we already covered to some degree in
Section 8.2.

9.2.1 Maximum Likelihood Estimation

A widely used approach to finding the desired parameters θ ML is maximum maximum likelihood
likelihood estimation where we find parameters θ ML that maximize the estimation
likelihood (9.5b). Intuitively, maximizing the likelihood means maximiz- Maximizing the
ing the predictive distribution of the training data given the model param- likelihood means
maximizing the
eters. We obtain the maximum likelihood parameters as
predictive
distribution of the
θ ML = arg max p(Y | X , θ) . (9.7)
θ (training) data
given the
Remark. The likelihood p(y | x, θ) is not a probability distribution in θ : It parameters.
is simply a function of the parameters θ but does not integrate to 1 (i.e., The likelihood is not
it is unnormalized), and may not even be integrable with respect to θ . a probability
distribution in the
However, the likelihood in (9.7) is a normalized probability distribution
parameters.
in y . ♦
To find the desired parameters θ ML that maximize the likelihood, we
typically perform gradient ascent (or gradient descent on the negative
likelihood). In the case of linear regression we consider here, however, Since the logarithm
a closed-form solution exists, which makes iterative gradient descent un- is a (strictly)
monotonically
necessary. In practice, instead of maximizing the likelihood directly, we
increasing function,
apply the log-transformation to the likelihood function and minimize the the optimum of a
negative log-likelihood. function f is
identical to the
Remark (Log Transformation). Since the likelihood (9.5b) is a product optimum of log f .
of N Gaussian distributions, the log-transformation is useful since a) it
does not suffer from numerical underflow, b) the differentiation rules will
turn out simpler. More specifically, numerical underflow will be a prob-
lem when we multiply N probabilities, where N is the number of data
points, since we cannot represent very small numbers, such as 10−256 .
Furthermore, the log-transform will turn the product into a sum of log-
probabilities such that the corresponding gradient is a sum of individual
gradients, instead of a repeated application of the product rule (5.46) to
compute the gradient of a product of N terms. ♦
To find the optimal parameters θ ML of our linear regression problem,
we minimize the negative log-likelihood
N
Y N
X
− log p(Y | X , θ) = − log p(yn | xn , θ) = − log p(yn | xn , θ) , (9.8)
n=1 n=1
where we exploited that the likelihood (9.5b) factorizes over the number
of data points due to our independence assumption on the training set.
In the linear regression model (9.4) the likelihood is Gaussian (due to
the Gaussian additive noise term), such that we arrive at
1
log p(yn | xn , θ) = − (yn − x> 2
n θ) + const , (9.9)
2σ 2
where the constant includes all terms independent of θ . Using (9.9) in the
c
negative log-likelihood (9.8) we obtain (ignoring the constant terms)

N
1 X
L(θ) := (yn − x>
n θ)
2
(9.10a)
2σ 2 n=1
1 1 2
= (y − Xθ)> (y − Xθ) = 2 ky − Xθk , (9.10b)
2σ 2 2σ
The negative where we define the design matrix X := [x1 , . . . , xN ]> ∈ RN ×D as the
log-likelihood collection of training inputs and y := [y1 , . . . , yN ]> ∈ RN as a vector that
function is also
collects all training targets. Note that the nth row in the design matrix X
called error function.
design matrix
corresponds to the training input xn . In (9.10b) we used the fact that the
The squared error is
sum of squared errors between the observations yn and the corresponding
often used as a model prediction x> n θ equals the squared distance between y and Xθ .
measure of distance. With (9.10b) we have now a concrete form of the negative log-likelihood
Recall from function we need to optimize. We immediately see that (9.10b) is quadratic
Section 3.1 that
in θ . This means that we can find a unique global solution θ ML for mini-
kxk2 = x> x if we
choose the dot mizing the negative log-likelihood L. We can find the global optimum by
product as the inner computing the gradient of L, setting it to 0 and solving for θ .
product. Using the results from Chapter 5, we compute the gradient of L with
respect to the parameters as
dL d 1

>
= (y − Xθ) (y − Xθ) (9.11a)
dθ dθ 2σ 2
1 d >
= 2 y y − 2y > Xθ + θ > X > Xθ (9.11b)
2σ dθ
1
= 2 (−y > X + θ > X > X) ∈ R1×D . (9.11c)
σ
The maximum likelihood estimator θ ML solves dL
dθ
= 0> (necessary opti-
Ignoring the mality condition) and we obtain
possibility of
duplicate data dL (9.11c)
points, rk(X) = D = 0> ⇐⇒ θ > > >
ML X X = y X (9.12a)
dθ
if N > D, i.e., we
do not have more ⇐⇒ θ > > >
ML = y X(X X)
−1
(9.12b)
parameters than
⇐⇒ θ ML = (X > X)−1 X > y . (9.12c)
data points.
We could right-multiply the first equation by (X > X)−1 because X > X is

positive definite if rk(X) = D, where rk(X) denotes the rank of X .
Remark. Setting the gradient to 0> is a necessary and sufficient condition
and we obtain a global minimum since the Hessian ∇2θ L(θ) = X > X ∈
RD×D is positive definite. ♦
Remark. The maximum likelihood solution in (9.12c) requires us to solve
a system of linear equations of the form Aθ = b with A = (X > X) and
b = X >y. ♦

Example 9.2 (Fitting Lines)

Let us have a look at Figure 9.2, where we aim to fit a straight line f (x) =
θx, where θ is an unknown slope, to a dataset using maximum likelihood
estimation. Examples of functions in this model class (straight lines) are
shown in Figure 9.1(a). For the dataset shown in Figure 9.1(b) we find
the maximum likelihood estimate of the slope parameter θ using (9.12c)
and obtain the maximum likelihood linear function in Figure 9.1(c).
Maximum Likelihood Estimation with Features

So far, we considered the linear regression setting described in (9.4),
which allowed us to fit straight lines to data using maximum likelihood
estimation. However, straight lines are not sufficiently expressive when it Linear regression
comes to fitting more interesting data. Fortunately, linear regression offers refers to “linear-in-
the-parameters”
us a way to fit nonlinear functions within the linear regression framework:
regression models,
Since “linear regression” only refers to “linear in the parameters”, we can but the inputs can
perform an arbitrary nonlinear transformation φ(x) of the inputs x and undergo any
then linearly combine the components of this transformation. The corre- nonlinear
transformation.
sponding linear regression model is
p(y | x, θ) = N y | φ> (x)θ, σ 2

K−1
>
X (9.13)
⇐⇒ y = φ (x)θ + = θk φk (x) + ,
k=0
where φ : RD → RK is a (nonlinear) transformation of the inputs x and

φk : RD → R is the k th component of the feature vector φ. Note that the feature vector
model parameters θ still appear only linearly.
Example 9.3 (Polynomial Regression)

We are concerned with a regression problem y = φ> (x)θ+, where x ∈ R
and θ ∈ RK . A transformation that is often used in this context is
1
 
 
φ0 (x)  x 
 
 φ1 (x)   x2 
φ(x) =  ..  =  x  ∈ RK . (9.14)
   3 
 .  
 . 

φK−1 (x)  .. 
xK−1
This means, we “lift” the original one-dimensional input space into a
K -dimensional feature space consisting of all monomials xk for k =
0, . . . , K − 1. With these features, we can model polynomials of degree
6 K − 1 within the framework of linear regression: A polynomial of de-
c
gree K − 1 is
K−1
X
f (x) = θk xk = φ> (x)θ , (9.15)
k=0
where φ is defined in (9.14) and θ = [θ0 , . . . , θK−1 ]> ∈ RK contains the

(linear) parameters θk .
Let us now have a look at maximum likelihood estimation of the param-

eters θ in the linear regression model (9.13). We consider training inputs
feature matrix xn ∈ RD and targets yn ∈ R, n = 1, . . . , N , and define the feature matrix
design matrix (design matrix) as
φ0 (x1 ) · · · φK−1 (x1 )

 
 > 
φ (x1 )
..   φ0 (x2 ) · · · φK−1 (x2 ) 
 
Φ :=  = . .  ∈ RN ×K , (9.16)

.   .. ..
>
 
φ (xN )
φ0 (xN ) · · · φK−1 (xN )
where Φij = φj (xi ) and φj : RD → R.
Example 9.4 (Feature Matrix for Second-order Polynomials)

For a second-order polynomial and N training points xn ∈ R, n =
1, . . . , N , the feature matrix is
1 x1 x21
 
1 x2 x22 
Φ =  .. .. ..  . (9.17)
 
. . . 
1 xN x2N
With the feature matrix Φ defined in (9.16) the negative log-likelihood

for the linear regression model (9.13) can be written as
1
− log p(Y | X , θ) = (y − Φθ)> (y − Φθ) + const . (9.18)
2σ 2
Comparing (9.18) with the negative log-likelihood in (9.10b) for the “feature-
free” model, we immediately see we just need to replace X with Φ. Since
both X and Φ are independent of the parameters θ that we wish to opti-
maximum likelihood mize, we arrive immediately at the maximum likelihood estimate
estimate
θ ML = (Φ> Φ)−1 Φ> y (9.19)
for the linear regression problem with nonlinear features defined in (9.13).
Remark. When we were working without features, we required X > X to

be invertible, which is the case when the rows of X are linearly indepen-
dent. In (9.19), we therefore require Φ> Φ ∈ RD×D to be invertible. This
is the case if and only if rk(Φ) = D. ♦
Example 9.5 (Maximum Likelihood Polynomial Fit)
Figure 9.4
4 4 Training data Polynomial
MLE regression. (a)
2 2 Dataset consisting of
(xn , yn ) pairs,
0 0
y
n = 1, . . . , 10; (b)
−2 −2
Maximum
likelihood
−4 −4 polynomial of
−4 −2 0 2 4 −4 −2 0 2 4 degree 4.
x x
(a) Regression dataset. (b) Polynomial of degree 4 determined by max-

imum likelihood estimation.
Consider the dataset in Figure 9.4(a). The dataset consists of N = 20

pairs (xn , yn ), where xn ∼ U[−5, 5] and yn = − sin(xn /5) + cos(xn ) + ,
where ∼ N 0, 0.22 .
We fit a polynomial of degree K = 4 using maximum likelihood esti-
mation, i.e., parameters θ ML are given in (9.19). The maximum likelihood
estimate yields function values φ> (x∗ )θ ML at any test location x∗ . The
result is shown in Figure 9.4(b).
Estimating the Noise Variance

Thus far, we assumed that the noise variance σ 2 is known. However, we
can also use the principle of maximum likelihood estimation to obtain the
2
maximum likelihood estimator σML for the noise variance. To do this, we
follow the standard procedure: we write down the log-likelihood, com-
pute its derivative with respect to σ 2 > 0, set it to 0 and solve. The log-
likelihood is given by
N
X
log p(Y | X , θ, σ 2 ) = log N yn | φ> (xn )θ, σ 2

(9.20a)
n=1
N
1 1 1
X
2 > 2
= − log(2π) − log σ − 2 (yn − φ (xn )θ) (9.20b)
n=1
2 2 2σ
N
N 1 X
=− log σ 2 − 2 (yn − φ> (xn )θ)2 + const . (9.20c)
2 2σ n=1
| {z }
=:s
c
The partial derivative of the log-likelihood with respect to σ 2 is then

∂ log p(Y | X , θ, σ 2 ) N 1
= − 2 + 4s = 0 (9.21a)
∂σ 2 2σ 2σ
N s
⇐⇒ 2 = 4 (9.21b)
2σ 2σ
so that we identify
N
s 1 X
2
σML = = (yn − φ> (xn )θ)2 . (9.22)
N N n=1
Therefore, the maximum likelihood estimate of the noise variance is the
empirical mean of the squared distances between the noise-free function
values φ> (xn )θ and the corresponding noisy observations yn at input lo-
cations xn .
9.2.2 Overfitting in Linear Regression

We just discussed how to use maximum likelihood estimation to fit lin-
ear models (e.g., polynomials) to data. We can evaluate the quality of
the model by computing the error/loss incurred. One way of doing this
is to compute the negative log-likelihood (9.10b), which we minimized
to determine the maximum likelihood estimator. Alternatively, given that
the noise parameter σ 2 is not a free model parameter, we can ignore the
scaling by 1/σ 2 , so that we end up with a squared-error-loss function
2
root mean squared ky − Φθk . Instead of using this squared loss, we often use the root mean
error squared error (RMSE)
RMSE v
q u
u1 X N
2
ky − Φθk /N = t (yn − φ> (xn )θ)2 , (9.23)
N n=1
which (a) allows us to compare errors of datasets with different sizes

The RMSE is and (b) has the same scale and the same units as the observed func-
normalized. tion values yn . For example, if we fit a model that maps post-codes (x
is given in latitude, longitude) to house prices (y -values are EUR) then
the RMSE is also measured in EUR, whereas the squared error is given
The negative in EUR2 . If we choose to include the factor σ 2 from the original negative
log-likelihood is log-likelihood (9.10b) then we end up with a unitless objective, i.e., in the
unitless.
above example our objective would no longer be in EUR or EUR2 .
For model selection (see Section 8.5) we can use the RMSE (or the
negative log-likelihood) to determine the best degree of the polynomial by
finding the polynomial degree M that minimizes the objective. Given that
the polynomial degree is a natural number, we can perform a brute-force
search and enumerate all (reasonable) values of M . For a training set of
size N it is sufficient to test 0 6 M 6 N − 1. For M 6 N , the maximum
likelihood estimator is unique. For M > N we have more parameters

Figure 9.5
MLE MLE MLE Maximum
2 2 2
likelihood fits for
0 0 0
y
y
different polynomial
−2 −2 −2
degrees M .
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
(a) M = 0 (b) M = 1 (c) M = 3

MLE MLE MLE
2 2 2
0 0 0
y
y
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
(d) M = 4 (e) M = 6 (f) M = 9
than data points, and would need to solve an underdetermined system of

linear equations (Φ> Φ in (9.19) would also no longer be invertible) so
that there are infinitely many possible maximum likelihood estimators.
Figure 9.5 shows a number of polynomial fits determined by maximum
likelihood for the dataset from Figure 9.4(a) with N = 10 observations.
We notice that polynomials of low degree (e.g., constants (M = 0) or lin-
ear (M = 1) fit the data poorly and, hence, are poor representations of the
true underlying function. For degrees M = 3, . . . , 5 the fits look plausible
and smoothly interpolate the data. When we go to higher-degree polyno- The case of
mials, we notice that they fit the data better and better. In the extreme M = N − 1 is
extreme in the sense
case of M = N − 1 = 9, the function will pass through every single data
that otherwise the
point. However, these high-degree polynomials oscillate wildly and are a null space of the
poor representation of the underlying function that generated the data, corresponding
such that we suffer from overfitting. system of linear
equations would be
Remember that the goal is to achieve good generalization by making
non-trivial, and we
accurate predictions for new (unseen) data. We obtain some quantita- would have
tive insight into the dependence of the generalization performance on the infinitely many
polynomial of degree M by considering a separate test set comprising 200 optimal solutions to
the linear regression
data points generated using exactly the same procedure used to generate
problem.
the training set. As test inputs, we chose a linear grid of 200 points in the
overfitting
interval of [−5, 5]. For each choice of M , we evaluate the RMSE (9.23) for
Note that the noise
both the training data and the test data. variance σ 2 > 0.
Looking now at the test error, which is a qualitive measure of the gen-
eralization properties of the corresponding polynomial, we notice that ini-
tially the test error decreases, see Figure 9.6 (orange). For fourth-order
polynomials the test error is relatively low and stays relatively constant up
to degree 5. However, from degree 6 onward the test error increases signif-
icantly, and high-order polynomials have very bad generalization proper-
ties. In this particular example, this also is evident from the corresponding
c
Figure 9.6 Training 10

and test error. Training error
8 Test error
RMSE
4
0
0 2 4 6 8 10
Degree of polynomial
training error maximum likelihood fits in Figure 9.5. Note that the training error (blue
curve in Figure 9.6) never increases when the degree of the polynomial in-
creases. In our example, the best generalization (the point of the smallest
test error test error) is obtained for a polynomial of degree M = 4.
9.2.3 Maximum A Posteriori Estimation

We just saw that maximum likelihood estimation is prone to overfitting.
We often observe that the magnitude of the parameter values becomes
relatively large if we run into overfitting (Bishop, 2006).
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters. The prior distribution explicitly en-
codes what parameter values are plausible (before having seen any data).
For example, a Gaussian prior p(θ) = N 0, 1 on a single parameter
θ encodes that parameter values are expected lie in the interval 0 ± 2
(two standard deviations around the mean value). Once a dataset X , Y
is available, instead of maximizing the likelihood we seek parameters that
maximize the posterior distribution p(θ | X , Y). This procedure is called
maximum a maximum a posteriori (MAP) estimation.
posteriori The posterior over the parameters θ , given the training data X , Y , is
MAP obtained by applying Bayes’ theorem (Section 6.3) as
p(Y | X , θ)p(θ)
p(θ | X , Y) = . (9.24)
p(Y | X )
Since the posterior explicitly depends on the parameter prior p(θ), the
prior will have an effect on parameter vector we find as the maximizer
of the posterior. We will see this more explicitly in the following. The pa-
rameter vector θ MAP that maximizes the posterior (9.24) is the maximum
a posteriori (MAP) estimate.
To find the MAP estimate, we follow steps that are similar in flavor
to maximum likelihood estimation. We start with the log-transform and
compute the log-posterior as
log p(θ | X , Y) = log p(Y | X , θ) + log p(θ) + const , (9.25)

where the constant comprises the terms that are independent of θ . We see
that the log-posterior in (9.25) is the sum of the log-likelihood p(Y | X , θ)
and the log-prior log p(θ) so that the MAP estimate will be a “compromise”
between the prior (our suggestion for plausible parameter values before
observing data) and the data-dependent likelihood.
To find the MAP estimate θ MAP , we minimize the negative log-posterior
distribution with respect to θ , i.e., we solve
θ MAP ∈ arg min{− log p(Y | X , θ) − log p(θ)} . (9.26)
θ
The gradient of the negative log-posterior with respect to θ is

d log p(θ | X , Y) d log p(Y | X , θ) d log p(θ)
− =− − , (9.27)
dθ dθ dθ
where we identify the first term on the right-hand-side as the gradient of
the negative log-likelihood from (9.11c).
With a (conjugate) Gaussian prior p(θ) = N 0, b2 I on the parameters
θ , the negative log-posterior for the linear regression setting (9.13), we
obtain the negative log posterior
1 1
− log p(θ | X , Y) = (y − Φθ)> (y − Φθ) + 2 θ > θ + const . (9.28)
2σ 2 2b
Here, the first term corresponds to the contribution from the log-likelihood,
and the second term originates from the log-prior. The gradient of the log-
posterior with respect to the parameters θ is then
d log p(θ | X , Y) 1 1
− = 2 (θ > Φ> Φ − y > Φ) + 2 θ > . (9.29)
dθ σ b
We will find the MAP estimate θ MAP by setting this gradient to 0> and
solving for θ MAP . We obtain
1 > > 1
2
(θ Φ Φ − y > Φ) + 2 θ > = 0> (9.30a)
σ b
> 1 > 1 1
⇐⇒ θ 2
Φ Φ + 2 I − 2 y > Φ = 0> (9.30b)
σ b σ
2

σ
⇐⇒ θ > Φ> Φ + 2 I = y > Φ (9.30c)
b
−1
σ2

> > >
⇐⇒ θ = y Φ Φ Φ + 2 I (9.30d)
b
so that the MAP estimate is (by transposing both sides of the last equality) Φ> Φ is symmetric,
−1 positive
σ2

> semi-definite. The
θ MAP = Φ Φ + 2 I Φ> y . (9.31) additional term
b
in (9.31) is strictly
Comparing the MAP estimate in (9.31) with the maximum likelihood es- positive definite so
timate in (9.19) we see that the only difference between both solutions that the inverse
2 exists.
is the additional term σb2 I in the inverse matrix. This term ensures that
c
2
Φ> Φ + σb2 I is symmetric and strictly positive definite (i.e., its inverse
exists and the MAP estimate is the unique solution of a system of linear
equations). Moreover, it reflects the impact of the regularizer.
Example 9.6 (MAP Estimation for Polynomial Regression)

In the polynomial regression example from Section 9.2.1, we place a Gaus-
sian prior p(θ) = N 0, I on the parameters θ and determine the MAP
estimates according to (9.31). In Figure 9.7, we show both the maximum
likelihood and the MAP estimates for polynomials of degree 6 (left) and
degree 8 (right). The prior (regularizer) does not play a significant role
for the low-degree polynomial, but keeps the function relatively smooth
for higher-degree polynomials. Although the MAP estimate can push the
boundaries of overfitting it is not a general solution to this problem so that
we need a more principled approach to tackle overfitting.
Figure 9.7
Polynomial 4 4 Training data
regression: MLE
Maximum 2 2 MAP
likelihood and MAP
0 0
y
estimates.
(a) Polynomials of
−2 Training data −2
degree 6; MLE
(b) Polynomials of −4 MAP −4
degree 8.
−4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Polynomials of degree 6. (b) Polynomials of degree 8.
9.2.4 MAP Estimation as Regularization

Instead of placing a prior distribution on the parameters θ it is also pos-
sible to mitigate the effect of overfitting by penalizing the amplitude of
regularization the parameter by means of regularization. In regularized least squares, we
regularized least consider the loss function
squares 2 2
ky − Φθk + λ kθk2 , (9.32)
which we minimize with respect to θ (see Section 8.1.3). Here, the first
data-fit term term is a data-fit term (also called misfit term), which is proportional to
misfit term the negative log-likelihood, see (9.10b). The second term is called the
regularizer regularizer, and the regularization parameter λ > 0 controls the “strict-
regularization ness” of the regularization.
parameter
Remark. Instead of the Euclidean norm k·k2 , we can choose any p-norm
k·kp in (9.32). In practice, smaller values for p lead to sparser solutions.
Here, “sparse” means that many parameter values θd = 0, which is also

useful for variable selection. For p = 1, the regularizer is called LASSO LASSO
(least absolute shrinkage and selection operator) and was proposed by Tib-
shirani (1996). ♦
2
The regularizer λ kθk2 in (9.32) can be interpreted as a negative log-
Gaussian prior, which we use in MAP estimation, see (9.26). More specif-
ically, with a Gaussian prior p(θ) = N 0, b2 I , we obtain the negative
log-Gaussian prior
1 2
− log p(θ) = kθk2 + const (9.33)
2b2
so that for λ = 2b12 the regularization term and the negative log-Gaussian
prior are identical.
Given that the regularized least-squares loss function in (9.32) consists
of terms that are closely related to the negative log-likelihood plus a neg-
ative log-prior, it is not surprising that, when we minimize this loss, we
obtain a solution that closely resembles the MAP estimate in (9.31). More
specifically, minimizing the regularized least-squares loss function yields
θ RLS = (Φ> Φ + λI)−1 Φ> y , (9.34)
2
which is identical to the MAP estimate in (9.31) for λ = σb2 , where σ 2 is
2
the noise variance and b the variance of the (isotropic) Gaussian prior
2
p(θ) = N 0, b I . A point estimate is a
So far, we covered parameter estimation using maximum likelihood and single specific
parameter value,
MAP estimation where we found point estimates θ ∗ that optimize an ob-
unlike a distribution
jective function (likelihood or posterior). We saw that both maximum like- over plausible
lihood and MAP estimation can lead to overfitting. In the next section, we parameter settings.
will discuss Bayesian linear regression, where we use Bayesian inference
(Section 8.3) to find a posterior distribution over the unknown parame-
ters, which we subsequently use to make predictions. More specifically, for
predictions we will average over all plausible sets of parameters instead
of focusing on a point estimate.
9.3 Bayesian Linear Regression

Previously, we looked at linear regression models where we estimated the
model parameters θ , e.g., by means of maximum likelihood or MAP esti-
mation. We discovered that MLE can lead to severe overfitting, in particu-
lar, in the small-data regime. MAP addresses this issue by placing a prior
on the parameters that plays the role of a regularizer. Bayesian linear
Bayesian linear regression pushes the idea of the parameter prior a step regression
further and does not even attempt to compute a point estimate of the
parameters, but instead the full posterior distribution over the parameters
is taken into account when making predictions. This means we do not fit
any parameters, but we compute a mean over all plausible parameters
settings (according to the posterior).
c
9.3.1 Model
In Bayesian linear regression, we consider the model

prior p(θ) = N m0 , S 0 ,
(9.35)
likelihood p(y | x, θ) = N y | φ> (x)θ, σ 2 ,

Figure 9.8 where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on θ ,
Graphical model for which turns the parameter vector into a random variable. This allows us
Bayesian linear
to write down the corresponding graphical model in Figure 9.8, where we
regression.
made the parameters of the Gaussian prior on θ explicit. The full proba-
m0 S0 bilistic model, i.e., the joint distribution of observed and unobserved ran-
dom variables, y and θ , respectively, is
θ
σ p(y, θ | x) = p(y | x, θ)p(θ) . (9.36)
x y
9.3.2 Prior Predictions
In practice, we are usually not so much interested in the parameter values
θ themselves. Instead, our focus often lies in the predictions we make
with those parameter values. In a Bayesian setting, we take the parameter
distribution and average over all plausible parameter settings when we
make predictions. More specifically, to make predictions at an input x∗ ,
we integrate out θ and obtain
Z
p(y∗ | x∗ ) = p(y∗ | x∗ , θ)p(θ)dθ = Eθ [p(y∗ | x∗ , θ)] , (9.37)
which we can interpret as the average prediction of y∗ | x∗ , θ for all plau-

sible parameters θ according to the prior distribution p(θ). Note that pre-
dictions using the prior distribution only require to specify the input x∗ ,
but no training data.
In our model (9.35), we chose a conjugate (Gaussian) prior on θ so
that the predictive distribution is Gaussian as well (and can be computed
in closed form): With the prior distribution p(θ) = N m0 , S 0 , we obtain
the predictive distribution as
p(y∗ | x∗ ) = N φ> (x∗ )m0 , φ> (x∗ )S 0 φ(x∗ ) + σ 2 ,

(9.38)
where we exploited that (i) the prediction is Gaussian due to conjugacy
(see Section 6.6) and the marginalization property of Gaussians (see Sec-
tion 6.5), (ii) the Gaussian noise is independent so that
V[y∗ ] = Vθ [φ> (x∗ )θ] + V [] , (9.39)
(iii) y∗ is a linear transformation of θ so that we can apply the rules for
computing the mean and covariance of the prediction analytically by us-
ing (6.50) and (6.51), respectively. In (9.38), the term φ> (x∗ )S 0 φ(x∗ ) in
the predictive variance explicitly accounts for the uncertainty associated

with the parameters θ , whereas σ 2 is the uncertainty contribution due to

the measurement noise.
If we are interested in predicting noise-free function values f (x∗ ) =
φ> (x∗ )θ instead of the noise-corrupted targets y∗ we obtain
p(f (x∗ )) = N φ> (x∗ )m0 , φ> (x∗ )S 0 φ(x∗ ) ,

(9.40)
which only differs from (9.38) in the omission of the noise variance σ 2 in
the predictive variance.
Remark (Distribution over Functions). Since we can represent the distri- The parameter
bution p(θ) using a set of samples θ i and every sample θ i gives rise to distribution p(θ)
induces a
a function fi (·) = θ >
i φ(·) it follows that the parameter distribution p(θ) distribution over
induces a distribution p(f (·)) over functions. Here we use the notation (·) functions.
to explicitly denote a functional relationship. ♦
Example 9.7 (Prior over Functions)

Let us consider a Bayesian linear regression problem with polynomials of
degree 5. We choose a parameter prior p(θ) = N 0, 14 I . Figure 9.9 vi-
sualizes the induced prior distribution over functions (shaded area: dark-
gray: 67% confidence bound, light-gray: 95% confidence bound) induced
by this parameter prior, including some function samples from this prior.
A function sample is obtained by first sampling a parameter vector
θ i ∼ p(θ) and then computing fi (·) = θ > i φ(·). We used 200 input lo-
cations x∗ ∈ [−5, 5] to which we apply the feature function φ(·). The
uncertainty (represented by the shaded area) in Figure 9.9 is solely due to
the parameter uncertainty because we considered the noise-free predictive
distribution (9.40).
Figure 9.9 Prior
over functions.
4 4
(a) Distribution over
functions
2 2
represented by the
0 0 mean function
y
(black line) and the

−2 −2 marginal
uncertainties
−4 −4 (shaded),
representing the
−4 −2 0 2 4 −4 −2 0 2 4
x x 67% and 95%
confidence bounds,
(a) Prior distribution over functions. (b) Samples from the prior distribution over respectively;
functions. (b) Samples from
the prior over
functions, which are
So far, we looked at computing predictions using the parameter prior induced by the
samples from the
p(θ). However, when we have a parameter posterior (given some train- parameter prior.
ing data X , Y ), the same principles for prediction and inference hold
as in (9.37) – we just need to replace the prior p(θ) with the posterior
c
p(θ | X , Y). In the following, we will derive the posterior distribution in

detail before using it to make predictions.
9.3.3 Posterior Distribution

Given a training set of inputs xn ∈ RD and corresponding observations
yn ∈ R, n = 1, . . . , N , we compute the posterior over the parameters
using Bayes’ theorem as
p(Y | X , θ)p(θ)
p(θ | X , Y) = , (9.41)
p(Y | X )
where X is the set of training inputs and Y the collection of correspond-
ing training targets. Furthermore, p(Y | X , θ) is the likelihood, p(θ) the
parameter prior and
Z
p(Y | X ) = p(Y | X , θ)p(θ)dθ = Eθ [p(Y | X , θ) (9.42)
marginal likelihood the marginal likelihood/evidence, which is independent of the parameters

evidence θ and ensures that the posterior is normalized, i.e., it integrates to 1. We
The marginal can think of the marginal likelihood as the likelihood averaged over all
likelihood is the
possible parameter settings (with respect to the prior distribution p(θ)).
expected likelihood
under the parameter Theorem 9.1 (Parameter Posterior). In our model (9.35), the parameter
prior.
posterior (9.41) can be computed in closed form as

p(θ | X , Y) = N θ | mN , S N , (9.43a)
S N = (S −1
0 +σ
−2 >
Φ Φ)−1 , (9.43b)
mN = S N (S −1
0 m0 + σ
−2 >
Φ y) , (9.43c)
where the subscript N indicates the size of the training set.
Proof Bayes’ theorem tells us that the posterior p(θ | X , Y) is propor-
tional to the product of the likelihood p(Y | X , θ) and the prior p(θ):
p(Y | X , θ)p(θ)
posterior p(θ | X , Y) = (9.44a)
p(Y | X )
p(Y | X , θ) = N y | Φθ, σ 2 I

likelihood (9.44b)

prior p(θ) = N θ | m0 , S 0 . (9.44c)
Instead of looking at the product of the prior and the likelihood, we
can transform the problem into log-space and solve for the mean and
covariance of the posterior by completing the squares.
The sum of the log-prior and the log-likelihood is
log N y | Φθ, σ 2 I + log N θ | m0 , S 0

(9.45a)
1
= − σ −2 (y − Φθ)> (y − Φθ) + (θ − m0 )> S −1

0 (θ − m0 ) + const
2
(9.45b)

where the constant contains terms independent of θ . We will ignore the

constant in the following. We now factorize (9.45b), which yields
1 −2 >
− σ y y − 2σ −2 y > Φθ + θ > σ −2 Φ> Φθ + θ > S −1
0 θ
2 (9.46a)
−1 > −1
− 2m>

0 S 0 θ + m0 S 0 m 0
1 > −2 >
= − θ (σ Φ Φ + S −1 −2 >
Φ y + S −1 >

0 )θ − 2(σ 0 m0 ) θ + const ,
2
(9.46b)
where the constant contains the black terms in (9.46a), which are inde-
pendent of θ . The orange terms are terms that are linear in θ , and the
blue terms are the ones that are quadratic in θ . Inspecting (9.46b), we
find that this equation is quadratic in θ . The fact that the unnormalized
log-posterior distribution is a (negative) quadratic form implies that the
posterior is Gaussian, i.e.,
p(θ | X , Y) = exp(log p(θ | X , Y)) ∝ exp(log p(Y | X , θ) + log p(θ))
(9.47a)
1
∝ exp − θ > (σ −2 Φ> Φ + S −1
0 )θ − 2(σ
−2 >
Φ y + S −1 >
0 m0 ) θ ,
2
(9.47b)
where we used (9.46b) in the last expression.
The remaining task is it to bring this (unnormalized)
Gaussian into the
form that is proportional to N θ | mN , S N , i.e., we need to identify the
mean mN and the covariance matrix S N . To do this, we use the concept
of completing the squares. The desired log-posterior is completing the
squares
1
logN θ | mN , S N = − (θ − mN )> S −1

N (θ − mN ) + const (9.48a)
2
1
= − θ > S −1 > −1 > −1

N θ − 2mN S N θ + mN S N mN . (9.48b)
2
Here, we factorized the quadratic form (θ − mN )> S −1 N (θ − mN ) into a
term that is quadratic in θ alone (blue), a term that is linear in θ (orange),
and a constant term (black). This allows us now to find S N and mN by
matching the colored expressions in (9.46b) and (9.48b), which yields
S −1 > −2
N = Φ σ IΦ + S −1
0 (9.49a)
⇐⇒ S N = (σ −2 Φ> Φ + S −1
0 )
−1
(9.49b)
and
−1 −2 >
m>
N S N = (σ Φ y + S −1
0 m0 )
>
(9.50a)
>
⇐⇒ mN = S N (σ −2 Φ y + S −1
0 m0 ) . (9.50b)
c
Remark (General Approach to Completing the Squares). If we are given

an equation
x> Ax − 2a> x + const1 , (9.51)
where A is symmetric and positive definite, which we wish to bring into

the form
(x − µ)> Σ(x − µ) + const2 , (9.52)
we can do this by setting
Σ := A , (9.53)
µ := Σ−1 a (9.54)
and const2 = const1 − µ> Σµ. ♦

We can see that the terms inside the exponential in (9.47b) are of the
form (9.51) with
A := σ −2 Φ> Φ + S −1
0 , (9.55)
a := σ −2 Φ> y + S −1
0 m0 . (9.56)
Since A, a can be difficult to identify in equations like (9.46a), it is of-

ten helpful to bring these equations into the form (9.51) that decouples
quadratic term, linear terms and constants, which simplifies finding the
desired solution.
9.3.4 Posterior Predictions

In (9.37), we computed the predictive distribution of y∗ at a test input
x∗ using the parameter prior p(θ). In principle, predicting with the pa-
rameter posterior p(θ | X , Y) is not fundamentally different given that
in our conjugate model the prior and posterior are both Gaussian (with
different parameters). Therefore, by following the same reasoning as in
Section 9.3.2 we obtain the (posterior) predictive distribution
Z
p(y∗ | X , Y, x∗ ) = p(y∗ | x∗ , θ)p(θ | X , Y)dθ (9.57a)
Z
= N y∗ | φ> (x∗ )θ, σ 2 N θ | mN , S N dθ (9.57b)

= N y∗ | φ> (x∗ )mN , φ> (x∗ )S N φ(x∗ ) + σ 2 . (9.57c)

The term φ> (x∗ )S N φ(x∗ ) reflects the posterior uncertainty associated
with the parameters θ . Note that S N depends on the training inputs
through Φ, see (9.43b). The predictive mean φ> (x∗ )mN coincides with
the MAP estimate.

Remark (Marginal Likelihood and Posterior Predictive Distribution). By

replacing the integral in (9.57a) the predictive distribution can be equiv-
alently written as the expectation Eθ | X ,Y [p(y∗ | x∗ , θ)], where the expec-
tation is taken with respect to the parameter posterior p(θ | X , Y).
Writing the posterior predictive distribution in this way highlights a
close resemblance to the marginal likelihood (9.42). The key difference
between the marginal likelihood and the posterior predictive distribution
are (i) the marginal likelihood can be thought of predicting the training
targets y and not the test targets y∗ , (ii) the marginal likelihood averages
with respect to the parameter prior and not the parameter posterior. ♦
Remark (Mean and Variance of Noise-Free Function Values). In many
cases, we are not interested in the predictive distribution p(y∗ | X , Y, x∗ )
of a (noisy) observation y∗ . Instead, we would like to obtain the distribu-
tion of the (noise-free) function values f (x∗ ) = φ> (x∗ )θ . We determine
the corresponding moments by exploiting the properties of means and
variances, which yields
E[f (x∗ ) | X , Y] = Eθ [φ> (x∗ )θ | X , Y] = φ> (x∗ )Eθ [θ | X , Y]
(9.58)
= φ> (x∗ )mN = m>
N φ(x∗ ) ,
Vθ [f (x∗ ) | X , Y] = Vθ [φ> (x∗ )θ | X , Y]

= φ> (x∗ )Vθ [θ | X , Y]φ(x∗ ) (9.59)
>
= φ (x∗ )S N φ(x∗ )
We see that the predictive mean is the same as the predictive mean for
noisy observations as the noise has mean 0, and the predictive variance
only differs by σ 2 , which is the variance of the measurement noise: When
we predict noisy function values, we need to include σ 2 as a source of
uncertainty, but this term is not needed for noise-free predictions. Here,
the only remaining uncertainty stems from the parameter posterior. ♦ Integrating out
Remark (Distribution over Functions). The fact that we integrate out the parameters induces
parameters θ induces a distribution over functions: If we sample θ i ∼ a distribution over
functions.
p(θ | X , Y) from the parameter posterior, we obtain a single function re-
alization θ >
i φ(·). The mean function, i.e., the set of all expected function mean function
values Eθ [f (·) | θ, X , Y], of this distribution over functions is m> N φ(·).
The (marginal) variance, i.e., the variance of the function f (·), is given by
φ> (·)S N φ(·). ♦
Example 9.8 (Posterior over Functions)

Let us revisit the Bayesian linear regression problem with polynomials
of degree 5. We choose a parameter prior p(θ) = N 0, 41 I . Figure 9.9
visualizes the prior over functions induced by the parameter prior and
sample functions from this prior.
Figure 9.10 shows the posterior over functions that we obtain via
c
Bayesian linear regression. The training dataset is shown in panel (a);

Panel (b) shows the posterior distribution over functions, including the
functions we would obtain via maximum likelihood and MAP estimation.
The function we obtain using the MAP estimate also corresponds to the
posterior mean function in the Bayesian linear regression setting. Panel (c)
shows some plausible realizations (samples) of functions under that pos-
terior over functions.
Figure 9.10
Bayesian linear 4 4 4
regression and 2 2 2
posterior over
0 0 0
y
y
functions. Training data
−2 −2 MLE −2
(a) Training data; MAP
(b) Posterior −4 −4 BLR −4
−4 −2 −4 −2 −4 −2
distribution over 0
x
2 4 0
x
2 4 0
x
2 4
functions;
(a) Training data. (b) Posterior over functions rep- (c) Samples from the posterior
(c) Samples from
resented by the marginal un- over functions, which are in-
the posterior over
certainties (shaded) showing duced by the samples from the
functions.
the 67% and 95% predictive parameter posterior.
confidence bounds, the maxi-
mum likelihood estimate (MLE)
and the MAP estimate (MAP),
which is identical to the poste-
rior mean function.
Figure 9.11 shows some posterior distributions over functions induced

by the parameter posterior. For different polynomial degrees M the left
panels show the maximum likelihood function θ > ML φ(·), the MAP func-
tion θ >
MAP φ(·) (which is identical to the posterior mean function) and the
67% and 95% predictive confidence bounds obtained by Bayesian linear
regression, represented by the shaded areas.
The right panels show samples from the posterior over functions: Here,
we sampled parameters θ i from the parameter posterior and computed
the function φ> (x∗ )θ i , which is a single realization of a function under
the posterior distribution over functions. For low-order polynomials, the
parameter posterior does not allow the parameters to vary much: The
sampled functions are nearly identical. When we make the model more
flexible by adding more parameters (i.e., we end up with a higher-order
polynomial), these parameters are not sufficiently constrained by the pos-
terior, and the sampled functions can be easily visually separated. We also
see in the corresponding panels on the left how the uncertainty increases,
especially at the boundaries.
Although for a 7th-order polynomial the MAP estimate yields a rea-
sonable fit, the Bayesian linear regression model additionally tells us that
the posterior uncertainty is huge. This information can be critical when

Figure 9.11
4 4 Bayesian linear
regression. Left
2 2 panels: Shaded
areas indicate the
0 0
y
y
67% (dark-gray)
Training data
and 95%
−2 MLE −2
MAP
(light-gray)
BLR predictive
−4 −4
confidence bounds.
−4 −2 0 2 4 −4 −2 0 2 4 The mean of the
x x
Bayesian linear
(a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the pos- regression model
terior over functions (right). coincides with the
MAP estimate. The
predictive
4 4 uncertainty is the
sum of the noise
2 2
term and the
posterior parameter
0 0
y
Training data uncertainty, which

−2 MLE −2 depends on the
MAP location of the test
−4 BLR −4 input. Right panels:
Sampled functions
−4 −2 0 2 4 −4 −2 0 2 4
x x from the posterior
distribution.
(b) Posterior distribution for polynomials of degree M = 5 (left) and samples from the
posterior over functions (right).
4 Training data 4
MLE
2 MAP 2
BLR
0 0
y
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(c) Posterior distribution for polynomials of degree M = 7 (left) and samples from the pos-
terior over functions (right).
we use these predictions in a decision-making system, where bad deci-

sions can have significant consequences (e.g., in reinforcement learning
or robotics).
9.3.5 Computing the Marginal Likelihood

In Section 8.5.2, we highlighted the importance of the marginal likelihood
for Bayesian model selection. In the following, we compute the marginal
c
likelihood for Bayesian linear regression with a conjugate Gaussian prior

on the parameters, i.e., exactly the setting we have been discussing in this
chapter.
Just to re-cap, we consider the following generative process:

θ ∼ N m0 , S 0 (9.60a)
> 2

yn | xn , θ ∼ N xn θ, σ , (9.60b)
The marginal n = 1, . . . , N . The marginal likelihood is given by
likelihood can be Z
interpreted as the
p(Y | X ) = p(Y | X , θ)p(θ)dθ (9.61a)
expected likelihood
under the prior, i.e.,
Z
= N y | Xθ, σ 2 I N θ | m0 , S 0 dθ ,

Eθ [p(Y | X , θ)]. (9.61b)
where we integrate out the model parameters θ . We compute the marginal

likelihood in two steps: First, we show that the marginal likelihood is
Gaussian (as a distribution in y ); Second, we compute the mean and co-
variance of this Gaussian.
1. The marginal likelihood is Gaussian: From Section 6.5.2 we know that

(i) the product of two Gaussian random variables is an (unnormal-
ized) Gaussian distribution, (ii) a linear transformation of a Gaussian
random variable is Gaussian distributed. In (9.61b), we require a linear

transformation to bring N y | Xθ, σ 2 I into the form N θ | µ, Σ for
some µ, Σ. Once this is done, the integral can be solved in closed form.
The result is the normalizing constant of the product of the two Gaus-
sians. The normalizing constant itself has Gaussian shape, see (6.76).
2. Mean and covariance. We compute the mean and covariance matrix
of the marginal likelihood by exploiting the standard results for means
and covariances of affine transformations of random variables, see Sec-
tion 6.4.4. The mean of the marginal likelihood is computed as
Eθ [Y | X ] = Eθ [Xθ + ] = X Eθ [θ] = Xm0 . (9.62)

Note that ∼ N 0, σ 2 I is a vector of i.i.d. random variables. The
covariance matrix is given as
Covθ [Y|X ] = Cov[Xθ] + σ 2 I = X Covθ [θ]X > + σ 2 I (9.63a)
= XS 0 X > + σ 2 I (9.63b)
Hence, the marginal likelihood is
N 1
p(Y | X ) = (2π)− 2 det(XS 0 X > + σ 2 I)− 2 (9.64a)
1 > > 2 −1

· exp − − Xm0 ) (XS 0 X + σ I)
2
(y (y − Xm0 )
= N y | Xm0 , S 0 + σ 2 I

(9.64b)
Given the close connection with the posterior predictive distribution (see

9.4 Maximum Likelihood as Orthogonal Projection 313
4 4 Figure 9.12
Geometric
2 2 interpretation of
least squares.
0 0 (a) Dataset;
y
y
(b) Maximum
Projection likelihood solution
−2 −2
Observations interpreted as a
Maximum likelihood estimate
projection.
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Regression dataset consisting of noisy ob- (b) The orange dots are the projections of
servations yn (blue) of function values f (xn ) the noisy observations (blue dots) onto the
at input locations xn . line θML x. The maximum likelihood solution to
a linear regression problem finds a subspace
(line) onto which the overall projection er-
ror (orange lines) of the observations is mini-
mized.
Remark on page 309), the functional form of the marginal likelihood

should not be too surprising.
9.4 Maximum Likelihood as Orthogonal Projection

Having crunched through much algebra to derive maximum likelihood
and MAP estimates, we will now provide a geometric interpretation of
maximum likelihood estimation. Let us consider a simple linear regression
setting
y = xθ + , ∼ N 0, σ 2 ,

(9.65)
in which we consider linear functions f : R → R that go through the
origin (we omit features here for clarity). The parameter θ determines the
slope of the line. Figure 9.12(a) shows a one-dimensional dataset.
With a training data set {(x1 , y1 ), . . . , (xN , yN )} we recall the results
from Section 9.2.1 and obtain the maximum likelihood estimator for the
slope parameter as
X >y
θML = (X > X)−1 X > y = ∈ R, (9.66)
X >X
where X = [x1 , . . . , xN ]> ∈ RN , y = [y1 , . . . , yN ]> ∈ RN .
This means for the training inputs X we obtain the optimal (maximum
likelihood) reconstruction of the training targets as
X >y XX >
XθML = X > = > y , (9.67)
X X X X
i.e., we obtain the approximation with the minimum least-squares error
between y and Xθ.
As we are looking for a solution of y = Xθ, we can think of linear
c
Linear regression regression as a problem for solving systems of linear equations. There-
can be thought of as fore, we can relate to concepts from linear algebra and analytic geometry
a method for solving
that we discussed in Chapters 2 and 3. In particular, looking carefully
systems of linear
equations. at (9.67) we see that the maximum likelihood estimator θML in our ex-
ample from (9.65) effectively does an orthogonal projection of y onto
Maximum the one-dimensional subspace spanned by X . Recalling the results on or-
>
likelihood linear thogonal projections from Section 3.8, we identify XX as the projection
regression performs X>X
an orthogonal
matrix, θML as the coordinates of the projection onto the one-dimensional
projection. subspace of RN spanned by X and XθML as the orthogonal projection of
y onto this subspace.
Therefore, the maximum likelihood solution provides also a geometri-
cally optimal solution by finding the vectors in the subspace spanned by
X that are “closest” to the corresponding observations y , where “clos-
est” means the smallest (squared) distance of the function values yn to
xn θ. This is achieved by orthogonal projections. Figure 9.12(b) shows the
projection of the noisy observations onto the subspace that minimizes the
squared distance between the original dataset and its projection (note that
the x-coordinate is fixed), which corresponds to the maximum likelihood
solution.
In the general linear regression case where
y = φ> (x)θ + , ∼ N 0, σ 2

(9.68)
with vector-valued features φ(x) ∈ RK , we again can interpret the maxi-

mum likelihood result
y ≈ Φθ ML , (9.69)
> −1 >
θ ML = Φ(Φ Φ) Φ y (9.70)
as a projection onto a K -dimensional subspace of RN , which is spanned

by the columns of the feature matrix Φ, see Section 3.8.2.
If the feature functions φk that we use to construct the feature ma-
trix Φ are orthonormal (see Section 3.7), we obtain a special case where
the columns of Φ form an orthonormal basis (see Section 3.5), such that
Φ> Φ = I . This will then lead to the projection
K
!
X >
Φ(Φ> Φ)−1 Φy = ΦΦ> y = φk φk y (9.71)
k=1
so that the coupling between different features has disappeared and the
maximum likelihood projection is simply the sum of projections of y onto
the individual basis vectors φk , i.e., the columns of Φ. Many popular basis
functions in signal processing, such as wavelets and Fourier bases, are
orthogonal basis functions. When the basis is not orthogonal, one can
convert a set of linearly independent basis functions to an orthogonal basis
by using the Gram-Schmidt process (Strang, 2003).

9.5 Further Reading

In this chapter, we discussed linear regression for Gaussian likelihoods
and conjugate Gaussian priors on the parameters of the model. This al-
lowed for closed-form Bayesian inference. However, in some applications
we may want to choose a different likelihood function. For example, in
a binary classification setting, we observe only two possible (categorical) classification
outcomes, and a Gaussian likelihood is inappropriate in this setting. In-
stead, we can choose a Bernoulli likelihood that will return a probability of
the predicted label to be 1 (or 0). We refer to the books by Bishop (2006);
Murphy (2012); Barber (2012) for an in-depth introduction to classifi-
cation problems. A different example where non-Gaussian likelihoods are
important is count data. Counts are non-negative integers, and in this case
a Binomial or Poisson likelihood would be a better choice than a Gaussian.
All these examples fall into the category of generalized linear models, a flex- generalized linear
ible generalization of linear regression that allows for response variables models
that have error distribution models other than a Gaussian distribution.
The GLM generalizes linear regression by allowing the linear model to be Generalized linear
related to the observed values via a smooth and invertible function σ(·) models are the
building blocks of
that may be nonlinear so that y = σ(f ), where f = θ > φ(x) is the linear
deep neural
regression model from (9.13). We can therefore think of a generalized lin- networks.
ear model in terms of function composition y = σ ◦ f where f is a linear
regression model and σ the activation function. Note, that although we
are talking about “generalized linear models” the outputs y are no longer
linear in the parameters θ . In logistic regression, we choose the logistic sig- logistic regression
1
moid σ(f ) = 1+exp(−f )
∈ [0, 1], which can be interpreted as the probability logistic sigmoid
of observing y = 1 of a Bernoulli random variable y ∈ {0, 1}. The function
σ(·) is called transfer function or activation function, its inverse is called the transfer function
canonical link function. From this perspective, it is also clear that general- activation function
canonical link
ized linear models are the building blocks of (deep) feedforward neural
function
networks: If we consider a generalized linear model y = σ(Ax + b), For ordinary linear
where A is a weight matrix and b a bias vector, we identify this general- regression the
ized linear model as a single-layer neural network with activation function activation function
σ(·). We can now recursively compose these functions via would simply be the
identity.
xk+1 = f k (xk )
(9.72)
f k (xk ) = σk (Ak xk + bk )
for k = 0, . . . , K − 1 where x0 are the input features and xK = y
are the observed outputs, such that f K−1 ◦ · · · ◦ f 0 is a K -layer deep
neural network. Therefore, the building blocks of this deep neural net-
work are the generalized linear models defined in (9.72). A great post
on the relation between GLMs and deep networks is available at https:
//tinyurl.com/glm-dnn. Neural networks (Bishop, 1995; Goodfellow
et al., 2016) are significantly more expressive and flexible than linear re-
gression models. However, maximum likelihood parameter estimation is a
c
non-convex optimization problem, and marginalization of the parameters

in a fully Bayesian setting is analytically intractable.
We briefly hinted at the fact that a distribution over parameters in-
Gaussian process duces a distribution over regression functions. Gaussian processes (Ras-
mussen and Williams, 2006) are regression models where the concept of
a distribution over function is central. Instead of placing a distribution
over parameters a Gaussian process places a distribution directly on the
space of functions without the “detour” via the parameters. To do so, the
kernel trick Gaussian process exploits the kernel trick (Schölkopf and Smola, 2002),
which allows us to compute inner products between two function values
f (xi ), f (xj ) only by looking at the corresponding input xi , xj . A Gaus-
sian process is closely related to both Bayesian linear regression and sup-
port vector regression but can also be interpreted as a Bayesian neural
network with a single hidden layer where the number of units tends to
infinity (Neal, 1996; Williams, 1997). An excellent introduction to Gaus-
sian processes can be found in (MacKay, 1998; Rasmussen and Williams,
2006).
We focused on Gaussian parameter priors in the discussions in this chap-
ters because they allow for closed-form inference in linear regression mod-
els. However, even in a regression setting with Gaussian likelihoods we
may choose a non-Gaussian prior. Consider a setting where the inputs are
x ∈ RD and our training set is small and of size N D. This means that
the regression problem is under-determined. In this case, we can choose
a parameter prior that enforces sparsity, i.e., a prior that tries to set as
variable selection many parameters to 0 as possible (variable selection). This prior provides
a stronger regularizer than the Gaussian prior, which often leads to an in-
creased prediction accuracy and interpretability of the model. The Laplace
prior is one example that is frequently used for this purpose. A linear re-
gression model with the Laplace prior on the parameters is equivalent to
LASSO linear regression with L1 regularization (LASSO) (Tibshirani, 1996). The
Laplace distribution is sharply peaked at zero (its first derivative is discon-
tinuous) and it concentrates its probability mass closer to zero than the
Gaussian distribution, which encourages parameters to be 0. Therefore,
the non-zero parameters are relevant for the regression problem, which is
the reason why we also speak of “variable selection”.

10
Dimensionality Reduction with Principal

Component Analysis
Working directly with high-dimensional data, such as images, comes with A 640 × 480 pixels
some difficulties: it is hard to analyze, interpretation is difficult, visualiza- color image is a data
point in a
tion is nearly impossible, and (from a practical point of view) storage of
million-dimensional
the data vectors can be expensive. However, high-dimensional data often space, where every
has properties that we can exploit. For example, high-dimensional data is pixel responds to
often overcomplete, i.e., many dimensions are redundant and can be ex- three dimensions,
one for each color
plained by a combination of other dimensions. Furthermore, dimensions
channel (red, green,
in high-dimensional data are often correlated so that the data possesses an blue).
intrinsic lower-dimensional structure. Dimensionality reduction exploits
structure and correlation and allows us to work with a more compact rep-
resentation of the data, ideally without losing information. We can think
of dimensionality reduction as a compression technique, similar to jpeg or
mp3, which are compression algorithms for images and music.
In this chapter, we will discuss principal component analysis (PCA), an principal component
algorithm for linear dimensionality reduction. PCA, proposed by Pearson analysis
(1901) and Hotelling (1933), has been around for more than 100 years PCA
dimensionality
and is still one of the most commonly used techniques for data compres- reduction
sion and data visualization. It is also used for the identification of simple
patterns, latent factors and structures of high-dimensional data. In the sig-
Figure 10.1
Illustration:
4 4
Dimensionality
reduction. (a) The
2 2 original dataset
does not vary much
along the x2
x2
x2
0 0
direction. (b) The
data from (a) can be
−2 −2
represented using
the x1 -coordinate
−4 −4 alone with nearly no
loss.
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x1 x1
(a) Dataset with x1 and x2 coordinates. (b) Compressed dataset where only the x1 coor-
dinate is relevant.
317
c
318 Dimensionality Reduction with Principal Component Analysis
Karhunen-Loève nal processing community, PCA is also known as the Karhunen-Loève trans-
transform form. In this chapter, we derive PCA from first principles, drawing on our
understanding of basis and basis change (Sections 2.6.1 and 2.7.2), pro-
jections (Section 3.8), eigenvalues (Section 4.2), Gaussian distributions
(Section 6.5) and constrained optimization (Section 7.2).
Dimensionality reduction generally exploits a property of high-dimen-
sional data (e.g., images) that it often lies on a low-dimensional subspace.
Figure 10.1 gives an illustrative example in two dimensions. Although
the data in Figure 10.1(a) does not quite lie on a line, the data does not
vary much in the x2 -direction, so that we can express it as if it was on
a line – with nearly no loss, see Figure 10.1(b). To describe the data in
Figure 10.1(b), only the x1 -coordinate is required, and the data lies in a
one-dimensional subspace of R2 .
10.1 Problem Setting

In PCA, we are interested in finding projections x̃n of data points xn that
are as similar to the original data points as possible, but which have a sig-
nificantly lower intrinsic dimensionality. Figure 10.1 gives an illustration
what this could look like.
More concretely, we consider an i.i.d. dataset X = {x1 , . . . , xN }, xn ∈
data covariance RD , with mean 0 that possesses the data covariance matrix (6.42)
matrix
N
1 X
S= xn x>
n . (10.1)
N n=1
Furthermore, we assume there exists a low-dimensional compressed rep-
resentation (code)
z n = B > xn ∈ RM (10.2)
of xn , where we define the projection matrix
B := [b1 , . . . , bM ] ∈ RD×M . (10.3)
We assume that the columns of B are orthonormal (Definition 3.7) so that
The columns b> >
i bj = 0 if and only if i 6= j and bi bi = 1. We seek an M -dimensional
b1 , . . . , bM of B subspace U ⊆ RD , dim(U ) = M < D onto which we project the data. We
form a basis of the
denote the projected data by x̃n ∈ U , and their coordinates (with respect
M -dimensional
subspace in which to the basis vectors b1 , . . . , bM of U ) by z n . Our aim is to find projections
the projected data x̃n ∈ RD (or equivalently the codes z n and the basis vectors b1 , . . . , bM )
x̃ = BB > x ∈ RD so that they are as similar to the original data xn and minimize the loss
live.
due to compression.
Example 10.1 (Coordinate Representation/Code)

Consider R2 with the canonical basis e1 = [1, 0]> , e2 = [0, 1]> . From

10.1 Problem Setting 319
original compressed Figure 10.2

D D Graphical
R R illustration of PCA.
In PCA, we find a
RM compressed version
x z x̃ x̃ of original data x
that has an intrinsic
lower-dimensional
representation z.
Chapter 2 we know that x ∈ R2 can be represented as a linear combina-

tion of these basis vectors, e.g.,

5
= 5e1 + 3e2 . (10.4)
3
However, when we consider vectors of the form

0
x̃ = ∈ R2 , z ∈ R , (10.5)
z
they can always be written as 0e1 + ze2 . To represent these vectors it is
sufficient to remember/store the coordinate/code z of x̃ with respect to
the e2 vector. The dimension of a
vector space
More precisely, the set of x̃ vectors (with the standard vector addition
corresponds to the
and scalar multiplication) forms a vector subspace U (see Section 2.4) number of its basis
with dim(U ) = 1 because U = span[e2 ]. vectors (see
Section 2.6.1).
In Section 10.2, we will find low-dimensional representations that re-

tain as much information as possible and minimize the compression loss.
An alternative derivation of PCA is given in Section 10.3, we will be look-
2
ing at minimizing the squared reconstruction error kxn − x̃n k between
the original data xn and its projection x̃n .
Figure 10.2 illustrates the setting we consider in PCA, where z repre-
sents the intrinsic lower dimension of the compressed data x̃ and plays
the role of a bottleneck, which controls how much information can flow
between x and x̃. In PCA, we consider a linear relationship between the
original data x and its low-dimensional code z so that z = B > x and
x̃ = Bz for a suitable matrix B . Based the motivation of thinking of
PCA as a data compression technique, we can interpret the arrows in Fig-
ure 10.2 as a pair of operations representing encoders and decoders. The
linear mapping represented by B can be thought of a decoder, which maps
the low-dimensional code z ∈ RM back into the original data space RD .
Similarly, B > can be thought of an encoder, which encodes the original
data x as a low-dimensional (compressed) code z .
Throughout this chapter, we will use the MNIST digits dataset as a re-
c
Figure 10.3
Examples of
handwritten digits
from the MNIST
dataset. http:
//yann.lecun.
com/exdb/mnist/
occurring example, which contains 60, 000 examples of handwritten digits
0–9. Each digit is a grayscale image of size 28 × 28, i.e., it contains 784
pixels so that we can interpret every image in this dataset as a vector
x ∈ R784 . Examples of these digits are shown in Figure 10.3.
10.2 Maximum Variance Perspective

Figure 10.1 gave an example of how a two-dimensional dataset can be
represented using a single coordinate. In Figure 10.1(b), we chose to ig-
nore the x2 -coordinate of the data because it did not add too much in-
formation so that the compressed data is similar to the original data in
Figure 10.1(a). We could have chosen to ignore the x1 -coordinate, but
then the compressed data had been very dissimilar from the original data,
and much information in the data would have been lost.
If we interpret information content in the data as how “space filling”
the dataset is, then we can describe the information contained in the data
by looking at the spread of the data. From Section 6.4.1 we know that the
variance is an indicator of the spread of the data, and we can derive PCA as
a dimensionality reduction algorithm that maximizes the variance in the
low-dimensional representation of the data to retain as much information
as possible. Figure 10.4 illustrates this.
Considering the setting discussed in Section 10.1, our aim is to find
a matrix B (see (10.3)) that retains as much information as possible
when compressing data by projecting it onto the subspace spanned by
the columns b1 , . . . , bM of B . Retaining most information after data com-
pression is equivalent to capturing the largest amount of variance in the
low-dimensional code (Hotelling, 1933).
Remark. (Centered Data) For the data covariance matrix in (10.1) we
assumed centered data. We can make this assumption without loss of gen-
erality: Let us assume that µ is the mean of the data. Using the properties
of the variance, which we discussed in Section 6.4.4 we obtain
Vz [z] = Vx [B > (x − µ)] = Vx [B > x − B > µ] = Vx [B > x] , (10.6)
i.e., the variance of the low-dimensional code does not depend on the
mean of the data. Therefore, we assume without loss of generality that the
data has mean 0 for the remainder of this section. With this assumption
the mean of the low-dimensional code is also 0 since Ez [z] = Ex [B > x] =
B > Ex [x] = 0. ♦

Figure 10.4 PCA

finds a
lower-dimensional
subspace (line) that
maintains as much
variance (spread of
the data) as possible
when the data
(blue) is projected
onto this subspace
(orange).
10.2.1 Direction with Maximal Variance

We maximize the variance of the low-dimensional code using a sequential
approach. We start by seeking a single vector b1 ∈ RD that maximizes the The vector b1 will
variance of the projected data, i.e., we aim to maximize the variance of be the first column
of the matrix B and
the first coordinate z1 of z ∈ RM so that
therefore the first of
M orthonormal
N
1 X 2 basis vectors that
V1 := V[z1 ] = z (10.7) span the
N n=1 1n
lower-dimensional
subspace.
is maximized, where we exploited the i.i.d. assumption of the data and
defined z1n as the first coordinate of the low-dimensional representation
z n ∈ RM of xn ∈ RD . Note that first component of z n is given by
z1n = b>
1 xn , (10.8)
i.e., it is the coordinate of the orthogonal projection of xn onto the one-

dimensional subspace spanned by b1 (Section 3.8). We substitute (10.8)
into (10.7), which yields
N N
1 X > 1 X >
V1 = (b1 xn )2 = b xn x>n b1 (10.9a)
N n=1 N n=1 1
N
!
> 1 X
= b1 xn xn b1 = b>
>
1 Sb1 , (10.9b)
N n=1
where S is the data covariance matrix defined in (10.1). In (10.9a) we

have used the fact that the dot product of two vectors is symmetric with
respect to its arguments, that is b> >
1 x n = x n b1 .
Notice that arbitrarily increasing the magnitude of the vector b1 in-
creases V1 , that is, a vector b1 that is two times longer can result in V1
that is potentially four times larger. Therefore, we restrict all solutions to kb1 k2 = 1
2 ⇐⇒ kb1 k = 1.
kb1 k = 1, which results in a constrained optimization problem in which
we seek the direction along which the data varies most.
With the restriction of the solution space to unit vectors the vector b1
that points in the direction of maximum variance can be found by the
c
constrained optimization problem

max b>
1 Sb1
b1
(10.10)
2
subject to kb1 k = 1 .
Following Section 7.2, we obtain the Lagrangian
L(b1 , λ) = b> >
1 Sb1 + λ1 (1 − b1 b1 ) (10.11)
to solve this constrained optimization problem. The partial derivatives of
L with respect to b1 and λ1 are
∂L ∂L
= 2b> >
1 S − 2λ1 b1 , = 1 − b>
1 b1 , (10.12)
∂b1 ∂λ1
respectively. Setting these partial derivatives to 0 gives us the relations
Sb1 = λ1 b1 , (10.13)
b>
1 b1 = 1. (10.14)
By comparing with the definition of an eigenvalue decomposition (Sec-
tion 4.4), we see that b1 is an eigenvector of the data covariance matrix S ,
and the Lagrange multiplier λ1 plays the role of the corresponding eigen-
√
The quantity λ1 is value. This eigenvector property (10.13) allows us to rewrite our variance
also called the objective (10.10) as
loading of the unit
vector b1 and V1 = b> >
1 Sb1 = λ1 b1 b1 = λ1 , (10.15)
represents the
standard deviation i.e., the variance of the data projected onto a one-dimensional subspace
of the data equals the eigenvalue that is associated with the basis vector b1 that spans
accounted for by the
principal subspace
this subspace. Therefore, to maximize the variance of the low-dimensional
span[b1 ]. code we choose the basis vector associated with the largest eigenvalue of
principal component the data covariance matrix. This eigenvector is called the first principal
component. We can determine the effect/contribution of the principal com-
ponent b1 in the original data space by mapping the coordinate z1n back
into data space, which gives us the projected data point
x̃n = b1 z1n = b1 b>
1 xn ∈ R
D
(10.16)
in the original data space.
Remark. Although x̃n is a D-dimensional vector it only requires a single
coordinate z1n to represent it with respect to the basis vector b1 ∈ RD . ♦
10.2.2 M -dimensional Subspace with Maximal Variance

Assume we have found the first m − 1 principal components as the m − 1
eigenvectors of S that are associated with the largest m − 1 eigenvalues.
Since S is symmetric, the spectral theorem (Theorem 4.15) states that we
can use these eigenvectors to construct an orthonormal eigenbasis of an

(m − 1)-dimensional subspace of RD . Generally, the mth principal com-

ponent can be found by subtracting the effect of the first m − 1 principal
components b1 , . . . , bm−1 from the data, thereby trying to find principal
components that compress the remaining information. We achieve this by
first subtracting the contribution of the m − 1 principal components from
the data, similar to (10.16), so that we arrive at the new data matrix
m−1
X
X̂ := X − bi b>
i X, (10.17)
i=1
where X = [x1 , . . . , xN ] ∈ RD×N contains the data points as column

vectors. The matrix X̂ := [x̂1 , . . . , x̂N ] ∈ RD×N in (10.17) contains the
data that only contains the information that has not yet been compressed.
Remark (Notation). Throughout this chapter, we do not follow the con-
vention of collecting data x1 , . . . , xN as the rows of the data matrix, but
we define them to be the columns of X . This means that our data ma-
trix X is a D × N matrix instead of the conventional N × D matrix. The
reason for our choice is that the algebra operations work out smoothly
without the need to either transpose the matrix or to redefine vectors as
row vectors that are left-multiplied onto matrices. ♦
To find the mth principal component, we maximize the variance
N N
1 X 2 1 X >
Vm = V[zm ] = z = (b xn )2 = b>
m Ŝbm , (10.18)
N n=1 mn N n=1 m
2
subject to kbm k = 1, where we followed the same steps as in (10.9b)
and defined Ŝ as the data covariance matrix of the transformed dataset
X̂ := {x̂1 , . . . , x̂N }. As previously, when we looked at the first principal
component alone, we solve a constrained optimization problem and dis-
cover that the optimal solution bm is the eigenvector of Ŝ that is associated
with the largest eigenvalue of Ŝ .
However, it also turns out that bm is an eigenvector of S . It holds that
N N m−1 m−1
1 X (10.17) 1 X >
x̂n x̂>
X X
Ŝ = n = xn − bi b>
i xn xn − b b>
i i x n
N n=1 N n=1 i=1 i=1
(10.19a)
N m−1 m−1 m−1
1 X X >
X >
X
= > >
xn xn − 2xn xn bi bi + >
bi bi xn xn bi b>
i ,
N n=1 i=1 i=1 i=1
(10.19b)
> >
where we exploited the symmetries x> >
n bi = bi xn and bi xn = xn bi to
summarize
m−1
X m−1
X m−1
X
−xn x>
n bi b>
i − bi b> >
i xn xn = −2xn x>
n bi b>
i . (10.20)
i=1 i=1 i=1
c
If we take a vector bm with kbm k = 1 that is orthogonal to all b1 , . . . , bm−1

and right-multiply bm to Ŝ in (10.19b) we obtain
N N
1 X 1 X
Ŝbm = x̂n x̂> b
n m = xn x>
n bm = Sbm = λm bm . (10.21)
N n=1 N n=1
Here we applied the orthogonality property b> i bm = 0 for i = 1, . . . , m−1

(all terms involving sums up to m − 1 vanish). Equation (10.21) reveals
that bm is an eigenvector of both Ŝ and the original data covariance ma-
trix S . In other words, λm is the largest eigenvalue of Ŝ and λm is the
mth largest eigenvalue of S , and both have the associated eigenvector
bm . This derivation shows that there is an intimate connection between
the M -dimensional subspace with maximal variance and the eigenvalue
decomposition. We will revisit this connection in Section 10.4.
With the relation (10.21) and b>m bm = 1 the variance of the data pro-
jected onto the mth principal component is
(10.21)
V m = b>
m Sbm = λ m b>
m bm = λm . (10.22)
This means that the variance of the data, when projected onto an M -
dimensional subspace, equals the sum of the eigenvalues that are associ-
ated with the corresponding eigenvectors of the data covariance matrix.
Example 10.2 (Eigenvalues of MNIST ‘8’)

Taking all digits ‘8’ in the MNIST training data, we compute the eigenval-
ues of the data covariance matrix.
Figure 10.5
50
Properties of the
500
training data of 40
MNIST ‘8’.(a)
Captured variance
400
Eigenvalue
Eigenvalues sorted 30
in descending order; 300
(b) Variance 20
200
captured by the
10
principal 100
components 0
0 50 100 150 200 0 50 100 150 200
associated with the Index Number of principal components
largest eigenvalues.
(a) Eigenvalues (sorted in descending order) of (b) Variance captured by the principal compo-
the data covariance matrix of all digits ‘8’ in the nents.
MNIST training set.
Figure 10.5(a) shows the 200 largest eigenvalues of the data covariance
matrix. We see that only a few of them have a value that differs signifi-
cantly from 0. Therefore, most of the variance, when projecting data onto
the subspace spanned by the corresponding eigenvectors, is captured by
only a few principal components as shown in Figure 10.5(b).

Figure 10.6
Illustration of the
projection
approach: Find a
subspace (line) that
minimizes the
length of the
difference vector
between projected
(orange) and
original (blue) data.
Overall, to find an M -dimensional subspace of RD that retains as much

information as possible, PCA tells us to choose the columns of the matrix
B in (10.3) as the M eigenvectors of the data covariance matrix S that
are associated with the M largest eigenvalues. The maximum amount of
variance PCA can capture with the first M principal components is
M
X
VM = λm , (10.23)
m=1
where the λm are the M largest eigenvalues of the data covariance matrix
S . Consequently, the variance lost by data compression via PCA is
D
X
JM := λj = VD − VM . (10.24)
j=M +1
Instead of these absolute quantities, we can define the relative variance

captured as VVM
D
, and the relative variance lost by compression as 1 − VVM
D
.
10.3 Projection Perspective

In the following, we will derive PCA as an algorithm that directly mini-
mizes the average reconstruction error. This perspective allows us to in-
terpret PCA as implementing an optimal linear auto-encoder. We will draw
heavily from Chapters 2 and 3.
In the previous section, we derived PCA by maximizing the variance
in the projected space to retain as much information as possible. In the
following, we will look at the difference vectors between the original data
xn and their reconstruction x̃n and minimize this distance so that xn and
x̃n are as close as possible. Figure 10.6 illustrates this setting.
10.3.1 Setting and Objective

Assume an (ordered) orthonormal basis (ONB) B = (b1 , . . . , bD ) of RD ,
i.e., b>
i bj = 1 if and only if i = j and 0 otherwise.
c
Figure 10.7 2.5 2.5
Simplified
projection setting. 2.0 2.0
(a) A vector x ∈ R2
(red cross) shall be 1.5 1.5
projected onto a
one-dimensional
x2
x2
1.0 1.0
subspace U ⊆ R2 U U
spanned by b. (b) 0.5 0.5
shows the difference b b
vectors between x 0.0 0.0
and some
candidates x̃. −0.5 −0.5
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x1 x1
(a) Setting. (b) Differences x − x̃i for 50 different x̃i are

shown by the red lines.
From Section 2.5 we know that for a basis (b1 , . . . , bD ) of RD any x ∈

R can be written as a linear combination of the basis vectors of RD , i.e.,
D
D
X M
X D
X
x= ζd b d = ζm b m + ζj b j (10.25)
d=1 m=1 j=M +1
Vectors x̃ ∈ U could for suitable coordinates ζd ∈ R.

be vectors on a We are interested in finding vectors x̃ ∈ RD , which live in lower-
plane in R3 . The
dimensional subspace U ⊆ RD , dim(U ) = M , so that
dimensionality of
the plane is 2, but M
X
the vectors still have x̃ = zm bm ∈ U ⊆ RD (10.26)
three coordinates
m=1
with respect to the
standard basis of is as similar to x as possible. Note that at this point we need to assume
R3 . that the coordinates zm of x̃ and ζm of x are not identical.
In the following, we use exactly this kind of representation of x̃ to find
optimal coordinates z and basis vectors b1 , . . . , bM such that x̃ is as sim-
ilar to the original data point x, i.e., we aim to minimize the (Euclidean)
distance kx − x̃k. Figure 10.7 illustrates this setting.
Without loss of generality, we assume that the dataset X = {x1 , . . . , xN },
xn ∈ RD , is centered at 0, i.e., E[X ] = 0. Without the zero-mean assump-
tion, we would arrive at exactly the same solution but the notation would
be substantially more cluttered.
We are interested in finding the best linear projection of X onto a lower-
dimensional subspace U of RD with dim(U ) = M and orthonormal basis
principal subspace vectors b1 , . . . , bM . We will call this subspace U the principal subspace.
The projections of the data points are denoted by
M
X
x̃n := zmn bm = Bz n ∈ RD , (10.27)
m=1

where z n := [z1n , . . . , zM n ]> ∈ RM is the coordinate vector of x̃n with

respect to the basis (b1 , . . . , bM ). More specifically, we are interested in
having the x̃n as similar to xn as possible.
The similarity measure we use in the following is the squared Euclidean
2
norm kx − x̃k between x and x̃. We therefore define our objective as
the minimizing the average squared Euclidean distance (reconstruction er- reconstruction error
ror) (Pearson, 1901)
N
1 X
JM := kxn − x̃n k2 , (10.28)
N n=1
where we make it explicit that the dimension of the subspace onto which
we project the data is M . In order to find this optimal linear projection,
we need to find the orthonormal basis of the principal subspace and the
coordinates z n ∈ RM of the projections with respect to this basis.
To find the coordinates z n and the ONB of the principal subspace we
follow a two-step approach. First, we optimize the coordinates z n for a
given ONB (b1 , . . . , bM ); second, we find the optimal ONB.
10.3.2 Finding Optimal Coordinates

Let us start by finding the optimal coordinates z1n , . . . , zM n of the projec-
tions x̃n for n = 1, . . . , N . Consider Figure 10.7(b) where the principal
subspace is spanned by a single vector b. Geometrically speaking, finding
the optimal coordinates z corresponds to finding the representation of the
linear projection x̃ with respect to b that minimizes the distance between
x̃ − x. From Figure 10.7(b) it is clear that this will be the orthogonal
projection, and in the following we will show exactly this.
We assume an ONB (b1 , . . . , bM ) of U ⊆ RD . To find the optimal co-
ordinates z m with respect to this basis, we require the partial derivatives
∂JM ∂JM ∂ x̃n

= , (10.29a)
∂zin ∂ x̃n ∂zin
∂JM 2
= − (xn − x̃n )> ∈ R1×D , (10.29b)
∂ x̃n N
M
!
∂ x̃n (10.27) ∂ X
= zmn bm = bi (10.29c)
∂zin ∂zin m=1
for i = 1, . . . , M , such that we obtain
(10.29b) M
!>
∂JM (10.29c) 2 2
(10.27)
X
= − (xn − x̃n )> bi = − xn − zmn bm bi
∂zin N N m=1
(10.30a)
ONB 2 > 2 >
= − (xn bi − zin b>
i bi ) = − (x bi − zin ) . (10.30b)
N N n
c
Figure 10.8 3.25 2.5
Optimal projection
3.00
of a vector x ∈ R2 2.0
onto a 2.75
one-dimensional 2.50
1.5
subspace
kx − x̃k
2.25
x2
(continuation from 1.0
Figure 10.7). U
2.00 x̃
(a) Distances 0.5
1.75 b
kx − x̃k for some
0.0
x̃ ∈ U . 1.50
(b) Orthogonal
1.25 −0.5
projection and −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x1 x1
optimal coordinates.
(a) Distances kx − x̃k for some x̃ = z1 b ∈ (b) The vector x̃ that minimizes the distance
U = span[b], see panel (b) for the setting. in panel (a) is its orthogonal projection onto
U . The coordinate of the projection x̃ with
respect to the basis vector b that spans U
is the factor we need to scale b in order to
“reach” x̃.
since b>
i bi = 1. Setting this partial derivative to 0 yields immediately the
optimal coordinates
>
zin = x>
n bi = bi xn (10.31)
for i = 1, . . . , M and n = 1, . . . , N . This means that the optimal co-
ordinates zin of the projection x̃n are the coordinates of the orthogonal
projection (see Section 3.8) of the original data point xn onto the one-
The coordinates of dimensional subspace that is spanned by bi . Consequently:
the optimal
projection of xn The optimal linear projection x̃n of xn is an orthogonal projection.
with respect to the The coordinates of x̃n with respect to the basis (b1 , . . . , bM ) are the
basis vectors
coordinates of the orthogonal projection of xn onto the principal sub-
b1 , . . . , bM are the
coordinates of the space.
orthogonal An orthogonal projection is the best linear mapping given the objec-
projection of xn tive (10.28).
onto the principal
The coordinates ζm of x in (10.25) and the coordinates zm of x̃ in (10.26)
subspace.
must be identical for m = 1, . . . , M since U ⊥ = span[bM +1 , . . . , bD ] is
the orthogonal complement (see Section 3.6) of U = span[b1 , . . . , bM ].
Remark (Orthogonal Projections with Orthonormal Basis Vectors). Let us
briefly recap orthogonal projections from Section 3.8. If (b1 , . . . , bD ) is an
b>
j x is the orthonormal basis of RD then
coordinate of the
orthogonal x̃ = bj (b>
j bj )
−1 >
bj x = bj b>
j x ∈ R
D
(10.32)
projection of x onto
the subspace is the orthogonal projection of x onto the subspace spanned by the j th
spanned by bj . basis vector, and zj = b> j x is the coordinate of this projection with respect
to the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.8
illustrates this setting.

More generally, if we aim to project onto an M -dimensional subspace

of RD , we obtain the orthogonal projection of x onto the M -dimensional
subspace with orthonormal basis vectors b1 , . . . , bM as
> −1 > >
x̃ = B(B
| {zB}) B x = BB x , (10.33)
=I
where we defined B := [b1 , . . . , bM ] ∈ RD×M . The coordinates of this

projection with respect to the ordered basis (b1 , . . . , bM ) are z := B > x
as discussed in Section 3.8.
We can think of the coordinates as a representation of the projected
vector in a new coordinate system defined by (b1 , . . . , bM ). Note that al-
though x̃ ∈ RD we only need M coordinates z1 , . . . , zM to represent this
vector; the other D − M coordinates with respect to the basis vectors
(bM +1 , . . . , bD ) are always 0. ♦
So far we have shown that for a given ONB we can find the optimal
coordinates of x̃ by an orthogonal projection onto the principal subspace.
In the following, we will determine what the best basis is.
10.3.3 Finding the Basis of the Principal Subspace

To determine the basis vectors b1 , . . . , bM of the principal subspace, we
rephrase the loss function (10.28) using the results we have so far. This
will make it easier to find the basis vectors. To reformulate the loss func-
tion, we exploit our results from before and obtain
M M
(10.31)
X X
x̃n = zmn bm = (x>
n bm )bm . (10.34)
m=1 m=1
We now exploit the symmetry of the dot product, which yields

M
!
X >
x̃n = bm bm xn . (10.35)
m=1
Since we can generally write the original data point xn as a linear combi-
nation of all basis vectors, it holds that
D D D
!
(10.31)
X X X >
>
xn = zdn bd = (xn bd )bd = bd bd xn (10.36a)
d=1 d=1 d=1
M
! D
!
X X
= bm b>
m xn + bj b>
j xn , (10.36b)
m=1 j=M +1
where we split the sum with D terms into a sum over M and a sum
over D − M terms. With this result, we find that the displacement vector
xn − x̃n , i.e., the difference vector between the original data point and its
c
Figure 10.9
Orthogonal 6 U⊥
projection and
displacement 4
vectors. When
projecting data 2
points xn (blue)
x2
0
onto subspace U1 U
we obtain x̃n
−2
(orange). The
displacement vector −4
x̃n − xn lies
completely in the −6
orthogonal
−5 0 5
complement U2 of x1
U1 .
projection, is
D
!
X
xn − x̃n = bj b>
j xn (10.37a)
j=M +1
D
X
= (x>
n bj )bj . (10.37b)
j=M +1
This means the difference is exactly the projection of the data point onto
the orthogonal complement of the principal subspace: We identify the ma-
trix j=M +1 bj b>
PD
j in (10.37a) as the projection matrix that performs this
projection. Hence, the displacement vector xn − x̃n lies in the subspace
that is orthogonal to the principal subspace as illustrated in Figure 10.9.
Remark (Low-Rank Approximation). In (10.37a), we saw that the projec-
tion matrix, which projects x onto x̃, is given by
M
X
bm b> >
m = BB . (10.38)
m=1
By construction as a sum of rank-one matrices bm b> >

m we see that BB is
symmetric and has rank M . Therefore, the average squared reconstruction
error can also be written as
N N
1 X 2 1 X >
2
kxn − x̃n k = xn − BB xn (10.39a)

N n=1 N n=1
N
1 X >
2
= (I − BB )xn . (10.39b)

N n=1
PCA finds the best Finding orthonormal basis vectors b1 , . . . , bM , which minimize the differ-
rank-M ence between the original data xn and their projections x̃n , is equivalent
approximation of
to finding the best rank-M approximation BB > of the identity matrix I
the identity matrix.
(see Section 4.6). ♦

Now, we have all the tools to reformulate the loss function (10.28).
D 2
N N X
1 X 2 (10.37b) 1
X >

JM = kxn − x̃n k = (bj xn )bj . (10.40)

N n=1 N n=1

j=M +1
We now explicitly compute the squared norm and exploit the fact that the
bj form an ONB, which yields
N D N D
1 X X > 1 X X >
JM = 2
(b xn ) = b xn b>
j xn (10.41a)
N n=1 j=M +1 j N n=1 j=M +1 j
N D
1 X X >
= b xn x>
n bj , (10.41b)
N n=1 j=M +1 j
where we exploited the symmetry of the dot product in the last step to
write b> >
j xn = xn bj . We now swap the sums and obtain
D N
! D
X > 1 X X
JM = bj >
xn xn bj = b>
j Sbj (10.42a)
j=M +1
N n=1 j=M +1
| {z }
=:S
D
X D
X D
X
= tr(b>
j Sbj ) tr(Sbj b>
j ) = tr bj b>
j S ,
j=M +1 j=M +1 j=M +1
| {z }
projection matrix
(10.42b)
where we exploited the property that the trace operator tr(·), see (4.18),
is linear and invariant to cyclic permutations of its arguments. Since we
assumed that our dataset is centered, i.e., E[X ] = 0, we identify S as the
data covariance matrix. Since the projection matrix in (10.42b) is con-
structed as a sum of rank-one matrices bj b> j it itself is of rank D − M . Minimizing the
Equation (10.42a) implies that we can formulate the average squared average squared
reconstruction error
reconstruction error equivalently as the covariance matrix of the data,
is equivalent to
projected onto the orthogonal complement of the principal subspace. Min- minimizing the
imizing the average squared reconstruction error is therefore equivalent to projection of the
minimizing the variance of the data when projected onto the subspace we data covariance
matrix onto the
ignore, i.e., the orthogonal complement of the principal subspace. Equiva-
orthogonal
lently, we maximize the variance of the projection that we retain in the complement of the
principal subspace, which links the projection loss immediately to the principal subspace.
maximum-variance formulation of PCA discussed in Section 10.2. But this Minimizing the
then also means that we will obtain the same solution that we obtained average squared
for the maximum-variance perspective. Therefore, we omit a derivation
is equivalent to
that is identical to the one Section 10.2 and summarize the results from maximizing the
earlier in the light of the projection perspective. variance of the
The average squared reconstruction error, when projecting onto the M - projected data.
c
dimensional principal subspace, is
D
X
JM = λj , (10.43)
j=M +1
where λj are the eigenvalues of the data covariance matrix. Therefore,

to minimize (10.43) we need to select the smallest D − M eigenvalues,
which then implies that their corresponding eigenvectors are the basis
of the orthogonal complement of the principal subspace. Consequently,
this means that the basis of the principal subspace are the eigenvectors
b1 , . . . , bM that are associated with the largest M eigenvalues of the data
covariance matrix.
Example 10.3 (MNIST Digits Embedding)
Figure 10.10
Embedding of
MNIST digits 0
(blue) and 1
(orange) in a
two-dimensional
principal subspace
using PCA. Four
embeddings of the
digits ‘0’ and ‘1’ in
the principal
subspace are
highlighted in red
with their
corresponding
original digit.
Figure 10.10 visualizes the training data of the MMIST digits ‘0’ and ‘1’
embedded in the vector subspace spanned by the first two principal com-
ponents. We observe a relatively clear separation between ‘0’s (blue dots)
and ‘1’s (orange dots), and we see the variation within each individual
cluster. Four embeddings of the digits ‘0’ and ‘1’ in the principal subspace
are highlighted in red with their corresponding original digit. The figure
reveals that the variation within the set of ‘0’ is significantly greater than
the variation within the set of ‘1’.

10.4 Eigenvector Computation and Low-Rank Approximations 333
10.4 Eigenvector Computation and Low-Rank Approximations

In the previous sections, we obtained the basis of the principal subspace
as the eigenvectors that are associated with the largest eigenvalues of the
data covariance matrix
N
1 X 1
S= xn x>
n = XX > , (10.44)
N n=1 N
X = [x1 , . . . , xN ] ∈ RD×N . (10.45)
Note that X is a D × N matrix, i.e., it is the transpose of the “typical”
data matrix (Bishop, 2006; Murphy, 2012). To get the eigenvalues (and
the corresponding eigenvectors) of S , we can follow two approaches: Eigendecomposition
or SVD to compute
We perform an eigendecomposition (see Section 4.2) and compute the eigenvectors.
eigenvalues and eigenvectors of S directly.
We use a singular value decomposition (see Section 4.5). Since S is
symmetric and factorizes into XX > (ignoring the factor N1 ), the eigen-
values of S are the squared singular values of X .
More specifically, the SVD of X is given by
X = |{z}
|{z} U |{z} V> ,
Σ |{z} (10.46)
D×N D×D D×N N ×N
>
where U ∈ RD×D and V ∈ RN ×N are orthogonal matrices and Σ ∈
RD×N is a matrix whose only non-zero entries are the singular values
σii > 0. It then follows that
1 1 1
S= XX > = U Σ |V {z
>
V} Σ> U > = U ΣΣ> U > . (10.47)
N N N
=I N
With the results from Section 4.5 we get that the columns of U are the The columns of U
eigenvectors of XX > (and therefore S ). Furthermore, the eigenvalues are the eigenvectors
of S.
λd of S are related to the singular values of X via
σd2
λd =
. (10.48)
N
This relationship between the eigenvalues of S and the singular values
of X provides the connection between the maximum variance view (Sec-
tion 10.2) and the singular value decomposition.
10.4.1 PCA using Low-rank Matrix Approximations

To maximize the variance of the projected data (or minimize the average
squared reconstruction error), PCA chooses the columns of U in (10.47)
to be the eigenvectors that are associated with the M largest eigenvalues
of the data covariance matrix S so that we identify U as the projection ma-
trix B in (10.3), which projects the original data onto a lower-dimensional
c
Eckart-Young subspace of dimension M . The Eckart-Young theorem (Theorem 4.25 in

theorem Section 4.6) offers a direct way to estimate the low-dimensional represen-
tation. Consider the best rank-M approximation
X̃ M := argminrk(A)6M kX − Ak2 ∈ RD×N (10.49)
of X , where k·k2 is the spectral norm defined in (4.93). The Eckart-Young
theorem states that X̃ M is given by truncating the SVD at the top-M
singular value. In other words, we obtain
X̃ M = U M ΣM V > M ∈ R
D×N
(10.50)
|{z} |{z} |{z}
D×M M ×M M ×N
with orthogonal matrices U M := [u1 , . . . , uM ] ∈ RD×M and V M :=

[v 1 , . . . , v M ] ∈ RN ×M and a diagonal matrix ΣM ∈ RM ×M whose diago-
nal entries are the M largest singular values of X .
10.4.2 Practical Aspects

Finding eigenvalues and eigenvectors is also important in other funda-
mental machine learning methods that require matrix decompositions. In
theory, as we discussed in Section 4.2, we can solve for the eigenvalues as
roots of the characteristic polynomial. However, for matrices larger than
4×4 this is not possible because we would need to find the roots of a poly-
Abel-Ruffini nomial of degree 5 or higher. However, the Abel-Ruffini theorem (Ruffini,
theorem 1799; Abel, 1826) states that there exists no algebraic solution to this
problem for polynomials of degree 5 or more. Therefore, in practice, we
np.linalg.eigh
or solve for eigenvalues or singular values using iterative methods, which are
np.linalg.svd implemented in all modern packages for linear algebra.
In many applications (such as PCA presented in this chapter), we only
require a few eigenvectors. It would be wasteful to compute the full de-
composition, and then discard all eigenvectors with eigenvalues that are
beyond the first few. It turns out that if we are interested in only the first
few eigenvectors (with the largest eigenvalues), then iterative processes,
which directly optimize these eigenvectors, are computationally more effi-
cient than a full eigendecomposition (or SVD). In the extreme case of only
power iteration needing the first eigenvector, a simple method called the power iteration
is very efficient. Power iteration chooses a random vector x0 that is not in
If S is invertible, it the null space of S and follows the iteration
is sufficient to
ensure that x0 6= 0.
Sxk
xk+1 = , k = 0, 1, . . . . (10.51)
kSxk k
This means the vector xk is multiplied by S in every iteration and then
normalized, i.e., we always have kxk k = 1. This sequence of vectors con-
verges to the eigenvector associated with the largest eigenvalue of S . The
original Google PageRank algorithm (Page et al., 1999) uses such an al-
gorithm for ranking web pages based on their hyperlinks.

10.5 PCA in High Dimensions 335
10.5 PCA in High Dimensions

In order to do PCA, we need to compute the data covariance matrix. In D
dimensions, the data covariance matrix is a D × D matrix. Computing the
eigenvalues and eigenvectors of this matrix is computationally expensive
as it scales cubically in D. Therefore, PCA, as we discussed earlier, will be
infeasible in very high dimensions. For example, if our xn are images with
10, 000 pixels (e.g., 100 × 100 pixel images), we would need to compute
the eigendecomposition of a 10, 000 × 10, 000 covariance matrix. In the
following, we provide a solution to this problem for the case that we have
substantially fewer data points than dimensions, i.e., N D.
Assume we have a centered dataset x1 , . . . , xN , xn ∈ RD . Then, the
data covariance matrix is given as
1
S= XX > ∈ RD×D , (10.52)
N
where X = [x1 , . . . , xN ] is a D × N matrix whose columns are the data
points.
We now assume that N D, i.e., the number of data points is smaller
than the dimensionality of the data. If there are no duplicate data points
the rank of the covariance matrix S is N , so it has D − N + 1 many eigen-
values that are 0. Intuitively, this means that there are some redundancies.
In the following, we will exploit this and turn the D×D covariance matrix
into an N × N covariance matrix whose eigenvalues are all positive.
In PCA, we ended up with the eigenvector equation
Sbm = λm bm , m = 1, . . . , M , (10.53)
where bm is a basis vector of the principal subspace. Let us re-write this
equation a bit: With S defined in (10.52), we obtain
1
Sbm = XX > bm = λm bm . (10.54)
N
We now multiply X > ∈ RN ×D from the left-hand side, which yields
1 1 >
X > X X > bm = λm X > bm ⇐⇒ X Xcm = λm cm , (10.55)
N | {z } | {z } N
N ×N =:cm
and we get a new eigenvector/eigenvalue equation: λm remains eigen-

value, which confirms our results from Section 4.5.3 that the non-zero
eigenvalues of XX > equal the non-zero eigenvalues of X > X . We ob-
tain the eigenvector of the matrix N1 X > X ∈ RN ×N associated with λm
as cm := X > bm . Assuming we have no duplicate data points, this matrix
has rank N and is invertible. This also implies that N1 X > X has the same
(non-zero) eigenvalues as the data covariance matrix S . But this is now
an N × N matrix, so that we can compute the eigenvalues and eigenvec-
tors much more efficiently than for the original D × D data covariance
matrix.
c
Now, that we have the eigenvectors of N1 X > X , we are going to re-

cover the original eigenvectors, which we still need for PCA. Currently,
we know the eigenvectors of N1 X > X . If we left-multiply our eigenvalue/
eigenvector equation with X , we get
1
XX > Xcm = λm Xcm (10.56)
N
| {z }
S
and we recover the data covariance matrix again. This now also means
that we recover Xcm as an eigenvector of S .
Remark. If we want to apply the PCA algorithm that we discussed in Sec-
tion 10.6 we need to normalize the eigenvectors Xcm of S so that they
have norm 1. ♦
10.6 Key Steps of PCA in Practice

In the following, we will go through the individual steps of PCA using a
running example, which is summarized in Figure 10.11. We are given a
two-dimensional dataset (Figure 10.11(a)), and we want to use PCA to
project it onto a one-dimensional subspace.
1. Mean subtraction We start by centering the data by computing the
mean µ of the dataset and subtracting it from every single data point.
This ensures that the dataset has mean 0 (Figure 10.11(b)). Mean sub-
traction is not strictly necessary but reduces the risk of numerical prob-
lems.
2. Standardization Divide the data points by the standard deviation σd
of the dataset for every dimension d = 1, . . . , D. Now the data is unit
free, and it has variance 1 along each axis, which is indicated by the
standardization two arrows in Figure 10.11(c). This step completes the standardization
of the data.
3. Eigendecomposition of the covariance matrix Compute the data
covariance matrix and its eigenvalues and corresponding eigenvectors.
Since the covariance matrix is symmetric, the spectral theorem (The-
orem 4.15) states that we can find an ONB of eigenvectors. In Fig-
ure 10.11(d), the eigenvectors are scaled by the magnitude of the cor-
responding eigenvalue. The longer vector spans the principal subspace,
which we denote by U . The data covariance matrix is represented by
the ellipse.
4. Projection We can project any data point x∗ ∈ RD onto the principal
subspace: To get this right, we need to standardize x∗ using the mean
µd and standard deviation σd of the training data in the dth dimension,
respectively, so that
(d)
x ∗ − µd
x(d)
∗ ← , d = 1, . . . , D , (10.57)
σd

10.6 Key Steps of PCA in Practice 337
Figure 10.11 Steps
5.0 5.0 5.0 of PCA.
2.5 2.5 2.5

x2
x2
x2
0.0 0.0 0.0
−2.5 −2.5 −2.5
0 5 0 5 0 5
x1 x1 x1
(a) Original dataset. (b) Step 1: Centering by sub- (c) Step 2: Dividing by the
tracting the mean from each standard deviation to make
data point. the data unit free. Data has
variance 1 along each axis.
5.0 5.0 5.0
2.5 2.5 2.5

x2
x2
x2
0.0 0.0 0.0
−2.5 −2.5 −2.5
0 5 0 5 0 5
x1 x1 x1
(d) Step 3: Compute eigenval- (e) Step 4: Project data onto (f) Undo the standardization
ues and eigenvectors (arrows) the principal subspace. and move projected data back
of the data covariance matrix into the original data space
(ellipse). from (a).
(d)
where x∗ is the dth component of x∗ . We obtain the projection as
x̃∗ = BB > x∗ (10.58)
with coordinates
z ∗ = B > x∗ (10.59)
with respect to the basis of the principal subspace. Here, B is the ma-
trix that contains the eigenvectors that are associated with the largest
eigenvalues of the data covariance matrix as columns. PCA returns the
coordinates (10.59), not the projections x∗ .
Having standardized our dataset, (10.58) only yields the projections in

the context of the standardized dataset. To obtain our projection in the
original data space (i.e., before standardization), we need to undo the
standardization (10.57) and multiply by the standard deviation before
adding the mean so that we obtain
x̃(d) (d)
∗ ← x̃∗ σd + µd , d = 1, . . . , D . (10.60)
Figure 10.11(f) illustrates the projection in the original data space.
c
Example 10.4 (MNIST Digits: Reconstruction)

http: In the following, we will apply PCA to the MNIST digits dataset, which
//yann.lecun.
contains 60, 000 examples of handwritten digits 0–9. Each digit is an im-
com/exdb/mnist/
age of size 28 × 28, i.e., it contains 784 pixels so that we can interpret
every image in this dataset as a vector x ∈ R784 . Examples of these digits
are shown in Figure 10.3.
Figure 10.12 Effect

of increasing the
number of principal
Original
components on
reconstruction.
PCs: 1
PCs: 10
PCs: 100
PCs: 500
For illustration purposes, we apply PCA to a subset of the MNIST digits,

and we focus on the digit ‘8’. We used 5,389 training images of the digit
‘8’ and determined the principal subspace as detailed in this chapter. We
then used the learned projection matrix to reconstruct a set of test images,
which is illustrated in Figure 10.12. The first row of Figure 10.12 shows
a set of four original digits from the test set. The following rows show
reconstructions of exactly these digits when using a principal subspace of
dimensions 1, 10, 100, 500, respectively. We see that even with a single-
dimensional principal subspace we get a half-way decent reconstruction of
the original digits, which, however, is blurry and generic. With an increas-
ing number of principal components (PCs) the reconstructions become
sharper and more details are accounted for. With 500 principal compo-
nents, we effectively obtain a near-perfect reconstruction. If we were to
choose 784 PCs we would recover the exact digit without any compres-
sion loss.
Figure 10.13 shows the average squared reconstruction error, which is
N D
1 X 2
X
kxn − x̃n k = λi , (10.61)
N n=1 i=M +1
as a function of the number M of principal components. We can see that

the importance of the principal components drops off rapidly, and only
marginal gains can be achieved by adding more PCs. This matches exactly
our observation in Figure 10.5 where we discovered that the most of the
variance of the projected data is captured by only a few principal compo-
nents. With about 550 PCs, we can essentially fully reconstruct the training
data that contains the digit ‘8’ (some pixels around the boundaries show
no variation across the dataset as they are always black).
Average squared reconstruction error
Figure 10.13
500 Average squared
400
as a function of the
number of principal
300
components. The
average squared
200
100 is the sum of the
eigenvalues in the
0 orthogonal
0 200 400 600 800 complement of the
Number of PCs principal subspace.
10.7 Latent Variable Perspective

In the previous sections, we derived PCA without any notion of a prob-
abilistic model using the maximum-variance and the projection perspec-
tives. On the one hand this approach may be appealing as it allows us to
sidestep all the mathematical difficulties that come with probability the-
ory, but on the other hand a probabilistic model would offer us more flex-
ibility and useful insights. More specifically, a probabilistic model would
come with a likelihood function, and we can explicitly deal with noisy
observations (which we did not even discuss earlier),
allow us to do Bayesian model comparison via the marginal likelihood
as discussed in Section 8.5,
view PCA as a generative model, which allows us to simulate new data,
allow us to make straightforward connections to related algorithms
deal with data dimensions that are missing at random by applying
Bayes’ theorem,
give us a notion of the novelty of a new data point,
give us a principled way to extend the model, e.g., to a mixture of PCA
models,
have the PCA we derived in earlier sections as a special case,
allow for a fully Bayesian treatment by marginalizing out the model
parameters.
c
Figure 10.14
Graphical model for zn
probabilistic PCA.
The observations xn
explicitly depend on B µ
corresponding
latent variables

z n ∼ N 0, I . The
xn σ
model parameters
n = 1, . . . , N
B, µ and the
likelihood
parameter σ are
shared across the By introducing a continuous-valued latent variable z ∈ RM it is possible
dataset. to phrase PCA as a probabilistic latent-variable model. Tipping and Bishop
probabilistic PCA (1999) proposed this latent-variable model as Probabilistic PCA (PPCA).
PPCA PPCA addresses most of the issues above, and the PCA solution that we
obtained by maximizing the variance in the projected space or by minimiz-
ing the reconstruction error is obtained as the special case of maximum
likelihood estimation in a noise-free setting.
10.7.1 Generative Process and Probabilistic Model

In PPCA, we explicitly write down the probabilistic model for linear di-
mensionality reduction. For this we assume a continuous
latent variable
z ∈ RM with a standard-Normal prior p(z) = N 0, I and a linear rela-
tionship between the latent variables and the observed x data where
x = Bz + µ + ∈ RD , (10.62)

where ∼ N 0, σ 2 I is Gaussian observation noise, B ∈ RD×M and µ ∈
RD describe the linear/affine mapping from latent to observed variables.
Therefore, PPCA links latent and observed variables via
p(x|z, B, µ, σ 2 ) = N x | Bz + µ, σ 2 I .

(10.63)
Overall, PPCA induces the following generative process:

z n ∼ N z | 0, I (10.64)
xn | z n ∼ N x | Bz n + µ, σ 2 I

(10.65)
To generate a data point that is typical given the model parameters, we
ancestral sampling follow an ancestral sampling scheme: We first sample a latent variable z n
from p(z). Then, we use z n in (10.63) to sample a data point conditioned
on the sampled z n , i.e., xn ∼ p(x | z n , B, µ, σ 2 ).
This generative process allows us to write down the probabilistic model
(i.e., the joint distribution of all random variables, see Section 8.3) as
p(x, z|B, µ, σ 2 ) = p(x|z, B, µ, σ 2 )p(z) , (10.66)
which immediately gives rise to the graphical model in Figure 10.14 using
the results from Section 8.4.

Remark. Note the direction of the arrow that connects the latent variables
z and the observed data x: The arrow points from z to x, which means
that the PPCA model assumes a lower-dimensional latent cause z for high-
dimensional observations x. In the end, we are obviously interested in
finding something out about z given some observations. To get there we
will apply Bayesian inference to “invert” the arrow implicitly and go from
observations to latent variables. ♦
Example 10.5 (Generating New Data using Latent Variables)
Figure 10.15
Generating new
MNIST digits. The
latent variables z
can be used to
generate new data
x̃ = Bz. The closer
we stay to the
training data the
more realistic the
generated data.
Figure 10.15 shows the latent coordinates of the MNIST digits ‘8’ found
by PCA when using a two-dimensional principal subspace (blue dots). We
can query any vector z ∗ in this latent space and generate an image x̃∗ =
Bz ∗ that resembles the digit ‘8’. We show eight of such generated images
with their corresponding latent space representation. Depending on where
we query the latent space, the generated images look different (shape,
rotation, size, ...). If we query away from the training data, we see more an
more artefacts, e.g., the top-left and top-right digits. Note that the intrinsic
dimensionality of these generated images is only two.
10.7.2 Likelihood and Joint Distribution The likelihood does

Using the results from Chapter 6, we obtain the likelihood of this proba- not depend on the
latent variables z.
bilistic model by integrating out the latent variable z (see Section 8.3.3)
c
so that
Z
2
p(x | B, µ, σ ) = p(x | z, µ, σ 2 )p(z)dz (10.67a)
Z
N x | Bz + µ, σ 2 I N z | 0, I dz .

= (10.67b)
From Section 6.5, we know that the solution to this integral is a Gaussian
distribution with mean
Ex [x] = Ez [Bz + µ] + E [] = µ (10.68)
and with covariance matrix
V[x] = Vz [Bz + µ] + V [] = Vz [Bz] + σ 2 I (10.69a)
> 2 > 2
= B Vz [z]B + σ I = BB + σ I . (10.69b)
The likelihood in (10.67b) can be used for maximum likelihood or MAP
estimation of the model parameters.
Remark. We cannot use the conditional distribution in (10.63) for maxi-
mum likelihood estimation as it still depends on the latent variables. The
likelihood function we require for maximum likelihood (or MAP) estima-
tion should only be a function of the data x and the model parameters,
but must not depend on the latent variables. ♦
From Section 6.5 we know that a Gaussian random variable z and
a linear/affine transformation x = Bz of it are jointly Gaussian
dis-
tributed. We already know the marginals p(z) = N z | 0, I and p(x) =
N x | µ, BB > + σ 2 I . The missing cross-covariance is given as

Cov[x, z] = Covz [Bz + µ] = B Covz [z, z] = B . (10.70)

Therefore, the probabilistic model of PPCA, i.e., the joint distribution of
latent and observed random variables is explicitly given by
BB > + σ 2 I B

x µ
p(x, z | B, µ, σ 2 ) = N , , (10.71)
z 0 B> I
with a mean vector of length D + M and a covariance matrix of size
(D + M ) × (D + M ).

The joint Gaussian distribution p(x, z | B, µ, σ 2 ) in (10.71) allows us to
determine the posterior distribution p(z | x) immediately by applying the
rules of Gaussian conditioning from Section 6.5.1. The posterior distribu-
tion of the latent variable given an observation x is then

p(z | x) = N z | m, C , (10.72)
m = B > (BB > + σ 2 I)−1 (x − µ) , (10.73)

C = I − B > (BB > + σ 2 I)−1 B . (10.74)
Note that the posterior covariance does not depend on the observed data
x. For a new observation x∗ in data space, we use (10.72) to determine
the posterior distribution of the corresponding latent variable z ∗ . The co-
variance matrix C allows us to assess how confident the embedding is. A
covariance matrix C with a small determinant (which measures volumes)
tells us that the latent embedding z ∗ is fairly certain. If we obtain a pos-
terior distribution p(z ∗ | x∗ ) with much variance, we may be faced with
an outlier. However, we can explore this posterior distribution to under-
stand what other data points x are plausible under this posterior. To do
this, we exploit the generative process underlying PPCA, which allows us
to explore the posterior distribution on the latent variables by generating
new data that are plausible under this posterior:
1. Sample a latent variable z ∗ ∼ p(z | x∗ ) from the posterior distribution

over the latent variables (10.72)
2. Sample a reconstructed vector x̃∗ ∼ p(x | z ∗ , B, µ, σ 2 ) from (10.63)
If we repeat this process many times, we can explore the posterior dis-
tribution (10.72) on the latent variables z ∗ and its implications on the
observed data. The sampling process effectively hypothesizes data, which
is plausible under the posterior distribution.

We derived PCA from two perspectives: a) maximizing the variance in the
projected space; b) minimizing the average reconstruction error. However,
PCA can also be interpreted from different perspectives. Let us re-cap what
we have done: We took high-dimensional data x ∈ RD and used a matrix
B > to find a lower-dimensional representation z ∈ RM . The columns of
B are the eigenvectors of the data covariance matrix S that are associated
with the largest eigenvalues. Once we have a low-dimensional represen-
tation z , we can get a high-dimensional version of it (in the original data
space) as x ≈ x̃ = Bz = BB > x ∈ RD , where BB > is a projection
matrix.
We can also think of PCA as a linear auto-encoder as illustrated in Fig- auto-encoder
ure 10.16. An auto-encoder encodes the data xn ∈ RD to a code z n ∈ RM code
and decodes it to a x̃n similar to xn . The mapping from the data to the
code is called the encoder, and the mapping from the code back to the orig- encoder
inal data space is called the decoder. If we consider linear mappings where decoder
the code is given by z n = B > xn ∈ RM and we are interested in minimiz-
ing the average squared error between the data xn and its reconstruction
c
Figure 10.16 PCA original

can be viewed as a
D
linear auto-encoder. R RD
It encodes the
code
high-dimensional M
R
data x into a B> B
lower-dimensional x z x̃
representation
(code) z ∈ RM and
decodes z using a
decoder. The
decoded vector x̃ is
the orthogonal Encoder Decoder
projection of the
original data x onto
the M -dimensional x̃n = Bz n , n = 1, . . . , N , we obtain
principal subspace.
N N
1 X 1 X >
2
kxn − x̃n k2 = xn − B Bxn . (10.75)

N n=1 N n=1
This means we end up with the same objective function as in (10.28) that
we discussed in Section 10.3 so that we obtain the PCA solution when we
minimize the squared auto-encoding loss. If we replace the linear map-
ping of PCA with a nonlinear mapping, we get a nonlinear auto-encoder.
A prominent example of this is a deep auto-encoder where the linear func-
tions are replaced with deep neural networks. In this context, the encoder
recognition network is also known as a recognition network or inference network, whereas the
inference network decoder is also called a generator.
generator Another interpretation of PCA is related to information theory. We can
think of the code as a smaller or compressed version of the original data
point. When we reconstruct our original data using the code, we do not
get the exact data point back, but a slightly distorted or noisy version
The code is a of it. This means that our compression is “lossy”. Intuitively we want
compressed version to maximize the correlation between the original data and the lower-
of the original data.
dimensional code. More formally, this is related to the mutual information.
We would then get the same solution to PCA we discussed in Section 10.3
by maximizing the mutual information, a core concept in information the-
ory (MacKay, 2003).
In our discussion on PPCA, we assumed that the parameters of the
model, i.e., B, µ and the likelihood parameter σ 2 are known. Tipping
and Bishop (1999) describe how to derive maximum likelihood estimates
for these parameters in the PPCA setting (note that we use a different
notation in this chapter). The maximum likelihood parameters, when pro-
jecting D-dimensional data onto an M -dimensional subspace, are
N
1 X
µML = xn , (10.76)
N n=1
1
B ML = T (Λ − σ 2 I) 2 R , (10.77)

D
2 1 X
σML = λj , (10.78)
D − M j=M +1
where T ∈ RD×M contains M eigenvectors of the data covariance matrix, The matrix Λ − σ 2 I
Λ = diag(λ1 , . . . , λM ) ∈ RM ×M is a diagonal matrix with the eigenvalues in (10.77) is
guaranteed to be
associated with the principal axes on its diagonal, and R ∈ RM ×M is
positive
an arbitrary orthogonal matrix. The maximum likelihood solution B ML is semi-definite as the
unique up to an arbitrary orthogonal transformation, e.g., we can right- smallest eigenvalue
multiply B ML with any rotation matrix R so that (10.77) essentially is a of the data
covariance matrix is
singular value decomposition (see Section 4.5). An outline of the proof is
bounded from
given by Tipping and Bishop (1999). below by the noise
The maximum likelihood estimate for µ given in (10.76) is the sample variance σ 2 .
mean of the data. The maximum likelihood estimator for the observation
noise variance σ 2 given in (10.78) is the average variance in the orthog-
onal complement of the principal subspace, i.e., the average leftover vari-
ance that we cannot capture with the first M principal components are
treated as observation noise.
In the noise-free limit where σ → 0, PPCA and PCA provide identical
solutions: Since the data covariance matrix S is symmetric, it can be di-
agonalized (see Section 4.4), i.e., there exists a matrix T of eigenvectors
of S so that
S = T ΛT −1 . (10.79)
In the PPCA model, the data covariance matrix is the covariance matrix of
the Gaussian likelihood p(x | B, µ, σ 2 ), which is BB > +σ 2 I , see (10.69b).
For σ → 0, we obtain BB > so that this data covariance must equal the
PCA data covariance (and its factorization given in (10.79)) so that
1
Cov[X ] = T ΛT −1 = BB > ⇐⇒ B = T Λ 2 R , (10.80)
i.e., we obtain the maximum likelihood estimate in (10.77) for σ = 0.

From (10.77) and (10.79) it becomes clear that (P)PCA performs a de-
composition of the data covariance matrix.
In a streaming setting, where data arrives sequentially, it is recom-
mended to use the iterative expectation maximization (EM) algorithm for
maximum likelihood estimation (Roweis, 1998).
To determine the dimensionality of the latent variables (the length of
the code, the dimensionality of the lower-dimensional subspace onto which
we project the data) Gavish and Donoho (2014) suggest the heuristic that,
if we can estimate the noise variance
√ σ 2 of the data, we should discard all
singular values smaller than 4σ√3D . Alternatively, we can use (nested) cross
validation (Section 8.5.1) or Bayesian model selection criteria (discussed
in Section 8.5.2) to determine a good estimate of the intrinsic dimension-
ality of the data (Minka, 2001b).
Similar to our discussion on linear regression in Chapter 9, we can place
c
a prior distribution on the parameters of the model and integrate them

out. By doing so, we a) avoid point estimates of the parameters and the
issues that come with these point estimates (see Section 8.5) and b) allow
for an automatic selection of the appropriate dimensionality M of the la-
Bayesian PCA tent space. In this Bayesian PCA, which was proposed by Bishop (1999),
a prior p(µ, B, σ 2 ) is placed on the model parameters. The generative
process allows us to integrate the model parameters out instead of condi-
tioning on them, which addresses overfitting issues. Since this integration
is analytically intractable, Bishop (1999) proposes to use approximate in-
ference methods, such as MCMC or variational inference. We refer to the
work by Gilks et al. (1996) and Blei et al. (2017) for more details on these
approximate inference techniques.
In PPCA, we considered the linear model p(xn | z n ) = N xn | Bz n +

µ, σ 2 I with prior p(z n ) = N 0, I , where all observation dimensions
are affected by the same amount of noise. If we allow each observation
factor analysis dimension d to have a different variance σd2 we obtain factor analysis
(FA) (Spearman, 1904; Bartholomew et al., 2011). This means, FA gives
the likelihood some more flexibility than PPCA, but still forces the data
An overly flexible to be explained by the model parameters B, µ.However, FA no longer
likelihood would be allows for a closed-form maximum likelihood solution so that we need to
able to explain more
use an iterative scheme, such as the Expectation Maximization algorithm,
than just the noise.
to estimate the model parameters. While in PPCA all stationary points are
global optima, this no longer holds for FA. Compared to PPCA, FA does
not change if we scale the data, but it does return different solutions if we
rotate the data.
independent An algorithm that is also closely related to PCA is independent com-
component analysis ponent analysis (ICA, (Hyvarinen et al., 2001)) Starting again with the
ICA latent-variable perspective p(xn | z n ) = N xn | Bz n + µ, σ 2 I we now
change the prior on z n to non-Gaussian distributions. ICA can be used
blind-source for blind-source separation. Imagine you are in a busy train station with
separation many people talking. Your ears play the role of microphones, and they
linearly mix different speech signals in the train station. The goal of blind-
source separation is to identify the constituent parts of the mixed signals.
As discussed above in the context of maximum likelihood estimation for
PPCA, the original PCA solution is invariant to any rotation. Therefore,
PCA can identify the best lower-dimensional subspace in which the sig-
nals live, but not the signals themselves (Murphy, 2012). ICA addresses
this issue by modifying the prior distribution p(z) on the latent sources
to require non-Gaussian priors p(z). We refer to the books by Hyvarinen
et al. (2001) and Murphy (2012) for more details on ICA.
PCA, factor analysis and ICA are three examples for dimensionality re-
duction with linear models. Cunningham and Ghahramani (2015) provide
a broader survey of linear dimensionality reduction.
The (P)PCA model we discussed here allows for several important ex-

tensions. In Section 10.5, we explained how to do PCA when the in-

put dimensionality D is significantly greater than the number N of data
points. By exploiting the insight that PCA can be performed by computing
(many) inner products, this idea can be pushed to the extreme by consid-
ering infinite-dimensional features. The kernel trick is the basis of kernel kernel trick
PCA and allows us to implicitly compute inner products between infinite- kernel PCA
dimensional features (Schölkopf et al., 1998; Schölkopf and Smola, 2002).
There are nonlinear dimensionality reduction techniques that are de-
rived from PCA (Burges (2010) provide a good overview). The auto-encoder
perspective of PCA that we discussed above can be used to render PCA as a
special case of a deep auto-encoder. In the deep auto-encoder, both the en- deep auto-encoder
coder and the decoder are represented by multi-layer feedforward neural
networks, which themselves are nonlinear mappings. If we set the acti-
vation functions in these neural networks to be the identity, the model
becomes equivalent to PCA. A different approach to nonlinear dimension-
ality reduction is the Gaussian process latent variable model (GP-LVM) pro- Gaussian process
posed by Lawrence (2005). The GP-LVM starts off with the latent-variable latent variable
model
perspective that we used to derive PPCA and replaces the linear relation-
GP-LVM
ship between the latent variables z and the observations x with a Gaus-
sian process (GP). Instead of estimating the parameters of the mapping (as
we do in PPCA), the GP-LVM marginalizes out the model parameters and
makes point estimates of the latent variables z . Similar to Bayesian PCA,
the Bayesian GP-LVM proposed by Titsias and Lawrence (2010) maintains Bayesian GP-LVM
a distribution on the latent variables z and uses approximate inference to
integrate them out as well.
c
11
Density Estimation with Gaussian Mixture

Models
In earlier chapters, we covered already two fundamental problems in

machine learning: regression (Chapter 9) and dimensionality reduction
(Chapter 10). In this chapter, we will have a look at a third pillar of ma-
chine learning: density estimation. On our journey, we introduce impor-
tant concepts, such as the Expectation Maximization (EM) algorithm and
a latent variable perspective of density estimation with mixture models.
When we apply machine learning to data we often aim to represent
data in some way. A straightforward way is to take the data points them-
selves as the representation of the data, see Figure 11.1 for an example.
However, this approach may be unhelpful if the dataset is huge or if we
are interested in representing characteristics of the data. In density esti-
mation, we represent the data compactly using a density from a paramet-
ric family, e.g., a Gaussian or Beta distribution. For example, we may be
looking for the mean and variance of a dataset in order to represent the
data compactly using a Gaussian distribution. The mean and variance can
be found using tools we discussed in Section 8.2: maximum likelihood or
maximum-a-posteriori estimation. We can then use the mean and variance
of this Gaussian to represent the distribution underlying the data, i.e., we
think of the dataset to be a typical realization from this distribution if we
were to sample from it.
Figure 11.1
Two-dimensional
dataset that cannot 4
be meaningfully
represented by a 2
Gaussian.
0
x2
−2
−4
−5 0 5
x1
348
c
11.1 Gaussian Mixture Model 349
In practice, the Gaussian (or similarly all other distributions we encoun-

tered so far) have limited modeling capabilities. For example, a Gaussian
approximation of the density that generated the data in Figure 11.1 would
be a poor approximation. In the following, we will look at a more ex-
pressive family of distributions, which we can use for density estimation:
mixture models. mixture models
Mixture models can be used to describe a distribution p(x) by a convex
combination of K simple (base) distributions
K
X
p(x) = πk pk (x) (11.1)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.2)
k=1
where the components pk are members of a family of basic distributions,

e.g., Gaussians, Bernoullis or Gammas, and the πk are mixture weights. mixture weights
Mixture models are more expressive than the corresponding base distri-
butions because they allow for multimodal data representations, i.e., they
can describe datasets with multiple “clusters”, such as the example in Fig-
ure 11.1.
We will focus on Gaussian mixture models (GMMs), where the basic
distributions are Gaussians. For a given dataset, we aim to maximize the
likelihood of the model parameters to train the GMM. For this purpose
we will use results from Chapter 5, Section 7.2 and Chapter 6. However,
unlike other application we discussed earlier (linear regression or PCA),
we will not find a closed-form maximum likelihood solution. Instead, we
will arrive at a set of dependent simultaneous equations, which we can
only solve iteratively.
11.1 Gaussian Mixture Model

A Gaussian mixture model is a density model where
we combine a finite Gaussian mixture
number of K Gaussian distributions N x | µk , Σk so that model
K
X
p(x | θ) = πk N x | µk , Σk (11.3)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.4)
k=1
where we defined θ := {µk , Σk , πk : k = 1, . . . , K} as the collection of

all parameters of the model. This convex combination of Gaussian distri-
bution gives us significantly more flexibility for modeling complex densi-
ties than a simple Gaussian distribution (which we recover from (11.3) for
K = 1). An illustration is given in Figure 11.2, displaying the weighted
c
350 Density Estimation with Gaussian Mixture Models
Figure 11.2 0.30

Component 1
Gaussian mixture Component 2
model. The 0.25 Component 3
Gaussian mixture GMM density
distribution (black) 0.20
is composed of a
p(x)
0.15
convex combination
of Gaussian
0.10
distributions and is
more expressive 0.05
than any individual
component. Dashed 0.00
lines represent the −4 −2 0 2 4 6 8
weighted Gaussian x
components.
components and the mixture density, which is given as

p(x | θ) = 0.5N x | − 2, 12 + 0.2N x | 1, 2 + 0.3N x | 4, 1 . (11.5)

11.2 Parameter Learning via Maximum Likelihood

Assume we are given a dataset X = {x1 , . . . , xN } where xn , n = 1, . . . , N
are drawn i.i.d. from an unknown distribution p(x). Our objective is to
find a good approximation/representation of this unknown distribution
p(x) by means of a Gaussian mixture model (GMM) with K mixture com-
ponents. The parameters of the GMM are the K means µk , the covariances
Σk and mixture weights πk . We summarize all these free parameters in
θ := {πk , µk , Σk : k = 1, . . . , K}.
Example 11.1 (Initial setting)
Figure 11.3 Initial 0.30 π1 N (x|µ1 , σ12 )

setting: GMM π2 N (x|µ2 , σ22 )
0.25 π3 N (x|µ3 , σ32 )
(black) with
GMM density
mixture three 0.20
mixture components
p(x)
(dashed) and seven 0.15

data points (discs).
0.10
0.05
0.00
−5 0 5 10 15
x
Throughout this chapter, we will have a simple running example that

helps us illustrate and visualize important concepts.

We consider a one-dimensional dataset X = {−3, −2.5, −1, 0, 2, 4, 5}

consisting of seven data points and wish to find a GMM with K = 3
components that models the density of the data. We initialize the mixture
components as

p1 (x) = N x | − 4, 1 (11.6)

p2 (x) = N x | 0, 0.2 (11.7)

p3 (x) = N x | 8, 3 (11.8)
and assign them equal weights π1 = π2 = π3 = 13 . The corresponding
model (and the data points) are shown in Figure 11.3.
In the following, we detail how to obtain a maximum likelihood esti-

mate θ ML of the model parameters θ . We start by writing down the like-
lihood, i.e., the predictive distribution of the training data given the pa-
rameters. We exploit our i.i.d. assumption, which leads to the factorized
likelihood
N
Y K
X
p(X | θ) = p(xn | θ) , p(xn | θ) = πk N xn | µk , Σk , (11.9)
n=1 k=1
where every individual likelihood term p(xn | θ) is a Gaussian mixture

density. Then, we obtain the log-likelihood as
N
X N
X K
X
log p(X | θ) = log p(xn | θ) = log πk N xn | µk , Σk . (11.10)
n=1 n=1 k=1
| {z }
=:L
We aim to find parameters θ ∗ML that maximize the log-likelihood L defined

in (11.10). Our “normal” procedure would be to compute the gradient
dL/dθ of the log-likelihood with respect to the model parameters θ , set
it to 0 and solve for θ . However, unlike our previous examples for maxi-
mum likelihood estimation (e.g., when we discussed linear regression in
Section 9.2), we cannot obtain a closed-form solution. However, we can
exploit an iterative scheme to find good model parameters θ ML , which will
out to be the EM algorithm for Gaussian mixture models. The key idea is
to update one model parameter at a time while keeping the others fixed.
Remark. If we were to consider a single Gaussian as the desired density,
the sum over k in (11.10) vanishes, and the log can be applied directly to
the Gaussian component, such that we get
log N x | µ, Σ = − D2 log(2π) − 12 log det(Σ) − 12 (x − µ)> Σ−1 (x − µ).

(11.11)
This simple form allows us find closed-form maximum likelihood esti-
mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move
c
the log into the sum over k so that we cannot obtain a simple closed-form
maximum likelihood solution. ♦
Any local optimum of a function exhibits the property that its gradi-
ent with respect to the parameters must vanish (necessary condition), see
Chapter 7. In our case, we obtain the following necessary conditions when
we optimize the log-likelihood in (11.10) with respect to the GMM param-
eters µk , Σk , πk :
N
∂L X ∂ log p(xn | θ)
= 0> ⇐⇒ = 0> , (11.12)
∂µk n=1
∂µ k
N
= 0 ⇐⇒ = 0, (11.13)
∂Σk n=1
∂Σk
N
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk
For all three necessary conditions, by applying the chain rule (see Sec-
tion 5.2.2), we require partial derivatives of the form
∂ log p(xn | θ) 1 ∂p(xn | θ)
= , (11.15)
∂θ p(xn | θ) ∂θ
where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and
1 1
= PK . (11.16)
p(xn | θ) j=1 πj N xn | µj , Σj
In the following, we will compute the partial derivatives (11.12)–(11.14).

But before we do this, we introduce a quantity that will play a central role
in the remainder of this chapter: responsibilities.
11.2.1 Responsibilities
We define the quantity

πk N xn | µk , Σk
rnk := PK (11.17)
j=1 πj N xn | µj , Σj
responsibility as the responsibility of the k th mixture component for the nth data point.
The responsibility rnk of the k th mixture component for data point xn is
proportional to the likelihood

p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.18)
r n follows a of the mixture component given the data point. Therefore, mixture com-
Boltzmann/Gibbs ponents have a high responsibility for a data point when the data point
distribution.
could be a plausible sample from that mixture component. Note that
r n := [rn1 , . . . , rnK ]> ∈ RK is a (normalized) probability vector, i.e.,

P
krnk = 1 with rnk > 0. This probability vector distributes probabil-
ity mass among the K mixture components, and we can think of r n as a
“soft assignment” of xn to the K mixture components. Therefore, the re- The responsibility
sponsibility rnk from (11.17) represents the probability that xn has been rnk is the
probability that the
generated by the k th mixture component.
kth mixture
component
Example 11.2 (Responsibilities) generated the nth
data point.
For our example from Figure 11.3 we compute the responsibilities rnk
1.0 0.0 0.0
 
 1.0 0.0 0.0 
 
0.057 0.943 0.0 
0.001 0.999 0.0  ∈ RN ×K .
 
  (11.19)
 0.0 0.066 0.934
 
 0.0 0.0 1.0 
0.0 0.0 1.0
Here, the nth row tells us the responsibilities of all mixture components
for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
the k th mixture component. We can see that the third mixture component
(third column) is not responsible for any of the first four data points, but
takes much responsibility of the remaining data points. The sum of all
entries of a column gives us the values Nk , i.e., the total responsibility of
the k th mixture component. In our example, we get N1 = 2.058, N2 =
2.008, N3 = 2.934.
In the following, we determine the updates of the model parameters

µk , Σk , πk for given responsibilities. We will see that the update equa-
tions all depend on the responsibilities, which makes a closed-form solu-
tion to the maximum likelihood estimation problem impossible. However,
for given responsibilities we will be updating one model parameter at a
time, while keeping the others fixed. After this, we will re-compute the
responsibilities. Iterating these two steps will eventually converge to a lo-
cal optimum and is a specific instantiation of the EM algorithm. We will
discuss this in some more detail in Section 11.3.
11.2.2 Updating the Means

Theorem 11.1 (Update of the GMM Means). The update of the mean pa-
rameters µk , k = 1, . . . , K , of the GMM is given by
PN
new n=1 rnk xn
µk = P N
, (11.20)
n=1 rnk
where the responsibilities rnk are defined in (11.17).
c
Remark. The update of the means µk of the individual mixture compo-

nents in (11.20) depends on all means, covariance matrices Σk and mix-
ture weights πk via rnk given in (11.17). Therefore, we cannot obtain a
closed-form solution for all µk at once. ♦
Proof From (11.15), we see that the gradient of the log-likelihood with
respect to the mean parameters µk , k = 1, . . . , K , requires us to compute
the partial derivative
K

∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk
= πj = πk (11.21a)
∂µk j=1
∂µk ∂µk
= πk (xn − µk )> Σ−1

k N xn | µk , Σk , (11.21b)
where we exploited that only the k th mixture component depends on µk .
We use our result from (11.21b) in (11.15) and put everything together
so that the desired partial derivative of L with respect to µk is given as
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.22a)
∂µk n=1
∂µk n=1
p(x n | θ) ∂µk
N
X
> −1
= (xn − µk ) Σk PK (11.22b)
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk
N
X
= rnk (xn − µk )> Σ−1
k . (11.22c)
n=1
Here, we used the identity from (11.16) and the result of the partial
derivative in (11.21b) to get to (11.22b). The values rnk are the responsi-
bilities we defined in (11.17).
∂L(µnew )
We now solve (11.22c) for µnewk so that ∂µk = 0> and obtain
k
N N PN N
X X rnk xn 1 X
rnk xn = rnk µnew
k ⇐⇒ µnew
k = P n=1
= rnk xn ,
n=1 n=1
N
rnk Nk n=1
n=1
(11.23)
where we defined
N
X
Nk := rnk (11.24)
n=1
as the total responsibility of the k th mixture component for the entire

dataset. This concludes the proof of Theorem 11.1.
Intuitively, (11.20) can be interpreted as an importance-weighted Monte-
Carlo estimate of the mean, where the importance weights of data point
xn are the responsibilities rnk of the k th cluster for xn , k = 1, . . . , K .

Therefore, the mean µk is pulled toward a data point xn with strength Figure 11.4 Update
given by rnk . The means are pulled stronger toward data points for which of the mean
parameter of
the corresponding mixture component has a high responsibility, i.e., a high
mixture component
likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- in a GMM. The
date in (11.20) as the expected value of all data points under the distri- mean µ is being
bution given by pulled toward
individual data
r k := [r1k , . . . , rN k ]> /Nk , (11.25) points with the
weights given by the
which is a normalized probability vector, i.e., corresponding
responsibilities.
µk ← Erk [X ] . (11.26)
x2 x3
r2
Example 11.3 (Mean Updates) r1 r3
x1
µ
Figure 11.5 Effect

0.30 π1 N (x|µ1 , σ12 ) π1 N (x|µ1 , σ12 ) of updating the
0.30
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
0.25 π3 N (x|µ3 , σ32 ) 0.25 π3 N (x|µ3 , σ32 ) mean values in a
0.20
GMM density GMM density GMM. (a) GMM
0.20
before updating the
p(x)
p(x)
0.15 0.15 mean values; (b)

0.10 0.10 GMM after updating
0.05 0.05 the mean values µk
while retaining the
0.00 0.00
−5 0 5 10 15 −5 0 5 10 15 variances and
x x mixture weights.
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the mean values. after updating the mean values.
In our example from Figure 11.3, the mean values are updated as fol-
lows:
µ1 : −4 → −2.7 (11.27)
µ2 : 0 → −0.4 (11.28)
µ3 : 8 → 3.7 (11.29)
Here, we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .
The update of the mean parameters in (11.20) look fairly straight-

forward. However, note that the responsibilities rnk are a function of
πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.20) depend
on all parameters of the GMM, and a closed-form solution, which we ob-
tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
be obtained.
c
11.2.3 Updating the Covariances

Theorem 11.2 (Updates of the GMM Covariances). The update of the co-
variance parameters Σk , k = 1, . . . , K of the GMM is given by
N
1 X
Σnew
k = rnk (xn − µk )(xn − µk )> , (11.30)
Nk n=1
where rnk and Nk are defined in (11.17) and (11.24), respectively.
Proof To prove Theorem 11.2 our approach is to compute the partial

derivatives of the log-likelihood L with respect to the covariances Σk , set
them to 0 and solve for Σk . We start with our general approach
N
∂L X 1 ∂p(xn | θ)
= = . (11.31)
∂Σk n=1
∂Σk n=1
p(xn | θ) ∂Σk
We already know 1/p(xn | θ) from (11.16). To obtain the remaining par-

tial derivative ∂p(xn | θ)/∂Σk , we write down the definition of the Gaus-
sian distribution p(xn | θ), see (11.9), and drop all terms but the k th. We
then obtain
∂p(xn | θ)
(11.32a)
∂Σk

∂ −
D
−
1
1 > −1

= (2π) 2 det(Σk ) exp − 2 (xn − µk ) Σk (xn − µk )
2
∂Σk
(11.32b)

D ∂ 1
= πk (2π)− 2 det(Σk )− 2 exp − 12 (xn − µk )> Σ−1

k (xn − µk )
∂Σk

1 ∂
− 1 > −1

+ det(Σk ) 2 exp − 2 (xn − µk ) Σk (xn − µk ) . (11.32c)
∂Σk
We now use the identities
∂ 1 1
(5.101) 1
det(Σk )− 2 = − det(Σk )− 2 Σ−1
k , (11.33)
∂Σk 2
∂ (5.106)
(xn − µk )> Σ−1 −1 > −1
k (xn − µk ) = −Σk (xn − µk )(xn − µk ) Σk
∂Σk
(11.34)
and obtain (after some re-arranging) the desired partial derivative re-
quired in (11.31) as
∂p(xn | θ)
= πk N xn | µk , Σk
∂Σk
· − 12 (Σ−1 −1 > −1

k − Σk (xn − µk )(xn − µk ) Σk ) . (11.35)
Putting everything together, the partial derivative of the log-likelihood

with respect to Σk is given by

N
∂L X 1 ∂p(xn | θ)
= = (11.36a)
∂Σk n=1
∂Σk n=1
p(xn | θ) ∂Σk
N
X πk N xn | µk , Σk

= PK
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk
· − 21 (Σ−1 −1 > −1

k − Σk (xn − µk )(xn − µk ) Σk ) (11.36b)
N
1X
=− rnk (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.36c)
2 n=1
N N
!
1 −1 X 1 −1 X
= − Σk rnk + Σk rnk (xn − µk )(xn − µk ) >
Σ−1
k .
2 n=1
2 n=1
| {z }
=Nk
(11.36d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
N
!
X
Nk Σ−1
k = Σk
−1
rnk (xn − µk )(xn − µk )> Σ−1k (11.37a)
n=1
N
!
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )> Σ−1
k (11.37b)
n=1
By solving for Σk , we obtain

N
1 X
Σnew
k = rnk (xn − µk )(xn − µk )> , (11.38)
Nk n=1
where r k is the probability vector defined in (11.25). This gives us a sim-
ple update rule for Σk for k = 1, . . . , K and proves Theorem 11.2.
Similar to the update of µk in (11.20), we can interpret the update of
the covariance in (11.30) as an importance-weighted expected value of
the square of the centered data X̃k := {x1 − µk , . . . , xN − µk }.
Example 11.4 (Variance Updates)

In our example from Figure 11.3, the variances are updated as follows:
σ12 : 1 → 0.14 (11.39)
σ22 : 0.2 → 0.44 (11.40)
σ32 : 3 → 1.53 (11.41)
c
Here, we see that the variances of the first and third component shrink
significantly, the variance of the second component increases slightly.
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
vidual components prior to updating the variances. Figure 11.6(b) shows
the GMM density after updating the variances.
Figure 11.6 Effect
of updating the 0.30 π1 N (x|µ1 , σ12 ) 0.35 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
variances in a GMM. 0.25 π3 N (x|µ3 , σ32 )
0.30 π3 N (x|µ3 , σ32 )
(a) GMM before GMM density 0.25 GMM density

0.20
updating the 0.20
p(x)
p(x)
variances; (b) GMM 0.15
0.15
after updating the 0.10
0.10
variances while 0.05 0.05
retaining the means
0.00 0.00
and mixture −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
weights. x x
prior to updating the variances. after updating the variances.
Similar to the update of the mean parameters, we can interpret (11.30)

as a Monte-Carlo estimate of the weighted covariance of data points xn
associated with the k th mixture component, where the weights are the
responsibilities rnk . As with the updates of the mean parameters, this up-
date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities
rnk , which prohibits a closed-form solution.
11.2.4 Updating the Mixture Weights

Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights
of the GMM are updated as
Nk
πknew =
, k = 1, . . . , K , (11.42)
N
where N is the number of data points and Nk is defined in (11.24).
Proof To find the partial derivative of the log-likelihood with respect
weight parameters πk , k = 1, . . . , K , we account for the con-
to the P
straint k πk = 1 by using Lagrange multipliers (see Section 7.2). The
Lagrangian is
K
!
X
L=L+λ πk − 1 (11.43a)
k=1
N K K
!
X X X
= log πk N xn | µk , Σk + λ πk − 1 , (11.43b)
n=1 k=1 k=1

where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivative with respect to πk as
N
∂L X N xn | µk , Σk
= PK +λ (11.44a)
∂πk n=1 j=1 πj N xn | µj , Σj
N
1 X πk N xn | µk , Σk Nk
= PK +λ = + λ, (11.44b)
πk n=1 j=1 πj N xn | µj , Σj πk
| {z }
=Nk
and the partial derivative with respect to the Lagrange multiplier λ as

K
∂L X
= πk − 1 . (11.45)
∂λ k=1
Setting both partial derivatives to 0 (necessary condition for optimum)

yields the system of equations
Nk
πk = − , (11.46)
λ
K
X
1= πk . (11.47)
k=1
Using (11.46) in (11.47) and solving for πk , we obtain

K K
X X Nk N
πk = 1 ⇐⇒ − = 1 ⇐⇒ − = 1 ⇐⇒ λ = −N .
k=1 k=1
λ λ
(11.48)
This allows us to substitute −N for λ in (11.46) to obtain

Nk
πknew = , (11.49)
N
which gives us the update for the weight parameters πk and proves Theo-
rem 11.3.
We can identify the mixture weight in (11.42) as the ratio of the to-
tal responsibility
P of the k th cluster and the number of data points. Since
N = k Nk the number of data points can also be interpreted as the to-
tal responsibility of all mixture components together, such that πk is the
relative importance of the k th mixture component for the dataset.
PN
Remark. Since Nk = i=1 rnk , the update equation (11.42) for the mix-
ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
sponsibilities rnk . ♦
c
Example 11.5 (Weight Parameter Updates)
Figure 11.7 Effect

of updating the 0.35 π1 N (x|µ1 , σ12 ) 0.30 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
mixture weights in a 0.30 π3 N (x|µ3 , σ32 ) 0.25 π3 N (x|µ3 , σ32 )
GMM density GMM density
GMM. (a) GMM 0.25
0.20
before updating the 0.20
p(x)
p(x)
0.15
mixture weights; (b) 0.15
0.10
GMM after updating 0.10
the mixture weights 0.05 0.05
while retaining the 0.00 0.00

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
means and x x
variances. Note the
different scales of
prior to updating the mixture weights. after updating the mixture weights.
the vertical axes.
In our running example from Figure 11.3, the mixture weights are up-
dated as follows:
1
π1 : 3
→ 0.29 (11.50)
1
π2 : 3
→ 0.29 (11.51)
1
π3 : 3
→ 0.42 (11.52)
Here we see that the third component gets more weight/importance,
while the other components become slightly less important. Figure 11.7
illustrates the effect of updating the mixture weights. Figure 11.7(a) is
identical to Figure 11.6(b) and shows the GMM density and its individual
components prior to updating the mixture weights. Figure 11.7(b) shows
the GMM density after updating the mixture weights.
Overall, having updated the means, the variances and the weights once,
we obtain the GMM shown in Figure 11.7(b). Compared with the ini-
tialization shown in Figure 11.3, we can see that the parameter updates
caused the GMM density to shift some of its mass toward the data points.
After updating the means, variances and weights once, the GMM fit
in Figure 11.7(b) is already remarkably better than its initialization from
Figure 11.3. This is also evidenced by the log-likelihood values, which in-
creased from 28.3 (initialization) to 14.4 after one complete update cycle.
11.3 EM Algorithm
Unfortunately, the updates in (11.20), (11.30), and (11.42) do not consti-
tute a closed-form solution for the updates of the parameters µk , Σk , πk
of the mixture model because the responsibilities rnk depend on those pa-
rameters in a complex way. However, the results suggest a simple iterative
scheme for finding a solution to the parameters estimation problem via
EM algorithm maximum likelihood. The Expectation Maximization algorithm (EM algo-

11.3 EM Algorithm 361
rithm) was proposed by Dempster et al. (1977) and is a general iterative

scheme for learning parameters (maximum likelihood or MAP) in mixture
models and, more generally, latent-variable models.
In our example of the Gaussian mixture model, we choose initial values
for µk , Σk , πk and alternate until convergence between
E-step: Evaluate the responsibilities rnk (posterior probability of data
point n belonging to mixture component k ).
M-step: Use the updated responsibilities to re-estimate the parameters
µk , Σk , πk .
Every step in the EM algorithm increases the log-likelihood function (Neal
and Hinton, 1999). For convergence, we can check the log-likelihood or
the parameters directly. A concrete instantiation of the EM algorithm for
estimating the parameters of a GMM is as follows:
1. Initialize µk , Σk , πk
2. E-step: Evaluate responsibilities rnk for every data point xn using cur-
rent parameters πk , µk , Σk :

rnk = P . (11.53)
j πj N xn | µj , Σj
3. M-step: Re-estimate parameters πk , µk , Σk using the current responsi-

bilities rnk (from E-step):
N
1 X
µk = rnk xn , (11.54)
Nk n=1
N
1 X
Σk = rnk (xn − µk )(xn − µk )> , (11.55)
Nk n=1
Nk
πk = . (11.56)
N
Example 11.6 (GMM Fit)
Figure 11.8 EM
π1 N (x|µ1 , σ12 ) 28 algorithm applied to
0.30
π2 N (x|µ2 , σ22 )
the GMM from
Negative log-likelihood
26
0.25 π3 N (x|µ3 , σ32 )
GMM density 24 Figure 11.2.
0.20
22
p(x)
0.15
20
0.10
18
0.05 16
0.00 14
−5 0 5 10 15 0 1 2 3 4 5
x
Iteration
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
algorithm converges and returns this GMM. EM iterations.
c
Figure 11.9 10
Illustration of the 104
Negative log-likelihood
EM algorithm for 5
fitting a Gaussian
mixture model with
x2
0
three components to 6 × 103
a two-dimensional −5
dataset. (a) Dataset;
4 × 103
(b) Negative −10
−10 −5 0 5 10 0 20 40 60
log-likelihood x1 EM iteration
(lower is better) as
(a) Dataset. (b) Negative log-likelihood.
a function of the EM
iterations. The red
10 10
dots indicate the
iterations for which
5 5
the mixture
components of the
x2
x2
0 0
corresponding GMM
fits are shown
−5 −5
in (c)–(f). The
yellow discs indicate
−10 −10
the means of the −10 −5 0 5 10 −10 −5 0 5 10
x1 x1
Gaussian mixture
components. (c) EM initialization. (d) EM after 1 iteration.
Figure 11.10(a)
shows the final 10 10
GMM fit.
5 5
x2
x2
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1
(e) EM after 10 iterations. (f) EM after 62 iterations.
When we run EM on our example from Figure 11.3, we obtain the final
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as

p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
(11.57)
+ 0.43N x | 3.64, 1.63 .
We applied the EM algorithm to the two-dimensional dataset shown

in Figure 11.1 with K = 3 mixture components. Figure 11.9 illustrates
some steps of the EM algorithm and shows the negative log-likelihood as
a function of the EM iteration (Figure 11.9(b)). Figure 11.10(a) shows

6 6 Figure 11.10 GMM
fit and
4 4
responsibilities
2 2 when EM converges.
(a) GMM fit when
x2
x2
0 0
EM converges;
−2 −2
(b) Each data point
−4 −4 is colored according
−6 −6 to the
−5 0 5 −5 0 5
x1 x1 responsibilities of
the mixture
(a) GMM fit after 62 iterations. (b) Dataset colored according to the respon-
components.
sibilities of the mixture components.
the corresponding final GMM fit. Figure 11.10(b) visualizes the final re-
sponsibilities of the mixture components for the data points. The dataset is
colored according to the responsibilities of the mixture components when
EM converges. While a single mixture component is clearly responsible
for the data on the left, the overlap of the two data clusters on the right
could have been generated by two mixture components. It becomes clear
that there are data points that cannot be uniquely assigned to a single
(either blue or yellow) component, such that the responsibilities of these
two clusters for those points are around 0.5.
11.4 Latent Variable Perspective

We can look at the GMM from the perspective of a discrete latent variable
model, i.e., where the latent variable z can attain only a finite set of val-
ues. This is in contrast to PCA where the latent variables were continuous-
valued numbers in RM .
The advantages of the probabilistic perspective are that (i) it will justify
some ad-hoc decisions we made in the previous sections, (ii) it allows for
a concrete interpretation of the responsibilities as posterior probabilities,
(iii) the iterative algorithm for updating the model parameters can be de-
rived in a principled manner as the EM algorithm for maximum likelihood
parameter estimation in latent variable models.
11.4.1 Generative Process and Probabilistic Model

To derive the probabilistic model for GMMs, it is useful to think about the
generative process, i.e., the process that allows us to generate data, using
a probabilistic model.
We assume a mixture model with K components and that a data point
x can be generated by exactly one mixture component. We introduce a
binary indicator variable zk ∈ {0, 1} with two states (see Section 6.2) that
indicates whether the k th mixture component generated that data point
c
so that

p(x | zk = 1) = N x | µk , Σk . (11.58)
We define z := [z1 , . . . , zK ]> ∈ RK as a probability vector consisting of

K −1 many 0s and exactly one 1. For example, for K = 3, a valid z would
be z = [z1 , z2 , z3 ]> = [0, 1, 0]> , which would select the second mixture
component since z2 = 1.
Remark. Sometimes this kind of probability distribution is called “multi-
noulli”, a generalization of the Bernoulli distribution to more than two
values (Murphy, 2012). ♦
PK
one-hot encoding The properties of z imply that k=1 zk = 1. Therefore, z is a one-hot
1-of-K encoding (also: 1-of-K representation).
representation Thus far, we assumed that the indicator variables zk are known. How-
ever, in practice, this is not the case, and we place a prior distribution
K
X
p(z) = π = [π1 , . . . , πK ]> , πk = 1 , (11.59)
k=1
on the latent variable z . Then the k th entry
πk = p(zk = 1) (11.60)
of this probability vector describes the probability that the k th mixture

Figure 11.11 component generated data point x.
Graphical model for
a GMM with a single Remark (Sampling from a GMM). The construction of this latent variable
data point. model (see the corresponding graphical model in Figure 11.11) lends it-
π self to a very simple sampling procedure (generative process) to generate
data:
z 1. Sample z (i) ∼ p(z)

2. Sample x(i) ∼ p(x | z (i) = 1)
µk
In the first step, we select a mixture component i (via the one-hot encod-
Σk x ing z ) at random according to p(z) = π ; in the second step we draw a
k = 1, . . . , K
sample from the corresponding mixture component. When we discard the
samples of the latent variable so that we are left with the x(i) , we have
valid samples from the GMM. This kind of sampling, where samples of
random variables depend on samples from the variable’s parents in the
ancestral sampling graphical model, is called ancestral sampling. ♦
Generally, a probabilistic model is defined by the joint distribution of
the data and the latent variables (see Section 8.3). With the prior p(z)
defined in (11.59)–(11.60) and the conditional p(x | z) from (11.58) we
obtain all K components of this joint distribution via

p(x, zk = 1) = p(x | zk = 1)p(zk = 1) = πk N x | µk , Σk (11.61)

for k = 1, . . . , K , so that
   
p(x, z1 = 1) π1 N x | µ1 , Σ1
p(x, z) =  .. ..
= , (11.62)
   
. .
p(x, zK = 1) πK N x | µK , ΣK
which fully specifies the probabilistic model.
11.4.2 Likelihood
To obtain the likelihood p(x | θ) in a latent variable model, we need to
marginalize out the latent variables (see Section 8.3.3. In our case, this
can be done by summing out all latent variables from the joint p(x, z)
in (11.62) so that
X
p(x | θ) = p(x | θ, z)p(z | θ) , θ := {µk , Σk , πk : k = 1, . . . , K} .
z
(11.63)
We now explicitly condition on the parameters θ of the probabilistic model,
which we previously omitted. In (11.63), P we sum over all K possible one-
hot encodings of z , which is denoted by z . Since there is only a single
non-zero single entry in each z there are only K possible configurations/
settings of z . For example, if K = 3 then z can have the configurations
     
1 0 0
0 , 1 , 0 . (11.64)
0 0 1
Summing over all possible configurations of z in (11.63) is equivalent to
looking at the non-zero entry of the z -vector and write
X
p(x | θ) = p(x | θ, z)p(z | θ) (11.65a)
z
K
X
= p(x | θ, zk = 1)p(zk = 1 | θ) (11.65b)
k=1
so that the desired marginal distribution is given as

K
(11.65b)
X
p(x | θ) = p(x | θ, zk = 1)p(zk = 1|θ) (11.66a)
k=1
K
X
= πk N x | µk , Σk , (11.66b)
k=1
which we identify as the GMM model from (11.3). Given a dataset X we

immediately obtain the likelihood
N N X
K
(11.66b)
Y Y
p(X | θ) = p(xn | θ) = πk N xn | µk , Σk (11.67)
n=1 n=1 k=1
c
Figure 11.12 π
Graphical model for
a GMM with N data
points.
zn
µk
Σk xn
k = 1, . . . , K
n = 1, . . . , N
which is exactly the GMM likelihood from (11.9). Therefore, the latent
variable model with latent indicators zk is an equivalent way of thinking
about a Gaussian mixture model.

Let us have a brief look at the posterior distribution on the latent variable
z . According to Bayes’ theorem, the posterior of the k th component having
generated data point x
p(zk = 1)p(x | zk = 1)
p(zk = 1 | x) = , (11.68)
p(x)
where the marginal p(x) is given in (11.66b). This yields the posterior
distribution for the k th indicator variable zk

p(zk = 1)p(x | zk = 1) πk N x | µk , Σk
p(zk = 1 | x) = PK = PK ,
j=1 p(zj = 1)p(x | zj = 1) j=1 πj N x | µj , Σj
(11.69)
which we identify as the responsibility of the k th mixture component for
data point x. Note that we omitted the explicit conditioning on the GMM
parameters πk , µk , Σk where k = 1, . . . , K .
11.4.4 Extension to a Full Dataset

Thus far, we only discussed the case where the dataset consists only of a
single data point x. However, the concepts of the prior and the posterior
can be directly extended to the case of N data points X := {x1 , . . . , xN }.
In the probabilistic interpretation of the GMM, every data point xn pos-
sesses its own latent variable
z n = [zn1 , . . . , znK ]> ∈ RK . (11.70)
Previously (when we only considered a single data point x) we omitted
the index n, but now this becomes important.

We share the same prior distribution π across all latent variables z n .

The corresponding graphical model is shown in Figure 11.12, where we
use the plate notation.
The conditional distribution p(x1 , . . . , xN | z 1 , . . . , z N ) factorizes over
the data points and is given as
N
Y
p(x1 , . . . , xN | z 1 , . . . , z N ) = p(xn | z n ) . (11.71)
n=1
To obtain the posterior distribution p(znk = 1 | xn ) we follow the same

reasoning as in Section 11.4.3 and apply Bayes’ theorem to obtain
p(xn | znk = 1)p(znk = 1)

p(znk = 1 | xn ) = PK (11.72a)
j=1 p(xn | znj = 1)p(znj = 1)

= PK = rnk . (11.72b)
j=1 πj N xn | µj , Σj
This means that p(zk = 1 | xn ) is the (posterior) probability that the k th

mixture component generated data point xn and corresponds to the re-
sponsibility rnk we introduced in (11.17). Now, the responsibilities also
have not only an intuitive but also a mathematically justified interpreta-
tion as posterior probabilities.
11.4.5 EM Algorithm Revisited

The EM algorithm that we introduced as an iterative scheme for maximum
likelihood estimation can be derived in a principled way from the latent
variable perspective. Given a current setting θ (t) of model parameters, the
E-step calculates the expected log-likelihood
Q(θ | θ (t) ) = Ez | x,θ(t) [log p(x, z | θ)] (11.73a)

Z
= log p(x, z | θ)p(z | x, θ (t) )dz , (11.73b)
where the expectation of log p(x, z | θ) is taken with respect to the poste-
rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
of model parameters θ (t+1) by maximizing (11.73b).
Although an EM iteration does increase the log-likelihood, there are
no guarantees that EM converges to the maximum likelihood solution.
It is possible that the EM algorithm converges to a local maximum of
the log-likelihood. Different initializations of the parameters θ could be
used in multiple EM runs to reduce the risk of ending up in a bad local
optimum. We do not go into further details here, but refer to the excellent
expositions by Rogers and Girolami (2016) and Bishop (2006).
c

The GMM can be considered a generative model in the sense that it is
straightforward to generate new data using ancestral sampling (Bishop,
2006). For given GMM parameters πk , µk , Σk , k = 1, . . . , K , we sample
an index k from the probability
vector [π1 , . . . , πK ]> and then sample a
data point x ∼ N µk , Σk . If we repeat this N times, we obtain a dataset
that has been generated by a GMM. Figure 11.1 was generated using this
procedure.
Throughout this chapter, we assumed that the number of components
K is known. In practice, this is often not the case. However, we could use
nested cross-validation, as discussed in Section 8.5.1, to find good models.
Gaussian mixture models are closely related to the K -means clustering
algorithm. K -means also uses the EM algorithm to assign data points to
clusters. If we treat the means in the GMM as cluster centers and ignore
the covariances (or set them to I ), we arrive at K -means. As also nicely
described by MacKay (2003), K -means makes a “hard” assignments of
data points to cluster centers µk , whereas a GMM makes a “soft” assign-
ment via the responsibilities.
We only touched upon the latent variable perspective of GMMs and the
EM algorithm. Note that EM can be used for parameter learning in general
latent variable models, e.g., nonlinear state-space models (Ghahramani
and Roweis, 1999; Roweis and Ghahramani, 1999) and for reinforcement
learning as discussed by Barber (2012). Therefore, the latent variable per-
spective of a GMM is useful to derive the corresponding EM algorithm in
a principled way (Bishop, 2006; Barber, 2012; Murphy, 2012).
We only discussed maximum likelihood estimation (via the EM algo-
rithm) for finding GMM parameters. The standard criticisms of maximum
likelihood also apply here:
As in linear regression, maximum likelihood can suffer from severe
overfitting. In the GMM case, this happens when the mean of a mix-
ture component is identical to a data point and the covariance tends to
0. Then, the likelihood approaches infinity. Bishop (2006) and Barber
(2012) discuss this issue in detail.
We only obtain a point estimate of the parameters πk , µk , Σk for k =
1, . . . , K , which does not give any indication of uncertainty in the pa-
rameter values. A Bayesian approach would place a prior on the param-
eters, which can be used to obtain a posterior distribution on the param-
eters. This posterior allows us to compute the model evidence (marginal
likelihood), which can be used for model comparison, which gives us a
principled way to determine the number of mixture components. Un-
fortunately, closed-form inference is not possible in this setting because
there is no conjugate prior for this model. However, approximations,
such as variational inference, can be used to obtain an approximate
posterior (Bishop, 2006).

0.30 Figure 11.13
Data Histogram (orange
0.25 KDE bars) and kernel
Histogram density estimation
0.20
(blue line). The
p(x)
0.15
kernel density
estimator produces
0.10 a smooth estimate
of the underlying
0.05 density, whereas the
histogram is an
0.00 unsmoothed count
−4 −2 0 2 4 6 8
x measure of how
many data points
(black) fall into a
In this chapter, we discussed mixture models for density estimation. single bin.
There is a plethora of density estimation techniques available. In practice,
we often use histograms and kernel density estimation. histogram
Histograms provide a non-parametric way to represent continuous den-
sities and have been proposed by Pearson (1895). A histogram is con-
structed by “binning” the data space and count how many data points fall
into each bin. Then a bar is drawn at the center of each bin, and the height
of the bar is proportional to the number of data points within that bin. The
bin size is a critical hyperparameter, and a bad choice can lead to overfit-
ting and underfitting. Cross-validation, as discussed in Section 8.1.4, can
be used to determine a good bin size. kernel density
Kernel density estimation, independently proposed by Rosenblatt (1956) estimation
and Parzen (1962), is a nonparametric way for density estimation. Given
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
N
1 X x − xn

p(x) = k , (11.74)
N h n=1 h
where k is a kernel function, i.e., a non-negative function that integrates
to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a simi-
lar role as the bin size in histograms. Note that we place a kernel on every
single data point xn in the dataset. Commonly used kernel functions are
the uniform distribution and the Gaussian distribution. Kernel density esti-
mates are closely related to histograms, but by choosing a suitable kernel,
we can guarantee smoothness of the density estimate. Figure 11.13 illus-
trates the difference between a histogram and a kernel density estimator
(with a Gaussian-shaped kernel) for a given dataset of 250 data points.
c
12
Classification with Support Vector Machines
In many situations, we want our machine learning algorithm to predict

one of a number of (discrete) outcomes. For example, an email client that
sorts mail into personal mail and junk mail, which has two outcomes. An-
other example is a telescope that identifies whether an object in the night
sky is a galaxy, star or planet. There are usually a small number of out-
comes, and more importantly there is usually no additional structure on
An example of these outcomes. In this chapter, we consider predictors that output binary
structure is if the values, i.e., there are only two possible outcomes. This machine learning
outcomes were
task is called binary classification. This is in contrast to Chapter 9 where
ordered, like in the
case of small, we considered a prediction problem with continuous-valued outputs.
medium and large For binary classification the set of possible values that the label/output
t-shirts. can attain is binary, and for this chapter we denote them by {+1, −1}. In
binary classification
other words, we consider predictors of the form
f : RD → {+1, −1} . (12.1)
Recall from Chapter 8 that we represent each example (data point) xn
Input example xn as a feature vector of D real numbers. The labels are often referred to as
may also be referred the positive and negative classes, respectively. One should be careful not
to as inputs, data
to infer intuitive attributes of positiveness of the +1 class. For example,
points, features or
instances. in a cancer detection task, a patient with cancer is often labeled +1. In
class principle, any two distinct values can be used, e.g., {True, False}, {0, 1}
For probabilistic or {red, blue}. The problem of binary classification is well studied, and
models, it is we defer a survey of other approaches to Section 12.6.
mathematically
We present an approach known as the Support Vector Machine (SVM),
convenient to use
{0, 1} as a binary which solves the binary classification task. As in regression, we have a su-
representation, see pervised learning task, where we have a set of examples xn ∈ RD along
remark after with their corresponding (binary) labels yn ∈ {+1, −1}. Given a train-
Example 6.12.
ing data set consisting of example-label pairs {(x1 , y1 ), . . . , (xN , yN )}, we
would like to estimate parameters of the model that will give the smallest
classification error. Similar to Chapter 9 we consider a linear model, and
hide away the nonlinearity in a transformation φ of the examples (9.13).
We will revisit φ in Section 12.4.
The SVM provides state-of-the-art results in many applications, with
sound theoretical guarantees (Steinwart and Christmann, 2008). There
are two main reasons why we chose to illustrate binary classification using
370
c
Classification with Support Vector Machines 371
Figure 12.1
Example 2D data,
illustrating the
intuition of data
where we can find a
linear classifier that
x(2)
separates red
crosses from blue
dots.
x(1)
SVMs. First, the SVM allows for a geometric way to think about supervised
machine learning. While in Chapter 9 we considered the machine learning
problem in terms of probabilistic models and attacked it using maximum
likelihood estimation and Bayesian inference, here we will consider an
alternative approach where we reason geometrically about the machine
learning task. It relies heavily on concepts, such as inner products and
projections, which we discussed in Chapter 3. The second reason why we
find SVMs instructive is that in contrast to Chapter 9, the optimization
problem for SVM does not admit an analytic solution so that we need to
resort to a variety of optimization tools introduced in Chapter 7.
The SVM view of machine learning is subtly different from the max-
imum likelihood view of Chapter 9. The maximum likelihood view pro-
poses a model based on a probabilistic view of the data distribution, from
which an optimization problem is derived. In contrast, the SVM view starts
by designing a particular function that is to be optimized during training,
based on geometric intuitions. We have seen something similar already in
Chapter 10 where we derived PCA from geometric principles. In the SVM
case, we start by designing an objective function that is to be minimized on
training data, following the principles of empirical risk minimization 8.1.
This can also be understood as designing a particular loss function.
Let us derive the optimization problem corresponding to training an
SVM on example-label pairs. Intuitively, we imagine binary classification
data, which can be separated by a hyperplane as illustrated in Figure 12.1.
Here, every example xn (a vector of dimension 2) is a two-dimensional
location (x(1) (2)
n and xn ), and the corresponding binary label yn is one of
two different symbols (red cross or blue disc). “Hyperplane” is a word that
is commonly used in machine learning, and we encountered hyperplanes
already in Section 2.8. A hyperplane is an affine subspace of dimension
D−1 (if the corresponding vector space is of dimension D). The examples
consist of two classes (there are two possible labels) that have features
(the components of the vector representing the example) arranged in such
a way as to allow us to separate/classify them by drawing a straight line.
In the following, we formalize the idea of finding a linear separator
c
372 Classification with Support Vector Machines
of the two classes. We introduce the idea of the margin and then extend
linear separators to allow for examples to fall on the “wrong” side, incur-
ring a classification error. We present two equivalent ways of formalizing
the SVM: the geometric view (Section 12.2.4) and the loss function view
(Section 12.2.5). We derive the dual version of the SVM using Lagrange
multipliers (Section 7.2). The dual SVM allows us to observe a third way
of formalizing the SVM: in terms of the convex hulls of the examples of
each class (Section 12.3.2). We conclude by briefly describing kernels and
how to numerically solve the nonlinear kernel-SVM optimization problem.
12.1 Separating Hyperplanes

Given two examples represented as vectors xi and xj , one way to compute
the similarity between them is using a inner product hxi , xj i. Recall from
Section 3.2 that inner products are closely related to the angle between
two vectors. The value of the inner product between two vectors depends
on the length (norm) of each vector. Furthermore, inner products allow
us to rigorously define geometric concepts such as orthogonality and pro-
jections.
The main idea behind many classification algorithms is to represent
data in RD and then partition this space, ideally in a way that examples
with the same label are in the same partition (and no other examples).
In the case of binary classification, the space would be divided into two
parts corresponding to the positive and negative classes, respectively. We
consider a particularly convenient partition, which is to (linearly) split
the space into two halves using a hyperplane. Let example x ∈ RD be an
element of the data space. Consider a function
f : RD → R (12.2a)
x 7→ hw, xi + b , (12.2b)
parametrized by w ∈ RD and b ∈ R. Recall from Section 2.8 that hy-
perplanes are affine subspaces. Therefore, we define the hyperplane that
separates the two classes in our binary classification problem as
x ∈ RD : f (x) = 0 .

(12.3)
An illustration of the hyperplane is shown in Figure 12.2, where the
vector w is a vector normal to the hyperplane and b the intercept. We can
derive that w is a normal vector to the hyperplane in (12.3) by choosing
any two examples xa and xb on the hyperplane and showing that the
vector between them is orthogonal to w. In the form of an equation,
f (xa ) − f (xb ) = hw, xa i + b − (hw, xb i + b) (12.4a)
= hw, xa − xb i , (12.4b)
where the second line is obtained by the linearity of the inner product

12.1 Separating Hyperplanes 373
Figure 12.2
w Equation of a
separating
w hyperplane (12.3).
(a) The standard
b way of representing
.positive the equation in 3D.
(b) For ease of
. . drawing, we look at
0 negative the hyperplane edge
on.
(a) Separating hyperplane in 3D. (b) Projection of the setting in (a) onto
a plane.
(Section 3.2). Since we have chosen xa and xb to be on the hyperplane,

this implies that f (xa ) = 0 and f (xb ) = 0 and hence hw, xa − xb i = 0.
Recall that two vectors are orthogonal when their inner product is zero. w is orthogonal to
Therefore, we obtain that w is orthogonal to any vector on the hyperplane. any vector on the
hyperplane.
Remark. Recall from Chapter 2 that we can think of vectors in different
ways. In this chapter, we think of the parameter vector w as an arrow
indicating a direction, i.e., we consider w to be a geometric vector. In
contrast, we think of the example vector x as a data point (as indicated
by its coordinates), i.e., we consider x to be the coordinates of a vector
with respect to the standard basis. ♦
When presented with a test example, we classify the example as posi-
tive or negative depending on which side of the hyperplane it occurs. Note
that (12.3) not only defines a hyperplane; it additionally defines a direc-
tion. In other words, it defines the positive and negative side of the hyper-
plane. Therefore, to classify a test example xtest , we calculate the value of
the function f (xtest ) and classify the example as +1 if f (xtest ) > 0 and
−1 otherwise. Thinking geometrically, the positive examples lie “above”
the hyperplane and the negative examples “below” the hyperplane.
When training the classifier, we want to ensure that the examples with
positive labels are on the positive side of the hyperplane, i.e.,
hw, xn i + b > 0 when yn = +1 (12.5)
and the examples with negative labels are on the negative side, i.e.,
hw, xn i + b < 0 when yn = −1 . (12.6)
Refer to Figure 12.2 for a geometric intuition of positive and negative

examples. These two conditions are often presented in a single equation
yn (hw, xn i + b) > 0 . (12.7)
Equation (12.7) is equivalent to (12.5) and (12.6) when we multiply both

sides of (12.5) and (12.6) with yn = 1 and yn = −1, respectively.
c
Figure 12.3
Possible separating
hyperplanes. There
are many linear
classifiers (green
lines) that separate
x(2)
red crosses from
blue dots.
x(1)
12.2 Primal Support Vector Machine

Based on the concept of distances from points to a hyperplane, we now
are in a position to discuss the support vector machine. For a dataset
{(x1 , y1 ), . . . , (xN , yN )} that is linearly separable, we have infinitely many
candidate hyperplanes (refer to Figure 12.3), and therefore classifiers,
that solve our classification problem without any (training) errors. To find
a unique solution, one idea is to choose the separating hyperplane that
maximizes the margin between the positive and negative examples. In
other words, we want the positive and negative examples to be separated
A classifier with by a large margin (Section 12.2.1). In the following, we compute the dis-
large margin turns tance between an example and a hyperplane to derive the margin. Recall
out to generalize
that the closest point on the hyperplane to a given point (example xn ) is
well (Steinwart and
Christmann, 2008). obtained by the orthogonal projection (Section 3.8).
12.2.1 Concept of the Margin

margin The concept of the margin is intuitively simple: It is the distance of the
There could be two separating hyperplane to the closest examples in the dataset, assuming
or more closest that the dataset is linearly separable. However, when trying to formalize
examples to a
this distance, there is a technical wrinkle that may be confusing. The tech-
hyperplane.
nical wrinkle is that we need to define a scale at which to measure the
distance. A potential scale is to consider the scale of the data, i.e., the raw
values of xn . There are problems with this, as we could change the units
of measurement of xn and change the values in xn , and, hence, change
the distance to the hyperplane. As we will see shortly, we define the scale
based on the equation of the hyperplane (12.3) itself.
Consider a hyperplane hw, xi + b, and an example xa as illustrated in
Figure 12.4. Without loss of generality, we can consider the example xa
to be on the positive side of the hyperplane, i.e., hw, xa i + b > 0. We
would like to compute the distance r > 0 of xa from the hyperplane. We
do so by considering the orthogonal projection (Section 3.8) of xa onto
the hyperplane, which we denote by x0a . Since w is orthogonal to the

. xa Figure 12.4 Vector

addition to express
r w
x0a .
distance to
hyperplane:
w
xa = x0a + r kwk .
.
0
hyperplane, we know that the distance r is just a scaling of this vector w.

If the length of w is known, then we can use this scaling factor r factor
to work out the absolute distance between xa and x0a . For convenience
we choose to use a vector of unit length (its norm is 1), and obtain this
w
by dividing w by its norm, kwk . Using vector addition (Section 2.4) we
obtain
w
xa = x0a + r . (12.8)
kwk
Another way of thinking about r is that it is the coordinate of xa in the
subspace spanned by w/ kwk. We have now expressed the distance of xa
from the hyperplane as r, and if we choose xa to be the point closest to
the hyperplane, this distance r is the margin.
Recall that we would like the positive examples to be further than r
from the hyperplane, and the negative examples to be further than dis-
tance r (in the negative direction) from the hyperplane. Analogously to
the combination of (12.5) and (12.6) into (12.7), we formulate this ob-
jective as
yn (hw, xn i + b) > r . (12.9)
In other words, we combine the requirements that examples are at least

r away from the hyperplane (in the positive and negative direction) into
one single inequality.
Since we are interested only in the direction, we add an assumption to
our model that the parameter vector w is of √ unit length, i.e., kwk = 1,
where we use the Euclidean norm kwk = w> w (Section 3.1). This We will see other
assumption also allows a more intuitive interpretation of the distance r choices of inner
products
(12.8) since it is the scaling factor of a vector of length 1.
(Section 3.2) in
Remark. A reader familiar with other presentations of the margin would Section 12.4.
notice that our definition of kwk = 1 is different from the standard
presentation if the SVM provided by Schölkopf and Smola (2002), for
example. In Section 12.2.3, we will show the equivalence of both ap-
proaches. ♦
Collecting the three requirements into a single constrained optimization
c
Figure 12.5
Derivation of the
.xa
1
r w
x0a .
margin: r = kwk .
hw
,
hw
xi
,
+
xi
b=
+
b=
1
0
problem, we obtain the objective
max r
w,b,r |{z}
margin
(12.10)
subject to yn (hw, xn i + b) > r , kwk = 1 , r > 0,
| {z } | {z }
data fitting normalization
which says that we want to maximize the margin r, while ensuring that
the data lies on the correct side of the hyperplane.
Remark. The concept of the margin turns out to be highly pervasive in
machine learning. It was used by Vladimir Vapnik and Alexey Chervo-
nenkis to show that when the margin is large, the “complexity” of the func-
tion class is low, and, hence, learning is possible (Vapnik, 2000). It turns
out that the concept is useful for various different approaches for theo-
retically analyzing generalization error (Shalev-Shwartz and Ben-David,
2014; Steinwart and Christmann, 2008). ♦
12.2.2 Traditional Derivation of the Margin

In the previous section, we derived (12.10) by making the observation that
we are only interested in the direction of w and not its length, leading to
the assumption that kwk = 1. In this section, we derive the margin max-
imization problem by making a different assumption. Instead of choosing
that the parameter vector is normalised, we choose a scale for the data.
We choose this scale such that the value of the predictor hw, xi + b is 1 at
Recall that we the closest example. Let us also denote the example in the dataset that is
currently consider closest to the hyperplane by xa .
linearly separable
Figure 12.5 is identical to Figure 12.4, except that now we rescaled the
data.
axes, such that the example xa lies exactly on the margin, i.e., hw, xa i +
b = 1. Since x0a is the orthogonal projection of xa onto the hyperplane, it
must by definition lie on the hyperplane, i.e.,
hw, x0a i + b = 0 . (12.11)

By substituting (12.8) into (12.11) we obtain

w

w, xa − r + b = 0. (12.12)
kwk
Exploiting the bilinearity of the inner product (see Section 3.2), we get
hw, wi
hw, xa i + b − r = 0. (12.13)
kwk
Observe that the first term is 1 by our assumption of scale, i.e., hw, xa i +
b = 1. From (3.16) in Section 3.1 we know that hw, wi = kwk2 . Hence,
the second term reduces to rkwk. Using these simplifications, we obtain
1
r= . (12.14)
kwk
This means we derived the distance r in terms of the normal vector w of
the hyperplane. At first glance this equation is counterintuitive as we seem We can also think of
to have derived the distance from the hyperplane in terms of the length the distance as the
projection error that
of the vector w, but we do not yet know this vector. One way to think
incurs when
about it is to consider the distance r to be a temporary variable that we projecting xa onto
only use for this derivation. Therefore, for the rest of this section we will the hyperplane.
1
denote the distance to the hyperplane by kwk . In Section 12.2.3, we will
see that the choice that the margin equals 1 is equivalent to our previous
assumption of kwk = 1 in Section 12.2.1.
Similar to the argument to obtain (12.9), we want the positive and
negative examples to be at least 1 away from the hyperplane, which yields
the condition
yn (hw, xn i + b) > 1 . (12.15)
Combining the margin maximization with the fact that examples need to
be on the correct side of the hyperplane (based on their labels) gives us
1
max (12.16)
w,b kwk
subject to yn (hw, xn i + b) > 1 for all n = 1, . . . , N. (12.17)
Instead of maximizing the reciprocal of the norm as in (12.16), we often
minimize the squared norm. We also often include a constant 12 that does The squared norm
not affect the optimal w, b but yields a tidier form when we compute the results in a convex
quadratic
gradient. Then, our objective becomes
programming
1 problem for the
min kwk2 (12.18) SVM (Section 12.5).
w,b 2
subject to yn (hw, xn i + b) > 1 for all n = 1, . . . , N . (12.19)
Equation (12.18) is known as the hard margin SVM. The reason for the hard margin SVM
expression “hard” is because the above formulation does not allow for any
violations of the margin condition. We will see in Section 12.2.4 that this
c
“hard” condition can be relaxed to accommodate violations if the data is

not linearly separable.
12.2.3 Why we can set the Margin to 1

In Section 12.2.1, we argued that we would like to maximize some value
r, which represents the distance of the closest example to the hyperplane.
In Section 12.2.2, we scaled the data such that the closest example is of
distance 1 to the hyperplane. In this section, we relate the two derivations,
and show that they are equivalent.
Theorem 12.1. Maximizing the margin r, where we consider normalized
weights as in (12.10),
max r
w,b,r |{z}
margin
(12.20)
subject to yn (hw, xn i + b) > r , kwk = 1 , r > 0,
| {z } | {z }
data fitting normalization
is equivalent to scaling the data, such that the margin is unity:

1 2
min kwk
w,b
|2 {z }
margin (12.21)
subject to yn (hw, xn i + b) > 1 .
| {z }
data fitting
Proof Consider (12.20). Since the square is a strictly monotonic trans-

formation for non-negative arguments, the maximum stays the same if we
consider r2 in the objective. Since kwk = 1 we can reparametrize the
equation with a new weight vector w0 that is not normalized by explicitly
w0
using kw 0 k . We obtain
max r2
w0 ,b,r

w0
(12.22)
subject to yn , xn + b > r, r > 0.
kw0 k
Equation (12.22) explicitly states that the distance r is positive. Therefore,
Note that r > 0 we can divide the first constraint by r, which yields
because we
assumed linear max r2
w0 ,b,r
separability, and  
hence there is no
(12.23)
* +
issue to divide by r.  w0 b 
subject to yn  , xn +  > 1, r>0
 
 kw0 k r r
|{z}
| {z } 00
w00 b

Figure 12.6
Linearly separable
and non linearly
separable data.
x(2)
x(2)
x(1) x(1)
(a) Linearly separable data, with a large (b) Non linearly separable data
margin.
0
renaming the parameters to w00 and b00 . Since w00 = kww0 kr , rearranging for
r gives
w0 w0

00 1 1
kw k = 0 = · 0
= . (12.24)
kw k r r kw k r
By substituting this result into (12.23) we obtain
1
max 2
00 00
w ,b kw00 k (12.25)
subject to yn (hw00 , xn i + b00 ) > 1 .
1
The final step is to observe that maximizing kw00 k2
yields the same solution
1 00 2
as minimizing 2
kw k , which concludes the proof of Theorem 12.1.
12.2.4 Soft Margin SVM: Geometric View

In the case where data is not linearly separable, we may wish to allow
some examples to fall within the margin region, or even to be on the
wrong side of the hyperplane as illustrated in Figure 12.6.
The model that allows for some classification errors is called the soft soft margin SVM
margin SVM. In this section, we derive the resulting optimization problem
using geometric arguments. In Section 12.2.5, we will derive an equiv-
alent optimization problem using the idea of a loss function. Using La-
grange multipliers (Section 7.2), we will derive the dual optimization
problem of the SVM in Section 12.3. This dual optimization problem al-
lows us to observe a third interpretation of the SVM: as a hyperplane that
bisects the line between convex hulls corresponding to the positive and
negative data examples (Section 12.3.2).
The key geometric idea is to introduce a slack variable ξn corresponding slack variable
to each example-label pair (xn , yn ) that allows a particular example to be
within the margin or even on the wrong side of the hyperplane (refer to
c
Figure 12.7 Soft

margin SVM allows
. w
examples to be
within the margin or
on the wrong side of ξ
the hyperplane. The
slack variable ξ
x+ .
hw
measures the
,
hw
xi
distance of a
,
positive example
+
xi
b=
x+ to the positive
+
margin hyperplane
b=
1
hw, xi + b = 1
0
when x+ is on the
wrong side.
Figure 12.7). We subtract the value of ξn from the margin, constraining

ξn to be non-negative. To encourage correct classification of the samples
we add ξn to the objective
N
1 X
min kwk2 + C ξn (12.26a)
w,b,ξ 2 n=1
subject to yn (hw, xn i + b) > 1 − ξn (12.26b)

ξn > 0 (12.26c)
for n = 1, . . . , N . In contrast to the optimization problem (12.18) for the

soft margin SVM hard margin SVM, this one is called the soft margin SVM. The parameter
C > 0 trades off the size of the margin and the total amount of slack that
regularization we have. This parameter is called the regularization parameter since, as
parameter we will see in the following section, the margin term in the objective func-
tion (12.26a) is a regularization term. The margin term kwk2 is called
regularizer the regularizer, and in many books on numerical optimization, the regu-
larization parameter multiplied with this term (Section 8.1.3). This is in
contrast to our formulation in this section. Here a large value of C implies
low regularization, as we give the slack variables larger weight, hence giv-
ing more priority to examples which do not lie on the correct side of the
There are margin.
alternative
parametrizations of Remark. In the formulation of the soft margin SVM (12.26a) w is reg-
this regularization, ularized, but b is not regularized. We can see this by observing that the
which is regularization term does not contain b. The unregularized term b compli-
why (12.26a) is also
cates theoretical analysis (Steinwart and Christmann, 2008, Chapter 1)
often referred to as
the C-SVM. and decreases computational efficiency (Fan et al., 2008). ♦
12.2.5 Soft Margin SVM: Loss Function View

Let us consider a different approach for deriving the SVM, following the
principle of empirical risk minimization (Section 8.1). For the SVM we

choose hyperplanes as the hypothesis class, that is

f (x) = hw, xi + b. (12.27)
We will see in this section that the margin corresponds to the regulariza-
tion term. The remaining question is: what is the loss function? In contrast loss function
to Chapter 9, where we consider regression problems (the output of the
predictor is a real number), in this chapter, we consider binary classifica-
tion problems (the output of the predictor is one of two labels {+1, −1}).
Therefore, the error/loss function for each single example-label pair needs
to be appropriate for binary classification. For example, the squared loss
that is used for regression (9.10b) is not suitable for binary classification.
Remark. The ideal loss function between binary labels is to count the num-
ber of mismatches between the prediction and the label. This means that
for a predictor f applied to an example xn , we compare the output f (xn )
with the label yn . We define the loss to be zero if they match, and one if
they do not match. This is denoted by 1(f (xn ) 6= yn ) and is called the
zero-one loss. Unfortunately, the zero-one loss results in a combinatorial zero-one loss
optimization problem for finding the best parameters w, b. Combinatorial
optimization problems (in contrast to continuous optimization problems
discussed in Chapter 7) are in general more challenging to solve. ♦
What is the loss function corresponding to the SVM? Consider the error
between the output of a predictor f (xn ) and the label yn . The loss de-
scribes the error that is made on the training data. An equivalent way to
derive (12.26a) is to use the hinge loss hinge loss
`(t) = max{0, 1 − t} where t = yf (x) = y(hw, xi + b) . (12.28)

If f (x) is on the correct side (based on the corresponding label y ) of the
hyperplane, and further than distance 1, this means that t > 1 and the
hinge loss returns a value of zero. If f (x) is on the correct side but too
close to the hyperplane (0 < t < 1) the example x is within the margin,
and the hinge loss returns a positive value. When the example is on the
wrong side of the hyperplane (t < 0) the hinge loss returns an even larger
value, which increases linearly. In other words, we pay a penalty once we
are too close than the margin, even if the prediction is correct, and the
penalty increases linearly. An alternative way to express the hinge loss is
by considering it as two linear pieces
(
0 if t > 1
`(t) = , (12.29)
1 − t if t < 1
as illustrated in Figure 12.8. The loss corresponding to the hard margin
SVM 12.18 is defined as
(
0 if t > 1
`(t) = . (12.30)
∞ if t < 1
c
Figure 12.8 The

4
hinge loss is a zero-one loss
convex upper bound
max{0, 1 − t}
hinge loss
of zero-one loss.
2
0
−2 0 2
t
This loss can be interpreted as never allowing any examples inside the
margin.
For a given training set {(x1 , y1 ), . . . , (xN , yN )} we seek to minimize
the total loss, while regularizing the objective with `2 -regularization (see
Section 8.1.3). Using the hinge loss (12.28) gives us the unconstrained
optimization problem
N
1 X
min kwk2 + C max{0, 1 − yn (hw, xn i + b)} . (12.31)
w,b
|2 {z } n=1
regularizer
| {z }
error term
regularizer The first term in (12.31) is called the regularization term or the regularizer
loss term (see Section 9.2.3), and the second term is called the loss term or the error
error term 2
term. Recall from Section 12.2.4 that the term 12 kwk arises directly from
the margin. In other words, margin maximization can be interpreted as
regularization regularization.
In principle, the unconstrained optimization problem in (12.31) can be
directly solved with (sub-)gradient descent methods as described in Sec-
tion 7.1. To see that (12.31) and (12.26a) are equivalent, observe that the
hinge loss (12.28) essentially consists of two linear parts, as expressed
in (12.29). Consider the hinge loss on for a single example-label pair
(12.28). We can equivalently replace minimization of the hinge loss over t
with a minimization of a slack variable ξ with two constraints. In equation
form,
min max{0, 1 − t} (12.32)
t
is equivalent to
min ξ
ξ,t
(12.33)
subject to ξ > 0, ξ > 1 − t.
By substituting this expression into (12.31) and rearranging one of the
constraints, we obtain exactly the soft margin SVM (12.26a).
Remark. Let us contrast our choice of the loss function in this section to the
loss function for linear regression in Chapter 9. Recall from Section 9.2.1

that for finding maximum likelihood estimators, we usually minimize the

negative log-likelihood. Furthermore, since the likelihood term for linear
regression with Gaussian noise is Gaussian, the negative log-likelihood for
each example is a squared error function The squared error function is the
loss function that is minimized when looking for the maximum likelihood
solution. ♦
12.3 Dual Support Vector Machine

The description of the SVM in the previous sections, in terms of the vari-
ables w and b, is known as the primal SVM. Recall that we consider inputs
x ∈ RD with D features. Since w is of the same dimension as x, this
means that the number of parameters (the dimension of w) of the opti-
mization problem grows linearly with the number of features.
In the following, we consider an equivalent optimization problem (the
so-called dual view), which is independent of the number of features. In-
stead the number of parameters increases with the number of examples
in the training set. We saw a similar idea appear in Chapter 10, where we
expressed the learning problem in a way that does not scale with the num-
ber of features. This is useful for problems where we have more features
than the number of examples in the training dataset. The dual SVM also
has the additional advantage that it easily allows kernels to be applied,
as we shall see at the end of this chapter. The word “dual” appears often
in mathematical literature, and in this particular case it refers to convex
duality. The following subsections are essentially an application of convex
duality, which we discussed in Section 7.2.
12.3.1 Convex Duality via Lagrange Multipliers

Recall the primal soft margin SVM (12.26a). We call the variables w, b and
ξ corresponding to the primal SVM the primal variables. We use αn > 0 In Chapter 7, we
as the Lagrange multiplier corresponding to the constraint (12.26b) that used λ as Lagrange
multipliers. In this
the examples are classified correctly and γn > 0 as the Lagrange multi-
section we follow
plier corresponding to the non-negativity constraint of the slack variable, the notation
see (12.26c). The Lagrangian is then given by commonly chosen in
SVM literature, and
N use α and γ.
1 2
X
L(w, b, ξ, α, γ) = kwk + C ξn (12.34)
2 n=1
N
X N
X
− αn (yn (hw, xn i + b) − 1 + ξn ) − γ n ξn .
n=1 n=1
| {z } | {z }
constraint (12.26b) constraint (12.26c)
c
By differentiating the Lagrangian (12.34) with respect to the three primal

variables w, b and ξ respectively, we obtain
N
∂L X
= w> − αn yn xn > , (12.35)
∂w n=1
N
∂L X
= αn yn , (12.36)
∂b n=1
∂L
= C − αn − γn . (12.37)
∂ξn
We now find the maximum of the Lagrangian by setting each of these
partial derivatives to zero. By setting (12.35) to zero we find
N
X
w= αn yn xn , (12.38)
n=1
representer theorem which is a particular instance of the representer theorem (Kimeldorf and
Wahba, 1970). Equation (12.38) states that the optimal weight vector in
the primal is a linear combination of the examples xn . Recall from Sec-
tion 2.6.1 that this means that the solution of the optimization problem
lies in the span of training data. Additionally the constraint obtained by
setting (12.36) to zero implies that the optimal weight vector is an affine
The representer combination of the examples. The representer theorem turns out to hold
theorem is actually for very general settings of regularized empirical risk minimization (Hof-
a collection of
mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more
theorems saying
that the solution of general versions (Schölkopf et al., 2001), and necessary and sufficient
minimizing conditions on its existence can be found in (Yu et al., 2013).
empirical risk lies in
the subspace
Remark. The representer theorem (12.38) also provides an explaination
(Section 2.4.3) of the name Support Vector Machine. The examples xn , for which the
defined by the corresponding parameters αn = 0, do not contribute to the solution w at
examples. all. The other examples, where αn > 0, are called support vectors since
support vectors they “support” the hyperplane. ♦
By substituting the expression for w into the Lagrangian (12.34), we
obtain the dual
N N N
*N +
1 XX X X
D(ξ, α, γ) = yi yj αi αj hxi , xj i − yi αi yj αj xj , xi
2 i=1 j=1 i=1 j=1
N
X N
X N
X N
X N
X
+C ξi − b yi αi + αi − αi ξi − γ i ξi .
i=1 i=1 i=1 i=1 i=1
(12.39)
PNinvolving the primal variable w.

Note that there are no longer any terms
By setting (12.36) to zero, we obtain n=1 yn αn = 0. Therefore, the term
involving b also vanishes. Recall that inner products are symmetric and

bilinear (see Section 3.2). Therefore, the first two terms in (12.39) are
over the same objects. These terms (colored blue) can be simplified, and
we obtain the Lagrangian
N N N N
1 XX X X
D(ξ, α, γ) = − yi yj αi αj hxi , xj i + αi + (C − αi − γi )ξi .
2 i=1 j=1 i=1 i=1
(12.40)
The last term in this equation is a collection of all terms that contain slack
variables ξi . By setting (12.37) to zero, we see that the last term in (12.40)
is also zero. Furthermore, by using the same equation and recalling that
the Lagrange multiplers γi are non-negative, we conclude that αi 6 C .
We now obtain the dual optimization problem of the SVM, which is ex-
pressed exclusively in terms of the Lagrange multipliers αi . Recall from
Lagrangian duality (Definition 7.1) that we maximize the dual problem.
This is equivalent to minimizing the negative dual problem, such that we
end up with the dual SVM dual SVM
N N N
1 XX X
min yi yj αi αj hxi , xj i − αi
α 2 i=1 j=1 i=1
N
X (12.41)
subject to yi αi = 0
i=1
0 6 αi 6 C for all i = 1, . . . , N .
The equality constraint in (12.41) is obtained from setting (12.36) to

zero. The inequality constraint αi > 0 is the condition imposed on La-
grange multipliers of inequality constraints (Section 7.2). The inequality
constraint αi 6 C is discussed in the previous paragraph.
The set of inequality constraints in the SVM are called “box constraints”
because they limit the vector α = [α1 , . . . , αN ]> ∈ RN of Lagrange mul-
tipliers to be inside the box defined by 0 and C on each axis. These
axis-aligned boxes are particularly efficient to implement in numerical
solvers (Dostál, 2009, Chapter 5). It turns out
Once we obtain the dual parameters α we can recover the primal pa- examples that lie
exactly on the
rameters w by using the representer theorem (12.38). Let us call the op- margin are
timal primal parameter w∗ . However, there remains the question on how examples whose
to obtain the parameter b∗ . Consider an example xn that lies exactly on dual parameters lie
the margin’s boundary, i.e., hw∗ , xn i + b = yn . Recall that yn is either +1 strictly inside the
box constraints,
or −1. Therefore, the only unknown is b, which can be computed by
0 < αi < C. This is
derived using the
b∗ = yn − hw∗ , xn i . (12.42) Karush Kuhn Tucker
conditions, for
Remark. In principle, there may be no examples that lie exactly on the example in
margin. In this case, we should compute |yn − hw∗ , xn i | for all support Schölkopf and
vectors and take the median value of this absolute value difference to be Smola (2002).
c
Figure 12.9 Convex
hulls. (a) Convex
hull of points, some
of which lie within
the boundary;
(b) Convex hulls
around positive and
negative examples.
c
(a) Convex hull. (b) Convex hulls around positive (blue) and
negative (red) examples. The distance between
the two convex sets is the length of the differ-
ence vector c − d.
the value of b∗ . A derivation of this can be found in http://fouryears.

eu/2012/06/07/the-svm-bias-term-conspiracy/. ♦
12.3.2 Dual SVM: Convex Hull View

Another approach to obtain the dual SVM is to consider an alternative
geometric argument. Consider the set of examples xn with the same label.
We would like to build a convex set that contains all the examples such
that it is the smallest possible set. This is called the convex hull and is
illustrated in Figure 12.9.
Let us first build some intuition about a convex combination of points.
Consider two points x1 and x2 and corresponding non-negative weights
α1 , α2 > 0 such that α1 + α2 = 1. The equation α1 x1 + α2 x2 describes
each point on a line between x1 and x2 . Consider what happens P3when we
add a third point x3 along with a weight α3 > 0 such that n=1 αn =
1. The convex combination of these three points x1 , x2 , x3 span a two-
convex hull dimensional area. The convex hull of this area is the triangle formed by
the edges corresponding to each pair of of points. As we add more points,
and the number of points become greater than the number of dimensions,
some of the points will be inside the convex hull, as we can see in Fig-
ure 12.9(a).
In general, building a convex convex hull can be done by introducing
non-negative weights αn > 0 corresponding to each example xn . Then
the convex hull can be described as the set
(N ) N
X X
conv (X) = αn xn with αn = 1 and αn > 0, (12.43)
n=1 n=1

for all n = 1, . . . , N . If the two clouds of points corresponding to the

positive and negative classes are separated, then the convex hulls do not
overlap. Given the training data (x1 , y1 ), . . . , (xN , yN ) we form two con-
vex hulls, corresponding to the positive and negative classes respectively.
We pick a point c, which is in the convex hull of the set of positive exam-
ples, and is closest to the negative class distribution. Similarly we pick a
point d in the convex hull of the set of negative examples, and is closest to
the positive class distribution, see Figure 12.9(b). We define a difference
vector between d and c as
w := c − d . (12.44)
Picking the points c and d as above, and requiring them to be closest to

each other is equivalent to minimizing the length/norm of w, so that we
end up with the corresponding optimization problem
1 2
arg min kwk = arg min kwk . (12.45)
w w 2
Since c must be in the positive convex hull, it can be expressed as a convex
combination of the positive examples, i.e., for non-negative coefficients
αn+
X
c= αn+ xn . (12.46)
n:yn =+1
In (12.46), we use the notation n : yn = +1 to indicate the set of indices

n for which yn = +1. Similarly, for the examples with negative labels we
obtain
X
d= αn− xn . (12.47)
n:yn =−1
By substituting (12.44), (12.46) and (12.47) into (12.45), we obtain the

objective
2
1 X X
min αn+ xn − αn− xn . (12.48)

α 2
n:y =+1 n n:y =−1 n
Let α be the set of all coefficients, i.e., the concatenation of α+ and α− .

Recall that we require that for each convex hull that their coefficients sum
to one,
X X
αn+ = 1 and αn− = 1 . (12.49)
n:yn =+1 n:yn =−1
This implies the constraint

N
X
yn αn = 0 . (12.50)
n=1
c
This result can be seen by multiplying out the individual classes

N
X X X
yn αn = (+1)αn+ + (−1)αn− (12.51a)
n=1 n:yn =+1 n:yn =−1
X X
= αn+ − αn− = 1 − 1 = 0 . (12.51b)
n:yn =+1 n:yn =−1
The objective function (12.48) and the constraint (12.50), along with the
assumption that α > 0, give us a constrained (convex) optimization prob-
lem. This optimization problem can be shown to be the same as that of
the dual hard margin SVM (Bennett and Bredensteiner, 2000a).
Remark. To obtain the soft margin dual, we consider the reduced hull. The
reduced hull reduced hull is similar to the convex hull but has an upper bound to the
size of the coefficients α. The maximum possible value of the elements
of α restricts the size that the convex hull can take. In other words, the
bound on α shrinks the convex hull to a smaller volume (Bennett and
Bredensteiner, 2000b). ♦
12.4 Kernels
Consider the formulation of the dual SVM (12.41). Notice that the in-
ner product in the objective occurs only between examples xi and xj .
There are no inner products between the examples and the parameters.
Therefore, if we consider a set of features φ(xi ) to represent xi , the only
change in the dual SVM will be to replace the inner product. This mod-
ularity, where the choice of the classification method (the SVM) and the
choice of the feature representation φ(x) can be considered separately,
provides flexibility for us to explore the two problems independently. In
this section we discuss the representation φ(x) and briefly introduce the
idea of kernels, but do not go into the technical details.
Since φ(x) could be a non-linear function, we can use the SVM (which
assumes a linear classifier) to construct classifiers that are nonlinear in
the examples xn . This provides a second avenue, in addition to the soft
margin, for users to deal with a dataset that is not linearly separable. It
turns out that there are many algorithms and statistical methods, which
have this property that we observed in the dual SVM: the only inner prod-
ucts are those that occur between examples. Instead of explicitly defining
a non-linear feature map φ(·) and computing the resulting inner product
between examples xi and xj , we define a similarity function k(xi , xj ) be-
kernel tween xi and xj . For a certain class of similarity functions, called kernels,
the similarity function implicitly defines a non-linear feature map φ(·).
The inputs X of the Kernels are by definition functions k : X × X → R for which there exists
kernel function can a Hilbert space H and φ : X → H a feature map such that
be very general, and
is not necessarily k(xi , xj ) = hφ(xi ), φ(xj )iH . (12.52)
restricted to RD .

12.4 Kernels 389
Figure 12.10 SVM
with different
kernels. Note that
while the decision
boundary is
second feature
second feature
nonlinear, the
underlying problem
being solved is for a
linear separating
hyperplane (albeit
with a nonlinear
kernel).
first feature first feature

(a) SVM with linear kernel. (b) SVM with RBF kernel.
second feature
second feature
first feature first feature

(c) SVM with polynomial (degree 2) kernel. (d) SVM with polynomial (degree 3) kernel.
There is a unique reproducing kernel Hilbert space associated with every

kernel k (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2004). In this
unique association φ(x) = k(·, x) is called the canonical feature map. canonical feature
The generalization from an inner product to a kernel function (12.52) is map
known as the kernel trick (Schölkopf and Smola, 2002; Shawe-Taylor and kernel trick
Cristianini, 2004), as it hides away the explicit non-linear feature map.
The matrix K ∈ RN ×N , resulting from the inner products or the appli-
cation of k(·, ·) to a dataset, is called the Gram matrix, and is often just Gram matrix
referred to as the kernel matrix. Kernels must be symmetric and positive kernel matrix
semi-definite functions so that every kernel matrix K is symmetric and
positive semi-definite (Section 3.2.3):
∀z ∈ RN : z > Kz > 0 . (12.53)
Some popular examples of kernels for multivariate real-valued data xi ∈
RD are the polynomial kernel, the Gaussian radial basis function kernel,
and the rational quadratic kernel (Schölkopf and Smola, 2002; Rasmussen
c
and Williams, 2006). Figure 12.10 illustrates the effect of different kernels
on separating hyperplanes on an example dataset. Note that we are still
solving for hyperplanes, that is the hypothesis class of functions are still
linear. The non-linear surfaces are due to the kernel function.
Remark. Unfortunately for the fledgling machine learner, there are multi-
ple meanings of the word kernel. In this chapter, the word kernel comes
from the idea of the Reproducing Kernel Hilbert Space (RKHS) (Aron-
szajn, 1950; Saitoh, 1988). We have discussed the idea of the kernel in
linear algebra (Section 2.7.3), where the kernel is another word for the
null space. The third common use of the word kernel in machine learning
is the smoothing kernel in kernel density estimation (Section 11.5). ♦
Since the explicit representation φ(x) is mathematically equivalent to
the kernel representation k(xi , xj ) a practitioner will often design the
kernel function, such that it can be computed more efficiently than the
inner product between explicit feature maps. For example, consider the
polynomial kernel (Schölkopf and Smola, 2002), where the number of
terms in the explicit expansion grows very quickly (even for polynomials
of low degree) when the input dimension is large. The kernel function
only requires one multiplication per input dimension, which can provide
significant computational savings. Another example is the Gaussian ra-
dial basis function kernel (Schölkopf and Smola, 2002; Rasmussen and
Williams, 2006) where the corresponding feature space is infinite dimen-
sional. In this case, we cannot explicitly represent the feature space but
The choice of can still compute similarities between a pair of examples using the kernel.
kernel, as well as Another useful aspect of the kernel trick is that there is no need for
the parameters of
the original data to be already represented as multivariate real-valued
the kernel are often
chosen using nested data. Note that the inner product is defined on the output of the function
cross validation φ(·), but does not restrict the input to real numbers. Hence, the function
(Section 8.5.1). φ(·) and the kernel function k(·, ·) can be defined on any object, e.g.,
sets, sequences, strings, graphs and distributions (Ben-Hur et al., 2008;
Gärtner, 2008; Shi et al., 2009; Vishwanathan et al., 2010; Sriperumbudur
et al., 2010).
12.5 Numerical Solution

We conclude our discussion of SVMs by looking at how to express the
problems derived in this chapter in terms of the concepts presented in
Chapter 7. We consider two different approaches for finding the optimal
solution for the SVM. First we consider the loss view of SVM 8.1.2 and ex-
press this as an unconstrained optimization problem. Then we express the
constrained versions of the primal and dual SVMs as quadratic programs
in standard form 7.3.2.
Consider the loss function view of the SVM (12.31). This is a convex
unconstrained optimization problem, but the hinge loss (12.28) is not dif-

12.5 Numerical Solution 391
ferentiable. Therefore, we apply a subgradient approach for solving it.

However, the hinge loss is differentiable almost everywhere, except for
one single point at the hinge t = 1. At this point, the gradient is a set of
possible values that lie between 0 and −1. Therefore, the subgradient g of
the hinge loss is given by

−1
 t<1
g(t) = [−1, 0] t = 1 . (12.54)

0 t>1

Using this subgradient above, we can apply the optimization methods pre-
sented in Section 7.1.
Both the primal and the dual SVM result in a convex quadratic pro-
gramming problem (constrained optimization). Note that the primal SVM
in (12.26a) has optimization variables that have the size of the dimen-
sion D of the input examples. The dual SVM in (12.41) has optimization
variables that have the size of the number N of examples.
To express the primal SVM in the standard form (7.45) for quadratic
programming, let us assume that we use the dot product (3.5) as the
inner product. We rearrange the equation for the primal SVM (12.26a), Recall from
such that the optimization variables are all on the right and the inequality Section 3.2 that we
use the phrase dot
of the constraint matches the standard form. This yields the optimization
product to mean the
N inner product on
1 X
Euclidean vector
min kwk2 + C ξn
w,b,ξ 2 space.
n=1 (12.55)
−yn x>
n w − yn b − ξn 6 −1
subject to
−ξn 6 0
n = 1, . . . , N . By concatenating the variables w, b, xn into a single vector,
and carefully collecting the terms, we obtain the following matrix form of
the soft margin SVM.
 >    
w w
> w
1  ID 0D,N +1  
min b b + 0D+1,1 C1N,1  b 
w,b,ξ 2 ξ 0N +1,D 0N +1,N +1
ξ ξ
 
w
−Y X −y −I N   −1N,1
subject to b 6 .
0N,D+1 −I N 0N,1
ξ
(12.56)
In the above optimization problem, the minimization is over [w> , b, ξ > ]> ∈
RD+1+N , and we use the notation: I m to represent the identity matrix of
size m × m, 0m,n to represent the matrix of zeros of size m × n, and 1m,n
to represent the matrix of ones of size m × n. In addition y is the vector
of labels [y1 , . . . , yN ]> , Y = diag(y) is an N by N matrix where the ele-
c
ments of the diagonal are from y , and X ∈ RN ×D is the matrix obtained

by concatenating all the examples.
We can similarly perform a collection of terms for the dual version of
the SVM (12.41). To express the dual SVM in standard form, we first have
to express the kernel matrix K such that each entry is Kij = k(xi , xj ).
Or if we are using an explicit feature representation Kij = hxi , xj i. For
convenience of notation we introduce a matrix with zeros everywhere ex-
cept on the diagonal, where we store the labels, that is, Y = diag(y). The
dual SVM can be written as
1 >
min α Y KY α − 1> N,1 α
α 2
y>

−y > 
(12.57)
subject to   α 6 0N +2,1 .
−I N  C1N,1
IN
Remark. In Section 7.3.1 and 7.3.2 we introduced the standard forms of
the constraints to be inequality constraints. We will express the dual SVM’s
equality constraint as two inequality constraints, i.e.,
Ax = b is replaced by Ax 6 b and Ax > b (12.58)
Particular software implementations of convex optimization methods may

provide the ability to express equality constraints. ♦
Since there are many different possible views of the SVM, there are
many approaches for solving the resulting optimization problem. The ap-
proach presented here, expressing the SVM problem in standard convex
optimization form, is not often used in practice. The two main imple-
mentations of SVM solvers are (Chang and Lin, 2011) (which is open
source) and (Joachims, 1999). Since SVMs have a clear and well defined
optimization problem, many approaches based on numerical optimization
techniques (Nocedal and Wright, 2006) can be applied (Shawe-Taylor and
Sun, 2011).

The SVM is one of many approaches for studying binary classification.
Other approaches include the perceptron, logistic regression, Fisher dis-
criminant, nearest neighbor, naive Bayes, and random forest (Bishop, 2006;
Murphy, 2012). A short tutorial on SVMs and kernels on discrete se-
quences can be found in Ben-Hur et al. (2008). The development of SVMs
is closely linked to empirical risk minimization discussed in Section 8.1.
Hence, the SVM has strong theoretical properties (Vapnik, 2000; Stein-
wart and Christmann, 2008). The book about kernel methods (Schölkopf
and Smola, 2002) includes many details of support vector machines and

how to optimize them. A broader book about kernel methods (Shawe-

Taylor and Cristianini, 2004) also includes many linear algebra approaches
for different machine learning problems.
An alternative derivation of the dual SVM can be obtained using the
idea of the Legendre-Fenchel transform (Section 7.3.3). The derivation
considers each term of the unconstrained formulation of the SVM (12.31)
separately and calculates their convex conjugates (Rifkin and Lippert,
2007). Readers interested in the functional analysis view (also the reg-
ularization methods view) of SVMs are referred to the work by Wahba
(1990). Theoretical exposition of kernels (Manton and Amblard, 2015;
Aronszajn, 1950; Schwartz, 1964; Saitoh, 1988) require a basic ground-
ing of linear operators (Akhiezer and Glazman, 1993). The idea of kernels
have been generalized to Banach spaces (Zhang et al., 2009) and Kreı̆n
spaces (Ong et al., 2004; Loosli et al., 2016).
Observe that the hinge loss has three equivalent representations, as
shown in (12.28) and (12.29), as well as the constrained optimization
problem in (12.33). The formulation (12.28) is often used when compar-
ing the SVM loss function with other loss functions (Steinwart, 2007).
The two-piece formulation (12.29) is convenient for computing subgra-
dients, as each piece is linear. The third formulation (12.33), as seen
in Section 12.5, enables the use of convex quadratic programming (Sec-
tion 7.3.2) tools.
Since binary classification is a well-studied task in machine learning,
other words are also sometimes used, such as discrimination, separation
or decision. Furthermore, there are three quantities that can be the output
of a binary classifier. First is the output of the linear function itself (often
called the score), which can take any real value. This output can be used
for ranking the examples, and binary classification can be thought of as
picking a threshold on the ranked examples (Shawe-Taylor and Cristian-
ini, 2004). The second quantity that is often considered the output of a
binary classifier is after the output is passed through a non-linear function
to constrain its value to a bounded range, for example in the interval [0, 1].
A common non-linear function is the sigmoid function (Bishop, 2006).
When the non-linearity results in well calibrated probabilities (Gneiting
and Raftery, 2007; Reid and Williamson, 2011), this is called class proba-
bility estimation. The third output of a binary classifier is the final binary
decision {+1, −1}, which is the one most commonly assumed to be the
output of the classifier.
The SVM is a binary classifier that does not naturally lend itself to a
probabilistic interpretation. There are several approaches for converting
the raw output of the linear function (the score) into a calibrated class
probability estimate (P (Y = 1|X = x)) which involve an additional cal-
ibration step (Platt, 2000; Lin et al., 2007; Zadrozny and Elkan, 2001).
From the training perspective, there are many related probabilistic ap-
proches. We mentioned at the end of Section 12.2.5 that there is a re-
c
lationship between loss function and the likelihood (also compare Sec-
tion 8.1 and Section 8.2). The maximum likelihood approach correspond-
ing to a well calibrated transformation during training is called logistic
regression, which comes from a class of methods called generalized linear
models. Details of logistic regression from this point of view can be found
in Agresti (2002, Chapter 5) and McCullagh and Nelder (1989, Chapter
4). Naturally, one could take a more Bayesian view of the classifier out-
put by estimation a posterior distribution using Bayesian logistic regres-
sion. The Bayesian view also includes the specification of the prior, which
includes design choices such as conjugacy (Section 6.6.1) with the like-
lihood. Additionally, one could consider latent functions as priors, which
results in Gaussian process classification (Rasmussen and Williams, 2006,
Chapter 3).

References
Abel, Niels H. 1826. Démonstration de l’Impossibilité de la Résolution Algébrique des

Équations Générales qui Passent le Quatrième Degré. Grøndahl & Søn.
Adhikari, Ani, and DeNero, John. 2018. Computational and Inferential Thinking: The
Foundations of Data Science. Gitbooks.
Agarwal, Arvind, and Daumé III, Hal. 2010. A Geometric View of Conjugate Priors.
Machine Learning, 81(1), 99–113.
Agresti, A. 2002. Categorical Data Analysis. Wiley.
Akaike, Hirotugu. 1974. A New Look at the Statistical Model Identification. IEEE
Transactions on Automatic Control, 19(6), 716–723.
Akhiezer, N.I., and Glazman, I.M. 1993. Theory of Linear Operators in Hilbert Space.
Dover Publications, Inc.
Alpaydin, Ethem. 2010. Introduction to Machine Learning. MIT Press.
Amari, Shun-ichi. 2016. Information Geometry and Its Applications. Springer.
Argyriou, Andreas, and Dinuzzo, Francesco. 2014. A Unifying View of Representer
Theorems. In: Proceedings of the International Conference on Machine Learning.
Aronszajn, N. 1950. Theory of Reproducing Kernels. Transactions of the American
Mathematical Society, 68, 337–404.
Axler, Sheldon. 2015. Linear Algebra Done Right. 3rd edn. Springer.
Bakir, Gökhan, Hofmann, Thomas, Schölkopf, Bernhard, Smola, Alexander J., Taskar,
Ben, and Vishwanathan, S.V.N (eds). 2007. Predicting Structured Data. MIT Press.
Barber, David. 2012. Bayesian Reasoning and Machine Learning. Cambridge University
Press.
Barndorff-Nielsen, Ole. 2014. Information and Exponential Families: In Statistical The-
ory. Wiley.
Bartholomew, David, Knott, Martin, and Moustaki, Irini. 2011. Latent Variable Models
and Factor Analysis: A Unified Approach. Wiley.
Beck, Amir, and Teboulle, Marc. 2003. Mirror Descent and Nonlinear Projected Subgra-
dient Methods for Convex Optimization. Operations Research Letters, 31(3), 167–
175.
Belabbas, Mohamed-Ali, and Wolfe, Patrick J. 2009. Spectral Methods in Machine
Learning and New Strategies for Very Large Datasets. Proceedings of the National
Academy of Sciences, 0810600105.
Belkin, Mikhail, and Niyogi, Partha. 2003. Laplacian Eigenmaps for Dimensionality
Reduction and Data Representation. Neural Computation, 15(6), 1373–1396.
Ben-Hur, Asa, Ong, Cheng S., Sonnenburg, Sören, Schölkopf, Bernhard, and Rätsch,
Gunnar. 2008. Support Vector Machines and Kernels for Computational Biology.
PLoS Computational Biology, 4(10), e1000173.
Bennett, Kristin P., and Bredensteiner, Erin J. 2000a. Duality and Geometry in SVM
Classifiers. In: Proceedings of the International Conference on Machine Learning.
Bennett, Kristin P., and Bredensteiner, Erin J. 2000b. Geometry in Learning. In: Gorini,
Catherine A. (ed), Geometry at Work. The Mathematical Association of America.
395
c
396 References
Berlinet, Alain, and Thomas-Agnan, Christine. 2004. Reproducing Kernel Hilbert Spaces
in Probability and Statistics. Springer.
Bertsekas, Dimitri P. 1999. Nonlinear Programming. Athena Scientific.
Bertsekas, Dimitri P. 2009. Convex Optimization Theory. Athena Scientific.
Betancourt, Michael. 2018. Probability Theory (for Scientists and Engineers). https:
//betanalpha.github.io/assets/case_studies/probability_theory.html.
Bickel, Peter J., and Doksum, Kjell. 2006. Mathematical Statistics, Basic Ideas and
Selected Topics. Vol. 1. Prentice Hall.
Bickson, Danny, Dolev, Danny, Shental, Ori, Siegel, Paul H., and Wolf, Jack K. 2007.
Linear Detection via Belief Propagation. In: Proceedings of the Annual Allerton Con-
ference on Communication, Control, and Computing.
Billingsley, Patrick. 1995. Probability and Measure. Wiley.
Bishop, Christopher M. 1995. Neural Networks for Pattern Recognition. Clarendon
Press.
Bishop, Christopher M. 1999. Bayesian PCA. In: Advances in Neural Information Pro-
cessing Systems.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Blei, David M., Kucukelbir, Alp, and McAuliffe, Jon D. 2017. Variational Inference: A
Review for Statisticians. Journal of the American Statistical Association, 112(518),
859–877.
Blum, Arvim, and Hardt, Moritz. 2015. The Ladder: A Reliable Leaderboard for Ma-
chine Learning Competitions. In: International Conference on Machine Learning.
Bonnans, J. Frédéric, Gilbert, J. Charles, Lemaréchal, Claude, and Sagastizábal, Clau-
dia A. 2006. Numerical Optimization: Theoretical and Practical Aspects. 2nd edn.
Springer Verlag.
Borwein, Jonathan M., and Lewis, Adrian S. 2006. Convex Analysis and Nonlinear
Optimization. 2nd edn. Canadian Mathematical Society.
Bottou, Léon. 1998. Online Algorithms and Stochastic Approximations. In: Online
Learning and Neural Networks. Cambridge University Press.
Bottou, Léon, Curtis, Frank E, and Nocedal, Jorge. 2018. Optimization Methods for
Large-scale Machine Learning. SIAM Review, 60(2), 223–311.
Boucheron, Stephane, Lugosi, Gabor, and Massart, Pascal. 2013. Concentration In-
equalities: A Nonasymptotic Theory of Independence. Oxford University Press.
Boyd, Stephen, and Vandenberghe, Lieven. 2004. Convex Optimization. Cambridge
University Press.
Boyd, Stephen, and Vandenberghe, Lieven. 2018. Introduction to Applied Linear Alge-
bra. Cambridge University Press.
Brochu, Eric, Cora, Vlad M., and de Freitas, Nando. 2009. A Tutorial on Bayesian
Optimization of Expensive Cost Functions, with Application to Active User Modeling
and Hierarchical Reinforcement Learning. Tech. rept. TR-2009-023. Department of
Computer Science, University of British Columbia.
Brooks, Steve, Gelman, Andrew, Jones, Galin L., and Meng, Xiao-Li (eds). 2011. Hand-
book of Markov Chain Monte Carlo. Chapman and Hall/CRC.
Brown, Lawrence D. 1986. Fundamentals of Statistical Exponential Families: With Ap-
plications in Statistical Decision Theory. Lecture Notes - Monograph Series. Institute
of Mathematical Statistics.
Bryson, Arthur E. 1961. A Gradient Method for Optimizing Multi-stage Allocation
Processes. In: Proceedings of the Harvard University Symposium on Digital Computers
and Their Applications.
Bubeck, Sébastien. 2015. Convex Optimization: Algorithms and Complexity. Founda-
tions and Trends in Machine Learning, 8(3-4), 231–357.
Bühlmann, Peter, and Geer, Sara Van De. 2011. Statistics for High-Dimensional Data.
Springer.

References 397
Burges, Christopher. 2010. Dimension Reduction: A Guided Tour. Foundations and

Trends in Machine Learning, 2(4), 275–365.
Carroll, J Douglas, and Chang, Jih-Jie. 1970. Analysis of Individual Differences in
Multidimensional Scaling via an N-way Generalization of “Eckart-Young” Decompo-
sition. Psychometrika, 35(3), 283–319.
Casella, George, and Berger, Roger L. 2002. Statistical Inference. Duxbury.
Çinlar, Erhan. 2011. Probability and Stochastics. Springer.
Chang, Chih-Chung, and Lin, Chih-Jen. 2011. LIBSVM: A Library for Support Vector
Machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cheeseman, Peter. 1985. In Defense of Probability. In: Proceedings of the International
Joint Conference on Artificial Intelligence.
Chollet, Francois, and Allaire, J. J. 2018. Deep Learning with R. Manning Publications.
Codd, Edgar F. 1990. The Relational Model for Database Management. Addison-Wesley
Longman Publishing.
Cunningham, John P., and Ghahramani, Zoubin. 2015. Linear Dimensionality Reduc-
tion: Survey, Insights, and Generalizations. Journal of Machine Learning Research,
16, 2859–2900.
Datta, Biswa N. 2010. Numerical Linear Algebra and Applications. Vol. 116. SIAM.
Davidson, Anthony C., and Hinkley, David V. 1997. Bootstrap Methods and their Appli-
cation. Cambridge University Press.
Dean, Jeffrey, Corrado, Greg S, Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V,
Mao, Mark Z, Ranzato, Marc Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and
Ng, Andrew Y. 2012. Large Scale Distributed Deep Networks. In: Advances in Neural
Information Processing Systems.
Deisenroth, Marc P., and Mohamed, Shakir. 2012. Expectation Propagation in Gaus-
sian Process Dynamical Systems. Pages 2618–2626 of: Advances in Neural Informa-
tion Processing Systems.
Deisenroth, Marc P., and Ohlsson, Henrik. 2011. A General Perspective on Gaussian
Filtering and Smoothing: Explaining Current and Deriving New Algorithms. In:
Proceedings of the American Control Conference.
Deisenroth, Marc P., Fox, Dieter, and Rasmussen, Carl E. 2015. Gaussian Processes
for Data-Efficient Learning in Robotics and Control. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37(2), 408–423.
Dempster, Arthur P., Laird, Nan M., and Rubin, Donald B. 1977. Maximum Likelihood
from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society,
39(1), 1–38.
Deng, Li, Seltzer, Michael L., Yu, Dong, Acero, Alex, Mohamed, Abdel-rahman, and
Hinton, Geoffrey E. 2010. Binary Coding of Speech Spectrograms using a Deep
Auto-Encoder. Pages 1692–1695 of: Interspeech.
Devroye, Luc. 1986. Non-Uniform Random Variate Generation. Springer.
Donoho, David L, and Grimes, Carrie. 2003. Hessian Eigenmaps: Locally Linear Em-
bedding Techniques for High-dimensional Data. Proceedings of the National Academy
of Sciences, 100(10), 5591–5596.
Dostál, Zdenĕk. 2009. Optimal Quadratic Programming Algorithms: With Applications
to Variational Inequalities. Springer.
Douven, Igor. 2017. Abduction. In: The Stanford Encyclopedia of Philosophy. Meta-
physics Research Lab, Stanford University.
Downey, Allen B. 2014. Think Stats: Exploratory Data Analysis. 2nd edn. O’Reilly
Media.
Dreyfus, Stuart. 1962. The Numerical Solution of Variational Problems. Journal of
Mathematical Analysis and Applications, 5(1), 30–45.
c
398 References
Drumm, Volker, and Weil, Wolfgang. 2001. Lineare Algebra und Analytische Geometrie.
Lecture Notes, Universität Karlsruhe (TH).
Dudley, Richard M. 2002. Real Analysis and Probability. Cambridge University Press.
Eaton, Morris L. 2007. Multivariate Statistics: A Vector Space Approach. Vol. 53. Insti-
tute of Mathematical Statistics Lecture Notes—Monograph Series.
Eckart, Carl, and Young, Gale. 1936. The Approximation of One Matrix by Another of
Lower Rank. Psychometrika, 1(3), 211–218.
Efron, Bradley, and Hastie, Trevor. 2016. Computer Age Statistical Inference: Algorithms,
Evidence and Data Science. Cambridge University Press.
Efron, Bradley, and Tibshirani, Robert J. 1993. An Introduction to the Bootstrap. Chap-
man and Hall/CRC.
Elliott, Conal. 2009. Beautiful Differentiation. In: International Conference on Func-
tional Programming.
Evgeniou, Theodoros, Pontil, Massimiliano, and Poggio, Tomaso. 2000. Statistical
Learning Theory: A Primer. International Journal of Computer Vision, 38(1), 9–13.
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Xiang-Rui, and Lin, Chih-Jen.
2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine
Learning Research, 9, 1871–1874.
Gal, Yarin, van der Wilk, Mark, and Rasmussen, Carl E. 2014. Distributed Variational
Inference in Sparse Gaussian Process Regression and Latent Variable Models. In:
Advances in Neural Information Processing Systems.
Gärtner, Thomas. 2008. Kernels for Structured Data. World Scientific.
Gavish, Matan,√ and Donoho, David L. 2014. The Optimal Hard Threshold for Singular
Values is 4 3. IEEE Transactions on Information Theory, 60(8), 5040–5053.
Gelman, Andrew, Carlin, John B., Stern, Hal S., and Rubin, Donald B. 2004. Bayesian
Data Analysis. Second. Chapman & Hall/CRC.
Gentle, James E. 2004. Random Number Generation and Monte Carlo Methods. 2nd
edn. Springer.
Ghahramani, Zoubin. 2015. Probabilistic Machine Learning and Artificial Intelligence.
Nature, 521, 452–459.
Ghahramani, Zoubin, and Roweis, Sam T. 1999. Learning Nonlinear Dynamical Sys-
tems using an EM Algorithm. In: Advances in Neural Information Processing Systems.
MIT Press.
Gilks, Walter R., Richardson, Sylvia, and Spiegelhalter, David J. 1996. Markov Chain
Monte Carlo in Practice. Chapman & Hall.
Gneiting, Tilmann, and Raftery, Adrian E. 2007. Strictly Proper Scoring Rules, Pre-
diction, and Estimation. Journal of the American Statistical Association, 102(477),
359–378.
Goh, Gabriel. 2017. Why Momentum Really Works. Distill.
Gohberg, Israel, Goldberg, Seymour, and Krupnik, Nahum. 2012. Traces and Determi-
nants of Linear Operators. Vol. 116. Birkhäuser.
Golan, Jonathan S. 2007. The Linear Algebra a Beginning Graduate Student Ought to
Know. 2nd edn. Springer.
Golub, Gene H., and Van Loan, Charles F. 2012. Matrix Computations. Vol. 4. JHU
Press.
Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. 2016. Deep Learning. MIT
Press.
Graepel, Thore, Candela, Joaquin Quiñonero-Candela, Borchert, Thomas, and Her-
brich, Ralf. 2010. Web-scale Bayesian Click-through Rate Prediction for Sponsored
Search Advertising in Microsoft’s Bing Search Engine. In: Proceedings of the Interna-
tional Conference on Machine Learning.
Griewank, Andreas, and Walther, Andrea. 2003. Introduction to Automatic Differenti-
ation. In: Proceedings in Applied Mathematics and Mechanics.

References 399
Griewank, Andreas, and Walther, Andrea. 2008. Evaluating Derivatives, Principles and
Techniques of Algorithmic Differentiation. second edn. SIAM, Philadelphia.
Grimmett, Geoffrey, and Welsh, Dominic. 2014. Probability: an Introduction. 2nd edn.
Oxford University Press.
Grinstead, Charles M., and Snell, J. Laurie. 1997. Introduction to Probability. American
Mathematical Society.
Hacking, Ian. 2001. Probability and Inductive Logic. Cambridge University Press.
Hall, Peter. 1992. The Bootstrap and Edgeworth Expansion. Springer.
Hallin, Marc, Paindaveine, Davy, and Šiman, Miroslav. 2010. Multivariate quantiles
and multiple-output regression quantiles: from `1 optimization to halfspace depth.
Annals of Statistics, 38, 635–669.
Hasselblatt, Boris, and Katok, Anatole. 2003. A first course in dynamics with a
Panorama of Recent Developments. Cambridge University Press.
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. 2001. The Elements of Statis-
tical Learning—Data Mining, Inference, and Prediction. Springer Series in Statistics.
175 Fifth Avenue, New York City, NY, USA: Springer-Verlag New York, Inc.
Hausman, Karol, Springenberg, Jost T., Wang, Ziyu, Heess, Nicolas, and Riedmiller,
Martin. 2018. Learning an Embedding Space for Transferable Robot Skills. In:
Proceedings of the International Conference on Learning Representations.
Hazan, Elad. 2015. Introduction to Online Convex Optimization. Foundations and
Trends in Optimization, 2(3-4), 157–325.
Hensman, James, Fusi, Nicolò, and Lawrence, Neil D. 2013. Gaussian Processes for
Big Data. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Herbrich, Ralf, Minka, Tom, and Graepel, Thore. 2007. TrueSkill(TM): A Bayesian
Skill Rating System. In: Advances in Neural Information Processing Systems.
Hiriart-Urruty, Jean-Baptiste, and Lemaréchal, Claude. 2001. Fundamentals of Convex
Analysis. Springer.
Hoffman, Matthew D., Blei, David M., and Bach, Francis. 2010. Online Learning for
Latent Dirichlet Allocation. Advances in Neural Information Processing Systems.
Hoffman, Matthew D., Blei, David M., Wang, Chong, and Paisley, John. 2013. Stochas-
tic Variational Inference. Journal of Machine Learning Research, 14(1), 1303–1347.
Hofmann, Thomas, Schölkopf, Bernhard, and Smola, Alexander J. 2008. Kernel Meth-
ods in Machine Learning. Annals of Statistics, 36(3), 1171–1220.
Hogben, Leslie. 2013. Handbook of Linear Algebra. 2nd edn. Chapman and Hall/CRC.
Horn, Roger A., and Johnson, Charles R. 2013. Matrix Analysis. Cambridge University
Press.
Hotelling, Harold. 1933. Analysis of a Complex of Statistical Variables into Principal
Components. Journal of Educational Psychology, 24, 417–441.
Hyvarinen, Aapo, Oja, Erkki, and Karhunen, Juha. 2001. Independent Component Anal-
ysis. Wiley.
Imbens, Guido W., and Rubin, Donald B. 2015. Causal Inference for Statistics, Social
and Biomedical Sciences. Cambridge University Press.
Jacod, Jean, and Protter, Philip. 2004. Probability Essentials. 2nd edn. Springer.
Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge University
Press.
Jefferys, Willian H., and Berger, James O. 1992. Ockham’s Razor and Bayesian Analy-
sis. American Scientist, 80, 64–72.
Jeffreys, Harold. 1961. Theory of Probability. 3rd edn. Oxford University Press.
Jimenez Rezende, Danilo, Mohamed, Shakir, and Wierstra, Daan. 2014. Stochastic
Backpropagation and Approximate Inference in Deep Generative Models. In: Pro-
ceedings of the International Conference on Machine Learning.
Joachims, Thorsten. 1999. Advances in Kernel Methods—Support Vector Learning. MIT
Press. Chap. Making Large-Scale SVM Learning Practical, pages 169–184.
c
400 References
Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S., and Saul, Lawrence K.
1999. An Introduction to Variational Methods for Graphical Models. Machine Learn-
ing, 37, 183–233.
Julier, Simon J., and Uhlmann, Jeffrey K. 1997. A New Extension of the Kalman Filter
to Nonlinear Systems. In: Proceedings of AeroSense Symposium on Aerospace/Defense
Sensing, Simulation and Controls.
Kaiser, Marcus, and Hilgetag, Claus C. 2006. Nonoptimal Component Placement, but
Short Processing Paths, due to Long-Distance Projections in Neural Systems. PLoS
Computational Biology, 2(7), e95.
Kalman, Dan. 1996. A Singularly Valuable Decomposition: The SVD of a Matrix. The
College Mathematics Journal, 27(1), 2–23.
Kalman, Rudolf E. 1960. A New Approach to Linear Filtering and Prediction Problems.
Transactions of the ASME—Journal of Basic Engineering, 82(Series D), 35–45.
Kamthe, Sanket, and Deisenroth, Marc P. 2018. Data-Efficient Reinforcement Learning
with Probabilistic Model Predictive Control. In: Proceedings of the International
Conference on Artificial Intelligence and Statistics.
Katz, Victor J. 2004. A History of Mathematics. Pearson/Addison-Wesley.
Kelley, Henry J. 1960. Gradient Theory of Optimal Flight Paths. Ars Journal, 30(10),
947–954.
Kimeldorf, George S., and Wahba, Grace. 1970. A Correspondence Between Bayesian
Estimation on Stochastic Processes and Smoothing by Splines. Annals of Mathemat-
ical Statistics, 41(2), 495–502.
Kingma, Diederik, and Ba, Jimmy. 2014. Adam: A Method for Stochastic Optimization.
Proceedings of the International Conference on Learning Representations, 1–13.
Kingma, Diederik P., and Welling, Max. 2014. Auto-Encoding Variational Bayes. In:
Proceedings of the International Conference on Learning Representations.
Kittler, J., and Föglein, J. 1984. Contextual Classification of Multispectral Pixel Data.
Image and Vision Computing, 2(1), 13–29.
Kolda, Tamara G., and Bader, Brett W. 2009. Tensor Decompositions and Applications.
SIAM Review, 51(3), 455–500.
Koller, Daphne, and Friedman, Nir. 2009. Probabilistic Graphical Models. MIT Press.
Kong, Linglong, and Mizera, Ivan. 2012. Quantile Tomography: Using Quantiles with
Multivariate Data. Statistica Sinica, 22, 1598–1610.
Lang, Serge. 1987. Linear Algebra. Springer.
Lawrence, Neil. 2005. Probabilistic Non-linear Principal Component Analysis with
Gaussian Process Latent Variable Models. Journal of Machine Learning Research,
6(Nov.), 1783–1816.
Leemis, Lawrence M., and McQueston, Jacquelyn T. 2008. Univariate Distribution
Relationships. The American Statistician, 62(1), 45–53.
Lehmann, Erich L., and Romano, Joseph P. 2005. Testing Statistical Hypotheses.
Springer.
Lehmann, Erich Leo, and Casella, George. 1998. Theory of Point Estimation. Springer.
Liesen, Jörg, and Mehrmann, Volker. 2015. Linear Algebra. Springer.
Lin, Hsuan-Tien, Lin, Chih-Jen, and Weng, Ruby C. 2007. A Note on Platt’s Probabilistic
Outputs for Support Vector Machines. Machine Learning, 68, 267–276.
Ljung, Lennart. 1999. System Identification: Theory for the User. Prentice Hall.
Loosli, Gaëlle, Canu, Stéphane, and Ong, Cheng S. 2016. Learning SVM in Kreı̆n
Spaces. IEEE Transactions of Pattern Analysis and Machine Intelligence, 38(6), 1204–
1216.
Luenberger, David G. 1969. Optimization by Vector Space Methods. Wiley.
MacKay, Davic J. C. 2003. Information Theory, Inference, and Learning Algorithms.
Cambridge University Press.
MacKay, David J. C. 1992. Bayesian Interpolation. Neural Computation, 4, 415–447.

References 401
MacKay, David J. C. 1998. Introduction to Gaussian Processes. Pages 133–165 of:

Neural Networks and Machine Learning, vol. 168. Berlin, Germany: Springer.
Magnus, Jan R., and Neudecker, Heinz. 2007. Matrix Differential Calculus with Appli-
cations in Statistics and Econometrics. 3rd edn. Wiley.
Manton, Jonathan H., and Amblard, Pierre-Olivier. 2015. A Primer on Reproducing
Kernel Hilbert Spaces. Foundations and Trends in Signal Processing, 8(1–2), 1–126.
Markovsky, Ivan. 2011. Low Rank Approximation: Algorithms, Implementation, Appli-
cations. Springer.
Maybeck, Peter S. 1979. Stochastic Models, Estimation, and Control. Mathematics in
Science and Engineering, vol. 141. Academic Press, Inc.
McCullagh, Peter, and Nelder, John A. 1989. Generalized Linear Models. 2nd edn. CRC
Press.
McEliece, Robert J., MacKay, David J. C., and Cheng, Jung-Fu. 1998. Turbo Decoding
as an Instance of Pearl’s “Belief Propagation” Algorithm. IEEE Journal on Selected
Areas in Communications, 16(2), 140–152.
Mika, Sebastian, Rätsch, Gunnar, Weston, Jason, Schölkopf, Bernhard, and Müller,
Klaus-Robert. 1999. Fisher Discriminant Analysis with Kernels. Neural Networks
for Signal Processing, IX, 41–48.
Minka, Thomas P. 2001a. A Family of Algorithms for Approximate Bayesian Inference.
Ph.D. thesis, Massachusetts Institute of Technology.
Minka, Tom. 2001b. Automatic Choice of Dimensionality of PCA. In: Advances in
Neural Information Processing Systems.
Mitchell, Tom. 1997. Machine Learning. McGraw Hill.
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel,
Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostro-
vski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King,
Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis.
2015. Human-Level Control through Deep Reinforcement Learning. Nature, 518,
529–533.
Moonen, Marc, and De Moor, Bart. 1995. SVD and Signal Processing, III: Algorithms,
Architectures and Applications. Elsevier.
Moustaki, Irini., Knott, Martin., and Bartholomew, David. J. 2015. Latent-Variable
Modeling. American Cancer Society. Pages 1–10.
Müller, Andreas C., and Guido, Sarah. 2016. Introduction to Machine Learning with
Python: A Guide for Data Scientists. O’Reilly Publishing.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, MA,
USA: MIT Press.
Neal, Radford M. 1996. Bayesian Learning for Neural Networks. Ph.D. thesis, Depart-
ment of Computer Science, University of Toronto.
Neal, Radford M., and Hinton, Geoffrey E. 1999. A View of the EM Algorithm that
Justifies Incremental, Sparse, and Other Variants. Pages 355–368 of: Learning in
Graphical Models. MIT Press.
Nelsen, Roger. 2006. An Introduction to Copulas. Springer.
Nesterov, Yuri. 2018. Lectures on Convex Optimization. Springer.
Neumaier, Arnold. 1998. Solving Ill-Conditioned and Singular Linear Systems: A Tu-
torial on Regularization. SIAM Review, 40, 636–666.
Nocedal, Jorge, and Wright, Stephen J. 2006. Numerical Optimization. Springer.
Nowozin, Sebastian, Gehler, Peter V., Jancsary, Jeremy, and Lampert, Christoph H.
(eds). 2014. Advanced Structured Prediction. MIT Press.
O’Hagan, Anthony. 1991. Bayes-Hermite Quadrature. Journal of Statistical Planning
and Inference, 29, 245–260.
c
402 References
Ong, Cheng S., Mary, Xavier, Canu, Stéphane, and Smola, Alexander J. 2004. Learn-
ing with Non-positive Kernels. Pages 639–646 of: Proceedings of the International
Conference on Machine Learning.
Ormoneit, Dirk, Sidenbladh, Hedvig, Black, Michael J., and Hastie, Trevor. 2001.
Learning and Tracking Cyclic Human Motion. In: Advances in Neural Information
Processing Systems.
Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd, Terry. 1999. The PageR-
ank Citation Ranking: Bringing Order to the Web. Tech. rept. Stanford InfoLab.
Paquet, Ulrich. 2008. Bayesian Inference for Latent Variable Models. Ph.D. thesis, Uni-
versity of Cambridge.
Parzen, Emanuel. 1962. On Estimation of a Probability Density Function and Mode.
The Annals of Mathematical Statistics, 33(3), 1065–1076.
Pearl, Judea. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann.
Pearl, Judea. 2009. Causality: Models, Reasoning and Inference. 2nd edn. Cambridge
University Press.
Pearson, Karl. 1895. Contributions to the Mathematical Theory of Evolution. II. Skew
Variation in Homogeneous Material. Philosophical Transactions of the Royal Society
A: Mathematical, Physical and Engineering Sciences, 186, 343–414.
Pearson, Karl. 1901. On Lines and Planes of Closest Fit to Systems of Points in Space.
Philosophical Magazine, 2(11), 559–572.
Peters, Jonas, Janzing, Dominik, and Schölkopf, Bernhard. 2017. Elements of Causal
Inference: Foundations and Learning Algorithms. MIT Press.
Petersen, K. B., and Pedersen, M. S. 2012 (nov). The Matrix Cookbook. Tech. rept.
Technical University of Denmark. Version 20121115.
Platt, John C. 2000. Probabilistic Outputs for Support Vector Machines and Compar-
isons to Regularized Likelihood Methods. In: Advances in Large Margin Classifiers.
Pollard, David. 2002. A User’s Guide to Measure Theoretic Probability. Cambridge
University Press.
Polyak, Roman A. 2016. The Legendre Transformation in Modern Optimization. Pages
437–507 of: Optimization and Its Applications in Control and Data Sciences. Springer.
Press, William H., Teukolsky, Saul A., Vetterling, William T., and Flannery, Brian P.
2007. Numerical Recipes: The Art of Scientific Computing. 3rd edn. Cambridge Uni-
versity Press.
Proschan, Michael A., and Presnell, Brett. 1998. Expect the Unexpected from Condi-
tional Expectation. American Statistician, 52(3), 248–252.
Raschka, Sebastian, and Mirjalili, Vahid. 2017. Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and TensorFlow. Packt Publish-
ing.
Rasmussen, Carl E., and Ghahramani, Zoubin. 2001. Occam’s Razor. In: Advances in
Rasmussen, Carl E., and Ghahramani, Zoubin. 2003. Bayesian Monte Carlo. In: Ad-
vances in Neural Information Processing Systems.
Rasmussen, Carl E., and Williams, Christopher K. I. 2006. Gaussian Processes for Ma-
chine Learning. Cambridge, MA, USA: MIT Press.
Reid, Mark, and Williamson, Robert C. 2011. Information, Divergence and Risk for
Binary Experiments. Journal of Machine Learning Research, 12, 731–817.
Rezende, Danilo J., and Mohamed, Shakir. 2015. Variational Inference with Normal-
izing Flows. In: Proceedings of the International Conference on Machine Learning.
Rifkin, Ryan M., and Lippert, Ross A. 2007. Value Regularization and Fenchel Duality.
Journal of Machine Learning Research, 8, 441–479.
Rockafellar, R. Tyrrell. 1970. Convex Analysis. Princeton University Press.

References 403
Rogers, Simon, and Girolami, Mark. 2016. A First Course in Machine Learning. 2nd
edn. Chapman and Hall/CRC.
Rosenbaum, Paul R. 2017. Observation & Experiment: An Introduction to Causal Infer-
ence. Harvard University Press.
Rosenblatt, Murray. 1956. Remarks on Some Nonparametric Estimates of a Density
Function. The Annals of Mathematical Statistics, 27(3), 832–837.
Roweis, Sam, and Ghahramani, Zoubin. 1999. A Unifying Review of Linear Gaussian
Models. Neural Computation, 11(2), 305–345.
Roweis, Sam T. 1998. EM Algorithms for PCA and SPCA. Pages 626–632 of: Advances
in Neural Information Processing Systems.
Roy, Anindya, and Banerjee, Sudipto. 2014. Linear Algebra and Matrix Analysis for
Statistics. Chapman and Hall/CRC.
Rubinstein, Reuven Y., and Kroese, Dirk P. 2016. Simulation and the Monte Carlo
Method. Vol. 10. Wiley.
Ruffini, Paolo. 1799. Teoria Generale delle Equazioni, in cui si Dimostra Impossibile la
Soluzione Algebraica delle Equazioni Generali di Grado Superiore al Quarto. Stampe-
ria di S. Tommaso d’Aquino.
Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J. 1986. Learning
Representations by Back-propagating Errors. Nature, 323(6088), 533–536.
Sæmundsson, Steindór, Hofmann, Katja, and Deisenroth, Marc P. 2018. Meta Rein-
forcement Learning with Latent Variable Gaussian Processes. In: Proceedings of the
Conference on Uncertainty in Artificial Intelligence.
Saitoh, Saburou. 1988. Theory of Reproducing Kernels and its Applications. Longman
Scientific & Technical.
Schölkopf, Bernhard, and Smola, Alexander J. 2002. Learning with Kernels—Support
Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Schölkopf, Bernhard, Smola, Alexander J., and Müller, Klaus-Robert. 1997. Kernel
Principal Component Analysis. In: Proceedings of the International Conference on
Artificial Neural Networks. Springer.
Schölkopf, Bernhard, Smola, Alexander J., and Müller, Klaus-Robert. 1998. Nonlinear
Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5),
1299–1319.
Schölkopf, Bernhard, Herbrich, Ralf, and Smola, Alexander J. 2001. A Generalized
Representer Theorem. In: Proceedings of the International Conference on Computa-
tional Learning Theory.
Schwartz, Laurent. 1964. Sous espaces Hilbertiens d’espaces vectoriels topologiques
et noyaux associés. Journal d’Analyse Mathématique, 13, 115–256. in French.
Schwarz, Gideon E. 1978. Estimating the Dimension of a Model. Annals of Statistics,
6(2), 461–464.
Shahriari, Bobak, Swersky, Kevin, Wang, Ziyu, Adams, Ryan P., and De Freitas, Nando.
2016. Taking the Human out of the Loop: A Review of Bayesian Optimization.
Proceedings of the IEEE, 104(1), 148–175.
Shalev-Shwartz, Shai, and Ben-David, Shai. 2014. Understanding Machine Leanring:
From Theory to Algorithms. Cambridge University Press.
Shawe-Taylor, John, and Cristianini, Nello. 2004. Kernel Methods for Pattern Analysis.
Cambridge University Press.
Shawe-Taylor, John, and Sun, Shiliang. 2011. A Review of Optimization Methodologies
in Support Vector Machines. Neurocomputing, 74(17), 3609–3618.
Shental, O., Bickson, D., P. H. Siegel and, J. K. Wolf, and Dolev, D. 2008. Gaussian
Belief Propagatio Solver for Systems of Linear Equations. In: Proceedings of the
International Symposium on Information Theory.
Shewchuk, Jonathan R. 1994. An Introduction to the Conjugate Gradient Method With-
out the Agonizing Pain.
c
404 References
Shi, Jianbo, and Malik, Jitendra. 2000. Normalized Cuts and Image Segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Shi, Qinfeng, Petterson, James, Dror, Gideon, Langford, John, Smola, Alex, and Vish-
wanathan, S.V.N. 2009. Hash Kernels for Structured Data. Journal of Machine
Learning Research, 2615–2637.
Shiryayev, A. N. 1984. Probability. Springer.
Shor, Naum Z. 1985. Minimization Methods for Non-differentiable Functions. Springer.
Shotton, Jamie, Winn, John, Rother, Carsten, and Criminisi, Antonio. 2006. Texton-
Boost: Joint Appearance, Shape and Context Modeling for Mulit-Class Object Recog-
nition and Segmentation. In: Proceedings of the European Conference on Computer
Vision.
Smith, Adrian F. M., and Spiegelhalter, David. 1980. Bayes Factors and Choice Criteria
for Linear Models. Journal of the Royal Statistical Society B, 42(2), 213–220.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. 2012. Practical Bayesian Opti-
mization of Machine Learning Algorithms. In: Advence in Neural Information Pro-
cessing Systems.
Spearman, Charles. 1904. “General Intelligence,” Objectively Determined and Mea-
sured. American Journal of Psychology, 15(2), 201–292.
Sriperumbudur, Bharath K., Gretton, Arthur, Fukumizu, Kenji, Schölkopf, Bernhard,
and Lanckriet, Gert R. G. 2010. Hilbert Space Embeddings and Metrics on Proba-
bility Measures. Journal of Machine Learning Research, 11, 1517–1561.
Steinwart, Ingo. 2007. How to Compare Different Loss Functions and Their Risks.
Constructive Approximation, 26, 225–287.
Steinwart, Ingo, and Christmann, Andreas. 2008. Support Vector Machines. Springer.
Stoer, Josef, and Burlirsch, Roland. 2002. Introduction to Numerical Analysis. Springer.
Strang, Gilbert. 1993. The Fundamental Theorem of Linear Algebra. The American
Mathematical Monthly, 100(9), 848–855.
Strang, Gilbert. 2003. Introduction to Linear Algebra. 3rd edn. Wellesley-Cambridge
Press.
Stray, Jonathan. 2016. The Curious Journalist’s Guide to Data. Tow Center for Digital
Journalism at Columbia’s Graduate School of Journalism.
Strogatz, Steven. 2014. Writing about Math for the Perplexed and the Traumatized.
Notices of the American Mathematical Society, 61(3), 286–291.
Sucar, Luis E., and Gillies, Duncan F. 1994. Probabilistic Reasoning in High-Level
Vision. Image and Vision Computing, 12(1), 42–60.
Szeliski, Richard, Zabih, Ramin, Scharstein, Daniel, Veksler, Olga, Kolmogorov,
Vladimir, Agarwala, Aseem, Tappen, Marshall, and Rother, Carsten. 2008. A Com-
parative Study of Energy Minimization Methods for Markov Random Fields with
Smoothness-based Priors. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 30(6), 1068–1080.
Tandra, Haryono. 2014. The Relationship Between the Change of Variable Theorem
and The Fundamental Theorem of Calculus for the Lebesgue Integral. Teaching of
Mathematics, 17(2), 76–83.
Tenenbaum, Joshua B., De Silva, Vin, and Langford, John C. 2000. A Global Geometric
Framework for Nonlinear Dimensionality Reduction. Science, 290(5500), 2319–
2323.
Tibshirani, Robert. 1996. Regression Selection and Shrinkage via the Lasso. Journal
of the Royal Statistical Society B, 58(1), 267–288.
Tipping, Michael E., and Bishop, Christopher M. 1999. Probabilistic Principal Compo-
nent Analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611–622.
Titsias, Michalis K., and Lawrence, Neil D. 2010. Bayesian Gaussian Process Latent
Variable Model. In: Proceedings of the International Conference on Artificial Intelli-
gence and Statistics. JMLR W&CP, vol. 9.

References 405
Toussaint, Marc. 2012. Some Notes on Gradient Descent.

Trefethen, Lloyd N., and Bau III, David. 1997. Numerical Linear Algebra. Vol. 50. SIAM.
Tucker, Ledyard R. 1966. Some Mathematical Notes on Three-mode Factor Analysis.
Psychometrika, 31(3), 279–311.
Vapnik, Vladimir N. 1998. Statistical Learning Theory. Wiley.
Vapnik, Vladimir N. 1999. An Overview of Statistical Learning Theory. IEEE Transac-
tions on Neural Networks, 10(5), 988–999.
Vapnik, Vladimir N. 2000. The Nature of Statistical Learning Theory. Springer.
Vishwanathan, S.V.N., Schraudolph, Nicol N., Kondor, Risi, and Borgwardt, Karsten M.
2010. Graph Kernels. Journal of Machine Learning Research, 11, 1201–1242.
von Luxburg, Ulrike, and Schölkopf, Bernhard. 2011. Statistical Learning Theory:
Models, Concepts, and Results. Pages 651–706 of: Handbook of the History of Logic,
vol. 10. Elsevier.
Wahba, Grace. 1990. Spline Models for Observational Data. Society for Industrial and
Applied Mathematics.
Walpole, Ronald E., Myers, Raymond H., Myers, Sharon L., and Ye, Keying. 2011.
Probability & Statistics for Engineers & Scientists. 9th edn. Prentice Hall.
Wasserman, Larry. 2004. All of Statistics. Springer.
Wasserman, Larry. 2007. All of Nonparametric Statistics. Springer.
Whittle, Peter. 2000. Probability via Expectation. Springer.
Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software, 59.
Williams, Christopher K. I. 1997. Computing with Infinite Networks. In: Advances in
Yu, Yaoliang, Cheng, Hao, Schuurmans, Dale, and Szepesvári, Csaba. 2013. Charac-
terizing the Representer Theorem. In: Proceedings of the International Conference on
Machine Learning.
Zadrozny, Bianca, and Elkan, Charles. 2001. Obtaining Calibrated Probability Esti-
mates from Decision Trees and Naive Bayesian Classifiers. In: Proceedings of the
International Conference on Machine Learning.
Zhang, Haizhang, Xu, Yuesheng, and Zhang, Jun. 2009. Reproducing Kernel Banach
Spaces for Machine Learning. Journal of Machine Learning Research, 10, 2741–2775.
Zia, Royce K. P., Redish, Edward F., and McKay, Susan R. 2009. Making Sense of the
Legendre Transform. American Journal of Physics, 77(614).
c

MML Book PDF

Uploaded by

Copyright:

Available Formats

MML Book PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MML Book PDF

Uploaded by

Copyright:

Available Formats

Mathematics for Machine Learning

Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong

Part I Mathematical Foundations 9

1 Introduction and Motivation 11

4.1 Determinant and Trace 99

5 Vector Calculus 139

6 Probability and Distributions 172

7 Continuous Optimization 225

Part II Central Machine Learning Problems 249

8 When Models meet Data 251

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

8.5 Model Selection 283

9 Linear Regression 289

10 Dimensionality Reduction with Principal Component Analysis 317

11 Density Estimation with Gaussian Mixture Models 348

12 Classification with Support Vector Machines 370

1.1 The foundations and four pillars of machine learning. 14

4.8 Intuition behind SVD as sequential transformations. 120

8.10 Graphical models for a repeated Bernoulli experiment. 280

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

12.2 Equation of a separating hyperplane. 373

Machine learning is the latest in a long line of attempts to distill human

Programming languages and data analysis tools

At universities, introductory courses on machine learning tend to spend

Why Another Book on Machine Learning?

Who is the Target Audience?

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

covered in high-school mathematics and physics. For example, the reader

Abdul-Ganiy Usman Ellen Broad

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

Maximus McCann Shakir Mohamed

SamDataMad insad empet

We are also very grateful to the many anonymous reviewers organized

Symbol Typical meaning

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

Symbol Typical meaning

Table of Abbreviations and Acronyms

Introduction and Motivation

Machine learning is about designing algorithms that automatically extract

1.1 Finding Words for Intuitions

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

desired performance measure. However, in practice, we are interested in

We represent data as vectors.

1.2 Two Ways to Read this Book

Bottom-up: Building up the concepts from foundational to more ad-

We decided to write this book in a modular way to separate foundational

Figure 1.1 The

Part I is about Mathematics

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

allow us to express some sort of uncertainty, e.g., to quantify the confi-

Part II is about Machine Learning

classification pillar: classification. We will discuss classification in the context of support

1.3 Exercises and Feedback

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.

When formalizing intuitive concepts, a common approach is to construct a

1. Geometric vectors. This example of a vector may be familiar from high

(a) Geometric vectors. (b) Polynomials.

be added together, which results in another polynomial; and they can

Draft (2019-04-25) of “Mathematics for Machine Learning”. Feedback to https://mml-book.com.