0% found this document useful (0 votes)
388 views134 pages

Mathematics For Machine Learning

This document provides lecture notes on mathematics for machine learning. It covers topics such as linear regression, probabilities, vector calculus, parameter estimation, gradient descent, model selection, Bayesian linear regression, feature extraction techniques including principal component analysis and linear discriminant analysis. The document is intended to teach the fundamental mathematical concepts and methods that underpin machine learning algorithms.

Uploaded by

Daria Godorozha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
388 views134 pages

Mathematics For Machine Learning

This document provides lecture notes on mathematics for machine learning. It covers topics such as linear regression, probabilities, vector calculus, parameter estimation, gradient descent, model selection, Bayesian linear regression, feature extraction techniques including principal component analysis and linear discriminant analysis. The document is intended to teach the fundamental mathematical concepts and methods that underpin machine learning algorithms.

Uploaded by

Daria Godorozha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

lOMoARcPSD|5494880

Mathematics for Machine Learning

Machine Learning (University of Nottingham)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Daria Godorozha (dgodorozha@gmail.com)
lOMoARcPSD|5494880

L ECTURE N OTES

I MPERIAL C OLLEGE L ONDON


D EPARTMENT OF C OMPUTING

Mathematics for Machine Learning

Authors:
Marc Deisenroth, Stefanos Zafeiriou

Version: September 15, 2017

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Contents

1 Linear Regression 3
1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Means and Covariances . . . . . . . . . . . . . . . . . . . . . 6
1.2.1.1 Sum of Random Variables . . . . . . . . . . . . . . . 7
1.2.1.2 Affine Transformation . . . . . . . . . . . . . . . . . 7
1.2.2 Statistical Independence . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Basic Probability Distributions . . . . . . . . . . . . . . . . . . 8
1.2.3.1 Uniform Distribution . . . . . . . . . . . . . . . . . . 9
1.2.3.2 Bernoulli Distribution . . . . . . . . . . . . . . . . . 9
1.2.3.3 Binomial Distribution . . . . . . . . . . . . . . . . . 10
1.2.3.4 Beta Distribution . . . . . . . . . . . . . . . . . . . . 10
1.2.3.5 Gaussian Distribution . . . . . . . . . . . . . . . . . 12
1.2.3.6 Gamma Distribution . . . . . . . . . . . . . . . . . . 16
1.2.3.7 Wishart Distribution . . . . . . . . . . . . . . . . . . 17
1.2.4 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 From Joint Distributions to Graphs . . . . . . . . . . . . . . . 19
1.3.2 From Graphs to Joint Distributions . . . . . . . . . . . . . . . 20
1.3.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1 Scalar Differentiation . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1.1 Taylor Series . . . . . . . . . . . . . . . . . . . . . . 24
1.4.1.2 Differentiation Rules . . . . . . . . . . . . . . . . . . 27
1.4.2 Partial Differentiation and Gradients . . . . . . . . . . . . . . 27
1.4.3 Gradients of Vector-Valued Functions (Vector Fields) . . . . . 30
1.4.3.1 Gradients of Matrices . . . . . . . . . . . . . . . . . 31
1.4.4 Basic Rules of Partial Differentiation . . . . . . . . . . . . . . 33
1.4.5 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.4.6 Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . 36
1.4.7 Linearization and Taylor Series . . . . . . . . . . . . . . . . . 37
1.4.8 Gradient Checking . . . . . . . . . . . . . . . . . . . . . . . . 41
1.5 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.5.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 43
1.5.1.1 Closed-Form Solution . . . . . . . . . . . . . . . . . 44

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Table of Contents CONTENTS

1.5.1.2 Iterative Solution . . . . . . . . . . . . . . . . . . . 44


1.5.1.3 Maximum Likelihood Estimation with Features . . . 45
1.5.1.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . 46
1.5.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.5.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.5.4 Maximum-A-Posterior (MAP) Estimation . . . . . . . . . . . . 50
1.5.4.1 MAP Estimation for Linear Regression . . . . . . . . 51
1.6 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.6.1 Stepsize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.6.2 Gradient Descent with Momentum . . . . . . . . . . . . . . . 54
1.6.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 55
1.6.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.7 Model Selection and Cross Validation . . . . . . . . . . . . . . . . . . 56
1.7.1 Cross-Validation to Assess the Generalization Performance . . 57
1.7.2 Bayesian Model Selection . . . . . . . . . . . . . . . . . . . . 58
1.7.3 Bayes Factors for Model Comparison . . . . . . . . . . . . . . 59
1.7.4 Fully Bayesian Treatment . . . . . . . . . . . . . . . . . . . . 60
1.7.5 Computing the Marginal Likelihood . . . . . . . . . . . . . . . 61
1.7.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.8 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 63
1.8.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
1.8.2 Parameter Posterior . . . . . . . . . . . . . . . . . . . . . . . 64
1.8.2.1 Linear Transformation of Gaussian Random Variables 65
1.8.2.2 Completing the Squares . . . . . . . . . . . . . . . . 65
1.8.3 Prediction and Inference . . . . . . . . . . . . . . . . . . . . . 67
1.8.3.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . 68

2 Feature Extraction 71
2.1 Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.1.1 Eigen-decomposition . . . . . . . . . . . . . . . . . . . . . . . 71
2.1.1.1 Symmetric Matrices . . . . . . . . . . . . . . . . . . 72
2.1.2 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . 72
2.1.2.1 Gram-Schmidt Process . . . . . . . . . . . . . . . . . 74
2.1.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 76
2.1.3.1 Thin SVD . . . . . . . . . . . . . . . . . . . . . . . . 77
2.1.3.2 Dimensionality Reduction and SVD . . . . . . . . . . 77
2.1.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . 78
2.1.4.1 Statistical Perspective . . . . . . . . . . . . . . . . . 78
2.1.4.2 Reconstruction Perspective . . . . . . . . . . . . . . 81
2.1.4.3 Computing PCA . . . . . . . . . . . . . . . . . . . . 81
2.1.4.4 Link between SVD and PCA . . . . . . . . . . . . . . 83
2.1.5 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 83
2.1.5.1 The two class case . . . . . . . . . . . . . . . . . . . 84
2.1.5.2 Multi-class Case . . . . . . . . . . . . . . . . . . . . 86
2.2 Computing Linear Discriminant Analysis . . . . . . . . . . . . . . . . 87
2.2.1 Kernel PCA and Kernel LDA . . . . . . . . . . . . . . . . . . . 90

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

CONTENTS Table of Contents

2.2.1.1 Maximum Likelihood for Probabilistic PCA . . . . . 92

3 Support Vector Machines 98


3.1 Support Vector Classification . . . . . . . . . . . . . . . . . . . . . . . 98
3.1.1 Linear Separating Hyperplane with Maximal Margin . . . . . 98
3.1.1.1 Lagrangian Duality . . . . . . . . . . . . . . . . . . . 100
3.1.1.2 Conditions for Optimality (Karush-Kuhn-Tucker Con-
ditions) . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.1.1.3 SVM dual problem . . . . . . . . . . . . . . . . . . . 102
3.1.2 Mapping Data to Higher Dimensional Spaces . . . . . . . . . 104
3.1.3 The Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . 106
3.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . 108

A 111
A.1 Preliminaries on Vectors and Matrices . . . . . . . . . . . . . . . . . 111
A.1.1 Vectors and Vector Operators . . . . . . . . . . . . . . . . . . 111
A.1.2 Matrices and Matrix Operators . . . . . . . . . . . . . . . . . 112
A.1.2.1 Matrix Norms . . . . . . . . . . . . . . . . . . . . . 113
A.1.2.2 Matrix Multiplications . . . . . . . . . . . . . . . . . 113
A.1.2.3 Matrix Transposition . . . . . . . . . . . . . . . . . . 114
A.1.2.4 Trace Operator . . . . . . . . . . . . . . . . . . . . . 115
A.1.2.5 Matrix Determinant . . . . . . . . . . . . . . . . . . 117
A.1.2.6 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . 118
A.1.2.7 Matrix Pseudo-Inverse . . . . . . . . . . . . . . . . . 118
A.1.2.8 Range, Null Space and Rank of a matrix . . . . . . . 119
A.1.2.9 Eigenvalues and Eigenvectors . . . . . . . . . . . . . 120
A.1.2.10 Positive and Negative Definite Matrices . . . . . . . 121
A.1.2.11 Triangular Matrices . . . . . . . . . . . . . . . . . . 121
A.1.2.12 QR decomposition . . . . . . . . . . . . . . . . . . . 122
A.2 Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.2.1 Lengths, Distances, Orthogonality . . . . . . . . . . . . . . . . 123
A.2.2 Applications of Inner Products . . . . . . . . . . . . . . . . . . 124
A.3 Useful Matrix Identities . . . . . . . . . . . . . . . . . . . . . . . . . 125

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Introduction

These lecture notes support the course “Mathematics for Machine Learning” in the
Department of Computing at Imperial College London. The aim of the course is
to provide students the basic mathematical background and skills necessary to un-
derstand, design and implement modern statistical machine learning methodologies
and inference mechanisms. The course will focus on examples regarding the use of
mathematical tools for the design of basic machine learning and inference method-
ologies, such as Principal Component Analysis (PCA), Linear Discriminant Analysis,
Bayesian Linear Regression and Support Vector Machines (SVMs).
This course is a hard pre-requisite for the following courses:

• CO-424H: Learning in Autonomous Systems

• CO-433: Advanced Robotics

• CO-493: Data Analysis and Probabilistic Inference1

• CO-495: Advanced Statistical Machine Learning and Pattern Recognition

and a soft pre-requisite for CO-477.


Relevant for the exam are

• The lecture notes

• The following chapters from the book by Bishop (2006):

– Chapter 1
– Chapter 2.2–2.3
– Chapter 3
– Chapter 8.1
– Chapter 12.1–12.2

• The following chapters from the book by Golub and Van Loan (2012):

– Chapter 1: Matrix Multiplication (without 1.5 and 1.6)


– Chapter 2: Matrix Analysis (without 2.6 and 2.7)
1
MSc Computing Science students are exempt from this hard constraint, but they are required to
know the material of this course.

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Contents CONTENTS

– Chapter 5: 5.2 The QR Factorisation using only the Gram-Schmidt process

• The papers by Turk and Pentland (1991); Belhumeur et al. (1997)

• The following tutorials:

– Tutorial by Burges (1998) until Section 4.4


– From the tutorial by Lin (2006) only the basic topics.

For part two and three the relevant literature is also the lecture notes, as well as the
following chapters from the book by Golub and Van Loan (2012):

• Chapter 1: Matrix Multiplication (without 1.5 and 1.6)

• Chapter 2: Matrix Analysis (without 2.6 and 2.7)

• Chapter 5: 5.2 The QR Factorisation using only the Gram-Schmidt process

As well as, the following papers Turk and Pentland (1991); Belhumeur et al. (1997)
and tutorials Burges (1998); Lin (2006):

• Tutorial Burges (1998) until Section 4.4

• From Tutorial Lin (2006) only the basic topics.

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1

Linear Regression

In this part of the course, we will be looking at regression problems, where we want
to find a function f that maps inputs x ∈ RD to corresponding function values
f (x) ∈ R based on noisy observations y = f (x) + ǫ, where ǫ is a random variable.
An example is given in Fig. 1.1. A typical regression problem is given in Fig. 1.1(a):
For some values x we observe (noisy) function values y = f (x)+ǫ. The task is to infer
the function f that generated the data. A possible solution is given in Fig. 1.1(b).
Regression is a fundamental problem, and regression problems appear in a diverse
range of research areas and applications, including time-series analysis (e.g., sys-
tem identification), control and robotics (e.g., reinforcement learning, forward/in-
verse model learning), optimization (e.g., line searches, global optimization), deep-
learning applications (e.g., computer games, speech-to-text translation, image recog-
nition, automatic video annotation).
Finding a regression function requires solving a variety of problems, including

• Choice of the parametrization of the regression function. Given a data set,


what function classes (e.g., polynomials) are good candidates for modeling
the data, and what particular parametrization (e.g., degree of the polynomial)
should we choose?

• Finding good parameters. Having chosen a parametrization of the regression


function, how do we find the parameter values? Here, we will need to look at
different loss functions (they determine what a “good” fit is) and optimization
algorithms that allow us to minimize this loss function.

• Probabilistic models. Loss functions are often motivated by probabilistic mod-


els. We will derive a set of useful models and discuss the corresponding under-
lying assumptions.

• Overfitting and model selection. Overfitting is a problem when the regres-


sion function fits the training data “too well” but does not generalize to the test
data. Overfitting typically occurs if the underlying model (or its parametriza-
tion) is overly flexible and expressive. Model selection allows us to compare
various models to find the simplest model that explains the training data rea-
sonably well.

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.1. Problem Formulation Chapter 1. Linear Regression

(b) Regression solution: Possible function that


(a) Regression problem: Observed noisy func- could have generated the data.
tion values from which we wish to infer the un-
Figure 1.1: (a) Regression problem and (b) possible solution.
derlying function that generated the data.
• Uncertainty modeling. In any practical setting, we have access to only a
finite, potentially large, amount of (training) data for selecting the model class
and the corresponding parameters. Given that this finite amount of training
data does not cover all possible scenarios, we need to model the remaining
uncertainty to obtain a measure of confidence of the model’s prediction at test
time. The smaller the training set the more important uncertainty modeling.
Consistent modeling of uncertainty equips model predictions with confidence
bounds.
In this chapter, we will be reviewing the necessary mathematical background for
solving regression problems. This includes a brief introduction to probabilities and
graphical models, which are useful to visualize relationships between random vari-
ables. Furthermore, we will go through some vector calculus, which is required for
gradient-based optimization in the context of parameter learning. Once we know
how to learn regression functions, we will discuss the problem of overfitting and
model selection: Assuming we have a set of “realistic” models, are there some mod-
els that are better than others? Toward the end of this chapter, we will then discuss
Bayesian linear regression, which allows us to reason about parameters at a higher
level, thereby removing some of the problems encountered in maximum likelihood
and MAP estimation.

1.1 Problem Formulation


We consider the regression problem

y = f (x) + ǫ , (1.1)

where x ∈ RD are inputs and y ∈ R are observed targets. Furthermore, ǫ ∼ N 0, σ 2
is independent, identically distributed (i.i.d.) measurement noise. In this particular

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.2. Probabilities

case, ǫ is Gaussian distributed with mean 0 and variance σ 2 . The objective is to find
a function f that is close to the unknown function that generated the data.
In this course, we focus on parametric models, i.e., we choose a parametrization of f
and find parameters that “work well”. In the most simple case, the parametrization
of linear regression is

y = f (x) + ǫ = x⊤ θ + ǫ (1.2)

where θ ∈ RD are the parameters we seek and ǫ ∼ N 0, σ 2 .1
In this course, we will discuss in more detail how to
• Find good parameters θ

• Evaluate whether a parameter set “works well”


We will start by introducing some background material that is necessary and useful:
concepts in probability theory, standard probability distributions, and probabilistic
graphical models.

1.2 Probabilities
Probability theory is a mathematical foundation of statistics and a consistent way of
expressing uncertainty. Jaynes (2003) provides a great introduction to probability
theory.
Definition 1 (Probability Density Function)
A function p : RD → R is called a probability density function if (1) its integral
exists, (2) ∀x ∈ RD : p(x) ≥ 0 and (3)
Z
p(x)dx = 1 . (1.3)
RD

Here, x ∈ RD is a (continuous) random variable.2 For discrete random variables, the


integral in (1.3) is replaced with a sum.
Remark 1
We will be using the expression “probability distribution” not only for discrete distribu-
tions but also for continuous probability density functions, although this is technically
incorrect.

There are two fundamental rules in probability theory that appear everywhere in
machine learning and Bayesian statistics:
Z
p(x) = p(x, y)dy Sum rule/Marginalization property (1.4)

1
It would be more precise to call this model “linear in the parameters”. We will see later that

Φ (x)θ for nonlinear transformations Φ is also a linear regression model.
2
We omit the definition of a random variable as this will become too technical for the purpose of
this course.

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.2. Probabilities Chapter 1. Linear Regression

p(x, y) = p(y|x)p(x) Product rule (1.5)

Here, p(x, y) is the joint distribution of the two random variables x, y, p(x), p(y)
are the corresponding marginal distributions, and p(y|x) is the conditional dis-
tribution of y given x. If we consider discrete random variables x, y, the integral
in (1.4) is replaced by a sum. This is where the name comes from.
In machine learning and Bayesian statistics, we are often interested in making infer-
ences of random variables given that we have observed other random variables. Let
us assume, we have some prior knowledge p(x) about a random variable x and some
relationship p(y|x) between x and a second random variable y. If we now observe
y, we can use Bayes’ theorem3 to draw some conclusions about x given the observed
values of y. Bayes’ theorem follows immediately from the sum and product rules
in (1.4)–(1.5) as
p(y|x)p(x)
p(x|y) = . (1.6)
p(y)
Here, p(x) is the prior, which encapsulates our prior knowledge of x, p(y|x) is
the likelihood4 that describes how y and x are related. The quantity p(y) is the
marginal likelihood or evidenceRand is a normalizing constant (independent of x),
which is obtained as the integral p(y|x)p(x)dx of the numerator with respect to x
and ensures that the fraction is normalized. The posterior p(x|y) expresses exactly
what we are interested in, i.e., what we know about x if we observe y.
Remark 2 (Marginal Likelihood)
Thus far, we looked at the marginal likelihood simply as a normalizing constant that
ensures that the posterior probability distribution integrates to 1. In Section 1.7 we will
see that the marginal likelihood also plays an important role in model selection.

1.2.1 Means and Covariances


Mean and (co)variance are often useful to describe properties of probability distri-
butions (expected values and spread).
Definition 2 (Mean and Covariance)
The mean of a random variable x ∈ RD is defined as
 
Z E[x1 ]
Ex [x] = xp(x)dx =  ...  ∈ RD ,
 
(1.7)
E[xD ]

where the subscript indicates the corresponding dimension of x.


The expected value of a function of a random variable x ∼ p(x), say g(x), is given by
Z
E[g(x)] = g(x)p(x)dx . (1.8)

3
Also called the “probabilistic inverse”
4
Also called the “measurement model”

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.2. Probabilities

If we consider two random variables x ∈ RD , y ∈ RE , the covariance between x and


y is defined as

Cov[x, y] = Ex,y [xy ⊤ ] − Ex [x]Ey [y]⊤ = Cov[y, x]⊤ ∈ RD×E . (1.9)

Here, the subscript makes it explicit with respect to which variable we need to average.
The variance of a random variable x ∈ RD with mean vector µ is defined as

Vx [x] = Ex [(x − µ)(x − µ)⊤ ] = Ex [xx⊤ ] − Ex [x]Ex [x]⊤ (1.10)


 
Cov[x1 , x1 ] Cov[x1 , x2 ] . . . Cov[x1 , xD ]
 Cov[x2 , x1 ] Cov[x2 , x2 ] . . . Cov[x2 , xD ] 
 ∈ RD×D .
 
= .. .. . . .. (1.11)
 . . . . 
Cov[xD , x1 ] ... . . . Cov[xD , xD ]

This matrix is called the covariance matrix of the random variable x. The covariance
matrix is symmetric and positive definite and tells us something about the spread of the
data. R
The covariance matrix contains the variances of the marginals p(xi ) = p(x1 , . . . , xD )dx\i
on its diagonal, where “\i” denotes “all variables but i”. The off-diagonal terms contain
the cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j.

It generally holds that

Vx [x] = Covx [x, x] . (1.12)

1.2.1.1 Sum of Random Variables


Consider two random variables x, y ∈ RD . It holds that

E[x+y] = E[x]+E[y] (1.13)


E[x−y] = E[x]−E[y] (1.14)
V[x+y] = V[x] + V[y]+ Cov[x, y]+ Cov[y, x] (1.15)
V[x−y] = V[x] + V[y]− Cov[x, y]− Cov[y, x] (1.16)

1.2.1.2 Affine Transformation


Mean and (co)variance exhibit some useful properties when it comes to affine trans-
formation of random variables5 . Consider a random variable x with mean µ and
covariance matrix Σ and a (deterministic) affine transformation y = Ax + b of x.
Then, y is itself a random variable whose mean vector and covariance matrix are
given by

Ey [y] = Ex [Ax + b] = AEx [x] + b = Aµ + b , (1.17)


Vy [y] = Vx [Ax + b] = Vx [Ax] = AVx [x]A⊤ = AΣA⊤ , (1.18)
5
The proof is left as an exercise.

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.2. Probabilities Chapter 1. Linear Regression

respectively.6 Furthermore,
Z
Cov[x, y] = x(Ax + b)⊤ p(x)dx − E[x]E[Ax + b]⊤ (1.19)
Z Z
= xp(x)dxb + xx⊤ p(x)dxA⊤ − µb⊤ − µµ⊤ A⊤

(1.20)
Z
⊤ ⊤

= µb − µb + xx⊤ p(x)dx − µµ⊤ A⊤ (1.21)
(1.10)
= ΣA⊤ (1.22)

1.2.2 Statistical Independence


Definition 3 (Independence)
Two random variables x, y are statistically independent if and only if

p(x, y) = p(x)p(y) . (1.23)

Intuitively, two random variables x and y are independent if the value of y (once
known) does not add any additional information about x (and vice versa).
If x, y are (statistically) independent then

• p(y|x) = p(y)

• p(x|y) = p(x)

• V[x + y] = V[x] + V[y]

• Cov[x, y] = 0

Another concept that is important in machine learning is conditional independence.

Definition 4 (Conditional Independence)


Formally, x and y are conditionally independent given z if and only if

p(x, y|z) = p(x|z)p(y|z) . (1.24)

We write x ⊥ y|z.

Independence can be cast as a special case of conditional independence if we write


x ⊥ y|∅.

1.2.3 Basic Probability Distributions


In the following, we will briefly introduce basic probability distributions.

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.2. Probabilities

U [0.5, 1]

1 1

U [0, 2]

1
Figure 1.2: Examples of uniform distributions. Left: Continuous uniform distribu-
tion that distributes probability mass equally everywhere in a (compact) region. Right:
Discrete uniform distribution that assigns equal probability to four possible (discrete)
events.

Figure 1.3: The Bernoulli distribution can be used to model the binary outcome proba-
bility of a coin flip experiment.

1.2.3.1 Uniform Distribution


The uniform distribution is a distribution that assigns equal probability mass in a
region. For a, b ∈ R and a < b, the uniform distribution is defined as
 1
, a≤x≤b
U [a, b] = b−a (1.25)
0 otherwise

Mean and variance of the uniform distribution U [a, b] are 21 (a + b) and 1


12
(b − a)2 ,
respectively.

1.2.3.2 Bernoulli Distribution


The Bernoulli distribution is a distribution for a single binary variable x ∈ {0, 1}
and is governed by a single continuous parameter µ ∈ [0, 1] that represents the
probability of x = 1. The Bernoulli distribution is defined as

p(x|µ) = µx (1 − µ)1−x , x ∈ {0, 1} , (1.26)


6
This can be shown directly by using the definition of the mean and covariance.

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.2. Probabilities Chapter 1. Linear Regression

Figure 1.4: Examples of the Binomial distribution for µ ∈ {0.1, 0.4, 0.75} and N = 15.

E[x] = µ , (1.27)
V[x] = µ(1 − µ) , (1.28)
where E[x] and V[x] are the mean and variance of the binary random variable x.
An example where the Bernoulli distribution can be used is when we are interested
in modeling the probability of “head” when flipping a coin.

1.2.3.3 Binomial Distribution


The Binomial distribution is a generalization of the Bernoulli distribution to a distri-
bution over integers. In particular, the Binomial can be used to describe the prob-
ability of observing m occurrences of x = 1 in a set of N samples from a Bernoulli
distribution where p(x = 1) = µ ∈ [0, 1]. The Binomial distribution is defined as
 
N m
p(m|N, µ) = µ (1 − µ)N −m , (1.29)
m
E[m] = N µ , (1.30)
V[m] = N µ(1 − µ) (1.31)
where E[m] and V[m] are the mean and variance of m, respectively.
An example where the Binomial could be used is if we want to describe the probabil-
ity of observing m “heads” in N coin-flip experiments if the probability for observing
head in a single experiment is µ?

1.2.3.4 Beta Distribution


The Beta distribution is a distribution over a continuous variable µ ∈ [0, 1], which
is often used to represent the probability for some binary event (e.g., the parameter

10

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.2. Probabilities

a = 0. 5 = b
a =1=b
4
a =2=b
a = 4, b = 10

3 a = 5, b = 1
p(µ|a, b)

0
0.0 0.2 0.4 0.6 0.8 1.0

Figure 1.5: Examples of the Beta distribution for different values of α and β.

governing the Bernoulli distribution). The Beta distribution itself is governed by two
parameters α > 0, β > 0 and is defined as

Γ(α + β) α−1
p(µ|α, β) = µ (1 − µ)β−1 (1.32)
Γ(α)Γ(β)
α αβ
E[µ] = , V[µ] = 2
(1.33)
α+β (α + β) (α + β + 1)

where Γ(·) is the Gamma function defined as


Z ∞
Γ(t) := xt−1 exp(−x)dx, t > 0. (1.34)
0
Γ(t + 1) = tΓ(t) . (1.35)

Note that the fraction of Gamma functions in (1.32) normalizes the Beta distribution.
Intuitively, α moves probability mass toward 1, whereas β moves probability mass
toward 0. There are some special cases (Murphy, 2012):

• For α = 1 = β we obtain the uniform distribution U [0, 1].

• For α, β < 1, we get a bimodal distribution with spikes at 0 and 1.

• For α, β > 1, the distribution is unimodal.

• For α, β > 1 and α = β, the distribution is unimodal, symmetric and centered


in the interval [0, 1], i.e., the mode/mean is at 12 .

11

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.2. Probabilities Chapter 1. Linear Regression

0.04

0.03

0.02

p(x, y)
0.01

0.00

0.01

0.02

0.03

0.04

8
6
4

8 2
y
6 0
4
2
x 2
0 4

Figure 1.6: Gaussian distribution of two random variables x, y.

1.2.3.5 Gaussian Distribution


The multivariate Gaussian distribution7 is fully characterized by a mean vector µ
and a covariance matrix Σ and defined as
D 1 
p(x|µ, Σ) = (2π)− 2 |Σ|− 2 exp − 21 (x − µ)⊤ Σ−1 (x − µ) , (1.36)
 
where x ∈ RD is a random variable. We write x ∼ N x | µ, Σ or x ∼ N µ, Σ .
Figure 1.6 shows a bi-variate Gaussian (mesh), with the corresponding contour plot.
The Gaussian distribution is the most important probability distribution8 for continuous-
valued random variables. Its importance originates from the fact that it has many
computationally convenient properties, which we will be discussing in the following.
Application areas in which the Gaussian plays a central role range from signal pro-
cessing (e.g., Kalman filter) to control (e.g., linear quadratic regulator) and ma-
chine learning (e.g., Gaussian processes, principal component analysis, clustering
with Gaussian mixture models and k-means, linear regression, deep learning with
squared errors, variational inference, reinforcement learning).

Conditional and Marginal Consider the joint Gaussian distribution


   
µx Σxx Σxy
p(x, y) = N , (1.37)
µy Σyx Σyy

of two random variables x, y, where Σxx = Cov[x, x] and Σyy = Cov[y, y] are the
marginal covariance matrices of x and y, respectively, and Σxy = Cov[x, y] is the
cross-covariance matrix between x and y.
7
Also: multivariate normal distribution
8
We will be adopting a common, but mathematically slightly sloppy, language and call the “prob-
ability density function” a “distribution”.

12

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.2. Probabilities

3
p(x) 95% confidence bound
Mean Mean
0.3 95% confidence bound 2

0.25 1

0.2 0
p(x)

x2
0.15 −1

0.1 −2

0.05 −3

0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x
1
3
Data Data
p(x) 95% confidence bound
0.3 Mean 2 Mean
95% confidence interval
0.25 1

0.2 0
p(x)

x2

0.15 −1

0.1 −2

0.05 −3

0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x
1

Figure 1.7: Gaussian distributions. Left: Univariate (1-dimensional) Gaussians; Right:


Multivariate (2-dimensional) Gaussians, viewed from top. Crosses show the mean of
the distribution, the red solid (contour) lines represent the 95% confident bounds. The
bottom row shows the Gaussians overlaid with 100 samples. We expect that on aver-
age 95/100 samples are within the red contour lines/intervals that indicate the 95%
confidence bounds.

The conditional distribution p(x|y) is also Gaussian and given by



p(x|y) = N µx|y , Σx|y (1.38)
µx|y = µx + Σxy Σ−1
yy (y − µy ) (1.39)
Σx|y = Σxx − Σxy Σ−1
yy Σyx . (1.40)

Note that in the computation of the mean in (1.39) the y-value is an observation
and no longer random.

Remark 3
The conditional Gaussian distribution shows up in many places, where we are interested
in posterior distributions:

• The Kalman filter (Kalman, 1960), one of the most central algorithms for state es-
timation in signal processing, does nothing but computing Gaussian conditionals
of joint distributions (Deisenroth and Ohlsson, 2011).

13

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.2. Probabilities Chapter 1. Linear Regression

Joint p(x,y)
3 Observation y
Conditional p(x|y)
2

y
-1

-2

-3

-4

-5
-6 -4 -2 0 2 4
x

Figure 1.8: The conditional distribution of a Gaussian is also Gaussian

Joint p(x,y)
3 Marginal p(x)

0
y

-1

-2

-3

-4

-5
-6 -4 -2 0 2 4
x

Figure 1.9: Marginal of a joint Gaussian distribution is Gaussian.

• Gaussian processes (Rasmussen and Williams, 2006), which are a practical im-
plementation of a distribution over functions. In a Gaussian process, we make
assumptions of joint Gaussianity of random variables. By (Gaussian) condition-
ing on observed data, we can determine a posterior distribution over functions.
• Latent linear Gaussian models (Roweis and Ghahramani, 1999; Murphy, 2012),
which include probabilistic PCA (Tipping and Bishop, 1999).

The marginal distribution p(x)9 of a joint Gaussian distribution p(x, y), see (1.37),
is itself Gaussian and computed by applying the sum-rule in (1.4) and given by
Z

p(x) = p(x, y)dy = N x | µx , Σxx . (1.41)

Intuitively, looking at the joint distribution in (1.37), we ignore (i.e., integrate out)
everything we are not interested in.
 
Product of Gaussians The product of two Gaussians  N x | a, A N x | b, B is
an unnormalized Gaussian distribution c N x | c, C with

C = (A−1 + B −1 )−1 (1.42)


9
The same holds for p(y).

14

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.2. Probabilities

c = C(A−1 a + B −1 b) (1.43)
D 1 
c= (2π)− 2 |A + B|− 2 exp − 12 (a − b)⊤ (A + B)−1 (a − b) . (1.44)

Note that the normalizing constant c itself is a Gaussian either


 in a or in b withan
“inflated” covariance matrix A + B, i.e., c = N a | b, A + B = N b | a, A + B .

Linear Transformation of Gaussian Random Variables A linear (or affine) trans-


formation of a Gaussian random variable is Gaussian distributed.
 
• If p(x) = N x | µ, Σ and y = Ax then p(y) = N y | Aµ, AΣA⊤ .

• If p(y) = N y | Ax, Σ then we can re-write this as a probability distribution
in x: If A was invertible, we could write x = A−1 y. However, A is not
generally invertible. Therefore, we perform a transformation with A⊤ :

y = Ax ⇔ A⊤ y = A⊤ Ax . (1.45)

A⊤ A is symmetric and positive definite10 , i.e., it can be inverted. Then,

y = Ax ⇔ (A⊤ A)−1 A⊤ y = x . (1.46)

Hence, x is a linear transformation of y, and we obtain



p(x) = N x | (A⊤ A)−1 A⊤ y, (A⊤ A)−1 A⊤ ΣA(A⊤ A)−1 . (1.47)

Sampling from Multivariate Gaussian Distributions Assume we are interested in


generating samples xi , i = 1, . . . , n, from a multivariate Gaussian distribution with
mean µ and covariance matrix Σ. However, we only have access  to a sampler that
allows us to generate samples from the standard normal N 0, I .

To obtain samples from a multivariate normal N µ, Σ , we can use the properties 
of a linear transformation of a Gaussian random variable: If x ∼ N 0, I then
y = Ax + µ, where AA⊤ = Σ, is Gaussian distributed with mean µ and covariance
matrix Σ. We call Σ = AA⊤ the Cholesky factorization of Σ.11 The idea of
an affine transformation of N 0, I -distributed random variables is an example of
the reparametrization trick, which is key to propagating gradient information in
the variational auto-encoder (Jimenez Rezende et al., 2014; Kingma and Welling,
2013).12
10
Actually, only positive semi-definite, but with mild assumptions we arrive at positive definite.
11
To compute the Cholesky factorization of a matrix, it is required that the matrix is symmetric and
positive definite. Covariance matrices possess this property.
12
An excellent blog post about the reparametrization trick is available at http://blog.shakirm.
com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/.

15

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.2. Probabilities Chapter 1. Linear Regression

Figure 1.10: Gamma distribution for different values of a, b.

Sum of Independent Gaussian Random Variables If x, y are independent Gaus-


sian random variables (i.e., the joint is given as p(x, y) = p(x)p(y)) with p(x) =
N x | µx , Σx and p(y) = N y | µy , Σy , then x + y is also Gaussian distributed and
given by

p(x + y) = N µx + µy , Σx + Σy . (1.48)

Knowing that p(x + y) is Gaussian, the mean and covariance matrix can be de-
termined immediately using the results from (1.13)–(1.16). This property will be
important when we consider i.i.d. Gaussian noise acting on random variables.

1.2.3.6 Gamma Distribution


The Gamma distribution is a distribution over positive real numbers τ > 0. The
Gamma distribution is itself governed by two parameters a, b > 0, where a is a shape
parameter and b a scale parameter of the corresponding density.
The Gamma distribution is defined as
1 a a−1
p(τ |a, b) = b τ exp(−bτ ) , (1.49)
Γ(a)
a
E[τ ] = , (1.50)
b
a
V[τ ] = 2 . (1.51)
b
The Gamma distribution is often used as a prior on the precision (inverse variance)
of a univariate Gaussian distribution.
Remark 4
Other distributions are special cases of the Gamma distribution (Murphy, 2012): The
Exponential distribution with parameter λ is obtained for (a, b) = (1, λ). The expo-
nential distribution describes the time between events in a Poisson process. The Erlang

16

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.2. Probabilities

distribution is a Gamma distribution with a ∈ N. In the Chi-Squared distribution,


which is the distribution of the sum of Gaussian random variables, the scale parameter
b is fixed to b = 12 .

1.2.3.7 Wishart Distribution


The Wishart distribution is the multivariate generalization of the Gamma distribu-
tion: It is a family of probability distributions defined over symmetric, nonnegative-
definite matrix-valued random variables (“random matrices”).
For D × D matrices the Wishart distribution is defined as
ν−D−1 
W(Σ|W , ν) = B|Σ| 2 exp − 12 tr(W −1 Σ) (1.52)
where ν is called the number of degrees of freedom of the distribution, and W
is a D × D scale matrix. B is a normalization constant (Bishop, 2006, p. 102) that
ensures that the distribution is normalized.
These distributions are of great importance in the estimation of covariance matrices
in multivariate statistics: The Wishart distribution is the conjugate prior for the pre-
cision matrix(inverse covariance matrix) of a Gaussian-distributed random variable
x ∼ N µ, Σ .

1.2.4 Conjugacy
According to Bayes’ theorem (1.6), the posterior is proportional to the product of the
prior and the likelihood. The specification of the prior can be tricky for two reasons:
First, the prior should encapsulate our knowledge about the problem before we see
some data. This is often difficult to describe. Second, it is often not possible to
compute the posterior distribution analytically. However, there are some priors that
are computationally convenient: conjugate priors.
Definition 5 (Conjugate Prior)
A prior is conjugate for the likelihood function if the posterior is of the same form/type
as the prior.

Example (Beta-Binomial Conjugacy)


Consider a Binomial random variable x ∼ Bin(m|N, µ) where
 
N m
p(x|µ, N ) = µ (1 − µ)N −m ∝ µa (1 − µ)b (1.53)
m
for some constants a, b. We place a Beta prior on the parameter µ:
Γ(α + β) α−1
Beta(µ|α, β) = µ (1 − µ)β−1 ∝ µα−1 (1 − µ)β−1 (1.54)
Γ(α)Γ(β)
If we now observe some outcomes x = (x1 , . . . , xN ) of a repeated coin-flip experi-
ment with h heads and t tails, we compute the posterior distribution on µ as
p(µ|x = h) ∝ p(x|µ)p(µ|α, β) = µh (1 − µ)t µα−1 (1 − µ)β−1 (1.55)

17

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.3. Probabilistic Graphical Models Chapter 1. Linear Regression

= µh+α−1 (1 − µ)t+β−1 ∝ Beta(h + α, t + β) (1.56)

i.e., the posterior distribution is a Beta distribution as the prior, i.e., the Beta prior is
conjugate for the parameter µ in the Binomial likelihood function.

Table 1.1 lists examples for conjugate priors for the parameters of some of the stan-
dard distributions that we discussed in this section.

Table 1.1: Standard examples of conjugate priors.

Conjugate prior Likelihood Posterior


Beta Bernoulli Beta
Beta Binomial Beta
Gaussian/inverse Gamma Gaussian Gaussian/inverse Gamma
Gaussian/inverse Wishart Gaussian Gaussian/inverse Wishart
Dirichlet Multinomial Dirichlet

The Beta distribution is the conjugate prior for the parameter µ in both the Binomial
and the Bernoulli likelihood. For a Gaussian likelihood function, we can place a con-
jugate Gaussian prior on the mean. The reason why the Gaussian likelihood appears
twice in the table is that we need distinguish the univariate from the multivariate
case. In the univariate (scalar) case, the inverse Gamma is the conjugate prior for
the variance13 . In the multivariate case, we use a conjugate inverse Wishart distribu-
tion as a prior on the covariance matrix14 . The Dirichlet distribution is the conjugate
prior for the multinomial likelihood function. For further details, we refer to Bishop
(2006).

1.3 Probabilistic Graphical Models


A graphical model captures the way in which the joint distribution over all random
variables can be decomposed into a product of factors depending only on a subset of
these variables.
There are three main types of graphical models:

• Directed graphical models (Bayesian networks)

• Undirected graphical models (Markov random fields)

• Factor graphs
13
Alternatively, the Gamma prior is conjugate for the precision (inverse variance) in the Gaussian
likelihood.
14
Alternatively, the Wishart prior is conjugate for the precision matrix (inverse covariance matrix)
in the Gaussian likelihood.

18

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.3. Probabilistic Graphical Models

a b a b a b

c c c
(a) Directed graphical model (b) Undirected graphical (c) Factor graph
model

Figure 1.11: Three types of graphical models: (a) Directed graphical models (Bayesian
network); (b) Undirected graphical models (Markov random field); (c) Factor graphs.

Nodes are (random) variables, edges represent probabilistic relations between vari-
ables. In this course, we will focus on directed graphical models.15
Probabilistic graphical models have some convenient properties:
• They are a simple way to visualize the structure of a probabilistic model
• They can be used to design or motivate new kind of statistical models
• Inspection of the graph alone gives us insight into properties, e.g., conditional
independence
• Complex computations for inference and learning in statistical models can be
expressed in terms of graphical manipulations.

1.3.1 From Joint Distributions to Graphs


Consider the joint distribution
p(a, b, c) = p(c|a, b)p(b|a)p(a) (1.57)
of three random variables a, b, c. The factorization of the joint in (1.57) tell us some-
thing about the relationship between the random variables:
• c depends directly on a and b
• b depends directly on a
• a depends neither on b nor on c
We can build the corresponding directed graphical model as follows:
1. Create a node for all random variables
2. For each conditional distribution, we add a directed link (arrow) to the graph
from the nodes corresponding to the variables on which the distribution is
conditioned on
Note that the graph layout depends on the choice of factorization. For the factoriza-
tion in (1.57), we obtain the directed graphical model in Fig. 1.12.
15
“Data Analysis and Probabilistic Inference” (CO-493) will discuss all three types of graphical
models.

19

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.3. Probabilistic Graphical Models Chapter 1. Linear Regression

a b

c
Figure 1.12: Directed graphical model for the factorization of the joint distribution
in (1.57).

x5

x1 x2

x3 x4
Figure 1.13: Directed graphical model for which we seek the corresponding joint distri-
bution and its factorization.

1.3.2 From Graphs to Joint Distributions


In the following, we will discuss how to extract the joint distribution of a set of
random variables from a given graphical model. We will immediately look at the
graphical model in Fig. 1.13, and exploit two observations:

• The joint distribution p(x1 , . . . , x5 ) we seek is the product of a set of condition-


als, one for each node in the graph. In this particular example, we will need
five conditionals.

• Each conditional depends only on the parents of the corresponding node in the
graph. For example, x4 will be conditioned on x2 .

Using these two properties, we arrive at the desired factorization of the joint distri-
bution

p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x5 )p(x2 |x5 )p(x3 |x1 , x2 )p(x4 |x2 ) . (1.58)

In general, the joint distribution p(x) = p(x1 , . . . , xK ) is given as

K
Y
p(x) = p(xk |pak ) (1.59)
k=1

where pak means “the parent nodes of xk ”.

20

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.3. Probabilistic Graphical Models

Example (Linear Regression)


We seek the graphical model representation for the linear regression setting

y n = x⊤
nθ + ǫ, ǫ ∼ N 0, σ 2 , (1.60)

for n = 1, . . . , N , where yn are observed variables.


In this example, we have three different types of “variables”:

• Unknown random variables θ

• Observed random variables yn

• Deterministic parameters xn , σ, which are fixed

To make the distinction between these three types easier, we introduce additional
nodes for graphical models:

• Shaded nodes represent observed random variables

• Dots represent deterministic parameters

To find the directed graphical model, for all (observed and unobserved) random
variables we write down all probability distributions with explicit conditioning on
the parameters/variables they depend on. In our case, we end up with:

• p(yn |xn , θ, σ)

• p(θ)

This gives us the joint distribution of all random variables

N
Y
p(y1 , . . . , yN , θ|x1 , . . . , xN , σ) = p(θ) p(yn |xn , θ, σ) . (1.61)
n=1

Now, we follow the steps Section 1.3.1 and find the graphical model in Fig. 1.14(a).
Observed random variables are shaded, deterministic parameters are dots, unob-
served random variables are “hollow”. The graphical model is somewhat repeti-
tive because, and we can write it in a more compact form using the plate nota-
tion in Fig. 1.14(b). The plate essentially can be read as “for n = 1, . . . , N locally
copy/repeat everything inside the plate”. Therefore, the plate replaces the dots in
Fig. 1.14(a). Note that the parameter σ for the noise and the random variable θ are
“global” and, therefore, outside the plate.

21

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.3. Probabilistic Graphical Models Chapter 1. Linear Regression

x1 x2 xN
θ
y1 y2 yN

xn yn σ
σ θ N
(a) Version 1 (b) Version 2 using the plate notation.

Figure 1.14: Two graphical models for linear regression. Observed random variables are
shaded, deterministic parameters are dots. (a) Graphical model without plate notation;
(b) Graphical model with plate notation, which allows for a more compact representa-
tion than (a).

(a) Online ranking with the TrueSkill sys- (b) Image restoration
tem.

Figure 1.15: Examples of message passing using graphical models: (a) Microsoft’s
TrueSkill system (Herbrich et al., 2007) is used for ranking in online video games. (b)
Image restoration (Kittler and Föglein, 1984) is used to remove noise from images.

1.3.3 Further Reading


A good and extensive introduction to probabilistic graphical models can be found in
the book by Koller and Friedman (2009).
Directed graphical models allow us to find conditional independence relationship
properties of the joint distribution only by looking at the graph. A concept called
d-separation (Pearl, 1988) is key to this. D-separation will be discussed in more
detail in “Data Analysis and Probabilistic Inference” (CO-493).
Graphical models allow for graph-based algorithms for inference and learning, e.g.,
via local message passing. Applications range from ranking in online games (Her-
brich et al., 2007) and computer vision (e.g., image segmentation, semantic labeling,
image de-noising, image restoration (Sucar and Gillies, 1994; Shotton et al., 2006;
Szeliski et al., 2008; Kittler and Föglein, 1984)) to coding theory (McEliece et al.,
1998), solving linear equation systems (Shental et al., 2008) and iterative Bayesian
state estimation in signal processing (Bickson et al., 2007; Deisenroth and Mohamed,
2012).

22

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

y f (x)

f (x0 + δx)
δy
f (x0 )
δx

Figure 1.16: The average incline of a function f between x0 and x0 + δx is the incline
of the secant (blue) through f (x0 ) and f (x0 + δx) and given by δy/δx.

1.4 Vector Calculus


Now, we return to the linear regression setting outlined in Section 1.1: We are in-
terested in finding “good” parameters for the linear regression model. Finding good
parameters can be phrased as an optimization problem16 . We will use gradient-
based approaches for solving this optimization problem. Hence, in this section, we
will be looking at differentiation of multi-variate functions with respect to parameter
vectors.

1.4.1 Scalar Differentiation


In the following, we briefly revisit differentiation of a univariate function, i.e., the
differentiation we know from school. We start with the difference quotient of a
univariate function y = f (x), x, y ∈ R, which we will subsequently use to define
derivatives.
Definition 6 (Difference Quotient)
The difference quotient

δy f (x + δx) − f (x)
= (1.62)
δx δx

computes the slope of the secant line through two points on the graph of f . These are
the points with x-coordinates x and x + δx, see Fig. 1.16.

The difference quotient can also be considered the average slope of f between x and
x + δx if we assume a f to be linear. In the limit for δx → 0, we obtain the tangent
of f at x, if f is differentiable. The tangent is then the derivative of f at x.
16
We will do this in Section 1.5.

23

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

Definition 7 (Derivative)
More formally, for h > 0 the derivative of f at x is defined as the limit

df f (x + h) − f (x)
= lim , (1.63)
dx h→0 h

and the secant in Fig. 1.16 becomes a tangent.

Example (Derivative of a Polynomial)


We want to compute the derivative of f (x) = xn , n ∈ N. From school, we already
know that the answer will be nxn−1 , but we want to derive this result using the
definition of the gradient as the limit of the difference quotient.
By using (1.63), we obtain

df f (x + h) − f (x)
= lim (1.64)
dx h→0 h
(x + h) − xn
n
= lim (1.65)
Pn hn n−i i
h→0
x h − xn
= lim i=0 i . (1.66)
h→0 h
n

We see that xn = 0
xn−0 h0 . By starting the sum at 1 the xn -term cancels, and we
obtain
Pn n

df i=1 i
xn−i hi
= lim (1.67)
dx h→0 h
n  
X n n−i i−1
= lim x h (1.68)
h→0
i=1
i
  n  
n n−1 X n n−i i−1
= lim x + x h (1.69)
h→0 1 i
i=2
| {z }
→0 as h→0
n!
= xn−1 = nxn−1 . (1.70)
1!(n − 1)!

1.4.1.1 Taylor Series

The Taylor series is a representation of a function f as an (infinite) sum of terms.


These terms are derivatives of f , evaluated at a point x0 .

24

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

Definition 8 (Taylor Series)


For a function f : R → R, the Taylor series of a smooth17 function f ∈ C ∞ at x0 is
defined as

X f (k) (x0 )
f (x) = (x − x0 )k (1.71)
k=0
k!
(k)
where f (k) (x0 ) is the kth derivative of f at x0 and f k!(x0 ) are the coefficients of the
polynomial. For x0 = 0, we obtain the Maclaurin series, a special instance of the
Taylor series.
Definition 9 (Taylor Polynomial)
The Taylor polynomial of degree n of f at x0 contains the first n + 1 components of
the series in (1.71) and is defined as
n
X f (k) (x0 )
Tn = (x − x0 )k , (1.72)
k=0
k!

i.e., the Taylor polynomial consists of the first n terms of the Taylor series in (1.71).
Remark 5
In general, a Taylor polynomial of degree n is an approximation of a function. The
approximation is best close to x0 . However, a Taylor polynomial of degree n is an exact
representation of a polynomial f of degree k ≤ n since all derivatives f (i) , i > k vanish.

Example (Taylor Polynomial)


We consider the polynomial
f (x) = x4 (1.73)
and seek the Taylor polynomial T6 , evaluated at x0 = 1. We start by computing the
coefficients f (k) (1) for k = 0, . . . , 6:
f (1) = 1 (1.74)
f ′ (1) = 4 (1.75)
f ′′ (1) = 12 (1.76)
f (3) (1) = 24 (1.77)
(4)
f (1) = 24 (1.78)
f (5) (1) = 0 (1.79)
f (6) (1) = 0 (1.80)
Therefore, the desired Taylor polynomial is
6
X f (k) (x0 )
T6 (x) = (x − x0 )k = 1 + 4(x − 1) + 6(x − 1)2 + 4(x − 1)3 + (x − 1)4 + 0
k=0
k!
(1.81)
17
i.e., f ∈ C ∞ (infinitely often continuously differentiable)

25

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

6
f
T0
5 T1
T5
T10
4

2
f(x)

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
x

Figure 1.17: Taylor polynomials. The original function f (x) = sin(x) + cos(x) (black,
solid) is approximated by Taylor polynomials (dashed) around x0 = 0. Higher-order
Taylor polynomials approximate the function f better more globally. T10 is already very
close to f in [−4, 4].

Multiplying out and re-arranging yields

T6 (x) = (1 − 4 + 6 − 4 + 1) + x(4 − 12 + 12 − 4) + x2 (6 = 12 + 6) + x3 (4 − 4) + x4
(1.82)
= x4 = f (x) , (1.83)

i.e., we obtain an exact representation of the original function.

Example (Taylor Series)

Consider the smooth function

f (x) = sin(x) + cos(x) ∈ C ∞ . (1.84)

We seek a Taylor series expansion of f at x0 = 0 (which is the Maclaurin series


expansion of f ).

f (0) = sin(0) + cos(0) = 1 (1.85)


f ′ (0) = cos(0) − sin(0) = 1 (1.86)
f ′′ (0) = − sin(0) − cos(0) = −1 (1.87)
f (3) (0) = − cos(0) + sin(0) = −1 (1.88)

26

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

f (4) (0) = sin(0) + cos(0) = f (0) = 1 (1.89)


..
. (1.90)

We can see a pattern here: The coefficients in our Taylor series are only ±1 (since
sin(0) = 0), each of which occurs twice before switching to the other one. Further-
more, f (k+4) (0) = f (k) (0).
Therefore, we can write down the full Taylor series expansion of f at x0 = 0 is given
by

X f (k) (x0 )
f (x) = (x − x0 )k (1.91)
k=0
k!
1 2 1 1 1
=1+x− x − x3 + x4 + x5 − · · · (1.92)
2! 3! 4! 5!
1 2 1 4 1 3 1 5
= 1 − x + x ∓ ··· + x − x + x ∓ ··· (1.93)
2! 4! 3! 5!
∞ ∞
X 1 X 1 2k
= (−1)2k x2k+1 + (−1)2k x (1.94)
k=0
(2k + 1)! k=0
(2k)!
= sin(x) + cos(x) (1.95)

Figure 1.17 shows the corresponding first Taylor polynomials Tn for n = 0, 1, 5, 10.

1.4.1.2 Differentiation Rules


We now state basic differentiation rules, where we denote the derivative of f by f ′ .

Product Rule: (f (x)g(x))′ = f ′ (x)g(x) + f (x)g ′ (x) (1.96)


 ′
f (x) f ′ (x)g(x) − f (x)g ′ (x)
Quotient Rule: = (1.97)
g(x) (g(x))2
Sum Rule: (f + g)′ = f ′ + g ′ (1.98)
′
Chain Rule: g(f (x)) = (g ◦ f )′ = g ′ (f )f ′ (1.99)

1.4.2 Partial Differentiation and Gradients


Ordinary differentiation df dx
applies to functions f of a scalar variable x ∈ R, see
Chapter 1.4.1 for an introduction. In the following, we consider the general case
where the function f depends on one or more variables x ∈ Rn , e.g., f (x) =
f (x1 , x2 ). The generalization of the derivative to functions of several variables is
the gradient.
We find the gradient of the function f with respect to x by varying one variable at a
time and keeping the others constant. The gradient is then the vector of these partial
derivatives.

27

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

Definition 10 (Partial Derivative)


For a function f : Rn → R, x 7→ f (x), x ∈ Rn of n variables x1 , . . . , xn we define the
partial derivatives as
∂f f (x1 + h, x2 , . . . , xn ) − f (x)
= lim (1.100)
∂x1 h→0 h
..
.
∂f f (x1 , . . . , xn−1 , xn + h) − f (x)
= lim (1.101)
∂xn h→0 h
and collect them in the row vector
h i
∇x f = ∂f∂x(x)
1
∂f (x)
∂x2
··· ∂f (x)
∂xn
∈ R1×n , (1.102)

where n is the number of variables and 1 is the dimension of the image of f .18 Here, we
used the compact vector notation x = [x1 , . . . , xn ]⊤ .

Remark 6 (Gradient as a Row Vector)


It is not uncommon in the literature to define the gradient vector as a column vector,
following the convention that vectors are generally column vectors. The reason why
we define the gradient vector as a row vector are twofold: First, we can consistently
generalize the gradient to a setting where f : Rn → Rm no longer maps onto the
real line (then the gradient becomes a matrix). Second, we can immediately apply the
chain-rule without paying attention to the dimension of the gradient. We will discuss
both points later.

Remark 7 (Verifying the Correctness of the Gradient Implementation)


The definition of the gradient as the limit of the difference quotient can be exploited
when checking gradients in computer programs: When we compute gradients and im-
plement them, we can use finite differences to numerically test our computation and
implementation: We choose the value h to be small (e.g., h = 10−4 ) and compare the
finite-difference approximation from Definition 10 with our (analytic) implementation
of the gradient. If the error isqsmall, our gradient implementation is probably cor-
P
(dh −df )2
rect. “Small” could mean that Pi (dhii +dfii )2 < 10−3 , where dhi is the finite-difference
i
approximation and dfi is the analytic gradient of f with respect to the ith variable xi .

Example (Partial Derivative)


For f (x1 , x2 ) = x21 x2 + x1 x32 ∈ R, the partial derivatives (i.e., the derivatives of f with
respect to x1 and x2 ) are
∂f (x1 , x2 )
= 2x1 x2 + x32 (1.103)
∂x1
∂f (x1 , x2 )
= x21 + 3x1 x22 (1.104)
∂x2
18
Note that we currently consider only functions f : Rn → R1 .

28

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

Figure 1.18: f (x, y) = x2 + y 2

where ∂ indicates that it is a partial derivative, and the gradient is then


df h i  
= ∂f (x
∂x
1 ,x2 ) ∂f (x1 ,x2 )
∂x
= 2x1 x2 + x32 x21 + 3x1 x22 ∈ R1×2 . (1.105)
dx 1 2

Example
Consider the function f (x, y) = x2 + y 2 (see Fig. 1.18). We obtain the partial deriva-
tive ∂f /∂x by treating y as a constant and computing the derivative of f with respect
to x. We then obtain
∂f (x, y)
= 2x . (1.106)
∂x
Similarly, we obtain the partial derivative of f with respect to y as
∂f (x, y)
= 2y . (1.107)
∂y

Example
For f (x, y) = (x + 2y 3 )2 , we obtain the partial derivatives
∂f (x, y) ∂
= 2(x + 2y 3 ) (x + 2y 3 ) = 2(x + 2y 3 ) , (1.108)
∂x ∂x
29

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

∂f (x, y) ∂
= 2(x + 2y 3 ) (x + 2y 3 ) = 12(x + 2y 3 )y 2 . (1.109)
∂y ∂y

1.4.3 Gradients of Vector-Valued Functions (Vector Fields)


Thus far, we discussed (partial) derivatives of functions f : Rn → R. In the fol-
lowing, we will generalize the concept of the gradient to vector-valued functions
f : Rn → Rm , where n, m ≥ 1.
For f : Rn → Rm and a vector x = [x1 , . . . , xn ]⊤ ∈ Rn , the corresponding vector of
function values is given as
 
f1 (x)
f (x) =  ...  ∈ Rm .
 
(1.110)
fm (x)

Writing the vector-valued function in this way allows us to view a vector-valued


function f : Rn → Rm as a vector of functions [f1 , . . . , fm ]⊤ , fi : Rn → R that map
onto R. The differentiation rules for every fi are exactly the ones we discussed in
Section 1.4.2.
Therefore, the partial derivative of a vector-valued function f : Rn → Rm with
respect to xi ∈ R, i = 1, . . . n is given as the vector
 ∂f1   f1 (x1 ,...,xi−1 ,xi +h,xi+1 ,...xn )−f1 (x)

∂xi limh→0 h
∂f
=  ...  =  .. m
   
. ∈R . (1.111)
∂xi ∂fm
∂xi limh→0 fm (x1 ,...,xi−1 ,xi +h,x
h
i+1 ,...xn )−fm (x)

From (1.102), we know that we obtain the gradient of f with respect to a vector
as the row vector of the partial derivatives. In (1.111), every partial derivative is a
column vector. Therefore, we obtain the gradient of f : Rn → Rm with respect to
x ∈ Rn as

df (x) h ∂f (x) i
= ∂x1 · · · ∂f∂x(x) (1.112)
dx n
 ∂f1 (x) ∂f1 (x) 
∂x1
· · · ∂xn
 .. ..  ∈ Rm×n .
= . .  (1.113)
∂fm (x) ∂fm (x)
∂x1
··· ∂xn

Definition 11 (Jacobian)
The matrix (or vector) of all first-order partial derivatives of a vector-valued function
f : Rn → Rm is called the Jacobian. The Jacobian J is an m × n matrix, which we

30

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

define and arrange as follows:


 ∂f1 (x) ∂f1 (x)   
∂x1
··· ∂xn x1
df (x) h ∂f (x) i
.. ..
∂f (x)
x =  ...  ,
   
J = ∇x f = = ∂x1 · · · ∂xn
= . . ,
dx ∂fm (x) ∂fm (x)
∂x1
··· ∂xn
xn
(1.114)
∂fi
J(i, j) = . (1.115)
∂xj

: Rn → R1 , which maps a vector x = [x1 , . . . , xn ]⊤ ∈ Rn


In particular, a function f P
onto a scalar (e.g., f (x) = i xi ), possesses a Jacobian that is a row vector (matrix of
dimension 1 × n), see (1.102).

1.4.3.1 Gradients of Matrices


We will encounter situations where we need to take gradients of matrices with re-
spect to vectors (or other matrices), which results in a multi-dimensional tensor. For
example, if we compute the gradient of an m × n matrix with respect to a p × q
matrix, the resulting Jacobian would be (p × q) × (m × n), i.e., a four-dimensional
tensor (or array). Since matrices represent linear mappings, we can exploit the fact
that there is an isomorphism (linear, invertible mapping) between the space Rm×n of
m × n matrices and the space Rmn of mn vectors. Therefore, we can re-shape19 our
matrices into vectors of lengths mn and pq, respectively. The gradient using these
re-shaped matrices results in a Jacobian of size pq × mn. Figure 1.19 visualizes both
approaches. In practical applications, it is often desirable to re-shape the matrix into
a vector and continue working with this Jacobian matrix: The chain-rule boils down
to simple matrix multiplication, whereas in the case of a Jacobian tensor, we will
need to pay more attention to what dimensions we need to sum out.

Example (Gradient of Matrices with Respect to Matrices)


Let us consider a matrix L ∈ Rm×n and f : Rm×n → Rn×n with

f (L) = L⊤ L =: K ∈ Rn×n . (1.116)

To solve this hard problem, let us first write down what we already know: We know
that the gradient has the dimensions

dK
∈ R(n×n)×(m×n) , (1.117)
dL
which is a tensor. If we compute the partial derivative of f with respect to a single
entry Lij , i, j ∈ {1, . . . , n}, of L, we obtain an m × m-matrix

∂K
∈ Rn×n . (1.118)
∂Lij
19
e.g., by stacking the columns of the matrix (“flatten”)

31

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

(a) Approach 1: We compute the partial derivative (b) Approach 2: We re-shape (flatten) A ∈
∂A ∂A ∂A
∂x1 , ∂x2 , ∂x3 , each of which is a 4 × 2 matrix, and R4×2 into a vector Ā ∈ R8 . Then, we com-

collate them in a 4 × 2 × 3 tensor. pute the gradient ddx ∈ R8×3 . We obtain the
gradient tensor by re-shaping this gradient as
illustrated above.

Figure 1.19: Visualization of gradient computation of a matrix with respect to a vector.


We are interested in computing the gradient of A ∈ R4×2 with respect to a vector
x ∈ R3 . We know that gradient dAdx ∈ R
4×2×3 . We follow two equivalent approaches to

arrive there.

Furthermore, we know that


dKpq
∈ Rm×n (1.119)
dL
for p, q = 1, . . . , n, where Kpq = fpq (L) is the (p, q)-th entry of K = f (L).
Denoting the i-th column of L by li , we see that every entry of K is given by an
inner product of two columns of L, i.e.,
n
X
Kpq = l⊤
p lq = Lkp Lkq . (1.120)
k=1

∂Kpq
When we now compute the partial derivative ∂Lij
, we obtain

n
∂Kpq X ∂
= Lkp Lkq = ∂pqij , (1.121)
∂Lij k=1
∂Lij


 Liq if j = p, p 6= q

Lip if j = q, p 6= q
∂pqij = (1.122)

 2Liq if j = p, p = q

0 otherwise

32

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

From (1.117), we know that the desired gradient has the dimension (n×n)×(m×n),
end every single entry of this tensor is given by ∂pqij in (1.122), where p, q, j =
1, . . . , n and i = q, . . . , m.

1.4.4 Basic Rules of Partial Differentiation


In the multivariate case, where x ∈ Rn , the basic differentiation rules that we know
from school (sum rule, product rule, chain rule)20 still apply. However, we need to
pay attention because now we have to deal with matrices where multiplication is no
longer commutative, i.e., the order matters.
∂  ∂f ∂g
Product Rule: f (x)g(x) = g(x) + f (x) (1.123)
∂x ∂x ∂x
∂  ∂f ∂g
Sum Rule: f (x) + g(x) = + (1.124)
∂x ∂x ∂x
∂ ∂  ∂g ∂f
Chain Rule: (g ◦ f )(x) = g(f (x)) = (1.125)
∂x ∂x ∂f ∂x
Before we have a closer look at the chain rule in Section 1.4.5, we state some useful
identities for computing gradients that are frequently required in a machine learning
context (Petersen and Pedersen, 2012):

 ⊤
∂f (X)⊤ ∂f (X)
= (1.126)
∂X ∂X
 
∂tr(f (X)) ∂f (X)
= tr (1.127)
∂X ∂X
 
∂ det(f (X)) −1 ∂f (X)
= det(X)tr X (1.128)
∂X ∂X
−1
∂f (X) ∂f (X) −1
= −f −1 (X) f (X) (1.129)
∂X ∂X
∂a⊤ X −1 b
= −(X −1 )⊤ ab⊤ (X −1 )⊤ (1.130)
∂X
∂x⊤ a
= a⊤ (1.131)
∂x
∂a⊤ x
= a⊤ (1.132)
∂x
∂a⊤ Xb
= ab⊤ (1.133)
∂X
∂x⊤ Bx
= x⊤ (B + B ⊤ ) (1.134)
∂x

(x − As)⊤ W (x − As) = −2(x − As)⊤ W A for symmetric W (1.135)
∂s
20
product rule: (f g)′ = f ′ g + f g ′ , sum rule: (f + g)′ = f ′ + g ′ , chain rule: (g ◦ f )′ = g ′ (f )f ′

33

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

1.4.5 Chain Rule


Let us have a closer look at the chain rule. Consider a function f : R2 → R of two
variables x1 , x2 . Furthermore, x1 and x2 are themselves functions of t. To compute
the gradient of f with respect to t, we need to apply the chain rule (1.125) for
multivariate functions as
df h i  dx1  ∂f dx1 ∂f dx2
∂f ∂f dt
= ∂x1 ∂x2 dx 2
= + (1.136)
dt dt ∂x1 dt ∂x2 dt
where d denotes the total derivative.

Example
Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
df ∂f dx1 ∂f dx2
= + (1.137)
dt ∂x1 dt ∂x2 dt
d sin t d cos t
= 2 sin t +2 (1.138)
dt dt
= 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1) (1.139)

If f (x1 , x2 ) is a function of x1 and x2 , where x1 (s, t) and x2 (s, t) are themselves


functions of two variables s and t, the chain rule yields
df ∂f ∂x1 ∂f ∂x2
= + , (1.140)
ds ∂x1 ∂s ∂x2 ∂s
df ∂f ∂x1 ∂f ∂x2
= + , (1.141)
dt ∂x1 ∂t ∂x2 ∂t
which can be expressed as a matrix equation
df ∂f dx h i  ∂x1 ∂x1 
∂f ∂f
= = ∂s ∂t (1.142)
d(s, t) ∂x d(s, t) | ∂x1 {z ∂x2 } ∂x 2 ∂x2
| ∂s {z ∂t } ∂f
= ∂x dx
= d(s,t)

Example
Consider the function h : R → R, h(t) = (f ◦ g)(t) with
f : R2 → R (1.143)
g : R → R2 (1.144)
f (x) = exp(x1 x22 ) , (1.145)
 
t cos t
x = g(t) = (1.146)
t sin t
and compute the gradient of h with respect to t.
Since f : R2 → R and g : R → R2 we note that
∂f
∈ R1×2 , (1.147)
∂x

34

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

∂g
∈ R2×1 . (1.148)
∂t
The desired gradient is computed by applying the chain-rule:
dh ∂f dx
= (1.149)
dt ∂x dt
h i dx1 
∂f ∂f dt
= ∂x1 ∂x2 dx 2
(1.150)
dt
 
  cos t − t sin t
= exp(x1 x22 )x22 2 exp(x1 x22 )x1 x2 (1.151)
sin t + t cos t

= exp(x1 x22 ) x22 (cos t − t sin t) + 2x1 x2 (sin t + t cos t) (1.152)

Example (Linear Least-Squares Regression)


Let us consider the linear regression model
y = Φθ , y ∈ RN , θ ∈ RD , (1.153)
and we define the following functions:
L(e) := kek2 , (1.154)
e(θ) := y − Φθ . (1.155)

We seek ∂L
∂θ
, and we will use the chain rule for this purpose.
Before we start any calculation, we determine the dimensionality of the gradient as
∂L
∈ R1×D . (1.156)
∂θ
The chain rule allows us to compute the gradient as
∂L ∂L ∂e
= . (1.157)
∂θ ∂e ∂θ
The dot product yields kek2 = e⊤ e (see Appendix A.2), and by exploiting the product
rule and (1.131) we determine
∂L
= 2e⊤ ∈ R1×N . (1.158)
∂e
Furthermore, we obtain
∂e
= −Φ ∈ RN ×D , (1.159)
∂θ
such that our desired derivative is
∂L
= −2e⊤ Φ = − 2(y ⊤ − θ ⊤ Φ⊤ ) |{z}
Φ ∈ R1×D . (1.160)
∂θ | {z }
1×N N ×D

35

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

Remark 8
We would have obtained the same result without using the chain rule by immediately
looking at the function

L2 (θ) := ky − Φθk2 = (y − Φθ)⊤ (y − Φθ) . (1.161)

This approach is still practical for simple functions like L2 but becomes impractical if
consider deep function compositions.

Remark 9
When we look at deep function compositionsnested fK ◦ fK−1 ◦ · · · ◦ f1 , writing out
the full function is tedious. Furthermore, for programming purposes the chain rule is
extremely useful: When we write functions/methods for ever fi that return the partial
derivative of its outputs with respect to its inputs, the total derivative with respect to the
input is just the product of the partial derivatives returned by the individual functions.
If we then decide modify fi into f˜i , we simply have to write a function that computes
the partial derivative of f˜i and use this in the product of partial derivatives (instead of
re-deriving the total derivative from scratch).

Remark 10 (Application of the Chain Rule in Machine Learning)


In machine learning, the chain rule plays an important role when optimizing parame-
ters of a hierarchical model (e.g., for maximum likelihood estimation). An area where
the chain rule is used to an extreme is Deep Learning where the function value y is
computed as a nested/layered function

y = fK (fK−1 (· · · (f1 (x)) · · · )) , (1.162)

where x are the inputs (e.g., images), y are the observations (e.g., class labels) and
every function fi , i = 1, . . . , K possesses its own parameters. In neural networks with
multiple layers, we have functions fi (x) = σ(Ai xi−1 + bi ) in the ith layer, where xi−1
is the output of layer i − 1 and σ an activation function, e.g., the logistic sigmoid
1
1+e−x
, tanh or a rectified linear unit (ReLU). In order to train these models, we have
to compute the gradient of a loss function with respect to the inputs of each layer (e.g.,
xi−1 ) to obtain the partial derivative with respect to the parameters of the previous layer
(e.g., Ai−1 , bi−1 ). There are efficient ways of implementing this repeated application of
the chain-rule using backpropagation (Kelley, 1960; Bryson, 1961; Dreyfus, 1962;
Rumelhart et al., 1986).

1.4.6 Higher-Order Derivatives


Notation Consider a function f : R2 → R of two variables x, y. Multiple partial
derivatives (as for ordinary derivatives) are expressed as
∂2f
• ∂x2
is the second partial derivative of f with respect to x

36

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

∂nf
• ∂xn
is the nth partial derivative of f with respect to x
∂2f
• ∂x∂y
is the partial derivative obtained by first partial differentiating by x and
then y
∂2f
• ∂y∂x
is the partial derivative obtained by first partial differentiating by y and
then x

If f (x, y) is a twice (continuously) differentiable function then

∂ 2f ∂ 2f
= , (1.163)
∂x∂y ∂y∂x

i.e., the order of differentiation does not matter, and the corresponding Hessian
matrix (the matrix of second partial derivatives)
" 2 2
#
∂ f ∂ f
∂x2 ∂x∂y
H= ∂2f ∂2f (1.164)
∂x∂y ∂y 2

is symmetric. Generally, for x ∈ RD , the Hessian is a D × D matrix. The Hessian


measures the local geometry of curvature.
Remark 11 (Hessian of a Vector Field)
If f : Rn → Rm is a vector field, the Hessian is an (m × n × n)-tensor.

1.4.7 Linearization and Taylor Series


The gradient ∇f of a function f is often used for a locally linear approximation of f
around x0 :

f (x) ≈ f (x0 ) + (∇x f )(x0 )(x − x0 ) . (1.165)

Here (∇x f )(x0 ) is the gradient of f with respect to x, evaluated at x0 . Note that (1.165)
is equivalent to the first two terms in the multi-variate Taylor-series expansion of
f at x0 .
Definition 12 (Multivariate Taylor Series)
For the multivariate Taylor series, we consider a function

f : RD → R (1.166)
x 7→ f (x) , x ∈ RD , (1.167)

that is smooth at x0 .
When we define δ := x − x0 , the Taylor series of f at (x0 ) is defined as

X Dk f (x0 )
f (x) = x
δk , (1.168)
k=0
k!

37

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

where Dxk f (x0 ) is the k-th (total) derivative of f with respect to x, evaluated at x0 .
The Taylor polynomial of degree n of f at x0 contains the first n + 1 components of
the series in (1.168) and is defined as
n
X Dk f (x0 )
Tn = x
δk . (1.169)
k=0
k!

(a) Given a vector δ ∈ R4 , we obtain the outer product δ 2 := δ ⊗ δ =


δδ ⊤ ∈ R4×4 as a matrix.

(b) An outer product δ 3 := δ ⊗ δ ⊗ δ ∈ R4×4×4 results in a third-order tensor


(“three-dimensional matrix”), i.e., an array with three indexes.

Figure 1.20: Visualizing outer products. Outer products of vectors increase the dimen-
sionality of the array by 1 per term.

Remark 12 (Notation)
In (1.168) and (1.169), we used the slightly sloppy notation of δ k , which is not defined
for vectors x ∈ RD , D > 1, and k > 1. Note that both Dxk f and δ k are k-th order
k times
z }| {
21
tensors, i.e., k-dimensional arrays. The k-th order tensor δ k ∈ R D × D × . . . × D is
D
obtained as a k-fold outer product, denoted by ⊗, of the vector δ ∈ R . For example,

δ 2 = δ ⊗ δ = δδ ⊤ , δ 2 [i, j] = δ[i]δ[j] (1.170)


δ3 = δ ⊗ δ ⊗ δ , δ 3 [i, j, k] = δ[i]δ[j]δ[k] . (1.171)

Figure 1.20 visualizes two such outer products. In general, we obtain the following
terms in the Taylor series:
X X
Dxk f (x0 )δ k = ··· Dxk f (x0 )[a, . . . , k]δ[a] · · · δ[k] , (1.172)
a k

where Dxk f (x0 )δ k contains k-th order polynomials.


21
A vector is a 1-dimensional array, a matrix is a 2-dimensional array.

38

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

Now that we defined the Taylor series for vector fields, let us explicitly write down the
first terms Dxk f (x0 )δ k of the Taylor series expansion for k = 0, . . . , 3 and δ := x − x0 :

k = 0 : Dx0 f (x0 )δ 0 = f (x0 ) ∈ R (1.173)


X
k = 1 : Dx1 f (x0 )δ 1 = ∇x f (x0 ) |{z}
δ = ∇x f (x0 )[i]δ[i] ∈ R (1.174)
| {z } i
1×D D×1
 XX
k = 2 : Dx2 f (x0 )δ 2 = tr |{z}
H |{z} δ ⊤ = δ ⊤ Hδ =
δ |{z} H[i, j]δ[i]δ[j] ∈ R
D×D D×1 1×D i j
(1.175)
XXX
k = 3 : Dx3 f (x0 )δ 3 = Dx3 f (x0 )[i, j, k]δ[i]δ[j]δ[k] ∈ R (1.176)
i j k

Example (Taylor-Series Expansion of a Function with Two Variables)


Consider the function

f (x, y) = x2 + 2xy + y 3 . (1.177)

We want to compute the Taylor series expansion of f at (x0 , y0 ) = (1, 2). Before
we start, let us discuss what to expect: The function in (1.177) is a polynomial
of degree 3. We are looking for a Taylor series expansion, which itself is a linear
combination of polynomials. Therefore, we do not expect the Taylor series expansion
to contain terms of fourth or higher order to express a third-order polynomial. This
means, it should be sufficient to determine the first four terms of (1.168) for an exact
alternative representation of (1.177).
To determine the Taylor series expansion, start of with the constant term and the
first-order derivatives, which are given by

f (1, 2) = 13 (1.178)
∂f ∂f
= 2x + 2y ⇒ (1, 2) = 6 (1.179)
∂x ∂x
∂f ∂f
= 2x + 3y 2 ⇒ (1, 2) = 14 . (1.180)
∂y ∂y
h i  
⇒ Dx,y1
f (1, 2) = ∇x,y f (1, 2) = ∂f∂x
(1, 2) ∂f
∂y
(1, 2) = 6 14 ∈ R1×2 (1.181)
1  
Dx,y f (1, 2)   x−1
⇒ δ = 6 14 = 6(x − 1) + 14(y − 2) . (1.182)
1! y−2
1
Note that Dx,y f (1, 2)δ contains only linear terms, i.e., first-order polynomials.
The second-order partial derivatives are given by
∂ 2f ∂ 2f
= 2 ⇒ (1, 2) = 2 (1.183)
∂x2 ∂x2
∂ 2f ∂ 2f
= 6y ⇒ (1, 2) = 12 (1.184)
∂y 2 ∂y 2
∂f 2 ∂f 2
=2⇒ (1, 2) = 2 (1.185)
∂x∂y ∂x∂y

39

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.4. Vector Calculus Chapter 1. Linear Regression

∂f 2 ∂f 2
=2⇒ (1, 2) = 2 . (1.186)
∂y∂x ∂y∂x
When we collect the second-order partial derivatives, we obtain the Hessian
" 2 # 
∂ f ∂f 2 
∂x 2 ∂x∂y 2 2
H = ∂f 2 ∂ 2 f = (1.187)
∂y∂x ∂y 2
2 6y
 
2 2
⇒ H(1, 2) = ∈ R2×2 , (1.188)
2 12
such that the next term of the Taylor-series expansion is given by
2
Dx,y f (1, 2) 2 1 2
δ = δ H(1, 2)δ (1.189)
2! 2   
  2 2 x−1
= x−1 y−2 (1.190)
2 12 y − 2
= (x − 1)2 + 2(x − 1)(y − 2) + 6(y − 2)2 (1.191)
2
Here, Dx,y f (1, 2)δ 2 contains only quadratic terms, i.e., second-order polynomials.
The third-order derivatives are obtained as
 
Dx3 f = ∂H ∂x
∂H
∂y ∈ R2×2×2 , (1.192)
" 3 #
∂ f ∂3f
3 ∂H ∂x3 ∂x∂y∂x
Dx f [:, :, 1] = = ∂3f ∂3f , (1.193)
∂x ∂y∂x2 ∂y 2 ∂x
" 3 #
∂ f ∂3f
∂H 2 2
Dx3 f [:, :, 2] = = ∂x∂ 3 f∂y ∂x∂y
∂3f . (1.194)
∂y ∂y∂x∂y ∂y 3
.

Since most second-order partial derivatives in the Hessian in (1.187) are constant
the only non-zero third-order partial derivative is
∂ 3f ∂ 3f
= 6 ⇒ (1, 2) = 6 . (1.195)
∂y 3 ∂y 3
∂f 3
Higher-order derivatives and the mixed derivatives of degree 3 (e.g., ∂x2 ∂y
) vanish,
such that
   
3 0 0 3 0 0
Dx,y f [:, :, 1] = , Dx,y f [:, :, 2] = (1.196)
0 0 0 1
and
3
Dx,y f (1, 2) 3 (y − 2)3
δ = , (1.197)
3! 6
which collects all cubic terms (third-order polynomials) of the Taylor series.
Overall, the (exact) Taylor series expansion of f at (x0 , y0 ) = (1, 2) is
2 3
1
Dx,y f (1, 2) 2 Dx,y f (1, 2) 3
f (x) =f (1, 2)+Dx,y f (1, 2)δ+ δ + δ (1.198)
2! 3!

40

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.4. Vector Calculus

=f (1, 2) (1.199)
∂f (1, 2) ∂f (1, 2)
+ (x − 1) + (y − 2) (1.200)
∂x ∂y
 
1 ∂ 2 f (1, 2) 2 ∂ 2 f (1, 2) 2 ∂ 2 f (1, 2)
+ (x − 1) + (y − 2) + 2 (x − 1)(y − 2)
2! ∂x2 ∂y 2 ∂x∂y
(1.201)
1 ∂ 3 f (1, 2)
+ (y − 2)3 (1.202)
6 ∂y 3
= 13+6(x − 1) + 14(y − 2)+(x − 1)2 + 6(y − 2)2 + 2(x − 1)(y − 2)+(y − 2)3 .
(1.203)
In this case, we obtained an exact Taylor series expansion of the polynomial in (1.177),
i.e., the polynomial in (1.203) is equivalent to the original polynomial in (1.177).
In this particular example, this result is not surprising since the original function
was a third-order polynomial, which we expressed through a linear combination of
constant terms, first-order, second order and third-order polynomials in (1.203).

1.4.8 Gradient Checking


When we implement gradients for learning and optimization, we need to verify that
our gradients are correct and correctly implemented. One way to check the gradients
is to compare our analytical solution that we implemented with a finite-difference
approximation. The finite-differences approximation of the gradient exploits the
definition of the gradient (1.63)
f (x + δ) − f (x)
f ′ (x) = lim . (1.204)
δ→0 δ
The finite-difference approximation approximates the gradient by choosing a small
value for δ (e.g., 10−4 and ignoring the limit, such that an approximation of the
gradient is
f (x + δ) − f (x)
f ′ (x) ≈ (1.205)
δ
The central difference approximation of the gradient is numerically more stable,
but approximately twice as expensive to compute. This approximation is given by
f (x + 12 δ) − f (x) f (x) − f (x − 21 δ) f (x + 21 δ) − f (x − 12 δ)
f ′ (x) ≈ + = (1.206)
δ δ δ
Intuitively, central differences looks 2δ to the left and right of the point x0 where
we wish to evaluate the gradient at, which is also illustrated in Figure 1.21. In
contrast, the finite difference approximation (1.205) computes only a one-sided gra-
dient. However, central differences in (1.206) requires to evaluate the function f
twice to obtain the gradient.

41

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.5. Parameter Estimation Chapter 1. Linear Regression

finite differences
central differences
true gradient
f (x)
f (x0 − 2δ )

f (x0 + δ)

f (x0 ) f (x0 + 2δ )

δ δ
x0 − 2 x0 x0 + 2 x0 + δ
Figure 1.21: Finite and central difference approximation of the gradient. The gradient
f ′ (x0 ) is shown in gray. The finite-difference approximation is shown in red, the central
differences approximation is shown in blue. In this illustration, only the slope is impor-
tant. Central differences yields a better approximation to the ground truth than finite
differences.

xn yn σ
N
Figure 1.22: Probabilistic graphical model for linear regression.

If x ∈ RD , we compute the finite/central difference approximation for every partial


derivative. Therefore, require 2D + 1 function evaluations for central differences
compared to D + 1 for finite differences.

1.5 Parameter Estimation

Assume we are given a training set D consisting of N inputs xi ∈ RD and cor-


responding observations yi ∈ R, i = 1, . . . , N , where yi and yj are conditionally
independent given xi , xj . Our objective is to find optimal parameters θ ∗ ∈ RD for
the linear regression model (1.2).22 Once the parameters are found, we can predict
function values by using the estimate θ ∗ in the model (1.2).

22
The corresponding graphical model is given in Fig. 1.22.

42

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.5. Parameter Estimation

1.5.1 Maximum Likelihood Estimation


A widely used approach to finding the desired parameters θ ∗ is maximum likelihood
estimation where

θ ∗ = arg max p(y|X, θ) , X := [x1 | · · · |xN ]⊤ ∈ RN ×D , y := [y1 , . . . , yN ]⊤ ∈ RN


θ
(1.207)

maximizes the likelihood function p(y|X, θ).


Remark 13
Note that the likelihood is not a probability distribution in θ: It is simply a function but
does usually not integrate to 1 (i.e., it is unnormalized), and may not even be integrable
with respect to θ. However, the likelihood in (1.207) is a probability distribution in y.

To find the desired parameters θ ∗ that maximize the likelihood, we typically do


gradient ascent (or gradient descent on the negative likelihood). For numerical rea-
sons, we apply the log-transformation to the problem23 and minimize the negative
log-likelihood

θ∗ ∈ arg min − log p(y|X, θ) (1.208)
θ
N
!
Y
= arg min − log p(yi |xi , θ) (1.209)
θ
i=1
N
X
= arg min − log p(yi |xi , θ) , (1.210)
θ
i=1

where we exploited that the likelihood factorizes over the number of data points due
to our independence assumption on the training set.
In the linear regression model (1.2) the likelihood is Gaussian (due to the Gaussian
additive noise term), such that we arrive at
1
− log p(yi |xi , θ) = (yi − x⊤ 2
i θ) + const (1.211)
2σ 2
where the constant includes all terms independent of θ. Using (1.211) in the nega-
tive log-likelihood (1.210) we obtain (ignoring the constant terms)
N
1 X
L(θ) := − log p(y|X, θ) = 2 (yi − x⊤
i θ)
2
(1.212)
2σ i=1
1 1
= 2
(y − Xθ)⊤ (y − Xθ) = 2 ky − Xθk2 , (1.213)
2σ   2σ
x⊤
1
 .. 
X :=  .  ∈ RN ×D , (1.214)
x⊤
N

23
Note that the logarithm is a (strictly) monotonically increasing function.

43

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.5. Parameter Estimation Chapter 1. Linear Regression

where X is called the design matrix.24 In (1.213) we replaced the sum of squared
with the squared norm25 of the difference term y − Xθ.
Remark 14
In machine learning, the negative log likelihood function is also called an error func-
tion.
Now that we have a concrete form of the negative log-likelihood function we need
to optimize. We will discuss two approaches, both of which are based on gradients.

1.5.1.1 Closed-Form Solution


We immediately see that (1.213) is quadratic in θ. This means that we can find a
unique global solution θ ∗ for minimizing the negative log-likelihood. We can find
the global optimum by computing the gradient of L, setting it to 0 and solving for
θ.26
We compute the gradient of L with respect to the parameters as
 
∂L ∂ 1 ⊤
= (y − Xθ) (y − Xθ) (1.215)
∂θ ∂θ 2σ 2
1 ∂ 
= 2 y ⊤ y − 2y ⊤ Xθ + θ ⊤ X ⊤ Xθ (1.216)
2σ ∂θ
1
= 2 (−y ⊤ X + θ ⊤ X ⊤ X) ∈ R1×D . (1.217)
σ
A necessary condition for θ being optimal we set this gradient to 0 and obtain
∂L (1.217)
= 0 ⇔ (θ ∗ )⊤ X ⊤ X = y ⊤ X (1.218)
∂θ
⇔ (θ ∗ )⊤ = y ⊤ X(X ⊤ X)−1 (1.219)
⇔ θ ∗ = (X ⊤ X)−1 X ⊤ y (1.220)

We could right-multiply the first equation by (X ⊤ X)−1 because X ⊤ X is positive


definite (if we do not have two identical inputs xi , xj for i 6= j). Note that we
actually do obtain a global minimum since the Hessian ∇2θ L(θ) = X ⊤ X ∈ RD×D is
positive definite.

1.5.1.2 Iterative Solution


We may be interested in finding parameters on which the log-likelihood depends
in a more complicated way. Then, a closed-form solution is often not available.
However, we can use an iterative procedure to find a local optimum. The most
straightforward procedure for this is gradient descent, which we will discuss in more
detail in Section 1.6.
24
Note that there is some notation overloading: We summarize the set of training inputs in X,
whereas in the design matrix we additionally assume a specific “shape”.
25
Remember that kxk2 := hx, xi where h·, ·i is a scalar product (inner product).
26
In this case, setting the gradient to 0 is a necessary and sufficient condition. As an exercise,
compute the Hessian (matrix of second derivatives) and show that it is positive definite.

44

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.5. Parameter Estimation

1.5.1.3 Maximum Likelihood Estimation with Features


When we consider the linear regression model

y = φ⊤ (x)θ + ǫ (1.221)

where φ : RD → RK is a (nonlinear) transformation of the inputs x.

Example (Polynomial Regression)


We are concerned with a regression problems y = φ⊤ (x)θ + ǫ, where x ∈ R. A
transformation that is often used in this context is
 
  1
φ0 (x) x
 
 φ1 (x)   x  2
φ(x) =  ..  =  x3  ∈ RK+1 .
   
(1.222)
 .   
 .. 
φK (x)  . 
xK

This means, we “lift” the original 1D input space into a K + 1-dimensional feature
space consisting of monomials. With these features, we can model polynomials of
degree ≤ K within the framework of linear regression: A polynomial of degree K is
given by
K
X
f (x) = θi xi = φ⊤ (x)θ (1.223)
i=0

where φ is defined in (1.222) and θ = [θ0 , . . . , θK ]⊤ contains the parameters θi .

When we consider the training data xi , yi , i = 1, . . . , N and define the feature (de-
sign) matrix
 
φ⊤
0 (x 1 ) φ ⊤
1 (x 1 ) · · · φ ⊤
K (x 1 )
 φ⊤ (x2 ) φ⊤ (x2 ) · · · φ⊤ (x2 ) 
 0 1 K
 , Φij = φ⊤

Φ =  .. .. .. j (xi ) (1.224)
 . . . 
φ⊤
0 (xN ) ··· · · · φ⊤
K (xN )

the negative log-likelihood can be written as


1
− log p(y|X, θ) = (y − Φθ)⊤ (y − Φθ) + const . (1.225)
2σ 2
Comparing (1.225) with (1.213) we immediately see that X is replaced by Φ. Since
both X and Φ are independent of the parameters θ that we wish to optimize, we
therefore also arrive immediately at the maximum likelihood estimate

θ ∗ = (Φ⊤ Φ)−1 Φ⊤ y (1.226)

45

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.5. Parameter Estimation Chapter 1. Linear Regression

for the linear regression problem with nonlinear features defined in (1.221).

Example (Feature Matrix for Polynomials)


For a second-order polynomial and N training points xi ∈ R, i = 1, . . . , N , the feature
matrix is
 
0 x1 x21
 0 x2 x2 
 2
Φ =  .. .. ..  . (1.227)
. . . 
0 xN x2N

1.5.1.4 Properties
The maximum likelihood estimate θ ∗ possesses the following properties:

• Asymptotic consistency: The MLE converges to the true value in the limit of in-
finitely many observations, plus a random error that is approximately normal.

• The size of the samples necessary to achieve these properties can be quite large.

• The error’s variance decays in 1/N where N is the number of data points.

• Especially, in the “small” data regime, maximum likelihood estimation can lead
to overfitting.

Example (Maximum Likelihood Polynomial Fit)

Let us consider the data set in Fig. 1.23(a). The data set consists of N = 20 pairs 
(xi , yi ), where xi ∼ U[−5, 5] and yi = − sin(xi /5) + cos(xi ) + ǫ, where ǫ ∼ N 0, 0.22 .
We fit a polynomial of degree M = 4 using maximum likelihood estimation (i.e.,
the parameters are given in (1.226)). The maximum likelihood estimate yields an
expected function value φ(x∗ )⊤ θ ML at a new test location x∗ . The result is shown in
Fig. 1.23(b).

1.5.2 Overfitting
We have seen that we can use maximum likelihood estimation to fit linear models
(e.g., polynomials) to data. We can evaluate the quality of the model by comput-
ing the error/loss incurred. One way of doing this is to compute the negative log-
likelihood (1.210), which we minimized to determine the MLE. Alternatively, given
that the noise parameter σ 2 is not a free parameter, we can ignore the scaling by

46

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.5. Parameter Estimation

Polynomial of degree 4
3 3

2 2

1 1

f(x)
f(x)

0 0

-1 -1

-2 -2
Data
Data Maximum likelihood estimate
-3 -3
-5 0 5 -5 0 5
x x
(a) Regression data set. (b) Polynomial of degree 4 determined by maxi-
mum likelihood estimation.

Figure 1.23: Polynomial regression. (a) Data set consisting of (xi , yi ) pairs, i =
1, . . . , 20. (b) Maximum likelihood polynomial of degree 4.

1/σ 2 , so that we end up with a squared-error-loss function ky − Φθk2 . Instead of


using this squared loss, we often use the root mean squared error (RMSE)
v
u N
p u1 X
2
ky − Φθk /N = t (yn − φ⊤ (xn )θ)2 , (1.228)
N n=1

which (a) allows us to compare errors of data sets with different sizes27 and (b) has
the same scale and the same units as the observed function values yi .28 Note that
the division by σ 2 makes the log-likelihood “unit-free”.
We can use the RMSE (or the log-likelihood) to determine the best degree of the
polynomial by finding the value M , such that the error is minimized. Given that
the polynomial degree is a natural number, we can perform a brute-force search and
enumerate all (reasonable) values of M .29
Fig. 1.24 shows a number of polynomial fits determined by maximum likelihood. We
notice that polynomials of low degree (e.g., constants (M = 0) or linear (M = 1) fit
the data poorly and, hence, are poor representations of the true underlying function.
For degrees M = 4, . . . , 9 the fits look plausible and smoothly interpolate the data.
When we go to higher-degree polynomials, we notice that they fit the data better
and better—in the extreme case of M = N − 1 = 19, the function passes through
every single data point. However, these high-degree polynomials oscillate wildly and
are a poor representation of the underlying function that generated the data.30 The
property of the polynomials fitting the noise structure is called overfitting.
27
The RMSE is normalized.
28
Assume, we fit a model that maps post-codes (x is given in latitude,longitude) to house prices
(y-values are GBP). Then, the RMSE is also measured in GBP, whereas the squared error is given in
GBP2 .
29
For a training set of size N it is sufficient to test 0 ≤ M ≤ N − 1.
30
Note that the noise variance σ 2 > 0.

47

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.5. Parameter Estimation Chapter 1. Linear Regression

Polynomial of degree 0 Polynomial of degree 1 Polynomial of degree 4


3 3 3

2 2 2

1 1 1
f(x)

f(x)

f(x)
0 0 0

-1 -1 -1

-2 -2 -2
Data Data Data
Maximum likelihood estimate Maximum likelihood estimate Maximum likelihood estimate
-3 -3 -3
-5 0 5 -5 0 5 -5 0 5
x x x

(a) M = 0 (b) M = 1 (c) M = 4


Polynomial of degree 6
3
Polynomial of degree 9 Polynomial of degree 12
3 3

2
2 2

1
1 1
f(x)

0
f(x)

f(x)
0 0

-1
-1 -1

-2 -2 -2
Data Data Data
Maximum likelihood estimate Maximum likelihood estimate Maximum likelihood estimate
-3 -3 -3
-5 0 5 -5 0 5 -5 0 5
x x x

(d) M = 6 (e) M = 9 (f) M = 12


Polynomial of degree 14 Polynomial of degree 16 Polynomial of degree 19
3 3
3

2 2 2

1 1 1
f(x)
f(x)

0
f(x)

0 0

-1 -1 -1

-2 -2 -2
Data Data Data
Maximum likelihood estimate Maximum likelihood estimate Maximum likelihood estimate
-3 -3 -3
-5 0 5 -5 0 5 -5 0 5
x x x

(g) M = 14 (h) M = 16 (i) M = 19

Figure 1.24: Polynomial fits for different degrees M

Remember that the goal is to achieve good generalization by making accurate pre-
dictions for new (unseen) data. We obtain some quantitative insight into the de-
pendence of the generalization performance on the polynomial of degree M by con-
sidering a separate test set comprising 100 data points generated using exactly the
same procedure used to generate the training set, but with new choices for the ran-
dom noise values included in the target values. As test inputs, we chose a linear
grid of 100 points in the interval of [−5, 5]. For each choice of M , we evaluate the
RMSE (1.228) for both the training data and the test data.
Looking now at the test error, which is a qualitive measure of the generalization
properties of the corresponding polynomial, we notice that initially the test error
decreases, see Fig. 1.25 (red). For fourth-order polynomials the test error is rela-
tively low and stays relatively constant up to degree 11. However, from degree 12
onward the test error increases significantly, and high-order polynomials have very

48

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.5. Parameter Estimation

2
Training error
1.8 Test error
1.6

1.4

1.2

RMSE 1

0.8

0.6

0.4

0.2

0
0 5 10 15 20
Degree of polynomial

Figure 1.25: Training and test error.

bad generalization properties. In this particular example, this also is evident from
the corresponding maximum likelihood fits in Fig. 1.24. Note that the training error
(blue curve in Fig. 1.25) never increases as a function of M . In our example, the
best generalization (the point of the smallest test error) is achieved for a polynomial
of degree M = 6.
It is a bit counter-intuitive that a polynomial of degree M = 19 is a worse approxi-
mation than a polynomial of degree M = 4, which is a special case of a 19th-order
polynomial (by setting all higher coefficients to 0). However, a 19th-order polyno-
mial can also describe many more functions, i.e., it is a much more flexible model.
In the data set we considered, the observations yn were noisy (i.i.d. Gaussian). A
polynomial of a high degree will use its flexibility to model random disturbances as
systematic/structural properties of the underlying function. Overfitting can be seen
as a general problem of maximum likelihood estimation (Bishop, 2006). Assuming
we had noise-free data, overfitting does not occur, which is also revealed by the test
error, see Fig. 1.26.

1.5.3 Regularization

In Section 1.5.1, we saw that maximum likelihood estimation is prone to overfitting.


It often happens that the parameter values become relatively big if we run into over-
fitting (Bishop, 2006). One way to control overfitting is to penalize big parameter
values. A technique that is often used to control overfitting is regularization. In
regularization, we add a term to the log-likelihood that penalizes the amplitude of
the parameters θ. A typical example is a “loss function” of the form

− log p(y|X, θ) + λkθk22 , (1.229)

49

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.5. Parameter Estimation Chapter 1. Linear Regression

Polynomial of degree 19
3 2
Training error
1.8 Test error
2
1.6

1.4
1
1.2

RMSE
f(x)

0 1

0.8
-1
0.6

0.4
-2
Data 0.2
Maximum likelihood estimate
-3 0
-5 0 5 0 5 10 15 20
x Degree of polynomial

(a) Maximum likelihood fit. (b) Training and test error

Figure 1.26: Noise-free maximum likelihood estimation with a 19th-degree polynomial.


(a) Overfitting no longer occurs; (b) the test error declines to 0.

where the second term is the regularizer, and λ ≥ 0 controls the “strictness” of the
regularization.31

1.5.4 Maximum-A-Posterior (MAP) Estimation


From a probabilistic perspective, adding a regularizer is identical to using a prior
distribution p(θ) on the parameters and then selecting the parameters that maxi-
mize the posterior distribution p(θ|X, y), i.e., we choose the parameters θ that are
“most probable” given the training data: In a Bayesian setting, the posterior over the
parameters θ given the training data X, y is obtained by applying Bayes’ theorem as

p(y|X, θ)p(θ)
p(θ|X, y) = . (1.230)
p(y|X)

The parameter vector θ MAP that maximizes the posterior (1.230) is called the maxi-
mum a-posteriori (MAP) estimate.
To find the MAP estimate, we follow steps that are similar in flavor to maximum
likelihood estimation. We start with the log-transform and compute the log-posterior
as

log p(θ|X, y) = log p(y|X, θ) + log p(θ) + const , (1.231)

where the constant includes the terms independent of θ. We see that the log-
posterior in (1.231) consists of the log-likelihood p(y|X, θ) and the log-prior log p(θ).
31
Instead of the 2-norm, we can choose and p-norm k · kp . In practice, smaller values for p lead to
sparser solutions. Here, “sparse” means that many parameter values θi = 0, which is also useful for
variable selection. For p = 1, the regularizer is called LASSO (least absolute shrinkage and selection
operator) and was proposed by Tibshirani (1996).

50

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.5. Parameter Estimation

Remark 15 (Relation to Regularization)  1


Choosing a Gaussian parameter prior p(θ) = N 0, b2 I , b2 = 2λ
, the (negative) log-
prior term will be

− log p(θ) = λθ
| {z θ} +const , (1.232)
=λkθk22

and we recover exactly the regularization term in (1.229). This means that for a
quadratic regularization, the regularization parameter λ in (1.229) corresponds to
twice the precision (inverse variance) of the Gaussian (isotropic) prior p(θ). The log-
prior in (1.231) plays the role of a regularizer that penalizes implausible values, i.e.,
values that are unlikely under the prior.

To find the MAP estimate θ MAP , we minimize the negative log-posterior with respect
to θ, i.e., we solve

θ MAP ∈ arg min − log p(y|X, θ) − log p(θ) . (1.233)


θ

We compute the gradient with respect to θ as


∂ log p(θ|X, y) ∂ log p(y|X, θ) ∂ log p(θ)
− =− − , (1.234)
∂θ ∂θ ∂θ
where we identify the first term on the right-hand-side as the gradient of the negative
log-likelihood given in (1.217).

1.5.4.1 MAP Estimation for Linear Regression


We consider the linear regression problem where

y = φ⊤ (x)θ + ǫ , ǫ ∼ N 0, σ 2 , (1.235)

with a Gaussian prior p(θ) = N 0, b2 I on the parameters θ.
The negative log-posterior for this model is
1 ⊤ 1 ⊤
− log p(θ|X, y) = (y − Φθ) (y − Φθ) + θ θ + const . (1.236)
2σ 2 2b2
Here, the blue term corresponds to the contribution from the log-likelihood, and the
red term originates from the log-prior.
The gradient of the log-posterior with respect to the parameters θ is
∂ log p(θ|X, y) 1 1
− = 2 (θ ⊤ Φ⊤ Φ − y ⊤ Φ) + 2 θ ⊤ . (1.237)
∂θ σ b
We will find the MAP estimate θ MAP by setting this gradient to 0:
1 ⊤ ⊤ 1
2
(θ Φ Φ − y ⊤ Φ) + 2 θ ⊤ = 0 (1.238)
σ   b
1 ⊤ 1 1
⇔θ ⊤ 2
Φ Φ + 2 I − 2 y⊤Φ = 0 (1.239)
σ b σ

51

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.5. Parameter Estimation Chapter 1. Linear Regression

 
⊤ σ2

⇔θ Φ Φ + 2 I = y⊤Φ (1.240)
b
 −1
⊤ ⊤ ⊤ σ2
⇔θ = y Φ Φ Φ + 2 I (1.241)
b
 2
−1
⊤ σ
⇔θ MAP = Φ Φ + 2 I Φ⊤ y . (1.242)
b

Comparing the MAP estimate in (1.242) with the maximum likelihood estimate in
(1.226) we see that the only difference between both solutions is the additional red
2
term σb2 I in the inverse matrix.32 This term ensures that the inverse exists and serves
as a regularizer.

Example (MAP Estimation for Polynomial Regression)

Polynomial of degree 6 Polynomial of degree 14


3 3

2 2

1 1
f(x)

f(x)

0 0

-1 -1

-2 Data -2 Data
Maximum likelihood estimate Maximum likelihood estimate
MAP estimate MAP estimate
-3 -3
-5 0 5 -5 0 5
x x

Figure 1.27: Polynomial regression: Maximum likelihood and MAP estimates.

In the polynomial regression


 example from Section 1.5.1, we place a Gaussian prior
p(θ) = N 0, 0.12 I on the parameters θ and determine the MAP estimates accord-
ing to (1.242). In Fig. 1.27, we show both the maximum likelihood and the MAP
estimates for polynomials of degree 6 (left) and degree 14 (right). The prior (regu-
larizer) does not play a significant role for the low-degree polynomial, but keeps the
function relatively smooth for higher-degree polynomials. However, the MAP esti-
mate is only able to push the boundaries of overfitting—it is not a general solution
to this problem.

Φ⊤ Φ is positive semidefinite and the additional term is strictly positive definite, such that all
32

eigenvalues of the matrix to be inverted are positive.

52

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.6. Gradient Descent

4.0

3.5

3.0

2.5

f(x) 2.0

1.5

1.0

0.5

0.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x

Figure 1.28: Gradient descent can lead to zigzagging and slow convergence.

1.6 Gradient Descent


Gradient descent is a first-order optimization algorithm. To find a local minimum
of a function using gradient descent, one takes steps proportional to the negative of
the gradient (or of the approximate gradient) of the function at the current point.
Remember that the gradient points in the direction of the steepest ascent and it is
orthogonal to the contour lines of the function we wish to optimize.
Gradient descent exploits the fact that if a multivariate function f (x) is differentiable
in a neighborhood of a point x0 then f (x0 ) decreases fastest if one moves from x0
in the direction of the negative gradient −(∇f )(x0 ) of f at x0 . Then, if
x1 = x0 − γ(∇f )(x0 ) (1.243)
for a small step size γ ≥ 0 then f (x1 ) ≤ f (x0 ).
This observation allows us to define a simple gradient-descent algorithm: If we want
to find a local optimum f (x∗ ) of a function f : Rn → Rm , x 7→ f (x), we start with
an initial guess x0 of the parameters we wish to optimize and then iterate according
to
xi+1 = xi − γi (∇f )(xi ) . (1.244)
For suitable γi , the sequence f (x0 ) ≥ f (x1 ) ≥ . . . converges to a local minimum.
Remark 16
Gradient descent can be relatively slow close to the minimum: Its asymptotic rate of
convergence is inferior to many other methods. For poorly conditioned convex problems,
gradient descent increasingly ‘zigzags’ as the gradients point nearly orthogonally to the
shortest direction to a minimum point, see Fig. 1.28.

1.6.1 Stepsize
Choosing a good stepsize is important in gradient descent: If the stepsize (also called
the learning rate) is too small, gradient descent can be slow. If the stepsize is chosen
too large, gradient descent can overshoot, fail to converge, or even diverge.

53

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.6. Gradient Descent Chapter 1. Linear Regression

Robust gradient methods permanently rescale the stepsize empirically depending on


local properties of the function. There are two simple heuristics (Toussaint, 2012):
• When the function value increases after a gradient step, the step size was too
large. Undo the step and decrease the stepsize.
• When the function value decreases the step could have been larger. Try to
increase the stepsize.
Although the “undo” step seems to be a waste of resources, using this heuristic guar-
antees monotonic convergence.

Example (Solving a Linear Equation System)


When we solve linear equations of the form Ax = b, in practice we solve Ax−b = 0
approximately by finding x∗ that minimizes the the squared error

kAx − bk2 = (Ax − b)⊤ (Ax − b) (1.245)

if we use the Euclidean norm. The gradient of (1.245) with respect to x is

∇x = 2(Ax − b)⊤ A . (1.246)

Remark 17
Gradient descent is rarely used for solving linear equation systems Ax = b. Instead
other algorithms with better convergence properties, e.g., conjugate gradient descent,
are used. The speed of convergence of gradient descent depends on the maximal and
minimal eigenvalues of A, while the speed of convergence of conjugate gradients has
a more complex dependence on the eigenvalues, and can benefit from preconditioning.
Gradient descent also benefits from preconditioning, but this is not done as commonly.
For further information on gradient descent, pre-conditioning and convergence we refer
to CO-477.

1.6.2 Gradient Descent with Momentum


Gradient descent with momentum (Rumelhart et al., 1986) is a method that smoothes
out erratic behavior of gradient updates and dampens oscillations.
The idea is to have a gradient update with a memory to implement a moving average.
The momentum-based method remembers the update ∆xi at each iteration i and
determines the next update as a linear combination of the current and previous
gradients

xi+1 = xi − γi (∇f )(xi ) + α∆xi (1.247)


∆xi = xi − xi−1 = −γi−1 (∇f )(xi−1 ) , (1.248)

where α ∈ [0, 1]. Due to the moving average of the gradient, momentum-based
methods are particularly useful when the gradient itself is only a (noisy) estimate.
We will discuss stochastic approximations to the gradient in the following.

54

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.6. Gradient Descent

1.6.3 Stochastic Gradient Descent


Computing the gradient can be very time consuming. However, often it is possible
to find a “cheap” approximation of the gradient. Approximating the gradient is still
useful as long as it points in roughly the same direction as the true gradient.
Stochastic gradient descent (often shortened in SGD) is a stochastic approximation
of the gradient descent method for minimizing an objective function that is written
as a sum of differentiable functions.
In machine learning, we often consider objective functions of the form
X
L(θ) = Lk (θ) (1.249)
k

where θ is the parameter vector of interest, i.e., we want to find θ that minimizes L.
An example is the negative log-likelihood
X
L(θ) = − log p(yk |xk , θ) (1.250)
k

in a regression setting, where xk ∈ RD are the training inputs, yk are the training
targets and θ are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” optimization method,
i.e., optimization is performed using the full training set by updating the parameters
according to
X
θ i+1 = θ i − γi ∇L(θ i ) = θ i − γi ∇Lk (θ i ) (1.251)
k

for a suitable stepsize parameter γi . Evaluating the sum-gradient may require ex-
pensive evaluations of the gradients from all summand functions. When the training
set is enormous (commonly the case in Deep Learning) and/or no simple formu-
las exist, evaluating the sums of gradients becomes very expensive: Evaluating the
gradient requires evaluating the gradients of all summands. To economize on the
computational cost at every iteration, stochastic gradient descent samples a sub-
set of summand functions (“mini-batch”) at every step. In an extreme case, the
sum in (1.249) is approximated by a single summand (randomly chosen), and the
parameters are updated according to
θ i+1 = θ i − γ∇Lk (θ i ) . (1.252)
In practice, it is good to keep the size of the mini-batch as large as possible to (a)
reduce the variance in the parameter update33 and (b) allow for the computation
to take advantage of highly optimized matrix operations that should be used in a
well-vectorized computation of the cost and gradient.
Remark 18
When the learning rate decreases at an appropriate rate, and subject to relatively
mild assumptions, stochastic gradient descent converges almost surely to local mini-
mum (Bottou, 1998).
33
This often leads to more stable convergence.

55

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.7. Model Selection and Cross Validation Chapter 1. Linear Regression

1.6.4 Further Reading


For non-differentiable functions, gradient methods are ill-defined. In these cases,
subgradient methods can be used (Shor, 1985). For further information and algo-
rithms for optimizing non-differentiable functions, we refer to the book by Bertsekas
(1999).
Stochastic gradient descent is very effective in large-scale machine learning prob-
lems, such as training deep neural networks on millions of images (Dean et al.,
2012), topic models (Hoffman et al., 2013), reinforcement learning (Mnih et al.,
2015) or training large-scale Gaussian process models (Hensman et al., 2013; Gal
et al., 2014). Extensions of stochastic gradient descent methods include RMSProp,
AdaGrad (Duchi et al., 2011), momentum-based methods, and Adam (Kingma and
Ba, 2014).

1.7 Model Selection and Cross Validation


Sometimes, we need to make high-level decisions about the model we want to use
in order to increase the performance. Examples include:

• The degree of a polynomial in a regression setting

• The number of components in a mixture model

• The network architecture of a (deep) neural network

• The type of kernel in a support vector machine

The choices we make (e.g., the degree of the polynomial) influence the number of
free parameters in the model and thereby also the model complexity. More com-
plex models are more flexible in the sense that they can be used to describe more
data sets. For instance, a polynomial of degree 1 (a line a0 + a1 x) can only be used
to describe linear relations between inputs x and observations y. A polynomial of
degree 2 can additionally describe quadratic relationships between inputs and ob-
servations.34 Higher-order polynomials are very flexible models as we have seen
already in Section 1.5 in the context of polynomial regression.
A general problem is that at training time we can only use the training set to evaluate
the performance of the model. However, the performance on the training set is
not really what we are interested in: In Section 1.5, we have seen that maximum
likelihood estimation can lead to overfitting, especially when the training data set is
small. Ideally, our model (also) works well on the test set (which is not available
at training time). Therefore, we need some mechanisms for assessing how a model
generalizes to unseen test data. Model selection is concerned with exactly this
problem.
34
A polynomial a0 + a1 x + a2 x2 can also describe linear functions by setting a2 = 0.

56

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.7. Model Selection and Cross Validation

Training Validation

Figure 1.29: K-fold cross-validation. The data set is divided into K = 5 chunks, K − 1
of which serve as the training set (blue) and one as the validation set (yellow). This
procedure is repeated for all K choices for the validation set, and the performance of
the model from the K runs is averaged.

1.7.1 Cross-Validation to Assess the Generalization Performance


One way to assess the generalization performance of a model M is to use a valida-
tion set. The validation set a small subset of the available training set that we keep
aside. A practical issue with this approach is that the amount of data is limited, and
ideally we would use as much of the data available to train the model. This would
require to keep our validation set V small, which then would lead to a noisy estimate
(with high variance) of the predictive performance. One solution to these contradic-
tive objectives (large training set, large validation set) is to use cross-validation.
K-fold cross-validation effectively partitions the data into K chunks, K − 1 of which
form the training set D̃, and the last chunk serves as the validation set V (similar
to the idea outlined above). Cross-validation iterates through (ideally) all combina-
tions of assignments of chunks to D̃ and V, see Fig. 1.29.
We partition our training set D = D̃ ∪ V, D̃ ∩ V = ∅, where V is the validation set,
and train our model on D̃. After training, we assess the performance of the model M
on the validation set V (e.g., by computing RMSE values of the trained model on the
validation set). We cycle through all possible partitionings of validation and training
sets and compute the average generalization error of the model. Cross-validation
effectively computes the expected generalization error
K
1 X
EV [G(V)|M ] ≈ G(V (k) |M ) , (1.253)
K k=1

where G(V) is the generalization error (e.g., RMSE) on the validation set V for model
M . We repeat this procedure for all models and choose the model that performs best.
Note that cross-validation does not only give us the expected generalization error,
but we can also obtain high-order statistics, e.g., the standard error35 , an estimate
of how uncertain the mean estimate is.
Once the model is chosen we can evaluate the final performance on the test set.
35
The standard error is defined as √σ , where K is the number of experiments and σ the standard
K
deviation.

57

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.7. Model Selection and Cross Validation Chapter 1. Linear Regression

A potential disadvantage of K-fold cross-validation is the computational cost of


training the model K times, which can be burdensome if the training cost is com-
putationally expensive. In practice, it is often not sufficient to look at the direct
parameters alone. For example, we need to explore multiple complexity parame-
ters (e.g., multiple regularization parameters), which may not be direct parame-
ters of the model. Evaluating the quality of the model, depending on these hyper-
parameters may result in a number of training runs that is exponential in the number
of model parameters.
However, cross-validation is an embarrassingly parallel problem, i.e., little effort
is needed to separate the problem into a number of parallel tasks. Given sufficient
computing resources (e.g., cloud computing, server farms), cross-validation does not
require longer than a single performance assessment.

1.7.2 Bayesian Model Selection


There are many approaches to model selection, some of which are covered in this
section. Generally, they all attempt to trade off model complexity and data fit:36
The objective is to find the simplest model that explains the data reasonably well.37
This concept is also known as Occam’s Razor. One may consider placing a prior on
models that favors simpler models. However, it is not necessary to do this: An “auto-
matic Occam’s Razor” is quantitatively embodied in the application of Bayesian prob-
ability (Spiegelhalter and Smith, 1980; MacKay, 1992; Jefferys and Berger, 1992).
Fig. 1.30 from MacKay (2003) illustrates this property.
Now, let us phrase model selection as a hierarchical inference problem. Generally,
we assume a finite number of models that we can choose from. A more efficient
way than cross validation, where we have to fit each model K times (where K is
the number of cross-validation folds), is to compute the posterior distribution over
models.
Let us consider a finite number of models M = {M1 , . . . , MK }, where each model
Mk is parametrized by θ k . In Bayesian model selection, we place a prior p(M ) on
the set of models. The corresponding generative process is

Mk ∼ p(M ) (1.254)
θ k |Mk ∼ p(θ k ) (1.255)
D|θ k ∼ p(D|θ k ) (1.256)

and illustrated in Fig. 1.31. Given a training set D, we compute the posterior over
models as

p(Mk |D) ∝ p(Mk )p(D|Mk ) . (1.257)

Note that this posterior no longer depends on the model parameters θ k because they
have been integrated out in the Bayesian setting.
36
We assume that simpler models are less prone to overfitting than complex models.
37
If we treat model selection as a hypothesis testing problem, we are looking for the simplest
hypothesis that is consistent with the data (Murphy, 2012).

58

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.7. Model Selection and Cross Validation

Evidence
p(D|M1)

p(D|M2)

D
C1
Figure 1.30: “Why Bayesian inference embodies Occam’s razor. This figure gives the
basic intuition for why complex models can turn out to be less probable. The horizontal
axis represents the space of possible data sets D. Bayes’ theorem rewards models in
proportion to how much they predicted the data that occurred. These predictions are
quantified by a normalized probability distribution on D. This probability of the data
given model Mi , p(D|Mi ), is called the evidence for Mi . A simple model M1 makes
only a limited range of predictions, shown by p(D|M1 ); a more powerful model M2 that
has, for example, more free parameters than M1 , is able to predict a greater variety of
data sets. This means, however, that M2 does not predict the data sets in region C1 as
strongly as M1 . Suppose that equal prior probabilities have been assigned to the two
models. Then, if the data set falls in region C1 , the less powerful model M1 will be the
more probable model.” (MacKay, 2003)

From the posterior in (1.257), we determine the MAP estimate as

M ∗ = arg max p(Mk |D) . (1.258)


Mk

With a uniform (uninformative) prior p(Mk ) = K1 , determining the MAP estimate


amounts to picking the model that maximizes the model evidence (marginal like-
lihood)
Z
p(D|Mk ) = p(D|θ k )p(θ k |Mk )dθ k , (1.259)

where p(θ k |Mk ) is the prior distribution of the model parameters θ k of model Mk .
Remark 19
There are some important differences between a likelihood and a marginal likelihood:
While the likelihood is prone to overfitting, the marginal likelihood is typically not as
the model parameters have been marginalized out (i.e., we no longer have to fit the
parameters). Furthermore, the marginal likelihood automatically embodies a trade-off
between model complexity and data fit.

1.7.3 Bayes Factors for Model Comparison


Consider the problem of comparing two probabilistic models M1 , M2 , given a data
set D. If we compute the posteriors p(M1 |D) and p(M2 |D), we can compute the ratio

59

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.7. Model Selection and Cross Validation Chapter 1. Linear Regression

Mk

θk

Figure 1.31: Illustration of the hierarchical generative process in Bayesian model se-
lection. We place a prior p(M ) on the set of models. For each model, there is a prior
p(θ k |Mk ) on the corresponding model parameters, which are then used to generate the
data D.

of the posteriors (posterior odds)


p(D|M1 )p(M1 )
p(M1 |D) p(D) p(M1 ) p(D|M1 )
= p(D|M2 )p(M2 )
= (1.260)
p(M2 |D) p(M2 ) p(D|M2 )
| {z } p(D) | {z } | {z }
posterior odds prior odds Bayes factor

The first fraction on the right-hand-side (prior odds) measures how much our prior
(initial) beliefs favor M1 over M2 . The ratio of the marginal likelihoods (second
fraction on the right-hand-side) is called the Bayes factor and measures how well
the data D is predicted by M1 compared to M2 .
Remark 20
The Jeffreys-Lindley paradox states that the “Bayes factor always favors the simpler
model since the probability of the data under a complex model with a diffuse prior will
be very small” (Murphy, 2012).
If we choose a uniform prior over models, the prior odds term in (1.260) is 1, i.e.,
the posterior odds is the ratio of the marginal likelihoods (Bayes factor)
p(D|M1 )
. (1.261)
p(D|M2 )
If the Bayes factor is greater than 1, we choose model M1 , otherwise model M2 .

1.7.4 Fully Bayesian Treatment


Instead of selecting a single “best” model, in a fully Bayesian treatment, we integrate
out the corresponding model parameters θ k and average over all models Mk , k =
1, . . . , K
XK Z
p(D) = p(MK ) p(D|θ k )p(θ k |Mk )dθ k (1.262)
K=1 | {z }
=p(D|Mk )

60

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.7. Model Selection and Cross Validation

where p(D|Mk ) is the marginal likelihood of model Mk .

1.7.5 Computing the Marginal Likelihood


The marginal likelihood plays an important role in model selection: We need to
compute it for Bayes factors (1.260) and in the fully Bayesian setting where we
additionally integrate out the model itself (1.262).
Unfortunately, computing the marginal likelihood requires us to solve an integral.
This integration is generally analytically intractable, and we will have to resort to
approximation techniques, e.g., numerical integration (Stoer and Burlirsch, 2002),
stochastic approximations using Monte Carlo (Murphy, 2012) or Bayesian Monte
Carlo techniques (O’Hagan, 1991; Rasmussen and Ghahramani, 2003).
However, there are special cases in which we can solve it. In Section 1.2.4, we
discussed conjugate models. If we choose a conjugate parameter prior p(θ), we can
compute the marginal likelihood in closed form.

Example (Marginal Likelihood Computation)


We consider the linear-Gaussian model with the following generative process (the

xi
m0, S0

σ yi θ
i = 1, ..., N

Figure 1.32: Graphical model for Bayesian linear regression.

corresponding graphical model is shown in Fig. 1.32):



θ ∼ N m0 , S 0 (1.263)

yi |xi , θ ∼ N x⊤
i θ, σ
2
, (1.264)

i = 1, . . . , N . We see the marginal likelihood


Z
p(y|X) = p(y|X, θ)p(θ)dθ (1.265)
Z
 
= N y | Xθ, σ 2 I N θ | m0 , S 0 dθ . (1.266)

We compute the marginal likelihood in two steps: First, we show that the marginal
likelihood is Gaussian (as a distribution in y); Second, we compute the mean and
covariance of this Gaussian.

61

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.7. Model Selection and Cross Validation Chapter 1. Linear Regression

1. The marginal likelihood is Gaussian: From Section 1.2.3.5 we know that (i)
the product of two Gaussian random variables is an (unnormalized) Gaus-
sian distribution, (ii) a linear transformation of a Gaussian random variable is
Gaussian distributed.
 In (1.266), we require
 a linear transformation to bring
2
N y | Xθ, σ I into the form N θ | µ, Σ for some µ, Σ. Once this is done,
the integral can be solved in closed form. The result is the normalizing con-
stant of the product of the two Gaussians. The normalizing constant itself has
Gaussian shape, see (1.44).

2. Mean and covariance. We compute the mean and covariance matrix of the
marginal likelihood by exploiting the standard results for means and covari-
ances of affine transformations of random variables, see Section 1.2.1.2. The
mean of the marginal likelihood is computed as

Eθ [y|X] = Eθ [Xθ + ǫ] = XEθ [θ] = Xm0 . (1.267)



Note that ǫ ∼ N 0, σ 2 I is a vector of random variables. The covariance
matrix is given as

Covθ [y] = Cov[Xθ] + σ 2 I = X Covθ [θ]X ⊤ + σ 2 I = XS 0 X ⊤ + σ 2 I (1.268)

Hence, the marginal likelihood is


N 1
p(y|x) = (2π)− 2 |XS 0 X ⊤ + σ 2 I|− 2
 (1.269)
× exp − 21 (y − Xm0 )⊤ (XS 0 X ⊤ + σ 2 I)−1 (y − Xm0 ) .

1.7.6 Further Reading


Rasmussen and Ghahramani (2001) showed that the automatic Occam’s razor does
not necessarily penalize the number of parameters in a model38 but it is active in
terms of the complexity of functions. They also showed that the automatic Occam’s
razor also holds for Bayesian non-parametric models with many parameters, e.g.,
Gaussian processes.
If we focus on the maximum likelihood estimate, there exist a number of heuristics
for model selection that discourage overfitting. The Akaike Information Criterion
(AIC) (Akaike, 1974)

log p(x|θ) − M (1.270)

corrects for the bias of the maximum likelihood estimator by addition of a penalty
term to compensate for the overfitting of more complex models (with lots of param-
eters). Here, M is the number of model parameters.
38
In parametric models, the number of parameters is often related to the complexity of the model
class.

62

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.8. Bayesian Linear Regression

xi
m0, S0

σ yi θ
i = 1, ..., N

Figure 1.33: Graphical model for Bayesian linear regression.

The Bayesian Information Criterion (BIC) (Schwarz, 1978)


Z
1
ln p(x) = log p(x|θ)p(θ)dθ ≈ log p(x|θ) − M ln N (1.271)
2
can be used for exponential family distributions. Here, N is the number of data
points and M is the number of parameters. BIC penalizes model complexity more
heavily than AIC.

1.8 Bayesian Linear Regression


Thus far, we looked at linear regression models where we estimated the parameters
θ, e.g., by means of maximum likelihood or MAP estimation. We discovered that
MLE can lead to severe overfitting, in particular, in the small-data regime. MAP
addresses this issue by placing a prior on the parameters that plays the role of a
regularizer.
Bayesian linear regression pushes the idea of the parameter prior a step further and
does not even attempt to compute a point estimate of the parameters, but instead
the full posterior over the parameters is taken into account when making predictions.
This means that the parameters themselves remain uncertain.

1.8.1 Model
In Bayesian linear regression, we consider the following model
y = φ⊤ (x)θ + ǫ

ǫ ∼ N 0, σ 2 (1.272)

θ ∼ N m0 , S 0 ,

where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on the parameter
vector θ.39 The graphical corresponding graphical model is shown in Fig. 1.33.
39
Why is a Gaussian prior a convenient choice?

63

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.8. Bayesian Linear Regression Chapter 1. Linear Regression

1.8.2 Parameter Posterior


Given a training set of inputs xi ∈ RD and corresponding observations yi ∈ R,
i = 1, . . . , N , compute the posterior over the parameters using Bayes’ theorem as
p(y|X, θ)p(θ)
p(θ|X, y) = , (1.273)
p(y|X)
where X is the collection of training inputs and y the collection of training targets.
Furthermore, p(y|X, θ) is the likelihood, p(θ) the parameter prior, and
Z
p(y|X) = p(y|X, θ)p(θ)dθ (1.274)

the marginal likelihood/evidence, which is independent of the parameters θ and


normalizes the posterior. The marginal likelihood is the likelihood averaged over all
possible parameter settings (with respect to the prior distribution p(θ)).
In our specific model (1.272), the posterior (1.273) can be computed in closed form
as

p(θ|X, y) = N θ | mN , S N , (1.275)
S N = (S −1 −2 ⊤
0 + σ Φ Φ)
−1
, (1.276)
mN = S N (S −1
0 m0
−2 ⊤
+ σ Φ y) , (1.277)
where the subscript N indicates the size of the training set. In the following, we will
detail how we arrive at this posterior.
Bayes’ theorem tells us that the posterior p(θ|X, y) is proportional to the product of
the likelihood p(y|X, θ) and the prior p(θ):
p(y|X, θ)p(θ)
p(θ|X, y) = , (1.278)
p(y|X)

p(y|X, θ) = N y | Φθ, σ 2 I , (1.279)

p(θ) = N θ | m0 , S 0 . (1.280)
Looking at the numerator of the posterior in (1.278), we know that the Gaussian
prior times the Gaussian likelihood (where the parameters on which we place the
Gaussian appears linearly in the mean) is an (unnormalized) Gaussian (see Sec-
tion 1.2.3.5). If necessary, we can find the normalizing constant if we know the
covariance matrix of this unnormalized Gaussian.
Before we can compute this product, e.g., by applying the results from 1.2.3.5, we
need to ensure the product has the “right” form
   
N y | Φθ, σ 2 I N θ | m0 , S 0 = N θ | µ, Σ N θ | m0 , S 0 (1.281)
for some µ, Σ. With this form we determine the desired product immediately as
  
N θ | µ, Σ N θ | m0 , S 0 ∝ N θ | mN , S N (1.282)
S N = (S −1 −1 −1
0 +Σ ) (1.283)
mN = S N (S −1
0 m0
−1
+ Σ µ) . (1.284)
In the following, we discuss two ways of finding µ and Σ.

64

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.8. Bayesian Linear Regression

1.8.2.1 Linear Transformation of Gaussian Random Variables


To find µ and Σ, we use some basic identities for linearly transforming Gaussian
random variables (see Section 1.2.3.5), such that we transform y = Φθ into By =
θ for a suitable B. In the following, we will perform these steps directly on the
Gaussian distribution40 and obtain
 scale by Φ⊤ 
N y | Φθ, σ 2 I ❀ N Φy | Φ⊤ Φθ, σ 2 Φ⊤ Φ (1.285)
⊤ −1
scale by (Φ Φ) 
❀ N (Φ⊤ Φ)−1 Φ⊤ y | θ, σ 2 (Φ⊤ Φ)−1 Φ⊤ Φ(Φ⊤ Φ)−1
(1.286)
⊤ −1 ⊤ 2 ⊤ −1

= N θ | (Φ Φ) Φ y, σ (Φ Φ) . (1.287)

If we now use the re-arranged likelihood (1.287) and define its mean as µ and
covariance matrix as Σ in (1.284) and (1.283), respectively, we obtain
  
N θ | µ, Σ N θ | m0 , S 0 ∝ N θ | mN , S N (1.288)
S N = (S −1 −2 ⊤
0 + σ Φ Φ)
−1
(1.289)
mN = S N (S −1
0 m0
−2 ⊤ ⊤
+ σ (Φ Φ) (Φ Φ) Φ y ) = −1 ⊤
S N (S −1
0 m0
−2 ⊤
+ σ Φ y) . (1.290)
| {z } | {z }
Σ−1 µ

1.8.2.2 Completing the Squares


Instead of looking at the product of the prior and the likelihood, we can transform
the problem into log-space and solve for µ, Σ by completing the squares.
 
log N y | Φθ, σ 2 I + log N θ | m0 , S 0 (1.291)
1 
= − σ −2 (y − Φθ)⊤ (y − Φθ) + (θ − m0 )⊤ S −10 (θ − m0 + const (1.292)
2
where the constant contains terms independent of θ. We will ignore the constant in
the following. We now factorize (1.292), which yields
1 −2 ⊤ 
− σ y y − 2σ −2 y ⊤ Φθ + θ ⊤ σ −2 Φ⊤ Φθ + θ ⊤ S −1 ⊤ −1 ⊤ −1
0 θ − 2m0 S 0 θ + m0 S 0 m0
2
(1.293)
1 
= − θ ⊤ (σ −2 Φ⊤ Φ + S −1 −2 ⊤ −1 ⊤
0 )θ − 2(σ Φ y + S 0 m0 ) θ + const , (1.294)
2
where the constant contains the black terms in (1.293), which are independent of θ.
By inspecting (1.294), we find that this equation is quadratic in θ. The fact that the
unnormalized log-posterior distribution is a (negative) quadratic form implies that
the posterior is Gaussian, i.e.,

p(θ|X, y) = exp(log p(θ|X, y)) ∝ exp(log p(y|X, θ) + log p(θ)) (1.295)


 1 
∝ exp − θ ⊤ (σ −2 Φ⊤ Φ + S −1
0 )θ − 2(σ −2 ⊤
Φ y + S −1
0 m 0 ) ⊤
θ (1.296)
2
40
Pay attention to the transformation of the mean and covariance.

65

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.8. Bayesian Linear Regression Chapter 1. Linear Regression

where we just used (1.294) in the last transformation.


The remaining task is it to bring this (unnormalized) Gaussian into the form that
is proportional to N θ | mN , S N for a scaling factor, i.e., we need to identify the
mean mN and the covariance matrix S N .
Here, we use the idea of completing the squares. The desired log-posterior is
 1 
log N θ | mN , S N = − (θ − mN )⊤ S −1N (θ − mN ) + const (1.297)
2
1 ⊤ −1 −1 −1

= − θ S N θ − 2m⊤ N S N θ + mN S N mN . (1.298)
2

Here, we factorized the quadratic form (θ − mN )⊤ S −1 N (θ − mN ) into a term that


is quadratic in θ alone (blue), a term that is linear in θ (red), and a constant term
(black). This allows us now to find S N and mN by matching the colored expressions
in (1.294) and (1.298), which yields

S −1 ⊤ −2 −1 −2 ⊤ −1 −1
N = Φ σ IΦ + S 0 ⇔ S N = (σ Φ Φ + S 0 ) , (1.299)
−1 ⊤
m⊤
N SN
−2
= (σ Φ y + S −1
0 m0 )

⇔ mN = S N (σ Φ y + −2 ⊤
S −1
0 m0 ) . (1.300)

This is identical to the solution in (1.289)–(1.290), which we obtained by repeated


linear transformation of Gaussian random variables.
Remark 21 (Completing the Squares—General Approach)
If we are given an equation

x⊤ Ax − 2a⊤ x + const1 , (1.301)

where A is symmetric and positive definite, which we wish to bring into the form

(x − µ)⊤ Σ(x − µ) + const2 , (1.302)

we can do this by setting

Σ=A (1.303)
−1
µ=Σ a (1.304)

and const2 = const1 − µ⊤ Σµ.

We can see that the terms inside the exponential in (1.296) are of the form (1.301)
with

A = σ −2 Φ⊤ Φ + S −1
0 , (1.305)
a = σ −2 Φ⊤ y + S −1
0 m0 . (1.306)

Since A, a can be difficult to identify in equations like (1.293), it is often helpful


to bring these equations into the form (1.301) that decouples quadratic term, linear
terms and constants, which simplifies finding the desired solution.

66

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.8. Bayesian Linear Regression

Remark 22
The posterior precision (inverse covariance) of the parameters (see (1.299), for exam-
ple)
1 T
S −1 −1
N = S0 + Φ Φ, (1.307)
σ2
contains two terms: S −1 1 T
0 is the prior precision and σ 2 Φ Φ is a data-dependent (pre-
cision) term. Both terms (matrices) are symmetric and positive definite. The data-
dependent term σ12 ΦT Φ grows as more data is taken into account.41 This means (at
least) two things:
• The posterior precision grows as more and more data is taken into account (there-
fore, the covariance shrinks).
• The (relative) influence of the parameter prior vanishes for large N .

1.8.3 Prediction and Inference


In practice, we are usually not so much interested in the parameter values θ. Instead,
our focus often lies in the predictions we make with those parameter values. In
a fully Bayesian setting, we take the full posterior distribution and average over
all plausible parameter settings when we make predictions, instead of finding the
maximum a-posteriori (point) estimate of the parameters (the setting θ MAP at which
the posterior attains its maximum value). To predict the distribution of

y∗ = φ⊤ (x∗ )θ + ǫ , ǫ ∼ N 0, σ 2 , (1.308)
at a test location x∗ , we compute
Z
p(y∗ |X, y, x∗ ) = p(y∗ |x∗ , θ)p(θ|X, y)dθ (1.309)
Z
 
= N y∗ | φ⊤ (x∗ )θ, σ 2 N θ | mN , S N dθ (1.310)


= N y∗ | m⊤
N φ(x ∗ ), φ (x ∗ )S N φ(x ∗ ) + σ 2
(1.311)

The term φ⊤ (x∗ )S N φ(x∗ ) reflects the uncertainty associated with the parameters θ,
whereas σ 2 is the noise variance. Note that S N depends on the training inputs X,
see (1.289). The predictive mean coincides with the MAP estimate.
Remark 23 (Mean and Variance of Noise-Free Function Values)
In many cases, we are not interested in the predictive distribution p(y∗ |X, y, x∗ ) of a
(noisy) observation. Instead, we would like to obtain the distribution of the (noise-free)
latent function values f (x∗ ) = φ⊤ (x∗ )θ. We determine the corresponding moments by
exploiting the properties of means and variances, which yields
E[f (x∗ )|X, y] = Eθ [φ⊤ (x∗ )θ|X, y] = φ⊤ (x∗ )Eθ [θ|X, y] = m⊤
N φ(x∗ ) , (1.312)
Vθ [f (x∗ )|X, y] = Vθ [φ⊤ (x∗ )θ|X, y] = φ⊤ (x∗ )Vθ [θ|X, y]φ(x∗ ) = φ⊤ (x∗ )S N φ(x∗ )
(1.313)
41
ΦT Φ is accumulating contributions from the data, not averaging.

67

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.8. Bayesian Linear Regression Chapter 1. Linear Regression

Remark 24 (Distribution over Functions)


The fact that we integrate out the parameters θ induces a distribution over functions:
If we sample θ i ∼ p(θ|X, y) from the parameter posterior, we obtain a single function
realization θ ⊤i φ(·). The mean function, i.e., the set of all expected function values
Eθ [f (·)|θ, X, y], of this distribution over functions is m⊤ N φ(·). The (marginal) vari-
ances, i.e., the variance of the function f (·), are given by φ⊤ (·)S N φ(·).

1.8.3.1 Derivation
The predictive distribution p(y∗ |X, y, x∗ ) is Gaussian: In (1.310), we multiply two
Gaussians (in θ), which results in another (unnormalized) Gaussian.42 When in-
tegrating out θ, we are left with the normalization constant, which itself is Gaus-
sian shaped (see Section 1.2.3.5). Therefore, it suffices to determine the mean and
the (co)variance of the predictive distribution, which we will do by applying the
standard rules for computing means and (co)variances (see Section 1.2.1). In the
following, we will use the shorthand notation φ∗ = φ(x∗ ).
The mean of the posterior predictive distribution p(y∗ |X, y, x∗ ) is

Eθ,ǫ [y∗ |X, y, x∗ ] = Eθ,ǫ [φ⊤


∗ θ + ǫ|X, y] (1.314)
= φ⊤
∗ Eθ [θ|X, y] +0 (1.315)
= φ⊤
∗ mN . (1.316)

Here, we exploited that the noise is i.i.d. and that its mean is 0.
The corresponding posterior predictive variance is

Vθ,ǫ [y∗ |X, y, x∗ ] = Vθ,ǫ [φ⊤


∗ θ + ǫ|X, y] (1.317)
i.i.d.
= Vθ [φ⊤
∗ θ|X, y] + Vǫ [ǫ] (1.318)
= φ⊤
∗ Vθ [θ|X, y]φ∗ + σ
2
(1.319)
= φ⊤ 2
∗ S N φ∗ + σ . (1.320)

The blue terms in the variance expression are the terms that are due to the inherent
(posterior) uncertainty of the parameters, which induces the uncertainty about the
latent function f . The additive green term is the variance of the i.i.d. noise variable.

Example

Fig. 1.34 shows some examples of the posterior distribution over functions, induced
by the parameter posterior. The left panels show the maximum likelihood estimate,
the MAP estimate (which is identical to the posterior mean function) and the 95%
predictive confidence bounds, represented by the shaded area. The right panels
show samples from the posterior over functions: Here, we sampled parameters θ i
from the parameter posterior and computed the function φ⊤ (x∗ )θ i , which is a single
42
To be precise: We multiply the posterior p(θ|X, y) with a distribution of a linearly transformed
θ. Note that a linear transformation of a Gaussian random variable preserves Gaussianity (see Sec-
tion 1.2.3.5).

68

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 1. Linear Regression 1.8. Bayesian Linear Regression

realization of a function under the posterior distribution over functions. For low-
order polynomials, the parameter posterior does not allow the parameters to vary
much: The sampled functions are nearly identical. When we make the model more
flexible by adding more parameters (i.e., we end up with a higher-order polyno-
mial), these parameters are not sufficiently constrained by the posterior, and the
sampled functions can be easily visually separated. We also see in the corresponding
panels on the left how the uncertainty increases, especially at the boundaries. Al-
though for a 10th-order polynomial the MAP estimate yields a good fit, the Bayesian
linear regression model additionally tells us that the posterior uncertainty is huge.
This information can be critical, if we use these predictions in a decision-making sys-
tem, where bad decisions can have significant consequences (e.g., in reinforcement
learning or robotics).

Remark 25
Bayesian linear regression naturally equips the estimate with uncertainty induced by the
(posterior) distribution on the parameters. In maximum likelihood (or MAP) estima-
tion, we can obtain an estimate of the uncertainty by looking at the point-wise squared
distances between the observed values yi in the training data and the function values
φ(xi )⊤ θ. The variance estimate would then be itself a maximum likelihood estimate
and given by
N
2 1 X
σML = (yi − φ(xi )⊤ θ)2 , (1.321)
N i=1

where θ is a point estimate of the parameters (e.g., maximum likelihood or MAP) and
N is the size of the training data set.

69

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

1.8. Bayesian Linear Regression Chapter 1. Linear Regression

Polynomial of degree 4 Polynomial of degree 4


3 3

2 2

1 1
f(x)

f(x)
0 0

-1 -1

95% predictive confidence bound


-2 Data -2
Maximum likelihood estimate Data
MAP estimate Function samples
-3 -3
-5 0 5 -5 0 5
x x

Polynomial of degree 6 Polynomial of degree 6


3 3

2 2

1 1
f(x)

f(x)

0 0

-1 -1

95% predictive confidence bound


-2 Data -2
Maximum likelihood estimate Data
MAP estimate Function samples
-3 -3
-5 0 5 -5 0 5
x x

Polynomial of degree 10 Polynomial of degree 10


3 3

2 2

1 1
f(x)

f(x)

0 0

-1 -1

95% predictive confidence bound


-2 Data -2
Maximum likelihood estimate Data
MAP estimate Function samples
-3 -3
-5 0 5 -5 0 5
x x

Figure 1.34: Bayesian linear regression. Left panels: The shaded area indicates the
95% predictive confidence bounds. The mean of the Bayesian linear regression model
coincides with the MAP estimate. The predictive uncertainty is the sum of the noise
term and the posterior parameter uncertainty, which depends on the location of the test
input. Right panels: Sampled functions from the posterior distribution.

70

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2

Feature Extraction

2.1 Decompositions
In this chapter we will discuss about the use of linear algebra of vectors and matrices
in order to define basic feature extraction and dimensionality reduction methodolo-
gies. In this context, we will study particular linear decompositions, such as eigen-
decomposition and diagonalisations, QR decomposition, Singular Value Decompo-
sitions (SVD), etc. The above algebraic decompositions will be used to formulate
popular linear feature extraction methodologies such as Principal Component Anal-
ysis (PCA) and Linear Discriminant Analysis (LDA). We will show that many of these
decompositions arise naturally by formulating and solving trace optimisation prob-
lems with constraints. We will also study a Maximum Likelihood (ML) solution of a
Probabilistic PCA (PPCA) formulation. Finally, we will study some non-linear feature
extraction methodologies, using kernel methods. In the following we will be using
elements of linear algebra that can be found in Appendix A.1.

2.1.1 Eigen-decomposition
Assume square matrix A ∈ Rn×n with n linearly independent eigenvectors qi , i =
1, . . . , n and n eigenvalues λ1 , . . . , λn . Then A can be factorised as

A = QΛQ−1 (2.1)

where Q = [q1 . . . qn ] and Λ is a diagonal matrix whose diagonal elements are the
corresponding eigenvalues, i.e. Λ = diag{λ1 , . . . , λn }.
Using the eigen-decomposition we can compute various powers of A as

Ak = QΛk Q−1 . (2.2)

We can easily verify the above for k = 2 as A2 = QΛQ−1 QΛQ−1 = QΛ2 Q−1 . Then,
we can easily prove the general case using induction.
In case k = −1 we can compute the inverse as

A−1 = QΛ−1 Q−1 . (2.3)

71

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

2.1.1.1 Symmetric Matrices


The eigen-decomposition of symmetric matrices are of particular interest in machine
learning. Symmetric matrices have always real eigendecomposition (i.e., all eigen-
values and eigenvectors are real). Furthermore, Q is an orthogonal matrix (i.e.,
QT = Q−1 ). Hence, if A is symmetric
A = QΛQT . (2.4)
In case that A is singular (i.e., rank(A) = r < n) then there are n − r eigenvectors
that correspond to zero eigenvalues.
 
λ1 . . . 0 0
 
 0 ... 0 0  T
 
Λr 0
A = Q Q = Q QT = Qr Λr QTr (2.5)
 0 . . . λr 0  0 0
0 ... 0 0
where Qr is an n × r matrix of the eigenvectors that correspond to non-zero eigen-
values and Λr is r × r diagonal matrix of the r non-zero eigenvalues.

2.1.2 QR decomposition
Any real square matrix A ∈ Rn×n can be decomposed as
A = QR (2.6)
where Q is an orthogonal matrix (i.e., QT Q = QQT = I) and R is an upper-
triangular matrix.
Some important properties of QR decomposition
• If A is non-singular then its QR decomposition is unique if we require the
diagonal elements of R to be positive.
• If A is singular then QR decomposition still exists but yields a singular upper
triangular R .
• If A has r linearly independent columns, then the first r columns of Q form an
orthonormal basis for the column space of A.
Q
• |det(A)| = | i rii | (see Appendix A.1.2.12 for a proof).
QR has important applications in solving linear systems, in producing an orthogonal
basis for eigenvalue computations etc. In the following we will see some examples.

Example (Solving Linear System)


Assume the following linear system
r11 x1 +r12 x2 +r13 x3 +··· +r1(m−1) xm−1 +r1m xm = b1
0x1 +r22 x2 +r23 x3 +··· +r2(m−1) xm−1 +r2m xm = b2
.. (2.7)
.
0x1 +0x2 +0x3 +··· +r(m−1)(m−1) xm−1 +r(m−1)m xm = bm−1
0x1 +0x2 +0x3 +··· +0xm−1 +rmm xm = bm

72

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.1. Decompositions

The above linear system can be written as

Rx = b (2.8)

where R is an upper triangular matrix, x = [x1 , . . . , xm ]T and b = [b1 , . . . , bm ]T .


The above system can be easily solved using the following set of rules (backward
elimination)
bm
xm =
rmm
bm−1 − r(m−1)m xm
xm−1 =
r(m−1)(m−1) (2.9)
..
.
P
b1 − m i=2 r1i xi
x1 = .
r11
The QR decomposition can be used in order to reduce any linear m × m system
Ax = b (with A of full rank) to a system as in (2.8). That is, assume that the QR
decomposition of A is QR then

Ax = b ⇔
QRx = b ⇔ (2.10)
Rx = QT b.

A very important application of the QR decomposition is the QR algorithm for eigen-


value computation.

Example (The QR algorithm for eigenvalue computation)

Assume matrix A. We create the series Ak by choosing for k = 0, A0 = A and then

A k = Q k Rk
(2.11)
Ak+1 = Rk Qk

As can be seen
Ak+1 = Rk Qk = QTk Qk Rk Qk = QTk Ak Qk (2.12)
so all Ak are similar and hence they have the same eigenvalues. The matrices Ak
converge to a triangular matrix (Golub and Van Loan (2012)). Hence, the main
diagonal of this matrix contains the eigenvalues of A.
As an exercise, run the above algorithm in Matlab for a symmetric matrix and com-
pare the results with the function eig of Matlab. In particular, try to run the code
1 A = randn(100,100);
2 B = A*A';

73

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

3 B = 1/2*(B+B'); %This is for stability (making a truly symmetric matrix)


4 Ao = B;
5 for i=1:100
6 [Q, R] = qr(Ao);
7 Ao = R*Q;
8 end

and compare the eigenvalues computed by eig(B).

2.1.2.1 Gram-Schmidt Process

In the following we will discuss a practical and well-known algorithm for computing
the QR decomposition. The algorithm can also be used in order to compute an
orthogonal base from a matrix. The algorithm is the so-called Gram-Schmidt (GS)
process.
GS process answers the following question: If the columns of A = [a1 , . . . , an ] define
a basis (not orthonormal) for an inner product space, is there a way to convert it to
an orthonormal basis?
We start by geometrically describing the process for two vectors a1 and a2 . In the first
step we normalize the norm of a1 as q1 = ||aa11||2 . Then, we compute the projection of
a2 onto q1 as in (A.6)

qT1 a2
projq1 a2 = q1 = (qT1 a2 )q1 . (2.13)
||q1 ||2

Then, we can compute the vector orthogonal to q1 as

q̃2 = a2 − (qT1 a2 )q1 (2.14)

q̃2
and normalize it so that it has norm one as q2 = ||q̃2 ||2
.
The general algorithm for computing the orthogonal basis Q = [q1 , . . . , qn ] as

q̃1
q̃1 = a1 , q1 =
||q̃1 ||2
q̃2
q̃2 = a2 − projq1 a2 , q2 =
||q̃2 ||2
q̃3 (2.15)
q̃3 = a3 − projq1 a3 − projq2 a3 , q3 =
||q̃3 ||2
n−1
X q̃n
q̃n = an − projqi an , qn =
i=1
||q̃n ||2

74

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.1. Decompositions

The upper triangular matrix R an be computed by observing that

a1 = (qT1 a1 )q1
a2 = (qT1 a2 )q1 + (qT2 a2 )q2
..
. (2.16)
Xn
an = (qTi an )qi .
i=1

Hence, R can be computed as


 
qT1 a1 qT1 a2 ··· qT1 an

 0 qT2 a2 ··· qT2 an 

R= .. .. .. .. . (2.17)
 . . . . 
0 ··· 0 qTn an

Example (The GS procedure)


Let matrix
 
1 1 0
A=  1 2 1 
-2 -3 1

perform QR decompositionon A.    


1 1 0
A = [a1 a2 a3 ], then a1 =  1 , a2 =  2 , a3 =  1 
-2 -3 1  
√1
√ 6
√1
 
Let Q = [q1 q2 q3 ], then since ||a1 ||2 = 6, q1 = a1 /||a1 ||2 =  6 .
- √26

First we compute qT1 a2 = 9/ 6

     1 
1 1 -2
T 9 1 
q2 = a2 − q1 a2 q1 =  2  − 1  = 
2
6
-3 -2 0
 
- √12
p q2
and ||q2 ||2 = 1/2. Then, q2 = = √1 
||q2 ||2 2
0
aT3 q1 = − √16 and aT3 q2 = √1 .
2
Then,

q3 = a3 − (aT3 q1 )q1 − (aT3 q2 )q2


       2 
0 1 -1
1 1 3
=  1  +  1  −  1  =  23  (2.18)
6 2 2
1 -2 0 3

75

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

 
√1
√ q3 3
√1
 
||q3 ||2 = 2/ 3 hence q3 = = 3  (2.19)
||q3 ||2 √1
3

Hence,
 
√1 - √12 √1
6 3
√1 √1 √1
 
Q= 6 2 3 
− √26 0 - √13
and
   √ √9

qT1 a1 qT1 a2 qT1 a3 6 6
− √16
√1 √1
 
R= 0 qT2 a2 T
q2 a3 =  0

2 2 
0 0 qT3 a3 0 0 √2
3

2.1.3 Singular Value Decomposition


A very useful matrix decomposition is the Singular Value Decomposition (SVD). If
A ∈ Rm×n , then there exist orthogonal matrices
U = [u1 , . . . , um ] ∈ Rm×m and V = [v1 , . . . , vn ] ∈ Rn×n . (2.20)
such that
UT AV = Σ = diag{σ1 , . . . , σp } ∈ Rm×n , p = min{m, n}
(2.21)
A = UΣVT
where σ1 ≥ σ2 ≥ . . . ≥ σp ≥ 0. σi are called singular values of A and the vectors ui
and vi are the corresponding i-th left and right singular vectors, respectively.
SVD reveals a lot of information regarding the structure of a matrix. Assume the
SVD of matrix A, then we define r by
σ1 ≥ σ2 ≥ . . . ≥ σr ≥ σr+1 = · · · = σp = 0, (2.22)
then
rank(A) = r
null(A) = span{vr+1 , . . . , vn } (2.23)
ran(A) = span{u1 , . . . , ur }
and the SVD expansion can be written as
Xr
A= σi ui viT . (2.24)
i=1

Furthermore, regarding the 2-norm and Frobenius norm we have the following,
||A||2F = σ12 + . . . + σp2 , p = min{m, n}
(2.25)
||A||2 = σ1 .

76

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.1. Decompositions

2.1.3.1 Thin SVD


Assume A = UΣV ∈ Rm×n is the SVD of A and m ≥ n, then

A = U 1 Σ1 V (2.26)

where
U1 = [u1 , . . . , un ] ∈ Rm×n (2.27)
and
Σ1 = diag(σ1 , . . . , σn ) ∈ Rn×n . (2.28)
The above is the thin SVD of matrix A.
In thin SVD we have UT1 U1 = In but U1 UT1 6= Im . Furthermore, VT V = VVT = Im .

2.1.3.2 Dimensionality Reduction and SVD


A very important application of SVD is the reduction of the rank of a matrix. Rank
reduction can be used for dimensionality reduction on the data (e.g., the smallest
singular values and the corresponding singular vectors may correspond to noise).
Theorem 1
Let the SVD of A = UΣVT . If k < r = rank(A) and
k
X
Ak = σi ui viT , (2.29)
i=1

then
min ||A − B||2 = ||A − Ak ||2 = σk+1 . (2.30)
rank(B)=k

Proof. Since, UT Ak V = diag(σ1 , . . . , σk , 0, . . . , 0) then rank(Ak ) = k and UT (A −


Ak )V = diag(0, . . . , 0, σk+1 , . . . , σp ). Hence, ||A − Ak ||2 = σk+1 .
Now suppose rank(B) = k for a B ∈ Rm×n . It follows that there is an orthonormal
basis {q1 , . . . , qn−k } so that null(B) = span{q1 , . . . , qn−k }. It follows that

span{q1 , . . . , qn−k } ∩ span{v1 , . . . , vk+1 } =


6 {0}. (2.31)

That is, there exist unit 2-norm that Bz = 0 and which can be written as
Pk+1 vectors soP
a linear combination z = i=1 αi vi with k+1 2 T
i=1 αi = 1 and αi = vi z. Since Bz = 0
and
k+1
X
Az = σi (viT z)ui (2.32)
i=1

we have that
k+1
X k+1
X
||A − B||22 ≥ ||(A − B)z||22 = ||Az||22 = σi2 (viT z)2 ≥ σk+1 αi2 = σk+1
2
(2.33)
i=1 i=1

which completes the proof.

77

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

The above theorem provides us with a way for dimensionality reduction through
rank-reduction. That is, assume that we have a collection of n data samples x1 , . . . , xn .
We stack the samples as columns of matrix X = [x1 , . . . , xn ]. Then, the k low-rank
representation of the data is
Xk = Uk Σk VkT . (2.34)
keeping the first k singular values and the corresponding singular vectors.

2.1.4 Principal Component Analysis


In the following we describe one the most well-studied and popular methods for
feature extraction and dimensionality reduction. That is, we will describe Principal
Component Analysis (PCA). PCA has a very long history in mathematical statistics
and engineering. It was first invented by Karl Pearson in 1901 and independently
developed by Harold Hotelling. In the following we will introduce PCA under two
perspectives. One is the statistical one and the second is based on reconstruction.
We assume that we are given a collection of n d-dimensional samples stored in the
columns of matrix P X = [x1 , . . . , xn ]. For convenience we will assume that data
n
are centered (i.e., i=1 xi = 0). We can always P center our data by computing
and subtracting the mean of the data m = n1 ni=1 xi . Data centering can be done
efficiently using matrix multiplication (see the example below).

Example (Data Centering using Matrix Multiplication)


In order to compute data centering using matrix multiplication first we need to see
how we can create a matrix that contains in all n columns the same vector a ∈ Rd

A = [a, . . . , a] = a1Tn . (2.35)

where 1n ∈ Rn is a vector.
Using the
P above, a matrix M that contains in all columns the mean of the data
(m = n1 ni=1 xi = n1 X1n ) can be computed as

1
M = m1T = X( 1n 1Tn ). (2.36)
n
Hence, data centering can be computed as
1
[x1 − m, . . . , xn − m] = X − M = X(I − 1n 1Tn ). (2.37)
n

2.1.4.1 Statistical Perspective


We want to find a set of features, via linear projections onto some basis, that max-
imise the variance of the sample data on that basis. We will start by assuming that

78

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.1. Decompositions

we want to find only one vector w so that the variance of the projected features
yi = wiT xi is maximised. The variance can be defined as
n
1X
σy2 = (yi − µy )2 . (2.38)
n i=1
P
where µy = n1 ni=1 yi . But since we assumed centered data it is easy to verify that
µy = 0. Hence, the optimal features {y1o , . . . , yno }
1
{y1o , . . . , yno } = arg max σy2 ⇒
2
n n
1 X T 2 1 X T
wo = arg max (w xi ) = arg max w xi xTi w
w 2n w 2n
i=1 i=1
n (2.39)
1 X T 1
= arg max w xi xTi w = arg max wT XXT w
w 2n w 2n
i=1
1
= arg max wT St w
w 2

where St = n1 XXT 1 is a very important matrix called covariance matrix or total-


scatter matrix.
The above optimisation problem has a trivial solution which is w = ∞. In order to
avoid the trivial solution we incorporate the constraint ||w||2 = 1. Hence optimisa-
tion problem (2.39) is now re-formulated as
1
wo = arg max wT St w
w 2 (2.40)
subject to wT w = 1.

In order to solve the above constrained optimisation problem we need to formulate


the Lagrangian as
1
L(w, λ) = wT St w − λ(wT w − 1). (2.41)
2
We can compute the partial derivatives (using the results in Appendix A.33) as

∇w L(w) = St w − λw (2.42)

forcing ∇w L(w) = 0 we end up to

St w = λw (2.43)

which means that w is an eigenvector of St and the Lagrangian multiplier λ is the


corresponding eigenvalue. St is a symmetric positive semi-definite matrix, hence all
of its eigenvalues are non-negative.
Now, the question is what pair of eigenvalue and eigenvector, should we choose?
Intuitively, someone would answer the eigenvector that corresponds to the largest
1 1
We will drop the n in the following for convenience.

79

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

eigenvalue. But let’s see that mathematically, by replacing the solution to the opti-
misation problem (2.39), we get
λo = arg max λ. (2.44)
λ

Hence, we have to choose λ to be the largest eigenvalue and w to be the eigenvector


that corresponds to the largest eigenvalue. This vector is also called the principal
axis of the data.
Now assume that we want to extract more than one dimension. That is, we want to
find yi ∈ Rd so that    
y1i w1T xi
yi =  ...  =  ...  = WT xi
   
(2.45)
T
ydi wd x i
where W = [w1 , . . . , wd ]. We also assume that WT W = I.
The optimisation problem is now written as
d n n d
1 XX 2 1 XX T 2
Wo = arg max yki = arg max (wk xi )
W 2n W 2n
k=1 i=1 i=1 k=1
n d
1 XX T
= arg max wk xi xTi wk
W 2n
i=1 k=1
d n d
(2.46)
1 X T X 1X T
= arg max wk ( xi xTi )wk = arg max w k St w k
W 2n W 2
k=1 i=1 k=1
= arg max tr(WT St W)
W
T
subject to W W = I.
The Lagrangian of the above optimisation problem is
L(W, Λ) = tr(WT St W) − tr(Λ(WT W − I)) (2.47)
where Λ is a matrix with Lagrangian multipliers. Then, we need to estimate ∇W L(W) =
0 which leads to
St W = WΛ. (2.48)
which gives
St w k = λ k w k (2.49)
for i = 1, . . . , k. Hence, W contains as columns the eigenvectors of St .
Now, let’s discuss: what d eigenvectors should we keep?
Assume the eigendecomposition of St as
St = UΛUT (2.50)
then W = Ud = [u1 , . . . , ud ]. The cost function in (2.46) can written as
d
X
T T T
tr(W St W) = tr(W UΛU W) = tr(Λd ) = λk . (2.51)
k=1

80

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.1. Decompositions

Since λk > 0 maximisation of the above cost function boils down to keeping the
eigenvectors that correspond to the k largest eigenvalues.
To conclude, PCA transform looks for d orthogonal direction vectors (known as the
principal axes) such that the projection of input sample vectors onto the principal
directions has the maximal spread, or equivalently that the variance of the output
coordinates is maximal. The principal directions are the first (with respect to de-
scending eigenvalues) k eigenvectors of the covariance matrix XXT .

2.1.4.2 Reconstruction Perspective


We will also have another perspective of PCA using the notion of optimal linear
reconstruction. Assume again the same set of centered data X = [x1 , . . . , xn ]. We
assume we want to find an optimal way to reconstruct the data using a small set
of vectors and produce X̃ = [x̃1 , . . . , x̃n ]. We want to find the optimal vectors by
minimizing the least squares form as
n
1X 1
Wo = arg min ||xi − x̃i ||22 = arg min ||X − X̃||2F . (2.52)
W 2 i=1 W 2

The reconstruction operator using the vectors stored in the columns of W = [w1 , . . . , wd ]
can be written as
xi = Wyi (2.53)
which gives yi = (WT W)−1 WT .
Replacing it back we get that the optimal reconstruction is

x̃i = W(WT W)−1 WT xi . (2.54)

Imposing that WT W = I we get


1
Wo = arg min ||X − X̃||2F = arg min ||X − WWT X||2F
W 2 W
= arg min tr((X − WW X) (X − WWT X))
T T
W
= arg min tr(XT X) − tr(WT XXT W) (2.55)
W
= arg max tr(WT XXT W).
W
T
subject to W W = I

which is equivalent to the optimisation problem (2.46).

2.1.4.3 Computing PCA


In this section we will discuss how to compute PCA in Small Sampled Size (SSS)
problems (i.e., where the original dimensionality of the observations, F , is larger –
and in majority of cases is much larger – than the number of samples.
If we want to keep d principal components, the computational cost of the eigen-
analysis of an F × F matrix in order to find d eigenvalues/eigenvectors is O(dF 2 ). If

81

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

1
Figure 2.1: Example of data whitening using the PCA projection matrix W = UΛ− 2 .

F is large, this computation can be quite expensive. We will show how to expedite
this procedure when n ≪ F . We firstly have to introduce the following Lemma.
Lemma 1
Let us assume that B = XXT and C = XT X. It can be proven that B and C have
the same positive eigenvalues Λ and, assuming that n < F , then the eigenvectors U
1
of B and the eigenvectors V of C are related as U = XVΛ− 2 .

Using Lemma 1 we can compute the eigenvectors U of St = XXT in O(n3 ). The


eigenanalysis of XT X is denoted by
XT X = VΛVT (2.56)
where V is a n × (n − 1) matrix with the eigenvectors as columns and Λ is a (n −
1) × (n − 1) diagonal matrix with the eigenvalues in the main diagonal. Given that
VT V = I and VVT 6= I we have

XT X = VΛVT 1 1
− 12 ⇒ UT XXT U = Λ− 2 VT XT XXT XVΛ− 2 =
U = XVΛ
1
− 12 (2.57)
= Λ− 2 V T T T
| {zV} Λ |V{zV} Λ |V{zV} Λ =
I I I

The pseudocode for computing PCA in SSS problems is

Algorithm 1 Principal Component Analysis


1: procedure PCA
2: Compute dot product matrix: XT X = [(xi − m)T (xi − m)]
3: Eigenanalysis: XT X = VΛVT
1
4: Compute eigenvectors: U = XVΛ− 2
5: Keep specific number of first components: Ud = [u1 , . . . , ud ]
6: Compute d features: Y = Ud T X

Let’s inspect the covariance matrix of low-dimensional features stored in matrix Y.


The covariance matrix of Y is
YYT = UT XXT U = Λ.

82

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.1. Decompositions

From the above it is evident that the features in Y are un-correlated (i.e., the co-
variance matrix YYT is diagonal, with off-diagonal elements being zero) and the
variance in each dimension is equal to the positive eigenvalues of XXT .
Assume further that we want to make the low-dimensional covariance matrix of the
data equal to the identity matrix. This procedure is called whitening (or sphering)
which is an important normalisation of the data.
The whitening tranformation is given by the projection matrix
1
W = UΛ− 2 (2.58)

which normalises the data to have unit variance (Figure 2.1).

2.1.4.4 Link between SVD and PCA


Here we will investigate the relationship between PCA with SVD on a set of cen-
tralised data stored in matrix X. The project basis of PCA is given by the r eigenvec-
tors of the the covariance matrix that do not correspond to zero eigenvectors

XXT = Wr Λr WrT . (2.59)

Furthermore, assume the SVD decomposition of

X = Ur Σr VrT . (2.60)

Using the above SVD decomposition we can write the covariance matrix as

XXT = Ur Σr VrT Vr Σr UTr = Ur Σ2r UTr . (2.61)

Comparing now the above with the eigen-decomposition in 2.59, we get that (1) the
basis of PCA is given by the left orthogonal space of SVD and (2) the eigenvalues are
the singular values squared.
Furthermore, the low-dimensional features of PCA are given by

Yr = UTr X = UTr Ur Σr VrT = Σr VrT . (2.62)

Hence, the normalised features stacked in matrix Σ−1 r Yr are equal to the right or-
thogonal space of SVD, Vr . That is, the right orthogonal space of SVD is equal to the
whitened features.

2.1.5 Linear Discriminant Analysis


In this section we will try to derive a dimensionality reduction methodology that
exploits the availability of discrete labels. That is we assume the following scenario:
Say we are given a set of data x1 , . . . , xn and a vector l = [l1 , . . . , ln ] with labels (a
label li per sample xi ) with li ∈ 1, . . . , C where C is the number of classes we have.
Some of the pros and cons of PCA for the above scenario are
• PCA: Unsupervised approach, good for compression of data and data recon-
struction. Good statistical prior.

83

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

• PCA: Not explicitly defined for classification problems (i.e., in case that data
come with labels)

• How do we define a latent space it this case? (i.e., that helps in data classifica-
tion).
In order to capitalise on the availability of class labels we need to properly define
relevant statistical properties which may help us in classification. Intuition: We want
to find a space in which:
1. data consisting each class look more like each other, while,

2. data of separate classes look more dissimilar.


The relevant questions we need to answer are:
1. How do I make my data in each class look more similar? (Answer: Minimise
the variability in each class (i.e., minimize the variance)).

2. How do I make the data between classes look dissimilar? (Answer: I move
the data from different classes further away from each other (i.e., increase the
distance between their means)).

2.1.5.1 The two class case


In the following, we will discuss the two-class case. That is, we assume that C = 2
and that we have two means µy (c1 ), µy (c2 ) and two variances σy2 (c1 ), σy2 (c2 ) (one for
each class). We want a latent space of yi such that:

σy2 (c1 ) + σy2 (c2 ) is minimum

(µy (c1 ) − µy (c2 ))2 is maximum


How do I combine them together?
 
σy2 (c1 ) + σy2 (c2 )
minimize
(µy (c1 ) − µy (c2 ))2
 
(µy (c1 ) − µy (c2 ))2
Or maximize .
σy2 (c1 ) + σy2 (c2 )
Assuming again that the low-dimensional features are found through a projection to
a vector w as yi = wT xi , the within-class variance in the first class c1
1 X
σy2 (c1 ) = (yi − µy (c1 ))2
Nc1 x ∈c
i 1

1 X T
= (w (xi − µ(c1 )))2
Nc1 x ∈c
i 1

1 X T
= w (xi − µ(c1 ))(xi − µ(c1 ))T w
Nc1 x ∈c
i 1

84

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.1. Decompositions

1 X
= wT (xi − µ(c1 ))(xi − µ(c1 ))T w
Nc1 x ∈c
i 1
T
= w S1 w

P
where µ(c1 ) = N1c xi ∈c1 xi is the mean of the first class.
1
Similarly, the within-class variance in the second class c2 is given by

σy2 (c2 ) = wT S2 w
P
where S2 = N1c T
xi ∈c2 (xi − µ(c2 ))(xi − µ(c2 )) and µ(c2 ) is the mean of the second
2
class.
Now, the sum of the two variances can be written as

σy2 (c1 ) + σy2 (c2 ) = wT (S1 + S2 )w = wT Sw w

where Sw = S1 + S2 is the within class scatter matrix.


The distance between the two means can be written as

(µy (c1 ) − µy (c2 ))2 = wT (µ(c1 ) − µ(c2 ))(µ(c1 ) − µ(c2 ))T w


| {z }
Sb between class scatter matrix

The quantity we need to maximise is


(µy (c1 ) − µy (c2 ))2 wT Sb w
=
σy2 (c1 ) + σy2 (c2 ) wT Sw w

The equivalent optimisation problem is

max wT Sb w s.t. wT Sw w = 1

In order to solve the optimisation problem we need to formulate the Lagrangian

Langrangian: L(w, λ) = wT Sb w − λ(wT Sw w − 1)

∂wT Sw w ∂wT St w
= 2Sw w = 2St w
∂w ∂w
∂L
= 0 ⇒ λSw w = Sb w.
∂w
Hence the optimal w is given by the eigenvector that corresponds to the eigenvalue
of S−1
w Sb (assuming that Sw is invertible).
In this special case (where C = 2) the optimal w is

w ∝ S−1
w (µ(c1 ) − µ(c2 )).

In the following we will compute the LDA projection for the following 2D dataset.

c1 = {(4, 1), (2, 4), (2, 3), (3, 6), (4, 4)}

85

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.1. Decompositions Chapter 2. Feature Extraction

c2 = {(9, 10), (6, 8), (9, 5), (8, 7), (10, 8)}

Solution (by hand)


The class statistics are
   
0.8 −0.4 1.84 −0.04
S1 = , S2 =
−0.4 2.64 −0.04 2.64

The within and between class scatter matrices are

µ1 = [3.0 3.6]T µ2 = [8.4 7.6]T


   
29.16 21.6 2.64 −0.44
Sb = , Sw =
21.6 16.0 −0.44 5.28
The LDA projection is then obtained as the solution of the generalized eigenvalue
problem

S−1 −1
w Sb w = λw → |Sw Sb − λI| = 0 →

11.89 − λ 8.81
= 0 → λ = 15.65
5.08 3.76 − λ
        
11.89 8.81 w1 w1 w1 0.91
= 15.65 → = .
5.08 3.76 w2 w2 w2 0.39
Or directly by
w∗ = S−1 T
w (µ1 − µ2 ) = [−0.91 − 0.39] .

2.1.5.2 Multi-class Case


In the general case, where C classes are available, the within-class scatter matrix is
defined as
C C
X X 1 X
Sw = Sj = (xi − µ(cj ))(xi − µ(cj ))T
j=1 j=1
N cj x ∈c
i j

and the between-class scatter matrix as


C
X
Sb = (µ(cj ) − m)(µ(cj ) − m)T
j=1

Now, we have to find a matrix W = [w1 , . . . , wd ] by solving the following optimisa-


tion problem
max tr[WT Sb W] s.t. WT Sw W = I
The Lagrangian of the problem is defined as

L(W, Λ) = tr[WT Sb W] − tr[Λ(WT Sw W − I)]

86

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.2. Computing Linear Discriminant Analysis

∂tr[WT Sb W] ∂tr[Λ(WT Sw W − I)]


= 2Sb W = 2Sw WΛ
∂W ∂W
∂L(W, Λ)
= 0 ⇒ Sb W = Sw WΛ.
∂W
As a result, the columns of W are the eigenvectors of S−1 w Sb (assuming Sw is not
singular) that correspond to its largest eigenvalues (if d are the eigenvectors, then
the following must hold: d ≤ C − 1).

2.2 Computing Linear Discriminant Analysis


In the following we will show how to solve the general LDA optimisation problem in
SSS problems (i.e., without having to assume that Sw is invertible)

W0 = arg max tr(WT Sb W)


W
(2.63)
subject to WT Sw W = I

Assume that we have C classes in total. We assume that each class has Nci samples,
stored in matrix ci = [x1 , . . . , xNci ], i = 1, . . . , Nci , where each xj has F dimensions
and µ(ci ) is the mean vector of thePclass i. Thus, the overall data matrix X =
C
[c1 , . . . , cC ] has size of F × n (n = i=1 Nci ). If m is the overall mean, then the
within-class scatter matrix, Sw , is defined as
C
X C X
X
Sw = Sj = (xi − µ(cj ))(xi − µ(cj ))T (2.64)
j=1 j=1 xi ∈cj

and has rank(Sw ) = min (F, n − (C + 1)). Moreover, the between-class scatter ma-
trix, Sb , is defined as
C
X
Sb = Ncj (µ(cj ) − m)(µ(cj ) − m)T (2.65)
j=1

and has rank(Sb ) = min (F, C − 1).


The solution of Eq. 2.63 is given from the generalised eigenvalue problem

Sb W = Sw WΛ (2.66)

thus the optimal Wo corresponds to the eigenvectors of S−1 w Sb that correspond to


the largest eigenvalues. In order to deal with the singularity of Sw , we can do the
following steps:
1. Perform PCA on our data matrix X to reduce the dimensions to n − (C + 1)
using the eigenvectors U
2. Solve LDA on this reduced space (i.e., in the space of Y = UT X) and get
optimal matrix Q that has C − 1 columns.

87

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.2. Computing Linear Discriminant Analysis Chapter 2. Feature Extraction

3. Compute the total transformation as W = UQ.


Unfortunately, if you follow the above procedure it is possible that important infor-
mation is lost. In the following, we show how the components of LDA can be com-
puted by applying a simultaneous diagonalisation procedure. Before we continue,
we need to write some properties regarding the between and within class matrices
Sw and Sb .

Properties
The scatter matrices have some interesting properties. Let us denote
 
E1 0 · · · 0
 0 E2 · · · 0 
 
M =  .. .. . . ..  = diag {E1 , E2 , . . . , EC } (2.67)
 . . . . 
0 0 · · · EC
where  
1 1
Nci
··· Nc i
Ei =  ... ... .. 

.  (2.68)
1 1
Nci
··· Nc i Nci ×Nci

Note that M is idempotent, thus MM = M. Given that the data covariance matrix
is St = XXT , the between-class scatter matrix can be written as
Sb = XMMXT = XMXT (2.69)
and the within-class scatter matrix as
T T T
| {z } − XMX
Sw = XX | {z } = X(I − M)X (2.70)
St Sb

Thus, we have that St = Sw + Sb . Note that since M is idempotent, I − M is also


idempotent.

Given the above properties, the objective function of Eq. 2.63 can be expressed as
Wo = arg max tr(WT XMMXT W)
W
(2.71)
subject to WT X(I − M)(I − M)XT W = I
The optimisation procedure of this problem involves a procedure called Simultane-
ous Diagonalisation. Let’s assume that the final transformation matrix has the form
W = UQ (2.72)
We aim to find the matrix U that diagonalises Sw = X(I − M)(I − M)XT . This
practically means that, given the constraint of Eq. 2.71, we want
WT X(I − M)(I − M)XT W = I ⇒
⇒QT UT X(I − M)(I − M)XT U Q = I (2.73)
| {z }
I

88

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.2. Computing Linear Discriminant Analysis

Consequently, using Eqs. 2.72 and 2.73, the objective function of Eq. 2.71 can be
further expressed as

Qo = arg max tr(QT UT XMMXT UQ)


Q
(2.74)
subject to QT Q = I

where the constraint WT X(I − M)(I − M)XT W = I now has the form QT Q = I.
Lemma 2
Assume the matrix X(I − M)(I − M)XT = Xw Xw T , where Xw is the F × n matrix
Xw = X(I − M). By performing eigenanalysis on Xw T Xw as Xw T Xw = Vw ΛVw T ,
we get n − (C + 1) positive eigenvalues, thus Vw is a n × (n − (C + 1) matrix.

The optimisation problem of Eq. 2.74 can be solved in two steps

1. Find U such that UT X(I − M)(I − M)XT U = I. By applying Lemma 2, we get


U = Xw Vw Λw −1 . Note that U has size F × (n − (C + 1)).

2. Find Q0 . By denoting
X̃b = UT XM
the (n − (C + 1)) × n matrix of projected class means, Eq. 2.74 becomes

Qo = arg max tr(QT X̃b X̃T


b Q)
Q
(2.75)
T
subject to Q Q = I

which is equivalent to applying PCA on the matrix of projected class means.


The final Q0 is a matrix with columns the d eigenvectors of X̃b X̃T
b that corre-
spond to the d largest eigenvalues (d ≤ C − 1).

The final projection matrix is given by

W0 = UQ0 (2.76)

Based on the above, the pseudocode for computing LDA is

Algorithm 2 Linear Discriminant Analysis


1: procedure LDA
2: Find the eigenvectors of Sw that correspond to the non-zero eigenvalues
(usually n − (C + 1)), i.e. U = [u1 , . . . , un−(C+1) ] by performing eigen-analysis
to (I − M)XT X(I − M) = Vw Λw VwT and computing U = X(I − M)Vw Λw −1
(performing whitening on Sw ).
3: Project the data as X̃b = UT XM.
4: Perform PCA on X̃b to find Q (i.e., compute the eigenanalysis of X̃b X̃Tb =
QΛb QT ).
5: The total transform is W = UQ

89

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.2. Computing Linear Discriminant Analysis Chapter 2. Feature Extraction

2.2.1 Kernel PCA and Kernel LDA


All the above techniques are linear. But in many cases it would be beneficial to
design non-linear feature extraction methods (for examples see the slides). Assume
again that we have a set of observations x1 , . . . , xn . We further assume that we have
a non-linear mapping φ
xi ∈ RF → φ(xi ) ∈ H
H can be of arbitrary dimensionality space (could be even infinite).
φ(.) may not be explicitly known or is extremely expensive to compute and store.
What is explicitly known is the dot product in H (also known as kernel k)

φ(xi )T φ(xj ) = k(xi , xj )


(xi , xj ) ∈ RF ×F → k(., .) ∈ R

All positive (semi)-definite functions can be used as kernels.


Given a training population of n samples [x1 , . . . , xn ], we compute the training kernel
matrix (also called Gram matrix)

K = [φ(xi )T φ(xj )] = [k(xi , xj )]

All the computations are performed via the use of the kernel or the centralized kernel
matrix
n
Φ T Φ Φ 1X
K̄ = (φ(xi ) − m ) (φ(xj ) − m ), m = φ(xi )
n i=1
Some popular kernel functions include: Gaussian Radial Basis Function (RBF) ker-
nel: 2 ||xi −xj ||2
k(xi , xj ) = exp− r2

Polynomial kernel:
k(xi , xj ) = (xTi xj + b)n
Hyperbolic Tangent kernel:

k(xi , xj ) = tanh(xTi xj + b)

In kernel literature the original observation space

X = [x1 , . . . , xn ]

is called input space.


While the non-linear space

XΦ = [φ(x1 ), . . . , φ(xn )]

is called feature space. Using the feature space the kernel matrix is defined as
T
K = [φ(xi )T φ(xj )] = [k(xi , xj )] = XΦ XΦ .

90

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.2. Computing Linear Discriminant Analysis

We can define the centralised matrix of features as


X̄Φ = [φ(x1 ) − mΦ , . . . , φ(xn ) − mΦ ]
1
= XΦ (I − E) = XΦ M, E = 11T .
n
Using the centralised matrix of features the centralised kernel matrix as
T
K̄ =[(φ(xi ) − mΦ )T (φ(xj ) − mΦ )] = (I − E)XΦ XΦ (I − E)
=(I − E)K(I − E) = K − EK − KE + EKE.
We can now define PCA in the features space or, as it is known, Kernel PCA (KPCA)
T
UΦ Φ Φ Φ
o =arg max tr[U St U ]

T ΦT Φ
=arg max tr[UΦ X X UΦ ]

T
subject to UΦ UΦ = I.
The solution is given by the d eigenvectors that correspond to the d largest eigenval-
ues
SΦ Φ Φ
t Uo = Uo Λ
Do you see any problem with that? How can we compute the eigenvectors of SΦ t .
We do not even know φ!
Remember our Lemma that links the eigenvectors and eigenvalues of matrices of the
form AAT and AT A
ΦT Φ Φ
−2 1
K = X X = VΛVT then UΦ
o = X VΛ

All computations are performed via the use of K (so called kernel trick)
Φ − 21
Still UΦo = X VΛ cannot be analytically computed. But we do not want to com-
pute UΦ o . What we want is to compute latent features. That is, given a test sample
T
xi , we want to compute y = UΦ o φ(xt ) (this can be performed via the kernel trick)
T
y =UΦ Φ
o (φ(xt ) − m )
1 ΦT
=Λ− 2 VT X (φ(xt ) − mΦ )
 
− 21 T ΦT 1 Φ
=Λ V (I − E)X φ(xt ) − X 1
n
 
− 21 T ΦT 1 ΦT Φ
=Λ V (I − E) X φ(xt ) − X X 1
n
 
− 21 T 1
=Λ V (I − E) g(xt ) − K1
n

where    
φ(x1 )T φ(xt ) k(x1 , xt )
T
g(xt ) = XΦ φ(xt ) =  ...  = ... .
T
φ(xn ) φ(xt ) k(xn , xt )
Kernel LDA is a tutorial exercise.

91

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.2. Computing Linear Discriminant Analysis Chapter 2. Feature Extraction

2.2.1.1 Maximum Likelihood for Probabilistic PCA


PCA, as defined above, is a deterministic procedure. In the following, we will dis-
cuss about a probabilistic counterpart of PCA, the so-called Probabilistic PCA (PPCA).
PPCA forms the basis for a Bayesian treatment of PCA in which the dimensionality
of the principal subspace can be found automatically from the data. Furthermore,
PPCA can be used to model class-conditional densities and hence be applied to clas-
sification problems. Finally, PPCA model can be run generatively to provide samples
from the distribution.
The probabilistic model of PPCA can be written as
xi = Wyi + m + ei
ei ∼ N (ei |0, σ 2 I) (2.77)
yi ∼ N (yi |0, I)

or equivalently
 
2 1 1
p(xi |yi , W, σ) = N (xi |Wyi + m, σ I) = p exp − 2 (xi − m − Wyi )T (xi − m − Wyi )
(2π)F σ F 2σ
 
1 1
p(yi ) = N (yi |0, I) = p exp − yiT yi .
(2π) d 2
(2.78)
Given the conditional probability p(xi |yi , W, σ) and prior p(y) we can compute the
two following important distributions
p(yi |xi , W, σ), posterior
(2.79)
p(xi |W, σ), marginal.

We apply the technique called “completing the square” in order to do so.


Z Z
p(xi |W, σ) = p(xi , yi |W, σ)dyi = p(xi |yi , W, σ)p(yi )dyi . (2.80)
yi yi

 
1 1 
p(xi |yi , W, σ)p(yi ) = p p exp − 2 (xi − m − Wyi )T (xi − m − Wyi ) + σ 2 yiT yi
(2π)F σ F (2π)d 2σ
(2.81)
Now, in order to compute the marginal, as well as the posterior, we will restructure
the exponent of the exponential. The aim of the restructure is to reveal a term that
does not depend on yi , so that it can be safely go out of the integral in (2.80). This
term is used to produce the marginal. The other term is used to produce the pos-
terior. Let’s now see how to restructure the exponent (for convenience let’s assume
x̄i = xi − m)
(x̄i − Wyi )T (x̄i − Wyi ) + σ 2 yiT yi
= x̄Ti x̄i − 2x̄Ti Wyi + yiT WT Wyi + σ 2 yiT yi
(2.82)
= x̄Ti x̄i − 2x̄Ti Wyi + yiT (WT W + σ 2 I)yi
= x̄Ti x̄i − 2x̄Ti Wyi + yiT Myi

92

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.2. Computing Linear Discriminant Analysis

where M = WT W + σ 2 I.
We observe that we have a quadratic term yiT Myi and we need some extra terms in
order to complete its quadratic form. We can do so as follows

x̄Ti x̄i − 2x̄Ti Wyi + yiT Myi


= x̄Ti x̄i − 2(M−1 WT x̄i )T Myi + yiT Myi
= x̄Ti x̄i − 2(M−1 WT x̄i )T Myi + yiT Myi + (M−1 WT x̄i )T M(M−1 WT x̄i ) − (M−1 WT x̄i )T M(M−1 WT
= (yi − M−1 WT x̄i )T M(yi − M−1 WT x̄i ) + x̄Ti (I − WM−1 WT )x̄i .
(2.83)
Hence, after straightforward computations (i.e., putting the exponent back to the
integral), we have that

p(yi |xi , W, σ) = N (yi |M−1 WT (xi − m), σ 2 M−1 )


(2.84)
p(xi |W, σ) = N (xi |m, (σ −2 I − σ −2 WM−1 WT )−1 )

In order to simplify the marginal we will make use of the Woodbury identity

(A + UCV)−1 = A−1 − A−1 U(C−1 + VA−1 U)−1 VA−1 . (2.85)

Using the above we can easily verify that if D = WWT + σ 2 I then

D−1 = σ −2 I − σ −2 WM−1 WT = σ −2 I − σ −2 W(σ 2 I + WT W)−1 WT . (2.86)

Hence, the marginal can be written as

p(xi |W, σ) = N (xi |m, σ 2 I + WWT ). (2.87)

Having computed the marginal, we are ready to formulate a maximum likelihood


framework in order to estimate the parameters of the model, i.e., θ = {W, m, σ},
given a set of training data samples x1 , . . . , xn

θo = arg max ln p(x1 , . . . , xn |θ)


θ
n
Y
= arg max ln p(xi |θ)
θ
i=1
(2.88)
( n
)
nd n 1X
= arg max − ln(2π) − ln det(D) − (xi − m)T D−1 (xi − m) .
θ 2 2 2 i=1

Hence, by removing the constant terms, the function we want to optimise with re-
gards to the parameters is
n
n 1X
L(W, σ, m) = ln det(D) − (xi − m)T D−1 (xi − m)
2 2 i=1
n
n 1X (2.89)
= ln det(D) − tr(D−1 (xi − m)(xi − m)T )
2 2 i=1
n n
= ln det(D) − tr(D−1 St )
2 2

93

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.2. Computing Linear Discriminant Analysis Chapter 2. Feature Extraction

We will now take the derivative of the function with regards to the parameters
n
1X
∇m L = 0 ⇒ m = xi
n i=1 . (2.90)
−1 −1 −1 −1
∇W L = 0 ⇒ D St D W − D W ⇒ St D W = W

There are three different solutions. The first is W = 0 (not useful). The second is
D = St . In this case, if St = UΛUT is the covariance matrix eigendecomposition,
1
then W = U(Λ−σ 2 I) 2 VT for an arbitrary rotation matrix V (i.e., VT V = I), D 6= St
and W 6= 0 d < q = rank(St ).
Assume the SVD of W = ULVT

U = [u1 . . . ud ] F × d matrix
U U = I, VT V = VVT = I
T
 
l1 . . . 0
L =  ... . . . ... 
 
0 . . . ld

St D−1 ULVT = ULVT


Let’s study D−1 U.

W=ULVT
D−1 = (WWT + σ 2 I)−1 ======⇒
= (UL2 UT + σ 2 I)−1

Assume a set of bases UF −d such that UTF −d U = 0 and UTF −d UTF −d = I. We then have

−1
D−1 = ULUT + σ 2 I
  2  −1
L 0 T 2 T
= [U UF −d ] [U UF −d ] + [U UF −d ]σ I[U UF −d ]
0 0
 2 −1
L + σ2I 0
=[U UF −d ] [U UF −d ]T
0 σ2I
 2 
(L + σ 2 I)−1 0
=[U UF −d ] [U UF −d ]T
0 σ −2 I

And subsequently
 
−1 (L2 + σ 2 I)−1 0
D U = [U UF −d ] −2 [U UF −d ]T U
0 σ I
 
(L2 + σ 2 I)−1 0
= [U UF −d ] [I 0]T
0 σ −2 I
 2 
(L + σ 2 I)−1
= [U UF −d ]
0

94

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.2. Computing Linear Discriminant Analysis

= U(L2 + σ 2 I)−1

As a result, we have

St D−1 ULVT = ULVT


St U(L2 + σ 2 I)−1 = U
St U = U(L2 + σ 2 I)

2 2 T
Hence we have that St u√ i = (li + σ )ui . For St = UΛU , ui are the eigenvectors of St
2 2
and λi = li + σ ⇒ l = λ − σ 2 . Unfortunately, V cannot be determined thus there
is a rotation ambiguity.
Concluding, the optimum Wd is given by (keeping d eigenvectors)
1
Wd = Ud (Λd − σ 2 I) 2 VT

Having computed W, we need to compute the optimum σ 2

NF N N
ln(2π) − ln(|D|) − tr[D −1 S t ]
L(W, σ 2 , µ) = −
2 2 2
NF N N
=− ln(2π) − ln |WWT + σ 2 I| − tr[(WWT + σ 2 I)−1 St ]
2 2 2

 
2 Λd − σ 2 I 0
Wd WTd + σ I = [Ud UF −d ] [Ud UF −d ]T
0 0
 
σ2 . . . 0
+[Ud UF −d ]  ... . . . ...  [Ud UF −d ]T
 
0 . . . σ2
 
Λd 0
= [Ud UF −d ] [Ud UF −d ]T
0 σ2I

Hence
d
Y F
Y
2
|Wd WTd + σ I| = λi σ2
i=1 i=d+1

And thus the log of the determinant is given by


d
X
ln |Wd WTd + σ 2 I| = (F − d) ln σ 2 + ln λi
i=1

We also have that


 
 −1 Λd 0 0
Λd 0
D−1 St = [Ud UF −d ] [Ud UF −d ]T [Ud UF −d ]  0 Λq−d 0 
0 σ2I
0 0 0

95

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

2.2. Computing Linear Discriminant Analysis Chapter 2. Feature Extraction

[Ud UF −d ]T
 
I 0 0
1
= [Ud UF −d ] 0
 Λ
σ 2 q−d
0  [Ud UF −d ]T
0 0 0
q
1 X
⇒ tr(D−1 St ) = 2 λi + d
σ i=d+1
( d q
)
N X 1 X
L(σ 2 ) = − F ln 2π + 2
ln λj + (F − d) ln σ + 2 λj + d
2 j=1
σ j=d+1
q q
∂L −3
X 2(F − d) 2 1 X
= 0 ⇒ −2σ λj + =0⇒σ = λj
∂σ j=d+1
σ F − d j=d+1

Putting the solution back, we have


( d q
)
N X 1 X
L(σ 2 ) = − ln λj + (F − d) ln λj + F ln 2π + F
2 j=1 F − d j=d+1

d q q
2 N X X X
L(σ ) = − { ln λj + ln λj − ln λj
2 j=1 j=d j=d
| {z }
ln |St |
q
1 X
+(F − d) ln λj + F ln 2π + F }
F − d j=d+1

( q q
! )
N 1 1 X 1 X
max ln |St | − ln λj + ln ln λj + const.
2 F −d F − d j=d F − d j=d+1

( q
! q
)
1 X 1 X
⇒ min ln ln λj − ln λj
F − d j=d F − d j=d

Taking into account Jensen inequality


 Pn  n
i=1 ri 1X
ln ≥ ln ri
n n i=1

we have that
q
! q
1 X 1 X
ln λj ≥ ln λj
F − d j=d+1 F − d j=d+1

96

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 2. Feature Extraction 2.2. Computing Linear Discriminant Analysis

Hence
q
! q
1 X 1 X
⇒ ln ln λj − ln λj ≥ 0
F − d j=d F − d j=d

Therefore, the function is minimised when the discarded eigenvectors are the ones
that correspond to the q − d eigenvalues.
A brief summary:
q
2 1 X
σ = λj
F − d j=d+1
1
Wd = Ud (Λd − σ 2 I) 2 VT
N
1 X
µ= xi
N i=1

We no longer have a projection but: Ep(yi |xi ) {yi } = M−1 WT (xi − µ). We also have a
reconstruction x̂i = WEp(yi |xi ) {yi } + µ. We can notice that
1
lim
2
Wd = Ud Λd2
σ →0

lim
2
M = WTd Wd
σ →0

Hence,

lim Ep(yi |xi ) {yi } = M−1 WTd (xi − µ)


σ 2 →0
−1
= Λd 2 Ud (xi − µ)

which gives PCA.

97

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 3

Support Vector Machines

3.1 Support Vector Classification


In the following, we will touch upon quadratic optimisation problems with con-
straints in order to see in more details how the methods of Lagrangian multipliers
work. Furthermore, we will study how the dual optimisation problem is formulated
and solved. We will study this in the context of Support Vector Machines (SVMs) for
classification and regression.

3.1.1 Linear Separating Hyperplane with Maximal Margin


The original idea of SVM classification is to use a linear separating hyperplane to
create a classifier. Given training vectors xi , i = 1, . . . , n with xi ∈ RF , a vector y is
defined as follows 
1 if xi in class 1
yi =
−1 if xi in class 2

The SVM technique tries to find the separating hyperplane with the largest margin
between two classes, measured along a line perpendicular to the hyperplane. For
example, in Figure 3.1, the two classes could be fully separated by a dotted line
wT x + b = 0. We would like to decide the line with the largest margin. In other
words, intuitively we think that the distance between two classes of training data
should be as large as possible. That means we find a line with parameters w and b
such that the distance between wT x + b = ±1 is maximised.
The distance between wT x + b = 1 and −1 can be calculated by the following way.
Consider a point x̃ on wT x + b = −1 (see Figure 3.2). As w is the “normal vector” of
the line wT x + b = −1, w and the line are perpendicular to each other. Starting from
x̃ and moving along the direction w, we assume x̃ + tw touches line wT x + b = 1.
Therefore,
wT (x̃ + tw) + b = 1 and wT x̃ + b = −1

We then have twT w = 2, so the distance (i.e., the length of tw) is ||tw||2 = 2 ||w||
wT w
2
=
2
p 2
2 2
||w||2
. Note that ||w||2 = w1 + · · · + wn . As maximising ||w||2 is equivalent to min-

98

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 3. Support Vector Machines 3.1. Support Vector Classification

Figure 3.1: Separating Hyperplanes.

Figure 3.2: Distance between hyperplanes.

wT w
imising 2
, we have the following problem:
1 T
min 2
w w
w,b
subject to yi (wT xi+ b) ≥ 1 (3.1)
i = 1, . . . , l

The constraint yi (wT xi + b) ≥ 1 means

(wT xi ) + b ≥ 1 if yi = 1,
(3.2)
(wT xi ) + b ≤ −1 if yi = −1,

That is, data in the class 1 must be on the right-hand side of wT x + b = 0 while data
in the other class must be on the left-hand side. Note that the reason of maximising
the distance between wT x+b = ±1 is based on Vapnik’s Structural Risk Minimisation
(?).
The following example gives a simple illustration of maximal-margin separating hy-
perplanes:

99

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

3.1. Support Vector Classification Chapter 3. Support Vector Machines

Figure 3.3: 1D Toy Example

Figure 3.4: Solution Toy

Example
Given two training data in R1 as in the following Figure 3.3: What is the separating
hyperplane?
We have two data points, namely x1 = 1, x2 = 0 with y = [+1, −1]T . Furthermore,
w ∈ R1 , so Eq. 3.1 becomes

1 2
min w
w,b 2
subject to w·1+b≥1 (3.3)
− 1(w · 0 + b) ≥ 1 (3.4)

From Ineq. 3.4, −b ≥ −1. Putting this into Ineq. 3.3, w ≥ 2. In other words, for
any (w, b) which satisfies 3.3 and 3.4, we have w ≥ 2. As we are minimising 12 w2 ,
the smallest possibility is w = 2. Thus, (w, b) = (2, −1) is the optimal solution. The
separating hyperplane is 2x − 1 = 0, in the middle of the two training data points
(Figure 3.4).

In order to find the optimal w in the general case we need to solve optimisation
problem 3.1. Before doing so, we need some basic knowledge regarding Lagrangian
optimisation and Lagrangian duality.

3.1.1.1 Lagrangian Duality

The problem which we have to solve is a constrained optimisation problem. It is of


the form

min f (w)
w
subject to g(w) ≤ 0. (3.5)

By convention, we write that g(w) ≤ 0, as a result this means that we multiply the
constrains from (3.5) by minus one.

100

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 3. Support Vector Machines 3.1. Support Vector Classification

To solve this we use the method of Lagrange multipliers. We define the Lagrangian
to be the original objective function added to a weighted combination of the con-
straints. The weights are called Lagrange multipliers. It will be helpful to focus on
the simpler case with one inequality constraint and one Lagrange multiplier.

L(w, a) = f (w) + ag(w) (3.6)

Theorem 2
The original minimisation problem can be written as

min max L(w, a) (3.7)


w a≥0

Proof: Looking at the inner term we get



f (w), g(w) ≤ 0
max L(w, a) =
a≥0 ∞, g(w) > 0

This is because when g(w) ≤ 0, we maximise (3.6) by setting a = 0. When g(w) > 0,
one can drive the value to infinity by setting a to a large number. Minimising the
outer term, one sees that we obtain the minimum value of f (w) such that the con-
straint g(w) ≤ 0 holds. Therefore, we can say that the two problems are equivalent.
The primal solution to the problem is given by

p∗ = min max L(w, a) (3.8)


w a≥0

The dual solution to the problem is given by

d∗ = max min L(w, a) (3.9)


a≥0 w

We claim that d∗ ≤ p∗ . Let w∗ be the w value that corresponds to the optimal primal
solution p∗ . We can write for all a ≥ 0

max L(w∗ , ã) ≥ L(w∗ , a) ≥ min L(w, a). (3.10)


ã≥0 w

The Left-Hand-Side (LHS) of the above is obviously p∗ . This means we can interpret
the Right-Hand-Side (RHS) as a lower bound on p∗ for all a ≥ 0. One obtains the
best lower bound when maximising over a - this yields d∗ . Hence d∗ ≤ p∗ for any
f (w) and g(w). However, if certain conditions are met, namely

• f (w) is convex

• g(w) is affine (e.g., g(w) = wT x + b)

then d∗ = p∗ .
For the SVM problem, both of these conditions hold. Finally, in order to solve the
SVM optimisation problem using the dual, we need to further explore the optimality
conditions.

101

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

3.1. Support Vector Classification Chapter 3. Support Vector Machines

3.1.1.2 Conditions for Optimality (Karush-Kuhn-Tucker Conditions)


Lagrangian duality theory also states a number of necessary and sufficient conditions
that hold at the optimum solution.
1. w and a are feasible
2. ag(w) = 0
Condition 1 means that g(w) ≤ 0 and a ≥ 0. Condition 2 is called “complimentary
slackness condition”. It follows from the fact that the constraint g(w) may or may
not affect the final solution. If the minimum of f (w) lies within the region {w :
g(w) < 0}, then one can optimise f (w) without the constraint (i.e., let a = 0). If the
minimum of f (w) lies outside this set, then the constraint is turned “on”, and the
final solution must satisfy g(w) = 0. In this case, a behaves like a typical Lagrange
multiplier for an equality constraint.

3.1.1.3 SVM dual problem


We are now ready to formulate the Lagrangian for optimisation problem (3.1). Since
we have n data, we have also n constraints, one for each sample.
The Lagrangian is hence formulated as
n
1 X
L(w, b, a) = wT w − ai (yi (wT xi + b) − 1). (3.11)
2 i=1

The solution of the dual problem is


max min L(w, b, a) (3.12)
ai ≥0 w,b

Since we are optimising a convex function with linear constraints, the dual solution
will equal the primal solution. To optimise the dual (3.12), we need to minimise
L(w, b, a) with respect to w and b for a fixed value of a. We know that the optimal
w and b must satisfy the condition that the partial derivatives of L with regards to w
and b are 0.
n
X
∇w L(w, b, a) = w − ai yi xi = 0 ⇒ (3.13)
i=1
n
X
w= ai yi xi (3.14)
i=1

Similarly,
n
∂L(w, b, a) X
= ai yi = 0. (3.15)
∂b i=1

Therefore, for a fixed value of a, we have a closed form solution for w that minimises
L(w, b, a). We also have a condition on the sum of ai yi . We can plug them back into
the dual expression.

102

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 3. Support Vector Machines 3.1. Support Vector Classification

n n n
X 1 XX
L(a) = ai − ai aj yi yj xTi xj . (3.16)
i=1
2 i=1 j=1

Finally, we are left with a function of a what we wishPto maximise. Putting this
together with the constraints ai ≥ 0 and the constraint ni=1 ai yi = 0, we obtain the
following optimisation problem

max 1T a − 12 aT Ky a
a
subject to ai ≥ 0, i = 1, . . . , n (3.17)
aT y = 0

where a = [a1 , . . . , an ]T , 1 = [1, . . . , 1]T , y = [y1 , . . . , yn ]T and Ky = yi yj xTi xj .


How do I solve this optimisation problem?

Example
In this example we will see how we can solve the optimisation problem using quadprog
of Matlab. The quadprog function solves generic quadratic programming optimisa-
tion problems of the form:

min f T g + 12 gT Hg
g (3.18)
subject to Ag ≤ c, Ae g = ce , gl ≤ g ≤ gu

This minimisation problem solves for the vector g. The first step to solving our
problem, is to encode it using the matrices H, A, f , c, ce , gl , gu and Ae . Assume we
are given a set of data stored as columns in a data matrix X ∈ RF ×n and a vector
y of labels 1, −1. Then the SVM optimisation problem (3.17) can be reformulated
to (3.18) by (a) changing maximisation to minimisation by reversing the sign of the
cost function, (b) setting g = a, H = [yi yj xTi xj ], (c) f = −1n , A = 0 and c = 0
(a dummy inequality constraint), Ae = [y1 , . . . , yn ] and ce = 0, gl = [0, . . . , 0]T , and
finally gu = [∞ . . . , ∞]T . Once we have created the matrices and vectors quadprog
function can be used like so:
1 g = quadprog (H, f, A, c, A e, c e, g l, g u)

which will return the optimal values into vector g.


Assume we are given a set of data stored as columns in a data matrix X ∈ ℜF ×n and
a vector y of labels 1, −1.
1 X1 = 4+randn(10,100);
2 X2 = randn(10,100);
3 X = [X1 X2];
4 y1 = ones(1,100);
5 y2 = -ones(1,100);
6 y = [y1 y2]';
7 f = -ones(1,200);
8 A = zeros(1,200);
9 H = (y*y').*(X'*X);
10 c = 0;

103

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

3.1. Support Vector Classification Chapter 3. Support Vector Machines

11 A e = y';
12 c e = 0;
13 g l = zeros(200,1);
14 g u = 100000*ones(200,1);
15 alpha = quadprog(H,f,A,c,A e,c e,g l,g u);

The claim is that the dual problem is more computationally convenient. We validate
this claim by considering the KKT conditions, which must hold for the solution. In
particular, the complementary slackness condition can be written as

ai = 0 ⇒ yi (wT xi + b) ≥ 1
(3.19)
ai > 0 ⇒ yi (wT xi + b) = 1.

Furthermore, from the above conditions we can find b (from any support vector). A
more numerical stable solution can be found by averaging over all support vectors
as
1 X
b= (yi − wT xi ) (3.20)
NS x ∈S
i

where S is the set of support vectors and NS its corresponding cardinality.


These conditions mathematically validate our original sparseness intuition. Points
P beyond the margin will have ai = 0, and so will not effect the final solution
that lie
w = ni=1 ai yi xi . The final set of points with non-zero ai , or alternatively, the set of
points with margin 1, are called the support vectors.

3.1.2 Mapping Data to Higher Dimensional Spaces


However, problems in practice may not be linearly separable (an example is provided
in Figure 3.5). That is, there is no (w, b) which satisfies the constraints of 3.1. In this
situation, we say 3.1 is “infeasible”. We can introduce slack variables ξi , i = 1, . . . , n
in the constraints
n
1 T X
min w w+C ξi
w,b,ξ 2 i=1
subject to yi (wT xi + b) ≥ 1 − ξi (3.21)
ξi ≥ 0 i = 1, . . . , n

That is, the constraints in 3.21 allow training data to not be on the correct side of
the separating hyperplane wT x + b = 0. This happens when ξi > 1 and an example
is provided in Figure 3.5.
We have ξi ≥ 0 since if ξi < 0, we have yi (wT xi + b) ≥ 1 − ξi ≥ 1 and the training
data is already on the correct side. The new problem is always feasible since for any
(w, b),
ξi ≡ max(0, 1 − yi (wT xi + b)), i = 1, . . . , l

104

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 3. Support Vector Machines 3.1. Support Vector Classification

Figure 3.5: Allowing errors

we have a feasible solution ((w, b, ξ)). Using this setting, we may worry that for
linearly separable data, some ξi ’s could be larger than 1 and hence corresponding
data could be wrongly classified. For the case that most data except some noisy ones
are separable by a linear function, we would like wT x + b = 0 to correctly classify
the majority
Pl of the points. Therefore, in the objective function we add a penalty
term C i=1 ξi , where C > 0 is the penalty parameter. To have the objective value
as small as possible, most ξi ’s should be zero, so that the constraint goes back to its
original form.
In order to formulate the dual of (3.21) we need to compute
n n n
1 X X X
L(w, b, ξi , ai , ri ) = wT w + C ξi − ai (yi (wT xi + b) − 1 + ξi ) − ri ξ i
2 i=1 i=1 i=1
(3.22)
with Lagrangian multipliers ai ≥ 0, ri ≥ 0. Computing the derivatives
n
∂L X
=w− ai yi xi = 0
∂w i=1
n
∂L X
= ai yi = 0 (3.23)
∂b i=1
∂L
= C − ai − ri = 0.
∂ξi
Substituting (3.23) back to (3.22) we get the dual optimisation problem
1
max L(a) = aT 1 − aT Ky a (3.24)
a 2
T
subject to a y = 0, 0 ≤ ai ≤ C (3.25)
where Ky = [yi yj xTi xj ].
If data are distributed in a highly non-linear way, employing only a linear function
causes many training instances to be on the wrong side of the hyperplane. As a

105

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

3.1. Support Vector Classification Chapter 3. Support Vector Machines

result, under-fitting occurs and the decision function does not perform well. To fit
the training data better, we may think of using a non-linear curve. The problem is
that it is very difficult to model non-linear curves. All we are familiar with are elliptic,
hyperbolic, or parabolic curves, which are far from enough in practice. Instead of
using more sophisticated curves, another approach is to map data into a higher
dimensional space. In this higher dimensional space, it is more likely that data can
be linearly separated. An example by mapping x from R3 to R8 is as follows
√ √ √ √ √ √
φ(x) = [1, 2x1 , 2x2 , 2x3 , x21 , x,2 x23 , 2x1 x2 , 2x2 x3 , 2x1 x3 ]

An extreme example is to map a data instance x to an infinite dimensional space


x1 x22 x33
φ(x) = [1, , , , . . . ]T
1! 2! 3!
We then try to find a linear separating plane in a higher dimensional space so that
3.21 becomes
n
1 T X
min w w+C ξi
w,b,ξ 2 i=1
subject to yi (wT φ(xi ) + b) ≥ 1 − ξi (3.26)
ξi ≥ 0 i = 1, . . . , n

3.1.3 The Dual Problem


The remaining problem is how to effectively solve 3.26. Especially after data are
mapped into a higher dimensional space, the number of variables (w, b) becomes
very large or even infinite. We handle this difficulty by solving the dual problem of
3.26
n n n
1 XX X
min ai aj yi yj φ(xi )T φ(xj ) − ai
α 2 i=1 j=1 i=1

subject to 0 ≤ ai ≤ C i = 1, . . . , n (3.27)
X n
yi ai = 0
i=1

This new problem of course has some relation with the original problem 3.26, and
we hope that it can be solved more easily. We may write 3.27 in a matrix form for
convenience:
1 T
min α Ky α − 1T α
a 2
subject to 0 ≤ ai ≤ C i = 1, . . . , l (3.28)
T
y α=0

In 3.28, 1 is the vector of ones, C is the upper bound, Ky is an n × n positive semi-


definite matrix, Ky ≡ [yi yj k(xi , xj )], and k(xi , xj ) ≡ φ(xi )T φ(xj ) is the kernel.

106

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 3. Support Vector Machines 3.2. Support Vector Regression

Therefore, the crucial point is whether the dual is easier to be solved than the primal.
The number of variables in the dual is the size of the training set is n; a fixed number.
In contrast, the number of variables in the primal problem varies depending on how
data are mapped to a higher dimensional space. Therefore, moving from the primal
to the dual means that we solve a finite-dimensional optimisation problem instead
of a possibly infinite-dimensional one.
If φ(x) is an infinitely-long vector, there is no way to fully write it down and then
calculate the inner product. Therefore, even though the dual possesses the advan-
tage of having a finite number of variables, we could not even write the problem
down before solving it. This is resolved by using special mapping functions φ so that
φ(xi )T φ(xj ) is efficiently calculated (i.e., by using the kernel trick). Then, a decision
function is written as

l
!
X
T T
f (x) = sign(w φ(x) + b) = sign yi αi φ(xi ) φ(x) + b (3.29)
i=1

P
In other words, for a test vector x, if ni=1 yi αi φ(x)T φ(x) + b > 0, we classify it to
be in the class 1. Otherwise, we classify it in the second class. We can see that only
support vectors will affect the results in the prediction stage. In general, the number
of support vectors is not large. Therefore, we can say SVM is used in order to derive
important data (support vectors) from the training data.

3.2 Support Vector Regression

3.2.1 Linear Regression


Given training data (x1 , y1 ), . . . , (xn , yn ) in Figure 3.6, where xi is an input vector
and yi is the associated output value for xi , the traditional linear regression finds a
linear function wT x + b so that (w, b) is an optimal solution of

n
X
min (yi − (wT xi + b))2 (3.30)
w,b
i=1

In other words, wT x + b approximates training data by minimising the sum of square


errors.
Note that F , the number of features, is in general less than n. Otherwise, a line
passing through all points so that 3.30 is zero is the optimal function wT x + b. For
such cases, over-fitting occurs.
Similar to classification, if the data is non-linearly distributed, a linear function is
not good enough. Therefore, we also map data to a higher dimensional space by a
function φ(x). Then F ≤ dimensionality of φ(x) and as a result over-fitting happens
again. An example is in Figure 3.7.

107

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

3.2. Support Vector Regression Chapter 3. Support Vector Machines

Figure 3.6: Linear Regression

Figure 3.7: Non-linear Regression

Figure 3.8: Support Vector Regression

3.2.2 Support Vector Regression


To rectify the over-fitting problem after using φ, we consider the following reformu-
lation of 3.30 (geometric interpretation is given in Figure 3.8):
n
X
min ∗ ξi2 + (ξi∗ )2
w,b,ξ,ξ
i=1
subject to − ξi∗ ≤ yi − (wT φ(xi ) + b) ≤ +ξi (3.31)
ξi , ξi∗ ≥ 0 i = 1, . . . , n

It is easy to see that 3.30 (with x replaced by φ(x)) and 3.31 are equivalent: If

108

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter 3. Support Vector Machines 3.2. Support Vector Regression

(w, b, ξ, ξ ∗ ) is optimal for 3.31, as ξi2 + (ξi∗ )2 is minimised, we have

ξi = max(yi − (wT φ(xi ) + b), 0) and


ξi∗ = max(−yi + −(wT φ(xi ) + b), 0).

Therefore,
ξi2 + (ξi∗ )2 = (yi − (wT φ(xi ) + b))2
Moreover, ξi ξi∗ = 0 at an optimal solution.
Instead of using square errors, we can use linear ones:
l
X
min ∗ (ξi + ξi∗ )
w,b,ξ,ξ
i=1
subject to − ξi∗ ≤ yi − (wT φ(xi ) + b) ≤ +ξi
ξi , ξi∗ ≥ 0 i = 1, . . . , l

Support vector regression (SVR) then employees two modifications to avoid over-
fitting:

1. A threshold ǫ is given so that if the i-th datum satisfies

− ǫ ≤ yi − (wT φ(xi ) + b) ≤ ǫ (3.32)

it is considered a correct approximation. Then ξi = ξi∗ = 0

2. To smooth the function wT φ(xi ) + b, an additional term wT w is added to the


objective function.

Thus, support vector regression solves the following optimisation problem:


l
1 T X
min w w+C (ξi + ξi∗ )
w,b,ξ,ξ∗ 2 i=1
subject to (wT φ(xi ) + b) − yi ≤ ǫ + ξi (3.33)
T
yi − (w φ(xi ) + b) ≤ ǫ + ξi∗ (3.34)
ξi , ξi∗ ≥ 0 i = 1, . . . , l

Clearly, ξi is the upper training error (ξi∗ is the lower) subject to the ǫ-insensitive
tube |yi − (wT φ(xi ) + b)| ≤ ǫ. This can be seen from Figure 3.8. If xi is not in the
tube, there is an error ξi or ξi∗ , which we would like to minimise in the objective
function. SVR avoids Punder-fitting and over-fitting the training data by minimising
l
the training error C i=1 (ξi + ξi ) as well as the regularisation term 12 wT w. Addition

of the term wT w can be explained by a similar way to that for classification problems.
In Figure 3.9, under the condition that training data are in the ǫ-insensitive tube, we
would like the approximate function to be as general as possible to represent the
data distribution.

109

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

3.2. Support Vector Regression Chapter 3. Support Vector Machines

Figure 3.9: More general approximate function by maximising the distance between
wT x + b = ±ǫ

The parameters which control the regression quality are the cost of error C, the
width of the tube ǫ, and the mapping function φ. Similar to support vector classifi-
cation, as w may be a huge vector variable, we solve the dual problem
n n
1 X X
min (α − α∗ )T K(α − α∗ ) + ǫ (αi + αi∗ ) + yi (αi − αi∗ )
α,α∗ 2 i=1 i=1
n
X
subject to (αi − αi∗ ) = 0, 0 ≤ ai , a∗i ≤ C, i = 1, . . . , n (3.35)
i=1
(3.36)

where Kij = k(xi , xj ) ≡ φ(xi )T φ(xj ). The derivation of the dual uses the same
procedure for support vector classification. The primal-dual relation shows that
n
X
w= (−αi + αi∗ )φ(xi ),
i=1

so the approximate function is:


n
X
(−αi + αi∗ )k(xi , xj ) + b
i=1

110

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Appendix A

A.1 Preliminaries on Vectors and Matrices


Below are some ”soft” definitions of vectors and matrices. We revise various repre-
sentations of matrices and vectors, as well as useful definitions and identities.

A.1.1 Vectors and Vector Operators


A column vector x ∈ Rn is defined as
 
x1
x =  ...  = [x1 . . . xn ]T = [xi ]
 
(A.1)
xn

where [x1 . . . xn ] represents a row vector. In these notes we use column vectors. Use
of row vectors will be explicitly noted.
A matrix A ∈ Rn×l is defined as the following collection
   
a11 . . . a1l ãT1
A =  ... .. ..  = [a . . . a ] =  ..  = [a ]

. .  1 l  .  ij (A.2)
T
an1 . . . anl ãn

where ãj , j = 1, . . . , n are row vectors.


The inner product between two vectors x, y ∈ Rn is a scalar c defined as
n
X
c = xT y = xi yi . (A.3)
i=1

The squared ℓ2 norm of a vector can be defined using the inner product as
n
X
||x||22 T
=x x= x2i . (A.4)
i=1

The cosine of the angle θ between two vectors x and y is defined as

xT y
cos(θ(x, y)) = . (A.5)
||x||2 ||y||2

111

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

A.1. Preliminaries on Vectors and Matrices Chapter A.

Figure A.1: Geometric interpretation of the projection of a vector y onto x. The green
vector is projx y, while the red vector is y − projx y.

The cosine between two vectors can be used in order to define projections onto
vectors. In particular, the projection of y onto x, denoted as projx y, is a vector that
is co-linear to x and can be computed as

xT y
projx y = βx = cos(θ)||y||2 x = x. (A.6)
||x||2

The outer product of two vectors is the rank-one matrix defined as


 
x1 y1 ... x1 y n
T  .. .. ..  .
xy =  . . .  (A.7)
xn y1 ... xn yn

A.1.2 Matrices and Matrix Operators


Let two matrices A = [aij ] ∈ Rn×l and B = [bij ] ∈ Rl×m . These matrices can be
represented as A = [a1 . . . al ] and B = [b1 . . . bm ], using column vectors, and as
 
ãT1
A =  ...  = [aij ]
 
ãTn

and  
b̃T1
B =  ...  = [bij ],
 

b̃Tl

using row vectors.

112

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter A. A.1. Preliminaries on Vectors and Matrices

A.1.2.1 Matrix Norms


The most frequently used matrix norms are the Frobenius norm
sX X
||A||F = |a2ij | (A.8)
i j

and the induced p-norms


||Ax||p
||A||p = sup = max ||Ax||p . (A.9)
x6=0 ||x||p ||x||p =1

Important properties of the matrix norms include


• For all A ∈ Rm×n , B ∈ Rn×q it holds
||AB||p ≤ ||A||p ||B||p . (A.10)

• For all A ∈ Rm×n , x ∈ Rn it holds


||Ax||p ≤ ||A||p ||x||p . (A.11)

• For all A ∈ Rm×n , we can compute the induced norm for p = 1 as


m
X
||A||1 = max |aij |. (A.12)
1≤j≤n
i=1

• For all A ∈ Rm×n , we can compute the induced norm for p = ∞ as


Xn
||A||∞ = max |aij |. (A.13)
1≤i≤m
j=1

• For all A ∈ Rm×n , we can compute the induced norm for p = 2 as


||A||2 = σ1 , (A.14)
i.e. the largest singular value of A.

A.1.2.2 Matrix Multiplications


The matrix multiplication between A ∈ Rn×l and B ∈ Rl×m (the number of rows of
A must be equal to the number of columns of B) can be defined using the following
forms
 
T
" l
X
# ã 1
aik bkj =  ...  [b1 . . . bm ] = [ãTi bj ]
 
AB =
k=1 ãTn
 
ãT1 B
(A.15)
= [Ab1 . . . Abm ] =  ... 
 
ãTn B
l
X
= ak b̃Tk .
k=1

113

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

A.1. Preliminaries on Vectors and Matrices Chapter A.

Furthermore, the i, j element of AB can be expressed as


l
X l
X l
X
[AB]ij = [ ak b̃Tk ]ij = [ak b̃Tk ]ij = aik bkj . (A.16)
k=1 k=1 k=1

Using the above a special case is the matrix-vector multiplication


   
ãT1 ãT1 b
Ab =  ...  b =  ...  = [ãTj b].
   
(A.17)
ãTn ãTn b

Assume that we are given a basis {u1 , . . . , un } and an arbitrary vector x which can
be written as a linear combination of the basis as
n
X
x= ki ui = Uk (A.18)
i=1

where U = [u1 , . . . , un ] and k = [k1 , . . . , kn ].


The identity matrix is defined as identity element of matrix multiplication (i.e., AI =
IA = A). An example of identity matrix for 3 × 3 matrices is
 
1 0 0
I3 =  0 1 0  3 × 3 Identity Matrix. (A.19)
0 0 1

Using matrix multiplications we can define matrix integer power as

Ak = AA · · · A, k times. (A.20)
1
Fractional power of a matrix A can be defined a matrix B = A k such that

Bk = A. (A.21)

A.1.2.3 Matrix Transposition


 
a11 . . . an1
Matrix transposition operation is defined as AT =  ... .. ..  = [ã . . . ã ] =

. .  1 n
a1l . . . anl
 
T
a1
 .. 
 .  i.e. rows become columns and columns rows. Important property is that
aTl

(AB)T = BT AT . (A.22)

Important expansion of the matrix multiplication AAT is the following


l
X
T
AA = ak aTk . (A.23)
k=1

114

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter A. A.1. Preliminaries on Vectors and Matrices

Using the above the quadratic form xT AAT x can be expanded as


l
X l X
X n
n X l
X
T T
x AA x = x T
ak aTk x = xj xi ajk aik = (aTk x)2 . (A.24)
k=1 k=1 j=1 i=1 k=1

The matrix product ABCT


X X X XX
[ABCT ]il = aij [BCT ]jl = aij bjk clk = aij bjk clk . (A.25)
j j k j k

Example (Derivatives of Quadratic Forms)


Let function f (X) = xT Bx XX
f= xi xj bij . (A.26)
i j

First we need to split function f as


XX
f= xi xj bij
i j
XX X X
= xi xj bij + xi xk bik + xk xj bkj (A.27)
i6=k j6=k i j6=k
XX X X
= xi xj bij + xi xk bik + xk xj bkj + x2k bkk
i6=k j6=k i6=k j6=k

Then, we can compute


∂f X X
= xi bik + xj bkj + 2xk bkk
∂xk i6=k j6=k
X X
= xi bik + xj bkj (A.28)
i j
T
= [Bx]k + [B x]k .

Hence, ∇x f = Bx + BT x. If B is symmetric then ∇x f = 2Bx.

A.1.2.4 Trace Operator


Matrix trace operation on square matrices is defined as tr(A) : A ∈ Rn×n → R
n
X
tr(A) = aii . (A.29)
i=1

Some important properties of the trace operator

• tr(A + B) = tr(A) + tr(B).

115

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

A.1. Preliminaries on Vectors and Matrices Chapter A.

• tr(cA) = ctr(A) for all scalars c ∈ R


• tr(A) = tr(AT )
• tr(AB) = tr(BA), hence tr(XT Y) = tr(XYT )
• tr(ABCD) = tr(BCDA) = tr(DABC) = tr(CDAB)
• If A is an
X n × n matrix and λ1 , . . . , λn are its corresponding eigenvalues then
tr(A) = λi .
i

• ||A||2F = tr(AT A).

Example (Trace derivatives A: Linear Case)


Let function f (X) = tr(AXB)
X XX XXX
f= [AXB]ii = aij [XB]ji = aij xjk bki . (A.30)
i i j i j k

Now we can compute ∇X f = [ ∂x∂fjk ]

∂f X
= aij bki = [BA]kj = [(BA)T ]jk . (A.31)
∂xjk i

Hence, ∇X tr(AXB) = AT BT .

Example (Trace derivatives A: Quadratic Case)


Let function f (W) = tr(WT BW). Using the results of the example of derivatives
with quadratic forms we get
X
f= wiT Bwi
i
XXX
= wji wri bjr
i j r (A.32)
!
X XX X X
2
= wji wri bjr + wki wri bkr + wji wki bjk + wki bkk .
i j6=k r6=j r6=k j6=k

∂f
Now we can compute ∇W f = [ ∂w ki
]

∂f X X
= wri bkr + wji bjk + 2wki bkk
∂wki r6=k j6=k
X X
= wri bkr + wji bjk (A.33)
r j
T
= [BW]ki + [B W]ki .

116

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter A. A.1. Preliminaries on Vectors and Matrices

Hence, ∇W tr(WT BW) = BW + BT W. If B is symmetric then ∇W tr(WT BW) =


2BW.

A.1.2.5 Matrix Determinant


Matrix determinant is defined as (Laplace formula)
Xn
det(A) = (−1)j+k ajk |Ajk | (A.34)
j=1

where Ajk is defined as the determinant of the (n−1)×(n−1) matrix that is produced
from A by removing the j-th row and k-th column.

Example (Determinant)
 
−2 2 −3
Assume matrix A =  −1 1 3 
2 0 −1
     
1+2 −1 3 2+2 −2 −3 3+2 −2 −3
det(A) = (−1) ·2· + (−1) ·1· + (−1) ·0·
2 −1 2 −1 −1 3
= (−2) · ((−1) · (−1) − 2 · 3) + 1 · ((−2) · (−1) − 2 · (−3))
= (−2) · (−5) + 8 = 18
(A.35)

Some important properties of the determinant of matrix A ∈ Rn×n


• det(I) = 1.
• det(A) = det(AT )
• det(AB) = det(A)det(B)
• det(cA) = cn det(A)
Qn
• If A is a triangular matrix the det(A) = i=1 aii .
n × n matrix and λ1 , . . . , λn are its corresponding eigenvalues then
• If A is an Q
det(A) = i λi .
• ∇A det(A) = det(A)(A−1 )T
• Determinant
 of block
 matrices. Let the block matrix and A be invertible,
A B
then det = det(A)det(D − CA−1 B). Due to the above we have
 C  D
A 0
det = det(A)det(D).
C D

117

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

A.1. Preliminaries on Vectors and Matrices Chapter A.

A.1.2.6 Matrix Inverse


The invertible matrix theorem. Let A be a square n × n matrix over R. The following
statements are all equivalent
• A is invertible (or non-singular or non-degenerate)
• det(A) 6= 0
• A has full rank rank(A) = n
• The system Ax = 0 has only one trivial solution x = 0.
• The null space of A is the empty space.
• The system Ax = b has a unique solution for each b.
• The mapping x 7→ Ax is one to one and onto (bijection).
• The columns of A are linearly independent.
• The columns of A form a basis of Rn .
• AT is invertible (hence the rows of A are linearly independent).
• There is a unique n × n matrix A−1 such that AA−1 = A−1 A = I.
• A does not have any zero eigenvalues.
Block matrix inversion
   −1 
A B A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
A= =
C D −(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
(A.36)
−1
where D − CA B is the so-called Schur complement of A.
In linear algebra, two n × n matrices A and B are called similar if
B = P−1 AP. (A.37)
Similar matrices have the same rank, determinant, trace and eigenvalues.

A.1.2.7 Matrix Pseudo-Inverse


The pseudo-inverse of an m × n matrix A is a matrix that generalizes to arbitrary
matrices the notion of inverse of a square, invertible matrix.
• If A is full column rank, rank(A) = n ≤ m, that is AT A is not singular, then
A† is a left inverse of A (i.e., A† A = I). We have the closed-form expression
A† = (AT A)−1 AT . (A.38)

• If A is full row rank, rank(A) = m ≤ n, that is AAT is not singular, then A† is


a right inverse of A (i.e., AA† = I). We have the closed-form expression
A† = AT (AAT )−1 . (A.39)

• If A is square and invertible matrix then A† = A−1 .

118

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter A. A.1. Preliminaries on Vectors and Matrices

A.1.2.8 Range, Null Space and Rank of a matrix


There are two important sub-spaces associated with a matrix A ∈ Rn×m . The range
of A is defined by
ran(A) = {y ∈ Rn : y = Ax, Rm } (A.40)
and the null space of A is defined by
null(A) = {x ∈ Rm : Ax = 0}. (A.41)
If A = [a1 , . . . , an ]] is a column partitioning, then1
ran(A) = span{a1 , . . . , am }. (A.42)
The column rank of A is the dimension of the column space of A,while the row rank
of A is the dimension of the row space 2 of A. A fundamental result in linear algebra
is that the column rank and the row rank are always equal.
Hence, column or row rank (e.g., the number of linear independent columns, rank(A) =
dim(ran(A))) is simply called rank of matrix A.
A square matrix A ∈ Rn×n is said to have full rank if rank(A) = n. A matrix
A ∈ Rn×m has full rank if rank(A) = min{n, m}. If a matrix does not have full rank,
then it is called rank deficient.
• For all matrices A ∈ Rn×m we have that rank(A) ≤ min{n, m}
• Rank is the dimension of the largest square sub-matrix of a matrix that has a
non-zero determinant.
• rank(A) = 0 iff A = 0.
• If B ∈ Rm×k , then
rank(AB) ≤ min(rank(A), rank(B)). (A.43)
If rank(B) is of rank m, then rank(AB) = rank(A).
• If C ∈ Rl×n of rank n, then
rank(CA) = rank(A). (A.44)
• The rank of A is equal to r if and only if there exists an invertible n × n matrix
X and an invertible m × m matrix Y such that
 
Ir 0
XAY = (A.45)
0 0
where Ir denotes the r × r identity matrix.

rank(A) = rank(AT ) = rank(AAT ) = rank(AT A). (A.46)
• Rank-nullity theorem. The rank and the nullity3 of a matrix equals the number
of columns of the matrix.
1
Column space, also referred to as the range of a matrix, is the span (set of all possible linear
combinations) of its column vectors
2
Row space is the span (set of all possible linear combinations) of its row vectors
3
Nullity is the dimension of the null space of the matrix.

119

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

A.1. Preliminaries on Vectors and Matrices Chapter A.

A.1.2.9 Eigenvalues and Eigenvectors


The determination of the eigenvalues and eigenvectors of a system is extremely im-
portant in machine learning. Each eigenvalue is paired with a corresponding so-
called eigenvector.
Let A be a square n × n matrix. Vector x is a right eigenvector of matrix A with a
corresponding eigenvalue λ if
Ax = λx. (A.47)
The above equation can be stated equivalently as,

(A − λI)x = 0. (A.48)

The above equation has a non-zero solution if and only if the determinant |A − λI|
is zero. Therefore, the eigenvalues of A are values of λ that satisfy the equation

det(A − λI) = 0. (A.49)

Hence, a way to find eigenvalues analytically is by finding the roots of the above
polynomial (which is called the characteristic polynomial of A).
The generalised eigenvectors of matrices A and B are vectors that satisfy

Ax = λBx (A.50)

and λ is the corresponding generalised eigenvalue. If B is invertible, then the origi-


nal problem can be written in the form

B−1 Ax = λx (A.51)

which is a standard eigenvalue problem. However, in most situations it is preferable


not to perform the inversion, but rather to solve the generalised eigenvalue problem
as stated originally.
The possible values of λ must obey the following equation

det(A − λB) = 0. (A.52)

For an arbitrary matrix A its eigenvalues and eigenvectors could be complex. In


machine learning, generally we will work with symmetric matrices. If A is a sym-
metric n × n matrix, then all its eigenvalues and eigenvectors are real. Furthermore,
its eigenvectors form an orthonormal basis of Rn . Hence, it admits the following
eigendecomposition

A = UΛUT , UT U = In , UUT = In . (A.53)

If A and B are symmetric and B is a positive-definite matrix, the generalised eigen-


values λ are real and eigenvectors vi and vj with distinct eigenvalues are B-orthogonal

viT Bvj = 0. (A.54)

120

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter A. A.1. Preliminaries on Vectors and Matrices

A.1.2.10 Positive and Negative Definite Matrices


A symmetric matrix A ∈ Rn×n is called positive definite if for all x ∈ Rn

xT Ax > 0. (A.55)

Similarly it is called negative positive if xT Ax < 0 for all x ∈ Rn .


Furthermore, a matrix A ∈ Rn×n is positive semi-definite if the above inequality is
not strict (i.e., xT Ax ≥ 0). All matrices of the form BBT are positive semi-definite
(the proof is in equation (A.24)).
Theorem 3
A matrix A ∈ Rn×n is positive definite iff all eigenvalues are positive.

Proof: Assume that all eigenvalues λi are positive. Then according to the eigende-
composition of symmetric matrices in (A.53) we have A = UΛUT . The columns of
U constitute a base of Rn . Hence, x = Uc for all x ∈ Rn and c 6= 0. Then,
n
X
T T T
x Ax = c U UΛUU c = c Λc = T T
c2i λi > 0. (A.56)
i=1

Hence, A is positive definite.


Assume now that A is positive definite. Then, again
n
X
T
x Ax = c Λc =T
c2i λi > 0 (A.57)
i=1

which holds for all c ∈ Rn . Now, by choosing as c the columns of the identity In
matrix, the above inequality turns into λi > 0. Hence, all eigenvalues are positive.

A.1.2.11 Triangular Matrices


Triangular matrices are very important matrices in linear algebra, because they allow
for efficient computations. A triangular matrix is a special kind of square matrix. A
square matrix is called lower triangular if all the entries above the main diagonal
are zero. Similarly, a square matrix is called upper triangular if all the entries below
the main diagonal are zero. A triangular matrix is one that is either lower triangular
or upper triangular. A matrix that is both upper and lower triangular is called a
diagonal matrix.
Properties of triangular matrices

• The sum of two upper (lower) triangular matrices is upper (lower) triangular.

• The product of two upper (lower) triangular matrices is upper (lower) trian-
gular.

• The inverse of an invertible upper (lower) triangular matrix is upper (lower)


triangular.

121

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

A.2. Inner Products Chapter A.

• The product of an upper (lower) triangular matrix by a constant is an upper


(lower) triangular matrix.

• A matrix A which is simultaneously triangular and normal (i.e., AAT = AT A)


is also diagonal.

• The transpose of an upper triangular matrix is a lower triangular matrix and


vice versa.

• The determinant of a triangular matrix equals the product of the diagonal en-
tries.

• The diagonal entries of a triangular matrix give the multiset of eigenvalues.

Proof of the above are exercises of the first tutorial.

A.1.2.12 QR decomposition
Theorem 4
If A = QR is the QR decomposition of matrix A then
Y
|det(A)| = | rii |. (A.58)
i

Proof: First we prove that |det(Q)| = 1.

QQT = I ⇒ det(Q)det(QT ) = 1 ⇒ (det(Q))2 = 1 ⇒ |det(Q)| = 1. (A.59)


Q
Hence, |det(A)| = | i rii |.

A.2 Inner Products


Inner products allow the rigorous introduction of intuitive geometrical quantities,
such as the length of a vector and the angle or distance between two vectors. We
may already be familiar with the scalar product/dot product in Rn
n
X

x y= xi yi . (A.60)
i=1

However, inner products are more general concepts with specific properties, which
we will now introduce.
Definition 13
Let V be a vector space and β : V × V → R a bilinear mapping (i.e., linear in both
arguments).

• β is called symmetric if β(x, y) = β(y, x) for all x, y ∈ V .

• β is called positive definite if for all x 6= 0: β(x, x) > 0. β(0, 0) = 0.

122

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter A. A.2. Inner Products

• A positive definite, symmetric bilinear mapping β : V × V → R is called inner


product on V . We typically write hx, yi instead of β(x, y).
• The pair (V, h·, ·i) is called inner product space, normed vector space or (real)
vector space with inner product. If we use the dot product defined in (A.60),
we call (V, h·, ·i) a Euclidean vector space.

Example (Inner Product)


Consider V = R2 . If we define β(x, y) = hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2 then
β is an inner product but different from the dot product.

In an inner product space, the inner product allows us to introduce concepts, such
as lengths, distances and orthogonality.

A.2.1 Lengths, Distances, Orthogonality


Definition 14 (Norm) p
Consider an inner product space (V, h·, ·i). Then kxk := hx, xi is the length or norm
of x ∈ V . The mapping
k·k:V →R (A.61)
x 7→ kxk (A.62)
is called norm.

Example (Length of Vectors)


In geometry, we are often interested in lengths of vectors. We can now use the inner
product to compute them. For instance, in a Euclidean vector √ space, where
√ we use
⊤ 2 2
the dot product, if x = [1, 2] then its norm/length is kxk = 1 + 2 = 5

Remark 26
The norm k · k possesses the following properties:
1. kxk ≥ 0 for all x ∈ V and kxk = 0 ⇔ x = 0
2. kλxk = |λ| · kxk for all x ∈ V and λ ∈ R
3. Minkowski inequality: kx + yk ≤ kxk + kyk for all x, y ∈ V
Definition 15 (Distance and Metric)
Consider an inner product space (V, h·, ·i). Then d(x, y) := kx − yk is called distance
of x, y ∈ V . The mapping
d:V ×V →R (A.63)
(x, y) 7→ d(x, y) (A.64)
is called metric.

123

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

A.2. Inner Products Chapter A.

A metric d satisfies:

1. d is positive definite, i.e., d(x, y) ≥ 0 for all x, y ∈ V and d(x, y) = 0 ⇔ x = y

2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .

3. Triangular inequality: d(x, z) ≤ d(x, y) + d(y, z).


Definition 16 (Orthogonality)
Vectors x and y are orthogonal if hx, yi = 0, and we write x ⊥ y

Theorem 5
Let (V, h·, ·i) be an inner product space and x, y, z ∈ V . Then:

1. Cauchy-Schwarz inequality: |hx, yi| ≤ kxk kyk

2. Minkowski inequality: kx + yk ≤ kxk + kyk

3. Triangular inequality: d(x, z) ≤ d(x, y) + d(y, z)

4. Parallelogram law: kx + yk + kx − yk = 2kxk2 + 2kyk2

5. 4hx, yi = kx + yk2 − kx − yk2

6. x ⊥ y ⇔ kx + yk2 = kxk2 + kyk2

The Cauchy-Schwarz inequality allows us to define angles ω in inner product spaces


between two vectors x, y. Assume that x 6= 0, y 6= 0. Then

hx, yi
−1 ≤ ≤ 1. (A.65)
kxk kyk

Therefore, there exists a unique ω ∈ [0, π) with

hx, yi
cos ω = . (A.66)
kxk kyk

The number ω is the angle between x and y.

A.2.2 Applications of Inner Products


Inner products allow us to compute angles between vectors or distances. A major
purpose of inner products is to determine whether vectors are orthogonal to each
other; in this case hx, yi = 0. This will play an important role when we discuss pro-
jections in Section ??. The inner product also allows us to determine specific bases of
vector (sub)spaces, where each vector is orthogonal to all others (orthogonal bases)
using the Gram-Schmidt method4 . These bases are important optimization and nu-
merical algorithms for solving linear equation systems. For instance, Krylov subspace
4
not discussed in this course

124

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Chapter A. A.3. Useful Matrix Identities

methods5 , such as Conjugate Gradients or GMRES, minimize residual errors that are
orthogonal to each other (Stoer and Burlirsch, 2002).
In machine learning, inner products are important in the context of kernel meth-
ods (Schölkopf and Smola, 2002). Kernel methods exploit the fact that many lin-
ear algorithms can be expressed purely by inner product computations.6 Then, the
“kernel trick” allows us to compute these inner products implicitly in a (potentially
infinite-dimensional) feature space, without even knowing this feature space ex-
plicitly. This allowed the “non-linearization” of many algorithms used in machine
learning, such as kernel-PCA (Schölkopf et al., 1998) for dimensionality reduction.
Gaussian processes (Rasmussen and Williams, 2006) also fall into the category of
kernel methods and are the current state-of-the-art in probabilistic regression (fit-
ting curves to data points).

A.3 Useful Matrix Identities


To avoid explicit inversion of a possibly singular matrix, we often employ the follow-
ing three identities:

(A−1 + B −1 )−1 = A(A + B)−1 B = B(A + B)−1 A (A.67)


⊤ −1 −1 −1 −1 ⊤ −1 −1 ⊤ −1
(Z + U W V ) =Z −Z U (W +V Z U) V Z (A.68)
−1 −1 −1 −1 −1 −1
(A + BC) =A − A B(I + CA B) CA . (A.69)

The Searle identity in (A.67) is useful if the individual inverses of A and B do not
exist or if they are ill conditioned. The Woodbury identity in (A.68) can be used to
reduce the computational burden: If Z ∈ Rp×p is diagonal, the inverse Z −1 can be
computed in O(p). Consider the case where U ∈ Rp×q , W ∈ Rq×q , and V ⊤ ∈ Rq×p
with p ≫ q. The inverse (Z + U W V ⊤ )−1 ∈ Rp×p would require O(p3 ) computations
(naively implemented). Using (A.68), the computational burden reduces to O(p)
for the inverse of the diagonal matrix Z plus O(q 3 ) for the inverse of W and the
inverse of W −1 + V ⊤ Z −1 U ∈ Rq×q . Therefore, the inversion of a p × p matrix can be
reduced to the inversion of q × q matrices, the inversion of a diagonal p × p matrix,
and some matrix multiplications, all of which require less than O(p3 ) computations.
The Kailath inverse in (A.69) is a special case of the Woodbury identity in (A.68)
with W = I. The Kailath inverse makes the inversion of A + BC numerically a bit
more stable if A + BC is ill-conditioned and A−1 exists.

5
The basis for the Krylov subspace is derived from the Cayley-Hamilton theorem, which allows us
to compute the inverse of a matrix in terms of a linear combination of its powers.
6
Matrix-vector multiplication Ax = b falls into this category since bi is dot product of the ith row
of A with x.

125

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Bibliography

Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transac-
tions on Automatic Control, 19(6):716–723. pages 62

Belhumeur, P. N., Hespanha, J. P., and Kriegman, D. J. (1997). Eigenfaces vs. Fish-
erfaces: Recognition using Class Specific Linear Projection. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 19(7):711–720. pages 2

Bertsekas, D. P. (1999). Nonlinear Programming. Athena Scientific. pages 56

Bickson, D., Dolev, D., Shental, O., Siegel, P. H., and Wolf, J. K. (2007). Linear
Detection via Belief Propagation. In Proceedings of the Annual Allerton Conference
on Communication, Control, and Computing. pages 22

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Information Science


and Statistics. Springer-Verlag. pages 1, 17, 18, 49

Bottou, L. (1998). Online Algorithms and Stochastic Approximations. In Online


Learning and Neural Networks, pages 1–34. Cambridge University Press. pages 55

Bryson, A. E. (1961). A Gradient Method for Optimizing Multi-stage Allocation


Processes. In Proceedings of the Harvard University Symposium on Digital Computers
and Their Applications. pages 36

Burges, C. J. (1998). A Tutorial on Support Vector Machines for Pattern Recognition.


Data Mining and Knowledge Discovery, 2(2):121–167. pages 2

Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z.,
Ranzato, M. A., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large Scale
Distributed Deep Networks. In Advances in Neural Information Processing Systems,
pages 1–11. pages 56

Deisenroth, M. P. and Mohamed, S. (2012). Expectation Propagation in Gaussian


Process Dynamical Systems. In Advances in Neural Information Processing Systems,
pages 2618–2626. pages 22

Deisenroth, M. P. and Ohlsson, H. (2011). A General Perspective on Gaussian Fil-


tering and Smoothing: Explaining Current and Deriving New Algorithms. In Pro-
ceedings of the American Control Conference. pages 13

126

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

BIBLIOGRAPHY Bibliography

Dreyfus, S. (1962). The Numerical Solution of Variational Problems. Journal of


Mathematical Analysis and Applications, 5(1):30–45. pages 36

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for On-
line Learning and Stochastic Optimization. Journal of Machine Learning Research,
12:2121–2159. pages 56

Gal, Y., van der Wilk, M., and Rasmussen, C. E. (2014). Distributed Variational
Inference in Sparse Gaussian Process Regression and Latent Variable Models. In
Advances in Neural Information Processing Systems. pages 56

Golub, G. H. and Van Loan, C. F. (2012). Matrix Computations, volume 4. JHU Press.
pages 1, 2, 73

Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian Processes for Big
Data. In Nicholson, A. and Smyth, P., editors, Proceedings of the Conference on
Uncertainty in Artificial Intelligence. AUAI Press. pages 56

Herbrich, R., Minka, T., and Graepel, T. (2007). TrueSkill(TM): A Bayesian Skill
Rating System. In Advances in Neural Information Processing Systems, pages 569–
576. MIT Press. pages 22

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic Variational
Inference. Journal of Machine Learning Research, 14(1):1303–1347. pages 56

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University


Press. pages 5

Jefferys, W. H. and Berger, J. O. (1992). Ockham’s Razor and Bayesian Analysis.


American Scientist, 80:64–72. pages 58

Jimenez Rezende, D., Mohamed, S., and Wierstra, D. (2014). Stochastic Backprop-
agation and Variational Inference in Deep Latent Gaussian Models. In Proceedings
of the International Conference on Machine Learning. pages 15

Kalman, R. E. (1960). A New Approach to Linear Filtering and Prediction Problems.


Transactions of the ASME—Journal of Basic Engineering, 82(Series D):35–45. pages
13

Kelley, H. J. (1960). Gradient Theory of Optimal Flight Paths. Ars Journal,


30(10):947–954. pages 36

Kingma, D. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. Inter-
national Conference on Learning Representations, pages 1–13. pages 56

Kingma, D. P. and Welling, M. (2013). Auto-Encoding Variational Bayes. In Proceed-


ings of the International Conference on Learning Representations. pages 15

Kittler, J. and Föglein, J. (1984). Contextual Classification of Multispectral Pixel


Data. IMage and Vision Computing, 2(1):13–29. pages 22

127

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

Bibliography BIBLIOGRAPHY

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models. MIT Press. pages
22
Lin, C.-J. (2006). A guide to support vector machines. Department of Computer
Science & Information Engineering, National Taiwan University, Taiwan. pages 2
MacKay, D. J. C. (1992). Bayesian Interpolation. Neural Computation, 4:415–447.
pages 58
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms.
Cambridge University Press, The Edinburgh Building, Cambridge CB2 2RU, UK.
pages 58, 59
McEliece, R. J., MacKay, D. J. C., and Cheng, J.-F. (1998). Turbo Decoding as an
Instance of Pearl’s “Belief Propagation” Algorithm. IEEE Journal on Selected Areas
in Communications, 16(2):140–152. pages 22
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,
Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie,
C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and
Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning.
Nature, 518:529–533. pages 56
Murphy, K. P. (2012). Machine Learning: A Proabilistic Perspective. MIT Press, Cam-
bridge, MA, USA. pages 11, 14, 16, 58, 60, 61
O’Hagan, A. (1991). Bayes-Hermite Quadrature. Journal of Statistical Planning and
Inference, 29:245–260. pages 61
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann. pages 22
Petersen, K. B. and Pedersen, M. S. (2012). The Matrix Cookbook. Version
20121115. pages 33
Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s Razor. In Advances in Neural
Information Processing Systems 13, pages 294–300. The MIT Press. pages 62
Rasmussen, C. E. and Ghahramani, Z. (2003). Bayesian Monte Carlo. In Becker, S.,
Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing
Systems 15, pages 489–496. The MIT Press, Cambridge, MA, USA. pages 61
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learn-
ing. Adaptive Computation and Machine Learning. The MIT Press, Cambridge,
MA, USA. pages 14, 125
Roweis, S. and Ghahramani, Z. (1999). A Unifying Review of Linear Gaussian Mod-
els. Neural Computation, 11(2):305–345. pages 14
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning Representations
by Back-propagating Errors. Nature, 323(6088):533–536. pages 36, 54

128

Downloaded by Daria Godorozha (dgodorozha@gmail.com)


lOMoARcPSD|5494880

BIBLIOGRAPHY Bibliography

Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels—Support Vector Ma-


chines, Regularization, Optimization, and Beyond. Adaptive Computation and Ma-
chine Learning. The MIT Press, Cambridge, MA, USA. pages 125

Schölkopf, B., Smola, A. J., and Müller, K.-R. (1998). Nonlinear Component Analysis
as a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299–1319. pages
125

Schwarz, G. E. (1978). Estimating the Dimension of a Model. Annals of Statistics,


6(2):461–464. pages 63

Shental, O., Bickson, D., P. H. Siegel and, J. K. W., and Dolev, D. (2008). Gaussian
Belief Propagatio Solver for Systems of Linear Equations. In IEEE International
Symposium on Information Theory. pages 22

Shor, N. Z. (1985). Minimization Methods for Non-differentiable Functions. Springer.


pages 56

Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2006). TextonBoost: Joint Ap-
pearance, Shape and Context Modeling for Mulit-Class Object Recognition and
Segmentation. In Proceedings of the European Conference on Computer Vision. pages
22

Spiegelhalter, D. and Smith, A. F. M. (1980). Bayes Factors and Choice Criteria for
Linear Models. Journal of the Royal Statistical Society B, 42(2):213–220. pages 58

Stoer, J. and Burlirsch, R. (2002). Introduction to Numerical Analysis. Springer. pages


61, 125

Sucar, L. E. and Gillies, D. F. (1994). Probabilistic Reasoning in High-Level Vision.


Image and Vision Computing, 12(1):42–60. pages 22

Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tap-
pen, M., and Rother, C. (2008). A Comparative Study of Energy Minimization
Methods for Markov Random Fields with Smoothness-based Priors. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080. pages 22

Tibshirani, R. (1996). Regression Selection and Shrinkage via the Lasso. Journal of
the Royal Statistical Society B, 58(1):267–288. pages 50

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic Principal Component Analysis.


Journal of the Royal Statistical Society: Series B, 61(3):611–622. pages 14

Toussaint, M. (2012). Some Notes on Gradient Descent. pages 54

Turk, M. and Pentland, A. (1991). Eigenfaces for Recognition. Journal of Cognitive


Neuroscience, 3(1):71–86. pages 2

129

Downloaded by Daria Godorozha (dgodorozha@gmail.com)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy