Machine Learning
Machine Learning
Machine Learning
About
CS 189 is the Machine Learning course at UC Berkeley. In this guide we have created a com-
prehensive course guide in order to share our knowledge with students and the general public,
and hopefully draw the interest of students from other universities to Berkeley’s Machine Learning
curriculum.
This guide was started by CS 189 TAs Soroush Nasiriany and Garrett Thomas in Fall 2017, with
the assistance of William Wang and Alex Yang.
We owe gratitude to Professors Anant Sahai, Stella Yu, and Jennifer Listgarten, as this book is
heavily inspired from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk
for his machine learning notes, from which we drew inspiration.
The latest version of this document can be found either at http://www.eecs189.org/ or http:
//snasiriany.me/cs189/. Please report any mistakes to the staff, and contact the authors if you
wish to redistribute this document.
Notation
Notation Meaning
R set of real numbers
Rn set (vector space) of n-tuples of real numbers, endowed with the usual inner product
Rm×n set (vector space) of m-by-n matrices
δij Kronecker delta, i.e. δij = 1 if i = j, 0 otherwise
∇f (x) gradient of the function f at x
∇2 f (x) Hessian of the function f at x
p(X) distribution of random variable X
p(x) probability density/mass function evaluated at x
E[X] expected value of random variable X
Var(X) variance of random variable X
Cov(X, Y ) covariance of random variables X and Y
Other notes:
• Vectors and matrices are in bold (e.g. x, A). This is true for vectors in Rn as well as for
vectors in general vector spaces. We generally use Greek letters for scalars and capital Roman
letters for matrices and random variables.
• We assume that vectors are column vectors, i.e. that a vector in Rn can be interpreted as an
n-by-1 matrix. As such, taking the transpose of a vector is well-defined (and produces a row
vector, which is a 1-by-n matrix).
Contents
1 Regression I 5
1.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Hyperparameters and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Regression II 17
2.1 MLE and MAP for Regression (Part I) . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 MLE and MAP for Regression (Part II) . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Kernels and Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Sparse Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Dimensionality Reduction 63
3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3
4 CONTENTS
5 Classification 107
5.1 Generative vs. Discriminative Classification . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Least Squares Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.7 Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6 Clustering 151
6.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.2 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . . . . . . . . . . 156
Regression I
Our goal in machine learning is to extract a relationship from data. In regression tasks, this
relationship takes the form of a function y = f (x), where y ∈ R is some quantity that can be
predicted from an input x ∈ Rd , which should for the time being be thought of as some collection
of numerical measurements. The true relationship f is unknown to us, and our aim is to recover it
as well as we can from data. Our end product is a function ŷ = h(x), called the hypothesis, that
should approximate f . We assume that we have access to a dataset D = {(xi , yi )}ni=1 , where each
pair (xi , yi ) is an example (possibly noisy or otherwise approximate) of the input-output mapping
to be learned. Since learning arbitrary functions is intractable, we restrict ourselves to some
hypothesis class H of allowable functions. More specifically, we typically employ a parametric
model, meaning that there is some finite-dimensional vector w ∈ Rd , the elements of which are
known as parameters or weights, that controls the behavior of the function. That is,
hw (x) = g(x, w)
for some other function g. The hypothesis class is then the set of all functions induced by the
possible choices of the parameters w:
H = {hw | w ∈ Rd }
After designating a cost function L, which measures how poorly the predictions ŷ of the hypothesis
match the true output y, we can proceed to search for the parameters that best fit the data by
minimizing this function:
w∗ = arg min L(w)
w
Ordinary least squares (OLS) is one of the simplest regression problems, but it is well-understood
and practically useful. It is a linear regression problem, which means that we take hw to be of
the form hw (x) = x>w. We want
5
6 CHAPTER 1. REGRESSION I
In words, the matrix X ∈ Rn×d has the input datapoint xi as its ith row. This matrix is some-
times called the design matrix. Usually n ≥ d, meaning that there are more datapoints than
measurements.
There will in general be no exact solution to the equation y = Xw (even if the data were perfect,
consider how many equations and variables there are), but we can find an approximate solution by
minimizing the sum (or equivalently, the mean) of the squared errors:
n
X
L(w) = (xi>w − yi )2 = min kXw − yk22
w
i=1
Now that we have formulated an optimization problem, we want to go about solving it. We will see
that the particular structure of OLS allows us to compute a closed-form expression for a globally
optimal solution, which we denote w∗ols .
Calculus is the primary mathematical workhorse for studying the optimization of differentiable
functions. Recall the following important result: if L : Rd → R is continuously differentiable, then
any local optimum w∗ satisfies ∇L(w∗ ) = 0. In the OLS case,
∇x (a>x) = a
∇x (x>Ax) = (A + A>)x
where in the last line we have used the symmetry of X>X to simplify X>X + (X>X)> = 2X>X.
Setting the gradient to 0, we conclude that any optimum w∗ols satisfies
X>Xw∗ols = X>y
1.1. ORDINARY LEAST SQUARES 7
If X is full rank, then X>X is as well (assuming n ≥ d), so we can solve for a unique solution
Note: Although we write (X>X)−1 , in practice one would not actually compute the inverse; it
is more numerically stable to solve the linear system of equations above (e.g. with Gaussian
elimination).
In this derivation we have used the condition ∇L(w∗ ) = 0, which is a necessary but not sufficient
condition for optimality. We found a critical point, but in general such a point could be a local
minimum, a local maximum, or a saddle point. Fortunately, in this case the objective function
is convex, which implies that any critical point is indeed a global minimum. To show that L is
convex, it suffices to compute the Hessian of L, which in this case is
∇2 L(w) = 2X>X
There is also a linear algebraic way to arrive at the same solution: orthogonal projections.
Recall that if V is an inner product space and S a subspace of V , then any v ∈ V can be decomposed
uniquely in the form
v = vS + v⊥
where vS ∈ S and v⊥ ∈ S ⊥ . Here S ⊥ is the orthogonal complement of S, i.e. the set of vectors
that are perpendicular to every vector in S.
The orthogonal projection onto S, denoted PS , is the linear operator that maps v to vS in the
decomposition above. An important property of the orthogonal projection is that
kv − PS vk ≤ kv − sk
PS v = arg min kv − sk
s∈S
with equality holding if and only if kPS v − sk2 = 0, i.e. s = PS v. Taking square roots on both
sides gives kv − sk ≥ kv − PS vk as claimed (since norms are nonnegative).
But observe that the set of vectors that can be written Xw for some w ∈ Rd is precisely the range
of X, which we know to be a subspace of Rn , so
By pattern matching with the earlier optimality statement about PS , we observe that Prange(X) y =
Xw∗ols , where w∗ols is any optimum for the right-hand side. The projected point Xw∗ols is always
unique, but if X is full rank (again assuming n ≥ d), then the optimum w∗ols is also unique (as
expected). This is because X being full rank means that the columns of X are linearly independent,
in which case there is a one-to-one correspondence between w and Xw.
To solve for w∗ols , we need the following fact1 :
null(X>) = range(X)⊥
Since we are projecting onto range(X), the orthogonality condition for optimality is that y − P y ⊥
range(X), i.e. y − Xw∗ols ∈ null(X>). This leads to the equation
X>(y − Xw∗ols ) = 0
which is equivalent to
X>Xw∗ols = X>y
as before.
While Ordinary Least Squares can be used for solving linear least squares problems, it falls short
due to numerical instability and generalization issues. Numerical instability arises when the features
of the data are close to collinear (leading to linearly dependent feature columns), causing the input
1 This result is often stated as part of the Fundamental Theorem of Linear Algebra.
1.2. RIDGE REGRESSION 9
matrix X to lose its rank or have singular values that very close to 0. Why are small singular values
bad? Let us illustrate this via the singular value decomposition (SVD) of X:
X = UΣV>
where U ∈ Rn×n , Σ ∈ Rn×d , V ∈ Rd×d . In the context of OLS, we must have that X>X is invertible,
or equivalently, rank(X>X) = rank(X>) = rank(X) = d. Assuming that X and X> are full column
rank d, we can express the SVD of X as
Σd >
X=U V
0
where Σd ∈ Rd×d is a diagonal matrix with strictly positive entries. Now let’s try to expand the
(X>X)−1 term in OLS using the SVD of X:
> −1
> Σd > −1
(X X) = (V Σd 0 U U V )
0
Σd > −1
= (V Σd 0 I V )
0
= (VΣ2d V>)−1 = (V>)−1 (Σ2d )−1 V−1 = VΣ−2
d V
>
This means that (X>X)−1 will have singular values that are the squared inverse of the singular
values of X, potentially leading to extremely large singular values when the singular value of X are
close to 0. Such excessively large singular values can be very problematic for numerical stability
purposes. In addition, abnormally high values to the optimal w solution would prevent OLS from
generalizing to unseen data.
There is a very simple solution to these issues: penalize the entries of w from becoming too large.
We can do this by adding a penalty term constraining the norm of w. For a fixed, small scalar
λ > 0, we now have:
min kXw − yk22 + λkwk22
w
Note that the λ in our objective function is a hyperparameter that measures the sensitivity to
the values in w. Just like the degree in polynomial features, λ is a value that we must choose
arbitrarily through validation. Let’s expand the terms of the objective function:
Finally take the gradient of the objective and find the value of w that achieves 0 for the gradient:
∇w L(w) = 0
> >
2X Xw − 2X y + 2λw = 0
(X>X + λI)w = X>y
w∗ridge = (X>X + λI)−1 X>y
This value is guaranteed to achieve the (unique) global minimum, because the objective function
is strongly convex. To show that f is strongly convex, it suffices to compute the Hessian of f ,
which in this case is
∇2 L(w) = 2X>X + 2λI
10 CHAPTER 1. REGRESSION I
Since the Hessian is positive definite, we can equivalently say that the eigenvalues of the Hessian are
strictly positive and that the objective function is strongly convex. A useful property of strongly
convex functions is that they have a unique optimum point, so the solution to ridge regression is
unique. We cannot make such guarantees about ordinary least squares, because the corresponding
Hessian could have eigenvalues that are 0. Let us explore the case in OLS when the Hessian has
a 0 eigenvalue. In this context, the term X>X is not invertible, but this does not imply that no
solution exists! In OLS, there always exists a solution, and when the Hessian is PD that solution
is unique; when the Hessian is PSD, there are infinitely many solutions. (There always exists a
solution to the expression X>Xw = X>y, because the range of X>X and the range space of X>
are equivalent; since X>y lies in the range of X>, it must equivalently lie in the range of X>X and
therefore there always exists a w that satisfies the equation X>Xw = X>y.)
The technique we just described is known as ridge regression. Note that now the expression
X>X + λI is invertible, regardless of rank of X. Let’s find (X>X + λI)−1 through SVD:
!−1
Σr 0 > Σr 0 >
(X>X + λI)−1 = V UU V + λI
0 0 0 0
2 !−1
Σr 0 >
= V V + λI
0 0
2 !−1
Σr 0 > >
= V V + V(λI)V
0 0
2 ! −1
Σr 0
= V + λI V>
0 0
2 !−1
Σr + λI 0
= V V>
0 λI
2 −1
> −1 Σr + λI 0
= (V ) V−1
0 λI
(Σr + λI)−1 0
2
=V 1 V>
0 λI
Now with our slight tweak, the matrix X>X + λI has become full rank and thus invertible. The
singular values have become σ21+λ and λ1 , meaning that the singular values are guaranteed to be
at most λ1 , solving our numerical instability issues. Furthermore, we have partially solved the
overfitting issue. By penalizing the norm of x, we encourage the weights corresponding to relevant
features that capture the main structure of the true model, and penalize the weights corresponding
to complex features that only serve to fine tune the model and fit noise in the data.
1.3. FEATURE ENGINEERING 11
represents the “best-fit” linear model, by projecting y onto the subspace spanned by the columns
of X. However, the true input-output relationship y = f (x) may be nonlinear, so it is useful to
consider nonlinear models as well. It turns out that we can still do this under the framework of
linear least-squares, by augmenting the data with new features. In particular, we devise some
function φ : R` → Rd , called a feature map, that maps each raw data point x ∈ R` into a vector
of features φ(x). The hypothesis function then writes
d
X
hw (x) = wj φj (x) = w>φ(x)
j=1
Note that the resulting model is still linear with respect to the features, but it is nonlinear with
respect to the original data if φ is nonlinear. The component functions φj are sometimes called
basis functions because our hypothesis is a linear combination of them. In the simplest case, we
could just use the components of x as features (i.e. φj (x) = xj ), but in general it is helpful to
disambiguate the features of an example from the example’s entries.
We can then use least-squares to estimate the weights w, just as before. To do this, we replace the
original data matrix X ∈ Rn×` by Φ ∈ Rn×d , which has φ(xi )> as its ith row:
min kΦw − yk22
w
where
x2 x22,1 x1,1 x2,1 x1,1 x2,1
21,1
x1,2 x22,2 x1,2 x2,2 x1,2 x2,2
Φ=
.. .. .. .. .. ..
. . . . . .
x21,n x22,n x1,n x2,n x1,n x2,n
In this case, the feature map φ is given by
φ(x) = (x21 , x22 , x1 x2 , x1 , x2 )
Note that there is no “target” vector y here, so this is not a traditional regression problem, but it
still fits into the framework of least-squares.
12 CHAPTER 1. REGRESSION I
Polynomial Features
The example above demonstrates an important class of features known as polynomial features.
Remember that a polynomial is linear combination of monomial basis terms. Monomials can be
classified in two ways, by their degree and dimension:
Degree
0 1 2 3 ...
Dimension
1 (univariate) 1 x x2 x3 ···
2 (bivariate) 1 x1 , x2 x21 , x22 , x1 x2 x31 , x32 , x21 x2 , x1 x22 ···
.. .. .. .. .. ..
. . . . . .
A big reason we care polynomial features is that any smooth function can be approximated ar-
bitrarily closely by some polynomial.2 For this reason, polynomials are said to be universal
approximators.
One downside of polynomials is that as their degree increases, their number of terms increases
rapidly. Specifically, one can use a “stars and bars” style combinatorial argument3 to show that a
polynomial of degree d in ` variables has
`+d (` + d)!
=
` `!d!
terms. To get an idea for how quickly this quantity grows, consider a few examples:
d
1 3 5 10 25
`
1 2 4 6 11 26
3 4 20 56 286 3276
5 6 56 252 3003 142506
10 11 286 3003 184756 183579396
25 26 3276 142506 183579396 126410606437752
Later we will learn about the kernel trick, a clever mathematical method that allows us to
circumvent this rapidly growing cost in certain cases.
Observe that the model order d is not one of the decision variables being optimized when we fit to
the data. For this reason d is called a hyperparameter. We might say more specifically that it is
a model hyperparameter, since it determines the structure of the model.
For another example, recall ridge regression, in which we add an `2 penalty on the parameters
w:
min kXw − yk22 + λkwk22
w
The regularization weight λ is also a hyperparameter, as it is fixed during the minimization above.
However λ, unlike the previously discussed hyperparameter d, is not a part of the model. Rather,
it is an aspect of the optimization procedure used to fit the model, so we say it is an optimization
hyperparameter. Hyperparameters tend to fall into one of these two categories.
Since hyperparameters are not determined by the data-fitting optimization procedure, how should
we choose their values? A suitable answer to this question requires some discussion of the different
types of error at play.
Types of Error
We have seen that it is common to minimize some measure of how poorly our hypothesis fits the
data we have, but what we actually care about is how well the hypothesis predicts future data.
Let us try to formally distinguish the various types of error. Assume that the data are distributed
according to some (unknown) distribution D, and that we have a loss function ` : R × R → R,
which is to measure the error between the true output y and our estimate ŷ = h(x). The risk (or
true error) of a particular hypothesis h ∈ H is the expected loss over the whole data distribution:
Ideally, we would find the hypothesis that minimizes the risk, i.e.
However, computing this expectation is impossible because we do not have access to the true data
iid
distribution. Rather, we have access to samples (xi , yi ) ∼ D. These enable us to approximate the
real problem we care about by minimizing the empirical risk (or training error)
n
1X
R̂train (h) = `(h(xi ), yi )
n
i=1
But since we have a finite number of samples, the hypothesis that performs the best on the training
data is not necessarily the best on the whole data distribution. In particular, if we both train and
evaluate the hypothesis using the same data points, the training error will be a very biased estimate
of the true error, since the hypothesis has been chosen specifically to perform well on those points.
This phenomenon is sometimes referred to as “data incest”.
A common solution is to set aside some portion (say 30%) of the data, to be called the validation
set, which is disjoint from the training set and not allowed to be used when fitting the model:
Validation Training
14 CHAPTER 1. REGRESSION I
We can use this validation set to estimate the true error by the validation error
m
1 X
R̂val (h) = `(h(xval val
i ), yi )
m
i=1
With this estimate, we have a simple method for choosing hyperparameter values: try a bunch of
configurations of the hyperparameters and choose the one that yields the lowest validation error.
Note that as we add more features to a linear model, training error can only decrease. This is
because the optimizer can set wi = 0 if feature i cannot be used to reduce training error.
Training Error
Model Order
Adding more features tends to reduce true error as long as the additional features are useful
predictors of the output. However, if we keep adding features, these begin to fit noise in the
training data instead of the true signal, causing true error to actually increase. This phenomenon
is known as overfitting.
True Error
Model Order
The validation error tracks the true error reasonably well as long as the validation set is sufficiently
large. The regularization hyperparameter λ has a somewhat different effect on training error.
Observe that if λ = 0, we recover the exact OLS problem, which is directly minimizing the training
error. As λ increases, the optimizer places less emphasis on the training error and more emphasis
on reducing the magnitude of the parameters. This leads to a degradation in training error as λ
grows:
1.4. HYPERPARAMETERS AND VALIDATION 15
Training Error
Regularization Weight
Cross-validation
Setting aside a validation set works well, but comes at a cost, since we cannot use the validation
data for training. Since having more data generally improves the quality of the trained model,
we may prefer not to let that data go to waste, especially if we have little data to begin with
and/or collecting more data is expensive. Cross-validation is an alternative to having a dedicated
validation set.
k-fold cross-validation works as follows:
1. Shuffle the data and partition it into k equally-sized (or as equal as possible) blocks.
2. For i = 1, . . . , k,
• Train the model on all the data except block i.
• Evaluate the model (i.e. compute the validation error) using block i.
1 2 3 4 5 6 ··· k
validate train
3. Average the k validation errors; this is our final estimate of the true error.
Observe that, although every datapoint is used for evaluation at some time or another, the model
is always evaluated on a different set of points than it was trained on, thereby cleverly avoiding the
“data incest” problem mentioned earlier.
Note also that this process (except for the shuffling and partitioning) must be repeated for every
hyperparameter configuration we wish to test. This is the principle drawback of k-fold cross-
validation as compared to using a held-out validation set – there is roughly k times as much
computation required. This is not a big deal for the relatively small linear models that we’ve seen
so far, but it can be prohibitively expensive when the model takes a long time to train, as is the
case in the Big Data regime or when using neural networks.
16 CHAPTER 1. REGRESSION I
Chapter 2
Regression II
So far, we’ve explored two approaches of the regression framework, Ordinary Least Squares and
Ridge Regression:
ŵols = arg min ky − Xwk22
w
ŵridge = arg min ky − Xwk22 + λkwk22
w
One question that arises is why we specifically use the `2 norm to measure the error of our predic-
tions, and to penalize the model parameters. We will justify this design choice by exploring the
statistical interpretations of regression — namely, we will employ Gaussians, MLE and MAP to
validate what we’ve done so far through a different lens.
Probabilistic Model
In the context of supervised learning, we assume that there exists a true underlying model
mapping inputs to outputs:
f : x → f (x)
The true model is unknown to us, and our goal is to find a hypothesis model that best represents
the true model. The only information that we have about the true model is via a dataset
D = {(xi , yi )}ni=1
where xi ∈ Rd is the input and yi ∈ R is the observation, a noisy version of the true output f (xi ):
Yi = f (xi ) + Zi
We assume that xi is a fixed value (which implies that f (xi ) is fixed as well), while Zi is a random
variable (which implies that Yi is a random variable as well). We always assume that Zi has zero
mean, because otherwise there would be systematic bias in our observations. The Zi ’s could be
Gaussian, uniform, Laplacian, etc... In most contexts, we us assume that they are independent
iid
identically distributed (i.i.d) Gaussians: Zi ∼ N (0, σ 2 ). We can therefore say that Yi is a
random variable whose probability distribution is given by
iid
Yi ∼ N (f (xi ), σ 2 )
17
18 CHAPTER 2. REGRESSION II
Now that we have defined the model and data, we wish to find a hypothesis model hθ (parameterized
by θ) that best captures the relationships in the data, while possibly taking into account prior beliefs
that we have about the true model. We can represent this as a probability problem, where the goal
is to find the optimal model that maximizes our probability.
In Maximum Likelihood Estimation (MLE), the goal is to find the hypothesis model that
maximizes the probability of the data. If we parameterize the set of hypothesis models with θ, we
can express the problem as
The quantity L(θ) that we are maximizing is also known as the likelihood, hence the term MLE.
Substituting our representation of D we have
Note that we implicitly condition on the xi ’s, because we treat them as fixed values of the data. The
only randomness in our data comes from the yi ’s (since they are noisy versions of the true values
f (xi )). We can further simplify the problem by working with the log likelihood `(θ; X, y) =
log L(θ; X, y)
θ̂ mle = arg max L(θ; X, y) = arg max `(θ; X, y)
θ θ
With logs we are still working with the same problem, because logarithms are monotonic functions.
In other words we have that:
We decoupled the probabilities from each datapoints because their corresponding noise components
are independent. Note that the logs allow us to work with sums rather products, simplifying
the problem — one reason why the log likelihood is such a powerful tool. Each individual term
p(yi | xi , θ) comes from a Gaussian
Yi | θ ∼ N (hθ (xi ), σ 2 )
n
X (yi − hθ (xi ))2 √
= arg min 2
+ n log 2πσ (2.4)
θ 2σ
i=1
n
X
= arg min (yi − hθ (xi ))2 (2.5)
θ i=1
Note that in step (4) we turned the problem from a maximization problem to a minimization
problem by negating the objective. In step (5) we eliminated the second term and the denominator
in the first term, because they do not depend on the variables we are trying to optimize over.
Now let’s look at the case of regression — our hypothesis has the form hθ (xi ) = xi>θ, where
θ ∈ Rd , where d is the number of dimensions of our featurized datapoints. For this specific setting,
the problem becomes:
n
X
θ̂ mle = arg min (yi − xi>θ)2
θ∈Rd i=1
This is just the Ordinary Least Squares (OLS) problem! We just proved that OLS and MLE for
regression lead to the same answer! We conclude that MLE is a probabilistic justification for why
using squared error (which is the basis of OLS) is a good metric for evaluating a regression model.
Maximum a Posteriori
In Maximum a Posteriori (MAP) Estimation, the goal is to find the model, for which the data
maximizes the probability of the model:
The probability distribution that we are maximizing is known as the posterior. Maximizing this
term directly is often infeasible, so we we use Bayes’ Rule to re-express the objective.
We treat p(data = D) as a constant value because it does not depend on the variables we are
optimizing over. Notice that MAP is just like MLE, except we add a term p(true model = hθ ) to
our objective. This term is the prior over our true model. Adding the prior has the effect of favoring
certain models over others a priori, regardless of the dataset. Note the MLE is a special case of
MAP, when the prior does not treat any model more favorably over other models. Concretely, we
have that
Xn
θ̂ map = arg min − log[p(yi | xi , θ)] − log[p(θ)]
θ i=1
20 CHAPTER 2. REGRESSION II
Again, just as in MLE, notice that we implicitly condition on the xi ’s because we treat them as
iid
constants. Also, let us assume as before that the noise terms are i.i.d. Gaussians: Ni ∼ N (0, σ 2 ).
For the prior term P (Θ), we assume that the components θj are i.i.d. Gaussians:
iid
θj ∼ N (θj0 , σh2 )
P
d
!
− θj0 ) 2
Pn
i=1 (yi − hθ (xi ))2 j=1 (θj
θ̂ map = arg min +
θ 2σ 2 2σh2
n 2 d
X σ X
= arg min (yi − hθ (xi ))2 + 2 (θj − θj0 )2
θ i=1
σ h j=1
Let’s look again at the case for linear regression to illustrate the effect of the prior term when
θj0 = 0. In this context, we refer to the linear hypothesis function hθ (x) = θ>x.
n d
X σ2 X 2
θ̂ map = arg min (yi − xi>θ)2 + 2 θ
θ∈Rd i=1
σh j=1 j
This is just the Ridge Regression problem! We just proved that Ridge Regression and MAP for
2
regression lead to the same answer! We can simply set λ = σσ2 . We conclude that MAP is a
h
probabilistic justification for adding the penalized ridge term in Ridge Regression.
Based on our analysis of Ordinary Least Squares Regression and Ridge Regression, we should
expect to see MAP perform better than MLE. But is that always the case? Let us visit a simple
2D problem where
we would like to know what parameters MLE and MAP will select, after providing them with some
dataset D. Let’s start with MLE:
2.1. MLE AND MAP FOR REGRESSION (PART I) 21
The diagram above shows the the contours of the likelihood distribution in model space. The gray
dot represents the true underlying model. MLE chooses the point that maximizes the likelihood,
which is indicated by the green dot. As we can see, MLE chooses a reasonable hypothesis, but
this hypothesis lies in a region on high variance, which indicates a high level of uncertainty in the
predicted model. A slightly different dataset could significantly alter the predicted model.
Now, let’s take a look at the hypothesis model from MAP. One question that arises is where the
prior should be centered and what its variance should be. This depends on our belief of what the
true underlying model is. If we have reason to believe that the model weights should all be small,
then the prior should be centered at zero with a small variance. Let’s look at MAP for a prior that
is centered at zero:
For reference, we have marked the MLE estimation from before as the green point and the true
model as the gray point. The prior distribution is indicated by the diagram on the left, and
22 CHAPTER 2. REGRESSION II
the posterior distribution is indicated by the diagram on the right. MAP chooses the point that
maximizes the posterior probability, which is approximately (0.70, 0.25). Using a prior centered
at zero leads us to skew our prediction of the model weights toward the origin, leading to a less
accurate hypothesis than MLE. However, the posterior has significantly less variance, meaning that
the point that MAP chooses is less likely to overfit to the noise in the dataset.
Let’s say in our case that we have reason to believe that both model weights should be centered
around the 0.5 to 1 range.
Our prediction is now close to that of MLE, with the added benefit that there is significantly less
variance. However, if we believe the model weights should be centered around the -0.5 to -1 range,
we would make a much poorer prediction than MLE.
As always, in order to compare our beliefs to see which prior works best in practice, we should use
cross validation!
2.2. BIAS-VARIANCE TRADEOFF 23
Recall from our previous discussion on supervised learning, that for a fixed input x the correspond-
ing measurement Y is a noisy measurement of the true underlying response f (x):
Y = f (x) + Z
D = {(xi , yi )}ni=1
In that context, we treated the xi ’s in our dataset D as fixed values. In this case however, we treat
the xi ’s as values sampled from random variables Xi . That is, D is a random variable, consisting
of random variables Xi and Yi . For some arbitrary test input x, h(x; D) depends on the random
variable D that was used to train h. Since D is random, we will have a slightly different hypothesis
model h(x; D) every time we use a new dataset. Note that x and D are completely independent
from one another — x is a test point, while D consists of the training data.
Metric
Our objective is to, for a fixed test point x, evaluate how closely the hypothesis can estimate the
noisy observation Y corresponding to x. Note that we have denoted x here as a lowercase letter
because we are treating it as a fixed constant, while we have denoted the Y and D as uppercase
letters because we are treating them as random variables. Y and D as independent random
variables, because our x and Y have no relation to the set of Xi ’s and Yi ’s in D. Again, we can
view D as the training data, and (x, Y ) as a test point — the test point x is probably not even in
the training set D! Mathematically, we express our metric as the expected squared error between
the hypothesis and the observation Y = f (x) + Z:
ε(x; h) = E[(h(x; D) − Y )2 ]
Note that the error is w.r.t the observation Y and not the true underlying model f (x), because we
do not know the true model and only have access to the noisy observations from the true model.
24 CHAPTER 2. REGRESSION II
Bias-Variance Decomposition
The error metric is difficult to interpret and work with, so let’s try to decompose it into parts that
are easier to understand. Before we start, let’s find the expectation and variance of Y :
Recall that for any two independent random variables D and Y , g1 (D) and g2 (Y ) are also in-
dependent, for any functions g1 , g2 . This implies that h(x; D) and Y are independent, allowing
us to express E[h(x; D) · Y ] = E[h(x; D)] · E[Y ] in the second line of the derivation. The final
decomposition, also known as the bias-variance decomposition, consists of three terms:
• Bias2 of method: Measures how well the average hypothesis (over all possible training sets)
can come close to the true underlying value f (x), for a fixed value of x. A low bias means
that on average the regressor h(x) accurately estimates f (x).
• Variance of method: Measures the variance of the hypothesis (over all possible training
sets), for a fixed value of x. A low variance means that the prediction does not change much
as the training set varies. An un-biased method (bias = 0) could have a large variance.
• Irreducible error: This is the error in our model that we cannot control or eliminate, because
it is due to errors inherent in our noisy observation Y .
The decomposition allows us to measure the error in terms of bias, variance, and irreducible error.
Irreducible error has no relation with the hypothesis model, so we can fully ignore it in theory when
minimizing the error. As we have discussed before, models that are very complex have very little
bias because on average they can fit the true underlying model value f (x) very well, but have very
high variance and are very far off from f (x) on an individual basis.
Note that the error above is only for a fixed input x, but in regression our goal is to minimize
the average error over all possible values of X. If we know the distribution for X, we can find the
effectiveness of a hypothesis model as a whole by taking an expectation of the error over all possible
values of x: EX [ε(x; h)].
2.2. BIAS-VARIANCE TRADEOFF 25
Alternative Decomposition
The previous derivation is short, but may seem somewhat arbitrary. Let’s explore an alternative
derivation. At its core, it uses the technique that E[(Z − Y )2 ] = E[((Z − E[Z]) + (E[Z] − Y ))2 ]
which decomposes to easily give us the variance of Z and other terms.
ε(x; h) = E[(h(x; D) − Y )2 ]
h 2 i
= E h(x; D) − E[h(x; D)] + E[h(x; D)] − Y
h 2 i h 2 i h i
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y + 2E h(x; D) − E[h(x; D)] · E[h(x; D)] − Y
h 2 i h 2 i ((
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y + 2E[h(x; D)(
( −( (((
E[h(x; D)]] · E[E[h(x; D)] − Y ]
( ( (
h 2 i h 2 i
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y
h 2 i
= Var((h(x; D)) + E E[h(x; D)] − Y
h 2 i
= Var((h(x; D)) + E E[h(x; D)] − E[Y ] + E[Y ] − Y
h 2 i
+ E[(Y − E[Y ])2 ] + 2 E[h(x; D)] − E[Y ] · (
= Var((h(x; D)) + E E[h(x; D)] − E[Y ] −(
(
((](
E[E[Y Y]
h 2 i
= Var((h(x; D)) + E E[h(x; D)] − E[Y ] + E[(Y − E[Y ])2 ]
2
= Var((h(x; D)) + E[h(x; D)] − E[Y ] + Var(Y )
2
= Var((h(x; D)) + E[h(x; D)] − f (x) + Var(Z)
2
= E[h(x; D)] − f (x) + Var(h(x; D)) + Var(Z)
| {z } | {z } | {z }
bias2 of method variance of method irreducible error
Experiments
Let’s confirm the theory behind the bias-variance decomposition with an empirical experiment that
measures the bias and variance for polynomial regression with 0 degree, 1st degree, and 2nd degree
polynomials. In our experiment, we will repeatedly fit our hypothesis model to a random training
set. We then find the expectation and variance of the fitted models generated from these training
sets.
Let’s first look at a 0 degree (constant) regression model. We repeatedly fit an optimal constant
line to a training set of 10 points. The true model is denoted by gray and the hypothesis is denoted
by red. Notice that at each time the red line is slightly different due to the different training set
used.
26 CHAPTER 2. REGRESSION II
Let’s combine all of these hypotheses together into one picture to see the bias and variance of our
model.
On the top left diagram we see all of our hypotheses and all training sets used. The bottom left
diagram shows the average hypothesis in cyan. As we can see, this model has low bias for x’s in
2.2. BIAS-VARIANCE TRADEOFF 27
the center of the graph, but very high bias for x’s that are away from the center of the graph. The
diagram in the bottom right shows that the variance of the hypotheses is quite high, for all values
of x.
Now let’s look at a 1st degree (linear) regression model.
The bias is now very low bias for all x’s. The variance is low for x’s in the middle of the graph,
28 CHAPTER 2. REGRESSION II
but higher for x’s that are away from the center of the graph.
Finally, let’s look at a 2nd degree (quadratic) regression model.
The bias is still very low for all x’s. However, the variance is much higher for all values of x.
Let’s summarize our results. We find the bias and the variance empirically and graph them for all
values of x, as shown in the first two graphs. Finally, we take an expectation over the bias and
2.2. BIAS-VARIANCE TRADEOFF 29
The bias-variance decomposition confirms our understanding that the true model is linear. While
a quadratic model achieves the same theoretical bias as a linear model, it overfits to the data, as
indicated by its high variance. On the other hand a constant model underfits the data, as indicated
by its high bias. In the process of training our model, we can tell that a constant model is a poor
choice, because its high bias is reflected in poor training error. However we cannot tell that a
quadratic model is poor, because its high variance is not reflected in the training error. This is the
reason why we use validation data and cross-validation as a means to measure the performance of
our hypothesis model on unseen data.
Takeaways
So far in our discussion of MLE and MAP in regression, we considered a set of Gaussian random
variables Z1 , Z2 , . . . , Zk , which can represent anything from the noise in data to the parameters of a
model. One critical assumption we made is that these variables are independent and identically dis-
tributed. However, what about the case when these variables are dependent and/or non-identical?
For example, in time series data we have the relationship
Zi+1 = rZi + Ui
iid
where Ui ∼ N (0, 1) and −1 ≤ r ≤ 1 (so that it doesn’t blow up)
Here’s another example: consider the “sliding window” (like echo of audio)
Zi = Σrj Ui−j
iid
where Ui ∼ N (0, 1)
In general, if we can represent the random vector Z = (Z1 , Z2 , . . . , Zk ) as
Z = RU
iid
where Z ∈ Rn , R ∈ Rn×n , U ∈ Rn , and Ui ∼ N (0, 1), we refer to Z as a Jointly Gaussian
Random Vector. Our goal now is to derive its probability density formula.
Definition
There are three equivalent definitions of a jointly Gaussian (JG) random vector:
Pk
2. A random vector Z = (Z1 , Z2 , . . . , Zk )> is JG if i=1 ai Zi is normally distributed for every
a = (a1 , a2 , . . . , ak )> ∈ Rk .
3. (Non-degenerate case only) A random vector Z = (Z1 , Z2 , . . . , Zk )> is JG if
1 1 1 > −1
fZ (z) = p √ e− 2 (Z−µ) Σ (Z−µ)
| det(Σ)| ( 2π)k
Note that all of these conditions are equivalent. In this note we will start by showing a proof that
(1) =⇒ (3). We will leave it as an exercise to prove the rest of the implications needed to show
that the three conditions are in fact equivalent.
In the context of the noise problem we defined earlier, we are starting with condition (1), ie.
Z = RU (in this case k = l = n), and we would like to derive the probability density of Z. Note
that here we removed the µ from consideration because in machine learning we always assume that
the noise has a mean of 0. We leave it as an exercise for the reader to prove the case for an arbitrary
µ.
We will first start by relating the probability density function of U to that of Z. Denote fU (u) as
the probability density for U = u, and similarly denote fZ (z) as the probability density for Z = z.
One may initially believe that fU (u) = fZ (Ru), but this is NOT true. Remember that since there
is a change of variables from U to Z, we must make sure to incorporate the change of variables
constant, which in this case is the absolute value of the determinant of R. Incorporating this
constant, we will have the correct formula:
Let’s see why this is true, with a simple 2D geometric explanation. Define U space to be the 2D
space with axes U1 and U2 . Now take any arbitrary region R0 in U space (note that this R0 is
different from the matrix R that relates U to Z). As shown in the diagram below, we have some
off-centered circular region R0 and we would like to approximate the probability that U takes a
value in this region. We can do so by taking a Riemann sum of the density function fU (.) over
smaller and smaller squares that make up the region R0 :
32 CHAPTER 2. REGRESSION II
Now, let’s apply the linear transformation Z = RU, mapping the region R0 in U space, to the
region T (R0 ) in Z space.
The graph on the right is now Z space, the 2D space with axes Z1 and Z2 . Assuming that the
matrix R is invertible, there is a one-to-one correspondence between points in U space to points in
Z space. As we can note in the diagram above, each unit square in U space maps to a parallelogram
in Z space (in higher dimensions, we would use the terms hypercube and parallelepiped). Recall
the relationship between each unit hypercube and the parallelepiped it maps to:
Area(parallelepiped) = | det(R)| · Area(hypercube)
In this 2D example, if we denote the area of each unit square as ∆u1 ∆u2 , and the area of each unit
parallelepiped as ∆A, we say that
∆A = | det(R)| · ∆u1 ∆u2
Now let’s take a Riemann sum to find the probability that Z takes a value in T (R0 ):
ZZ
0
P (Z ⊆ T (R )) = fZ (z1 , z2 ) dz1 dz2
T (R0 )
X X
≈ fZ (z) ∆A
T (R )0
XX
= fZ (Ru) | det(R)|∆u1 ∆u2
R0
Note the change of variables in the last step: we sum over the squares in U space, instead of
parallelograms in R space.
So far, we have shown that (for any dimension n)
Z ZZ
P (U ⊆ R0 ) = . . . fU (u) du1 du2 . . . dun
R0
2.3. MULTIVARIATE GAUSSIANS 33
and Z ZZ
0
P (Z ⊆ T (R )) = ... fZ (Ru) | det(R)|du1 du2 . . . dun
R0
Notice that these two probabilities are equivalent! The probability that U takes value in R0 must
equal the probability that the transformed random vector Z takes a value in the transformed region
T (R0 ).
Therefore, we can say that
Z ZZ
0
P (U ⊆ R ) = ... fU (u) du1 du2 . . . dun
R0
Z ZZ
= ... fZ (Ru) | det(R)|du1 du2 . . . dun
R0
= P (Z ⊆ T (R0 ))
We conclude that
fU (u) = fZ (Ru) | det(R)|
Since the densities for all the Ui ’s are i.i.d, and U = R−1 Z, we can write the joint density function
of Z as
1
fZ (z) = fU (R−1 z)
| det(R)|
n
1 Y
= fUi ((R−1 z)i )
| det(R)|
i=1
1 1 1 −1 > −1
= √ e− 2 (R z) (R z)
| det(R)| ( 2π)n
1 1 1 > −T −1
= √ e− 2 z R R z
| det(R)| ( 2π)n
1 1 1 > > −1
= √ e− 2 z (RR ) z
| det(R)| ( 2π)n
and therefore
1 1 1 > −1
fZ (z) = p √ e− 2 z ΣZ z
det(ΣZ ) ( 2π)n
For a particular multivariate Gaussian distribution f (.), if we do not have the true means and
covariances µ, Σ, then our best bet is to use MLE to estimate them empirically with i.i.d. samples
x1 , x2 , . . . , xn :
1X
µ̂ = xi
n
ti =k
1X
Σ̂ = (xi − µ̂)(xi − µ̂)T
n
ti =k
Note that the above formulas are not necessarily trivial and must be formally proven using MLE.
Just to present a glimpse of the process, let’s prove that these formulas hold for the case where we
are dealing with 1-d data points. For notation purposes, assume that D = {x1 , x2 , . . . , xn } is the
set of all training data points that belong to class k. Note that the data points are i.i.d. Our goal
is to solve the following MLE problem:
Note that the objective above is not jointly convex, so we cannot simply take derivatives and set
them to 0! Instead, we decompose the minimization over σ 2 and µ into a nested optimization
problem:
n n
X (xi − µ)2 X (xi − µ)2
min2 2
+ ln(σ) = min min + ln(σ)
µ,σ 2σ σ2 µ 2σ 2
i=1 i=1
The optimization problem has been decomposed into an inner problem that optimizes for µ given
a fixed σ 2 , and an outer problem that optimizes for σ 2 given the optimal value µ̂. Let’s first solve
the inner optimization problem. Given a fixed σ 2 , the objective is convex in µ, so we can simply
take a partial derivative w.r.t µ and set it equal to 0:
n n n
∂ X (xi − µ)2 X −(x − µ)
i 1X
2
+ ln(σ) = 2
= 0 =⇒ µ̂ = xi
∂µ 2σ σ n
i=1 i=1 i=1
2.3. MULTIVARIATE GAUSSIANS 35
Note that this objective is not convex in σ, so we must instead find the critical point of the objective
that minimizes the objective. Assuming that σ ≥ 0, the critical points are:
• σ = 0: assuming that not all of the points xi are equal to µ̂, there are two terms that are at
odds with each other: a 1/σ 2 term that blows off to ∞, and a ln(σ) term that blows off to
−∞ as σ → 0. Note that the 1/σ 2 term blows off at a faster rate, so we conclude that
n
X (xi − µ̂)2
lim + ln(σ) = ∞
σ→0 2σ 2
i=1
• σ = ∞: this case does not lead to the solution, because it gives a maximum, not a minimum.
n
X (xi − µ̂)2
lim + ln(σ) = ∞
σ→∞ 2σ 2
i=1
Isocontours
Let’s try to understand in detail how to visualize a multivariate Gaussian distribution. For sim-
plicity, let’s consider a zero-mean Gaussian distribution N (0, Σ), which just leaves us with the
covariance matrix Σ. Since Σ is a symmetric, positive semidefinite matrix, we can decompose it
by the spectral theorem into Σ = VΛVT , where the columns of V form an orthonormal basis in
Rd , and Λ is a diagonal matrix with real, non-negative values. We wish to find its level set
f (x) = k
or simply the set of all points x such that theprobability density f (x) evaluates to a fixed constant
k. This is equivalent to the level set ln f (x) = ln(k) which further reduces to
xT Σ−1 x = c
for some constant c. Without loss of generality, assume that this √ constant
√ is √1. The level set
xT Σ−1 x = 1 is an ellipsoid with axes v√ 1 , v 2 , . . . , v d , with lengths λ 1 , λ2 , . . . , λd , respectively.
Each axis of the ellipsoid is the vector λi vi , and we can verify that
The entries of Λ dictate how elongated or shrunk the distribution is along each direction. In the
case of isotropic distributions, the entries of Λ are all identical, meaning the the axes of the
ellipsoid form a circle. In the case of anisotropic distributions, the entries of Λ are not necessarily
identical, meaning that the resulting ellipsoid may be elongated/shrunken and also rotated.
√
Figure 2.1: Isotropic (left) vs Anisotropic (right) contours are ellipsoids with axes λi vi . Images courtesy
Professor Shewchuk’s notes
Properties
Let’s state some well-known properties of Multivariate Gaussians. Given a JG random vector
Z ∼ N (µZ , ΣZ ), the linear transformation AZ (where A is an appropriately dimensioned constant
matrix) is also JG:
AZ ∼ N (AµZ , AΣZ A>)
We can derive the mean and covariance of AZ using the linearity of expectations:
µAZ = E[AZ] = AE[Z] = AµZ
and
ΣAZ = E[(AZ − E[AZ])(AZ − E[AZ])>]
= E[A(Z − E[Z])(Z − E[Z])>A>]
= AE[(Z − E[Z])(Z − E[Z])>]A>
= AΣZ A>
Note that the statements above did not rely on the fact that Z is JG, so this reasoning applies
to all random vectors. We know that AZ is JG itself, because it can be expressed as a linear
transformation of i.i.d. Gaussians: AZ = ARU.
X
Now suppose that we have the partition Z = whose distribution is given by Z ∼ N (µZ , ΣZ )
Y
and
µX ΣXX ΣXY
µZ = , ΣZ =
µY ΣYX ΣYY
It turns out that the marginal distribution of the individual random vector X (and Y) is JG:
X ∼ N (µX , ΣXX )
2.4. MLE AND MAP FOR REGRESSION (PART II) 37
X = RX UX , Y = RY UY
we would expect that the expression for the joint distribution would be JG:
X RX 0 UX
=
Y 0 RY UY
However, since we cannot guarantee that the entries of UX are independently distributed from the
entries of UY , we cannot conclude that the joint distribution is JG. If the entries are independently
distributed, then we would be able to conclude that the joint distribution is JG.
Let’s now transition back to our discussion of Z. The conditional distribution of X given Y
(and vice versa) is also JG:
If X and Y are uncorrelated (that is, if ΣXY = ΣYX = 0), we can say that they are independent.
Namely, the conditional distribution of X given Y does not depend on Y:
Note the significance of this statement. Given any two general random vectors, we cannot neces-
sarily say “if they are uncorrelated, then they are independent”. However in the case of random
vectors from the same JG joint distribution, we can make this claim.
The power of probabilistic thinking is that it allows us a way to model situations that arise and
adapt our approaches in a reasonably principled way. This is particularly true when it comes to
incorporating information about the situation that comes from the physical context of the data
gathering process. In this note, we will explore what happens as we vary our assumptions about
the noise in our data and the priors for our parameters, as well as the “importance” of certain
training points.
So far we have used MLE and MAP to justify the optimization formulation of OLS and ridge
regression, respectively. The MLE formulation assumes that the observation Yi is a noisy version
of the true underlying output:
Yi = f (xi ) + Zi
38 CHAPTER 2. REGRESSION II
where the noise for each datapoint is crucially i.i.d. The MAP formulation assumes that the model
parameter Wj is according to an i.i.d. Gaussian prior
iid
Wj ∼ N (µj , σh2 )
.
So far, we have restricted ourselves to the case when the noise/parameters are i.i.d:
However, what about the case when Ni ’s/Wj ’s are non-identical or dependent on one another? We
would like to explore the case when the observation noise and underlying parameters are jointly
Gaussian with arbitrary individual covariance matrices, but are independent of each other.
Z ∼ N (0, ΣZ ), W ∼ N (µW , ΣW )
It turns out that via a change of coordinates, we can reduce these non-i.i.d. problems back to the
i.i.d. case and solve them using the original techniques we used to solve OLS and Ridge Regression!
Changing coordinates is a powerful tool in thinking about machine learning.
The basic idea of weighted least squares is the following: we place more emphasis on the loss
contributed from certain data points over others - that is, we care more about fitting some data
points over others. It turns out that this weighted perspective is very useful as a building block
when we go beyond traditional least-squares problems.
Optimization View
This objective is the same as OLS, except that each term in the sum is weighted by a positive
coefficient ωi . As always, we can vectorize this problem:
Where the i’th row X is xi>, and Ω ∈ Rn×n is a diagonal matrix with Ωi,i = ωi .
We rewrite the WLS objective to an OLS objective:
This formulation is identical to OLS except that we have scaled the data matrix and the observation
vector by Ω1/2 , and we conclude that
−1
ŵwls = (Ω1/2 X)>(Ω1/2 X) Ω1/2 X >Ω1/2 y = (X>ΩX)−1 X>Ωy
Probabilistic View
As in MLE, we assume that our observations y are noisy, but now suppose that some of the yi ’s
are more noisy than others. How can we take this into account in our learning algorithm so we can
get a better estimate of the weights? Our probabilistic model looks like
Yi = xi>w + Zi
where the Zi ’s are still independent Gaussians random variables, but not necessarily identical:
Zi ∼ N (0, σi2 ). Jointly, we have that Z ∼ N (µZ , ΣZ ), where
2
σ1 0 · · · 0
0 σ2 · · · 0
2
ΣZ = ..
.. .. ..
. . . .
0 · · · · · · σn2
We can morph the problem into an MLE one by scaling the data to make sure all the Zi ’s are
identically distributed, by dividing by σi :
Yi xi> Zi
= w+
σi σi σi
Note that the scaled noise entries are now i.i.d:
Zi iid
∼ N (0, 1)
σi
Jointly, we can express this change of coordinates as
−1 −1 −1 −> −1
ΣZ 2 y ∼ N (ΣZ 2 Xw, ΣZ 2 ΣZ ΣZ 2 ) = N (ΣZ 2 Xw, I)
This change of variable is sometimes called the reparameterization trick. Now that the noise is
i.i.d. using the change of coordinates, we rewrite our original problem as a scaled MLE problem:
>
Xn
( σyii − xσii w)2 √
ŵwls = arg min + n log 2π
w∈Rd 2
i=1
n
X 1
= arg min (yi − xi>w)2
w∈Rd i=1
σi2
The MLE estimate of this scaled problem is equivalent to the WLS estimate of the original problem:
−1 −1 −1 −1
ŵwls = (X>ΣZ 2 ΣZ 2 X)−1 X>ΣZ 2 ΣZ 2 y = (X>Σ−1 −1 > −1
Z X) X ΣZ y
As long as no σ is 0, ΣZ is invertible. Note that ωi from the optimization perspective is directly
related to σi2 from the probabilistic perspective: ωi = σ12 . Or at the level of matrices, Ω = ΣZ −1 .
i
As the variance σi2 of the noise corresponding to data point i decreases, the weight ωi increases: we
are more concerned about fitting data point i because it is likely to match the true underlying de-
noised point. Inversely, as the variance σi2 increases, the weight ωi decreases: we are less concerned
about fitting data point i because it is noisy and should not be trusted.
40 CHAPTER 2. REGRESSION II
Now let’s consider the case when the noise random variables are dependent on one another. We
have
Y = Xw + Z
Z ∼ N (0, ΣZ ), Y ∼ N (Xw, ΣZ )
This problem is known as generalized least squares. Our goal is to maximize the probability of
our data over the set of possible w’s:
1 1 1 > −1
ŵgls = arg max p √e− 2 (y−Xw) ΣZ (y−Xw)
det(ΣZ ) ( 2π)n
w∈Rd
Since ΣZ is symmetric, we can decompose it into its eigen structure using the spectral theorem:
2
σ1 0 · · · 0
0 σ2 · · · 0
2 >
ΣZ = Q .. .. Q
.. ..
. . . .
0 · · · · · · σn2
where Q is orthonormal. As before with weighted least squares, our goal is to find an appropriate
linear transformation so that we can reduce the problem into the i.i.d. case.
Consider
1
σ1 0 ··· 0
1
0 ··· 0
−1 σ2 >
ΣZ 2 = Q
.. .. .. .. Q
. . . .
1
0 ··· ··· σn
We can scale the data to morph the problem into an MLE problem with i.i.d. noise variables, by
−1
premultiplying the data matrix X and the observation vector y by ΣZ 2 . Jointly, we can express
this change of coordinates as
−1 −1 −1 −> −1
ΣZ 2 y ∼ N (ΣZ 2 Xw, ΣZ 2 ΣZ ΣZ 2 ) = N (ΣZ 2 Xw, I).
Consequently, in a very similar fashion to the independent noise problem, the MLE of the scaled
dependent noise problem is
ŵgls = (X>Σ−1 −1 > −1
Z X) X ΣZ y.
2.4. MLE AND MAP FOR REGRESSION (PART II) 41
In the ordinary least squares (OLS) statistical model, we assume that the output Y is a linear
function of the input, plus some Gaussian noise. We take this one step further in MAP estimation,
where we assume that the weights are a random variable. The new statistical model is
Y = XW + Z
where Y and Z are n-dimensional random vectors, W is a d-dimensional random vector, and X is
a fixed n × d matrix. Note that random vectors are not notationally distinguished from matrices
here, so keep in mind what each symbol represents.
We have seen that ridge regression can be derived by assuming a prior distribution on W in which
Wi are i.i.d. (univariate) Gaussian, or equivalently,
W ∼ N (0, I)
W ∼ N (µW , ΣW )
Recall that we can rewrite a multivariate Gaussian variable as an affine transformation of a standard
Gaussian variable:
1/2
W = ΣW V + µW , V ∼ N (0, I)
However V is not what we care about – we need to convert back to the actual weights W in order to
make predictions. Since W is completely determined by V (assuming fixed mean and covariance),
1/2
ŵ = ΣW v̂ + µW
1/2 >/2 1/2 >/2
= µW + ΣW (ΣW X>XΣW + I)−1 ΣW X>(y − XµW )
− >/2 − 1/2
= µW + (X>X + ΣW ΣW )−1 X>(y − XµW )
| {z }
Σ−1
W
Note that there are two terms: the prior mean µW , plus another term that depends on both the
data and the prior. The positive-definite precision matrix of W’s prior (Σ−1 W ) controls how the data
fit error affects our estimate. This is called Tikhonov regularization in the literature and generalizes
ridge regularization.
42 CHAPTER 2. REGRESSION II
When the prior variance σj2 for dimension j is large, the prior is telling us that Wj may take on a
wide range of values. Thus we do not want to penalize that dimension as much, preferring to let
the data fit sort it out. And indeed the corresponding entry in Σ−1W will be small, as desired.
Conversely if σj2 is small, there is little variance in the value of Wj , so Wj ≈ µj . As such we penalize
the magnitude of the data-fit contribution to Ŵj more heavily.
If all the σj2 are the same, then we have traditional ridge regularization.
In an explicitly probabilistic perspective, MAP with colored noise (and known X) can be expressed
as:
iid
U, V ∼ N (0, I) (2.6)
Y RZ XRW U
= (2.7)
W 0 RW V
where RZ and RW are relationships with W and Z, respectively. Note that the RW appears
twice because our model assumes Y = XW + noise, so if W = RW V, then we must have Y =
XRW V + noise.
We want to find the posterior W | Y = y. The formulation above makes it relatively easy to find
the posterior of Y conditioned on W (see below), but not vice-versa. So let’s pretend instead that
iid
U0 , V0 ∼ N (0, I)
A B U0
W
=
Y 0 D V0
= AA>
In both cases above where we drop the conditioning on Y, we are using the fact U0 is independent
of V0 (and thus independent of Y = DV0 ). Therefore
W | Y = y ∼ N (BD−1 y, AA>)
Recall that a Gaussian distribution is completely specified by its mean and covariance matrix. We
see that the covariance matrix of the joint distribution is
" # " #
W A B A> 0
h i
E >
W Y > =
Y 0 D B> D>
" #
AA> + BB> BD>
=
DB> DD>
ΣW ΣW,Y
=
ΣY,W ΣY
Matching the corresponding terms, we can express the conditional mean and variance of W | Y = y
in terms of these (cross-)covariance matrices:
I I
> > > > −1
= AA + BB − (BD )(DD ) DB>
= ΣW − ΣW,Y Σ−1
Y ΣY,W
ΣW = RW RW>
ΣY = ΣZ + XΣW X>
ΣY,W = XΣW
ΣW,Y = ΣW X>
ŵ = E[W | Y = y]
= ΣW,Y Σ−1
Y y
44 CHAPTER 2. REGRESSION II
which looks more familiar. In fact, you can recognize this as the general solution when we have
both a generic Gaussian prior on the parameters and colored noise in the observations.
We have seen a number of related linear models, with varying assumptions about the randomness
in the observations and the weights. We summarize these below:
Z
N (0, I) N (0, ΣZ )
W
No prior ŵols = (X>X)−1 X>y ŵgls = (X>Σ−1 −1 > −1
Z X) X ΣZ y
N (0, λ−1 I) ŵridge = (X>X + λI)−1 X>y (X ΣZ X + λI) X>Σ−1
> −1 −1
Z y
N (µW , ΣW ) µW + (X>X + Σ−1
W)
−1 >
X (y − XµW ) µW + (X>Σ−1 −1 −1 > −1
Z X + ΣW ) X ΣZ (y − XµW )
In ridge regression, we given a vector y ∈ Rn and a matrix X ∈ Rn×` , where n is the number of
training points and ` is the dimension of the raw data points. In most settings we don’t want to
work with just the raw feature space, so we augment features to the data points and replace X
with Φ ∈ Rn×d , where φi> = φ(xi ) ∈ Rd . Then we solve a well-defined optimization problem that
involves Φ and y, over the parameters w ∈ Rd . Note the problem that arises here. If we have
polynomial features of degree at most p in the raw ` dimensional space, then there are d = `+pp
terms that we need to optimize, which can be very, very large (much larger than the number of
training points n). Wouldn’t it be useful, if instead of solving an optimization problem over d
variables, we could solve an equivalent problem over (potentially much smaller) n variables, and
achieve a computational runtime independent of the number of augmented features? As it turns
out, the concept of kernels (in addition to a technique called the kernel trick) will allow us to
achieve this goal. Recall the solution to ridge regression:
w∗ = (Φ>Φ + λI)−1 Φ>y
This operation involves calculating Φ>Φ, which is a d×d matrix and takes O(d2 n) time to compute.
The matrix inversion operation takes an additional O(d3 ) time to compute. What we would really
like is to have an n × n matrix that takes O(n3 ) to invert. Here’s a simple observation: if we flip
the order of Φ> and Φ, we end up with an n × n matrix ΦΦ>. In fact, the matrix ΦΦ> has a very
intuitive meaning: it is the matrix of inner products between all of the augmented datapoints, which
in loose terms measures the “similarity” among of the datapoints and captures their relationship.
Now let’s see if we could somehow express the solution to ridge regression using the matrix ΦΦ>.
1 (A + UCV)−1 = A−1 − A−1 U(C−1 + VA−1 U)−1 VA−1
2.5. KERNELS AND RIDGE REGRESSION 45
Derivation
For simplicity of notation, let’s revert back to using X instead of Φ (pretend that we are only
working with raw features, our analysis of kernel ridge regression still holds if we use just the raw
features). Rearranging the terms of the original ridge regression solution, we have
We can’t yet isolate v and have a closed-form solution for it, but we can make the observation that
if we found an v such that we had
XX>v + λv = y
that would imply that this v also satisfies the above equation. Note that we did not “cancel the
X>’s on both sides of the equation.” We saw that having v satisfy one equation implied that it
satisfied the other as well. So, indeed, we can isolate v in this new equation:
and have that the v which satisfies this equation will be such that X>v equals w. We conclude
that the optimal w is
w∗ = X>v∗ = X>(XX> + λI)−1 y
In fact, these two are equivalent expressions! The question that now arises is which expression
should you pick? Which is more efficient to calculate? We will answer this question after we
introduce kernels.
46 CHAPTER 2. REGRESSION II
The previous derivation involved using some intuitive manipulations to achieve the desired answer.
Let’s formalize our derivation using more principled arguments from linear algebra and optimiza-
tion Before we do so, we must first introduce the Fundamental Theorem of Linear Algebra
(FTLA): Suppose that there is a matrix (linear map) X that maps R` to Rn . Denote N (X) as
the nullspace of X, and R(X) as the range of X. Then the following properties hold:
⊥ ⊥
1. N (X) ⊕ R(X>) = R` and N (X>) ⊕ R(X) = Rn by symmetry
The symbol ⊕ indicates that we taking a direct sum of N (X) and R(X>), which means that
∀u ∈ R` there exist unique elements u1 ∈ N (X) and u2 ∈ R(X>) such that u = u1 + u2 .
Furthermore, the symbol ⊥ indicates that N (X) and R(X>) are orthogonal subspaces.
2. N (X>X) = N (X) and N (XX>) = N (X>) by symmetry
3. R(X>X) = R(X>) and R(XX>) = R(X) by symmetry.
Here’s where FTLA comes, in the context of kernel ridge regression. We know that we can express
any w ∈ R` as a unique combination w = w1 + w2 , where w1 ∈ R(X>) and w2 ∈ N (X).
Equivalently we can express this as w = X>v + r, where v ∈ Rn and r ∈ N (X). Now, instead of
optimizing over w ∈ R` , we can optimize over v ∈ Rn and r ∈ R` , which equates to optimizing over
n + ` variables. However, as we shall see, the optimization over r will be trivial so we just have to
optimize an n dimensional problem.
We know that w = X>v + r, where v ∈ Rn and r ∈ N (X). Let’s now solve ridge regression by
optimizing over the variables v and r instead of w:
We crossed out Xr and 2v>Xr because r ∈ N (X) and therefore Xr = 0. Now we are optimizing
over L(v, r), which is jointly convex in v and r, because its Hessian is PSD. Let’s show that this
is indeed the case:
∇2r L(v, r) = 2I 0
∇r ∇v L(v, r) = ∇v ∇r L(v, r) = 0
∇2v L(v, r) = 2XX>XX> + 2λXX> 0
Since the cross terms of the Hessian are 0, it suffices that ∇2r L(v, r) and ∇2v L(v, r) are PSD to
establish joint convexity. With joint convexity established, we can set the gradient to 0 w.r.t r and
v and obtain the global minimum:
∇r L(v, r∗ ) = 2r∗ = 0 =⇒ r∗ = 0
2.5. KERNELS AND RIDGE REGRESSION 47
Note that XX>+ λI is positive definite and therefore invertible, so we can compute (XX>+ λI)−1 y.
Even though (XX>+λI)−1 y is a critical point for which the gradient is 0, it must achieve the global
minimum because the objective is jointly convex. We conclude that
w∗ = X>(XX> + λI)−1 y
Non-i.i.d. Case
So far we have assumed the special i.i.d. case of ridge regression, where
As we’ve seen already, the solution in this case can be expressed in two forms, either the familiar
case
w∗ = (X>Σ−1 −1 −1 > −1
Z X + ΣW ) X ΣZ y
or the case that we desire in kernel ridge regression
The principal difference in the non-i.i.d case is that we are computing XΣW X> as opposed to XX>.
Kernels
Having derived the kernel ridge regression formulation for the raw data matrix X, we can apply
the exact same logic to the augmented data matrix Φ and replace the optimal expression with
w∗ = Φ>(ΦΦ> + λI)−1 y
Let’s explore the ΦΦ> term in kernel ridge regression in more detail:
φ>1
φ>1 φ1 φ>1 φ2 ...
φ>2
..
ΦΦ> =
>
.. φ1 φ2 . . . φn =
φ2 φ1 .
.
..
. >
φn φn
φ>n
Each entry ΦΦ>ij is a dot product between φ(xi ) and φ(xj ) and can be interpreted as a similarity
measure:
ΦΦ>ij = hφi , φj i = hφ(xi ), φ(xj )i = k(xi , xj )
48 CHAPTER 2. REGRESSION II
where k(., .) is the kernel function. The kernel function takes raw-feature inputs and outputs their
inner product in the augmented feature space. We denote the matrix of k(xi , xj ) terms as the
Gram matrix and denote it as K:
k(x1 , x1 ) k(x1 , x2 ) ...
..
K = ΦΦ> =
k(x 2 , x 1 ) .
..
. k(xn , xn )
Formally, k(xi , xj ) is defined to be a valid kernel function if either of the following definitions are
met:
• There exists a feature map φ(.) such that ∀xi , xj , k(xi , xj ) = hφ(xi ), φ(xj )i
where α, β ≥ 0 is also a valid kernel. We can show this from the second property:
e i ) = Σ 12 φ(xi )
is a valid kernel. We can show this from the first property: φ(x
Computing the each Gram matrix entry k(xi , xj ) can be done in a straightforward fashion if we
apply the feature map to xi and xj and then take their dot product in the augmented feature space
— this takes O(d) time, where d is the dimensionality of the problem in the augmented feature
space. However, if we use the kernel trick, we can perform this operation in O(` + log p) time,
where ` is the dimensionality of the problem in the raw feature space and p is the degree of the
polynomials in the augmented feature space.
Kernel Trick
Suppose that we are computing k(x, z), using a p-degree polynomial feature map that maps `
dimensional inputs to d = O(`p ) dimensional outputs. Let’s take p = 2 and ` = 2 as an example.
Define the polynomial feature map as
h √ √ √ i
φ(x) = x21 x22 2x1 x2 2x1 2x2 1 >
2.5. KERNELS AND RIDGE REGRESSION 49
1. Raising the inputs to the augmented feature space and take their inner product
2. Computing (x>z + 1)2 , which involves an inner product of the raw-feature inputs
Clearly, the latter option is much cheaper to calculate, taking O(` + log p) time, instead of O(`p )
time. In fact, this concept generalizes for any arbitrary ` and p, and for p-degree polynomial
features, we have that
k(x, z) = (x>z + 1)p
The kernel trick makes computations significantly cheaper to perform, making kernelization much
more appealing! The takeaway here is that no matter what the degree p is, the computational
complexity is the same — it is only dependent on the dimensionality of the raw feature space!
Note that we can equivalently express the degree-2 polynomial features problem using the more
natural mapping
g = x2 x2 x1 x2 x1 x2 1 >
φ(x) 1 2
in which case the kernel function would be expressed as
g >Σφ(z)
k(x, z) = φ(x) g = (x>z + 1)2 , Σ = Diag 1 1 2 2 2 1
Thus we can view kernel ridge regression with the kernel trick in two ways:
1. i.i.d. prior W ∼ N 0, Diag 1 1 1 1 1 1 , using the feature mapping φ(x)
2. non-i.i.d prior W ∼ N 0, Diag 1 1 2 2 2 1 , using the feature mapping φ(x) g (note
that the kernel trick is only applicable for this specific setting of Σ — it does not necessarily
apply to arbitrary Σ.)
Computational Analysis
1. Kernelized
Computing the K term takes O(n2 (` + log p)), and inverting the matrix takes O(n3 ). These
two computations dominate, for a total computation time of O(n3 + n2 (` + log p)).
2. Non-kernelized
ŷ = hφ(z), w∗ i = φ(z)>(Φ>Φ + λI)−1 Φ>y
Computing the Φ>Φ term takes O(d2 n), and inverting the matrix takes O(d3 ). These two
computations dominate, for a total computation time of O(d3 + d2 n).
Suppose we want to solve the least squares objective, subject to a constraint that w is sparse.
Mathematically this is expressed as
where the `0 norm of w is simply the number of non-zero elements in w. This quantity is otherwise
known as the Hamming Distance between w and 0, the vector of zeros.
There are several motivations for designing optimization problems with sparse solutions. One
advantage is that sparse weights speed up testing time. In the context of primal problems, if the
weight vector w is sparse, then after we compute w in training, we can discard features/dimensions
with 0 weight, as they will contribute nothing to the evaluation of the hypothesized regression
values of test points. A similar reasoning applies to dual problems with dual weight vector v,
allowing us to discard the training points corresponding to dual weight 0, ultimately allowing for
faster evaluation of our hypothesis function on test points.
Note that the `0 norm does not actually satisfy the properties of a norm, evident by the fact that
the it is not convex, a property that all norms share. Solving this optimization problem is NP-hard,
so we instead aim to find a computationally feasible alternative method that can approximate the
optimal solution. We will present two such methods: LASSO, a relaxed version of the problem that
replaces the `0 norm with a `1 norm, and Matching pursuit, a greedy algorithm that iteratively
updates one entry of w at a time until the sparsity constraint can not longer be satisfied.
LASSO
The least absolute shrinkage and selection operator (LASSO), introduced in 1996 by Robert
Tibshirani, is identical to the sparse least squares objective, except that the `0 norm penalizing w
is now changed to an `1 norm:
min kXw − yk22
w
s.t. kwk1 ≤ k
2.6. SPARSE LEAST SQUARES 51
(The k in the constraint is not necessarily the same k in the sparse least squares objective.) The
`1 norm of w is the sum of absolute values of its entries:
d
X
kwk1 = |wi |
i=1
Unlike the `0 norm, the `1 norm actually satisfies the properties of norms. The relaxation from
the `0 to `1 norm is desirable, because it makes the optimization problem convex, and is no longer
NP-hard to solve. But does the `1 norm still induce sparsity like the `0 ? As we will see, the answer
is yes!
Due to strong duality, we can equivalently express the LASSO problem in the unconstrained form
min kXw − yk2 + λkwk1
w
We make a striking observation here: LASSO is identical to the ridge regression objective, except
that the `2 norm (squared) penalizing w is now changed to an `1 norm (with no squaring term).
Recall that the `2 norm squared of w, the sum of squared values of its entries:
d
X
kwk22 = wi2
i=1
As it turns out, the simple change from the `2 to `1 norm inherently leads to a sparse solution.
In fact, the sparsity inducing properties of the `1 norm are not just unique to least squares. To
illustrate the point, let’s take a step away from least squares for a moment and discuss the `1 norm
in the context of SVMs. Recall the soft-margin SVM problem (constraints omitted for brevity):
n
1 X
min kwk2 + C ξi
w,ξ 2
i=1
The slack ξi is constrained to be either positive or zero. Note that if a point xi has a nonzero
slack ξi > 0, by definition it must lie inside the margin. Due to the heavy penalty factor C for
violating the margin there are relatively few such points, and thus the slack vector ξ is sparse
— most of its entries are 0. We are interested in explaining why this phenomenon occurs in this
specific optimization problem, and identifying the key properties that determine sparse solutions
for arbitrary optimization problems.
To reason about the SVM case, let’s see how changing some arbitrary slack variable ξi affects the
loss. A unit decrease in ξi results in a “reward” of C, and is captured by the partial derivative
∂L
∂ξi . Note that no matter what the current value of ξi is, the reward for decreasing ξi is constant.
Of course, decreasing ξi may change the boundary and thus the cost attributed to the size of the
margin kwk2 . The overall reward for decreasing ξi is either going to be worth the effort (greater
than cost incurred from w) or not worth the effort (less than cost incurred from w). Intuitively, ξi
will continue to decrease until it hits a lower-bound “equilibrium” — which is often just 0.
Now consider the following formulation (constraints omitted for brevity again):
n
1 X
min kwk2 + C ξi2
w,ξ 2
i=1
The reward for decreasing ξi is no longer constant — at any point, a unit decrease in ξi results in a
“reward” of 2Cξi . As ξi approaches 0, the rewards get smaller and smaller, reaching infinitesimal
52 CHAPTER 2. REGRESSION II
values. On the other hand, decreasing ξi causes a finite increase in the cost incurred by the kwk2
— the same increase in cost as in the previous example. Intuitively, we can reason that there will
be a threshold value ξi∗ such that decreasing ξi further will no longer outweigh the cost incurred by
the size of the margin, and that the ξi ’s will halt their descent before they hit zero.
The same reasoning applies to least squares as well. For any particular component wi of w, the
corresponding loss in LASSO is the absolute value |wi |, while the loss in ridge regression is the
squared term wi2 . In the case of LASSO the “reward” for decreasing wi by a unit amount is a
constant λ, while for ridge regression the equivalent “reward” is 2λwi , which depends on the value
of wi . There is a compelling geometric argument behind this reasoning as well.
Figure 2.2: Comparing contour plots for LASSO (left) vs. ridge regression (right).
Suppose for simplicity that we are only working with 2-dimensional data points and are thus
optimizing over two weight variables w1 and w2 . In both figures above, the red ellipses represent
isocontours in w-space of the squared loss kXw − yk2 . In ridge regression, each isocontour of
λkwk22 is represented by a circle, one of which is shown in the right figure. Note that the optimal
w will only occur at points of tangency between the red ellipse and the blue circle. Otherwise
we could always move along the isocontour of one of the functions (keeping its overall cost fixed)
while improving the value of the the other function, thereby improving the overall value of the
loss function. We can’t really infer much about these points of tangency other than the fact that
the blue circle centered at the origin draws the optimal point closer to the origin (ridge regression
penalizes large weights).
Now, let’s examine the LASSO case. The red ellipses represent the same objective kXw − yk2 ,
but now the `1 regularization term λkwk1 is represented by diamond isocontours. As with ridge
regression, note that the optimal point in w-space must occur at points of tangency between the
ellipse and the diamond. Due to the “pointy” property of the diamonds, tangency is very likely to
happen at the corners of the diamond because they are single points from which the rest of the
diamond draws away from. And what are the corners of the diamond? Why, they are points at
which one component of w is 0!
2.6. SPARSE LEAST SQUARES 53
Solving LASSO
Convinced that LASSO achieves sparsity, now let’s find the optimal solution to LASSO. Unlike
ridge regression, it is not exactly clear what the closed form solution is through linear algebra or
gradient methods, since the objective function not differentiable (due to the “pointiness” of the `1
norm). Specifically, LASSO zeros out features, and once these weights are set to 0 the objective
function becomes non-differentiable. Note however, that the objective is still convex, and we could
use an iterative method such as subgradient descent or line search to solve the problem. Here, we
will use a line search method called coordinate descent.
While SGD focuses on iteratively optimizing the value of the objective L(w) for each sample in the
training set, coordinate descent iteratively optimizes the value of the objective for each feature.
Coordinate descent is guaranteed to find the global minimum if L is jointly convex. No such
guarantees can be made however if L is only elementwise convex, since it may have local minima.
To understand why, let’s start by understanding elementwise vs joint convexity. Suppose we are
trying to minimize f (x, y), a function of two scalar variables x and y. For simplicity, assume that f
is twice differentiable, so we can take its Hessian. f (x, y) is element-wise convex in x if its Hessian
is psd when y is fixed:
∂2
f (x, y) ≥ 0
∂x∂x
Same goes for element-wise convexity in y.
f (x, y) is jointly convex in x and y if its Hessian ∇2 f (x, y) is psd. Note that being element-wise
convex in both x and y does not imply joint convexity in x and y (consider f (x, y) = x2 + y 2 − 4xy
as an example). However, being joint convexity in x and y does imply being element-wise convex
in both x and y.
Now, if f (x, y) was jointly convex, then we could find the gradient wrt. x and y individually,
set them to 0, and be guaranteed that would be the global minimum. Can we do this if f (x, y) is
element-wise convex in both x and y? Even though it is true that minx,y f (x, y) = minx miny f (x, y),
we can’t always just set gradients to 0 if f (x, y) is not jointly convex. While the inner optimization
problem over y is convex, the outer optimization problem over x may no longer be convex. In the
case when joint convexity is not reached, there is no clean strategy to find global minimum and we
must analyze all of the critical points to find the minimum.
In the case of LASSO, the objective function is jointly convex, so we can use coordinate descent.
There are a few details to be filled in, namely the choice of which feature to update and how wi
is updated. One simple way is to just pick a random feature i each iteration. After choosing the
feature, we have to update wi ← arg min wi L(w). For LASSO, it turns out there is a closed-form
solution (note that we are only minimizing with respect to one feature instead of all the features).
Let’s solve the line search problem minwi L(w). For convenience, let’s separate the terms that
depend on wi from those that don’t. Denoting xj as the j-th column of X, we have
2
d
X
X
=
wj x j − y
+ λ|w i | + λ |wj |
j=1
j6=i
2
= kwi xi + rk22 + λ|wi | + C
P P
where r = j6=i wj xj − y and C = λ j6=i |wj |. The objective can in turn be written as
n
X
L(w) = λ|wi | + C + (wi xji + rj )2
j=1
−λ + a
wi∗ =
b
−λ+a
But this only holds if the right hand side, b , is actually positive. If it is negative or 0, then this
means there is no optimum in (0, ∞).
When wi∗ < 0, then similar calculations will lead to
λ+a
wi∗ =
b
λ+a
Again, this only holds if b is actually negative. If it is positive or 0, then there is no optimum in
(−∞, 0).
If neither the conditions −λ+a
b > 0 or λ+a
b < 0 hold, then there is no optimum in (−∞, 0) or (0, ∞).
But the LASSO objective is convex in wi and has an optimum somewhere, thus in this case wi∗ = 0.
In order for this case to hold, we must have that −λ+a
b ≤ 0 and λ+a b ≥ 0. Rearranging, we can see
this is equivalent to |a| ≤ λ.
Examine each of the following cases:
• −λ+a
b ≤ 0 and λ+a
b ≥ 0: wi∗ = 0
• −λ+a
b ≤ 0 and λ+a
b < 0: wi∗ < 0
• −λ+a
b > 0 and λ+a
b ≥ 0: wi∗ > 0
• −λ+a
b > 0 and λ+a
b < 0: impossible since this implies −λ+a
b > λ+a
b and λ and b are non-negative
where
n
X n
X
a=− 2xji rj , b= 2x2ji
j=1 j=1
This is not a gradient-descent update — we have a closed-form solution for the optimum wi , given
that all of the other weights are fixed constants. We can see explicitly how the LASSO objective
induces sparsity — a is some function of the data and the other weights, and when |a| ≤ λ, we set
wi = 0 in this iteration of coordinate descent. By increasing λ, we increase the threshold of |a| for
wi to be set to 0, and our solution becomes more sparse. Also note that the term ab is the least
squares solution (without regularization), so we can see that the regularization term tries to pull
the least squares update towards 0.
One subtle point: during coordinate descent, weights can be “reactivated” after having been set to
0 in a previous iteration, since a is affected by factors other than wi .
Matching Pursuit
Rather than relaxing the `0 constraint (as seen in LASSO), the matching pursuit algorithm
keeps the constraint, and instead finds an approximate solution to the sparse least squares problem
in a greedy fashion. The algorithm starts with with a completely sparse solution (w0 = 0), and
iteratively updates w until the sparsity constraint kwk0 ≤ k can no longer be met. At iteration
t, the algorithm can only update one entry of wt−1 , and it chooses the feature that minimizes the
(squared) norm of the resulting residual krt k2 = ky − Xwt k2 .
Algorithm 2: Matching Pursuit
initialize the weights w0 = 0 and the residual r0 = y − Xw0 = y
while kwk0 < k do
find the feature i for which the length of the projected residual onto xi is maximized:
t−1
hr , xj i
i = arg min (min krt−1 − νxj k) = arg max
j ν j kxj k
hrt−1 , xi i
wit = wit−1 +
kxi k2
At iteration t, we pick the coordinate i such that the distance from the residual rt−1 to xi (the i’th
column of X corresponding to feature i, not datapoint i) is minimized:
i = arg min (min krt−1 − νxj k)
j ν
This equates to finding the index i for which the length of the projection onto xi is maximized:
t−1
hr , xj i
i = arg max
j kxj k
Let’s see why this is true. The inner optimization problem minν krt−1 − νxi k is simply a projection
problem, and its solution is
hrt−1 , xj i
ν∗ =
hxj , xj i
56 CHAPTER 2. REGRESSION II
Now that we have found the feature i that maximizes the length of the projection, we must update
corresponding weight i and the residual vector. The updated residual rt is the result of projecting
rt−1 onto feature xi :
rt−1
rt
0
hrt−1 ,xi i
kxi k2 xi
Figure 2.3: The updated residual rt , current residual rt−1 , and scaled feature xi form a right triangle.
The Orthogonal Matching Pursuit (OMP) algorithm is an extension to the standard Matching
Pursuit algorithm with the following difference: at iteration t, we maintain a set I t of all features
selected by the algorithm so far, and instead of updating just the one weight corresponding to feature
i found at iteration t, we update all weights corresponding to the features in I t using Least Squares.
Algorithm 3: Orthogonal Matching Pursuit
initialize the weights w0 = 0 and the residual r0 = y − Xw0 = y
initialize a set of features I 0 = ∅
while kwk0 < k do
find the feature i for which the length of the projected residual onto xi is maximized:
t−1
hr , xj i
t−1
i = arg min (min kr − νxj k) = arg max
j ν j kxj k
I t = I t−1 ∪ {i}
Estimate the best linear fit of the target y using the features obtained so far. Given that
we have found t good features, we now find the best linear fit for the target y using these
t-features. Define Xt = xi1 , . . . , xit made up of these t-features. Then we determine wt
as the solution for the following least-squares problem:
The motivation for OMP is as follows: if we do not refit on all features updated so far after choosing
a new feature, the new residual will not necessarily be orthogonal to the span of the canonical
basis vectors corresponding to those chosen coordinates, and is therefore not optimal. The OMP
algorithm ensures that wt corresponds to an optimal Least Squares solution if we restricted our
features to just those in I t .
Previously, we have covered Ordinary Least Squares (OLS) which assumes that the dependent
variable y is noisy but the independent variables x are noise-free. We now discuss Total Least
Squares (TLS), where we assume that our independent variables are also corrupted by noise. For
this reason, TLS is considered an errors-in-variables model.
A probabilistic motivation?
We might begin with a probabilistic formulation and fit the parameters via maximum likelihood
estimation, as before. Consider for simplicity a one-dimensional linear model
ytrue = wxtrue
58 CHAPTER 2. REGRESSION II
Low-rank formulation
To solve the TLS problem, we develop another formulation that can be solved using the singular
value decomposition. To motivate this formulation, recall that in OLS we attempt to minimize
kXw − yk22 , which is equivalent to
min kk22 subject to y = Xw +
w,
This only accounts for errors in the dependent variable, so for TLS we introduce a second residual
X ∈ Rn×d to account for independent variable error:
2
min
X y
subject to (X + X )w = y + y
w,X ,y f
For comparison to the OLS case, note that the Frobenius norm is essentially the same as the 2-norm,
just applied to the elements of a matrix rather than a vector.
From a probabilistic perspective, finding the most likely value of a Gaussian corresponds to min-
imizing the squared distance from the mean. Since we assume the noise is 0-centered, we want
to minimize the sum of squares of each entry in the error matrix, which corresponds exactly to
minimizing the Frobenius norm.
In order to separate out the terms being minimized, we rearrange the constraint equation as
w
X + X y + y =0
| {z } −1
∈Rn×(d+1)
2.7. TOTAL LEAST SQUARES 59
h i
This expression tells us that the vector w> −1 > lies in the nullspace of the matrix on the left.
However, if the matrix is full rank, its nullspace contains only 0, and thus the equation cannot
be satisfied (since the last component, −1, is always nonzero). Therefore we must choose the
perturbations X and y in such a way that the matrix is not full rank.
It turns out that there is a mathematical result, the Eckart-Young theorem, that can help us
pick these perturbations. This theorem essentially says that the best low-rank approximation (in
terms of the Frobenius norm2 ) is obtained by throwing away the smallest singular values.
Theorem. Suppose A ∈ Rm×n has rank r ≤ min(m, n), and let A = UΣV> = ri=1 σi ui vi> be its
P
singular value decomposition. Then
σ1 · · · 0 · · · 0
.. . .
Xk . . 0 · · · 0
>
>
Ak = σ i ui v i = U 0
0 σk · · · 0 V
i=1 .. .. .. . . ..
. . . . .
0 0 0 ··· 0
kA − Ak kf ≤ kA − Ãkf
Let us assume that the data matrix X y is full rank.3 Write its singular value decomposition:
d+1
X
σi ui vi>
X y =
i=1
Then the Eckart-Young theorem tells us that the best rank-d approximation to this matrix is
d
X
σi ui vi>
X + X y + y =
i=1
Note that this requires the (d + 1)st component of vd+1 to be nonzero. (See Sectionfor details.)
2 There is a more general version that holds for any unitary invariant norm.
3 This should be the case in practice because the noise will cause y not to lie in the columnspace of X.
60 CHAPTER 2. REGRESSION II
In a sense, above we have solved the problem of total least squares by reducing it to computing an
appropriate SVD. Once we have vd+1 , or any scalar multiple of it, we simply rescale it so that the
last component is −1, and then the first d components give us w. However, we can look at this
more closely to uncover the relationship between TLS and the ideas of regularization that we have
seen earlier in the course.
Since vd+1 is a right-singular vector of X y , it is an eigenvector of the matrix
" #
> X>X X>y
X y X y = >
y X y>y
This result is like ridge regression, but with a negative regularization constant!
Why does this make sense? One of the original motivations of ridge regression was to ensure that
the matrix being inverted is in fact nonsingular, and subtracting a scalar multiple of the identity
seems like a step in the opposite direction. We can make sense of this by recalling our original
model:
X = Xtrue + Z
where Xtrue are the actual values before noise corruption, and Z is a zero-mean noise term with
i.i.d. entries. Then
Observe that the off-diagonal terms of E[Z>Z] terms are zero because the ith and jth rows of Z are
2 I term is
independent for i 6= j, and the on-diagonal terms are essentially variances. Thus the −σd+1
there to compensate for the extra noise introduced by our assumptions regarding the independent
variables.
For another perspective, note that
If we plug this into the OLS solution (where we have assumed no noise in the independent variables),
we see
ŵols = (Xtrue>Xtrue )−1 Xtrue>y = (E[X>X] − E[Z>Z])−1 E[X]>y
which strongly resembles the TLS solution, but expressed in terms of expectations over the noise
Z.
So, is this all just a mathematical trick or is there a practical sense in which ridge regularization
itself is related to adding noise? The math above suggests that we can take the original training
data set and instead of working with that data set, just sample lots of points (say r times each)
with i.i.d. zero-mean Gaussian noise with variance λ added to each of their features. Call this the
X and have the corresponding y just keep the original y values. Then, doing ordinary least squares
on this noisily degraded data set will end up behaving like ridge regression since the laws of large
numbers will make 1r X>X concentrate around Xtrue>Xtrue + λI. Meanwhile X>y will concentrate
√
to rXtrue>yorig with O( r) noise on top of this by the Central Limit Theorem (if we used other-
than-Gaussian noise to noisily resample), and straight variance-O(r) Gaussian noise if we indeed
used Gaussian noise. Putting them together means that the result of OLS with noisily augmented
training data will result in approximately the same solution as ridge-regression, with the solutions
approaching each other as the number of noisy copies r goes to infinity.
Why does this make intuitive sense? How can adding noise make learning more reliable? The
intuitive reason is that this added noise destroys inadvertent conspiracies. Overfitting happens
because the learning algorithm sees some degree of conspiracies between the observed training
labels y and the input features. By adding lots of copies of the training data with additional
noise added into them, many of these conspiracies will be masked by the added noise because they
are fundamentally sensitive to small details — this is why they manifest as large weights w. We
know from our studies of the bias/variance tradeoff that having more training samples reduces this
variance. Adding our own noisy samples exploits this variance reduction.
In many practical machine learning situations, appropriately adding noise to your training data
can be an important tool in helping generalization performance.
In the discussion above, we have in some places made assumptions to move the derivation forward.
These do not always hold, but we can provide sufficient conditions for the existence of a solution.
Proposition. Let σ1 , . . . , σd+1 denote the singular values of X y , and σ̃1 , . . . , σ̃d denote the
singular values of X. If σd+1 < σ̃d , then the total least squares problem has a solution, given by
" > #
X X X>y a
> a 2 a
X y X y = > = σ
0 y X y>y 0 d+1 0
Then
X>Xa = σd+1
2
a
62 CHAPTER 2. REGRESSION II
since we have assumed σd+1 < σ̃d . Therefore the (d + 1)st component of vd+1 is nonzero, which
guarantees the existence of a solution.
We have already derived the given expression for ŵtls , but it remains to show that the matrix
X>X − σd+1
2 I is invertible. This is fairly immediate from the assumption that σ
d+1 < σ̃d , since this
implies
2
σd+1 < σ̃d2 = λmin (X>X)
giving
λmin (X>X − σd+1
2
I) = λmin (X>X) − σd+1
2
>0
which guarantees that the matrix is invertible.
This gives us a nice mathematical characterization of the existence of a solution, showing that the
two technical requirements we raised earlier (the last entry of vd+1 being nonzero, and the matrix
X>X − σd+1
2 being invertible) happen together. However, is the assumption of the proof likely to
hold in practice? We give an intuitive argument that it is.
Consider that in solving the TLS problem, we have determined the error term X . In principle, we
could use this to denoise X, as in X̂true = X − X , and then perform OLS as normal. This process
is essentially the same as TLS if we compare the original formulations. Assuming the error is
drawn from a continuous distribution, the probability that the denoised matrix X̂true has collinear
columns is zero. y = 12
Recall that OLS tries to minimize the vertical distance between the fitted line and data points.
TLS, on the other hand, tries to minimize the perpendicular distance. For this reason, TLS may
sometimes be referred to as orthogonal regression.
The red lines represent vertical distance, which OLS aims to minimize. The blue lines represent
perpendicular distance, which TLS aims to minimize. Note that all blue lines are perpendicular to
the black line (hypothesis model), while all red lines are perpendicular to the x axis.
Chapter 3
Dimensionality Reduction
In machine learning, the data we have are often very high-dimensional. In fact, when we introduced
the idea of features (like polynomial features), these made the dimensionality of the data even
higher. The kernel trick was something that let us partially deal with this by working with vectors
only as long as there are training samples.
However, there are a number of reasons why we might want to work with a lower-dimensional
representation:
• Visualization (if we can get it down to 2 or 3 dimensions), e.g. for exploratory data analysis
• Reduce computational load
• Reduce variance in estimation — regularize the problem
So, how can we reduce the dimensionality of data? There are obvious ways — just keeping a subset
of features. But which features? In general, that presumably depends on what you are trying to
do. What could you do if you didn’t know what you were trying to predict with those features?
This corresponds to unsupervised dimensionality reduction. There are a couple of intuitive choices.
First, just pick some features at random to keep. This is appealing for its symmetry, but it makes
you wonder if we could do better by actually looking at the data before deciding which features to
keep.
Consequently, another thing that you could do is to just keep the few features that have the most
variability — which you could measure by the variance of that feature. But what if two of the
most variable features were actually very correlated to each other? Should we really be including
both of them? Maybe we should focus on “fresh” variability somehow. To do this, maybe it would
be helpful to allow ourselves to synthesize linear combinations of features and keep some of these
synthesized features.
63
64 CHAPTER 3. DIMENSIONALITY REDUCTION
that labeled training data might be hard or expensive to get, but unlabeled training data (i.e. no
y just x) might be more easily available. PCA is able to extract meaningful directions from such
unlabeled data.
Not coincidentally, PCA turns out to be intimately connected to the ideas of Total Least Squares.
Projection
Let us first review the meaning of scalar projection of one vector onto another. If v ∈ Rd is a unit
vector, i.e. kvk = 1, then the scalar projection of another vector x ∈ Rd onto v is given by x>v.
This quantity tells us roughly how much of the projected vector x lies along the direction given
direction v. Why does this expression make sense? Recall the slightly more general formula which
holds for vectors of any length:
x>v = kxkkvk cos θ
where θ is the angle between the vectors. In this case, since kvk = 1, the expression simplifies to
x>v = kxk cos θ. But since cosine gives the ratio of the adjacent side (the projection we want to
find) to the hypotenuse (kxk), this is exactly what we want:
Let X ∈ Rn×d be our matrix of data, where each row is a d-dimensional datapoint. These are to
be thought of as i.i.d. samples from some random vector x.
We will assume that the data points have mean Pzero; if this is not the case, we can make it so by
subtracting the average of all the rows, x̄ = n1 ni=1 xi , from each row. The motivation for this is
that we want to find directions of high variance within the data, and variance is defined relative
to the mean of the data. If we did not zero-center the data, the directions found would be heavily
influenced by where the data lie relative to the origin, rather than where they lie relative to the
other data, which is more useful.
Since X is zero-mean, the sample variance of the datapoints’ projections onto a unit vector v is
given by
n
1X > 2 1 1
(xi v) = kXvk2 = v>X>Xv
n n n
i=1
3.1. PRINCIPAL COMPONENT ANALYSIS 65
Note that we have discarded the positive constant factor 1/n which does not affect the optimal
value of v.
To reduce this constrained optimization problem to an unconstrained one, we write down its La-
grangian:
L(v) = v>X>Xv − λ(v>v − 1)
First-order necessary conditions for optima imply that
Hence X>Xv1 = λv1 , i.e. v1 is an eigenvector of X>X with eigenvalue λ. Since we constrain
v1>v1 = 1, the value of the objective is precisely
so the optimal value is λ = λmax (X>X), which is achieved when v1 is a unit eigenvector of X>X
corresponding to its largest eigenvalue.
We have seen how to find the first loading vector, which is the unit vector that maximizes the
variance of the projected data points. However, in most applications, we want to find more than
one direction. We want the subsequent directions found to also be directions of high variance,
but they ought to be orthogonal to the existing directions in order to minimize redundancy in the
information captured. Thus we define the kth loading vector vk as the solution to the constrained
optimization problem
We claim that vk is a unit eigenvector of X>X corresponding to its kth largest eigenvalue.
Proof. By induction on k. We have already shown that the claim is true for the base case k = 1
(where there are no orthogonality constraints). Now assume that it is true for the first k loading
vectors v1 , . . . , vk , and consider the problem of finding vk+1 .
1 To make sense of the sample variance, recall that for any random variable Z,
so if E[Z] = 0 then Var(Z) = E[Z 2 ]. In practice we will not have the true random variable Z, but rather i.i.d. observations
z1 , . . . , zn of Z. The expected value can then be approximated by a sample average, i.e.
1X 2
E[Z 2 ] ≈ z
n i=1 i
which is justified by the law of large numbers, which states that (under mild conditions) the sample average converges to the
expected value as n → ∞. In our case the random variable Z is the principal component v>x, and the i.i.d. observations are
the projections of our datapoints, i.e. zi = v>xi .
66 CHAPTER 3. DIMENSIONALITY REDUCTION
By the inductive hypothesis, we know that v1 , . . . , vk are orthonormal eigenvectors of X>X. Denote
the ith largest eigenvalue of X>X by λi , noting that X>Xvi = λi vi .
The Lagrangian of the objective function is
k
X
> > >
L(v) = v X Xv − λ(v v − 1) + ηi v>vi
i=1
0 = vj>0
k
X
= 2vj>X>Xvk+1 − 2λ vj>vk+1 + ηi vj>vi
| {z } i=1
| {z }
0 δij
> >
= 2(X Xvj ) vk+1 + ηj
= 2(λj vj )>vk+1 + ηj
= 2λj vj>vk+1 +ηj
| {z }
0
= ηj
for all j = 1, . . . , k.
Plugging these values back into the optimality equation above, we see that vk+1 must satisfy
X>Xvk+1 = λvk+1 , i.e. vk+1 is an eigenvector of X>X with eigenvalue λ. As before, the value
of the objective function is then λ. To maximize, we want the largest eigenvalue, but we must
respect the constraints that vk+1 is orthogonal to v1 , . . . , vk . Clearly if vk+1 is equal to any of
these eigenvectors (up to sign), then one of these constraints will not be satisfied. Thus to maximize
the expression, vk+1 should be a unit eigenvector of X>X corresponding to its (k + 1)st largest
eigenvalue. By the spectral theorem, we can always choose this vector in such a way that it is
orthogonal to v1 , . . . , vk , so we are done.
We have shown that the loading vectors are orthonormal eigenvectors of X>X. In other words, they
are right-singular vectors of X, so they can all be found simultaneously by computing the SVD of
X.
Once we have computed the loading vectors, we can use them as a new coordinate system. The kth
principal component of a datapoint xi ∈ Rd is defined as the scalar projection of xi onto the kth
loading vector vk , i.e. xi>vk . We can compute all the principal components of all the datapoints
at once using a matrix-matrix multiplication:
Zk = XVk
where Vk ∈ Rd×k is a matrix whose columns are the first k loading vectors v1 , . . . , vk .
3.1. PRINCIPAL COMPONENT ANALYSIS 67
Observe that the data are uncorrelated in the projected space. Also note that this example does
not show the power of PCA since we have not reduced the dimensionality of the data at all – the
plot is merely to show the PCA coordinate transformation.
Once we’ve computed the principal components, we can approximately reconstruct the original
points by
X̃k = Zk Vk> = XVk Vk>
The rows of X̃k are the projections of the original rows of X onto the subspace spanned by the
loading vectors.
We have given the most common derivation of PCA above, but it turns out that there are other
ways to solve the optimization problem, or to arrive at the same formulation. These give us helpful
additional perspectives on what PCA is doing.
Changing coordinates
In PCA we want to find the unit length v that maximizes v>X>Xv. It turns out that there is a
result, sometimes referred to as the variational characterization of eigenvalues, that tells us
which vectors v achieve this. The key idea in the proof is a length-preserving change of coordinates.
Theorem. Let A ∈ Rd×d be symmetric. Then for any v ∈ Rd satisfying kvk2 = 1,
λmin (A) ≤ v>Av ≤ λmax (A)
where for both bounds, equality holds if and only if v is a corresponding eigenvector.
Proof. We show only the max case because the argument for the min case is entirely analogous.
Since A is symmetric, we can decompose it as A = QΛQ>, where Q ∈ Rd×d is orthogonal and
Λ = diag(λ1 , . . . , λd ) contains the eigenvalues of A. For any v satisfying kvk2 = 1, define z = Q>v,
noting that the relationship between v and z is one-to-one because Q is invertible and that kzk2 = 1
because Q is orthogonal. Hence
d
X
max v>Av = max z>Λz = max
2
λi zi2
kvk2 =1 kzk2 =1 kzk2 =1
i=1
68 CHAPTER 3. DIMENSIONALITY REDUCTION
We note that
d
X d
X d
X
λi zi2 ≤ λmax (A)zi2 = λmax (A) zi2
i=1 i=1 i=1
Pd
so the constraint kzk22 = 2
i=1 zi = 1 implies
d
X
λi zi2 ≤ λmax (A)
i=1
Defining I = {i : λi = λmax (A)}, the P index set of the largest eigenvalue, we see that the bound is
achieved with equality if and only if i∈I zi2 = 1 and zj = 0 for j 6∈ I. Suppose z∗ satisfies this
condition. Then writing q1 , . . . , qd for the columns of Q, we have
d
X X
∗ ∗
v = Qz = zi∗ qi = zi∗ qi
i=1 i∈I
Recall that q1 , . . . , qd are eigenvectors of A and form an orthonormal basis for Rd . Therefore by
construction, the set {qi : i ∈ I} forms an orthonormal basis for the eigenspace of λmax (A). Hence
v∗ , which is a linear combination of these, lies in that eigenspace and thus is an eigenvector of A
corresponding to λmax (A).
Conversely, suppose v ∈ Rd is unit-length but not an eigenvector corresponding to λmax (A). The
vectors q1 , . . . , qd are still a basis for Rd , so we have a unique expansion
v = z 1 q1 + · · · + zd qd
Since v does not lie in the eigenspace of λmax (A), one of the components zj must be nonzero for
an index j 6∈ I, so equality does not hold in the bound above.
With this result established, we see that the vector we seek (which maximizes v>X>Xv) must be an
eigenvector corresponding to λmax (X>X). This is the same solution we derived via the Lagrangian
formulation above.
Recall that ordinary least squares minimizes the vertical distance between the fitted line and the
data points. We show that PCA can be interpreted as minimizing the perpendicular distance
between the data points and the subspace onto which we are projecting them.
The orthogonal projection of a vector x onto the subspace spanned by a unit vector v equals v
scaled by the scalar projection of x onto v:
Pv x = (x>v)v
Thus
n
X n
X
2
kxi − Pv xi k = kxi k2 − kPv xi k2
i=1 i=1
n
X n
X
= kxi k2 − k(xi>v)vk2
i=1 i=1
Xn Xn
= kxi k2 − (xi>v)2
i=1 i=1
Pn 2
Then since the first term Pi=1 kxi k is constant with respect to v, minimizing reconstruction error
n
is equivalent to maximizing i=1 (xi>v)2 , which is (up to an irrelevant positive constant factor 1/n)
the projected variance.
Another way to write this interpretation is that the reconstructed matrix X̃k is the
Pbest rank-k
approximation to X in the Frobenius norm. To see this, first note that (writing X = di=1 σi ui vi>)
d
X
X̃k = XVk Vk> = σi ui vi>Vk Vk>
i=1
By orthonormality, the product vi>Vk results in a k-dimensional row vector with 1 in the ith place
and 0 everywhere else, i.e. ei>, as long as i ≤ k. In this case,
vi>Vk Vk> = ei>Vk> = (Vk ei )> = vi>
If i > k, vi>Vk = 0>, so the term disappears. Therefore we see that
d
X k
X
X̃k = σi ui vi>Vk Vk> = σi ui vi>
i=1 i=1
Probabilistic PCA
We have seen probabilistic motivations or derivations of many of the methods discussed so far in
this class. In a similar vein, probabilistic PCA (PPCA) is a generative model for PCA. Here
we make the following assumptions about how the data were generated: for each datapoint i, there
is a k-dimensional latent variable
zi ∼ N (0, I)
which we cannot observe, and the actual d-dimensional observation is distributed conditionally on
this latent variable as
xi |zi ∼ N (Λzi + µ, Ψ)
Here Λ ∈ Rd×k d
and µ ∈ R are parameters to be estimated. Since zi is Gaussian and xi |zi is
xi
Gaussian, is Gaussian, so its marginal xi is Gaussian. In particular, by integrating out the
zi
latent variable Z Z
p(xi ) = p(xi , z) dz = p(xi |z)p(z) dz
z z
one can show that
xi ∼ N (µ, ΛΛ> + Ψ)
70 CHAPTER 3. DIMENSIONALITY REDUCTION
It is common to assume Ψ = σ 2 I. In this case, if we let σ 2 → 0, we recover the original PCA solution
in the sense that the columspace of Λ̂mle approaches the PCA subspace (i.e. the columnspace of
Vk ). 2
PCA provided us with a dimensionality-reduction approach that didn’t use the labels y in any way.
In that way, it was fundamentally unsupervised by nature. However, we can imagine that there can
be situations in which the most relevant directions in x for understanding y are not necessarily the
directions of greatest variation in x. For example, what if the x data by nature was contaminated
with a strong correlated noise signal? PCA would find the noise dimensions to be those that have
the greatest variation and keep them, throwing away those dimensions where we could actually
hope to get information relevant for predicting y!
The other potentially troublesome aspect of PCA is that it is not invariant to a change of units
or scaling. If we changed the units of some feature from meters to millimeters, then all the values
for that feature would increase by a factor of a thousand, and suddenly, this direction might be
favored by PCA. This is unavoidable because there is no natural reference point that would allow
us to treat units as arbitrary.
Consequently, it is important to have an approach to dimensionality reduction and the discovery
of linear structure from data that does take advantage of paired (x, y) data, preferably in a way
that is robust to linear transformations of both x and y individually.
What does it mean to extract the linear structure establishing the underlying relationship between
X and Y, two vector-valued quantites of which we have many paired samples. To understand
what this should mean, we need to construct a model. The first thing that we do is assume we
have a joint distribution for X and Y as random variables. In practice, we won’t have the random
variables in distribution, just paired samples of them. But it is easier to start understanding what
we want by assuming that we have the entire distribution. This corresponds to how well we think
we can do given infinite amounts of training data. The next we do is assume a particular form for
the random variables. Since we are interested in linear structure, jointly Gaussian random variables
are a useful model.
Our goal is to extract the underlying relationship or commonality between X and Y. To do this,
we assume that we have three underlying iid standard Gaussian random vectors ZJ (representing
the common/joint part), ZX (representing the randomness that is purely in X and not shared by
Y), and ZY (representing the randomness that is purely in Y and not shared by X). Then we can
assume that they are related by an underlying linear relationship:
ZX
X A B 0
= ZJ (3.1)
Y 0 C D
ZY
How will such a joint relationship manifest in the joint distributions for X and Y? To understand
that, we should first consider the scalar case.
then we can define a distribution on individually normalized X and Y and have their joint inter-
relationship entirely captured by ρ(X, Y ). First write
σxy
ρ(X, Y ) =
σx σy
Then
" 2 #
σx2 σxy
σx ρσx σy
Σ= =
σxy σy 2 ρσx σy σy2
so
" # " # " #
σx−1 0 X σ −1 0 σ −1 0 >
∼ N 0, x Σ x
0 σy−1 Y 0 σy−1 0 σy−1
!
1 ρ
∼ N 0,
ρ 1
This ρ quantity is the signature of the joint inter-relationship of the X and Y random variables.
BC
To make things explicit, once we have the ρ = √ , we can come up with many possible
(A2 +B 2 )(C 2 +D2 )
backstories for the latent picture behind the observed random variables. Here is one that splits the
influence of the latent space proportionately.
p
A = σx 1 − |ρ| (3.2)
p
B = σx |ρ| (3.3)
p
C = σy sign(ρ) |ρ| (3.4)
p
D = σy 1 − |ρ| (3.5)
BC
Because A2 + B 2 = σx2 , C 2 + D2 = σy2 , and ρ = √ , this works.
(A2 +B 2 )(C 2 +D2 )
72 CHAPTER 3. DIMENSIONALITY REDUCTION
Pearson Correlation
Although we defined this ρ above for a pair of jointly Gaussian random variables, it is really about
linear structure. The Pearson Correlation Coefficient ρ(X, Y ) is effectively a way to measure
how linearly related (in other words, how well a linear model captures the relationship between)
random variables X and Y .
Cov(X, Y )
ρ(X, Y ) = p
Var(X) Var(Y )
Here are some important facts about it:
The correlation is defined in terms of random variables rather than observed data. Assume now
that x, y ∈ Rn are vectors containing n independent observations of X and Y , respectively. Recall
the law of large numbers, which states that for i.i.d. Xi with mean µ,
n
1X a.s.
Xi −→ µ as n → ∞
n
i=1
Plugging these estimates into the definition for correlation and canceling the factor of 1/n leads us
to the Sample Pearson Correlation Coefficient ρ̂:
Pn
(xi − x̄)(yi − ȳ)
ρ̂(x, y) = pPn i=1 Pn
2 2
i=1 (xi − x̄) · i=1 (yi − ȳ)
3.2. CANONICAL CORRELATION ANALYSIS 73
x̃>ỹ
=p where x̃ = x − x̄, ỹ = y − ȳ
x̃>x̃ · ỹ>ỹ
Here are some 2-D scatterplots and their corresponding correlation coefficients:
Canonical Correlation Analysis (CCA) is a method of modeling the relationship between two
point sets by making use of the correlation coefficients.
As in PCA, it is useful to start with trying to find the directions that represent the most correlation.
You can think of this as finding the parts of Xrv and Yrv that depend on the first coordinate of
ZJ , where we choose the convention that the first coordinate represents the most shared dimension.
We will then see how to move on to get the rest.
Formally, given zero-mean random vectors Xrv ∈ Rp and Yrv ∈ Rq , we want to find projection
vectors u ∈ Rp and v ∈ Rq that maximizes the correlation between Xrv>u and Yrv>v:
Cov(Xrv>u, Yrv>v)
max ρ(Xrv>u, Yrv>v) = max p
u,v u,v Var(Xrv>u) Var(Yrv>v)
Observe that
which also implies (since Var(Z) = Cov(Z, Z) for any random variable Z) that
where again the sample-based approximation is justified by the law of large numbers. Similarly,
n
1X 1
Cov(Xrv , Xrv ) = E[Xrv Xrv>] ≈ xi xi> = X>X
n n
i=1
n
1 X 1 >
Cov(Yrv , Yrv ) = E[Yrv Yrv>] ≈ yi yi> = Y Y
n n
i=1
Plugging these estimates in for the true covariance matrices, we arrive at the problem
u> n1 X>Y u u>X>Yv
max r = max √
u,v u,v > > > >
| u X Xu{z· v Y Yv}
u> n1 X>X u · v> n1 Y>Y v
ρ̂(Xu,Yv)
Let’s try to massage the maximization problem into a form that we can reason with more easily.
Our strategy is to choose matrices to transform X and Y such that the maximization problem is
equivalent but easier to understand.
1. First, let’s choose matrices Wx , Wy to whiten X and Y. This will make the (co)variance
matrices (XWx )>(XWx ) and (YWy )>(YWy ) become identity matrices and simplify our
expression. To do this, note that X>X is positive definite (and hence symmetric), so we can
employ the eigendecomposition
X>X = Ux Sx Ux>
Since
Sx = diag(λ1 (X>X), . . . , λd (X>X))
where all the eigenvalues are positive, we can define the “square root” of this matrix by taking
the square root of every diagonal entry:
q q
1/2
Sx = diag λ1 (X>X), . . . , λd (X>X)
−1/2
Then, defining Wx = Ux Sx Ux>, we have
= Ux Ux>
=I
which shows that Wx is a whitening matrix for X. The same process can be repeated to
−1/2
produce a whitening matrix Wy = Uy Sy Uy> for Y.
Let’s denote the whitened data Xw = XWx and Yw = YWy . Then by the change of variables
uw = Wx−1 u, vw = Wy−1 v,
(Xu)>Yv
max ρ̂(Xu, Yv) = max p
u,v u,v (Xu)>Xu(Yv)>Yv
(XWx Wx−1 u)>Y Wy Wy−1 v
= max q
u,v
(XWx Wx−1 u)>XWx Wx−1 u(Y Wy Wy−1 v)>YWy Wy−1 v
(Xw uw )>Yw vw
= max p
uw ,vw (Xw uw )>Xw uw (Yw vw )>Yw vw
uw>Xw>Yw vw
= max p
uw ,vw uw>Xw>Xw uw · vw>Yw>Yw vw
uw>Xw>Yw vw
= max p
uw ,vw uw>uw · vw>vw
| {z }
ρ̂(Xw uw ,Y w vw )
Note we have used the fact that Xw>Xw and Yw>Yw are identity matrices by construction.
2. Second, let’s choose matrices Dx , Dy to decorrelate Xw and Yw . This will let us simplify
the covariance matrix (Xw Dx )>(Yw Dy ) into a diagonal matrix.
Recall that our ultimate goal is to understand the underlying latent structure behind Xrv
and Yrv . The whitening was a normalizing change of coordinates. The decorrelation is there
so that we can pick out independent underlying components of ZJ . (Since jointly Gaussian
random variables are independent if they are uncorrelated.) Alternatively, you can consider
decorrelation as reducing the problem to a sequence of scalar problems.
To do this, we’ll make use of the SVD:
Xw>Yw = USV>
Let’s denote the decorrelated data Xd = Xw Dx and Yd = Yw Dy . Then by the change of variables
ud = D−1 > −1 >
x uw = Dx uw , vd = Dy vw = Dy vw ,
(Xw uw )>Yw vw
max ρ̂(Xw uw , Yw vw ) = max p
uw ,vw uw ,vw uw>uw · vw>vw
(Xw Dx D−1 > −1
x uw ) Y w D y D y v w
= max p
uw ,vw (Dx uw )>Dx uw · (Dy vw )>Dy vw
(Xd ud )>Yd vd
= max p
ud ,vd ud>ud · vd>vd
76 CHAPTER 3. DIMENSIONALITY REDUCTION
ud>Xd Yd vd
= max p
ud ,vd u >ud · vd>vd
| d {z }
ρ̂(Xd ud ,Y d vd )
ud>Svd
= max p
ud ,vd ud>ud · vd>vd
Without loss of generality, suppose ud and vd are unit vectors3 so that the denominator becomes
1, and we can ignore it:
ud>Svd ud>Svd
max p = max = max ud>Svd
ud ,vd ud>ud · vd>vd kud k=1 kud kkvd k kud k=1
kvd k=1 kvd k=1
The diagonal nature of S implies Sij = 0 for i 6= j, so our simplified objective expands as
XX X
ud>Svd = (ud )i Sij (vd )j = Sii (ud )i (vd )i
i j i
where Sii , the singular values of Xw>Yw , are arranged in descending order. Thus we have a weighted
sum of these singular values, where the weights are given by the entries of ud and vd , which are
constrained to have unit norm. To maximize the sum, we “put all our eggs in one basket” and
extract S11 by setting the first components of ud and vd to 1, and the rest to 0:
1 1
0 0
ud = .. ∈ Rp vd = .. ∈ Rq
. .
0 0
Any other arrangement would put weight on Sii at the expense of taking that weight away from
S11 , which is the largest, thus reducing the value of the sum.
Finally we have an analytical solution, but it is in a different coordinate system than our original
problem! In particular, ud and vd are the best weights in a coordinate system where the data has
been whitened and decorrelated. To bring it back to our original coordinate system and find the
vectors we actually care about (u and v), we must invert the changes of variables we made:
u = Wx uw = Wx Dx ud v = Wy vw = Wy Dy vd
U = Wx Dx Ud V = Wy Dy Vd
Note that Ud and Vd have orthogonal columns. The columns of U and V, which are the projection
directions we seek, will in general not be orthogonal, but they will be linearly independent (since
they come from the application of invertible matrices to the columns of Ud , Vd ).
3 Why can we assume this? Observe that the value of the objective does not change if we replace u by αu and v by
d d d
βvd , where α and β are any positive constants. Thus if there are maximizers ud , vd which are not unit vectors, then ud /kud k
and vd /kvd k (which are unit vectors) are also maximizers.
3.2. CANONICAL CORRELATION ANALYSIS 77
Following (3.2), (3.3), (3.4), and (3.5), it is also possible to use what we have calculated to give
an explicit learned latent-space realization for the Xrv , Yrv in terms of standard Gaussian random
variables ZX , ZJ , ZY . In particular, matrices A, B, C, D of the appropriate sizes. This is left as
an exercise to the reader once you realize that after whitening and decorrelating (both invertible
transformations), we are left with a collection of scalar problems that would represent independent
random variables if all the variables were indeed jointly Gaussian.
CCA thus illustrates how it is possible to learn a latent representation for common (linear) structure
given paired data. This is a powerful idea not limited to the specific case of CCA. In effect, CCA
shows how we can discover (synthesize) features that distill what aspects of input data is relevant
for understanding output data.
This is subtly different from what happens in ordinary least squares because in ordinary least
squares, each individual element of y is predicted independently. In OLS, the different output
variables are not used collectively to distill the most relevant dimensions of the input. By contrast,
in CCA, the different output variables do vote collectively to determine relevant dimensions in the
input.
An advantage of CCA over PCA is that it is invariant to scalings and affine transformations of X
and Y. Consider a simplified scenario in which two matrix-valued random variables X, Y satisfy
Y = X + where the noise has huge variance. What happens when we run PCA on Y? Since
PCA maximizes variance, it will actually project Y (largely) into the column space of ! However,
we’re interested in Y’s relationship to X, not its dependence on noise. How can we fix this? As
it turns out, CCA solves this issue. Instead of maximizing variance of Y, we maximize correlation
between X and Y. In some sense, we want the maximize “predictive power” of information we
have.
CCA regression
Once we’ve computed the CCA coefficients, one application is to use them for regression tasks,
predicting Y from X (or vice-versa). Recall that the correlation coefficient attains a greater value
when the two sets of data are more linearly correlated. Thus, it makes sense to find the k × k weight
matrix A that linearly relates XU and YV. We can accomplish this with ordinary least squares.
Denote the projected data matrices by Xc = XU and Yc = YV. Observe that Xc and Yc are
zero-mean because they are linear transformations of X and Y, which are zero-mean. Thus we can
fit a linear model relating the two:
Y c ≈ Xc A
The least-squares solution is given by
However, since what we really want is an estimate of Y given new (zero-mean) observations X̃
(or vice-versa), it’s useful to have the entire series of transformations that relates the two. The
predicted canonical variables are given by
We can collapse all these terms into a single matrix Aeq that gives the prediction Ŷ from X̃:
Up to this point, we’ve restricted ourselves to linear regression models. That is, our prediction
ŷ = w>x is a linear function of the input x. This holds even in the case of least-squares polynomial
regression — while the predicted value is not a linear function of the raw input x, it is still a linear
function of the augmented polynomial feature input φ(x).
Effectively, we have been able to form nonlinear models by manually augmenting features to the
input. Now what if instead of using a linear function of the augmented input, we could use
an arbitrary nonlinear function f (x; w) directly of the raw input x? This approach is often more
expressive and robust, because it removes the burden of augmenting expressive features to the input.
As a motivating example, consider the problem of estimating the 2D position w = (w1 , w2 ) of a
robot. We are given noisy distance estimates Yi ∈ R from n sensors whose positions xi ∈ R2 are fixed
and known. Since we are predicting distance, it is reasonable to use the model f (x; w) = kx − wk2 .
This model is clearly more appropriate than restricting ourselves to a linear model with augmented
features — in that case, what exactly would the augmented features be?
Note however that for most problems, we are not given the form or structure of the model. Consider
the following example: we are trying to predict a user’s income based on their occupation, age,
education, etc... It is not exactly clear what model we should use. Rather than specifying a specific
family of nonlinear functions, we are instead interested in a universal function appropriator f (x; w)
which can approximate any function f (x) with appropriate parameters w. This will be the basis
for neural networks, which we will study in detail later.
For the purposes of our discussion, let us assume that we are given a model f , an arbitrary
(nonlinear) differentiable function parameterized by w:
iid
Yi = f (xi ; w) + Zi , Zi ∼ N (0, σ 2 ), i = 1, . . . , n
which can equivalently be expressed as Yi | xi ∼ N (f (xi ; w), σ 2 ). We are interested in finding the
parameters ŵmle that maximize the likelihood of the data:
79
80 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
n
X
= arg max log p(yi | xi , w)
w
i=1
n
!
X (yi − f (xi ; w))2
1
= arg max log √ exp −
w 2πσ 2 2σ 2
i=1
n
X 1
2
1 2
= arg max − log 2πσ − 2 (yi − f (xi ; w))
w 2 2σ
i=1
n
X
= arg min (yi − f (xi ; w))2
w
i=1
Observe that the objective function is a sum of squared residuals as we’ve seen before, but now the
function f is nonlinear. For this reason this method is called nonlinear least squares.
Motivated by the MLE formulation above, our goal is to solve the following optimization problem:
n
1X
min L(w) = min (yi − f (xi ; w))2
w w 2
i=1
One way to solve this optimization problem is to find all of its critical points and choose the
point that minimizes the objective. From first-order optimality conditions, the gradient of the
objective function at any minimum must be zero:
n
X
∇w L(w) = (yi − f (xi ; w))∇w f (xi ; w) = 0
i=1
and we can derive a closed-form solution for w, arriving at the OLS solution:
X>(y − Xw) = 0
X>y − X>Xw = 0
X>y = X>Xw
w = (X>X)−1 X>y
In the general case where f is nonlinear in w, it is not necessarily possible to derive a closed-
form solution for w, for a few reasons. First of all, without additional assumptions on f , the
4.2. OPTIMIZATION 81
NLS objective may not be convex. Therefore there may exist values of w that are not global
minima, but nonetheless ∇w L(w) = 0 — they could be local minima, saddle points, or worse, local
maxima! Second of all, even if the objective is convex, we may not be able to solve the equation
J(w)>(y − F (w)) = 0 for w. Given the challenges that nonlinear least squares introduces over
linear least squares, we need a principled approach to solve problems that have no closed-form
solution, preferably agnostic of the specific objective itself.
4.2 Optimization
but more generally, as we move into the realm of neural networks and beyond, we will be solving
arbitrary problems of the form
min f (w)
w∈X
over an arbitrary continuous objective function f : Rd → R and arbitrary domain X . If we are able
to solve this more general class of problems, then we can solve nonlinear least squares as a specific
instance of the problem. Solving such problems is the focus of optimization, an extensive field
that has applications in control theory, finance, and machine learning.
In optimization we are interested in finding the global minimum of a function. In the pursuit of
finding the global minimum, we may encounter local minima along the way, which are suboptimal
but may actually be close enough to the global minimum. More broadly, such points belong to the
class of critical points, the “interesting” points of deflection that we may want to consider when
finding minima:
(i) local minimum: a differentiable point w ∈ X such that there exists a neighborhood around
w where f (w) attains the minimum value
(ii) local maximum: a differentiable point w ∈ X such that there exists a neighborhood around
w where f (w) attains the maximum value
(iii) saddle point: a differentiable point w ∈ X such that for all neighborhoods around w, there
exists u, v such that f (u) ≤ f (w) ≤ f (v)
Technically, local minima must exist within a neighborhood of the domain and be differentiable, and
our analysis of minima isn’t complete without also considering the following as potential minima:
(i) boundary points: points w ∈ X that can be approached from both X and outside X , or
more intuitively, points that lie of the “boundary” of X
(ii) non-differentiable points: points at which the derivative is undefined, such as points with
“kinks”
For the remainder of our discussion, we will assume that we are solving unconstrained optimization
problems with differentiable functions, so that we will only have to consider critical points in our
analysis.
The categorization in the previous section is helpful, but how exactly can we determine which points
in the domain are critical points? As it turns out, the set of all critical points is simply the set of
points at which the gradient is zero. Given that f is continuously differentiable, the gradient is
defined as the vector of partial derivatives of f , denoted by
∂f
∂w 1
∂f
∂w
∇f = 2
..
.
∂f
∂wd
Since the set of points for which the gradient is zero in turn define the set of critical points, the
gradient being zero is a necessary condition for local minima.
Proposition 1. If w∗ is a local minimum of f and f is continuously differentiable in a neighborhood
of w∗ , then ∇f (w∗ ) = 0.
This justifies the technique we have been using on numerous occasions so far to solve least squares
problems: setting the gradient of the objective function to zero and solving the corresponding
equation. Note however, that while setting the gradient to zero is a necessary condition for local
minima, it is not a sufficient condition. In many circumstances, the function that we are optimizing
may not have a local minima, and generally setting the gradient to zero could yield local maxima or
saddle points. Even if all critical points were minima, we would still have to solve the corresponding
equation ∇f (w) = 0, which is not always trivial. In the cases when solving this equation is
intractable, we say that no closed-form solution exists, and therefore an iterative algorithm is
needed to solve the optimization problem. Even if a problem does have a closed form solution that
we can directly find, it may still be much more computationally efficient to solve the problem with
iterative algorithms.
Rather than using gradients to find the closed-form solution, we can use gradients to “creep toward”
a local minimum in an iterative fashion. Gradient Descent is an algorithm that iteratively takes
small steps in the direction of steepest descent of the objective f . Intuitively, we can view gradient
descent as a ball rolling down a hill. If we place the ball somewhere at the top of the hill, it will
4.3. GRADIENT DESCENT 83
naturally roll down the direction of steepest descent until it reaches the bottom of the hill, at which
point it may oscillate around until it eventually comes to a stop at the bottom.
Gradient descent is a simple, intuitive method that works remarkably well in practice. One question
that remains is: how exactly do we determine the direction of steepest descent of a multivariate
function, and what does this method have to do with gradients? Given that we are currently at
a point w(t) in the domain of the function, the direction of steepest descent is the negative of the
gradient at that point, −∇f (w(t) ). To see why, recall that the directional derivative in a unit
direction u at w(t) is defined as the inner product of the gradient and the direction:
where θ is the angle between ∇f (w(t) ) and u. Finding the direction of steepest descent entails
finding the direction that minimizes the directional derivative. We can minimize the directional
derivative by setting θ = −π, which will mean that the direction u and ∇f (w(t) ) are opposite to
each other, and thus the direction of steepest descent u∗ is −∇f (w(t) ) (similarly the direction of
steepest ascent is ∇f (w(t) )). The gradient descent algorithm will take an arbitrary step in this
direction, scaling the gradient by a scalar αt .
Algorithm 4: Gradient Descent
Initialize w(0) to a random point
while f (w(t) ) not converged do
w(t+1) ← w(t) − αt ∇f (w(t) )
Determining this scaling αt is dependent on the attributes of the function f . Sometimes we can
set the scaling to a constant value and converge to the optimum value, whereas in other instances
we need to determine an adaptive stepsize. A scaling that is too high may cause the algorithm to
diverge from the optimal solution, whereas a scaling that is too low may cause the algorithm to
converge too slowly. For certain classes of functions, there are theoretical guarantees that establish
convergence, which we will state later.
Figure 4.2: In gradient descent, stepsize matters. A small stepsize (left) will never converge to the optimal
point, and a large stepsize (right) will lead to divergence. Source: CMU 10-725
84 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
Mini-batch gradient descent is a stochastic variant of batch gradient descent, that instead of
summing an entire “batch” of n gradients, samples and adds a random “mini-batch” of gradients
over k < n indices drawn from {1, . . . , n}:
k
1X
G(w) = ∇fi (w)
k
i=1
A major advantage of mini-batch gradient descent is that each iteration is now more computa-
tionally efficient, leading to greater progress and allowing us to monitor the performance of the
algorithm faster. In addition, mini-batch gradient descent can escape local minima with more ease
compared to batch gradient descent, due the noisy nature of its gradients. However note that
this can also lead to instability if the stochastic gradients have high variance. For this reason,
mini-batch gradient descent generally requires a higher number of overall iterations to match the
performance of batch gradient descent, which can lead to expensive computational overhead. Given
the appropriate choice of k, mini-batch gradient descent can be significantly more computationally
efficient overall than batch gradient descent.
The special case of mini-batch gradient descent with k = 1 is called stochastic gradient descent
(SGD). In this case, we can define the stochastic gradient by just drawing an index i uniformly at
random from {1, . . . , n} and setting
G(w) = ∇fi (w)
We can verify that the stochastic gradient is indeed an unbiased estimate of the true gradient:
n
X
Ei [G(w)] = Ei [∇fi (w)] = P (i = j)∇fj (w)
j=1
4.3. GRADIENT DESCENT 85
n
1X
= ∇fj (w) = ∇f (w)
n
j=1
Compared to batch gradient descent, the gradient updates in SGD are significantly faster, but SGD
often requires a significantly higher number of updates. In practice, mini-batch gradient descent
is more effective than SGD and batch gradient descent, capturing the stability of batch gradient
descent while at the same time injecting enough stochasticity to escape local minima and saddle
points.
Figure 4.3: Increasing the batch size will lead to more stability at the cost of higher computational costs.
Source: Towards Data Science
A fair metric that we can use to compare batch, mini-batch, and stochastic gradient descent is
through the concept of epochs. An epoch is a measure of time — it is defined as the number of
iterations in order to traverse the training data once. In the case of batch gradient descent, since
all n examples comprising the training data are used to compute the gradient at each iteration, an
epoch is simply equivalent to one iteration. In the case of SGD, since we only sample one example
at each iteration, an epoch is equivalent to n iterations. In the case of mini-batch gradient descent,
as epoch comprises of nk iterations. In practice, given the same number of epochs, mini-batch
gradient descent tends to perform the best.
Momentum
Just as mini-batch gradient descent can lead us to escape local minima and saddle points, the
stochastic nature of the algorithm can often lead to oscillations that cause instability and slow con-
vergence. These issues are not just unique to stochastic gradients and can arise in the deterministic
case, for example when the objective function is disproportionately scaled — ie. the function is
elongated along one axis while being contracted along along another, giving the illusion of “ravines”
in the function landscape. The disproportionate scalings cause the algorithm to make large leaps
86 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
in contracted directions, while making very slow progress along elongated directions. The resulting
behavior is a series of oscillations that may reach the optimal point very slowly.
Figure 4.4: Standard gradient descent cannot converge to the optimum when the objective function is
disproportionately scaled. Source: distill.pub
Polyak’s heavy ball method addresses these issues by introducing a momentum term that adds
inertia to the iterates and prevents them from deviating from the overall direction of the updates.
Rather than updating the iterate w(t) using the gradient ∇f (w(t) ), Polyak’s heavy ball method
uses ∇f (w(t) ) along with a history of all the gradients from the iterates seen so far. Specifically, it
updates the iterates via a velocity term v(t) that represents an exponential moving average of all
of the gradients seen so far.
Algorithm 6: Polyak’s Heavy Ball Method
Initialize w(0) to a random point
Initialize v(0) to −α0 ∇f (w(0) )
while f (w(t) ) not converged do
v(t) ← βt v(t−1) − αt ∇f (w(t) )
w(t+1) ← w(t) + v(t)
The motivation for using a moving average of the gradients is as follows: we want to downplay
the directions for which the gradient is oscillating over time and boost the directions for which
the gradient is constant over time. When we are in a “ravine” this has the effect of “killing” the
gradient in constricted directions whose derivatives oscillate over time, accelerating convergence.
4.3. GRADIENT DESCENT 87
Figure 4.5: Polyak’s heavy ball method uses momentum to dampen oscillations, accelerating convergence to
the optimum point. Source: distill.pub
There is an alternative interpretation of Polyak’s heavy ball method that is condensed to just one
line:
Polyak’s heavy ball method uses information about past iterates to determine the descent direction.
Nesterov’s accelerated gradient descent improves on this reasoning, incorporating information
about potential future iterates as well. The only difference in Nesterov’s accelerated gradient descent
is that it computes a “lookahead gradient” ∇f (w(t) +βt v(t−1) ) instead of the gradient at the current
iterate ∇f (w(t) ). Effectively, we are performing a one step “look ahead” of the gradient and moving
in that direction, potentially correcting for oscillations ahead of us.
Algorithm 7: Nesterov’s Accelerated Gradient Descent
Initialize w(0) to a random point
Initialize v(0) to −α0 ∇f (w(0) )
while f (w(t) ) not converged do
v(t) ← βt v(t−1) − αt ∇f (w(t) + βt v(t−1) )
w(t+1) ← w(t) + v(t)
88 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
Figure 4.6: Polyak’s heavy ball method applies gradient before update, while Nesterov’s accelerated gradient
descent applies gradient after update. Source: Stanford CS 231n
Through the same manipulations that we showed for Polyak’s heavy ball method, we derive the
one-line update for Nesterov’s accelerated gradient descent:
Line search is another iterative optimization algorithm that, instead of taking small gradient
steps, repeatedly slices the function across a 1 dimensional line and finds the minimum. Normally,
finding the (global) minimum of d functions is an extremely difficult problem, but in this case we
are only doing so for 1 dimensional “sliced” functions, which is a much more trivial task. Each
iteration of line search entails three steps: (1) choosing a promising descent direction (or sometimes
a random direction), (2) looking ahead in that direction and (roughly) finding the minimum, and
(3) going to that minimum.
Algorithm 8: Line Search
Initialize w(0) to a random point
while f (w(t) ) not converged do
Find a descent direction u(t)
Find αt ∈ R+ to minimize h(α) = f (w(t) + αu(t) )
w(t+1) ← w(t) − αt u(t)
There are several options for the direction, such as the negative gradient (which is used in gradient
descent). Broadly, we can pick any descent direction — a direction which entails to a negative
directional derivate:
Du f (w(t) ) = h∇f (w(t) ), ui < 0
Another common choice is to choose an arbitrary coordinate (for example, the x/y/z coordinate in
3D), a line search variant called coordinate descent. Note that the minimization in the second
step does not necessarily need to be exact. A simple approach is to sample several points across
the line and choose the minimum, in a grid search fashion.
Line search methods offer a few advantages over gradient descent methods. For one, they do not
necessarily require gradients, which can be particularly helpful in non-differentiable domains. Also,
they are potentially more robust to local minima, because they find the global minima of the 1D
functions ahead of them.
4.5. CONVEX OPTIMIZATION 89
A critical issue with the methods we have presented so far is that they can get stuck in local minima.
With gradient descent for example, moving in the direction of steepest descent is a greedy choice
that can cause convergence to a poor local minimum, depending on the initial starting point of the
algorithm.
Figure 4.7: Depending on the initialization of gradient descent, the algorithm will converge to different local
minima. Source: Towards Data Science
Convex functions conveniently eliminate this problem due to their “bowl shape,” which ensures
that all local minima are global minima.
8
7
6
5
4
3
2
−1.5
1
−1.0
−0.5
2.0
0.0 1.5
0.5 1.0
0.5
x
1.0 0.0 y
−0.5
1.5 −1.0
−1.5
2.0
Due to this property, optimizing convex functions entails nice theoretical convergence rates that are
otherwise not guaranteed for non-convex functions. For these reasons, there is a dedicated subfield
of optimization called convex optimization that focuses on optimization problems with convex
functions and convex constraints.
Given that f : Rn → R is twice continuously differentiable, the following are equivalent conditions
of convexity:
Figure 4.9: Any line segment connecting two points of a convex function must lie above the function. Source:
Princeton University ORF 523
Figure 4.10: Any line tangent to a convex function must lie below the function. Source: Princeton University
ORF 523
Let’s study these properties closely. The first condition states that for any two points w1 , w2 , the
function lies below the line segment connecting w1 and w2 . The next condition states that any
tangent line to f must lie below the entire function. The third condition intuitively states that if
w2 is greater than w1 , then the derivative of w2 is also greater than the derivative of w1 .
Finally, the last condition states that the “second derivative” of f is always non-negative. More
rigorously, we can generalize the concept of second derivatives in higher dimensions with the Hes-
sian. Given that f is twice continuously differentiable, we define the Hessian as the matrix of
second partial derivatives of f , denoted by
∂2f ∂2f
. . .
∂w12 ∂w1 ∂wd
2 .. . . ..
∇ f = . . .
∂2f ∂2f
∂wd ∂w1 . . . ∂w2 d
Unfortunately, the gradient being zero and the Hessian being PSD together are necessary but not
sufficient conditions local minima (consider the function f (w) = w3 or f (w) = −w4 ). However, the
gradient being zero and the Hessian being PSD in a neighborhood are sufficient conditions.
Proposition 3. Suppose f is twice continuously differentiable with ∇2 f positive semi-definite in
a neighborhood of w∗ , and that ∇f (w∗ ) = 0. Then w∗ is a local minimum of f . 1
Since for convex functions the Hessian is PSD at all points in the domain, any critical point is a
local minimum. In fact, any local minimum is also a global minimum, so any point for which the
gradient is zero must be the global minimum.
Proposition 4. Let X be a convex set. If f is convex, then any local minimum of f in X is also
a global minimum.
Consequently we can find any point for which the gradient is zero and guarantee that it is the
global minimum (this is exactly the case in OLS and Ridge Regression since the objective function
is PSD and therefore convex). Note however, that this does not imply that the global minimum is
unique — there could be several different points which achieve the global minimum.
Strong Convexity
While convex functions guarantee that all local minima are global minima, they do not guarantee
that the global minimum is satisfied uniquely. Strongly convexity is an extension that guarantees
this property. For a strictly positive m ∈ R, a function is m-strongly convex if the following
equivalent conditions hold:
t(1−t)m
(i) f (tw1 + (1 − t)w2 ) ≤ tf (w1 ) + (1 − t)f (w2 ) − 2 kw2 − w 1 k2 , ∀w1 , w2 , t ∈ [0, 1]
m 2
(ii) g(w) = f (w) − 2 kwk is convex
The conditions for strong convexity are identical to those for convex functions, but with an addi-
tional term involving m. Strongly convex functions provide several advantages over general convex
functions. From the third condition, we see that strongly convex functions can be lower bounded
by a quadratic function, which establishes the uniqueness of a global minimum.
1 A subtle point: if ∇2 f (w∗ ) is positive definite and ∇f (w∗ ) = 0, then w∗ is a strict local minimum. We do not have to
check that the Hessian is PSD in a neighborhood of w∗ , as this condition is implied from the fact that f is twice continuously
differentiable.
92 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
Proposition 5. Let X be a convex set. If f is strongly convex, then there exists at exactly one
local minimum of f in X . Consequently, it is the unique global minimum of f in X .
If the Hessian of ∇2 f has eigenvalues that are all strictly positive at all points, then f is m-
strongly convex with m equal to the the smallest eigenvalue of ∇2 f (over all points w). Recall
from our discussion of OLS vs. Ridge Regression that while OLS may have several solutions, Ridge
Regression has a unique solution. This is because the Ridge Regression formulation is positive
definite and thus strongly convex, while OLS is positive semi-definite and not necessarily strongly
convex.
Smoothness
While strongly convex functions are lower bounded by a quadratic function, smooth functions are
upper bounded by a quadratic function. 2
An M -smooth (or more formally Lipschitz continuous gradient) function is one for which
there exists a strictly positive M ∈ R such that
k∇f (w2 ) − ∇f (w1 )k ≤ M kw2 − w1 k, ∀w1 , w2 (4.1)
This definition does not assume that f is convex. 4.1 implies all of the following equivalent condi-
tions:
t(1−t)M
(i) f (tw1 + (1 − t)w2 ) ≥ tf (w1 ) + (1 − t)f (w2 ) − 2 kw2 − w 1 k2 , ∀w1 , w2 , t ∈ [0, 1]
(ii) f (w2 ) ≤ f (w1 ) + ∇f (w1 )>(w2 − w1 ) + M
2 kw2 − w1 k2 , ∀w1 , w2
(iii) (∇f (w2 ) − ∇f (w1 ))>(w2 − w1 ) ≤ M kw2 − w1 k2 , ∀w1 , w2
(iv) ∇2 f (w) M I, ∀w
When f is convex, then the above conditions also imply 4.1, establishing equivalence among all
of the conditions. Roughly speaking, smoothness is the counterpart to strong convexity, with the
inequality signs flipped. If the Hessian of ∇2 f has eigenvalues that are bounded from above, f is
M -smooth with M equal to the the maximum eigenvalue of ∇2 f (over all points x).
While gradient descent does not have convergence guarantees in general, we can make theoretical
guarantees when the function is convex. Furthermore, strong convexity and smoothness will provide
lower and upper bounds for f respectively, allowing us to achieve a significantly faster convergence
rate. Assuming that the distance from the initial point w(0) and the optimal point w∗ is R, we
have the following:
For detailed proofs of rates above, refer to the EE 227C lecture notes. Individually, strong convexity
and smoothness will allow us to accelerate the rate of convergence from O( √1t ) to O( 1t ). Put
together, they allow us to achieve an exponential convergence rate — a significant acceleration!
The quantity κ = M m is known as the condition number — the ratio of the largest over smallest
singular value of the Hessian of f . Recall from our discussion of OLS vs. Ridge Regression that
Ridge Regression adds a small penalty term λkwk2 to the objective, effectively making the problem
strongly convex. Since the OLS is already smooth as well, then gradient descent can achieve an
exponential rate of convergence to the optimal value. The higher the value of λ, the lower the value
of the condition number κ, which leads to an even faster convergence rate. This of course, comes
at the costs of regularization.
Up until this point, we have only considered first-order methods to optimize functions. Now, we
will present Newton’s method, an iterative method that utilizes second-order information to
achieve a faster rate of convergence than existing first-order methods. Given an arbitrary twice
continuously differentiable objective function f , Newton’s Method iteratively minimizes the second-
order Taylor expansion of the objective function. Given the current iterate w(t) , it minimizes the
following objective:
1
min f¯(w) = f (w(t) ) + ∇f (w(t) )>(w − w(t) ) + (w − w(t) )>∇2 f (w(t) )(w − w(t) )
w 2
¯
We can minimize f (w) by setting its gradient to zero:
∇f¯(w) = ∇f (w(t) ) + ∇2 f (w(t) )(w − w(t) ) = 0
Which leads to the update rule (otherwise known as Newton step)
w(t+1) = w(t) − ∇2 f (w(t) )−1 ∇f (w(t) )
The updates for Newton’s method and gradient descent are nearly identical:
w(t+1) = w(t) − αt ∇f (w(t) ) (Gradient descent)
(t+1) (t) 2 (t) −1 (t)
w =w − ∇ f (w ) ∇f (w ) (Newton’s method)
We can think of gradient descent as a Newton update in which we approximate ∇2 f (w(t) )−1 by
a scaled version of the identity. That is, gradient descent is equivalent to Newton’s method when
∇2 f (w(t) )−1 = αt I where I is the identity matrix.
The algorithm is as follows:
Algorithm 9: Newton’s Method
Initialize w(0) to a random point
while f (w(t) ) not converged do
w(t+1) ← w(t) − ∇2 f (w(t) )−1 ∇f (w(t) )
Alternative Interpretation
approximation. Newton’s method is agnostic to the type of function that it optimizes — whether
it is the gradient function, or just the objective function. At its simplest form, Newton’s method
can be used to find the roots of a single variable function ϕ : R → R. Our goal is to find a root of
the non-linear equation ϕ(w) = 0. Suppose we have a current estimate of the root ϕ(w(t) ). From
Taylor’s theorem, we can express the first-order form of ϕ(w) with respect to ϕ(w(t) ) as
(t) 0 (t) (t)
ϕ(w) = ϕ(w ) + ϕ (w) · (w − w ) + o(w − w )
ϕ(w(t) ) + ϕ0 (w(t) )δ = 0
(t) (t)
Then, δ = − ϕϕ(w )
0 (w (t) ) , leading to the iteration w
(t+1) = w (t) − ϕ(w ) . We can similarly make an
ϕ0 (w(t) )
d d
argument for a multivariate function F : R → R . Our goal is to solve F (w) = 0. Again, from
Taylor’s theorem we have that
where JF is the Jacobian. This gives us ∆ = −JF−1 (w)F (w), and the iteration
In the context of optimization, Newton’s method is a special application of this root-finding method,
applied to the gradient function. That is, given that we are minimizing f : Rd → R, Newton’s
method finds the roots of the gradient function ∇f : Rd → Rd . It uses the update rule
as the Hessian ∇2 f (w(t) ) of the objective function corresponds to the Jacobian JF−1 (w) of the
gradient. Let’s understand the motivation of Newton’s method in close detail. Our goal is to find
local minima for f , points for which it is necessarily true that ∇f (w) = 0. Consequently, we wish
to find points for which ∇f (w) = 0. The gradient ∇f (w) can be difficult or even intractable to
work with, so instead we work with a first-order Taylor approximation of the gradient with respect
to our current iterate w(t) . We solve for the roots of the first-order gradient, update our iterate,
and repeat the process. Note that while solving ∇f (w) = 0 may yield local maxima or even saddle
points, we are finding the roots of the linearized gradient, which is convex — therefore any point
for which the first-order approximation of the gradient is zero yields a global minimum for the
approximation.
There are a few issues with Newton’s method that we glossed over in our analysis. In general, there
are no guarantees that Newton’s method can converge, and even more concerning, the algorithm
may get stuck as the Hessian ∇2 f (w(t) ) may not be invertible. Placing invertibility issues aside,
the most concerning issue is that Newton’s method may not even be attempting to minimize the
objective function. To see why, recall that the goal of each Newton step is to minimize the second-
order approximation, which we do so by setting the gradient of the approximation to zero. This
4.6. NEWTON’S METHOD 95
is not a sound step, as it may yield saddle points or maxima. This can happen when the Hessian
∇2 f (w(t) ) has non-positive eigenvalues. In order to ensure that the second order approximation
f¯(w) yields a unique global minimum, we must ensure that it is strongly convex. We can do so
by regularizing the objective f (w) with an additional λkwk2 term, with an appropriately chosen λ
that shifts all of the eigenvalues of the objective to be positive.
Even when the objective is strongly convex, Newton’s method can be quite unpredictable. For
example, consider the function p
f (w) = w2 + 1
essentially a smoothed version of the absolute value |x|. Clearly, the function is minimized at
w∗ = 0. Calculating the necessary derivatives for Newton’s method, we find
w
f 0 (w) = √
w2 + 1
f 00 (w) = (1 + w2 )−3/2 .
Note that f (w) is strongly convex since its second derivative strictly positive and 1-smooth (|f 0 (w)| <
1). The Newton step for minimizing f (w) is
f 0 (w(t) ) 3
w(t+1) = w(t) − 00 (t)
= −w(t) .
f (w )
The behavior of this algorithm depends on the magnitude of w(t) . In particular, we have the
following three regimes
(t)
|w | < 1 Algorithm converges cubically
|w(t) | = 1 Algorithm oscillates between −1 and 1
|w(t) | > 1 Algorithm diverges
This example shows that even for strongly convex functions with Lipschitz gradients that Newton’s
method is only guaranteed to converge locally. To avoid divergence, a popular technique is to use
a damped step–size:
Convergence Analysis
We can ensure that Newton’s method converges, if all of the following conditions are met:
These conditions combined establish local convergence of Newton’s method to a local minimum.
That is, given that the initial point w(0) is sufficiently close to the local minimum, the Hessian is
positive definite at the local minimum, and the Hessian is Lipschitz (meaning that its rate of change
t
can be bounded), we can ensure a quadratic convergence rate of O(e−e ), which is significantly
faster than the fastest rate for gradient descent that we have seen, O(e−t ). Note however, that each
Newton step will involve inverting the Hessian, which itself is an expensive O(d3 ) operation that
becomes impractical for high dimensional functions.
96 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
Let’s revisit the nonlinear least squares problem. We can try to apply all of the techniques and
approaches we have covered so far to solve this problem, but there is a specialized algorithm for
solving the nonlinear least squares problem, called Gauss-Newton. The Gauss-Newton algorithm
has parallels to Newton’s method, as they both repeatedly make linearly approximations of an
objective and solve that approximation. At each iteration, this method linearly approximates the
function F about the current iterate and solves a least-squares problem involving the linearization
in order to compute the next iterate.
Let’s say that we have a “guess” for w at iteration k, which we denote w(k) . We consider the
first-order approximation of F (w) about w(k) :
∂
F (w) ≈ F̃ (w) = F (w(k) ) + F (w(k) )(w − w(k) )
∂w
= F (w(k) ) + J(w(k) )∆w
where ∆w := w − w(k) .
Now that F̃ is linear in ∆w (the Jacobian and F are just constants: functions evaluated at w(k) ),
our objective is convex and we can perform linear least squares to form the closed form solution for
∆w. Applying the first-order optimality condition to the objective F̃ yields the following equation:
0 = JF̃ (w)>(y − F̃ (w)) = J(w(k) )> y − F (w(k) ) + J(w(k) )∆w
Note that the Jacobian of the linearized function F̃ , evaluated at any w, is precisely J(w(k) ).
Denoting J = J(w(k) ) and ∆y := y − F (w(k) ) for brevity, we have
J>(∆y − J∆w) = 0
J>∆y = J>J∆w
∆w = (J>J)−1 J>∆y
Comparing this solution to OLS, we see that it is effectively solving
∆w = arg min kJδw − ∆yk2
δw
Note that the solution will depend on the initial value w(0) in general. There are several choices
for measuring convergence. Some common choices include testing changes in the objective value:
L(k+1) − L(k)
≤ threshold
L(k)
Neural networks are a class of compositional function approximators. They come in a variety of
shapes and sizes. In this class, we will only discuss feedforward neural networks, those networks
whose computations can be modeled by a directed acyclic graph.3 The most basic (but still com-
monly used) class of feedforward neural networks is the multilayer perceptron. Such a network
might be drawn as follows:
Computation flows left-to-right. The circles represent nodes, a.k.a. units or neurons, which are
loosely based on the behavior of actual neurons in the brain. Observe that the nodes are organized
into layers. The first (left-most) layer is called the input layer, the last (right-most) layer is
called the output layer, and any other layers (there is only one here, but there could be multiple)
are referred to as hidden layers. The dimensionality of the input and output layers is determined
by the function we want the network to compute. For a function from Rd to Rk , we should have d
input nodes and k output nodes. The number and sizes of the hidden layers are hyperparameters
to be chosen by the network designer.
Note that in the diagram above, each non-input layer has the following property: every node in
that layer is connected to every node in the previous layer. Layers that have this property are
described as fully connected.4 Each edge in the graph has an associated weight, which is the
3 There are also recurrent neural networks whose computation graphs have cycles.
4 Later we will learn about convolutional layers, which have a different connectivity structure.
98 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
strength of the connection from the input node in one layer to the node in the next layer. Each
node computes a weighted sum of its inputs, with these connection strengths being the weights,
and then applies a nonlinear function which is variously referred to as the activation function or
the nonlinearity. Concretely, if wi denotes the weights and σi denotes the activation function of
node i, it computes the function
x 7→ σi (wi>x)
Let us denote the number of (non-input) layers by L, the number of units in layer ` ∈ {0, . . . , L}
by n` (here n0 is the size of the input layer), and the nonlinearity for layer ` ∈ {1, . . . , L} by
σ ` : Rn` → Rn` . The weights for every node in layer ` can be stacked (as rows) into a matrix of
weights W` ∈ Rn` ×n`−1 . Then layer ` performs the computation
x 7→ σ ` (W` x)
Since the output of each layer is passed as input to the next layer, the function represented by the
entire network can be written
which is often used to produce a discrete probability distribution over k classes. Note that every
entry of the softmax output depends on every entry of the input. Also, softmax preserves ordering,
in the sense that sorting the indices i = 1, . . . , k by the resulting value σ(x)i yields the same
ordering as sorting by the input value xi . In other words, more positive xi leads to larger σ(x)i .
This nonlinearity is used most commonly (but not always) at the output layer of the network.
Expressive power
It is the repeated combination of nonlinearities that gives deep neural networks their remarkable
expressive power. Consider what happens when we remove the activation functions (or equivalently,
set them to the identity function): the function computed by the network is
x 7→ WL WL−1 · · · W2 W1 x
| {z }
=:W
f
which is linear in its input! Moreover, the size of the smallest layer restricts the rank of W,
f as
f ≤
rank(W) min rank(W` ) ≤ min n`
`∈{1,...,L} `∈{0,...,L}
4.8. NEURAL NETWORKS 99
Despite having many layers of computation, this class of networks is not very expressive; it can
only represent linear functions.
We would like to produce a class of networks that are universal function approximators. This
essentially means that given any continuous function, we can choose a network in this class such
that the output of the circuit can be made arbitrarily close to the output of the given function for
all given inputs. We make a more precise statement later.
A key observation is that piecewise-constant functions are universal function approximators:
(
1 x≥0
σ(x) =
0 x<0
We can build very complicated functions from this simple step function by combining translated
and scaled versions of it. Observe that
It turns out that only one hidden layer is needed for universal approximation, and for simplicity
we assume a one-dimensional input. Thus from here on we consider networks with the following
structure:
The input x is one-dimensional, and the weight on x to node j is bj . We also introduce a constant
1, whose weight into node j is aj . (This is referred to as the bias, but it has nothing to do with
bias in the sense of the bias-variance tradeoff. It’s just there to provide the node with the ability
to shift its input.) The function implemented by the network is
k
X
h(x) = cj σ(aj + bj x)
j=1
Choosing weights
With a proper choice of aj , bj , and cj , this function can approximate any continuous function we
want. But the question remains: given some target function, how do we choose these parameters
in the appropriate way?
4.8. NEURAL NETWORKS 101
Let’s try a familiar technique: least squares. Assume we have training data {(xi , yi )}ni=1 . We aim
to solve the optimization problem
n
X
min (yi − h(xi ))2
a,b,c
i=1
| {z }
f (a,b,c)
To run gradient descent, we need derivatives of the loss with respect to our optimization variables.
We compute via the chain rule
so no update will be made for that example. More egregiously, consider the derivatives with respect
to aj 5 :
n
∂f X ∂h(xi )
= −2(yi − h(xi )) =0
∂aj ∂aj
i=1 | {z }
0
and bj :
n
∂f X ∂h(xi )
= −2(yi − h(xi )) =0
∂bj ∂bj
i=1 | {z }
0
Since gradient descent changes weights in proportion to their gradient, it will never modify a or b!
Even though the step function is useful for the purpose of showing the approximation capabilities
of neural networks, it is seldom used in practice because it cannot be trained by conventional
gradient-based methods.
The next simplest universal approximator is the class of piecewise-linear functions. Just as
piecewise-constant functions can be achieved by combinations of the step function as a nonlinearity,
piecewise-linear functions can be achieved by combinations of the rectified linear unit (ReLU)
function
σ(x) = max{0, x}
5 Technically, the derivative of σ is not defined at zero, where there is a discontinuity. However it is defined (and zero)
everywhere else. In practice, we will almost never hit the point of discontinuity because it is a set of measure zero.
102 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
Depending on the weights a and b, our ReLUs can move to the left or right, increase or decrease
their slope, and flip direction.
Let us calculate the gradients again, assuming we replace the step functions by ReLUs:
n
∂f X
= −2(yi − h(xi )) max{0, aj + bj xi }
∂cj
i=1
(
n n
∂f X ∂ X 0 if aj + bj xi < 0
= −2(yi − h(xi ))cj max{0, aj + bj xi } = −2(yi − h(xi ))cj
∂aj ∂aj 1 if aj + bj xi > 0
i=1 k=1
(
n n
∂f X ∂ X 0 if aj + bj xi < 0
= −2(yi − h(xi ))cj max{0, aj + bj xi } = −2(yi − h(xi ))cj
∂bj ∂bj xi if aj + bj xi > 0
i=1 i=1
Crucially, we see that the gradient with respect to a and b is not uniformly zero, unlike with the
step function.
Later we will discuss backpropagation, a dynamic programming algorithm for efficiently com-
puting gradients with respect to a neural network’s parameters.
The celebrated neural network universal approximation theorem, due to Kurt Hornik6 , tells us that
neural networks are universal function approximators in the following sense.
Theorem. Suppose σ : R → R is nonconstant, bounded, nondecreasing, and continuous7 , and let
S ⊆ Rd be closed and bounded. Then for any continuous function f : S → R and any > 0, there
exists a neural network with one hidden layer containing finitely many nodes, which we can write
k
X
h(x) = cj σ(aj + bj>x)
j=1
such that
|h(x) − f (x)| <
6 See Approximation Capabilities of Multilayer Feedforward Networks.
7 Both ReLU and sigmoid satisfy these requirements.
4.9. TRAINING NEURAL NETWORKS 103
for all x ∈ S.
There’s some subtlety in the theorem that’s worth noting. It says that for any given continuous
function, there exists a neural network of finite size that uniformly approximates the given func-
tion. However, it says nothing about how well any particular architecture you’re considering will
approximate the function. It also doesn’t tell us how to compute the weights.
It’s also worth pointing out that in the theorem, the network consists of just one hidden layer. In
practice, people find that using more layers works better.
We have seen that first-order optimization techniques such as gradient descent and stochastic
gradient descent are effective tools for minimizing differentiable cost functions. In order to im-
plement these techniques, we need to be able to compute the gradient of the cost function with
respect to the weights. The chain rule allows us to compute these derivatives in principle, but as we
will see, the order of the computations matters in neural networks. The backpropagation algo-
rithm takes advantage of the directed acyclic graph (DAG) nature of feedforward neural networks
to calculate these derivatives efficiently.
Computational graphs
We assume that the our network can be expressed as a finite directed acyclic graph G = (V, E),
sometimes called the computational graph of the network. Each vertex vi ∈ V represents the
result of some differentiable8 computation. Each edge represents a computational dependency:
there is an edge (vi , vj ) ∈ E if and only if the value computed at vi is used to compute vj . We
denote the set of outgoing neighbors of a node vi by
Furthermore, some of these vertices have special significance. There is a vertex ` ∈ V , representing
the loss function, which contains no outgoing edges (i.e. out(`) = ∅). There is also some subset of
vertices W ⊂ V representing the trainable parameters of the network. Our objective is to efficiently
∂`
calculate ∂w i
for each wi ∈ W .
The primary mathematical tool employed in backpropagation is the chain rule. This allows us to
write
∂` X ∂` ∂vj
=
∂vi ∂vj ∂vi
vj ∈out(vi )
The intuition here is that the value computed at vi affects potentially all of the vertices to which
it is an input, and each of those vertices affects the loss in some way. The total contribution of vi
to the loss must be summed over these downstream effects.
We could expand recursively to get an expression for each weight:
∂` X ∂` ∂vj
=
∂wi ∂vj ∂wi
vj ∈out(wi )
8 A number of common neural network operations, such as the ReLU activation function, are not everywhere differentiable.
In practice it is sufficient to be differentiable except at finitely many points.
104 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
X X ∂` ∂vk ∂vj
=
∂vk ∂vj ∂wi
vj ∈out(wi ) vk ∈out(vj )
..
.
X ∂` ∂v (k) ∂v (2) ∂v (1)
= · · ·
∂v (k) ∂v (k−1) ∂v (1) ∂wi
paths v (1) , . . . , v (k) from wi to `
However, computing the derivative by evaluating this expression is quite inefficient, as many terms
appear in more than one path from wi to `, so we are doing more work than necessary.
Backpropagation
The backpropagation algorithm combines the chain rule with the principles of dynamic pro-
gramming: dividing a large problem into simpler subproblems, solving these and storing their
solutions, and combining the stored solutions to solve larger subproblems or the original problem.
In this context, the large problem is computing ∇`(W ), and the subproblems are computing the
∂`
individual terms ∂w i
. The key observation from the first chain rule expression above is that we
∂` ∂`
can reuse work by computing ∂v i
in a “back to front” order. That is, before computing ∂v i
, we
∂`
should compute ∂vj for each vj ∈ out(vi ). Because our computational graph is a DAG, such a
topological ordering can always and efficiently9 be computed via a topological sort.10 Then
∂` ∂`
the subproblem of computing ∂v i
can be easily accomplished by combining the stored values ∂v j
In a standard fully-connected layer, each vertex calculates zj as a linear combination of the activa-
tions ai of the previous layer, with weights wji :
X
zj = wji ai
i
We have omitted layer indexing to keep the notation simple, but keep in mind that this ai is the
result of some computation performed at the previous layer11 , and these zj are likely used as inputs
to vertices at later layers. This part of the computational graph looks like
∂zj
= ai
∂wji
so
∂` ∂`
= ai
∂wji ∂zj
Observe that we must use the activations ai that were previously computed in the forward pass.
We must also compute the derivatives zj with respect to ai so that we can pass these backward to
earlier layers. In the image above, out(ai ) = {z1 , . . . , zk }, and it is straightforward to see that
∂zj
= wji
∂ai
so
k
∂` X ∂`
= wji
∂ai ∂zj
j=1
Element-wise nonlinearities
After taking linear combinations, it is typical to insert a nonlinearity. (Recall from the previous
note that nonlinearities are at the heart of neural networks’ expressive power.) In most cases,
this nonlinearity is applied elementwise. Again omitting layer indexing, we might write such a
computation as
ai = σ(zi )
where zi is the value from the previous layer, and σ is the activation function. This part of the
computational graph looks like
106 CHAPTER 4. BEYOND LEAST SQUARES: OPTIMIZATION AND NEURAL NETWORKS
Classification
The task of classification differs from regression in that we are now interested in assigning a
d-dimensional data point one of a discrete number of classes, instead of assigning it a continuous
value. Thus, the task is simpler in that there are fewer choices of labels per data point but more
complicated in that we now need to somehow factor in information about each class to obtain the
classifier that we want.
Given a training set D = {(xi , yi )}ni=1 of n points, where each data point xi ∈ Rd is paired with
a known discrete class label yi ∈ {1, 2, ..., K}, our goal is to train a classifier which, when fed any
arbitrary d-dimensional data point, classifies that data point as one of the K discrete classes.
There are two main types of classification models: generative models and discriminative models.
Generative models have strong roots in probabilistic modeling. The idea is that we form a joint
probability distribution p(X, Y ) over the input X (which we treat as a random vector) and label
Y (which we treat as a random variable), and we classify an arbitrary datapoint x with the class
label that maximizes the joint probability:
Generative models typically form the joint distribution by explicitly forming the following:
P (k) = P (class = k)
pk (X) = p(X|class k)
Using the prior and the conditional distributions in conjunction, we have (from Bayes’ rule) that
maximizing the joint probability over the class labels is equivalent to maximizing the posterior
probability of the class label:
ŷ = arg max p(x, Y = k) = arg max P (k) pk (x) = arg max P (Y = k|x)
k k k
107
108 CHAPTER 5. CLASSIFICATION
Maximizing the posterior will induce regions in the feature space in which one class has the highest
posterior probability, and decision boundaries in between classes where the posterior probability
of two classes are equal.
Figure 5.1: A collection (in dark black) of linear (left) vs quadratic (right) level set boundaries in a 2D
feature space
Generative classifiers are flexible, quick to train, and can generate new samples (in order to augment
the training dataset). However, they are also inefficient, because they require estimation of a
large number of parameters (ie. the covariance matrices of the conditional distributions, which
have d(d+1)
2 parameters). Typically, the decision boundary only requires O(d) parameters, but
generative models typically estimate O(d2 ) parameters in order to to determine the class-conditional
probability distributions. As d increases, generative models tend to loose their effectiveness, as the
number of parameters starts to dominate in comparison to the number of datapoints, and as a
result the variance of the model increases.
This leads us to the concept of discriminative models, where we bypass learning a generative
model altogether and directly learn a decision boundary. Discriminative models are parameterized
by weights that either (1) form a posterior distribution P (Y |X) without considering the prior or
conditional distributions, or (2) directly form a hard decision boundary without considering any
probabilities in the first place. In the former case, discriminative models choose the class that
maximizes the posterior probability distribution:
ŷ = arg max P (Y = k|x)
k
Generative models also choose the class that maximizes the posterior probability distribution. The
only difference is in the way generative and discriminative models form the posterior.
While both generative and discriminative models by default maximize the posterior probability
over classes, this strategy may not necessarily be desirable at all times. Rather than maximizing
the posterior probability, we would really like to minimize the risk of our model. Recall that the
risk for a given classifier h is defined as the expected loss over X and Y :
R(h) = E(x,y)∼D [`(h(x), y)]
where `(h(x), y) measures the loss between the predicted label h(x) and the true label y. In the
context of regression, the loss function was the squared error `(h(x), y) = (h(x) − y)2 . In the
5.2. LEAST SQUARES SUPPORT VECTOR MACHINE 109
context of classification, the loss function can take many forms, but the simplest is the standard
step function (
0 if h(x) = y
`(h(x), y) =
1 if h(x) 6= y
Our goal is to find a classifier that minimizes the risk, given the loss function. We can equivalently
express the risk as
Z X K
R(h) = L(r(x), k)P (Y = k|x) p(x)dx
k=1
The Bayes’ classifier h∗ will minimize the risk. Given an arbitrary x, the Bayes’ classifier will
pick
K
X
h∗ (x) = arg min L(j, k)P (Y = k|x)
j
k=1
Effectively, the Bayes’ classifier will pick the class that minimizes the expected loss for the given x.
In the special case where the loss function is the standard step function (as described above),
X
h∗ (x) = arg min P (Y = k|x) = arg min 1 − P (Y = j|x) = arg max P (Y = j|x)
j j j
k6=j
This is equivalent to selecting the class that maximizes the posterior distribution!
Depending on which loss function we are using, the optimal classifier may or may not maximize
the posterior probability. For example, consider the case of cancer diagnosis, where a patient’s
diagnosis for cancer can come up as positive or negative. There are four possible cases:
1. Classify the patient cancer +, and in reality the patient is cancer + (Correct Classification)
2. Classify the patient cancer −, and in reality the patient is cancer − (Correct Classification)
3. Classify the patient cancer +, but in reality the patient is cancer − (False positive)
4. Classify the patient cancer −, but in reality the patient is cancer + (False negative)
Classifying the patient’s condition correctly is ideal, so we can reasonably set the loss for those
cases to 0. The false positive and false negative cases are bad, and there should be a loss for these
cases. But should these cases have the same loss value or should we weigh them differently? A
false negative diagnosis would be significantly worse than a false positive, because a false negative
diagnosis would go undiagnosed and would probably be fatal. Therefore, the associated loss for
the false negative case should be higher than the associated loss for the false positive case. In this
case, the goal is no longer to maximize the posterior probability, because otherwise we would be
treating the false negative and false positive cases the same.
We need to figure out how to optimize the parameter w. One simple procedure we can try is to fit
a least squares objective:
n
X
arg min kyi − sign(w>xi )k2 + λkwk2
w
i=1
Where xi , w ∈ Rd+1 . Note that we have not forgotten about the bias term! Even though we are
dealing with d dimensional data, xi and w are d + 1 dimensional: we add an extra “feature” of 1
to x, and a corresponding bias term of k in w. Note that in practice, we do not want to penalize
the bias term in the regularization term, because the we should be able to work with any affine
transformation of the data and still end up with the same decision boundary. Therefore, rather
than taking the norm of w, we often take the norm of w0 , which is every term of w excluding the
corresponding bias term. For simplicity of notation however, let’s just take the norm of w.
Without the regularization term, this would be equivalent to minimizing the number of misclassified
training points. Unfortunately, the “sign” term makes this optimization problem non-convex, and
in fact this optimization problem is NP-hard (computationally intractable). Instead we can solve
a relaxed version of this problem:
n
X
arg min kyi − w>xi k2 + λkwk2
w
i=1
This method is called the binary least squares support vector machine (LS-SVM). Note that
in this relaxed version, we care about the magnitude of w>xi and not just the sign.
One drawback of LS-SVM is that the hyperplane decision boundary it computes does not necessarily
make sense for the sake of classification. For example, consider the following set of data points,
color-coded according to the class:
LS-SVM will classify every data point correctly, since all the +1 points lie on one side of the decision
boundary and all the −1 points lie on the other side. Now if we add another cluster of points as
follows:
5.2. LEAST SQUARES SUPPORT VECTOR MACHINE 111
The original LS-SVM fit would still have classified every point correctly, but now the LS-SVM
gets confused and decides that the points at the bottom right are contributing too much to the
loss (perhaps for these points, w>xi = −5 for the original choice of w so even though they are on
the correct side of the original separating hyperplane, they incur a high squared loss and thus the
hyperplane is shifted to accommodate). This problem will be solved when we introduce general
Support Vector Machines (SVM’s).
Feature Extension
Working with linear classifiers in the raw feature space may be extremely limiting, so we may
consider adding features that that allow us to come up with nonlinear classifiers (note that we
are still working with linear classifiers in the augmented feature space). For example, adding
quadratic features allows us to find a linear decision boundary in the augmented quadratic space
that corresponds to a nonlinear “circle” decision boundary projected down into the raw feature
space.
Note that φ is a function that takes as input the data in raw feature space, and outputs the data
in augmented feature space.
Instead of using the linear function w>x or augmenting features to the data, we can also directly
use a non-linear function of our choice in the original feature space, such as a neural network. One
can imagine a whole family of discriminative binary classifiers that minimize
n
X
arg min kyi − gw (xi )k2 + λkwk2
w
i=1
where gw (xi ) can be any function that is easy to optimize. Then we can classify using the rule
(
1 gw (xi ) > θ
ŷi =
−1 gw (xi ) ≤ θ
Where θ is some threshold. In LS-SVM, gw (xi ) = x>wi and θ = 0. Using a neural network with
non-linearities as gw can produce complex, non-linear decision boundaries.
Multiclass Extension
We can also adapt this approach to the case where we have multiple classes. Suppose there are K
classes, labeled 1, 2, ..., K. One possible way to extend the approach from binary classification is to
compute gw (xi ) and round it to the nearest number from 1 to K. However, this approach gives an
“ordering” to the classes, even if the classes themselves have no natural ordering. This is clearly a
problem. For example, in fruit classification, suppose 1 is used to represent “peach,” 2 is used to
represent “banana,” and 3 is used to represent “apple.” In our numerical representation, it would
appear that peaches are less than bananas, which are less than apples. As a result, if we have an
image that looks like some cross between an apple and a peach, we may simply end up classifying
it as a banana.
The typical way to get around this issue is as follows: if the i’th observation has class k, instead
of using the representation yi = k, we can use the representation yi = ek , the k’th canonical basis
vector. Now there is no relative ordering in the representations of the classes. This method is called
one-hot vector encoding.
When we have multiple classes, each yi is a K-dimensional one-hot vector, so for LS-SVM, we
instead have a K × (d + 1) weight matrix to optimize over:
n
X
arg min kyi − Wxi k2 + λkwk2
W i=1
To classify an arbitrary input x, we compute Wx and see which component k is the largest:
ŷ = max wk>x
k
5.3. LOGISTIC REGRESSION 113
Suppose that we have the binary classification problem where classes are represented by 0 and 1.
Note that we instead of using −1/ + 1 labels (as in LS-SVM), in binary logistic regression we use
0/1 labels. Logistic regression makes more sense this way because it directly outputs a probability,
which belongs in the range of values between 0 and 1.
In binary logistic regression, we would like our model to output the probability that a data point
is in class 0 or 1. We can start with the raw linear “score” w>x and convert it to a probability
between 0 and 1 by applying a sigmoid transformation s(w>x), where s(z) = 1+e1 −z . To classify an
arbitrary point x, we use the sigmoid function to output a probability distribution P (Ŷ ) over the
classes 0 and 1:
Figure 5.5: Logistic function. For our purposes, the horizontal axis is the output of the linear function w>xi
and the vertical axis is the output of the logistic function, which can be interpreted as a probability between
0 and 1.
Equivalently, we classify x as
(
1 if w>x ≥ 0
ŷ =
0 otherwise
114 CHAPTER 5. CLASSIFICATION
Loss Function
Suppose we are given a training dataset D = {(xi , yi )}ni=1 . In order to train our model, we need a
loss function to optimize. One possibility is least squares:
n
X
arg min kyi − s(w>xi )k2 + λkwk2
w
i=1
However, this may not be the best choice. Ordinary least squares regression has theoretical justi-
fications such as being the maximum likelihood objective under Gaussian noise. Least squares for
this classification problem does not have a similar justification.
Instead, the loss function we use for logistic regression is called the log-loss, or cross entropy:
n
X 1 1
L(w) = yi ln + (1 − yi ) ln
s(w>xi ) 1 − s(w>xi )
i=1
If we define pi = s(w>xi ), then using the properties of logs we can express this as
n
X
L(w) = − yi ln pi + (1 − yi ) ln(1 − pi )
i=1
For each xi , pi represents our predicted probability that its corresponding class is 1. Because
yi ∈ {0, 1}, the loss corresponding to the i’th data point is
(
− ln pi when yi = 1
Li (w) =
− ln(1 − pi ) when yi = 0
Intuitively, if pi = yi , then we incur 0 loss. However, this is never actually the case. The logistic
function can never actually output a value of exactly 0 or 1, and we will therefore always incur
some loss. If the actual label is yi = 1, then as we lower pi towards 0, the loss for this data point
approaches infinity.
The loss function can be derived from a maximum likelihood perspective or an information-theoretic
perspective. First let’s present the maximum likelihood perspective. We view each observations
yi as an independent sample from a Bernoulli distribution Ŷi ∼ Bern(pi ) (technically we mean
Ŷi | xi , w, but we remove the conditioning terms for notational brevity), where pi is a function of
xi . Thus our observation yi , which we can view as a “sample,” has probability
(
pi if yi = 1
P (Ŷi = yi ) =
1 − pi if yi = 0
Now we can estimate the parameters w via maximum likelihood. We have the problem
ŵlr = arg max P (Ŷ1 = y1 , . . . , Ŷn = yn | x1 , . . . , xn , w)
w
n
Y
= arg max P (Ŷi = yi | xi , w)
w
i=1
Yn
= arg max pyi i (1 − pi )(1−yi )
w
i=1
n
hY i
= arg max ln pyi i (1 − pi )(1−yi )
w
i=1
n
X
= arg max yi ln pi + (1 − yi ) ln(1 − pi )
w
i=1
Xn
= arg min − yi ln pi + (1 − yi ) ln(1 − pi )
w
i=1
which exactly matches the cross-entropy formulation from earlier. The logistic regression loss func-
tion can also be justified from an information-theoretic perspective. To motivate this approach, we
introduce Kullback-Leibler (KL) divergence (also called relative entropy), which measures
the amount that one distribution diverges from another. Given any two discrete random variables
P and Q, the KL divergence from Q to P is defined as
X P (x)
DKL (P k Q) = P (x) ln
x
Q(x)
Note that DKL is not a true distance metric, because it is not symmetric, ie. DKL (P k Q) 6=
DKL (Q k P ) in general. It also does not satisfy the triangle inequality. However, it is always
positive, ie. DKL (P k Q) ≥ 0, with equality iff P = Q.
In the context of classification, if the class label yi is interpreted as the probability of being class
1, then logistic regression provides an estimate pi of the probability that the data is in class 1.
The true class label can be viewed as a sampled value from the “true” distribution Yi ∼ Bern(yi ).
P (Yi ) is not a particularly interesting distribution because all values sampled from it will yield
yi : P (Yi = yi ) = 1. Logistic regression yields a distribution Ŷi ∼ Bern(pi ), which is the posterior
probability that estimates the true distribution P (Yi ).
The KL divergence from P (Ŷi ) to P (Yi ) provides a measure of how closely logistic regression can
match the true label. We would like to minimize this KL divergence, and ideally we would try to
choose our parameters so that DKL (P (Yi ) k P (Ŷi )) = 0. Again, this is impossible for two reasons.
First, if we want DKL (P (Yi ) k P (Ŷi )) = 0, then we would need pi = yi , which is impossible because
pi is the output of a logistic function that can never actually reach 0 or 1. Second, even if we tried
tuning the parameters so that DKL (P (Yi ) k P (Ŷi )) = 0, that’s only optimizing one of the data
points – we need to tune the parameters so that we can collectively minimize the totality of all of
the KL divergences contributed by all data points.
Therefore, our goal is to tune the parameters w (which indirectly affect the pi values and therefore
the estimated distribution P (Ŷi )), in order to minimize the total sum of KL divergences contributed
by all data points:
n
X
ŵlr = arg min DKL (P (Yi ) k P (Ŷi ))
w
i=1
116 CHAPTER 5. CLASSIFICATION
n
X yi (1 − yi )
= arg min yi ln + (1 − yi ) ln
w pi (1 − pi )
i=1
Xn
= arg min yi (ln yi − ln pi ) + (1 − yi )(ln(1 − yi ) − ln(1 − pi ))
w
i=1
Xn
= arg min (−yi ln pi − (1 − yi ) ln(1 − pi )) + (yi ln yi + (1 − yi ) ln(1 − yi ))
w
i=1
n
X
= arg min − yi ln pi + (1 − yi ) ln(1 − pi )
w
i=1
n
X
= arg min H(P (Yi ), P (Ŷi ))
w
i=1
Note that the yi ln yi +(1−yi ) ln(1 − yi ) component of the KL divergence is a constant, independent
of our changes to pi . Therefore, we are effectively minimizing the sum of the cross entropies
H(P (Yi ), P (Ŷi )). We conclude our discussion of KL Divergence by noting the relation between KL
divergence and cross entropy:
where:
yi (1 − yi )
DKL (P (Yi ) k P (Ŷi )) = yi ln + (1 − yi ) ln
pi (1 − pi )
Since the parameters W do not affect the entropy, we can optimize the cross entropy instead of
the KL divergence.
Let’s generalize logistic regression to the case where there are K classes. Similarly to our discussion
of the multi-class LS-SVM, it is important to note that there is no inherent ordering to the classes,
and predicting a class in the continuous range from 1 to K would be a poor choice. To see why,
recall our fruit classification example. Suppose 1 is used to represent “peach,” 2 is used to represent
“banana,” and 3 is used to represent “apple.” In our numerical representation, it would appear
that peaches are less than bananas, which are less than apples. As a result, if we have an image
that looks like some cross between an apple and a peach, we may simply end up classifying it as a
banana.
5.3. LOGISTIC REGRESSION 117
The solution is to use a one-hot vector encoding to represent all of our labels. If the i’th
observation has class k, instead of using the representation yi = k, we can use the representation
yi = ek , the k’th canonical basis vector. For example, in our fruit example, if the i’th image is
classified as “banana”, its label representation would be
yi = [0 1 0]
(Be careful to distinguish between the class label yi ∈ R and its one-hot encoding yi ∈ RK ). Now
there is no relative ordering in the representations of the classes. We must modify our parameter
representation accordingly to the one-hot vector encoding. Now, there are a set of d + 1 parameters
associated with every class, which amounts to a matrix W ∈ RK×(d+1) . For each input xi ∈ Rd+1 ,
each class k is given a “score”
zk = wk>xi
Where wk ∈ Rd+1 is the k’th row of the W matrix. In total there are K raw linear scores for an
arbitrary input x: h i
w1>x w2>x . . . wK>x
The higher the score for a class, the more likely logistic regression will pick that class. Now that we
have a score system, we must transform all of these scores into a posterior probability distribution
P (Ŷ ). For binary logistic regression, we used the logistic function, which takes the value w>x
and squashes it to a value between 0 and 1. The generalization to the the logistic function for
the multi-class case is the softmax function. The softmax function takes as input all K scores
(formally known as logits) and an index j, and outputs the probability that the corresponding
softmax distribution takes value j:
e zj
σ(z)j = PK
zk
k=1 e
The logits induce a softmax distribution, which we can verify is indeed a probability distribution:
On inspection, this softmax distribution is reasonable, because the higher the score of a class, the
higher its probability. In fact, we can verify that the logistic function used in the binary case is a
special case of the softmax function used in the multiclass case. Assuming that the corresponding
parameters for class 0 and 1 are w0 and w1 , we have that:
>
e w 1 xi
P (Ŷi = 1 | xi , W) = σ(Wxi )1 = = s((w1 − w0 )>xi )
ew0 xi + ew1>xi
>
>
e w 0 xi
P (Ŷi = 0 | xi , W) = σ(Wxi )0 = w >x = 1 − s((w1 − w0 )>xi )
e 0 i + ew1>xi
In the 2-class case, because we are only interested in the difference between w1 and w0 , we just
use a change of variables w = w1 − w0 . We don’t need to know w1 and w0 individually, because
once we know P (Ŷi = 1), we know by default that P (Ŷi = 0) = 1 − P (Ŷi = 1).
118 CHAPTER 5. CLASSIFICATION
Loss Function
Let’s derive the loss function for multiclass logistic regression, first using the information-theoretic
perspective. The “true” or more formally the target distribution in this case is P (Yi = j) = δj,yi .
In other words, the entire distribution is concentrated on the label for the training example. The
estimated distribution P (Ŷi ) comes from multiclass logistic regression, and in this case is the softmax
distribution: >
e w j xi
P (Ŷi = j) = PK
wk>xi
k=1 e
Now let’s proceed to deriving the loss function. The objective, as always, is to minimize the sum
of the KL divergences contributed by all of the training examples.
n
X
Ŵmclr = arg min DKL (P (Yi ) k P (Ŷi ))
W i=1
n X
K
!
X P (Yi = j)
= arg min P (Yi = j) ln
W i=1 j=1
P (Ŷi = j)
K
n X
!
X δj,yi
= arg min δj,yi ln
W σ(Wxi )j
i=1 j=1
n X
X K
= arg min δj,yi · ln δj,yi − δj,yi · ln σ(Wxi )j
W i=1 j=1
n X
X K
= arg min − δj,yi · ln σ(Wxi )j
W i=1 j=1
K
n X >
!
X ew j xi
= arg min − δj,yi · ln PK wk>xi
W i=1 j=1 k=1 e
n
X
= arg min H(P (Yi ), P (Ŷi ))
W i=1
Just like binary logistic regression, we can justify the loss function with MLE as well:
n X
K >
!
X ew j xi
= arg min − δj,yi ln PK wk>xi
W i=1 j=1 k=1 e
Training
The logistic regression loss function has no known analytic closed-form solution. Therefore, in order
to minimize it, we can use gradient descent, either in batch form or stochastic form. Let’s examine
the case for batch gradient descent.
Binary Case
where
1
pi = s(w>xi ) =
1 + e−w>xi
n
X
∇w L(w) = ∇w − yi ln pi + (1 − yi ) ln(1 − pi )
i=1
n
X
=− yi ∇w ln pi + (1 − yi )∇w ln(1 − pi )
i=1
n
X yi 1 − yi
=− ∇w pi − ∇w pi
pi 1 − pi
i=1
n
X yi 1 − yi
=− − ∇w pi
pi 1 − pi
i=1
Note that ∇z s(z) = s(z)(1 − s(z)), and from the chain rule we have that
n
X yi 1 − yi
=− − pi (1 − pi )xi
pi 1 − pi
i=1
Xn
=− yi (1 − pi ) − (1 − yi )(pi ) xi
i=1
Xn
=− (yi − pi ) xi
i=1
It does not matter what initial values we pick for w, because the loss function L(w) is convex and
does not have any local minima. Let’s prove this, by first finding the Hessian of the loss function.
The k, lth entry of the Hessian is the partial derivative of the gradient with respect to wk and w` :
∂ 2 L(w)
Hkl =
∂wk ∂w`
n
∂ X
= − (yi − pi ) xil
∂wk
i=1
n
X ∂
= pi xil
∂wk
i=1
n
X
= pi (1 − pi )xik xil
i=1
We conclude that
n
X
H(w) = pi (1 − pi )xi xi>
i=1
Multiclass Case
Instead of finding the gradient with respect to all of the parameters of the matrix W, let’s find
them with respect to one row of W at a time:
n X K >
!
X ewj xi
∇w` L(W) = ∇w` − δj,yi · ln PK
wk>xi
i=1 j=1 k=1 e
n X K >
!
X ewj xi
=− δj,yi · ∇w` ln PK
wk>xi
i=1 j=1 k=1 e
5.4. GAUSSIAN DISCRIMINANT ANALYSIS 121
n X
K K
>
X X
=− δj,yi · ∇w` wj>xi − ∇w` ln ew kx i
i=1 j=1 k=1
n X
K >
!
X ew` xi
=− δj,yi · δ`,yi − PK xi
wk>xi
i=1 j=1 k=1 e
n
X
=− δ`,yi − P (Ŷi = `) xi
i=1
Just as with binary logistic regression, it does not matter what initial values we pick for W, because
the loss function L(W) is convex and does not have any local minima.
Recall the idea of generative models: we classify an arbitrary datapoint x with the class label
that maximizes the joint probability p(x, Y ) over the label Y :
Generative models typically form the joint distribution by explicitly forming the following:
P (k) = P (class = k)
pk (X) = p(X|class k)
In total there are K + 1 probability distributions: 1 for the prior, and K for all of the individual
classes. Note that the prior probability distribution is a categorical distribution over the K discrete
classes, whereas each class conditional probability distribution is a continuous distribution over Rd
(often represented as a Gaussian). Using the prior and the conditional distributions in conjunction,
we have (from Bayes’ rule) that maximizing the joint probability over the class labels is equivalent
to maximizing the posterior probability of the class label:
ŷ = arg max p(x, Y = k) = arg max P (k) pk (x) = arg max P (Y = k|x)
k k k
Consider the example of digit classification. Suppose we are given dataset of images of handwritten
digits each with known values in the range {0, 1, 2, . . . , 9}. The task is, given an image of a
handwritten digit, to classify it to the correct digit. A generative classifier for this this task would
122 CHAPTER 5. CLASSIFICATION
effectively form a the prior distribution and conditional probability distributions over the 10 possible
digits and choose the digit that maximizes posterior probability:
Maximizing the posterior will induce regions in the feature space in which one class has the highest
posterior probability, and decision boundaries in between classes where the posterior probability
of two classes are equal.
Gaussian Discriminant Analysis (GDA) is a specific generative method in which the class
conditional probability distributions are Gaussian: (X|Y = k) ∼ N (µk , Σk ). (Caution: the term
“discriminant” in GDA is misleading; GDA is a generative method, it is not a discriminative
method!)
Assume that we are given a training set D = {(xi , yi )}ni=1 of n points. Estimating the prior
distribution is the same for any other generative model. The probability of a class k is
nk
P (k) =
n
where nk is the number of training points that belong to class k. We can estimate the parameters of
the conditional distributions with MLE. Once we form the estimated prior conditional distributions,
we use Bayes’ Rule to directly solve the optimization problem
√ d
For future reference, let’s use Qk (x) = ln 2π P (k) pk (x) to simplify our notation.
We classify an arbitrary test point
GDA comes in two flavors: Quadratic Discriminant Analysis (QDA) in which the decision boundary
is quadratic, and Linear Discriminant Analysis (LDA) in which the decision boundary is linear. We
will now present both and compare them in detail.
QDA Classification
data that are classified as class k. The MLE estimate for the parameters of pk (X) is:
1 X
µ̂k = xi
nk
i:yi =k
1 X
Σ̂k = (xi − µ̂k )(xi − µ̂k )>
nk
i:yi =k
LDA Classification
n
1X
Σ̂ = (xi − µ̂yi )(xi − µ̂yi )>
n
i=1
One way to understand this formula is as a weighted average of the within-class covariances. Here,
assume we have sorted our training data by class and we can index through the xi ’s by specifying
a class k and the index within that class j:
n
1X
Σ̂ = (xi − µ̂yi )(xi − µ̂yi )>
n
i=1
K nk
1 X X
= (xi − µ̂k )(xi − µ̂k )>
n
k=1 i:yi =k
K
1X
= nk Σk
n
k=1
K
X nk
= Σk
n
k=1
Let’s now derive the form of the decision boundary for QDA and LDA. As we will see, the term
quadratic in QDA and linear in LDA actually signify the shape of the decision boundary. We will
prove this claim using binary (2-class) examples for simplicity (class A and class B). An arbitrary
124 CHAPTER 5. CLASSIFICATION
The decision boundary is the set of all points in x-space that are classified according to the third
case.
The simplest case is when the two classes are equally likely in prior, and their conditional probability
distributions are isotropic with identical covariances. Recall that isotropic Gaussian distributions
have covariances of the form of Σ = σ 2 I, which means that their isocontours are circles. In this
case, pA (X) and pB (X) have identical covariances of the form ΣA = ΣB = σ 2 I.
Figure 5.6: Contour plot of two isotropic, identically distributed Gaussians in R2 . The circles are the level
sets of the Gaussians.
Geometrically, we can see that the task of classifying a 2-D point into one of the two classes amounts
simply to figuring out which of the means it’s closer to. Using our notation of Qk (x) from before,
this can be expressed mathematically as:
QA (x) = QB (x)
1 1 1 1
ln P (A) − (x − µ̂A )>Σ̂−1
A (x − µ̂A ) − ln |Σ̂A | = ln P (B) − (x − µ̂B )>Σ̂−1 B (x − µ̂B ) − ln |Σ̂B |
2 2 2 2
1 1 1 1 1 1
ln − (x − µ̂A )>σ −2 I(x − µ̂A ) − ln |σ 2 I| = ln − (x − µ̂B )>σ −2 I(x − µ̂B ) − ln |σ 2 I|
2 2 2 2 2 2
> >
(x − µ̂A ) (x − µ̂A ) = (x − µ̂B ) (x − µ̂B )
The decision boundary is the set of points x for which kx − µ̂A k2 = kx − µ̂B k2 , which is simply
the set of points that are equidistant from µ̂A and µ̂B . This decision boundary is linear because
5.4. GAUSSIAN DISCRIMINANT ANALYSIS 125
the set of points that are equidistant from µ̂A and µ̂B are simply the perpendicular bisector of the
segment connecting µ̂A and µ̂B .
The next case is when the two classes are equally likely in prior, and their conditional probability
distributions are anisotropic with identical covariances.
Figure 5.7: Two anisotropic, identically distributed Gaussians in R2 . The ellipses are the level sets of the
Gaussians.
The anisotropic case can be reduced to the isotropic case simply by performing a linear change of
coordinates that transforms the ellipses back into circles, which induces a linear decision boundary
both in the transformed and original space. Therefore, the decision boundary is still the set of
points that are equidistant from µ̂A and µ̂B .
Now, let’s find the decision boundary when the two classes still have identical covariances but are
not necessarily equally likely in prior:
QA (x) = QB (x)
1 1 1 1
ln P (A) − (x − µ̂A )>Σ̂−1 ln |Σ̂A | = ln P (B) − (x − µ̂B )>Σ̂−1
A (x − µ̂A ) − B (x − µ̂B ) − ln |Σ̂B |
2 2 2 2
1 > −1 1 1 > −1 1
ln P (A) − (x − µ̂A ) Σ̂ (x − µ̂A ) − ln |Σ̂| = ln P (B) − (x − µ̂B ) Σ̂ (x − µ̂B ) − ln |Σ̂|
2 2 2 2
1 > −1
1 > −1
ln P (A) − (x − µ̂A ) Σ̂ (x − µ̂A ) = ln P (B) − (x − µ̂B ) Σ̂ (x − µ̂B )
2 2
2 ln P (A) − x Σ̂ x + 2x Σ̂ µ̂A − µ̂A Σ̂ µ̂A = 2 ln P (B) − x>Σ̂−1 x + 2x>Σ̂−1 µ̂B − µ̂B>Σ̂−1 µ̂B
> −1 > −1 > −1
2 ln P (A) + 2x>Σ̂−1 µ̂A − µ̂A>Σ̂−1 µ̂A = 2 ln P (B) + 2x>Σ̂−1 µ̂B − µ̂B>Σ̂−1 µ̂B
The decision boundary is the level set of a linear function f (x) = w>x − b. In fact, the decision
boundary is the level set of a linear function (which itself is linear) as long as the two class conditional
probability distributions share the same covariance matrices. This is the reason for why LDA has
a linear decision boundary.
QA (x) = QB (x)
1 1 1 1
ln P (A) − (x − µ̂A )>Σ̂−1 > −1
A (x − µ̂A ) − ln |Σ̂ A | = ln P (B) − (x − µ̂B ) Σ̂B (x − µ̂B ) − ln |Σ̂ B |
2 2 2 2
Here, unlike in LDA when ΣA = ΣB , we cannot cancel out the quadratic terms in x from both
sides of the equation, and thus our decision boundary is now represented by the level set of an
arbitrary quadratic function.
It should now make sense why QDA is short for quadratic discriminant analysis and LDA is short
for linear discriminant analysis!
As it turns out, LDA and logistic regression share the same type of posterior distribution. We
already showed that the posterior distribution in logistic regression is
1
P (Y = A|x) = = s(w>x − b)
1 + ew>x−b
for some appropriate vector w and bias b. Now let’s derive the posterior distribution for LDA.
From Bayes’ rule we have that
p(x|Y = A)P (Y = A)
P (Y = A|x) =
p(x|Y = B)P (Y = B) + p(x|Y = B)P (Y = B)
eQA (x)
=
eQA (x) + eQB (x)
1
= Q (x)−Q
1+e A B (x)
We already showed the the decision boundary in LDA is linear — it is the set of points x such that
The analysis on the decision boundary in QDA and LDA can be extended to the general case when
there are more than two classes. In the multiclass setting, the decision boundary is a collection
of linear boundaries in LDA and quadratic boundaries in QDA. The following Voronoi diagrams
illustrate the point:
Figure 5.8: LDA (left) vs QDA (right): a collection of linear vs quadratic level set boundaries. Source:
Professor Shewchuk’s notes
So far we’ve explored generative classifiers (LDA) and discriminative classifiers (logistic re-
gression), but in both of these methods, we tasked ourselves with modeling some kind of probability
distribution. One observation about classification is that in the end, if we only care about assigning
each data point a class, all we really need to know do is find a “good” decision boundary, and we
can skip thinking about the distributions. Support Vector Machines (SVMs) are an attempt
to model decision boundaries directly in this spirit.
Here’s the setup for the problem. We are given a training dataset D = {(xi , yi )}ni=1 , where xi ∈ Rd
and yi ∈ {−1, +1}. Our goal is to find a d − 1 dimensional hyperplane decision boundary H
which separates the +1’s from the −1’s.
In order to motivate SVMs, we first have to understand the simpler perceptron algorithm and its
shortcomings. Given that the training data is linearly separable, the perceptron algorithm finds
a d − 1 dimensional hyperplane that perfectly separates the +1’s from the −1’s. Mathematically,
the goal is to learn a set of parameters w ∈ Rd and b ∈ R, that satisfy the linear separability
constraints: (
w>xi − b ≥ 0 if yi = 1
∀i,
w>xi − b ≤ 0 if yi = −1
Equivalently,
∀i, yi (w>xi − b) ≥ 0
The resulting decision boundary is a hyperplane H = {x : w>x − b = 0}. All points on the positive
side of the hyperplane are classified as +1, and all points on the negative side are classified as −1.
128 CHAPTER 5. CLASSIFICATION
Perceptrons have two major shortcomings that as we shall see, SVMs can overcome. First of all,
if the data is not linearly separable, the perceptron fails to find a stable solution. As we shall see,
soft-margin SVMs fix this issue by allowing best-fit decision boundaries even when the data is not
linearly separable. Second, if the data is linearly separable, the perceptron could find infinitely
many hyperplanes that the perceptron could pick — if (w, b) is a pair that separates the data
points, then the perceptron could also end up choosing a slightly different (w, b + ) pair that still
separates the data points. Some hyperplanes are better than others, but the perceptron cannot
distinguish between them. This leads to generalization issues.
Figure 5.9: Several possible decision boundaries under the perceptron. The X’s and C’s represent the +1’s
and −1’s respectively.
In the figure above, we consider three potential linear separators that satisfy the constraints. To the
eyes of the perceptron algorithm, all three are perfectly valid linear separators. Ideally, we should
not treat all linear separators equally — some are better than others. One could imagine that if
we observed new test points that are nearby the region of C’s (or X’s) in the training data, they
should also be of class C (or X). The two separators close to the training points would incorrectly
classify some of these new test points, while the third separator which maintains a large distance
to the points would classify them correctly. The perceptron algorithm does not take this reasoning
into account, and may find a classifier that does not generalize well to unseen data.
Hard-Margin SVMs
Hard-Margin SVMs address the generalization problem of perceptrons by maximizing the mar-
gin, formally defined as the minimum distance from the decision boundary to the training points.
5.5. SUPPORT VECTOR MACHINES 129
Figure 5.10: The optimal decision boundary (as shown) maximizes the margin.
Intuitively, maximizing the margin allows us to generalize better to unseen data, because the
decision boundary with the maximum margin is as far away from the training data as possible and
the boundary cannot be violated unless the unseen data contains outliers.
Simply put, the goal of hard-margin SVMs is to find a hyperplane H that maximizes the margin
m. Let’s formalize an optimization problem for hard-margin SVMs. The variables we are trying to
optimize over are the margin m and the parameters of the hyperplane, w and b. The objective is
to maximize the margin m, subject to the following constraints:
• All points classified as +1 are to the positive side of the hyperplane and their distance to H
is greater than the margin
• All points classified as −1 are to the negative side of the hyperplane and their distance to H
is greater than the margin
• The margin is non-negative.
Let’s express the first two constraints mathematically. First, note that the vector w is perpendicular
to the hyperplane H = {x : w>x − b = 0}.
Proof: consider any two points on H, x0 and x1 . We will show that (x1 − x0 ) ⊥ w. Note that
(x1 − x0 )>(w) = (x1 − x0 )>((x1 + w) − x1 ) = x1>w − x0>w = b − b = 0
130 CHAPTER 5. CLASSIFICATION
Since w is perpendicular to H, the (shortest) distance from any arbitrary point z to the hyperplane
H is determined by a scaled multiple of w. If we take any point on the hyperplane x0 , the distance
from z to H is the length of the projection from z − x0 to the vector w, which is
|w>xi − b|
kwk2
In order to ensure that positive points are on the positive side of the hyperplane outside a margin
of size m, and that negative points are on the negative side of the hyperplane outside a margin of
size m, we can express the constraint
(w>xi − b)
yi ≥m
kwk2
max m
m,w,b
(w>xi − b) (5.1)
s.t. yi ≥ m ∀i
kwk2
m≥0
Maximizing the margin m implies that there exists at least one point on the positive side of the
hyperplane and at least one point on the negative side whose distance to the hyperplane is exactly
equal to m. These points are the support vectors, hence the name “support vector machines.”
They are called support vectors because they literally hold/support the margin planes in place.
Through a series of optimization steps, we can simplify the problem by removing the margin variable
and just optimizing the parameters of the hyperplane. Note that the current optimization formu-
lation does not induce a unique choice of w and b: if (m∗ , w∗ , b∗ ) is a solution, then (m∗ , αw∗ , αb∗ )
is also a solution, for any α > 0. In order to ensure that w and b are unique (without changing
the nature of the optimization problem), we can add an additional constraint for the norm of w:
5.5. SUPPORT VECTOR MACHINES 131
1
kwk2 = α, for some α > 0. In particular, we can add the constraint kwk2 = m or equivalently,
1
m = kwk 2
:
max m
m,w,b
(w>xi − b)
s.t. yi ≥ m ∀i
kwk2 (5.2)
m≥0
1
m=
kwk2
1
Now, we can substitute m = kwk2 and eliminate m from the optimization:
1
max
w,b kwk2 (5.3)
>
s.t. yi (w xi − b) ≥ 1 ∀i
At last, we have formulated the hard-margin SVM optimization problem! The standard formulation
of hard-margin SVMs is
1
min kwk22
w,b 2 (5.4)
s.t. yi (w>xi − b) ≥ 1 ∀i
Soft-Margin SVMs
The hard-margin SVM optimization problem has a unique solution only if the data are linearly
separable, but it has no solution otherwise. This is because the constraints are impossible to satisfy
if we can’t draw a hyperplane that separates the +1’s from the −1’s. In addition, hard-margin
SVMs are very sensitive to outliers — for example, if our data is class-conditionally distributed
Gaussian such that the two Gaussians are far apart, if we witness an outlier from class +1 that
crosses into the typical region for class −1, then hard-margin SVM will be forced to compromise a
more generalizable fit in order to accommodate for this point. Our next goal is to come up with
a classifier that is not sensitive to outliers and can work even in the presence of data that is not
linearly separable. To this end, we’ll talk about Soft-Margin SVMs.
A soft-margin SVM modifies the constraints from the hard-margin SVM by allowing some points
to violate the margin. It introduces slack variables ξi , one for each training point, into the
constraints:
yi (w>xi − b) ≥ 1 − ξi
ξi ≥ 0
The constraints are now a less-strict, softer version of the hard-margin SVM constraints, because
each point xi need only be a “distance” of 1 − ξi of the separating hyperplane instead of a hard
“distance” of 1.
(By the way, the Greek letter ξ is spelled “xi” and pronounced “zai.” ξi is pronounced “zai-eye.”)
These constraints would be fruitless if we didn’t bound the values of the ξi ’s — by setting them to
large values, we are saying that any point may violate the margin by an arbitrarily large distance,
which makes our choice of w meaningless. Therefore we modify the objective function to penalize
132 CHAPTER 5. CLASSIFICATION
the slacks:
n
1 X
min kwk2 + C ξi
w,b,,ξi 2
i=1
Where C is a hyperparameter tuned through cross-validation. Putting the objective and constraints
together, the soft-margin SVM optimization problem is
n
1 X
min kwk2 + C ξi
w,b,ξi 2
i=1
(5.5)
>
s.t. yi (w xi − b) ≥ 1 − ξi ∀i
ξi ≥ 0 ∀i
The table below compares the effects of having a large C versus a small C. As C goes to infinity,
the penalty for having non-zero ξi goes to infinity, and thus we force the ξi ’s to be zero, which is
exactly the setting of the hard-margin SVM.
small C large C
Desire maximize margin keep ξi ’s small or zero
Danger underfitting overfitting
Outliers less sensitive more sensitive
Let’s see why. Manipulating the first constraint of constraints, we have that
ξi ≥ 1 − yi (w>xi − b)
Combining with the constraint ξi ≥ 0, we have that
ξi ≥ max(1 − yi (w>xi − b), 0)
At the optimal value of the optimization problem, these inequalities must be tight. Otherwise,
we could lower each ξi to equal max(1 − yi (w>xi − b), 0) and decrease the value of the objective
function. Thus we can rewrite the soft-margin SVM optimization problem as
n
1 X
min kwk2 + C ξi
w,b,ξi 2 (5.7)
i=1
s.t. ξi = max(1 − yi (w>xi − b), 0) ∀i
5.5. SUPPORT VECTOR MACHINES 133
If we divide by Cn (which does not change the optimal solution of the optimization problem), we can
1
see that this formulation is equivalent to the regularized regression problem, with λ = 2Cn . Thus
we have two interpretations of soft-margin SVM: either as finding a max-margin hyperplane that is
allowed to make some mistakes via slack variables ξi , or as regularized empirical risk minimization.
Through some manipulations, we can formulate SVMs and many other classification problems as
an optimization over the objective
n
1X
min L(yi , w>xi − b) + λkwk2
w,b n
i=1
over some specified loss function and a possible regularization term (sometimes we may set λ = 0).
The simplest loss function that we can optimize is the 0-1 step loss:
(
> 1 y(w>x − b) < 0
Lstep (y, w x − b) =
0 y(w>x − b) ≥ 0
The 0-1 loss is 0 if x is correctly classified and 1 otherwise. Minimizing n1 ni=1 L(yi , w>xi − b)
P
directly minimizes classification error on the training set. However, the 0-1 loss is difficult to
optimize: it is neither convex nor differentiable (see Figure 7.1). Furthermore, if we exclude the
regularization term, we do not penalize the classifier for being close to the training points, which
leads to generalization issues.
Figure 5.13: Step (0-1) loss, hinge loss, and logistic loss. Logistic loss is convex and differentiable, hinge loss
is only convex, and step loss is neither.
Another loss function that we have seen is the logistic loss, which is used in logistic regression:
> 1 1
Llr (y, w x − b) = y ln + (1 − y) ln
s(w>x − b) 1 − s(w>x − b)
The logistic loss is convex and differentiable, and is optimized using gradient descent methods. The
134 CHAPTER 5. CLASSIFICATION
logistic loss is the basis for logistic regression, and it works well without regularization:
n
1X
min Llr (yi , w>xi − b)
w,b n
i=1
The hinge loss modifies the 0-1 loss to be convex. The points with y(w>x − b) ≥ 0 should remain
at 0 loss, but we may consider allowing a linear penalty “ramp” for misclassified points. This leads
us to the hinge loss, as illustrated in Figure 7.1:
Lhinge (y, w>x − b) = max(1 − y(w>x − b), 0)
The ramp ensures that misclassified points that are close to the boundary are penalized less than
misclassified points that are far from the boundary. The perceptron algorithm optimizes over the
sum of hinge losses contributed from all of the training points:
n
1X
min Lhinge (yi , w>xi − b)
w,b n
i=1
The SVM formulation is an optimization over the same problem, with the addition of a regulariza-
tion term:
n
1X
min Lhinge (yi , w>xi − b) + λkwk2
w,b n
i=1
The regularization term allows for better generalization, in this case by penalizing choices of w for
which the margin is small.
5.6 Duality
As we have seen in our discussion of kernels, ridge regression can be viewed in two ways: (1) an
optimization problem over the weights w ∈ Rd which scales according to the dimensionality of the
augmented feature space, and (2) an optimization problem over the weights α ∈ Rn which scales
according to the the number of training points. These two viewpoints give rise to two equivalent
solutions:
w∗ = (X>X + λI)−1 X>y and w∗ = X>(XX> + λI)−1 y
The second (kernelized) expression is much more efficient to calculate when the number of training
points n is significantly smaller than the number of augmented features d. Recall that the derivation
for the kernelized expression relied on invoking the fundamental theorem of linear algebra and
solving for a set of dual variables. While this approach is certainly valid, it may not be applicable
for kernelizing all problems. Rather, a more principled approach is to apply Lagrangian duality
and solve the dual problem. In this section we will introduce duality for arbitrary optimization
problems, and then use duality to derive the kernelized versions for ridge regression and SVMs.
For the purposes of our discussion, assume that x ∈ Rd . The components of an optimization
problem are:
Working with the constraints can be cumbersome and challenging to manipulate, and it would be
ideal if we could somehow turn this constrained optimization problem into an unconstrained one.
One idea is to re-express the optimization problem into
min L(x)
x
where
(
f0 (x) if fi (x) ≤ 0, ∀i ∈ {1 . . m} and hj (x) = 0, ∀j ∈ {1 . . n}
L(x) = (5.10)
∞ otherwise
Note that the unconstrained optimization problem above is equivalent to the original constrained
problem. Even though the unconstrained problem considers values that violate the constraints (and
therefore are not in the feasible set for the constrained optimization problem), it will effectively
ignore them because they are treated as ∞ in a minimization problem.
Even though we are now dealing with an unconstrained problem, it still is difficult to solve the
optimization problem, because we still have to deal with all of the casework in the objective function
L(x). In order to solve this issue, we have to introduce dual variables, specifically one set of dual
variables for the equality constraints, and one set for the inequality constraints. If we only take
into account the dual variables for the equality constraints, the optimization problem now becomes
where (
f0 (x) + nj=1 νj hj (x) if fi (x) ≤ 0, ∀i ∈ {1 . . m}
P
L(x, ν) = (5.11)
∞ otherwise
We are still working with an unconstrained optimization problem, except that now, we are opti-
mizing over two sets of variables: the primal variables x ∈ Rd and the dual variables ν ∈ Rn .
Also note that the optimization problem has now become a nested one, with an inner optimization
problem the maximizes over the dual variables, and an outer optimization problem that minimizes
over the primal variables. Let’s examine why this optimization problem is equivalent to the original
constrained optimization problem:
• Any x that violates the inequality constraints is still treated as ∞ by the outer minimization
problem over x and therefore ignored
• For any x that violates the equality constraints (meaning that ∃j s.t. hj (x) 6= 0), the inner
maximization problem over ν can choose νj as ∞ if hj (x) > 0 (or νj as −∞ if hj (x) < 0) to
cause the inner maximization to go to ∞, therefore being ignored by the outer minimization
over x
136 CHAPTER 5. CLASSIFICATION
• For any x that does not violate any of the equality or inequality constraints, the inner maxi-
mization problem over ν is simply equal to f0 (x)
This solution comes at a cost — in an effort to remove the equality constraints, we had to add in
dual variables, one for each equality constraint. With this in mind, let’s try to do the same for the
inequality constraints. Adding in dual variable λi to represent each inequality constraint, we now
have
Xm Xn
min max L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x)
x λ,ν (5.12)
i=1 j=1
s.t. λi ≥ 0 i = 1, . . . , m
For convenience, we can place the constraints involving λ into the optimization variable.
m
X n
X
min max L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x) (5.13)
x λ≥0,ν
i=1 j=1
This optimization problem above is otherwise known as the primal (not to be confused with the
primal variables), and its optimal value is indeed equivalent to that of the original constrained
optimization problem.
p∗ = min max L(x, λ, ν) (5.14)
x λ≥0,ν
• For any x that violates the inequality constraints (meaning that ∃i ∈ {1 . . m} s.t. fi (x) > 0),
the inner maximization problem over λ can choose λi as ∞ to cause the inner maximization
go to ∞, therefore being ignored by the outer minimization over x
• For any x that violates the equality constraints (meaning that ∃j s.t. hj (x) 6= 0), the inner
maximization problem over ν can choose νj as ∞ if hj (x) > 0 (or νj as −∞ if hj (x) < 0)
to cause the inner maximization go to ∞, therefore being ignored by the outer minimization
over x
• For any x that does not violate any of the P equality or inequality constraints, in the inner
maximization problem over ν, the expression nj=1 νj hj (x) evaluates to 0 no matter what the
value of ν is, and in the inner maximization problem over λ, the expression m
P
i=1 λi fi (x) can
at maximum be 0, because λi is constrained to be non-negative, and fi (x) is non-positive.
Therefore, at best, the maximization problem sets λi fi (x) = 0, and
max L(x, λ, ν) = f0 (x)
λ≥0,ν
In its full form, the objective L(x, λ, ν) is called the Lagrangian, and it takes into account the
unconstrained set of primal variables x ∈ Rd , the constrained set of dual variables λ ∈ Rn cor-
responding to the inequality constraints, and the unconstrained set of dual variables ν ∈ Rm
corresponding to the equality constraints. Note that our dual variables λi are in fact constrained,
so ultimately we were not able to turn the original optimization problem into an unconstrained
one, but our constraints are much simpler than before.
The dual of this optimization problem is still over the same optimization objective, except that
now we swap the order of the maximization of the dual variables and the minimization of the primal
variables.
d∗ = max min L(x, λ, ν) = max g(λ, ν) (5.15)
λ≥0,ν x λ≥0,ν
5.6. DUALITY 137
where
g(λ, ν) = min L(x, λ, ν) (5.17)
x
The dual is very useful to work with, because now the inner optimization problem over x is an
unconstrained problem! Furthermore, the dual g(λ, ν) is always a concave function, regardless of
the primal objective function or its constraints. This is because the dual is a pointwise minimum
of concave functions, which itself is a concave function. Specifically g(λ, ν) = minx L(x, λ, ν) is
a pointwise minimum of functions L(x, λ, ν) that are affine in the dual variables (which are both
concave and convex at the same time).
Let’s examine the relationship between the primal and dual problem. It is always true that the
solution to the primal problem is at least as large as the solution to the dual problem:
p∗ ≥ d∗ (5.18)
This condition is know as weak duality.
More compactly,
∀x, λ ≥ 0, ν max L(x, λ̃, ν̃) ≥ min L(x̃, λ, ν)
λ̃≥0,ν̃ x̃
Since this is true for all x, λ ≥ 0, ν this is true in particular when we set
x = arg min max L(x̃, λ̃, ν̃)
x̃ λ̃≥0,ν̃
and
λ, ν = arg max min L(x̃, λ̃, ν̃)
x̃
λ̃≥0,ν̃
The difference p∗ − d∗ is known as the duality gap. In the case of strong duality, the duality
gap is 0. That is, we can swap the order of the minimization and maximization and up with the
same optimal value:
p∗ = d∗ (5.19)
There are several useful theorems detailing the existence of strong duality, such as Slater’s the-
orem, which states that if the primal problem is convex, and there exists an x that can strictly
meet the inequality constraints and meet the equality constraints, then strong duality holds. Given
that strong duality holds, the Karush-Kuhn-Tucker (KKT) conditions can help us find the
solution to the dual variables of the optimization problem. The KKT conditions are composed of:
138 CHAPTER 5. CLASSIFICATION
Proof. KKT conditions 1, 2, 3 are trivially true, because the primal solution x∗ must satisfy the
primal constraints, and the dual solution λ∗ , ν ∗ must satisfy the dual constraints. Now, let’s prove
conditions 4 and 5. We know that since strong duality holds, we can say that
We cancel the terms involving hj (x∗ ) because we know that the primal solution must satisfy
hj (x∗ ) = 0. Furthermore, we know that λ∗i fi (x∗ ) ≤ 0, because λ∗i ≥ 0 in order to satisfy the
dual constraints, and fi (x∗ ) ≤ 0 in order to satisfy the primal constraints. Since we established
that f0 (x∗ ) = minx L(x, λ∗ , ν ∗ ) ≤ L(x∗ , λ∗ , ν ∗ ) ≤ f0 (x∗ ), we know that all of the inequalities
hold with equality and therefore L(x∗ , λ∗ , ν ∗ ) = minx L(x, λ∗ , ν ∗ ). This implies KKT condition 5
(stationarity), that
m
X X n
∇x f0 (x∗ ) + λ∗i ∇x fi (x∗ ) + νj∗ ∇∗x hj (x∗ ) = 0
i=1 j=1
Pm Pm ∗
Finally, note that due to the equality f0 (x∗ )+ ∗ ∗ ∗
i=1 λi fi (x ) = f0 (x ), we know that
∗
i=1 λi fi (x ) =
0. This combined with the fact that ∀i λ∗i fi ∗
(x ) ≤ 0, establishes KKT condition 4 (complementary
slackness):
λ∗i fi (x∗ ) = 0, ∀i ∈ {1 . . m}
5.6. DUALITY 139
The theorem above establishes that in the presence of strong duality, if the solutions are optimal,
then they satisfy the KKT conditions. Let’s prove a statement that is almost (but not quite) the
converse, which will be much more helpful for solving optimization problems.
Theorem 2. If x̄ and λ̄, ν̄ satisfy the KKT conditions, and the primal problem is convex, then
they are the optimal solutions to the primal and dual problems with zero duality gap.
Proof. If x̄ and λ̄, ν̄ satisfy KKT conditions 1, 2, 3 we know that they are at least feasible for the
primal and dual problem. From the KKT stationarity condition we know that
m
X n
X
∇x f0 (x̄) + λ̄i ∇x fi (x̄) + ν¯j ∇x hj (x̄) = 0
i=1 j=1
Since the primal problem is convex, we know that L(x, λ, ν) is convex in x, and if the gradient of
L(x, λ̄, ν̄) at x̄ is 0, we know that
Therefore, we know that the optimal primal values for the primal problem optimize the inner
optimization problem of the dual problem, and
m
X n
X
g(λ̄, ν̄) = f0 (x̄) + λ̄i fi (x̄) + ν¯j hj (x̄)
i=1 j=1
By the primal feasibility conditions for hj (x) and the complementary slackness condition, we know
that
g(λ̄, ν̄) = f0 (x̄)
Now, all we have to do is to prove that x̄ and λ̄, ν̄ are primal and dual optimal, respectively. Note
that since weak duality always holds, we know that
Since p∗ is the minimum value for the primal problem, we can go further by saying that p∗ ≥ f0 (x̄)
holds with equality and
p∗ = f0 (x̄) = g(λ̄, ν̄) ≤ d∗
since it always holds that p∗ ≥ d∗ we conclude that
Therefore, we have proven that x̄ and λ̄, ν̄ are primal and dual optimal respectively, with zero
duality gap. We eventually arrived at the conclusion that strong duality does indeed hold.
140 CHAPTER 5. CLASSIFICATION
Let’s pause for a second to understand what we’ve found so far. Given an optimization problem,
its primal problem is an optimization problem over the primal variables, and its dual problem is
an optimization problem over the dual variables. If strong duality holds, then we can solve the
dual problem and arrive at the same optimal value. In order to solve the dual, we have to first
solve the unconstrained inner optimization problem over the primal variables and then solve the
constrained outer optimization problem over the dual variables. But how do we even know in the
first place that strong duality holds? This is where KKT comes into play. If the the primal problem
is convex and the KKT conditions hold, we can solve for the dual variables easily and also verify
strong duality does indeed hold. We shall do just that, in our discussion of dual ridge regression
and dual SVMs.
Let’s derive kernel ridge regression again, using duality this time. Recall the unconstrained ridge
regression formulation:
min kXw − yk2 + λkwk2
w
This formulation is not conducive to dualization, because it lacks constraints. We will add con-
straints by introducing a dummy variable z = Xw − y that corresponds to equality constraints:
Now we proceed to forming the dual problem. For the purposes of notation, note that we are using
α in place of ν, and there are no dual variables corresponding to λ because there are no inequality
constraints. The Lagrangian is
Since the g(α) is a convex minimization problem over the variables w and z, we can simply set the
derivative to 0 w.r.t. w and z:
Plugging these optimal values back into the optimization problem, we have that
1 1 >
= − α>α − α XX>α − α>y (5.32)
4 4λ
Now, the dual problem is
1 1 >
max g(α) = max − α>α − α XX>α − α>y
α α 4 4λ
Note that this problem is a maximization over a concave problem (similar to a minimization over
a convex problem) and we can take the derivative w.r.t α and set it to 0:
1 1
∇α g(α) = − α − XX>α − y = 0 =⇒ α∗ = −2λ(XX> + λI)−1 y
2 2λ
The optimal w∗ is therefore given by
1 > ∗
w∗ = − X α = X>(XX> + λI)−1 y
2λ
Which exactly matches the expression we previously derived for kernel ridge regression! Note that
while this solution is dual optimal, it may not be optimal for the primal problem. In order to ensure
that it is primal optimal, we need to establish that strong duality holds. In this case the primal
problem is convex, so we simply need to ensure that the KKT conditions hold. Since we are not
dealing with any inequality conditions here, the only applicable conditions are primal feasibility for
the equalities and stationarity. Indeed the primal equality constraints are met, since
1 1
Xw∗ − y − z∗ = − XX>α∗ − y − α∗
2λ 2
1 > ∗
= − (XX + λI)α − y
2λ
= (XX> + λI)(XX> + λI)−1 y − y
=0
We already showed the stationarity conditions are met, when we were solving g(α) = minw,z L(w, z, α).
We conclude that w∗ is indeed the optimal solution to the primal problem.
Dual SVMs
Now, let’s investigate the dual of this problem. The primal problem in standard form is
n
1 X
min kwk2 + C ξi
w,b,ξ 2
i=1
(5.34)
s.t. (1 − ξi ) − yi (w>xi − b) ≤ 0 ∀i
− ξi ≤ 0 ∀i
Let’s identify the primal and dual variables for the SVM problem. We will have
For the purposes of notation, note that we are using α and β in place of λ, and there are no dual
variables corresponding to ν because there are no equality constraints. The Lagrangian for the
SVM problem is
n n n
1 X X X
L(w, b, ξ, α, β) = kwk2 + C ξi + αi ((1 − ξi ) − yi (w>xi − b)) + βi (−ξi ) (5.35)
2
i=1 i=1 i=1
n n n
1 X X X
= kwk2 − αi yi (w>xi − b) + αi + (C − αi − βi )ξi (5.36)
2
i=1 i=1 i=1
Verify that the other KKT also hold, establishing strong duality. Using these observations, we can
eliminate some terms of the dual problem.
n n n
1 X X X
L(w, b, ξ, α∗ , β ∗ ) = kwk2 − αi∗ yi (w>xi − b) + αi∗ + (C − αi∗ − βi∗ )ξi (5.39)
2
i=1 i=1 i=1
5.6. DUALITY 143
n n n n
1 X X X X
= kwk2 − αi∗ yi (w>xi ) + b αi∗ yi + αi∗ + (C − αi∗ − βi∗ )ξi (5.40)
2
i=1 i=1 i=1 i=1
| {z } | {z }
=0 =0
n n
1 X X
= kwk2 − αi∗ yi (w>xi ) + αi∗ (5.41)
2
i=1 i=1
Since the primal problem is convex, from the KKT conditions we have that the optimal primal
variables w∗ , b∗ , ξ ∗ minimize L(w, b, ξ, α∗ , β ∗ ):
g(α∗ , β ∗ ) = min L(w, b, ξ, α∗ , β ∗ ) (5.42)
w,b,ξ
= L(w∗ , b∗ , ξ ∗ , α∗ , β ∗ ) (5.43)
n n n n
1 X ∗ X X X
= k αi yi xi k2 − αi∗ yi (( αj∗ yj xj )>xi ) + αi∗ (5.44)
2
i=1 i=1 j=1 i=1
n n n n
1 X ∗ X X X
= k αi yi xi k2 − (αi∗ yi xi>( αj∗ yj xj )) + αi∗ (5.45)
2
i=1 i=1 j=1 i=1
1
= α∗>1 − α∗>Qα∗ (5.46)
2
where Qij = yi (xi>xj )yj (and Q = (diag y)XX>(diag y)).
Now, we can write the final form of the dual, which is only in terms of α and X and y (Note that
we have eliminated all references to β):
1
max α>1 − α>Qα
α 2
n
(5.47)
X
s.t. αi yi = 0
i=1
0 ≤ αi ≤ C i = 1, . . . , n
Pn
Remember to account for the constraints i=1 αi yi = 0 and 0 ≤ αi ≤ C that arise from the
stationarity conditions. After all of this effort, we have managed to turn a minimization problem
over the primal variables into a maximization problem over the dual variables. One might ask, why
go through the effort to formulate and solve the dual problem instead? For one, the dual is an
optimization problem over the number of training points n rather than the number of augmented
features d, making it particularity attractive when n d. Second, it incorporates the term XX>
which is simply the Gram matrix K of kernel evaluations among all pairs of training points. We
can apply the kernel trick to form this Gram matrix, effectively relying on the the dimensionality
of the raw feature space rather than the augmented feature space. These are more or less the exact
same justifications for kernel ridge regression.
Geometric intuition
We’ve formulated the dual SVM problem and used the KKT conditions to formulate an equivalent
optimization problem, but what do these dual values αi even mean? That’s a good question!
We know that given optimal primal and dual values, the following KKT conditions are enforced:
• Stationarity
C − αi∗ − βi∗ = 0
144 CHAPTER 5. CLASSIFICATION
• Complementary slackness
βi∗ · ξi∗ = 0
Here are some noteworthy relationships between αi and the properties of the SVMs:
• Case 1: αi∗ = 0. In this case, we know βi∗ = C, which is nonzero, and therefore ξi∗ = 0. That
is, if for point i we have that αi∗ = 0 by the dual problem, then we know that there is no slack
given to this point. Looking at the other complementary slackness condition, this makes sense
because if αi∗ = 0, then yi (w∗>xi − b∗ ) − (1 − ξi∗ ) may be any value, and if we’re minimizing
the sum of our ξi ’s, we should have ξi∗ = 0. So, point i lies on or outside the margin.
• Case 2: αi∗ is nonzero. If this is the case, then we know βi∗ = C − αi∗ ≥ 0
– Case 2.1: αi∗ = C. If this is the case, then we know βi∗ = 0, and therefore ξi∗ may be
exactly 0 or nonzero. So, point i lies on or violates the margin.
– Case 2.2: 0 < αi∗ < C. In this case, then βi∗ is nonzero and ξi∗ = 0. But this is different
from Case 1 because with αi∗ nonzero, we can divide by αi∗ in the complementary slackness
condition and arrive at the fact that 1 − yi (w∗>xi − b∗ ) = 0 =⇒ yi (w∗>xi − b∗ ) = 1,
which means xi lies exactly on the margin determined by w∗ and b∗ . So, point i lies on
the margin.
Using this information, let’s reconstruct the optimal primal values w∗ , b∗ , ξi∗ from the optimal dual
5.7. NEAREST NEIGHBOR CLASSIFICATION 145
values α∗ :
n
X
∗
w = αi∗ yi xi
i=1
∗ ∗>
b = w xi − yi if 0 < αi∗ < C (5.48)
(
1 − yi (w∗>xi − b∗ ) if αi∗ = C,
ξi∗ =
0 otherwise
The principal takeaway is that the optimal w∗ is a linear combination of the training points for
which the corresponding dual weight αi is non-zero. Such points are called support vectors,
because they determine the optimal w∗ . There is a special relationship between the values of αi
and the position of xi relative to the margin. All training points that violate the decision boundary
have αi > 0 and are thus support vectors, while all training points that strictly do not violate the
decision boundary (meaning that they do not lie on the boundary) have αi = 0 and are not support
vectors. For training points which lie exactly on the boundary, some may have αi > 0 and some
may have αi = 0; only the points that are critical to determining the decision boundary have αi > 0
and are thus support vectors. Intuitively, there are very few support vectors compared to the total
number of training points, meaning that the dual vector α∗ is sparse. This is advantageous when
predicting class for a test point:
n
X n
X
w∗>φ(x) + b∗ = αi∗ yi φ(xi )>φ(x) + b∗ = αi∗ yi k(xi , x) + b∗
i=1 i=1
We only have to make m n kernel evaluations to predict a test point, where m is the number
of support vectors. It should now be clear why the dual SVM problem is so useful: it allows us to
use the kernel trick to eliminate dependence on the dimensionality of the argument feature space,
while also allowing us to discard most training points because they have dual weight 0.
In classification, it is reasonable to conjecture that data points that are sufficiently close to one
another should be of the same class. For example, in fruit classification, perturbing a few pixels in
an image of a banana should still result in something that looks like a banana. The k-nearest-
neighbors (k-NN) classifier is based on this observation. Assuming that there is no preprocessing
of the training data, the training time for k-NN is effectively O(1). To train this classifier, we simply
store our training data for future reference.1 For this reason, k-NN is sometimes referred to as “lazy
learning.” The major work of k-NNs in done at testing time: to predict on a test data point z,
we compute the k closest training data points to z, where “closeness” can be quantified in some
distance function such as Euclidean distance — these are the k nearest neighbors to z. We then
find the most common class y among these k neighbors and classify z as y (that is, we perform a
majority vote). For binary classification, k is usually chosen to be odd so we can break ties cleanly.
Note that k-NN can also be applied to regression tasks — in that case k-NN would return the
average label of the k nearest points.
1 Sometimes we store the data in a specialized structure called a k-d tree. This data structure is out of scope for this course,
but it usually allows for faster (average-case O(log n)) nearest neighbors queries.
146 CHAPTER 5. CLASSIFICATION
Figure 5.14: Voronoi diagram for k-NN. Points in a region shaded a certain color will be classified as that
color. Test points in a region shaded with a combination of 2 colors have those colors as their 2 nearest
neighbors.
Choosing k
Nearest neighbors can produce very complex decision functions, and its behavior is highly dependent
on the choice of k.
(a) k = 1 (b) k = 15
Figure 5.15: Voronoi diagram for k = 1 vs. k = 15. Figure from Introduction to Statistical Learning.
Choosing k = 1, we achieve an optimal training error of 0 because each training point will classify
as itself, thus achieving 100% accuracy on itself. However, k = 1 overfits to the training data, and
is a terrible choice in the context of the bias-variance tradeoff. Increasing k leads to an increase in
training error, but a decrease in testing error and achieves better generalization. At one point, if
k becomes too large, the algorithm will underfit the training data, and suffer from huge bias. In
general, in order to select k we use cross-validation.
5.7. NEAREST NEIGHBOR CLASSIFICATION 147
Figure 5.16: Training and Testing error as a function of k. Figure from Introduction to Statistical Learning.
Bias-Variance Analysis
Let’s justify this reasoning formally for k-NN applied to regression tasks. Suppose we are given
a training dataset D = {(xi , yi )}ni=1 , where the labels yi are real valued scalars. We model our
hypothesis h(z) as
n
1X
h(z) = N (xi , z, k)
k
i=1
Suppose also we assume our labels yi = f (xi ) + , where is the noise that comes from N (0, σ 2 )
and f is the true function. Without loss of generality, let x1 . . . xk be the k closest points. Let’s
first derive the bias2 of our model for the given dataset D.
2 2
n k
2 1X 1X
E[h(z)] − f (z) = E N (xi , z, k) − f (z) = E yi − f (z)
k k
i=1 i=1
2 2
k k
1 X 1 X
= E[yi ] − f (z) = E[f (xi ) + ] − f (z)
k k
i=1 i=1
2
k
1X
= f (xi ) − f (z)
k
i=1
1 Pk
When k −→ ∞, then k i=1 f (xi ) goes to the average label for x. When k = 1, then the bias
is simply f (x1 ) − f (z). Assuming x1 is close enough to f (z), the bias would likely be small when
148 CHAPTER 5. CLASSIFICATION
k = 1 since it’s likely to share a similar label. Meanwhile, when k −→ ∞, the bias doesn’t depend
on the training points at all which like will restrict it to be higher.
Now, let’s derive the variance of our model.
k k
1 X 1 X
Var[h(z)] = Var yi = 2
Var[f (xi ) + ]
k k
i=1 i=1
k
1 X
= Var[]
k2
i=1
k
1 X 1 σ2
= σ2 = kσ 2
=
k2 k2 k
i=1
Properties
Computational complexity: We require O(n) space to store a training set of size n. There is no
runtime cost during training if we do not use specialized data structures to store the data. However,
predictions take O(n) time, which is costly. There has been research into approximate nearest
neighbors (ANN) procedures that quickly find an approximation for the nearest neighbor - some
common ANN methods are Locality-Sensitive Hashing and algorithms that perform dimensionality
reduction via randomized (Johnson-Lindenstrauss) distance-preserving projections.2
Flexibility: When k > 1, k-NN can be modified to output predicted probabilities P (Y |X) by
defining P (Y |X) as the proportion of nearest neighbors to X in the training set that have class Y .
k-NN can also be adapted for regression — instead of taking the majority vote, take the average
of the y values for the nearest neighbors. k-NN can learn very complicated, non-linear decision
boundaries.
Non-parametric: k-NN is a non-parametric method, which means that the number of parameters
in the model grows with n, the number of training points. This is as opposed to parametric
methods, for which the number of parameters is independent of n. Some examples of parametric
models include linear regression, LDA, and neural networks.
Behavior in high dimensions: k-NN does not behave well in high dimensions. As the dimension
increases, data points drift farther apart, so even the nearest neighbor to a point will tend to be
very far away.
Theoretical properties: 1-NN has impressive theoretical guarantees for such a simple method. Cover
and Hart, 1967 prove that as the number of training samples n approaches infinity, the expected
prediction error for 1-NN is upper bounded by 2∗ , where ∗ is the Bayes (optimal) error. Fix and
Hodges, 1951 prove that as n and k approach infinity and if nk → 0, then the k nearest neighbor
error approaches the Bayes error.
Curse of Dimensionality
To understand why k-NN does not perform well in high-dimensional space, we first need to un-
derstand the properties of metric spaces. In high-dimensional spaces, much of our low-dimensional
2 ANN methods are beyond the scope of this course, but are useful in real applications.
5.7. NEAREST NEIGHBOR CLASSIFICATION 149
intuition breaks down. Here is one classical example. Consider a ball in Rd centered at the origin
with radius r, and suppose we have another ball of radius r − centered at the origin. In low
dimensions, we can visually see that much of the volume of the outer ball is also in the inner ball.
In general, the volume of the outer ball is proportional to rd , while the volume of the inner ball is
proportional to (r − )d . Thus the ratio of the volume of the inner ball to that of the outer ball is
d
(r − )d
= 1− ≈ e−d/r −→ 0
rd r d→∞
Hence as d gets large, most of the volume of the outer ball is concentrated in the annular region
{x : r − < x < r} instead of the inner ball.
High dimensions also make Gaussian distributions behave counter-intuitively. Suppose X ∼ N (0, σ 2 I).
If Xi are the components of X and R is the distance from X to the origin, then R2 = di=1 Xi2 . We
P
have E(R2 ) = dσ 2 , so in expectation a random Gaussian will actually be reasonably far from the
origin. If σ = 1, then R2 is distributed chi-squared with d degrees of freedom. One can show that
in high dimensions, with high probability 1 − O(e−d ), this multivariate Gaussian will lie within
the annular region {X : |R2 − E(R2 )| ≤ d1/2+ } where E(R2 ) = dσ 2 (one possible approach is to
note that as d → ∞, the chi-squared approaches a Gaussian by the CLT, and use a Chernoff bound
to show exponential decay). This phenomenon is known as concentration of measure. Without
resorting to more complicated inequalities, we can show a simple, weaker result:
Theorem: If Xi ∼ N (0, σ 2 ), i = 1, ..., d are independent and R2 = di=1 Xi2 , then for every > 0,
P
the following holds:
1
lim P (|R2 − E(R2 )| ≥ d 2 + ) = 0
d→∞
Thus in the limit, the squared radius is concentrated about its mean.
Proof. From the formula for the variance of a chi-squared distribution, we see that Var(R2 ) = 2dσ 4 .
Applying a Chebyshev bound yields
1 2dσ 4
P (|R2 − E(R2 )| ≥ d 2 + ) ≤ −→ 0
d1+2 d→∞
Thus a random Gaussian will lie within a thin annular region away from the origin in high dimen-
sions with high probability, even though the mode of the Gaussian bell curve is at the origin. This
150 CHAPTER 5. CLASSIFICATION
illustrates the phenomenon in high dimensions where random data is spread very far apart. The
k-NN classifier was conceived on the principle that nearby points should be of the same class -
however, in high dimensions, even the nearest neighbors that we have to a random test point will
tend to be far away, so this principle is no longer useful.
Improving k-NN
There are two main ways to improve k-NN and overcome the shortcomings we have discussed.
One example of reducing the dimensionality in image space is to lower the resolution of the image
— while this is throwing some of the original pixel features away, we may still be able to get the
same or better performance with a nearest neighbors method.
We can also modify the distance function. For example, we have a whole family of Minkowski
distances that are induced by the Lp norms:
1
d p
X
p
Dp (x, z) = |xi − zi |
i=1
Without preprocessing the data, 1-NN with the L3 distance outperforms 1-NN with L2 on MNIST.
We can also use kernels to compute distances in a different feature space. For example, if k is a
kernel with associated feature map Φ and we want to compute the Euclidean distance from Φ(x)
to Φ(z), then we have
Clustering
In the problem of clustering, we are given a dataset comprised only of input features without
labels. We wish to assign to each data point a discrete label indicating which “cluster” it belongs
to, in such a way that the resulting cluster assignment “fits” the data. We are given flexibility to
choose our notion of goodness of fit for cluster assignments.
Figure 6.2: A nonspherical clustering assignment. Possible outliers are shown in black.1
In our discussion of LDA and QDA, we assumed that we had data which was conditionally Gaussian
1 https://www.imperva.com/blog/2017/07/clustering-and-dimensionality-reduction-understanding-the-magic-behind-
machine-learning/
151
152 CHAPTER 6. CLUSTERING
given a discrete class label. When we observed a data point, we observed both its input features
and its class label. These are supervised learning methods, which deal with prediction of observed
outputs from observed inputs. Clustering is an example of unsupervised learning, where we are
not given labels and desire to infer something about the underlying structure of the data. Another
example of unsupervised learning is dimensionality reduction, where we desire to learn important
features from the data.
Clustering is most often used in exploratory data visualization, as it allows us to see the different
groups of similar data points within the data. Combined with domain knowledge, these clusters
can have a physical interpretation - for example, different clusters can represent different species of
plant in the biological setting, or types of consumers in a business setting. If desired, these clusters
can be used as pre-processing to make the data more compact. Clustering is also used for outlier
detection, as in Figure 6.2: data points that do not seem to belong in their assigned cluster may
be flagged as outliers.
In order to create an algorithm for clustering, we first must determine what makes a good clustering
assignment. Here are some possible desired properties:
1. High intra-cluster similarity - points within a given cluster are very similar.
2. Low inter-cluster similarity - points in different clusters are not very similar.
Of course, this depends on our notion of similarity. For now, we will say that points in Rd are
similar if their L2 distance is small, and dissimilar otherwise. A generalization of this notion is
provided in the appendix.
Let X denote the set of N data points xi ∈ Rd . A cluster assignment is a partition C1 , ..., CK ⊆ X
such that the sets Ck are disjoint and X = C1 ∪ · · · ∪ CK . A data point x ∈ X is said to belong to
cluster k if it is in Ck .
One approach to the clustering problem is to represent each cluster Ck by a single point ck ∈ Rd
in the input space - this is called the centroid approach. K-means is an example of centroid-
based clustering where we choose centroids and a cluster assignment such that the total distance
of each point to its assigned centroid is minimized. In this regard, K-means optimizes for high
intra-cluster similarity, but the clusters do not necessarily need to be far apart, so we may also
have high inter-cluster similarity.
Formally, K-means solves the following problem:
K X
X
arg min kx − ck k2
K K
{C }
k k=1 ,{c }
k k=1 :X=C1 ∪···∪CK k=1 x∈C
k
It has been shown that this problem is NP hard, so solving it exactly is intractable. However, we
can come up with a simple algorithm to compute a candidate solution. If we knew the cluster
assignment C1 , ..., CK , then we would only need to determine the centroid locations. Since the
choice of centroid location ci does not affect the distances of points in Cj to cj for i 6= j, we
can consider each cluster separately and choose the centroid that minimizes the sum of squared
distances to points in that cluster. The centroid we compute, ĉk , is
X
ĉk = arg min kx − ck k2
ck
x∈Ck
6.1. K-MEANS CLUSTERING 153
Similarly, if we knew the centroids ck , in order to choose the cluster assignment C1 , ..., CK that
minimizes the sum of squared distances to the centroids, we simply assign each data point x to the
cluster represented by its closest centroid, that is, we assign x to
arg min kx − ck k2
k
Now we can perform alternating minimization - on each iteration of our algorithm, we update the
clusters using the current centroids, and then update the centroids using the new clusters. This
algorithm is sometimes called Lloyd’s Algorithm.
This algorithm will always converge to some value. To show this, note the following facts:
1. There are only finitely many (say, M ) possible partition/centroid pairs that can be produced
by the algorithm. This is true since each partition chosen at some iteration in the algorithm
completely determines the subsequent centroid assignment in that iteration.
2. Each update of the cluster assignment and centroids does not increase the value of the ob-
jective. This is true since each of these updates is a minimization of the objective which we
solve exactly.
If the value of the objective has not converged after M iterations, then we have cycled through all
the possible partition/centroid pairs attainable by the algorithm. On the next iteration, we would
obtain a partition and centroid assignment that we have already seen, say on iteration t ≤ M . But
this means that the value of the objective at time M + 1 is the same as at time t, and because the
value of the objective function never increases throughout the algorithm, the value is the same as
at time M , so we have converged.
In practice, it is common to run the K-means algorithm multiple times with different initialization
points, and the cluster corresponding to the minimum objective value is chosen. There are also
ways to choose a smarter initialization than a random seed, which can improve the quality of the
local optimum found by the algorithm.2 It should be emphasized that no efficient algorithm for
solving the K-means optimization is guaranteed to give a good cluster assignment, as the problem
is NP hard and there are local optima.
Choosing the number of clusters k is similar to choosing the number of principal components for
PCA - we can compute the value of the objective for multiple values of k and find the “elbow” in
the curve.
2 For example, K-means++.
154 CHAPTER 6. CLUSTERING
We have noted that the main algorithm for solving K-means does not have to produce a good
solution. Let us step back and consider some shortcomings of the K-means objective function
itself:
1. There is no likelihood attached to K-means, which makes it harder to understand what as-
sumptions we are making on the data.
2. Each feature is treated equally, so the clusters produced by K-means will look spherical. We
can also infer this by looking at the sum of squares in the objective function, which we have
seen to be related to spherical Gaussians.
3. Each cluster assignment in the optimization is a hard assignment - each point belongs in
exactly one cluster. A soft assignment would assign each point to a distribution over the
clusters, which can encode not only which cluster a point belongs to, but also how far it was
from the other clusters.
Soft K-means
We can introduce soft assignments to our algorithm easily using the familiar softmax function.
Recall that if z ∈ Rd , then the softmax function σ is defined as
e zj
σ(z)j = Pd
zk
k=1 e
This is now a weighted average of the xi - the weights reflect how much we believe each data point
belongs to a particular cluster, and because we are using this information, our algorithm should
not jump around between clusters, resulting in better convergence speed.
There are still a few issues with soft K-means. One is the choice of β - it is not so clear how to
set this hyperparameter. Another issue is that our clusters are still spherical, since we are still
weighting all features the same (note that we have weighted each data point differently with soft
K-means). To solve these issues, we will use a fully probabilistic model.
6.2. MIXTURE OF GAUSSIANS 155
Suppose µk ∈ Rd , Σk ∈ Rd×d are fixed parameters for k = 1, ..., K. Consider the following experi-
ment: draw a value z from some distribution on the set of indices {1, ..., K}, and then draw x ∈ Rd
from the Gaussian distribution N (µz , Σz ). We can interpret x as belonging to cluster z. This
model is called Mixture of Gaussians (MoG), also known as a Gaussian mixture model.
If we have fit a MoG model to data (ie. we have determined values for µk , Σk , and the prior on
z), then to perform clustering, we can use Bayes’ rule to determine the posterior P (z = k|x) and
assign x to the cluster k that maximizes this quantity. In fact, this is exactly our decision rule
with QDA using a prior - the difference is that QDA, a supervised method, is given labels to fit the
mixture model, while in the unsupervised clustering setting we must fit the mixture model without
the aid of labels. When Σk are not multiples of the identity, we can obtain non-spherical clusters,
which was not possible with K-means.
MoG is an example of a latent variable model. A latent variable model is a probabilistic model
in which some variables can be directly observed or measured, while other latent (hidden) variables
cannot be observed directly; rather, we observe them indirectly through their influence on the
observed variables. When we try to fit a MoG model to data, we only observe the data xi , which
we presume to have been generated based on the latent variable zi , the cluster assignment. Latent
variable models are modular and can be used to model complex dependencies in a probabilistic
model - however, the added flexibility can lead to difficulty in learning its parameters.
To illustrate this, we will examine the likelihood function for MoG. Suppose xi has distribution
p(xi ; θ), where θ is a set of all µk , Σk , p(zi = k). The likelihood for the single data point xi is
Li (θ; xi ) = p(xi ; θ)
K
X
= p(xi , zi = k; θ)
k=1
XK
= p(xi |zi = k; θ)p(zi = k; θ)
k=1
When we perform QDA, we know the zi are known, deterministic quantities and thus the likelihood
(6.2) reduces to
N
X
`(θ; X) = log p(xi |zi ; θ)
i=1
156 CHAPTER 6. CLUSTERING
Maximizing this is equivalent to fitting the individual class-conditional Gaussians via maximum
likelihood, which is consistent with how we have described QDA in the past. When we fit the
MoG without knowledge of the latent variables, the parameters θ in (6.2) are now coupled together
inside the log, which complicates the likelihood. While it is still possible to find the MLE by
working out the gradient and using our descent methods, there is an alternative approach called
Expectation-Maximization (EM), which takes advantage of the latent variable structure.
The Expectation-Maximization (EM) Algorithm is used to compute the MLE for latent vari-
able models, such as MoG. First recall that soft K-means consisted of the following two alternating
steps:
1. For each data point, compute a soft assignment ri (k) to the clusters - that is, a probability
distribution over clusters. The soft assignment is obtained by using a softmax.
2. Update the centroids in an optimal way given the soft assignments computed in the first step.
The resulting updates are a weighted average of the data points.
1. Soft imputation of the data - fill in the missing data (“impute”) with a probability distribution
over all its possible values (“soft”).
2. Parameter updates given the imputed data
The EM algorithm alternates between these two steps in the same way as soft K-means, but the
updates for each step are performed in a principled way to maximize certain “components” of the
log likelihood (we will make this precise later). We will derive the following updates for EM when
computing the MLE for MoG:
1. Using the current parameter estimates, estimate p(zi |xi ; θ). That is, perform soft imputation
of the latent cluster variable.
2. Estimate the parameters via MLE, using the estimates of p(zi |xi ; θ) to make the computation
tractable.
Both soft K-means and EM estimate p(zi |xi ; θ), though EM will do so in a more principled way.
In the second alternating step, we will see that the EM update of the mean is exactly the same as
the soft K-means centroid update. However, EM will also update the covariance estimates, which
captures ellipsoidal structure in the data.
To derive the EM algorithm, it will be helpful to introduce the notion of the complete log
likelihood, which we define as
Lc (xi , zi ; θ) := log p(xi , zi ; θ)
If we assumed zi to be known, this would be the log likelihood of the data. In practice, we do
not know the values of zi , but if we are given a distribution q(zi |xi ) over the latents zi , we can
marginalize over the possible values of zi by taking the expectation
K
X
Eq (Lc (xi , zi ; θ)) = Lc (xi , zi = k; θ)q(zi = k|xi )
k=1
6.3. EXPECTATION MAXIMIZATION (EM) ALGORITHM 157
We call this the expected complete log likelihood. The distribution q is an estimate of the true
conditional distribution p(zi |xi ; θ); in the EM algorithm, we will alternate between updating our q
distribution to better estimate p(zi xi ; θ) and maximizing the expected complete log likelihood in
place of the true likelihood function.
We are now in a position to derive the algorithm. We will need a well-known result called Jensen’s
Inequality:
Theorem 3. If X is a random variable and f is convex, then f (E(X)) ≤ E(f (X)).
If f is concave, then using the fact −f is convex immediately yields the conclusion
E(f (X)) ≤
f (E(X)). In particular, since log is concave, we have E(log(X)) ≤ log E(X) .
Now we can derive the EM algorithm. Suppose xi are random variables depending on zi , and θ
are the parameters of interest. Given any conditional distribution over the latents q(zi = k|xi ), the
log likelihood for the i-th data point is
We will define
K
X
H(q(zi |xi )) := − q(zi = k|xi ) log[q(zi = k|xi )]
k=1
and
Lc (xi , zi ; θ) := log p(xi , zi ; θ)
so that the above lower bound can be written as
The first term H(q(zi |xi )) has an information-theoretic interpretation - it is the entropy of the
distribution q(zi |xi ), a non-negative quantity that measures the amount of disorder encoded in the
distribution. As mentioned earlier, the term Lc (xi , zi ; θ) is called the complete log likelihood of
xi - it will be easier to optimize Eq (Lc (xi , zi ; θ)) than the original log likelihood, since we will not
need to deal with the “marginalization problem” in the original log likelihood from Equation 6.2.
158 CHAPTER 6. CLUSTERING
Since the log likelihood of the full data is the sum of the individual log likelihoods, we can take
sums and find
N
X
`(θ; X) ≥ Fi (q, θ) = H(q(z|X)) + Eq (Lc (X, z; θ)) =: F (q, θ) (6.3)
i=1
Here, X denotes the full dataset and z denotes the length N vector of latent variables. It is easy to
check that if q(zi |xi ) = p(zi |xi ; θ) for all i, then the inequality (6.3) is tight (set q(zi |xi ) = p(xi , zi ; θ)
in the application of Jensen’s inequality and observe both sides of the inequality are equal). Thus it
makes sense to perform an alternating maximization scheme, where we iteratively update q(zi |xi )
to p(zi |xi ; θ) to make the inequality tight and then maximize over θ. Formally, the algorithm is as
follows:
1. Initialize θ 0
2. Expectation (E) step (soft imputation): set q t+1 = arg max q F (q, θ t ), that is,
A few remarks: when we maximize over q in the E step, there are nK values to be updated - one
value of q(zi = k|xi ) for every data index i and latent index k. After the E step, q t+1 is fixed and
does not depend on θ, so the entropy term does not depend on θ and maximizing F (q t+1 , θ) in
the subsequent M step amounts to maximizing the expected complete log likelihood. The E step is
what we have previously described as soft imputation of the latents: we fill in values for the hidden
variables zi by determining a conditional distribution q(zi |xi ). The M step assumes the E step
has done a reasonable job at imputing the data and uses this additional information to maximize
the likelihood. Observe the connections between K-means, soft K-means, and EM - all perform
alternating steps of data imputation and subsequent parameter optimization given the imputed
data. In the data imputation step for K-means, each data point is given a hard assignment to a
latent variable value, while in soft K-means and EM, each data point gets assigned a distribution
over the latent variables.
From our derivation of EM, we can see that the value of the likelihood never decreases during
the execution of the algorithm. It turns out that EM will converge to a parameter estimate with
zero gradient, but will not necessarily find the global optimum. When the clusters are sufficiently
separated, EM can exhibit Newtonian (second-order) convergence speed - however, if the clusters
are close together, the posteriors will be very flat and EM can take longer than gradient descent
methods to converge.
EM for MoG
As a concrete example, we derive the EM updates for fitting a mixture of Gaussians. Recall the
MoG model
x|z ∼ N (µz , Σz ), p(z = k) =: αk
6.3. EXPECTATION MAXIMIZATION (EM) ALGORITHM 159
We define the parameter set θ as the set of all µk , Σk , αk . Let x1 , ..., xN ∈ Rd be our observed
t := q t (z = k|x ). The EM updates, derived below, are
data. Define qki i i
E step:
αt p(xi |zi = k; θ t )
q t+1 (zi = k|xi ) = PK k t
t
j=1 αj p(xi |zi = j; θ )
M step:
PN t+1
qki xi
µt+1
k = Pi=1
N t+1
i=1 qki
PN t+1
i=1 qki (xi − µt+1 t+1 T
k )(xi − µk )
Σt+1
k = PN t+1
i=1 qki
N
1 X t+1
αkt+1 = qki
N
i=1
In the E step, we assign to each xi a probability distribution over latents (that is, a soft assignment).
This assignment is p(xi |zi = k; θ t ), the Gaussian likelihood of the data, but reweighted by the prior
and normalized. In the M step, we are essentially computing the usual maximimum likelihood
t+1
estimates of the parameters, but weighted by the posterior on z; indeed, if we set qki = N1 , then
we are using the usual MLE. The update for µk is entirely analogous to the update to the centroids
for soft k-means (6.1). The main difference is that now we are also updating estimates of covariances
and the prior and synthesizing this information in our posterior estimates in the E step, which will
in turn influence the µk assignments in the next E step.
We now derive these updates. Recall the log likelihood is
N
X XK
`(θ; x) = log p(xi |zi = k; θ)p(zi = k; θ)
i=1 k=1
Setting the gradients to zero and solving these equations gives us the P updates for µk , Σk shown
above. To obtain the update for αk , we need to introduce the constraint K α = 1 via Lagrange
P k
k=1
multipliers. We thus maximize the Lagrangian `0q = Eqt+1 (Lc (x, z; θ)) − λ( K k=1 αk − 1). Taking
gradients, we get
N
∂`0q 1 X t+1
= −λ + qki = 0
∂αk αk
i=1
Let f be convex, and X be a random variable. Construct the tangent line L(X) = aX + b to f
at E(X) for some a, b - this means L(E(X)) = f (E(X)). By convexity, L(X) ≤ f (X) for all X.3
Then by monotonicity of expectation, we have
1. d(x, y) = 0 iff x = y
2. d(x, y) = d(y, x) for all x, y
3. d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z (triangle inequality)
A dissimilarity measure d(x, y) is a function satisfying the above properties except possibly the
triangle inequality. One possible similarity measure s(x, y) can be defined as −d(x, y). In clus-
tering, we are free to choose our notion of similarity. Different algorithms may or may not work
differently depending on which similarity measure we choose.
For example, it does not make sense to use the K-means algorithm if we care about L1 distances
instead of L2 . However, we can apply the same principles used to derive the K-means algorithm:
if we replace the L2 norm in the objective by L1 and do a similar alternating minimization, then
on the centroid assignment step, we will set each centroid to the median of the data instead of
the mean. On the cluster assignment step, we will assign each point to the closest centroid in L1
distance. This variation is called K-medians.
Deciding which similarity measure to use is a modeling choice that typically depends on the data
and desired clustering properties. For example, K-medians may be of use if the data has outliers
and we desire a more robust estimator of the clusters. In certain domains, such as computer vision,
the Lp distances are not appropriate measures of dissimilarity, so other measures may be used.
162 CHAPTER 6. CLUSTERING
Chapter 7
A decision tree is a model that makes predictions by posing a series of simple tests on the given
point. This process can be represented as a tree, with the non-leaf nodes of the tree representing
these tests. At the leaves of this tree are the values to be predicted after descending the tree in
the manner prescribed by each node’s test. Decision trees can be used for both classification and
regression, but we will focus exclusively on classification.
In the simple case which we consider here, the tests are of the form “Is feature j of this point less
than the value v?”1 These tests carve up the feature space in a nested rectangular fashion:
1 One could use more complicated decision procedures. For example, in the case of a categorical variable, there could be
a separate branch in the tree for each value that the variable could take on. Or multiple features could be used, e.g. “Is
x1 > x2 ?” However, using more complicated decision procedures complicates the learning process. For simplicity we consider
the single-feature binary case here.
163
164 CHAPTER 7. DECISION TREE LEARNING
Given sufficiently many splits, a decision tree can represent an arbitrarily complex classifier and
perfectly classify any training set.2
Training
Decision trees are trained in a greedy, recursive fashion, downward from the root. After creating a
node with some associated split, the children of that node are constructed by the same tree-growing
procedure. However, the data used to train left subtree are only those points satisfying xj < v,
and similarly the right subtree is grown with only those points with xj ≥ v. Eventually we must
each a base case where no further splits are to be made, and a prediction (rather than a split) is
associated with the node.
We can see that the process of building the tree raises at least the following questions:
Typically decision trees will pick a split by considering all possible splits and choosing the one that
is the best according to some criterion. We will discuss possible criteria later, but first it is worth
asking what we mean by “all possible splits”. It is clear that we should look at all features, but
what about the possible values? Observe that in the case where tests are of the form xj < v, there
are infinitely many values of v we could choose, but only finitely many different resulting splits
(since there are finitely many training points). Therefore it suffices to perform a one-dimensional
sweep: sort the datapoints by their xj values and only consider these values as split candidates.
Now let us revisit the criterion. Intuitively, we want to choose the split which most reduces our
classifier’s uncertainty about which class points belongs to. In the ideal case, the hyperplane xj = v
2 Unless two training points of different classes coincide.
7.1. DECISION TREES 165
perfectly splits the given data points such that all the instances of the positive class lie on one side
and all the instances of the negative class lie on the other.
One way to quantify the aforementioned “uncertainty”, we’ll use the ideas of surprise and en-
tropy. The surprise of observing that a discrete random variable Y takes on value k is:
1
log = − log P (Y = k)
P (Y = k)
As P (Y = k) → 0, the surprise of observing that value approaches ∞, and conversely as P (Y =
k) → 1, the surprise of observing that value approaches 0.
The entropy of Y , denoted H(Y ), is the expected surprise:
H(Y ) = E[− log P (Y )]
X
=− P (Y = k) log P (Y = k)
k
Observe that as a function of p (the probability of the variable being 1), the entropy is strictly
concave. Moreover, it is maximized at p = 21 , when the probability distribution is uniform with
respect to the outcomes. That is to say, a coin that is completely fair (P (Y = 0) = P (Y = 1) = 21 )
has more entropy than a coin that is biased. This is because we are less sure of the outcome of the
fair coin than the biased coin overall. Even though we are more surprised when the biased coin
comes up as its more unlikely outcome, the way that entropy is defined gives a higher uncertainty
score to the fair coin. Generally speaking, a random variable has more entropy when the distribution
of its outcomes is closer to uniform and less entropy when the distribution is highly skewed to one
outcome.
This definition is for random variables, but in practice we work with data. The distribution is
empirically defined by our training points {(xi , yi )}ni=1 . Concretely, the probability of class k is the
proportion of datapoints having class k:
|{i | yi = k}|
P (Y = k) =
n
We know that when we choose a split-feature, split-value pair, we want to reduce entropy in some
way. Let Xj,v be an indicator variable which is 1 when xj < v, and 0 otherwise. There are a few
entropies to consider:
166 CHAPTER 7. DECISION TREE LEARNING
• H(Y )
• H(Y |Xj,v = 1), the entropy of the distribution of points such that xj < v.
• H(Y |Xj,v = 0), the entropy of the distribution of points such that xj ≥ v.
H(Y ) is not really under our control: we start with the set of points with labels represented by Y ,
this distribution has some entropy, and now we wish to carve up those points in a way to minimize
the entropy remaining. Thus, the quantity we want to minimize is a weighted average of the two
sides of the split, where the weights are (proportional to) the sizes of two sides:
This quantity H(Y |Xj,v ) is known as the conditional entropy of Y given Xj,v . An equivalent
way of seeing this is that we want to maximize the information we’ve learned, which is represented
by how much entropy is reduced after learning whether or not xj < v:
This quantity I(Xj,v ; Y ) is known as the mutual information between Xj,v and Y . It is always
nonnegative, and it’s zero iff the resulting sides of the split have the same distribution of classes
as the original set of points. Let’s say you were using a decision tree to classify emails as spam
and ham. For example, you gain no information if you take a set of (20 ham, 10 spam) and split
it on some feature to give you sets of (12 ham, 6 spam); (8 ham, 4 spam) because the empirical
distribution of those two resulting sets is equal to the original one.
Gini impurity
Another way to assess the quality of a split is Gini impurity, which measures how often a randomly
chosen element from the set would be incorrectly labeled if it were randomly labeled according to
the distribution of labels in the subset. It as defined as
X X X X
G(Y ) = P (Y = k) P (Y = j) = P (Y = k)(1 − P (Y = k)) = 1 − P (Y = k)2
k j6=k k k
Exactly as with entropy, we can define a version of this quantity which is dependent on the split.
For example, G(Y |Xj,v = 1) would be the Gini impurity computed only on those points satisfying
xj < v. And we can define an analogous quantity
which is to be minimized.
Empirically, the Gini impurity is found to produce very similar results to entropy, and it is slightly
faster to compute because we don’t need to take logs.
Since we ultimately care about classification accuracy, it is natural to wonder why we don’t directly
use the misclassification rate by plurality vote as the measure of impurity:
M (Y ) = 1 − max P (Y = k)
k
7.1. DECISION TREES 167
It turns out that this quantity is insensitive in the sense that the quantity it induces for evaluating
splits (M (Y |Xj,v )) may assign the same value to a variety of splits which are not, in fact, equally
good. Suppose3 the current node has 40 training points of class 1 and 40 of class 2. Here M (Y ) =
1 − 12 = 21 . Now consider two possible splits:
1. Separate into a region xj < v with 30 points of class 1 and 10 of class 2 (10/40 = 1/4 misclassi-
fied), and a region xj ≥ v with 10 points of class 1 and 30 of class 2 (10/40 = 1/4 misclassified).
Since 40/80 = 1/2 of the points lie satisfy xj < v and 40/80 = 1/2 of the points lie satisfy xj ≥ v,
1 1 1 1 1
M (Y |Xj,v ) = · + · =
2 4 2 4 4
2. Separate into a region xj < v with 20 points of class 1 and 40 of class 2 (20/60 = 1/3 misclas-
sified), and a region xj ≥ v with 20 points of class 1 and 0 of class 2 (0/20 = 0 misclassified).
Since 60/80 = 3/4 of the points lie satisfy xj < v and 20/80 = 1/4 of the points lie satisfy xj ≥ v,
3 1 1 1
M (Y |Xj,v ) = · + ·0=
4 3 4 4
We see that the criterion value is the same for both splits, even though the second one appears
to do a better job reducing our uncertainty for many of the points. And indeed, the conditional
entropy and Gini impurity scores are lower (so the information gain is higher) for the second split.4
The limitation of this criterion can be understood mathematically in terms of strict concavity.
This property means that the graph of the function lies strictly below the tangent line5 at every
point (except the point of tangency, of course). Consider the plots of entropy and misclassification
rate for a binary classification problem:
Both curves are concave, but the one for misclassification rate is not strictly so; it has only two
unique tangent lines. Because the conditional quantity is a convex combination (since it is weighted
by probabilities/proportions) of the children’s values, the strictly convex functions always yield
positive information gain as long as the children’s distributions are not identical to the parent’s.
We have no such guarantee in either linear region of the misclassification rate curve; any convex
combination of points on a line also lies on the line, yielding zero information gain.
Stopping criteria
We mentioned earlier that sufficiently deep decision trees can represent arbitrarily complex decision
boundaries, but of course this will lead to overfitting if we are not careful. There are a number of
heuristics we may consider to decide when to stop splitting:
• Limited depth: don’t split if the node is beyond some fixed depth depth in the tree
• Node purity: don’t split if the proportion of training points in some class is sufficiently high
• Information gain criteria: don’t split if the gained information/purity is sufficiently close to
zero
Note that these are not mutually exclusive, and the thresholds can be tuned with validation. As
an alternative (or addition) to stopping early, you can prune a fully-grown tree by re-combining
splits if doing so reduces validation error.
Another way to combat overfitting is to combine the predictions of many varied models into a
single prediction, typically by plurality vote in the case of classification and averaging in the case
of regression. This is a general technique known as ensemble learning. To understand the
motivation for averaging, consider a set of uncorrelated random variables {Yi }ni=1 with common
mean E[Yi ] = µ and variance Var(Yi ) = σ 2 . The average of these has the same expectation
n n
1 X 1X 1
E Yi =
E[Yi ] = · nµ = µ
n n n
i=1 i=1
In the context of ensemble methods, these Yi are analogous to the prediction made by classifier i.
The combined prediction has the same expected value as any individual prediction but lower vari-
ance. Real-world predictions will of course not be completely uncorrelated, but reducing correlation
will generally reduce the final variance, so this is a goal to aim for.
Random forests are a specific ensemble method where the individual models are decision trees
trained in a randomized way so as to reduce correlation among them. Because the basic decision
tree building algorithm is deterministic, it will produce the same tree every time if we give it the
same dataset and use the same algorithm hyperparameters (stopping conditions, etc.).
Random forests are typically randomized in the following ways:
7.3. BOOSTING 169
• Per-classifier bagging (short for bootstrap aggregating): sample some number m < n of
datapoints uniformly with replacement, and use these as the training set.
• Per-split feature randomization: sample some number number k < d of features as candi-
dates to be considered for this split
Both the size of the random subsample of training points and the number of features at each split
are hyperparameters which should be tuned through cross-validation.
7.3 Boosting
We have seen that in the case of random forests, combining many imperfect models can produce
a single model that works very well. This is the idea of ensemble methods. However, random
forests treat each member of the forest equally, taking a plurality vote or an average over their
outputs. The idea of boosting is to combine the models (typically called weak learners in this
context) in a more principled manner. The key idea is as follows: to improve our combined
model, we should focus on finding learners that correctly predict the points which the overall
boosted model is currently predicting inaccurately. Boosting algorithms implement this idea by
associating a weight with each training point and iteratively reweighting so that mispredicted points
have relatively high weights. Intuitively, some points are “harder” to predict than others, so the
algorithm should focus its efforts on those.
These ideas also connect to matching pursuit. In both cases, our overall predictor is an additive
combination of pieces which are selected one-by-one in a greedy fashion. The algorithm keeps track
of residual prediction errors, chooses the “direction” to move based on these, and then performs a
sort of line search to determine how far along that direction to move.
AdaBoost
There are many flavors of boosting. We will discuss one of the most popular versions, known as
AdaBoost (short for adaptive boosting), which is a method for binary classification. Its developers
won the prestgious Gödel Prize for this work.
Algorithm
We present the algorithm first, then derive it later. Assume access to a dataset {(xi , yi )}ni=1 , where
xi ∈ Rd and yi ∈ {−1, 1}.
1
1. Initialize the weights wi = n for all i = 1, . . . , n training points.
2. Repeat for m = 1, . . . , M :
(a) Build a classifier Gm : Rd → {−1, 1}, where in the training process the data are weighted
according to wi .
P
wi
(b) Compute the weighted error em = i misclassified
P
wi .
i
We first address the issue of step (a): how do we train a classifier if we want to weight different
samples differently? One common way to do this is to resample from the original training set
every iteration to create a new training set that is fed to the next classifier. Specifically, we create
a training set of size n by sampling n values from the original training data with replacement,
according to the distribution wi . (This is why we might renormalize the weights in step (d).) This
way, data points with large values of wi are more likely to be included in this training set, and the
next classifier will place higher priority on such data points.
Suppose6 that our weak learners always produce an error em < 21 . To make sense of the formulas
we see in the algorithm, note that for step
q (c), if the i-th data point is misclassified, then the
1−em
weight wi gets increased by a factor of em (more priority placed on sample i), while if it is
classified correctly, the priority gets decreased. AdaBoost does have a practical weakness in that
this aggressive reweighting can cause the classifier to focus too much on certain training examples
– if the data contains outliers or a lot of noise, the boosting algorithm’s generalization performance
may suffer as it overfits to a few challenging examples.
We have not yet discussed how to make a prediction on test points given our classifiers G1 , . . . , GM .
One conceivable method is to use logistic regression with Gm (x) as features. However, a smarter
choice that is based on the AdaBoost algorithm is to set
1 1 − em
αm = ln
2 em
and classify x by
M
X
h(x) = sgn αm Gm (x)
m=1
Note that this choice of αm (derived later) gives high weight to classifiers that have low error:
• As em → 0, 1−em
em → ∞, so αm → ∞.
• As em → 1, 1−em
em → 0, so αm → −∞.
We now proceed to demystify the formulas in the algorithm above by presenting a matching pursuit
interpretation of AdaBoost. This interpretation is also useful because it generalizes to a powerful
technique called Gradient Boosting, of which AdaBoost is just one instance.
Derivation of AdaBoost
Suppose we have computed classifiers G1 , . . . , Gm−1 along with their corresponding weights αk and
we want to computeP the next classifier Gm along with its weight αm . The output of our model so
far is Fm−1 (x) := m−1
i=1 αi Gi (x), and we want to minimize the risk:
n
X
αm , Gm = arg min L(yi , Fm−1 (xi ) + αG(xi ))
α,G i=1
6 1
This is a reasonable thing to ask. A classifier with error em ≥ 2
is even worse than the trivial classifier which predicts the
class with the most total weight without regard for the input xi .
7.3. BOOSTING 171
for some suitable loss function L(y, ŷ). Loss functions we have previously used include mean squared
error for linear regression, cross-entropy loss for logistic regression, and hinge loss for SVM. For
AdaBoost, we use the exponential loss:
This loss function is illustrated in Figure 7.1. Observe that if y ŷ > 0 (i.e. ŷ has the correct sign), the
loss decreases exponentially in |ŷ|, which should be interpreted as the confidence of the prediction.
Conversely, if y ŷ < 0, our loss is increasing exponentially in the confidence of the prediction.
Figure 7.1: The exponential loss provides exponentially increasing penalty for confident incorrect predictions.
This figure is from Cornell CS4780 notes.
Plugging the exponential loss into the general optimization problem above yields
n
X
αm , Gm = arg min e−yi (Fm−1 (xi )+αG(xi ))
α,G i=1
n
X
= arg min e−yi Fm−1 (xi ) e−yi αG(xi )
α,G i=1
(m)
The term wi := e−yi Fm−1 (xi ) is a constant with respect to our optimization variables. We can
split out this sum into the components with correctly classified points and incorrectly classified
points:
n
(m) −yi αG(xi )
X
αm , Gm = arg min wi e
α,G i=1
(m) −α (m) α
X X
= arg min wi e + wi e (∗)
α,G
yi =G(xi ) yi 6=G(xi )
n
(m) (m) (m)
X X X
−α
= arg min e wi − wi + eα wi
α,G i=1 yi 6=G(xi ) yi 6=G(xi )
n
(m) (m)
X X
= arg min (eα − e−α ) wi + e−α wi
α,G i=1
yi 6=G(xi )
172 CHAPTER 7. DECISION TREE LEARNING
To arrive at (∗) we have used the fact that yi Gm (xi ) equals 1 if the prediction is correct, and −1
otherwise. For a fixed value of α, the second term in this last expression does not depend on G.
Thus we can see that the best choice of Gm (x) is the classifier that minimizes the total weight of
the misclassified points. Let
P (m)
yi 6=Gm (xi ) wi
em = P (m)
i wi
(m)
Once we have obtained Gm , we can solve for αm . Dividing (∗) by the constant ni=1 wi , we
P
obtain
αm = arg min (1 − em )e−α + em eα
α
We can solve for the minimizer analytically using calculus. Setting the derivative of the objective
function to zero gives
− 1 yi Gm (xi )
(m) 1 − em 2
= wi
em
r i m i y G (x )
(m) em
= wi
1 − em
q q
em 1−em
Here we see that the multiplicative factor is 1−e m
when y i = G (x
m i ) and em otherwise. This
completes the derivation of the algorithm.
7.3. BOOSTING 173
As a final note about the intuition, we can view these α updates as pushing towards a solution
in some direction until we can no longer improve our performance. More precisely, whenever we
compute αm (and thus w(m+1) ), for the incorrectly classified entries, we have
r
X (m+1)
X (m) 1 − em
wi = wi
em
yi 6=Gm (xi ) yi 6=Gm (xi )
q
(m)
Dividing the right-hand side by ni=1 wi , we obtain em 1−e
P p
em =
m
em (1 − em ). Similarly, for
the correctly classified entries, we have
P (m+1)
yi =Gm (xi ) wi
r
em p
= (1 − em ) = em (1 − em )
Pn (m) 1 − em
i=1 wi
Thus these two quantities are the same once we have adjusted our α, so the misclassified and
correctly classified sets both get equal total weight.
This observation has an interesting practical implication. Even after the training error goes to zero,
the test error may continue to decrease. This may be counter-intuitive, as one would expect the
classifier to be overfitting to the training data at this point. One interpretation for this phenomenon
is that even though the boosted classifier has achieved perfect training error, it is still refining its
fit in a max-margin fashion, which increases its generalization capabilities.
Gradient Boosting
AdaBoost assumes a particular loss function, the exponential loss function. Gradient boosting
is a more general technique that allows an arbitrary differentiable loss function L(y, ŷ). Recall the
general optimization problem we must solve when choosing the next model:
n
X
min L(yi , Fm−1 (xi ) + αG(xi ))
α,G
i=1
where, in abuse of notation, Fm−1 (X) is a vector with Fm−1 (xi ) as its ith element, and ∇ŷ L(y, ŷ)
is a vector with ∂L
∂ ŷ (yi , ŷi ) as its ith element. To decrease the cost in a steepest descent fashion, we
seek the direction g which maximizes
h−∇ŷ L(y, Fm−1 (X)), gi
174 CHAPTER 7. DECISION TREE LEARNING
subject to g being the output of some model G in the model class we are considering.7
Some comments are in order. First, observe that the loss need only be differentiable with respect
to its inputs, not necessarily with respect to model parameters, so we can use non-differentiable
models such as decision trees. Additionally, in the case of squared loss L(y, ŷ) = 12 (y − ŷ)2 we have
∂L
(yi , Fm−1 (xi )) = −(yi − Fm−1 (xi ))
∂ ŷ
so
−∇ŷ L(y, Fm−1 (X)) = y − Fm−1 (X)
This means the algorithm will follow the residual, as in matching pursuit.
7 The absolute value may seem odd, but consider that after choosing the direction g , we perform a line search to select
m
αm . This search may choose αm < 0, effectively flipping the direction. The key is to maximize the magnitude of the inner
product.
Chapter 8
Deep Learning
Neural networks have been successfully applied to many real-world problems, such as predicting
stock prices and learning robot dynamics in autonomous systems. The most general type of neural
network is multilayer perceptrons (MLPs), which we have already studied in detail and em-
ployed to classify 28 × 28 pixel images as digits (MNIST). While MLPs can be used to effectively
classify small images (such as those in MNIST), they are impractical for large images. Let’s see
why. Given a W × H × 3 image (over 3 channels — red, green, blue), an MLP would take the
flattened image as input, pass it through several fully connected (FC) layers and non-linearities,
and finally output a vector of probabilities for each of the classes.
Figure 8.1: A Fully Connected layer connects every input neuron to every output neuron.
Associated with each FC layer is an ni × no weight matrix that “connects” each of the ni input
neurons to each of the no output neurons, hence the term “fully connected layer”. The first FC
layer takes an image as input, with ni = W ×H ×3 input neurons. Assuming that there are no ≈ ni
output neurons, then there are ni × no ≈ W 2 × H 2 × 32 weights — a prohibitively large number
of weights (in the millions)! This analysis extends to all FC layers that have large inputs, not just
the first layer. In the framework of image classification, MLPs are generally ineffective — not only
are they computationally expensive to train (both in terms of time and memory usage), but they
175
176 CHAPTER 8. DEEP LEARNING
Convolutional Layers
The subscript in Ic indexes into the depth of the image, in this case for depth c. We can view
convolution as either:
1. a 2-D operator over the width/height of the image, “broadcast” over the depth
2. a 3-D operator over the weight/height/depth of the image, with the convolution over the depth
spanning the whole image with no room to move.
What exactly is convolution useful for, and why do we use it in the context of image classification?
In simple terms, convolutions help us extract features. On a low level, filters can be used to detect
all kinds of edges in an image, and at a high level they can detect more complex shapes and objects
that are critical to classifying an image. Consider a simple horizontal edge detector filter [1 − 1].
This filter will produce large negative values for inputs in which the left pixel is bright and the
right pixel is dark; conversely, it will produce large positive values for inputs in which the left pixel
is dark and the right pixel is bright.
8.1. CONVOLUTIONAL NEURAL NETWORKS 177
In general, a filter will produce large positive values in the areas of the image which appear most
similar to it. As another example, here is a filter detects edges at a positive 45-degree angle:
0.6 0.2 0
0.2 0 0.2
0 0.2 0.6
How do conv layers compare to FC layers? Let’s revisit the example from figure 8.2, where the
input is ni = 49 units and the output is no = 25 units. In an FC layer, we would have used
ni × no = 1225 weights, but in our conv layer the filter only has 9 weights! conv layers use a
significantly smaller number of weights, which reduces the variance of the model significantly while
retaining the expressiveness. This is because we make use of weight sharing:
1. the same weights are shared among all the pixels of the input
2. the individual units of the output layer are all determined by the same weights
Compare this to the fully-connected architecture where for each output unit, there is a separate
weight to learn for each input-ouput weight. We can illustrate the point for a simple 1-D input-
output example.
Figure 8.4: FC vs. conv layer. Conv layers are equivalent to FC layers, except that (1) all weights outside
the receptive field are 0, and (2) the weights are shared.
This architecture not only decreases the complexity of our model (there are fewer weights), it is
actually reasonable for image processing because there are repeated patterns in images — ie. a
178 CHAPTER 8. DEEP LEARNING
filter that can detect some kind of pattern in one area of the image can be used elsewhere in the
image to detect the same pattern.
In practice, we can apply several different filters to the image to detect different patterns in the
input image. For example, we can use a filter that detects horizontal edges, one that detects vertical
edges, and another that detects diagonal edges all at once. Given a W × H × D input image and
k separate w × h × D filters, each filter produces an (W − w + 1) × (H − h + 1) × 1 dimensional
output. These individual outputs are stacked together for a (W − w + 1) × (H − h + 1) × k combined
output.
Figure 8.5: Here, we slid 6 independent 5 × 5 × 3 filters across the original image to produce 6 activation
maps in the next convolutional layer.
Stacking filters can incur high computational costs, and in order to mitigate this issue, we can
stride our filter across the image by multiple pixels instead:
In conjunction to striding, zero-padding the borders of the image is sometimes used to control the
exact dimensions of the convolutional layer.
So far, we have introduced convolutional layers as an intuitive and effective approach to extracting
features from images, but one potential inspection a disadvantage is that they can only detect
“local” features, which is not sufficient to capture complex, global patterns in images. This is not
actually the case, because as we stack more convolutional layers, the effective receptive field of
each successive layer increases. That is, as we go downstream (of the layers), the value of any
single unit is informed by an increasingly large patch of the original image. For example, if we use
two successive layers of 3 × 3 filters, any one unit in the first convolutional layer is informed by 9
separate image pixels. Any one unit in the second convolutional layer is informed by 9 separate
8.1. CONVOLUTIONAL NEURAL NETWORKS 179
units of the first convolutional layer, which could informed by up to 9 × 9 = 81 original pixels. The
increasing receptive field of the successive layers means that the filters in the first few layers extract
local low level features, and the later layers extract global high level features.
Figure 8.6: The highlighted unit in the downstream layer uses information from all the highlighted units in
the input layer.
Pooling Layers
In line with convolutional layers reducing the number of weights in neural networks to reduce
variance, pooling layers directly reduce the number of neurons in neural networks. The sole
purpose of a pooling layer is to downsample (also known as pool, gather, consolidate) the previous
layer, by sliding a fixed window across a layer and choosing one value that effectively “represents”
all of the units captured by the window. There are two common implementations of pooling. In
max-pooling, the representative value just becomes the largest of all the units in the window, while
in average-pooling, the representative value is the average of all the units in the window. In practice,
we stride pooling layers across the image with the stride equal to the size of the pooling layer. None
of these properties actually involve any weights, unlike fully connected and convolutional layers.
Orthogonal to the choice between max and average pooling is option between spatial and cross-
channel pooling. Spatial pooling pools values within the same channel, which induces translational
invariance in our model and adding generalization capabilities. In the following figure, we can see
that even though the input layer of the right image is a translated version of the input layer of the
left image, due to spacial pooling the next layer looks more or less the same.
180 CHAPTER 8. DEEP LEARNING
Cross-channel pooling pools values across different channels, which induces transformational in-
variance in our model, again adding generalization capabilities. To illustrate the point, consider an
example with a convolutional layer represented by 3 filters. Suppose each can detect the number
5 in some degree of rotation. If we pooled across the three channels determined by these filters,
then no matter what orientation of the number “5” we got as input to our CNN, the pooling layer
would have a large response!
Just like MLPs, we can use the Backpropagation algorithm to train CNNs as well. We simply
have to compute partial derivatives for conv and pool layers, just as we did for FC layers and
non-linearities.
Let’s denote the error function as f . In classification tasks, this is typically cross entropy loss.
The forward pass will gives us the input I to the CNN. The backward pass will compute the
∂f
partial derivatives of the the error f with respect the output of the layer L, ∂L . Without additional
knowledge about the error function after L, we can compute the derivatives with respect to elements
8.1. CONVOLUTIONAL NEURAL NETWORKS 181
in the input I and filter G using the chain rule. Specifically, we have
∂f ∂f ∂L
=
∂Gc [x, y] ∂L ∂Gc [x, y]
and
∂f ∂f ∂L
=
∂Ic [x, y] ∂L ∂Ic [x, y]
where Gc [x, y] denotes the entry in the filter for color c at position (x, y) and similarly for Ic [x, y].
From the equation for discrete convolution, we can compute the derivatives for each entry (i, j) in
L as
w h
∂L[i, j] ∂ XX X
= Ic [i + a, j + b] · Gc [a, b]
∂Gc [x, y] ∂Gc [x, y]
a=1 b=1 c∈{r,g,b}
= Ic [i + x, j + y]
For the input image, we similarly compute the derivative as
w h
∂L[i, j] ∂ XX X
= Ic [i + a, j + b] · Gc [a, b]
∂Ic [x, y] ∂Ic [x, y]
a=1 b=1 c∈{r,g,b}
= Gc [x − i, y − j]
where we have i + a = x and j + b = y. When x − i or y − j go outside the boundary of the filter,
we can treat the derivative as zero.
We can collect the derivatives of the filter parameter for all L[i, j] into a vector and multiply it by
the derivatives we computed to get
∂f ∂f ∂L X ∂f ∂L[i, j] X ∂f
= · = ]= Ic [i + x, j + y]
∂Gc [x, y] ∂L ∂Gc [x, y] ∂L[i, j] ∂Gc [x, y] ∂L[i, j]
i,j i,j
Since pooling layers do not involve any weights, we only need to calculate partial derivatives with
respect to the input:
∂f
∂Ic [x, y]
Through the chain rule, we have that
∂f ∂f ∂L
= ·
∂Ic [x, y] ∂L ∂Ic [x, y]
and now the problem entails finding ∂Ic∂L [x,y] . Computing this derivative depends on the stride, ori-
entation, and nature of the pooling, but in the case of max-pooling the output is simply a maximum
∂L
of inputs: L = max(I1 , I2 , ..., In ) and in this case, we have ∂I j
= 1(Ij = max(I1 , I2 , ..., In )).
182 CHAPTER 8. DEEP LEARNING
Convolutional Neural Networks were first applied successfully to the ImageNet challenge in 2012
and continue to outperform computer vision techniques that do not use neural networks. Here are
a few of the architectures that have been developed over the years.
Key characteristics:
Figure 8.10: AlexNet architecture. Reference: “ImageNet Classification with Deep Convolutional Neural
Networks,” NIPS 2012.
Key characteristics:
• Conv filters of varying sizes - for example, the first layer has 11 × 11 conv filters
• First use of ReLU, which fixed the problem of saturating gradients in the predominant tanh
activation.
• Several layers of convolution, max pooling, some normalization. Three fully connected layers
at the end of the network (these comprise the majority of the weights in the network).
• Around 60 million weights, over half of which are in the first fully connected layer following
the last convolution.
8.2. CNN ARCHITECTURES 183
• Trained over two GPU’s — the top and bottom divisions in Figure 8.10 were due to the need
to separate training onto two GPU’s. There was limited communication between the GPU’s,
as illustrated by the arrows that go between the top and bottom.
• Dropout in first two FC layers — prevents overfitting
• Heavy data augmentation. One form is image translation and reflection: for example, an
elephant facing the left is the same class as an elephant facing the right. The second form is
altering the intensity of RGB color channels: different cameras can have different lighting on
the same objects, so it is necessary to account for this.
Reference paper: “Very Deep Convolutional Networks for Large-Scale Image Recognition,” ICLR
2015.1 Key characteristics:
• Only uses 3×3 convolutional filters. Blocks of conv-conv-conv-pool layers are stacked together,
followed by fully connected layers at the end (the number of convolutional layers between
pooling layers can vary). Note that a stack of 3 3 × 3 conv filters has the same effective
receptive field as one 7 × 7 conv filter. To see this, imagine sliding a 3 × 3 filter over a 7 × 7
image - the result is a 5 × 5 image. Do this twice more and the result is a 1 × 1 cell - sliding
one 7 × 7 filter over the original image would also result in a 1 × 1 cell. The computational cost
of the 3 × 3 filters is lower - a stack of 3 such filters over C channels requires 3 ∗ (32 C) weights
(not including bias weights), while one 7 × 7 filter would incur a higher cost of 72 C learned
weights. Deeper, more narrow networks can introduce more non-linearities than shallower,
wider networks due to the repeated composition of activation functions.
Also codenamed as “Inception.”2 Published in CVPR 2015 as “Going Deeper with Convolutions.”
Key characteristics:
1 VGG stands for the “Visual Geometry Group” at Oxford where this was developed.
2 “In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception,
which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous we need to go
deeper internet meme [1].” The authors seem to be meme-friendly.
184 CHAPTER 8. DEEP LEARNING
• Deeper than previous networks (22 layers), but more computationally efficient (5 million
parameters - no fully connected layers).
• Network is composed of stacked sub-networks called “Inception modules.” The naive Incep-
tion module (a) runs convolutional layers in parallel and concatenates the filters together.
However, this can be computationally inefficient. The dimensionality reduction Inception
module (b) performs 1 × 1 convolutions that act as dimensionality reduction. This lowers the
computational cost and makes it tractable to stack many Inception modules together.
Figure 8.12: Building block for the ResNet from “Deep Residual Learning for Image Recognition,” CVPR
2016. If the desired function to be learned is H(x), we instead learn the residual F(x) := H(x) − x, so the
output of the network is F(x) + x = H(x).
Key characteristics:
• Very deep (152 layers). Residual blocks (Figure 8.12) are stacked together - each individual
weight layer in the residual block is implemented as a 3 × 3 convolution. There are no FC
layers until the final layer.
• Residual blocks solve the “vanishing gradient” problem: the gradient signal diminishes in
layers that are farther away from the end of the network. Let L be the loss, Y be the output
at a layer, x be the input. Regular neural networks have gradients that look like
∂L ∂L ∂Y
=
∂x ∂Y ∂x
but the derivative of Y with respect to x can be small. If we use a residual block where
Y = F (x) + x, we have
∂Y ∂F (x)
= +1
∂x ∂x
The +x term in the residual block always provides some default gradient signal so the signal
is still backpropagated to the front of the network. This allows the network to be very deep.
To conclude this section, we note that the winning ImageNet architectures have all increased in
depth over the years. While both shallow and deep neural networks are known to be universal
function approximators, there is growing empirical and theoretical evidence that deep neural net-
works can require fewer (even exponentially fewer) parameters than shallow nets to achieve the
same approximation performance. There is also evidence that deep neural networks possess better
generalization capabilities than their shallow counterparts. The performance, generalization, and
optimization benefits of adding more layers is an ongoing component of theoretical research.
8.3. VISUALIZING AND UNDERSTANDING CNNS 185
We know that a convolutional net learns features, but these may not be directly useful to visualize.
There are several methods available that enable us to better understand what convolutional nets
actually learn. These include:
• Visualizing filters - can give an idea of what types of features the network learns, such as
edge detectors. This only works in the first layer. Visualizing activations - can see sparsity
in the responses as the depth increases. One can also visualize the feature map before a
fully connected layer by conducting a nearest neighbor search in feature space. This helps
to determine if the features learned by the CNN are useful - for example, in pixel space, an
elephant on the left side of the image would not be a neighbor of an elephant on the right side
of the image, but in a translation-invariant feature space these pictures might be neighbors.
• Reconstruction by deconvolution - isolate an activation and reconstruct the original image
based on that activation alone to determine its effect.
• Activation maximization - Hubel and Wiesel’s experiment, but computationally
• Saliency maps - find what locations in the image make a neuron fire
• Code inversion - given a feature representation, determine the original image
• Semantic interpretation - interpret the activations semantically (for example, is the CNN
determining whether or not an object is shiny when it is trying to classify?)