Note 2: EECS 189 Introduction To Machine Learning Fall 2020
Note 2: EECS 189 Introduction To Machine Learning Fall 2020
Note 2: EECS 189 Introduction To Machine Learning Fall 2020
hw (x) = g(x, w)
for some other function g. The hypothesis class is then the set of all functions induced by the
possible choices of the parameters w:
H = {hw | w ∈ Rd }
After designating a cost function L, which measures how poorly the predictions ŷ of the hypothesis
match the true output y, we can proceed to search for the parameters that best fit the data by
minimizing this function:
w∗ = arg min L(w)
w
Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
In words, the matrix X ∈ Rn×d has the input datapoint xi as its ith row. This matrix is sometimes
called the design matrix. Usually n ≥ d, meaning that there are more datapoints than measure-
ments.
There will in general be no exact solution to the equation y = Xw (even if the data were perfect,
consider how many equations and variables there are), but we can find an approximate solution by
minimizing the sum (or equivalently, the mean) of the squared errors:
n
X
L(w) = (xi>w − yi )2 = min kXw − yk22
w
i=1
Now that we have formulated an optimization problem, we want to go about solving it. We will see
that the particular structure of OLS allows us to compute a closed-form expression for a globally
optimal solution, which we denote w∗ols .
∇x (a>x) = a
∇x (x>Ax) = (A + A>)x
where in the last line we have used the symmetry of X>X to simplify X>X+(X>X)> = 2X>X. Setting
the gradient to 0, we conclude that any optimum w∗ols satisfies
X>Xw∗ols = X>y
If X is full rank, then X>X is as well (assuming n ≥ d), so we can solve for a unique solution
Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Note: Although we write (X>X)−1 , in practice one would not actually compute the inverse; it
is more numerically stable to solve the linear system of equations above (e.g. with Gaussian
elimination).
In this derivation we have used the condition ∇L(w∗ ) = 0, which is a necessary but not sufficient
condition for optimality. We found a critical point, but in general such a point could be a local
minimum, a local maximum, or a saddle point. Fortunately, in this case the objective function
is convex, which implies that any critical point is indeed a global minimum. To show that L is
convex, it suffices to compute the Hessian of L, which in this case is
∇2 L(w) = 2X>X
kv − PS vk ≤ kv − sk
PS v = arg min kv − sk
s∈S
with equality holding if and only if kPS v − sk2 = 0, i.e. s = PS v. Taking square roots on both sides
gives kv − sk ≥ kv − PS vk as claimed (since norms are nonnegative).
Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
In the OLS case,
w∗ols = arg min kXw − yk22
w
But observe that the set of vectors that can be written Xw for some w ∈ Rd is precisely the range
of X, which we know to be a subspace of Rn , so
min kz − yk22 = min kXw − yk22
z∈range(X) w∈Rd
By pattern matching with the earlier optimality statement about PS , we observe that Prange(X) y =
Xw∗ols , where w∗ols is any optimum for the right-hand side. The projected point Xw∗ols is always
unique, but if X is full rank (again assuming n ≥ d), then the optimum w∗ols is also unique (as
expected). This is because X being full rank means that the columns of X are linearly independent,
in which case there is a one-to-one correspondence between w and Xw.
To solve for w∗ols , we need the following fact1 :
null(X>) = range(X)⊥
Since we are projecting onto range(X), the orthogonality condition for optimality is that y − Py ⊥
range(X), i.e. y − Xw∗ols ∈ null(X>). This leads to the equation
X>(y − Xw∗ols ) = 0
which is equivalent to
X>Xw∗ols = X>y
as before.
2 Ridge Regression
While Ordinary Least Squares can be used for solving linear least squares problems, it falls short
due to numerical instability and generalization issues. Numerical instability arises when the fea-
tures of the data are close to collinear (leading to linearly dependent feature columns), causing the
1
This result is often stated as part of the Fundamental Theorem of Linear Algebra.
Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
input matrix X to lose its rank or have singular values that very close to 0. Why are small singular
values bad? Let us illustrate this via the singular value decomposition (SVD) of X:
X = UΣV>
where U ∈ Rn×n , Σ ∈ Rn×d , V ∈ Rd×d . In the context of OLS, we must have that X>X is invertible,
or equivalently, rank(X>X) = rank(X>) = rank(X) = d. Assuming that X and X> are full column
rank d, we can express the SVD of X as
Σ
X = U d V>
0
where Σd ∈ Rd×d is a diagonal matrix with strictly positive entries. Now let’s try to expand the
(X>X)−1 term in OLS using the SVD of X:
Σ
h i
(X X) = (V Σd 0 U U d V>)−1
> −1 >
0
i Σ
h
= (V Σd 0 I d V>)−1
0
= (VΣ2d V>)−1 = (V>)−1 (Σ2d )−1 V−1 = VΣ−2
d V
>
This means that (X>X)−1 will have singular values that are the squared inverse of the singular val-
ues of X, potentially leading to extremely large singular values when the singular value of X are
close to 0. Such excessively large singular values can be very problematic for numerical stability
purposes. In addition, abnormally high values to the optimal w solution would prevent OLS from
generalizing to unseen data.
There is a very simple solution to these issues: penalize the entries of w from becoming too large.
We can do this by adding a penalty term constraining the norm of w. For a fixed, small scalar
λ > 0, we now have:
2
min
Xw − y
2 + λkwk22
w
Note that the λ in our objective function is a hyperparameter that measures the sensitivity to the
values in w. Just like the degree in polynomial features, λ is a value that we must choose arbitrarily
through validation. Let’s expand the terms of the objective function:
Finally take the gradient of the objective and find the value of w that achieves 0 for the gradient:
∇w L(w) = 0
2X>Xw − 2X>y + 2λw = 0
(X>X + λI)w = X>y
w∗ridge = (X>X + λI)−1 X>y
Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
This value is guaranteed to achieve the (unique) global minimum, because the objective function is
strongly convex. To show that f is strongly convex, it suffices to compute the Hessian of f , which
in this case is
∇2 L(w) = 2X>X + 2λI
and show that this is positive definite (PD):
Since the Hessian is positive definite, we can equivalently say that the eigenvalues of the Hessian
are strictly positive and that the objective function is strongly convex. A useful property of strongly
convex functions is that they have a unique optimum point, so the solution to ridge regression is
unique. We cannot make such guarantees about ordinary least squares, because the corresponding
Hessian could have eigenvalues that are 0. Let us explore the case in OLS when the Hessian has a
0 eigenvalue. In this context, the term X>X is not invertible, but this does not imply that no solution
exists! In OLS, there always exists a solution, and when the Hessian is PD that solution is unique;
when the Hessian is PSD, there are infinitely many solutions. (There always exists a solution to
the expression X>Xw = X>y, because the range of X>X and the range space of X> are equivalent;
since X>y lies in the range of X>, it must equivalently lie in the range of X>X and therefore there
always exists a w that satisfies the equation X>Xw = X>y.)
The technique we just described is known as ridge regression. Note that now the expression
X>X + λI is invertible, regardless of rank of X. Let’s find (X>X + λI)−1 through SVD:
−1
Σr 0 > Σr 0 >
(X X + λI) = V V + λI
> −1
U U
0 0 0 0
−1
Σ2r 0 >
= V V + λI
0 0
−1
Σ2r 0 >
= V V + V(λI)V>
0 0
−1
Σ2 0
= V r + λI V>
0 0
Σ2r + λI 0 >−1
= V V
0 λI
−1
Σ + λI
2
0
= (V>)−1 r V−1
0 λI
(Σr + λI)−1 0 >
2
= V 1 V
0 λ
I
Now with our slight tweak, the matrix X>X + λI has become full rank and thus invertible. The
singular values have become σ21+λ and λ1 , meaning that the singular values are guaranteed to be
at most λ1 , solving our numerical instability issues. Furthermore, we have partially solved the
Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
overfitting issue. By penalizing the norm of x, we encourage the weights corresponding to relevant
features that capture the main structure of the true model, and penalize the weights corresponding
to complex features that only serve to fine tune the model and fit noise in the data.
Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7