Lecture 0.2 - Linear Methods For Regression, Optimization
Lecture 0.2 - Linear Methods For Regression, Optimization
No ProctorU!
In supervised learning:
There is input x ∈ X , typically a vector of features (or covariates)
There is target t ∈ T (also called response, outcome, output, class)
Objective is to learn a function f : X → T such that t ≈ y = f (x)
based on some data D = {(x(i) , t(i) ) for i = 1, 2, ..., N }.
I y is the prediction
I w is the weights
I b is the bias (or intercept)
w and b together are the parameters
We hope that our prediction is close to the target: y ≈ t.
0.5
0.0
y = wx + b where w, x, b ∈ R.
0.5
1.0 y is linear in x.
2 1 0 1 2
x: features
If we have D features:
y = w> x + b where w, x ∈ RD ,
b∈R
y is linear in x.
0.5 0.5
Different (w, b) define different lines.
0.0 0.0 We want the “best” line (w, b).
0.5 0.5
How to quantify “best”?
1.0 1.0
2 2 1 1 0 0 1 1 2 2
x: features
x: features
L(y, t) = 12 (y − t)2
y = w> x + b
This is simpler and executes much faster:
Why vectorize?
The equations, and the code, will be simpler and more readable.
Gets rid of dummy variables/indices!
Vectorized code is much faster
I Cut down on Python interpreter overhead
I Use highly optimized linear algebra libraries (hardware support)
I Matrix multiplication very fast on GPU (Graphics Processing Unit)
Switching in and out of vectorized form is a skill you gain with practice
Some derivations are easier to do element-wise
Some algorithms are easier to write/understand using for-loops
and vectorize later for performance
wT x(N ) + b y (N )
X> (y∗ − t) = 0
X> Xw∗ − X> t = 0
X> Xw∗ = X> t
w∗ = (X> X)−1 X> t
∂ f (x1 + h, x2 ) − f (x1 , x2 )
f (x1 , x2 ) = lim
∂x1 h→0 h
To compute, take the single variable derivative, pretending the
other arguments are constant.
Example: partial derivatives of the prediction y
∂y ∂ X ∂y ∂ X
= wj 0 xj 0 + b = wj 0 xj 0 + b
∂wj ∂wj 0
∂b ∂b 0
j j
= xj =1
∂L dL ∂y
=
∂wj dy ∂wj ∂L dL ∂y
=
d 1 ∂b dy ∂b
= (y − t)2 · xj
dy 2 =y−t
= (y − t)xj
For cost derivatives, use linearity and average over data points:
∂J 1 PN (i) (i) ∂J 1 PN
∂wj = N i=1 (y − t(i) ) xj ∂b = N i=1 y
(i) − t(i)
Visualization:1
1
Image source: mkwiki.org
Intro ML (UofT) CSC311-Lec2 20 / 53
Direct Solution II: Calculus
The relation between the input and output may not be linear.
y = w0
1 M =0
t
−1
0 x 1
y = w0 + w1 x
1 M =1
t
−1
0 x 1
y = w0 + w1 x + w2 x 2 + w3 x 3
1 M =3
t
−1
0 x 1
y = w0 + w1 x + w2 x2 + w3 x3 + . . . + w9 x9
1 M =9
t
−1
0 x 1
Underfitting (M=0): model is too simple — does not fit the data.
Overfitting (M=9): model is too complex — fits perfectly.
1 M =0 1 M =9
t t
0 0
−1 −1
0 x 1 0 x 1
1 M =3
t
−1
0 x 1
1 M =9
t
−1
0 x 1
1
For the least squares problem, we have J (w) = 2N kXw − tk2 .
When λ > 0 (with regularization), regularized cost gives
1 λ
wλRidge = argmin Jreg (w) = argmin kXw − tk22 + kwk22
w w 2N 2
> −1 >
=(X X + λN I) X t
Observe:
I if ∂J /∂wj > 0, then increasing wj increases J .
I if ∂J /∂wj < 0, then increasing wj decreases J .
The following update always decreases the cost function for small
enough α (unless ∂J /∂wj = 0):
∂J
wj ← wj − α
∂wj
α > 0 is a learning rate (or step size). The larger it is, the faster w
changes.
I We’ll see later how to tune the learning rate, but values are
typically small, e.g. 0.01 or 0.0001.
I If cost is the sum of N individual losses rather than their average,
smaller learning rate will be needed (α0 = α/N ).
Warning: in general, it’s very hard to tell from the training curves
whether an optimizer has converged. They can reveal major
problems, but they can’t guarantee convergence.
Intro ML (UofT) CSC311-Lec2 41 / 53
Stochastic Gradient Descent
So far, the cost function J has been the average loss over the
training examples:
N N
1 X (i) 1 X
J (θ) = L = L(y(x(i) , θ), t(i) ).
N N
i=1 i=1
Typical strategy:
I Use a large learning rate early in training so you can get close to
the optimum
I Gradually decay the learning rate to reduce the fluctuations
critical
point
critical
point local
maximum
local critical
minimum point
global
minimum
λ 1 x1 + · · · + λ N xN ∈ S for λi > 0, λ1 + · · · λN = 1.