CS 304.A Training Models
CS 304.A Training Models
Training
Models
Outline
• Vectorized form:
Linear Regression – Cost Function
column-wise concatenation
OR
Linear Regression – The Normal Equation
Implementation using Numpy
Linear Regression – The Normal Equation
Implementation using Numpy
Linear Regression – The Normal Equation
Implementation using Scikit-Learn
𝜃"
Feature
weights (𝜃! )
Linear Regression – The Normal Equation
Implementation using Scikit-Learn
• Fortunately, the MSE cost function for Linear Regresion model happens
to be a convex function. Hence, there are no local minima, just one
global minimum.
• It is also a continuous function with a slope that never changes abruptly.
• Therefore, gradient descent is guaranteed to approach arbitrarily
close to the global minimum (if you wait long enough and if the
learning rate is not too high).
Linear Regression – Cost Function
• The cost function (MSE) for linear
regression is always convex (it is
bowl shaped).
• Hence it has a single global
minimum.
• GD converges to the global
optimum.
𝑀𝑆𝐸 𝜽 = 𝐽(𝜃" , 𝜃# )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
Source: Andrew Ng
If the cost function is not convex
J(q0,q1)
q1
q0
J(q0,q1)
q1
q0
• To implement gradient descent, you need to compute the gradient of the cost function
with regard to each model parameter θj.
• In other words, you need to calculate how much the cost function will change if you
change θj just a little bit.
• This is called a partial derivative.
• The partial derivative of the MSE with regard to parameter θj, noted ∂ MSE(θ) / ∂θj.
Batch Gradient Descent
• Instead of computing these partial derivatives individually, you can compute them
all in one go.
• The gradient vector, noted ∇θMSE(θ), contains all the partial derivatives of the cost
function (one for each model parameter).
Batch Gradient Descent
• The formula involves calculations over the full training set X, at each
Gradient Descent step!
• This is why the algorithm is called Batch Gradient Descent: it uses the whole
batch of training data at every step.
• It is terribly slow on very large training datasets (there are faster versions).
• Gradient descent scales well with the number of features:
• training linear regression model when there are hundreds of thousand of features is
much faster using gradient descent than using the normal equation.
Batch Gradient Descent
• Once you have the gradient vector, which points uphill, just go in the opposite
direction to go downhill.
• This means subtracting ∇θMSE(θ) from θ.
• This is where the learning rate η comes into play: multiply the gradient vector by η
to determine the size of the downhill step:
Batch Gradient Descent – Example (𝑛 = 1)
' '
1 ( 1 (
𝑀𝑆𝐸 𝜽 = 𝑀𝑆𝐸 𝜃$ , 𝜃% = / 𝑦! (") − 𝑦 (") = / 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
𝑚 𝑚
"&% "&%
Batch Gradient Descent – Example (𝑛 = 1)
'
1 (
𝑀𝑆𝐸 𝜃$ , 𝜃% = / 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
𝑚
"&%
Find the partial derivative of 𝑀𝑆𝐸 𝜃$ , 𝜃% with respect to 𝜃$ (use chain rule):
'
1 (
𝑀𝑆𝐸 𝜃$ , 𝜃% = / 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
𝑚
"&%
* ( '
𝑀𝑆𝐸 𝜃$ , 𝜃% = ∑ 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
*+! ' "&%
* ( '
𝑀𝑆𝐸 𝜃$ , 𝜃% = ∑ 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (") 𝑥 (")
*+" ' "&%
Batch Gradient Descent – Example (𝑛 = 1)
*
𝑀𝑆𝐸 𝜃$ , 𝜃%
*+!
• gradient of 𝑀𝑆𝐸 𝜽 = ∇𝜽 𝑀𝑆𝐸 𝜽 = *
𝑀𝑆𝐸 𝜃$ , 𝜃%
*+"
𝜃$
•𝜽=
𝜃%
Batch Gradient Descent – Example (𝑛 = 1)
(+,-) #
𝜃! ≔ 𝜃! − 𝜂 𝑀𝑆𝐸(𝜃! )
#$!
Exercise 1-solution
• Perform 1-step (i.e. 1 epoch) of batch gradient descent to update the line
parameter 𝜃% (use 𝜂 = 0.1. )
# % (+,-) #
𝑀𝑆𝐸 𝜃! = ∑&
'(! 𝜃! 𝑥
(')
− 𝑦 (') 𝑥 (') 𝜃! ≔ 𝜃! − 𝜂
#$!
𝑀𝑆𝐸(𝜃! )
#$! &
𝜕 2
𝑀𝑆𝐸 𝜃% ? = −1×1 − 2×2 − 3×3 = −9.3
𝜕𝜃% +
3
" &$
(,-.)
𝜃% = 0 − 0.1 −9.3 = 0.93
Batch Gradient Descent
Implementation
That’s exactly what the Normal equation found! Gradient descent worked perfectly.
But what if you had used a different learning rate (eta)?
Batch Gradient Descent – Learning rate
• Figure shows the first 20 steps of gradient descent using three different learning rates.
• The line at the bottom of each plot represents the random starting point, then each epoch is represented by a
darker and darker line.
• Left: the learning rate is too low - the algorithm will eventually reach the solution, but it will take a long time.
• Middle: the learning rate looks pretty good - in just a few epochs, it has already converged to the solution.
• Right: the learning rate is too high - the algorithm diverges, jumping all over the place and actually getting
further and further away from the solution at every step.
• The main problem with batch gradient descent is the fact that it uses the whole
training set to compute the gradients at every step, which makes it very slow
when the training set is large.
• At the opposite extreme, stochastic gradient descent picks a random instance
in the training set at every step and computes the gradients based only on that
single instance:
• Working on a single instance at a time makes the algorithm much faster because it has
very little data to manipulate at every iteration.
• It also makes it possible to train on huge training sets, since only one instance needs to be
in memory at each iteration.
Stochastic Gradient Descent - disadvantages
• On the other hand, due to its stochastic (i.e., random) nature, this algorithm is
much less regular than batch gradient descent:
• instead of gently decreasing until it reaches the minimum, the cost function will bounce
up and down, decreasing only on average.
• Over time it will end up very close to the minimum, but once it gets there it
will continue to bounce around, never settling down (see Figure 4-9).
• Once the algorithm stops, the final parameter values will be good, but not
optimal.
Stochastic Gradient Descent
Stochastic Gradient Descent
• When the cost function is very irregular (as in Figure 4-6), this can actually help the
algorithm jump out of local minima, so stochastic gradient descent has a better chance of
finding the global minimum than batch gradient descent does.
• Therefore, randomness is good to escape from local optima, but bad because it means that
the algorithm can never settle at the minimum.
• One solution to this dilemma is to gradually reduce the learning rate: The steps start out large
(which helps make quick progress and escape local minima), then get smaller and smaller,
allowing the algorithm to settle at the global minimum.
• The function that determines the learning rate at each iteration is called the learning schedule.
• If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up
frozen halfway to the minimum.
• If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end
up with a suboptimal solution if you halt training too early.
Stochastic Gradient Descent: Implementation
• To perform linear regression using stochastic GD with Scikit-Learn, you can use the SGDRegressor
class, which defaults to optimizing the MSE cost function.
• The code runs for maximum 1,000 epochs (max_iter) or until the loss drops by less than 10–5 (tol)
during 100 epochs (n_iter_no_change).
• It starts with a learning rate of 0.01 (eta0), using the default learning schedule (different from the one
we used).
• Lastly, it does not use any regularization (penalty=None):
• At each step, instead of computing the gradients based on the full training set
(as in batch GD) or based on just one instance (as in stochastic GD), mini-
batch GD computes the gradients on small random sets of instances called
mini-batches.
• The main advantage of mini-batch GD over stochastic GD is that you can get
a performance boost from hardware optimization of matrix operations,
especially when using GPUs.
Mini-batch Gradient Descent
Advantages, Disadvantages
• The algorithm’s progress in parameter space is less erratic than with stochastic
GD, especially with fairly large mini-batches.
• As a result, mini-batch GD will end up walking around a bit closer to the
minimum than stochastic GD—but it may be harder for it to escape from local
minima (in the case of problems that suffer from local minima, unlike linear
regression with the MSE cost function).
Mini-batch Gradient Descent
Mini-batch Gradient Descent
Three Gradient Descent Algorithms
• Which linear regression training algorithm you should not use if you have a
training set with millions of features? Why?
a) Normal equation
b) Batch gradient descent
c) Stochastic gradient descent
d) Mini-batch gradient descent
E-mail: cigdem.erogluerdem@gmail.com
Subject: CS304 Poll2
Poll 2
• Which linear regression training algorithm you should not use if you have a
training set with millions of features? Why?
a) Normal equation
b) Batch gradient descent
c) Stochastic gradient descent
d) Mini-batch gradient descent
Reason: Since normal equation will be slow due to the matrix inversion.
E-mail: cigdem.erogluerdem@gmail.com
Subject: CS304 Poll2
Resources
• X_poly now contains the original feature of X plus the square of this feature.
Polynomial Regression - Example
• Fit a LinearRegression model
to this extended training data
• Model estimates:
• Original function:
Polynomial Regression - Example
• Note: when there are multiple features, polynomial regression is capable of
finding relationships between features, which is something a plain linear
regression model cannot do.
• This is made possible by the fact that PolynomialFeatures also adds all
combinations of features up to the given degree.
• For example, if there were two features a and b, PolynomialFeatures with
degree=3 would not only add the features a2, a3, b2, and b3, but also the
combinations ab, a2b, and ab2.
Polynomial Regression - Example
Learning Curves
Polynomial Regression
• If you perform high-degree
Polynomial Regression, you will
likely fit the training data much better
than with plain Linear Regression.
• Figure applies a 300-degree
polynomial model to the preceding
training data and compares the result
with a pure linear model and a
quadratic model (second-degree
polynomial).
• The 300-degree polynomial model
wiggles around to get as close as
possible to the training instances.
Polynomial Regression
• Elastic Net is a middle ground between Ridge Regression and Lasso Regression.
• The regularization term is a simple mix of both Ridge and Lasso’s regularization
terms, and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to
Ridge Regression, and when r = 1, it is equivalent to Lasso Regression.
• When should you use plain Linear Regression (i.e., without any regularization),
Ridge, Lasso, or Elastic Net?
• It is almost always preferable to have at least a little bit of regularization, so
generally you should avoid plain Linear Regression.
• Ridge is a good default, but if you suspect that only a few features are useful,
you should prefer Lasso or Elastic Net because they tend to reduce the useless
features’ weights down to zero.
• In general, Elastic Net is preferred over Lasso because
• Lasso may behave erratically when the number of features is greater than the number of
training instances or when several features are strongly correlated.
Early Stopping
• With stochastic and mini-batch gradient descent, the curves are not so
smooth (as in previous figure), and it may be hard to know whether you
have reached the minimum or not.
• One solution is to stop only after the validation error has been above
the minimum for some time (when you are confident that the model
will not do any better), then roll back the model parameters to the point
where the validation error was at a minimum.
Early
Stopping
Implementation
Early Stopping
Implementation
• The previous code first adds the polynomial features and scales all the input features, both
for the training set and for the validation set (the code assumes that you have split the
original training set into a smaller training set and a validation set).
• Then it creates an SGDRegressor model with no regularization and a small learning rate.
• At each epoch, it measures the RMSE on the validation set. If it is lower than the
lowest RMSE seen so far, it saves a copy of the model in the best_model variable.
• This implementation does not actually stop training, but it lets you revert to the best
model after training.
• Note that the model is copied using copy.deepcopy(), because it copies both the
model’s hyperparameters and the learned parameters. In contrast,
sklearn.base.clone() only copies the model’s hyperparameters.
Logistic Regression
Logistic Regression
• The objective of training is to set the parameter vector 𝜃 so that the model estimates high
probabilities for positive instances (𝑦 = 1) and low probabilities for negative instances
(𝑦 = 0).
• Cost for a single training instance:
• Cost function over the whole training set (log loss) is the average cost over all
training samples:
Note about log loss (optional)
• It can be shown mathematically (using Bayesian inference) that minimizing the log
loss will result in the model with the maximum likelihood of being optimal,
assuming that the instances follow a Gaussian distribution around the mean of their
class.
• When you use the log loss, this is the implicit assumption you are making.
• The more wrong this assumption is, the more biased the model will be.
• Similarly, when we used the MSE to train linear regression models, we were
implicitly assuming that the data was purely linear, plus some Gaussian noise. So, if
the data is not linear (e.g., if it’s quadratic) or if the noise is not Gaussian (e.g., if
outliers are not exponentially rare), then the model will be biased.
Logistic Regression
Training
• Bad news: there is no known closed-form equation to compute the value of θ that
minimizes this cost function.
• Good News: this cost function is convex, so Gradient Descent (or any other
optimization algorithm) is guaranteed to find the global minimum (if the learning
rate is not too large and you wait long enough).
• The partial derivatives of the logistic cost function with regard to the jth model
parameter θ are given by:
• Once you have the gradient vector containing all the derivatives, you can use it in
the batch, stochastic or mini-batch gradient descent algorithm.
For full derivation see: https://medium.com/analytics-vidhya/derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d
Logistic Regression
Example
• Iris dataset: sepal and petal length and width of 150 iris flowers of
three species: Iris setosa, Iris versicolor and Iris virginica.
Logistic Regression
Example – Iris dataset
• Goal: Build a classifier to detect the Iris virginica type based only on the petal width feature.
• Load the data and take a quick look:
Logistic Regression
Example – Iris dataset
• Split the data and train a logistic regression model on the training set:
Example – Iris dataset
• Look at the model’s estimated probabilities for flowers with petal widths
varying from 0 cm to 3 cm
Logistic
Regression
Example
• The petal width of Iris virginica flowers (represented by triangles) ranges from 1.4 cm to 2.5 cm, while the other iris
flowers (represented by squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm. Notice that
there is a bit of overlap.
• Above about 2 cm the classifier is highly confident that the flower is an Iris virginica (it outputs a high probability for
that class), while below 1cm it is highly confident that it is not an Iris virginica (high probability for the “Not Iris
virginica” class).
• In between these extremes, the classifier is unsure. However, if you ask it to predict the class (using the predict()
method rather than the predict_proba() method), it will return whichever class is the most likely.
• Therefore, there is a decision boundary at around 1.6 cm where both probabilities are equal to 50%: if the petal width
is higher than 1.6 cm, the classifier will predict that the flower is an Iris virginica, and otherwise it will predict that it
is not (even if it is not very confident):
Logistic
Regression
Example:
Iris dataset
• Use two features: petal width and length to to classify Iris Virginica vs. not Iris virginica.
• The dashed line represents the points where the model estimates a 50% probability: this is
the model’s decision boundary. Note that it is a linear boundary.
• Each parallel line represents the points where the model outputs a specific probability, from
15% (bottom left) to 90% (top right). All the flowers beyond the top-right line have an over
90% chance of being Iris virginica, according to the model.
Logistic Regression
Regularization
Softmax Regression
(Multinomial Logistic Regression)
The gradient vector of this cost function with regard to 𝜃 (:) is:
Softmax Regression
Example: Iris Dataset
• Use softmax regression to classify the iris plants into all three classes.
• Scikit-Learn’s LogisticRegression classifier uses softmax regression automatically
when you train it on more than two classes (assuming you use solver="lbfgs", which is the
default).
• It also applies ℓ2 regularization by default, which you can control using the hyperparameter C
Softmax Regression
Example: Iris Dataset
• If you find an iris with petals that are 5 cm long and 2 cm wide, you can ask
your model to tell you what type of iris it is
• It will answer Iris virginica (class 2) with 96% probability (or Iris versicolor with 4%
probability):
Softmax Regression
Example: Iris Dataset
• Linear Regression
• Closed-form solution
• Gradient descent (batch, stochastic, mini-batch)
• Polynomial regression
• Regularization (Ridge, Lasso, Eleastic Net)
• Learning curves
• Early stopping
• Logistic regression (binary classifier, linear classifier)
• Softmax regression (multiclass classifier, linear classifier)