0% found this document useful (0 votes)
55 views149 pages

CS 304.A Training Models

CS 304.A Training Models

Uploaded by

utkangencpsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views149 pages

CS 304.A Training Models

CS 304.A Training Models

Uploaded by

utkangencpsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 149

Chapter 4:

Training
Models
Outline

• Linear Regression • Polynomial Regression • Logistic Regression


• Normal equation (Least- • Estimating probabilities
• Learning Curves
squares solution) • Training and cost function
• Gradient Descent • Regularized Linear Models • Decision boundaries
• Batch GD • Ridge regression • Softmax Regression
• Stochastic GD • Lasso regression
• Mini-batch GD • Elastic Net
• Early Stopping
Reminder
• This chapter will use linear algebra and calculus concepts
• Review material for math was uploaded to LMS during the first week.
• Tutorials as Jupyter notebooks are available from online supplemental material
of the book:
• https://github.com/ageron/handson-ml3
• math_linear_algebra.ipynb
• math_differen2al_calculus.ipynb
Introduction
• So far, we have treted machine learning models and their training algorithms
mostly like “black boxes” without knowing how they actually work:
• Optimized a regression system
• Improved a digit image classifier
• Having a good understanding of algorithms will enable you to:
• Find the appropriate model
• Right training algorithm to use
• Good set of hyperparameters for your task
• Debug issues efficiently
• Perform error analysis efficiently
Linear Regression
Normal Equation
Introduction
• Linear regression
• Two very different ways to train it:
• Closed-form equation: that directly computes the model parameters that best fit the
model to the training set.
• i.e. The model paramters that minimize the cost function over the training set.
• Gradient Descent: An iterative optimization approach that updates the model parameters
to minimize the cost function over the training set.
• Converges to the same set of parameters as the closed-form solution
• Variants of gradient descent:
• Batch GD
• Mini-batch GD
• Stochastic GD
Notation
Linear Regression
• We previously studied a simple regression model of life satisfaction:
Linear Regression

• A linear model makes a prediction by simply computing a weighted sum


of the input features, plus a constant called the bias term (also called the
intercept term)
Linear Regression

• Vectorized form:
Linear Regression – Cost Function

• Goal: Minimize the mean-squared error between the predicted value


(𝑦! (") ) and the desired value (𝑦 (") ).
Linear Regression – Training

• To train a Linear Regression model, we need to find the value of 𝜽 that


minimizes the MSE.
Linear Regression – The Normal Equation

• The normal equation in a closed-form solution that gives the result


directly.
Derivation of the Normal Equation
Derivation of the Normal Equation
Derivation of the Normal Equation

(the least-squares solution)


Derivation of the Normal Equation
Derivation of the Normal Equation
Linear Regression – The Normal Equation
Implementation
Linear Regression – The Normal Equation
Implementation using Numpy

column-wise concatenation

OR
Linear Regression – The Normal Equation
Implementation using Numpy
Linear Regression – The Normal Equation
Implementation using Numpy
Linear Regression – The Normal Equation
Implementation using Scikit-Learn

• Implement using Scikit-Learn

𝜃"

Feature
weights (𝜃! )
Linear Regression – The Normal Equation
Implementation using Scikit-Learn

Note: The normal equation may not work if the


matrix XTX is not invertible (i.e. singular), but the
pseudo inverse is always defined.
Linear Regression – The Normal Equation
Computational Complexity

• The Normal equation computes the inverse of 𝑋𝑇𝑋, which is an


(𝑛 + 1) × (𝑛 + 1) matrix (where 𝑛 is the number of features).
• The computational complexity of inverting such a matrix is typically about 𝑂(𝑛2) to
𝑂(𝑛3), depending on the implementation.
• Both the Normal equation and the SVD approach get very slow when the number of features
grows large (e.g., 100,000).
• On the positive side, both are linear with regard to the number of instances in the training set (they
are 𝑂(𝑚)), so they handle large training sets efficiently, provided they can fit in memory.
• Once you have trained your linear regression model predictions are very fast:
• the computational complexity is linear with regard to both the number of instances you want to make
predictions on and the number of features.
Gradient Descent
Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent
Linear Regression – Gradient Descent

• Gradient descent is a generic optimization algorithm capable of finding optimal solutions


to a wide range of problems.
• The general idea of gradient descent is to update parameters iteratively in order to
minimize a cost function.
• Suppose you are lost in the mountains in a dense fog, and you can only feel the slope of the
ground below your feet.
• A good strategy to get to the bottom of the valley quickly is to go downhill in the direction
of the steepest slope.
• This is exactly what gradient descent does: it measures the local gradient of the error
function with regard to the parameter vector θ, and it goes in the direction of descending
gradient.
• Once the gradient is zero, you have reached a minimum!
Gradient
Descent

• In practice, you start by


filling θ with random
values (this is called
random initialization).
• Then you improve it
gradually, taking one baby
step at a time, each step
attempting to decrease the
cost function (e.g., the
MSE), until the algorithm
converges to a minimum.
Gradient Descent
Gradient Descent
Gradient Descent – Learning Rate

• An important parameter in gradient


descent is the size of the steps,
determined by the learning rate
hyperparameter.
• If the learning rate is too small,
then the algorithm will have to go
through many iterations to converge,
which will take a long time.
Gradient Descent – Learning Rate

• If the learning rate is too high, you


might jump across the valley and
end up on the other side, possibly
even higher up than you were before.
• This might make the algorithm
diverge, with larger and larger
values, failing to find a good
solution.
Gradient Descent – Gradient Descent Pitfalls

• Not all cost functions look like nice,


regular bowls.
• There may be holes, ridges, plateaus, and
all sorts of irregular terrains, making
convergence to the minimum difficult.
• If random initialization starts on the left,
then it will converge to a local minimum,
which is not as good as the global
minimum.
• If it starts on the right, then it will take a
long time to cross the plateau. And if you
stop too early, you will never reach the
global minimum.
Linear Regression – Gradient Descent

• Fortunately, the MSE cost function for Linear Regresion model happens
to be a convex function. Hence, there are no local minima, just one
global minimum.
• It is also a continuous function with a slope that never changes abruptly.
• Therefore, gradient descent is guaranteed to approach arbitrarily
close to the global minimum (if you wait long enough and if the
learning rate is not too high).
Linear Regression – Cost Function
• The cost function (MSE) for linear
regression is always convex (it is
bowl shaped).
• Hence it has a single global
minimum.
• GD converges to the global
optimum.

𝑀𝑆𝐸 𝜽 = 𝐽(𝜃" , 𝜃# )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Source: Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

Best fitting line The minimum of the cost function is reached

Source: Andrew Ng
If the cost function is not convex

J(q0,q1)

q1
q0

Converges to the global minimum


If the cost function is not convex

J(q0,q1)

q1
q0

With a different initial point, converges to a different local minimum.


Gradient descent is susceptible to local minimum.
Linear Regression – Gradient Descent
• While the cost function has the shape of a bowl, it can be an elongated bowl if the features have
very different scales.
• Figure shows gradient descent on a training set where features 1 and 2 have the same scale (on the
left), and on a training set where feature 1 has much smaller values than feature 2 (on the right).
n Left: the Gradient Descent algorithm
goes straight toward the minimum,
thereby reaching it quickly,
n Right: it first goes in a direction almost
orthogonal to the direction of the global
minimum, and it ends with a long march
down an almost flat valley.
n It will eventually reach the
minimum, but it will take a long
time.
n Training a model is a search in the
model’s parameter space.
Batch Gradient Descent

• To implement gradient descent, you need to compute the gradient of the cost function
with regard to each model parameter θj.
• In other words, you need to calculate how much the cost function will change if you
change θj just a little bit.
• This is called a partial derivative.
• The partial derivative of the MSE with regard to parameter θj, noted ∂ MSE(θ) / ∂θj.
Batch Gradient Descent

• Instead of computing these partial derivatives individually, you can compute them
all in one go.
• The gradient vector, noted ∇θMSE(θ), contains all the partial derivatives of the cost
function (one for each model parameter).
Batch Gradient Descent

• The formula involves calculations over the full training set X, at each
Gradient Descent step!
• This is why the algorithm is called Batch Gradient Descent: it uses the whole
batch of training data at every step.
• It is terribly slow on very large training datasets (there are faster versions).
• Gradient descent scales well with the number of features:
• training linear regression model when there are hundreds of thousand of features is
much faster using gradient descent than using the normal equation.
Batch Gradient Descent

• Once you have the gradient vector, which points uphill, just go in the opposite
direction to go downhill.
• This means subtracting ∇θMSE(θ) from θ.
• This is where the learning rate η comes into play: multiply the gradient vector by η
to determine the size of the downhill step:
Batch Gradient Descent – Example (𝑛 = 1)

• If 𝑛 = 1 (just one feature)


𝜃$
𝑦! = 𝜃$ + 𝜃% 𝑥 , 𝜽=
𝜃%

' '
1 ( 1 (
𝑀𝑆𝐸 𝜽 = 𝑀𝑆𝐸 𝜃$ , 𝜃% = / 𝑦! (") − 𝑦 (") = / 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
𝑚 𝑚
"&% "&%
Batch Gradient Descent – Example (𝑛 = 1)

'
1 (
𝑀𝑆𝐸 𝜃$ , 𝜃% = / 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
𝑚
"&%
Find the partial derivative of 𝑀𝑆𝐸 𝜃$ , 𝜃% with respect to 𝜃$ (use chain rule):

Find the partial derivative of 𝑀𝑆𝐸 𝜃$ , 𝜃% with respect to 𝜃% :

Find the gradient of 𝑀𝑆𝐸 𝜽 = ∇𝜽 𝑀𝑆𝐸 𝜽 =


Batch Gradient Descent – Example (𝑛 = 1)

'
1 (
𝑀𝑆𝐸 𝜃$ , 𝜃% = / 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
𝑚
"&%

* ( '
𝑀𝑆𝐸 𝜃$ , 𝜃% = ∑ 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (")
*+! ' "&%

* ( '
𝑀𝑆𝐸 𝜃$ , 𝜃% = ∑ 𝜃$ + 𝜃% 𝑥 (") − 𝑦 (") 𝑥 (")
*+" ' "&%
Batch Gradient Descent – Example (𝑛 = 1)

*
𝑀𝑆𝐸 𝜃$ , 𝜃%
*+!
• gradient of 𝑀𝑆𝐸 𝜽 = ∇𝜽 𝑀𝑆𝐸 𝜽 = *
𝑀𝑆𝐸 𝜃$ , 𝜃%
*+"

𝜃$
•𝜽=
𝜃%
Batch Gradient Descent – Example (𝑛 = 1)

Repeat until convergence: {


(,-.) 𝜕
𝜃$ ≔ 𝜃$ − 𝜂 𝑀𝑆𝐸(𝜃$ , 𝜃% ) --Update 𝜃! and 𝜃" simultaneously
𝜕𝜃$ --Evaluate the derivatives using the current values
(,-.) * of the parameters 𝜃! and 𝜃"
𝜃% ≔ 𝜃% − 𝜂 𝑀𝑆𝐸(𝜃$ , 𝜃% )
*+"
}
Vector form:
𝜽(,-.) = 𝜽 − 𝜂∇𝜽 𝑀𝑆𝐸 𝜽
Poll 1

• Suppose we want to fit a line ℎ 𝑥 = 𝜃% 𝑥 to


the points (1,1), (2,2), (3,3). If the initial
value of 𝜃% = 0, what is the 𝑀𝑆𝐸(𝜃% )?
(a) 0
(b) 14/3
(c) 14
(d) undefined
E-mail: cigdem.erogluerdem@gmail.com
Subject: CS304 Poll1
Poll 1

• Suppose we want to fit a line ℎ 𝑥 = 𝜃% 𝑥 to


the points (1,1), (2,2), (3,3). If the initial
value of 𝜃% = 0, what is the 𝑀𝑆𝐸(𝜃% )?
(b)14/3
'
1 (
𝑀𝑆𝐸 𝜃% = / 𝜃% 𝑥 (") − 𝑦 (")
𝑚
"&%
1
= (0 − 1)( +(0 − 2)( +(0 − 9)(
3
Exercise 1

• Perform 1-step of batch gradient descent to update the line parameter 𝜃%


Hint: #
#$!
𝑀𝑆𝐸 𝜃! =
%
&
∑&
'(! 𝜃! 𝑥
(')
− 𝑦 (') 𝑥 (')

(+,-) #
𝜃! ≔ 𝜃! − 𝜂 𝑀𝑆𝐸(𝜃! )
#$!
Exercise 1-solution

• Perform 1-step (i.e. 1 epoch) of batch gradient descent to update the line
parameter 𝜃% (use 𝜂 = 0.1. )
# % (+,-) #
𝑀𝑆𝐸 𝜃! = ∑&
'(! 𝜃! 𝑥
(')
− 𝑦 (') 𝑥 (') 𝜃! ≔ 𝜃! − 𝜂
#$!
𝑀𝑆𝐸(𝜃! )
#$! &

𝜕 2
𝑀𝑆𝐸 𝜃% ? = −1×1 − 2×2 − 3×3 = −9.3
𝜕𝜃% +
3
" &$

(,-.)
𝜃% = 0 − 0.1 −9.3 = 0.93
Batch Gradient Descent
Implementation

That’s exactly what the Normal equation found! Gradient descent worked perfectly.
But what if you had used a different learning rate (eta)?
Batch Gradient Descent – Learning rate
• Figure shows the first 20 steps of gradient descent using three different learning rates.
• The line at the bottom of each plot represents the random starting point, then each epoch is represented by a
darker and darker line.
• Left: the learning rate is too low - the algorithm will eventually reach the solution, but it will take a long time.
• Middle: the learning rate looks pretty good - in just a few epochs, it has already converged to the solution.
• Right: the learning rate is too high - the algorithm diverges, jumping all over the place and actually getting
further and further away from the solution at every step.

To find a good learning rate


you can use grid search.
Batch Gradient Descent – Learning rate
• To choose 𝜂 try:

… 0.001, 0.01, 0.1, 1, …

… 0.001, 0.003 0.01,0.03 0.1, 0.3 1, …


Batch Gradient Descent – number of iterations
• How to set the number of epochs:
• If it is too low: you will still be far away from the optimal solution when the algorithm
stops;
• If it is too high: you will waste time while the model parameters do not change
anymore.
• A simple solution: set a very large number of epochs but interrupt the
algorithm when the gradient vector becomes tiny—that is, when its norm
becomes smaller than a tiny number ϵ (called the tolerance)—because this
happens when gradient descent has (almost) reached the minimum.
Stochastic Gradient Descent - advantages

• The main problem with batch gradient descent is the fact that it uses the whole
training set to compute the gradients at every step, which makes it very slow
when the training set is large.
• At the opposite extreme, stochastic gradient descent picks a random instance
in the training set at every step and computes the gradients based only on that
single instance:
• Working on a single instance at a time makes the algorithm much faster because it has
very little data to manipulate at every iteration.
• It also makes it possible to train on huge training sets, since only one instance needs to be
in memory at each iteration.
Stochastic Gradient Descent - disadvantages

• On the other hand, due to its stochastic (i.e., random) nature, this algorithm is
much less regular than batch gradient descent:
• instead of gently decreasing until it reaches the minimum, the cost function will bounce
up and down, decreasing only on average.
• Over time it will end up very close to the minimum, but once it gets there it
will continue to bounce around, never settling down (see Figure 4-9).
• Once the algorithm stops, the final parameter values will be good, but not
optimal.
Stochastic Gradient Descent
Stochastic Gradient Descent
• When the cost function is very irregular (as in Figure 4-6), this can actually help the
algorithm jump out of local minima, so stochastic gradient descent has a better chance of
finding the global minimum than batch gradient descent does.
• Therefore, randomness is good to escape from local optima, but bad because it means that
the algorithm can never settle at the minimum.
• One solution to this dilemma is to gradually reduce the learning rate: The steps start out large
(which helps make quick progress and escape local minima), then get smaller and smaller,
allowing the algorithm to settle at the global minimum.
• The function that determines the learning rate at each iteration is called the learning schedule.
• If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up
frozen halfway to the minimum.
• If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end
up with a suboptimal solution if you halt training too early.
Stochastic Gradient Descent: Implementation

• Implements stochastic gradient


descent using a simple learning
schedule:
• By convention we iterate by rounds
of m iterations; each round is called
an epoch.
• The batch gradient descent code
iterated 1,000 times through the whole
training set
• This code goes through the training set
only 50 times and reaches a pretty
good solution
Stochastic Gradient Descent: Implementation

• Note: Since instances are picked


randomly, some instances may be
picked several times per epoch, while
others may not be picked at all.
• If you want to be sure that the
algorithm goes through every instance
at each epoch:
• Shuffle the training set (making sure to
shuffle the input features and the labels
jointly), then go through it instance by
instance,
• Then shuffle it again, and so on.
• This approach is more complex, and it
generally does not improve the result.
Stochastic Gradient Descent

• The first 20 steps of training


• Notice how irregular the steps
are.
Warning
• When using stochastic gradient descent, the training instances must be
independent and identically distributed (IID) to ensure that the
parameters get pulled toward the global optimum, on average.
• A simple way to ensure this is to shuffle the instances during training
(e.g., pick each instance randomly, or shuffle the training set at the
beginning of each epoch).
• If you do not shuffle the instances—for example, if the instances are
sorted by label— then SGD will start by optimizing for one label, then
the next, and so on, and it will not settle close to the global minimum.
Stochastic Gradient Descent
Implementation 2: using Scikit-Learn

• To perform linear regression using stochastic GD with Scikit-Learn, you can use the SGDRegressor
class, which defaults to optimizing the MSE cost function.
• The code runs for maximum 1,000 epochs (max_iter) or until the loss drops by less than 10–5 (tol)
during 100 epochs (n_iter_no_change).
• It starts with a learning rate of 0.01 (eta0), using the default learning schedule (different from the one
we used).
• Lastly, it does not use any regularization (penalty=None):

Solution is quite close to true values (4,3).


Exercise 2

• Perform 1 epoch of stochastic gradient descent to update the line parameter


𝜃% given in Exercise 1. (use 𝜂 = 0.1. ) (Make sure you use all 3 points).
Compare your result with exercise 1.
Mini-batch Gradient Descent
Advantages, Disadvantages

• At each step, instead of computing the gradients based on the full training set
(as in batch GD) or based on just one instance (as in stochastic GD), mini-
batch GD computes the gradients on small random sets of instances called
mini-batches.
• The main advantage of mini-batch GD over stochastic GD is that you can get
a performance boost from hardware optimization of matrix operations,
especially when using GPUs.
Mini-batch Gradient Descent
Advantages, Disadvantages

• The algorithm’s progress in parameter space is less erratic than with stochastic
GD, especially with fairly large mini-batches.
• As a result, mini-batch GD will end up walking around a bit closer to the
minimum than stochastic GD—but it may be harder for it to escape from local
minima (in the case of problems that suffer from local minima, unlike linear
regression with the MSE cost function).
Mini-batch Gradient Descent
Mini-batch Gradient Descent
Three Gradient Descent Algorithms

• Figure shows the paths taken by the


three gradient descent algorithms in
parameter space during training.
• They all end up near the minimum, but
batch GD’s path actually stops at the
minimum, while both stochastic GD and
mini-batch GD continue to walk around.
• However, batch GD takes a lot of time to
take each step, and stochastic GD and
mini-batch GD would also reach the
minimum if you used a good learning
schedule.
Three Gradient Descent Algorithms
• There is almost no difference after training: all these algorithms end up with
very similar models and make predictions in exactly the same way.
Poll 2

• Which linear regression training algorithm you should not use if you have a
training set with millions of features? Why?
a) Normal equation
b) Batch gradient descent
c) Stochastic gradient descent
d) Mini-batch gradient descent

E-mail: cigdem.erogluerdem@gmail.com
Subject: CS304 Poll2
Poll 2

• Which linear regression training algorithm you should not use if you have a
training set with millions of features? Why?
a) Normal equation
b) Batch gradient descent
c) Stochastic gradient descent
d) Mini-batch gradient descent
Reason: Since normal equation will be slow due to the matrix inversion.
E-mail: cigdem.erogluerdem@gmail.com
Subject: CS304 Poll2
Resources

• Gradient Descent, Step-by-Step by Josh Starmer (StatQuest)


• https://www.youtube.com/watch?v=sDv4f4s2SB8
Polynomial Regression
Polynomial Regression

• What if the data is more complex than a line?


• We can use a linear model to fit nonlinear data
• A simple way: add powers of each feature as new features, then train a
linear model on this extended set of features.
• This technique is called polynomial regression.
Polynomial Regression - Example

• Generate some nonlinear data based


on a simple quadratic equation—
that’s an equation of the form :
y = ax2 + bx + c + noise
• Noise is uniformly distributed between
0 and 1.
Polynomial Regression - Example

• A straight line will never fit this data properly.


• Use Scikit-Learn’s PolynomialFeatures class to transform our training data,
adding the square (second degree polynomial) of each feature in the training set as a
new feature (in this case there is just one feature):

• X_poly now contains the original feature of X plus the square of this feature.
Polynomial Regression - Example
• Fit a LinearRegression model
to this extended training data

• Model estimates:
• Original function:
Polynomial Regression - Example
• Note: when there are multiple features, polynomial regression is capable of
finding relationships between features, which is something a plain linear
regression model cannot do.
• This is made possible by the fact that PolynomialFeatures also adds all
combinations of features up to the given degree.
• For example, if there were two features a and b, PolynomialFeatures with
degree=3 would not only add the features a2, a3, b2, and b3, but also the
combinations ab, a2b, and ab2.
Polynomial Regression - Example
Learning Curves
Polynomial Regression
• If you perform high-degree
Polynomial Regression, you will
likely fit the training data much better
than with plain Linear Regression.
• Figure applies a 300-degree
polynomial model to the preceding
training data and compares the result
with a pure linear model and a
quadratic model (second-degree
polynomial).
• The 300-degree polynomial model
wiggles around to get as close as
possible to the training instances.
Polynomial Regression

• This high-degree Polynomial Regression model is severely overfitting the


training data, while the linear model is underfitting it.
• The model that will generalize best in this case is the quadratic model, which
makes sense because the data was generated using a quadratic model.
• But in general you won’t know what function generated the data, so how can
you decide how complex your model should be?
• How can you tell that your model is overfitting or underfitting the data?
Polynomial Regression

• Cross-validation gives an estimate of a model’s generalization performance:


• If a model performs well on the training data but generalizes poorly according to the
cross-validation metrics, then the model is overfitting.
• If it performs poorly on both training data and cross-validation, then it is underfitting.
• This is one way to tell when a model is too simple or too complex.
Learning Curves
• Another way to tell
is to look at learning
curves.
• Learning curves:
Plots of the model’s
performance on the
training set and the
validation set as a
function of the
training set size.
Learning Curves
• Scikit-Learn learning_curve()
function trains and evaluates the model
using cross-validation.
• By default it retrains the model on growing
subsets of the training set,
• If the model supports incremental learning you
can set
exploit_incremental_learning=True
when calling learning_curve() and it will
train the model incrementally instead.
• The function returns
• the training set sizes at which it evaluated the
model,
• the training and validation scores it measured for
each size and for each cross-validation fold.
Learning Curves

• The model is underfitting:


• Training error:
• With just one or two instances in the
training set, the model can fit them
perfectly (training error is zero).
• When training set increases linear model
can not fit perfectly and reaches a
plateau.
• Validation error:
• Very few training instances: can not
generalize well (val. error is large)
• More training instances: validation error Typical Underfitting:
decreases but reaches a high plateau, very Both curves have reached a plateau;
close to the training curve.
they are close and fairly high.
Learning Curves
10-th Degree Polynomial
• The learning curves of a 10th-degree
polynomial model on the same data:
• The error on the training data is
much lower than with the Linear
Regression model.
• There is a gap between the curves.
• This means that the model
performs significantly better on
the training data than on the
validation data, which is the
hallmark of an overfitting model.
• If you used a much larger training
set, however, the two curves would
continue to get closer.
Bias-Variance Trade-Off
• An important theoretical result of statistics and machine learning is the fact that a model’s
generalization error can be expressed as the sum of three very different errors:
• Bias: This part of the generalization error is due to wrong assumptions, such as assuming that
the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the
training data.
• Variance: This part is due to the model’s excessive sensitivity to small variations in the
training data. A model with many degrees of freedom (such as a high-degree polynomial
model) is likely to have high variance and thus overfit the training data.
• Irreducible error: This part is due to the noisiness of the data itself. The only way to reduce
this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors,
or detect and remove outliers).
• Increasing a model’s complexity will typically increase its variance and reduce its bias.
Conversely, reducing a model’s complexity increases its bias and reduces its variance. This is
why it is called a trade-off.
Regularized Linear Models
Ridge Regression (L2 Regularization)
Lasso Regression
Elastic Net
Early Stopping
Regularized Linear Models

• A good way to reduce overfitting is to regularize the model (i.e. constrain


it).
• A simple way to regularize a polynomial model is to reduce the number of
polynomial degrees.
• For a linear model, regularization is typically achieved by constraining the
weights of the model.
Ridge Regression (L2 Regularization)

• Also called as Tikhonov regularization


• A regularization term based on L2 norm of the parameter vector is added to the cost
function to force the learning algorithm to not only fit the data but also keep the
model weights as small as possible.
• The regularization term is only added during training. Once trained, use the
unregularized performance measure to evaluate the model’s performance.
Ridge Regression - Results

• Figure shows several Ridge models trained on


some very noisy linear data using different α
values.
• Left: plain Ridge models are used, leading to
linear predictions.
• Right: data is first expanded using
PolynomialFeatures(degree=10), then it is
scaled using a StandardScaler, and
finally the Ridge models are applied to the
resulting features: this is Polynomial
Regression with Ridge regularization.
• Increasing α leads to flatter (i.e., less extreme,
more reasonable) predictions, thus reducing
the model’s variance but increasing its bias.
Ridge Regression
Implementation

• As with Linear Regression, we can perform Ridge Regression either by


computing a closed-form equation or by performing Gradient Descent. The
pros and cons are the same.
• Equation below shows the closed-form solution, where A is the (𝒏 +
𝟏) × (𝒏 + 𝟏) identity matrix, except with a 0 in the top-left cell,
corresponding to the bias term.
Ridge Regression
Implementation

• With Scikit-Learn using a closed-form solution:

• Using Stochastic Gradient Descent:


L2 norm indicates ridge regression

A 1-D array, containing the elements


of y, is returned
Lasso Regression (L1 Regularization)

• Least Absolute Shinkage and Selection Operator (LASSO) regression adds


a regularization term to the cost function using L1 norm of the weight
vector:
• Cost fuction:

• Important characteristic of Lasso regression:


• Eliminates the weights of least important features (i.e. sets them to zero).
• In other words, Lasso regression automatically performs feature selection and outputs
a sparse model (i.e. with few non-zero feature weights)
Lasso Regression (L1 Regularization)

• Right: The curve with


𝛼 = 0.01 looks roughly cubic:
• All the weights for the high-
degree polynomial features are
zero.
Lasso versus Ridge regularization

• The axes represent model parameters


• The background contours represent different loss functions.
• Top: the contours represent the ℓ1 loss (|θ1| + |θ2|), which
drops linearly as you get closer to any axis.
• Initialize the model parameters to θ1= 2 and θ2= 0.5
• Gradient Descent will decrement both parameters equally (as
represented by the dashed yellow line); therefore θ2 will reach 0
first (since it was closer to 0 to begin with).
• After that, Gradient Descent will roll down the axis until it
reaches θ1 = 0.
• Bottom: the contours represent the ℓ2 loss
• Gradient descent takes a straight path to origin
Lasso versus Ridge regularization

• Model parameters are initialized as:


• θ1 = 0.25, θ2 = –1
• White circles: path of gradient descent.
• Red squares: global optima. Increasing 𝛼 will shift the red quare towards left
on the yellow dashed line.
• Top:
• Contours represent Lasso regression’s cost function (MSE+ℓ1 loss)
• The path reaches quickly to θ2 = 0, and then bounces around the global optimum
(gradually reduce the learning rate to converge)
• Optimal parameters for the unregularized MSE are θ1 = 2 and θ2 = 0.5
• Bottom:
• Contours represent Ridge regression’s cost function (MSE+ℓ2 loss)
• Gradient gets smaller as the parameters approach the global minimum so gradient
descent naturally slows down. This limits the bouncing around
• Ridge converges faster than lasso regression.
Lasso Regression (optional)
Lasso Regression (optional)
• The lasso cost function is not differentiable at 𝜃𝑖 = 0 (for 𝑖 = 1, 2, ⋯ , 𝑛), but
gradient descent still works if you use a subgradient vector g11 instead when any
𝜃𝑖 = 0.
• The following subgradient vector equation you can use for gradient descent with the
lasso cost function.
Lasso Regression
Implementation with Scikit-learn

• A small Scikit-Learn example using the Lasso class:

• You could instead use SGDRegressor(penalty="l1", alpha=0.1).


Elastic Net Regression

• Elastic Net is a middle ground between Ridge Regression and Lasso Regression.
• The regularization term is a simple mix of both Ridge and Lasso’s regularization
terms, and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to
Ridge Regression, and when r = 1, it is equivalent to Lasso Regression.

Scikit learn implementation (l1_ratio corresponds to the mix ratio r):


Which regularization method?

• When should you use plain Linear Regression (i.e., without any regularization),
Ridge, Lasso, or Elastic Net?
• It is almost always preferable to have at least a little bit of regularization, so
generally you should avoid plain Linear Regression.
• Ridge is a good default, but if you suspect that only a few features are useful,
you should prefer Lasso or Elastic Net because they tend to reduce the useless
features’ weights down to zero.
• In general, Elastic Net is preferred over Lasso because
• Lasso may behave erratically when the number of features is greater than the number of
training instances or when several features are strongly correlated.
Early Stopping

• A very different way to regularize


iterative learning algorithms such as
Gradient Descent is to stop
training as soon as the validation
error reaches a minimum. This is
called early stopping.
• Figure:
• complex model (a high-degree
Polynomial Regression model) being
trained with Batch Gradient Descent. Early stopping regularization
Early Stopping - Implementation

• With stochastic and mini-batch gradient descent, the curves are not so
smooth (as in previous figure), and it may be hard to know whether you
have reached the minimum or not.
• One solution is to stop only after the validation error has been above
the minimum for some time (when you are confident that the model
will not do any better), then roll back the model parameters to the point
where the validation error was at a minimum.
Early
Stopping
Implementation
Early Stopping
Implementation

• The previous code first adds the polynomial features and scales all the input features, both
for the training set and for the validation set (the code assumes that you have split the
original training set into a smaller training set and a validation set).

• Then it creates an SGDRegressor model with no regularization and a small learning rate.

• In the training loop, it calls partial_fit() instead of fit(), to perform incremental


learning (continues where it left off).
Early Stopping
Implementation

• At each epoch, it measures the RMSE on the validation set. If it is lower than the
lowest RMSE seen so far, it saves a copy of the model in the best_model variable.

• This implementation does not actually stop training, but it lets you revert to the best
model after training.
• Note that the model is copied using copy.deepcopy(), because it copies both the
model’s hyperparameters and the learned parameters. In contrast,
sklearn.base.clone() only copies the model’s hyperparameters.
Logistic Regression
Logistic Regression

• Logistic Regression (also called Logit Regression) is commonly used to


estimate the probability that an instance belongs to a particular class:
• e.g., what is the probability that this email is spam?
• If the estimated probability is greater than 50%, then the model predicts that
the instance belongs to that class (called the positive class, labeled “1”), and
otherwise it predicts that it does not (i.e., it belongs to the negative class,
labeled “0”).
• This makes it a binary classifier.
Logistic Regression
Estimating Probabilities

• Logistic Regression model


computes a weighted sum of the
input features (plus a bias term)
• Instead of outputting the result
directly like the Linear Regression
model does, it outputs the logistic
of this result using the sigmoid
function:
Logistic Regression
Estimating Labels
Note

• The score 𝑡 is often called the logit.


• The name comes from the fact that the logit function, defined as
𝑙𝑜𝑔𝑖𝑡(𝑝) = log(𝑝 / (1 – 𝑝)), is the inverse of the logistic function.
• Indeed, if you compute the logit of the estimated probability 𝑝, you will
find that the result is 𝑡.
• The logit is also called the log-odds, since it is the log of the ratio between
the estimated probability for the positive class and the estimated probability
for the negative class.
Logistic Regression
Training and Cost Function

• The objective of training is to set the parameter vector 𝜃 so that the model estimates high
probabilities for positive instances (𝑦 = 1) and low probabilities for negative instances
(𝑦 = 0).
• Cost for a single training instance:

• The cost function makes sense:


• −log(𝑡) grows very large when 𝑡 approaches 0, so the cost will be large if the model
estimates a probability close to 0 for a positive instance,
• It will also be large if the model estimates a probability close to 1 for a negative
instance.
• On the other hand, – log(𝑡) is close to 0 when 𝑡 is close to 1, so the cost will be close
to 0 if the estimated probability is close to 0 for a negative instance or close to 1 for a
positive instance, which is precisely what we want.
Logistic Regression
Training and Cost Function

• Cost for a single training instance:


where

• Cost function over the whole training set (log loss) is the average cost over all
training samples:
Note about log loss (optional)

• It can be shown mathematically (using Bayesian inference) that minimizing the log
loss will result in the model with the maximum likelihood of being optimal,
assuming that the instances follow a Gaussian distribution around the mean of their
class.
• When you use the log loss, this is the implicit assumption you are making.
• The more wrong this assumption is, the more biased the model will be.
• Similarly, when we used the MSE to train linear regression models, we were
implicitly assuming that the data was purely linear, plus some Gaussian noise. So, if
the data is not linear (e.g., if it’s quadratic) or if the noise is not Gaussian (e.g., if
outliers are not exponentially rare), then the model will be biased.
Logistic Regression
Training

• Bad news: there is no known closed-form equation to compute the value of θ that
minimizes this cost function.
• Good News: this cost function is convex, so Gradient Descent (or any other
optimization algorithm) is guaranteed to find the global minimum (if the learning
rate is not too large and you wait long enough).
• The partial derivatives of the logistic cost function with regard to the jth model
parameter θ are given by:

• Once you have the gradient vector containing all the derivatives, you can use it in
the batch, stochastic or mini-batch gradient descent algorithm.
For full derivation see: https://medium.com/analytics-vidhya/derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d
Logistic Regression
Example

• Iris dataset: sepal and petal length and width of 150 iris flowers of
three species: Iris setosa, Iris versicolor and Iris virginica.
Logistic Regression
Example – Iris dataset

• Goal: Build a classifier to detect the Iris virginica type based only on the petal width feature.
• Load the data and take a quick look:
Logistic Regression
Example – Iris dataset

• Split the data and train a logistic regression model on the training set:
Example – Iris dataset

• Look at the model’s estimated probabilities for flowers with petal widths
varying from 0 cm to 3 cm
Logistic
Regression
Example

• The petal width of Iris virginica flowers (represented by triangles) ranges from 1.4 cm to 2.5 cm, while the other iris
flowers (represented by squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm. Notice that
there is a bit of overlap.
• Above about 2 cm the classifier is highly confident that the flower is an Iris virginica (it outputs a high probability for
that class), while below 1cm it is highly confident that it is not an Iris virginica (high probability for the “Not Iris
virginica” class).
• In between these extremes, the classifier is unsure. However, if you ask it to predict the class (using the predict()
method rather than the predict_proba() method), it will return whichever class is the most likely.
• Therefore, there is a decision boundary at around 1.6 cm where both probabilities are equal to 50%: if the petal width
is higher than 1.6 cm, the classifier will predict that the flower is an Iris virginica, and otherwise it will predict that it
is not (even if it is not very confident):
Logistic
Regression
Example:
Iris dataset

• Use two features: petal width and length to to classify Iris Virginica vs. not Iris virginica.
• The dashed line represents the points where the model estimates a 50% probability: this is
the model’s decision boundary. Note that it is a linear boundary.
• Each parallel line represents the points where the model outputs a specific probability, from
15% (bottom left) to 90% (top right). All the flowers beyond the top-right line have an over
90% chance of being Iris virginica, according to the model.
Logistic Regression
Regularization
Softmax Regression
(Multinomial Logistic Regression)

• The Logistic Regression model can be generalized to support multiple


classes directly, without having to train and combine multiple binary
classifiers.
• The idea is simple: when given an instance 𝒙, the Softmax Regression model
first computes a score 𝑠𝑘(𝒙) for each class k, then estimates the probability of
each class by applying the softmax function (also called the normalized
exponential) to the scores.
• The equation to compute 𝑠𝑘(𝒙) should look familiar, as it is just like the
equation for Linear Regression prediction:
Softmax Regression
(Multinomial Logistic Regression)

• Once you have computed the


score of every class for the
instance 𝑥, you can estimate the
probability p ̂k that the instance
belongs to class 𝑘 by running
the scores through the softmax
function (Equation 4-20).
• The function computes the
exponential of every score,
then normalizes them (dividing
by the sum of all the
exponentials). The scores are
generally called logits or log-
odds.
Softmax Regression
(Multinomial Logistic Regression)

• Just like the Logistic Regression classifier, the Softmax Regression


classifier predicts the class with the highest estimated probability (which is
simply the class with the highest score):
Softmax Regression
Training

• The objective is to have a model


that estimates a high probability
for the target class (and
consequently a low probability
for the other classes).
• Minimizing the cost function
shown in Equation 4-22, called
the cross entropy, should lead to
this objective because it
penalizes the model when it
estimates a low probability for
a target class.
• Cross entropy is frequently used
to measure how well a set of
estimated class probabilities
matches the target classes.
Cross-Entropy (optional)
• Cross entropy originated from Claude Shannon’s information theory.
• Suppose you want to efficiently transmit information about the weather every day. If there are eight
options (sunny, rainy, etc.), you could encode each option using 3 bits, because 23 = 8.
• However, if you think it will be sunny almost every day, it would be much more efficient to code
“sunny” on just one bit (0) and the other seven options on four bits (starting with a 1).
• Cross entropy measures the average number of bits you actually send per option.
• If your assumption about the weather is perfect, cross entropy will be equal to the entropy of the
weather itself (i.e., its intrinsic unpredictability).
• But if your assumption is wrong (e.g., if it rains often), cross entropy will be greater by an amount
called the Kullback–Leibler (KL) divergence.
• The cross entropy between two discrete probability distributions 𝑝 and 𝑞 is defined as
𝐻(𝑝, 𝑞) = – ∑! 𝑝(𝑥) log 𝑞(𝑥). For more details, check out my video on the subject.
https://www.youtube.com/watch?v=ErfnhcEV1O8
Softmax Regression
Training - Optimization

The gradient vector of this cost function with regard to 𝜃 (:) is:
Softmax Regression
Example: Iris Dataset

• Use softmax regression to classify the iris plants into all three classes.
• Scikit-Learn’s LogisticRegression classifier uses softmax regression automatically
when you train it on more than two classes (assuming you use solver="lbfgs", which is the
default).
• It also applies ℓ2 regularization by default, which you can control using the hyperparameter C
Softmax Regression
Example: Iris Dataset

• If you find an iris with petals that are 5 cm long and 2 cm wide, you can ask
your model to tell you what type of iris it is
• It will answer Iris virginica (class 2) with 96% probability (or Iris versicolor with 4%
probability):
Softmax Regression
Example: Iris Dataset

• The decision boundaries between any two classes are linear.


• The probabilities for the Iris versicolor class, represented by the curved lines (e.g., the line labeled with 0.30
represents the 30% probability boundary).
• Notice that the model can predict a class that has an estimated probability below 50%. For example, at the point
where all decision boundaries meet, all classes have an equal estimated probability of 33%.
Summary

• Linear Regression
• Closed-form solution
• Gradient descent (batch, stochastic, mini-batch)
• Polynomial regression
• Regularization (Ridge, Lasso, Eleastic Net)
• Learning curves
• Early stopping
• Logistic regression (binary classifier, linear classifier)
• Softmax regression (multiclass classifier, linear classifier)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy