0% found this document useful (0 votes)
113 views194 pages

Model Training: (Anything Done While We Train The Model)

The document discusses model training and hyperparameters tuning. It covers topics such as overfitting, validation loss, training vs scoring, parameters vs hyperparameters, common data splits for tuning hyperparameters using cross-validation, and methods for hyperparameters tuning including grid search, random search, and Bayesian optimization.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views194 pages

Model Training: (Anything Done While We Train The Model)

The document discusses model training and hyperparameters tuning. It covers topics such as overfitting, validation loss, training vs scoring, parameters vs hyperparameters, common data splits for tuning hyperparameters using cross-validation, and methods for hyperparameters tuning including grid search, random search, and Bayesian optimization.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 194

Model training

(Anything done while we train the model)

1
List of topics
• Overfitting and learning curves
Validation loss

▪ Your validation loss will first get worse during training because the model gets
overconfident, and only later will get worse because it is incorrectly memorizing the data.

▪ We care in practice about only the latter issue.

▪ REMEMBER, our loss function is something that we use to allow our optimizer to have
something it can differentiate and optimize; it’s not the thing we care about in practice.

Howard, Jeremy, and Sylvain Gugger. Deep Learning for Coders


with fastai and PyTorch. O'Reilly Media, 2020. 3
Training vs. scoring

▪ When we apply a machine learning algorithm to a dataset in order to obtain a model, we


talk about model training.

▪ When we apply a trained model to an input example (or, sometimes, a sequence of


examples) in order to obtain a prediction (or, predictions) or to somehow transform an
input, we talk about scoring.

https://www.dropbox.com/s/wxybbtbiv64yf0j/Chapter1.pdf?dl=0 4
Parameters vs. hyperparameters

▪ Hyperparameters are inputs of machine learning algorithms or pipelines that influence the
performance of the model. They don’t belong to the training data and cannot be learned
from it. Example: the maximum depth of the tree in the decision tree learning algorithm.

▪ Parameters are variables that define the model trained by the learning algorithm.
Parameters are directly modified by the learning algorithm based on the training data. The
goal of learning is to find such values of parameters that make the model optimal in a
certain sense. Example: parameters are w and b in the equation of linear regression.

https://www.dropbox.com/s/wxybbtbiv64yf0j/Chapter1.pdf?dl=0 5
Hyperparemeters tuning: #1

▪ Method #1: Luckily, there is a correct way of tuning the hyperparameters and it does not
touch the test set at all. The idea is to split our training set in two: a slightly smaller training
set, and what we call a validation set.

▪ Method #2: In cases where the size of your training data (and therefore also the validation
data) might be small, people sometimes use a more sophisticated technique for
hyperparameter tuning called cross-validation.

https://cs231n.github.io/classification/
6
Hyperparemeters tuning: #2

▪ Common data splits. A training and test set is given. The training set is split into folds (for example 5
folds here). The folds 1-4 become the training set. One fold (e.g. fold 5 here in yellow) is denoted as
the Validation fold and is used to tune the hyperparameters. Cross-validation goes a step further and
iterates over the choice of which fold is the validation fold, separately from 1-5. This would be
referred to as 5-fold cross-validation. In the very end once the model is trained and all the best
hyperparameters were determined, the model is evaluated a single time on the test data (red).

https://cs231n.github.io/classification/ 7
Hyperparemeters tuning: #2.2

▪ Essentially, when:
▪ the dataset is small
▪ Or you want to get the mean performance
(the k-estimates are NOT independent)
▪ Instead of using validation and test you can use CV
8
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec9.pdf
Hyperparemeters tuning: #2.1
▪ How to split the data:

▪ The training set is used by the machine learning algorithm to train the model.

▪ The validation set is needed to find the best values for the hyperparameters of the
machine learning pipeline. Most importantly this set is not seen by the machine
learning algorithm.

▪ The test set is used for reporting: once you have your best model, you test its
performance on the test set and report the results.

https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 9
Hyperparemeters tuning: #3

▪ Split your training data randomly into train/val splits.

▪ As a rule of thumb, between 70-90% of your data usually goes to the train split.

▪ This setting depends on how many hyperparameters you have and how much of an
influence you expect them to have.

▪ If there are many hyperparameters to estimate, you should err on the side of having larger
validation set to estimate them effectively.

https://cs231n.github.io/classification/ 10
Hyperparemeters tuning: #4

https://scikit-learn.org/stable/modules/cross_validation.html 11
Hyperparemeters tuning: #5

12
Hyperparemeters tuning: #6

13
Hyperparameters HPO tuning: the cost issue

▪ The issue: HPO techniques very expensive and resource-heavy in practice, especially for
applications of deep learning

▪ Partial solution: even approaches like Hyperband or Bayesian optimization, that are
specifically designed to minimize the number of training cycles needed, are still not able to
deal with certain problems due to the complexity of the models or the size of the datasets

Paleyes, Andrei, Raoul-Gabriel Urma, and Neil D. Lawrence. "Challenges in deploying machine
learning: a survey of case studies." arXiv preprint arXiv:2011.09926 (2020). 14
HPO: available methods
All combinations of hyper-parameters are evaluated.
To solve HPO problems, we need to use Thus, grid search is computationally expensive,
data-driven methods. Manual methods infeasible and suffers from the curse of
are not effective. dimensionality. Easy to parallelise

It samples configurations at random until a particular


budget B is exhausted. Easy to parallelise

https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 15
▪ Grid search: Given a finite set of discrete values for each hyperparameter, exhaustively cross-
validate all combinations.
▪ Random search: Given a discrete or continuous distribution for each hyperparameter,
randomly sample from the joint distribution. Generally more efficient than exhaustive grid
search.
▪ Bayesian optimization: Sample like random search, but update the search space you sample
from as you go, based on outcomes of prior searches.
▪ Gradient-based optimization: Attempt to estimate the gradient of the cross-validation metric
with respect to the hyperparameters and ascend/descend the gradient.
▪ Evolutionary optimization: Sample the search space, discard combinations with poor metrics,
and genetically evolve new combinations based on the successful combinations.
▪ Population-based training: A method of performing hyperparameter optimization at the
same time as training.

https://towardsdatascience.com/beyond-grid-search-hypercharge-hyperparameter-tuning-for-xgboost-7c78f7a2929d 16
Bayesian Optimization for HPO: #1
▪ The Random search eventually converges to the optimal answer, but this method is just
such a blind search! Is there a smarter way to search? Yes – Bayesian optimization.

▪ In essence, Bayesian optimization is a probability model that wants to learn an expensive


objective function by learning based on previous observation.

▪ It has two powerful features:


▪ Surrogate model: for modelling the objective function and cheaply evaluate it.
▪ Acquisition function: measures the value that would be generated by the evaluation of
the objective function at a new point.

https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 17
Bayesian Optimization for HPO: #2
The blue tube around the We also have an acquisition
black line is our uncertainty.  function, which is the way we
explore the search space for
finding the new observation
optimum value. In other
true objective function words, the acquisition function
Dashed line helps us improve our surrogate
model and select the next
value. In the image above, the
acquisition function is shown
as an orange curve. An
surrogate model (regression acquisition max means where
model) Continuous line the uncertainty is max and
predicted value is low.

https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 18
Pros & cons of Bayesian optimisation for HPO

▪ Pros: The most important advantage for Bayesian optimization is that it can operate very
well with black box function. BO is also data efficient and robust with noise.

▪ Cons: It can’t work well with parallel resources because the optimization process is
sequential

https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 19
A bit more on Bayesian optimisation

▪ Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a


search of a GLOBAL optimization problem that is efficient and effective.

▪ It works by building a probabilistic model of the objective function, called the surrogate
function, that is then searched efficiently with an acquisition function before candidate
samples are chosen for evaluation on the real objective function.

https://machinelearningmastery.com/what-is-bayesian-optimization/ 20
Surrogate function & acquisition function
▪ The surrogate function is a technique used to best approximate the mapping of input
examples to an output score. Probabilistically, it summarizes the conditional probability of
an objective function (f), given the available data. This can be done via Gaussian process
which is a model that constructs a joint probability distribution over the variables,
assuming a multivariate Gaussian distribution. As such, it is capable of efficient and
effective summarization of a large number of functions and smooth transition as more
observations are made available to the model. This smooth structure and smooth transition
to new functions based on data are desirable properties as we sample the domain. An
important aspect in defining the GP model is the kernel and RBF seems to be the most used
choice.

▪ The acquisition function is responsible for scoring or estimating the likelihood that a given
candidate sample (input) is worth evaluating with the real objective function.

https://machinelearningmastery.com/what-is-bayesian-optimization/ 21
Some example of regression model used in
Bayesian model for HPO

Acquisition functions
define a balance between
exploring new areas in
the objective space and
exploiting areas that are
already known to have
favourable values

https://static.sigopt.com/b/20a144d208ef255d3b981ce419667ec25d8
412e2/static/pdf/SigOpt_Bayesian_Optimization_Primer.pdf
22
Multi Fidelity Optimization for HPO

▪ In the Bayesian method, estimation of the objective function is very expensive. Is there any
cheaper way to estimate the objective function?

▪ Multi Fidelity optimization methods are the answer. Some of them are:

▪ Successive halving
▪ HyperBand
▪ BOHB

https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 23
Successive halving for HPO
▪ Imagine you have N different configurations
and a budget (for example time).

▪ In each iteration, as you can see in the below


image, successive halving keeps the best halve
of the configurations and discards half of the
algorithms that are not good.

▪ It will continue until we have just one single


configuration. This method will be finished
when it reaches the maximum of its budget.

▪ Issue: In successive halving there is a trade-off


between between selecting the number Launch presentation
configurations and allocating the budget. to see animation
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 24
HyperBand
▪ This method is an extension on top of
successive halving algorithms.

▪ To solve this problem, HyperBand proposed to


frequently perform the successive halving
method with different budgets to find the best
configurations.

▪ Python implementation here

https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 25
BOHB: Robust and Efficient Hyperparameter
Optimization at Scale
▪ BOHB is a state-of-the-art hyperparameter optimization algorithm

▪ The idea behind the BOHB algorithm is based on one simple question – why do we run
successive halving repeatedly? 

▪ Instead of a blind repetition method on top of successive halving, BOHB uses the Bayesian
Optimization algorithm. In fact, BOHB combines HyperBand and BO to use both of these
algorithms in an efficient way.

▪ A python implementation BOHB is here

https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 26
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

• Why did we come up with the idea of CV in the first place? Cross
validation is a form of model checking (I used this word on purpose)
which attempts (so maybe it does not work all the times?) to improve
on the basic methods of hold-out checking (training/test) by leveraging
subsets of our data and an understanding of the bias/variance trade-off
in order to gain a better understanding of how our models will actually
perform when applied outside of the data it was trained on.
• In this sentence I have substituted ”validation” with “checking” as I feel
this is more appropriate.
• How many methods can we use? Hold-out, k-fold, LOOCV

27
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

• Why I have an issue when we say “model validation”? Because we are


accustomed with the following split: training, validation and test. When we use
the word “model validation” we actually mean model checking. It is unfortunate
we use the same name to describe something two things that are actually
different.
• So what model validation (checking) entitle to? It is the process that allows you
to predict how your model will perform on datasets (properly called test test)
not used in the training.
• Why is so important to to get it right? Partially linked to the concept of data
leakage. We don’t actually care how well the model predicts data we trained it
on because we already know the target values for those value. We do actually
care on the error it has on newly seen data, i.e. test data.
28
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

• Hold-out method? Aka as train and test split. Some of the issue
the method does not address are the following: We validated our
model once. What if the split we made just happened to be very
conducive to this model? What if the split we made introduced a
large skew into the date? Didn’t we significantly reduce the size of
our training dataset by splitting it like that?
• How does bias and variance fit in all of this? When creating a
model, we account for a few types of error: validation error,
testing error, error due to bias, and error due to variance in a
relationship known as the bias variance trade-off. As you can see
here the author rightfully make a distinction btw validation and
test error. Bias is telling you how far away we are from the target
whereas variance is telling the spread of the predictions.

29
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

• CV is the answer to all the questions we had about the


hold-out method. The first method we are controlling is
called k-fold CV and can be seen on the figure on the
right. Where I’d like to point out that that the name of
the orange cubes is test! On the histogram you can see
how the score varies btw folds, thus you get the mean
of those one as an indication of the goodness of the
learning.
• How do we choose k? should be large enough to be
representative of the model and small enough to be
computed in a reasonable amount of time. As a general
rule, as k increases, bias decreases and variance
increases as well as the computational requirements.

30
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

• Leave One Out Cross Validation (LOOCV) can be considered a type of


K-Fold validation where k=n given n is the number of rows in the
dataset. Other than that the methods are quire similar. You will notice,
however, that running the following code will take much longer than
previous methods.
• When to use each method?
• Hold out method - The hold out method can be effective and computationally
inexpensive on very large datasets, or on limited computational resources.
• K-Fold can be very effective on medium sized datasets
• LOOCV is most useful in small datasets as it allows for the smallest amount of
data to be removed from the training data in each iteration.
31
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

Dataset
Method Pros Cons
size
Easy to
Hold out understand does not preserve the
Big dataset
method Easy statistical integrity
implementation
Accuracy score
increases as k is
Medium increased, this
K-fold CV Very costly
dataset decreases bias
while increase
variance 32
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

• To answer the original questions:


• If I have a big data set, I will split my data into test and training data and and perform
validation on the test data. If I have a small data set I would like to used cross-
validation and then the validation is already performed within it.

• So validation has nothing to do with hyperparamterisation here, but it used as in


”checking” the model neither overfitting or underfitting.

• What is confusing? What puzzles me is that lots of people split data, perform training
on training data with cross-validation, and then perform validation on the test dataset.
Is this correct? To add confusion (“apparent”) you can also do cross-validation to select
the hyper-parameters of your model

https://stats.stackexchange.com/questions/148688/cross-validation-with-test-data-set 33
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?

• Let's look at three different approaches


• In the simplest scenario one would collect one dataset and train your model via cross-validation to
create your best model. Then you would collect another completely independent dataset and test
your model. However, this scenario is not possible for many researchers given time or cost
limitations.
• If you have a sufficiently large dataset, you would want to take a split of your data and leave it to the
side (completely untouched by the training). This is to simulate it as a completely independent
dataset set even though it comes from the same dataset but the model training won't take any
information from those samples. You would then build your model on the remaining training samples
and then test on these left-out samples.
• If you have a smaller dataset, you may not be able to afford to simply ignore a chunk of your data for
model building. As such, the validation is performed on every fold (k-fold CV?) and your validation
metric would be aggregated across each validation. You are feeding the whole dataset to the k-fold
method. There is no need to spit in training and test as by doing so you are negating even more the
very reason you do k-fold validation: if the dataset is small so you cannot afford to split the dataset.

https://stats.stackexchange.com/questions/148688/cross-validation-with-test-data-set 34
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing? Solved with a
respectable reference – part #1
▪ You should have (unless you have a tiny dataset) three sets:
▪ Training set -
▪ Test set -
▪ Validation set – hyperparameter tuning

▪ You then have two scenarios:


▪ When using train-split: you’re actually using all the three set: train, set, validation. This means
the training set is further split again in train and validation.
▪ When using cross validation: GridSearchCV will use cross-validation in place of the split into a
training and validation set that we used before. However, we still need to split the data into a
training and a test set, to avoid overfitting the parameters:

Guido, Sarah, and Andreas Müller. Introduction to machine learning with python. Vol. 282. O'Reilly Media, 2016. 35
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing? Solved with a
respectable reference – part #2

The best score on the validation set is 96%: slightly lower than before, probably
because we used less data to train the model (X_train is smaller now because
we split our dataset twice). However, the score on the test set—the score that
actually tells us how well we generalize—is even lower, at 92%.

Guido, Sarah, and Andreas Müller. Introduction to machine learning with python. Vol. 282. O'Reilly Media, 2016. 36
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing? Solved with a
respectable reference – part #3
Fitting the GridSearchCV object not only searches for the best parameters,
but also automatically fits a new model on the whole training dataset with
the parameters that yielded the best cross-validation performance. This is
equivalent to:

Choosing the parameters using cross-validation, we actually found a model


that ach‐ ieves 97% accuracy on the test set. The important thing here is that
we did not use the test set to choose the parameters. The parameters that
were found are scored in the best_params_ attribute, and the best cross-
validation accuracy (the mean accuracy over the different splits for this
parameter setting) is stored in best_score_:

Guido, Sarah, and Andreas Müller. Introduction to machine


37
learning with python. Vol. 282. O'Reilly Media, 2016.
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing? Solved with a
respectable reference – part #4
▪ While the method of splitting the data into a
training, a validation, and a test set that we just
saw is workable, and relatively commonly used,
it is quite sensitive to how exactly the data is
split.
▪ For a better estimate of the generalization
performance, instead of using a single split into a
training and a validation set, we can use cross-
validation to evaluate the performance of each
parameter combination.
▪ OF COURSE: GridSearchCV will use cross-
validation in place of the split into a training and
validation set that we used before. However, we
still need to split the data into a training and a
test set, to avoid overfitting the parameters:

Guido, Sarah, and Andreas Müller. Introduction to machine 38


learning with python. Vol. 282. O'Reilly Media, 2016.
What is the main purposes of CV? #1
▪ Is CV main purposes to get a feeling of the mean model performance or to do
hyperparameters tuning?

▪ In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to


evaluate performance of model-building procedure. This means we can also do
hyperparameters tuning on CVs and this is generally the standard procedure!

▪ What is the different with a basic train/split? A basic train/test split is conceptually
identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the
k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about
the estimate of generalisation error. There is more info in a sense of getting the error + stat
uncertainty.

https://stackoverflow.com/questions/46456381/cross-validation-in-lightgbm 39
What is the main purposes of CV? #2
▪ You want to train a model with a set of parameters on some data and evaluate each
variation of the model on an independent (validation) set. Then you intend to choose the
best parameters by choosing the variant that gives the best evaluation metric of your choice.
▪ This can be done with a simple train/test split. But evaluated performance, and thus the
choice of the optimal model parameters, might be just a fluctuation on a particular split.
▪ Thus, you can evaluate each of those models more statistically robust averaging evaluation
over several train/test splits, i.e k-fold CV.
▪ Then you can make a step further and say that you had an additional hold-out set, that was
separated before hyperparameter optimisation was started. This way you can evaluate the
chosen best model on that set to measure the final generalisation error.
▪ However, you can make even step further and instead of having a single test sample you can
have an outer CV loop, which brings us to nested CV.

https://stackoverflow.com/questions/46456381/cross-validation-in-lightgbm 40
Always ask the following question: When
should you use cross-validation?
▪ The guideline is as follows:
▪ For small datasets, where extra computational burden isn't a big deal, you should run cross-
validation.
▪ For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have
enough data that there's little need to re-use some of it for holdout.
▪ There's no simple threshold for what constitutes a large vs. small dataset. But if your model takes a
couple minutes or less to run, it's probably worth switching to cross-validation.
▪ Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each
experiment yields the same results, a single validation set is probably sufficient.

https://www.kaggle.com/alexisbcook/cross-validation 41
The training, validation & test set: #1
▪ How to split the data:

▪ The training set is used by the machine learning algorithm to train the model.

▪ The validation set is needed to find the best values for the hyperparameters of the machine learning pipeline. Most
importantly this set is not seen by the machine learning algorithm.

▪ The test set is used for reporting: once you have your best model (chosen in the previous step), you test its
performance (say accuracy) on the test set and report the results.

https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 https://stats.stackexchange.com/questions/19048/what-is-th 42
e-difference-between-test-set-and-validation-set
Do You Re-train on the Whole Dataset After
Validating the Model?
▪ Method A) ▪ Method B)
▪ Train on 80% ▪ Train on 80%
▪ Validate on 20% ▪ Validate on 20%
▪ Model is good, train on 100%. ▪ Model is good, use this model as is.
▪ Predict test set. ▪ Predict test set.

▪ There seems to be enough evidence for option A) because:


▪ More data is better. In case of time time series, including more recent data is always
better.
▪ Cross validation is used to validate the hyper-parameters to train a model, rather than
the model itself. You then pick the best parameters to re-train a model.

http://www.chioka.in/do-you-re-train-on-the-whole-dataset-after-validating-the-model/ 43
Issue with train & test split

▪ One test score is not reliable because splitting the data into different training and test sets
would give different results.

▪ In effect, splitting the data into a training set and a test set is arbitrary, and a different
random_state will give a different error.

Corey Wade. “Hands-On Gradient Boosting with XGBoost and scikit-learn. 44


When is a random subset in the train-test
split not good enough?

https://www.fast.ai/2017/11/13/validation-sets/
Time series data A random subset is a poor choice A better choice for your
(too easy to fill in the gaps, and not training set. Use the earlier
indicative of what you’ll need in data as your training set (and
production). you can look at the data
both before and after the dates your
the later data for the
are trying to predict validation set):

45
How (and why) to create a good validation set: #1
▪ The sklearn offers a train_test_split method takes a random subset of the data,
which is a poor choice for many real-world problems.
▪ Also keep in mind that sklearn does not have a train_validation_test_split.
The reason is that it assumed you will often be using cross-validation, in which different
subsets of the training set serve as the validation set.

▪ There is a lot confusion around it because people keep using the wrong nomenclature. An
example? In the DL community, “test-time inference” is often used to refer to evaluating on
data in production, which is NOT the technical definition of a test set.
▪ The ultimate goal is for it to be accurate on new data, not just the data you are using to
build it.

https://www.fast.ai/2017/11/13/validation-sets/ 46
How (and why) to create a
good validation set: #2
▪ Why is that? If you were to gather some new data points, they most
likely would not be on that curve in the graph on the right, but would
be closer to the curve in the middle graph.
▪ The underlying idea is that:
▪ The training set is used to train a given model
▪ The validation set is used to choose between models (for instance,
does a random forest or a neural net work better for your
problem? do you want a random forest with 40 trees or 50 trees?)
▪ The test set tells you how you’ve done. If you’ve tried out a lot of
different models, you may get one that does well on your
validation set just by chance, and having a test set helps make sure
that is not the case.

https://www.fast.ai/2017/11/13/validation-sets/ 47
Cross-validation: #1

▪ The validation set is needed to find the best values for the hyperparameters of the
machine learning pipeline.

▪ When you don’t have a decent validation set to tune your hyperparameters on, the
common technique that can help you is called cross-validation.

▪ When you have few training examples, it could be prohibitive to have both validation and
test set. You would prefer to use more data to train the model. In such a case, you only split
your data into a training and a test set. Then you use cross-validation to on the training set
to simulate a validation set.

48
Cross-validation: #2

▪ A noticeable problem with the train/test set split is that you’re actually introducing bias into your testing
because you’re reducing the size of your in-sample training data. When you split your data, you may be
actually keeping some useful examples out of training.

▪ Moreover, sometimes your data is so complex that a test set, though apparently similar to the training set,
is not really similar because combinations of values are different (which is typical of highly dimensional
data- sets).

▪ These issues add to the instability of sampling results when you don’t have many examples. The risk of
splitting your data in an unfavourable way also explains why the train/test split isn’t the favoured solution
by machine learning practitioners when you have to evaluate and tune a machine learning solution.

▪ Cross-validation based on k-folds is actually the answer.

Mueller, John Paul, and Luca Massaron. Machine learning for dummies. John Wiley & Sons, 2016. 49
Cross-validation: #3
▪ The process continues until all the k-folds
are used once as a test set and you have
a k number of error estimates that you
can compute into a mean error estimate
(the cross-validation score) and a
standard error of the estimates.

▪ Important: there is going to be overlap in


the training sets, but not the test sets.”

▪ Averaging the results: at the end, there


will be k different scores evaluating the
model against k different test sets. Taking
the mean score of the k folds gives a
more reliable score than any single fold.”

Mueller, John Paul, and Luca Massaron. Machine learning for dummies. John Wiley & Sons, 2016. 50
Cross-validation: #3

Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 51
Cross-validation: #4

▪ K-fold cross-validation breaks the dataset into k partitions. Each partition gets held out as the
testing data, while the model is trained on the remaining k − 1 partitions.

▪ The average of all these performances indicates how good a model you were fitting.

Data Science: The Executive Summary – A Technical Book for Non–Technical Professionals 52
Cross-validation: #5
▪ The extreme version of this is “leave-one-out” cross-validation, where k is equal to the
number of datapoints you have and every testing dataset is only one point.

▪ This is impractical to do with all but the smallest datasets.

Data Science: The Executive Summary – A Technical Book for Non–Technical Professionals 53
Another reason to use cross-validation

▪ If test sets can provide unstable results because of sampling, the solution is to
systematically sample a certain number of test sets and then average the results.

▪ It is a statistical approach (to observe many results and take an average of them), and that’s
the basis of cross‐validation.

Mueller, John Paul, and Luca Massaron. Python for data science for dummies. John Wiley & Sons, 2019. 54
Interpreting cross-validation score
▪ When comparing scores from a cross-validation test vs. a simple train-test split it is
important not to look at the number as a simple “this is lower that that, thus it must be
better!”

▪ The point here is not whether the score is better or worse.

▪ The point is that it's a better estimation of how linear regression will perform on unseen
data. Using cross-validation is always recommended for a better estimate of the score.

Corey Wade, Hands-On Gradient Boosting with XGBoost and scikit-learn: Perform
accessible machine learning and extreme gradient boosting with Python
55
Cross-validation methods: #1

▪ k-fold cross-validation
▪ stratified k-fold cross-validation = keeps the ratio of labels in each fold constant.
▪ hold-out based validation
▪ leave-one-out cross-validation
▪ group k-fold cross-validation

Thakur, Abhishek. Approaching (almost) any machine learning


problem. Abhishek Thakur, 2020.
56
Cross-validation methods: #1.1
▪ SIMPLE HOLD-OUT VALIDATION which at the end of the day it means: training, validation
and testing.

▪ One flaw: if little data is available, then your validation and test sets may contain too few
samples to be statistically representative of the data at hand.

▪ How can you recognise this is the case?: if different random shuffling rounds of the data
before splitting end up yielding very different measures of model performance, then you’re
having this issue.

Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 57
Cross-validation methods: #2
▪ If you have a skewed dataset for binary classification with 90% positive samples and only 10%
negative samples, you don't want to use random k-fold cross-validation. Using simple k-fold
cross-validation for a dataset like this can result in folds with all negative samples. Then just
use stratified k-fold cross-validation which keeps the ratio of labels in each fold constant.

▪ If you do not have a lot of samples, you can use a simple rule like Sturge’s Rule to calculate the
appropriate number of bins.

▪ What if we have a large amount of data? A 5 fold cross-validation would be too expensive. In
these cases, we can opt for a hold-out based validation. For a dataset which has 1 million
samples, we can create ten folds instead of 5 and keep one of those folds as hold-out. This
means we will have 100k samples in the hold- out, and we will always calculate loss, accuracy
and other metrics on this set and train on 900k samples.
Thakur, Abhishek. Approaching (almost) any machine learning
problem. Abhishek Thakur, 2020.
58
Cross-validation methods: #3

▪ Cross-validation is a resampling procedure used to evaluate machine learning models on a


limited data sample.

▪ The procedure has a single parameter called k that refers to the number of groups that a
given data sample is to be split into. As such, the procedure is often called k-fold cross-
validation.

59
Cross-validation methods: #4
▪ There are a number of variations on the k-fold cross-validation procedure. Three commonly used
variations are as follows:

▪ Train/Test Split: Taken to one extreme, k may be set to 1 such that a single train/test split is created
to evaluate the model.
▪ LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset
such that each observation is given a chance to be the held out of the dataset. This is called leave-
one-out cross-validation, or LOOCV for short.
▪ Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold
has the same proportion of observations with a given categorical value, such as the class outcome
value. This is called stratified cross-validation.
▪ Repeated: This is where the k-fold cross-validation procedure is repeated n times, where
importantly, the data sample is shuffled prior to each repetition, which results in a different split of
the sample.

60
Cross-validation methods: #5

▪ Generally k-fold cross-validation is the gold-standard for evaluating the performance of a


machine learning algorithm on unseen data with k set to 3, 5, or 10.

▪ Use stratified cross-validation to enforce class distributions when there are a large number
of classes or an imbalance in instances for each class.

▪ Using a train/test split is good for speed when using a slow algorithm and produces
performance estimates with lower bias when using large datasets.

▪ If in doubt use 10-fold cross validation

61
Cross-validation methods: #6

62
Cross-validation methods: #7
▪ Another approach to evaluating the model is as follows.

▪ Divide our data up into a training set and a test set: 80% in the training and 20% in the test.
Fit the model on the training set, then look at the mean squared error on the test set and
compare it to that on the training set.

▪ If the mean squared errors are approximately the same, then our model generalizes well
and we’re not in danger of overfitting.

O'Neil, Cathy, and Rachel Schutt. Doing data science: Straight talk from the frontline. " O'Reilly Media, Inc.", 2013.
63
Cross-validation methods: #6

▪ The holdout method is the most common method, which starts by dividing the dataset into
two partitions called train and test (80% and 20%, respectively). The train set is used for
training the model, and the test data tests its predictive power.

▪ The k-fold cross-validation technique is used to verify that the model is not over-fitted. The
dataset is randomly partitioned into k mutually exclusive subsets, where each partition has
equal size. One partition is for testing and the other partitions are for training. Iterate
throughout the whole of the k-folds.

Python 3 for machine learning, O. Compesato 64


Stratified k-fold cross-validation
▪ Sometimes the data comes in several different non-overlapping categories (men and
women, machines purchased from one vendor versus another, etc.) called “strata.”

▪ In these cases we may want to make sure that all the strata are appropriately represented
in both the testing and training data.

▪ To do this we use “stratified sampling,” where each stratum is divided into training/testing
data, in the same proportions.

▪ stratified k-fold cross-validation = keeps the ratio of labels in each fold constant.

Data Science: The Executive Summary - A Technical Book for Non-Technical Professionals 65
Comparison of all possible of split the dataset: #1

As you can see, by default the KFold cross-validation iterator does not
First, we must understand the structure of our data. It has 100 randomly take either datapoint class or group into consideration. We can change
generated input datapoints, 3 classes split unevenly across datapoints, and this by using the StratifiedKFold like so. In this case, the cross-validation
10 “groups” split evenly across datapoints. retained the same ratio of classes across each CV split.

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-a 66
uto-examples-model-selection-plot-cv-indices-py
Comparison of all possible of split the dataset: #2

67

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-a
uto-examples-model-selection-plot-cv-indices-py
Comparison of all possible of split the dataset: #3

68

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-a
uto-examples-model-selection-plot-cv-indices-py
The true about CV: #1
▪ In this work, we show that the estimand of CV is not the accuracy of the model fit on the
data at hand, but is instead the average accuracy over many hypothetical data sets.
Specifically, we show that the CV estimate of error has larger mean squared error (MSE)
when estimating the prediction error of the final model than when estimating the average
prediction error of models across many unseen data sets for the special case of linear
regression.

▪ Turning to confidence intervals for prediction error, we show that naıve intervals based on
CV can fail badly, giving coverage far below the nominal level; we provide a simple example
soon in Section 1.1.

Cross-validation: what does it estimate and how well does it do it?Stephen Bates ∗, Trevor
Hastie†, and Robert Tibshirani 69
The true about CV: #2
▪ The source of this behaviour is the estimation of the variance used to compute the width of
the interval: it does not account for the correlation between the error estimates in different
folds, which arises because each data point is used for both training and testing. As a result,
the estimate of variance is too small and the intervals are too narrow.

▪ To address this issue, we develop a modification of cross-validation, nested cross-


validation(NCV),that achieves coverage near the nominal level, even in challenging cases
where the usual cross-validation intervals have miscoverage rates two to three times larger
than the nominal rate.

Cross-validation: what does it estimate and how well does it do it?Stephen Bates ∗, Trevor
Hastie†, and Robert Tibshirani 70
Nested versus non-nested cross-validation: #1
▪ NON-NESTED = estimates the generalization error of the underlying model and its (hyper)parameter
search. Model selection without nested CV uses the same data to tune model parameters and evaluate
model performance. Information may thus “leak” into the model and overfit the data. The magnitude of
this effect is primarily dependent on the size of the dataset and the stability of the model. Choosing the
parameters that maximize non-nested CV biases the model to the dataset, yielding an overly-optimistic
score.

▪ NESTED = cross-validation (CV) is often used to train a model in which hyperparameters also need to be
optimized. To avoid this problem, nested CV effectively uses a series of train/validation/test set splits. In
the inner loop (here executed by GridSearchCV), the score is approximately maximized by fitting a model
to each training set, and then directly maximized in selecting (hyper)parameters over the validation set.
In the outer loop (here in cross_val_score), generalization error is estimated by averaging test set scores
over several dataset splits.

https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-a
uto-examples-model-selection-plot-nested-cross-validation-iris-py 71
Nested versus non-nested cross-validation: #2

Preferred way: Under this procedure,


hyperparameter search does not have an
opportunity to overfit the dataset as it is
only exposed to a subset of the dataset
provided by the outer cross-validation
procedure. This reduces, if not eliminates,
the risk of the search procedure
overfitting the original dataset and should
provide a less biased estimate of a tuned
model’s performance on the dataset.

https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-a
uto-examples-model-selection-plot-nested-cross-validation-iris-py 72
Nested versus non-nested cross-validation: #3
▪ The k-fold cross-validation procedure is used to estimate the performance of machine
learning models. It is less biased than some other techniques, such as a single train-test split
for small- to modestly-sized dataset.
▪ This procedure can be used both when optimizing the hyperparameters of a model on a
dataset, and when comparing and selecting a model for the dataset. When the same cross-
validation procedure and dataset are used to both tune and select a model, it is likely to lead
to an optimistically biased evaluation of the model performance. This highlights that if the k-
fold cross-validation procedure is used both in the selection of model hyperparameters to
configure each model and in the selection of configured models, but it can lead to overfitting.
▪ One approach to overcoming this bias is to nest the hyperparameter optimization procedure
under the model selection procedure. This is called double cross-validation or nested cross-
validation and is the preferred way to evaluate and compare tuned machine learning models

https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ 73
Nested versus non-nested cross-validation: #4
▪ Each time a model with different model hyperparameters is evaluated on a dataset, it provides
information about the dataset. Specifically, an often noisy model performance score. This
knowledge about the model on the dataset can be exploited in the model configuration procedure
to find the best performing configuration for the dataset. The k-fold cross-validation procedure
attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing
or overfitting of the model hyperparameters to the dataset will be performed. This is the normal
case for hyperparameter optimization.

▪ The problem is that if this score alone is used to then select a model, or the same dataset is used to
evaluate the tuned models, then the selection process will be biased by this inadvertent
overfitting. The result is an overly optimistic estimate of model performance that does not
generalize to new data.

▪ One approach to solve it is called nested cross-validation.

https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ 74
Nested versus non-nested cross-validation: #5

▪ What Is the Cost of Nested Cross-Validation? To make this concrete, you might use k=5 for
the hyperparameter search and test 100 combinations of model hyperparameters. A
traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500
models. Nested cross-validation with k=10 folds in the outer loop would fit and evaluate
5,000 models. A 10x increase in this case.

▪ How Do You Set k? It is common to use k=10 for the outer loop and a smaller value of k for
the inner loop, such as k=3 or k=5.

https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ 75
How to interpreter CV score values
▪ Using the mean cross-validation we can conclude
that we expect the model to be around 96%
accurate on average.
▪ Looking at all five scores produced by the five-fold
cross-validation, we can also conclude that there is
a relatively high variance in the accuracy between
folds, ranging from 100% accuracy to 90% accuracy.
▪ This is quite a range, and it provides us with an idea
about how the model might perform in the worst
case and best case scenarios when applied to new
data.
▪ Further, this could imply that
1. the model is very dependent on the particular folds
used for training,
2. OR it could also just be a consequence of the small
size of the dataset.

Guido, Sarah, and Andreas Müller. Introduction to machine


learning with python. Vol. 282. O'Reilly Media, 2016. 76
Another advantages of cross validation

▪ Another benefit of cross-validation as compared to using a single split of the data is that we
use our data more effectively.
▪ When using train_test_split, we usually use 75% of the data for training and 25% of the
data for evaluation.
▪ When using five-fold cross-validation, in each iteration we can use four-fifths of the data
(80%) to fit the model.
▪ When using 10-fold cross-validation, we can use nine-tenths of the data (90%) to fit the
model. More data will usually result in more accurate models.

Guido, Sarah, and Andreas Müller. Introduction to machine


learning with python. Vol. 282. O'Reilly Media, 2016. 77
The dangers of cross-validation
▪ Consider a 3-fold cross validation, the data is divided into 3 sets: A, B, and C. A model is
first trained on A and B combined as the training set, and evaluated on the validation set C.
Next, a model is trained on A and C combined as the training set, and evaluated on
validation set B. And so on, with the model performance from the 3 folds being averaged in
the end.

▪ Issue? CV is that it is rarely applicable to real world problems, for all the reasons described
in the above sections. Cross-validation only works in the same cases where you can
randomly shuffle your data to choose a validation set.

▪ The link will also explain how the Kaggle works in terms of CV.

https://www.fast.ai/2017/11/13/validation-sets/ 78
Handling Imbalanced Datasets: #1
▪ In many practical situations, your labelled dataset will have underrepresented the examples
of some class.

▪ This is the case, for example, when your classifier has to distinguish between genuine and
fraudulent e-commerce transactions: the examples of genuine transactions are much more
frequent. If you use SVM with soft margin, you can define a cost for misclassified examples.

▪ Because noise is always present in the training data, there are high chances that many
examples of genuine transactions would end up on the wrong side of the decision
boundary by contributing to the cost.

Burkov, Andriy. "The Hundred-Page Machine Learning Book 79


Handling Imbalanced Datasets: #2
▪ The SVM algorithm will try to move the hyperplane to avoid as much as possible misclassified
examples.
▪ The “fraudulent” examples, which are in the minority, risk being misclassified in order to classify
more numerous examples of the majority class correctly.

80
Burkov, Andriy. "The Hundred-Page Machine Learning Book
Handling Imbalanced Datasets: #3
▪ Some solutions:

▪ Some SVM implementations (including SVC in scikit-learn) allow you to provide weights for every
class. The learning algorithm takes this information into account when looking for the best
hyperplane.
▪ If your learning algorithm doesn’t allow weighting classes, you can try to increase the importance of
examples of some class by making multiple copies of the examples of this class (this is called
oversampling).
An opposite approach is to randomly remove from the training set some examples of the majority
class (undersampling).
▪ You might also try to create synthetic examples by randomly sampling feature values of several
examples of the minority class and combining them to obtain a new example of that class. There two
popular algorithms that oversample the minority class by creating synthetic examples: the synthetic
minority oversampling technique (SMOTE) and the adaptive synthetic sampling method (ADASYN).

81
Burkov, Andriy. "The Hundred-Page Machine Learning Book
k-fold validation [warning]

▪ Cross validation IS OFTEN NOT USED for evaluating deep learning models because of the
greater computational expense.

▪ For example k-fold cross validation is often used with 5 or 10 folds. As such, 5 or 10 models
must be constructed and evaluated, greatly adding to the evaluation time of a model.

▪ Nevertheless, when the problem is small enough or if you have sufficient compute
resources, k-fold cross validation can give you a less biased estimate of the performance of
your model.

82
Shuffling the training and validation set

▪ Shuffling the training data is important to prevent correlation between batches and
overfitting.

▪ On the other hand, the validation loss will be identical whether we shuffle the validation
set or not. Since shuffling takes extra time, it makes no sense to shuffle the validation data.

▪ To be clear it makes no-sense because it does not make a difference, the fact it takes more
time is a secondary reason. You would not to do even if the action was super fast. It simply
does not make sense.

https://pytorch.org/tutorials/beginner/nn_tutorial.html#refactor-using-optim
83
An alternative to CV
▪ Model selection is the problem of choosing one from among a set of candidate models.
▪ It is common to choose a model that performs the best on a hold-out test dataset or to
estimate model performance using a resampling technique, such as k-fold cross-validation.
▪ An alternative approach to model selection involves using probabilistic statistical measures
that attempt to quantify both the model performance on the training dataset and the
complexity of the model. Examples include the Akaike and Bayesian Information Criterion
and the Minimum Description Length.

▪ The benefit of these information criterion statistics is that they do not require a hold-out
test set, although a limitation is that they do not take the uncertainty of the models into
account and may end-up selecting models that are too simple.

https://machinelearningmastery.com/probabilistic-model-selection-measures/ 84
Early stopping: #1

The algorithm halts whenever


overfitting begins to occur.

Weights are almost Weights are too


Weights have
all close to zero large
mid-size
85
http://www.deeplearningbook.org/slides/07_regularization.pdf
Early stopping: #2

▪ If we have lots of data and a big model, its very expensive to keep re-training it with
different sized penalties on the weights or different architectures.

▪ It is much cheaper to start with very small weights and let them grow until the performance
on the validation set starts getting worse.
▪ But it can be hard to decide when performance is getting worse.

▪ The capacity of the model is limited because the weights have not had time to grow big.
▪ Smaller weights give the network less capacity. Why?

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Early stopping: #3

▪ Why early stopping works?

▪ When the weights are very small, every hidden unit is in its linear range.
▪ So even with a large layer of hidden units it’s a linear model.
▪ It has no more capacity than a linear net in which the inputs are directly connected to the
outputs!

▪ As the weights grow, the hidden units start using their non-linear ranges so the capacity (ability
to fit a wide variety of functions) grows.

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Early stopping: #4

▪ Generally training has two tasks: one is to optimise your cost function and the other is to
avoid overfitting. These two task are different and therefore are separate.

▪ However, when using early stopping it is impossible to separate the two task and this is an
downside of the method.

▪ Early stopping is effectively another technique as in the L2-regularisation meaning is an


different option, but generally you can use L2 instead of early stopping.

88
Early stopping: #5

The model fitting is the same, we are just changing the way
we measure the performance! However, using different
criteria allow us to pick up the upward trend much more
easily, because not all of them show it clearly.
89
Early stopping: #6
▪ Apart from the overfitting reason
explained earlier early stopping is useful
for another reason.

▪ This example illustrates how the early


stopping can used in the
GradientBoostingClassifier model to
achieve almost the same accuracy as
compared to a model built without early
stopping using many fewer estimators.

▪ This can significantly reduce training time,


memory usage and prediction latency.

https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html#sphx-glr-auto-e 90
xamples-ensemble-plot-gradient-boosting-early-stopping-py
Early stopping [overfitting]: #7
▪ Neural networks can get worse if you train them too much! The reality is that this is common in neural
networks.

▪ One way to view the weights of a neural network is as a high-dimensional shape. As you train, this shape
molds around the shape of the data, learning to distinguish one pattern from another.

▪ Unfortunately, if the testing dataset were slightly different from the patterns in the training dataset, the
network could fail on many of the testing examples. The network has overfitted the data, hence it did not
generalise well.

▪ A more official definition of a neural network that overfits is a neural network that has learned the noise
in the dataset instead of making decisions based only on the true signal.

Trask, Andrew W. "Deep learning." (2019). 91


Early stopping on valid or test sets?
▪ Good practice: set = train + test + validation
▪ In practice (not so good practice): set = train + test
▪ The question is: while using early stopping, shall I use test or validation method? If we are
using early stopping as a training termination criterion then it becomes part of your
learning algorithm, meaning the model is implicitly seen these data. Thus this data cannot
be used for testing otherwise you are leaking info when you are not supposed to.

▪ While practicing you'll see that is difficult to see this leakage and people simply do not
bother with it.
▪ There are cases where you simply cannot have all three sets and then you have to rely on
other validation methods such as a cross validation etc.

https://stats.stackexchange.com/questions/56421/is-it-ok-to-determine-earl 92
y-stopping-using-the-validation-set-in-10-fold-cross-v
TPOT = Tree-based Pipeline Optimization Tool: #1
▪ TPOT uses a Genetic Programming stochastic global search procedure to efficiently discover
a top-performing model pipeline for a given dataset.

▪ TPOT will automate the most tedious part of machine learning by intelligently exploring
thousands of possible pipelines to find the best one for your data.

▪ AutoML algorithms aren't as simple as fitting one model on the dataset; they are
considering multiple machine learning algorithms (random forests, linear models, SVMs,
etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA,
feature selection, etc.), the hyperparameters for all of the models and preprocessing steps,
as well as multiple ways to ensemble or stack the algorithms within the pipeline.

https://epistasislab.github.io/tpot/ Olson, Randal S., et al. "Evaluation of a tree-based pipeline optimization tool for automating data 93
science." Proceedings of the Genetic and Evolutionary Computation Conference 2016. 2016.
TPOT = Tree-based Pipeline Optimization Tool: #2

https://epistasislab.github.io/tpot/ 94
TPOT other references
▪ https://www.kdnuggets.com/2021/05/machine-learning-pipeline-opti
mization-tpot.html

95
Automated machine learning tools

▪ TPOT
▪ Auto-Sklearn
▪ Auto-Weka
▪ Machine-JS
▪ DataRobot
▪ FLAML - Fast and Lightweight AutoML

96
Proper definition of bias & variance: #1
▪ Bias:
▪ It is a learner’s tendency to consistently learn the same wrong thing.
▪ Underfit = “high bias”
▪ Simple algorithms have high bias; having few internal parameters.
▪ high bias if it is not able to fully use the information in the data. It is
too reliant on general information, such as the most frequent case,
the mean of the response, or few powerful features.
▪ Wrong assumption when training
▪ Bias is telling you how far away we are from the target.

▪ Variance:
▪ The tendency to learn random things irrespective of the real signal.
▪ overfit = “high variance”
▪ You can think of variance as a problem connected to memorization.
▪ High variance if it is using too much information from the the data.
▪ Model performs well on training but poorly on test set Simultaneously avoiding both requires
▪ Sensitive to fluctuations when training learning a perfect classifier/regressor, and
▪ Variance is telling the spread of the predictions short of knowing it in advance there is no
single technique that will always do best
Domingos, Pedro. "A few useful things to know about machine learning."
(no free lunch).
97
Communications of the ACM 55.10 (2012): 78-87.
Proper definition of bias & variance: #2
Does it even
happen?

Underfitting
area

Best scenario but impossible in Overfitting


practice as there is tradeoff (no free area
lunch). You almost have both
Proper definition of bias & variance: #3

A good trade-off would be


to stay in the lower left
triangle defined by the
green triangle. The green
arrow is pointed toward the
best case scenario

https://stats.stackexchange.com/questions/204489/discussion-about-overfit-in-xgboost 99
Proper definition of bias & variance: #4

Is the high variance &


high bias possible?
Yes it is and it is best
explained with a joke

100
▪ Bias: expected deviation from the true value.
▪ Variance: deviation from the expected estimator.

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 101
Proper definition of bias & variance

102
High bias and variance: #0

Underfit = “high bias” “just right” overfit = “high variance”

Overfitting: if we have too many features, the learned hypothesis may fit the
training set very well (cost function almost zero), but fail to generalise (this is the
key term) the new examples, i.e., predict prices on new examples.
103
High bias and variance: #1

104
High bias and variance: #1

▪ The bias–variance dilemma is the conflict in trying to simultaneously minimize these two
sources of error that prevent supervised learning algorithms from generalizing beyond their
training set:

▪ The bias error is an error from erroneous assumptions in the learning algorithm. High
bias can cause an algorithm to miss the relevant relations between features and target
outputs (underfitting).

▪ The variance is an error from sensitivity to small fluctuations in the training set. High
variance can cause an algorithm to model the random noise in the training data, rather
than the intended outputs (overfitting).

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
105
High bias and variance: #1.1

106
High bias and variance: #1.2

Ideally, we need to find a golden


mean between these two. 107
High bias and variance: #1.2.1

Changing degree or lambda 108


High bias and variance: #1.3
High bias High variance

If a learning algorithm is suffering from If a learning algorithm is suffering from


high bias, getting more training data will high variance, getting more training data
not (by itself) help much. is likely to help.
109
High bias and variance: #1.4

▪ At the left end of the graph, training error and


generalization error are both high. This is the
underfitting regime.

▪ As we increase capacity, training error decreases,


but the gap between training and generalization
error increases.

▪ Eventually, the size of this gap outweighs the


decrease in training error, and we enter the
overfitting regime, where capacity is too large,
above the optimal capacity

http://www.deeplearningbook.org/contents/ml.html 110
High bias and variance: #1.5

[XGBoost pima-indians-diabetes dataset] The model fitting is


the same, we are just changing the way we measure the
performance. So you should be aware that difference loss
function show overfitting to a different degree.
111
High bias and variance: #2
▪ For the overfitting problem, just to spell it out more clearly:

▪ It makes an accurate predictions for examples in the training set, but it does not
generalise well to make accurate predictions on new, unseen examples.

▪ Something we want to avoid is the following (overfitting):

▪ It's almost as though our network is merely memorizing the training set, without
understanding digits well enough to generalize to the test set.

112
High bias and variance: #3

113
High bias and variance: #4 [solution to overfitting]

▪ There are two main options to address the issue of overfitting:

▪ 1) Reduce the number of features:


▪ Manually select which features to keep.
▪ Use a model selection algorithm.

▪ 2) Regularization
▪ Keep all the features, but reduce the magnitude of the cost function parameters​.
▪ Regularization works well when we have a lot of slightly useful features

114
High bias and variance: #5 [solution to overfitting]

Aka: weight decay


(because it makes the
weight smaller) or L2
regularization.

115
High bias and variance: #5.1 [solution to overfitting]
▪ Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other
things being equal.

▪ Large weights will only be allowed if they considerably improve the first part of the cost function.

▪ Put another way, regularization can be viewed as a way of compromising between finding small weights and
minimizing the original cost function.

▪ The relative importance of the two elements of the compromise depends on the value of λ: when λ is small we
prefer to minimize the original cost function, but when λ is large we prefer small weights.

▪ So regularisation help the model generalise better via a reduction in overfitting likelihood. What overfitting is
really doing is learning the noises in the data. We do no want that but we could also use it to learn the noise in
the data! At least we know where it is.
http://neuralnetworksanddeeplearning.com/chap3.html
116
Train/Val accuracy

▪ The blue validation error curve shows very


small validation accuracy compared to the
training accuracy, indicating strong overfitting
(note, it's possible for the validation accuracy
to even start to go down after some point).

▪ When you see this in practice you probably


want to increase regularization (stronger L2
weight penalty, more dropout, etc.)

https://cs231n.github.io/neural-networks-3/ 117
Bayes Optimal This is a theoretical
error

Usually, human and Bayes error are quite


Following methods can be used to improve performance: close, especially for natural perception
1) Get labelled data from humans problems, and there is little scope for
2) Gain insights from manual error analysis, e.g. understand why a improvement after surpassing human-
human got this right level performance and thus, learning
118
3) Better analysis of Bias/Variance slows down considerably
Model score vs. its complexity
▪ The training score is everywhere higher than the
validation score. This is generally the case: the model
will be a better fit to data it has seen than to data it
has not seen.
▪ For very low model complexity (a high-bias model), the
training data is underfit, which means that the model
is a poor predictor both for the training data and for
any previously unseen data.
▪ For very high model complexity (a high-variance
model), the training data is overfit, which means that
the model predicts the training data very well, but fails
for any previously unseen data.
▪ For some intermediate value, the validation curve has
a maximum. This level of complexity indicates a
suitable trade-off between bias and variance.

VanderPlas, Jake. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.", 2016. 119
▪ In high dimensions, we cannot draw decision curves to inspect bias-variance. Instead, we
calculate error values to infer the source of errors on the training set, as well as on the
validation set.

▪ To determine bias, we need a baseline, such as human-level performance.

▪ The Bayes error is the minimum error for any classifier we could obtain with a perfect
model stripped of all avoidable bias and variance.

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 120
▪ We can analyze and compare the errors on the training and validation sets in order to
deduce the cause of the error, whether it is due to avoidable bias or avoidable variance (it
is unlikely that we ever obtain the Bayes error rate)

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 121
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 122
What is the baseline performance important?
▪ Create a very simple model that beats the baseline score. For instance, in classification
tasks, the baseline accuracy should be higher than 0.5 (better than random) and our simple
model should be able to beat this score.

▪ If we are not able to beat the baseline score, then maybe the input data does not hold the
necessary information required to make the necessary prediction.

▪ Remember not to introduce any regularization or dropouts at this step

Subramanian, Vishnu. Deep Learning with PyTorch: A practical approach to


building neural network models using PyTorch. Packt Publishing Ltd, 2018. 123
Overfitting and learning curves
Why do we have overfitting?
▪ Overfitting is caused by having too few samples to learn from, rendering you unable to train
a model that can generalize to new data.

▪ Given infinite data, your model would be exposed to every possible aspect of the data
distribution at hand: you would never overfit.

▪ Is that the only reason why we have overfitting?

Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018.
125
How can you avoid overfitting?
▪ Cross-validation: Cross-validation is a technique used to assess how well a model performs on a new
independent dataset. The simplest example of cross-validation is when you split your data into two
groups: training data and testing data, where you use the training data to build the model and the testing
data to test the model.
▪ Regularization: Overfitting occurs when models have higher degree polynomials. Thus, regularization
reduces overfitting by penalizing higher degree polynomials.
▪ Reduce the number of features: You can also reduce overfitting by simply reducing the number of input
features. You can do this by manually removing features, or you can use a technique, called Principal
Component Analysis, which projects higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2
dimensions).
▪ Ensemble Learning Techniques: Ensemble techniques take many weak learners and converts them into a
strong learner through bagging and boosting. Through bagging and boosting, these techniques tend to
overfit less than their alternative counterparts.

https://towardsdatascience.com/120-data-scientist-interview-questions-and-answers-you-should-know-in-2021-b2faf7de8f3e
When do we want to have overfitting?

▪ After you have a simple implementation of your model, try to overfit a small amount of
training data and run evaluation on the same data to make sure that it gets to the smallest
possible loss.

▪ If it can't overfit a small amount of data, there's something wrong with your
implementation.

https://huyenchip.com/machine-learning-systems-design/design-a-machine-learning-system.html

127
Misconception about overfitting

▪ A common misconception about overfitting is that it is caused by noise, like training


examples labelled with the wrong class. This can indeed aggravate overfitting.

▪ For example, in a classification example, the overfitting can make the learner draw a
capricious frontier to keep those examples on what it thinks is the right side.

▪ But severe overfitting can occur even in the absence of noise.

Domingos, Pedro. "A few useful things to know about machine learning."
128
Communications of the ACM 55.10 (2012): 78-87.
Overfitting: some observations

▪ If you do not have much data, you should use a simple model, because a complex one will
overfit.
▪ This is true.
▪ But only if you assume that fitting a model means choosing a single best setting of the
parameters.

▪ If you use the full posterior distribution over parameter settings, overfitting disappears.
▪ When there is very little data, you get very vague predictions because many different
parameters settings have significant posterior probability.

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec10.pdf 129
Alternative view of overfitting
▪ Deep learning’s greatest weakness: Overfitting

▪ Error is shared among all the weights. If a particular configuration of weights accidentally
creates perfect correlation between the prediction and the output dataset (such that error
== 0) without giving the heaviest weight to the best inputs, the neural network will stop
learning. Does it mean we should be avoiding low error? No, but explain why is a bit
complicated!

▪ As long as you don’t train exclusively on the first example, the rest of the training examples
will help the network avoid getting stuck in these edge-case configurations that exist for
any one training example. This will also makes the case for having a large number of data.

Trask, Andrew W. "Deep learning." (2019).


130
Observation about overfitting
▪ The greatest challenge in neural networks is that of
overfitting, when a neural network memorizes a dataset
instead of learning useful abstractions that generalize to
unseen data. In other words, the neural network learns to
predict based on noise in the dataset as opposed to relying
on the fundamental signal.

▪ What is noise? Everything that makes these pictures


unique beyond what captures the essence of “dog”
is included in the term noise.

131
Trask, Andrew W. "Deep learning." (2019).
Learning curves: #0
▪ Definition: learning curves are plots that show model performance as a function of dataset size.
▪ Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of
how well the model is learning.
▪ Validation (I prefer the name test) Learning Curve: Learning curve calculated from a hold-out
validation dataset that gives an idea of how well the model is generalizing.

▪ In some cases, it is also common to create learning curves for multiple metrics:
▪ Optimization Learning Curves: Learning curves calculated on the metric by which the
parameters of the model are being optimized, e.g. loss. This is justified from a numerical point
of view.
▪ Performance Learning Curves: Learning curves calculated on the metric by which the model
will be evaluated and selected, e.g. accuracy. This is justified from a interpretability point of
view.
https://machinelearningmastery.com/learning-curves-for-dia
gnosing-machine-learning-model-performance/ 132
Learning curves: #1 [ideal]
▪ The performance of training samples
should be above what is desired.

▪ As the number of training samples


increases, the model performance
on testing samples improves.

▪ Eventually the testing becomes close


to that on training.

Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 133
Learning curves: #2 [overfitting]

▪ When testing converges at a value


far from the training, overfitting can
be concluded.

▪ In this case, the model fails to


generalise to instances that are not
seen.

Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 134
Learning curves: #3 [underfitting]

▪ Both training and testing converged


value are below what is the desired
value.

▪ In this case, the model does not


even fit well on the training samples.

Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 135
Learning curves: #4
▪ The training error tends to decrease whenever we
increase the model complexity, that is, whenever we
fit the data harder.

▪ However with too much fitting, the model adapts


itself too closely to the training data, and will not
generalize well (i.e., have large test error). In
that case the predictions will have large variance.

▪ If the model is not complex enough, it will underfit


and may have large bias, again resulting in poor
generalization.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of


statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001. 136
Learning curves: #5
▪ An underfit model can be identified from the learning
curve of the training loss only.
▪ It may show a flat line or noisy values of relatively high
loss, indicating that the model was unable to learn the
An Underfit Model That Does Not Have
training dataset at all. Sufficient Capacity

▪ An underfit model may also be identified by a training


loss that is decreasing and continues to decrease at
the end of the plot. This indicates that the model is
capable of further learning and possible further
improvements and that the training process was
halted prematurely.

an Underfit Model That 137


https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
Requires Further Training
Learning curves: #6
▪ Overfitting refers to a model that has learned the training dataset too well,
including the statistical noise or random fluctuations in the training
dataset.
▪ The problem with overfitting, is that the more specialized the model
becomes to training data, the less well it is able to generalize to new data,
resulting in an increase in generalization error. This increase in
generalization error can be measured by the performance of the model on
the validation dataset.
▪ This often occurs if the model has more capacity than is required for the
problem, and, in turn, too much flexibility. It can also occur if the model is
trained for too long.
▪ A plot of learning curves shows overfitting if:
▪ The plot of training loss continues to decrease with experience.
▪ The plot of validation loss decreases to a point and begins increasing
again.

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ 138
Learning curves: #7
▪ A good fit is identified by a training and validation loss
that decreases to a point of stability with a minimal gap
between the two final loss values.
▪ The loss of the model will almost always be lower on
the training dataset than the validation dataset. This
means that we should expect some gap between the
train and validation loss learning curves. This gap is
referred to as the “generalization gap.”
▪ A plot of learning curves shows a good fit if:
▪ The plot of training loss decreases to a point of stability.
▪ The plot of validation loss decreases to a point of stability
and has a small gap with the training loss.
▪ Continued training of a good fit will likely lead to an
overfit.

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ 139
Learning curves: #8
▪ The main indicator of a bias problem is a
high validation error. But how can establish
if the MSE is high or low? Domain
knowledge is here necessary. First of keep
in mind that MSE square the units as well.
So we have an MSE error of 20MW^2, to
make an engineering judgment we need to
take the sqrt hence 4.5MV.
▪ But is it a low bias problem or a high bias
problem? To find the answer, we need to
look at the training error.

https://www.dataquest.io/blog/learning-curves-machine-learning/ 140
Learning curves: #9
▪ Estimating variance can be done in at least two ways:
▪ By examining the gap between the validation
learning curve and training learning curve.
▪ By examining the training error: its value and its
evolution as the training set sizes increase.

▪ High training MSE scores are also a quick way to detect


low variance. If the variance of a learning algorithm is low,
then the algorithm will come up with simplistic and
similar models as we change the training sets. Because
the models are overly simplified, they cannot even fit the
training data well (they underfit the data). So we should
expect high training MSEs. Hence, high training MSEs can
be used as indicators of low variance.

141
https://www.dataquest.io/blog/learning-curves-machine-learning/
Learning curves: #10

https://www.dataquest.io/blog/learning-curves-machine-learning/ 142
Learning curves: #11
▪ In our case, the training MSE plateaus at around 20, and we’ve already
concluded that’s a high value. So besides the narrow gap, we now have
another confirmation that we have a low variance problem. So far, we can
conclude that:
▪ Our learning algorithm suffers from high bias and low variance,
underfitting the training data.
▪ Adding more instances (rows) to the training data is hugely unlikely to
lead to better models under the current learning algorithm.
▪ One solution at this point is to change to a more complex learning algorithm.
This should decrease the bias and increase the variance. A mistake would be
to try to increase the number of training instances. Generally, these other
two fixes also work when dealing with a high bias and low variance problem:
▪ Training the current learning algorithm on more features (to avoid
collecting new data, you can generate easily polynomial features). This
should lower the bias by increasing the model’s complexity.
▪ Decreasing the regularization of the current learning algorithm, if
that’s the case. In a nutshell, regularization prevents the algorithm
from fitting the training data too well. If we decrease regularization,
the model will fit training data better, and, as a consequence, the
variance will increase and the bias will decrease.

https://www.dataquest.io/blog/learning-curves-machine-learning/ 143
Learning curves: #12
▪ When we build a model to map the relationship between the features X and the target Y, we assume that there is such a
relationship in the first place. Provided the assumption is true, there is a true model f that describes perfectly the relationship
between X and Y, like so:

https://www.dataquest.io/blog/learning-curves-machine-learning/
▪ But why is there an error?! Haven’t we just said that describes the relationship between X and Y perfectly? There’s an error
there because is not only a function of our limited number of features . There could be many other features that influence the
value of . Features we don’t have. It might also be the case that contains measurement errors. So, besides , is also a function of .
Now let’s explain why this error is irreducible. When we estimate with a model , we introduce another kind of error, called
reducible error:

▪ Replacing f(X) we get:

▪ Error that is reducible can be reduced by building better models, however the irreducible error will always be there and what is
more is that its value is almost never known.

144
Learning curves: #13
training (solid line) and validation (dotted line)

(A) Training and validation losses do not


decrease; the model is not learning due to no (B) Training loss decreases while validation
information in the data or insufficient loss increases: overfitting.
capacity of the model.

Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning 145
with PyTorch. Manning Publications Company, 2020
Learning curves: #14
training (solid line) and validation (dotted line)

(C) Training and validation losses decrease


(D) Training and validation losses have
exactly in tandem. Performance may be
different absolute values but similar
improved further as the model is not at the
trends: overfitting is under control.
limit of overfitting.

Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning 146
with PyTorch. Manning Publications Company, 2020
Learning curves: #15
Set

Test set Training set

Training Test
validation set validation set

http://josh-tobin.com/assets/pdf/troubleshooting-deep-neural-networks-01-19.pdf
147
Addressing under-fitting Addressing over-fitting
(i.e., reducing bias) (i.e., reducing variance)

http://josh-tobin.com/assets/pdf/troubleshooting-deep-neural-networks-01-19.pdf
148
Diagnosing Unrepresentative Datasets: #1
▪ Learning curves can also be used to diagnose properties of a dataset and whether it is
relatively representative.

▪ There are two common cases that could be observed; they are:
▪ Training dataset is relatively unrepresentative.
▪ Validation dataset is relatively unrepresentative.

▪ It is a bit more difficult to distinguish these two cases.

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
149
Diagnosing Unrepresentative
Datasets: #2
▪ Unrepresentative Train Dataset: This situation can be
identified by a learning curve for training loss that shows
improvement and similarly a learning curve for validation loss
that shows improvement, but a large gap remains between
both curves.

▪ Unrepresentative Validation Dataset: This case can be


identified by a learning curve for training loss that looks like a
good fit and a learning curve for validation loss that shows
noisy movements around the training loss. It may also be
identified by a validation loss that is lower than the training
loss. In this case, it indicates that the validation dataset may
be easier for the model to predict than the training dataset.

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ 150
Bias/variance tradeoff and MSE: #1
▪ Let assume we have a regression model and we want to measure the error via MSE:

▪ Where E denotes the expectations. This error can be decomposed into bias and variance
components:

151
Bias/variance tradeoff and MSE: #2
▪ The bias measures the error of estimations, and the variance describes how much the
estimation moves around its mean.

▪ The more complex the learning model is and the larger the training size, the lower the bias
will be.

▪ However, the variance will increase.

▪ We use cross-validation to find an optimal balance btw bias and variance while diminishing
overfitting.

Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 152
Solution for reducing overfitting
▪ Overfitting: Training a model in such a way that it remembers specific features of the ting
input data, rather than generalizing well to data not seen during training

▪ L2 regularisation

▪ Dropout

▪ Data augmentation

153
Bias-variance tradeoff: #1

https://towardsdatascience.com/ensemble-methods-bagging-boos Thakur, Abhishek. Approaching (almost) any machine learning


ting-and-stacking-c9214a10a205
problem. Abhishek Thakur, 2020.
154
Bias-variance tradeoff: #2

Fits training data perfectly but


does not fit the true function
very well, i.e. it is very sensitive
to varying training data (high
variance). Option #1: select
hyperparamteres
appropriately, option #2, use
more training data. However,
you should only collect more
training data if the true
function is too complex to be
approximated by an estimator
Provides only a with a lower variance.
poor fit = high bias almost perfect fit
155
Bias-variance tradeoff: #3

156
Bias, variance & noise

▪ Every estimator has its advantages and drawbacks. Its generalization error can be
decomposed in terms of:

▪ The bias of an estimator is its average error for different training sets.

▪ The variance of an estimator indicates how sensitive it is to varying training sets.

▪ Noise is a property of the data.

https://scikit-learn.org/stable/modules/learning_curve.html 157
Ways to reduce overfitting

▪ A large number of different methods have been developed.

▪ Weight-decay
▪ Weight-sharing
▪ Early stopping
▪ Model averaging
▪ Bayesian fitting of neural nets
▪ Dropout
▪ Generative pre-training

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Underfitting vs. overfitting [network layout]

159
Recognising overfitting

▪ The polynomial fit is very complex


and even drops off at the extremes.

▪ This is an indicator that the


polynomial regression model is
overfitting the training data and will
not generalize well.

160
Definition of regularisation
▪ Regularisation is a subset of methods used to encourage generalization in learned models,
often by increasing the difficulty for a model to learn the fine-grained details of training data.

▪ The simplest regularization: Early stopping. How do you know when to stop? The only real
way to know is to run the model on data that isn’t in the training dataset. This is typically
done using a second test dataset called a validation set. In some circumstances, if you used
the test set for knowing when to stop, you could overfit to the test set. As a general rule, you
don’t use it to control training. You use a validation set instead.

▪ Industry standard regularization: Dropout. Dropout makes a big network act like a little one
by randomly training little subsections of the network at a time, and little networks don’t
overfit.

Trask, Andrew W. "Deep learning." (2019). 161


Why regularisation help?
▪ If you use regularisation weights have smaller values.

▪ Smaller weights means that behaviour of the network will not change much if we change a few
random inputs here and there.

▪ That makes it difficult for a regularized network to learn the effects of local noise in the data.

▪ Instead, a regularized network learns to respond to types of evidence which are seen often across
the training set and it becomes resistant to learning peculiarities of the noise in the training data.

▪ The hope is that this will force our networks to do real learning about the phenomenon at hand,
and to generalize better from what they learn.
http://neuralnetworksanddeeplearning.com/chap3.html

162
What does regularisation does in practice?

▪ Regularisation helps prevents overfitting.

▪ This term put a penalty on the overall cost.

▪ As the magnitude of the model parameter increases the penalty increases as well.

▪ Note that regularization hurts training set performance! This is because it limits the ability
of the network to overfit to the training set. But since it ultimately gives better test
accuracy, it is helping your system.

163
Lasso & Ridge L^n regularisation
▪ The regularization also acts on overcorrelated features — smoothing and combining their
contribution, thus stabilizing the results and reducing the consequent variance of the
estimates:

▪ L1 (also called Lasso): Shrinks some coefficients to zero, making your coefficients
sparse. This means some features are entirely ignored by the model. This can be seen
as a form of automatic feature selection. Having some coefficients be exactly zero often
makes a model easier to interpret, and can reveal the most important features of your
model.
▪ L2 (also called Ridge): Reduces the coefficients of the most problematic features,
making them smaller, but never equal to zero. All coefficients keep participating in the
estimate, but many become small and irrelevant.

Mueller, John Paul, and Luca Massaron. Python for data Guido, Sarah, and Andreas Müller. Introduction to machine
science for dummies. John Wiley & Sons, 2019. learning with python. Vol. 282. O'Reilly Media, 2016. 164
Lasso & Ridge L^n regularisation
▪ L2 and L1 penalize weights differently:
▪ L2 penalizes weight2.
▪ L1 penalizes |weight|.
▪ Consequently, L2 and L1 have different derivatives:
▪ The derivative of L2 is 2 * weight.
▪ The derivative of L1 is k (a constant, whose value is independent of weight).

▪ You can think of the derivative of L2 as a force that removes x% of the weight every time. At any rate, L 2 does not normally
drive weights to zero.

▪ You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks
to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For
example, if subtraction would have forced a weight from +0.1 to -0.2, L 1 will set the weight to exactly 0. Eureka, L1 zeroed
out the weight. L1 regularization—penalizing the absolute value of all the weights—turns out to be quite efficient for wide
models.
https://developers.google.com/machine-learning/crash-course/regularizatio 165
n-for-sparsity/l1-regularization
LASSO vs. Regression: #1
▪ Variations on ordinary linear regression can help address some problems that come up
working with read data:

▪ LASSO helps when you have too many predictors by favouring weights of zero.
▪ Ridge regression can help with reducing the variance of your weights and predictions
by shrinking the weights to 0. Aka as L2 regularization, sometimes also called
Tikhonov regularization

https://da5030.weebly.com/uploads/8/6/5/9/8659576/120_interview_questions.pdf

166
VanderPlas, Jake. Python data science handbook: Essential tools
for working with data. " O'Reilly Media, Inc.", 2016.
The α parameter is essentially a knob

LASSO vs. controlling the complexity of the


resulting model. In the limit α ->0, we

Regression: #2
recover the standard linear regression
result; in the limit α ->oo, all model
responses will be suppressed. One
advantage of ridge regression in particular
is that it can be computed very efficiently
—at hardly more computational cost
than the original linear regression model.

With the lasso regression penalty, the majority of


The coefficients for an overly complex model the coefficients are exactly zero, with the
functional behaviour being modelled by a small
subset of the available basis functions.

167
RIDGE vs. LASSO in practice
▪ In practice, ridge regression is usually the first choice between these two models.

▪ However, if you have a large amount of features and expect only a few of them to be
important, Lasso might be a better choice.

▪ ElasticNet combines the penalties of Lasso and Ridge. In practice, this combination works
best, though at the price of having two parameters to adjust: one for the L1 regularization,
and one for the L2 regularization.

Guido, Sarah, and Andreas Müller. Introduction to machine


learning with python. Vol. 282. O'Reilly Media, 2016. 168
Lasso regression
▪ Lasso regression is a variant of logistic regression. One of the problems with logistic
regression is that you can have many different features all with modest weights, instead of
a few clearly meaningful features with large weights.

▪ This makes it harder to extract real-world meaning from the model. It is also an insidious
form of overfitting, which is begging to have the model generalize poorly.

▪ In lasso regression p(x) has the same functional form of 𝜎(w ⋅ x + b), where we assign every
feature a weight and then plug their weighted sum into a logistic function. However, we
train it in a different way that penalizes modest-sized weights.

Data Science: The Executive Summary - A Technical Book for Non-Technical Professionals 169
Another important observation on RIDGE
▪ The lesson here is that with enough
training data, regularization becomes less
important, and given enough data, ridge
and linear regression will have the same
performance (the fact that this happens
here when using the full dataset is just by
chance).

▪ Another interesting aspect is the decrease


in training performance for linear
regression. If more data is added, it
becomes harder for a model to overfit, or
memorize the data.

Guido, Sarah, and Andreas Müller. Introduction to machine


learning with python. Vol. 282. O'Reilly Media, 2016.
170
Weight penalties vs. weight constraints

▪ We usually penalize the squared ▪ Weight constraints have several advantages over
value of each weight separately. weight penalties.
▪ Instead, we can put a constraint on ▪ Its easier to set a sensible value.
the maximum squared length of the ▪ They prevent hidden units getting stuck near
incoming weight vector of each unit. zero.
▪ If an update violates this ▪ They prevent weights exploding.
constraint, we scale down the ▪ When a unit hits it’s limit, the effective weight
vector of incoming weights to
penalty on all of it’s weights is determined by the
the allowed length.
big gradients.
▪ This is more effective than a fixed penalty at
pushing irrelevant weights towards zero.

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Don’t let the regularization overwhelm the data
▪ It is often the case that a loss function is a sum of the data loss and the regularization loss
(e.g. L2 penalty on weights).

▪ One danger to be aware of is that the regularization loss may overwhelm the data loss, in
which case the gradients will be primarily coming from the regularization term (which
usually has a much simpler gradient expression).

▪ This can mask an incorrect implementation of the data loss gradient.

▪ Therefore, it is recommended to turn off regularization and check the data loss alone first,
and then the regularization term second and independently

https://cs231n.github.io/neural-networks-3/ 172
Why don’t we regularise the bias?

▪ In theory: it would be easy to modify the regularization procedure to regularize the biases.

▪ Empirically, doing this often doesn't change the results very much, so to some extent it's
merely a convention whether to regularize the biases or not.

▪ In fact, it's worth noting that having a large bias doesn't make a neuron sensitive to its
inputs in the same way as having large weights.

http://neuralnetworksanddeeplearning.com/chap3.html 173
Other techniques for regularization
▪ Other techniques are:
▪ L2 - has the intuitive interpretation of heavily penalizing peaky
weight vectors and preferring diffuse weight vectors. Due to
multiplicative interactions between weights and inputs this has the
appealing property of encouraging the network to use all of its
inputs a little rather than some of its inputs a lot.
▪ L1 - neurons with L1 regularization end up using only a sparse
subset of their most important inputs and become nearly invariant
to the “noisy” inputs. L2 regularization can be expected to give
superior performance over L1.
▪ Dropout: we randomly modify the network itself by randomly
dropping node in each layer.
▪ Artificially expanding the training data = data augmentation

http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization 174
Dropout: #0
▪ Neural networks are extremely flexible models due to their large number of parameters,
which is beneficial for learning, but also highly redundant.

▪ This makes it possible to compress (via dropout) neural networks without having a drastic
effect on performance.

▪ However adding dropout brings other issues: it can be added between different layers, and
finding the best place is usually done through experimentation. The percentage of dropout
to be added is also tricky, as it is purely dependent on the problem statement we are trying
to solve. It is often good practice to start with a small number such as 0.2.

Subramanian, Vishnu. Deep Learning with PyTorch: A practical approach to


Gomez, Aidan N., et al. "Targeted dropout." (2018). building neural network models using PyTorch. Packt Publishing Ltd, 2018. 175
Dropout: #1
▪ Dropout is a radically different technique for regularization.

▪ Unlike L1 and L2 regularization, dropout doesn't rely on modifying the cost function. Instead,
in dropout we modify the network itself.

▪ Since test-time performance is so critical, it is always preferable to use inverted dropout,


which performs the scaling at train time, leaving the forward pass at test time untouched.

▪ One issue is that the cost function is less well defined, or is certainly hard to calculate. So
you lose this debugging tool to will a plot, a graph like this. What you can do is to first turn
off drop out and I run the code and make sure that it is monotonically decreasing J, and then
turn on drop out.

176
Dropout: #2
▪ Generally use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too
low has minimal effect and a value too high results in under-learning by the network.

▪ Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model
more of an opportunity to learn independent representations.

▪ Use dropout on input (visible) as well as hidden layers. Application of dropout at each layer of the network has shown
good results.

▪ Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a
high momentum value of 0.9 or 0.99.

▪ Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint
on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.

177
Dropout: #3
▪ It is a regularisation technique to combat overfitting. It is employed at the training time
only and eliminates some units randomly. By doing so the knowledge is spread across all of
the neurons to help obtain a more robust network. Remember that at testing time all the
neuros a re present.

▪ Even thought dropout is itself a regularisation technique it can still be used in conjunction
with other techniques such as L2.

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 178
Dropout: #4

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 179
Dropout: #5
With inverted dropout, scaling is applied at the training
With normal dropout at test time, you have to scale time, but inversely. First, dropout all activations by
activations by dropout rate p because you are not dropout factor p, and second, scale them by inverse
dropping out any of the neurons, so you need to match dropout factor 1/p.
expected value at training. Inverted dropout has an advantage, that you don’t have
to do anything at test time, which makes inference faster.

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 180
Dropout: #6

▪ It turns out that the smaller a neural network is, the less it’s able to overfit. Why? Well,
small neural networks don’t have much expressive power. They can’t latch on to the more
granular details (noise) that tend to be the source of overfitting. They have room to capture
only the big, obvious, high-level features.

▪ How do you get the power of a large neural network with the resistance to overfitting of
the small neural network? Take the big neural network and turn off nodes randomly. What
happens when you take a big neural network and use only a small part of it? It behaves like
a small neural network. But when you do this randomly over potentially millions of
different subnetworks, the sum total of the entire network still maintains its expressive
power.

Trask, Andrew W. "Deep learning." (2019). 181


Dropout: #7
Although it’s likely that large, unregularized neural networks will overfit to noise, it’s unlikely they will
overfit to the same noise.

▪ Why don’t they overfit to the same noise? Because they start randomly, and they stop training once they’ve
learned enough noise to disambiguate between all the images in the training set. The MNIST network needs
to find only a handful of random pixels that happen to correlate with the output labels, to overfit. But this is
contrasted with, perhaps, an even more important concept:

Neural networks, even though they’re randomly generated, still start by learning the biggest, most broadly
sweeping features before learning much about the noise.

▪ The takeaway is this: if you train 100 neural networks (all initialized randomly), they will each tend to latch
onto different noise but similar broad signal. Thus, when they make mistakes, they will often make differing
mistakes. If you allowed them to vote equally, their noise would tend to cancel out, revealing only what they
all learned in common: the signal.

Trask, Andrew W. "Deep learning." (2019). 182


Dropout: #8
▪ Dropout layer which can help correct overfitting.
▪ Overfitting is caused by the network learning spurious patterns in the training data.
▪ To recognize these spurious patterns a network will often rely on very a specific combinations of weight, a kind
of "conspiracy" of weights. Being so specific, they tend to be fragile: remove one and the conspiracy falls apart.
▪ This is the idea behind dropout. To break up these conspiracies, we randomly drop out some fraction of a
layer's input units every step of training, making it much harder for the network to learn those spurious
patterns in the training data.
▪ Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust.
▪ You could also think about dropout as creating a kind of ensemble of networks.
▪ The predictions will no longer be made by one big network, but instead by a committee of smaller networks.
▪ Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making
the committee as a whole better than any individual.
▪ If you're familiar with random forests as an ensemble of decision trees, it's the same idea.

https://www.kaggle.com/ryanholbrook/dropout-and-batch-normalization 183
Dropout and co-adaptation
▪ It is common for two or more neurons to begin to detect the same feature repeatedly. This
is called co- adaptation. It implies the DNN is not utilizing its full capacity efficiently, in
effect wasting computational resources.
▪ Dropout discourages co-adaptations of hidden neurons by dropping out a fixed fraction of
the activation of the neurons during the feed forward phase of training.

▪ Interesting fact: Dropout can also be applied to inputs. In this case, the algorithm randomly
ignores a fixed proportion of input attributes.
▪ One of life’s lessons is that dropping out is not necessarily ruinous to future performance,
but neither is it a guarantee of future success. The same is true for dropout in DNNs; there
is no absolute guarantee that it will enhance performance, but it is often worth a try.

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 184
Some points on dropout

▪ Dropout can reduce the likelihood of co-adaptation in noisy samples by creating multiple
paths to correct classification throughout the DNN.
▪ The larger the dropout fraction the more noise is introduced during training; this slows
down learning,
▪ Dropout appears to offer the most benefit on very large DNN models.

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 185
What if we just don’t train a random bunch of nodes ?

▪ Because they won’t be updated, they won’t have the chance to


overfit to the input data, and because it’s random, each training cycle
will ignore a different selection of the input, which should help
generalization even further.

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019. 186
Pruning: #0
▪ Model pruning seeks to induce sparsity (by removing sections that provide no
improvement) in a deep neural network’s various connection matrices, thereby reducing
the number of nonzero-valued parameters in the model.

▪ The word ‘pruning’ comes from its use in decision trees where branches of the tree are
pruned as a form of model regularization. Analogously, weights in a neural network that are
considered unimportant or rarely fire can be removed from the network with little to no
consequence. In fact, the majority of neurons have a relatively small impact on the model
performance, meaning we can achieve high accuracy even when eliminating large numbers
of parameters.

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 187
Pruning: #1
▪ Usually combined with retraining

▪ Compressed models have amplified


sensitivity to adversarial examples and
common corruptions.

▪ Important Questions
▪ Which weights should we remove
first?
▪ How many weights can we remove?
▪ What about the order of removal?

https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf 188
Pruning: #2 [motivation]
▪ In Humans we have synaptic pruning
▪ This removes redundant connections in the brain

189
https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf
Pruning: #3

190
https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf
Pruning: #4

https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf 191
Pruning: #5
▪ The most common way is weight pruning. Weight pruning rank-orders the weights by their
magnitude since parameters with larger weights are more likely to fire and thus more likely to
be important.

▪ Another way is unit pruning, which sets entire columns in the weight matrix to zero, in effect
deleting the corresponding output neuron. Here to achieve sparsity of k% we rank the
columns of a weight matrix according to their L2-norm and delete the smallest k%.

▪ A third and more advanced method is to use Fischer pruning, which relies on the Fischer
information. This generates a norm known as the Fischer-Rao norm which can then be used to
rank-order parameters. It is suspected there is a link between the Fischer information and the
redundancy of parameters, and this is why this technique seems to produce good results.

https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 192
Pruning: #6
▪ If we prune too much at once, the network might be
damaged so much it won’t be able to recover.
▪ So in practice this is an iterative process - often called
‘Iterative Pruning

https://jacobgil.github.io/deeplearning/pruning-deep-learning 193
Pruning: #7
▪ Sounds good, why isn’t this more popular?

▪ Which is surprising considering all the effort on running deep learning on mobile devices. I
guess the reason is a combination of:
▪ The ranking methods weren’t good enough until now, resulting in too big of an
accuracy drop.
▪ It’s a pain to implement.
▪ Those who do use pruning, keep it for themselves as a secret sauce advantage.

https://jacobgil.github.io/deeplearning/pruning-deep-learning 194

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy