Model Training: (Anything Done While We Train The Model)
Model Training: (Anything Done While We Train The Model)
1
List of topics
• Overfitting and learning curves
Validation loss
▪ Your validation loss will first get worse during training because the model gets
overconfident, and only later will get worse because it is incorrectly memorizing the data.
▪ REMEMBER, our loss function is something that we use to allow our optimizer to have
something it can differentiate and optimize; it’s not the thing we care about in practice.
https://www.dropbox.com/s/wxybbtbiv64yf0j/Chapter1.pdf?dl=0 4
Parameters vs. hyperparameters
▪ Hyperparameters are inputs of machine learning algorithms or pipelines that influence the
performance of the model. They don’t belong to the training data and cannot be learned
from it. Example: the maximum depth of the tree in the decision tree learning algorithm.
▪ Parameters are variables that define the model trained by the learning algorithm.
Parameters are directly modified by the learning algorithm based on the training data. The
goal of learning is to find such values of parameters that make the model optimal in a
certain sense. Example: parameters are w and b in the equation of linear regression.
https://www.dropbox.com/s/wxybbtbiv64yf0j/Chapter1.pdf?dl=0 5
Hyperparemeters tuning: #1
▪ Method #1: Luckily, there is a correct way of tuning the hyperparameters and it does not
touch the test set at all. The idea is to split our training set in two: a slightly smaller training
set, and what we call a validation set.
▪ Method #2: In cases where the size of your training data (and therefore also the validation
data) might be small, people sometimes use a more sophisticated technique for
hyperparameter tuning called cross-validation.
https://cs231n.github.io/classification/
6
Hyperparemeters tuning: #2
▪ Common data splits. A training and test set is given. The training set is split into folds (for example 5
folds here). The folds 1-4 become the training set. One fold (e.g. fold 5 here in yellow) is denoted as
the Validation fold and is used to tune the hyperparameters. Cross-validation goes a step further and
iterates over the choice of which fold is the validation fold, separately from 1-5. This would be
referred to as 5-fold cross-validation. In the very end once the model is trained and all the best
hyperparameters were determined, the model is evaluated a single time on the test data (red).
https://cs231n.github.io/classification/ 7
Hyperparemeters tuning: #2.2
▪ Essentially, when:
▪ the dataset is small
▪ Or you want to get the mean performance
(the k-estimates are NOT independent)
▪ Instead of using validation and test you can use CV
8
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec9.pdf
Hyperparemeters tuning: #2.1
▪ How to split the data:
▪ The training set is used by the machine learning algorithm to train the model.
▪ The validation set is needed to find the best values for the hyperparameters of the
machine learning pipeline. Most importantly this set is not seen by the machine
learning algorithm.
▪ The test set is used for reporting: once you have your best model, you test its
performance on the test set and report the results.
https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 9
Hyperparemeters tuning: #3
▪ As a rule of thumb, between 70-90% of your data usually goes to the train split.
▪ This setting depends on how many hyperparameters you have and how much of an
influence you expect them to have.
▪ If there are many hyperparameters to estimate, you should err on the side of having larger
validation set to estimate them effectively.
https://cs231n.github.io/classification/ 10
Hyperparemeters tuning: #4
https://scikit-learn.org/stable/modules/cross_validation.html 11
Hyperparemeters tuning: #5
12
Hyperparemeters tuning: #6
13
Hyperparameters HPO tuning: the cost issue
▪ The issue: HPO techniques very expensive and resource-heavy in practice, especially for
applications of deep learning
▪ Partial solution: even approaches like Hyperband or Bayesian optimization, that are
specifically designed to minimize the number of training cycles needed, are still not able to
deal with certain problems due to the complexity of the models or the size of the datasets
Paleyes, Andrei, Raoul-Gabriel Urma, and Neil D. Lawrence. "Challenges in deploying machine
learning: a survey of case studies." arXiv preprint arXiv:2011.09926 (2020). 14
HPO: available methods
All combinations of hyper-parameters are evaluated.
To solve HPO problems, we need to use Thus, grid search is computationally expensive,
data-driven methods. Manual methods infeasible and suffers from the curse of
are not effective. dimensionality. Easy to parallelise
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 15
▪ Grid search: Given a finite set of discrete values for each hyperparameter, exhaustively cross-
validate all combinations.
▪ Random search: Given a discrete or continuous distribution for each hyperparameter,
randomly sample from the joint distribution. Generally more efficient than exhaustive grid
search.
▪ Bayesian optimization: Sample like random search, but update the search space you sample
from as you go, based on outcomes of prior searches.
▪ Gradient-based optimization: Attempt to estimate the gradient of the cross-validation metric
with respect to the hyperparameters and ascend/descend the gradient.
▪ Evolutionary optimization: Sample the search space, discard combinations with poor metrics,
and genetically evolve new combinations based on the successful combinations.
▪ Population-based training: A method of performing hyperparameter optimization at the
same time as training.
https://towardsdatascience.com/beyond-grid-search-hypercharge-hyperparameter-tuning-for-xgboost-7c78f7a2929d 16
Bayesian Optimization for HPO: #1
▪ The Random search eventually converges to the optimal answer, but this method is just
such a blind search! Is there a smarter way to search? Yes – Bayesian optimization.
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 17
Bayesian Optimization for HPO: #2
The blue tube around the We also have an acquisition
black line is our uncertainty. function, which is the way we
explore the search space for
finding the new observation
optimum value. In other
true objective function words, the acquisition function
Dashed line helps us improve our surrogate
model and select the next
value. In the image above, the
acquisition function is shown
as an orange curve. An
surrogate model (regression acquisition max means where
model) Continuous line the uncertainty is max and
predicted value is low.
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 18
Pros & cons of Bayesian optimisation for HPO
▪ Pros: The most important advantage for Bayesian optimization is that it can operate very
well with black box function. BO is also data efficient and robust with noise.
▪ Cons: It can’t work well with parallel resources because the optimization process is
sequential
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 19
A bit more on Bayesian optimisation
▪ It works by building a probabilistic model of the objective function, called the surrogate
function, that is then searched efficiently with an acquisition function before candidate
samples are chosen for evaluation on the real objective function.
https://machinelearningmastery.com/what-is-bayesian-optimization/ 20
Surrogate function & acquisition function
▪ The surrogate function is a technique used to best approximate the mapping of input
examples to an output score. Probabilistically, it summarizes the conditional probability of
an objective function (f), given the available data. This can be done via Gaussian process
which is a model that constructs a joint probability distribution over the variables,
assuming a multivariate Gaussian distribution. As such, it is capable of efficient and
effective summarization of a large number of functions and smooth transition as more
observations are made available to the model. This smooth structure and smooth transition
to new functions based on data are desirable properties as we sample the domain. An
important aspect in defining the GP model is the kernel and RBF seems to be the most used
choice.
▪ The acquisition function is responsible for scoring or estimating the likelihood that a given
candidate sample (input) is worth evaluating with the real objective function.
https://machinelearningmastery.com/what-is-bayesian-optimization/ 21
Some example of regression model used in
Bayesian model for HPO
Acquisition functions
define a balance between
exploring new areas in
the objective space and
exploiting areas that are
already known to have
favourable values
https://static.sigopt.com/b/20a144d208ef255d3b981ce419667ec25d8
412e2/static/pdf/SigOpt_Bayesian_Optimization_Primer.pdf
22
Multi Fidelity Optimization for HPO
▪ In the Bayesian method, estimation of the objective function is very expensive. Is there any
cheaper way to estimate the objective function?
▪ Multi Fidelity optimization methods are the answer. Some of them are:
▪ Successive halving
▪ HyperBand
▪ BOHB
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 23
Successive halving for HPO
▪ Imagine you have N different configurations
and a budget (for example time).
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 25
BOHB: Robust and Efficient Hyperparameter
Optimization at Scale
▪ BOHB is a state-of-the-art hyperparameter optimization algorithm
▪ The idea behind the BOHB algorithm is based on one simple question – why do we run
successive halving repeatedly?
▪ Instead of a blind repetition method on top of successive halving, BOHB uses the Bayesian
Optimization algorithm. In fact, BOHB combines HyperBand and BO to use both of these
algorithms in an efficient way.
https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms 26
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?
• Why did we come up with the idea of CV in the first place? Cross
validation is a form of model checking (I used this word on purpose)
which attempts (so maybe it does not work all the times?) to improve
on the basic methods of hold-out checking (training/test) by leveraging
subsets of our data and an understanding of the bias/variance trade-off
in order to gain a better understanding of how our models will actually
perform when applied outside of the data it was trained on.
• In this sentence I have substituted ”validation” with “checking” as I feel
this is more appropriate.
• How many methods can we use? Hold-out, k-fold, LOOCV
27
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?
• Hold-out method? Aka as train and test split. Some of the issue
the method does not address are the following: We validated our
model once. What if the split we made just happened to be very
conducive to this model? What if the split we made introduced a
large skew into the date? Didn’t we significantly reduce the size of
our training dataset by splitting it like that?
• How does bias and variance fit in all of this? When creating a
model, we account for a few types of error: validation error,
testing error, error due to bias, and error due to variance in a
relationship known as the bias variance trade-off. As you can see
here the author rightfully make a distinction btw validation and
test error. Bias is telling you how far away we are from the target
whereas variance is telling the spread of the predictions.
29
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?
30
https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?
Dataset
Method Pros Cons
size
Easy to
Hold out understand does not preserve the
Big dataset
method Easy statistical integrity
implementation
Accuracy score
increases as k is
Medium increased, this
K-fold CV Very costly
dataset decreases bias
while increase
variance 32
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?
• What is confusing? What puzzles me is that lots of people split data, perform training
on training data with cross-validation, and then perform validation on the test dataset.
Is this correct? To add confusion (“apparent”) you can also do cross-validation to select
the hyper-parameters of your model
https://stats.stackexchange.com/questions/148688/cross-validation-with-test-data-set 33
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing?
https://stats.stackexchange.com/questions/148688/cross-validation-with-test-data-set 34
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing? Solved with a
respectable reference – part #1
▪ You should have (unless you have a tiny dataset) three sets:
▪ Training set -
▪ Test set -
▪ Validation set – hyperparameter tuning
Guido, Sarah, and Andreas Müller. Introduction to machine learning with python. Vol. 282. O'Reilly Media, 2016. 35
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing? Solved with a
respectable reference – part #2
The best score on the validation set is 96%: slightly lower than before, probably
because we used less data to train the model (X_train is smaller now because
we split our dataset twice). However, the score on the test set—the score that
actually tells us how well we generalize—is even lower, at 92%.
Guido, Sarah, and Andreas Müller. Introduction to machine learning with python. Vol. 282. O'Reilly Media, 2016. 36
If CV (cross validation) is doing the same thing as a naïve training and
test, shouldn’t it be called cross training and testing? Solved with a
respectable reference – part #3
Fitting the GridSearchCV object not only searches for the best parameters,
but also automatically fits a new model on the whole training dataset with
the parameters that yielded the best cross-validation performance. This is
equivalent to:
▪ What is the different with a basic train/split? A basic train/test split is conceptually
identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the
k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about
the estimate of generalisation error. There is more info in a sense of getting the error + stat
uncertainty.
https://stackoverflow.com/questions/46456381/cross-validation-in-lightgbm 39
What is the main purposes of CV? #2
▪ You want to train a model with a set of parameters on some data and evaluate each
variation of the model on an independent (validation) set. Then you intend to choose the
best parameters by choosing the variant that gives the best evaluation metric of your choice.
▪ This can be done with a simple train/test split. But evaluated performance, and thus the
choice of the optimal model parameters, might be just a fluctuation on a particular split.
▪ Thus, you can evaluate each of those models more statistically robust averaging evaluation
over several train/test splits, i.e k-fold CV.
▪ Then you can make a step further and say that you had an additional hold-out set, that was
separated before hyperparameter optimisation was started. This way you can evaluate the
chosen best model on that set to measure the final generalisation error.
▪ However, you can make even step further and instead of having a single test sample you can
have an outer CV loop, which brings us to nested CV.
https://stackoverflow.com/questions/46456381/cross-validation-in-lightgbm 40
Always ask the following question: When
should you use cross-validation?
▪ The guideline is as follows:
▪ For small datasets, where extra computational burden isn't a big deal, you should run cross-
validation.
▪ For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have
enough data that there's little need to re-use some of it for holdout.
▪ There's no simple threshold for what constitutes a large vs. small dataset. But if your model takes a
couple minutes or less to run, it's probably worth switching to cross-validation.
▪ Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each
experiment yields the same results, a single validation set is probably sufficient.
https://www.kaggle.com/alexisbcook/cross-validation 41
The training, validation & test set: #1
▪ How to split the data:
▪ The training set is used by the machine learning algorithm to train the model.
▪ The validation set is needed to find the best values for the hyperparameters of the machine learning pipeline. Most
importantly this set is not seen by the machine learning algorithm.
▪ The test set is used for reporting: once you have your best model (chosen in the previous step), you test its
performance (say accuracy) on the test set and report the results.
https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 https://stats.stackexchange.com/questions/19048/what-is-th 42
e-difference-between-test-set-and-validation-set
Do You Re-train on the Whole Dataset After
Validating the Model?
▪ Method A) ▪ Method B)
▪ Train on 80% ▪ Train on 80%
▪ Validate on 20% ▪ Validate on 20%
▪ Model is good, train on 100%. ▪ Model is good, use this model as is.
▪ Predict test set. ▪ Predict test set.
http://www.chioka.in/do-you-re-train-on-the-whole-dataset-after-validating-the-model/ 43
Issue with train & test split
▪ One test score is not reliable because splitting the data into different training and test sets
would give different results.
▪ In effect, splitting the data into a training set and a test set is arbitrary, and a different
random_state will give a different error.
https://www.fast.ai/2017/11/13/validation-sets/
Time series data A random subset is a poor choice A better choice for your
(too easy to fill in the gaps, and not training set. Use the earlier
indicative of what you’ll need in data as your training set (and
production). you can look at the data
both before and after the dates your
the later data for the
are trying to predict validation set):
45
How (and why) to create a good validation set: #1
▪ The sklearn offers a train_test_split method takes a random subset of the data,
which is a poor choice for many real-world problems.
▪ Also keep in mind that sklearn does not have a train_validation_test_split.
The reason is that it assumed you will often be using cross-validation, in which different
subsets of the training set serve as the validation set.
▪ There is a lot confusion around it because people keep using the wrong nomenclature. An
example? In the DL community, “test-time inference” is often used to refer to evaluating on
data in production, which is NOT the technical definition of a test set.
▪ The ultimate goal is for it to be accurate on new data, not just the data you are using to
build it.
https://www.fast.ai/2017/11/13/validation-sets/ 46
How (and why) to create a
good validation set: #2
▪ Why is that? If you were to gather some new data points, they most
likely would not be on that curve in the graph on the right, but would
be closer to the curve in the middle graph.
▪ The underlying idea is that:
▪ The training set is used to train a given model
▪ The validation set is used to choose between models (for instance,
does a random forest or a neural net work better for your
problem? do you want a random forest with 40 trees or 50 trees?)
▪ The test set tells you how you’ve done. If you’ve tried out a lot of
different models, you may get one that does well on your
validation set just by chance, and having a test set helps make sure
that is not the case.
https://www.fast.ai/2017/11/13/validation-sets/ 47
Cross-validation: #1
▪ The validation set is needed to find the best values for the hyperparameters of the
machine learning pipeline.
▪ When you don’t have a decent validation set to tune your hyperparameters on, the
common technique that can help you is called cross-validation.
▪ When you have few training examples, it could be prohibitive to have both validation and
test set. You would prefer to use more data to train the model. In such a case, you only split
your data into a training and a test set. Then you use cross-validation to on the training set
to simulate a validation set.
48
Cross-validation: #2
▪ A noticeable problem with the train/test set split is that you’re actually introducing bias into your testing
because you’re reducing the size of your in-sample training data. When you split your data, you may be
actually keeping some useful examples out of training.
▪ Moreover, sometimes your data is so complex that a test set, though apparently similar to the training set,
is not really similar because combinations of values are different (which is typical of highly dimensional
data- sets).
▪ These issues add to the instability of sampling results when you don’t have many examples. The risk of
splitting your data in an unfavourable way also explains why the train/test split isn’t the favoured solution
by machine learning practitioners when you have to evaluate and tune a machine learning solution.
Mueller, John Paul, and Luca Massaron. Machine learning for dummies. John Wiley & Sons, 2016. 49
Cross-validation: #3
▪ The process continues until all the k-folds
are used once as a test set and you have
a k number of error estimates that you
can compute into a mean error estimate
(the cross-validation score) and a
standard error of the estimates.
Mueller, John Paul, and Luca Massaron. Machine learning for dummies. John Wiley & Sons, 2016. 50
Cross-validation: #3
Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 51
Cross-validation: #4
▪ K-fold cross-validation breaks the dataset into k partitions. Each partition gets held out as the
testing data, while the model is trained on the remaining k − 1 partitions.
▪ The average of all these performances indicates how good a model you were fitting.
Data Science: The Executive Summary – A Technical Book for Non–Technical Professionals 52
Cross-validation: #5
▪ The extreme version of this is “leave-one-out” cross-validation, where k is equal to the
number of datapoints you have and every testing dataset is only one point.
Data Science: The Executive Summary – A Technical Book for Non–Technical Professionals 53
Another reason to use cross-validation
▪ If test sets can provide unstable results because of sampling, the solution is to
systematically sample a certain number of test sets and then average the results.
▪ It is a statistical approach (to observe many results and take an average of them), and that’s
the basis of cross‐validation.
Mueller, John Paul, and Luca Massaron. Python for data science for dummies. John Wiley & Sons, 2019. 54
Interpreting cross-validation score
▪ When comparing scores from a cross-validation test vs. a simple train-test split it is
important not to look at the number as a simple “this is lower that that, thus it must be
better!”
▪ The point is that it's a better estimation of how linear regression will perform on unseen
data. Using cross-validation is always recommended for a better estimate of the score.
Corey Wade, Hands-On Gradient Boosting with XGBoost and scikit-learn: Perform
accessible machine learning and extreme gradient boosting with Python
55
Cross-validation methods: #1
▪ k-fold cross-validation
▪ stratified k-fold cross-validation = keeps the ratio of labels in each fold constant.
▪ hold-out based validation
▪ leave-one-out cross-validation
▪ group k-fold cross-validation
▪ One flaw: if little data is available, then your validation and test sets may contain too few
samples to be statistically representative of the data at hand.
▪ How can you recognise this is the case?: if different random shuffling rounds of the data
before splitting end up yielding very different measures of model performance, then you’re
having this issue.
Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 57
Cross-validation methods: #2
▪ If you have a skewed dataset for binary classification with 90% positive samples and only 10%
negative samples, you don't want to use random k-fold cross-validation. Using simple k-fold
cross-validation for a dataset like this can result in folds with all negative samples. Then just
use stratified k-fold cross-validation which keeps the ratio of labels in each fold constant.
▪ If you do not have a lot of samples, you can use a simple rule like Sturge’s Rule to calculate the
appropriate number of bins.
▪ What if we have a large amount of data? A 5 fold cross-validation would be too expensive. In
these cases, we can opt for a hold-out based validation. For a dataset which has 1 million
samples, we can create ten folds instead of 5 and keep one of those folds as hold-out. This
means we will have 100k samples in the hold- out, and we will always calculate loss, accuracy
and other metrics on this set and train on 900k samples.
Thakur, Abhishek. Approaching (almost) any machine learning
problem. Abhishek Thakur, 2020.
58
Cross-validation methods: #3
▪ The procedure has a single parameter called k that refers to the number of groups that a
given data sample is to be split into. As such, the procedure is often called k-fold cross-
validation.
59
Cross-validation methods: #4
▪ There are a number of variations on the k-fold cross-validation procedure. Three commonly used
variations are as follows:
▪ Train/Test Split: Taken to one extreme, k may be set to 1 such that a single train/test split is created
to evaluate the model.
▪ LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset
such that each observation is given a chance to be the held out of the dataset. This is called leave-
one-out cross-validation, or LOOCV for short.
▪ Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold
has the same proportion of observations with a given categorical value, such as the class outcome
value. This is called stratified cross-validation.
▪ Repeated: This is where the k-fold cross-validation procedure is repeated n times, where
importantly, the data sample is shuffled prior to each repetition, which results in a different split of
the sample.
60
Cross-validation methods: #5
▪ Use stratified cross-validation to enforce class distributions when there are a large number
of classes or an imbalance in instances for each class.
▪ Using a train/test split is good for speed when using a slow algorithm and produces
performance estimates with lower bias when using large datasets.
61
Cross-validation methods: #6
62
Cross-validation methods: #7
▪ Another approach to evaluating the model is as follows.
▪ Divide our data up into a training set and a test set: 80% in the training and 20% in the test.
Fit the model on the training set, then look at the mean squared error on the test set and
compare it to that on the training set.
▪ If the mean squared errors are approximately the same, then our model generalizes well
and we’re not in danger of overfitting.
O'Neil, Cathy, and Rachel Schutt. Doing data science: Straight talk from the frontline. " O'Reilly Media, Inc.", 2013.
63
Cross-validation methods: #6
▪ The holdout method is the most common method, which starts by dividing the dataset into
two partitions called train and test (80% and 20%, respectively). The train set is used for
training the model, and the test data tests its predictive power.
▪ The k-fold cross-validation technique is used to verify that the model is not over-fitted. The
dataset is randomly partitioned into k mutually exclusive subsets, where each partition has
equal size. One partition is for testing and the other partitions are for training. Iterate
throughout the whole of the k-folds.
▪ In these cases we may want to make sure that all the strata are appropriately represented
in both the testing and training data.
▪ To do this we use “stratified sampling,” where each stratum is divided into training/testing
data, in the same proportions.
▪ stratified k-fold cross-validation = keeps the ratio of labels in each fold constant.
Data Science: The Executive Summary - A Technical Book for Non-Technical Professionals 65
Comparison of all possible of split the dataset: #1
As you can see, by default the KFold cross-validation iterator does not
First, we must understand the structure of our data. It has 100 randomly take either datapoint class or group into consideration. We can change
generated input datapoints, 3 classes split unevenly across datapoints, and this by using the StratifiedKFold like so. In this case, the cross-validation
10 “groups” split evenly across datapoints. retained the same ratio of classes across each CV split.
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-a 66
uto-examples-model-selection-plot-cv-indices-py
Comparison of all possible of split the dataset: #2
67
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-a
uto-examples-model-selection-plot-cv-indices-py
Comparison of all possible of split the dataset: #3
68
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-a
uto-examples-model-selection-plot-cv-indices-py
The true about CV: #1
▪ In this work, we show that the estimand of CV is not the accuracy of the model fit on the
data at hand, but is instead the average accuracy over many hypothetical data sets.
Specifically, we show that the CV estimate of error has larger mean squared error (MSE)
when estimating the prediction error of the final model than when estimating the average
prediction error of models across many unseen data sets for the special case of linear
regression.
▪ Turning to confidence intervals for prediction error, we show that naıve intervals based on
CV can fail badly, giving coverage far below the nominal level; we provide a simple example
soon in Section 1.1.
Cross-validation: what does it estimate and how well does it do it?Stephen Bates ∗, Trevor
Hastie†, and Robert Tibshirani 69
The true about CV: #2
▪ The source of this behaviour is the estimation of the variance used to compute the width of
the interval: it does not account for the correlation between the error estimates in different
folds, which arises because each data point is used for both training and testing. As a result,
the estimate of variance is too small and the intervals are too narrow.
Cross-validation: what does it estimate and how well does it do it?Stephen Bates ∗, Trevor
Hastie†, and Robert Tibshirani 70
Nested versus non-nested cross-validation: #1
▪ NON-NESTED = estimates the generalization error of the underlying model and its (hyper)parameter
search. Model selection without nested CV uses the same data to tune model parameters and evaluate
model performance. Information may thus “leak” into the model and overfit the data. The magnitude of
this effect is primarily dependent on the size of the dataset and the stability of the model. Choosing the
parameters that maximize non-nested CV biases the model to the dataset, yielding an overly-optimistic
score.
▪ NESTED = cross-validation (CV) is often used to train a model in which hyperparameters also need to be
optimized. To avoid this problem, nested CV effectively uses a series of train/validation/test set splits. In
the inner loop (here executed by GridSearchCV), the score is approximately maximized by fitting a model
to each training set, and then directly maximized in selecting (hyper)parameters over the validation set.
In the outer loop (here in cross_val_score), generalization error is estimated by averaging test set scores
over several dataset splits.
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-a
uto-examples-model-selection-plot-nested-cross-validation-iris-py 71
Nested versus non-nested cross-validation: #2
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-a
uto-examples-model-selection-plot-nested-cross-validation-iris-py 72
Nested versus non-nested cross-validation: #3
▪ The k-fold cross-validation procedure is used to estimate the performance of machine
learning models. It is less biased than some other techniques, such as a single train-test split
for small- to modestly-sized dataset.
▪ This procedure can be used both when optimizing the hyperparameters of a model on a
dataset, and when comparing and selecting a model for the dataset. When the same cross-
validation procedure and dataset are used to both tune and select a model, it is likely to lead
to an optimistically biased evaluation of the model performance. This highlights that if the k-
fold cross-validation procedure is used both in the selection of model hyperparameters to
configure each model and in the selection of configured models, but it can lead to overfitting.
▪ One approach to overcoming this bias is to nest the hyperparameter optimization procedure
under the model selection procedure. This is called double cross-validation or nested cross-
validation and is the preferred way to evaluate and compare tuned machine learning models
https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ 73
Nested versus non-nested cross-validation: #4
▪ Each time a model with different model hyperparameters is evaluated on a dataset, it provides
information about the dataset. Specifically, an often noisy model performance score. This
knowledge about the model on the dataset can be exploited in the model configuration procedure
to find the best performing configuration for the dataset. The k-fold cross-validation procedure
attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing
or overfitting of the model hyperparameters to the dataset will be performed. This is the normal
case for hyperparameter optimization.
▪ The problem is that if this score alone is used to then select a model, or the same dataset is used to
evaluate the tuned models, then the selection process will be biased by this inadvertent
overfitting. The result is an overly optimistic estimate of model performance that does not
generalize to new data.
https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ 74
Nested versus non-nested cross-validation: #5
▪ What Is the Cost of Nested Cross-Validation? To make this concrete, you might use k=5 for
the hyperparameter search and test 100 combinations of model hyperparameters. A
traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500
models. Nested cross-validation with k=10 folds in the outer loop would fit and evaluate
5,000 models. A 10x increase in this case.
▪ How Do You Set k? It is common to use k=10 for the outer loop and a smaller value of k for
the inner loop, such as k=3 or k=5.
https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ 75
How to interpreter CV score values
▪ Using the mean cross-validation we can conclude
that we expect the model to be around 96%
accurate on average.
▪ Looking at all five scores produced by the five-fold
cross-validation, we can also conclude that there is
a relatively high variance in the accuracy between
folds, ranging from 100% accuracy to 90% accuracy.
▪ This is quite a range, and it provides us with an idea
about how the model might perform in the worst
case and best case scenarios when applied to new
data.
▪ Further, this could imply that
1. the model is very dependent on the particular folds
used for training,
2. OR it could also just be a consequence of the small
size of the dataset.
▪ Another benefit of cross-validation as compared to using a single split of the data is that we
use our data more effectively.
▪ When using train_test_split, we usually use 75% of the data for training and 25% of the
data for evaluation.
▪ When using five-fold cross-validation, in each iteration we can use four-fifths of the data
(80%) to fit the model.
▪ When using 10-fold cross-validation, we can use nine-tenths of the data (90%) to fit the
model. More data will usually result in more accurate models.
▪ Issue? CV is that it is rarely applicable to real world problems, for all the reasons described
in the above sections. Cross-validation only works in the same cases where you can
randomly shuffle your data to choose a validation set.
▪ The link will also explain how the Kaggle works in terms of CV.
https://www.fast.ai/2017/11/13/validation-sets/ 78
Handling Imbalanced Datasets: #1
▪ In many practical situations, your labelled dataset will have underrepresented the examples
of some class.
▪ This is the case, for example, when your classifier has to distinguish between genuine and
fraudulent e-commerce transactions: the examples of genuine transactions are much more
frequent. If you use SVM with soft margin, you can define a cost for misclassified examples.
▪ Because noise is always present in the training data, there are high chances that many
examples of genuine transactions would end up on the wrong side of the decision
boundary by contributing to the cost.
80
Burkov, Andriy. "The Hundred-Page Machine Learning Book
Handling Imbalanced Datasets: #3
▪ Some solutions:
▪ Some SVM implementations (including SVC in scikit-learn) allow you to provide weights for every
class. The learning algorithm takes this information into account when looking for the best
hyperplane.
▪ If your learning algorithm doesn’t allow weighting classes, you can try to increase the importance of
examples of some class by making multiple copies of the examples of this class (this is called
oversampling).
An opposite approach is to randomly remove from the training set some examples of the majority
class (undersampling).
▪ You might also try to create synthetic examples by randomly sampling feature values of several
examples of the minority class and combining them to obtain a new example of that class. There two
popular algorithms that oversample the minority class by creating synthetic examples: the synthetic
minority oversampling technique (SMOTE) and the adaptive synthetic sampling method (ADASYN).
81
Burkov, Andriy. "The Hundred-Page Machine Learning Book
k-fold validation [warning]
▪ Cross validation IS OFTEN NOT USED for evaluating deep learning models because of the
greater computational expense.
▪ For example k-fold cross validation is often used with 5 or 10 folds. As such, 5 or 10 models
must be constructed and evaluated, greatly adding to the evaluation time of a model.
▪ Nevertheless, when the problem is small enough or if you have sufficient compute
resources, k-fold cross validation can give you a less biased estimate of the performance of
your model.
82
Shuffling the training and validation set
▪ Shuffling the training data is important to prevent correlation between batches and
overfitting.
▪ On the other hand, the validation loss will be identical whether we shuffle the validation
set or not. Since shuffling takes extra time, it makes no sense to shuffle the validation data.
▪ To be clear it makes no-sense because it does not make a difference, the fact it takes more
time is a secondary reason. You would not to do even if the action was super fast. It simply
does not make sense.
https://pytorch.org/tutorials/beginner/nn_tutorial.html#refactor-using-optim
83
An alternative to CV
▪ Model selection is the problem of choosing one from among a set of candidate models.
▪ It is common to choose a model that performs the best on a hold-out test dataset or to
estimate model performance using a resampling technique, such as k-fold cross-validation.
▪ An alternative approach to model selection involves using probabilistic statistical measures
that attempt to quantify both the model performance on the training dataset and the
complexity of the model. Examples include the Akaike and Bayesian Information Criterion
and the Minimum Description Length.
▪ The benefit of these information criterion statistics is that they do not require a hold-out
test set, although a limitation is that they do not take the uncertainty of the models into
account and may end-up selecting models that are too simple.
https://machinelearningmastery.com/probabilistic-model-selection-measures/ 84
Early stopping: #1
▪ If we have lots of data and a big model, its very expensive to keep re-training it with
different sized penalties on the weights or different architectures.
▪ It is much cheaper to start with very small weights and let them grow until the performance
on the validation set starts getting worse.
▪ But it can be hard to decide when performance is getting worse.
▪ The capacity of the model is limited because the weights have not had time to grow big.
▪ Smaller weights give the network less capacity. Why?
https://www.cs.toronto.edu/~hinton/coursera_slides.html
Early stopping: #3
▪ When the weights are very small, every hidden unit is in its linear range.
▪ So even with a large layer of hidden units it’s a linear model.
▪ It has no more capacity than a linear net in which the inputs are directly connected to the
outputs!
▪ As the weights grow, the hidden units start using their non-linear ranges so the capacity (ability
to fit a wide variety of functions) grows.
https://www.cs.toronto.edu/~hinton/coursera_slides.html
Early stopping: #4
▪ Generally training has two tasks: one is to optimise your cost function and the other is to
avoid overfitting. These two task are different and therefore are separate.
▪ However, when using early stopping it is impossible to separate the two task and this is an
downside of the method.
88
Early stopping: #5
The model fitting is the same, we are just changing the way
we measure the performance! However, using different
criteria allow us to pick up the upward trend much more
easily, because not all of them show it clearly.
89
Early stopping: #6
▪ Apart from the overfitting reason
explained earlier early stopping is useful
for another reason.
https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html#sphx-glr-auto-e 90
xamples-ensemble-plot-gradient-boosting-early-stopping-py
Early stopping [overfitting]: #7
▪ Neural networks can get worse if you train them too much! The reality is that this is common in neural
networks.
▪ One way to view the weights of a neural network is as a high-dimensional shape. As you train, this shape
molds around the shape of the data, learning to distinguish one pattern from another.
▪ Unfortunately, if the testing dataset were slightly different from the patterns in the training dataset, the
network could fail on many of the testing examples. The network has overfitted the data, hence it did not
generalise well.
▪ A more official definition of a neural network that overfits is a neural network that has learned the noise
in the dataset instead of making decisions based only on the true signal.
▪ While practicing you'll see that is difficult to see this leakage and people simply do not
bother with it.
▪ There are cases where you simply cannot have all three sets and then you have to rely on
other validation methods such as a cross validation etc.
https://stats.stackexchange.com/questions/56421/is-it-ok-to-determine-earl 92
y-stopping-using-the-validation-set-in-10-fold-cross-v
TPOT = Tree-based Pipeline Optimization Tool: #1
▪ TPOT uses a Genetic Programming stochastic global search procedure to efficiently discover
a top-performing model pipeline for a given dataset.
▪ TPOT will automate the most tedious part of machine learning by intelligently exploring
thousands of possible pipelines to find the best one for your data.
▪ AutoML algorithms aren't as simple as fitting one model on the dataset; they are
considering multiple machine learning algorithms (random forests, linear models, SVMs,
etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA,
feature selection, etc.), the hyperparameters for all of the models and preprocessing steps,
as well as multiple ways to ensemble or stack the algorithms within the pipeline.
https://epistasislab.github.io/tpot/ Olson, Randal S., et al. "Evaluation of a tree-based pipeline optimization tool for automating data 93
science." Proceedings of the Genetic and Evolutionary Computation Conference 2016. 2016.
TPOT = Tree-based Pipeline Optimization Tool: #2
https://epistasislab.github.io/tpot/ 94
TPOT other references
▪ https://www.kdnuggets.com/2021/05/machine-learning-pipeline-opti
mization-tpot.html
95
Automated machine learning tools
▪ TPOT
▪ Auto-Sklearn
▪ Auto-Weka
▪ Machine-JS
▪ DataRobot
▪ FLAML - Fast and Lightweight AutoML
96
Proper definition of bias & variance: #1
▪ Bias:
▪ It is a learner’s tendency to consistently learn the same wrong thing.
▪ Underfit = “high bias”
▪ Simple algorithms have high bias; having few internal parameters.
▪ high bias if it is not able to fully use the information in the data. It is
too reliant on general information, such as the most frequent case,
the mean of the response, or few powerful features.
▪ Wrong assumption when training
▪ Bias is telling you how far away we are from the target.
▪ Variance:
▪ The tendency to learn random things irrespective of the real signal.
▪ overfit = “high variance”
▪ You can think of variance as a problem connected to memorization.
▪ High variance if it is using too much information from the the data.
▪ Model performs well on training but poorly on test set Simultaneously avoiding both requires
▪ Sensitive to fluctuations when training learning a perfect classifier/regressor, and
▪ Variance is telling the spread of the predictions short of knowing it in advance there is no
single technique that will always do best
Domingos, Pedro. "A few useful things to know about machine learning."
(no free lunch).
97
Communications of the ACM 55.10 (2012): 78-87.
Proper definition of bias & variance: #2
Does it even
happen?
Underfitting
area
https://stats.stackexchange.com/questions/204489/discussion-about-overfit-in-xgboost 99
Proper definition of bias & variance: #4
100
▪ Bias: expected deviation from the true value.
▪ Variance: deviation from the expected estimator.
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 101
Proper definition of bias & variance
102
High bias and variance: #0
Overfitting: if we have too many features, the learned hypothesis may fit the
training set very well (cost function almost zero), but fail to generalise (this is the
key term) the new examples, i.e., predict prices on new examples.
103
High bias and variance: #1
104
High bias and variance: #1
▪ The bias–variance dilemma is the conflict in trying to simultaneously minimize these two
sources of error that prevent supervised learning algorithms from generalizing beyond their
training set:
▪ The bias error is an error from erroneous assumptions in the learning algorithm. High
bias can cause an algorithm to miss the relevant relations between features and target
outputs (underfitting).
▪ The variance is an error from sensitivity to small fluctuations in the training set. High
variance can cause an algorithm to model the random noise in the training data, rather
than the intended outputs (overfitting).
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
105
High bias and variance: #1.1
106
High bias and variance: #1.2
http://www.deeplearningbook.org/contents/ml.html 110
High bias and variance: #1.5
▪ It makes an accurate predictions for examples in the training set, but it does not
generalise well to make accurate predictions on new, unseen examples.
▪ It's almost as though our network is merely memorizing the training set, without
understanding digits well enough to generalize to the test set.
112
High bias and variance: #3
113
High bias and variance: #4 [solution to overfitting]
▪ 2) Regularization
▪ Keep all the features, but reduce the magnitude of the cost function parameters.
▪ Regularization works well when we have a lot of slightly useful features
114
High bias and variance: #5 [solution to overfitting]
115
High bias and variance: #5.1 [solution to overfitting]
▪ Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other
things being equal.
▪ Large weights will only be allowed if they considerably improve the first part of the cost function.
▪ Put another way, regularization can be viewed as a way of compromising between finding small weights and
minimizing the original cost function.
▪ The relative importance of the two elements of the compromise depends on the value of λ: when λ is small we
prefer to minimize the original cost function, but when λ is large we prefer small weights.
▪ So regularisation help the model generalise better via a reduction in overfitting likelihood. What overfitting is
really doing is learning the noises in the data. We do no want that but we could also use it to learn the noise in
the data! At least we know where it is.
http://neuralnetworksanddeeplearning.com/chap3.html
116
Train/Val accuracy
https://cs231n.github.io/neural-networks-3/ 117
Bayes Optimal This is a theoretical
error
VanderPlas, Jake. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.", 2016. 119
▪ In high dimensions, we cannot draw decision curves to inspect bias-variance. Instead, we
calculate error values to infer the source of errors on the training set, as well as on the
validation set.
▪ The Bayes error is the minimum error for any classifier we could obtain with a perfect
model stripped of all avoidable bias and variance.
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 120
▪ We can analyze and compare the errors on the training and validation sets in order to
deduce the cause of the error, whether it is due to avoidable bias or avoidable variance (it
is unlikely that we ever obtain the Bayes error rate)
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 121
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 122
What is the baseline performance important?
▪ Create a very simple model that beats the baseline score. For instance, in classification
tasks, the baseline accuracy should be higher than 0.5 (better than random) and our simple
model should be able to beat this score.
▪ If we are not able to beat the baseline score, then maybe the input data does not hold the
necessary information required to make the necessary prediction.
▪ Given infinite data, your model would be exposed to every possible aspect of the data
distribution at hand: you would never overfit.
Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018.
125
How can you avoid overfitting?
▪ Cross-validation: Cross-validation is a technique used to assess how well a model performs on a new
independent dataset. The simplest example of cross-validation is when you split your data into two
groups: training data and testing data, where you use the training data to build the model and the testing
data to test the model.
▪ Regularization: Overfitting occurs when models have higher degree polynomials. Thus, regularization
reduces overfitting by penalizing higher degree polynomials.
▪ Reduce the number of features: You can also reduce overfitting by simply reducing the number of input
features. You can do this by manually removing features, or you can use a technique, called Principal
Component Analysis, which projects higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2
dimensions).
▪ Ensemble Learning Techniques: Ensemble techniques take many weak learners and converts them into a
strong learner through bagging and boosting. Through bagging and boosting, these techniques tend to
overfit less than their alternative counterparts.
https://towardsdatascience.com/120-data-scientist-interview-questions-and-answers-you-should-know-in-2021-b2faf7de8f3e
When do we want to have overfitting?
▪ After you have a simple implementation of your model, try to overfit a small amount of
training data and run evaluation on the same data to make sure that it gets to the smallest
possible loss.
▪ If it can't overfit a small amount of data, there's something wrong with your
implementation.
https://huyenchip.com/machine-learning-systems-design/design-a-machine-learning-system.html
127
Misconception about overfitting
▪ For example, in a classification example, the overfitting can make the learner draw a
capricious frontier to keep those examples on what it thinks is the right side.
Domingos, Pedro. "A few useful things to know about machine learning."
128
Communications of the ACM 55.10 (2012): 78-87.
Overfitting: some observations
▪ If you do not have much data, you should use a simple model, because a complex one will
overfit.
▪ This is true.
▪ But only if you assume that fitting a model means choosing a single best setting of the
parameters.
▪ If you use the full posterior distribution over parameter settings, overfitting disappears.
▪ When there is very little data, you get very vague predictions because many different
parameters settings have significant posterior probability.
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec10.pdf 129
Alternative view of overfitting
▪ Deep learning’s greatest weakness: Overfitting
▪ Error is shared among all the weights. If a particular configuration of weights accidentally
creates perfect correlation between the prediction and the output dataset (such that error
== 0) without giving the heaviest weight to the best inputs, the neural network will stop
learning. Does it mean we should be avoiding low error? No, but explain why is a bit
complicated!
▪ As long as you don’t train exclusively on the first example, the rest of the training examples
will help the network avoid getting stuck in these edge-case configurations that exist for
any one training example. This will also makes the case for having a large number of data.
131
Trask, Andrew W. "Deep learning." (2019).
Learning curves: #0
▪ Definition: learning curves are plots that show model performance as a function of dataset size.
▪ Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of
how well the model is learning.
▪ Validation (I prefer the name test) Learning Curve: Learning curve calculated from a hold-out
validation dataset that gives an idea of how well the model is generalizing.
▪ In some cases, it is also common to create learning curves for multiple metrics:
▪ Optimization Learning Curves: Learning curves calculated on the metric by which the
parameters of the model are being optimized, e.g. loss. This is justified from a numerical point
of view.
▪ Performance Learning Curves: Learning curves calculated on the metric by which the model
will be evaluated and selected, e.g. accuracy. This is justified from a interpretability point of
view.
https://machinelearningmastery.com/learning-curves-for-dia
gnosing-machine-learning-model-performance/ 132
Learning curves: #1 [ideal]
▪ The performance of training samples
should be above what is desired.
Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 133
Learning curves: #2 [overfitting]
Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 134
Learning curves: #3 [underfitting]
Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 135
Learning curves: #4
▪ The training error tends to decrease whenever we
increase the model complexity, that is, whenever we
fit the data harder.
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ 138
Learning curves: #7
▪ A good fit is identified by a training and validation loss
that decreases to a point of stability with a minimal gap
between the two final loss values.
▪ The loss of the model will almost always be lower on
the training dataset than the validation dataset. This
means that we should expect some gap between the
train and validation loss learning curves. This gap is
referred to as the “generalization gap.”
▪ A plot of learning curves shows a good fit if:
▪ The plot of training loss decreases to a point of stability.
▪ The plot of validation loss decreases to a point of stability
and has a small gap with the training loss.
▪ Continued training of a good fit will likely lead to an
overfit.
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ 139
Learning curves: #8
▪ The main indicator of a bias problem is a
high validation error. But how can establish
if the MSE is high or low? Domain
knowledge is here necessary. First of keep
in mind that MSE square the units as well.
So we have an MSE error of 20MW^2, to
make an engineering judgment we need to
take the sqrt hence 4.5MV.
▪ But is it a low bias problem or a high bias
problem? To find the answer, we need to
look at the training error.
https://www.dataquest.io/blog/learning-curves-machine-learning/ 140
Learning curves: #9
▪ Estimating variance can be done in at least two ways:
▪ By examining the gap between the validation
learning curve and training learning curve.
▪ By examining the training error: its value and its
evolution as the training set sizes increase.
141
https://www.dataquest.io/blog/learning-curves-machine-learning/
Learning curves: #10
https://www.dataquest.io/blog/learning-curves-machine-learning/ 142
Learning curves: #11
▪ In our case, the training MSE plateaus at around 20, and we’ve already
concluded that’s a high value. So besides the narrow gap, we now have
another confirmation that we have a low variance problem. So far, we can
conclude that:
▪ Our learning algorithm suffers from high bias and low variance,
underfitting the training data.
▪ Adding more instances (rows) to the training data is hugely unlikely to
lead to better models under the current learning algorithm.
▪ One solution at this point is to change to a more complex learning algorithm.
This should decrease the bias and increase the variance. A mistake would be
to try to increase the number of training instances. Generally, these other
two fixes also work when dealing with a high bias and low variance problem:
▪ Training the current learning algorithm on more features (to avoid
collecting new data, you can generate easily polynomial features). This
should lower the bias by increasing the model’s complexity.
▪ Decreasing the regularization of the current learning algorithm, if
that’s the case. In a nutshell, regularization prevents the algorithm
from fitting the training data too well. If we decrease regularization,
the model will fit training data better, and, as a consequence, the
variance will increase and the bias will decrease.
https://www.dataquest.io/blog/learning-curves-machine-learning/ 143
Learning curves: #12
▪ When we build a model to map the relationship between the features X and the target Y, we assume that there is such a
relationship in the first place. Provided the assumption is true, there is a true model f that describes perfectly the relationship
between X and Y, like so:
https://www.dataquest.io/blog/learning-curves-machine-learning/
▪ But why is there an error?! Haven’t we just said that describes the relationship between X and Y perfectly? There’s an error
there because is not only a function of our limited number of features . There could be many other features that influence the
value of . Features we don’t have. It might also be the case that contains measurement errors. So, besides , is also a function of .
Now let’s explain why this error is irreducible. When we estimate with a model , we introduce another kind of error, called
reducible error:
▪ Error that is reducible can be reduced by building better models, however the irreducible error will always be there and what is
more is that its value is almost never known.
144
Learning curves: #13
training (solid line) and validation (dotted line)
Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning 145
with PyTorch. Manning Publications Company, 2020
Learning curves: #14
training (solid line) and validation (dotted line)
Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning 146
with PyTorch. Manning Publications Company, 2020
Learning curves: #15
Set
Training Test
validation set validation set
http://josh-tobin.com/assets/pdf/troubleshooting-deep-neural-networks-01-19.pdf
147
Addressing under-fitting Addressing over-fitting
(i.e., reducing bias) (i.e., reducing variance)
http://josh-tobin.com/assets/pdf/troubleshooting-deep-neural-networks-01-19.pdf
148
Diagnosing Unrepresentative Datasets: #1
▪ Learning curves can also be used to diagnose properties of a dataset and whether it is
relatively representative.
▪ There are two common cases that could be observed; they are:
▪ Training dataset is relatively unrepresentative.
▪ Validation dataset is relatively unrepresentative.
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
149
Diagnosing Unrepresentative
Datasets: #2
▪ Unrepresentative Train Dataset: This situation can be
identified by a learning curve for training loss that shows
improvement and similarly a learning curve for validation loss
that shows improvement, but a large gap remains between
both curves.
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ 150
Bias/variance tradeoff and MSE: #1
▪ Let assume we have a regression model and we want to measure the error via MSE:
▪ Where E denotes the expectations. This error can be decomposed into bias and variance
components:
151
Bias/variance tradeoff and MSE: #2
▪ The bias measures the error of estimations, and the variance describes how much the
estimation moves around its mean.
▪ The more complex the learning model is and the larger the training size, the lower the bias
will be.
▪ We use cross-validation to find an optimal balance btw bias and variance while diminishing
overfitting.
Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 152
Solution for reducing overfitting
▪ Overfitting: Training a model in such a way that it remembers specific features of the ting
input data, rather than generalizing well to data not seen during training
▪ L2 regularisation
▪ Dropout
▪ Data augmentation
153
Bias-variance tradeoff: #1
156
Bias, variance & noise
▪ Every estimator has its advantages and drawbacks. Its generalization error can be
decomposed in terms of:
▪ The bias of an estimator is its average error for different training sets.
https://scikit-learn.org/stable/modules/learning_curve.html 157
Ways to reduce overfitting
▪ Weight-decay
▪ Weight-sharing
▪ Early stopping
▪ Model averaging
▪ Bayesian fitting of neural nets
▪ Dropout
▪ Generative pre-training
https://www.cs.toronto.edu/~hinton/coursera_slides.html
Underfitting vs. overfitting [network layout]
159
Recognising overfitting
160
Definition of regularisation
▪ Regularisation is a subset of methods used to encourage generalization in learned models,
often by increasing the difficulty for a model to learn the fine-grained details of training data.
▪ The simplest regularization: Early stopping. How do you know when to stop? The only real
way to know is to run the model on data that isn’t in the training dataset. This is typically
done using a second test dataset called a validation set. In some circumstances, if you used
the test set for knowing when to stop, you could overfit to the test set. As a general rule, you
don’t use it to control training. You use a validation set instead.
▪ Industry standard regularization: Dropout. Dropout makes a big network act like a little one
by randomly training little subsections of the network at a time, and little networks don’t
overfit.
▪ Smaller weights means that behaviour of the network will not change much if we change a few
random inputs here and there.
▪ That makes it difficult for a regularized network to learn the effects of local noise in the data.
▪ Instead, a regularized network learns to respond to types of evidence which are seen often across
the training set and it becomes resistant to learning peculiarities of the noise in the training data.
▪ The hope is that this will force our networks to do real learning about the phenomenon at hand,
and to generalize better from what they learn.
http://neuralnetworksanddeeplearning.com/chap3.html
162
What does regularisation does in practice?
▪ As the magnitude of the model parameter increases the penalty increases as well.
▪ Note that regularization hurts training set performance! This is because it limits the ability
of the network to overfit to the training set. But since it ultimately gives better test
accuracy, it is helping your system.
163
Lasso & Ridge L^n regularisation
▪ The regularization also acts on overcorrelated features — smoothing and combining their
contribution, thus stabilizing the results and reducing the consequent variance of the
estimates:
▪ L1 (also called Lasso): Shrinks some coefficients to zero, making your coefficients
sparse. This means some features are entirely ignored by the model. This can be seen
as a form of automatic feature selection. Having some coefficients be exactly zero often
makes a model easier to interpret, and can reveal the most important features of your
model.
▪ L2 (also called Ridge): Reduces the coefficients of the most problematic features,
making them smaller, but never equal to zero. All coefficients keep participating in the
estimate, but many become small and irrelevant.
Mueller, John Paul, and Luca Massaron. Python for data Guido, Sarah, and Andreas Müller. Introduction to machine
science for dummies. John Wiley & Sons, 2019. learning with python. Vol. 282. O'Reilly Media, 2016. 164
Lasso & Ridge L^n regularisation
▪ L2 and L1 penalize weights differently:
▪ L2 penalizes weight2.
▪ L1 penalizes |weight|.
▪ Consequently, L2 and L1 have different derivatives:
▪ The derivative of L2 is 2 * weight.
▪ The derivative of L1 is k (a constant, whose value is independent of weight).
▪ You can think of the derivative of L2 as a force that removes x% of the weight every time. At any rate, L 2 does not normally
drive weights to zero.
▪ You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks
to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For
example, if subtraction would have forced a weight from +0.1 to -0.2, L 1 will set the weight to exactly 0. Eureka, L1 zeroed
out the weight. L1 regularization—penalizing the absolute value of all the weights—turns out to be quite efficient for wide
models.
https://developers.google.com/machine-learning/crash-course/regularizatio 165
n-for-sparsity/l1-regularization
LASSO vs. Regression: #1
▪ Variations on ordinary linear regression can help address some problems that come up
working with read data:
▪ LASSO helps when you have too many predictors by favouring weights of zero.
▪ Ridge regression can help with reducing the variance of your weights and predictions
by shrinking the weights to 0. Aka as L2 regularization, sometimes also called
Tikhonov regularization
https://da5030.weebly.com/uploads/8/6/5/9/8659576/120_interview_questions.pdf
166
VanderPlas, Jake. Python data science handbook: Essential tools
for working with data. " O'Reilly Media, Inc.", 2016.
The α parameter is essentially a knob
Regression: #2
recover the standard linear regression
result; in the limit α ->oo, all model
responses will be suppressed. One
advantage of ridge regression in particular
is that it can be computed very efficiently
—at hardly more computational cost
than the original linear regression model.
167
RIDGE vs. LASSO in practice
▪ In practice, ridge regression is usually the first choice between these two models.
▪ However, if you have a large amount of features and expect only a few of them to be
important, Lasso might be a better choice.
▪ ElasticNet combines the penalties of Lasso and Ridge. In practice, this combination works
best, though at the price of having two parameters to adjust: one for the L1 regularization,
and one for the L2 regularization.
▪ This makes it harder to extract real-world meaning from the model. It is also an insidious
form of overfitting, which is begging to have the model generalize poorly.
▪ In lasso regression p(x) has the same functional form of 𝜎(w ⋅ x + b), where we assign every
feature a weight and then plug their weighted sum into a logistic function. However, we
train it in a different way that penalizes modest-sized weights.
Data Science: The Executive Summary - A Technical Book for Non-Technical Professionals 169
Another important observation on RIDGE
▪ The lesson here is that with enough
training data, regularization becomes less
important, and given enough data, ridge
and linear regression will have the same
performance (the fact that this happens
here when using the full dataset is just by
chance).
▪ We usually penalize the squared ▪ Weight constraints have several advantages over
value of each weight separately. weight penalties.
▪ Instead, we can put a constraint on ▪ Its easier to set a sensible value.
the maximum squared length of the ▪ They prevent hidden units getting stuck near
incoming weight vector of each unit. zero.
▪ If an update violates this ▪ They prevent weights exploding.
constraint, we scale down the ▪ When a unit hits it’s limit, the effective weight
vector of incoming weights to
penalty on all of it’s weights is determined by the
the allowed length.
big gradients.
▪ This is more effective than a fixed penalty at
pushing irrelevant weights towards zero.
https://www.cs.toronto.edu/~hinton/coursera_slides.html
Don’t let the regularization overwhelm the data
▪ It is often the case that a loss function is a sum of the data loss and the regularization loss
(e.g. L2 penalty on weights).
▪ One danger to be aware of is that the regularization loss may overwhelm the data loss, in
which case the gradients will be primarily coming from the regularization term (which
usually has a much simpler gradient expression).
▪ Therefore, it is recommended to turn off regularization and check the data loss alone first,
and then the regularization term second and independently
https://cs231n.github.io/neural-networks-3/ 172
Why don’t we regularise the bias?
▪ In theory: it would be easy to modify the regularization procedure to regularize the biases.
▪ Empirically, doing this often doesn't change the results very much, so to some extent it's
merely a convention whether to regularize the biases or not.
▪ In fact, it's worth noting that having a large bias doesn't make a neuron sensitive to its
inputs in the same way as having large weights.
http://neuralnetworksanddeeplearning.com/chap3.html 173
Other techniques for regularization
▪ Other techniques are:
▪ L2 - has the intuitive interpretation of heavily penalizing peaky
weight vectors and preferring diffuse weight vectors. Due to
multiplicative interactions between weights and inputs this has the
appealing property of encouraging the network to use all of its
inputs a little rather than some of its inputs a lot.
▪ L1 - neurons with L1 regularization end up using only a sparse
subset of their most important inputs and become nearly invariant
to the “noisy” inputs. L2 regularization can be expected to give
superior performance over L1.
▪ Dropout: we randomly modify the network itself by randomly
dropping node in each layer.
▪ Artificially expanding the training data = data augmentation
http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization 174
Dropout: #0
▪ Neural networks are extremely flexible models due to their large number of parameters,
which is beneficial for learning, but also highly redundant.
▪ This makes it possible to compress (via dropout) neural networks without having a drastic
effect on performance.
▪ However adding dropout brings other issues: it can be added between different layers, and
finding the best place is usually done through experimentation. The percentage of dropout
to be added is also tricky, as it is purely dependent on the problem statement we are trying
to solve. It is often good practice to start with a small number such as 0.2.
▪ Unlike L1 and L2 regularization, dropout doesn't rely on modifying the cost function. Instead,
in dropout we modify the network itself.
▪ One issue is that the cost function is less well defined, or is certainly hard to calculate. So
you lose this debugging tool to will a plot, a graph like this. What you can do is to first turn
off drop out and I run the code and make sure that it is monotonically decreasing J, and then
turn on drop out.
176
Dropout: #2
▪ Generally use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too
low has minimal effect and a value too high results in under-learning by the network.
▪ Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model
more of an opportunity to learn independent representations.
▪ Use dropout on input (visible) as well as hidden layers. Application of dropout at each layer of the network has shown
good results.
▪ Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a
high momentum value of 0.9 or 0.99.
▪ Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint
on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.
177
Dropout: #3
▪ It is a regularisation technique to combat overfitting. It is employed at the training time
only and eliminates some units randomly. By doing so the knowledge is spread across all of
the neurons to help obtain a more robust network. Remember that at testing time all the
neuros a re present.
▪ Even thought dropout is itself a regularisation technique it can still be used in conjunction
with other techniques such as L2.
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 178
Dropout: #4
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 179
Dropout: #5
With inverted dropout, scaling is applied at the training
With normal dropout at test time, you have to scale time, but inversely. First, dropout all activations by
activations by dropout rate p because you are not dropout factor p, and second, scale them by inverse
dropping out any of the neurons, so you need to match dropout factor 1/p.
expected value at training. Inverted dropout has an advantage, that you don’t have
to do anything at test time, which makes inference faster.
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 180
Dropout: #6
▪ It turns out that the smaller a neural network is, the less it’s able to overfit. Why? Well,
small neural networks don’t have much expressive power. They can’t latch on to the more
granular details (noise) that tend to be the source of overfitting. They have room to capture
only the big, obvious, high-level features.
▪ How do you get the power of a large neural network with the resistance to overfitting of
the small neural network? Take the big neural network and turn off nodes randomly. What
happens when you take a big neural network and use only a small part of it? It behaves like
a small neural network. But when you do this randomly over potentially millions of
different subnetworks, the sum total of the entire network still maintains its expressive
power.
▪ Why don’t they overfit to the same noise? Because they start randomly, and they stop training once they’ve
learned enough noise to disambiguate between all the images in the training set. The MNIST network needs
to find only a handful of random pixels that happen to correlate with the output labels, to overfit. But this is
contrasted with, perhaps, an even more important concept:
Neural networks, even though they’re randomly generated, still start by learning the biggest, most broadly
sweeping features before learning much about the noise.
▪ The takeaway is this: if you train 100 neural networks (all initialized randomly), they will each tend to latch
onto different noise but similar broad signal. Thus, when they make mistakes, they will often make differing
mistakes. If you allowed them to vote equally, their noise would tend to cancel out, revealing only what they
all learned in common: the signal.
https://www.kaggle.com/ryanholbrook/dropout-and-batch-normalization 183
Dropout and co-adaptation
▪ It is common for two or more neurons to begin to detect the same feature repeatedly. This
is called co- adaptation. It implies the DNN is not utilizing its full capacity efficiently, in
effect wasting computational resources.
▪ Dropout discourages co-adaptations of hidden neurons by dropping out a fixed fraction of
the activation of the neurons during the feed forward phase of training.
▪ Interesting fact: Dropout can also be applied to inputs. In this case, the algorithm randomly
ignores a fixed proportion of input attributes.
▪ One of life’s lessons is that dropping out is not necessarily ruinous to future performance,
but neither is it a guarantee of future success. The same is true for dropout in DNNs; there
is no absolute guarantee that it will enhance performance, but it is often worth a try.
Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 184
Some points on dropout
▪ Dropout can reduce the likelihood of co-adaptation in noisy samples by creating multiple
paths to correct classification throughout the DNN.
▪ The larger the dropout fraction the more noise is introduced during training; this slows
down learning,
▪ Dropout appears to offer the most benefit on very large DNN models.
Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 185
What if we just don’t train a random bunch of nodes ?
▪ The word ‘pruning’ comes from its use in decision trees where branches of the tree are
pruned as a form of model regularization. Analogously, weights in a neural network that are
considered unimportant or rarely fire can be removed from the network with little to no
consequence. In fact, the majority of neurons have a relatively small impact on the model
performance, meaning we can achieve high accuracy even when eliminating large numbers
of parameters.
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 187
Pruning: #1
▪ Usually combined with retraining
▪ Important Questions
▪ Which weights should we remove
first?
▪ How many weights can we remove?
▪ What about the order of removal?
https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf 188
Pruning: #2 [motivation]
▪ In Humans we have synaptic pruning
▪ This removes redundant connections in the brain
189
https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf
Pruning: #3
190
https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf
Pruning: #4
https://stanford-cs329s.github.io/slides/cs329s_12_slides_sara_google.pdf 191
Pruning: #5
▪ The most common way is weight pruning. Weight pruning rank-orders the weights by their
magnitude since parameters with larger weights are more likely to fire and thus more likely to
be important.
▪ Another way is unit pruning, which sets entire columns in the weight matrix to zero, in effect
deleting the corresponding output neuron. Here to achieve sparsity of k% we rank the
columns of a weight matrix according to their L2-norm and delete the smallest k%.
▪ A third and more advanced method is to use Fischer pruning, which relies on the Fischer
information. This generates a norm known as the Fischer-Rao norm which can then be used to
rank-order parameters. It is suspected there is a link between the Fischer information and the
redundancy of parameters, and this is why this technique seems to produce good results.
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae 192
Pruning: #6
▪ If we prune too much at once, the network might be
damaged so much it won’t be able to recover.
▪ So in practice this is an iterative process - often called
‘Iterative Pruning
https://jacobgil.github.io/deeplearning/pruning-deep-learning 193
Pruning: #7
▪ Sounds good, why isn’t this more popular?
▪ Which is surprising considering all the effort on running deep learning on mobile devices. I
guess the reason is a combination of:
▪ The ranking methods weren’t good enough until now, resulting in too big of an
accuracy drop.
▪ It’s a pain to implement.
▪ Those who do use pruning, keep it for themselves as a secret sauce advantage.
https://jacobgil.github.io/deeplearning/pruning-deep-learning 194