8
8
Cross Validation
Machine Learning
by Tom M. Mitchell
1
Cross Validation
• In machine learning is to not use the entire data set when training
a learner.
2
Cross Validation cont…
• Method of estimating expected predicting error
3
Cross Validation cont…
Types
1) Holdout method
2) K-Fold CV
3) Leave one out CV
4) Bootstraps Methods
4
Holdout method
• The holdout cross validation method is the simplest of all. In this
method, you randomly assign data points to two sets. The size of
the sets does not matter
5
K-fold
• K-fold cross validation is one way to improve over the holdout
method. The data set is divided into k subsets, and the holdout
method is repeated k times
• Each time, one of the k subsets is used as the test set and the
other k-1 subsets are put together to form a training set
6
Leave one out CV
• Leave-one-out cross validation is K-fold cross validation taken to its
logical extreme, with K equal to N, the number of data points in
the set
• That means that N separate times, the function approximate is
trained on all the data except for one point and a prediction is
made for that point
7
Leave one out cont…
• Specific case of K-fold validation
8
Bootstrap
• Randomly draw datasets from the training sample
• Each sample same size as the training sample
• Refit the model with the boothstrap samples
• Examine the model
9
Bootstrap cont…
• Example of bootstrap
10
Logistic Regression with Cross Validation
11
Logistic Regression
# evaluate a logistic regression model using k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
12
Logistic Regression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20,
n_informative = 15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv1 = KFold(n_splits=10, random_state=12, shuffle=True)
# create model
model = LogisticRegression(max_iter=300)
13
Logistic Regression
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv1, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
14
Repeated k-Fold Cross-Validation
• The estimate of model performance via k-fold cross-validation can be
noisy.
• This means that each time the procedure is run, a different split of the
dataset into k-folds can be implemented, and in turn, the distribution
of performance scores can be different, resulting in a different mean
estimate of model performance.
15
Repeated k-Fold Cross-Validation
# evaluate a logistic regression model using repeated k-fold cross-
validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
16
Repeated k-Fold Cross-Validation
# create dataset
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
17
Repeated k-Fold Cross-Validation
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
18
Repeated k-Fold Cross-Validation
• Running the example creates the dataset, then evaluates a logistic
regression model on it using 10-fold cross-validation with three
repeats. The mean classification accuracy on the dataset is then
reported.
19
stratified k-Fold Cross-Validation
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
20
stratified k-Fold Cross-Validation
# Import necessary modules
#from sklearn.metrics import mean_squared_error
#from math import sqrt
# from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
21
stratified k-Fold Cross-Validation
df = pd.read_csv(‘./dataset/diabetes.csv')
x1 = df.drop(‘Outcome’, axis=1)
y1 = df[‘Outcome’]
22
stratified k-Fold Cross-Validation
skfold = StratifiedKFold(n_splits=3, random_state=100)
model_skfold = LogisticRegression()
results_skfold = cross_val_score(model_skfold, x1, y1, scoring=‘accuracy’,
cv=skfold)
print("Accuracy: %.2f%%" % (results_skfold.mean()*100.0))
23
stratified k-Fold Cross-Validation Classification Report
>> from sklearn.metrics import classification_report, accuracy_score,
make_scorer
model_skfold = LogisticRegression() 24
stratified k-Fold Cross-Validation Classification Report
>> # Nested CV with parameter optimization
>> nested_score = cross_val_score(model_skfold, X=x1, y=y1, cv=skfold,
scoring = make_scorer(classification_report_with_accuracy_score))
>> print(nested_score )
25
Multi-class classifcation
>> from sklearn import datasets
>> from sklearn.metrics import confusion_matrix
>> from sklearn.model_selection import train_test_split
>> # loading the iris dataset
>> iris = datasets.load_iris()
26
Multi-class classifcation
>> # dividing X, y into train and test data
>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state =
0)
27
Multi-class classifcation
>> # model accuracy for X_test
>> accuracy = svm_model_linear.score(X_test, y_test)
>> print(classification_score(svm_predictions,y_test))
28
Multi-class classification-cross validation
# Nested CV with parameter optimization
>> nested_score = cross_val_score(model_skfold, X=X, y=y,
cv=skfold,scoring=make_scorer(classification_report_with_accuracy_scor
e))
>> print(nested_score)
29
All methods of Cross Validation
30
All methods of Cross Validation - Introduction
• Building machine learning models is an important element of
predictive modelling. However, without proper model validation, the
confidence that the trained model will generalize well on the unseen
data can never be high.
31
All methods of Cross Validation - Introduction
• Hold Out Validation
• K-fold Cross-Validation.
• Stratified K-fold Cross-Validation
• Leave One Out Cross-Validation.
• Repeated Random Test-Train Splits
32
All methods of Cross Validation – Diabetes data set
detail
• pregnancies - Number of times pregnant.
• glucose - Plasma glucose concentration.
• diastolic - Diastolic blood pressure (mm Hg).
• triceps - Skinfold thickness (mm).
• insulin - Hour serum insulin (mu U/ml).
• bmi - BMI (weight in kg/height in m).
• dpf - Diabetes pedigree function.
• age - Age in years.
• diabetes - “1” represents the presence of diabetes while “0”
represents the absence of it. This is the target variable.
33
All methods of Cross Validation - Introduction
Steps
• In this guide, we will follow the following steps:
• Step 1 - Loading the required libraries and modules.
• Step 2 - Reading the data and performing basic data checks.
• Step 3 - Creating arrays for the features and the response variable.
• Step 4 - Trying out different model validation techniques.
34
All methods of Cross Validation - Introduction
Step 1 - Loading the Required Libraries and Modules
35
All methods of Cross Validation - Introduction
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
# from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
36
All methods of Cross Validation - Introduction
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold
37
All methods of Cross Validation - Introduction
Step 2 - Reading the Data and Performing Basic Data Checks
• The first line of code below reads in the data as a pandas dataframe,
while the second line prints the shape - 768 observations of 9
variables. The third line gives the transposed summary statistics of the
variables.
38
All methods of Cross Validation - Introduction
dat = pd.read_csv('diabetes.csv')
print(dat.shape)
dat.describe(). transpose()
Output:
39
All methods of Cross Validation - Introduction
• Looking at the summary for the 'diabetes' variable, we observe that
the mean value is 0.35, which means that around 35 percent of the
observations in the dataset have diabetes. Therefore, the baseline
accuracy is 65 percent, and the model we build should definitely beat
this baseline benchmark.
40
All methods of Cross Validation - Introduction
• Step 3 - Creating Arrays for the Features and the Response Variable
• The lines of code below create an array of the features and the
dependent variable, respectively.
x1 = dat.drop('diabetes', axis=1).values
y1 = dat['diabetes'].values
41
All methods of Cross Validation - Introduction
• Step 4 - Trying out Different Model Validation Techniques
• With the arrays of the features and the response variable created, we
will start discussing the various model validation strategies.
42
Holdout Validation Approach - Train and Test Set Split
• The holdout validation approach refers to creating the training and the
holdout sets, also referred to as the 'test' or the 'validation' set.
• The training data is used to train the model while the unseen data is
used to validate the model performance.
• The common split ratio is 70:30, while for small datasets, the ratio can
be 90:10.
43
Holdout Validation Approach - Train and Test Set Split
# Evaluate using a train and a test set
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x1, y1,
test_size=0.30, random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
44
Holdout Validation Approach - Train and Test Set Split
Output:
Accuracy: 74.46%
• We can see that the accuracy for the model on the test data is
approximately 74 percent. The above technique is useful but it has
pitfalls
• The split is very important and, if it goes wrong, it can lead to model
overfitting or underfitting the new data. This problem can be rectified
using resampling methods
45
K-fold Cross-Validation
• In k-fold cross-validation, the data is divided into k folds. The model
is trained on k-1 folds with one fold held back for testing.
• This process gets repeated to ensure each fold of the dataset gets
the chance to be the held back set.
46
K-fold Cross-Validation
kfold = model_selection.KFold(n_splits=8, random_state=100)
model_kfold = LogisticRegression() // object
47
K-fold Cross-Validation
Output:
Accuracy: 76.95%
48
Stratified K-fold Cross-Validation
• Stratified K-Fold approach is a variation of k-fold cross-validation
that returns stratified folds, i.e., each set containing approximately
the same ratio of target labels as the complete data.
49
Stratified K-fold Cross-Validation
• The mean accuracy for the model using stratified k-fold cross-
validation is 76.96 percent.
50
Leave One Out Cross-Validation (LOOCV)
• LOOCV is the cross-validation technique in which the size of the fold
is “1” with “k” being set to the number of observations in the data.
This variation is useful when the training data is of limited size and
the number of parameters to be tested is not high.
loocv = model_selection.LeaveOneOut()
model_loocv = LogisticRegression()
results_loocv = model_selection.cross_val_score(model_loocv, x1, y1,
cv=loocv)
print("Accuracy: %.2f%%" % (results_loocv.mean()*100.0))
51
Leave One Out Cross-Validation (LOOCV)
Output
Accuracy: 76.82%
• The mean accuracy for the model using the leave-one-out cross-
validation is 76.82 percent.
52
Repeated Random Test-Train Splits
• This technique is a hybrid of traditional train-test splitting and the k-
fold cross-validation method.
• In this technique, we create random splits of the data in the training-
test set manner and then repeat the process of splitting and
evaluating the algorithm multiple times, just like the cross-validation
method.
53
Repeated Random Test-Train Splits
>> kfold2 = model_selection.ShuffleSplit(n_splits=10, test_size=0.30,
random_state=100)
>> model_shufflecv = LogisticRegression()
>> results_4 = model_selection.cross_val_score(model_shufflecv, x1,
y1, cv=kfold2)
>> print("Accuracy: %.2f%% (%.2f%%)" % (results_4.mean()*100.0, >>
results_4.std()*100.0))
54
Repeated Random Test-Train Splits
Output
Accuracy: 74.76% (2.52%)
• The mean accuracy for the model using the repeated random train-
test split method is 74.76 percent.
55
Some of cross validation for classification report, confusion
metrics, individual split report etc
56