KNN_Bias_Variance_Classification_Metrics (1)
KNN_Bias_Variance_Classification_Metrics (1)
26-11-2024 2
K Nearest Neighbors Classifier
• How can we find the new Label?
• Old adage: Something walks and talks like peacock beware of statistics
it may be hen
• kNN Idea: Something walks and talks like peacock it is high likely to be
peacock not hen
26-11-2024 3
K Nearest Neighbors Classifier
x: Class I and 0: Class II
Class I • kNN classifier
0
x 0 0 • Training Data:
0 x
x + 0 0 {(x1, y1), (x2, y2), ….., (xn, yn)}
0
x +
Class II
• A distance Metric
x 0 0
x x • Number of neighbors: K
x 0
x
x
26-11-2024 4
K Nearest Neighbors Classifier
Algorithm
1. Data {(x1, y1), (x2, y2), ….., (xn, yn)}
2. For new data point, xo
3. Find the nearest point(s)
26-11-2024 5
K Nearest Neighbors Classifier
Example:
x: Class I and 0: Class II
0 • K=3
x 0 0 • Compute conditional
0 x
x + xo 0 0 probability
0
x + • P(Y=Class I | x=xo)= 0.67
x 0 0
x x • P (Y= Class II| x=xo)=0.33
x 0
x
x
26-11-2024 6
K Nearest Neighbors Classifier
2-class classification problem with 2 features
26-11-2024 7
K Nearest Neighbors Classifier
2-class classification problem with 2 features
26-11-2024 8
K Nearest Neighbors Classifier
2-class classification problem with 2 features
26-11-2024 9
K Nearest Neighbors Classifier
• Choice of K
• Large K value
• Less flexible model
• Small K value
• Flexible model
• But sensitive to noisy data point
26-11-2024 10
K Nearest Neighbors Classifier
How do we decide the “K”?
26-11-2024 11
K Nearest Neighbors Classifier
How do we decide the “K”?
1 James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to statistical learning, 2021
26-11-2024 12
K Nearest Neighbors Classifier
How do we decide the “K”?
1 James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to statistical learning, 2021
26-11-2024 13
Flexible vs Inflexible Models
y y y
x x x
26-11-2024 14
Flexibility and Interpretability of Models
Complexity of Models
High
LASSO
Subset selection
Least-squares
Interpretability
Boosting, Bagging
Support vector machines
Deep Learning
Low
Low
Flexibility High
26-11-2024 1 James, 15
G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to statistical learning, 2021
Irreducible and Reducible Errors
Mean Square Error between the actual and predicted y
መ 𝑝)
using the fit 𝑓(𝑥, ො
Irreducible Error
Reducible Error
26-11-2024 16
Bias-Variance Trade-off
26-11-2024 17
Bias-Variance Trade-off
kNN Classifier Linear Regression
26-11-2024 18
Bias-Variance Trade-off and Prediction error
kNN MSE
26-11-2024 19
Bias-Variance Trade-off
Under fitting Over fitting
Optimal Model
Total error
Error
Variance
Bias2
Irreducible error
26-11-2024 20
Model Selection and Assessment
• Model selection is important in multiple linear and nonlinear
models
• Data-rich situation: Randomly divide the data in three parts
Ideal Scenario: Data-rich situation
26-11-2024 23
Resampling Methods
• Consider the following data set
• Training set: {(x1, y1);(x2, y2);…; (xn, yn)}
• Test point: (x0, y0) such nt observations
• Training error rate Not of our interest
for predictive
ability of the model
• Test error rates
Of our interest
Irreducible
error Variance Bias
26-11-2024 25
Validation Set Approach
• Enough data: (1) Training set, (2) Validation set, and (3) Test
set
• Not enough data: Generate validation sets from a training set
• Validation set approach: Divides (often randomly) the training
set into two parts
1234 n
• A training set 1234 nt
• A validation set (or hold-out set) 1234 nv
• Use training set, to fit the model
• Use validation set, to predict validation set errors
Provides an estimate of test error rates
26-11-2024 26
Validation Set Approach: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)
High variability in
MSE
MSE
estimates of test error
1Tibshirani et al (2013)
26-11-2024 27
Leave-one-out-cross-validation (LOOCV)
• Build model using (n-1) samples and predict
the response (yi) for the remaining sample
1234 n
1234 n
1234 n
1234 n
26-11-2024 28
LOOCV: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)
LOOCV Validation Set Approach
MSE
MSE
1Tibshirani et al (2013)
26-11-2024 29
Leave-one-out-cross-validation (LOOCV)
• Advantages
• Far less bias comparison to the validation set approach
Training set contains (n-1) observations each iteration
• Yield the same results
No randomness in the training/validation set splits
• Does not overestimate the test error rate as much as the validation set
approach
• Disadvantages
• Expensive to implement due to fitting happens n times
• Asymptotical incorrect (n tends to infinity) it does not choose correct model
• It may select a model of excessive size (more variables) than the optimal
model
26-11-2024 30
k-Fold Cross Validation
• Training data into k disjoint samples of
equal size,
Z1, Z2…, Zk 1234 n
• For each validation sample Zi
• Use remaining data to fit the model 1
• Predict the response for the
validation sample Zi and compute
mean square error (MSEi), 2
• Repeat for all k samples
• The k-fold CV k
26-11-2024 31
k-fold Validation
• For k=n, Leave-one-out-cross-validation (LOOCV)
• In practice, k=5 or 10 is taken,
• Less computation cost
• For computationally intensive learning methods
• LOOCV fits the model n times
• k-fold CV fits the model k times
26-11-2024 32
k-fold CV: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)
LOOCV
MSE
MSE
Degree of Polynomial
Degree of Polynomial
1Tibshirani et al (2013)
26-11-2024 33
k-fold CV: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)
10-fold CV Validation Set Approach
MSE
MSE
Degree of Polynomial Degree of Polynomial
k-fold CV has lower variability in comparison to Validation Set Approach
1Tibshirani et al (2013)
26-11-2024 34
k-fold CV: Bias-Variance Trade-off
• Bias reduction in test error: LOOCV is preferred
• LOOCV provides nearly unbiased estimates: (n-1)
observations in training set
• k-fold CV provides intermediate level of biased estimates:
(k-1)n/k observations in training set
• Variance reduction in test error: k-fold CV
• LOOCV leads to higher variance: Training on almost
identical (n-1) observations
• k-fold CV (k<n) leads to lower variance: Training on (k-
1)n/k observations having overlap between the training
sets in each model is smaller
5- or 10-fold CV yields test error rate estimates having moderate bias and variances
26-11-2024 35
Cross-validation: Classification Problems
• Quantitative outcome yi of Regression problems
• In CV, MSE is used to quantify test error
• Classification problem: yi is qualitative
• CV?
• Use the number of misclassified observations
• LOOCV error rate
26-11-2024 36
Bootstrap
Training
Z={z1,z2,z3,…,zn} sample
Bootstrap
Z*1 Z*2 Z*m samples
Bootstrap
S(Z*1) S(Z*2) S(Z*m) Replications
Of S(.) methods
parameters
26-11-2024 37
Bootstrap
• Normally used for quantifying the uncertainty associated with a
given estimator
• Training set: Z={z1,z2,…,zn} where zi=(xi,yi)
• Draw samples with replacement from the training set such that
each sample size = original training size
• Repeat the sampling for m times: m data sets Z*m
• Compute the quantity of interest (ex. Regression parameters)
from the each data set
• Estimation of prediction errors
26-11-2024 38
Bootstrap
• Estimation of prediction error
where C-i the set of indices of the sample m that not having ith
observation
26-11-2024 39
Bootstrap: Example
• Two instruments: A and B
• Property C= αA+(1- α)B, α is a parameter
• Variability associated with each instrument
• Objective : Choose α such that variance of C is minimized
• α value at minimum var(C) can be given by
26-11-2024 40
Bootstrap: Example
• n=100 observations
• m= n Bootstrap samples
• Compute unknown estimates of
2 2 2
Quantities 𝜎ො𝐴 ,𝜎ො𝐵 , 𝜎ො𝐴𝐵 and
• 𝛼ො for each bootstrap sample using
26-11-2024
𝛼=0.5964
ො 42
Conclusion:
Choosing the Optimal Model?
10-fold CV
26-11-2024 43
Conclusion:
Choosing the Optimal Model?
One-standard-error rule
Compute standard error of the test MSE for each model
Select the smallest model for which the test error is within one
standard error of the lowest point on the curve
10-fold CV
26-11-2024 44
Classification Models
• Data: (x1,y1), (x2,y2),…,(xn,yn)
• Binary class problems
• Multi-class problems
• Underlying true distribution P(X, y)
• How well the underlying distribution learnt by a
Classifier?
• Questions
• How do we estimate the true performance of a
classifier?
• How good are the parameter estimates in the classifier?
8
Evaluation Metrics: Binary Classification
T + + + + - - - - - -
P + - + - + - - - - -
Outcome TP FN TP FN FP TN TN TN TN TN
9
Evaluation Metrics: Binary Classification
Confusion Matrix (Contingency table)
Predicted
+ - Accuracy =
𝑇𝑃+𝑇𝑁
𝑃+𝑁
False
True Positive
+ (TP)
Negative
(FN) Misclassification rate=1- Accuracy
True
10
Evaluation Metrics: Binary Classification
+ - + -
+ 15 5 + 18 2
- 30 950 - 20 960
Compute Accuracy
15+950 18+960
Accuracy = = 0.965 Accuracy = = 0.978
20+980 20+980
Predicted
𝑇𝑃
Precision =
+ - 𝑇𝑃+𝐹𝑃
False
True Positive
+ (TP)
Negative
(FN) 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑎𝑟𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑
True 𝑎𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑏𝑦 𝑎 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟
Precision = 𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
- False Positive True Negative
(FP) (TN) 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑏𝑦 𝑎 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟
12
Evaluation Metrics: Binary Classification
Confusion Matrix (Contingency table)
Predicted 𝑇𝑃
Recall =
𝑇𝑃+𝐹𝑁
+ -
False The fraction of true positives predicted by a
True Positive
+ (TP)
Negative
(FN)
Classifier wrt to the true positive
True
13
Evaluation Metrics: Binary Classification
+ - + -
+ 15 5 + 18 2
- 30 950 - 10 970
15 18
Precision = 0.67 = 0.65
15 + 30 18 + 10
18
Recall 15
=0.75 =0.9
18+2
15+5
14
Evaluation Matrix: Binary Classification
Confusion Matrix (Contingency table)
Predicted
𝑇𝑁
+ - Specificity =
𝐹𝑃+𝑇𝑁
Recall, sensitivity =
𝑇𝑃
False 𝑇𝑃+𝐹𝑁
True Positive
+ (TP)
Negative
(FN)
True
15
Evaluation Metrics: Binary Classification
𝑇𝑁 𝑇𝑃
Specificity = Recall, sensitivity =
𝐹𝑃+𝑇𝑁 𝑇𝑃+𝐹𝑁
16
Evaluation Metrics: Binary Classification
𝑇𝑁 𝑇𝑃
Specificity = Recall, sensitivity =
𝐹𝑃+𝑇𝑁 𝑇𝑃+𝐹𝑁
17
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve
𝑇𝑃
True Positive Rate =
𝑇𝑃+𝐹𝑁
𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁
(ROC) Curve:
A graph between FPR vs TPR
19
Evaluation Matrix: Binary Classification
𝑇𝑃
True Positive Rate =
Receiver Operating Characteristic (ROC) Curve 𝑇𝑃+𝐹𝑁
𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁
20
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve
𝑇𝑃
True Positive Rate =
𝑇𝑃+𝐹𝑁
𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁
TPR
FPR
21
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve
𝑇𝑃 𝐹𝑃
True Positive Rate = False Positive Rate =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
24
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve
25
Evaluation Matrix: Binary Classification
Receiver Operating Characteristic (ROC) Curve
26
Evaluation Metrics: Binary Classification
Comparing Receiver Operating Characteristic (ROC) Curves
𝑇𝑃
True Positive Rate =
𝑇𝑃+𝐹𝑁
𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁
TPR
FPR
27
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves
TPR
FPR
28
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves
TPR
FPR
29
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves
TPR
FPR
30
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves
TPR
FPR
31
Evaluation Metrics: Binary Classification
32
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves
TPR
FPR
35
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves
TPR
FPR
36
Evaluation Metrics: Binary Classification
Precision Recall Curve
37
Evaluation Metrics: Multi-Class Classification
Metrics
38
Evaluation Metrics: Multi-Class Classification
Metrics
39
Evaluation Metrics: Multi-Class Classification
Metrics
40
Evaluation Metrics: Multi-Class Classification
Metrics
41
Evaluation Metrics: Multi-Class Classification
Metrics
42
Evaluation Metrics: Multi-Class Classification
Metrics
43
Evaluation Metrics: Multi-Class Classification
Metrics
44
Evaluation Metrics: Multi-Class Classification
Metrics
45
Evaluation Metrics: Multi-Class Classification
Metrics
46
Evaluation Metrics: Multi-Class Classification
Metrics
47
References:
1. Tom Fawcett, An Introduction to ROC Analysis, Pattern Recognition Letters, 2006, 861-874
2. Alaa Tharwat, Classification Assessment Methods, Applied Computing and Informatics Vol. 17 No. 1, 2021
pp. 168-192
48