0% found this document useful (0 votes)
10 views

KNN_Bias_Variance_Classification_Metrics (1)

bias variance tradeoff

Uploaded by

Aman Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

KNN_Bias_Variance_Classification_Metrics (1)

bias variance tradeoff

Uploaded by

Aman Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

K-NN, Bias Variance Trade-off

and Classification Metrics


Nirav P Bhatt
Department of Data Science and AI
Robert Bosch Centre for Data Science and Artificial Intelligence
Indian Institute of Technology Madras, Chennai – 600036, India
K Nearest Neighbors Classifier
• Data: {(x1, y1), (x2, y2), ….., (xn, yn)}
• Features: (x1,x2,….xp): xi
• Label: yi
• New test data xo
• What is the corresponding label?

• Instant based Classifier


• Use the data (or training data) for classification (no models)
• Non-parametric method

26-11-2024 2
K Nearest Neighbors Classifier
• How can we find the new Label?
• Old adage: Something walks and talks like peacock beware of statistics
it may be hen
• kNN Idea: Something walks and talks like peacock it is high likely to be
peacock not hen

26-11-2024 3
K Nearest Neighbors Classifier
x: Class I and 0: Class II
Class I • kNN classifier
0
x 0 0 • Training Data:
0 x
x + 0 0 {(x1, y1), (x2, y2), ….., (xn, yn)}
0
x +
Class II
• A distance Metric
x 0 0
x x • Number of neighbors: K
x 0
x
x
26-11-2024 4
K Nearest Neighbors Classifier
Algorithm
1. Data {(x1, y1), (x2, y2), ….., (xn, yn)}
2. For new data point, xo
3. Find the nearest point(s)

4. Label yo=yn* based on majority votes

26-11-2024 5
K Nearest Neighbors Classifier
Example:
x: Class I and 0: Class II
0 • K=3
x 0 0 • Compute conditional
0 x
x + xo 0 0 probability
0
x + • P(Y=Class I | x=xo)= 0.67
x 0 0
x x • P (Y= Class II| x=xo)=0.33
x 0
x
x
26-11-2024 6
K Nearest Neighbors Classifier
2-class classification problem with 2 features

26-11-2024 7
K Nearest Neighbors Classifier
2-class classification problem with 2 features

26-11-2024 8
K Nearest Neighbors Classifier
2-class classification problem with 2 features

26-11-2024 9
K Nearest Neighbors Classifier
• Choice of K
• Large K value
• Less flexible model
• Small K value
• Flexible model
• But sensitive to noisy data point

26-11-2024 10
K Nearest Neighbors Classifier
How do we decide the “K”?

26-11-2024 11
K Nearest Neighbors Classifier
How do we decide the “K”?

Test error rate


K*

Training error rate

1 James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to statistical learning, 2021
26-11-2024 12
K Nearest Neighbors Classifier
How do we decide the “K”?

1 James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to statistical learning, 2021
26-11-2024 13
Flexible vs Inflexible Models

y y y

x x x

26-11-2024 14
Flexibility and Interpretability of Models
Complexity of Models
High
LASSO
Subset selection
Least-squares

Generalized Additive Models


Decision Tree

Interpretability
Boosting, Bagging
Support vector machines

Deep Learning
Low

Low
Flexibility High
26-11-2024 1 James, 15
G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to statistical learning, 2021
Irreducible and Reducible Errors
Mean Square Error between the actual and predicted y
መ 𝑝)
using the fit 𝑓(𝑥, ො

Irreducible Error

Reducible Error

26-11-2024 16
Bias-Variance Trade-off

26-11-2024 17
Bias-Variance Trade-off
kNN Classifier Linear Regression

26-11-2024 18
Bias-Variance Trade-off and Prediction error
kNN MSE

26-11-2024 19
Bias-Variance Trade-off
Under fitting Over fitting
Optimal Model
Total error

Error

Variance
Bias2
Irreducible error

Model Complexity (# parameters)

26-11-2024 20
Model Selection and Assessment
• Model selection is important in multiple linear and nonlinear
models
• Data-rich situation: Randomly divide the data in three parts
Ideal Scenario: Data-rich situation

Train Validate Test


50% 25% 25%

Fit the models Model selection Model assessment

Practice: Limited Amount of Data


Best Model in Practice? Need a Criterion
26-11-2024 21
Resampling Methods
• Validation of models by repeatedly drawing random
samples from a training set
• K-fold cross validation
• Bootstrap
• Objective:
• Predict the performance of model(s) on the test sets using
the training sets
• Resampling methods useful for data scarce situations

26-11-2024 23
Resampling Methods
• Consider the following data set
• Training set: {(x1, y1);(x2, y2);…; (xn, yn)}
• Test point: (x0, y0) such nt observations
• Training error rate Not of our interest
for predictive
ability of the model
• Test error rates

Of our interest

Data scarcity: Test data are not available


26-11-2024 24
Resampling Methods
• Expected test error

Irreducible
error Variance Bias

• Interpretation of variance: The amount by which 𝑚𝑜𝑑𝑒𝑙


would change if we estimate it using different training sets
• Interpretation of Bias: The amount of error introduced by
approximating a problem with a simpler model
• Select the model that achieves low variance and low bias

26-11-2024 25
Validation Set Approach
• Enough data: (1) Training set, (2) Validation set, and (3) Test
set
• Not enough data: Generate validation sets from a training set
• Validation set approach: Divides (often randomly) the training
set into two parts
1234 n
• A training set 1234 nt
• A validation set (or hold-out set) 1234 nv
• Use training set, to fit the model
• Use validation set, to predict validation set errors
Provides an estimate of test error rates

26-11-2024 26
Validation Set Approach: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)

High variability in
MSE

MSE
estimates of test error

1Tibshirani et al (2013)
26-11-2024 27
Leave-one-out-cross-validation (LOOCV)
• Build model using (n-1) samples and predict
the response (yi) for the remaining sample

1234 n
1234 n
1234 n

1234 n

26-11-2024 28
LOOCV: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)
LOOCV Validation Set Approach

MSE
MSE

Degree of Polynomial Degree of Polynomial

1Tibshirani et al (2013)
26-11-2024 29
Leave-one-out-cross-validation (LOOCV)
• Advantages
• Far less bias comparison to the validation set approach
Training set contains (n-1) observations each iteration
• Yield the same results
No randomness in the training/validation set splits
• Does not overestimate the test error rate as much as the validation set
approach
• Disadvantages
• Expensive to implement due to fitting happens n times
• Asymptotical incorrect (n tends to infinity) it does not choose correct model
• It may select a model of excessive size (more variables) than the optimal
model

26-11-2024 30
k-Fold Cross Validation
• Training data into k disjoint samples of
equal size,
Z1, Z2…, Zk 1234 n
• For each validation sample Zi
• Use remaining data to fit the model 1
• Predict the response for the
validation sample Zi and compute
mean square error (MSEi), 2
• Repeat for all k samples
• The k-fold CV k

26-11-2024 31
k-fold Validation
• For k=n, Leave-one-out-cross-validation (LOOCV)
• In practice, k=5 or 10 is taken,
• Less computation cost
• For computationally intensive learning methods
• LOOCV fits the model n times
• k-fold CV fits the model k times

26-11-2024 32
k-fold CV: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)
LOOCV

MSE
MSE

Degree of Polynomial
Degree of Polynomial

1Tibshirani et al (2013)
26-11-2024 33
k-fold CV: Example
• Example: mileage~ horsepower1
• Nonlinear Model: mileage~f(horsepower)
10-fold CV Validation Set Approach
MSE

MSE
Degree of Polynomial Degree of Polynomial
k-fold CV has lower variability in comparison to Validation Set Approach
1Tibshirani et al (2013)
26-11-2024 34
k-fold CV: Bias-Variance Trade-off
• Bias reduction in test error: LOOCV is preferred
• LOOCV provides nearly unbiased estimates: (n-1)
observations in training set
• k-fold CV provides intermediate level of biased estimates:
(k-1)n/k observations in training set
• Variance reduction in test error: k-fold CV
• LOOCV leads to higher variance: Training on almost
identical (n-1) observations
• k-fold CV (k<n) leads to lower variance: Training on (k-
1)n/k observations having overlap between the training
sets in each model is smaller
5- or 10-fold CV yields test error rate estimates having moderate bias and variances
26-11-2024 35
Cross-validation: Classification Problems
• Quantitative outcome yi of Regression problems
• In CV, MSE is used to quantify test error
• Classification problem: yi is qualitative
• CV?
• Use the number of misclassified observations
• LOOCV error rate

with Erri = I(yi≠ 𝑦ො𝑖 ) , I is an indicator function

26-11-2024 36
Bootstrap
Training
Z={z1,z2,z3,…,zn} sample

Bootstrap
Z*1 Z*2 Z*m samples

Bootstrap
S(Z*1) S(Z*2) S(Z*m) Replications
Of S(.) methods
parameters

26-11-2024 37
Bootstrap
• Normally used for quantifying the uncertainty associated with a
given estimator
• Training set: Z={z1,z2,…,zn} where zi=(xi,yi)
• Draw samples with replacement from the training set such that
each sample size = original training size
• Repeat the sampling for m times: m data sets Z*m
• Compute the quantity of interest (ex. Regression parameters)
from the each data set
• Estimation of prediction errors

26-11-2024 38
Bootstrap
• Estimation of prediction error

• MSEboot does not provide a good estimate, why?


• The original training set is acting as test set
• Boot strap sets are near to the training set
• A better bootstrap estimate of prediction error is

where C-i the set of indices of the sample m that not having ith
observation

26-11-2024 39
Bootstrap: Example
• Two instruments: A and B
• Property C= αA+(1- α)B, α is a parameter
• Variability associated with each instrument
• Objective : Choose α such that variance of C is minimized
• α value at minimum var(C) can be given by

• 𝜎𝐴2 , 𝜎𝐵2 , 𝜎𝐴𝐵


2
: Unknown
• Compute them using past data sets

26-11-2024 40
Bootstrap: Example

26-11-2024 Simulated α=0.6 41


Bootstrap: Example

• n=100 observations
• m= n Bootstrap samples
• Compute unknown estimates of
2 2 2
Quantities 𝜎ො𝐴 ,𝜎ො𝐵 , 𝜎ො𝐴𝐵 and
• 𝛼ො for each bootstrap sample using

26-11-2024
𝛼=0.5964
ො 42
Conclusion:
Choosing the Optimal Model?

Validation Set BIC

10-fold CV

26-11-2024 43
Conclusion:
Choosing the Optimal Model?

Validation Set BIC

One-standard-error rule
Compute standard error of the test MSE for each model
Select the smallest model for which the test error is within one
standard error of the lowest point on the curve
10-fold CV

26-11-2024 44
Classification Models
• Data: (x1,y1), (x2,y2),…,(xn,yn)
• Binary class problems
• Multi-class problems
• Underlying true distribution P(X, y)
• How well the underlying distribution learnt by a
Classifier?
• Questions
• How do we estimate the true performance of a
classifier?
• How good are the parameter estimates in the classifier?
8
Evaluation Metrics: Binary Classification
T + + + + - - - - - -
P + - + - + - - - - -
Outcome TP FN TP FN FP TN TN TN TN TN

TP: True Positive (Positive sample classified as positive class)


FN: False Negative (Positive sample classified as a negative class)
FP: False Positive (Negative sample classified as positive class)
TN: True Positive (Negative sample classified as negative class)

9
Evaluation Metrics: Binary Classification
Confusion Matrix (Contingency table)

Predicted

+ - Accuracy =
𝑇𝑃+𝑇𝑁
𝑃+𝑁
False
True Positive
+ (TP)
Negative
(FN) Misclassification rate=1- Accuracy
True

- False Positive True Negative


(FP) (TN)

10
Evaluation Metrics: Binary Classification
+ - + -
+ 15 5 + 18 2

- 30 950 - 20 960

Compute Accuracy

15+950 18+960
Accuracy = = 0.965 Accuracy = = 0.978
20+980 20+980

Accuracies for both classifier are quite identical.


For Highly imbalance data, Accuracies is not a good measure.
11
Evaluation Metrics: Binary Classification
Confusion Matrix (Contingency table)

Predicted
𝑇𝑃
Precision =
+ - 𝑇𝑃+𝐹𝑃

False
True Positive
+ (TP)
Negative
(FN) 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑎𝑟𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑
True 𝑎𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑏𝑦 𝑎 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟
Precision = 𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
- False Positive True Negative
(FP) (TN) 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑏𝑦 𝑎 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟

The fraction of true positive classes corrected predicted by a classifier in


all the positive classes predicted by the classifier

12
Evaluation Metrics: Binary Classification
Confusion Matrix (Contingency table)

Predicted 𝑇𝑃
Recall =
𝑇𝑃+𝐹𝑁
+ -
False The fraction of true positives predicted by a
True Positive
+ (TP)
Negative
(FN)
Classifier wrt to the true positive
True

- False Positive True Negative


(FP) (TN)

13
Evaluation Metrics: Binary Classification
+ - + -
+ 15 5 + 18 2

- 30 950 - 10 970

15 18
Precision = 0.67 = 0.65
15 + 30 18 + 10

18
Recall 15
=0.75 =0.9
18+2
15+5

14
Evaluation Matrix: Binary Classification
Confusion Matrix (Contingency table)

Predicted
𝑇𝑁
+ - Specificity =
𝐹𝑃+𝑇𝑁

Recall, sensitivity =
𝑇𝑃
False 𝑇𝑃+𝐹𝑁
True Positive
+ (TP)
Negative
(FN)
True

- False Positive True Negative


(FP) (TN)

15
Evaluation Metrics: Binary Classification
𝑇𝑁 𝑇𝑃
Specificity = Recall, sensitivity =
𝐹𝑃+𝑇𝑁 𝑇𝑃+𝐹𝑁

RT-PCR vs Cancer (Mammogram) Test

16
Evaluation Metrics: Binary Classification
𝑇𝑁 𝑇𝑃
Specificity = Recall, sensitivity =
𝐹𝑃+𝑇𝑁 𝑇𝑃+𝐹𝑁

RT-PCR vs Cancer (Mammogram) Test

17
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve

𝑇𝑃
True Positive Rate =
𝑇𝑃+𝐹𝑁
𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁

(ROC) Curve:
A graph between FPR vs TPR

19
Evaluation Matrix: Binary Classification
𝑇𝑃
True Positive Rate =
Receiver Operating Characteristic (ROC) Curve 𝑇𝑃+𝐹𝑁
𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁

20
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve

𝑇𝑃
True Positive Rate =
𝑇𝑃+𝐹𝑁

𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁
TPR

FPR
21
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve
𝑇𝑃 𝐹𝑃
True Positive Rate = False Positive Rate =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁

Points Probability Threshold TPR FPR


+ 0.9
+ 0.85
+ 0.6
- 0.49
- 0.4
- 0.35
- 0.34
+ 0.32
- 0.2
- 0.1
22
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve
𝑇𝑃 𝐹𝑃
True Positive Rate = False Positive Rate =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Points Probabil Thresh TPR FPR
ity old
+ 0.9
+ 0.85
+ 0.6
- 0.49
- 0.4
- 0.35
- 0.34
+ 0.32
- 0.2
- 0.1
23
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve

24
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curve

25
Evaluation Matrix: Binary Classification
Receiver Operating Characteristic (ROC) Curve

26
Evaluation Metrics: Binary Classification
Comparing Receiver Operating Characteristic (ROC) Curves

𝑇𝑃
True Positive Rate =
𝑇𝑃+𝐹𝑁

𝐹𝑃
False Positive Rate =
𝐹𝑃+𝑇𝑁
TPR

FPR
27
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves

TPR

FPR
28
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves

TPR

FPR
29
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves

TPR

FPR
30
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves

TPR

FPR
31
Evaluation Metrics: Binary Classification

32
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves

TPR

FPR
35
Evaluation Metrics: Binary Classification
Receiver Operating Characteristic (ROC) Curves

TPR

FPR
36
Evaluation Metrics: Binary Classification
Precision Recall Curve

37
Evaluation Metrics: Multi-Class Classification
Metrics

38
Evaluation Metrics: Multi-Class Classification
Metrics

39
Evaluation Metrics: Multi-Class Classification
Metrics

40
Evaluation Metrics: Multi-Class Classification
Metrics

41
Evaluation Metrics: Multi-Class Classification
Metrics

42
Evaluation Metrics: Multi-Class Classification
Metrics

43
Evaluation Metrics: Multi-Class Classification
Metrics

44
Evaluation Metrics: Multi-Class Classification
Metrics

45
Evaluation Metrics: Multi-Class Classification
Metrics

46
Evaluation Metrics: Multi-Class Classification
Metrics

47
References:
1. Tom Fawcett, An Introduction to ROC Analysis, Pattern Recognition Letters, 2006, 861-874
2. Alaa Tharwat, Classification Assessment Methods, Applied Computing and Informatics Vol. 17 No. 1, 2021
pp. 168-192

48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy