Lecture 5 Evaluation_Classifer

EVALUATION OF MACHINE
LEARNING CLASSIFIERS
Machine Learning
Dr. Dinesh K Vishwakarma

Professor, Department of Information Technology
Delhi Technological University, Delhi-110042, India
Email: dinesh@dtu.ac.in
Mobile:9971339840
Webpage: http://www.dtu.ac.in/web/departments/informationtechnology/faculty/dkvishwakarma.php
1
Outline: Evaluation Parameters
 Precision
 Recall
 Accuracy
 F-Measure
 True Positive Rate
 False Positive Rate
 Sensitivity
 ROC 2
Experiment: Training and Testing
 Objective: Unbiased estimate of accuracy
Simplest way
to split the data is to
use the train-test
split method
To get an unbiased estimate of the model’s

performance, we need to evaluate it on the
data, we didn’t use for training. 3
Experiment: Training and
Testing…
 How can we get an unbiased estimate of the
accuracy of a learned model?
 when learning a model, you should pretend that you
don’t have the test data yet (it is “in the mail”)*
 if the test-set labels influence the learned model in
any way, accuracy estimates will be biased
 * In some applications it is reasonable to assume that
you have access to the feature vector (i.e. x) but not the
y part of each test instance
4
Learning Curve
 How does the accuracy of a learning method
change as a function of the training-set size?
 This can be assessed by plotting learning curves
#Given training/test set
partition
• for each sample size s on
learning curve
• (optionally) repeat n times
• randomly select s instances
from training set
• learn model
• evaluate model on test set
to determine accuracy a
• plot (s, a) or (s, avg.
accuracy and error bars)
5
Validation (Tuning) Set
 Consider we want unbiased estimates of accuracy
during the learning process (e.g. to choose the best level
of decision-tree pruning)? holding out a
portion or subset
of training data
that is held out.
This method is
called
the validation set
approach
Partition training data into separate training/validation sets 6

Limitation of Single Training/Test Partition
 We may not have enough data to make sufficiently
large
 training and test sets a larger test set gives us more
reliable estimate of accuracy (i.e. a lower variance
estimate)
 but… a larger training set will be more representative of
how much data we actually have for learning process
 A single training set doesn’t tell us how sensitive
accuracy is to a particular training sample
7
Random Sampling
 It can be addressed the second issue by repeatedly
randomly partitioning the available data into training and set
sets.
8
Random Sampling…
 When randomly selecting
training or validation sets,
we may want to ensure that
class proportions are
maintained in each selected
set.
 This can be done via
stratified sampling: first
stratify (divider) instances
by class, then randomly
select instances from each
class proportionally.
9
Cross Validation
 The train and test split has limitations such as
the dataset is small, the method is prone to
high variance.
 Due to the random partition, the results can be
entirely different for different test sets because in
some partitions, samples that are easy to
classify get into the test set, while in others, the
test set receives the ‘difficult’ ones.
 To deal with this issue, we use cross-
validation to evaluate the performance of a
machine learning model.
10
Cross Validation…
 K-Fold Cross Validation
 In k-fold CV, we first divide our dataset into k equally
sized subsets. Then, we repeat the train-test method
k times such that each time one of the k subsets is
used as a test set and the rest k-1 subsets are used
together as a training set.
 Finally, we compute the estimate of the model’s
performance estimate by averaging the scores over
the k trials.
11
K-Fold Cross Validation
 Example of 3-fold
 For example, let’s suppose that we have a dataset
𝑆𝑐𝑜𝑟𝑒1 + 𝑆𝑐𝑜𝑟𝑒2 + 𝑆𝑐𝑜𝑟𝑒3
𝑆 = {𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 },
 First we divide the samples in to 3-fold as 𝑆1 =
3
𝑥1 , 𝑥2 , 𝑆2 = 𝑥3 , 𝑥4 , 𝑆3 = 𝑥5 , 𝑥6 . Then we evaluate
the model as
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑆𝑐𝑜𝑟𝑒 =
12
5-fold Cross Validation
Partition
data
into n
subsamples
(S)
Iteratively
leave one
subsample
out for
the test set,
train on
the rest
13
5-fold Cross Validation…
 Suppose we have 100 instances, and we want to
estimate accuracy with cross validation.
10-fold cross validation is

common, but smaller values of n
are often used when learning takes
a lot of time. 14
Leave-One-Out Cross-Validation
 LOOCV
• we train our machine-learning model 𝑛 times where 𝑛
is dataset size.
• Each time, only one sample is used as a test set
while the rest are used to train our model.
• LOO on previous example 𝑆 = {𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 },
• 𝑆1 = 𝑥1
• 𝑆2 = 𝑥2
• 𝑆3 = 𝑥3
• 𝑆4 = 𝑥4
• 𝑆5 = 𝑥5
• 𝑆6 = 𝑥6
15
LOOCV…
𝑆𝑐𝑜𝑟𝑒1 + 𝑆𝑐𝑜𝑟𝑒2 + 𝑆𝑐𝑜𝑟𝑒3 + 𝑆𝑐𝑜𝑟𝑒4 + 𝑆𝑐𝑜𝑟𝑒5 + 𝑆𝑐𝑜𝑟𝑒6

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑆𝑐𝑜𝑟𝑒 =
6
16
Cross Validation… Summary
 When the size is small, LOOCV is more
appropriate since it will use more training
samples in each iteration.
 Conversely, we use k-fold cross-validation to
train a model on a large dataset, which
reduces the training time.
 Using k-Fold Cross-Validation over LOOCV is
one of the examples of Bias-Variance Trade-off.
It reduces the variance shown by LOOCV and
introduces some bias by holding out a
substantially large validation set
17
Confusion Matrix
 It is also called as prediction
table.
 It is an 𝑵 × 𝑵 matrix used for
evaluating the performance of a
classification model, where 𝑵 is
the number of target classes
 It compares the actual target
values with those predicted.
 The columns represent the actual
values of the target variable
 The rows represent the predicted
values of the target variable. 18
Confusion Matrix…
19
Type-I and Type-II Error
20
Sec. 8.3
Precision
 Precision: measures the correctness achieved in true
prediction. Also, tells us how many predictions are actually
positive out of all the total positive predicted. Precision
should be high(ideally 1)
 “Precision is a useful metric in cases where False Positive
is a higher concern than False Negatives”
𝑡𝑝
 Precision/ Positive Prediction Value 𝑃 =
𝑡𝑝 +𝑓𝑝
𝑡𝑝
 Recall R=
𝑡𝑝 +𝑓𝑛
21
Issues with “Precision & Recall”
TP FP
FN TN
 Both classifiers gives the same precision and recall

values of 66.7% and 40% (Note: the data sets are
different)
 They exhibit very different behaviours:
 Same positive recognition rate
 Extremely different negative recognition rate: strong on the left /
nil on the right
 Note: Accuracy has no problem catching this!
22
Sec. 8.3
A combined measure: F
 Combined measure that assesses
precision/recall tradeoff is F measure (weighted
harmonic mean):
(   1) PR
2
1
F 
1
  (1   )
1  PR
2
P R
• high F1 score if both Precision and Recall are high
• low F1 score if both Precision and Recall are low
• medium F1 score if one of Precision and Recall is low and the other is high
 People usually use balanced F1 measure

 i.e., with  = 1 or  = ½
 Harmonic mean is a conservative average. 23
Accuracy
 Measures the correct predictions.
 The accuracy metric is not suited for imbalanced
classes.
 Accuracy has its own disadvantages, for imbalanced
data, when the model predicts that each point belongs to
the majority class label, the accuracy will be high. But,
the model is not accurate.
 Accuracy is a valid choice of evaluation for
classification problems which are well
balanced and not skewed or there is no class
imbalance.
24
Sec. 8.3
Accuracy Measure
 The accuracy of
an engine: the
fraction of these
classifications that
are correct.
 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚(%) =
(𝒕𝒑 +𝒕𝒏 )
× 100
(𝒕𝒑 +𝒕𝒏 +𝒇𝒏 +𝒇𝒑 )
25
Accuracy Measure
𝒚 labelled Value (0- ෝ predicted
𝒚 Output at Confusion Matrix
Negative, 1-Positive) value threshold (0.5)
0 0.3 0 TP=2 FP=1
1 0.4 0 FN=1 TN=2
0 0.7 1
1 0.8 1
0 0.4 0
1 0.7 1
4 𝑇𝑃 2
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = .666 𝑅𝑒𝑐𝑎𝑙𝑙 = = = .666
6 𝑇𝑃 + 𝐹𝑁 3
2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = .666
3
26
Issues with Accuracy
 Consider a 2-class problem
 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10
 If model predicts everything to be class 0,

accuracy is 9990/10000 = 99.9 %
 Accuracy is misleading because model does not detect
any class 1 example
9/15/2022 Dinesh K. Vishwakarma, Ph.D. 27

Issues with Accuracy…
 Both classifiers gives 60% accuracy.

 They exhibit very different behaviors:
 On the left: weak positive recognition rate/strong
negative recognition rate
 On the right: strong positive recognition rate/weak
negative recognition rate
28
Is accuracy adequate measure?
 Accuracy may not be useful measure in cases
where
 there is a large class skew
 Is 98% accuracy good if 97% of the instances are negative?
 there are differential misclassification costs – say,
getting a positive wrong costs more than getting a
negative wrong.
 Consider a medical domain in which a false positive results in
an extraneous test but a false negative results in a failure to
treat a disease
 we are most interested in a subset of high-confidence
predictions
29
Miss Classification Error
 Recognition rate=accuracy=success rate
 Miss classification rate= failure rate
5+10
 𝑀𝑖𝑠𝑠 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 = = 0.09
50+10+5+100
𝐹𝑁+𝐹𝑃
 Error in percentage= ∗ 100
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 30
Sensitivity & Specificity
 Sensitivity is the metric that evaluates a
model’s ability to predict true positives of each
available category.
 Specificity is the metric that evaluates a
model’s ability to predict true negatives of each
available category.
31
Find Sensitivity and Specificity
32
Other form of Accuracy Metrics
33
ROC/AUC
 A Receiver Operating Characteristic (ROC)/Area Under
Curve plots the TP-rate vs. the FP-rate as a threshold on
the confidence of an instance being positive is varied.
Different methods can

work better in different
parts of ROC space.
This depends on cost of
false + vs. false -
expected curve for

random guessing
34
Area Under the Receiver
Operating Characteristics
 AUC-ROC curve measure the
performance at various threshold
settings.
 ROC is a probability curve and
AUC represents the degree or
measure of separability.
 AUC tells the model capability of
distinguishing between classes.
 Higher AUC, the better the model
is at predicting 0 classes as 0 and
1 classes as 1.
 The ROC curve is plotted between
TPR & FPR, where TPR is on the
y-axis and FPR is on the x-axis.
35
ROC curves & Misclassification
costs
Best operating point

when FN costs 10× FP
Best operating point when

cost of misclassifying
positives
and negatives is equal
Best operating point when

FP costs 10× FN
36
Create ROC of a model
 Consider a prediction table at different threshold setting
𝒚 ෝ predicted
𝒚 Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0
Threshold TP=3 FP=1 TN=2 FN=0 TPR=3/(3+0)=1 FPR=2/(2+1)=.66

Setting (0.5)
37
Create ROC of a model…
 Threshold setting (0.6)
𝒚 ෝ predicted
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0
Threshold TP=2 FP=1 TN=2 FN=1 TPR=2/(2+1)=.66 FPR=1/(1+2)=.66

Setting (0.6)
38
𝒚 ෝ predicted
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0

Setting (0.72)
39
𝒚 ෝ predicted
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0
Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0

Setting (0.80
40
Plot of ROC
Threshold TP=3 FP=1 TN=2 FN=0 TPR=3/(3+0)=1 FPR=2/(2+1)=.66
Setting (0.5)

Setting (0.6)

Setting (0.72)
Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0

Setting (0.80
1
0.9
0.8
0.7
0.6
TPR
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FPR 41
Step to create ROC
 Sort test-set predictions according to confidence
that each instance is positive.
 Step through sorted list from high to low
confidence
 locate a threshold between instances with opposite
classes (keeping instances with the same confidence
value on the same side of threshold)
 compute TPR, FPR for instances above threshold
 output (FPR, TPR) coordinate
42
Example of ROC Plot
43
Example of ROC Plot …
 Rearrange the samples according to class
Correct class Instance Confidence Positive
+ Ex 9 0.99
+ Ex 7 0.98 Positive
+ Ex 2 0.70 Class
+ Ex 6 0.65
+ Ex 5 0.24
- Ex 1 0.72
- Ex10 0.51
Negative
- Ex 3 0.39 Class
- Ex 4 0.11
- Ex 8 0.01
44
 For Threshold 0.72
Correct class Instance confidence positive predicted class
+ Ex 9 0.99 +
+ Ex 7 0.98 +
+ Ex 2 0.70 -
+ Ex 6 0.65 -
+ Ex 5 0.24 -
- Ex 1 0.72 +
- Ex10 0.51 -
- Ex 3 0.39 -
- Ex 4 0.11 -
- Ex 8 0.01 -
Confidence > threshold

TP=2
Positive class FP=1
Else TN=4
Negative class FN=3
TPR=TP/TP+FN=2/5
FPR=FP/FP+TN=1/5 45
 For Threshold 0.65
Correct class Instance confidence positive predicted class
+ Ex 9 0.99 +
+ Ex 7 0.98 +
+ Ex 2 0.70 +
+ Ex 6 0.65 +
+ Ex 5 0.24 -
- Ex 1 0.72 +
- Ex10 0.51 -
- Ex 3 0.39 -
- Ex 4 0.11 -
- Ex 8 0.01 -
Confidence > threshold

TP=4
Positive class FP=1
Else TN=4
Negative class FN=1
TPR=TP/TP+FN=4/5
FPR=FP/FP+TN=1/5 46
Significance of ROC
 This is an ideal situation, when two curves don’t overlap

at all, means model has an ideal measure of separability.
 It is perfectly able to distinguish between positive class
and negative class.
47
Significance of ROC…
 When two distributions overlap, then type 1 and type 2

errors are introduced.
 Depending upon the threshold, it can be minimized or
maximized. When AUC is 0.7, it means there is a 70%
chance that the model will be able to distinguish between
positive class and negative class.
48
 This is the worst situation.

 When AUC is approximately 0.5, the model has no
discrimination capacity to distinguish between positive
class and negative class.
49
 When AUC is approximately 0, the model is actually

reciprocating the classes. It means the model is
predicting a negative class as a positive class and vice
versa.
TPR⬆️, FPR⬆️ and TPR⬇️, FPR⬇️ 50

Issues with ROC/AUC
 AUC/ROC has adopted as replacement of
accuracy but it has also some criticism such as:
 The ROC curves on which the AUCs of different
classifiers are based may cross, thus not giving an
accurate picture of what is really happening.
 The misclassification cost distributions used by the
AUC are different for different classifiers.
 Therefore, we may be comparing “apples and
oranges” as the AUC may give more weight to
misclassifying a point by classifier A than it does by
classifier B. Ans: H-Measure
51
Other Accuracy Metrics
52
Precision/recall curves
 A precision/recall curve plots the precision vs.
recall (TP-rate) as a threshold on the confidence
of an instance being positive is varied.
53
Comment on ROC/PR Curve
 Both
 allow predictive performance to be assessed at various levels of
confidence
 assume binary classification tasks
 sometimes summarized by calculating area under the curve
 ROC curves
 insensitive to changes in class distribution (ROC curve does not
change if the proportion of positive and negative instances in the
test set are varied)
 can identify optimal classification thresholds for tasks with
differential misclassification costs
 Precision/Recall curves
 show the fraction of predictions that are false positives
 well suited for tasks with lots of negative instances
54
Loss Function
 Mean Square Error Loss Function
 It is used for regression problem
 Mean square error loss for m-data point is defined as
1 𝑚
𝐿𝑆𝐸 = σ𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2
𝑚
ො 2.
 For single point 𝐿𝑆𝐸 1 = (𝑦 − 𝑦)
 Binary Cross Entropy Loss Function
 It is used for classification problem
1 𝑚
 BCELF is defined 𝐿𝐶𝐸 = − σ [𝑦 ln 𝑦ො𝑖 + (1 −
𝑚 𝑖=1 𝑖
55
Example
 Consider a 2-class problem, where ground truth is
𝑦 = 0. 𝑡ℎ𝑒𝑛 𝑳𝑺𝑬 𝟏 = 𝒚 ෝ𝟐 and, 𝑦 = 1 𝐿𝑆𝐸 1 = (1 − 𝑦)
ො 2.
 Similarly 𝑳𝑪𝑬 𝟏 = 𝒍𝒏 𝟏 − 𝒚 ෝ 𝒂𝒏𝒅 𝒍𝒏(ෝ𝒚) 𝟐
(1 − 𝑦)
ො 2 ෝ
𝒚
 Consider example,
 𝑦 = 0, & 𝑦ො = 0.9, 𝐿𝑆𝐸 = 0.81
 Similarly 𝐿𝐶𝐸 = 2.3
𝜕𝐿𝑆𝐸 𝜕𝐿𝐶𝐸
 Gradient = 1.8 and = 10.0
𝜕𝑦ො 𝜕𝑦ො
Cross entropy loss 𝒍𝒏(ෝ

𝒚)
ෝ
𝒍𝒏 𝟏 − 𝒚
penalizes model
more
56
Practice question
Q1. Suppose a computer program for recognizing dogs in photographs
identifies eight dogs in a picture containing 12 dogs and some cats. Of the
eight dogs identified, five actually are dogs while the rest are cats.
Compute the Precision and recall of the computer program.
Solution:
Actual dogs Actual cats
Predicted dogs TP 5 FP 3
TN
Predicted cats FN 7
TP TP
P= R=
TP+𝐹𝑃 TP+𝐹𝑁
5 5
P= 𝑅=
5+3 5+7
5 5
P= R=
8 12
57
Practice question
Q2. A database contains 80 records on a particular topic of which 55 are
relevant to a certain investigation. A search was conducted on that topic
and 50 records were retrieved. Of the 50 records retrieved, 40 were
relevant. Construct the confusion matrix for the search and calculate the
precision and recall scores for the search. Each record may be assigned a
class label “relevant" or “not relevant”.
Solution: All the 80 records were tested for relevance. The test classified 50
records as “relevant”. But only 40 of them were actually relevant.
Actual relevant Actual not relevant
Predicted relevant 40 10
Predicted not 15 25
relevant
58
Practice question
 TP = 40
 FP = 10
 FN = 15
The precision P is P = TP/( TP + FP)

= 40/( 40 + 10) = 4/ 5
The recall R is R = TP/( TP + FN)

= 40/( 40 + 15) = 40/ 55
59
Practice question
 Using the data in the confusion matrix of a classifier of two-class
dataset, several measures of performance can be calculated as
well.
65
 Accuracy = (TP + TN)/( TP + TN + FP + FN ) =
90
65 25
 Error rate = 1− Accuracy = 1- =
90 90
40
 Sensitivity = TP/( TP + FN) =
55
25
 Specificity = TN /(TN + FP) =
35
80
 F-measure = (2 × TP)/( 2 × TP + FP + FN) =
105
60
Practice question
Q3. Let there be 10 balls (6 white and 4 red balls) in a box and let it be
required to pick up the red balls from them. Suppose we pick up 7 balls as
the red balls of which only 2 are actually red balls. What are the values of
precision and recall in picking red ball?
Solution:
 TP = 2
 FP = 7 − 2 = 5
 FN = 4 − 2 = 2
The precision P is P = TP/( TP + FP)
= 2/( 2 + 5) = 2/ 7
The recall R is R = TP/( TP + FN )

= 2/(2 + 2) = 1/2 61

Lecture 5 Evaluation_Classifer

Uploaded by

Copyright:

Available Formats

Lecture 5 Evaluation_Classifer

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 5 Evaluation_Classifer

Uploaded by

Copyright:

Available Formats

EVALUATION OF MACHINE

Dr. Dinesh K Vishwakarma

To get an unbiased estimate of the model’s

Partition training data into separate training/validation sets 6

10-fold cross validation is

𝑆𝑐𝑜𝑟𝑒1 + 𝑆𝑐𝑜𝑟𝑒2 + 𝑆𝑐𝑜𝑟𝑒3 + 𝑆𝑐𝑜𝑟𝑒4 + 𝑆𝑐𝑜𝑟𝑒5 + 𝑆𝑐𝑜𝑟𝑒6

 Both classifiers gives the same precision and recall

 People usually use balanced F1 measure

 Number of Class 1 examples = 10

 If model predicts everything to be class 0,

9/15/2022 Dinesh K. Vishwakarma, Ph.D. 27

 Both classifiers gives 60% accuracy.

Different methods can

expected curve for

Best operating point

Best operating point when

Best operating point when

Threshold TP=3 FP=1 TN=2 FN=0 TPR=3/(3+0)=1 FPR=2/(2+1)=.66

Threshold TP=2 FP=1 TN=2 FN=1 TPR=2/(2+1)=.66 FPR=1/(1+2)=.66

Threshold TP=1 FP=1 TN=2 FN=2 TPR=1/(1+2)=.33 FPR=1/(1+2)=.33

Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0

Threshold TP=2 FP=1 TN=2 FN=1 TPR=2/(2+1)=.66 FPR=1/(1+2)=.66

Threshold TP=1 FP=1 TN=2 FN=2 TPR=1/(1+2)=.33 FPR=1/(1+2)=.33

Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0

Confidence > threshold

Confidence > threshold

 This is an ideal situation, when two curves don’t overlap

 When two distributions overlap, then type 1 and type 2

 This is the worst situation.

 When AUC is approximately 0, the model is actually

TPR⬆️, FPR⬆️ and TPR⬇️, FPR⬇️ 50

Cross entropy loss 𝒍𝒏(ෝ

The precision P is P = TP/( TP + FP)

The recall R is R = TP/( TP + FN)

The recall R is R = TP/( TP + FN )

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.