Classification Error: Training Errors Generalization Errors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Classification error

 The errors committed by a classification model are


generally divided into two types
 Training errors
 Generalization errors
 Training error is the number of misclassification
errors committed on training records.
 Training error is also known as resubstitution error or
apparent error.
 Generalization error is the expected error of the
model on previously unseen records.

1
Classification error
 A good classification model should
 Fit the training data well. (low training error)
 Accurately classify records it has never seen
before. (low generalization error)
 A model that fits the training data too well can
have a poor generalization error.
 This is known as model overfitting.

2
Classification error
 We consider the 2-D data set in the following
figure.
 The data set contains data points that belong
to two different classes.
 30% of the points are chosen for training,
while the remaining 70% are used for testing.
 A decision tree classifier is built using the
training set.
 Different levels of pruning are applied to the
tree to investigate the effect of overfitting.
3
Classification error

4
Classification error
 The following figure shows the training and
test error rates of the decision tree.
 Both error rates are large when the size of
the tree is very small.
 This situation is known as model underfitting.
 Underfitting occurs because the model
cannot learn the true structure of the data.
 It performs poorly on both the training and
test sets.
5
Classification error

6
Classification error
 When the tree becomes too large
 The training error rate continues to decrease.
 However, the test error rate begins to increase.
 This phenomenon is known as model
overfitting.

7
Overfitting
 The training error can be reduced by
increasing the model complexity.
 However, the test error can be large because
the model may accidentally fit some of the
noise points in the training data.
 In other words, the performance of the model
on the training set does not generalize well to
the test examples.

8
Overfitting
 We consider a training and test set for a
mammal classification problem.
 Two of the ten training records are mislabeled.
 Bats and whales are labeled as non-
mammals instead of mammals.

9
Training set
Name Body Gives Four- Hibernates Class
Temperature Birth Legged Label
porcupine warm-blooded yes yes yes yes
cat warm-blooded yes yes no yes
bat warm-blooded yes no yes no*
whale warm-blooded yes no no no*
salamander cold-blooded no yes yes no
komodo dragon cold-blooded no yes no no
python cold-blooded no no yes no
salmon cold-blooded no no no no
eagle warm-blooded no no no no
guppy cold-blooded yes no no no

10
Test set
Name Body Gives Four- Hibernates Class
Temperature Birth Legged Label
human warm-blooded yes no no yes
pigeon warm-blooded no no no no
elephant warm-blooded yes yes no yes
leopard shark cold-blooded yes no no no
turtle cold-blooded no yes no no
penguin warm-blooded no no no no
eel cold-blooded no no no no
dolphin warm-blooded yes no no yes
spiny anteater warm-blooded no yes yes yes
gila monster cold-blooded no yes yes no

11
Overfitting
 A decision tree that perfectly fits the training
data is shown in the following figure.
 The training error rate for the tree is 0%.
 However, its error rate on the test set is 30%.

12
Overfitting

13
Overfitting
 Both humans and dolphins are misclassified
as non-mammals.
 Their attribute values for Body Temperature,
Gives Birth and Four-legged are identical to
the mislabeled records in the training set.
 On the other hand, spiny anteater represents
an exceptional case.
 The class label of the test record contradicts
the class labels of other similar records in the
training set.

14
Overfitting
 In contrast, the simpler decision tree in the
following figure has
 A somewhat higher training error rate (20%)
but
 A lower test error rate (10%).
 It can be seen that the Four-legged attribute
test condition in the first model is spurious.
 It fits the mislabeled training records, which
leads to the misclassification of records in the
test set.

15
Overfitting

16
Generalization error estimation
 The ideal classification model is the one that
produces the lowest generalization error.
 The problem is that the model has no
knowledge of the test set.
 It has access only to the training set.
 We consider the following approaches to
estimate the generalization error
 Resubstitution estimate
 Estimates incorporating model complexity
 Using a validation set
17
Resubstitution estimate
 The resubstitution estimate approach
assumes that the training set is a good
representation of the overall data.
 However, the training error is usually an
optimistic estimate of the generalization error.

18
Resubstitution estimate
 We consider the two decision trees shown in
the following figure.
 The left tree TL is more complex than the right
tree TR.
 The training error rate for TL is
e(TL)=4/24=0.167.
 The training error rate for TR is
e(TR)=6/24=0.25.
 Based on the resubstitution estimate, TL is
considered better than TR.
19
Resubstitution estimate

20
Estimates incorporating model
complexity
 The chance for model overfitting increases as
the model becomes more complex.
 As a result, we should prefer simpler models.
 Based on this principle, we can estimate the
generalization error as the sum of
 Training error and
 A penalty term for model complexity.

21
Estimates incorporating model
complexity
 In the case of a decision tree, let
 L be the number of leaf nodes.
 nl be the l-th leaf node.
 m(nl) be the number of training records classified by node nl.
 e(nl) be the number of misclassified records by node nl.
 ζ(nl) be a penalty term associated with the node nl.
 The resulting error rate ec of the decision tree can be estimated
as follows:
L

∑ [e(n ) + ζ (n )]
l l
ec = l =1
L

∑ m( n )
l =1
l

22
Estimates incorporating model
complexity
 We consider the previous two decision trees TL and TR.
 We assume that the penalty term is equal to 0.5 for
each leaf node.
 The error rate estimate for TL is
4 + 7 × 0.5 7.5
ec (TL ) = = = 0.3125
24 24
 The error rate estimate for TR is
6 + 4 × 0.5 8
ec (TR ) = = = 0.3333
24 24

23
Estimates incorporating model
complexity
 Based on this penalty term, TL is better than
TR.
 For a binary tree, a penalty term of 0.5 means
that a node should always be expanded into
its two child nodes if it improves the
classification of at least one training record.
 This is because expanding a node, which is
the same as adding 0.5 to the overall error, is
less costly than committing one training error.

24
Estimates incorporating model
complexity
 Suppose the penalty term is equal to 1 for all
the leaf nodes.
 The error rate estimate for TL becomes 0.458.
 The error rate estimate for TR becomes 0.417.
 Based on this penalty term, TR is better than
TL.
 A penalty term of 1 means that, for a binary
tree, a node should not be expanded unless it
reduces the classification error by more than
one training record.
25
Using a validation set
 In this approach, the original training data is
divided into two smaller subsets.
 One of the subsets is used for training.
 The other, known as the validation set, is
used for estimating the generalization error.

26
Using a validation set
 This approach can be used in the case where
the complexity of the model is determined by
a parameter.
 We can adjust the parameter until the
resulting model attains the lowest error on the
validation set.
 This approach provides a better way for
estimating how well the model performs on
previously unseen records.
 However, less data are available for training.

27
Handling overfitting in decision tree
 There are two approaches for avoiding model
overfitting in decision tree
 Pre-pruning
 Post-pruning

28
Pre-pruning
 In this approach, the tree growing algorithm is
halted before generating a fully grown tree
that perfectly fits the training data.
 To do this, an alternative stopping condition
could be used.
 For example, we can stop expanding a node
when the observed change in impurity
measure falls below a certain threshold.

29
Pre-pruning
 The advantage of this approach is that it
avoids generating overly complex sub-trees
that overfit the training data.
 However, it is difficult to choose the right
threshold for the change in impurity measure.
 A threshold which is too high will result in
underfitted models.
 A threshold which is too low may not be
sufficient to overcome the model overfitting
problem.

30
Post-pruning
 In this approach, the decision tree is initially
grown to its maximum size.
 This is followed by a tree pruning step, which
trims the fully grown tree.

31
Post-pruning
 Trimming can be done by replacing a sub-
tree with a new leaf node whose class label is
determined from the majority class of records
associated with the node.
 The tree pruning step terminates when no
further improvement is observed.

32
Post-pruning
 Post-pruning tends to give better results than
pre-pruning because it makes pruning
decisions based on a fully grown tree.
 On the other hand, pre-pruning can suffer
from premature termination of the tree
growing process.
 However, for post-pruning, the additional
computations for growing the full tree may be
wasted when some of the sub-trees are
pruned.

33
Classifier evaluation
 There are a number of methods to evaluate
the performance of a classifier
 Hold-out method
 Cross validation

34
Hold-out method
 In this method, the original data set is
partitioned into two disjoint sets.
 These are called the training set and test set
respectively.
 The classification model is constructed from
the training set.
 The performance of the model is evaluated
using the test set.

35
Hold-out method
 The hold-out method has a number of well
known limitations.
 First, fewer examples are available for
training.
 Second, the model may be highly dependent
on the composition of the training and test
sets.

36
Cross validation
 In this approach, each record is used the same
number of times for training, and exactly once for
testing.
 To illustrate this method, suppose we partition the
data into two equal-sized subsets.
 First, we choose one of the subsets for training and
the other for testing.
 We then swap the roles of the subsets so that the
previous training set becomes the test set, and vice
versa.

37
Cross validation
 The estimated error is obtained by averaging
the errors on the test sets for both runs.
 In this example, each record is used exactly
once for training and once for testing.
 This approach is called two-fold cross-
validation.

38
Cross validation
 The k-fold cross validation method
generalizes this approach by segmenting the
data into k equal-sized partitions.
 During each run
 One of the partitions is chosen for testing.
 The rest of them are used for training.
 This procedure is repeated k times so that
each partition is used for testing exactly once.
 The estimated error is obtained by averaging
the errors on the test sets for all k runs.
39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy