Classification Error: Training Errors Generalization Errors
Classification Error: Training Errors Generalization Errors
Classification Error: Training Errors Generalization Errors
1
Classification error
A good classification model should
Fit the training data well. (low training error)
Accurately classify records it has never seen
before. (low generalization error)
A model that fits the training data too well can
have a poor generalization error.
This is known as model overfitting.
2
Classification error
We consider the 2-D data set in the following
figure.
The data set contains data points that belong
to two different classes.
30% of the points are chosen for training,
while the remaining 70% are used for testing.
A decision tree classifier is built using the
training set.
Different levels of pruning are applied to the
tree to investigate the effect of overfitting.
3
Classification error
4
Classification error
The following figure shows the training and
test error rates of the decision tree.
Both error rates are large when the size of
the tree is very small.
This situation is known as model underfitting.
Underfitting occurs because the model
cannot learn the true structure of the data.
It performs poorly on both the training and
test sets.
5
Classification error
6
Classification error
When the tree becomes too large
The training error rate continues to decrease.
However, the test error rate begins to increase.
This phenomenon is known as model
overfitting.
7
Overfitting
The training error can be reduced by
increasing the model complexity.
However, the test error can be large because
the model may accidentally fit some of the
noise points in the training data.
In other words, the performance of the model
on the training set does not generalize well to
the test examples.
8
Overfitting
We consider a training and test set for a
mammal classification problem.
Two of the ten training records are mislabeled.
Bats and whales are labeled as non-
mammals instead of mammals.
9
Training set
Name Body Gives Four- Hibernates Class
Temperature Birth Legged Label
porcupine warm-blooded yes yes yes yes
cat warm-blooded yes yes no yes
bat warm-blooded yes no yes no*
whale warm-blooded yes no no no*
salamander cold-blooded no yes yes no
komodo dragon cold-blooded no yes no no
python cold-blooded no no yes no
salmon cold-blooded no no no no
eagle warm-blooded no no no no
guppy cold-blooded yes no no no
10
Test set
Name Body Gives Four- Hibernates Class
Temperature Birth Legged Label
human warm-blooded yes no no yes
pigeon warm-blooded no no no no
elephant warm-blooded yes yes no yes
leopard shark cold-blooded yes no no no
turtle cold-blooded no yes no no
penguin warm-blooded no no no no
eel cold-blooded no no no no
dolphin warm-blooded yes no no yes
spiny anteater warm-blooded no yes yes yes
gila monster cold-blooded no yes yes no
11
Overfitting
A decision tree that perfectly fits the training
data is shown in the following figure.
The training error rate for the tree is 0%.
However, its error rate on the test set is 30%.
12
Overfitting
13
Overfitting
Both humans and dolphins are misclassified
as non-mammals.
Their attribute values for Body Temperature,
Gives Birth and Four-legged are identical to
the mislabeled records in the training set.
On the other hand, spiny anteater represents
an exceptional case.
The class label of the test record contradicts
the class labels of other similar records in the
training set.
14
Overfitting
In contrast, the simpler decision tree in the
following figure has
A somewhat higher training error rate (20%)
but
A lower test error rate (10%).
It can be seen that the Four-legged attribute
test condition in the first model is spurious.
It fits the mislabeled training records, which
leads to the misclassification of records in the
test set.
15
Overfitting
16
Generalization error estimation
The ideal classification model is the one that
produces the lowest generalization error.
The problem is that the model has no
knowledge of the test set.
It has access only to the training set.
We consider the following approaches to
estimate the generalization error
Resubstitution estimate
Estimates incorporating model complexity
Using a validation set
17
Resubstitution estimate
The resubstitution estimate approach
assumes that the training set is a good
representation of the overall data.
However, the training error is usually an
optimistic estimate of the generalization error.
18
Resubstitution estimate
We consider the two decision trees shown in
the following figure.
The left tree TL is more complex than the right
tree TR.
The training error rate for TL is
e(TL)=4/24=0.167.
The training error rate for TR is
e(TR)=6/24=0.25.
Based on the resubstitution estimate, TL is
considered better than TR.
19
Resubstitution estimate
20
Estimates incorporating model
complexity
The chance for model overfitting increases as
the model becomes more complex.
As a result, we should prefer simpler models.
Based on this principle, we can estimate the
generalization error as the sum of
Training error and
A penalty term for model complexity.
21
Estimates incorporating model
complexity
In the case of a decision tree, let
L be the number of leaf nodes.
nl be the l-th leaf node.
m(nl) be the number of training records classified by node nl.
e(nl) be the number of misclassified records by node nl.
ζ(nl) be a penalty term associated with the node nl.
The resulting error rate ec of the decision tree can be estimated
as follows:
L
∑ [e(n ) + ζ (n )]
l l
ec = l =1
L
∑ m( n )
l =1
l
22
Estimates incorporating model
complexity
We consider the previous two decision trees TL and TR.
We assume that the penalty term is equal to 0.5 for
each leaf node.
The error rate estimate for TL is
4 + 7 × 0.5 7.5
ec (TL ) = = = 0.3125
24 24
The error rate estimate for TR is
6 + 4 × 0.5 8
ec (TR ) = = = 0.3333
24 24
23
Estimates incorporating model
complexity
Based on this penalty term, TL is better than
TR.
For a binary tree, a penalty term of 0.5 means
that a node should always be expanded into
its two child nodes if it improves the
classification of at least one training record.
This is because expanding a node, which is
the same as adding 0.5 to the overall error, is
less costly than committing one training error.
24
Estimates incorporating model
complexity
Suppose the penalty term is equal to 1 for all
the leaf nodes.
The error rate estimate for TL becomes 0.458.
The error rate estimate for TR becomes 0.417.
Based on this penalty term, TR is better than
TL.
A penalty term of 1 means that, for a binary
tree, a node should not be expanded unless it
reduces the classification error by more than
one training record.
25
Using a validation set
In this approach, the original training data is
divided into two smaller subsets.
One of the subsets is used for training.
The other, known as the validation set, is
used for estimating the generalization error.
26
Using a validation set
This approach can be used in the case where
the complexity of the model is determined by
a parameter.
We can adjust the parameter until the
resulting model attains the lowest error on the
validation set.
This approach provides a better way for
estimating how well the model performs on
previously unseen records.
However, less data are available for training.
27
Handling overfitting in decision tree
There are two approaches for avoiding model
overfitting in decision tree
Pre-pruning
Post-pruning
28
Pre-pruning
In this approach, the tree growing algorithm is
halted before generating a fully grown tree
that perfectly fits the training data.
To do this, an alternative stopping condition
could be used.
For example, we can stop expanding a node
when the observed change in impurity
measure falls below a certain threshold.
29
Pre-pruning
The advantage of this approach is that it
avoids generating overly complex sub-trees
that overfit the training data.
However, it is difficult to choose the right
threshold for the change in impurity measure.
A threshold which is too high will result in
underfitted models.
A threshold which is too low may not be
sufficient to overcome the model overfitting
problem.
30
Post-pruning
In this approach, the decision tree is initially
grown to its maximum size.
This is followed by a tree pruning step, which
trims the fully grown tree.
31
Post-pruning
Trimming can be done by replacing a sub-
tree with a new leaf node whose class label is
determined from the majority class of records
associated with the node.
The tree pruning step terminates when no
further improvement is observed.
32
Post-pruning
Post-pruning tends to give better results than
pre-pruning because it makes pruning
decisions based on a fully grown tree.
On the other hand, pre-pruning can suffer
from premature termination of the tree
growing process.
However, for post-pruning, the additional
computations for growing the full tree may be
wasted when some of the sub-trees are
pruned.
33
Classifier evaluation
There are a number of methods to evaluate
the performance of a classifier
Hold-out method
Cross validation
34
Hold-out method
In this method, the original data set is
partitioned into two disjoint sets.
These are called the training set and test set
respectively.
The classification model is constructed from
the training set.
The performance of the model is evaluated
using the test set.
35
Hold-out method
The hold-out method has a number of well
known limitations.
First, fewer examples are available for
training.
Second, the model may be highly dependent
on the composition of the training and test
sets.
36
Cross validation
In this approach, each record is used the same
number of times for training, and exactly once for
testing.
To illustrate this method, suppose we partition the
data into two equal-sized subsets.
First, we choose one of the subsets for training and
the other for testing.
We then swap the roles of the subsets so that the
previous training set becomes the test set, and vice
versa.
37
Cross validation
The estimated error is obtained by averaging
the errors on the test sets for both runs.
In this example, each record is used exactly
once for training and once for testing.
This approach is called two-fold cross-
validation.
38
Cross validation
The k-fold cross validation method
generalizes this approach by segmenting the
data into k equal-sized partitions.
During each run
One of the partitions is chosen for testing.
The rest of them are used for training.
This procedure is repeated k times so that
each partition is used for testing exactly once.
The estimated error is obtained by averaging
the errors on the test sets for all k runs.
39