Estimating Bias and Variance From Data: Unpublished Draft
Estimating Bias and Variance From Data: Unpublished Draft
Estimating Bias and Variance From Data: Unpublished Draft
Geoffrey I. Webb
Paul Conilione
School of Computer Science and Software Engineering
Monash University
Vic. 3800, Australia
1. Introduction
The bias plus variance decomposition of error has proved a useful tool
for analyzing supervised learning algorithms. While initially developed
in the context of numeric regression (specifically, of squared error loss,
Geman, Bienenstock, & Doursat, 1992), a number of variants have been
developed for classification learning (zero-one loss) (Breiman, 1996b;
Kohavi & Wolpert, 1996; Kong & Dietterich, 1995; Friedman, 1997;
Domingos, 2000; Webb, 2000; James, 2003). This analysis decomposes
error into three terms, derived with reference to the performance of
a learner when trained with different training sets drawn from some
reference distribution of training sets:
Kohavi and Wolpert (1996) define the bias and variance decomposition
of error as follows.
1X
bias2x = [PY,X (Y =y|X=x) − PT (L(T )(x)=y)]2 (1)
2 y∈Y
1
PT (L(T )(x)=y)2
X
variancex = 1 − (2)
2 y∈Y
1
PY,X (Y =y|X=x)2
X
σ x = 1 − (3)
2 y∈Y
When estimating these terms from data Kohavi and Wolpert (1996)
recommend the use of a correction to unbias the estimates. Thus, the
estimate of bias2 is
where P̂ (·) is the estimate of P (·) derived from the observed frequency
of the argument over repeated sample training sets.
Further, as it is infeasible to estimate σ from sample data, this term
is aggregated into the bias term by assuming that PT (L(T )(x)=y) is
always either 0.0 or 1.0. Hence,
variance [2 .
\ x = P̂Y,X (L(T )(X) 6= Y |X=x) − bias (5)
x
That is, the estimate of variance equals the estimate of error minus the
estimate of bias.
We draw attention to the manner in which bias and variance are
presented as functions with a single parameter, x. From careful analysis,
however it can be seen that there are two further parameters, L and T .
That is, the terms should more correctly be written as bias2x,L,T and
variancex,L,T . In most cases there is little harm in dropping L as it will
usually be very clear from the context. One of our contentions, however,
is that the failure to recognize T as an important parameter has led
to a serious failure to understand the significance of the distribution in
determining bias-variance results.
Note that bias and variance are defined here with respect to a single
test object. In practice we evaluate these terms over all of a set of test
objects and present the mean value of each term.
Note also that Kohavi and Wolpert all their bias term bias2 , fol-
lowing the convention set in the context of numeric regression (Geman
et al., 1992). In this paper we use bias to refer to the bias term in a
classification context.
Kohavi and Wolpert (1996) present the following holdout procedure for
estimating the bias and variance of a learner L from a dataset D.
better basis for comparing error also make it a superior process for
estimating bias and variance. In particular, we hypothesis that a cross-
validation based technique for estimating bias and variance, such as
that proposed by Webb (2000), will provide more stable estimates than
will a sampling-based technique such as that proposed by Kohavi and
Wolpert (1996).
Webb’s (2000) procedure repeats k-fold cross validation l times. This
ensures that each element x of the dataset D is classified l times.
The biasx and variancex can be estimated from the resulting set of
classifications. The bias and variance with respect to the distribution
from which D is drawn can be estimated from the average of each term
over all x ∈ D.
This procedure has a number of advantages over Kohavi and
Wolpert’s. First, like Valentini and Dietterich’s (2003) bootstrap pro-
cedure, all data are used as both training and test data. This can be
expected to lead to far greater stability in the estimates of bias and
variance that are derived, as selection of different training and test sets
can be expected to substantially alter the estimates that are derived.
A second advantage, that is also an advantage over Valentini and
Dietterich’s procedure, is that it allows greater control over the training
sets sizes and inter-training-set variability. Let | · | represent the size
of a data set. In general, k-fold cross validation will result in training
sets of size (k−1)
k |D|. Hence changes to k will result in changes to the
training set size.
Bias and variance is evaluated with respect to individual objects.
Under a single run of k-fold cross-validation, each object is classified
once. Therefore, if there are l repetitions of k-fold cross-validation, each
object will be classified l times. The training sets used to classify an
object o under cross-validation cannot contain o. Hence, the l training
sets used to classify o are drawn at random from D − o. In consequence,
each object o0 6= o, o0 ∈ D has k−1k |D|/(|D| − 1) probability of being a
member of any training set used to classify o and so the inter-training-
set variability of the distribution used to generate the bias and variance
estimates is as follows1 .
k−1 |D|
δ = 1− × (6)
k |D| − 1
1
Note that we are here discussing the distribution of training sets used to classify
a single object. The training sets generated by successive folds of a single cross-
validation run will have a different distribution to the one we describe. However,
each test object will only be classified by one test set from this distribution, and
hence the distribution of training sets generated by a single cross-validation run is
quite distinct from the distribution of interest.
ssCV (D, l, m, δ)
D: The data to be used.
l: The number of times each test item is to be classified, an integer
greater than zero.
m: The size of the training sets, an integer 0 < m < |D|.
δ: The average proportion of objects to be shared in common between
any pair of training sets, m/(|D| + 1) ≤ δ < 1.
For i = 1 to q do
a) Partition Ei into k random subsets F1 . . . Fk .
b) For j = 1, k
i) Select a random sample S of size m from Ei − Fj .
ii) For each x ∈ Fj
Record L(S)(x)
iii) If i = 1 and j = 1 Then
For each x ∈ Eq+1
Record L(S)(x)
7. Calculate the estimates of bias, variance, and error from the records
of the repeated classifications for each object.
the size of the training pool and the training pool cannot be larger
than D. Hence, |D|/2 is an upper limit on the size of the training sets
under the holdout approach. In contrast, under cross validation, the
training-set size is |D| − |D|/l and the maximum possible training-set
size is |D| − 1. So, for example, with 5-fold cross-validation and m set
to the maximum value possible, each training set will contain 0.8 of the
data. We believe that it will often be desirable to have larger training
sets as this will more accurately reflect the distribution from which the
training sets to which the learner will be applied in practice.
A final respect in which we expect the cross-validation and bootstrap
procedures to be superior to the holdout procedure is that they use all
objects in D as test objects rather than a sub-sample. This means that if
objects are of varying degrees of difficulty to classify, all will be classified
each time. This can be expected to produce much more stable estimates
than the holdout method where the once-off selection of a test set E
can be expected to impact the estimates obtained. We believe that such
stability is a critical feature of a bias-variance estimation procedure. If
estimates vary greatly by chance then it follows that any one estimate
is unlikely to be accurate. Comparisons between learners on the basis
of such unreliable estimates are likely to in turn be inaccurate.
3. Evaluation
All four of the factors identified in Section 2.6 are important. The
first three do not require evaluation. From inspection of the procedures
it is possible to determine that the specified features of the respective
procedures indeed exist. However,the final of these factors does warrant
investigation. To what extent, in practice, does the use of all available
data in the role of test data lead to more stable estimates than the
use of only a single holdout set? A second, and in our assessment more
significant issue that we seek to evaluate is what effect if any is there
from varying the type of distribution from which bias and variance are
estimated.
To these ends we implemented the cross-validation technique in the
Weka machine learning environment (Witten & Frank, 2000), which
already contains the holdout approach. Unfortunately, we did not have
access to an implementation of the bootstrap procedure. However, we
assessed it less important to include this approach in our evaluation as
its addition could contribute only to further understanding of the first
and lesser issue, the relative stability of the estimates.
We applied the two approaches to the nine data sets used by Kohavi
and Wolpert (1996). These are described in Table I. We use the hold-
out approach with the training set sizes used by Kohavi and Wolpert.
To compare the relative stability of the estimates we apply the cross-
validation technique to each data set using the same training set size
and inter-training-set variability (δ = 0.50) as Kohavi and Wolpert.
We used l = 50, the number of repetitions used by Kohavi and
Wolpert. To explore the consequences of reducing l we also applied our
procedure with l = 10.
Each of the estimation processes was applied to each data set using
each of two different learning algorithms, J48 and NB. J48 is a Weka
reimplementation of C4.5 (Quinlan, 1993). NB is the Weka implemen-
tation of naive Bayes. In order to assess the stability of each estimation
process, we repeated each ten times, using a different random seed on
each trial. Thus we obtained ten values for each measure (error, bias
and variance) for each combination of an evaluation process and a data
set. Tables II to VII present the mean and standard deviation of these
ten values for every such combination of process and data set.
3.1. Stability
combination there are nine outcomes to compare: the outcomes for each
data set. The probability of the cross-validation method having a lower
value than the holdout method for every one of those comparisons if
each method has equal chance of obtaining the lower value is 0.0020
(one-tailed binomial sign test). This result remains significant at the
0.05 level even if a Bonferroni adjustment for multiple comparisons is
applied by dividing the alpha by 18. We therefore conclude that there is
very strong support for our hypothesis that the cross-validation method
that we have proposed is more stable in its estimates than the holdout
method. This means that the experimenter can have greater confidence
in the values that it produces.
The greater stability is produced at greater computational cost,
however, as the cross-validation method learns l × k × q models in
comparison to the l models learned by the holdout method. To assess
whether so much computation is required from the ssCV, we compare
it with l = 10 against holdout with l = 50. The standard deviation
for ssCV is still lower than that of the holdout method on all compar-
isons, demonstrating that more stable estimates can be derived at more
modest computational cost.
2
We consider only the bias and variance means as error is derived therefrom.
Table VIII. Summary for error with J48 over differing training set
distributions
Table IX. Summary for bias with J48 over differing training set
distributions
Table XI. Summary for error with NB over differing training set
distributions
Table XII. Summary for bias with NB over differing training set
distributions
0.2
• Error
Error
+
0.1 + + Bias
+ 4 Variance
0
0 0.25 0.50 0.75
δ
Figure 1. Comparison of error, bias and variance as δ is changed for J48.
0.2
+
+
+ • Error
Error
0.1 + Bias
4 Variance
0
0 0.25 0.50 0.75
δ
Figure 2. Comparison of error, bias and variance as δ is changed for Naive Bayes.
increases, the average error remains constant, the average bias decreases
and the average variance increases.
4. Discussion
5. Conclusions
Acknowledgements
We are very grateful to Kai Ming Ting, Alexander Nedoboi and Shane
Butler for valuable comments on drafts of this paper.
References