Boos and Stefanski 2010 Efron's Bootstrap
Boos and Stefanski 2010 Efron's Bootstrap
Boos and Stefanski 2010 Efron's Bootstrap
The bootstrap was introduced by Brad Efron in the late 1970s. It is a computer-intensive method for approximating
the sampling distribution of any statistic derived from a random sample. Here Dennis Boos and Leonard Stefanski
give simple examples to show how the bootstrap is used and help to explain its enormous success as a tool of
statistical inference.
say {64, 59, 65, 70, 52, …}. This hypotheti- of random sampling from a set of numbers and
cal set of possible study results represents the thereby estimates the sampling distribution of
sampling distribution of the sample proportion virtually any statistic computed from the sample.
statistic. With it one can assess the variabil- The only way it differs from the hypothetical re-
ity in the real-sample estimate (e.g., attach sampling described above is that the repeated
a margin of error to it, say 64% ± 9%), and samples are not drawn from the population, but
rigorously address questions such as whether rather from the sample itself because the popu-
more than half the readers prefer stories to lation is not accessible.
continue on the back page.
The catch is, of course, that it is impractical
to repeat studies, and thus the set of possible Examples
percentages described above is never more than
hypothetical. The solution to this dilemma, To illustrate these ideas we use two simple ex-
before the widespread availability of fast com- amples where the statistics are the sample mean
puting, was to derive the sampling distribution and median. Consider the data set in Table 1 of
mathematically. This is easy to do for simple es- n = 25 adult male yearly incomes (in thousands
timates such as the sample proportion, but not of dollars) collected from a fictitious county in
Bootstrap basics so easy for more complicated statistics. North Carolina.
Fast computing opened a new door to the
A fundamental problem in statistics is assess- problem of determining the sampling distribution
ing the variability of an estimate derived from of a statistic. On the other side of that door was The sample mean
sample data. Consider, for example, a simple Efron’s bootstrap, or what is now known simply
survey in which a newspaper with a circulation as the bootstrap. In broad strokes, the bootstrap The sample mean of the Table 1 data is Y = 47.76.
of 300 000 (the population) randomly samples substitutes computing power for mathematical Statistical theory tells us that if these values
100 of its subscribers (the sample) and asks prowess in determining the sampling distribu- were independently drawn from a population of
their preference as to whether front-page sto- tion of a statistic. incomes having mean µ and variance s2, then the
ries should continue on the second page or on In practice, the bootstrap is a computer- sampling distribution of Y has mean µ, variance
the back page of the section. Suppose that in based technique that mimics the core concept s2/n (here n = 25), and standard deviation s/√n.
the sample of 100 readers 64% favoured the
back page. If this study were repeated with a
new random sample of 100 readers, then the Table 1. Random sample of 25 yearly incomes in thousands of dollars
(ordered from lowest to highest)
results would be unlikely to be 64% again, but
would probably be something else, say 59%. 1 4 6 12 13 14 18 19 20 22 23 24 26
And if the study were repeated over and over, 31 34 37 46 47 56 61 63 65 70 97 385
the results would be a large set of percentages,
0.04
0.04
Density
Density
0.02
0.02
0.00
0.00
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Means of Lognormal Samples of Size n=25 Means of Bootstrap Samples of Size n=25
Figure 1. (Left) Histogram of 1000 sample means from repeated sampling of a theoretical lognormal population. (Right) Histogram of 1000 bootstrap sample means from
randomly sampling with replacement from Table 1 data
Figure 1: (Left) Histogram of 1000 sample means from repeated sampling of a theoretical
lognormal population. (Right) Histogram of 1000 bootstrap sample means from randomly
The sampling distribution of a statistic computed Note that repeated values of the original data second component is due to the difference in
from a random sample is the
sampling with distribution of the appear
replacement from within
Tableeach resample because the sam- the denominators between sn and sn–1. These are
1 data.
statistic in repeated sampling from that popula- pling is with replacement (as opposed to without relatively minor discrepancies, and most analysts
tion. Usually we do not know the population and replacement). The only sample of size n = 25 that are usually willing to accept a small amount of
cannot repeatedly sample, and thus we estimate could be drawn without replacement is the origi- variation in bootstrap standard errors due to
µ with Y and also estimate the sampling standard nal sample itself. The right panel in Figure 1 is a the Monte Carlo simulation, that is, using 1000
The bootstrap can be used to approximate the sampling distribution of Y when we do not
deviation of Y (often called the standard error) by histogram of the 1000 sample means computed resamples rather than say 1 million resamples.
sn–1/√n, where s n–1 = the
know
2
(n – 1) S i=1(Yi – Y)2 isfrom
–1 n
population from 1000 resamples. It is the bootstrap estimate (And of course the fact that means from even
the which the sample was obtained (always the case with real data).
unbiased version of the sample variance. Statis- of the distribution in the left panel. Remember 1000 resamples are calculated implies the boot-
tical inference
The proceeds by relying on the
nonparametric fact thatproceeds
bootstrap we have the byleft panel in the
treating this case
dataonlyin Table 1 as a population and
that Y is approximately normally distributed due because we generated the sample from a known
to the centraldrawing
limit theorem.random samples from probability
it. A distribution.
bootstrap In any real application
random sample (also called a resample) is
So that we know what the bootstrap should we cannot produce the left panel, but the boot-
be estimating, we generated
drawn from the the data
Table in Table strap can always produce
1 pseudo-population the right panel.
by randomly The
choosing Bootstrapping
25 values withneeds computing
replacement
1 as Yi = 30 exp(Zi) (×$1000), i = 1, …, 25, two panels are similar, but there are differences power. Happily it was devised just as
where Z1, …, Z25 are
from theindependently Table 1. resulting
values indistributed Table 2from the bootstrap
displays twostep such thatsamples.
uses the
computers became common
standard normal random variables. Thus our ficti- sample as if it were the population.
tious sample is known to come from a lognormal An important use of the bootstrap is calcula-
population or distribution. Since we know the tion2:of Bootstrap
Table the standard error of an estimate
resamples from(the Table 1.
population distribution, we can also generate essential component of the margin of error as-
# 1 of Y by1 creating
the true sampling distribution 4 4
sociated 6with 18 22 estimate).
a statistical 22 23 For 23 our 23 24 26need for
strap’s practical 31a computer. Happily, it
independent random samples in the same man- toy example, the bootstrap standard error of the was developed just as computing power became
ner, and then computing Y for each 37one.46 We 47 47 56
mean estimate, 47.76,56is 61 61 63 65 65 65
widely available.)
did this for 1000 random samples and plotted 1/ 2
In some situations, we might feel comfortable
a histogram of the Y values in the left panel of 1 1000
2 making a guess at the type of distribution that
#2 1 4 6 13 14 ∑ (14 = 13
Yi − Y ) 18 .8
19 22 23the data23came 23 24 is, the basic shape of
Figure 1. 1000 − 1 i =1 from, that
The bootstrap can be used to approximate the underlying population. For example, the data
26 26 37 46 46 47 47 63 63 70 70 97
the sampling distribution of Y when we do not In this case we can also use the theoretically in Table 1 is actually from a normal distribution
know the population from which the sample was derived formula to get the non-bootstrap stan- and then exponentiated to get lognormal data.
obtained (always the case with real data). The dard error estimate sn–1/√n = 14.8 for the Table Another way to do bootstrap sampling is to esti-
nonparametric bootstrap proceeds by treating 1 data. The difference between the two esti- mate the parameters of the assumed distribution
the data in Table 1 as a population and draw- mated standard errors (13.8 versus 14.8) has and then generate bootstrap samples from the
ing random samples Note fromthat repeated
it. A bootstrap randomvalues
twoof the original
components. datacomponent
The random appear iswithin each resample
due estimated population. because the
This is called parametric
sample (also called a resample) is drawn from the to the fact that the bootstrap estimate is based bootstrapping, and is best used when the distri-
sampling isby with
Table 1 pseudo-population randomly replacement
choosing on (as 1000opposed
resamples. Hadto without
we used a much replacement).
larger butionThe
type isonly sample
reasonably of size
well known.
25 values with replacement from the values in number of resamples, then the bootstrap stan-
Table 1. Table 2 displays two such samples. dard error would approximate sn/√n = 14.5. The
3 The sample median
Table 2. Bootstrap resamples from Table 1
A histogram of the Table 1 data (not displayed)
Sample 1 1 4 4 6 18 22 22 23 23 23 24 26 31 reveals that it is quite skewed to the right.
37 46 47 47 56 56 61 61 63 65 65 65 This skewness is also clear from the fact that
the sample mean 47.8 is much larger than the
Sample 2 1 4 6 13 14 14 18 19 22 23 23 23 24 sample median, 26. In situations with such
26 26 37 46 46 47 47 63 63 70 70 97 skewness it is typical to use the median to mea-
sure central tendency instead of the mean. Not
december2010 187
0.12
0.04
0.08
Density
Density
0.02
0.04
0.00
0.00
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Medians of Lognormal Samples of Size n=25 Medians of Bootstrap Samples of Size n=25
Figure 2. (Left) Histogram of 1000 sample medians from repeated sampling of a theoretical lognormal population. (Right) Histogram of 1000 bootstrap sample medians
from Table 1 data
Figure 2: (Left) Histogram of 1000 sample medians from repeated sampling of a theoretical
lognormal
only is the median population.
more representative (Right)
of typical Histogram
Using the 1000 bootstrapof 1000 bootstrap
medians depicted sample medians
act coverage from
probability Table
0.957. Efron11 pointed
data values, but the sampling distribution stan- in the right panel of Figure 2, the bootstrap stan- out the close similarity between the bootstrap
dard deviationdata.
is much smaller for the sample dard error (of the median estimate 26) is percentile interval and this nonparametric con-
median than for the sample mean (small is good 1/ 2
fidence interval.
for standard deviations of estimators!). Thus, for 1 1000
2
∑ ( M i − M ) = 7.3
example, the US Census Bureau routinely uses 1000 − 1 i =1
difficultincome
medians to summarize to estimate
data. than the sampling distribution of the mean. Conclusion
However, the bootstrap still
Unfortunately the sampling distribution of Comparing this bootstrap standard error, 7.3, to
estimates
the sample median thetosampling
is difficult analyse theo- distribution well enough,
that for the sample mean, 13.8,and in particular
empirically provides
sup- The power of the abootstrap
valid standard
lies in the fact that
retically. In fact, there is no simple expression ports the claim that the median is a less variable the method applies to (almost) any estimator, no
error
for the standard estimate
deviation for the
of the sample median,
median whereas
statistic than thethe meanbest-known
for skewed data.other computationally-based
matter how complicated. The only method
requirement is
like the expression s/√n for the sample mean. The 1000 bootstrap median values are com- a computer program to calculate the estimator
for study
We can of course estimating standard
the distribution by Monteerrors,
monlytheused jackknife,
for other purposesdoesas not.
well. The plots from a sample and a method to draw resamples.
Carlo sampling from a true population when it in Figure 2 suggest that the sampling distribution We have described only the case of simple random
is known. Using the 1000 bootstrap medians depicted in the right panel sampling. However,
of Figure 2,the
thebootstrap method applies
bootstrap
The left panel of Figure 2 gives a histogram to any type of probability-based data collection,
of sample median values error
standard from the (ofsame
the1000 median estimate 26) is provided that it can be imitated via a computer
lognormal samples as in the left panel of Fig- program to generate resamples that relate statis-
ure 1. This histogram
of medians1000 approximates 1/2The only requirement for a tically to the real sample in the same way that
the true sampling distribution1
of the sample the real sample relates to the population from
median. However, in real life we only (know Mi − theM)2 bootstrap = 7.3. is a computer program which it was selected. For example, economic
1000 −
sample, not the population. Thus1 the
i=1right panel and a method to draw resamples data is often in the form of time series where
of Figure 2 gives the histogram of 1000 sample all the sample data are correlated. A parametric
medians computed from the same resamples as
Comparing this bootstrap standard error, 7.3, to that for the sample bootstrap would assume
mean, 13.8, aempirically
specific model such as
used in the right panel of Figure 1. a normal autoregressive process. After estimating
Note thatsupports
the verticalthe scales are different
claim that the median is a less variable statistic than the the
unknown
mean parameters of the model,
for skewed data.many in-
in Figure 2. Because of the discreteness of the dependent bootstrap time series would be gener-
bootstrap pseudo-population and the nature of of the median is mildly skewed to the right. (In ated from the estimated autoregressive process.
the median, theThe 1000sampling
estimated bootstrapdistribu- median valuestheare
large samples commonly
sampling distributionused for other
approxi- Therepurposes
are literally as well. ofThe
thousands articles on
tion is very discrete, with most of the sample mates a normal distribution.) Thus, we might the bootstrap and many expository reviews. For
plots in onFigure
medians concentrated the Table 2 1suggest
central that the sampling
be interested in the bias indistribution
the sample median, of the median
starters, though,isthemildly
book by skewed
Efron and Tibshi-
values 22, 23, 24, 26, 31 and 34. For most that is, the difference between the mean of the rani2 is a good introduction, and those by Efron1
purposes this todiscreteness
the right. (Ina problem.
is not large samples
sampling the sampling
distribution and the distribution
true populationactualand Shaoapproximates
and Tu3 can beaconsulted
normalfor more
Comparing Figures 1 and 2 visually suggests median. The bootstrap estimate of that bias is technical accounts.
distribution.)
that the bootstrap distribution forThus, we might
the sample (1000)be Si =1000M̂i – 26 = 28.4
–1 1interested in –the
26 = bias in the sample median, that is, the
2.4, leading
mean is a better estimate of the true sampling to the bootstrap bias-adjusted median estimate
References
distribution difference
of the sample between
mean than the mean
it is for 26 –of2.4
the= 23.6.sampling distribution and the true population median.
1. Efron, B. (1982) The Jackknife, the
the sample median. This reflects the fact that Another important −1 1000 technique
statistical
i − 26 =Bootstrap, and Other Resampling Plans. Philadelphia:
the sampling The bootstrap
distribution of theestimate
median isof that bias
amenable is (1000)
to the i=1 Minterval
bootstrap is confidence 28.4 − 26 = 2.4, leading to
Society for Industrial and Applied Mathematics.
more difficult to estimate than the sampling construction. The simplest bootstrap approach 2. Efron, B. and Tibshirani, R. J. (1993) An
distribution of the mean. However, the boot- to confidence intervals is first to order the Introduction to the Bootstrap. New York: Chapman and
strap still estimates the sampling distribution 1000 bootstrap medians 6 displayed in the right Hall.
well enough, and in particular provides a valid panel of Figure 2, say M̂(1) ≤ M̂(2) … ≤ M̂(1000). 3. Shao, J., and Tu, D. (1996) The Jackknife
standard error estimate for the median, whereas Then (M̂(25), M̂(975)) = (19, 46) is called the 95% and Bootstrap. New York: Springer.
the best-known other computationally-based bootstrap percentile interval. In this case, an
method for estimating standard errors, the jack- exact nonparametric confidence interval for the Dennis Boos and Leonard Stefanski are at the Depart-
knife, does not. median is available, given by (19, 47) with ex- ment of Statistics, North Carolina State University.
188 december2010