Bootstrap - Shalizi PDF
Bootstrap - Shalizi PDF
Bootstrap - Shalizi PDF
American Scientist
the magazine of Sigma Xi, The Scientific Research Society
This reprint is provided for personal and noncommercial use. For any other use, please send a request to Permissions,
American Scientist, P.O. Box 13975, Research Triangle Park, NC, 27709, U.S.A., or by electronic mail to perms@amsci.org.
Sigma Xi, The Scientific Research Society and other rightsholders
Computing Science
The Bootstrap
Cosma Shalizi
EFOTJUZPGTBNQMJOHEJTUSJCVUJPO
QSPCBCJMJUZEFOTJUZ
MPHSFUVSOT
m
m
m m m m m m m m
USBEJOHEBZ S R
Figure 1. A series of log returns from the Standard and Poors 500 stock index from October 1, 1999, to October 20, 2009 (left), can be used to
illustrate a classical approach to probability. A financial model that assumes the series are sequences of independent, identically distributed
Gaussian random variables yields the distribution function shown at center. A theoretical sampling distribution that models the smallest 1
percent of daily returns (denoted as q0.01) shows a value of 0.0326 0.00104 (right), but we need a way to determine the uncertainty of this estimate.
sults of probability theorythe laws of tributions of their estimates. We have the log returns, the log of the price today
large numbers, the ergodic theorem, the been especially devoted to rewriting our divided by the price yesterday. For this
central limit theorem and so onde- estimates as averages of independent time period of 2,529 trading days, there
scribe limits in which all stochastic pro- quantities, so that we can use the CLT to are 2,528 such values (see Figure 1). The
cesses in broad classes of models display get Gaussian asymptotics. Refinements efficient market hypothesis from fi-
the same asymptotic behavior. The cen- to such results would consider, say, the nancial theory says the returns cant be
tral limit theorem (CLT), for instance, rate at which the error of the asymptotic predicted from any public information,
says that if we average more and more Gaussian approximation shrinks as the including their own past values. In fact,
independent random quantities with a sample sizes grow. many financial models assume such
common distribution, and if that com- To illustrate the classical approach series are sequences of independent,
mon distribution is not too pathological, and the modern alternatives, Ill intro- identically distributed (IID) Gaussian
then the distribution of their means ap- duce some data: The daily closing prices random variables. Fitting such a model
proaches a Gaussian. (The non-Gauss- of the Standard and Poors 500 stock yields the distribution function in the
ian parts of the distribution wash away index from October 1, 1999, to October center graph of Figure 1.
under averaging, but the average of two 20, 2009. (I use these data because they An investor might want to know,
Gaussians is another Gaussian.) Typi- happen to be publicly available and fa- for instance, how bad the returns
cally, as in the CLT, the limits involve miliar to many readers, not to impart could be. The lowest conceivable log
taking more and more data from the any kind of financial advice.) Profes- return is negative infinity (with all the
source, so statisticians use the theorems sional investors care more about chang- stocks in the index losing all value),
to find the asymptotic, large-sample dis- es in prices than their level, specifically but most investors worry less about an
FTUJNBUPS
FTUJNBUPS
FTUJNBUPS
O
MBUJP
TJNV
FNQJSJDBM
GJUUFENPEFM
EJTUSJCVUJPO
m m m m m m m m
Figure 2. A schematic for model-based bootstrapping (left) shows that simulated values are generated from the fitted model, and then they are treated
like the original data, yielding a new parameter estimate. Alternately, in nonparametric bootstrapping, a schematic (right) shows that new data are
simulated by resampling from the original data (allowing repeated values), then parameters are calculated directly from the empirical distribution.
%FOTJUZ
%FOTJUZ
m m m m m m m m
S R
Figure 3. An empirical distribution (left, in red, smoothed for visual clarity) of the log returns from a stock-market index is more peaked and has sub-
stantially more large-magnitude returns than a Gaussian fit (blue). The black marks on the horizontal axis show all the observed values. The distribu-
tion of q0.01 based on 100,000 nonparametric replications is very non-Gaussian (right, in red). The empirical estimate is marked by the blue dashed line.
apocalyptic end of American capital- year.) From the fitted distribution, we real q0.01 is in that range, or our data set
ism than about large-but-still-typical can calculate that q0.01 = 0.0326, or, is one big fluke (at 1-in-20 odds), or the
lossessay, how bad are the smallest undoing the logarithm, a 3.21 percent IID-Gaussian model is wrong.
1 percent of daily returns? Call this loss. How uncertain is this point esti-
number q0.01; if we know it, we know mate? The Gaussian assumption lets Fitting Models
that we will do better about 99 percent us calculate the asymptotic sampling From its origins in the 19th century
of the time, and we can see whether distribution of q0.01, which turns out to through about the 1960s, statistics was
we can handle occasional losses of that be another Gaussian (see the right graph split between developing general ideas
magnitude. (There are about 250 trad- in Figure 1), implying a standard error about how to draw and evaluate sta-
ing days in a year, so we should expect of 0.00104. The 95 percent confidence tistical inferences, and working out the
two or three days at least that bad in a interval is (0.0347, 0.0306): Either the properties of inferential procedures in
tractable special cases (like the one we
just went through) or under asymptot-
ic approximations. This yoked a very
broad and abstract theory of inference
to very narrow and concrete practical
formulas, an uneasy combination often
preserved in basic statistics classes.
The arrival of (comparatively) cheap
and fast computers made it feasible for
scientists and statisticians to record lots
of data and to fit models to them. Some-
UPNPSSPXhTSFUVSO
UPNPSSPXhTSFUVSO
ferences without using either implausi-
bly helpful assumptions or asymptotics;
all of the solutions turned out to demand
even more computation. Perhaps the most
successful was a proposal by Stanford
University statistician Bradley Efron, in
a now-famous 1977 paper, to combine
estimation with simulation. Over the last
m
three decades, Efrons bootstrap has
spread into all areas of statistics, sprout-
ing endless elaborations; here Ill stick to
its most basic forms.
Remember that the key to dealing with
m
uncertainty in parameters is the sampling
distribution of estimators. Knowing what m m
distribution wed get for our estimates UPEBZhTSFUVSO
on repeating the experiment would give
us quantities, such as standard errors. Figure 5. The same spline fit from the previous figure (black line) is combined with 800 splines
Efrons insight was that we can simulate fit to bootstrapped resamples of the data (blue curves) and the resulting 95 percent confidence
replication. After all, we have already fit- limits for the true regression curve (red lines).
ted a model to the data, which is a guess
at the mechanism that generated the Bootstrapping model with resampling from the data.
data. Running that mechanism generates The bootstrap approximates the sam- After all, our initial collection of data
simulated data that, by hypothesis, have pling distribution, with three sources of gives us a lot of information about the
nearly the same distribution as the real approximation error. First theres simu- relative probabilities of different values,
data. Feeding the simulated data through lation error, using finitely many replica- and in certain senses this empirical dis-
our estimator gives us one draw from tions to stand for the full sampling dis- tribution is actually the least prejudiced
the sampling distribution; repeating this tribution. Clever simulation design can estimate possible of the underlying dis-
many times yields the sampling distri- shrink this, but brute forcejust using tributionanything else imposes biases
bution as a whole. Because the method enough replicationscan also make it or preconceptions, which are possibly
gives itself its own uncertainty, Efron arbitrarily small. Second, theres statisti- accurate but also potentially misleading.
called this bootstrapping; unlike Bar- cal error: The sampling distribution of We could estimate q0.01 directly from the
on von Mnchhausens plan for getting the bootstrap reestimates under our fit- empirical distribution, without the me-
himself out of a swamp by pulling him- ted model is not exactly the same as diation of the Gaussian model. Efrons
self out by his bootstraps, it works. the sampling distribution of estimates nonparametric bootstrap treats the
Lets see how this works with the under the true data-generating process. original data set as a complete popula-
stock-index returns. Figure 2 shows The sampling distribution changes with tion and draws a new, simulated sample
the overall process: Fit a model to data, the parameters, and our initial fit is not from it, picking each observation with
use the model to calculate the param- completely accurate. But it often turns equal probability (allowing repeated val-
eter, then get the sampling distribution out that distribution of estimates around ues) and then re-running the estimation
by generating new, synthetic data from the truth is more nearly invariant than (as shown in Figure 2).
the model and repeating the estima- the distribution of estimates themselves, This new method matters here be-
tion on the simulation output. The first so subtracting the initial estimate from cause the Gaussian model is inaccurate;
time I recalculate q0.01 from a simula- the bootstrapped values helps reduce the true distribution is more sharply
tion, I get -0.0323. Replicated 100,000 the statistical error; there are many sub- peaked around zero and has substan-
times, I get a standard error of 0.00104, tler tricks to the same end. The final tially more large-magnitude returns, in
and a 95 percent confidence interval of source of error in bootstrapping is speci- both directions, than the Gaussian (see
(0.0347, 0.0306), matching the theo- fication error: The data source doesnt the left graph in Figure 3). For the em-
retical calculations to three significant exactly follow our model at all. Simulat- pirical distribution, q0.01 = 0.0392. This
digits. This close agreement shows that ing the model then never quite matches may seem close to our previous point
I simulated properly! But the point of the actual sampling distribution. estimate of 0.0326, but its well beyond
the bootstrap is that it doesnt rely on Here Efron had a second brilliant the confidence interval, and under the
the Gaussian assumption, just on our idea, which is to address specification Gaussian model we should see values
ability to simulate. error by replacing simulation from the that negative only 0.25 percent of the