Sim Slides 2 Handout
Sim Slides 2 Handout
Sim Slides 2 Handout
Summer, 2011
1/68
Resampling
Resampling methods share many similarities to Monte Carlo
simulations in fact, some refer to resampling methods as a
type of Monte Carlo simulation.
Resampling methods use a computer to generate a large
number of simulated samples.
Patterns in these samples are then summarized and analyzed.
However, in resampling methods, the simulated samples are
drawn from the existing sample of data you have in your
hands and NOT from a theoretically defined (researcher
defined) DGP.
Thus, in resampling methods, the researcher DOES NOT
know or control the DGP, but the goal of learning about the
DGP remains.
Monte Carlo Simulation and Resampling
2/68
Resampling Principles
Begin with the assumption that there is some population
DGP that remains unobserved.
That DGP produced the one sample of data you have in your
hands.
Now, draw a new sample of data that consists of a
different mix of the cases in your original sample. Repeat
that many times so you have a lot of new simulated
samples.
The fundamental assumption is that all information about
the DGP contained in the original sample of data is also
contained in the distribution of these simulated samples.
If so, then resampling from the one sample you have is
equivalent to generating completely new random samples
from the population DGP.
Monte Carlo Simulation and Resampling
3/68
4/68
Bootstrap
Jackknife
Permutation/randomization tests
Posterior sampling
Cross-validation
5/68
The Bootstrap
6/68
7/68
sample(Names,N,replace=FALSE)
[1] Sung-Geun William
Tom
Michael
Jeffrey
[6] Andrew
Rosie
Jeff
Kate
Ahmed
sample(Names,N,replace=FALSE)
[1] Jeffrey
Andrew
Tom
William
Ahmed
[6] Jeff
Sung-Geun Michael
Rosie
Kate
sample(Names,N,replace=TRUE)
[1] Tom
Tom
Kate
Jeff
Sung-Geun
[6] Sung-Geun Michael
Andrew
Sung-Geun Ahmed
sample(Names,N,replace=TRUE)
[1] Sung-Geun Kate
William
Andrew
Kate
[6] Jeff
Jeff
Sung-Geun William
Tom
Monte Carlo Simulation and Resampling
8/68
What to Resample?
In the single variable case, you must resample from the data
itself.
However, in something like OLS, you have a choice.
Remember the X s fixed in repeated samples assumption?
So, you can resample from the data, thus getting a new mix
of X s and Y each time for your simulations.
Or you can leave the X s fixed, resample from the residuals of
the model, and use those to generate simulated samples of Y
to regress on the same fixed X s every time.
9/68
Simple Example
10/68
10
5
0
Density
15
20
0.00
0.02
0.04
0.06
Estimated Mean of X
0.08
0.10
12/68
13/68
14/68
Percentile Bootstrap CI
15/68
16/68
17/68
18/68
19/68
20/68
21/68
An Example
Londregan and Snyder (1994) compare the preferences of
legislative committees with the entire legislative chamber to
test if committees are preference outliers.
Competing theories:
Committees will be preference outliers due to self-selection
and candidate-centered incentive to win re-election.
Committees will NOT be preference outliers because the floor
assigns members to develop expertise for the floor to follow.
22/68
23/68
Fig1.pdf
25/68
Benoit et al (cont.)
26/68
The Jackknife
The Jackknife emerged before the bootstrap.
Its primary use has been to compute standard error and
confidence intervals just like the bootstrap.
It is a resampling method, but it is based on drawing n
resamples each of size n-1 because each time you drop out a
different observation.
The notion is that each sub-sample provides an estimate of
the parameter of interest on a sample that can easily be
viewed as a random sample from the population (if the
original sample was) since it only drops won case at a time.
NOTE: You can leave out groups rather than individual
observations if the sampling/data structure is complex (e.g.
clustered data).
Monte Carlo Simulation and Resampling
28/68
Jackknife (2)
29/68
A Digression to Cooks D
The jackknife works very similarly to Cooks Distance (or
Cooks D), which is a measure of how influential individual
observations are on statistical estimates (in OLS).
P
j Y
ji )2
(Y
Di =
2
k
(1)
Where:
k = the number of parameters in the model
j for j th observation for full model
Y
ji for j th observation after i th observation has been dropped.
Y
30/68
Plot of Poverty and Per Capita Income Using State Postal Codes
20
MS
DC
LA
18
NM
WV
AR
KY
OK
TX
16
TN
SC
GA
NC
14
AZ
NY
MT SDMO MI
OR
OH
ME
IN
ID
12
Percent in Poverty
AL
CA
KS
FL
PA
IL
WA
ND
IA
CO
NE
WI
RI DE
AK
UT
10
VT
NV
MN
VA
HI
MA
WY
NJ
CT
NH
30000
35000
40000
MD
45000
50000
55000
2
0
Cook's Distance
DC
MS
MD
NYNC
WVWIWY
OK
ALAKAZARCA
ND
MAMIMN MO
MTNENVNHNJNM
COCTDE FLGAHIIDILINIAKSKYLAME
OH
ORPARISCSDTNTXUTVTVAWA
10
14
18
22
26
30
34
38
42
46
50
Figure: Cooks Distance Plot from Model of State Level Poverty Rate
as a Function of Per Capita Income
Jackknife-after-Bootstrap
33/68
Permutation/Randomization
Just another form of resampling, but in this case it is done
without replacement.
They have been around since Fisher introduced them in the
1930s.
Often used to conduct hypothesis testing where the Null is
zero.
Rather than assume a distribution for the Null hypothesis, we
simulate what it would be by randomly reconfiguring our
sample lots of times (e.g. 1,000) in a way that breaks the
relationship in our sample data.
The question then is how often do these permutations or
randomly reshuffled data sets produce a relationship as large
or larger than the one we saw in our original sample?
Monte Carlo Simulation and Resampling
34/68
35/68
Additional Considerations
36/68
A Simple Example
I have data on the weight of chicks and what they were fed.
The samples are small, and the distributions unknown.
Still, I want to know if their weights differ based on what
they were fed.
In a parametric world, Id do a two-sample difference of
means t-test
But, that is only appropriate if the distributional assumption
holds.
37/68
Randomization Example
attach(chickwts)
x - sort(as.vector(weight[feed==soybean]))
y - sort(as.vector(weight[feed==linseed]))
x
[1] 158 171 193 199 230 243 248 248 250 267 271 316 327 329
y
[1] 141 148 169 181 203 213 229 244 257 260 271 309
38/68
Sample.T - t.test(x,y)
Sample.T
data: x and y
t = 1.3246, df = 23.63, p-value = 0.198
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-15.48547 70.84262
sample estimates:
mean of x mean of y
246.4286 218.7500
39/68
set.seed(6198)
R - 999
z - c(x,y)
K - seq(1:length(z))
reps - numeric(R)
t0 - t.test(x,y)$statistic
for(i in 2:R)
k - sample(K,size=14,replace=FALSE)
x1 - z[k]
y1 - z[-k]
reps[i] - t.test(x1,y1)$statistic
40/68
Results
41/68
0.2
0.1
0.0
Density
0.3
0.4
-4
-2
0
Simulated values of t
It does not look like the means of these two groups differ
significantly (at least at the .05 level of significance).
We can compare any aspect of these two samples the same
way compute the statistic every time for a thousand
replications and then look at their distribution.
In fact, there are tests to evaluate whether the two
distributions are statistically significantly different or not.
43/68
0.004
0.002
0.000
Density
0.006
Soybean
Linseed
100
150
200
250
Weight in Grams
300
350
400
45/68
Kirkland (2011)
Modularity is bounded between -1 and 1, but no known
distribution.
Kirkland simulates that distribution by randomly partitioning
the network 25,000 times (basically randomly assigning
legislators to two teams)
The population PDF is then estimated by the 25,000
modularity statistics computed on randomly partitioned
networks.
Use a percentile method to compute 95% confidence
interval, and compare the observed modularity in a chamber
to this null distribution.
Party matters
46/68
0.4
Party Modularity
Null Modularity Region
Avg. U.S. House Modularity
0.3
0.2
0.047
0.1
0.031
0.0
0.1
Party Modularity
AK
NC
IN
MS
HI
OK SD AR MD LA WV NV VT
RI
NY SC TX GA WI NM WA MA KS MO VA
DE OH CO WY ND NJ
KY UT
FL MN CT
AL
TN OR PA ME NH
MI
MT AZ CA
IA
Comparing Methods
Bootstrap is the most flexible and most powerful. It can be
extended to any statistical or calculation you might make
using sample data.
Bootstraping does NOT make the exchangeability
assumption that randomization tests make.
Jackknife is limited by sample size
Permutation/Randomization methods break all relationships
in data dont let you produce a covariance matrix.[but what
if we reshuffled just on Y ?]
48/68
49/68
50/68
51/68
52/68
Advantages of PS
53/68
Limitations of PS
Computational intensity
Large models can produce lots of uncertainty around quantity
of interest
54/68
55/68
56/68
How can you avoid using the same data to build and then
test your theory?
Develop your theory, specify your model, etc. before looking
at your data, then run the statistical analysis you planned
one time and write it up.
Use the data you have to build your model, then collect fresh
data to test it.
Divide the data you have so you use some of it to build your
model and some of it to independently test it.
This last option is cross-validation
57/68
Cross-Validation (CV)
58/68
59/68
60/68
61/68
62/68
Stagflation
63/68
CV Examples
64/68
General CV Steps
65/68
Two Types of CV
Split-sample CV:
Partition into 50% training, 50% testing (could also do
75/25, 80/20, etc. . . )
Usually want to maximize size of training set
Particularly common in time series analysis where the testing
data is generally the most recent years for which data is
available.
Leave-one-out CV:
Iterative method with number of iterations = sample size
Each observation becomes the training set one time
Note the parallel to the Jackknife and Cooks D.
66/68
Leave-One-Out CV
67/68
Limitations of CV
68/68