Course in Causal Inference
Course in Causal Inference
Course in Causal Inference
Inference
Lecture notes for my “Causal Inference” course
at the University of California Berkeley
Contents
Preface xv
Acronyms xvii
I Introduction 1
1 Correlation, Association, and the Yule–Simpson Paradox 3
1.1 Traditional view of statistics . . . . . . . . . . . . . . . . . . 3
1.2 Some commonly-used measures of association . . . . . . . . . 4
1.2.1 Correlation and regression . . . . . . . . . . . . . . . . 4
1.2.2 Contingency tables . . . . . . . . . . . . . . . . . . . . 5
1.3 An example of the Yule–Simpson Paradox . . . . . . . . . . 7
1.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Explanation . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 Geometry of the Yule–Simpson Paradox . . . . . . . . 9
1.4 The Berkeley graduate school admission data . . . . . . . . . 10
1.5 Homework Problems . . . . . . . . . . . . . . . . . . . . . . . 12
2 Potential Outcomes 15
2.1 Experimentalists’ view of causal inference . . . . . . . . . . . 15
2.2 Formal notation of potential outcomes . . . . . . . . . . . . . 16
2.2.1 Causal effects, subgroups, and the non-existence of
Yule–Simpson Paradox . . . . . . . . . . . . . . . . . . 18
2.2.2 Subtlety of experimental unit . . . . . . . . . . . . . . 18
2.3 Treatment assignment mechanism . . . . . . . . . . . . . . . 19
2.4 Homework Problems . . . . . . . . . . . . . . . . . . . . . . . 20
II Randomized experiments 23
3 The Completely Randomized Experiment and the Fisher
Randomization Test 25
3.1 CRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 FRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Canonical choices of the test statistic . . . . . . . . . . . . . 28
3.4 A case study of the LaLonde experimental data . . . . . . . 33
3.5 Some history of randomized experiments and FRT . . . . . . 35
3.5.1 James Lind’s experiment . . . . . . . . . . . . . . . . 35
v
vi Contents
7 Matched-Pairs Experiment 95
7.1 Design of the experiment and potential outcomes . . . . . . 95
7.2 FRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Neymanian inference . . . . . . . . . . . . . . . . . . . . . . 99
7.4 Covariate adjustment . . . . . . . . . . . . . . . . . . . . . . 101
7.4.1 FRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.4.2 Regression adjustment . . . . . . . . . . . . . . . . . . 102
7.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5.1 Darwin’s data comparing cross-fertilizing and self-
fertilizing on the height of corns . . . . . . . . . . . . 104
7.5.2 Children’s television workshop experiment data . . . . 105
7.6 Comparing the MPE and CRE . . . . . . . . . . . . . . . . . 106
7.7 Extension to the general matched experiment . . . . . . . . . 107
7.7.1 FRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.7.2 Estimating the average of the within-strata effects . . 108
7.7.3 A more general causal estimand . . . . . . . . . . . . . 109
7.8 Homework Problems . . . . . . . . . . . . . . . . . . . . . . . 110
xv
Acronyms
xvii
Part I
Introduction
1
Correlation, Association, and the
Yule–Simpson Paradox
This book has a very different view: statistics is crucial for understanding
causality. The main focus of this book is to introduce the formal language for
causal inference and develop statistical methods to estimate causal effects in
randomized experiments and observational studies.
3
4 1 Correlation, Association, and the Yule–Simpson Paradox
Y =1 Y =0
Z=1 p11 p10
Z=0 p01 p00
rd = pr(Y = 1 | Z = 1) − pr(Y = 1 | Z = 0)
p11 p01
= − ,
p11 + p10 p01 + p00
the risk ratio as
pr(Y = 1 | Z = 1)
rr =
pr(Y = 1 | Z = 0)
p11 . p01
= ,
p11 + p10 p01 + p00
the event happens over the probability that the event does not happen.
2 This book uses the notation to denote independence or conditional independence of
random variables. The notation is due to Dawid (1979).
6 1 Correlation, Association, and the Yule–Simpson Paradox
I leave the proofs of statements (1) and (2) as a homework problem. State-
ment (3) is informal. The approximation holds because the odds p/(1 − p)
is close to the probability p for rare diseases with p ≈ 0: by Taylor expan-
sion p/(1 − p) = p + p2 + · · · ≈ p. In epidemiology, if the outcome repre-
sents the occurrence of a rare disease, then it is reasonable to assume that
pr(Y = 1 | X = 1) and pr(Y = 1 | X = 0) are small.
We can also define conditional versions of the rd, rr, and or if the prob-
abilities are replaced by the conditional probabilities given another variable
X, i.e., pr(Y = 1 | Z = 1, X = x) and pr(Y = 1 | Z = 0, X = x).
With frequencies nzy = #{i : Zi = z, Yi = y}, we can summarize the
observed data in the following two-by-two table:
Y =1 Y =0
Z=1 n11 n10
Z=0 n01 n00
We can estimate rd, rr, and or by replacing the true probabilities by
the sample proportions. In R, functions fisher.test performs exact test and
chisq.test performs asymptotic test for Z Y based on a two-by-two table of
observed data.
0 1
black 2278 157
white 2200 235
The two rows have the same total count, so it is apparent that White names
received more callbacks. Fisher’s exact test below shows that this difference is
statistically significant.
> fisher . test ( Alltable )
data : Alltable
p - value = 4.759 e -05
alternative hypothesis : true odds ratio is not equal to 1
95 percent confidence interval :
1.249828 1.925573
sample estimates :
1.3 An example of the Yule–Simpson Paradox 7
odds ratio
1.549732
Y =1 Y =0
Z=1 273 77
Z=0 289 61
The estimated rd is
273 289
c=
rd − = 78% − 83% = −5% < 0.
273 + 77 289 + 61
Treatment 0 seems better, that is, the small puncture leads to higher successful
rate compared to the open surgical procedure.
However, the data were not from a randomized controlled trial (RCT)3 .
Patients receiving treatment 1 can be very different from patients receiving
treatment 0. A “lurking variable” in this study is the severity of the case:
some patients have smaller stones but some patients have larger stones. We
can split the data according to the size of the stones.
For patients with smaller stones, the treatment and outcome data can be
summarized in the following two-by-two table:
Y =1 Y =0
Z=1 81 6
Z=0 234 36
For patients with larger stones, the treatment and outcome data can be sum-
marized in the following two-by-two table:
Y =1 Y =0
Z=1 192 71
Z=0 55 25
3 In an RCT, patients are randomly assigned to the treatment arms. Part II of this book
1.3.2 Explanation
Let X be the binary indicator with X = 1 for smaller stones and X = 0 for
larger stones. Let us first take a look at the X–Z relationship by comparing
the probabilities of receiving treatment 1 among patients with smaller and
larger stones:
pr(Z
b = 1 | X = 1) − pr(Z
b = 1 | X = 0)
81 + 6 192 + 71
= −
81 + 6 + 234 + 36 192 + 71 + 55 + 25
= 24% − 77%
= −53% < 0.
So patients with larger stones tend to take treatment 1. Statistically, X and
Z have negative association.
Let us then take a look at the X–Y relationship by comparing the probabil-
ities of success among patients with smaller and larger stones: under treatment
1,
pr(Y
b = 1 | Z = 1, X = 1) − pr(Y
b = 1 | Z = 1, X = 0)
81 192
= −
81 + 6 192 + 71
= 93% − 73%
= 20% > 0;
1.3 An example of the Yule–Simpson Paradox 9
X
− +
~
Z /Y
+
FIGURE 1.1: A diagram for the kidney stone example. The signs indicate the
associations of two variables, conditioning on other variables pointing to the
downstream variable.
under treatment 0,
pr(Y
b = 1 | Z = 0, X = 1) − pr(Y
b = 1 | Z = 0, X = 0)
234 55
= −
234 + 36 55 + 25
= 87% − 69%
= 18% > 0.
So under both treatment levels, patients with smaller stones have higher suc-
cess probabilities. Statistically, X and Y have positive association conditional
on both treatment levels.
We can summarize the qualitative associations in the diagram in Figure
1.1. In technical terms, the treatment has a positive direct path and a more
negative indirect path to the outcome, so the overall association is negative
between the treatment and outcome. In plain English, when less effective
treatment 0 is applied more frequently to the less severe cases, it can appear
to be a more effective treatment.
whole population Y = 1 Y = 0
Z=1 n11 n10
Z=0 n01 n00
Figure 1.2 shows the geometry of the Yule–Simpson Paradox. The y-axis
shows the count of successes with Y = 1 and the x-axis shows the count of
failures with Y = 0. The two parallelograms corresponds to aggregating the
counts of successes and failures under two treatment levels. The slope of OA1
is larger than that of OB1 , and the slope of OA0 is larger than that of OB0 .
So the treatment seems beneficial to the outcome within both levels of X.
However, the slope of OA is smaller than that of OB. So the treatment seems
harmful to the outcome for the whole population. The Yule–Simpson Paradox
arises.
Admit
Gender Admitted Rejected
Male 512 313
Female 89 19
, , Dept = B
Admit
Gender Admitted Rejected
Male 353 207
Female 17 8
, , Dept = C
Admit
Gender Admitted Rejected
Male 120 205
Female 202 391
, , Dept = D
Admit
Gender Admitted Rejected
Male 138 279
Female 131 244
, , Dept = E
Admit
Gender Admitted Rejected
Male 53 138
Female 94 299
, , Dept = F
Admit
Gender Admitted Rejected
Male 22 351
Female 24 317
$ pv
[1] 1.055797 e -21
Stratifying on the departments, we find smaller and insignificant differences
between the admission rates of male and female students. In department A,
the difference is significant but negative.
> P . diff = rep (0 , 6)
> PV = rep (0 , 6)
> for ( dd in 1:6)
+ {
+ department = risk . difference ( UCBAdmissions [ , , dd ])
+ P . diff [ dd ] = department $ p . diff
+ PV [ dd ] = department $ pv
+ }
>
> round ( P . diff , 2)
[1] -0.20 -0.05 0.03 -0.02 0.04 -0.01
> round ( PV , 2)
[1] 0.00 0.77 0.43 0.64 0.37 0.64
The correlation coefficient between X and Y is ρXY . There are many equiva-
lent definitions of the partial correlation coefficient. For a multivariate Normal
vector, let ρXY |Z denote the partial correlation coefficient between X and Y
given Z, which is defined as their correlation coefficient in the conditional
distribution (X, Y ) | Z. Show that
ρXY − ρXZ ρY Z
ρXY |Z = p p
1 − ρ2XZ 1 − ρ2Y Z
Example 2.1 If we are interested in the effect of taking aspirin or not on the
relief of head ache, the intervention is taking aspirin.
Example 2.4 Gerber et al. (2008) were interested in the effect of different
get-out-to-vote messages on the voting behavior. The intervention is different
get-out-to-vote messages.
Example 2.5 Pearl (2018) claimed that we could infer the effect of obesity on
life span. A popular measure of obesity of the body mass index (BMI), defined
as the body mass divided by the square of the body height in units of kg/m2 .
So the intervention can be BMI.
15
16 2 Potential Outcomes
callback no callback
African-American 157 2278
White 235 2200
From the above, we can compare the the probabilities of being called back
among African-American- and White-sounding names:
157 235
− = 6.45% − 9.65% = −3.20% < 0
2278 + 157 2200 + 235
with p-value from the Fisher exact test much smaller than 0.001.
Assumption 2.1 (no interference) Unit i’s potential outcomes do not de-
pend on other units’ treatments. This is sometimes called the no-interference
assumption.
Assumption 2.2 (consistency) There are no other versions of the treat-
ment. Equivalently, we require that the treatment level be well defined, or have
no ambiguity at least for the outcome of interest. This is sometimes called the
consistency assumption.
Assumption 2.1 can be violated in infectious diseases or network exper-
iments. For instance, if some of my friends receive flu shots, my chance of
getting the flu decrease even if I do not receive the flu shot; if my friends see
an ad on Facebook, my chance of buying that product increase even if I do
not see the ad. It is an active research area to study situations with interfering
units in modern causal inference literature.
Assumption 2.2 can be violated for treatment with complex components.
For instance, when studying the effect of cigarette smoking on lung cancer, the
type of cigarettes may matter; when studying the effect of college education
on income, the type and major of college education may matter.
Rubin (1980) called the Assumptions 2.1 and 2.2 above together the Stable
Unit Treatment Value Assumption (SUTVA).
Assumption 2.3 (SUTVA) Both Assumptions 2.1 and 2.2 hold.
Under SUTVA, Rubin (2005) called the n×2 matrix of potential outcomes
the Science Table:
i Yi (1) Yi (0)
1 Y1 (1) Y1 (0)
2 Y2 (1) Y2 (0)
.. .. ..
. . .
n Yn (1) Yn (0)
Due to Neyman and Rubin’s fundamental contribution to statistical causal
inference, the potential outcomes framework is sometimes called the Neyman
model, the Neyman–Rubin model, or the Rubin Causal Model.
Causal effects are functions of the Science Table. Inferring individual causal
effects
τi = Yi (1) − Yi (0)
is fundamentally challenging because we can only observe either Yi (1) or Yi (0)
for each unit i, that is, we can observed only half of the Science Table. As
a starting point, most parts of the book focus on the average causal effect
(ACE):
n
X n
X n
X
τ = n−1 {Yi (1) − Yi (0)} = n−1 Yi (1) − n−1 Yi (0).
i=1 i=1 i=1
But we can easily extend our discussion to many other parameters (also called
estimands).
18 2 Potential Outcomes
where X ∼ N(µ, σ 2 ), and ϕ(·) and Φ(·) are the probability density and cumu-
lative distribution functions of a standard Normal random variable.
which is, in general, different from the median of the individual treatment
effect
δ2 = median{(Yi (1) − Yi (0)}ni=1 .
1. Give numerical examples which have δ1 = δ2 , δ1 > δ2 , and δ1 < δ2 .
2. Which estimand makes more sense, δ1 or δ2 ? Why? Use examples
to justify your conclusion. If you feel that both δ1 and δ2 can make
sense in different applications, you can also give examples to justify
both estimands.
Randomized experiments
3
The Completely Randomized Experiment
and the Fisher Randomization Test
3.1 CRE
Consider an experiment with n units, with n1 receiving the treatment and n0
receiving the control. We can define the CRE based on its treatment assign-
ment mechanism1 .
25
26 3 Completely Randomized Experiment and FRT
3.2 FRT
Fisher (1935) was interested in testing the following null hypothesis:
Rubin (1980) called it the sharp null hypothesis in the sense that it can deter-
mine all the potential outcomes based on the observed data: Y (1) = Y (0) =
Y = (Y1 , . . . , Yn ), the vector of the observed outcomes. It is also called the
strong null hypothesis (e.g., Wu and Ding, 2021).
Conceptually, under H0f , the FRT works for any test statistic
T = T (Z, Y ), (3.1)
{z 1 , . . . , z M }
where M = nn1 , and the z m ’s are all possible vectors with n1 1’s and n0 0’s.
as follows:
> permutation10 = function (n , n1 ){
+ M = choose (n , n1 )
+ treat . index = combn (n , n1 )
+ Z = matrix (0 , n , M )
3.2 FRT 27
+ for ( m in 1: M ){
+ treat = treat . index [ , m ]
+ Z [ treat , m ] = 1
+ }
+ Z
+ }
>
> permutation10 (5 , 3)
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6] [ ,7] [ ,8] [ ,9] [ ,10]
[1 ,] 1 1 1 1 1 1 0 0 0 0
[2 ,] 1 1 1 0 0 0 1 1 1 0
[3 ,] 1 0 0 1 1 0 1 1 0 1
[4 ,] 0 1 0 1 0 1 1 0 1 1
[5 ,] 0 0 1 0 1 1 0 1 1 1
As a consequence, T is uniform over the set (with possible duplications)
{T (z 1 , Y ), . . . , T (z M , Y )}.
That is, the distribution of T is known due to the design of the CRE. We will
call this distribution of T the randomization distribution.
If larger values are more extreme for T , we can use the following tail
probability to measure the extremeness of the test statistic with respect to its
randomization distribution:
M
X
−1
pfrt = M I{T (z m , Y ) ≥ T (Z, Y )}, (3.2)
m=1
which is called the p-value by Fisher. Figure 3.1 illustrates the computational
process of pfrt .
The p-value, pfrt , in (3.2) works for any choice of test statistic and any
outcome-generating process. It also extends naturally to any experiments,
28 3 Completely Randomized Experiment and FRT
2 This is the standard definition of the p-value in mathematical statistics. The inequality
is often due to discreteness of the test statistic, and when the equality holds, the p-value is
Uniform(0, 1) under the null hypothesis. Let F (·) be the distribution function of T (Z, Y ).
Even though it is a step function, we assume that it is continuous and strictly increasing as
if it is the distribution function of a continuous random variable taking values on the whole
real line. So pfrt = 1 − F (T ), and
pr(pfrt ≤ u) = pr{1 − F (T ) ≤ u} = pr{T ≥ F −1 (1 − u)} = 1 − F (F −1 (1 − u)) = u.
The discreteness of T does cause some technical issues in the proof, yielding an inequality
instead of an equality. I leave the technical details in Problem 3.1.
3.3 Canonical choices of the test statistic 29
is the sample mean of the outcomes under the control, respectively. Under H0f ,
it has mean
n
X n
X
E(τ̂ ) = n−1
1 E(Zi )Yi − n−1
0 E(1 − Zi )Yi = 0
i=1 i=1
and variance
( n n
)
X X
var(τ̂ ) = var n−1
1 Zi Yi − n−1
0 (1 − Zi )Yi
i=1 i=1
n
!
n 1 X
= var Zi Yi
n0 n1 i=1
n2 n1 s2
=∗ 1 −
n20 n n1
n 2
= s ,
n1 n0
where =∗ follows from Lemma A3.2 for simple random sampling with
n
X n
X
Ȳ = n−1 Yi , s2 = (n − 1)−1 (Yi − Ȳ )2 .
i=1 i=1
The observed data are {Yi : Zi = 1} and {Yi : Zi = 0}, so the problem
is essentially a two-sample problem. Under the assumption of IID Normal
30 3 Completely Randomized Experiment and FRT
outcomes (see Section A1.4.1), the classic two-sample t-test assuming equal
variance is based on
τ̂
r hP i ∼ tn−2 . (3.6)
n ˆ 2
P ˆ 2
n1 n0 (n−2) Zi =1 {Yi − Ȳ (1)} + Zi =0 {Yi − Ȳ (0)}
With a large sample size n, we can ignore the difference between N(0, 1) and
tn−2 and the difference between n − 1 and n − 2. Moreover, under H0f , τ̂
converges to zero in probability, so n1 n0 /nτ̂ 2 can be ignored asymptotically.
Therefore, under H0f , the approximate p-value in Example 3.1 is close to the
p-value from the classic two-sample t-test assuming equal variance, which can
be calculated by t.test with var.equal = TRUE. Under alternative hypotheses
with nonzero τ , the additional term n1nn0 τ̂ 2 in the above expansion can make
the FRT less powerful than the usual t-test.
Based on the above discussion, the FRT with τ̂ effectively uses a pooled
variance ignoring the heteroskedasticity between these two groups. In classical
statistics, the two-sample problem with heteroskedastic Normal outcomes is
called the Behrens–Fisher problem (see Section A1.4.1). In the Behrens–Fisher
problem, a standard choice of the test statistic is the studentized statistic
below.
where
are the sample variances of the observed outcomes under the treatment and
control, respectively. Under H0f , the finite population central limit theorem
again implies that t is asymptotically Normal:
t → N(0, 1)
Ri = #{j : Yj ≤ Yi }.
The Wilcoxon rank sum statistic is the sum of the ranks under treatment:
n
X
W = Zi Ri .
i=1
For algebraic simplicity, we assume that there are no ties in the outcomes,
although the FRT can be applied regardless of the existence of ties. For the
case with ties, see Lehmann (1975, Chapter 1 Section 4). Because the sum
of the ranks of the pooled samples are fixed at 1 + 2 + · · · + n = n(n + 1)/2,
the Wilcoxon statistic is equivalent to the difference in the means of the ranks
under treatment and control. Under H0f , the Ri ’s are fixed, so W has mean
n n
X n1 X n1 n(n + 1) n1 (n + 1)
E(W ) = E(Zi )Ri = i= × =
i=1
n i=1 n 2 2
32 3 Completely Randomized Experiment and FRT
and variance
n
!
1 X
var(W ) = var n1 Z i Ri
n1 i=1
n 2
n1 1 1 X n+1
=∗ n21 1 − Ri −
n n1 n − 1 i=1 2
n 2
n1 n0 X n+1
= i−
n(n − 1) i=1 2
( n 2 )
n1 n0 X
2 n+1
= i −n
n(n − 1) i=1 2
( 2 )
n1 n0 n(n + 1)(2n + 1) n+1
= −n
n(n − 1) 6 2
n1 n0 (n + 1)
= ,
12
where =∗ follows from Lemma A3.2. Furthermore, under H0f , the finite pop-
ulation central limit theorem ensures that the randomization distribution of τb
is approximately Normal:
Pn n1 (n+1)
i=1 Zi Ri − 2
q → N(0, 1) (3.8)
n1 n0 (n+1)
12
Figure 3.2 shows the histograms of the outcomes under the treatment and
control.
treatment mean
control mean
density
The following code computes the observed values of the test statistics using
existing functions:
> tauhat = t . test ( y [ z == 1] , y [ z == 0] ,
+ var . equal = TRUE ) $ statistic
> tauhat
t
2.835321
> student = t . test ( y [ z == 1] , y [ z == 0] ,
+ var . equal = FALSE ) $ statistic
> student
t
2.674146
> W = wilcox . test ( y [ z == 1] , y [ z == 0]) $ statistic
> W
W
27402.5
> D = ks . test ( y [ z == 1] , y [ z == 0]) $ statistic
> D
D
0.1321206
The one-sided p-values based on the FRT are all smaller than 0.05:
> exact . pv = c ( mean ( Tauhat >= tauhat ) ,
+ mean ( Student >= student ) ,
+ mean ( Wilcox >= W ) ,
+ mean ( Ks >= D ))
> round ( exact . pv , 3)
[1] 0.002 0.002 0.006 0.040
3.5 Some history of randomized experiments and FRT 35
Without using Monte Carlo, we can also compute the asymptotic p-values
which are all smaller than 0.05:
> asym . pv = c ( t . test ( y [ z == 1] , y [ z == 0] ,
+ var . equal = TRUE ) $ p . value ,
+ t . test ( y [ z == 1] , y [ z == 0] ,
+ var . equal = FALSE ) $ p . value ,
+ wilcox . test ( y [ z == 1] , y [ z == 0]) $ p . value ,
+ ks . test ( y [ z == 1] , y [ z == 0]) $ p . value )
> round ( asym . pv , 3)
[1] 0.005 0.008 0.011 0.046
The differences between the p-values are due to the asymptotic approxi-
mations as well as the fact that the default choices for t.test and wilcox.test
are two-sided tests.
Figure 3.3 shows the histograms of the randomization distributions of four
test statistics, as well as their corresponding observed values. For the first
three test statistics, the Normal approximations works quite well even though
the underlying outcome data distribution is far from Normal. In general, a
figure like Figure 3.3 can give very clear information for testing the sharp
null hypothesis. Recently, Bind and Rubin (2020) proposes, in the title of
their paper, that “when possible, report a Fisher-exact p-value and display its
underlying null randomization distribution.”
−2 0 2 4 −4 −2 0 2
teqvar tuneqvar
After six days, patients in the fifth group recovered, but patients in other
groups did not. If we simplify the treatment as
This is the extremest possible 2 × 2 table we can observe under this exper-
iment, and the data contain strong evidence for the positive effect of citrus
fruits for curing scurvy. Statistically, how do we measure the strength of the
evidence?
3.5 Some history of randomized experiments and FRT 37
Following the logic of the FRT, if the treatment has no effect at all (under
H0f ), the extreme 2 × 2 table will occur with probability
1 1
12
= = 0.015
2
66
which is the pfrt . This seems a surprise under H0f : we can easily reject H0f
at the level 0.05.
if Lind only assign one patient to each of the six groups, then the smallest
p-value is
1 1
6 = 6 = 0.167;
1
if Fisher only made 6 cups of tea, 3 with milk added first and the other 3 four
with tea added first, then the smallest p-value is
1 1
6 = 20 = 0.05.
3
We can never reject the null hypotheses at the level of 0.05. This highlights
the second Fisherian principle of experiments: replications.
Chapter 5 will discuss the third Fisherian principle of experiments: block-
ing.
3.6 Discussion
3.6.1 Other sharp null hypotheses and confidence intervals
I focus on the sharp null hypothesis H0f above. In fact, the logic of the FRT
also works for other sharp null hypotheses. For instance, we can test
for a known vector τ = (τ1 , . . . , τn ). Because the individual causal effects are
all known under H0 (τ ), we can impute all missing potential outcomes based on
the observed data. With known potential outcomes, the distribution of any test
statistic is completely determined by the treatment assignment mechanism,
and therefore, we can compute the corresponding pfrt as a function of τ ,
denoted by pfrt (τ ). If we can specify all possible τ ’s, then we can compute
a series of pfrt (τ )’s. By duality of hypothesis testing and confidence set (see
Section A1.2.5), we can obtain a (1 − α)-level confidence set for the average
causal effect: ( )
Xn
τ = n−1 τi : pfrt (τ ) ≥ α .
i=1
for a known constant c. Given c, we can compute pfrt (c). By duality, we can
obtain a (1 − α)-level confidence set for the average causal effect:
because power depends on the alternative hypothesis. The four test statistics
in Section 3.3 are motivated by different alternative hypotheses. For instance,
τ̂ and t are motivated by an alternative hypothesis with nonzero average
treatment effect; W is motivated by an alternative hypothesis with a constant
causal effect with outliers. Specifying a working alternative hypothesis is often
helpful for constructing a test statistic although it does not have to be precise
to guarantee the validity of the FRT. Problems 3.6 and 3.7 illustrate the idea
of using a working alternative hypothesis or statistical model to construct test
statistics.
and
1
varmc (p̂frt ) ≤ ,
4R
where the subscript “mc” signifies the randomness due to Monte Carlo, that
is, p̂frt is random because z r ’s are R independent random draws from all
possible values of Z.
Remark: pfrt is random because Z is random. But in this problem, we
condition on data, so pfrt becomes a fixed number. p̂frt is random because
the z r ’s are random permutations of Z.
Problem 3.2 shows that p̂frt is unbiased for pfrt over the Monte Carlo
randomness and gives an upper bound on the variance of p̂frt . Luo et al.
(2021, Theorem 2) gives a more delicate bound on the Monte Carlo error.
In his seminal paper, Neyman (1923) not only proposed to use the notation of
potential outcomes but also derived rigorous mathematical results for making
inference of the average causal effect under a CRE. In contrast to Fisher’s idea
of calculating the p-value under the sharp null hypothesis, Neyman (1923)
proposed an unbiased point estimator and a conservative confidence interval
based on the sampling distribution of the point estimator. This chapter will
introduce Neyman (1923)’s fundamental results, which are very important for
understanding later chapters in Part II of this book.
variances1
n
X n
X
S 2 (1) = (n−1)−1 {Yi (1)− Ȳ (1)}2 , S 2 (0) = (n−1)−1 {Yi (0)− Ȳ (0)}2 ,
i=1 i=1
and covariance
n
X
−1
S(1, 0) = (n − 1) {Yi (1) − Ȳ (1)}{Yi (0) − Ȳ (0)}.
i=1
1 Here the divisor n − 1 makes the formulas more elegant. Changing the divisor to n
complicates the formulas but does not change the results fundamentally. With large n, the
difference is minor.
43
44 4 Completely Randomized Experiment and Neymanian Inference
and variance
n
X
S 2 (τ ) = (n − 1)−1 (τi − τ )2 .
i=1
We have the following relationship between the variances and covariance.
Lemma 4.1 2S(1, 0) = S 2 (1) + S 2 (0) − S 2 (τ ).
The proof of Lemma 4.1 follows from elementary algebra. I leave it as
Problem 4.1.
These fixed quantities are functions of the Science Table {Yi (1), Yi (0)}ni=1 .
We are interested in estimating the average causal effect τ based on the data
(Zi , Yi )ni=1 from a CRE.
But there are no sample versions of S(1, 0) and S 2 (τ ) because the potential
outcomes Yi (1) and Yi (0) are never jointly observed for each unit i. Neyman
(1923) proved the following theorem.
Theorem 4.1 Under a CRE,
Ŝ 2 (1) Ŝ 2 (0)
V̂ = +
n1 n0
is conservative for estimating var(τ̂ ):
S 2 (τ )
E(V̂ ) − var(τ̂ ) = ≥0
n
with the equality holding if and only if τi = τ for all units.
1. the FRT works for any test statistic but Neyman (1923)’s theorem
is only about the difference in means. Although we could derive the
properties of other estimators similar to Neyman (1923)’s theorem,
this mathematical exercise is often quite challenging for general es-
timators;
2. in Figure 3.1 , the observed outcome vector Y is fixed but in Figure
4.1, the observed outcome vector Y (z m ) changes as z m changes;
3. the T (z m , Y )’s are all computable based on the observed data, but
the τ̂ m ’s are hypothetical values because not all potential outcomes
are known.
The point estimator is standard but it has a non-trivial variance under
the potential outcomes framework with a CRE. The variance formula (4.1)
differs from the classic variance formula for difference in means2 because it
not only depends on the finite population variances of the potential outcomes
but also depends on the finite population variance of the individual effects,
or, equivalently, the finite population covariance of the potential outcomes.
2 In the classic two-sample problem, the outcomes under treatment are IID draws from a
distribution with mean µ1 and variance σ12 , and the outcomes under control are IID draws
from a distribution with mean µ0 and variance σ02 . Under this assumption, we have
σ12 σ2
var(τ̂ ) = + 0.
n1 n0
Here, var(·) is over the randomness of the outcomes. This variance formula does not involve
a third term that depends on the variance of the individual causal effects.
46 4 Completely Randomized Experiment and Neymanian Inference
Unfortunately, S 2 (τ ) and S(1, 0) are not identifiable from the data because
Yi (1) and Yi (0) are never jointly observed.
Due to the fundamental problem of missing one potential outcome, we can
at most obtain a conservative variance estimator. In statistics, the definition
of the confidence interval allows for over coverage and thus conservativeness
in variance estimation. This may be not a good idea in some applications, for
example, studies on side effects of drugs.
The formula (4.1) is a little puzzling in that the more heterogeneous the
individual effects are the smaller the variability of τ̂ is. Section 4.5.1 will
use numerical examples to verify (4.1). What is the intuition here? I give an
explanation based on the equivalent form (4.2). Compare the case with pos-
itively correlated potential outcomes and the case with negatively correlated
potential outcomes. Although the treatment group is a simple random sample
from the finite population of n units, it is possible to observe relatively large
treatment potential outcomes in a realized experiment. If this happens, then
those control units have relatively small treatment potential outcomes. Con-
sequently, if S(1, 0) > 0, then the control potential outcomes are relatively
small; if S(1, 0) < 0, then the control potential outcomes are relatively large.
Therefore, τ̂ tends to larger when the potential outcomes are positively cor-
related, resulting in more extreme values of τ̂ . So the variance of τ̂ is larger
when the potential outcomes are positively correlated.
Li and Ding (2017, Theorem 5 and Proposition 3) further proved the fol-
lowing asymptotic Normality of τ̂ based on the finite population central limit
theorem.
Theorem 4.2 Let n → ∞ and n1 → ∞. If n1 /n has a limiting value in
(0, 1), {S 2 (1), S 2 (0), S(1, 0)} have limiting values, and
then
τ̂ − τ
p → N(0, 1)
var(τ̂ )
in distribution, and
in probability.
The proof of Theorem 4.2 is technical and beyond the scope of this book.
It ensures that the sampling distribution of τ̂ can be approximated by Normal
distribution with large sample size and some regularity conditions. Moreover,
it ensures that the sample variances of the outcomes are consistent for the
population variances, which further ensures that the probability limit of Ney-
man (1923)’s variance estimator is larger than the true variance of τ̂ . This
justifies a conservative large-sample confidence interval for τ :
p
τ̂ ± z1−α/2 V̂ ,
which is the same as the confidence interval for the standard two-sample
problem asymptotically. This confidence interval covers τ with probability
at least at large as 1 − α when the sample size is large enough. By duality, the
confidence interval implies a test for H0n : τ = 0.
The conservativeness of Neyman (1923)’s confidence interval for τ is not
a big problem if under reporting the treatment effect is not a big problem. It
can be problematic if the outcomes measure the side effects of a treatment. In
medical experiments, under reporting the side effects of a new drug can have
severe consequences.
4.3 Proofs
In this section, I will prove Theorem 4.1.
First, the unbiasedness of τ̂ follows from the representation
n
X n
X
τ̂ = n−1
1 Zi Yi − n−1
0 (1 − Zi )Yi
i=1 i=1
Xn n
X
= n−1
1 Zi Yi (1) − n−1
0 (1 − Zi )Yi (0)
i=1 i=1
48 4 Completely Randomized Experiment and Neymanian Inference
Similarly, E{Ŝ 2 (0)} = S 2 (0). Therefore, V̂ is unbiased for the first two terms
in (4.1).
4.5 Regression analysis of the CRE 49
and use the coefficient of the treatment β̂ as the estimator for the average
causal effect. We can show the coefficient β̂ equals the difference in means:
β̂ = τ̂ . (4.3)
However, the usual variance estimator from the OLS, e.g., the output from
the lm function of R, equals
N (N1 − 1) 2 N (N0 − 1) 2
V̂ols = Ŝ (1) + Ŝ (0) (4.4)
(N − 2)N1 N0 (N − 2)N1 N0
Ŝ 2 (1) Ŝ 2 (0)
≈ + ,
N0 N1
where the approximation holds with large N1 and N0 . It differs from V̂ even
with large N1 and N0 .
Fortunately, the Eicker–Huber–White (EHW) robust variance estimator is
close to V̂ :
Ŝ 2 (1) N1 − 1 Ŝ 2 (0) N0 − 1
V̂ehw = + (4.5)
N1 N1 N0 N0
2 2
Ŝ (1) Ŝ (0)
≈ +
N1 N0
where the approximation holds with large N1 and N0 . It is almost identical to
V̂ . Moreover, the so-called HC2 variant of the EHW robust variance estimator
is identical to V̂ . The hccm function in the car package returns the EHW robust
variance estimator as well as its HC2 variant.
Problem 4.3 provides more technical details for (4.3)–(4.5).
4.5 Examples
4.5.1 Simulation
I first choose the sample size as n = 100 with 60 treated and 40 control units,
and generate the potential outcomes with constant individual causal effects.
50 4 Completely Randomized Experiment and Neymanian Inference
n = 100
n1 = 60
n0 = 40
y0 = rexp ( n )
y0 = sort ( y0 , decreasing = TRUE )
y1 = y0 + 1
With the Science Table fixed, I repeated generate completely randomized ex-
periments and apply Theorem 4.1 to obtain the point estimator, the conser-
vative variance estimator, and the confidence interval based on the Normal
approximation. The first panel of Figure 4.2 shows the histogram of τ̂ − τ over
104 simulations.
I then change the potential outcome by sorting the control potential out-
come in reverse order
y0 = sort ( y0 , decreasing = FALSE )
and repeat the above simulation. The second panel of Figure 4.2 shows the
histogram of τ̂ − τ over 104 simulations.
I finally permute the control potential outcomes
y0 = sample ( y0 )
and repeat the above simulation. The third panel of Figure 4.2 shows the
histogram of τ̂ − τ over 104 simulations.
Importantly, in the above three sets of simulations, the correlations be-
tween potential outcomes are different but the marginal distributions are the
same. The following table compares the true variances, the conservative esti-
mated variances, and the coverage rates of the 95% confidence intervals.
constant negative independent
var 0.036 0.007 0.020
estimated var 0.036 0.036 0.036
coverege rate 0.947 1.000 0.989
The true variance depends on the correlation between the potential outcomes,
with positively correlated potential outcomes corresponding to a larger sam-
pling variance. This verifies (4.2). The estimated variances are almost identical
because the formula of V̂ depends only on the marginal distributions of the
potential outcomes. Due to the discrepancy between the true and estimated
variances, the coverage rates differ across the three sets of simulations. Only
with constant causal effects, the estimated variance is identical to the true
variance, verifying point 3 of Theorem 4.1.
Figure 4.2 also shows the Normal density curves based on the central limit
theorem for τ̂ . They are very close to the histogram over simulations, verifying
Theorem 4.2.
4.5 Examples 51
τi = τ
Figures 4.3 and 4.4 show two realizations of the histograms of τ̂ −τ with the
corresponding Normal approximations. With heavy-tailed potential outcomes,
the Normal approximations are quite poor. Moreover, unlike Figure 4.2, the
histograms are quite sensitive to the random seed of the simulation.
4.5.3 Application
I again use the lalonde data to illustrate the theory.
> library ( Matching )
> data ( lalonde )
> z = lalonde $ treat
> y = lalonde $ re78
We can easily calculate the point estimator and standard error based on
the formulas in Theorem 4.1:
> n1 = sum ( z )
> n0 = length ( z ) - n1
> tauhat = mean ( y [ z ==1]) - mean ( y [ z ==0])
> vhat = var ( y [ z ==1]) / n1 + var ( y [ z ==0]) / n0
> sehat = sqrt ( vhat )
> tauhat
[1] 1794.343
> sehat
[1] 670.9967
Practitioners often use ordinary least squares (OLS) to estimate the aver-
age causal effect which also gives a standard error.
> olsfit = lm ( y ~ z )
> summary ( olsfit ) $ coef [2 , 1: 2]
Estimate Std . Error
1794.3431 632.8536
However, the above standard error seems too small compared to the one based
4.5 Examples 53
τi = τ
−4 −2 0 2 4
prob comtamination = 0.1
τi = τ
−4 −2 0 2 4
prob comtamination = 0.3
τi = τ
−4 −2 0 2 4
prob comtamination = 0.5
τi = τ
−4 −2 0 2 4
prob comtamination = 0.1
τi = τ
−4 −2 0 2 4
prob comtamination = 0.3
τi = τ
−4 −2 0 2 4
prob comtamination = 0.5
on Theorem 4.1. However, this can be easily solved by using the Eicker–Huber–
White robust standard error.
> library ( car )
> sqrt ( hccm ( olsfit )[2 , 2])
[1] 672.6823
> sqrt ( hccm ( olsfit , type = " hc0 " )[2 , 2])
[1] 669.3155
> sqrt ( hccm ( olsfit , type = " hc2 " )[2 , 2])
[1] 670.9967
Different versions of the robust standard error exist. They yield similar results
if the sample size is large, with hc2 yielding a standard error identical to
Theorem 4.1. Problem 4.3 gives a theoretical explanation for the possible
failure of the standard error based on OLS and the asymptotic validity of the
Eicker–Huber–White robust standard error.
Give a counterexample with S 2 (1) > S 2 (0) but S(Y (0), τ ) < 0.
Remark: The first result states that no treatment effect heterogeneity im-
plies equal variances in the treated and control potential outcomes. But the
converse is not true. The second result states that if the treated potential
outcome has larger variance than the control potential outcome, then the in-
dividual treatment effect is negatively correlated with the control potential
outcome. But the converse is not true. Gerber and Green (2012, page 293)
and (Ding et al., 2019, Appendix B.3) gave related discussions.
Section 4.5.1 used V̂ in the simulation with R code NeymanCR.R. Repeat the
simulation with additional comparison with the variance estimator V̂ ′ and
the associated confidence interval.
Remark: The upper bound (4.6) can be further improved. Aronow et al.
(2014) derived the sharp upper bound for var(bτ ) using the Frechet–Hoeffding
inequality. Those improvements are rarely used in practice mainly for two
reasons. First, they are more complicated than V̂ which can be conveniently
implemented by OLS. Second, the confidence interval based on V̂ also works
under other formulations, for example, under a true linear model of the out-
come on the treatment, but those improvements do not. Although they are
theoretically interesting, those improvements have little practical impact.
where Vi (1) and Vi (0) are the potential outcomes of V for unit i. The Neyman-
type estimator for τV is the difference between the sample mean vectors of
the observed outcomes under treatment and control:
n n
1 X 1 X
τbV = V̄1 − V̄0 = Zi Vi − (1 − Zi )Vi .
n1 i=1 n0 i=1
Consider a CRE. Show that τbV is unbiased for τV . Find the covariance
matrix of τbV . Find a (possibly conservative) estimator for the variance.
5.1 Stratification
A CRE may generate an undesired treatment allocation. Let us start with a
completely randomized experiment with a discrete covariate Xi ∈ {1, . . . , K},
and define n[k] = #{i : Xi = k} and π[k] = n[k] /n as the number and pro-
portion of units in stratum k (k = 1, . . . , K). A CRE assigns n1 units to the
treatment group and n0 units to the control group, which results in
n[k]1 = #{i : Xi = k, Zi = 1}, n[k]0 = #{i : Xi = k, Zi = 0}
units in the treatment and control groups within stratum k. With positive
probability, n[k]1 or n[k]0 is zero for some k, that is, it is possible that some
strata only have treated or control units. Even none of the n[k]1 ’s or n[k]0 ’s
are zero, with high probability
n[k]1 n[k]0
− ̸= 0, (5.1)
n1 n0
and the magnitude can be quite large. So the proportions of units in stratum
k are different across the treatment and control groups although on average
their difference is zero:
n[k]1 n[k]0
E −
n1 n0
( n n
)
X X
−1 −1
= E n1 Zi 1(Xi = k) − n0 (1 − Zi )1(Xi = k)
i=1 i=1
= 0.
1 His most famous quote is “all models are wrong but some are useful.”
59
60 5 Stratification and Post-Stratification
When n[k]1 /n1 − n[k]0 /n0 is large for some strata with X = k, the treatment
and control groups have undesirable covariate imbalance. Such covariate im-
balance deteriorates the quality of the experiment, making it difficult to in-
terpret the results of the experiment since the difference in the outcomes may
be attributed to the treatment or the covariate imbalance.
How can we actively avoid covariate imbalance in the experiment? We
can fix the n[k]1 ’s or n[k]0 ’s in advance and conduct stratified randomized
experiments (SRE).
Definition 5.1 (SRE) We conduct K independent CREs within the K
strata of a discrete covariate X.
In agricultural experiments, the SRE is also called the randomized block
design, with the strata called the blocks. Analogously, stratified randomization
is also called block randomization. The total number of randomizations in an
SRE equals
K
Y n[k]
,
n[k]1
k=1
and each feasible randomization has equal probability. Within stratum k, the
proportion of units receiving the treatment is
n[k]1
e[k] = ,
n[k]
which is also called the propensity score, a conceptual that will play a central
role in Part III of this book. An SRE is different from a CRE: first, all feasible
randomizations in an SRE form a subset of all feasible randomizations in a
CRE, so
K
Y n[k] n
< ;
n[k]1 n1
k=1
second, e[k] is fixed in an SRE but random in a CRE.
For every unit i, we have potential outcomes Yi (1) and Yi (0), and individual
causal effect τi = Yi (1)−Yi (0). For stratum k, we have stratum-specific average
causal effect X
τ[k] = n−1
[k] τi .
Xi =k
The average causal effect is
n
X K X
X K
X
τ = n−1 τi = n−1 τi = π[k] τ[k] ,
i=1 k=1 Xi =k k=1
which is also the weighted average of the stratum-specific average causal ef-
fects.
If we are interested in τ[k] , then we can use the methods in Chapters 3 and
4 for the CRE within stratum k. Below I will discuss statistical inference for
τ.
5.2 FRT 61
5.2 FRT
5.2.1 Theory
In parallel with the discussion of a CRE, I will start with the FRT in an SRE.
The sharp null hypothesis is still
where
n
X n
X
τ̂[k] = n−1
[k]1 I(Xi = k, Zi = 1)Yi − n−1
[k]0 I(Xi = k, Zi = 0)Yi
i=1 i=1
We can simulate the exact distributions of the above test statistics under
the SRE. We can also calculate their means and variances and obtain the
p-values based on Normal approximations.
After searching for a while, I failed to find detailed discussion of the
Kolmogorov–Smirnov statistic for the SRE. Below is my proposal.
or
Dmax = max c[k] D[k] ,
1≤k≤K
p
where c[k] = n[k]1 n[k]0 /n[k] is motivated by the limiting distribution of D[k]
with n[k]1 and n[k]0 approach infinity (Van der Vaart, 2000). The statistics
DS and Dmax are more appropriate when all strata have large sample size.
Another reasonable choice is
K
X
D = max π[k] {F̂[k]1 (y) − F̂[k]0 (y)} ,
y
k=1
where F̂[k]1 (y) and F̂[k]0 (y) are the stratum-specific empirical distribution func-
tions of the outcomes under treatment and control, respectively. The statistic
D is appropriate in both the cases with large strata and the cases with many
small strata.
5.2.2 An application
The Penn Bonus experiment as an example to illustrate the FRT in the SRE.
The dataset used by Koenker and Xiao (2002) is from a job training program
stratified on quarter, with the outcome being the duration before employed.
penndata = read . table ( " Penn46 _ ascii . txt " )
z = penndata $ treatment
y = log ( penndata $ duration )
block = penndata $ quarter
I will focus on τ̂S and WS , and leave the FRT with other statistics as
exercise. The following function computes τ̂S and V :
stat _ SRE = function (z , y , x )
{
xlevels = unique ( x )
K = length ( xlevels )
PiK = rep (0 , K )
TauK = rep (0 , K )
WilcoxK = rep (0 , K )
for ( k in 1: K )
{
xk = xlevels [ k ]
zk = z [ x == xk ]
yk = y [ x == xk ]
PiK [ k ] = length ( zk ) / length ( z )
TauK [ k ] = mean ( yk [ zk ==1]) - mean ( yk [ zk ==0])
WilcoxK [ k ] = wilcox . test ( yk [ zk ==1] , yk [ zk ==0]) $ statistic
}
return ( zrandom )
}
Based on the above data and functions, we can easily simulate the ran-
domization distributions of the test statistics (shown in Figure 5.1 with 104
Monte Carlo draws) and compute the p-values.
> MC = 10^4
> statSREMC = matrix (0 , MC , 2)
> for ( mc in 1: MC )
+ {
+ zrandom = zRandomSRE (z , block )
+ statSREMC [ mc , ] = stat _ SRE ( zrandom , y , block )
+ }
> mean ( statSREMC [ , 1] <= stat . obs [1])
[1] 0.0019
> mean ( statSREMC [ , 2] <= stat . obs [2])
[1] 5e -04
FIGURE 5.1: The randomization distributions of τ̂S and V based on the Penn
Bonus experiment.
66 5 Stratification and Post-Stratification
PK PK
stratified estimator τ̂S = k=1 π[k] τ̂[k] is unbiased for τ = k=1 π[k] τ[k] with
variance
XK
2
var(τ̂S ) = π[k] var(τ̂[k] ).
k=1
2
If n[k]1 ≥ 2 and n[k]0 ≥ 2, then we can obtain the sample variances Ŝ[k] (1)
2
and Ŝ[k] (0) of the outcomes within stratum k and construct a conservative
variance estimator
K 2 2
!
X
2
Ŝ[k] (1) Ŝ[k] (0)
V̂S = π[k] + ,
n[k]1 n[k]0
k=1
2 2
where Ŝ[k] (1) and Ŝ[k] (0) are the stratum-specific sample variances of the
outcomes under treatment and control, respectively. Based on a Normal ap-
proximation of τ̂S , we can construct a Wald-type 1 − α confidence interval for
τ: q
τ̂S ± z1−α/2 V̂S .
From a hypothesis
p testing perspective, under H0n : τ = 0, we can compare
tS = τ̂S / V̂S with the standard Normal quantiles to obtain asymptotic p-
values. The statistic tS has appeared in Example 5.2 for the FRT. Similar to
the discussion for the CRE, using tS in the FRT yields finite-sample exact
p-value under H0f and asymptotically valid p-value under H0n . Wu and Ding
(2021) provided a justification for this claim.
Here I omit the technical details for the central limit theorem of τ̂S . See Liu
and Yang (2020) for a proof, which includes the two regimes with a few large
strata and many small strata. I will illustrate this theoretical issues using a
numerical example in Section 5.3.2.
> MC = 10^4
> TauHat = rep (0 , MC )
> VarHat = rep (0 , MC )
> for ( mc in 1: MC )
+ {
+ z = replicate (K , sample ( zb ))
+ z = as . vector ( z )
+ y = z * y1 + (1 - z ) * y0
+ est = Neyman _ SRE (z , y , x )
+ TauHat [ mc ] = est [1]
+ VarHat [ mc ] = est [2]
+ }
>
> hist ( TauHat , xlab = expression ( hat ( tau )[ S ]) ,
+ ylab = " " , main = " many small strata " ,
+ border = FALSE , col = " grey " ,
+ breaks = 30 , yaxt = ’n ’ ,
+ xlim = c (0.8 , 1.2))
> abline ( v = 1)
>
> var ( TauHat )
[1] 0.001443111
> mean ( VarHat )
[1] 0.001473616
The lower panel of Figure 5.2 shows the histogram of the point estimator,
which is symmetric and bell-shaped around the true parameter.
We finally use the Penn Bonus Experiment to illustrate the Neymanian
inference in an SRE. Applying the function Neyman_SRE to the dataset, we
obtain:
> est = Neyman _ SRE (z , y , block )
> est [1]
[1] -0.08990646
> sqrt ( est [2])
[1] 0.03079775
So the job training program significantly shortens the duration time before
employment.
and similarly,
K
n[k] − 1
X n[k]
S 2 (0) = 2
S[k] {Ȳ[k] (0) − Ȳ (0)}2 ,
(0) +
n−1 n−1
k=1
K
n[k] − 1 2
X n[k]
S 2 (τ ) = S (τ ) + 2
{τ[k] − τ } .
n − 1 [k] n−1
k=1
varCRE (τ̂ )
S 2 (1) S 2 (0) S 2 (τ )
= + −
n1 n0 n
K
X π[k] 2 π[k] 2 π[k] 2
≈ S (1) + S (0) − S (τ )
n1 [k] n0 [k] n [k]
k=1
K
X π[k] 2 π[k] 2 π[k] 2
+ {Ȳ[k] (1) − Ȳ (1)} + {Ȳ[k] (0) − Ȳ (0)} − {τ[k] − τ } .
n1 n0 n
k=1
π[k] /n[k]1 = 1/(ne), π[k] /n[k]0 = 1/{n(1 − e)}, π[k] /n[k] = 1/n,
which is non-negative. The difference is zero only in the extreme case that
r r
n0 n1
{Ȳ[k] (1) − Ȳ (1)} + {Ȳ[k] (0) − Ȳ (0)} = 0
n1 n0
prCRE (Z = z, n) 1
prCRE (Z = z | n) = = QK n[k]
, (5.2)
prCRE (n) k=1 n[k]1
the FRT becomes a conditional FRT, and the Neymanian analysis becomes
post-stratification:
K
X
τ̂PS = π[k] τ̂[k] ,
k=1
The following table shows the estimates for two strata separately, the post-
stratified estimator, and the crude estimator ignoring the binary covariate, as
well as the corresponding standard errors.
stratum 1 stratum 2 post-stratification crude
est −0.034 −0.036 −0.035 −0.045
se 0.031 0.060 0.032 0.033
The treatment and control group sizes vary across five strata:
> table ( dat _ physician $z ,
+ dat _ physician $ class _ level )
1 2 3 4 5
FALSE 15 19 16 12 10
TRUE 17 20 15 11 10
We can use the Neyman_SRE function defined before to compute the stratified
estimator and its estimated variance.
tauS = with ( dat _ physician ,
Neyman _ SRE (z , gradesq34 , class _ level ))
An important additional covariate is the baseline anemic indicator which is
quite important for predicting the outcome. Further conditioning the baseline
anemic indicator, we have an experiment with 5 × 2 = 10 strata, with the
treatment and control group sizes shown below.
> table ( dat _ physician $z ,
+ dat _ physician $ class _ level ,
+ dat _ physician $ anemic _ base _ re )
, , = No
1 2 3 4 5
FALSE 6 14 12 7 4
TRUE 8 12 9 5 6
74 5 Stratification and Post-Stratification
, , = Yes
1 2 3 4 5
FALSE 9 5 4 5 6
TRUE 9 8 6 6 4
Again we can use the Neyman_SRE function defined before to compute the post-
stratified estimator and its estimated variance.
tauSPS = with ( dat _ physician ,
{
sps = interaction ( class _ level , anemic _ base _ re )
Neyman _ SRE (z , gradesq34 , sps )
})
The following table compares these two estimators. The post-stratified
estimator yields a much smaller p-value.
est se t.stat p.value
stratify 0.406 0.202 2.005 0.045
stratify and post-stratify 0.463 0.190 2.434 0.015
This example illustrates that post-stratification can be used not only in
the CRE but also in the SRE with additional discrete covariates.
5.3 FRT for the Project STAR data in Imbens and Rubin (2015)
Reanalyze the Project STAR data using the Fisher randomization test. Note
that I use Z for the treatment indicator but Imbens and Rubin (2015) use
W. Use τ̂S , V and the aligned rank statistic in the Fisher randomization test.
Compare the p-values.
treatment = list ( c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0) ,
c (1 ,1 ,0 ,0))
76 5 Stratification and Post-Stratification
This is a SRE with centers being the strata. The trial was conducted
to study the efficacy and tolerability of finasteride, a drug for treating benign
prostatic hyperplasia. Within each of the 29 centers, patients were randomized
into three arms: control, finasteride 1mg, and finasteride 5mg. The above
dataset provides summary statistics for the outcome, which is the change
from baseline in total symptom score. The total symptom score is the sum of
the responses to nine questions (score 0 to 4) about symptoms pertaining to
various aspects of impaired urinary ability. The meanings of the columns are:
1. center: number of the center;
2. n0, n1, n5: sample sizes of the three arms;
3. mean0, mean1, mean5: mean of the outcome;
4. sd0, sd1, sd5: standard deviation of the outcome.
The individual-level outcomes are not reported so we cannot implement
the FRT. However, the Neymanian inference only requires the summary statis-
tics. Report the point estimators and variance estimators for comparing “fi-
nasteride 1mg” and “finasteride 5mg” to “control”, separately.
design analysis
discrete covariate stratification post-stratification
general covariate rerandomization regression adjustment
6.1 Rerandomization
6.1.1 Experimental design
Again we consider a finite population of n units, where n1 of them receive the
treatment and n0 of them receive the control. Let Z = (Z1 , . . . , Zn ) be the
treatment vector for these units. Unit i has covariate Xi ∈ RK which can have
continuous or binary components. Concatenate
Pn them as X = (X1 , . . . , Xn )
and center them at mean zero X̄ = n−1 i=1 Xi = 0 without loss of generality.
The CRE balances the covariates in the treatment and control groups on
average, for instance, the difference in means of the covariates
n
X n
X
τ̂X = n−1
1 Zi Xi − n−1
0 (1 − Zi )Xi
i=1 i=1
has mean zero under the CRE. However, it can result in undesirable covariate
balance across the treatment and control groups in the realized treatment
allocation, that is, the realized value of τ̂X is often not zero. Using the vector
form of Neyman (1923) in Problem 4.6, we can show that
1 2 1 2 n
cov(τ̂X ) = S + S = S2 ,
n1 X n0 X n1 n0 X
79
80 6 Rerandomization and Regression Adjustment
Pn
where SX2
= (n − 1)−1 i=1 Xi XiT . The following Mahalanobis distance mea-
sures the difference between the treatment and control groups:
−1
n
M= T
τ̂X cov(τ̂X )−1 τ̂X = T
τ̂X S2 τ̂X .
n1 n0 X
2
Technically the above formula of M is meaningful only if SX is invertible,
which means that the columns of the covariate matrix are linearly indepen-
dent. If a column can be represented by a linear combinations of other columns,
it is redundant and should be dropped before the experiment. A nice feature
of M is that it is invariance under non-degenerate linear transformations of X.
Lemma 6.1 below summarizes the result with the proof relegated to Problem
6.2.
Lemma 6.1 M remains the same if we transform Xi to α+BXi for all units
i = 1, . . . , n where α ∈ RK and B ∈ RK×K is invertible.
The finite population central limit theorem (Li and Ding, 2017) ensures
that with large n, the Mahalanobis distance M is approximately χ2K under the
CRE. Therefore, it is likely that M has a large realized value under the CRE
with asymptotic mean K and variance 2K. Rerandomization avoids covariate
imbalance by discarding the treatment allocations with large values of M .
Below I give a formal definition of the rerandomization using the Mahalanobis
distance (ReM), which was proposed by Cox (1982) and Morgan and Rubin
(2012).
Definition 6.1 (ReM) Draw Z from CRE and accept it if and only if
M ≤ a,
τ̂xk
q ≤a (k = 1, . . . , K) (6.1)
n 2
n1 n0 Sxk
6.1 Rerandomization 81
for some predetermined constant a > 0. For example, a some upper quantile
of a standard Normal distribution. ReM has many desirable properties. As
mentioned above, it is invariant to linear transformation of the covariates.
Moreover, it has nice geometric properties and elegant mathematical theory.
This chapter will focus on ReM. See Zhao and Ding (2021b) for the theory
for the rerandomization based on criterion (6.1) as well as other criteria.
Condition 6.1 As n → ∞,
1. n1 /n and n0 /n have positive limits;
2. the finite population covariance of {Xi , Yi (1), Yi (0), τi } has
limit;
3. max1≤i≤n |Yi (1)− Ȳ (1)|2 /n → 0, max1≤i≤n |Yi (0)− Ȳ (0)|2 /n →
0, and max1≤i≤n ∥Xi ∥2 /n → 0,
LK,a ∼ D1 | DT D ≤ a
where
S 2 (1) S 2 (0) S 2 (τ )
var(τ̂ ) = + −
n1 n0 n
is Neyman (1923)’s variance formula proved in Chapter 4, and
1 The ·
notation “A ∼ B” means that A and B have the same asymptotic distributions.
82 6 Rerandomization and Regression Adjustment
is the squared multiple correlation coefficient2 between τ̂ and τ̂X under the
CRE.
Although the proof of Li et al. (2018b) is technical, the asymptotic distri-
bution in Theorem 6.1 has clear geometric interpretation, as shown in Figure
6.1. It shows that τ̂ decomposes into a component that is a linear combination
of τ̂X and a component that is orthogonal to τ̂X . Geometrically, cos2 θ = R2 ,
where θ is the angle between τ̂ and τ̂X . ReM affects the first component but
does not change the second component. The truncated Normal distribution
LK,a is due to the restriction of ReM on the first component.
When a = ∞, the asymptotic distribution simplifies to the one under the
CRE:
· p
τ̂ − τ ∼ var(τ )ε.
When the threshold a is close to zero, the the asymptotic distribution simplifies
to
· p
τ̂ − τ ∼ var(τ )(1 − R2 )ε.
So with a small threshold a, the efficiency gain due to ReM depends on R2 ,
which has the following equivalent form.
Proposition 6.1 Under the CRE,
n−1 2 −1 2
1 S (1 | x) + n0 S (0 | x) − n
−1 2
S (τ | x)
R2 = corr2 (τ̂ , τ̂X ) = −1 2 −1 2 −1 2
,
n1 S (1) + n0 S (0) − n S (τ )
2 The squared multiple correlation coefficient between a random variable y and a random
vector X is defined as
2 cov(y, X)cov(X)−1 cov(X, y)
RyX = corr2 (y, X) = .
var(y)
It extends the definition of the Pearson correlation coefficient and measures the linear
dependence of y on X.
6.2 Regression adjustment 83
where {S 2 (1), S 2 (0), S 2 (τ )} are the finite population variances of {Yi (1), Yi (0), τi }ni=1 ,
and {S 2 (1 | x), S 2 (0 | x), S 2 (τ | x)} are the corresponding finite population
variances of their linear projections on (1, Xi ). 3 Under the constant causal
effect assumption with τi = τ , R2 reduces to the finite population squared
multiple correlation between Yi (0) and Xi .
I leave the proof of Proposition 6.1 to Problem 6.4.
When 0 < a < ∞, the asymptotic distribution has a more complicated
form and is more concentrated at τ and thus the difference in means is more
precise under ReM than under the CRE.
If we ignore the design of ReM and still use the confidence interval based
on Neyman (1923)’s variance formula and the Normal approximation, it is
overly conservative and overcovers τ even if the individual causal effects are
constant. Li et al. (2018b) described how to construct confidence intervals
based on Theorem 6.1. We omit the discussion here but will come back to the
inference issue in Section 6.3.
In strategy one, we only need to run regression once, but in strategy two,
we need to run regression many times. In the above, “regression” is a generic
term, which can be linear regression, logistic regression, or even machine learn-
ing algorithms. The FRT with any test statistics from these two strategies will
be finite-sample exact under H0f although they differ under alternative hy-
potheses.
(1923)’s classic one. For coherence, we can also use the HC2 correction for Lin (2013)’s
estimator with covariate adjustment. When the number of covariates is small compared
to the sample size and the covariates do not contain outliers, the variants of the EHW
standard error perform similarly to the original one. When the number of covariates is large
compared to the sample size or the covariates contain outliers, the variants can outperform
6.2 Regression adjustment 85
a conservative estimator for the true standard error of τ̂F under the
CRE.
where {Ȳˆ (1), Ȳˆ (0)} are the sample means of the outcomes, and {X̄ ˆ (1), X̄
ˆ (0)}
are the sample means of the covariates. This covariate-adjusted estimator
τ̂ (β1 , β0 ) tries to reduce the variance of τ̂ by residualizing the potential out-
comes. It reduces to τ̂ with β1 = β0 = 0. It has mean τ for any fixed values
of β1 and β0 because X̄ = 0. We are interested in finding the (β1 , β0 ) that
minimized the variance of τ̂ (β1 , β0 ). This estimator is essentially the difference
in means of the adjusted potential outcomes {Yi (1) − βT1 Xi , Yi (0) − βT0 Xi }ni=1 .
Applying Neyman (1923)’s result, this estimator has variance
S 2 (1; β1 ) S 2 (0; β1 ) S 2 (τ ; β1 , β0 )
var{τ̂ (β1 , β0 )} = + − ,
n1 n0 n
Ŝ 2 (1; β1 ) Ŝ 2 (0; β1 )
V̂ (β1 , β0 ) = + ,
n1 n0
where
n
X
Ŝ 2 (1; β1 ) = (n1 − 1)−1 Zi {Yi − γ1 − βT1 Xi }2 ,
i=1
n
X
Ŝ 2 (0; β0 ) = (n0 − 1)−1 (1 − Zi ){Yi − γ0 − βT0 Xi }2
i=1
are the sample variances of the adjusted potential outcomes with γ1 and γ0
the original one. In those cases, Lei and Ding (2021) recommend using the HC3 variant of
the EHW standard error. See Chapter A2 for more details of the EHW standard errors.
86 6 Rerandomization and Regression Adjustment
being the sample means of Yi − βT1 Xi under treatment and Yi − βT0 Xi under
control. To minimize V̂ (β1 , β0 ), we need to solve two OLS problems:
n
X n
X
min Zi {Yi − γ1 − βT1 Xi }2 , min (1 − Zi ){Yi − γ0 − βT0 Xi }2 .
γ1 ,β1 γ0 ,β0
i=1 i=1
We run OLS of Yi on Xi for the treatment and control groups separately and
obtain (γ̂1 , β̂1 ) and (γ̂0 , β̂0 ). The final estimator is
n
X n
X
τ̂ (β̂1 , β̂0 ) = n−1
1 Zi (Yi − β̂T1 Xi ) − n−1
0 (1 − Zi )(Yi − β̂T0 Xi )
i=1 i=1
n o n o
= ˆ (1) − Ȳˆ (0) − β̂T X̄
Ȳˆ (1) − β̂T1 X̄ 0
ˆ (0) .
Based on quite technical calculations, Lin (2013) further showed that the
EHW standard error from the OLS in Proposition 6.2 is almost identical to
V̂ (β̂1 , β̂0 ) which is a conservative estimator of the true standard error of τ̂L
under the CRE. Intuitively, this is because we do not assume that the linear
model is correctly specified, and the EHW standard error is robust to model
misspecification.
There is a subtle issue with the discussion above. The variance formula
var{τ̂ (β1 , β0 )} works for fixed (β1 , β0 ), but the estimator τ̂ (β̂1 , β̂0 ) uses two
estimated coefficients (β̂1 , β̂0 ). The additional uncertainty in the estimated
coefficients may cause finite-sample bias in the final estimator. Lin (2013)
showed that the issue goes away asymptotically. However, his theory requires
a large sample size and some regularity conditions on the potential outcomes
and covariates.
6.2 Regression adjustment 87
Similarly, we build a prediction model for Y (0) based on X using the data
from the control group:
If we predict the missing potential outcomes, then we have the following pre-
dictive estimator:
( )
X X X X
−1
τ̂pred = n Yi + µ̂1 (Xi ) − µ̂0 (Xi ) − Yi . (6.7)
Zi =1 Zi =0 Zi =1 Zi =0
We can verify that with (6.5) and (6.6), the predictive estimator equals Lin
(2013)’s estimator:
If we predict all potential outcomes even if they are observed, we have the
following projective estimator:
n
X
τ̂proj = n−1 {µ̂1 (Xi ) − µ̂0 (Xi )}. (6.9)
i=1
We can verify that with (6.5) and (6.6), the projective estimator equals Lin
(2013)’s estimator:
where γ̂ = nn0 β̂1 + nn1 β̂0 . I leave the proofs of (6.11) and (6.12) to Problem 6.7.
The forms (6.11) and (6.12) are the mathematical statements of “adjusting for
the covariate imbalance.” They essentially subtract some linear combinations
of the difference in means of the covariates. Since τ̂ and τ̂X are correlated, the
covariate adjustment with an appropriate γ reduces the variance of τ̂ . Another
interesting feature of (6.11) and (6.12) is that the final estimators depend only
on γ or γ̂, so the choice of the β-coefficients are not unique. Therefore, Lin
(2013)’s estimator is just one of the optimal estimators, but it can be easily
implemented via the standard OLS with the EHW standard error.
The first form of τ̂ (1, 1) justifies the name gain score because it is essentially
the difference in means of the gain score gi = Yi − Xi . The second form of
τ̂ (1, 1) justifies the name difference-in-difference because it is the difference
between two differences in means. This estimator is different from Lin (2013)’s
estimator: it fixes β1 = β0 = 1 in advance while Lin (2013)’s estimator involves
two estimated β’s. It is unbiased with a conservative variance estimator
n
1 X
V̂ (1, 1) = Zi {gi − ḡˆ(1)}2
n1 (n1 − 1) i=1
n
1 X
+ (1 − Zi ){gi − ḡˆ(0)}2 ,
n0 (n0 − 1) i=1
where ḡˆ(1) and ḡˆ(0) are the sample means of the gain score gi = Yi − Xi under
treatment and control, respectively. When the lagged outcome is a strong
predictor of the outcome, the gain score gi = Yi − Xi often has much smaller
variance than the outcome itself. In this case, τ̂ (1, 1) often greatly reduces the
variance of the simple difference in means of the outcome.
90 6 Rerandomization and Regression Adjustment
analysis
1
CRE τ̂ (Neyman,
1923) −→ τ̂L (Lin,
2013)
design 2y
y4
3
ReM τ̂ (Li et al., 2018b) −→ τ̂L (Li and Ding, 2020)
K
X
τ̂L,S = π[k] τ̂L,[k] .
k=1
where V̂ehw,[k] is the EHW variance estimator from the OLS fit of the out-
come on the treatment indicator, the covariates, and their interactions within
stratum k. Importantly, we need to center covariates by their stratum-specific
means.
6.4 Simulation
Angrist et al. (2009) conducted an experiment to evaluate different strategies
to improve academic performance among college freshmen. Here I use a subset
of the original data, focusing on the control group and the treatment group
offered academic support services and financial incentives for good grades.
The outcome is the GPA at the end of the first year, and two covariates
are the gender and baseline GPA. The following table summarizes the results
based on the unadjusted and adjusted estimators. The adjusted estimator has
smaller standard error although it gives the same insignificant result as the
unadjusted estimator.
I also use this dataset to conduct simulation studies to evaluate the four
design and analysis strategies summarized in Table 6.2. I fit quadratic func-
tions of the outcome on the covariates and use them to impute all the missing
potential outcomes, separately for the treated and control groups. To show the
improvement of ReM and regression adjustment, I also rescale the error terms
by 0.1 and 0.25 to increase the signal to noise ratio. With the imputed Science
Table, I generate 2000 treatments, obtain the observed data, and calculate the
estimators. In the simulation, the “true” outcome model is nonlinear, but we
still use linear adjustment for estimation. By doing this, we can evaluate the
properties of the estimators when the linear model is misspecified.
Figure 6.2 shows the violin plots of the four combinations, subtracting the
true τ from the estimates. As predicted by the theory, all estimators are nearly
unbiased, and both ReM and regression adjustment improve efficiency. They
are more effective when the noise level is smaller.
0.10
0.05
estimate
0.00
−0.05
−0.10
FIGURE 6.2: Simulation with 2000 Monte Carlo replicates and a = 0.05 for
the ReM
ratio which may not be the parameter of interest; when the logistic model is
incorrect, it is even harder to interpret the coefficient. From the discussion
above, if the parameter of interest is the average causal effect, we can still use
Lin (2013)’s estimator to analyze the binary outcome data in the CRE. Guo
and Basse (2023) extend Lin (2013)’s theory to allow for using generalized
linear models to construct estimators for the average causal effect under the
potential outcomes framework.
Other extensions of Lin (2013)’s theory focus on high dimensional covari-
ates. Bloniarz et al. (2016) focus on the regime with many covariates than the
sample size, and under the sparsity assumption, they suggest replacing the
OLS fits by the least absolute shrinkage and selection operator (LASSO) fits
(Tibshirani, 1996) of the outcome on the treatment, covariates and their inter-
actions. Lei and Ding (2021) focus on the regime with a diverging number of
covariates without assuming sparsity, and under certain regularity conditions,
they show that Lin (2013)’s estimator is still consistent and asymptotically
Normal. Wager et al. (2016) propose to use machine learning methods to
analyze high dimensional experimental data.
V̂ (1, 1) is a conservative estimator for the true variance of τ̂ (1, 1). When does
E{V̂ (1, 1)} = var{τ̂ (1, 1)} hold?
Compare the variances of τ̂ (0, 0) and τ̂ (1, 1) to show that
if and only if
n0 n1
2 β1 + 2 β0 ≥ 1,
n n
where
Pn Pn
i=1 (Xi − X̄){Yi (1) − Ȳ (1)} i=1 (Xi − X̄){Yi (0) − Ȳ (0)}
β1 = Pn 2
, β0 = Pn 2
i=1 (Xi − X̄) i=1 (Xi − X̄)
are the coefficients of Xi in the OLS fits of Yi (1) and Yi (0) on (1, Xi ), respec-
tively.
Remark: Gerber and Green (2012, page 28) discussed a special case of this
problem with n1 = n0 .
The matched-pairs experiment (MPE) is the most extreme version of the SRE
with only one treated unit and one control unit within each stratum. In this
case, the strata are also called pairs. Although this type of experiment is a
special case of the SRE discussed in Chapter 5, it has its own estimation and
inference strategy. Moreover, it has many new features and it is closely related
to the “matching” strategy in observational studies which will be covered in
Chapter 15 later. So we discuss the MPE in this separate chapter.
95
96 7 Matched-Pairs Experiment
and (
Yi2 (0), if Zi = 1;
Yi2 = Zi Yi2 (0) + (1 − Zi )Yi2 (1) =
Yi2 (1), if Zi = 0.
So the observed data are (Zi , Yi1 , Yi2 )ni=1 .
7.2 FRT
Similar to the discussion before, we can always use the FRT to test the sharp
null hypothesis:
H0f : Yij (1) = Yij (0) for all i = 1, . . . n and j = 1, 2.
When conducting the FRT, we need to simulate the distribution of
(Zi , . . . , Zn ) from (7.1). I will discuss some canonical choices of test statistics
based on the within-pair differences between the treated and control outcomes:
τ̂i = outcome under treatment − outcome under control (within pair i)
= (2Zi − 1)(Yi1 − Yi2 )
= Si (Yi1 − Yi2 ),
where the Si = 2Zi − 1 are IID random signs with mean 0 and variance
1, for i = 1, . . . , n. Since the pairs with zero τ̂i ’s do not contribute to the
randomization distribution, we drop those pairs in the discussion of the FRT.
Example 7.1 (paired t statistic) The average of the within-pair differ-
ences is
n
X
τ̂ = n−1 τ̂i .
i=1
Under H0f ,
E(τ̂ ) = 0
and
n
X n
X n
X
var(τ̂ ) = n−2 var(τ̂i ) = n−2 var(Si )(Yi1 − Yi2 )2 = n−2 τ̂i2 .
i=1 i=1 i=1
7.2 FRT 97
Based on the CLT for the sum of independent random variables, we have the
Normal approximation:
τ̂ d
p Pn 2
−→ N(0, 1).
n−2 i=1 τ̂i
In classic statistics, the motivation for using tpair is under a different frame-
IID
work. When τ̂i ∼ N(0, σ 2 ), we can show that tpair ∼ t(n − 1), i.e., the exact
distribution of tpair is t with degrees of freedom n − 1, which is close to N(0, 1)
with a large n. The R function t.test with paired=TRUE can implement this
test. With a large n, these procedures give similar results. The discussion in
Example 7.1 gives another justification of the classic paired t test without
assuming the Normality of the data.
Under H0f ,
n n
1X 1X n(n + 1)
E(W ) = Ri = i=
2 i=1 2 i=1 4
and
n n
1X 2 1X 2 n(n + 1)(2n + 1)
var(W ) = Ri = i = .
4 i=1 4 i=1 24
The CLT for the sum of independent random variables ensures the following
Normal approximation:
W − n(n + 1)/4 d
p −→ N(0, 1).
n(n + 1)(2n + 1)/24
be the empirical distribution of −(τ̂1 , . . . , τ̂n ), where F̂ (−t−) is the left limit
of the function F̂ (·) at −t. A Kolmogorov–Smirnov-type statistic is then
D = max |F̂ (t) + F̂ (−t−) − 1|.
t
Butler (1969) proposed this test statistic and derived its exact and asymp-
totic distributions. Unfortunately, this is not implemented in standard software
packages. Nevertheless, we can simulate its exact distribution and compute the
p-value based on the FRT. 1
Example 7.4 (sign statistic) The sign statistic uses only the signs of the
within-pair differences
Xn
∆= I(τ̂i > 0).
i=1
Under H0f ,
IID
I(τ̂i > 0) ∼ Bernoulli(1/2)
and therefore
∆ ∼ Binomial(n, 1/2).
Based on this we have an exact Binomial test, which is implemented in the
R function binom.test with p=1/2. Using the CLT, we can also conduct a test
based on the following Normal approximation of the Binomial distribution:
∆ − n/2 d
p −→ N(0, 1).
n/4
1 Butler (1969)’s proposed this test statistic under a slightly different framework. Given
IID draws of (τ̂1 , . . . , τ̂n ) from a distribution F (y), if they are symmetrically distributed
around 0, then
F (t) = pr(τ̂i ≤ t) = pr(−τ̂i ≤ t) = 1 − pr(τ̂i < −t) = 1 − F (−t−).
Therefore, F̂ (t) + F̂ (−t−) − 1 measures the deviation from the null hypothesis of symmetry,
which motivates the definition of D. A naive definition of the Kolmogorov–Smirnov-type
statistic is to compare the empirical distributions of the outcomes under treatment and
control as in Example 3.4. Using that definition, we effectively break the pairs. Although it
can still be used in the FRT for the MPE, it does not capture the matched-pairs structure
of the experiment.
7.3 Neymanian inference 99
Both the exact FRT and the asymptotic test do not depend on m11 or m00 .
Only the numbers of discordant pairs matter in these tests.
The discussion also extends to the independent but not IID setting; see Prob-
lem A1.1 in Chapter A1. The above discussion seems a digression from the
MPE which has completely different statistical assumptions. But at least it
motivates a variance estimator V̂ , which uses the between-pair variance of τ̂i
to estimate variance of τ̂ . Of course, it is derived under different assumptions.
Does it work for the MPE? Theorem 7.1 below is a positive result.
Theorem 7.1 Under the MPE, V̂ is a conservative estimator for the true
variance of τ̂ :
n
X
E(V̂ ) − var(τ̂ ) = {n(n − 1)}−1 (τi − τ )2 ≥ 0.
i=1
Therefore,
n
X
E(V̂ ) = var(τ̂ ) + {n(n − 1)}−1 (τi − τ )2 ≥ var(τ̂ ).
i=1
□
Similar to the discussions for other experiments, the Neymanian approach
relies on the large-sample approximation:
τ̂ − τ
p → N(0, 1)
var(τ̂ )
Proposition 7.1 τ̂ and V̂ are identical to the coefficient and variance es-
timator of the intercept from the OLS fit of the vector (τ̂1 , . . . , τ̂n )T on the
intercept only.
In a realized MPE, cov(τ̂X ) is not zero unless all the τ̂X,i ’s are zero. With an
unlucky draw of (Z1 , . . . , Zn ), it is possible that τ̂X differs substantially from
zero. Similar to the discussion in the CRE, adjusting for the imbalance of the
covariate means is likely to improve estimation efficiency.
Consider a class of estimators indexed by γ:
τ̂ (γ) = τ̂ − γT τ̂X
which has mean 0 for any fixed γ. We want to choose γ to minimize the
variance of τ̂ (γ). Its variance is a quadratic function of γ:
which is minimized at
We have obtained the formula of cov(τ̂X ) in the above, which can also be
written as
n
X
cov(τ̂X ) = n−2 |τ̂X,i ||τ̂X,i |T ,
i=1
which is approximately the coefficient of the τ̂X,i in the OLS fit of the τ̂i ’s on
the τ̂X,i ’s with an intercept. The final estimator is
which, by the property of OLS, is approximately the intercept in the OLS fit
of the τ̂i ’s on the τ̂X,i ’s with an intercept.
A conservative variance estimator for τ̂adj is then
A subtle technical issue is whether τ̂ (γ̂) has the same optimality as τ̂ (γ̃).
With large samples, we can show τ̂ (γ̂) − τ̂ (γ̃) = −(γ̂ − γ̃)T τ̂X is of higher order
since it is the product of two “small” terms γ̂ − γ̃ and τ̂X . I omit the tedious
details for asymptotic analysis, but hope the result makes some intuitive sense
to the readers.
Moreover, Fogarty (2018b) discussed the asymptotically equivalent regres-
sion formulation of the above covariate-adjusted procedure, and gave a rigor-
ous proof for associated CLT. I summarize the regression formulation below
without giving the regularity conditions.
Proposition 7.2 Under the MPE, the covariate-adjusted estimator τ̂adj and
the associated variance estimator V̂adj can be conveniently approximated by
the intercept and the associated variance estimator from the OLS fit of the
vector of the τ̂i ’s on the 1’s and the matrix of the τ̂X,i ’s.
7.5 Examples
7.5.1 Darwin’s data comparing cross-fertilizing and self-
fertilizing on the height of corns
This is a classical example from Fisher (1935). It contains 15 pairs of corns with
either cross-fertilizing or self-fertilizing, with the height being the outcome.
The R package HistData provides the original data, where cross and self are
the heights under cross-fertilizing and self-fertilizing, respectively, and diff
denotes their difference.
> library ( " HistData " )
> ZeaMays
pair pot cross self diff
1 1 1 23.500 17.375 6.125
2 2 1 12.000 20.375 -8.375
3 3 1 21.000 20.000 1.000
4 4 2 22.000 20.000 2.000
5 5 2 19.125 18.375 0.750
6 6 2 21.500 18.625 2.875
7 7 3 22.125 18.625 3.500
8 8 3 20.375 15.250 5.125
9 9 3 18.250 16.500 1.750
10 10 3 21.625 18.000 3.625
11 11 3 23.250 16.250 7.000
12 12 4 21.000 18.000 3.000
13 13 4 22.125 12.750 9.375
14 14 4 23.000 15.500 7.500
15 15 4 12.000 18.000 -6.000
In total, the MPE has 215 = 32768 possible treatment assignment which
is a tractable number in R. The following function can enumerate all possible
treatment assignment for the MPE:
MP _ enumerate = function (i , n . pairs )
{
if ( i > 2^ n . pairs ) print ( " i is too large . " )
a = 2^(( n . pairs -1):0)
b = 2*a
2 * sapply (i -1 ,
function ( x )
as . integer (( x %% b ) >= a )) - 1
}
p−value = 0.026
−4 −2 0 2 4
paired t−statistic
−3 −2 −1 0 1 2 3 −6 −4 −2 0 2 4 6
^τ ^τadj
to reject the sharp null hypothesis in the MPE but it is possible in the CRE.
Even if the covariates are perfect predictors of the outcome, the MPE is not
superior to the CRE based on the FRT.
7.7.1 FRT
As usual, we can always use the FRT to test the sharp null hypothesis
H0f : Yij (1) = Yij (0) for all i = 1, . . . , n; j = 1, . . . , Mi + 1.
Because the general matched experiment is a special case of the SRE with
many small strata, we can use the test statistics defined in Examples 5.4, 5.5,
7.2, 7.3, 7.4, as well as the estimators and the corresponding t-statistics from
the following two subsections.
Interestingly, we can show that Theorem 7.1 holds for the general matched
experiment, so are other results for the MPE. In particular, we can use the
OLS fit of the τ̂i ’s on the intercept to obtain the point and variance estimators
for τ . With covariates, we can use the OLS fit of the τ̂i ’s on the intercept and
the τ̂X,i ’s, where
M
Xi +1 n
X
τ̂X,i = Zij Xij − Mi−1 (1 − Zij )Xij
j=1 i=1
However, estimating the variance of this estimator is quite tricky because the
τ̂i ’s are independent random variable without any replicates. This is a famous
problem in theoretical statistics studied by Hartley et al. (1969) and Rao
(1970). Fogarty (2018a) also discussed this problem without recognizing these
previous works. I will give the final form of the variance estimator without
detailing the motivation:
n
X
V̂w = ci (τ̂i − τ̂w )2
i=1
where
wi2
1−2wi
ci = Pn wi2
.
1+ i=1 1−2wi
Theorem 7.3 Under the general matched experiment with varying Mi ’s, we
have
n
X
E(V̂w ) − var(τ̂w ) = ci (τi − τw )2 ≥ var(τ̂w ) ≥ 0
i=1
The outcomes are the Bagrut passing rates in years 2001 and 2002, with the
Bagrut passing rates in 1999 and 2000 as pretreatment covariates. Re-analyze
the data based on the Neymanian inference with and without covariates. In
particular, how do you deal with the missing outcome in pair 25?
Previous chapters cover both the Fisherian and Neymanian inferences for dif-
ferent types of experiments. The Fisherian perspective focuses on the finite-
sample exact p-value for testing the strong null hypothesis of no causal effects
for any units whatsoever, and the Neymanian perspective focuses on unbi-
ased estimation with a conservative large-sample confidence interval for the
average causal effect. Both of them are justified by the physical randomiza-
tion of the experiments. They are the two important forms of design-based
or randomization-based inference for causal effects. They are related but also
have distinct features.
In 1935, Neyman presented his seminal paper on randomization-based in-
ference to the Royal Statistical Society. His paper (Neyman, 1935) was at-
tacked by Fisher in the discussion session. Sabbaghi and Rubin (2014) re-
viewed this famous Neyman–Fisher controversy and presented some new re-
sults for this old problem. Instead of going to philosophical issues, this chapter
provides a unified discussion.
based on s
τ̂ var(τ̂ ) τ̂ d
t= p = ×p −→ C × N(0, 1),
V̂ V̂ var(τ̂ )
113
114 8 Unification of Fisherian and Neymanian Inferences
S 2 (1) S 2 (0) S 2 (τ )
·
τ̂ ∼ N 0, + − .
n1 n0 n
The FRT pretends that the Science Table is (Yi , Yi )ni=1 , so the permutation
distribution of τ̂ is
s2 s2
π ·
(τ̂ ) ∼ N 0, + ,
n1 n0
where (·)π denotes the permutation distribution and s2 is the sample variance
of the observed outcomes. Based on (3.7) in Chapter 3, we can approximate
the asymptotic variance of (τ̂ )π under H0f as
s2 s2
n n1 − 1 2 n0 − 1 2 n1 n0
+ = Ŝ (1) + Ŝ (0) + τ̂ 2
n1 n0 n1 n0 n − 1 n−1 n(n − 1)
Ŝ 2 (1) Ŝ 2 (0)
≈ +
n0 n1
2 2
S (1) S (0)
≈ + ,
n0 n1
which does not match the asymptotic variance of τ̂ . Ideally, we should com-
pute the p-value under H0n based the true distribution of τ̂ , which, however,
depends on the unknown potential outcomes. In contrast, we use the FRT to
compute the pfrt based on the permutation distribution (τ̂ )π , which does not
match the true distribution of τ̂ under H0n even with large samples. There-
fore, the FRT with τ̂ may not control the type one error rate under H0n even
with large samples.
Fortunately, the undesired property of the FRT with τ̂ goes away if we
replace the test statistic τ̂ with the studentized version t. Under H0n , we have
·
t ∼ N(0, C 2 )
8.3 Covariate-adjusted FRTs in the CRE 115
where the variance equals 1 because the Science Table used by the FRT has
zero individual causal effects. Under H0n , because the true distribution of t
is more dispersed than the corresponding permutation distribution, the pfrt
based on t is asymptotically conservative.
The analysis of ReM is trickier. Zhao and Ding (2021a) show that the FRT
with t does not have the dual guarantees in Section 8.1, but the FRT with tL
still has the guarantees in Section 8.2. This highlights the importance of both
covariate adjustment and studentization in ReM.
Similar results hold for the MPE. Without covariates, we recommend using
the FRT with the t-statistic for the intercept in the OLS fit of τ̂i on 1; with
covariates, we recommend using the FRT with the t-statistic for the intercept
in the OLS fit of τ̂i on 1 and τ̂x,i . Figure 7.2 in Chapter 7 are based on these
recommended FRTs.
Overall, the FRTs with studentized statistics are safer choices. When the
large-sample Normal approximations to the studentized statistics are accu-
rate, the FRTs give pfrt ’s that are almost identical to those based on Normal
approximations. When the large-sample approximations are inaccurate, the
FRTs at least guarantees valid p-values under the strong null hypotheses.
This is the recommendation of this book.
One outcome of interest is the average grades in the third and fourth
quarters of 2009, and an important background covariate was the anemia
status at baseline. We make pairwise comparisons of the “soccer” arm versus
8.5 Homework Problems 117
the “control” arm and the “physician” arm versus the “control” arm. We
also compare the FRTs with and without using the covariate indicating the
baseline anemia status. We use their dataset to illustrate the FRTs in complete
randomization and stratified randomization. The ten subgroup analyses within
the same class levels use the FRTs with t and tL for the CRE and the two
overall analyses averaging over all class levels use the FRTs with tS and tL,S
for the SRE.
Table 8.1 shows the point estimators, standard errors, the p-value based
on the Normal approximation of the robust t-statistics, and the p-value based
on the FRTs. In most strata, covariate adjustment decreases the standard er-
ror since the baseline anemia status is predictive to the outcome. Table 8.1
also exhibits two exceptions: within class 2, covariate adjustment increases
the standard error when comparing “soccer” and “control”; in class 4, covari-
ate adjustment increases the standard error when comparing “physician” and
“control”. This is due to the small group sizes within these strata, causing the
asymptotic approximation dubious. Nevertheless, in these two scenarios, the
differences in the standard error are in the third digit. The p-values from the
Normal approximation and the FRT are close with the latter being slightly
larger in most cases. Based on the theory, the p-values based on the FRT
should be trusted since it has an additional guarantee of being finite-sample
exact under the sharp null hypothesis. This becomes important in this exam-
ple since the groups sizes are quite small within strata.
We echo Bind and Rubin (2020)’s suggestion that when conducting the
FRTs, not only the p-values but also the randomization distributions of the
test statistics should be reported. Figure 8.1 compares the histograms of the
randomization distributions of the robust t-statistics with the asymptotic ap-
proximations. In the subgroup analysis, we can observe discrepancy between
the randomization distributions and N(0, 1); average over all class levels, the
discrepancy becomes unnoticeable. Overall, in this application, the p-values
based on the Normal approximation do not differ substantially from those
based on the FRTs. Two approaches yield coherent conclusions: the video
with a physician telling the benefits of iron supplements improved the aca-
demic performance and the effect was most significant among student in class
3; in contrast, the video with a famous soccer player telling the benefits of the
iron supplements did not have any significant effect.
1 2 3 4 5 all
0.5
0.4
Neyman
0.3
0.2
0.1
0.0
0.5
0.4
0.3
Lin
0.2
0.1
0.0
−2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5
(a) soccer versus control
1 2 3 4 5 all
0.4
Neyman
0.3
0.2
0.1
0.0
0.4
0.3
Lin
0.2
0.1
0.0
−5.0 −2.5 0.0 2.5 −5.0 −2.5 0.0 2.5 −5.0 −2.5 0.0 2.5 −5.0 −2.5 0.0 2.5 −5.0 −2.5 0.0 2.5 −5.0 −2.5 0.0 2.5
(b) physician versus control
121
122 9 Bridging finite and super population causal inference
9.1 CRE
Assume
IID
{Zi , Yi (1), Yi (0), Xi }ni=1 ∼ {Z, Y (1), Y (0), X}
from a super population. With a little abuse of notation, we define the popu-
lation average causal effect as
Under the super population framework, we can formulate the CRE as below.
Definition 9.1 (CRE under the super population framework) Z {Y (1), Y (0), X}
we have
τ = E{Y (1) − Y (0)} = γ1 − γ0 + (β1 − β0 )T E(X),
since the residuals ε(1) and ε(0) have mean zero due to the inclusion of the
1 In causal inference, we say that a parameter is nonparametrically identifiable if it can
be determined by the distribution of the observed variables without imposing further para-
metric assumptions.
9.2 SRE 123
intercepts. We can use the OLS with the treated and control data to estimate
the coefficients in (9.2) and (9.3), respectively. The sample versions of the
coefficients are γ̂1 , β̂1 , γ̂0 , β̂0 , so a covariate-adjusted estimator for τ is
(β̂1 − β̂0 )T SX
2
(β̂1 − β̂0 )/n.
9.2 SRE
We can extend the discussion in Section 9.1 to the SRE since it is equivalent to
independent CREs within strata. The notation below will be slightly different
from that in Chapter 5.
Assume that
IID
{Zi , Yi (1), Yi (0), Xi } ∼ {Z, Y (1), Y (0), X}.
Definition 9.2 (SRE under the super population framework) Z {Y (1), Y (0)} |
X.
Under Definition 9.2, the conditional average causal effect can be rewritten
as
The discussion in Section 9.1 holds with all strata, so we can derive the super
population analog for the SRE. When there are more than two treatment
and control units within each strata, we can use V̂S as an unbiased variance
estimator for var(τ̂S ).
Y = α0 + αZ Z + αTX X + αTZX XZ + ε
where
That is,
Observational studies
10
Observational Studies, Selection Bias, and
Nonparametric Identification of Causal
Effects
127
128 10 Observational Studies
Example 10.3 (school meal program and body mass index) Chan et al.
(2016) used a subsample of the data from NHANES 2007–2008 to study
whether participation in school meal programs lead to an increase in BMI
for school children. They documented the data as nhanes_bmi in the package
ATE. The dataset has the following important covariates:
age age
ChildSex gender (1: Male, 0: Female)
black race (1: Black, 0: otherwise)
mexam race (1: Hispanic: 0 otherwise)
pir200 plus Family above 200% of the federal poverty level
WIC Participation in the special supplemental nutrition program
Food Stamp Participation in food stamp program
fsdchbi Childhood food security
AnyIns Any insurance
RefSex Gender of the adult respondent (1: Male, 0: Female)
RefAge Age of the adult respondent
and
In the above two formulas of τT and τC , the quantities E(Y | Z = 1) and E(Y |
Z = 0) are directly observable from the data, but the quantities E{Y (0) | Z =
1} and E{Y (1) | Z = 0} are not. The latter two are counterfactuals because
they are the means of the potential outcomes corresponding to the treatment
level that is the opposite of the actual received treatment.
The simple difference in means, also known as the prima facie causal effect,
is generally biased for the causal effects defined above. For example,
and
τPF − τC = E{Y (1) | Z = 1} − E{Y (1) | Z = 0}
are not zero in general, and they quantifies the selection bias. They measure
the differences in the means of the potential outcomes across the treatment
and control groups.
Why randomization is so important? Rubin (1978) first used potential
outcomes to quantify the benefit of randomization. We have used the fact in
Chapter 9 that
Z {Y (1), Y (0)} (10.1)
130 10 Observational Studies
in the CRE, which implies that the selection bias terms are both zero:
τPF − τT = E{Y (0) | Z = 1} − E{Y (0) | Z = 0} = 0
and
τPF − τC = E{Y (1) | Z = 1} − E{Y (1) | Z = 0} = 0.
So under complete randomization (10.1),
τ = τT = τC = τPF .
From the above discussion, the fundamental benefit of randomization is to
balance the distributions of the potential outcomes across the treatment and
control groups, which is more important than to balance the distributions of
the observed covariates.
Without randomization, the selection bias terms can be arbitrarily large
especially for unbounded outcomes. This highlights the fundamental difficulty
of causal inference with observational studies.
Definition 10.1 is too abstract at the moment. I will use more concrete
examples in later chapters to illustrate its meaning. It is often neglected in
standard statistics problems. For instance, the mean θ = E(Y ) is nonpara-
metrically identifiable if we have IID draws of Yi ’s; the Pearson correlation
coefficient θ = corr(X, Y ) is nonparametrically identifiable if we have IID
draws of the pairs (Xi , Yi )’s. In those examples, the parameters are nonpara-
metrically identifiable automatically. However, Definition 10.1 is fundamental
in causal inference with observational studies. In particular, the parameter of
interest τ = E{Y (1) − Y (0)} depends on some unobserved random variables,
so it is unclear whether it is nonparametrically identifiable based on observed
data. Under the assumptions in (10.2) and (10.3), it is nonparametrically
identifiable, with detailed below.
Because τPF (X) depends only on the observables, it is nonparametrically
identified by definition. Moreover, (10.2) and (10.3) ensure that the three
causal effects are the same as τPF (X), so τ (X), τT (X) and τC (X) are all
nonparametrically identified. Consequently, the unconditional versions are also
nonparametrically identified under (10.2) and (10.3) due to the law of total
expectation:
From now on, we focus on τ unless stated otherwise. The following theorem
summarized the identification formulas of τ .
Theorem 10.1 Under (10.2) and (10.3), the average causal effect τ is iden-
tified by
by the law of total probability. Comparing (10.7) and (10.8), we can see that
although both formulas compare the conditional expectations E(Y | Z =
1, X = x) and E(Y | Z = 0, X = x), they average over different distribution of
the covariates. The causal parameter τ averages the conditional expectations
over the common distribution of the covariate, but the difference in means
τPF averages the conditional expectations over two different distributions of
covariate in the treated and control groups.
Usually, we impose a stronger assumption:
which is called strong ignorability (Rosenbaum and Rubin, 1983b). If the pa-
rameter of interest is τ , then the stronger assumptions (10.9) and (10.10) are
just imposed for notational simplicity. They are not necessary in this case.
However, they cannot be relaxed if the parameter of interest is the causal
effects on other scales (for example, distribution, quantile, or some transfor-
mation of the outcome). The strong ignorability assumption requires that the
potential outcomes vector be independent of the treatment given covariates,
but the ignorability assumption only requires each potential outcome be in-
dependent of the treatment given covariates. The former is stronger than the
10.4 Sufficient conditions for nonparametric identification 133
Y (1) = f1 (X, V1 ),
Y (0) = f0 (X, V0 ),
Z = 1{g(X, V ) ≥ 0}
with (V1 , V0 ) V , then (10.9) and (10.10) hold. In the above data generating
process, the “common causes” X of the treatment and the outcome are all
observed, the remaining random components are independent. If the data
generating process changes to
Y (1) = f1 (X, U, V1 ),
Y (0) = f0 (X, U, V0 ),
Z = 1{g(X, U, V ) ≥ 0}
with (V1 , V0 ) V , then (10.9) or (10.10) does not hold in general. The un-
measured “common cause” U induces dependence between the treatment and
potential outcomes even conditioning on the observed covariates X. If we do
not have access to U and analyze the data based only on (Z, X, Y ), the final
estimator will be biased for the causal parameter in general. This type of bias
is called the omitted variable bias in econometrics.
The ignorability assumption can be reasonable if we observe a rich set
of covariates X that affect the treatment and the outcome simultaneously. I
start with this assumption, discussing identification and estimation strategies
in Part III of this book. However, it is fundamentally untestable. We may
justify it based on the scientific background knowledge, but we are often not
sure whether it holds or not. Parts IV and V of this book will discuss other
strategies when this assumption is not plausible.
134 10 Observational Studies
Therefore, if ignorability holds and the outcome model is linear, then the aver-
age causal effect equals the coefficient of Z. This is one of the most important
applications of the linear model. However, the causal interpretation of the co-
efficient of Z is valid only under two strong assumptions: ignorability and the
linear model.
We have discussed in Chapter 6, the above procedure is suboptimal even in
randomized experiments, because it ignores the treatment effect heterogeneity
induced by the covariates. If we assume
we have
The estimator for τ is then β̂z + β̂Tzx X̄, where β̂z is the regression coefficient
and X̄ is the sample mean of X. If we center the covariates to ensure X̄ = 0,
then the estimator is simply the regression coefficient of Z. To simplify the
procedure, we usually center the covariates at the beginning; also recall Lin
(2013)’s estimator introduced in Chapter 6. Rosenbaum and Rubin (1983b)
and Hirano and Imbens (2001) discussed this estimator.
In general, we can use other more complex models to estimate the causal
effects. For example, if we build two predictors µ̂1 (X) and µ̂0 (X) based on
the treated and control data, respectively, then we have an estimator for the
conditional average causal effect
The estimator τ̂ above has the same form as the projective estimator discussed
in Chapter 6. It is sometimes called the outcome imputation estimator. For
example, we may model a binary outcome using a logistic model
T
eβ0 +βz Z+βx X
E(Y | Z, X) = pr(Y = 1 | Z, X) = T ,
1 + eβ0 +βz Z+βx X
136 10 Observational Studies
then based on the estimators of the coefficients β̂0 , β̂z , β̂x , we have the following
estimator for the average causal effect:
n
( T T
)
−1
X eβ̂0 +β̂z +β̂x Xi eβ̂0 +β̂x Xi
τ̂ = n T
− T
.
i=1 1 + eβ̂0 +β̂z +β̂x Xi 1 + eβ̂0 +β̂x Xi
This estimator is not simply the coefficient of the treatment in the logistic
model.1 It is a nonlinear function of all the coefficients as well as the the
empirical distribution of the covariates. In econometrics, this estimator is is
called the average partial effect or average marginal effect of the treatment
in the logistic model. Many econometric software packages can report this
estimator associated with the standard error. Similarly, we can also derive
the corresponding estimator based on a fully interacted logistic model; see
Problem 10.2.
For all the estimators discussed above, we can use the nonparametric boot-
strap to estimate the standard errors. See Chapter A1.5.
The above predictors for the conditional means of the outcome can also be
other machine learning tools. In particular, Hill (2011) championed the use of
tree methods for estimating τ , and Wager and Athey (2018) proposed to use
them also for estimating τ̂ (X). Wager and Athey (2018) also combined the
tree methods with the ideas in the next chapter. Since then, machine learning
and causal inference has been an active research area (e.g., Hahn et al., 2020;
Künzel et al., 2019).
The biggest problem of the above approach based on outcome regressions
is its sensitivity to the specification of the outcome model. Problem 1.3 gave
such an example. Depending on the incentive of empirical research and pub-
lications, people sometimes reported their favorable causal effects estimates
after searching over a wide set of candidate models, without confessing this
searching process. This is a major source of p-hacking in causal inference.
Rosenbaum and Rubin (1983b) proposed the key concept propensity score and
discussed its role in causal inference with observational studies. It is one of
the most cited papers in statistics, and Titterington (2013) listed it as the
second most cited paper published in Biometrika during the past 100 years.
Its citations are growing very fast during the recent years.
Under the IID sampling assumption, we have four random variables as-
sociated with each unit: {X, Z, Y (1), Y (0)}. Following the basic probability
rule, we can factorize the joint distribution as
where pr(X) is the covariate distribution, pr{Y (1), Y (0) | X} is the outcome
model, and pr{Z | X, Y (1), Y (0)} is the treatment assignment mechanism.
Usually, we do not want to model the covariates because they are background
information happening before the treatment and outcome. If we want to move
beyond the outcome model, then we must focus on the treatment assignment
mechanism, which leads to the definition of the propensity score.
the conditional probability of the receiving the treatment given the observed
covariates.
139
140 11 Propensity Score
pr{Z = 1 | e(X)}
= E{Z | e(X)}
h i
= E E{Z | e(X), X} | e(X) (tower property)
n o
= E E(Z | X) | e(X)
n o
= E e(X) | e(X)
= e(X).
rather than their exact values, which makes it relatively robust compared to
other methods. This robustness property of propensity score stratification
appeared in many numerical examples but its rigorous quantification is still
missing in the literature.
An important practical question is how to choose K? If K is too small,
then the strong ignorability does not hold even approximately given ê′ (X).
If K is too large, then we do not have enough units within each stratum of
the estimated propensity score and many strata have only treated or control
units. Therefore, we face a trade-off in practice. Following Cochran (1968)’s
heuristics, Rosenbaum and Rubin (1983b) and Rosenbaum and Rubin (1984)
suggested K = 5 which removes a large amount of bias in many settings.
However, with extremely large dataset, propensity score stratification leads
to biased estimators with a fixed K (Lunceford and Davidian, 2004). It is
thus reasonable to increase K as long as each stratum has enough treated and
control units. Wang et al. (2020) suggested an aggressive choice of K, which
is the maximum number of strata such that the stratified estimator is well
defined. But the rigorous theory for this procedure is not fully established.
Another important practical question is how to compute the standard er-
rors of the estimators based on propensity score stratification? Some researcher
conditioned on the discretized propensity scores ê′ (X) and reported standard
errors based on the SRE. This effectively ignored the uncertainty in the esti-
mated propensity scores. Other researchers bootstrapped the whole procedure
to account for full uncertainty. However, the theory for the bootstrap is still
unclear due to the discreteness of this estimator.
11.1.3 Application
To illustrate the propensity score stratification method, I revisited Example
10.3. Figure 11.1 shows the histograms of the estimated propensity scores with
different numbers of bins (K = 5, 10, 30).
Based on propensity score stratification, we can calculate the point esti-
mators and the standard errors for difference choice of K ∈ {5, 10, 20, 50, 80}
as follows (with the function Neyman_SRE defined in Chapter 5 for analyzing
the SRE):
> pscore = glm ( z ~ x , family = binomial ) $ fitted . values
> n . strata = c (5 , 10 , 20 , 50 , 80)
> strat . res = sapply ( n . strata , FUN = function ( nn ){
+ q . pscore = quantile ( pscore , (1:( nn -1)) / nn )
+ ps . strata = cut ( pscore , breaks = c (0 , q . pscore ,1) ,
+ labels = 1: nn )
+ Neyman _ SRE (z , y , ps . strata )})
>
> rownames ( strat . res ) = c ( " est " , " se " )
> colnames ( strat . res ) = n . strata
> round ( strat . res , 3)
5 10 20 50 80
11.2 The propensity score as a dimension reduction tool 143
breaks = 5 breaks = 10 breaks = 30
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
and
ZY (1 − Z)Y
τ = E{Y (1) − Y (0)} = E − .
e(X) 1 − e(X)
where ê(Xi ) is the estimated propensity score. This is the inverse propensity
score weighting (IPW) estimator, which is also called the Horvitz–Thompson
(HT) estimator. Horvitz and Thompson (1952) proposed it in survey sampling
and Rosenbaum (1987a) used in causal inference with observational studies.
However, the estimator τ̂ ht has many problems. In particular, it is not
invariant to location transformation of the outcome. For example, if we change
Yi to Yi + c with a constant c, then it becomes τ̂ ht + c(1̂T − 1̂C ), where
n n
1 X Zi 1 X (1 − Zi )
1̂T = , 1̂C =
n i=1 ê(Xi ) n i=1 1 − ê(Xi )
are two different estimates of the constant 1. I use the funny notation 1̂T
and 1̂C because with the true propensity score these two terms both have
expectation 1; see Problem 11.3. In general, 1̂T − 1̂C is not zero in finite
sample. Since adding a constant to every outcome should not change the
average causal effect, this estimator is not reasonable because of its dependence
146 11 Propensity Score
on c. A simple fix to the problem is to normalize the weights by 1̂T and 1̂C
respectively, resulting in the following estimator
Pn Zi Yi Pn (1−Zi )Yi
hajek i=1 ê(Xi ) i=1 1−ê(Xi )
τ̂ = Pn Zi
− Pn 1−Zi
.
i=1 ê(Xi ) i=1 1−ê(Xi )
This is the Hajek estimator due to Hájek (1971). We can verify that the Hajek
estimator is invariant to the location transformation, that is, if we replace Yi
by Yi + c, then τ̂ hajek remains the same. Moreover, many numerical studies
have found that τ̂ hajek is much more stable than τ̂ ht in finite samples.
that is, the true propensity score is bounded away from 0 and 1. However,
D’Amour et al. (2021) pointed out that this is a rather strong assumption
especially with many covariates. Chapter 20 will discuss this problem in detail.
Even if the strong overlap condition holds for the true propensity score,
the estimated propensity scores can be close to 0 or 1. When this happens,
the weighting estimators blow up to infinity resulting in extremely unstable
behaviors in finite samples. We can either truncate the estimated propensity
score by changing it to
h i
max αL , min{ê(Xi ), αU } ,
or trim the observations by dropping units with ê(Xi ) outside the interval
[αL , αU ]. Crump et al. (2009) suggested αL = 0.1 and αU = 0.9, and Kurth
et al. (2005) suggested αL = 0.05 and αU = 0.95. Yang and Ding (2018)
established some asymptotic theory for trimming.
11.2.4 Application
Revisiting Example 10.3, we can obtain the weighting estimators based on
different truncations of the the estimated propensity scores. The following
results are the two weighting estimators with the bootstrap standard errors,
with truncations at (0, 1), (0.01, 0.99), (0.05, 0.95), and (0.1, 0.9):
$ trunc0
HT Hajek
est -1.516 -0.156
se 0.495 0.238
11.3 The propensity score as a balancing score 147
$ trunc .01
HT Hajek
est -1.516 -0.156
se 0.464 0.231
$ trunc .05
HT Hajek
est -1.499 -0.152
se 0.472 0.248
$ trunc .1
HT Hajek
est -0.713 -0.054
se 0.435 0.229
The HT estimator gives results far away from all other estimators we discussed
so far. The point estimates seem too large and they are negatively significant
unless we truncate the estimated propensity scores at (0.1, 0.9). This is an
example showing the instability of the HT estimator.
Z X | e(X).
Following similar steps as the proof of Theorem 11.1, we can show that the
left-hand side of (11.3) equals
Therefore, we can check whether the covariate distributions are the same
across the treatment and control groups within each stratum of the discretized
estimated propensity score.
In propensity score weighting, we can view h(X) as a pseudo outcome and
estimate the average causal effect on h(X). Because the true average causal
effect on h(X) is 0, the estimate should not be significantly different from 0.
A canonical choice of h(X) is X.
Let us revisit Example 10.3 again. Based on propensity score stratification
11.4 Homework Problems 149
0.10
0.05
0.00
−0.05
−0.10
−0.15
1 2 3 4 5 6 7 8 9 10 11
balance check based on stratification with K=5
0.05
0.00
−0.05
−0.10
1 2 3 4 5 6 7 8 9 10 11
balance check based on weighting
FIGURE 11.2: Balance check: point estimates and 95% confidence intervals
of the average causal effect on covariates
with K = 5, all the covariates except Food_Stamp are well balanced across the
treatment and control groups. Similar result holds for the Hajek estimator.
Figure 11.2 shows the balance checking results.
Rosenbaum (2020) and Rosenbaum and Rubin (2023) pointed out this result
and called e(X, Y (1), Y (0)) the principal unobserved covariate.
150 11 Propensity Score
Theorem 11.4 b(X) is a balancing score if and only if b(X) is finer than
e(X) in the sense that e(X) = f (b(X)) for some function f (·).
because b(X) = {e(X), X1 } is finer than e(X) and thus a balancing score.
The conditional independence in (11.4) ensures ignorability holds given the
propensity score, within each level of X1 . Therefore, we can perform the same
analysis based on the propensity score, within each level of X1 , yielding esti-
mates for two subgroup effects.
With the above motivation in mind, now prove Theorem 11.4.
Show that
.
1(X1 = x1 )ZY 1(X1 = x1 )(1 − Z)Y
τ (x1 ) = E − pr(X1 = x1 )
e(X) 1 − e(X)
where
are the two conditional mean functions of the outcome given covariates. Sec-
ond, the inverse propensity score weighting (IPW) formula is
ZY (1 − Z)Y
τ =E −E (12.2)
e(X) 1 − e(X)
where
e(X) = pr(Z = 1 | X)
is the propensity score introduced in Chapter 11.
The outcome imputation estimator requires fitting a model for the outcome
given the treatment and covariates. It is consistent if the outcome model
is correctly specified. The IPW estimator requires fitting a model for the
treatment given covariates. It is consistent if the propensity score model is
correctly specified.
Mathematically, we have many combinations of (12.1) and (12.2) that lead
to different identification formulas of the average causal effect. Below I will dis-
cuss a particular combination that has appealing theoretical properties. This
combination motivates an estimator that is consistent if either the propensity
score or the outcome model is correctly specified. It is call the doubly robust
estimator, championed by James Robins (Scharfstein et al., 1999; Bang and
Robins, 2005).
153
154 12 Doubly Robust Estimator
The formulas in (12.3) and (12.4) augment the outcome imputation estima-
tor by inverse propensity score weighting terms of the residuals. The formulas
in (12.5) and (12.6) augment the IPW estimator by the imputed outcomes. For
this reason, the doubly robust estimator is also called the augmented inverse
propensity score weighting (AIPW) estimator.
The augmentation strengthens the theoretical properties in the following
sense.
Therefore, µ̃dr
1 − E{Y (1)} = 0 if either e(X, α) = e(X) or µ1 (X, β1 ) = µ1 (X).
□
n
" #
dr 1 X Zi {Yi − µ1 (Xi , β̂1 )}
µ̂1 = + µ1 (Xi , β̂1 )
n i=1 e(Xi , α̂)
and
n
" #
1 X (1 − Zi ){Yi − µ0 (Xi , β̂0 )}
µ̂dr
0 = + µ0 (Xi , β̂0 ) ;
n i=1 1 − e(Xi , α̂)
Analogous to (12.5) and (12.6), we can also rewrite µ̂dr1 and µ̂0 as
dr
n
1X Zi Yi Zi − e(Xi , α̂)
µ̂dr
1 = − µ1 (Xi , β̂1 ) ,
n i=1 e(Xi , α̂) e(Xi , α̂)
n
1 X (1 − Zi )Yi e(Xi , α̂) − Zi
µ̂dr
0 = − µ0 (X i , β̂ 0 ) .
n i=1 1 − e(Xi , α̂) 1 − e(Xi , α̂)
which holds if the propensity score model is correct without assuming that
the outcome model is correct. Using a working model to improve efficiency
is an old idea from survey sampling. Little and An (2004) and Lumley et al.
(2011) pointed out its connection with the doubly robust estimator.
which may not be the same as µ1 since the outcome may be wrong. The bias
of this estimator is E{µ1 (X, β1 ) − Y (1)}, which can be estimated by an IPW
estimator
Z{µ1 (X, β1 ) − Y }
B=E
e(X)
if the propensity score model is correct. So a de-biased estimator is µ̃1 − B,
which is identical to (12.8).
12.3 Examples
12.3.1 Summary of some canonical estimators for τ
The following R implements the outcome imputation, Hovitz–Thompson, Ha-
jek, and doubly robust estimators for τ . These estimators can be conveniently
implemented based on the fitted values of the glm function. The default choice
for the propensity score model is the logistic model, and the default choice
for the outcome model is the linear model with out.family = gaussian1 . For
binary outcomes, we can also specify out.family = binomial to fit the logistic
model.
OS _ est = function (z , y , x , out . family = gaussian ,
truncpscore = c (0 , 1))
{
# # fitted propensity score
pscore = glm ( z ~ x , family = binomial ) $ fitted . values
pscore = pmax ( truncpscore [1] , pmin ( truncpscore [2] , pscore ))
# # nonparametric bootstrap
n . sample = length ( z )
x = as . matrix ( x )
boot . est = replicate ( n . boot ,
{ id . boot = sample (1: n . sample , n . sample , replace = TRUE )
OS _ est ( z [ id . boot ] , y [ id . boot ] , x [ id . boot , ] ,
out . family , truncpscore )})
return ( res )
}
12.3 Examples 159
12.3.2 Simulation
I will use simulation to evaluate the finite-sample properties of the estimators
under four scenarios:
1. both the propensity score and outcome models are correct;
2. the propensity score model is wrong but the outcome model is cor-
rect;
3. the propensity score model is correct but the outcome model is
wrong;
4. both the propensity score and outcome models are wrong.
I will report the average bias, the true standard error, and the average esti-
mated standard error of the estimators over simulation.
In case 1, the data generating process is
x = matrix ( rnorm ( n * 2) , n , 2)
x1 = cbind (1 , x )
beta . z = c (0 , 1 , 1)
pscore = 1 / (1 + exp ( - as . vector ( x1 % * % beta . z )))
z = rbinom (n , 1 , pscore )
beta . y1 = c (1 , 2 , 1)
beta . y0 = c (1 , 2 , 1)
y1 = rnorm (n , x1 % * % beta . y1 )
y0 = rnorm (n , x1 % * % beta . y0 )
y = z * y1 + (1 - z ) * y0
In case 4, I modify both the propensity score and the outcome model.
We set the sample size to be n = 500 and generate 500 independent data
sets according to the data generating processes above. In case 1,
reg HT Hajek DR
ave . bias 0.00 0.02 0.03 0.01
true . se 0.11 0.28 0.26 0.13
est . se 0.10 0.25 0.23 0.12
All estimators are nearly unbiased. The two weighting estimators have larger
variances. In case 2,
160 12 Doubly Robust Estimator
reg HT Hajek DR
ave . bias 0.00 -0.76 -0.75 -0.01
true . se 0.12 0.59 0.47 0.18
est . se 0.13 0.50 0.38 0.18
The two weighting estimators are severely biased due to the misspecification
of the propensity score model. The regression imputation and doubly robust
estimators are nearly unbiased. In case 3,
reg HT Hajek DR
ave . bias -0.05 0.00 -0.01 0.00
true . se 0.11 0.15 0.14 0.14
est . se 0.11 0.14 0.13 0.14
The regression imputation estimator has larger bias than the other three esti-
mators due to the misspecification of the outcome model. The weighting and
doubly robust estimators are nearly unbiased. In case 4,
reg HT Hajek DR
ave . bias -0.08 0.11 -0.07 0.16
true . se 0.13 0.32 0.20 0.41
est . se 0.13 0.25 0.16 0.26
All estimators are biased because both the propensity score and outcome
models are wrong. The Horvitz–Thompson and doubly robust estimator has
the largest bias. When both models are wrong, the doubly robust estimator
appears to be doubly fragile.
In all the cases above, the boostrap standard errors are close to the true
ones when the estimators are nearly unbiased for the true average causal effect.
12.3.3 Applications
Revisiting Example 10.3, we obtain the following estimators and bootstrap
standard errors:
reg HT Hajek DR
est -0.017 -1.516 -0.156 -0.019
se 0.230 0.492 0.246 0.233
The two weighting estimators are much larger than the other two estimators.
Truncating the estimated propensity score at [0.1, 0.9], we obtain the following
estimators and bootstrap standard errors:
reg HT Hajek DR
est -0.017 -0.713 -0.054 -0.043
se 0.223 0.422 0.235 0.231
The Hajek estimator becomes much close to the regression imputation and
doubly robust estimators, while the Horvitz–Thompson estimator is still an
outlier.
12.5 Some further discussion 161
Because of the symmetry, this chapter focuses on τT and also included exten-
sions to other estimands.
where the first term E(Y | Z = 1) is directly identifiable from the data and
the second term E{Y (0) | Z = 1} is counterfactual. The key assumption
to identify the second term is the following unconfoundedness and overlap
assumptions.
Because the key is to identify E{Y (0) | Z = 1}, we only need the “one-
sided” unconfoundedness and overlap assumptions. Under Assumption 13.1,
we have the following identification result for τT .
163
16413 The Average Causal Effect on the Treated Units and Other Estimands
□
With a discrete X, the identification formula in Theorem 13.1 reduces to
K
X
E{Y (0) | Z = 1} = E(Y | Z = 0, X = k)pr(X = k | Z = 1),
k=1
k=1
where π̂[k]|1 = n[k]1 /n1 is the proportion of category k of X among the treated
units.
For continuous X, we need to fit an outcome model for E(Y | Z = 0, X)
using the control units. If the fitted values for the control potential outcomes
are µ̂0 (Xi ), then the outcome regression estimator is
n n
τ̂T = Ȳˆ (1) − n−1
X X
1 Zi µ̂0 (Xi ) = n−1
1 Zi {Yi − µ̂0 (Xi )}.
i=1 i=1
E(Y | Z, X) = β0 + βz Z + βTx X,
then
i=1
Therefore, the above estimator reduces to τ̂T = β̂z , the OLS coefficient of Z.
By the property of the OLS, we can also write β̂z as the difference in means
of the adjusted outcome Yi − β̂Tx Xi , resulting in
n o n o
τ̂T = ˆ
ˆ (1) − Ȳˆ (0) − β̂T X̄
Ȳˆ (1) − β̂Tx X̄ x (0)
n o n o
= ˆ (1) − X̄
Ȳˆ (1) − Ȳˆ (0) − β̂Tx X̄ ˆ (0) . (13.2)
Therefore, τ̂T equals the simple difference in means of the outcome, adjusted
by the imbalance of the covariates in the treatment and control groups.
Section 10.4.2 shows that β̂z is an estimator for τ , and this example further
shows that β̂z is an estimator for τT . This is not surprising because the linear
model assumes constant causal effects across units.
If we run OLS with only the control units to obtain (β̂0|0 , β̂x|0 ), then the esti-
mator is
τ̂T = Ȳˆ (1) − β̂0|0 − β̂Tx|0 X̄
ˆ (1).
which is similar to (13.2) with a different coefficient for the difference in means
of the covariates.
As an algebraic fact, we can show that this estimator equals the coefficient
of Z in the OLS fit of the outcome on the treatment, covariates, and their
interactions, with the covariates centered by X̄ ˆ (1). See Problem 13.1 for more
details.
16613 The Average Causal Effect on the Treated Units and Other Estimands
and
e(X) 1 − Z
τT = E(Y | Z = 1) − E Y , (13.4)
e 1 − e(X)
So (13.3) holds. □
We have two inverse propensity score weighting estimators
n
τ̂Tht = Ȳˆ (1) − n−1
X
1 ô(Xi )(1 − Zi )Yi
i=1
and Pn
ô(Xi )(1 − Zi )Yi
τ̂Thajek = Ȳˆ (1) − Pi=1
n ,
i=1 ô(Xi )(1 − Zi )
where ô(Xi ) = ê(Xi )/{1 − ê(Xi )} is the fitted odds of the treatment given
covariates.
The estimation of E(Y | Z = 1) is simple. We have a doubly robust
13.3 Inverse propensity score weighting and doubly robust estimation of τT 167
estimator for E{Y (0) | Z = 1} which combines the propensity score and the
outcome model. Define
µ̃dr
0T = E [o(X, α)(1 − Z){Y − µ0 (X, β0 )} + Zµ0 (X, β0 )] /e, (13.5)
e µ̃dr
0T − E{Y (0) | Z = 1}
= E [o(X, α)(1 − Z){Y (0) − µ0 (X, β0 )} + Zµ0 (X, β0 )] − E{ZY (0)}
= E [o(X, α)(1 − Z){Y (0) − µ0 (X, β0 )} − Z{Y (0) − µ0 (X, β0 )}]
= E [{o(X, α)(1 − Z) − Z} {Y (0) − µ0 (X, β0 )}]
e(X, α) − Z
= E {Y (0) − µ0 (X, β0 )}
1 − e(X, α)
e(X, α) − Z
= E E | X × E{Y (0) − µ0 (X, β0 ) | X}
1 − e(X, α)
e(X, α) − e(X)
= E × {µ0 (X) − µ0 (X, β0 )} .
1 − e(X, α)
Therefore, µ̃dr
0T −E{Y (0) | Z = 1} = 0 if either e(X, α) = e(X) or µ0 (X, β0 ) =
µ0 (X). □
From the population versions of µ̃dr
0T , we can construct the sample version
by the following steps:
1. obtain the fitted values of the propensity scores e(X, α̂);
2. obtain the fitted values of the outcome mean under control
µ0 (X, β̂0 );
3. construct the doubly robust estimator: τ̂ dr = Ȳˆ (1) − µ̂dr , where
T 0T
n
" #
1 X (1 − Zi ){Yi − µ0 (Xi , β̂0 )}
µ̂dr
0T = e(Xi , α̂) + Zi µ0 (Xi , β̂0 ) ;
n1 i=1 1 − e(Xi , α̂)
13.3 An example
The following R code implements two outcome regression estimators, two IPW
estimators, and the doubly robust estimator for τT , as well as the bootstrap
variance estimators. To avoid extreme estimated propensity scores, we can
also truncated them from the above.
ATT . est = function (z , y , x , out . family = gaussian , Utruncpscore = 1)
{
# # sample size
nn = length ( z )
nn1 = sum ( z )
return ( c ( ace . reg0 , ace . reg , ace . ipw0 , ace . ipw , ace . dr ))
}
# # nonparametric bootstrap
n . sample = length ( z )
x = as . matrix ( x )
boot . est = replicate ( n . boot ,
{ id . boot = sample (1: n . sample , n . sample , replace = TRUE )
13.4 Other estimands 169
return ( res )
}
Now we re-analyze the data in Example 10.3 to estimate τT . We obtain
reg0 reg HT Hajek DR
est 0.061 -0.351 -1.992 -0.351 -0.187
se 0.227 0.258 0.705 0.328 0.287
without truncating the estimated propensity scores, and
reg0 reg HT Hajek DR
est 0.061 -0.351 -0.597 -0.192 -0.230
se 0.223 0.255 0.579 0.302 0.276
by truncating the estimated propensity scores from the above at 0.9. The HT
estimator is sensitive to the truncation as expected. The regression estima-
tor in Example 13.1 is quite different from other estimators. It imposes an
unnecessary assumption that the regression functions in the treatment and
control group share the same coefficient of X. The regression estimator in
Example 13.2 is much close to the Hajek and doubly robust estimators. The
estimates above are slightly different from those in Section 12.3.3, suggesting
some treatment effect heterogeneity across τT and τ .
The proof of Theorem 13.4 is similar to those of Theorems 11.2 and 13.2
which is relegated to Problem 13.8. Based on Theorem 13.4, we can construct
the corresponding IPW estimator.
By Theorem 13.4, each unit is associated with the weight due to the defini-
tion of the estimand as well as the weight due to the inverse of the propensity
score. Finally, the treated units are weighted by h(X)/e(X) and the control
units are weighted by h(X)/{1 − e(X)}. Li et al. (2018a, Table 1) summarized
several estimands, and I present a part of it below:
population h(X) estimand weights
combined 1 τ 1/e(X) and 1/{1 − e(X)}
treated e(X) τT 1 and e(X)/{1 − e(X)}
control 1 − e(X) τC {1 − e(X)}/e(X) and 1
overlap e(X){1 − e(X)} τO 1 − e(X) and e(X)
The overlap population and the corresponding estimand
is new to us. This estimand has the largest weight for units with e(X) = 1/2
and downweights the units with extreme propensity scores. A nice feature
of this estimand is that its IPW estimator is rather stable without the pos-
sibly extremely small values of e(X) and 1 − e(X) in the denominator. If
e(X) τ (X) including the special case of τ (X) = τ , the parameter τO reduces
to τ . In general, however, the estimand τO may cause controversy because
it changes the initial population and depends on the propensity score which
may be misspecified in practice. Li et al. (2018a) and Li et al. (2019) gave
some justifications and numerical evidence. This estimand will appear again
in Chapter 14.
We can also construct the doubly robust estimator for τ h . I relegate the
details to Problem 13.9.
13.5 Homework Problems 171
13.2 Simulation for the average causal effect on the treated units
In OS_ATE.R in Chapter 12, I ran some simulation studies for τ . Run similar
simulation studies for τT with either correct or incorrect propensity score or
outcome models.
You can choose different model parameters, larger numbers of simulation
and bootstrap replicates. Report your findings, including at least the bias,
variance, and variance estimator via the bootstrap. You can also report other
properties of the estimators, for example, the asymptotic Normality and the
coverage rates of the confidence intervals.
show that
Zi Yi (1 − Zi )Yi
δi = −
e 1−e
is an unbiased predictor of the individual effect in the sense that
E(δi − τi ) = 0 (i = 1, . . . , n).
Zi Yi (1 − Zi )Yi
δi = −
e(Xi ) 1 − e(Xi )
13.7 More on τO
Show that
E[{1 − e(X)}τ (X) | Z = 1] E{e(X)τ (X) | Z = 0}
τO = = .
E{1 − e(X) | Z = 1} E{e(X) | Z = 0}
Remark: Tao and Fu (2019) proved the above results. However, they hold
only for a given h(X). The most interesting cases of τT , τC and τO all have
weight depending on the propensity score e(X), which must be estimated in
the first place. The above formulas do not apply to constructing the doubly
robust estimators for τT and τC ; there does not exist a doubly robust estimator
for τO .
Since Rosenbaum and Rubin (1983b)’s seminal paper, many creative uses of
the propensity score have appeared in the literature (e.g., Bang and Robins,
2005; Robins et al., 2007; Van der Laan and Rose, 2011; Vansteelandt and
Daniel, 2014). This chapter discusses two simple methods to use the propensity
score: including the propensity score as a covariate in regressions and running
regressions weighted by the inverse of the propensity score. I choose to focus
on these two methods because
175
176 14 Using the Propensity Score in Regressions for Causal Effects
recalling that hO (X) = e(X){1 − e(X)} and τ (X) = E{Y (1) − Y (0) | X}.
cov{Z − e(X), Y }
= E[{Z − e(X)}Y ]
= E[{Z − e(X)}ZY (1)] + E[{Z − e(X)}(1 − Z)Y (0)]
(since Y = ZY (1) + (1 − Z)Y (0))
= E[{Z − Ze(X)}Y (1)] − E[e(X)(1 − Z)Y (0)]
= E[Z{1 − e(X)}Y (1)] − E[e(X)(1 − Z)Y (0)]
= E[e(X){1 − e(X)}µ1 (X)] − E[e(X){1 − e(X)}µ0 (X)]
(tower property and ignorability)
= E{hO (X)τ (X)}.
Z − e(X) = Z − 0 − 1 · e(X) − 0T X
is the residual of the OLS fit of Z on {1, e(X), X}, since Z − e(X) is uncor-
related with any functions of X. □
Theorem 14.1 motivates a two-step estimator for τO : first, fit a propensity
score model to obtain ê(Xi ); second, run OLS of Yi on (1, Xi , ê(Xi )) to obtain
the coefficient of Zi . Corollary 14.1 motivates another two-step estimator for
τO : first, fit a propensity score model to obtain ê(Xi ); second, run OLS of Yi
on Zi − ê(Xi ) to obtain the coefficient of Zi . Although OLS is convenient for
obtaining point estimators, the corresponding standard errors are incorrect
due to the uncertainty in the first step estimation of the propensity score. We
can use the bootstrap to approximate the standard errors.
178 14 Using the Propensity Score in Regressions for Causal Effects
Robins et al. (1992) discussed many OLS estimators based on the propen-
sity score. The above results seem special cases of their general theory although
they did not point out the connection with the estimand under the overlap
weight, which was resurrected by Li et al. (2018a). Lee (2018) proposed to
regress Y on Z − e(X) from a different perspective without making connec-
tions to the existing results in Robins et al. (1992) and Li et al. (2018a).
Rosenbaum and Rubin (1983b) proposed to estimate the average causal
effect based on the OLS fit of Y on {1, Z, e(X), Ze(X)}. When this outcome
model is correct, their estimator is consistent for the average causal effect.
However, when the model is incorrect, the corresponding estimator has a much
more complicated interpretation. Little and An (2004) suggested constructing
estimators based on the OLS of Y on Z and a flexible function of e(X) and
showed it enjoys certain doubly robustness property. Due to the complexity
in implementation, I omit the discussion.
which equals the difference between the weighted means of the outcomes in
the treatment and control groups. Numerically, it is identical to the coefficient
of Zi in the following weighted least squares (WLS) of Yi on (1, Zi ).
with weights
(
1
Zi 1 − Zi ê(Xi ) if Zi = 1;
wi = + = 1
(14.1)
ê(Xi ) 1 − ê(Xi ) 1−ê(Xi ) if Zi = 0.
are correctly specified, then both OLS and WLS give consistent estimators
for the coefficients and the estimators of the coefficient of Z is consistent for
τ . More interestingly, the estimator of the coefficient of Z based on WLS is
also consistent for τ if the propensity score model is correct and the outcome
model is incorrect. That is, the estimator based on WLS is doubly robust.
Robins et al. (2007) discussed this property and attributed this result to M.
Joffe’s unpublished paper. I will give more details below.
Let ê(Xi ) be the fitted propensity score and (µ1 (Xi , β̂1 ), µ0 (Xi , β̂0 )) be the
fitted values of the outcome means based on the WLS. The outcome regression
estimator is
n n
reg 1X 1X
τ̂wls = µ1 (Xi , β̂1 ) − µ0 (Xi , β̂0 )
n i=1 n i=1
and the doubly robust estimator for τ is
n n
dr reg 1 X Zi {Yi − µ1 (Xi , β̂1 )} 1 X (1 − Zi ){Yi − µ0 (Xi , β̂0 )}
τ̂wls = τ̂wls + − .
n i=1 ê(Xi ) n i=1 1 − ê(Xi )
An interesting result is that this doubly robust estimator equals the outcome
regression estimator, which reduces to the coefficient of Zi in the WLS fit of
Yi on (1, Zi , Xi , Zi Xi ) if we use weights (14.1).
Theorem 14.2 If X̄ = 0 and (µ1 (Xi , β̂1 ), µ0 (Xi , β̂0 )) = (β̂10 + β̂T1x Xi , β̂00 +
180 14 Using the Propensity Score in Regressions for Causal Effects
and
n
X (1 − Zi )(Yi − β̂00 − β̂T0x Xi )
= 0.
i=1
1 − ê(Xi )
So the difference between τ̂ dr and τ̂ reg is exactly zero. Both reduces to
n n
1X 1X
(β̂10 +β̂T1x Xi )− (β̂00 +β̂T0x Xi ) = β̂10 −β̂00 +(β̂1x −β̂0x )T X̄ = β̂10 −β̂00
n i=1 n i=1
with centered covariates. So they both equal the coefficient of Zi in the WLS
fit of Yi on (1, Zi , Xi , Zi Xi ). □
Freedman and Berk (2008) discouraged the use of the WLS estimator above
based on some simulation studies. They showed that when the outcome model
is correct, the WLS estimator is worse than the OLS estimator since the WLS
estimator has large variability in their simulation setting with homoskedastic
outcomes. This may not be true in general. When the errors have variance
proportional to the inverse of the propensity scores, the WLS estimator will
be more efficient than the OLS estimator. They also showed that the estimated
standard error based on the WLS fit is not consistent for the true standard
error because it ignores the uncertainty in the estimated propensity score.
This can be easily fixed by using the bootstrap to approximate the variance
of the WLS estimator. Nevertheless, they found that “weighting may help
under some circumstances” because when the outcome model is incorrect, the
WLS estimator is still consistent if the propensity score model is correct.
I end this section with Table 14.1 summarizing the regression estimators
for causal effects in both randomized experiments and observational studies.
with ô(Xi ) = ê(Xi )/{1 − ê(Xi )}, equals the coefficient of Zi in the following
WLS fit Yi on (1, Zi ).
14.2 Regressions weighted by the inverse of the propensity score 181
with weights
(
1 if Zi = 1;
wTi = Zi + (1 − Zi )ô(Xi ) = (14.2)
ô(Xi ) if Zi = 0.
Proof of Theorem 14.3: Based on the WLS fits in the treatment and control
groups, we have
n
X
Zi (Yi − β̂10 − β̂T1x Xi ) = 0, (14.3)
i=1
n
X
ô(Xi )(1 − Zi )(Yi − β̂00 − β̂T0x Xi ) = 0. (14.4)
i=1
182 14 Using the Propensity Score in Regressions for Causal Effects
dr reg
The second result (14.4) ensures that τ̂T,wls = τ̂T,wls . Both reduces to
n n
1 X 1 X
Ȳˆ (1) − Zi (β̂00 + β̂T0x Xi ) = Zi (Yi − β̂00 − β̂T0x Xi ).
n1 i=1 n1 i=1
With covariates centered with X̄ ˆ (1) = 0, the first result (14.3) implies that
Ȳˆ (1) = β̂10 which further simplifies the estimators to β̂10 − β̂00 . □
τ̂ pred = µ̂pred
1 − µ̂pred
0
where
n
1 Xn o
µ̂pred
1 = Zi Yi + (1 − Zi )µ1 (Xi , β̂1 )
n i=1
and
n
1 Xn o
µ̂pred
0 = Zi µ0 (Xi , β̂1 ) + (1 − Zi )Yi .
n i=1
It differs from the outcome regression estimator discussed before in that it
only predicts the counterfactural outcomes but not the observed outcomes.
Show that the doubly robust estimator equals τ̂ pred if (µ1 (Xi , β̂1 ), µ0 (Xi , β̂1 )) =
(β̂10 + β̂T1x Xi , β̂00 + β̂T0x Xi ) are from the WLS fits of Yi on (1, Xi ) based on
the treated and control data, respectively, with weights
(
1
= 1−ê(X i)
if Zi = 1;
wi = Zi /ô(Xi ) + (1 − Zi )ô(Xi ) = ô(Xi ) ê(Xi )
ê(Xi ) (14.5)
ô(Xi ) = 1−ê(X i)
if Zi = 0.
Remark: Cao et al. (2009) and Vermeulen and Vansteelandt (2015) moti-
vated the weights in (14.5) from other more theoretical perspectives.
14.3 Homework problems 183
Recall that e(X) = pr(Z = 1 | X) is the propensity score, and define ẽ(X) =
γ0 + γT1 X as the OLS projection of A on X with
1. Show that
E[w̃(X){µ1 (X) − µ0 (X)}] E[{e(X) − ẽ(X)}µ0 (X)]
β1 = +
E{w̃(X)} E{w̃(X)}
Xm(1)
X1
Xm(2)
X2
⋮
Xn1
Xm(n1)
Consider a simple case with the number of control units n0 being much
larger than the number of treated units n1 . For unit i = 1, . . . , n1 in the treated
185
186 15 Matching in Observational Studies
group, we find a unit m(i) in the control group such that Xi = Xm(i) . In the
ideal case, we have exact matches. Therefore, the units within a matched pair
have the same propensity score e(Xi ) = e(Xm(i) ). Consequently, conditioning
on the event that one unit receives the treatment and the other receives the
control, the probability of unit i receiving the treatment and unit m(i) receives
the control is
with Ω being the sample covariance matrix of the Xi ’s from the whole popu-
lation or only the control group.
I review some subtle issues about matching below. See Stuart (2010) for a
review paper.
1. (one-to-one or one-to-M matching) The above discussion focused
on one-to-one matching
2. I focus on matching with replacement but some practitioners pre-
fer matching without replacement. If the pool of control units is
large, these two methods will not not matter too much for the final
result. Matching with replacement is computationally more conve-
nient, but matching without replacement involves computationally
intensive discrete optimization. Matching with replacement usually
gives matches of higher quality but it introduces dependence by
using the same units multiple times. In contrast, the advantage of
matching without replacement is the independence of matched units
and the simplicity in the subsequent data analysis.
3. Because of the residual covariate imbalance within matched pairs,
it is crucial to use covariate adjustment when analyzing the data.
In this case, covariate adjustment is not only for efficiency gain but
also for bias correction.
4. If X is “high dimensional”, it is likely that d(Xi , Xk ) is too large
for some unit i in the treated group and for all choices of the units
in the control group. In this case, we may have to drop some units
that are hard to find matches. By doing this, we effectively change
the study population of interest.
5. It is hard to avoid the above problem. For example, if Xi ∼
N(0, Ip ), Xk ∼ N(0, Ip ), and Xi Xk , then
which has mean 2p and variance 8p. Theory shows that with large
p, imperfect matching causes large bias in causal effect estimation.
This suggests that if p is large, we must have some dimension reduc-
tion before matching. Rosenbaum and Rubin (1983b) proposed to
match based on the propensity score. With the estimated propen-
sity score, we find pairs of units {i, m(i)} with small values of
|ê(Xi ) − ê(Xm(i) )| or |logit{ê(Xi )} − logit{ê(Xm(i) )}|, i.e., we have
a one dimensional matching problem.
1 We define ∥v∥2 =
Pp
2 j=1 vj2 for a vector v = (v1 , . . . , vp )T . It denotes the squared length
of the vector v.
188 15 Matching in Observational Studies
where Ji is the set of matched units from the control group for unit i. For
example, we can compute d(Xi , Xk ) for all k in the control group, and then
define Ji as the indices of k with the M smallest values of d(Xi , Xk ).
For a control unit i, we simply impute the potential outcome under control
as Ŷi (0) = Yi , and impute the potential outcome under treatment as
X
Ŷi (1) = M −1 Yk ,
k∈Ji
where Ji is the set of matched units from the treatment group for unit i.
The matching estimator is
n
X
τ̂ m = n−1 {Ŷi (1) − Ŷi (0)}.
i=1
with {µ̂1 (Xi ), µ̂0 (Xi )} being the predicted outcomes by, for example, from
OLS fits. For a treated unit with Zi = 1, the estimated bias is
X
B̂i = M −1 {µ̂0 (Xi ) − µ̂0 (Xk )}
k∈Ji
15.3 Matching estimator for the average causal effect 189
τ̂ mbc = τ̂ m − B̂,
where
ψ̂i = µ̂1 (Xi ) − µ̂0 (Xi ) + (2Zi − 1)(1 + Ki /M ){Yi − µ̂Zi (Xi )}
The linear expansion in Proposition 15.1 follows from simple but tedious
algebra. I leave its proof as Problem 15.1. The linear expansion motivates a
simple variance estimator
n
1 X
V̂ mbc = (ψ̂i − τ̂ mbc )2 ,
n2 i=1
by viewing τ̂ mbc as sample averages of the ψ̂i ’s. In the literature, Abadie and
Imbens (2008) first showed that the simple bootstrap by resampling the origi-
nal data does not work for estimating the variance of the matching estimators,
but their proposed variance estimation procedure is not easy to implement.
Otsu and Rai (2017) proposed to bootstrap the ψ̂i ’s in the linear expansion,
which yields the variance estimator V̂ mbc .
For the average causal effect τ , recall the outcome regression estimator
n
X
τ̂ reg = n−1 {µ̂1 (Xi ) − µ̂0 (Xi )}
i=1
we only need to impute the missing potential outcomes under control for all
the treated units, resulting the following estimator
n
X
τ̂Tm = n−1
1 Zi {Yi − Ŷi (0)}.
i=1
15.4 Matching estimator for the average causal effect on the treated 191
where X
B̂T,i = M −1 {µ̂0 (Xi ) − µ̂0 (Xk )}
k∈Ji
corrects the bias due to the mis-match of covariates for a treated unit with
Zi = 1.
The final bias-corrected estimator is
where
ψ̂T,i = Zi {Yi − µ̂0 (Xi )} − (1 − Zi )Ki /M {Yi − µ̂0 (Xi )}.
I leave the proof as Problem 15.1. Motivated by Otsu and Rai (2017), we
can view τ̂Tmbc as n/n1 multiplied by the sample average of the ψ̂T,i ’s, so an
intuitive variance estimator is
2 n n
n 1 X 1 X
V̂Tmbc = (ψ̂ T,i − τ̂ mbc
n 1 /n) 2
= (ψ̂T,i − τ̂Tmbc n1 /n)2 .
n1 n2 i=1 T
n21 i=1
Furthermore, we can verify that τ̂Tmbc has a form very similar to τ̂Tdr .
192 15 Matching in Observational Studies
Both the unadjusted and adjusted estimators shows positive significant results
on the job training program. We can analyze the data as if it is an observational
study, yielding the following results:
15.5 A case study 193
Both the point estimator and standard error increase, but qualitatively, the
conclusion remains the same.
If we use simple OLS estimators, we obtain results that are far from the
experimental benchmark:
> neymanols = lm ( y ~ z )
> neymanols $ coef [2]
z
-8506.495
> sqrt ( hccm ( neymanols , type = " hc2 " )[2 , 2])
[1] 583.4426
>
> xc = scale ( x )
> linols = lm ( y ~ z * xc )
> linols $ coef [2]
z
-4265.801
> sqrt ( hccm ( linols , type = " hc2 " )[2 , 2])
[1] 3211.772
However, if we use matching, the results almost recovers those based on the
experimental data:
> matchest = Match ( Y = y , Tr = z , X = x , BiasAdjust = TRUE )
194 15 Matching in Observational Studies
Ignoring the ties in the matched data, we can also use the matched-pairs
analysis, which again yields results similar to those based on the experimental
data:
> diff = y [ matchest $ index . treated ] -
+ y [ matchest $ index . control ]
> round ( summary ( lm ( diff ~ 1)) $ coef [1 , ] , 2)
Estimate Std . Error t value Pr ( >| t |)
1581.44 558.55 2.83 0.01
>
> diff . x = x [ matchest $ index . treated , ] -
+ x [ matchest $ index . control , ]
> round ( summary ( lm ( diff ~ diff . x )) $ coef [1 , ] , 2)
Estimate Std . Error t value Pr ( >| t |)
1842.06 578.37 3.18 0.00
Call :
lm ( formula = z ~ x )
Residuals :
Min 1Q Median 3Q Max
-0.18508 -0.01057 0.00303 0.01018 1.01355
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 1.404 e -03 6.326 e -03 0.222 0.8243
xage -4.043 e -04 8.512 e -05 -4.750 2.05 e -06 * * *
xeduc 3.220 e -04 4.073 e -04 0.790 0.4293
15.6 A case study 195
But after matching, the covariates are well balanced, signified by the ab-
sence of stars for all coefficients.
> lm . after = lm ( z ~ x ,
+ subset = c ( matchest $ index . treated ,
+ matchest $ index . control ))
> summary ( lm . after )
Call :
lm ( formula = z ~ x , subset = c ( matchest $ index . treated , matchest $ index . control ))
Residuals :
Min 1Q Median 3Q Max
-0.66864 -0.49161 -0.03679 0.50378 0.65122
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 6.003 e -01 2.427 e -01 2.474 0.0137 *
xage 3.199 e -03 3.427 e -03 0.933 0.3511
xeduc -1.501 e -02 1.634 e -02 -0.918 0.3590
xblack 6.141 e -05 7.408 e -02 0.001 0.9993
xhispan 1.391 e -02 1.208 e -01 0.115 0.9084
xmarried -1.328 e -02 6.729 e -02 -0.197 0.8437
xnodegree -3.023 e -02 7.144 e -02 -0.423 0.6723
xre74 6.754 e -06 9.864 e -06 0.685 0.4939
xre75 -9.848 e -06 1.279 e -05 -0.770 0.4417
xu74 2.179 e -02 1.027 e -01 0.212 0.8321
xu75 -2.642 e -02 8.327 e -02 -0.317 0.7512
15.6 Discussion
With many covariates, matching based on the original covariates may suffer
from the curse of dimensionality. Rosenbaum and Rubin (1983b) suggested
to use matching based on the estimated propensity score. Abadie and Imbens
(2016) provided a form theory for this strategy.
sity scores. You may consider fitting different propensity score and outcome
models, e.g., including some quadratic terms of the basic covariates. You can
even apply these estimators to the matched data.
This is a classic dataset and hundreds of papers have used it. You can read
some references (Dehejia and Wahba, 1999; Hainmueller, 2012) and you can
also be creative in your own data analysis.
Part III of this book discusses causal inference with observational studies
under two assumptions: unconfoundedness and overlap. Both are strong as-
sumptions and likely to be violated in practice. This chapter will discuss the
difficulties of the unconfoundedness assumption. Chapters 17–19 will discuss
various strategies for sensitivity analysis in observational studies with un-
measured confounding. Chapter 20 will discuss the difficulties of the overlap
assumption.
~
Z /Y
201
20216 Difficulties of Unconfoundedness in Observational Studies for Causal Effects
x &/
Z Y
we can read it as
X ∼ FX (x),
U ∼ FU (u),
Z = fZ (X, U, εZ ),
Y (z) = fY (X, U, z, εY (z)),
where εZ εY (z) for both z = 0, 1. We can easily read from the equations that
Z Y (z) | (X, U ) but Z / Y (z) | X, i.e., the unconfoundedness assumption
holds conditioning on (X, U ) but does not hold conditioning on X only. In
this case, U is an unmeasured confounder. In this diagram, U is called an
unmeasured confounder.
X / Yn
~
Z /Y
Example 16.1 Cornfield et al. (1959) studied the causal role of cigarette
smoking on lung cancer based on observational studies. They controlled for
many important background variables but it is still possible to have some un-
measured confounders biasing the observed effects. To strengthen the evidence,
they also reported the effect of cigarette smoking on car accident which was
close to zero, the anticipated effect based on biology. So even if they could not
rule out unmeasured confounding in the analysis, this supplementary analysis
based on a negative outcome makes the evidence of the the causal effect of
cigarette smoking on lung cancer stronger.
Example 16.2 Imbens and Rubin (2015) suggested using the lagged outcome
as a negative outcome. In most cases, it is reasonable to believe that the lagged
outcome and the outcome have similar confounding structure. Since the lagged
outcome happens before the treatment, the average causal effect on it must be
0. However, their suggestion should be used with caution since in most studies
we simply treat lagged outcomes as an observed confounder.
In some sense, the covariate balance check in Chapter 11 is a special case
of using negative controls. Similar to the problem of using lagged outcomes
as negative controls, those covariates are usually a part of the ignorability
assumption. Therefore, the failure of covariate balance check does not really
falsify the ignorability assumption but rather the model specification of the
propensity score.
Example 16.3 Observational studies in elderly persons have shown that vac-
cination against influenza remarkably reduces one’s risk of pneumonia/in-
fluenza hospitalization and all-cause mortality in the following season, after
adjustment for measured covariates. Jackson et al. (2006) were skeptical about
the large magnitude and thus conducted supplementary analysis on negative
outcomes. Vaccination often begins in autumn, but influenza transmission is
often minimal until winter. Based on biology, the effect of vaccination should
be most prominent during influenza season. But Jackson et al. (2006) found
greater effect before the influenza season, suggesting that the observed effect is
due to unmeasured confounding.
20416 Difficulties of Unconfoundedness in Observational Studies for Causal Effects
Jackson et al. (2006) seems the most convincing one since the influenza-
related outcomes before and during the influenza season should have simi-
lar confounding patterns. Cornfield et al. (1959)’s additional evidence seems
weaker since car accident and lung cancer have very different causal mecha-
nisms with respect to cigarette smoking. In fact, Fisher (1957)’s critique was
that the relationship between cigarette smoking on lung cancer may be due
to an unobserved genetic factor. Such a genetic factor might affect cigarette
smoking and lung cancer simultaneously, but it seems unlikely that it also
affects car accident.
Lipsitch et al. (2010) is a recent article on negative outcomes. Rosenbaum
(1989) discussed the role of known effects in causal inference.
Zn o X
~
Z /Y
Example 16.4 Sanderson et al. (2017) give many examples of negative ex-
posures in determining the effect of intrauterine exposure on later outcomes
by comparing the association of a maternal exposure during pregnancy with
the outcome of interest, with the association of the paternal exposure with the
same outcome. They review studies on the effect of maternal and paternal
smoking on offspring outcomes, the effect of maternal and paternal BMI on
later offspring BMI and autism spectrum disorder. In these examples, we ex-
pect the the association of the maternal exposure with the outcome is larger
than that of the paternal exposure with the outcome.
16.2.3 Summary
The unconfoundedness assumption is fundamentally untestable without ad-
ditional assumptions. Although negative outcomes and negative controls in
observational studies cannot prove or disprove unconfoundedness, using them
in supplementary analyses can strengthen the evidence for causation. How-
ever, it is often non-trivial to conduct this type of supplementary analyses
16.3 Problems of over adjustment 205
16.3.1 M-bias
M-bias appears in the following causal diagram with an M-structure:
U1 U2
~
X
Z Y
where (εX , εZ , εY ) are independent random error terms. In the above causal
diagram, X is observed, but U1 and U2 are unobserved. If we change the value
of Z, the value of Y will not change at all. So the true causal effect of Z on Y
must be 0. From the data-generating equations, we can easy read that Z Y ,
20616 Difficulties of Unconfoundedness in Observational Studies for Causal Effects
This means that without adjusting for the covariate X, the simple estimator
is unbiased for the true parameter.
However, if we condition on X, then U1 / U2 | X, and consequently, Z / Y |
X and
Z
{E(Y | Z = 1, X = x) − E(Y | Z = 0, X = x)}F (dx) ̸= 0
in general. To gain intuition, we consider the case with Gaussian linear mod-
els1 :
X = aU1 + bU2 + εX ,
Z = cU1 + εZ ,
Y = Y (z) = dU2 + εY ,
IID
where (U1 , U2 , εX , εZ , εY ) ∼ N(0, 1). We have
but by the result in Problem 1.2, the partial correlation coefficient between Z
and Y given X is
ρZY − ρZX ρY X
ρZY |X = p p ∝ −ρZX ρY X ∝ −cov(Z, X)cov(Y, X) = −abcd,
1 − ρ2ZX 1 − ρ2Y X
1 It is not ideal for our discussion of binary Z, but it simplifies the derivations. Ding and
Miratrix (2015) gave detailed discussion with more natural models for binary Z.
16.3 Problems of over adjustment 207
> Z = ( Z >= 0)
> round ( summary ( lm ( Y ~ Z )) $ coef [2 , 1] , 3)
[1] -0.002
> round ( summary ( lm ( Y ~ Z + X )) $ coef [2 , 1] , 3)
[1] -0.421
16.3.2 Z-bias
Consider the following causal diagram:
U
b c
X
a /Z τ /Y
We need to solve for (τadj , α) from the above two linear equations:
E(ZY ) E(XZ)
E(XY ) var(X) E(ZY )var(X) − E(XZ)E(XY )
τadj = =
var(Z) E(XZ) var(Z)var(X) − E(XZ)2
E(XZ) var(X)
τ (a2 + b2 + 1) + bc − aτ a τ (b2 + 1) + bc bc
= 2 2 2
= =τ+ 2 ,
(a + b + 1) − a b2 + 1 b +1
2 Again, we generate continuous Z from a linear model to simplify the derivations. Ding
et al. (2017b) extended the theory to more general causal models, especially for binary Z.
20816 Difficulties of Unconfoundedness in Observational Studies for Causal Effects
XZ X XY
Z /Y
XI
The covariates above have different features:
1. X affects both the treatment and the outcome. Conditioning on X
ensures ignorability, so we should control for X.
2. XR is pure random noise not affecting either the treatment or the
outcome. Including it in analysis does not bias the estimate but it
introduces unnecessary variability in finite sample.
3. XZ is an instrumental variable that affects the outcome only
through the treatment. In the diagram above, including it in analy-
sis does not bias the estimate although it increases variability. How-
ever, with unmeasured confounding, including it in analysis ampli-
fies the bias as shown in Section 16.3.1.
4. XY affects the outcome only but not the treatment. Without con-
ditioning on it, the ignorability still holds. Since they are predictive
to the outcome, including them in analysis often improves precision.
5. XI is affected by the treatment and outcome. It is a post-treatment
variable, not a pretreatment covariate. We should not include it if
the goal is to infer the effect of the treatment on the outcome. We
will discuss issues with post-treatment variables in causal inference
in Part VI of this book.
If we believe the above causal diagram, we should adjust for at least X to
remove bias and more ideally, further adjust for XY to reduce variance.
21016 Difficulties of Unconfoundedness in Observational Studies for Causal Effects
where ε̂, ê, v̂ are the residuals. Again, the last OLS fit means the
OLS fit of each column of X2 on X1 , and therefore the residual v̂ is
an n × L matrix.
Show that γ̂ = β̂1 + δ̂ β̂2 .
Remark: The product terms δβ2 and δ̂ β̂2 are often referred to as the
omitted-variable bias at the population level and sample level, respectively.
All the methods discussed in Part III rely crucially on the ignorability as-
sumption. They require controlling for all confounding between the treatment
and outcome. However, we cannot use the data to validate the ignorability
assumption. Observational studies are often criticized due to the possibility of
unmeasured confounding. The famous Yule–Simpson Paradox demonstrates
that an unmeasured binary confounder can completely overturn an observed
association between the treatment and outcome. However, to overturn a larger
observed association, this unmeasured confounder must have stronger associa-
tion with the treatment and the outcome. In other words, not all observational
studies are created equal. Some provide stronger evidence for causation than
others.
The following three chapters will discuss various sensitivity analysis tech-
niques that can quantify the evidence of causation based on observational
studies in the presence of unmeasured confounding. This chapter starts with
the E-value, introduced by VanderWeele and Ding (2017) based on the theory
in Ding and VanderWeele (2016). It is more useful for observational studies
using logistic regressions. Chapter 18 discusses sensitivity analysis for the av-
erage causal effect based on inverse probability weighting, outcome regression,
and doubly robust estimation. Chapter 19 discusses Rosenbaum’s framework
for sensitivity analysis for matched observational studies.
Z / {Y (1), Y (0)} | X,
211
212 17 E-Value
The technique in this chapter works the best for a binary outcome Y although
it can be extended to other non-negative outcomes (Ding and VanderWeele,
2016). Focus on binary Y now. The true conditional causal effect on the risk
ratio scale is defined as
pr{Y (1) = 1 | X = x}
rrtrue
ZY |x = ,
pr{Y (0) = 1 | X = x}
and the observed conditional risk ratio equals
pr(Y = 1 | Z = 1, X = x)
rrobs
ZY |x = .
pr(Y = 1 | Z = 0, X = x)
In general, with an unmeasured confounder U ,
rrtrue obs
ZY |x ̸= rrZY |x
because
R
pr(Y = 1 | Z = 1, X = x, U = u)F (du | X = x)
rrtrue
ZY |x =R
pr(Y = 1 | Z = 0, X = x, U = u)F (du | X = x)
and
R
pr(Y = 1 | Z = 1, X = x, U = u)F (du | Z = 1, X = x)
rrobs
ZY |x = R
pr(Y = 1 | Z = 0, X = x, U = u)F (du | Z = 0, X = x)
are averaged over different distributions of U .
Doll and Hill (1950) found that the risk ratio of cigarette smoking on lung
cancer was 9 even after adjusting for many observed covariates X 1 . Fisher
(1957) criticized their result to be noncausal because it is possible that a
hidden gene simultaneously causes cigarette smoking and lung cancer although
the true causal effect of cigarette smoking on lung cancer is absent. This is the
common cause hypothesis, also discussed by Reichenbach (1957). Cornfield
et al. (1959) took a more constructive perspective and asked: how strong
this unmeasured confounder must be in order to explain away the observed
association between cigarette smoking and lung cancer? Below we will use
Ding and VanderWeele (2016)’s general formulation of the problem.
Consider the following causal diagram:
Z Y
of cigarette smoking on lung cancer. But the risk ratio is close to the odds ratio since lung
cancer is a rare outcome.
17.1 Cornfield-type sensitivity analysis 213
pr(U = 1 | Z = 1, X = x) f1,x
rrZU |x = ≡
pr(U = 1 | Z = 0, X = x) f0,x
pr(Y = 1 | U = 1, X = x)
rrU Y |x = ,
pr(Y = 1 | U = 0, X = x)
Theorem 17.1 shows the upper bound of the observed risk ratio of the
treatment on the outcome if the conditional independence Z Y | (X, U )
holds. Under this conditional independence assumption, the association be-
tween the treatment and the outcome is purely due to the association be-
tween the treatment, rrZU |x , and the confounder and the association be-
tween the confounder and the outcome, rrU Y |x . The upper bound equals
rrZU |x rrU Y |x /(rrZU |x + rrU Y |x − 1). A similar inequality appeared in Lee
(2011). It is also related to Cochran’s formula or the omitted-variable bias
formula for linear models, which was reviewed in Problem 16.1.
Reversely, to generate a certain value of the observed risk ratio rrobs x ,
the two confounding measures rrZU |x and rrU Y |x cannot be arbitrary. Their
function rrZU |x rrU Y |x /(rrZU |x + rrU Y |x − 1) must be at least at large as
rrobs
x .
I will give the proof of Theorem 17.1 below.
214 17 E-Value
rrobs
ZY |x
pr(Y = 1 | Z = 1, X = x)
=
pr(Y = 1 | Z = 0, X = x)
pr(U = 1 | Z = 1, X = x)pr(Y = 1 | Z = 1, U = 1, X = x)
+ pr(U = 0 | Z = 1, X = x)pr(Y = 1 | Z = 1, U = 0, X = x)
=
pr(U = 1 | Z = 0, X = x)pr(Y = 1 | Z = 0, U = 1, X = x)
+ pr(U = 0 | Z = 0, X = x)pr(Y = 1 | Z = 0, U = 0, X = x)
pr(U = 1 | Z = 1, X = x)pr(Y = 1 | U = 1, X = x)
+ pr(U = 0 | Z = 1, X = x)pr(Y = 1 | U = 0, X = x)
=
pr(U = 1 | Z = 0, X = x)pr(Y = 1 | U = 1, X = x)
+ pr(U = 0 | Z = 0, X = x)pr(Y = 1 | U = 0, X = x)
f1,x rrU Y |x + 1 − f1,x
=
f0,x rrU Y |x + 1 − f0,x
(rrU Y |x − 1)f1,x + 1
= rrU Y |x −1 .
rrZU |x f1,x +1
□
In the proof of Theorem 17.1, we have obtain an identity
(rrU Y |x − 1)f1,x + 1
rrobs
ZY |x = rrU Y |x −1 .
rrZU |x f1,x +1
But this identity involves three parameters
{f1,x , rrZU |x , rrU Y |x };
see Problem 17.2 for a related formula. In contrast, the upper bound in The-
orem 17.1 involves only two parameters
{rrZU |x , rrU Y |x }
which measure the strength of the confounder.
17.2 E-value
Lemma 17.1 below is useful for deriving interesting corollaries of Theorem
17.1.
17.2 E-value 215
Lemma 17.1 Define β(w1 , w2 ) = w1 w2 /(w1 +w2 −1) for w1 > 1 and w2 > 1.
rrZU |x ≥ rrobs
ZY |x , rrU Y |x ≥ rrobs
ZY |x ,
or, equivalently,
min(rrZU |x , rrU Y |x ) ≥ rrobs
ZY |x .
Therefore, to explain away the observed relative risk, both confounding mea-
sures rrZU |x and rrU Y |x must be at least as large as rrobs ZY |x Cornfield et al.
(1959) first derived the inequality rrZU |x ≥ rrobsZY |x , also called the Cornfield
inequality (Gastwirth et al., 1998). Schlesselman (1978) derived the inequality
rrU Y |x ≥ rrobs
ZY |x . These are related to to the data processing inequality in
information theory2 .
If we define w = max(rrZU |x , rrU Y |x ), then we can use Theorem 17.1 and
Lemma 17.1(4) to obtain
=⇒ w2 − 2rrobs obs
x w + rrx ≥ 0,
q
which is a quadratic inequality. One root rrobs
ZY |x − rrobs obs
ZY |x (rrZY |x − 1) is
always smaller than or equal to 1, so we have
q
w = max(rrZU |x , rrU Y |x ) ≥ rrobs
ZY |x + rrobs obs
ZY |x (rrZY |x − 1).
Therefore, to explain away the observed relative risk, the maximum of the
confounding measures rrZU |x and rrU Y |x must be at least as large as
q
rrobs
ZY |x + rrobs obs
ZY |x (rrZY |x − 1). Based on this result, VanderWeele and Ding
(2017) introduced the following notion of E-value for measuring the evidence
of causation with observational studies.
2 In information theory, the mutual information
ZZ
p(a, b)
I(A, B) = p(a, b) log2 dadb
p(a)p(b)
measures the dependence between two random variables A and B, where p(·) denotes the
joint or marginal density of (A, B). The data processing inequality is a famous result: if
Z Y | U , then I(Z, Y ) ≥ I(Z, U ) and I(Z, Y ) ≥ I(U, Y ). Lihua Lei and Bin Yu pointed
out to me the connection between Cornfield’s inequality and the data processing inequality.
216 17 E-Value
Definition 17.1 (E-Value) With the observed conditional risk ratio rrobs
ZY |x ,
define the E-Value as
q
rrobs
ZY |x + rrobs obs
ZY |x (rrZY |x − 1)
Example 17.1 Hammond and Horn (1958) used the U.S. population to study
the cigarette smoking and lung cancer relationship. Ignoring covariates, their
data can be represented by a 2 × 2 table:
Based on the data, they obtained an estimate of the risk ratio 10.73 with a
95% confidence interval [8.02, 14.36]. To explain away the point estimate, the
E-value is p
10.73 + 10.73 × (10.73 − 1) = 20.95;
to explain away the lower confidence limit, the E-value is
p
8.02 + 8.02 × (8.02 − 1) = 15.52.
Figure 17.1 shows the joint values of the two confounding measures to
17.4 Extensions 217
40
35
30
25
(20.95, 20.95)
RRUY
●
20
● (15.52, 15.52)
15
10
5
5 10 15 20 25 30 35 40
RRZU
explain away the point estimate and lower confidence limit of the risk ratio.
In particular, to explain way the point estimate, they must lie in the area above
the solid curve; to explain away the lower confidence limit, they must lie in
the area above the dashed curve.
17.4 Extensions
17.4.1 E-value and Bradford Hill’s criteria for causation
The E-value provides evidence for causation. But evidence is not a proof.
With a larger E-value, we need a stronger unmeasured confounder to explain
away the observed risk ratio; the evidence for causation is stronger. With a
smaller E-value, we need a weaker unmeasured confounder to explain away
218 17 E-Value
the observed risk ratio; the evidence for causation is weaker. Coupled with the
discussion in Section 17.5.1, a larger observed risk ratio have stronger evidence
for causation. This is closely related to Sir Bradford Hill’s first criterion for
causation: strength of the association (Bradford Hill, 1965). Theorem 17.1
provides a mathematical quantification of his heuristic argument.
In a famous paper, Bradford Hill (1965) proposed a set of nine criteria to
provide evidence for causation between a presumed cause and outcome. His
criteria are
1. strength;
2. consistency;
3. specificity;
4. temporality;
5. biological gradient;
6. plausibility;
7. coherence;
8. experiment;
9. analogy.
The E-value is a way to justify his first criterion. That is, stronger as-
sociation often provides stronger evidence for causation because to explain
way stronger association, we need stronger confounding measures. We have
discussed randomized experiments in Part II, which corroborates his eighth
criterion. Due to the space limit, I omit the detailed discussion of his other cri-
teria and encourage the readers to read (Bradford Hill, 1965). Recently, this
paper is re-printed as Bradford Hill (2020) with insightful comments from
many leading researchers in causal inference.
pr(Yi = 1 | Zi = 1, Xi = x)/pr(Yi = 0 | Zi = 1, Xi = x)
β1 = log .
pr(Yi = 1 | Zi = 0, Xi = x)/pr(Yi = 0 | Zi = 0, Xi = x)
Importantly, the logistic model assumes a common odds ratio across all values
of the covariates. Moreover, when the outcome is rare in that pr(Yi = 1 | Zi =
17.4 Extensions 219
pr(Yi = 1 | Zi = 1, Xi = x)
β1 ≈ log = log rrobs
ZY |x .
pr(Yi = 1 | Zi = 0, Xi = x)
Therefore, based on the estimated logistic regression coefficient and the corre-
sponding confidence limits, we can calculated the E-value immediately. This
is the leading application of the E-value.
Example 17.2 The NCHS2003.txt contains the National Center for Health
Statistics birth certificate data, with the following binary indicator variables
useful for us:
PTbirth pre-term birth
preeclampsia pre-eclampsia3
ageabove35 an older mother with age ≥ 35 (the treatment)
somecollege college education
mar marital status
smoking smoking status
drinking drinking status
hispanic mother’s ethnicity
black mother’s ethnicity
nativeamerican mother’s ethnicity
asian mother’s ethnicity
This version of the data is from Valeri and Vanderweele (2014). This ex-
ample focuses on the outcome PTbirth and Problem 17.3. The following R code
computes the E-values after fitting a logistic regression. Based on the E-values,
we conclude that to explain away the point estimate, the maximum confound-
ing measure must be larger than 1.94, and to explain away the lower confidence
limit, the maximum confounding measure must be larger than 1.91. Although
these confounding measures are not as strong as those in Section 17.3, they
appear to be fairly large in epidemiologic studies.
20
2.2
15
4
1.8
E−value
E−value
E−value
10
3
1.4
5
1.0
1
1.0 1.1 1.2 1.3 1.4 1.5 1.0 1.5 2.0 2.5 3.0 2 4 6 8 10
RR RR RR
17.5.3 It works the best for a binary outcome and the risk
ratio
Theorem 17.1 works well for a binary outcome and the risk ratio. Ding
and VanderWeele (2016) also proposed sensitivity analysis methods for other
causal parameters, but they are not as elegant as the E-value for binary out-
come based on the risk ratio. The next chapter will propose a simple sensitivity
analysis method for the average causal effect that include several methods in
Part III as special cases.
rrobs
ZY 1 + (γ − 1)pr(U = 1 | Z = 1)
=
rrtrue
ZY 1 + (γ − 1)pr(U = 1 | Z = 0)
assuming a common risk ratio of the treatment on the outcome within both
U = 0 and U = 1:
rrZY |U =0 = rrZY |U =1 ,
and also a common risk ratio of the confounder on the outcome within both
Z = 0 and Z = 1:
rrtrue
ZY = rrZY |U =0 = rrZY |U =1 .
17.6 Homework Problems 223
This identity shows the collapsibility of the risk ratio. In epidemiology, the
risk ratio is a collapsible measure of association.
Remark: Schlesselman (1978)’s formula does not assume conditional in-
dependence Z Y | U , but assumes homogeneity of the Z-Y and U -Y risk
ratios. It is a classic formula for sensitivity analysis. It is an identity that is
simple to implement with pre-specified
rdobs
ZY = rdZU × rdU Y (17.1)
where rdobsZY is the observed risk difference of Z on Y , and rdZU and rdU Y
are the treatment-confounder and confounder-outcome risk differences, respec-
tively (recall the definition of the risk difference in Chapter 1.2.2).
Remark: Without loss of generality, assume that rdobs ZY , rdZU , rdU Y are
all positive. Then (17.1) implies that
and q
max(rdZU , rdU Y ) ≥ rdobs
ZY .
These are the Cornfield inequalities for the risk difference with a binary
confounder. They show that for an unmeasured confounder to explain away
an observed risk difference rdobs
ZY , the treatment-confounder and confounder-
outcome risk differences must both be larger than rdobs ZY , and the maximum
of them must be larger than the square root of rdobsZY .
Cornfield et al. (1959) obtained, but did not appreciate the significance of
(17.1). Gastwirth et al. (1998) and Poole (2010) discussed the first Cornfield
condition for the risk difference, and Ding and VanderWeele (2014) discussed
the second.
Ding and VanderWeele (2014) also derived more general results without
assuming a binary U . Unfortunately, the results for a general U are weaker
224 17 E-Value
than those above for a binary U , that is, the inequalities become looser with
more levels of U . This motivated Ding and VanderWeele (2016) to focus on the
Cornfield inequalities for the risk ratio, which do not deteriorate with more
levels of U .
Cornfield-type sensitivity analysis works the best for binary outcomes on the
risk ratio scale, conditioning on the observed covariates. Although Ding and
VanderWeele (2016) also proposed Cornfield-type sensitivity analysis methods
for the average causal effect, they are not general enough and are not con-
venient to apply. Below I give a more direct approach to sensitivity analysis
based on the conditional expectations of the potential outcomes. The idea
appeared in early work of Robins (1999) and Scharfstein et al. (1999). This
chapter is based on Lu and Ding (2023)’s recent formulation.
The approach is closely related to the idea of deriving worse-case bounds on
the average potential outcomes. I will first review the simpler idea of bounds,
and then discuss the approach to sensitivity analysis.
18.1 Introduction
IID
Recall the canonical setup of an observational study with {Zi , Xi , Yi (1), Yi (0)}ni=1 ∼
{Z, X, Y (1), Y (0)} and focus on the average causal effect
τ = E{Y (1) − Y (0)}.
It decomposes to
τ = [E(Y | Z = 1)pr(Z = 1) + E{Y (1) | Z = 0}pr(Z = 0)]
− [E{Y (0) | Z = 1}pr(Z = 1) + E(Y | Z = 0)pr(Z = 0)] .
So the fundamental difficulty is to estimate the counterfactual means
E{Y (1) | Z = 0}, E{Y (0) | Z = 1}.
There are in general two extreme strategies to estimate them.
We have discussed the first strategy in Part III, which relies on ignorability.
Assuming
E{Y (1) | Z = 1, X} = E{Y (1) | Z = 0, X},
E{Y (0) | Z = 1, X} = E{Y (0) | Z = 0, X},
225
226 18 Sensitivity Analysis for Average Causal Effect
TABLE 18.1: Science Table with bounded outcome [ℓ, u], where ℓ and u are
two constants
Z Y (1) Y (0) Lower Y (1) Upper Y (1) Lower Y (0) Upper Y (0)
1 Y1 (1) ? Y1 (1) Y1 (1) ℓ u
.. .. .. .. .. .. ..
. . . . . . .
1 Yn1 (1) ? Yn1 (1) Yn1 (1) ℓ u
0 ? Yn1 +1 (0) ℓ u Yn1 +1 (0) Yn1 +1 (0)
.. .. .. .. .. .. ..
. . . . . . .
0 ? Yn (0) ℓ u Yn (0) Yn (0)
and, similarly,
The second strategy in the next section assumes nothing except that the
outcomes are bounded between ℓ and u. This is natural for binary outcomes
with ℓ = 0 and u = 1. With this assumption, the two counterfactual means
are also bounded between ℓ and u, which implies the worse-case bounds on τ .
I will review this strategy below.
Combining these bounds, we can derive that the average causal effect τ =
E{Y (1)} − E{Y (0)} has lower bound
which are the sensitivity parameters. For simplicity, we can further assume
that they are constant independent of X. In practice, we need to fix them or
vary them in a pre-specified range. Recall that µ1 (X) = E(Y | Z = 1, X) and
µ0 (X) = E(Y | Z = 0, X). We can identify the two counterfactual means and
the average causal effect as follows.
and therefore
I leave the proof of Theorem 18.1 to Problem 18.1. With the fitted out-
come model, (18.1) and (18.2) motivate the following predictive and projective
estimators for τ :
( n n
)
X X
pred −1 −1
τ̂ = n Zi Yi + n (1 − Zi )µ̂1 (Xi )/ε1 (Xi )
i=1 i=1
( n n
)
X X
−1 −1
− n Zi µ̂0 (Xi )ε0 (Xi ) + n (1 − Zi )Yi ,
i=1 i=1
18.3 Sensitivity analysis for the average causal effect 229
and
( n n
)
X X
proj −1 −1
τ̂ = n Zi µ̂1 (Xi ) + n (1 − Zi )µ̂1 (Xi )/ε1 (Xi )
i=1 i=1
( n n
)
X X
−1 −1
− n Zi µ̂0 (Xi )ε0 (Xi ) + n (1 − Zi )µ̂0 (Xi ) .
i=1 i=1
The terminologies “predictive” and “projective” are from the survey sampling
literature (Firth and Bennett, 1998; Ding and Li, 2018). The estimators τ̂ pred
and τ̂ proj differ slightly: the former uses the observed outcomes when available;
in contrast, the latter replaces the observed outcomes with the fitted values.
More interesting, we can also identify τ by an inverse probability weighting
formula.
Theorem 18.2 With known ε1 (X) and ε0 (X), we have
Z 1−Z
E{Y (1)} = E w1 (X) Y , E{Y (0)} = E w0 (X) Y ,
e(X) 1 − e(X)
where
w1 (X) = e(X) + {1 − e(X)}/ε1 (X), w0 (X) = e(X)ε0 (X) + 1 − e(X).
I leave the proof of Theorem 18.2 to Problem 18.2. Theorem 18.2 modi-
fies the classic inverse probability weighting formulas with two extra factors
w1 (X) and w0 (X) depending on both the propensity score and the sensitivity
parameters. With the fitted propensity score model, Theorem 18.2 motivates
the following estimators for τ :
n
X {ê(Xi )ε1 (Xi ) + 1 − ê(Xi )}Zi Yi
τ̂ ht = n−1
i=1
ε1 (Xi )ê(Xi )
n
−1
X {ê(Xi )ε0 (Xi ) + 1 − ê(Xi )}(1 − Zi )Yi
−n
i=1
1 − ê(Xi )
and
n n
X {ê(Xi )ε1 (Xi ) + 1 − ê(Xi )}Zi Yi . X Zi
τ̂ haj =
i=1
ε1 (Xi )ê(Xi ) i=1
ê(Xi )
n n
−1
X {ê(Xi )ε0 (Xi ) + 1 − ê(Xi )}(1 − Zi )Yi . X 1 − Zi
−n .
i=1
1 − ê(Xi ) i=1
1 − ê(Xi )
More interestingly, with fitted propensity score and outcome models, the
following estimator for τ is doubly robust:
n
X µ̂1 (Xi ) µ̂0 (Xi )ε0 (Xi )
τ̂ ht = τ̂ ipw − n−1 {Zi − ê(Xi )} + .
i=1
ê(Xi )ε1 (Xi ) 1 − ê(Xi )
230 18 Sensitivity Analysis for Average Causal Effect
That is, with known ε1 (Xi ) and ε0 (Xi ), the estimator τ̂ dr is consistent for τ if
either the propensity score model or the outcome model is correctly specified.
We can use the bootstrap to approximate the variance of the above estimators.
See Lu and Ding (2023) for technical details.
When ε1 (Xi ) = ε0 (Xi ) = 1, the above estimators reduce to the predic-
tive estimator, inverse probability weighting estimator, and the doubly robust
estimators introduced in Part III.
18.4 Example
We revisit Example 10.3. With
ε1 (X) = ε0 (X) ∈ {1/2, 1/1.7, 1/1.5, 1/1.3, 1, 1.3, 1.5, 1.7, 2},
The signs of the estimates are not sensitive to sensitivity parameters larger
than 1, but they are quite sensitivity to sensitivity parameters smaller than 1.
When the participants of the meal plan tend to have higher BMI, the average
causal effect of the meal plan on BMI is negative. However, this conclusion
can be quite sensitive if the participants of the meal plan tend to have lower
BMI.
18.3 Sensitivity analysis for the average causal effect on the treated units τT
This problem extends Chapter 13 to allow for unmeasured confounding for
estimating
τT = E{Y (1) − Y (0) | Z = 1} = E(Y | Z = 1) − E{Y (0) | Z = 1}.
We can easily estimate E(Y | Z = 1) by the sample moment. The only coun-
terfactual term is E{Y (0) | Z = 1}. Therefore, we only need the sensitivity
parameter ε0 (X). We have the following two identification formulas with a
known ε0 (X).
Theorem 18.3 With known ε0 (X), we have
E{Y (0) | Z = 1} = E {Zµ0 (X)ε0 (X)} /e
1−Z
= E e(X)ε0 (X) Y /e,
1 − e(X)
where e = pr(Z = 1)
Prove Theorem 18.3.
∗ ∗
Remark: Theorem
Pn Pnmotivates using τ̂t = µ̂t1 − µ̂t0 to estimate τt ,
18.3
where µ̂t1 = i=1 Zi Yi / i=1 Zi and
n
X
µ̂reg
t0 = n−1
1 Zi ε0 (Xi )µ̂0 (Xi ),
i=1
Xn
µ̂ht
t0 = n−1
1 ε0 (Xi )ô(Xi )(1 − Zi )Yi ,
i=1
n
X n
X
µ̂haj
t0 = ε0 (Xi )ô(Xi )(1 − Zi )Yi / ô(Xi )(1 − Zi ),
i=1 i=1
with ô(Xi ) = ê(Xi )/{1 − ê(Xi )} being the estimated conditional odds of the
treatment. Moreover, we can construct the doubly robust estimator τ̂tdr =
µ̂t1 − µ̂dr
t0 for τt , where
n
−1
X ê(Xi ) − Z
µ̂dr ht
t0 = µ̂t0 − n1 ε0 (Xi ) µ̂0 (Xi ).
i=1
1 − ê(Xi )
Lu and Ding (2023) provide more details and also propose a doubly robust
estimator for τT .
18.4 R code
Implement the estimators in Problem 18.3.
232 18 Sensitivity Analysis for Average Causal Effect
Let Si = {Yi1 (1), Yi1 (0), Yi2 (1), Yi2 (0)} denote the set of all potential outcomes
within pair i. Conditioning on the event that Zi1 + Zi2 = 1, we have
233
234 19 Sensitivity Analysis in Matching
Under ignorability, eij is only a function of Xi , and therefore, ei1 = ei2 and
πi1 = 1/2. Thus the treatment assignment mechanism conditioning on covari-
ates and potential outcomes is equivalent to that from an MPE with equal
treatment and control probabilities. This is a strategy to analyze matched
observational studies we discussed in Chapter 15.1.
In general, eij is also a function of the unobserved potential outcomes, and
it can range from 0 to 1. Rosenbaum (1987b)’s model for sensitivity analysis
imposes bounds on the odds ratio oi1 /oi2 .
Under Assumption 19.1, we have a biased MPE with unequal and varying
treatment and control probabilities across pairs. When Γ = 1, we have πi1
and thus a standard MPE. Therefore, Γ > 1 measures the deviation from the
ideal MPE due to the omitted variables in matching.
where qi ≥ 0 is a function of (|τ̂1 |, . . . , |τ̂n |). Special cases include the sign
statistic, the pair t statistic (up to some constant shift), and the Wilcoxon
sign rank statistic:
n
X n
X n
X
T = Si , T = Si |τ̂i |, T = Si Ri ,
i=1 i=1 i=1
19.4 Revisiting the LaLonde data 235
Here, the FRT with T has the largest p-value under the “the worst case”
distribution. The corresponding distribution has mean
n
Γ X
EΓ (T ) = qi ,
1 + Γ i=1
and variance
n
Γ X
varΓ (T ) = q2 ,
(1 + Γ)2 i=1 i
with a Normal approximation
Γ
Pn
T− 1+Γ i=1 qi d
q Pn −→ N(0, 1).
Γ 2
(1+Γ)2 i=1 qi
Γ=1 p = 0.002
Γ = 1.1 p = 0.011
Γ = 1.3 p = 0.084
Pn
FIGURE 19.1: Distributions of T = i=1 Si |τ̂i | with Si iid Bernoulli(Γ/(1 +
Γ)), based on the LaLonde data.
19.4 Revisiting the LaLonde data 237
8e−05
0.15
6e−05
0.10
p−value
Density
4e−05
0.05
2e−05
0e+00
0.00
erpcp
0.8
0.10
0.08
0.6
0.06
p−value
Density
0.4
0.04
0.2
0.02
0.00
0.0
−1 0 1 2 3 1 2 3 4 5
^τi Γ
0.25
0.10
0.20
0.08
0.15
p−value
Density
0.06
0.10
0.04
0.05
0.02
0.00
0.00
241
242 20 Overlap in Observational Studies: Difficulties and Opportunities
where
(e − η)(1 − e − η)
C=
e2 (1 − e)2 η(1 − η)
is a positive constant depending only on (e, η), and λ1 and λ0 are the maximum
eigenvalues of the covariance matrices cov(X | Z = 1) and cov(X | Z = 0),
respectively.
then it suffices to control for r(X), a lower dimensional summary of the original
covariates. Due to the dimension reduction, the strict overlap condition on
r(X) can be much weaker than the strict overlap condition on X. This is
conceptually straightforward, but the corresponding theory and methods are
missing.
20.2 Causal inference with no overlap: regression discontinuity 243
Z = 1(X ≥ x0 ),
Z {Y (1), Y (0)} | X
Example 20.2 Bor et al. (2014) used regression discontinuity to study the
effect of when to start HIV patients with antiretroviral on their mortality,
where the treatment is determined by whether the patients’ CD4 counts were
below 200 cells/µL.1
Example 20.3 Carpenter and Dobkin (2009) studied the effect of alco-
hol consumption on mortality, which leverages the minimum legal drink-
ing age as a discontinuity for alcohol consumption. They derived mortality
1 CD4 cells are white blood cells that fight infection.
244 20 Overlap in Observational Studies: Difficulties and Opportunities
FIGURE 20.1: A graph from Thistlethwaite and Campbell (1960) with minor
modifications of the unclear text in the original paper
20.2 Causal inference with no overlap: regression discontinuity 245
100
var
all
75 internal
external
outcome
alcohol
50 homicide
suicide
mva
25 drugs
externalother
19 20 21 22 23
agecell
data from the National Center for Health Statistics, including the decedent’s
date of birth and date of death. They computed age profile of deaths per
100,000 person years with outcomes measured by the following nine variables:
all all deaths, the sum of internal and external
internal deaths due to internal causes
external deaths due to external causes, the sum of the rest
homicide homicides
suicide suicides
mva motor vehicle accidents
alcohol deaths with a mention of alcohol
drugs deaths with a mention of drug use
externalother deaths due to other external causes
Figure 20.2 plots the number of deaths per 100,000 person years for nine
measures based on the data used by Angrist and Pischke (2014). From the
jumps at age 21, it seems obvious that there is an increase of mortality at age
21, primarily due to motor vehicle accidents. I leave the formal analysis as
Problem 20.3.
Theorem 20.2 Assume that E{Y (1) | X = x} is continuous from the right
at x0 and E{Y (0) | X = x} is continuous from the left at x0 . Then the local
average treatment effect at X = x0 is identified by
Since the right-hand side of the above equation only involves observables,
the parameter τ (x0 ) is nonparametrically identified. However, the form of
the identification formula is totally different from what we derived before. In
particular, the identification formula involves limits of two conditional expec-
tation functions.
● ●
4
●
8
● ●
● ●
● ●
● ●
● ● ● ●● ● ●
● ●
● ● ● ●
● ● ●●
● ●
● ●● ●
● ● ●● ●
● ●
●
● ● ●
● ●●● ● ● ●● ●● ● ●
●
●
● ●● ● ● ● ● ●● ●
● ● ● ●
●
● ●●● ● ● ● ● ● ● ● ● ● ●
●●
● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●
● ● ● ●● ● ● ● ● ●
6
● ● ● ●
● ● ●● ● ●● ● ●● ●
● ●● ● ● ● ● ● ●●●● ●●
● ●●
● ● ● ● ● ● ● ● ●
● ● ●● ●● ● ●
●
●
● ● ●● ● ● ● ●
● ● ●●● ● ●● ● ● ●
● ● ● ● ● ●● ●
● ● ●● ● ● ●●●
● ● ●●●● ●
●●● ● ●
● ● ● ●● ● ● ●
●
●
● ●● ●●●●● ●● ●
● ●●●
●● ● ●● ●● ● ●
● ●
●
● ● ● ●● ● ● ●● ● ● ●
2
● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●
●● ● ●● ● ● ● ● ● ● ●
● ● ●● ●● ●● ●●● ●●
●
●
●
● ●● ●● ●●● ● ●● ●
● ● ●
●● ● ● ● ●●●● ● ●● ●
● ● ●
● ● ●● ● ●● ● ● ●● ● ●●● ● ● ● ●
● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●
●●● ● ●● ● ● ●● ● ● ● ● ●● ●
●●● ●●
● ● ● ● ●
● ●
● ● ● ● ● ● ●● ●● ●
●● ●
●● ●● ●●●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●
● ●● ● ●●●●● ● ● ●● ●● ●●●● ● ● ● ● ●● ● ● ●●● ●● ● ●
●● ●
● ● ● ● ● ●● ●● ● ●● ● ● ●
● ●
● ●●●
● ● ● ● ●● ● ● ● ●● ●●
● ● ● ●
●● ● ●
● ●●● ●● ●● ● ● ●●● ● ●
● ● ● ● ● ● ●●● ●●●●● ● ● ●
● ● ● ● ●● ● ● ●●
● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●●
●
● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ●●● ● ● ● ● ● ●
●
● ● ● ● ●●● ●●●● ● ●● ● ● ● ●●
4
●
● ●● ● ●● ●
●●●● ● ●
● ●● ● ● ●● ●● ● ●●
●
● ●● ●● ● ●● ●●
● ● ● ● ● ● ●●● ● ●
● ● ● ●
●● ● ● ● ●
●● ●● ● ● ● ● ●● ● ● ● ●● ●●● ● ●●
●● ● ●●
●
●
● ● ● ●● ● ●●● ● ● ● ●●●●●●● ●● ● ●●● ● ● ● ●●
● ● ● ● ● ● ● ● ●●
● ● ●
● ● ● ● ●● ●● ●● ● ● ●
●
● ●
● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●
● ●
● ●
●●● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ●●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●●
●● ● ●● ● ● ● ●● ● ●● ●● ● ● ●● ●
● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●●● ●● ●● ● ● ●● ● ●●●
● ● ●● ●
● ●● ●● ● ●●● ● ● ●●●●
●● ● ● ● ● ●● ● ●● ● ● ●
●● ● ● ● ●●● ● ● ●● ●●● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ●
● ●● ●● ●
●
● ●● ●●●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
●●●● ● ●
● ● ●
● ● ● ● ●● ●
●● ● ● ● ●●● ●●●●
● ● ●
● ● ● ● ● ● ●● ● ● ● ●● ● ● ●
● ●● ● ●
● ● ● ● ● ● ●● ● ●● ● ●●●●● ● ● ●● ● ● ● ●
● ● ●
● ● ● ● ● ●●●●● ●
0
● ● ●
● ●●● ●●
● ●
● ● ● ●● ● ● ● ●●
2
● ● ● ● ● ● ● ● ● ●
●● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●●●● ● ● ● ●●● ● ● ● ●●
●● ● ● ●● ●● ●● ●
● ●● ● ●● ● ● ●●● ● ●●● ● ● ●● ●● ●●
●
● ●
●● ●●●
●● ● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●●
● ●
● ● ●●● ●● ●● ● ●● ● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●●● ●●
● ● ●● ●●
●
● ●●● ● ● ● ● ● ● ● ● ● ●● ●
● ●●
●●●● ● ●
●●●
● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●
●●
●●
● ● ●
● ● ● ● ● ● ●● ●
●
●
● ● ●● ●
● ● ●
● ●● ● ● ● ●● ● ●● ●
● ● ●
● ●● ● ● ● ● ● ●●●● ●●
● ●● ● ● ●
●
● ●● ● ● ●
● ●
● ● ●● ●● ● ●
●
●
● ● ● ● ● ● ● ● ●● ● ● ●
● ● ●●● ● ●● ● ● ●
● ●
● ● ● ● ●● ● ● ● ● ●● ●
● ●
● ●● ●● ●●●
● ● ●●●● ●
●●● ●
●
● ● ● ● ● ● ●● ●
● ●●●●●● ● ●● ● ●● ● ● ● ●
● ● ● ●● ● ●
● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●
●
● ●
● ● ●● ● ●
● ●
● ●● ● ●● ● ● ●●●●
● ● ● ●● ●● ● ●●● ● ●● ● ●●
●
● ● ● ● ●
● ●● ●● ●● ●
●● ●● ●
●●●●●● ●
● ● ●● ●
●● ●
● ● ● ●●● ●● ● ●●● ● ●●
0
● ●
● ● ● ● ●● ● ●● ● ● ●●● ●● ● ● ● ● ● ● ● ●
●●● ●
●● ●
●
●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ●●●
●
●● ●● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●
● ● ●
● ●● ●
● ●●●●●
● ● ● ● ●● ● ●●●● ● ●
● ●
●
● ●●
● ●● ●
● ●● ● ●● ● ● ● ● ●
● ●● ●● ● ● ●
● ●● ● ● ● ●●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ●
−2
● ● ● ● ● ●
● ●● ● ●● ●
●●●● ● ●● ●
● ● ●● ●● ● ●● ●●
● ● ● ●
●● ●● ● ●
● ● ● ●● ● ●●● ● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ●● ●
● ● ● ●● ●
● ● ●
● ●●● ●●
−2
● ● ● ● ● ●● ●
● ● ● ● ● ● ●
● ●
● ● ●
●● ● ●
●
● ●
● ● ● ●
● ●
●
●
●
●
● ● ●
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
X X ●
●
●
●
●
●●
●
● ● ●
●
●
● ● ● ●
● ●
● ●● ●
● ●
● ● ●
● ● ●
●
● ● ● ●● ●
● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●
●● ●● ● ● ●
●● ● ●
● ● ● ● ● ●●● ●
●
●
●● ●● ● ●● ●● ●● ● ● ● ● ●
● ● ●
● ●●●● ● ● ● ●● ●●●
6
● ● ● ● ● ●
●
● ●
●● ●
● ● ● ● ●● ●
●
● ● ● ● ● ●
●
●● ●● ● ● ● ● ● ● ●
2
● ● ●●
●● ● ●● ●●
● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●
● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●
● ● ●● ● ●● ●● ● ● ● ●●
● ● ● ● ● ● ●
● ●●
● ● ●● ● ● ●
● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ●
●● ● ●● ●●●● ●● ● ● ● ● ●
●●● ● ●● ● ●● ● ●● ●● ●● ● ● ● ● ● ●
● ●●● ● ●
● ● ● ● ●● ● ● ●
● ●● ●
●
● ● ●● ●● ● ● ●● ● ●●● ● ● ●● ●● ● ●● ●
● ●
● ● ● ● ●●●●●●● ● ● ●
●● ● ●● ● ● ●
● ●● ●●● ● ● ●●● ● ●●●● ● ● ●● ● ● ●● ●● ● ●● ● ●
● ● ●● ●●● ● ● ●● ● ● ● ●● ●● ●●● ●
●
● ● ● ●● ●
●● ● ● ● ●● ● ● ● ●● ●
● ● ●
● ● ● ● ● ●
● ●●●
● ● ● ● ● ● ● ● ●● ●●
●● ●
●●● ●
● ● ●● ●
● ● ●●●●● ● ●●●●● ●●●●● ● ● ● ● ●●● ● ● ●● ●
● ● ● ●● ●
●● ● ●● ● ● ●
● ● ●● ● ● ● ● ●●●●● ●
●
●● ● ● ●● ● ●
● ● ●
● ● ● ●● ● ●● ●
●● ●●● ●● ● ● ●● ● ● ●● ● ● ● ● ●
●●●● ● ● ●
● ●● ● ●
● ● ● ● ●● ● ● ● ●●●
● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●
●● ●● ●●● ●● ●
●● ●
● ● ● ● ● ●
●
● ● ● ●●●
● ●● ●●● ●● ●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ●
● ● ●
● ● ● ●●● ● ●●● ● ● ● ●● ●● ●
● ● ●
●● ● ● ●● ● ●● ● ● ● ●●
● ●● ● ● ●● ● ● ●● ● ●
●● ● ● ●
● ●
● ● ●● ●●
●
● ●● ●●
● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●
●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●
●●●
●
● ●● ● ●
●●● ● ●● ●●●●● ●● ● ● ●●● ● ● ● ●
● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ●
●● ●● ● ●
●
●●
● ●● ●●● ● ● ●●● ●● ● ●●●● ●
●●●● ●
4
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●
● ● ●● ● ● ● ● ● ● ● ● ● ●
● ● ●●●●● ● ● ●● ● ●● ●●● ●
● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ● ●● ●●● ● ● ●● ●
● ● ●
●
● ● ●● ● ● ● ● ● ●● ● ● ●●●● ● ● ●●● ● ●● ● ● ● ●●●● ● ● ●● ● ● ● ●
● ● ● ●● ● ●● ● ● ● ● ●
● ● ●● ●● ● ● ●
●●●● ● ● ●●
● ● ●● ● ● ●● ● ●●●● ●● ●●● ● ●
● ● ●
● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●●●● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ●● ●● ● ● ●●● ● ●● ● ●
● ●● ●● ● ●● ● ● ● ●
● ●● ●●
● ●
● ● ● ●● ● ● ● ●
0
● ● ● ●
● ● ●● ● ● ●● ●● ●
● ● ● ●
●
●
● ● ●●● ● ● ●● ●
●●
● ●●● ●● ●● ●● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
●
●
● ● ●
● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●
●● ●● ●● ● ●●● ● ● ●● ● ●●● ●● ●● ● ●
● ●●● ● ● ● ●
● ● ● ● ● ●● ●● ●● ●
●●● ● ● ● ●● ●
● ●● ●
●●
●● ● ●● ●● ● ● ● ●
● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●
●● ●●
2
● ●● ● ● ● ●● ●●● ● ● ●
●●●● ● ● ● ●● ● ● ● ● ● ● ●●
● ● ● ●● ● ● ● ● ● ● ●
●
●
●
● ●
● ●
● ● ●● ●● ● ● ● ●●● ●●●●● ●●● ●
●
● ●
● ● ● ●● ● ● ●● ●●
● ● ●● ●● ● ● ● ●
●
● ● ●●
● ● ● ● ● ● ●●●● ●
● ● ● ● ●
●● ● ●● ● ● ● ●● ● ● ●● ● ● ●
●● ●
● ●● ● ● ●
● ●
● ● ●
● ● ● ●●
●
●●●●●● ●● ● ●
● ●●
● ● ● ● ● ●●
● ● ● ●●● ●
●● ● ● ●● ● ●● ● ● ●
● ●● ●●
−2
● ● ● ● ●
●
● ● ● ● ● ●●● ●●● ● ● ●● ● ● ●●●
●●
● ● ●● ●
● ●
● ●● ●● ●●
● ●
●● ● ●● ● ● ● ● ● ●●
●
●
● ● ● ●● ●●● ● ● ● ●●● ● ●●
● ● ●
●●
● ● ● ●●
0
●
● ●● ●●●●● ● ●● ●
● ●●● ● ● ●●● ● ● ●● ● ● ● ●
● ● ● ●
●●
● ●●●●●● ● ● ● ●●●
● ● ● ●●● ● ● ● ● ● ● ● ●
●
● ● ●● ● ● ●
●● ● ● ● ●● ● ● ● ●
●●● ● ● ●● ● ●● ● ●● ●
●
● ● ●● ●● ●● ● ● ●●● ● ● ● ●●●● ● ●
●
●● ●● ●● ●
● ●● ●● ●● ●● ●●
●
●
● ●
●
● ●
● ●● ● ● ● ● ● ● ●●
● ● ● ● ●
● ● ●● ●●●● ●●
● ●● ●● ● ● ●●● ●● ● ● ●●
● ● ●● ● ●●●● ●● ● ● ● ●● ●
−2
● ● ● ●●
●● ● ●
●●● ● ● ●
● ●● ● ● ●
● ● ●
●●
−4
● ● ●
●
●● ●● ● ● ●● ●
●
● ●
● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
●●● ●● ●
● ● ● ●● ● ●
● ●
● ●
● ●
● ●
−4
● ●
● ● ●
●
● ●
● ●
● ● ● ●
−6
● ●
●
●
−6
● ●
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
X X
Yi ∼ {1, Zi , Ri , Li }, (20.6)
where
Ri = max(Xi − x0 , 0), Li = min(Xi − x0 , 0)
indicate the right and left parts of Xi − x0 , respectively. I leave the algebraic
details to Problem 20.1.
However, this approach may be sensitive to the violation of the linear
model. Theory suggests running regression using only the local observations
near the cutoff point2 . However, the rule for choosing the “local points” are
quite involved. Fortunately, the rdrobust function in the rdrobust package in R
implements various choices of “local points.” Since choosing the “local points”
is the key in regression discontinuity, it seems more sensible to report estimates
and confidence intervals based on various choices of the “local points.”
20.2.4 An example
Lee (2008) gave a famous example of using regression discontinuity to study
the incumbency advantage in the U.S. House. He wrote that “incumbents are,
by definition, those politicians who were successful in the previous election.
If what makes them successful is somewhat persistent over time, they should
be expected to be somewhat more successful when running for re-election.”
Therefore, this is a fundamentally challenging causal inference problem. The
regression discontinuity is a clever study design to study this problem.
The running variable is the lagged vote in the previous election centered
at 0, and the outcome is the vote in current election, with units being the
congressional districts. The treatment is the binary indicator for being the
current incumbent party in a district, determined by the lagged vote. Figure
20.4 show the raw data.
The rdrobust function gives three sets of the point estimate and confidence
intervals. They all suggest positive incumbency advantage.
> library ( rdrobust )
> library ( rddtools )
> data ( house )
> RDDest = rdrobust ( house $y , house $ x )
[1] " Mass points detected in the running variable . "
> cbind ( RDDest $ coef , RDDest $ ci )
Coeff CI Lower CI Upper
Conventional 0.06372533 0.04224798 0.08520269
Bias - Corrected 0.05937028 0.03789292 0.08084763
Robust 0.05937028 0.03481238 0.08392818
1.0
● ● ● ● ● ● ●● ● ● ●● ●●
● ● ●●
● ●● ● ●
● ● ● ●
● ●●●
● ● ●● ●●
● ●● ●● ●
●● ●● ●●
● ● ●
●
●● ● ●● ● ●●●
●● ●
● ● ● ● ●● ●●●● ●●
●●● ●
●●●●● ●
● ●● ●●
●●●
● ●
● ●●● ●● ● ●
●●
●●●
● ●● ●●● ●● ●●● ●● ●
● ●● ●
●●
● ● ● ●●● ●● ● ● ● ●
● ●● ●● ●●
●●● ●● ● ●●●● ●
● ●
● ● ●●● ●●● ●
●●● ● ●
● ● ● ●
●●●●
● ●●
● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●●● ● ●● ● ● ● ●
●● ●●● ● ●● ●●● ● ●● ●●● ●●●●●● ●●
●
●
●
● ● ● ●
● ● ●
● ● ●
●
● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ●
●
● ●
●
● ●
● ● ● ● ● ●● ●
● ● ● ●
● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
●
● ● ●●
● ● ●
●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●●
● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ●● ●● ●
● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ●
●● ● ●
●
●
● ● ● ●
●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ●
●
●
● ●● ● ● ● ● ● ●
● ● ● ● ●● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●● ● ● ● ● ●
●
●● ● ●
0.8
● ● ● ● ● ● ● ●● ●
● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ●
● ● ●● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ●
● ● ● ● ●
● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ●● ● ●
● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●
● ●● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
●● ● ● ● ●
●
● ● ● ● ●● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●● ● ●
● ●
● ● ● ● ● ● ● ● ● ●● ●●
● ● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ●
●
●
● ● ●● ● ●
●
●
●● ● ● ● ● ● ● ● ●● ● ●
● ●
● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ●● ● ●
● ● ● ● ● ● ●
●
● ● ● ● ●
●
●
● ●●
● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
●
●
● ● ●
● ● ● ●● ● ●● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●●● ●●
● ● ●●
●
●
● ● ●●
● ●
●
●● ● ● ● ● ●
● ●
●
●
● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●
●
● ● ● ● ●● ● ●
●● ● ● ● ● ●●● ● ● ●● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●●
● ●
● ●
●● ● ● ●● ● ● ● ● ●
● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
● ● ● ●
●
● ● ● ● ●
●
●
● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ● ● ● ●
●
● ●
●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●
● ● ● ● ● ●● ●● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ● ●●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ●
●
●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ●
● ● ●● ● ● ● ●● ●
● ● ● ●● ● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ●● ●● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ●
●
● ●●
●
● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
● ● ● ●
●
●● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●
● ● ●
● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ●
●●
● ● ● ● ●
0.6
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ●
● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ●● ●●
● ● ●
●
● ●● ● ●
● ● ● ● ●●
● ● ●●
● ● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
●● ●● ● ●● ● ● ● ●
●
● ●● ● ●
● ● ●● ●
●
● ●
●
● ● ● ● ● ● ● ●
●
● ● ● ●●
●
●
● ● ●● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●
●
● ● ● ● ● ● ● ● ● ●●●● ●
● ● ●● ● ●● ●● ● ● ● ● ●
● ●
●● ● ● ● ● ● ● ●
● ●●
● ●●● ●
● ● ●● ● ● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
●
● ●● ● ● ● ● ●●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●●
●
●
● ●● ● ● ●
●● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ● ● ●
● ● ● ● ●
●●
● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ●● ● ● ●
●
● ●
●
● ● ●
● ●● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ● ●
●
● ● ● ●
●
●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●
● ● ● ● ●●
● ● ● ●● ● ● ● ●
●●
● ●
●
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ●● ●
● ● ●
●
● ●
●
●
● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ●
● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ●
● ● ● ●●●● ● ● ● ● ●●● ●●●● ● ● ● ●
●
●
● ●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ● ●● ● ●
●● ●
●● ● ● ● ●
● ● ● ●● ●
●● ● ● ● ● ● ●● ● ● ●
● ● ● ●
●● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●
● ●
● ● ●● ● ●
● ●
●● ●
● ● ●● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ●
●
●
● ●●
● ● ● ● ● ● ●●
●
● ●●●●● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●● ● ●
● ● ● ● ●● ● ●●●●
● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●
● ●●● ●● ●● ● ● ● ● ●●
●● ● ●● ●● ●
● ● ●● ● ●
●●
● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●
●
●
● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●● ● ● ● ●● ● ● ●● ●
●
●
● ● ● ●● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ●● ● ● ● ●● ● ● ● ●
● ●● ●
● ● ● ● ● ●
● ●● ● ● ● ● ●
●
● ● ●
●●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ●
●●●
●
● ●
●
● ● ●
● ●
● ● ●● ●● ●●● ● ●● ●
●
● ● ●
● ● ● ●● ●● ● ● ●
y
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●
● ● ●
● ●
● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●
● ●
● ●
●
● ● ●● ● ●●● ● ● ●
●
● ●●● ●● ● ●
●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
●
● ● ● ● ● ●
●
● ●● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ●
●
●
● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ●●● ●
● ●
●● ●● ● ●● ● ●
●●
● ● ●●●
● ● ● ● ●
●
●●
●
●
●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ●● ● ● ● ●● ●● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●●● ● ● ● ●● ● ●● ●
● ●
● ● ● ● ● ● ● ● ●●● ● ● ● ● ●
● ● ●● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
●● ● ● ●
● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
● ● ●● ● ●● ● ● ● ● ●
● ●
● ● ● ● ●● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ●
●
● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●
●
●
● ● ● ●
● ● ● ● ● ● ● ● ● ●●
●● ● ●
● ● ●● ●●● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ●● ●
● ●
● ●
●● ●
● ●● ● ● ●● ● ● ● ● ●● ● ● ●
● ●● ● ●
●● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●
●
● ● ●
●
● ●●
● ● ● ●
● ●● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
●● ● ● ● ●
● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●●
● ● ● ●
● ●
● ● ● ● ● ●● ●● ●● ● ●● ● ● ●● ●● ● ●● ●
● ●
● ● ● ●● ● ● ● ●●
●●●
● ● ● ● ● ●●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ●●
● ● ●
● ● ●● ● ● ●
0.4
● ● ●
● ● ●● ●● ● ● ●
● ●
● ● ● ● ●● ● ● ●
● ●● ● ●
●
● ● ●
● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ●
● ●●
● ● ●
● ● ● ● ● ● ● ● ● ●● ●
● ●●● ● ●●
●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ●●● ● ● ● ●● ●● ●
●●
● ● ●
●
● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
●
●● ● ● ●●●
● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ●
●● ● ● ● ●
●● ● ● ● ●●
●● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ●
● ● ● ● ●
● ●
●● ● ● ●●● ● ● ● ● ●● ● ● ●●● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ●● ● ●
●
●● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●●● ●● ● ● ●
● ● ● ●●● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●● ●●● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ●● ● ●
●●
● ● ● ● ●●
●● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
●● ● ● ● ● ● ● ●
● ●● ●
● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●
● ●
● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ●
●●
●
●
● ●● ●
● ● ● ● ●
●● ● ● ● ●●
● ● ● ●● ● ● ●● ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ●
●
● ●● ●
● ● ● ●●● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ●
● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ●
● ●● ●● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●● ● ● ● ● ● ● ● ● ●
● ●
●●
●●● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ●
●
● ●
●
● ● ● ●● ●● ● ● ●
●
●
● ●
● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●● ●● ●
● ● ● ●
●● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ●
● ●
● ●● ●● ● ● ●
●● ● ● ●
● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●
● ●● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●
● ● ●
● ●
● ● ●● ● ● ●● ● ●● ● ● ●
● ● ●● ●● ● ● ● ● ●
●
● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●
● ●
●
● ● ● ● ●
●
●
●
● ● ● ●
● ●●
● ● ●
● ● ●● ● ● ●
● ●
● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●● ● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●
●
●
● ●
● ●● ● ● ●● ● ● ● ● ● ●
● ● ●
● ● ● ●● ● ●●● ● ●● ●
● ●● ●● ● ● ●● ●
●●
● ●
● ●
● ●
●
●
●
● ● ● ●●● ● ● ● ●● ● ● ●●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ●
● ● ●
●
● ● ●
● ● ● ● ● ● ●●
● ●
● ● ● ●
● ●
● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●
●●
● ● ● ● ●
● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●●
●● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ●
● ●
●
● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ●●● ●
● ● ●
●
● ● ● ● ● ● ●●
● ● ●
● ● ● ● ● ● ● ●
●
●
● ● ● ●● ● ● ● ●
●
● ● ● ●
● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●● ●
● ● ● ●● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●●●● ● ● ● ●
●
● ● ● ●
● ●
●
● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●
●
● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
0.2
● ● ● ● ● ●
● ● ● ●
● ●● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●●
● ● ● ● ●
● ●● ● ●
● ●
●
● ●
● ● ● ●
● ●
● ●
●
● ●
●
●
● ● ●
● ●
● ●
0.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●
● ●● ● ● ●● ● ● ●● ● ●●●●●
●●●●● ● ●●●
●● ●● ● ●● ●
● ●
●● ● ●
●
● ●●●
● ● ●
●
●●●
●●●● ●
●● ●
●●●●● ●
●
● ●
●●●●
●● ●
●● ●●●
● ●
●●● ●●
●●
● ●●
● ●
● ● ●●
●●● ●●● ● ●
●● ●● ●
●●
● ●
● ●
●● ●● ● ● ●● ●
● ●
● ● ●● ●● ●
●● ● ●● ● ●●● ●● ● ●●● ●
● ● ●●●
● ●● ●● ●
●● ● ● ● ●
●● ● ● ● ● ●
● ●●● ● ● ●● ●● ●● ●● ● ●●
●●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
Figure 20.5 shows the point estimates and the confidence intervals based
on OLS with different choices of the local points defined by |X| < x0 . While
the point estimates and the confidence intervals are sensitive to the choice of
x0 , the qualitative result remains the same as above.
violated in practice. For instance, if the mortality rate jumps at the age of
21, then the jumps in Figure 20.2 may not be due to the change of drinking
behavior due to the legal drink age. However, it is hard to check the violation
of the continuity condition empirically.
McCrary (2008) proposed an indirect test for the validity of the regression
discontinuity. He suggested checking the density of the running variable at
the cutoff point. The discontinuity in the density of the running variable at
the cutoff point may suggest that some units were able to manipulate their
treatment status perfectly.
Instrumental variables
21
An Experimental Perspective
253
254 21 An Experimental Perspective
where “a” is for “always taker,” “c” is for “complier,” “d” is for “defier,”
and “n” is for “never taker.” Because we cannot observe Di (1) and Di (0)
simultaneously, Ui is a latent variable for the compliance behavior of unit i.
Based on U , we can use the law of total probability to decompose the
average causal effect on Y into four terms:
τY = E{Y (1) − Y (0) | U = a}pr(U = a)
+E{Y (1) − Y (0) | U = c}pr(U = c)
+E{Y (1) − Y (0) | U = d}pr(U = d)
+E{Y (1) − Y (0) | U = n}pr(U = n). (21.1)
Therefore, τY is a weighted average of four latent subgroup effects. We will
look into more details of the latent groups below.
Assumption 21.2 below restricts the third term in (21.1) to be zero.
21.2 Latent Compliance Status and Effects 255
But Assumption 21.2 is much stronger than the inequality above. The former
restricts Di (1) and Di (0) at the individual level and the latter restricts them
only on average. Nevertheless, when this testable implication holds, we cannot
use the observed data to refute Assumption 21.2.
Assumption 21.3 below restricts the first and last terms in (21.1) to be
zero based on the mechanism of the treatment assignment on the outcome
through only the treatment received.
Assumption 21.3 (exclusion restriction) Yi (1) = Yi (0) for always takers
with Ui = a and never takers with Ui = n.
Assumption 21.3 requires that the treatment assignment affects the out-
come only if it affects the treatment received. In double-blind clinical trial1 ,
it is biologically plausible because the outcome only depends on the actual
treatment received. That is, if the treatment assignment does not change the
treatment received, it does not change the outcome either. It can be violated
if the treatment assignment has direct effects on the outcome not through
the treatment received. For example, some randomized controlled trials are
not double blinded, and the treatment assignment can have some unknown
pathways to the outcome.
Under Assumptions 21.2 and 21.3, the decomposition (21.1) only has the
second term :
Similarly, we can decompose the average causal effect on D into four terms:
placebo effects, patients’ expectation, etc. In double blind trials, both doctors and patients
do not know the treatment; in single blind trials, the patients do not know the treatment
but the doctors know. Sometimes, it is impossible to conduct double or even single blind
trials. Those trials are called open trials.
256 21 An Experimental Perspective
This is an interesting fact that the proportion of the compliers πc equals the
average causal effect of the treatment assigned on D, an identifiable quantity
under complete randomization. Although we do not know all the compliers
based on the observed data, we can identify their proportion in the whole
population based on (21.3). Combining (21.2) and (21.3), we have the following
result.
Following Imbens and Angrist (1994) and Angrist et al. (1996), we define
a new causal effect below.
as the “complier average causal effect (CACE)” or the “local average treatment
effect (LATE)”. It has alternative forms:
E(Y | Z = 1) − E(Y | Z = 0)
τc = .
E(D | Z = 1) − E(D | Z = 0)
21.2.2 Estimation
Based on Corollary 21.1, we can estimate τc by a simple ratio
τ̂Y
τ̂c = ,
τ̂D
which is called the Wald estimator (Wald, 1940) or the IV estimator. In the
above discussion, Z acts as the IV for D.
We can obtain the variance estimator based on the following heuristics
(see Example A1.3):
and q q
τ̂Y /τ̂D ± z1−α/2 V̂Y /τ̂D = τ̂Y ± z1−α/2 V̂Y /τ̂D .
These confidence intervals give the same qualitative conclusions since they
will both cover zero or not. In some sense, the IV analysis provides the same
qualitative information as the ITT analysis of Y although it involves more
complicated procedures.
258 21 An Experimental Perspective
21.3 Covariates
21.3.1 Covariate adjustment in complete randomization
We now consider completely randomized experiments with covariates, and
assume Z {D(1), D(0), Y (1), Y (0), X}. With covariates X, we can obtain
Lin (2013)’s estimators τ̂D,L and τ̂Y,L for both D and Y , resulting in τ̂c,L =
τ̂Y,L /τ̂D,L . Recall that
n o n o
τ̂D,L = D̄ˆ (1) − β̂T X̄ ˆ (1) − D̄ˆ (0) − β̂T X̄ˆ (0) ,
D1 D0
n o n o
τ̂Y,L = Ȳˆ (1) − β̂Y 1 X̄
T ˆ (1) − Ȳˆ (0) − β̂ X̄
T ˆ
Y 0 (0) ,
where β̂D1 and β̂Y 1 are the coefficients of X in the OLS fits of D and Y in
the treated group, and β̂D0 and β̂Y 0 are the coefficients of X in the OLS fits
of D and Y in the control group. We can approximate the standard error of
τ̂c,L based on the following heuristics (again see Example A1.3):
then we must adjust for covariates to avoid bias. The analysis is also straight-
forward since we already have discussed many estimators in Part III for es-
timating the effects of Z on D and Y , respectively. We can just use them
in the ratio formula τ̂c = τ̂Y /τ̂D and use the bootstrap to approximate the
asymptotic variance.
21.4 Weak IV
Even τD > 0, there is a positive probability that τ̂D is zero, so the variance of
τ̂c is infinity. The variance from the Normal approximation discussed before is
not the variance of τ̂c but rather the variance of its asymptotic distribution.
This is a subtle technical point. When τD is close to 0, which is referred to
as the weak IV case, the ratio estimator τ̂c = τ̂Y /τ̂D has poor finite-sample
properties. Under this scenario, τ̂c has finite sample bias and non-Normal
asymptotic distribution, and the corresponding Wald-type confidence intervals
have poor coverage properties2 . In the simple case with a binary outcome Y ,
we know that τY must be bounded between −1 and 1, but there is no guarantee
that τ̂c is bounded between −1 and 1. How do we deal with a weak IV?
From a testing perspective, there is an easy solution. Because τc = τY /τD ,
so the following two null hypotheses are equivalent:
H0 : τc = 0 ⇐⇒ H0′ : τY = 0.
Therefore, we simply test H0′ , i.e., the average causal effect of Z on Y is zero.
This echos our discussion in Section 21.2.2.
From an estimation perspective, we can focus on the confidence inter-
val although the point estimator has poor finite-sample properties. Because
τc = τY /τD , this is similar to the classical Fieller–Creasy problem in statis-
tics. Below we discuss a strategy for constructing confidence interval for τc
motivated by Fieller (1954); see Section A1.4.2. Given the true value τc , we
have
τY − τc τD = 0.
2 The theory often assumes that τ −1/2 . Under this regime, the propor-
D has the order n
tion of compliers goes to 0 as n goes to infinity. The IV method can only identify a subgroup
average causal effect with the proportion shrinking to 0. This is a contrived regime for theo-
retical analysis. It is hard to justify this assumption in practice. The follow discussion does
not assume it.
260 21 An Experimental Perspective
H0 (b) : τA(b) = 0.
Let τ̂A (b) be a generic estimator for τA(b) with the associated variance
estimator V̂A (b). In the CRE without covariates, τ̂A (b) is the difference in
means of the outcome Ai (b) and V̂A (b) is the Neyman-type variance estimator.
In the CRE with covariates, τ̂A (b) is Lin (2013)’s estimator for the outcome
Ai (b) and V̂A (b) is the EHW variance estimator in the associated OLS fit
of Yi − bDi on (Zi , Xi , Zi Xi ). In unconfounded observational studies, we can
obtain the estimator for the average causal effect on Ai (b) and the associated
variance estimator based on many existing strategies in Part III.
Based on τ̂A (b) and τA(b) , we can construct a Wald-type test for H0 (b).
Inverting tests, we can construct the following confidence set for τc :
( )
τ̂A2 (b) 2
b: ≤ zα .
V̂A (b)
(τ̂Y − bτ̂D )2
h
≤ zα2 n−1 2 2 2
1 {ŜY (1) + b ŜD (1) − 2bŜY D (1)}
i
+n−1 2 2 2
0 {ŜY (0) + b ŜD (0) − 2bŜY D (0)} ,
21.5 Application
The mediation package contains a dataset jobs from Job Search Intervention
Study (JOBS II), which was a randomized field experiment that investigates
the efficacy of a job training intervention on unemployed workers. The variable
treat is the indicator for whether a participant was randomly selected for the
JOBS II training program, and the variable comply is the indicator for whether
a participant actually participated in the JOBS II program. An outcome of
interest is job_seek for measuring the level of job-search self-efficacy with
values from 1 to 5. A few standard covariates are sex, age, marital, nonwhite,
educ, and income.
Without using covariates, the confidence intervals based on the delta
method and the bootstrap are
> est
[1] 0.1087904
> c(est - 1.96*dse, est + 1.96*dse)
[1] -0.05002163 0.26760235
> c(est - 1.96*bse, est + 1.96*bse)
[1] -0.04657384 0.26415455
> est
[1] 0.1176332
> c(est - 1.96*dse, est + 1.96*dse)
[1] -0.03638421 0.27165070
> c(est - 1.96*bse, est + 1.96*bse)
[1] -0.03926737 0.27453386
> ARCI
[1] -0.050 0.267
adjusting for covariates, it is
> ARCI
[1] -0.046 0.281
without covariates
1.0
0.8
0.6
p−value
0.4
0.2
0.0
τc
with covariates
1.0
0.8
0.6
p−value
0.4
0.2
0.0
τc
Based on the causal graph below, Assumption 21.4 rules out the direct
arrow from Z to Y . In such case, Z is an IV for D.
U
~
Z /D /Y
τc = E{Y (d = 1) − Y (d = 0) | U = c}.
and τc = τY /τD .
The proof is almost identical to the proof of Theorem 21.1 with modifica-
tions of the notation. I leave it as Problem 21.2. From the notation Yi (d), it is
more convenient to interpret τc as as the average causal effect of the treatment
received on the outcome for compliers.
21.2 Proof of the main theorem of Imbens and Angrist (1994) and Angrist
et al. (1996)
Prove Theorem 21.2.
where
pr{D(1) ≥ j > D(0)}
wj = PJ .
j ′ =1 pr{D(1) ≥ j ′ > D(0)}
to obtain
J
X
D(1) − D(0) = j[1{D(1) = j} − 1{D(0) = j}]
j=0
and
J
X
Y (D(1)) − Y (D(0)) = Y (j)[1{D(1) = j} − 1{D(0) = j}].
j=0
Then use the following Abel’s lemma, also called summation by parts:
J
X J
X
fj (gj+1 − gj ) = fJ gJ+1 − f0 g0 − gj (fj − fj−1 )
j=0 j=1
21.5 Data analysis: a flu shot encouragement design (McDonald et al., 1992)
The dataset in fludata.txt is from a randomized encouragement design of
McDonald et al. (1992), which was also re-analyzed by Hirano et al. (2000).
It contains the following variables:
assign binary encouragement to receive the flu shot
receive binary indicator for receiving the flu shot
outcome binary outcome for flu related hospitalization
age age of the patient
sex sex of the patient
race race of the patient
copd, dm, heartd, renal, liverd various disease background covariates
Analyze the data with and without adjusting for the covariates.
TABLE 22.1: Observed groups and latent groups under Assumption 21.2
Z =1 D =1 D(1) = 1 U = c or a
Z =1 D =0 D(1) = 0 U =n
Z =0 D =1 D(0) = 1 U =a
Z =0 D =0 D(0) = 0 U = c or n
267
26822 Disentangle Mixture Distributions and Instrumental Variable Inequalities
πn = pr(D = 0 | Z = 1),
πa = pr(D = 1 | Z = 0),
πc = E(D | Z = 1) − E(D | Z = 0),
pr(D = 0 | Z = 1) = pr(U = n | Z = 1)
= pr(U = n) = πn ,
pr(D = 1 | Z = 0) = pr(U = a | Z = 0)
= pr(U = a) = πa .
πc = pr(U = c) = 1 − πn − πa
= 1 − pr(D = 0 | Z = 1) − pr(D = 1 | Z = 0)
= E(D | Z = 1) − E(D | Z = 0) = τD ,
which is coherent with our discussion before. Although we do not know indi-
vidual latent compliance types for all units, we can identify the proportions
of never takers, always takers, and compliers.
Part II: We then identify the means of the potential outcomes within latent
compliance types. Under Assumption 21.3,
□
Based on the formulas of µ1c and µ0c in Theorem 22.1, we have
E(DY | Z = 1) − E(DY | Z = 0) ≥ 0,
E(DY | Z = 1) − E(DY | Z = 0) ≤ E(D | Z = 1) − E(D | Z = 0),
E{(1 − D)Y | Z = 0} − E{(1 − D)Y | Z = 1} ≥ 0,
E{(1 − D)Y | Z = 0} − E{(1 − D)Y | Z = 1} ≤ E(D | Z = 1) − E(D | Z = 0).
22.3 Examples
For a binary outcome, we can estimate all the parameters by the method of
moment as below.
# # function for binary data (Z , D , Y )
# # n _ { zdy } ’ s are the counts from 2 X2X2 table
IVbinary = function ( n111 , n110 , n101 , n100 ,
n011 , n010 , n001 , n000 ){
22.3 Examples 271
Z=1 Z=0
D=1 D=0 D=1 D=0
Y =1 107 68 24 131
Y =0 42 42 8 79
Z=1 Z=0
D=1 D=0 D=1 D=0
Y =1 31 85 30 99
Y =0 424 944 237 1041
$ mu _ c1
[1] 0.7086064
$ mu _ c0
[1] 0.6292042
$ mu _ c0
[1] 0.1200094
pr{Y (1) = 1 | U = c}
rrc = .
pr{Y (0) = 1 | U = c}
E(DY | Z = 1) − E(DY | Z = 0)
rrc = .
E{(D − 1)Y | Z = 1} − E{(D − 1)Y | Z = 0}
for both y = 0, 1.
27422 Disentangle Mixture Distributions and Instrumental Variable Inequalities
for all y.
Remark: Imbens and Rubin (1997) and Kitagawa (2015) discussed similar
results. For instance, we can test the first inequality based on an analog of the
Kolmogorov–Smirnov statistic:
Pn Pn
i=1PZi Di 1(Yi ≤ y) (1 − Zi )Di 1(Yi ≤ y)
i=1P
KS1 = max n − n .
i=1 Zi Di i=1 (1 − Zi )Di
y
Show that
πa µa + pr(Z = 1)πc µ1c πn µn + pr(Z = 0)πc µ0c
τat = − .
pr(D = 1) pr(D = 0)
2. The per-protocol analysis compares the units who comply with the
treatment assigned in treatment and control groups, yielding
Show that
πa µa + πc µ1c πn µn + πc µ0c
τpp = − .
πa + πc πn + πc
3. We may also want to compare the outcomes among units receiving
the treatment and control, conditioning on their treatment assign-
ment, yielding
δ = E{Y (d = 1) − Y (d = 0)},
They satisfy X
δ= πu (m1u − m0u ).
u=a,n,c
27622 Disentangle Mixture Distributions and Instrumental Variable Inequalities
Section 22.1 identifies πa , πn , πc , m1a = µ1a , m0n = µ0n , m1c = µ1c and
m0c = µ0c . But the data do not contain any information about m0a and m1n .
Therefore, we cannot identify δ. With a bounded outcome, we can bound δ.
Show the following result:
δ = δ ′ − ypr(D = 1 | Z = 0) + ypr(D = 0 | Z = 1)
and
δ = δ ′ − ypr(D = 1 | Z = 0) + ypr(D = 0 | Z = 1)
with δ ′ = E(DY | Z = 1) − E(Y − DY | Z = 0).
Remark: In the special case with a binary outcome, the bounds simplify
to
δ = E(DY | Z = 1) − E(D + Y − DY | Z = 0)
and
δ = E(DY + 1 − D | Z = 1) − E(Y − DY | Z = 0).
Zi = 0 =⇒ Di = 0 (i = 1, . . . , n).
Z=1 Z=0
D=1 D=0 D=1 D=0
Y = 1 9663 2385 0 11514
Y =0 12 34 0 74
Re-analyze it.
Remark: Bloom (1984) first discussed one-sided noncompliance and pro-
posed the IV estimator τ̂c = τ̂Y /τ̂D . His notation is different from this chapter.
~
Z /D /Y
279
280 23 An Econometric Perspective
Example 23.2 Hearst et al. (1986) reported that men with low lottery num-
ber in the Vietnam Era draft lottery had higher mortality rates afterwards.
They attributed this to the negative effect of military service. Angrist (1990)
further reported that men with low lottery number in the Vietnam Era draft
lottery had lower subsequent earnings. He attributed this to the negative effect
of military service. These explanations are plausible because the lottery num-
bers were randomly generated, men with low lottery number were more likely
to have military service, and the lottery numbers were unlikely to affect the
subsequent mortality or earnings. That is, Figure 23.1 is plausible. Angrist
et al. (1996) reanalyzed the data using the IV framework. Here, the lottery
number is the IV, military service is the treatment, and mortality or earnings
is the outcome.
Example 23.3 Angrist and Krueger (1991) studied the return of schooling
in years on earnings, using the quarter of birth as an IV. This IV is plausible
because of the pseuso randomization of the quarter of birth. It affected the years
of schooling because (1) most states required the students to enter school in
the calendar year in which they turned six, and (2) compulsory schooling laws
typically required students remain in the school before their sixteenth birthday.
More important, it is plausible that the quarter of birth did not affect earnings
directly.
Example 23.4 Angrist and Evans (1998) studied the effect of family size on
mother’s employment and work, using the sibling sex composition as an IV.
This IV is plausible because of the pseudo randomization of the sibling sex
composition. Moreover, parents in the US with two children of the same sex
are more likely to have a third child than those parents with two children of
different sex. It is also plausible that the sibling sex composition does not affect
mother’s employment and work directly.
Example 23.5 Card (1993) studied the effect of schooling on wage, using the
geographic variation in college proximity as an IV. In particular, Z contains
dummy variables for whether a subject grew up near a two-year college or a
23.2 Brief Review of the Ordinary Least Squares 281
Example 23.6 Voight et al. (2012) studied the causal effect of plasma high-
density lipoprotein (HDL) cholesterol on the risk of heart attack based on
Mendelian randomization. They used some single-nucleotide polymorphisms
(SNPs) as genetic IV for HDL, which are random with respect to the unmea-
sured confounders between HDL and heart attack by Mendel’s second law, and
affect heart attack only though HDL. I will give more details of Mendelian
randomization in Chapter 25.
Y = DT β + ε, (23.1)
n
!−1 n
X X
β̂ = Di DTi Di Yi .
i=1 i=1
Because
n
!−1 n n
!−1 n
X X X X
β̂ = Di DTi Di (DTi β + εi ) = β + Di DTi Di εi ,
i=1 i=1 i=1 i=1
Y = DT β + ε, (23.2)
as a true model for data generating process. That is, given the random vari-
ables (D, ε), we generate Y based on the linear equation (23.2). Importantly,
in the data generating process, ε and D may be correlated in that E(Dε) ̸= 0.
Figure 23.2 gives such an example. This is the fundamental difference com-
pared to the first view where E(εD) = 0 holds by the definition of the popu-
lation OLS. Consequently, the OLS estimator can be inconsistent:
in probability.
I end this section with definitions of endogenous and exogenous regressors
based on (23.2), although their definitions are not unique in econometrics.
Y = DT β + ε,
23.4 Linear Instrumental Variable Model 283
~
D ε
Y
(a) E(Dε) ̸= 0
ε
D /Y
(b) marginalized over ε
with
E(εZ) = 0. (23.3)
Z /D /Y
The above linear IV model allows that E(εD) ̸= 0 but requires an alterna-
tive moment condition (23.3). With E(ε) = 0 by incorporating the intercept,
the new condition states that Z is uncorrelated with the error term ε. But
any randomly generated noise is uncorrelated with ε, so an additional con-
dition must hold to ensure that Z is useful for estimating β. Intuitively, the
additional condition requires that Z is correlated to D, with more technical
details stated below.
The mathematical requirement (23.3) seems simple. However, it is a key
challenge in empirical research to find such a variable or variables Z that
satisfies (23.3). Since the condition (23.3) involves the unobservable ε, it is
generally untestable.
284 23 An Econometric Perspective
equation E(ZY ) = E(ZDT )β has infinitely many solutions. This is the under-
identified case in which the coefficient β cannot be uniquely determined even
with Z. It is a challenging case beyond the scope of this book. In practice, we
need at least as many IVs as the endogenous regressors.
When Z has higher dimension than D and E(ZDT ) has full column rank,
we have many ways to determine β from E(ZY ) = E(ZDT )β. What is more,
the sample analog
Xn Xn
n−1 Zi Yi = n−1 Zi DTi β
i=1 i=1
may not have any solution because the number of equations is larger than the
number of unknown parameters.
A computational trick in this case is the two-stage least squares (TSLS)
estimator (Theil, 1953; Basmann, 1957). It is a clever computational trick,
which has two steps.
To see why TSLS works, we need more algebra. Write it more explicitly as
n
!−1 n
X X
β̂tsls = D̂i D̂Ti D̂i Yi (23.5)
i=1 i=1
n
!−1 n
X X
= D̂i D̂Ti D̂i (DTi β + εi )
i=1 i=1
n
!−1 n n
!−1 n
X X X X
= D̂i D̂Ti D̂i DTi β + D̂i D̂Ti D̂i εi .
i=1 i=1 i=1 i=1
being a zero square matrix with the same dimension as Di . The orthogonality
(23.6) implies
Xn n
X
D̂i DTi = D̂i D̂Ti ,
i=1 i=1
286 23 An Econometric Perspective
Based on (23.9),
Pn we can see the consistency of the TSLS estimator because the
term n−1 i=1 Zi εi has probability limit E(Zε) = 0. We can also use (23.9)
to show that when Z and D have the same dimension, β̂tsls is numerically
identical to β̂iv defined in Section 23.4, which is left as Problem 23.1.
Based on (23.7), we can obtain the standard error as follows. We first
obtain the residual ε̂i = Yi − β̂Ttsls Di , and then obtain the robust variance
estimator as
n
!−1 n ! n !−1
X X X
V̂tsls = D̂i D̂Ti ε̂2i D̂i D̂Ti D̂i D̂Ti .
i=1 i=1 i=1
Importantly, the ε̂i ’s are not the residual from the second stage OLS Yi −
β̂Ttsls D̂i , so V̂tsls differs from the robust variance estimator from the second
stage OLS.
Yi = β0 + β1 Di + βT2 Xi + εi ,
(23.10)
Di = γ0 + γ1 Zi + γT2 Xi + ε2i ,
23.6.3 Weak IV
The following inferential procedure is simpler, more transparent, and more
robust to weak IV. It is more computationally intensive though. The reduced
form (23.11) also implies that
Yi − bDi = (Γ0 − bγ0 ) + (Γ1 − bγ1 )Zi + (Γ2 − bγ2 )T Xi + (ε1i − bε2i ).(23.12)
{b : |tZ (b)| ≤ zα } ,
where tZ (b) is the t-statistic for the coefficient of Z based on the OLS fit
of (23.12) with the EHW standard error. This confidence interval is more
robust than the Wald-type confidence interval based on the TSLS estimator.
It is similar to the Fieller–Anderson–Rubin confidence interval discussed in
Chapter 21. This procedure makes the TSLS estimator unnecessary, and what
is more, we only need to run the OLS fit of Y based on the reduced form if
the goal is to test β1 = 0 under (23.10).
23.7 Application
Card (1993) used the National Longitudinal Survey of Young Men to estimate
the causal effect of education on earnings. The data set contains 3010 men
with age between 14 and 24 in the year 1966, and Card (1993) leveraged the
geographic variation in college proximity as an IV for education. Here, Z is
the indicator of growing up near a four-year college, D measures the years of
education, and the outcome Y is the log wage in the year 1976, ranging from
4.6 to 7.8. Additional covariates are ace, age and squared age, a categorical
variable indicating living with both parents, single mom, or both parents, and
variables summarizing the living areas in the past.
> library ( " car " )
>
> # # Card Data
> card . data = read . csv ( " card1995 . csv " )
> Y = card . data [ , " lwage " ]
> D = card . data [ , " educ " ]
> Z = card . data [ , " nearc4 " ]
> X = card . data [ , c ( " exper " , " expersq " , " black " , " south " ,
+ " smsa " , " reg661 " , " reg662 " , " reg663 " ,
+ " reg664 " , " reg665 " , " reg666 " ,
+ " reg667 " , " reg668 " , " smsa66 " )]
> X = as . matrix ( X )
23.8 Application 289
Based on TSLS, the point estimator is 0.132 and the 95% confidence in-
terval is [0.026, 0.237].
> Dhat = lm ( D ~ Z + X ) $ fitted . values
> tslsreg = lm ( Y ~ Dhat + X )
> tslsest = coef ( tslsreg )[2]
> # # correct se by changing the residuals
> res . correct = Y - cbind (1 , D , X ) % * % coef ( tslsreg )
> tslsreg $ residuals = as . vector ( res . correct )
> tslsse = sqrt ( hccm ( tslsreg , type = " hc0 " )[2 , 2])
> res = c ( tslsest , tslsest - 1.96 * tslsse , tslsest + 1.96 * tslsse )
> names ( res ) = c ( " est " , " l . ci " , " u . ci " )
> round ( res , 3)
est l . ci u . ci
0.132 0.026 0.237
Figure 23.3 shows the p-values for a sequence of tests for the coefficient of
D. It also implies the 95% confidence interval for the coefficient of D based
on inverting tests, which is [0.028, 0.282].
> BetaAR = seq ( -0.1 , 0.4 , 0.001)
> PvalueAR = sapply ( BetaAR ,
+ function ( b ){
+ Y_b = Y - b*D
+ ARreg = lm ( Y _ b ~ Z + X )
+ coefZ = coef ( ARreg )[2]
+ seZ = sqrt ( hccm ( ARreg )[2 , 2])
+ Tstat = coefZ / seZ
+ (1 - pnorm ( abs ( Tstat ))) * 2
+ })
> point . est = BetaAR [ which . max ( PvalueAR )]
> point . est
[1] 0.132
> ARCI = range ( BetaAR [ PvalueAR >= 0.05])
> ARCI
[1] 0.028 0.282
Comparing the above two methods, the lower confidence limits are very
close but the upper confidence limits are slightly different due to the possibly
heavy right tail of the distribution of the TSLS estimator.
290 23 An Econometric Perspective
0.4
0.2
0.0
coefficient of D
23.8 Homework
23.1 More algebra for TSLS in Section 23.5
1. Show that the Γ̂ in (23.8) equals
n
!−1 n
X X
Γ̂ = Zi ZiT Zi DTi .
i=1 i=1
293
29424 Application of the Instrumental Variable Method: Fuzzy Regression Discontinuity
family income. To receive this grant, the students must apply first. Therefore,
the eligibility and the application status jointly determined the final treatment
status. The running variable alone did not determine the treatment status al-
though it changed the treatment probability at the cutoff point zero.
Example 24.3 Amarante et al. (2016) estimated the impact of in utero expo-
sure to a social assistance program on children’s birth outcomes. They used a
regression discontinuity induced by the Uruguayan Plan de Atención Nacional
a la Emergencia Social. It was a temporary social assistance program targeted
to the poorest 10 percent of households, implemented between April 2005 and
December 2007. Households with a predicted low income score below a prede-
termined threshold were assigned to the program. The predicted income score
did not determine whether the mother received at least one program trans-
fer during the pregnancy but it changed the probability of the final treatment
received. The birth outcomes included birth weight, weeks of gestation, etc.
signed, we can define potential outcomes {Di (1), Di (0), Yi (1), Yi (0)}. The
sharp regression discontinuity of Z allows for identification of
and
24.3 Application
24.3.1 Re-analyzing Asher and Novosad (2020)’s data
Figure 24.2 shows the result using occupation_index_andrsn as the outcome.
The package rdrobust selects the bandwidth automatically. The results
suggest that receiving a new road did not affect the outcome significantly.
> road _ dat = read . csv ( " indianroad . csv " )
> road _ dat $ runv = road _ dat $ left + road _ dat $ right
> library ( " rdrobust " )
> frd _ road = with ( road _ dat ,
+ {
+ rdrobust ( y = occupation _ index _ andrsn ,
+ x = runv ,
+ c = 0,
+ fuzzy = r2012 )
+ })
> res = cbind ( frd _ road $ coef , frd _ road $ se )
> round ( res , 3)
Coeff Std . Err .
Conventional -0.253 0.301
Bias - Corrected -0.283 0.301
Robust -0.283 0.359
24.3 Application 297
0.5
0.0
estimate
−0.5
−1.0
20 40 60 80
bandwidth h
9000
sample size
6000
3000
20 40 60 80
bandwidth h
FIGURE 24.2: Re-analyzing Asher and Novosad (2020)’s data, with point
estimates and standard errors from TSLS.
29824 Application of the Instrumental Variable Method: Fuzzy Regression Discontinuity
0.25
0.00
estimate
−0.25
10000
7500
sample size
5000
2500
FIGURE 24.3: Re-analyzing Li et al. (2015)’s data, with point estimates and
standard errors from TSLS.
+ })
> res = cbind ( frd _ italy $ coef , frd _ italy $ se )
> round ( res , 3)
Coeff Std . Err .
Conventional -0.149 0.101
Bias - Corrected -0.155 0.101
Robust -0.155 0.121
24.4 Discussion
Both Chapter 20 and this chapter formulate regression discontinuity based on
the continuity of the conditional expectations of the potential outcomes given
the running variables. This perspective is mathematically simpler but it only
identifies the local effects precisely at the cutoff point of the running variable.
Hahn et al. (2001) started this line of literature.
An alternative, not so dominant perspective is based on local randomiza-
tion (Cattaneo et al., 2015; Li et al., 2015). If we view the running variable as
a noisy measure of some underlying truth and the cutoff point is somewhat
arbitrarily chosen, the units near the cutoff point do not differ systematically.
This suggests that in a small neighborhood of the cutoff point, the units receive
the treatment and the control in a random fashion just as in a randomized
experiment. Similar to the issue of choosing h in the first perspective, it is
crucial to decide how local should the randomized experiment be under the
regression discontinuity. It is not easy to quantify the intuition mathemat-
ically, and again conducting sensitivity analysis with a range of h seems a
reasonable approach in the second perspective as well.
See Sekhon and Titiunik (2017) for more conceptual discussion of regres-
sion discontinuity.
Katan (1986) was concerned with the observational studies suggesting that
low serum cholesterol levels were associated with the risk of cancer. As we
have discussed, however, observational studies suffer from unmeasured con-
founding. Consequently, it is difficult to interpret the apparent association as
causality. In the particular problem studied by Katan (1986), it is even pos-
sible that early stages of cancer reversely cause low serum cholesterol levels.
Disentangling the causal effect of the serum cholesterol level on cancer seems
a hard problem using standard epidemiologic studies. Katan (1986) argued
that Apolipoprotein E genes are associated with the serum cholesterol levels
but do not directly affect the cancer status. So if low serum cholesterol lev-
els causes cancer, we should observe differences in cancer risks among people
with and without the genotype that leads to different serum cholesterol lev-
els. Using our language for causal inference, Katan (1986) proposed to use
Apolipoprotein E genes as IVs.
Katan (1986) did not conduct any data analysis but just proposed a con-
ceptual design that could address not only unmeasured confounding but also
reverse causality. Since then, more complicated and sophisticated studies have
been conducted thanks to the modern genome-wide association studies. These
studies used genetic information as IVs for exposures in epidemiologic stud-
ies to estimate causal effects of exposures on outcomes. They were all moti-
vated by Mendel’s second law, the law of random assortment, which suggests
the inheritance of one trait is independent of the inheritance of other traits.
Therefore, the method of using genetic information as IV is called Mendelian
Randomization (MR).
301
30225 Application of the Instrumental Variable Method: Mendelian Randomization
IVs have direct effect on the outcome of interest, so Figure 25.1 also allows
for the violation of the exclusion restriction assumption.
The standard linear IV model assumes away the direct effect of the IVs
on the outcome. Definition 25.1 below gives both the structural and reduces
forms.
Y = β0 + βD + βu U + εY , (25.1)
D = γ0 + γ1 G1 + · · · + γp Gp + γu U + εD , (25.2)
Definition 25.2 below allows for the violation of exclusion restriction. Then,
G1 , . . . , Gp are not valid IVs.
Definition 25.2 (linear model with possibly invalid IVs) The linear model
Y = β0 + βD + α1 G1 + · · · + αp Gp + βu U + εY , (25.5)
D = γ0 + γ1 G1 + · · · + γp Gp + γu U + εD , (25.6)
combining multiple estimates β̂j = Γ̂j /γ̂j for the common parameter β. Using
delta method (see Example A1.3), β̂j has approximate squared standard error
Γj = βγj (j = 1, . . . , p);
This seems a classic OLS problem of {Γ̂j }pj=1 on {γ̂j }pj=1 . We can fit an OLS
of Γ̂j on γ̂j , with or without an intercept, possibly weighted by wj , to estimate
β. The following results hold thanks to the algebraic properties of the WLS
reviewed in Section A2.5.
25.3 An example 305
Pp Pp Pp Pp
where γ̂w = j=1 γ̂j wj / j=1 wj and Γ̂w = j=1 Γ̂j wj / j=1 wj are the
weighted averages of the γ̂j ’s and Γ̂j ’s, respectively. Even without assuming
that all γj ’s are zero under Definition 25.2, we have
Pp Pp
j=1 (γj − γw )(Γj − Γw )wj j=1 (γj − γw )(αj − αw )wj
β̂egger0 → Pp 2
= β + Pp 2
j=1 (γj − γw ) wj j=1 (γj − γw ) wj
Γw − βγw = αw
25.3 An example
I use the bmi.sbp data in the mr.raps package to illustrate the Egger regres-
sions.
> library ( " mr . raps " )
> bmisbp = subset ( bmi . sbp ,
30625 Application of the Instrumental Variable Method: Mendelian Randomization
The Egger regressions with or without the intercept give very similar re-
sults.
> mr . egger = lm ( beta . outcome ~ 0 + beta . exposure ,
+ data = bmisbp ,
+ weights = 1 / se . outcome ^2)
> summary ( mr . egger )
Call :
lm ( formula = beta . outcome ~ 0 + beta . exposure , data = bmisbp ,
weights = 1 / se . outcome ^2)
Weighted Residuals :
Min 1 Q Median 3Q Max
-5.6999 -1.1691 -0.0199 1.0073 11.3449
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
beta . exposure 0.3173 0.1106 2.869 0.00468 * *
>
> mr . egger . w = lm ( beta . outcome ~ beta . exposure ,
+ data = bmisbp ,
+ weights = 1 / se . outcome ^2)
> summary ( mr . egger . w )
Call :
lm ( formula = beta . outcome ~ beta . exposure , data = bmisbp , weights = 1 / se . outcome ^2)
Weighted Residuals :
Min 1 Q Median 3Q Max
-5.7099 -1.1774 -0.0296 0.9969 11.3393
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 0.0001133 0.0020794 0.055 0.95660
beta . exposure 0.3172989 0.1109485 2.860 0.00481 * *
0.2
0.1
Γ
^
0.0
−0.1
−0.05 0.00 0.05
^γ
FIGURE 25.2: Scatter plot proportional to the inverse of the variance, with
the Egger regression line
Figure 25.2 shows the raw data as well as the fitted Egger regression line.
It is possible that these IVs have direct effects on the confounders. It is also
possible that some unmeasured genes affect both the IVs and the confounders.
Mendel’s second law does not ensure the exclusion restriction assumption ei-
ther. It is possible that the IVs have other causal pathways to the outcome,
beyond the pathway through the treatment of interest.
Technically, the statistical assumptions for MR are quite strong. Clearly,
the linear IV model is a strong modeling assumption. The independence of
the γ̂j ’s and the Γ̂j ’s is also quite strong. Other issues in the data collecting
process can further complicate the interpretation of the IV assumptions. For
instance, the treatments and outcomes are often measured with errors, and
the genome wide associate studies are often based on the case-control design.
VanderWeele et al. (2014) is an excellent review paper that discusses the
methodological challenges in MR.
311
312 26 Principal Stratification
~
Z /M /6 Y
~
Z /M 6Y
(b) M is not on the causal pathway from Z to Y , with Z randomized and U repre-
senting unmeasured confounding
Conditioning on M = m, we compare
pr(Y | Z = 1, M = m)
and
pr(Y | Z = 0, M = m).
This comparison seems intuitive which measures the difference in the outcome
distributions in treated and control groups given the same value of the post-
treatment variable. When M is a pre-treatment covariate, this comparison
yields a reasonable subgroup effect. However, when M is a post-treatment
variable, the interpretation of this comparison is problematic. Under Assump-
tion 26.1, we can re-write
and
Therefore, we are comparing the distributions of Y (1) and Y (0) for different
subset of units because the units with M (1) = m are different from the units
with M (0) = m if the Z affects M . Consequently, the comparison conditioning
on M = m does not have a causal interpretation in general unless M (1) =
M (0).1
Revisit Example 26.1. Comparing pr(Y | Z = 1, M = 1) and pr(Y |
Z = 0, M = 1) is equivalent to comparing the treated potential outcomes for
compliers and always-takers and control potential outcomes for always-takers,
under the monotonicity assumption that M (1) ≥ M (0). Part 3 of Problem
22.7 has pointed out the drawbacks of this analysis.
Revisit Example 26.2. If the treatment improves the survival status, the
treatment can save more weak patients than the control. In this case, units
with M (1) = 1 are weaker than units with M (0) = 1, so the naive comparison
gives biased results that is in favor of the control.
1 Based on the causal diagrams, we can reach the same conclusion. In Figure 26.1, even
though Z U by randomization of Z, conditioning on M introduces the “collider bias” that
causes Z / U .
314 26 Principal Stratification
and
pr{Y (0) | M (1) = m1 , M (0) = m0 }
for some (m1 , m0 ). This is a comparison of the potential outcomes under
treatment and control for the same subset of units with M (1) = m1 and
M (0) = m0 . Frangakis and Rubin (2002) called this strategy principal strat-
ification, viewing {M (1), M (0)} as a pre-treatment covariate. Based on this
idea, we can define
as the principal stratification average causal effect for the subgroup with
M (1) = m1 , M (0) = m0 . For a binary M , we have four subgroups
τ (1, 1) = E{Y (1) − Y (0) | M (1) = 1, M (0) = 1},
τ (1, 0) = E{Y (1) − Y (0) | M (1) = 1, M (0) = 0},
(26.1)
τ (0, 1) = E{Y (1) − Y (0) | M (1) = 0, M (0) = 1},
τ (0, 0) = E{Y (1) − Y (0) | M (1) = 0, M (0) = 0}.
It is called the survivor average causal effect (Rubin, 2006a). It is the aver-
age causal effect of the treatment on the outcome for those units who survive
regardless of the treatment status.
ods for analyzing selective samples.” His model contains two stages. First, the employment
status is determined by a latent linear model
Mi = 1(XT
i β + ui ≥ 0).
and
π(1,1)
E(Y | Z = 1, M = 1) = E{Y (1) | M (1) = 1, M (0) = 1}.
π(1,1) + π(1,0)
Theorem 26.1 Under Assumptions 26.1 and 26.2 with a binary Y , we have
{π(1,1) + π(1,0) }E(Y | Z = 1, M = 1) − π(1,0)
− E(Y | Z = 0, M = 1)
π(1,1)
≤ τ (1, 1)
{π(1,1) + π(1,0) }E(Y | Z = 1, M = 1)
≤ − E(Y | Z = 0, M = 1).
π(1,1)
In most truncation by death problems, the lower and upper bounds are
quite different, and they are bounded away from the extreme values −1 and 1.
So we can use Imbens and Manski (2004)’s confidence interval for τ (1, 1) which
involves two steps: first, we obtain the estimated lower and upper bounds [ˆl, û]
with estimated standard errors (sel , seu ); second, we construct the confidence
interval as [ˆl − zα sel , û + zα seu ], where zα is the 1 − α quantile of the standard
normal distribution.
To summarize, this is a challenging problem since we cannot identify the
parameter based on the observed data even with infinite sample size. We can
derive large-sample bounds for τ (1, 1) but the statistical inference based on
the bounds are not standard. If we do not have monotonicity, the large-sample
bounds have even more complex forms (Zhang and Rubin, 2003; Jiang et al.,
2016).
26.4.2 An application
I use the data in Yang and Small (2016) from the Acute Respiratory Distress
Syndrome Network study involving 861 patients with lung injury and acute
respiratory distress syndrome. Patients were randomized to receive mechanical
ventilation with either lower tidal volumes or traditional tidal volumes. The
outcome is the binary indicator for whether patients could breathe without
assistance by day 28. Table 26.1 summarizes the observed data.
318 26 Principal Stratification
TABLE 26.1: Data truncated by death with * indicating the outcomes for
dead patients
Treatment Z = 1 Control Z = 0
Y = 1 Y = 0 total Y =1 Y =0 total
M =1 54 268 322 M =1 59 218 277
M =0 * * 109 M =0 * * 152
26.4.3 Extensions
Zhang and Rubin (2003) started the literature of large-sample bounds. Imai
(2008a) and Lee (2009) were two follow-up papers. Cheng and Small (2006)
derived the bounds with multiple treatment arms. Yang and Small (2016) used
a secondary outcome to sharpen the bounds on the survivor average causal
effect.
additional assumptions are not testable, and their plausibility depends on the
application. A line of research parallels causal inference with unconfounded
observational studies. For simplicity, I focus on the case with strong mono-
tonicity.
Theorem 26.2 Under Assumptions 26.1, 26.3 and 26.4, the principal strat-
ification average causal effects can be identified by
and
π(X) = pr(M = 1 | Z = 1, X)
and
π = pr(M = 1 | Z = 1).
□
Theorem 26.2 motivates the following simple estimators for τ (1, 0) and
τ (0, 0), respectively:
1. fit a logistic regression of M on X using only data from the treated
group to obtain π̂(Xi );
Pn Pn
2. estimate π by π̂ = i=1 Zi Mi / i=1 Zi ;
3. obtain moment estimators:
Pn Pn
i=1 Zi Mi Yi (1 − Zi )π̂(Xi )Yi
τ̂ (1, 0) = Pn − i=1Pn
Z M
i=1 i i π̂ i=1 (1 − Zi )
and
Pn Pn
i=1 Zi (1 − Mi )Yi (1 − Zi )(1 − π̂(Xi )}Yi
τ̂ (0, 0) = P n − i=1 Pn ;
Z
i=1 i (1 − M i ) (1 − π̂) i=1 (1 − Zi )
26.5.2 Extensions
Follmann (2000), Hill et al. (2002), Jo and Stuart (2009), Jo et al. (2011)
and Stuart and Jo (2015) started the literature of using the principal score to
identify causal effects within principal strata. Ding and Lu (2017) provided
theoretical foundation for this strategy. They prove Theorem 26.2 as well as a
more general version under monotonicity; see Problem 26.1. Jiang et al. (2022)
give a unified discussion of this strategy for observational studies and propose
multiply robust estimators for causal effects within principal strata.
x '/
Z /M 7Y
323
324 27 Mediation Analysis: Natural Direct and Indirect Effects
Example 27.2 Rudolph et al. (2018) studies the causal mechanism from
neighborhood poverty to adolescent substance use, mediated by the school and
peer environment. They used data from the National Comorbidity Survey
Replication Adolescent Supplement, a nationally representative survey of U.S.
adolescents conducted during 2001–2004. The treatment is the binary indi-
cator of neighborhood disadvantage, defined as living in the lowest tertile of
neighborhood socioeconomic status based on data from the 2000 U.S. Census.
Four binary mediators are measures of school and peer environments, and six
binary outcomes are measures of substance use. Baseline covariates included
the adolescent’s sex, age, race, immigration generation, family income, etc.
{Y (z, m) : z = 0, 1; m ∈ M},
27.2 Nested Potential Outcomes 325
where M contains all possible values of m. Robins and Greenland (1992) and
Pearl (2001) further consider the nested potential outcomes corresponding to
intervention on z and m = M (z ′ ) ≡ Mz′ :
{Y (z, Mz′ ) : z = 0, 1; z ′ = 0, 1}
Definition 27.1 (total, direct and indirect effects) Define the total ef-
fect of the treatment on the outcome as
The total effect is the standard average causal effect of Z on Y . The nat-
ural direct effect measures the effect of the treatment on the outcome if the
326 27 Mediation Analysis: Natural Direct and Indirect Effects
mediator were set at the natural value M0 without the intervention. The nat-
ural indirect effect measures the the effect of the treatment through changing
the mediator if the treatment itself were set at z = 1. Under the composition
assumption, the natural direct and indirect effects reduce to
and therefore, we can decompose the total effect as the sum of the natural
direct and indirect effects.
parallel worlds
intervetion z = 0 intervetion z = 1
cross-world communications
M0 intervention m = M1 M1 intervention m = M0
Y (0) = Y (0,M0 ) Y (z, m) = Y (0,M1) Y (1) = Y (1,M1) Y (z, m) = Y (1,M0 )
1 By the probability theory, given the marginal distributions of pr(Y (1) ≤ y ) and
1
pr(Y (0) ≤ y0 ), we can bound the joint distribution of pr(Y (1) ≤ y1 , Y (0) ≤ y0 ) by the
Frechet–Hoeffding inequality:
max{0, pr(Y (1) ≤ y1 ) + pr(Y (0) ≤ y0 ) − 1}
≤ pr(Y (1) ≤ y1 , Y (0) ≤ y0 )
≤ min{pr(Y (1) ≤ y1 ), pr(Y (0) ≤ y0 )}.
This is often a loose inequality. Unfortunately, we do not have any information beyond this
inequality without imposing additional assumptions.
328 27 Mediation Analysis: Natural Direct and Indirect Effects
Z Y (z, m) | X
M Y (z, m) | (X, Z)
Assumptions 27.2 and 27.3 together are often called sequential ignorability.
They are equivalent to the assumption that (Z, M ) are jointly randomized
conditioning on X:
Z M (z) | X
for all z.
Y (z, m) M (z ′ ) | X
Assumptions 27.2–27.4 are very strong, but at least they hold under exper-
iments with randomized treatment and mediator. Assumption 27.5 is stronger
because no physical experiment can ensure it. Because we can never observe
Y (z, m) and M (z ′ ) in any experiment if z ̸= z ′ , Assumption 27.5 can never
be validated so it is fundamentally meta-physical.
I give an example below in which Assumptions 27.2–27.5 all hold.
27.3 The Mediation Formula 329
We can verify that Assumptions 27.2–27.5 hold under this data generating
process; see Problem 27.2.
Pearl (2001) proved the following key result for mediation analysis.
and therefore,
X
E{Y (z, Mz′ )} = E{Y (z, Mz′ ) | X = x}pr(X = x).
x
Theorem 27.1 assumes that both M and X are discrete. With general M
and X, the mediation formulas become
Z
E{Y (z, Mz′ ) | X = x} = E(Y | Z = z, M = m, X = x)fM (m | Z = z ′ , X = x)dm
and Z
E{Y (z, Mz′ )} = E{Y (z, Mz′ ) | X = x}fX (x)dx.
From Theorem 27.1, the identification formulas for the means of the nested
potential outcomes depend on the conditional mean of the outcome given the
treatment, mediator, and covariates, as well as the conditional mean of the
mediator given the treatment and covariates. We need to evaluate these two
conditional means at different treatment levels if the nested potential outcome
involves cross-world interventions.
If we drop the cross-world independence assumption, we can modify the
definition of the natural direct and indirect effects and the same formulas hold.
See Problem 27.8 for more details.
I give the proof below.
Proof of Theorem 27.1: By the tower property, E{Y (z, Mz′ )} =
330 27 Mediation Analysis: Natural Direct and Indirect Effects
E[E{Y (z, Mz′ ) | X}], so we need only to prove the formula for E{Y (z, Mz′ ) |
X = x}. Starting with the law of total probability, we have
E{Y (z, Mz′ ) | X = x}
X
= E{Y (z, Mz′ ) | Mz′ = m, X = x}pr(Mz′ = m | X = x)
m
X
= E{Y (z, m) | Mz′ = m, X = x}pr(Mz′ = m | X = x)
m
X
= E{Y (z, m) | X = x} pr(M = m | Z = z ′ , X = x)
m
| {z }| {z }
Assumption 27.5 Assumption 27.4
X
= E(Y | Z = z, M = m, X = x) pr(M = m | Z = z ′ , X = x).
m
| {z }
Assumptions 27.2 and 27.3
□
The above proof is actually trivial from a mathematical perspective. It
illustrates the necessity of Assumptions 27.2–27.5.
Conditioning on X = x, the mediation formulas for Y (1, M1 ) and Y (0, M0 )
simplifies to
E{Y (1, M1 ) | X = x}
X
= E(Y | Z = 1, M = m, X = x)pr(M = m | Z = 1, X = x)
m
= E(Y | Z = 1, X = x)
and
E{Y (0, M0 ) | X = x}
X
= E(Y | Z = 0, M = m, X = x)pr(M = m | Z = 0, X = x)
m
= E(Y | Z = 0, X = x)
based on the law of total probability; the mediation formula for Y (1, M0 )
simplifies to
X
E{Y (1, M0 ) | X = x} = E(Y | Z = 1, M = m, X = x)pr(M = m | Z = 0, X = x),
m
and
and
I leave the proof of Corollary 27.2 as Problem 27.4. Corollary 27.2 gives
a simple formula in the case of a binary M . With randomized Z conditional
on X, we can view τZ→M (x) as the conditional average causal effect of Z
on M . With randomized M conditional on (X, Z), we can view τM →Y (z, x)
as the conditional average causal effect of M on Y . The conditional natural
indirect effect equals their product. This is coherent with our intuition that
the indirect effect acts from Z to M and then from M to Y .
θ4
β2
Z
β1
/M θ2
/Y indirect effect: β1 θ2
=
θ1 direct effect: θ1
FIGURE 27.3: The Baron–Kenny Method for mediation under linear models
E(M | Z, X) = β0 + β1 Z + βT2 X,
E(Y | Z, M, X) = θ0 + θ1 Z + θ2 M + θT4 X.
Under these linear models, the formulas for the natural direct and indirect
effects simplify to functions of the coefficients.
nde = θ1 , nie = θ2 β1 .
and
X
nie(x) = (θ0 + θ1 + θ2 m + θT4 x)
m
×{pr(M = m | Z = 1, X = x) − pr(M = m | Z = 0, X = x)}
= θ2 {E(M = m | Z = 1, X = x) − E(M = m | Z = 0, X = x)}
= θ2 β 1 ,
27.4 The Mediation Formula Under Linear Models 333
which do not depend on x. Therefore, they are also the formulas for the
unconditional natural direct and indirect effects. □
If we obtain OLS estimators of these coefficients, we can estimate the direct
and indirect effects by
ˆ = θ̂1 ,
nde ˆ = θ̂2 β̂1 ,
nie
which is called the Baron–Kenny method (Judd and Kenny, 1981; Baron and
Kenny, 1986) although it had several antecedents (e.g., Hyman, 1955; Alwin
and Hauser, 1975; Judd and Kenny, 1981; Sobel, 1982).
Standard software packages report the standard error of ndeˆ from OLS.
ˆ
Sobel (1982, 1986) used the delta method to obtain the standard error of nie.
Based on the formula in Example A1.2, the asymptotic variance of θ̂2 β̂1 equals
var(θ̂2 )β12 + θ22 var(β̂1 ). So the estimated variance is
Testing the null hypothesis of nie based on θ̂2 β̂1 and the estimated variance
above is called Sobel’s test in the literature of mediation analysis.
27.4.2 An Example
We can easily implement the Baron–Kenny method via the following code.
library ( " car " )
BKmediation = function (Z , M , Y , X )
{
# # two regressions and coefficients
mediator . reg = lm ( M ~ Z + X )
mediator . Zcoef = mediator . reg $ coef [2]
mediator . Zse = sqrt ( hccm ( mediator . reg )[2 , 2])
outcome . reg = lm ( Y ~ Z + M + X )
outcome . Zcoef = outcome . reg $ coef [2]
outcome . Zse = sqrt ( hccm ( outcome . reg )[2 , 2])
outcome . Mcoef = outcome . reg $ coef [3]
outcome . Mse = sqrt ( hccm ( outcome . reg )[3 , 3])
NDE . se , NIE . se ,
NDE / NDE . se , NIE / NIE . se ) ,
2 , 3)
rownames ( res ) = c ( " NDE " , " NIE " )
colnames ( res ) = c ( " est " , " se " , " t " )
res
}
Revisiting Example 27.3, we obtain the following estimates for the direct
and indirect effects:
> library ( mediation )
> Z = jobs $ treat
> M = jobs $ job _ seek
> Y = jobs $ depress2
> getX = lm ( treat ~ econ _ hard + depress1 +
+ sex + age + occp + marital +
+ nonwhite + educ + income ,
+ data = jobs )
> X = model . matrix ( getX )[ , -1]
> res = BKmediation (Z , M , Y , X )
> round ( res , 3)
est se t
NDE -0.037 0.042 -0.885
NIE -0.014 0.009 -1.528
Both the estimates for the direct and indirect effects are negative although
they are insignificant.
{Y (z, m), M (z ′ )} Z | X
and
Y (z, m) M (z ′ ) | (Z = z ′ , X)
for all z, z ′ , m.
E(M | Z, X) = β0 + β1 Z + βT2 X,
E(Y | Z, M, X) = θ0 + θ1 Z + θ2 M + θ3 ZM + θT4 X,
where the outcome model has the interaction term between the treatment and
the mediator.
Under the above linear models, show that
the linear models, the average causal effect of Z of M equals β1 , and the aver-
age causal effect of M on Y equals θ2 + θ3 E(Z). Therefore, it is possible that
both of these effects are positive, but the natural indirect effect is negative.
For instance:
This is somewhat paradoxical, and can be called the mediator paradox. Chen
et al. (2007) reported a related surrogate endpoint paradox or intermediate
variable paradox.
E(Y | Z, M, X) = θ0 + θ1 Z + θ2 M + θT4 X,
Express nde and nie in terms of the model parameters and the distribution
of X. How do we estimate nde and nie with IID data?
as the potential outcome under treatment z and a random draw from the dis-
tribution of Mz′ | X. The key difference between Y (z, Mz′ ) and Y (z, FMz′ |X )
is that Mz′ is the potential mediator for the same unit whereas FMz′ |X is a
random draw from the conditional distribution of the potential mediator in
the whole population. Define the natural direct and indirect effects as
nde = E{Y (1, FM0 |X )−Y (0, FM0 |X )}, nie = E{Y (1, FM1 |X )−Y (1, FM0 |X )}.
27.6 Homework problems 337
or, equivalently,
(Z, M ) Y (z, m) | X.
339
340 28 Controlled Direct Effect
The following theorem extends the results for observational studies with a
binary treatment, identifying
or
I(Z = z, M = m)Y
µzm = E .
ezm (X)
Moreover, based on the working models ezm (X, α) and µzm (X, β), we have the
doubly robust formula
I(Z = z, M = m){Y − µzm (X, β)}
µdr
zm = E{µ zm (X, β)} + E ,
ezm (X, α)
which equals µzm if either ezm (X, α) = ezm (X) or µzm (X, β) = µzm (X).
The proof of Theorem 28.1 is similar to those for the standard uncon-
founded observational studies. Problem 28.2 gives a general result. Based on
the outcome mean model, we can obtain µ̂zm (x) for µzm (x). Based on the
treatment model, we can obtain êz (x) for pr(Z = z | X = x); based on the
intermediate variable model, we can obtain êm (z, x) for pr(M = m | Z =
z, X = x). We can then estimate µzm by outcome regression
n
X
−1
µ̂reg
zm = n µ̂zm (Xi ),
i=1
28.2 Discussion 341
We can then estimate cde(m) by µ̂1m − µ̂0m and use the bootstrap to ap-
proximate the standard error.
If we are willing to assume a linear outcome model, the controlled direct
effect simplifies to the coefficient of the treatment. Example 28.1 below gives
the details.
Example 28.1 Under Assumption 28.1 and a linear outcome model,
E(Y | Z, M, X) = θ0 + θ1 Z + θ2 M + θT4 X,
we can show that cde(m) equals the coefficient θ1 , which coincides with the
natural direct effect in the Baron–Kenny method. I relegate the proof to Prob-
lem 28.3.
28.2 Discussion
The formulation of the controlled direct effect does not involve nested or a
priori counterfactual potential outcomes, and its identification does not re-
quire the cross-world counterfactual independence assumption. The parameter
cde(m) can capture the direct effect of the treatment holding the mediator at
m. However, this formulation cannot capture the indirect effect. I summarize
the causal frameworks for intermediate variables below.
chapter framework direct effect indirect effect
26 principal stratification τ (1, 1), τ (0, 0) ?
27 mediation analysis nde nie
29 controlled direct effect cde(m) ?
The mediation analysis framework can decompose the total effect into
natural direct and indirect effects, but it requires nested potential outcomes
and cross-world independence. The principal stratification and controlled di-
rect effect frameworks cannot define indirect effects but they do not involve
342 28 Controlled Direct Effect
which reduces to
X
nde(x) = cde(m | x)pr(M0 = m | X = x)
m
Therefore, the key is to identify and estimate the means of the potential
outcomes µk = E{Y (k)} under the ignorability assumption below based on
IID data of (Zi , Xi , Yi )ni=1 .
28.3 Homework problems 343
µk (X) = E(Y | Z = k, X)
µk = E{µk (X)}
or
I(Z = k)Y
µk = E .
ek (X)
Moreover, based on the working models ek (X, α) and µk (X, β), we have the
doubly robust formula
dr I(Z = k){Y − µk (X, β)}
µk = E{µk (X, β)} + E ,
ek (X, α)
which equals µk if either ek (X, α) = ek (X) or µk (X, β) = µk (X).
cde(m) = θ1 + θ3 m.
then
if
logit{pr(Y = 1 | Z, M, X)} = θ0 + θ1 Z + θ2 M + θ3 ZM + θT4 X,
then
X0 → Z1 → X1 → Z2 → Y
where
• X0 denotes the baseline pre-treatment covariates;
345
346 29 Time-Varying Treatment and Confounding
$
Z1 / X1 / Z2 /' Y
Y (z1 , z2 ) for z1 , z2 = 0, 1.
I will focus on the canonical setting with sequential ignorability, that is, the
treatments are sequentially randomized given the observed history.
$
Z1 / X1 / Z2 /' Y
a >
The following result extends it to the setting with a treatment at two time
points.
Compare (29.2) with the formula based on the law of total probability to gain
more insights:
XXXX
E(Y ) = E(Y | z2 , z1 , x1 , x0 )
x0 z1 x1 z2
pr(z1 | z1 , x1 , x0 )pr(x1 | z1 , x0 )pr(z1 | x0 )pr(x0 ). (29.4)
Erasing the probabilities of z2 and z1 in (29.4), we can obtain the formula
(29.3). This is intuitive because the potential outcome Y (z1 , z2 ) has the mean-
ing of fixing Z1 and Z2 at z1 and z2 , respectively.
Robins called (29.2) and (29.3) the g-formulas. Now I will prove Theorem
29.1.
Proof of Theorem 29.1: By the tower property,
h i
E{Y (z1 , z2 )} = E E{Y (z1 , z2 ) | X0 } ,
By Assumption 29.1(2),
h i
E{Y (z1 , z2 ) | X0 } = E E{Y (z1 , z2 ) | z2 , z1 , X1 , X0 } | z1 , X0
h i
= E E{Y | z2 , z1 , X1 , X0 } | z1 , X0 .
Define X
E{X1 (z1 )} = E(X1 | z1 , x0 )pr(x0 ) (29.5)
x0
Therefore, we can estimate the effect of (Z1 , Z2 ) on Y based on the above for-
mulas by first estimating the regression coefficients βs and the average causal
effect of Z1 on X1 using standard methods.
method for estimation. Start from the inner conditional expectation, denoted
by
Ỹ2 (z1 , z2 ) = E(Y | Z2 = z2 , Z1 = z1 , X1 , X0 ).
We can fit a model of Y on (X1 , X0 ) using the subset of the data with (Z2 =
z2 , Z1 = z1 ), and obtain the fitted values Ŷ2i (z1 , z2 ) for all units. Move on to
outer conditional expectation, denoted by
We can fit a model of Ŷ2 (z1 , z2 ) on X0 using the subset of data with Z1 = z1 ,
and obtain the fitted values Ŷ1i (z1 , z2 ) for all units. The final estimator for
E{Y (z1 , z2 )} is then
n
X
−1
Ê{Y (z1 , z2 )} = n Ŷ1i (z1 , z2 ).
i=1
The above recursive estimation does not involve fitting a model for X1 and
avoids the g-null paradox. See Problem 29.2 for a special case.
The following result extends it to the setting with a treatment at two time
points. Define
e(z1 , X0 ) = pr(Z1 = z1 | X0 )
and
e(z2 , Z1 , X1 , X0 ) = pr(Z2 = z2 | Z1 , X1 , X0 )
as the propensity scores at time points 1 and 2, respectively.
for all z1 and z2 . If some propensity scores are 0 or 1, then the identification
formula (29.6) blows up to infinity.
Proof of Theorem 29.2: Conditioning on (Z1 , X1 , X0 ) and using Assump-
tion 29.1(2), we can simplify the right-hand side of (29.6) as
1(Z1 = z1 )1(Z2 = z2 )Y (z1 , z2 )
E
pr(Z1 = z1 | X0 )pr(Z2 = z2 | Z1 , X1 , X0 )
1(Z1 = z1 )pr(Z2 = z2 | Z1 , X1 , X0 )E(Y (z1 , z2 ) | Z1 , X1 , X0 )
= E
pr(Z1 = z1 | X0 )pr(Z2 = z2 | Z1 , X1 , X0 )
1(Z1 = z1 )
= E E(Y (z1 , z2 ) | Z1 , X1 , X0 )
pr(Z1 = z1 | X0 )
1(Z1 = z1 )
= E Y (z1 , z2 ) , (29.7)
pr(Z1 = z1 | X0 )
where, again, the last line follows from the tower property. □
The estimator based on IPW is much simpler which only involves modeling
two binary indicators. First, we can fit a model of Z1 on X0 to obtain the fitted
values ê1 (z1 , X0i ) and fit a model of Z2 on (Z1 , X1 , X0 ) to obtain the fitted
values ê2 (z2 , Z1i , X1i , X0i ) for all units. Then, we obtain the following IPW
estimator:
n
X 1(Z1i = z1 )1(Z2i = z2 )Yi
Ê ht {Y (z1 , z2 )} = n−1 .
i=1
ê1 (z1 , X0i )ê2 (z2 , Z1i , X1i , X0i )
For simplicity, I focus on the least squares formulation. We can also extend
the discussion to a general loss function.
Under sequential ignorability, we can solve β from the following minimiza-
tion problem that only involves observables.
Theorem 29.3 (IPW under MSM) Under Assumption 29.1 and Defini-
tion 29.2,
X X 1(Z1 = z1 )1(Z2 = z2 )
β = arg min E {Y − f (z1 , z2 , X0 ; b)}2 .
b
z z
e(z1 , X0 )e(z2 , Z1 , X1 , X0 )
2 1
29.4 Multiple time points 353
g1 (0, X0 ; β) = 0
and
g2 (0, z1 , X1 , X0 ; β) = 0 for all z1 .
Two leading choices of Definition 29.3 are below.
Compare Definitions 29.2 and 29.3. The structural nested model allows
for adjusting for the time-varying covariates whereas the marginal structural
model only allows for adjusting for baseline covariates. The estimation under
Definition 29.3 is more involved. A strategy is to estimate the parameter based
on estimating equations.
354 29 Time-Varying Treatment and Confounding
I first introduce two important building blocks for the discussing the esti-
mation. Define
U2 (β) = Y − g2 (Z2 , Z1 , X1 , X0 ; β)
and
U1 (β) = Y − g2 (Z2 , Z1 , X1 , X0 ; β) − g1 (Z1 , X0 ; β).
They are not directly computable from the data because they depend on the
true value of the parameter β. At the true value, they have the following
properties.
Lemma 29.1 Under Assumption 29.1 and Definition 29.3, we have
E{U2 (β) | Z2 , Z1 , X1 , X0 } = E{U2 (β) | Z1 , X1 , X0 }
= E{Y (Z1 , 0) | Z1 , X1 , X0 }
and
E{U1 (β) | Z1 , X0 } = E{U1 (β) | X0 }
= E{Y (0, 0) | X0 }.
Lemma 29.1 involves a subtle notation Y (Z1 , 0) because Z1 is random.
It should be read as Y (Z1 , 0) = Z1 Y (1, 0) + (1 − Z1 )Y (0, 0). Based on the
definitions and Lemma 29.1, U1 (β) acts as the control potential outcome before
receiving any treatment and U2 (β) acts as the control potential outcome after
receiving the treatment at time point 1.
Proof of Lemma 29.1: First, we have
E{U2 (β) | Z2 = 1, Z1 , X1 , X0 } = E{Y (Z1 , 1) − g2 (1, Z1 , X1 , X0 ; β) | Z2 = 1, Z1 , X1 , X0 }
= E{Y (Z1 , 0) | Z2 = 1, Z1 , X1 , X0 }
E{U2 (β) | Z2 = 0, Z1 , X1 , X0 } = E{Y (Z1 , 0) − g2 (0, Z1 , X1 , X0 ; β) | Z2 = 0, Z1 , X1 , X0 }
= E{Y (Z1 , 0) | Z2 = 0, Z1 , X1 , X0 }
so
E{U2 (β) | Z2 , Z1 , X1 , X0 } = E{Y (Z1 , 0) | Z2 , Z1 , X1 , X0 } = E{Y (Z1 , 0) | Z1 , X1 , X0 }
where the last identity follows from sequential ignorability. Since the last term
does not depend on Z2 , we also have
E{U2 (β) | Z2 , Z1 , X1 , X0 } = E{U2 (β) | Z1 , X1 , X0 }.
Using the above results, we have
E{U1 (β) | Z1 , X0 } = E{U2 (β) − g1 (Z1 , X0 ; β) | Z1 , X0 }
= E [E{U2 (β) − g1 (Z1 , X0 ; β) | X1 , Z1 , X0 } | Z1 , X0 ]
= E [E{Y (Z1 , 0) − g1 (Z1 , X0 ; β) | X1 , Z1 , X0 } | Z1 , X0 ]
= E{Y (Z1 , 0) − g1 (Z1 , X0 ; β) | Z1 , X0 }
= E{Y (0, 0) | Z1 , X0 }
= E{Y (0, 0) | X0 }
29.5 Multiple time points 355
where the last identity follows from sequential ignorability. Since the last term
does not depend on Z1 , we also have
E{U1 (β) | Z1 , X0 } = E{U1 (β) | X0 }.
□
With Lemma 29.1, we can prove Theorem 29.4 below.
Theorem 29.4 Under Assumption 29.1 and Definition 29.3,
h i
E h2 (Z1 , X1 , X0 ){Z2 − e(1, Z1 , X1 , X0 )}U2 (β) = 0
and h i
E h1 (X0 ){Z1 − e(1, X0 )}U1 (β) = 0.
for any functions h1 and h2 , provided that the moments exist.
Proof of Theorem 29.2: Use the tower property by conditioning on
(Z2 , Z1 , X1 , X0 ) and Lemma 29.1 to obtain
E [h2 (Z1 , X1 , X0 ){Z2 − e(1, Z1 , X1 , X0 )}E{U2 (β) | Z2 , Z1 , X1 , X0 }]
= E [h2 (Z1 , X1 , X0 ){Z2 − e(1, Z1 , X1 , X0 )}E{U2 (β) | Z1 , X1 , X0 }] .
Use the tower property by conditioning on (Z1 , X1 , X0 ) to show that the last
identity equals 0.
Similarly, use the tower property by conditioning on (Z1 , X0 ) and Lemma
29.1 to obtain
E [h1 (X0 ){Z1 − e(1, X0 )}E{U1 (β) | Z1 , X0 }]
= E [h1 (X0 ){Z1 − e(1, X0 )}E{U1 (β) | X0 }] .
Use the tower property by conditioning on X0 to show that the last identity
equals 0. □
To use Theorem 29.4, we must specify h1 and h2 to ensure that there are
enough equations for solving β. Example 29.4 below revisits Example 29.2.
Example 29.4 Under Example 29.2, we can choose h1 = 1 and h2 = (1, Z1 )
to obtain
E [{Z2 − e(1, Z1 , X1 , X0 )}{Y − (β2 + β3 Z1 )Z2 }] = 0,
E [Z1 {Z2 − e(1, Z1 , X1 , X0 )}{Y − (β2 + β3 Z1 )Z2 }] = 0,
E [{Z1 − e(1, X0 )}{Y − (β2 + β3 Z1 )Z2 − β1 Z1 }] = 0.
We can then solve for the β’s from the above linear equations; see Problem
29.5. A natural question is that whether alternative choices of (h1 , h2 ) can lead
to more efficient estimators. The answer is yes. For example, we can choose
many (h1 , h2 ) and use the generalized method of moment (Hansen, 1982). The
technical details are beyond this book.
Naimi et al. (2017) and Vansteelandt and Joffe (2014) provided tutorials
on the structural nested models.
356 29 Time-Varying Treatment and Confounding
Z1 / X1 / Z2
a >Y
Definition 29.4 (structural nested model with a single time point) The
conditional mean of the individual effect is
2. We have
h i
E h(X){Z − e(X)}{Y − g(Z, X; β)} = 0 (29.9)
That is, (β0 , β1 ) equal the coefficients in the two-stage least squares of Y
on (Z, XZ) with (Z − e(X), (Z − e(X))X) being the instrument variable for
(Z, XZ).
358 29 Time-Varying Treatment and Confounding
e(z1 , X0 ) = pr(Z1 = z1 | X0 ),
..
.
e(zk , Z k−1 , X k−1 ) = pr(Zk = zk | Z k−1 , X k−1 ),
..
.
e(zK , Z K−1 , X K−1 ) = pr(ZK = zK | Z K−1 , X K−1 ).
Theorem 29.6 (IPW with multiple time points) Under Assumption 29.2,
1(Z1 = z1 ) · · · 1(ZK = zK )Y
E{Y (z K )} = E .
e(z1 , X0 ) · · · e(zK , Z K−1 , X K−1 )
Based on Theorem 29.7, construct the Horvitz–Thompson and Hajek es-
timators.
E{Y (z K ) | X0 } = f (z K , X0 ; β).
Theorem 29.7 below shows that under Assumption 29.2, we can solve β from
a minimization problem that only involves observables.
360 29 Time-Varying Treatment and Confounding
Theorem 29.7 (IPW for MSM with multiple time points) Under As-
sumption 29.2,
X 1(Z1 = z1 ) · · · 1(ZK = zK )
2
β = arg min E {Y − f (z K , X0 ; β)} .
b
z
e(z1 , X0 ) · · · e(zK , Z K−1 , X K−1 )
K
for all k = 1, . . . , K.
Appendices
A1
Probability and Statistics
A1.1 Probability
A1.1.1 Tower property and variance decomposition
Given random variables or vectors A, B, C, we have
E(A) = E{E(A | B)}
and
E(A | C) = E{E(A | B, C) | C}.
Given a random variable A and random variables or vectors B, C, we have
var(A) = E{var(A | B)} + var{E(A | B)}
and
var(A | C) = E{var(A | B, C) | C} + var{E(A | B, C) | C}.
Similarly, we can decompose the covariance as
cov(A1 , A2 ) = E {cov(A1 , A2 | B)} + cov{E(A1 | B), E(A2 | B)}
and
cov(A1 , A2 | C) = E {cov(A1 , A2 | B, C) | C}+cov{E(A1 | B, C), E(A2 | B, C) | C}.
363
364 A1 Probability and Statistics
The law of large numbers in Theorem A1.1 states that the sample average
is close to the population mean in the limit.
IID
Theorem A1.2 (central limit theorem) If X1 , . . . , Xn ∼ X with
var(X) < ∞, then
X̄ − E(X)
p → N(0, 1)
var(X)/n
in distribution.
The central limit theorem in Theorem A1.2 states that the standardized
sample average is close to a standard Normal random variable in the limit.
Theorems A1.1 and A1.2 assume IID random variables for convenience.
There are also many law of large numbers and central limit theorems for the
sample mean of independent random variable (e.g., Durrett, 2019).
in distribution.
I will omit the proof of Theorem A1.3. It is intuitive based on the first-order
Taylor expansion:
g(Xn ) − g(µ) ∼
= (∇g(µ)T (Xn − µ).
E(θ̂) = θ
for all possible values of θ and η.
366 A1 Probability and Statistics
θ̂ → θ
in probability as the sample size approaches to infinity, for all possible values
of θ and η.
Unbiasedness requires that the mean of the estimator is identical to the
parameter of interest. Consistency requires that the estimator is close to the
true parameter in the limit. Unbiased does not imply consistency, and con-
sistency does not imply unbiasedness either. Unbiasedness can be restrictive
because it is impossible even in some simple statistics problems. Consistency
is often the basic requirement in most statistics problems.
Definition A1.7 When H0 holds, define the type one error rate of the test
ϕ as the maximum possible value of the probability
pr(ϕ = 1).
A standard choice is to make sure that the type one error rate is below
α = 0.05. The type two error rate of the test is the probability of no rejection
if the null hypothesis does not hold. I review the definition below.
Definition A1.8 When H0 does not hold, define the type two error rate of
the test ϕ as the probability
pr(ϕ = 0).
Given the control of the type one error rate under H0 , we hope the type
two error rate is as low as possible when H0 does not hold.
pr(θ ∈ Θ̂) = 1 − α.
Then we can reject the null hypothesis H0 (c) : θ = c if c is not in the set Θ̂.
This is a valid test because when θ indeed equals c, we have correct type one
error rate pr(θ ̸∈ Θ̂) = α. Conversely, if we test a sequence of null hypotheses
368 A1 Probability and Statistics
He argued that the sum n11 + n01 ≡ n·1 contains little information for the
difference between p1 and p0 , and n11 conditioning on the sum has Hypergeo-
metric distribution that does not depend on the unknown parameter p1 = p0
under H0 :
n·1 n−n·1
k n1 −k
pr(n11 = k) = n
.
n1
Therefore, we can estimate the risk difference, log risk ratio, and log odds
ratio
rd = p1 − p0 ,
p1
log rr = log ,
p0
p1 /(1 − p1 )
log or = log
p0 /(1 − p0 )
ˆ
rd = p̂1 − p̂0 ,
p̂
ˆ = log 1 ,
log rr
p̂0
p̂ /(1 − p̂1 ) n11 n00
ˆ = log 1
log or = log .
p̂0 /(1 − p̂0 ) n10 n01
where
are the sample variances of the outcomes under the treatment and control,
respectively. Unfortunately, the exact distribution of tunequal depends on the
known variances. Testing H0 without assuming equal variances is the famous
Behrens–Fisher problem. With large sample sizes n1 and n0 , the central limit
theorem ensures that tunequal is approximately N(0, 1). So we can construct
approximate test for H0 .
A1.5 Bootstrap
It is often very tedious to derive the variance formulas for complex estimators.
Efron (1979) proposed the bootstrap as a general tool for variance estimation.
There are many versions of the bootstrap (Davison and Hinkley, 1997). In this
book, we only need the most basic one: the nonparametric bootstrap, which
will be simply called the bootstrap.
Consider the generic setting with
IID
Y1 , . . . , Yn ∼ Y,
where Yi can be a general random element denoting the observed data for
unit i. An estimator θ̂ is a function of the observed data: θ̂ = T (Y1 , . . . , Yn ).
When T is a complex function, it may not be easy to obtain the variance or
asymptotic variance of θ̂.
The uncertainty of θ̂ is driven by the IID sampling of Y1 , . . . , Yn from the
true distribution. Although the true distribution is unknown, it can be well
approximated by its empirical version
n
X
F̂n (y) = n−1 I(Yi ≤ y),
i=1
∗ −1
PB ∗
where θ̄ = B b=1 θ̂b .
The bootstrap confidence interval based on the
Normal approximation is then
q
θ̂ ± z1−α/2 V̂boot ,
where z1−α/2 is the 1 − α/2 upper quantile of N(0, 1).
372 A1 Probability and Statistics
373
374 A2 Linear and Logistic Regressions
which equals
E(xy)
γ=.
E(x2 )
When x has mean zero, β = γ in the above two population OLS.
We can also rewrite (A2.1) as
y = xT β + ε, (A2.2)
which holds by the definition of the population OLS coefficient and resid-
ual without any modeling assumption. We call (A2.2) the population OLS
decomposition.
xTn yn
A2.3 Frisch–Waugh–Lovell Theorem 375
β̂ = (XT X)−1 XT Y
Ŷ = X β̂ = X(XT X)−1 XT Y.
H = X(XT X)−1 XT .
= β.
Moreover, the population OLS coefficient does not depend on the distribution
of x. The asymptotic inference in Section A2.1 applies to this model too.
In the special case with var(ε | x) = σ 2 , the asymptotic variance of the
OLS coefficient reduces to
V = σ 2 {E(xxT )}−1
so a simpler moment estimator for the asymptotic variance of β̂ is
n
!−1
X
2 T
V̂ols = σ̂ xi xi
i=1
Pn
where σ̂ 2 = (n − p)−1 i=1 ε̂2i . This is the standard covariance estimator from
the lm function.
A2.6 Weighted least squares 377
which satisfies
E{wx(y − xT βw )} = 0
and thus equals
βw = {E(wxxT )}−1 E(wxy)
if E(wxxT ) is invertible.
At the sample level, we can define the WLS coefficient as
n
X
β̂w = arg min wi (yi − xTi b)2 ,
b
i=1
which satisfies
n
X
wi xi (yi − xTi β̂w ) = 0
i=1
and thus equals
n
!−1 n
!
X X
−1 −1
β̂w = n wi xi xTi n wi xi yi
i=1 i=1
Pn
if i=1 wi xi xTi is invertible.
pr(yi = 1 | xi )
logit {pr(yi = 1 | xi )} ≡ log = xTi β.
1 − pr(yi = 1 | xi )
where . . . contains all other regressor xi2 , . . . , xip . Therefore, the coefficient β1
equals the log odds ratio of xi1 on yi conditional on other regressors.
Let β̂ denote the maximizer, which is called the maximum likelihood estimate
(MLE). Taking the log of L(β) and differentiating it with respect to β, we can
show that the MLE must satisfy the first order condition:
n
X
xi {yi − π(xi , β̂)} = 0.
i=1
that is, the average of the observed yi ’s must be identical to the average of
the fitted probabilities π(xi , β̂)’s.
Using the general theory for the MLE, we can show that it is consistent
for the true parameter β and is asymptotically normal:
√
n(β̂ − β) → N(0, V )
in distribution, where V = E π(xi , β){1 − π(xi , β)}xxT . So we can approx-
imate the covariance matrix of β̂ by
n
X
n−1 π(xi , β̂){1 − π(xi , β̂)}xi xTi .
i=1
In R, the glm function can find the MLE and report the estimated covariance
matrix.
pr(si = 1 | xi , yi ) = pr(si = 1 | yi )
and
That is, if the regressor is binary in the univariate WLS, the coefficient of the
regressor equals the difference in the weighted means.
Hint: Think about an appropriate reparametrization of the WLS problem.
Otherwise, the derivation can be tedious.
Assume X1 and X2 are orthogonal, that is, X1T X2 = 0. Show that β̂1
equals the coefficient from OLS of Y on X1 and β̂2 equals the coefficient from
OLS of Y on X2 , respectively.
A3.1 Lemmas
Simple random sampling is a basic topic in standard survey sampling text-
books (e.g., Cochran, 1953). Below I review some results for simple random
sampling that are useful for design-based inference in the CRE in Chapters 3
and 4.
A simple random sample of size n1 consists of a subset from a finite popula-
tion of n units indexed by i = 1, . . . , n. Let Z = (Z1 , . . . , Zn ) be the inclusion
indicators of the n units with
Zi = 1 if unit i is sampled and Zi = 0 otherwise.
The vector Z can take nn1 possible permutations of a vector of n1 1’s and
n0 0’s, and each has equal probability. The following lemma summarizes the
first two moments of the inclusion indicators.
381
382 A3 Simple Random Sampling
their covariance is
n
X
Scd = (n − 1)−1 ¯
(ci − c̄)(di − d).
i=1
¯ˆ
Lemma A3.2 below gives the moments of the sample means c̄ˆ and d.
Lemma A3.2 The sample means are unbiased for the population means:
E(c̄ˆ) = c̄, ˆ¯ = d.
E(d) ¯
Lemma A3.3 The sample variances and covariance are unbiased for their
population versions:
then
c̄ˆ − c̄
q → N(0, 1)
n0 2
nn1 Sc
where z1−α/2 is the 1 − α/2 upper quantile of the standard Normal random
variable.
A3.2 Proofs
Proof of Lemma A3.1: By symmetry, the Zi ’s have the same mean, so
n n
!
X X
n1 = Zi = E Zi = nE(Zi ) =⇒ E(Zi ) = n1 /n.
i=1 i=1
□
384 A3 Simple Random Sampling
Proof of Lemma A3.1: The unbiasedness of the sample mean follows from
linearity. For example,
n
! n
1 X 1 X
E(c̄ˆ) = E Zi ci = E(Zi )ci = c̄.
n1 i=1 n1 i=1
Because
n
X n
X n
X X
0= (ci − c̄) ¯ =
(di − d) ¯ +
(ci − c̄)(di − d) ¯
(ci − c̄)(dj − d),
i=1 i=1 i=1 i̸=j
Show that
n0 2
E(c̄ˆ) = c̄, cov(c̄ˆ) = S , E(Ŝc2 ) = Sc2 .
nn1 c
Bibliography
Amarante, V., Manacorda, M., Miguel, E., and Vigorito, A. (2016). Do cash
transfers improve birth outcomes? evidence from matched vital statistics,
program, and social security data. American Economic Journal: Economic
Policy, 8:1–43.
387
388 A3 Bibliography
Angrist, J. D. and Evans, W. N. (1998). Children and their parents’ labor sup-
ply: Evidence from exogenous variation in family size. American Economic
Review, 88:450–477.
Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation
of average causal effects in models with variable treatment intensity. Journal
of the American Statistical Association, 90:431–442.
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of
causal effects using instrumental variables (with discussion). Journal of the
American Statistical Association, 91:444–455.
Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance
affect schooling and earnings? Quarterly Journal of Economics, 106:979–
1014.
Angrist, J. D. and Pischke, J.-S. (2008). Mostly Harmless Econometrics: An
Empiricist’s Companion. Princeton: Princeton University Press.
Angrist, J. D. and Pischke, J.-S. (2014). Mastering’Metrics: The Path from
Cause to Effect. Princeton: Princeton University Press.
Aronow, P. M., Green, D. P., and Lee, D. K. K. (2014). Sharp bounds on the
variance in randomized experiments. Annals of Statistics, 42:850–871.
Asher, S. and Novosad, P. (2020). Rural roads and local economic develop-
ment. American Economic Review, 110:797–823.
Baker, S. G. and Lindeman, K. S. (1994). The paired availability design: a pro-
posal for evaluating epidural analgesia during labor. Statistics in Medicine,
13:2269–2278.
Balke, A. and Pearl, J. (1997). Bounds on treatment effects from studies
with imperfect compliance. Journal of the American Statistical Association,
92:1171–1176.
Ball, S., Bogatz, G., Rubin, D., and Beaton, A. (1973). Reading with tele-
vision: An evaluation of the electric company. a report to the children’s
television workshop. volumes 1 and 2.
Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data
and causal inference models. Biometrics, 61:962–973.
Barnard, G. A. (1947). Significance tests for 2 × 2 tables. Biometrika, 34:123–
138.
Baron, R. M. and Kenny, D. A. (1986). The moderator-mediator variable dis-
tinction in social psychological research: Conceptual, strategic, and statisti-
cal considerations. Journal of Personality and Social Psychology, 51:1173–
1182.
A3.4 Bibliography 389
Bazzano, L. A., He, J., Muntner, P., Vupputuri, S., and Whelton, P. K. (2003).
Relationship between cigarette smoking and novel risk factors for cardiovas-
cular disease in the United States. Annals of Internal Medicine, 138:891–
897.
Berk, R., Pitkin, E., Brown, L., Buja, A., George, E., and Zhao, L. (2013).
Covariance adjustments for the analysis of randomized field experiments.
Evaluation Review, 37:170–196.
Bertrand, M. and Mullainathan, S. (2004). Are Emily and Greg more em-
ployable than Lakisha and Jamal? A field experiment on labor market dis-
crimination. American Economic Review, 94:991–1013.
Bickel, P. J., Hammel, E. A., and O’Connell, J. W. (1975). Sex bias in graduate
admissions: Data from Berkeley. Science, 187:398–404.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Ef-
ficient and Adaptive Estimation for Semiparametric Models. Baltimore:
Johns Hopkins University Press.
Bind, M.-A. C. and Rubin, D. B. (2020). When possible, report a fisher-
exact p value and display its underlying null randomization distribution.
Proceedings of the National Academy of Sciences of the United States of
America, 117:19151–19158.
Bowden, J., Spiller, W., Del Greco M, F., Sheehan, N., Thompson, J., Minelli,
C., and Davey Smith, G. (2018). Improving the visualization, interpretation
390 A3 Bibliography
Charig, C. R., Webb, D. R., Payne, S. R., and Wickham, J. E. (1986). Compar-
ison of treatment of renal calculi by open surgery, percutaneous nephrolitho-
tomy, and extracorporeal shockwave lithotripsy. British Medical Journal,
292:879–882.
Chen, H., Geng, Z., and Jia, J. (2007). Criteria for surrogate end points.
Journal of the Royal Statistical Society: Series B (Statistical Methodology),
69:919–932.
A3.4 Bibliography 391
D’Amour, A., Ding, P., Feller, A., Lei, L., and Sekhon, J. (2021). Overlap in
observational studies with high-dimensional covariates. Journal of Econo-
metrics, 221:644–654.
Davey Smith, G. and Ebrahim, S. (2003). “Mendelian randomization”: can
genetic epidemiology contribute to understanding environmental determi-
nants of disease? International Journal of Epidemiology, 32:1–22.
Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their
Application. Cambridge: Cambridge University Press.
Dawid, A. P. (1979). Conditional independence in statistical theory. Journal
of the Royal Statistical Society: Series B (Methodological), 41:1–15.
Dawid, A. P. (2000). Causal inference without counterfactuals (with discus-
sion). Journal of the American Statistical Association, 95:407–424.
Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental stud-
ies: Reevaluating the evaluation of training programs. Journal of the Amer-
ican statistical Association, 94:1053–1062.
Ding, P. (2016). A paradox from randomization-based causal inference (with
discussion). Statistical Science, 32:331–345.
Ding, P. (2021). The Frisch–Waugh–Lovell theorem for standard errors. Statis-
tics and Probability Letters, 168:108945.
Ding, P. and Dasgupta, T. (2016). A potential tale of two by two tables
from completely randomized experiments. Journal of American Statistical
Association, 111:157–168.
Ding, P. and Dasgupta, T. (2017). A randomization-based perspective on
analysis of variance: a test statistic robust to treatment effect heterogeneity.
Biometrika, 105:45–56.
Ding, P., Feller, A., and Miratrix, L. (2019). Decomposing treatment effect
variation. Journal of the American Statistical Association, 114:304–317.
Ding, P., Geng, Z., Yan, W., and Zhou, X.-H. (2011). Identifiability and esti-
mation of causal effects by principal stratification with outcomes truncated
by death. Journal of the American Statistical Association, 106:1578–1591.
Ding, P. and Li, F. (2018). Causal inference: A missing data perspective.
Statistical Science, 33:214–237.
Ding, P., Li, X., and Miratrix, L. W. (2017a). Bridging finite and super
population causal inference. Journal of Causal Inference, 5:20160027.
Ding, P. and Lu, J. (2017). Principal stratification analysis using principal
scores. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 79:757–777.
A3.4 Bibliography 393
Hahn, J., Todd, P., and Van der Klaauw, W. (2001). Identification and esti-
mation of treatment effects with a regression-discontinuity design. Econo-
metrica, 69:201–209.
Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2020). Bayesian regression
tree models for causal inference: regularization, confounding, and heteroge-
neous effects. Bayesian Analysis, 15:965–1056.
Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate
reweighting method to produce balanced samples in observational studies.
Political Analysis, 20:25–46.
Hájek, J. (1960). Limiting distributions in simple random sampling from a fi-
nite population. Publications of the Mathematics Institute of the Hungarian
Academy of Science, 5:361–74.
Hájek, J. (1971). Comment on “an essay on the logical foundations of survey
sampling, part one”. The foundations of survey sampling, 236.
Hearst, N., Newman, T. B., and Hulley, S. B. (1986). Delayed effects of the
military draft on mortality. New England Journal of Medicine, 314:620–624.
396 A3 Bibliography
Lee, M.-J. (2018). Simple least squares estimator for treatment effects using
propensity score residuals. Biometrika, 105:149–164.
Lee, W.-C. (2011). Bounding the bias of unmeasured factors with confounding
and effect-modifying potentials. Statistics in Medicine, 30:1007–1017.
Li, F., Mattei, A., and Mealli, F. (2015). Evaluating the causal effect of uni-
versity grants on student dropout: evidence from a regression discontinuity
design using principal stratification. Annals of Applied Statistics, 9:1906–
1931.
Li, F., Morgan, K. L., and Zaslavsky, A. M. (2018a). Balancing covariates via
propensity score weighting. Journal of the American Statistical Association,
113:390–400.
Li, F., Thomas, L. E., and Li, F. (2019). Addressing extreme propensity scores
via the overlap weights. American Journal of Epidemiology, 188:250–257.
Li, X. and Ding, P. (2016). Exact confidence intervals for the average causal
effect on a binary outcome. Statistics in Medicine, 35:957–960.
Li, X. and Ding, P. (2017). General forms of finite population central limit
theorems with applications to causal inference. Journal of the American
Statistical Association, 112:1759–1769.
Li, X. and Ding, P. (2020). Rerandomization and regression adjustment.
Journal of the Royal Statistical Society, Series B (Statistical Methodology),
82:241–268.
Li, X., Ding, P., and Rubin, D. B. (2018b). Asymptotic theory of reran-
domization in treatment-control experiments. Proceedings of the National
Academy of Sciences of the United States of America, 115:9157–9162.
Lin, W. (2013). Agnostic notes on regression adjustments to experimental
data: Reexamining Freedman’s critique. Annals of Applied Statistics, 7:295–
318.
Lin, Z., Ding, P., and Han, F. (2023). Estimation based on nearest neighbor
matching: from density ratio to average treatment effect. Econometrica.
Lind, J. (1753). A treatise of the scurvy. Three Parts. Containing an Inquiry
into the Nature, Causes and Cure, of that Disease. Together with a Critical
and Chronological View of what has been Published on the Subject.
Lipsitch, M., Tchetgen Tchetgen, E., and Cohen, T. (2010). Negative con-
trols: a tool for detecting confounding and bias in observational studies.
Epidemiology, 21:383–388.
Little, R. and An, H. (2004). Robust likelihood-based analysis of multivariate
data with missing values. Statistica Sinica, 14:949–968.
Liu, H. and Yang, Y. (2020). Regression-adjusted average treatment effect
estimates in stratified randomized experiments. Biometrika, 107:935–948.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent stan-
dard errors in the linear regression model. American Statistician, 54:217–
224.
A3.4 Bibliography 401
Lu, S. and Ding, P. (2023). Flexible sensitivity analysis for causal in-
ference in observational studies subject to unmeasured confounding.
https://arxiv.org/abs/2305.17643.
Lumley, T., Shaw, P. A., and Dai, J. Y. (2011). Connections between sur-
vey calibration estimators and semiparametric models for incomplete data.
International Statistical Review, 79:200–220.
McGrath, S., Young, J. G., and Hernán, M. A. (2021). Revisiting the g-null
paradox. Epidemiology, 33:114–120.
Mealli, F. and Pacini, B. (2013). Using secondary outcomes to sharpen in-
ference in randomized experiments with noncompliance. Journal of the
American Statistical Association, 108:1120–1131.
Robins, J., Sued, M., Lei-Gomez, Q., and Rotnitzky, A. (2007). Comment:
Performance of double-robust estimators when inverse probability weights
are highly variable. Statistical Science, 22:544–559.
Robins, J. M. (1999). Association, causation, and marginal structural models.
Synthese, 121:151–179.
Robins, J. M. and Greenland, S. (1992). Identifiability and exchangeability
for direct and indirect effects. Epidemiology, 3:143–155.
Robins, J. M., Hernan, M. A., and Brumback, B. (2000). Marginal structural
models and causal inference in epidemiology. Epidemiology, 11:550–560.
Rudolph, K. E., Goin, D. E., Paksarian, D., Crowder, R., Merikangas, K. R.,
and Stuart, E. A. (2018). Causal mediation analysis with observational data:
considerations and illustration examining mechanisms linking neighborhood
poverty to adolescent substance use. American Journal of Epidemiology,
188:598–608.
Sekhon, J. S. (2009). Opiates for the matches: Matching methods for causal
inference. Annual Review of Political Science, 12:487–508.
Sekhon, J. S. (2011). Multivariate and propensity score matching software
with automated balance optimization: The matching package for R. Journal
of Statistical Software, 47:1–52.
Sekhon, J. S. and Titiunik, R. (2017). On interpreting the regression discon-
tinuity design as a local experiment. In Regression Discontinuity Designs,
volume 38. Emerald Publishing Limited.
Shinozaki, T. and Matsuyama, Y. (2015). Doubly robust estimation of stan-
dardized risk difference and ratio in the exposed population. Epidemiology,
26:873–877.
A3.4 Bibliography 407
Sobel, M. E. (1986). Some new results on indirect effects and their standard
errors in covariance structure models. Sociological Methodology, 16:159–186.
Sommer, A. and Zeger, S. L. (1991). On estimating efficacy from clinical trials.
Statistics in Medicine, 10:45–52.
Tao, Y. and Fu, H. (2019). Doubly robust estimation of the weighted average
treatment effect for a target population. Statistics in Medicine, 38:315–325.
Theil, H. (1953). Estimation and simultaneous correlation in complete equa-
tion systems. central planning bureau. Technical report, Mimeo, The Hague.
Wager, S., Du, W., Taylor, J., and Tibshirani, R. J. (2016). High-dimensional
regression adjustments in randomized experiments. Proceedings of the Na-
tional Academy of Sciences of the United States of America, 113:12673–
12678.
Wald, A. (1940). The fitting of straight lines if both variables are subject to
error. Annals of Mathematical Statistics, 11:284–300.
Wang, L., Zhang, Y., Richardson, T. S., and Zhou, X.-H. (2020). Robust
estimation of propensity score weights via subclassification. arXiv preprint
arXiv:1602.06366.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator
and a direct test for heteroskedasticity. Econometrica, 48:817–838.
Wooldridge, J. (2016). Should instrumental variables be used as matching
variables? Research in Economics, 70:232–237.
Wooldridge, J. M. (2015). Control function methods in applied econometrics.
Journal of Human Resources, 50:420–445.
Wu, J. and Ding, P. (2021). Randomization tests for weak null hypotheses in
randomized experiments. Journal of the American Statistical Association,
116:1898–1913.
Yang, F. and Small, D. S. (2016). Using post-outcome measurement infor-
mation in censoring-by-death problems. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 78:299–318.
Yang, S. and Ding, P. (2018). Asymptotic causal inference with observational
studies trimmed by the estimated propensity scores. Biometrika, 105:487–
493.
Zelen, M. (1979). A new design for randomized clinical trials. New England
Journal of Medicine, 300:1242–1245.
Zhang, J. L. and Rubin, D. B. (2003). Estimation of causal effects via principal
stratification when some outcomes are truncated by “death”. Journal of
Educational and Behavioral Statistics, 28:353–368.
Zhang, J. L., Rubin, D. B., and Mealli, F. (2009). Likelihood-based analysis of
causal effects of job-training programs using principal stratification. Journal
of the American Statistical Association, 104:166–176.
Zhang, M. and Ding, P. (2022). Interpretable sensitivity analysis for the baron-
kenny approach to mediation with unmeasured confounding. arXiv preprint
arXiv:2205.08030.
Zhao, A. and Ding, P. (2021a). Covariate-adjusted Fisher randomization tests
for the average treatment effect. Journal of Econometrics, 225:278–294.
410 A3 Bibliography
Zhao, A. and Ding, P. (2021b). No star is good news: A unified look at reran-
domization based on p-values from covariate balance tests. arXiv preprint
arXiv:2112.10545.
Zhao, Q., Wang, J., Hemani, G., Bowden, J., and Small, D. (2020). Statisti-
cal inference in two-sample summary-data Mendelian randomization using
robust adjusted profile score. Annals of Statistics, 48:1742–1769.