Horowitz Sinander Notes
Horowitz Sinander Notes
Ludvig Sinander
Northwestern University
This version: 12 December 2018
These notes are based on an econometrics course for first-year PhD students
taught by Joel Horowitz at Northwestern in winter 2016. The topics are limit
theory and the asymptotic properties of extremum estimators.
I thank Joel for teaching a great class and for agreeing to let me share these
notes, and Ahnaf Al Rafi, Bence Bardóczy, Ricardo Dahis, Michael Gmeiner
and Joe Long for reporting errors.
1
Copyright c 2020 Carl Martin Ludvig Sinander.
2
Contents
1 Outline 5
2 Probability theory 8
2.1 Measurable spaces . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 The Lebesgue integral . . . . . . . . . . . . . . . . . . . . . . 14
2.6 The Radon–Nikodým theorem . . . . . . . . . . . . . . . . . . 16
2.7 Conditional probability . . . . . . . . . . . . . . . . . . . . . 18
2.8 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Modes of convergence 26
3.1 Convergence of random sequences . . . . . . . . . . . . . . . . 26
3.2 Convergence of random functions . . . . . . . . . . . . . . . . 29
3.3 Convergence of measures . . . . . . . . . . . . . . . . . . . . . 32
3.4 Relationships between modes of convergence . . . . . . . . . . 34
3.5 The Borel–Cantelli lemmata . . . . . . . . . . . . . . . . . . . 38
3.6 Convergence of moments . . . . . . . . . . . . . . . . . . . . . 39
3.7 Characteristic functions . . . . . . . . . . . . . . . . . . . . . 40
3.8 The continuous mapping theorem . . . . . . . . . . . . . . . . 44
3.9 Stochastic order notation . . . . . . . . . . . . . . . . . . . . 47
3.10 The delta method . . . . . . . . . . . . . . . . . . . . . . . . . 48
3
7 Asymptotic properties of extremum estimators 75
7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 84
7.5 Estimating the asymptotic variance . . . . . . . . . . . . . . . 88
7.6 Asymptotic normality with a nonsmooth objective . . . . . . 90
References 135
4
1 Outline
Econometrics is about inferring functional relations among variables from
data. Schematically, we observe a set of realisations of random vector (x, y),
and wish to learn the function f (·, 0) that satisfies y = f (x, u) for some
unknown random vector u.1 One intuitive way of thinking about this problem
is to divide it into two parts: learning the parametric form (‘shape’) of f (·, 0),
and learning the values of its parameters. (This intuition underlies parametric
methods of estimation. Many nonparametric methods do not divide things
up in this way.)
We’ll need to sharpen up the question a bit in order to answer it. In
general, we cannot learn f (·, 0). As we know from Manski’s course, the most
that we can hope to learn is the joint distribution P(x, y) of (x, y). Often,
we are interested in some feature of the joint distribution, such as P(y|x),
E(y|x) or some quantile of y conditional on x. These objects can be thought
of as features of the function f when convenient.
A natural approach to estimating P(x, y) or P(y|x) when the support is
finite is to use the empirical distribution. By a law of large numbers, this
gives a pointwise consistent estimate.2 When the support is uncountable, we
could of course discretise the outcome space and apply the same reasoning.
One drawback to this approach is that we’ll often need a fine grid to provide
a good approximation, in which case we’ll need an astronomical dataset in
order to have more than zero or one observation per cell. Another drawback
is that we lose the tractability of analysis. (Discrete maths can be ugly.)
Another issue is that on any finite dataset, there is an infinite number of
lines that you can fit through the data. In order for a fitted line to approximate
the true relationship more and more closely as the sample size increases, we
will therefore require some assumptions on the distribution of (y, x). For
concreteness, consider the conditional mean function g(x) := E(y|x). In
this case, we can estimate g using nonparametric regression provided that
g is continuous. We avoid discretisation by taking local averages, using a
bandwidth that shrinks as the sample size increases. Continuity guarantees
that whatever weighted local average you take (e.g. which kernel you use in
kernel regression), the fitted line will get close to g as the sample size gets
large. Continuity is often a pretty weak assumption in economics.3
1
6 0, just
It is wlog to say that we want to learn f (·, 0). If we want to learn f (·, ξ) for ξ =
reparameterise as u0 := u − ξ.
2
In fact, this estimator is uniformly strongly consistent by the Glivenko–Cantelli theorem
(Billingsley, 1995, p. 269).
3
In other fields, discontinuity has to be allowed for explicitly. Joel gave the example of
5
A fundamental problem with nonparametric estimation techniques is
the curse of dimensionality. Roughly speaking, this is that the sample size
required to get a given level of estimator precision is exponentially increasing
in the dimension of x. It can be proved that without stronger assumptions,
the curse of dimensionality is unavoidable. Very loosely, the idea is that
you’re asking a finite dataset to tell you about an infinite-dimensional object.
One way of strengthening the assumptions to avoid the curse of dimension-
ality is to assume that g belongs to a finite-dimensional family of functions.
Schematically, we assume that g(x) = G(x, θ) where G is a known function
and θ is a finite-dimensional, unknown constant, i.e. a parameter. (In this
parametric case, it is often natural to do inference directly on θ rather than
trying to learn g directly.) It turns out (unsurprisingly) that parametrisation
defeats the curse of dimensionality. The price we pay for this victory is the
need to specify the function G. If we misspecify G, the math won’t break,
but the interpretation of results may be way off.
The obvious next conern is the ‘accuracy’ of our estimate of θ (or g). (If
we didn’t care about accuracy, there would be no reason to use the data!)
Since an estimator of θ is a function of the (random) data, an estimator is a
random variable. To characterise accuracy, we have to study this random
variable. The problem is that the distribution of an estimator depends on
the unknown distribution of (x, y). (If we knew the population distribution,
we would once again have no use for a dataset.) Except under very stringent
conditions, we cannot consistently estimate (never mind infer with certainty)
the distribution of an estimator.
The way we get around this is by using approximations to the distribution
of an estimator. If we are to trust these approximations, they must become
increasingly good as the data gets increasingly good, in some sense to be
made precise. The leading example will be asymptotic approximations, which
are approximations that (usually) become increasingly good as the sample
size grows. There are many kinds of asymptotic approximation, but the
general idea is easily illustrated using the simplest central limit theorem.
Suppose that our estimator θb is an average of n iid random variables (many
estimators are), where n is the sample size. Then the central limit theorem
says that n1/2 θb converges in distribution to N θ, σ 2 , a two-dimensional
family of distributions!
Our focus will be on asymptotic theory for parametric estimators. Besides
this restriction, our treatment will be general, though common special cases
will be mentioned along the way. We’ll first cover basic (measure-theoretic)
image denoising.
6
probability theory, then limit theorems. Once the machinery is in place, we
will develop the asymptotic theory of extremum estimators.
The course starts off with background probability theory, emphasising
concepts required for asymptotic theory. We then state and prove several
laws of large numbers and central limit theorems. With the technical ma-
chinery in place, we establish the consistency and asymptotic normality of
general extremum estimators. We apply these results to maximum likelihood
and generalised-method-of-moments estimators, and cover efficiency and
specification tests while we’re at it. Finally, we study hypothesis testing in
the setting of extremum estimation.
The main references for the course are Amemiya (1985) and Newey and
McFadden (1994). These texts (unlike most others in econometrics) raise and
address all the important technical issues. Joel will also sometimes to refer
Rao (1973) and Serfling (1980), two statistics texts that are worth studying
for anyone interested in research in econometric theory. Several other texts
that will be mentioned along the way, e.g. White (2001).
7
2 Probability theory
Official reading: Rao (1973, ch. 2).
This section covers basic (measure-theoretic) probability theory. There is
a very large number of good texts on measure and probability theory; Joel
mentioned Kolmogorov and Fomin (1975) in particular.
(1) Ω ∈ A.
by properties (2) and (3). Another one is ∅ ∈ A, which follows from (1) and
(2).
8
because for general uncountable Ω, it leads to paradoxes. This will be less of
a problem than it may first appear to be because the subsets of Ω missing
from the σ-algebras we will be working with are very strange sets. But can
still give rise to difficulties: measurability problems arise fairly frequently in
econometric and economic theory.
In the context of probability theory, we sometimes call the measurable
sets ‘events’, and interpret them as ‘something that happens’. To illustrate,
suppose we draw one coloured ball from an urn, formalised by the measurable
{red,blue,green}
space {red, blue, green}, 2 . In ordinary language, one ‘event’
I might describe is ‘I pick a red or a blue ball’. In the formalism, this
corresponds to the event (measurable set) {red, blue}.
We might wonder how to choose our σ-algebra. An important criterion is
that our σ-algebra contain enough subsets of Ω to allow us to study conver-
gence (of measures, of measurable functions and of integrals). Convergence
is a topological notion, so let’s equip Ω with a topology. In order to obtain
convergence results, we will need the σ-algebra to contain enough topologic-
ally interesting subsets of Ω; at the very least, it should contain all of the
open subsets of Ω. It turns out that this minimal requirement is enough for
most purposes, leading to the following definition.
2.2 Measures
Let R = R ∪ {−∞, ∞} be the extended real line.
9
(2) If Aj ∈ A for each j ∈ N are disjoint, then
[ X
µ Aj = µ(Aj ).
j∈N j∈N
10
that any monotone function f : R → R is continuous λ-a.e., where λ is
Lebesgue measure on (R, B). When µ is a probability measure and P holds
µ-a.e., we usually say that P holds µ-almost surely (µ-a.s.) or that P holds
with probability 1. More generally, ‘µ-almost’ is used flexibly as an adverb,
e.g. ‘µ-almost all’ or ‘µ-almost every’.
f −1 (B) := {x ∈ F : f (x) ∈ B} .
10
For the special case of functions f : Rn → R with the Borel σ-algebras Bn and B,
Bn /B-measurability is sometimes called Borel-measurability.
11
Some confusing (in my view) terminology was used at this point in the lecture. Consider
two measurable spaces (F, F) and (G, G), a measure µ on (F, F), and a function f : F → G.
Above, I defined the property of F/G-measurability of f . Joel called this same property
µ-measurability of f . But as I pointed out, the measure µ has nothing to do with it!
11
turns out that in this case, a function f : Ω → Rk is measurable iff
{ω ∈ Ω : f (ω) ≤ z} ∈ A for each z ∈ Rk .
We can now define random elements, which are principal characters in
the sequel.
Definition 7. Let (Ω, A, P) be a probability space, and let (S, S) be a
measurable space. A random element of (S, S) defined on (Ω, A, P) is an
A/S-measurable function X : Ω → S.
The set S in which a random element takes values can be entirely arbitrary;
it need not be a topological or metric space, for example. But often, S will be
a metric space. In this case, we will sometimes abuse terminology by saying
‘random element of (S, ρ) (defined on (Ω, A, P))’, on the understanding that
S is equipped with a σ-algebra, usually the Borel σ-algebra generated by the
topology induced by the metric ρ.
Definition 8. A random variable is a random element of (R, B). A random
n-vector is a random element of (Rn , B). A random n×m matrix is a random
element of (Rn×m , B).
For a random element X : Ω → S and a measurable subset B of S,
X −1 (B) is the set of states of the world ω ∈ Ω at which X(ω) lies in B. We
know that X −1 (B) ∈ A since X is a measurable function. But A may contain
lots of other events that do not correspond to X −1 (B) for some B ∈ S. These
other events are not interesting for the study of X, so we sometimes wish to
use smaller the σ-algebra that contains all the sets of interest for X but no
others. In our previous jargon, what we want is the σ-algebra generated by
the interesting sets, viz. σ X −1 (B) B∈S . The name of this object is often
12
Definition 10. Let X be a random vector. The cumulative distribution
function (CDF) of X is FX : Rn → [0, 1] defined by
FX (x1 , . . . , xn ) := LX ((−∞, x1 ] × · · · × (−∞, xn ])
for each (x1 , . . . , xn ) ∈ Rn .
It is intuitive (but not quite obvious, I think) that FX fully characterises
the law of a random vector. Precisely stated: for random vectors X and Y ,
LX = LY (setwise) iff FX = FY (pointwise). See Rosenthal (2006, Proposition
6.0.2) for a (very easy) proof.
Some properties of CDFs are that they are right-continuous, nondecreas-
ing, and satisfy limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. (For n = 1, we
have a converse: any function F : R → [0, 1] with these four properties is the
CDF of some random vector on some probability space.)
n
We can also define random elements X : Ω → R that are like random
vectors but can take infinite values. If LX (Rn ) < 1, then the distribution LX
of X is said to be defective. Conversely, if LX (Rn ) = 1 then the distribution
is called proper. Though we rarely want to work with random vectors that
take infinite values with positive probability, we sometimes obtain a random
vector with a defective distribution as the limit of a sequence of random
vectors with proper distributions.
2.4 Independence
Probability theory is basically measure theory plus independence. This will
become increasingly clear: all of the interesting theorems that we will state
specifically for probability measures (rather than general measures) assume
(some weakened form of) independence.
Definition 11. For a probability space (Ω, A, P), events A, B ∈ A are
independent iff P(A ∩ B) = P(A)P(B).
Sometimes, we wish to impose independence for two whole classes of
events. Call F a sub-σ-algebra of the σ-algebra A iff it is a σ-algebra of
subsets of Ω and F ⊆ A.
Definition 12. For a probability space (Ω, A, P), sub-σ-algebras F and G
of A are independent iff P(F ∩ G) = P(F )P(G) for every F ∈ F and G ∈ G.
Fix a probability space (Ω, A, P) and a measurable space (S, S). Two
random elements X and Y of (S, S) are called independent iff their generated
σ-algebras σ(X) and σ(Y ) are independent. This just means that
P(X ∈ BX , Y ∈ BY ) = P(X ∈ BX )P(Y ∈ BY )
13
for any measurable subsets BX and BY of S. (So for random vectors, BX
and BY are any Borel sets.)
where 1S is the indicator function for S.13 The Lebesgue integral does not
exist for every measurable function: a boundedness condition is also needed
to avoid the undefined expression ∞ − ∞.
A measurable function is called Lebesgue-integrable iff its integral exists
and is finite.14 Any Riemann-integrable function is also Lebesgue-integrable,
and in such cases the two integrals coincide. But many functions are Lebesgue-
but not Riemann-integrable.15,16
In constructing the Lebesgue integral, it quickly becomes apparent that
we can replace the Lebesgue measure λ with R
whatever measure µ we like,
leading to the generalised Lebesgue integral S f dµ. Of course, the conditions
under which fR is integrable depend on what measure we’re integrating with
respect to. If S f dµ exists, we say that f is µ-integrable.
Importantly, the Lebesgue integral may defined for measurable functions
f : Ω → R on any measurable space (Ω, A), not just (say) R or Rn . This
provides a powerful generalisation of Riemann integration; for example, we
can integrate over functional spaces.
Now consider the case in which µ is a probability measure on a measurable
space (Ω, A), so that the measurable function f is a random variable. Let’s
13
I.e. 1S (ω) = 1 iff ω ∈ S, 0 otherwise.
14
AR necessary and sufficient condition for a measurable function f to be integrable is
that Ω |f |dλ < ∞.
15
For example, consider the function f : R → R such that f (x) = 1(x R ∈/ Q). This
function is not Riemann-integrable, but it is Lebesgue-integrable, with R f dλ = 1 as
we would hope. More generally, there are certain kinds of discontinuity that Lebesgue
integration can handle but Riemann integration cannot.
16
In terms of how they are constructed, the difference between the two integrals is
that while the Riemann integral considers the limit of a sequence of approximations to
the ‘area under f ’ constructed by discretising the x axis, the Lebesgue integral considers
approximations constructed by discretising the y axis.
14
use the more familiar notation of P for the measureR
and X for the random
variable. In this setting, the Lebesgue integral Ω XdP is also called the
expected value of X (or the expected value of the distribution LX ). Since not
all measurable functions admit an integral, there are random variables whose
expectation is undefined. (One example is a Cauchy-distributed random
variable.) Even when the expectation exists, it may be infinite.
There are various kinds of notation for the expected value, including:
Z Z Z
E(X) = XdP = X(ω)dP(ω) = X(ω)P(dω).
Ω Ω Ω
Finally, we can rewrite the integral in terms of the CDF FX rather than
the law LX . This is unsurprising in view of the fact that CDFs coincide iff
the laws do. The integral w.r.t. a CDF is called a Stieltjes integral, and can
be written in various ways:
Z Z
E(X) = xdFX (x) = xFX (dx).
R R
The Stieltjes integral coincides with the Lebesgue integral, but is defined
differently (in terms of the CDF).
Let X be a random variable; then X n for n ∈ N is also a random variable
(easy proof). E (X n ) is called the nth moment of X, and E ((X − E(X))n )
is called the nth central moment. (Of course, a given moment of X need not
exist or be finite.) The second central moment is called the variance of X,
denoted Var(X).
While we’re at it, here are two related concepts. Let X and Y be random
variables. Their covariance is
15
There are a lot of algebraic shortcuts involving variances, covariances and
correlations that are hopefully familiar from undergrad. I won’t list them
here, but I will make use of them!
since 1A (ω) = 0 for all ω ∈ Ω outside a set of µ-measure zero (viz. the set
A). So we’ve learned that if (1) holds, then µ ν. The Radon–Nikodým
theorem is the (far less obvious) converse to this result: if µ ν, then there
exists a nonnegative f such that (1) holds. Actually, there’s a caveat: in
order for the converse to hold, both measures must be σ-finite.
In words, a measure defined on (Ω, A) is σ-finite iff there is a countable,
measurable cover of Ω such that every piece of the cover is assigned finite
measure. It should be obvious that every probability measure is σ-finite. For
reference, here’s a schematic definition:
18
There’s some unhelpful confusion of terminology here. Most authors (e.g. Kolmogorov
and Fomin (1975) and Billingsley (1995)) use ‘absolutely continuous’ synonymously with
‘dominates’, as in my definition. But Rosenthal (2006, p. 143) defines ‘ν absolutely continu-
ous w.r.t. µ’ to mean that there exists a nonnegative f such that representation (1) holds.
By the Radon–Nikodým theorem, the two turn out to be equivalent for σ-finite measures,
but they are still distinct properties that should have distinct names.
16
Definition 14. Let (Ω, A, µ) be a measure space. The measure µ is σ-finite
iff there is a countable collection {Aj }j∈N of subsets of Ω such that
17
integration w.r.t. counting measure is equivalent to summation: formally, for
any c-integrable (and measurable) function f : N → R,
Z X
f dc = f (n) for every A ∈ A.
A n∈A
(If you know how the Lebesgue integral is defined, then this should be trivial.)
Now let’s apply the Radon–Nikodým theorem to the σ-finite measures
P and c. The only set to which c assigns measure zero is ∅, and P(∅) = 0;
hence P c. So by the Radon–Nikodým theorem, there is nonnegative and
c-integrable function f : N → R such that
Z X
P(A) = f dc = f (n) for every A ∈ A.
A n∈A
18
i.e. P(G) = 0. (For example, the realisation of a normally distributed signal.)
The ratio formula does not apply in this case, so a subtler construction is
called for.
To build some intuition, consider a finite probability space Ω, 2Ω , P in
which P({ω}) > 0 ∀ω ∈ Ω. Let Q(A|G) := P(A ∩ G)/P(G) be the ‘ordinary’
probability of A conditional on G. Notice that Q(A|{ω}) = 1(ω ∈ A). It
follows that for events A and G,
X X
P(A ∩ G) = P({ω}) = 1(ω ∈ A)P({ω})
ω∈A∩G ω∈G
X Z
= Q(A|{ω})P({ω}) = Q(A|{ω})P(dω).
ω∈G G
(The Lebesgue integral is used because unlike the sum, it remains defined
when we move to uncountable probability spaces.) R
We’d like our definition
of conditional probability to respect P(A ∩ G) = G Q(A|{ω})P(dω). Since
this property does not involve division by P(G), we can require it to hold
even when P(G) = 0.
We require an additional bit of machinery. Recall that A is a σ-algebra of
subsets of Ω, we call G a sub-σ-algebra of A iff G is a σ-algebra and G ⊆ A.
Since G contains only a subset of the events in A, it provides a ‘coarser’
description of the state of the world ω ∈ Ω.
Here’s an analogy: I am trying to convey to you the colour of the sky.
The sky can be any colour ω in Ω = [0, 1], where 0 is ‘totally blue’ and 1
is ‘totally red’ (say). I have at my disposal a small vocabulary of English
phrases that I can use to communicate with, consisting of (any combination
of) ‘blue’ (= [0, 13 ]), ‘purple’ (= ( 13 , 23 ]) and ‘red’ (= ( 23 , 1]). Formally, my
language is the σ-algebra generated by the events ‘blue’, ‘purple’ and ‘red’:
n o n o
A=σ [0, 13 ], ( 13 , 32 ], ( 23 , 1] = ∅, [0, 31 ], ( 13 , 23 ], ( 23 , 1], [0, 32 ], ( 13 , 1], [0, 1] .
Now suppose that my English deteriorates: I forget the words ‘blue’ and
‘purple’, and am left only with the coarser term ‘blurple’ (= [0, 23 ]). My newly
worsened language is
n o n o
G=σ [0, 23 ], ( 23 , 1] = ∅, [0, 32 ], ( 23 , 1], [0, 1] .
19
Back to conditional probability. In general, we are interested in the
probability of A conditional on several different events G. For example,
suppose that we want to condition on some random variable X being realised
in a Borel set B, i.e. the event GB = X −1 (B) = {ω ∈ Ω : X(ω) ∈ B}; usually
we’d like to be able to condition on any Borel event of this sort, not just a
particular one. The rigorous construction of conditional probabilities requires
us to specify in advance what collection of events we want to be able to
condition on. It is perhaps not surprising that we require this collection of
conditioning events to be a σ-algebra. But we also need it to not lead to
measurability problems, and that requires the conditioning σ-algebra to be a
sub-σ-algebra G of A.25
To make the sub-σ-algebra G explicit, we can write the probability of
A conditional on G ∈ G as QG (A|G), which (for fixed A) is a mapping
G → R. It turns out, however, to be technically more convenient to define
the conditional probability as a G-measurable mapping Ω → R, denoted
P(A|G)(ω). Since it’s G-measurable, P(A|G)(ω) = P(A|G)(ω 0 ) whenever ω
and ω 0 lie in all the same sets G ∈ G; in this sense, the conditional probability
can only vary between states of the world that are distinguishable using G.
Let’s try to clarify this using the example of conditioning on a random
variable. We formalise ‘conditioning on a random variable’ as conditioning
on its generated σ-algebra σ(X). Since P(A|σ(X)) is σ(X)-measurable,
P(A|σ(X))(ω) = P(A|σ(X))(ω 0 ) whenever X(ω) = X(ω 0 ). Conversely, if
X(ω) 6= X(ω 0 ) then in general P(A|σ(X))(ω) 6= P(A|σ(X))(ω 0 ). Informally,
although P(A|σ(X))(·) is really a function of ω, σ(X)-measurability means
(precisely) that it behaves as if it were a function of (the realised value
of) X. So this construction allows us to condition on events like X = x.26
But importantly, this conditional probability does not give us statements
about the probability of A conditional on (say) X ≤ x. To get conditional
probabilities like that, we need a coarser σ-algebra since we’re conditioning
on ‘larger’ events.
All this chatting has established that we want our conditional probability
P(A|G)
R
to be a G-measurable mapping Ω → R that satisfies P(A ∩ G) =
G P(A|G)dP for every G ∈ G. The last condition obviously requires that
P(A|G) be P-integrable. (And implies that P(A|G) ∈ [0, 1] P-a.s.) Let’s put
it all together!
25
If it isn’t clear why measurability problems would arise without this requirement, think
about it until it’s clear!
26
To construct this probability explicitly, pick some ωx ∈ {ω ∈ Ω : X(ω) = x} for each
x ∈ R (any will do), and define the ‘intuitive conditional probability’ Qσ(X) (A|X = x) :=
P(A|σ(X))(ωx ).
20
Definition 15. Consider a probability space (Ω, A, P) and a sub-σ-algebra
G of A. A random variable P(A|G) : Ω → R is a conditional probability of
A on G iff it is (1) G-measurable, (2) P-integrable, and (3) satisfies
Z
P(A ∩ G) = P(A|G)dP ∀G ∈ G.
G
21
That is, there exists a random function µ(·|G) s.t. every µ(·|G)(ω) is a
probability measure and every µ(A|G)(·) is a conditional probability of A on
G (Billingsley, 1995, Theorem 33.3). Such a µ is called a regular conditional
probability.
Conditional expectation is defined in a similar way.
Definition 16. Consider a probability space (Ω, A, P), a sub-σ-algebra G
of A, and a P-integrable random variable Y .29 A random variable E(Y |G) :
Ω → R is a conditional expectation of Y on G iff it is (1) G-measurable, (2)
P-integrable, and (3) satisfies
Z Z
Y dP = E(Y |G)dP ∀G ∈ G.
G G
The proof of existence is similar to the one for conditional probability; see
Rosenthal (2006, Proposition 13.1.7). The interpretational subtleties outlined
above also apply to conditional expectation. Also, fun fact: for G = Ω we get
Z Z
Y dP = E(Y |G)dP,
Ω Ω
meaning that the ‘law of iterated expectation’ is actually part of the definition
of conditional expectation.
2.8 Inequalities
Probability theory is full of inequalities. The main ones are concentration
inequalities, which bound the probability that a random variable deviates
away from some value (usually zero or its mean). These are often useful for
establishing the convergence of sequences of random variables, the topic of
the next section. The two most basic concentration inequalities are Markov’s
and Chebychev’s; both apply to random variables, but extend easily to
random vectors.
Proposition 2 (Markov’s inequality). Let X be a nonnegative random
variable on (Ω, A, P). Then P(X ≥ ε) ≤ E(X)/ε ∀ε > 0.
Proof. Fix ε > 0.
Z Z Z
E(X) = XdP = XdP + XdP
Ω {X≥ε} {X<ε}
Z Z
≥ XdP ≥ ε dP = εP(X ≥ ε).
{X≥ε} {X≥ε}
29
Recall that Y is P-integrable iff E(Y ) exists and is finite.
22
Corollary 1 (Chebychev’s inequality). Let X be a random variable on
(Ω, A, P) such that E(X) exists and is finite. Then P(|X − E(X)| ≥ ε) ≤
Var(X)/ε2 for every ε > 0.
Proof. Fix ε > 0 and define Y := (X − E(X))2 . Y is nonnegative, so by
Markov’s inequality
P(|X − E(X)| ≥ ε) = P Y ≥ ε2 ≤ E(Y )/ε2 = Var(X)/ε2 .
23
Proposition 4 (generalised Markov inequality). Let X be a nonnegative
random variable on (Ω, A, P), and let g : R+ → R+ be strictly increasing.
Then P(X ≥ ε) ≤ E(g(X))/g(ε) ∀ε > 0.
Proposition 5 (Cantelli’s inequality). Let X be a random variable on
(Ω, A, P) such that E(X) exists and is finite. Then
≤ Var(X)
Var(X)+λ2
for λ > 0
P(X − E(X) ≥ λ) λ2
≥ for λ < 0.
Var(X)+λ2
The next two concentration inequalities are terribly ugly, but very useful.
The former (Kolmogorov’s) is a special case of the latter (Hájek–Rényi). We
will use the Hájek–Rényi inequality to prove Kolmogorov’s first SLLN in
section 4.1 (p. 53).
Theorem 2 (Kolmogorov’s inequality). Let {Xn } be a sequence of mean-
zero independent random variables on (Ω, A, P). Then for any m < n and
ε > 0,
k n
!
X 1 X
P max Xi ≥ ε ≤ 2 Var(Xi ).
k∈[m,n]
i=1
ε i=1
24
Theorem 5 (Hölder’s inequality). Let X and Y be random variables on
(Ω, A, P). Then for any p, q ≥ 1 with p−1 + q −1 ≤ 1,
25
3 Modes of convergence
Official reading: Amemiya (1985, ch. 3), Rao (1973, ch. 2) and Serfling (1980,
ch. 1).
Since this section concerns convergence, it will be important that all
topologically interesting sets are measurable. So unless otherwise specified,
every set will be equipped with (a superset of) its Borel σ-algebra.
Definition 17. Let {xn } be a sequence in a metric space (S, ρ). {xn }
converges to x0 ∈ S iff for any ε > 0, there exists Nε ∈ N such that
ρ(xn , x0 ) < ε whenever n ≥ Nε . Convergence is typeset as xn −
→ x0 or
limn→∞ xn = x0 .
26
We will instead proceed in an ad-hoc way, dreaming up new convergence
concepts with intuition as our guide.
The most obvious way of weakening the ordinary convergence is almost
sure convergence. Ordinary convergence in the sup metric required uniform
convergence of the measurable functions {Xn }. We could weaken this to
requiring only pointwise convergence of {Xn }: Xn (ω) − → X(ω) for all ω ∈ Ω
(we might call this ‘sure convergence’).31 Almost sure convergence weakens
this one step further by requiring that Xn (ω) − → X(ω) for almost all ω ∈ Ω,
i.e. for all ω outside a set of measure zero. Formally:
lim Xn = X P-a.s.32
n→∞
a.s.
Almost sure convergence is typeset Xn −−→ X or Xn −→ X a.s., and is also
known as convergence with probability 1 (w.p. 1) or (in general measure
theory) convergence almost everywhere (a.e.).
27
A.s. convergence turns out to be stronger than necessary for most of
econometric theory. To weaken it, look at Lemma 1: if we drop the sup
operator, we obviously get a less stringent condition. This weaker condition
is called convergence in probability.
Definition 19. Let {Xn } and X be random elements of a metric space
(S, ρ) defined on (Ω, A, P). {Xn } converges in probability to X iff for any
ε > 0,
lim P(ρ(Xn , X) > ε) = 0.
n→∞
p
Convergence in probability is typeset Xn −→ X or plimn→∞ Xn = X, and is
also known as convergence with probability approaching 1.
Finally, we introduce a third convergence concept. Like a.s. convergence,
it is stronger than convergence in probability, though it neither implies nor
is implied by a.s. convergence. Let k·k2 denote the Euclidean norm.
Definition 20. Let {Xn } and X be random vectors defined on (Ω, A, P).
{Xn } converges in mean square (m.s.) to X iff E (kXn − Xk2 )2 exists for
each n ∈ N and
lim E (kXn − Xk2 )2 = 0.
n→∞
m.s.
Convergence in mean square is typeset Xn −−→ X.
We will not use this concept very much because we don’t want to as-
sume that the second moment of an estimator exists. But convergence in
mean square turns out to be useful for studying the convergence of random
functions.
It’s clear that convergence in mean square can at most be extended to
random elements of normed vector spaces (such as Rn ). It cannot be defined
for random elements of general metric spaces.
28
p
→ρ X.33
It turns out that there is also a metric ρ such that Xn −→ X iff Xn −
On the other hand, unless (Ω, A) is trivial, there is no metric ρ on S such
a.s.
that Xn −−→ X iff Xn − →ρ X (see Dudley (2004, p. 289)).
29
f20 f5 f2 f1
30
We’re defining these convergence concepts using the metric on the image
space S 0 . It is perhaps more natural to define convergence of functions by
endowing the functional space F with a topology such that the appropriate
convergence concept coincides with convergence in that topology. Convergence
in the product topology (a.k.a. the topology of pointwise convergence) on F
is equivalent to pointwise convergence as defined above. Convergence in the
topology on F induced by the sup metric
Definition 22. Let (S, ρ) and (S 0 , ρ0 ) be metric spaces, and let {fn } and f
be random functions S → S 0 .
a.s. a.s.
(1) fn −−→ f pointwise iff ρ0 (fn (x), f (x)) −−→ 0 for every x ∈ S.
a.s. a.s.
(2) fn −−→ f uniformly iff supx∈S ρ0 (fn (x), f (x)) −−→ 0.
p p
(3) fn −→ f pointwise iff ρ0 (fn (x), f (x)) −→ 0 for every x ∈ S.
p p
(4) fn −→ f uniformly iff supx∈S ρ0 (fn (x), f (x)) −→ 0.
a.s. a.s.
Remark 2. Obvious equivalences: fn −−→ f pointwise iff fn (x) −−→ f (x) for
p p
every x ∈ R, and fn −→ f pointwise iff fn (x) −→ f (x) for every x ∈ R.
31
3.3 Convergence of measures
All three of the convergence concepts we’ve given have a similar flavour:
they require the random elements Xn to get close to X as n increases. But
we might also care about the distributions of {Xn } getting close to the
distribution of X. For example, suppose Xn ∼ N (0, 1) and X ∼ N (0, 1), all
independent.34 No matter how large n gets, P(Xn = 6 X) = 1. Nevertheless, it
seems that this is a (trivial) case in which the distributions of {Xn } converge
to the distribution of X.
For a probability space (Ω, A, P) with Ω is endowed with a topology, call
A ∈ A a P-continuity set iff P(∂A) = 0.35
Definition 23. Let {Xn } and X be random elements of a metric space (S, ρ)
defined on (Ω, A, P). {Xn } converges in distribution to X iff LXn (A) − →
LX (A) for every LX -continuity set A. Convergence in distribution is typeset
d
Xn −→ X or Xn X.
In the case of random vectors, this obviously reduces to the (perhaps more
familiar) definition that FXn −
→ FX (pointwise) at every continuity point of
FX . The following example illustrates why we do not require convergence at
discontinuity points of FX .
32
in distribution to. And since convergence in distribution does not require
d
convergence of the CDFs at discontinuity points, we have Xn −→ X where
X is any random variable with this CDF.
(1) µn (A) −
→ µ(A) for every µ-continuity set A.
R R
(2) Ω f dµn −
→ Ω f dµ for every continuous and bounded f : Ω → R.
33
3.4 Relationships between modes of convergence
In this section, we will establish the implication relationships between the
modes of convergence we’re considering. In particular, we will show that
a.s.
Xn −−→ X
p
d
⇒ Xn −→ X ⇒ Xn −→ X .
m.s.
Xn −−→ X
d p
We’ll also show that for a constant α, Xn −→ α ⇒ Xn −→ α .
We begin with the first two implications: that a.s. convergence and
convergence in m.s. imply convergence in probability. Both have nice, short
proofs.
a.s. p
Proposition 6. If Xn −−→ X, then Xn −→ X.
a.s.
Proof. Let Xn −−→ X. Fix an ε > 0. Obviously ρ(XN , X) > ε implies that
supn≥N ρ(Xn , X) > ε. Together with nonnegativity, this yields
!
0 ≤ P (ρ(XN , X) > ε) ≤ P sup ρ(Xn , X) > ε .
n≥N
a.s.
Since the RHS converges to 0 as N − → ∞ by Xn −−→ X, it follows that
p
limN →∞ P (ρ(XN , X) > ε) = 0. Since ε > 0 was arbitrary, Xn −→ X.
m.s. p
Proposition 7. If Xn −−→ X, then Xn −→ X.
Proof. Let Xn −−→ X. (kXn − Xk2 )2 is a nonnegative random variable, so
m.s.
Markov’s inequality (p. 22) applies. Together with the fact that probabilities
are nonnegative, we have for any ε > 0 that
0 ≤ P (kXn − Xk2 > ε) = P (kXn − Xk2 )2 > ε2
≤ ε−2 E (kXn − Xk2 )2 .
m.s.
The RHS converges to 0 since Xn −−→ X. Hence P(kXn − Xk2 > ε) −
→0
p
for every ε > 0, i.e. Xn −→ X.
A natural question you might now ask is: convergence in probability plus
what property is equivalent to convergence in mean square? The answer is a
boundedness property called uniform integrability; see e.g. Williams (1991,
sec. 13.7).
To show that a.s. convergence and convergence in m.s. do not imply each
other, we give counterexamples.
34
m.s. a.s.
Example 5 (−−→ without −−→). Let {Xn } be independent with
1 1
P(Xn = 0) = 1 − n and P(Xn = 1) = n for each n ∈ N.
m.s.
Then E Xn2 = 1/n −
→ 0 as n −→ ∞, so Xn −−→ 0.
It’s obvious (but we didn’t prove) that if {Xn } is a.s.-convergent then
the limit must be 0. A.s. convergence to 0 would require that
!
lim P sup |Xn | < ε = 1 for every ε > 0.
N →∞ n≥N
So choose ε ∈ (0, 1). Then |Xn | < ε iff Xn = 0, so for any N ∈ N we have
∞
!
Y
1
P sup |Xn | < ε = P(Xn = 0 ∀n ≥ N ) = 1− n .
n≥N n=N
∞ ∞
!!
n−1 = −∞
X X
1
ln P sup |Xn | < ε = ln 1 − n ≤−
n≥N n=N n=N
a.s. m.s.
Example 6 (−−→ without −−→). Let {Xn } be independent with
1 1
P(Xn = 0) = 1 − n2
and P(Xn = n) = n2
for each n ∈ N.
mean square. It’s fairly obvious (but we didn’t prove) that we cannot have
convergence in m.s. to anything other than 0.
Again, a.s. convergence to 0 requires that
!
lim P sup |Xn | < ε = 1 for every ε > 0.
N →∞ n≥N
36
Since ln is concave, it must lie below all its tangents, i.e. for any x, x0 > 0, ln(x) ≤
ln(x ) + (x0 )−1 (x − x0 ). Setting x = 1 − n1 and x0 = 1 yields ln 1 − n1 ≤ − n1 .
0
35
Following the steps in the previous example, for small ε > 0 and any N ∈ N
we have
∞
!!
X
1
ln P sup |Xn | < ε = ln 1 − n2
.
n≥N n=N
1
Using the inequality ln 1 − n2
≥ − n12 − 1 37
2n4
, we obtain
∞
!!
1 1
X
ln P sup |Xn | < ε ≥− 2
+ 4 .
n≥N n=N
n 2n
P∞
The p-series n=N n−p is convergent iff p > 1 (regardless of N ), so the RHS
is finite for each N and converges to 0 as N → ∞. So by continuity of ln(·)
we obtain !
lim P sup |Xn | < ε = 1.
N →∞ n≥N
The statement is true for general random elements, but our proof restricts
attention to random vectors in order to make use of the simpler CDF-based
definition of convergence in distribution.
p
Proof for random vectors. Let Xn −→ X. Define Zn := X − Xn , so that
p
Zn −→ 0. Fix some ε > 0 and a continuity point t ∈ R of FX . We must show
that FXn (t) −
→ FX (t).
Using the fact that A ⊆ B implies P(A) ≤ P(B) and a few other basic
37
This inequality follows from the fact that in the Taylor series
∞ ∞
X (−1)n+1 x2 X (−1)n+1 n
ln(1 + x) = xn = x − + x ,
n 2 n
n=1 n=3
P∞ n+1
(−1)
the remainder n=3 n
xn can be shown to be nonnegative. Now set x = −1/n2 .
36
facts about probabilities,
FXn (t) = P (X − (X − Xn ) ≤ t)
= P (X ≤ t + Zn )
= P (X ≤ t + Zn , Zn < ε) + P (X ≤ t + Zn , Zn ≥ ε)
≤ P (X ≤ t + ε, Zn < ε) + P (X ≤ t + Zn , Zn ≥ ε)
≤ P (X ≤ t + ε) + P (X ≤ t + Zn , Zn ≥ ε)
≤ P (X ≤ t + ε) + P (X ≤ ∞, Zn ≥ ε)
≤ P (X ≤ t + ε) + P (Zn ≥ ε)
= FX (t + ε) + P (Zn ≥ ε) .
p
Since Zn −→ 0, P (Zn ≥ ε) −
→ 0 as n −
→ ∞. It follows that
Taking ε −
→ 0 and using the fact that t is a continuity point of FX ,
Now go through the exactly same steps, replacing FXn with 1 − FXn and
ε with −ε:
p
since Zn −→ 0. Taking ε −
→ 0 and using the fact that t is a continuity point
of FX then gives us
lim inf FXn (t) ≥ FX (t).
n→∞
37
Putting together the pieces,
The result is pretty obvious, but the proof I’ve seen use parts of the
Portmanteau lemma that I haven’t stated, so I won’t bother.
σ2
P(|Xn − µ| > ε) ≤ −
→0 for any ε > 0,
n α ε2
p
so Xn −→ µ. (This is how we will prove Chebychev’s WLLN in section 4.1
(p. 53).)
It would be nice to have a similarly tractable sufficient condition for
almost sure convergence. That is exactly what the first Borel–Cantelli lemma
gives us. And there’s more: the second Borel–Cantelli lemma says that our
sufficient condition is also necessary when the sequence is independent.
(2) If ∞ n=1 P(ρ(Xn , X) > ε) = ∞ for some ε > 0 and {Xn } are independ-
P
The Borel–Cantelli lemmata are actually much more general than what
we stated here. If you care, see e.g. Rosenthal (2006, Theorem 3.4.2).
38
3.6 Convergence of moments
d
Suppose Xn −→ X for random variables {Xn } and X. It might seem reason-
able to conjecture that E(Xn ) −
→ E(X). But upon reflection, it’s not a very
d
good conjecture: by the Portmanteau lemma (p. 33), Xn −→ X is equivalent
to
Z Z
f dLXn −
→ f dLX for every continuous and bounded f ,
R R
but we want Z Z
IdLXn −
→ IdLX
R R
where I is the definitely-not-bounded identity function I(x) := x! So {LXn }
are going to have to be appropriately bounded if the moments are to converge.
I’ll give two counterexamples. In the first, no moments exist along the
sequence, but the limit distribution has moments. In the (perhaps less trivial)
second example, moments exist along the entire sequence, but fail to converge
nonetheless.
where Φ is the standard normal CDF and C is the standard Cauchy CDF
C(x) := 1
2 + π −1 arctan(x).
d
It’s obvious that FXn − → Φ pointwise, so Xn −→ N (0, 1). The expectation of
the limit is therefore 0. But Xn has no mean for any n ∈ N since the Cauchy
distribution has no moments. So the sequence {E(Xn )} does not even exist,
hence certainly cannot be said to converge to zero.
E (Yn ) = E Xn2 E X 2 = E Xn2 = 1 − 1
n + n1 n2 −
→ ∞.
39
d
But even if Xn −→ X is not sufficient for E(Xn ) − → E(X), surely
a.s.
Xn −−→ X is sufficient? No again! We can still get the sort of pathological
behaviour exhibited by the examples above. To rule this out, we need a
boundedness condition on {Xn } and X to rule out nonexistence or explosive
behaviour.
There are several important theorems giving conditions under which
a.s.
Xn −−→ X implies E(Xn ) − → E(X). These include (in order from strongest
to weakest assumptions) the monotone convergence theorem, the bounded
convergence theorem, the (Lebesgue) dominated convergence theorem and
the (Vitali) uniform integrability convergence theorem. The proofs of the last
few rely heavily on Fatou’s lemma. All of this is covered well by Rosenthal
(2006, mainly ch. 9). We’ll need the dominated convergence theorem later
on, but I won’t give a proof.
40
Definition 26. The characteristic transform is the mapping µ 7→ φµ from
probability measures µ on (Rn , B) to characteristic functions φµ : Rn → C,
defined by
Z
φµ (t) := exp (it> x) µ(dx) for each t ∈ Rn .39
Rn
This definition might make you wonder what the space of characteristic
functions is. It is by no means the case that every function Rn → C is the
characteristic transform of some probability measure on (Rn , B)! It is possible
to state ‘primitive’ necessary and sufficient conditions for a function Rn → C
to be a characteristic function. Bochner’s theorem (e.g. Rao (1973, p. 141))
gives one set of necessary and sufficient conditions. Tractable sufficient (but
not necessary) conditions are given by Pólya’s theorem, which I’ll only state
for the univariate case.
Theorem 9 (Pólya’s theorem). Suppose ϕ : R → C is R-valued, even,40
continuous, convex on R++ , and satisfies ϕ(0) = 1 and limt→∞ ϕ(t) = 0.
Then ϕ = φµ for some probability measure µ on (R, B) that is absolutely
continuous w.r.t. Lebesgue measure and symmetric about 0.
It turns out that the space of characteristic functions is a dual of the
space of probability measures in the following sense. First, the characteristic
mapping is a bijection: φµ = φν (pointwise) iff µ = ν (setwise). Second,
the characteristic transform has a closed-form inverse, and there are many
convenient equivalences between properties of probability measures and
properties of characteristic functions. Third, the characteristic mapping is
continuous in a certain sense.
Let’s state two important bits of that formally. Proofs can be found in
e.g. Rosenthal (2006, ch. 10).
Theorem 10 (Fourier uniqueness theorem). Let µ and ν be probability
measures on (Rn , B). Then φµ = φν (pointwise) iff µ = ν (setwise).
Theorem 11 (Lévy’s continuity theorem). Let {µn } and µ be probability
measures on (Rn , B). Then µn ⇒ µ iff φµn −
→ φµ pointwise.41
39
The characteristic transform is also sometimes known as (a version of) the Fourier
transform. But what exactly is meant by ‘Fourier transform’ varies hugely between authors
and fields, so I won’t use this term at all.
40
A function f is even iff f (−x) = f (x) for every x in its domain.
41
It’s called the continuity theorem because when the space of probability measures
is endowed with the topology of weak convergence (the weak? topology) and the space
of characteristic functions is endowed with the topology of pointwise convergence (the
product topology), the theorem says precisely that the characteristic transform and its
inverse are continuous mappings.
41
These properties mean that any results we prove about characteristic
functions, including convergence results, translate directly into results about
probability measures (and vice versa). When we face a difficult question
about probability measures on (Rn , B), we will often translate it into a
question about characteristic functions, easily find the answer, then translate
the answer back into probability-measure space.
The leading example of this strategy is the proof of the Lindeberg–
Lévy central limit theorem (p. 63). But we’ll also use it to prove part of
the continuous mapping theorem on p. 45, and to establish an interesting
property of the Cauchy distribution in an example on p. 42.
I mentioned equivalences between properties of measures and of their
characteristic transforms. There are many, and they are easy to look up, but
here are a few important ones.
Proposition 10. Let X and Y be random variables.
(1) φX (0) = 1.
(2) |φX (t)| = 1 for every t ∈ Rn .
(3) φaX+b (t) = exp(itb)φX (at) for any a ∈ R and b, t ∈ R.
(4) φX+Y = φX φY if X and Y are independent. (The converse is not true!)
(j)
(5) If E X j exists and is finite then φX (0) exists. (There’s a partial
(j)
converse.) Whenever they exist and are finite, φX (0) = ij E X j .
42
Now consider a sequence {Xn } of independent Cauchy-distributed random
variables, and write Sn := n−1 ni=1 Xi . Then
P
n
!!
−1
X
φSn (t) = E exp itn Xi
i=1
n
!
−1
Y
=E exp itn Xi
i=1
n
E exp itn−1 Xi
Y
=
i=1
n
= E exp itn−1 X1
= φX1 (t/n)n
= exp (−|t|/n)n
= exp (−|t|)
= φX1 (t) .
43
is usually that the variance is infinite.) As we will see when we prove LLNs,
moment restrictions are required in order to avoid this sort of problem. (At
the very least, we’ll require the first moment to exist.)
Proof of (1). We know that there are measurable Ω0 , Ω00 ⊆ Ω such that
→ X(ω) for all ω ∈ Ω0 , g is continuous at all X(ω) s.t. ω ∈ Ω00 , and
Xn (ω) −
P(Ω ) = P(Ω00 ) = 1. Firstly, Ω0 ∩ Ω00 is measurable with P (Ω0 ∩ Ω00 ) = 1
0
since
c c c
P Ω0 ∩ Ω00 = 1 − P Ω0 ∪ Ω00 Ω0 Ω00
c
≥1−P −P = 1.
44
Proof of (2) for constant X. By continuity of g at α, for each ε > 0, there
is a δ > 0 such that ρ(Xn , α) < δ implies ρ0 (g(Xn ), g(α)) < ε. So
p
Since Xn −→ α, the right-hand side converges to 1 regardless of δ. It follows
p
that P (ρ0 (g(Xn ), g(α)) < ε) −
→ 1 for each ε > 0, i.e. g(Xn ) −→ g(α).
For (3), the cleanest general proof that I’ve seen uses Skorokhod’s theorem,
then follows the argument for (1). This would take us too far afield, so let’s
restrict attention to the case in which {Xn } and X are random `-vectors and
g : R` → Rm , so that we can use characteristic functions.
45
d
(2) Xn Yn −→ XA.
d
(3) Xn Yn−1 −→ XA−1 provided A is invertible.46
Proof. (X, A) and each (Xn , Yn ) are random elements of the metric space
p d
Rm×k × Rk×k . Yn −→ A implies Yn −→ A by Proposition 9 (p. 38). The
mappings (x, y) 7→ x + y and (x, y) 7→ xy are continuous, and (x, y) 7→ xy −1
is continuous whenever y is invertible. The result then follows from part (3)
of the continuous mapping theorem.
Remark 4. Since the proof of Slutsky’s theorem is via the Mann–Wald CMT,
d p a.s.
the result obviously still holds if we replace −→ with −→ or −−→. But be careful
here: it’s important that Y converges to a constant rather than to a random
matrix. When Xn and Yn both converge to random elements X and Y , it
d d d
need not be that Xn + Yn −→ X + Y , Xn Yn −→ XY or Xn Yn−1 −→ XY −1 .
The following example illustrates.
Example 10 (weak convergence of marginals vs. joint). Let {Xn }, {Yn }, X
and Y be random variables distributed
! ! !! ! ! !!
Xn iid 0 1 ρ X 0 1 r
∼N , and ∼N , .
Yn 0 ρ 1 Y 0 r 1
46
3.9 Stochastic order notation
When we use approximations, we have to control the approximation error.
Usually, we want the error to vanish as the sample size grows large. The
notation introduced here offers a compact way of keeping track of approx-
imation error. This section will treat sequences in R, on the understanding
that the extension to Rn is straightforward.
Let’s start out with order notation from analysis.
Definition 27. Let {xn } and {an } be sequences in R.
(1) xn = O(an ) iff ∃M0 > 0 s.t. |Xn /an | ≤ M0 for n sufficiently large.
47
Of course, the converse is not true, i.e. Op (an ) = O(an ) and op (an ) = o(an )
are false in general. (A sequence may be bounded/vanishing in probability
without being bounded/vanishing for sure.)
We will do a lot of algebra involving Op and op once we start studying
estimators, so here’s a collection of facts about how Op and op can be
manipulated. Except for the last one, they are all easily proved from the
definitions.
48
The theorem extends immediately to any ` times differentiable function
g : Rk → Rm , but the notation becomes ugly fast (tensor products). For
g : Rk → R, we can go to second order without notational trouble:
g(x) − g(b) = ∇g(b)> (x − b) + 12 (x − b)> ∇2 g(b)(x − b) + o (kx − bk2 )2 .
49
Although the first-order delta method above is valid when Dg(b) = 0,
it isn’t very helpful in that case. Unless g is a constant function, g(Xn ) is
still going to be random, so we’d like our approximating distribution to be
nondegenerate. The obvious remedy is to use a second-order Taylor expansion.
As noted above, this would require heavy notation for the case g : Rk → Rm ,
so we’ll just state it for the case g : Rk → R.
Proposition 13 (second-order delta method). Let {Xn } be a sequence of
d
random k-vectors such that an (Xn − b) −→ W for some constants {an } and
b, and let g : Rk → R be twice differentiable in an open neighbourhood of b,
with derivatives ∇g(b) = 0 and ∇2 g(b) at b. Then
d
a2n (g(Xn ) − g(b)) −→ 12 W > ∇2 g(b)W.
so
d
Since an (Xn − b) −→ W ,
op (kan (Xn − b)k2 )2 = op Op (1)2 = op (Op (1)) = op (1),
and
d
1
2 [an (Xn − b)]> ∇2 g(b)[an (Xn − b)] −→ 21 W > ∇2 g(b)W
by Slutsky’s theorem (p. 45). Hence
a2n (g(Xn ) − g(b)) = 21 [an (Xn − b)]> ∇2 g(b)[an (Xn − b)] + op (1)
d
−→ 21 W > ∇2 g(b)W.
50
Remark 8. Even if ∇g(b) 6= 0, we could use a second-order Taylor expansion
to approximate the distribution of g(Xn ). But this makes the approximation
so complicated that it’s rarely worthwhile.
Of course, there’s nothing special about the second order: if the first
` − 1 derivatives are zero, we can use the `th derivative to approximate the
distribution of g(Xn ). To duck notational difficulties, I’ll only state this for
the case g : R → R.
d g (`) (b) `
a`n (g(Xn ) − g(b)) −→ W .
`!
Proof. By Taylor’s theorem,
g (`) (b)
g(Xn ) − g(b) = (Xn − b)` + op |Xn − b|` ,
`!
so
g (`) (b)
a`n (g(Xn ) − g(b)) = [an (Xn − b)]` + op |an (Xn − b)|` .
`!
d
Since an (Xn − b) −→ W ,
op |an (Xn − b)|` = op Op (1)` = op (Op (1)) = op (1),
and [an (Xn − b)]` −→ W ` by the continuous mapping theorem (p. 44). So by
d
51
w.r.t. Lebesgue measure is f (x) = α exp(−αx). The mean and variance of
this distribution are α−1 and α−2 .
Suppose we have n iid random variables {Xi }ni=1 drawn from the exp(α)
distribution, and wish to estimate α. The obvious analogy estimator, which
turns out to also be the maximum likelihood estimator, is
n
!−1
n−1
X
b n :=
α Xi .
i=1
i=1
a.s.
so by the continuous mapping theorem α bn −−→ α, i.e. the estimator is strongly
consistent. So αb n will be ‘close’ to α in a large sample.
But how close? To answer this question, we need to approximate the
distribution of α
b n in a large sample. The Lindeberg–Lévy CLT (p. 63) gives
us n
Xi − α−1 d
n−1/2
X
√ −→ N (0, 1),
i=1 α−2
which we can rewrite as
−1
− α−1 ) −→ N (0, 1).
d
n1/2 α(α
bn
Now use the delta method with g(x) = 1/x (so g 0 (x) = −1/x2 ), an = n1/2 α
and b = α−1 to obtain
b n − α −→ −1/α−2 N (0, 1),
d
n1/2 α α
or equivalently
d
n1/2 α
b n − α −→ N 0, α2 .
b n is well-approximated by N α, n−1 α2 .
So for n large, the distribution of α
52
4 Laws of large numbers
Official reading: Amemiya (1985, ch. 3), Rao (1973, ch. 2) and White (2001,
ch. 3).
A law of large numbers (LLN) gives conditions under which the average of
n random variables converges as n − → ∞. They are called weak laws (WLLNs)
if convergence is in probability, and strong laws (SLLNs) if convergence is
almost sure.48
There are a lot of different laws of large numbers. The common theme is
that the volatility of the average must be controlled by combining two kinds
of restriction. On the one hand, we restrict the individual variances to keep
them from getting too large. On the other hand, we restrict the dependence
between the random variables, so that one extreme realisation doesn’t make
further extreme realisations likely. Each LLN imposes some mix of the two,
and often we can weaken the one at the expense of strengthening the other.
(2) The restriction on the variances implies that each variance is finite,
hence that each mean exists and is finite.
53
p
we want to show that Sn −→ 0.
n n
!
−2
= n−2
X X
Var(Sn ) = n Var (Xi − E(Xi )) Var (Xi )
i=1 i=1
Now for an easy strong law. It isn’t actually used very often, but it plays
an important role in the proof of Kolmogorov’s second SLLN.
Theorem 15 (Kolmogorov’s first SLLN). Let {Xn } be a sequence of inde-
pendent random variables with
∞
X Var(Xi )
< ∞.
i=1
i2
Pn a.s.
Then n−1 i=1 (Xi − E(Xi )) −−→ 0 as n −
→ ∞.
Remark 10. The Kolmogorov variance condition
∞
X
Var(Xi )/i2 < ∞
i=1
54
Proof of Kolmogorov’s first SLLN. Write
n
Sn := n−1
X
(Xi − E(Xi ));
i=1
a.s.
we want to show that Sn −−→ 0. Fix ε > 0. Using the Hájek–Rényi inequality
with ci = i−1 ,
k
! !
−1
X
P max |Sk | ≥ ε =P max k (Xi − E(Xi )) ≥ ε
k∈[m,n] k∈[m,n]
i=1
m n
1
≤ 2 m−2 i−2 Var(Xi ) .
X X
Var(Xi ) +
ε i=1 i=m+1
P∞ 2
Taking n −
→ ∞ and using the fact that i=1 Var(Xi )/i converges,
m ∞
1
P max|Sk | ≥ ε ≤ 2 m−2 i−2 Var(Xi ) .
X X
Var(Xi ) +
k≥m ε i=1 i=m+1
Now taking m −
→ ∞,
!
lim P sup |Sk | ≥ ε
m→∞ k≥m
m ∞
1
≤ 2 lim m−2 i−2 Var(Xi ) .
X X
Var(Xi ) + lim
ε m→∞
i=1
m→∞
i=m+1
P∞ 2
Since i=1 Var(Xi )/i exists and is finite, Kronecker’s lemma with ci = i2
yields
m m
Var(Xi )
lim m−2 Var(Xi ) = lim m−2
X X
i2 = 0,
m→∞
i=1
m→∞
i=1
i2
i.e. the first term is zero. For the second term,
∞ ∞ m
!
X Var(Xi ) X Var(Xi ) X Var(Xi )
lim = lim −
m→∞
i=m+1
i2 m→∞
i=1
i2 i=1
i2
∞ ∞
X Var(Xi ) X Var(Xi )
= − = 0.
i=1
i2 i=1
i2
55
Hence limm→∞ P supk≥m |Sk | ≥ ε ≤ 0. Since probabilities are nonnegative,
!
lim P sup |Sk | ≥ ε = 0.
m→∞ k≥m
a.s.
Since ε > 0 was arbitrary, we’ve shown that Sn −−→ 0.
There are several refinements of Chebychev’s WLLN and Kolmogorov’s
SLLN. One of these is Markov’s SLLN.
56
The variance restriction is very similar to (but weaker than) the one
in Kolmogorov’s first SLLN. The main novelty comes from the fact that
we’ve replaced independence with bounded autocorrelation. Notice that we
only need to rule out large and persistent positive autocorrelation. Negative
autocorrelation is actually helpful: it speeds up ‘mixing’, leading to faster
convergence!
Again we won’t give a proof; you can find one in Serfling (1970, Corollary
2.2.1). But we will give an example to verify that there are interesting
sequences of random variables that satisfy the hypotheses of the theorem.
where we used |r2 | < 1, which follows from |r| < 1. Notice that Var(Xn ) ≥
Var(Xm ) whenever n ≥ m, and that Var(Xn ) < σε2 1 − r2 for every n.
For n ≥ m,
n−1
X m−1
X
Cov(Xn , Xm ) = Cov rj εn−j , rj εm−j
j=0 j=0
n
X m
X
= Cov rn−j εj , rm−j εj
j=1 j=1
m
X m
X
= Cov rn−j εj , rm−j εj
j=1 j=1
m
X
= rn−m Var rm−j εj
j=1
m−1
X
= rn−m Var rj εm−j
j=0
n−m
=r Var (Xm ) .
57
Since Var(Xn ) ≥ Var(Xm ), it follows that
Cov(Xn , Xm )
Corr(Xn , Xm ) = p p
Var(Xm ) Var(Xn )
p
n−m pVar (Xm )
=r
Var(Xn )
n−m
≤r .
So we have constants {ρj } := {rj } in [0, 1] for which ∞j=1 ρj < ∞ and
P
∞ ∞
ln(i) 2 σε2 X ln(i) 2
X
Var(Xi ) ≤ ,
i=1
i 1 − r2 i=1 i
which can be shown to converge using the integral test. (Or Wolfram Alpha!)
such that for each θ ∈ Θ, n−1 ni=1 gi (θ) < ε whenever n ≥ Nε,θ .
P
58
More explicitly, the conclusion of the theorem is that
n
−1
X
lim sup n gi (θ) = 0 a.s.
n→∞ θ∈Θ
i=1
of radius δ > 0. For each ball, the centre ni=1 gi (θj ) obeys Kolmogorov’s
P
second SLLN, and for other points θ ∈ Bδ (θj ) it must be that ni=1 gi (θ) is
P
Pn
close to i=1 gi (θj ) by continuity a.s. We then take δ ↓ 0.
59
Painfully but straightforwardly, compute
n
!
−1
X
P sup sup n gi (θ) > ε
n≥N θ∈Θ i=1
J(δ) n
sup sup n−1
[ X
≤ P gi (θ) > ε
j=1 n≥N θ∈Θj
δ
i=1
J(δ) n
P sup sup n−1
X X
≤ gi (θ) > ε
j=1 n≥N θ∈Θδ i=1
j
J(δ) n n
!
n−1 gi (θj ) + n−1
X X X
≤ P sup sup (gi (θ) − gi (θj )) > ε
j=1 n≥N θ∈Θδ i=1 i=1
j
J(δ) n n
P sup n−1 gi (θj ) + sup sup n−1
X X X
≤ (gi (θ) − gi (θj )) > ε
j=1 n≥N i=1 n≥N θ∈Θδ i=1
j
J(δ) ( n
)
X
−1
X ε
≤ P sup n gi (θj ) >
j=1 n≥N i=1
3
n
( )!
2ε
sup sup n−1
X
∪ (gi (θ) − gi (θj )) >
n≥N θ∈Θδ i=1
3
j
J(δ) n
!
ε
P sup n−1
X X
≤ gi (θj ) >
j=1 n≥N i=1
3
J(δ) n
2ε
P sup sup n−1
X X
+ (gi (θ) − gi (θj )) >
j=1 n≥N θ∈Θδ i=1
3
j
J(δ) n
!
ε
P sup n−1
X X
≤ gi (θj ) >
j=1 n≥N i=1
3
J(δ) n
2ε
P sup n−1
X X
+ sup |gi (θ) − gi (θj )| > .
j=1 n≥N i=1 θ∈Θ
δ 3
j
60
second SLLN (p. 56) then implies
n
!
−1
X ε
lim P sup n gi (θj ) > = 0 for each j ∈ {1, . . . , J(δ)},
N →∞ n≥N i=1
3
Moreover,
sup |g1 (θ) − g1 (θj )| ≤ 2 sup|g1 (θ)|,
θ∈Θδj θ∈Θ
and the right-hand side has finite expectation by assumption. Hence by the
dominated convergence theorem (p. 40),
δ(ε)
So there exists a δ(ε) > 0 such that µj < ε/3.
Now,
sup |gn (θ) − gn (θj )|
θ∈Θδ(ε)
j
δ(ε)
is a sequence of iid random variables with finite mean µj , so by Kolmog-
orov’s second SLLN (p. 56),
n
δ(ε) ε
lim P sup n−1
X
sup |gi (θ) − gi (θj )| − µj > = 0.
N →∞ n≥N i=1 θ∈Θδ(ε)
3
j
61
δ(ε)
Since µj < ε/3, it follows that
n
2ε
lim P sup n−1
X
sup |gi (θ) − gi (θj )| > = 0,
N →∞ n≥N i=1 θ∈Θδ(ε)
3
j
hence
J(δ(ε)) n
2ε
P sup n−1
X X
lim sup |gi (θ) − gi (θj )| > = 0.
N →∞
j=1 n≥N i=1 θ∈Θδ(ε)
3
j
Putting this all together, we’ve shown that for any ε > 0, there is a
δ(ε) > 0 such that
n
!
−1
X
P sup sup n gi (θ) > ε
n≥N θ∈Θ i=1
J(δ(ε)) n
! !
X
−1
X ε
≤ P sup n |gi (θj )| >
j=1 n≥N i=1
3
J(δ(ε)) n
2ε
P sup n−1
X X
+ sup |gi (θ) − gi (θj )| >
3
j=1 n≥N i=1 θ∈Θδ(ε)
j
−
→0 as N −
→ ∞.
(The inequality holds for any δ > 0 you like, and the first term on the
RHS vanishes as N − → ∞ for any δ > 0 you like, but the second term only
vanishes when δ is chosen appropriately.) Since probabilities are nonnegative,
it follows that
n
!
−1
X
lim P sup sup n gi (θ) > ε = 0.
N →∞ n≥N θ∈Θ i=1
Remark 11. The proof extends without much difficulty to the non-iid
case. The monstrous inequality holds regardless of how {gn } are distributed.
Convergence of the first term requires only that {gn } obeys a SLLN pointwise.
Convergence of the second term requires E (supθ∈Θ |gn (θ)|) < ∞ for each n.
As far as I can make out, these are the only tweaks that are required to
obtain a uniform SLLN for the non-iid case.
As you’d expect, there are many other uniform LLNs. For the iid case,
slightly different assumptions can be used to obtain a uniform SLLN, and
somewhat weaker assumptions suffice for a uniform WLLN. As I indicated,
uniform SLLNs for the non-iid case are also fairly straightforward.
62
5 Central limit theorems
Official reading: Amemiya (1985, ch. 3), Rao (1973, ch. 2), Billingsley (1995,
sec. 27) and White (2001, ch. 5).
In a finite sample, we’d like to have an idea of how far our (consistent)
estimator is likely to be from the truth. The exact distribution of an estimator
(across repeated samples) will depend on the unknown distribution of the data,
and will anyway be extremely complicated. So we’d like an approximation to
its distribution that uses only what the econometrician observes, and which
is a good approximation in the sense that it becomes arbitrarily accurate as
the sample size increases. This may sound like too much to hope for, but it
is in fact possible to do precisely this by using the magic of the central limit
theorems (CLTs).
A central limit theorem has the following form. Take any sequence {Xn }
of random variables that satisfy some conditions; then there are sequences of
constants {an } and {bn } such that
n
d
X
an (Xi − bi ) −→ N (0, 1),
i=1
Remark 12. Setting E(X1 ) = 0 and Var(X1 ) = 1 is wlog, since for E(Xn ) =
µ and Var(Xn ) = σ 2 , we can apply the theorem to Yn := (Xn − µ)/σ. The
importance of these restrictions is that the mean exists and is finite and that
the variance is finite and nonzero.
52
There are many generalisations of CLTs that don’t quite fit this format. A very
important example for econometrics is functional central limit theorems, which give
Pbτ nc
conditions under which the random function Sn (τ ) := n−1/2 i=1 Xi converges weakly
to a Brownian motion. Another example is ‘generalised central limit theorems’, which give
conditions for convergence to a stable law (see footnote 43 on p. 43).
63
The proof will make use of characteristic functions. In particular, we’ll
−1/2 Pn X converges pointwise
show that the characteristic
function of n i=1 i
1 2
to φN (0,1) (t) = exp − 2 t , then appeal to Lévy’s continuity theorem.
Proof. Write n
Zn := n−1/2
X
Xi .
i=1
Fix an arbitrary t ∈ R.
n
φZn (t) = E exp itn−1/2
X
Xj
j=1
n
exp itn−1/2 Xj
Y
= E
j=1
n
E exp itn−1/2 Xj
Y
=
j=1
n
= E exp itn−1/2 X1
n
= φX1 t n1/2 .
where we used independence in the third equality and identical distribution
in the fourth.
Since t n1/2 −
→ 0 for fixed t ∈ R, only the behaviour of φX in a shrinking
neighbourhood of 0 will matter. Formally, we use Taylor expansion around 0
to approximate φX1 t n1/2 as n grows large:
φX1 t n1/2 = φX1 (0) + φ0X1 (0)t n1/2 + 21 φ00X1 (0)t2 n + o (1/n) ,
where the derivatives exist since E(X1 ) and E X12 exist and are finite (see
part (5) of Proposition 10 (p. 42)). Again by Proposition 10 (p. 42), we have
φX1 (0) = 1, φ0X1 (0) = i−1 E(X1 ) = 0 and E X12 = i−2 Var(X1 ) = i−2 = −1,
64
Before moving on, let’s give an (unusual) example of how the Lindeberg–
Lévy CLT can be used.
Example 13. Let {Xn } be independently distributed Xn ∼ N (0, 1), and
define Sn := ni=1 Xi2 . In this case, we don’t really need to approximate the
P
E (exp (iλ> Xn )) −
→ E (exp (iλ> X)) for every λ ∈ Rk .
But the LHS equals φXn (λ), and the RHS equals φX (λ)! So we’ve shown
d
that φXn −→ φX pointwise, which implies Xn −→ X by Lévy’s continuity
theorem (p. 41).
65
Corollary 5 (vector Lindeberg–Lévy CLT). Let {Xn } be a sequence of iid
d
random k-vectors with E(X1 ) = 0 and Var(X1 ) = I. Then n−1/2 ni=1 Xi −→
P
N (0, I).
Proof. Write n
Zn := n−1/2
X
Xi ,
i=1
and let Z denote some (any) k-vector distributed N (0, I). By a property of
the normal distribution, λ> Z is then distributed univariate N (0, λ> λ) for
any λ ∈ Rk .
By the univariate Lindeberg–Lévy CLT,
d d
λ> Zn −→ N (0, λ> λ) = λ> Z for any λ ∈ Rk .55
d d
Hence Zn −→ Z = N (0, I) by the Cramér–Wold device.
The rest of the central limit theorems in this section will be stated for
random variables only. But all of them can easily be extended to random
vectors using Cramér–Wold device in the manner just demonstrated.
66
of restrictions on the variances. (There’s a similarity here with Kolmogorov’s
first SLLN, which also imposes independence and a variance restriction.
Continuing the analogy, the Lindeberg–Lévy CLT and Kolmogorov’s second
SLLN both impose iid.)
The most general theorem along these lines is the following.
n
Var(Xi ) −1
X d
lim max 2
= 0 and cn (Xi − E(Xi )) −→ N (0, 1)
n→∞ i∈[1,n] cn i=1
hold iff
n h i
lim c−2
X
n E (Xi − E(Xi ))2 1 (|Xi − E(Xi )| > εcn ) = 0 for any ε > 0.
n→∞
i=1
Remark 14. The last condition is called the Lindeberg condition; it restricts
the thickness of the tails. The ‘only if’ part means that the Lindeberg
condition is the weakest possible sufficient condition for weak convergence to
a normal law when independence and
.
lim max Var(Xi ) c2n = 0
n→∞ i∈[1,n]
67
Proof. Write Yn := Xn − E(Xn ). We wish to show that the Lindeberg
condition holds, so fix ε > 0.
n h i n Z
c−2 E Yi2 1 (|Yi | > εcn ) = c−2
X X
n n y 2 LYi (dy)
i=1 i=1 {|y|>εcn }
n Z
1 3
= c−2
X
n |y| LYi (dy).
i=1 {|y|>εcn }
|y|
Observe that this expression must be nonnegative. To show that it can’t
be strictly positive, use the nonnegativity of the integrand (together with
cn > 0 and τn < ∞) to obtain
n Z n Z
1 3 1 X
c−2
X
n |y| LYi (dy) ≤ 3 |y|3 LYi (dy)
i=1 {|y|>εcn }
|y| εcn i=1 {|y|>εcn }
n
1 X
≤ τi
εc3n i=1
3
1 bn
= −
→0 as n −
→ ∞.
ε cn
Since ε > 0 was arbitrary, we’ve shown that the Lindeberg condition holds.
Hence the conclusion follows by the Lindeberg–Feller CLT.
Remark 15. It’s clear from the proof that we don’t really need the third
moment to exist; it’s enough for the (2 + α)th moment to exist for some
α > 0.
Before moving on, let’s give an example of how these theorems are used
in econometrics.
Example 14. In the linear model,
−1
n1/2 βb − β = n−1 X > X n−1/2 X > ε .
68
d
Showing that n−1/2 X > ε −→ N (0, Σ) is easy when {Xi } and {εi } are iid
and independent of each other, for then {Xi εi } are iid random vectors and
the vector Lindeberg–Lévy CLT can be applied.
But suppose that we want the asymptotic distribution conditional on X, or
equivalently that X is nonstochastic (‘fixed regressors’) with n−1 X > X −
→A
for some nonsingular A. In this case, more work is required to show that
n−1/2 X > ε converges in distribution. The observations are still independent
since {εi } are, but they are no longer identically distributed, since each
term in the sum is a different linear combination of the elements of ε. We
therefore need to make assumptions sufficient for the Lindeberg condition
to be satisfied; loosely, we require that X is sufficiently bounded that no
single observation can dominate the variance of the sum. As mentioned
above, it is rather hard to check the Lindeberg condition, but in this case
the demonstration can be found in Amemiya (1985).
Definition 29. A stochastic process {Xn } is strictly stationary iff for any
finite collection of indices (n1 , . . . , nT ), the (joint) distribution of the random
vector (Xn1 +m , . . . , XnT +m ) does not depend on m ∈ N.
69
The object that we’re taking the supremum of is the degree of ‘independ-
ence failure’. Since we’re taking the supremum (over a very large set), the
condition says that the degree of independence failure between ‘blocks’ k
periods apart is uniformly bounded by α(k) for each k. Since α(k) − → 0, the
degree of independence failure must vanish uniformly as blocks are pulled
further apart.
The following is one of many ‘time-series CLTs’. Joel said that it can be
found in White (2001), though I haven’t been able to locate it.
Remark 16. There is an explicit tradeoff here between how heavy tails and
how much dependence that can be accommodated: for γ small (heavy tails),
β must be large (low dependence).
70
Theorem 24 (Berry–Esseen). Let {Xn } be a sequence of iid random vari-
ables with E(X1 ) = 0 and Var(X1 ) = 1 whose third moment E X13 exists.
(For concreteness, you can think of an = n−1/2 and bn = E(Xn ).) Then
n
−1
X
n (Xi − bi ) = Op (1/nan ).
i=1
If 1/nan = op (1), it follows that {Xn } satisfy a weak law of large numbers:
n
−1
X
n (Xi − bi ) = Op (op (1)) = op (1).
i=1
71
p
so a fortiori Xn −→ X, but an ni=1 Xi = 0 a.s. for any sequence of constants
P
n n
−1 a.s. −1/2 d
X X
n Xi −−→ 0 and n Xi −→ N (0, 1).57
i=1 i=1
But what happens if we use scaling constants that increase at a rate slower
than O(n) but faster than O n1/2 ? In particular, how much slower than
O(n)
Pn
can we make rate while keeping almost every convergent subsequence
of i=1 Xi bounded? The Hartman–Wintner LIL says that the answer is
1/2
O [n ln(ln(n))] . (For any sequence {Xn }!)
72
Remark 18. A consequence of this LIL is that
n
[n ln(ln(n))]−1/2
X
Xi
i=1
i=1
An intuitive way of putting this into words is that {An i.o.} obtains iff all
but finitely many of the events {An } occur. (This follows from the deMorgan
law; see e.g. Rosenthal (2006, sec. 3.4)).
It turns out (see e.g. Billingsley (1995, pp. 154–6)) that the Hartman–
Wintner LIL is equivalent to the following.
Corollary 6. Let {Xn } be a sequence of iid random variables with E(X1 ) = 0
and Var(X1 ) = 1. Then for every ε > 0,
n
!
X q
P Xi ≥ (1 + ε) 2n ln(ln(n)) i.o. = 0 and
i=1
n
!
X q
P Xi ≥ (1 − ε) 2n ln(ln(n)) i.o. = 1.
i=1
Pn
In words, this reformulated LIL says that i=1 Xi lies in
q
±(1 − ε) 2n ln(ln(n))
only finitely many times (with probability 1). The LIL therefore bounds the
Pn
extreme fluctuations of i=1 Xi : fluctuations inside the iterated-log bound
73
occur infinitely often w.p. 1, and fluctuations big enough to jump outside
the iterated-log bound occur at most finitely many times w.p. 1.
It may look like the LIL gives us a usable 100% confidence interval for
Pn
Xi , but this is not really the case. The LIL says that along n ∈ N,
Pi=1
n
i=1 Xi lies in q
±(1 − ε) 2n ln(ln(n))
infinitely many times with probability 1. It doesn’t say anything about the
probability that ni=1 Xi lies in this interval for any given (possibly large) n,
P
though!
As with LLNs and CLTs, LILs are also available for independent but not
identically distributed random variables, as well as for dependent random
variables. Some of these can be found in Serfling (1980, sec. 1.10).
74
7 Asymptotic properties of extremum estimators
Official reading: Amemiya (1985, sec. 4.1.1–4.1.2) and Newey and McFadden
(1994, sec. 2–3).
7.1 Preliminaries
The setting is as follows. There is a Rr -valued stochastic process {yn } defined
on a probability space (Ω, A, P).58 We call this stochastic process the data-
generating process (DGP), and write µ0 for its (unknown) law.59 A dataset
of size n is a realisation {yi (ω)}ni=1 of the first n coordinates of the DGP.
We wish to use a dataset to learn about the law µ0 of the data-generating
process (‘the distribution of the data’). In particular, we want to learn about
(estimate) a parameter of the law. Formally, a parameter is a mapping
τ : M → Θ, where M is a set of probability measures to which µ0 is assumed
to belong, and Θ is called the parameter space. Intuitively, τ captures some
‘aspect’ of the DGP’s distribution.60 Nonparametric econometrics is concerned
with the case in which Θ is infinite-dimensional (e.g. a function space). We
will focus on parameteric econometrics, meaning that we will be concerned
with the finite-dimensional case Θ ⊆ Rk for some k ∈ N.
When studying extremum estimators, we will not say anything about
the shape of the map τ : M → Θ. Instead, we will study maximisers of a
dataset-dependent function of θ ∈ Θ. (So θ is properly called a parameter
value. It is not a parameter.) When we study consistency, we will define a
θ0 ∈ Θ to which the maximisers converge in probability/a.s. as the size of the
dataset grows. We leave for the applied researcher the task of establishing
that θ0 as defined below is in fact equal to τ (µ0 ) for the parameter τ that
she wishes to estimate.
An estimator is a mapping from datasets (of arbitrary size) into Θ. As the
language suggests, the idea is usually that (for large datasets), the estimator
will be close to the true value τ (µ0 ) of some interesting parameter τ . But
don’t let the lingo confuse you: an estimator is just a mapping from datasets
into Θ, which may or may not be useful for learning about some parameter
τ.
58
Reminder: a (discrete-time) stochastic process is a collection {yn }n∈N of random
variables defined on some (common) probability space.
59
A stochastic process (taken as a whole) is a random element of a sequence space, so
the law of the process is just the law of this random element.
60
Simple example: if the DGP is R-valued and iid withR marginal distribution µ10 , then
1
the mean (provided it exists) is a parameter: τ (µ0 ) := R xµ0 (dx).
75
An extremum estimator is an estimator constructed by maximising a
data-dependent criterion function. Formally, it is a family of mappings
θen : Rn×r → Θ, one for each sample size n ∈ N, such that
7.2 Measurability
We’ve defined the criterion function and extremum estimator as deterministic
functions of the data. But for the purposes of asymptotic theory, we’d like to
be able to treat them as a random function and a random vector, respectively.
(Otherwise concepts like convergence in probability are not defined!) To this
end, redefine the criterion function and extremum estimator as mappings
directly from the underlying probability space (Ω, A, P): for each n ∈ N and
ω ∈ Ω,
e n ({yi (ω)}n , ·)
Qn (ω)(·) := Q and θbn (ω) ∈ arg max Qn (ω)(θ).
i=1
θ∈Θ
76
The necessary and sufficient condition for Qn to be measurable (hence
a random function) is easy: we require Q e n to be measurable in its first
argument w.r.t. your desired σ-algebras on Rn×r and on the space of functions
Rn×r → Θ. But the existence of a measurable selection θbn from the argmax
is not so obvious. The following gives one set of sufficient conditions.
Proposition 15. Suppose that Q e n (·, θ) is measurable for each θ ∈ Θ, that
e n ({yi }n , ·) is continuous for each yn ∈ Rn×r , and that Θ is compact.
Q i=1
Then the argmax correspondence G(·) := arg maxθ∈Θ Qn (·)(θ) admits a
measurable selection θbn .
Remark 19. The main role of continuity and compactness it to ensure that
G is nonempty-valued; otherwise G may not admit any selection, measurable
or not.
The result is a corollary of the following lemma, which I’ve adapted from
Aliprantis and Border (2006, Theorem 18.19).61
Lemma 4 (measurable maximum lemma). Let (Ω, A) be a measurable space,
and let X ⊆ Rk be compact. Let f be a function Ω × X → R such that
f (·, x) is measurable for each x ∈ X and f (ω, ·) is continuous for each ω ∈ Ω.
Define v : Ω → R and G : Ω ⇒ X by
v(·) := max f (·, x) and G(·) := arg max f (·, x).
x∈X x∈X
= ω ∈ Ω : f (ω, x) ≤ c ∀x ∈ X 0
\
= {ω ∈ Ω : f (ω, x) ≤ c}
x∈X 0
\
= {ω ∈ Ω : f (ω, x) ≤ c}
x∈X 0 ∩Qk
∈A
61
For now, we only require the second part (the measurability of the argmax). But
later on we’ll want to use the maximised value of the criterion function as (part of) a test
statistic, and a test statistic had better be a random variable!
77
where the final equality holds since f (ω, ·) is continuous and Qk is dense in
Rk , and inclusion in A holds because f (·, x) is measurable and σ-algebras
are closed under countable intersection. So FX 0 is A/BR -measurable, no
matter what (nonempty) X 0 ⊆ X you choose. Letting X 0 = X, it follows
that v = FX is A/BR -measurable.
Next, we want to show that some selection g : Ω → X from G is
measurable. Since X ⊆ Rk , this requires precisely that {ω ∈ Ω : g(ω) ≤ c} ∈
A for every c ∈ Rk . Write C := {x ∈ X : x ≤ c}. If C = X or C = ∅ then
the result follows trivially, so let c be such that neither C nor X\C is empty.
Then
( )
{ω ∈ Ω : g(ω) ≤ c} = ω ∈ Ω : max f (ω, x) ≥ sup f (ω, x)
x∈C x∈X\C
( )
= ω ∈ Ω : sup f (ω, x) ≥ sup f (ω, x)
x∈C x∈X\C
n o
= ω ∈ Ω : FC (ω) − FX\C (ω) ≥ 0
∈A
where the inclusion follows from the fact that FC and FX\C are both meas-
urable and that the difference of measurable functions is measurable.
From this point on, we will always treat Qn and θbn as random elements,
without specifying particular primitive assumptions that guarantee meas-
urability. If you like concreteness, maintain the sufficient conditions given
above.62
7.3 Consistency
We say that an extremum estimator θbn is weakly consistent for θ0 ∈ θ
iff it converges in probability θ0 . We say that it is strongly consistent iff
the convergence is almost sure. As mentioned above, this section will give
conditions under which extremum estimators are consistent for a θ0 that we
will define from the criterion functions. There is no reason why θ0 should be
equal to the true value of some interesting parameter of the DGP’s law µ0 !
78
(1) Θ ⊆ Rk is compact.
(2) Condition (3) is a high-level assumption. Later on, we will see how
a uniform law of large numbers can be used to derive (3) from more
primitive conditions on Qn and the distribution of the data.
Proof.
To establish convergence in probability, we have to show that P θbn ∈
S − → 1 as n −→ ∞ for any open neighbourhood S of θ0 . So fix an arbitrary
open neighbourhood S in Rk of θ0 . Since Θ is compact and S is open,
S c ∩ Θ is compact. Since each Qn is continuous, Q is continuous. Hence
maxθ∈Θ∩S c Q(θ) exists by the Weierstrass theorem, so we can define
and
n−1 Qn (θ0 ) > Q(θ0 ) − ε/2. (5)
79
Using Qn θbn ≥ Qn (θ0 ), (4) implies
Adding (5) and (6) and cancelling n−1 Qn (θ0 ), we see that An implies
Q θbn > Q(θ0 ) − ε = max c Q(θ)
θ∈Θ∩S
(1) Θ ⊆ Rk is compact.
The same method of proof should work, but we will pursue a different
argument that was not available for weak consistency. The proof below is
very slow because there are subtleties in this argument that were not obvious
to me without elaboration.
Proof. Here’s the outline of what we’re going to do. For fixed ω ∈ Ω, θbn (ω)
is a sequence in Rk . From real analysis, we know that if this sequence
has convergent subsequences, and if all of these
convergent subsequences
have the same limit, then the full sequence θn (ω) is convergent with the
b
same limit. Since Θ is compact, θbn (ω) does have at least one convergent
subsequence. So we just have to show that for almost all ω ∈ Ω, every
convergent subsequence has limit θ0 .
80
Fix an arbitrary ω ∈ Ω, and pick an arbitrary convergent subsequence
θbnωi (ω) ; call its limit θω . Fix ε > 0. By the triangle inequality,
(nωi )−1 Qnωi θbnωi (ω) − Q(θω )
= (nωi )−1 Qnωi θbnωi (ω) − Q θbnωi (ω) + Q θbnωi (ω) − Q(θω )
≤ (nωi )−1 Qnωi θbnωi (ω) − Q θbnωi (ω) + Q θbnωi (ω) − Q(θω )
a.s.
uniformly to Q. So (regardless of the what the deterministic sequence
θni (ω) happens to look like,) there exists N2 ∈ N such that
b ω
(nωi )−1 Qnωi θbnωi (ω) − Q θbnωi (ω) < ε/2 a.s. for all i ≥ N2 .63
Since ε > 0 was arbitrary, we’ve shown that there is Ω0 ⊆ Ω such that
P(Ω0 ) = 1 and
(nωi )−1 Qnωi (ω 0 ) θbnωi (ω) −
→ Q (θω ) for all ω 0 ∈ Ω0 .
If ω lies in Ω0 , the LHS converges to Q(θω ). For the RHS, the fact that
a.s.
(nωi )−1 Qnωi −−→ Q uniformly implies that there is Ω00 ⊆ Ω such that P(Ω00 ) = 1
and (nωi )−1 Qnωi (ω)(θ0 ) −
→ Q(θ0 ) for all ω ∈ Ω00 . So taking the subsequential
limit i − → ∞ on both sides, we obtain
81
Since θ0 is the unique maximiser of Q, Q(θω ) ≥ Q(θ0 ) implies θω = θ0 ; so
we have θω = θ0 for every ω ∈ Ω0 ∩ Ω00 . Moreover, Ω0 ∩ Ω00 has measure one:
≥ 1 − P (Ω0 )c − P (Ω00 )c = 1.
We’ve now shown that there is a probability-1 event Ω0 ∩ Ω00 such that
whenever ω ∈ Ω0 ∩ Ω00 , every convergent subsequence of θbn (ω) converges
to θ0 . Moreover, at least one convergent subsequence is guaranteed
to exist
for any ω by compactness of Θ. It follows that the full sequence θbn (ω) is
convergent with limit θ0 whenever ω ∈ Ω0 ∩ Ω00 . Since P(Ω0 ∩ Ω00 ) = 1, this
a.s.
implies that θbn −−→ θ0 as desired.
and n
n 1 1 X
Q({y
e }
i i=1 , m) := − ln(2π) − lim (µi − mi )2 .
2 n→∞ 2n
i=1
All the assumptions in our consistency theorems appear to be satisfied.
We can take µn ∈ M for some compact M ⊆ R, so that µ ∈ M ∞ , which
is compact by Tychonoff’s theorem. Qn is measurable and continuous as
required. n−1 Qn converges a.s. uniformly to Q by a uniform LLN, though we
will not prove this. Finally, Q has a unique maximum at µ, the true means.
But as a matter of fact, the maximum-likelihood estimator is inconsistent
in this case. The reason is that the parameters µ live in an infinite-dimensional
space, whereas we required Θ to be a subset of the finite-dimensional space
Rk . More intuitively, the number of parameters increases as o(n), whereas
we required them to stay fixed at some k. It is perhaps intuitive that we
cannot consistently estimate n parameters using n data points!
This is called this incidental parameters problem, and originated with
Neyman and Scott (1948). It shows up in many other contexts. An obvious
one is the fixed effects model for panel data, where the number of fixed effects
is (by construction) equal to the number of cross-sectional observations, so
that we cannot estimate them consistently under short-panel asymptotics.
It should be clear that this is a problem of identification: no matter
how large your dataset is, you cannot precisely learn the values of the
82
parameters. We could formalise this in the Manski way by looking at the
joint distribution of the data directly. The approach we took above of looking
at identification indirectly via what can be consistently estimated from data
is the old-fashioned approach to identification.
Example 17 (consistency of MLE for uniform). Let {yi }ni=1 be independent
draws from U[0, θ0 ]. The likelihood is
n
L (θ, {yi }ni=1 ) = θ−n 1(yi ∈ [0, θ]) = θ−n 1(yi ∈ [0, θ] ∀i).
Y
i=1
83
The log-likelihood is
e n {yi }n , µ, σ 2
n
Q i=1 = − ln(2π)
2
n
" ! #
1 yi − µ 2 1 2
−1
X
+ ln λσ exp − + (1 − λ) exp − yi
i=1
2 σ 2
(3) A defined by
A := lim E n−1 ∇2 Qn (θ0 )
n→∞
p
is nonsingular, and n−1 ∇2 Qn (θn ) −→ A for any sequence of random
p
vectors {θn } such that θn −→ θ0 .
d
(4) n−1/2 ∇Qn (θ0 ) −→ N (0, B), where
B := lim E n−1 [∇Qn (θ0 )][∇Qn (θ0 )]> .
n→∞
d
Then n1/2 θbn − θ0 −→ N 0, A−1 BA−1 .
84
Remark 21. θ0 is defined by assumption (1). We’re not taking a stand on
why θbn is consistent for θ0 , but if we decided to justify it using our weak
consistency theorem, then θ0 would of course be the maximiser of Q.
where the mean value θen lies between θbn and θ0 (hence in H by convexity).
Rearranging,
h i+ h i
n1/2 θbn − θ0 = − n−1 ∇2 Qn θen n−1/2 ∇Qn (θ0 ) ,
Under our assumptions, n−1 ∇2 Qn θen may be singular, in which case the ordinary
64
matrix inverse is undefined. That’s why we use the Moore–Penrose pseudo-inverse: it is
always (uniquely) defined, and coincides with the ordinary inverse for nonsingular matrices.
For details, see e.g. Rao (1973, sec. 1.b.5 & 1.c.5).
65
This statement only makes sense if the mean value θen is a random vector, i.e. is
measurable! It turns out that it is; Newey and McFadden (1994, p. 2141, footnote 25)
indicate why, and refer the reader to our friend Jennrich (1969) for a formal proof.
85
The Moore–Penrose pseudo-inverse operator is continuous at A since A is
nonsingular,66 so h i +
n−1 ∇2 Qn θen −→ A−1
p
where the final equality used the fact that A−1 is symmetric by Young’s
theorem.
(1) It should be clear from the proof that we don’t actually need θbn to be
a global maximiser; we only need it to satisfy
the first-order condition.
The result will therefore go through if θbn is a sequence of local
minima and/or maxima. This raises two issues, though. First, we’ll
need to find sufficient conditions for such an object to be measurable.
Second, we can no longer appeal to our consistency theorems above
to justify assumption (1). But it turns out that there are consistency
theorems for local maxima/minima; see e.g. Amemiya (1985, Theorem
4.1.2).
(2) The existence of the second derivative is actually not required for
asymptotic normality, though the proof is much harder without this
assumption. The kind of proof employed here definitely requires the
first derivative, since it is based on the first-order condition.
The LAD estimator is an example of an asymptotically normal ex-
tremum estimator for which even the first derivative fails to exist. In
section 7.6 (p. 90), we’ll give an idea of how asymptotic normality
can be proved without the use of derivatives for certain estimators,
including the LAD estimator. The maximum score estimator (Manski,
66
The ordinary matrix inverse operator is everywhere continuous: for invertible matrices
{An } and A, An −→ A implies A−1 n − → A−1 . (We used this to prove Slutsky’s theorem.)
But the Moore–Penrose pseudo-inverse is not actually continuous! However, it turns out to
be continuous at invertibility points, which is all we need.
86
1975) is an example in which ∇Qn does not exist, and the limiting
distribution is nonnormal (Kim & Pollard, 1990).67
You may wonder what can happen when θ0 lies on the boundary of
the parameter space. The following example shows how this can give rise
a nonnormal limiting distribution, even for very simple and otherwise well-
behaved estimators.
87
i.e. the rate of convergence is slower. This can be formalised by using a
Pitman drift, meaning that we let µ drift toward zero as n increases and
study how fast the drift can be without breaking asymptotic normality.68
(See section 9.4 (p. 117) for details on what a Pitman drift is and how it can
be used.)
True parameters on the boundary arise frequently in more sophistic-
ated econometric models. One example (currently fashionable) is moment-
inequality models, where we’re on the boundary whenever an inequality binds
in the population.
n
X
e n ({yi }n , θ) =
Q qe(yi , θ)
i=1
i=1
for some function qe. We’ll want work directly with the random functions
Θ → R defined by qi (ω)(θ) := qe(yi (ω), θ), so that Qn = ni=1 qi . Observe
P
88
by Kolmogorov’s second SLLN. Similarly, {[∇qi (θ0 )][∇qi (θ0 )]> } is a sequence
of iid k × k matrices with mean E ([∇q1 (θ0 )][∇q1 (θ0 )]> ), so by Kolmogorov’s
second SLLN we have
n
n−1 [∇Qn (θ0 )][∇Qn (θ0 )]> = n−1
X
[∇qi (θ0 )][∇qi (θ0 )]>
i=1
a.s.
−−→ E ([∇q1 (θ0 )][∇q1 (θ0 )]> ) .
Now that we have clean expressions for A and B, we can think about
estimating them. Consider the random functions Θ → Rr×r
n
Abn := n−1 ∇2 Qn = n−1
X
∇ 2 qi
i=1
n
bn := n−1 [∇Qn ] [∇Qn ]> = n−1
X
B [∇qi ] [∇qi ]> .
i=1
a.s. a.s.
We have just shown that Abn (θ0 ) −−→ A and B bn (θ0 ) −
−→ B. But these are
infeasible estimators because they require knowledge of θ0 . The obvious rem-
edy is to plug in our consistent estimator θbn . By continuous differentiability,
Abn and B bn are continuous at θ0 . Hence
a.s. a.s.
Abn θbn −−→ A and B
bn θbn −−→ B
89
7.6 Asymptotic normality with a nonsmooth objective
This section is basically an aside, so generality and rigour will be sacrificed
on the altar of clarity.
Our asymptotic normality result in section 7.4 required the existence and
continuity of derivatives of the criterion function in a neighbourhood of θ0 .
The second derivative assumption turns out to be unnecessary, but proving
this in generality is hard. Getting rid of the first derivative is harder still,
but important in certain applications (e.g. auctions). We’ll give an example
to indicate what can be done.
Suppose that the DGP {yi } is iid and that the parameter θ0 satisfies the
moment condition E (ge(y1 , θ0 )) = 0. Estimation based on moment conditions
such as these can be done using the generalised method of moments (GMM)
covered in section 10, a special case of extremum estimation. We use it here
merely as an example. Restrict attention to the univariate case yi ∈ R and
Θ ⊆ R, and write F for the CDF of each yiR.
The population moment can be written R ge(y, θ0 )F (dy) = 0. Define the
empirical distribution function (EDF) by
n
Fbn (y) := n−1
X
1(y ≤ yi ).
i=1
90
For now, we’re merely asserting that the final term is op (1); more on that
below. The convergence in distribution of the first term is immediate by the
Lindeberg–Lévy CLT.
The usual modus operandi is as follows. Assuming that ge(y, ·) is dif-
ferentiable with derivative ge2 , we mean-value expand the remaining term
as
Z Z h i
n1/2 ge y, θbn F (dy) = n1/2
ge(y, θ0 ) + ge2 y, θen θbn − θ0 F (dy)
R R Z
h i
1/2
= n θbn − θ0 ge2 y, θen F (dy)
R
where the mean value θen lies between θbn and θ0 , and so converges in probab-
ility to θ0 . Assuming that the second derivative ge2 (y, ·) is continuous at θ0 ,
it follows by the continuous mapping theorem that
Z Z
p
ge2 y, θen F (dy) −→ ge2 (y, θ0 )F (dy) = E (ge2 (y1 , θ0 )) ,
R R
.
=N 0, E ge(y, θ0 )2 E (ge2 (y1 , θ0 ))2 .
d
91
around θ0 as
Z
n1/2
ge y, θbn F (dy)
R
i ∂
Z h Z
= n1/2 ge(y, θ0 )F (dy) + n1/2 θbn − θ0
ge2 y, θen F (dy)
R ∂θ R
h i ∂ Z
= n1/2 θbn − θ0
ge2 y, θen F (dy).
∂θ R
Assuming that the derivative of the integral is continuous at θ0 , plus some
additional boundedness condition to keep the derivative in check, we get
∂
Z
p
ge y, θen F (dy) −→ A
∂θ R
.
d
=N 0, E ge(y, θ0 )2 A2 .
This approach works for the LAD estimator, for example. There ge is the
discontinuous function
ge(y, θ) = 2 · 1 (y ≤ θ) − 1,
Another estimator that admits this kind of argument is the maximum rank
correlation estimator. (I don’t pretend to know what that is.)
92
Recall that to get to this point, we simply asserted that
Z h i
1/2
n ge y, θbn − ge(y, θ0 ) [Fn − F ](dy) = op (1).
R
This is not so obvious when ge is nonsmooth, since we’d usually use a mean-
value expansion of the integrand to show this. But here empirical process
theory comes to the rescue. An empirical process is a random functional,
i.e. a random function mapping from a functional space. We can write our
troublesome integral as
Z h i
n1/2
ge y, θbn − ge(y, θ0 ) [Fn − F ](dy)
R Z Z
1/2
ge y, θbn F (dy) − n1/2
= −n ge(y, θ0 )Fn (dy)
ZR ZR
= − n1/2 ge(y, θ0 )F (dy) − n1/2 ge(y, θ0 )Fn (dy) + op (1)
ZR R
93
8 The (quasi-)maximum-likelihood estimator
Official reading: Amemiya (1985, sec. 4.2.1–4.2.3) and Newey and McFadden
(1994, sec. 2.2.1, 2.4, 3.2, 3.3, and 4.2).
8.1 Preliminaries
Recall the estimation setting laid out section 7. The data-generating pro-
cess {yn } is a Rr -valued stochastic process defined on a probability space
(Ω, A, P); its law is denoted µ0 . We observe a dataset, meaning a realisation
{yi (ω)}ni=1 of the first n coordinates of the DGP.
We do not know the law µ0 ∈ M ; a priori we only know that it lies in
a set M of probability measures. There is some parameter τ : M → Θ of
interest (Θ ⊆ Rk ) whose true value τ (µ0 ) we wish to learn about using the
data.
In our study of extremum estimators, we did not specify a particular
parameter τ ; instead we looked at estimators that are consistent for some
point in θ0 ∈ Θ which may or may not equal τ (µ0 ) for some parameter of
interest τ . Now that we’re looking at a specific class of extremum estimators,
we will be able to ensure that θ0 = τ (µ0 ).
So fix a parameter τ : M → Ω of interest. Since we’re only interested in
estimating τ (µ0 ), there is no need to distinguish between distinct measures
in M that give rise to the same parameter value; so wlog, we will treat τ as a
bijection (one-to-one and onto). This means that we can write M as {µθ }θ∈Θ ,
a parametric (finite-dimensional) family of distributions. The true value of
the parameter is written θ0 := τ (µ0 ).69 To re-iterate, θ0 is a feature of the
unknown population distribution now: it is no longer some hard-to-interpret
probability limit of an extremum estimator.
Suppose that there is a measure ν with respect to which every µθ possesses
a density (Radon–Nikodým derivative). Let µθn and νnθ denote the marginal
distributions that µθ and ν induce over the first n coordinates, and write
f θn := dµθn /dν n for the density governing a sample of size n when the true
parameter is θ (the true law is µθ ).
The likelihood function L is the random function Θ → R defined by
94
Ln (θ) = 0.) A maximum likelihood estimator is an extremum estimator
whose criterion function (Qn ) is `n .
We will focus on the case of independently and identically distributed
data. In this case,
n
Y
f θn ({yi }ni=1 ) = f θ (yi )
i=1
Then we can write the log-likelihood as `n = ni=1 `1i , a sum of iid random
P
functions to which we can apply Jennrich’s uniform SLLN. Pretty much all
the results we will derive for the MLE can be extended to the non-iid case,
but we won’t do it.
Now suppose that our model of µ0 is misspecified: the f θn we use to form
the likelihood is not actually equal to dµθn /dνn . The log-likelihood `n is still
well-defined, so we can still obtain an extremum estimator by maximising it.
Such an estimator is called a quasi-maximum-likelihood estimator (QMLE).
We will see that under appropriate conditions, QMLEs are consistent for
θ0 and asymptotically normal, but that they do not share the efficiency
properties of the MLE.
Proposition 19 (information
R
inequality). Let f and g be densities w.r.t. a
measure ν on (Ω, A). Then Ω ln(g/f )f dν ≤ 0, with equality iff g = f ν-a.e.
95
where γ (a function!) lies between g/f and 1.70 So
1
Z Z Z
ln(g/f )f dν = (g/f − 1)f dν − 2
(g/f − 1)2 f dν
Ω Ω Ω 2γ
1 (g − f )2
Z Z Z
= gdν − f dν − 2
dν
Ω Ω Ω 2γ f
1 (g − f )2
Z
= − 2
dν.
Ω 2γ f
The RHS is evidently nonpositive and equal to zero iff g = f ν-a.e.
{`1i } are iid random functions Θ → R. Assume that Θ is compact, that `11
is continuous and that E supθ∈Θ `11 (θ) < ∞. Then by Jennrich’s uniform
SLLN,
n
n−1 `n = n−1
a.s.
X
`1i −−→ Q uniformly over Θ,
i=1
where Q : Θ → R is the nonstochastic function
Q(θ) := E ln f θ (y1 )
Z
= ln f θ (y1 ) µθ10 (dy1 )
r
ZR
= ln f θ (y1 ) f θ0 (y1 ) ν1 (dy1 ).
Rr
using the fact that µθ10 is the true distribution and that f θ0 (y1 ) is its density
w.r.t. ν1 . The information inequality tells us precisely that Q attains a unique
maximum at θ0 . Lo and behold, all the assumptions of our strong consistency
result (p. 80) are satisfied, so the MLE is strongly consistent for θ0 . And as
promised, θ0 here was defined as τ (µ0 ), the true value of the parameter τ .
‘Consistency for the truth’, if you will. To summarise:
Proposition 20 (consistency for the truth). Suppose that Θ is compact,
that `11 is continuous and that E supθ∈Θ `11 (θ) < ∞. Then the MLE θbn is
96
Q has a unique maximum, and certainly not that this maximum is equal
to τ (µ0 ). It is possible, however, to give additional conditions restricting
the degree of misspecification in such a way that Q is guaranteed to have a
unique maximum at τ (θ0 ). A QMLE satisfying these extra conditions will
also be consistent for the truth.
and
B := lim E n−1 [∇`n (θ0 )] [∇`n (θ0 )]>
n→∞
" n
#" n
#> !
n−1/2 n−1/2
X X
= lim E ∇`1i (θ0 ) ∇`1i (θ0 )
n→∞
i=1 i=1
n h
!
ih i>
−1
X
= lim E n ∇`1i (θ0 ) ∇`1i (θ0 )
n→∞
i=1
h ih i>
=E ∇`11 (θ0 ) ∇`11 (θ0 ) .
We assume that both of these expectations exist and are finite, and fur-
thermore that A is nonsingular (negative definite will do). The information
matrix equality below will tell us that −A = B when the model is correctly
specified. −A is called the Hessian form of the information matrix, and B is
called the outer product form of the information matrix.
2 1
∇ `i (θ0 ) is an iid sequence of random matrices whose mean we assumed
exists, so by Khinchine’s WLLN (p. 56) we have
n
n−1 ∇2 `n (θ0 ) = n−1
p
X
∇2 `1i (θ0 ) −→ A.
i=1
97
Since ∇2 `n is continuous in a neighbourhood of θ0 , any {θn } with θn =
θ0 + op (1) satisfies
∂ θ0
!
∇L11 (θ0 ) ∂θ f (y) θ0
Z
E ∇`11 (θ0 ) = E = f (y)ν1 (dy)
L11 (θ0 ) Rr f θ0 (y)
∂ θ0 ∂ ∂(1)
Z Z
= f (y) ν1 (dy) = f θ0 (y)ν1 (dy) = = 0.
Rr ∂θ ∂θ Rr ∂θ
In words, the expected score is zero at the truth. So ∇`1i (θ0 ) is an iid
sequence of random vectors with mean zero and finite variance B. Hence by
the multivariate Lindeberg–Lévy CLT we have
n
n−1/2 ∇`n (θ0 ) = n−1/2
d
X
∇`1i (θ0 ) −→ N (0, B) .
i=1
p
Since θbn −→ θ0 by the arguments in the previous section, our asymptotic
normality result for extremum estimators (p. 84) applies, giving us
n1/2 θbn − θ0 −→ N 0, A−1 BA−1 .
d
Actually, it turns out that −A = B when the model is correctly specified,
−1
so that the asymptotic variance simplifies to −A−1 = −∇2 `11 (θ0 ) . This
result is called the information matrix equality.
98
Proof. The the proof is similar to the demonstration above that the expected
score is zero at the true parameter.
E ∇2 `11 (θ0 )
∂
=E ∇`1 (θ0 )
∂θ> 1
!!
∂ ∇L11 (θ0 )
=E
∂θ> L11 (θ0 )
∂ θ0
!
∂ ∂θ f (y)
Z
= f θ0 (y)dν1 (dy)
Rr ∂θ> f θ0 (y)
h ih i
∂ θ0 ∂
Z ∂2
f θ0 (y) ∂θ f (y) ∂θ>
f θ0 (y)
∂θ∂θ> f θ0 (y)dν1 (dy)
= −
Rr f θ0 (y) f θ0 (y)2
!
∂2
Z
= f θ0 (y) dν1 (dy)
Rr ∂θ∂θ>
#>
∂ θ0 ∂ θ0
" #"
∂θ f (y) ∂θ f (y)
Z
− f θ0 (y)dν1 (dy)
Rr f θ0 (y) f θ0 (y)
!
∂2
Z h ih i>
θ0 1 1
= f (y) dν1 (dy) − E ∇`1 (θ0 ) ∇`1 (θ0 )
Rr ∂θ∂θ>
∂2
Z h ih i>
θ0 1 1
= f (y)dν1 (dy) − E ∇` 1 (θ 0 ) ∇` 1 (θ 0 )
∂θ∂θ> Rr
2
∂ (1)
h ih i>
= >
− E ∇`11 (θ0 ) ∇`11 (θ0 )
∂θ∂θ
h ih i>
= −E ∇`11 (θ0 ) ∇`11 (θ0 )
When the model is not correctly specified (the QMLE case), we can still
obtain asymptotic normality. The only part of the argument that relied on
correct specification was our demonstration that the expected score is zero
at the truth; this will have to be assumed to ensure asymptotic normality
of the QMLE. The information matrix equality does not hold in this case,
so we have to stick with the sandwich form A−1 BA−1 for the asymptotic
variance.
99
8.4 Estimating the asymptotic variance
The matrices A and B are unknown parameters of the DGP. If we’re going
to use our asymptotic normality result to approximate the distribution of
the MLE, we had better be able to estimate them consistently!
Recall from section 7.5 (p. 88) that we already know how to estimate
A and B consistently in the general framework of extremum estimation. In
particular, define the random functions Θ → Rr×r by
n
Abn := n−1 ∇2 `n = n−1
X
∇2 `1i
i=1
n h ih i>
bn := n−1 [∇`n ] [∇`n ]> = n−1
X
B ∇`1i ∇`1i ;
i=1
then Abn θbn and Bbn θbn are strongly consistent for A and B, respectively,
under the conditions of the consistency and asymptotic normality results
plus a dominance condition on the derivatives to allow the interchange of
integration and differentiation.
These were the analogy estimators. Analogy estimation just means using
sample averages to estimate population averages (expectations). To motivate
an alternative way of estimating A and B, let’s restate that in fancier
language. Suppose we wish to estimate
Z Z
E(ψ(y1 , θ0 )) = ψ(y, θ0 )µθ10 (dy) = ψ(y, θ0 )F θ0 (dy),
Rr Rr
integrate ψ(·, θ0 ) w.r.t. Fbn rather than the unknown F θ0 . We don’t know θ0 ,
but if ψ(y, ·) is continuous then we can replace θ0 with θbn without affecting
100
consistency. The estimator we obtain in this way is (fairly obviously) precisely
the analogy estimator:
Z n
−1
X
ψ y, θn Fn (dy) = n
b b ψ yi , θbn .
Rr i=1
This class of estimators has the virtue that it’s easy to prove consistency
under weak conditions using a law of large numbers. It is a semiparametric
approach: there’s a nonparametric step (the EDF Fbn ) and a parametric step
(the consistent estimator θbn ).
But in the MLE context, we already have a full parametric model of
F θ0 ; why not make use of it? In particular, instead of integrating w.r.t. the
nonparametric EDF Fbn , why not integrate w.r.t F θbn , the plug-in estimate
of F θ0 obtained by making use of our model θ 7→ F θ of the DGP? This
suggestion yields a parametric estimator of Rr ψ(y, θ0 )F θ0 (dy):
R
Z
ψ y, θbn F θn (dy).
b
Rr
101
numerical integration, which is several orders of magnitude slower than
averages (for a given
level of accuracy). When analytical derivatives
are
not available, Abn θbn suddenly becomes much heavier than B bn θbn because
accurate second derivatives
are computationally expensive compared to first
derivatives. But B e θbn is still more expensive than A bn θbn because accurate
numerical integration is a lot slower than numerical differentiation.
Next, how close do these estimators actually tend to be to A and B in
a finite sample, assuming that we’ve computed them (very) accurately? To
answer this question, we have to look atMonte Carlo studies. The lighting
lit review is that Abn θbn is worst, B bn θbn is better, and B e θbn is best.
However, notice that the latter is only consistent for B (hence for −A)
under correct specification. In the misspecified (QMLE) case, it will not
consistently estimate
either −A
or B since the integrating density is wrong! By
contrast, An θn and Bn θn consistently estimate A and B under incorrect
b b b b
specification because we integrate w.r.t. the EDF, whose consistency does not
depend on the correctness or otherwise of the parametric model. It follows
that the asymptotic variance estimate
−1
bn θbn −1
Abn θbn B
bn θbn A
are not.
We assume that pe, and hence each pi , are differentiable. Then the score of
102
the ith log-likelihood contribution is
= pi (θ)[1 − pi (θ)]xi .
This is what e.g. Stata uses by default to estimate the asymptotic variance
of the MLE for the logit model. It is numerically different from Ab θbn , of
course. (To give a closed form for the latter we would have to compute some
monstrous derivatives, so I’d rather not.)
103
close to each other in a large sample if the model is correctly specified. To
formalise what we mean by ‘close’, we need an asymptotic distribution. It
turns out that under the null hypothesis of correct specification, the elements
of the k × k matrix
h i
Wn θbn := n1/2 Abn θbn + B
bn θbn
are joint normally distributed in the limit, with mean zero and a covariance
matrix that can be estimated consistently. Wn θbn has k(k+1)/2 independent
elements (since it’s symmetric), so we can construct a test statistic as
a function of these. Such tests are called information matrix (IM) tests,
introduced by White (1982).
There are many ways turning Wn θbn into a test statistic. A simple one
is to take a quadratic form
>
qn θbn Wn θbn qn θbn
where the vector qn θbn is chosen as a function of the estimated covariance.
Under the null, this quadratic form converges to a χ2 distribution with
degrees of freedom depending on how many independent elements of Wn θbn
are given positive weight. Many variants have been proposed, some of which
are asymptotically equivalent but much easier to compute.
Monte Carlo evidence suggests that the χ2 approximation to the distri-
bution of IM statistics is very poor. When using the 5% χ2 critical values,
the test often rejects under the null as often as 95% of the time for moderate
sample sizes! Fortunately, the bootstrap approximation to the distribution of
IM statistics is very accurate, so we can use bootstrap critical values instead.
104
scenario in which some arbitrary θ ∈ Θ is the true value:
Z
Eθ Tn := Ten ({yi }ni=1 ) dµθn
Z Rn×r
θ
>
Ten ({yi }ni=1 ) − Eθ Tn Ten ({yi }ni=1 ) − Eθ Tn dµθn .
Var Tn :=
Rn×r
(Recall that µθn denotes the law of {yi }ni=1 when θ is the true parameter
value.) With this notation, we can easily define a function I : Θ → Rk×k
that maps an arbitrary θ ∈ Θ into what the information matrix −A = B
would be if θ were the true value:
h ih i>
θ
I(θ) := E ∇`11 (θ) ∇`11 (θ) = −Eθ ∇2 `11 (θ) .
We’ve seen that when θ is the true value, the asymptotic variance of the
MLE is I(θ)−1 . Let V : Θ → Rk×k be the asymptotic variance (at each θ) of
some alternative estimator that is also consistent and asymptotically normal.
We say that the MLE is asymptotically (weakly) more efficient at θ ∈ Θ
than the alternative estimator iff V (θ) − I(θ)−1 is positive semidefinite (psd).
V (θ) − I(θ)−1 being psd says precisely that every linear combination of θbn
has weakly lower variance than the same linear combination of the alternative
estimator. We say that the MLE is asymptotically more efficient (simpliciter)
than the alternative estimator iff it is asymptotically more efficient at every
θ ∈ Θ. Finally, we say that the MLE is asymptotically efficient within some
class of estimators iff it is asymptotically more efficient than each other
estimator in the class.
A natural conjecture is that the MLE is asymptotically efficient within
the class of consistent and asymptotically normal estimators. This conjecture
was long believed to be true, but the following (species of) example, due to
Hodges, shows that it is not.
Example 22 (super-efficiency). Let Θ ⊆ R and let the true value be θ0 .
Let θbn be the (consistent and asymptotically normal) MLE. Define another
estimator θen by (
0 if θbn < n−1/4
θen = b
θn if θbn ≥ n−1/4 .
6 0, θen is asymptotically equivalent to the MLE θbn , so has the same
For θ0 =
asymptotic variance. But when θ0 = 0, consistency means that for large n,
with high probability, |θbn | < n−1/4 , in which case the asymptotic variance is
zero. θ0 = 0 is called a point of super-efficiency of the estimator θen , and an
estimator with super-efficiency points is called a super-efficient estimator.
105
So the MLE is not efficient: there exists another estimator θen whose
variance is at least as low for every θ0 ∈ Θ and strictly lower for some θ0 ∈ Θ.
In particular, it is always possible to construct a super-efficient estimator
that efficiency-dominates the MLE.
But we can salvage a great deal here. Le Cam showed that the set Θe ⊆ Θ
of super-efficiency points of any given super-efficient estimator must have
Lebesgue measure zero. This is a formal sense in which we might call super-
efficiency a pathology. Moreover, super-efficient estimators turn out to be
the only obstacle to calling the MLE asymptotically efficient: Le Cam also
showed that the MLE is asymptotically efficient within the class of non-
super-efficient, consistent and asymptotically normal estimators. Similarly,
if we redefine ‘asymptotically more efficient than’ to require only that the
variance is smaller for a subset of Θ of full Lebesgue measure, then the MLE
is asymptotically efficient within the class of all consistent and asymptotically
normal estimators.
There’s a bunch of other results along the same lines. In efficiency settings,
I −1 is called the Cramér–Rao lower bound. A basic result is that no unbiased
estimator can achieve asymptotic variance lower than the Cramér–Rao bound
at every θ ∈ Θ. The MLE is not unbiased in general, but it is asymptotically
unbiased under regularity conditions, so this result gives us efficiency of the
MLE in the class of consistent, asymptotically normal and asymptotically
unbiased estimators. Another result is that the Cramér–Rao bound is a lower
bound on the variance of the any consistent and uniformly asymptotically
normal estimator. The latter means that weak convergence to a normal is
uniform in a certain sense.
The most general class of theorems on asymptotic efficiency is exemplified
by the Hájek–Le Cam asymptotic minimax theorem (see e.g. Ibragimov and
Has’minskii (1981, Theorem 12.1)). Such results give a set of conditions
under which an estimator is efficient within some class, and the MLE satisfies
these assumptions under regularity conditions.
106
9 Hypothesis testing
Official reading: Amemiya (1985, sec. 4.5.1) and Newey and McFadden (1994,
sec. 9).
9.1 Preliminaries
So far, we’ve focused on the problem of point estimation: given DGP para-
meterised by θ0 , we try to find a way of learning the value of θ0 . Now let’s
turn that on its head: we start with a subset Θ0 ⊆ Θ, and want to learn
whether θ0 ∈ Θ0 . Intuitively, we should be able to do this using our consistent
and asymptotically normal extremum estimator θbn , for if the hypothesis is
true then θbn ∈ Θ0 with high probability, and if the hypothesis is false then
θbn ∈
/ Θ0 with high probability.
The formal setup is as follows. We wish to test the null hypothesis
(H0 ) that θ0 ∈ Θ0 against the alternative hypothesis (H1 ) that θ0 ∈ / Θ0 .
To implement this, we use a test statistic. A real-valued statistic is just a
measurable function Tn : Rn×r → R mapping data into the real line; as usual
we will work directly with the random variables Tn (ω) := Ten ({yi (ω)}ni=1 ).
A rejection region for the statistic Tn is some subset Rn of R. Our testing
procedure is as follows:
If Tn ∈ Rn then we reject H0 ;
otherwise we fail to reject H0 .
In many cases, the rejection regions {Rn } will take the form
The tests discussed below have rejection regions of the former kind; the t test
has a rejection region of the latter kind. In either case, we call cn a critical
value for the test.
We’ll want a sensible way of choosing the rejection region, or else the test
will be useless. There are two problems we have to worry about: rejecting H0
when it’s actually true (type-I error), and failing to reject H0 when it’s false
(type-II error). It is much easier to control the probability of type-I error of
a test, as we’ll see momentarily, so that’s what we’ll focus on in determining
our critical values.
So fix a desired probability α of type-I error; α is called the (desired)
size of the test. Then for any fixed θ ∈ Θ0 , we can read off the rejection
region RTαn (θ) from the (approximate) distribution of Tn under θ0 = θ. For
107
the approximate distributions we’ll consider, we can summarise the rejection
region using a critical value cαTn (θ0 ). (E.g. for rejection regions of the form
h
cαTn (θ0 ), ∞ , the critical value is the (1 − α)th quantile of the (approximate)
distribution of Tn .)
Without further restrictions, this test is infeasible because the critical
values depend on the unknown value of θ0 . The reason why type-I error is
easy to control is that since Θ0 is generally a fairly small set, it will often
be the case that our approximate distribution of Tn under θ0 = θ ∈ Θ0 will
be the same for each θ ∈ Θ0 . That is, we have an approximate distribution
that holds under H0 irrespective of which particular θ ∈ Θ0 is the true value.
This gives us critical values cαn that can be obtained without knowledge of
θ0 .
Useful jargon: a statistic Tn is pivotal under H0 iff its distribution under
H0 (for finite n) does not depend on θ0 . It is asymptotically pivotal under
H0 iff its asymptotic distribution under H0 does not depend on θ0 . We are
interested in the latter case, since we rarely encounter statistics whose finite-
sample distribution can be derived, never mind shown to be independent of
θ0 . The previous paragraph says, in short, that our tests will be based on a
statistics that are asymptotically pivotal under H0 .
Actually, we will be able to simplify the critical values further. In general,
our critical values cαn will depend on n since our approximate distribution for
Tn might vary with the sample size. But since our approximating distribution
will be the asymptotic distribution (obviously independent of n), we can use
critical values cα do not depend on n. However, our asymptotic distribution
actually justifies the use of any critical values cα + op (1) (anything that is
asymptotically equivalent to cα ). If we choose to add some n-dependent,
asymptotically vanishing term to the critical values, we may obtain a better
approximation to the distribution of Tn under H0 in a finite sample. Many
such finite-sample corrections have been proposed for well-known tests,
usually justified by Monte Carlo evidence.
So far, we have only dealt with type-I error. But type-II error is also a
concern: a test with known (asymptotic) size α but high probability of type-II
error will not be able to discriminate between the null and alternative hypo-
theses. The power Pnα (θ) of a test of size α against the (specific) alternative
θ0 = θ ∈ Θc0 is the probability of rejecting H0 when θ0 = θ (a particular way
in which H1 can be true). (So Pnα (θ) is one minus the probability of type-II
error.) We say that our test is consistent against the alternative θ0 = θ iff
Pnα (θ) − → 1 as n −→ ∞. The test is consistent iff it is consistent against every
108
Qn θbn
Qn (θ0 ) Qn
θ0 θbn
109
Formally, the three test statistics for a simple null hypothesis with
Θ0 = {θ0 } are
h i
LRn = 2 Qn θbn − Qn (θ0 )
h i> h i+ h i
LMn = n−1/2 ∇Qn (θ0 ) B
bn (θ0 ) n−1/2 ∇Qn (θ0 )
h i> h + i+ h i
Wn = n1/2 θbn − θ0 bn θbn + n1/2 θbn − θ0 ,
Abn θbn B bn θbn A
In other words, each of the test stats is op (n). This is the formal sense in
which the three test stats must be close to 0 with high probability in a large
sample.
To obtain critical values that we can use to perform these tests, we need
to derive the asymptotic distributions of the three test statistics. For LMn
and Wn , it should be clear that we just have to apply Slutsky’s theorem.
(For LRn , additional structue will be needed.)
Proposition 22. Assume the hypotheses of our asymptotic normality result
for extremum estimators (p. 84), and further suppose that A−1 BA−1 is
d d
positive definite.73 Then LMn −→ χ2 (k) and Wn −→ χ2 (k).
d p
Proof. n−1/2 ∇Qn (θ0 ) −→ Nk (0, B) by assumption, and B
bn (θ0 ) −→ B. Hence
by Slutsky’s theorem,
LMn −→ [Nk (0, B)]> B −1 [Nk (0, B)] = [Nk (0, I)]> [Nk (0, I)] = χ2 (k).
d d d
d
n1/2 θbn − θ0 −→ Nk 0, A−1 BA−1 by the asymptotic normality proposition,
p p
and Abn θbn −→ A and B bn θbn −→ B. A−1 BA−1 is symmetric and positive
definite, hence nonsingular. So by Slutsky’s theorem,
h i> h i−1 h i
Wn −→ Nk 0, A−1 BA−1 A−1 BA−1 Nk 0, A−1 BA−1
d
d d
= [Nk (0, I)]> [Nk (0, I)] = χ2 (k).
In fact, something much stronger is true. Not only are the asymptotic
distributions the same; the two statistics are actually numerically close in
large samples with high probability.
73
This just means that the limiting distribution is nondegenerate: no linear combination
has variance zero.
110
Proposition 23. Assume the hypotheses of our asymptotic normality result
for extremum estimators (p. 84), and further suppose that A−1 BA−1 is
positive definite. Then LMn − Wn = op (1).
Partial proof. We will treat the case in which θbn lies in the neighbourhood
of θ0 in which the derivatives exist and are continuous. This occurs with
probability approaching 1 as n − → ∞, but a rigorous proof would proceed
more cautiously.
Since θ0 ∈int Θ, θbn lies in the interior of Θ. Hence it must satisfy the
FOC ∇Qn θbn = 0. Expanding the derivative around θ0 using the mean
value theorem,
0 = ∇Qn (θ0 ) + ∇2 Qn θen θbn − θ0 ,
p
where the mean value θen lies between θ0 and θen , so that θen −→ θ0 since θbn
is consistent. Rearranging and using the definition Abn = n−1 ∇2 Qn from
section 7.5 (p. 88),
h i h i
− n−1/2 ∇Qn (θ0 ) = Abn θen n1/2 θbn − θ0 .
bn θen + to get
On the one hand, we can premultiply this by B
h i h i
bn θen + A
bn θen + n−1/2 ∇Qn (θ0 ) = B bn θen n1/2 θbn − θ0 .
−B (7)
(using the symmetry of the second derivative, which holds by Young’s the-
orem). Premultiplying (7) by (8),
h i> h i
n−1/2 ∇Qn (θ0 ) bn θen + n−1/2 ∇Qn (θ0 )
B
h i> + h 1/2 i
= n1/2 θbn − θ0
Abn θen B
bn θen Abn θen n θbn − θ0 .
= LMn + op (1).
111
Also by Slutsky’s theorem, the RHS is
h i > h 1/2
+ i
n1/2 θbn − θ0
bn θen
Abn θen B n θbn − θ0
Abn θen
h i> h i
= n1/2 θbn − θ0 bn θbn + Abn θbn n1/2 θbn − θ0 + op (1)
Abn θbn B
h i> h + i+ h 1/2 i
= n1/2 θbn − θ0 bn θbn +
Abn θbn B bn θbn A n θbn − θ0 + op (1)
= Wn + op (1).
Together, this says that LMn + op (1) = Wn + op (1), or equivalently LMn −
Wn = op (1).
What about the likelihood ratio stat? Let’s just see how far we can get.
As in the previous two proofs, let’s proceed by simply assuming that θbn lies
in the neighbourhood of θ0 in which the derivatives exist and are continuous
(which is the case with probability approaching 1 as n −
→ 1). A second-order
mean value expansion of Qn (θ0 ) around θn yields
b
1 >
Qn (θ0 ) − Qn θbn = ∇Qn θbn θ0 − θbn + θ0 − θbn ∇2 Qn θen θ0 − θbn
2
p
where the mean value θen lies between θbn and θ0 , hence θen −→ θ0 . By interiority
and differentiability, the first-order condition must hold, eliminating the first-
order term. Rearranging and using the definition of Abn ,
h i
LRn = 2 Qn θbn − Qn (θ0 )
h i> h ih i
= n1/2 θbn − θ0 −n−1 ∇2 Qn θen n1/2 θbn − θ0
h i > h i h 1/2 i
= n1/2 θbn − θ0 −Abn θen
n θbn − θ0
h i > −1 h i
= n1/2 θbn − θ0 −A−1 n1/2 θbn − θ0 + op (1).
But this is the end of the line. The asymptotic variance of n1/2 θbn − θ0 is
LRn
h i> h + i+ h i
= n1/2 θbn − θ0 bn θbn + n1/2 θbn − θ0 + op (1)
Abn θbn B bn θbn A
112
All of our dreams have come true.
The key to getting the LR stat to be asymptotically equivalent to the
Wald and LM stats was the information matrix equality. As we’ll see when
studying efficient GMM in section 10, generalisations of the information
matrix equality are available for certain estimators outside the MLE class.
(As in the MLE context, the information matrix equality is tightly linked to
efficiency.) We therefore summarise our result in a way that extends beyond
the MLE context:
113
The test statistics are
h i
LRn = 2 Qn θbn − Qn θen
h i> h i+ h i
LMn = n−1/2 ∇Qn θen n−1/2 ∇Qn θen
bn θen
B
Wn
h i> h + i+ h i
= n1/2 h θbn bn θbn + Dh θbn > n1/2 h θbn .
Dh θbn Abn θbn B bn θbn A
114
Putting this all together using Slutsky’s theorem,
h i> h i−1 h i
Wn = n1/2 h θbn Dh(θ0 )A−1 BA−1 Dh(θ0 )> n1/2 h θbn + op (1)
> h i−1
−→Nq 0, Dh(θ0 )A−1 BA−1 Dh(θ0 )> Dh(θ0 )A−1 BA−1 Dh(θ0 )>
d
× Nq 0, Dh(θ0 )A−1 BA−1 Dh(θ0 )>
d
=Nq (0, I)> Nq (0, I)
d
=χ2 (q).
For the LM stat, first observe that θen is consistent because we’re main-
taining the assumptions of our consistency result for extremum estimators
(p. 78). So show this carefully, note that we can write
Θ0 = Θ ∩ {θ ∈ Rk : h(θ) = 0}.
Θ is compact and {θ ∈ Rk : h(θ) = 0} is closed, so Θ0 is compact. Qn
p
is continuous on Θ, hence continuous on Θ0 . n−1 Qn −→ Q uniformly on
Θ, hence also on Θ0 . Finally, Q has a unique maximum on Θ at θ0 , hence
a fortiori has a unique maximum on Θ0 at θ0 . So the conditions of the
consistency proposition (p. 78) hold on Θ0 , whence it follows that θen is
consistent for θ0 .
By consistency of θen and continuity of Bbn , the continuous mapping the-
p
orem yields Bbn θen = B bn (θ0 )+op (1). Moreover, Bbn (θ0 ) −→ B by Khinchine’s
WLLN (p. 56).
Since we’re maintaining the hypotheses of the asymptotic normality result
for extremum estimators (p. 84), we have that Qn is differentiable near θ0
and that θ0 is interior to Θ. Since θbn is consistent for θ0 , it follows that for
large n, with high probability, the FOC holds:
>
∇Qn θen = Dh θen λn
where λn is a (random) q-vector of Lagrange multipliers. We will proceed in
the same informal manner as in our proof of asymptotic normality (p. 84)
by behaving as if the FOC always holds.
Since θen is consistent for θ0 , and since ∇Qn and Dh are continuous at θ0
by assumption, the continuous mapping theorem lets us write
∇Qn (θ0 ) = Dh(θ0 )> λn + op (1),
and hence
n−1/2 ∇Qn (θ0 ) = Dh(θ0 )> n−1/2 λn + op n−1/2 .
115
One of the hypotheses of our asymptotic normality result is that the LHS
converges in distribution to Nk (0, B). Hence the RHS must do the same:
Dh(θ0 )> n−1/2 λn −→ Nk (0, B),
d
d
whence it follows that n−1/2 λn −→ Nq (0, V ) for some V that satisfies
Now here’s a fun fact that you can (very easily) prove at home: because V is
invertible (since it’s a nondegerate variance matrix) and Dh(θ0 )Dh(θ0 )> has
full rank q (since Dh(θ0 ) has rank q), we have
−1
Dh(θ0 ) [Dh(θ0 )> V Dh(θ0 )] Dh(θ0 )> = V −1 . (9)
Hence
Dh(θ0 )B −1 Dh(θ0 )> = V −1 . (10)
Putting together the pieces and using (10),
h i> h i+ h i
LMn = n−1/2 ∇Qn θen n−1/2 ∇Qn θen
B
bn θen
h i> h i
= Dh(θ0 )> n−1/2 λn B −1 Dh(θ0 )> n−1/2 λn + op (1)
>
= n−1/2 λn Dh(θ0 )B −1 Dh(θ0 )> n−1/2 λn + op (1)
>
= n−1/2 λn V −1 n−1/2 λn + op (1).
Finally, let’s turn to the likelihood ratio statistic. All the steps in our
derivation in the previous section (for a simple hypothesis) still go through,
giving us
h i> −1 h i
LRn = n1/2 θbn − θen −A−1 n1/2 θbn − θen + op (1).
116
A first-order mean-value expansion of h θbn around θen yields
h θbn = h θen + Dh θn θbn − θen = Dh θn θbn − θen
where the mean value θn lies between θbn and θen , so is consistent for θ0 . So
h i> h > i+ h 1/2
i
LRn = n1/2 h θbn −A−1 Dh θn
Dh θn n h θbn + op (1)
h i> h > i+ h 1/2 i
= n1/2 h θbn Dh θbn −A−1 Dh θbn n h θbn + op (1).
9.4 Power
We’ve now figured out the asymptotic distributions of our statistics under the
null. But as mentioned above, the tests are not much use if their distributions
under the alternative are very similar to their null distributions: then power
will be low, so we won’t be very likely to reject the null even when it is false.
First, we’d like to show that our tests are consistent (consistent against
every alternative θ0 = θ ∈ Θc0 ).74 Since the asymptotic null distribution is
χ2 , this will require us to show that our test statistics explode under the
null; then in a large sample, with high probability, they will be larger than
we would expect a χ2 -distributed random variable to be, leading us to reject
the null.
Secondly, we’d like to know how powerful they are. We could approach
this by studying the rate at which the test stats explode, but that turns out
not to be fruitful. Instead, we’ll consider the behaviour of the test stat when
the null is nearly true, using a formal device called a Pitman drift. This will
allow us to derive the asymptotic distribution of each test stat explicitly,
and then we can read off our tests’ asymptotic rejection probabilities from
the asymptotic distribution under the alternative. This limiting rejection
probability (under local-DGP asymptotics) is called the local power.
The exposition in this section will be a little looser. We’ll consider only
the case of a simple hypothesis, and we’ll limit our derivations to the Wald
74
This is a nice property to have, and we will have it here. But there are many well-known
tests that are not consistent against all alternatives. One of these is the information matrix
test, which is inconsistent against DGPs for which the information matrix equality holds
despite misspecification. (Which can happen.)
117
statistic. The null hypothesis is that θ0 = θ? , but actually the true value θ0
is 6= θ? . The Wald statistic for a simple hypothesis from section 9.2 can be
written
h i> h + i+
Wn = n1/2 θbn − θ0 + n1/2 (θ0 − θ? ) bn θbn +
Abn θbn B bn θbn A
h i
× n1/2 θbn − θ0 + n1/2 (θ0 − θ? )
h i> h i−1
= n1/2 θbn − θ0 + n1/2 (θ0 − θ? ) A−1 BA−1
h i
× n1/2 θbn − θ0 + n1/2 (θ0 − θ? ) + op (1)
h i> h i−1
= Nk 0, A−1 BA−1 + n1/2 (θ0 − θ? ) A−1 BA−1
h i
× Nk 0, A−1 BA−1 + n1/2 (θ0 − θ? ) + op (1)
h i−1
= χ2 (k) + n (θ0 − θ? )> A−1 BA−1 (θ0 − θ? )
h i−1
+ n1/2 (θ0 − θ? )> Nk 0, A−1 BA−1 + op (1).
118
A DGP law µθ0 for which the null is hard to reject is precisely one such
that θ0 (the truth) is close to the null hypothesis θ? being tested. So we need
to let θ0 drift
θ
toward θ? as n increases. To that end, consider a sequence of
DGP laws µ 0,n such that
θ0,n := θ? + n−1/2 µ + o n−1/2
for some fixed µ ∈ Rk . (Such a sequence is called a Pitman drift.) The algebra
is very easy: rearrange to get n1/2 (θ0,n − θ? ) = µ + o(1), then substitute:
h i> h + i+
Wn = n1/2 θbn − θ0,n + n1/2 (θ0,n − θn? ) bn θbn +
Abn θbn B bn θbn A
h i
× n1/2 θbn − θ0,n + n1/2 (θ0,n − θn? )
h i> h i−1 h i
= n1/2 θbn − θ0,n + µ A−1 BA−1 n1/2 θbn − θ0,n + µ + op (1)
h i> h i−1 h i
−→ Nk 0, A−1 BA−1 + µ A−1 BA−1 Nk 0, A−1 BA−1 + µ
d
h i−1
where λ(µ) := µ> A−1 BA−1
d
=χ2 (k, λ(µ)) µ,
Qα (µ) is the limiting rejection probability for size α along the sequence of
local DGP laws.
∞
75
A triangular array is a double-indexed sequence {{yn,N }N
n=1 }N =1 of random variables.
In our case, the triangular array is ‘iid within rows’, meaning that {yn,N }N n=1 is an iid
sequence for each N ∈ N. Most of our central limit theorems apply to triangular arrays;
for example, the Lindeberg–Feller CLT extends without change to triangular arrays that
are independent within rows.
76
I’ve tried to make this clear, but it’s worth repeating: the Pitman drift is just a
technique that lets us approximate the power of a test. We’re not actually interested in
something weird like the asymptotic behaviour of a test when the DGP changes with the
sample size to make your life harder.
119
Recall that what we want is to approximate the power function Pnα
for a fixed n. Expressed more longwindedly, for any θ0 6= θ? , we want an
approximation to the power Pnα (θ0 ) of our test of size α of the null θ0 = θ?
when the sample size is n. So for fixed n and θ0 , choose µ ∈ Rk so that the
nth DGP law µθ0,n in the Pitman sequence has θ0,n = θ0 :
θ0 = θ? + n−1/2 µ,
120
1
α
Pb1000
α
Pb200
α
Pb50
θ? θ0
121
Finally, the derivations in section 9.2 showing that the LR, LM and Wald
statistics are asymptotically equivalent (i.e. within op (1) of each other) apply
with almost no changes to this new asymptotic environment. It follows that
the LR and LM statistics are also noncentral-χ2 -distributed, with the same
noncentrality parameter, under local-DGP asymptotics. Their local power
is therefore the same as that of the Wald test. Moreover, all of the results
extend to (smooth) composite hypotheses: the trinity tests are still consistent,
and their power can be approximated using a Pitman drift.
need it be monotonic on either side of θ? . These properties do hold for the Wald test,
however, because the noncentrality varies monotonically and symmetrically with θ0 .
78
It is important that Figure 3 depicts the local power envelopes. The true power envelope
family {Pnα }n∈N may not be as well-behaved, since the DGP could be weird. The true
rejection probability under the null will not be α, though it should be ‘close’ to α when
n is large. Moreover, the power curves need not have the nice symmetric and monotonic
shape depicted, though again they will ‘nearly’ have these properties for n large.
122
10 The generalised method of moments estimator
Official reading: Amemiya (1985, sec. 8.1.1 and 8.2.2) and Newey and Mc-
Fadden (1994, sec. 2.2.3, 2.5, 3.3 and 4.3).
10.1 Preliminaries
The generalised method of moments (GMM), introduced by Hansen (1982), is
a pretty general (haha) technique for estimation and inference that subsumes
most parametric methods as special cases. It occupies an important place in
the history of econometrics: before GMM, there were many disparate methods
such as 2SLS, 3SLS and so on (as you can see by looking at Amemiya (1985),
which was written before GMM was introduced). GMM subsumed these
classical methods as special cases.
We will treat the case in which the DGP {yi } is iid. We have
some
(economic) model that gives us the moment conditions E ge(y1 , θ0 ) = 0 for
some known function ge : Rr ×Θ → Rq .79 Even if the model imposes additional
structure on the data, GMM makes use only of these moment conditions.80 As
usual we define the random functions gi : Θ → Rq by gi (ω)(θ) := ge(yi (ω), θ),
allowing us to write the moment conditions as E(g1 (θ0 )) = 0.
So we have q restrictions on the k-dimensional parameter θ0 . We assume
that θ0 is point-identified, meaning that θ0 is the unique solution in Θ to
E(g1 (θ)) = 0. A necessary condition for this is (obviously) q ≥ k. The
case q > k is called overidentification; in the lingo, the model gives us
overidentifying restrictions on θ0 in this case.81 When the moment conditions
are misspecified and q > k, it should be clear that E(g1 (θ)) = 0 may not
have a solution in Θ. This indicates that we can formulate a specification
test for the model by expoiting the overidentifying restrictions; we will work
out the details in section 10.5 below.82
79
A rather nice thing is that a lot of economic models give rise to moment conditions;
Euler equations are one example. By contrast, it’s very rare that a convincing economic
model gives rise to a more restrictive statistical model such as a density (as required for
maximum likelihood).
80
This is another advantage of GMM over e.g. ML. Suppose we have a model that gives
us a likelihood from which we can derive moment conditions. If the likelihood is misspecified
then our estimates will in general be inconsistent. But if the moment conditions hold (a
much weaker condition in general), then GMM will give us consistent estimates.
81
Correspondingly, the case q = k is called ‘exact identification’. Then we have enough
restrictions to identify θ0 , but no more.
82
More generally, we could allow E(g1 (θ)) = 0 to have multiple solutions, in which case
we obtain partial identification. Another form of partially-identified GMM comes from
replacing the moment equalities with moment inequalities. The latter is currently a hot
123
Every parameteric estimator that we have mentioned so far is a GMM
estimator for some choice
of ge. In particular, any extremum estimator for
which the FOC ∇Qn θbn = 0 holds is a GMM estimator. Since we’re only
covering GMM for the iid case, restrict attention to separable criterion
functions, so that the FOC can be written
n
X
∇qi θbn = 0.
i=1
and the moment condition is derived from the assumption E(y|x) = f (x, θ0 ).
Yet another is the LAD estimator, where
ge((y, x), θ) = 2 · 1 (y ≤ θ) − 1,
and the moment condition is derived from the assumption that the median
of the distribution of y conditional on x is f (x, θ).
When q = k and ge is well-behaved, the sample moment condition
n
n−1
X
gi (θ) = 0
i=1
will have a unique solution. This value is called the method-of-moments es-
timator, and the sample moment condition is sometimes called an estimating
equation in this context. But what can we do when q > k or ge is ill-behaved?
One possibility is to throw away q − k moment conditions and choose θen
to make n−1 ni=1 gi (θ) as close as possible to zero in some metric (exactly
P
124
Pn
in the sample moments. Writing Gn := n−1/2 i=1 gi , the GMM objective
function is
n
" #> " n
#
1 1 −1/2 X
Wn n−1/2
X
Qn (θ) := Gn (θ)> Wn Gn (θ) = n gi (θ) gi (θ) .83
2 2 i=1 i=1
10.2 Consistency
Conditions for consistency of GMM can be obtained (essentially) from our
consistency results for extremum estimators (pp. 78 and 80). We’ll give
primitive conditions for strong consistency in the iid case.
Assume that Θ is compact and that each gi is continuous (so that Qn is).
Further assume that E (supθ∈Θ |g1 (θ)|) < ∞. Then Jennrich’s uniform SLLN
(p. 58) applies, giving us
n
−1/2 −1 a.s.
X
n Gn = n gi −−→ g uniformly over Θ,
i=1
125
10.3 Asymptotic normality
Unsurprisingly, we will appeal to our asymptotic normality result for ex-
tremum estimators (p. 84). Again we treat the iid case and give primitive
conditions.
So maintain the assumptions we imposed to obtain strong consistency in
the previous section, and assume that θ0 ∈ int Θ. Further assume that each
gi is twice continuously differentiable in a neighbourhood of θ0 , so that Qn
is too. Write Dgi for the q × k first derivative, and D2 gi for the q × k × k
second derivative. The latter is a three-dimensional array!
The derivatives of Qn are
with a matrix, yielding a matrix. My notation for this is not ideal (e.g. it
doesn’t tell the reader along what dimensions we’re transposing the array),
but it won’t matter because this term is going to vanish. The first derivative
of Gn is of course
n
−1/2
X
DGn = n Dgi .
i=1
as in the previous section. We already have that {Dgi } are iid and continu-
ous and that Θ is compact, so we only have to add the assumption that
E (supθ∈Θ |Dg1 (θ)|) < ∞. Jennrich’s uniform SLLN (p. 58) then tells us that
n
n−1/2 DGn (θ) = n−1
a.s.
X
Dgi (θ) −−→ Dg uniformly over Θ.
i=1
126
It follows by the dominated convergence theorem and independence that
B = lim E n−1 [∇Qn (θ0 )][∇Qn (θ0 )]>
n→∞
h i>
−1/2
= lim E n DGn (θ0 ) Wn Gn (θ0 )
n→∞
h i> >
−1/2
× n DGn (θ0 ) Wn Gn (θ0 )
h i> h i
−1/2 > > −1/2
= lim E n DGn (θ0 ) Wn Gn (θ0 )Gn (θ0 ) Wn n DGn (θ0 )
n→∞
n o
= [Dg(θ0 )]> W lim E (Gn (θ0 )Gn (θ0 )> ) W [Dg(θ0 )]
n→∞
n X
n
= [Dg(θ0 )]> W lim E n−1
X
gi (θ0 )gj (θ0 )> W [Dg(θ0 )]
n→∞
i=1 j=1
n
( !)
lim E n−1
>
X
= [Dg(θ0 )] W gi (θ0 )gi (θ0 )> W [Dg(θ0 )]
n→∞
i=1
= [Dg(θ0 )]> W E (g1 (θ0 )g1 (θ0 )> ) W [Dg(θ0 )] .
127
p
gives us that for any sequence {θn } of random k-vectors with θn −→ θ0 , we
have
n−1 ∇2 Qn (θn ) −→ [Dg(θ0 )]> W [Dg(θ0 )] = A.
p
You may recall that this is what assumption (3) in the asymptotic normality
result for extremum estimators (p. 84) requires.
To verify the fourth and final hypothesis of the asymptotic normality
result, first observe that
h i>
n−1/2 ∇Qn (θ0 ) = n−1/2 DGn (θ0 ) Wn [Gn (θ0 )]
n
" #
−1/2
>
X
= [Dg(θ0 )] W n gi (θ0 ) + op (1).
i=1
{gi (θ0 )} are iid with mean zero and variance E (g1 (θ0 )g1 (θ0 )> ), so by the
multivariate Lindeberg–Lévy CLT and Slutsky’s theorem,
n−1/2 ∇Qn (θ0 ) −→ [Dg(θ0 )]> W Nq (0, E (g1 (θ0 )g1 (θ0 )> ))
d
d
=Nq 0, [Dg(θ0 )]> W E (g1 (θ0 )g1 (θ0 )> ) W [Dg(θ0 )]
d
=Nq (0, B) .
128
choose Wn . Of course only the probability limit W of {Wn } matters, so really
we’re choosing W .
It turns out that an (infeasible) optimal choice of weight matrix is
W = E (g1 (θ0 )g1 (θ0 )> )−1 . (We won’t show this, but it’s straightforward
matrix algebra.) So the asymptotic variance of an efficient GMM estimator is
−1
A−1 BA−1 = [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W W −1
−1
× W [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] .
>
−1
= [Dg(θ0 )] W [Dg(θ0 )]
−1
−1
= [Dg(θ0 )]> E (g1 (θ0 )g1 (θ0 )> ) [Dg(θ0 )]
= A−1 = B −1 .
There’s a close analogy with the efficiency of (correctly-specified) MLE
here. The efficient choice of weight matrix (resp. density) causes a bunch
of cancellations in the asymptotic variance, leaving us with A−1 . (It’s A
rather than −A because we’re minimising Qn , so ∇2 Qn is positive definite
here.) Moreover, we obtain A = B, a generalisation of the information matrix
equality.84 Clearly B −1 is the lower bound on the variance of any GMM
estimator using these moment conditions, analogous to the Cramér–Rao
bound.
So far we only have an infeasible procedure, since it requires knowledge
of E (g1 (θ0 )g1 (θ0 )> )−1 . But provided we can consistently estimate the latter,
we can obtain a feasible estimator. The standard way of doing this (proposed
by Hansen (1982)) is called two-step GMM. First pick some arbitrary weight
matrix Wn ,85 and obtain the GMM estimate θen . Define the Θ → Rq×q
function
n
! +
−1
X
cn := >
W n gi gi ,
i=1
129
asymptotically equivalent to the infeasible efficient GMM estimator above.
The two-step GMM estimator θbn is therefore asymptotically efficient within
the class of GMM estimators that use these moment conditions.
There are serious finite-sample problems with the two-step procedure. The
intermediate step of estimating E (g1 (θ0 )g1 (θ0 )> )−1 lowers the asymptotic
variance, but at the cost of introducing an additional source of noise into
the estimator. Worse, the noise in estimating the optimal weight matrix
is generally correlated with the noise in the sample moments Gn , which
introduces bias into the two-step GMM estimator. In Monte Carlo studies,
this bias is quite severe for nonlinear DGPs, even in large samples.
There are two obvious ways of dealing with finite-sample bias. On the
one hand, we could just eschew two-step GMM in favour of one-step GMM,
which is asymptotically less efficient but allows for more reliable inference.
(And may be more efficient in a finite sample!) On the other hand, we
could incorporate finite-sample corrections to our estimates. Many have been
proposed, and some of them are quite helpful.
Another solution is to use a one-step procedure that nevertheless delivers
an estimator that is asymptotically equivalent to the infeasible efficient GMM
estimator. The continuously-updated (CUE) GMM estimator maximises
Gn (θ)> W
cn (θ)Gn (θ),
obviating the need for a second step. And it turns out (perhaps unsurprisingly)
to be asymptotically equivalent to the infeasible efficient GMM estimator that
uses weight matrix E (g1 (θ0 )g1 (θ0 )> )−1 . The lack of an initial step turns out
to make a big difference: the Monte Carlo evidence is that the finite-sample
behaviour of the CUE GMM estimator is much better than that of two-step
GMM. Unfortunately, obtaining the CUE GMM estimator is in general a
pretty hard problem: even when ge is linear, it is a nonlinear optimisation
problem.
Yet another one-step method that is asymptotically equivalent to efficient
GMM is the empirical likelihood (EL) estimator. The EL estimator is the
first part of the argmax in the problem
n n n
s.t. n−1
X X X
max ln(pi ) gi (θ) = 0 and pi = 1.
(θ,p)∈Θ×(0,1)n
i=1 i=1 i=1
Again, the asymptotics are the same as for efficient GMM, but the finite-
sample properties are much better. And again, this estimator is computa-
tionally troublesome, as we’re now maximising over k + n variables with an
additional constraint. There’s a broader class of estimators called generalised
130
empirical likelihood (GEL) estimators which share these good finite-sample
properties. We won’t delve into (G)EL estimation here; see Imbens (2002)
for a nice intro-level survey.
Though it has the flavour of an LR statistic, note that we are not testing the
null hypothesis that the true parameter satisfies some restriction. Instead,
our null is that Qn (θ0 ) is op (1), meaning that all the moment conditions are
satisfied at the truth.
To derive the asymptotic distribution under the null, begin with a mean-
value expansion:
h ih i
Gn θbn = Gn (θ0 ) + n−1/2 DGn θen n1/2 θbn − θ0
h i
= Gn (θ0 ) + [Dg(θ0 )] n1/2 θbn − θ0 + op (1)
86
The information matrix test is another specification test. The J test is conceptually
distinct from the information matrix test, however. Given that we derived a generalised
information matrix equality for efficient GMM, we could formulate an information matrix
test for GMM if desired.
131
where θen is the mean value. Recall from our proof of asymptotic normality
for extremum estimators (p. 84) that
h i+ h i
n1/2 θbn − θ0 = − n−1 ∇2 Qn θen n−1/2 ∇Qn (θ0 )
i +
h
i> h
= − n−1/2 DGn θen Wn n−1/2 DGn θen
h i>
× n−1/2 DGn (θ0 ) Wn Gn (θ0 )
−1
= − [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W Gn (θ0 ) + op (1)
You may be tempted to use the matrix algebra result in equation (9) (p.
116) here to write
−1
[Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> = W −1 ,
but this is a mistake! The reason is that [Dg(θ0 )] [Dg(θ0 )]> must have full
rank in order for this identity to hold. But since Dg(θ0 ) is q × k and q > k
by assumption, [Dg(θ0 )] [Dg(θ0)]> can have rank at most k!
Instead, premultiply Gn θbn by W 1/2 :
W 1/2 Gn θbn
−1
= W 1/2 Gn (θ0 ) − W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )]
× [Dg(θ0 )]> W Gn (θ0 ) + op (1)
−1
= I − W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2
where
−1
M := I − W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2
132
So the test statistic is
h i> h i
Jn = Wn1/2 Gn θbn Wn1/2 Gn θbn
h i> h i
= W 1/2 Gn θbn W 1/2 Gn θbn + op (1)
h i> h i
= M W 1/2 Gn (θ0 ) M W 1/2 Gn (θ0 ) + op (1)
h i> h i
= W 1/2 Gn (θ0 ) M > M W 1/2 Gn (θ0 ) + op (1)
h i> h i
= W 1/2 Gn (θ0 ) M W 1/2 Gn (θ0 ) + op (1).
Now suppose (this is important!) that we choose the weight matrix optimally:
W = E (g1 (θ0 )g1 (θ0 )> )−1 . Then we obtain
W 1/2 Gn (θ0 ) −→ Nq 0, W 1/2 W −1 W 1/2 = Nq (0, I) .
d d
Now here’s a fact for you: for ξ ∼ Nq (0, I) and any symmetric and
idempotent q × q matrix M ,
ξ > M ξ ∼ χ2 (rank M ) .
XM
rank
ξ>M ξ = ((P ξ)j )2 .
j=1
133
Since P ξ is a linear combination of normals, it is normally distributed. Since
the eigenvectors are orthogonal and the components of ξ are independent,
the components of P ξ are independent. Moreover
> >
Var((P ξ)j ) = Pj,· Var(ξj )Pj,· = Pj,· Pj,· =1
rank M
since P is an orthogonal matrix. So we’ve shown that (P ξ)j j=1 are
independent standard-normal-distributed random variables. Therefore
XM
rank
ξ>M ξ = ((P ξ)j )2 ∼ χ2 (rank M ) .
j=1
Using the fact that the rank and trace of an idempotent matrix are equal,
and writing Im for an m × m identity matrix to make the dimensions explicit,
rank M
= tr M
−1
= tr Iq − tr W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2
−1
= tr Iq − tr [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2 W 1/2 [Dg(θ0 )]
= tr Iq − tr Ik
= q − k.
That was the distribution under the null (correct specification). It’s pretty
clear that the test is consistent, for the same sort of reason as the trinity
tests in section 9 were. We can also construct a Pitman drift such that the
asymptotic distribution under local-DGP asymptotics is noncentral χ2 .
134
References
Aliprantis, C. D., & Border, K. C. (2006). Infinite dimensional analysis: A
hitchhiker’s guide (3rd). Berlin: Springer.
Amemiya, T. (1985). Advanced econometrics. Cambridge, MA: Harvard
University Press.
Billingsley, P. (1995). Probability and measure (3rd). New York, NY: Wiley.
Billingsley, P. (1999). Convergence of probability measures (2nd). New York,
NY: Wiley.
Dudley, R. M. (2004). Real analysis and probability. Cambridge: Cambridge
University Press.
Durrett, R. (2010). Probability: Theory and examples (4th). Cambridge:
Cambridge University Press.
Gnedenko, B. V., & Kolmogorov, A. N. (1954). Limit distributions for sums
of independent random variables. Cambridge, MA: Addison-Wesley.
Hansen, L. P. (1982). Large sample properties of generalized method of
moments estimators. Econometrica, 50(4), 1029–1054.
Ibragimov, I. A., & Has’minskii, R. (1981). Statistical estimation: Asymptotic
theory. Berlin: Springer.
Imbens, G. W. (2002). Generalized method of moments and empirical likeli-
hood. Journal of Business & Economic Statistics, 20(4), 493–506.
Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares
estimators. Annals of Mathematical Statistics, 40(2), 633–643.
Kim, J., & Pollard, D. (1990). Cube root asymptotics. Annals of Statistics,
18(1), 191–219.
Kolmogorov, A. N., & Fomin, S. V. (1975). Introductory real analysis. New
York, NY: Dover.
Manski, C. F. (1975). Maximum score estimation of the stochastic utility
model of choice. Journal of Econometrics, 3(3), 205–228.
Newey, W. K., & McFadden, D. L. (1994). Large sample estimation and
hypothesis testing. In R. F. Engle & D. L. McFadden (Eds.), Handbook
of econometrics (Chap. 36, Vol. 4, pp. 2111–2245). Amsterdam: North-
Holland.
Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially
consistent observations. Econometrica, 16(1), 1–32.
Rao, C. R. (1973). Linear statistical inference and its applications (2nd).
New York, NY: Wiley.
Rosenthal, J. S. (2006). A first look at rigorous probability theory (2nd).
Singapore: World Scientific.
135
Serfling, R. J. (1970). Convergence properties of Sn under moment restrictions.
Annals of Mathematical Statistics, 41(4), 1235–1248.
Serfling, R. J. (1980). Approximation theorems of mathematical statistics.
New York, NY: Wiley.
White, H. (1982). Maximum likelihood estimation of misspecified models.
Econometrica, 1(50), 1–25.
White, H. (2001). Asymptotic theory for econometricians (Revised). San
Diego, CA: Academic Press.
Williams, D. (1991). Probability with martingales. Cambridge: Cambridge
University Press.
136