0% found this document useful (0 votes)
2 views

Horowitz Sinander Notes

These notes summarize an econometrics course focused on limit theory and asymptotic properties of extremum estimators, taught by Joel Horowitz at Northwestern University in winter 2016. The document includes detailed topics such as probability theory, modes of convergence, laws of large numbers, central limit theorems, and asymptotic properties of estimators. It also discusses challenges in nonparametric estimation and the implications of dimensionality on estimator accuracy.

Uploaded by

Petre Caraiani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Horowitz Sinander Notes

These notes summarize an econometrics course focused on limit theory and asymptotic properties of extremum estimators, taught by Joel Horowitz at Northwestern University in winter 2016. The document includes detailed topics such as probability theory, modes of convergence, laws of large numbers, central limit theorems, and asymptotic properties of estimators. It also discusses challenges in nonparametric estimation and the implications of dimensionality on estimator accuracy.

Uploaded by

Petre Caraiani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Econometrics II

Taught by Joel Horowitz


Northwestern University, winter 2016

Ludvig Sinander
Northwestern University
This version: 12 December 2018

These notes are based on an econometrics course for first-year PhD students
taught by Joel Horowitz at Northwestern in winter 2016. The topics are limit
theory and the asymptotic properties of extremum estimators.

I thank Joel for teaching a great class and for agreeing to let me share these
notes, and Ahnaf Al Rafi, Bence Bardóczy, Ricardo Dahis, Michael Gmeiner
and Joe Long for reporting errors.

1
Copyright c 2020 Carl Martin Ludvig Sinander.

Permission is granted to copy, distribute and/or modify this


document under the terms of the GNU Free Documentation
License, Version 1.3 or any later version published by the Free
Software Foundation; with no Invariant Sections, no Front-Cover
Texts, and no Back-Cover Texts. A copy of the license is included
in the section entitled ‘GNU Free Documentation License’.

This is a ‘copyleft’ licence. Visit gnu.org/licenses/copyleft to learn more.

2
Contents
1 Outline 5

2 Probability theory 8
2.1 Measurable spaces . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 The Lebesgue integral . . . . . . . . . . . . . . . . . . . . . . 14
2.6 The Radon–Nikodým theorem . . . . . . . . . . . . . . . . . . 16
2.7 Conditional probability . . . . . . . . . . . . . . . . . . . . . 18
2.8 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Modes of convergence 26
3.1 Convergence of random sequences . . . . . . . . . . . . . . . . 26
3.2 Convergence of random functions . . . . . . . . . . . . . . . . 29
3.3 Convergence of measures . . . . . . . . . . . . . . . . . . . . . 32
3.4 Relationships between modes of convergence . . . . . . . . . . 34
3.5 The Borel–Cantelli lemmata . . . . . . . . . . . . . . . . . . . 38
3.6 Convergence of moments . . . . . . . . . . . . . . . . . . . . . 39
3.7 Characteristic functions . . . . . . . . . . . . . . . . . . . . . 40
3.8 The continuous mapping theorem . . . . . . . . . . . . . . . . 44
3.9 Stochastic order notation . . . . . . . . . . . . . . . . . . . . 47
3.10 The delta method . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Laws of large numbers 53


4.1 Uncorrelated/independent random variables . . . . . . . . . . 53
4.2 iid random variables . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Dependent random variables . . . . . . . . . . . . . . . . . . . 56
4.4 Uniform laws of large numbers . . . . . . . . . . . . . . . . . 58

5 Central limit theorems 63


5.1 iid random variables . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Independent random variables . . . . . . . . . . . . . . . . . . 66
5.3 Dependent random variables . . . . . . . . . . . . . . . . . . . 69
5.4 The rate of convergence . . . . . . . . . . . . . . . . . . . . . 70

6 Some more limit theory 71


6.1 Connections between CLTs and LLNs . . . . . . . . . . . . . 71
6.2 Laws of the iterated logarithm . . . . . . . . . . . . . . . . . 72

3
7 Asymptotic properties of extremum estimators 75
7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 84
7.5 Estimating the asymptotic variance . . . . . . . . . . . . . . . 88
7.6 Asymptotic normality with a nonsmooth objective . . . . . . 90

8 The (quasi-)maximum-likelihood estimator 94


8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.2 Consistency for the truth . . . . . . . . . . . . . . . . . . . . 95
8.3 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 97
8.4 Estimating the asymptotic variance . . . . . . . . . . . . . . . 100
8.5 The information matrix test . . . . . . . . . . . . . . . . . . . 103
8.6 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . 104

9 Hypothesis testing 107


9.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Simple hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 109
9.3 Composite hypotheses . . . . . . . . . . . . . . . . . . . . . . 113
9.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

10 The generalised method of moments estimator 123


10.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.3 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 126
10.4 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . 128
10.5 The J test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

References 135

4
1 Outline
Econometrics is about inferring functional relations among variables from
data. Schematically, we observe a set of realisations of random vector (x, y),
and wish to learn the function f (·, 0) that satisfies y = f (x, u) for some
unknown random vector u.1 One intuitive way of thinking about this problem
is to divide it into two parts: learning the parametric form (‘shape’) of f (·, 0),
and learning the values of its parameters. (This intuition underlies parametric
methods of estimation. Many nonparametric methods do not divide things
up in this way.)
We’ll need to sharpen up the question a bit in order to answer it. In
general, we cannot learn f (·, 0). As we know from Manski’s course, the most
that we can hope to learn is the joint distribution P(x, y) of (x, y). Often,
we are interested in some feature of the joint distribution, such as P(y|x),
E(y|x) or some quantile of y conditional on x. These objects can be thought
of as features of the function f when convenient.
A natural approach to estimating P(x, y) or P(y|x) when the support is
finite is to use the empirical distribution. By a law of large numbers, this
gives a pointwise consistent estimate.2 When the support is uncountable, we
could of course discretise the outcome space and apply the same reasoning.
One drawback to this approach is that we’ll often need a fine grid to provide
a good approximation, in which case we’ll need an astronomical dataset in
order to have more than zero or one observation per cell. Another drawback
is that we lose the tractability of analysis. (Discrete maths can be ugly.)
Another issue is that on any finite dataset, there is an infinite number of
lines that you can fit through the data. In order for a fitted line to approximate
the true relationship more and more closely as the sample size increases, we
will therefore require some assumptions on the distribution of (y, x). For
concreteness, consider the conditional mean function g(x) := E(y|x). In
this case, we can estimate g using nonparametric regression provided that
g is continuous. We avoid discretisation by taking local averages, using a
bandwidth that shrinks as the sample size increases. Continuity guarantees
that whatever weighted local average you take (e.g. which kernel you use in
kernel regression), the fitted line will get close to g as the sample size gets
large. Continuity is often a pretty weak assumption in economics.3
1
6 0, just
It is wlog to say that we want to learn f (·, 0). If we want to learn f (·, ξ) for ξ =
reparameterise as u0 := u − ξ.
2
In fact, this estimator is uniformly strongly consistent by the Glivenko–Cantelli theorem
(Billingsley, 1995, p. 269).
3
In other fields, discontinuity has to be allowed for explicitly. Joel gave the example of

5
A fundamental problem with nonparametric estimation techniques is
the curse of dimensionality. Roughly speaking, this is that the sample size
required to get a given level of estimator precision is exponentially increasing
in the dimension of x. It can be proved that without stronger assumptions,
the curse of dimensionality is unavoidable. Very loosely, the idea is that
you’re asking a finite dataset to tell you about an infinite-dimensional object.
One way of strengthening the assumptions to avoid the curse of dimension-
ality is to assume that g belongs to a finite-dimensional family of functions.
Schematically, we assume that g(x) = G(x, θ) where G is a known function
and θ is a finite-dimensional, unknown constant, i.e. a parameter. (In this
parametric case, it is often natural to do inference directly on θ rather than
trying to learn g directly.) It turns out (unsurprisingly) that parametrisation
defeats the curse of dimensionality. The price we pay for this victory is the
need to specify the function G. If we misspecify G, the math won’t break,
but the interpretation of results may be way off.
The obvious next conern is the ‘accuracy’ of our estimate of θ (or g). (If
we didn’t care about accuracy, there would be no reason to use the data!)
Since an estimator of θ is a function of the (random) data, an estimator is a
random variable. To characterise accuracy, we have to study this random
variable. The problem is that the distribution of an estimator depends on
the unknown distribution of (x, y). (If we knew the population distribution,
we would once again have no use for a dataset.) Except under very stringent
conditions, we cannot consistently estimate (never mind infer with certainty)
the distribution of an estimator.
The way we get around this is by using approximations to the distribution
of an estimator. If we are to trust these approximations, they must become
increasingly good as the data gets increasingly good, in some sense to be
made precise. The leading example will be asymptotic approximations, which
are approximations that (usually) become increasingly good as the sample
size grows. There are many kinds of asymptotic approximation, but the
general idea is easily illustrated using the simplest central limit theorem.
Suppose that our estimator θb is an average of n iid random variables (many
estimators are), where n is the sample size. Then the central limit theorem
says that n1/2 θb converges in distribution to N θ, σ 2 , a two-dimensional


family of distributions!
Our focus will be on asymptotic theory for parametric estimators. Besides
this restriction, our treatment will be general, though common special cases
will be mentioned along the way. We’ll first cover basic (measure-theoretic)
image denoising.

6
probability theory, then limit theorems. Once the machinery is in place, we
will develop the asymptotic theory of extremum estimators.
The course starts off with background probability theory, emphasising
concepts required for asymptotic theory. We then state and prove several
laws of large numbers and central limit theorems. With the technical ma-
chinery in place, we establish the consistency and asymptotic normality of
general extremum estimators. We apply these results to maximum likelihood
and generalised-method-of-moments estimators, and cover efficiency and
specification tests while we’re at it. Finally, we study hypothesis testing in
the setting of extremum estimation.
The main references for the course are Amemiya (1985) and Newey and
McFadden (1994). These texts (unlike most others in econometrics) raise and
address all the important technical issues. Joel will also sometimes to refer
Rao (1973) and Serfling (1980), two statistics texts that are worth studying
for anyone interested in research in econometric theory. Several other texts
that will be mentioned along the way, e.g. White (2001).

7
2 Probability theory
Official reading: Rao (1973, ch. 2).
This section covers basic (measure-theoretic) probability theory. There is
a very large number of good texts on measure and probability theory; Joel
mentioned Kolmogorov and Fomin (1975) in particular.

2.1 Measurable spaces


Let Ω be an arbitrary set. In the context of probability theory, we call Ω the
sample space, and interpret it as the set of all possible states of nature, or
outcomes of an experiment.

Definition 1. A collection A of subsets of Ω is called a σ-algebra (of subsets


of Ω) iff

(1) Ω ∈ A.

(2) If A ∈ A then Ac ∈ A.4

(3) If Aj ∈ A for each j ∈ N, then ∈ A.5


S
j∈N Aj

Some additional properties of σ-algebras are easily derived. For example,


if Aj ∈ A for each j ∈ N, then
 c
\ [
Aj =  Acj  ∈ A
j∈N j∈N

by properties (2) and (3). Another one is ∅ ∈ A, which follows from (1) and
(2).

Definition 2. Let Ω be an arbitrary set, and let A be a σ-algebra of subsets


of Ω. We call (Ω, A) a measurable space. The elements of A are called
the measurable subsets of Ω; subsets of Ω that are not in A are called
non-measurable.

The idea is that when we start assigning measure to subsets of Ω, we


will only assign measure to the subsets of Ω that lie in A; this is why we
call these subsets measurable. It might seem like we could make our lives
easier by choosing A = 2Ω , the set of all subsets of Ω. We do not do this
4
Ac := Ω\A denotes the complement of A in Ω.
5
N = {1, 2, . . .} denotes the natural numbers.

8
because for general uncountable Ω, it leads to paradoxes. This will be less of
a problem than it may first appear to be because the subsets of Ω missing
from the σ-algebras we will be working with are very strange sets. But can
still give rise to difficulties: measurability problems arise fairly frequently in
econometric and economic theory.
In the context of probability theory, we sometimes call the measurable
sets ‘events’, and interpret them as ‘something that happens’. To illustrate,
suppose we draw one coloured ball from an urn, formalised by the measurable
{red,blue,green}

space {red, blue, green}, 2 . In ordinary language, one ‘event’
I might describe is ‘I pick a red or a blue ball’. In the formalism, this
corresponds to the event (measurable set) {red, blue}.
We might wonder how to choose our σ-algebra. An important criterion is
that our σ-algebra contain enough subsets of Ω to allow us to study conver-
gence (of measures, of measurable functions and of integrals). Convergence
is a topological notion, so let’s equip Ω with a topology. In order to obtain
convergence results, we will need the σ-algebra to contain enough topologic-
ally interesting subsets of Ω; at the very least, it should contain all of the
open subsets of Ω. It turns out that this minimal requirement is enough for
most purposes, leading to the following definition.

Definition 3. Let (Ω, T ) be a topological space. The Borel σ-algebra of


subsets of Ω (relative to topology T ) is the smallest σ-algebra of subsets of
Ω that contains T (all of the open sets). When Ω and its topology are clear
from the context, we’ll write B for the Borel σ-algebra.

More broadly, we will sometimes be interested in a certain collection X of


subsets of Ω. In order to say measure-theoretic things about them, we need
them to be measurable! To this end, we write σ(X ) for the smallest σ-algebra
of subsets of Ω that contains X ; σ(X ) is called the σ-algebra generated by
X . Note that for a topological space (Ω, T ), the σ-algebra σ(T ) generated
by the open sets is exactly the Borel σ-algebra.

2.2 Measures
Let R = R ∪ {−∞, ∞} be the extended real line.

Definition 4. Given a measurable space (Ω, A), a function µ : A → R is a


measure iff

(1) µ(A) ≥ 0 for each A ∈ A.

9
(2) If Aj ∈ A for each j ∈ N are disjoint, then
 
[ X
µ Aj  = µ(Aj ).
j∈N j∈N

(This property is called countable additivity.)


If in addition µ(Ω) = 1, then µ is a probability measure. We often use P to
denote a probability measure.
Definition 5. Let (Ω, A) be a measurable space, and let µ be a measure on
(Ω, A). We call (Ω, A, µ) a measure space. If µ is a probability measure, we
call it a probability (measure) space.
Example 1 (Lebesgue measure). Let Ω = [0, 1], let B be the Borel σ-algebra,
and let I be the intervals of [0, 1]. Let λ : I → R be the probability measure
on (R, I) with the property that the value of an interval is its length, e.g.
λ((a, b]) = b − a. It turns out that there’s a unique extension of λ to the rest
of B that respects the measure axioms.6,7 This measure is called Lebesgue
measure on ([0, 1], B). We can similarly define Lebesgue measure on other
subsets of Rn for n ∈ N.
One property of this measure is that any countable set has Lebesgue
measure zero. To show this, take a collection of distinct points xj ∈ [0, 1] for
each j ∈ N. Since λ({xj }) = 0 for each j ∈ N (each is an interval of length
zero), countable additivity yields
 
[ X X
λ {xj } = λ({xj }) = 0 = 0.
j∈N j∈N j∈N

It follows, for example, that λ(Q ∩ [0, 1]) = 0.8


Let (Ω, A, µ) be a measure space, and consider some property P that
does or does not hold at each ω ∈ Ω. We say that property P holds µ-almost
everywhere (µ-a.e.) on Ω iff it holds everywhere on Ω except possibly on a
set of µ-measure zero.9 For example, a standard real analysis result states
6
This follows from an extension theorem. Extension theorems provide conditions for
the existence, and sometimes uniqueness, of an extension of a measure from a small set
(e.g. the intervals) to a larger set (e.g. the Borel σ-algebra).
7
Actually, there’s a unique extension to ([0, 1], L), where L is a larger σ-algebra called
the Lebesgue-measurable sets.
8
Q denotes the rational numbers.
9
That is, there is some A ∈ A such that P holds at every ω ∈ A and µ(Ac ) = 0.

10
that any monotone function f : R → R is continuous λ-a.e., where λ is
Lebesgue measure on (R, B). When µ is a probability measure and P holds
µ-a.e., we usually say that P holds µ-almost surely (µ-a.s.) or that P holds
with probability 1. More generally, ‘µ-almost’ is used flexibly as an adverb,
e.g. ‘µ-almost all’ or ‘µ-almost every’.

2.3 Measurable functions


For a function f : F → G and a subset B ⊆ G, write

f −1 (B) := {x ∈ F : f (x) ∈ B} .

(This is standard notation, but worth writing down just in case.)

Definition 6. Let (F, F) and (G, G) be measurable spaces. Then f : F → G


is F/G-measurable iff for any B ∈ G, f −1 (B) ∈ F. When one or both
σ-algebras are clear from the context, we sometimes shorten this to ‘F-
measurable’ or simply ‘measurable’.10

In words, measurable sets of values are generated by measurable sets of


arguments. This is very similiar to the definition of continuity from topology,
where a function is continuous iff open sets of values are generated by open
sets of arguments. Notice that the measurability of a function depends only
on the measurable spaces; it has nothing to do with measures defined on
those spaces.11
Measurability is needed because the whole point of measure theory is to
assign measure to things. If a function maps from one measurable space to
another, we’d like to be able to say that the measure of a measurable set of
values in G of the function is equal to the measure of the set of arguments
in F that generate those values. But if our function is not measurable, then
there will be sets of values in G that are counted as measurable according
to (G, G), but which are generated by a set of arguments in F that is not
measurable according to (F, F).
A special case of interest is where (F, F) = (Ω, A) is an arbitrary meas-
urable space and (G, G) = Rk , B where B is the Borel σ-algebra on Rk . It


10
For the special case of functions f : Rn → R with the Borel σ-algebras Bn and B,
Bn /B-measurability is sometimes called Borel-measurability.
11
Some confusing (in my view) terminology was used at this point in the lecture. Consider
two measurable spaces (F, F) and (G, G), a measure µ on (F, F), and a function f : F → G.
Above, I defined the property of F/G-measurability of f . Joel called this same property
µ-measurability of f . But as I pointed out, the measure µ has nothing to do with it!

11
turns out that in this case, a function f : Ω → Rk is measurable iff
{ω ∈ Ω : f (ω) ≤ z} ∈ A for each z ∈ Rk .
We can now define random elements, which are principal characters in
the sequel.
Definition 7. Let (Ω, A, P) be a probability space, and let (S, S) be a
measurable space. A random element of (S, S) defined on (Ω, A, P) is an
A/S-measurable function X : Ω → S.
The set S in which a random element takes values can be entirely arbitrary;
it need not be a topological or metric space, for example. But often, S will be
a metric space. In this case, we will sometimes abuse terminology by saying
‘random element of (S, ρ) (defined on (Ω, A, P))’, on the understanding that
S is equipped with a σ-algebra, usually the Borel σ-algebra generated by the
topology induced by the metric ρ.
Definition 8. A random variable is a random element of (R, B). A random
n-vector is a random element of (Rn , B). A random n×m matrix is a random
element of (Rn×m , B).
For a random element X : Ω → S and a measurable subset B of S,
X −1 (B) is the set of states of the world ω ∈ Ω at which X(ω) lies in B. We
know that X −1 (B) ∈ A since X is a measurable function. But A may contain
lots of other events that do not correspond to X −1 (B) for some B ∈ S. These
other events are not interesting for the study of X, so we sometimes wish to
use smaller the σ-algebra that contains all the sets of interest for X but no
others. In our previous jargon, what we want is the σ-algebra generated by
the interesting sets, viz. σ X −1 (B) B∈S . The name of this object is often
 

shortened to ‘the σ-algebra generated by X’, or σ(X).


Definition 9. Let X be a random element of (S, S) defined on (Ω, A, P).
The law (or distribution) of X is the function LX : S → R given by
LX (B) := P(X ∈ B) for each B ∈ S.12
If we are interested only in the behaviour of the random element X,
then all of the information we need is contained in its law LX . It does not
matter what probability space it is defined on! In fact, (S, S, LX ) is itself a
probability space, and the random element Y defined by Y (s) := s for each
s ∈ S on this probability space has law LY = LX .
When X is a random vector, there’s is an alternative (more tractable)
object that fully describes the behaviour of X: the CDF.
12
P(X ∈ B) is shorthand for P({ω ∈ Ω : X(ω) ∈ B}).

12
Definition 10. Let X be a random vector. The cumulative distribution
function (CDF) of X is FX : Rn → [0, 1] defined by
FX (x1 , . . . , xn ) := LX ((−∞, x1 ] × · · · × (−∞, xn ])
for each (x1 , . . . , xn ) ∈ Rn .
It is intuitive (but not quite obvious, I think) that FX fully characterises
the law of a random vector. Precisely stated: for random vectors X and Y ,
LX = LY (setwise) iff FX = FY (pointwise). See Rosenthal (2006, Proposition
6.0.2) for a (very easy) proof.
Some properties of CDFs are that they are right-continuous, nondecreas-
ing, and satisfy limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. (For n = 1, we
have a converse: any function F : R → [0, 1] with these four properties is the
CDF of some random vector on some probability space.)
n
We can also define random elements X : Ω → R that are like random
vectors but can take infinite values. If LX (Rn ) < 1, then the distribution LX
of X is said to be defective. Conversely, if LX (Rn ) = 1 then the distribution
is called proper. Though we rarely want to work with random vectors that
take infinite values with positive probability, we sometimes obtain a random
vector with a defective distribution as the limit of a sequence of random
vectors with proper distributions.

2.4 Independence
Probability theory is basically measure theory plus independence. This will
become increasingly clear: all of the interesting theorems that we will state
specifically for probability measures (rather than general measures) assume
(some weakened form of) independence.
Definition 11. For a probability space (Ω, A, P), events A, B ∈ A are
independent iff P(A ∩ B) = P(A)P(B).
Sometimes, we wish to impose independence for two whole classes of
events. Call F a sub-σ-algebra of the σ-algebra A iff it is a σ-algebra of
subsets of Ω and F ⊆ A.
Definition 12. For a probability space (Ω, A, P), sub-σ-algebras F and G
of A are independent iff P(F ∩ G) = P(F )P(G) for every F ∈ F and G ∈ G.
Fix a probability space (Ω, A, P) and a measurable space (S, S). Two
random elements X and Y of (S, S) are called independent iff their generated
σ-algebras σ(X) and σ(Y ) are independent. This just means that
P(X ∈ BX , Y ∈ BY ) = P(X ∈ BX )P(Y ∈ BY )

13
for any measurable subsets BX and BY of S. (So for random vectors, BX
and BY are any Borel sets.)

2.5 The Lebesgue integral


A problem with the Riemann integral is that many interesting functions are
not Riemann-integrable. This defect is addressed by the Lebesgue integral.
When Ω = R, the standardR
Lebesgue integral of a measurable function
f : Ω → R is written Ω f dλ, where λ is Lebesgue measure on R. To
integrate over a measurable subset S ⊆ R, define
Z Z
f dλ := f 1S dλ
S Ω

where 1S is the indicator function for S.13 The Lebesgue integral does not
exist for every measurable function: a boundedness condition is also needed
to avoid the undefined expression ∞ − ∞.
A measurable function is called Lebesgue-integrable iff its integral exists
and is finite.14 Any Riemann-integrable function is also Lebesgue-integrable,
and in such cases the two integrals coincide. But many functions are Lebesgue-
but not Riemann-integrable.15,16
In constructing the Lebesgue integral, it quickly becomes apparent that
we can replace the Lebesgue measure λ with R
whatever measure µ we like,
leading to the generalised Lebesgue integral S f dµ. Of course, the conditions
under which fR is integrable depend on what measure we’re integrating with
respect to. If S f dµ exists, we say that f is µ-integrable.
Importantly, the Lebesgue integral may defined for measurable functions
f : Ω → R on any measurable space (Ω, A), not just (say) R or Rn . This
provides a powerful generalisation of Riemann integration; for example, we
can integrate over functional spaces.
Now consider the case in which µ is a probability measure on a measurable
space (Ω, A), so that the measurable function f is a random variable. Let’s
13
I.e. 1S (ω) = 1 iff ω ∈ S, 0 otherwise.
14
AR necessary and sufficient condition for a measurable function f to be integrable is
that Ω |f |dλ < ∞.
15
For example, consider the function f : R → R such that f (x) = 1(x R ∈/ Q). This
function is not Riemann-integrable, but it is Lebesgue-integrable, with R f dλ = 1 as
we would hope. More generally, there are certain kinds of discontinuity that Lebesgue
integration can handle but Riemann integration cannot.
16
In terms of how they are constructed, the difference between the two integrals is
that while the Riemann integral considers the limit of a sequence of approximations to
the ‘area under f ’ constructed by discretising the x axis, the Lebesgue integral considers
approximations constructed by discretising the y axis.

14
use the more familiar notation of P for the measureR
and X for the random
variable. In this setting, the Lebesgue integral Ω XdP is also called the
expected value of X (or the expected value of the distribution LX ). Since not
all measurable functions admit an integral, there are random variables whose
expectation is undefined. (One example is a Cauchy-distributed random
variable.) Even when the expectation exists, it may be infinite.
There are various kinds of notation for the expected value, including:
Z Z Z
E(X) = XdP = X(ω)dP(ω) = X(ω)P(dω).
Ω Ω Ω

Sometimes, we wish to work with LX rather than X and P; in these cases we


sometimes write E(LX ) (confusingly!). It is simple to show (e.g. Rosenthal
(2006, Theorem 6.1.1)) that the expected value can be written as an integral
with respect to the law of X,17 so that
Z Z
E(X) = E(LX ) = xdLX (x) = xLX (dx).
R R

Finally, we can rewrite the integral in terms of the CDF FX rather than
the law LX . This is unsurprising in view of the fact that CDFs coincide iff
the laws do. The integral w.r.t. a CDF is called a Stieltjes integral, and can
be written in various ways:
Z Z
E(X) = xdFX (x) = xFX (dx).
R R

The Stieltjes integral coincides with the Lebesgue integral, but is defined
differently (in terms of the CDF).
Let X be a random variable; then X n for n ∈ N is also a random variable
(easy proof). E (X n ) is called the nth moment of X, and E ((X − E(X))n )
is called the nth central moment. (Of course, a given moment of X need not
exist or be finite.) The second central moment is called the variance of X,
denoted Var(X).
While we’re at it, here are two related concepts. Let X and Y be random
variables. Their covariance is

Cov(X, Y ) := E ([X − E(X)][Y − E(Y )]) ,

and their correlation is


Cov(X, Y )
Corr(X, Y ) := p p .
Var(X) Var(Y )
17
Recall that (R, B, LX ) is a probability space.

15
There are a lot of algebraic shortcuts involving variances, covariances and
correlations that are hopefully familiar from undergrad. I won’t list them
here, but I will make use of them!

2.6 The Radon–Nikodým theorem


The Radon–Nikodým theorem gives conditions under which a measure can
be represented as a Lebesgue integral w.r.t. another measure. In particular,
given measures µ and ν on a measurable space (Ω, A), we’re interested in
representations of the form
Z
ν(A) = f dµ for each A ∈ A (1)
A

for some nonnegative, µ-integrable (hence A/B-measurable) function f :


Ω → R.
The following concept will turn out to be the key.
Definition 13. Let µ and ν be measures defined on a measurable space
(Ω, A). We say that µ dominates ν, written µ  ν, iff for any A ∈ A,
µ(A) = 0 implies ν(A) = 0. We also sometimes say that ν is absolutely
continuous w.r.t. µ.18
Suppose that the representation is possible: there is a nonnegative f such
that (1) holds. Then if µ(A) = 0 for some A ∈ A, it follows that
Z Z
ν(A) = f (ω)µ(dω) = f (ω)1A (ω)µ(dω) = 0
A Ω

since 1A (ω) = 0 for all ω ∈ Ω outside a set of µ-measure zero (viz. the set
A). So we’ve learned that if (1) holds, then µ  ν. The Radon–Nikodým
theorem is the (far less obvious) converse to this result: if µ  ν, then there
exists a nonnegative f such that (1) holds. Actually, there’s a caveat: in
order for the converse to hold, both measures must be σ-finite.
In words, a measure defined on (Ω, A) is σ-finite iff there is a countable,
measurable cover of Ω such that every piece of the cover is assigned finite
measure. It should be obvious that every probability measure is σ-finite. For
reference, here’s a schematic definition:
18
There’s some unhelpful confusion of terminology here. Most authors (e.g. Kolmogorov
and Fomin (1975) and Billingsley (1995)) use ‘absolutely continuous’ synonymously with
‘dominates’, as in my definition. But Rosenthal (2006, p. 143) defines ‘ν absolutely continu-
ous w.r.t. µ’ to mean that there exists a nonnegative f such that representation (1) holds.
By the Radon–Nikodým theorem, the two turn out to be equivalent for σ-finite measures,
but they are still distinct properties that should have distinct names.

16
Definition 14. Let (Ω, A, µ) be a measure space. The measure µ is σ-finite
iff there is a countable collection {Aj }j∈N of subsets of Ω such that

(1) Aj ∈ A for each j ∈ N


S
(2) j∈N Aj =Ω
(3) µ(Aj ) < ∞ for each j ∈ N.

With definitions in place, we’re ready to state the theorem.


Theorem 1 (Radon–Nikodým). Let µ and ν be σ-finite measures on some
measurable space (Ω, A), and suppose that µ Rν. Then there is a nonneg-
ative, µ-integrable function f such that ν(A) = A f dµ for each A ∈ A.
f in the theorem is usually called the density of ν with respect to µ. (The
densities familiar from undergrad are densities w.r.t. Lebesgue measure λ.)
It is also called a Radon–Nikodým derivative, denoted dν/dµ. This name is
obviously motivated by the fact that

Z
ν(A) = dµ ∀A ∈ A,
A dµ
an expression analogous to the fundamental theorem of calculus for ordinary
derivatives and the Riemann integral.19
Another property of representation (1) is that the density f is unique
up to sets of measure zero: if f and g are both densities of ν w.r.t. µ then
µ(f 6= g) = 0.20 This is a corollary to a basic property of the Lebesgue
integral; a proof can be found in Billingsley (1995, Theorem 16.10).
Example 2 (density w.r.t. counting measure). As already mentioned, the
probability density functions familiar from undergrad are Radon–Nikodým
derivatives w.r.t. Lebesgue measure. In this example, we’ll see that probability
mass functions (for discrete random variables) are in fact Radon–Nikodým
derivatives w.r.t. a measure called counting measure.
Let (N, A) be a measurable space,21 and let P be a probability measure
on it. Let c : A → R be defined by c(A) := |A| for A ∈ A.22 c is obviously a σ-
finite measure on (N, A); it is called counting measure. Perhaps unsurprisingly,
19
There are other properties which Radon–Nikodým derivatives share with ordinary
derviatives. One of these is the chain rule: for σ-finite measures ν  µ  τ on some
dν dν dµ
measurable space, we have dτ = dµ dτ
.
20
That is, µ({ω ∈ Ω : f (ω) 6= g(ω)}) = 0.
21
Recall that N = {1, 2, . . .}.
22
|A| denotes the number of elements in the set A. For A infinite, |A| = ∞; moreover
|∅| = 0.

17
integration w.r.t. counting measure is equivalent to summation: formally, for
any c-integrable (and measurable) function f : N → R,
Z X
f dc = f (n) for every A ∈ A.
A n∈A

(If you know how the Lebesgue integral is defined, then this should be trivial.)
Now let’s apply the Radon–Nikodým theorem to the σ-finite measures
P and c. The only set to which c assigns measure zero is ∅, and P(∅) = 0;
hence P  c. So by the Radon–Nikodým theorem, there is nonnegative and
c-integrable function f : N → R such that
Z X
P(A) = f dc = f (n) for every A ∈ A.
A n∈A

Moreover, f is unique: we already knew it to be unique up to sets of c-measure


zero, but the only set of c-measure zero is ∅.
Let’s suppose our σ-algebra is rich enough that it contains all the single-
tons: {n} ∈ A for each n. (Using such a rich σ-algebra would cause problems
on an uncountable sample space, but it’s fine here since N is countable.) Then
for each n ∈ N we have P({n}) = f (n), showing that f is the probability
mass function.23

The Radon–Nikodým theorem turns out to have various uses. It is used to


establish the existence of useful constructs such as conditional probabilities
and conditional expectations (see below). In estimation, if the data are
distributed according to νθ for some parameter θ, then a clever choice of µ
can give us a convenient likelihood function L(θ) := dνθ /dµ.

2.7 Conditional probability


This section may be hard to follow. I recommend Billingsley (1995, sec. 33).
Consider two events A and G of a probability space (Ω, A, P). Imagine
that we learn that event G obtains, and would like to revise the probability
of event A in light of this information. (This scenario is obviously ubiquitous
in economics and econometrics.) How do we formalise this?
When P(G) > 0, the answer is intuitive and easy: we say that the
conditional probability is P(A ∩ G)/P(G). (Draw a Venn diagram.) The
problem is that we (very) often wish to condition on a probability-zero event,
23
The Radon–Nikodým theorem doesn’t tell us what the function f is, only that it exists.
Characterising f will generally require an argument specific to the measure space at hand.

18
i.e. P(G) = 0. (For example, the realisation of a normally distributed signal.)
The ratio formula does not apply in this case, so a subtler construction is
called for.  
To build some intuition, consider a finite probability space Ω, 2Ω , P in
which P({ω}) > 0 ∀ω ∈ Ω. Let Q(A|G) := P(A ∩ G)/P(G) be the ‘ordinary’
probability of A conditional on G. Notice that Q(A|{ω}) = 1(ω ∈ A). It
follows that for events A and G,
X X
P(A ∩ G) = P({ω}) = 1(ω ∈ A)P({ω})
ω∈A∩G ω∈G
X Z
= Q(A|{ω})P({ω}) = Q(A|{ω})P(dω).
ω∈G G

(The Lebesgue integral is used because unlike the sum, it remains defined
when we move to uncountable probability spaces.) R
We’d like our definition
of conditional probability to respect P(A ∩ G) = G Q(A|{ω})P(dω). Since
this property does not involve division by P(G), we can require it to hold
even when P(G) = 0.
We require an additional bit of machinery. Recall that A is a σ-algebra of
subsets of Ω, we call G a sub-σ-algebra of A iff G is a σ-algebra and G ⊆ A.
Since G contains only a subset of the events in A, it provides a ‘coarser’
description of the state of the world ω ∈ Ω.
Here’s an analogy: I am trying to convey to you the colour of the sky.
The sky can be any colour ω in Ω = [0, 1], where 0 is ‘totally blue’ and 1
is ‘totally red’ (say). I have at my disposal a small vocabulary of English
phrases that I can use to communicate with, consisting of (any combination
of) ‘blue’ (= [0, 13 ]), ‘purple’ (= ( 13 , 23 ]) and ‘red’ (= ( 23 , 1]). Formally, my
language is the σ-algebra generated by the events ‘blue’, ‘purple’ and ‘red’:
n o n o
A=σ [0, 13 ], ( 13 , 32 ], ( 23 , 1] = ∅, [0, 31 ], ( 13 , 23 ], ( 23 , 1], [0, 32 ], ( 13 , 1], [0, 1] .

Now suppose that my English deteriorates: I forget the words ‘blue’ and
‘purple’, and am left only with the coarser term ‘blurple’ (= [0, 23 ]). My newly
worsened language is
n o n o
G=σ [0, 23 ], ( 23 , 1] = ∅, [0, 32 ], ( 23 , 1], [0, 1] .

Clearly G is a sub-σ-algebra of A. The example should illustrate the sense in


which G is a coarser language than A.24
24
Aside: a sequence of increasing σ-algebras (each one a sub-σ-algebra of the next one)
is called a filtration. A filtration provides formal way to talk about possible histories. They
are therefore important for the study of stochastic processes.

19
Back to conditional probability. In general, we are interested in the
probability of A conditional on several different events G. For example,
suppose that we want to condition on some random variable X being realised
in a Borel set B, i.e. the event GB = X −1 (B) = {ω ∈ Ω : X(ω) ∈ B}; usually
we’d like to be able to condition on any Borel event of this sort, not just a
particular one. The rigorous construction of conditional probabilities requires
us to specify in advance what collection of events we want to be able to
condition on. It is perhaps not surprising that we require this collection of
conditioning events to be a σ-algebra. But we also need it to not lead to
measurability problems, and that requires the conditioning σ-algebra to be a
sub-σ-algebra G of A.25
To make the sub-σ-algebra G explicit, we can write the probability of
A conditional on G ∈ G as QG (A|G), which (for fixed A) is a mapping
G → R. It turns out, however, to be technically more convenient to define
the conditional probability as a G-measurable mapping Ω → R, denoted
P(A|G)(ω). Since it’s G-measurable, P(A|G)(ω) = P(A|G)(ω 0 ) whenever ω
and ω 0 lie in all the same sets G ∈ G; in this sense, the conditional probability
can only vary between states of the world that are distinguishable using G.
Let’s try to clarify this using the example of conditioning on a random
variable. We formalise ‘conditioning on a random variable’ as conditioning
on its generated σ-algebra σ(X). Since P(A|σ(X)) is σ(X)-measurable,
P(A|σ(X))(ω) = P(A|σ(X))(ω 0 ) whenever X(ω) = X(ω 0 ). Conversely, if
X(ω) 6= X(ω 0 ) then in general P(A|σ(X))(ω) 6= P(A|σ(X))(ω 0 ). Informally,
although P(A|σ(X))(·) is really a function of ω, σ(X)-measurability means
(precisely) that it behaves as if it were a function of (the realised value
of) X. So this construction allows us to condition on events like X = x.26
But importantly, this conditional probability does not give us statements
about the probability of A conditional on (say) X ≤ x. To get conditional
probabilities like that, we need a coarser σ-algebra since we’re conditioning
on ‘larger’ events.
All this chatting has established that we want our conditional probability
P(A|G)
R
to be a G-measurable mapping Ω → R that satisfies P(A ∩ G) =
G P(A|G)dP for every G ∈ G. The last condition obviously requires that
P(A|G) be P-integrable. (And implies that P(A|G) ∈ [0, 1] P-a.s.) Let’s put
it all together!
25
If it isn’t clear why measurability problems would arise without this requirement, think
about it until it’s clear!
26
To construct this probability explicitly, pick some ωx ∈ {ω ∈ Ω : X(ω) = x} for each
x ∈ R (any will do), and define the ‘intuitive conditional probability’ Qσ(X) (A|X = x) :=
P(A|σ(X))(ωx ).

20
Definition 15. Consider a probability space (Ω, A, P) and a sub-σ-algebra
G of A. A random variable P(A|G) : Ω → R is a conditional probability of
A on G iff it is (1) G-measurable, (2) P-integrable, and (3) satisfies
Z
P(A ∩ G) = P(A|G)dP ∀G ∈ G.
G

Proposition 1. Consider a probability space (Ω, A, P), a sub-σ-algebra G of


A, and an event A ∈ A. Then there exists a conditional probability P(A|G)
of A on G.

Proof. µA (G) := P(A ∩ G) ≤ P(G) for every G ∈ G, so µA  P. Since both


are σ-finite measures, the Radon–Nikodým theorem implies that there exists
a (nonnegative,) G-measurable and P-integrable function fA : Ω → R such
that Z
P(A ∩ G) = µA (G) = fA dP ∀G ∈ G. (2)
G
So fA is a conditional probability of A on G. 

The use of the Radon–Nikodým theorem highlights that conditional


probabilities are generally not unique, though they are unique up to sets
of measure zero. This is not ideal, but we cannot fix in it full generality.
Instead, we must pick the ‘right’ conditional probability in each individual
application.27
So far, we have defined a conditional probability for a single, fixed event
A ∈ A. What we’d really like is a conditional probability measure µ(·|G) that
gives the conditional probability of every event in A. We’ll obviously construct
the random function µ(·|G) : A → R according to µ(A|G) := P(A|G) for
each A ∈ A,28 for some collection {P(A|G)}A∈A of conditional probabilities.
Unsurprisingly, µ(·|G)(ω) will not be a measure for all ω ∈ Ω unless the
collection {P(A|G)}A∈A is ‘consistent’ in some way. Happily, it turns out that
it is always possible to choose {P(A|G)}A∈A ‘consistently’ in this manner.
27
Aside: in game theory, off-equilibrium beliefs (beliefs following events that have prob-
ability zero on the equilibrium path) are important for constructing equilibria in dynamic
games, since ‘fearful’ off-equilibrium beliefs can ‘deter’ players from deviating from equilib-
rium play. Most equilibrium concepts require that players update their beliefs in accordance
with conditional probability. But since conditional probabilities are indeterminate, lots of
belief-updating protocols are allowed, leading to equilibrium multiplicity (‘fearful’ equi-
libria supported by ‘fearful’ off-equilibrium beliefs). Much of the refinements literature
is concerned with imposing additional, intuitive restrictions on how beliefs are revised in
order to kill off implausible equilibria of this sort.
28
That is, µ(·|G)(·) : A × Ω → R is a function defined by µ(A|G)(ω) := P(A|G)(ω) for
each A ∈ A and ω ∈ Ω.

21
That is, there exists a random function µ(·|G) s.t. every µ(·|G)(ω) is a
probability measure and every µ(A|G)(·) is a conditional probability of A on
G (Billingsley, 1995, Theorem 33.3). Such a µ is called a regular conditional
probability.
Conditional expectation is defined in a similar way.
Definition 16. Consider a probability space (Ω, A, P), a sub-σ-algebra G
of A, and a P-integrable random variable Y .29 A random variable E(Y |G) :
Ω → R is a conditional expectation of Y on G iff it is (1) G-measurable, (2)
P-integrable, and (3) satisfies
Z Z
Y dP = E(Y |G)dP ∀G ∈ G.
G G

The proof of existence is similar to the one for conditional probability; see
Rosenthal (2006, Proposition 13.1.7). The interpretational subtleties outlined
above also apply to conditional expectation. Also, fun fact: for G = Ω we get
Z Z
Y dP = E(Y |G)dP,
Ω Ω

meaning that the ‘law of iterated expectation’ is actually part of the definition
of conditional expectation.

2.8 Inequalities
Probability theory is full of inequalities. The main ones are concentration
inequalities, which bound the probability that a random variable deviates
away from some value (usually zero or its mean). These are often useful for
establishing the convergence of sequences of random variables, the topic of
the next section. The two most basic concentration inequalities are Markov’s
and Chebychev’s; both apply to random variables, but extend easily to
random vectors.
Proposition 2 (Markov’s inequality). Let X be a nonnegative random
variable on (Ω, A, P). Then P(X ≥ ε) ≤ E(X)/ε ∀ε > 0.
Proof. Fix ε > 0.
Z Z Z
E(X) = XdP = XdP + XdP
Ω {X≥ε} {X<ε}
Z Z
≥ XdP ≥ ε dP = εP(X ≥ ε). 
{X≥ε} {X≥ε}
29
Recall that Y is P-integrable iff E(Y ) exists and is finite.

22
Corollary 1 (Chebychev’s inequality). Let X be a random variable on
(Ω, A, P) such that E(X) exists and is finite. Then P(|X − E(X)| ≥ ε) ≤
Var(X)/ε2 for every ε > 0.
Proof. Fix ε > 0 and define Y := (X − E(X))2 . Y is nonnegative, so by
Markov’s inequality
 
P(|X − E(X)| ≥ ε) = P Y ≥ ε2 ≤ E(Y )/ε2 = Var(X)/ε2 . 

As an illustration, consider a standard-normal-distributed random vari-


able X. Chebychev’s inequality gives us P(|X| ≥ 1.96) ≤ 1/1.962 ' 0.26.
But we know that P(|X| ≥ 1.96) ' 0.05, so the Chebychev bound is not
very tight in this case. This is typical, and perhaps not very surprising since
the bound applies to all probability distributions, even really badly-behaved
ones.
Another useful corollary to Markov’s inequality is the following.
Proposition 3 (Chernoff bounds). Let X be a random variable on (Ω, A, P).
Then for every ε > 0,
E (exp(tX))
P(X ≥ ε) ≤ for every t > 0, and
exp(tε)
E (exp(tX))
P(X ≤ ε) ≤ for every t < 0.
exp(tε)
You need to do a bit of work to obtain a useful Chernoff bound. The
quantity fX (t) := E(exp(tX)) is called the moment-generating function
(MGF) of X (it is a cousin of the characteristic function introduced in
section 3.7). The MGFs of all commonly-used distributions can be looked up
on Wikipedia, so when the distribution is known we can compute a Chernoff
bound for various values of t. We can also make use of generic properties of
MGFs: for example, we can use the fact that
n
Y
fPn Xi (t) = fXi
i=1
i=1

for {Xi } independent to derive a Chernoff bound for sums of independent


random variables. The fact that the generic bound holds for any t is helpful,
since it allows us to pick a t for which the bound is tractable (and tight,
hopefully).
There are many refinements of Markov’s and Chebychev’s inequalities
that give tighter bounds than this under additional assumptions. The next
two are easy concentration inequalities similar to Markov’s above.

23
Proposition 4 (generalised Markov inequality). Let X be a nonnegative
random variable on (Ω, A, P), and let g : R+ → R+ be strictly increasing.
Then P(X ≥ ε) ≤ E(g(X))/g(ε) ∀ε > 0.
Proposition 5 (Cantelli’s inequality). Let X be a random variable on
(Ω, A, P) such that E(X) exists and is finite. Then

≤ Var(X)
Var(X)+λ2
for λ > 0
P(X − E(X) ≥ λ) λ2
≥ for λ < 0.
Var(X)+λ2

The next two concentration inequalities are terribly ugly, but very useful.
The former (Kolmogorov’s) is a special case of the latter (Hájek–Rényi). We
will use the Hájek–Rényi inequality to prove Kolmogorov’s first SLLN in
section 4.1 (p. 53).
Theorem 2 (Kolmogorov’s inequality). Let {Xn } be a sequence of mean-
zero independent random variables on (Ω, A, P). Then for any m < n and
ε > 0,
k n
!
X 1 X
P max Xi ≥ ε ≤ 2 Var(Xi ).
k∈[m,n]
i=1
ε i=1

Theorem 3 (Hájek–Rényi inequality). Let {Xn } be a sequence of mean-


zero independent random variables on (Ω, A, P), and let {cn } be a weakly
decreasing sequence in R+ . Then for any m < n in N and any ε > 0,
 
k m n
!
X 1 X X
P max ck Xi ≥ ε ≤ 2 c2m Var(Xi ) + c2i Var(Xi ) .
k∈[m,n]
i=1
ε i=1 i=m+1

Next up are two inequalities from the theory of Lp spaces, a branch of


functional analysis (i.e. these are not concentration inequalities).30 The first
is the triangle inequality for Lp spaces; the second is a generalisation of the
Cauchy–Schwarz inequality.
Theorem 4 (Minkowski’s inequality). Let X and Y be random variables
on (Ω, A, P). Then for any p ≥ 1,

E(|X + Y |p )1/p ≤ E(|X|p )1/p + E(|Y |p )1/p

provided the moments exist.


30
(It’s not important, but) an Lp space is a set of measurable functions on some measure
space (Ω, A) whose pth power is integrable w.r.t. some measure µ on (Ω, A). Lp spaces turn
out to be normed vector spaces (you can easily verify this using Minkowski’s inequality).

24
Theorem 5 (Hölder’s inequality). Let X and Y be random variables on
(Ω, A, P). Then for any p, q ≥ 1 with p−1 + q −1 ≤ 1,

E(|XY |) ≤ E (|X|p )1/p E (|Y |q )1/q

provided the moments exist.

Corollary 2 (Cauchy–Schwarz inequality). Let X and Y be random vari-


ables on (Ω, A, P). Then
1/2 1/2
E(|XY |) ≤ E |X|2 E |Y |2

provided the moments exist.

We’ll finish with Jensen’s inequality, which is probably familiar.

Theorem 6 (Jensen’s inequality). Let X be a random variable on (Ω, A, P),


and let g : R → R be convex. Then g (E(X)) ≤ E (g(X)) provided the
moments exist, with equality iff either g is linear or LX is a point mass.

25
3 Modes of convergence
Official reading: Amemiya (1985, ch. 3), Rao (1973, ch. 2) and Serfling (1980,
ch. 1).
Since this section concerns convergence, it will be important that all
topologically interesting sets are measurable. So unless otherwise specified,
every set will be equipped with (a superset of) its Borel σ-algebra.

3.1 Convergence of random sequences


In econometric theory, we use an estimator θbn to estimate some unknown
parameter θ. θbn is a function of the data (which has sample size n), and
so is random. As the sample size grows large, we would like the sequence
of estimators θbn to get close to θ in some well-defined sense. This is one
of the most basic adequacy criteria for estimators, called consistency. To
study consistency, we have to define what it means for a sequence of random
variables to converge to a point, or more generally to a random variable.
As a starting point, let’s review what ‘convergence’ means in analysis.
We’ll restrict attention to convergence in metric spaces (rather than general
topological spaces) to make our lives easier.

Definition 17. Let {xn } be a sequence in a metric space (S, ρ). {xn }
converges to x0 ∈ S iff for any ε > 0, there exists Nε ∈ N such that
ρ(xn , x0 ) < ε whenever n ≥ Nε . Convergence is typeset as xn −
→ x0 or
limn→∞ xn = x0 .

This notion of convergence can easily be applied to random elements. Let


S be the set of all random elements of some measurable space (T, T ) defined
on (Ω, A, P), and let ρ be some metric on S. Let {Xn } and X be elements
of S. Then the ordinary notion of convergence is perfectly intelligible. For
example, if the outcome space T is equipped with a metric d, we can use the
sup metric
ρ(X, Y ) := sup d (X(ω), Y (ω))
ω∈Ω

on S, in which case Xn − → X means uniform convergence of the functions


{Xn } to the function X.
This concept is so strong, however, that many useful convergence results
(to be proved later) are unavailable if we use it. Weaker concepts are therefore
called for. One approach would be to try to find a cleverer choice of metric
ρ on S, but we will avoid this route because it is less intuitive and because
some of our convergence concepts are not metrisable in this manner anyway.

26
We will instead proceed in an ad-hoc way, dreaming up new convergence
concepts with intuition as our guide.
The most obvious way of weakening the ordinary convergence is almost
sure convergence. Ordinary convergence in the sup metric required uniform
convergence of the measurable functions {Xn }. We could weaken this to
requiring only pointwise convergence of {Xn }: Xn (ω) − → X(ω) for all ω ∈ Ω
(we might call this ‘sure convergence’).31 Almost sure convergence weakens
this one step further by requiring that Xn (ω) − → X(ω) for almost all ω ∈ Ω,
i.e. for all ω outside a set of measure zero. Formally:

Definition 18. Let {Xn } and X be random elements of a metric space


(S, ρ) defined on (Ω, A, P). {Xn } converges almost surely (a.s.) to X iff

lim Xn = X P-a.s.32
n→∞

a.s.
Almost sure convergence is typeset Xn −−→ X or Xn −→ X a.s., and is also
known as convergence with probability 1 (w.p. 1) or (in general measure
theory) convergence almost everywhere (a.e.).

This definition of a.s. convergence is intuitive, but it is often difficult


to work with. The following lemma gives a more tractable condition. I’ll
omit the proof becaue it requires machinery I haven’t introduced (tail events,
continuity of measures).

Lemma 1. Let {Xn } and X be random elements of a metric space (S, ρ)


a.s.
defined on (Ω, A, P). Then Xn −−→ X iff
!
lim P sup ρ(Xn , X) > ε = 0.
N →∞ n≥N

Remark 1. Often, we’re interested in the notion that a sequence of random


elements converges to a point, not to a random element. To fit this idea into
the definition above, simply take the limiting random element X to be a
a.s.
constant function. We’ll use the shorthand Xn −−→ α for a point α ∈ S to
denote a.s. convergence of {Xn } to a constant function everywhere equal
p m.s.
to α. (Similarly, we will write Xn −→ α and Xn −−→ α for the other two
convergence concepts in this section.)
31
This corresponds to convergence in the product topology on S (a.k.a. the topology of
pointwise convergence). This topology is not metrisable!
32
More explicitly, P ({ω ∈ Ω : limn→∞ Xn (ω) = X(ω)}) = 1. The lim operator here is
the ordinary one from analysis, since {Xn (ω)} is a sequence in a metric space.

27
A.s. convergence turns out to be stronger than necessary for most of
econometric theory. To weaken it, look at Lemma 1: if we drop the sup
operator, we obviously get a less stringent condition. This weaker condition
is called convergence in probability.
Definition 19. Let {Xn } and X be random elements of a metric space
(S, ρ) defined on (Ω, A, P). {Xn } converges in probability to X iff for any
ε > 0,
lim P(ρ(Xn , X) > ε) = 0.
n→∞
p
Convergence in probability is typeset Xn −→ X or plimn→∞ Xn = X, and is
also known as convergence with probability approaching 1.
Finally, we introduce a third convergence concept. Like a.s. convergence,
it is stronger than convergence in probability, though it neither implies nor
is implied by a.s. convergence. Let k·k2 denote the Euclidean norm.
Definition 20. Let {Xn } and X be random vectors defined on (Ω, A, P).
{Xn } converges in mean square (m.s.) to X iff E (kXn − Xk2 )2 exists for
each n ∈ N and  
lim E (kXn − Xk2 )2 = 0.
n→∞
m.s.
Convergence in mean square is typeset Xn −−→ X.
We will not use this concept very much because we don’t want to as-
sume that the second moment of an estimator exists. But convergence in
mean square turns out to be useful for studying the convergence of random
functions.
It’s clear that convergence in mean square can at most be extended to
random elements of normed vector spaces (such as Rn ). It cannot be defined
for random elements of general metric spaces.

Aside on metrisability. Let’s return to the question of whether we can


cleverly choose a metric on the set of random elements such that our new
convergence concepts are equivalent to convergence in the ordinary sense.
Fix a probability space (Ω, A, P) and a metric space (T, d), let S be the set
of random elements of (T, d) defined on this space, and let {Xn } and X lie
in S. Endow S with a metric ρ, and let − →ρ denote ordinary convergence in
the metric space (S, ρ).
m.s.
It should be obvious that Xn −−→ X is equivalent to Xn − →ρ X in the
metric
ρ(X, Y ) := E (kX − Y k2 )2 .


28
p
→ρ X.33
It turns out that there is also a metric ρ such that Xn −→ X iff Xn −
On the other hand, unless (Ω, A) is trivial, there is no metric ρ on S such
a.s.
that Xn −−→ X iff Xn − →ρ X (see Dudley (2004, p. 289)).

3.2 Convergence of random functions


In analysis, we make a distinction between pointwise and uniform convergence
of functions. As in the previous section, we’ll make our lives easier by
restricting attention to metrisable topologies.
Definition 21. Let (S, ρ) and (S 0 , ρ0 ) be metric spaces, and let {fn } and f
be functions S → S 0 .
→ f pointwise iff ρ0 (fn (x), f (x)) −
(1) fn − → 0 for every x ∈ S.
→ f uniformly iff supx∈S ρ0 (fn (x), f (x)) −
(2) fn − → 0.
I’ve stated both in terms of convergence (in R) of the distance ρ0 because
uniform convergence is most naturally expressed in that way, but clearly
fn −→ f pointwise is equivalent to fn (x) −→ f (x) for every x ∈ S.
Uniform convergence obviously implies pointwise convergence. What
uniform convergence adds to pointwise convergence is that there must be
some rate of convergence that applies to every point. In particular, pointwise
convergence says that for each ε > 0 and each x ∈ S, there is Nε,x such
that ρ0 (fn (x), f (x)) < ε for all n ≥ Nε,x ; uniform convergence adds the
requirement that the set {Nε,x }x∈S has a finite upper bound N ε .
This should make it sound like uniform convergence is quite a lot stronger
than pointwise convergence. Indeed, many pointwise convergent sequences of
well-behaved functions fail to converge uniformly, as the following example
shows.
Example 3. Let S = [0, 2] and let fn : S → R be given by

nx

 for x ∈ [0, 1/n)
fn (x) = 2 − nx for x ∈ [1/n, 2/n)

for x ∈ [2/n, 2].

0

A few functions in the sequence are drawn in Figure 1. Define f : S → R by


f (x) := 1(x = 0). Obviously fn − → f pointwise. But {fn } does not converge
to f uniformly: supx∈[0,2] |fn (x) − f (x)| = 1 no matter how large n gets.
33
For example, ρ(X, Y ) := E ( |X − Y |/ [1 + |X − Y |]) does the trick. We were asked to
prove this in question 3 on problem set 2; a proof can also be found in Dudley (2004,
Theorem 9.2.2).

29
f20 f5 f2 f1

Figure 1 – Some elements of the sequence {fn } from Example 3.

30
We’re defining these convergence concepts using the metric on the image
space S 0 . It is perhaps more natural to define convergence of functions by
endowing the functional space F with a topology such that the appropriate
convergence concept coincides with convergence in that topology. Convergence
in the product topology (a.k.a. the topology of pointwise convergence) on F
is equivalent to pointwise convergence as defined above. Convergence in the
topology on F induced by the sup metric

d(f, g) := sup ρ0 (f (x), g(x))


x∈S 0

is equivalent to uniform convergence.


The task in this section is to extend uniform and pointwise convergence
to random functions. A random function is simply a random element that
takes values in a functional space. As usual, we are using the Borel σ-algebra
corresponding to whatever topology we’ve endowed this functional space
with. (There are many topologies we might want to endow a functional space
with. We’ve already seen two, the product topology and the topology induced
by the sup metric.)
Here’s the (entirely straightforward) extension. We won’t bother with
convergence in m.s.

Definition 22. Let (S, ρ) and (S 0 , ρ0 ) be metric spaces, and let {fn } and f
be random functions S → S 0 .
a.s. a.s.
(1) fn −−→ f pointwise iff ρ0 (fn (x), f (x)) −−→ 0 for every x ∈ S.
a.s. a.s.
(2) fn −−→ f uniformly iff supx∈S ρ0 (fn (x), f (x)) −−→ 0.
p p
(3) fn −→ f pointwise iff ρ0 (fn (x), f (x)) −→ 0 for every x ∈ S.
p p
(4) fn −→ f uniformly iff supx∈S ρ0 (fn (x), f (x)) −→ 0.
a.s. a.s.
Remark 2. Obvious equivalences: fn −−→ f pointwise iff fn (x) −−→ f (x) for
p p
every x ∈ R, and fn −→ f pointwise iff fn (x) −→ f (x) for every x ∈ R.

Some final (dull) remarks on terminology. Sometimes we’re interested in


convergence a.s./in probability pointwise/uniformly on some subset T ⊆ S;
in this case, simply replace ‘for every x ∈ S with ‘for every x ∈ T ’ and
‘supx∈S ’ with ‘supx∈T ’ in the definitions above. When we’re holding an
a.s.
argument fixed, as in ‘fn (·, y) −−→ f (·, y) uniformly on T ⊆ S’, we sometimes
a.s.
say ‘fn (x, y) −−→ f (x, y) uniformly in x ∈ T ’ instead; similarly for uniform
convergence in probability.

31
3.3 Convergence of measures
All three of the convergence concepts we’ve given have a similar flavour:
they require the random elements Xn to get close to X as n increases. But
we might also care about the distributions of {Xn } getting close to the
distribution of X. For example, suppose Xn ∼ N (0, 1) and X ∼ N (0, 1), all
independent.34 No matter how large n gets, P(Xn = 6 X) = 1. Nevertheless, it
seems that this is a (trivial) case in which the distributions of {Xn } converge
to the distribution of X.
For a probability space (Ω, A, P) with Ω is endowed with a topology, call
A ∈ A a P-continuity set iff P(∂A) = 0.35

Definition 23. Let {Xn } and X be random elements of a metric space (S, ρ)
defined on (Ω, A, P). {Xn } converges in distribution to X iff LXn (A) − →
LX (A) for every LX -continuity set A. Convergence in distribution is typeset
d
Xn −→ X or Xn X.

In the case of random vectors, this obviously reduces to the (perhaps more
familiar) definition that FXn −
→ FX (pointwise) at every continuity point of
FX . The following example illustrates why we do not require convergence at
discontinuity points of FX .

Example 4. Let {Xn } be a sequence of logistically distributed random


variables. In particular, let them have CDFs

FXn (x) = (1 + exp (−x/θn ))−1 ∀x ∈ R

where the sequence {θn } satsifies θn −


→ 0. The sequence of functions {FXn }
converges pointwise to

0

 x<0
G(x) := 0.5 x = 0


1 x > 0.

G is not a CDF since it isn’t right-continuous. But the function F given


by F (x) := G(x) at x 6= 0 and F (0) := 1 is a CDF, corresponding to a
point mass at 0. This is intuitively what the sequence {Xn } should converge
34

N µ, σ 2 denotes the normal distribution; in particular I sometimes use it to mean
a normally distributed random variable, sometimes a normal law on (R, B). ‘∼’ reads ‘is
distributed as’.
35
∂A denotes the boundary of A. It is measurable since it’s closed and we’re using (a
superset of) the Borel σ-algebra.

32
in distribution to. And since convergence in distribution does not require
d
convergence of the CDFs at discontinuity points, we have Xn −→ X where
X is any random variable with this CDF.

We say that a sequence of random variables converges in distribution, but


it is really more natural to think of the sequence of laws {LXn } as converging.
In this language, convergence in distribution is called weak convergence.

Definition 24. Let {µn } and µ be measures defined on a measurable space


(Ω, A), and equip Ω with a topology. {µn } converges weakly to µ iff µn (A) −

µ(A) for every µ-continuity set A. Weak convergence is typeset µn ⇒ µ.
(Occasionally, I may sloppily say that the random variables {Xn } converge
weakly to X.)

This quite intuitive notion of convergence turns out to be equivalent


to several other tractable (but less intuitive) properties. The equivalence is
given by the Portmanteau lemma, a small part of which is stated below. As
it happens, property (2) below is conventionally taken as the definition of
weak convergence.

Lemma 2 (partial Portmanteau lemma). Let {µn } and µ be measures on


(Ω, A). The following are equivalent.

(1) µn (A) −
→ µ(A) for every µ-continuity set A.
R R
(2) Ω f dµn −
→ Ω f dµ for every continuous and bounded f : Ω → R.

Weak convergence is equivalent to convergence in the weak? topology


on the set of probability measures on (Ω, A). (This is immediate from the
definition of the weak? topology!) Moreover, this topology is metrisable (by
the Prohorov metric; see Billingsley (1999, pp. 72–3)), so weak convergence
corresponds to ordinary convergence in a certain metric on the space of
probability measures.
Our interest in weak convergence is motivated by central limit theorems,
which concern weak convergence of the laws of normalised sums of random
vectors to a normal law. It turns out that for the special case of random
vectors, the theory of weak convergence can be studied using the characteristic
transform introduced in section 3.7 below. We therefore won’t delve any
deeper into the theory of weak convergence in general metric spaces here.
(For the curious, Billingsley (1999) is a standard book.)

33
3.4 Relationships between modes of convergence
In this section, we will establish the implication relationships between the
modes of convergence we’re considering. In particular, we will show that
 
a.s.
Xn −−→ X 
p
 
d

  ⇒ Xn −→ X ⇒ Xn −→ X .
m.s.
Xn −−→ X
d  p 
We’ll also show that for a constant α, Xn −→ α ⇒ Xn −→ α .
We begin with the first two implications: that a.s. convergence and
convergence in m.s. imply convergence in probability. Both have nice, short
proofs.
a.s. p
Proposition 6. If Xn −−→ X, then Xn −→ X.
a.s.
Proof. Let Xn −−→ X. Fix an ε > 0. Obviously ρ(XN , X) > ε implies that
supn≥N ρ(Xn , X) > ε. Together with nonnegativity, this yields
!
0 ≤ P (ρ(XN , X) > ε) ≤ P sup ρ(Xn , X) > ε .
n≥N

a.s.
Since the RHS converges to 0 as N − → ∞ by Xn −−→ X, it follows that
p
limN →∞ P (ρ(XN , X) > ε) = 0. Since ε > 0 was arbitrary, Xn −→ X. 
m.s. p
Proposition 7. If Xn −−→ X, then Xn −→ X.
Proof. Let Xn −−→ X. (kXn − Xk2 )2 is a nonnegative random variable, so
m.s.

Markov’s inequality (p. 22) applies. Together with the fact that probabilities
are nonnegative, we have for any ε > 0 that
 
0 ≤ P (kXn − Xk2 > ε) = P (kXn − Xk2 )2 > ε2
 
≤ ε−2 E (kXn − Xk2 )2 .
m.s.
The RHS converges to 0 since Xn −−→ X. Hence P(kXn − Xk2 > ε) −
→0
p
for every ε > 0, i.e. Xn −→ X. 

A natural question you might now ask is: convergence in probability plus
what property is equivalent to convergence in mean square? The answer is a
boundedness property called uniform integrability; see e.g. Williams (1991,
sec. 13.7).
To show that a.s. convergence and convergence in m.s. do not imply each
other, we give counterexamples.

34
m.s. a.s.
Example 5 (−−→ without −−→). Let {Xn } be independent with
1 1
P(Xn = 0) = 1 − n and P(Xn = 1) = n for each n ∈ N.
m.s.
Then E Xn2 = 1/n −

→ 0 as n −→ ∞, so Xn −−→ 0.
It’s obvious (but we didn’t prove) that if {Xn } is a.s.-convergent then
the limit must be 0. A.s. convergence to 0 would require that
!
lim P sup |Xn | < ε = 1 for every ε > 0.
N →∞ n≥N

So choose ε ∈ (0, 1). Then |Xn | < ε iff Xn = 0, so for any N ∈ N we have
∞ 
!
Y 
1
P sup |Xn | < ε = P(Xn = 0 ∀n ≥ N ) = 1− n .
n≥N n=N

where the final


 equality
 used independence. Taking logs and using the in-
1 1 36
equality ln 1 − n ≤ − n ,

∞ ∞
!!
 
n−1 = −∞
X X
1
ln P sup |Xn | < ε = ln 1 − n ≤−
n≥N n=N n=N

series ∞ −1 diverges for every N . So by continuity


P
since the harmonic
 n=N n
of ln(·), P supn≥N |Xn | < ε = 0 for every N , hence
!
lim P sup |Xn | < ε = 0 6= 1.
N →∞ n≥N

a.s. m.s.
Example 6 (−−→ without −−→). Let {Xn } be independent with
1 1
P(Xn = 0) = 1 − n2
and P(Xn = n) = n2
for each n ∈ N.

E Xn2 = n2 /n2 = 1 for every n ∈ N, so we don’t have convergence to 0 in




mean square. It’s fairly obvious (but we didn’t prove) that we cannot have
convergence in m.s. to anything other than 0.
Again, a.s. convergence to 0 requires that
!
lim P sup |Xn | < ε = 1 for every ε > 0.
N →∞ n≥N

36
Since ln is concave, it must lie below all its tangents, i.e. for any x, x0 > 0, ln(x) ≤
ln(x ) + (x0 )−1 (x − x0 ). Setting x = 1 − n1 and x0 = 1 yields ln 1 − n1 ≤ − n1 .
0

35
Following the steps in the previous example, for small ε > 0 and any N ∈ N
we have

!!
X  
1
ln P sup |Xn | < ε = ln 1 − n2
.
n≥N n=N
 
1
Using the inequality ln 1 − n2
≥ − n12 − 1 37
2n4
, we obtain

∞ 
!!
1 1
X 
ln P sup |Xn | < ε ≥− 2
+ 4 .
n≥N n=N
n 2n
P∞
The p-series n=N n−p is convergent iff p > 1 (regardless of N ), so the RHS
is finite for each N and converges to 0 as N → ∞. So by continuity of ln(·)
we obtain !
lim P sup |Xn | < ε = 1.
N →∞ n≥N

Now for another proposition: convergence in probability implies conver-


gence in distribution.
p d
Proposition 8. If Xn −→ X, then Xn −→ X.

The statement is true for general random elements, but our proof restricts
attention to random vectors in order to make use of the simpler CDF-based
definition of convergence in distribution.
p
Proof for random vectors. Let Xn −→ X. Define Zn := X − Xn , so that
p
Zn −→ 0. Fix some ε > 0 and a continuity point t ∈ R of FX . We must show
that FXn (t) −
→ FX (t).
Using the fact that A ⊆ B implies P(A) ≤ P(B) and a few other basic
37
This inequality follows from the fact that in the Taylor series
∞ ∞
X (−1)n+1 x2 X (−1)n+1 n
ln(1 + x) = xn = x − + x ,
n 2 n
n=1 n=3

P∞ n+1
(−1)
the remainder n=3 n
xn can be shown to be nonnegative. Now set x = −1/n2 .

36
facts about probabilities,

FXn (t) = P (X − (X − Xn ) ≤ t)
= P (X ≤ t + Zn )
= P (X ≤ t + Zn , Zn < ε) + P (X ≤ t + Zn , Zn ≥ ε)
≤ P (X ≤ t + ε, Zn < ε) + P (X ≤ t + Zn , Zn ≥ ε)
≤ P (X ≤ t + ε) + P (X ≤ t + Zn , Zn ≥ ε)
≤ P (X ≤ t + ε) + P (X ≤ ∞, Zn ≥ ε)
≤ P (X ≤ t + ε) + P (Zn ≥ ε)
= FX (t + ε) + P (Zn ≥ ε) .
p
Since Zn −→ 0, P (Zn ≥ ε) −
→ 0 as n −
→ ∞. It follows that

lim sup FXn (t) ≤ FX (t + ε).


n→∞

Taking ε −
→ 0 and using the fact that t is a continuity point of FX ,

lim sup FXn (t) ≤ FX (t).


n→∞

Now go through the exactly same steps, replacing FXn with 1 − FXn and
ε with −ε:

1 − FXn (t) = P (X − (X − Xn ) > t)


= P (X > t + Zn )
= P (X > t + Zn , Zn ≤ −ε) + P (X > t + Zn , Zn > −ε)
≤ P (X > t + Zn , Zn ≤ −ε) + P (X > t − ε, Zn > −ε)
≤ P (X > t + Zn , Zn ≤ −ε) + P (X > t − ε)
≤ P (X > −∞, Zn ≤ −ε) + P (X > t − ε)
≤ P (Zn ≤ −ε) + P (X > t − ε)
= P (Zn ≤ −ε) + [1 − FX (t − ε)] .

Rearranging, FXn (t) ≥ FX (t − ε) − P (Zn ≤ −ε), which yields

lim inf FXn (t) ≥ FX (t − ε)


n→∞

p
since Zn −→ 0. Taking ε −
→ 0 and using the fact that t is a continuity point
of FX then gives us
lim inf FXn (t) ≥ FX (t).
n→∞

37
Putting together the pieces,

lim sup FXn (t) ≤ FX (t) ≤ lim inf FXn (t).


n→∞ n→∞

Hence {FXn (t)} is convergent and has limit FX (t). 

There is a special case in which the converse is true:


d p
Proposition 9. If Xn −→ X for X constant, then Xn −→ X.

The result is pretty obvious, but the proof I’ve seen use parts of the
Portmanteau lemma that I haven’t stated, so I won’t bother.

3.5 The Borel–Cantelli lemmata


The concentration inequalities in section 2.8 (p. 22) can be used to prove
that a sequence of random elements converges in probability. For example,
if {Xn } are random variables with means µ and variances n−α σ 2 for some
α > 0, then Chebychev’s inequality (p. 23) yields

σ2
P(|Xn − µ| > ε) ≤ −
→0 for any ε > 0,
n α ε2
p
so Xn −→ µ. (This is how we will prove Chebychev’s WLLN in section 4.1
(p. 53).)
It would be nice to have a similarly tractable sufficient condition for
almost sure convergence. That is exactly what the first Borel–Cantelli lemma
gives us. And there’s more: the second Borel–Cantelli lemma says that our
sufficient condition is also necessary when the sequence is independent.

Theorem 7 (Borel–Cantelli lemmata). Let {Xn } and X be random elements


of a metric space (S, ρ) defined on (Ω, A, P).
P∞ a.s.
(1) If n=1 P(ρ(Xn , X) > ε) < ∞ for all ε > 0, then Xn −−→ X.

(2) If ∞ n=1 P(ρ(Xn , X) > ε) = ∞ for some ε > 0 and {Xn } are independ-
P

ent, then Xn does not converge a.s. to X.

The Borel–Cantelli lemmata are actually much more general than what
we stated here. If you care, see e.g. Rosenthal (2006, Theorem 3.4.2).

38
3.6 Convergence of moments
d
Suppose Xn −→ X for random variables {Xn } and X. It might seem reason-
able to conjecture that E(Xn ) −
→ E(X). But upon reflection, it’s not a very
d
good conjecture: by the Portmanteau lemma (p. 33), Xn −→ X is equivalent
to
Z Z
f dLXn −
→ f dLX for every continuous and bounded f ,
R R

but we want Z Z
IdLXn −
→ IdLX
R R
where I is the definitely-not-bounded identity function I(x) := x! So {LXn }
are going to have to be appropriately bounded if the moments are to converge.
I’ll give two counterexamples. In the first, no moments exist along the
sequence, but the limit distribution has moments. In the (perhaps less trivial)
second example, moments exist along the entire sequence, but fail to converge
nonetheless.

Example 7. Let {Xn } be independent random variables with CDFs


 
FXn (x) := n1 C(x) + 1 − 1
n Φ(x)

where Φ is the standard normal CDF and C is the standard Cauchy CDF

C(x) := 1
2 + π −1 arctan(x).
d
It’s obvious that FXn − → Φ pointwise, so Xn −→ N (0, 1). The expectation of
the limit is therefore 0. But Xn has no mean for any n ∈ N since the Cauchy
distribution has no moments. So the sequence {E(Xn )} does not even exist,
hence certainly cannot be said to converge to zero.

Example 8. Consider random variables {Xn } and X such that


1 1
P(Xn = 1) = 1 − n and P(Xn = n) = n

and X ∼ N (0, 1), with X independent of {Xn }. Define Yn := (Xn X)2 .


p d
Evidently Xn2 −→ 1, so by Slutsky’s theorem Yn = Xn2 X 2 −→ X 2 . But
E X 2 = 1, whereas (using independence)


       
E (Yn ) = E Xn2 E X 2 = E Xn2 = 1 − 1
n + n1 n2 −
→ ∞.

39
d
But even if Xn −→ X is not sufficient for E(Xn ) − → E(X), surely
a.s.
Xn −−→ X is sufficient? No again! We can still get the sort of pathological
behaviour exhibited by the examples above. To rule this out, we need a
boundedness condition on {Xn } and X to rule out nonexistence or explosive
behaviour.
There are several important theorems giving conditions under which
a.s.
Xn −−→ X implies E(Xn ) − → E(X). These include (in order from strongest
to weakest assumptions) the monotone convergence theorem, the bounded
convergence theorem, the (Lebesgue) dominated convergence theorem and
the (Vitali) uniform integrability convergence theorem. The proofs of the last
few rely heavily on Fatou’s lemma. All of this is covered well by Rosenthal
(2006, mainly ch. 9). We’ll need the dominated convergence theorem later
on, but I won’t give a proof.

Theorem 8 (dominated convergence theorem). Let {Xn }, X and Y be


a.s.
random variables such that Xn −−→ X, |Xn | ≤ Y (pointwise) for each n ∈ N,
and E(Y ) exists and is finite. Then E(Xn ) −
→ E(X).
a.s. R
R
In other words, when Xn −−→ X, a sufficient condition for Ω Xn dP −→
Ω XdP is that {Xn } is dominated by an integrable function (random vari-
able) Y .

3.7 Characteristic functions


When we’re working with random vectors, we have access to the following
highly convenient object. Let C denote the complex plane.

Definition 25. Let X be a random n-vector on (Ω, A, P). The characteristic


function of X is φX : Rn → C given by
Z
φX (t) := E (exp (it> X)) = exp (it> x) LX (dx) for each t ∈ Rn .38
Rn

Above, we took the random variable X as the primitive. Although this is


often natural, most of probability theory is concerned with measures, not
random variables. It is therefore instructive to study the mapping µ 7→ φµ
from probability measures to the characteristic functions of random variables
distributed according to those probability measures:
38

Since |exp(ic)| = 1 for any c ∈ R, exp it> X is P-integrable for any t. Hence φX is
always well-defined, unlike the otherwise similar moment-generating function.

40
Definition 26. The characteristic transform is the mapping µ 7→ φµ from
probability measures µ on (Rn , B) to characteristic functions φµ : Rn → C,
defined by
Z
φµ (t) := exp (it> x) µ(dx) for each t ∈ Rn .39
Rn
This definition might make you wonder what the space of characteristic
functions is. It is by no means the case that every function Rn → C is the
characteristic transform of some probability measure on (Rn , B)! It is possible
to state ‘primitive’ necessary and sufficient conditions for a function Rn → C
to be a characteristic function. Bochner’s theorem (e.g. Rao (1973, p. 141))
gives one set of necessary and sufficient conditions. Tractable sufficient (but
not necessary) conditions are given by Pólya’s theorem, which I’ll only state
for the univariate case.
Theorem 9 (Pólya’s theorem). Suppose ϕ : R → C is R-valued, even,40
continuous, convex on R++ , and satisfies ϕ(0) = 1 and limt→∞ ϕ(t) = 0.
Then ϕ = φµ for some probability measure µ on (R, B) that is absolutely
continuous w.r.t. Lebesgue measure and symmetric about 0.
It turns out that the space of characteristic functions is a dual of the
space of probability measures in the following sense. First, the characteristic
mapping is a bijection: φµ = φν (pointwise) iff µ = ν (setwise). Second,
the characteristic transform has a closed-form inverse, and there are many
convenient equivalences between properties of probability measures and
properties of characteristic functions. Third, the characteristic mapping is
continuous in a certain sense.
Let’s state two important bits of that formally. Proofs can be found in
e.g. Rosenthal (2006, ch. 10).
Theorem 10 (Fourier uniqueness theorem). Let µ and ν be probability
measures on (Rn , B). Then φµ = φν (pointwise) iff µ = ν (setwise).
Theorem 11 (Lévy’s continuity theorem). Let {µn } and µ be probability
measures on (Rn , B). Then µn ⇒ µ iff φµn −
→ φµ pointwise.41
39
The characteristic transform is also sometimes known as (a version of) the Fourier
transform. But what exactly is meant by ‘Fourier transform’ varies hugely between authors
and fields, so I won’t use this term at all.
40
A function f is even iff f (−x) = f (x) for every x in its domain.
41
It’s called the continuity theorem because when the space of probability measures
is endowed with the topology of weak convergence (the weak? topology) and the space
of characteristic functions is endowed with the topology of pointwise convergence (the
product topology), the theorem says precisely that the characteristic transform and its
inverse are continuous mappings.

41
These properties mean that any results we prove about characteristic
functions, including convergence results, translate directly into results about
probability measures (and vice versa). When we face a difficult question
about probability measures on (Rn , B), we will often translate it into a
question about characteristic functions, easily find the answer, then translate
the answer back into probability-measure space.
The leading example of this strategy is the proof of the Lindeberg–
Lévy central limit theorem (p. 63). But we’ll also use it to prove part of
the continuous mapping theorem on p. 45, and to establish an interesting
property of the Cauchy distribution in an example on p. 42.
I mentioned equivalences between properties of measures and of their
characteristic transforms. There are many, and they are easy to look up, but
here are a few important ones.
Proposition 10. Let X and Y be random variables.
(1) φX (0) = 1.
(2) |φX (t)| = 1 for every t ∈ Rn .
(3) φaX+b (t) = exp(itb)φX (at) for any a ∈ R and b, t ∈ R.
(4) φX+Y = φX φY if X and Y are independent. (The converse is not true!)
(j)
(5) If E X j exists and is finite then φX (0) exists. (There’s a partial

(j)
converse.) Whenever they exist and are finite, φX (0) = ij E X j .


(6) φX is uniformly continuous.


(7) (Riemann–Lebesgue lemma) If LX has a density w.r.t. Lebesgue meas-
ure, then φX (t) −
→ 0 as |t| −
→ ∞.
Finally, an illustration.
Example 9 (the Cauchy law is stable). A Cauchy-distributed random
variable X is one whose density w.r.t. Lebesgue measure is
dLX 1 1
=√ .
dλ π 1 + x2
A patient reader can verify that the corresponding characteristic function is
φX (t) = exp(−|t|). We know that the Cauchy distribution has no moments,
so it shouldn’t surprise us that φX is not differentiable at 0.42
42
Above, we stated the result that when a moment exists and is finite, the corresponding
derivative exists. We did not state the partial converse. So this does not constitute a proof
that the Cauchy distribution has no moments!

42
Now consider a sequence {Xn } of independent Cauchy-distributed random
variables, and write Sn := n−1 ni=1 Xi . Then
P

n
!!
−1
X
φSn (t) = E exp itn Xi
i=1
n
!
 
−1
Y
=E exp itn Xi
i=1
n   
E exp itn−1 Xi
Y
=
i=1
  n
= E exp itn−1 X1
= φX1 (t/n)n
= exp (−|t|/n)n
= exp (−|t|)
= φX1 (t) .

So the average of n Cauchy-distributed random variables is itself Cauchy-


distributed! n-fold convolution of the Cauchy distribution is itself Cauchy! A
distribution with the property that aSn for some a has the same distribution
as X1 is called a (Lévy or α) stable distribution. Another stable law is the
normal distribution (you already knew that—think about it). The theory of
stable laws is a very interesting branch of probability theory, I think.43
This example also serves as a prelude to our study of laws of large numbers,
which give conditions under which Sn converges (a.s. or in probability) to a
constant. It should be obvious that Sn converges weakly to a Cauchy law; we
don’t even need Lévy’s continuity theorem to prove this. Convergence to a
point fails to happen here because the Cauchy distribution has ‘heavy tails’,
i.e. lots of probability mass in the tails. (The formal definition of ‘heavy tail’
43
Think of a sequence of distributions of Sn as a path in the space of probability
distributions. This path is governed by a law of motion. A stable distribution is a steady
state of this law of motion: once you’re there, you don’t leave. Some of these steady states
may be attractors in some region: if you start in this region, the sequence converges weakly
to the stable law. One theorem in the theory of stable laws is (loosely) that only stable
laws can be attractors.
Moreover, there are generalisations of the central limit theorems. CLTs give (large)
regions in which the normal law is an attractor; ‘generalised central limit theorems’ give
large regions in which the CLT fails (due to infinite variance), but in which there is another
attractor. By the previous result, this attractor must be a stable distribution, but it will
not be normal. This material can be found in e.g. Gnedenko and Kolmogorov (1954, ch. 7)
and Durrett (2010, sec. 3.7).

43
is usually that the variance is infinite.) As we will see when we prove LLNs,
moment restrictions are required in order to avoid this sort of problem. (At
the very least, we’ll require the first moment to exist.)

3.8 The continuous mapping theorem


One characterisation of continuity in metric spaces is that a continuous
mapping is one that ‘preserves convergence’: f is continuous at x0 iff f (xn ) −

f (x0 ) for any sequence {xn } s.t. xn −→ x0 . The Mann–Wald continuous
mapping theorem (CMT) is the analog for random variables of the ‘only if’
part of this characterisation: it says that a.s. convergence, convergence in
probability and convergence in distribution are all preserved under almost-
everywhere continuous transformations.

Theorem 12 (Mann–Wald CMT). Let (S, ρ) and (S 0 , ρ0 ) be metric spaces,


let {Xn } and X be random elements of (S, ρ), and let g : S → S 0 be
measurable and continuous LX -a.e.44 Then
a.s. a.s.
(1) Xn −−→ X implies g(Xn ) −−→ g(X).
p p
(2) Xn −→ X implies g(Xn ) −→ g(X).
d d
(3) Xn −→ X implies g(Xn ) −→ g(X).

Remark 3. We did not mention convergence in mean square because it


turns out not to be preserved under arbitrary a.e.-continuous mappings!
Something much stronger is needed, e.g. g linear.

Proof of (1). We know that there are measurable Ω0 , Ω00 ⊆ Ω such that
→ X(ω) for all ω ∈ Ω0 , g is continuous at all X(ω) s.t. ω ∈ Ω00 , and
Xn (ω) −
P(Ω ) = P(Ω00 ) = 1. Firstly, Ω0 ∩ Ω00 is measurable with P (Ω0 ∩ Ω00 ) = 1
0

since
 c   c   c 
P Ω0 ∩ Ω00 = 1 − P Ω0 ∪ Ω00 Ω0 Ω00
 c
≥1−P −P = 1.

→ g(X(ω)) at all ω ∈ Ω0 ∩ Ω00 .


Secondly, g(Xn (ω)) − 

We won’t bother proving (2) in full generality, though it is not hard.


p
Instead, we will content ourselves with the case in which Xn −→ α for a
constant α. In this case, continuity LX -a.e. of g reduces to continuity of g at
α.
44
Recall that LX is the law of X. So the final requirement is that the underlying
probability space (Ω, A, P) satisfies P ({ω ∈ Ω : g continuous at X(ω)}) = 1.

44
Proof of (2) for constant X. By continuity of g at α, for each ε > 0, there
is a δ > 0 such that ρ(Xn , α) < δ implies ρ0 (g(Xn ), g(α)) < ε. So

1 ≥ P ρ0 (g(Xn ), g(α)) < ε ≥ P(ρ(Xn , α) < δ).




p
Since Xn −→ α, the right-hand side converges to 1 regardless of δ. It follows
p
that P (ρ0 (g(Xn ), g(α)) < ε) −
→ 1 for each ε > 0, i.e. g(Xn ) −→ g(α). 

For (3), the cleanest general proof that I’ve seen uses Skorokhod’s theorem,
then follows the argument for (1). This would take us too far afield, so let’s
restrict attention to the case in which {Xn } and X are random `-vectors and
g : R` → Rm , so that we can use characteristic functions.

Proof of (3) for S = R` and S 0 = Rm . Fix t ∈ R` ; we wish to show that


φg(Xn ) (t) −
→ φg(X) (t). We have
Z
φg(Xn ) (t) = exp (it> g(y)) LXn (dy)
`
ZR Z
>
= cos (t g(y)) LXn (dy) + i sin (t> g(y)) LXn (dy).
` R`
ZR Z

→ cos (t> g(y)) LX (dy) + i sin (t> g(y)) LX (dy)
ZR` R`

= exp (it> g(y)) LX (dy)


R`
= φg(X) (t)

where convergence follows by the Portmanteau lemma (p. 33),45 since y 7→


cos (t> g(y)) and y 7→ sin (t> g(y)) are bounded and continuous mappings.
d
Hence g(Xn ) −→ g(X) by Lévy’s continuity theorem (p. 41). 

The following result is an oft-used corollary to the continuous mapping


theorem. It states that the elementary algebraic operations of addition,
multiplication and division are preserved under weak convergence. (It’s a
corollary because these operations are continuous.)

Corollary 3 (Slutsky’s theorem). Let {Xn } and X be m × k random


matrices, let {Yn } be k × k random matrices, and let A be a k × k (constant)
d p
matrix. Suppose that Xn −→ X and Yn −→ A. Then
d
(1) Xn + Yn −→ X + A.
45
Joel actually appealed to the Helly–Bray theorem, which is a special case of the
Portmanteau lemma.

45
d
(2) Xn Yn −→ XA.
d
(3) Xn Yn−1 −→ XA−1 provided A is invertible.46
Proof. (X, A) and each (Xn , Yn ) are random elements of the metric space
p d
Rm×k × Rk×k . Yn −→ A implies Yn −→ A by Proposition 9 (p. 38). The
mappings (x, y) 7→ x + y and (x, y) 7→ xy are continuous, and (x, y) 7→ xy −1
is continuous whenever y is invertible. The result then follows from part (3)
of the continuous mapping theorem. 

Remark 4. Since the proof of Slutsky’s theorem is via the Mann–Wald CMT,
d p a.s.
the result obviously still holds if we replace −→ with −→ or −−→. But be careful
here: it’s important that Y converges to a constant rather than to a random
matrix. When Xn and Yn both converge to random elements X and Y , it
d d d
need not be that Xn + Yn −→ X + Y , Xn Yn −→ XY or Xn Yn−1 −→ XY −1 .
The following example illustrates.
Example 10 (weak convergence of marginals vs. joint). Let {Xn }, {Yn }, X
and Y be random variables distributed
! ! !! ! ! !!
Xn iid 0 1 ρ X 0 1 r
∼N , and ∼N , .
Yn 0 ρ 1 Y 0 r 1

The marginal distributions of Xn , Yn , X and Y are all N (0, 1). Hence


d d
(trivially) we have Xn −→ X and Yn −→ Y . But

Xn + Yn ∼ N (0, 2(1 + ρ)) and X + Y ∼ N (0, 2(1 + r)),


d
so it is generally not the case that Xn + Yn −→ X + Y ! The reason is clear:
although the marginal distributions converge weakly, the joint distribution
does not, as evidenced by the fact that ρ may differ from r.
In the example, it’s clear that if Xn , Yn are independent (ρ = 0) and X, Y
d
are also independent (r = 0) then we do in fact have Xn + Yn −→ X + Y .
This is true in general, since then

φXn +Yn (t) = φXn (t)φYn (t) −


→ φX (t)φY (t) = φX+Y (t) for arbitrary t ∈ R,
d
whence Xn + Yn −→ X + Y follows by Lévy’s continuity theorem.
46
If some {Yn } in (3) are singular, we can replace Yn−1 with a Moore–Penrose pseudo-
inverse and still obtain convergence. The Moore–Penrose pseudo-inverse is continuous at
invertibility points, so the continuous mapping theorem applies. (But note that unlike the
ordinary matrix inverse, the Moore–Penrose pseudo-inverse is not continuous at all points.)

46
3.9 Stochastic order notation
When we use approximations, we have to control the approximation error.
Usually, we want the error to vanish as the sample size grows large. The
notation introduced here offers a compact way of keeping track of approx-
imation error. This section will treat sequences in R, on the understanding
that the extension to Rn is straightforward.
Let’s start out with order notation from analysis.
Definition 27. Let {xn } and {an } be sequences in R.
(1) xn = O(an ) iff ∃M0 > 0 s.t. |Xn /an | ≤ M0 for n sufficiently large.

(2) xn = o(an ) iff xn /an −


→ 0.
Intuitively, xn = O(an ) means that {xn } increases no faster than {an },
while xn = o(an ) means that {xn } increases at a slower rate than {an }.
Unsurprisingly, these concepts are not well-suited for use with random
variables. We therefore use analogous ‘in probability’ definitions.
Definition 28. Let {Xn } be a sequence of random variables and {an } be a
sequence in R.
(1) Xn = Op (an ) iff ∀ε > 0, ∃Mε > 0 s.t. P (|Xn /an | ≤ Mε ) ≥ 1 − ε for n
sufficiently large.
p
(2) Xn = op (an ) iff Xn /an −→ 0.
The parallel with O and o is clear. We’re weakening them in the ‘in
probability’ way, as opposed to in the ‘almost sure’ way because the latter
would be too strong (but easier, really).
To compare Op and op , use the definition of convergence in probability
to see that Xn = op (an ) iff ∀ε > 0, ∀M0 > 0, P (|Xn /an | ≤ M0 ) ≥ 1 − ε for
n sufficiently large. The latter contains ‘for all M0 ’ rather than ‘there exists
an Mε ’. This should make it clear that Xn = op (an ) implies Xn = Op (an ).
An unfortunate feature of order notation is that it breaks the symmetry
of the equality symbol. Concisely put, xn = O(an ) says that xn is of order
an ; it does not say that the object O(an ) is equal to xn . So xn = O(an ) must
be read left-to-right, not right-to-left. This is the convention, and I will be
using it. Be forewarned!
Notice that anything that that is bounded (vanishing) is also bounded
(vanishing) in probability:

O(an ) = Op (an ) and o(an ) = op (an ).

47
Of course, the converse is not true, i.e. Op (an ) = O(an ) and op (an ) = o(an )
are false in general. (A sequence may be bounded/vanishing in probability
without being bounded/vanishing for sure.)
We will do a lot of algebra involving Op and op once we start studying
estimators, so here’s a collection of facts about how Op and op can be
manipulated. Except for the last one, they are all easily proved from the
definitions.

Proposition 11. Some facts about Op and op :

(1) op (an ) = an op (1) and Op (an ) = an Op (1).

(2) op (Op (1)) = op (1).

(3) op (1) + Op (1) = Op (1).

(4) op (1)Op (1) = op (1).

(5) (1 + op (1))−1 = Op (1).

(6) If R(0) = 0, R(h) = o (khkp ) as h ↓ 0, and an = op (1), then R(an ) =


op (kan kp ).

3.10 The delta method


Suppose you know that a random vector Xn is approximately distributed as
d
a−1
n W +b for large n, where W is a random vector (formally an (Xn −b) − → W ),
but that you’re actually interested in approximating the distribution of some
function g(Xn ) of this random vector (e.g. a test statistic). The delta method
provides a way of doing this whenever g is smooth near b. Formally, it is
based on a Taylor expansion.

Theorem 13 (Taylor’s theorem). Let g : R → R be ` times differentiable


in an open neighbourhood of b.47 Then
`
g (j) (b)  
(x − b)j + o |x − b|` ,
X
g(x) − g(b) =
j=1
j!

where g (j) denotes the jth derivative.


47
Some authors state Taylor’s theorem requiring only differentiability at b, but the proof
seems to require differentiability in a neighbourhood.

48
The theorem extends immediately to any ` times differentiable function
g : Rk → Rm , but the notation becomes ugly fast (tensor products). For
g : Rk → R, we can go to second order without notational trouble:
 
g(x) − g(b) = ∇g(b)> (x − b) + 12 (x − b)> ∇2 g(b)(x − b) + o (kx − bk2 )2 .

For g : Rn → Rm , only a first-order expansion is easy to write down:


g(x) − g(b) = Dg(b)(x − b) + o (kx − bk2 ) .
Proposition 12 (delta method). Let {Xn } be a sequence of random k-
d
vectors such that an (Xn − b) −→ W for some constants {an } and b, and
let g : Rk → Rm be differentiable in an open neighbourhood of b, with
derivative Dg(b) at b. Then
d
an (g(Xn ) − g(b)) −→ Dg(b)W.
Remark 5. Notice that we did not require Dg(b) to be nonsingular (or even
nonzero), nor did we require Dg to be continuous at b. Although W will be
normally distributed in the vast majority of applications (by a central limit
theorem; see section 5), that is not required, either.
Proof. By Taylor’s theorem,
g(Xn ) − g(b) = Dg(b)(Xn − b) + op (kXn − bk2 ) ,
so
an (g(Xn ) − g(b)) = Dg(b)an (Xn − b) + op (kan (Xn − b)k2 ) .
d
Since an (Xn − b) −→ W ,
op (kan (Xn − b)k2 ) = op (Op (1)) = op (1),
d
and Dg(b)an (Xn − b) −→ Dg(b)W by Slutsky’s theorem (p. 45). Hence
d
an (g(Xn ) − g(b)) = Dg(b)an (Xn − b) + op (1) −→ Dg(b)W. 
Remark 6. Suppose instead that we have a sequence {bn } of k-vectors such
d
that an (Xn − bn ) −→ W , and that bn −
→ b. Add the assumption that Dg is
continuous at b. Then Dg(bn ) = Dg(b) + o(1), so the proof above still goes
through, giving us
d
an (g(Xn ) − g(bn )) −→ Dg(b)W.
(The same extension is available for the second- and `th-order delta methods
below.)

49
Although the first-order delta method above is valid when Dg(b) = 0,
it isn’t very helpful in that case. Unless g is a constant function, g(Xn ) is
still going to be random, so we’d like our approximating distribution to be
nondegenerate. The obvious remedy is to use a second-order Taylor expansion.
As noted above, this would require heavy notation for the case g : Rk → Rm ,
so we’ll just state it for the case g : Rk → R.
Proposition 13 (second-order delta method). Let {Xn } be a sequence of
d
random k-vectors such that an (Xn − b) −→ W for some constants {an } and
b, and let g : Rk → R be twice differentiable in an open neighbourhood of b,
with derivatives ∇g(b) = 0 and ∇2 g(b) at b. Then
d
a2n (g(Xn ) − g(b)) −→ 12 W > ∇2 g(b)W.

Proof. By Taylor’s theorem and ∇g(b) = 0,


 
g(Xn ) − g(b) = 12 (Xn − b)> ∇2 g(b)(Xn − b) + op (kXn − bk2 )2 ,

so

a2n (g(Xn ) − g(b))


 
= 21 [an (Xn − b)]> ∇2 g(b)[an (Xn − b)] + op (kan (Xn − b)k2 )2 .

d
Since an (Xn − b) −→ W ,
   
op (kan (Xn − b)k2 )2 = op Op (1)2 = op (Op (1)) = op (1),

and
d
1
2 [an (Xn − b)]> ∇2 g(b)[an (Xn − b)] −→ 21 W > ∇2 g(b)W
by Slutsky’s theorem (p. 45). Hence

a2n (g(Xn ) − g(b)) = 21 [an (Xn − b)]> ∇2 g(b)[an (Xn − b)] + op (1)
d
−→ 21 W > ∇2 g(b)W. 

Remark 7. Combining the first- and second-order delta methods, we get


p d
an (g(Xn ) − g(b)) −→ 0 and a2n (g(Xn ) − g(b)) −→ 12 W > ∇2 g(b)W.
p d
(I can write −→ rather than −→ by Proposition 9 (p. 38).) There is no
contradiction between the two: we get different behaviour because we’re
using different scaling factors ({an } vs. {a2n }).

50
Remark 8. Even if ∇g(b) 6= 0, we could use a second-order Taylor expansion
to approximate the distribution of g(Xn ). But this makes the approximation
so complicated that it’s rarely worthwhile.

Of course, there’s nothing special about the second order: if the first
` − 1 derivatives are zero, we can use the `th derivative to approximate the
distribution of g(Xn ). To duck notational difficulties, I’ll only state this for
the case g : R → R.

Proposition 14 (`th-order delta method). Let {Xn } be a sequence of


d
random variables such that an (Xn − b) −→ W for some constants {an } and
b, and let g : R → R be ` times differentiable in an open neighbourhood of
b, with derivatives g 0 (b) = · · · = g (`−1) (b) = 0 and g (`) (b) at b. Then

d g (`) (b) `
a`n (g(Xn ) − g(b)) −→ W .
`!
Proof. By Taylor’s theorem,

g (`) (b)  
g(Xn ) − g(b) = (Xn − b)` + op |Xn − b|` ,
`!
so

g (`) (b)  
a`n (g(Xn ) − g(b)) = [an (Xn − b)]` + op |an (Xn − b)|` .
`!
d
Since an (Xn − b) −→ W ,
   
op |an (Xn − b)|` = op Op (1)` = op (Op (1)) = op (1),

and [an (Xn − b)]` −→ W ` by the continuous mapping theorem (p. 44). So by
d

Slutsky’s theorem (p. 45),

g (`) (b) g (`) (b) `


[an (Xn − b)]` + op (1) −→
d
a`n (g(Xn ) − g(b)) = W . 
`! `!
Before we move on, here’s an illustration of how the delta method can
be used in econometrics. The example makes use of a law of large numbers
and a central limit theorem which will not be covered until sections 4 and 5.

Example 11 (exp(α) ML estimator). The exponential distribution with


parameter α > 0 (denoted exp(α)) is any distribution on (R, B) whose density

51
w.r.t. Lebesgue measure is f (x) = α exp(−αx). The mean and variance of
this distribution are α−1 and α−2 .
Suppose we have n iid random variables {Xi }ni=1 drawn from the exp(α)
distribution, and wish to estimate α. The obvious analogy estimator, which
turns out to also be the maximum likelihood estimator, is
n
!−1
n−1
X
b n :=
α Xi .
i=1

By Kolmogorov’s second SLLN (p. 56),


n
n−1 Xi −−→ E(Xi ) = α−1 ,
a.s.
X

i=1

a.s.
so by the continuous mapping theorem α bn −−→ α, i.e. the estimator is strongly
consistent. So αb n will be ‘close’ to α in a large sample.
But how close? To answer this question, we need to approximate the
distribution of α
b n in a large sample. The Lindeberg–Lévy CLT (p. 63) gives
us n
Xi − α−1 d
n−1/2
X
√ −→ N (0, 1),
i=1 α−2
which we can rewrite as
−1
− α−1 ) −→ N (0, 1).
d
n1/2 α(α
bn

Now use the delta method with g(x) = 1/x (so g 0 (x) = −1/x2 ), an = n1/2 α
and b = α−1 to obtain
 
b n − α −→ −1/α−2 N (0, 1),
d
n1/2 α α


or equivalently  
d
n1/2 α
b n − α −→ N 0, α2 .


b n is well-approximated by N α, n−1 α2 .

So for n large, the distribution of α

52
4 Laws of large numbers
Official reading: Amemiya (1985, ch. 3), Rao (1973, ch. 2) and White (2001,
ch. 3).
A law of large numbers (LLN) gives conditions under which the average of
n random variables converges as n − → ∞. They are called weak laws (WLLNs)
if convergence is in probability, and strong laws (SLLNs) if convergence is
almost sure.48
There are a lot of different laws of large numbers. The common theme is
that the volatility of the average must be controlled by combining two kinds
of restriction. On the one hand, we restrict the individual variances to keep
them from getting too large. On the other hand, we restrict the dependence
between the random variables, so that one extreme realisation doesn’t make
further extreme realisations likely. Each LLN imposes some mix of the two,
and often we can weaken the one at the expense of strengthening the other.

4.1 Uncorrelated/independent random variables


We begin with an easy-to-prove weak law.
Theorem 14 (Chebychev’s WLLN). Let {Xn } be a sequence of uncorrelated
random variables with
n
−2
X
lim n Var(Xi ) = 0.
n→∞
i=1
Pn p
Then n−1 i=1 (Xi − E(Xi )) −→ 0 as n −
→ ∞.
Remark 9. Three separate remarks, really.
(1) Neither {n−1 ni=1 Xi } nor {n−1 ni=1 E(Xi )} need converge to any-
P P

thing; they could be ‘exploding together’, for example.

(2) The restriction on the variances implies that each variance is finite,
hence that each mean exists and is finite.

(3) The variance condition can be weakened.


Proof. Write n
Sn := n−1
X
(Xi − E(Xi ));
i=1
48
As indicated, we will state our results for random variables. They can of course be
applied element-wise to random vectors.

53
p
we want to show that Sn −→ 0.
n n
!
−2
= n−2
X X
Var(Sn ) = n Var (Xi − E(Xi )) Var (Xi )
i=1 i=1

by uncorrelatedness. Hence Var(Sn ) −


→ 0 by the variance condition. By
nonnegativity and Chebychev’s inequality, we have for any ε > 0 that

0 ≤ P (|Sn | > ε) ≤ Var(Sn )/ε2 .

Since the RHS converges to 0, it follows that P (|Sn | > ε) −


→ 0 for every
p
ε > 0, i.e. Sn −→ 0. 

Now for an easy strong law. It isn’t actually used very often, but it plays
an important role in the proof of Kolmogorov’s second SLLN.
Theorem 15 (Kolmogorov’s first SLLN). Let {Xn } be a sequence of inde-
pendent random variables with

X Var(Xi )
< ∞.
i=1
i2
Pn a.s.
Then n−1 i=1 (Xi − E(Xi )) −−→ 0 as n −
→ ∞.
Remark 10. The Kolmogorov variance condition

X
Var(Xi )/i2 < ∞
i=1

implies the Chebychev variance condition


n
lim n−2
X
Var(Xi ) = 0
n→∞
i=1

by Kronecker’s lemma (below). So the Kolmogorov SLLN strengthens both the


variance restriction and the dependence restriction (from uncorrelatedness to
independence). Our reward is a stronger result, viz. almost sure convergence.
Our proof will make use of two horrendous inequalities: the Hájek–Rényi
inequality (p. 24), and Kronecker’s lemma. The latter is
Lemma 3 (Kronecker’s lemma). Let {xn } be a sequence in R such that
P∞
n=1 xi exists and is finite. Then for any weakly increasing sequence {cn }
−1 Pn
in R++ such that cn − → ∞, limn→∞ cn i=1 ci xi = 0.

54
Proof of Kolmogorov’s first SLLN. Write
n
Sn := n−1
X
(Xi − E(Xi ));
i=1

a.s.
we want to show that Sn −−→ 0. Fix ε > 0. Using the Hájek–Rényi inequality
with ci = i−1 ,
k
! !
−1
X
P max |Sk | ≥ ε =P max k (Xi − E(Xi )) ≥ ε
k∈[m,n] k∈[m,n]
i=1
 
m n
1
≤ 2 m−2 i−2 Var(Xi ) .
X X
Var(Xi ) +
ε i=1 i=m+1

P∞ 2
Taking n −
→ ∞ and using the fact that i=1 Var(Xi )/i converges,
 
m ∞
1
 
P max|Sk | ≥ ε ≤ 2 m−2 i−2 Var(Xi ) .
X X
Var(Xi ) +
k≥m ε i=1 i=m+1

Now taking m −
→ ∞,
!
lim P sup |Sk | ≥ ε
m→∞ k≥m
 
m ∞
1
≤ 2  lim m−2 i−2 Var(Xi ) .
X X
Var(Xi ) + lim
ε m→∞
i=1
m→∞
i=m+1

P∞ 2
Since i=1 Var(Xi )/i exists and is finite, Kronecker’s lemma with ci = i2
yields
m m
Var(Xi )
lim m−2 Var(Xi ) = lim m−2
X X
i2 = 0,
m→∞
i=1
m→∞
i=1
i2
i.e. the first term is zero. For the second term,

∞ ∞ m
!
X Var(Xi ) X Var(Xi ) X Var(Xi )
lim = lim −
m→∞
i=m+1
i2 m→∞
i=1
i2 i=1
i2
∞ ∞
X Var(Xi ) X Var(Xi )
= − = 0.
i=1
i2 i=1
i2

55
 
Hence limm→∞ P supk≥m |Sk | ≥ ε ≤ 0. Since probabilities are nonnegative,
!
lim P sup |Sk | ≥ ε = 0.
m→∞ k≥m
a.s.
Since ε > 0 was arbitrary, we’ve shown that Sn −−→ 0. 
There are several refinements of Chebychev’s WLLN and Kolmogorov’s
SLLN. One of these is Markov’s SLLN.

4.2 iid random variables


Kolmogorov’s second SLLN features a different mix of restrictions on variances
and dependence. Relative to Kolmogorov’s first SLLN, we drop the variance
restriction. But to make up for this, we impose identical distributions. The
combination of independence and identical distribution is usually shortened
to ‘iid’; it is very common in (micro)econometrics.
Theorem 16 (Kolmogorov’s second SLLN). Let {Xn } be a sequence of iid
a.s.
random variables. Then n−1 ni=1 Xi −−→ µ if and only if E(X1 ) exists, is
P

finite and equals µ.


Observe that this LLN gives conditions that are necessary as well as
sufficient for a.s. convergence! We won’t provide a proof; you can find one in
Rao (1973, pp. 115–6). An obvious corollary is
Corollary 4 (Khinchine’s WLLN). Let {Xn } be a sequence of iid random
p
variables such that E(X1 ) exist and is finite. Then n−1 ni=1 Xi −→ E(X1 ).
P

4.3 Dependent random variables


Finally, we’ll present a substantial refinement of Kolmogorov’s first SLLN
(p. 54) which weakens the variance condition and requires bounded auto-
correlation instead of independence. This theorem is useful for time-series
econometrics.
Theorem 17 (Serfling (1970) SLLN). Let {Xn } be a sequence of random
variables with finite variance. Assume that there exist constants {ρj }j∈N
in [0, 1] such that ∞ j=1 ρj < ∞ and Corr(Xn , Xm ) ≤ ρn−m for all n ≥ m.
P

Further assume that


∞ 
ln(i) 2
X 
Var(Xi ) < ∞.
i=1
i
Pn a.s.
Then n−1 i=1 (Xi − E(Xi )) −−→ 0 as n −
→ ∞.

56
The variance restriction is very similar to (but weaker than) the one
in Kolmogorov’s first SLLN. The main novelty comes from the fact that
we’ve replaced independence with bounded autocorrelation. Notice that we
only need to rule out large and persistent positive autocorrelation. Negative
autocorrelation is actually helpful: it speeds up ‘mixing’, leading to faster
convergence!
Again we won’t give a proof; you can find one in Serfling (1970, Corollary
2.2.1). But we will give an example to verify that there are interesting
sequences of random variables that satisfy the hypotheses of the theorem.

Example 12 (AR(1) model). Let X0 = 0 and Xn = rXn−1 + εn for n ∈ N,


where |r| < 1 and {εn } is a white noise process (iid with zero mean and finite
variance). The sequence {Xn } is called an AR(1) process.
Iterating backward and using X0 = 0, Xn = n−1 j
P
j=0 r εn−j . It follows that
2 2

E(Xn ) = 0. Moreover, writing σε := E εn , we have
n−1 n−1
X X 1 − r2n 2
Var(Xn ) = r2j Var (εn−j ) = σε2 r2j = σ
j=0 j=0
1 − r2 ε

where we used |r2 | < 1, which follows from |r| < 1. Notice that Var(Xn ) ≥
Var(Xm ) whenever n ≥ m, and that Var(Xn ) < σε2 1 − r2 for every n.
For n ≥ m,
 
n−1
X m−1
X
Cov(Xn , Xm ) = Cov  rj εn−j , rj εm−j 
j=0 j=0
 
n
X m
X
= Cov  rn−j εj , rm−j εj 
j=1 j=1
 
m
X m
X
= Cov  rn−j εj , rm−j εj 
j=1 j=1
 
m
X
= rn−m Var  rm−j εj 
j=1
 
m−1
X
= rn−m Var  rj εm−j 
j=0
n−m
=r Var (Xm ) .

57
Since Var(Xn ) ≥ Var(Xm ), it follows that

Cov(Xn , Xm )
Corr(Xn , Xm ) = p p
Var(Xm ) Var(Xn )
p
n−m pVar (Xm )
=r
Var(Xn )
n−m
≤r .

So we have constants {ρj } := {rj } in [0, 1] for which ∞j=1 ρj < ∞ and
P

Corr(Xn , Xm ) ≤ ρn−m for all n ≥ m, as required. 


As for the second condition, since Var(Xn ) < σε2 1 − r2 , we obtain


∞  ∞ 
ln(i) 2 σε2 X ln(i) 2
X  
Var(Xi ) ≤ ,
i=1
i 1 − r2 i=1 i

which can be shown to converge using the integral test. (Or Wolfram Alpha!)

4.4 Uniform laws of large numbers


In this section, we’re interested in laws of large numbers for random functions.
In particular, consider an iid sequence {gn } of random functions Θ → R,
and assume E(g1 (θ)) = 0 for each θ ∈ Θ.49 Kolmogorov’s second SLLN tells
a.s.
us that n−1 ni=1 gi −−→ 0 pointwise, i.e. for any ε > 0, there is {Nε,θ }θ∈Θ
P

such that for each θ ∈ Θ, n−1 ni=1 gi (θ) < ε whenever n ≥ Nε,θ .
P

If Θ is finite, the convergence is automatically uniform: for any θ ∈ Θ you


like, n−1 ni=1 gi (θ) < ε whenever n ≥ maxθ∈Θ Nε,θ , where the maximum
P

is attained since Θ is finite. But to obtain uniform a.s. convergence of {gn }


without assuming that Θ is finite, we need a new theorem. Such theorems
are called uniform laws of large numbers. A useful uniform SLLN for the iid
case is the following.

Theorem 18 (Jennrich (1969) uniform SLLN). Let {gn } be an iid sequence


of random functions Θ → R with E(g1 (θ)) = 0 for each θ ∈ Θ. Assume that
Θ ⊆ Rk is compact, that g1 is continuous a.s., and that E (supθ∈Θ |g1 (θ)|) <
∞. Then n
n−1
a.s.
X
gi −−→ 0 uniformly over Θ.
i=1
49
We really just need the mean to exist and be finite. Setting it to zero is a normalisation,
for if the mean is µ then we consider egn (·) := gn (·) − µ(·).

58
More explicitly, the conclusion of the theorem is that
n
−1
X
lim sup n gi (θ) = 0 a.s.
n→∞ θ∈Θ
i=1

The method of proof is called a chaining argument, which is used a lot in


empirical process theory.50 The idea is that by compactness, we can cover Θ
with a finite number of open balls
J(δ)
{Bδ (θj )}j=1

of radius δ > 0. For each ball, the centre ni=1 gi (θj ) obeys Kolmogorov’s
P

second SLLN, and for other points θ ∈ Bδ (θj ) it must be that ni=1 gi (θ) is
P
Pn
close to i=1 gi (θj ) by continuity a.s. We then take δ ↓ 0.

Proof. We want to show that for any ε > 0,


n
!
−1
X
lim P sup sup n gi (θ) > ε = 0.
N →∞ n≥N θ∈Θ i=1

Begin by fixing an arbitrary δ > 0. (Further down, we will choose a particular


value δ(ε) determined by ε). {Bδ (θ)}θ∈Θis obviously an open cover of Θ, so
by compactness it has a finite subcover Bδ (θ1 ), . . . , Bδ (θJ(δ) ) . Then
 δ J(δ) J(δ)
Θj j=1 := {cl Bδ (θj )}j=1

is a finite cover of Θ,51 and each Θδj is compact.


50
Empirical process theory is the asymptotic theory of certain functions of random data,
providing (vast) generalisations of classical asymptotic theory. (‘Stochastic process’ is
another name for a random function.) One topic is P uniform LLNs: uniform convergence
n
of partial-average functions such as Gn (·) = n−1 i=1 gi (·) to a nonstochastic limit.
Another topic is functional CLTs: weak convergence of scaled partial-average processes
Pbτ nc
such as Gn (τ ) = n−1/2 i=1 Xi to Brownian motion. Yet another topic is extensions of
the Glivenko–Cantelli theorem: uniform convergence of empirical measures (analogs of
empirical CDFs) to the population probability measure.
51
cl A denotes the closure of the set A.

59
Painfully but straightforwardly, compute

n
!
−1
X
P sup sup n gi (θ) > ε
n≥N θ∈Θ i=1
  
J(δ)   n
sup sup n−1
[ X
≤ P gi (θ) > ε 
j=1 n≥N θ∈Θj
 δ 
i=1
 
J(δ) n
P  sup sup n−1
X X
≤ gi (θ) > ε
j=1 n≥N θ∈Θδ i=1
j
 
J(δ) n n
!
n−1 gi (θj ) + n−1
X X X
≤ P  sup sup (gi (θ) − gi (θj )) > ε
j=1 n≥N θ∈Θδ i=1 i=1
j
 
J(δ) n n
P  sup n−1 gi (θj ) + sup sup n−1
X X X
≤ (gi (θ) − gi (θj )) > ε
j=1 n≥N i=1 n≥N θ∈Θδ i=1
j

J(δ) ( n
)
X
−1
X ε
≤ P sup n gi (θj ) >
j=1 n≥N i=1
3
n
( )!

sup sup n−1
X
∪ (gi (θ) − gi (θj )) >
n≥N θ∈Θδ i=1
3
j

J(δ) n
!
ε
P sup n−1
X X
≤ gi (θj ) >
j=1 n≥N i=1
3
 
J(δ) n

P  sup sup n−1
X X
+ (gi (θ) − gi (θj )) > 
j=1 n≥N θ∈Θδ i=1
3
j

J(δ) n
!
ε
P sup n−1
X X
≤ gi (θj ) >
j=1 n≥N i=1
3
   
J(δ) n

P  sup n−1
X X
+ sup |gi (θ) − gi (θj )| >  .
j=1 n≥N i=1 θ∈Θ
δ 3
j

We’ll establish separately that both terms on the RHS vanish as N − → ∞.


For the first term, recall that we assumed E (supθ∈Θ |g1 (θ)|) < ∞; hence a
fortiori E (|g1 (θj )|) < ∞. Since {gn (θj )} are iid with mean zero, Kolmogorov’s

60
second SLLN (p. 56) then implies
n
!
−1
X ε
lim P sup n gi (θj ) > = 0 for each j ∈ {1, . . . , J(δ)},
N →∞ n≥N i=1
3

whence it follows that the first term converges to zero:


J(δ) n
!
X
−1
X ε
lim P sup n gi (θj ) > = 0.
N →∞
j=1 n≥N i=1
3

Notice that this argument works for any fixed δ > 0.


For the second term, we will need to choose a sufficiently small value
of δ to ensure convergence. Observe that by decreasing δ, we can shrink
Θδj enough to ensure that any θ ∈ Θδj is arbitrarily close to θj . Since g1 is
continuous a.s., we can therefore choose δ > 0 small enough to ensure that
supθ∈Θδ |g1 (θ) − g1 (θj )| is arbitrarily small a.s., i.e.
j

sup |g1 (θ) − g1 (θj )| −


→0 a.s. as δ ↓ 0.
θ∈Θδj

Moreover,
sup |g1 (θ) − g1 (θj )| ≤ 2 sup|g1 (θ)|,
θ∈Θδj θ∈Θ

and the right-hand side has finite expectation by assumption. Hence by the
dominated convergence theorem (p. 40),
 

µδj := E  sup |g1 (θ) − g1 (θj )| −


→ 0 as δ ↓ 0.
θ∈Θδj

δ(ε)
So there exists a δ(ε) > 0 such that µj < ε/3.
Now,  

 

sup |gn (θ) − gn (θj )|
θ∈Θδ(ε)
 

j

δ(ε)
is a sequence of iid random variables with finite mean µj , so by Kolmog-
orov’s second SLLN (p. 56),
   
n
δ(ε)  ε
lim P  sup n−1
X
sup |gi (θ) − gi (θj )| − µj >  = 0.
 
N →∞ n≥N i=1 θ∈Θδ(ε)
3
j

61
δ(ε)
Since µj < ε/3, it follows that
   
n
2ε 
lim P  sup n−1
X
sup |gi (θ) − gi (θj )| >  = 0,
  
N →∞ n≥N i=1 θ∈Θδ(ε)
3
j

hence
   
J(δ(ε)) n
2ε 
P  sup n−1
X X
lim sup |gi (θ) − gi (θj )| >  = 0.
  
N →∞
j=1 n≥N i=1 θ∈Θδ(ε)
3
j

Putting this all together, we’ve shown that for any ε > 0, there is a
δ(ε) > 0 such that
n
!
−1
X
P sup sup n gi (θ) > ε
n≥N θ∈Θ i=1
J(δ(ε)) n
! !
X
−1
X ε
≤ P sup n |gi (θj )| >
j=1 n≥N i=1
3
   
J(δ(ε)) n
2ε 
P  sup n−1
X X
+ sup |gi (θ) − gi (θj )| >
  
3

j=1 n≥N i=1 θ∈Θδ(ε)
j


→0 as N −
→ ∞.
(The inequality holds for any δ > 0 you like, and the first term on the
RHS vanishes as N − → ∞ for any δ > 0 you like, but the second term only
vanishes when δ is chosen appropriately.) Since probabilities are nonnegative,
it follows that
n
!
−1
X
lim P sup sup n gi (θ) > ε = 0. 
N →∞ n≥N θ∈Θ i=1
Remark 11. The proof extends without much difficulty to the non-iid
case. The monstrous inequality holds regardless of how {gn } are distributed.
Convergence of the first term requires only that {gn } obeys a SLLN pointwise.
Convergence of the second term requires E (supθ∈Θ |gn (θ)|) < ∞ for each n.
As far as I can make out, these are the only tweaks that are required to
obtain a uniform SLLN for the non-iid case.
As you’d expect, there are many other uniform LLNs. For the iid case,
slightly different assumptions can be used to obtain a uniform SLLN, and
somewhat weaker assumptions suffice for a uniform WLLN. As I indicated,
uniform SLLNs for the non-iid case are also fairly straightforward.

62
5 Central limit theorems
Official reading: Amemiya (1985, ch. 3), Rao (1973, ch. 2), Billingsley (1995,
sec. 27) and White (2001, ch. 5).
In a finite sample, we’d like to have an idea of how far our (consistent)
estimator is likely to be from the truth. The exact distribution of an estimator
(across repeated samples) will depend on the unknown distribution of the data,
and will anyway be extremely complicated. So we’d like an approximation to
its distribution that uses only what the econometrician observes, and which
is a good approximation in the sense that it becomes arbitrarily accurate as
the sample size increases. This may sound like too much to hope for, but it
is in fact possible to do precisely this by using the magic of the central limit
theorems (CLTs).
A central limit theorem has the following form. Take any sequence {Xn }
of random variables that satisfy some conditions; then there are sequences of
constants {an } and {bn } such that
n
d
X
an (Xi − bi ) −→ N (0, 1),
i=1

a standard-normal-distributed random variable. The conditions on {Xn } are


analogous to the ones for LLNs: they restrict the variances and the degree of
dependence. Generally, the theorem will include a characterisation of a set
of sequences {an } and {bn } for which the result holds; in the simplest CLTs,
bi = E(Xi ) and an = n−1/2 .52

5.1 iid random variables


Theorem 19 (Lindeberg–Lévy CLT). Let {Xn } be a sequence of iid random
d
variables with E(X1 ) = 0 and Var(X1 ) = 1. Then n−1/2 ni=1 Xi −→ N (0, 1).
P

Remark 12. Setting E(X1 ) = 0 and Var(X1 ) = 1 is wlog, since for E(Xn ) =
µ and Var(Xn ) = σ 2 , we can apply the theorem to Yn := (Xn − µ)/σ. The
importance of these restrictions is that the mean exists and is finite and that
the variance is finite and nonzero.
52
There are many generalisations of CLTs that don’t quite fit this format. A very
important example for econometrics is functional central limit theorems, which give
Pbτ nc
conditions under which the random function Sn (τ ) := n−1/2 i=1 Xi converges weakly
to a Brownian motion. Another example is ‘generalised central limit theorems’, which give
conditions for convergence to a stable law (see footnote 43 on p. 43).

63
The proof will make use of characteristic functions. In particular, we’ll
−1/2 Pn X converges pointwise
show that the characteristic
  function of n i=1 i
1 2
to φN (0,1) (t) = exp − 2 t , then appeal to Lévy’s continuity theorem.

Proof. Write n
Zn := n−1/2
X
Xi .
i=1
Fix an arbitrary t ∈ R.
  
n
φZn (t) = E exp itn−1/2
X
Xj 
j=1
 
n  
exp itn−1/2 Xj 
Y
= E
j=1
n   
E exp itn−1/2 Xj
Y
=
j=1
  n
= E exp itn−1/2 X1
 n
= φX1 t n1/2 .
where we used independence in the third equality and identical distribution
in the fourth.
Since t n1/2 −

→ 0 for fixed t ∈ R, only the behaviour of φX in a shrinking
neighbourhood of 0 will matter. Formally, we use Taylor expansion around 0
to approximate φX1 t n1/2 as n grows large:
 
 
φX1 t n1/2 = φX1 (0) + φ0X1 (0)t n1/2 + 21 φ00X1 (0)t2 n + o (1/n) ,
 

where the derivatives exist since E(X1 ) and E X12 exist and are finite (see


part (5) of Proposition 10 (p. 42)). Again by Proposition 10 (p. 42), we have
φX1 (0) = 1, φ0X1 (0) = i−1 E(X1 ) = 0 and E X12 = i−2 Var(X1 ) = i−2 = −1,


we can write the Taylor expansion as


 
φX1 t n1/2 = 1 − 21 t2 n + o (1/n) .


So using the fact that limn→∞ (1 + x/n)n = exp(x), we get


 n  n
φZn (t) = φX1 t n1/2 = 1 − 21 t2 n + o (1/n)

 
→ exp − 12 t2 = φN (0,1) (t).

d
Hence Zn −→ N (0, 1) by Lévy’s continuity theorem (p. 41). 

64
Before moving on, let’s give an (unusual) example of how the Lindeberg–
Lévy CLT can be used.
Example 13. Let {Xn } be independently distributed Xn ∼ N (0, 1), and
define Sn := ni=1 Xi2 . In this case, we don’t really need to approximate the
P

distribution of Sn since we know that it is χ2 (n). But the Lindeberg–Lévy


CLT still applies, so let’s apply it to get an approximation to the distribution
of Sn . 2
Compute E Zn2 = 1 and Var Zn2 = E Zn4 − E Zn2 = 3 − 1 = 2.53
  

Hence by the Lindeberg–Lévy CLT,


n
Xi2 − 1
(2n)−1/2 (Sn − n) = n−1/2
d
X
√ −→ N (0, 1).
i=1 2
a
The large-n approximation (2n)−1/2 (Sn − n) ∼ N (0, 1) lets us approximate
a
the distribution of Sn for n large as Sn ∼ N (n, 2n).54
In most of our applications, we’ll actually have a random vector (not
variable) whose distribution we’d like to approximate using a multivariate
normal distribution. The univariate Lindeberg–Lévy theorem can be applied
to any given linear combination of the elements of our random vector, giving
weak convergence of a particular marginal distribution, but it isn’t obvious
that this is sufficient for convergence of the joint distribution. The Cramér–
Wold device tells that it is in fact sufficient.
Theorem 20 (Cramér–Wold device). Let {Xn } and X be random k-vectors.
d d
Then Xn −→ X iff λ> Xn −→ λ> X for every λ ∈ Rk .
d
Proof. Suppose Xn −→ X and fix λ ∈ Rk . x 7→ λ> x is a continuous mapping,
d
so by the continuous mapping theorem we get λ> Xn −→ λ> X.
d
Suppose λ> Xn −→ λ> X for every λ ∈ Rk . By Lévy’s continuity theorem,
this implies φλ> Xn (1) −
→ φλ> X (1), i.e.

E (exp (iλ> Xn )) −
→ E (exp (iλ> X)) for every λ ∈ Rk .

But the LHS equals φXn (λ), and the RHS equals φX (λ)! So we’ve shown
d
that φXn −→ φX pointwise, which implies Xn −→ X by Lévy’s continuity
theorem (p. 41). 

Using the Cramér–Wold device, we can easily extend the Lindeberg–Lévy


CLT to random vectors.
53
It is a fact that the fourth moment of the standard normal distribution is 3.
54 a
∼ reads ‘is approximately distributed as’.

65
Corollary 5 (vector Lindeberg–Lévy CLT). Let {Xn } be a sequence of iid
d
random k-vectors with E(X1 ) = 0 and Var(X1 ) = I. Then n−1/2 ni=1 Xi −→
P

N (0, I).

Proof. Write n
Zn := n−1/2
X
Xi ,
i=1

and let Z denote some (any) k-vector distributed N (0, I). By a property of
the normal distribution, λ> Z is then distributed univariate N (0, λ> λ) for
any λ ∈ Rk .
By the univariate Lindeberg–Lévy CLT,
d d
λ> Zn −→ N (0, λ> λ) = λ> Z for any λ ∈ Rk .55
d d
Hence Zn −→ Z = N (0, I) by the Cramér–Wold device. 

Remark 13. Again, the importance of the restrictions E(X1 ) = 0 and


Var(X1 ) = 1 is that the mean and covariance matrix exist, that E(X1 ) is
finite, and that Var(X1 ) is nonsingular and finite. Nonsingularity of Var(X1 )
is the multivariate analog of our previous requirement that Var(X1 ) > 0. It
fails iff there is a linear combination of X1 whose variance is zero (i.e. the
distribution is degenerate).
To use the theorem for random vectors {Xn } with E(X1 ) = µ and
Var(X1 ) = V nonsingular, just apply it to Yn := V −1/2 (Xn − µ), where
V −1/2 is the unique Choleski factor of V −1 .56

The rest of the central limit theorems in this section will be stated for
random variables only. But all of them can easily be extended to random
vectors using Cramér–Wold device in the manner just demonstrated.

5.2 Independent random variables


The CLTs in this section retain the independence assumption of the Lindeberg–
Lévy theorem, but drop the requirement of identical distribution in favour
55 d d
= denotes equality in distribution, i.e. X = Y iff LX = LY .
56
A Choleski decomposition of a matrix A is A = A1/2 (A1/2 )> where A1/2 is lower-
triangular with positive diagonal entries. To show that it exists and is unique for V −1 , reason
as follows. V is real, symmetric and positive semidefinite (p.s.d.) since it’s a covariance
matrix. A real, symmetric and p.s.d. matrix is nonsingular iff it is positive definite (p.d.),
so V is p.d. The inverse of a p.d. matrix is p.d., so V −1 is p.d. A matrix has a unique
Choleski decomposition iff it is p.d., so V −1 has a unique Choleski decomposition.

66
of restrictions on the variances. (There’s a similarity here with Kolmogorov’s
first SLLN, which also imposes independence and a variance restriction.
Continuing the analogy, the Lindeberg–Lévy CLT and Kolmogorov’s second
SLLN both impose iid.)
The most general theorem along these lines is the following.

Theorem 21 (Lindeberg–Feller CLT). Let {Xn } be a sequence of inde-


pendent random variables with E(Xn ) finite and 0 < Var(Xn ) < ∞. Write
cn := ( ni=1 Var(Xi ))1/2 . Then
P

n
Var(Xi ) −1
X d
lim max 2
= 0 and cn (Xi − E(Xi )) −→ N (0, 1)
n→∞ i∈[1,n] cn i=1

hold iff
n h i
lim c−2
X
n E (Xi − E(Xi ))2 1 (|Xi − E(Xi )| > εcn ) = 0 for any ε > 0.
n→∞
i=1

Remark 14. The last condition is called the Lindeberg condition; it restricts
the thickness of the tails. The ‘only if’ part means that the Lindeberg
condition is the weakest possible sufficient condition for weak convergence to
a normal law when independence and
.
lim max Var(Xi ) c2n = 0
n→∞ i∈[1,n]

hold! But unfortunately, the Lindeberg condition is usually difficult to check.

To deal with the intractability of the Lindeberg condition, we can use


the stronger but simpler Liapunov condition.

Theorem 22 (Liapunov CLT). Let {Xn } be a sequence of independent


random variables with E(Xn ) finite, 0 < Var(Xn ) < ∞, and
 
τn := E |Xn − E(Xn )|3 < ∞.

Write cn := ( ni=1 Var(Xi ))1/2 and bn := ( 1/3


P Pn
i=1 τi ) , and assume that
limn→∞ bn /cn = 0. Then
n
c−1
d
X
n (Xi − E(Xi )) −→ N (0, 1).
i=1

67
Proof. Write Yn := Xn − E(Xn ). We wish to show that the Lindeberg
condition holds, so fix ε > 0.
n h i n Z
c−2 E Yi2 1 (|Yi | > εcn ) = c−2
X X
n n y 2 LYi (dy)
i=1 i=1 {|y|>εcn }
n Z
1 3
= c−2
X
n |y| LYi (dy).
i=1 {|y|>εcn }
|y|
Observe that this expression must be nonnegative. To show that it can’t
be strictly positive, use the nonnegativity of the integrand (together with
cn > 0 and τn < ∞) to obtain
n Z n Z
1 3 1 X
c−2
X
n |y| LYi (dy) ≤ 3 |y|3 LYi (dy)
i=1 {|y|>εcn }
|y| εcn i=1 {|y|>εcn }
n
1 X
≤ τi
εc3n i=1
3
1 bn

= −
→0 as n −
→ ∞.
ε cn
Since ε > 0 was arbitrary, we’ve shown that the Lindeberg condition holds.
Hence the conclusion follows by the Lindeberg–Feller CLT. 

Remark 15. It’s clear from the proof that we don’t really need the third
moment to exist; it’s enough for the (2 + α)th moment to exist for some
α > 0.
Before moving on, let’s give an example of how these theorems are used
in econometrics.
Example 14. In the linear model,
 −1  
n1/2 βb − β = n−1 X > X n−1/2 X > ε .


We assume independent observations, and impose conditions s.t. n−1 X > X


obeys a WLLN, converging in probability to a nonsingular constant matrix
A. Then if n−1/2 X > ε converges in distribution to N (0, Σ), Slutsky’s theorem
implies that    >
n1/2 βb − β −→ N 0, A−1 Σ A−1
 d
,

giving us the useful approximation


  > 
a −1 −1 −1
βb ∼ N β, n A Σ A .

68
d
Showing that n−1/2 X > ε −→ N (0, Σ) is easy when {Xi } and {εi } are iid
and independent of each other, for then {Xi εi } are iid random vectors and
the vector Lindeberg–Lévy CLT can be applied.
But suppose that we want the asymptotic distribution conditional on X, or
equivalently that X is nonstochastic (‘fixed regressors’) with n−1 X > X −
→A
for some nonsingular A. In this case, more work is required to show that
n−1/2 X > ε converges in distribution. The observations are still independent
since {εi } are, but they are no longer identically distributed, since each
term in the sum is a different linear combination of the elements of ε. We
therefore need to make assumptions sufficient for the Lindeberg condition
to be satisfied; loosely, we require that X is sufficiently bounded that no
single observation can dominate the variance of the sum. As mentioned
above, it is rather hard to check the Lindeberg condition, but in this case
the demonstration can be found in Amemiya (1985).

5.3 Dependent random variables


A sequence of random elements is also called a (discrete-time) stochastic
process. In this section, we’re mainly thinking about time-series applica-
tions in which n is a time index, so we’ll use the language of stochastic
processes. We’ll actually consider stochastic processes with no starting date,
i.e. sequences {Xn }∞−∞ .
‘Time-series CLTs’ do away with independence, which tends to make
things a lot uglier. Many of them also allow for some degree of heterogeneity
of distribution. The CLT we will state (which is one of many) replaces replaces
identical distribution with strict stationarity and replaces independence with
α-mixing.

Definition 29. A stochastic process {Xn } is strictly stationary iff for any
finite collection of indices (n1 , . . . , nT ), the (joint) distribution of the random
vector (Xn1 +m , . . . , XnT +m ) does not depend on m ∈ N.

Definition 30. A strictly stationary stochastic process {Xn } is α-mixing


(or strongly mixing) iff there is α : N → R such that limk→∞ α(k) = 0 and
n
sup |P(B ∩ C) − P(B)P(C)| :
o
B ∈ σ(. . . , Xn−1 , Xn ), C ∈ σ(Xn+k , Xn+k+1 , . . . ), n ∈ (−∞, ∞)
≤ α(k) for each k ∈ N.

69
The object that we’re taking the supremum of is the degree of ‘independ-
ence failure’. Since we’re taking the supremum (over a very large set), the
condition says that the degree of independence failure between ‘blocks’ k
periods apart is uniformly bounded by α(k) for each k. Since α(k) − → 0, the
degree of independence failure must vanish uniformly as blocks are pulled
further apart.
The following is one of many ‘time-series CLTs’. Joel said that it can be
found in White (2001), though I haven’t been able to locate it.

Theorem 23 (α-mixing CLT). Let {Xn } be a real-valued, strictly stationary


process with mean zero. Assume that E (|Xn |γ ) < ∞ for some γ > 2, and
that the process is α-mixing with α(k) = ak −β for a > 0 and β > γ/(γ − 2).
Then   −1
∞ n
d
X X
 Cov(X0 , Xk ) Xi −→ N (0, 1).
k=−∞ i=1

Remark 16. There is an explicit tradeoff here between how heavy tails and
how much dependence that can be accommodated: for γ small (heavy tails),
β must be large (low dependence).

We (definitely) won’t give a proof, but here’s a rough indication as to


why it’s true. We know from the proof of the Lindeberg–Lévy CLT that
independence allows us to factor the characteristic function into the product
of n characteristic functions. While this no longer holds exactly without
independence, α-mixing is sufficient for it to remain a good approximation in
the sense that the approximation error vanishes sufficiently fast as n → ∞.
This is because for ‘blocks’ of random variables very far apart, independence
‘nearly’ holds, and as n −→ ∞ there are many blocks that are very far apart.
Formalising this is a delicate business, however!

5.4 The rate of convergence


Central limit theorems tell us that a normalised sum of random variables
converges weakly to N (0, 1), but they do no tell us the rate of convergence.
If convergence is really slow, CLTs won’t provide very good approximations!
Ideally, we’d like a uniform bound on the approximation error from
using the standard normal CDF Φ instead of the true (unknown) CDF. The
Berry–Esseen theorem does exactly this for the iid case. Its assumptions are
exactly those of the Lindeberg–Lévy CLT, except that the existence of the
third moment is assumed.

70
Theorem 24 (Berry–Esseen). Let {Xn } be a sequence of iid random vari-
ables with E(X1 ) = 0 and Var(X1 ) = 1 whose third moment E X13 exists.


Then for every n ∈ R,


n
!
 
−1/2
Xi ≤ x − Φ(x) ≤ 3n−1/2 E X13 .
X
sup P n
x∈R i=1

There’s a cottage industry in probability theory devoted to refining


this theorem. On the one hand, it’s possible to replace the 3 with some
other constant which may be smaller. On the other hand, probabilists have
extended the Berry–Esseen theorem to non-iid sequences.

6 Some more limit theory


6.1 Connections between CLTs and LLNs
In this little section, we’ll show that whenever a CLT-type property holds
for some scaling constants {an } that don’t shrink too fast, a WLLN follows.
We’ll also show that the converse is false, and that we cannot strengthen the
result to obtain a SLLN.
Suppose that random variables {Xn } satisfy a central limit theorem for
some constants constants {an } and {bn }:
n
d
X
an (Xi − bi ) −→ N (0, 1).
i=1

(For concreteness, you can think of an = n−1/2 and bn = E(Xn ).) Then
n
−1
X
n (Xi − bi ) = Op (1/nan ).
i=1

If 1/nan = op (1), it follows that {Xn } satisfy a weak law of large numbers:
n
−1
X
n (Xi − bi ) = Op (op (1)) = op (1).
i=1

In particular, this implication holds if an = n−1/2 as in e.g. the Lindeberg–


Lévy CLT. (The argument goes through if an ni=1 (Xi − bi ) converges weakly
P

to any proper distribution, even if it’s nonnormal.)


The converse is false: a CLT result does not follow from a WLLN property.
a.s.
For example, consider {Xn } and X with X = Xn = 0 a.s. Then Xn −−→ X,

71
p
so a fortiori Xn −→ X, but an ni=1 Xi = 0 a.s. for any sequence of constants
P

{an }. So we cannot obtain a CLT-type result no matter how we choose our


scaling constants.
A CLT property does not imply a SLLN property, however. I haven’t
been able to think up a counterexample to illustrate this, sadly. It’s not
super-intuitive!

6.2 Laws of the iterated logarithm


My understanding of this topic is poor! Proceed with caution.
Laws of the iterated logarithm (LILs) operate ‘between’ LLNs and CLTs.
Consider a sequence of iid random variables with E(X1 ) = 0 and Var(X1 ) = 1.
Kolmogorov’s second SLLN and the Lindeberg–Lévy CLT tell us how the
Pn
partial sums i=1 Xi behave when scaled (respectively) by O(n) and
O n1/2 :


n n
−1 a.s. −1/2 d
X X
n Xi −−→ 0 and n Xi −→ N (0, 1).57
i=1 i=1

But what happens if we use scaling constants that increase at a rate slower
than O(n) but faster than O n1/2 ? In particular, how much slower than
O(n)
Pn
can we make rate while keeping almost every convergent subsequence
of i=1 Xi bounded? The Hartman–Wintner LIL says that the answer is
1/2

O [n ln(ln(n))] . (For any sequence {Xn }!)

Theorem 25 (Hartman–Wintner LIL). Let {Xn } be a sequence of iid


random variables with E(X1 ) = 0 and Var(X1 ) = 1. Then
n √
−1/2
X
lim sup[n ln(ln(n))] Xi = 2 a.s.
n→∞
i=1

Remark 17. The conclusion of the theorem is equivalent to


n √
lim inf [n ln(ln(n))]−1/2
X
Xi = − 2 a.s.
n→∞
i=1

To see this, just replace {Xn } with {−Xn }.


57
If P
you know Skorokhod’s theorem, it will be more helpful to consider that
−1/2 n a.s.
n i=1
Xi −−→ X for some X ∼ N (0, 1) (possibly on a different probability space).

72
Remark 18. A consequence of this LIL is that
n
[n ln(ln(n))]−1/2
X
Xi
i=1

is bounded a.s. It is therefore obviously bounded in probability:


n  
Xi = Op [n ln(ln(n))]−1/2 .
X

i=1

But observe that our LIL gives us more: it tells


√ us not just the rate of increase,
but also the constant of proportionality ( 2)!
There’s a nice equivalent statement in terms of tail events. For events
{An }, the event ‘An occurs infinitely often (i.o.)’ is
∞ [
\ ∞
{An i.o.} := Am .
n=1 m=n

An intuitive way of putting this into words is that {An i.o.} obtains iff all
but finitely many of the events {An } occur. (This follows from the deMorgan
law; see e.g. Rosenthal (2006, sec. 3.4)).
It turns out (see e.g. Billingsley (1995, pp. 154–6)) that the Hartman–
Wintner LIL is equivalent to the following.
Corollary 6. Let {Xn } be a sequence of iid random variables with E(X1 ) = 0
and Var(X1 ) = 1. Then for every ε > 0,
n
!
X q
P Xi ≥ (1 + ε) 2n ln(ln(n)) i.o. = 0 and
i=1
n
!
X q
P Xi ≥ (1 − ε) 2n ln(ln(n)) i.o. = 1.
i=1
Pn
In words, this reformulated LIL says that i=1 Xi lies in
q
±(1 − ε) 2n ln(ln(n))

infinitely often (with probability 1), and lies outside


q
±(1 + ε) 2n ln(ln(n))

only finitely many times (with probability 1). The LIL therefore bounds the
Pn
extreme fluctuations of i=1 Xi : fluctuations inside the iterated-log bound

73
occur infinitely often w.p. 1, and fluctuations big enough to jump outside
the iterated-log bound occur at most finitely many times w.p. 1.
It may look like the LIL gives us a usable 100% confidence interval for
Pn
Xi , but this is not really the case. The LIL says that along n ∈ N,
Pi=1
n
i=1 Xi lies in q
±(1 − ε) 2n ln(ln(n))
infinitely many times with probability 1. It doesn’t say anything about the
probability that ni=1 Xi lies in this interval for any given (possibly large) n,
P

though!
As with LLNs and CLTs, LILs are also available for independent but not
identically distributed random variables, as well as for dependent random
variables. Some of these can be found in Serfling (1980, sec. 1.10).

74
7 Asymptotic properties of extremum estimators
Official reading: Amemiya (1985, sec. 4.1.1–4.1.2) and Newey and McFadden
(1994, sec. 2–3).

7.1 Preliminaries
The setting is as follows. There is a Rr -valued stochastic process {yn } defined
on a probability space (Ω, A, P).58 We call this stochastic process the data-
generating process (DGP), and write µ0 for its (unknown) law.59 A dataset
of size n is a realisation {yi (ω)}ni=1 of the first n coordinates of the DGP.
We wish to use a dataset to learn about the law µ0 of the data-generating
process (‘the distribution of the data’). In particular, we want to learn about
(estimate) a parameter of the law. Formally, a parameter is a mapping
τ : M → Θ, where M is a set of probability measures to which µ0 is assumed
to belong, and Θ is called the parameter space. Intuitively, τ captures some
‘aspect’ of the DGP’s distribution.60 Nonparametric econometrics is concerned
with the case in which Θ is infinite-dimensional (e.g. a function space). We
will focus on parameteric econometrics, meaning that we will be concerned
with the finite-dimensional case Θ ⊆ Rk for some k ∈ N.
When studying extremum estimators, we will not say anything about
the shape of the map τ : M → Θ. Instead, we will study maximisers of a
dataset-dependent function of θ ∈ Θ. (So θ is properly called a parameter
value. It is not a parameter.) When we study consistency, we will define a
θ0 ∈ Θ to which the maximisers converge in probability/a.s. as the size of the
dataset grows. We leave for the applied researcher the task of establishing
that θ0 as defined below is in fact equal to τ (µ0 ) for the parameter τ that
she wishes to estimate.
An estimator is a mapping from datasets (of arbitrary size) into Θ. As the
language suggests, the idea is usually that (for large datasets), the estimator
will be close to the true value τ (µ0 ) of some interesting parameter τ . But
don’t let the lingo confuse you: an estimator is just a mapping from datasets
into Θ, which may or may not be useful for learning about some parameter
τ.
58
Reminder: a (discrete-time) stochastic process is a collection {yn }n∈N of random
variables defined on some (common) probability space.
59
A stochastic process (taken as a whole) is a random element of a sequence space, so
the law of the process is just the law of this random element.
60
Simple example: if the DGP is R-valued and iid withR marginal distribution µ10 , then
1
the mean (provided it exists) is a parameter: τ (µ0 ) := R xµ0 (dx).

75
An extremum estimator is an estimator constructed by maximising a
data-dependent criterion function. Formally, it is a family of mappings
θen : Rn×r → Θ, one for each sample size n ∈ N, such that

θen ({yi }ni=1 ) ∈ arg max Q


e n ({yi }n , θ)
i=1
θ∈Θ

of criterion functions Rn×r × Θ → R. It turns out



for some family Q en
n∈N
that the vast majority of estimators in econometrics are extremum estimators.
Example 15 (common extremum estimators). The ordinary least squares,
nonlinear least squares, least absolute deviations and maximum likelihood
estimators can be written as extremum estimators:
n
2
X
βenOLS ({yi , xi }ni=1 ) := arg min (yi − x>
i β)
β∈Rk i=1
n
(yi − f (xi , β))2
X
βenNLS ({yi , xi }ni=1 ) := arg min
β∈Rk i=1
n
X
βenLAD ({yi , xi }ni=1 ) := arg min |yi − x>
i β|.
β∈Rk i=1
βenML ({yi , xi }ni=1 ) := arg max L ({yi }ni=1 , β)
β∈Rk

The generalised-method-of-moments estimator is also in the extremum class.


Notice that several of these models are conditional models, i.e. they
concern a regression of yi on xi (the mean regression in the case of OLS, the
median regression for LAD). As the example makes clear, there is nothing
special about dependent and independent variables: they’re all just part of
the dataset.

7.2 Measurability
We’ve defined the criterion function and extremum estimator as deterministic
functions of the data. But for the purposes of asymptotic theory, we’d like to
be able to treat them as a random function and a random vector, respectively.
(Otherwise concepts like convergence in probability are not defined!) To this
end, redefine the criterion function and extremum estimator as mappings
directly from the underlying probability space (Ω, A, P): for each n ∈ N and
ω ∈ Ω,
e n ({yi (ω)}n , ·)
Qn (ω)(·) := Q and θbn (ω) ∈ arg max Qn (ω)(θ).
i=1
θ∈Θ

76
The necessary and sufficient condition for Qn to be measurable (hence
a random function) is easy: we require Q e n to be measurable in its first
argument w.r.t. your desired σ-algebras on Rn×r and on the space of functions
Rn×r → Θ. But the existence of a measurable selection θbn from the argmax
is not so obvious. The following gives one set of sufficient conditions.
Proposition 15. Suppose that Q e n (·, θ) is measurable for each θ ∈ Θ, that
e n ({yi }n , ·) is continuous for each yn ∈ Rn×r , and that Θ is compact.
Q i=1
Then the argmax correspondence G(·) := arg maxθ∈Θ Qn (·)(θ) admits a
measurable selection θbn .
Remark 19. The main role of continuity and compactness it to ensure that
G is nonempty-valued; otherwise G may not admit any selection, measurable
or not.
The result is a corollary of the following lemma, which I’ve adapted from
Aliprantis and Border (2006, Theorem 18.19).61
Lemma 4 (measurable maximum lemma). Let (Ω, A) be a measurable space,
and let X ⊆ Rk be compact. Let f be a function Ω × X → R such that
f (·, x) is measurable for each x ∈ X and f (ω, ·) is continuous for each ω ∈ Ω.
Define v : Ω → R and G : Ω ⇒ X by
v(·) := max f (·, x) and G(·) := arg max f (·, x).
x∈X x∈X

Then v is A/BR -measurable and G admits a A/BX -measurable selection,


where B denotes the respective Borel σ-algebras.
Proof. Take any nonempty X 0 ⊆ X. Define FX 0 (·) := supx∈X 0 f (·, x). Then
for any c ∈ R,
( )
{ω ∈ Ω : FX 0 (ω) ≤ c} = ω ∈ Ω : sup f (ω, x) ≤ c
x∈X 0

= ω ∈ Ω : f (ω, x) ≤ c ∀x ∈ X 0

\
= {ω ∈ Ω : f (ω, x) ≤ c}
x∈X 0
\
= {ω ∈ Ω : f (ω, x) ≤ c}
x∈X 0 ∩Qk

∈A
61
For now, we only require the second part (the measurability of the argmax). But
later on we’ll want to use the maximised value of the criterion function as (part of) a test
statistic, and a test statistic had better be a random variable!

77
where the final equality holds since f (ω, ·) is continuous and Qk is dense in
Rk , and inclusion in A holds because f (·, x) is measurable and σ-algebras
are closed under countable intersection. So FX 0 is A/BR -measurable, no
matter what (nonempty) X 0 ⊆ X you choose. Letting X 0 = X, it follows
that v = FX is A/BR -measurable.
Next, we want to show that some selection g : Ω → X from G is
measurable. Since X ⊆ Rk , this requires precisely that {ω ∈ Ω : g(ω) ≤ c} ∈
A for every c ∈ Rk . Write C := {x ∈ X : x ≤ c}. If C = X or C = ∅ then
the result follows trivially, so let c be such that neither C nor X\C is empty.
Then
( )
{ω ∈ Ω : g(ω) ≤ c} = ω ∈ Ω : max f (ω, x) ≥ sup f (ω, x)
x∈C x∈X\C
( )
= ω ∈ Ω : sup f (ω, x) ≥ sup f (ω, x)
x∈C x∈X\C
n o
= ω ∈ Ω : FC (ω) − FX\C (ω) ≥ 0
∈A

where the inclusion follows from the fact that FC and FX\C are both meas-
urable and that the difference of measurable functions is measurable. 

From this point on, we will always treat Qn and θbn as random elements,
without specifying particular primitive assumptions that guarantee meas-
urability. If you like concreteness, maintain the sufficient conditions given
above.62

7.3 Consistency
We say that an extremum estimator θbn is weakly consistent for θ0 ∈ θ
iff it converges in probability θ0 . We say that it is strongly consistent iff
the convergence is almost sure. As mentioned above, this section will give
conditions under which extremum estimators are consistent for a θ0 that we
will define from the criterion functions. There is no reason why θ0 should be
equal to the true value of some interesting parameter of the DGP’s law µ0 !

Proposition 16 (weak consistency). Assume


62
Aside: when we move into more complicated econometric problems, measurability
sometimes becomes prohibitively difficult to verify. As a result, much of empirical process
theory has abandoned ordinary probability measures in favour of outer measure.

78
(1) Θ ⊆ Rk is compact.

(2) Qn is continuous for each n ∈ N.


p
(3) n−1 Qn −→ Q uniformly over Θ for some nonstochastic Q : Θ → R.

(4) Q has a unique maximum on Θ at θ0 .


p
Then θbn −→ θ0 .

Remark 20. Three things.

(1) It is possible to relax compactness by requiring that Q be ‘sufficiently


concave’; then the argmax eventually lies in some compact set with
high probability.

(2) Condition (3) is a high-level assumption. Later on, we will see how
a uniform law of large numbers can be used to derive (3) from more
primitive conditions on Qn and the distribution of the data.

(3) Assumption (4) does two things. First, it assumes identification: if


there were multiple maxima, we’d have partial identification. Second,
it defines θ0 .

Proof.

To establish convergence in probability, we have to show that P θbn ∈
S − → 1 as n −→ ∞ for any open neighbourhood S of θ0 . So fix an arbitrary
open neighbourhood S in Rk of θ0 . Since Θ is compact and S is open,
S c ∩ Θ is compact. Since each Qn is continuous, Q is continuous. Hence
maxθ∈Θ∩S c Q(θ) exists by the Weierstrass theorem, so we can define

ε := Q(θ0 ) − max c Q(θ). (3)


θ∈Θ∩S

Let An be the event


( )
An := sup n−1 Qn (θ) − Q(θ) < ε/2 .
θ∈Θ

Notice that P(An ) −→ 1 as n − → ∞ by assumption (3).


Since θbn and θ0 lie in Θ, An implies

Q θbn > n−1 Qn θbn − ε/2


 
(4)

and
n−1 Qn (θ0 ) > Q(θ0 ) − ε/2. (5)

79

Using Qn θbn ≥ Qn (θ0 ), (4) implies

Q θbn > n−1 Qn (θ0 ) − ε/2.



(6)

Adding (5) and (6) and cancelling n−1 Qn (θ0 ), we see that An implies

Q θbn > Q(θ0 ) − ε = max c Q(θ)
θ∈Θ∩S

by the definition (3) of ε. It follows that θbn ∈


/ Θ ∩ S c , so θbn ∈ S.
So An implies θbn ∈ S, hence P(An ) ≤ P θbn ∈ S . Since P(An ) − → 1, it

follows that P θbn ∈ S − → 1. Since S was an arbitrarily chosen neighbourhood
p
of θ0 , this establishes that θbn −→ θ0 . 

Perhaps unsurprisingly, strengthening assumption (3) to require uniform


almost sure convergence of n−1 Qn yields strong consistency.

Proposition 17 (strong consistency). Assume

(1) Θ ⊆ Rk is compact.

(2) Qn is continuous for each n ∈ N.


a.s.
(3) n−1 Qn −−→ Q uniformly over Θ for some nonstochastic Q : Θ → R.

(4) Q has a unique maximum on Θ at θ0 .


a.s.
Then θbn −−→ θ0 .

The same method of proof should work, but we will pursue a different
argument that was not available for weak consistency. The proof below is
very slow because there are subtleties in this argument that were not obvious
to me without elaboration.

Proof. Here’s the outline of what we’re going to do. For fixed ω ∈ Ω, θbn (ω)
is a sequence in Rk . From real analysis, we know that if this sequence
has convergent subsequences, and if all of these

convergent subsequences
have the same limit, then the full sequence θn (ω) is convergent with the
b

same limit. Since Θ is compact, θbn (ω) does have at least one convergent
subsequence. So we just have to show that for almost all ω ∈ Ω, every
convergent subsequence has limit θ0 .

80
Fix an arbitrary ω ∈ Ω, and pick an arbitrary convergent subsequence
θbnωi (ω) ; call its limit θω . Fix ε > 0. By the triangle inequality,


 
(nωi )−1 Qnωi θbnωi (ω) − Q(θω )
     
= (nωi )−1 Qnωi θbnωi (ω) − Q θbnωi (ω) + Q θbnωi (ω) − Q(θω )
     
≤ (nωi )−1 Qnωi θbnωi (ω) − Q θbnωi (ω) + Q θbnωi (ω) − Q(θω )

→ θω and Q is continuous, there exists N1 ∈ N such that


Since θbnωi (ω) −
 
Q θbnωi (ω) − Q(θω ) < ε/2 for all i ≥ N1 .
a.s.
Since n−1 Qn −−→ Q uniformly, the subsequence (nωi )−1 Qnωi also converges


a.s.

uniformly to Q. So (regardless of the what the deterministic sequence
θni (ω) happens to look like,) there exists N2 ∈ N such that
b ω

   
(nωi )−1 Qnωi θbnωi (ω) − Q θbnωi (ω) < ε/2 a.s. for all i ≥ N2 .63

Putting this together, there exists N (= N1 ∨ N2 ) such that


 
(nωi )−1 Qnωi θbnωi (ω) − Q (θω ) < ε a.s. for all i ≥ N .

Since ε > 0 was arbitrary, we’ve shown that there is Ω0 ⊆ Ω such that
P(Ω0 ) = 1 and
 
(nωi )−1 Qnωi (ω 0 ) θbnωi (ω) −
→ Q (θω ) for all ω 0 ∈ Ω0 .

Next, observe that θbn maximises Qn by definition:


 
(nωi )−1 Qnωi (ω) θbnωi (ω) ≥ (nωi )−1 Qnωi (ω)(θ0 ).

If ω lies in Ω0 , the LHS converges to Q(θω ). For the RHS, the fact that
a.s.
(nωi )−1 Qnωi −−→ Q uniformly implies that there is Ω00 ⊆ Ω such that P(Ω00 ) = 1
and (nωi )−1 Qnωi (ω)(θ0 ) −
→ Q(θ0 ) for all ω ∈ Ω00 . So taking the subsequential
limit i − → ∞ on both sides, we obtain

Q(θω ) ≥ Q(θ0 ) for every ω ∈ Ω0 ∩ Ω00 .


63
In this expression, θbnωi is evaluated at a fixed ω ∈ Ω, but the functions Qnωi are still
random; the ‘a.s.’ is w.r.t. random variation in the functions.

81
Since θ0 is the unique maximiser of Q, Q(θω ) ≥ Q(θ0 ) implies θω = θ0 ; so
we have θω = θ0 for every ω ∈ Ω0 ∩ Ω00 . Moreover, Ω0 ∩ Ω00 has measure one:

P(Ω0 ∩ Ω00 ) ≤ P(Ω0 ) = 1 − P (Ω0 )c ∪ (Ω00 )c




≥ 1 − P (Ω0 )c − P (Ω00 )c = 1.
 

We’ve now shown that there is a probability-1 event Ω0 ∩ Ω00 such that
whenever ω ∈ Ω0 ∩ Ω00 , every convergent subsequence of θbn (ω) converges
to θ0 . Moreover, at least one convergent subsequence is guaranteed

to exist
for any ω by compactness of Θ. It follows that the full sequence θbn (ω) is
convergent with limit θ0 whenever ω ∈ Ω0 ∩ Ω00 . Since P(Ω0 ∩ Ω00 ) = 1, this
a.s.
implies that θbn −−→ θ0 as desired. 

Example 16 (the incidental parameters problem). Let {yn } be independ-


ent random variables with yn ∼ N (µn , 1). The average log-likelihood at
parameter m is
n
e n ({yi }n , m) := − 1 ln(2π) − 1
X
Q i=1 (yi − mi )2 ,
2 2n i=1

and n
n 1 1 X
Q({y
e }
i i=1 , m) := − ln(2π) − lim (µi − mi )2 .
2 n→∞ 2n
i=1
All the assumptions in our consistency theorems appear to be satisfied.
We can take µn ∈ M for some compact M ⊆ R, so that µ ∈ M ∞ , which
is compact by Tychonoff’s theorem. Qn is measurable and continuous as
required. n−1 Qn converges a.s. uniformly to Q by a uniform LLN, though we
will not prove this. Finally, Q has a unique maximum at µ, the true means.
But as a matter of fact, the maximum-likelihood estimator is inconsistent
in this case. The reason is that the parameters µ live in an infinite-dimensional
space, whereas we required Θ to be a subset of the finite-dimensional space
Rk . More intuitively, the number of parameters increases as o(n), whereas
we required them to stay fixed at some k. It is perhaps intuitive that we
cannot consistently estimate n parameters using n data points!
This is called this incidental parameters problem, and originated with
Neyman and Scott (1948). It shows up in many other contexts. An obvious
one is the fixed effects model for panel data, where the number of fixed effects
is (by construction) equal to the number of cross-sectional observations, so
that we cannot estimate them consistently under short-panel asymptotics.
It should be clear that this is a problem of identification: no matter
how large your dataset is, you cannot precisely learn the values of the

82
parameters. We could formalise this in the Manski way by looking at the
joint distribution of the data directly. The approach we took above of looking
at identification indirectly via what can be consistently estimated from data
is the old-fashioned approach to identification.
Example 17 (consistency of MLE for uniform). Let {yi }ni=1 be independent
draws from U[0, θ0 ]. The likelihood is
n
L (θ, {yi }ni=1 ) = θ−n 1(yi ∈ [0, θ]) = θ−n 1(yi ∈ [0, θ] ∀i).
Y

i=1

The likelihood is discontinuous with a unique maximum at maxi∈{1,...,n} yi .


Continuity of the criterion function is violated here, so our consistency
theorems do not apply. But as a matter of fact, the maximum likelihood
estimator is consistent. The reason is (basically) that despite the discontinuity,
the maximum is always attained in this case. It turns out that the MLE is
not efficient in this case, however.
Example 18 (consistency of LAD). Suppose {yi }ni=1 are iid. The least
absolute deviations (LAD) estimator of the median minimises
n
X
e n ({yi }n , θ) :=
Q |yi − θ|.
i=1
i=1

The criterion function is clearly not differentiable everywhere: the derivative


fails to exist for θ at which yi = θ for some i. When the derivative exists, it is
n
X
e n,2 ({yi }n , θ) =
Q [2 · 1(yi ≤ θ) − 1].
i=1
i=1

When n is even, there will in general be a continuum of minima, but in fact


the measure of minimisers decreases sufficiently quickly with n to ensure
than any measurable selection from the argmin is consistent.
Example 19. Let {yi }ni=1 be independent draws from a mixture distribution
with density
2 !
1 y−µ 1
 −1/2   
f (y) := λ 2πσ 2
exp − + (1 − λ) (2π)−1/2 exp − y 2
2 σ 2

Mixture distributions arise naturally in models of choice by heterogeneous


agents, e.g. in IO (both theoretical and empirical). Let’s suppose that we
know λ in this case, and wish only to estimate µ and σ 2 .

83
The log-likelihood is
 
e n {yi }n , µ, σ 2
 n
Q i=1 = − ln(2π)
2
n
"  ! #
1 yi − µ 2 1 2
 
−1
X
+ ln λσ exp − + (1 − λ) exp − yi
i=1
2 σ 2

This criterion function is discontinuous: if we set µ = yi for some i and


decrease σ toward 0, the log-likelihood increases without bound. So in order
to satisfy the hypotheses of our consistency theorems, we must rule out σ = 0.
To ensure that the parameter space remains compact, this means we must
restrict σ to be bounded away from zero by some σ > 0, perhaps not a very
appealing restriction in applications. If we don’t do this, consistency fails.

7.4 Asymptotic normality


Suppose θbn is consistent for θ0 . We would then like to know how far it is likely
to be from θ0 in our sample. The obvious way to do this is to approximate
1/2

the distribution of n θn − θ0 with a normal by appeal to a central limit
b
theorem. The asymptotic normality result in this section formalises this idea.
The proposition below is almost exactly Theorem 3.1 in Newey and
McFadden (1994); the exception is that condition (3) above is slightly different
(in a way that makes the proof easier).

Proposition 18 (asymptotic normality). Assume


p
(1) θbn −→ θ0 ∈ int Θ.

(2) There is a neighbourhood of θ0 in Θ in which ∇Qn and ∇2 Qn exist


and are continuous for every n ∈ N.

(3) A defined by  
A := lim E n−1 ∇2 Qn (θ0 )
n→∞
p
is nonsingular, and n−1 ∇2 Qn (θn ) −→ A for any sequence of random
p
vectors {θn } such that θn −→ θ0 .
d
(4) n−1/2 ∇Qn (θ0 ) −→ N (0, B), where
 
B := lim E n−1 [∇Qn (θ0 )][∇Qn (θ0 )]> .
n→∞

 d
Then n1/2 θbn − θ0 −→ N 0, A−1 BA−1 .


84
Remark 21. θ0 is defined by assumption (1). We’re not taking a stand on
why θbn is consistent for θ0 , but if we decided to justify it using our weak
consistency theorem, then θ0 would of course be the maximiser of Q.

Partial proof. Assumption (2) says that there is neighbourhood of θ0 on


which each Qn is twice continuously differentiable. Since θ0 is interior, it
follows that there exists a convex open neighbourhood H ⊆ int Θ of θ0 on
which each Qn is twice continuously differentiable. Let En be the event that
p
θbn ∈ H. Since θbn −→ θ0 , P(En ) −
→ 1.
We will fudge the proof by behaving as though En obtains for all n
sufficiently large. This does not follow from P(En ) − → 1, but proceeding
in this manner turns out to be valid nonetheless. For a rigorous proof that
avoids this fudge, see Newey and McFadden (1994, ‘A complete proof of
Theorem 3.1’, p. 2152).
When En obtains, θbn is an interior maximiser on H of the differentiable
function Qn , so must satisfy the first-order condition for a local maximum:

∇Qn θbn = 0.

Since Qn is twice continuously differentiable on H, the mean value theorem


lets us replace the left-hand side with an exact Taylor expansion around θ0 :

∇Qn (θ0 ) + ∇2 Qn θen θbn − θ0 = 0,


 

where the mean value θen lies between θbn and θ0 (hence in H by convexity).
Rearranging,
h i+ h i
n1/2 θbn − θ0 = − n−1 ∇2 Qn θen n−1/2 ∇Qn (θ0 ) ,
 

where + denotes the Moore–Penrose pseudo-inverse.64


p p
Since θbn −→ θ0 and θen lies between θbn and θ0 , θen −→ θ0 .65 Hence by
assumption (3),
n−1 ∇2 Qn θen −→ A.
 p

Under our assumptions, n−1 ∇2 Qn θen may be singular, in which case the ordinary
64

matrix inverse is undefined. That’s why we use the Moore–Penrose pseudo-inverse: it is
always (uniquely) defined, and coincides with the ordinary inverse for nonsingular matrices.
For details, see e.g. Rao (1973, sec. 1.b.5 & 1.c.5).
65
This statement only makes sense if the mean value θen is a random vector, i.e. is
measurable! It turns out that it is; Newey and McFadden (1994, p. 2141, footnote 25)
indicate why, and refer the reader to our friend Jennrich (1969) for a formal proof.

85
The Moore–Penrose pseudo-inverse operator is continuous at A since A is
nonsingular,66 so h i +
n−1 ∇2 Qn θen −→ A−1
 p

by the continuous mapping theorem. Combining this with assumption (4)


and Slutsky’s theorem, we obtain
h i+ h i
n1/2 θbn − θ0 = − n−1 ∇2 Qn θen n−1/2 ∇Qn (θ0 )
 
 
−→ −A−1 N (0, B) = N 0, A−1 BA−1
d d

where the final equality used the fact that A−1 is symmetric by Young’s
theorem. 

Remark 22. Two things:

(1) It should be clear from the proof that we don’t actually need θbn to be
a global maximiser; we only need it to satisfy

the first-order condition.
The result will therefore go through if θbn is a sequence of local
minima and/or maxima. This raises two issues, though. First, we’ll
need to find sufficient conditions for such an object to be measurable.
Second, we can no longer appeal to our consistency theorems above
to justify assumption (1). But it turns out that there are consistency
theorems for local maxima/minima; see e.g. Amemiya (1985, Theorem
4.1.2).

(2) The existence of the second derivative is actually not required for
asymptotic normality, though the proof is much harder without this
assumption. The kind of proof employed here definitely requires the
first derivative, since it is based on the first-order condition.
The LAD estimator is an example of an asymptotically normal ex-
tremum estimator for which even the first derivative fails to exist. In
section 7.6 (p. 90), we’ll give an idea of how asymptotic normality
can be proved without the use of derivatives for certain estimators,
including the LAD estimator. The maximum score estimator (Manski,
66
The ordinary matrix inverse operator is everywhere continuous: for invertible matrices
{An } and A, An −→ A implies A−1 n − → A−1 . (We used this to prove Slutsky’s theorem.)
But the Moore–Penrose pseudo-inverse is not actually continuous! However, it turns out to
be continuous at invertibility points, which is all we need.

86
1975) is an example in which ∇Qn does not exist, and the limiting
distribution is nonnormal (Kim & Pollard, 1990).67

You may wonder what can happen when θ0 lies on the boundary of
the parameter space. The following example shows how this can give rise
a nonnormal limiting distribution, even for very simple and otherwise well-
behaved estimators.

Example 20 (nonnormality on the boundary). Let {yn } be independent


random variables, each with mean µ ∈ Θ := [0, K] and variance σ 2 ∈ (0, ∞).
Choose K large enough that the sample average is never above K; then the
least-squares estimator is
n n
( )
−1 −1
X X
2
bn := arg min n
µ (yi − m) = max n yi , 0 ,
m∈[0,K] i=1 i=1

what we might call a ‘censored average’.


The Lindeberg–Lévy CLT implies that
n  
n−1/2
d
X
(yi − µ) −→ N 0, σ 2
i=1

as usual. If µ > 0 then


 
d
n1/2 µ
bn − µ −→ N 0, σ 2 ,


but if µ = 0 then the limiting distribution of n1/2 µ bn is N 0, σ 2 with all




negative values censored to zero (the censored normal distribution).


Intuitively, what’s going on here is that as n gets large, with high probab-
ility the sample average is close to µ by Kolmogorov’s first SLLN, so we can
restrict attention to the behaviour of µ bn in an arbitrarily small neighbour-
hood of µ. When µ > 0, we can choose a neighbourhood that doesn’t include
zero, and it’s as if µ
bn were an ordinary uncensored average. But when µ = 0,
every neighbourhood of µ contains negative values at which censoring takes
place, so that censoring has a first-order effect on the distribution of µ bn no
matter how large n gets.
This raises another issue. If µ is positive but small, then we’ll need a
larger n in order for the normal distribution to be a good approximation,
67
In particular, Qn is a multidimensional step function, and the limit distribution is the
maximum of a multidimensional Gaussian process with quadratic drift that depends on
nuisance parameters. Not nice!

87
i.e. the rate of convergence is slower. This can be formalised by using a
Pitman drift, meaning that we let µ drift toward zero as n increases and
study how fast the drift can be without breaking asymptotic normality.68
(See section 9.4 (p. 117) for details on what a Pitman drift is and how it can
be used.)
True parameters on the boundary arise frequently in more sophistic-
ated econometric models. One example (currently fashionable) is moment-
inequality models, where we’re on the boundary whenever an inequality binds
in the population.

7.5 Estimating the asymptotic variance


If we are to use our asymptotic normality result from the previous section to
approximate the distribution of θbn , we had better be able to obtain consistent
estimates of A and B. This is possible under fairly weak assumptions. In this
section, we’ll restrict attention to iid data and additively separable criterion
functions Qn .
In particular, assume that the DGP {yi } is iid and that Qe n can be written

n
X
e n ({yi }n , θ) =
Q qe(yi , θ)
i=1
i=1

for some function qe. We’ll want work directly with the random functions
Θ → R defined by qi (ω)(θ) := qe(yi (ω), θ), so that Qn = ni=1 qi . Observe
P

that {qi } is an iid sequence of random functions.


Assume that the conditions of the weak consistency and asymptotic
normality results (pp. 78 & 84) hold. In the current setting, this implies
in particular that each qi is twice continuously differentiable near θ0 and
that the moments E([∇q1 (θ0 )][∇q1 (θ0 )]> ) and E ∇2 q1 (θ0 ) exist. Further


suppose that ∇q1 and ∇2 q1 are each bounded by some finite-expectation


random
 2
variable (so that the dominated convergence theorem applies).
∇ qi (θ0) is a sequence of iid k × k random matrices each with mean
E ∇2 q1 (θ0 ) , so
n  
n−1 ∇2 Qn (θ0 ) = n−1
a.s.
X
∇2 qi (θ0 ) −−→ E ∇2 q1 (θ0 )
i=1
68
The last point raises a more general issue. There’s a fashion in econometrics now for
studying estimators whose good properties are uniform in the true parameters. Uniformly
good estimators have the nice feature that we don’t need to be worried about the true
parameter being close to a bad region (e.g. a boundary, a nonidentification region); when
it’s in a good region, its properties are as good as anywhere else in that region.

88
by Kolmogorov’s second SLLN. Similarly, {[∇qi (θ0 )][∇qi (θ0 )]> } is a sequence
of iid k × k matrices with mean E ([∇q1 (θ0 )][∇q1 (θ0 )]> ), so by Kolmogorov’s
second SLLN we have
n
n−1 [∇Qn (θ0 )][∇Qn (θ0 )]> = n−1
X
[∇qi (θ0 )][∇qi (θ0 )]>
i=1
a.s.
−−→ E ([∇q1 (θ0 )][∇q1 (θ0 )]> ) .

Hence by the dominated convergence theorem,


   
A = lim E n−1 ∇2 Qn (θ0 ) = E ∇2 q1 (θ0 )
n→∞
 
−1
B = lim E n [∇Qn (θ0 )][∇Qn (θ0 )]> = E ([∇q1 (θ0 )][∇q1 (θ0 )]> ) .
n→∞

Now that we have clean expressions for A and B, we can think about
estimating them. Consider the random functions Θ → Rr×r
n
Abn := n−1 ∇2 Qn = n−1
X
∇ 2 qi
i=1
n
bn := n−1 [∇Qn ] [∇Qn ]> = n−1
X
B [∇qi ] [∇qi ]> .
i=1

a.s. a.s.
We have just shown that Abn (θ0 ) −−→ A and B bn (θ0 ) −
−→ B. But these are
infeasible estimators because they require knowledge of θ0 . The obvious rem-
edy is to plug in our consistent estimator θbn . By continuous differentiability,
Abn and B bn are continuous at θ0 . Hence
 a.s.  a.s.
Abn θbn −−→ A and B
bn θbn −−→ B

by the continuous mapping theorem. It follows by Slutsky’s theorem that


−1
bn θbn −1
 
Abn θbn B
bn θbn A

is a consistent estimator of the asymptotic variance of our extremum estim-


ator.
As a reward for our toil, we can finally do inference. In particular, what
we have learned is the approximate distributional result
−1
bn θbn −1 .
 
θbn ∼ N θ0 , n−1 Abn θbn
a  
B
bn θbn A

89
7.6 Asymptotic normality with a nonsmooth objective
This section is basically an aside, so generality and rigour will be sacrificed
on the altar of clarity.
Our asymptotic normality result in section 7.4 required the existence and
continuity of derivatives of the criterion function in a neighbourhood of θ0 .
The second derivative assumption turns out to be unnecessary, but proving
this in generality is hard. Getting rid of the first derivative is harder still,
but important in certain applications (e.g. auctions). We’ll give an example
to indicate what can be done.
Suppose that the DGP {yi } is iid and that the parameter θ0 satisfies the
moment condition E (ge(y1 , θ0 )) = 0. Estimation based on moment conditions
such as these can be done using the generalised method of moments (GMM)
covered in section 10, a special case of extremum estimation. We use it here
merely as an example. Restrict attention to the univariate case yi ∈ R and
Θ ⊆ R, and write F for the CDF of each yiR.
The population moment can be written R ge(y, θ0 )F (dy) = 0. Define the
empirical distribution function (EDF) by
n
Fbn (y) := n−1
X
1(y ≤ yi ).
i=1

It should be clear that an integral w.r.t. Fn is simply a sample average. We


use the method of moments estimator θbn which satsifies the analogous sample
moment Z
ge y, θbn )Fn (dy) = 0.
R
(We simply assume that a solution exists; there are workarounds.)
Now we do some simple algebra:
Z Z
1/2 1/2 
0=n ge(y, θ0 )Fn (dy) + n ge y, θbn F (dy)
R Z R
Z
1/2 1/2 
−n g (y, θ0 )Fn (dy) − n
e ge y, θbn F (dy)
Z R Z R
1/2 1/2 
=n ge(y, θ0 )Fn (dy) + n ge y, θn F (dy)
b
R Z h R
i
+ n1/2

ge y, θbn − ge(y, θ0 ) [Fn − F ](dy)
Z R Z
1/2 1/2 
= n g (y, θ0 )Fn (dy) +n
e ge y, θbn F (dy) + op (1).
R R
| {z }
d
−→N (0,E(eg(y,θ0 )2 ))

90
For now, we’re merely asserting that the final term is op (1); more on that
below. The convergence in distribution of the first term is immediate by the
Lindeberg–Lévy CLT.
The usual modus operandi is as follows. Assuming that ge(y, ·) is dif-
ferentiable with derivative ge2 , we mean-value expand the remaining term
as
Z Z h i
n1/2 ge y, θbn F (dy) = n1/2
  
ge(y, θ0 ) + ge2 y, θen θbn − θ0 F (dy)
R R Z
h i
1/2 
= n θbn − θ0 ge2 y, θen F (dy)
R

where the mean value θen lies between θbn and θ0 , and so converges in probab-
ility to θ0 . Assuming that the second derivative ge2 (y, ·) is continuous at θ0 ,
it follows by the continuous mapping theorem that
Z Z
 p
ge2 y, θen F (dy) −→ ge2 (y, θ0 )F (dy) = E (ge2 (y1 , θ0 )) ,
R R

assumed to be nonzero. Finally, rearrange and use Slutsky’s theorem to


obtain
Z −1  Z 
1/2   1/2
n θbn − θ0 = − ge2 y, θn F (dy)
e n ge(y, θ0 )Fn (dy)
R R
  
−→ − E (ge2 (y1 , θ0 ))−1 N 0, E ge(y, θ0 )2
d

  . 
=N 0, E ge(y, θ0 )2 E (ge2 (y1 , θ0 ))2 .
d

This is basically the strategy we followed in proving out asymptotic normality


result above, and clearly requires the existence of the first derivative. (We
made do without the second derivative by exploiting the moment; for a
general exteremum estimator we need the second derivative as well for this
strategy to work.)
But when the first derivative does not exist, as in e.g. the LAD case,
this strategy is not available to us. Fortunately, the integral is a smoothing
operator: the Lebesgue integral of a nonsmooth function may be smooth.
(This should
R
be intuitive.) So we can make assumptions sufficient for the
integral R g (y, ·)F (dy) to be differentiable without requiring ge(y, ·) to be.
e
Then we can mean-value expand the integral (rather than the integrand)

91
around θ0 as
Z
n1/2

ge y, θbn F (dy)
R
i ∂
Z h Z
= n1/2 ge(y, θ0 )F (dy) + n1/2 θbn − θ0

ge2 y, θen F (dy)
R ∂θ R
h i ∂ Z
= n1/2 θbn − θ0
 
ge2 y, θen F (dy).
∂θ R
Assuming that the derivative of the integral is continuous at θ0 , plus some
additional boundedness condition to keep the derivative in check, we get

Z
 p
ge y, θen F (dy) −→ A
∂θ R

for some nonstochastic A, assumed nonzero. Now we can rearrange as before


to get
−1 

 Z Z 
1/2   1/2
n θbn − θ0 = − ge y, θn F (dy)
e n ge(y, θ0 )Fn (dy)
∂θ
R R
  
−→ − A−1 N 0, E ge(y, θ0 )2
d

  . 
d
=N 0, E ge(y, θ0 )2 A2 .

This approach works for the LAD estimator, for example. There ge is the
discontinuous function

ge(y, θ) = 2 · 1 (y ≤ θ) − 1,

but its integral is


Z Z θen
 
ge y, θen F (dy) = 2 dF − 1 = 2F θen − 1,
R −∞

which is differentiable provided the data are continuously distributed. In


that case, calling the density f , we have A = 2f (θ0 ). Provided f is strictly
positive at θ0 , the LAD estimator has asymptotic distribution
  . 
n1/2 θbn − θ0 −→N 0, E [2 · 1 (y1 ≤ θ0 ) − 1]2
 d
4f (θ0 )2
 
d
=N 0, 1/4f (θ0 )2 .

Another estimator that admits this kind of argument is the maximum rank
correlation estimator. (I don’t pretend to know what that is.)

92
Recall that to get to this point, we simply asserted that
Z h i
1/2 
n ge y, θbn − ge(y, θ0 ) [Fn − F ](dy) = op (1).
R

This is not so obvious when ge is nonsmooth, since we’d usually use a mean-
value expansion of the integrand to show this. But here empirical process
theory comes to the rescue. An empirical process is a random functional,
i.e. a random function mapping from a functional space. We can write our
troublesome integral as
Z h i
n1/2

ge y, θbn − ge(y, θ0 ) [Fn − F ](dy)
R Z Z
1/2
ge y, θbn F (dy) − n1/2

= −n ge(y, θ0 )Fn (dy)
ZR ZR
= − n1/2 ge(y, θ0 )F (dy) − n1/2 ge(y, θ0 )Fn (dy) + op (1)
ZR R

= − n1/2 ge(y, θ0 )Fn (dy) + op (1)


R

= − vn ge(·, θ0 ) + op (1)

where vn is the empirical process vn (f ) := n−1/2 ni=1 f (yi ). Provided that


P

ge lives in an appropriate family of functions,



theorems in empirical process
theory allow us to assert that vn ge(·, θ0 ) = op (1). The requirements involve a
concept called stochastic equicontinuity that plays a central role in empirical
process theory.
A word of caution is needed here. There are important applications in
which the appropriate stochastic equicontinuity condition is violated, in
which case the remainder term above does not vanish. In that case we may
have a nonnormal asymptotic distribution. An example of this is Manski’s
maximum score estimator.

93
8 The (quasi-)maximum-likelihood estimator
Official reading: Amemiya (1985, sec. 4.2.1–4.2.3) and Newey and McFadden
(1994, sec. 2.2.1, 2.4, 3.2, 3.3, and 4.2).

8.1 Preliminaries
Recall the estimation setting laid out section 7. The data-generating pro-
cess {yn } is a Rr -valued stochastic process defined on a probability space
(Ω, A, P); its law is denoted µ0 . We observe a dataset, meaning a realisation
{yi (ω)}ni=1 of the first n coordinates of the DGP.
We do not know the law µ0 ∈ M ; a priori we only know that it lies in
a set M of probability measures. There is some parameter τ : M → Θ of
interest (Θ ⊆ Rk ) whose true value τ (µ0 ) we wish to learn about using the
data.
In our study of extremum estimators, we did not specify a particular
parameter τ ; instead we looked at estimators that are consistent for some
point in θ0 ∈ Θ which may or may not equal τ (µ0 ) for some parameter of
interest τ . Now that we’re looking at a specific class of extremum estimators,
we will be able to ensure that θ0 = τ (µ0 ).
So fix a parameter τ : M → Ω of interest. Since we’re only interested in
estimating τ (µ0 ), there is no need to distinguish between distinct measures
in M that give rise to the same parameter value; so wlog, we will treat τ as a
bijection (one-to-one and onto). This means that we can write M as {µθ }θ∈Θ ,
a parametric (finite-dimensional) family of distributions. The true value of
the parameter is written θ0 := τ (µ0 ).69 To re-iterate, θ0 is a feature of the
unknown population distribution now: it is no longer some hard-to-interpret
probability limit of an extremum estimator.
Suppose that there is a measure ν with respect to which every µθ possesses
a density (Radon–Nikodým derivative). Let µθn and νnθ denote the marginal
distributions that µθ and ν induce over the first n coordinates, and write
f θn := dµθn /dν n for the density governing a sample of size n when the true
parameter is θ (the true law is µθ ).
The likelihood function L is the random function Θ → R defined by

Ln (ω)(θ) := f θn ({yi (ω)}ni=1 ) for each ω ∈ Ω and θ ∈ Θ.

The log-likelihood function is `n := ln (Ln ). (Set `n (θ) = −∞ whenever


By the way, µ0 and µθ0 denote the same thing in my notation. I don’t think this should
69

cause any confusion.

94
Ln (θ) = 0.) A maximum likelihood estimator is an extremum estimator
whose criterion function (Qn ) is `n .
We will focus on the case of independently and identically distributed
data. In this case,
n
Y
f θn ({yi }ni=1 ) = f θ (yi )
i=1

where fθ is the marginal distribution. Define the (log-)likelihood contribution


of observation i as
 
`1i (ω)(θ) = ln f θ (yi (ω)) for each ω ∈ Ω and θ ∈ Θ.

Then we can write the log-likelihood as `n = ni=1 `1i , a sum of iid random
P

functions to which we can apply Jennrich’s uniform SLLN. Pretty much all
the results we will derive for the MLE can be extended to the non-iid case,
but we won’t do it.
Now suppose that our model of µ0 is misspecified: the f θn we use to form
the likelihood is not actually equal to dµθn /dνn . The log-likelihood `n is still
well-defined, so we can still obtain an extremum estimator by maximising it.
Such an estimator is called a quasi-maximum-likelihood estimator (QMLE).
We will see that under appropriate conditions, QMLEs are consistent for
θ0 and asymptotically normal, but that they do not share the efficiency
properties of the MLE.

8.2 Consistency for the truth


In this section, we’ll establish the consistency of the MLE for θ0 = τ (µ0 ).
The (entirely straightforward) argument hinges on the following result.

Proposition 19 (information
R
inequality). Let f and g be densities w.r.t. a
measure ν on (Ω, A). Then Ω ln(g/f )f dν ≤ 0, with equality iff g = f ν-a.e.

Proof. A first-order mean-value theorem expansion of ln(g/f ) around g/f = 1


yields
1 1
ln(g/f ) = (g/f − 1) − (g/f − 1)2 ,
2 γ2

95
where γ (a function!) lies between g/f and 1.70 So
1
Z Z Z
ln(g/f )f dν = (g/f − 1)f dν − 2
(g/f − 1)2 f dν
Ω Ω Ω 2γ
1 (g − f )2
Z Z Z
= gdν − f dν − 2

Ω Ω Ω 2γ f
1 (g − f )2
Z
= − 2
dν.
Ω 2γ f
The RHS is evidently nonpositive and equal to zero iff g = f ν-a.e. 

The maximum likelihood estimator for independent data is an extremum


estimator whose criterion function is `n = ni=1 `1i . When the data are iid,
P

{`1i } are iid random functions Θ → R. Assume that Θ is compact, that `11
is continuous and that E supθ∈Θ `11 (θ) < ∞. Then by Jennrich’s uniform
SLLN,
n
n−1 `n = n−1
a.s.
X
`1i −−→ Q uniformly over Θ,
i=1
where Q : Θ → R is the nonstochastic function
  
Q(θ) := E ln f θ (y1 )
Z  
= ln f θ (y1 ) µθ10 (dy1 )
r
ZR  
= ln f θ (y1 ) f θ0 (y1 ) ν1 (dy1 ).
Rr

using the fact that µθ10 is the true distribution and that f θ0 (y1 ) is its density
w.r.t. ν1 . The information inequality tells us precisely that Q attains a unique
maximum at θ0 . Lo and behold, all the assumptions of our strong consistency
result (p. 80) are satisfied, so the MLE is strongly consistent for θ0 . And as
promised, θ0 here was defined as τ (µ0 ), the true value of the parameter τ .
‘Consistency for the truth’, if you will. To summarise:
Proposition 20 (consistency for the truth). Suppose that Θ is compact,
that `11 is continuous and that E supθ∈Θ `11 (θ) < ∞. Then the MLE θbn is


consistent for θ0 = τ (µ0 ).


A similar argument can be applies for the QMLE. But in this case, the
information inequality cannot be used to establish that the limit function
70
We need γ to be A-measurable; otherwise we cannot integrate it. It is in fact measurable,
as discussed in footnote 65 (p. 85).

96
Q has a unique maximum, and certainly not that this maximum is equal
to τ (µ0 ). It is possible, however, to give additional conditions restricting
the degree of misspecification in such a way that Q is guaranteed to have a
unique maximum at τ (θ0 ). A QMLE satisfying these extra conditions will
also be consistent for the truth.

8.3 Asymptotic normality


Asymptotic normality is even easier than consistency: we just apply our
asymptotic normality result for extremum estimators out of the box. Assume
that `11 is twice continuously differentiable in a neighbourhood of θ0 . The A
and B matrices from the asymptotic normality proposition become (using
the fact that the data are iid)
 
A := lim E n−1 ∇2 `n (θ0 )
n→∞
n
!
−1
X
= lim E n ∇2 `1i (θ0 )
n→∞
i=1
 
= E ∇2 `11 (θ0 )

and
 
B := lim E n−1 [∇`n (θ0 )] [∇`n (θ0 )]>
n→∞
" n
#" n
#> !
n−1/2 n−1/2
X X
= lim E ∇`1i (θ0 ) ∇`1i (θ0 )
n→∞
i=1 i=1
n h
!
ih i>
−1
X
= lim E n ∇`1i (θ0 ) ∇`1i (θ0 )
n→∞
i=1
h ih i> 
=E ∇`11 (θ0 ) ∇`11 (θ0 ) .

We assume that both of these expectations exist and are finite, and fur-
thermore that A is nonsingular (negative definite will do). The information
matrix equality below will tell us that −A = B when the model is correctly
specified. −A is called the Hessian form of the information matrix, and B is
called the outer product form of the information matrix.
 2 1
∇ `i (θ0 ) is an iid sequence of random matrices whose mean we assumed
exists, so by Khinchine’s WLLN (p. 56) we have
n
n−1 ∇2 `n (θ0 ) = n−1
p
X
∇2 `1i (θ0 ) −→ A.
i=1

97
Since ∇2 `n is continuous in a neighbourhood of θ0 , any {θn } with θn =
θ0 + op (1) satisfies

n−1 ∇2 `n (θn ) = n−1 ∇2 `n (θ0 ) + op (1) −→ A


p

by the continuous mapping theorem.


Assume that ∇`11 (θ0 ) is dominated by some random vector with finite
expectation. Then the dominated convergence theorem applies, so we can
exchange the order of differentiation and integration (see e.g. Rosenthal
(2006, sec. 9.2)). Hence

∂ θ0
!
∇L11 (θ0 ) ∂θ f (y) θ0
  Z
E ∇`11 (θ0 ) = E = f (y)ν1 (dy)
L11 (θ0 ) Rr f θ0 (y)
∂ θ0 ∂ ∂(1)
Z   Z
= f (y) ν1 (dy) = f θ0 (y)ν1 (dy) = = 0.
Rr ∂θ ∂θ Rr ∂θ

In words, the expected score is zero at the truth. So ∇`1i (θ0 ) is an iid


sequence of random vectors with mean zero and finite variance B. Hence by
the multivariate Lindeberg–Lévy CLT we have
n
n−1/2 ∇`n (θ0 ) = n−1/2
d
X
∇`1i (θ0 ) −→ N (0, B) .
i=1

p
Since θbn −→ θ0 by the arguments in the previous section, our asymptotic
normality result for extremum estimators (p. 84) applies, giving us
 
n1/2 θbn − θ0 −→ N 0, A−1 BA−1 .
 d

Actually, it turns out that −A = B when the model is correctly specified,
−1
so that the asymptotic variance simplifies to −A−1 = −∇2 `11 (θ0 ) . This
result is called the information matrix equality.

Proposition 21 (information matrix equality). Assume that ∇`11 (θ0 ) is


dominated by some random vector with finite expectation and that the
model is correctly specified. Then
  h ih i> 
−A = −E ∇2 `11 (θ0 ) = E ∇`11 (θ0 ) ∇`11 (θ0 ) = B.

98
Proof. The the proof is similar to the demonstration above that the expected
score is zero at the true parameter.
 
E ∇2 `11 (θ0 )

 
=E ∇`1 (θ0 )
∂θ> 1
!!
∂ ∇L11 (θ0 )
=E
∂θ> L11 (θ0 )
∂ θ0
!
∂ ∂θ f (y)
Z
= f θ0 (y)dν1 (dy)
Rr ∂θ> f θ0 (y)
 h ih i
∂ θ0 ∂
Z ∂2
f θ0 (y) ∂θ f (y) ∂θ>
f θ0 (y)
∂θ∂θ>  f θ0 (y)dν1 (dy)
=  −
Rr f θ0 (y) f θ0 (y)2
!
∂2
Z
= f θ0 (y) dν1 (dy)
Rr ∂θ∂θ>
#>
∂ θ0 ∂ θ0
" #"
∂θ f (y) ∂θ f (y)
Z
− f θ0 (y)dν1 (dy)
Rr f θ0 (y) f θ0 (y)
!
∂2
Z h ih i> 
θ0 1 1
= f (y) dν1 (dy) − E ∇`1 (θ0 ) ∇`1 (θ0 )
Rr ∂θ∂θ>
∂2
Z h ih  i> 
θ0 1 1
= f (y)dν1 (dy) − E ∇` 1 (θ 0 ) ∇` 1 (θ 0 )
∂θ∂θ> Rr
2
∂ (1)
h ih i> 
= >
− E ∇`11 (θ0 ) ∇`11 (θ0 )
∂θ∂θ
h ih i> 
= −E ∇`11 (θ0 ) ∇`11 (θ0 )

where the exchanging of integration and differentiation in the third-last


equality is permissible by the dominance condition. 

When the model is not correctly specified (the QMLE case), we can still
obtain asymptotic normality. The only part of the argument that relied on
correct specification was our demonstration that the expected score is zero
at the truth; this will have to be assumed to ensure asymptotic normality
of the QMLE. The information matrix equality does not hold in this case,
so we have to stick with the sandwich form A−1 BA−1 for the asymptotic
variance.

99
8.4 Estimating the asymptotic variance
The matrices A and B are unknown parameters of the DGP. If we’re going
to use our asymptotic normality result to approximate the distribution of
the MLE, we had better be able to estimate them consistently!
Recall from section 7.5 (p. 88) that we already know how to estimate
A and B consistently in the general framework of extremum estimation. In
particular, define the random functions Θ → Rr×r by
n
Abn := n−1 ∇2 `n = n−1
X
∇2 `1i
i=1
n h ih i>
bn := n−1 [∇`n ] [∇`n ]> = n−1
X
B ∇`1i ∇`1i ;
i=1
 
then Abn θbn and Bbn θbn are strongly consistent for A and B, respectively,
under the conditions of the consistency and asymptotic normality results
plus a dominance condition on the derivatives to allow the interchange of
integration and differentiation.
These were the analogy estimators. Analogy estimation just means using
sample averages to estimate population averages (expectations). To motivate
an alternative way of estimating A and B, let’s restate that in fancier
language. Suppose we wish to estimate
Z Z
E(ψ(y1 , θ0 )) = ψ(y, θ0 )µθ10 (dy) = ψ(y, θ0 )F θ0 (dy),
Rr Rr

where F θ is the CDF corresponding to the law µ θ1 and ψ : Rr × Θ → R` is


  1 >
2 1 1
some interesting function. (ψ is ∇ `i for A and ∇`i ∇`i for B.)
The natural nonparametric estimator of F θ0 is the EDF (empirical distri-
bution function)
n
Fbn (y) := n−1
X
1(y ≤ yi ).
i=1

The EDF is consistent for F θ0 in a strong sense: the Glivenko–Cantelli


theorem (e.g. Billingsley (1995, p. 269)) says that Fbn converges uniformly a.s.
to F θ0 . The estimator is nonparametric in the sense that it does not require
us to first estimate θ0 , then plug in our estimate to compute our estimate of
F θ0 : we just estimate the function F θ0 directly.
If we knew θ0 , a natural way to estimate Rr ψ(y, θ0 )F θ0 (dy) would be to
R

integrate ψ(·, θ0 ) w.r.t. Fbn rather than the unknown F θ0 . We don’t know θ0 ,
but if ψ(y, ·) is continuous then we can replace θ0 with θbn without affecting

100
consistency. The estimator we obtain in this way is (fairly obviously) precisely
the analogy estimator:
Z n
−1
 X 
ψ y, θn Fn (dy) = n
b b ψ yi , θbn .
Rr i=1

This class of estimators has the virtue that it’s easy to prove consistency
under weak conditions using a law of large numbers. It is a semiparametric
approach: there’s a nonparametric step (the EDF Fbn ) and a parametric step
(the consistent estimator θbn ).
But in the MLE context, we already have a full parametric model of
F θ0 ; why not make use of it? In particular, instead of integrating w.r.t. the
nonparametric EDF Fbn , why not integrate w.r.t F θbn , the plug-in estimate
of F θ0 obtained by making use of our model θ 7→ F θ of the DGP? This
suggestion yields a parametric estimator of Rr ψ(y, θ0 )F θ0 (dy):
R

Z
ψ y, θbn F θn (dy).
 b
Rr

Provided θ 7→ F θ is continuous, this also provides a consistent estimate of


F θ0 , and it may be more efficient than the analogy estimator. (But it isn’t
necessarily more efficient!)
It should be pretty obvious where this is going now. Instead of estimating

B by integrating w.r.t. the EDF Fbn (the analogy estimator B bn θbn above),

we could integrate w.r.t. F θbn . To implement this, define the (deterministic)


e : Ω → Rr×r by
function B
>
∂ θ ∂ θ
Z  
B(θ)
e := f (y) f (y) f θ (y)ν1 (dy) for each θ ∈ Θ.
Rr ∂θ ∂θ | {z }
=µθ (dy)=F θ (dy)

Clearly B(θe 0 ) = B. Our new estimator of B is the parametric plug-in



estimator B e θbn . It is consistent by the continuous mapping theorem provided
Be is continuous at θ0 .71
How do these three estimators compare? Consider computational issues
first. When analytical derivatives of the log-likelihood are available, Ab θbn
 
and B b θbn are very fast to compute accurately. By contrast, B e θbn requires

71 e analogously, we have −A(θ) e for any θ ∈ Θ (not just θ0 ). This holds


Defining A e = B(θ)
by the information matrix equality because A e and B
e use the same value of the θ in the
score/Hessian and the integrating density. Hence there’s no point in defining A
e formally:
just use −B.
e

101
numerical integration, which is several orders of magnitude slower than
averages (for a given 
level of accuracy). When analytical derivatives
are
not available, Abn θbn suddenly becomes much heavier than B bn θbn because
accurate second derivatives

are computationally expensive compared to first
derivatives. But B e θbn is still more expensive than A bn θbn because accurate
numerical integration is a lot slower than numerical differentiation.
Next, how close do these estimators actually tend to be to A and B in
a finite sample, assuming that we’ve computed them (very) accurately? To
answer this question, we have to look atMonte Carlo studies. The lighting
lit review is that Abn θbn is worst, B bn θbn is better, and B e θbn is best.
However, notice that the latter is only consistent for B (hence for −A)
under correct specification. In the misspecified (QMLE) case, it will not
consistently estimate

either −A
or B since the integrating density is wrong! By
contrast, An θn and Bn θn consistently estimate A and B under incorrect
b b b b
specification because we integrate w.r.t. the EDF, whose consistency does not
depend on the correctness or otherwise of the parametric model. It follows
that the asymptotic variance estimate
−1
bn θbn −1
 
Abn θbn B
bn θbn A

is robust to misspecification of the likelihood, whereas the alternatives


−1
bn θbn −1 e θbn −1
 
−Abn θbn , B and B

are not.

Example 21 (binary choice). Suppose we have data {yi , xi }ni=1 where yi


takes values in {0, 1} and xi ∈ R is a covariate. A single-index model
specifies P(y = 1|x, θ) = pe (θ> x) for some known function pe : Rk → [0, 1]
and unknown parameter θ ∈ Θ ⊆ Rk .
Write pi (θ) := pe (θ> xi ) for short. The ith likelihood contribution (w.r.t.
counting measure) is pi (θ)yi [1 − pi (θ)]1−yi . (Why?) So the log-likelihood is
n
X
`n (θ) = (yi ln(pi (θ)) + (1 − yi ) ln(1 − pi (θ))) .
i=1

We assume that pe, and hence each pi , are differentiable. Then the score of

102
the ith log-likelihood contribution is

∇pi (θ) ∇pi (θ)


∇`1i (θ) = yi − (1 − yi )
pi (θ) 1 − pi (θ)
yi [1 − pi (θ)] − (1 − yi )pi (θ)
 
= ∇pi (θ)
pi (θ)[1 − pi (θ)]
yi − pi (θ)
 
= ∇pi (θ).
pi (θ)[1 − pi (θ)]

The Hessian is a lot uglier, so let’s not even bother.


Now consider a particular single-index model for binary choice, the logit
model:
exp (θ> x)
pe (θ> x) = .
1 + exp (θ> x)
The derivative is

∂ exp (θ> xi ) xi [1 + exp (θ> xi )] − exp (θ> xi )2 xi


∇pi (θ) = pe (θ> xi ) =
∂θ [1 + exp (θ> xi )]2
exp (θ> xi ) 1
= xi
1 + exp (θ xi ) 1 + exp (θ> xi )
>

= pi (θ)[1 − pi (θ)]xi .

So the score of the ith log-likelihood contribution is

∇`1i (θ) = (yi − pi (θ)) xi .

Letting θbn be the MLE, our analogy estimator of B is therefore


n
b θbn = n−1 (yi − pi (θ))2 xi x>
 X
B i .
i=1

This is what e.g. Stata uses by default to estimate the asymptotic variance

of the MLE for the logit model. It is numerically different from Ab θbn , of
course. (To give a closed form for the latter we would have to compute some
monstrous derivatives, so I’d rather not.)

8.5 The information matrix test


The availability of consistent estimators of A and B raises the possibility
of testing whether the model is correctly specified.
(These kinds

of tests
are called specification tests.) Intuitively, −An θn and Bn θn should be
b b b b

103
close to each other in a large sample if the model is correctly specified. To
formalise what we mean by ‘close’, we need an asymptotic distribution. It
turns out that under the null hypothesis of correct specification, the elements
of the k × k matrix
h i
Wn θbn := n1/2 Abn θbn + B
  
bn θbn

are joint normally distributed in the limit, with mean zero and a covariance
matrix that can be estimated consistently. Wn θbn has k(k+1)/2 independent
elements (since it’s symmetric), so we can construct a test statistic as
a function of these. Such tests are called information matrix (IM) tests,
introduced by White (1982). 
There are many ways turning Wn θbn into a test statistic. A simple one
is to take a quadratic form
>  
qn θbn Wn θbn qn θbn

where the vector qn θbn is chosen as a function of the estimated covariance.
Under the null, this quadratic form converges to a χ2 distribution with
degrees of freedom depending on how many independent elements of Wn θbn
are given positive weight. Many variants have been proposed, some of which
are asymptotically equivalent but much easier to compute.
Monte Carlo evidence suggests that the χ2 approximation to the distri-
bution of IM statistics is very poor. When using the 5% χ2 critical values,
the test often rejects under the null as often as 95% of the time for moderate
sample sizes! Fortunately, the bootstrap approximation to the distribution of
IM statistics is very accurate, so we can use bootstrap critical values instead.

8.6 Asymptotic efficiency


Consider the class of estimators that are weakly consistent for θ0 and asymp-
totically normal. How do we choose between them? The obvious concern is to
minimise variance. We don’t have formulae for the variance of an estimator
in a finite sample, but we do have the asymptotic variance (the variance of
the limiting normal distribution).
Since θ0 could be any point in Θ, we’d like our analysis to be valid no
matter what its value happens to be. So we can’t treat it as fixed the way
we did above; we have to let it vary. To this end, it will be useful to have
symbols for the mean and variance of a statistic Ten : Rn×r → T ⊆ R` in the

104
scenario in which some arbitrary θ ∈ Θ is the true value:
Z
Eθ Tn := Ten ({yi }ni=1 ) dµθn


Z Rn×r
θ
    >
Ten ({yi }ni=1 ) − Eθ Tn Ten ({yi }ni=1 ) − Eθ Tn dµθn .

Var Tn :=
Rn×r

(Recall that µθn denotes the law of {yi }ni=1 when θ is the true parameter
value.) With this notation, we can easily define a function I : Θ → Rk×k
that maps an arbitrary θ ∈ Θ into what the information matrix −A = B
would be if θ were the true value:
h ih i>   
θ
I(θ) := E ∇`11 (θ) ∇`11 (θ) = −Eθ ∇2 `11 (θ) .

We’ve seen that when θ is the true value, the asymptotic variance of the
MLE is I(θ)−1 . Let V : Θ → Rk×k be the asymptotic variance (at each θ) of
some alternative estimator that is also consistent and asymptotically normal.
We say that the MLE is asymptotically (weakly) more efficient at θ ∈ Θ
than the alternative estimator iff V (θ) − I(θ)−1 is positive semidefinite (psd).
V (θ) − I(θ)−1 being psd says precisely that every linear combination of θbn
has weakly lower variance than the same linear combination of the alternative
estimator. We say that the MLE is asymptotically more efficient (simpliciter)
than the alternative estimator iff it is asymptotically more efficient at every
θ ∈ Θ. Finally, we say that the MLE is asymptotically efficient within some
class of estimators iff it is asymptotically more efficient than each other
estimator in the class.
A natural conjecture is that the MLE is asymptotically efficient within
the class of consistent and asymptotically normal estimators. This conjecture
was long believed to be true, but the following (species of) example, due to
Hodges, shows that it is not.
Example 22 (super-efficiency). Let Θ ⊆ R and let the true value be θ0 .
Let θbn be the (consistent and asymptotically normal) MLE. Define another
estimator θen by (
0 if θbn < n−1/4
θen = b
θn if θbn ≥ n−1/4 .
6 0, θen is asymptotically equivalent to the MLE θbn , so has the same
For θ0 =
asymptotic variance. But when θ0 = 0, consistency means that for large n,
with high probability, |θbn | < n−1/4 , in which case the asymptotic variance is
zero. θ0 = 0 is called a point of super-efficiency of the estimator θen , and an
estimator with super-efficiency points is called a super-efficient estimator.

105
So the MLE is not efficient: there exists another estimator θen whose
variance is at least as low for every θ0 ∈ Θ and strictly lower for some θ0 ∈ Θ.
In particular, it is always possible to construct a super-efficient estimator
that efficiency-dominates the MLE.

But we can salvage a great deal here. Le Cam showed that the set Θe ⊆ Θ
of super-efficiency points of any given super-efficient estimator must have
Lebesgue measure zero. This is a formal sense in which we might call super-
efficiency a pathology. Moreover, super-efficient estimators turn out to be
the only obstacle to calling the MLE asymptotically efficient: Le Cam also
showed that the MLE is asymptotically efficient within the class of non-
super-efficient, consistent and asymptotically normal estimators. Similarly,
if we redefine ‘asymptotically more efficient than’ to require only that the
variance is smaller for a subset of Θ of full Lebesgue measure, then the MLE
is asymptotically efficient within the class of all consistent and asymptotically
normal estimators.
There’s a bunch of other results along the same lines. In efficiency settings,
I −1 is called the Cramér–Rao lower bound. A basic result is that no unbiased
estimator can achieve asymptotic variance lower than the Cramér–Rao bound
at every θ ∈ Θ. The MLE is not unbiased in general, but it is asymptotically
unbiased under regularity conditions, so this result gives us efficiency of the
MLE in the class of consistent, asymptotically normal and asymptotically
unbiased estimators. Another result is that the Cramér–Rao bound is a lower
bound on the variance of the any consistent and uniformly asymptotically
normal estimator. The latter means that weak convergence to a normal is
uniform in a certain sense.
The most general class of theorems on asymptotic efficiency is exemplified
by the Hájek–Le Cam asymptotic minimax theorem (see e.g. Ibragimov and
Has’minskii (1981, Theorem 12.1)). Such results give a set of conditions
under which an estimator is efficient within some class, and the MLE satisfies
these assumptions under regularity conditions.

106
9 Hypothesis testing
Official reading: Amemiya (1985, sec. 4.5.1) and Newey and McFadden (1994,
sec. 9).

9.1 Preliminaries
So far, we’ve focused on the problem of point estimation: given DGP para-
meterised by θ0 , we try to find a way of learning the value of θ0 . Now let’s
turn that on its head: we start with a subset Θ0 ⊆ Θ, and want to learn
whether θ0 ∈ Θ0 . Intuitively, we should be able to do this using our consistent
and asymptotically normal extremum estimator θbn , for if the hypothesis is
true then θbn ∈ Θ0 with high probability, and if the hypothesis is false then
θbn ∈
/ Θ0 with high probability.
The formal setup is as follows. We wish to test the null hypothesis
(H0 ) that θ0 ∈ Θ0 against the alternative hypothesis (H1 ) that θ0 ∈ / Θ0 .
To implement this, we use a test statistic. A real-valued statistic is just a
measurable function Tn : Rn×r → R mapping data into the real line; as usual
we will work directly with the random variables Tn (ω) := Ten ({yi (ω)}ni=1 ).
A rejection region for the statistic Tn is some subset Rn of R. Our testing
procedure is as follows:

If Tn ∈ Rn then we reject H0 ;
otherwise we fail to reject H0 .

In many cases, the rejection regions {Rn } will take the form

Rn = [cn , ∞) or Rn = (−∞, −cn ] ∩ [cn , ∞) .

The tests discussed below have rejection regions of the former kind; the t test
has a rejection region of the latter kind. In either case, we call cn a critical
value for the test.
We’ll want a sensible way of choosing the rejection region, or else the test
will be useless. There are two problems we have to worry about: rejecting H0
when it’s actually true (type-I error), and failing to reject H0 when it’s false
(type-II error). It is much easier to control the probability of type-I error of
a test, as we’ll see momentarily, so that’s what we’ll focus on in determining
our critical values.
So fix a desired probability α of type-I error; α is called the (desired)
size of the test. Then for any fixed θ ∈ Θ0 , we can read off the rejection
region RTαn (θ) from the (approximate) distribution of Tn under θ0 = θ. For

107
the approximate distributions we’ll consider, we can summarise the rejection
region using a critical value cαTn (θ0 ). (E.g. for rejection regions of the form
h 
cαTn (θ0 ), ∞ , the critical value is the (1 − α)th quantile of the (approximate)
distribution of Tn .)
Without further restrictions, this test is infeasible because the critical
values depend on the unknown value of θ0 . The reason why type-I error is
easy to control is that since Θ0 is generally a fairly small set, it will often
be the case that our approximate distribution of Tn under θ0 = θ ∈ Θ0 will
be the same for each θ ∈ Θ0 . That is, we have an approximate distribution
that holds under H0 irrespective of which particular θ ∈ Θ0 is the true value.
This gives us critical values cαn that can be obtained without knowledge of
θ0 .
Useful jargon: a statistic Tn is pivotal under H0 iff its distribution under
H0 (for finite n) does not depend on θ0 . It is asymptotically pivotal under
H0 iff its asymptotic distribution under H0 does not depend on θ0 . We are
interested in the latter case, since we rarely encounter statistics whose finite-
sample distribution can be derived, never mind shown to be independent of
θ0 . The previous paragraph says, in short, that our tests will be based on a
statistics that are asymptotically pivotal under H0 .
Actually, we will be able to simplify the critical values further. In general,
our critical values cαn will depend on n since our approximate distribution for
Tn might vary with the sample size. But since our approximating distribution
will be the asymptotic distribution (obviously independent of n), we can use
critical values cα do not depend on n. However, our asymptotic distribution
actually justifies the use of any critical values cα + op (1) (anything that is
asymptotically equivalent to cα ). If we choose to add some n-dependent,
asymptotically vanishing term to the critical values, we may obtain a better
approximation to the distribution of Tn under H0 in a finite sample. Many
such finite-sample corrections have been proposed for well-known tests,
usually justified by Monte Carlo evidence.
So far, we have only dealt with type-I error. But type-II error is also a
concern: a test with known (asymptotic) size α but high probability of type-II
error will not be able to discriminate between the null and alternative hypo-
theses. The power Pnα (θ) of a test of size α against the (specific) alternative
θ0 = θ ∈ Θc0 is the probability of rejecting H0 when θ0 = θ (a particular way
in which H1 can be true). (So Pnα (θ) is one minus the probability of type-II
error.) We say that our test is consistent against the alternative θ0 = θ iff
Pnα (θ) − → 1 as n −→ ∞. The test is consistent iff it is consistent against every

108

Qn θbn
Qn (θ0 ) Qn

θ0 θbn

Figure 2 – Graphic illustration of the trinity of test statistics.

alternative θ ∈ Θc0 .72


The power function Pnα is generally not constant; intuitively, when H0 is
‘nearly’ true (θ0 is close to Θ0 ), power will tend to be low. Another thing:
you might wonder why we allow type-I errors at all; why not set α = 0? The
answer is that then power would be zero, making the test useless.
We can classify null hypotheses into two varieties. A simple hypothesis is
one for which Θ0 is a singleton; unsurprisingly, this makes things a lot easier.
A composite hypothesis is one for which Θ0 is not singleton.

9.2 Simple hypotheses


The broad idea behind the three tests in this section is as follows. Suppose
you have a consistent and asymptotically normal extremum estimator θbn
of θ0 , formed by maximising Qn . Your simple null hypothesis is that the
true value is θ0 . (Recall that θ0 is whatever value uniquely maximised Q,
a.s.
the nonstochastic function satisfying n−1 Qn −−→ Q uniformly on Θ.) This
situation is depicted in Figure 2.
Under the null, θbn − θ0 (the horizontal distance in the figure) had better
be small; this forms the basis of the Wald test. Similarly, Qn θbn − Qn (θ0 )
had better be small; this forms the basis of the likelihood ratio (LR) test.
Finally, the gradient at ∇Qn (θ0 ) at θ0 had better be close to zero; this is
the intuition behind the Lagrange multiplier (LM) test. (The latter is also
known as the score test or Rao test.)
72
Notice that the consistency of a test is a pointwise concept. Uniform consistency is
something stronger.

109
Formally, the three test statistics for a simple null hypothesis with
Θ0 = {θ0 } are
h i
LRn = 2 Qn θbn − Qn (θ0 )
h i> h i+ h i
LMn = n−1/2 ∇Qn (θ0 ) B
bn (θ0 ) n−1/2 ∇Qn (θ0 )
h i> h + i+ h i
Wn = n1/2 θbn − θ0 bn θbn + n1/2 θbn − θ0 ,
   
Abn θbn B bn θbn A

where + denotes the Moore–Penrose pseudo-inverse. Under the hypotheses


of our asymptotic normality result for extremum estimators,
n−1 LRn −→ 0, n−1 LMn −→ 0 and n−1 Wn −→ 0.
p p p

In other words, each of the test stats is op (n). This is the formal sense in
which the three test stats must be close to 0 with high probability in a large
sample.
To obtain critical values that we can use to perform these tests, we need
to derive the asymptotic distributions of the three test statistics. For LMn
and Wn , it should be clear that we just have to apply Slutsky’s theorem.
(For LRn , additional structue will be needed.)
Proposition 22. Assume the hypotheses of our asymptotic normality result
for extremum estimators (p. 84), and further suppose that A−1 BA−1 is
d d
positive definite.73 Then LMn −→ χ2 (k) and Wn −→ χ2 (k).
d p
Proof. n−1/2 ∇Qn (θ0 ) −→ Nk (0, B) by assumption, and B
bn (θ0 ) −→ B. Hence
by Slutsky’s theorem,
LMn −→ [Nk (0, B)]> B −1 [Nk (0, B)] = [Nk (0, I)]> [Nk (0, I)] = χ2 (k).
d d d

d
n1/2 θbn − θ0 −→ Nk 0, A−1 BA−1 by the asymptotic normality proposition,
 
 p  p
and Abn θbn −→ A and B bn θbn −→ B. A−1 BA−1 is symmetric and positive
definite, hence nonsingular. So by Slutsky’s theorem,
h  i> h i−1 h  i
Wn −→ Nk 0, A−1 BA−1 A−1 BA−1 Nk 0, A−1 BA−1
d

d d
= [Nk (0, I)]> [Nk (0, I)] = χ2 (k). 
In fact, something much stronger is true. Not only are the asymptotic
distributions the same; the two statistics are actually numerically close in
large samples with high probability.
73
This just means that the limiting distribution is nondegenerate: no linear combination
has variance zero.

110
Proposition 23. Assume the hypotheses of our asymptotic normality result
for extremum estimators (p. 84), and further suppose that A−1 BA−1 is
positive definite. Then LMn − Wn = op (1).

Partial proof. We will treat the case in which θbn lies in the neighbourhood
of θ0 in which the derivatives exist and are continuous. This occurs with
probability approaching 1 as n − → ∞, but a rigorous proof would proceed
more cautiously.
Since θ0 ∈int Θ, θbn lies in the interior of Θ. Hence it must satisfy the
FOC ∇Qn θbn = 0. Expanding the derivative around θ0 using the mean
value theorem,
0 = ∇Qn (θ0 ) + ∇2 Qn θen θbn − θ0 ,
 

p
where the mean value θen lies between θ0 and θen , so that θen −→ θ0 since θbn
is consistent. Rearranging and using the definition Abn = n−1 ∇2 Qn from
section 7.5 (p. 88),
h i h i
− n−1/2 ∇Qn (θ0 ) = Abn θen n1/2 θbn − θ0 .
 

bn θen + to get

On the one hand, we can premultiply this by B
h i h i
bn θen + A
bn θen + n−1/2 ∇Qn (θ0 ) = B bn θen n1/2 θbn − θ0 .
   
−B (7)

On the other hand, we can transpose it to get


h i> h i>
− n−1/2 ∇Qn (θ0 ) = n1/2 θbn − θ0
 
Abn θen (8)

(using the symmetry of the second derivative, which holds by Young’s the-
orem). Premultiplying (7) by (8),
h i> h i
n−1/2 ∇Qn (θ0 ) bn θen + n−1/2 ∇Qn (θ0 )

B
h i> +  h 1/2 i
= n1/2 θbn − θ0

Abn θen B
bn θen Abn θen n θbn − θ0 .

By Slutsky’s theorem, the LHS is


h i> + h −1/2 i
n−1/2 ∇Qn (θ0 ) B
bn θen n ∇Qn (θ0 )
h i> h i
= n−1/2 ∇Qn (θ0 ) B bn (θ0 )+ n−1/2 ∇Qn (θ0 ) + op (1)

= LMn + op (1).

111
Also by Slutsky’s theorem, the RHS is
h i >  h 1/2
+ i
n1/2 θbn − θ0

bn θen
Abn θen B n θbn − θ0
Abn θen
h i> h i
= n1/2 θbn − θ0 bn θbn + Abn θbn n1/2 θbn − θ0 + op (1)
 
Abn θbn B
h i> h +  i+ h 1/2 i
= n1/2 θbn − θ0 bn θbn +

Abn θbn B bn θbn A n θbn − θ0 + op (1)
= Wn + op (1).
Together, this says that LMn + op (1) = Wn + op (1), or equivalently LMn −
Wn = op (1). 

What about the likelihood ratio stat? Let’s just see how far we can get.
As in the previous two proofs, let’s proceed by simply assuming that θbn lies
in the neighbourhood of θ0 in which the derivatives exist and are continuous
(which is the case with probability approaching 1 as n −
→ 1). A second-order
mean value expansion of Qn (θ0 ) around θn yields
b

 1 >
Qn (θ0 ) − Qn θbn = ∇Qn θbn θ0 − θbn + θ0 − θbn ∇2 Qn θen θ0 − θbn
   
2
p
where the mean value θen lies between θbn and θ0 , hence θen −→ θ0 . By interiority
and differentiability, the first-order condition must hold, eliminating the first-
order term. Rearranging and using the definition of Abn ,
h  i
LRn = 2 Qn θbn − Qn (θ0 )
h i> h ih i
= n1/2 θbn − θ0 −n−1 ∇2 Qn θen n1/2 θbn − θ0
  

h i > h i h 1/2 i
= n1/2 θbn − θ0 −Abn θen
n θbn − θ0
h i >  −1 h i
= n1/2 θbn − θ0 −A−1 n1/2 θbn − θ0 + op (1).

But this is the end of the line. The asymptotic variance of n1/2 θbn − θ0 is


A−1 BA−1 , which is not equal to −A−1 in general.


But suppose we’re in the land of correctly specified MLE. In this case,
d
A−1 BA−1 = −A−1 by the information matrix equality, so we get LRn −→
χ2 (k). Moreover, we then have

LRn
h i> h + i+ h i
= n1/2 θbn − θ0 bn θbn + n1/2 θbn − θ0 + op (1)
   
Abn θbn B bn θbn A

= Wn + op (1) = LMn + op (1).

112
All of our dreams have come true.
The key to getting the LR stat to be asymptotically equivalent to the
Wald and LM stats was the information matrix equality. As we’ll see when
studying efficient GMM in section 10, generalisations of the information
matrix equality are available for certain estimators outside the MLE class.
(As in the MLE context, the information matrix equality is tightly linked to
efficiency.) We therefore summarise our result in a way that extends beyond
the MLE context:

Proposition 24. Assume the hypotheses of our asymptotic normality result


for extremum estimators (p. 84), and further suppose that −A = B. Then
LRn = LMn + op (1) = Wn + op (1), and each converges weakly to χ2 (k).

9.3 Composite hypotheses


Conceptually, not much changes when the null hypothesis is composite. Our
null hypothesis is θ0 ∈ Θ0 . (Θ0 is singleton iff the hypothesis is simple.) To get
a handle on this, define the function h : Θ → Rq by h(θ) = 0 iff θ ∈ Θ0 . We
impose q ≤ k: you can’t have more restrictions than Θ has dimensions. This
is wlog since if q > k then either some constraints don’t bind at the optimum,
or there is no solution. Our null hypothesis can therefore be expressed as
h(θ0 ) = 0.
We make the crucial (and restrictive) assumption that h is smooth; in
particular that it is continuously differentiable in a neighbourhood of θ0 ,
with q × k derivative Dh. We further assume that Dh(θ0 ) has rank q (full
rank). If this were not the case, then there would be redundant constraints
encoded in h, so this assumption is wlog.
Define θen to be a constrained extremum estimator, meaning a solution to

max Qn (θ) s.t. h(θ) = 0.


θ∈Θ

If the null is true, then with high probability the constraint



won’tbe very
binding in a large sample. One implication is that Qn θbn − Qn θen should
be small; this is what the LR stat examines. Another implication is that the
Lagrange multiplier on the constraint should be small; the LM test 
checks
this. (Hence the name, obviously.) A final implication is that h θn should
b
be close to h(θ0 ) = 0; this is the basis for the Wald test.

113
The test statistics are
h   i
LRn = 2 Qn θbn − Qn θen
h i> h i+ h i
LMn = n−1/2 ∇Qn θen n−1/2 ∇Qn θen
  
bn θen
B
Wn
h i> h + i+ h i
= n1/2 h θbn bn θbn + Dh θbn > n1/2 h θbn .
     
Dh θbn Abn θbn B bn θbn A

As far as the asymptotic theory is concerned, the filling in the LM and


Wald sandwiches can be any consistent estimators. The advantage of these
particular choices is that we can compute LMn without having to estimate
the unrestricted model, and we can compute Wn without having to estimate
the restricted model. This flexibility can be a godsend when one θbn and θen
is hard to compute. In contrast, we have to estimate both the restricted and
unrestricted models to compute LRn .
Unluckily for us, the Monte Carlo evidence is that the χ2 approximation
is much better for LRn than it is for either LMn or Wn . There’s some
intuition behind this: the later two involve an intermediate step of variance
estimation, which introduces an additional source of noise into the test
statistics. This noise vanishes asymptotically (of course), but in a finite
sample it affects the distribution. Moreover, this noise in the filling of the
LM and Wald sandwiches might be correlated with the bread. Again this
correlation vanishes asymptotically since the filling converges in probability
to a nonstochastic limit, but it may be important in finite samples.
The asymptotic distribution of Wn is basically immediate from the delta
method. As in the previous
section, consistency of θbn and continuity of
Abn and B bn yield Abn θbn = A bn (θ0 ) + op (1) and B
bn θbn = Bbn (θ0 ) + op (1)
by the continuous mapping theorem, and the right-hand sides converge in
probability to A and B (respectively) by Khinchine’s WLLN (p. 56). By the
delta method (p. 49),
   
n1/2 h θbn = n1/2 h θbn − h(θ0 ) −→ Nq 0, Dh(θ0 )A−1 BA−1 Dh(θ0 )> .
  d

114
Putting this all together using Slutsky’s theorem,
h i> h i−1 h i
Wn = n1/2 h θbn Dh(θ0 )A−1 BA−1 Dh(θ0 )> n1/2 h θbn + op (1)
 

 > h i−1
−→Nq 0, Dh(θ0 )A−1 BA−1 Dh(θ0 )> Dh(θ0 )A−1 BA−1 Dh(θ0 )>
d

 
× Nq 0, Dh(θ0 )A−1 BA−1 Dh(θ0 )>
d
=Nq (0, I)> Nq (0, I)
d
=χ2 (q).

For the LM stat, first observe that θen is consistent because we’re main-
taining the assumptions of our consistency result for extremum estimators
(p. 78). So show this carefully, note that we can write
Θ0 = Θ ∩ {θ ∈ Rk : h(θ) = 0}.
Θ is compact and {θ ∈ Rk : h(θ) = 0} is closed, so Θ0 is compact. Qn
p
is continuous on Θ, hence continuous on Θ0 . n−1 Qn −→ Q uniformly on
Θ, hence also on Θ0 . Finally, Q has a unique maximum on Θ at θ0 , hence
a fortiori has a unique maximum on Θ0 at θ0 . So the conditions of the
consistency proposition (p. 78) hold on Θ0 , whence it follows that θen is
consistent for θ0 .
By consistency of θen and continuity of Bbn , the continuous mapping the-
p
orem yields Bbn θen = B bn (θ0 )+op (1). Moreover, Bbn (θ0 ) −→ B by Khinchine’s
WLLN (p. 56).
Since we’re maintaining the hypotheses of the asymptotic normality result
for extremum estimators (p. 84), we have that Qn is differentiable near θ0
and that θ0 is interior to Θ. Since θbn is consistent for θ0 , it follows that for
large n, with high probability, the FOC holds:
 >
∇Qn θen = Dh θen λn
where λn is a (random) q-vector of Lagrange multipliers. We will proceed in
the same informal manner as in our proof of asymptotic normality (p. 84)
by behaving as if the FOC always holds.
Since θen is consistent for θ0 , and since ∇Qn and Dh are continuous at θ0
by assumption, the continuous mapping theorem lets us write
∇Qn (θ0 ) = Dh(θ0 )> λn + op (1),
and hence
   
n−1/2 ∇Qn (θ0 ) = Dh(θ0 )> n−1/2 λn + op n−1/2 .

115
One of the hypotheses of our asymptotic normality result is that the LHS
converges in distribution to Nk (0, B). Hence the RHS must do the same:
 
Dh(θ0 )> n−1/2 λn −→ Nk (0, B),
d

d
whence it follows that n−1/2 λn −→ Nq (0, V ) for some V that satisfies

Dh(θ0 )> V Dh(θ0 ) = B.

B is a nondegenerate (i.e. positive definite) variance matrix, so is invert-


ible; therefore
−1
[Dh(θ0 )> V Dh(θ0 )] = B −1 .
It follows that
−1
Dh(θ0 ) [Dh(θ0 )> V Dh(θ0 )] Dh(θ0 )> = Dh(θ0 )B −1 Dh(θ0 )> .

Now here’s a fun fact that you can (very easily) prove at home: because V is
invertible (since it’s a nondegerate variance matrix) and Dh(θ0 )Dh(θ0 )> has
full rank q (since Dh(θ0 ) has rank q), we have
−1
Dh(θ0 ) [Dh(θ0 )> V Dh(θ0 )] Dh(θ0 )> = V −1 . (9)

Hence
Dh(θ0 )B −1 Dh(θ0 )> = V −1 . (10)
Putting together the pieces and using (10),
h i> h i+ h i
LMn = n−1/2 ∇Qn θen n−1/2 ∇Qn θen
  
B
bn θen
h  i> h  i
= Dh(θ0 )> n−1/2 λn B −1 Dh(θ0 )> n−1/2 λn + op (1)
 >  
= n−1/2 λn Dh(θ0 )B −1 Dh(θ0 )> n−1/2 λn + op (1)
 >  
= n−1/2 λn V −1 n−1/2 λn + op (1).

Applying Slutsky’s theorem then yields

LMn −→ Nq (0, V )> V −1 Nq (0, V ) = Nq (0, I)> Nq (0, I) = χ2 (q).


d d d

Finally, let’s turn to the likelihood ratio statistic. All the steps in our
derivation in the previous section (for a simple hypothesis) still go through,
giving us
h i>  −1 h i
LRn = n1/2 θbn − θen −A−1 n1/2 θbn − θen + op (1).
 

116

A first-order mean-value expansion of h θbn around θen yields
     
h θbn = h θen + Dh θn θbn − θen = Dh θn θbn − θen

where the mean value θn lies between θbn and θen , so is consistent for θ0 . So
h i> h  > i+ h 1/2
 i
LRn = n1/2 h θbn −A−1 Dh θn

Dh θn n h θbn + op (1)
h i> h   > i+ h 1/2 i
= n1/2 h θbn Dh θbn −A−1 Dh θbn n h θbn + op (1).

As before, this is not asymptotically χ2 in general. When −A = B (an


information matrix equality holds, e.g. MLE), the asymptotic variance of
θbn − θen is −A−1 , which immediately gives us LRn = Wn + op (1), hence
d
LRn −→ χ2 (q).

9.4 Power
We’ve now figured out the asymptotic distributions of our statistics under the
null. But as mentioned above, the tests are not much use if their distributions
under the alternative are very similar to their null distributions: then power
will be low, so we won’t be very likely to reject the null even when it is false.
First, we’d like to show that our tests are consistent (consistent against
every alternative θ0 = θ ∈ Θc0 ).74 Since the asymptotic null distribution is
χ2 , this will require us to show that our test statistics explode under the
null; then in a large sample, with high probability, they will be larger than
we would expect a χ2 -distributed random variable to be, leading us to reject
the null.
Secondly, we’d like to know how powerful they are. We could approach
this by studying the rate at which the test stats explode, but that turns out
not to be fruitful. Instead, we’ll consider the behaviour of the test stat when
the null is nearly true, using a formal device called a Pitman drift. This will
allow us to derive the asymptotic distribution of each test stat explicitly,
and then we can read off our tests’ asymptotic rejection probabilities from
the asymptotic distribution under the alternative. This limiting rejection
probability (under local-DGP asymptotics) is called the local power.
The exposition in this section will be a little looser. We’ll consider only
the case of a simple hypothesis, and we’ll limit our derivations to the Wald
74
This is a nice property to have, and we will have it here. But there are many well-known
tests that are not consistent against all alternatives. One of these is the information matrix
test, which is inconsistent against DGPs for which the information matrix equality holds
despite misspecification. (Which can happen.)

117
statistic. The null hypothesis is that θ0 = θ? , but actually the true value θ0
is 6= θ? . The Wald statistic for a simple hypothesis from section 9.2 can be
written
h i> h + i+
Wn = n1/2 θbn − θ0 + n1/2 (θ0 − θ? ) bn θbn +
  
Abn θbn B bn θbn A
h i
× n1/2 θbn − θ0 + n1/2 (θ0 − θ? )


h i> h i−1
= n1/2 θbn − θ0 + n1/2 (θ0 − θ? ) A−1 BA−1

h i
× n1/2 θbn − θ0 + n1/2 (θ0 − θ? ) + op (1)


h   i> h i−1
= Nk 0, A−1 BA−1 + n1/2 (θ0 − θ? ) A−1 BA−1
h   i
× Nk 0, A−1 BA−1 + n1/2 (θ0 − θ? ) + op (1)
h i−1
= χ2 (k) + n (θ0 − θ? )> A−1 BA−1 (θ0 − θ? )
 h i−1 
+ n1/2 (θ0 − θ? )> Nk 0, A−1 BA−1 + op (1).

This is a χ2 plus something deterministic that explodes plus something


stochastic that explodes. So Wn definitely explodes, and hence the Wald test
is consistent. The consistency of the LR and LM tests can be demonstrated
in pretty much the same way.
To show consistency, we used fixed-DGP asymptotics: we held the truth
θ0 (and the alternative hypothesis θ? ) fixed and studied the behaviour of Wn
as n grew large. Consistency means precisely that for any fixed DGP, the
rejection probability converges to unity as n − → ∞. This is analogous to the
consistency of an estimator: it eventually ends up in the right place.
But we’d like an asymptotic approximation that tells us how close it is
to the right place. For estimators, we did this by blowing up the estimator
by n1/2 and showing that this blown-up object converges to a normal rather
than a point; we then used this normal to approximate the finite-sample
distribution of the estimator. By analogy, we’d like to find a way to mess
with our test statistic in such a way that it doesn’t explode.
The heuristic reason why Wn explodes under fixed-alternative asymptotics
is that as n grows larger, any given false null becomes increasingly easy to
reject because of lower sampling variation. So to prevent it from exploding,
we need the DGP’s law to drift with n in such a way that the (fixed) null
θ? becomes increasingly hard to reject as n increases. If we make it harder
at just the right rate, we might hope that our test stat will converge in
distribution rather than vanishing or exploding.

118
A DGP law µθ0 for which the null is hard to reject is precisely one such
that θ0 (the truth) is close to the null hypothesis θ? being tested. So we need
to let θ0 drift
 θ
toward θ? as n increases. To that end, consider a sequence of
DGP laws µ 0,n such that
 
θ0,n := θ? + n−1/2 µ + o n−1/2

for some fixed µ ∈ Rk . (Such a sequence is called a Pitman drift.) The algebra
is very easy: rearrange to get n1/2 (θ0,n − θ? ) = µ + o(1), then substitute:
h i> h + i+
Wn = n1/2 θbn − θ0,n + n1/2 (θ0,n − θn? ) bn θbn +
  
Abn θbn B bn θbn A
h i
× n1/2 θbn − θ0,n + n1/2 (θ0,n − θn? )


h i> h i−1 h i
= n1/2 θbn − θ0,n + µ A−1 BA−1 n1/2 θbn − θ0,n + µ + op (1)
 

h   i> h i−1 h   i
−→ Nk 0, A−1 BA−1 + µ A−1 BA−1 Nk 0, A−1 BA−1 + µ
d

h i−1
where λ(µ) := µ> A−1 BA−1
d
=χ2 (k, λ(µ)) µ,

and χ2 (k, λ) denotes the noncentral χ2 distribution with k degrees of freedom


and noncentrality parameter λ. Note that the convergence in distribution
above requires a CLT for triangular arrays.75 What this tells us that when
the truth is close to but not equal to θ? , the distribution of Wn is well-
approximated by a noncentral χ2 .76
The first step toward approximating the power of our test is to figure out
the asymptotic rejection probability for a given µ. This is straightforward: for
a given size α, obtain the critical value cα from the χ2 (k) quantile function
(inverse CDF), then plug it into the χ2 (k, λ(µ)) CDF:
 
Qα (µ) := 1 − Fχ2 (k,λ(µ)) Fχ−1
2 (k) (1 − α) .

Qα (µ) is the limiting rejection probability for size α along the sequence of
local DGP laws.

75
A triangular array is a double-indexed sequence {{yn,N }N
n=1 }N =1 of random variables.
In our case, the triangular array is ‘iid within rows’, meaning that {yn,N }N n=1 is an iid
sequence for each N ∈ N. Most of our central limit theorems apply to triangular arrays;
for example, the Lindeberg–Feller CLT extends without change to triangular arrays that
are independent within rows.
76
I’ve tried to make this clear, but it’s worth repeating: the Pitman drift is just a
technique that lets us approximate the power of a test. We’re not actually interested in
something weird like the asymptotic behaviour of a test when the DGP changes with the
sample size to make your life harder.

119
Recall that what we want is to approximate the power function Pnα
for a fixed n. Expressed more longwindedly, for any θ0 6= θ? , we want an
approximation to the power Pnα (θ0 ) of our test of size α of the null θ0 = θ?
when the sample size is n. So for fixed n and θ0 , choose µ ∈ Rk so that the
nth DGP law µθ0,n in the Pitman sequence has θ0,n = θ0 :

θ0 = θ? + n−1/2 µ,

or µ = n1/2 (θ0 − θ? ). Then our local-DGP asymptotics tell us that when θ0


is the truth and the sample size is n, the approximate distribution of the
Wald stat is χ2 (k, λ), where the noncentrality parameter is
h i> h i−1 h i
λ = n1/2 (θ0 − θ? ) A−1 BA−1 n1/2 (θ0 − θ? ) .

So the approximate distribution of the Wald stat under the alternative θ0


will be very different from the central χ2 distribution from which we compute
critical values whenever n is large, θ0 is very different from the null θ? , and
the asymptotic variance A−1 BA−1 is small. Intuitive!
The local power against the alternative θ0 of our test of size α of the null
θ? with sample size n is defined
 
Pbnα (θ0 ) := Qα n1/2 (θ0 − θ? )
 
= 1 − Fχ2 (k,λ(n1/2 (θ0 −θ? ))) Fχ−1
2 (k) (1 − α) for each θ0 ∈ Θ.

By varying θ0 , we trace out a function Pbnα : Θ → [0, 1] that approximates the


power envelope Pnα . As we would hope, Pbnα (θ? ) = α, the asymptotic rejection
probability when the null is true; formally this is because θ0 = θ? corresponds
to µ = 0, which is how we did the asymptotics under the null. As we vary n,
we trace out a family {Pbnα }n∈N of approximate power curves corresponding
to different sample sizes. (We can also vary α and θ? as desired. I did not
index Pnα and Pbnα by θ? , but that was only to avoid clutter.)
To get an idea of what we’ve obtained, a typical family {Pbnα }n∈N of local
power envelopes is depicted in Figure 3. The rejection probability at the
null is α because we constructed our test to control size asymptotically. The
power against alternatives close to the null is low; formally this is because
the noncentrality parameter is then small, so our noncentral χ2 distribution
is close to the (central) χ2 null distribution. Power increases as we move
away from the null, and also increases with sample size.77,78
77
The local power envelope of an arbitrary test need not be symmetric about θ? , nor

120
1
α
Pb1000

α
Pb200

α
Pb50

θ? θ0

Figure 3 – A family of local power envelopes for Θ ⊆ R for a given α and θ? .


Power is high when θ0 is far from the null and n is large.

121
Finally, the derivations in section 9.2 showing that the LR, LM and Wald
statistics are asymptotically equivalent (i.e. within op (1) of each other) apply
with almost no changes to this new asymptotic environment. It follows that
the LR and LM statistics are also noncentral-χ2 -distributed, with the same
noncentrality parameter, under local-DGP asymptotics. Their local power
is therefore the same as that of the Wald test. Moreover, all of the results
extend to (smooth) composite hypotheses: the trinity tests are still consistent,
and their power can be approximated using a Pitman drift.

need it be monotonic on either side of θ? . These properties do hold for the Wald test,
however, because the noncentrality varies monotonically and symmetrically with θ0 .
78
It is important that Figure 3 depicts the local power envelopes. The true power envelope
family {Pnα }n∈N may not be as well-behaved, since the DGP could be weird. The true
rejection probability under the null will not be α, though it should be ‘close’ to α when
n is large. Moreover, the power curves need not have the nice symmetric and monotonic
shape depicted, though again they will ‘nearly’ have these properties for n large.

122
10 The generalised method of moments estimator
Official reading: Amemiya (1985, sec. 8.1.1 and 8.2.2) and Newey and Mc-
Fadden (1994, sec. 2.2.3, 2.5, 3.3 and 4.3).

10.1 Preliminaries
The generalised method of moments (GMM), introduced by Hansen (1982), is
a pretty general (haha) technique for estimation and inference that subsumes
most parametric methods as special cases. It occupies an important place in
the history of econometrics: before GMM, there were many disparate methods
such as 2SLS, 3SLS and so on (as you can see by looking at Amemiya (1985),
which was written before GMM was introduced). GMM subsumed these
classical methods as special cases.
We will treat the case in which the DGP {yi } is iid. We have 
some
(economic) model that gives us the moment conditions E ge(y1 , θ0 ) = 0 for
some known function ge : Rr ×Θ → Rq .79 Even if the model imposes additional
structure on the data, GMM makes use only of these moment conditions.80 As
usual we define the random functions gi : Θ → Rq by gi (ω)(θ) := ge(yi (ω), θ),
allowing us to write the moment conditions as E(g1 (θ0 )) = 0.
So we have q restrictions on the k-dimensional parameter θ0 . We assume
that θ0 is point-identified, meaning that θ0 is the unique solution in Θ to
E(g1 (θ)) = 0. A necessary condition for this is (obviously) q ≥ k. The
case q > k is called overidentification; in the lingo, the model gives us
overidentifying restrictions on θ0 in this case.81 When the moment conditions
are misspecified and q > k, it should be clear that E(g1 (θ)) = 0 may not
have a solution in Θ. This indicates that we can formulate a specification
test for the model by expoiting the overidentifying restrictions; we will work
out the details in section 10.5 below.82
79
A rather nice thing is that a lot of economic models give rise to moment conditions;
Euler equations are one example. By contrast, it’s very rare that a convincing economic
model gives rise to a more restrictive statistical model such as a density (as required for
maximum likelihood).
80
This is another advantage of GMM over e.g. ML. Suppose we have a model that gives
us a likelihood from which we can derive moment conditions. If the likelihood is misspecified
then our estimates will in general be inconsistent. But if the moment conditions hold (a
much weaker condition in general), then GMM will give us consistent estimates.
81
Correspondingly, the case q = k is called ‘exact identification’. Then we have enough
restrictions to identify θ0 , but no more.
82
More generally, we could allow E(g1 (θ)) = 0 to have multiple solutions, in which case
we obtain partial identification. Another form of partially-identified GMM comes from
replacing the moment equalities with moment inequalities. The latter is currently a hot

123
Every parameteric estimator that we have mentioned so far is a GMM
estimator for some choice

of ge. In particular, any extremum estimator for
which the FOC ∇Qn θbn = 0 holds is a GMM estimator. Since we’re only
covering GMM for the iid case, restrict attention to separable criterion
functions, so that the FOC can be written
n
X 
∇qi θbn = 0.
i=1

Any such estimator is an exactly-identified (q = k) GMM estimator with


gi = ∇qi .
One example is the MLE, for which gi = `1i . Another is the nonlinear
least squares estimator, for which

ge((y, x), θ) = (y − f (x, θ)) x,

and the moment condition is derived from the assumption E(y|x) = f (x, θ0 ).
Yet another is the LAD estimator, where

ge((y, x), θ) = 2 · 1 (y ≤ θ) − 1,

and the moment condition is derived from the assumption that the median
of the distribution of y conditional on x is f (x, θ).
When q = k and ge is well-behaved, the sample moment condition
n
n−1
X
gi (θ) = 0
i=1

will have a unique solution. This value is called the method-of-moments es-
timator, and the sample moment condition is sometimes called an estimating
equation in this context. But what can we do when q > k or ge is ill-behaved?
One possibility is to throw away q − k moment conditions and choose θen
to make n−1 ni=1 gi (θ) as close as possible to zero in some metric (exactly
P

equal to zero if ge is well-behaved). More generally, we could set k linear


combinations of the sample moments as close as possible to zero (in some
metric).
GMM is similar to the latter suggestion. We keep all q moment conditions
(instead of combining them into k conditions), and we minimise their distance
from zero in some metric. GMM uses a particular metric: a quadratic form
topic in econometric theory.

124
Pn
in the sample moments. Writing Gn := n−1/2 i=1 gi , the GMM objective
function is
n
" #> " n
#
1 1 −1/2 X
Wn n−1/2
X
Qn (θ) := Gn (θ)> Wn Gn (θ) = n gi (θ) gi (θ) .83
2 2 i=1 i=1

The q×q weight matrix Wn is allowed to be stochastic (a function of the data),


but must satisfy Wn = W + op (1) for some symmetric, positive definite,
nonstochastic q × q matrix W . The GMM estimator (for given moment
conditions E(g1 (θ0 )) = 0 and given weight matrix Wn ) minimises Qn . (So
the GMM estimator is an extremum estimator.)

10.2 Consistency
Conditions for consistency of GMM can be obtained (essentially) from our
consistency results for extremum estimators (pp. 78 and 80). We’ll give
primitive conditions for strong consistency in the iid case.
Assume that Θ is compact and that each gi is continuous (so that Qn is).
Further assume that E (supθ∈Θ |g1 (θ)|) < ∞. Then Jennrich’s uniform SLLN
(p. 58) applies, giving us
n
−1/2 −1 a.s.
X
n Gn = n gi −−→ g uniformly over Θ,
i=1

where g : Θ → Rq is the nonstochastic function g(θ) := E(g1 (θ)). Hence


1 h −1/2 i> h i
n−1 Qn = n Gn Wn n−1/2 Gn
2
a.s. 1
−−→ g > W g =: Q uniformly over Θ.
2
We assumed point identification, which says precisely that
g(θ) = E(g1 (θ)) = 0 iff θ = θ0 .
Hence Q(θ) = 12 g(θ)> W g(θ) > 0 for θ =6 θ0 (remember that W is positive
definite) and Q(θ0 ) = 0, so Q is uniquely minimised at θ0 .
We’ve now verified all of the conditions of our strong consistency result
a.s.
for extremum estimators (p. 80), so θbn −−→ θ0 .
83
Most authors, including Joel, scale the objective function differently.
Pn I think that
my Pscaling makes P by far the most sense. First, having n−1/2 i=1 gi (θ) instead of
n n
n−1 i=1 gi (θ) or g (θ) allows us to directly apply our results for extremum es-
i=1 i
timators. Second, the 1/2 means that Qn is similar to the likelihood: the efficient GMM
estimator will satisfy
 a generalised information matrix equality, and the LR stat will be
2 Qn θen − Qn θbn .

125
10.3 Asymptotic normality
Unsurprisingly, we will appeal to our asymptotic normality result for ex-
tremum estimators (p. 84). Again we treat the iid case and give primitive
conditions.
So maintain the assumptions we imposed to obtain strong consistency in
the previous section, and assume that θ0 ∈ int Θ. Further assume that each
gi is twice continuously differentiable in a neighbourhood of θ0 , so that Qn
is too. Write Dgi for the q × k first derivative, and D2 gi for the q × k × k
second derivative. The latter is a three-dimensional array!
The derivatives of Qn are

∇Qn = [DGn ]> Wn Gn


h i>
∇2 Qn = [DGn ]> Wn [DGn ] + D2 Gn Wn G n .
>
Don’t forget that D2 Gn Wn Gn is the product of a three-dimensional array


with a matrix, yielding a matrix. My notation for this is not ideal (e.g. it
doesn’t tell the reader along what dimensions we’re transposing the array),
but it won’t matter because this term is going to vanish. The first derivative
of Gn is of course
n
−1/2
X
DGn = n Dgi .
i=1

We’ll want n−1/2 DGn to converge uniformly to Dg, where

g(θ) = E(g1 (θ))

as in the previous section. We already have that {Dgi } are iid and continu-
ous and that Θ is compact, so we only have to add the assumption that
E (supθ∈Θ |Dg1 (θ)|) < ∞. Jennrich’s uniform SLLN (p. 58) then tells us that
n
n−1/2 DGn (θ) = n−1
a.s.
X
Dgi (θ) −−→ Dg uniformly over Θ.
i=1

The uniform boundedness assumption on g1 (that we used to derive consist-


ency) is sufficient for the dominated convergence theorem, and hence for the
interchanging of integration and differentiation. Therefore
∂ d
 
Dg(θ) = > E(g1 (θ)) = E g1 (θ) = E (Dg1 (θ)) .
∂θ dθ>
Good to know.

126
It follows by the dominated convergence theorem and independence that
 
B = lim E n−1 [∇Qn (θ0 )][∇Qn (θ0 )]>
n→∞
 h i> 
−1/2
= lim E n DGn (θ0 ) Wn Gn (θ0 )
n→∞
h i> > 
−1/2
× n DGn (θ0 ) Wn Gn (θ0 )
h i> h i
−1/2 > > −1/2
= lim E n DGn (θ0 ) Wn Gn (θ0 )Gn (θ0 ) Wn n DGn (θ0 )
n→∞
n o
= [Dg(θ0 )]> W lim E (Gn (θ0 )Gn (θ0 )> ) W [Dg(θ0 )]
n→∞
  
 n X
n
= [Dg(θ0 )]> W lim E n−1
X
gi (θ0 )gj (θ0 )>  W [Dg(θ0 )]
n→∞ 
i=1 j=1
n
( !)
lim E n−1
>
X
= [Dg(θ0 )] W gi (θ0 )gi (θ0 )> W [Dg(θ0 )]
n→∞
i=1
= [Dg(θ0 )]> W E (g1 (θ0 )g1 (θ0 )> ) W [Dg(θ0 )] .

(The penultimate equality holds by independence.)


Now let’s do something similar for ∇2 Qn . Add the Jennrich bounded-
ness condition E supθ∈Θ D2 g1 (θ) < ∞ on the second derivative; then
a.s.
n−1/2 D2 Gn −−→ D2 g uniformly by Jennrich’s uniform SLLN. So
h i> h i h i> h i
n−1 ∇2 Qn = n−1/2 DGn Wn n−1/2 DGn + n−1/2 D2 Gn Wn n−1/2 Gn
h i>
a.s.
−−→ [Dg]> W [Dg] + D2 g Wg uniformly over Θ.

So using g(θ0 ) = 0, we have


h i>
n−1 ∇2 Qn (θ0 ) −−→ [Dg(θ0 )]> W [Dg(θ0 )] + D2 g(θ0 )
a.s.
W g(θ0 )
= [Dg(θ0 )]> W [Dg(θ0 )] .

Hence by the dominated convergence theorem,


 
A = lim E n−1 ∇2 Qn (θ0 ) = [Dg(θ0 )]> W [Dg(θ0 )] .
n→∞

The fact that


h i>
n−1 ∇2 Qn −−→ [Dg]> W [Dg] + D2 g
a.s.
Wg uniformly over Θ

127
p
gives us that for any sequence {θn } of random k-vectors with θn −→ θ0 , we
have
n−1 ∇2 Qn (θn ) −→ [Dg(θ0 )]> W [Dg(θ0 )] = A.
p

You may recall that this is what assumption (3) in the asymptotic normality
result for extremum estimators (p. 84) requires.
To verify the fourth and final hypothesis of the asymptotic normality
result, first observe that
h i>
n−1/2 ∇Qn (θ0 ) = n−1/2 DGn (θ0 ) Wn [Gn (θ0 )]
n
" #
−1/2
>
X
= [Dg(θ0 )] W n gi (θ0 ) + op (1).
i=1

{gi (θ0 )} are iid with mean zero and variance E (g1 (θ0 )g1 (θ0 )> ), so by the
multivariate Lindeberg–Lévy CLT and Slutsky’s theorem,

n−1/2 ∇Qn (θ0 ) −→ [Dg(θ0 )]> W Nq (0, E (g1 (θ0 )g1 (θ0 )> ))
d

d
=Nq 0, [Dg(θ0 )]> W E (g1 (θ0 )g1 (θ0 )> ) W [Dg(θ0 )]


d
=Nq (0, B) .

So we’ve satisfied all the conditions of our asymptotic normality result


(p. 84). Hence  
n1/2 θbn − θ0 −→ N 0, A−1 BA−1 .
 d

The asymptotic variance is the beast


−1
A−1 BA−1 = [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W E (g1 (θ0 )g1 (θ0 )> )
−1
× W [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] .

10.4 Asymptotic efficiency


In general, there’s no reason to think that the GMM estimator is asymptot-
ically efficient within (say) the class of consistent and asymptotically normal
estimators. For if the moment conditions are uninformative then there’s no
way to obtain a low-variance estimator.
Instead, we fix the moment conditions and search for the asymptotically
most efficient estimator that uses only this information. Since the only
parameter in GMM estimation that can be varied is the weight matrix Wn ,
the theory of efficient GMM amounts to the theory of how to optimally

128
choose Wn . Of course only the probability limit W of {Wn } matters, so really
we’re choosing W .
It turns out that an (infeasible) optimal choice of weight matrix is
W = E (g1 (θ0 )g1 (θ0 )> )−1 . (We won’t show this, but it’s straightforward
matrix algebra.) So the asymptotic variance of an efficient GMM estimator is
−1
A−1 BA−1 = [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W W −1
−1
× W [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] .
>
−1
= [Dg(θ0 )] W [Dg(θ0 )]
 −1
−1
= [Dg(θ0 )]> E (g1 (θ0 )g1 (θ0 )> ) [Dg(θ0 )]
= A−1 = B −1 .
There’s a close analogy with the efficiency of (correctly-specified) MLE
here. The efficient choice of weight matrix (resp. density) causes a bunch
of cancellations in the asymptotic variance, leaving us with A−1 . (It’s A
rather than −A because we’re minimising Qn , so ∇2 Qn is positive definite
here.) Moreover, we obtain A = B, a generalisation of the information matrix
equality.84 Clearly B −1 is the lower bound on the variance of any GMM
estimator using these moment conditions, analogous to the Cramér–Rao
bound.
So far we only have an infeasible procedure, since it requires knowledge
of E (g1 (θ0 )g1 (θ0 )> )−1 . But provided we can consistently estimate the latter,
we can obtain a feasible estimator. The standard way of doing this (proposed
by Hansen (1982)) is called two-step GMM. First pick some arbitrary weight
matrix Wn ,85 and obtain the GMM estimate θen . Define the Θ → Rq×q
function
n
! +
−1
X
cn := >
W n gi gi ,
i=1

and estimate E (g1 (θ0 )g1 (θ0 )> )−1 by W



cn θen . This estimator is obviously
consistent under the maintained assumptions. Now do GMM again using
the estimated optimal weight matrix to obtain the two-step GMM estim-
cn is asymptotically equivalent to E (g1 (θ0 )g1 (θ0 )> )−1 , θbn is
ate θbn . Since W
84
If you remember how we made use of the information matrix equality to derive the
asymptotic distribution
 of 
the LR statistic, then you should realise that the LR-type
statistic 2 Qn θen − Qn θbn will be asymptotically χ2 in the efficient GMM setting just
as in correctly specified MLE.
85
There’s some evidence on what is and is not a good idea for a first-step weight matrix.
The identity is very bad; the 2SLS weight matrix is pretty good. (The 2SLS weight matrix
is consistent for B −1 in the linear homoskedastic case.)

129
asymptotically equivalent to the infeasible efficient GMM estimator above.
The two-step GMM estimator θbn is therefore asymptotically efficient within
the class of GMM estimators that use these moment conditions.
There are serious finite-sample problems with the two-step procedure. The
intermediate step of estimating E (g1 (θ0 )g1 (θ0 )> )−1 lowers the asymptotic
variance, but at the cost of introducing an additional source of noise into
the estimator. Worse, the noise in estimating the optimal weight matrix
is generally correlated with the noise in the sample moments Gn , which
introduces bias into the two-step GMM estimator. In Monte Carlo studies,
this bias is quite severe for nonlinear DGPs, even in large samples.
There are two obvious ways of dealing with finite-sample bias. On the
one hand, we could just eschew two-step GMM in favour of one-step GMM,
which is asymptotically less efficient but allows for more reliable inference.
(And may be more efficient in a finite sample!) On the other hand, we
could incorporate finite-sample corrections to our estimates. Many have been
proposed, and some of them are quite helpful.
Another solution is to use a one-step procedure that nevertheless delivers
an estimator that is asymptotically equivalent to the infeasible efficient GMM
estimator. The continuously-updated (CUE) GMM estimator maximises

Gn (θ)> W
cn (θ)Gn (θ),

obviating the need for a second step. And it turns out (perhaps unsurprisingly)
to be asymptotically equivalent to the infeasible efficient GMM estimator that
uses weight matrix E (g1 (θ0 )g1 (θ0 )> )−1 . The lack of an initial step turns out
to make a big difference: the Monte Carlo evidence is that the finite-sample
behaviour of the CUE GMM estimator is much better than that of two-step
GMM. Unfortunately, obtaining the CUE GMM estimator is in general a
pretty hard problem: even when ge is linear, it is a nonlinear optimisation
problem.
Yet another one-step method that is asymptotically equivalent to efficient
GMM is the empirical likelihood (EL) estimator. The EL estimator is the
first part of the argmax in the problem
n n n
s.t. n−1
X X X
max ln(pi ) gi (θ) = 0 and pi = 1.
(θ,p)∈Θ×(0,1)n
i=1 i=1 i=1

Again, the asymptotics are the same as for efficient GMM, but the finite-
sample properties are much better. And again, this estimator is computa-
tionally troublesome, as we’re now maximising over k + n variables with an
additional constraint. There’s a broader class of estimators called generalised

130
empirical likelihood (GEL) estimators which share these good finite-sample
properties. We won’t delve into (G)EL estimation here; see Imbens (2002)
for a nice intro-level survey.

10.5 The J test


Recall that we defined θ0 to be the unique solution to the population moment
condition E(g1 (θ0 )) = 0. So we know that the model is misspecified (in the
precise sense that the moment conditions are inconsistent with each other) if
E(g1 (θ)) = 0 has no solution in Θ.
When q = k (exact identification), provided g1 is moderately well-behaved,
our GMM procedure will force all q sample moment conditions to hold.
Intuitively, whenever a moment condition fails, we will have a degree of
freedom (a parameter whose estimate we have not yet determined) that we
can play with until the last sample moment is satisfied.
But consider the overidentified case q > k. Intuitively, we can only force k
of the sample moment conditions to hold, leaving another q − k free to do as
they please. If the model is correctly specified (the moment conditions hold
in the population) then the additional moments should be close to zero. This
provides the basis for a specification test, i.e. a test of the null hypothesis that
the model is correctly specified.86 The test is sometimes called the Hansen J
test, or else simply the test of overidentifying restrictions.
To that end, consider the LR-type statistic
 > 
Jn := 2Qn θbn = Gn θbn Wn Gn θbn .

Though it has the flavour of an LR statistic, note that we are not testing the
null hypothesis that the true parameter satisfies some restriction. Instead,
our null is that Qn (θ0 ) is op (1), meaning that all the moment conditions are
satisfied at the truth.
To derive the asymptotic distribution under the null, begin with a mean-
value expansion:
h ih i
Gn θbn = Gn (θ0 ) + n−1/2 DGn θen n1/2 θbn − θ0
  
h i
= Gn (θ0 ) + [Dg(θ0 )] n1/2 θbn − θ0 + op (1)


86
The information matrix test is another specification test. The J test is conceptually
distinct from the information matrix test, however. Given that we derived a generalised
information matrix equality for efficient GMM, we could formulate an information matrix
test for GMM if desired.

131
where θen is the mean value. Recall from our proof of asymptotic normality
for extremum estimators (p. 84) that
h i+ h i
n1/2 θbn − θ0 = − n−1 ∇2 Qn θen n−1/2 ∇Qn (θ0 )
 

i +
h 
i> h
= − n−1/2 DGn θen Wn n−1/2 DGn θen
h i> 
× n−1/2 DGn (θ0 ) Wn Gn (θ0 )
−1
= − [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W Gn (θ0 ) + op (1)

where θen is also a mean value. So


−1
Gn θbn = Gn (θ0 ) − [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )]


× [Dg(θ0 )]> W Gn (θ0 ) + op (1).

You may be tempted to use the matrix algebra result in equation (9) (p.
116) here to write
−1
[Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> = W −1 ,

but this is a mistake! The reason is that [Dg(θ0 )] [Dg(θ0 )]> must have full
rank in order for this identity to hold. But since Dg(θ0 ) is q × k and q > k
by assumption, [Dg(θ0 )] [Dg(θ0)]> can have rank at most k!
Instead, premultiply Gn θbn by W 1/2 :

W 1/2 Gn θbn

−1
= W 1/2 Gn (θ0 ) − W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )]
× [Dg(θ0 )]> W Gn (θ0 ) + op (1)
−1
= I − W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2


× W 1/2 Gn (θ0 ) + op (1)


= M W 1/2 Gn (θ0 ) + op (1),

where
−1
M := I − W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2

M is symmetric (obvious) and idempotent (trivial to verify, just compute


M M and cancel terms to recover M ).87
87
A square matrix M is idempotent iff M M = M .

132
So the test statistic is
h i> h i
Jn = Wn1/2 Gn θbn Wn1/2 Gn θbn
 
h i> h i
= W 1/2 Gn θbn W 1/2 Gn θbn + op (1)
 

h i> h i
= M W 1/2 Gn (θ0 ) M W 1/2 Gn (θ0 ) + op (1)
h i> h i
= W 1/2 Gn (θ0 ) M > M W 1/2 Gn (θ0 ) + op (1)
h i> h i
= W 1/2 Gn (θ0 ) M W 1/2 Gn (θ0 ) + op (1).

It’s immediate by the Lindeberg–Lévy CLT that


d
Gn (θ0 ) −→ Nq (0, E (g1 (θ0 )g1 (θ0 )> )) .

Now suppose (this is important!) that we choose the weight matrix optimally:
W = E (g1 (θ0 )g1 (θ0 )> )−1 . Then we obtain
 
W 1/2 Gn (θ0 ) −→ Nq 0, W 1/2 W −1 W 1/2 = Nq (0, I) .
d d

It then follows by Slutsky’s theorem that


h i> h i
Jn = W 1/2 Gn (θ0 ) M W 1/2 Gn (θ0 ) + op (1)
d
−→ [Nq (0, I)]> M [Nq (0, I)] .

Now here’s a fact for you: for ξ ∼ Nq (0, I) and any symmetric and
idempotent q × q matrix M ,

ξ > M ξ ∼ χ2 (rank M ) .

To see why, eigen-decompose M as P > ΛP , where Λ is a diagonal matrix


with the eigenvalues of M on the diagonal and P contains the eigenvectors.
Wlog, arrange the rows so that the zero eigenvalues are last. Then
q XM
rank
2
λj ((P ξ)j )2 .
>
X
>
ξ M ξ = (P ξ) Λ (P ξ) = λj ((P ξ)j ) =
j=1 j=1

Since M is idempotent, all its eigenvalues are either zero or unity, so

XM
rank
ξ>M ξ = ((P ξ)j )2 .
j=1

133
Since P ξ is a linear combination of normals, it is normally distributed. Since
the eigenvectors are orthogonal and the components of ξ are independent,
the components of P ξ are independent. Moreover
> >
Var((P ξ)j ) = Pj,· Var(ξj )Pj,· = Pj,· Pj,· =1
 rank M
since P is an orthogonal matrix. So we’ve shown that (P ξ)j j=1 are
independent standard-normal-distributed random variables. Therefore

XM
rank
ξ>M ξ = ((P ξ)j )2 ∼ χ2 (rank M ) .
j=1

So let’s compute the rank of our matrix


−1
M = I − W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2 .

Using the fact that the rank and trace of an idempotent matrix are equal,
and writing Im for an m × m identity matrix to make the dimensions explicit,

rank M
= tr M
−1
= tr Iq − tr W 1/2 [Dg(θ0 )] [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2
−1
= tr Iq − tr [Dg(θ0 )]> W [Dg(θ0 )] [Dg(θ0 )]> W 1/2 W 1/2 [Dg(θ0 )]
= tr Iq − tr Ik
= q − k.

So using our fun fact, we obtain


d d d
Jn −→ [Nq (0, I)]> M [Nq (0, I)] = χ2 (rank M ) = χ2 (q − k) .

That was the distribution under the null (correct specification). It’s pretty
clear that the test is consistent, for the same sort of reason as the trinity
tests in section 9 were. We can also construct a Pitman drift such that the
asymptotic distribution under local-DGP asymptotics is noncentral χ2 .

134
References
Aliprantis, C. D., & Border, K. C. (2006). Infinite dimensional analysis: A
hitchhiker’s guide (3rd). Berlin: Springer.
Amemiya, T. (1985). Advanced econometrics. Cambridge, MA: Harvard
University Press.
Billingsley, P. (1995). Probability and measure (3rd). New York, NY: Wiley.
Billingsley, P. (1999). Convergence of probability measures (2nd). New York,
NY: Wiley.
Dudley, R. M. (2004). Real analysis and probability. Cambridge: Cambridge
University Press.
Durrett, R. (2010). Probability: Theory and examples (4th). Cambridge:
Cambridge University Press.
Gnedenko, B. V., & Kolmogorov, A. N. (1954). Limit distributions for sums
of independent random variables. Cambridge, MA: Addison-Wesley.
Hansen, L. P. (1982). Large sample properties of generalized method of
moments estimators. Econometrica, 50(4), 1029–1054.
Ibragimov, I. A., & Has’minskii, R. (1981). Statistical estimation: Asymptotic
theory. Berlin: Springer.
Imbens, G. W. (2002). Generalized method of moments and empirical likeli-
hood. Journal of Business & Economic Statistics, 20(4), 493–506.
Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares
estimators. Annals of Mathematical Statistics, 40(2), 633–643.
Kim, J., & Pollard, D. (1990). Cube root asymptotics. Annals of Statistics,
18(1), 191–219.
Kolmogorov, A. N., & Fomin, S. V. (1975). Introductory real analysis. New
York, NY: Dover.
Manski, C. F. (1975). Maximum score estimation of the stochastic utility
model of choice. Journal of Econometrics, 3(3), 205–228.
Newey, W. K., & McFadden, D. L. (1994). Large sample estimation and
hypothesis testing. In R. F. Engle & D. L. McFadden (Eds.), Handbook
of econometrics (Chap. 36, Vol. 4, pp. 2111–2245). Amsterdam: North-
Holland.
Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially
consistent observations. Econometrica, 16(1), 1–32.
Rao, C. R. (1973). Linear statistical inference and its applications (2nd).
New York, NY: Wiley.
Rosenthal, J. S. (2006). A first look at rigorous probability theory (2nd).
Singapore: World Scientific.

135
Serfling, R. J. (1970). Convergence properties of Sn under moment restrictions.
Annals of Mathematical Statistics, 41(4), 1235–1248.
Serfling, R. J. (1980). Approximation theorems of mathematical statistics.
New York, NY: Wiley.
White, H. (1982). Maximum likelihood estimation of misspecified models.
Econometrica, 1(50), 1–25.
White, H. (2001). Asymptotic theory for econometricians (Revised). San
Diego, CA: Academic Press.
Williams, D. (1991). Probability with martingales. Cambridge: Cambridge
University Press.

136

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy