Risk Fisher
Risk Fisher
_
T(X)
_
= for all values of (2.1)
Here E
(X
1
) = E
(X) =
1
n
n
k=1
E
(X
k
) =
_
xf(x, ) dx = (2.2)
and both of the statistics T
1
= X
1
and T
2
= X are unbiased estimators of .
If the density f(x, ) is discrete instead of continuous, the integral in (2.2) is
replaced by a sum.
The relation (2.1) implies that if we had a large number of dierent
samples X
(m)
, each of size n, then the estimates T(X
(m)
) should cluster
around the true value of . However, it says nothing about the sizes of the
errors T(X
(m)
) , which are likely to be more important.
The errors of T(X) as an estimator of can be measured by a loss
function L(x, ), where L(x, ) 0 and L(, ) = 0 (see Larsen and Marx,
page 419). The risk is the expected value of this loss, or
R(T, ) = E
_
L
_
T(X),
__
The most common choice of loss function is the quadratic loss function
L(x, ) = (x )
2
, for which the risk is
R(T, ) = E
_
_
T(X)
_
2
_
(2.3)
Another choice is the absolute value loss function L(x, ) = |x|, for which
the risk is R(T, ) = E
_
|T(X) |
_
.
If T(X) is an unbiased estimator and L(x, ) = (x )
2
, then the
risk (2.3) is the same as the variance
R(T, ) = Var
_
T(X)
_
but not if T(X) is biased (that is, not unbiased).
Assume E
_
T(X)
_
= () for a possibly biased estimator T(X). That
is, () = for some or all . Let S = T , so that E
(S) = () .
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Then R(T, ) = E
_
(T )
2
_
= E
(S
2
) and it follows from the relation
Var(S) = E(S
2
) E(S)
2
that
R(T, ) = E
_
(T(X) )
2
_
= Var
_
T(X)
_
+
_
()
_
2
, () = E
_
T(X)
_
(2.4)
In principle, we might be able to nd a biased estimator T(X) that out-
performs an unbiased estimator T
0
(X) if the biased estimator has a smaller
variance that more than osets the term (() )
2
in (2.4).
Example (1). Suppose that X
1
, . . . , X
n
are normally distributed N(,
2
)
and we want to estimate . Then one might ask whether the biased estimator
T(X
1
, . . . , X
n
) =
X
1
+X
2
+. . . +X
n
n + 1
(2.5)
could have R(T, ) < R(X, ) for the MLE X = (X
1
+. . . +X
n
)/n. While
T(X) is biased, it should also have a smaller variance since we divide by a
larger number. As in (2.4)
R(X, ) = E
_
(X )
2
_
= Var(X) =
2
n
(2.6)
R(T, ) = E
_
(T )
2
_
= Var(T) + E(T )
2
= Var
_
X
1
+. . . +X
n
n + 1
_
+
_
n
n + 1
_
2
=
n
2
(n + 1)
2
+
2
(n + 1)
2
Comparing R(T, ) with R(X, ):
R(T, ) R(X, ) =
n
(n + 1)
2
2
+
2
(n + 1)
2
1
n
2
=
2
(n + 1)
2
_
1
n
n
(n + 1)
2
_
2
=
1
(n + 1)
2
_
_
(n + 1)
2
n
2
n
_
2
_
=
1
(n + 1)
2
_
_
2n + 1
n
_
2
_
(2.7)
Thus R(T, ) < R(X, ) if
2
< ((2n + 1)/n)
2
, which is guaranteed by
2
< 2
2
. In that case, R(T, ) is less risky than R(X, ) (in the sense of
having smaller expected squared error) even though it is biased.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Shrinkage Estimators. The estimator T(X) in (2.5) can be written
T(X
1
, . . . , X
n
) =
n
n + 1
X +
1
n + 1
0
which is a convex combination of X and 0. A more general estimator is
T(X
1
, . . . , X
n
) = cX + (1 c)a (2.8)
where a is an arbitrary number and 0 < c < 1. Estimators of the form
(2.5) and (2.8) are called shrinkage estimators. While shrinkage estimators
are biased unless E(X
i
) = = a, the calculation above shows that they
have smaller risk if
2
< 2
2
for (2.5) or ( a)
2
< ((1 +c)/(1 c))(
2
/n)
for (2.8).
On the other hand, R(T, ) and R(X, ) are of order 1/n and, by arguing
as in (2.6) and (2.7), the dierence between the two is of order 1/n
2
for xed
, a, and 0 < c < 1. (Exercise: Prove this.) Thus one cannot go too far
wrong by using X instead of a shrinkage estimator.
2.2. Ridge Regression. In ridge regression (which is discussed in other
courses), the natural estimator T
1
(X) of certain parameters is unbiased, but
Var(T
1
) is very large because T
1
(X) depends on the inverse of a matrix that
is very close to being singular.
The method of ridge regression nds biased estimators T
2
(X) that are
similar to T
1
(X) such that E
_
T
2
(X)) is close to E
_
T
1
(X)
_
but Var
_
T
2
(X)
_
is of moderate size. If this happens, then (2.4) with T(X) = T
2
(X) implies
that the biased ridge regression estimator T
2
(X) can be a better choice than
the unbiased estimator T
1
(X) since it can have much lower risk and give
much more reasonable estimates.
2.3. Relative Eciency. Let T(X) and T
0
(X) be estimators of , where
T
0
(X) is viewed as a standard estimator such as X or the MLE (maximum
likelihood estimator) of (see below). Then, the relative risk or relative
eciency of T(X) with respect to T
0
(X) is
RR(T, ) =
R(T
0
, )
R(T, )
=
E
_
(T
0
(X) )
2
_
E
_
(T(X) )
2
_ (2.9)
Note that T
0
(X) appears in the numerator, not the denominator, and T(X)
appears in the denominator, not the numerator. If RR(T, ) < 1, then
R(T
0
, ) < R(T, ) and T(X) can be said to be less ecient, or more risky,
than T
0
(X). Conversely, if RR(T, ) > 1, then T(X) is more ecient (and
less risky) than the standard estimator T
0
(X).
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4. MLEs are Not Always Sample Means even if E
(X) = .
The most common example with
MLE
= X is the normal family N(, 1).
In that case, Var
(X) = E
(X) = .
If Y has the density (2.10), then Y has the same distribution as +cY
0
where Y
0
L(0, 1). (Exercise: Prove this.) Thus the Laplace family (2.10)
is a shift-and-scale family like the normal family N(,
2
), and is similar to
N(,
2
) except that the probability density of X L(, c) decays exponen-
tially for large x instead of faster than exponentially as is the case for the
normal family. (It also has a non-dierentiable cusp at x = .)
In any event, one might expect that the MLE of might be less willing
to put as much weight on large sample values than does the sample mean X,
since these values may be less reliable due to the relatively heavy tails of the
Laplace distribution. In fact
Lemma 2.1. Let X = (X
1
, . . . , X
n
) be an independent sample of size n
from the Laplace distribution (2.10) for unknown and c. Then
MLE
(X) = median{ X
1
, . . . , X
n
} (2.11)
Remark. That is, if
X
(1)
< X
(2)
< . . . X
(n)
(2.12)
are the order statistics of the sample X
1
, . . . , X
n
, then
MLE
(X) =
_
X
(k+1)
if m = 2k + 1 is odd
_
X
(k)
+X
(k+1)
_
/2 if m = 2k is even
(2.13)
Thus
MLE
= X
(2)
if n = 3 and X
(1)
< X
(2)
< X
(3)
, and
MLE
= (X
(2)
+
X
(3)
)/2 if n = 4 and X
(1)
< X
(2)
< X
(3)
< X
(4)
.
Proof of Lemma 2.1. By (2.10), the likelihood of is
L(, X
1
, . . . , X
n
) =
n
i=1
_
1
2c
e
|X
i
|/c
_
=
1
(2c)
n
exp
_
i=1
|X
i
|
c
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
It follows that the likelihood L(, X) is maximized whenever the sum
M() =
n
i=1
|X
i
| =
n
i=1
|X
(i)
| (2.14)
is minimized, where X
(i)
are the order statistics in (2.12).
The function M() in (2.14) is continuous and piecewise linear. If
X
(m)
X
(m+1)
(that is, if lies between the m
th
and the (m+1)
st
order
statistics of { X
i
}), then X
(i)
X
(m)
if i m and X
(m+1)
X
(i)
if m+ 1 i n. Thus
M() =
n
i=1
|X
(i)
| =
m
i=1
( X
(i)
) +
n
i=m+1
(X
(i)
)
and if X
(m)
< < X
(m+1)
d
d
M() = M
() = m(n m) = 2mn
It follows that M
() > 0
(and M() is increasing) if m > n/2. If n = 2k+1 is odd, then n/2 = k+(1/2)
and M() is strictly decreasing if < X
(k+1)
and is strictly increasing if
> X
(k+1)
. It follows that the minimum value of M() is attained at
= X
(k+1)
.
If n = 2k is even, then, by the same argument, M() is minimized at
any point in the interval (X
(k)
, X
(k+1)
), so that any value in that interval
maximizes the likelihood. When that happens, the usual convention is to
set the MLE equal to the center of the interval, which is the average of
the endpoints. Thus
MLE
= X
(k+1)
if n = 2k + 1 is odd and
MLE
=
(X
(k)
+X
(k+1)
)/2 if n = 2k is even, which implies (2.13).
A third example of a density with E
(X) = E
(X) = is
f(x, ) = (1/2)I
(1,+1)
(x) (2.15)
which we can call the centered uniform distribution of length 2. If X has
density (2.15), then X is uniformly distributed between 1 and +1 and
E
i=1
_
1
2
I
(1,+1)
(X
i
)
_
=
1
2
n
n
i=1
I
(X
i
1,X
i
+1)
() (2.16)
=
1
2
n
I
(X
max
1,X
min
+1)
()
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
since (i) 1 < X
i
< + 1 if and only if X
i
1 < < X
i
+ 1, so
that I
(1,+1)
(X
i
) = I
(X
i
1,X
i
+1)
(), and (ii) the product of the indicator
functions is non-zero if and only X
i
< + 1 and 1 < X
i
for all i, which
is equivalent to 1 < X
min
X
max
< + 1 or X
max
1 < < X
min
+ 1.
Thus the likelihood is zero except for (X
max
1, X
min
+ 1), where
the likelihood has the constant value 1/2
n
. Following the same convention
as in (2.13), we set
MLE
(X) =
X
max
+X
min
2
(2.17)
(Exercise: Note that normally X
min
< X
max
. Prove that the interval
(X
max
1, X
min
+ 1) is generally nonempty for the density (2.15).)
2.5. Relative Eciencies of Three Sample Estimators. We can use
computer simulation to compare the relative eciencies of the sample mean,
the sample median, and the average of the sample minima and maxima for
the three distributions in the previous subsection. Recall that, while all three
distributions are symmetric about a shift parameter , the MLEs of are the
sample mean, the sample median, and the average of the sample minimum
and maximum, respectively, and are not the same.
It is relatively easy to use a computer to do random simulations of n
random samples X
(j)
(1 j n) for each of these distributions, where each
random sample X
(j)
=
_
X
(j)
1
, . . . , X
(j)
m
_
is of size m. Thus the randomly
simulated data for each distribution will involve generating n m random
numbers.
For each set of simulated data and each sample estimator T(X), we
estimate the risk by (1/n)
n
j=1
_
T(X
(j)
)
_
2
. Analogously with (2.9), we
estimate the relative risk with respect to the sample mean X by
RR(T, ) =
(1/n)
n
j=1
_
X
(j)
_
2
(1/n)
n
j=1
_
T(X
(j)
)
_
2
Then RR(T, ) < 1 means that the sample mean has less risk, while
RR(T, ) > 1 implies that it is riskier. Since all three distributions are
shift invariant in , it is sucient to assume = 0 in the simulations.
The simulations show that, in each of the three cases, the MLE is the
most ecient of the three estimators of . Recall that the MLE is the sample
mean only for the normal family. Specically, we nd
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Table 2.1: Estimated relative eciencies with respect to the sample
mean for n = 1,000,000 simulated samples, each of size m = 10:
Distrib Mean Median AvMinMax Most Ecient
CentUnif 1.0 0.440 2.196 AvMinMax
Normal 1.0 0.723 0.540 Mean
Laplace 1.0 1.379 0.243 Median
The results are even more striking for samples of size 30:
Table 2.2: Estimated relative eciencies with respect to the sample
mean for n = 1,000,000 simulated samples, each of size m = 30:
Distrib Mean Median AvMinMax Most Ecient
CentUnif 1.0 0.368 5.492 AvMinMax
Normal 1.0 0.666 0.265 Mean
Laplace 1.0 1.571 0.081 Median
Table 2.2 shows that the sample mean has a 3:2 advantage over the sample
median for normal samples, but a 3:2 decit for the Laplace distribution. Av-
eraging the sample minimum and maximum is 5-fold better than the sample
mean for the centered uniforms, but is 12-fold worse for the Laplace distri-
bution. Of the three distributions, the Laplace has the largest probability of
large values.
3. Scores and Fisher Information. Let X
1
, X
2
, . . . , X
n
be an indepen-
dent sample of observations from a density f(x, ) where is an unknown
parameter. Then the likelihood function of the parameter given the data
X
1
, . . . , X
n
is
L(, X
1
, . . . , X
n
) = f(X
1
, )f(X
2
, ) . . . f(X
n
, ) (3.1)
where the observations X
1
, . . . , X
n
are used in (3.1) instead of dummy vari-
ables x
k
. Since the data X
1
, . . . , X
n
is assumed known, L(, X
1
, . . . , X
n
)
depends only on the parameter .
The maximum likelihood estimator of is the value =
(X) that
maximizes the likelihood (3.1). This can often be found by forming the
partial derivative of the logarithm of the likelihood
log L(, X
1
, . . . , X
n
) =
n
k=1
log f(X
k
, ) (3.2)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
and setting this expression equal to zero. The sum in (3.2) is suciently
important in statistics that not only the individual terms in the sum, but
also their variances, have names.
Specically, the scores of the observations X
1
, . . . , X
n
for the den-
sity f(x, ) are the terms
Y
k
() =
log f(X
k
, ) (3.3)
Under appropriate assumptions on f(x, ) (see Lemma 3.1 below), the scores
Y
k
() have mean zero. (More exactly, E
_
Y
k
()
_
= 0, where the same value
of is used in both parts of the expression.)
The Fisher information of an observation X
1
from f(x, ) is the variance
of the scores
I(f, ) = Var
_
Y
k
()
_
=
_ _
log f(x, )
_
2
f(x, ) dx (3.4)
Under an additional hypothesis (see Lemma 3.2 below), we also have
I(f, ) =
_ _
2
2
log f(x, )
_
f(x, ) dx (3.5)
which is often easier to compute since it involves a mean rather than a second
moment.
For example, assume X
1
, . . . , X
n
are normally distributed with unknown
mean and known variance
2
0
. Then
f(x, ) =
1
_
2
2
0
e
(x)
2
/2
2
0
, < x < (3.6)
Thus
log f(x, ) =
1
2
log(2
2
0
)
(x )
2
2
2
0
It follows that (/) log f(x, ) = (x )/
2
0
, and hence the k
th
score is
Y
k
() =
log f(X
k
, ) =
X
k
2
0
(3.7)
In particular E
(Y
k
()) = 0 as expected since E
(X
k
) = , and, since
E
_
(X
k
)
2
_
=
2
0
, the scores have variance
I(f, ) = E
_
Y
k
()
2
_
=
E
_
(X
k
)
2
_
(
2
0
)
2
=
1
2
0
(3.8)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
In this case, the relation
2
log f(X, ) =
1
2
0
from (3.7) combined with (3.5) gives an easier derivation of (3.8).
The Fisher information I(f, ) = 1/
2
0
for (3.6) is large (that is, each
X
k
has lots of information) if
2
0
is small (so that the error in each X
k
is
small), and similarly the Fisher information is small if
2
0
is large. This may
have been one of the original motivations for the term information.
We give examples below of the importance of scores and Fisher infor-
mation. First, we give a proof that E
_
Y
k
()
_
= 0 under certain conditions.
Lemma 3.1. Suppose that K = { x : f(x, ) > 0 } is the same bounded
or unbounded interval for all , that f(x, ) is smooth enough that we can
interchange the derivative and integral in the rst line of the proof, and that
(/) log f(x, ) is integrable on K. Then
E
_
Y
k
()
_
=
_ _
log f(x, )
_
f(x, ) dx = 0 (3.9)
Proof. Since
_
f(x, ) dx = 1 for all , we can dierentiate
d
d
_
f(x, )dx = 0 =
_
f(x, ) dx =
_
(/)f(x, )
f(x, )
f(x, ) dx
=
_ _
log f(x, )
_
f(x, ) dx = 0
Lemma 3.2. Suppose that f(x, ) satises the same conditions as in
Lemma 3.1 and that log f(x, ) has two continuous partial derivatives that
are continuous and bounded on K. Then
I(f, ) = E
_
Y
k
()
2
_
=
_ _
2
2
log f(x, )
_
f(x, ) dx (3.10)
Proof. Extending the proof of Lemma 3.1,
d
2
d
2
_
f(x, ) = 0 =
d
d
_ _
log f(x, )
_
f(x, ) dx (3.11)
=
_ _
2
2
log f(x, )
_
f(x, ) dx +
_ _
log f(x, )
_
f(x, ) dx
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The last term equals
_ _
log f(x, )
_
(/)f(x, )
f(x, )
f(x, ) dx
=
_ _
log f(x, )
_
2
f(x, ) dx = I(f, )
by (3.4). Since the left-hand side of (3.11) is equal to zero, Lemma 3.2
follows.
Remarks. The hypotheses of Lemma 3.1 are satised for the normal den-
sity (3.6), for which E
_
Y
k
()
_
= 0 by (3.7). However, the hypotheses are not
satised for the uniform density f(x, ) = (1/)I
(0,)
(x) since the supports
K() = (0, ) depend on .
For f(x, ) = (1/)I
(0,)
(x), the scores Y
k
() = (1/)I
(0,)
(X
k
) have
means E
_
Y
k
()
_
= 1/ = 0, so that the proof of Lemma 3.1 breaks down
at some point. (Exercise: Show that this is the correct formula for the score
for the uniform density and that this is the mean value.)
4. The Cramer-Rao Inequality. Let X
1
, X
2
, . . . , X
n
be an independent
random sample from the density f(x, ), where f(x, ) satises the conditions
of Lemma 3.1. In particular,
(i) The set K = { x : f(x, ) > 0 } is the same for all values of and
(ii) The function log f(x, ) has two continuous partial derivatives in
that are integrable on K.
We then have
Theorem 4.1. (Cramer-Rao Inequality) Let T(X
1
, X
2
, . . . , X
n
) be an
arbitrary unbiased estimator of . Then, under the assumptions above,
E
_
(T )
2
_
1
nI(f, )
(4.1)
for all values of , where I(f, ) is the Fisher information dened in (3.4).
Remark. Note that (4.1) need not hold if T(X
1
, . . . , X
n
) is a biased esti-
mator of , nor if the assumptions (i) or (ii) fail.
Proof of Theorem 4.1. Let T = T(X
1
, . . . , X
n
) be any unbiased estima-
tor of . Then
= E
_
T(X
1
, . . . , X
n
)
_
(4.2)
=
_
. . .
_
T(y
1
, . . . , y
n
) f(y
1
, ) . . . f(y
n
, ) dy
1
. . . dy
n
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Dierentiating (4.2) with respect to
1 =
_
. . .
_
T(y
1
, . . . , y
n
)
_
n
k=1
f
k
_
dy
1
. . . dy
n
where f
k
= f(y
k
, ). By the chain rule
1 =
_
. . .
_
T
n
k=1
_
_
_
_
k1
j=1
f
j
_
_
_
f
k
_
_
_
n
j=k+1
f
j
_
_
_
_
dy
1
. . . dy
n
for T = T(y
1
, y
2
, . . . , y
n
) and
1 =
_
. . .
_
T
n
k=1
_
_
_
_
k1
j=1
f
j
_
_
_
(/)f
k
f
k
_
f
k
_
_
n
j=k+1
f
j
_
_
_
_
dy
1
. . . dy
n
=
_
. . .
_
T
_
n
k=1
log
_
f(y
k
,
_
_
_
_
n
j=1
f
j
_
_
dy
1
. . . dy
n
= E
_
T(X
1
, . . . , X
n
)
_
n
k=1
Y
k
__
(4.3)
where Y
k
= (/) log f(X
k
, ) are the scores dened in (3.3). Since
E
(Y
k
()) = 0 by Lemma 3.1, it follows by subtraction from (4.3) that
1 = E
_
_
T(X
1
, . . . , X
n
)
_
_
n
k=1
Y
k
__
(4.4)
By Cauchys inequality (see Lemma 4.1 below),
E(XY )
_
E(X
2
)
_
E(Y
2
)
for any two random variables X, Y with E(|XY |) < . Equivalently
E(XY )
2
E(X
2
)E(Y
2
). Applying this in (4.4) implies
1 E
_
_
T(X
1
, . . . , X
n
)
_
2
_
E
_
_
_
n
k=1
Y
k
_
2
_
_
(4.5)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The scores Y
k
= (/) log f(X
k
, ) are independent with the same distri-
bution, and have mean zero and variance I(f, ) by Lemma 3.1 and (3.4).
Thus
E
_
_
_
n
k=1
Y
k
_
2
_
_
= Var
_
n
k=1
Y
k
_
= nVar
(Y
1
) = nI(f, )
Hence 1 E
_
(T )
2
_
nI(f, ), which implies the lower bound (4.1).
Denition. The eciency of an estimator T(X
1
, . . . , X
n
) is
RE (T, ) =
1/
_
nI(f, )
_
E
((T )
2
)
=
1
nI(f, ) E
((T )
2
)
(4.6)
Note that this is the same as the relative risk or relative eciency (2.9)
with R(T
0
, ) replaced by the Cramer-Rao lower bound (4.1). Under the
assumptions of Theorem 4.1, RE (T, ) 1.
An unbiased estimator T(X
1
, . . . , X
n
) is called ecient if RE (T, ) = 1;
that is, if its variance attains the lower bound in (4.1). This means that any
other unbiased estimator of , no matter how nonlinear, must have an equal
or larger variance.
An estimator T(X) of a parameter is super-ecient if its expected
squared error E
_
(T(X) )
2
_
is strictly less than the Cramer-Rao lower
bound. Under the assumptions of Theorem 4.1, this can happen only if
T(X) is biased, and typically holds for some parameter values but not for
others. For example, the shrinkage estimator of Section 2.1 is super-ecient
for parameter values that are reasonably close to the value a but not for
other .
Examples (1). Assume X
1
, X
2
, . . . , X
n
are N(,
2
0
) (that is, normally dis-
tribution with unknown mean and known variance
2
0
). Then E
(X
k
) =
E
(X
1
) =
2
0
n
By (3.8), the Fisher information is I(f, ) = 1/
2
0
, so that 1/(nI(f, )) =
2
0
/n. Thus Var
_
Y
k
()
_
= 1/ and E
_
Y
k
()
2
_
= 1/
2
. Hence
I(f, ) = Var
_
Y
k
()
_
= 0, so that the Cramer-Rao lower bound for un-
biased estimators in Theorem 4.1 is . (If we can use Lemma 3.2, then
I(f, ) = 1/
2
and the lower bound is negative. These are not contradic-
tions, since the density f(x, ) = (1/)I
(0,)
(x) does not satisfy the hypothe-
ses of either Lemma 3.1 or 3.2.)
Ignoring these awkwardnesses for the moment, let X
max
= max
1in
X
i
.
Then E
(X
max
) = (n/(n + 1)), so that if T
max
= ((n + 1)/n)X
max
E
(2X) = E
(T
max
) =
Thus both T
1
= 2X and T
2
= T
max
are unbiased estimators of . However,
one can show
Var
(2X) =
2
2
3n
and Var
(T
max
) =
2
n(n + 2)
Assuming I(f, ) = 1/
2
for deniteness, this implies that
RE (2X, ) = 3/2 and RE (T
max
, ) = n + 2 (if n is large)
Thus the conclusions of Theorem 4.1 are either incorrect or else make no
sense for either unbiased estimator in this case.
We end this section with a proof of Cauchys inequality.
Lemma 4.1 (Cauchy-Schwartz-Bunyakowski) Let X, Y be any two ran-
dom variables such that E(|XY |) < . Then
E(XY )
_
E(X
2
)
_
E(Y
2
) (4.7)
Proof. Note
_
(
a) x (1/
a) y
_
2
0 for arbitrary real numbers x, y, a
with a > 0. Expanding the binomial implies ax
2
2xy + (1/a)y
2
0, or
xy (1/2)
_
ax
2
+
1
a
y
2
_
for all real x, y and any a > 0. It then follows that for any values of the
random variables X, Y
XY
1
2
_
aX
2
+
1
a
Y
2
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
In general, if Y
1
Y
2
for two random variables Y
1
, Y
2
, then E(Y
1
) E(Y
2
).
This implies
E(XY )
1
2
_
aE(X
2
) +
1
a
E(Y
2
)
_
, any a > 0 (4.8)
If we minimize the right-hand side of (4.8) as a function of a, for exam-
ple by setting the derivative with respect to a equal to zero, we obtain
a
2
= E(Y
2
)/E(X
2
) or a =
_
E(Y
2
)/E(X
2
). Evaluating the right-hand
side of (4.8) with this value of a implies (4.7).
5. Maximum Likelihood Estimators are Asymptotically Ecient.
Let X
1
, X
2
, . . . , X
n
, . . . be independent random variables with the same dis-
tribution. Assume E(X
2
j
) < and E(X
j
) = . Then the central limit
theorem implies
lim
n
P
_
X
1
+X
2
+ +X
n
n
n
2
y
_
=
1
2
_
y
e
(1/2)x
2
dx (5.1)
for all real values of y. Is there something similar for MLEs (maximum
likelihood estimators)? First, note that (5.1) is equivalent to
lim
n
P
_
_
n
2
_
X
1
+X
2
+ +X
n
n
_
y
_
=
1
2
_
y
e
(1/2)x
2
dx
(5.2)
If X
1
, . . . , X
n
were normally distributed with mean and variance
2
, then
n
(X) = X = (X
1
+. . . +X
n
)/n. This suggests that we might have a central
limit theorem for MLEs
n
(X) of the form
lim
n
P
_
_
nc()
_
n
(X)
_
y
_
=
1
2
_
y
e
(1/2)x
2
dx
where is the true value of and c() is a constant depending on . In fact
Theorem 5.1. Assume
(i) The set K = { x : f(x, ) > 0 } is the same for all values of ,
(ii) The function log f(x, ) has two continuous partial derivatives in
that are integrable on K, and
(iii) E(Z) < for Z = sup
|(
2
/
2
) log f(X, )|, and
(iv) the MLE
(X) is attained in the interior of K.
Let I(f, ) be the Fisher information (3.8) in Section 3. Then
lim
n
P
_
_
nI(f, )
_
n
(X)
_
y
_
=
1
2
_
y
e
(1/2)x
2
dx (5.3)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
for all real values of y. (Condition (iii) is more than is actually required.)
Remarks (1). The relation (5.3) says that the MLE
n
(X) is approxi-
mately normally distributed with mean and variance 1/
_
nI(f, )
_
, or sym-
bolically
n
(X) N
_
,
1
nI(f, )
_
(5.4)
If (5.4) held exactly, then E
n
(X)
_
= and Var
n
(X)
_
= 1/
_
nI(f, )
_
,
and
n
(X) would be an unbiased estimator whose variance was equal to
the Cramer-Rao lower. We interpret (5.3)(5.4) as saying that
n
(X) is
asymptotically normal, is asymptotically unbiased, and is asymptotically ef-
cient in the sense of Section 4, since its asymptotic variance is the the
Cramer-Rao lower bound. However, (5.3) does not exclude the possibility
that E
_
|
n
(X)|
_
= for all nite n, so that
n
(X) need not be unbiased
nor ecient nor even have nite variance in the usual senses for any value
of n.
(2). If f(x, ) = (1/)I
(0,)
(x), so that X
i
U(0, ), then the order
of the rate of convergence in the analog of (5.3) is n instead of
n and the
limit is a one-sided exponential, not a normal distribution. (Exercise: Prove
this.) Thus the conditions of Theorem 5.1 are essential.
Asymptotic Condence Intervals. We can use (5.3) to nd asymptotic
condence intervals for the true value of based on the MLE
n
(X). It
follows from (5.3) and properties of the standard normal distribution that
lim
n
P
_
1.96
_
nI(f, )
<
n
<
1.96
_
nI(f, )
_
(5.5)
= lim
n
P
n
(X)
1.96
_
nI(f, )
< <
n
(X) +
1.96
_
nI(f, )
_
= 0.95
Under the assumptions of Theorem 5.1, we can approximate the Fisher in-
formation I(f, ) in (3.8) by I
_
f,
n
(X)
_
, which does not depend explicitly
on . The expression I
_
f,
n
(X)
_
is called the empirical Fisher information
of depending on X
1
, . . . , X
n
. This and (5.5) imply that
_
_
n
(X)
1.96
_
nI(f,
n
(X))
,
n
(X) +
1.96
_
nI(f,
n
(X))
_
_
(5.6)
is an asymptotic 95% condence interval for the true value of .
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Examples (1). Let f(x, p) = p
x
(1 p)
1x
for x = 0, 1 for the Bernoulli
distribution. (That is, tossing a biased coin.) Then
log f(x, p) = xlog(p) + (1 x) log(1 p)
log f(x, p) =
x
p
1 x
1 p
and
2
2
log f(x, p) =
x
p
2
1 x
(1 p)
2
Thus by Lemma 3.2 the Fisher information is
I(f, p) = E
_
2
2
log f(X, p)
_
=
E(X)
p
2
+
E(1 X)
(1 p)
2
=
p
p
2
+
1 p
(1 p)
2
=
1
p
+
1
1 p
=
1
p(1 p)
This implies
1
_
nI(f, p
=
_
p(1 p)
n
Hence in this case (5.6) is exactly the same as the usual (approximate) 95%
condence interval for the binomial distribution.
(2). Let f(x, ) = x
1
for 0 x 1. Then
Y
k
() = (/) log f(X
k
, ) = (1/) + log(X
k
)
W
k
() = (
2
/
2
) log f(X
k
, ) = 1/
2
Since (/) log L(, X) =
n
k=1
Y
k
() = (n/) +
n
k=1
log(X
k
), it follows
that
n
(X) =
n
n
k=1
log(X
k
)
(5.7)
Similarly, I(f, ) = E
_
W
k
()
_
= 1/
2
by Lemma 3.2. Hence by (5.6)
_
n
(X)
1.96
n
(X)
n
,
n
(X) +
1.96
n
(X)
n
_
(5.8)
is an asymptotic 95% condence interval for .
Proof of Theorem 5.1. Let
M() =
log L(, X
1
, . . . , X
n
) (5.9)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
where L(, X
1
, . . . , X
n
) is the likelihood function dened in (3.1). Let
n
(X)
be the maximum likelihood estimator of . Since
n
(X) is attained in the
interior of K by condition (iv),
M(
n
) =
log L(
n
, X) = 0
and by Lemma 3.1
M() =
log L(, X) =
n
k=1
log f(X
k
, ) =
n
k=1
Y
k
()
where Y
k
() are the scores dened in Section 3. By the mean value theorem
M(
n
) M() = (
n
)
d
d
M(
n
) = (
n
)
2
2
log L(
n
, X)
= (
n
)
n
k=1
(
2
/
2
) log f
_
X
k
,
n
(X)
_
where
n
(X) is a value between and
n
(X). Since M(
n
) = 0
n
=
M()
(d/d)M(
n
)
=
n
k=1
Y
k
()
n
k=1
(
2
/
2
) log f(X
k
,
n
)
(5.10)
Thus
_
nI(f, )
_
_
=
1
_
nI(f, )
n
k=1
Y
k
()
1
nI(f, )
n
k=1
(
2
/
2
) log f(X
k
,
n
(X))
(5.11)
By Lemma 3.1, the Y
k
() are independent with the same distribution with
E
_
Y
k
()
_
= 0 and Var
_
Y
k
()
_
= I(f, ). Thus by the central limit theorem
lim
n
P
_
1
_
nI(f, )
n
k=1
Y
k
() y
_
=
1
2
_
y
e
(1/2)x
2
dx (5.12)
Similarly, by Lemma 3.2, W
k
() = (
2
/
2
) log f(X
k
, ) are independent
with E
_
W
k
()
_
= I(f, ). Thus by the law of large numbers
lim
n
1
nI(f, )
n
k=1
2
log f(X
k
, ) = 1 (5.13)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
in the sense of convergence in the law of large numbers. One can show that,
under the assumptions of Theorem 5.1, we can replace
n
(X) on the right-
hand side of (5.11) by as n . It can then be shown from (5.11)(5.13)
that
lim
n
P
_
_
nI(f, )
_
n
(X)
_
y
_
=
1
2
_
y
e
(1/2)x
2
dx
for all real values of y. This completes the proof of Theorem 5.1.
6. The Most Powerful Hypothesis Tests are Likelihood Ratio Tests.
The preceding sections have been concerned with estimation and interval
estimation. These are concerned with nding the most likely value or range
of values of a parameter , given an independent sample X
1
, . . . , X
n
from a
probability density f(x, ) for an unknown value of .
In contrast, hypothesis testing has a slightly dierent emphasis. Sup-
pose that we want to use data X
1
, . . . , X
n
to decide between two dierent
hypotheses, which by convention are called hypotheses H
0
and H
1
. The
hypotheses are not treated in a symmetrical manner. Specically,
H
0
: What one would believe if one had no additional data
H
1
: What one would believe if the data X
1
, . . . , X
n
makes the al-
ternative hypothesis H
1
signicantly more likely.
Rather than estimate a parameter, we decide between two competing
hypotheses, or more exactly decide (yes or no) whether the data X
1
, . . . , X
n
provide sucient evidence to reject the conservative hypothesis H
0
in favor
of a new hypothesis H
1
.
This is somewhat like an an estimation procedure with D(X) =
D(X
1
, . . . , X
n
) = 1 for hypothesis H
1
and D(X
1
, . . . , X
n
) = 0 for H
0
. How-
ever, this doesnt take into account the question of whether we have sucient
evidence to reject H
0
.
A side eect of the bias towards H
0
is that choosing H
1
can be viewed
as proving H
1
in some sense, while choosing H
0
may just mean that we do
not have enough evidence one way or the other and so stay with the more
conservative hypothesis.
Example. (Modied from Larsen and Marx, pages 428431.) Suppose that
it is generally believed that a certain type of car averages 25.0 miles per
gallon (mpg). Assume that measurements X
1
, . . . , X
n
of the miles per gallon
are normally distributed with distribution N(,
2
0
) with
0
= 2.4. The
conventional wisdom is then =
0
= 25.0.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A consumers group suspects that the current production run of cars
actually has a higher mileage rate. In order to test this, the group runs
n = 30 cars through a typical course intended to measure miles per gallon.
The results are observations of mpg X
1
, . . . , X
30
with sample mean X =
(1/n)
30
i=1
X
i
= 26.50. Is this sucient evidence to conclude that mileage
per gallon has improved?
In this case, the conservative hypothesis is
H
0
: X
i
N(
0
,
2
0
) (6.1)
for
0
= 25.0 and
0
= 2.40. The alternative hypothesis is
H
1
: X
i
N(,
2
0
) for some >
0
(6.2)
A standard statistical testing procedure is, in this case, rst to choose a level
of signicance that represents the degree of condence that we need to
reject H
0
in favor of H
1
. The second step is to choose a critical value
= () with the property that
P(X ) = P
_
X ()
_
= (6.3)
Given , the value = () in (6.3) can be determined from the properties
of normal distributions and the parameters in (6.1), and is in fact = 25.721
for = 0.05 and n = 30. (See below.)
The nal step is to compare the measure X = 26.50 with . If X , we
reject H
0
and conclude that the mpgs of the cars have improved. If X < ,
we assume that, even though X >
0
, we do not have sucient evidence to
conclude that mileage has improved. Since X = 26.50 > 25.721, we reject
H
0
in favor of H
1
for this value of , and conclude that the true > 25.0.
Before determining whether or not this is the best possible test, we rst
need to discuss what is a test, as well as a notion of best.
6.1. What is a Test? What Do We Mean by the Best Test?
The standard test procedure leading up to (6.3) leaves open a number of
questions. Why should the best testing procedure involve X and not a more
complicated function of X
1
, . . . , X
n
? Could we do better if we used more of
the data? Even if the best test involves only X, why necessarily the simple
form X > ?
More importantly, what should we do if the data X
1
, . . . , X
n
are not
normal under H
0
and H
1
, and perhaps involve a family of densities f(x, )
for which the MLE is not the sample mean? Or if H
0
is expressed in terms
of one family of densities (such as N(,
2
0
)) and H
1
in terms of a dierent
family, such as gamma distributions?
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Before proceeding, we need a general denition of a test, and later a
denition of best.
Assume for deniteness that , X
1
, . . . , X
n
are all real numbers. We
then dene (an abstract) test to be an arbitrary subset C R
n
, with the
convention that we choose H
1
if the data X = (X
1
, . . . , X
n
) C and oth-
erwise choose H
0
. (The set C R
n
is sometimes called the critical region
of the test.) Note that the decision rule D(X) discussed above is now the
indicator function D(X) = I
C
(X).
In the example (6.1)(6.3), C =
_
x : x =
1
n
n
i=1
x
i
_
for x =
(x
1
, . . . , x
n
), so that X C if and only if X .
Later we will derive a formula that gives the best possible test in many
circumstances. Before continuing, however, we need some more denitions.
6.2. Simple vs. Composite Tests. In general, we say that a hypothesis
(H
0
or H
1
) is a simple hypothesis or is simple if it uniquely determines the
density of the random variables X
i
. The hypothesis is composite otherwise.
For example, suppose that the X
i
are known to have density f(x, ) for
unknown for a family of densities f(x, ), as in (6.1)(6.2) for a normal
family with known variance. Then
H
0
: =
0
and H
1
: =
1
(6.4)
are both simple hypotheses. If as in (6.1)(6.2)
H
0
: =
0
and H
1
: >
0
(6.5)
then H
0
is simple but H
1
is composite.
Fortunately, if often turns out that the best test for H
0
: =
0
against
H
1
: =
1
is the same test for all
1
>
0
, so that it is also the best test
against H
1
: >
0
. Thus, in this case, it is sucient to consider simple
hypotheses as in (6.4).
6.3. The Size and Power of a Test. If we make a decision between two
hypotheses H
0
and H
1
on the basis of data X
1
, . . . , X
n
, then there are two
types of error that we can make.
The rst type (called a Type I error) is to reject H
0
and decide on H
1
when, in fact, the conservative hypothesis H
0
is true. The probability of
a Type I error (which can only happen if H
0
is true) is called the false
positive rate. The reason for this is that deciding on the a priori less likely
hypothesis H
1
is called a positive result. (Think of proving H
1
as the rst
step towards a big raise, or perhaps towards getting a Nobel prize. On the
other hand, deciding on H
1
could mean that you have a dread disease, which
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
you might not consider a positive result at all. Still, it is a positive result for
the test, if not necessarily for you.)
Suppose that H
0
and H
1
are both simple as in (6.4). Then the proba-
bility of a Type I error for the test C, or equivalently the false positive rate,
is
= P(reject H
0
| H
0
is true) = P(choose H
1
| H
0
) (6.6)
= P(X C | H
0
) =
_
C
f( x,
0
) d x
where
f( x, ) = f(x
1
, )f(x
2
, ) . . . f(x
n
, ) (6.7)
is the joint probability density of the sample X = (X
1
, . . . , X
n
) and
_
C
f( x,
0
) d x is an n-fold integral.
As the form of (6.6) indicates, depends only on the hypothesis H
0
and
not on H
1
, since it is given by the integral of f( x,
0
) over C and does not
involve
1
. Similarly, the critical value = () in (6.3) in the automobile
example depends only on and n and the parameters involved in H
0
.
The value in (6.6) is also called the level of signicance of the test C
(or, more colloquially, of the test with critical region C). As mentioned
above, depends only on the hypothesis H
0
and is given by the integral of
a probability density over C. For this reason, is also called the size of the
test C. That is,
Size(C) =
_
C
f( x,
0
) d x (6.8)
Note that we have just given four dierent verbal denitions for the value
in (6.6) or the value of the integral in (6.8). This illustrates the importance
of for hypothesis testing.
Similarly, a Type II error is to reject H
1
and choose H
0
when the alter-
native H
1
is correct. The probability of a Type II error is called the false
negative rate, since it amounts to failing to detect H
1
when H
1
is correct.
This is
= P(reject H
1
| H
1
) =
_
R
n
C
f( x,
1
) d x (6.9)
for
1
in (6.4) and f( x,
1
) in (6.7). Note that depends only on H
1
and
not on H
0
.
The power of a test is the probability of deciding correctly on H
1
if H
1
is true, and is called the true positive rate. It can be written
Power(
1
) = 1 = P(choose H
1
| H
1
) =
_
C
f( x,
1
) d x (6.10)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
The power Power() is usually written as a function of since the hypoth-
esis H
1
is more likely to be composite. Note that both the level of signi-
cance and the power P(
1
) involve integrals over the same critical region C,
but with dierent densities.
To put these denitions in a table:
Table 6.1. Error Type and Probabilities
What We Decide
Which is True H
0
H
1
H
0
OK Type I
H
1
Type II
OK
Power
If H
0
and/or H
1
are composite, then , , and the power are replaced
by their worst possible values. That is, if for example
H
0
: X
i
f
0
(x) for some density f
0
T
0
H
1
: X
i
f
1
(x) for some density f
1
T
1
for two classes of densities T
0
, T
1
on R, then
= sup
f
0
T
0
_
C
f
0
( x) d x, = sup
f
1
T
1
_
R
n
C
f
1
( x) d x
and
Power = inf
f
1
T
1
_
C
f
1
( x) d x
6.4. The Neyman-Pearson Lemma. As suggested earlier, a standard
approach is to choose a highest acceptable false positive rate (for reject-
ing H
0
) and restrict ourselves to tests C with that false positive rate or
smaller.
Among this class of tests, we would like to nd the test that has the
highest probability of detecting H
1
when H
1
is true. This is called (reason-
ably enough) the most powerful test of H
0
against H
1
among tests C of a
given size or smaller.
Assume for simplicity that H
0
and H
1
are both simple hypotheses, so
that
H
0
: X
i
f(x,
0
) and H
1
: X
i
f(x,
1
) (6.11)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
where X
i
f(x) means that the observations X
i
are independently chosen
from the density f(x) and f(x, ) is a family of probability densities. As
mentioned above, both the size and power of a test C R
n
can be expressed
as n-dimensional integrals over C:
Size(C) =
_
C
f( x,
0
) d x and Power(C) =
_
C
f( x,
1
) d x (6.12)
The next result uses (6.12) to nd the most powerful tests of one simple
hypothesis against another at a xed level of signicance .
Theorem 6.1. (Neyman-Pearson Lemma) Assume that the set
C
0
= C
0
() =
_
x R
n
:
f( x,
1
)
f( x,
0
)
_
(6.13)
has Size(C
0
) = for some constant > 0. Then
Power(C) Power
_
C
0
()
_
(6.14)
for any other subset C R
n
with Size(C) .
Remarks (1). This means that C
0
() is the most powerful test of H
0
against H
1
with size Size(C) .
(2). If x = X for data X = (X
1
, . . . , X
n
), then the ratio in (6.13)
L( x,
1
,
0
) =
f( x,
1
)
f( x,
0
)
=
L(
1
, X)
L(
0
, X)
(6.15)
is a ratio of likelihoods. In this sense, the tests C
0
() in Theorem 6.1 are
likelihood-ratio tests.
(3). Suppose that the likelihood L(, X) = f(X
1
, ) . . . f(X
n
, ) has a
sucient statistic S(X) = S(X
1
, . . . , X
n
). That is,
L(, X) = f(X
1
, ) . . . f(X
n
, ) = g
_
S(X),
_
A(X) (6.16)
Then, since the factors A( x) cancel out in the likelihood ratio, the most-
powerful tests
C
0
() =
_
x R
n
:
f( x,
1
)
f( x,
0
)
_
=
_
x R
n
:
g
_
S( x),
1
_
g
_
S( x),
0
_
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
depend only on the sucient statistic S(X).
(4). By (6.12) and (6.15)
Size(C) =
_
C
f( x,
0
) d x and Power(C) =
_
C
L( x,
1
,
0
)f( x,
0
) d x
for L( x,
1
,
0
) in (6.15). Intuitively, the set C that maximizes Power(C)
subject to Size(C) should be the set of size with the largest values of
L( x,
1
,
0
). This is essentially the proof of Theorem 6.1 given below.
Before giving a proof of Theorem 6.1, lets give some examples.
Example (1). Continuing the example (6.1)(6.2) where f(x, ) is the nor-
mal density N(,
2
0
), the joint density (or likelihood) is
f(X
1
, . . . , X
n
, ) =
1
_
2
2
0
n
exp
_
1
2
2
0
n
i=1
(X
i
)
2
_
= C
1
(,
0
, n) exp
_
1
2
2
0
_
n
i=1
X
2
i
2
n
i=1
X
k
__
Since the factor containing
n
i=1
X
2
i
is the same in both likelihoods, the
likelihood ratio is
f(X
1
, . . . , X
n
,
1
)
f(X
1
, . . . , X
n
,
0
)
= C
2
exp
_
_
(
1
0
)
2
0
n
j=1
X
j
_
_
(6.17)
where C
2
= C
2
(
1
,
0
,
0
, n). If
0
<
1
are xed, the likelihood-ratio sets
C
0
() in (6.13) are
C
0
() =
_
_
_
x : C
2
exp
_
_
(
1
0
)
2
0
n
j=1
x
j
_
_
_
_
_
(6.18a)
=
_
x :
1
n
n
i=1
x
i
m
_
(6.18b)
where
m
is a monotonic function of . Thus the most powerful tests of H
0
against H
1
for any
1
>
0
are tests of the form X
m
. As in (6.3), the
constants
m
=
m
() are determined by
Size
_
C()
_
= = P
0
_
X
m
()
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Since X
i
N(
0
,
2
0
) and X N(
0
,
2
0
/n), this implies
m
() =
0
+
0
n
z
(6.19)
where P(Z z
j=1
f(x
j
, ) =
n
j=1
x
1
j
=
n
_
n
j=1
x
j
_
1
In general if
0
<
1
, the likelihood ratio is
f( x,
1
)
f( x,
0
)
=
n
1
_
n
j=1
x
j
_
1
1
n
0
_
n
j=1
x
j
_
0
1
= C
_
n
j=1
x
j
_
0
(6.20)
for C = C(
0
,
1
, n). Thus the most powerful tests of H
0
: =
0
against
H
1
: =
1
for
0
<
1
are
C
0
() =
_
_
_
x : C
_
n
j=1
x
j
_
0
_
_
_
(6.21a)
=
_
x :
n
j=1
x
j
m
_
(6.21b)
where
m
is a monotonic function of .
Note that the function
m
=
m
() in (6.21b) depends on H
0
but not
on H
1
. Thus the tests (6.21b) are UMP for
1
>
0
as in Example 1.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Exercise. For H
0
: =
0
, prove that the tests
C
0
() =
_
x :
n
j=1
x
j
m
_
(6.22)
are UMP against H
1
: =
1
for all
1
<
0
.
6.5. P-values. The nested structure of the likelihood-ratio sets in (6.13)
and (6.14)
C(
) =
_
x R
n
:
f( x,
1
)
f( x,
0
)
_
where (6.23)
Size
_
C(
)
_
= P
__
X :
f(X,
1
)
f(X,
0
__
=
means that we can give a single number that describes the outcome of the
tests (6.23) for all . Specically, let
P = P
_
f(X,
1
)
f(X,
0
)
T
0
H
0
_
(6.24)
where T
0
= T
0
(X) = f(X,
1
)/f(X,
0
) for the observed sample. Note that
the X in (6.24) is random with distribution H
0
, but the X in T
0
(X) is the
observed sample and assumed constant. Then
Lemma 6.1. Suppose that X = (X
1
, . . . , X
n
) is an independent sample
with density f(x, ). Suppose that we can nd constants
) and we reject H
0
. If P > ,
then the observed X / C(
) and we accept H
0
.
Proof. If P < , then the observed T
0
(X) >
). Hence we reject H
0
. If P > , then the observed
T
0
(X) <
=
_
x R
n
:
f( x,
1
)
f( x,
0
)
_
(7.4)
where x = (x
1
, . . . , x
n
). That is, we reject H
0
in favor of H
1
if X =
(X
1
, . . . , X
n
) C
.
The idea behind generalized likelihood-ratio tests (abbreviated GLRTs)
is that we use the likelihood-ratio test (7.4) with our best guesses for
0
0
and
1
1
. That is, we dene
LR
n
(X) =
max
1
L(, X
1
, . . . , X
n
)
max
0
L(, X
1
, . . . , X
n
)
=
L(
H
1
(X), X
1
, . . . , X
n
)
L(
H
0
(X), X
1
, . . . , X
n
)
(7.5)
where
H
1
(X) and
H
1
(X) are the maximum-likelihood estimates for
0
and
1
, respectively. Note that
LR
n
(X) depends on X but not on
(except indirectly from the sets
0
and
1
). We then use the tests with
critical regions
C
=
_
x R
n
:
LR
n
( x)
_
with Size(C
) = (7.6)
Since the maximum likelihood estimates
H
1
(X),
H
1
(X) in (7.5) depend
on X = (X
1
, . . . , X
n
), the Neyman-Pearson lemma does not guarantee
that (7.6) provides the most powerful tests. However, the asymptotic con-
sistency of the MLEs (see Theorem 5.1 above) suggests that
H
0
,
H
1
may
be close to the correct values.
Warning: Some statisticians, such as the authors of the textbook
Larsen and Marx, use an alternative version of the likelihood ratio
LR
alt
n
(X) =
max
0
L(, X
1
, . . . , X
n
)
max
1
L(, X
1
, . . . , X
n
)
=
L(
H
0
(X), X
1
, . . . , X
n
)
L(
H
1
(X), X
1
, . . . , X
n
)
(7.7)
with the maximum for H
1
in the denominator instead of the numerator and
the maximum for H
0
in the numerator instead of the denominator. One
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
then tests for small values of the GLRT statistic instead of large values.
Since
LR
alt
n
(X) = 1/
LR
n
(X), the critical tests for (7.7) are
C
alt
=
_
x R
n
:
LR
alt
n
( x)
_
(7.8)
=
_
x R
n
:
LR
n
( x) 1/
_
with Size(C
alt
) =
Thus the critical regions (7.8) are exactly the same as those for (7.6) except
for a transformation
1/
1
= (0, 1). This corresponds to the hypotheses H
0
: = 1 and H
1
: < 1
where X
1
, . . . , X
n
are U(0, ). Then one can show
LR
n
(X) = (1/X
max
)
n
, X
max
= max
1jn
X
j
(Argue as in Section 6.5 in the text, Larsen and Marx.) The GLRTs (7.6) in
this case are equivalent to
C
(X) = { X :
LR
n
(X)
} = { X : (1/X
max
)
n
} (7.9)
= { X : X
max
} where P(X
max
| H
0
) =
In Example 2 (see (7.2b) above),
0
= { (
0
,
2
) : =
0
} and
1
=
{ (,
2
) : =
0
}. This corresponds to the hypotheses H
0
: =
0
and
H
1
: =
0
where X
1
, . . . , X
n
are N(,
2
) with
2
unspecied. One can
show in this case that
LR
n
(X) =
_
1 +
T(X)
2
n 1
_
n/2
where (7.10)
T(X) =
n(X
0
)
_
S(X)
2
, S(X) =
1
n 1
n
j=1
(X
j
X)
2
(See Appendix 7.A.4, pages 519521, in the textbook, Larsen and Marx.
They obtain (7.10) with n/2 in the exponent instead of n/2 because they
use (7.7) instead of (7.5) to dene the GLRT statistic, which is
LR
n
(X) here
but in their notation.)
Since
LR
n
(X) is a monotonic function of |T(X)|, the GLRT test (7.6)
is equivalent to
C
(X) = { X : |T(X)|
} where P
_
|T(X)|
H
0
_
= (7.11)
This is the same as the classical two-sided one-sample Student-t test.
There is a useful large sample asymptotic version of the GLRT, for which
it is easy to nd the critical values
0
,
1
R
4
.
Since
0
1
, a test of H
0
against H
1
cannot be of the form either-or
as in Section 6, since
0
implies
1
. Instead, we view (7.12) with
0
1
as a test of whether we really need the additional d = m
1
m
0
parameter or parameters. That is, if the data X = (X
1
, . . . , X
n
) does not t
the hypothesis H
1
suciently better than H
0
(as measured by the relative
size of the tted likelihoods in (7.5) ) to provide evidence for rejecting H
0
.
Then, to be conservative, we accept H
0
and conclude that there is not enough
evidence for the more complicated hypothesis H
1
.
A test of the form (7.12) with
0
1
is called a nested hypothesis test.
Note that, if
0
1
, then (7.5) implies that
LR
n
(X) 1.
Under the following assumptions for a nested hypothesis test, we have
the following general theorem. Assume as before that X
1
, . . . , X
n
is an in-
dependent sample with density f(x, ) where f(x, ) satises the conditions
of the Cramer-Rao lower bound (Theorem 4.1 in Section 4 above) and of the
asymptotic normality of the MLE (Theorem 5.1 in Section 5 above). Then
we have
Theorem 7.1. (Twice the Log-Likelihood Theorem) Under the
above assumptions, assume that
0
1
in (7.12), that d = m
1
m
0
> 0,
and that the two maximum-likelihood estimates
H
0
(X) and
H
1
(X) in (7.5)
are attained in the interior of the sets
0
and
1
, respectively. Then, for
LR
n
(X) in (7.5),
lim
n
P
_
2 log
_
LR
n
(X)
_
y
H
0
_
= P
_
2
d
y
_
(7.13)
for y 0, where
2
d
represents a random variable with a
2
distriution with
d = m
1
m
0
degrees of freedom.
Proof. The proof is similar to the proof of Theorem 5.1 in Section 5, but
uses an m-dimensional central limit theorem for vector-valued random vari-
ables in R
m
and Taylors Theorem in R
m
instead of in R
1
.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Remarks (1). The analog of Theorem 7.1 for the alternative (upside-
down) denition of
LR
n
(X) in (7.7) has 2 log
_
LR
n
(X)
alt
_
instead of
2 log
_
LR
n
(X)
_
.
(2). There is no analog of Theorem 7.1 if the hypotheses H
0
and H
1
are
not nested. Finding a good asymptotic test for general non-nested composite
hypotheses is an open question in Statistics that would have many important
applications.
8. Fishers Meta-Analysis Theorem. Suppose that we are interested
in whether we can reject a hypothesis H
0
in favor of a hypothesis H
1
. As-
sume that six dierent groups have carried out statistical analyses based on
dierent datasets with mixed results. Specically, assume that they have
reported the six P-values (as in Section 6.5 above)
0.06 0.02 0.13 0.21 0.22 0.73 (8.1)
While only one of the six groups rejected H
0
at level = 0.05, and that
was with borderline signicance (0.01 < P < 0.05), ve of the six P-values
are rather small. Is there a way to assign an aggregated P-value to the six
P-values in (8.1)? After reading these six studies (and nding nothing wrong
with them), should we accept H
0
or reject H
1
at level = 0.05?
The rst step is to nd the random distribution of P-values that in-
dependent experiments or analyses of the same true hypothesis H
0
should
attain. Suppose that each experimenter used a likelihood-ratio test of the
Neyman-Pearson form (6.23) or GLRT form (7.6) where it is possible to nd
a value
P <
_
= for 0 < < 1.
This means that
P is uniformly distributed in (0, 1).
Given that the numbers in (8.1) should be uniformly distributed if H
0
is true, do these numbers seem signicantly shifted towards smaller values,
as they might be if H
1
were true? The rst step towards answering this is
to nd a reasonable alternative distribution of the P-values given H
1
.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Fisher most likely considered the family of distributions f(p, ) = p
1
for 0 < 1, so that H
0
corresponds to = 1. For < 1, not only is
E(P) = /( + 1) < 1, but the density f(p, ) has an innite cusp at = 0.
For this family, the likelihood of random P-values P
1
, . . . , P
n
given is
L(, P
1
, . . . , P
n
) =
n
j=1
f(, P
j
) =
n
_
n
j=1
P
j
_
1
Thus Q =
n
j=1
P
j
is a sucient statistic for , and we have at least a single
number to summarize the six values in (8.1).
Morever, it follows as in Example 2 in Section 6.4 above that tests of the
form { P :
n
j=1
P
j
n
j=1
2 log(U
j
)
2
2n
has a chi-square distribution with 2n degrees
of freedom.
Proof. (a) For A, t > 0
P(Y > t) = P
_
Alog(U) > t
_
= P
_
log(U) < t/A
_
= P
_
U < exp(t/A)
_
= exp(t/A)
This implies that Y has a probability density f
Y
(t) = (d/dt) exp(t/A) =
(1/A) exp(t/A), which is exponential with rate A.
(b) A
2
d
distribution is gamma(d/2, 1/2), so that
2
2
gamma(1, 1/2).
By the form of the gamma density, gamma(1, ) is exponential with rate .
Thus, by part (a), 2 log(1/U) gamma(1, 1/2)
2
2
.
(c) Each 2 log(1/P
j
)
2
2
, which implies that
Q =
n
j=1
2 log(1/P
j
)
2
2n
(8.2)
Putting these results together,
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Theorem 8.1 (Fisher). Assume independent observations U
1
, . . . , U
n
have
density f(x, ) = x
1
for 0 x 1, and in particular are independent
and uniformly distributed in (0, 1) if = 1. Then, the P-value of the UMP
test for H
0
: = 1 against H
1
: < 1 is
P = P(
2
2n
Q
0
)
where Q
0
is the observed value of 2
n
j=1
log(1/U
j
) and
2
2n
represents a
chi-square distribution with 2n degrees of freedom.
Proof. By Lemma 8.1 and (8.2).
Example. The numbers P
1
, . . . , P
6
in (8.1) satisfy
6
j=1
2 log(1/P
j
) = 5.63 + 7.82 + 4.08 + 3.12 + 3.03 + 0.63 = 24.31
Thus the P-value in Theorem 8.1 is
P = P(
2
12
24.31) = 0.0185
Thus the net eect of the six tests with P-values in (8.1) is P = 0.0185,
which is signicant at = 0.05 but not at = 0.01.
9. Two Contingency-Table Tests. Consider the following contingency
table for n = 1033 individuals with two classications A and B:
Table 9.1. A Contingency Table for A and B
B: 1 2 3 4 5 6 Sums:
1 29 11 95 78 50 47 310
A: 2 38 17 106 105 74 49 389
3 31 9 60 49 29 28 206
4 17 13 35 27 21 15 128
Sums: 115 50 296 259 174 139 1033
It is assumed that the data in Table 9.1 comes from independent observations
Y
i
= (A
i
, B
i
) for n = 1033 individuals, where A
i
is one of 1, 2, 3, 4 and B
i
is
one of 1, 2, 3, 4, 5, 6. Rather than write out the n = 1033 values, it is more
convenient to represent the data as 24 counts for the 4 6 possible A, B
values, as we have done in Table 9.1.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Suppose we want to test the hypothesis that the Y
i
are sampled from a
population for which A and B are independent. (Sometimes this hypothesis
is stated that rows and columns are independent, but this doesnt make
very much sense if you analyze it closely.)
If the sample is homogeneous, each observation Y
i
= (A
i
, B
i
) has a mul-
tivariate Bernoulli distribution with probability function P
_
Y = (a, b)
_
= p
ab
for 1 a s and 1 b t, where s = 4 is the number of rows in Ta-
ble 9.1 and t = 6 is the number of columns, and
s
a=1
t
b=1
p
ab
= 1. If
the random variables A and B are independent, then P
_
Y = (a, b)
_
=
P(A = a)P(B = b). If P(A = a) = p
A
a
and P(B = b) = p
B
b
, then
p
ab
= p
A
a
p
B
b
. This suggests the two nested hypotheses
H
1
: p
ab
> 0 are arbitrary subject with
s
a=1
t
b=1
p
ab
= 1 (9.1)
H
0
: p
ab
= p
A
a
p
B
b
where
s
a=1
p
A
a
=
t
b=1
p
B
b
= 1
9.1. Pearsons Chi-Square Test.
We rst consider the GLRT test for (9.1). Writing p for the matrix p = p
ab
(1 a s, 1 b t), the likelihood of Y = (Y
1
, Y
2
, . . . , Y
n
) is
L(p, Y ) =
n
i=1
{q
i
= p
ab
: Y
i
= (a, b)} =
s
a=1
t
b=1
p
X
ab
ab
(9.2)
where X
ab
are the counts in Table 9.1. The MLE p
H
1
for hypothesis H
1
can
be found by the method of Lagrange multipliers by solving
p
ab
log L(p, Y ) = 0 subject to
s
a=1
t
b=1
p
ab
= 1
This leads to ( p
H
1
)
ab
= X
ab
/n. The MLE p
H
0
can be found similarly as the
solution of
p
A
a
log L(p, Y ) =
p
B
b
log L(p, Y ) = 0
subject to
s
a=1
p
A
a
=
t
b=1
p
B
b
= 1
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
This implies
p
A
a
= X
a+
/n and
p
B
b
= X
+a
/n where X
a+
=
t
c=1
X
ac
and
X
+b
=
s
c=1
X
cb
. This in turn implies ( p
H
0
)
ab
= (X
a+
/n)(X
+b
/n). Thus
the GLRT statistic for (9.1) is
LR
n
(Y ) =
L( p
H
1
, Y )
L( p
H
0
, Y )
=
s
a=1
t
b=1
_
X
ab
n
_
X
ab
s
a=1
_
X
a+
n
_
X
a+
t
b=1
_
X
+b
n
_
X
+b
(9.3)
Note that hypothesis H
1
in (9.1) has m
1
= st 1 free parameters, while
hypothesis H
0
has m
0
= (s 1) + (t 1) free parameters. The dierence is
d = m
1
m
0
= st 1 (s 1) (t 1) = st s t + 1 = (s 1)(t 1).
Thus by Theorem 7.1 in Section 7
lim
n
P
_
2 log
_
LR
n
(X)
_
y
H
0
_
= P
_
2
d
y
_
(9.4)
where d = (s 1)(t 1). The test of H
0
against H
1
based on (9.4) is often
called the G-test.
Pearsons Sum of (Observed Expected)
2
/Expected statistic is
D
n
(y) =
s
a=1
t
b=1
_
X
ab
n p
A
a
p
B
b
_
2
n p
A
a
p
B
b
=
s
a=1
t
b=1
_
X
ab
(X
a+
X
+b
/n)
_
2
(X
a+
X
+b
/n)
It was proven in class in a more general context that
E
_
2 log
LR
n
(Y ) D
n
(Y )
_
C
n
for n 1. It can be show that this in combination with (9.4) implies
lim
n
P
_
D
n
(Y ) y
H
0
_
= P
_
2
d
y
_
(9.5)
Thus the GLRT test for H
0
within H
1
in (9.1) is asymptotically equivalent
to a test on D
n
(Y ), for which the P-value can be written asymptotically
P = P(
2
d
D
n
(Y )
Obs
)
where Obs stands for Observed value of.
For the data is Table 9.1, D
n
(Y ) = 19.33 and P = 0.199 for d =
(4 1)(6 1) = 15 degrees of freedom. Thus, the data in Table 9.1 is not
signicant using the Pearsons chi-square test.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2. The Pearson Test is an Omnibus Test.
The GLRT test of (9.1) is sometimes called a test of H
0
against an omnibus
alternative, since it is designed to have power against any alternative p
ab
for
which A and B fail to be independent.
A test that is sensitive to a particular way in which H
0
may fail can
have much greater power against that alternative than an omnibus test,
which much guard against any possible failure of H
0
. Conversely, a test that
is tuned towards a particular alternative can fail miserably when H
0
is
false for other reasons.
The shrinkage estimator in Section 2.1 provides a somewhat similar ex-
ample. If we make even a crude guess what the true mean of a normal
sample might be, then a shrinkage estimator towards that value can have
smaller expected squared error than the sample mean estimator, which is
the minimum-variance unbiased estimator for all possible true means. Con-
versely, if we guess wrongly about the true mean, the shrinkage estimator
may have a much larger expected squared error.
9.3. The Mantel-Haenszel Trend Test.
Suppose that one suspects that the random variables A and B in Table 9.1
are correlated as opposed to being independent. In particular, we would like a
test of H
0
in (9.1) whose implicit alternative is that A, B are correlated, which
may have greater power if A and B are in fact correlated. We understand that
this test may have much less power against an alternative to independence
in which A and B are close to being uncorrelated.
The Mantel-Haenszel trend test does exactly this. (Note: This test is
also called the Mantel trend test. The trend is necessary here because
there is a contingency table test for stratied tables that is also called the
Mantel-Haenszel test.)
Specically, let r be the sample Pearson correlation coecient of A
i
and B
i
for the sample Y
i
= (A
i
, B
i
). That is,
r =
n
i=1
(A
i
A)(B
i
B)
_
n
i=1
(A
i
A)
2
_
n
i=1
(B
i
B)
2
(9.6)
Recall that A
i
takes on integer values with 1 A
i
s and B
i
takes on
integer values with 1 B
i
t. Then
Theorem 9.1 (Mantel-Haenszel). Under the assumptions of this sec-
tion, using a permutation test based on permuted the values B
i
in Y
i
=
(A
i
, B
i
) for 1 i n among themselves while holding A
i
xed,
lim
n
P
_
(n 1)r
2
y
H
0
_
= P(
2
1
y) (9.7)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Remarks. The limits in Theorem 7.1 and (9.4) are based on a probability
space that supports independent random variables with a given probability
density f(x, ).
In contrast, the underlying probability space in Theorem 9.1, in common
with permutation tests in general, is dened by a set of permutations of the
data under which the distribution of a sample statistic is the same as if H
0
is
true. For example, in this case, if we choose A
i
at random from A
1
, . . . , A
n
and match it with a randomly permuted B
i
at that i, then
P(A
i
= a, B
i
= b) = P(A
i
= a)P(B
i
= a)
and A
i
, B
i
are independent. (In contrast, B
1
and B
2
are not independent.
If B
1
happened to be a large value, then the value B
2
at a dierent oset
in the permuted values, conditional on B
1
already have been chosen, would
be drawn from values with a smaller mean. Thus B
1
and B
2
are negatively
correlated.)
Since the pairs A
i
, B
i
are independent in this permutation probability
space, if the observed value of r in (9.6) is far out on the tail of the statistics r
calculated by randomly permuted the B
i
in this manner, then it is likely that
the observed A
i
and B
i
were not chosen from a distribution in which A and B
were independent. We neednt worry that the set of possible P-values is
overly discrete if n is large, since in that case the number of permutations (n!)
is truly huge. Since the test statistic (9.7) is the sample correlation itself, if
we reject H
0
then it is likely that A and B are correlated.
Example. For the data in Table 9.1, the sample correlation coecient r =
0.071 and X
obs
= (n 1)r
2
= (1032)(0.071)
2
= 5.2367. The P-value is
P = P(
2
1
5.2367) = 0.0221. Thus Table 9.1 shows signicant departure
from independence by the Mantel test, but not for the standard Pearson test.
In general, one can get P-values for a
2
1
distribution from a standard
normal table, since it is the square of a standard normal. Thus
P = P(
2
1
5.2367) = P(Z
2
5.2367)
= P(|Z|