0% found this document useful (0 votes)

46 views

Risk Fisher

This document discusses risk, scores, Fisher information, and generalized likelihood ratio tests (GLRTs). It covers topics such as: 1. Unbiased estimators, risk, and relative risk. Biased estimators can have lower risk than unbiased estimators under certain conditions. 2. Shrinkage estimators and ridge regression, which are biased estimators that can have lower risk than unbiased estimators by trading off a small amount of bias for a lower variance. 3. Maximum likelihood estimators (MLEs) are not always sample means, even when the expected value of the sample equals the parameter of interest. The document provides an example using the Laplace distribution.

Uploaded by

betho1992

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Risk Fisher

Uploaded by

betho1992

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Risk, Scores, Fisher Information, and GLRTs

(Supplementary Material for Math494)

Stanley Sawyer Washington University
Vs. April 24, 2010
Table of Contents
1. Statistics and Esimatiors
2. Unbiased Estimators, Risk, and Relative Risk
3. Scores and Fisher Information
4. Proof of the Cramer Rao Inequality
5. Maximum Likelihood Estimators are Asymptotically Ecient
6. The Most Powerful Hypothesis Tests are Likelihood Ratio Tests
7. Generalized Likelihood Ratio Tests
8. Fishers Meta-Analysis Theorem
9. A Tale of Two Contingency-Table Tests
1. Statistics and Estimators. Let X
1
, X
2
, . . . , X
n
be an independent
sample of observations from a probability density f(x, ). Here f(x, ) can
be either discrete (like the Poisson or Bernoulli distributions) or continuous
(like normal and exponential distributions).
In general, a statistic is an arbitrary function T(X
1
, . . . , X
n
) of the data
values X
1
, . . . , X
n
. Thus T(X) for X = (X
1
, X
2
, . . . , X
n
) can depend on
X
1
, . . . , X
n
, but cannot depend on . Some typical examples of statistics are
T(X
1
, . . . , X
n
) = X =
X
1
+X
2
+. . . +X
n
n
(1.1)
= X
max
= max{ X
k
: 1 k n}
= X
med
= median{ X
k
: 1 k n}
These examples have the property that the statistic T(X) is a symmetric
function of X = (X
1
, . . . , X
n
). That is, any permutation of the sample
X
1
, . . . , X
n
preserves the value of the statistic. This is not true in general:
For example, for n = 4 and X
4
> 0,
T(X
1
, X
2
, X
3
, X
4
) = X
1
X
2
+ (1/2)X
3
/X
4
is also a statistic.
A statistic T(X) is called an estimator of a parameter if it is a statistic
that we think might give a reasonable guess for the true value of . In general,
we assume that we know the data X
1
, . . . , X
n
but not the value of . Thus,
among statistics T(X
1
, . . . , X
n
), what we call an estimator of is entirely
up to us.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Unbiased Estimators, Risk, and Relative Risk. Assume as before
that X = (X
1
, X
2
, . . . , X
n
) is an independent sample where each X
k
has den-
sity f(x, ). An estimator T(X) is unbiased if
E

_
T(X)
_
= for all values of (2.1)
Here E

( ) means that the sums or integrals involved in calculating the

expected value depend on the parameter . For example, if is the mean of
a continuous density f(x, ), then
E

(X
1
) = E

(X) =
1
n
n

k=1
E

(X
k
) =
_
xf(x, ) dx = (2.2)
and both of the statistics T
1
= X
1
and T
2
= X are unbiased estimators of .
If the density f(x, ) is discrete instead of continuous, the integral in (2.2) is
replaced by a sum.
The relation (2.1) implies that if we had a large number of dierent
samples X
(m)
, each of size n, then the estimates T(X
(m)
) should cluster
around the true value of . However, it says nothing about the sizes of the
errors T(X
(m)
) , which are likely to be more important.
The errors of T(X) as an estimator of can be measured by a loss
function L(x, ), where L(x, ) 0 and L(, ) = 0 (see Larsen and Marx,
page 419). The risk is the expected value of this loss, or
R(T, ) = E

_
L
_
T(X),
__
The most common choice of loss function is the quadratic loss function
L(x, ) = (x )
2
, for which the risk is
R(T, ) = E

_
_
T(X)
_
2
_
(2.3)
Another choice is the absolute value loss function L(x, ) = |x|, for which
the risk is R(T, ) = E
_
|T(X) |
_
.
If T(X) is an unbiased estimator and L(x, ) = (x )
2
, then the
risk (2.3) is the same as the variance
R(T, ) = Var

_
T(X)
_
but not if T(X) is biased (that is, not unbiased).
Assume E

_
T(X)
_
= () for a possibly biased estimator T(X). That
is, () = for some or all . Let S = T , so that E

(S) = () .
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Then R(T, ) = E

_
(T )
2
_
= E

(S
2
) and it follows from the relation
Var(S) = E(S
2
) E(S)
2
that
R(T, ) = E

_
(T(X) )
2
_
= Var

_
T(X)
_
+
_
()
_
2
, () = E

_
T(X)
_
(2.4)
In principle, we might be able to nd a biased estimator T(X) that out-
performs an unbiased estimator T
0
(X) if the biased estimator has a smaller
variance that more than osets the term (() )
2
in (2.4).
Example (1). Suppose that X
1
, . . . , X
n
are normally distributed N(,
2
)
and we want to estimate . Then one might ask whether the biased estimator
T(X
1
, . . . , X
n
) =
X
1
+X
2
+. . . +X
n
n + 1
(2.5)
could have R(T, ) < R(X, ) for the MLE X = (X
1
+. . . +X
n
)/n. While
T(X) is biased, it should also have a smaller variance since we divide by a
larger number. As in (2.4)
R(X, ) = E
_
(X )
2
_
= Var(X) =

2
n
(2.6)
R(T, ) = E
_
(T )
2
_
= Var(T) + E(T )
2
= Var
_
X
1
+. . . +X
n
n + 1
_
+
_
n
n + 1

_
2
=
n
2
(n + 1)
2
+

2
(n + 1)
2
Comparing R(T, ) with R(X, ):
R(T, ) R(X, ) =
n
(n + 1)
2

2
+

2
(n + 1)
2

1
n

2
=

2
(n + 1)
2

_
1
n

n
(n + 1)
2
_

2
=
1
(n + 1)
2
_

_
(n + 1)
2
n
2
n
_

2
_
=
1
(n + 1)
2
_

_
2n + 1
n
_

2
_
(2.7)
Thus R(T, ) < R(X, ) if
2
< ((2n + 1)/n)
2
, which is guaranteed by

2
< 2
2
. In that case, R(T, ) is less risky than R(X, ) (in the sense of
having smaller expected squared error) even though it is biased.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Shrinkage Estimators. The estimator T(X) in (2.5) can be written
T(X
1
, . . . , X
n
) =
n
n + 1
X +
1
n + 1
0
which is a convex combination of X and 0. A more general estimator is
T(X
1
, . . . , X
n
) = cX + (1 c)a (2.8)
where a is an arbitrary number and 0 < c < 1. Estimators of the form
(2.5) and (2.8) are called shrinkage estimators. While shrinkage estimators
are biased unless E(X
i
) = = a, the calculation above shows that they
have smaller risk if
2
< 2
2
for (2.5) or ( a)
2
< ((1 +c)/(1 c))(
2
/n)
for (2.8).
On the other hand, R(T, ) and R(X, ) are of order 1/n and, by arguing
as in (2.6) and (2.7), the dierence between the two is of order 1/n
2
for xed
, a, and 0 < c < 1. (Exercise: Prove this.) Thus one cannot go too far
wrong by using X instead of a shrinkage estimator.
2.2. Ridge Regression. In ridge regression (which is discussed in other
courses), the natural estimator T
1
(X) of certain parameters is unbiased, but
Var(T
1
) is very large because T
1
(X) depends on the inverse of a matrix that
is very close to being singular.
The method of ridge regression nds biased estimators T
2
(X) that are
similar to T
1
(X) such that E
_
T
2
(X)) is close to E
_
T
1
(X)
_
but Var
_
T
2
(X)
_
is of moderate size. If this happens, then (2.4) with T(X) = T
2
(X) implies
that the biased ridge regression estimator T
2
(X) can be a better choice than
the unbiased estimator T
1
(X) since it can have much lower risk and give
much more reasonable estimates.
2.3. Relative Eciency. Let T(X) and T
0
(X) be estimators of , where
T
0
(X) is viewed as a standard estimator such as X or the MLE (maximum
likelihood estimator) of (see below). Then, the relative risk or relative
eciency of T(X) with respect to T
0
(X) is
RR(T, ) =
R(T
0
, )
R(T, )
=
E
_
(T
0
(X) )
2
_
E
_
(T(X) )
2
_ (2.9)
Note that T
0
(X) appears in the numerator, not the denominator, and T(X)
appears in the denominator, not the numerator. If RR(T, ) < 1, then
R(T
0
, ) < R(T, ) and T(X) can be said to be less ecient, or more risky,
than T
0
(X). Conversely, if RR(T, ) > 1, then T(X) is more ecient (and
less risky) than the standard estimator T
0
(X).
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4. MLEs are Not Always Sample Means even if E

(X) = .
The most common example with

MLE
= X is the normal family N(, 1).
In that case, Var

(X) = 1/n attains the Cramer-Rao lower bound (see Sec-

tion 4) and thus is the unbiased estimator of with the smaller possible
variance.
A second example with E

(X) = (so that X is an unbiased estimator

of ) is the Laplace distribution L(, c). This has density
f(x, , c) =
1
2c
e
|x|/c
, < x < (2.10)
Since the density f(x, , c) is symmetric about x = , E

(X) = E

(X) = .
If Y has the density (2.10), then Y has the same distribution as +cY
0
where Y
0
L(0, 1). (Exercise: Prove this.) Thus the Laplace family (2.10)
is a shift-and-scale family like the normal family N(,
2
), and is similar to
N(,
2
) except that the probability density of X L(, c) decays exponen-
tially for large x instead of faster than exponentially as is the case for the
normal family. (It also has a non-dierentiable cusp at x = .)
In any event, one might expect that the MLE of might be less willing
to put as much weight on large sample values than does the sample mean X,
since these values may be less reliable due to the relatively heavy tails of the
Laplace distribution. In fact
Lemma 2.1. Let X = (X
1
, . . . , X
n
) be an independent sample of size n
from the Laplace distribution (2.10) for unknown and c. Then

MLE
(X) = median{ X
1
, . . . , X
n
} (2.11)
Remark. That is, if
X
(1)
< X
(2)
< . . . X
(n)
(2.12)
are the order statistics of the sample X
1
, . . . , X
n
, then

MLE
(X) =
_
X
(k+1)
if m = 2k + 1 is odd
_
X
(k)
+X
(k+1)
_
/2 if m = 2k is even
(2.13)
Thus

MLE
= X
(2)
if n = 3 and X
(1)
< X
(2)
< X
(3)
, and

MLE
= (X
(2)
+
X
(3)
)/2 if n = 4 and X
(1)
< X
(2)
< X
(3)
< X
(4)
.
Proof of Lemma 2.1. By (2.10), the likelihood of is
L(, X
1
, . . . , X
n
) =
n

i=1
_
1
2c
e
|X
i
|/c
_
=
1
(2c)
n
exp
_

i=1
|X
i
|
c
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
It follows that the likelihood L(, X) is maximized whenever the sum
M() =
n

i=1
|X
i
| =
n

i=1
|X
(i)
| (2.14)
is minimized, where X
(i)
are the order statistics in (2.12).
The function M() in (2.14) is continuous and piecewise linear. If
X
(m)
X
(m+1)
(that is, if lies between the m
th
and the (m+1)
st
order
statistics of { X
i
}), then X
(i)
X
(m)
if i m and X
(m+1)
X
(i)
if m+ 1 i n. Thus
M() =
n

i=1
|X
(i)
| =
m

i=1
( X
(i)
) +
n

i=m+1
(X
(i)
)
and if X
(m)
< < X
(m+1)
d
d
M() = M

() = m(n m) = 2mn
It follows that M

() < 0 (and M() is decreasing) if m < n/2 and M

() > 0
(and M() is increasing) if m > n/2. If n = 2k+1 is odd, then n/2 = k+(1/2)
and M() is strictly decreasing if < X
(k+1)
and is strictly increasing if
> X
(k+1)
. It follows that the minimum value of M() is attained at
= X
(k+1)
.
If n = 2k is even, then, by the same argument, M() is minimized at
any point in the interval (X
(k)
, X
(k+1)
), so that any value in that interval
maximizes the likelihood. When that happens, the usual convention is to
set the MLE equal to the center of the interval, which is the average of
the endpoints. Thus

MLE
= X
(k+1)
if n = 2k + 1 is odd and

MLE
=
(X
(k)
+X
(k+1)
)/2 if n = 2k is even, which implies (2.13).
A third example of a density with E

(X) = E

(X) = is
f(x, ) = (1/2)I
(1,+1)
(x) (2.15)
which we can call the centered uniform distribution of length 2. If X has
density (2.15), then X is uniformly distributed between 1 and +1 and
E

(X) = . The likelihood of an independent sample X

1
, . . . , X
n
is
L(, X
1
, . . . , X
n
) =
n

i=1
_
1
2
I
(1,+1)
(X
i
)
_
=
1
2
n
n

i=1
I
(X
i
1,X
i
+1)
() (2.16)
=
1
2
n
I
(X
max
1,X
min
+1)
()
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
since (i) 1 < X
i
< + 1 if and only if X
i
1 < < X
i
+ 1, so
that I
(1,+1)
(X
i
) = I
(X
i
1,X
i
+1)
(), and (ii) the product of the indicator
functions is non-zero if and only X
i
< + 1 and 1 < X
i
for all i, which
is equivalent to 1 < X
min
X
max
< + 1 or X
max
1 < < X
min
+ 1.
Thus the likelihood is zero except for (X
max
1, X
min
+ 1), where
the likelihood has the constant value 1/2
n
. Following the same convention
as in (2.13), we set

MLE
(X) =
X
max
+X
min
2
(2.17)
(Exercise: Note that normally X
min
< X
max
. Prove that the interval
(X
max
1, X
min
+ 1) is generally nonempty for the density (2.15).)
2.5. Relative Eciencies of Three Sample Estimators. We can use
computer simulation to compare the relative eciencies of the sample mean,
the sample median, and the average of the sample minima and maxima for
the three distributions in the previous subsection. Recall that, while all three
distributions are symmetric about a shift parameter , the MLEs of are the
sample mean, the sample median, and the average of the sample minimum
and maximum, respectively, and are not the same.
It is relatively easy to use a computer to do random simulations of n
random samples X
(j)
(1 j n) for each of these distributions, where each
random sample X
(j)
=
_
X
(j)
1
, . . . , X
(j)
m
_
is of size m. Thus the randomly
simulated data for each distribution will involve generating n m random
numbers.
For each set of simulated data and each sample estimator T(X), we
estimate the risk by (1/n)

n
j=1
_
T(X
(j)
)
_
2
. Analogously with (2.9), we
estimate the relative risk with respect to the sample mean X by
RR(T, ) =
(1/n)

n
j=1
_
X
(j)

_
2
(1/n)

n
j=1
_
T(X
(j)
)
_
2
Then RR(T, ) < 1 means that the sample mean has less risk, while
RR(T, ) > 1 implies that it is riskier. Since all three distributions are
shift invariant in , it is sucient to assume = 0 in the simulations.
The simulations show that, in each of the three cases, the MLE is the
most ecient of the three estimators of . Recall that the MLE is the sample
mean only for the normal family. Specically, we nd
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Table 2.1: Estimated relative eciencies with respect to the sample
mean for n = 1,000,000 simulated samples, each of size m = 10:
Distrib Mean Median AvMinMax Most Ecient
CentUnif 1.0 0.440 2.196 AvMinMax
Normal 1.0 0.723 0.540 Mean
Laplace 1.0 1.379 0.243 Median
The results are even more striking for samples of size 30:
Table 2.2: Estimated relative eciencies with respect to the sample
mean for n = 1,000,000 simulated samples, each of size m = 30:
Distrib Mean Median AvMinMax Most Ecient
CentUnif 1.0 0.368 5.492 AvMinMax
Normal 1.0 0.666 0.265 Mean
Laplace 1.0 1.571 0.081 Median
Table 2.2 shows that the sample mean has a 3:2 advantage over the sample
median for normal samples, but a 3:2 decit for the Laplace distribution. Av-
eraging the sample minimum and maximum is 5-fold better than the sample
mean for the centered uniforms, but is 12-fold worse for the Laplace distri-
bution. Of the three distributions, the Laplace has the largest probability of
large values.
3. Scores and Fisher Information. Let X
1
, X
2
, . . . , X
n
be an indepen-
dent sample of observations from a density f(x, ) where is an unknown
parameter. Then the likelihood function of the parameter given the data
X
1
, . . . , X
n
is
L(, X
1
, . . . , X
n
) = f(X
1
, )f(X
2
, ) . . . f(X
n
, ) (3.1)
where the observations X
1
, . . . , X
n
are used in (3.1) instead of dummy vari-
ables x
k
. Since the data X
1
, . . . , X
n
is assumed known, L(, X
1
, . . . , X
n
)
depends only on the parameter .
The maximum likelihood estimator of is the value =

(X) that
maximizes the likelihood (3.1). This can often be found by forming the
partial derivative of the logarithm of the likelihood

log L(, X
1
, . . . , X
n
) =
n

k=1

log f(X
k
, ) (3.2)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
and setting this expression equal to zero. The sum in (3.2) is suciently
important in statistics that not only the individual terms in the sum, but
also their variances, have names.
Specically, the scores of the observations X
1
, . . . , X
n
for the den-
sity f(x, ) are the terms
Y
k
() =

log f(X
k
, ) (3.3)
Under appropriate assumptions on f(x, ) (see Lemma 3.1 below), the scores
Y
k
() have mean zero. (More exactly, E

_
Y
k
()
_
= 0, where the same value
of is used in both parts of the expression.)
The Fisher information of an observation X
1
from f(x, ) is the variance
of the scores
I(f, ) = Var

_
Y
k
()
_
=
_ _

log f(x, )
_
2
f(x, ) dx (3.4)
Under an additional hypothesis (see Lemma 3.2 below), we also have
I(f, ) =
_ _

2

2
log f(x, )
_
f(x, ) dx (3.5)
which is often easier to compute since it involves a mean rather than a second
moment.
For example, assume X
1
, . . . , X
n
are normally distributed with unknown
mean and known variance
2
0
. Then
f(x, ) =
1
_
2
2
0
e
(x)
2
/2
2
0
, < x < (3.6)
Thus
log f(x, ) =
1
2
log(2
2
0
)
(x )
2
2
2
0
It follows that (/) log f(x, ) = (x )/
2
0
, and hence the k
th
score is
Y
k
() =

log f(X
k
, ) =
X
k

2
0
(3.7)
In particular E

(Y
k
()) = 0 as expected since E

(X
k
) = , and, since
E
_
(X
k
)
2
_
=
2
0
, the scores have variance
I(f, ) = E

_
Y
k
()
2
_
=
E
_
(X
k
)
2
_
(
2
0
)
2
=
1

2
0
(3.8)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
In this case, the relation

2
log f(X, ) =
1

2
0
from (3.7) combined with (3.5) gives an easier derivation of (3.8).
The Fisher information I(f, ) = 1/
2
0
for (3.6) is large (that is, each
X
k
has lots of information) if
2
0
is small (so that the error in each X
k
is
small), and similarly the Fisher information is small if
2
0
is large. This may
have been one of the original motivations for the term information.
We give examples below of the importance of scores and Fisher infor-
mation. First, we give a proof that E

_
Y
k
()
_
= 0 under certain conditions.
Lemma 3.1. Suppose that K = { x : f(x, ) > 0 } is the same bounded
or unbounded interval for all , that f(x, ) is smooth enough that we can
interchange the derivative and integral in the rst line of the proof, and that
(/) log f(x, ) is integrable on K. Then
E

_
Y
k
()
_
=
_ _

log f(x, )
_
f(x, ) dx = 0 (3.9)
Proof. Since
_
f(x, ) dx = 1 for all , we can dierentiate
d
d
_
f(x, )dx = 0 =
_

f(x, ) dx =
_
(/)f(x, )
f(x, )
f(x, ) dx
=
_ _

log f(x, )
_
f(x, ) dx = 0
Lemma 3.2. Suppose that f(x, ) satises the same conditions as in
Lemma 3.1 and that log f(x, ) has two continuous partial derivatives that
are continuous and bounded on K. Then
I(f, ) = E

_
Y
k
()
2
_
=
_ _

2

2
log f(x, )
_
f(x, ) dx (3.10)
Proof. Extending the proof of Lemma 3.1,
d
2
d
2
_
f(x, ) = 0 =
d
d
_ _

log f(x, )
_
f(x, ) dx (3.11)
=
_ _

2

2
log f(x, )
_
f(x, ) dx +
_ _

log f(x, )
_

f(x, ) dx
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The last term equals
_ _

log f(x, )
_
(/)f(x, )
f(x, )
f(x, ) dx
=
_ _

log f(x, )
_
2
f(x, ) dx = I(f, )
by (3.4). Since the left-hand side of (3.11) is equal to zero, Lemma 3.2
follows.
Remarks. The hypotheses of Lemma 3.1 are satised for the normal den-
sity (3.6), for which E

_
Y
k
()
_
= 0 by (3.7). However, the hypotheses are not
satised for the uniform density f(x, ) = (1/)I
(0,)
(x) since the supports
K() = (0, ) depend on .
For f(x, ) = (1/)I
(0,)
(x), the scores Y
k
() = (1/)I
(0,)
(X
k
) have
means E

_
Y
k
()
_
= 1/ = 0, so that the proof of Lemma 3.1 breaks down
at some point. (Exercise: Show that this is the correct formula for the score
for the uniform density and that this is the mean value.)
4. The Cramer-Rao Inequality. Let X
1
, X
2
, . . . , X
n
be an independent
random sample from the density f(x, ), where f(x, ) satises the conditions
of Lemma 3.1. In particular,
(i) The set K = { x : f(x, ) > 0 } is the same for all values of and
(ii) The function log f(x, ) has two continuous partial derivatives in
that are integrable on K.
We then have
Theorem 4.1. (Cramer-Rao Inequality) Let T(X
1
, X
2
, . . . , X
n
) be an
arbitrary unbiased estimator of . Then, under the assumptions above,
E

_
(T )
2
_

1
nI(f, )
(4.1)
for all values of , where I(f, ) is the Fisher information dened in (3.4).
Remark. Note that (4.1) need not hold if T(X
1
, . . . , X
n
) is a biased esti-
mator of , nor if the assumptions (i) or (ii) fail.
Proof of Theorem 4.1. Let T = T(X
1
, . . . , X
n
) be any unbiased estima-
tor of . Then
= E

_
T(X
1
, . . . , X
n
)
_
(4.2)
=
_
. . .
_
T(y
1
, . . . , y
n
) f(y
1
, ) . . . f(y
n
, ) dy
1
. . . dy
n
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Dierentiating (4.2) with respect to
1 =
_
. . .
_
T(y
1
, . . . , y
n
)

_
n

k=1
f
k
_
dy
1
. . . dy
n
where f
k
= f(y
k
, ). By the chain rule
1 =
_
. . .
_
T
n

k=1
_
_
_
_
k1

j=1
f
j
_
_
_

f
k
_
_
_
n

j=k+1
f
j
_
_
_
_
dy
1
. . . dy
n
for T = T(y
1
, y
2
, . . . , y
n
) and
1 =
_
. . .
_
T
n

k=1
_
_
_
_
k1

j=1
f
j
_
_
_
(/)f
k
f
k
_
f
k
_
_
n

j=k+1
f
j
_
_
_
_
dy
1
. . . dy
n
=
_
. . .
_
T
_
n

k=1

log
_
f(y
k
,
_
_
_
_
n

j=1
f
j
_
_
dy
1
. . . dy
n
= E

_
T(X
1
, . . . , X
n
)
_
n

k=1
Y
k
__
(4.3)
where Y
k
= (/) log f(X
k
, ) are the scores dened in (3.3). Since
E

(Y
k
()) = 0 by Lemma 3.1, it follows by subtraction from (4.3) that
1 = E

_
_
T(X
1
, . . . , X
n
)
_
_
n

k=1
Y
k
__
(4.4)
By Cauchys inequality (see Lemma 4.1 below),
E(XY )
_
E(X
2
)
_
E(Y
2
)
for any two random variables X, Y with E(|XY |) < . Equivalently
E(XY )
2
E(X
2
)E(Y
2
). Applying this in (4.4) implies
1 E

_
_
T(X
1
, . . . , X
n
)
_
2
_
E

_
_
_
n

k=1
Y
k
_
2
_
_
(4.5)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The scores Y
k
= (/) log f(X
k
, ) are independent with the same distri-
bution, and have mean zero and variance I(f, ) by Lemma 3.1 and (3.4).
Thus
E

_
_
_
n

k=1
Y
k
_
2
_
_
= Var

_
n

k=1
Y
k
_
= nVar

(Y
1
) = nI(f, )
Hence 1 E

_
(T )
2
_
nI(f, ), which implies the lower bound (4.1).
Denition. The eciency of an estimator T(X
1
, . . . , X
n
) is
RE (T, ) =
1/
_
nI(f, )
_
E

((T )
2
)
=
1
nI(f, ) E

((T )
2
)
(4.6)
Note that this is the same as the relative risk or relative eciency (2.9)
with R(T
0
, ) replaced by the Cramer-Rao lower bound (4.1). Under the
assumptions of Theorem 4.1, RE (T, ) 1.
An unbiased estimator T(X
1
, . . . , X
n
) is called ecient if RE (T, ) = 1;
that is, if its variance attains the lower bound in (4.1). This means that any
other unbiased estimator of , no matter how nonlinear, must have an equal
or larger variance.
An estimator T(X) of a parameter is super-ecient if its expected
squared error E
_
(T(X) )
2
_
is strictly less than the Cramer-Rao lower
bound. Under the assumptions of Theorem 4.1, this can happen only if
T(X) is biased, and typically holds for some parameter values but not for
others. For example, the shrinkage estimator of Section 2.1 is super-ecient
for parameter values that are reasonably close to the value a but not for
other .
Examples (1). Assume X
1
, X
2
, . . . , X
n
are N(,
2
0
) (that is, normally dis-
tribution with unknown mean and known variance
2
0
). Then E

(X
k
) =
E

(X) = , and X is an unbiased estimator of . Its variance is

Var

(X) = (1/n) Var

(X
1
) =

2
0
n
By (3.8), the Fisher information is I(f, ) = 1/
2
0
, so that 1/(nI(f, )) =

2
0
/n. Thus Var

(X) attains the Cramer-Rao lower bound for unbiased es-

timators of , so that X is an ecient unbiased estimator of .
(2). Assume that X
1
, . . . , X
n
are uniformly distributed in (0, ) for
some unknown value of , so that they have density f(x, ) = (1/)I
(0,)
(x).
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Then X
1
, . . . , X
n
do not satisfy condition (i) at the beginning of Section 4,
but we can see if Theorem 4.1 holds anyway.
As in a remark at the end of Section 3, the scores are Y
k
() =
(1/)I
(0,)
(X
k
). Thus Y
k
() = 1/ whenever 0 < X
k
< (that is, with
probability one), so that E

_
Y
k
()
_
= 1/ and E

_
Y
k
()
2
_
= 1/
2
. Hence
I(f, ) = Var
_
Y
k
()
_
= 0, so that the Cramer-Rao lower bound for un-
biased estimators in Theorem 4.1 is . (If we can use Lemma 3.2, then
I(f, ) = 1/
2
and the lower bound is negative. These are not contradic-
tions, since the density f(x, ) = (1/)I
(0,)
(x) does not satisfy the hypothe-
ses of either Lemma 3.1 or 3.2.)
Ignoring these awkwardnesses for the moment, let X
max
= max
1in
X
i
.
Then E

(X
max
) = (n/(n + 1)), so that if T
max
= ((n + 1)/n)X
max
E

(2X) = E

(T
max
) =
Thus both T
1
= 2X and T
2
= T
max
are unbiased estimators of . However,
one can show
Var

(2X) =
2
2
3n
and Var

(T
max
) =

2
n(n + 2)
Assuming I(f, ) = 1/
2
for deniteness, this implies that
RE (2X, ) = 3/2 and RE (T
max
, ) = n + 2 (if n is large)
Thus the conclusions of Theorem 4.1 are either incorrect or else make no
sense for either unbiased estimator in this case.
We end this section with a proof of Cauchys inequality.
Lemma 4.1 (Cauchy-Schwartz-Bunyakowski) Let X, Y be any two ran-
dom variables such that E(|XY |) < . Then
E(XY )
_
E(X
2
)
_
E(Y
2
) (4.7)
Proof. Note
_
(

a) x (1/

a) y
_
2
0 for arbitrary real numbers x, y, a
with a > 0. Expanding the binomial implies ax
2
2xy + (1/a)y
2
0, or
xy (1/2)
_
ax
2
+
1
a
y
2
_
for all real x, y and any a > 0. It then follows that for any values of the
random variables X, Y
XY
1
2
_
aX
2
+
1
a
Y
2
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
In general, if Y
1
Y
2
for two random variables Y
1
, Y
2
, then E(Y
1
) E(Y
2
).
This implies
E(XY )
1
2
_
aE(X
2
) +
1
a
E(Y
2
)
_
, any a > 0 (4.8)
If we minimize the right-hand side of (4.8) as a function of a, for exam-
ple by setting the derivative with respect to a equal to zero, we obtain
a
2
= E(Y
2
)/E(X
2
) or a =
_
E(Y
2
)/E(X
2
). Evaluating the right-hand
side of (4.8) with this value of a implies (4.7).
5. Maximum Likelihood Estimators are Asymptotically Ecient.
Let X
1
, X
2
, . . . , X
n
, . . . be independent random variables with the same dis-
tribution. Assume E(X
2
j
) < and E(X
j
) = . Then the central limit
theorem implies
lim
n
P
_
X
1
+X
2
+ +X
n
n

n
2
y
_
=
1

2
_
y

e
(1/2)x
2
dx (5.1)
for all real values of y. Is there something similar for MLEs (maximum
likelihood estimators)? First, note that (5.1) is equivalent to
lim
n
P
_
_
n

2
_
X
1
+X
2
+ +X
n
n

_
y
_
=
1

2
_
y

e
(1/2)x
2
dx
(5.2)
If X
1
, . . . , X
n
were normally distributed with mean and variance
2
, then

n
(X) = X = (X
1
+. . . +X
n
)/n. This suggests that we might have a central
limit theorem for MLEs

n
(X) of the form
lim
n
P
_
_
nc()
_

n
(X)
_
y
_
=
1

2
_
y

e
(1/2)x
2
dx
where is the true value of and c() is a constant depending on . In fact
Theorem 5.1. Assume
(i) The set K = { x : f(x, ) > 0 } is the same for all values of ,
(ii) The function log f(x, ) has two continuous partial derivatives in
that are integrable on K, and
(iii) E(Z) < for Z = sup

|(
2
/
2
) log f(X, )|, and
(iv) the MLE

(X) is attained in the interior of K.
Let I(f, ) be the Fisher information (3.8) in Section 3. Then
lim
n
P

_
_
nI(f, )
_

n
(X)
_
y
_
=
1

2
_
y

e
(1/2)x
2
dx (5.3)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
for all real values of y. (Condition (iii) is more than is actually required.)
Remarks (1). The relation (5.3) says that the MLE

n
(X) is approxi-
mately normally distributed with mean and variance 1/
_
nI(f, )
_
, or sym-
bolically

n
(X) N
_
,
1
nI(f, )
_
(5.4)
If (5.4) held exactly, then E

n
(X)
_
= and Var

n
(X)
_
= 1/
_
nI(f, )
_
,
and

n
(X) would be an unbiased estimator whose variance was equal to
the Cramer-Rao lower. We interpret (5.3)(5.4) as saying that

n
(X) is
asymptotically normal, is asymptotically unbiased, and is asymptotically ef-
cient in the sense of Section 4, since its asymptotic variance is the the
Cramer-Rao lower bound. However, (5.3) does not exclude the possibility
that E

_
|

n
(X)|
_
= for all nite n, so that

n
(X) need not be unbiased
nor ecient nor even have nite variance in the usual senses for any value
of n.
(2). If f(x, ) = (1/)I
(0,)
(x), so that X
i
U(0, ), then the order
of the rate of convergence in the analog of (5.3) is n instead of

n and the
limit is a one-sided exponential, not a normal distribution. (Exercise: Prove
this.) Thus the conditions of Theorem 5.1 are essential.
Asymptotic Condence Intervals. We can use (5.3) to nd asymptotic
condence intervals for the true value of based on the MLE

n
(X). It
follows from (5.3) and properties of the standard normal distribution that
lim
n
P

_
1.96
_
nI(f, )
<

n
<
1.96
_
nI(f, )
_
(5.5)
= lim
n
P

n
(X)
1.96
_
nI(f, )
< <

n
(X) +
1.96
_
nI(f, )
_
= 0.95
Under the assumptions of Theorem 5.1, we can approximate the Fisher in-
formation I(f, ) in (3.8) by I
_
f,

n
(X)
_
, which does not depend explicitly
on . The expression I
_
f,

n
(X)
_
is called the empirical Fisher information
of depending on X
1
, . . . , X
n
. This and (5.5) imply that
_
_

n
(X)
1.96
_
nI(f,

n
(X))
,

n
(X) +
1.96
_
nI(f,

n
(X))
_
_
(5.6)
is an asymptotic 95% condence interval for the true value of .
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Examples (1). Let f(x, p) = p
x
(1 p)
1x
for x = 0, 1 for the Bernoulli
distribution. (That is, tossing a biased coin.) Then
log f(x, p) = xlog(p) + (1 x) log(1 p)

log f(x, p) =
x
p

1 x
1 p
and

2

2
log f(x, p) =
x
p
2

1 x
(1 p)
2
Thus by Lemma 3.2 the Fisher information is
I(f, p) = E
_

2

2
log f(X, p)
_
=
E(X)
p
2
+
E(1 X)
(1 p)
2
=
p
p
2
+
1 p
(1 p)
2
=
1
p
+
1
1 p
=
1
p(1 p)
This implies
1
_
nI(f, p
=
_
p(1 p)
n
Hence in this case (5.6) is exactly the same as the usual (approximate) 95%
condence interval for the binomial distribution.
(2). Let f(x, ) = x
1
for 0 x 1. Then
Y
k
() = (/) log f(X
k
, ) = (1/) + log(X
k
)
W
k
() = (
2
/
2
) log f(X
k
, ) = 1/
2
Since (/) log L(, X) =

n
k=1
Y
k
() = (n/) +

n
k=1
log(X
k
), it follows
that

n
(X) =
n

n
k=1
log(X
k
)
(5.7)
Similarly, I(f, ) = E

_
W
k
()
_
= 1/
2
by Lemma 3.2. Hence by (5.6)
_

n
(X)
1.96

n
(X)

n
,

n
(X) +
1.96

n
(X)

n
_
(5.8)
is an asymptotic 95% condence interval for .
Proof of Theorem 5.1. Let
M() =

log L(, X
1
, . . . , X
n
) (5.9)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
where L(, X
1
, . . . , X
n
) is the likelihood function dened in (3.1). Let

n
(X)
be the maximum likelihood estimator of . Since

n
(X) is attained in the
interior of K by condition (iv),
M(

n
) =

log L(

n
, X) = 0
and by Lemma 3.1
M() =

log L(, X) =
n

k=1

log f(X
k
, ) =
n

k=1
Y
k
()
where Y
k
() are the scores dened in Section 3. By the mean value theorem
M(

n
) M() = (

n
)
d
d
M(

n
) = (

n
)

2

2
log L(

n
, X)
= (

n
)
n

k=1
(
2
/
2
) log f
_
X
k
,

n
(X)
_
where

n
(X) is a value between and

n
(X). Since M(

n
) = 0

n
=
M()
(d/d)M(

n
)
=

n
k=1
Y
k
()

n
k=1
(
2
/
2
) log f(X
k
,

n
)
(5.10)
Thus
_
nI(f, )
_

_
=
1
_
nI(f, )
n

k=1
Y
k
()

1
nI(f, )

n
k=1
(
2
/
2
) log f(X
k
,

n
(X))
(5.11)
By Lemma 3.1, the Y
k
() are independent with the same distribution with
E

_
Y
k
()
_
= 0 and Var

_
Y
k
()
_
= I(f, ). Thus by the central limit theorem
lim
n
P

_
1
_
nI(f, )
n

k=1
Y
k
() y
_
=
1

2
_
y

e
(1/2)x
2
dx (5.12)
Similarly, by Lemma 3.2, W
k
() = (
2
/
2
) log f(X
k
, ) are independent
with E

_
W
k
()
_
= I(f, ). Thus by the law of large numbers
lim
n
1
nI(f, )
n

k=1

2
log f(X
k
, ) = 1 (5.13)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
in the sense of convergence in the law of large numbers. One can show that,
under the assumptions of Theorem 5.1, we can replace

n
(X) on the right-
hand side of (5.11) by as n . It can then be shown from (5.11)(5.13)
that
lim
n
P

_
_
nI(f, )
_

n
(X)
_
y
_
=
1

2
_
y

e
(1/2)x
2
dx
for all real values of y. This completes the proof of Theorem 5.1.
6. The Most Powerful Hypothesis Tests are Likelihood Ratio Tests.
The preceding sections have been concerned with estimation and interval
estimation. These are concerned with nding the most likely value or range
of values of a parameter , given an independent sample X
1
, . . . , X
n
from a
probability density f(x, ) for an unknown value of .
In contrast, hypothesis testing has a slightly dierent emphasis. Sup-
pose that we want to use data X
1
, . . . , X
n
to decide between two dierent
hypotheses, which by convention are called hypotheses H
0
and H
1
. The
hypotheses are not treated in a symmetrical manner. Specically,
H
0
: What one would believe if one had no additional data
H
1
: What one would believe if the data X
1
, . . . , X
n
makes the al-
ternative hypothesis H
1
signicantly more likely.
Rather than estimate a parameter, we decide between two competing
hypotheses, or more exactly decide (yes or no) whether the data X
1
, . . . , X
n
provide sucient evidence to reject the conservative hypothesis H
0
in favor
of a new hypothesis H
1
.
This is somewhat like an an estimation procedure with D(X) =
D(X
1
, . . . , X
n
) = 1 for hypothesis H
1
and D(X
1
, . . . , X
n
) = 0 for H
0
. How-
ever, this doesnt take into account the question of whether we have sucient
evidence to reject H
0
.
A side eect of the bias towards H
0
is that choosing H
1
can be viewed
as proving H
1
in some sense, while choosing H
0
may just mean that we do
not have enough evidence one way or the other and so stay with the more
conservative hypothesis.
Example. (Modied from Larsen and Marx, pages 428431.) Suppose that
it is generally believed that a certain type of car averages 25.0 miles per
gallon (mpg). Assume that measurements X
1
, . . . , X
n
of the miles per gallon
are normally distributed with distribution N(,
2
0
) with
0
= 2.4. The
conventional wisdom is then =
0
= 25.0.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A consumers group suspects that the current production run of cars
actually has a higher mileage rate. In order to test this, the group runs
n = 30 cars through a typical course intended to measure miles per gallon.
The results are observations of mpg X
1
, . . . , X
30
with sample mean X =
(1/n)

30
i=1
X
i
= 26.50. Is this sucient evidence to conclude that mileage
per gallon has improved?
In this case, the conservative hypothesis is
H
0
: X
i
N(
0
,
2
0
) (6.1)
for
0
= 25.0 and
0
= 2.40. The alternative hypothesis is
H
1
: X
i
N(,
2
0
) for some >
0
(6.2)
A standard statistical testing procedure is, in this case, rst to choose a level
of signicance that represents the degree of condence that we need to
reject H
0
in favor of H
1
. The second step is to choose a critical value
= () with the property that
P(X ) = P
_
X ()
_
= (6.3)
Given , the value = () in (6.3) can be determined from the properties
of normal distributions and the parameters in (6.1), and is in fact = 25.721
for = 0.05 and n = 30. (See below.)
The nal step is to compare the measure X = 26.50 with . If X , we
reject H
0
and conclude that the mpgs of the cars have improved. If X < ,
we assume that, even though X >
0
, we do not have sucient evidence to
conclude that mileage has improved. Since X = 26.50 > 25.721, we reject
H
0
in favor of H
1
for this value of , and conclude that the true > 25.0.
Before determining whether or not this is the best possible test, we rst
need to discuss what is a test, as well as a notion of best.
6.1. What is a Test? What Do We Mean by the Best Test?
The standard test procedure leading up to (6.3) leaves open a number of
questions. Why should the best testing procedure involve X and not a more
complicated function of X
1
, . . . , X
n
? Could we do better if we used more of
the data? Even if the best test involves only X, why necessarily the simple
form X > ?
More importantly, what should we do if the data X
1
, . . . , X
n
are not
normal under H
0
and H
1
, and perhaps involve a family of densities f(x, )
for which the MLE is not the sample mean? Or if H
0
is expressed in terms
of one family of densities (such as N(,
2
0
)) and H
1
in terms of a dierent
family, such as gamma distributions?
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Before proceeding, we need a general denition of a test, and later a
denition of best.
Assume for deniteness that , X
1
, . . . , X
n
are all real numbers. We
then dene (an abstract) test to be an arbitrary subset C R
n
, with the
convention that we choose H
1
if the data X = (X
1
, . . . , X
n
) C and oth-
erwise choose H
0
. (The set C R
n
is sometimes called the critical region
of the test.) Note that the decision rule D(X) discussed above is now the
indicator function D(X) = I
C
(X).
In the example (6.1)(6.3), C =
_
x : x =
1
n

n
i=1
x
i

_
for x =
(x
1
, . . . , x
n
), so that X C if and only if X .
Later we will derive a formula that gives the best possible test in many
circumstances. Before continuing, however, we need some more denitions.
6.2. Simple vs. Composite Tests. In general, we say that a hypothesis
(H
0
or H
1
) is a simple hypothesis or is simple if it uniquely determines the
density of the random variables X
i
. The hypothesis is composite otherwise.
For example, suppose that the X
i
are known to have density f(x, ) for
unknown for a family of densities f(x, ), as in (6.1)(6.2) for a normal
family with known variance. Then
H
0
: =
0
and H
1
: =
1
(6.4)
are both simple hypotheses. If as in (6.1)(6.2)
H
0
: =
0
and H
1
: >
0
(6.5)
then H
0
is simple but H
1
is composite.
Fortunately, if often turns out that the best test for H
0
: =
0
against
H
1
: =
1
is the same test for all
1
>
0
, so that it is also the best test
against H
1
: >
0
. Thus, in this case, it is sucient to consider simple
hypotheses as in (6.4).
6.3. The Size and Power of a Test. If we make a decision between two
hypotheses H
0
and H
1
on the basis of data X
1
, . . . , X
n
, then there are two
types of error that we can make.
The rst type (called a Type I error) is to reject H
0
and decide on H
1
when, in fact, the conservative hypothesis H
0
is true. The probability of
a Type I error (which can only happen if H
0
is true) is called the false
positive rate. The reason for this is that deciding on the a priori less likely
hypothesis H
1
is called a positive result. (Think of proving H
1
as the rst
step towards a big raise, or perhaps towards getting a Nobel prize. On the
other hand, deciding on H
1
could mean that you have a dread disease, which
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
you might not consider a positive result at all. Still, it is a positive result for
the test, if not necessarily for you.)
Suppose that H
0
and H
1
are both simple as in (6.4). Then the proba-
bility of a Type I error for the test C, or equivalently the false positive rate,
is
= P(reject H
0
| H
0
is true) = P(choose H
1
| H
0
) (6.6)
= P(X C | H
0
) =
_
C
f( x,
0
) d x
where
f( x, ) = f(x
1
, )f(x
2
, ) . . . f(x
n
, ) (6.7)
is the joint probability density of the sample X = (X
1
, . . . , X
n
) and
_
C
f( x,
0
) d x is an n-fold integral.
As the form of (6.6) indicates, depends only on the hypothesis H
0
and
not on H
1
, since it is given by the integral of f( x,
0
) over C and does not
involve
1
. Similarly, the critical value = () in (6.3) in the automobile
example depends only on and n and the parameters involved in H
0
.
The value in (6.6) is also called the level of signicance of the test C
(or, more colloquially, of the test with critical region C). As mentioned
above, depends only on the hypothesis H
0
and is given by the integral of
a probability density over C. For this reason, is also called the size of the
test C. That is,
Size(C) =
_
C
f( x,
0
) d x (6.8)
Note that we have just given four dierent verbal denitions for the value
in (6.6) or the value of the integral in (6.8). This illustrates the importance
of for hypothesis testing.
Similarly, a Type II error is to reject H
1
and choose H
0
when the alter-
native H
1
is correct. The probability of a Type II error is called the false
negative rate, since it amounts to failing to detect H
1
when H
1
is correct.
This is
= P(reject H
1
| H
1
) =
_
R
n
C
f( x,
1
) d x (6.9)
for
1
in (6.4) and f( x,
1
) in (6.7). Note that depends only on H
1
and
not on H
0
.
The power of a test is the probability of deciding correctly on H
1
if H
1
is true, and is called the true positive rate. It can be written
Power(
1
) = 1 = P(choose H
1
| H
1
) =
_
C
f( x,
1
) d x (6.10)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
The power Power() is usually written as a function of since the hypoth-
esis H
1
is more likely to be composite. Note that both the level of signi-
cance and the power P(
1
) involve integrals over the same critical region C,
but with dierent densities.
To put these denitions in a table:
Table 6.1. Error Type and Probabilities
What We Decide
Which is True H
0
H
1
H
0
OK Type I

H
1
Type II

OK
Power
If H
0
and/or H
1
are composite, then , , and the power are replaced
by their worst possible values. That is, if for example
H
0
: X
i
f
0
(x) for some density f
0
T
0
H
1
: X
i
f
1
(x) for some density f
1
T
1
for two classes of densities T
0
, T
1
on R, then
= sup
f
0
T
0
_
C
f
0
( x) d x, = sup
f
1
T
1
_
R
n
C
f
1
( x) d x
and
Power = inf
f
1
T
1
_
C
f
1
( x) d x
6.4. The Neyman-Pearson Lemma. As suggested earlier, a standard
approach is to choose a highest acceptable false positive rate (for reject-
ing H
0
) and restrict ourselves to tests C with that false positive rate or
smaller.
Among this class of tests, we would like to nd the test that has the
highest probability of detecting H
1
when H
1
is true. This is called (reason-
ably enough) the most powerful test of H
0
against H
1
among tests C of a
given size or smaller.
Assume for simplicity that H
0
and H
1
are both simple hypotheses, so
that
H
0
: X
i
f(x,
0
) and H
1
: X
i
f(x,
1
) (6.11)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
where X
i
f(x) means that the observations X
i
are independently chosen
from the density f(x) and f(x, ) is a family of probability densities. As
mentioned above, both the size and power of a test C R
n
can be expressed
as n-dimensional integrals over C:
Size(C) =
_
C
f( x,
0
) d x and Power(C) =
_
C
f( x,
1
) d x (6.12)
The next result uses (6.12) to nd the most powerful tests of one simple
hypothesis against another at a xed level of signicance .
Theorem 6.1. (Neyman-Pearson Lemma) Assume that the set
C
0
= C
0
() =
_
x R
n
:
f( x,
1
)
f( x,
0
)

_
(6.13)
has Size(C
0
) = for some constant > 0. Then
Power(C) Power
_
C
0
()
_
(6.14)
for any other subset C R
n
with Size(C) .
Remarks (1). This means that C
0
() is the most powerful test of H
0
against H
1
with size Size(C) .
(2). If x = X for data X = (X
1
, . . . , X
n
), then the ratio in (6.13)
L( x,
1
,
0
) =
f( x,
1
)
f( x,
0
)
=
L(
1
, X)
L(
0
, X)
(6.15)
is a ratio of likelihoods. In this sense, the tests C
0
() in Theorem 6.1 are
likelihood-ratio tests.
(3). Suppose that the likelihood L(, X) = f(X
1
, ) . . . f(X
n
, ) has a
sucient statistic S(X) = S(X
1
, . . . , X
n
). That is,
L(, X) = f(X
1
, ) . . . f(X
n
, ) = g
_
S(X),
_
A(X) (6.16)
Then, since the factors A( x) cancel out in the likelihood ratio, the most-
powerful tests
C
0
() =
_
x R
n
:
f( x,
1
)
f( x,
0
)

_
=
_
x R
n
:
g
_
S( x),
1
_
g
_
S( x),
0
_
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
depend only on the sucient statistic S(X).
(4). By (6.12) and (6.15)
Size(C) =
_
C
f( x,
0
) d x and Power(C) =
_
C
L( x,
1
,
0
)f( x,
0
) d x
for L( x,
1
,
0
) in (6.15). Intuitively, the set C that maximizes Power(C)
subject to Size(C) should be the set of size with the largest values of
L( x,
1
,
0
). This is essentially the proof of Theorem 6.1 given below.
Before giving a proof of Theorem 6.1, lets give some examples.
Example (1). Continuing the example (6.1)(6.2) where f(x, ) is the nor-
mal density N(,
2
0
), the joint density (or likelihood) is
f(X
1
, . . . , X
n
, ) =
1
_
2
2
0
n
exp
_

1
2
2
0
n

i=1
(X
i
)
2
_
= C
1
(,
0
, n) exp
_

1
2
2
0
_
n

i=1
X
2
i
2
n

i=1
X
k
__
Since the factor containing

n
i=1
X
2
i
is the same in both likelihoods, the
likelihood ratio is
f(X
1
, . . . , X
n
,
1
)
f(X
1
, . . . , X
n
,
0
)
= C
2
exp
_
_
(
1

0
)

2
0
n

j=1
X
j
_
_
(6.17)
where C
2
= C
2
(
1
,
0
,
0
, n). If
0
<
1
are xed, the likelihood-ratio sets
C
0
() in (6.13) are
C
0
() =
_
_
_
x : C
2
exp
_
_
(
1

0
)

2
0
n

j=1
x
j
_
_

_
_
_
(6.18a)
=
_
x :
1
n
n

i=1
x
i

m
_
(6.18b)
where
m
is a monotonic function of . Thus the most powerful tests of H
0
against H
1
for any
1
>
0
are tests of the form X
m
. As in (6.3), the
constants
m
=
m
() are determined by
Size
_
C()
_
= = P

0
_
X
m
()
_
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Since X
i
N(
0
,
2
0
) and X N(
0
,
2
0
/n), this implies

m
() =
0
+

0

n
z

(6.19)
where P(Z z

) = . In particular, the most powerful test C R

n
of
H
0
: =
0
against H
1
: =
1
at level of signicance or smaller is
C = { x : x
m
() } for
m
() in (6.19).
Since exactly the same tests are most powerful for any
1
>
0
, the
test (6.18b) are called uniformly most powerful (UMP) for all
1
>
0
. Note
that
m
() in (6.19) depends on
0
but not on
1
. In this sense, these tests
are also most powerful for H
0
: =
0
against H
1
: >
0
.
Example (2). Let f(x, ) = x
1
for 0 x 1 and > 0. Suppose
that we want to test H
0
: = 1 against H
1
: =
1
for some
1
> 1. If
= 1, random variables X with distribution f(x, ) are uniformly distributed
in (0, 1), while if
1
> 1 a sample X
1
, X
2
, . . . , X
n
will tend to be more
bunched towards x = 1. We would like to nd the most powerful test for
detecting this, at least against the alternative x
1
for > 1.
The joint density here is
f( x, ) =
n

j=1
f(x
j
, ) =
n

j=1
x
1
j
=
n
_
n

j=1
x
j
_
1
In general if
0
<
1
, the likelihood ratio is
f( x,
1
)
f( x,
0
)
=

n
1
_

n
j=1
x
j
_

1
1

n
0
_

n
j=1
x
j
_

0
1
= C
_
n

j=1
x
j
_

0
(6.20)
for C = C(
0
,
1
, n). Thus the most powerful tests of H
0
: =
0
against
H
1
: =
1
for
0
<
1
are
C
0
() =
_
_
_
x : C
_
n

j=1
x
j
_

0

_
_
_
(6.21a)
=
_
x :
n

j=1
x
j

m
_
(6.21b)
where
m
is a monotonic function of .
Note that the function
m
=
m
() in (6.21b) depends on H
0
but not
on H
1
. Thus the tests (6.21b) are UMP for
1
>
0
as in Example 1.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Exercise. For H
0
: =
0
, prove that the tests
C
0
() =
_
x :
n

j=1
x
j

m
_
(6.22)
are UMP against H
1
: =
1
for all
1
<
0
.
6.5. P-values. The nested structure of the likelihood-ratio sets in (6.13)
and (6.14)
C(

) =
_
x R
n
:
f( x,
1
)
f( x,
0
)

_
where (6.23)
Size
_
C(

)
_
= P
__
X :
f(X,
1
)
f(X,
0

__
=
means that we can give a single number that describes the outcome of the
tests (6.23) for all . Specically, let
P = P
_
f(X,
1
)
f(X,
0
)
T
0

H
0
_
(6.24)
where T
0
= T
0
(X) = f(X,
1
)/f(X,
0
) for the observed sample. Note that
the X in (6.24) is random with distribution H
0
, but the X in T
0
(X) is the
observed sample and assumed constant. Then
Lemma 6.1. Suppose that X = (X
1
, . . . , X
n
) is an independent sample
with density f(x, ). Suppose that we can nd constants

such that (6.23)

holds for all , 0 < < 1. Dene P by (6.24).
Then, if P < , the observed X C(

) and we reject H
0
. If P > ,
then the observed X / C(

) and we accept H
0
.
Proof. If P < , then the observed T
0
(X) >

by (6.23) and (6.24),

and thus X C(

). Hence we reject H
0
. If P > , then the observed
T
0
(X) <

by the same argument and X / C(, ). Hence in this case we

accept H
0
.
We still need to prove Theorem 6.1:
Proof of Theorem 6.1. For C
0
in (6.13) and an arbitrary test C R
n
with Size(C) , let A = C
0
C and B = C C
0
. Then by (6.12)
Power(C
0
) Power(C) =
_
C
0
f( x,
1
)d x
_
C
f( x,
1
)d x
=
_
A
f( x,
1
)d x
_
B
f( x,
1
)d x (6.25)
by subtracting the integral over C
0
C from both terms.
By the denition in (6.13), f( x,
1
) f( x,
0
) on A C
0
and f( x,
1
) <
f( x,
0
) on B R
n
C
0
. Thus by (6.25)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Power(C
0
) Power(C)
_
A
f( x,
0
) d x
_
B
f( x,
0
) d x
=
_
C
0
f( x,
0
) d x
_
C
f( x,
0
) d x
by adding the integral of f( x,
0
) over C
0
C to both integrals. Hence again
by (6.12)
Power(C
0
) Power(C)
_
Size(C
0
) Size(C)
_
0
since > 0 and Size(C) Size(C
0
) = by assumption. Thus Power(C)
Power(C
0
), which completes the proof of (6.14) and hence of Theorem 6.1.
7. Generalized Likelihood Ratio Tests. Suppose that an independent
sample of observations X
1
, . . . , X
n
are known to have density f(x, ) for
some unknown (vector) parameter R
m
, and that we want to test the
hypothesis
H
0
:
0
against H
1
:
1
(7.1)
for subsets
0
,
i
R
m
. Some examples are
Example (1). Assume X
j
is uniformly distributed U(0, ) for some un-
known . That is, X
j
have density f(x, ) = (1/)I
(0,)
(x) and test
H
0
: = 1 against H
1
: < 1 (7.2a)
This is of the form (7.1) with
0
= { 1 } and
1
= (0, 1).
(2). Assume X
j
are normally distributed N(,
2
) and we want to test
H
0
: =
0
against H
1
: =
0
(7.2b)
without any restrictions on . This is of the form (7.1) with
0
= { (
0
,
2
) },
which is a half-line in R
2
, and
1
= { (,
2
) : =
0
}, which is a half-plane
minus a half-line.
(3). Assume X
1
, . . . , X
n
1
are independent normal N(
1
,
2
1
) and
Y
1
, . . . , Y
n
2
are N(
2
,
2
2
) and that we want to test
H
0
:
1
=
2
and
2
1
=
2
2
against
H
1
: All other (
1
,
2
,
2
1
,
2
2
)
(7.2c)
In this case,
0
= { (, ,
2
,
2
) } is a two-dimensional subset of R
4
and
1
is four-dimensional.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
In any of these examples, if we choose particular
0

0
and
1

1
(
1
=
0
) and wanted to test
H
0
: =
0
against H
1
: =
1
(7.3)
then, by the Neyman-Pearson Lemma (Theorem 6.1 in Section 6.5 above),
the most powerful test of H
0
against H
1
at any level of signicance is the
likelihood-ratio set
C

=
_
x R
n
:
f( x,
1
)
f( x,
0
)

_
(7.4)
where x = (x
1
, . . . , x
n
). That is, we reject H
0
in favor of H
1
if X =
(X
1
, . . . , X
n
) C

.
The idea behind generalized likelihood-ratio tests (abbreviated GLRTs)
is that we use the likelihood-ratio test (7.4) with our best guesses for
0

0
and
1

1
. That is, we dene

LR
n
(X) =
max

1
L(, X
1
, . . . , X
n
)
max

0
L(, X
1
, . . . , X
n
)
=
L(

H
1
(X), X
1
, . . . , X
n
)
L(

H
0
(X), X
1
, . . . , X
n
)
(7.5)
where

H
1
(X) and

H
1
(X) are the maximum-likelihood estimates for
0
and
1
, respectively. Note that

LR
n
(X) depends on X but not on
(except indirectly from the sets
0
and
1
). We then use the tests with
critical regions
C

=
_
x R
n
:

LR
n
( x)

_
with Size(C

) = (7.6)
Since the maximum likelihood estimates

H
1
(X),

H
1
(X) in (7.5) depend
on X = (X
1
, . . . , X
n
), the Neyman-Pearson lemma does not guarantee
that (7.6) provides the most powerful tests. However, the asymptotic con-
sistency of the MLEs (see Theorem 5.1 above) suggests that

H
0
,

H
1
may
be close to the correct values.
Warning: Some statisticians, such as the authors of the textbook
Larsen and Marx, use an alternative version of the likelihood ratio

LR
alt
n
(X) =
max

0
L(, X
1
, . . . , X
n
)
max

1
L(, X
1
, . . . , X
n
)
=
L(

H
0
(X), X
1
, . . . , X
n
)
L(

H
1
(X), X
1
, . . . , X
n
)
(7.7)
with the maximum for H
1
in the denominator instead of the numerator and
the maximum for H
0
in the numerator instead of the denominator. One
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
then tests for small values of the GLRT statistic instead of large values.
Since

LR
alt
n
(X) = 1/

LR
n
(X), the critical tests for (7.7) are
C
alt

=
_
x R
n
:

LR
alt
n
( x)

_
(7.8)
=
_
x R
n
:

LR
n
( x) 1/

_
with Size(C
alt

) =
Thus the critical regions (7.8) are exactly the same as those for (7.6) except
for a transformation

and the direction of the inequality. However,

one has to be aware of small dierences between how the tests as described.
7.1. Examples. In Example 1 (see (7.2a) above),
0
= { 1 } and

1
= (0, 1). This corresponds to the hypotheses H
0
: = 1 and H
1
: < 1
where X
1
, . . . , X
n
are U(0, ). Then one can show

LR
n
(X) = (1/X
max
)
n
, X
max
= max
1jn
X
j
(Argue as in Section 6.5 in the text, Larsen and Marx.) The GLRTs (7.6) in
this case are equivalent to
C

(X) = { X :

LR
n
(X)

} = { X : (1/X
max
)
n

} (7.9)
= { X : X
max

} where P(X
max

| H
0
) =
In Example 2 (see (7.2b) above),
0
= { (
0
,
2
) : =
0
} and
1
=
{ (,
2
) : =
0
}. This corresponds to the hypotheses H
0
: =
0
and
H
1
: =
0
where X
1
, . . . , X
n
are N(,
2
) with
2
unspecied. One can
show in this case that

LR
n
(X) =
_
1 +
T(X)
2
n 1
_
n/2
where (7.10)
T(X) =

n(X
0
)
_
S(X)
2
, S(X) =
1
n 1
n

j=1
(X
j
X)
2
(See Appendix 7.A.4, pages 519521, in the textbook, Larsen and Marx.
They obtain (7.10) with n/2 in the exponent instead of n/2 because they
use (7.7) instead of (7.5) to dene the GLRT statistic, which is

LR
n
(X) here
but in their notation.)
Since

LR
n
(X) is a monotonic function of |T(X)|, the GLRT test (7.6)
is equivalent to
C

(X) = { X : |T(X)|

} where P
_
|T(X)|

H
0
_
= (7.11)
This is the same as the classical two-sided one-sample Student-t test.
There is a useful large sample asymptotic version of the GLRT, for which
it is easy to nd the critical values

. First, we need a reformulation of the

problem H
0
:
0
versus H
1
:
1
.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2. Nested Hypothesis Tests. Each of Examples 13 can be reformu-
lated as
H
0
:
0
against H
1
:
1
(7.12)
where
0

1
and
1
has more free parameters than H
0
.
Equivalently, more free parameters means m
0
< m
1
, where m
0
=
dim(
0
) and m
1
= dim(
1
) are the local dimensions of
0
and
1
. In
Example 1, m
0
= 0 (a point has no free parameters and is of dimen-
sion 0) and m
1
= 1, where we now take
1
= (0, 1]. In Example 2, m
1
= 1
(there is one free parameter,
2
> 0), m
2
= 2, and
2
= { (,
2
) } is
now a full half-plane. In Example 3, m
1
= 2 and m
2
= 4, since the sets

0
,
1
R
4
.
Since
0

1
, a test of H
0
against H
1
cannot be of the form either-or
as in Section 6, since
0
implies
1
. Instead, we view (7.12) with

0

1
as a test of whether we really need the additional d = m
1
m
0
parameter or parameters. That is, if the data X = (X
1
, . . . , X
n
) does not t
the hypothesis H
1
suciently better than H
0
(as measured by the relative
size of the tted likelihoods in (7.5) ) to provide evidence for rejecting H
0
.
Then, to be conservative, we accept H
0
and conclude that there is not enough
evidence for the more complicated hypothesis H
1
.
A test of the form (7.12) with
0

1
is called a nested hypothesis test.
Note that, if
0

1
, then (7.5) implies that

LR
n
(X) 1.
Under the following assumptions for a nested hypothesis test, we have
the following general theorem. Assume as before that X
1
, . . . , X
n
is an in-
dependent sample with density f(x, ) where f(x, ) satises the conditions
of the Cramer-Rao lower bound (Theorem 4.1 in Section 4 above) and of the
asymptotic normality of the MLE (Theorem 5.1 in Section 5 above). Then
we have
Theorem 7.1. (Twice the Log-Likelihood Theorem) Under the
above assumptions, assume that
0

1
in (7.12), that d = m
1
m
0
> 0,
and that the two maximum-likelihood estimates

H
0
(X) and

H
1
(X) in (7.5)
are attained in the interior of the sets
0
and
1
, respectively. Then, for

LR
n
(X) in (7.5),
lim
n
P
_
2 log
_

LR
n
(X)
_
y

H
0
_
= P
_

2
d
y
_
(7.13)
for y 0, where
2
d
represents a random variable with a
2
distriution with
d = m
1
m
0
degrees of freedom.
Proof. The proof is similar to the proof of Theorem 5.1 in Section 5, but
uses an m-dimensional central limit theorem for vector-valued random vari-
ables in R
m
and Taylors Theorem in R
m
instead of in R
1
.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Remarks (1). The analog of Theorem 7.1 for the alternative (upside-
down) denition of

LR
n
(X) in (7.7) has 2 log
_

LR
n
(X)
alt
_
instead of
2 log
_

LR
n
(X)
_
.
(2). There is no analog of Theorem 7.1 if the hypotheses H
0
and H
1
are
not nested. Finding a good asymptotic test for general non-nested composite
hypotheses is an open question in Statistics that would have many important
applications.
8. Fishers Meta-Analysis Theorem. Suppose that we are interested
in whether we can reject a hypothesis H
0
in favor of a hypothesis H
1
. As-
sume that six dierent groups have carried out statistical analyses based on
dierent datasets with mixed results. Specically, assume that they have
reported the six P-values (as in Section 6.5 above)
0.06 0.02 0.13 0.21 0.22 0.73 (8.1)
While only one of the six groups rejected H
0
at level = 0.05, and that
was with borderline signicance (0.01 < P < 0.05), ve of the six P-values
are rather small. Is there a way to assign an aggregated P-value to the six
P-values in (8.1)? After reading these six studies (and nding nothing wrong
with them), should we accept H
0
or reject H
1
at level = 0.05?
The rst step is to nd the random distribution of P-values that in-
dependent experiments or analyses of the same true hypothesis H
0
should
attain. Suppose that each experimenter used a likelihood-ratio test of the
Neyman-Pearson form (6.23) or GLRT form (7.6) where it is possible to nd
a value

for each , 0 < < 1.

Lemma 8.1. Under the above assumptions, assuming that the hypothe-
sis H
0
is true, the P-values obtained by random experiments are uniformly
distributed in (0, 1).
Proof. Choose with 0 < < 1. Since is the false positive rate, the
fraction of experimenters who reject H
0
, and consequently have P < for
their computed P-value, is . In other words, treating their P-values as
observations of a random variable

P, then P
_

P <
_
= for 0 < < 1.
This means that

P is uniformly distributed in (0, 1).
Given that the numbers in (8.1) should be uniformly distributed if H
0
is true, do these numbers seem signicantly shifted towards smaller values,
as they might be if H
1
were true? The rst step towards answering this is
to nd a reasonable alternative distribution of the P-values given H
1
.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Fisher most likely considered the family of distributions f(p, ) = p
1
for 0 < 1, so that H
0
corresponds to = 1. For < 1, not only is
E(P) = /( + 1) < 1, but the density f(p, ) has an innite cusp at = 0.
For this family, the likelihood of random P-values P
1
, . . . , P
n
given is
L(, P
1
, . . . , P
n
) =
n

j=1
f(, P
j
) =
n
_
n

j=1
P
j
_
1
Thus Q =

n
j=1
P
j
is a sucient statistic for , and we have at least a single
number to summarize the six values in (8.1).
Morever, it follows as in Example 2 in Section 6.4 above that tests of the
form { P :

n
j=1
P
j

} are UMP for H

0
: = 1 against the alternatives
< 1. (See Exercise (6.22).) The distribution of

n
j=1
P
j
given = 1 can
be obtained from
Lemma 8.2. Assume that U is uniformly distributed in (0, 1). Then
(a) The random variable Y = Alog U = Alog(1/U) has an exponential
distribution with rate 1/A.
(b) Y = 2 log(1/U)
2
2
(that is, Y has a chi-square distribution with
2 degrees of freedom).
(c) If U
1
, U
2
, . . . , U
n
are independent and uniformly distributed in (0, 1),
then Q =

n
j=1
2 log(U
j
)
2
2n
has a chi-square distribution with 2n degrees
of freedom.
Proof. (a) For A, t > 0
P(Y > t) = P
_
Alog(U) > t
_
= P
_
log(U) < t/A
_
= P
_
U < exp(t/A)
_
= exp(t/A)
This implies that Y has a probability density f
Y
(t) = (d/dt) exp(t/A) =
(1/A) exp(t/A), which is exponential with rate A.
(b) A
2
d
distribution is gamma(d/2, 1/2), so that
2
2
gamma(1, 1/2).
By the form of the gamma density, gamma(1, ) is exponential with rate .
Thus, by part (a), 2 log(1/U) gamma(1, 1/2)
2
2
.
(c) Each 2 log(1/P
j
)
2
2
, which implies that
Q =
n

j=1
2 log(1/P
j
)
2
2n
(8.2)
Putting these results together,
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Theorem 8.1 (Fisher). Assume independent observations U
1
, . . . , U
n
have
density f(x, ) = x
1
for 0 x 1, and in particular are independent
and uniformly distributed in (0, 1) if = 1. Then, the P-value of the UMP
test for H
0
: = 1 against H
1
: < 1 is
P = P(
2
2n
Q
0
)
where Q
0
is the observed value of 2

n
j=1
log(1/U
j
) and
2
2n
represents a
chi-square distribution with 2n degrees of freedom.
Proof. By Lemma 8.1 and (8.2).
Example. The numbers P
1
, . . . , P
6
in (8.1) satisfy
6

j=1
2 log(1/P
j
) = 5.63 + 7.82 + 4.08 + 3.12 + 3.03 + 0.63 = 24.31
Thus the P-value in Theorem 8.1 is
P = P(
2
12
24.31) = 0.0185
Thus the net eect of the six tests with P-values in (8.1) is P = 0.0185,
which is signicant at = 0.05 but not at = 0.01.
9. Two Contingency-Table Tests. Consider the following contingency
table for n = 1033 individuals with two classications A and B:
Table 9.1. A Contingency Table for A and B
B: 1 2 3 4 5 6 Sums:
1 29 11 95 78 50 47 310
A: 2 38 17 106 105 74 49 389
3 31 9 60 49 29 28 206
4 17 13 35 27 21 15 128
Sums: 115 50 296 259 174 139 1033
It is assumed that the data in Table 9.1 comes from independent observations
Y
i
= (A
i
, B
i
) for n = 1033 individuals, where A
i
is one of 1, 2, 3, 4 and B
i
is
one of 1, 2, 3, 4, 5, 6. Rather than write out the n = 1033 values, it is more
convenient to represent the data as 24 counts for the 4 6 possible A, B
values, as we have done in Table 9.1.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Suppose we want to test the hypothesis that the Y
i
are sampled from a
population for which A and B are independent. (Sometimes this hypothesis
is stated that rows and columns are independent, but this doesnt make
very much sense if you analyze it closely.)
If the sample is homogeneous, each observation Y
i
= (A
i
, B
i
) has a mul-
tivariate Bernoulli distribution with probability function P
_
Y = (a, b)
_
= p
ab
for 1 a s and 1 b t, where s = 4 is the number of rows in Ta-
ble 9.1 and t = 6 is the number of columns, and

s
a=1

t
b=1
p
ab
= 1. If
the random variables A and B are independent, then P
_
Y = (a, b)
_
=
P(A = a)P(B = b). If P(A = a) = p
A
a
and P(B = b) = p
B
b
, then
p
ab
= p
A
a
p
B
b
. This suggests the two nested hypotheses
H
1
: p
ab
> 0 are arbitrary subject with
s

a=1
t

b=1
p
ab
= 1 (9.1)
H
0
: p
ab
= p
A
a
p
B
b
where
s

a=1
p
A
a
=
t

b=1
p
B
b
= 1
9.1. Pearsons Chi-Square Test.
We rst consider the GLRT test for (9.1). Writing p for the matrix p = p
ab
(1 a s, 1 b t), the likelihood of Y = (Y
1
, Y
2
, . . . , Y
n
) is
L(p, Y ) =
n

i=1
{q
i
= p
ab
: Y
i
= (a, b)} =
s

a=1
t

b=1
p
X
ab
ab
(9.2)
where X
ab
are the counts in Table 9.1. The MLE p
H
1
for hypothesis H
1
can
be found by the method of Lagrange multipliers by solving

p
ab
log L(p, Y ) = 0 subject to
s

a=1
t

b=1
p
ab
= 1
This leads to ( p
H
1
)
ab
= X
ab
/n. The MLE p
H
0
can be found similarly as the
solution of

p
A
a
log L(p, Y ) =

p
B
b
log L(p, Y ) = 0
subject to
s

a=1
p
A
a
=
t

b=1
p
B
b
= 1
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
This implies

p
A
a
= X
a+
/n and

p
B
b
= X
+a
/n where X
a+
=

t
c=1
X
ac
and
X
+b
=

s
c=1
X
cb
. This in turn implies ( p
H
0
)
ab
= (X
a+
/n)(X
+b
/n). Thus
the GLRT statistic for (9.1) is

LR
n
(Y ) =
L( p
H
1
, Y )
L( p
H
0
, Y )
=

s
a=1

t
b=1
_
X
ab
n
_
X
ab

s
a=1
_
X
a+
n
_
X
a+
t
b=1
_
X
+b
n
_
X
+b
(9.3)
Note that hypothesis H
1
in (9.1) has m
1
= st 1 free parameters, while
hypothesis H
0
has m
0
= (s 1) + (t 1) free parameters. The dierence is
d = m
1
m
0
= st 1 (s 1) (t 1) = st s t + 1 = (s 1)(t 1).
Thus by Theorem 7.1 in Section 7
lim
n
P
_
2 log
_

LR
n
(X)
_
y

H
0
_
= P
_

2
d
y
_
(9.4)
where d = (s 1)(t 1). The test of H
0
against H
1
based on (9.4) is often
called the G-test.
Pearsons Sum of (Observed Expected)
2
/Expected statistic is
D
n
(y) =
s

a=1
t

b=1
_
X
ab
n p
A
a
p
B
b
_
2
n p
A
a
p
B
b
=
s

a=1
t

b=1
_
X
ab
(X
a+
X
+b
/n)
_
2
(X
a+
X
+b
/n)
It was proven in class in a more general context that
E
_

2 log

LR
n
(Y ) D
n
(Y )

_

C

n
for n 1. It can be show that this in combination with (9.4) implies
lim
n
P
_
D
n
(Y ) y

H
0
_
= P
_

2
d
y
_
(9.5)
Thus the GLRT test for H
0
within H
1
in (9.1) is asymptotically equivalent
to a test on D
n
(Y ), for which the P-value can be written asymptotically
P = P(
2
d
D
n
(Y )
Obs
)
where Obs stands for Observed value of.
For the data is Table 9.1, D
n
(Y ) = 19.33 and P = 0.199 for d =
(4 1)(6 1) = 15 degrees of freedom. Thus, the data in Table 9.1 is not
signicant using the Pearsons chi-square test.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2. The Pearson Test is an Omnibus Test.
The GLRT test of (9.1) is sometimes called a test of H
0
against an omnibus
alternative, since it is designed to have power against any alternative p
ab
for
which A and B fail to be independent.
A test that is sensitive to a particular way in which H
0
may fail can
have much greater power against that alternative than an omnibus test,
which much guard against any possible failure of H
0
. Conversely, a test that
is tuned towards a particular alternative can fail miserably when H
0
is
false for other reasons.
The shrinkage estimator in Section 2.1 provides a somewhat similar ex-
ample. If we make even a crude guess what the true mean of a normal
sample might be, then a shrinkage estimator towards that value can have
smaller expected squared error than the sample mean estimator, which is
the minimum-variance unbiased estimator for all possible true means. Con-
versely, if we guess wrongly about the true mean, the shrinkage estimator
may have a much larger expected squared error.
9.3. The Mantel-Haenszel Trend Test.
Suppose that one suspects that the random variables A and B in Table 9.1
are correlated as opposed to being independent. In particular, we would like a
test of H
0
in (9.1) whose implicit alternative is that A, B are correlated, which
may have greater power if A and B are in fact correlated. We understand that
this test may have much less power against an alternative to independence
in which A and B are close to being uncorrelated.
The Mantel-Haenszel trend test does exactly this. (Note: This test is
also called the Mantel trend test. The trend is necessary here because
there is a contingency table test for stratied tables that is also called the
Mantel-Haenszel test.)
Specically, let r be the sample Pearson correlation coecient of A
i
and B
i
for the sample Y
i
= (A
i
, B
i
). That is,
r =

n
i=1
(A
i
A)(B
i
B)
_

n
i=1
(A
i
A)
2
_

n
i=1
(B
i
B)
2
(9.6)
Recall that A
i
takes on integer values with 1 A
i
s and B
i
takes on
integer values with 1 B
i
t. Then
Theorem 9.1 (Mantel-Haenszel). Under the assumptions of this sec-
tion, using a permutation test based on permuted the values B
i
in Y
i
=
(A
i
, B
i
) for 1 i n among themselves while holding A
i
xed,
lim
n
P
_
(n 1)r
2
y

H
0
_
= P(
2
1
y) (9.7)
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Remarks. The limits in Theorem 7.1 and (9.4) are based on a probability
space that supports independent random variables with a given probability
density f(x, ).
In contrast, the underlying probability space in Theorem 9.1, in common
with permutation tests in general, is dened by a set of permutations of the
data under which the distribution of a sample statistic is the same as if H
0
is
true. For example, in this case, if we choose A
i
at random from A
1
, . . . , A
n
and match it with a randomly permuted B
i
at that i, then
P(A
i
= a, B
i
= b) = P(A
i
= a)P(B
i
= a)
and A
i
, B
i
are independent. (In contrast, B
1
and B
2
are not independent.
If B
1
happened to be a large value, then the value B
2
at a dierent oset
in the permuted values, conditional on B
1
already have been chosen, would
be drawn from values with a smaller mean. Thus B
1
and B
2
are negatively
correlated.)
Since the pairs A
i
, B
i
are independent in this permutation probability
space, if the observed value of r in (9.6) is far out on the tail of the statistics r
calculated by randomly permuted the B
i
in this manner, then it is likely that
the observed A
i
and B
i
were not chosen from a distribution in which A and B
were independent. We neednt worry that the set of possible P-values is
overly discrete if n is large, since in that case the number of permutations (n!)
is truly huge. Since the test statistic (9.7) is the sample correlation itself, if
we reject H
0
then it is likely that A and B are correlated.
Example. For the data in Table 9.1, the sample correlation coecient r =
0.071 and X
obs
= (n 1)r
2
= (1032)(0.071)
2
= 5.2367. The P-value is
P = P(
2
1
5.2367) = 0.0221. Thus Table 9.1 shows signicant departure
from independence by the Mantel test, but not for the standard Pearson test.
In general, one can get P-values for a
2
1
distribution from a standard
normal table, since it is the square of a standard normal. Thus
P = P(
2
1
5.2367) = P(Z
2
5.2367)
= P(|Z|

5.2367) = 2P(Z 2.2884) = 0.0221

Proof of Theorem 9.1 (Remarks). The usual central limit theorem is
for independent random variables X
1
, . . . , X
n
, . . . drawn from the same
probability space. There are analogous limit theorems (also giving nor-
mal distributions) for sums and linear combinations of values in segments
X
k+1
, . . . , X
k+m
(that is, for osets between k + 1 and k + m) of a large,
randomly permuted sequence of numbers X
1
, . . . , X
n
, assuming that n
m (1 )n as n for some > 0.
Risk, Scores, Fisher Information, and GLRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
These theorems are sometimes referred to collectively as Hajeks Cen-
tral Limit Theorem and can be used to prove not only Theorem 9.1, but also
a large variety of large-sample limit theorems in nonparametric statistics.
Remark. Even though (I believe) the Friedman of the Freidman rank-
sum test (a nonparametric analog of the two-way ANOVA) is the same
as the economist Friedman, the Hajek of Hajeks central limit theorem
(Jaroslav Hajek, 19261974) is not the same person as the famous conserva-
tive economist (Friedrich von) Hajek (18991992). In particular, it is not true
that all famous results in nonparametric statistics are named after Nobel-
prize-winning economists.

Solutions To Steven Kay's Statistical Estimation Book
67% (3)
Solutions To Steven Kay's Statistical Estimation Book
16 pages
Cramer Raoh and Out 08
No ratings yet
Cramer Raoh and Out 08
13 pages
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
100% (1)
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
14 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Estimator Properties
No ratings yet
Estimator Properties
17 pages
STAT2602 Tutorial 5
No ratings yet
STAT2602 Tutorial 5
7 pages
ACFrOgDxHI9RLajsdAAleI AMD3fD8GMumHY4hP954G9Nc5wG y r Km6yewAtD6KPaLn4JtmlryIevFHyE5hLCpCG9kYiN y2aUEiWWoofQYGd7Z10 ETX5BGeaw6ImvJ9HjlO8aNIJuqL7FlX9wq3pZ2PgZnbra RuhNZrYg==
No ratings yet
ACFrOgDxHI9RLajsdAAleI AMD3fD8GMumHY4hP954G9Nc5wG y r Km6yewAtD6KPaLn4JtmlryIevFHyE5hLCpCG9kYiN y2aUEiWWoofQYGd7Z10 ETX5BGeaw6ImvJ9HjlO8aNIJuqL7FlX9wq3pZ2PgZnbra RuhNZrYg==
16 pages
Chap 10
No ratings yet
Chap 10
7 pages
Advanced Statistical Inference
No ratings yet
Advanced Statistical Inference
7 pages
Lectura 2 Point Estimator Basics
No ratings yet
Lectura 2 Point Estimator Basics
11 pages
Unbiased Estimator
No ratings yet
Unbiased Estimator
70 pages
msqe_metrics_1_ps2
No ratings yet
msqe_metrics_1_ps2
11 pages
02 Point Estimators
No ratings yet
02 Point Estimators
33 pages
Statistics
No ratings yet
Statistics
60 pages
9511_et_Module-2
No ratings yet
9511_et_Module-2
6 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
Asymptotic Theory and Parametric Inference
No ratings yet
Asymptotic Theory and Parametric Inference
32 pages
Unit 2
No ratings yet
Unit 2
41 pages
Basic Stats Estimation
No ratings yet
Basic Stats Estimation
8 pages
Introecon Estimators Properties
No ratings yet
Introecon Estimators Properties
8 pages
Notes
No ratings yet
Notes
10 pages
Unbiasedness: Sudheesh Kumar Kattumannil University of Hyderabad
No ratings yet
Unbiasedness: Sudheesh Kumar Kattumannil University of Hyderabad
10 pages
Introduction
No ratings yet
Introduction
11 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Classics: 76 Resonance
No ratings yet
Classics: 76 Resonance
15 pages
Statinf Estimation
No ratings yet
Statinf Estimation
110 pages
Properties of Estimators New Update Spin
No ratings yet
Properties of Estimators New Update Spin
43 pages
6.CHAPTER 4
No ratings yet
6.CHAPTER 4
9 pages
Statistical Inference
No ratings yet
Statistical Inference
35 pages
Unit-16 IGNOU STATISTICS
No ratings yet
Unit-16 IGNOU STATISTICS
16 pages
Frequentist Estimation: 4.1 Likelihood Function
No ratings yet
Frequentist Estimation: 4.1 Likelihood Function
6 pages
Module3
No ratings yet
Module3
5 pages
18.6501x Fundamentals of Statistics
100% (1)
18.6501x Fundamentals of Statistics
8 pages
lecture_note_15
No ratings yet
lecture_note_15
6 pages
7.7
No ratings yet
7.7
6 pages
Chapter 7. Statistical Estimation 7.7: Properties of Estimators II
No ratings yet
Chapter 7. Statistical Estimation 7.7: Properties of Estimators II
6 pages
STA 303 Theory of Estimation 9th Lecture-1
No ratings yet
STA 303 Theory of Estimation 9th Lecture-1
7 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
ECON 1630 Problem Set #2 Fall 2021: Bias Variance
No ratings yet
ECON 1630 Problem Set #2 Fall 2021: Bias Variance
9 pages
STAT4027 Assignment 1: Lewis Hastie
No ratings yet
STAT4027 Assignment 1: Lewis Hastie
26 pages
Statistical Inference
No ratings yet
Statistical Inference
82 pages
Lectura 1 Point Estimation
No ratings yet
Lectura 1 Point Estimation
47 pages
ps2,3
No ratings yet
ps2,3
48 pages
Solutions Exercises Chapter 2: Dependence)
No ratings yet
Solutions Exercises Chapter 2: Dependence)
3 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Estimation Theory: x, x, x ,…… ……x ,x f x,θ θ θ θ
No ratings yet
Estimation Theory: x, x, x ,…… ……x ,x f x,θ θ θ θ
18 pages
Week 1 1720465962 Estimation Hour 2
No ratings yet
Week 1 1720465962 Estimation Hour 2
14 pages
James Stein Estimator
No ratings yet
James Stein Estimator
9 pages
Statistics
No ratings yet
Statistics
53 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Chapter 3 - Statistical Inference (Point Estimation
No ratings yet
Chapter 3 - Statistical Inference (Point Estimation
15 pages
Unit - III
No ratings yet
Unit - III
4 pages
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
No ratings yet
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
25 pages
delta_method
No ratings yet
delta_method
10 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Integration, Measure and Probability
From Everand
Integration, Measure and Probability
H. R. Pitt
No ratings yet
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
Bayesian Information Criterion
No ratings yet
Bayesian Information Criterion
3 pages
The - Impact - of - Railway - Networks - On - Residential - Land Value
No ratings yet
The - Impact - of - Railway - Networks - On - Residential - Land Value
9 pages
Estimating Animal Abundance
No ratings yet
Estimating Animal Abundance
134 pages
Specification Test: Vid Adrison
No ratings yet
Specification Test: Vid Adrison
18 pages
Toubiana 2401.06845
No ratings yet
Toubiana 2401.06845
10 pages
Microeconometrics
No ratings yet
Microeconometrics
228 pages
CS1B_April_2024_Exam_Paper
No ratings yet
CS1B_April_2024_Exam_Paper
7 pages
Full download Mathematical Statistics and Data Analysis 3ed Duxbury Advanced 3rd Edition John A. Rice pdf docx
100% (2)
Full download Mathematical Statistics and Data Analysis 3ed Duxbury Advanced 3rd Edition John A. Rice pdf docx
50 pages
Gill Et. Al 2020 Part II
No ratings yet
Gill Et. Al 2020 Part II
13 pages
Soil Dynamics and Earthquake Engineering: Shima Taheri, Reza Karami Mohammadi
No ratings yet
Soil Dynamics and Earthquake Engineering: Shima Taheri, Reza Karami Mohammadi
20 pages
Smart Money, Dumb Money, and Learning Type From Price
No ratings yet
Smart Money, Dumb Money, and Learning Type From Price
23 pages
Akaike 1987
No ratings yet
Akaike 1987
16 pages
Regression Vs Kalman Filter
No ratings yet
Regression Vs Kalman Filter
68 pages
Notes4_BayesianLearning
No ratings yet
Notes4_BayesianLearning
8 pages
1. AGRISE TEBU IKA
No ratings yet
1. AGRISE TEBU IKA
7 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
On Modeling of Lifetime Data Using Three-Parameter PDF
No ratings yet
On Modeling of Lifetime Data Using Three-Parameter PDF
8 pages
Face Recognition With Local Binary Patterns
No ratings yet
Face Recognition With Local Binary Patterns
14 pages
2022 Wang, Bhat - Probit Analysis in Shanghai
No ratings yet
2022 Wang, Bhat - Probit Analysis in Shanghai
25 pages
Raw Material Risk Assessment
No ratings yet
Raw Material Risk Assessment
8 pages
Statistic Book
100% (1)
Statistic Book
328 pages
A Flexible Univariate Autoregressive Time-Series Model For Dispersed Count Data
No ratings yet
A Flexible Univariate Autoregressive Time-Series Model For Dispersed Count Data
22 pages
International Conference Jember
No ratings yet
International Conference Jember
18 pages
Application and Comparison of Classification Techniques in Controlling Credit Risk
0% (1)
Application and Comparison of Classification Techniques in Controlling Credit Risk
16 pages
Estimating Distributions and Densities: 36-350, Data Mining, Fall 2009 23 November 2009
No ratings yet
Estimating Distributions and Densities: 36-350, Data Mining, Fall 2009 23 November 2009
7 pages
Sillano and Ortuzar (2005) (WTP_Random)
No ratings yet
Sillano and Ortuzar (2005) (WTP_Random)
26 pages
Lecture 22: Introduction To Log-Linear Models: Dipankar Bandyopadhyay, PH.D
No ratings yet
Lecture 22: Introduction To Log-Linear Models: Dipankar Bandyopadhyay, PH.D
59 pages
Logrank Tests (Lachin and Foulkes)
No ratings yet
Logrank Tests (Lachin and Foulkes)
13 pages
Manley CH 5 Part 1
No ratings yet
Manley CH 5 Part 1
17 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Risk Fisher

Uploaded by

Risk Fisher

Uploaded by

Risk, Scores, Fisher Information, and GLRTs

(Supplementary Material for Math494)

( ) means that the sums or integrals involved in calculating the

(X) = 1/n attains the Cramer-Rao lower bound (see Sec-

(X) = (so that X is an unbiased estimator

() < 0 (and M() is decreasing) if m < n/2 and M

(X) = . The likelihood of an independent sample X

(X) = , and X is an unbiased estimator of . Its variance is

(X) = (1/n) Var

(X) attains the Cramer-Rao lower bound for unbiased es-

) = . In particular, the most powerful test C R

such that (6.23)

by (6.23) and (6.24),

by the same argument and X / C(, ). Hence in this case we

and the direction of the inequality. However,

. First, we need a reformulation of the

for each , 0 < < 1.

} are UMP for H

5.2367) = 2P(Z 2.2884) = 0.0221

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.