Institute of Mathematical Statistics The Annals of Statistics
Institute of Mathematical Statistics The Annals of Statistics
Institute of Mathematical Statistics The Annals of Statistics
Efficiency)
Author(s): Bradley Efron
Source: The Annals of Statistics, Vol. 3, No. 6 (Nov., 1975), pp. 1189-1242
Published by: Institute of Mathematical Statistics
Stable URL: https://www.jstor.org/stable/2958246
Accessed: 09-05-2019 19:05 UTC
REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/2958246?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
The Annals of Statistics
1975, Vo1. 3, No. 6, 1189-1242
BY BRADLEY EFRON
Stanford University
breakdown of this favorable situation. For example, if Too is large, the locall
most powerful test of 0 = 0, versus 0 > 00 can be expected to have poor oper
ing characteristics. Similarly the variance of the maximum likelihood estimator
(MLE) exceeds the Cramer-Rao lower bound in approximate proportion to To2.
(See Sections 8 and 10.)
For nonexponential families the MLE is not, in general, a sufficient statistic.
How much information does it lose, compared with all the data x? The answer
1189
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1190 BRADLEY EFRON (AND DISCUSSANTS)
can be expressed in terms of r92. This theory goes back to Fisher (1925) and
Rao (1961, 1962, 1963). They attempted to show that if J is a one-parameter
subset of the k-category multinomial distributions, indexed say by the vector of
probabilities f0(x) = P0(X e category x), x = 1, 2, * *, k, the following result
holds: let io be the Fisher information in an independent sample of size n from
fo' io0 the Fisher information in the maximum likelihood estimator O(xl, x2, * * , x")
based on that sample, and i0 the Fisher information in a sample of size one (so
i, = ni0). Then
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1191
There are really two halves to this paper. Sections 3-7 introduce the notion
of statistical curvature, Sections 8-10 apply curvature to hypothesis testing,
partial sufficiency, and estimation. Section 2 consists of a brief review of the
notion of the geometrical curvature of a line.
(2.1) rx = F (Y")2 ]1
is defined to be the curvature of -2' at X, where Y' dY/dX, Y" d2Y/dX2
are assumed to exist continuously in a neighborhood of the value X where the
curvature is being evaluated. In particular if Y' = 0 then rx = I Y"'l. An exer-
cise in differential calculus shows that rx is the rate of change of direction of
2' with respect to arc-length along the curve. The "radius of curvature", Px-
l/rx, is the radius of the circle tangent to X' at (X, Y) whose Taylor expansion
about (X, Y) agrees up to the quadratic term with that of S9. Struik (1950) is
a good elementary reference for curvature and related concepts.
The concept of curvature extends to curved lines in Euclidean k-space, Ek,
say 2? = {vj7, 8 e 9}, where e is an interval of the real line. For each 8, r, is
a vector in Ek whose componentwise derivatives with respect to 8 we denote
-(a/O8)r8, i_ (a2/a842)i0. These derivatives are assumed to exist continu-
ously in a neighborhood of a value of 8 where we wish to define the curvature.
Suppose also that a k x k symmetric nonnegative definite matrix T0 is defined
continuously in 8. Let M. be the 2 x 2 matrix, with entries denoted L20(O)' V11(8),
Vo2(8) as shown, defined by
(2.3) To
Then ro is "the curvature of
take k = 2, 8 = X, r0 = (
Again it can be shown th
respect to arc-length alon
1, where the arc-length fr
7,
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1192 BRADLEY EFRON (AND DISCUSSANTS)
(2.4) To da=
(2.5) ds 'o 0)
dO
(3.3) A = ((): e I
The mapping (3.2) from -4' to A is
of i(r)), recognizing that i indexes
has the same rank r for all '2, and
Now suppose that
(3.4) 1-{(: el
is a one-parameter subset
differentiable function of
fo to be
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1193
(3. 7) x ~ -42(r7 , I)
(3.8) M ( 1 + TO 202 )
io20 TO
so
2o2
(3.9) To (1 +To202)3
b__ _2 b .3
(3.10) p20) = aa +b )'
>0o2(0)
V02(0)= zk=' (a+i)3
(a+ Ob Ob, (a ()=bi_ 2
That the entries of M. are summations follows from the independence of xI,
x2, . Xk, as mentioned in Section 6. A very similar formula holds for the
analogous binomial regression model.
The Neyman-Davies model, x1, X2, ..., xk independent scaled X12 random vari-
ables, xi i (1 + 031)i , 82, . . ., * kknown constants, has the same structure.
(Davies (1969) uses this model, which originates in an application due to Neyman,
to investigate the power of the locally most powerful test of 0 = 0 versus 0 > 0.
We compare our results with his in Section 8.) By direct calculation or by the
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1194 BRADLEY EFRON (AND DISCUSSANTS)
(3.14) M O 08T-
T 0 6}rO T'
To 2
a a
(3.16) l (x) = - (x), l((x) - 92 I@X).
The moment relationships
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1195
(3.19) io = 0'z0o .
As a matter of fact the covariance ma
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1196 BRADLEY EFRON (AND DISCUSSANTS)
(The constants in (4.2) are necessary to satisfy (2.2).) We will use standard form
to simplify proofs in Sections 9 and 10. If - is not in standard form at 00 t
above transformation makes it so, and by property (ii) M. and hence all infor
tion and curvature properties remain unchanged. We could use property (i) to
further standardize the situation so that io = 1, V11(00) = 0, but that does no
simplify any of the theoretical calculations which follow. Property (i) is useful
for calculating curvatures, as will be shown in Section 7.
(5.1) 5_{ff(x),0e0}
be an arbitrary family of density functions indexed by the single parameter
0 e 0, a possibly infinite interval of the real line. The sample space z' and
carrier measure for the densities can be anything at all so we have not excluded
the possibility that .w consists of discrete distributions. Let
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1197
UOO is the version of Fisher's score statistic 1o% that is the best locally unbiased
estimator for 0 near 00: Varo0 U0 = /ioo, the Cramer-Rao lower bound, and
E0OU0= 00, dEoU00/d0j0=00 = 1. Therefore
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1198 BRADLEY EFRON (AND DISCUSSANTS)
(6.2) ro= r*
ni
The curvature goes to zero at rate l/ni under repeated sampling. This makes
sense since we know that linear methods work better in large samples.
In curved exponential families, (3.1 8)-(3. 19) combine with lo(x) = l (xi)
to give
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1199
densities given are with respect to Lebesgue measure on the real line, i.e., that
we have just one observation of a continuous variable. For an independent,
identically distributed (i.i.d) sample of size n the curvature is obtained from
formula (6.2). This last remark applies also to Example 7, and to the examples
of Section 3.
we calculate
f 1 2 5 10 20 0o
(7.4)
(74 2 2.5 1.063 .306 .107 .0334 18/f2
1 ~~~2
(7.6) M 1 l (a -3)
a -2(2 4a -10
(a-3) (a-2)(a-3)(a-4)
2 2 a-I
(a - 3)2 a - 4
(For a < 4, V02 iS infinite.)
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1200 BRADLEY EFRON (AND DISCUSSANTS)
C 4 2 1 2 4 oo
4 2(-
(7.8) o2 .0370 .0625 .0740 .0640 .0439 1/4c
(7.12) MO = (r- _2 32
- a a4 a2! ao2 a3I
2 ~~~2 a2
If g and h are normal densities, say g
we have r(x) = exp(,ax - p2/2) and
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1201
(8.1) 0a= 0 2
looI
00
so that iooi(01 - 0) =2. From the discussion (5.5)-(5.7) this means that,
approximately,
(where in (5.7) we have used ds/d01J=00 = i0oi). The locally most powerful level
a test of Ho versus AO, LMPar, for short, rejects for large values of U00. From
(8.2) we would expect LMPa to have reasonable power at 01 for the customary
values of a. That is 01 should be a "statistically reasonable" alternative to 00.
The discussion following (5.8) shows that the unexplained fraction of the
variance of U01 after linear regression on U00, calculated under foo, is approxi-
mately 4Tr0. If this quantity is large, say 4T 90 -2, then UO1 differs considerably
from UO, and the test of Ho based on U0l will substantially differ from that based
on U0O. Under these circumstances it is reasonable to question the use of LMP,.
Based on those very rough calculations a value of rTO 2 8 is ";large".
00= 8
(8.3) no= 8 0
makes 2= T2O/n < 8, and there
point. For the Cauchy translat
shape parameter, Example 6, n
of variation, Example 5, no < 1
normal mixture problem, Ex
expect linear methods to work
in the last example, even for large samples.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1202 BRADLEY EFRON (AND DISCUSSANTS)
i X2 \ Rejection Rejection
\ Region Region
\MP (01) LMPF
0 I Za \ Xl
FIG. 2. Bivariate Normal, Example 1, testing 8 0 versus 8 > 0. The rejection re-
gion for the locally most powerful level a test, LMPa, is compared with that for the
most powerful level a test of 0 versus Oo, MP,(Oi).
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1203
1.00 _ =
- Power Envelope
0.80 - y):l= 25 5
0.60 -
0.20
0
ro
0 .5 1.0 1.5 2.0 2.5 3.0 3.5
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1204 BRADLEY EFRON (AND DISCUSSANTS)
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1205
e2
L0
M@\
77
of those values of the sufficient statistic x for which 0 is a solution to the like-
lihood equations 1i(x) = 0. Then, since 1i = nC0,'(x - 2),
(9.4) Lo = {X: (e - e) 0 O}
the r - 1-dimensional hyperplane through 20, ortho
Figure 4 illustrates the situation for the case r = 2. (Notice that the sample
space, the space of possible 5x values, has been superimposed on A, the space of
possible mean vectors i.) Actually this two-dimensional picture is appropriate
for any dimension since curvature is locally a two-dimensional property, as
pointed out at the end of Section 2. A heuristic proof of (9.1) based on this
picture now follows in five easy steps:
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1206 BRADLEY EFRON (AND DISCUSSANTS)
La
(0,1) /
C:-(-1 ,-1 )
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1207
Since (9.10) would make 2i a rational number, x(1) must equal x(2), In short
there is at most one possible x value corresponding to any #, and so the MLE
is a sufficient statistic in J. implying i -i = 0 for all n. But To2 must be
positive for all 0 values since i, is always changing direction. This completes
the counterexample.
REMARK 2. Instead of working with the MLE a itself we can consider the
coarser statistic which only records which interval a lies in, among intervals of
the form (ies, (i + 1)E?) i = 0, ?1 +2, ? 2 The line La in Figure 4 is now
replaced by a pair of lines L L and step (v) can be weakened to say only
that the conditional distribution of :?2, given that x is between the two lines, has
variance 1/n + o(l/n). However in order for statement (iv) to still have meaning
we need to take E_ = o(l/n) (so that the conditional variance of fo will still be
due mainly to the slope of the lines LiIn, L(i+l,E , and not to the distance between
them). It turns out (Efron and Truax (1968)) to be possible to choose en in this
way and to get the proper convergence of the conditional variance if fo is non-
lattice, 1(t)l < 1 for all t t 0. (This excludes the multinomial.) In this case
it is possible to show that lim sup,,_. (io - i 0) < j0 To2.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1208 BRADLEY EFRON (AND DISCUSSANTS)
of M- near 4. The details are almost identical to those of Section 10 and will
not be given here. (See (10.25).)
REMARK 4. It is possible for two of the surfaces (9.4), say Lo and La, to in-
tersect. If x C Lo n L- then both 0 and a are solutions to the likelihood equation.
As 0 decreases to zero in Figure 4, Lo n L, converges to a point (in general an
r - 2 dimensional flat) on Lo {ce2} a distance po _ 1/7, above 0. Values of
on Lo which lie above this point are local maxima of the likelihood function,
while those lying below are local minima.
REMARK 5. Rao (1961, 1962, 1963) uses a different definition of the informa-
tion which avoids the difficulty illustrated by the counterexample. (9.3) can be
written as i _-iT = inf E0{1,(x) - h(T(x))}2, the infimum being over all choic
of the function h(.). Rao redefines 10T by restricting the function h to be quad-
ratic. Rao states that he believes the two definitions to be equivalent, but the
counterexample can be used to show that they are not.
10. Estimation with squared error loss. Suppose we wish to estimate the
parameter 0 in a curved exponential family on the basis of an i.i.d. sample x1,
x2, * , xn, using a squared error loss function to evaluate possible estimators.
We will only consider estimators that are smooth functions of the sufficient
statistic x and are consistent and efficient in the usual sense (see (10.5)-(10.7)
below). The following result will be discussed: let 0(x) be such an estimator,
the form of 0 not depending on n, and let 0(a) --E,9 UOO() where as
UO (x) _l %/iO + 00 is the best locally unbiased estimator of 0 near 0.. A
bo _ E0() - 0 be the bias of 0, a quantity which will turn out to be o
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1209
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1210 BRADLEY EFRON (AND DISCUSSANTS)
0.12
Barnett ? S.D.
0.10 1
Andrews? S.D.
[90.08 1
0.06-
Theoretic al
0. 04 -Value \
0.02
o < 1 t I I I I I I I
0 8 10 12 14 16 18 20
n
FIG. 6. Variance of MLE minus Cramer-Rao lower bound, for estimating the Cauchy
translation parameter. Theoretical value from (10.1) compared with Monte Carlo
results.
Monte Carlo studies of Barnett (1966) and also of Andrews et al. (1972) are
shown in Figure 6. The theoretical values are obviously too small for n ? 11,
but seem to be more accurate than the Monte Carlo results for n > 13. For
n = 40 Andrews et al. estimate Var,0 - 1/nio0 = .0025 + .0017 while (1
gives .003 1.
6) For estimating a translation parameter Pitman's estimator is known to
have smaller variance than the MLE. However, (10.1) suggests that this effect
must be of magnitude at most O(l/n3).
7) Nothing in (10.1), except the application to general curved exponential
families, is new. Rao (1963) states the result for curved multinomial families,
and notes that for the MLE it was previously derived by Haldane and Smith
(1956). The identification of the bracketed terms with curvatures is new, as
well as the line of proof which leads to a rigorous verification.
8) The similarity of (9.1) and (10.1) can be viewed as a vindication of the
belief that Fisher information is an accurate measure of the information con-
tained in a given statistic. This conclusion is premature; the squared error es-
timation problem is very closely related to the information calculation, a fact
which would be more obvious if we had presented a geometric argument below,
as in Section 9, instead of using analytic methods. It is more reasonable to say
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1211
The proof of the lemma is based on two simple facts: in order for a con
estimator #({) to be consistent it must have "Fisher consistency",
(10.6) 0(20) = 0 ,
since x -p 20 under repeated independent
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1212 BRADLEY EFRON (AND DISCUSSANTS)
,0 = io0el0 + 0(02)
(1 0.1 1) Ee o =0?fo
57A(x) [OX+ f(x)
'~f\ox+ 2x++ 2 to(x) + o (a) mx
=io0 + 2 + 0(02),
implying
(10.13) a = coio0ei
and
implying
(10.17) ao = 0,
co = 1l/io, and co11 pl+ ioAl1 = 0.
(10.18) a = j I
e,,_
Alli_
2 A21
A
10
=-.I
o
10
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1213
(10.19) Eo T2 = I +E0(T- x) +2 no
nio ioi0 '0
(10.24) A= 4_ trA21)/2
and so equals 0 for the MLE.
Several more remarks can now be made about (10.1).
9) The bias of the MLE up to 0(1/n) is, by (10.21), equal to -p,11/(2ion). If
0 is unbiased to O(1/n), as it is for example in any translation parameter estima-
tion problem involving a symmetric density, then we must have jl, = 0. By
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1214 BRADLEY EFRON (AND DISCUSSANTS)
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1215
Acknowledgment. Much of this work was done while I was visiting Imperial
College, London, Department of Mathematics. I appreciate the assistance of
Margaret Ansell in carrying out the more difficult numerical computations. The
Associate Editor provided extensive help, especially with the Appendix.
APPENDIX
PROOF. Let F.(y) -P{n > y}, so F,(y) ? [0b(s)e-8fln for Isl
have
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1216 BRADLEY EFRON (AND DISCUSSANTS)
2i (jj/) i j
(A7) gj(z) < ( e n,11z111+0n(1)
cn = o(nA), c, + oo.
Here gn/2(z), the density of (n/2)1fln2, is known to exist and to converge unifor
to (27r)-i exp(-z2/2), see page 244 of Gnedenko and Kolmogorov (1954). T
M,n= sup, gj(z)j = (27r)-f + o"(1), so for 0 < z cn
h(z) < Mn12{S -j0 gX12(Z -w)dw + SZ/2 gn/2(w) dw}
< 2M,/2 e- (z/8)min{2cn,zI+01(l)
where we have used the bound P{nixn > z3 < exp [-z/2 min {c;, z} + o,(1)
tained by setting y = z/ni and s = min {z,/ni, c,n/n} in (A2). But gn(z) = 2Vh(
giving (A7). The same proof with trivial modifications works for n odd. For
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1217
REFERENCES
[1] ANDREWS, F., BICKEL, P., HAMPEL, P., HUBER, P., ROGERS, W., and TUKEY, J. (1972).
Robust Estimates of Location. Princeton Univ. Press.
[2] BARNETT, V. D. (1966). Evaluation of the maximum likelihood estimator when the like-
lihood equation has multiple roots. Biometrika 53 151-165.
[3] CHERNOFF, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based
on the sum of observations. Ann. Math. Statist. 23 493-507.
[4] DAVIES, R. B. (1969). Beta optimal tests and an application to the summary evaluation of
experiments. J. Roy. Statist. Soc. Ser. B 31 524-538.
[5] DAVIES, R. B. (1971). Rank tests for Lehmann's alternative. J. Amer. Statist. Assoc. 66
879-883.
[6] EFRON, B. and TRUAX, D. (1968). Large deviations theory in exponential families. Ann.
Math. Statist. 39 1402-1424.
[7] FISHER, R. A. (1925). Theory of statistical estimation. Proc. Cambridge Philos. Soc. 122
700-725.
[8] GNEDENKO, B. V., and KOLMOGOROV, A. N. (1954). (Translated by K. Chung.) Limit Dis-
tributions for Sums of Independent Random Variables. Addison-Wesley, Cambridge,
Mass.
[9] HALDANE, J. B. S. and SMITH, S. M. (1956). The sampling distribution of a maximum
likelihood estimate. Biometrika 43 96-103.
[10] RAO, C. R. (1961). Asymptotic efficiency and limiting information. (J. Neyman, Ed.).
Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1 531-545. Univ. of California Press.
[11] RAO, C. R. (1962). Efficient estimates and optimum inference procedures in large samples.
J. Roy. Statist. Soc. Ser. B. 24 46-72.
DEPARTMENT OF STATISTICS
STANFORD UNIVERSITY
STANFORD, CALIFORNIA 94305
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1218 BRADLEY EFRON (AND DISCUSSANTS)
C. R. RAO
I am delighted to see the paper by Bradley Efron and also the paper by J. K.
Ghosh and K. Subrahmaniam (Sankhya A, 1975 36 325-358) on the subject of
second order efficiency. Having worked for some time on second order efficien
of estimators, I was aware of the importance of measures of how closely a giv
model can be approximated by an exponential family {f f C(8) exp [K(0)T(X)]}
Measures of this sort are of course closely related to what Professor Efron calls
the curvature of a statistical problem. What is quite new about Professor Efron's
measure is its invariance under smooth 1 - 1 transformations and the elegant
geometric interpretation which makes the term so apt and illuminating and pro-
vides new tools and insights into the subject.
My endeavour in this area was motivated by two results in the literature on
estimation which seemed to contradict Fisher's claims about MLE's. (maximum
likelihood estimators). One is the concept of super efficiency, according to which
MLE is not efficient in the sense defined by Fisher. Another is the concept of
BANE (best asymptotically normal estimator), according to which ML is only
one out of a very wide class of estimation procedures.
The first task was to redefine the concept of efficiency of an estimator since
its asymptotic variance is a poor indicator of its performance in statistical in-
ference. To do this it is necessary to see how well an optimum inference pro-
cedure based on a given estimator Tn alone compares with that based on all the
observations. Following Fisher's ideas, I thought it is relevant, at least in large
samples, to consider the score function i(a) (see Efron's paper for notations) as
basic to all inference problems. Then the problem reduces to examining how
closely l(a) and Tn are related. Under the additional condition that Tn is con-
sistent for 0, T, was defined to be first order efficient if
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1219
(i) The results due to Fisher and me were confined to multinomial distribu-
tions. Efron, and also Ghosh and Subrahmaniam extend the results to a wider
class of distributions.
(ii) Efron relates second order efficiency to what he calls curvature of a
statistical problem, which appears to be natural and throws further light on
problems of inference (providing, for instance, an intimate connection between
curvature and properties of test criteria).
(iii) Efron provides a decomposition of Ob(O) in (5), which is extremely
interesting.
(iv) Efron suggests the use of a most powerful test at a suitably chosen al-
ternative in preference to a locally most powerful test, which seems to be an
attractive idea worth pursuing.
DON A. PIERCE
I think that I am not alone in having had great difficulty with the reas
of Fisher's 1925 paper. Professor Efron's elegant contribution to clari
these ideas is very helpful.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1220 BRADLEY EFRON (AND DISCUSSANTS)
The part of Fisher's paper which has intrigued and puzzled me most is the
final section in which he suggests the use of 1(x), in Efron's notation, as an
ancillary statistic. I would like to indicate here how the geometry of this paper
helps clarify this, although there are many details yet unclear to me.
It is characteristic of "curvature" that -1(x) /= iL. In fact, one can always
parameterize so that Covy0 (lou, los) = 0, and then 29 = Var0o (100)/il0. Fisher
seems to suggest using - I-(x), rather than i , as a post-data measure of precision
of 0. This is also suggested by standard asymptotic Bayesian arguments, but the
sampling theory justification has never been clear to me. Such use of 1l would
be significant relative to the order of n-2 of approximation to Var (0) considered
in this paper, for -1# = i , + O(ni) and thus-l/I6 = I/i- + O0(n-i).
The geometrical structure exposed in this paper is indeed very helpful in un-
derstanding the role of 1, as an ancillary statistic. For a curved exponential
family of dimension k think of the projection from the sample point x E Ek to the
MLE 4- where Ld9 = E(x) as an orthogonal projection (relative to A 67') first to i in
the local osculating plane of the curve 2, and then a projection from 2 to 2A. The
argument below suggests that (-1 6(x) - i)/i1 is a useful measure of the signed
distance from i to the curve 4, positive when i is on the outside of the curve.
This is useful ancillary information because the projection from 2 to 2- is a
contraction (resp. expansion) mapping when 2 is on the outside (resp. inside) of
the curve 20. The extent of this contraction is a function of the distance from
i to the curve i2, as measured by the above statistic. Thus the conditional
precision of , given 1A(x) is either greater or less than the unconditional precision
Furthermore, it appears plausible that the component of i orthogonal to the
curve 20 at i2 is itself uninformative regarding the value of 2.
More precisely, consider the situation of Figure 4 with the additional assump-
tion that 0 is a choice of parameter such that Covy (4', i) = 0. The point (xl, x2)
corresponds to the i of the above discussion. It follows directly from (6.3) and
the relations given in the second paragraph after (9.2) that
XI =Oln(iO)i i X2 2 =-[-io-Oni0]/ni0 o
Near the origin the curve 20 is approximately a segment of a circle with center
at e2/r,, and the arc distance of 2- from the origin is to first order i,i6. Propor-
tionality of arc lengths to radii gives
= (1 -toX2)
so
OC,(X10o)(1 rO R2)-
( 1 ) = (X?1/io)[ 1 + (-i- ni0)/nioJ-
=(5c'1iO1)[niO1(-i0)]
Equation (1) can be seen to agree with the rigorously established (10.5) of the
paper, where pl, = 0 since v1, = 0.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1221
Thus we have
D. R. Cox
REFERENCES
FISHER, R. A. (1934). Two new properties of mathematical likelihood. Proc. Roy. Soc. Ser. A.
144 285-307.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1222 BRADLEY EFRON (AND DISCUSSANTS)
D. V. LINDLEY
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1223
These criticisms have less force before the data, x, are to hand. If it is a
question of experimental design, or choice of a survey sample, then naturally
one has to consider what data might be obtained, and integration becomes natural
and necessary. Hence curvature could have a place in these fields and it would
be interesting to see whether, in some sense, linear designs were better than
"curved" ones. However, the argument of my first paragraph would show that
if a terminal (as distinct fromn design) decision problem is contemplated after th
experimentation, then the choice of design would again involve a loss function,
so that no general measure seems possible. Some experiments are not associated
with terminal decisions and are genuinely inferential in character. In these one
is collecting information about parameters and Shannon's measure is essentially
the only one to use. I have tried to see whether some second-order expansion
of it might lead to anything analogous to Efron's curvature, but without success.
REFERENCES
LINDLEY, D. V. (1961). The use of prior probability distributions in statistical inference and
decisions. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1 453-468.
LUCIEN LE CAM
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1224 BRADLEY EFRON (AND DISCUSSANTS)
REFERENCES
J. K. GHOSH
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1225
8= 5 8= 13
55o .8 .06
561 .2 .95
If a = .2, 00 and 51
which is the alterna
any conclusion.
It is not difficult to come up with analogues of curvature when one has mo
parameters than one. Extension of the results due to Rao and Fisher to multi-
parameter families is provided in Ghosh and Subramanyam (1974). But it is
now necessary to study testing problems of composite hypotheses along the lines
of investigation carried out by Efron and Pfanzagl for simple hypotheses.
How relevant is curvature for a Bayesian? Ghosh and Subramanyam (1974)
have shown how one can construct a Bayesian proof of the second order efficiency
of the MLE. What is lacking and would be useful to have is a study of relevance
of curvature in Bayesian analysis. The difficulty here is that one cannot think
of any simple and convincing reason why a Bayesian would prefer the linear
exponential families to nonexponential ones. All is grist that comes to the mill
of the lucky man who not only has a prior but knows what it is.
It is a little disappointing, though not really surprising in retrospect, that
curvature has nothing to do with the geometrical curvature of the likelihood
curves. Curvature is, however, useful in the problemsthatSprott(1973)discusses.
For it is easy to show that his two approaches of minimizing FE(O) or F(0) (in
his notations) coincide iff one has a linear exponential family. (This statement
is true provided the MLE satisfies the likelihood equation with probability one
for all 8.) For example (2.3) of Sprott (1973), the curvature is fairly large for
x near .5 and so Sprott's transformation which minimizes FE() may not be
efficient in normalising the likelihood for x near .5. Incidentally, I suspect that
for small curvature one can reparametrize in such a way that the approach of
a posterior to normality, guaranteed by the Bernstein-von Mises theorem, would
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1226 BRADLEY EFRON (AND DISCUSSANTS)
be faster with the new parameter than with the original. (This may be an answer
to the question of relevance of curvature for a Bayesian.)
It may be worth pointing out here that the results of Pfanzagl (1973) and those
of Fisher and Rao (i.e. results like (10.1) of Efron) are not really comparable.
In fact for all the efficient estimators considered by Efron or Ghosh and
Subramanyam (1974), inequality (6.4) of Pfanzagl (1973, page 1005) reduces to
an equality. This result, which is not very hard to show, will appear in Ghosh
and Srinivasan (1975).
Finally, a question suggested by the beautiful counter example of Professor
Efron. Is there any example such that among the Fisher consistent efficient
estimators the MLE does not minimize the loss in Fisher's information for all
values of 0? It seems reasonable to expect that such examples do exist.
REFERENCES
[1] GHOSH, J. K. and SUBRAMANYAM, K. (1974). Second order efficiency of maximum likelihood
estimators. Sankhya Ser. A. (To appear).
[2] GHOSH, J. K. and SRINIVASAN, C. (1975). Asymptotic sufficiency and second order efficiency.
Unpublished.
[3] PFANZAGL, J. (1973). Asymptotic expansions related to minimum contrast estimators. Ann.
Statist. 1 993-1026.
[4] PFANZAGL, J. (1974). Nonexistence of tests with deficiency zero. University of Cologne,
preprint in Statistics #8.
[5] SPROTT, D. A. (1973). Normal likelihoods and their relation to large sample theory of
estimation. Biometrika 60 457-465.
J. PFANZAGL
University of Cologne
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1227
power can be neglected if the sample size exceeds 8y'0, (see 8.3).
Since this rule is rather arbitrary, the reader should be aware of other results
which make the role of rO more clear. These results concern the case of n i.i.d.
variables, the distribution of which is nonatomic and sufficiently regular (as a
function of 0). To define for a given level a-test "deficiency at rejection level
p" we determine first the alternative closest to the hypothesis which can be
rejected with probability 3 by some level a-test. (The test for which this is
achieved is called :-optimal.) In order to reach rejection probability 3 for this
alternative with the given test, the sample size has to be increased. The additional
number of observations needed for this purpose is the "deficiency at rejection
level j3."
For the LMP a-test the deficiency at rejection level iB is asymptotically equal to
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1228 BRADLEY EFRON (AND DISCUSSANTS)
REFERENCES
NIELS KEIDING
University of Copenhagen
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1229
2. The relation (10.1) for the asymptotic variance of any consistent and
efficient estimator 0 contains the term A0, being always nonnegative and z
for the MLE. This quantity was computed by Rao (1963) for several estimation
methods in the multinomial distribution, as noted by Efron. It would be inter-
esting if some geometrical interpretation, or at least a bit more transparent ex-
pression than (10.24) could be given for this quantity, which must be related
to the intuitive discussion by Fisher (1958, Section 57) of "the contribution to
x2 of errors of estimation".
2, _ o l ~ et- _ _ ._ _
^ Xo _ e-It 2t (eat 1)3
In the spirit of the paper, we quote some values of r12 (x0 = 1) in Table 1.
Two asymptotic schemes are inviting: large initial population size (x0 -> oo)
for fixed t and large observation period (t -* oo) for fixed x0. Being a branching
process, a birth process with X0 = x0 may be interpreted as a sum of x0 birth
processes with X0 = 1 and the same A. Therefore the first scheme is still within
the realm of independent identical replications, and may be treated with the
methods of Efron's paper. This was done by Beyer, Keiding and Simonsen (1975)
for this case as well as for the life-testing situation outlined above.
The second scheme, however, is a "real" stochastic process situation, and we
encounter here the trouble that the minimal sufficient statistic is not consistent,
TABLE 1
Statistical curvature for the birth process with xo = 1
it 0 0.1 0.5 1 2 5 00
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1230 BRADLEY EFRON (AND DISCUSSANTS)
St
E(St) /
observedXt /
t // ~~~~~~~~~~~~Bt
E(Bt)
FIG. 1. The canonical sample space of the birth process estimation problem. The
curve is the statistical model corresponding to 0 < i < oo (mean value parametrisa-
tion). The full-drawn line is the set of points for which Bt = 2St where i is the
"true" value, and the broken line is the set where Bt = 2St.
in fact, as t -> oo
almost surely, where the random variable W is gamma distributed with form
parameter x0 and expectation x,. Nevertheless i i a.s., as illustrated in Figure
1. Here i-1 is the slope of the full-drawn line, i-1 is the slope of the broken
line (connecting the observed (Be, St) and the origin.) Normalising with e-it,
the minimal sufficient statistic will converge towards some (1, 2-1)W (shown by
arrows), but the empirical line will always converge towards the correct line.
In the standard situation the asymptotic normality of 0 is based upon the as-
ymptotic normality of the minimal sufficient statistic combined with pure dif-
ferential geometry, as noted by Efron in Section 9. It is therefore no surprise
that asymptotic normality breaks down here. Notice also that , -> 1 (not 0) as
t -* oo. However, for given "nuisance statistic" W, the minimal sufficient sta-
tistic is asymptotically normal with asymptotic variance proportional to W-1,
and hence also i is asymptotically normal. (Marginally, the distribution of
e 2(2 _- ) converges towards a Student distribution with 2xo d.f., which may
be interpreted as the mixture of the normal distributions over the gamma dis-
tributed inverse variances.)
It is thus tempting to investigate the problem obtained by conditioning on
W = w, replacing the "nuisance statistic" W by a nuisance parameter w, see
Keiding (1974). The resulting "conditional" maximum likelihood estimator i*
has the same first-order efficiency properties as 2. A comparison of second-order
efficiencies is not yet completed.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1231
4. A more general aspect of the last example is: can curved exponential
families be "avoided"? In the birth process situation a stopping rule like "sample
until X, = n" will make the minimal sufficient statistic one-dimensional, in fact
equal to S,r = inf {t I Xt= n}. Also it should be mentioned that conditioning
on statistics which are in some sense ancillary (see Barndorff-Nielsen (1973) for
a survey of ancillarity) may completely change the curvature properties of the
problem.
REFERENCES
A. P. DAWID
may be seen from Efron's Figure 1: a. is the angle between (i) (0 and (ii)
displaced parallel to itself along - to Y). This depends on our connexion.
Let us try to frame Statistics within Differential Geometry as follows (ignoring
obvious technical difficulties): Let v be the family of all distributions over z
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1232 BRADLEY EFRON (AND DISCUSSANTS)
Let 7/, be the vector space of random variables T(x) having Ep[T(X)]
For given P, there is a natural isomorphism between ,//and %'p,: dm = T(x)
Then m0'" maps into i,4(x), which may again be identified with the tangent
at P05.
Now let Pj900 Po, erX, with tangent spaces 5", >1X and let To e v", T1 e 5K
To be able to talk about the angle between To and T1 we must put them into
the same space. We may do this by a parallel displacement of To along v to 0,
where it becomes To' e 5K.
The parallel displacement used implicitly by Efron-what I propose to call
the "Efron connexion"-has
This happens to be independent of the curve @, which is not always so. Noting
(d/dO)E[T] = E9[Ti4] for fixed T, we can generate (1) by the infinitesimal dis-
placement rule (having 01 = 00 + dO):
(3) io+i=
(4) To To T0'-
dPo)T. ~dP01J
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1233
(6) To To To lo + E00(TTol0)] * dO
yields a connexion-the "information connexion"-that is compatible with the
information metric. Curvature for this connexion (which is the geodesic curvature
associated with the information metric) uses the covariance matrix of i, and
40 + 12e2
We can calculate the torsion and curvature tensors (Hicks, page 59) for the
above connexions. We find that all have zero torsion (equivalently: are sym-
metric, or affine). There is a unique affine connexion compatible with a given
metric, hence (6) supplies it for the information metric.
We find zero curvature for the Efron and mixture connexions, while the cur-
vature tensor R associated with the information connexion has
(9) lo~~~~~j
(9)2 2 + 1 i62 + 2I j = O.
Solutions of (9) are closed curves, parametrized by an angle 0, having an angle-
valued sufficient statistic t, with density of the form
(10) f(tI0)=l+cos(t-O)
with respect to a probability measure P over the unit circle for which
s2;r eit d>(t) 0 O. Such curves have i= 1, and total length 27. Thus 9
rather like the surface of a sphere of radius 2, opposite points being identified.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1234 BRADLEY EFRON (AND DISCUSSANTS)
REFERENCES
JIM REEDS
Harvard University
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1235
"Var(T.) > a +
n n2
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1236 BRADLEY EFRON (AND DISCUSSANTS)
THEOREM. Let T depend only on xi, the sufficient statistic for a curved q-param-
eter exponential family. Suppose T is smooth in some neighborhood of E(xn), and
suppose T grows (as a function of X5) no faster than exponentially.
If T is a consistent andfirst order efficient estimate of 0, the variance of Tpossesses
an asymptotic expansion
-n)2CRLB
Var (T(5c)) _ RL ++A
AB+ 2C + 2 + O(n-3) .
C depends only on the function T, and vanishes for the particular choice
T = the maximum likelihood estimate.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1237
Bi = Zklm gingingkgFn1Pnk.
If, at 8O, the Fisher matrices of both 8 and r are equal to identity matrices, this
simplifies to
Bij CZ',k,,
kC?1C~k
r :kj
4 1
and
8 a3 a 8o
we may form the linear regression of i on / as follows:
lijk = i P jk'i
and we may calculate the regression-residual variance:
j a j' jj
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1238 BRADLEY EFRON (AND DISCUSSANTS)
REFERENCES
REPLY TO DISCUSSION
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1239
n -+
+ 2~~2 0
r@oo+ , } n2)
which equals to order 1/n2 the squared error risk of the biased correcte
at 6 = 06. (This result follows, with some effort, from (10.19).)
Professor Le Cam's warning about over-reliance on local methods is well
taken. As a matter of fact, my paper is most concerned with curvature as a
check on the appropriateness of first order local properties such as Fisher's in-
formation and the Cramer-Rao lower bound. In the situation of Figure 6, cur-
vature can be used quantitatively to improve the first order approximation. I
hope, but of course am not certain, that other situations will be similarly obliging.
Le Cam's criticism of the MLE as a point estimator should not be confused
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1240 BRADLEY EFRON (AND DISCUSSANTS)
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1241
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1242 BRADLEY EFRON (AND DISCUSSANTS)
di0/d0 = 2 -l -P30
as given in (10.1 1). Therefore the naming curvature r2 will not be zero for the
arc-length parameter unless e30 = 0. (That is, Fisher's score function has third
moment zero.)
It is not clear to me whether or not one can always choose a reparameteriza-
tion for JS which has naming curvature identically zero, even in the one-
parameter case. We probably wouldn't want to estimate such a parameter any-
way unless it had something more to recommend it than F02 = 0. I didn't mea
to imply that naming curvature is less important than statistical curvature, onyl
that it depends on the name.
Finally, I would like to thank the Editor for arranging this discussion which
involved a large amount of extra work on his part. I hope the Annals of Sta-
tistics will continue the entertaining and enlightening policy of providing occa-
sional discussion papers.
REFERENCES
[31 RAO, C. R. (1945 a). Information and the accuracy attainable in the estimation of statist
parameters. Bull. Calcutta Math. Soc. 37 81-91.
[4] RAO, C. R. (1945 b). On the distance between two populations. Sankhyd 9 246-248.
This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms