Institute of Mathematical Statistics The Annals of Statistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Defining the Curvature of a Statistical Problem (with Applications to Second Order

Efficiency)
Author(s): Bradley Efron
Source: The Annals of Statistics, Vol. 3, No. 6 (Nov., 1975), pp. 1189-1242
Published by: Institute of Mathematical Statistics
Stable URL: https://www.jstor.org/stable/2958246
Accessed: 09-05-2019 19:05 UTC

REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/2958246?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and


extend access to The Annals of Statistics

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
The Annals of Statistics
1975, Vo1. 3, No. 6, 1189-1242

DEFINING THE CURVATURE OF A STATISTICAL PROBLEM


(WITH APPLICATIONS TO SECOND ORDER EFFICIENCY)

BY BRADLEY EFRON

Stanford University

Statisticians know that one-parameter exponential families have very


nice properties for estimation, testing, and other inference problems. Fun-
damentally this is because they can be considered to be "straight lines"
through the space of all possible probability distributions on the sample
space. We consider arbitrary one-parameter families W and try to quantify
how nearly "exponential" they are. A quantity called "the statistical cur-
vature of &'w" is introduced. Statistical curvature is identically zero for ex-
ponential families, positive for nonexponential families. Our purpose is to
show that families with small curvature enjoy the good properties of ex-
ponential families. Large curvature indicates a breakdown of these prop-
erties. Statistical curvature turns out to be closely related to Fisher and
Rao's theory of second order efficiency.

1. Introduction. Suppose we have a statistical problem involving a one-pa-


rameter family of probability density functions . = {f0(x)}. Statistician
that if _ is an exponential family then standard linear methods will usually
solve the problem in neat fashion. For example, the locally most powerful test
of 0 = 00 versus 0 > 00 is uniformly most powerful in an exponential family.
The maximum likelihood estimator for 0 is a sufficient statistic in an exponential
family, and achieves the Cramer-Rao lower bound if we have chosen the right
function of 0 to estimate.
In this paper we consider arbitrary one-parameter families -Sand try to quan-
tify how nearly "exponential" they are. A quantity To called "the statistical
curvature of .W at 0" is introduced such that r, is identically zero if J7 is ex-
ponential and greater than zero, for at least some 0 values, otherwise.
Our purpose is to show that families with small curvature enjoy, nearly, the
good statistical properties of exponential families. Large curvature indicates a

breakdown of this favorable situation. For example, if Too is large, the locall
most powerful test of 0 = 0, versus 0 > 00 can be expected to have poor oper
ing characteristics. Similarly the variance of the maximum likelihood estimator
(MLE) exceeds the Cramer-Rao lower bound in approximate proportion to To2.
(See Sections 8 and 10.)
For nonexponential families the MLE is not, in general, a sufficient statistic.
How much information does it lose, compared with all the data x? The answer

Received March 1974; revised May 1975.


AMS 1970 subject classifications. 62B10, 62F20.
Key words and phrases. Curvature, exponential families, Cramer-Rao lower bound, locally
most powerful tests, Fisher information, second order efficiency, deficiency, maximum likeliho
estimation.

1189

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1190 BRADLEY EFRON (AND DISCUSSANTS)

can be expressed in terms of r92. This theory goes back to Fisher (1925) and
Rao (1961, 1962, 1963). They attempted to show that if J is a one-parameter
subset of the k-category multinomial distributions, indexed say by the vector of
probabilities f0(x) = P0(X e category x), x = 1, 2, * *, k, the following result
holds: let io be the Fisher information in an independent sample of size n from
fo' io0 the Fisher information in the maximum likelihood estimator O(xl, x2, * * , x")
based on that sample, and i0 the Fisher information in a sample of size one (so
i, = ni0). Then

(I. 1) limn-. (i0 - i 0) - i {1102 -2121 + 140 - 1 - 11 ? 30 211u3


where

(1.2) ( X), (fo(x)

the dot indicating differentiation wi


consistent, efficient estimator T(x1, x2
lim"O* (j - i 0T) is equal or greater t
the term "second order efficiency" f
preferred place in the class of "first order efficient" estimators T, those which
satisfy the weaker condition limrn0o i,T/li = 1.
It turns out that the unpleasant looking bracketed term in (1.1) equals r.o2
This leads to a straightforward geometrical "proof" of (1.1). The quotes are
necessary here since, as the counter-example of Section 9 shows, the result is
actually not true for multinomial families. However, the difficulty arises only
because of the discrete nature of the multinomial, and can be overcome by deal-
ing with less lumpy distributions. More importantly, a similar result of Rao's
for squared error estimation risk holds even for the multinomial, as discussed
in Section 10.
Under our definition an exponential family has zero curvature everywhere
so in some sense it is a "straight line through the space of possible probability
distributions." (This is intuitively plausible since linear methods, that is, meth-
ods based on linear approximations to the log likelihood function, tend to work
perfectly in exponential families. The fact that locally most powerful tests are
uniformly most powerful is an example of this.) We will make this notion precise
by considering families 5 which are subsets of multi-parameter exponential
families. If the subset is a straight line in the natural parameter space of the
bigger family then Jw is a one-parameter exponential family. If the subset is a
curved line through the natural parameter space then 5Y is not exponential, and
it turns out that the statistical curvature exactly equals the ordinary geometric
curvature of the line, the rate of change of direction with respect to arc-length.
For the sake of exposition we actually start with this latter definition in Section
3 and show in Section 5 how it leads to a sensible definition of statistical curva-
ture in the general case.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1191

There are really two halves to this paper. Sections 3-7 introduce the notion
of statistical curvature, Sections 8-10 apply curvature to hypothesis testing,
partial sufficiency, and estimation. Section 2 consists of a brief review of the
notion of the geometrical curvature of a line.

2. Curvature. If Y = Y(X) defines a curved line _9 in the (X, Y) plane then

(2.1) rx = F (Y")2 ]1
is defined to be the curvature of -2' at X, where Y' dY/dX, Y" d2Y/dX2
are assumed to exist continuously in a neighborhood of the value X where the
curvature is being evaluated. In particular if Y' = 0 then rx = I Y"'l. An exer-
cise in differential calculus shows that rx is the rate of change of direction of
2' with respect to arc-length along the curve. The "radius of curvature", Px-
l/rx, is the radius of the circle tangent to X' at (X, Y) whose Taylor expansion
about (X, Y) agrees up to the quadratic term with that of S9. Struik (1950) is
a good elementary reference for curvature and related concepts.
The concept of curvature extends to curved lines in Euclidean k-space, Ek,
say 2? = {vj7, 8 e 9}, where e is an interval of the real line. For each 8, r, is
a vector in Ek whose componentwise derivatives with respect to 8 we denote
-(a/O8)r8, i_ (a2/a842)i0. These derivatives are assumed to exist continu-
ously in a neighborhood of a value of 8 where we wish to define the curvature.
Suppose also that a k x k symmetric nonnegative definite matrix T0 is defined
continuously in 8. Let M. be the 2 x 2 matrix, with entries denoted L20(O)' V11(8),
Vo2(8) as shown, defined by

(2.2) M-(2l() U02(8)') \ TO ('; o)


and let

(2.3) To
Then ro is "the curvature of
take k = 2, 8 = X, r0 = (
Again it can be shown th
respect to arc-length alon
1, where the arc-length fr

7,

FI. S.oh cvuo a i 0


FIG. 1. The curvature of -2 at Oo is daoldsolo=0

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1192 BRADLEY EFRON (AND DISCUSSANTS)

between and (, called "a,". Then

(2.4) To da=

or equivalently T. = d sin ao/ds0


inner product Z0,,

(2.5) ds 'o 0)
dO

(2.6) sin a, [1- (_ _ C __? , X __ .

(Z00 can be replaced by 42l anywhere in (2.6).) As Figure 1 indicate


purpose of evaluating T.0 the k-dimensional curve 9 can be considere
as a two-dimensional curve in the plane through (6, spanned by (00 an
3. Curved exponential families. In this section we define statistical curvature
for one parameter families S which are curved subsets of a larger k-parameter
exponential family, "curved exponential families" for short. Denote the multi-
parameter family by

(3.1) g,(x) =_ g(x)e6tx-OW


a family of densities with respect to some given measure m
crete, on Euclidean k-space Ek. Here ( e ; the subset of Ek for which
Ek g(x)e1'x dm(x) < oo. The convex set -x1- is called the natural param
of the exponential family. If we define
(3.2) 2('2) =_E, x

the components of i can be obtain


Moreover the covariance matrix Z
a2b(()/86t0j. We denote by A the

(3.3) A = ((): e I
The mapping (3.2) from -4' to A is
of i(r)), recognizing that i indexes
has the same rank r for all '2, and
Now suppose that

(3.4) 1-{(: el
is a one-parameter subset
differentiable function of
fo to be

(3.5) f&(x) _ g,(x) = g(x)eo'x-Oo


where 0b0 _b(Y),). (Likewise 20 _(y2O), Zo =Z T(y2O).) It is easy to

(3.6) AO = oZO 9=, p 0a = E' = ,'x.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1193

Swill stand for the family of densities {ff(x): 0 C 81, our cu


family.

DEFINITION. T,o the statistical curvature of S at 0, is the geometrical curva-


ture of &' = 1{i: 0 e 81 at 0 with respect to the covariance inner product To,
as defined in (2.2) and (2.3).

EXAMPLE 1. Bivariate normal. x is a bivariate normal random vector with


covariance matrix I and mean vector )77 = (0, (To/2)02)" 0 C = (-oo, co),

(3. 7) x ~ -42(r7 , I)

Then ,0 = (1, r0)', i0 = (0, r,)', and

(3.8) M ( 1 + TO 202 )
io20 TO
so

2o2
(3.9) To (1 +To202)3

In particular To2 = ro2, justifying


curved exponential family will be

EXAMPLE 2. Poisson regression. x


variables, xi having mean a + Obi,
eters. e is the interval of 0 val
Since x = (xl, ..., Xk)' has a k param
the k means are unconstrained, we
MO9,

b__ _2 b .3
(3.10) p20) = aa +b )'

>0o2(0)
V02(0)= zk=' (a+i)3
(a+ Ob Ob, (a ()=bi_ 2

The formula (2.3) for T02 simplifies at 0 = 0 to

(3.11) T02 =? FZ= b4 ( - 1b 3)21


aL (= b 2) (Ek=, bi2)3 i

That the entries of M. are summations follows from the independence of xI,
x2, . Xk, as mentioned in Section 6. A very similar formula holds for the
analogous binomial regression model.
The Neyman-Davies model, x1, X2, ..., xk independent scaled X12 random vari-
ables, xi i (1 + 031)i , 82, . . ., * kknown constants, has the same structure.
(Davies (1969) uses this model, which originates in an application due to Neyman,
to investigate the power of the locally most powerful test of 0 = 0 versus 0 > 0.
We compare our results with his in Section 8.) By direct calculation or by the

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1194 BRADLEY EFRON (AND DISCUSSANTS)

remark at the end of Section 6 we get that M0 has elements

(3.12) 12(0() = a X=1 6i X 1(0) = -v0 2 , va(0


and so

(3.13) To2 = 8 [ = 6i _ (Z=1


( 1 ai2)2 1 bi2)

EXAMPLE 3. Autoregressive process. yo'yl,


regressive process Yo = Uo, Yt+i = 0Yt + (1
ut l x(O, 1), t 0, 1, * , T and 8 = (-1, 1). Writing out the likelihood
function Of (YO' I.. YT) shows that this is a curved exponential family with
k = 3, the v vector being i#' = (-(1 + 02)/a, 0, -')/(I - 02), with correspond
ing sufficient statistics x'-= (hT'1y2, ET YtYt-l,y O2 + YT2). For 0 = 0 the cal-
culations are easy, yielding

(3.14) M O 08T-
T 0 6}rO T'
To 2

Much messier expressions are found fo


cO/T + O(l/T2) as T-+ oo, with co = 8
any T, r-o = To' ia = i6 since the ma
takes 0 into -0 while preserving the curvature and Fisher information.) This
family is least like a one-parameter exponential family at 0 = 0.
If a9 is a straight line through XA- 8 = a + br(0) where a and b are known
vectors and -r(0) some real-valued twice differentiable function of 0, then T,, = 0
for all 0 since the curvature of a straight line is zero. In this case f0(x) =
(g(x)ea'z) exp [zr(0)b'x - /'] is a one-parameter exponential family with natural
parameter z(0) and sufficient statistic b'x. Under our definition all one-parameter
exponential families X and only such families, have statistical curvature every-
where equal to zero. This desirable property would still hold if we defined the
curvature with respect to an inner product other than Z,,, say X.-' or I. The
following discussion and Section 4 add support to the choice T,.
Let 14(x) denote the logarithm of f0(x),

(3.15) l(x) -logf,0(x)


and denote the first and second partial deriv

a a
(3.16) l (x) = - (x), l((x) - 92 I@X).
The moment relationships

(3.17) E04 =0 , E 192- -E4iq _ i,


where i, is Fisher's information, hold because the exponential family
(3.1)-(3.5) allows us to differentiate under integral signs with impunity. (We
will suppress the random element "x" in much of the subsequent notation.)

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1195

Notice that 1(x) = - lx b0, + log g(x) so that

(3.18) io(x) =0'(x 20) - i,(x) = AX- 2,) - VTO a


where we have made use of (3.6) in taking the derivatives. Remembering
T,0 is the covariance matrix of x, we see that (3.17) holds with

(3.19) io = 0'z0o .
As a matter of fact the covariance ma

(3.20) E0 (.. io )4 i + i@) = (-?0 8 (? X8 Z )


lo + lo ' ?0 o'o # 07;O
which is just the matrix MO defined at (2.2). Therefore

(3.21) P20(o) = io = E io2, V11(8) = E0 44 Cove


V02(8) = E0 12 - i2 = Var0 10 .

These definitions make no explicit reference to the geometrical structu


curved exponential family. We will use them in Section 5 to provide th
ture definition for an arbitrary one-parameter family.

4. Invariance properties of the curvature. The two definitions o


geometrical one following (3.6) and the statistical one (3.21) give two useful
invariance properties of the curvature y..
i) Statistical curvature is an intrinsic property of the family S and does not
depend on the particular parameterization used to index If we let 0- g(O)
where g is any strictly monotone twice differentiable function, and fi(x) =
f9_0(j)(x), then = Tg_(8 ) for every a e- g(e). This follows from the same
property of the geometrical curvature (2.3). [Note: this is not true for the Fisher
information: i, = i9_1(b)(d01d#)1.]
ii) If t = T(x), is sufficient for 8 then 4T(t) a/D8 logf0T(t) = C(X), where
fj indicates the density of T, implying by (3.18) that M.T M. and roT - .
The statistical curvature is invariant under any mapping to a sufficient statistic,
including of course all one-to-one mappings of the sample space. This property
would not hold if we had chosen an inner product other than T0 in the definition
of statistical curvature.
We can use property (ii) to transform an arbitrary curved exponential family
into a form particularly convenient for theoretical calculations. Let 80 be some
value of a at which we wish to investigate the local behavior of X Write
TOO=_A'DA, D an r X r diagonal matrix with positive diagonal elements and
A an r x k matrix with orthonormal rows, AA' = -I (rank , - r, I the r x r
identity matrix). Let x- =rD-iA(x - 2O) where r is an as yet unspecified r x r
orthogonal matrix. x? is an r-dimensional sufficient statistic for the family (3.1).
For 8 e 9 it has a curved exponential family of densities where we can take
go = rD'A(7m - ,9 ). (These statements are easily shown in the full rank case
r = k and are not difficult for r < k.)

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1196 BRADLEY EFRON (AND DISCUSSANTS)

Notice that ;0= 0 -, 0, and T,,g = I. Proper choice


trix F makes -,9 proportional to e, = (1, 0, * *, 0)'
of e1 and e2 (0, 1, 0, * *, 0)'. By (3.6), 400 is then a
DEFINITION. The family is in standard form at 0 = 0
sion of

(4.1) IN,= 20o,-0 =OO=ir


and

(4.2) ''o = -0 = i.ie,e,


0 0 Y=
0 ii
0 e, + i0Or6oe2
00e,+,re2
i00

(The constants in (4.2) are necessary to satisfy (2.2).) We will use standard form
to simplify proofs in Sections 9 and 10. If - is not in standard form at 00 t
above transformation makes it so, and by property (ii) M. and hence all infor
tion and curvature properties remain unchanged. We could use property (i) to
further standardize the situation so that io = 1, V11(00) = 0, but that does no
simplify any of the theoretical calculations which follow. Property (i) is useful
for calculating curvatures, as will be shown in Section 7.

5. General definition of statistical curvature. Leaving exponential families,


let

(5.1) 5_{ff(x),0e0}
be an arbitrary family of density functions indexed by the single parameter
0 e 0, a possibly infinite interval of the real line. The sample space z' and
carrier measure for the densities can be anything at all so we have not excluded
the possibility that .w consists of discrete distributions. Let

(5.2) 1(x) log(x)1(x), 10(x) a (x)


AX) lofg(x) lo(x) ao '(ox
as in (3.15), (3.16). We assume the derivatives exist continuously and can be
uniformly dominated by integrable functions in a neighborhood of the given 0,
so that E,1,9 = 0, E0, 2 =-E,io _ i as in (3.17). Finally, as in (3.20)-(3.21)
we let M. be the covariance matrix of (i, la),

(5.3) V =(20(0) ill(0) _ E( 2 E, 1,9 1)


'll(0) v02(0) E6 10 10 E,1,2 -i02
and define the statistical curvature of _ at 0 to be

(5.4) ro (IMoII9)W - [ i2: - i?3]T

In making this definition we assume 0 < i0, < oo and V,2(O


and (ii) of Section 4 are verified to hold for 7,9 as defined
~~~~~~~f If
I0 = f/fg, la = 1f0
term in (1.1), the crucial quantity in the Fisher-Rao theory.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1197

What does ro measure in this general situation? It is a


Fisher's score statistic is changing (more precisely, "tur
argument along those lines is given next, further suppor
tions of Section 8.
Comparing (5.3) with (2.2), we can connect the two definitions by thinking
of _= {[l, 0 G i} as a curve through the space of random variables on ct-. The
inner product <u, V>Q _ u'O v of (2.2) is taken to be the covariance inner product
in (5.3). (Section 3 makes the analogy precise in the exponential family case.)
All of the quantities in Figure 1 can now be given a statistical interpretation.
The element of arc length along X , by analogy with (2.5), is ds0/d0 =
(Eo 102)i - i_,. Define

(5.5) U<() ()+ a.


100

UOO is the version of Fisher's score statistic 1o% that is the best locally unbiased
estimator for 0 near 00: Varo0 U0 = /ioo, the Cramer-Rao lower bound, and
E0OU0= 00, dEoU00/d0j0=00 = 1. Therefore

(5.6) (d/dO)E0U0o _ dso


(Var00 Uod), 0=00 dO 0=00

(The quantity on the left of (5.6) is called the "efficacy" of the st


We see that

(5.7) (0- - ) -dO 0o=e (Varo U00-) 0?0( 00)2

Therefore sH of Figure 1 can be interpreted locally as the number of (00) st


deviations from E0o U0 to Eo Uoo.
By analogy with (2.6)

(5.8) sin ao = [Iov- (~' [ - corr~ 'o lP


(5.)[i Varo i6 Var,0 10] L 00(I00ol0)], I
so sin2 a0 is interpreted as the unexplained fraction of the variance i
linear regression on U,O(x), under density fOo.
From (2.4) we get the following interpretation of the statistical
roo is the derivative at 0 -00 of the unexplained fraction of the standard
of UO given U0o, the derivative being taken with respect to the effi
(E0 U0O - EoO UB90)/(Var00 U00)X along . If this quantity is large th
best estimator (also the locally best test statistic) is changing quickly
and J is highly curved in a statistical sense. At the opposite extrem
parameter exponential families for which a0 _ 0, so U0 is statistical
to U00 for all 0 and 00. We pursue this interpretation of To in Sectio
what constitutes a seriously large value of the curvature.
In a certain sense any smooth one-parameter family X7 can be em
a suitably large exponential family. Suppose at some point 00 in e, 1

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1198 BRADLEY EFRON (AND DISCUSSANTS)

differentiable. Consider the k-parameter exponential family

(5.9) g,(x) _ exp[loo(x) + 14I0(x) + r,2100(x) + + )2100(X) -


l0(x) = (YjaOk)lo(x)1oo0(() being chosen to make (5.9) integrate to one over
' with respect to the carrying measure for S Choosing

)7 = ((-) (-2 010' _ .0) .o _ ' ) k

gives a one-parameter family of densities g,2, approximatingf0 near 0 =


If the Taylor expansion for lo converges at 60 this approximation becomes in-
creasingly accurate as k - so. For any value of k ? 2 definitions (5.3) and
(3.21) show that = M00, so 00 = i00 and = r It is reasonable to expect
results proved in the context of curved exponential families to hold for sufficiently
smooth nonexponential families, though no justifying theorem has been proved
to this effect. This is in the same spirit as approximating an arbitrary family
by a multinomial with a large number of categories, as in Fisher (1925) and
Barnett (1966), but seems to make the approximation in a smoother way.

6. Repeated sampling. Suppose we sample xl, x2, ..*, x, independently and


identically distributed with density fg. We will use boldface letters to indica
quantities connected with the repeated sample, x (x1, x2, . *, x")', 10(x)
= l4(xi), Uo(x) = i(x)/io + 6, etc, In particular
(6.1) MO=nMW
since M0 is the covariance m
the familiar relationship i. = ni

(6.2) ro= r*
ni

The curvature goes to zero at rate l/ni under repeated sampling. This makes
sense since we know that linear methods work better in large samples.
In curved exponential families, (3.1 8)-(3. 19) combine with lo(x) = l (xi)
to give

(6.3) I,(x) = nu70'(X - 0) , lo(x) = n{( 2x- ,) - ni.}


x--=E,"-, xi/n being the sufficient statistic for the complete family (3.1).
If the xi are independent but not necessarily identically distributed we s
have 1,(x) I lali)(xi), the superscript indicating the distribution for x
so MO = 1 M0h). This explains the simple form of Mo in Exampl
Section 3.

7. Some examples. Before discussing the statistical properties of rT we w


expand our catalog of examples to include several nonexponential families.
Those results illustrate some simple principles that make To easy to calculat
familiar statistical situations. In the first three examples we assume that the

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1199

densities given are with respect to Lebesgue measure on the real line, i.e., that
we have just one observation of a continuous variable. For an independent,
identically distributed (i.i.d) sample of size n the curvature is obtained from
formula (6.2). This last remark applies also to Example 7, and to the examples
of Section 3.

EXAMPLE 4. Translation families. Let g(x) be a probability density function


and f0(x) -g(x - 0). Also let h(x) _ log g(x). Then i(x) =-h( - (x9-),
l0(x) = h - 0(x- ), where h(i)(x) = dih(x)/dx%, so E01.ii'j = - [-h1'1(x)]i x
[h(2)(x)]jg(x) dx. Obviously Ma and rO do not depend on 0 in a translation family.
For the t translation family, f degrees of freedom,

(7.1) g(x) F (f - 1) (i + x)f

we calculate

(7.2) V20(0) = i(8) - f +

2(a) = f + -1 [(f + 2)(f2 + 8f + 19) _ f + 1


f +3L -f(f +5)(f +7) f+3i
and P1,(0) = 0 (by symmetry), giving

(7.3) yo2 = 6[3f2 + 18f + 19]


f(f + 1)(f + 5)(f + 7)
a monotone decreasing function of f. Some values are as follows:

f 1 2 5 10 20 0o
(7.4)
(74 2 2.5 1.063 .306 .107 .0334 18/f2

The case f = 1 is the Cauchy translation family, and


with a closely related calculation in Fisher (1925).
For the Gamma translation family

(7.5) fxx x- )l(O > 0


fo (x) - r(a)
a > 4 a fixed constant, we calculate

1 ~~~2
(7.6) M 1 l (a -3)
a -2(2 4a -10
(a-3) (a-2)(a-3)(a-4)
2 2 a-I
(a - 3)2 a - 4
(For a < 4, V02 iS infinite.)

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1200 BRADLEY EFRON (AND DISCUSSANTS)

EXAMPLE 5. Scale families. x 0 a . z where z has a


(0, oo). If z is a positive random variable then log x =
tion family. By Section 4 the curvature will be the same for this family as for
the original one, and by Example 4 it will not depend on 0: For scale families
rO does not depend on 0. (The argument above applied separately to the positive
and negative axes gives the result in general. It can also be derived directly
from (5.3).)

A particular example is the normal with known coefficient of variation, x


>7'(0, c02), c known. Here x O Oz, z axF(l, c). We calculate io = 2(c + -)1(cO2)
and

(7.7) o2. 4(=


4(c + )3
(Notice that x , J/_-(O, CO)2 is a curved exponential family, k = 2.) The curva-
ture is near 0 for all values of c, taking its maximum at c = 1:

C 4 2 1 2 4 oo
4 2(-
(7.8) o2 .0370 .0625 .0740 .0640 .0439 1/4c

EXAMPLE 6. Weibull shape parameter. fo(x) = Oxo-ie-x9 for x > 0, 0 e e =


(0, oo). That is x , zll/ where P{z < zo} = 1-e-zo for z0 ? 0. The transfor-
mation log x 1 I/O log z makes this a scale family in 1/0, so once again ro2 does
not depend on 9. Taking 0 = 1 for convenience gives i4(x) = (1 - x) log x + 1,
11(x) =-(x log2 x + 1), El Jilli -' [1,(x)]p[11(x)]ie-z dx. Numerical integration
gives

(7.9) r02 = .704.


EXAMPLE 7. Mixture problems. fo(x) (1 - 0)g(x) + Oh(x), g and h known
densities on an arbitrary space . The parameter space E3 contains [0, 1]. We
see that

(7.10) = h-g) 102


g -F 0(h - g)
and for 0 = 0

(7.11) 10 =_ r -1I, lo =-(r-)

where r(x) h(x)/g(x). Defining a, EO(r - l)i

(7.12) MO = (r- _2 32
- a a4 a2! ao2 a3I
2 ~~~2 a2
If g and h are normal densities, say g
we have r(x) = exp(,ax - p2/2) and

(7.13) MI=( e- ?2] )43 3 8 -


MO R3- 32 + 9 2] - 43 22 , 8Q e 4

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1201

when e el2. Therefore io = - 1 and

(7.14) To2 = 3( + 1).


The curvature approaches 2 for p near

p .5 .832 1 1.048 1.180 0oo

(7.15) To2 4.84 24 74.68 108 320 e4p2


io .284 1 1.718 2 3 - eP2

8. Hypothesis testing. So far we have not tried to say what constitutes a


"large" curvature-a value of ro (or, in repeated sampling situations of ?,,, the
curvature based on all the data) of sufficient magnitude to undermine techniques
based on linear approximations to the log likelihood function. We can obtain
a rough idea of this value by considering the problem of testing Ho: 0 = 00 versu
AO: 0 > S0.
Define

(8.1) 0a= 0 2
looI
00

so that iooi(01 - 0) =2. From the discussion (5.5)-(5.7) this means that,
approximately,

(8.2) Eo1 U00 - EF0 U00 = 2


(Varoo U0O)

(where in (5.7) we have used ds/d01J=00 = i0oi). The locally most powerful level
a test of Ho versus AO, LMPar, for short, rejects for large values of U00. From
(8.2) we would expect LMPa to have reasonable power at 01 for the customary
values of a. That is 01 should be a "statistically reasonable" alternative to 00.
The discussion following (5.8) shows that the unexplained fraction of the
variance of U01 after linear regression on U00, calculated under foo, is approxi-
mately 4Tr0. If this quantity is large, say 4T 90 -2, then UO1 differs considerably
from UO, and the test of Ho based on U0l will substantially differ from that based
on U0O. Under these circumstances it is reasonable to question the use of LMP,.
Based on those very rough calculations a value of rTO 2 8 is ";large".
00= 8

In the repeated sampling situation of Section 6 a sample of size n > no,

(8.3) no= 8 0
makes 2= T2O/n < 8, and there
point. For the Cauchy translat
shape parameter, Example 6, n
of variation, Example 5, no < 1
normal mixture problem, Ex
expect linear methods to work
in the last example, even for large samples.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1202 BRADLEY EFRON (AND DISCUSSANTS)

Moving from the vague to the specific, consider Example 1, Section 3, a


bivariate normal vector x = (x1, x2)' with mean (8, To82/2) and covariance m
I. Assume we wish to test Ho: 8 = 0 versus A,: 8 > 0 on the basis of observ
x. The LMPa, which rejects for large values of x1, has power function (probability
of rejection) 1 - &o() = $(8 - Za), where Za and (P are the upper a point and
cdf of a standard normal variate.
The Neyman-Pearson lemma says that the most powerful level a test of 8 = 0
versus some specific positive alternative 8 = 01, MPa,(81) for short, rejects for
large values of rj1x. It has power function

(8.4) 1 - i31(8) - (8(l + r,282/4)i cos (AO1 - AO) - Za),


AO being the angle from the xl axis to i0, illustrated in Figure 2. As 81 approaches
0, j,3(8) approaches p0(8) for all 8, justifying the notation 1 - p0() for the
power function of LMPa.
For a given value of 8 > 0 the power is maximized by taking 81 8 0, giving
'power envelope"

(8.5) 1-19*(a) = (ID(8(l + rO282/4)i - za)

Figure 3 compares the power envelope function, for four va


power function of LMP,a, a = .05 (which does not depend o
the difference between 1 - 13*(a) and 1 - p0() increases wit
In this case we can actually see that ro0 measures how fast
test statistic U,O(x) becomes nonoptimal as the alternative
Also according to prediction the LMPa has reasonable power properties for
To2 =i1' and poor properties for ro2 > .

i X2 \ Rejection Rejection
\ Region Region
\MP (01) LMPF

0 I Za \ Xl

FIG. 2. Bivariate Normal, Example 1, testing 8 0 versus 8 > 0. The rejection re-
gion for the locally most powerful level a test, LMPa, is compared with that for the
most powerful level a test of 0 versus Oo, MP,(Oi).

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1203

1.00 _ =
- Power Envelope

0.80 - y):l= 25 5

0.60 -

0.40-/ 0.40~ <~ ~~ ~/ / Power of L0P<,,

0.20

00 0.5 1.0 1.5 2.0 2.5 3.0


8
FIG. 3. Power of LMPa, a .05, compared with power envelope function, Example 1.
TABLE 1
Power comparison, Example I
a) Power envelope b) Power MP.05(2) c) Power LMP.o5

0
ro
0 .5 1.0 1.5 2.0 2.5 3.0 3.5

.25 .05& .126 .262 .453 .662 .835 .941 .985


.25 .05b .125 .260 .452 .662 .834 .938 .983
.25 .05c .126 .260 .442 .639 .804 .912 .968

.5 .05 .127 .269 .483 .723 .904 .982 .999


.5 .05 .121 .261 .479 .723 .901 .980 .998
.5 .05 .126 .260 .442 .639 .804 .912 .968

1.0 .05 .129 .300 .591 .882 .991 1.000 1.000


1.0 .05 .115 .280 .583 .882 .990 1.000 1.000
1.0 .05 .126 .260 .442 .639 .804 .912 .968

2.0 .05 .139 .409 .855 .998 1.000 1.000 1.000


2.0 .05 .115 .381 .850 .998 1.000 1.000 1.000
2.0 .05 .126 .260 .442 .639 .804 .912 .968

Of course no level a test can achieve t


value of 0 > 0. MPar(0i) achieves it for
0 in the sense that dj30(0)/d0j=,0 = d/*(0
in choosing 01 we get a test which ma
be a statistically interesting value of 0
not unreasonably high. In our example
Table 1 shows that 1 - P2(0) stays r
MP.05(2) has better power characteristi
of r".
Davies performs similar evaluations for the Neyman-Davies model of Example
2. The curvatures for the upper and lower cases graphed on page 532 of Davies
(1969) are To2 = .488 and To2 = .244 respectively, while the two on page 533 are
To2 = .00629 and ro2 = .0364. Ignoring the "Wald's test" curve, one sees that

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1204 BRADLEY EFRON (AND DISCUSSANTS)

the magnitude of To2 is indeed a good predictor o


LMPa compared to MP,(Oi). His results are quite
ample 1. (Davies chooses 01 so that 1 - 3*(91) = .8. This is a more precise
way of accomplishing what (8.1) is intended to do, but is computationally dif-
ficult in most situations.)
Section 10 shows that the Cramer-Rao lower bound for the variance of an
unbiased estimator errs roughly by a factor of 1 + ro2' rejustifying the definition
of ro2 > 8 as a "large" curvature.

9. The Fisher-Rao theorem. We again assume an i.i.d. sample x1, x2, .. , x,


as in Section 6. Result (1.1), originally stated by Fisher in his fundamental
paper on estimation theory (1925) can be restated as

(9. 1) lim ._. (io _ io ) = jo 702


since To2 equals the bracketed term in (1.1). (9.1) is derived
by means of the relationships V2o(0) = 20 = i0, Vl(o) =
[0O2 - 2[12, + 4o -_20 these following from (1.2), (5.3) an

(9.2) lo = Jo/fo 'f = fo/fo- (Jo/fe)2


To use Fisher's evocative language, asymptotically the MLE #(
extracts all but joT02 of the information in the sample x = (xl, *
a single observation contains an amount i0 of information this is eq
a reduction in effective sample size from n to n - To2, for example
n - 5 in the Cauchy translation parameter problem. This is the price one pays
for a one-dimensional summary of the data and, also according to Fisher, any
summary statistic other than the MLE would pay a greater price. (Rao's substan-
tial contributions to this argument are discussed toward the end of the section.)
The geometrical argument which follows shows clearly why the curvature To
plays the role that it does in (9.1). It also leads quickly to a counterexample
to (9.1) and shows that by working within multinomial families, Fisher and Rao
chose perhaps the least tractable curved exponential families. We will work
with a general curved exponential family in the standardform (4.1)-(4.2). For
notational convenience we let 00, a particular value of 0 where we wish to
evaluate liml . (io - i0), equal 0. Then we have (= = = 0, 5 = Ir, ( =
-O = ioie_, and 20 = (V11(0)/i0M)e, + i0ore2
Fisher's argument depends on two useful results which we borrow:
1) If T(x) is any statistic, with density say foT(t) and score function (log
derivative) 10T(t) &/DO log fj(t), then 10T(t) = Eo0to(x) I T = t} (where we recall
from Section 6 that 1i(x) is the score based on all the data). This implies that
the loss of information in going from x to T(x) is

(9.3) jo-_ T = E0 Varo {1i(x) I T}


since io - = Var io - Var0 jOT
2) Let Lo be the set of values of x-- = 1 xi/n for which i0(x) = 0;

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1205

e2

A and the Sample Space of x

L0

M@\

77

FIG. 4. A curved exponential family of dimension r 2. La is the set of x for which


6 is a solution to the maximum likelihood equations. Ma is the level curve for another
consistent efficient estimator.

of those values of the sufficient statistic x for which 0 is a solution to the like-
lihood equations 1i(x) = 0. Then, since 1i = nC0,'(x - 2),
(9.4) Lo = {X: (e - e) 0 O}
the r - 1-dimensional hyperplane through 20, ortho
Figure 4 illustrates the situation for the case r = 2. (Notice that the sample
space, the space of possible 5x values, has been superimposed on A, the space of
possible mean vectors i.) Actually this two-dimensional picture is appropriate
for any dimension since curvature is locally a two-dimensional property, as
pointed out at the end of Section 2. A heuristic proof of (9.1) based on this
picture now follows in five easy steps:

(i) io(x) = n(io)iX1 (by (6.3)).


(ii) nix --* r?(O, I) as n -, oo if 0 = 0 (since 20 = 0, TO I, and central limit
conditions are satisfied inside an exponential family).
(iii) Let 0 be the MLE and d the angle between (^ and C = i01el. Then d
i0i-00 + 0(02). (Since rId = O& + 0(02) - ij l0e + 0(02), the element of arclength
in Figure 1 is s- = 1j + 0(02) = ? 0(02). By (2.4) we have a_ a =
ioiroo + 0(02).)
(iv) Var0 {io(x) I09} = n2i0 tan2 a * Var0 {X2 1 2}. (In the case r = 2 this
immediately from (i) and the geometry of the situation. For r > 2, x2 is replaced
by v'111vil where v = _-jjJ cos a *( the part of ( orthogonal to .)
(v) VarO {X2 1 } = 1 /n + o (l /n). (This is plausible because of (ii) and the fact
that near 0 = 0 the partition of the sample space generated by the "lines" L.
looks like the partition generated by lines parallel to L-.)

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1206 BRADLEY EFRON (AND DISCUSSANTS)

Steps (iii) and (iv) together give Var0 {i10 I } =


which, combined with (v), gives

(9.5) Var0 {10(x) | 0} = niolrO202(1 + OQ())(1 + o"(1))


o,.(1) 0 as n -* oo, 0(a) -O0 as 0 - 0. The heuristic proof of (9. 1) is complete
by (9.3), giving
A*

(9.6) lim" n iO i ' = lim,.00 E0 Varo10 {I0} = r02


Here we have used

(9.7) lim ,--,O nE0J01' = O lim,-.0 nE0S =1 io-l,


which one might hope for in view of ni0 -+ A/V(0, io-').
All of the weak links in this chain of reasoning can be made solid except for
(v). Its fatal flaw is shown by a counterexample to (9. 1) based on the trinomial
distribution

(9.8) P{observed object is in category j} = 2, j = 1, 2, 3


(so 2 >? 0 21 + 22 + 23 = 1) .

The trinomial can be considered as an exponential family of form (3.1) with


k - r = 2; i = (21, 2)t' ) = (11 '2)' (j = log[/jl(1 - 21- 2)] = 1, 2, and
0(i) = log (1 + erl + e72). The x vector takes on three possible values: (1,
(0, 1), (0, 0), corresponding to the observed object being in the first, second, or
third categories respectively. The carrier measure m(.) puts mass one at each
of these three x values.
The counterexample is a one parameter family .Y with L. passing through
the fixed point c = (-2i, -1) as illustrated in Figure 5, the parameter 0 being

La
(0,1) /

/ / aso~0 (0 ,0) (1,0)

=(sin(8+AO), cos (8+AO)) /


B00 /o Arct ,)n io
\,n /1/3
Vj2+1/3

C:-(-1 ,-1 )

FIG. 5. Counterexample to (9.1) based on trinomial. Each line Lo contains at most


one possible sample point x.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1207

the angle between Lo and L.. Such a family does exist,


struction shows: let 20 (3, ) and

(9.9) '8 = 20 + SgOo,(Z.0f)d8

where ft. =_ 112 -cl /(l IZ0 a0 I sin BO), T0 is the co


the vector 0. and the angle B0 being defined as
gives 2, e La and also that, by (3.6), #Occo the nor
sitated by (9.4).

J is a curved exponential family having the following property: if 9(,) and


x(2, are two values of x = xiln giving the same MLE 0, then both x(-) and
X(2) lie on L-. But 9(i) = (n1(j)/n, n2(i)/n), i = 1, 2 the n being nonnegative in-
tegers. This implies either x(,) = x(2) or

(9.10) n2(2) -n2(1 - n21) + 1 . n


n1(2,-np(l) nj() + 2i * n

Since (9.10) would make 2i a rational number, x(1) must equal x(2), In short
there is at most one possible x value corresponding to any #, and so the MLE
is a sufficient statistic in J. implying i -i = 0 for all n. But To2 must be
positive for all 0 values since i, is always changing direction. This completes
the counterexample.

REMARK 1. Let 9p(t)- Eeit'x be the characteristic function of f,. If kp(t)p is


integrable for some p ? 1 then nix has a density function converging uniformly to
(27r)-k/2 exp (- I IXI 12/2). See Efron and Truax (1968), Gnedenko and Kolmogorov
(1954). Under those conditions (9.1) can be verified. The technical details,
which depend on an exponential bound to the density of x, are indicated in the
Appendix.

REMARK 2. Instead of working with the MLE a itself we can consider the
coarser statistic which only records which interval a lies in, among intervals of
the form (ies, (i + 1)E?) i = 0, ?1 +2, ? 2 The line La in Figure 4 is now
replaced by a pair of lines L L and step (v) can be weakened to say only
that the conditional distribution of :?2, given that x is between the two lines, has
variance 1/n + o(l/n). However in order for statement (iv) to still have meaning
we need to take E_ = o(l/n) (so that the conditional variance of fo will still be
due mainly to the slope of the lines LiIn, L(i+l,E , and not to the distance between
them). It turns out (Efron and Truax (1968)) to be possible to choose en in this
way and to get the proper convergence of the conditional variance if fo is non-
lattice, 1(t)l < 1 for all t t 0. (This excludes the multinomial.) In this case
it is possible to show that lim sup,,_. (io - i 0) < j0 To2.

REMARK 3. If #(x) is any other consistent efficient estimator of 0, and Mo is


the set of x values having #(x) = 6, then as in Figure 4, Ma passes through
4- and is tangent to La at that point. See Section 10. The increment of
[lim%..-+. (i0,- j0T) -_ To2] above zero is due to the quadratic term in the expansion

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1208 BRADLEY EFRON (AND DISCUSSANTS)

of M- near 4. The details are almost identical to those of Section 10 and will
not be given here. (See (10.25).)

REMARK 4. It is possible for two of the surfaces (9.4), say Lo and La, to in-
tersect. If x C Lo n L- then both 0 and a are solutions to the likelihood equation.
As 0 decreases to zero in Figure 4, Lo n L, converges to a point (in general an
r - 2 dimensional flat) on Lo {ce2} a distance po _ 1/7, above 0. Values of
on Lo which lie above this point are local maxima of the likelihood function,
while those lying below are local minima.

REMARK 5. Rao (1961, 1962, 1963) uses a different definition of the informa-
tion which avoids the difficulty illustrated by the counterexample. (9.3) can be

written as i _-iT = inf E0{1,(x) - h(T(x))}2, the infimum being over all choic
of the function h(.). Rao redefines 10T by restricting the function h to be quad-
ratic. Rao states that he believes the two definitions to be equivalent, but the
counterexample can be used to show that they are not.

REMARK 6. Is (9.1) a useful fact, assuming it is true? Fisher seemed to think


of Fisher information as a perfect measure of the amount of information available
to the statistician. For ordinary "first order efficiency" calculations in large
samples this is true enough, in the following sense: let T(x) be a statistic having
Fisher information 10T. Then in a neighborhood of any given value 0 of 0 we
can construct, under suitable regularity conditions, a function T(T), that is ap-
proximately I(, l/iOT), as compared with _Z1(0, 1/i00) for the MLE. If
ioT/i_ = .8 for example, then any statistic h(T) will have almost the same dis-
tribution as h(6) with 0 based on a sample 80% as large.

This argument breaks down for information discrepancies as small as those


contemplated in (9.1), since the central limit theorem is in general not capable
of supporting such fine distinctions. To give substance to Fisher and Rao's
theorem we must demonstrate that in specific statistical problems the Fisher in-
formation determines relative performance at the level of accuracy suggested
by (9.1). Rao (1963) showed that this indeed was the case for the problem of
estimating 0 with squared error loss. We review his results from the point of
view of this paper in Section 10.

10. Estimation with squared error loss. Suppose we wish to estimate the
parameter 0 in a curved exponential family on the basis of an i.i.d. sample x1,
x2, * , xn, using a squared error loss function to evaluate possible estimators.
We will only consider estimators that are smooth functions of the sufficient
statistic x and are consistent and efficient in the usual sense (see (10.5)-(10.7)
below). The following result will be discussed: let 0(x) be such an estimator,
the form of 0 not depending on n, and let 0(a) --E,9 UOO() where as
UO (x) _l %/iO + 00 is the best locally unbiased estimator of 0 near 0.. A
bo _ E0() - 0 be the bias of 0, a quantity which will turn out to be o

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1209

O(1/n) in the theory below. Then

(10.1) Varo - + 1 2 + 4 i?+0 + 2 O + o (1)


nj0 n2j 0To 0 0
where Ai > 0 and for the ML
curvature at 6 = 60 of the two
Before verifying (10.1) several remarks are pertinent.
1) The term 1/ni00 is the Cramer-Rao lower bound for the variance of an
unbiased estimator. The bracketed quantity in (10.1) expresses the coefficient
of the 1 /n2i00 term as the sum of three nonnegative quantities: r2o, the statistical
curvature, which is invariant under transformations of 6; 4f20/io0, the "naming
curvature", which depends on how J is parametrized (however, notice that
4]F2oli,o is invariant under linear reparametrizations 6 -* a + p0); and AO, which
can be made zero by using the MLE. Taken literally (10.1) says that the MLE
is superior to other efficient estimations with the same bias structure.
2) The estimators 0 will generally be biased by an amount of order 1 /n. This
affects mean square error to order l/n2. A simple adjustment, noted below at
Remark l1, produces estimators biased only to order 1/n2; (10.1), with the bias
term 2b,o/ni0o removed, is valid for such estimators. Among such bias corrected
estimators, (10.1) says that the MLE has asymptotically smallest variance.
3) The Fisher information is essentially invariant under reparametrizations
of JX, in the sense that if , = j(6) is a differentiable monotonic function then
iT = i,T(d6/d1)2 for every statistic T(x). The squared error estimation problem
is not invariant under reparametrization and this accounts for the presence of
the 4r2 term in (10.1). For a given 60 the "best" parametrization is in term
of 0(a), the expectation of the best locally unbiased estimator of 6. (Notice that
/ will be the same, except for scale and translation constants, no matter what
"6" we begin with.) It will turn out that if the MLE 0 is unbiased for 6 then
6_ a for all choices of 60, so we are automatically using the best parametrization.
4) (10.1) is not a special case of the Bhattacharyya lower bounds. The second
Bhattacharyya bound, applying to estimators biased by amount O(1/n2) or les
is of the form

(10.2) Var90 >-ni + 1+F { }


0 nio n .i0 ion

and the higher Bhattacharyya bounds ar


bounds relate only to the naming part of the estimation problem. It is possible
for an estimator to achieve equality in (10.2), but then it cannot be efficient in
a neighborhood of 60, so (10.1) is not contradicted.
5) Even if - is not a curved exponential family we can use (10.1) to get an
improved approximation to Varo0 6, compared with the Cramer-Rao lower bound
1/nIWO. The Cauchy translation family discussed at (7.4) has iO = 1 2 5
So F2~~~~~~'TO
The MLE a is unbiased in this case, so a00 = 0 and (10.1) is of the form Varo6o0 =
l/nio + TI /n2i0 + O(1/n3). Numerical comparisons of this formula with the

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1210 BRADLEY EFRON (AND DISCUSSANTS)

0.12

Barnett ? S.D.
0.10 1

Andrews? S.D.
[90.08 1

0.06-

Theoretic al
0. 04 -Value \

0.02

o < 1 t I I I I I I I
0 8 10 12 14 16 18 20
n

FIG. 6. Variance of MLE minus Cramer-Rao lower bound, for estimating the Cauchy
translation parameter. Theoretical value from (10.1) compared with Monte Carlo
results.

Monte Carlo studies of Barnett (1966) and also of Andrews et al. (1972) are
shown in Figure 6. The theoretical values are obviously too small for n ? 11,
but seem to be more accurate than the Monte Carlo results for n > 13. For
n = 40 Andrews et al. estimate Var,0 - 1/nio0 = .0025 + .0017 while (1
gives .003 1.
6) For estimating a translation parameter Pitman's estimator is known to
have smaller variance than the MLE. However, (10.1) suggests that this effect
must be of magnitude at most O(l/n3).
7) Nothing in (10.1), except the application to general curved exponential
families, is new. Rao (1963) states the result for curved multinomial families,
and notes that for the MLE it was previously derived by Haldane and Smith
(1956). The identification of the bracketed terms with curvatures is new, as
well as the line of proof which leads to a rigorous verification.
8) The similarity of (9.1) and (10.1) can be viewed as a vindication of the
belief that Fisher information is an accurate measure of the information con-
tained in a given statistic. This conclusion is premature; the squared error es-
timation problem is very closely related to the information calculation, a fact
which would be more obvious if we had presented a geometric argument below,
as in Section 9, instead of using analytic methods. It is more reasonable to say

that the curvature T. is the leading term defining the nonlinearity of a f


, and must play a central role in all calculations like (9.1) and (10.1). On
the other hand in the absence of evidence to the contrary it seems difficult to
dispute Fisher and Rao's assertion that the MLE provides the most informative
one-dimensional summary statistic even when there is no one-dimensional suf-
ficient statistic.
Our derivation of (10.1) will be done with the curved exponential family _

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1211

in standard form, and assuming 60 = 0. Neither of the


generality of the result. (The transformation to stan
mator into an estimator having the same variance, and
Too, and Fo0 unchanged.) We assume that the estimator
partial derivatives with respect to the components of
has the Taylor's series expansion

(10.3) #(x) = ao + a'x + (x'Ax)/2 + O(X3),

where a. is a scalar, a is a r X 1 vector, and A an r x r matr


dimension of the full exponential family containing S.
Here O(X3) indicates a term that near the origin is bounded in
by some polynomial in the components of x containing only term
Differentiating (10.3) with respect to the components of x give
vector

(10.4) VO(ix) = a + Ax + O(x2).

In order for a to be consistent and efficient, (10.3) m


shown in the lemma:

LEMMA. A consistent, efficient estimator #(x), having continuous third partial de


rivatives near x = 0, has the Taylor series expansion

(10.5) o6) - = x + Yo x1x2 + A -3


ioX io2 2 i 12 2
assuming is in standard form at 0 = 0. Here xi indicates the ith compon
X 1 X() _ 5,3 ..., * 'r) and A(1) is the matrix A with its first row and c
removed. For the MLEO(x), A(1,) = 0. As in (1.1), 4a11 = E0f0f0/f02.

The proof of the lemma is based on two simple facts: in order for a con
estimator #({) to be consistent it must have "Fisher consistency",

(10.6) 0(20) = 0 ,
since x -p 20 under repeated independent

(10.7) lim 1 (';)') V0)


i0 Var0 0 =(V'X'o 20(,V0'Z

so 0 will be first order efficient at

(10.8) V- -V#(9)j-=j. = co(o


for some scalar c0. Taken together (10.6) and (10.8) say t
MOa_=-{x: #(x) = 01 of an efficient consistent estimator 0
at 4, and at that point must be parallel to the level surfac
as shown in Figure 4. (10.7) merely says that the linear ter
of 0(x) about 4,9, 0 = 0 + V,'( - 20) + O((x -_ )2), must

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1212 BRADLEY EFRON (AND DISCUSSANTS)

the score statistic Io- = n#0'( - 2,) in order to ge


follows from a greatly simplified version of the argument below, but the result
is well known and will not be derived here.
The proof of (10.5) is obtained by seeing what form of (10.3) is necessary in
order that (10.6) and (10.8) hold for L, near 0. We will need the Taylor series
expansions

(10.9) 0 = iole1 + [-" el + ioroe2]0 + o(0)


io1

,0 = io0el0 + 0(02)

and a more accurate expansion for the first component of 2o,

(10.10) e f2 o = iot + Pii' 2 + ? (02)


i0~

(10.9) follows from the standard form relationships (4.1)-(4.2). To prove


(10.10) notice that el'20 = E0 x, = (l/ioi)E0ol(x) (see (3.18)). Formally

(1 0.1 1) Ee o =0?fo
57A(x) [OX+ f(x)
'~f\ox+ 2x++ 2 to(x) + o (a) mx

=io0 + 2 + 0(02),

a result which is easy to verify rigorously in an exponential family.


(10.4), (10.9), and (10.8) combine to give (writing co = co + eo0 +
(10.12) a + A(iiel 0) + 0(02)

= coi0le, + coioiel + C0 j (0) el + C0 io oe2] 0 + o (0)


l0

implying

(10.13) a = coio0ei
and

(10.14) ioiAe1 = (oi + c 3ll(0)) e1 + coio0oe2


0i

Notice that (10.14) shows that

(10.15) A31 = A41 Arl = 0

(10.9), (10.10), (10.13), (10.6), and (10.3) combine to give

(10.16) 0 = ao + coioiLioi0 + /i o] + i0A11 02 + o(02),

implying

(10.17) ao = 0,
co = 1l/io, and co11 pl+ ioAl1 = 0.

(10.18) a = j I
e,,_
Alli_
2 A21
A
10
=-.I
o
10

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1213

the first of these following from (10.13), the last from


(10.15), (10.17) and (10.18) are equivalent to (10.5). F
0((0, xl) = 0, implying A(1) = 0. This completes the
Two more simple results give (10.1) from (10.5). First
lower bound for the variance of a possibly biased estima
as an equality in the following useful form:

(10.19) Eo T2 = I +E0(T- x) +2 no
nio ioi0 '0

((10. 19) follows from Covo (T, io) =1


so this statistic is just the best locally
at (5.5). For an unbiased estimator, (1 0
Rao lower bound by the expected squared error of T in predicting U0. In a
curved exponential family the regularity conditions necessary for (10.19) are
satisfied if E0, T2 < oo for 6 in a neighborhood of 0. The second fact needed is
that if z is standard multivariate normal, z W Kr(0, I), and A is an r X r sym-
metric matrix, then E(z'Az)/2 = tr A/2 and
z'Az
(10.20) Var - tr A2 .
2 2

As n - cc, zn_- nix r 4A?(0, I), and becausef


the moments of zn converge to the moments
term, an omission justified (under an addition
below, (10.3) and (10.5) give

(10.21) Eo0= Eo xA2 1 trA = (1 ( i + tr At,


2 n 2 n\ 2i 2 21

Moreover (10.5) combines with

(10.22) E 2 -I+ 1 -I2


nio0 nl2\ 2i 4 2 4 } ni0 n2
? M
Therefore,

(10.23) Var0o 1 1 o 'O + _~?tA _ + 2/- /i0


nio n2 (io 2i04 21 nio (n2)

Finally, (10.11) gives 5b(6) = 6 + ([ei1/2i4)62 + 0


E0 io(x)/io, and then (2.1) gives the curvature squar
at 0 = 0. This completes the proof of (10.1). We se
00

(10.24) A= 4_ trA21)/2
and so equals 0 for the MLE.
Several more remarks can now be made about (10.1).
9) The bias of the MLE up to 0(1/n) is, by (10.21), equal to -p,11/(2ion). If
0 is unbiased to O(1/n), as it is for example in any translation parameter estima-
tion problem involving a symmetric density, then we must have jl, = 0. By

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1214 BRADLEY EFRON (AND DISCUSSANTS)

(10.23) we then have Var, 0 = 1/ni0 + rO2/n2ii +


term disappears from (10.1) in this case, so 0 must be equivalent to the best
name, ft, at every point in S.
10) The expression (10.24) for the excess variance of a over the MLE also
occurs in the theory of Section 9,

(10.25) limn_:C io- io = io ro2 + Ao A


see Rao (1963).
11) Let A(0O) be the matrix A in the Taylor expansion (10.3) when we
put -5 into standard form at 0 = 00, and define B =_ tr A(0)/2. Then
0(1/n), BaO/n is the bias of 8 when 0 = 00. It is easy to show, by calcula
similar to those in Remark 12 below, that 0 = 0 -B /n has bias of
O(l/n2) and variance as given in (10.23) but with the term 2bo/nio remove
Rao (1963). For the MLE 0, B00 -([e11(0)/2iO2). The estimator 0 - BA
BA _/n has variance as given in (10.1) but with the term AO removed. Th
is that by modifying the MLE we can obtain an estimator with the same bias
structure and smaller variance than any other consistent, efficient estimator 0.
12) We have ignored the 0(X3) term in (10.3) in the derivation of (10.23)
and (10.1). To justify this requires the following result: let 14@n be the cube
{z zil :< na, i = 1, 2, , r,0 < a < , and I,(z) the indicator function of SV,
Define z_ = ni (so zn. 91 _?0, I)) and let p(z,n) be a polynomial of degree 1
in the coordinates of zn. Then

(10.26) Eop(z,)[ 1 - In(zn)] = 0(nl a exp { n2a})


as discussed in the Appendix.
Now write (10.5) as 8 - cl/ioi = Q + R where Q is the quadratic term x'Ax,
A having the special form indicated in the lemma, and R is the remainder term
0(X3). Also define S(x) _ Q(t)IJ(ni%), T(x) = Q(x)[1 - I,(ni)], and V = T + R
(so Q = S + T, 8- xl/i& = S + V). Notice that
(10.27) I VI = 10(X3)l < KnO3(i[a) for ni5e n
for some positive constant K. (We use below the same symbol K to represent
any bounding constant.) To the assumptions of the lemma we now add that
18 - 5/li,il is uniformly bounded, giving

(10.28) IVI < K, n iX@ fc0(6,


(With only slightly greater effort below, the bounde
laxed to 101 < K(nflxl, )k for ni f n for some pos
(10.26) and (10.27),

(10.29) EolVI1 = - (n-31'J-,)


for any 1 > 0, while

(10.30) Eo1T11 = 0(n2a1e-n2a/2

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1215

Formulas (10.21) and (10.23) were derived assuming 0 - 1/ii = Q. But

IEoQ - EOSI EOITI - 0(nae-n2a/2)


and

1E0(O - 1l/io0) - EOSI < EOIVI = 0(n-V'*-a')


Since a < this shows that Eo0 = EoQ + o(l/n), so (10.21) is valid. Likewise

IE0( _- C1/ij0)2 - EoQ21 = IE0(S + V)2 - EO(S + T)21


= IEO[2SV + V2 - T2]1
(since ST 0), which is < 2EoISVI + EolV12 + EolT12. The last two terms are
o(n-2) by (10.29) and (10.30). Notice that SV = 0(X5) and SV = 0 for nixc V an
so

(10.31) ISVI < Kn-5<i-a

Taking a < I makes EoIS VI = o (n-2), comp


We remark that a more careful proof, assu
ferentiable, allows one to replace o(1/n2) b

Acknowledgment. Much of this work was done while I was visiting Imperial
College, London, Department of Mathematics. I appreciate the assistance of
Margaret Ansell in carrying out the more difficult numerical computations. The
Associate Editor provided extensive help, especially with the Appendix.

APPENDIX

Complete proofs of the statements made in Sections 9 and 10 require large


deviation results of the type discussed in Chernoff (1952) and the references
therein. Suppose xl, x2, . . ., xn, * * * are independent, identically distributed real
valued random variables such that Exi = 0, Var xi = 1, and sb(s) _ Ee8x exists
for 1Sl < sO, so some positive constant. Then sb(s) = 1 + s2/2 + 0(s2) for s near
0, so

(Al) log sb(s) = s2/2 + 0(s3)

Define I[y,cO)(Z) = 1 or 0 for z > y or z < y, r


I[v )(x") for all values of xn _ 1 xi/n we ha

(A2) P{xf > y} < Ee ns(z-Y) = [10(s)e-s]n


LEMMA. For c,n a sequence of numbers going to infinity, c,, = o (ni), and 1 a non
negative integer,

(A3) E{nX)Ic,o(i"}< c,1e-cn2/2+,Dntl)

PROOF. Let F.(y) -P{n > y}, so F,(y) ? [0b(s)e-8fln for Isl
have

= -n'l2 'i,d xl dPF(x)

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1216 BRADLEY EFRON (AND DISCUSSANTS)

and integration by parts gives

- s x' dF_(x) ( ) P, (i) + I 5-/j" x1-1F,(x) dx

< (5Y K (1 C) e-scnin] + X xll[s(s)e-z]n dx

Taking s = c,I/ni gives

(A4) E{ni5c-n)l)rcC',O)(ni5)} ? b (< S) e-n2 c+ E c, +

where G has density e-9 for g > 0, 0 otherwise. Finally

(A5) ," cn en1og09(cn/ni) = ec%2/2+o(cn 3/ni)

by (Al). Combining (A4) and (A5) gives (A3) with

(A6) ?n(l) = 0([c,,/n']3) + log fi + Ic,-2E(l + Glc"2)1-1}

where we now use c. = o(nA), cn o


Now let xl, x2, . ., *x, . * * be independent identically distributed random vec-
tors, dimension k, Ex, = 0, Cov xi = 1, such that sb(t) Eet'zi exists for I1th < to,
some positive constant. For any unit vector v define xv -v'x. Then (A3) holds
with xn replaced by x,v. The term on(1) is defined as in (A6), with the big 0
term being the one in the expression log 0b(t) = IIt 12/2 + 0(t3). (Notice that
oq*(l) does not depend on v.) (10.26) now follows easily.

LEMMA. If jE0eit'zjP is integrable as a function of t for some p 1 then g"(z),


the density of z = nl_, exists and satisfies

2i (jj/) i j
(A7) gj(z) < ( e n,11z111+0n(1)

cn = o(nA), c, + oo.

PROOF. Consider the univariate case, with n even. Define

(A8) h(z) 5, g,/2(W)gn/2(Z- w) dw


=5 g9,/2(W)g,/2(Z- w) dw + 5 Z2gfn2(W)fjI./2(Z - w) dw

Here gn/2(z), the density of (n/2)1fln2, is known to exist and to converge unifor
to (27r)-i exp(-z2/2), see page 244 of Gnedenko and Kolmogorov (1954). T
M,n= sup, gj(z)j = (27r)-f + o"(1), so for 0 < z cn
h(z) < Mn12{S -j0 gX12(Z -w)dw + SZ/2 gn/2(w) dw}
< 2M,/2 e- (z/8)min{2cn,zI+01(l)

where we have used the bound P{nixn > z3 < exp [-z/2 min {c;, z} + o,(1)
tained by setting y = z/ni and s = min {z,/ni, c,n/n} in (A2). But gn(z) = 2Vh(
giving (A7). The same proof with trivial modifications works for n odd. For

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1217

the multivariate case the integrals in (A8) are o


1zil12/2} and R2 = {w: z'w > I1z112/21.
Remark 1 of Section 9 follows because (A7) makes step (v) of the heuristic
proof valid. All the other approximations involved in the proof are handled by
power series expansions and the bounding arguments of Remark 12, Section 10.

REFERENCES

[1] ANDREWS, F., BICKEL, P., HAMPEL, P., HUBER, P., ROGERS, W., and TUKEY, J. (1972).
Robust Estimates of Location. Princeton Univ. Press.
[2] BARNETT, V. D. (1966). Evaluation of the maximum likelihood estimator when the like-
lihood equation has multiple roots. Biometrika 53 151-165.
[3] CHERNOFF, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based
on the sum of observations. Ann. Math. Statist. 23 493-507.
[4] DAVIES, R. B. (1969). Beta optimal tests and an application to the summary evaluation of
experiments. J. Roy. Statist. Soc. Ser. B 31 524-538.
[5] DAVIES, R. B. (1971). Rank tests for Lehmann's alternative. J. Amer. Statist. Assoc. 66
879-883.
[6] EFRON, B. and TRUAX, D. (1968). Large deviations theory in exponential families. Ann.
Math. Statist. 39 1402-1424.
[7] FISHER, R. A. (1925). Theory of statistical estimation. Proc. Cambridge Philos. Soc. 122
700-725.
[8] GNEDENKO, B. V., and KOLMOGOROV, A. N. (1954). (Translated by K. Chung.) Limit Dis-
tributions for Sums of Independent Random Variables. Addison-Wesley, Cambridge,
Mass.
[9] HALDANE, J. B. S. and SMITH, S. M. (1956). The sampling distribution of a maximum
likelihood estimate. Biometrika 43 96-103.
[10] RAO, C. R. (1961). Asymptotic efficiency and limiting information. (J. Neyman, Ed.).
Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1 531-545. Univ. of California Press.
[11] RAO, C. R. (1962). Efficient estimates and optimum inference procedures in large samples.
J. Roy. Statist. Soc. Ser. B. 24 46-72.

[121 RAO, C. R. (1963). Criteria of estimation in large samples. Sankhya 25 189-206.


[13] STRUIK, D. J. (1950). Differential Geometry. Addison-Wesley, Reading, Mass.

DEPARTMENT OF STATISTICS
STANFORD UNIVERSITY
STANFORD, CALIFORNIA 94305

DISCUSSION ON PROFESSOR EFRON'S PAPER

Professor Efron's paper was presented at the 1974 Annual Meetin


stitute of Mathematical Statistics at Edmonton, Alberta. Professor
A. P. Dawid, J. K. Ghosh, N. Keiding, L. M. Le Cam, D. V. Lindley,
D. A. Pierce, C. R. Rao and J. Reeds were invited discussants. The Editor
greatly appreciates the willing assistance of Professor Efron as well as the dis-
cussants in arranging this discussion paper. Professor Rao's remarks arrived
after the author's reply to the discussion was received and are not referred to
for that reason.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1218 BRADLEY EFRON (AND DISCUSSANTS)

C. R. RAO

Indian Statistical Institute, New Delhi

I am delighted to see the paper by Bradley Efron and also the paper by J. K.
Ghosh and K. Subrahmaniam (Sankhya A, 1975 36 325-358) on the subject of
second order efficiency. Having worked for some time on second order efficien
of estimators, I was aware of the importance of measures of how closely a giv
model can be approximated by an exponential family {f f C(8) exp [K(0)T(X)]}
Measures of this sort are of course closely related to what Professor Efron calls
the curvature of a statistical problem. What is quite new about Professor Efron's
measure is its invariance under smooth 1 - 1 transformations and the elegant
geometric interpretation which makes the term so apt and illuminating and pro-
vides new tools and insights into the subject.
My endeavour in this area was motivated by two results in the literature on
estimation which seemed to contradict Fisher's claims about MLE's. (maximum
likelihood estimators). One is the concept of super efficiency, according to which
MLE is not efficient in the sense defined by Fisher. Another is the concept of
BANE (best asymptotically normal estimator), according to which ML is only
one out of a very wide class of estimation procedures.
The first task was to redefine the concept of efficiency of an estimator since
its asymptotic variance is a poor indicator of its performance in statistical in-
ference. To do this it is necessary to see how well an optimum inference pro-
cedure based on a given estimator Tn alone compares with that based on all the
observations. Following Fisher's ideas, I thought it is relevant, at least in large
samples, to consider the score function i(a) (see Efron's paper for notations) as
basic to all inference problems. Then the problem reduces to examining how
closely l(a) and Tn are related. Under the additional condition that Tn is con-
sistent for 0, T, was defined to be first order efficient if

(1) plimnO. In-1i(0) - a - 3ni(Tn - 8)1 -? 0.


There are a large number of estimators which are first order efficient. To dis-
tinguish among them, it is natural to examine the rapidity of convergence in
(1), which led to the consideration of the random variable (rv)

(2) li() - nia - n3(T, - 0)


which is ni times the rv in (1). The asymptotic v
the second order efficiency. Instead of (2) we may

(3) 11(0) - noa - np(T. - 0) - 2n(T,,- 0)21


and define its minimum asymptotic variance for a proper choice of i as the
second order efficiency. Fisher suggested the use of

(4) limrn non(i - i'T)


to distinguish between alternate estimators,
tremely difficult.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1219

The definition arising out of (3) was criticised as not b


to an inference problem, although it attempts to exami
i(8). This led to another definition of second order effic
pansion (under some conditions) of the variance of T7 af

(5) V(Tn) =.+ 4()+ ?(n2)


in n 2 -

The quantity ob(0) was considered as a measure of second o


major component of Ob(O) was the measure based on (3).
With this background, the work of Efron is valuable in m

(i) The results due to Fisher and me were confined to multinomial distribu-
tions. Efron, and also Ghosh and Subrahmaniam extend the results to a wider
class of distributions.
(ii) Efron relates second order efficiency to what he calls curvature of a
statistical problem, which appears to be natural and throws further light on
problems of inference (providing, for instance, an intimate connection between
curvature and properties of test criteria).
(iii) Efron provides a decomposition of Ob(O) in (5), which is extremely
interesting.
(iv) Efron suggests the use of a most powerful test at a suitably chosen al-
ternative in preference to a locally most powerful test, which seems to be an
attractive idea worth pursuing.

No doubt Efron's work has led to considerable clarification of second order


efficiency and its relevance in problems of inference. However, there are many
problems which require deeper investigation.

(i) Efron shows by an example that measures of second order efficiency


based on (3) and (4) can be different. In fact, as he observes, it may be shown
(from definition) that the measure based on (4) is smaller than that on (3). But
the question remains: under what conditions are the two measures the same,
and is the MLE efficient under the measure (4)?
(ii) I have considered Fisher's score function 1(8) as a basic in problems of
inference. Perhaps, following Barnard and Sprott, one should consider 1(0)
itself. How should efficiency of T,, be defined in such a case?
(iii) How can the result based on quadratic loss function as in (5) be extended
to more general loss functions?

DON A. PIERCE

Oregon State University

I think that I am not alone in having had great difficulty with the reas
of Fisher's 1925 paper. Professor Efron's elegant contribution to clari
these ideas is very helpful.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1220 BRADLEY EFRON (AND DISCUSSANTS)

The part of Fisher's paper which has intrigued and puzzled me most is the
final section in which he suggests the use of 1(x), in Efron's notation, as an
ancillary statistic. I would like to indicate here how the geometry of this paper
helps clarify this, although there are many details yet unclear to me.
It is characteristic of "curvature" that -1(x) /= iL. In fact, one can always

parameterize so that Covy0 (lou, los) = 0, and then 29 = Var0o (100)/il0. Fisher
seems to suggest using - I-(x), rather than i , as a post-data measure of precision
of 0. This is also suggested by standard asymptotic Bayesian arguments, but the
sampling theory justification has never been clear to me. Such use of 1l would
be significant relative to the order of n-2 of approximation to Var (0) considered
in this paper, for -1# = i , + O(ni) and thus-l/I6 = I/i- + O0(n-i).
The geometrical structure exposed in this paper is indeed very helpful in un-
derstanding the role of 1, as an ancillary statistic. For a curved exponential
family of dimension k think of the projection from the sample point x E Ek to the
MLE 4- where Ld9 = E(x) as an orthogonal projection (relative to A 67') first to i in
the local osculating plane of the curve 2, and then a projection from 2 to 2A. The
argument below suggests that (-1 6(x) - i)/i1 is a useful measure of the signed
distance from i to the curve 4, positive when i is on the outside of the curve.
This is useful ancillary information because the projection from 2 to 2- is a
contraction (resp. expansion) mapping when 2 is on the outside (resp. inside) of
the curve 20. The extent of this contraction is a function of the distance from
i to the curve i2, as measured by the above statistic. Thus the conditional
precision of , given 1A(x) is either greater or less than the unconditional precision
Furthermore, it appears plausible that the component of i orthogonal to the
curve 20 at i2 is itself uninformative regarding the value of 2.
More precisely, consider the situation of Figure 4 with the additional assump-
tion that 0 is a choice of parameter such that Covy (4', i) = 0. The point (xl, x2)
corresponds to the i of the above discussion. It follows directly from (6.3) and
the relations given in the second paragraph after (9.2) that

XI =Oln(iO)i i X2 2 =-[-io-Oni0]/ni0 o
Near the origin the curve 20 is approximately a segment of a circle with center
at e2/r,, and the arc distance of 2- from the origin is to first order i,i6. Propor-
tionality of arc lengths to radii gives

i'OPI/x - (1/ro)!(1/rO - X2)

= (1 -toX2)
so

OC,(X10o)(1 rO R2)-
( 1 ) = (X?1/io)[ 1 + (-i- ni0)/nioJ-
=(5c'1iO1)[niO1(-i0)]
Equation (1) can be seen to agree with the rigorously established (10.5) of the
paper, where pl, = 0 since v1, = 0.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1221

Thus we have

(2) Var (6 I lo) (1 /ni,)[nil/( _-)]2


= [nio/(-io)][ I /(-io)]
Since -10 = niO + O(ni) this expression can be either greater or less than
by an amount O(n-i).
I do not know the effect of conditioning on I0 rather than lo nor can I see
whether 1/(-la) as suggested by Fisher is a good approximation to Var (6 I i,).
Note that the expression in (2) differs by O(n-i) from 1/(-1O). I also do not
know the effect of relaxing the assumption that one has parameterized so that
CovO (1, lo) = 0.
It appears, then, that the curvature ra is essentially the standard deviation of
an approximately ancillary statistic. This interpretation might have a number
of advantages over that furnished by relations such as (1. 1) and (10. 1). Loosely
put, the degree of curvature relates to the amount of information in the sample
which is not captured by the MLE; information in a sense regarding not 0 but
rather the precision of 6. Moreover, this information can be largely recovered
through appropriate use of 1a.

D. R. Cox

Imperial College, London

Dr. Efron's impressive paper throws much light on a longstanding problem.


I will confine my comments to one aspect that he has not treated. For an ap-
proach to statistical inference in which evidence in unique sets of data is inter-
preted via frequencies in hypothetical repetitions, appropriate conditioning is
important, at least theoretically, in making the hypothetical repetitions relevant
to the data under study. Thus for the translation family, Example 4, Fisher
(1934) provided a simple definitive solution to inference about 0 by conditioning
on the ancillary statistic, the set of differences among order statistics. This leads
to the use of normalized likelihood as giving confidence limits. Curvature here
measures the variation among the different kinds of likelihood functions that
can arise. It would be useful to make this more specific and to draw any im-
plications about the comparison of conditional and unconditional inference.
More importantly, what are the implications of conditional inference for some
of the other problems, for instance Example 1? Here, if x = (xl, x2), x2 -
ry(XJ2-1) is approximately ancillary in some sense, at least for small rO.
Existence of an approximate ancillary must be connected with the approximate
constancy of 7o as a function of 0; it would be good to have the connexions
explored.

REFERENCES

FISHER, R. A. (1934). Two new properties of mathematical likelihood. Proc. Roy. Soc. Ser. A.
144 285-307.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1222 BRADLEY EFRON (AND DISCUSSANTS)

D. V. LINDLEY

The University of Iowa

My first comment is to repeat the point made in discussing C. R. Rao's (1962)


paper, namely that it is doubtful whether any general measure of second-order
efficiency is possible. The reason for suggesting this is that an admissible estimate
is typically, to order n-1, equivalent to the maximum likelihood estimate, for a
wide class of loss functions: but to order n-2 its asymptotic form depends on
some features of the loss structure. Consequently the second-order "correction"
to the maximum likelihood estimate typically depends on the loss structure, as
does its efficiency. The point is discussed more fully in Lindley (1961).
Efron's thought-provoking paper does not introduce curvature solely for
second-order efficiency properties; nevertheless the definition of curvature he
proposes suffers from a defect in some statistical problems. The defect arises
from the fact that it involves an integration over sample space and thereby vio-
lates the likelihood principle. Put it this way: suppose we have some data x
and its associated likelihood function, 1(x), then, according to Efron, we have
to consider what other data we might have had, but did not, before any inference
can be made. These data are needed before the integrations, symbolized by Eo
in the paper, can be performed. That such data are needed is puzzling and any
reasonable axiomatization of inference seems to deny their relevance. The author
tacitly assumes that the other data are samples of the same size, but many prac-
tical problems do not naturally fit into this framework. Even the notation helps
to reinforce this view. Likelihood is a function of 0 for fixed x and yet Efron
lowers the status of the variable to that of a subscript and the constant appears
in the place customarily reserved for the argument. The notation 1(8 I x) is surely
to be preferred.
An example of the misuse of the integration is provided by the discussion of
the t-translation family [Example 4 of Section 7: see also the remark after (8.3)].
If samples are taken from a t-distribution with low degrees of freedom, then it
will be found that a substantial majority of them look very like samples from a
normal distribution-the comparison being made through the t- and normal
likelihoods. It is only rarely (how rarely depends on f and n) that a sample
arises which is clearly nonnormal and its log-likelihood is markedly not quadratic.
But because of the integration, or averaging, over all samples, these "peculiar"
samples get put in with the "normal" ones and nonstandard estimates proposed.
Looked at without prejudice, I think you will find this is a surprising thing to
do. The argument can be extended to query whether it is reasonable to look for
a point estimate in the "peculiar" cases: for example, when the likelihood is
bimodal. I would go further and suggest that point estimation is not a good
model for any inference procedure, though it does occasionally occur in a decision
context. Estimation is solved by describing the likelihood function or the pos-
terior distribution.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1223

These criticisms have less force before the data, x, are to hand. If it is a
question of experimental design, or choice of a survey sample, then naturally
one has to consider what data might be obtained, and integration becomes natural
and necessary. Hence curvature could have a place in these fields and it would
be interesting to see whether, in some sense, linear designs were better than
"curved" ones. However, the argument of my first paragraph would show that
if a terminal (as distinct fromn design) decision problem is contemplated after th
experimentation, then the choice of design would again involve a loss function,
so that no general measure seems possible. Some experiments are not associated
with terminal decisions and are genuinely inferential in character. In these one
is collecting information about parameters and Shannon's measure is essentially
the only one to use. I have tried to see whether some second-order expansion
of it might lead to anything analogous to Efron's curvature, but without success.

REFERENCES

LINDLEY, D. V. (1961). The use of prior probability distributions in statistical inference and
decisions. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1 453-468.

LUCIEN LE CAM

University of California, Berkeley

Professor Bradley Efron is to be congratulated for a clear and informative


discussion of the differential properties of families of measures. The paper is
certainly a step in the right direction. However, as I shall try to explain below,
much remains to be done.
The paper tends to give the impression that the curvature measures the loss
of information sustained by using a one dimensional summary of the data. This
is perhaps so if "information" is measured by Fisher's number. However, one
can define other measures of loss of information more directly in terms of per-
formance in testing or other decision problems. See for instance E. N. Torgersen
(1970). These definitions are usable for arbitrary families, whether or not they
are smoothly differentiable.
It can probably be shown that these other measures of loss of information are
related to Fisher's numbers in certain special situations, but not in general. One
could roughly say that Torgersen's formula for testing deficiencies relies on finite
differences instead of relying on the first and second derivatives used to compute
curvatures. Efron's curvature has the merit of being easily computable, but one
should not take it for granted that computations with differences, which may
be difficult, should not be attempted.
The part of the paper which relates to the presumed excellency of maximum
likelihood estimates should be taken with a great deal of caution. It is easy to
modify Bahadur's example (1958) to construct one parameter families of densities
which are infinitely differentiable, satisfy all kinds of reasonable conditions
locally but are such that, when the number of observations tends to infinity,

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1224 BRADLEY EFRON (AND DISCUSSANTS)

the maximum likelihood estimate always converges to infinity, no matter what


the true value of 0 is.
It is also easy to find exponential families where, for reasonable numbers of
observations, maximum likelihood estimates are difficult to compute and defi-
nitely worse (in the sense of expected square deviations) than some readily
available alternatives. An example occurs in bioassay using the logit method (see
Berkson (1951)). Another example with an interesting discussion is given by
T. S. Ferguson (1958).
Finally, it seems that the entire asymptotic argument relies essentially on a
replacement of the actual logarithm of likelihood ratio by a suitable approxima-
tion which is quadratic in 0.
If this is indeed the case, the technique of using a preliminary estimate, fitting
a quadratic around the estimated value and then maximizing the quadratic should
give the same asymptotic results. Preliminary considerations suggest that this
technique may well work better than straight maximum likelihood estimation
in the finite sample situation.

REFERENCES

[1] BAHADUR, R. R. (1958). Examples of inconsistency of maximum likelihood estimates.


Sankhyd 20 207-210.
[2] BERKSON, J. (1951). Relative precision of minimum chi-square and maximum likelihood
estimates of regression coefficients. Proc. Second Berkeley Symp. Math. Statist. Prob.
471-479. Univ. of California Press.
[3] FERGUSON, T. S. (1958). A method of generating best asymptotically normal estimates with
application to the estimation of bacterial densities. Ann. Math. Statist. 29 1046-1062.
[4] TORGERSEN, E. N. (1970). Comparison of experiments when the parameter space is finite.
Z. Wahrscheinlichkeitstheorie und Verw. Gebiete 16 219-249.

J. K. GHOSH

Indian Statistical Institute, Calcutta

Thanks to my work on second order efficiency, I was aware of the significance


of the quantity which Professor Efron calls the curvature of a statistical problem.
What enhances the importance of it is the elegant geometric interpretation of it,
which affords new techniques and deeper insight into the problem.
It is natural to expect that this quantity also plays an important role in as-
ymptotic problems of testing hypotheses. By considering a number of examples
of curved exponential families, Professor Efron has shown that this is indeed
the case and unless curvature is small such commonly used methods as maximis-
ing the local power perform rather poorly for moderate sample sizes. Pfanzagl
(1974) has arrived at the same conclusion. (Pfanzagl's D = (curvature)7/4.)
Probably even more interesting than this is the suggestion by both Pfanzagl
and Efron to use a suitable most powerful test instead of a locally most power-
ful test when the curvature is appreciable. Following Davies, Professor Efron
suggests the use of a most powerful test against an alternative O1 such that its

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1225

power at 81 is about .8 and recommends the thumb rule


These suggestions must be tried out in lots of probl
families to see if one does get reasonable tests this way even for moderate sam-
ples. (Pfanzagl (1974) provides some criteria for comparing two tests.) I report
below some calculations for a curved nonexponential family, namely, the Cauchy
with unknown location parameter. To make matters worse, I take sample size
N= 1.

Suppose then that I have a random variable X with density


1/(1 + (x - 02)) and want to test H,(8 = 0) vs. H#(8 > 0). Let 0
powerful test of Davies and 5z1 the test: reject Ho iffX > C. The
its greatest power against 8 = 2C and seems to me a reasonable one. For a -
.05, 0b, is most powerful against 8 - 5 (approximately) and 51 is most powerfu
against 8 = 13 (approximately). The following table compares 50 and 01.

8= 5 8= 13
55o .8 .06
561 .2 .95

If a = .2, 00 and 51
which is the alterna
any conclusion.
It is not difficult to come up with analogues of curvature when one has mo
parameters than one. Extension of the results due to Rao and Fisher to multi-
parameter families is provided in Ghosh and Subramanyam (1974). But it is
now necessary to study testing problems of composite hypotheses along the lines
of investigation carried out by Efron and Pfanzagl for simple hypotheses.
How relevant is curvature for a Bayesian? Ghosh and Subramanyam (1974)
have shown how one can construct a Bayesian proof of the second order efficiency
of the MLE. What is lacking and would be useful to have is a study of relevance
of curvature in Bayesian analysis. The difficulty here is that one cannot think
of any simple and convincing reason why a Bayesian would prefer the linear
exponential families to nonexponential ones. All is grist that comes to the mill
of the lucky man who not only has a prior but knows what it is.
It is a little disappointing, though not really surprising in retrospect, that
curvature has nothing to do with the geometrical curvature of the likelihood
curves. Curvature is, however, useful in the problemsthatSprott(1973)discusses.
For it is easy to show that his two approaches of minimizing FE(O) or F(0) (in
his notations) coincide iff one has a linear exponential family. (This statement
is true provided the MLE satisfies the likelihood equation with probability one
for all 8.) For example (2.3) of Sprott (1973), the curvature is fairly large for
x near .5 and so Sprott's transformation which minimizes FE() may not be
efficient in normalising the likelihood for x near .5. Incidentally, I suspect that
for small curvature one can reparametrize in such a way that the approach of
a posterior to normality, guaranteed by the Bernstein-von Mises theorem, would

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1226 BRADLEY EFRON (AND DISCUSSANTS)

be faster with the new parameter than with the original. (This may be an answer
to the question of relevance of curvature for a Bayesian.)
It may be worth pointing out here that the results of Pfanzagl (1973) and those
of Fisher and Rao (i.e. results like (10.1) of Efron) are not really comparable.
In fact for all the efficient estimators considered by Efron or Ghosh and
Subramanyam (1974), inequality (6.4) of Pfanzagl (1973, page 1005) reduces to
an equality. This result, which is not very hard to show, will appear in Ghosh
and Srinivasan (1975).
Finally, a question suggested by the beautiful counter example of Professor
Efron. Is there any example such that among the Fisher consistent efficient
estimators the MLE does not minimize the loss in Fisher's information for all
values of 0? It seems reasonable to expect that such examples do exist.

REFERENCES

[1] GHOSH, J. K. and SUBRAMANYAM, K. (1974). Second order efficiency of maximum likelihood
estimators. Sankhya Ser. A. (To appear).
[2] GHOSH, J. K. and SRINIVASAN, C. (1975). Asymptotic sufficiency and second order efficiency.
Unpublished.
[3] PFANZAGL, J. (1973). Asymptotic expansions related to minimum contrast estimators. Ann.
Statist. 1 993-1026.
[4] PFANZAGL, J. (1974). Nonexistence of tests with deficiency zero. University of Cologne,
preprint in Statistics #8.
[5] SPROTT, D. A. (1973). Normal likelihoods and their relation to large sample theory of
estimation. Biometrika 60 457-465.

J. PFANZAGL

University of Cologne

In hypothesis testing, one-parameter exponential families are distinguished by


the fact that for one-sided alternatives uniformly most powerful tests exist for
arbitrary sample sizes. For other families, the test has to be chosen with par-
ticular alternatives in mind. It is intuitively clear that the dependence of the
test on these particular alternatives will be weak if the family is close to an ex-
ponential one. Is it possible to measure "nonexponentiality" (for this and other
purposes) by a single quantity? Mr. Efron's suggestion to use the "curvature"
To for this purpose is based on a geometric analogy. Therefore, its usefulnes
for statistical theory is not obvious in advance. It is the purpose of this note
to draw attention to some results of asymptotic theory where the function
has been in use already for some time. Whether curvature admits an easy statis-
tical interpretation in nonasymptotic theory seems doubtful.
"Nonexponentiality" implies in particular that a LMP (locally most powerful)
test will not be MP (most powerful) against the statistically reasonable alterna-
tives. The author uses a particular example to support his claim (see end of
Section 8) that T2 is a good predictor for the relative performance of the LMP
test compared to the test which is MP against a specific alternative. In this

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1227

connection he suggests that the difference in powe


For the case of a sample of n i.i.d. variables this entails that the difference in

power can be neglected if the sample size exceeds 8y'0, (see 8.3).
Since this rule is rather arbitrary, the reader should be aware of other results
which make the role of rO more clear. These results concern the case of n i.i.d.
variables, the distribution of which is nonatomic and sufficiently regular (as a
function of 0). To define for a given level a-test "deficiency at rejection level
p" we determine first the alternative closest to the hypothesis which can be
rejected with probability 3 by some level a-test. (The test for which this is
achieved is called :-optimal.) In order to reach rejection probability 3 for this
alternative with the given test, the sample size has to be increased. The additional
number of observations needed for this purpose is the "deficiency at rejection
level j3."
For the LMP a-test the deficiency at rejection level iB is asymptotically equal to

( 1 ) 1 20(Np - Na)2 + o (n0),


where N5 is the 3-quantile of the standard normal distributi
(1973, Corollary 2) and Pfanzagl (1973, Section 8, formula 24)
Proposition 1, formula 6.2).)
This result enables one to check whether the rule suggested by (8.3) is rea-
sonable. For a = .01 and j3 .99 the deficiency is 5.4r2 + o(n?). Mr. Efron
suggests in (8.3) not to worry about curvature if n > 8r' . To follow this sug-
gestion and to use a LMP test instead of a a-optimal test could mean to waste
more than half of the sample.
The following is another asymptotic result (for nonatomic families) illustrating
the statistical relevance of curvature. If a sequence of tests is j3-optimal, then
its deficiency at rejection level jS is at least

(2) 2oO(N, - Np0)2 + o(n?)


(see Pfanzagl 1975, Corollary 2, formula 6.5). Hence
asymptotic deficiency zero for more than one alterna
curvature is zero.
In another attempt to demonstrate the statistical relevance of "curvature,"
Mr. Efron refers to a result of Fisher (see (9.1)). Mr. Efron is careful enough
not to follow Fisher's abuse of language using a suggestive word for a mathe-
matical construct (such as "information" or "likelihood") without paying any
attention to the question whether the interpretation thus suggested is meaningful
from the operational point of view.
A statement like "Since a single observation contains an amount i4 of informa-
tion this [namely the use of a MLE instead of the whole sample] is equivalent
to a reduction in effective sample size from n to n -o2 *" (see beginning of
Section 9) is misleading, at least, since for nonatomic families the level a-test
based on the MLE has asymptotic deficiency zero at the rejection level (1 - a),
and not asymptotic deficiency rT2, as the statement quoted above might suggest.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1228 BRADLEY EFRON (AND DISCUSSANTS)

(See Chibisov 1973, Corollary 3 or Pfanzagl 1973, formula 2


or Pfanzagl 1975, end of Section 6.) Probably the statement quoted above is
meant as the interpretation Fisher himself would give to (9.1). Since this
interpretation is unjustified, how can (9.1) convince the reader that "curvature"
is statistically significant?

REFERENCES

[1] CHIBISOV, D. M. (1973). Asymptotic expansions for some asymptotically op


Prague Symp. Asymptotic Statist. 2 37-68.
[2] PFANZAGL, J. (1973). Asymptotically optimum estimation and test procedures. Proc. Prague
Symp. Asymptotic Statist. 1 201-272.
[31 PFANZAGL, J. (1975). On asymptotically complete classes. Statistical Inference. 2 1-43 M.
Puri, ed. Academic Press.

NIELS KEIDING

University of Copenhagen

1. An important feature of Efron's paper is the study of the loss of infor-


mation resulting from summarizing the data in n replications X1, * * , X, of
multivariate random variable into a one-dimensional statistic T(X) = T(X1, * * *
Xx). In most of the paper it is assumed that the Xi's are observable and th
their distribution belongs to an exponential family of which the statistical model
forms a "curved subset", in the sense of the mean value parametrization. The
basic result in this connection is formula (9.3), stating that the information loss
from n replications is

jo-_ T = Eo Varo {iO(X) I T}


where for T = 0, the right hand side is io r02, independent of n. (Notice that it
is an implicit consequence of this that 0 cannot itself have the form lt(Xi)).
A somewhat related problem is that of incomplete observation of an exponential
family, where the statistician is "forced" to work with nonsufficient reduction
of data. It is here assumed that the statistical problem is specified in terms of
an exponential family where only a function Y = Y(X) of each component may
be observed. If Y is a linear function of the canonical statistic X, there seems
to be a canonical way of decomposing the parameter vector into an efficiently
estimable part and a nonidentifiable part, using the concepts of "mixed para-
metrization" and "cut" introduced and further studied by Barndorff-Nielsen
(1973, 1974) and Barndorff-Nielsen and Blaesild (1975), and in the case of con-
tinuously distributed random variables this seems to hold as soon as the level
curves of Y are hyperplanes. Asymptotic results for arbitrary "curved" func-
tions Y were given by Sundberg (1974) who points out that the same formula as
above applies for the information loss, which here in general will be of order n.
It is clear that the two situations might be combined: a "curved" model with
incomplete observation. An example of this was discussed by Fisher (1958,
Section 57.1).

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1229

2. The relation (10.1) for the asymptotic variance of any consistent and
efficient estimator 0 contains the term A0, being always nonnegative and z
for the MLE. This quantity was computed by Rao (1963) for several estimation
methods in the multinomial distribution, as noted by Efron. It would be inter-
esting if some geometrical interpretation, or at least a bit more transparent ex-
pression than (10.24) could be given for this quantity, which must be related
to the intuitive discussion by Fisher (1958, Section 57) of "the contribution to
x2 of errors of estimation".

3. Curved exponential families occur frequently in population process and


life testing models leading to occurrence/exposure estimates of birth or death
intensities. One familiar example is that of estimating the mean j-r of an ex-
ponential distribution from a sample of n, censored at a fixed point t. If D is
the number of variables less than t, and S the sum of these + (n - D)t, then
the likelihood function is fDDe-S, yielding j D/S.
We shall here comment a little upon the similar example of estimating the
birth intensity i in a pure (linear) birth process (X.) from continuous observation
of the process in [0, t]. See Keiding (1974) for details of the problem.
Assuming X0 = x0, degenerate, the likelihood is

,xt-l oe- 2St

with St = oX. du. Setting Bt - Xt-x0, the maximum likelihood estimator


is i = B,/St. It is readily seen that the Fisher information
i- = x(elt - 1)/22

and the statistical curvature 7 is given by

2, _ o l ~ et- _ _ ._ _
^ Xo _ e-It 2t (eat 1)3

In the spirit of the paper, we quote some values of r12 (x0 = 1) in Table 1.
Two asymptotic schemes are inviting: large initial population size (x0 -> oo)
for fixed t and large observation period (t -* oo) for fixed x0. Being a branching
process, a birth process with X0 = x0 may be interpreted as a sum of x0 birth
processes with X0 = 1 and the same A. Therefore the first scheme is still within
the realm of independent identical replications, and may be treated with the
methods of Efron's paper. This was done by Beyer, Keiding and Simonsen (1975)
for this case as well as for the life-testing situation outlined above.
The second scheme, however, is a "real" stochastic process situation, and we
encounter here the trouble that the minimal sufficient statistic is not consistent,

TABLE 1
Statistical curvature for the birth process with xo = 1

it 0 0.1 0.5 1 2 5 00

r22 0 0.009 0.052 0.125 0.319 0.835 1

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1230 BRADLEY EFRON (AND DISCUSSANTS)

St
E(St) /

observedXt /

t // ~~~~~~~~~~~~Bt
E(Bt)

FIG. 1. The canonical sample space of the birth process estimation problem. The
curve is the statistical model corresponding to 0 < i < oo (mean value parametrisa-
tion). The full-drawn line is the set of points for which Bt = 2St where i is the
"true" value, and the broken line is the set where Bt = 2St.

in fact, as t -> oo

e-it(Bt, St) - (1, 2-1)W

almost surely, where the random variable W is gamma distributed with form
parameter x0 and expectation x,. Nevertheless i i a.s., as illustrated in Figure
1. Here i-1 is the slope of the full-drawn line, i-1 is the slope of the broken
line (connecting the observed (Be, St) and the origin.) Normalising with e-it,
the minimal sufficient statistic will converge towards some (1, 2-1)W (shown by
arrows), but the empirical line will always converge towards the correct line.
In the standard situation the asymptotic normality of 0 is based upon the as-
ymptotic normality of the minimal sufficient statistic combined with pure dif-
ferential geometry, as noted by Efron in Section 9. It is therefore no surprise
that asymptotic normality breaks down here. Notice also that , -> 1 (not 0) as
t -* oo. However, for given "nuisance statistic" W, the minimal sufficient sta-
tistic is asymptotically normal with asymptotic variance proportional to W-1,
and hence also i is asymptotically normal. (Marginally, the distribution of
e 2(2 _- ) converges towards a Student distribution with 2xo d.f., which may
be interpreted as the mixture of the normal distributions over the gamma dis-
tributed inverse variances.)
It is thus tempting to investigate the problem obtained by conditioning on
W = w, replacing the "nuisance statistic" W by a nuisance parameter w, see
Keiding (1974). The resulting "conditional" maximum likelihood estimator i*
has the same first-order efficiency properties as 2. A comparison of second-order
efficiencies is not yet completed.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1231

4. A more general aspect of the last example is: can curved exponential
families be "avoided"? In the birth process situation a stopping rule like "sample
until X, = n" will make the minimal sufficient statistic one-dimensional, in fact
equal to S,r = inf {t I Xt= n}. Also it should be mentioned that conditioning
on statistics which are in some sense ancillary (see Barndorff-Nielsen (1973) for
a survey of ancillarity) may completely change the curvature properties of the
problem.

REFERENCES

BARNDORFF-NIELSEN, 0. (1973). Exponential Families and Conditioning. Univ. of Copenhagen.


BARNDORFF-NIELSEN, 0. (1974). Factorization of likelihood functions for exponential families.
J. Roy. Statist. Soc. Ser. B. (Submitted).
BARNDORFF-NIELSEN, 0. and BLAESILD, P. (1975). S-ancillarity in exponential families. To
appear in Sankhyd A 37.
BEYER, J. E., KEIDING, N. and SIMONSEN, W. (1975). The exact behaviour of the maximum
likelihood estimator in the pure birth process and the pure death process. Scand. J.
Statist. 2. To appear.
FISHER, R. A. (1958). Statistical Methods for Research Workers. 13th ed. Oliver & Boyd,
Edinburgh.
KEIDING, N. (1974). Estimation in the birth process. Biometrika 61 71-80 and 647.
RAO, C. R. (1963). Criteria of estimation in large samples. Sankhyd A 25 189-206.
SUNDBERG, R. (1974). Maximum likelihood theory for incomplete data from an exponential
family. Scand. J. Statist. 1 49-58.

A. P. DAWID

University College London

With his introduction of the concept of statistical curvature, Professor Efron


has provided, not merely a valuable theoretical tool, but a new way of looking
at statistical problems which at once unifies what has gone before and opens up
new territory.
The general study of curvature belongs to Differential Geometry, a subject
which has proved an invaluable tool in Physics, both Newtonian and Einsteinian.
It may have much to offer Statistics. A good introduction is Laugwitz (1965)
while Hicks (1965) emphasises a coordinate-free approach more suitable for
Statistics.
In general differentiable spaces, we cannot talk about curvature until we have
chosen, somewhat arbitrarily, a linear connexion: this defines what we mean by
"displacement of a vector parallel to itself along a curve." For example, consider
an observer who lives and measures on a plane inverted in its unit circle. To
him, a circle through the origin looks like a straight line, and he would consider
its tangents as parallel; to us they are not. The need for the parallelism concept

may be seen from Efron's Figure 1: a. is the angle between (i) (0 and (ii)
displaced parallel to itself along - to Y). This depends on our connexion.
Let us try to frame Statistics within Differential Geometry as follows (ignoring
obvious technical difficulties): Let v be the family of all distributions over z

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1232 BRADLEY EFRON (AND DISCUSSANTS)

equivalent to a carrier measure p. A curve W in


9, say v' = {P0} with densities {f f,, having suita
If is the vector space of signed measures m on , with m < p and
m(t) = 0, we may define the tangent to v' at P = P,9 as m6' e X with m.' =
"lima-'0" (P8+j - P,)/1. (Equivalently, dm0l7du = fo). Conversely m e- i
tangent to some curve at P.

Let 7/, be the vector space of random variables T(x) having Ep[T(X)]
For given P, there is a natural isomorphism between ,//and %'p,: dm = T(x)
Then m0'" maps into i,4(x), which may again be identified with the tangent
at P05.

Now let Pj900 Po, erX, with tangent spaces 5", >1X and let To e v", T1 e 5K
To be able to talk about the angle between To and T1 we must put them into
the same space. We may do this by a parallel displacement of To along v to 0,
where it becomes To' e 5K.
The parallel displacement used implicitly by Efron-what I propose to call
the "Efron connexion"-has

(1) Tot = - E0 (TO)

This happens to be independent of the curve @, which is not always so. Noting
(d/dO)E[T] = E9[Ti4] for fixed T, we can generate (1) by the infinitesimal dis-
placement rule (having 01 = 00 + dO):

(2) To = To- Eo(TO?16) * dOa

For curvature, we look at the angle between l' =


We may measure this by any convenient inner product, but in our statistical
set-up there appears to be only one natural inner product in 5K., namely
<T, U> = Ep(TU). (For any parametric family {P,}, this yields the information
inner product, with matrix (EO[(al/a11)(dl/aOj)].) Hence we may call this the
information metric). This leads to Efron's measurement of angle and of curvature.
The "straight lines" have a characterisation independent of the metric: io
must displace to become a scalar multiple of 41. By reparametrisation, the
multiple may be taken as unity. This leads to the differential equation

(3) io+i=

characterising exponential families.


The Efron connexion is not, however, the only available one (although it
probably is the only one that fits in neatly with repeated sampling, as in Efron's
Section 6). An alternative obvious definition of parallel displacement considers
/ as the tangent space and uses the identity transformation (again, independent
of path). This is equivalent to transforming 5K into 5K with

(4) To To T0'-
dPo)T. ~dP01J

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1233

yielding the infinitesimal displacement

(5) To= To-Tolo * dOa


To measure curvature with this connexion, using the information metric, M0
in Efron's (2.3) must be replaced by the covariance matrix of i0 and (l8 + l2).
The "straight lines" now have 1a + j,,2 = 0, which yields mixture families: P.
(1 - O)PO + OP1. Thus the above connexion may be termed the "mixture
connexion".
Now the information metric makes 97 into a Riemannian space, and from
point of view there is a serious deficiency in both connexions above: they are
not compatible with the metric. That is, the length of To at P0O(viz [E00(T02)]1)
not the same as that of its parallel translate To' at P,1. It may be checked that
the infinitesimal displacement

(6) To To To lo + E00(TTol0)] * dO
yields a connexion-the "information connexion"-that is compatible with the
information metric. Curvature for this connexion (which is the geodesic curvature
associated with the information metric) uses the covariance matrix of i, and
40 + 12e2
We can calculate the torsion and curvature tensors (Hicks, page 59) for the
above connexions. We find that all have zero torsion (equivalently: are sym-
metric, or affine). There is a unique affine connexion compatible with a given
metric, hence (6) supplies it for the information metric.
We find zero curvature for the Efron and mixture connexions, while the cur-
vature tensor R associated with the information connexion has

(7) R(T U)V-{[T E(UV)-U* E(TV)] .


The Riemann-Christoffel curvature tensor K of type 0, 4 (Hicks, page 72) is
then given by:

(8) K(T, U, V W) = 1[E(TV)E(UW) - E(TW)E(UV)].


From this we find that the space 9? with the information metric, has cons
positive, Riemannian curvature 4
The geodesics (shortest paths) for the information metric are the "straight
lines" of the information connexion, satisfying

(9) lo~~~~~j
(9)2 2 + 1 i62 + 2I j = O.
Solutions of (9) are closed curves, parametrized by an angle 0, having an angle-
valued sufficient statistic t, with density of the form

(10) f(tI0)=l+cos(t-O)
with respect to a probability measure P over the unit circle for which
s2;r eit d>(t) 0 O. Such curves have i= 1, and total length 27. Thus 9
rather like the surface of a sphere of radius 2, opposite points being identified.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1234 BRADLEY EFRON (AND DISCUSSANTS)

The nonvanishing of (7) means that the information parallel displacement


depends on path, which makes it less immediately intelligible than the Efron
and mixture displacements. Can we give any interesting statistical interpretation
to the information connexion, and its associated families (10)?

REFERENCES

[1] HICKS, N. J. (1965). Notes on Differential Geometry. Van Nostrand, Princeton.


[21 LAUGWITZ, D. (1965). Differential and Riemannian Geometry. Academic Press, New Y

JIM REEDS

Harvard University

1. Ideas of geometrical curvature are not completely new to statistics. Efron's


paper is the logical successor to papers applying the differential geometric point
of view to statistical estimation. Rao (1945) and Bhattacharyya (1943) viewed
the multiparameter Fisher information as defining a local (Riemannian) metric
(Eisenhart (1926 and 1960), Spivak (1970)) on the parameter space; the inte-
grated arc length of a geodesic connecting two parameter values then defines a
global metric or distance function on parameter space. Holland (1973), Huzur-
bazar (1950 and 1956) and Mitchell (1962) exploited transformation properties
of the Fisher information viewed as a Riemannian metric. Holland, for instance,
sought covariance stabilizing transformations (like the square root transforma-
tion of univariate Poissons). Such a transformation makes the Fisher information
matrix, expressed in transformed coordinates, a constant matrix. "When can
it be found?" is the question "When is a given Riemannian manifold locally
isometric to a Euclidean space?" Riemann gave the answer: "When the Rieman-
nian curvature (or, in two dimensions, the Gaussian curvature) vanishes iden-
tically." This always happens only in dimension one. In all higher dimensions
non-Euclidean manifolds-and noncovariance stabilizable parameter spaces-
occur.
Recent unpublished work of Tadashi Yoshizawa (1971) makes explicit use of
the inherent Riemannian structure in parameter estimation problems. He shows
how one can isometrically embed the parameter space into a Euclidean space
of sufficiently high dimension, and then read off the (first order) asymptotic
properties of the estimation problem by inspecting the parameter space as a
curved submanifold of a Euclidean space.
Thus curvature of one sort is not new to statistics. But Efron's curvature is
of a different sort-not the Riemannian or "intrinsic" curvature but instead the
curvature of embedding, associated with the particular way a parameter space
is placed inside a higher-dimensional "natural parameter" space. Riemann cur-
vature-measured by the curvature tensor-is determined solely by the "first
fundamental form" or metric tensor, the physicists' metric ground form, the
statisticians' Fisher information matrix. Efron's curvature, curvature of embed-
ding, is measured by the "second fundamental form" and depends on more than

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1235

Fisher information. The distinction is illustrated


Euclidean 3-space. This surface has curvature of embedding but no Riemann
curvature, for any piece of it can be unrolled without distorting lengths. A
sphere in 3-space has both sorts of curvature; a parabola in the plane has only
curvature of embedding. No submanifolds of Euclidean space have Riemann
curvature without curvature of embedding.
Efron takes the natural parameter space as Euclidean, with constant metric
given by the Fisher information evaluated at the true value of the parameter,
80. The actual parameter space is a submanifold of natural parameter space; its
curvature of embedding is calculated with respect to this constant Euclidean
structure on the natural parameter space. Efron's discussion in the second para-
graph of Section 2 is unclear; one might falsely assume that the natural parameter
space was endowed with the (nonconstant) metric provided by the Fisher in-
formation as a function of 0.
The point of Efron's paper is that the curvature of embedding, calculated in
this way, has an effect on statistical procedures, an effect amenable to quanti-
tative study.

2. The main result of Section 10 may be generalized to a multivariate curved


exponential family. Both this result and Efron's suffer from a defect which
might be overcome in future work. The defect is that both make statements
about the coefficients of the asymptotic expansions of the variance, not about
the variance itself. Thus, the conclusions are of the form

"Var (Tn) _ a + b + o(n-2) (or 0(n-3)),

and a>a, andif a=a, b?>1,"

where a and j3 are certain theoretical lower bo


with a stronger type of conclusion:

"Var(T.) > a +
n n2

where a and ,B have the same meaning as above. (


has an asymptotic expansion at all, the second conclusion implies the first.) Both
the Cramer-Rao and the Bhattacharyya inequalities provide conclusions of the
second type. In a sense, we can trace the difference to the different methods
used to prove the various inequalities. The classical proof of the Cramer-Rao
bound proceeds by constructing a certain variance-covariance matrix, and using
its positive semidefiniteness to get the desired results. This is to be contrasted
with the method used in the present theorems: Taylor expansions of the func-
tional form of the estimate, coupled with systematic discarding of negligible
terms.

It is conceivable that a proof of the theorem of Section 10 could be constructed


by the classical method, by considering the joint covariance of the estimate, the

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1236 BRADLEY EFRON (AND DISCUSSANTS)

first derivative of the log likelihood, the square of the first de


likelihood, and the product of the first and second derivatives
hood. This is conjectured on the grounds of the simple form the covariance
matrix takes, when only terms of order up through lln2 are considered.
We may define a curved q-parameter exponential family by means of a smooth
map r: 9 -> H, where 8 is some open subset of Rq, and H is the natural pa-
rameter space of a k-variate exponential family. To simplify the discussion that
follows, we will assume that v is an embedding in the sense of differential ge-
ometry: i is a C- injection, with differential of full rank at each point, and that
"4smooth"-whenever it appears in this discussion-means C-. Note that accord-
ing to this set-up, 9 is not a submanifold of H; but j(O) is. An estimate is a
function T: 2>-? 9, mapping the space of the sufficient statistic to the parameter
space.
If we restrict ourselves to estimates T that depend only on the sufficient sta-
tistic ,n = n-1(xl + . + x") (and not on n), and which satisfy certain regularity
conditions, we may prove:

THEOREM. Let T depend only on xi, the sufficient statistic for a curved q-param-
eter exponential family. Suppose T is smooth in some neighborhood of E(xn), and
suppose T grows (as a function of X5) no faster than exponentially.
If T is a consistent andfirst order efficient estimate of 0, the variance of Tpossesses
an asymptotic expansion

-n)2CRLB
Var (T(5c)) _ RL ++A
AB+ 2C + 2 + O(n-3) .

(Here CRLB denotes the Cramer-Rao Lower Bound,

A denotes the "naming" or "Bhattacharyya" curvature, which can be made


zero by an appropriate reparameterization of parameter space. It is in-
dependent of T.
B is the "Efron excess", or statistical curvature term and is independent of
T.

C depends only on the function T, and vanishes for the particular choice
T = the maximum likelihood estimate.

All these quantities are q by q positive semidefinite symmetric matrices.)


The proof of this multivariate theorem parallels Efron's univariate arguments
It shares the use of affine transformations to bring the problem into "standard
form,"' calculations with Taylor expansions to exhibit the consequences of con
sistency and first order efficiency, and finally, replacement of T by a Taylor
approximation, and the calculation of expectations and variances of the Taylor
approximant.
The key quantity of interest in the conclusion of this theorem is the term B,
the "Efron" or "statistical curvature" excess. It is the multivariate generaliza-
tion of p2li, and (like r2li) may be defined in several ways.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1237

(1) Let 7((), in the vicinity of 80, have an expans

i(8) = a + -4 i bji(di - 8oi) + 2 jk CJk(Oj


where 8 has coordinates (81, 82, .., 8 q). Let (giJ) denote the inverse of the
Fisher information matrix for 8, and let (G78) denote the Fisher information
matrix for the natural parameter v. Let
D. = 8 7Sb,Gsb8 h

Dih bi'rG, bh;


Ei,mn = 78r b%tG7 Csf
and
Fjk,mn =ErsCSkGrscmn .

Let the inverse of D = (Dih) be D-' - (Dij). Let

Fjk,mn = Fjk,mn zjh Ei,jkEh,mnD


Then

Bi = Zklm gingingkgFn1Pnk.
If, at 8O, the Fisher matrices of both 8 and r are equal to identity matrices, this
simplifies to
Bij CZ',k,,
kC?1C~k
r :kj

where the summation extends over r ? q + 1.


(2) Let I be the log likelihood function. If

4 1

and

8 a3 a 8o
we may form the linear regression of i on / as follows:

lijk = i P jk'i
and we may calculate the regression-residual variance:

Cov (iti - ijl imn - imn) = aij,mn


Then
ij - E. El gimgnak1, gkl
(3) Let Q,1ii be the components of the second fundamental form of the im-
bedding v: -* H (see Eisenhart (1926 and 1960)) where H has the Euclidean
structure induced by the Fisher information evaluated at 77(8,). Then

Bij = Zmn Z kl E r giQrlm1k g rlln g


Similar formulas hold for the naming curvature term A. In the special case
where both the Fisher information matrices are equal to identity matrices (at 08)
and where

j a j' jj

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1238 BRADLEY EFRON (AND DISCUSSANTS)

the ijth term of the naming curvature is

Aii = )a,b (Cab + aGa, )(


where the summation extends over 1 < a b < q.
Notice that in the univariate case the naming curvature term A can always
be made to vanish identically by a suitable reparameterization, but in the mul-
tivariate case this cannot in general be done. It can always be made to vanish
at isolated points, but there need not in general exist reparameterizations which
make the naming curvature vanish globally. This is related to the general
nonexistence of multivariate covariance stabilizing transformations. In the
univariate case, the naming curvature vanishes identically exactly when we
parameterize the curve by arc length: that is, it vanishes when the variance is
stabilized. In the multivariate setting, however, we cannot in general covariance
stabilize, and we cannot in general make the naming curvature identically zero.
Perhaps the easiest example is provided by the trivariate normal distribution,
with unit covariance matrix, with the mean vector constrained to have unit
length (and, to avoid global topological problems, with first coordinate positive).
Thus, in the multivariate case the naming curvature term takes on added sig-
nificance, and must be viewed as serious an object of study as the statistical
curvature term itself.

REFERENCES

[1] BHATTACHARYYA, A. (1943). On a measure of divergence between two stat


tions. Bull. Calcutta Math. Soc. 35 99-109.
[2] EISENHART, L. (1926 and 1960). Riemannian Geometry. Princeton Univ. Press.
[3] HOLLAND, P. (1973). Covariance stabilizing transformations. Ann. Statist. 1 84-92.
[4] HURZURBAZAR, V. (1950). Probability distributions and orthogonal parameters. Proc.
Cambridge Philos. Soc. 46 281.
[5] HURZURBAZAR, V. (1956). Sufficient statistics and orthogonal parameters. Sankhyd 17217-
220.
[6] MITCHELL, A. (1962). Sufficient statistics and orthogonal parameters. Proc. Cambridge
Philos. Soc. 58 326-337.
[7] RAO, C. R. (1945). Information and accuracy attainable in the estimation of statistical pa-
rameters. Bull. Calcutta Math. Soc. 37 81-91.
[8] SPIVAK, M. (1970). Differential Geometry. Publish or Perish, Boston.
[91 YOSHIZAWA, T. (1971). Memorandum TYH-2, A Geometrical Interpretation of Location and
Scale Parameters. Statist. Dept., Harvard Univ. Cambridge.

REPLY TO DISCUSSION

The discussants are (almost) uniformly constructive and informative in their


comments. They point out many important facts, and even whole areas, that
the paper misses. Only two of them consider me basically deranged in my
thought processes. In what follows I have tried to answer a few specific points,
without exploring much further the bigger questions raised.
Professors Cox and Pierce suggest that the distance from (xl, x2), to 2- is a

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1239

useful approximate ancillary statistic. (See Figure 4. It is simplest to assume


that the family S is in standard form at 6 = 0, and that we are considering 6
values near zero.) I particularly like Pierce's suggestion that the ancillary in-
formation has to do with the precision of 0 and not its location. To make things
really easy, consider repeated sampling in Example 1, and suppose that we happen
to get a = 0, that is xl = 0. (See Figure 2.) The likelihood function for 6 is
proportional to exp{-(n/2)[1 - r c - 7o202/4]02} which for 6 in the interval
6 + c/ni behaves like exp{-(n/2)[l -T rx2]02}. That is, the likelihood function
for 6 is approximately 4(, [17 -To2]/(ni )). The distance from (, x2) to 0, x2
in this case, modifies the unconditional variance (ni-)-I by the factor [1- 5c
It is probably possible to extend this likelihood analysis to a genuine conditional
variance statement, as Pierce suggests.
Bayesians and other nonfrequentist statisticians do not like averages taken
over the sample space 2with 6 fixed. Professor Lindley raises this objection
to the curvature rTo2, as it has been raised to the Fisher information i. itself.
Those who believe in direct interpretation of likelihood functions prefer -l1(x),
the actual curvature of the log likelihood function at its maximum, to the averag
value i0. (Incidentally, I use 6 as a subscript rather than an argument to save
writing parentheses!) I find some force in these kinds of considerations but,
perhaps because of my training, can never be convinced without the support
of some relevant averaging property, be it frequentist, conditional frequentist,
Bayesian, or otherwise. (See my discussion following Blyth (1970).)
If a Cauchy translation sample of size 10 yields a very normal looking likeli-
hood function, say J'1(0, .3), should we behave as if the MLE has variance .3?
Professor Lindley answers "yes" on Bayesian grounds, in the absence of prior
information. Professor Pierce's remarks indicate that the curvature may have
something helpful to say to frequentists about such problems.
Returning to less slippery ground, here is a calculation of asymptotic Bayes
risk that makes use of the curvature. In a curved exponential with an i.i.d.
sample of size n, let 6 have prior distribution 4(60, cu/n), where cq. is going
sufficiently slowly to infinity. Then the Bayes risk is asymptotically

n -+
+ 2~~2 0
r@oo+ , } n2)
which equals to order 1/n2 the squared error risk of the biased correcte
at 6 = 06. (This result follows, with some effort, from (10.19).)
Professor Le Cam's warning about over-reliance on local methods is well
taken. As a matter of fact, my paper is most concerned with curvature as a
check on the appropriateness of first order local properties such as Fisher's in-
formation and the Cramer-Rao lower bound. In the situation of Figure 6, cur-
vature can be used quantitatively to improve the first order approximation. I
hope, but of course am not certain, that other situations will be similarly obliging.
Le Cam's criticism of the MLE as a point estimator should not be confused

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1240 BRADLEY EFRON (AND DISCUSSANTS)

with Fisher's preference for it is an information g


MLE may be better than the MLE itself for any sp
This is the case in the Berkson example quoted. Berks
than the MLE, which eventually is improved by Rao-Blackwellizing it on the
sufficient statistics. This gives a function of the MLE! (It has to because the
situation involves a genuine uncurved exponential family.) Figure 4 becomes
more convincing the more you study it. Locally the straight level line L, seems
intuitively preferable to any curved competitor Me. (See Dr. Keiding's remarks
and my reply.)
Quadratic approximations to the log likelihood function have been used suc-
cessfully by many authors, notably Professor Le Cam himself. They are the
basis of Rao's work in second order efficiency. They can be used to produce
estimators other than the MLE which are second order efficient. Whether there
is a corresponding theory of third order efficiency, and whether the MLE is still
the champion, is an interesting open question.
After a long fallow period there seems to be a revival of interest in second
order efficiency and related topics. I am eager to see Professor Ghosh's work
with Subrahmaniam and Srinivasan. (Also, I must apologize for not having been
aware of Pfanzagl and Chibisov's papers, which demonstrate rigorously the rele-
vance of what I have called curvature to hypothesis testing problems, even out-
side an exponential fatnily framework.) As Ghosh suggests and as I mentioned
in discussing Pierce's comments, there is some connection between T.2 and the
geometrical curvature of the likelihood function, but not one I understand clearly
yet. Professor Ghosh's last question can be partially answered in the affirmative:
in the counter-example of Figure 5, change c to (-2i, 3). Then the MLE of
any x vector with x - 3 is zero, but if xl 3each x corresponds to a unique
0. For n any multiple of 3, 0 will lose information because of the grouping of
those x vectors with x =. It is easy to curve the level lines of another con-
sistent efficient estimator ', a la Figure 4, so that the vectors with x, = are
separated, and #(x) is different for all different x vectors, so no information is
lost. This works for any fixed n divisible by 3, but I am less certain about find-
ing a a that works for all values of n.
There is less difference between Professor Pfanzagl and me than the tone of
his comments indicates. His results (1) and (2) follow from (8.4). I should have
said earlier that a rescaled version of this equation holds as an approximation
when testing 0 = 0 versus 0 > 0 under i.i.d. sampling in any curved exponential
family,

1 - j9j,(O) $)(D(I + ? ji2 /4)i cos (A, - A) -z)


where a (nio)i8, , _ ni, and Aa tan-' (j0#/2). In order for this approxi-
mation to be sufficiently accurate to yield Pfanzagl's asymptotic results, the
family must be nonatomic. However, the type of power comparisons presented
in Table 3 are less sensitive as well as more familiar. For a = .01, power = .99,

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
CURVATURE OF A STATISTICAL PROBLEM 1241

ro2/n = the case Pfanzagl discusses, the local


proximate power .94 compared with the envelope value .99. I consider this
borderline acceptable, and will stick to my suggestion of r021n > - as a rough
indicator of nonnegligible curvature effects.
Fisher defined To2 as the loss of information in using a instead of the whole
sample. Rao's results on estimation with squared error loss partially vindicate
this definition. Pfanzagl's own work shows that To2 plays a key role in the loss
of effective sample size in hypothesis testing problems. Then why does he seem
to say that To2 has no statistical significance? The fact that the level a test based
on the MLE is asymptotically equivalent to the A optimal test with power 1 - a
has nothing to do with the existence of curvature effects. There still is no uni-
formly most powerful test. The global deviations of any attainable power curve
from the power envelope are still ruled by the magnitude of To2.
I was happy to see that Dr. Keiding had found a definite use for cglrved ex-
ponential families in his work on birth processes. Time series problems offer
many other examples, of which my Example 3 is close to the simplest. (With
Dr. Reeds' multiparameter theory available we are now in a position to analyze
the second order asymptotics of higher autoregressive schemes.) The geometric
interpretation of the penalty A@ for not using the MLE is simple in the case
r = 2. Comparing (10.24) with (10.5) shows that it equals one-half of the squared
curvature of the level curve Mo = {x: #(xc) = 80}. See Figure 4.
Dr. Dawid raises a deep question: why have I chosen to represent families of
probability distributions by their log densities rather than, say, the density func-
tions themselves? This latter representation would make mixture families rather
than exponential families straight lines, as he points out. What I have called
the matrix Me then has elements Phi as at (1.2) rather than P.j as at (3.21).
Dawid makes the interesting observation that still another definition is needed
to make straight lines into geodesics in the information metric. (Rao 1945 a and
1945 b, has proposed using this type of geodesic distance to measure the separa-
tion of probability distributions. Atkinson and Mitchell have calculated Rao
distances for many familiar distribution families.) I can't answer Dr. Dawid's
deep question except to say that my definition was motivated by what seemed
to be the most pressing statistical considerations. He makes a good case for
other definitions also yielding useful results for the statistician.
My paper considers only one parameter families. Dr. Reeds gives a convincing
extension to the multiparameter case. Having been frustrated myself by the
intricacies of the higher order differential geometry, I am impressed! Hopefully,
his "B", the analogue of To2, will also play the correct corresponding role vis-
a-vis Fisher information and hypothesis testing.
Two technical comments: (i) a version of the usual super-efficiency examples
prevents Reeds' formula (2) from holding generally. In my Example 1, Figure 2,
let @(x) = xl except in a band of width + x 2 on either side of 2 Within this
band modify a so that it is consistent and first order efficient. Then (10.19) can

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms
1242 BRADLEY EFRON (AND DISCUSSANTS)

be used to show that 0 satisfies (10.1) with the


at 00 = 0. (ii) It is not true in general, even i
the "arc-length parameter" has naming curvature equal to zero. Let v(8) be
this parameter measured from 0 = 0 = 0, where we assume for convenience
that io = 1. By definition a(0) = 5 iod,' so that da(0)/dO = ioi, d2o(0)/d02
(di0/d0)/2(i8)i. It is easy to show by an expansion similar to (10.10) that in term
of the quantities Phi defined at (1.2),

di0/d0 = 2 -l -P30

This gives the Taylor expansion about zero

a(0) =0+ - + 0(02),

Pii and poo being evaluated at 0 = 0.


The parameter q5(0) which figures in the defin
expansion

O0M) = 0 + /11 2 + 0(02)


2

as given in (10.1 1). Therefore the naming curvature r2 will not be zero for the
arc-length parameter unless e30 = 0. (That is, Fisher's score function has third
moment zero.)
It is not clear to me whether or not one can always choose a reparameteriza-
tion for JS which has naming curvature identically zero, even in the one-
parameter case. We probably wouldn't want to estimate such a parameter any-
way unless it had something more to recommend it than F02 = 0. I didn't mea
to imply that naming curvature is less important than statistical curvature, onyl
that it depends on the name.
Finally, I would like to thank the Editor for arranging this discussion which
involved a large amount of extra work on his part. I hope the Annals of Sta-
tistics will continue the entertaining and enlightening policy of providing occa-
sional discussion papers.

REFERENCES

[1] ATKINSON, C., and MITCHELL, A. F. Rao's Distance Measure. Unpublish


[2] BLYTH, C. R. (1970). On the inference and decision models of statistics. A
3 1034-1058.

[31 RAO, C. R. (1945 a). Information and the accuracy attainable in the estimation of statist
parameters. Bull. Calcutta Math. Soc. 37 81-91.
[4] RAO, C. R. (1945 b). On the distance between two populations. Sankhyd 9 246-248.

This content downloaded from 141.211.4.224 on Thu, 09 May 2019 19:05:26 UTC
All use subject to https://about.jstor.org/terms

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy