2017 Book MathematicalStatistics
2017 Book MathematicalStatistics
2017 Book MathematicalStatistics
Perspectives in Statistics
Johann Pfanzagl
Mathematical
Statistics
Essays on History and Methodology
Springer Series in Statistics
Perspectives in Statistics
More information about this series at http://www.springer.com/series/1383
Johann Pfanzagl
Mathematical Statistics
Essays on History and Methodology
123
Johann Pfanzagl
Mathematical Institute
University of Cologne
Cologne
Germany
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 The Intuitive Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Exhaustive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Sufficient Statistics—Sufficient r-Fields . . . . . . . . . . . . . . . . . . . . 17
2.4 The Factorization Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Minimal Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Trivially Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Sufficiency and Exponentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 Characterizations of Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Parameters and Functionals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Estimands and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Stochastic Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Unimodality; Logconcave Distributions . . . . . . . . . . . . . . . . . . . . 55
3.7 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8 Anderson’s Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.9 The Spread of Convolution Products . . . . . . . . . . . . . . . . . . . . . . 62
3.10 Interpretation of Convolution Products . . . . . . . . . . . . . . . . . . . . . 64
3.11 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.12 Pitman Closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
ix
x Contents
In the years between 1940 and 1970, say, mathematical statistics has undergone a
dramatic change. Not only has the amount of publications increased by a factor of
10 (starting from about 1.500 pages a year in 1940); it is even more the level of
mathematical sophistication which has changed. Comparing a volume of the Annals
of Statistics of 1970 with a volume of 1940 (then called Annals of Mathematical
Statistics) one would not believe it to be the same journal.
The following essays try to portray the historical developments during this period
for selected areas of mathematical statistics. Our emphasis is on conceptualal issues;
this justifies the restriction to basic models mostly based on independent and iden-
tically distributed observations. To describe what has happened fifty years ago is,
however, not the main purpose. What is still of interest today is to see how the emer-
gence of refined mathematical techniques influenced the subject of mathematical
statistics. Whether these refined techniques were fully understood and their power
fully exploited by all contemporary statisticians is an interesting second aspect.
Around 1940, mathematical statistics was mainly restricted to parametric fami-
lies, and the highlights were theorems about unbiased estimators and families with
monotone likelihood ratios. As a new generation of statisticians, familiar with the
techniques of measure theory, entered the stage, it soon became clear that compelling
intuitive ideas like “sufficiency” or “asymptotic optimality” were not so easy to trans-
late into constructs which are accessible to a mathematical treatment. It turned out
that the solution of seemingly meaningful problems can be hampered by difficulties
of purely mathematical nature (such as the existence of regular conditional probabil-
ities), difficulties which are in no inherent relationship with the problem itself, and
which became —even so—the favourite subjects of certain statisticians.
The following essays are restricted to subjects which show some inherent rela-
tionship: Descriptive statistics, sufficiency, estimation, asymptotics. The omission of
important topics like “Bayesian theory”, “decision theory”, “robustness” takes the
restricted competence of the author into account. Before going into the details, we
try to set forth the basic ideas of our approach.
© Springer-Verlag Berlin Heidelberg 2017 1
J. Pfanzagl, Mathematical Statistics, Springer Series in Statistics,
DOI 10.1007/978-3-642-31084-3_1
2 1 Introduction
A Methodological Manifesto
Assertions and assumptions on statistical procedures should be of operational sig-
nificance, in other words: Their relations to reality should be expressible in terms
of probabilities and, perhaps, costs. This requires converting inexact, intuitive con-
cepts (like independence, concentration, information) into concepts which are oper-
ationally meaningful. The invention of such constructs is limited by what is feasible
from the mathematical point of view. Insisting on operational significance excludes
the sole reliance on “principles” (like “maximum likelihood”) which rest upon au-
thority or metaphysics.
Assertions based on a particular loss function are operationally significant if the
loss function represents reality. Results based on the quadratic loss function will,
therefore, be operationally significant only under special circumstances. That they
are easy to prove, or nice looking, does not give them any meaning.
Assertions on posterior distributions are operationally significant if the prior dis-
tribution admits an interpretation in terms of reality, be it as a “physical probability”
based on past experience (as, for instance, in acceptance sampling) or a description of
the state of mind (hopefully based on some empirical evidence). Prior distributions
justified by formal arguments might result in nice posterior distributions. This by
itself does not give them any meaning, though.
Converting an intuitive concept into a meaningful mathematical construct may
turn out to be difficult. An example of a successful conversion is the concept of a
“sufficient statistic, containing all information in the sample”. It is made precise by
the idea that for any function which could be computed from the sample there exists
a (randomized) function of the sufficient statistic which has the same distribution, for
every probability measure in the underlying family. Like it or not: For mathematical
reasons, this interpretation is confined to functions taking their values in a complete
separable metric space. (See Sect. 2.2 for details.)
An example of a condition without operational significance is the requirement that
the estimator for the sample size n should be an element of a “convergent” sequence.
“Robustness” is an example of a convincing idea which is difficult to cast in mathe-
matical terms: Since models are never exact, the performance of statistical procedures
should be insensitive to small departures from the model. Yet, what are small de-
partures from the model, and what are small changes in the performance? Consider
e.g. a real-valued functional κ of a distribution P in some family P and a sequence
of estimators κ (n) , n ∈ N, based on n i.i.d. observations, with joint distribution P n .
The suggestion of Hampel (1971, p. 1890) to define “qualitative robustness” of κ (n) ,
n ∈ N, by equicontinuity (with respect to n) of the map P → P n ◦ κ (n) proved in-
adequate for grasping the phenomenon of “robustness” in its complexity. Hampel’s
concept of qualitative robustness was supplemented by more detailed concepts of
departures from the model (like ε-contamination), additional measures of robustness
(like breakdown point and gross error sensitivity), and, finally, principles for finding
a balance between robustness and efficiency. Although this network of concepts and
principles did not result in a coherent theory, it attracted the attention of compe-
1 Introduction 3
tent scholars. For informations about the present state of art see Rieder (1994) and
Jurecková and Sen (1996).
Here is an example of a totally misleading intuitive idea. Assume that (x1 , . . . , xn )
is a realization from some normal distribution N (μ, 1)n with mean 0 and variance 1.
If the sample mean x n equals 10, it seems much more likely that μ is in the interval
(8, 12) than in the interval (157, 161), say. Evident as this appears, the mathematician
will discover soon the impossibility of expressing this in a mathematically correct
way without introducing a prior probability of μ. Starting in R.A. Fisher (1930)
took numerous occasions to present his idea of a “fiducial probability”. That means:
Given a sample from Pϑn , ϑ ∈ Θ ⊂ R, it is possible to compute a distribution of the
unknown parameter ϑ. If Fisher had tried to describe such a distribution in math-
ematical terms, he would have run into insuperable difficulties. Let Φ denote the
standard normal distribution function. For samples (x1 , . . . , xn ) from N (μ, 1)n with
μ ∈ R unknown, the probability of μ < x n + n −1/2 Φ −1 (α) is equal to α. Apply-
ing this relation for α = Φ(n 1/2 (t − x n )) one arrives at the conclusion that μ < t
holds with probability Φ(n 1/2 (t − x n )), or that t → Φ(n 1/2 (t − x n )) is the (fiducial)
distribution function of μ. This sounds plausible. If one tries to write this down in
formulas (in particular: replacing the word “probability” by N (μ, 1)n ), the absurdity
of this argument becomes patent. The fallacy in Fisher’s reasoning was elaborated
by Neyman1 (1941, Sects. 4 and 5); see also Lehmann (1995). Neither Neyman nor
any other critic was able to convince Fisher of his mistake. Even in Fisher (1959),
when mathematical rigor was standard already, Fisher says (p. 56):
The treatment in this book ... does ... rely on a property inherent in the semantics of the word
“probability”.
1 Curiously,the very first sentence in Neyman’s paper is: “The theory of confidence intervals was
started by the author in 1930.” However, forerunners are Laplace (1812) and Poisson (1837). If
there had been any doubts about the interpretation of “covering probability”, they were settled by
Wilson (1927).
4 1 Introduction
Fisher’s abuse of language was carried on by some of his adherents, so for instance
by C.R. Rao, who gives the following definition (1962, p. 77) for i.i.d. observations
(x1 , . . . , xn ) with density p(·, ϑ):
A statistic is said to be efficient if its asymptotic correlation with the derivative of log
any statistic may be measured by , where is the
likelihood is unity. The efficiency of 2
One cannot but sympathize with Lindley when he says, in the discussion of C.R.
Rao’s paper (p. 68):
Professor Rao follows in the footsteps of Fisher in basing his thesis on intuitive considerations
of estimation that people like myself who lack such penetrating intuition, cannot aspire to.
This and other attempts at defining first and second order efficiency will be discussed
in Sects. 5.9 and 5.18.
Refinements of the mathematical techniques did not only expose gaps in the pre-
ceding literature. They also turned up technical assumptions the relevance of which
is hard to understand from the intuitive point of view, for instance the conditions for
the existence of a regular conditional probability.
For scholars with a firm background in mathematics it was natural to consider
statistical theory as a part of mathematics. From this point of view it is legitimate to
generalize theorems with a clear statistical interpretation to a more abstract frame-
work. This is not without danger, though. Isolating the abstract core of an argument
might kill the original idea, thus causing a handicap for generalizations in other direc-
tions. Assertions valid in the abstract framework might be without an interpretation
in terms of the original problem and, therefore, without operational significance.
Realizing the shape which Hájek’s Convolution Theorem took in the publications of
Le Cam (see e.g. 1979), the statistician cannot avoid thinking of Lucretius:
When a thing changes its nature, at that moment comes the death of what it was before.
The following example from test theory illustrates the usefulness of refined mathe-
matical techniques, and their limitations.
Example. Given a family P of probability measures which is interpreted as a “hy-
pothesis”, and a probability measure P0 outside P, the “alternative”, the question
arises whether a most powerful testexists, i.e., whether there is a critical function ϕ0
which maximizes
the expectation ϕd P0 in the class Φα of all critical functions ϕ
fulfilling ϕd P ≤ α for P ∈ P. The mathematical problem is to show that for any
sequence of critical functions ϕn : X → [0, 1], n ∈ N, there exists a subsequence
N0 and a critical function ϕ0 : X → [0, 1] such that ( ϕn d P)n∈N0 → ϕ0 d P for
P ∈ P and P = P0 . This so-called “weak compactness theorem” occurs already
in Banach (1932). It was proved independently by Lehmann (1959, p. 354, Theo-
rem 3) under the assumption that the underlying σ -field A is countably generated
and P|A is dominated. The same result is obtained by Nölle and Plachky (1967, p.
182, Satz) for arbitrary
A (by considering the sub-σ -field generated by a sequence
ϕn , n ∈ N, with ϕn d P0 , n ∈ N, approaching the supremum), and by Landers and
Rogge (1972, p. 339, Theorem) for arbitrary A and without domination of P, using
that Φα is convex, and that, therefore, the P0 -weak closure of Φα coincides with the
P0 -strong closure. Even mathematically advanced textbooks like Schmetterer (1974,
p. 14, Theorem XI), Witting (1985, p. 207, Korollar 2.15) or Lehmann (1986, p. 576,
Theorem 3) withhold this result from the reader, so that their existence theorems for
most powerful (or most stringent) tests are confined to dominated hypotheses.
Whether the supremum is attained or not is irrelevant from the practical point
of view; in any case, it can be approximated as closely as one likes. Moreover,
the results mentioned above assert the existence of a critical function for which
the supremum is attained, without giving any advice how such an optimal critical
function can be obtained. After all, the real problems lie somewhere else: In a real
testing problem, there is not one, but a whole class of alternatives and the question
is: For
which kind of models does there exist a critical function ϕ0 in Φα such
that ϕ0 d P0 = sup{ ϕd P0 : ϕ ∈ Φα }, simultaneously for every P0 in the class of
6 1 Introduction
2 Having been undecided between probability and statistics, Halmos made up his mind as early as
1937: “I’ll take probability, and to hell with Fisher” (see Halmos 1985, p. 65). See, however, his
fundamental contribution to the concept of “sufficiency” in Halmos and Savage (1949).
1 Introduction 7
nothing to statistics which is unique in the sense that “nobody else could have done
it”.
Universal Theories
Statisticians with a strong preference for abstract results might be disappointed by
the fact that attempts at developing a universal theory which “solves” all statistical
problems—like Wald’s “Decision Theory”, or Le Cam’s “Theory of Experiments”—
are not accepted by statisticians interested in problems of practical relevance.
Since the problems dealt with in statistical theory are so diverse, no experienced
statistician would seriously consider the possibility of placing them all in the Pro-
crustean bed of a coherent theory. It required the courage of a mathematician with
limited experience in statistics to build a theory on three principles, each of which is
unreasonable by itself:
(i) Given a decision space (D, D), the consequence drawn on the basis of a sample
x from a distribution P can be evaluated by a loss function a → (a, P).
(ii) The performance of a randomized decision function D : X × D → [0, 1] at
P is evaluated by the expected loss, or risk, R(D, P) = (a, P)D(x, da)P(d x).
(iii) The overall performance of a statistical procedure on a family P of probability
measures is evaluated by the maximal expected loss sup P∈P R(D, P).
Wald’s conception of “decision making” as the prototype of a statistician’s usual
activity was obviously inspired by hypotheses testing (more precisely by acceptance
sampling). This impression is confirmed by the content of the lectures Wald gave in
(1941a) at Notre Dame University: About 25 pages are on testing and confidence
intervals, and 4 on estimation. Even in the final shape Wald gave to his decision
theory in (1950), his endeavours to include estimation in this framework seem to
be inadequate. Around (1941b), Wald’s methodological background on estimation
theory was confined to some papers by Fisher; in the finished version of his decision
theory of Wald (1950) he added just one more paper: Pitman (1939). Motivated by
Pitman’s paper (p. 401) he suggests to evaluate an estimator ϑ̂ by the loss function
1{|ϑ̂−ϑ|>t} . This ignores other aspects which might be relevant for the evaluation
of estimators, such as unbiasedness, or measures of concentration other than the
probability of |ϑ̂ − ϑ| ≤ t. Even now, textbooks which introduce the conceptual
framework of decision theory (like Heyer 1982, pp. 16–24 or Witting 1985, pp.
1–17) make no use of it as soon as it comes to estimation theory.
Among the “principles” constituting the conceptual framework of decision theory,
the “minimax-principle” is the most unreasonable one. With his endeavours to justify
a mathematically fruitful idea from the methodological point of view, Wald seems to
be ill at ease. In (1939), Wald’s approach to the evaluation of statistical procedures
is Bayesian. In (1943) he changes his position. Since an “a priori distribution ...
is usually unknown ... it seems of interest to consider a decision function which
minimizes the maximum risk” (p. 267), and similarly in ((Wald, 1950), p. 27) “it
is perhaps not unreasonable for the experimenter to behave as if Nature wanted to
maximize the risk”. As Jimmy Savage never stopped telling, Wald’s attitude towards
8 1 Introduction
the minimax principle was just “let’s try whether something reasonable comes out
of it”, a fact confirmed by Wolfowitz (1952, p. 8).
The minimax principle was refuted by philosophers on philosophical grounds
(Carnap 1952, Sect. 25, pp. 81–90) and criticized by statisticians on account of statis-
tical and arguments: Hodges and Lehmann (1950, pp. 190/1) determine the minimax
estimator of the parameter p in the Binomial distribution under quadratic loss. It turns
out that the risk of the minimax estimator is larger than the risk of the usual mean-
unbiased estimator k/n, except for a small interval about p = 1/2 (which shrinks
to {1/2} as n tends to infinity). For more on the history of the minimax principle
see L.D. Brown (1964). We conclude our remarks on “Decision Theory” with an
important voice, Fisher (1959, p. 101):
The idea that this responsibility [i.e., the interpretation of observations] can be delegated to
a giant computer programmed with Decision Functions belongs to the phantasy of circles
rather remote from scientific research.
References
Aitken, A. C., & Silverstone, H. (1942). On the estimation of statistical parameters. Proceedings of
the Royal Society of Edinburgh Section A, 61, 186–194.
Banach, S. (1932). Théorie des opérations linéaires. Monografie Matematyczne 1, Subwencji Fun-
duszu Kultury Narodowej, Warszawa.
Brown, L. D. (1964). Sufficient statistics in the case of independent random variables. The Annals
of Mathematical Statistics, 35, 1456–1474.
Carnap, R. (1952). The continuum of inductive methods. Chicago: University Chicago Press.
Cramér, H. (1946). Mathematical methods of statistics. Princeton: Princeton University Press.
Darmois, G. (1945). Sur les limites de la dispersion de certaines estimations. Revue de l’Institut
International de Statistique, 13, 9–15.
Fisher, R. A. (1930). Inverse probability. Mathematical Proceedings of the Cambridge Philosophical
Society, 26, 528–535.
Fisher, R. A. (1935). The logic of inductive inference (with discussion). Journal of the Royal
Statistical Society, 98, 39–82.
Fisher, R. A. (1959). Statistical methods and scientific inference (2nd ed.). Edinburgh: Oliver and
Boyd.
Fréchet, M. (1943). Sur l’extension de certaines évaluations statistiques au cas de petits échantillons.
Revue de l’Institut International de Statistique, 11, 182–205.
Halmos, P. R. (1985). I want to be a mathematician: An automathography. Berlin: Springer.
Halmos, P. R., & Savage, L. J. (1949). Application of the Radon-Nikodym theorem to the theory
of sufficient statistics. The Annals of Mathematical Statistics, 20, 225–241.
Hampel, F. R. (1971). A general qualitative definition of robustness. The Annals of Mathematical
Statistics, 42, 1887–1896.
Heyer, H. (1982). Theory of statistical experiments, Springer series in statistics. Berlin: Springer.
Hodges, J. L, Jr., & Lehmann, E. L. (1950). Some problems in minimax point estimation. The
Annals of Mathematical Statistics, 21, 182–197.
Jurecková, J., & Sen, P. K. (1996). Robust statistical procedures: Asymptotics and interrelations.
New York: Wiley.
Landers, D., & Rogge, L. (1972). Existence of most powerful tests for undominated hypotheses. Z.
Wahrscheinlichkeitstheorie verw. Gebiete, 24, 339–340.
Laplace, P. S. (1812). Théorie analytique des probabilités. Paris: Courcier.
Le Cam, L. (1979). On a theorem of J. Hájek. In J. Jurečkova (Ed.), Contributions to statistics: J.
Hájek memorial volume (pp. 119–137). Prague: Akademia.
Lehmann, E. L. (1959). Testing statistical hypotheses. New York: Wiley.
Lehmann, E. L. (1986). Testing statistical hypotheses (2nd ed.), Wiley series in probability and
mathematical statistics: Probability and mathematical statistics. New York: Wiley.
Lehmann, E. L. (1995). Neyman’s statistical philosophy. Probability and Mathematical Statistics-
PWN, 15, 29–36.
Lehmann, E. L., & Stein, C. (1948). Most powerful tests of composite hypotheses. I. Normal
distributions. The Annals of Mathematical Statistics, 19, 495–516.
Merton, R. K. (1973). The sociology of science: Theoretical and empirical investigations. Chicago:
University of Chicago press.
Neyman, J. (1941). Fiducial argument and the theory of confidence intervals. Biometrika, 32, 128–
150.
Nölle, G., & Plachky, D. (1967). Zur schwachen Folgenkompaktheit von Testfunktionen. Z.
Wahrscheinlichkeitstheorie verw. Gebiete, 8, 182–184.
Poisson, S.-D. (1837). Récherches sur la probabilité des jugements. Paris: Bachelier.
Pitman, E. J. G. (1939). Tests of hypotheses concerning location and scale parameters. Biometrika,
31(1/2), 200–215.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters.
Bulletin of Calcutta Mathematical Society, 37, 81–91.
10 1 Introduction
Rao, C. R. (1962). Apparent anomalies and irregularities in maximum likelihood estimation (with
discussion). Sankhyā: The Indian Journal of Statistics: Series A, 24, 73–102.
Rieder, H. (1994). Robust asymptotic statistics. Berlin: Springer.
Schmetterer, L. (1974). Introduction to mathematical statistics, Translation of the 2nd (German
ed.). Berlin: Springer.
Slutsky, E. (1925). Über stochastische Asymptoten und Grenzwerte. Metron, 5(3), 3–89.
Stigler, S. M. (1999). Statistics on the table: The history of statistical concepts and methods.
Cambridge: Harvard University Press.
Wald, A. (1939). Contributions to the theory of statistical estimation and testing hypotheses. The
Annals of Mathematical Statistics, 10, 299–326.
Wald, A. (1941a). On the Principles of Statistical Inference. Four Lectures Delivered at the Univer-
sity of Notre Dame, February 1941. Published as Notre Dame Mathematical Lectures 1, 1952.
Wald, A. (1941b). Asymptotically most powerful tests of statistical hypotheses. The Annals of
Mathematical Statistics, 12, 1–19.
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of
observations is large. Transactions of the American Mathematical society, 54, 426–482.
Wald, A. (1950). Statistical decision functions. New York: Wiley.
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal
of the American Statistical Association, 22, 209–212.
Witting, H. (1985). Mathematische statistik I: parametrische verfahren bei festem stichprobenum-
fang. Teubner
Wolfowitz, J. (1952). Abraham Wald, 1902–1950. The Annals of Mathematical Statistics, 23, 1–14.
Wolfowitz, J. (1969). Reflections on the future of mathematical statistics. In R. C. Bose, et al. (Eds.),
Essays in probability and statistics (pp. 739–750). Chapel Hill: University of North Carolina Press.
Chapter 2
Sufficiency
Comparisons between estimators based on the asymptotic variance have a long his-
tory, dating back to Laplace and Gauss. To express in mathematical terms the idea
that “S contains all information included in T ”, one needs the joint distribution of S
and T . In this way, Laplace (1818) came across the example of two estimators S and
T such that no linear combination of S and T has a smaller asymptotic variance than
S itself (Stigler 1973). This could be considered as a restricted version of Fisher’s
idea that “S contains all information included in T ”.
How did Fisher arrive at his idea of “sufficiency”? Unaware (as often) of what
others had done before him, Fisher considered the estimators
n 1/2
S(x1 , . . . , xn ) = n −1 (xν − x n )2
ν=1
and
n
T (x1 , . . . , xn ) = π/2 n −1 |xν − x n |
ν=1
with
n
μn (x1 , . . . , xn ) = n −1 xν
ν=1
2.1 The Intuitive Idea 13
and
n 1/2
sn (x1 , . . . , xn ) = n −1 (xν − μn (x1 , . . . , xn ))2 .
ν=1
For ν = 1, . . . , n let
ψν (u 1 , . . . , u n ) := (u ν − μn (u 1 , . . . , u n ))/sn (u 1 , . . . , u n ).
under N (μ, σ 2 )n × N (0, 1)n is N (μ, σ 2 )n . (Example 2 in Kumar and Pathak 1977.
See Pfanzagl 1981, for a generalization.)
We now, discuss the, mathematical difficulties connected with the generalization
of Fisher’s idea to a more general framework. Let (X, A ), (Y, B) and (Z , C ) be
measurable spaces and S : X → Y and T : X → Z measurable maps. Let P|A
be a probability measure. Speaking of the conditional distribution of T within the
partition {x ∈ X : S(x) = y} requires, in mathematical terms, the existence of a
regular conditional probability, i.e. a Markov kernel M|Y × C such that, for every
C ∈ C , M(·, C) is a conditional expectation of 1T −1 C , given S, under P, or in other
words:
M(S(x), C)1 B (S(x))P(d x) = P(S −1 B ∩ T −1 C) for every B ∈ B.
(2.1.1)
The question whether the idea of the conditional distribution of T , given S, can
always be expressed by means of a Markov kernel was answered in the affirmative
by Doob (1938, Theorem 3.1, pp. 95/8) for T (x) = x under the assumption that
A is countably generated, but his proof is valid for (X, A ) = (R, B), with B
the Borel σ field, only. It appears that the existence of a Markov kernel was not
a trivial problem then. Even Halmos published an erroneous result (1941, p. 390,
Theorem 1), based on Doob’s Theorem. A counterexample by Dieudonné1 (1948,
p. 42) demonstrated that the existence of a Markov kernel can be guaranteed only
under additional restrictions, such as the compact approximation of P. For the more
general case of maps T : (X, A ) → (Z , C ), the restrictive assumptions have to be
1 It
is interesting to observe how the authorship of a nontrivial example was lost as time went on.
Dieudonné’s example appears in various textbooks as an exercise with reference to the author (e.g.
Doob 1953, p. 624; Halmos 1950, p. 210, Exercise 4 and p. 292, Reference to Sect. 48; Dudley
1989, p. 275, Problem 6 and Note Sect. 10.2, p. 298). It also appears in Ash 1972, p. 267, Problem
4, without reference to Dieudonné, and in Romano and Siegel (1986), pp. 138/9, Example 6.13, as
an “Example given in Ash”. In Lehmann and Casella (1998, p. 35) it was presented as an “Example
due to Ash, presented by Romano and Siegel (1986)”.
14 2 Sufficiency
Remark In Definition 2.1.2, the σ -field B occurs three times; (i) in the A ,
B-measurability of S, (ii) in the B, B-measurability of ψ A , and (iii) in relation
(2.1.2). If Y is a Euclidean space, the pertaining Borel field is a natural choice for B. If
Y is an abstract space, the presence of an unspecified σ -field B is irritating. This, pre-
sumably, motivated Bahadur to introduce the concept of a “sufficient transformation”,
which uses for B the σ -field induced on Y by S, i.e., B S := {B ⊂ Y : S −1 B ∈ A },
the largest σ -field which renders x → S(x) A -measurable.
Most authors ignore the σ -field B which shows up in the Definition 2.1.2 of
sufficiency. (An exception is Witting 1966, who discusses this problem in footnote
2.1 The Intuitive Idea 15
1, p. 121.) Does the sufficiency of S really depend on the —more or less arbitrary—
σ -field B?
The operational significance of sufficiency depends on relation (2.2.2). Using
Proposition 1.10.25 in Pfanzagl (1964, p. 60), which connects (2.1.2) with (2.2.2),
it can be shown that (2.1.2) holds for every countably generated σ -field B ⊂ P(Y )
containing {y} for P ◦S-a.a. y ∈ Y if it holds for some σ -field sharing these properties.
Presumably this result also holds true under less restrictive regularity conditions.
The concept of a sufficient transformation was almost entirely neglected in the
literature. Sato (1996) seems to have been the first author to reflect upon the rela-
tionship between a sufficient statistic and a sufficient transformation. He shows
(p. 283, Lemma) that the two concepts are identical if (X, A ) and (Y, B) are
Euclidean and P is dominated. This results from the special nature of the spaces
(X, A ) and (Y, B): If f : (Rm , Bm ) → (Rk , Bk ), then for every B ⊂ Rk with
f −1 B ∈ Bm , there exists B0 ∈ Bk with B0 ⊂ B and λm (( f −1 B) ( f −1 B0 )) = 0,
so that {B ∈ Rk : f −1 B ∈ Bm } is not much larger than f −1 Bk . (See also Bahadur
1955b, p. 493, Lemma 5.)
We shall call S exhaustive if relation (2.2.1) holds with T (x) = x, i.e., if P itself
can be obtained by a randomization procedure based on S(x). More precisely: If
there exists a Markov kernel M|Y × A such that
M(S(x), A)P(d x) = P(A) for A ∈ A and P ∈ P. (2.2.2)
there are none: If relation (2.2.2) is true (no matter where the Markov kernel comes
from), the statistic S is sufficient. This follows from a result obtained by Sacksteder
(1967, p. 788, Theorem 2.1) on the comparison of experiments (see also Heyer 1969,
p. 39, Satz 5.2.1). Without domination this follows from Roy and Ramamoorthi
(1979, p. 50, Theorem 2) under the assumption that B is countably generated. For
dominated families, the simplest way to show that exhaustivity implies, sufficiency
is to use a result of Pfanzagl (1974, p. 197, Theorem) which implies, in particular,
that S is sufficient if for every A ∈ A there exists a critical function ϕ A such that
ϕ A (S(x))P(d x) = P(A) for every P ∈ P.
It is tempting to conclude from (2.2.2) that M(·, A) itself is a conditional expec-
tation of 1 A , given S, i.e., that (2.1.2) holds with ψ A = M(·, A). This is, however,
not generally true.
Example Let X = {−1, 0, 1} × {−1, 1} be endowed with the σ -field P(X ) of all
subsets of X . For ϑ ∈ (0, 1) let
Pϑ ({η, ε}) := (1 − |η|)(1 − ϑ)/2 + |η|ϑ/4 for η ∈ {−1, 0, 1}, ε ∈ {−1, 1}.
for ρ ∈ {0, 1}, η ∈ {−1, 0, 1} and ε, δ ∈ {−1, 1}, fulfills relation (2.2.2), which is to
say that
M(y, {η, ε})P ◦ S(dy) = (1 − |η|)(1 − ϑ)/2 + |η|ϑ/4 = Pϑ ({η, ε}),
Relation (2.1.2) implies that (2.2.3) is, in fact, true for P ◦ S-a.a. y ∈ Y . If B is
countably generated, relation (2.2.3) implies
The relevant point is the converse: (2.2.3) and (2.2.2) together imply (2.1.2), i.e.,
M(·, A) is necessarily a conditional expectation of 1 A , given S, with respect to P.
(Blackwell and Dubins 1975, p. 741/2. For a detailed proof see Pfanzagl 1994, p. 60,
Lemma 1.10.24 and Proposition 1.10.25.)
Of course, one would prefer to have a version of M such that (2.2.3) holds for all
(rather than just for P ◦ S-a.a.) y ∈ Y . According to Blackwell and Ryll-Nardzewski
(1963, p. 223, Theorem 1) this cannot be generally achieved, even if (X, A ) is Polish.
The idea that random variables with distribution P can be obtained from random
variables with distribution P ◦ S by means of a randomization device (not depending
on P) if S is sufficient was extended to the comparison of “experiments” by Blackwell
(1951, 1953). Let X = (X, A , {Pi : i ∈ I }) and Y = (Y, B, {Q i : i ∈ I }) be two
“experiments” with an arbitrary index set I . Then Y is called “sufficient for X ” if
there exists a Markov kernel M|Y × A such that
Pi (A) = M(y, A)Q i (dy) for A ∈ A and i ∈ I.
Sufficient sub-σ -fields have been introduced by Bahadur (1954, p. 430), following
a suggestion of L.J. Savage (see p. 431):
That means: For every A ∈ A there exists a function ψ A : (X, A0 ) → (R, B) such
that
18 2 Sufficiency
ψ A (x)1 A0 (x)P(d x) = P(A0 ∩ A) for every A0 ∈ A0 .
Fisher’s idea of sufficiency was related to statistics S that are estimators (of some
parameter ϑ). As soon as it became clear that an interpretation of S as an estimator is
irrelevant for the idea of “sufficiency”, it was also clear that the values taken by the
statistic S are irrelevant, too: Any statistic Ŝ which is one-one with S serves the same
purpose. If the image-space of S is irrelevant, the same is true for the σ -field with
which the image-space of S might be endowed. Consequently, Bahadur suggested
the concept of a sufficient transformation as a map from X to Y , and that we base
the definition of sufficiency on A0 := A ∩ S −1 (P(Y )).
The evident notational simplifications which result from studying a statistic in terms of the
subfield induced by it suggest the possibility of taking a sufficient subfield rather than a
sufficient statistic to be the basic concept in the formal exposition. (Bahadur 1954, p. 430.)
When Bahadur wrote these lines, he was not fully aware of the troubles connected
with the duplicity between “sufficient statistics” and “sufficient sub-σ -fields”, still
present in many textbooks. Obviously, S:(X, A ) → (Y, B) is sufficient iff S −1 B is
sufficient. The problem lies with the sufficiency of sub-σ -fields that are not inducible
by means of a statistic. Bahadur (1954) was still uncertain about this point, which,
however, was soon clarified: According to a Lemma of Blackwell (see Lemma 1
in Bahadur and Lehmann 1955, p. 139) a subfield A0 ⊂ A cannot be induced by
a statistic if {x} ∈ A0 for every x ∈ X . The situation is much more favourable if
A is countably generated. Under the additional assumption that P is dominated,
Bahadur (1955, Lemmas 3 and 4, pp. 492–493) proved for any sub-σ -field A0 ⊂ A
the existence of a statistic f : (X, A ) → (R, B) such that A0 = f −1 (B) (P).
More important in connection with sufficiency is a result of Burkholder (1961, p.
1200, Theorem 7): If A is countably generated, then for any sufficient sub-σ -field
A0 ⊂ A there exists a statistic f :(X, A ) → (R, B) such that A0 = f −1 (B).
Hence for A countably generated, the concepts of “sufficient statistic” and “suf-
ficient sub-σ -field” are equivalent. Yet even in this case, puzzling things may occur.
If a sufficient statistic S : (X, A ) → (Y, B) is the contraction of a statistic
S : (X, A ) → (Y , B ) (i.e., S = g ◦ S with g : (Y , B ) → (Y, B)), then
S is sufficient, too. In contrast, a sub-σ -field containing a sufficient sub-σ -field is
not necessarily sufficient itself: For (X, A ) = (R, B), the σ -field B0 of all sets in B
symmetric about 0 is sufficient for the family of all probability measures on B which
are symmetric about 0. Burkholder (1961, pp. 1192/3, Example 1), constructs a sub-
σ -field of B containing B0 , which fails to be sufficient. Such a paradox is impossible
if P is dominated (Bahadur 1954, p. 440, Theorem 6.4).
Instead of making the theory of sufficiency more elegant, the introduction of
sufficient sub-σ -fields gave rise to mathematical problems which have nothing to do
with the idea of sufficiency as such. Whereas the interpretation of a sufficient statistic
is clear, neither Bahadur (1954, pp. 431/2) nor any of his followers have so far been
able to explain the operational significance of a sufficient sub-σ -field. Bahadur’s
argument (1954, pp. 431/2) is based on the partitions induced by the σ -fields A
2.3 Sufficient Statistics—Sufficient σ -Fields 19
and A0 , say π(x) := {A ∈ A : x ∈ A} and π0 (x) := {A ∈ A0 : x ∈ A},
respectively: “If A0 is sufficient, a statistician who knows only π0 (x) is as well off
as one who knows π(x).” That this argument works if A0 is the sub-σ -field induced
by S, (so that π0 (x) = S −1 {S(x)}) is not a convincing argument for replacing the
concept of a sufficient statistic by the concept of a sufficient sub-σ -field, and what if,
in the general case, π0 (x) = π(x) = {x}? Leaving aside all mathematical aspects,
we still encounter a problem with the interpretation of sufficiency. If one thinks of
S(x) ∈ Y as “containing all information contained in x” one might feel uneasy if the
definition of sufficiency refers to some σ -field of subsets of Y .
Let S and T be two real-valued statistics with a joint density pϑ (s, t). According to
Fisher (1922, p. 331) “the factorization of pϑ into factors involving (ϑ, s) and (s, t),
respectively, is merely a mathematical expression of the condition of sufficiency”.
More precisely (Fisher 1934, p. 288): “If pϑ (s, t) = gϑ (s)h(s, t), the conditional
distribution of T , given S, will be independent of ϑ” (hence S is sufficient in Fisher’s
sense).
One of the important achievements of Neyman was to show that this factorization
is even necessary. Starting from the definition that S is sufficient if the conditional
distribution of any other statistic T is independent of P (Neyman 1935, p. 325),
Neyman asserts in Teorema II, p. 326, that the factorization is necessary and sufficient
for the sufficiency of S. His theorem refers to a parametric family, and to i.i.d. samples.
Written in our notations this is
Neyman was unaware that Fisher had already shown that the factorization implies
sufficiency. Neyman’s proof of the nontrivial necessity part of the theorem (Sect. 7,
pp. 328–332) is spoiled by numerous regularity conditions (including, for instance,
the differentiability of S). Obviously, Neyman was not familiar with the measure-
theoretic tools developed by Kolmogorov (1933). The first mathematically “up to
date” paper was that of Halmos and Savage (1949). Their Theorem 1 (p. 233) asserts
that for a dominated family P, a necessary and sufficient condition for the sufficiency
of S : (X, A ) → (Y, B) is the existence of a dominating measure, say μ0 , such
that every P ∈ P admits an S −1 B-measurable version of d P/dμ0 . Expressed with
reference to an arbitrary dominating measure μ, and with explicit use of the sufficient
statistic S, this is their Corollary 1, p. 234.
Factorization Theorem Assume that P is dominated by μ (P μ). Then S is suffi-
cient if and only if there exists a nonnegative A -measurable function h : (X, A ) →
([0, ∞), B ∩ [0, ∞)) and for every P ∈ P a nonnegative B-measurable function
20 2 Sufficiency
2.5 Completeness
A first version of this theorem appears in Lehmann and Scheffé (1955, pp. 223/4,
Theorem 7.3). In Lehmann and Scheffé (1950, pp. 313–315) the completeness for
various exponential families had been proved by means of ad hoc arguments using
power series expansions, Laplace and Mellin transformations.
The condition that {(a1 (P), . . . , am (P)) : P ∈ P} has a nonempty interior is
sufficient but not necessary for the completeness of {P ◦ (T1 , . . . , Tm ) : P ∈ P}.
Examples of complete “curved” exponential families can be found in Messig and
Strawderman (1993). For a particularly simple example see Pfanzagl (1994, p. 96,
Example 2.7.3).
What matters for applications is the completeness of sufficient statistics for i.i.d.
products. Since i.i.d. products of exponential families are exponential, too, this prob-
lem is solved by the theorem above.
Another important result is the [bounded] completeness of the order statistic for
certain nonparametric families, referred to as “symmetric [bounded] completeness”.
A precise proof may be found in, Heyer (1982, p. 41, Theorem 6.15).
2.5 Completeness 23
i = 1, . . . , n.
(A simpler proof for convex—rather than weakly convex—families can be found in,
Pfanzagl 1994, p. 20, Lemma 1.5.9.)
Since the [bounded] completeness of P|A implies the [bounded] completeness
of {P1 × · · · × Pn : Pi ∈ P}, i = 1, . . . , n (see Landers and Rogge 1976, p. 139,
Theorem; improving an earlier result of, Plachky 1977, concerning bounded com-
pleteness), the Polarization Lemma implies the following theorem (see Mandelbaum
and Rüschendorf 1987, p. 1239, Theorem 7):
If P is [boundedly] complete and weakly convex, then {P n : P ∈ P} is symmet-
rically [boundedly] complete for every n ∈ N.
According to Mattner (1996, p. 1267, Theorem 3), this implies that {P n : P ∈ P}
is symmetrically complete if P|B contains all P with unimodal Lebesgue densities.
In this connection we mention a result of Hoeffding (1977) concerning families
that are symmetrically incomplete, but symmetrically boundedly complete, which
generalizes
an earlier result of Fraser (1954, p. 48, Theorem 2.1).
If ud P = 0 for some u : (X, A ) → (R, B) and some P|A , then
n
f n (x1 , . . . , xn ) := u(xν )h n−1 (xn·ν ) (2.5.1)
ν=1
is symmetric and fulfills f n d P n = 0 if h n−1 : X n−1 → R is a P n−1 -integrable,
symmetric function and xn·ν := (x1 , . . . , xν−1 , xν+1 , . . . , xn ).
Let now Pu be the family of all P dominated by μ that fulfill ud P = 0. A result
of Hoeffding
(1977, p. 279, Theorem 1B) implies that any symmetric function f n
fulfilling f n d P n = 0 for P ∈ P can be represented by (2.5.1). Hence (Theorem
2B, p. 280) the family {P n : P ∈ Pu } is symmetrically boundedly complete if u is
unbounded.
Refined mathematical techniques have been used to obtain a great variety of
complete and/or boundedly complete families (for instance Bar-Lev and Plachky
1989 or Isenbeck and Rüschendorf 1992). Referring, so to speak, to the sample size
1, these results seem to be answers waiting for questions.
24 2 Sufficiency
If sufficient statistics can be used for a reduction of the data, the aspiration to max-
imal reduction leads to the concept of a “minimal suff icient statistic”, which is the
contraction of any sufficient statistic.
Definition 2.6.1 The sufficient statistic S0 : (X, A ) → (Y0 , B0 ) is minimal suffi-
cient for P if for any sufficient statistic S : (X, A ) → (Y, B) there exists a function
H : (Y, B) → (Y0 , B0 ) such that S0 = H ◦ S (P).
The concept of a minimal sufficient statistic was introduced by Lehmann and
Scheffé (1950, Sect. 6). Their interest in minimal sufficient statistics was motivated
by applications to the theory of similar tests and unbiased estimators. The theory
in this field becomes particularly clear if there exists a sufficient statistic, say S,
such that P ◦ S is (boundedly) complete. Since the bounded completeness of P ◦ S
for a sufficient statistic S implies that S is minimal sufficient, the applications of
sufficiency for certain problems in testing and estimation motivated the interest of
Lehmann and Scheffé (1950) in minimal sufficient statistics. Lehmann and Scheffé
(1950, p. 316, Theorem 3.1) show that “S sufficient and P ◦ S boundedly complete”
implies that S is minimal sufficient if a minimal sufficient statistic exists. Bahadur
(1957, p. 217, Theorem 3) shows that “S sufficient and P ◦ S boundedly complete”
implies that S is minimal sufficient.
In their Sect. 6, pp. 327ff, Lehmann and Scheffé suggest a workable technique for
obtaining a minimal sufficient statistic. Their basic assumption is that there exists
a countable subfamily P0 ⊂ P which is dense with respect to the sup-distance.
This implies the existence of a dominating measure, say, μ, equivalent to P. (Recall
that, conversely, the existence of such a subfamily follows for dominated families
if A is countably generated.) To determine a minimal sufficient statistic, Lehmann
and Scheffé introduce the “operation ϑ” which amounts to determining for x0 ∈ X
the set of all x such that q P (x)/q P (x0 ) is independent of P ∈ P. (The fact that
this is formulated in terms of a parametric family is of no relevance.) This means
that determining a function S|X such that S(x) = S(x0 ) implies q P (x) = q P (x0 )
for P ∈ P. According to Theorem 6.3, p. 336, the resulting statistic S is minimal
sufficient if the “operation ϑ” is applied with P replaced by P0 . The restriction to a
countable subfamily is required since the densities are unique μ-a.e. only.
Presuming that a sufficient statistic S : (X, A ) → (Y, B) has been found, i.e.,
that there exists a μ-density of P of the form x → g P (S(x))h(x), the basic idea of
Lehmann and Scheffé can be put into the following form:
Assume that A is countably generated and (Y, B) Polish. If there exists a count-
able dense subfamily P0 ⊂ P such that
g P (y ) = g P (y ) for P ∈ P0 implies y = y ,
family, Sato’s result reads as follows: Let (X, A ) and (Y, B) be Euclidean spaces,
P a dominated family on A and P0 a countable dense subfamily. Assume that the
densities are continuous in the sense that d(Pn , P0 ) → 0 implies q Pn → q P0 μ-
a.e. If for arbitrary x , x , the relation S(x ) = S(x ) implies that q P (x )/q P (x ) is
independent of P ∈ P, then S is minimal sufficient.
The duplicity between sufficient statistics and sufficient sub-σ -fields extends to
the concept of “minimality”.
( nν=1 T1 (xν ), . . . , nν=1 Tk (xν )) is sufficient and complete, hence minimal suffi-
cient if the set {(a1 (P), . . . , ak (P)) : P ∈ P} ⊂ Rk has a nonempty interior. (For
curved exponential families, this statistic may be minimal sufficient without being
complete.)
An interesting result of a general nature is provided by Mattner (2001, p. 3402,
Theorem 1.5). If P is a dominated convex family admitting the minimal sufficient
σ -field A0 , then the σ -field of all permutation invariant subsets of A0n is minimal
sufficient for {P n : P ∈ P}. (Note: Recall the analogous result mentioned in Sect.
2.1.5 on (not necessarily dominated) convex families that are [boundedly] complete.)
Concerning location and/or scale parameter families on B, one could, roughly
speaking, say that under suitable regularity conditions on the densities, the order
statistic is minimal sufficient, except for particular families, like {N (μ, σ 2 )n : μ ∈
R, σ 2 > 0} or {(μ, σ 2 )n : μ ∈ R, σ 2 > 0}. A recent result in this direction that
requires the use of more subtle mathematics was put forward by Mattner (2000, pp.
1122/3, Theorem 1.1) who shows under a weak regularity condition on the Lebesgue
density of P that for the location and scale parameter family generated by P the order
statistic is minimal sufficient for the sample size n, unless log p is a polynomial of
degree less than n.
If the vague notion that S is sufficient if “S(x) contains all information contained
in x” is taken seriously, it only produces new problems. If the problem is just to
regain from S(x) a random variable equivalent to x: Why use the randomization
device instead of using for S a bijective function from X to R, say? If X = Rn and
S : Rn → R is a bijective map, it is possible to regain from S(x1 , . . . , xn ) the original
sample (x1 , . . . , xn ) itself (rather than a random variable with
the same
distribution
as (x1 , . . . , xn )). Why do statisticians use S(x1 , . . . , xn ) = ( nν=1 xν , ν=1 1n xν2 )
as a sufficient statistic for the family {N (μ, σ 2 )n : μ ∈ R, σ > 0} rather than an
injective map S : Rn → R, which is trivially sufficient for the larger family of all
probability measures on Bn , and is 1- rather than 2-dimensional? Of course, S should
be “regular” in some sense, continuous at least, and it should be interpretable in some
operational sense.
Denny (1964, p. 95, Theorem) proves the existence of a uniformly contin-
uous function Sn : Rn → (0, 1) with nondecreasing partial functions xν →
Sn (x1 , . . . , xn ), ν = 1, . . . , n, such that Sn is injective on a subset Dn ⊂ Rn with
λn (Dnc ) = 0. Such a function Sn is sufficient for the family of all probability mea-
sures on Bn with λn -density. This result poses a serious problem for the concept of
“sufficiency”. Yet it is ignored in virtually all textbooks on mathematical statistics.
Romano and Siegel (1986, Sect. 7.1, pp. 158/9) even give examples for which, as
they think, “no single [i.e., real-valued] continuous sufficient statistic” exists.
If attention is confined to i.i.d. products, the order statistic is always sufficient.
Mattner (1999, p. 399, Theorem 2.2) constructs a uniformly continuous and strictly
2.7 Trivially Sufficient Statistics 27
are the most familiar example of families admitting a sufficient statistic for every
sample size:
n
n
(x1 , . . . , xn ) → T1 (xν ), . . . , Tm (xν ) is sufficient for {P n : P ∈ P}.
ν=1 ν=1
This, of course, is not the only example: If Pϑ |B ∩ (0, ∞) has Lebesgue density
ϑ −1 1(0,ϑ) , then (x1 , . . . , xn ) → xn:n is sufficient for {Pϑn : ϑ > 0}.
In spite of such isolated examples, the idea soon came up that under certain regu-
larity conditions (in particular if all members of the family P have the same support),
a sufficient statistic for every sample size exists only if the family is exponential. This
idea occurs in vague form in Fisher (1934, Sect. 2.5, pp. 293/4). Readers who are
not satisfied with the outline of a proof offered by Fisher may consult Hald (1998,
p. 728).
Pitman (1936, p. 569) was among the first statisticians to attempt something
approaching a detailed proof. To illustrate the level of mathematical sophistication
at this time we follow Pitman’s argument more closely.
Let P = {Pϑ : ϑ ∈ Θ}, Θ ⊂ R, be a family with Lebesgue densities p(·, ϑ).
Following Fisher’s arguments related
to the idea of “maximum likelihood”, Pitman
starts from the assumption that nν=1 ∂ϑ log p(xν , ϑ) is a function of the sufficient
statistic Sn : Rn → R, i.e., there exists a function ψϑ : R → (0, ∞) such that
28 2 Sufficiency
n
ϕϑ (xν ) = ψϑ (Sn (x1 , . . . , xn )) with ϕϑ (x) := ∂ϑ log p(x, ϑ). (2.8.1)
ν=1
Assuming (tacitly) that ψϑ has for some ϑ0 a differentiable inverse, say f 0 , Pitman
concludes that
n
S(x1 , . . . , xn ) = f 0 ϕϑ0 (xν ) ,
ν=1
Since this relation holds for every ν0 = 1, . . . , n, it follows that Ψϑ (y), considered
as a function of y, is constant, i.e., Ψϑ (y) = a(ϑ), hence Ψϑ (y) = a(ϑ)y + b(ϑ).
From (2.8.2), written for n = 1,
n
p(xν ) = g P (Sn (x1 , . . . , xn ))h(x1 , . . . , xn ) (2.8.3)
ν=1
implies
n
ϕ P (xν ) = ψ P (Sn (x1 , . . . , xn )) (2.8.4)
ν=1
with ϕ P (x) = log p(x)/ p0 (x) and ψ P (y) = log g P (y)/g P0 (y). Hence it is not
necessary to assume that P is a parametric family in order to obtain a relation like
(2.8.1), and to assume differentiability with respect to ϑ in order to get rid of the
factor h from (2.8.3). Unjustified, too, is the assumption that relation (2.8.3), or the
resulting relation (2.8.4), holds for all (rather than μn -a.a.) (x1 , . . . , xn ) ∈ X n .
2.8 Sufficiency and Exponentiality 29
There are several papers that disregard this point and try to solve the functional
equation (2.8.4) under minimal regularity conditions. It suffices to consider the case
n = 2. Let ϕ : X → R and S : X 2 → R be functions such that
The problem is: Given S, what can be said about pairs of functions (ϕ, ψ) for which
(2.8.5) holds true? It is clear that this can be solved under restrictive conditions on
the functions S, ϕ and ψ only. Considering the origins of this problem, conditions of
operational significance can be placed upon ϕ (corresponding to conditions on the
densities), and conditions on the sufficient statistic S. Conditions on the function ψ
can hardly be justified from an operational point of view.
The intended result is that the solution ϕ is unique up to a linear transformation.
More precisely: If (ϕi , ψi ), i = 0, 1, are two solutions, then
Example For P = N (μ, σ 2 ) and P0 = N (0, 1), the linear space generated by the
functions
μ2 −2 1
(x, (μ, σ 2 )) = − log σ − σ + μσ −2 x + (1 − σ −2 )x 2 , x ∈ R,
2 2
S(x) = (T1 (x), . . . , Tk (x)), and the map (x1 , . . . , xn ) → ( nν=1 T1 (xν ), . . . , nν=1
Tk (xν )) is sufficient for {P n : P ∈ P}.
In Dynkin’s paper it remains unclear where the dimension (k + 1) really comes
from and why the functions Ti can be chosen as (·, Pi ) (see p. 24: “Actually we
have...”). It is just the case k ≥ n where the local properties of the functions (·, Pi )
are used to show that the sufficient statistic for the sample size n is locally bijective
if the functions (·, Pi ) have continuous derivatives (Dynkin 1951, p. 24, Theorem
2). Perhaps the reader will get on with what Schmetterer (1966, Satz 7.4, pp. 257/8
and 1974, pp. 215/6) has to say about Dynkin’s Theorem 2. A variant of Dynkin’s
Theorem 2 is Theorem A in Brown (1964, p. 1461).
An approach that derives the minimal dimension of “regular” sufficient statistics
from local properties of the densities q P is more enlightening. It was first explored by
Barankin and Katz (1959), and continued by Barankin (1961), and later Barankin and
Maitra (1963). The paper from Barankin and Katz (1959), original in its approach
compared with Dynkin, is of poor technical quality and Barankin’s paper (1961) is
mainly a correction note. Hence readers interested in this approach should start with
Barankin and Maitra (1963). Yet, even this paper is outdated. Written for parametric
families it uses derivatives of the densities with respect to the parameters. According
to Shimizu (1966), this is avoidable.
The essence of these papers becomes more transparent if (i) we restrict the con-
siderations to i.i.d. products, and (ii) generalize the framework from parametric to
general families. The basic assumption is that the densities q P have continuous deriv-
atives.
Let xn = (x1 , . . . , xn ) ∈ X n , where X ⊂ R is an interval. For arbitrary m ∈ N
and arbitrary Pi ∈ P, i = 1, . . . , m, let M(xn ; P1 , . . . , Pm ) denote the rank of the
matrix
and let
To simplify the presentation, we assume that ρ(xn ) (which will turn out to be the
minimal dimension of “regular” sufficient statistics) is the same for every xn ∈ X n ,
say r . By definition, r ≤ n. For arbitrary Pi ∈ P, i = 1, . . . , r , let
Since (·, P) is continuous, the set U (P1 , . . . , Pr ) is open. The main results of this
approach are
on a subset of X n means in this connection that the factorization holds for all xn in
this subset.)
(ii) If xn → (S1 (xn ), . . . , Sk (xn ) is sufficient for {P n : P ∈ P}, then k ≥ r ,
provided every Si has continuous partial derivatives: The rank r is a lower bound
for the “dimension” of any sufficient statistic with continuous partial derivatives.
Remark When some statisticians speak of the “dimension” of a sufficient statistic,
they simply mean that S(x) can be written as (S1 (x), . . . , Sk (x)) ∈ Rk . Since there
exists a continuous map from Rk to R, say T , which is bijective λk -a.e. (see, Denny
1964, p. 95, Theorem), it is clear that the continuity of the functions Si : X → R is not
enough to define the “dimension” of S: If S is a continuous k-dimensional statistic,
T ◦ S is a continuous one-dimensional sufficient statistic, provided P ◦ S λk for
P ∈ P. A meaningful concept for the dimension can, therefore, be defined only for
sufficient statistics subject to a condition stronger than continuity.
What has been stated under (i) and (ii) is essentially Lemma 3, pp. 50/1 in Shimizu
(1966). A forerunner of this result is Theorem 3.2 in Barankin and Katz (1959, p.
228), repeated as Theorem 2.1 in Barankin and Maitra (1963, p. 222). All of these
results are based on the assumption that the Factorization Theorem holds everywhere.
This is justified by a Lemma of Barankin and Katz (see Lemma 2.8.1 below) which
seems to have been overlooked by Shimizu.
The arrangement of Shimizu’s proof is not very lucid. Perhaps one could argue
as follows. For x̄n ∈ U (P1 , . . . , Pr ), the rank of the matrix (2.8.7) is r .
To simplify our notations, we assume that ( (x ν , Pi ))i=1,...,ν=1,...,n is nonsingular.
By definition of r , the following matrix is singular for every x ∈ X and every P ∈ P:
⎛ ⎞
(x 1 , P1 ) . . . (x 1 , Pr ) (x 1 , P)
⎜ .. .. ⎟
⎜ . . ⎟
⎜ ⎟.
⎝ (x r , P1 ) . . . (x r , Pr ) (x r , P) ⎠
(x, P1 ) . . . (x, Pr ) (x, P)
r
(x, P) = ai (P) (x, Pi )
i=1
and therefore
r
(x, P) = a0 (P) + ai (P)(x, Pi ).
i=1
with
r
ψ P (y1 , . . . , yr ) = a0 (P) + ai (P)yi ,
i=1
has rank k ≤ n for every x ∈ X . Then relation (2.8.8) holds for every x ∈ X if ϕ is
continuous on X .
Proof Let x0 , y0 ∈ X be such that S(x0 ) = S(y0 ) (= s0 , say). We have to show that
ϕ(x0 ) = ϕ(y0 ). Let N ⊂ Bn denote the exceptional λn -nullset for relation (2.8.8).
We shall show that for any open U x0 and V y0 there exist x ∈ U ∩ N c and
y ∈ V ∩ N c such that S(x) = S(y), whence ϕ(x) = ϕ(y). Since U and V are
arbitrary and ϕ continuous, this implies ϕ(x0 ) = ϕ(y0 ).
Using the Implicit Function Theorem, we may assume w.l.g. that S(U ) and S(V )
are open in Rk . Since S(U ) ∩ S(V ) = ∅, we have λk (S(U ) ∩ S(V )) > 0. Since Si
has continuous derivatives, S fulfills Lusin’s condition. Hence λn (N ) = 0 implies
λk (S(N )) = 0 and therefore
This implies in particular that the statistical procedure based on M has, for any loss
function (·, P), the same risk as the original procedure based on T , i.e.,
(z, P)M(y, dz)P ◦ S(dy) = (z, P)P ◦ T (dz) for P ∈ P.
then S is sufficient.
Bahadur (1955a) proves this result (p. 288, Theorem) for dominated families P
containing more than one probability measure, and for (Z , C ) = (Rk , Bk ) (including
the case k = ∞). The essential point for this converse is for which loss functions
relation (2.9.1) is required. Bahadur gets along with a single loss function fulfilling
a not very appealing condition (see p. 287), namely: For arbitrary Pi ∈ P, i = 0, 1,
inf z {(1 − u)(z, P0 ) + u(z, P1 )} is attained for a unique value z = z(u), and
z(u ) = z(u ) implies u = u (a condition not fulfilled for (z, Pϑ ) = (z − ϑ)2 ).
At first glance, results of this kind would seem to satisfy all needs, at least if the
conditions on the loss function could be somehow improved. Yet, not all statistical
problems fit easily into this framework. This holds, in particular, if equivariance or
unbiasedness of estimators has to be taken into account.
This is trivial for unbiasedness: Let κ̂ : (X, A ) → (R, B) be unbiased and
of minimal convex risk for κ(P) := κ̂(x)P(d x). That means: For any unbiased
estimator there exists an equivalent or better unbiased estimator depending on x
through κ̂(x) only, namely κ̂ itself. Yet, κ̂ is not necessarily sufficient: This can
happen if there exists a sufficient statistic S : (X, A ) → (Y, B) with P ◦ S|B is
complete, and one chooses a functional contraction of S for κ̂. (See also Chap. 4.)
In connection with equivariant estimators, we mention the following result of
Fieger (1978, p. 39, Satz 1). For ϑ ∈ R let Pϑ |B be the probability measure with
λ-density x → p(x − ϑ). A shift equivariant function Sn : X n → Rk is sufficient iff
for every convex loss function there exists a shift equivariant function k : Rk → R
such that
(k (Sn (x)) − ϑ)Pϑ (d x) ≤ (ϑ (n) (x) − ϑ)Pϑn (d x)
n
(2.9.2)
Note that this theorem depends on the fact that (2.9.2) is required for a large class of
loss functions. If p(x) = 1[−1/2,1/2] (x), the statistic Sn (xn ) = (x1:n , xn:n ) is sufficient
for the pertaining location parameter family. The function k(y1 , y2 ) = (y1 + y2 )/2
fulfills (2.9.2) for every loss function which is convex and symmetric about 0; yet
1
(x + xn:n ) is not sufficient (Fieger 1978, p. 39).
2 1:n
Then S is sufficient for P. (See Pfanzagl 1974, Theorem and p. 198. See also Pfanzagl
1994, p. 10, Theorem 1.3.9.)
Characterization by Concentration of Mean Unbiased Estimators
If S : (X, A ) → (Y, B) is sufficient, then for any unbiased estimator κ̂: (X, A ) →
(R, B) there exists a function k : (Y, B) → (R, B) (which is the conditional expec-
tation of κ̂, given S) such that
k(y)P ◦ S(dy) = κ̂(x)P(d x), (2.9.3)
C(k(y))P ◦ S(dy) ≤ C(κ̂(x))P(d x), (2.9.4)
condition (2.9.4) is strengthened in the sense that for every bounded κ̂ there exists k
such that (2.9.4) holds for every κ̂ˆ fulfilling (2.9.3), then (2.9.4) with C(u) = u 2 is
enough to infer the sufficiency of S. Observe, however, that this condition is somewhat
artificial: It requires the existence of an optimal unbiased estimator whenever an
unbiased estimator exists at all. (We mention this just as a variant of a theorem
of Bahadur discussed in Sect. 4.2.) Observe that the extended condition mentioned
above does not imply that P ◦ S is complete. Completeness of P ◦ S follows if
k ◦ S ∈ K for every k : Y → R.
Characterization by Measures of Information
Let Pi , i = 1, 2 be probability measures with μ-densities pi . For a convex function
C : [0, ∞) → R, the C-divergence between P1 and P2 , introduced by Csiszár
(1963), p. 86, relation (4), is
IC (P1 , P2 ) := C( p1 (x)/ p2 (x))P2 (d x).
Csiszár (1963, p. 90, Satz 1 and Ergänzung) implies the following assertions. For
any statistic S : (X, A ) → (Y, B),
If S is sufficient for {P1 , P2 }, equality holds in (2.9.5) for every convex function C.
Conversely, equality in (2.9.5) for some strictly convex C, implies that S is sufficient
for {P1 , P2 }, provided IC (P1 , P2 ) < ∞.
This generalizes an earlier result of Kullback and Leibler (1951) for the special
case C(u) = u log u.
References
Ash, R. B. (1972). Real Analysis and Probability. New York: Academic Press.
Bahadur, R. R. (1954). Sufficiency and statistical decision functions. The Annals of Mathematical
Statistics, 25, 423–462.
Bahadur, R. R. (1955a). A characterization of sufficiency. The Annals of Mathematical Statistics,
26, 289–293.
Bahadur, R. R. (1955b). Statistics and subfields. The Annals of Mathematical Statistics, 26, 490–497.
Bahadur, R. R. (1957). On unbiased estimates of uniformly minimum variance. Sankhyā, 18, 211–
224.
Bahadur, R. R., & Lehmann, E. L. (1955). Two comments on "sufficiency and statistical decision
functions". The Annals of Mathematical Statistics, 26, 139–142.
Barankin, E. W. (1960). Sufficient parameters: solution of the minimal dimensionality problem.
Annals of the Institute of Statistical Mathematics, 12, 91–118.
Barankin, E. W. (1961). A note on functional minimality of sufficient statistics. Sankhyā Series A,
23, 401–404.
Barankin, E. W., & Katz, M, Jr. (1959). Sufficient statistics of minimal dimension. Sankhyā, 21,
217–246.
References 39
Halmos, P. R., & Savage, L. J. (1949). Application of the Radon-Nikodym theorem to the theory
of sufficient statistics. The Annals of Mathematical Statistics, 20, 225–241.
Hasegawa, M., & Perlman, M. D. (1974). On the existence of a minimal sufficient subfield. Annals
of Statistics, 2, 1049–1055; Correction Note, 3, 1371/2 (1975).
Heyer, H. (1969). Erschöpftheit und Invarianz beim Vergleich von Experimenten. Z. Wahrschein-
lichkeitstheorie verw. Gebiete, 12, 21–55.
Heyer, H. (1972). Zum Erschöpftheitsbegriff von D. Blackwell. Metrika, 19, 54–67.
Heyer, H. (1982). Theory of statistical experiments. Springer series in statistics. Berlin: Springer.
Hipp, Ch. (1972). Characterization of exponential families by existence of sufficient statistics.
Thesis: University of Cologne.
Hipp, Ch. (1974). Sufficient statistics and exponential families. Annals of Statistics, 2, 1283–1292.
Hoeffding, W. (1977). Some incomplete and boundedly complete families of distributions. Annals
of Statistics, 5, 278–291.
Hoel, P. G. (1951). Conditional expectation and the efficiency of estimates. The Annals of Mathe-
matical Statistics, 22, 299–301.
Isenbeck, M., & Rüschendorf, L. (1992). Completeness in location families. Probability and Math-
ematical Statistics, 13, 321–343.
Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitstheorie. Berlin: Springer.
Koopman, B. O. (1936). On distributions admitting a sufficient statistic. Transactions of the Amer-
ican Mathematical Society, 39, 399–409.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical
Statistics, 22, 79–86.
Kumar, A., & Pathak, P. K. (1977). Two applications of Basu’s lemma. Scandinavian Journal of
Statistics, 4, 37–38.
Landers, D., & Rogge, L. (1972). Minimal sufficient σ -fields and minimal sufficient statistics. Two
counterexamples. The Annals of Mathematical Statistics, 43, 2045–2049.
Landers, D., & Rogge, L. (1976). A note on completeness. Scandinavian Journal of Statistics, 3,
139.
Laplace, P. S. (1818). Deuxiéme supplément à la theórie analytique des probabilités.
Paris: Courcier.
n
Laube, G., & Pfanzagl, J. (1971). A remark on the functional equation i=1 ϕ(x i ) =
ψ(T (x1 , . . . , xn )). Aequationes Mathematicae, 6, 241–242.
Le Cam, L. (1986). Asymptotic methods in statistical decision theory. New York: Springer.
Lehmann, E. L. (1947). On families of admissible tests. The Annals of Mathematical Statistics, 18,
97–104.
Lehmann, E. L. (1955). Ordered families of distributions. The Annals of Mathematical Statistics,
26, 399–419.
Lehmann, E. L. (1959). Testing statistical hypotheses. New York: Wiley.
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). Berlin: Springer.
Lehmann, E. L., & Scheffé, H. (1950, 1955, 1956). Completeness, similar regions and unbiased
estimation. Sankhyā, 10, 305–340; 15, 219–236; Correction, 17, 250.
Mandelbaum, A., & Rüschendorf, L. (1987). Complete and symmetrically complete families of
distributions. Annals of Statistics, 15, 1229–1244.
Mattner, L. (1996). Complete order statistics in parametric models. Annals of Statistics, 24, 1265–
1282.
Mattner, L. (1999). Sufficiency, exponential families, and algebraically independent numbers. Math-
ematical Methods of Statistics, 8, 397–406.
Mattner, L. (2000). Minimal sufficient statistics in location-scale parameter models. Bernoulli, 6,
1121–1134.
Mattner, L. (2001). Minimal sufficient order statistics in convex models. Proceedings of the Amer-
ican Mathematical Society, 129, 3401–3411.
Messig, M. A., & Strawderman, W. E. (1993). Minimal sufficiency and completeness for dichoto-
mous quantal response models. Annals of Statistics, 21, 2149–2157.
Neyman, J. (1935). Sur un teorema concernente le cosiddette statistiche sufficienti. Giorn. Ist. Ital.
Attuari, 6, 320–334.
References 41
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of
probability. Philosophical Transactions of the Royal Society of London. Series A, 236, 333–380.
Pfanzagl, J. (1964). On the topological structure of some ordered families of distributions. The
Annals of Mathematical Statistics, 35, 1216–1228.
Pfanzagl, J. (1970). On a functional equation related to families of exponential probability measures.
Aequationes Mathematicae, 4, 139–142; Correction Note, 6, 120.
Pfanzagl, J. (1971a). A counterexample to a lemma of L. D. Brown. The Annals of Mathematical
Statistics, 42, 373–375.
Pfanzagl, J. (1971b). On the functional equation ϕ(x) + ϕ(y) = ψ(T (x, y)). Aequationes Mathe-
maticae, 6, 202–205.
Pfanzagl, J. (1974). A characterization of sufficiency by power functions. Metrika, 21, 197–199.
Pfanzagl, J. (1981). A special representation of a sufficient randomization kernel. Metrika, 28,
79–81.
Pfanzagl, J. (1994). Parametric statistical theory. Berlin: De Gruyter.
Pitcher, T. S. (1957). Sets of measures not admitting necessary and sufficient statistics or subfields.
The Annals of Mathematical Statistics, 28, 267–268.
Pitcher, T. S. (1965). A more general property than domination for sets of probability measures.
Pacific Journal of Mathematics, 15, 597–611.
Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Proceedings of the Cambridge
Philosophical Society, 32, 567–579.
Plachky, D. (1977). A characterization of bounded completeness in the undominated case. In Trans-
actions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions,
Random Processes A (pp. 477–480). Reidel.
Rogge, L. (1972). The relations between minimal sufficient statistics and minimal sufficient σ -fields.
Z. Wahrscheinlichkeitstheorie verw. Gebiete, 23, 208–215.
Romano, J. P., & Siegel, A. F. (1986). Counterexamples in probability and statistics. Monterey:
Wadsworth and Brooks.
Roy, K. K., & Ramamoorthi, R. V. (1979). Relationship between Bayes, classical and decision
theoretic sufficiency. Sankhyā Series A, 41, 48–58.
Sacksteder, R. (1967). A note on statistical equivalence. The Annals of Mathematical Statistics, 38,
787–794.
Sato, M. (1996). A minimal sufficient statistic and representations of the densities. Scandinavian
Journal of Statistics, 23, 381–384.
Schmetterer, L. (1966). On the asymptotic efficiency of estimates. In Research Papers in Statistics.
Festschrift for J. Neyman & F. N. David (Eds.) (pp. 301–317). New York: Wiley.
Schmetterer, L. (1974). Introduction to mathematical statistics. Translation of the 2nd German
edition 1966. Berlin: Springer.
Shimizu, R. (1966). Remarks on sufficient statistics. Annals of the Institute of Statistical Mathe-
matics, 18, 49–55.
Stigler, S. M. (1973). Laplace, Fisher, and the discovery of the concept of sufficiency. Biometrika,
60, 439–445. Reprinted in Kendall and Plackett (1977).
Strasser, H. (1985). Mathematical theory of statistics. Berlin: De Gruyter.
Stuart, A., & Ord, K. (1991). Kendall’s advanced theory of statistics. Classical inference and
relationship (5th ed., Vol. 2). London: Edward Arnold.
Sverdrup, E. (1953). Similarity, unbiasedness, minimaxity and admissibility of statistical test pro-
cedures. Skand. Aktuarietidskr., 36, 64–86.
Torgersen, E. N. (1991). Comparison of statistical experiments (Vol. 36). Encyclopedia of mathe-
matics and its applications. Cambridge: Cambridge University Press.
Wijsman, R. A. (1958). Incomplete sufficient statistics and similar tests. The Annals of Mathematical
Statistics, 29, 1028–1045.
Witting, H. (1966). Mathematische Statistik. Eine Einführung in Theorie und Methoden: Teubner.
Zacks, S. (1971). The theory of statistical inference. New York: Wiley.
Chapter 3
Descriptive Statistics
3.1 Introduction
compatibility with a meaningful (partial) order relation guarantee that the functional
itself is meaningful. Moreover, a comparison based on the functional κ will be less
informative than a comparison with respect to the partial order, assuming such a
comparison is possible. In particular: the comparison of estimators should not be
restricted to the comparison of variances only if they are comparable with respect to
their concentration on intervals. In their papers on descriptive functionals of location
and dispersion, Bickel and Lehmann introduce another aspect for the selection of a
descriptive functional: its robustness and, in this context, its estimability.
In its simplest form, estimation theory starts from a family P of probability
measures P defined on a measurable space (X, A ), and a functional κ|P, taking its
values in a measurable space (Y, B). In the period following 1950, a typical case
was Y = Rk , perhaps Y a function space if the problem was to estimate the density
of P ∈ P. Estimators κ (n) : X n → Y were obtained from i.i.d. samples guided by
P n , for some (unknown) P ∈ P.
The big problem is the choice of the basic family. If the P chosen is larger than
necessary, then an estimator which is optimal for this family will be suboptimal
for a subfamily P0 ⊂ P. If the P chosen is too small, then an estimator, though
reasonable for every P ∈ P, might go wild if the true probability measure is not
in P.
There are few situations where the knowledge of the random process generating
the observations x1 , x2 , . . . suggests for P a particular parametric family. Radioactive
decay is one of these exceptions. Usually, there will be not more than a vague general
experience leaving the choice between a variety of parametric models. The situation
is no better for fully “nonparametric” theory. Even if P is intended to be a “general”
family, say a large family on B with Lebesgue densities, any mathematical treatment
depends on certain nontestable smoothness assumptions about the densities, and
smoothness assumptions which are roughly equivalent from an intuitive point of view
(such as Hölder- or Sobolev-classes) produce different results. These are important
problems which can only be addressed using asymptotic techniques.
There are some obvious conditions on the estimators like measurability of
(x1 , . . . , xn ) → κ (n) (x1 , . . . , xn ) which is indispensable, or continuity of this map
which is natural and not really restrictive. Other obvious requirements are permu-
tation invariance in the case of an i.i.d. sample, or equivariance if the family P is
closed under certain transformations.
The requirement that the estimators should be concentrated about the estimand
as closely as possible is usually subject to the further condition that the estimators
should be properly centered.
distribution of lifetime with density x → ϑ exp[−ϑ x], x > 0. The task to estimate
the parameter is not fully specified unless the intended use of the estimator is known.
The parameter ϑ has an interpretation as decay time, and one might be interested
in an estimator which is median unbiased. If the task is to estimate the expected
lifetime, this is 1/ϑ, and one will be interested in an estimator with expectation 1/ϑ.
Perhaps the problem is to obtain an (unbiased) estimator of the density itself, or an
(unbiased) estimator of exp[−t/ϑ] (= Pϑ (t, ∞))?
Even in the case of a parametric family, it is a functional (expressed in terms of
the parameter) which is the estimand. Regrettably, many statisticians use the term
“parameter” though they really mean “functional”.
In the opinion of prominent statisticians (like von Mises 1912, p. 16), parameters
should have an intuitive interpretation (as functionals). Even if this is impossible,
as is the case of the shape parameter of the Gamma-distribution (what, after all,
is
∞an operational definition of shape?), there are functions of the parameters, like
0 xΓa,b (d x) = ab, that have operational significance.
If we take a sensible view, the real problem will always be to estimate the functional
of an unknown distribution, and the parametric family enters the problem just as an
approximation to the unknown distribution (hopefully supported by some knowledge
of the process generating the unknown distribution).
In many (nonparametric) applications, the functional κ : P → R is not uniquely
determined: κ should, perhaps, be some “measure of location”. It appears doubtful
whether the use of refined mathematical techniques is adequate in such a case. Not all
scholars share this opinion. Lehmann (1983, p. 365) claims that: “If each [functional
κ] is an equally valued measure of location, what matters is how efficient the quantity
[κ(P)] can be estimated.” In this connection, Bickel and Lehmann speak of the
“robustness” of the functional, which means: Small changes in the distribution imply
only small changes in the value of the functional. In other words: The functional
should be continuous (with respect to some metric on P). Such a requirement is
supported by the fact that κ cannot be estimated locally uniformly at P unless κ is
continuous at P. Yet there continue to be some difficulties with functionals that are
distinguished by being accurately estimable rather than by the nature of the problem,
and one could even ask how important the use of particularly accurate estimators for
only vaguely defined functionals really is. To say that an estimator estimates what it
estimates, but does so efficiently, is tantamount to saying, “anything goes”.
If the functional κ(P) is given, the relation (3.3.1) is a condition on the estimator. It
confirms that κ̂ does, in fact, what it should: It estimates κ(P) (and not some other
functional). Applied in asymptotic theory, this would also lead to the unpleasant
result that the estimand is different for different sample sizes.
Relation (3.3.1) raises various problems: Unless loss is measured on a cardinal
scale, the functional κ defined by (3.3.1) will depend on the particular scale. More-
over, different functions κ̂ will define different estimands. Hence the definition of
κ(P) by (3.3.1) will usually only make sense in asymptotic considerations. (See
Chap. 5.)
Beyond the formal definition of the estimand using (3.3.1), there may be various
intuitive requirements: If P is symmetric about κ(P), and κ̂ is an estimator symmetric
about κ(P), one will be willing to accept κ̂ as an estimator of κ(P) even if condition
(3.3.1) is not fulfilled with μ = κ(P). Applied to families P|B, the loss function
(u) = u 2 defines κ(P) as the mean of P, and (u) − |u| defines κ(P) as the
median. Depending on the basic model, conditions of mean unbiasedness or median
unbiasedness might have an intuitive appeal of their own.
For multivariate functionals κ : P → Rk , mean unbiasedness of κ (n) : X n → Rk
poses no problem. It is defined componentwise,
κi (x)P(d x) = κi (P) for i = 1, . . . , k.
This, too, is a sort of unbiasedness, but the unbiasedness of ϑ (n) as an estimator for
ϑ has nothing to do with condition (3.3.2).
Unbiasedness of an estimator for a parameter ϑ can be justified only if this para-
meter can be interpreted as a functional for which, by its nature, unbiasedness is
desirable. This certainly applies to location parameters. It is doubtful whether unbi-
asedness is adequate for scale parameters. One could argue that a deviation of σ̂ /σ
from 1 by the factor 2 is of the same “weight” as a deviation by the factor 1/2—and
this does not easily go together with unbiasedness.
Even statisticians who admire the beautiful result of Olkin and Pratt (1958, p.
202, relation (2.3)) on the existence of unbiased estimators for the correlation coef-
ficient of a two-dimensional normal distribution will agree that, for this parameter,
unbiasedness is without operational significance.
In spite of the fact that convincing arguments for mean unbiasedness as a uni-
versal requirement for every estimator have never been brought forward, estimation
theory was almost exclusively concerned with mean unbiased estimators in the years
between 1950 and 1970. This can, perhaps, be traced back to the fact that the math-
ematical tools for dealing with mean unbiasedness, like integrals and conditional
expectations, were available at this time.
Compared with the mathematical appeal of the theory of mean unbiased estimators
(see Chap. 4), the obvious shortcomings of this concept were neglected. The doubts
raised by prominent statisticians concerning mean unbiasedness had no effect on the
predominance of these concepts in textbooks—elementary ones and others. Here are
some critical voices:
C.R. Rao (1945, p. 82): “... the inevitable arbitrariness of these postulates of unbiasedness
and minimum variance needs no emphasis.”
L.J. Savage (1954, Chapter 7, p. 244): “... it is now widely agreed that a serious reason to
prefer [mean] unbiased estimators seems never to have been proposed.”
Barnard (1974, p. 4): “Often one begins with the concept of an unbiased estimator—in spite
of the fifty year old condemnation of this idea, as a foundation for theory, on the part of R.A.
Fisher.”
Fraser (1957, p. 49): “median unbiasedness has found little application in estimation theory
48 3 Descriptive Statistics
primarily because it does not lend itself to the mathematical analysis needed to find minimum
risk estimates.”
The opinion put forward by Aitken and Silverstone (1942, p. 189) on unbiasedness
is downright absurd: “... To obtain an unbiased estimator with minimum variance
we must in most cases estimate not ϑ but some function of ϑ.” Similarly, Barton
(1956) suggests a method “which may only be reasonably applied when the property
of unbiasedness is of more importance than the functional form of the parameter
estimated” (see p. 202). In spite of the absurdity of this idea: What can one do in the
case of a parametric family where no function of the parameter admits an unbiased
estimator?
A basic shortcoming of mean unbiased estimators: As opposed to median unbiased
estimators, they are not invariant under arbitrary monotone transformations of the
functional: If P = {N (μ, σ 2 )n : μ ∈ R, σ 2 > 0}, the square roots of unbiased
estimators for σ 2 are not unbiased for σ .
The bad news is: Even if the model requires an unbiased estimator for some func-
tional, unbiased estimators may not exist at all, or they may show certain disturbing
properties (e.g. by not being proper). We will abstain from presenting an example; a
plethora of them can be found in textbooks like Lehmann and Casella (1998).
Even if unbiasedness is not required by the nature of the model, one might surmise
that unbiasedness somehow ensures that the estimator is “impartial”, that it does not
favour some members of P to the disadvantage of others. Yet, there are numerous
examples available in the literature where mean unbiased estimators spectacularly
fail to meet this requirement. The first example of this phenomenon was provided
by D. Basu (1955, p. 346), a variant of which can be found in Pfanzagl (1994, p. 71,
Example 2.2.9). There are estimators ϑ (n) of ϑ which are mean unbiased in the family
{N (ϑ, 1)n : ϑ ∈ R}, but N (ϑ, 1)n {ϑ (n) = 0} > 0 for every ϑ ∈ R. Zacks (1971, p.
3.3 Estimands and Estimators 49
119, Example 3.9) presents an unbiased estimator for ϑ in the family with density
(x1 , x2 ) → ϑ −2 1(ϑ,2ϑ) (x1 )1(ϑ,2ϑ) (x2 ) with variance 0 for ϑ ∈ {2k : k = 0, ±1, . . .}.
Historical Remark
Let P be a family of probability measures on (X, A ), and κ : P → R a functional.
One of the basic qualities of an estimator κ̂ : X → R is its concentration about κ(P).
One aspect of this is that it should be properly centered, i.e., without a systematic
error which over- or underestimates κ(P) for every P ∈ P. The question is how
these ideas could be made precise.
It appears that the concept of “bias” originally referred to the observations, not
to estimators. Working on the assumption that there is a “true” value and that the
observations differ from this true value by a degree of error, Bowley (1897, p. 859)
distinguishes between “biased errors” going all in the same direction, and “unbiased
errors”, which are equally likely to be positive or negative. His interest is in the
influence of these errors of the observations on the error of an estimator (taken
as an average of the observations), and his conclusion (based on some elementary
computations) is: “Unbiassed errors can be neglected in comparison with biassed”.
Basically the same attitude can be found in Markov (1912, p. 202) who requires as a
starting point for his computations the absence of a systematic error (of the individual
observations).
A precise definition of an “unbiased estimator” (in the sense of ϑ̂(x)P(d x) = ϑ)
first occurs in David and Neyman (1938, p. 106, Definition 1). One could, perhaps,
say that the concept of an unbiased estimator grew out of a compelling condition
formulated as the “absence of a systematic error”, but meaning, in fact, that the
model underlying the analysis should be correct. There is no conclusive argument
leading from this condition to the requirement of “unbiasedness” for estimators. In
the paper by David and Neyman, dealing with the Markov Theorem on least squares,
unbiasedness is an essential ingredient for this particular theory. It is not meant as a
general condition to be imposed on all estimators.
The idea of the median as an important descriptive functional is as old as prob-
ability theory (see Huygens 1657, de Moivre 1756, Problem 5), and median unbi-
asedness as a requirement for the location of estimators was suggested as early as
1774 (Laplace, p. 363).
For probability measures on B, various order relations have been suggested to express
that the random variable corresponding to P1 is in a stochastic sense larger than the
random variable corresponding to P0 . To define the following order relations, let Fi
denote the distribution function corresponding to Pi .
50 3 Descriptive Statistics
or
m(u)P0 (du) ≤ m(u)P1 (du) for every nondecreasing function m.
Here is a natural example of order A: For any location parameter family {Pϑ : ϑ ∈ R},
the family {Pϑn ◦ Tn : ϑ ∈ R} is ordered A if (x1 , . . . , xn ) → Tn (x1 , . . . , xn ) is
equivariant under shifts.
Measures κ of location compatible with the stochastic order are studied in Bickel
and Lehmann (1975–1979). In addition to the compatibility with the stochastic order,
i.e., κ(P0 ) ≤ κ(P1 ) if P0 ≤ P1 , they require that
and that
It was Rubin who, in an abstract (1951, p. 608), pointed out that it is the monotonic-
ity of likelihood ratios that accounts for certain attractive properties of exponential
families. The theory of m.l.r. families was fully developed by Karlin and Rubin (1956,
p. 279, Theorem 1) and Lehmann (1959). Various complete class- and optimality
results by Allen (1953), Sobel (1953) and, in particular, Blackwell and Girshick
(1954, p. 182, Lemma 7.4.1) which hold for m.l.r. families are formulated for the
special case of an exponential family only.
It can easily be seen that order C implies order B, and that (3.4.1) as well as (3.4.2)
imply A. Orders A and C are useful in statistical theory. Order B was introduced in
Pfanzagl (1964, p. 217) because it is strong enough to imply certain topological
properties of P. If P is B-ordered, pointwise convergence of the distribution func-
tions implies uniform convergence on B. If the probability measures have continu-
ous Lebesgue densities, pointwise convergence of the distribution functions implies
pointwise convergence a.e. of the densities. (See Pfanzagl 1964, p. 1220, Theorem
1, and 1969, p. 61, Theorem 2.12.) This result applies in particular to exponential
families (which have monotone likelihood ratios and continuous densities). As a
consequence: Pϑn ⇒ Pϑ0 weakly implies ϑn → ϑ0 . (See in this connection also
Barndorff-Nielsen 1969.)
A Side Remark on m.l.r. Families and the Power of Tests
For m.l.r. families, Karlin and Rubin (1956) state, among many other results, that
the class of all critical functions ϕ fulfilling
1 >
ϕ(t) = for t t0 (3.4.4)
0 <
is complete. (See p. 282, Theorem 4, and Lehmann 1959, p. 72, Theorem 3 for a
more accessible version.)
This is a special case of a more general result by Karlin and Rubin (1956, p.
279, Theorem 1) which asserts that the class of all monotone procedures is complete
with respect to any subconvex loss function. An improved version of this result is
Theorem 2.1, p. 714, in Brown et al. (1976).
Lehmann (1959, p. 68, Theorem 2) expresses for a parametric family {Pϑ : ϑ ∈
Θ}, Θ ∈ R, what is characteristic for the optimality of tests in m.l.r. families. To
come closer to applications, assume now the existence of a function T : X → R
such that {Pϑ : ϑ ∈ Θ} has “monotone likelihood ratios in T ”, i.e., that for any
ϑ1 , ϑ2 there exists a nondecreasing function H1,2 such that
Under this assumption, for any critical function ϕ of type (3.4.4), x → ϕ(T (x)) is
most powerful for any of the test problems Pϑ1 : Pϑ2 . Or, in other words:
52 3 Descriptive Statistics
If
ψ(x)Pϑ0 (d x) = ϕ(T (x))Pϑ0 (d x),
then
≤ >
ψ(x)Pϑ (d x) ϕ(T (x))Pϑ (d x) for ϑ ϑ0 . (3.4.6)
≥ <
Warning: The optimality of the critical functions ϕ ◦ T depends on the fact that
the family {Pϑ : ϑ ∈ Θ} has m.l.r. in T in the sense of (3.4.5), which implies that
T is sufficient. It is not sufficient for the family {Pϑ ◦ T : ϑ ∈ Θ} to have m.l.r.
M.l.r. families provide tests which not only maximize the power for every ϑ > ϑ0 .
They also minimize the power for every ϑ < ϑ0 . This is more than is usually required
for testing the hypothesis ϑ ≤ ϑ0 against alternatives ϑ > ϑ0 . The usual requirement
on the critical function is Pϑ (ϕ) ≤ α for ϑ ≤ ϑ0 and Pϑ (ϕ) as large as possible for
ϑ > ϑ0 . The existence of an optimal critical function in this class does not imply
that the family has m.l.r.—not even if such a critical function exists for every ϑ0 ∈ Θ
and every α ∈ (0, 1). (For an example see Pfanzagl 1960 and 1962, p. 112.)
The existence of tests with the strong optimum property indicated under (3.4.6) is
more or less confined to m.l.r. families. Assume that a dominated family P is ordered
by a relation ≤ with the following property: For every P0 ∈ P and every α ∈ (0, 1)
there exists a critical function ϕ such that
ϕ(x)P0 (d x) = α
and
≤ >
ψ(x)P(d x) ϕ(x)P(d x) if P P0
≥ <
for any critical function ψ fulfilling ψ(x)P(d x) = α. Then there exists a function
T : X → R and for any pair Pi ∈ P, i = 1, 2, a nondecreasing function H1,2 such
that
(See Pfanzagl 1962, p. 110, Satz. For a generalization see Mussmann 1987.)
Exponential families with density C(ϑ)h(x) exp[a(ϑ)T (x)] have m.l.r. if the
function a is increasing. For such families the existence of most powerful criti-
cal functions of the type (3.4.4) was already remarked by Lehmann (1947, p. 99,
Theorem 1). Relevant for applications are families with m.l.r.for every sample size.
Exponential families are obviously of this type (with m.l.r. in nν=1 T (xν )). Accord-
ing to Borges and Pfanzagl (1963, p. 112, Theorem 1), a family of mutually absolutely
3.4 Stochastic Order 53
continuous probability measures which has m.l.r. for every sample size is necessarily
exponential.
The results concerning the relationship between m.l.r. families and the existence
of most powerful tests, obtained in the fifties, are not yet familiar to all statisticians.
Berger (1980, p. 369) writes:
The most important class of distributions for which [uniformly most powerful tests] some-
times [!] exist is the class of distributions with monotone likelihood ratio.
3.5 Spread
F0−1 (β) − F0−1 (α) ≤ F1−1 (β) − F1−1 (α) for 0 < α < β < 1, (3.5.2)
pp. 78/79), unaware of the paper by Bickel and Lehmann Bickel and Lehmann
(1975–1979), define a dispersion order as follows.
Q 1 is more dispersed than Q 0 if for every α ∈ (0, 1),
≥ >
F0 (F0−1 (α) + x) F1 (F1−1 (α) + x) for x 0. (3.5.3)
≤ <
The equivalence between (3.5.2) and (3.5.3) is already indicated in a somewhat vague
version in Saunders and Moran 1978, p. 427, relation 1.3.
According to Shaked (1982, p. 312, Theorem 2.1), relation (3.5.3) is equivalent
to the following:
It speaks for the power of this construct that it was discovered in different versions
by so many authors. Though the equivalence of the different definitions can easily
be seen, the authors were not always aware of the other related papers. It appears
that the equivalence between (3.5.1) and (3.5.2) remained unnoticed until 1983 (see
Deshpande and Kochar 1983, p. 686). One might even say that the spread order was
discovered once more: The property of estimator sequences asserted in Hájek (1970,
p. 329, Corollary 2), followed by Roussas (1972, p. 145, Proposition 4.1) is just a
version of (3.5.3).
Probability measures which are comparable in the spread order emerge from the
Convolution Theorem. According to the results presented in Sect. 5.13, Q 1 = Q 0 ∗ R
is “more” spread out than Q 0 if Q 0 is logconcave. That regularly attainable limit
distributions on B are comparable with the optimal limit distribution in the spread
order may also be shown directly (see Sect. 5.11).
To obtain from Q 0 Q 1 an assertion about the concentration on intervals, one
has to distinguish a certain center μ. The following result is straightforward:
3.7 Concentration
Though the intention of estimation theory is to find estimators that are concentrated
about the estimand as closely as possible, there were no efforts prior to 1950, say,
to find a suitable mathematical construct corresponding to the intuitive concept of
“concentration”. The statisticians settled for comparing estimators on the basis of
their quadratic risk, even though Fisher had already expressed his doubts. L.J. Sav-
age (1954, p. 224) made the obvious suggestion to consider the concentration of
probability measures Q 0 , Q 1 on B about a given center μ, and to define Q 0 ≥ Q 1 if
Q 0 (μ − t , μ + t ) ≥ Q 1 (μ − t , μ + t ) for t , t ≥ 0. (3.7.1)
for all C ∈ Bk which are convex and symmetric about 0. Half a century later,
the same order relation appears in Witting and Müller-Funk (1995, p. 439) as the
“Anderson-Halbordnung”.
As can easily be seen, the peak order is equivalent to
Q 1 ≥ Q 2 if gd Q 1 ≥ gd Q 2
for all gain functions g : Rk → [0, ∞) which are symmetric and unimodal. (See
Witting and Müller-Funk 1995, p. 439, Hilfssatz 6.214.)
Equivalent to (3.7.2) is the statement that the distribution of subconvex loss is
stochastically smaller under Q 1 than under Q 2 .
Notice that Q 1 ≥ Q 2 in the peak order on Bk implies that
There is an obvious but technically useful extension of this equivalence to the class,
say M , of all symmetric gain functions that are approximable (from below) by
functions Σai gi with ai > 0 and gi ≥ 0, symmetric and unimodal.
Generalizing Birnbaum’s Lemma 1 (p. 77), Sherman’s Lemma 3, p. 766, asserts
that P1 ≥ P2 and Q 1 ≥ Q 2 in the peak order imply P1 ∗ Q 1 ≥ P2 ∗ Q 2 , if all
probability measures are symmetric and unimodal. In fact, a stronger result holds
true:
P1 ≥ P2 implies Q ∗ P1 ≥ Q ∗ P2 , (3.7.3)
This follows directly from the extended version of Anderson’s Theorem, which
asserts that y → Q(C + y) is in M if C is convex and symmetric about 0.
In this connection one should also mention an early result of Z.W. Birnbaum (1948,
p. 79, Theorem 1) on unimodal distributions with a Lebesgue density symmetric about
0: If Q 1 is more concentrated than Q 2 on symmetric intervals about 0, then the same
is true of all n-fold convolution products. (Notice that for symmetric distributions,
“more concentrated on all intervals symmetric about 0” is the same as saying “more
concentrated on all intervals containing 0”.)
The Löwner Order
The family N (0, Σ)|Bk is an instance where the peak order applies. Löwner (1934)
introduced an order relation between positive definite matrices Σi by Σ1 ≤ L Σ2
if Σ2 − Σ1 is positive semidefinite. Since N (0, Σ2 ) = N (0, Σ1 ) ∗ N (0, Σ2 − Σ1 ),
Anderson’s Theorem implies that
case. Example 6.1 in Pfanzagl (2000b, p. 7) shows that for any t , t > 0, t = t ,
there exists R with expectation 0 such that
For the interpretation of the Convolution Theorem (Sect. 5.13) with convolution
kernel Q, one needs conditions on Q|Bk which imply
for a relevant family B of sets B|Rk . Since relation (3.8.1) is required to hold for
any R|Bk , this amounts to finding conditions on Q and B such that
The usual textbooks provide for this purpose a shortened version of Anderson’s
Theorem, which asserts (3.8.1) under the condition that Q is unimodal and symmetric
about 0, and B is convex and symmetric about 0. In fact, Anderson’s Theorem offers
a stronger result: Rewritten in our notation it asserts the following.
Anderson’s Theorem. If Q|Bk is unimodal and symmetric about 0, then
As already remarked by Andersen (1955, p. 171; see Sherman 1955, p. 764 for
a counterexample), this function is not necessarily unimodal if k > 1. (Sherman’s
example is reproduced in Das Gupta 1976, Example 1, pp. 90/1 and in Dharmadhikari
and Joag-Dev 1988, p. 65.)
Though y → Q(C + y) is not unimodal in general, it can be approximated by a
sequence of unimodal functions, which serves nearly the same purpose. To see this,
we may go back to an alternative proof of Anderson’s Theorem offered by Sherman
(1955). According to his Lemma 1, p. 764, the function
y→ 1C (x + y)1 K (x)λk (d x) (3.8.3)
Q ∗ P1 ≤ Q ∗ P2 if P1 ≤ P2 . (3.8.5)
Since relation (3.8.3) already occurs in Fáry and Rédei (1950), Anderson’s Theorem
could have been obtained five years earlier, but nobody was interested. Even so,
Anderson’s Theorem had to wait more than ten years to find its pivotal role in the
interpretation of the Convolution Theorem.
In connection with the characterization of maximum likelihood (ML) sequences
as “asymptotically Bayes”, Le Cam (1953, p. 315) introduces the concept of gain
functions g such that
y → g(x)N (y, Σ)(d x) (3.8.6)
his readers by unusual references, had missed the chance to cite Satz 3 in Fáry and
Rédei (1950, p. 207), from which Anderson’s Theorem easily follows.
The unimodality and symmetry of Q are the basic conditions of Anderson’s Theo-
rem. For interpretations of the Convolution Theorem, it is only the case Q = N (0, Σ)
where symmetry applies. Yet, N (0, Σ) has an important feature going beyond uni-
modality: Its density is logconcave. Though logconcavity had already turned out to
be an important property in Ibragimov’s paper from 1956, the question was never
taken up whether there is a stronger version of Anderson’s Theorem for the partic-
ular case Q = N (0, Σ). The key for such a possible improvement is the following
theorem1 (Prékopa 1973, p. 342, Theorem 6).
Prékopa’s Theorem. If f : Rm × Rk → [0, ∞) is logconcave, then
y −→ f (x, y)λm (d x) is logconcave.
If Q and C are symmetric about 0, this implies that the function y → Q(C + y), too,
is symmetric about 0. Being logconcave, it is, therefore, unimodal about 0. This is a
property which does not apply, in general, if Q is merely unimodal and symmetric.
(Recall the example in Sherman 1955, p. 764.)
Remark. Obviously Prékopa was unaware of the relation between his result and both
Anderson’s Theorem and the Convolution Theorem. He also seems to have over-
looked a forerunner
of his result in the paper by Davidovič et al. (1969), which states
that y → f 1 (x + y) f 2 (x)λk (d x) is logconcave if the functions f i are logconcave.
These authors, in turn, were unaware of a just slightly weaker result by Lekkerk-
erker (1953, p. 505/6, Theorem 1), which asserts that y → f 1 (x + y) f 2 (x)d x
is decreasing and logconcave on y ∈ (0, ∞) if the functions f i have this property.
Readers interested in this field, with a surprising number of multiple discoveries,
would be well advised to consult Dharmadhikari and Joag-Dev (1988), Pečarić et al.
(1992), and Das Gupta (1980). More may be found in Bertin et al. (1997), but even
readers ready to follow the abstract approach of these authors will be daunted by
their forbidding notation.
1 According to Pfanzagl (1994, p. 86, Corollary 2.4.10), the function y → f (x, y)λk (d x) is
concave if f is unimodal. This stronger assertion is obviously wrong: It would imply that any
subconvex function is concave, since (x, y) → 1[0,1]k (x)(y) is subconvex if is subconvex, and
(y) = 1[0,1]k (x)(y)λk (d x). The proof uses that y → λk (C y ) is concave, which is true only on
{y ∈ Rm : λk (C y ) > 0}, not on Rm . None of the reviewers mentioned this blunder.
62 3 Descriptive Statistics
The extension of the most fundamental result is immediate: If y → 1C (x +
y)Q(d x) is star-down for every convex and symmetric set C, then y → g(x +
y)Q(d x) is star-down
for every symmetric and unimodal function g. More complex
properties of y → 1C (x + y)Q(d x) like unimodality or logconcavity require more
subtle arguments. If q ∈ d Q/dλk and g are logconcave, the function
y→ g(x + y)Q(d x) (3.8.7)
Of course, one would like to have (3.9.2) for a larger class of distributions Q. This
is ruled out by a result of Droste and Wefelmeyer (1985, p. 237, Proposition 2): The
validity of (3.9.2) for arbitrary R already characterizes Q as logconcave. (See also
Klaassen 1985, p. 905, Theorem 1.1.) The fact that relation (3.9.1) is, with respect
to the spread order, restricted to logconcave Q, does not limit its usefulness for the
interpretation of the Convolution Theorem.
Now we will discuss the consequences which spread order has for the concen-
tration on intervals. Since spread is shift invariant, the relation Q 0 Q 1 leads to a
comparison of the respective concentrations only if Q 0 and Q 1 are comparable with
respect to their location. Lemma 3.5.1 implies the following.
If Q|B is logconcave, then
This leads to the question whether for k > 1, too, there is always a shifted version
Q 0 of Q = N (0, Σ) ∗ R such that Q 0 (B) ≤ N (0, Σ)(B) for a class of sets B more
general than the convex and symmetric ones, say for all convex sets containing 0.
Kaufman (1966, Sect. 6, pp. 176–178) presents for k = 2 the example of a dis-
tribution Q = N (0, Σ) ∗ R (in fact a regularly attainable limit distribution) with
the following property. For every shifted version Q 0 there are rectangles I1 × I2
containing 0 such that
When Kaufman presented his fundamental result (1966, p. 157, Theorem 2.1) he
simply said: A uniformly attainable limit distribution cannot be more concentrated—
on convex sets symmetric about zero—than the limit distribution of the ML-sequence.
It was an idea of Inagaki (1970, p. 10, Theorem 3.1), followed by Hájek (1970, p.
324, Theorem), to express limit distributions as convolutions with a factor N (0, Σ),
say N (0, Σ) ∗ R. Even though they supply their result with an interpretation based
on Anderson’s Theorem, one could question whether it was a good idea to use
the convolution form to express an optimality result. It appears that certain authors
consider the convolution form as some sort of a final result. This happens already
in LeCam’s paper (1972, p. 259, Examples 2 and 3), where he presents an abstract
Convolution Theorem. Though he points to convolutions with an exponential factor
as a special case, he feels no need for an interpretation in terms of probabilities.
Ibragimov and Has’minskii present a Convolution Theorem containing the factor
N (0, Σ) (1981, Theorem II,9,1, p. 154). Then they use four pages (155–158) to
prove Anderson’s Theorem, which they need for the interpretation of this result (see
Theorem II.11.2, p. 160). In Theorem V.5.2, p. 278, they present a Convolution
Theorem containing an exponential factor which
is analogous to Theorem II.9.1 with the normal distribution replaced by an exponential one.
was revived by Wald, who introduced (1939, p. 304) the misleading term “risk”. We
prefer to speak of “expected loss” or “average loss”. The use of “gain functions”
by Le Cam (1953), which is more convenient since “gain” is in direct relation to
“concentration”, was not accepted by the statistical community.
(n)
Remark. Some authors accept the evaluation of estimator sequences by (cn (κ −
κ(P))) only with certain reservations. After all, u → (cn (u − κ(P))) evaluates
the loss using a different loss function for every n ∈ N. Arguments brought forward
by some scholars to justify the application of the loss function to the standardized
estimator sequence are spurious. As an example we might mention Millar (1983, p.
145):
[Since n 1/2 is the best possible rate of convergence] it is reasonable to specify the loss by
(n 1/2 (ϑ (n) − ϑ)).
The justification for the use of (cn (κ (n) − κ(ϑ))) is more convincing in the special
case (u) = 1[−t,t] (u): The concentration of estimator sequences κ (n) is then com-
pared on sets (κ(P) − cn t, κ(P) + cn t) containing these estimator sequences with a
probability which is neither negligible nor close to 1.
With fixed, d Q defines a total order between all probability measures Q|Bk .
Applied with an arbitrarily chosen , this order says nothing about the reality. One
way out is to consider a large family L0 of all potential loss functions (including the
unknown true one). Yet, is there any chance that the order
∞ based on the expected loss
will be the same for every ∈ L0 ? Since d Q = 0 Q{u ∈ Rk : (u) > r }dr , it
seems unlikely that
d Q 0 ≤ d Q 1 (3.11.1)
for a large number of loss functions, unless there is an inherent relationship between
the probability measures Q i and the loss functions , which implies that
If this is the case, relation (3.11.1) will be true for all loss functions such that
whence
d Q ∗ P ≥ d Q for every P|Bk .
This argument was used by Dvoretzky et al. (1956, p. 666) and Beran (1977, p. 402).
If is subconvex, then m ◦ is subconvex for any nondecreasing function m with
m(0) = 0. Apart from this special type of loss functions, it is a reasonable requirement
on any class of loss functions to be closed under monotone transformations.
Some authors think of “loss” measured on a cardinal scale. Yet convincing exam-
ples are lacking. A more realistic assumption is that the loss is measured on an ordinal
scale only, and this, too, requires that m ◦ be a possible loss function if is such
a one.
Proposition 3.11.1 For families L of loss functions which are closed under
monotone transformations, the following relations are equivalent.
(i) d Q 0 ≤ d Q 1 for every ∈ L .
(ii) Q 0 ◦ is stochastically smaller than Q 1 ◦ for every ∈ L .
Proof If m ◦ d Q 0 ≤ m ◦ d Q 1 for any m on decreasing m with m(0) = 0, then
this relation holds in particular with m s (u) = 1[s,∞) (u) for any s > 0. Since
∞
m s ◦ d Q = Q{u ∈ Rk : m s ((u)) ≥ r }, dr = Q{u ∈ Rk : (u) ≥ s},
0
the equality m s ◦ d Q 0 ≤ m s ◦ d Q 1 for every s > 0 implies that Q 0 ◦ is
stochastically smaller than Q 1 ◦ .
3.11 Loss Functions 69
P n ◦ cn (κi(n) − κ(P)) → Q i
is evaluated by a comparison between limn→∞ (cn (κi(n) − κ(P)))d P n , i = 0, 1,
it might turn out that, for large n,
(cn (κ0(n) − κ(P)))d P n < (cn (κ1(n) − κ(P)))d P n ,
yet
(c̄n (κ0(n) − κ(P)))d P n > (c̄n (κ1(n) − κ(P)))d P n ,
corresponding to
(u)Q 0 (du) < (u)Q 1 (du)
and
(au)Q 0 (du) > (au)Q 1 (du).
70 3 Descriptive Statistics
(To obtain an example with risks that can be computed explicitly, choose Q 0 =
N (0, 1), Q 1 = 21 N (−μ, σ 2 ) + 21 N (μ, σ 2 ), and (u) = 1 − exp[−u 2 ].)
To summarize: Comparisons based on the expected loss under a single loss func-
tion are significant only if this is the true loss function. Consistent results for a large
class of loss functions can be expected under special conditions only. In such cases,
the comparison of losses leads to the same result as the comparison of the distribu-
tion of losses or the concentration of the estimators in certain sets. It therefore adds
nothing to the results already known from the comparison of concentrations.
The mathematical argument for applying a loss function (attaining its minimum at
0) to cn (κ (n) −κ(P)) rather than to κ (n) −κ(P) is clear: Since κ (n) −κ(P) → 0 (P n ),
the asymptotic performance of (κ (n) − κ(P))d P n , n ∈ N, depends on the local
properties
(n) of at 0, and with a twice differentiable loss function one ends up with
(κ − κ(P))2 d P n as an approximation for the risk.
It is adequate to evaluate the accuracy of an estimator according to the length of
an interval containing the estimate with high probability. (Recall the time-honoured
comparison based on the “probable error”.) This amounts to considering the con-
centration of κ (n) in intervals (κ(P) − cn−1 t , κ(P) + cn−1 t ), or the concentration of
cn (κ (n) − κ(P)) in (−t , t ). The possibility of expressing this in terms of the loss
function (u) = 1 − 1(−t ,t ) (u), applied to cn (κ (n) − κ(P)), does not imply that the
evaluation of cn (κ (n) − κ(P)) by means of other loss functions yields a meaningful
result.
If the evaluation using loss functions serves any purpose beyond comparing the
asymptotic concentration on intervals, then that purpose is to obtain a global measure
for this difference. It seems questionable whether this purpose is achieved. Depending
on the loss function , the value of (κ̂(x) − κ(P))P(d x) might depend mainly
on the tails of P ◦ (κ̂ − κ(P)). The objection against expressing concentration by
means of loss functions gathers momentum if asymptotic concentration is the point:
estimator sequences which are asymptotically equivalent (in terms of concentration
in intervals) may widely diverge in terms of risks if the loss function is unbounded.
There is yet another problem, resulting from the standardization by cn . The purpose
of this standardization is to ensure that P (n) ◦ cn (κ (n) − κ(P)), n ∈ N, converges to
a non-degenerate limit distribution. If this is achieved by means of standardization
with the rate (cn )n∈N , it is also achieved using standardization with the rate ĉn = acn ,
for any a > 0. The results are not necessarily the same (see Sect. 3.11.)
Few authors admit that comparisons of expected loss are operationally significant
only if based on the true loss function. The “mathematical convenience” of a loss
function has nothing to do with the very nature of the particular problem. Choosing
a loss function on the basis of mathematical convenience, therefore, means —if
judged on the basis of its suitability for a particular problem—choosing it ad libitum.
Hence the optimality of an estimator with respect to such a loss function bears no
relationship to reality. Here are a few important opinions on this problem.
C.R. Rao (1962, p. 74) is skeptical that the “criterion of minimum expected
squared error”, used—among many others—by Berkson (1956), is justified “unless
he believes or makes us believe that the loss to society is proportional to the square
of the error in his estimate”.
3.11 Loss Functions 71
Zellner (1986, p. 450) on the linex loss functions defined in (3.11.6) below: “The analytic ease
with which results can be obtained makes them attractive for the use in applied problems...”
Another example of the casual approach to dealing with loss functions is (in our
notations)
M.M. Rao (1965, p. 135): “It is more convenient [!] ... to consider the optimality of an
(unbiased)
estimator κ̂ ... at P0 as the minimum value of N (κ̂ − κ(P0 )) instead of that of
C(κ̂ − κ(P0 ))d P0 ”,
where
N ( f ) := inf{k > 0 : C( f /k)dλ ≤ 1}
Since the dilation order is not very convincing from an intuitive point of view
(observe that the convex function C is not required to be nonnegative), it is of interest
to relate it to other order relations. According to Witting and Müller-Funk (1995, p.
72 3 Descriptive Statistics
493, Satz 7.24), relation (3.11.3) follows if the (necessary) condition (3.11.4) is
fulfilled, and if (3.11.5) holds true for some μ.
Moreover, Q 0 Q 1 (in the spread order) implies
C(u − μ0 )Q 0 (du) ≤ C(u − μ1 )Q 1 (du) with μi = u Q i (du)
for every convex function C (since Q 0 Q 1 implies Q ∗0 Q ∗1 and u Q i∗ (du) = 0,
if Q i∗ := Q i − μi . From this, Q ∗0 and Q ∗1 have a common quantile, for which (3.11.5)
applies). (See Shaked 1982, p. 313.)
Equality in (3.11.3) for C(u) = |u| implies Q 0 = Q 1 (Pfanzagl 2000b, p. 8,
Lemma 6.1).
The fact that relations like (3.11.3) are compatible with meaningful order relations
(based on probabilities) does not entail that they are operationally significant in cases
where an underlying meaningful order relation is not available. Moreover: In view of
the enormous number of functions or C for which these relations hold true, it would
need additional arguments to distinguish one of the expressions (u − μ)Q(du)
(say the one with (u) = |u| or (u) = u 2 ) as “the” global measure of concentration.
Loss Functions and Unbiasedness
Lehmann (1951b, p. 587) suggests a general concept of unbiasedness, based on a
given loss function. For the problem of estimation, this reads as follows:
Given a functional κ and a loss function (·, Q) with minimum (κ(Q), Q), the
estimator κ̂ is unbiased for κ at P if the function Q → (κ̂(x), Q)P(d x) attains
its minimum at Q = P. (Lehmann does not mention that κ(Q) should minimize
(·, Q).
Specialized to P (u) = (u − κ(P))2 and P (u) = |u − κ(P)| this leads to
mean unbiasedness and median unbiasedness, respectively. Yet, it appears that mean
and median unbiasedness have a strong appeal of their own, and not because they
come from some loss function. Loss functions are a rather artificial construct, and
it seems questionable whether one should sacrifice concepts like mean or median
unbiasedness with a clear operational significance for an unbiasedness concept based
on a loss function whose only selling point is that it is “not totally unreasonable”. For
a discussion of various concepts of unbiasedness see also H.R. van der Vaart (1961).
Even if one abandons the idea of using an unbiasedness concept derived from a
loss function, there should be no inherent conflict between, say, mean unbiasedness
and the loss function. As a point of departure, let us assume that P (u) = (u−κ(P)),
where is subconvex, attaining its minimum at 0. Unbiasedness supposes, implicitly,
that the deviation from κ(P) by the amount Δ has the same weight, whether it is to
κ(P) + Δ or to κ(P) − Δ. Correspondingly, should be symmetric about 0.
Assume now that {Pa : a ∈ (0, ∞)} is a scale parameter family, and that κ(Pa ) =
a. In this case it is natural to express the deviation of an estimate â from a by the
deviation of â/a from 1, and to suppose that a deviation â/a from 1 by the factor
2, say, has the same weight as a deviation by the factor 1/2. If this is accepted, it
makes no sense to require that â should be unbiased. If â is evaluated by (â/a), with
3.11 Loss Functions 73
a loss function attaining its minimum at 1, then one would, if anything, require
that (1/u) = (u) rather than symmetry of (u) about u = 1. A loss function
with this property is (u) = (log u)2 , suggested by Ferguson (1967, p. 179) as
“more appropriate for scale parameters than the squared error loss”. His opinion is,
however, not incontrovertible. On p. 191 he also uses the squared error loss for a
scale parameter only if doing so leads to a nicer result.
One would hesitate to write down such obvious remarks but for the fact that the
textbooks are full of suggestions to measure the deviation of â(x) from a by means
of loss functions like (â(x)/a − 1)2 . A loss function not downright absurd for scale
parameters is
0 (u) = u − log u − 1.
This loss function (suggested by James and Stein 1961, p. 376, relation (72) for
matrix-valued estimators) is convex and attains its minimum at u = 1. Yet, it
appears that this loss function harbours an inherent contradiction: 0 attributes dif-
ferent weights to â(x)/a = 1 + Δ and â(x)/a = 1 − Δ. If this is the adequate
description of a real situation, one would not think that—at the same time—mean
unbiasedness is a natural requirement. Nevertheless, it is the mean unbiased estimator
which minimizes the risk for the loss function 0 . The function
a → 0 (â(x)/a)Pa0 (d x)
attains its minimum at a = a0 iff â(x)Pa0 (d x) = a0 . It follows from a result of
Brown (1968, p. 35, Theorem 3.1) that 0 is, up to the transformation 0 → α0 + β,
the only loss function with this property (except for u → (u − a)2 , of course).
For real estate assessment, Varian (1975, p. 196) suggests evaluating the difference
κ̂(x) − κ(P) by means of the so-called linex (for “linear-exponential”) loss function
does not tell us much about the joint distribution. One might, however, expect that κ̂0
is closer to κ(P) than κ̂1 if, say, |κ̂0 −κ(P)| is stochastically smaller than |κ̂1 −κ(P)|.
Yet this is not the case. An example by Blyth (1972, p. 367, the “clocking paradox”),
referring to a parametric family, shows that P{|κ̂0 − κ(P)| < |κ̂1 − κ(P)|} might be
close to zero, even though κ̂0 and κ̂1 have the same distribution. A slight modification
of this example in Blyth and Pathak (1985, p. 46, Example 1) shows that the same
effect may occur if |κ̂0 − κ(P)| is stochastically smaller than |κ̂1 − κ(P)|.
Yet such examples are highly artificial, and the question arises whether the concept
of closeness might be of some use in a more natural context. One such possibility is
the following: If κ̂0 and κ̂1 − κ̂0 are stochastically independent, then κ̂0 is closer to
κ(P) than κ̂1 , provided κ̂0 is median unbiased. This is (a special case of) Pitman’s
Comparison Theorem (see 1937, p. 214), proved under somewhat fuzzy conditions.
The Convolution Theorem suggests that this Comparison Theorem could be used to
show that an asymptotically optimal estimator sequence κ0(n) , n ∈ N, is asymptoti-
cally closer to κ(P) than any other “regular” estimator sequence.
3.12 Pitman Closeness 75
We first consider a result for probability measures Q|Bk , which will later be
applied to limit distributions. If Q is a normal distribution, multidimensional median
unbiasedness is a natural assumption.
Lemma 3.12.1 Assume that Q|Bk is median unbiased in the sense that Q ◦ (u →
a u) has median 0 for every a ∈ Rk . Then the following holds true for every positive
definite k × k-matrix M and every probability measure R|Bk .
occurs as Theorem 1 in M. Ghosh and Sen (1989, p. 1089). Relation (3.12.2) is, in
fact, a special case of Pitman’s “Comparison Theorem”.
Ghosh and Sen mainly consider applications for parametric families and a median
unbiased estimator ϑ̂0 which is the function of a complete sufficient statistic. If ϑ̂1
is another estimator such that ϑ̂1 − ϑ̂0 is ancillary, then ϑ̂0 − ϑ and ϑ̂1 − ϑ̂0 are
stochastically independent (according to Basu’s Theorem), so that
he shows that
lim Pϑn n 1/2 (ϑ0(n) − ϑ) Σ0−1 n 1/2 (ϑ0(n) − ϑ)
n→∞
≤ n 1/2 (ϑ (n) − ϑ) Σ0−1 n 1/2 (ϑ (n) − ϑ) ≥ 1/2. (3.12.4)
Recall that (u −v) Σ0−1 (u −v) has a traditional interpretation as the Mahalanobis
distance between u and v.
Theorem 6.2.1 in Keating et al. (1993, p. 182), which asserts, in fact, that
lim Pϑn |ϑ0(n) − ϑ| ≤ |ϑ (n) − ϑ| ≥ 1/2
n→∞
References
Aitken, A. C., & Silverstone, H. (1942). On the estimation of statistical parameters. Proceedings of
the Royal Society of Edinburgh Section A, 61, 186–194.
Allen, S. G. (1953). A class of minimax tests for one-sided composite hypotheses. The Annals of
Mathematical Statistics, 24, 295–298.
Anderson, T. W. (1955). The integral of a symmetric unimodal function over a symmetric convex
set and some probability inequalities. Proceedings of the American Mathematical Society, 6,
170–176.
Bahadur, R. R. (1964). On Fisher’s bound for asymptotic variances. The Annals of Mathematical
Statistics, 35, 1545–1552.
Barnard, G. A. (1974). Can we all agree on what we mean by estimation? Util. Mathematics, 6,
3–22.
Barndorff-Nielsen, O. (1969). Lévy homeomorphic parametrization and exponential families. Z.
Wahrscheinlichkeitstheorie verw. Gebiete, 12, 56–58.
Barton, D. E. (1956). A class of distributions for which the maximum-likelihood estimator is
unbiased and of minimum variance for all sample sizes. Biometrika, 43, 200–202.
Basu, D. (1955). A note on the theory of unbiased estimation. The Annals of Mathematical Statistics,
26, 345–348.
Beran, R. (1977). Robust location estimates. The Annals of Statistics, 5, 431–444.
Berger, J. O. (1980). Statistical decision theory: Foundations, concepts, and methods. Springer,
New York: Springer Series. in Statistics.
Berkson, J. (1956). Estimation by least squares and maximum likelihood. In Proceedings of the
Berkeley symposium on statistics and probability I (pp. 1–11), California: University California
Press.
Bertin, E. M. J., Cuculescu, I., & Theodorescu, R. (1997). Unimodality of probability measures.
Dordrecht: Kluwer Academic Publishers.
Bickel, P. J., & Lehmann, E. L. (1975–1979). Descriptive statistics for nonparametric models. I: The
Annals of Statistics, 3, 1038–1045; II: The Annals of Statistics, 3, 1045–1069; III: The Annals of
Statistics, 4, 1139–1159. IV: In Contributions to Statistics, Hájek Memorial Volume (pp. 33–40).
Academia, Prague.
Bickel, P.J., Klaassen, C.A.J., Ritov, Y., & Wellner, J. A. (1993). Efficient and adaptive estimation
for semiparametric models. Johns Hopkins University Press. (1998 Springer Paperback)
Birnbaum, Z. W. (1948). On random variables with comparable peakedness. The Annals of Math-
ematical Statistics, 19, 76–81.
Blackwell, D., & Girshick, M. A. (1954). Theory of games and statistical decisions. New Jersey:
Wiley.
Blyth, C. R.,& Pathak, P. K. (1985). Does an estimator’s distribution suffice? In L. M. Le Cam &
R. A. Olshen (Eds.), Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack
Kiefer (Vol. 1, pp. 45–52). Wadsworth.
Blyth, C. R. (1972). Some probability paradoxes in choice from among random alternatives. With
discussion. Journal of the American Statistical Association, 67, 366–381.
Borges, R., & Pfanzagl, J. (1963). A characterization of the one-parameter exponential family of
distributions by monotonicity of likelihood ratios. Z. Wahrscheinlichkeitstheorie verw. Gebiete,
2, 111–117.
Bowley, A. L. (1897). Relations between the accuracy of an average and that of its constituent parts.
Journal of the Royal Statistical Society, 60, 855–866.
Brown, L. D. (1968). Inadmissibility of the usual estimators of scale parameters in problems with
unknown location and scale parameters. The Annals of Mathematical Statistics, 39, 29–48.
Brown, L. D., Cohen, A., & Strawderman, W. E. (1976). A complete class theorem for strict
monotone likelihood ratio with applications. Annals of Statistics, 4, 712–722.
Chung, K. L. (1953). Sur les lois de probabilité unimodales. C. R. Acad. Sci. Paris, 236, 583–584.
78 3 Descriptive Statistics
Inagaki, N. (1970). On the limiting distribution of a sequence of estimators with uniformity property.
Annals of the Institute of Statistical Mathematics, 22, 1–13.
James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the 4th Berkeley
symposium on mathematical statistics and probability I (pp. 361–379). Berkeley: University
California Press.
Karlin, S., & Rubin, H. (1956). The theory of decision procedures for distributions with monotone
likelihood ratios. The Annals of Mathematical Statistics, 27, 272–291.
Kaufman, S. (1966). Asymptotic efficiency of the maximum likelihood estimator. Annals of the
Institute of Statistical Mathematics, 18, 155–178. See also Abstract. The Annals of Mathematical
Statistics, 36, 1084 (1965).
Keating, J. P., Mason, R. L., & Sen, P. K. (1993). Pitman’s measure of closeness. A comparison of
statistical estimators: SIAM.
Khintchine, A. Y. (1938). On unimodal distributions. Izv. Nauchno. Issled. Inst. Mat. Mech. Tomsk.
Gos. Univ., 2, 1–7.
Klaassen, C. A. J. (1985). Strong unimodality. Advances in Applied Probability, 17, 905–907.
Klebanov, L. B. (1974). Unbiased estimators and sufficient statistics. Theory of Probability and Its
Applications, 19, 379–383.
Laplace, P. S. (1774). Mémoire sur les suites récurro-récurrentes et sur leurs usages dans la théorie
des hasards. Mémories of the Academia Royal Science Paris, 6, 353–371.
Laplace, P. S. (1820). Théorie analytique des probabilités. Courcier, Paris: Troisième édition.
Le Cam, L. (1972). Limits of experiments. In Sixth Berkeley symposium on mathematical statistics
and probability I (pp. 249–261).
Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related
Bayes’ estimates. University of California Publications in Statistics, 1, 277–330.
Lehmann, E. L. (1959). Testing statistical hypotheses. New Jersey: Wiley.
Lehmann, E. L. (1983). Theory of point estimation. New Jersey: Wiley.
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). Berlin: Springer.
Lehmann, E. L. (1947). On families of admissible tests. The Annals of Mathematical Statistics, 18,
97–104.
Lehmann, E. L. (1951b). A general concept of unbiasedness. The Annals of Mathematical Statistics,
22, 587–592.
Lehmann, E. L. (1951a). Consistency and unbiasedness of certain nonparametric tests. The Annals
of Mathematical Statistics, 22, 165–179.
Lehmann, E. L. (1952). Testing multiparameter hypotheses. The Annals of Mathematical Statistics,
23, 541–552.
Lehmann, E. L. (1955). Ordered families of distributions. The Annals of Mathematical Statistics,
26, 399–419.
Lehmann, E. L. (1966). Some concepts of dependence. The Annals of Mathematical Statistics, 37,
1137–1153.
Lekkerkerker, C. G. (1953). A property of logarithmic concave functions. I and II. Indagationes
Mathematics, 15, 505–513 and 514–521.
Lewis, T., & Thompson, J. W. (1981). Dispersive distributions, and the connection between disper-
sivity and strong unimodality. Journal of Applied Probability, 18, 76–90.
Löwner, K. (1934). Über monotone Matrixfunktionen. Mathematische Zeitschrift, 38, 177–216.
Lynch, J., Mimmack, G., & Proschan, F. (1983). Dispersive ordering results. Advances in Applied
Probability, 15, 889–891.
Mann, H. B., & Whitney, D. R. (1947). On tests whether one of two random variables is stochastically
larger than the other. The Annals of Mathematical Statistics, 18, 50–60.
Markov, A. A. (1912). Wahrscheinlichkeitsrechnung. Translation by Liebmann, H. of the 2nd ed.
of the Russian orginal. Teubner.
Millar, P. W. (1983). The minimax principle in asymptotic statistical theory. In Ecole d’Eté & de
Probabilités de Saint-Flour XI-1981, Hennequin, P.L. (Eds.), Lecture Notes in Mathematics 976
(pp. 75–265). New York: Springer.
80 3 Descriptive Statistics
Mises, R. von (1912). Über die Grundbegriffe der Kollektivmaßlehre. Jahresbericht der Deutschen
Mathematiker-Vereinigung, 21, 9–20.
Müller, A., & Stoyan, D. (2002). Comparison methods for stochastic models and risks. Sussex:
Wiley.
Muños-Perez, J., & Sanchez-Gomez, A. (1990). Dispersive ordering by dilation. Journal of Applied
Probability, 27, 440–444.
Mussmann, D. (1987). On a characerization of monotone likelihood ratio experiments. Annals of
the Institute of Statistical Mathematics, 39, 263–274.
Neyman, J. (1938). L’estimation statistique traitée comme un probléme classique de probabilité.
Actualités Sci. Indust., 739, 25–57.
Olkin, I., & Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. The Annals
of Mathematical Statistics, 29, 201–211.
Pečarić, J. E., Proschan, F., & Tong, Y. L. (1992). Convex functions, partial orderings, and statistical
applications. Cambridge: Academic Press.
Pfanzagl, J. (1960). Über die Existenz überall trennscharfer Tests. Metrika 3, 169–176. Correction
Note, 4, 105–106.
Pfanzagl, J. (1994). Parametric statistical theory. De Gruyter.
Pfanzagl, J. (1962). Überall trennscharfe Tests und monotone Dichtequotienten. Z. Wahrschein-
lichkeitstheorie verw. Gebiete, 1, 109–115.
Pfanzagl, J. (1964). On the topological structure of some ordered families of distributions. The
Annals of Mathematical Statistics, 35, 1216–1228.
Pfanzagl, J. (1969). Further remarks on topology and convergence in some ordered families of
distributions. The Annals of Mathematical Statistics, 40, 51–65.
Pfanzagl, J. (1995). On local and global asymptotic normality. Mathematical Methods of Statistics,
4, 115–136.
Pfanzagl, J. (2000b). Subconvex loss functions, unimodal distributions, and the convolution theorem.
Mathematical Methods of Statistics, 9, 1–18.
Pitman, E. J. G. (1937). The "closest" estimates of statistical parameters. Mathematical Proceedings
of the Cambridge Philosophical Society, 33, 212–222.
Pitman, E. J. G. (1939). Tests of hypotheses concerning location and scale parameters. Biometrika,
31, 200–215.
Prékopa, A. (1973). On logarithmic concave measures and functions. Acta Scientiarum Mathemati-
carum, 34, 335–343.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters.
Bulletin of Calcutta Mathematical Society, 37, 81–91.
Rao, C. R. (1947). Minimum variance estimation of several parameters. Proceedings of the Cam-
bridge Philosophical Society, 43, 280–283.
Rao, C. R. (1962). Efficient estimates and optimum inference procedures in large samples. Journal
of the Royal Statistical Society. Series B, 24, 46–72.
Rao, M. M. (1965). Existence and determination of optimal estimators relative to convex loss.
Annals of the Institute of Statistical Mathematics, 17, 133–147.
Reiss, R. D. (1989). Approximate distributions of order statistics. With applications to nonpara-
metric statistics. Berlin: Springer.
Roussas, G. G. (1968). Some applications of asymptotic distribution of likelihood functions to
the asymptotic efficiency of estimates. Zeitschrift für Wahrscheinlichkeitsheorie und verwandte
Gebiete, 10, 252–260.
Roussas, G. G. (1972). Contiguity of probability measures: Some applications in statistics. Cam-
bridge: Cambridge University Press.
Saunders, I. W., & Moran, P. A. P. (1978). On the quantiles of the gamma and the F distributions.
Journal of Applied Probability, 15, 426–432.
Savage, L. J. (1954). The foundations of statistics. New Jersey: Wiley.
Schmetterer, L. (1974). Introduction to mathematical statistics. Translation of the 2nd German
edition 1966. Berlin: Springer.
References 81
Schmetterer, L. (1956). Einführung in die Mathematische Statistik (2nd ed., p. 1966). Wien:
Springer.
Schweder, T. (1982). On the dispersion of mixtures. Scandinavian Journal of Statistics, 9, 165–169.
Sen, P. K. (1986). Are BAN estimators the Pitman-closest ones too? Sankhyā Series A 48, 51–58.
Serfling, R. J. (1980). Approximation theorems of mathematical statistics. New York: Wiley.
Shaked, M. (1980). On mixtures from exponential families. The Journal of the Royal Statistical
Society, 42, 192–198.
Shaked, M. (1982). Dispersive ordering of distributions. Journal of Applied Probability, 19, 310–
320.
Sherman, S. (1955). A theorem on convex sets with applications. The Annals of Mathematical
Statistics, 26, 763–766.
Sobel, M. (1953). An essentially complete class of decision functions for certain standard sequential
problems. The Annals of Mathematical Statistics, 24, 319–337.
Stein, Ch. (1964). Inadmissability of the usual estimator for the variance of a normal distribution
with unknown mean. Annals of the Institute of Statistical Mathematics, 16, 155–160.
Strasser, H. (1985). Mathematical theory of statistics. De Gruyter.
Vaart, H. R. van der (1961). Some extensions of the idea of bias. The Annals of Mathematical
Statistics, 32, 436–447.
van Zwet, W. R. (1964). Convex Transformations of Random Variables. Mathematical Centre Tracts
7. Mathematisch Centrum, Amsterdam.
Varian, H. R. (1975). A Bayesian approach to real estate assessment. In S. E. Fienberg & A. Zellner
(Eds.), Studies in Bayesian econometrics and statistics in honor of Leonard J. Savage (pp. 195–
208). North-Holland.
Wald, A. (1939). Contributions to the theory of statistical estimation and testing hypotheses. The
Annals of Mathematical Statistics, 10, 299–326.
Wefelmeyer, W. (1985). A counterexample concerning monotone unimodality. Statistics and Prob-
ability Letters, 3, 87–88.
Wells, D. R. (1978). A monotone unimodal distribution which is not central convex unimodal. The
Annals of Statistics, 6, 926–931.
Wilks, S. S. (1943). Mathematical statistics. Princeton University Press: Princeton.
Wintner, A. (1938). Asymptotic distributions and infinite convolutions. Ann Arbor: Edwards Broth-
ers.
Witting, H. (1985). Mathematische Statistik I. Parametrische Verfahren bei festem Stichprobenum-
fang: Teubner.
Witting, H., & Müller-Funk, U. (1995). Mathematische Statistik II. Asymptotische Statistik: Para-
metrische Modelle und nichtparametrische Funktionale. Teubner.
Zacks, S. (1971). The theory of statistical inference. New Jersey: Wiley.
Zellner, A. (1986). Bayesian estimation and prediction using asymmetric loss functions. Journal of
the American Statistical Association, 81, 446–451.
Chapter 4
Optimality of Unbiased Estimators:
Nonasymptotic Theory
among all estimators κ̂ fulfilling κ̂d P = κ(P) for P ∈ P. We speak of convex
optimality if this relation holds for every convex loss function, and of quadratic
optimality if it holds for (u) = u 2 .
Certain results become more transparent if we call κ̂0 quadratically [convex]
optimal if it is an -optimal
estimator of its expectation (without an explicit reference
to the functional P → κ̂d P).
Unlike the case of median unbiased estimators, a comparison of mean unbiased
estimators with respect to the concentration on intervals is impossible. Hence the
following considerations are based on the comparison of estimators by means of
certain loss functions only.
If a functional κ : P → R admits a mean unbiased estimator, all that one could
expect in general is to find an estimator that minimizes the risk for a given loss
function at a certain P0 ∈ P. It is of mainly mathematical interest whether the
infimum of 0 (κ̂ − κ(P0 ))P0 (d x) over all unbiased estimators κ̂ is attained or not.
For (u) = |u|s , s > 1, this was proved by Barankin (1949, p. 483, Theorem
2(iii)) under the assumption that
( p(x)/ p0 (x))s/(s−1) P0 (d x) < ∞ for P ∈ P.
Of some practical interest are criteria which guarantee that a given unbiased
estimator, say κ̂0 , does, in fact, minimize κ̂ → P0 (κ̂(x))P0 (d x). (Recall that there
is at most one such estimator if P is strictly convex and bounded from below.)
Starting with the quadratic loss function, i.e., P (u) = (u − κ(P))2 , Rao (1952)
suggests the following
Criterion. The estimator
κ̂0 minimizes the quadratic risk among all mean unbiased
estimators at P0 iff v(x)κ̂0 (x)P0 (d x) = 0 holds for every function v ∈ V2 , where
V2 is the set of all functions v : X → R with
v(x)P(d x) = 0 and v2 (x)P(d x) < ∞ for P ∈ P.
This criterion is more important for general considerations than for particular appli-
cations, since the class V2 is not always easy to characterize. (See the examples 2.112,
p. 301 and 2.113, p. 302 in Witting 1985.)
Rao’s criterion was generalized from the power s = 2 to higher powers s ≥ 2:
The estimator κ̂0 minimizes the risk for (u) := |u|s at P0 iff
v(x)|κ̂0 (x) − κ(P0 )|s−1 sgn (κ̂0 (x) − κ(P0 ))P0 (d x) = 0
holds
for every function v ∈ Vs , where Vs now denotes all functions v ∈ V2 fulfilling
|v(x)|s P(d x) < ∞ for P ∈ P.
See Schmetterer (1960, p. 1155, Theorem 3). (Correct three misprints in the
statement of this theorem.) For a more precise proof of this result see Heyer (1982),
pp. 124/5, Theorem 17.3. This result was further generalized to convex functions :
The estimator κ̂0 minimizes the risk for a convex function at P0 iff
v(x) (κ̂0 (x) − κ(P0 ))P0 (d x) = 0
a sample of size n > 1 which minimizes the quadratic risk simultaneously for all
ϑ ∈ R. For an example of a parametric family in which an unbiased estimator exists
for every sample size, but none of them minimizes the risk for any strictly convex
loss function, simultaneously for every ϑ ∈ Θ, see Pfanzagl (1994, p. 103, Example
3.1.6).
In fact one would expect that the existence of an optimal unbiased estimator (i.e.,
one that minimizes the risk for a given loss function simultaneously for all P0 ∈ P)
is a rare exception. It might, therefore, come as a surprise that there is an important
type of families in which for every functional admitting an unbiased estimator there
exists an unbiased estimator that minimizes the risk, simultaneously for all convex
loss functions and all P0 ∈ P. These are the families P admitting a sufficient statistic
S : X → Y for which P ◦ S is complete. Historically, it was this finding that sparked
interest in a theory of unbiased estimators around 1950, and which is still responsible
for its presence in textbooks.
Now let κ̂ be unbiased on P for κ : P → Rk in the family P. If S|X is sufficient,
κ̂, given S, say k ◦ S, which is independent
there exists a conditional expectation of
of P, hence an estimator again. Since k(S(x))P(d x) = κ̂(x)P(d x) for P ∈ P,
the estimator k ◦ S is unbiased, too. According to Jensen’s inequality for conditional
expectations, for every convex function ,
hence
(k(S(x))P(d x) ≤ (κ̂(x))P(d x). (4.1.2)
(If is strictly convex, the inequality (4.1.1) is strict unless κ̂ = k ◦ S P-a.e.) Together
with unbiasedness, (4.1.2) implies
(k(S(x)) − κ(P))P(d x) ≤ (κ̂(x) − κ(P))P(d x).
Rao (1945, p. 83) proves relation (4.1.2) for (u) = u 2 , using an argument specific
to this quadratic loss function. His argument refers to a one-parameter family of
probability measures Pϑ over Rn , with κ(Pϑ ) = ϑ. Presumably, he had a real-
valued sufficient statistic in mind. Rewritten in our notations, his argument, given in
his equation (3.8), reads as follows:
(κ̂(x) − κ(P))2 P(d x)
(κ̂(x) − k(S(x))) P(d x) + (k(S(x) − κ(P))2 P(d x)
2
≥ (k(S(x)) − κ(P))2 P(d x).
86 4 Optimality of Unbiased Estimators: Nonasymptotic Theory
without further comment. In Rao (1947, pp. 280/1, Theorem 1) this argument is
extended to k-parameter families.
Independently of C.R. Rao, the same result was obtained by Blackwell (1947, p.
106, Theorem 2). Relation (4.1.3), used by Rao without further ado, is found worth
of a careful proof by Blackwell (see p. 105, Theorem 1).
The extension to convex loss functions is due to Hodges and Lehmann (1950,
p. 188, Theorem 3.3). Their proof is based on Jensen’s inequality for conditional
expectations (p. 195, Lemma 3.1), corresponding to (4.1.1). The authors are aware
of the fact that this inequality is straightforward if a “regular conditional probability”
exists. Yet they take the trouble to give a proof which goes through without the
conditions needed to ensure the existence of a regular conditional probability in
general. The role of Barankin’s paper (1950) which appeared in the same journal,
but after the paper by Hodges and Lehmann remains unclear. Barankin’s Theorem
on p. 281 gives inequality (4.1.1) for (u) = |u|s , with a proof attributed to “the
referee”, and he applies this inequality in Corollary 1, p. 283 to obtain (4.1.2) with
(u) = |u|s . Unclear, too, is the purpose of Barankin’s paper (1951). His Theorem
on p. 168 is just another proof of Jensen’s inequality for conditional expectations,
and its application repeats the result of Hodges and Lehmann.
The idea that unbiased estimators can be improved by taking the conditional
expectation with respect to a sufficient statistic, met with some reserve. First of all,
“improved” just means that the risk is decreased, simultaneously for every convex
loss function; it does not mean that the improved estimator is more concentrated
on intervals containing the estimand. Moreover, the improved estimator may have
certain properties uncalled for (and not shared by the original estimator). It might, for
instance, fail to be proper. Finally, there are cases where the conditional expectation,
given S, cannot be expressed in closed form.
The role of the improvement procedure is accentuated by a result of Lehmann and
Scheffé: The improved estimator is optimal (in the sense of minimizing the convex
risk in the class of all unbiased estimators) if the conditioning is taken with respect
to a sufficient statistic S for which P ◦ S is complete. Theorem 5.1 in Lehmann and
Scheffé (1950, p. 321) asserts that an unbiased estimator is of minimal quadratic
risk iff it is the contraction of a sufficient statistic S with P ◦ S complete. This
formulation does not exhibit the core of the argument (which depends in by no
means on the assumption that (u) = u 2 ): If P ◦ S is complete, there is at most one
unbiased estimator that is a contraction of S. Recall a forerunner of this result, due
to Halmos (1946).
In their final form, these results are an inevitable topic in any textbook under the
title
Theorem of Rao–Blackwell–Lehmann–Scheffé. Let P be a family admitting a
sufficient statistic S such that P ◦ S is complete. Then for every mean unbiased
4.1 Optimal Mean Unbiased Estimators 87
estimator κ̂, its conditional expectation k ◦ S is optimal in the sense that it minimizes
the convex risk in the class of all mean unbiased estimators.
If we consider the steps leading to this result, we find that two essential points
are already present in Rao (1945, in particular p. 83): (i) There is a bound for the
quality of mean unbiased estimators, and (ii) this bound can be achieved by taking
conditional expectations with respect to a certain sufficient statistic.
Yet it needed five authors (and five years) to bring these vague ideas to their final
shape. The reason: Rao had difficulties to cope with the concept of a conditional
expectation (see 1945, p. 83, relation 3.7, and 1947, p. 281). If he had used the
stochastic independence between S and κ̂ −k◦S (rather than their being uncorrelated)
he could have obtained the convex-optimality (rather than the quadratic optimality).
Taking conditional expectations leads to an improvement, but not necessarily to
optimality. The optimality follows from the completeness of P ◦ S, introduced by
Lehmann and Scheffé in 1950. Something close to “completeness” is foreshadowed
in Rao (1945), p. 834−6 and (1947), p. 2811,2 .
The use of “uncorrelated” rather than “stochastically independent” leads to a seri-
ous disadvantage when Rao extends his results to k-parameter families (see 1945, p.
84–86). Instead of arriving at “minimal convex risk” he ends up with the result that
the covariance matrix of the improved estimators is minimal in the Löwner-order
among all covariance matrices of mean unbiased estimators—a result which fol-
2
k
lows immediately, since (u 1 , . . . , u k ) → i=1 αi u i is convex for every (α1 , . . . ,
αk ) ∈ Rk .
Remark Various optimality results for unbiased estimators are of the type: κ̂0 min-
imizes (κ̂)d P for κ̂ in a certain class of unbiased estimators, simultaneously for
all convex functions : R → [0, ∞). This is not a convincing
optimum property,
since the location of P ◦ κ̂ enters through the condition κ̂d P = κ(P) only. The
loss function itself makes no allowance for location (say by the property (u) = 0
for u = κ(P)), nor does it distinguish between estimators that are optimal for the
particular loss function, and estimators that are optimal for every convex loss func-
tion.
Recall that optimality with respect to all subconvex loss functions is equivalent
to the following statements:
(i) The distribution of subconvex losses is minimal in the stochastic order.
(ii) the concentration is maximal on all intervals containing κ(P).
As against that, minimality of κ̂ with respect to every convex loss function says
nothing about the distribution of the convex losses ◦ κ̂. Moreover, the improvement
of an estimator by taking a conditional expectation reduces the convex risk, but the
distribution of the convex losses of the improved estimator is not necessarily stochas-
tically smaller than the distribution of the original estimator. Finally, an estimator
which is of minimal convex risk in the class of all mean unbiased estimators may be
inferior to other mean unbiased estimators if evaluated by a subconvex loss function
88 4 Optimality of Unbiased Estimators: Nonasymptotic Theory
Bahadur’s paper is fairly poorly arranged, containing six Theorems and seven
Propositions. The presentation of Bahadur’s results in Schmetterer (1966, pp. 332–
352) requires 6 Theorems, and it had not become much simpler twenty years later:
The presentation in Strasser (1985, pp. 168–172) consists of 4 Lemmas, 3 Theorems
and one Corollary. In other textbooks like Eberl and Moeschlin (1982), Heyer (1973,
1982) and Witting (1985) Bahadur’s result is not even mentioned. In roughly a dozen
of papers dealing with Bahadur’s approach (see Eberl 1984, for further references)
one is missing what I would consider the main result of Bahadur (1957): That a
bounded quadratically optimal estimator is convex optimal. This result, based on
Bahadur’s approach, was explicitly put forward by Padmanabhan (1970, p. 109,
Theorem 3.1). See also Schmetterer and Strasser (1974, p. 60). (For more examples
and counterexamples see the papers by Bomze 1986, 1990; Eberl 1984; Heizmann
1989.)
To make a long story short: All these papers were based on an ingenious idea of
Bahadur (1957, p. 218). Given a family P of probability measures P|(X, A ), let V2
the set of all functions ν : X → R fulfilling the conditions ν d P < ∞ and
2
be
νd P = 0 for P ∈ P. Bahadur introduces
AP := {A ∈ A : 1 A νd P = 0 for ν ∈ V2 and P ∈ P}, (4.2.1)
κ̂ k ν ∈ V2 for every k ∈ N.
Following the
basic idea of Bahadur (1957, p. 218, proof of Theorem 5(i)), this
implies that 1 B (κ̂)νd P = 0 for ν ∈ V2 and P ∈ P. Hence κ̂ −1 B is AP -measurable
for every B ∈ B, which implies the AP -measurability of κ̂.
A detailed proof can be found in Strasser (1972, p. 110, Theorem 5.6). See also
Strasser (1985, p. 170, Theorem 35.14) or Pfanzagl (1994, p. 121).
90 4 Optimality of Unbiased Estimators: Nonasymptotic Theory
By definition of AP ,
1 A νd P = 0 for ν ∈ V2 and P ∈ P. (4.2.3)
The essential point in the proof of Proposition 4.2.2 is that every AP -measurable
unbiased estimator is the conditional expectation of any unbiased estimator. This
idea occurs first in Padmanabhan (1970, p. 109, Theorem 1), under the redundant
assumption that the AP -measurable estimator minimizes the quadratic risk. Without
the redundant assumption this assertion occurs in Schmetterer (1974, p. 61, Satz 1).
The idea that AP -measurable estimators are conditional expectations and therefore
optimal for every convex loss function was not obvious from the beginning. This may
be seen from Theorem 7 in Schmetterer (1960, p. 1161) which asserts the optimality
4.2 Bahadur’s Converse of the Rao–Blackwell–Lehmann–Scheffé Theorem 91
For such loss functions,
0 (κ̂0 )νd
P = 0 for ν ∈ V2 and P ∈ P (see relation 6, p.
62) which implies (κ̂0 )d P ≤ (κ̂)d P for every convex loss function and every
P ∈ P, whence
(κ̂0 − μ)d P ≤ (κ̂ − μ)d P
for every μ ∈ R. If κ0 d P = κ0 d P = κ(P), this implies
(κ̂0 − κ(P))d P ≤ (κ0 − κ(P))d P for every P ∈ P. (4.2.5)
The restrictive condition on the loss function 0 is, perhaps, responsible for the
fact that this result is neglected in the literature.
There is another point in the paper by Schmetterer and Strasser which might cause
some irritation: Optimality with respect to a loss function is defined by (4.2.5). For
the quadratic loss function,
(κ̂0 − κ(P))2 d P ≤ (κ̂ − κ(P))2 d P
is, under the condition κ̂0 d P = κ(P) and κ̂d P = κ(P), equivalent to
92 4 Optimality of Unbiased Estimators: Nonasymptotic Theory
κ̂02 d P ≤ κˆ2 d P.
Schmetterer and Strasser start from the condition (their relation (2), p. 60)
(κ̂0 )d P ≤ (κ̂)d P for κ̂d P = κ̂0 d P = κ(P) and P ∈ P,
which is not the same as (4.2.5) unless (u) = u 2 . Yet, since the final result,
(κ̂0 )d P ≤ (κ0 )d P for every convex ,
refers to each P ∈ P separately, the proof by Schmetterer and Strasser can be carried
through with P fixed, i.e., with P = {P0 } in which case the loss function may be
taken to be u → 0 (u − κ(P0 )).
The propositions stated above have nothing to do with the sufficiency of AP .
However, if a 2-complete sufficient sub-σ -field of P does exist, then this is the
σ -field A0 defined by (4.2.1). AP “recovers” the σ -field underlying the Bahadur–
Rao–Lehmann–Scheffé Theorem.
The following Proposition states Bahadur’s converse: If for every unbiasedly
estimable functional there is a quadratically optimal unbiased estimator, then there
exists a sufficient sub-σ -field. With this result, Bahadur answers a question which no
statistician had ever asked. What the statistician is interested in is an optimal unbiased
estimator for a given functional. Whether every unbiasedly estimable functional
admits an optimal unbiased estimator is of no relevance for his problem.
Bahadur’s proof (1957), followed by Strasser (1985, pp. 171/2), uses the existence
of optimal estimators for bounded densities.
Sufficiency, which was the main point in Bahadur’s paper, is neglected by
Schmetterer: “It is more difficult to prove under some more conditions that AP
is also sufficient” (see 1966, p. 252 and 1974, p. 289).
4.3 Unbiased Estimation of Probabilities and Densities 93
Then M̂n (y, A) := 1 A (ξ ) p̂(ξ, y)μ(dξ ) fulfills for every A ∈ A the relation
M̂n (y, A)Q P (dy) = P(A) for P ∈ P.
Needless to say that here the natural choice is (u) = |u|. Since
1
| p(ξ ) − q(ξ )|μ(dξ ) = sup |P(A) − Q(A)|,
2 A∈A
this implies that Mn (Sn (·), ·)|A , evaluated as a probability measure, minimizes the
sup-distance in the class of all unbiased estimators of P|A . Yet, even more is true: The
Rao–Blackwell–Lehmann–Scheffé Theorem implies that Mn (Sn (·), A) minimizes—
for every A ∈ A —the convex risk in the class of all unbiased estimators of P →
P(A).
The question remains how an unbiased estimator of q P (ξ ) can be obtained. Gener-
alizing the ideas applied by Kolmogorov (1950, Sect. 9, pp. 389–392) for the normal
distribution, Lumel’skii and Sapozhnikov (1969, p. 357, Theorem 1) suggest the
following general procedure:
Let P be dominated by μ. Assume that for every P ∈ P, P n ◦ Sn |B has a ν-
density, say h (n)
P , and that the joint distribution of (x 1 , Sn (x 1 , . . . , x n )) under P has
n
(n) (n) (n)
μ × ν-density (ξ, y) → h P (ξ, y). Then pn (ξ, y) := h P (ξ, y)/ h P (y) is, thanks to
the sufficiency of Sn , independent of P, and pn (ξ, Sn (·)) is unbiased
n for h P (ξ ). The
computation of h (n)P becomes simple if Sn (x 1 , . . . , x n ) = ν=1 x ν . (See Pfanzagl
1994, p. 118.)
The literature provides numerous examples of unbiased estimators of probabilities
and the pertaining estimators of densities. A somewhat disturbing phenomenon, to
be found in all these examples: If P is a parametric family, the optimal unbiased
estimator is not a member of this family. As an example we mention that the optimal
unbiased estimator of N (μ, 1)(A) in the family {N (μ, 1) : μ ∈ R} is
(x1 , . . . , xn ) → N (x n , (n − 1)/n)(A).
n
with sn = n −1 ν=1 (x ν − x n )2 and
(n−4)/2
1 1 (ξ − μ)2
p (n) (ξ ; μ, σ ) = cn 1− .
σ n−1 σ2
(Find out which of the versions of cn offered in the literature comes closest to the truth:
Kolmogorov 1950, pp. 391/2; Barton 1961, p. 228; Basu 1964, p. 219; Lumel’skii
and Sapozhnikov (1969), p. 360, specialized for p = 1.)
The first general result on mean unbiased estimators is the so-called Cramér–Rao
bound, a classical example of multiple discoveries. With a straightforward proof,
this result has an unusual number of fathers: Aitken and Silverstone (1942), Fréchet
(1943, p. 185), Darmois (1945, p. 9, with a reference to Fréchet 1943), Rao (1945,
p. 83 and 1947, p. 281) and Cramér (1946, p. 480, relation 23.3.3a). Following
“Stigler’s law of eponymy” it was named after the last two of these. Savage (1954,
p. 238) therefore suggested the now widely used name “information bound”.
The straightforward argument: Let {Pϑ : ϑ ∈ Θ}, Θ ⊂ R, be a parametric
family with p(·, ϑ) ∈ d Pϑ /dμ. Write p • (x, ϑ) = ∂ϑ p(x, ϑ) and • (x, ϑ) =
∂ϑ log p(x, ϑ) = p • (x, ϑ)/ p(x, ϑ). If the estimator ϑ̂ : X → R is unbiased for
ϑ, then
(ϑ̂(x) − ϑ) p(x, ϑ)μ(d x) = 0 for ϑ ∈ Θ. (4.4.1)
i.e., on the interchange of differentiation and integration. This relation was neither
taken for granted, nor considered as a regularity condition in the early papers. It was
96 4 Optimality of Unbiased Estimators: Nonasymptotic Theory
neglected in Rao (1945). Rao (1949, p. 216, Theorem 3) gives conditions on the
family that imply relation (4.4.4).
A weak condition that ensures the validity of (4.4.4) at ϑ0 (even for k-parameter
families) can be found in Witting (1985, p. 319, Satz 2.136): L 2 -differentiability
of the
family {Pϑ : ϑ ∈ Θ} at ϑ0 , and the existence of a function ϑ̂ such that ϑ̂d Pϑ = ϑ
and ϑ → ϑˆ2 d Pϑ is locally bounded at ϑ0 . Simons and Woodroofe (1983, p. 76,
Corollary 2) show, under slightly weaker conditions, that (4.4.3) holds μ-a.e. (see
also Witting 1985, p. 327, Aufgabe 2.33).
In “Kendall’s Advanced Theory of Statistics”, Stuart and Ord (1991) pass over
such points. They confine themselves to the statement p • (x, ϑ)d x = 0 is the only
[!] condition for the validity of the Cramér–Rao bound (see p. 616). This slip subsists
in all editions even though it was pointed out already in Polfeldt (1970, p. 23).
The Cramér–Rao bound is attainable in one special case only: If the family {Pϑ :
ϑ ∈ Θ} is exponential with density
p(x, ϑ) = C(ϑ) exp[ϑ T (x)] and T (x)Pϑ (d x) = ϑ.
This is implicitly already contained in Aitken and Silverstone (1942, p. 188), who
show that an unbiased estimator ϑ (n) achieves the minimal variance iff
n
• (xν , ϑ) = C(ϑ)(ϑ (n) (x) − ϑ).
ν=1
quality of an estimator is more than doubtful. Hence, we will abstain from discussing
similar bounds put forward by Hammersley (1950) or Chapman and Robbins (1951).
Some scholars hold the opinion that bounds are meaningful even if they are
not attainable, but they withhold their arguments supporting this opinion. Ander-
sen (1970, p. 85, with respect to another bound which is unattainable, too) claimed:
“... in situations where the lower bound is not attained, [it] provides us with a denom-
inator for an efficiency measure”. Similarly, Barnett (1975, p. 126): “[The Cramér–
Rao bound] provides an absolute standard against which to measure estimators in a
wide range of situations.” Fisz (1963, p. 470, Definition 13.5.1) defines an unbiased
estimator as “most efficient” if it attains the Cramér–Rao bound, neglecting the pres-
ence of models with an unbiased estimator of minimal convex risk, larger than the
Cramér–Rao bound.
Some scholars think that the Cramér–Rao bound is at least valid asymptotically.
This opinion results from the fact that in highly regular parametric families the
Cramér–Rao bound happens to coincide with the asymptotic bound provided by the
Convolution Theorem for regular estimator sequences. Even respected authors like
Witting and Müller-Funk (1995, p. 198) cannot resist the temptation to pretend a
connection which does not exist: The nature of these bounds is totally different,
and so are the proofs. This can be seen from examples where the bound for the
asymptotic variance of regular estimator sequences is attained, whereas the best
sequences of unbiased estimators have a larger asymptotic variance (see Portnoy
1977, for a somewhat artificial, and Pfanzagl 1993, pp. 74–76, for a more natural
example). This refutes a widely held opinion that the Cramér–Rao bound is always
asymptotically attainable.
Fϑ (u) := Pϑ {x ∈ X : S(x) ≤ u}
is, for fixed u, decreasing in ϑ. To avoid technicalities, we assume for the moment
that u → Fϑ (u) is increasing and continuous. If u(ϑ) is such that Fϑ (u(ϑ)) = 1/2,
then x → u −1 (S(x)) is a median unbiased estimator of ϑ.
The median unbiased estimators thus obtained will be maximally concentrated
on arbitrary intervals containing ϑ: if S is sufficient, and if the densities p(x, ϑ) =
h(x)g(S(x), ϑ) have monotone likelihood ratios in S, i.e., if y → g(y, ϑ2 )/g(y, ϑ1 )
is increasing if ϑ1 < ϑ2 . Monotonicity of likelihood ratios implies that the family
{Pϑ ◦ S : ϑ ∈ Θ} is stochastically increasing, so that median unbiased estimators
can be obtained as indicated above.
Stochastic monotonicity suffices for the existence of a median unbiased estimator.
It does, however, not guarantee that this estimator is reasonable (see Pfanzagl 1972,
p. 160, Example 3.16). If S is sufficient for {Pϑ : ϑ ∈ Θ} and if {Pϑ ◦ S : ϑ ∈
Θ} has monotone likelihood ratios, then a median unbiased estimator is maximally
concentrated on all intervals (ϑ −t , ϑ +t ) if it is a monotone function of S. Because
of the m.l.r. property, any set {x ∈ X : m(S(x)) ≥ ϑ }, with m increasing, is,
according to the Neyman–Pearson Lemma, most powerful for every testing problem
Pϑ : Pϑ with ϑ > ϑ . Since
1
Pϑ {m ◦ S ≥ ϑ } = = Pϑ {ϑ̂ ≥ ϑ }
2
Pϑ {m ◦ S ≥ ϑ } ≥ Pϑ {ϑ̂ ≥ ϑ },
hence
Pϑ {ϑ ≤ m ◦ S < ϑ} ≥ Pϑ {ϑ ≤ ϑ̂ < ϑ}.
What has been said so far is essentially presented in Lehmann (1959, Sect. 5, pp.
78–83). Lehmann also indicates how randomization can be used to obtain median
unbiased estimators in the more general case where Pϑ ◦ S may contain atoms.
Lehmann is aware of the fact that critical regions {x ∈ X : S(x) ≥ c} do not
only maximize the power for testing ϑ against ϑ > ϑ; they also minimize the
power for alternatives ϑ < ϑ (see his pp. 68/9, Theorem 2). Yet he arrives at the
optimality assertion (see p. 83) by means of a different argument: He considers
median unbiased estimators as a boundary case of two-sided confidence intervals
(ϑ, ϑ̄) with Pϑ {ϑ < ϑ} = α1 , and Pϑ {ϑ > ϑ̄} = α2 for α1 = α2 = 1/2, using a
loss function like
⎧
⎨ϑ − ϑ ϑ < ϑ,
(ϑ; ϑ, ϑ̄) = ϑ̄ − ϑ if ϑ ≤ ϑ ≤ ϑ̄,
⎩
ϑ̄ − ϑ ϑ̄ < ϑ.
Birnbaum (Birnbaum 1964, p. 27) attributes the optimality result for median unbiased
estimators to Birnbaum (1961), where it is contained implicitly in Lemma 2, p.
121. Pfanzagl (1970, p. 33, Theorem 1.12) treats the general case of random-
ized estimators.
According to Brown et al. (1976, p. 719, Corollary 4.1), the following is true under
the m.l.r. conditions indicated above: For any median unbiased estimator which is not
a contraction of S (almost everywhere), there exists a median unbiased contraction of
S which is strictly better. This result is an analogue to the Rao–Blackwell–Lehmann–
Scheffé Theorem for mean unbiased estimators.
A straightforward application of these results is to exponential families with den-
sities of the form
C(ϑ)h(x) exp[a(ϑ)S(x)] (4.5.1)
with a monotone function a. These families have monotone likelihood ratios for
every sample size. In fact, this is almost the only application. According to a result
obtained by Borges and Pfanzagl (1963, p. 112, Theorem 1) a one-parameter family
of mutually absolutely continuous probability measures with monotone likelihood
ratios for every sample size is necessarily of the type (4.5.1).
Relevant for applications are results on the existence of optimal median unbiased
estimators for families with nuisance parameters. Such a result may be found in
Pfanzagl (1979, p. 188, Theorem). Here is an important special case.
Assume that Pϑ,η , ϑ ∈ Θ (an interval in R) and η ∈ H (and abstract parameter
space) has densities
k
C(ϑ, η)h(x) exp a(ϑ)S(x) + ai (ϑ, η)Ti (x)
i=1
{x ∈ X : t ∈ K (x)} ∈ A for t ∈ Rk .
Pτ {ϑ ∈ K } ≥ 1 − α for τ ∈ Hϑ , ϑ ∈ Θ.
The first formally correct statement of confidence sets as random objects contain-
ing the fixed parameter with prescribed probability is given by Wilson (1927). Other
examples of conceptually precise confidence statements prior to Neyman’s general
theory are Working and Hotelling (1929), Hotelling (1931) and Clopper and Pearson
(1934).
When Neyman entered the scene (1934, Appendix, Note 1, and 1937, 1938) he
was obviously not aware of Wilson’s paper. His intention was to make clear that
the concept of confidence, being based on the usual concept of probability, was
something else than Fisher’s concept of “fiducial distributions” (1930), a concept
based on principles that cannot be deduced from the rules of ordinary logic. (See in
particular Neyman 1941.)
Lehmann (1959) was the first textbook dealing with confidence sets. As a book
with emphasis on test theory, it treats confidence intervals more or less as an appendix
to test theory. It obtains confidence sets by inverting critical regions (Lemma 5.5.1,
p. 179). This accounts for the restriction to confidence intervals for the parameter of
a univariate family. As an appendix to test theory, the author borrows the concept
for describing properties of confidence intervals (such as unbiasedness) from the
corresponding properties of tests. This prevents more or less the development of a
genuine theory of confidence sets. For the case of univariate families, something like
optimality was obtained for exponential (more generally: m.l.r.-) families.
The theory of confidence sets has virtually disappeared in the more advanced
textbooks on mathematical statistics—Bickel et al. (1993) has two references to con-
fidence sets; Strasser (1985) has none. The fact that the theory of confidence sets had
lost its adequate treatment is, perhaps due to the inadequate starting point, in Lehmann
(1959), to consider where confidence procedures are considered as an appendix to
test theory. Even the most recent book which contains a longer chapter on confidence
sets (“The theory of confidence sets”, Chap. IV, pp. 254–267 in Schmetterer 1974),
still deals with confidence sets as an appendix to test theory. The number of refer-
ences to confidence procedures is two in Lehmann and Casella (1998) and zero in
Bickel et al. (1993) and in Strasser. See however Shao (2003), Chap. 7.
A confidence procedure K ∗ with confidence level 1 − α is called uniformly most
accurate (for H ) if for every confidence procedure K ∗ with confidence level 1 − α
we have
Pτ (ϑ ∈ K ∗ ) ≤ Pτ (ϑ ∈ K ) for τ ∈ / Hϑ , ϑ ∈ Θ.
The idea to obtain confidence sets by inversion of acceptance regions meets the
following problem: Whereas the shape of a acceptance region is of no relevance for
the power of a test, the shape of a confidence set is crucial for its interpretation. Think
of examples where confidence sets obtained by inversion of (optimal) tests are not
connected.
Pratt (1961) and Ghosh (1961) relate the false covering probabilities of confidence
sets K (x) to their expected size. With Θ ⊂ Rk and λ denoting Lebesgue measure
on Rk , Fubini’s Theorem implies
λ(K (x)) P(d x) = P{x ∈ X : τ ∈ K (x)}λ(dτ )
We now consider the problem of an upper confidence bound with covering prob-
ability β ∈ (0, 1) for the parameter ϑ in a univariate family Pϑ |A , ϑ ∈ Θ ⊂ R, i.e.
a function ϑ β : X → R such that
For a given ϑ the ideal answer would be qβ (ϑ), the β-quantile of Pϑ . A function
x → ϑ β (x) which meets this requirement for every ϑ ∈ Θ, should be close to
qβ (ϑ) for every ϑ ∈ Θ. If we consider ϑ β as an estimator of qβ (ϑ), the ϑ β should
be concentrated about qβ (ϑ) as closely as possible. Under special conditions there
is a precise answer to this vague question. Let us say that a functional f 0 is more
concentrated about q(ϑ) than the functional f 1 if
Pϑ {x ∈ X : f 0 (x) ∈ I } ≥ Pϑ {x ∈ X : f 1 (x) ∈ I }
for every interval I containing q(ϑ). With this terminology, no confidence bound ϑβ
fulfilling condition (4.6.1) can be more concentrated about qβ (ϑ) than a confidence
bound depending on X through T (x) only. An earlier version of this result occurs in
Pfanzagl (1994, p. 173, Theorem 5.4.3).
References
Aitken, A. C., & Silverstone, H. (1942). On the estimation of statistical parameters. Proceedings of
the Royal Society of Edinburgh Section A, 61, 186–194.
Andersen, E. B. (1970). Asymptotic properties of conditional maximum-likelihood estimators.
Journal of the Royal Statistical Society. Series B, 32, 283–301.
References 103
Bahadur, R. R. (1957). On unbiased estimates of uniformly minimum variance. Sankhyā, 18, 211–
224.
Barankin, E. W. (1949). Locally best unbiased estimates. The Annals of Mathematical Statistics,
20, 477–501.
Barankin, E. W. (1950). Extension of a theorem of Blackwell. The Annals of Mathematical Statistics,
21, 280–284.
Barankin, E. W. (1951). Conditional expectation and convex functions. In Proceedings of the Sec-
ond Berkeley Symposium on Mathematical Statistics and Probability (pp. 167–169). Berkeley:
University of California Press.
Barnett, V. (1975). Comparative statistical inference. New York: Wiley.
Barton, D. E. (1961). Unbiased estimation of a set of probabilities. Biometrika, 48, 227–229.
Basu, A. P. (1964). Estimators of reliability for some distributions useful in life testing. Technomet-
rics, 6, 215–219.
Basu, D. (1955). A note on the theory of unbiased estimation. The Annals of Mathematical Statistics,
26, 345–348.
Bednarek-Kozek, B., & Kozek, A. (1978). Two examples of strictly convex non-universal loss
functions (p. 133). Preprint: Institute of Mathematics Polish Academy of Sciences.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and adaptive esti-
mation for semiparametric models. Baltimore: Johns Hopkins University Press (1998 Springer
Paperback).
Birnbaum, A. (1961). A unified theory of estimation, I. The Annals of Mathematical Statistics, 32,
112–135.
Birnbaum, A. (1964). Median-unbiased estimators. Bulletin of Mathematics and Statistics, 11, 25–
34.
Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation. The Annals of
Mathematical Statistics, 18, 105–110.
Bomze, I. M. (1986). Measurable supports, reducible spaces and the structure of the optimal σ -field
in unbiased estimation. Monatshefte für Mathematik, 101, 27–38.
Bomze, I. M. (1990). A functional analytic approach to statistical experiments (Vol. 237). Pitman
research notes in mathematics series. Harlow: Longman.
Borges, R., & Pfanzagl, J. (1963). A characterization of the one-parameter exponential family of
distributions by monotonicity of likelihood ratios. Z. Wahrscheinlichkeitstheorie verw. Gebiete,
2, 111–117.
Brown, L. D., Cohen, A., & Strawderman, W. E. (1976). A complete class theorem for strict
monotone likelihood ratio with applications. The Annals of Statistics, 4, 712–722.
Chapman, D. G., & Robbins, H. (1951). Minimum variance estimation without regularity assump-
tions. The Annals of Mathematical Statistics, 22, 581–586.
Clopper, C. J., & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the
case of the binomial. Biometrika, 26, 404–413.
Cournot, A. A. (1843). Exposition de la théorie des chances et des probabilités. Paris: Hachette.
Cramér, H. (1946). Mathematical methods of statistics. Princeton: Princeton University Press.
Darmois, G. (1945). Sur les lois limites de la dispersion de certain estimations. Revue de l’Institut
International de Statistique, 13, 9–15.
Eberl, W. (1984). On unbiased estimation with convex loss functions. In E. J. Dudewitz, D. Plachky,
& P. K. Sen (Eds.), Recent results in estimation theory and related topics (pp. 177–192). Statistics
and Decisions, Supplement, Issue 1.
Eberl, W., & Moeschlin, O. (1982). Mathematische Statistik. Berlin: De Gruyter.
Fisher, R. A. (1930). Inverse probability. Mathematical Proceedings of the Cambridge Philosophical
Society, 26, 528–535.
Fisz, M. (1963). Probability theory and mathematical statistics (3rd ed.). (1st ed. 1954 in Polish,
2nd ed. 1958 in Polish and German). New York: Wiley.
104 4 Optimality of Unbiased Estimators: Nonasymptotic Theory
Fourier, J. B. J. (1826). Mémoire sur les résultats moens déduits d’un grand nombre d’observations.
Recherches statistiques sur la ville de Paris et le département de la Seine. Reprinted in Œuvres,
2, 525–545.
Fréchet, M. (1943). Sur l’extension de certaines évaluations statistiques de petits échantillons. Revue
de l’Institut International de Statistique, 11, 182–205.
Gauss, C. F. (1816). Bestimmung der Genauigkeit der Beobachtungen. Carl Friedrich Gauss Werke
4, 109–117, Königliche Gesellschaft der Wissenschaften, Göttingen.
Ghosh, J. K. (1961). On the relation among shortest confidence intervals of different types. Calcutta
Statistical Association Bulletin, 10, 147–152.
Halmos, P. R. (1946). The theory of unbiased estimation. The Annals of Mathematical Statistics,
17, 34–43.
Hammersley, J. M. (1950). On estimating restricted parameters. Journal of the Royal Statistical
Society. Series B, 12, 192–229; Discussion, 230–240.
Heizmann, H.-H. (1989). UMVU-Schätzer und ihre Struktur. Systems in Economics, 112.
Athenäum-Verlag.
Heyer, H. (1973). Mathematische Theorie statistischer Experimente. Hochschultext. Berlin:
Springer.
Heyer, H. (1982). Theory of statistical experiments. Springer series in statistics. Berlin: Springer.
Hodges, J. L, Jr., & Lehmann, E. L. (1950). Some problems in minimax point estimation. The
Annals of Mathematical Statistics, 21, 182–197.
Hotelling, H. (1931). The generalization of Student’s ratio. The Annals of Mathematical Statistics,
2, 360–378.
Kolmogorov, A. N. (1950). Unbiased estimators. Izvestiya Akademii Nauk S.S.S.R. Seriya Matem-
aticheskaya, 14, 303–326. Translation: pp. 369–394 in: Selected works of A.N. Kolmogorov (Vol.
II). Probability and mathematical statistics. A. N. Shiryaev (ed.), Mathematics and its applications
(Soviet series) (Vol. 26). Dordrecht: Kluwer Academic (1992).
Kozek, A. (1977). On the theory of estimation with convex loss functions. In R. Bartoszynski,
E. Fidelis, & W. Klonecki (Eds.), Proceedings of the Symposium to Honour Jerzy Neyman (pp.
177–202). Warszawa: PWN-Polish Scientific Publishers.
Kozek, A. (1980). On two necessary and sufficient σ -fields and on universal loss functions. Prob-
ability and Mathematical Statistics, 1, 29–47.
Krámli, A. (1967). A remark to a paper of L. Schmetterer. Studia Scientiarum Mathematicarum
Hungarica, 2, 159–161.
Laplace, P. S. (1812). Théorie analytique des probabilités. Paris: Courcier.
Lehmann, E. L. (1959). Testing statistical hypotheses. New York: Wiley.
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). Berlin: Springer.
Lehmann, E. L., & Scheffé, H. (1950, 1955, 1956). Completeness, similar regions and unbiased
estimation. Sankhyā, 10, 305–340; 15, 219–236; Correction, 17 250.
Lexis, W. (1875). Einleitung in die Theorie der Bevölkerungsstatistik. Straßburg: Trübner.
Linnik, Yu V, & Rukhin, A. L. (1971). Convex loss functions in the theory of unbiased estimation.
Soviet Mathematics Doklady, 12, 839–842.
Lumel’skii, Ya. P., & Sapozhnikov, P. N., (1969). Unbiased estimates of density functions. Theory
of Probability and Its Applications, 14, 357–364.
Müller-Funk, U., Pukelsheim, F., & Witting, H. (1989). On the attainment of the Cramér-Rao bound
in L r -differentiable families of distributions. The Annals of Statistics, 17, 1742–1748.
Neyman, J. (1934). On the two different aspects of the representative method: The method of
stratified sampling and the method of purposive selection. The Journal of the Royal Statistical
Society, 97, 558–625.
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of
probability. Philosophical Transactions of the Royal Society of London. Series A, 236, 333–380.
Neyman, J. (1938). Léstimation statistique traitée comme un probléme classique de probabilité.
Actualités Sci. Indust., 739, 25–57.
References 105
Neyman, J. (1941). Fiducial argument and the theory of confidence intervals. Biometrika, 32, 128–
150.
Padmanabhan, A. R. (1970). Some results on minimum variance unbiased estimation. Sankhyā Ser.
A, 32, 107–114.
Pfanzagl, J. (1970). Median-unbiased estimators for M.L.R.-families. Metrika, 15, 30–39.
Pfanzagl, J. (1972). On median-unbiased estimates. Metrika, 18, 154–173.
Pfanzagl, J. (1979). On optimal median-unbiased estimators in the presence of nuisance parameters.
The Annals of Statistics, 7, 187–193.
Pfanzagl, J. (1993). Sequences of optimal unbiased estimators need not be asymptotically optimal.
Scandinavian Journal of Statistics, 20, 73–76.
Pfanzagl, J. (1994). Parametric statistical theory. Berlin: De Gruyter.
Polfeldt, T. (1970). Asymptotic results in non-regular estimation. Skand. Aktuarietidskr. Supplement
1–2.
Portnoy, S. (1977). Asymptotic efficiency of minimum variance unbiased estimators. The Annals
of Statistics, 5, 522–529.
Pratt, J. W. (1961). Length of confidence intervals. Journal of the American Statistical Association,
56, 549–567.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters.
Bulletin of Calcutta Mathematical Society, 37, 81–91.
Rao, C. R. (1947). Minimum variance estimation of several parameters. Proceedings of the Cam-
bridge Philosophical Society, 43, 280–283.
Rao, C. R. (1949). Sufficient statistics and minimum variance estimates. Proceedings of the Cam-
bridge Philosophical Society, 45, 213–218.
Rao, C. R. (1952). Some theorems on minimum variance estimation. Sankhyā, 12, 27–42.
Savage, L. J. (1954). The foundations of statistics. New York: Wiley.
Schmetterer, L. (1960). On unbiased estimation. The Annals of Mathematical Statistics, 31, 1154–
1163.
Schmetterer, L. (1966). On the asymptotic efficiency of estimates. In Research Papers in Statistics.
Festschrift for J. Neyman & F. N. David (Eds.) (pp. 301–317). New York: Wiley.
Schmetterer, L. (1974). Introduction to mathematical statistics. Translation of the 2nd German
edition 1966. Berlin: Springer.
Schmetterer, L. (1977). Einige Resultate aus der Theorie erwartungstreuer Schätzungen. In Trans-
actions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions,
Random processes (Vol. B, pp. 489–503).
Schmetterer, L., & Strasser, H. (1974). Zur Theorie der erwartungstreuen Schätzungen. Anzeiger,
Österreichische Akadademie der Wissenschaften, Mathematisch-Naturwissenschaftliche. Klasse,
6, 59–66.
Seheult, A. H., & Quesenberry, C. P. (1971). On unbiased estimation of density functions. The
Annals of Mathematical Statistics, 42, 1434–1438.
Shao, J. (2003). Mathematical statistics (2nd ed.). Berlin: Springer.
Simons, G., & Woodroofe, M. (1983). The Cramér-Rao inequality holds almost everywhere. In M.
H. Rizvi, J. S. Rustagi, & D. Siegmund (Eds.), Papers in Honor of Herman Chernoff (pp. 69–83).
New York: Academic Press.
Stein, Ch. (1950). Unbiased estimates of minimum variance. The Annals of Mathematical Statistics,
21, 406–415.
Strasser, H. (1972). Sufficiency and unbiased estimation. Metrika, 19, 98–114.
Strasser, H. (1985). Mathematical theory of statistics. Berlin: De Gruyter.
Stuart, A., & Ord, K. (1991). Kendall’s advanced theory of statistics. Classical inference and
relationship (5th ed., Vol. 2). London: Edward Arnold.
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal
of the American Statistical Association, 22, 209–212.
Witting, H. (1985). Mathematische Statistik I. Parametrische Verfahren bei festem Stichprobenum-
fang: Teubner.
106 4 Optimality of Unbiased Estimators: Nonasymptotic Theory
Witting, H., & Müller-Funk, U. (1995). Mathematische Statistik II. Asymptotische Statistik: Para-
metrische Modelle und nichtparametrische Funktionale. Teubner.
Working, H., & Hotelling, H. (1929). Applications of the theory of error to the interpretation of
trends. Journal of the American Statistical Association, 24(165A), 73–85.
Chapter 5
Asymptotic Optimality of Estimators
5.1 Introduction
for a fixed sample size n 0 . Assertions about “what might happen” if the observations
were continued from x1 , . . . , xn 0 to xn 0 +1 , . . . are not of primary interest.
The concept of consistency (as opposed to strong consistency) was easy to deal
with using n-fold products of Lebesgue densities. Dealing with convergence P N -a.e.
in a mathematically precise way was impossible prior to Kolmogorov’s trailblazing
book (1933). It appears that R.A. Fisher’s unwillingness to handle a mathematically
more complex concept like “strong consistency” lies at the root of his deviating
concept of consistency.
A statistic …is a consistent estimate of any parameter, if when calculated from an indefinitely
large sample it tends to be accurately equal to that parameter.
This definition appeared in Fisher (1925, p. 702), and still held firm in Fisher (1959,
pp. 143–146).
Consistent estimator sequences can be used as a starting point for the construction
of optimal estimator sequences. Consistency of the ML-sequences is needed to obtain
their limit distribution. Some authors require to choose among several solutions of the
likelihood equations a consistent one. Yet, consistency is a property of an estimator
sequence, not of a particular estimator.
Rules that guarantee the consistency of an estimator sequence are of questionable
significance if they refer to sample sizes that will never occur. From the methodolog-
ical point of view, the situation is even worse in the case of Cramér’s Consistency
Theorem (see Sect. 5.2). Cramér’s Consistency Theorem is valid only if the likelihood
equation has for every sample size and every (!) sample (x1 , . . . , xn ) one solution
only. There is no theorem based on properties of the basic family {Pϑ : ϑ ∈ Θ}
that guarantees that these conditions are fulfilled. (The same mistake occurs in
Schmetterer 1966, p. 373, Satz 3.5.)
How could a property (like consistency) that lacks operational significance, con-
tribute to the proof of an operationally significant assertion (like the asymptotic
distribution of the ML-sequence)? An asymptotic assertion about the estimator κ (n)
should be based on properties for the sample size n only. That N (0, σ 2 (ϑ)) is the limit
distribution of the estimator sequence ϑ (n) , n ∈ N, can be turned into the operational
assertion that the true distribution of Pϑn 0 ◦ n 0 (ϑ (n 0 ) − ϑ) can be approximated by
1/2
be based on the observations (x1 , . . . , xn 0 ) and not on properties of ϑ (n) for samples
(x1 , . . . , xn ) that will never occur.
Limit Distributions
Consistency is just an ancillary concept. What makes asymptotic theory useful are
assertions about the (weak) convergence of the standardized estimator sequences to
a limit distribution, for example
Q (n)
P := P ◦ n
n 1/2 (n)
(κ − κ(P)) ⇒ Q P .
For parametric families, the main result on the asymptotic performance of ML-
sequences is usually stated as follows1 (see Lehmann and Casella 1998, p. 447,
Theorem 6.3.7, Witting and Müller-Funk 1995, p. 202, Satz 6.3.5): Every consistent
sequence of ML estimators converges weakly to N (0, Λ(ϑ)).
Even supplied by conditions which guarantee the consistency of (every? some?)
ML-sequence, this assertion is different from what the statistician would like to know:
that N (0, Λ(ϑ)) is a good approximation to the distribution of the ML estimator under
Pϑn when n is large. The latter property has nothing to do with the performance of
the ML-sequence as the sample size tends to infinity.
The condition that ϑ (n) should belong to a consistent sequence is meaningless.
To give operational significance to such a mathematical assertion, one must, again,
keep error bounds in one’s mind. Weak convergence to a limit distribution should
always include the connotation of error bounds. In ideal situations, there exists K > 0
such that
|Q (n)
P (I ) − Q P (I )| ≤ K n
−1/2
for every P ∈ P, every interval I, and every n ∈ N.
(5.1.1)
Even if such an assertion can be proved for a certain class of estimator sequences
(say ML), it will, in general, not be useful for practical purposes, unless K is given
explicitly and is not very large. The proof of an assertion like (5.1.1) for a general class
of estimators, say ML estimators, requires several intermediate steps which, taken
together, lead to a constant K which is far too large. In general, practically useful
information about the accuracy of Q P (I ) as an estimate of P n {n 1/2 (κ (n) − κ(P)) ∈
I } can only be obtained using simulations. Refinements of the normal approximation
by means of asymptotic expansions are more useful than bounds for the error of the
normal approximation.
Basing the comparison of estimator sequences on the standardized version of the
limit distribution is generally accepted. One might question, of course, whether the
comparison of standardized multivariate estimators according to their concentration
on symmetric convex sets is more informative than the component-wise comparison
on intervals containing 0. But one might have reservations about using joint (limit)
distributions as the appropriate concept for judging the quality of estimator sequences
if the standardizing factors cn are different for different components.
Uniformity on P
For two different probability measures P0 and P1 , the sup-distance between P0n
and P1n tends to 1 as n tends to infinity. Hence the asymptotic performance of
P0n ◦ n 1/2 (κ (n) − κ(P0 )), n ∈ N, is unrelated to the asymptotic performance of
P1n ◦ n 1/2 (κ (n) − κ(P1 )), n ∈ N. One might change the asymptotic performance of a
given estimator sequence at P0 without affecting its asymptotic performance for any
P
= P0 . This makes it impossible to speak of a limit distribution which is optimal
for every P ∈ P, unless one restricts the consideration to estimator sequences which
1 A formulation of this result giving full attention to all details as in Schmetterer (1966, p. 388, Satz
Q (n)
P = P ◦n
n 1/2 (n)
(κ − κ(P)),
The Prehistory
In 1912 (Fisher 1912), a self-confident undergraduate, R.A. Fisher, believed to have
discovered a new, “absolute” method of estimation, Maximum Likelihood (ML).
Surprisingly, his paper was published at a time when this “new” method had been
known under various names and with varying justifications for 150 years. That the
basic idea already occurs in Lambert (1760), was only discovered by Sheynin in
(1966). But it was well known that Bernoulli (1778) had suggested the parameter
value as an estimator—the location in his case—“... qui maxima gaudet probabilitate”
(Todhunter 1865, pp. 236/7). In the 19th century, the method was well known to
astronomers (Encke 1832–34), could be found in various textbooks (Czuber 1891)
and was used by Pearson (1896) [pp. 262–265] to determine the sample correlation
coefficient as an estimator of the population correlation coefficient. The basic idea of
maximum likelihood is so obvious that it was “discovered” once again by Zermelo
(1929), a mathematician without any ties to statistics. Though some of the authors
came to the ML idea via a Bayesian approach using a uniform prior (see Hald 1999,
on this particular point), the opinion was widespread that it was the performance of
the estimators that counted, not the “metaphysical principle” on which the estimator
was grounded. As an example we can cite Hald (1998) [p. 499] :
Helmert (1876) borrowed the idea from inverse probability that the “best” estimate is obtained
by maximizing the probability density of the sample and having found the estimate, he
evaluated its properties by studying its sampling distribution.
114 5 Asymptotic Optimality of Estimators
Fisher (1912) [p. 155] claims that his new method leads to optimal estimators. It
appears that Fisher was at this time convinced of optimum properties even for finite
samples. The further development of his thought over the next decade is carefully
examined by Stigler (2005).
At the beginning of the 20th century, the time was ripe for a general theorem on
the asymptotic distribution of ML-sequences. The result (for an arbitrary number of
parameters) is basically contained in the paper by Pearson and Filon (1898), written
in the spirit of “inverse probability”, and with the (misguided) interpretation that their
result refers to moment estimators. Fisher (1922) [p. 329, footnote] heavily criticizes
these points. If he had been familiar with the literature of his time he could have cited
the papers by Edgeworth (1908/1909), which appeared not in some obscure journal,
but in the Journal of the Royal Statistical Society. In these papers, Edgeworth puts
forward a correct interpretation of the results of Pearson and Filon, and he shows that,
in a variety of examples, the asymptotic variance of the ML-sequence is not larger
than that of other estimator sequences. 1. For symmetric densities, the asymptotic
variance of the ML-sequence is at least as good as the sample mean and the sample
median. 2. For Pearsons’s Type III distributions, the ML-sequence is at least as good
as the moment estimators. 3. Among the estimating equations nν=1 g(xν − ϑ) = 0,
the one with g(x) = p (x)/ p(x) (corresponding to the likelihood equation for a shift
parameter) yields the estimator sequence with the smallest asymptotic variance.
Instead of welcoming the results of Edgeworth (after they had become known to
him) as supporting his claim of the superiority of the ML method, Fisher continued to
quarrel about the distinction between “maximum likelihood” and “maximal inverse
probability”, though Edgeworth had stated explicitly (1908/1909, p. 82) that his
results are “free from the speculative character which attaches to inverse probability”.
Is it legitimate to refuse giving mathematical results proper recognition with the
argument that they are based on the wrong “philosophy”?
The relations between the results of Pearson and Filon (1898), Edgeworth
(1908/1909) and Fisher are discussed in detail by a number of competent schol-
ars, like Edwards (1974, 1992), Pratt (1976), Savage (1976), Hald (1998, Chap. 28)
and (most interesting) Stigler (2007).
Even though the ML principle (in one or the other version) was familiar, general
results about the asymptotic distribution of the ML-sequence were late to arrive. It
might be a promising project to find out how scientists determined the accuracy of
their ML estimators before such general results were available. As an example, we
can mention a paper by Schrödinger (1915) on measurements with particles subject
to Brownian motion. On p. 290, Schrödinger writes:
Es gilt nun, aus einer Beobachtungsreihe ... diejenigen Werte [der beiden Parameter] zu
berechnen, welche durch diese Beobachtungsreihe zu den wahrscheinlichsten gemacht wer-
den; ferner die Fehler zu berechnen, die beiden Werten wahrscheinlich oder im Mittel
anhaften.
What Schrödinger (Sect. 6, pp. 294/5) does to determine the accuracy of his ML
estimators was, probably, the common practice among physicists at this time: With an
explicit formula for the ML estimator ϑ (n) at hand, he computes σn2 (ϑ) := (ϑ (n) −
5.2 Maximum Likelihood 115
ϑ)2 d Pϑn , and bases a measure of accuracy on σn (ϑ (n) (x)). In the particular case dealt
with by Schrödinger, σn (ϑ) is asymptotically equivalent to n −1/2 σ (ϑ).
In the following sections we deal with various proofs of consistency and asymp-
totic normality of the ML-sequences. At the technical level of Fisher’s writings,
consistency had no relevance. (Fisher’s concept of “consistency” meant something
else.) His derivation of the asymptotic variance of the ML-sequence (1922, pp. 328/9)
is intelligible only for readers familiar with Fisher’s style. The emphasis of Fisher’s
papers (1922, pp. 330–32, 1925, p. 707) is on the asymptotic optimality of the
ML-sequence. His intention is to show that the asymptotic variance of asymptoti-
• −1
cally normal estimator sequences cannot be smaller than (x, ϑ)2 Pϑ (d x) .
Fisher’s efforts to support his claim to the superiority of the ML method by means
of mathematical arguments motivated mathematicians and statisticians with a back-
ground in probability theory to turn conjectures and assertions about ML-sequences
into mathematical theorems. The results will be discussed in the following sections.
What, in the end, is Fisher’s contribution to the theory of ML estimators? It is the
concept of “maximum likelihood”, which he introduced in (1922, p. 326).
When one dealt with the asymptotic theory of ML-sequences in a mathematically
precise way, it soon became clear that the consistency is more difficult to establish
than the asymptotic distribution, given consistency. Occasionally, the consistency
can be proved directly, if the estimators are given by explicit formulas. Hence it
is preferable to present the asymptotic theory of ML-sequences in two parts: (a) a
consistency theorem, (b) a theorem about the asymptotic distribution of consistent
ML-sequences.
Remark Assertions about the distribution of n 1/2 (ϑ (n) − ϑ) under Pϑn are possible
only if ϑ (n) (xn ) is a measurable function of xn . If ϑ (n) (xn ) is defined by
n
n
(n)
p(xν , ϑ (xn )) = sup p(xν , ϑ),
ν=1 ϑ∈Θ ν=1
say, the measurability of ϑ (n) can be guaranteed by some kind of “Measurable Selec-
tion Theorem”. An early example of such a theorem occurs in Schmetterer (1966)
[p. 375, Lemma 3.3], who seems to have been unaware of selection theorems in vari-
ous connections. Now measurable selection theorems occur occasionally in advanced
textbooks on mathematical statistics (Strasser 1985, p. 34, Theorem 6.10.1; Pfanzagl
1994, p. 21, Theorem 6.7.22; Witting and Müller-Funk 1995, p. 173, Hilfssatz 6.7). A
paper on the measurability of ML estimators far off the mainstream is Reiss (1973).
Consistency of ML-Sequences
The consistency of an estimator sequence is just an intermediate step on the way to
its asymptotic distribution. Nevertheless, the literature on consistency is immense.
The following considerations are restricted to weak consistency of ML-sequences for
one-parameter families. The generalizations to strong consistency, perhaps locally
uniform, and to multi-parameter families are straightforward if the arguments for the
simplest case are understood.
116 5 Asymptotic Optimality of Estimators
n
n
p(xν , ϑ (n) (xn )) ≥ p(xν , ϑ) for ϑ ∈ Θ.
ν=1 ν=1
When Θ ⊂ Rk , we have the necessary condition that ϑ (n) (xn ) is a solution of the
likelihood equation
n
(i) (xν , ϑ) = 0, i = 1, . . . , k.
ν=1
It is possible that solutions of the estimating equation give reasonable results although
the maximizer of the likelihood function does not; see Kraft and Le Cam (1956).
Notice the basic distinction: To prove consistency based on the original definition,
only topological conditions (on Θ and the function ϑ → p(x, ϑ)) are required.
Obtaining consistency for solutions of the likelihood equation not only requires
ϑ → p(x, ϑ) to be differentiable; it leads to problems if the likelihood equation
has several solutions. (Recall the result of Reeds 1985, that for the location family
of Cauchy distributions, kn (xn ) − 1 is asymptotically distributed according to the
Poisson distribution with parameter 1/π , where kn (xn ) is the number of solutions of
the likelihood equation.)
It does not make sense to state that among the roots of the estimating question
for the sample size n 0 there is (at least) one which is an element of a consistent
estimator sequence. To say that ϑ (n 0 ) is an element of a consistent estimator sequence
is meaningless, anyway. What is needed is a guide which root to choose.
Some Technical Questions
As soon as mathematically oriented statisticians entered the scene, they discovered a
number of problems which nobody had thought about before. The most elementary
one: Assertions about consistency and asymptotic normality require the estimators to
5.2 Maximum Likelihood 117
to
n
−1 (n)
n f (xν , ϑ (xn )) → f (·, ϑ0 )d Pϑ0 (Pϑn0 ) (5.2.1)
ν=1
if ϑ (n) → ϑ0 (Pϑn0 ). Not all authors had the right idea about which conditions on f
going beyond continuity of ϑ → f (x, ϑ) for every x ∈ X , are needed for this step:
Continuity of ϑ → f (x, ϑ) uniformly in x, as required by Dugué (1936, 1937),
is much too strong. Let Θ ⊂ R and f • (x, ϑ) = ∂ϑ f (x, ϑ) . In their “fundamen-
tal” Lemma 2.1, Gong and Sameniego (1981, pp. 862/3) require, in addition to
| f (·, ϑ0 )|d Pϑ0 ) < ∞, a condition corresponding to supϑ∈U | f (1) (·, ϑ)|d Pϑ0 <
∞ for an open subset U containing ϑ0 . In fact, the condition
suffices. Since ϑ → f (x, ϑ) is continuous, this implies for every ε > 0 the existence
of Uε
ϑ0 such that
n
n
n
n −1 f (xν , ϑ) = n −1 f (xν , ϑ0 + (ϑ − ϑ0 ) n −1 f • (xν , ϑ)
ν=1 ν=1 ν=1
1
n
+ (ϑ − ϑ0 )2 n −1 f •• (xν , ϑ̄n (xn )) (5.2.4)
2 ν=1
with ϑ̄(xn ) between ϑ0 and ϑ. However, to know the latter is not enough. Since
ϑ̄ (n) (xn ) depends on ϑ0 as well as on ϑ, one runs into trouble if (5.2.4) is applied
with ϑ replaced by some ϑ (n) (bxn ). For this purpose, the expansion
n
n
n −1 f (xν , ϑ) = n −1 f (xν , ϑ0 )
ν=1 ν=1
n
1
−1
+(ϑ − ϑ0 ) n f (1) (xν , (1 − u)ϑ0 + uϑ)du (5.2.5)
ν=1 0
provided (x, V )P0 (d x) < ∞ for some V . Let X V ⊂ X N denote the set of all
x ∈ X N such that
5.2 Maximum Likelihood 119
1
n
((xν , V ) − (xν , ϑ0 )) ≥ 0 for infinitely many n ∈ N.
n ν=1
By the Strong Law of Large Numbers, relation (5.2.6) implies that PϑN0 (X V ) = 0.
For n ∈ N, let ϑ (n) (x) be such that nν=1 (xν , ϑ (n) (x)) ≥ nν=1 (xν , ϑ) for ϑ ∈ Θ.
Then ϑ (n) (x) ∈ V implies
n
n
n
n −1 (xν , ϑ0 ) ≤ n −1 (xν , ϑ (n) (x)) ≤ n −1 (xν , V ).
ν=1 ν=1 ν=1
Hence the set of all x ∈ X N such that ϑ (n) (x) ∈ V for infinitely many n ∈ N is a
(n)
subset of the PϑN0 -null set X V . The immediate consequence: The ML estimator ϑ(x)
cannot be for infinitely many n in a set V fulfilling (5.2.6) (except for x in a set
of PϑN0 -measure zero). In particular: (ϑ (n) (x))n∈N cannot converge to some ϑ
= ϑ0
except for x in a set of PϑN0 -measure zero.
If Θ is compact, the set {ϑ ∈ Θ : |ϑ − ϑ0 | ≥ ε} is compact and can, therefore,
be covered by a finite number of sets Vϑi : i = 1, . . . , m, fulfilling (5.2.6). Hence for
PϑN0 -a.a. x ∈ X N , the relation |ϑ (n) (x) − ϑ0 | ≥ ε holds for finitely many n only. That
(n)
means: ϑ(x) , n ∈ N, converges to ϑ0 except for x in a set of PϑN0 -measure 0.
This is Wald’s Consistency Theorem, the earliest consistency theorem which is
mathematically correct. It is widely ignored because it requires Θ to be compact.
Various examples show that the compactness condition cannot be omitted without
further ado. The compactness condition, however, may be replaced by the condition
that for every neighbourhood U0 of ϑ0 , the set U0c can be covered by a finite number
of sets Vi fulfilling condition (5.2.6). (See e.g. the Covering Condition 6.3.8 in Pfan-
zagl 1994, p. 194, or condition 6.1.57 in Witting and Müller-Funk 1995, p. 201, Satz
6.34.) Schmetterer (1966) [Satz 3.8, p. 384] is the first textbook presenting a (some-
what confused) version of Wald’s strong consistency theorem. Schmetterer does not
mention the compactness condition explicitly, but all his considerations (including
the definition of the ML estimator) are restricted to some compact subset of Θ (called
Γ0 ) which restricts the assertion of consistency to compact subsets of Θ.
Kendall and Stuart (1961) [pp. 39–41] think they can do better than Wald:
This direct proof of consistency is a simplified form of Wald’s proof. Its generality is clear
from the absence of any regularity conditions on the distribution [ p(x, ϑ)].
This is supplemented by their Exercise 18.35, p. 74: “... show that many inconsistent
ML estimators, as well as consistent ML estimators, exist”. The proof in Kendall and
Stuart (see p. 40) is based on their relations (18.20),
n
n
lim P n (xν , ϑ) < (xν , ϑ0 ) = 1 for every ϑ
= ϑ0 ,
n→∞ ϑ0
ν=1 ν=1
120 5 Asymptotic Optimality of Estimators
and (18.21),
n
n
(xν , ϑ (n) (xn )) ≥ (xν , ϑ0 ) for n ∈ N and xn ∈ Rn ,
ν=1 ν=1
“since, by (18.20), (18.21) only holds with probability zero for any ϑ
= ϑ0 [?] it
follows that limn→∞ PϑN0 {ϑ (n) = ϑ0 } = 1.”
The idea that, because of (18.20) and (18.21), the sequence ϑ (n) , n ∈ N, cannot
stay away from ϑ0 with positive probability seems to be obvious, but is not really true.
To make it precise one has to consider the performance of the estimator sequence
ϑ (n) (x), n ∈ N, on x ∈ X N , as done by Wald, an idea which was not understood by
the authors.
If the argument of Kendall and Stuart might have been acceptable in the thirties,
its weakness must have been clear to anybody who had understood Wald’s paper
from (1949). Thirty years later (in “Kendall’s Advanced Theory of Statistics” by
Stuart and Ord 1991) not much had been changed.
n From (18.20) and (18.21) the
(n)
authors now argue as follows: “asn→∞, ν=1 (x ν , ϑ̂ (xn )) cannot take any
other value than nν=1 (xν , ϑ0 ). If nν=1 (xν , ϑ) is identifiable [?] this implies that
PϑN0 {limn→∞ ϑ (n) = ϑ0 } = 1”.
A second approach to consistency, using the existence of •• (see Kendall and
Stuart, 1961, p. 41 and 1991, p. 660), follows Cramér’s lead, described below, both
in the proof as in its misinterpretation.
Cramér’s Consistency Proof
Even now, most textbooks refer to Cramér (1946a) as the author of the first consis-
tency proof. In fact, the basic idea is of tempting simplicity. A concise version of the
proof starts with the expansion
n
n
n
n −1 • (xν , ϑ) = n −1 • (xν , ϑ0 ) + (ϑ − ϑ0 )n −1 k(xν , ϑ0 , ϑ) (5.2.7)
ν=1 ν=1 ν=1
with
1
k(x, ϑ0 , ϑ) = •• (x, (1 + u)ϑ0 + uϑ)du.
0
If ϑ → •• (x, ϑ)Pϑ0 (d x) is continuous
at ϑ = ϑ0 , then ϑ→ k(x, ϑ0 , ϑ)Pϑ0 (d x)
is continuous at ϑ = ϑ0 . Since k(x, ϑ0 , ϑ0 )Pϑ0 (d x) = •• (x, ϑ0 )Pϑ0 (d x) < 0,
it follows that n −1 n
in PϑN0 -measure to a negative value if
ν=1 k(x ν , ϑ0 , ϑ) converges
−1 n •
ϑ is sufficiently close to ϑ0 . Together with n ν=1 (x ν , ϑ0 ) → 0 (Pϑ0 ), relation
n
n
n
Pϑn0 n −1 • (xν , ϑ0 + δ) < 0 < n −1 • (xν , ϑ0 − δ) → 1. (5.2.8)
ν=1 ν=1
5.2 Maximum Likelihood 121
n
Since ϑ → n −1 •
n ν=1• (xν , ϑ) is continuous in a neighbourhood of ϑ0 , there is
a solution of ν=1 (xν , ϑ) = 0 in (ϑ0 − δ, ϑ0 + δ). In other words: For every
sufficiently small δ, the PϑN0 -probability that the likelihood equation has a solution in
the interval (ϑ0 − δ, ϑ0 + δ) converges to 1.
It is surprising to see how a mathematician with Cramér’s stature deals with this
approach (see p. 501). For the proof of (5.2.8) he requires the existence of functions
Hi , i = 1, 2, 3, such that |(i) (x, ϑ)| ≤ Hi (x) for x ∈ X and ϑ ∈ Θ. If Cramér says
that Hi , i = 1, 2, are “integrable over (−∞, +∞)”, he means Pϑ -integrable (for all
ϑ ∈ Θ, or for ϑ = ϑ0 ?). Local versions of these conditions would suffice, and even
for Cramér’s crude proof, based on the expansion
1
• (x, ϑ) = • (x, ϑ0 ) + (ϑ − ϑ0 )•• (x, ϑ0 ) + δ(ϑ − ϑ0 )2 H3 (x) with |δ| < 1,
2
his condition supϑ∈Θ H3 (x)Pϑ (d x) < ∞ could be replaced by H3 (x)Pϑ0 (d x) <
∞.
Cramér did not take much advantage of what other authors had done before
him. Wald (1941a, b) paper was, perhaps, not accessible to Cramér when he wrote
the first version of his book, but he retained his unsatisfactory global regularity
conditions in all subsequent editions. More surprising than the technical infelicities
is the misleading interpretation Cramér gives to his result, namely: That there is a
sequence of solutions of the likelihood equation which is consistent for every ϑ ∈ Θ.
To recall: He just proved that for every ϑ0 ∈ Θ and every neighbourhood of ϑ0 , the
Pϑn0 -probability for a solution in this neighbourhood converges to 1 as n tends to
infinity. This does not deliver what Cramér had promised on p. 500: “It will be
shown that, under general conditions, the likelihood equation ... has a solution which
converges in probability to the true value of [ϑ] as n → ∞”. Cramér’s result implies
consistency under an additional condition: That the likelihood equation has one
solution only for every n ∈ N and all (x1 , . . . , xn ). It would certainly be reassuring
if this holds true for the actual sample size n 0 . But what is required is uniqueness
for every sample size, a condition which casts some doubts on the interpretation of
asymptotic results.
Though Cramér’s paralogism became known soon (Wald 1949, p. 595, footnote 3),
the message did not spread quickly. Cramér’s result was presented as a consistency
proof in a number of textbooks. As an example we might mention Schmetterer
(1956), at this time the mathematically most advanced textbook. In Satz 8, pp. 223/4,
Schmetterer asserts the existence of a sequence of solutions of the likelihood equation
which is consistent for every ϑ ∈ Θ, an open rectangle in Rk . His argument is
dubious. After having proved a k-dimensional version of Cramér’s result (following
Chanda 1954), he makes a vault to the interpretation as a consistency theorem (p. 231).
It is, by the way, not straightforward to transfer Cramér’s argument “the interval
(ϑ0 − δ, ϑ0 + δ) contains a solution of the likelihood equation” from R1 to Rk . An
error in Chanda’s proof was discovered by Tarone and Gruenhage (1975) [p. 903].
122 5 Asymptotic Optimality of Estimators
Foutz (1977) tries to simplify the technical details of this proof by means of the
Inverse Function Theorem, but he, too, follows Cramér in the misinterpretation of
his result.
In fact, it requires additional steps to transform Cramér’s result into an assertion
about the existence of a consistent sequence. Introducing
n
An (xn ) := ϑ ∈ Θ : • (xν , ϑ) = 0 ,
ν=1
W.l.g.: Nk+1 > Nk . With kn denoting the largest integer k such that Nk ≤ n, this
implies
Pϑn0 {An ∩ (ϑ0 − 1/kn , ϑ0 + 1/kn )
= ∅} → 1.
is a continuous
function of x, and log+ p n (·, x) is P-integrable. Among his conclu-
sions: If lim supn∈N0 p(x, ϑ (n) (x))d x ≤ 1, then limn∈N0 p(x, ϑ (n) (x)) = p(x) for
x ∈ X 0 and P-a.a. x ∈ X .
This result can, at best, be considered as a lemma from which a consistency theo-
rem could be derived. Notice that it contains no assumptions on the parameter (like
continuity of ϑ → p(x, ϑ)), and that the distinguished probability measure P is
not necessarily a member of the family. Hence it needs conditions on the family
which imply the conditions of Doob’s Theorem 5, and a condition which leads from
p(x, ϑ (n) (x)) → p(x, ϑ0 ) to ϑ (n) (x) → ϑ0 . Doob confines himself to demonstrat-
ing the usefulness of his theorem by showing (see p. 768) that it establishes the
consistency of x n as an estimator for ϑ in the family {N (ϑ, 1) : ϑ ∈ R}.
Since Doob’s theorem needs further specifications in order to arrive at a consis-
tency theorem for ML-sequences, the question appears to be of minor importance
whether it is true in its own framework. Doubts are in order since in the proof of
Theorem 5, Doob makes use of Theorem 4, p. 766, which is obviously wrong. Doob
claims: “Its proof is simple and will be omitted.” He seems to have overlooked the
fact that the points of discontinuity may depend on the functions g. (See in this
connection also Wald 1949, p. 595, footnote 2.)
Starting in 1936 and 1937, Dugué made several attempts to deliver precise proofs
of the consistency and asymptotic normality of the ML-sequence (Dugué 1936,
1937). According to Le Cam (1953) [p. 279] “... the proofs are not rigorous and
the mistakes are apparent”. In a more transparent version, Dugué’s arguments are
presented (almost unchanged) in 1958, and here the weak points are easier to spot
(Dugué 1958).
From nν=1 • (xν , ϑ (n) (x)) = 0 for x ∈ X N , n ∈ N, and n −1 nν=1 • (xν , ϑ0 ) →
0 (PϑN0 ), Dugué concludes (Theorem I, p. 140/1) that
124 5 Asymptotic Optimality of Estimators
n
n −1 (• (xν , ϑ0 ) − • (xν , ϑ (n) (x))) → 0 (PϑN0 ).
ν=1
provided |ϑ − ϑ0 | < δε . Without further arguments, Dugué then states that, con-
versely, for every ε > 0, there is δε such that
−1 •
n
n ( (xν , ϑ) − • (xν , ϑ0 ))
< δε implies |ϑ − ϑ0 | < ε, (5.2.9)
ν=1
provided the function ϑ → n −1 nν=1 • (xν , ϑ) is one-one, so that (5.2.9) implies
that ϑ (n) → ϑ0 (PϑN0 ).
Wilks (1962) seeks to provide more elegant consistency proofs (Theorem 12.3.2,
p. 360 for Θ ⊂ R and 12.7.2, pp. 379/380 for Θ ⊂ Rk ) by using the following auxil-
iary result 4.3.8, p. 105: If f n (·, ϑ) → g(ϑ) (P N ) uniformly for ϑ in a neighbourhood
of ϑ0 , with g continuous at ϑ = ϑ0 , then ϑ (n) → ϑ0 (P N ) implies
Wilks’ proof contains the usual mistake: Presented in a more transparent way, he
concludes from
and
Pϑn0 {ϑ (n) ∈ Uε } > 1 − ε for n ≥ n ε ,
a mistake already noted in the review by Hoeffding (1962) [p. 1469]: The argument
requires Pϑn0 {supϑ∈Uε | f n (·, ϑ) − g(ϑ)| < ε} > 1 − ε for n ≥ n ε . This is, by far, not
the only weak point in Wilks’ book. The 7-page review by Hoeffding contains an
impressive list of shortcomings.
A consistency theorem for a general family of probability measures (with condi-
tions for consistency which are necessary and sufficient) can be found in Pfanzagl
1969a, p. 258, Theorem 2.6. See also Landers (1972).
Examples of Inconsistent ML-Sequences
The intuitive appeal of the ML method is so strong that one might be inclined to
consider all conditions used for the consistency theorem (in particular the compact-
ness of Θ) as being artifacts of the technique of the proof. It is, therefore, important
to point to the numerous examples of nice-looking families where one or the other
condition is violated, and where the ML-sequence fails to be consistent. First, we
will concentrate our attention on counterexamples for parametric families and the
i.i.d. case.
The first counterexamples, provided by Kraft and Le Cam (1956), present families
where, with probability tending to unity, the ML estimator exists for every sample size
and yet fails to be consistent, though consistent estimator sequences exist. A slight
shortcoming of these examples: The parameter space Θ is an open set, but not an
interval, and the densities are not continuous functions. Bahadur (1958) has various
clever examples, but they are all nonparametric. Ferguson (1982) [Sects. 2 and 3]
presents examples of one-parameter families with Θ = [0, 1] satisfying Cramér’s
conditions such that every ML-sequence converges to 1 if ϑ ∈ [1/2, 1]. Observe
that these are more than examples of inconsistent ML-sequences: They are also
counterexamples to Cramér’s interpretation of his “consistency” theorem. Another
example of an inconsistent ML-sequence can be found in Pfanzagl (1994) [Example
6.6.2, pp. 209–210].
Many textbooks cite Neyman and Scott (1948) [p. 7] for an example of an incon-
sistent ML-sequence. Here is the simplest version of this example: Let (xν , yν ),
ν = 1, . . . , n, be distributed as ×nν=1 N (ϑν , σ 2 )2 with σ 2 > 0 and ϑν ∈ R for ν =
1, . . . , n. The ML estimators are
1
ϑν(n) ((x1 , y1 ), . . . , (xn , yn )) = (xν + yν ) for ν = 1, . . . , n
2
and
1
n
σn2 ((x1 , y1 ), . . . , (xn , yn )) = (xν − yν )2 .
4n ν=1
The clue is that (σn2 )n∈N converges to σ 2 /2. This is not really a “counterexample”,
since the usual consistency theorems refer to the i.i.d. case. In cases with varying
unknown nuisance parameters, inconsistency is not unexpected. Another example of
this kind can be found in M. Ghosh (1995) [p. 166].
126 5 Asymptotic Optimality of Estimators
n
n
n −1/2 • (xν , ϑ (n) (xn )) = n −1/2 • (xν , ϑ0 )
ν=1 ν=1
n
+n 1/2 (ϑ (n) (xn ) − ϑ0 )n −1 k(xν , ϑ0 , ϑ (n) (xn )) (5.2.10)
ν=1
with
1
k(x, ϑ0 , ϑ) := •• (x, (1 − u)ϑ0 + uϑ)du.
0
This expansion
n is valid if ϑ → •• (x, ϑ) is continuous in a neighbourhood of ϑ0 .
If ν=1 (xν , ϑ (n) (xn )) = 0 for xn ∈ X n , relation (5.2.10) implies that
•
(n) −n −1/2 nν=1 • (xν , ϑ0 )
n 1/2
(ϑ
(xn ) − ϑ0 ) = −1 n (n) (x ))
+ o(n 0 , Pϑn0 ). (5.2.11)
n ν=1 k(x ν , ϑ0 , ϑ n
Simple though this proof may be, various authors were unable to find reasonable
conditions which guarantee (5.2.12) for the sequence of their remainder terms. Recall
that Dugué (1937, pp. 328/331) requires that ϑ → •• (x, ϑ) is continuous at ϑ0 ,
uniformly in x ∈ X (a condition which excludes, for instance, the application to
{N (0, σ 2 ) : σ > 0}). The fact that Dugué still uses this condition in 1958, Theorem
II, p. 143, more than 17 years after Wald (1941b), is hard to explain.
Dugué’s paper (1937) is just a simplified version of what Doob (1934) [Theorem
6, pp. 770/1], who presented the first valid proof, had already written. Yet Doob’s
proof is also unsatisfactory, since he starts from a Taylor expansion of ϑ → (x, ϑ),
which he uses to derive the expansion of ϑ → • (x, ϑ) by means of differentiation,
a procedure which requires a more complex condition on the remainder term.
Remark Under suitable regularity conditions on the family {Pϑ : ϑ ∈ Θ}, the rela-
tions (5.2.11)–(5.2.13) imply that
n
n 1/2 (ϑ (n) (xn ) − ϑ0 ) − σ 2 (ϑ0 )n −1/2 • (xν , ϑ0 ) → 0 (Pϑn0 )n∈N .
ν=1
This implies that the ML-sequences are regular, a usually neglected fact which is
needed to give operational significance to the assertion that the limit distribution
N (0, σ 2 (ϑ0 )) is optimal. (For more see Sect. 5.13.)
Maximum Probability Estimators
The MP estimator (= maximum probability estimator) is based on a smoothed version
of the density p (n) (·, ϑ) of the observations, defined as
cn
pr(n) (xn , ϑ) := 1(ϑ−cn−1 r,ϑ+cn−1 r ) (τ ) p (n) (xn , τ )dτ.
2r
The MP estimator ϑr(n) is the value of ϑ which maximizes ϑ → pr(n) (xn , ϑ) (approx-
imately). In an elementary but poorly structured proof, Weiss and Wolfowitz (1967b,
Theorem, pp. 196/7 and 1974, Theorem 3.1, p. 17 and pp. 20/21) obtain a result which,
in a simplified version, reads as follows: If ϑr(n) , n ∈ N, standardized by cn , n ∈ N,
converges regularly to some limit distribution, say Q r , then Q(−r, r ) ≤ Q r (−r, r )
for any regularly attainable limit distribution Q.
In spite of the analogy between MP sequences and the ML-sequence in the defi-
nition, the result is of a different nature:
(i) There are no general conditions under which ϑr(n) , n ∈ N, is a regular estimator
sequence in the sense discussed in Sect. 5.12.
(ii) There is no general result expressing Q r in terms of inherent properties of the
model. To obtain the “optimal” limit distribution Q r one needs to determine the MP
estimator ϑr(n) for every n ∈ N and to find its asymptotic distribution.
(iii) The optimum property of Q r refers to the interval (−r, r ) only.
In (1967) and (1974), Weiss and Wolfowitz present various examples for the
successful applications of the MP theory. Grossmann (1979, 1981) shows that the
128 5 Asymptotic Optimality of Estimators
shortcomings indicated above do not arise in regular models. His main result is
Theorem 3.2 in (1981, pp. 99/100): If for s, t ∈ R,
with L(ϑ) = (·, ϑ)2 d Pϑ , and
lim n Pϑ+cn−1 s { p(x, ϑ + cn−1 t)/ p(x, ϑ + cn−1 s) < 1 − ε} = 0 for every ε > 0,
n→∞
then cn (ϑr(n) − ϑ), n ∈ N, converges for every r > 0 regularly to the optimal limit
distribution given by the underlying LAN condition.
The following example, put forward by Akahira and Takeuchi (1979), demon-
strates some shortcomings of the MP theory.
Example Let {Pϑ : ϑ ∈ R} be the shift parameter family, generated by the truncated
distribution
The authors determine the MP estimator ϑr(n) and they show that
√
with k = c/ e. Applied with r = t, relation (5.2.14) yields
but
It is a particular property of this model (not the outgrowth of a general theorem) that
an estimator sequence attaining the bound (5.5.1) does exist, namely
1
ϑ̂ (n) (xn ) := (x1:n + xn:n ).
2
We might mention in passing that the asymptotic efficiency of the ML-sequence
is 1/2. According to Akahira and Takeuchi (1979) [p. 137] the ML estimator is
⎧
⎨ x1:n + 1 x n > x1:n + 1,
ϑ (n) (x1 , . . . , xn ) = xn if xn:n − 1 ≤ x n ≤ x1:n + 1,
⎩
xn:n − 1 x n < xn:n − 1,
hence
lim Pϑn {n|ϑ (n) − ϑ| ≤ t} = 1 − exp[−kt].
n→∞
For n ∈ N let Pn |(X, A ) be a probability measure. The following concepts for the
convergence of Pn , n ∈ N, to P0 , setwise and uniform, are in use.
and
sup |Pn (A) − P0 (A)| → 0. (5.3.2)
A∈A
a special case of Vitali’s Theorem. It goes back to Riesz (1928/9) [I, p. 350 and II,
p. 182.]
Let X be a metric space, endowed with its Borel-algebra A . Let P and Pn , n ∈ N,
be probability measures on A . We define weak convergence of Pn to P as follows:
Among the relations equivalent to (5.3.3), the following is, perhaps, the most
useful:
Pn (Br ) → P(Br )
Let L denote the set of all subconvex functions : Rk → [0, 1], and Lu the subset
of uniformly continuous functions. These sets are weak convergence determining
classes. We will often express weak convergence of Pn to P by
d Pn → d P for ∈ Lu . (5.3.6)
Proof Since {x ∈ Rk : (x) ≤ t} is convex for every t ≥ 0, for every ε > 0, there is
n ε such that for n ≥ n ε , t > 0 and ∈ Lu ,
Pn {x ∈ Rk : (x) ≤ t} − P{x ∈ Rk : (x) ≤ t}
< ε.
implies
d Pn − d P
< ε for n ≥ n ε .
132 5 Asymptotic Optimality of Estimators
Proposition
5.3.2 Let n ∈ N be fixed. If P → κ(P) is continuous, then P →
d Q (n)
P is continuous for ∈ Lu .
Proof We have
(n (κ − κ(P))d P − (n 1/2 (κ (n) − κ(P0 ))d P0n
1/2 (n) n
≤
(n 1/2 (κ (n) − κ(P)) − (n 1/2 (κ (n) − κ(P0 ))
d P0 + d(P n , P0n ),
and d(P n , P0n ) ≤ nd(P, P0 ) by Hoeffding and Wolfowitz (1958). Since is uni-
formly continuous, there exists for every ε > 0 a δε > 0 such that n 1/2 |κ(P) −
κ(P0 )| < δε implies
for every set A the boundary of which is of Q ϑ -measure zero (e.g. for every convex
set A if Q ϑ λk ) for every ϑ ∈ Θ. Uniformity in f or A is not required.
Proposition 5.3.5 The following statements are equivalent.
(i) Q (n)
ϑ converges to Q ϑ locally uniformly weakly at ϑ0 .
(ii) ρ(ϑn , ϑ0 ) → 0 implies Q (n)
ϑn ⇒ Q ϑ0 .
and the smallest eigenvalue of Σ(ϑ) := h(·, ϑ)h(·, ϑ) d Pϑ is bounded away from
0 on Θ. Then h̃ n (xn ) = n −1/2 nν=1 h(xν , ϑ) fulfills
lim sup sup
Pϑn {h̃ n (·, ϑ) ∈ C} − N (0, Σ(ϑ))(C)
= 0.
n→∞ ϑ∈Θ C∈C
Let
an,m := sup h n (ϑ).
ϑ∈Um
Hint: If n(m 0 ) is defined, choose n > n(m 0 ) such that an,m 0 +1 > α/2, i.e.
With ϑm ∈ Um chosen such that h n(m) (ϑm ) > supϑ∈Um h n (ϑ) − 1/m, we have
Lemma 5.3.9 Assume that δ(h n (ϑ), h 0 (ϑ)) → 0 for every ϑ in a neighbourhood
of ϑ0 , and that this convergence is locally uniform at ϑ0 . If every h n is continuous at
ϑ0 , then h 0 is continuous at ϑ0 .
Proof We have
δ(h 0 (ϑ), h 0 (ϑ0 )) ≤ δ(h 0 (ϑ), h n (ϑ)) + δ(h n (ϑ), h n (ϑ0 )) + δ(h n (ϑ0 ), h 0 (ϑ0 )).
Now δ(h n (ϑ0 ), h 0 (ϑ0 )) < ε for n ≥ n ε and δ(h 0 (ϑ), h n (ϑ)) < ε for ϑ ∈ Uε and
n ≥ n ε , and, with n ε := n ε ∨ n ε , δ(h n ε (ϑ), h n ε (ϑ0 )) < ε for ϑ ∈ Uε . It follows that
δ(h 0 (ϑ), h 0 (ϑ0 )) < 3ε for ϑ ∈ Uε ∩ Uε .
Continuous convergence was used in certain parts of mathematics (see e.g. Hahn
1932). It was introduced in asymptotic statistical theory by Schmetterer (1966). To
motivate this step (see p. 301), he does not say more than
136 5 Asymptotic Optimality of Estimators
it seems better to introduce the idea of continuous convergence. When the limit of a sequence
of functions is continuous, the idea of continuous convergence is even more general [?] than
the idea of uniform convergence.
Continuous convergence was used by Roussas (1972, Sects. 5.6 and 5.7) J.K. Ghosh
(1985) mentions continuous convergence as “interesting and useful” (p. 315) without
any further arguments.
At a surface inspection, continuous convergence looks more natural than reg-
ular convergence. Does it really make sense to restrict the condition of conver-
gence to sequences ϑn = ϑ0 + n −1/2 a? It does, from the technical point of view,
since these are the sequences for which (Pϑnn )n∈N is contiguous to (Pϑn0 )n∈N . For
sequences (ϑn )n∈N converging to ϑ0 more slowy than n −1/2 (i.e., sequences with
n 1/2 |ϑn − ϑ0 | → ∞), convergence of Pϑnn ◦ n 1/2 (ϑ (n) − ϑn ), n ∈ N, is no effective
restriction on the asymptotic performance of the estimator sequences. Pay attention to
examples where the asymptotic performance of Pϑ+n n
−1/4 ◦ n
1/2
(ϑ (n) − ϑ) is distinct
from Pϑ+nn
−1/2 ◦ n
1/2
(ϑ (n) − ϑ). (See, e.g. Lehmann and Casella 1998, pp. 442/443,
Example 2.7.)
Schmetterer’s conclusion about the relation between “uniform convergence” and
“continuous convergence” reads as follows (see 1966, p. 303):
Continuous convergence and uniform convergence to a continuous function on a compact
set are equivalent.
Proof Let (ϑm )m∈N → ϑ0 . By assumption, for every m ∈ N there is n m such that
1
|h n (ϑm ) − h 0 (ϑm )| < for n ≥ n m .
m
W.l.g., n m+1 > n m . For N ∈ N, let M(N ) be the largest integer m such that n m ≤ N .
We have M(N ) → ∞ and |h n (ϑ M(N ) ) − h 0 (ϑ M(N ) )| < 1/M(N ), hence
√
5.4 Consistency and n-consistency of Estimator
Sequences
sup Δ(Px(n)
n
, P)P n (dxn ) → 0.
P∈P0
where Q (n)
xn is the empirical distribution, defined by
n
Q (n)
xn (A) := n
−1
1 A (xν ), A ∈A.
ν=1
Δ(Px(n)
n
, Q (n) (n)
xn ) < 2 inf{Δ(P, Q xn ) : P ∈ P},
say, we obtain
Δ(Px(n)
n
, P) ≤ Δ(Px(n)
n
, Q (n) (n) (n)
xn ) + Δ(Q xn , P) ≤ 3Δ(Q xn , P),
√
5.4 Consistency and n-consistency of Estimator Sequences 139
which implies
sup Pϑn {xn ∈ X n : ρ∗ (ϑ (n) (xn ), ϑ) > ε} → 0 for m ∈ N and ε > 0, (5.4.7)
ϑ∈Θm
then there exists an estimator sequence (ϑ̂ (n) )n∈N such that
sup Pϑn {xn ∈ X n : ρ(ϑ̂ (n) (xn ), ϑ) > ε} → 0 for m ∈ N and ε > 0. (5.4.8)
ϑ∈Θm
enough. This point is observed by Le Cam (1956, p. 136, or 1986, p. 605), but missed
by other authors (Bickel et al. 1993, p. 43). For example, R is the union of the compact
sets Θm = [−m, 1 − m −1 ] ∪ [1, m], m ∈ N, but none of the Θm covers the compact
set [0, 1].
Proof of Lemma 5.4.1. Let
W.l.g. we assume that Nm+1 > Nm . With m n denoting the largest integer m such that
Nm ≤ n, we obtain
whence
ρ∗ (ϑ̂ (n) (xn ), ϑ (n) (xn )) ≤ ρ∗ (ϑ, ϑ (n) (xn )) if ϑ ∈ Θm n .
n
n
(xν , ϑ̂ (n) ) ≥ sup (xν , ϑ) − n −1 .
ν=1 ϑ∈Un (xn ) ν=1
√
Le Cam’s Lemma 5 asserts that the estimator sequence thus defined is n-consistent
(under suitable regularity conditions on the densities p(·, ϑ)). In this Lemma, X is
Euclidean; the conditions on Θ remain vague (locally convex and without isolated
points on p. 130, locally compact and σ -compact on p. 137). Le Cam makes no use of
the fact that his (ϑ̂ (n) )n∈N is a consistent sequence of asymptotic ML estimators, and
therefore asymptotically efficient. (To obtain an asymptotically efficient estimator
sequence,
√ he applies in Lemma 6, p. 138, the usual one-step improvement procedure
to this n-consistent estimator sequence.)
√
5.4 Consistency and n-consistency of Estimator Sequences 143
In 1966 (p. 183, Lemma 2), 1969 (pp. 103–107) and 1986 √(p. 608, Proposition 1)
Le Cam uses a different argument √ in the construction of n-consistent estimator
sequences. The starting point is n-consistency of the empirical distribution, i.e.,
relation (5.4.2) is replaced by the stronger relation
sup P n {xn ∈ X n : n 1/2 ρ∗ (ϑ (n) (xn ), ϑ) > tn } → 0 for (tn )n∈N ↑ ∞. (5.4.13)
ϑ∈Θ
Since (5.4.13) is stronger than (5.5.1), the same construction as in Lemma 5.4.1
applies, thus leading to an estimator sequence (ϑ̂ (n) )n∈N fulfilling
√ (5.4.8) and (5.4.9).
This implies that (ϑ̂ (n) )n∈N is ρ-consistent (by 5.4.8), and n-consistent with respect
√ on every Θm . The following Lemma asserts
to ρ∗ (by (5.4.13) and (5.4.9), uniformly
that such estimator sequences are n-consistent with respect to ρ, uniformly on
every Θm , provided the following condition on the relation between ρ and ρ∗ holds
true. For some ε > 0,
and
sup Pϑn {xn ∈ X n : n 1/2 ρ∗ (ϑ̂ (n) (xn ), ϑ) > tn } → 0 for (tn )n∈N ↑ ∞,
ϑ∈Θm
then
sup Pϑn {xn ∈ X n : n 1/2 ρ(ϑ̂ (n) (xn ), ϑ) > tn } → 0 for (tn )n∈N ↑ ∞.
ϑ∈Θm
ρ∗ (ϑ , ϑ ) := Δ(Pϑ ◦ χ , Pϑ ◦ χ ).
Apart from Le Cam (1986) there are not more than three text-books which contain
general theorems on the existence of consistent estimator sequences.
√ Ibragimov and
Has’minskii (1981, p. 31, Theorem 4.1) prove the existence of n-consistent estima-
tor sequences for Euclidean Θ under the condition that ϑ → Pϑ is continuous with
respect to the sup-distance, and that inf{d(Pϑ , Pϑ0 ) : ϑ − ϑ0 > δ}√> 0 if δ > 0.
Rüschendorf (1988) [p. 68, Proposition 3.7] proves the existence of n-consistent
estimator sequences assuming Fréchet-differentiability of ϑ → Δk (Pϑ , Pϑ0 ). Bickel
√1] asserts for regular parametric models Pϑ |B , ϑ ∈ Θ,
p
et al. (1993) [p. 42, Theorem
the existence of uniformly n-consistent estimator sequences, provided ϑ is identi-
fiable. “Regular parametric model” means, roughly speaking, that the density has a
√
5.4 Consistency and n-consistency of Estimator Sequences 145
continuous derivative. (See their Definition 2, p. 12, for a precise definition.) Check-
√ clear that Θ is meant√
ing the proof it becomes as an open subset of an Euclidean space,
and that “uniformly n-consistent” means “ n-consistent, uniformly on every com-
pact subset of Θ”. Since this is the only easily accessible place where this result can
be found, it seems in order to mention that the proof of Theorem 1 is somewhat
sketchy on p. 43. With applications
√ in mind, the authors forgo proving the exis-
tence of consistent (rather than n-consistent) estimator sequences for more general
models.
The theorems mentioned above are more than “existence theorems”: The estimator
sequences are constructed explicitly. Yet, many steps in these constructions include
arbitrary elements. While the maximum likelihood estimator is uniquely determined
for every sample size (at least in the usual cases), the result of the foregoing con-
struction for a particular sample is vague, which calls the operational significance of
the asymptotic assertion in question.
Many estimator sequences with limit distribution are asymptotically linear. Asymp-
totically optimal estimator sequences in the sense of the Convolution Theorem are
necessarily of this type.
Let P be an arbitrary family of probability measures P|A . For P ∈ P set
L ∗ (P) := g ∈ L 2 (P) : gd P = 0 .
with
n
K̃ (xn , P) = n −1/2 K (xν , P).
ν=1
for paths Ptg whose density ptg fulfills a suitable representation ptg / p = 1 + tg + rt ,
t ∈ (−ε, ε), with g ∈ L ∗ (P). Then asymptotic linearity (5.5.1) holds with P replaced
by Pn −1/2 g iff
Hence
••
(x, ϑ) + • (x, ϑ)• (x, ϑ) Pϑ (d x) = 0.
With L i j (ϑ) = (i j) (x, ϑ)Pϑ (d x) and L i, j (ϑ) = (i) (x, ϑ)( j) (x, ϑ)Pϑ (d x)
this is the well known relation
occurs by Bahadur’s Lemma automatically for λ p −a.a. ϑ of all rational a along some
subsequence. Hence one cannot escape the question for conditions on K (·, P) which
imply relation (5.5.3).
The results presented so far are based on asymptotic linearity of κ (n) and the
differentiability of κ. If we include properties of Pnn−1/2 g ◦ n 1/2 (κ (n) − κ(Pn −1/2 a )), in
our considerations, we obtain the following Proposition.
Proposition 5.5.1 Assume that an estimator sequence of κ is at P asymptotically
linear with influence function K . This estimator sequence is regular at P iff κ is
differentiable at P, and K is a gradient of κ at P.
Weaker versions of this result occur in Witting and Müller-Funk 1995, p. 422,
Satz 6.201). That the influence function is a gradient is expressed in a disguised form
as “Kopplungsbedingung (6.5.15)”. For related results see also Bickel et al. 1993,
p. 39, Proposition 2.4.3 A and p. 183, Theorem 5.2.3.
Proof Recall that P n ◦ K̃ (·, P) ⇒ N (0, Σ(P)). The relation
Therefore, κ (n) is regular iff the righthand side of (5.5.7) is asymptotically indepen-
dent of g, i.e. iff
n
For the ML estimator ϑ̂ (n) , the likelihood equation ν=1 • (xν , ϑ̂ (n) (xn )) = 0 then
leads to the approximation
n −1
n
ϑ̂ (n) (xn ) = ϑ (n) (xn ) + •• (xν , ϑ (n) (xn )) • (xν , ϑ (n) (xn ))
ν=1 ν=1
n
= ϑ (n) (xn ) + Λ(ϑ (n) (xn ))n −1 • (xν , ϑ (n) (xn )) + o(n −1/2 , Pϑn ).
ν=1
Following his principle of “proof by example”, Fisher carries this through for the
Cauchy distribution, starting from the median as a preliminary estimator. In a more
general context, the same idea occurs (independently?) in Le Cam (1956) [p. 139]:
If {Pϑ : ϑ ∈ Θ}, Θ ⊂ Rk , is sufficiently regular (something close to LAN), then
5.5 Asymptotically Linear Estimator Sequences 149
n
ϑ̂ (n) (xn ) := ϑ (n) (xn ) + Λ(ϑ (n) (xn ))n −1 • (xν , ϑ (n) (xn ))
ν=1
n
is asymptotically linear with influence function Λ(ϑ)n −1/2 ν=1 • (xν , ϑ), i.e.
n
n 1/2 (ϑ (n) (xn ) − ϑ) = Λ(ϑ)n −1/2 • (xν , ϑ) + o(n 0 , Pϑn ). (5.5.8)
ν=1
n
ϑ̂ (n) (xn ) = ϑ (n) (xn ) + Λ(n) (xn )n −1 • (xν , ϑ (n) (xn )), (5.5.9)
ν=1
with some sequence Λ(n) → Λ(ϑ) (Pϑn ), and one can ask whether
or
n
Λ(n) (xn ) = −n −1 •• (xν , ϑ (n) (xn ))
ν=1
n
n 1/2 (ϑ̂ (n) (xn ) − ϑ) = n 1/2 (ϑ (n) (xn ) − ϑ) + Λ(n) (xn )n −1/2 • (xν , ϑ (n) (xn ))
ν=1
n
Since Λ(n) → Λ(ϑ) (Pϑn ) and since n −1/2 ν=1 • (xν , ϑ) is stochastically bounded,
relation (5.5.8) follows if
n
n −1/2 (• (xν , ϑ (n) (xn )) − • (xν , ϑ)) = o(n 0 , Pϑn ),
ν=1
n
κ (n) (xn ) := κ(Pxmn ) + (n − m n )−1 K (xν , Pxmn ). (5.5.10)
ν=m n +1
n
n 1/2 (ϑn − ϑ) + n −1/2 (K (xν , ϑn , η) − K (xν , ϑ, η)) = o(n 0 , Pϑ,η
n
). (5.5.11)
ν=1
n
n −1/2 (K (xν , ϑn , ηn ) − K (xν , ϑn , η)) = o(n 0 , Pϑ,η
n
). (5.5.12)
ν=1
Both, (5.5.11) and (5.5.12), are necessary for the existence of estimator sequences
admitting the influence function K (·, ϑ, η) locally uniformly.
Klaassen (1987) was the first to consider the construction of asymptotically lin-
ear estimator sequences with an arbitrary influence function. He justifies condition
(5.5.11) (corresponding to his condition (2.2), p. 1550) by the remark that “it often
holds”. He does not mention that it is necessarily true under the condition of local uni-
formity. Condition (5.5.11) occurs in Bickel et al. 1993, p. 395, (iv), as “smoothness
condition” (!).
The paper of Schick (1986) is confined to K (·, ϑ, η) being a canonical gradient.
In this case, condition (5.5.11) follows from some kind of LAN-condition on the
family (see p. 1142, relation (2.5)), an idea going back to Bickel (1982) [p. 670,
relation (6.43)].
152 5 Asymptotic Optimality of Estimators
Condition (5.5.12) does not show up in the literature. Since this condition is neces-
sary too, this requires an explanation. The results in Schick (1986), Klaassen (1987),
Bickel et al. (1993) are based on the following condition (5.5.13) and (5.5.14): There
exists an estimator sequence (x1 , . . . , xm ) → K̂ m (·, ϑ, x1 , . . . , xm ) of the function
K (·, ϑ, η) with the following properties.
n
(n − m n )−1/2 ( K̂ m n (xν , ϑm n , x1 , . . . , xm n ) − K (xν , ϑm n , η)) = o p (n 0 , Pϑ,η
n
)
ν=m n +1
(5.5.15)
n
−1/2
(n − m n ) ( K̂ m n (xν , ϑm n , x1 , . . . , xm n ) − K (xν , ϑm n , η))
ν=m n +1
n
= (n − m n )−1/2 K̂ m n (xν , ϑm n , x1 , . . . , xm n )
ν=m n +1
− K̂ m n (y, ϑm n , x1 , . . . , xm n )Pϑmn ,η (dy) − K (xν , ϑm n , η)
+(n − m n ) 1/2
K̂ m n (y, ϑm n , x1 , . . . , xm n )Pϑmn ,η (dy). (5.5.16)
The first term on the right-hand side of (5.5.16) converges stochastically to 0 because
of (5.5.14), provided n − m n ) tends to infinity. The second term converges to 0
because of (5.5.13) if n −1 m n is bounded away from 0. Both conditions are fulfilled
if m n ∼ n.
Now we shall show how, under conditions (5.5.11) and (5.5.15), an estimator
sequence with influence function K (·, ϑ, η) can be constructed.
5.5 Asymptotically Linear Estimator Sequences 153
We first remark√that relations (5.5.11) and (5.5.15) hold, in a modified form, with
ϑm replaced by a m-consistent estimator sequence ϑ (m) . This is true without any
modifications if (ϑ (m) )n∈N is discretized. This was the choice in the papers by Bickel
(1982) and Schick (1986). Alternatively, one might, following Klaassen (1987), use
the splitting trick.
Let 1 < kn < m n < n be such that n −1 kn is bounded away from 0 and n −1 m n
bounded away from 1. Relation (5.5.11) with (x1 , . . . , xn ) replaced by (xm n +1 ,
. . . , xn ) yields
= o(n , 0 n
Pϑ,η ). (5.5.17)
n
−1/2
(n − m n ) K̂ kn (xν , ϑ (m n −kn ) (xkn +1 , . . . , xm n ), xkn )
ν=m n +1
− K (xν , ϑ (m n −kn ) (xkn +1 , . . . , xm n ), η)
= o p (n 0 , Pϑ,η
n
).
Let now
n
+(n − m n )−1/2 K̂ kn (xν , ϑ (m n −kn ) (xkn +1 , . . . , xm n ), xkn )
ν=m n +1
n
= (n − m n )−1/2 K (xν , ϑ, η)
ν=m n +1
n
+(n − m n )−1/2 K̂ kn (xν , ϑ (m n −kn ) (xkn +1 , . . . , xm n ), xkn )
ν=m n +1
− K (xν , ϑ (m n −kn ) (xkn +1 , . . . , xm n ), η) + o(n 0 , Pϑ,η
n
)
n
= (n − m n )−1/2 K (xν , ϑ, η) + o(n 0 , Pϑ,η
n
).
ν=m n +1
For an estimator sequence ϑ2(n) constructed in the same way as ϑ1(n) , but with the roles
of x1 , . . . , xm n and xm n +1 , . . . , xn interchanged, we obtain in analogy to (5.5.18):
(n)
mn
n (ϑ2 (xn )
m 1/2 − ϑ) = m −1/2
n K (xν , ϑ, η) + o(n 0 , Pϑ,η
n
).
ν=1
Hence
ϑ̂ (n) := (1 − n −1 m n )ϑ1(n) + n −1 m n ϑ2(n) (5.5.19)
a rather stringent version of (5.5.13). (To the reader who has trouble in finding this
condition in Bickel’s paper: It is part of “condition H” on p. 653.)
with Ψ (x1 , x2 ) = Ψ (x2 , x1 ) and Ψ (x1 , x2 )2 P(d x1 )P(d x2 ) < ∞. A reasonable
estimator is
n
n
(x1 , . . . , xn ) → n −2 Ψ (xν , xμ ).
ν=1 μ=1
Stein’s Approach
Stein was the first scholar to strive for an answer to such questions. His paper (1956)
is generally considered as pioneering for the problems of nonparametric bounds.
Regrettably, Stein wrote his paper several years too early. He cites Le Cam (1953),
but he ignores the problem of “superefficiency”. Without a solid concept for the
optimality in parametric families, he takes ML-sequences as asymptotically efficient
straightaway, without entering the discussion about the asymptotic optimality of
statistical procedures. Stein never came back to this problem when the adequate
techniques were available. Recall that the idea to base the concept of asymptotic
optimality on local uniformity as in Rao (1963) and Bahadur (1964) had not yet
evolved, and that the optimality of ML-sequences based on such a solid concept of
asymptotic optimality occurred first in Wolfowitz (1965) and Kaufman (1966).
The main idea in Stein’s paper is to obtain the asymptotic bound for estima-
tor sequences of a functional κ : P → R at P0 ∈ P from the bound for estimator
sequences in the “least favourable” parametric subfamily passing through P0 , say
P0 . Assuming that ML-sequences are asymptotically optimal, the question remains
which parametric subfamily is the least favourable one. With a solid concept of
asymptotic optimality at our disposal, this is clear: P0 is least favourable if the best
possible estimator sequence which is regular in P0 is also regular in P, or con-
versely: If an estimator sequence which is regular and optimal in P is optimal in
P0 . Without a precise concept of asymptotic optimality, Stein’s argument (p. 187)
remains necessarily vague.
Clearly [!] a nonparametric problem is at least as difficult as any of the parametric problems
The reasoning underlying Stein’s idea contains certain weak points: (i) He assumes
that for parametric families the ML-sequence is asymptotically optimal. (ii) More-
over, it makes no sense to speak of the quality of an estimator sequence of κ(P)
at P = P0 without reference to the class of competing estimator sequences. Hence
it makes no sense to say what one would like to say, namely: That an estimator
sequence is optimal on some family P if it is optimal on some subfamily P0 ⊂ P.
To illustrate Stein’s approach, we consider the problem of estimating the quantile
of an unknown symmetric distribution.
Example Let P be the family of all symmetric distributions on B with a differentiable
Lebesgue density p. The problem is to estimate the β-quantile κβ (P), defined by
P(−∞, κβ (P)] = β. W.l.g. we assume β ≥ 1/2. The β-quantile κ (n) of the sample
is certainly a reasonable estimator; its limit distribution is normal with mean 0 and
variance β(1 − β)/ p(κβ (P))2 . Can the symmetry of the densities be utilized to
obtain a better estimator? According to Stein’s program, one has to find a parametric
subfamily in which the estimation of the β-quantile is particularly difficult. Let
P0 ∈ P be fixed. As starting point we consider the parametric family Q := {Q ϑ :
ϑ ∈ (−ε, ε)} with Lebesgue density
and
so that
L(Q 0 ) := (−0 + ψ)2 d P0 = (0 )2 d P0 + ψ 2 d P0 .
Hence the intrinsic bound for the asymptotic variance of estimator sequences of the
β-quantile in the family (5.6.1) is
2
1− Ψ 1(0,qβ ) d P0 p0 (qβ ) (0 )2 d P0 ) + ψ 2 d P0 . (5.6.5)
The problem is to choose ψ symmetric and subject to the conditions (5.6.1) such that
(5.6.5) is as large as possible.The solution of this task becomes more transparent if
we represent ψ = λψ0 , with ψ02 d P0 = 1. Given ψ0 ,
2
1 − λ ψ0 1(0,qβ ) d P0 p0 (qβ ) (0 )2 d P0 ) + λ2
The least favourable subfamily is of the type (5.6.1) with Ψ given by (5.6.6). This
leads to the variance bound
−1
(0 )2 d P0 + (β − 1/2)(1 − β)/ p0 (qβ )2 . (5.6.7)
Endowed with the concepts of “tangent space” and “gradient” introduced below, we
now complete this example of bounds for the β-quantile of a symmetric distribution
on B. According to (5.6.4), the tangent cone of this family contains all functions x →
−a0 (x) + ψ(x), where p0 is any density symmetric about 0, and ψ a (differentiable)
function fulfilling (5.6.2) and (5.6.3), which is symmetric about 0.A gradient of the
β-quantile at P0 is
κβ• (x, P0 ) = (β − 1(−∞,qβ ) (x)) p0 (qβ ),
This impression is due to the fact that Stein mingles the problems of “nonparametric
bounds” and “adaptivity”.
Most authors refer to Stein (1956) as a pioneering paper for the concept of non-
parametric bounds. It is, in fact, a paper on adaptivity. Readers who have difficulties
to understand Stein’s enigmatic “algebraic lemma” (p. 189) might consult Bickel
(1982) [p. 651, condition S] or Fabian and Hannan (1982) [p. 474, condition 8].
These authors offer simplified and more transparent versions of what Stein intended
to express in his “algebraic lemma”: That a certain orthogonality relation is necessary
for the existence of “adaptive” procedures.
The examples in Stein’s paper are all of the “adaptive” type, and in such cases it
is superfluous to search for “least favourable parametric subfamilies”: The intrinsic
nonparametric bound is the intrinsic bound for the submodel where the nonparametric
component is known.
The most interesting among Stein’s examples is the estimation of the median
of an unknown symmetric distribution. The estimation of an arbitrary β-quantile
would have offered the possibility to show how the method of least favourable one-
parameter subfamilies can be used to obtain a nonparametric bound. The restriction
to the adaptive case β = 1/2 in Stein’s paper deprives his example of its substance.
Tangent Sets
Shortly after Hájek’s paper (1972) on bounds for the concentration of estimator
sequences in parametric families, the problem of intrinsic bounds for functionals
on general families was taken up in a series of papers by Levit (1974, 1975) and
Koshevnik and Levit (1976). In these papers the ideas of Stein (1956) were trans-
formed into a serviceable technique for determining intrinsic bounds, based on the
concepts “tangent space” and “gradient”.
To keep the following considerations transparent, we assume that the probability
measures P ∈ P are mutually absolutely continuous. We consider paths Pt ∈ P for
t ∈ A = (−ε, ε) approximating P in the following sense: The density pt of Pt can
be represented as
t −1 ( pt / p − 1) = g + rt (5.6.8)
−1 2
lim t (( pt / p)1/2 − 1) − g d P = 0
t→0
160 5 Asymptotic Optimality of Estimators
−1 1 2
lim t (( pt / p)1/2 − 1) − g d P = 0.
t→0 2
n
n
−1/2 1
log( p n −1/2 a (xν )/ p(xν )) = an g(xν ) − a 2 g 2 d P0 (P0n ). (5.6.13)
ν=1 ν=1
2
In other words, (5.6.10)–(5.6.12) imply an LAN-condition. (In fact, rt2 d P → 0
suffices.) For a proof see (Pfanzagl 1985, Proposition 1.2.7, p. 22 ff).
For parametric families Pϑ : ϑ ∈ Θ with Θ ⊂ Rk we set Pϑ+ta in place of Pt and
obtain
d Pϑ+ta /d Pϑ = 1 + ta • (·, ϑ) + trt .
Then
n
log( p(xν , ϑ + n −1/2 at)/ p(xν , ϑ))
ν=1
n
1
= n −1/2 ta • (xν ) − t 2 a • (·, ϑ)• (·, ϑ) d Pϑ a (Pϑn ).
ν=1
2
5.6 Functionals on General Families 161
A representation (5.6.9) of the densities is natural from the intuitive point of view.
Technically useful for the determination of the concentration bound is the LAN-
condition (5.6.13). That (5.6.9) is almost necessary for (5.6.13) was suggested by
Le Cam (1984, 1985). See Pfanzagl (1985) [p. 22, Proposition 1.2.7] for details. Let
H : R → R be a function with continuous 2nd derivative in a neighbourhood of 0
such that
H (0) = 0 and H (0) = 1. (5.6.14)
1
H (h t ) = tg + t 2 (K + H (0)σ 2 ) + tst (5.6.16)
2
with st fulfilling (5.6.10)–(5.6.12).
This is a special case of Lemma 1.3.4 in Pfanzagl (1985) [pp. 30/1], applied with
a = b = 0. See also the remark on p. 21 concerning the condition on H . If G
is the inverse of H in a neighbourhood of 0, then the conditions (5.6.14) on H
imply G(0) = 0, G (0) = 1 and G (0) = −H (0). Hence, (5.6.16) and (5.6.15) are
equivalent.
If relation (5.6.9) holds true, relation (5.6.15) is fulfilled with
h t = pt / p − 1 and K = 0.
1
log pt / p = H (h t ) = tg + t 2 H (0)σ 2 + trt .
2
t2 2
log pt / p = tg − σ + trt . (5.6.17)
2
To establish LAN for i.i.d. products one may use relation (5.6.17). Since
n
n −1/2 rn −1/2 a (xν ) → 0
ν=1
t
( pt / p)1/2 = 1 + g + tst , with st2 d P → 0. (5.6.18)
2
We shall show that (5.6.18) is equivalent to (5.6.9) with (rt )t∈A fulfilling
condi-
tions (5.6.10)–(5.6.12). Notice that (5.6.18) implies gd P = 0 and g 2 d P < ∞.
According to Lemma 1.2.17 in Pfanzagl (1985) [p. 27], relation (5.6.18) is equivalent
to the condition that rt := st + 18 tσ 2 fulfills conditions (5.6.10)–(5.6.12). Hence
t 1
( pt / p)1/2 = 1 + g − t 2 σ 2 + trt (5.6.19)
2 8
is equivalent to Hellinger differentiability of t → pt / p with derivative g. As a con-
sequence of (5.6.19), relation (5.6.15) is fulfilled with
1
h t = 2(( pt / p)1/2 − 1) and K = − σ 2 .
4
Applied with H (u) = u(1 + u/4), relation (5.6.16) asserts that
1 2 1
pt / p − 1 = H (h t ) = tg + t − σ + H (0)σ + trt .
2 2
4 2
Since H (0) = 1/2, this is relation (5.6.9). Hence conditions (5.6.9), (5.6.17),
(5.6.18) and (5.6.19) are equivalent.
If condition (5.6.17) is the technically most convenient one, it is not so easy to
justify from the intuitive point of view for a given family {Pϑ : ϑ ∈ Θ}. Pfanzagl
(1982) is based on the more transparent condition (5.6.9) (see p. 23). Now generally
accepted is condition (5.6.18), i.e. Hellinger differentiability. What makes this con-
dition attractive is its provenance from a classical concept of differentiability. But is
this really a strong argument if the underlying distance function is rather artificial?
One reason for the general acceptance of Hellinger differentiability seems to be the
prestige of Le Cam who started its use in (1969) (see e.g. p. 94, Théorème).
Gradients
The functional κ : P → R is differentiable at P if there exists κ • ∈ L ∗ (P) such that
The tangent space describes the local properties of the family P; the gradient
describes the local properties of the functional κ. The Convolution Theorem (see
Sect. 5.13) shows that the canonical gradient of the functional κ determines the
bound for the concentration of regular estimator sequences.
Differentiability of the functional κ seems to be a natural condition for the exis-
tence of a reasonable estimator sequence. In fact, Bickel et al. (1993) [p. 183,
Theorem 5.2.3] claim the “equivalence of regularity and differentiability”, a result
going back to van der Vaart (1988). This claim is based on the joint convergence
of (n 1/2 (κ (n) − κ(P)), n −1/2 nν=1 g(xν )) for any g in the tangent set. Without this
popular but not operational assumption this is not true any more: Pfanzagl (2002a)
[pp. 266–268] presents a one-parameter family and a non-differentiable functional
admitting a regular estimator sequence with continuous limit distribution.
For parametric families {Pϑ : ϑ ∈ Θ}, Θ ⊂ Rk , it is natural to consider κ as a
functional on Θ instead of a functional defined on the family of probability measures
{Pϑ : ϑ ∈ Θ} with K (ϑ) := κ(Pϑ ). The canonical gradient of the functional κ may
now be written as
κ ∗ = K • Λ(ϑ)• (·, ϑ)
a functional investigated by von Mises in (1947) and in some earlier papers. After
some less satisfactory results of von Mises on the distribution of estimators of κ
(see Filippova, 1962, p. 24, for critical remarks), the following result was obtained
by Hoeffding (1948) in a paper “which was essentially completed before the paper
by von Mises (1947) was published” (see p. 306). Specialized for one-dimensional
functionals, Hoeffding’s Theorem 7.4, p. 309, proves that the estimator
−1
n
κ (n) (x1 , . . . , xn ) := ΣΨ (xi1 , . . . , xim ), (5.6.21)
m
with
(n) 1
Pϑ(n) ◦ log(d Pϑ+c
(n)
−1 d Pϑ ) ⇒ N − a
L(ϑ)a, a
L(ϑ)a
n a 2
and 1
(n) (n) (n)
Pϑ+c −1 ◦ log(d P −1 d Pϑ ) ⇒ N a L(ϑ)a, a (ϑ)a ,
n a ϑ+cn a 2
which is essential for certain asymptotic results on the concentration of estimator
sequences obtained from the Neyman–Pearson Lemma (see Sect. 5.11).
Le Cam’s beautiful Lemma inspired Hájek and Šidák (1967) [Sects. VI.1.2–
VI.1.4] to introduce what they called Le Cam’s 1st, 2nd and 3rd Lemma, an adaptation
useful in their particular framework, and it is this version in which Le Cam’s Lemma
now usually occurs in the literature. Le Cam’s original Lemma was extended to h n
with values in a metric space by Bickel et al. (1993) [p. 480, Lemma A.8.6]. Wit-
ting and Müller-Funk (1995) present the “three Lemmas”, and two versions of Le
Cam’s original Lemma on p. 326 as Satz 6.138. See also Bening (2000) [Sect. A, pp.
149–156].
Since Le Cam’s proof is opaque and the proof in Witting and Müller-Funk (with
5 references to auxiliary results) rather technical, we present the following lemma
which shows the basic idea.
Lemma 5.6.3 Let (Y, B) be a topological space, endowed with its Borel alge-
bra, and Q n |B, n ∈ N, a sequence of probability measures converging weakly to
5.6 Functionals on General Families 165
Proof The corollary follows from Lemma (5.6.3), applied with Q n = Pn ◦ (gn , pn )
and Q n = Pn ◦ (gn , pn ), since
for any measurable function
f : Rm × R+ → R. The condition q(u, v)Q
(d(u, v)) = 1 now becomes vQ(d(u, v)) = 1.
The essential point of Le Cam’s idea is to determine the limit distribution of a function
Sn under Pn from the limit distribution under Pn without knowing where this limit
distribution comes from.
For functions h, g ∈ L ∗ (P0 ), an elementary computation shows that
Pnn−1/2 g ◦ h̃ n ⇒ N hgd P0 , h 2 d P0 . (5.6.22)
This relation is basic for the g-regularity of asymptotically linear estimator sequences
with influence function h.
To illustrate Le Cam’s idea from 1960 (Le Cam 1960), we prove the following
generalization of (5.6.22), in which h̃ n is replaced by a function Sn .
166 5 Asymptotic Optimality of Estimators
Then the limit distribution of (Sn , g̃n ) under Pnn−1/2 g is N (μ, Σ), with
The second marginal of N (μ, Σ) i.e. the limit distribution of g̃n under Pnn−1/2 g , is
the well known N (Σ22 , Σ22 ). Of interest is the first marginal, N (Σ12 , Σ11 ), the limit
distribution of Sn under Pnn−1/2 g .
Applied for the special case Sn = h̃ n , the covariance matrix becomes
Σ11 = h d P0 , Σ22 =
2
g d P0 , Σ12 =
2
hgd P0 .
Proof According to (5.6.22), the density of Pnn−1/2 g with respect to P0n can be approx-
imated by exp[g̃n − 21 Σ22 ]. Therefore the density of Pnn−1/2 g ◦ g̃n with respect to
P0n ◦ g̃n is approximable by v → exp[v 21 Σ22 ]. If P0n ◦ (Sn , g̃n ) ⇒ Q, then Pnn−1/2 g ◦
(Sn , g̃n ) → Q with Q-density (u, v) → (u, exp[v − 21 Σ22 ]). Hence, if Q = N
(0, Σ), this leads to Q = N (μ, Σ) with μ1 = Σ12 , μ2 = Σ22 .
The conclusion from N (0, Σ) to N (μ, Σ) requires an elementary but some-
what tedious computation. Hint: Rewrite the 2-dimensional normal distribution with
covariance matrix Σ as
1 1
c exp − A11 u 2 + A12 uv − A22 v2
2 2
with
1 1 ρ 1
A11 = , A22 = , A12 = · .
(1 − ρ 2 ) (1 − ρ 2 )Σ22 1 − ρ 2 (Σ11 Σ22 )1/2
5.7 Adaptivity
The estimator sequence sn2 (x) = n −1 nν=1 (xν − x n )2 is asymptotically optimal for
estimating σ 2 in the family P = {N (μ, σ 2 ) : μ ∈ R, σ 2 > 0}. It is still asymptot-
ically optimal in any of the subfamilies Pμ := {N μ, σ 2 ) : σ 2 > 0}, μ ∈ R. That
means: Knowing μ does not help to obtain an estimator sequence for σ 2 which
is asymptotically better than sn2 . Within the realm of parametric families there
are many such examples. (Another example is the estimation of ρ in the family
5.7 Adaptivity 167
{N (μ1 , μ2 , σ12 σ22 , ρ) : μi ∈ R, σi2 > 0, ρ ∈ (−1, 1)}. Knowing that σ12 = σ22 does
not help to obtain an estimator sequence asymptotically superior to the correlation
coefficient. (See Pfanzagl 1982, p. 219, Example 13.2.4.)
This phenomenon of “adaptivity” did not find particular attention as long as it
occurred in parametric families. It became a worthwhile subject of statistical theory
as soon as it occurred in a general family. Let P be the family of all distributions on
B with a sufficiently smooth density; The problem is to estimate the median. Here
is a natural idea as to how a “good” estimator for the median may be obtained. The
simplest idea: Take the median of the sample, or: estimate the unknown density and
take the median of the estimated density as an estimator of the “true” median. If the
family of probability measures is large, no regular estimator sequence can be better
than the sample median. The situation changes dramatically as soon as it is known
that all densities are symmetric. Stein (1956) gave in his Sect. 4, pp. 190/1, a rough
sketch of how “adaptive” tests for the position of the median of a symmetric density
and for the difference in location and scale between two distributions of the same
unknown shape could be obtained.
It took almost fifteen years until Stein’s sketch was turned into a mathematically
solid result. The (equivalent) two-sample problem was solved independently by van
Eeden (1970) [p. 175, Theorem 2.1], Weiss and Wolfowitz (1970, Sect. 4, pp. 144/5)
and Beran (1974) [Theorem 3.1, p. 70]. The papers by van Eeden and Beran are based
on Hájek’s (1962) theorem on adaptive rank tests. They also assert the existence of
adaptive estimator sequences for the median of a symmetric distribution (their p.
180, Theorem 4.1, and p. 73, Theorem 4.1, respectively). See also the preliminary
version of van Eeden’s result, dating from (1968).
These basic papers were followed by a number of papers, doing what is usual in
mathematics: To prove a stronger assertion under weaker assumptions. (See Fabian,
1973, Sacks, 1975, and Stone, 1975.) There is also a questionable paper by Takeuchi
(1971): “We do not understand his argument” is what Weiss and Wolfowitz say (1970,
p. 149). The most impressive of these results is Theorem 1.1 in Stone (1975) [p. 268]
which asserts the existence of an adaptive translation and scale invariant estimator
of the median without any regularity conditions on the density except for absolute
continuity (and symmetry, of course).
Perplexing as the phenomenon of adaptivity in nonparametric families is, it obvi-
ously was not easy to find an appropriate conceptual framework. To discuss the results
mentioned above we use the conceptual framework of “semiparametric” models
introduced later by Bickel (1982). Let
P = {Pϑ,η : ϑ ∈ Θ, η ∈ H },
on some kind of locally uniform convergence to this limit distribution. Yet, he did not
care: “... in a sense, [the ML estimate] is the asymptotically best possible estimate”
(p. 188). Similarly, his definition of adaptivity (p. 192) “... if it is as difficult when the
form of the distribution is known as it is when the form of the distribution depends
in a regular way [?] on an unknown parameter.”
The succeeding authors, too, assume that the limit distribution of the ML-sequence
(or the Cramér–Rao bound) are bounds for the quality of estimator sequences, without
mentioning that such bounds are valid for “regular” estimator sequences only.
Most of the authors mentioned above are satisfied with estimator sequences
(ϑ (n) )n∈N attaining for every η ∈ H the optimal limit distribution N (0, ση2 (ϑ)) and
ignore the assumption of regularity (which is essential for the validity of ση2 (ϑ) as a
bound in Pη .) None of these authors pays attention to the preconditions under which
adaptivity is possible. In spite of the heuristic character of his paper, Stein arrives
with his “algebraic Lemma” (Sect. 3, pp. 188–190) somehow at the conclusion that
“orthogonality” is necessary for “adaptivity”. Bickel presents Stein’s “proof” that
“adaptivity” requires “orthogonality” in an intelligible form (see 1982, p. 651).
Bickel’s presentation is repeated in Fabian and Hannan, 1982, p. 474, Theorem 9
and in Bickel et al. 1993, pp. 28/9.
To arrive at a precise concept of “adaptivity”, let Pη := {Pϑ,η : ϑ ∈ Θ}, and let
N (0, ση2 (ϑ)) be the optimal limit distribution for estimator sequences of ϑ which are
“ϑ-regular” within Pη . Bickel’s definition of adaptivity (see 1982, p. 649) reads as
follows. The estimator sequence ϑ (n) , n ∈ N, is “adaptive” if it converges ϑ-regularly
to N (0, ση2 (ϑ)) for every η ∈ H , i.e. if
The same definition is accepted by Bickel et al. (1993) [p. 29, Definition 2.4.1]. It
is, roughly speaking, equivalent to “(ϑ, η)-regularity”, i.e. regularity w.r.t. P.
5.7 Adaptivity 169
n
(n) −1/2
n 1/2
(ϑ − ϑ) = n ση2 (ϑ)• (xν , ϑ, η) (Pϑ,η
n
n
).
ν=1
then
n
−1/2
n
Pϑ,η n
◦n ση2 (ϑ)• (xν , ϑ, η) ⇒ N (0, ση2 (ϑ)). (5.7.3)
ν=1
Yet,
n
−1/2
n
Pϑ,η+n −1/2 u ◦n ση2 (ϑ)• (xν , ϑ, η) ⇒ N (uση2 (ϑ)L 12 (ϑ, η), ση2 (ϑ)). (5.7.4)
ν=1
Hence (5.7.3) holds iff L 1,2 (ϑ, η) = 0, which is the orthogonality called for.
Conversely, if (ϑ (n) )n∈N is adaptive and ϑ-regular, we obtain from (5.7.1) and
(5.7.4) under the condition L 1,2 (ϑ, η) = 0 the relations (5.7.3) and (5.7.2). Hence
under this orthogonality condition any (in Bickel’s sense) adaptive and ϑ-regular
estimator sequence is automatically also η-regular. In a disguised form this occurs
in Fabian and Hannan 1982, p. 474, Theorem 7.10.
Applied to the problem of estimating the parameter ϑ of a family with den-
sity x → p(x − ϑ) this implies that any ϑ-equivariant (hence ϑ-regular) estimator
sequence which is efficient for every p is “robust” in the sense that its limit distri-
bution remains unchanged if the observations come from a density x → pn (x − ϑ)
where pn , n ∈ N, tends to p from an orthogonal direction. Beran (1974, Remark on
p. 74, 1978, p. 306, Theorem 4 and various other papers) seems to have been the first
170 5 Asymptotic Optimality of Estimators
This is the case if κ ∗ (·, P) = κ0∗ (·, P), i.e. if κ ∗ (·, P) ∈ T (P, P0 ).
The idea that some kind of “orthogonality” is necessary for adaptivity goes back
to Stein (1956). It occurs in the papers by Bickel and by Fabian and Hannan, mainly
in connection with a special type of models: The estimation of the parameter ϑ for
families Pϑ,η , with η known or unknown.
Does “orthogonality” also play a similar role in general models? Let P be a
general family, and κ : P → R the functional to be estimated. Can we characterize
the subfamilies P0 on which the estimation of κ is as difficult as on P? The answer
is affirmative if the restriction from P to P0 is based on a side condition of the
following type: There is a differentiable functional γ : P → H such that
An estimator sequence which is optimal for κ on P (i.e. T (P0 , P)-regular and asymp-
totically linear with influence function κ ∗ (·, P)), is optimal on the family P0 iff
∗
κ (·, P) ∈ T (P, P), i.e. γ ∗ (·, P)κ ∗ (·, P)d P = 0.
This is the orthogonality condition which is necessary and sufficient for adaptivity,
or: Optimal estimator sequences on P cannot be improved if it is known that P0
belongs to the smaller subfamily P0 .
In general, the knowledge that the “true” probability measure belongs to a given
subfamily P0 , can be utilized to obtain an asymptotically better estimator sequence
for κ(P). But again, if P ∈ P0 and
κ ∗ (·, P) ∈ T (P, P0 ),
then κ ∗ (·, P), the canonical gradient of κ in P, is also the canonical gradient in P0 ,
and N (0, σ 2 (P)) (with the same σ 2 (P)) is also the optimal limit distribution for
P0 -regular estimator sequences. In this case, reducing the condition of P-regularity
to P0 -regularity is not effective. A P-regular estimator sequence (which is a fortiori
P0 -regular) cannot be replaced by a better P0 -regular estimator sequence.
Let now κ0(n) , n ∈ N, be a P0 -regular estimator sequence which is optimal in P0 .
This implies
n 1/2 (κ0(n) − κ(P)) = κ̃0n
∗
(·, P) + o(n 0 , P n ),
This implies that the estimator sequence κ0(n) , n ∈ N, is regular with respect to every
h ∈ T (P, P): Any estimator sequence which is P0 -regular and optimal in P0 is also
P-regular (and therefore optimal in P).
For the purpose of illustration, we specialize the general considerations to a
semiparametric model in which P consists of all probability measures Pϑ,η |B with
Lebesgue density
x → pη (x − ϑ)
172 5 Asymptotic Optimality of Estimators
with ϑ ∈ R, where pη is a density symmetric about 0. For this model, the tangent
space T (Pϑ,η , P) consists of the functions of the form aη (x − ϑ) + Ψ (x − ϑ),
where η (x) = ∂x log pη (x) and Ψ is symmetric about 0 with Ψ (x)Pϑ,η (d x) = 0.
The functional κ(Pϑ,η ) = ϑ has in P the canonical gradient κ ∗ (·, ϑ, η) = σ (η)2 η
(x − ϑ) with σ (η)2 = ( (η )2 d Pϑ0 ,η )−1 .
If the density pη0 is known, then the tangent space T (Pϑ,η0 , Pη0 ) reduces to
{aη0 (· − ϑ) : a ∈ R}, which still contains x → σ 2 (η0 )η0 (x − ϑ), the canonical
gradient of the functional κ(Pϑ,η ) = ϑ in T (Pϑ,η0 , P). If an estimator sequence ϑ (n) ,
n ∈ N, is ϑ-equivariant and optimal in Pη0 , it is regular in P, and even locally
robust for certain directions, not necessarily in T (P, P), which are orthogonal to
x → η0 (x − ϑ).
The Convolution Theorem provides precise information about the optimal limit
distribution, based on the concepts of “tangent space” and “gradient”. The con-
cepts “tangent space” and “gradient” are somehow descendants of Stein’s ideas.
Yet it appears that something of Stein’s intuitive ideas is lost, namely: An estima-
tor sequence which is regular and optimal in the least favourable subfamily is also
regular, hence also optimal, in the whole family.
To verify Stein’s idea, and to elaborate on an aspect neglected by Stein, we apply
his ideas to a “large” parametric family, and we study the performance of estimator
sequences in one-parameter subfamilies. Let P = {Pϑ : ϑ ∈ Θ}, Θ ⊂ Rk , be an
LAN family with tangent space T (Pϑ , P) spanned by (1) (·, ϑ), . . . , (k) (·, ϑ), i.e.
n
n
log(d Pϑ+n ˜• 1 a L(ϑ)a + o(n 0 , Pϑn ).
−1/2 a d Pϑ ) = a (·, ϑ)n −
2
k
T (P, P̃) = a ϑi (i) (·, ϑ) : a ∈ R
i=1
k
1
k
log(d P̄nn−1/2 b d Pϑn ) = b ϑi ˜(i) (·, ϑ) − b2 ϑi ϑ j L i, j (ϑ) + o(n 0 , Pϑn ).
i=1
2 i, j=1
We now consider estimation of κ̃(t) := κ(ϑ1 (t), . . . , ϑk (t)) within the family
P̃. According to the Convolution Theorem, the optimal estimator sequence among
regular estimator sequences in this family has at t = 0 the stochastic expansion
n
n 1/2 (κ (n) − κ̃(0)) = n −1/2 κ0∗ (xν ) + o(n 0 , Pϑn )
ν=1
5.7 Adaptivity 173
with
k −1
k
k
κ0∗ = ϑi ϑ j L i, j (ϑ) κ (i) ϑi ϑr (r ) (·, ϑ). (5.7.6)
i, j=1 i=1 r =1
k −1
k 2
ϑi ϑ j L i, j (ϑ) κ (i) ϑi .
i, j=1 i=1
k
The variance attains its maximal value r,s=1 κ (r ) κ (s) Λr s (ϑ) if
k −1
k
ϑi = κ (r ) κ (s) Λr s (ϑ) Λi j (ϑ)κ ( j) . (5.7.7)
r,s=1 i=1
k
This follows from the Schwarz inequality if we rewrite i=1 κ (i) ϑi as
k
k
ϑi (i) (·, ϑ) Λr s (ϑ)κ (r ) (s) (·, ϑ)d Pϑ .
i=1 r,s=1
Hence the subfamily ϑ(t) with ϑi given by (5.7.7) is the least favorable one.
Since T (P, P̃) ⊂ T (P, P), the minimal asymptotic variance of T (P, P)-regular
estimator sequences is “larger” than the minimal asymptotic variance of any one-
dimensional subfamily, hence in particular larger than i,k j=1 κ (i) κ ( j) Λi j (ϑ), the
minimal asymptotic variance of the least favourable subfamily.
Now comes the point neglected by Stein: The estimator sequence which is optimal
among the estimator sequences regular in the least favorable subfamily is T (P, P-
regular, i.e., regular in the whole subfamily.
The canonical gradient κ0∗ for the subfamily { P̄t : t ∈ R} given by (5.7.6) reduces
for the least favorable subfamily defined by (5.7.7) to
k
κ∗ = Λi j (ϑ)κ (i) ( j) (·, ϑ).
i, j=1
Since
(r )
κ = κ ∗ (r ) (·, ϑ)d Pϑ for r = 1, . . . , k,
κ ∗ is also the canonical gradient of κ in T (P, P). Therefore any estimator sequence
with stochastic expansion
174 5 Asymptotic Optimality of Estimators
n
n 1/2 (κ (n) − κ(ϑ)) = n −1/2 κ ∗ (xν , ϑ)
ν=1
is T (P, P)-regular.
With ML-sequences in mind one might think that regularity conditions on the family
of probability measures is all one needs for assertions on the asymptotic performance
of the estimator sequences. If one turns to arbitrary estimator sequences it becomes
evident that meaningful results can be obtained only under conditions on the estimator
sequences (such as uniform convergence) which are automatically fulfilled for ML-
sequences under conditions on the family of probability measures.
The purpose of such regularity conditions on the estimator sequences is to make
sure that the limit distribution provides information on the true distribution of the
estimator for large samples. Ideally, this means that the convergence of
Q (n)
P := P
(n)
◦ cn (κ (n) − κ(P)), n ∈ N,
is the residual (Le Cam would, perhaps, say a ghost) of the operationally meaning-
ful condition of “uniform convergence”. It is convenient from the technical point of
view, and it suffices to establish bounds for the concentration of estimator sequences.
Remark Hájek’s concept of “regular convergence” might also be interpreted as a
localized version of the idea to consider the performance of an estimator sequence
ϑ (n) within a one-dimensional subfamily Pϑ(t) : t ∈ R, approaching Pϑ from the
direction a ∈ Rk if t −1 (ϑi (t) − ϑi ) → ai for i = 1, . . . , k.
The choice of such a family is straightforward. If Pϑ is sufficiently regular, the
function ϑ(t) should be sufficiently smooth, say differentiable at t = 0. How should
one choose a one-parameter subfamily, say {Pt : t ∈ A} with Pt in a general family
P? The essential point: For assertions about the asymptotic distribution of estimator
sequences ϑ (n) requires (in the i.i.d. case) an approximation to (Ptn )n∈N for t close
to 0.
How could the idea of “regular convergence”, defined for parametric families, be
carried over to general families? Should we consider estimator sequences which are
regularly convergent on all (sufficiently regular) subfamilies, or should we consider
estimator sequences which converge regularly with respect to all directions in the
tangent space T (P, P)? (See Sect. 5.6 for details.)
A natural way to introduce such a tangent space is to start from a family of
measures Pt , the P0 -density of which can be approximated by
1 2 2
d Pnn−1/2 t /d P0n = exp t g̃0 − t σ with σ = g02 d P0 .
2
2
ϑ → ϑ0 implies f d Q (n)
ϑ − f d Q (n)
ϑ0 → 0
Versions of Theorem 5.8.1 and Corollary 5.8.2 for general families are given in
Pfanzagl (2003) [p. 109].
Regularly Attainable Limit Distributions are Continuous
In the following section, we consider functions P → Q P from P to the family of
probability measures on Bm . By continuity we mean that the map P → f d Q P
is continuous for every bounded and continuous function f, if P is endowed
with
the sup-distance. In other words, that d(Pn , P0 ) → 0 implies
f d Q Pn → f d Q P0 .
Recall that continuity of P → Q P implies that P → f d Q P is lower [upper] semi-
continuous if f is bounded and lower [upper] semicontinuous. (Ash 2000, p. 122,
Theorem 2.8.1.)
If P is a parametric family {Pϑ : ϑ ∈ Θ}, Θ ⊂ Rk , we write Q ϑ for Q Pϑ , and the
continuity of ϑ → Q ϑ follows if ϑ → Pϑ is continuous with respect to the Euclidean
distance in Θ and the sup-distance in P, i.e., if
ϑ → Q (n)
ϑ = Pϑ ◦ n
n 1/2
(ϑ (n) − ϑ)
was first established by Rao (1963) [p. 196, Lemma 2(i)] under the unnecessarily
restrictive condition that Pϑ has a density such that ϑ → p(x, ϑ) is continuous for
every x ∈ X . From this he inferred that a (normal) limit distribution of Q (n)
ϑ , n ∈ N, is
continuous if the convergence is uniform. Similar results occur in Wolfowitz (1965)
[p. 254, Lemma 2].
Since continuity of the limit distribution is important in various connections, we
present the following result on the continuity of P → Q (n) P in greater generality.
The Convolution Theorem (Sect. 5.13) gives a satisfactory bound for the asymptotic
concentration of “regular” estimator sequences. Before we present it, we give a survey
of earlier attempts.
A Metaphysical Approach
We omit the attempts of Fisher at this problem that culminate in statements like (see
Fisher 1925, p. 714):
The efficiency of a statistic is the ratio of the intrinsic accuracy of its random sampling
distribution to the amount of information in the data from which it has been derived.
(i) If
lim sup (n 1/2 (ϑ (n) − ϑ0 ))d Pϑn0 < d N (0, 1/I (ϑ0 ))
n→∞
lim sup sup (n 1/2 (ϑ (n) − ϑ))d Pϑn > d N (0, 1/I (ϑ0 )).
n→∞ |ϑ−ϑ0 |<δ
(n)
lim sup (n 1/2
(ϑ − ϑ))d Pϑn ≤ d N (0, 1/I (ϑ)) for every ϑ ∈ Θ,
n→∞
(5.9.1)
then equality holds in (5.9.1) for λ-a.a. ϑ ∈ Θ. (Le Cam 1953, p. 314, Corol-
lary 8.1.)
Though Rao cites Le Cam (1953) in each of his papers, he ignores the main results
for this paper, except for the Hodges-example of superefficiency. His declared goal
was to find a concept of asymptotic efficiency which excludes superefficiency, but
he had no clear idea how this could be done. Ignoring the main corpus of Le Cam’s
paper, Rao started where Fisher had stopped almost forty years ago.
Rao (1961) [p. 537] suggested five equivalent (?) definitions of asymptotic effi-
ciency. Definition (ii) of asymptotic efficiency requires Iϑ (n) (ϑ) → I (ϑ). This defi-
nition is based on the inequality
• 2
with Iϑ (n) (ϑ) := qn (y, ϑ) qn (y, ϑ) Q (n)
ϑ (dy), which is, in Fisher’s interpreta-
tion, the “amount of information per observation” contained in ϑ (n) . Here qn (·, ϑ)
is the density of Q (n)
ϑ := Pϑ ◦ n
n 1/2
(ϑ (n) − ϑ). The inequality (5.9.2) is based on the
•
fact that qn (·, ϑ)/qn (·, ϑ) is a conditional expectation of
n
(x1 , . . . , xn ) → p • (xν , ϑ)/ p(xν , ϑ),
ν=1
given n 1/2 (ϑ (n) − ϑ), with respect to Pϑn . Doob (1936) [p. 415, Theorem 2] gave a
precise but somewhat disorganized proof of this inequality. Rao’s proof (1961, p.
534, Lemma 1 (iii)) is less precise, but more transparent.
Not being a master of concepts like “conditional expectation”, Fisher had to resort to a notion
like “summing over all (x1 , . . . , xn ) for which Tn (x1 , . . . , xn ) attains the same value”. That
Doob makes no use of “conditional expectations” is the more surprising since just two years
later he presented a paper including a detailed chapter on “conditional probability” (1938,
Sect. 3). This, by the way, is the paper notorious for its wrong theorem. On p. 96, Theorem
3.1, Doob tries to generalize Kolmogorov’s theorem the existence of a regular conditional
5.9 Bounds for the Asymptotic Concentration of Estimator Sequences 179
probability for probability measures on (R, B) to more general measurable spaces. In this
paper he had overlooked that the proof requires some kind of compact approximability
(which is, for instance, given in Polish spaces).
for some functions α and β (with α = 0 in most of the subsequent papers). From
then on Rao takes (5.9.3) as the definition of asymptotic efficiency. This corresponds
to the principle that (see Rao 1961, p. 532)
the efficiency of a statistic has to be judged by the degree to which the estimate provides an
approximation to [˜• (·, ϑ)].
In 1963, p. 200, Rao praises his concept for being “... not explicitly linked with any
loss functions” (Rao 1963). This is certainly true: It fails to reflect any meaningful
property of the estimator sequence (ϑ (n) )n∈N .
Remark One can all but admire Rao’s courage to define 2nd order efficiency before
he had established
a plausible concept of 1st order efficiency. From aformal relation
like ˜• (·, ϑ) − α(ϑ) + β(ϑ)n 1/2 (ϑ (n) − ϑ) + λ(ϑ)n 1/2 (ϑ (n) − ϑ)2 → 0 (see Rao
1961, p. 532, 1962a, p. 49, 1963, p. 199), one can hardly expect a refined information
about the performance of estimator sequences, let alone a result like “first order
efficiency implies second order efficiency ”. Le Cam (1974) [p. 233] politely says:
The reader may also have noticed that we did not mention the concept introduced by C.R.
Rao under the name of “second order efficiency”.
without specifying the value of β(ϑ). It appears that if [this condition] is satisfied for various
values of β(ϑ), then it is desirable to choose an estimator
for which β(ϑ) is a minimum
which is shown to be ( • (·, ϑ)2 d Pϑ )−1/2 [recte ( • (·, ϑ)2 d Pϑ )−1 ]
It escaped Rao’s attention that under the condition of uniform convergence, the
factor β(ϑ) is uniquely determined (as 1/I (ϑ)) (provided I is continuous). (Hint:
Use that ˜• (·, ϑ + n −1/2 a) − ˜• (·, ϑ) → a • (·, ϑ)2 d Pϑn , and that Pϑ+n
n
−1/2 a and
n
Pϑ are mutually absolutely continuous.) Rao’s misconception of the role of uniform
convergence is the more surprising since, in the very same paper (Rao 1963), he
uses uniform convergence to establish (by means of the Neyman–Pearson Lemma)
1/I (ϑ) as the optimal asymptotic variance (see p. 196, Lemma 2 (ii)).
180 5 Asymptotic Optimality of Estimators
There is a number of authors who mention relation (5.9.3) (together with its
extension to second order efficiency) (Zacks 1971, p. 207 and p. 243; Schmetterer
1966, p. 416/7 and 1974, p. 341; Hájek 1971, p. 161, and 1972, p. 178; Ghosh and
Subramanyam 1974, p. 331; Ibragimov and Has’minskii 1981, p. 102), but they make
no use of it, nor do they question its interpretation, in particular the role of β.
Rao’s Definition (5.9.3) requires basing the concept of asymptotic efficiency on
ρ(ϑ), the asymptotic correlation between n 1/2 (ϑ (n) − ϑ) and ˜• (·, ϑ). Rao suggests
to take ρ(ϑ)2 as a measure of asymptotic efficiency, hence ρ(ϑ) = 1 as a criterion of
optimality. Whatever “approximation by ˜• (·, ϑ) of the estimate” means—the corre-
lation between n 1/2 (ϑ (n) − ϑ) and α(ϑ) + β(ϑ)˜• (·, ϑ) is certainly not an adequate
expression for this approximation, nor is this correlation in a meaningful relation to
the asymptotic concentration of n 1/2 (ϑ (n) − ϑ).
To see that neither the relation Iϑ (n) (ϑ) → I (ϑ) nor ρ(ϑ) = 1 is a meaningful
concept of asymptotic efficiency, one might consider the example of the Hodges
estimator ϑ̂a(n) . A somewhat tedious computation shows that the Lebesgue density of
N (ϑ, 1)n ◦ ϑ̂a(n) is
n 1/2 ϕ(n 1/2 (y − ϑ)) > −1/4
qn (y, ϑ) := if |y| n .
n ϕ(n 1/2 ( ay − ϑ))
1 1/2
a
<
Remark There is, by the way, an even more mystical concept of “optimality”.
n
Godambe (1960) [and many more papers] calls an estimating2equation ν=1 g(x ν,
ϑ) = 0 with g(·, ϑ)d Pϑ = 0 “optimal” if the variance σ (ϑ) of g(·, ϑ)/ ∂ϑ
g(·, ϑ)d Pϑ is minimal. Accordingly, estimators derived from an optimal estimat-
ing equation are “optimal”—by definition. In this sense, the estimating equation
based on g(·, ϑ) = ∂ϑ log p(·, ϑ) is optimal, and this establishes the optimality of
the ML-sequence (for every finite sample size!). One can hardly disagree with Hájek
(1971) [p. 161], who says that
5.9 Bounds for the Asymptotic Concentration of Estimator Sequences 181
Professor Godambe’s suggestion how to prove the “optimality” of the maximum likelihood
estimate for any finite n is ... not convincing enough.
Rao (1961) [p. 196, Lemma 2] gives conditions under which Pϑn ◦ n 1/2 (ϑ (n) −
ϑ) ⇒ N (0, σ 2 (ϑ)) locally uniformly implies that σ 2 is continuous and σ 2 (ϑ) ≥
1/I (ϑ). The inequality is proved using test theory. Though Rao considered the con-
dition of uniform convergence as an important idea (see Sect. 5.6 for the question of
priority), he did not fully exploit its impact. With a more subtle use of uniform con-
vergence, he could have obtained the inequality σ 2 (ϑ) ≥ 1/I (ϑ) earlier and without
recourse to test theory. In (Rao 1961, p. 534, assumption 4, 1962b, p. 80, and 1963, p.
198) he assumes that Pϑn ◦ (n 1/2 (ϑ (n) − ϑ), ˜• (·, ϑ)), n ∈ N, converges to a normal
limit distribution N (0, Σ(ϑ)) of the form
σ 2 σ12
Σ= , (5.9.4)
σ12 I
with I (ϑ) := (• (·, ϑ))2 d Pϑ .
Remark The existence of a limit distribution for Pϑn ◦ (n 1/2 (ϑ (n) − ϑ), ˜• (·, ϑ)),
n ∈ N, is guaranteed if ϑ (n) , n ∈ N, is regular (see van der Vaart 1991, p. 181,
Theorem 2.1 and p. 198, Lemma A.1). Restrictive is the assumption that this joint
distribution is normal, an assumption stronger than the asymptotic normality of n 1/2
(ϑ (n) − ϑ).
Assumption (5.9.4) is used in the somewhat mysterious Lemma 4, p. 198, to show
that the correlation between n 1/2 (ϑ (n) − ϑ) and ˜• (·, ϑ) is unity iff n 1/2 (ϑ (n) − ϑ) −
˜• (·, ϑ)/I (ϑ) → 0 (Pϑn ). A more subtle use of assumption (5.9.4) entails a much
stronger result, namely: n 1/2 (ϑ (n) − ϑ) − ˜• (·, ϑ)/I (ϑ) and ˜• (·, ϑ) are asymptot-
ically stochastically independent. Though confined to asymptotically normal esti-
mator sequences, this indicates what the essence of the Convolution Theorem is in
general: The stochastic independence between n 1/2 (ϑ (n) − ϑ) − ˜• (·, ϑ)/I (ϑ) and
˜• (·, ϑ) (see (5.13.2)). It would have been easy for Rao to prove, at a moderate level
of rigor, that (5.9.4) implies
n
Pϑ+n −1/2 a ◦ (n
1/2
(ϑ (n) − ϑ), ˜• (·, ϑ)) ⇒ N ((aσ12 , a 2 I ) , Σ). (5.9.5)
Beyond that, Rao (1963) cites Le Cam (1960), who provides in Theorem 2.1 (6), p.
40, the basis for a precise proof. (This theorem is, by the way, the source from which
Hájek and Šidák (1967) [p. 208] extracted “Le Cam’s 3rd Lemma”.)
Relation (5.9.5) implies
n
Pϑ+n −1/2 a ◦ n
1/2
(ϑ (n) − (ϑ + n −1/2 a)) ⇒ N (a(σ12 − 1), σ 2 ). (5.9.6)
with 2
σ − I −1 0
Σ̄ = ,
0 I
which establishes I −1 as a lower bound for the asymptotic variance of regular (asymp-
totically normal) estimator sequences. At the same time, it asserts the asymptotic sto-
chastic independence between n 1/2 (ϑ (n) − ϑ) − ˜• (·, ϑ)/I (ϑ) and ˜• (·, ϑ), n ∈ N.
Moreover, σ12 = 1 in (5.9.4) implies that ρ(ϑ), the asymptotic correlation
between n 1/2 (ϑ (n) − ϑ) and ˜• (·, ϑ), is I (ϑ)−1/2 /σ (ϑ). Hence
is an adequate measure for asymptotic efficiency. This justifies Rao’s claim (1962b,
p. 77, Definitions) for the case of regular, asymptotically normal estimator sequences.
First solid results
Intrinsic bounds for the asymptotic concentration of estimator sequences for a func-
tional κ at P0 ∈ P depend on the local properties of P and κ at P0 . As the example
of Hodges convincingly demonstrated, regularity conditions on the family P and
on the functional κ are not enough to find a reasonable concept of “asymptotic effi-
ciency”: It would need conditions on the estimator sequence to avoid the “evil of
superefficiency” (Has’minskii and Ibragimov 1979, p. 100).
The now common concept of “regularity” appears as a deus ex machina. It is,
in fact, the outcome of the conceptual development that took place in the decade
between 1960 and 1970. From the beginning it was clear that the convergence of
Q (n)
ϑ = Pϑ ◦ n
n 1/2
(ϑ̂ (n) − ϑ) to a limit distribution Q ϑ (which was usually assumed
to be normal) needed to be “uniform” in some asymptotic sense. It was, however,
not clear how this uniformity could be expressed adequately.
To show that ML-sequences are asymptotically optimal was the purpose of the
asymptotic considerations; but there was no clear concept for the optimality of mul-
tivariate limit distributions. The outcome of these endeavors was that “regularity” in
the sense of definition is the adequate expression for locally uniform performance
of the estimator sequences. The maximal concentration on convex sets symmetric
about the origin came out as quite a surprise.
Following the course of history, we start with the estimation of ϑ in one-parameter
families {Pϑ : ϑ ∈ Θ}, Θ ⊂ R. The intention initially was to find an asymptotic
bound for the concentration of estimators; a side result was that the limit distribution
depends continuously on ϑ.
The first result is due to Rao: If Pϑn ◦ n 1/2 (ϑ (n) − ϑ) ⇒ N (0, σ 2 (ϑ)) uniformly
on compact subsets of Θ, then (Rao 1963, p. 196, Lemma 2(i)) the map ϑ → σ 2 (ϑ)
is continuous if the Lebesgue density p(·, ϑ) of Pϑ is a continuous function of ϑ, and
furthermore (p. 196, Lemma 2(ii)) we have σ 2 (ϑ) ≥ 1/I (ϑ). (Observe a misprint
5.9 Bounds for the Asymptotic Concentration of Estimator Sequences 183
in Lemma 2(ii). Read ≥ instead of ≤.) According to Rao’s Lemma 3, p. 197, the
equality σ 2 (ϑ) = 1/I (ϑ) is achieved if
extends to general families P endowed with the sup-metric d and estimator sequences
κ (n) for a d-continuous functional κ : P → R p fulfilling
(see p. 251, relation (2.1)) for any estimator sequence such that Q (n)
ϑ (−∞, t], n ∈ N,
converges to Q ϑ (−∞, t], uniformly in ϑ and t. Unlike Rao and Schmetterer he shows
that ML-sequences converge in this sense to N (0, 1/I (ϑ)) (see p. 253). Hence ML-
sequences are optimal in the class of all estimator sequences fulfilling Wolfowitz’s
conditions of uniform convergence.
When Wolfowitz submitted his paper in 1965 he was still unaware that lower and
upper medians are identical, a result which had been obtained in the meantime by
Kaufman (see Wolfowitz 1965, p. 259, footnote). In our presentation of the results
of Schmetterer and Wolfowitz this simplification has been taken into account. Sur-
prisingly, Roussas (1972) still struggles with lower and upper medians (see p. 130).
184 5 Asymptotic Optimality of Estimators
Schmetterer (1966) claims that Theorem 2.2, p. 308, based on the condition of
regular convergence, generalizes Rao’s result. In fact, his assumption of regular
convergence for every ϑ ∈ Θ is, on a locally compact parameter set Θ, equivalent
to Rao’s assumption of uniform convergence on compact subsets. This seems to
have escaped Schmetterer’s attention. Schmetterer uses the condition of continuous
convergence for two different purposes: To show that estimator sequences converge to
N (0, σ 2 (ϑ)) continuously (less would have done) and to show that σ 2 is continuous
(which follows from the fact that all limit distributions of uniformly convergent
estimator sequences are continuous). Schmetterer requires (see p. 307) that ϑ →
• (x, ϑ) is continuous, a fact already familiar to Rao (1963) [p. 196, Lemma 2(i)]
and Wolfowitz (1965) [p. 254, Lemma 2].
Still, the achievements of Rao and Wolfowitz are confined to one-parameter
families. The next step followed soon: This was the paper by Kaufman (1966),
which made all preceding results look poor. He shows that n 1/2 (ϑ̂ (n) − ϑ)n∈N and
n 1/2 (ϑ (n) − ϑ̂ (n) )n∈N are asymptotically independent if the estimator sequence ϑ (n) ,
n ∈ N is sufficiently regular and (ϑ̂ (n) )n∈N is a ML-sequence. This discovery is the
basis for an operational concept of multivariate optimality.
Some Properties of Limit Distributions
In this section we summarize properties of limit distributions which had been obtained
by various authors as side results. The conditions in these papers are not easy to
compare since continuity of the limit distribution usually occurs as a side result, and
the regularity conditions are chosen with a different main result in mind.
Regularly attainable limit distributions have a Lebesgue density. Convolution
products inherit certain properties of the factors: Q 1 ∗ Q 2 is nonatomic or absolutely
continuous with respect to the Lebesgue measure if at least one of the factors has
this property (see e.g. Lukacs 1960, p. 45, Theorem 3.3.2). Hence limit distributions
which are regularly attainable are, as a consequence of the Convolution Theorem,
absolutely continuous with respect to the Lebesgue measure. Of historical interest are
properties of limit distributions which were known prior to the Convolution Theorem.
Wolfowitz (1965) [p. 253, Lemma 1] asserts that uniformly attainable limit distri-
butions have continuous distribution functions (for Θ ⊂ R). According to Kaufman
(1966) [p. 173, Lemma 5.3 and p. 174, Lemma 5.4], they have a positive Lebesgue
density for Θ ⊂ Rk .
The following more general result can be found in Pfanzagl (1994) [p. 229],
Proposition 7.1.11): Let Θ be an open subset of Rk , and κ : Θ → R p a differentiable
functional. Then
Q (n)
ϑ+c−1 a
(n)
= Pϑ+c −1 ◦ cn (κ
a
(n)
− κ(ϑ + cn−1 a)) ⇒ Q ϑ
n n
(n) (n)
for every a ∈ Rk implies that Q ϑ λk if (Pϑ+c −1 )n∈N is contiguous to (Pϑ )n∈N ,
n a
for every a ∈ Rk .
5.9 Bounds for the Asymptotic Concentration of Estimator Sequences 185
Since am ≤ ε, there exists a convergent subsequence, say (am )m∈N0 → a. Because
of (5.9.10), a = ε. We have
D(Q ϑm , Q ϑ0 ) ≤ D(Q ϑm , Q (n m) (n m )
ϑm ) + D(Q ϑm , Q ϑ0 ).
hence
When Kaufman wrote his fundamental paper (1966) the problem was clear: There
was an intuitively convincing method for the construction of estimators, the ML
method. ML estimators were generally considered as “optimal”. Yet, outstanding
scholars had failed in their attempts to prove this optimality. In fact, nobody had an
idea what optimality could mean for multivariate estimator sequences.
Kaufman gave regularity conditions for the (uniform) convergence of ML-
sequences of a k-dimensional parameter ϑ to the (already well known) normal limit
distribution with covariance matrix Λ(ϑ) = L(ϑ)−1 , and he showed that Λ(ϑ) is the
optimal covariance matrix for estimator sequences converging uniformly on compact
sets of Θ. His paper established the concept of multidimensional optimality of limit
distributions as maximal concentration on sets that are convex and symmetric about
the origin.
In his paper from 1970, Hájek approached the problem of asymptotic optimality of
estimator sequences from a different point of view: Instead of presenting an estimator
sequence and proving its asymptotic optimality, he started from “regularity” as the
essential property of estimator sequences, and he gave a bound for the concentration
of regular estimator sequences. (His Theorem refers to general LAN-sequences and
is, therefore, not restricted to the i.i.d. case.)
In Kaufman’s paper, the optimality of the ML-sequence follows from the fact
that any convergent estimator sequence is a convolution product involving the limit
distribution of the ML-sequence. Hájek’s Theorem provides a bound for the con-
centration of regular estimator sequences; yet it remains to be shown that regular
estimator sequences attaining this bound do exist.
The idea to replace “uniform convergence on compact subsets of Θ” by “regular-
ity”, i.e. by the convergence of Q (n)ϑ along sequences ϑ + n
−1/2
a is now generally
accepted. (See Ibragimov and Has’minskii 1970; Bickel et al. 1993; Witting and
Müller-Funk 1995.) It was, in fact, regular convergence that was implicitly used by
many authors. Starting from uniform convergence of Pϑn ◦ n 1/2 (ϑ (n) − ϑ) to a limit
distribution Q ϑ , they used convergence on paths ϑ + n −1/2 a only. (C.R. Rao, 1963,
p. 196, Lemma 2, and Bahadur 1964, p. 1546, Proposition 1, and many more.) In
test theory, the use of Pϑ+n −1/2 a for approximations to the power function turned up
quite early (see e.g. Eisenhart 1938, p. 32).
According to the Convolution Theorem, “regular convergence” is strong enough
for obtaining a bound for the concentration of estimator sequences. Yet regularly
convergent sequences Q (n) ϑ , n ∈ N, may have unpleasant properties.
(i) Regular convergence of Q (n)ϑ to Q ϑ at ϑ0 does not imply that Q ϑ is continuous
at ϑ0 .
That regular convergence is not strong enough to imply continuity of the limit dis-
tribution was first observed by Tierney (1987) √ [p. 430]. Beware of a misprint in
line 20 of p. 430: Replace ϑ by ϑn in L ( n(Tn − ϑ)|ϑn )). For P = {N (ϑ, 1) :
ϑ ∈ R}, and the estimator ϑ (n) (xn ) := (1 − n −1/2 )x n + n −1/2 x1 if |x n | < 1/ log n
188 5 Asymptotic Optimality of Estimators
and ϑ (n) (xn ) := x n otherwise, one easily finds that limn→∞ Q (n)ϑ0 +n −1/2 a is N (0, 2) for
ϑ0 = 0 if a ∈ R, and N (0, 1) otherwise.
(ii) Even if Q ϑ is continuous at ϑ0 , regular convergence of Q (n) ϑ to Q ϑ at ϑ0 does
(n)
not imply that Q ϑn → Q ϑ0 for arbitrary sequences ϑn → ϑ0 .
In spite of the fact that Q (n)
ϑ0 +n −1/2 a (A) → Q ϑ (A) for every set A ∈ B with boundary
zero under Q ϑ0 , it is not excluded that
Q (n)
ϑn (A) → 0 for every bounded set A ∈ B
k
and
Pϑnn ◦ (n 1/2 (ϑ (n) − ϑn ) − u n ) ⇒ N0 ifn 1/2 ϑn > 2ωn . (5.10.2)
From this the assertions follow easily. (To prove the discontinuity of the regularly
attainable limit distribution use u n = u for n ∈ N.)
For convenience we introduce the random variable ξ := n 1/2 (ϑn − ϑ) which is
under Nϑn distributed as N0 . Hence the distribution of n 1/2 (ϑ (n) (x) − ϑ) under Nϑn
is the same as the distribution of
ξ ≤
Hϑ(n) (ξ ) := if |n 1/2 ϑ + ξ | ωn
ξ + un >
under N0 . Since
N0 {ξ : Hϑ(n)
n
(ξ ) = ξ } → 1 ifn 1/2 ϑn , n ∈ N, is bounded
and
N0 {ξ : Hϑ(n)
n
(ξ ) = ξ + u n } → 1 ifn 1/2 ϑn > 2ωn ,
lim inf P n {n 1/2 (κ (n) − κ(P))/σ (P) < t} = 1/2 for every t > 0,
n→∞ P∈P0
The following version of the Neyman–Pearson Lemma refers to the case of mutually
absolutely continuous probability measures Pi |(X, A ), i = 1, 2, with densities pi .
Definition 5.11.1 assumes nothing about the value of ϕ0 (x) if p1 (x) = cp0 (x).
190 5 Asymptotic Optimality of Estimators
It is easily seen that for any α ∈ (0, 1) there exists a critical function ϕ0 of
Neyman–Pearson type such that
ϕ0 (x)P0 (x) = α.
Neyman–Pearson Lemma. A critical function ϕ0 with ϕ0 (x)P0 (d x) ∈ (0, 1) is of
Neyman–Pearson type for P0 : P1 iff for every critical function ϕ,
ϕ(x)P0 (d x) = ϕ0 (x)P0 (d x)
implies
Observe that the value of ϕ(x) is irrelevant for x with p1 (x) = cp0 (x).
According to Schmetterer (1974, footnote on p. 166), a first version of the
Neyman–Pearson Lemma occurs in Neyman and Pearson (1936) [p. 207, relation
(4)]. The basic idea of this Lemma occurs earlier in Neyman and Pearson (1933) [p.
300, p. 151, relation (25)].
In Wolfowitz’s opinion, (Wolfowitz 1970, p. 767, footnote 4) “... the “Neyman–
Pearson fundamental Lemma”, which, no matter how “fundamental” it may be, is
pretty trivial to prove and not difficult to discover”.
Since the textbook by Schmetterer (1974) [Theorem 3.1, p. 166] needs four pages
for the proof of the Neyman–Pearson Lemma, it is worthwhile to look for alternatives.
Such can be found in Witting (1985) [p. 193, Satz 25] or Pfanzagl (1994) [p. 133,
Lemma 4.3.3]. The basic idea of these proofs is the relation
and
lim sup Pϑ(n)
n
{ϑ (n) ≥ ϑn } ≤ 1/2
n→∞
1
Λnt = log(d Pϑ(n)+c−1 t d Pϑ(n) ) = tΔn (·, ϑ0 ) − t 2 L(ϑ0 ) + o(n 0 , Pϑ(n) ).
0 n 0
2 0
1
Cn (t) := Λnt ≤ t 2 L(ϑ0 )2 .
2
We have
1
Pϑ(n) ◦ Λnt ⇒ N (− t 2 L(ϑ0 )2 , t 2 L(ϑ0 )2 ) (5.11.2)
0
2
and, consequently, by Le Cam’s 3rd Lemma,
1
Pϑ(n)+c−1 t ◦ Λnt ⇒ N ( t 2 L(ϑ0 )2 , t 2 L(ϑ0 )2 ). (5.11.3)
0 n 2
This implies
Pϑ(n)+c−1 t (Cn (t)) → 1/2
0 n
192 5 Asymptotic Optimality of Estimators
and
Pϑ(n)
0
(Cn (t)) → Φ(t L(ϑ0 )).
Since
lim sup Pϑ(n)+c−1 t {ϑ (n) ≤ ϑ0 + cn−1 t} ≤ 1/2 = lim Pϑ(n)+c−1 t (Cn (t)),
n→∞ 0 n n→∞ 0 n
With the same techniques, but under stronger regularity conditions, Michel (1974)
[p. 207, Theorem] obtains the following result of Berry-Esseen type. If (ϑ (n) )n∈N is
approximately median unbiased in the sense that for every compact K ⊂ Θ there is
a K such that, for ϑ ∈ K ,
Pϑn {ϑ (n) < ϑ} ≤ 1/2 + a K n −1/2 and Pϑn {ϑ (n) > ϑ} ≤ 1/2 + a K n −1/2 ,
Remark For some authors, the distinction between symmetric and arbitrary intervals
(or loss functions) seems to be nothing to speak of; see Witting and Müller-Funk
(1995) [p. 423]. Strasser (1985), presents a result concerning the concentration of
median unbiased estimators on arbitrary intervals containing 0 on p. 162, Lemma
34.2, and a corresponding result for symmetric loss functions on p. 362, Lemma 72.1.
Rüschendorf (1988) [p. 207, Satz 6.10] is on symmetric loss functions.
To derive asymptotic median unbiasedness from regular or locally uniform con-
vergence to a limit distribution poses no problems if the limit distribution is of the
form N (0, σ 2 (ϑ0 )) (as in the papers by C.R. Rao, Schmetterer and Bahadur). Any
estimator sequence converging regularly to N (0, σ 2 (ϑ0 )) is asymptotically median
unbiased, and relation (5.11.4) can now be rewritten as
which means
σ 2 (ϑ0 ) ≥ Λ(ϑ0 ).
(See C.R. Rao 1963, p. 196, Lemma 2(ii) and Bahadur 1964, p. 1546, Proposition 1;
notice the misprint in Rao’s Lemma 2(ii) which says σ 2 (ϑ0 ) ≤ Λ(ϑ0 ).)
That regularly attainable limit distributions have, under the usual regularity con-
ditions, a positive Lebesgue density and therefore a unique median was established
by Kaufman (1966, p. 174, Corollary). Since this result was not yet available to Wol-
fowitz (1965), and unknown to Schmetterer (1966) and Roussas (1968 and 1972),
these authors had to formulate a result with the median replaced by the lower or
upper bound of the median interval, respectively.
Problems connected with the nonuniqueness of the median vanish if the optimality
assertions are confined to intervals symmetric about 0. For regular one-parameter
families, Weiss and Wolfowitz (1966) [p. 61] show that for any regularly attainable
limit distribution Q ϑ ,
The argument in this paper is Bayesian, but the Neyman–Pearson Lemma suffices
(see below).
The idea to obtain a bound from the Neyman–Pearson Lemma also works in
“almost regular” cases. For shift parameter families, based on a probability measure
with support (a, b), Akahira (1975) [II] presents in Theorem 3.1, p. 106, a result
which implies, in particular, the following. Let
If these quantities are finite and positive, then the M.L. sequence of the location para-
meter, standardized by cn = n 1/2 (log n)1/2 , is asymptotically normal with variance
1
2
(A 2 /A + B 2 /B). According to Theorem 3.2, p. 107, this is the best possible limit
distribution for estimator sequences (ϑ (n) )n∈N converging to their limit distribution
Q ϑ locally uniformly. Akahira’s proof is based on the Neyman–Pearson Lemma.
Roughly speaking, his Lemma 3.1, p. 102, implies the validity of conditions (5.11.2)
−1
and (5.11.3) with cn = n 1/2 (log n)1/2 and L(ϑ) = 2 A 2 /A + B 2 /B) .
Though Akahira is aware of Woodroofe’s related paper (1972) (see Akahira
1975, I, p. 26) he desists from discussing the connection of his results with that
of Woodroofe and the pertaining optimality results of Weiss and Wolfowitz (1973).
By the same technique, bounds for asymptotically median unbiased estimator
sequences may also be obtained for certain non-regular parametric families where
the rate of convergence is cn = n, and the optimal limit distribution is non-Gaussian.
As a typical example we mention a result of Grossmann (1981) [p. 106, Corollary
4.3 and p. 107, Proposition 4.4]. Let p|R be a density such that p(x) = 0 for x ≤ 0
and
p(x) > 0 for x > 0 with lim p(x) = A.
x↓0
Let Pϑ |B be the distribution with density x → p(x − ϑ). Then for any estimator
sequence (ϑ (n) )n∈N fulfilling the condition of regular convergence with cn = n,
1
lim sup Pϑn0 {ϑ0 − n −1 t < ϑ (n) < ϑ0 + n −1 t } ≤ exp[At ] − exp[−At ] .
n→∞ 2
This bound is attained (for t , t sufficiently small) by the median unbiased estimator
sequence x1:n − n −1 A−1 log 2.
Various examples of this kind (with t = t ) can be found in Akahira (1982) and
the references there. Notice that in non-regular cases the bounds obtained by the
Neyman–Pearson Lemma are not necessarily attainable for every t > 0.
5.11 The Neyman–Pearson Lemma and Applications 195
A Multidimensional Bound
By nature, results on median unbiased estimator sequences are restricted to one-
dimensional functionals. They might, however, be used to derive some results on
multidimensional functionals.
Let Θ ⊂ Rk . Assume that ϑ (n) : X n → Rk , n ∈ N, is an estimator sequence for ϑ
which converges regularly to some limit distribution Q ϑ |Bk . If this limit distribution
is normal, say Q ϑ = N (0, Σ(ϑ)), the estimator sequence κ (n) := a ϑ (n) , n ∈ N,
for κ(ϑ) = a ϑ is asymptotically normal: Pϑn ◦ cn (κ (n) − κ(ϑ)), n ∈ N, converges
regularly to N (0, a Σ(ϑ)a). Hence κ (n) is asymptotically median unbiased, and
relation (5.11.4), applied with K = a, leads to
hence
a Σ(ϑ0 )a ≥ a Λ(ϑ0 )a.
Since this relation holds for arbitrary a ∈ Rk , the matrix Σ(ϑ0 ) − Λ(ϑ0 ) is positive
semidefinite. This argument occurs in Bahadur 1964, p. 1550, relation (26) and is
repeated in Roussas (1968) [p. 255, Theorem 3.2 (iii)]. Both authors are satisfied
with this formal relation between Σ(ϑ0 ) and Λ(ϑ0 ). Since Anderson’s Theorem was
well known at this time, (see in particular Anderson 1955, p. 173, Corollary 3) they
could have turned this relation into the inequality
for all sets C which are convex and symmetric about 0, a relation which anticipates
the essence of the Convolution Theorem for the particular case of normal limit
distributions.
Remark The idea to conclude from the distribution of estimators a ϑ (n) to the distri-
bution of ϑ (n) occurs already in Kallianpur and Rao (1955) [p. 342], foreshadowed
in a paper by Rao (1947) [p. 281, Corollary 1.1], which also contains the idea of an
intrinsic bound for the asymptotic variance of (unbiased) estimator sequences. Sur-
prisingly, it is not applied by Rao (1963). It was taken up by Roussas (1968, p. 255,
Theorem 3 and 1972, p. 161, Theorem 7.1)—post festum. Though Roussas (1968)
was aware of Kaufman’s result from 1966 he obviously misunderstood its relevance
(see Roussas 1968, p. 259, and 1972). Using the Neyman–Pearson Lemma, Roussas
still struggles with upper and lower medians, a problem settled already by Kaufman
(1966) [p. 174, Corollary]. It is not only the relevance of Kaufman’s pre-Convolution
Theorem which escaped Roussas’ attention. In the very same book, he presents as
Theorem 3.1, p. 136, a full version of the Convolution Theorem from which the
above-mentioned results follow immediately.
196 5 Asymptotic Optimality of Estimators
Remark Observe that Roussas’ paper deals with Markov chains fulfilling an LAN-
condition. It seems to be the first paper at this level of generality, prior to Hájek (1970).
A comparable result by Schmetterer (1966) [p. 316, Theorem 4.2] on Markov chains
with a 1-dimensional parameter is technically less advanced. Schmetterer’s paper
was criticized by Roussas (1968) [p. 252] for being technically obsolete, using on p.
309 results by Daniels (1961) rather than Le Cam (1960) [p. 40, Theorem 2.1 (6)].
Roussas could have added that Schmetterer’s Lemma 2.2, attributed by Schmetterer
to Daniels (1961), occurs there (p. 156, relation (3.8)) with an incorrect proof. (Since
this paper of Daniels is cited by many authors (Rao 1963, Bickel 1974, Lehmann and
Casella 1998 and Le Cam in various places), it might be worth mentioning that this
is not the only slip in Daniels’ paper. See also G. Huber 1967, p. 211 and Williamson
1984.)
Since Roussas avoids the use of Anderson’s Theorem (for unknown reasons), the
relation Λ(ϑ0 ) ≤ Σ(ϑ0 ) between the covariance matrices (in the Löwner order) is
used to derive relations between hyperplanes. For the case of an arbitrary multivariate
limit distribution, Roussas (1968) [p. 257, Theorem 4] presents a result concerning
the concentration about median hyperplanes.
Call {u ∈ Rk : a u = c} a median hyperplane for the probability measure Q|Bk if
The optimal limit distribution N (0, Λ(ϑ0 )) is for any a ∈ Rk “more” concentrated
about the median hyperplane {u ∈ Rk : a u = 0} than any regularly attainable dis-
tribution Q ϑ0 about its median hyperplane. More precisely: If {u ∈ Rk : a u =
m a (Q ϑ0 )} is a median hyperplane of Q ϑ0 , then
Q ϑ0 {u ∈ Rk : m a (Q ϑ0 ) − t ≤ a u ≤ m a (Q ϑ0 ) + t }
≤ N (0, Λ(ϑ0 )){u ∈ Rk : −t ≤ a u ≤ t } for t , t ≥ 0.
This result of Roussas generalizes the result of Wolfowitz (1965) [pp. 258/9, Theo-
rem] from k = 1 to an arbitrary k. It is, in fact, a trivial consequence of the Convolution
Theorem. Since a u is under Q ϑ0 more spread out than under N (0, Λ(ϑ0 )), it is (see
2.3.5) more concentrated—on arbitrary intervals—about its median under Q ϑ0 than
about its median under N (0, Λ(ϑ0 )).
According to the Convolution Theorem, N (0, K ΛK ) is the optimal limit distri-
bution of regular estimator sequences for for a real-valued functional κ with gradient
J . Since J Λ• (·, ϑ) is a gradient of κ, the stochastic expansion (5.11.6) implies that
the convergence of any optimal median unbiased estimator sequence to N (0, J ΛJ )
is regular. Hence these estimator sequences are a fortiori optimal in the class of all
regular estimator sequences for κ. It appears that this automatic regularity of asymp-
totically optimal median unbiased estimator sequences was overlooked in the papers
by Rao, Wolfowitz, Schmetterer, etc.
5.11 The Neyman–Pearson Lemma and Applications 197
confusion. Using the concepts developed in Sect. 5.13, these results can be easily
extended to general families.
lim Pϑ(n) {cn |κ (n) − κ(ϑ)| ≤ t} = N (0, J ΛJ )(−t, t) for every t > 0,
n→∞
then
cn (κ (n) − κ(ϑ)) − J Δn → 0 (Pϑ(n) ). (5.11.6)
198 5 Asymptotic Optimality of Estimators
Remark The scholars who tried to prove certain optimality properties of ML esti-
mators by means of the Neyman–Pearson Lemma could have used an asymptotic
expansion like (5.11.6) to show that an asymptotically normal estimator sequence
with optimal marginals has the same (joint) limit distribution as the ML-sequences.
This does, however, not yet establish ML-sequences as asymptotically optimal in
some multivariate sense. Estimator sequences with marginals inferior to the mar-
ginals of the ML-sequences could, in principle, be asymptotically superior to the
ML-sequences in some multidimensional sense. (See Sect. 5.13 for a discussion of
this question.)
Proof of Theorem 4.11.3. (i) Let a ∈ R be fixed. If κ (n) , n ∈ N, is asymptotically
median unbiased and κ is differentiable, the relation
(n)
lim sup ϕn d Pϑ+c −1
a
≤ 1/2
n
n→∞
is fulfilled for ϕn = 1{cn−1 (κ (n) −κ(ϑ))≤J a} . Hence Corollary 5.11.8 below implies
To make the upper bound as sharp as possible, one needs to minimize a La subject
to the condition J a ≥ t. This is achieved by choosing a = ā with
ā := ΛJ /J ΛJ . (5.11.8)
Together with the corresponding relation for t < 0, this implies (5.11.5).
(ii) According to Corollary 5.11.8 applied with β = 1/2 and a replaced by t â, the
relation
ϕn = 1{J Δn <t}
(n)
ϕn d Pϑ+c −1
t â
→ 1/2 (5.11.9)
n
and
Proposition 5.5.1 implies that asymptotically median unbiased estimator seque-
nces that attain at ϑ the maximal value N (0, J ΛJ )(−t, t) for every t > 0, converge
regularly to a limit distribution, which is N (0, J ΛJ ).
Leaving median unbiasedness aside, we turn to estimator sequences for κ : Θ →
R with a regularly attainable limit distribution, say Q. The following Theorem 5.11.4
asserts that Q is more spread out than N (0, J ΛJ ). This assertion is weaker than the
Convolution Theorem, since Q N (0, J ΛJ ) follows from Q = N (0, J ΛJ ) ∗ R
(see Sect. 2.8). It might be worth mentioning that the relation Q N (0, J ΛJ )
could thus have been obtained years before the Convolution Theorem.
Theorem 5.11.4 Assume that {(Pϑ(n) )n∈N : ϑ ∈ Θ}, Θ ⊂ Rk , is an LAN-family and
κ : Θ → R a differentiable functional with gradient J . Then Q ϑ N (0, J ΛJ )
for any regularly attainable limit distribution Q ϑ of estimator sequences for κ.
(n)
ϕn (·, a)d Pϑ+c −1
a
→β
n
and
By Lemma 5.11.5,
This relation holds for every β ∈ (0, 1) and every t ≥ 0. According to a well known
Lemma on the spread order (see (2.4.3)), this implies Q ϑ N (0, J ΛJ ).
Some Auxiliary Results
Lemma 5.11.5 Let F and F1 be increasing distribution functions of probability
measures Q and Q 1 , respectively. If
hence
F(F −1 (β) + t) ≤ Fσ (Fσ−1 (β) + t) for β ∈ (0, 1), t ∈ R.
Lemma 5.11.6 For n ∈ N, let f n , gn be real functions defined on (X, A ), and Pn |A
a probability measure.
If
1{ fn ≤t} − 1{gn ≤t} → 0 (Pn ) for t ∈ R,
then
f n − gn → 0 (Pn ).
Proof Since
1{ f − 1{gn ≤t}
= 1{ fn ≤t,gn >t} + 1{ fn >t,gn ≤t} ,
n ≤t}
implies
5.11 The Neyman–Pearson Lemma and Applications 201
If f n (x) − gn (x) > ε, the interval (gn (x) + ε/2, f n (x) − ε/2) is nonempty, and
there is t ∈ Q such that gn (x) + ε/2 < t < f n (x) − ε/2. Hence
{x ∈ X : f n (x) − gn (x) < ε} ⊂ {x ∈ X : gn (x) < t − ε/2, f n (x) > t + ε/2}.
t∈Q
Since
implies
(ii) If equality holds in (5.11.13) for some r > 0, then limn→∞ ϕn d Pn = Q[0, r ]
and therefore
ϕn − 1{qn <r } → 0 (Pn ). (5.11.14)
|χn | · |r − qn |d Pn → 0,
1
Pn {|r − qn | < ε} + |χn | · |rn − qn |d Pn , (5.11.16)
ε
(n) (n) 1
log(d Pϑ+c −1
a
d Pϑ ) = a Δn (·, ϑ) − a L(ϑ)a + o(n 0 , Pϑ(n) ).
n 2
and
Pϑ(n) ◦ a Δn (·, ϑ) ⇒ N (0, a L(ϑ)a),
(n)
Pϑ+c −1 ◦ a Δ(·, ϑ) ⇒ N (a L(ϑ)a, a L(ϑ)a),
na
(n)
lim sup ϕn d Pϑ+c −1
a
≤ β for some a ∈ Rk ,
n
n→∞
then
lim sup ϕn d Pϑ(n) ≤ Φ Φ −1 (β) + (a L(ϑ)a)1/2 .
n→∞
5.11 The Neyman–Pearson Lemma and Applications 203
(n) 1
ϕn d Pϑ+n 1/2 a →
2
and
then
ϕn − 1{a Δn <a La} → 0 (Pϑ(n) ).
Proof (i) The assertion follows from Lemma 5.11.7(i), applied with Pn = Pϑ(n) and
(n)
Pn = Pϑ+c
−1 and with r a = exp[(a La)
a
1/2 −1
Φ (β) + 21 a La]. Since
n
we have
(n)
Pϑ+c −1 {Λna ≤ r a } → β
na
and therefore
lim sup Pϑ(n) {Λna < ra } ≤ Φ Φ −1 (β) + (a La)1/2 .
n→∞
(ii) The addendum follows from Lemma 5.11.7(ii).
Symmetric Optimality
We consider a parametric family {(Pϑ(n) )n∈N : ϑ ∈ Θ}, Θ ⊂ Rk fulfilling an LAN-
condition
(n) (n) 1
log(d Pϑ+c −1
a
d Pϑ ) = a Δn (·, ϑ) − a L(ϑ)a + o(n 0 , Pϑ(n) ).
n 2
Asymptotic bounds for the concentration of estimator sequences for differentiable
functionals κ : Θ → R p are necessarily restricted to the concentration on symmetric
sets if p > 1.
For functionals κ : Θ → R this is not necessarily so. Among median unbiased
estimator sequences there may be estimator sequences which are asymptotically
maximally concentrated on arbitrary intervals containing 0. For irregular models,
estimator sequences with this strong optimum property do not necessarily exist. In
such cases, optimality on intervals symmetric about 0 may be a useful surrogate.
A first result on symmetric optimality was presented by Weiss and Wolfowitz
(1966) [p. 61]. To establish the asymptotic optimality of ML-sequences under
Cramér-type regularity conditions they show that
for any regularly attainable limit distribution Q ϑ . Since corresponding results with
not necessarily symmetric intervals were already available at this time, one can only
guess why they were interested in symmetric intervals. Perhaps their intention was
to avoid trouble with the median. They use Bayesian arguments though a proof based
on the Neyman–Pearson Lemma would have been simpler (see below).
The idea of symmetric optimality was taken up by Akahira and/or Takeuchi (see
e.g. 1982), combined with the requirement of asymptotic median unbiasedness.
Is an optimality concept based on the concentration on symmetric intervals
really required? In the regular cases considered above, there exist median unbi-
ased estimator sequences which are optimal on all intervals containing 0. In certain
non-regular models one encounters estimator sequences with an optimum property
limited to symmetric intervals. Such limited optimum properties are of interest also
from another point of view: They are neither related to the spread order nor to a
convolution property.
Let Pϑ(n) ◦ cn (ϑ n) − ϑ) ⇒ Q ϑ |Bk with Q ϑ nonatomic. The following proposition
gives bounds for Q ϑ (−t , t ) expressed by the asymptotic performance of
(n) (n)
pn (·, ϑ + cn−1 t)/ pn (·, ϑ) ∈ d Pϑ+c −1 /d Pϑ
t n
Proof By assumption,
(n) (n) (n) (n)
Pϑ+c −1 {ϑ
t
≤ ϑ} = Pϑ+c −1 {cn (ϑ
t
− (ϑ + cn−1 t )) ≤ −t } → Q ϑ (−∞, −t )
n n
and
(n) (n) (n)
n
Pϑ−c −1 {ϑ
t
≤ ϑ} = Pϑ−c −1 {cn (ϑ
t
− (ϑ − cn−1 t )) ≤ t } → Q ϑ (−∞, t ),
n n
hence
Q ϑ (−t , t ) = lim 1{ϑ (n) ≤ϑ} ( pn (·, ϑ − cn−1 t ) − p(·, ϑ + cn−1 t ))dμn .
n→∞
Since
5.11 The Neyman–Pearson Lemma and Applications 205
1{ϑ (n) ≤ϑ} − 1{ pn (·,ϑ−cn−1 t )> pn (·,ϑ+cn−1 t )}
pn (·, ϑ − cn−1 t ) − pn (·, ϑ + cn−1 t ) ≤ 0,
Under equivalent regularity conditions (using (5.11.2) and (5.11.3) again), Proposi-
tion 5.11.9 yields the bound
1 1
lim Pϑ(n) {−t ≤ cn (ϑ (n) − ϑ0 ) ≤ t } ≤ N (0, Λ(ϑ0 ))(− (t + t ), (t + t ))
n→∞ 0
2 2
(5.11.19)
(now under the assumption that the lim on the left-hand side exists).
Whereas the bound given in (5.11.4) is attainable, this is not the case with the
bound given in (5.11.19). Observe that
1 1
N (0, σ 2 )(−t , t ) ≤ N (0, σ 2 )(− (t + t ), (t + t )).
2 2
lim Pϑ(n)
0
{cn |ϑ (n) − ϑ0 | ≤ t} ≤ N (0, Λ(ϑ0 ))(−t, t), (5.11.20)
n→∞
with two different interpretations, corresponding to the two different side conditions:
median unbiasedness, and convergence to a limit distribution, respectively.
There is a third approach leading to the same bound, the Convolution Theorem,
which requires regular convergence of Pϑ(n) ◦ cn (ϑ (n) − ϑ), n ∈ N, to some limit
distribution Q ϑ , and which implies, corresponding to (5.11.20), that
(n) (n)
Pϑ+c −1 {cn (ϑ
a
− (ϑ + cn−1 a)) ∈ C}, n ∈ N,
n
Observe that the set C is fixed. The proof follows from Proposition 5.11.9, applied
with a ϑ (n) , as an estimator of a ϑ, a ∈ Rk .
Do we really need an independent concept of symmetric optimality? The following
example presents a limit distribution which is (i) optimal on all symmetric intervals,
and (ii) its optimality is not a consequence of the Convolution Theorem.
The operational significance of a representation Q = Q ∗ R depends on properties
of the factor Q. In the case k = 1 this is the logconcavity of Q, and the result is
Q(−t , t ) ≤ Q(−t , t ) for t , t ≥ 0. The following example presents two regularly
attainable limit distributions Q and Q, such that Q(−t, t) ≥ Q(−t, t) for every
t > 0, but Q(−t , t ) > Q(−t , t ) for some t , t > 0. Since Q is logconcave, Q
cannot be a convolution product involving the factor Q. Hence the optimality of Q
on all symmetric intervals is not a consequence of the Convolution Theorem.
Example For ϑ ∈ R let Pϑ be the probability measure with density p(x, ϑ) :=
1[−1/2,1/2] (x − ϑ). By (5.11.18),
Q ϑ (−t , t ) ≤ 1 − e−(t +t ) . (5.11.21)
Here 1 − e−(t +t ) is just a bound for Q ϑ (−t , t ); equality in (5.11.21) for arbitrary
t , t is impossible. For t = t = t, we obtain
Q ϑ (−t, t) ≤ 1 − e−2t ,
and this symmetric bound, 1 − e−2t , is attainable. For the estimator sequence
1
ϑ̂ (n) (xn ) := (x1:n + xn:n ),
2
Pϑn ◦ n(ϑ̂ (n) − ϑ) converges to the Laplace distribution with scale parameter 1/2,
hence
lim Pϑn {n|ϑ̂ (n) − ϑ| ≤ t} = 1 − e−2t .
n→∞
1
ϑ̃ (n) (xn ) := x1:n + − n −1 log 2
2
1 1
Q̃(−t , t ) = min{1, et } − e−t for − log 2 < −t ≤ 0 < t < ∞.
2 2
We have Q̃(−t, t) < Q(−t, t) for t > 0 by Proposition 5.11.9 and Q̃(−t , t ) >
Q(−t , t ) for t close to log 2 and t small.
We conclude this section by some side results. Let, more generally, Q denote a
family of distributions on B which contains an element Q 0 with median 0 which is
optimal on all symmetric intervals, i.e., for every Q ∈ Q,
Proof (i) Assume that Q is minimal in the spread order. W.l.g. we may assume
that Q has median 0. Then Q(−t , t ) ≥ Q 0 (−t , t ) for all t , t ≥ 0. Since
Q 0 is optimal on all symmetric intervals, Q(−t, t) = Q 0 (−t, t) for t > 0. By
Lemma 5.11.12 this implies Q = Q 0 .
(ii) Assume now that Q is minimal in the convolution order. Without assuming
anything about Q, it is not permitted to conclude that Q is also minimal in
the spread order (since Q ∗ R is not necessarily “more” spread out than Q in
general). Yet, the relation Q = Q 0 follows from Lemma 5.11.12.
Proof We have
Relation (5.11.22) with Q in place of Q implies u 2 Q̂(du) ≥ w2 Q 0 (dw). Hence
2
v R(dv) = 0, and Q̂ = Q 0 .
(See Pfanzagl 2001, p. 507, Theorem 3.1, improving an earlier result by Liu and
Brown (1993). Notice a misprint: cn in (3.6) has to be replaced by n 1/2 .)
Of course, relation (5.11.23) follows from the Convolution Theorem if Pϑnn ◦
n (ϑ (n) − ϑn ), n ∈ N converges to a limit distribution. The essence of this theorem
1/2
is that the local uniformity condition refers to mean unbiasedness only; convergence
to a limit distribution is not required.
5.12 Asymptotic Normality: Global and Local 209
The idea to take the LAN-condition as a basis for asymptotic results developed slowly.
The starting point was the paper by Wald (1943), suggesting the idea to approximate a
k-parameter family of i.i.d.-products for large n by a family of k-dimensional normal
distributions, and to obtain an approximate solution for the original problem (say an
optimal test) from the solution for the approximating normal distribution. The most
interesting result in this paper seems to be Lemma 2, p. 443, asserting the existence
of a map Wn : Bn → Bk such that
n
P (B) − N (0, Λ(ϑ))(Wn (B))
→ 0 (5.12.1)
ϑ
uniformly in ϑ and B ∈ Bn .
In this relation, Wn (B) is something like ϑ (n) (B), with ϑ (n) : X n → Rk a ML
estimator. With Wn a set-transformation, the representation (5.12.1) is not easy to
deal with (though Wald presents various applications to asymptotic test theory).
If Wolfowitz (1952) [p. 10] says that Wald wrote papers “without too much thought
of elegance”: this paper by Wald (containing 57 pages) is a convincing example.
Probably, nobody ever had the patience to study it in detail.
Global Asymptotic Normality
Wald’s approach was superseded by the following result of Le Cam (1956) [p. 140,
Theorem 1]:
Under suitable regularity conditions on the family {Pϑ |A : ϑ ∈ Θ}, Θ ⊂ Rk ,
dominated by μ|A , there exists an estimator sequence (ϑ (n) )n∈N and a sequence of
probability measures Q n,ϑ |A n with μn -density
1
K n (ϑ)h n 1 Bn (ϑ (n) − ϑ) exp − n(ϑ (n) − ϑ) L(ϑ)(ϑ (n) − ϑ) (5.12.2)
2
such that
d(Pϑn , Q n,ϑ ) → 0, (5.12.3)
asymptotic sufficiency as explained above means that Pϑ(n) (Bn ) can be approximated
by Q n,ϑ (Bn ), without randomization.
Relation (5.12.3) is used in the papers by Kaufman (1966) and Inagaki (1970).
Kaufman (p. 170, Theorem 4.3) gives an independent proof of relation (5.12.3), with
Q n,ϑ similar to (5.12.2). Under his strong regularity conditions, the ML estimator
may be taken for ϑ (n) .
In Michel and Pfanzagl (1970, p. 188, Theorem) it was shown that the factor
1 Bn (ϑ (n) (xn ) − ϑ) in definition (5.12.2) may be omitted. In Pfanzagl (1972b) [p.
177, Theorem 1] it was shown that for every compact subset K ⊂ Θ there exists a
constant a K such that
sup d(Pϑn , Q n,ϑ ) ≤ a K n −1/2 .
ϑ∈K
are impossible, even in a simple case like p(x, ϑ) = ϑ −1 exp[−ϑ −1 x], x > 0, with
Θ = (0, ∞).
After the general LAN -condition (see below) had been firmly established as a
technically useful tool for asymptotic theory, it was natural to reconsider “global
asymptotic normality”, originally confined to i.i.d., at this more general level. This
was done in papers by Milbrodt (1983), Droste (1985) and Pfanzagl (1995). For
n ∈ N let (X n , An ) be a measurable space. The family of sequences of distributions
{(Q n,ϑ )n∈N : ϑ ∈ Θ}, Θ ⊂ Rk is quasi-normal if Q n,ϑ has, with respect to some
σ -finite measure, a density
1
xn → K n (ϑ) exp − cn2 (ϑ (n) (xn ) − ϑ) L(ϑ)(ϑ (n) (xn ) − ϑ)
2
with K n (ϑ) → 1.
The family of sequences {(Pϑ(n) )n∈N : ϑ ∈ Θ} is called asymptotically normal if
there exists a quasi-normal family such that
d(Pϑ(n) , Q n,ϑ ) → 0.
which establishes the equivalence is due to Droste (1985) [p. 47, Satz 5.5]. Pfanzagl
(1995) [p. 117, Theorem] asserts the same result with an improved proof. The papers
mentioned above consider, in fact, a locally uniform version of this equivalence.
Since Θ is an open subset of Rk , the results hold then also uniformly on compact
subsets of Θ; hence the name “global asymptotic normality”.
That intuitively equivalent conditions on the remainder term like rn (·, ϑ, an ) → 0
and rn (·, ϑ, a) → 0 are of decisive influence on an asymptotic result is somewhat
irritating.
Global asymptotic normality of i.i.d.-families was used by Kaufman and Inagaki
in the proof of the Convolution Theorem. The extension of global asymptotic nor-
mality from i.i.d. models to LAN-families is an interesting, nontrivial achievement
of asymptotic theory. From the technical point of view the proof of the Convolution
Theorem is easier if it uses the LAN-condition immediately (rather than taking the
intuitively inverting digression via global asymptotic normality).
Local Asymptotic Normality
Unlike Kaufman (1966) and Inagaki (1970), Hájek starts in his papers (Hájek 1970,
1972) not from relation (5.12.2), but from the following LAN-condition:
LAN-condition. Let (X n , An ), n ∈ N, be a sequence of measurable spaces. For
ϑ ∈ Θ ⊂ Rk let Pϑ(n) be a sequence of probability measures such that
1
(n)
log(d Pϑ+c −1 d Pϑ(n) ) = a Δn (·, ϑ) − a L(ϑ)a + rn (x, ϑ, a) (5.12.4)
n a 2
with
(n)
Pϑ+c −1 ◦ Δn (·, ϑ) ⇒ N (a L(ϑ), L(ϑ)) (5.12.5)
a
n
and
rn (·, ϑ, a) → 0 (Pϑ(n) ) for every a ∈ Rk . (5.12.6)
n
Δn (xn , ϑ) = n −1/2 (log n)−1/2 2(xν − ϑ) − (xν − ϑ)−1 .
ν=1
(See Pfanzagl 2002b, p. 484, Example 1. See also Ibragimov and Has’minskii 1973,
p. 249, Theorem 1, and 1981, p. 134, Theorem 5.1.)
There are, however, almost regular models with Δn (xn , ϑ) = nν=1 h n (xν , ϑ),
where h n cannot be replaced by cn−1 • . (See Pfanzagl 2002b, p. 481, Proposition 2.2
and Example 2.)
As a consequence of (5.12.5),
According to a Lemma of Hájek (1970) [p. 327, Lemma 1], Δn may be replaced by
a truncated version such that, in addition to (5.12.4),
gn d Pϑ(n) → 1. (5.12.10)
and
1
exp[a u − a L(ϑ)a]N (0, L(ϑ))(du) = 1,
2
5.12 Asymptotic Normality: Global and Local 213
so that
Kaufman’s Paper
In 1966, the optimality of ML estimators for i.i.d. observations was still an open
problem. Since Le Cam’s paper (1953) it was clear that there are problems with the
concept of an optimal limit distribution, and it took about 10 years to find a solution
in the restriction to estimator sequences which attain their limit distribution (locally)
uniformly (Rao 1963, Wolfowitz 1965).
5.13 The Convolution Theorem 215
In the fundamental paper by Kaufman (1966) and the related paper by Inagaki
(1970), the main result reads as follows: Under suitable regularity conditions, the
ML-sequence (ϑ̂ (n) )n∈N converges on the family {Pϑ : ϑ ∈ Θ}, Θ ⊂ Rk , uniformly
on compact subsets of Θ to N (0, Λ(ϑ)), where Λ(ϑ) is the inverse of the matrix
L(ϑ) := • (·, ϑ)• (·, ϑ) d Pϑ .
Leaving the relation to the ML-sequence aside (which was the main subject of
interest in this time), the essence of the Convolution Theorem can be described as
follows:
If Pϑn ◦ (n 1/2 (ϑ (n) − ϑ), n ∈ N, converges to Q ϑ |Bk , uniformly on Θ, then
dent (Kaufman 1966, p. 164, Lemma 3.4, and Inagaki 1970, p. 8, Lemma 2.3). By
Anderson’s Theorem, this implies (5.13.1), which is Kaufman’s Theorem 2.1, p. 157.
Inagaki presents this result in a more elegant make up, as a Convolution Theorem
(p. 10, Theorem 3.1):
Q ϑ = N (0, Λ(ϑ)) ∗ Rϑ . (5.13.2)
There can be no question that Kaufman could have arrived at the convolution version,
but he preferred the direct way via Anderson’s Theorem (see p. 166).
The use of Anderson’s Theorem was not obvious from the beginning. In the
abstract of Kaufman’s paper from 1965, the optimality assertion (5.13.1) was
confined to ellipsoids C which are concentric to the concentration ellipsoid of
N (0, Λ(ϑ)).
Compared with later versions of the Convolution Theorem, Kaufman’s paper
is overcrowded with regularity conditions on the parametric family {Pϑ : ϑ ∈ Θ},
Θ ⊂ Rk . These regularity conditions are to ensure the usual properties of the ML-
sequence. Moreover, they are needed to prove that the ML-sequence is “asymptoti-
cally sufficient” in the following sense: There exist sequences of functions gn (·, ϑ),
ϑ ∈ Θ, and h n on X n such that uniformly on compact subsets of Θ,
n
p(xν , ϑ) − h n (xn )gn (ϑ̂ (n) (xn ), ϑ)
μn (dxn ) → 0.
ν=1
Kaufman’s proof has a clear underlying idea: For asymptotic considerations, the
sequence of i.i.d.-products Pϑn , n ∈ N, can be replaced by a family, say Q (n) ϑ , with
μn -density xn → h n (xn )gn (ϑ̂ (n) (xn ), ϑ). Since ϑ̂ (n) is sufficient for the family {Q (n)
ϑ :
(n) (n)
ϑ ∈ Θ}, any estimator ϑ can be obtained from ϑ̂ by randomization, and this
suggests that ϑ (n) is less accurate than ϑ̂ (n) . Plausible as this idea is, it is difficult
to carry through in a mathematically precise way. (For this purpose, (ϑ̂ (n) )n∈N and
216 5 Asymptotic Optimality of Estimators
proof are concerned with limit distributions, not with the estimators themselves.
(What has not survived from Hájek’s approach are Bayesian techniques.) The com-
ments by C.R. Rao (p. 160) and Godambe (p. 159) on Hájek’s paper from (1971)
show how progressive Hájek’s approach has been.
Obviously, Hájek considers his Convolution Theorem as a generalization of Kauf-
man’s Theorem. Yet, he is rather terse about the consequences of his theorem for the
optimality of ML-sequences.
Remark Some authors name the “Convolution Theorem” after Hájek and Le Cam.
This eponymy takes into account that it was Le Cam who had blazed the trail for
Hájek’s version of the Convolution Theorem by his papers on asymptotic normal-
ity, local and global, mainly in Le Cam (1956, 1960 and 1966). If it is true that
Hájek’s publication of the Convolution Theorem (1970) came to Le Cam as “a bolt
out of the blue” (a personal communication of Beran, cited in van der Vaart 2002,
p. 643) this confirms that Le Cam had, at this time, was not focused on what his fel-
low statisticians considered as relevant problems. In his last papers prior to Hájek’s
path-breaking paper from 1970, Le Cam (1969, entitled Théorie asymptotique de la
décision statistique, still plays around with various subtle points related to “asymp-
totic normality”. The emphasis of this paper can be illustrated by the references
which include Alexiewicz (differentiation of vector valued functions), Dudley (con-
vergence of Baire measures), Pettis (differentiation in Banach spaces), Saks (theory
of the integral), but not the paper by Kaufman (1966), which was based on Le Cam
(1956). If Le Cam says that he could prove the Convolution Theorem immediately
after he had seen it (Yang 1999, p. 236), this confirms his ability to perceive the
abstract structure of a problem (the local translation invariance, in this case). Two
years after Hájek’s paper, Le Cam, using the techniques of Le Cam (1964), offers an
abstract version of the Convolution Theorem (Le Cam 1972, p. 256, Proposition 8)
which asserts the existence of a transition between the limit distribution of a distin-
guished statistic and an arbitrary limit distribution. In Proposition 10, pp. 257/6, he
gives conditions under which this transition may be represented as a convolution. In
Sect. 6 he indicates certain applications of this result, including convolutions with an
exponential factor, cases which are ignored by Hájek. Ibragimov and Has’minskii
obtained for the same models the same results by elementary techniques. (See Ibrag-
imov and Has’minskii (1981), p. 278, Theorem 5.2. Hint: In relation 5.5, the letter n
must be replaced by u, a misprint dating from the Russian original, 1979, p. 370.)
We shall not follow Le Cam’s path to more and more abstract versions, a path
which ends up with the idea that the essence of the Convolution Theorem was there
long before statisticians had any notion of it: No, it is not the unpublished thesis
of Boll (1955); it is a paper by Wendel (1952) on the representation of bounded
linear transformations on integrable functions on locally compact groups with right
invariant Haar measure. Such transformations can be represented by a convolution
if they commute with all operations of left multiplication. (See the reference in Le
Cam 1994, p. 405 or 1998, p. 27.)
5.13 The Convolution Theorem 219
Proof The multidimensional version follows from the one-dimensional one, applied
with
p
p
p
κ= ai κi , κ (n) = ai κi(n) , g = ai gi .
i=1 i=1 i=1
In the following proof we write κ ∗ for κ ∗ (·, P). Since the sequence of probability
measures
P n ◦ g̃, κ̃ ∗ , n 1/2 (κ (n) − κ(P)) , n ∈ N,
Since the limits turn out to be independent of N0 the reference to N0 will be omitted
throughout. By assumption, the marginal on the first two components of Π is the
normal distribution N (0, Σ(P)) with
Σ11 = u 2 Π (d(u, v, w)), Σ12 = uvΠ (d(u, v, w)), Σ22 = v2 Π (d(u, v, w)).
Since
n 1/2
κ(Pn −1/2 (ag+bκ ∗ ) ) − κ(P) → κ ∗ (ag + bκ ∗ )d P = aΣ12 + bΣ22 ,
H (w − aΣ12 − bΣ22 )
222 5 Asymptotic Optimality of Estimators
1
exp (au + bv)) − (a 2 Σ11 + 2abΣ12 + b2 Σ22 ) Π (d(u, v))
2
= H (w)Π (dw) for a, b ∈ R (5.13.7)
for every bounded and continuous function H . With H (w) = exp[itw] we obtain
from (5.13.7) that
exp it (w − aΣ12 − bΣ22 )
1 2
exp (au + bv) − (a Σ11 + 2abΣ12 + b2 Σ22 ) Π (d(u, v))
2
= exp[itw]Π (dw) for a, b ∈ R. (5.13.8)
exp[isu + it (w − v)]Π (d(u, v, w)) = exp[isu]Π (du) exp[it (w − v)]Π (d(v, w)),
n 1/2
(κ(Pn −1/2 g ) − κ(P)) → κ ∗ (·, P)gd P for g ∈ T (P, P).
5.13 The Convolution Theorem 223
is the same for every g ∈ T (P, P). Then M is a convolution product of the factor
N (0, Σ(P)), with
i.e.
n 1/2 (κi(n) − κi (P)) = κ̃i∗ (·, P) + o(n 0 , P n ),
hence κi(n) is asymptotically linear with influence function κi∗ (·, P). By Proposi-
tion 5.5.1, such estimators are automatically regular.
Proof of the Theorem. Since
we obtain from Proposition 5.13.1 that g̃ and R (n) (·, P) are asymptotically stochas-
tically independent for every g ∈ T (P, P). Applied with g = αi κi∗ (·, P) this yields
the asymptotic independence between κ̃ ∗ (·, P) and R (n) (·, P). Hence the limit dis-
tribution of n 1/2 (κ (n) − κ(P)) is a convolution product between the limit distribution
of κ ∗ (·, P) and that of R (n) (·, P).
The Convolution Theorem Applied to Parametric Families
If P = {Pϑ : ϑ ∈ Θ}, Θ ⊂ Rk , then the tangent space is (under suitable regularity
conditions on the densities) the linear subspace of L ∗ (P) spanned by the components
of • (·, ϑ). The canonical gradient of the functional κ(ϑ) := ϑ is Λ(ϑ)• (·, ϑ),
and the optimal limit distribution of regular estimators of ϑ is N (0, Λ(ϑ)). Regular
estimator sequences (ϑ (n) ), n ∈ N, attaining this limit distribution have the stochastic
expansion
n 1/2 (ϑ (n) − ϑ) = Λ(ϑ)˜• (·, ϑ) + o(n 0 , Pϑ ).
224 5 Asymptotic Optimality of Estimators
it suggests itself to use κ (n) := κ(ϑ (n) ) as an estimator of κ(ϑ). If ϑ (n) is regular for
ϑ, then κ(ϑ (n) ) is also regular by the expansion
n 1/2 (κ(ϑ (n) − κ(ϑ)) = J (ϑ (n) )n 1/2 (ϑ (n) − ϑ) + o(n 0 , Pϑn ). (5.13.12)
this leads to N (0, Σ(ϑ)) with Σ(ϑ) = J (ϑ)Λ(ϑ)J (ϑ) as the optimal limit distri-
bution of estimators of κ(ϑ).
Proposition 5.13.1 implies that ˜•n (·, ϑ) and n 1/2 (κ (n) − κ( ϑ)) − K (ϑ)Λ(ϑ)˜• (·, ϑ)
are asymptotically independent. Hence the optimal limit distribution for estimator
sequences (κ(n), n ∈ N, that are regular with respect to all directions, is N (0, Σ(ϑ)).
Applied with p = k and κ(ϑ) = ϑ, this leads to J = I , the k × k unit matrix, i.e.
to Σ(ϑ) = Λ(ϑ).
For readers who dislike the condition of regular convergence on which the Con-
volution Theorem is based, we offer an alternative which is based on the continuity
of Σ(ϑ):
Theorem 5.13.3 Assume that Pϑn ◦ n 1/2 (κ (n) − κ(ϑ)), n ∈ N, converges to a limit
distribution Q ϑ which is continuous at ϑ0 . If Σ(ϑ) is continuous at ϑ0 , then
implies
Then
n 1/2 (G(κ (n) ) − G(κ(P))) = J (κ(P))n 1/2 (κ (n) − κ(P)) + o(n 0 , P n ). (5.13.14)
5.13 The Convolution Theorem 227
This makes it easy to link the asymptotic concentration of G(κ (n) ) in certain subsets
of Bq , say Bq , to the asymptotic concentration of κ (n) in corresponding subsets of B p .
Convexity is a natural requirement for the sets in B p and Bq . According to the
Convolution Theorem, concentrations of regularly attainable limit distributions are
comparable on convex sets that are symmetric about the origin. This is a property
inherited from κ (n) by G(κ (n) ).
A particular consequence: If κ̂ (n) is better than κ (n) on C p , then G(κ̂ (n) ) is better
than G(κ (n) ) on Cq . This does, however, not imply that optimality of κ (n) implies
optimality of (G(κ (n) ) among all regular estimator sequences of G(κ(P). For this,
the Convolution Theorem is indispensable.
Under suitable regularity conditions, the canonical gradient of G ◦ κ is of the form
J (κ(P))κ ∗ (·, P). Hence the covariance matrix of the optimal limit distribution of
regular
∗ estimator sequences of G(κ(P)) is J (κ(P))Σ(P)J (κ(P)) , where Σ(P) =
κ (·, P)κ (·, P) d P is the covariance matrix of the optimal limit distribution of
∗
For i = 1,2, the probability measure Q 0i has the median log 2, and Q i has the same
median if e−y R(dy) = 1. Moreover,
Q((t , t ) × (t , t )) = Q 0 ((t + y, t + y) × (t + y, t + y))R(dy)
= (e−t − e−t )2 e−2y R(dy) if log 2 ≤ t < t .
Since e−2y R(dy) > e−y R(dy) = 1 (unless R{0} = 1), this implies
with strict inequality if R is chosen appropriately. Since the (strict) inequality holds
for every t ≥ log 2, it also holds for t < log 2 close to log 2, in which case (t , t ) ×
(t , t ) contains the medians (log 2, log 2).
The following example shows that “componentwise better” does not always imply
“jointly better”.
Example There are Q 0 = N (0, Σ0 ) and Q = N (0, Σ)|B2 such that
but
but
N (0, I ){u u ≤ t} < Q{u u ≤ t} for 0 < t < 1.
(See Pfanzagl 1994, p. 84, Example 2.4.3.) If we interpret N and Q as limit distribu-
tions of estimator sequences (ϑ1(n) , ϑ2(n) )n∈N and (ϑ̂1(n) , ϑ̂2(n) )n∈N , then we obtain that
every κ(ϑ̂1(n) , ϑ̂2(n) ) is better than κ(ϑ1(n) , ϑ2(n) ), though (ϑ1(n) , ϑ2(n) ) is more concen-
trated than (ϑ̂1(n) , ϑ̂2(n) ) on small circles around (ϑ1 , ϑ2 ).
We conclude this section by a result on the relation between multivariate concen-
tration and the concentration on its marginals, the inequality
k
N (0, Σ)([−t1 , t1 ] × · · · × [−tk , tk ]) ≥ N (0, σii )[−ti , ti ] for all ti > 0, i = 1, . . . , k.
i=1
(5.13.16)
This relation was proved by Dunn (1958) under rather restrictive conditions. After an
abortive attempt by Scott (1967), it was proved by Sidak (1967, p. 628, Corollary 1),
using Anderson’s theorem. According to Jogdeo (1970) [p. 408, Theorem 3] equality
in (5.13.16) for some t1 , . . . , tk implies that Σ is a diagonal matrix.
Q (n)
P := P ◦ cn (κ
n (n)
− κ(P)).
Q P ≤ Q P for P ∈ P1 . (5.14.1)
where κ ∗ (·, ϑ) is the canonical gradient of κ with respect to the tangent space
{a • (·, ϑ) : a ∈ Rk }. It is easy to see that the proof of the Convolution Theorem
(yielding the optimality of Q ϑ = N (0, Λ(ϑ)) among all regularly attainable limit
distributions) remains valid if the convergence of Q (n) ϑ to Q ϑ holds just for a sub-
(n)
sequence N0 and for Q ϑ+n −1/2 a ⇒ Q ϑ with a in a dense subset of Rk . According to
Bahadur’s Lemma , these assumptions are fulfilled for ϑ in a dense subset Θ0 ⊂ Θ.
Hence for ∈ Ls ,
More precisely,
Q P ≤ N (0, Σ∗ (P))
5.14 The Extended Role of Intrinsic Bounds 231
if Q (n)
P , n ∈ N, converges to Q P regularly in the parametric subfamily spanned by
the tangent set {κ1∗ (·, P), . . . , κ ∗p (·, P)}.
For the following we need a metric ρ on P such that P → d Q (n) P is continuous
for
∈ L u . By the Addendum
to Lemma 5.14.1, applied
for X = P with f n (P) =
d Q (n)
P and f 0 (P) = d Q P , the convergence of d Q (n)
P , n ∈ N, to d Q P is
locally uniform (hence also regular) except for a set P1 of first category in (P, ρ). If
ρ is the sup-metric d and κ is continuous, this holds by Proposition 5.3.2 for ∈ Lu .
The class Lu contains a countable weak convergence determining class, for example
by appropriately truncating the convergence determining class x → exp[it x], t ∈
Qk . We obtain Q P ≤ N (0, Σ∗ (P)) for P ∈ P − P1 .
In the case of a parametric family, the exceptional set, being of λk -measure 0,
could legitimately be considered “small”. This does not necessarily apply to the set
P1 . However: If (P, ρ) is complete, P − P1 is dense in P by Lemma 5.14.2.
Instances of a metric ρ with the desired properties can be found in Pfanzagl
(2002a) [Sect. 7]. Here are two examples of general families with a natural metric ρ
such that no estimator sequence can be superefficient on some open subset of (P, ρ).
Example 1. Let P be the family of all probability
measures P|B with Lebesgue
density such that |x|P(d
x) < ∞. Let κ(P) := x P(d x). Then P, endowed with
the metric ρ(P, Q) := (1 + |x|)| p(x) − q(x)|d x, is complete, and κ is continuous.
Example 2. Let P be the family of all probability measures P|B with a positive,
bounded and continuous Lebesgue density. Let κ(P) be the median of P. Then P,
endowed with the metric ρ(P, Q) := p − q1 + p − q∞ , is complete and κ is
continuous.
(ii) Here is a suggestion how the inequality Q P ≤ Q P can be extended to limit
distributions Q P which are continuous in all parametric subfamilies. Let Πa :=
Q Pa . This notation is to distinguish between Q P (the optimal limit distribution in
the family P), applied with P = Pa , and Π̂a , the optimal limit distribution in the
family a → Q Pa . If P → Q P is continuous in all parametric subfamilies, a → Πa
is continuous. Since Pa is a path with Pa = P0 for a = 0, we have Π0 = Q P0 .
Let Π a denote the optimal limit distribution within the path Pa , a ∈ A. Since
a → Πa is continuous, we obtain from the results obtained in (i) that
Πa ≤ Π a for a ∈ A. (5.14.3)
Q P0 = Π0 ≤ Π 0 = Q P0 .
Hence the inequality Q P ≤ Q P0 holds for all limit distributions Q P which are con-
tinuous on all parametric subfamilies.
232 5 Asymptotic Optimality of Estimators
−1
p (x)2 1
σβ2 (P) := dx + (β − )(1 − β)/( p(κβ (P)))2 .
p(x) 2
Regular estimator sequences attaining the optimal limit distribution do exist. The
optimal limit distribution depends continuously on P along any sufficiently reg-
ular one-parameter subfamily. Hence, no limit distribution sharing this continuity
property can be superefficient at some P ∈ P.
(Warning: The reader who tries to find a clear-cut description of this procedure
in Pfanzagl (2002a) will be disappointed. There are various examples where this
is carried through, but the ingredients of this procedure need to be collected from
Lemma 4.1, Theorem 6.1, etc. See also Pfanzagl 1999a, p. 74, Theorem.)
Auxiliary Results
In the following we show that under various conditions on a sequence of functions fn :
X → R, “convergence everywhere” implies “continuous convergence somewhere”.
We first present results for a general set X , and then stronger results for a Euclidean X .
then there exists a set X 1 ⊂ X which is of first category in X such that for every
x0 ∈ X − X 1 and for every sequence xn → x0 ,
then there exists a set X 1 ⊂ X of first category in X such that for every x0 ∈ X −
X 1 and every sequence xn → x0 ,
Lemma 5.14.2 Let (X, ρ) be a metric space which is complete or locally compact.
Let X 0 ⊂ X be an open subset. If X 1 is of first category in (X 0 , ρ), then X 0 − X 1 is
dense in X 0 .
Proof A convenient proof for the case of a complete metric space can be found in
Pfanzagl (2002a) [p. 96, Lemma 9.2]. In order to establish the assertion for locally
compact metric spaces, observe that for any compact C ⊂ X 0 with C ◦
= ∅, the set
X 1 ∩ C is of first category in C if X 1 is of first category in X 0 . Since (C, ρ) is
complete, C − X 1 is dense in C. The assertion follows by applying this with C a
compact neighbourhood of an arbitrary element of X . (See also Kelley 1955, p. 200,
Theorem 34.)
Then for every sequence (yn )n∈N → 0 there exists an infinite subsequence N0 and a
λk -null set X 1 such that
Addendum. Together with the analogous version with lim inf and ≥ replaced
by lim sup and ≤, this yields that limn→∞ f n (x) = f 0 (x) for x ∈ X 0 implies
This Addendum is the well known Lemma of Bahadur (1964) [p. 1549, Lemma
4]. See also Droste and Wefelmeyer (1984) [p. 140, Proposition 3.7]. For a proof of
Lemma 5.14.3 see Pfanzagl (2003) [p. 107, Proposition 5.1].
In contrast to Lemmas 5.14.1, 5.14.3 does not necessarily require the functions
f n to be continuous. On the other hand, Bahadur’s Lemma asserts a version of
local uniformity which is weaker than continuous convergence in two respects. It
asserts f n (xn ) → f 0 (x) (i) not for all sequences xn → x, but only for xn = x + yn ,
for a countable family of sequences (yn )n∈N → 0, and (ii) only for some subsequence
N0 ⊂ N. Yet, this is all one needs for applications in statistical theory: The proof of the
Convolution Theorem requires Q (n) ϑn ⇒ Q ϑ0 just for those sequences ϑn = ϑ0 + cn a
−1
with a ∈ Q.
Though the occurrence of some subsequence N0 in Bahadur’s Lemma does not
impair its usefulness in statistical applications, Bahadur (1964) [p. 1550] raised the
question whether limn∈N0 f n (x + yn ) = f 0 (x) for λk -a.a. x ∈ Rk for some subse-
quence N0 could be strengthened to limn→∞ f n (x + yn ) = f 0 (x) for λk -a.a. x ∈ Rk .
An example by Rényi and Erdős (see Schmetterer 1966, pp. 304/5) exhibits a
sequence of functions with f n (x) → 0 for λ-a.a. x ∈ (0, 1), but lim supn→∞ f n (x +
n −1/2 ) = 1 for every x ∈ (0, 1). (Be aware of a misprint on p. 306). Since the func-
tions f n in this example are discontinuous, this does not exclude a sharper result for
continuous functions.
In the case of a parametric family P := {Pϑ : ϑ ∈ Θ}, Θ ⊂ Rk , the natural
application is with Lemma 5.14.3. As an alternativeone might entertain the use
of Lemma 5.14.1, applied with X = Θ and f n (ϑ) = (cn (ϑ (n) − ϑ))d Pϑn (rather
than X = {Pϑ : ϑ ∈ Θ}). If ϑ → Pϑ is continuous with respect to the sup-distance
on P, the functions f n are continuous. (The proof is about the same as that of
Lemma 5.8.3). Hence the Addendum to Lemma 5.14.1 implies continuous conver-
gence on Θ except for a set Θ1 of first category. Lemma 5.14.3 yields the weaker
(but equally useful) regular convergence along some subsequence, except for a set
Θ2 which is of Lebesgue-measure zero. Obviously, Θ2 ⊂ Θ1 , i.e., the set of contin-
uous convergence is smaller than the set of regular convergence. If one is satisfied
with the weaker assertion, then one can do without Bahadur’s Lemma , provided
Θ is complete. Yet it adds much to the interpretation of this result if we know that
the set of regular convergence is not just dense in Θ but equal to Θ up to a set of
Lebesgue-measure zero. (Recall, in this connection, that a set of first category in Rk
may be of positive Lebesgue-measure.)
If X = Rk and the functions f n , n = 0, 1, . . ., are continuous, both versions of the
lemmas apply. To make use of the fact that the exceptional set X 1 in Lemma 5.14.1
is of first category, one might assume that X 0 is complete or locally compact. Then
one could conclude that (see Lemma 5.14.2) convergence of f n (x) to f 0 (x) for every
x ∈ X 0 implies locally uniform convergence on a dense subset of X 0 .
Lemma 5.14.3 implies some sort of “qualified” regularity only, i.e., the conver-
gence of f n (x + cn−1 a), n ∈ N0 , to f 0 (x) along a subsequence N0 for λ-a.a. x ∈ X 0
and all a in a countable subset of Rk . For certain proofs this qualified regularity
suffices. In such cases, one will prefer a result for λ-a.a. x ∈ X 0 over a result for all
x in countable subset of X 0 .
5.14 The Extended Role of Intrinsic Bounds 235
Q (n) (n)
ϑ := Pϑ ◦ cn (κ
(n)
− κ(ϑ)).
ϑ→ d Q (n)
ϑ is continuous for every n ∈ N.
a→ d Q (n)
ϑ+c−1 a
, n ∈ N,
n
(n) (n)
lim sup a − a0 −1 d(Pϑ+c −1 , P
a ϑ+c−1 a
)<∞ (5.14.6)
n n 0
n→∞
a→ d Q (n)
ϑ+c−1 a
, n ∈ N,
n
Proof Since
(n) (n)
d Q ϑ+cn−1 a − d Q ϑ+cn−1 a
0
(n)
≤
(cn (κ (n) − κ(ϑ + cn−1 a)))d Pϑ+c −1
n a
(n)
− (cn (κ (n) − κ(ϑ + cn−1 a)))d Pϑ+c −1
n a0
(n)
+
(cn (κ (n) − κ(ϑ + cn−1 a))) − (cn (κ (n) − κ(ϑ + cn−1 a0 )))
d Pϑ+c −1
n a0
(n)
≤ d(Pϑ+c −1 , P
a
(n)
ϑ+c−1 a
) + sup
(u + cn (κ(ϑ + cn−1 a) − κ(ϑ + cn−1 a0 ))) − (u)
,
n n 0
u∈Rk
Addendum 1. We have
lim sup sup f n (x) ≤ lim sup f n (x0 ) ≤ lim sup f n (x0 ) ≤ sup lim sup f n (x).
n→∞ xd∈A n∈N0 n→∞ x∈A n→∞
Chernoff (1956) [p. 12, Theorem 1] presents for regular one-parameter families and
i.i.d. observations a result which seems to be the forefather of various asymptotic
“minimax” theorems. Rewritten in our notations, with ϑ (n) an estimator of ϑ and
Q (n)
ϑ = Pϑ ◦ n
n 1/2
(ϑ (n) − ϑ),
and assuming that Λ(ϑ) := 1/I (ϑ) is continuous at ϑ0 , Chernoff’s Theorem (attri-
buted by Hájek 1972, p. 177, to an unpublished paper by Stein and Rubin) asserts
that for every estimator sequence ϑ (n) ,
then
Q (n)
ϑ0 ⇒ N (0, Λ(ϑ0 )) = Q ϑ0 .
The interrelation between the length of the interval |ϑ − ϑ0 | ≤ n −1/2 r over which
the sup is taken, and the loss function u → min{u 2 , r 2 }, makes the interpretation
of such results difficult; the restriction to the quadratic loss function impairs their
relevance.
Motivated by the results of Chernoff (1956) and Huber (1966), Hájek (1972)
felt the need of supplementing the Convolution Theorem for regular estimator
sequences converging regularly to a limit distribution by a result for arbitrary estima-
tor sequences. This led him to the so-called “local asymptotic
Minimax Theorem”:
For regular estimator sequences, the asymptotic risk d Q (n) ϑ is constant in shrink-
ing neighborhoods |ϑ − ϑ0 | ≤ n −1/2 r for every r . The minimax theorem says that
even for non-regular estimators, the maximal risk in such neighborhoods, or more
precisely
cannot be lower than this constant. Chernoff already seemed to be ill at ease with
the interpretation of his theorem. “For an arbitrary estimate [Λ(ϑ0 )] is “essentially”
asymptotically a lower bound for the asymptotic variance...” After more than 50
years this construct is still lacking an operationally significant interpretation. For Le
238 5 Asymptotic Optimality of Estimators
Cam, Hájek’s Minimax Theorem is another chance to invent an abstract version (see
e.g. Le Cam 1979, p. 124, Theorem 1).
The following refers to a general family {(Pϑ(n) )n∈N : ϑ ∈ Θ}, Θ ⊂ Rk fulfilling
an LAN condition
(n) 1
log(d Pϑ+c na
/d Pϑ(n) ) = a Δn − a L(ϑ)a + o(n 0 , Pϑ(n) ).
2
We write Λ(ϑ) = L(ϑ)−1 for the the “optimal” asymptotic covariance matrix for
estimators of ϑ.
Hájek’s Minimax Theorem For one-parameter families fulfilling LAN, the follow-
ing is true for loss functions ∈ Ls .
(i) For any estimator sequence ϑ (n) ,
(ii) If
rather than
lim lim inf sup ,
r →∞ n→∞ |ϑ−ϑ |≤c r
0 n
but the latter version is indicated in his Remark 1, p. 189. (Beware of a misprint
in Hájek’s relation 4.17.) Hájek’s remark does not explicitly refer to part(ii), but
Ibragimov and Has’minskii 1981, p. 168, claim (without proof) that part (ii) holds
for neighbourhoods |ϑ − ϑ0 | ≤ cn r as well.) Fabian and Hannan (1982) discuss
the distinction between fixed and shrinking neighbourhoods in detail. In Sect. 3,
pp. 463/4, they point out that estimator sequences fulfilling relation (5.15.2) with
sup|ϑ−ϑ0 |<δ may not exist.
5.15 Local Asymptotic Minimaxity 239
Hájek’s Minimax Theorem is restricted to the case Θ ⊂ R, with a brief remark (p.
188, Theorem 4.2) concerning the bound for estimators of ϑ1 if ϑ = (ϑ1 , . . . , ϑk ).
The full k-dimensional version of part (i), referring to loss functions which are
symmetric, subconvex, and which increase with u → ∞ not too quickly, is due to
Ibragimov and Has’minskii (1981) [p. 162, Theorem 12.1, with the Russian original
dating from 1979].
For readers who are amazed at finding on p. 162 in Ibragimov and Has’minskii the assertion
“... is possible if and only if ...”: The “if” is an extra of the translator. The Russian original
says correctly: “... is possible only if ...”.
This is not the only slip in the translation of this theorem. The theorem refers to the class
We,2 of all loss functions growing more slowly than u → exp[εu 2 ] for every ε > 0. In the
translation, this class appears as Wε,2 , a slip which is not without the danger of confusion,
since ε corresponds in the model of Ibragimov and Has’minskii to n −1 , and the asymptotic
assertion refers explicitly to ε → 0. Hence the reader might be confused by an asymptotic
assertion for ε → 0 which holds for every loss function in the class Wε,2 .
In their Theorem 12.1, the authors present the version of (5.15.1) with
A detailed proof of the latter version may be found in Witting and Müller-Funk
(1995) [p. 457, Satz 6.229].
Observe that limr →∞ in relation (5.15.1) is essential. For the Hodges estimator
with α = 1/2, we have
√
as long as r < 3. (See Lehmann and Casella 1998, p. 440, Example 2.5 and p. 442,
Example 2.7.)
When Hájek wrote his paper on the Minimax Theorem, N (0, Λ(ϑ0 )) was known
as a bound for regular estimator sequences. When he tried in 1972 to characterize
N (0, Λ(ϑ0 )) as a bound for arbitrary estimator sequences (Hájek 1972), he obviously
was not familiar with the details of Le Cam’s fundamental from (1953). In Theorem
14, page 327, Le Cam had shown, for Θ ⊂ R, bounded and continuous loss functions
, and under Cramér-type regularity conditions, that
implies
which is the original version of Hájek’s relation (5.15.1). Hence in part (i) of his
Minimax Theorem, Hájek had just rediscovered earlier results of Le Cam. Part (ii)
contains something new.
With the bound N (0, Λ(ϑ0 )) determined by (5.15.1), it is natural to define “opti-
mality” (or “local asymptotic minimaxity”) by equality in (5.15.1). One could object
that Hájek’s definition (5.15.1) presumes the existence of
for every r > 0. This suggests replacing the definition of optimality by the apparently
stronger but more intuitive condition
With
Fn (r ) := sup d Q (n)
ϑ − d Q ϑ0 , (5.15.5)
|ϑ−ϑ0 |≤cn r
Le Cam writes this relation with neighbourhoods |ϑ − ϑ0 | < δ, but his proof is valid
for |ϑ − ϑ0 | < cn r as well. Relation (5.15.4), rewritten as lim supn→∞ Fn (r ) ≤ 0,
for every r > 0, therefore implies lim inf n→∞ Fn (0) ≥ 0. Hence
so that
lim Fn (r ) = 0 for every r > 0,
n→∞
implies
hence also
Pϑ(n)
0 +cn an
◦ cn−1 (ϑ (n) − (ϑ0 + cn an )) ⇒ N (0, Λ(·, ϑ0 )),
(n)
lim sup d Q ϑ+c−1 a ≤ d N (0, Λ(ϑ)) for every a ∈ R.
n
n→∞
For the origin of this result, Strasser gives unspecified references to Hájek (1972)
and Le Cam (1953), without emphasizing the improvement from “sup|a|≤r for all r ”
to “for all a”. (The reader who tries to discover traces of this result in Le Cam (1953),
will fail. A rather abstract version of Hájek (ii) appears in Le Cam (1972).)
Observe that the implication from (5.15.4) to (5.15.3) does not extend to families
with Θ ⊂ Rk , k ≥ 3. Stein’s estimator (1956), see Sect. 5.14, Example 4, below,
fulfills at ϑ0 = 0 relation (5.15.4), hence also (5.15.2), for the loss function (u) =
u2 , but not (5.15.3).
For Θ ⊂ Rk with k > 1, Hájek does not mention that a k-dimensional estimator
sequence ϑ (n) = (ϑ1(n) , . . . , ϑk(n) ) which admits a stochastic expansion
has the joint limit distribution N (0, Λ(ϑ0 )). For a discussion of “component-wise
optimality” versus “joint optimality” see Sect. 5.13.
Since the result of Hájek (ii) does not extend from Θ ⊂ R to arbitrary dimen-
sions k (think of Stein’s shrinking estimators), the adequate generalization is to one-
dimensional functionals defined on Θ ⊂ Rk . A result of this kind appears in Strasser
(1997) [p. 372, Theorem (iii)]. Specialized to the present framework of parametric
LAN-families with Θ ⊂ Rk and continuously differentiable functionals κ : Θ → R
with gradient K (ϑ) at ϑ, Strasser’s result reads as follows. Let Ls denote the set of
subconvex and symmetric functions L : [0, 1] → Rk .
If for some estimator sequence κ (n) , n ∈ N,
(n)
lim sup (cn (κ (n) − κ(ϑ + cn−1 a)))d Pϑ+c −1 (5.15.6)
n a
n→∞
then
cn (κ (n) − κ(ϑ)) − K (ϑ)Λ(ϑ)Δn (·, ϑ) → 0 (Pϑ(n) ). (5.15.7)
(n)
lim lim inf sup d Q ϑ+c−1 a ≥ d Q ϑ for every ∈ Ls (5.15.8)
r →∞ n→∞ |a|≤r n
(n)
lim sup sup d Q ϑ+c−1 a ≤ d Q ϑ for some ∈ Ls and every r > 0
n
n→∞ |a|≤r
(5.15.9)
implies
cn (ϑ (n) − ϑ) − Λ(ϑ)Δn (·, ϑ) → 0 (Pϑ(n) ). (5.15.10)
244 5 Asymptotic Optimality of Estimators
With the help of this relation, one obtains from Hájek’s results (5.15.6) and (5.15.7)
or (5.15.8) that
implies (5.15.10).
Corresponding results are true if Q (n) (n)
ϑ = Pϑ ◦ cn (ϑ
(n)
− ϑ) is replaced by
(n) (n)
Q ϑ = Pϑ ◦ cn (κ (n) − κ(ϑ)).
Asymptotic Optimality is Independent of the Loss Function
The Convolution Theorem provides an asymptotic bound for the concentration of
an important class of estimator sequences: Those converging regularly to some limit
distribution. If one takes the concept of a loss function seriously, one might ask
whether there are estimator sequences adjusted to the “true” loss function that are
asymptotically better than the best regular estimator sequences. The Convolution
Theorem, after all, assumes regular convergence of d Q (n) ϑ , n ∈ N, to d Q ϑ for
some Q ϑ and every ∈ L .
In order to bring in a certain regularity if the evaluation is restricted to some 0 ,
onemight require the existence of a function ˜0 : Θ → [0, ∞) which is approached
by 0 d Q (n)
ϑ in the sense that
¯
lim lim sup
0 d Q (n)
ϑ − 0 (ϑ)
= 0. (5.15.11)
δ↓0 n→∞ |ϑ−ϑ0 |≤δ
Recall that continuity of ϑ → 0 d Q (n) ¯
ϑ implies continuity of ϑ → 0 (ϑ) by
Lemma 5.3.9. Together with (5.15.1), relation (5.15.11) implies that
¯0 (ϑ0 ) ≥ 0 d Q ϑ0 .
5.15 Local Asymptotic Minimaxity 245
and
lim sup Fn (r ) ≤ 0 for every r ≥ 0. (5.15.13)
n→∞
(ii) If, for some subsequence N0 , limn∈N0 Fn (r ) exists for every r ≥ 0, then
Observe that (5.15.12) is based on relation (5.15.1), i.e., part (i) of Hájek’s Mini-
max Theorem, which is not restricted to one-parameter families.
Proof (i) implies (ii). Let N0 be a subsequence such that limn∈N0 Fn (r ) exists for
every r ≥ 0. Since
hence
lim lim Fn (r ) = 0.
r →∞ n∈N0
Equation (5.15.12): If limr →∞ lim inf n→∞ Fn (r ) < 0, there exists a sequence
(rn )n∈N1 → ∞ such that limn∈N1 Fn (rn ) < 0. Since there exists a subsequence N0 ⊂
N1 such that limr →∞ limn∈N0 Fn (r ) = 0, the relation limn∈N0 Fn (r ) ≤ limn∈N0 Fn
(rn ) < 0 for r ≥ 0 leads to a contradiction.
Equation (5.15.13): If lim supn→∞ Fn (r0 ) > 0 for some r0 ≥ 0, there exists a
subsequence N1 such that limn∈N1 Fn (r0 ) > 0. This is in contradiction to the existence
of a subsequence N0 ⊂ N1 , such that limn∈N0 Fn (r ) exists for every r > 0 whence
limr →∞ limn∈N0 Fn (r ) = 0.
Lemma 5.15.2 The implications (i), (ii) and (iii) are equivalent.
(i) For every subsequence N0 ,
Proof (i) implies (ii). If lim inf n∈N0 Fn (0) < 0, there exists N1 ⊂ N0 such that
hence
lim lim sup Fn (r ) ≤ 0.
r →∞ n∈N1
5.16 Superefficiency
In Hájek’s Minimax-Theorem, d N (0, Σ∗ (ϑ)) occurs as a lower bound for
All these results are based on some kind of local uniformity condition. Even if
a condition like regular convergence to a limit distribution may be justified from
the operational point of view (as an outgrowth of locally uniform convergence), it
is of interest whether a bound like N (0, Σ∗ (ϑ)) keeps its role under less restrictive
conditions, say as a bound for limit distributions which are attained for every ϑ ∈
Θ (but not necessarily in a locally uniform sense). This problem is dealt with in
Sect. 5.14. The present section is on the phenomenon of “superefficiency”.
248 5 Asymptotic Optimality of Estimators
Definition 5.16.1 The family P0 ⊂ P is for the estimator sequence (κ (n) )n∈N and
the loss function a set of superefficiency if there exists a subsequence N0 such that
(ii) Q P is optimal in the sense that for every “regular” estimator sequence (κ (n) )n∈N
The question is whether there are estimator sequences κ (n) such that the risk
(cn (κ − κ(P)))d P (n) remains, for some subsequence and some loss function
(n)
, smaller than d Q P for P in a substantial subset of P.
The common definition requires (5.16.1) with N in place of N0 . The present
definition takes into account that the statistician will be satisfied with superefficiency
along some infinite subsequence from which the sample size could be chosen.
Some authors (see e.g. Ibragimov and Has’minskii 1981, p. 170) use lim inf
rather than lim sup in the definition
of superefficiency. This seems to be questionable.
Smallness of lim inf n→∞ (cn (κ (n) − κ(P)))d P n
for every P is of no relevance if
the sample sizes corresponding to small values of (cn (κ (n) − κ(P)))d P (n) depend
on P. This is convincingly demonstrated by the following example, due to van der
Vaart (1997, p. 407).
Example 1. (i) Let u n ∈ (0, 1), n ∈ N, be such that n 1/2 |u n − u|, n ∈ N, has, for
every u ∈ (0, 1), the accumulation point zero. If u 2m +k := k2−m for k = 1, . . . , 2m ,
then |u n − u| < n −1 for n = 2m + km , if km ∈ {1, . . . , 2m } is chosen such that (km −
1)2−m < u ≤ km 2−m .
(ii) Let {Pϑ : ϑ ∈ (0, 1)} be a parametric family. Then the estimator sequence
ϑ (n) (xn ) := u n (which is independent of xn ) fulfills
If, for m > 1, Iδ1 ,...,δm−1 = [aδ1 ,...,δm−1 , bδ1 ,...,δm−1 ], we define
The length of Iδ1 ,...,δm is 2−m(m+1)/2 , and the minimum distance between the two
intervals Iδ1 ,...,δm and Iδ1 ,...,δm is
/ ∩∞
(iv) If ϑ ∈ m=1 Sm , there exists m 0 ∈ N such that ϑ ∈ / Sm 0 , whence d(ϑ, Sm 0 ) >
0. Let m 1 be such that 21 Δm 1 < d(ϑ, Sm 0 ). As (x n m )m∈N → ϑ N ϑ, 1)N -a.e., there
exists m(xn ) ≥ max{m 0 , m 1 } (depending on ϑ) such that d(x̄n m , ϑ) ≤ d(ϑ, Sm 0 ) −
1
Δ for all m ≥ m(xn ). Hence m ≥ m(xn ) implies
2 m1
1 1
d(x n m , Sm ) ≥ d(x n m , Sm 0 ) ≥ d(ϑ, Sm 0 ) − d(x n m , ϑ) ≥ Δm 1 ≥ Δm .
2 2
Hence we have ϑ (n m ) (xn ) = x n m for all m ≥ m(xn ) and N (ϑ, 1)N -a.a. xn ∈ RN . This,
◦ n m (ϑ (n m ) − ϑ), m ∈ N, has the same limit distribution
nm 1/2
however, implies that N(ϑ,1)
nm 1/2
as N(ϑ,1) ◦ n m (x̄n m − ϑ), m ∈ N, namely N (0, 1).
(v) If ϑ ∈ ∩∞ N
m=1 Sm , there exists (δi )i∈N ∈ {0, 1} such that ϑ ∈ Iδ1 ,...,δm for all
m ∈ N. Then
since aδ1 ,...,δm < ϑ < bδ1 ,...,δm . As 21 Δm n m = 2m/2−1 (1 − 2−m+1 ), we have
1/2
1 − α −m(m+1)/2 m 2 /2 1 − α −m/2
≤ 2 2 = 2 ↓ 0.
α α
Therefore,
nm (n m )
N(ϑ,1) ◦ n 1/2
m (ϑ − ϑ) ⇒ N (0, α 2 ).
N
∞(vi) For every sequence (δi )i∈N ∈ {0, 1} there exists a point r(δi )i∈N in the set
m=1 Iδ1 ,...,δm . (Since (Iδ1 ,...,δm )m∈N is a decreasing sequence of compact sets with
diameter converging to zero, this intersection consists of exactly one point.) As
{0, 1}N is uncountable, ∩∞ m=1 Sm is uncountable. Being a set of superefficiency,
∩∞m=1 Sm is necessarily of Lebesgue measure zero. This also follows directly from
λ(∩∞m=1 S m ) ≤ λ(S m ) = 2 m −m(m+1)/2
2 = 2−m(m−1)/2 for m ∈ N.
Whereas it requires some efforts (as in Example 2) to construct an estimator
sequence which is superefficient on an uncountable subset of R, superefficient esti-
mator sequences on uncountable subsets of Rk , k > 1, are straightforward.
Example 3. For Θ = R2 let P(ϑ1 ,ϑ2 ) := N (ϑ1 , 1) × N (ϑ2 , 1). The problem is to esti-
mate κ(P(ϑ1 ,ϑ2 ) ) = ϑ1 . The optimal limit distribution of regular estimator sequences
is N (0, 1). The estimator sequence
x 1n > −1/4
ϑ (n) ((x1ν , x2ν )ν=1,...,n ) := if |x 1n − x 2n | n
(x 1n + x 2n )/2 ≤
and
Since the interesting results on superefficiency are to the negative (in the sense
that “sets of superefficiency are necessarily small”) they are stronger if based on
Definition 5.16.1 which requires nothing about the performance of the estimator
sequences for ϑ outside Θ0 .
In his Corollary 8:1, p. 314, Le Cam asserts for Θ ⊂ Rk under Cramér-type reg-
ularity conditions that the set of superefficiency is necessarily of Lebesgue measure
0. His result is based on Bayesian arguments, which were in the air at this time.
Wolfowitz, too (1953a, p. 116), gave an (informal) Bayesian proof for the fact that
superefficiency is possible on a set of Lebesgue measure zero only, but he assumes
that the estimator sequence converges to a normal limit distribution. An elegant proof
under the same assumption was later given by Bahadur (1964) in a paper which is
remarkable in various respects.
The best result concerning sets of superefficiency for k-parameter LAN-families
is implicitly contained in Strasser (1978a). His Proposition, p. 37, adjusted to the
present framework, reads as follows.
For any probability measure Π |Bk λk |Bk and any subconvex loss function
∈ Ls ,
(n)
lim inf d Q ϑ Π (dϑ) ≥ d N (0, Σ∗ (ϑ))Π (dϑ). (5.16.2)
n→∞
lim d Q (n)
ϑ ≥ d N (0, Σ∗ (ϑ)) for λk − a.a. ϑ ∈ Θ,
n→∞
5.16 Superefficiency 253
An intelligible version of Le Cam’s original proof can be found in van der Vaart
(1997, pp. 398–401).
For parametric families, “smallness” of a set can be expressed as being “of
Lebesgue measure 0”. For general families, a set of superefficiency may be shown
to be of “first category” (see Pfanzagl (2003), p. 97, Theorem 2.1). Conditions under
which sets of first category can be considered as “small” are discussed in Sect. 5.14.
Does Superefficiency Exclude Local Uniformity?
After having shown that sets of superefficiency are necessarily of Lebesgue measure
zero, the next step for Le Cam would naturally have been to find operationally mean-
ingful conditions on the estimator sequence which preclude the irritating phenom-
enon of superefficiency. Le Cam approached this problem indirectly by exhibiting
the irregularity of superefficient estimator sequences for one-dimensional parame-
ters. He proved (Le Cam 1953, p. 327, Theorem 14) under Cramér-type conditions
the following.
Lemma 5.16.2 Let Θ ⊂ R. For every loss function which is sufficiently smooth,
bounded and symmetric about 0,
implies
Based on this result, it would have been easy to show that uniformly convergent
estimator sequences of real parameters cannot be superefficient; more precisely:
lim d Q (n)
ϑn = d Q ϑ0 for every sequence ϑn → ϑ0 , (5.16.5)
n→∞
then
d Q ϑ0 ≥ d Q ϑ0 . (5.16.6)
Proof If
d Q ϑ0 < d Q ϑ0 , (5.16.7)
254 5 Asymptotic Optimality of Estimators
then relation (5.16.5), applied with ϑn ≡ ϑ0 , implies (5.16.3), which, in turn, implies
(5.16.4), i.e.
d Q ϑ0 = lim d Q (n)
ϑn > d Q ϑ0 ,
n→∞
It is clear from Le Cam’s proof that Lemma 5.16.2 holds true already with ϑn =
ϑ0 + n −1/2 a. Hence regularity in the sense of (5.16.5) with ϑn = ϑ0 + n −1/2 a suffices
to exclude superefficiency. Since Le Cam missed this opportunity, such results did
not appear until 1963 (C.R. Rao, Wolfowitz; see Sect. 5.8).
The implication from (5.16.3) to (5.16.4) was established by Le Cam for Θ ⊂ R,
and he remarks that its extension to arbitrary dimensions of Θ poses certain problems.
This is confirmed by Stein’s shrinkage estimator.
Example 4. We consider for k ≥ 3 the family {N (ϑ, Ik ) : ϑ ∈ Rk }, where Ik is the
unit-matrix in Rk . The problem is to estimate ϑ = (ϑ1 , . . . , ϑk ). The optimal limit
distribution for regular estimator sequences is N (0, Ik ), which is attained locally
uniformly by the sample mean x̄n := (x 1n , . . . , x kn ) for every n ∈ N. Evaluated by
the loss function (u) = u2 , we have
we have
Since the presentation in Stein (1956) is not very transparent, compare Ibragimov
and Has’minskii (1981) [p. 27] or Lehmann and Casella (1998, p. 355, Theorem
5.5.1) for details. According to Casella and Hwang (1982) [p. 306, Lemma 1],
1 1 1
< n −1 x̄n −2 N (ϑ, Ik )n (dxn ) ≤ · ,
(k − 2) + nϑ2 k − 2 1 + nϑ2 /k
1
(n 1/2 (ϑ (n) − ϑ))d N(ϑ,I
n
k)
< d N (0, Ik ) − .
(k − 2) + nϑ2
Hence Stein’s estimator, evaluated by the loss function (u) = u2 , is superefficient
at ϑ = 0, and nowhere inefficient. Stein was interested in the performance of ϑ (n) for
5.16 Superefficiency 255
finite sample sizes. This might explain why he overlooked the relevance his invention
has in connection with Le Cam’s Theorem 14.
With Stein’s example restricted to the dimension k ≥ 3, the problem arises
whether Le Cam’s Theorem 14 is valid for k = 2. Proposition 6 in Le Cam 1974, p.
187, seems to give an affirmative answer to this question, but I was unable to follow
the proof.
The reader who is aware of the problem raised by Le Cam’s Theorem 14, will
be disappointed by the treatment Le Cam gives to this problem in (1986, p. 144).
He presents a hitherto unknown version of Stein’s estimator which reads in our
notations as k−2
1− x̄n .
1 + x̄n 2
Omitting all details, he gives his opinion about the limit distribution of this estimator
sequence, but he avoids to discuss what is of interest here, namely the performance
of these estimators near ϑ = 0, the point of superefficiency.
According to Sect. 5.15, regular estimator sequences are optimal with respect to all
loss functions in Ls if they are optimal with respect to one of these. This implies, in
particular, that the components of an optimal multidimensional estimator sequence
are optimal themselves. In contrast to that, superefficiency may be tied to a particular
loss function.
If an estimator sequence (ϑ1(n) , . . . , ϑk(n) ) is superefficient for (ϑ1 , . . . , ϑk ) with
respect to some loss function 0 , then at least one of the components ϑi(n) , i =
1, . . . , k, fails to converge regularly to the optimal limit distribution for ϑi .
Remark That Le Cam’s paper is mainly known for Hodges’ example and not for its
many deep results is, perhaps, due to the fact that it is not easy to read. (It contains
14 theorems, 4 corollaries and addenda, and 8 lemmas, all of which are somehow
interrelated.) Its mathematical deficiencies were criticized by Wolfowitz (1965) [p.
249]; Hájek (1972, p. 177) politely says that this paper “contains some omissions”.
Le Cam (1974) [p. 254] explains why his paper is “rather incorrect”. As a conse-
quence of the intricate presentation, it escaped notice that Le Cam’s results do not
require convergence to a limit distribution. Miraculously, Le Cam himself published
his superefficiency result once more for estimator sequences with limit distribution
(1958, p. 33). Bahadur (1964) suggests an easier way to this result. Though he explic-
itly refers to Le Cam’s papers of 1953 and 1958 (Le Cam 1953, 1958), he overlooked
the fact that the paper of 1953 is on arbitrary estimator sequences. The same mis-
understanding occurs in many textbooks. (See e.g. Stuart and Ord 1991, pp. 660/1.
Witting and Müller-Funk 1995, p. 200, Satz 6.33 and p. 417, Satz 6.199; Lehmann
and Casella 1998, p. 440.)
Remark on a Concept of Global Superefficiency
Brown et al. (1997) [p. 2612] suggest a new concept of superefficiency. To illustrate
the difficulties with the interpretation of this concept, we consider the following
problem: Let {Pϑ : ϑ ∈ Θ}, Θ = (ϑ0 , ∞), be a family of probability measures on
some measurable space (X, A ). For any estimator ϑ (n) let
256 5 Asymptotic Optimality of Estimators
According to Brown et al.’s new concept, the estimator sequence (ϑ̂ (n) )n∈N is
(asymptotically]) superefficient at ϑ if
lim sup Rϑ(n) (ϑ̂ (n) ) < lim sup inf sup{Rτ(n) (ϑ (n) ) : τ ∈ Θ, τ ≤ ϑ}.
n→∞ n→∞ ϑ (n)
Let
rϑ(n) := inf Rϑ(n) (ϑ (n) ) and r (ϑ) := lim sup rϑ(n) .
ϑ (n) n→∞
Since
If (ϑ̂ (n) )n∈N is asymptotically efficient with respect to the quadratic loss function, then
limn→∞ Rϑ(n) (ϑ̂ (n) ) = r (ϑ). Hence any asymptotically efficient estimator sequence
is superefficient at ϑ if r (ϑ) < sup{r (τ ) : τ ∈ Θ, τ ≤ ϑ}. Therefore, any estimator
sequence which is asymptotically efficient for every ϑ ∈ Θ is by definition automat-
ically superefficient on Θ if the function r is decreasing. An example of this kind is
the family {N (ϑ, ϑ −1 ) : ϑ ∈ (1, ∞)}, where r (ϑ) = 2ϑ 2 /(1 + 2ϑ 3 ). It is straight-
forward to modify these considerations in such a way that estimator sequences which
are asymptotically inefficient for every ϑ ∈ Θ are asymptotically superefficient on
a large subset of Θ.
κ(P)) remains stochastically bounded for a rate (cn )n∈N which tends to infinity more
slowly than n 1/2 ; even if P n ◦ cn (κ (n) − κ(P)) converges to a limit distribution, the
optimality of this limit distribution remains open.
For the purpose of illustration we consider the estimation of a density at a given
point. This is also the problem where the question of optimal rates took its origin. The
natural framework: a family of probability measures on B, with a Lebesgue density
p fulfilling a certain smoothness condition. The early papers in this area were just
concerned with the construction of “good” estimators. Histogram estimators were a
natural choice: For a fixed value of ξ , p(ξ ) is estimated by
n
(n) −1
p (ξ, xn ) := n h −1
n 1[ξ −h n /2,ξ +h n /2] (x ν ).
ν=1
That the estimator sequence p (n) (ξ, ·) is consistent for p(ξ ) under mild conditions on
p (continuity at ξ ) provided h n → 0 is usually attributed to Fix and Hodges (1951)
[p. 244, Lemma 3]. It occurs, from what is heard, already in Glivenko’s “Course in
Probability Theory” (1939). In a paper unknown to Fix and Hodges, Smirnov (1950,
p. 191, Theorem 3) had already obtained certain results on the rate at which
one has
2r/(2r +1)
lim n ( p (n) (ξ, xn ) − p(ξ ))2 P n (dxn )
n→∞
Parzen’s result refers to a particular class of kernels, and he neglects the question of
how to choose a sequence h n not depending on p.
It is not the purpose of the present section to deal with the theory of kernel
estimators in more detail. What is essential for our problem is the fact that other
techniques of density estimation like orthogonal expansions (Čentsov 1962), Fourier
series (Kronmal and Tarter 1968) or polynomial algorithms (Wahba 1971) lead under
intuitively comparable regularity conditions on the densities to the same rates of
convergence.
We just mention one more paper which was at that time almost entirely ignored
by scholars working on density estimation. Prakasa Rao (1969) [Theorem 6.3, p.
35] obtained for the family of all distributions on B ∩ (0, ∞) with a nonincreasing
Lebesgue density the following result for the distribution of the ML-sequence
implies
for every estimator sequence p (n) (ξ, ·). In this relation, Pr is (somewhat simpli-
fied) the family of all probability measures in a Lipschitz neighbourhood of a given
measure P0 admitting a density with a bounded r -th derivative.
At the time Farrell wrote his paper, it was clear from the study of parametric
families that a meaningful concept of asymptotic optimality had to be built upon
a condition of (locally) uniform convergence. Farrell requires uniformity on Pr
5.17 Rates of Convergence 259
without giving second thoughts to this question. To establish the rate cn = n r/(2r +1)
as optimal, it was therefore necessary to show an estimator sequence p̂ (n) (ξ, ·), n ∈ N,
such that
2r/(2r +1)
lim sup sup n ( p̂n (ξ, xn ) − p(ξ ))2 P n (dxn ) < ∞.
n→∞ P∈Pr
According to Farrell’s Lemma 1.4, p. 173, this holds true for kernel estimators.
In the proof of his Theorem 1.2, Farrell uses (see p. 174) what corresponds to
“least favourable paths” ( pn )n∈N , converging to p such that pn (ξ ) − p0 (ξ ) is large,
and Pn is close to P0 . Farrell’s complicated construction on pp. 174–177 shows that
the invention of a least favourable path may be a nontrivial task. A better arranged
version of Farrell’s proof can be found in Wahba (1975) [pp. 27–29].
Farrell (1972) was also aware of a problem which did not find due attention until
twenty years later: The problem of rate adaptivity. His generally neglected Theorem
1.3, p. 173, is a somewhat vague expression of the fact that estimator sequences
attaining the optimal rate n r/(2r +1) uniformly on Pr cannot, at the same time, attain
the better rate n r̄ /(2r̄ +1) locally uniformly on the smaller family Pr̄ if r̄ > r .
Farrell’s paper (1972) has a forerunner in his paper (1967), giving a lower bound
for the quadratic risk of sequential estimator sequences. In this paper, Farrell suspects
(see pp. 471/2) that for densities with a continuous second derivative,
(On p. 472, line 1, this statement occurs without the factor n 2/3 . It makes sense only
if one assumes that this factor became victim of a misprint.) As remarked by Kiefer
(1982) [p. 424], Farrell’s proof is not entirely correct since he uses a least favourable
subfamily which is not in P.
The best result now available is Theorem 5.1 in Ibragimov and Has’minskii (1981)
[p. 237] saying that for every ξ ∈ R
lim inf inf sup n r/(2r +1) ( p (n) (ξ, xn ) − p(ξ )) P n (dxn ) > 0 (5.17.1)
n→∞ p(n) (ξ,·) P∈P
r,L
lim sup sup sup n r/(2r +1) ( p (n) (ξ, xn ) − p(ξ )) P n (dxn ) < ∞ (5.17.2)
n→∞ P∈Pr,L ξ ∈R
for symmetric loss functions increasing not too quickly (u → (u) exp[−εu 2 ]
bounded). In these relations, Pr,L is the class of all probability measures on B the
densities of which have r − 1 derivatives with | p (r −1) (x) − p (r −1) (y)| ≤ L|x − y|.
260 5 Asymptotic Optimality of Estimators
Prior to Farrell (1972) there was a paper by Weiss and Wolfowitz (1967a)
asserting the existence of an estimator sequence p (n) (ξ, ·), n ∈ N, such that P n ◦
n 2/5 ( p (n) (ξ, ·) − p(ξ )) is asymptotically maximally concentrated in intervals sym-
metric about zero (see p. 331, relation (2.22)). Here p (n) (ξ, ·) is some kind of max-
imum probability estimator, and the assertion refers to densities admitting a certain
Taylor expansion of order 2. For the proof the authors refer to Theorem 3.1 in Weiss
and Wolfowitz (1966) [p. 65], which, however, is on the estimation of a real para-
meter; the optimality assertion in that theorem holds for one particular symmetric
interval (involved in the construction of the estimator) and (of course) for estimator
sequences fulfilling a certain local uniformity condition.
What Is an Optimal Rate?
To discuss the idea of an “optimal rate ” in a more general (i.i.d.) context, let now P
be an arbitrary family of probability measures P on a measurable space (X, A ), let
κ : P → R be a functional and κ (n) : X n → R an estimator sequence. To deal with
(local) uniformity, we introduce a sequence Pn ⊂ P which could mean Pn = P or
Pn ↓ {P0 }.
Definition 5.17.1 The estimator sequence (κ (n) )n∈N attains the rate (cn )n∈N uni-
formly on (Pn )n∈N if
The rate (cn )n∈N is attained iff cn (κ (n) − κ(P)) is stochastically bounded, uni-
formly on Pn . This is in particular the case if P n ◦ cn (κ (n) − κ(P)) converges to a
limit distribution, uniformly on Pn .
We define an order relation between rates by
cn
(cn )n∈N (cn )n∈N if lim sup < ∞.
n→∞ cn
If (cn )n∈N is attainable, then, by Definition 5.17.1, any rate (cn )n∈N (cn )n∈N is
attainable, too. The interest, therefore, is in attainable rates which are as large as
possible. How can we determine whether an attainable rate is the best possible one?
For this purpose, we introduce the concept of a “rate bound”. Roughly speaking,
(cn )n∈N is a rate bound if no better rate is attainable. A second thought suggests a more
restrictive definition: An estimator sequence attaining a better rate for infinitely many
n ∈ N would certainly be considered as an improvement. This suggests the following
definition.
Definition 5.17.2 (cn )n∈N is a rate bound for uniform convergence on (Pn )n∈N if
the following is true. If the rate (cn )n∈N is attained along an infinite subsequence N0 ,
i.e., if
lim sup P n {cn |κ (n) − κ(P)| > u n } = 0 (5.17.4)
n∈N0 P∈Pn
5.17 Rates of Convergence 261
Hence an attainable rate is optimal if it is, at the same time, a rate bound.
Definition 5.17.2 is not so easy to handle if it comes to proving that a given rate
(cn )n∈N is a rate bound. Here is another, equivalent, definition expressing the idea of
a rate bound.
Definition 5.17.3 (cn )n∈N is a rate bound for uniform convergence on (Pn )n∈N if
for every estimator sequence (κ (n) )n∈N ,
lim inf sup P n {cn |κ (n) − κ(P)| > u n } > 0 for every u n → 0. (5.17.5)
n→∞ P∈P
n
The following Lemma implies the equivalence of Definitions 5.17.2 and 5.17.3.
Lemma 5.17.4 For any sequence of nonincreasing functions Hn : [0, ∞) →
[0, ∞), the following assertions are equivalent.
(i) For every N0 ⊂ N,
then
1/2 −1/2
lim Hn (δ n ) = lim Hn (δ n δ n ) = 0.
n∈N0 n∈N0
1/2
Hence (ii) is violated for u n = δ n .
(ii) If (u n )n∈N → 0 and lim inf n→∞ Hn (u n ) = 0 then limn∈N0 Hn (u n ) = 0 for some
N0 ⊂ N. Since Hn (u n u n ) ≤ Hn (u n ) eventually if u n → ∞, relation (i) is vio-
lated for δn = u n .
If (cn )n∈N is a rate bound, then (cn )n∈N with (cn )n∈N (cn )n∈N is a rate bound,
too. As a side result: If every (cn )n∈N (cn )n∈N is a rate bound, then (cn )n∈N is a
rate bound itself.
If (cn )n∈N is attainable, and (cn )n∈N is a rate bound, then (cn )n∈N (cn )n∈N . This
implies (cn )n∈N ≈ (cn )n∈N if both sequences are attainable rate bounds. Hence an
optimal rate is unique up to equivalence.
262 5 Asymptotic Optimality of Estimators
This can be seen as follows. Assume that lim sup cn /cn = ∞, and let N0 be such
that limn∈N0 cn /cn = ∞. Since (cn )n∈N is attainable, (5.17.3) implies
lim sup P n {cn |κ0(n) − κ(P)| > (cn /cn )1/2 } = 0 for some (κ0(n) )n∈N .
n∈N0 P∈Pn
lim inf sup P n {cn |κ (n) − κ(P)| > (cn /cn )1/2 } > 0
n∈N0 P∈Pn
for every (κ (n) )n∈N , hence in particular for (κ0(n) )n∈N . Since
{cn |κ0(n) − κ(P)| > (cn /cn )1/2 } = {cn |κ0(n) − κ(P)| > (cn /cn )1/2 },
lim sup sup P n {cn |κ (n) − κ(P)| > u n } > 0 for u n → 0. (5.17.7)
n→∞ P∈Pn
Though the definitions (5.17.4) and (5.17.6) as well as (5.17.5) and (5.17.7) are
conceptionally distinct, the difference will usually be irrelevant. If (5.17.6) holds
true, then in all practical cases the stronger condition (5.17.4) will be fulfilled, too.
That means: If (cn )n∈N cannot be improved for a.a. n ∈ N, then, usually, it cannot be
improved along an infinite subsequence.
The concept of an attainable rate defined by (5.17.3) is generally accepted. (See
Farrell 1972, p. 172, relation (1.4); Stone 1980, p. 1348, relation (1.3); Kiefer 1982, p.
√ Welsh 1984, Sect. 3, pp. 1083–1084.) With cn = n , relation (5.17.3)
1/2
420; Hall and
occurs as n-consistency in Bickel et al. (1993) [p. 18, Definition 2]. Akahira and
Takeuchi (1995, p. 77, Definition 3.5.1) call the property defined by (5.17.3) “con-
sistency of order (cn )n∈N ”.
5.17 Rates of Convergence 263
There is less agreement on the concept of a rate bound (often occurring only
implicitly in the proof that a certain rate is optimal). Since (5.17.5) is equivalent to
(5.17.4), this is the weakest condition on the sequence (cn )n∈N which guarantees that
an estimator sequence attaining a better rate for infinitely many n ∈ N is impossible.
The literature offers a plethora of intuitively plausible conditions for a rate bound.
As an example of such a condition we mention the following:
lim inf sup P n {cn |κ (n) − κ(P)| > u} > 0 for every (κ (n) )n∈N and every u > 0.
n→∞ P∈P
n
(5.17.8)
This condition, which implies (5.17.5), occurs in (Stone (1980), p. 1348) and in
Hall (1989) [p. 50, (3.3); see also p. 51, Example 3.1]. In addition to (5.17.8), Stone
requires one more condition, (1.2) (see also Kiefer 1982, p. 420), which is equivalent
to
lim sup P n {cn |κ (n) − κ(P)| > u n } = 1 for every (κ (n) )n∈N and u n → 0.
n→∞ P∈P
n
(5.17.9)
Needless to say that (5.17.9), too, implies (5.17.5).
In (1983, p. 393) Stone requires—in a different context—a condition which cor-
responds to
In this paper, Stone had a deviating definition of attainability (p. 394, relation (2)),
namely
lim sup P n {cn |κ (n) − κ(P)| > u} = 0 for some u > 0.
n→∞ P∈P
n
That a competent author like Stone is at variance with himself illustrates how vague
the idea of an optimal rate is unless intuition is guided by methodological prin-
ciples. His argument (p. 394) that “these definitions...were formulated this way
mainly because they could be verified in the present context” is not compelling.
Stone abstains from relating his concept(s) of a rate bound to the concept Farrell had
used ten years ago.
Farrell (1972), p. 173 (see also Hall and Welsh 1984, p. 1080, and Carroll and
Hall 1988, p. 1185) considers the rate (cn )n∈N as optimal if
lim sup P n {|κ (n) − κ(P)| > an } = 0 for every (κ (n) )n∈N implies lim an cn = ∞,
n→∞ P∈P n→∞
n
lim inf sup (cn (κ (n) − κ(P)))d P n > 0 for every (κ (n) )n∈N . (5.17.11)
n→∞ P∈P
n
lim inf inf sup (cn (κ (n) − κ(P)))d P n > 0 for every (κ (n) )n∈N ,
n→∞ κ (n) P∈P
n
the proof of which is not more difficult than the proof of (5.17.11), which suffices to
establish the optimality of an attainable rate.
In order to express that (cn )n∈N is attainable, relation (5.17.10) with (u) = u 2
was used from the beginning. In Theorem 1.2, p. 173, Farrell (1972) (see also Wahba
1975, p. 16) defines an attainable rate (cn )n∈N as optimal if
lim sup sup an−2 (κ (n) − κ(P))2 d P n < ∞ implies lim inf cn an > 0.
n→∞ P∈Pn n→∞
This is a rather circumstantial way of saying that (5.17.11) holds true for (u) = u 2 .
Recall that relations (5.17.1) and (5.17.2), proved by Ibragimov and Has’minskii
(1981) [pp. 236/7] to establish n r/(2r +1) as an optimal rate for density estimators,
imply (5.17.10) and (5.17.11).
In the following we discuss the connection between rate concepts based on u →
P n {cn |κ (n) − κ(P)| > u}, and rate concepts based on (cn (κ (n) − κ(P)))d P n . The
connection between these concepts is conveyed by the following lemma, applied with
Q (n)
P := P ◦ cn (κ
n (n)
− κ(P)).
5.17 Rates of Convergence 265
Lemma 5.17.5 Let L0 be the class of all symmetric loss functions : R → [0, ∞]
which are continuous, and positive on (0, ∞]. Given probability measures Q (n)
P on
B ∩ [0, ∞), we consider the following statements.
Observe that nothing is assumed about Pn . Hence the assertions hold in particular
with Pn = {P0 } for n ∈ N.
Proof Since is nondecreasing, we have
Since is bounded by 1,
Hence
(u)Q (n)
P [u, ∞) ≤ (v)Q (n) (n)
P (dv) ≤ (u) + Q P [u, ∞) for u ≥ 0. (5.17.14)
in contradiction to (5.17.12). (Hint: For every m there exists n(m) > n(m − 1) such
that sup P∈Pn(m) Q (n(m))
P [1/m, ∞) < 1/m.)
for P in some dense subset of P0 . (Observe that this subset depends on (κ (n) )n∈N .)
Proof Assume that for some (κ (n) )n∈N and some ∈ L0 the relation
holds for every P in some open subset P1 ⊂ P0 . Since P → (cn (κ (n) − κ(P)))
d P n is continuous for every n ∈ N, there exists (by Lemmas 5.14.1 and 5.14.2) a
dense subset P2 ⊂ P1 such that, for every P0 ∈ P2 , the relation limn→∞ ρ(P0 , Pn ) =
0 implies
5.17 Rates of Convergence 267
If (Pn (P0 ))n∈N shrinks to P0 with respect to the metric ρ, this implies
in contradiction to (5.17.15).
For examples where the conditions of these assertions (such as the existence of a
metric ρ for which P is complete) see Pfanzagl (2002a) .
For families with a Euclidean parameter, the optimal rate usually is n 1/2 (or
n (log n)a ), and estimator sequences converging at this rate to a limit distribu-
1/2
tion are the custom. Hence assertions on the pointwise validity of locally uniform
rate bounds are, in this case, of minor interests. Just for sake of completeness we
mention that, as a consequence of Bahadur’s Lemma, locally uniform rate bounds
are valid pointwise for all Θ except for a set of Lebesgue measure 0.
Optimal Rates and Optimal Limit Distributions
In most cases, there are several estimator sequences converging to the estimand with
the same, optimal, rate. As an example, we mention the estimation of the center
of symmetry of an (unknown) distribution on B. Examples of estimator sequences
with the optimal rate n 1/2 are the Bickel and Hodges estimator (1967, Sect. 3),
the estimator of Rao et al. (1975) [Theorem 4, p. 866] based on the Kolmogorov
distance, or the (adaptive) estimator by van Eeden (1970). Under natural conditions
(see Sect. 5.5), the center of symmetry is a differentiable functional, so that the
LAN-theory yields the asymptotic concentration bound, which is attained by various
estimator sequences (see Sect. 5.13).
There are many more nonparametric models of this kind: a differentiable func-
tional and an LAN-condition based on a rate n 1/2 L(n) (usually with L(n) = (log n)a ),
which distinguishes certain estimator sequences as asymptotically maximally con-
centrated.
Typical for nonparametric models are, however, optimal rates slower than n 1/2 .
[This is shorthand for a rate n a L(n) with a < 1/2.] Even if estimator sequences
attaining this optimal rate converge to a limit distribution, there is no asymptotic
bound for the quality of these estimator sequences. This is not a lacuna in the theory;
such bounds are impossible.
First we mention a result which illustrates the principal difficulties relating to
limit distributions attained with an optimal rate slower than n 1/2 . Then we turn to
more specific conditions which determine the optimal rate, and which, at the same
time, exclude the existence of estimator sequences converging locally uniformly to
some limit distribution with the optimal rate, if the latter is slower than n 1/2 .
268 5 Asymptotic Optimality of Estimators
If there exists an estimator sequence converging with a rate n a L(n), a ∈ [0, 1/2)
locally uniformly to a limit distribution with expectation 0 and finite variance, then
there exists an estimator sequence converging with a better rate locally uniformly
to a normal limit distribution with mean 0. (See Pfanzagl 1999b, p. 759, Theorem
2.1 (ii).)
For a precise formulation of this result we need the concept of locally uniform
convergence, holding uniformly on some subset P0 ⊂ P. (The formulation of the
Theorem cited above is somewhat careless in this respect.) Given for every P0 ∈ P a
nondecreasing sequence Pn (P0 ), n ∈ N, we say that Q (n) P , n ∈ N, converges locally
uniformly to Q P , uniformly on P0 , if sup P∈Pn (P0 ) D(Q (n) P , Q P ), n ∈ N, converges
to 0, uniformly for P0 ∈ P0 . (This is, in particular, the case if sup P∈P0 D(Q (n)P , Q P ),
n ∈ N, converges to 0.)
If there exists an estimator sequence κ (n) such that P n ◦ cn (κ (n) − κ(P)), n ∈ N,
converges in this sense with cn = n a L(n), a ∈ [0, 1/2), to a limit distribution Q P ,
then there is an estimator sequence κ̂ (n) such that P n ◦ ĉn (κ̂ (n) − κ(P)), converges
in this sense to N (0, σ 2 (P)), with a rate(ĉn )n∈N such that limn→∞ ĉn /cn = ∞. This
holds under the assumption that Q P has expectation 0, that u 2 is Q P -integrable
uniformly for P ∈ P0 , and σ 2 (P) = u 2 Q P (du) is bounded and bounded away
from 0 on P0 . A special, yet more concise version is this: If P n ◦ cn (κ (n) − κ(P)),
n ∈ N, converges with cn = n a L(n), a ∈ [0, 1/2) to a N (0, σ 2 (P)), then there exists
an estimator sequence (κ̂ (n) )n∈N such that P n ◦ ĉn (κ̂ (n) − κ(P)), n ∈ N, converges
to the same normal limit distribution with a rate (ĉn )n∈N better than (cn )n∈N , provided
σ 2 (P) is bounded and bounded away from 0 for P ∈ P0 .
The conclusion: In situations which are typical for nonparametric models, it is
impossible to define the concept of a maximally concentrated limit distribution in
the usual sense, based on a local uniformity.
This result does not exclude the existence of estimator sequences which converge
locally uniformly to a limit distribution with a rate slower than the optimal one,
and they do not exclude the existence of estimator sequences converging to a limit
distribution with this rate for every P ∈ P.
Observe the connection with what has been said above: If it is possible to find
a metric ρ “stronger” than d, such that P is complete and κ continuous, then con-
vergence of P n ◦ cn (κ (n) − κ(P)) to a limit distribution Q P for every P in an open
subset P0 implies locally uniform convergence on a dense subset of P0 . However,
this local uniformity is with respect to ρ, and locally ρ-uniform convergence at some
P0 does not imply uniform convergence on {P ∈ P : d(P0n , P n ) ≤ α}.
How to Determine the Optimal Rate
The following theorem shows explicitly how optimal rate bounds can be determined,
and that estimator sequences converging with this optimal rate locally uniformly to
some limit distribution do not exist if this optimal rate is slower than n 1/2 .
Theorem 5.17.6 Assume that for (cn )n∈N there exists a sequence Pn ∈ P, n ∈ N,
such that
lim sup d(P0n , Pnn ) < 1 (5.17.16)
n→∞
5.17 Rates of Convergence 269
and
lim inf cn |κ(Pn ) − κ(P0 )| > 0. (5.17.17)
n→∞
(ii) If
lim m −1/2 lim sup cmn /cn = 0, (5.17.18)
m→∞ n→∞
then there exists no estimator sequence converging with this rate uniformly on
Pn,α (P0 ) for some α ∈ (0, 1) to a non-degenerate limit distribution.
Proof (i) By assumptions (5.17.16) and (5.17.17) there is α < 1 and λ > 0 such
that for n ∈ N,
d(P0n , Pn ) < α (5.17.19)
and
cn |κ(Pn ) − κ(P0 )| > λ. (5.17.20)
P0n {cn (κ (n) − κ(P0 )) > λ/2} + Pnn {cn (κ (n) − κ(Pn )) ≤ −λ/2} > 1 − α.
1
sup P n {cn |κ (n) − κ(P)| ≥ λ/2} > (1 − α) > 0.
P∈Pn,α (P0 ) 2
270 5 Asymptotic Optimality of Estimators
Hence, by Definition 5.17.3, (cn )n∈N is a rate bound for uniform convergence
on (Pn,α̂ )n∈N .
(iii) Assume now that P n ◦ cn (κ (n) − κ(P)), n ∈ N, converges weakly to Q P0 |B,
uniformly for P ∈ Pn,β for some β > 0. By definition of uniform weak con-
vergence, there exists a dense subset B ⊂ R such that
lim sup |P n {cn (κ (n) − κ(P)) > u} − Q P0 (u, ∞)| = 0 for every u ∈ B.
n→∞ P∈P
n,β
(5.17.22)
Relation (5.17.24), applied with P replaced by P0n and Q replaced by Pmn
n
(with
m fixed), yields
d(P0n , Pmn
n
) ≤ m −1/2 2/(1 − α)d(P0mn , Pmn )
mn 1/2
−1/2
≤m 2α/(1 − α). (5.17.23)
From (5.17.20),
cn
cn |κ(Pmn ) − κ(P0 )| ≥ λ.
cmn
Let
εm := m −1/2 lim sup cmn /cn .
n→∞
hence
cmn /cn ≤ 2m 1/2 εm for n ≥ n ε .
This implies
cn λ
cn |κ(Pmn ) − κ(P0 )| ≥ λ ≥ m −1/2 εm−1 .
cmn 2
Since Pmn ∈ Pn,β for every β > 0 (by (5.17.23), relation (5.17.22) implies
λ −1/2 −1
Q P0 (u + m εm , ∞) ≥ −m −1/2 2α/(1 − α) + Q P0 (u, ∞),
2
i.e.
λ −1/2 −1
Q P0 (u, u + m εm ] ≤ m −1/2 2α(1 − α) for u ∈ B and m ∈ N.
2
Since the convergence of m −1/2 εm−1 to 0 is slower than the convergence of m −1/2 , it
is plausible that this is incompatible with the assumption that Q P0 is non-degenerate.
This can be made precise by the following lemma (see Pfanzagl 2000b, p. 39,
Lemma 4.1):
For nondecreasing functions F : R → [0, 1], and t, s > 0, the relation
F(u + t) − F(u) ≤ s for all u in a dense subset implies
u 1
(1 − u)1/m ≥ 1 − · for every u ∈ [0, α].
1−α m
Proof Let
f (u) := (1 − u) exp[u/(1 − α)].
Then f (0) = 1 and f (u) ≥ 0 for u ∈ [0, α] imply f (u) ≥ 1 for u ∈ [0, α], hence
u 1 m
(1 − u) ≥ exp[−u/(1 − α)] ≥ 1 − · .
1−α m
Lemma 5.17.8 (i) H (P m , Q m )2 ≤ α < 1 implies
m −1
H (P, Q)2 ≤ H (P m , Q m )2 .
1−α
272 5 Asymptotic Optimality of Estimators
m −1
(1 − H (P m , Q m )2 )1/m ≥ 1 − H (P m , Q m )2 .
1−α
L i jk (ϑ) = (i jk) (·, ϑ)d Pϑ , L i j,k (ϑ) = (i j) (·, ϑ)(k) (·, ϑ)d Pϑ ,
This is not just a matter of higher order smoothness of the densities, say. Proper-
ties which were irrelevant for approximations of order o(n 0 ) now enter the stage:
Assertions of order o(n −1/2 ) about the concentration on even rather simple sets are
impossible for families of discrete distributions. If we consider asymptotic assertions
as approximations to the reality, we must not ignore that families of probability mea-
sures reflect reality to a degree which hardly ever justifies to take serious differences
between statistical procedures which are of the order n −1 only.
In contrast to an asymptotic theory of order o(n −1 ), some basic results following
from an asymptotic theory of order o(n −1/2 ) seem to be of practical relevance. One
would expect that an analysis of order o(n −1/2 ) reveals differences of order n −1/2
between asymptotically efficient estimator sequences which are indistinguishable by
an analysis of order o(n 0 ). This is, however, not the case. For a large class of estimator
sequences, asymptotic efficiency of order o(n 0 ) implies efficiency of order o(n −1/2 ).
It contributes to the relevance of this result that “efficiency of order o(n −1/2 )” holds
simultaneously for all loss functions in the case of one-parameter families, and for
all symmetric loss functions in the general case.
In this section we write o p (an ) for o(an , Pϑn ). We also use the Einstein convention
and sum over pairs of indices.
Asymptotic theory going beyond the approximation by limit distributions started
with papers on the distributions of ML-sequences, say (ϑ (n) )n∈N . For highly reg-
ular parametric families {Pϑn : ϑ ∈ Θ}, Θ ⊂ Rk , the ML estimator sequence has a
stochastic expansion (S-expansion ) of order o(n −1/2 ),
with
1
S (n) (·, ϑ) = λ̃(·, ϑ) + n −1/2 Λ• (ϑ)L i j (ϑ)λ̃i (·, ϑ)λ̃ j (·, ϑ)
2
+ Λ•i (ϑ)˜(i j) (·, ϑ)λ̃ j (·, ϑ) . (5.18.2)
given explicitly. For the ML estimator of a vector parameter see Michel (1975)
[p. 70, Theorem 1]. Stochastic expansions for Bayes estimators and for estimators
obtained by maximizing the posterior density can be found in Gusev (1975) [p.
476, Theorem 1, p. 489, Theorem 5] and Strasser (1977) [p. 32, Theorem 4]; for
estimators obtained as the median of the posterior distribution see Strasser (1978b)
[pp. 872–873, Lemma 2].
The Convolution Theorem was the outcome of long-lasting endeavors to prove the
asymptotic optimality of ML-sequences. The result was twofold. (i) Under conditions
on the basic family, there is a “bound” for regularly attainable limit distribution, and
(ii) under additional conditions on the family, estimator sequences attaining this
bound do exist.
From the beginning, E-expansions for the distribution of certain sequences of esti-
mators and tests had a less ambitious goal: To obtain more accurate approximations
to Pϑn ◦ n 1/2 (ϑ (n) − ϑ), in particular: to distinguish between estimator sequences
with identical limit distributions.
One would expect the class of all asymptotically efficient estimator sequences to
split up under a closer investigation. If we restrict attention to estimator sequences
with distributions approximable by an E-expansion E ϑ(n) with λk -density ϕΛ(ϑ) (u)
(1 + n −1/2 G(u, ϑ)), i.e.,
sup
Pϑn {n 1/2 (ϑ (n) − ϑ) ∈ C} − φΛ(ϑ) (u)(1 + n −1/2 G(u, ϑ))
= o(n −1/2 ),
C∈C C
this would mean a variety of n −1/2 -terms G, and the best one could hope for is the
(n) (n)
existence of an E-expansion E such that, for every n ∈ N, E (B) ≥ E (n) (B) for
a large class of sets B, say all convex and symmetric sets. It came as a surprise to
all of us that this disaster did not occur. It turned out that asymptotically efficient
estimator sequences have the same E-expansion of order o(n −1/2 ), except for a
difference of order n −1/2 in their location.
Remark That different asymptotically efficient statistical procedures agree in their
efficiency to a higher order, was already observed by Welch (1965) [p. 6]: confidence
procedures based on different statistics like ML estimators or ˜• have up to o(n −1/2 )
the same covering probability for ϑ − n −1/2 t if the covering probabilities for ϑ agree
up to o(n −1/2 ). Hartigan (1966) [p. 56] observes that central Bayes intervals agree in
their size for various prior distributions up to the order n −1 . For more results of this
kind see Strasser (1978b) [p. 874, Theorem 3.] Sharma (1973) [p. 974] observes for
certain exponential families: If the minimum variance unbiased estimator sequence
has the same asymptotic distribution as the ML-sequence, then the difference between
these distributions is of an order—almost—smaller than n −1/2 . (Readers interested
in questions of priority should be aware of J.K. Ghosh, 1991, p. 506, who claims to
have been the first to observe the phenomenon of 2nd order efficiency of certain 1st
order efficient tests, but gives no information about the date of this happening.)
When Hodges and Hodges and Lehmann (1970) introduced the concept of “defi-
ciency” to characterize the differences between asymptotically efficient statistical
5.18 Second Order Optimality 275
There are no further conditions on the structure of the sequence (ϑ (n) )n∈N .
In contrast to the situation for univariate families, there is no “absolute” bound
of order o(n −1/2 ) for the concentration of multivariate estimator sequences. This
motivates the study of estimator sequences with S-expansion.
Estimator Sequences With S-Expansions
The next step was to consider estimator sequences ϑ (n) , n ∈ N, with general S-
expansion
n 1/2 (ϑ (n) − ϑ) = λ̃(·, ϑ) + n −1/2 Q(λ̃(·, ϑ), g̃(·, ϑ), ϑ) + o p (n −1/2 ), (5.18.4)
276 5 Asymptotic Optimality of Estimators
with functions g(·, ϑ) ∈ L ∗ (ϑ)m uncorrelated to λ(·, ϑ). In the usual cases, the com-
ponents Q 1 (·, ϑ), . . . Q k (·, ϑ) of Q(·, ϑ) are polynomials with coefficients depend-
ing on ϑ.
Considering the general S-expansion (5.18.4), one would expect a great variety
of possible E-expansions. This is, however, not the case. First of all, the generality
of the functions Q 1 , . . . , Q k is spurious. If the S-expansion (5.18.4) holds locally
uniformly, more precisely: with ϑ replaced by ϑ + n −1/2 a, this restricts the possible
forms of the functions Q r . Replacing ϑ by ϑ + n −1/2 a and considering the resulting
relation as an identity in a leads to the following canonical representation (see
Pfanzagl and Wefelmeyer 1978, p. 18, Lemma 5.12 for more details):
where S (n) is given by (5.18.2) and the remainder term R (n) is of the type
with functions f (·, ϑ) ∈ L ∗ (ϑ)m that are uncorrelated to λ(·, ϑ). With the seemingly
more general S-expansion (5.18.4) we are in the end back to the S-expansion of the
ML-sequence except for the remainder term R (n) .
According to Pfanzagl (1979) [p. 182, Theorem 7.4], Pϑn ◦ S (n) (·, ϑ) has an E-
expansion with λk -density
where
G(u, ·) = ai u i + ci jk u i u j u k (5.18.8)
with
1
ai = −( L i, j,k + L i j,k )Λ jk (5.18.9)
2
and
1 1
ci jk = −( L i, j,k + L i, jk ). (5.18.10)
3 2
Let E0 denote the class of all estimator sequences for ϑ admitting an E-expansion
with λk -density (5.18.7), (5.18.8) with ci jk given by (5.18.10) and with generic con-
stants a1 , . . . , ak . These constants become unique under additional conditions on the
estimator sequence (say ML or componentwise median unbiased.)
By introducing the class E0 , we avoid building an optimality concept which
is confined to “standardized” estimator sequences. After all, there is no convinc-
ing standardization of order o(n −1/2 ) for multidimensional estimators (except for
component-wise median unbiasedness). Moreover, a standardization of ϑ (n) as an
estimator for ϑ does not carry over to a corresponding standardization for κ ◦ ϑ (n)
as an estimator for κ(ϑ).
5.18 Second Order Optimality 277
E0 contains all o(n −1/2 )-optimal estimator sequences with S-expansion. Under
suitable regularity conditions, any estimator sequence in E0 can be transformed in
any other estimator sequence in this class.
Remark The constants a1 , . . . , ak given by (5.18.9) are specific for ML-sequences.
They will be different for other asymptotically efficient estimator sequences.
For the following, it will be of interest that an estimator sequence ϑ (n) , n ∈ N,
with an n −1/2 -term G(u, ·) = ai u i + ci jk u i u j u k in the density, can be transformed
into an estimator sequence ϑ̂ (n) , n ∈ N, with an n −1/2 -term âi u i + ci jk u i u j u k by
Since estimator sequences with a stochastic expansion (5.18.4) differ, in the end,
by a term n −1/2 R (n) only—what is the influence of this term on the E-expansion?
Theorem 7.4 in Pfanzagl (1979) [p. 182] asserts that such estimator sequences have,
except for a shift of order n −1/2 , the same E-expansion, hence the same efficiency of
order o(n −1/2 ). This was the result which justified the phrase “first order efficiency
implies second order efficiency”. (The paper Pfanzagl 1979, was already presented
at the Meeting of Dutch Statisticians in Lunteren in 1975, but was not been published
until 1979 because of a delay in the publication of the Hájek Memorial Volume.)
In the following, this result (i.e., the o(n −1/2 )-equivalence of all o(n 0 )-efficient
estimator sequences) will be established under more general conditions.
We consider estimator sequences with the S-expansion
Observe that relation (5.18.13) is, in particular, true for remainder terms of the special
type (5.18.6). Since f i (·, ϑ) are uncorrelated to λ1 (·, ϑ), . . . , λk (·, ϑ), this implies
f˜i (·, ϑ + n −1/2 a) = f˜i (·, ϑ) + o p (n 0 ). If R (n) is a continuous function of all its argu-
ments, relation (5.18.13) follows.
We shall show that all estimator sequences with a representation (5.18.12) have
the same E-expansion, except for a shift of order n −1/2 . What makes the whole
thing tick is the asymptotic independence between S (n) and R (n) . This independence
follows since S (n) is the contradiction of an asymptotically sufficient and complete
statistic, and R (n) is asymptotically invariant (according to (5.18.13) (compare Basu’s
Theorem).
For the proof of the following Theorem 5.18.1 we need in addition to (5.18.13)
that the sequence R (n) (·, ϑ + n −1/2 a), n ∈ N, is uniformly Pϑ+n n
−1/2 a -integrable, a
property which is not inherent in the role of R (n) in the representation (5.18.12). A
proof under weaker conditions on R (n) is wanted.
More precisely: Under the conditions indicated above, there exist ρ(ϑ) ∈ Rk such
that Pϑn ◦ (S (n) (·, ϑ) + n −1/2 R (n) (·, ϑ)) and Pϑn ◦ (S (n) (·, ϑ) + n −1/2 ρ(ϑ)), n ∈ N,
have the same E-expansion.
Remark If ρ is a continuous function of ϑ, then the estimator sequence
has the same E-expansion as the estimator sequence ϑ (n) with the expansion
(5.18.12). This shows, in particular, that the class of all estimator sequences which
are modifications of the ML-sequence is complete of order o(n −1/2 ) in the class of
all estimator sequences with S-expansions. (Earlier completeness theorems, some of
these extending to an order going beyond o(n −1/2 ), can be found in Pfanzagl 1973,
Sect. 6, Pfanzagl 1974, Sect. 6 Pfanzagl 1975, p. 34, Theorem 6, and Pfanzagl and
Wefelmeyer 1978, p. 7, Theorem 1 (iii).)
n
n
sn2 (xn ) := n −1 (xν − x n )2 and ŝn2 (xn ) := n −1 xν2
ν=1 ν=1
with R (n) (xn ) = nx 2n . This is an S-expansion of n 1/2 (sn2 − σ 2 ) with S (n) (xn , σ 2 ) =
n 1/2 (ŝn2 (xn ) − σ 2 ) and R (n) (xn ) = ( f˜(xn ))2 with f (x) = x; observe that the cor-
relation between S (n) (·, σ 2 ) and R (n) is of order o(n 0 ), which follows in this case
immediately from the stochastic independence between n 1/2 (sn2 − σ 2 ) and R (n) . The
distribution of n 1/2 (ŝn2 − σ 2 ) differs from the distribution of n −1/2 (sn2 − σ 2 ) by a sto-
chastic
(n) term of2 order o p (n −1/2 ). N (0, σ 2 )n ◦ R (n) = σ 2 χ12 is non-degenerate, with
R d N (0, σ ) = σ 2 .
n
(n)
(n) 5.18.1.n Since R (·, ϑ), n ∈ N, is uniformly Pϑ -integrable,
n
Proof of Theorem the
sequence |R (·, ϑ)|d Pϑ , n ∈ N, is bounded. Hence ρn (ϑ) := R (n) (·, ϑ)d Pϑn is
bounded, too, and converges to some ρ(ϑ) ∈ Rk for some subsequence N0 .
Let g : Rk → R be a bounded function with bounded and continuous derivatives
(i)
g , i = 1, . . . , k, so that
g(S (n) (·, ϑ) + n −1/2 (R (n) (·, ϑ) − ρn (ϑ)))d Pϑn = g(S (n) (·, ϑ))d Pϑn + o(n −1/2 ).
Since Pϑn ◦ g(S (n) + n −1/2 R (n) ) admits an approximation by an E-sequence, the con-
vergence of (ρn )n∈N0 to ρ holds, in fact, for the whole sequence.
So far, the equality (5.18.14) holds for certain functions g only. Yet, the class
of these functions is large enough to imply that the E-expansions of Pϑn (S (n) +
n −1/2 R (n) ) and Pϑ(n) ◦ (S (n) + n −1/2 ρ) are identical.
The results indicated so far hold for estimator sequences with an S-expansion.
The approximation of n 1/2 (ϑ (n) − ϑ) by an S-expansion is an important device to
obtain an approximation to Pϑn ◦ n 1/2 (ϑ (n) − ϑ) by an E-expansion. The S-expansion
itself is of no operational significance. Hence it suggests itself to look for estimator
sequences with E-expansion which are better than the best estimator sequences with
280 5 Asymptotic Optimality of Estimators
S-expansion. The purpose of the following section is to show that there are no E-
expansions with a better n −1/2 -term. This assertion will be supported by two different
results.
(i) All sufficiently regular E-expansions have an n −1/2 term of the particular type
(5.18.7), and the one with bi j ≡ 0 is optimal in this class.
(ii) Another result is based on the “absolute” concept of o(n −1/2 )-optimality for
univariate functionals. It will be shown that
(i) Expansion (5.18.11) is the only E-expansion for estimator sequences (ϑ (n) )n∈N
with o(n −1/2 )-median unbiased marginals.
(ii) For sufficiently regular functionals κ : Rk → R, the estimator sequence κ ◦
ϑ , n ∈ N, is o(n −1/2 )-optimal if (ϑ (n) )n∈N has an E-expansion (5.18.8).
(n)
with some function G(·, ϑ). The application of this idea leads to the following result:
Assume that for every loss function ∈ L ∗ and some δ > 21 ,
n δ
(n 1/2 (ϑ (n) − ϑ))d Pϑn
− (u)(1 + n −1/2 G(u, ϑ))ϕΛ(ϑ) (u)λk (du)
= o(n 0 ) (5.18.15)
holds uniformly with ϑ replaced by ϑ + n −1/2 a. (Condition (5.18.15) with δ > 1/2
is needed in the proof. What corresponds to “approximation of order o(n −1/2 )” is
δ = 1/2.)
The use of loss functions with bounded and continuous first and second order
derivatives is needed since approximations of this order on (say convex) sets would be
possible only under additional smoothness conditions on the family (which exclude
lattice distributions).
If such an estimator sequence admits an E-expansion (5.18.7), then the function
G is necessarily of the following type.
where the matrix (bi j )i, j=1,...,k is positive semidefinite. Recall that the factor ci jk
given by (5.18.10) does not depend on the estimator sequence.
This is the parametric version of a more general theorem proved in Pfanzagl
(1985, Sect. 9, pp. 288–332) under the assumption that G(·, ϑ) is bounded by some
polynomial and that it fulfills certain smoothness conditions, such as
G(u, ϑn )ϕΛ(ϑ ) (u) − G(u, ϑ)ϕΛ(ϑ) (u)
λk (du) = o(n 0 ) forϑn = ϑ + n −1/2 a
n
and
G(u + y, ϑ) − G(u, ϑ)
ϕΛ(ϑ) (u)λk (du) = O(|y|) if y → 0.
If we accept the regularity conditions under which this result was proved, then one
might say that for the class of all estimator sequences admitting an E-expansion
(5.18.7), G(·, ϑ) is given by (5.18.16). This class of estimator sequences is, therefore,
(i) more general than the class of all estimator sequences with S-expansion, and (ii) the
optimal estimator sequences fulfill (5.18.16) with bi j = 0 for i, j = 1, . . . , k. Hence,
any estimator sequence with S-expansion is optimal in the larger class of estimator
sequences the distributions of which is approximable up to the order o(n −1/2 ) in the
sense described by (5.18.15).
The optimality of E-distributions with bi j = 0 for i, j = 1, . . . , k is based on
Lemma 5.18.2. Since the (bi j ), i, j = 1, . . . , k is positive semidefinite, Lemma
matrix
5.18.2 implies that bi j (u) u i u j − Λi j (ϑ) ϕΛ(ϑ) (u)λk (du) ≥ 0 for any symmet-
ric and subconvex loss function .
Lemma 5.18.2 (i) For any symmetric and subconvex loss function : Rk → [0, ∞),
the matrix
(u) u i u j − Σi j ϕΣ (u)λk (du), i, j = 1, . . . , k,
is positive semidefinite (see Pfanzagl and Wefelmeyer 1978, p. 16, Lemma 5.8 or
Pfanzagl 1985, p. 454, Lemma 13.2.4).
(ii) If k = 1,
holds for all loss functions which are subconvex about 0 (Pfanzagl 1985, Sect. 7.11,
p. 231.)
avoid technicalities, we consider only sample sizes n for which kn and m n are inte-
gers.) With
mn
x i := m −1
n x(i−1)m n +ν
ν=1
we define
kn
mn
κ (n) (xn ) := kn−1 (m n − 1)−1 (x(i−1)m n +ν − x i )2 .
i=1 ν=1
It is easy to check that the coefficient of u 3 agrees with that given in (5.18.10).
The important aspect of this example is the occurrence of a quadratic term with
positive coefficient, a4 σ −4 . Hence the estimator sequence is first-order efficient but
not second-order efficient. The following simpler, if somewhat artificial, example
shows the same effect.
Example For P = {N (ϑ, 1) : ϑ ∈ R}, the sample mean x n is the asymptotically
optimal
−1/2
nregular estimator sequence for ϑ. The S-expansion −1/2 of n 1/2 (x n − ϑ) is
n ν=1 (x ν − ϑ); the pertaining E-expansion of order o(n ) has λ-density
ϕ. The estimator sequence ϑ (n) (xn ) := x n + n −3/4 δn (xn ) is regularly approximated
by the E-expansion with density
It was already observed by Fisher (1925) [pp. 716–717] that in this case the
distribution of the ML estimator approaches its limiting distribution particularly
slowly. Comparing the variance of the ML estimator with the variance of the limiting
distribution, he found that the deficiency is of order n 1/2 (whereas it is of order n 0 in
the regular cases).
This example was taken up by Takeuchi (1974, pp. 188–193; see also Akahira,
1976a, pp. 621–622). For the ML-sequence they obtain the E-expansion with density
u2
ϕ(u) 1 + n −1/2 |u| −1
2
and the distribution function
1
t → φ(t) − n −1/2 t 2 sign t ϕ(t) + o(n −1/2 ).
2
Since the ML estimators are median unbiased, it suggests itself to compare this
distribution function with the bound of order o(n −1/2 ) for median unbiased estimator
sequences, given as
1
t → φ(t) − n −1/2 t 2 sign t ϕ(t) + o(n −1/2 ).
6
This bound is attainable in the following sense: For each s ∈ R there exists an
estimator sequence, median unbiased of order o(n −1/2 ), such that its distribution
function coincides with the bound up to o(n −1/2 ) for t = s. Hence we have a whole
family of first-order efficient estimator sequences with distributions differing by
amounts of order n −1/2 . (For regular families, this phenomenon does not occur until
the order n −1 .)
“Absolute” Optimality Based on Tests
The purpose of this section is to establish the optimality of E-expansions in E0 . Our
starting point is the “absolute” optimality concept for univariate estimator sequences
based on the Neyman–Pearson Lemma. Starting from this “absolute” optimality-
concept for univariate estimator sequences, a concept of multivariate o(n −1/2 )-
optimality will be developed.
The following theorem is a first step towards this “absolute” optimality concept.
It provides an “absolute” bound of order o(n −1/2 ) for the concentration of estimator
sequences κ (n) , n ∈ N, for κ(ϑ) which are median unbiased of order o(n −1/2 ).
σ 2 := Λi j κ (i) κ ( j) = L i, j K i K j , K i := Λi j κ ( j) ,
1
c̄ := σ −3 ci jk K i K j K k + κ (i j) K i K j .
2
This theorem provides a bound for the concentration of confidence bounds with
covering probability β. Of interest for our problem is the case β = 1/2: For esti-
mator sequences κ (n) , n ∈ N, which are median unbiased of order o(n −1/2 ), relation
(5.18.18) implies that o(n −1/2 )-optimal estimator sequences have an E-expansion
with distribution function
t t2
Φ − n −1/2 c̄ 2 . (5.18.19)
σ σ
For later use we note the pertaining density
t t 3
ϕσ 2 (t) 1 + n −1/2 c̄ − 2 + 3 . (5.18.20)
σ σ
For o(n −1/2 )-median unbiased estimator sequences κ (n) , n ∈ N, relation (5.18.19)
implies
For k = 1, relation (5.18.22) occurs in Pfanzagl (1973) [p. 1005, Theorem 6.1,
relation (6.4)]. A relation more general than (5.18.22) is given in Pfanzagl (1985,
pp. 250/1, Theorem 8.2.3. Observe a misprint in 8.2.4: replace u by n −1/2 u).
Proof of Theorem 5.18.3. By condition (5.18.17), assumption (5.18.26) of Theo-
rem 5.18.6 below is fulfilled for
Kr −1/2 t κ
(i j)
Ki K j
ar = t 1 − n .
σ2 2 σ4
This choice of ar , r = 1, . . . , k, minimizes L i, j ai a j under the condition
The bound in (5.18.21) for the concentration of o(n −1/2 )-optimal estimator
sequences presumes a certain smoothness of κ(ϑ). For such functionals, o(n −1/2 )-
optimal estimator sequences κ (n) can easily be obtained as κ (n) = κ ◦ ϑ (n) , if ϑ (n) ,
n ∈ N, is in E0 . The bound refers to estimator sequences κ (n) which are median unbi-
ased of order o(n −1/2 ). It may be achieved by κ ◦ ϑ (n) if ϑ (n) ∈ E0 is appropriately
chosen. The important point: If ϑ (n) is in E0 , then κ ◦ ϑ (n) will be o(n −1/2 )-optimal.
Observe the following detail: The bound given by (5.18.21) is “absolute”, i.e., it
holds for any estimator sequence which is median unbiased of order o(n −1/2 ). To
show that κ ◦ ϑ (n) is o(n −1/2 ) optimal we make use of the fact that, for sufficiently
regular families, the estimator sequences ϑ (n) admit an S-expansion.
G(u) = ai u i + ci jk u i u j u k .
(n) 1
E (n) := E ◦ (u → κ (i) u i + n −1/2 κ (i j) u i u j )
2
(n)
if E is the E-expansion of n 1/2 (ϑ (n) − ϑ). By Lemma 5.18.7, the E-expansion E (n)
has the λ-density
t t 3
ϕσ 2 (t) 1 + n −1/2 ā + c̄ 3 (5.18.23)
σ σ
with
σ 2 = L i, j K i K j ,
ā = σ −3 (ai K i + 3ci jk (Λi j K k − K i K j K k ),
1
c̄ = σ −3 (ci jk K i K j K k + κ (i j) K i K j ).
2
The estimator sequence κ ◦ ϑ (n) is median unbiased of order o(n −1/2 ) if (5.18.23)
holds with ā = −2c̄. This can be achieved by the choice of (a1 , . . . , ak ). That means:
By choosing (ϑ (n) )n∈N ∈ E0 appropriately, one obtains an estimator sequence κ ◦ ϑ (n)
which is median unbiased of order o(n −1/2 ). Any of these estimator sequences has
an E-expansion with λ-density (5.18.20) and is, therefore, o(n −1/2 )-optimal.
Recall that our purpose is to find an operational concept of o(n −1/2 )-optimality
for estimators of multivariate functionals. What is it that qualifies the estimator
sequences in E0 as o(n −1/2 )-optimal? According to Theorem 5.18.4, the estimator
sequences in E0 are a reservoir for generating univariate estimator sequences which
are o(n −1/2 )-optimal in an absolute sense. Conversely, any estimator sequence ϑ (n) ,
n ∈ N, with this property, belongs to E0 .
It would be nice to have a result of the following kind. If, for every α ∈ Rk , the
estimator sequence αi ϑi(n) is o(n −1/2 )-optimal for αi ϑi , then ϑ (n) is in E0 . Yet, this
does not work with an optimality concept based on median unbiasedness of order
o(n −1/2 ): In general, there is no sequence ϑ (n) such that
1
Pϑn {αi ϑi(n) ≤ αi ϑi } = + o(n −1/2 ) for every α ∈ Rk .
2
(See the example in Pfanzagl 1979, pp. 183/4, or Pfanzagl 1985, p. 218, Example
7.8.2.)
Even though there is no estimator sequence ϑ (n) such that αi ϑi(n) is o(n −1/2 )-
median unbiased for αi ϑi for every α ∈ Rk , there are estimator sequences fulfilling
this condition for αr = δr , r = 1, . . . , k, i.e., estimator sequences with o(n −1/2 )-
median unbiased components.
An interesting converse supporting the interpretation of E0 as the class of all
o(n −1/2 )-optimal estimator sequences reads as follows: Assume that the estima-
5.18 Second Order Optimality 287
Lemma 5.18.5
n
Pϑn log( p(xν , ϑ + n −1/2 a)/ p(xν ϑ)) ≤ t (5.18.24)
ν=1
t τ ai a j ak
=Φ + + n −1/2 (Ai jk + t Bi jk + t 2 Ci jk ) + o(n −1/2 ).
τ 2 τ3
n
n
Pϑ+n −1/2 a (log p(xν , ϑ + n −1/2 a)/ p(xν , ϑ)) ≤ t (5.18.25)
ν=1
t τ ai a j ak
=Φ − + n −1/2 (Ai jk + t Bi jk + t 2 Ci jk ) + o(n −1/2 ).
τ 2 τ3
In these relations, the following notations are used.
1
τ 2 = L i, j ai a j , Ci jk = − L i, j,k /τ 2 ,
6
1 1 1 1 1 1
Ai jk = L i, j,k + L i, j,k − L i jk τ 2 , Ai jk = L i, j,k − L i, j,k − L i jk τ 2 ,
6 12 2 6 12 2
1 1
Bi jk = L i jk , Bi jk = − (L i, j,k − L i jk ).
6 6
βn := ϕn d Pϑ+n
n
−1/2 a , n ∈ N, is bounded away from 0 and 1. (5.18.26)
Q n ◦ (u → αi u i + n −1/2 βi j u i u j )
where
v v3
Ḡ(v) = ā + c̄ 3 (5.18.28)
σ σ
with
σ 2 = Λi j αi α j , (5.18.29)
āσ = Λi j ai α j + 3ci jk (Λi j Λkr αr − Λir Λ js Λkt αr αs αt ),
3
(5.18.30)
c̄σ 3 = ci jk Λir Λ js Λkt αr αs αt + βi j Λir Λ js αr αs . (5.18.31)
holds for every continuous N (0, Λ)-integrable function h. The relations (5.18.27)–
(5.18.31) follow from an application with h(v) = exp[itv]. (For a similar proof see
Pfanzagl 1985, p. 468, Addendum to Lemma 13.5.12.)
Proof Since
k
{Si(n) ≤ ti for i = 1, . . . , k}Δ{ Ŝi(n) ≤ ti for i = 1, . . . , k} ⊂ {Si(n) ≤ ti }Δ{ Ŝi ≤ ti },
i=1
(n)
Hence the pertaining E-expansions E (n) and E agree for every n ∈ N on all rec-
tangles. For B ∈ Bk let
Since the signed measures μ and μ̂ agree on all rectangles, they agree on Bk .
i.e., p(Tn (·), a) is asymptotically a P0n -density of Pan |A n . Moreover, the family with
P0n -densities p(Tn (·), a), a ∈ A, is assumed to be asymptotically complete in the
sense that, for Q = P0n ◦ Tn and each measurable g : Rm → R,
then
f n h ◦ Tn d P0n = o(n 0 )
for every bounded and continuous function h : Rm → R. (5.18.34)
For the proof we need in addition to the essential condition (5.18.33) the following
technical condition: f n , n ∈ N, is uniformly
integrable with respect to Pan , i.e., for
every ε > 0 there exists ta,ε such that | f n |1[ta,ε ,∞) (| f n |)d Pan < ε for n ∈ N.
Remark Notice that Lemma 5.18.9 is closely related to Basu’s Theorem, which asserts
that completeness implies stochastic independence of ancillary statistics. Asymptotic
ancillarity could be defined as
relation (5.18.34) applied with f n replaced by f n − f n d P0n , yields
Proof Let f˜n+ := f n 1(0,∞) ( f n ) and f˜n− := − f n 1(−∞,0) ( f n ). Let N0 be a subsequence
such that cn+ := f n− d P0n , n ∈ N, converges to some c+ ≥ 0. If c+ > 0, we define
the probability measures Q + − m
n |B and Q n |B by
m
Q+
n (B) := 1 B f˜n+ d P0n /cn+ , B ∈ Bm ,
and
Q−
n (B) := 1 B f˜n− d P0n /cn− , B ∈ Bm .
5.18 Second Order Optimality 291
f n+ d P0n = f n− d P0n = 1.
Let G +
n B be the probability measure defined by
k
G+
n (B) = f n+ (x)1 B (Tn (x))P0n (d x), B ∈ Bm . (5.18.35)
If f n+ is uniformly Pan -integrable for every a ∈ A, there exists for every ε > 0 a
number sa,ε > 0 such that
Since
+
= f n 1{ fn+ >sa,ε } 1[t,∞) (|Tn |)d Pa +
n
f n+ 1{ fn+ ≤sa,ε } 1[t,∞) (|Tn |)d Pan
G+
n {u ∈ R : |u| > tε } < ε for n ∈ N,
m
G+ +
n ⇒n∈N0 G ∗ .
292 5 Asymptotic Optimality of Estimators
By relation (5.18.35),
m
for every bounded and
continuous function h|R .
Since f n+ Pan − f n− d Pa(n) = o(n 0 ), relation (5.18.36), applied with f n+ and
f n− , implies
+
f n 1[−t,t] (Tn )d Pa −
n
f n− 1[−t,t] (Tn )d Pan
< 2ε (5.18.37)
for n ∈ N if t > tε .
Since
or n ∈ N and t > tε .
Since u → p(u, a)1[−t,t] (u) is bounded and continuous λm -a.e.,
hence
p(u, a)1[−t,t] (u)G ∗ (du) − p(u, a)1[−t,t] (u)G −
+
∗ (du)
< 3ε
lim f n+ (x)h(Tn (x))P0n (d x) − f n− (x)h(Tn (x))P0n (d x) = 0
n∈N0
Limit distributions provide a first information about the quality of estimator sequences
for large samples. Let P|A be a family of probability measures, and let κ : P → Rk
be a k-dimensional functional. Let κ (n) and κ̂ (n) be estimators for κ(P) such that
P n ◦ cn (κ (n) − κ(P)) ⇒ Q P and P n ◦ cn (κ̂ (n) − κ(P)) ⇒ Q̂ P . If the limit distrib-
ution Q̂ P is more concentrated about 0 than Q P , this suggests to use κ̂ (n) rather than
κ (n) . Since we do not know the true P, the situation is unclear if it happens that Q̂ P
is better than Q P for P in some subset P0 of P but worse for P outside P0 .
If an estimate κ (n) (xn ) has been observed —what does this tell us about the “true”
value of the functional κ(P)? Even if the true distribution of n 1/2 (κ (n) − κ(P)) under
P n is known, there is no satisfactory answer, because a satisfactory answer must be
independent of the unknown P. In special cases, e.g. for real-valued functionals κ :
P → R, a satisfactory answer can be given by a confidence bound K (n) : X n → R
such that {κ(P) ≤ K n (xn )} holds under P n with high probability. An asymptotic
solution to this problem can be based on the limit distribution Q P |B.
(i) Given β ∈ (0, 1), determine the quantile tβ (P) defined by
Q P (−∞, tβ (P)] = β.
minimal. Different such sets need of course not be inclusion ordered. If Q P is convex
unimodal, the sets C P are convex.
For LAN families and regular estimators κ (n) , an efficient estimator leads to the
usual confidence procedure with convex C P of minimal asymptotic volume.
References
Edwards, A. W. F. (1974). The history of likelihood. International Statistical Review, 42, 9–15.
Edwards, A. W. F. (1992). Likelihood. Baltimore: Johns Hopkins University.
Eisenhart, Ch. (1938). Abstract 18. Bulletin of the American Mathematical Society, 44, 32.
Elstrodt, J. (2000). Maß- und Integrationstheorie. Berlin: Springer.
Encke, J.F. (1832–34). Über die Methode der kleinsten Quadrate. Berliner Astronomisches Jahrbuch
1834, 249–312; 1835, 253–320; 1836, 253–308.
Fabian, V. (1970). On uniform convergence of measures. Zeitschrift für Wahrscheinlichkeitstheorie
und verwandte Gebiete, 15, 139–143.
Fabian, V. (1973). Asymptotically efficient stochastic approximation; the RM case. Annals of Sta-
tistics, 1, 486–495.
Fabian, V., & Hannan, J. (1982). On estimation and adaptive estimation for locally asymptotically
normal families. Zeitschrift für Wahrscheinlichkeitstheorie, 59, 459–478.
Farrell, R. H. (1967). On the lack of a uniformly consistent sequence of estimators of a density
function in certain cases. The Annals of Mathematical Statistics, 38, 471–474.
Farrell, R. H. (1972). On the best obtainable asymptotic rates of convergence in estimation of a
density function at a point. The Annals of Mathematical Statistics, 43, 170–180.
Ferguson, T. S. (1982). An inconsistent maximum likelihood estimate. Journal of the American
Statistical Association, 77, 831–834.
Filippova, A. A. (1962). Mises’ theorem on the asymptotic behavior of functionals of empirical
distribution functions and its statistical applications. Theory of Probability and Its Applications,
7, 24–57.
Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Messenger of Mathemat-
ics, 41, 155–160.
Fisher, R. A. (1922). On the mathematical foundation of theoretical statistics. Philosophical trans-
actions of the Royal Society of London Series A, 222, 309–368.
Fisher, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical
Society, 22, 700–725.
Fisher, R. A. (1959). Statistical methods in scientific inference (2nd edn.). Oliver and Boyd.
Fix, E., & Hodges, J. L., Jr. (1951). Discriminatory analysis—nonparametric discrimination: Con-
sistency properties. Report 4, USAF School of Aviation Medicine. Reprinted in International
Statistical Review 57, 238–247 (1989).
Foutz, R. V. (1977). On the unique consistent solution of the likelihood equations. Journal of the
American Statistical Association, 72, 147–148.
Gänssler, P., & Stute, W. (1976). On uniform convergence of measures with applications to uniform
convergence of empirical distributions. In Empirical Distributions and Processes (Vol. 566, pp.
45–56). Lecture notes in mathematics.
Gauss, C. F. (1816). Bestimmung der Genauigkeit der Beobachtungen. Carl Friedrich Gauss Werke,
4, 109–117, Königliche Gesellschaft der Wissenschaften, Göttingen.
Ghosh, J. K. (1985). Efficiency of estimates. Part I. Sankhyā Series A, 47, 310–325.
Ghosh, J. K. (1991). Higher order asymptotics for the likelihood ratio, Rao’s and Wald’s tests.
Statistics and Probability Letters, 12, 505–509.
Ghosh, M. (1995). Inconsistent maximum likelihood estimators for the Rasch model. Statistics and
Probability Letters, 23, 165–170.
Ghosh, J. K., & Subramanyam, K. (1974). Second order efficiency of maximum likelihood estima-
tors. Sankhyā Series A, 36, 325–358.
Glivenko, V. I. (1939). Course in probability theory (Russian). Moscow: ONTI.
Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. The
Annals of Mathematical Statistics, 31, 1208–1211.
Gong, G., & Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: theory and applica-
tions. Annals of Statistics, 9, 861–869.
Grossmann, W. (1981). Efficiency of estimates in nonregular cases. In P. Révész, L. Schmetterer,
V. M. Zolotarev (Eds.), First Pannonian Symposium on Mathematical Statistics. Lecture notes in
statistics (Vol. 8, pp. 94–109). Berlin: Springer.
References 297
Grossmann, W. (1979). Einige Bemerkungen zur Theorie der Maximum Probability Schätzer.
Metrika, 26, 129–137.
Gurland, J. (1954). On regularity conditions for maximum likelihood estimation. Skandinavisk
Aktuarietidskrift, 37, 71–76.
Gusev, S. I. (1975). Asymptotic expansions that are connected with certain statistical estimates
in the smooth case. I. Expansions of random variables (Russian). Teoriya Veroyatnostei i ee
Primeneniya, 20, 488–514.
Hahn, H. (1932). Reelle Funktionen. Leipzig: Akademische Verlagsgesellschaft.
Hájek, J. (1962). Asymptotically most powerful rank-order tests. The Annals of Mathematical
Statistics, 33, 1124–1147.
Hájek, J. (1970). A characterization of limiting distributions of regular estimates. Zeitschrift für
Wahrscheinlichkeitsheorie und verwandte Gebiete, 14, 323–330.
Hájek, J. (1971). Limiting properties of likelihood and inference. In V. P. Godambe & O. A. Sprott
(Eds.), Foundation of Statistical Inference (pp. 142–162). Rinehart and Winston: Holt.
Hájek, J. (1972). Local asymptotic minimax and admissibility in estimation. In Proceedings of the
Sixth Berkeley Symposium on Mathematical Statistics and Probability I (pp. 175–194)
Hájek, J., & Šidák, Z. (1967). Theory of rank tests (1st edn.). Prague: Academia Publishing House.
Hald, A. (1998). A history of mathematical statistics from 1750 to 1930. New York: Wiley.
Hald, A. (1999). On the history of maximum likelihood in relation to inverse probability and least
squares. Statistical Science, 14, 214–222.
Hall, P. (1989). On convergence rates in nonparametric problems. International Statistical Review,
57, 45–58.
Hall, P., & Welsh, A. M. (1984). Best attainable rates of convergence for estimates of parameters
of regular variation. Annals of Statistics, 12, 1079–1084.
Hartigan, J. A. (1966). Note on the confidence-prior of Welch and Peers. Journal of the Royal
Statistical Society Series B, 28, 55–56.
Has’minskii, R. Z., & Ibragimov, I. A. (1979). On the nonparametric estimation of functionals.
In P. Mandl & H. Huškova (Eds.) Proceedings of the Second Prague Symposium on Asymptotic
Statistics (pp. 41–51), North-Holland.
Helmert, C. F. (1876). Über die Wahrscheinlichkeit der Potenzsummen der Beobachtungsfehler und
über einige damit in Zusammenhang stehende Fragen. Zeitschrift für Mathematik und Physik,
21, 192–219.
Hewitt, E., & Stromberg, K. (1965). Real and abstract analysis. New York: Springer.
Hodges, J. L, Jr., & Lehmann, E. L. (1970). Deficiency. The Annals of Mathematical Statistics, 41,
783–801.
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals of
Mathematical Statistics, 19, 293–325.
Hoeffding, W. (1962). Review of Wilks: Mathematical statistics. The Annals of Mathematical Sta-
tistics, 33, 1467–1473.
Hoeffding, W., & Wolfowitz, J. (1958). Distinguishability of sets of distributions. (The case of
independent and identically distributed chance variables). The Annals of Mathematical Statistics,
29, 700–718.
Hotelling, H. (1930). The consistency and ultimate distribution of optimum statistics. Transactions
of the American Mathematical Society, 32, 847–859.
Huber, P. J. (1966). Strict efficiency excludes superefficiency. Preprint. ETH Zürich. Abstract 25.
The Annals of Mathematical Statistics, 37, 1425/6.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions.
In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol.
1). Berkeley & Los Angeles: University of California Press.
Huzurbazar, V. S. (1948). The likelihood equation, consistency and the maximum likelihood func-
tion. Annals of Eugenics, 14, 185–200.
Ibragimov, I. A. (1956). On the composition of unimodal distributions. Theory of Probability and
Its Applications, 1, 255–260.
298 5 Asymptotic Optimality of Estimators
Ibragimov, I. A., & Has’minskii, R. Z. (1981). Statistical estimation: Asymptotic theory. New York:
Springer. Translation of the Russian original (1979).
Ibragimov, I. A., & Has’minskii, R. Z. (1970). On the asymptotic behavior of generalized Bayes
estimators. Doklady Akademii Nauk SSSR, 194, 257–260.
Ibragimov, I. A., & Has’minskii, R. Z. (1973). Asymptotic analysis of statistical estimators in the
"almost smooth" case. Theory of Probability and Its Applications, 18, 241–252.
Inagaki, N. (1970). On the limiting distribution of a sequence of estimators with uniformity property.
Annals of the Institute of Statistical Mathematics, 22, 1–13.
Janssen, A., & Ostrovski, V. (2013). The convolution theorem of Hájek and Le Cam-revisited.
Düsseldorf: Mathematical Institute.
Jogdeo, K. (1970). An inequality for a strict unimodal function with an application to a characteri-
zation of independence. Sankhyā Series A, 32, 405–410.
Kallianpur, G., & Rao, C. R. (1955). On Fisher’s lower bound to asymptotic variance of a consistent
estimate. Sankhyā Series A, 15, 331–342.
Kati, S. (1983). Comment on Bar-Lev and Reiser (1983). Sankhyā Series B, 45, 302.
Kaufman, S. (1966). Asymptotic efficiency of the maximum likelihood estimator. Annals of the
Institute of Statistical Mathematics, 18, 155–178. See also Abstract. The Annals of Mathematical
Statistics36, 1084 (1965)
Kelley, J. L. (1955). General topology. Toronto: Van Nostrand.
Kendall, M. G., & Stuart, A. (1961). The advanced theory of statistics (Vol. 1). London: Griffin.
Kersting, G. D. (1978). Die Geschwindigkeit der Glivenko-Cantelli-Konvergenz gemessen in der
Prohorov-Metrik. Mathematische Zeitschrift, 163, 65–102.
Kiefer, J. (1982). Optimum rates for nonparametric density and regression estimates under order
restrictions. In C. R. Rao, G. Kallianpur, P. R. Krishnaiah, & J. K. Ghosh (Eds.), Statistics and
Probability: Essays in Honor (pp. 419–429). Amsterdam: North-Holland.
Klaassen, C. A. J. (1979). Nonuniformity of the convergence of location estimators. In P. Mandl, &
M. Hušková (Eds.), Proceedings of the Second Prague Symposium on Asymptotic Statistics (pp.
251–258).
Klaassen, C. A. J. (1981). Statistical performance of location estimators. In Mathematical Centre
Tracts (Vol. 133). Amsterdam: Mathematisch Centrum.
Klaassen, C. A. J. (1987). Consistent estimation of the influence function of locally asymptotically
linear estimators. Annals of Statistics, 15, 1548–1562.
Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitstheorie. Berlin: Springer.
Koshevnik, Yu A, & Levit, B. Ya. (1976). On a nonparametric analogue of the information matrix.
Theory of Probability and Its Applications, 21, 738–753.
Kraft, C., & Le Cam, L. (1956). A remark on the roots of the maximum likelihood equation. The
Annals of Mathematical Statistics, 27, 1174–1177.
Kronmal, R., & Tarter, M. (1968). The estimation of probability densities and cumulatives by Fourier
series methods. Journal of the American Statistical Association, 63, 925–952.
Kulldorff, G. (1956). On the condition for consistency and asymptotic efficiency of maximum
likelihood estimates. Skandinavisk Aktuarietidskrift, 40, 129–144.
Lambert, J. H. (1760). Photometria, sive de mensura et gradilus luminis colorum et umbrae. Augus-
tae Vindelicorum.
Landers, D. (1972). Existence and consistency of modified minimum contrast estimates. The Annals
of Mathematical Statistics, 43, 74–83.
Laplace, P. S. (1812). Théorie analytique des probabilités. Paris: Courcier.
Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related
Bayes’ estimates. University of California Publications in Statistics, 1, 277–330.
Le Cam, L. (1956). On the asymptotic theory of estimation and testing hypotheses. In Proceedings
of the Third Berkeley Symposium on Mathematical Statistics and Probability (pp. 129–156).
Berkeley & Los Angeles: University of California Press.
Le Cam, L. (1958). Les propriétés asymptotiques des solutions de Bayes. Publication Institute
Statistics University of Paris VII, 17–35.
References 299
Neyman, J., & Pearson, E. S. (1936). Sufficient statistics and uniformly most powerful tests of
statistical hypotheses. Statistical Research Memory, 1, 113–137.
Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially consistent observations.
Econometrica, 16, 1–32.
Osgood, W. F. (1897). Non-uniform convergence on the integration of series term by term. American
Journal of Mathematics, 19, 155–190.
Parke, W. R. (1986). Pseudo-maximum-likelihood estimation: The asymptotic distribution. Annals
of Statistics, 14, 355–357.
Parthasarathy, K. R. (1967). Probability measures on metric spaces. New York: Academic Press.
Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Math-
ematical Statistics, 33, 1065–1076.
Pearson, K. (1896). Mathematical contributions to the theory of evolution III. Regression, heredity
and panmixia. Philosophical Transactions of the Royal Society of London (A), 187, 253–318.
Pearson, K., & Filon, L. N. G. (1898). Mathematical contributions to the theory of evolution IV. On
the probable errors of frequency constants and on the influence of random selection on variation
and correlation. Philosophical Transactions of the Royal Society of London (A), 191, 229–311.
Pfanzagl, J. (1969a). On the measurability and consistency of minimum contrast estimators. Metrika,
14, 249–272.
Pfanzagl, J. (1969b). On the existence of product measurable densities. Sankhyā, 31, 13–18.
Pfanzagl, J. (1970). On the asymptotic efficiency of median-unbiased estimates. The Annals of
Mathematical Statistics, 41, 1500–1509.
Pfanzagl, J. (1972a). On median-unbiased estimates. Metrika, 18, 154–173.
Pfanzagl, J. (1972b). Further results on asymptotic normality I. Metrika, 18, 174–198.
Pfanzagl, J. (1972c). Further results on asymptotic normality II. Metrika, 18, 89–97.
Pfanzagl, J. (1973). Asymptotic expansions related to minimum contrast estimators. Annals of
Statistics, 1, 993–1026.
Pfanzagl, J. (1974). Asymptotically optimum estimation and test procedures. In J. Hájek (Ed.),
Proceedings of the Prague Symposium on Asymptotic Statistics (Vol. 1, pp. 201–272). Prague:
Charles University.
Pfanzagl, J. (1975). On asymptotically complete classes (Vol. 2, pp. 1–43). Statistical inference and
related topics. New York: Academic Press.
Pfanzagl, J. (1979). First order efficiency implies second order efficiency. In J. Jurečková (Ed.),
Contribution to Statistics (pp. 167–196). Prague: Academia.
Pfanzagl, J. (1982). (with the assistance of W. Wefelmeyer). Contributions to a general asymptotic
statistical theory. Lecture Notes in Statistics (Vol. 13). Berlin: Springer.
Pfanzagl, J. (1985) (With the assistance of W. Wefelmeyer). Asymptotic expansions for general
statistical models (Vol. 31). Lecture notes in statistics. Berlin: Springer.
Pfanzagl, J. (1990). Estimation in semiparametric models. Some recent developments (Vol. 63).
Lecture notes in statistics. New York: Springer.
Pfanzagl, J. (1994). Parametric Statistical Theory. Berlin: De Gruyter.
Pfanzagl, J. (1995). On local and global asymptotic normality. Mathematical Methods of Statistics,
4, 115–136.
Pfanzagl, J. (1998). The nonexistence of confidence sets for discontinuous functionals. Journal of
Statistical Planning and Inference, 75, 9–20.
Pfanzagl, J. (1999a). On the optimality of limit distributions. Mathematical Methods of Statistics,
8, 69–83. Erratum 8, 550.
Pfanzagl, J. (1999b). On rates and limit distributions. Annals of the Institute of Statistical Mathe-
matics, 51, 755–778.
Pfanzagl, J. (2000a). Subconvex loss functions, unimodal distributions, and the convolution theorem.
Mathematical Methods of Statistics, 9, 1–18.
Pfanzagl, J. (2000b). On local uniformity for estimators and confidence limits. Journal of Statistical
Planning and Inference, 84, 27–53.
References 301
Takeuchi, K., & Akahira, M. (1976). On the second order asymptotic efficiencies of estimators. In
G. Maruyama, & J. V. Prohorov (Eds.), USSR Symposium on Probability Theory (Vol. 550, pp.
604–638). Lecture notes in mathematics. Berlin: Springer.
Takeuchi, K. (1971). A uniformly asymptotically efficient estimator of a location parameter. Journal
of the American Statistical Association, 66, 292–301.
Takeuchi, K. (1974). Tokei-teki Suitei no Zenkinriron (Asymptotic theory of statistical estimation)
(Japanese). Tokyo: Kyoiku-Shuppan.
Tarone, P. E., & Gruenhage, G. (1975). A note on the uniqueness of roots of the likelihood equations
for vector-valued parameters. Journal of the American Statistical Association, 70, 903–904.
Tierney, L. (1987). An alternative regularity condition for Hájek’s representation theorem. Annals
of Statistics, 15, 427–431.
Todhunter, I. (1865). A history of mathematical theory of probability. Cambridge: Macmillan.
Tong, Y. L. (1990). The multivariate normal distribution. Berlin: Springer.
van der Vaart, A. W. (1988). Statistical Estimation in Large Parameter Spaces., CWI Tract 44
Amsterdam: Centrum foor Wiskunde en Informatica.
van der Vaart, A. W. (1991). On differentiable functionals. Annals of Statistics, 19, 178–204.
van der Vaart, A. W. (1997). Super efficiency. In D. Pollard, E. Torgersen, & G. L. Yang (Eds.),
Festschrift for Lucien Le Cam (pp. 397–410). Research Papers in Probability and Statistics.
Springer.
van der Vaart, A. W. (2002). The statistical work of Lucien LeCam. Annals of Statistics, 30, 631–682.
van Eeden, C. (1968). Nonparametric estimation. Séminaire de Mathématiques Supérieures (Vol.
35). Montréal: Les Presses de l’Université de Montréal.
van Eeden, C. (1970). Efficiency-robust estimation of location. The Annals of Mathematical Statis-
tics, 41, 172–181.
von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. The
Annals of Mathematical Statistics, 18, 309–348.
Wahba, G. (1971). A polynomial algorithm for density estimation. The Annals of Mathematical
Statistics, 42, 1870–1886.
Wahba, G. (1975). Optimal convergence properties of variable knot, kernel, and orthogonal series
methods for density estimation. Annals of Statistics, 3, 15–29.
Wald, A. (1941a). On the principles of statistical inference. Four Lectures Delivered at the University
of Notre Dame, February 1941. Published as Notre Dame Mathematical Lectures (Vol. 1), 1952.
Wald, A. (1941b). Asymptotically most powerful tests of statistical hypotheses. The Annals of
Mathematical Statistics, 12, 1–19.
Wald, A. (1943). Test of statistical hypotheses concerning several parameters when the number of
observations is large. Transactions of the American Mathematical Society, 54, 426–482.
Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. The Annals of
Mathematical Statistics, 20, 595–601.
Weiss, L., & Wolfowitz, J. (1974). Maximum probability estimators and related topics. Lecture
notes in mathematics (Vol. 424). Berlin: Springer.
Weiss, L., & Wolfowitz, J. (1966). Generalized maximum likelihood estimators. Theory of Proba-
bility and Its Applications, 11, 58–83.
Weiss, L., & Wolfowitz, J. (1967a). Estimating a density function at a point. Zeitschrift für
Wahrscheinlichkeitsheorie und verwandte Gebiete, 7, 327–335.
Weiss, L., & Wolfowitz, J. (1967b). Maximum probability estimators. Annals of the Institute of
Statistical Mathematics, 19, 193–206.
Weiss, L., & Wolfowitz, J. (1970). Asymptotically efficient non-parametric estimators of location
and scale parameters. Zeitschrift für Wahrscheinlicheitstheorie Verw Gebiete, 16, 134–150.
Weiss, L., & Wolfowitz, J. (1973). Maximum likelihood estimation of a translation parameter of a
truncated distribution. Annals of Statistics, 1, 944–947.
Welch, B. L. (1965). On comparisons between confidence point procedures in the case of a single
parameter. Journal of the Royal Statistical Society Series B, 27, 1–8.
304 5 Asymptotic Optimality of Estimators
Wendel, J. G. (1952). Left centralizers and isomorphisms of group algebras. Pacific Journal of
Mathematics, 2, 251–261.
Wilks, S. S. (1962). Mathematical statistics. New York: Wiley.
Williamson, J. A. (1984). A note on the proof of H.E. Daniels of the asymptotic efficiency of a
maximum likelihood estimator. Biometrika, 71, 651–653.
Witting, H. (1985). Mathematische Statistik I: Parametrische Verfahren bei festem Stichprobenum-
fang. Teubner.
Witting, H. (1998). Nichtparametrische Statistik: Aspekte ihrer Entwicklung 1957–1997. Jahres-
bericht der Deutschen Mathematiker-Vereinigung, 100, 209–237.
Witting, H., & Müller-Funk, U. (1995). Mathematische Statistik II. Asymptotische Statistik: para-
metrische Modelle und nichtparametrische Funktionale. Wiesbaden: Teubner.
Wolfowitz, J. (1963). Asymptotic efficiency of the maximal likelihood estimator. Seventh All-Soviet
Union Conference on Probability and Mathematical Statistics, Tbilisi.
Wolfowitz, J. (1970). Reflections on the future of mathematical statistics. In R. C. Bose et al. (Eds.),
Essays in Probability and Statistics (pp. 739–750). University of North Carolina Press.
Wolfowitz, J. (1952). Consistent estimators of the parameters of a linear structural relation. Skan-
dinavisk Aktuarietidskrift, 35, 132–151.
Wolfowitz, J. (1953a). The method of maximum likelihood and the Wald theory of decision func-
tions. Indagationes Mathematicae, 15, 114–119.
Wolfowitz, J. (1953b). Estimation by the minimum distance method. Annals of the Institute of
Statistical Mathematics, 5, 9–23.
Wolfowitz, J. (1957). The minimum distance method. The Annals of Mathematical Statistics, 28,
75–88.
Wolfowitz, J. (1965). Asymptotic efficiency of the maximum likelihood estimator. Theory of Prob-
ability and Its Applications, 10, 247–260.
Woodroofe, M. (1972). Maximum likelihood estimation of a translation parameter of a truncated
distribution. The Annals of Mathematical Statistics, 43, 113–122.
Yang, G. L. (1999). A conversation with Lucien Le Cam. Statistical Science, 14, 223–241.
Zacks, S. (1971). The theory of statistical inference. New York: Wiley.
Zermelo, E. (1929). Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der
Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29, 436–460.
Author Index
Cramér, H., 6, 43, 48, 58, 95, 96, 118, 120– Foutz, R.V., 121
122, 125, 225 Fraser, D.A.S., 23, 47, 50, 53
Csiszár, I., 38 Fréchet, M., 6, 94, 96
Cuculescu, I., 61
Czuber, E., 113
G
Gänssler, P., 130
D Gauss, C.F., 11, 55, 66, 71, 100, 110
Daniels, H.E., 196 Ghosh, J.K., 67, 102, 136, 180
Darmois, G., 6, 95 Ghosh, M., 75, 125
Das Gupta, S., 58, 60, 61 Girshick, M.A., 51
David, F.N., 49 Glivenko, V.I., 257
Davidovič, Ju.S., 55, 61 Gnedenko, B.W., 55
Denny, J.L., 26, 29, 30, 33, 34 Godambe, V.P., 180, 218
Deshpande, J.V., 54 Gong, G., 117
Dharmadhikari, S.W., 55, 56, 59–61 Grossmann, W., 127, 194
Dieudonné, J., 13 Gruenhage, G., 121
Does, R.J.M.M., 275 Gurland, J., 122
Doksum, K., 53 Gusev, S.I., 274
Doob, J.L., 6, 13, 108, 123, 127, 178
Droste, W., 63, 210, 211, 234
Dubins, L.E., 17 H
Dudley, R.M., 13, 20, 218 Hahn, H., 135, 136
Dugué, D., 108, 117, 118, 123, 124, 127 Hájek, J., 5, 54, 64, 76, 159, 163, 164, 167,
Dunn, O.J., 229 174, 175, 180, 181, 187, 196, 211,
Dvoretzky, A., 68 212, 214, 216–219, 237–245, 247,
Dynkin, E.B., 31, 32 255, 277
Hald, A., 27, 55, 113, 114
Hall, P., 262, 263
E Halmos, P.R., 6, 12–14, 19–21, 23, 25, 86
Eberl, W., 89 Hammersley, J.M., 96
Edgeworth, F.Y., 114 Hampel, F.R., 2
Edwards, A.W.F., 114 Hannan, J., 159, 168–170, 188, 238, 242
Eeden, C. van, 167 Hartigan, J.A., 274
Eisenhart, Ch., 187 Hasegawa, M., 25
Elstrodt, J., 130 Has’minskii, R.Z., 64, 66, 76, 112, 144, 150,
Encke, J.F., 113 155, 163, 174, 180, 182, 187, 212,
Erdős, 234 218, 238, 239, 241, 248, 254, 259,
264
Heizmann, H.-H., 89
F Helguero, F. de, 55
Fáry, I., 60, 61 Helmert, C.F., 113
Fabian, V., 131, 159, 167–170, 188, 238, 242 Hewitt, E., 132
Farrell, R.H., 258–260, 262–264 Heyer, H., 7, 15, 17, 22, 48, 84, 88
Ferguson, T.S., 73, 125 Hipp, Ch., 30
Fieger, W., 36, 37 Hlawka, E., 130
Filippova, A.A., 163 Hodges, J.L., Jr., 8, 86, 182, 249, 255, 257,
Filon, L.N.G., 114 267, 274
Fisher, R.A., 3, 4, 6–8, 11–14, 18, 19, 27, 30, Hoeffding, W., 23, 125, 132, 163
47, 56, 101, 109, 113–115, 123, 148, Hoel, P.G., 21
149, 177, 178, 216, 225, 283 Hotelling, H., 101, 102, 123
Fisz, M., 97 Huber, P., 196, 237
Fix, E., 257 Huygens, C., 49
Fourier, J.B.J., 100 Huzurbazar, V.S, 122
Author Index 307
Hwang, J.T., 254 Laplace, P.S., 3, 11, 12, 49, 66, 100, 110
Laube, G., 29
Le Cam, L., 5–7, 17, 60, 64, 67, 116, 123,
I 125, 138–144, 148, 156, 161, 162,
Ibragimov, I.A., 55, 56, 61, 64, 66, 76, 112, 164, 165, 167, 177–179, 181, 196,
144, 150, 155, 163, 174, 180, 182, 209, 214, 216, 218, 238–240, 242,
187, 212, 218, 238, 239, 241, 248, 249, 251–255
254, 259, 264 Lehmann, E.L., 3, 5, 6, 8, 13, 18, 20–25, 43–
Inagaki, N., 64, 76, 130, 183, 210, 211, 215– 45, 48, 50–54, 56, 66, 71, 72, 84, 86,
217 87, 98, 101, 108, 111, 129, 136, 196,
Isenbeck,M., 23 219, 239, 254, 255, 274
Leibler, R.A., 38
Lekkerkerker, C.G., 61
J Levit, B.Ya., 150, 159, 160, 163, 164, 241
James, W., 73 Lewis, T., 53, 62
Janssen, A., 225 Lexis, W., 101
Joag-Dev, K., 55, 56, 59–61, 229 Lindley, D.V., 4, 180
Jurečkova, J., 3 Linnik, Yu.V., 6, 84, 93, 273
Liu, R.C., 208
Löwner, K., 58
K Lukacs, E., 184
Kallianpur, G., 195 Lumel’skii, Ya.P., 94
Karlin, S., 51 Lynch, J., 59
Kati, S., 225
Katz, M., Jr., 30, 32–35
Kaufman, S., 64, 112, 130, 156, 183–185,
M
187, 193, 195, 210, 211, 215–219,
Maitra, A.P., 32–34
226
Mandelbaum, A., 23
Keating, J.P., 76
Mann, H.B., 50
Kelley, J.L., 233
Markov, A.A., 49, 58
Kendall, M.G., 118–120
Mattner, L., 23, 26
Kersting, G., 133, 144
Khintchine, A.Y., 55 Merton, R.K., 6
Kiefer, J., 259, 262, 263 Messig, M.A., 22
Klaassen, C.A.J., 55, 63, 151–154, 189 Michel, R., 193, 210, 274
Klebanov, L.B., 73 Milbrodt, H., 210
Kochar, S.C., 54 Millar, P.W., 65, 67, 158, 241, 245
Kolmogorov, A.N., 6, 19, 55, 93, 94, 109, Mises, R. von, 45, 163
178 Mitrofanova, N.M., 273
Koopman, B.O., 30, 31 Moeschlin, O., 89
Koshevnik, Yu.A., 159, 163, 241 Moivre, A. de, 49
Kozek, A., 84, 90 Moran, P.A.P., 53, 54
Krámli, A., 84 Müller, A., 43
Kraft, C., 116, 125 Müller-Funk, U., 43, 57, 63, 71, 96, 97, 111,
Kronmal, R., 258 115, 119, 130, 146, 147, 164, 187,
Kullback, S., 38 193, 217, 219, 225, 239, 241, 255
Kulldorff, G., 122 Muñoz-Perez, J., 71
Kumar, A., 13 Mussmann, D., 52
L N
Lambert, J.H., 113 Neyman, J., 3, 19, 21, 49, 101, 102, 125, 180,
Landers, D., 5, 23, 25, 125 190
Lapin, A.I., 55 Nölle, G., 5
308 Author Index
C D
Canonical gradient, 112, 151, 158, 162, 163, DCC-differentiable family, 160, 162
164, 170, 173, 220, 222–224, 230 Decision function, 35
Complete family, 20 Decision theory, 7
Complete sufficient statistic, 96, 107, 177 Density estimator, 93
Concentration order, 56 Differentiable functional, 147, 199, 203, 222
Conditional expectation, 13 Dilation order, 71
Confidence bound, 102, 293 Dispersion order, 54
Confidence interval, 100 Divergence, 38
© Springer-Verlag Berlin Heidelberg 2017 311
J. Pfanzagl, Mathematical Statistics, Springer Series in Statistics,
DOI 10.1007/978-3-642-31084-3
312 Subject Index
A κ ∗ , canonical gradient,
162
AP , 89 K̃ (xn , P) = n −1/2 nν=1 K (xν , P), 145
B L
Bα,β , Beta distribution, 54 (x, P), log density, 31
λn , Lebesgue density, 27
L ∗ (P), 145
C 0 , 73
∗, convolution, 55 1 , linex loss
function, 73
L i, j (ϑ) = (i) (x, ϑ)( j) (x, ϑ)Pϑ (d x),
147
D L i j (ϑ) = (i j) (x, ϑ)Pϑ (d x), 147
, dominated, 19 L i j,k (ϑ) = (i j) (x, ϑ)(k) (x, ϑ)Pϑ (d x),
272
L i jk (ϑ) = (i jk) (x, ϑ)Pϑ (d x), 272
E (ϑ) = L(ϑ)−1 , 148
E0 , 276 λ(·, ϑ) = (ϑ)• (·, ϑ), 273
L(ϑ) = (L i, j (ϑ))i, j=1,...,k , 148
L, 131
F ≤ L , Löwner order, 58
f • (x, ϑ), 117 Ls , 243
f •• (x, ϑ), 118 Lu , 131
G N
(μ, σ 2 ), Gamma distribution, 26 N (μ, σ 2 ), normal distribution, 3
I O
I (ϑ) = • (x, ϑ)2 Pϑ (d x), Fisher infor- o(an , P n ), 108
mation, 126 o p (an ) = o(an , P n ), 273
K P
κ • , gradient, 162 p • (x, ϑ), 95
© Springer-Verlag Berlin Heidelberg 2017 315
J. Pfanzagl, Mathematical Statistics, Springer Series in Statistics,
DOI 10.1007/978-3-642-31084-3
316 Subject Index
Q V
Q (n)
P = P
(n) ◦ c (κ (n) − κ(P)), 174
n V2 , 89
S W
S (n) (·, ϑ), 273 ⇒, weak convergence, 130
, spread order, 53
, standard normal distribution function, 3
X
x n , sample mean, 3
T x = (x1 , x2 , . . . ), 108
T (P, P), tangent set, 159 xn = (x1 , . . . , xn ), 108