Stein's Paradox: DR Richard J. Samworth, Statslab Cambridge
Stein's Paradox: DR Richard J. Samworth, Statslab Cambridge
P
erhaps the most surprising result in Statis- problem above with p = 1 and the squared error
tics arises in a remarkably simple estima- loss function, the estimator θ̂ = 37 (which ignores
tion problem. Let X1, …, Xp be independent the data!) is admissible. On the other hand, deci-
random variables, with Xi ∼ N(θi , 1) for i = 1, …, sion theory dictates that inadmissible estimators
p. Writing X = (X1, …, Xp)T, suppose we want to can be discarded, and that we should restrict our
find a good estimator θ̂ = θ̂(X) of θ = (θ1, …, θp)T. choice of estimator to the set of admissible ones.
To define more precisely what is meant by a good
estimator, we use the language of statistical deci- This discussion may seem like overkill in this
sion theory. We introduce a loss function L(θ̂, θ), simple problem, because there is a very obvious
which measures the loss incurred when the true estimator of θ: since all the components of X are
value of our unknown parameter is θ, and we esti- independent, and E(Xi) = θi (in other words Xi
mate it by θ̂. We will be particularly interested in is an unbiased estimator of θi ), why not just use
the squared error loss function L(θ̂, θ) = �θ̂ – θ� 2 , θ̂ 0 (X) = X? Indeed, this estimator appears to have
where � . � denotes the Euclidean norm, but other
several desirable properties (for example, it is the
choices, such as the absolute error loss L(θ̂, θ) =
maximum likelihood estimator and the uniform
∑i=1 �θ̂i – θ i � are of course perfectly possible.
p
minimum variance unbiased estimator), and by
Now L(θ̂, θ) is a random quantity, which is not the early 1950’s, three proofs had emerged to show
ideal for comparing the overall performance of that θ̂ 0 is admissible for squared error loss when
two different estimators (as opposed to the loss- p = 1. Nevertheless, Stein (1956) stunned the sta-
es they each incur on a particular data set). We tistical world when he proved that, although θ̂ 0 is
therefore introduce the risk function admissible for squared error loss when p = 2, it is
inadmissible when p ≥ 3. In fact, James and Stein
(1961) showed that the estimator
If θ̂ and θ̃ are both estimators of θ, we say θ̂
strictly dominates θ̃ if R(θ̂, θ) ≤ R(θ̃, θ) for all θ,
with strict inequality for some value of θ. In this
case, we say θ̃ is inadmissible. If θ̂ is not strictly
dominated by any estimator of θ, it is said to be strictly dominates θ̂ 0. The proof of this remark-
admissible. Notice that admissible estimators able fact is relatively straightforward, and is given
are not necessarily sensible: for instance, in our in the Appendix.
Another important problem that is closely related the 1990 season. Further, let πi denote the player’s
to estimation is that of constructing a confidence true batting average, taken to be his career batting
set for θ, the aim being to give an idea of the un- average. (Each player had at least 3000 at bats in
certainty in our estimate of θ. Given α ∈ (0,1), his career.) We consider the model where Z1,…,
an exact (1 – α)-level confidence set is a subset Zp are independent, with Zi ∼ ni–1 Bin(ni , πi).
C = C(X) of Rp such that, whatever the true value
of θ, the confidence set contains it with probability We make the transformation
exactly 1 – α. The usual, exact (1 – α)-level confi-
dence set for θ in our original normal distribution
set-up is a sphere centred at X. More precisely, it is
and let θi = sin–1 (2πi – 1), which means that
Xi is approximately distributed as N(θi , 1). A heu-
where χ 2p (α) denotes the upper α-point of the ristic argument (which can be made rigorous) to
χ 2p distribution (in other words, if Z ∼ χ 2p, then justify this is that by a Taylor expansion applied
P�Z > χ 2p (α)� = α). But in the light of what we have to the function g(x) = sin–1 (2x – 1), we have
seen in the estimation problem, it is natural to
consider confidence sets that are spheres centred
at θ̂+JS (or θ̂ +, θ0 , for some θ 0 ∈ R ). Since the distri-
JS p
bat and batting average of the ith player during vide an improvement in this case.
Appendix
First note that since �X – θ� 2 ∼ χ 2p, we have R(θ̂ 0, θ) = p for all θ ∈ Rp. To compute the risk of the James–
Stein estimator, note that we can write
Consider the expectation inside the sum when i = 1. We can simplify this expectation by writing it out
as an n-fold integral, and computing the inner integral by parts:
since the integrated term vanishes. Repeating virtually the same calculation for components i = 2, …, p,
we obtain