Lecture Notes For Mathematical Statistics
Lecture Notes For Mathematical Statistics
Vladislav Kargin
1 Point Estimators 4
1.1 Basic problem of statistical estimation . . . . . . . . 4
1.2 An estimator, its bias and variance . . . . . . . . . . 6
1.3 Consistency . . . . . . . . . . . . . . . . . . . 12
1.4 Some common unbiased estimators . . . . . . . . . . 16
1.4.1 An estimator for the population mean µ. . . . . . 16
1.4.2 An estimator for the population proportion p. . . . 17
1.4.3 An estimator for the difference in population means µ1 −
µ2 . . . . . . . . . . . . . . . . . . . . . 18
1.4.4 An estimator for the difference in population propor-
tions p1 − p2 . . . . . . . . . . . . . . . . . 18
1.4.5 An estimator for the variance . . . . . . . . . . 19
1.5 The existence of unbiased estimators . . . . . . . . . 20
1.6 The error of estimation and the 2-standard-error bound . 21
2 Interval estimators 27
2.1 Confidence intervals and pivotal quantities . . . . . . . 27
2.2 Asymptotic confidence intervals . . . . . . . . . . . 33
2.3 How to determine the sample size . . . . . . . . . . 37
2.4 Small-sample confidence intervals . . . . . . . . . . 39
2.4.1 Small sample CIs for for µ and µ1 − µ2 . . . . . . 39
2.4.2 Small sample CIs for population variance σ2 . . . . 46
1
3.1 More about consistency of estimators . . . . . . . . . 50
3.2 Asymptotic normality . . . . . . . . . . . . . . . 53
3.3 Risk functions and comparison of point estimators . . . . 55
3.4 Relative efficiency . . . . . . . . . . . . . . . . . 57
3.5 Sufficient statistics . . . . . . . . . . . . . . . . 60
3.6 Rao-Blackwell Theorem and Minimum-Variance Unbiased Es-
timator . . . . . . . . . . . . . . . . . . . . . 65
4 Methods of estimation 69
4.1 Method of Moments Estimation . . . . . . . . . . . 69
4.2 Maximum Likelihood Estimation (MLE). . . . . . . . 75
4.3 Cramer-Rao Lower Bound and large sample properties of
MLE . . . . . . . . . . . . . . . . . . . . . . 88
5 Hypothesis testing 94
5.1 Basic definitions . . . . . . . . . . . . . . . . . 94
5.2 Calculating the Level and Power of a Test . . . . . . . 100
5.2.1 Basic examples . . . . . . . . . . . . . . . . 100
5.2.2 Additional examples. . . . . . . . . . . . . . 110
5.3 Determining the sample size . . . . . . . . . . . . 112
5.4 Relation with confidence intervals . . . . . . . . . . 113
5.5 p-values . . . . . . . . . . . . . . . . . . . . . 114
5.6 Small-sample hypothesis tests for population means . . . 118
5.7 Hypothesis testing for population variances . . . . . . 122
5.8 Neyman - Pearson Lemma and Uniformly Most Powerful Tests
127
5.9 Likelihood ratio test . . . . . . . . . . . . . . . . 132
5.9.1 An Additional Example . . . . . . . . . . . . 137
5.10 Quizzes . . . . . . . . . . . . . . . . . . . . . 140
2
6.2.2 Properties of LS estimator . . . . . . . . . . . 149
6.2.3 Confidence intervals and hypothesis tests for coefficients 153
6.2.4 Statistical inference for the regression mean . . . . 154
6.2.5 Prediction interval . . . . . . . . . . . . . . 155
6.2.6 Correlation and R-squared . . . . . . . . . . . 157
6.3 Multiple linear regression . . . . . . . . . . . . . . 158
6.3.1 Estimation . . . . . . . . . . . . . . . . . 158
6.3.2 Properties of least squares estimators . . . . . . . 160
6.3.3 Confidence interval for linear functions of parameters 164
6.3.4 Prediction. . . . . . . . . . . . . . . . . . 165
6.4 Goodness of fit and a test for a reduced model . . . . . 166
3
Chapter 1
Point Estimators
X1 , X 2 , . . . , X n .
The main assumption of the mathematical statistics is that this sequence has
a cumulative distribution function F (⃗x, θ) where θ is an unknown parameter,
which can be any number (or a vector) in a region Θ. The main task is to
obtain some information about this parameter.
As an example we can think about Xi as the number of Covid deaths on
day i, or GPA of a student i and so on.
In this course, we assume for simplicity that the random variables Xi
are i.i.d., independent and identically distributed, that is every datapoint
has the same distribution as others and that they are independent of each
other. It is a very restrictive requirement, for example, for Covid data it
is doubtful that Xi+1 is independent from Xi , however, this is the simplest
setting in which we can develop the statistical theory.
4
For example, we can look at the sample of Xi , i = 1, . . . , n, where Xi is
a lifetime of a smartphone and model Xi as an exponential random variable
with mean θ. Potentially, this θ can be any number in Θ = (0, ∞). Our
task is for a specific realization of random variables Xi derive a conclusion
about the parameter θ.
Our assumption means that the density of X1 is
1
fX1 (x1 ) = ex1 /θ ,
θ
the density of X2 is
1
fX2 (x2 ) = ex2 /θ ,
θ
and so on.
The joint density of independent datapoints is simply product of the
individual densities for each datapoint. In our example,
1 1 1
fX1 ,...,Xn (x1 , . . . , xn ) = ex1 /θ × ex2 /θ × . . . × exn /θ
θ θ θ
1 (Pni=1 xi )/θ
= ne
θ
In statistics, if we think about this joint density as a function of the
model parameter θ, we call it the likelihood function and denote it by letter
L. So, in our example, we have
1 (Pni=1 xi )/θ
L(θ|⃗x) = e ,
θn
where we used notation ⃗x to denote the vector of observed datapoints: ⃗x =
(x1 , . . . , xn ).
Now, we want to get some information about the parameter θ from the
vector (x1 , . . . , xn ). For example, we could look for a function of (x1 , . . . , xn )
which would be close to θ. This is called the point estimation problem
because we can try to find a point (an estimator) which would be close to
θ. We will discuss it in the next section.
5
1.2 An estimator, its bias and variance
One of the main goals in statistics is to guess the value of an unknown
parameter θ, given the realization of the data sample. Namely, we are given
the realization of random variables X1 , . . . , Xn , and we want to guess θ.
Mathematically, this means that we look for a function of the X1 , . . . , Xn ,
f (X1 , . . . , Xn ), which we call an estimator.
A function of the data sample is called a statistic, so an estimator is
a statistic. It can be any function whatsoever but naturally we want that
happen to be a good guess for the true value of the parameter.
Note on notation: If θ is a parameter to be estimated, then θb denotes
its estimator or a value of the estimator for a given sample. More carefully,
it is function of the data: θb = θ(X b 1 , . . . Xn ).
Examples of estimators: θb = X := (X1 + . . . + Xn )/n or θb = X(n) :=
max(X1 , . . . Xn ). Even very unnatural functions such as sin(X1 × X2 × . . . ×
Xn ) can be thought as estimators. So how do we distinguish between good
and bad estimators?
What do we mean by saying that θb is a good guess for θ?
Note that θb = θ(X b 1 , . . . Xn ) is random since its value changes from sam-
ple to sample. The distribution of this random variable θb depends on the
true value of the parameter θ. One of the things that we can ask from the
estimator is that its expected value equal to the true value of the parame-
ter. This is called unbiasedness. The second useful property is that when
we increase the size of the sample, the estimator converges to the true value
of the parameter in the sense of convergence in probability. This is called
consistency. We will deal with these two concepts one by one.
Bias of an estimator:
Def: Bias(θ̂) = Eθb− θ; (The bias of an estimator is its expected value minus
the true value of the parameter).
Note that the bias can depend on the true value of the parameter. A good
estimator should have zero or at least small bias for all values of the true
parameter.
6
Definition 1.2.1. An estimator θb = θ(X
b 1 , . . . , Xn ) is called unbiased, if
b 1 , . . . , Xn ) = θ,
Eθ(X
for every θ ∈ Θ.
In other words, the estimator θb is unbiased if its bias is zero for every
value of the true parameter θ ∈ Θ.
Example 1.2.2. Consider our previous example about the lifetime of smart-
phones. What is the bias of the following two estimators: θb = X and
θb = X1 ?
Why does X appear to be better than X1 as an estimator?
The reason is that the variance of X decreases as the sample size grows,
while the variance of X1 does not depend on the size of the sample.
Variance of an estimator
Def: Var(θ̂) = E(θb − Eθ)
b 2 = Eθb2 − (Eθ)
b 2;
We want that Var(θ̂) be small for all values of the true parameter θ.
Ideally, both the bias and the variance of the estimator should be small.
Sometimes we value unbiasedness more than anything else. We want to make
sure that an estimator is unbiased and only after this condition is satisfied
we start to look for estimators with low variance among these unbiased
estimators.
However, sometimes we can tolerate that an estimator is a bit biased.
Moreover, in some cases it is very difficult or even impossible to find an
unbiased estimator. In this case, it is useful to define a combined measure
of the quality of an estimator.
Def: Mean Squared Error of an extimator is defined as
n o
M SE(θ)b = E (θb − θ)2 .
b = Var(θ)
M SE(θ) b + [Bias(θ)]
b 2
7
Proof. By using the linearity of expectation:
h i h i
E (θb − θ)2 = E (θb − Eθb + Eθb − θ)2
h i
= E(θb − Eθ)
b 2 + 2E (θb − Eθ)(E
b θb − θ) + (Eθb − θ)2
h i
b + [Bias(θ)]
= Var(θ) b 2 + 2E (θb − Eθ)(E
b θb − θ) }
But in the last term we can take Eθb − θ outside of the expectation sign,
since it is not random, and we find that this last term is zero:
h i h i
E (θb − Eθ)(E
b θb − θ) = (Eθb − θ)E (θb − Eθ)
b = 0,
h i
because E (θb − Eθ)
b = Eθb − Eθb = 0.
b in terms of a, b and θ?
• What is Bias(θ),
Comment: if you find an biased estimator θ̂, you can sometimes easily
correct the bias to get an unbiased estimator. However, e.g. if we tried an
√
estimator θb and found that it has Eθb = θ, so we cannot correct the bias
by simply taking the square of θ.b The estimator θ̃ = θ̂2 will not not
unbiased for θ! If we recall the formula for the second moment of the
random variable, then in this particular example we can even compute the
bias:
so the bias of the estimator θ̂2 equals Var(θ̂). In general, it is often quite
difficult to find an unbiased estimator.
Let us look at a couple of examples.
8
Example 1.2.5. The reading on a voltage meter connected to a test circuit
is uniformly distributed over the interval (θ, θ + 1), where θ is the true but
unknown voltage of the circuit. Suppose that Y1 , Y2 , . . . , Yn denote a random
sample of such readings. We are going to try two estimators of θ, θb = Y
and θb = min{Y1 , . . . , Yn }. First, consider θb = Y .
• Find M SE(Y ).
1X
n
bias(Y ) = E(Y ) − θ = E(Yi ) − θ
n
i=1
= E(Y ) − θ = 1/2,
where we used the fact, that the expectation of a r.v. distributed on [θ, θ +1]
equals θ + 1.
Then M SE(Y ) = bias(Y )2 + Var(Y ), and since Yi independent,
1 X
n
1
Var(Y ) = 2
Var(Yi ) = Var(Y ).
n n
i=1
9
Reminder about the distribution of the minimum and the min-
imum. Recall the notation Y(1) = min{Y1 . . . Yn }. Then for the CDF of
Y(1) , we have:
10
(The integral can be calculated by doing integration by parts or by using a
very useful formula for Beta integrals:
Z 1
Γ(α)Γ(β)
xα−1 (1 − x)β−1 dx = ,
0 Γ(α + β)
where Γ(x) is the Gamma function. For integer argument x, Γ(x) = (x−1)!.)
Hence the bias of Y(1) = EY(1) − θ = EX(1) = n+1 1
. Note that in this
example the bias → 0 as the sample size increases. In addition, we can easily
correct the bias by using θb = Y(1) − n+1
1
.
b
What is the MSE of θ = Y(1) − n+1 ?
1
b = Var(Y(1) ) =
Since there is no bias, we only need to calculate Var(θ)
Var(X(1) ).
From the facts about the Beta distribution we have
αβ n 1
Var(X(1) ) = = ∼ 2
(α + β)2 (α + β + 1) 2
(n + 1) (n + 2) n
for large n. (We could also calculate it directly from the density.)
This is the MSE of θb = Y(1) − n+1
1
. For large n, it is much smaller than
the MSE of the estimator Y − 2 , which we calculated as 12n
1 1
.
Example 1.2.7. Calculate the distribution of the minimum for the sample
X1 , . . . , Xn from the exponential distribution with parameter θ. Use the
minimum to obtain an unbiased estimate of the parameter θ. What is the
variance of this estimator?
Solution First, we calculate the CDF of each observation as
Z x
1 −t/θ
FXi (x) = e dt = 1 − e−x/θ .
0 θ
11
If we set θb = nX(1) , then the expectation of this estimator is θ and it
gives an unbiased estimator of θ.
What is its variance?
Var nX(1) = n2 VarX(1) = n2 (θ/n)2 = θ2 ,
1.3 Consistency
Suppose again that we have a sample (X1 , . . . Xn ) from a probability distri-
bution that depends on parameter θ. Note that although we speak about
an estimator θb = θ(Xb 1 , . . . , Xn ), in fact the distribution of the estimator de-
pends on n, so it would be more correctly speak about a sequence of random
variables θbn .
Usually, we expect that when the size of the sample becomes larger, that
is, n grows, the distribution of the estimator θbn become concentrated more
and more around the true value of the parameter θ. This is the minimal
requirement that we can impose on the family of estimators that depend
on sample size. If this requirement is not satisfied as in Example 1.2.7
above, then the estimator is not very useful. Technically this property of an
estimator is called consistency and we are giving its mathematical definition
below.
Before that, let us look at some pictures. Plots show a simulation study.
A sample X1 , X2 , . . . from the distribution N (θ, 1/4) was generated with
θ = 10 and we computed θbk = (X1 + . . . Xk )/k. Figure 1.1 shows a path of
θbk . It suggests that if we get more and more data, θbk converges to the true
value of θ. In fact, this is a consequence of the strong law of large numbers,
which says that this behavior is observed with probability 1.
12
Figure 1.1 Figure 1.2
What about several different samples? Figure 1.2 shows the situation
when the sample X1 , X2 , . . . was generated 10 times and 10 paths of θbk
were plotted. This picture suggest that when the sample size grows the
distribution of θbk around the true value of the parameter θ. Mathematically
this is a consequence of the weak law of large numbers.
In order to define the consistency, recall what it means for a sequence of
random variables to converge to another random variable.
P(|Xn − X| < ϵ) → 1,
that is,
lim P(|Xn − X| < ϵ) = 1
n→∞
P
This is denoted either as Xn −→ X or as plimn→∞ Xn = X.
13
Definition 1.3.2 (Consistency). An estimator θbn is a consistent estimator
of θ, if θbn converges in probability to θ
θbn −→ θ.
P
P(|θbn − θ| < ε) → 1.
The consistency of the estimator means that as the sample size goes to
infinity, we are become more and more sure that the distance between θbn
and θ is smaller than any positive ε !
Consistency describes a property of the estimator in the n → ∞ limit.
Unlike unbiasedness, it is NOT meant to describe the property of the esti-
mator for a fixed n.
An unbiased estimator can be inconsistent as we can see in Example
1.2.7, and a biased estimator can be consistent (as Y(1) in Example 1.2.6)!
Consistency is more important than unbiasedness because it ensures that if
collect enough data we will eventually learn the true value of the parameter.
So, how can we tell if the estimator is consistent? One way is to see how
MSE changes with n.
therefore the event |θbn − θ| > ε can occur only if |θbn − Eθbn | > ε/2 occured.
Hence, for n > n0 , P(|θbn − θ| > ε) ≤ P(|θbn − Eθbn | > ε/2).
14
Now apply the Chebyshev inequality,
Var(θbn )
P(|θbn − Eθbn | > ε/2) ≤
(ε/2)2
By our assumption, the right-hand side can be made arbitrarily small for all
sufficiently large n because Var(θbn ) → 0. We showed that P(|θbn −θ| > ε) → 0
for any ε > 0.
Example 1.3.5 (Biased and consistent estimator of the mean). For parameter
θ = µ, consider a modified sample mean θbn = n−1
n
Y
b = Eθbn − θ = n µ − µ = 1 µ → 0 as n → ∞; (θbn is a
• Bias(θ) n−1 n−1
biased estimator of µ ̸= 0 for every n. It is, however, asymptotically
unbiased)
b =
• Var(θ) n2
σ 2 /n → 0.
(n−1)2
15
• We have shown in Examples 1.2.5 and 1.2.6 that these estimators are
both unbiased.
• Consistency:
16
• Expectation and bias of the estimator:
P
E(Y ) = n1 ni=1 EYi = EYi = µ; so bias(Y ) = 0.
σ2
• Variance: Var(Y ) = n1 Var(Yi ) = n ;
σ2
• M SE(Y ) = bias2 + Var = n
These observations shows that the sample mean is an unbiased and consis-
tent estimator for the population mean µ.
Another notation for the variance and the standard error of an estimator
θb is σ 2b and σθb, respectively.
θ
So the variance of the sample mean Y is σY2 = σ 2 /n and the standard
error is σY = √σn .
P (Yi = 1) = p, P (Yi = 0) = 1 − p.
This is a special case of the situation in the previous section and we can
use the same estimator, the sample mean. In this case, the sample mean
has a special name, the sample proportion. It is an unbiased and consistent
estimator for the parameter p (i.e., for the population proportion).
1X
n
pb = Yi ,
n
i=1
17
1.4.3 An estimator for the difference in population means
µ1 − µ2
Suppose we have two samples:
(1) (1) (1)
• {Y1 , Y2 , . . . , Yn1 } of size n1 from Population 1: with mean µ1 and
variance σ12 ;
(2) (2) (2)
• {Y1 , Y2 , . . . , Yn2 } of size n2 from Population 2: with mean µ2 and
variance σ22 ;
1 X 1 X
n1 n2
(1) (2)
difference in sample means: θ̂ = Y 1 − Y 2 = Yi − Yi .
n1 n2
i=1 i=1
(1) (2)
• Expectation: E(θ̂) = EY 1 − EY 2 = EY1 − EY1 = µ1 − µ2 ;
σ12 σ22
• Variance: σθ̂2 = V ar(θ̂) = n1 + n2 ;
It follows that this estimator is unbiased (its expectation equals the esti-
mated parameter) and consistent (because its MSE converges to zero as n1
and n2 jointly grow).
18
An unbiased point estimator for the difference in population means θ = p1 − p2
(the parameter of interest) is the
1 X 1 X
n1 n2
(1) (2)
difference in sample proportions: θ̂ = p̂1 − p̂2 = Yi − Yi
n1 n2
i=1 i=1
where q1 = 1 − p1 and q2 = 1 − p2 .
Obviously, this estimator is also unbiased and consistent.
Proof.
X
n hX
n X
n i hX
n i
2 2
E (Yi − Y ) = E
2
Yi2 −2 Yi Y + nY = E Yi2 − nY
i=1 i=1 i=1 i=1
Xn
2
= EYi2 − nE Y
i=1
19
For the second term:
2
nE Y = n[VarY + (EY )2 ] = n(σ 2 /n + µ2 )
= σ 2 + nµ2
Therefore,
X
n
E (Yi − Y )2 = (n − 1)σ 2
i=1
Hence
Pn
E i=1 (Yi− Y )2
ES = 2
= σ2
n−1
1X 1X 2 2
n n
(Yi − Y )2 = Yi − Y ,
n n
i=1 i=1
20
In this example, each observation is taken from the Bernoulli distribution
with parameter p. That is, Xi = 1 with probability p and Xi = 0 with
probability q = 1 − p. Of course, we have seen that there is an unbiased
estimator for p, namely pb = X. The twist of this example is that we try to
estimate θ = − ln p ∈ Θ = (0, ∞). Suppose, by seeking contradiction, that
θb = θ(X
b 1 , . . . Xn ) is an unbiased estimator of θ and therefore, Eθb = θ =
− ln p. We will write the expectation by using the basic definition:
X
1 X
1
b 1 , . . . , Xn ) =
Eθ(X ... b 1 , . . . , xn )P(X1 = x1 , . . . , Xn = xn ).
θ(x
x1 =0 xn =0
For Bernoulli r.v., we can write P(Xi = xi ) = pxi (1 − p)1−xi , where xi can
take only two values, 0 and 1. So, by independence of random variables
X1 , . . . , Xn , we have:
Pn Pn
P(X1 = x1 , . . . , Xn = xn ) = p i=1 xi
(1 − p)n− i=1 xi
.
X
1 X
1 Pn Pn
− ln p = ... b 1 , . . . , xn )p
θ(x i=1 xi
(1 − p)n− i=1 xi
,
x1 =0 xn =0
and this should be true for every p ∈ (0, 1) because the estimator is assumed
to be unbiased for every − ln p ∈ (0, ∞). However this means that the
logarithmic function of p equals to a polynomial in p. This is impossible.
For example, the limit of the left-hand side for p → 0 is ∞ while the limit
the right hand side is finite.
We got a contradiction, so that means that there is no unbiased estimator
of θ = − ln p.
21
The error of estimation is a random quantity that change from sample
to sample. We often interested to have a good bound on this quantity that
holds with large probability.
Recall that the standard error of the estimator σθb is another name for
q
b
the standard deviation of the estimator θ̂. That is σ b = Var(θ).
θ
By the Chebyshev inequality:
1
P{|θb − θ| > kσθb} ≤
k2
• For b = 2 · σθb, the RHS of the Chebyshev inequality is = 25%. [This
is a bound, the true probability that |ε| ≥ b is smaller, often as small
as 5%]
Since the estimator for the mean, Y is such a sum (only divided by n),
it becomes approximately normal when n is large, so if sample size is large,
then the estimation error, |Y − µ|, is less than 2-standard-error, 2σY , with
probability 95% (instead of 75%).
This observation holds also for the other standard estimators that we
considered in the previous section.
Example 1.6.2 (Titanic survivors). In a random sample of 136 Titanic first
class passengers that survived the Titanic ship accident, 91 were women. In
a random sample of 119 third class survivors, 72 were women. Assume that
these are small samples from two large populations of “survivors”: first-class
survivors and third-class survivors.
What is an unbiased estimate for the difference in proportions of females
in these populations? What is the two-standard error bound?
22
Solution. pb1 = 66.9%; pb3 = 60.5%; pb1 − pb3 = 6.4%
Two standard error bound is:
s
pb1 (1 − pb1 ) pb3 (1 − pb3 )
2 + = 12.1%
n1 n3
Does the data suggest that the total first class and third class survivor
populations had approximately the same proportions of females?
Does the data suggest that women from the first and third classes had
approximately the same chances to survive?
Solution: Not really. These data say that proportions of women among
the first and the third class survivors are approximately the same, meaning
that the difference between proportions is within the two-standard error bias.
However, this does not really say anything about the chances of survival.
Example 1.6.3 (Titanic survivors II). In a random sample of 95 female pas-
sengers in the first class, 91 survived the Titanic ship accident. In a random
sample of 145 women in the third class, 72 survived.
What is an unbiased estimate for the difference in proportions of sur-
vivors in the populations of the first and the third class female passengers?
What is the two-standard error bound?
Solution. pb1 = 95.8%; pb3 = 49.7%; pb1 − pb3 = 46.1%
Two standard error bound is:
s
pb1 (1 − pb1 ) pb3 (1 − pb3 )
2 + = 9.3%
n1 n3
Does the data suggest that female passengers from the third class had
lower chances to survive than female passengers from the first class?
Solution: Yes, the difference is far larger than the two standard error
bound suggesting, that it is highly unlikely that this happened by chance.
Some additional interesting info about this example: The chances of a
man in the first class to survive: 36.9%. The chances of a man in the third
class: 13.5%.
Example 1.6.4 (Elementary school IQ). The article “A Longitudinal Study
of the Development of Elementary School Children’s Private Speech” by
23
Bivens and Berk (Merrill-Palmer Q., 1990: 443–463) reported on a study of
children talking to themselves (private speech).
[The study was motivated by theories in psychology that claim that pri-
vate speech plays an important role in a child mental development, so one
can investigate how private speech is related to IQ (that is, performance on
math or verbal tasks) or to changes in IQ. The study found some support-
ing evidence that task-related private speech is positively related to future
changes in IQ. Here we are only interested in IQ data.]
The study included 33 students whose first-grade IQ scores are given
here:
082 096 099 102 103 103 106 107 108 108 108 108 109 110 110 111 113
113 113 113 115 115 118 118 119 121 122 122 127 132 136 140 146
(a) Suppose we want an estimate of the average value of IQ for the
first graders served by this school. What is an unbiased estimate for this
parameter?
[Hint: Sum is 3753.]
Solution. µb = X = 3753/33 = 113.7273
(b) Calculate and interpret a point estimate of the population standard
deviation σ. [Hint: Sum of squared observations is 432, 015]
Solution.
1
S2 = 432, 015 − 33 × (113.7273)2 = 162.3856
32
√
b = S = 162.3856 = 12.7431
σ
√ √
2S/ n = 2 × 12.7431/ 33 = 4.4366
Since the estimate of µ is 113.7273 and the two-error bound for the error
24
of estimation is 4.4366, the data suggest that this is an above average class,
because the nationwide IQ average is around 100.
(d) Calculate a point estimate of the proportion of all such students
whose IQ exceeds 100. [Hint: Think of an observation as a “success” if it
exceeds 100.]
Solution.
The number of students with IQ above 100 is 30. So the point estimate
is pb = 30/33 = 90.91%.
Example 1.6.5 (Elementary school IQ II). The data set mentioned in the
previous example also includes these third grade verbal IQ observations for
males:
117 103 121 112 120 132 113 117 132 149 125 131 136 107 108 113 136
114
(18 observations) and females:
114 102 113 131 124 117 120 90 114 109 102 114 127 127 103
(15 observations)
Let the male values be denoted X1 , . . . , Xm and the female values Y1 , . . . , Yn .
(a) Calculate the point estimate for the difference between male and
female verbal IQ.
Solution.
2186 1707
X −Y = − = 121.4444 − 113.8 = 7.6444
18 15
(b) What is the standard error of the estimator?
Solution. First we calculate the sample variances Sx2 and Sy2 for these
two samples.
1
Sx2 = 268, 046 − 18 × 121.44442 = 151.0964
17
1
Sy2 = 196, 039 − 15 × 113.82 = 127.3143
14
Then, we calculate the estimate of the standard error:
s r
Sx2 Sy2 151.0964 127.3143
bθb =
σ + == + = 4.1088
m n 18 15
25
So we see that the estimate of the difference 7.6444. However, the two-
standard-error bound is 8.2176 and the data does not give an evidence that
the difference is positive.
26
Chapter 2
Interval estimators
27
100
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
80
● ●
● ● ● ●
● ●
● ● ●
● ●
● ●
● ●
●●
60
● ● ●
● ●
●
●● ●
index
● ●
● ●
● ●
●
●
● ●
●
40
● ●
● ●
● ●
● ●
● ●●
● ●
● ●
●● ●
● ●
20
●●
● ●
● ●
● ●
●
● ●●
● ●
● ●
● ●
● ● ●
0
−4 −2 0 2
random interval estimator
Figure 2.1: Interval estimator for 100 different samples. The confidence interval
(θb − 1.96σθb, θb + 1.96σθb), – centered at the point estimators θb shown by red circles,
– in 95% cases covers the true parameter (shown by the purple line).
28
sample data and the parameter θ, but whose distribution does not depend
on the parameter θ !
Let X denote the data sample. (It is a vector of observations.) Find a
function T (X, θ) (the pivot) so that its distribution does not depend on θ
and so it is known.
Use the distribution of T to find a pair of L and U such that
Pr(L ≤ T ≤ U ) = 1 − α
L∗ (X) ≤ θ ≤ U ∗ (X),
X −µ
T = √
σ/ n
29
This inequality can be converted to the desired confidence interval for
parameter µ:
σ σ
X − zα/2 √ ≤ µ ≤ X + zα/2 √
n n
Alternatively, we can look for a “one-sided interval”. So, take L = −∞
and look for U such that P{Z > U } = α. By definition this U is denoted zα
and it can be found from a table or by using software.
Then, the inequality is
X −µ
Z= √ ≤ zα
σ/ n
and it can be transformed to the lower confidence bound on the parameter
µ
σ
X − zα √ ≤ µ.
n
Note the difference from the previous inequality. Here we use the factor zα
before the standard error √σn while in the previous inequality we used zα/2 .
Similarly, by using U = ∞ and looking for L such that P{Z < L} = α,
we can derive the upper confidence bound on µ:
σ
µ < X + zα √ .
n
30
Then, we get
1 X
n
χ21−α/2 (n) ≤ 2 (Xi − µ)2 ≤ χ2α/2 (n),
σ
i=1
1 X
n
1 X
n
(Xi − µ) ≤ σ ≤
2 2
(Xi − µ)2
χ2α/2 (n) i=1
χ21−α/2 (n) i=1
1 X
n
σ2 ≤ (Xi − µ)2
χ21−α (n) i=1
31
This interval has confidence level 1 − α. It can be shown that the size of the
interval decreases when n grows.
Here is a bit less standard example, which shows that the pivot method
sometimes gives poor confidence intervals.
Example 2.1.5. Suppose a sample X1 , …, Xn of random variables distributed
according to the exponential distribution with mean θ.
Suppose we want to build a confidence interval for θ with α = 10%.
The quantity T = nX(1) /θ is pivotal. Indeed, let us write Y to denote
X(1) . This is a minimum of n i.i.d exponential random variables and it is
easy to check that the density of Y is
n −ny/θ
fY (y) = e .
θ
That is, Y is exponential with the mean θ/n. Similar to the previous exam-
ple, it is easy to calculate the density of T = nY /θ = Y /(θ/n) and check
that it is exponential with mean 1. 1
Now we look for L and U , so that
Z U
0.90 = Pr(L ≤ T ≤ U ) = e−t dt = e−L − e−T .
L
There are infinitely many combinations of L and U which satisfy this. One
possibility is to let
32
We manipulate these two inequalities to put θ in the middle:
nX(1) nX(1)
≤θ≤
2.996 0.051
Remark: The resulting confidence interval is not very good. Indeed, the
1
length of the interval is nX(1) ( 0.051 − 0.2996
1
) and nX(1) is always an expo-
nential random variable with parameter θ, so we cannot expect that the
length of this confidence interval goes to 0 as n grows.
Often, the Central Limit Theorem (CLT) ensures that when the sample
size n is large enough an appropriate estimator is approximately normal
random variable.
2
• for θ = µ, the estimator θ̂ = Y is approximately ∼ N (µ, σn );
33
• for θ = p1 − p2 , the estimator θ̂ = p̂1 − p̂2 is approximately ∼ N (p1 −
p2 , p1 (1−p
n1
1)
+ p2 (1−p
n2
2)
);
In general, many estimators are asymptotically normal. We will discuss
asymptotic normality a bit later. Intuitively, this means that the distribu-
tion of an estimator θ̂ is close to the normal distribution, N (θ, σθ̂2 ), where
σθ̂2 = V ar(θ̂).
Then we can write an approximate pivotal quantity:
θb − θ
Z= ,
bθb
σ
Here σ bθb is a consistent estimator of σθ̂ . By a theorem which is called the
Slutsky theorem, the distribution of Z is close to the standard normal dis-
tribution.
Then we can proceed as usual and develop a confidence interval from
the pivotal quantity. Since Z is only an approximately pivotal quantity, the
resulting confidence interval will be only an asymptotic confidence interval,
that is, the probability that the interval covers θ equals 1 − α only if n is
large.
Example 2.2.2 (Two-sided asymptotic confidence interval for a parameter θ).
By using our results for the normally distributer variables X1 , . . . , Xn , we
write the (appoximate) two-sided interval for θ, based on the point estimator
θb which is assumed to be approximately distributed as N (θ, σθ̂2 ).
The two-sided confidence interval for θ with confidence coefficient 1 − α
is h i
θ̂ − zα/2 σθb, θ̂ + zα/2 σθb
Example 2.2.3 (Upper and lower confidence bounds). The one-sided large
sample confidence intervals are as follows:
• The upper bound confidence interval with confidence coefficient 1 − α
is
(−∞, θ̂ + zα σθb ]
34
Example 2.2.4. The shopping times of n = 64 randomly selected customers
at a local supermarket were recorded. The average and variance of the 64
shopping times were 33 minutes and 256 minutes, respectively. Estimate µ,
the true average shopping time per customer, with a confidence coefficient
of 1 − α = .90.
p
Solution. We have µ b = 33, and σ bθb = 256/64 = 2. We can get
zα/2 = z0.05 = qnorm(0.05, lower.tail = F ) = 1.644854, so the CI is
θb ± zα/2 σθb
• Note that
p
σθb =Varθb
p
= p1 (1 − p1 )/nA + p2 (1 − p2 )/nB ,
where p1 and p2 are unknown but can be approximated by pb1 and pb2 .
35
So,
p
bθb =
σ 0.24(1 − 0.24)/50 + 0.2(1 − 0.2)/60 = 0.0795.
We used here the R function “qnorm” to find the value of z0.01 . The lower.tail
option is set to F (FALSE) because we want to find such z0.01 such that the
upper tail of the standard normal distribution is 0.01, that is, we want
P(Z > z0.01 ) = 0.01. Alternatively, we could get the value of z0.01 from a
table.
For the exam, you are supposed to know how to calculate this confidence
interval. However, note that these calculations are already implemented in
R, although R uses the language of hypothesis testing here, which we will
learn later.
In particular, for this example, the confidence interval can be calculated
as follows.
prop . t e s t ( c ( 1 2 , 1 2 ) , c ( 5 0 , 6 0 ) , c o n f . l e v e l = 0 . 9 8 , c o r r e c t = F)
The first argument is the vector of the number of successes (or failures in
our example), and the second argument is the vector of the sample sizes.
2−sample t e s t f o r e q u a l i t y o f p r o p o r t i o n s w i t h o u t
continuity correction
data : c ( 1 2 , 1 2 ) out o f c ( 5 0 , 6 0 )
X−s q u a r e d = 0 . 2 5 5 8 1 , d f = 1 , p−v a l u e = 0 . 6 1 3
a l t e r n a t i v e h y p o t h e s i s : two . s i d e d
98 p e r c e n t c o n f i d e n c e i n t e r v a l :
−0.1448629 0 . 2 2 4 8 6 2 9
sample e s t i m a t e s :
prop 1 prop 2
0.24 0.20
36
Example 2.2.6. A study was done on 41 first-year medical students to see
if their anxiety levels changed during the first semester. One measure used
was the level of serum cortisol, which is associated with stress. For each
of the 41 students the level was compared during finals at the end of the
semester against the level in the first week of classes. The average difference
was 2.08 with a standard deviation of 7.88. Find a 95% lower confidence
bound for the population mean difference µ. Does the bound suggest that
the mean population stress change is necessarily positive?
Example 2.2.7. A random sample of 539 households from a mid-western city
was selected, and it was determined that 133 of these households owned
at least one firearm (“The Social Determinants of Gun Ownership: Self-
Protection in an Urban Environment,” Criminology, 1997: 629–640). Using
a 95% confidence level, calculate a lower confidence bound for the proportion
of all households in this city that own at least one firearm.
• On the one hand, the more data you have, the more accurate is your
estimator θb for θ
37
Rather than give a bunch of formulas we illustrate the method in exam-
ples.
Example 2.3.1. The reaction of an individual to a stimulus in a psycho-
logical experiment may take one of two forms, A or B. If an experimenter
wishes to estimate the probability p that a person will react in manner A,
how many people must be included in the experiment? Assume that the
experimenter will be satisfied if the error of estimation is less than .04 with
probability equal to .90. Assume also that he expects p to lie somewhere in
the neighborhood of .6.
Example 2.3.2. Telephone pollsters often interview between 1000 and 1500
individuals regarding their opinions on various issues. A survey question
asks if a person believes that the performance of their athletics teams has
a positive impact on the perceived prestige of the institutions. The goal
of the survey is to see if there is a difference between the opinions of men
and women on this issue. Suppose that you design the survey and wish to
estimate the difference in a pair of proportions, correct to within .02, with
probability .9. How many interviewees should be included in each sample?
38
q
b σb =
1. What is the standard error of θ? p1 (1−p1 )
+ p2 (1−p2 )
θ n1 n1
• The client is not sophisticated and does not formulate explicitly what
is the level of confidence required. In this case, it is typical to set
the confidence level at 95% and use the 2-standard-error bound. If we
want the error of estimation to be less than 2, then
σ 10
2 > 2σθb = 2 √ = 2 √ ⇒ n > 100.
n n
39
In addition, the parameter σ 2 can be consistently estimated by the sam-
ple variance. Thus, the quantity
X −µ
Z= √
S/ n
Now suppose that the sample size n is small, say less than 30.
Then the quantity Z may have a distribution which is very different from
the standard normal distribution. Using the normal distribution in the
case of small samples leads to erroneous intervals!
What can we do? In general, there is no universal answer. The answer
depends on the distribution of the data points Xi .
If the data happen to be normally distributed and σ 2 is known, then
X −µ
Z= √ (2.1)
σ/ n
is pivotal and normal. However, in most cases this fact cannot be used to
construct a confidence interval for µ since σ 2 is not known.
We can try to use the sample variance S 2 instead of σ 2 , however then
X −µ
T = √ (2.2)
S/ n
is pivotal but its distribution is not normal. The reason is that S 2 is a ran-
dom quantity and dividing the normal random variable X − µ by a random
quantity instead of a deterministic coefficient breaks the normality of the
quotient.
What is the distribution of T ? Let us recall first some facts from the
probability theory.
40
Definition 2.4.1. If X1 , . . . , Xn are i.i.d and distribute as standard normal
r.v.s (∼ N (0, 1)), then the random variable Y ≡ X12 + · · · + Xn2 has the χ2
distribution with n degrees of freedom.
Definition 2.4.2. Suppose random variables Z ∼ N (0, 1) and Y ∼ χ2 (n)
are independent. Then the random variable
Z
T ≡q (2.3)
Y
n
41
Note that the random variable has χ2 distribution with one less degree
of freedom than if we would add together n independent standard normal
random variables. The reason for this reduction is that the random variables
Xi − Y are not independent. Intuitively they can be expressed in terms of
n − 1 independent normal random variables, hence the reduction in the
degree of freedom. The most surprising in this theorem is the independence
of X and S 2 .
By using this theorem, we can show that
X −µ
T = √ ∼ t(n − 1),
S/ n
that is T has a t-distribution with degrees freedom df = n − 1.
Based on T ,
tα can be found using statistical software or the t-table (Table 5, look for
subscript α with df = n − 1).
Note that the statistic in both large sample and small sample cases is
the same:
X − µY
T ≡ √
S/ n
The only difference is that in the case of a large sample, it is distributed
as normal random variable, and in the case of a small sample it is distributed
as a t-random variable.
Remark 1: the small sample confidence intervals based on t distribu-
tion are longer than asymptotic confidence intervals based on the standard
normal distribution.
Remark 2: the small sample confidence intervals based on t distribution
are valid only if the data are normally distributed.
42
40
index
20
10 12 14
n=10. Confidence intervals using z (blue) and t (red) values
43
> x = c (78 ,66 ,65 ,63 ,60 ,60 ,58 ,56 ,52 ,50)
> x
[ 1 ] 78 66 65 63 60 60 58 56 52 50
> mean( x ) # sample mean
[ 1 ] 60.8
> sum( ( x − 6 0 . 8 ) ^ 2 ) /(10 −1) # sample v a r i a n c e
[ 1 ] 63.51111
> sqrt (sum( ( x − 6 0 . 8 ) ^ 2 ) /(10 −1)) # sample s t a n d a r d d e v i a t i o n
[ 1 ] 7.969386
> qt ( 0 . 9 7 5 , 9 ) # 0 . 0 2 5 p e r c e n t a g e p o i n t
# f o r t d i s t r i b u t i o n w i t h d f =9
[ 1 ] 2.262157
> qt ( 0 . 9 7 5 , 9 ) ∗sqrt (sum( ( x − 6 0 . 8 ) ^ 2 ) /(10 −1))/sqrt ( 1 0 )
[ 1 ] 5.700955
√
Answer: 60.8 ± 5.700955. Note that we have divided by 9 when we esti-
√
mated S and then again by 10 when we estimated σX . Do not forget the
second division.
Alternatively, one can use the “sd” function to calculate the sample
standard deviation:
t . test (x)
data : x
t = 2 4 . 1 2 6 , df = 9 , p−v a l u e = 1 . 7 2 7 e −09
a l t e r n a t i v e h y p o t h e s i s : t r u e mean i s not equal t o 0
44
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
55.09904 66.50096
sample e s t i m a t e s :
mean o f x
60.8
Exercise 2.4.5. The reaction time (RT) to a stimulus is the interval of time
commencing with stimulus presentation and ending with the first discernible
movement of a certain type. The article “Relationship of Reaction Time and
Movement Time in a Gross Motor Skill” (Percept. Motor Skills, 1973: 453–
454) reports that the sample average RT for 16 experienced swimmers to a
pistol start was .214 s and the sample standard deviation was .036 s.
Making any necessary assumptions, derive a 90% CI for true average RT
for all experienced swimmers.
Two sample t-test
We have samples X1 , . . . , Xn1 and Y1 , . . . Yn2 , with observations dis-
tributed according to N (µ1 , σ1 ) and, N (µ1 , σ1 ) respectively. We assume
that n1 and n2 are small and want to find C.I. for µ1 − µ2 .
Here we consider only the most simple case when it is assumed that
σ1 = σ2 = σ. Then we can define the pooled-sample estimator for the
common variance σ 2 ,
P n1 Pn2
i=1 (X − X) + i=1 (Yi − Y )
2 2
Sp ≡
2
n1 + n2 − 2
(n1 − 1)S12 + (n2 − 1)S22
=
n1 + n2 − 2
In this case,
Y 1 − Y 2 − (µ1 − µ2 ) Y 1 − Y 2 − (µ1 − µ2 )
T = = q ∼ t(n1 + n2 − 2),
bY 1 −Y 2
σ S 1
+ 1
p n1 n2
r r
(n1 +n2 −2) 1 1 (n1 +n2 −2) 1 1
X −Y − tα/2 Sp + , X − Y + tα/2 Sp + ,
n1 n2 n1 n2
45
Similarly, the lower bound confidence interval for µ1 − µ2 is
r
1 1
X − Y − tα S p + ,∞
n 1 n2
and the upper bound confidence interval for µ1 − µ2 is
r
1 1
− ∞, X − Y + tα Sp +
n1 n2
In more general case, the formulas are more complicated, and one has
rely on software.
Example 2.4.6. To reach maximum efficiency in performing an assembly op-
eration in a manufacturing plant, new employees require approximately a
1-month training period. A new method of training was suggested, and a
test was conducted to compare the new method with the standard proce-
dure. Two groups of nine new employees each were trained for a period of
3 weeks, one group using the new method and the other following the stan-
dard training procedure. The length of time (in minutes) required for each
employee to assemble the device was recorded at the end of the 3-week pe-
riod. The resulting measurements are as shown in Table 8.3 (see the book).
Estimate the true mean difference (µ1 −µ2 ) with confidence coefficient .95.
Assume that the assembly times are approximately normally distributed,
that the variances of the assembly times are approximately equal for the
two methods, and that the samples are independent.
1 X
n
2
S = (Xi − X)2 .
n−1
i=1
46
variance. Then the standard method for large sample confidence intervals
works. In practice, however, we are usually interested in confidence intervals
for σ 2 when we have a small data sample.
So assume that the data sample is small. In this case we must restrict
ourself to the situation when the data is normally distributed.
Assume that all sample data points X1 , X2 , . . . , Xn ∼ N (µ, σ 2 ).
The pivotal quantity (see Theorem 2.4.3) is
Pn
(Xi − X)2 (n − 1)S 2
T = i=1 2 = ∼ χ2(n−1)
σ σ2
We need to find L and U so that
(n − 1)S 2
P(L ≤ ≤ U) = 1 − α
σ2
A usual choice is L = χ21−(α/2) and U = χ2α/2 , both corresponding to (n − 1)
d.f.
Hence,
(n − 1)S 2
P χ1−(α/2) (n − 1) ≤
2
≤ χα/2 (n − 1) = 1 − α
2
σ2
!
(n − 1)S 2 (n − 1)S 2
⇔P ≤σ ≤ 2
2
=1−α
χ2α/2 (n − 1) χ1−α/2 (n − 1)
47
2
Similarly, χ(n−1)S
2 is a (1 − α) confidence upper bound for σ 2 .
1−α;n−1
What if we want to build a confidence interval for the standard deviation
√
σ = σ 2 , instead of σ 2 ? This is simple:
• This is equivalent to
s s !
(n − 1)S 2 (n − 1)S 2
P ≤σ≤ =1−α
χ2α/2 χ21−α/2
Example 2.4.7. Suppose that you wished to describe the variability of the
carapace lengths of this population of lobsters. Find a 90% confidence in-
terval for the population variance σ 2 .
> x = c(78,66,65,63,60,60,58,56,52,50)
> x
[1] 78 66 65 63 60 60 58 56 52 50
> sum(( x-mean(x) )^2) # the numerator of sample variance and the CI LB and UB
[1] 571.6
> # and then we calculate the denominators of the CI LB and UB
> qchisq(0.05,9)
[1] 3.325113
> qchisq(0.95,9)
[1] 16.91898
2
The answer is (571.6/16.91898, 571.6/3.325113). Note that χ0.95,9 = 3.325113 =
2
qchisq(0.05,9) and χ0.05,9 = 16.91898 = qchisq(0.95,9).
Both can also be obtained from Table 6.
48
Example 2.4.8. An optical firm purchases glass to be ground into lenses. As
it is important that the various pieces of glass have nearly the same index
of refraction, the firm is interested in controlling the variability. A simple
random sample of size n = 20 measurements yields S 2 = (1.2)10−4 . From
previous experience, it is known that the normal distribution is a reasonable
model for the population of these measurements. Find a 95% CI for σ.
49
Chapter 3
50
Theorem 3.1.1. Suppose that θbn → θ and θbn′ → θ′ , then:
p p
1. θbn + θbn′ → θ + θ′ ;
p
2. θbn × θbn′ → θ × θ′ ;
p
This result is called the continuous mapping theorem for the convergence
in probability.
We omit the proof.
Example 3.1.2 (S 2 is a consistent estimator of σ 2 .). By definition
! !
1 Xn
2 n 1 X
n
Sn2 = Yi2 − nY = Yi2 − Y 2 (3.1)
n−1 n−1 n
i=1 i=1
1X 2 p
n
Yi → E(Yi2 ) = σ 2 + µ2 , (3.2)
n
i=1
• Also, by LLN,
p
Y → E(Y1 ) = µ (3.3)
51
• By an application of the continuous mapping theorem with g(u) = u2 ,
we have (3):
p
(Y )2 → µ2
Recall that !
1X 2
n
n
Sn2 = Yi − Y n 2 ,
n−1 n
i=1
1X 2
n
p
Yi − (Y n )2 → σ 2
n
i=1
that !
1X 2
n
n 2 p
2
Sn = Yi − Y n → 1 × σ 2 = σ 2 .
n−1 n
i=1
52
√
Example 3.1.4. Is S := S 2 a continuous estimator of σ?
p
Yes! By the continuous mapping theorem, if plim Sn2 = σ 2 , then plim Sn2 =
√
σ 2 = σ.
Is consistency really important?
Yes. If an estimator is not consistent, then it will not produce the correct
estimation even if we are given the luxury of getting unlimited amount of
data for free. It’s a shame if one cannot get the correct answer in this
situation. An inconsistent estimator is a waste of time.
Does consistency guarantee good performance?
Not necessarily. We still live in a finite sample world. Something that
is ultimately good for very large sample, may not be good enough for a
realistic sample size.
X −µ
√ → Z,
σ/ n
53
in distribution, where Z is the standard normal r.v.
Many other estimators are also asymptotically normal, but it might be
not so easy to find their asymptotic variance.
The following example is meant to illustrate that sometimes there are
estimators that have smaller asymptotic variance than sample mean.
Example 3.2.3. Let X1 , . . . , Xn be a sample from the Laplace distribution
shifted by θ, that is from the distribution with density
1
pθ (x) = e−|x−θ| .
2
By symmetry, it is clear that E(Xi ) = θ, so we can estimate by using either
the sample mean or the sample median. Let us compare the asymptotic
variance of the estimators X and θbmed .
It turns out that it is possible to prove that θbmed is asymptotically normal
estimator of θ with asymptotic variance 1/(4p(0)2 ) = 1/(4 × (1/2)2 ) = 1.
On the other hand, for X, the asymptotic variance equals to the variance
of the Poisson distribution, which can be computed as 2. 1 It follows that in
this example the sample median has smaller asymptotic variance than the
sample mean.
Exercise 3.2.4. Show that if X1 , . . . , Xn is a sample from the standard normal
distribution, then the sample mean has smaller asymptotic variance than the
sample median.
These examples show that the answer to the question of which estimator
is better often depends on the distribution from which we draw the sample.
Now let us finish this section by proving the result that we used in the
section about asymptotic confidence intervals.
Example 3.2.5. Let X1 , . . . , Xn be a sample from a distribution with mean
E(Xi ) = µ and variance Var(Xi ) = σ 2 . Then, the statistics
X −µ
T = √
S/ n
1
∫ ∞ ∫ ∞
1
2
σX = x2 e−|x| dx = x2 e−x dx = Γ(3) = 2! = 2.
2 −∞ 0
54
converges to the standard normal distribution.
This is a direct consequence of the following result which we give without
proof.
• Xn + Yn → X + c in distribution;
• Xn Yn → cX in distribution;
55
The important thing is that MSE and,
more generally, any risk function depends
on θ, which we do not know. Ideally, an
estimator θb1 is better than another estima-
tor θb2 if its MSE is smaller for every θ ∈ Θ.
However, it might happen that MSE of θb1
is smaller than MSE of θb2 for one value of
parameter θ and larger for another. See the
Figure 3.1: MSE (risk func- picture.
tions) for two different estima- In general there are two approaches how
b b
tors, θ1 and θ2 to deal with this situation. In the first ap-
proach we simply compute the average MSE
of the estimators over the set of possible parameters and compare these av-
erages. This is called the Bayesian approach since it is popular in the branch
of statistical theory called the Bayesian statistics.
In the other approach one finds the values of θ which give the largest
MSE for every of the estimators. The better estimator will have the smaller
of these MSE. This is called the minimax approach.
Exercise 3.3.1. According to the minimax criterion, which of the estima-
tors is better for the situation pictured in Figure 3.1? Which one is better
according to the Bayesian criterion?
Example 3.3.2. Let X1 , . . . , Xn be sampled from the exponential distribution
with mean θ. Consider estimators of θ that have the form θb = µn (X1 + . . . +
Xn ). Calculate the MSE of these estimators. Which of them best according
to the Bayesian criterion? according to the minimax approach?
Let us calculate the MSE. In the calculation, we will use the fact that if
56
Xi ∼ Exp(θ) then E(Xi ) = θ, Var(Xi ) = θ2 and therefore E(Xi2 ) = 2θ2 .
h i2
M SE(θ) := E µn (X1 + . . . + Xn ) − θ
h X
n X X
n i
=E µ2n ( Xi2 + Xi Xj ) − 2µn θ Xi + θ 2
i=1 i̸=j i=1
1 X
n
n
θb Xi = X.
n+1 n+1
i=1
57
Usually, an estimator with smaller variance in preferable. One quantitative
measure that statisticians use to compare the unbiased estimators is their
relative efficiency.
Definition 3.4.1. Given two unbiased estimators θb1 and θb0 of the same
parameter θ, the relative efficiency of θb1 relative to θb0 is defined to be the
ratio of their variances
V ar(θb0 )
ef f (θb1 , θb0 ) = .
V ar(θb1 )
We can think about θb0 as a reference estimator. The estimator θb1 with
relative efficiency which is greater than 1 is better than θb0 since its variance
is smaller and so θb1 is a more accurate estimator of θ than θb0 ! Note that we
assumed from outset that both estimators are unbiased.
• Var(θb1 ) = 19 Var(Y1 ) = 19 ;
• Var(θb2 ) = 12 Var(Y1 ) = 12 ;
• the estimator θb2 does not use all information available, and so it is less
efficient.
58
Both are unbiased and therefore we only need to compute their variances.
Var(θb1 ) = Var(2Y ) = 4Var(Y )/n = θ2 /(3n). (This is because Y = θX,
where X is a random variable that has uniform distribution on [0, 1] and
the variance of X is 1/12.
In order to compute Var(θb2 ) = Var( n+1n max{Y1 , Y2 , . . . , Yn }), we note
that Yi = θXi , where Xi is uniformly distributed on [0, 1]. Then,
n+1 n+1
Var( max{Y1 , Y2 , . . . , Yn }) = Var(θ max{X1 , X2 , . . . , Xn })
n n
n + 1 2
= θ2 Var(max{X1 , X2 , . . . , Xn }).
n
So we only need to compute Var(max{X1 , X2 , . . . , Xn }). To do this we
need to find the density of X(n) = max{X1 , X2 , . . . , Xn }.
Recall that P(X(n) ≤ x) = P(X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x), so we
have for the cdf, FX(n) (x) = (FX1 (x))n , and for the density, fX(n) (x) =
nfX1 (x)(FX1 (x))n−1 . In our particular case, fX(n) (x) = nxn−1 .
Then we calculate:
Z 1
n
EX(n) = n xxn−1 dx = ,
0 n+1
Z 1
n
EX(n)
2
=n x2 xn−1 dx = ,
0 n+2
n n2 n(n + 1)2 − n2 (n + 2)
Var(X(n) ) = EX(n)2
− (EX(n) )2 = − =
n + 2 (n + 1)2 (n + 2)(n + 1)2
n
= ,
(n + 2)(n + 1)2
It follows that
n + 1 2 n 1
Var(θb2 ) = θ2 2
= θ2 ,
n (n + 2)(n + 1) n(n + 2)
Var(θb1 ) θ2 θ2 n+2
ef f (θb2 , θb1 ) = = / = .
Var(θb2 ) 3n n(n + 2) 3
Hence, the second estimator is much more efficient than the first one.
59
3.5 Sufficient statistics
There is a huge multitude of functions of the data that we can consider in
the search for a good estimator. So it is worthwhile to check if we can reduce
the data to one or just a few of summary statistics. This is the main idea
behind the concept of sufficient statistics.
Pθ (X1 , . . . , Xn |T (X1 , . . . , Xn ) = t)
p(x1 , . . . , xn , T = t)
p(x1 , . . . , xn |T = t) =
p(T = t)
P P
p xi (1 − p)n− i xi
i X
= n P P δ( xi − t)
i xi (1 − p)n− i xi
t p i
1 X
=
n δ( xi − t),
t i
P P
where δ( i xi − t) equals 1 if i xi = t and 0, otherwise. We can see that
this conditional probability does not depend on p.
In principle, a sufficient statistic can be a vector, that is, it can consists
of several functions. For example, if you take a vector of order statistics,
60
T = (X(1) , X(2) , . . . X(n) ), then it is a sufficient statistic. However, we don’t
gain very much by considering such statistics since they do not reduce the
data.
If we take a function of a vector of sufficient statistics and reduce the di-
mension, then it can potentially break the sufficiency, however in some cases
the resulting function is still sufficient. For example any invertible function
of a sufficient statistic is sufficient, – it does not loose any information.
A sufficient statistic is minimal if it can be written as a function of any
other of sufficient statistics. (A minimal sufficient statistic exists under mild
conditions on the distribution of the data but there are some counterexam-
ples.)
Why do we care about sufficient statistics?
In some cases, we can find a good estimator of a parameter θ by a 2-step
procedure:
Y
n
L(θ | x1 , . . . , xn ) = pX1 ,...,Xn (x1 , . . . , xn | θ) = p(xi |θ).
i=1
61
2. If X1 , . . . , Xn are continuous random variables and f (x|θ) is the den-
sity of each of them, then
Y
n
L(θ | x1 , . . . , xn ) = fX1 ,...,Xn (x1 , . . . , xn | θ) = f (xi |θ).
i=1
• Likelihood:
Y
n
L(p) = {pyi (1 − p)(1−yi ) }
i=1
Pn Pn
= p i=1 yi (1 − p)(n− i=1 yi )
Pni=1 yi
p
= (1 − p)n × 1
1−p
62
P
• We can define a statistic T = ni=1 Yi . Then we will have
n ot
p
– gp (t) = 1−p (1 − p)n , and h(y1 , . . . , yn ) = 1
– The first term only depends on p and T (or t)
– The second term does not depend on p
Y
n
λ yi
L(λ) = e−λ
yi !
i=1
• Pn
λ i=1 yi Pn 1
−nλ Q
L(λ) = e n = e−nλ λ i=1 yi
× Qn
i=1 (yi !) i=1 (yi !)
1 (y − µ)2
∼ f (y) = √ exp{− }
2πσ 2 2σ 2
• Likelihood:
n
Y
1 (yi − µ)2
L(·) = √ exp{− }
2πσ 2 2σ 2
i=1
Pn 2
2 −n i=1 (yi − µ)
= (2πσ ) exp −
2
2σ 2
P P P
• Note that ni=1 (yi − µ)2 = ni=1 (yi − y + y − µ)2 = ni=1 (yi − y)2 +
n(y − µ)2 = (n − 1)s2 + n(y − µ)2
63
• Thus we have
2 −n (n − 1)s2 n(y − µ)2
L(·) = (2πσ ) 2 · exp − · exp −
2σ 2 2σ 2
2 −n (n − 1)s2 n(y − µ)2
L(·) = (2πσ ) 2 · exp − · exp −
2σ 2 2σ 2
• The argument in L(·) is not specified because there are two situations.
A. True
B. False
A. True
B. False
64
• Density of one random variable:
1/θ, 0 < xi < θ,
fXi (xi ) =
0, otherwise
1
= 10<xi <θ ,
θ
where 10<xi <θ is the indicator function of the event {0 < xi < θ}.
• Likelihood:
n
Y
1
L(θ) = 10≤xi ≤θ
θ
i=1
Qn
• Note that i=1 1θ≥xi = 1θ≥x(n) . Not very obvious. Think!
• Thus the likelihood L(θ) = 1θ≥x(n) θ1n = 1θ≥x(n) θ1n × 1
• Note that previously we found that this estimator is much more ef-
ficient than another unbiased estimator 2X. This is a reflection of a
general fact which is called the Rao-Blackwell theorem.
• Sufficient statistic?
65
• We have learned two good qualities of an estimator θb for a parameter
θ:
– Unbiasenss: E θb = θ;
b is small;
– Low variance: V ar(θ)
V ar(θb2 )
Reff(θb1 , θb2 ) = ;
V ar(θb1 )
66
• Remarks:
– θb∗ is a function of T
– θb∗ is random
– If θb is already a function of T , then E(θb |T ) = θ,
b i.e., taking the
conditional expectation does not change anything, in particular
it does not improve the efficiency.
Since the second term is non-negative, we find that Var(θb∗ ) < Var(θ).
b
67
This raises questions about what a complete sufficient statistic is and
how one can check that a sufficient statistic is complete. We will not be
concerned with these questions in this course and simply promise that in all
our examples the sufficient statistics obtained by factorization theorem will
be complete and sufficient.
Routine to find the MVUE
68
Chapter 4
Methods of estimation
Suppose that we are looking for a good estimator θ(X b 1 , . . . , Xn ) for a pa-
rameter θ. The method in the previous section asks us to find a minimal
complete sufficient statistic T = T (X1 , . . . , Xn ) by using a factorization cri-
terion and then find a function of T which would be an unbiased estimator
of θ. This will lead us to a MVUE. Unfortunately, even if T is known, it is
b ) which would be unbiased for θ. In fact, in some
often difficult to find θ(T
cases no unbiased estimator for θ exists.
For this reason we are looking for other methods to construct an estima-
tor, which would be easy to construct and which would have a small MSE
in a large sample.
We will consider two such methods, Method of Moments Estimation
(MME) and Maximum Likelihood Estimation (MLE).
69
tation of an observation Xi , that is,
R
xk f (x|θ) dx if Xi are continuous r.v.,
µk (θ) := E(Xi )k = P
k if Xi are discrete r.v..
x x p(x|θ)
Note that the population moments are all functions of the parameter θ
(and do not depend on the sample data).
In contrast, the sample moments are functions of the data sample Xi .
(They are random quantities and their distribution depends on θ.) We
denote the k-th sample moment mk .
1X k
n
mk = mk (X1 , . . . , Xn ) := Xi .
n
i=1
For example, the first sample moment equals to the sample mean m1 = X,
the second sample moment can be expressed in terms of the sample variance
and the sample mean: m2 = (n − 1)S 2 + n(X)2 .
It should be emphasized that the sample moments are all functions of
the data, i.e., they can all be calculated using the data. And they are all
random.
The main idea behind Method of Moments Estimator is that by the Law
of Large Numbers the sample moments converge to population moments:
1X k P
n
mk = Xi −→ EXik = µk (θ),
n
i=1
in probability as n → ∞.
So for large n we have,
mk (X1 , . . . , Xn ) = µk (θ) + εk ,
where εk is very small with probability close to 1. This means that empirical
moments are consistent estimators of the population moments. In addition,
we know the form of the functions µk (θ), although we do not know the value
of θ. Hence we can invert the system of these functions and get an estimator
of θ from the estimator µbk = mk (X1 , . . . , Xn ) of µk (θ).
70
So, the idea is to ignore εk and solve the system of equations
b
mk (X1 , . . . , Xn ) = µk (θ), , k = 1, 2, . . . , s
(−1)
verse function µk is a continuous function then we can use the continuous
mapping theorem and show that θb → θ in probability as n → ∞, in other
words, that θb is a consistent estimator of θ.
How many moments we need for estimation? Typically, if we need to
estimate a vector that consists of s parameters θ1 , . . . θs , we use the first s
moments. However it might happen that one of the first theoretical moments
µk (θ) actually does not depend on the parameter of interest. For example, if
for every θ the distribution Fθ (x) has the density function symmetric relative
to the origin x = 0, then the first population moment (population mean or
expectation) is zero for every θ, and therefore this first moment is not going
to help us in the estimation of parameters.
Practical steps:
b = mk (X)
3. Match the moments: Solve the system of K equations µk (θ) ⃗
b
for θ.
71
parameter θ. These solutions give us the desired Method of Moments
Estimator (θbM M E ).
when n → ∞.
72
1 Pn
• The first sample moment is m1 := n i=1 Xi = X.
1 Pn 2
• The second sample moment is m2 := n i=1 Xi .
b = m1 = X
µ
1X 2
n
b2 + µ
σ b2 = m2 = Xi
n
i=1
b = X;
µ
X n
c2 = 1
σ Xi2 − (X)2
n
i=1
These are the MME estimators of the parameters. Note that the MME esti-
mator for the variance is different from the standard estimator, the sample
variance, which is
1 X 2
n
2
S = Xi − n(X)2
n−1
i=1
c2 M M E = n − 1 S 2 .
σ
n
In particular, the MME estimator is biased.
Example 4.1.2 (Poisson with unusual parameter). The data sample X1 , . . . Xn
is distributed according to the Poisson distribution with parameter λ. Find
the estimator of the parameter θ = 1/λ.
73
b
• Match the two quantities above and solve for θ:
1/θb = X
• Hence,
θbM M E = 1/X
If you attentively examine the example above then you will see that we
could estimate the parameter λ (the population mean) by the sample mean
X. This would give an MME estimator of λ. Then, since parameters θ and
λ are in one-to-one correspondence to each other, therefore solving MME
equations for θb can be done by solving MME equations for λ b and then using
the one-to-one relation between the parameters. This would give us the
MME estimator θb = 1/λ b = 1/X.
This is a manifestation of the general principle valid for MME estimators.
Plug-in or Invariance property of MME
ψ = h(θ)
ψbM M E = h(θbM M E ),
74
Note, however, that X is not a sufficient statistic for θ. Indeed, the
minimal sufficient statistic in this example is X(n) = max{X1 , . . . , Xn }.
So, this is an example of an MME estimator which is not a function of
a sufficient statistic. So it has no chance to be an MVUE. Even though it
is unbiased, its variance could be reduced by conditioning on the sufficient
statistic.
Reflection
• The four point estimators back in Chapter 8 (for the mean, proportion,
differences in means and proportions) were all MMEs.
75
Definition 4.2.1. The maximum likelihood estimator θb is the value of the
parameter θ, at which the likelihood function L(θ|x1 , . . . xn ) takes its maxi-
mum value.
76
The density of an individual observation is
1
f (xi ) = e−xi /θ .
θ
Since the observations are independent, the likelihood is just the product of
the density functions:
1 Y −xi /θ
n
1 −(Pni=1 xi )/θ
L(θ|x1 , . . . , xn ) = e = e
θn θn
i=1
1X
n
ℓ(θ|x1 , . . . , xn ) = log L(θ|x1 , . . . , xn ) = −n log θ − xi
θ
i=1
77
•
d t n−t
ℓ(p|⃗y ) = −
dp p 1−p
• Set
t n−t
ℓ′ (p) = − =0
p 1−p
We obtain
t n−t t
= ⇒ t − tp = np − tp ⇒ p =
p 1−p n
• Hence pbM LE = T
n
• Likelihood function:
n
Y
1 (xi − µ)2
L(µ, σ 2 ) = √ exp{− }
2πσ 2 2σ 2
i=1
Pn
2 −n (xi − µ)2
= (2πσ ) 2 exp − i=1
2σ 2
78
• Log-likelihood:
Pn
n (xi − µ)2
ℓ(µ, σ ) = log(L(µ, σ )) = − log(2πσ ) − i=1 2
2 2 2
2 2σ P
n
n n (xi − µ)2
= − log(2π) − log(σ 2 ) − i=1 2
2 2 2σ
Pn
∂{ℓ(µ, σ 2 )} n 1 i=1 (xi − µ)2 1
=− +
∂σ 2 2 σ2 2 σ4
• Set both to zero.
x−µ
=0⇒µ=x
σ 2 /n
Pn
1X 1X
n n
n 1 i=1 (xi− µ)2 1
− + = 0 ⇒ σ2 = (xi −µ)2 = (xi −x)2
2 σ2 2 σ 4 n n
i=1 i=1
Pn
• Hence, µ bM LE = X, σ bM2 1
LE = n i=1 (Xi −X) = S0 . These are exactly
2 2
the sample mean, and the alternative version of sample variance (but
not S 2 , the original sample variance). We know that X is unbiased
and both X and S02 are consistent. However, S02 is a biased estimator
of σ 2 .
In fact, as in the previous example, the ML estimator coincide with
the MM estimator.
79
Example 4.2.6 (Uniform on (0, θ)). Let X1 , . . . , Xn be i.i.d observations from
the uniform distribution on the interval [0, θ]. What is the ML estimator for
θ.
80
ML estimator is biased but it is a function of a minimal sufficient statistic.
In particular, the bias of the ML estimator can be corrected and this will
lead to an MVUE estimator.
Note, by the way, that if the domain of the density function depends on
the parameter θ, it is often a warning sign that the boundary point of θ may
play a major role in finding the MLE.
MLE is always a function of a sufficient statistic!
• The maximizer of log(g(t, θ)), over all possible θ, has to depend only
on t.
81
where xi can take values 0, 1, 2, . . .. Since the observations are independent,
the likelihood function is simply the product of pmfs of individual observa-
tions.
Pn Y
n
xi −nλ
L(λ|x1 , . . . , xn ) = λ i=1 e / (xi !).
i=1
b then
Now it is rather obvious that if L(λ|x1 , . . . , xn ) is maximized at λ = λ,
L(θ|x1 , . . . , xn ) is maximized at the point that corresponds to this point,
namely at θ = 1/λ. b Therefore,
θbM L = 1/λ
bM L = 1/X.
The principle that the relation between parameters are transferred to their
estimates is called the invariance, or plug-in, principle.
82
Theorem 4.2.9. Suppose that X1 , . . . , Xn are observations from the dis-
tribution that depends on parameter θ. If θbM L = θbM L (X1 , . . . , Xn ) is the
maximum likelihood estimator for θ and g(·) is a one-to-one function, then
g(θbM L ) is the maximum likelihood for parameter ψ := g(θ), i.e.,
ψbM L = g(θbM L )
and the expression on the right is the likelihood for ψ. If the MLE of ψ were
ψ ∗ ̸= g(θbM LE ), then it would follow that
But this would contradict the fact that θbM LE maximizes L(θ).
Example 4.2.10. For example, suppose you want to estimate the probability
that a random variable X > 2, you know that it is a Poisson r.v. and have
a data sample Xi , i = 1, . . . n.
Note that
b b b
λ b2
bλ
1 − e − λ − e −λ − e −λ ,
1! 2!
83
b=λ
where λ bM L is the maximum likelihood estimator for λ.
bM L = X, therefore the ML estimator for Pr{X > 2}
Since we know that λ
is
X (X)2
1 − e−X 1 + + .
1! 2!
Comparison of MM and MLE
Additional examples:
Example 4.2.11. Let Y1 , . . . , Yn be taken from distribution with the following
density:
1
fY (y) = ry r−1 e−y /θ 1y>0 , where θ > 0 and r is known.
r
θ
Find a sufficient statistic for θ. Find the MLE of θ. Is it MVUE?
Qn 1 r−1 −yi /θ r 1 n Qn r−1 e− θ1
Pn
yir
• L(θ) = i=1 θ ryi e = θ n r ( i=1 yi )
i=1
Pn r
• Clearly the sufficient statistic is i=1 Yi
Pn
1 − θ1 yir
• L(θ) = C · θn e
i=1 where C has nothing to do with θ.
Pn
• ℓ(θ) = log(C) − n log(θ) − 1
θ
r
i=1 yi
84
Pn
• ℓ′ (θ) = − nθ + 1
θ2
r
i=1 yi .
Note that log(C) disappears.
Pn 1 Pn
• Set ℓ′ (θ) = 0 ⇒ nθ = θ12 r ∗
i=1 yi ⇒ θ = n
r
i=1 yi
P
• So, n1 ni=1 Yir is the MLE for θ.
• Note that
Z ∞
1 r−1 −yr /θ r
E(Yir ) = ry e · y dy
θ
Z0 ∞
e−y
r /θ
= · y r d(y r /θ)
0
(Let u = y r /θ)
Z ∞
=θ e−u · udu
0
• One can either calculate the integral explicitly or note that e−u is
the density of the exponential distribution with parameter 1 and the
R∞
integral 0 e−u · u du calculate its expectation, which, as we already
know, equals 1. Hence, we have
E(Yir ) = θ
Pn Pn
• Therefore, E( n1 r
i=1 Yi ) = 1
n i=1 E(Yi )
= n1 nθ = θ r
P
It follows that the maximum likelihood estimator n1 ni=1 E(Yir ) is the MVUE
for θ.
Another example:
Example 4.2.12. Consider the situation when we have two samples. One of
them is X1 , X2 , . . . , Xm from normal distribution N (µ1 , σ 2 ). The other is
Y1 , Y2 , . . . , Yn from normal distribution N (µ2 , σ 2 ). Here we assumed that
the variance in both distributions is the same. What is the ML estimators
for the parameters µ1 , µ2 and σ 2 ?
85
• Given the observations x1 , . . . , xm , y1 , . . . , yn , the likelihood for µ1 , µ2 , σ 2
is the product of all the densities (including the X’s and the Y ’s)
m
Y
1 (xi − µ1 )2
L(µ1 , µ2 , σ 2 ) = √ exp{− } ×
2πσ 2 2σ 2
i=1
Yn
1 (yi − µ2 )2
√ exp{− }
2πσ 2 2σ 2
i=1
Pm
2 −m (xi − µ1 )2
= (2πσ ) 2 exp − i=1
×
2σ 2
Pn
2 −n (yi − µ2 )2
(2πσ ) 2 exp − i=1
2σ 2
• So the log-likelihood is
Pn
m m (xi − µ1 )2
ℓ(µ1 , µ2 , σ ) = − log(2π) − log(σ ) − i=1 2
2 2
2 2 Pn 2σ
n n (yi − µ2 )2
− log(2π) − log(σ 2 ) − i=1 2
2 2 2σ
b1 = X, µ
µ b2 = Y ,
Pm Pn
i=1 (xi − X) + i=1 (yi − Y )
2 2
c 2
σ =
m+n
86
Exercise 4.2.13. Y1 , Y2 , . . . , Yn is a sample of observations from N (5, θ) where
the variance θ is unknown and is the parameter of interest:
1 (y − 5)2
f (y) = √ exp[− ].
2πθ 2θ
(d). Show directly (without using the general theorem that MM estimators
are consistent) that θbM M is consistent for θ.
(f). Prove that θbM L is the minimal variance unbiased estimator (MVUE)
for θ
87
4.3 Cramer-Rao Lower Bound and large sample
properties of MLE
In this section we learn about the Cramer-Rao lower bound on the variance
of any unbiased estimator of θ. It is not possible do get smaller variance
even if you use the MVUE. We also learn that in the limit, for n → ∞, the
maximum likelihood estimator achieves this bound. In this sense, the ML
estimator is an asymptotically Minimal Variance Unbiased Estimator.
The idea behind the Cramer-Rao bound is that if the likelihood function
is flat and does not depend on the parameter θ then it will be difficult
to estimate the parameter from the data. The measure of the likelihood
function flatness that the Cramer-Rao bound uses is the Fisher information.
Essentially, it is the average squared sensitivity of the log-likelihood function
to the parameter.
For the formal definition, let us define the score function s(x, θ) of a
random variable X as log f (x, θ) if X is continuous with probability density
f (x, θ) and as log p(x, θ) if it is discrete with probablity mass function p(x, θ).
We are talking here about a single variable X, and a log-likelihood func-
P
tion is the sum of the values of score function at xi : ln L(θ) = ni=1 s(xi , θ).
88
By definition,
h d 2 i
I(θ) = E (− log θ − X/θ)
dθ
h 2 i
= E − 1/θ + X/θ2
h i
= E 1/θ2 − 2X/θ3 + X 2 /θ4
Recollect that the exponential distribution with parameter θ has mean θ and
variance θ2 . Hence EX = θ and EX 2 = θ2 + θ2 = 2θ2 . So after substitution
we get
89
Theorem 4.3.4 (Cramer-Rao bound). Let X1 , . . . , Xn be a sample of in-
dependent identically distributed observations from the distribution that de-
pends on parameter θ. Under certain regularity conditions on the distribu-
b
tion, for every unbiased estimator θ,
b ≥ 1
Var(θ)
nIX (θ)
Note 1: If we are able to find an unbiased estimator such that its variance
equals the Cramer-Rao bound (and the regularity conditions hold), then this
estimator is MVUE (minimal variance unbiased estimator).
Note 2: Sometimes the Cramer - Rao bound is not sharp. That is,
sometimes the variance of the MVUE will be larger than the bound given
by the Cramer - Rao inequality.
90
us calculate the Fisher Information for this distribution. The density is 1/θ
and the score function is − log(θ), so by definition:
h d 2 i h1i 1
IX (θ) = E s(X, θ) =E 2 = 2
dθ θ θ
(If we use identity (4.1), we get:
h d2 i h d2 i h1i 1
IX (θ) = −E 2
s(X, θ) = −E 2
s(X, θ) = E 2
= 2,
dθ dθ θ θ
which is the same. Note, however, that the conditions of Lemma 4.3.3 that
justified (4.1) are not satisfied in this example, so we are lucky that (4.1)
gives the correct result.) So the Cramer-Rao inequality predicts that every
unbiased estimator should have variance ≥ θ2 /n. We have calculated earlier
the expectation and variance of the ML estimator in this example which is
X(n) . This estimator is biased but we can correct its bias and consider the
unbiased estimator θb = n+1n X(n) . Its variance is
b = θ2 (n + 1)2 n 1
Var(θ) 2 2
= θ2
n (n + 2)(n + 1) n(n + 2)
So this estimator clearly violates the Cramer-Rao bound. The reason is that
the conditions of the Theorem 4.3.4 are not satisfied.
We will explain some ideas behind the proof of the Cramer-Rao bound
below. Now let us turn to the asymptotic optimality of MLE. The main
result here is that under some regularity conditions,
1
nVar θbM L →
IX (θ)
as n → ∞.
The point is that the MLE in the limit attains the Cramer-Rao bound.
In this sense it is an asymptotically MVUE, or in other terminology it is
asymptotically efficient.
Ideas of the proof of the Cramer-Rao bound
We are going to prove the bound for the case when X is continuous and
its range does not depend on the parameter and when n = 1. The proof in
the general case is difficult and you can find it in graduate level textbooks.
91
Lemma 4.3.7. Assume the range of X does not depend on θ and the density
is positive and continuously differentiable in θ. Then,
hd i
E s(X, θ) = 0.
dθ
Proof. Note that by chain rule:
d
d f (x, θ)
s(X, θ) = dθ .
dθ f (x, θ)
hd i Z b d Z b
d
E s(X, θ) = s(x, θ)f (x, θ)dx = f (x, θ)dx
dθ a dθ a dθ
Z b
d d
= f (x, θ)dx = 1 = 0.
dθ a dθ
Corollary of Lemma 4.3.7: I(θ) = Var d
dθ s(X, θ) .
b
Proof of the Cramer-Rao bound for n = 1. Let θ(X) be an unbiased esti-
mator of θ based on just one datapoint X. Let us write s′ (θ) instead of
d
dθ s(X, θ). By using Lemma 4.3.7 and the Cauchy - Schwarz inequality for
covariance:
q
′ b ′ b b
|E s (θ)θ | = |Cov s (θ), θ | ≤ Var(s′ (θ))Var(θ).
Or,
|E s b|
′ (θ)θ
b ≥
Var(θ)
I(θ)
All this would hold if θb was biased. The next step is crucial.
Z Z
d b
d b
E s (θ)θb =
′ b
f (x, θ)θ(x)dx = b
f (x, θ)θ(x)dx
a dθ dθ a
d b d
= Eθ = θ = 1.
dθ dθ
92
Example 4.3.8. • Yi ∼ Bernoulli(p) with PMF pyi (1 − p)1−yi
Pn
• We know that p̂ = i=1 Yi /n is the MLE for p
1 1
= h i= h i
∂ 2 [Y log(p)+(1−Y ) log(1−p)]
nE − ∂p2
nE Y
p2
+ 1−Y
(1−p)2
1 1 p(1 − p)
= h i= h i=
n p
+ 1−p
n 1
+ 1 n
p2 (1−p)2 p 1−p
p(1−p)
• We already know that Var(p̂) = n . So indeed, p̂ is the MVUE.
93
Chapter 5
Hypothesis testing
• Chapter 10:
94
3. Are men more likely to run a stop sign than women?
4. Does chemotherapy really cure cancer?
5. Is a new medicine effective in increasing longevity?
3. Namely, if the data looks very improbable under the null hypothesis,
then you can conclude that the data contradicts the null hypothesis,
so it should be rejected and your theory should be accepted instead.
4. However, if the data does not looks very improbable under null hypoth-
esis, then you cannot reject it and so you don’t have enough evidence
in support of your, alternative, point of view.
Terminology
• Hypothesis
• Null Hypothesis H0
95
– It is usually denoted by H0 and it is usually very specific. For
example it can state: “the treatment has no effect”.
• Alternative Hypothesis Ha
Besides the null and alternative hypothesis, the statistical test is defined
by a test statistic and a rejection region.
• A test statistic (TS) is a statistic, that is, a function of the data. Its
intention is different from an estimator which is also a function a data.
The test statistic should help us to answer the question “how close is
the data sample to what we would expect if the null hypothesis H0 were
true?”
θb − θ0
TS = ,
σθb
where θb is an estimator of θ.
96
• In this case, if |T S| is large (so T S is far from 0) then it might indicate
that the data is not compatible with the null hypothesis.
• If the TS is not in the RR, then we fail to reject the null hypothesis
H0 .
• Other commonly used terms: do not reject, do not have enough evi-
dence to reject, etc.
97
Does not reject H0 Reject H0 in favor of Ha
H0 is true Correct decision Type I Error
Ha is true Type II Error Correct decision
98
– In this case, a value for α is chosen before initiating a hypothesis
test.
– Common values for α are 0.01, 0.05 and 0.10;
– Choose the rejection region so that
Eventually, the choice of balance between type I and type II error depend
on a cost-benefit analysis, which is outside of the area of statistics.
Summary Design of the test:
• Set up H0
• Set up Ha
99
• Choose a small significance level α (like 5%) and find a reject region
(RR) so that P(make a type I error) = P(T S ∈ RR|H0 is true) = α.
• Type I Error:
• Type II Error:
Example 5.2.1. An experimenter has prepared a drug dosage level that she
claims will induce sleep for 80% of people suffering from insomnia. After
examining the dosage, we feel that her claims regarding the effectiveness of
the dosage are inflated. In an attempt to disprove her claim, we administer
her prescribed dosage to 20 insomniacs and we observe Y , the number
for whom the drug dose induces sleep. We wish to test the hypothesis
100
H0 : p = .8 versus the alternative, Ha : p < .8. Assume that the rejection
region {y ≤ 12} is used.
(d) If we want the size of the test α ≈ 0.01 how should we choose the
threshold r in the rejection region RR = {y ≤ r}?
(a) In this example Y is our test statistic (the complete data consists of
observations for each insomniac). This statistic is distributed according
to the binomial distribution with parameters n = 20 and p. If we assume
H0 then p = 0.8. Then we need to calculate
β = P(Y ∈
/ RR|H0 : p = 0.8) = P(Y > 12|Ha : p = 0.6)
= 1 − P(Y ≤ 12|Ha : p = 0.6).
101
(d) Now if we want to make α = 0.01, then we need to find r such that
P(Y ≤ r|H0 : p = 0.8) = 0.01. Unfortunately, there is no r such
that this equality is satisfied exactly. However, we can solve it approx-
imately by using the R command qbinom. If we issue thhe command
qbinom(0.01, size = 20, p = 0.8), it gives us r = 12, since by definition
it produces the smallest r such that P(Y ≤ r) is ≥ 0.01. However, we
know that P(Y ≤ 12) = pbinom(12, 20, 0.8) = 0.03214266 and this is not
satisfactory if we want to ensure that the probability of type I error is
smaller than 0.01. For this reason we should choose r = 11. In fact, for
this choice we have α = P(Y ≤ 11) = pbinom(11, 20, 0.8) = 0.009981786
which is very close to 0.01.
In the previous example we were given the test statistic and the rejection
region. How can we choose them in a typical exam? Here is one method
which is useful if we are interested in testing a statement about a parameter
θ, and are given a desired probability of type I error (i.e., the significance
level of the test α).
b
• Using the sample data to find an estimator of θ, denote it by θ;
102
– {|T S| > t} (two-sided test)
P (T S is in RR | H0 is true) = P (T S is in RR |θ = θ0 ) = α
• The reject region (RR): {T S > t}; with the cutoff t chosen such that
P (T S > t p = p0 ) = α.
P (T S > t p = p0 ) = α ⇒ P (Z > t) = α,
RR : {T S > zα };
103
The problem asks for level α = 0.01, so z0.01 = 2.33, and the reject region is
RR : {T S > 2.33};
By using the data provided in this problem, we calculate the observed value
of the test statistic T S as
pb − p0 0.15 − 0.10 5
ts = =p = = 1.667.
σpb 0.1(1 − 0.1)/100 3
Since ts is NOT in the reject region, we fail to reject H0 at the level α = 0.01.
and come to the conclusion that there is NO sufficient evidence to support
the statement that the machine must be repaired, at the significance level
α = 0.01.
Note that if we used a different α, say α = 0.05, then z0.05 = 1.645, and
the reject region would be
RR : {T S > 1.645};
Then, ts would be in the reject region, and we would reject H0 at the level
α = 0.05. The conclusion would become: at the significance level α = 0.05,
there is sufficient evidence to support the statement that the machine must
be repaired.
So, the decision of the hypothesis test depends on the value of α – the
level of tolerance for the type I error. The report about the decision should
always specify the value of α.
Not that the decision of a hypothesis test has a random nature! It
depends on the realized data. In particular, if H0 is true, we incorrectly
reject it with probability α.
Now let us look at the calculation of the probability of type II error in
the situation when we have a large sample and an estimator of the parame-
ter which is distributed approximately normally. Let us consider the same
example as before.
Example 5.2.3. Let X1 , . . . , Xn be distributed according to the Bernoulli
distribution with parameter p.
104
(b) Alternative hypothesis Ha : p > 0.1.
(d) What can be said about the probability of type II error and the power
of this test?
105
and this quantity is easy to evaluate by using software or by referring to
tables. If we use α = 0.01 and n = 100, then zα = 2.33 we calculate that
p
−0.05 + zα 0.1(1 − 0.1)/n
p = 0.5573
0.15(1 − 0.15)/n
and
So the probability that we make an error of the type II is rather large, 71%
Note that the power of the test is by definition 1 − β, so we have a
method to calculate both the probability of type II error and the power of
the test. In this example the power of the test is 1 − 0.71 = 29%.
It is important to note that β (and the power) depends on the value of
the parameter under the alternative hypothesis. For example, if we changed
our alternative hypothesis to Ha : p = 0.2, then the probability of type II
error would be equal to
p
−0.1 + zα 0.1(1 − 0.1)/n
β=P Z≤ p = P(Z ≤ −0.7525) = 22.6%
0.2(1 − 0.2)/n
106
• Find the test statistic
θb − θ0 pb1 − pb2 − 0
TS = =q ≈?.
σθb p1 (1−p1 )
+ p2 (1−p2 )
n1 n2
An additional difficulty here is that the null hypothesis does not specify
the exact values of p1 and p2 . It only says that p1 = p2 . For this reason, we
need to estimate p1 and p2 . We use the “pooled sample proportion”,
suggested by the fact that H0 claims that p1 = p2 .
Y1 + Y2
p̃ :=
n1 + n2
This is the best guess about p1 and p2 we can obtain when p1 = p2 (that is,
under the assumption that H0 is true.)
Using the data provided in this problem, we can calculate:
10834
pb1 = = 0.05333,
203123
782
pb2 = = 0.03062,
25536
pb1 − pb2 = 0.02271
10834 + 782
p̃ = = 0.05080,
203123 + 25536
r 1 1
σpb1 −bp2 = p̃(1 − p̃) + = 0.001457,
n1 n2
and the observed value of the test statistic T S is
θb − θ0 0.05080 − 0
ts = = = 15.5789
σθb 0.001457
So, in this case it is obvious that the null hypothesis can be rejected at
α = 0.01. The data give strong support to the hypothesis that the mortality
rate in New York is higher than that in California.
The previous two examples were about testing hypotheses about popu-
lation proportions.
107
Now let us look at the hypotheses about population means. We still
maintain the assumption that the sample size is large and therefore we can
rely on the normality of the parameter estimator distribution.
Example 5.2.5. A random sample of 37 second graders who participated in
sports had manual dexterity scores with mean 32.19 and standard deviation
4.34. An independent sample of 37 second graders who did not participate in
sports had manual dexterity scores with mean 31.68 and standard deviation
4.56.
b. For the rejection region used in part (a), calculate β when µ1 −µ2 = 3.
Since we do not know the exact values of σ12 and σ22 , we will use estimates
for these variances. Then, the test statistic is
θb − θ0 X −Y −0
TS = =p 2
bθb
σ s1 /n1 + s22 /n2
32.19 − 31.68
=p = 0.4928
4.342 /37 + 4.562 /37
Since the samples are relatively large (n1 = n2 > 30), the test statistic
is distributed as a standard normal random variable. Since α = 0.05 and
T S ≤ z0.05 = 1.645, the test statistic is not in the rejection region and we
are not able to reject the null hypothesis. The data does not give enough
evidence to indicate that second graders who participate in sports have a
higher mean dexterity score.
Now let us consider the second question. What is β if µ1 − µ2 = 3?
108
In this case, we know that
X −Y −3
Z=p 2
s1 /n1 + s22 /n2
is approximately standard normal random variable. What we need to cal-
culate is the probability that we do not reject the null hypothesis, that is
P(T S ≤ zα ). So, we need to express TS in terms of Z:
X −Y X −Y −3+3 3
TS = p 2 = p = Z + p
s1 /n1 + s22 /n2 s21 /n1 + s22 /n2 s21 /n1 + s22 /n2
Then, the desired probability is
h 3 i
β = P(T S ≤ zα ) = P Z + p 2 ≤ z α
s1 /n1 + s22 /n2
h 3 i
= P Z ≤ zα − p 2
s1 /n1 + s22 /n2
θb − θ0
TS = ,
bθb
σ
where the sample size is large, the estimator θb has an approximately normal
distribution, and the standard deviation of the estimator θb is estimated from
the data (not calculated on the basis of the hypothesis), then we can write
simple formulas for β.
If the alternative hypothesis is θa > θ0 , and we use the rejection region
T S > zα , then
h θa − θ0 i
β = P Z ≤ zα −
bθb
σ
109
If the alternative hypothesis is θa < θ0 and the rejection region is T S < −zα ,
then
h θa − θ0 i
β = P Z ≥ −zα − ,
bθb
σ
110
The null hypothesis H0 : θ = θ0 (where θ = µ1 − µ2 and θ0 = 0.) This is simply
a different formulation of the hypothesis H0 : µ1 = µ2 .) The alternative hypothesis
Ha : θ ̸= θ0 .
An estimator of θ = µ1 − µ2 is θb = Y 1 − Y 2 , the difference in sample mean. So
we take the test statistic
θb − θ0 θb − θ0 θb − θ0
TS = =q 2 ≈q 2 .
σθb σ1 2
σ2 s1 s22
n1 + n2 n1 + n2
The rejection region (RR): {|T S| > t} and choose the cutoff t so that
P (|T S| > t θ = θ0 ) = α
When H0 is true, the T S is approximately N (0, 1), and we can take t = zα/2 .
θb − θ0 1.65 − 1.43 − 0
ts = = q = 3.65.
σθb 0.262
+ 0.22
2
30 35
Since ts in the reject region, our decision: reject H0 at the level α = 0.01. We
conclude that at the significance level α = 0.01, there is sufficient evidence that the
soils appear to differ with respect to average shear strength.
Example 5.2.7. A political researcher believes that the fraction p1 of Republicans
strongly in favor of the death penalty is greater than the fraction p2 of Democrats
strongly in favor of the death penalty. He acquired independent random samples of
200 Republicans and 200 Democrats and found 46 Republicans and 34 Democrats
strongly favoring the death penalty. Does this evidence provide statistical support
for the researcher’s belief? Use α = .05.
111
Using the data provided in this problem, we can calculate:
Y1 + Y2 46 + 34
p̃ = = = 0.2,
n1 + n2 400
s r
p̃(1 − p̃) p̃(1 − p̃) 2 × 0.2 × 0.8
σpb1 −bp2 = + = = 0.04,
n1 n2 200
θb − θ0 (46/200 − 34/200) − 0
ts = = = 1.5
σθb 0.04
112
Note that 3 here represents θa − θ0 , so the general formula is
h z + z i2
β α
n = (s21 + s22 ) ,
θa − θ0
This formula works for both right-tailed and left-tailed one-sided tests. How-
ever, there is no simple formula for two-sided hypotheses.
By plugging in the numbers from the example, we get
h 1.645 + 1.645 i2
n = (4.342 + 4.562 ) ≈ 47.66
3
Since we cannot use fraction as a sample size, we conclude that sample size
n = 48 would be sufficient.
CI = (θb − zα σ
bθb, ∞).
Does it help us with hypothesis testing? Well, the confidence interval says
that the true value of the parameter is likely to be larger than θb − zα σ
bθb. So
if we test the null hypothesis H0 : θ = θ0 , and it happens that θ0 is outside
of the confidence interval, that is, if θ0 < θb − zα σ
bθb, then we should reject
the null hypothesis.
The only provision here is that the CI should be in agreement with the
alternative hypothesis, that is, the alternative hypothesis should be Ha :
θ > θ0 .
If our alternative hypothesis is that θ < θ0 , the it is more appropriate to
consider the upper bound confidence interval:
CI = (−∞, θb + zα σ
bθb).
113
This confidence interval tells us that the true value of the parameter is likely
to be large than θb + zα σ bθb, so if θ0 is greater then this quantity we should
reject the null hypothesis.
Similarly, if we use a two-sided alternative hypothesis Ha : θ ̸= θ0 , then
it is appropriate to use the two-sided confidence interval
CI = (θb − zα/2 σ
bθb, θb + zα/2 σ
bθb),
θb − θ0
TS = > zα .
bθb
σ
θ0 < θb − zα σ
bθb,
and this is exactly the condition that θ0 is outside of the low-bound con-
fidence interval, as we claimed above. The other two cases can be done
similarly.
5.5 p-values
The p-value of a test is useful if one wants to report how strongly the evidence
in the data speaks against the null hypothesis.
Recall that we saw several times the situation when for α = 0.01 we
could not reject the null hypothesis, the evidence was not strong enough,
but for α = 0.05, we could reject the null. (This is because when α = 0.05
we could allow to make type I error more frequently.)
For any data sample if we consider very large α then the test statistic
is likely to land in the rejection region, which is very wide in this case and
we are likely to reject the test. However, as we gradually decrease α, we
114
become more conservative, the rejection region shrinks, and at some point
we switch from rejecting H0 for this data sample to saying that there is not
enough evidence in the data to support the rejection. This point is called
the p-value of the test.
Note especially that unlike the level and the power of the test, the p-
value depends both on the test (that is, on the way to calculate the test
statistic and the rejection region) and on the data. If the data sample looks
more unlikely for the null hypothesis than another sample, that is, if it has a
larger test statistic, then the switch from rejection to non-rejection happens
later, for smaller α, and p - value for such data sample is smaller!
It is very easy to calculate the p-value. We just set the threshold in the
rejection region equal to the observed value of the test statistic and calculate
the probability of this rejection region under the null hypothesis.
Say, let the test have the rejection region RR : {T S > t} and let ts be
the observed value of the test statistic. Then the p-value is Pr{T S > ts|H0 }.
In practice, for large sample tests it often boils down to calculating the
cumulative function of the standard normal distribution at the test statistic
value ts.
Benefits of the p-value:
115
• It answers the question: “Assuming that the null is true, what is the
chance of observing a sample like this, or even worse?”
This means that the null hypothesis would be rejected by tests with α = 5%,
α = 1%, and even with α = 0.2%, although it could not be rejected at the
level of α = 0.1%. We would conclude that the sample appears to highly
contradictory to the null hypothesis, and so there is a compelling evidence
that the the population mean zinc mass exceeds 2.0 g.
p-values for large sample tests (aka z-tests)
116
4. The test statistic is
θb − θ0
TS = ∼ N (0, 1) under H0
σθb
and the observed test statistic using the given data is ts;
117
5.6 Small-sample hypothesis tests for population
means
If the sample size is small (n < 30) then we cannot hope that the Central
Limit Theorem will ensure that the test statistic
θb − θ0
TS =
bθb
σ
has the standard normal distribution. In this case the only way out is
to make sure that the data is at least approximately normal, perhaps by
applying an appropriate transformation to the data.
From now on, in this section we will assume that the data is normal. Even
in this case, the distribution of the test statistic differs significantly from the
normal distribution. This means that when we calculate the probabilities of
type I and II errors, or when we calculate the p-values, we cannot calculate
probabilities like
P(T S > x)
θb − θ0
TS = ∼ t-distribution if H0 is true (θ = θ0 ),
σθb
The degrees of freedom for t-distributions depends on whether θ = µ or
µ1 − µ2 .
If we are interested in testing of H0 : µ = µ0 , then we use the test
statistic
Y − µ0
TS = √ ∼ tn−1 if H0 is true (µ = µ0 )
S/ n
If we have two samples, X1 , . . . , Xn1 and Y1 , . . . , Yn2 ,with population
means µ1 and µ2 , respectively, then we are often interested in testing H0 :
µ 1 − µ2 = θ0 .
118
Figure 5.1: Degrees of freedom for the test statistic when the variances are not
the same
Here, two different situations are possible. A bit simple situation is when
we can assume that the variances in two samples are the same σ12 = σ22 = σ 2 .
(We could check this assumption by an appropriate test!) Then we can use
the test statistic
X − Y − θ0
TS = q ∼ tn1 +n2 −2 if H0 is true (θ = θ0 );
Sp n11 + n12
X − Y − θ0
TS = q 2 ,
S1 S22
n1 + n2
where S12 and S22 are sample variances in the two samples. It turns out that
the distribution of this TS is approximately a t-distribution, but the formula
for the degrees of freedom is quite complicated. See Figure 5.1.
119
Note, however, that some researchers suggested that this procedure should
be used if there are doubts about whether the variances are same.
After the distribution of the test-statistic is determined the rest is simple,
one would only need to replace zα with the tα that is calculated for the t-
distribution with correct number of degrees of freedom.
• Ha : θ > θ0 ⇔ RR : {T S > tα }
The quantities tα can be found from the tables or by using the R com-
mand qt. In particular tα for ν degrees of freedom can be calculated as
qt(1 − α, ν).
The calculation of the probability of type II error β and the power 1−β is
in fact very similar to the calculations in the case of the normal distribution.
Again, one only needs to use the t-distribution with the correct number of
degrees of freedom instead of the standard normal distribution.
This is also true for p-values.
where the test statistic has the t-distribution with an appropriate number
of degrees of freedom. The tables only give a range for p-value. For precise
probability, one must use R command pt(a,df) whose output is P(T < a)
where T ∼ t(df ).
Example 5.6.1. An Article in American Demographics investigated con-
sumer habits at the mall. We tend to spend the most money when shop-
ping on weekends, particularly on Sundays between 4:00 and 6:00PM, while
Wednesday-morning shoppers spend the least.
Independent random samples of weekend and weekday shoppers were
selected and the amount spent per trip to the mall was recorded as shown
in the following table:
120
• Is there sufficient evidence to claim that there is a difference in the
average amount spent per trip on weekends and weekdays? Use α =
0.05.
X −Y
TS = q
Sp n11 + 1
n2
where
s r
q (n1 − 1)s21 + (n2 − 1)s22 19 × 222 + 19 × 202
Sp = Sp2 = =
n1 + n2 − 2 20 + 20 − 2
= 21.0238
So,
78 − 67
TS = q = 1.654556
1 1
21.0238 20 + 20
For the normal distribution z0.05 = 1.645 so the test would reject the null
hypothesis if the sample were large.
If we want to test H0 at the level α = 0.05 and use the t-distribution we
want tα for ν = 20 + 20 − 2 = 38.
We calculate it as qt(.95, 38) = 1.685954, so we conclude that the evi-
dence is not sufficient to reject H0 at the level α = 0.05.
The p-value can be calculated as 1 − pt(1.654556, 38) = 5.31%.
121
5.7 Hypothesis testing for population variances
Occasionally, we are interested in testing variances. The most frequent ex-
ample is when we test equality of variance in two samples, in order to see
if the corresponding populations are really different in a certain aspect.
Sometimes we might me interested to see that the variance does not exceed
a certain threshold. This problem arises in quality control.
Let us consider first testing the hypothesis H 0 : σ 2 = σ02 If we sample is
large, then we can use
S 2 − σ02
TS = ,
σS 2
with a suitable estimator for σS 2 .
This approach does not generalizes easily to small samples since it is
difficult to calculate the exact distribution of the ration. So we look at an
alternative method, which works both for large and small samples.
So let us assume that X1 , . . . , Xn are from a Normal distribution N (µ, σ 2 )
with unknown mean µ and unknown variance σ 2 .
We use the result that we know from the section about variance estima-
tion.
(n − 1)S 2
TS = ∼ χ2 (n − 1) when H0 is true
σ02
Fore the case when the alternative hypothesis is Ha : σ 2 > σ02 , the
rejection region is similar to the RR in z- and t-tests:
122
For the alternative hypothesis Ha : σ 2 < σ02 , there is some difference from
the case of z- or t-tests because the χ2 distribution is not symmetric relative
to zero. Instead of using −χ2α (n − 1) as a threshold, we use χ21−α (n − 1). So
the rejection region in this case is
If Ha : σ 2 <σ02 , then
123
ts that was realized in the sample. At this moment the test stops rejecting
the null hypothesis. So if χ2α/2 hits ts first, then we conclude that this is the
critical α∗ which is equal the p-value. Hence in this case the p-value equals
Note that at that moment ts > 1 − α∗ /2 so P(T S < ts) > 1 − α∗ /2 and
2P (T S < ts) > 2 − α∗ > 1.
If χ21−α/2 hits ts first, then by a similar argument we find that p-value
equals
In case you pick up the wrong one between P (T S > ts) or P (T S < ts),
your answer will exceed 1, which is an immediate warning sign because
probability cannot be greater than 1.
Example 5.7.1. An experimenter was convinced that the variability in his
measuring equipment results in a standard deviation of 2. Sixteen measure-
ments yielded s2 = 6.1. Do the data disagree with his claim? Determine
the p-value for the test. What would you conclude if you chose α = 0.05?
• H0 : σ 2 = 4 and Ha : σ 2 ̸= 4
(n−1)S 2
• Test statistic: 4
(16−1)6.1
• Observed value for test statistic is ts = 4 = 22.875
124
• Hence p-value = 2 × 0.0868 = 17.36% > 5%.
We conclude that the data do not give enough evidence to disagree with his
claim.
Now consider the test for equality of variances in two population. In some
situation, a researcher is interested to know whether the data variation in
two samples indicated the different variances in corresponding populations.
For example:
S12
TS = .
S22
125
Under the null hypothesis this ratio is distributed as so-called F -distribution
with n1 − 1 and n2 − 1 degrees of freedom. (F is for Fisher, who designed
the test.)
So we can use the Rejection Region
RR = {T S > Fα }
126
Notice also the degrees of freedom in the numerator and denominator of the
thresholds!
Are you willing to assume that the underlying population variances are
equal? Test σ12 = σ22 against σ12 > σ22 at α = 0.01. What is the p-value?
The test statistic is
0.0172
TS = = 8.027778.
0.0062
p-value = 1 − pf(8.027778, 9, 12) = 0.07%.
So we would reject the null at 1% level.
127
For a test the natural measure of its goodness is its power. If we can
have two tests with the same α, we will prefer the test that has the large
power.
Note however that the power of the test is a function of the parameter
under the alternative hypothesis. So if one test has larger power than an-
other under one value of the parameter θa , it can actually has smaller power
under another value of θa .
Recall that the power of the test = 1 − β = Pr( reject H0 | if Ha is true).
It can be calculated only if Ha is a simple hypothesis.
First, let us not to be too ambitious and try to find the test with the
maximum power when the significance level is α and when the alternative
hypothesis is simple, Ha : θ = θa .
128
In other words, if the alternative hypothesis is simple: θ = θα , the best
test statistic is
L(θ0 |X)
TS =
L(θa |X)
This T S measures how likely is the data under the null hypothesis compared
with its likelihood under the alternative hypothesis. You reject H0 if the
ratio is too small, with the threshold chose so that the level of this test is α.
The theorem says that this test has the largest power to reject H0 (among
the tests with the same α) provided the alternative is fixed at θa .
In order to construct the Neyman-Pearson test, we need to know the
distribution of the test statistics, which is not always easy. Here is an
L(θ0 )
example where the distribution of L(θ a)
is not too hard to find.
Example 5.8.3. Suppose we have just one observation in the sample, Y ∼
f (y|θ) = θy θ−1 1{0<y<1} . Find the most powerful test for H0 : θ = 2 against
Ha : θ = 1 at significance level α = 0.05.
The likelihood here is simply L(θ|y) = θy θ−1 . (Only one observation -
so no products.)
The ratio is
L(θ0 |y) θ0
= y θ0 −θa
L(θa |y) θa
129
Hence, the threshold in the test should be set to
θ0 (θ0 −θa )/θ0
t= α .
θa
θ0 θ0 −θa
In our example, the test statistic is θa Y = 2Y , and
2
t = 0.05(2−1)/2 = 2 × 0.2236.
1
Equivalently, the most powerful test with α = 0.05 in this case has RR =
{Y < 0.2236}
In N-P lemma, the test is guaranteed to be most powerful level-α test
against a specific alternative hypothesis. What if we try a different alterna-
tive hypothesis?
Consider the previous example: We found that the most powerful test
has the rejection region:
nθ θ0 o n o
0 θ0 −θa
Y < α(θ0 −θa )/θ0 = Y θ0 −θa < α(θ0 −θa )/θ0
θa θa
If θa < θ0 , then we can take the power 1/(θ0 − θa ) on both sides and get
that the rejection region is
n o
Y < α1/θ0 .
It is the same for all θa < θ0 . But if θa > θ0 then we would get a completely
different test:
n o
Y > (1 − α)1/θ0
130
it is actually a different test. We call it the Neyman-Pearson test because
it was obtained by applying the Neyman - Pearson lemma to a specific
θa < θ. It is just happened in this example that all these test coincide
for all θa < θ.) However, in case of the two-sided hypothesis, Ha : θa ̸= θ0 ,
Neyman-Pearson is not helpful because it gives two different tests depending
on whether θa < θ0 or θa > θ0 . In fact, in this case there is no uniformly
most powerful test.
In many cases, even for one sided hypothesis, UMP tests do not exists.
However, they are especially rare if the alternative is two sided Ha : θ ̸= θ0 ,
or if we test a vector parameter and the alternative hypothesis is not simple.
(That is, the alternative hypothesis is not just a specific value θ⃗a , for which
the Neyman-Pearson lemma would give us a most powerful test.)
Example 5.8.4 (Neyman-Pearson lemma applied to normal data). Y1 , . . . , Yn ∼
N (µ, σ). Consider H0 : µ = µ0 versus Ha : µ = µ1 . We assume that σ 2 is
KNOWN (fixed). Otherwise the N-P lemma is not applicable: the hypoth-
esis is not simple. What is the most powerful test with level α?
The likelihood is
( )
1 X
n
2 −n/2
L(µ) = (2πσ ) exp − 2 (yi − µ)2
2σ
i=1
The ratio of the likelihoods defined in the Neyman Pearson lemma is:
P
L(µ0 ) (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (yi − µ0 )2
= P
L(µ1 ) (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (yi − µ1 )2
i
1 hX X
n n
= exp − 2 (yi − µ0 )2 − (yi − µ1 )2
2σ
i=1 i=1
h X n i
1
= exp − 2 2yi (µ1 − µ0 ) + µ0 − µ1
2 2
2σ
i=1
The Neyman Pearson Lemma says that the most powerful test is the one
131
with some appropriate threshold t which rejects H0 when
( )
1 hX i
n
exp − 2 2yi (µ1 − µ0 ) + µ0 − µ1
2 2
<t
2σ
i=1
1 hX i
n
⇐⇒ − 2 2yi (µ1 − µ0 ) + µ20 − µ21 < t′
2σ
i=1
X
n
⇐⇒ 2yi (µ1 − µ0 ) + µ20 − µ21 > t′′
i=1
where the thresholds A and B are chosen so that the level of the test is α.
Indeed, such a test would be what our intuition would have driven us.
The NP lemma provides theoretical justification for this test. Is this
test uniformly most powerful among all the tests against the composite
alternative hypothesis Ha : µ > µ0 ? the composite Ha : µ < µ0 ? Can we
construct uniformly most powerful tests for Ha : µ ̸= µ0 ?
132
One approach for the design of a statistical test is to find an estimator
for a parameter that encapsulate our hypothesis, and then use our knowl-
edge about the distribution of this estimator. For example, if we test the
hypothesis θ1 = θ2 , then we can find the ML estimator for the difference
θ1 − θ2 and use the fact that in large samples this estimator is normal and
that it is possible to calculate its variance. This is a very useful approach.
Its deficiency is that we need to obtain the estimator and its variance before
we are able to construct the test. In addition, the variance often depends
on the true value of parameter θ so if our null hypothesis is not simple but
has some nuisance parameters, then we are in trouble.
The second approach is based on the Neyman-Pearson lemma and uses
⃗ ⃗
the ratio L(⃗θ0 |Y⃗ ) as the test statistic. This will give the most powerful test.
L(θa |Y )
The deficiencies is that we need to find the distribution of this test statistic.
Additionally, this approach is quite restrictive. It requires both the null and
alternative hypotheses to be simple and not composite.
In this section we consider the third alternative, which is often very
convenient since it works for composite hypotheses, and in case of large
samples requires essentially no calculation except the maximization of some
likelihood functions.
Let Ω0 be the set of parameters that satisfy our null hypothesis. For
example, it can be that Ω0 = the set of all parameters θ⃗ such that θ1 = θ2 ,
and all other parameters can be arbitrary. Then, let Ωa be the set of all
possible alternative values for parameters. For example, our alternative can
be that θ1 > θ2 and all other θi are arbitrary. The alternative hypothesis is
typically a composite hypothesis in practical applications.
Define the total feasible parameter set Ω = Ω0 ∪Ωa . Define the likelihood
ratio statistic by
maxθ∈Ω ⃗⃗
⃗ 0 L(θ|Y )
λ=
max⃗ L(θ|⃗Y⃗)
θ∈Ω
⃗Y
where L(θ| ⃗ ) is the likelihood of the vector parameter θ⃗ given that the
observed data is Y⃗ = (Y1 , . . . , Yn ).
Use the rejection region RR = {λ < k}, where the threshold k is deter-
133
mined by the requirement that the level of the test is α.
This appears to be not an especially useful since we need to do two con-
strained maximizations and we do not know the distribution of λ, so we
cannot calculate the threshold k. In fact, it appears that λ is a quite com-
plicated function of the data: it is the ratio of two constrained maximums
of the likelihoods, which are itself complicated functions!
The power of this method is that we can do the maximization numerically
and this is a relatively easy given enough computing power. The most
important fact, however, is that k can be calculated efficiently when the
data sample is large.
Conceptually, the likelihood ratio test makes a lot of sense.
134
excess in the number of equality constraints that define Ω0 over the number
of equality constraints that define Ω. (Note that the inequality constraints
do not count - they are not important for large sample analysis.)
Therefore, the rejection region for the likelihood ratio test in large sam-
ples has a very simple form:
The proof of Wilks’ Theorem is not easy and will not be given here.
Example 5.9.2. Suppose that an engineer wishes to compare the number
of complaints per week filed by union stewards for two different shifts at a
manufacturing plant. One hundred independent observations on the number
of complaints gave means x = 20 for shift 1 and y = 22 for shift 2. Assume
that the number of complaints per week on the i-th shift has a Poisson
distribution with mean λi , for i = 1, 2. Use the likelihood ratio method to
test H0 : λ1 = λ2 against Ha : λ1 ̸= λ2 with α ≈ 0.01.
By taking the product the individual density functions we find the like-
lihood function:
1 −nλ1 Pn x Pn y
L(λ1 , λ2 ) = e λ1 i=1 i × e−nλ2 λ2 i=1 i ,
C
where C = x1 ! . . . xn !y1 ! . . . yn ! and n = 100.
Here we will be able to do maximizations analytically, although it could
also be done numerically.
Log likelihood function:
X
n X
n
ℓ(λ1 , λ2 ) = − log C + xi log λ1 − nλ1 + yi log λ2 − nλ2 .
i=1 i=1
bM L = 1 (x + y) = 21.
λ
2
135
If we do not assume that λ1 = λ2 , then the unconstrained maximum
likelihood estimator of the vector (λ1 , λ2 ) is (after a calculation)
bM L = x = 20 and λ
λ bM L = y = 22.
1 2
bM L , λ
−2 log likelihood ratio = −2 ℓ(λ bM L ) − ℓ(λ
bM L , λ
bM L )
1 2
Calculation gives:
bM L , λ
−2 log likelihood ratio = −2 ℓ(λ bM L ) − ℓ(λ
bM L , λ
bM L )
1 2
= −2 − log k + nx + ny log λ bM L − 2nλ
bM L
b M L + nλ
+ log k − nx log λ bM L − ny log λ
b M L + nλ
bM L
1 1 2 2
= 9.5274
By Wilks theorem we should use the rejection region RR = {−2 log λ >
χ2α=0.01,df =1= 6.635}. Hence we reject H0 : λ1 = λ0 at significance level
α = 0.01.
In fact, in this example, we can also use the first method based on the
estimator of the parameter λ1 − λ2 . Indeed, since parameter λ is the mean
of the Poisson distribution, the problem can be thought as a problem about
the equality of means in two samples. The difficulty is that the sample
standard deviations are not given. However, we know the distribution of
data observations (Poisson).
136
Hence, under the null hypothesis, we have that the test statistic
y−x
TS = ∼ N (0, 1).
σy−x
22 − 20 2 × 10
TS = q = √ = 3.086.
2 × 100
21 42
What are the likelihood functions under the hypothesis H0 and Ha , respec-
tively?
Under H0 , the likelihood is
n 1 n
.
x1 , . . . , x 6 6
137
Under Ha , we calculate that p1 = p2 = p3 = 1/3 − θ, and the likelihood
x1 +x3 +x5
n 1
−θ θx2 +x4 +x6 .
x1 , . . . , x 6 3
Suppose that X = X2 + X4 + X6 is the number of customers in the sample that
select an even numbered pump. What is the maximum likelihood estimator of the
parameter θ under the alternative hypothesis Ha ?
Maximization of likelihood under Ha is equivalent to maximization of log-
likelihood, that is, of
1
ℓ(θ) = c + (n − X) log − θ + X log θ,
3
Then,
1 1
ℓ′ (θ) = −(n − X) 1 + X = 0,
3 − θ θ
−θn + Xθ + X/3 − Xθ = 0,
X
θbM LE = .
3n
Express the likelihood ratio statistic λ in terms of X.
Under Ha , the likelihood
n−X
n 1
−θ θX .
x1 , . . . , x 6 3
Substituting the MLE estimate of θ in the definition of λ, we get:
(1/6)n
λ= n−X X
1
3 − X
3n
X
3n
The rejection region for likelihood ratio test is {λ ≤ t}, where t is a threshold.
This is the same as {− log λ ≥ t′ }, and we can re-write this region in our case as
n − X X
(n − X) log + X log ≥ t′′ ,
3n 3n
or
(n − X) log n − X + X log X ≥ t′′′ ,
138
The second derivative of the function on the left is
1 1
+ > 0,
n−X X
which means that this function is convex and so:
Let n = 10 and c = 9. Determine significance level α of the test and its power
when θ = p2 = p4 = p6 = 1/10.
Under the Ha , the probability that one of the pumps 2, 4, or 6 is visited equals
3θ. Hence X (the number of visits of these pumps) is distributed as binomial with
parameter 3θ = 0.3. For H0 it is the binomial with probability 0.5 Hence,
and
139
5.10 Quizzes
140
Quiz 5.10.3. We are interested in this problem: “Is the
proportion of babies born male different from 50%?” In
the sample of 200 births, we found that 96 babies born
were male. We tested the claim using a test with the level
of significance 1% and found that the conclusion is “Fail to
reject H0 .” What could we use as interpretation?
141
Quiz 5.10.5. An educator is interested in determining the
number of hours of TV watched by 4-year-old children. She
wants to show that the average number of hours watched
per day is more than 4 hours. To test her claim she took a
random sample of 100 youngsters. Which of the following
values for the sample mean would have the largest p-value
associated with it.
A. 2
B. 3.9
C. 4
D. 5
142
Quiz 5.10.7. Suppose that I am interested in testing H0 : µ =
µ0 against Ha : µ ̸= µ0 . I calculate the type II error proba-
bility β using the alternative value of parameter µa . Then,
β will be smaller if I
A. Decrease the type I error probability α;
B. Decrease the sample size n;
C. Decrease the distance between µa and µ0 ;
D. None of the above is correct.
143
Chapter 6
144
yi = β 0 + β 1 x i + ε i (6.1)
Here β0 and β1 are unknown parameters that we want to estimate. The
quantities xi , i = 1, . . . , n are known parameters, which are called explana-
tory variables, or independent variables. The variables εi are error terms.
They are responsible for randomness in the model. They are always as-
sumed to have zero mean: E(εi ) = 0. They are also often but not always
assumed to have unknown variance σ2 that does not depend on i: E(εi ) = σ2 .
Even more restrictively, they are often assumed to be normally distributed:
εi ∼ N (0, σ2 ).
The values yi are random since they are functions of εi . (We could write
them Yi following our usual convention about random variables.) They are
usually called the response variable or dependent variable. So, y1 , …, yn are
n independent observations of the response variable Y .
If we want to relate this model to the models in previous chapters, assume
that the error terms are normally distributed and note that then we can
think about y1 , . . . , yn as an a sample from the distribution N (β0 +β1 xi , σ 2 ).
The observations in this sample are independent but they are not identically
distributed! Indeed the mean of the i observation changes with i: E(yi ) =
β0 + β1 xi .
The model (6.1) is called the simple regression model. It is often written
in a short form that omits the subscript i:
Y = β0 + β1 x + ε.
Y = β0 + β1 x(1) + . . . + βk x(k) + ε
This model is very flexible and can be used to model non-linear depen-
dencies as well. For example, if we believe that the response Y depends on
145
Figure 6.1: A scatter plot for the weight (in pounds) and the miles per gallon
(MPG) of a car in a random sample of 25 vehicles.
146
Goals
As usual, are goals are to estimate parameters β0 , …, βk and test some
hypothesis about their values.
There is another goal, which we have not seen before. We might be
interested in predicting response Y for some other values of x. In addition
we might be interested in having some kind of a confidence interval for our
prediction.
Definition 6.2.1. The values of βb0 and βb1 which solve the problem (6.3)
are called the (ordinary) Least Squares estimators of the simple regression
model (6.1).
147
It is a bit simpler to do it for a modified model, in which the explanatory
variables are centered by subtracting their mean:
yi = α0 + β1 (xi − x) + εi
Theorem 6.2.2. The least squares estimators are given by the following
formulas:
b0 = y,
α
Sxy
βb1 = ,
Sxx
where
X
n
Sxy = (xi − x)(yi − y)
i=1
Xn
Sxx = (xi − x)2 .
i=1
This implies that for our original problem, we have also the following
least squares estimator for the parameter β0 :
b0 − βb1 x = y − βb1 x
βb0 = α
Proof.
nP o
∂SSE ∂ n
i=1 [yi α0 + βb1 (xi − x))]2
− (b
=
b0
∂α b0
∂α
X
n
=2 α0 + βb1 (xi − x))] · (−1)
[yi − (b
i=1
148
P
• Since ni=1 (xi − x) = 0 (we have centered them, remember?), we have
P
0 = ni=1 yi − nbα0 ⇒
b0 = y
α
nP o
∂SSE ∂ n
i=1 [yi α0 + βb1 (xi − x))]2
− (b
=
∂ βb1 ∂ βb1
X
n
=2 α0 + βb1 (xi − x))] · (−(xi − x))
[yi − (b
i=1
i=1 i=1
• Thus
Pn
(y − y)(xi − x)
βb1 = Pn i
i=1
i=1 (xi − x)
2
149
Theorem 6.2.3. Assume that the error terms in the simple linear regression
model yi = β0 + β1 xi + εi have the properties Eεi = 0 and Var(εi ) = σ 2 .
Then,
1. Eβb1 = β1 .
2. Var(βb1 ) = σ 2 /Sxx .
Then, we have
Sxy 1 X
Eβb1 = E = (xi − x)Eyi
Sxx Sxx
1 X 1 X
= (xi − x)(α0 + β1 (xi − x)) = β1 (xi − x)(xi − x)
Sxx Sxx
= β1 ,
Finally, if εi are normal, therefore yi are also normal. Note that βb1 is a
weighted sum of yi and the coefficients in this sum are non-random. We
150
know that this implies that the sum itself is normal. This shows that βb1 is
normal if εi are normal.
Theorem 6.2.4. Assume that the error terms in the simple linear regression
model yi = α0 +β1 (xi −x)+ εi have the properties Eεi = 0 and Var(εi ) = σ 2 .
Then,
1. Eb
α0 = α0 .
2. Var(b
α0 ) = σ 2 /n.
α0 , βb1 ) = 0.
3. Cov(b
b0 is also normal.
If, in addition, εi are normal, then α
Since βb0 = α
b0 − βb1 x, we have a similar result for β0 .
Theorem 6.2.5. Assume that the error terms in the simple linear regression
model yi = β0 + β1 xi + εi have the properties Eεi = 0 and Var(εi ) = σ 2 .
Then,
151
1. Eβb0 = β0 .
2.
1 (x)2
Var(βb0 ) = σ 2 + .
n Sxx
3.
x
Cov(βb0 , βb1 ) = σ 2 .
Sxx
This result can be obtained from the formula βb0 = αb0 − βb1 x and previous
theorems through an easy calculation and is left as an exercise.
We are almost ready to write down the confidence intervals for the esti-
mates βb1 , α
b0 and βb0 . However, the variances that we have just calculated
include σ2 , which is not known and should be estimated. It turns out that
our previous definition of S 2 as an estimator of σ2 is unapproriate because
yi are no longer identically distributed. A suitable estimator is as follows:
1 X X
n n
1
b :=
σ2
SSE ≡ (yi − ybi )2 ≡ (yi − βb0 − βb1 xi )2 .
n−2 n−2
i=1 i=1
152
6.2.3 Confidence intervals and hypothesis tests for coeffi-
cients
Once we know the variances of the parameters, it is easy to construct the
confidence intervals. The procedure is essentially the same as what we did
when the estimated the mean of a sample.
For example, a large sample two-sided confidence interval for the param-
eter β1 can be written as follows:
b
σ b
σ
βb1 − zα/2 √ , βb1 + zα/2 √ ,
Sxx Sxx
where α is the confidence level.
If the sample is small but we assume that the error terms are normal,
we can use our previous theorems to come to conclusion that
βb1 − β1
√
b/ Sxx
σ
βb1 − β1
(0)
T = √
b/ Sxx
σ
and use this test statistic to test the null hypothesis against various alter-
native. If the sample is large (n > 30) then T is distributed as a standard
normal random variable. If the sample is small, then we rely on the assump-
tion that εi have normal distribution and then T has the t distribution with
df = n − 2.
Similar procedures can be easily established for other parameters, that is
for α0 or β0 . We only need to use the appropriate variance of the estimator
instead of the σb2 /Sxx .
153
6.2.4 Statistical inference for the regression mean
In applications we sometimes want to make some inferences about linear
combinations of parameters. In this section we study a particular example
of this problem. Suppose we want to build the confidence interval for the
regression mean of Y , when x is equal to a specific value x∗ :
E(Y |x∗ ) = β0 + β1 x∗ .
This estimator is unbiased, because βb0 and βb1 are unbiased estimators of β0
and β1 . In order to build the confidence interval, we also need to calculate
its variance. It is more convenient to use the other form of the regression
for this task:
Then,
y ∗ ) = Var(b
Var(b α0 ) + (x∗ − x)2 Var(βb1 ) + 2(x∗ − x)Cov(b
α0 , βb1 )
1 (x∗ − x)2
= σ2 + ,
n Sxx
where we used our previous results about variances and covariance of esti-
mators α0 and β1 .
By using this information we can build the confidence intervals for y ∗ .
For example, if the sample size is large then the two-sided confidence interval
with significance level α is
s
1 (x∗ − x)2
yb∗ ± zα/2 σ
b + ,
n Sxx
p
where σ b = SSE/(n − 2) is the estimate for σ = Var(εi ).
If the sample is small and the errors εi are normal, then we can use
the t distribution with n − 2 degrees of freedom and the confidence interval
154
becomes:
s
1 (x∗ − x)2
yb∗ ± tα/2 σ
(n−2)
b + .
n Sxx
y ∗ − y0
T = q ∗
.
b n1 + (x S−x)
2
σ xx
P(L ≤ yi ≤ U ) = 1 − α.
Here L and U are some statistics, so they must be computable from data.
In order to construct the prediction interval we use the pivotal quantity
technique and consider
y ∗ − yb∗
T = ,
SE(y ∗ − yb∗ )
Since βb0 and βb1 are unbiased estimators of β0 and β1 , we see that this
quantity has expectation 0.
155
Moreover, if εi are normal then we see that y ∗ − yb∗ is also normal.
What is the standard error of y ∗ − yb∗ ? Note that we have
because the “new” error term ε∗ is un-correllated with prediction yb∗ . Indeed,
the coefficients βb0 and βb1 were estimated using the old error terms εi and
x∗ is not random.
We calculated the variance of yb∗ in the previous section, and so we have
1 (x∗ − x)2 1 (x∗ − x)2
Var(y ∗ − yb∗ ) = σ 2 + σ 2 + = σ2 1 + +
n Sxx n Sxx
It follows that
y ∗ − yb∗
Z= q ∗
σ 1 + n1 + (x S−x)
2
xx
y ∗ − yb∗
T = q ∗
b 1 + n1 + (x S−x)
2
σ xx
156
6.2.6 Correlation and R-squared
Sometimes, xi can be interpreted as observed values of some random quan-
tity X. That is, we have n observations (xi , yi ) sampled from the joint
distribution of the random quantities X and Y . In this case, the coefficient
β1 in the regression yi = β0 + β1 xi + εi can be interpreted as a measure of
dependence between Y and X.
On the other hand, we know that another measure of dependence be-
tween Y and X is the correlation coefficient:
Cov(X, Y )
ρ= p ,
Var(X)Var(Y )
Since βb1 = Sxy /Sxx , we see that we have the following relation between the
estimates of correlation coefficient ρ and linear regression parameter β:
r
Syy
R = β1 .
Sxx
So there is a clear relationship between these two measures of association.
The statistic r2 (called R-squared) has another useful interpretation,
which will be later generalized for multiple linear regression model. Namely,
it measures goodness of fit in the simple linear regression.
Indeed, it is possible to derive the following useful formula:
X X
SSE := (yi − ybi )2 = (yi − y − βb1 (xi − x))2
i i
X X X
= (yi − y) − 2βb1
2
(yi − y)(xi − x) + βb12 (xi − x)2
i i i
2
Sxy
= Syy − βb12 Sxx = Syy −
Sxx
P
Now, Syy = i (yi − y)2 can be thought as the variation in the response
variable if no explanatory variable is used, and SSE is the variation in the
157
response after the explanatory variable is brought in. So the difference is
the reduction in the variation due to the explanatory variable X.
In particular,
2
Sxy
R2 =
Sxx Syy
y = β0 + β1 x(1) + . . . + βp x(p) + ε,
where i = 1, . . . , n.
In full glory, we have a big system of equations:
In order to write down this system shorter, we use the matrix notation.
158
We have (p + 1) coefficients (1 for the intercept term and p for the
independent variables), and n observations. Each observation is represented
by yi and xi := [1, xi1 , xi2 , . . . , xip ]. (We distinguish between column and
row vectors, and the vector xi here is row vector.)
Now stack all observations together to form a vector of responses and a
matrix of explanatory variables.
y1 1 x11 x12 . . . x1p x1
y2 1 x21 x22 . . . x2p x2
y=
.. ,
X=
.. .. .. .. .. =
..
. . . . . . .
yn 1 xn1 xn2 . . . xnp xn
| {z }
p+1 columns
β = [β0 , β1 , . . . , βp ]T , ε = [ε1 , ε2 , . . . , εn ]T .
The transposition can also be defined for matrices and has a useful property:
(AB)T = B T AT .
Then we can rewrite model as
y = Xβ + ε
The sum of squared errors can also be written very simply in the matrix
notation:
159
X
n
b =
SSE(β) {yi − (βb0 + βb1 xi1 + βb2 xi2 + · · · + βbp xip )}2
i=1
X
n
= b 2
{yi − xi β}
i=1
b T [y − Xβ]
=[y − Xβ] b
b
∂SSE(β) X n
= −2 xij {yi − (βb0 + βb1 xi1 + βb2 xi2 + · · · + βbp xip )} = 0
∂βj
i=1
b
∂SSE(β) b = −2XT y + 2XT Xβ
= −2XT [y − Xβ] b=0
b
∂β
Or, re-arranging the terms and simplifying:
b = XT y.
XT Xβ
b = (XT X)−1 XT y.
β LS
160
We can also define the (variance-)covariance matrix of vector ξ ∈ Rp
with p components as the p × p matrix whose ijth element is the covariance
between the ith and jth elements of ξ, i.e.
Cov(ξ1 , ξ1 ) Cov(ξ1 , ξ2 ) . . . Cov(ξ1 , ξp )
Cov(ξ2 , ξ1 ) Cov(ξ2 , ξ2 ) . . . Cov(ξ2 , ξp )
V(ξ) =
.. .. .. ..
. . . .
Cov(ξp , ξ1 ) Cov(ξp , ξ2 ) . . . Cov(ξp , ξp )
b = (X T X)−1 X T y.
β
161
Proof. We know that Ey = Xβ and Vy = Vε = σ 2 In . Then, we can apply
rules (6.5) and (6.6) and get:
b = (X T X)−1 X T Ey = (X T X)−1 X T Xβ = β,
Eβ
and
after cancellations.
b
In addition, if εi are normal, εi ∼ N (0, σ), then it can be shown that β
2
is the multivariate normal with mean β and variance σ (X X) . T −1
Now it is clear how to build confidence intervals and test the hypothesis
for the parameters βi . We simply notice that
Var(βi ) = σ 2 cii ,
where cii is the i-th element on the main diagonal of the matrix (X T X)−1 :
h i
cii = (X T X)−1 . (6.7)
ii
162
b2 := SSE/(n − p − 1) is an unbiased estimator of σ 2 .
Theorem 6.3.2. σ
X
n
SSE = (yi − ybi )2 = ∥Xβ + ε − (Xβ + X(X T X)−1 X T ε)∥2
i=1
Next, we use the fact that E(εεt ) = V(ε) = σ 2 In , and the property that
taking expectations and taking the trace can be performed in any order and
get:
h i
E(SSE) = tr (In − X(X T X)−1 X T )E(εεt )
h i
= σ 2 tr (In − X(X T X)−1 X T )In = σ 2 (tr(In ) − tr X(X T X)−1 X T )
= σ 2 (n − tr X(X T X)−1 X T )
Now, in order to calculate tr X(X T X)−1 X T , we use the property of the
trace once more and write:
tr X(X T X)−1 X T = tr (X T X)−1 X T X = tr(Ip+1 ) = p + 1.
163
So,
E(SSE) = σ 2 (n − p − 1)
164
Indeed, by using one of the theorems from probability theory course we
have:
X
p X
Var( ai βbi = ai aj Cov(βbi , βbj ).
i=0 i,j
b = at V(β)a
Var(at β) b = σ 2 at (X t X)−1 a,
where we used the formula for the variance-covariance matrix of the estima-
tor β.
It follows that the confidence interval for at β can be written as
p
at βb ± zα/2 σ at (X t X)−1 a.
(1) (p)
for a new observation (x∗ , . . . , x∗ ).
In this case we use the formulas we derived above by setting the vector
(1) (p)
a = [1, x∗ , . . . x∗ ]t .
6.3.4 Prediction
Suppose we obtained a new observations with predictors (i.e., explanatory
(1) (2) (p)
variables x∗ , x∗ , . . . , x∗ and we want to predict the response variable y∗ .
The natural predictor is
(1) (p)
where x∗ is the column vector [1, x∗ , . . . , x∗ ]t .
165
This expected value of this predictor equals the regression mean Ey∗ ,
Eb b = xt β.
y∗ = xt∗ Eβ ∗
Let us define the prediction error as the difference between the prediction
and the actual realization of the response variable,
e∗ = y∗ − yb∗ ,
Then the expected value of the error is zero and it is easy to compute
its variance by using results from the previous section:
b
Var(e∗ ) = Var(y∗ − yb∗ ) = Var(xt∗ β + ε∗ − xt∗ β)
b
= Var(ε∗ ) + Var(xt∗ β)
= σ 2 + σ 2 xt∗ (X t X)−1 x∗ .
y = β0 + β1 x(1) + . . . + βp x(p) + ε,
y = β0 + β1 x(1) + . . . + βp x(p) + βp+1 x(p+1) + . . . + x(p+q) + ε.
166
In the complete model, we have q additional predictors.
Then it is reasonable to define a statistic that compares these two models.
We could define it as
SSER − SSEC
,
SSER
where SSER and SSEC are the sums of the squared errors computed, re-
spectively, for the reduced and complete models. This would be in complete
analogy to the definition of R2 above. However, traditionally another form
of the statistic is preferred, namely:
for the reason that it is useful for testing. Indeed, under the null hypotheses,
that the reduced model is correct this statistic is distributed according to
the Fisher distribution with q and n − p − q − 1 degrees of freedom.
In particular the null hypothesis can be rejected at α significance level
if F > Fα . Intuitively, the reduction in the size of errors, as measured by
SSER − SSEC is too large to be explained by pure chance.
Note, however, that the statement that F follows the Fisher distribution
assumes that the errors are normal and it is quite sensitive to this assump-
tion. (In other words, if the errors are not normal it can happen that the
probability of type I error is different from α, in particular it can happen
that we reject the null hypothesis too often.)
167
Chapter 7
Categorical data
7.1 Experiment
Here we discuss the experiment in which there are a finite number of out-
comes. So we have n observation and each observation Yi , i = 1, . . . , n, can
belong to one of k possible categories.
You should recognize here the multinomial experiment with n trials and
k possible outcomes. We assume here that there are no predictors xi . In this
case, the result of this experiment can be conveniently summarized by a list
of counts. So nj is the number of yi that happened to be in j-th category.
Clearly we have n1 + . . . + nk = n.
The model that we use assumes that all yi are independent, so the counts
nj , j = 1, . . . , k follow the multinomial distribution:
n
P(X1 = n1 , . . . , Xk = nk ) = pn1 pn2 . . . pnk k ,
n1 , n 2 , . . . , n k 1 2
168
large. This can be done by using the Pearson statistic and χ2 goodness of
fit test.
X
k
(Xj − EXj )2 X (nj − npj )2
k
T = =
EXj npj
j=1 j=1
Pearson proved that if the null hypothesis is correct (that is, if the data
indeed came from the multinomial distribution with the specified probabili-
ties), and if n is large, then this statistic is approximately distributed as χ2
random variable with k − 1 degrees of freedom.
Usually the one-sided test is used so that the null hypothesis is rejected
if T > χ2α,k−1 . Alternatively, one can calculate the p-value as
p − value ≈ P(ξk−1
2
≥T
X
k
n2j
T = −n
npj
j=1
Example 7.2.1 (From Ross). If a famous person is in poor health and dying,
then perhaps anticipating his birthday would “cheer him up and therefore
improve his health and possibly decrease the chance that he will die shortly
before his birthday.” The data might therefore reveal that a famous person
is less likely to die in the months before his or her birthday and more likely
to die in the months afterward.
169
To test this, a sample of 1,251 (deceased) Americans was randomly cho-
sen from Who Was Who in America, and their birth and death days were
noted. (The data are taken from D. Phillips, “Death Day and Birthday: An
Unexpected Connection,” in Statistics: A Guide to the Unknown, Holden-
Day, 1972.)
We choose the following categories: outcome 1 = “6, 5, 4 months before
death” outcome 2 = “3, 2, 1’ months before” outcome 3 = “0, 1, 2 months
after death”, outcome 4 = “3, 4, 5 months after death”.
Outcome Number of times occurring
1 277
2 283
3 358
4 333
The null hypothesis is that all these outcomes have equal probabilities
pi = 1/4 for all i = 1, . . . , 4. So we calculate the Pearson statistic:
170
method. Namely, if pi depend on m parameters and if they were estimated
by maximum likelihood, then the statistic
X
k
(nj − nbpj ) 2
T =
nb
pj
j=1
We want to test the hypothesis that number of accidents follows the Poisson
distribution.
We can divide the data in 5 categories: the outcome of the number of
accidents in a given week is in category 1 if there are 0 accidents, in category
2 if there is 1 accident, in category 3 if there are 2 or 3 accidents, in category
4 if there are 4 or 5 accidents, and in category 5 if there are more than 5
accidents. Then, if the parameter λ of the Poisson distribution were known,
we could calculate the probability of each category:
p1 = P(Y = 0) = e−λ ,
p2 = P(Y = 1) = λe−λ ,
λ2 e−λ λ3 e−λ
p3 = P(Y = 2) + P(Y = 3) = + ,
2 6
p4 = . . . ,
λ2 e−λ λ3 e−λ λ4 e−λ λ5 e−λ
p5 = P(Y > 5) = 1 − e−λ − λe−λ − − − − .
2 6 24 120
However, since it is unknown, it must be estimated from the data. The
maximum likelihood estimator here is the same as the method of moments
estimator:
b = Y = 95 = 3.16667.
λ
30
171
Then a computation gives
pb1 = .04214,
pb2 = .13346,
pb3 = .43434,
pb4 = .28841,
pb5 = .10164
X
k
(nj − 30bpj ) 2
T = = 21.99156
30b
pj
j=1
172
Then the marginal pmfs of X and Y are
X
s
pi := P[X = i] = pij ,
j=1
Xr
qj := P[Y = j] = pij ,
i=1
pij = pi qj ,
173
If n is large, then the distribution of T approximately equals the χ2 dis-
tribution. As usual the number of degrees of freedom equal the number
of categories minus 1, rs − 1, reduced by the number of estimations per-
formed. A calculation shows that the number of calculations needed is
(r − 1) + (s − 1) = r + s − 2 and so the number of degrees of freedom is
df = rs − 1 − (r + s − 2) = (r − 1)(s − 1).
So, for a given significance level α, the null hypothesis should be rejected if
T > χ2α,(r−1)(s−1) .
Example 7.4.1. A sample of 300 people was randomly chosen, and the sam-
pled individuals were classified as to their gender and political affiliation,
Democrat, Republican, or Independent. The following table displays the
resulting data.
Democrat Republican Independent Total
Women 68 56 32 156
Men 52 72 20 144
Total 120 128 52 300
The null hypothesis is that a randomly chosen individual’s gender and
political affiliation are independent.
174
We calculate:
n 1 m1 156 × 120
p1 qb1 =
nb = = 62.40
n 300
n1 m2 156 × 128
p1 qb2 =
nb = = 66.56
n 300
n1 m3 156 × 52
p1 qb3 =
nb = = 27.04
n 300
n 2 m1 144 × 120
p2 qb1 =
nb = = 57.60
n 300
n2 m2 144 × 128
p2 qb2 =
nb = = 61.44
n 300
n2 m3 144 × 52
p2 qb3 =
nb = = 24.96
n 300
Therefore, the test statistic is
χ20.05,2 = 5.991,
so we can reject the null hypothesis that gender and political affiliation are
independent. We can calculate p - value as
tb ← table(data$X, data$Y ),
and the Pearson table of independence can be done using function “chisq.test()”:
chisq.test(tb)
175
Chapter 8
Bayesian Inference
8.1 Estimation
The Bayesian inference is a collection of statistical methods based on a
different statistical philosophy. The statistical model is still consist of ob-
⃗
servations X1 , . . . , Xn which are random with the distribution f (X|θ) that
depend on a vector of parameters θ. However, while the classical statistic
treats the parameters as fixed and unknown quantities, the Bayesian statis-
tic models researchers’ beliefs about the parameters by using probability
theory. This adds a second layer of randomness: now the parameters θ of
the data-generating distribution f (X|θ)⃗ have their own probability distri-
butions which model our beliefs about them.
In fact, the parameters are treated as random variables that have two
probability distributions: before and after the data is observed. Their distri-
bution before the data is observed is described by a prior distribution with
density (or mass) function p(θ). The posterior distribution is the distribu-
tion of parameters after the data is observed. It captures our beliefs after
they were modified by the observed data. The density (or mass) function
of the posterior distribution is the conditional density p(θ|x1 , . . . , xn ). It
can be calculated from the prior distribution and the data by using Bayes’
formula:
176
f (x1 , . . . , xn |θ)p(θ)
p(θ|x1 , . . . , xn ) = R .
f (x1 , . . . , xn |φ)p(φ) dφ
The integral in the denominator is a normalizing constant – it does not
depend on θ. Often, it is not written explicitly, and the formula is written
as
p(θ) = 1, 0 ≤ θ ≤ 1.
p(θ|x1 , . . . , xn ) ∝ θt (1 − θ)n−t
that is, it is still the beta distribution but with updated parameters t+α1 and
n−t+α2 . Note that here α1 and α2 are parameters for the prior distribution
of the parameter θ! Sometimes they are called hyper-parameters.
Definition 8.1.2. Let f (x|θ) be the distribution of data point x given the
parameter θ. Let p(θ) be a prior distribution for the parameter. If the
posterior distribution p(θ|x1 , . . . , xn ) has the same functional form as the
prior but with altered parameter values, then the prior p(θ) is said to be
conjugate to the distribution f (x|θ).
177
The conjugate priors are very convenient in modeling and are often used
in practice. Here is another example.
Exercise 8.1.3. Suppose X1 , . . . , Xn are normally distributed Xi ∼ N (µ, σ)
with unknown parameter µ and known σ = 1. Let the prior distribution for
µ be N (0, τ −2 ) for known τ −2 . Then the posterior distribution for µ is
Pn x 1
i=1 i
N , .
n + τ2 n + τ2
In many cases, in practical applications we need a point estimator and
not a posterior distribution of the parameter. Bayesian statistic addresses
this concern with the concept of the loss function. A loss function L(θ, a)
is a measure of the loss incurred by estimating the value of the parameter
to be a when its true value is θ. The estimator θb is chosen to minimize the
b where the expectation is taken over θ with respect
expected loss EL(θ, θ),
to the posterior distribution p(θ|⃗x),
θb = arg min E L(θ, a)
a
Theorem 8.1.4. (a) Suppose that the loss function is quadratic in error:
L(θ, a) = (θ − a)2 . Then the expected loss is minimized by taking θb to be the
mean of the posterior distribution:
Z
b
θ = θp(θ|x1 , . . . , xn ) dθ.
(b) Suppose the loss function is the absolute value of the error: L(θ, a) =
|θ − a|. Then the expected loss is minimized by taking θb to be the median of
the posterior distribution.
Example 8.1.5 (Coin Tosses). Consider the setting of Example 8.1.1. The
posterior distribution is Beta distribution with parameters t+1 and n−t+1.
So, by properties of Beta distribution, the posterior mean estimator is
t+1
θb = ,
n+2
and the posterior median estimator needs to be calculated numerically. Note
that both are different from the standard estimator x = t/n.
178
Note that both posterior mean and posterior median estimators depends
on our choice of the prior distribution.
The Bayesian analogue of confidence intervals is credible sets. A set A
is a credible set with probability 1 − α if the probability that the parameter
belongs to the set is α when the probability is calculated with respect to the
posterior distribution. That is,
Z
Pθ ∈ A = p(θ|x1 , . . . , xn ) dθ = 1 − α.
A
179
of data densities for the parameters θ0 and θa :
In other words, we choose the value θ0 in the null hypothesis set Ω0 , which
gives the largest probability density of the data, and compare this density
with the maximum of the data probability density when the parameter is
allowed to vary over both the null (Ω0 ) and alternative (Ωa ) hypothesis sets.
We reject the null if the ratio of these two probabilities is smaller than a
threshold.
While this procedure is very reasonable, it does not have a clear proba-
bilistic justification.
In contrast, in the Bayesian inference, the null hypothesis is typically not
a single value but a big set of parameters H0 : θ ∈ Ω0 and the alternative
is the complement of this set, Ha : θ ∈ Ωa = Ωc0 . For example, we can have
the null hypothesis H0 : θ ≤ θ0 and the alternative Ha : θ > θ0 .
The null hypothesis is rejected by the Bayesian test, if the ratio of pos-
180
terior probabilities of hypotheses:
R
P[θ ∈ Ω0 ] p(θ|x1 , . . . , xn ) dθ
λB (x1 , . . . , xn ) = = R Ω0
P[θ ∈ Ωa ] p(θ|x1 , . . . , xn ) dθ
RΩa
f (x1 , . . . , xn |θ)p(θ) dθ
= R Ω0
Ωa f (x1 , . . . , xn |θ)p(θ) dθ
181
The prior is
α′ = n + α,
1 β
β ′ = Pn = Pn
i=1 xi + 1/β β i=1 xi + 1
We calculate
α′ = n + α = 10 + 3 = 13
X
n
1/β ′ = xi + 1/β = 1.26 + 1/5 = 1.46,
i=1
182
then
Since this is smaller than 1/(1 + t) = 1/2, we can reject the null hypothesis.
One observation about the Bayesian hypothesis testing is that the re-
sults of the tests depend on the choice of the prior distribution and this
choice should be careful and well-justified. The second observation is that
in practice it is sometimes difficult to calculate the probabilities under the
posterior distribution. This calculation may involve difficult integrations.
In this respect, the classical approach is often computationally easier.
183