0% found this document useful (0 votes)
44 views

Lecture Notes For Mathematical Statistics

This document contains lecture notes for a statistics course covering various topics in point estimation, interval estimation, hypothesis testing, and linear regression models. The notes begin with an introduction to statistical estimation and defining key concepts like estimators, bias, variance and consistency. Subsequent sections cover common unbiased estimators, interval estimators like confidence intervals, properties of estimators, and methods of estimation and inference including maximum likelihood and least squares. Later sections address hypothesis testing, categorical data analysis, and Bayesian inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Lecture Notes For Mathematical Statistics

This document contains lecture notes for a statistics course covering various topics in point estimation, interval estimation, hypothesis testing, and linear regression models. The notes begin with an introduction to statistical estimation and defining key concepts like estimators, bias, variance and consistency. Subsequent sections cover common unbiased estimators, interval estimators like confidence intervals, properties of estimators, and methods of estimation and inference including maximum likelihood and least squares. Later sections address hypothesis testing, categorical data analysis, and Bayesian inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 184

Lecture Notes for Math 448 Statistics

Vladislav Kargin

December 23, 2022


Contents

1 Point Estimators 4
1.1 Basic problem of statistical estimation . . . . . . . . 4
1.2 An estimator, its bias and variance . . . . . . . . . . 6
1.3 Consistency . . . . . . . . . . . . . . . . . . . 12
1.4 Some common unbiased estimators . . . . . . . . . . 16
1.4.1 An estimator for the population mean µ. . . . . . 16
1.4.2 An estimator for the population proportion p. . . . 17
1.4.3 An estimator for the difference in population means µ1 −
µ2 . . . . . . . . . . . . . . . . . . . . . 18
1.4.4 An estimator for the difference in population propor-
tions p1 − p2 . . . . . . . . . . . . . . . . . 18
1.4.5 An estimator for the variance . . . . . . . . . . 19
1.5 The existence of unbiased estimators . . . . . . . . . 20
1.6 The error of estimation and the 2-standard-error bound . 21

2 Interval estimators 27
2.1 Confidence intervals and pivotal quantities . . . . . . . 27
2.2 Asymptotic confidence intervals . . . . . . . . . . . 33
2.3 How to determine the sample size . . . . . . . . . . 37
2.4 Small-sample confidence intervals . . . . . . . . . . 39
2.4.1 Small sample CIs for for µ and µ1 − µ2 . . . . . . 39
2.4.2 Small sample CIs for population variance σ2 . . . . 46

3 Advanced properties of point estimators 50

1
3.1 More about consistency of estimators . . . . . . . . . 50
3.2 Asymptotic normality . . . . . . . . . . . . . . . 53
3.3 Risk functions and comparison of point estimators . . . . 55
3.4 Relative efficiency . . . . . . . . . . . . . . . . . 57
3.5 Sufficient statistics . . . . . . . . . . . . . . . . 60
3.6 Rao-Blackwell Theorem and Minimum-Variance Unbiased Es-
timator . . . . . . . . . . . . . . . . . . . . . 65

4 Methods of estimation 69
4.1 Method of Moments Estimation . . . . . . . . . . . 69
4.2 Maximum Likelihood Estimation (MLE). . . . . . . . 75
4.3 Cramer-Rao Lower Bound and large sample properties of
MLE . . . . . . . . . . . . . . . . . . . . . . 88

5 Hypothesis testing 94
5.1 Basic definitions . . . . . . . . . . . . . . . . . 94
5.2 Calculating the Level and Power of a Test . . . . . . . 100
5.2.1 Basic examples . . . . . . . . . . . . . . . . 100
5.2.2 Additional examples. . . . . . . . . . . . . . 110
5.3 Determining the sample size . . . . . . . . . . . . 112
5.4 Relation with confidence intervals . . . . . . . . . . 113
5.5 p-values . . . . . . . . . . . . . . . . . . . . . 114
5.6 Small-sample hypothesis tests for population means . . . 118
5.7 Hypothesis testing for population variances . . . . . . 122
5.8 Neyman - Pearson Lemma and Uniformly Most Powerful Tests
127
5.9 Likelihood ratio test . . . . . . . . . . . . . . . . 132
5.9.1 An Additional Example . . . . . . . . . . . . 137
5.10 Quizzes . . . . . . . . . . . . . . . . . . . . . 140

6 Linear statistical models and the method of least squares 144


6.1 Linear regression model . . . . . . . . . . . . . . 144
6.2 Simple linear regression . . . . . . . . . . . . . . 147
6.2.1 Least squares estimator . . . . . . . . . . . . 147

2
6.2.2 Properties of LS estimator . . . . . . . . . . . 149
6.2.3 Confidence intervals and hypothesis tests for coefficients 153
6.2.4 Statistical inference for the regression mean . . . . 154
6.2.5 Prediction interval . . . . . . . . . . . . . . 155
6.2.6 Correlation and R-squared . . . . . . . . . . . 157
6.3 Multiple linear regression . . . . . . . . . . . . . . 158
6.3.1 Estimation . . . . . . . . . . . . . . . . . 158
6.3.2 Properties of least squares estimators . . . . . . . 160
6.3.3 Confidence interval for linear functions of parameters 164
6.3.4 Prediction. . . . . . . . . . . . . . . . . . 165
6.4 Goodness of fit and a test for a reduced model . . . . . 166

7 Categorical data 168


7.1 Experiment . . . . . . . . . . . . . . . . . . . 168
7.2 Pearson’s χ2 test . . . . . . . . . . . . . . . . . 169
7.3 Goodness of fit tests when parameters are unspecified . . 170
7.4 Independence test for contingency tables. . . . . . . . 172

8 Bayesian Inference 176


8.1 Estimation . . . . . . . . . . . . . . . . . . . 176
8.2 Hypothesis testing . . . . . . . . . . . . . . . . 179

3
Chapter 1

Point Estimators

1.1 Basic problem of statistical estimation


Suppose we have a sample of data which was collected by observing a se-
quence of random experiments. Typically, this is a sequence of numbers
(x1 , x2 , . . . , xn ), where xi is a real number, but more generally every ob-
servation (a datapoint) can be a vector of numerical characteristics. Since
experiments results in a random outcome, xi is a realization of a random
variable Xi , so a random sample is the sequence of the random variables

X1 , X 2 , . . . , X n .

The main assumption of the mathematical statistics is that this sequence has
a cumulative distribution function F (⃗x, θ) where θ is an unknown parameter,
which can be any number (or a vector) in a region Θ. The main task is to
obtain some information about this parameter.
As an example we can think about Xi as the number of Covid deaths on
day i, or GPA of a student i and so on.
In this course, we assume for simplicity that the random variables Xi
are i.i.d., independent and identically distributed, that is every datapoint
has the same distribution as others and that they are independent of each
other. It is a very restrictive requirement, for example, for Covid data it
is doubtful that Xi+1 is independent from Xi , however, this is the simplest
setting in which we can develop the statistical theory.

4
For example, we can look at the sample of Xi , i = 1, . . . , n, where Xi is
a lifetime of a smartphone and model Xi as an exponential random variable
with mean θ. Potentially, this θ can be any number in Θ = (0, ∞). Our
task is for a specific realization of random variables Xi derive a conclusion
about the parameter θ.
Our assumption means that the density of X1 is
1
fX1 (x1 ) = ex1 /θ ,
θ
the density of X2 is
1
fX2 (x2 ) = ex2 /θ ,
θ
and so on.
The joint density of independent datapoints is simply product of the
individual densities for each datapoint. In our example,
1 1 1
fX1 ,...,Xn (x1 , . . . , xn ) = ex1 /θ × ex2 /θ × . . . × exn /θ
θ θ θ
1 (Pni=1 xi )/θ
= ne
θ
In statistics, if we think about this joint density as a function of the
model parameter θ, we call it the likelihood function and denote it by letter
L. So, in our example, we have
1 (Pni=1 xi )/θ
L(θ|⃗x) = e ,
θn
where we used notation ⃗x to denote the vector of observed datapoints: ⃗x =
(x1 , . . . , xn ).
Now, we want to get some information about the parameter θ from the
vector (x1 , . . . , xn ). For example, we could look for a function of (x1 , . . . , xn )
which would be close to θ. This is called the point estimation problem
because we can try to find a point (an estimator) which would be close to
θ. We will discuss it in the next section.

5
1.2 An estimator, its bias and variance
One of the main goals in statistics is to guess the value of an unknown
parameter θ, given the realization of the data sample. Namely, we are given
the realization of random variables X1 , . . . , Xn , and we want to guess θ.
Mathematically, this means that we look for a function of the X1 , . . . , Xn ,
f (X1 , . . . , Xn ), which we call an estimator.
A function of the data sample is called a statistic, so an estimator is
a statistic. It can be any function whatsoever but naturally we want that
happen to be a good guess for the true value of the parameter.
Note on notation: If θ is a parameter to be estimated, then θb denotes
its estimator or a value of the estimator for a given sample. More carefully,
it is function of the data: θb = θ(X b 1 , . . . Xn ).
Examples of estimators: θb = X := (X1 + . . . + Xn )/n or θb = X(n) :=
max(X1 , . . . Xn ). Even very unnatural functions such as sin(X1 × X2 × . . . ×
Xn ) can be thought as estimators. So how do we distinguish between good
and bad estimators?
What do we mean by saying that θb is a good guess for θ?
Note that θb = θ(X b 1 , . . . Xn ) is random since its value changes from sam-
ple to sample. The distribution of this random variable θb depends on the
true value of the parameter θ. One of the things that we can ask from the
estimator is that its expected value equal to the true value of the parame-
ter. This is called unbiasedness. The second useful property is that when
we increase the size of the sample, the estimator converges to the true value
of the parameter in the sense of convergence in probability. This is called
consistency. We will deal with these two concepts one by one.
Bias of an estimator:
Def: Bias(θ̂) = Eθb− θ; (The bias of an estimator is its expected value minus
the true value of the parameter).
Note that the bias can depend on the true value of the parameter. A good
estimator should have zero or at least small bias for all values of the true
parameter.

6
Definition 1.2.1. An estimator θb = θ(X
b 1 , . . . , Xn ) is called unbiased, if

b 1 , . . . , Xn ) = θ,
Eθ(X

for every θ ∈ Θ.

In other words, the estimator θb is unbiased if its bias is zero for every
value of the true parameter θ ∈ Θ.
Example 1.2.2. Consider our previous example about the lifetime of smart-
phones. What is the bias of the following two estimators: θb = X and
θb = X1 ?
Why does X appear to be better than X1 as an estimator?
The reason is that the variance of X decreases as the sample size grows,
while the variance of X1 does not depend on the size of the sample.
Variance of an estimator
Def: Var(θ̂) = E(θb − Eθ)
b 2 = Eθb2 − (Eθ)
b 2;
We want that Var(θ̂) be small for all values of the true parameter θ.
Ideally, both the bias and the variance of the estimator should be small.
Sometimes we value unbiasedness more than anything else. We want to make
sure that an estimator is unbiased and only after this condition is satisfied
we start to look for estimators with low variance among these unbiased
estimators.
However, sometimes we can tolerate that an estimator is a bit biased.
Moreover, in some cases it is very difficult or even impossible to find an
unbiased estimator. In this case, it is useful to define a combined measure
of the quality of an estimator.
Def: Mean Squared Error of an extimator is defined as
n o
M SE(θ)b = E (θb − θ)2 .

Theorem 1.2.3 (MSE decomposition).

b = Var(θ)
M SE(θ) b + [Bias(θ)]
b 2

7
Proof. By using the linearity of expectation:
h i h i
E (θb − θ)2 = E (θb − Eθb + Eθb − θ)2
h i
= E(θb − Eθ)
b 2 + 2E (θb − Eθ)(E
b θb − θ) + (Eθb − θ)2
h i
b + [Bias(θ)]
= Var(θ) b 2 + 2E (θb − Eθ)(E
b θb − θ) }

But in the last term we can take Eθb − θ outside of the expectation sign,
since it is not random, and we find that this last term is zero:
h i h i
E (θb − Eθ)(E
b θb − θ) = (Eθb − θ)E (θb − Eθ)
b = 0,

h i
because E (θb − Eθ)
b = Eθb − Eθb = 0.

Example 1.2.4. Suppose that for a certain estimator θb of θ we know that


Eθb = aθ + b for some constant a ̸= 0 and b ̸= 0.

b in terms of a, b and θ?
• What is Bias(θ),

• Find a function of θb that is an unbiased estimator for θ.

Comment: if you find an biased estimator θ̂, you can sometimes easily
correct the bias to get an unbiased estimator. However, e.g. if we tried an

estimator θb and found that it has Eθb = θ, so we cannot correct the bias
by simply taking the square of θ.b The estimator θ̃ = θ̂2 will not not
unbiased for θ! If we recall the formula for the second moment of the
random variable, then in this particular example we can even compute the
bias:

Eθ̂2 = (Eθ̂)2 + Var(θ̂) = θ + Var(θ̂),

so the bias of the estimator θ̂2 equals Var(θ̂). In general, it is often quite
difficult to find an unbiased estimator.
Let us look at a couple of examples.

8
Example 1.2.5. The reading on a voltage meter connected to a test circuit
is uniformly distributed over the interval (θ, θ + 1), where θ is the true but
unknown voltage of the circuit. Suppose that Y1 , Y2 , . . . , Yn denote a random
sample of such readings. We are going to try two estimators of θ, θb = Y
and θb = min{Y1 , . . . , Yn }. First, consider θb = Y .

• Calculate the bias of Y as an estimator of θ.

• Find an unbiased estimator of θ (based on Y ).

• Find M SE(Y ).

Solution. It is straightforward to calculate the bias:

1X
n
bias(Y ) = E(Y ) − θ = E(Yi ) − θ
n
i=1

= E(Y ) − θ = 1/2,

where we used the fact, that the expectation of a r.v. distributed on [θ, θ +1]
equals θ + 1.
Then M SE(Y ) = bias(Y )2 + Var(Y ), and since Yi independent,

1 X
n
1
Var(Y ) = 2
Var(Yi ) = Var(Y ).
n n
i=1

In order to calculate the variance of Y , which is uniform on [θ, θ + 1], note


that the variance of the shifted variable is the same, so we can calculate
variance of X which is uniform on [−1/2, 1/2],
Z 1/2 1/2 x3 1
Var(X) = x2 dx = = .
−1/2 −1/2 3 12

Altogether M SE(Y ) = 14 + 12n


1
.
Now we want to use the other estimator, θb = min{Y1 , . . . , Yn }.
Example 1.2.6. The sample values Y1 , Y2 , . . . , Yn are uniform on (θ, θ + 1).
Consider the estimator θb = Y(1) := min{Y1 , . . . , Yn }. Calculate the bias and
b Can we correct the bias?
variance of θ.

9
Reminder about the distribution of the minimum and the min-
imum. Recall the notation Y(1) = min{Y1 . . . Yn }. Then for the CDF of
Y(1) , we have:

FY(1) (y) ≡ Pr(Y(1) ≤ y) = 1 − Pr(Y(1) > y)


= 1 − Pr(all Yi ’s are > y)
= 1 − [1 − F (y)]n

And the PDF is


fY(1) (y) = n[1 − F (y)]n−1 f (y).
Similarly for the maximum we have notation Y(n) = max{Y1 . . . Yn }. The
CDF is

FY(n) (y) ≡ Pr(Y(n) ≤ y) = Pr(Yi ≤ y, for all i)


= [F (y)]n

and the PDF is


fY(n) (y) = n[F (y)]n−1 f (y).
Now let us return to the example.
We want to calculate EY(1) . It is convenient to define shifted variables
Xi = Yi − θ, since then EY(1) = EX(1) + θ and it is easier to calculate EX(1)
because Xi are simply uniform random variables on [0, 1]. (Of course, the
expectation can be calculated without this transformation but the formulas
would be more cumbersome.)
Then, since the density and cdf of Xi are fX (x) = 1 and FX (x) = x
supported on [0, 1], then we can use the formulas from above and calculate
the pdf of the minimum X(1) , fX(1) (x) = n(1 − x)n−1 . In other words, X(1)
has Beta distribution with parameters α = 1 and β = n. By the facts about
the Beta distribution, it follows that the expectation is EX(1) = α/(α+β) =
1/(n + 1).
Alternatively, we can simply integrate using the density of X(1) , and
calculate
Z 1
1
EX(1) = n x(1 − x)n−1 dx = .
0 n+1

10
(The integral can be calculated by doing integration by parts or by using a
very useful formula for Beta integrals:
Z 1
Γ(α)Γ(β)
xα−1 (1 − x)β−1 dx = ,
0 Γ(α + β)

where Γ(x) is the Gamma function. For integer argument x, Γ(x) = (x−1)!.)
Hence the bias of Y(1) = EY(1) − θ = EX(1) = n+1 1
. Note that in this
example the bias → 0 as the sample size increases. In addition, we can easily
correct the bias by using θb = Y(1) − n+1
1
.
b
What is the MSE of θ = Y(1) − n+1 ?
1

b = Var(Y(1) ) =
Since there is no bias, we only need to calculate Var(θ)
Var(X(1) ).
From the facts about the Beta distribution we have
αβ n 1
Var(X(1) ) = = ∼ 2
(α + β)2 (α + β + 1) 2
(n + 1) (n + 2) n

for large n. (We could also calculate it directly from the density.)
This is the MSE of θb = Y(1) − n+1
1
. For large n, it is much smaller than
the MSE of the estimator Y − 2 , which we calculated as 12n
1 1
.
Example 1.2.7. Calculate the distribution of the minimum for the sample
X1 , . . . , Xn from the exponential distribution with parameter θ. Use the
minimum to obtain an unbiased estimate of the parameter θ. What is the
variance of this estimator?
Solution First, we calculate the CDF of each observation as
Z x
1 −t/θ
FXi (x) = e dt = 1 − e−x/θ .
0 θ

Then, by using formulas above, we calculate the density of the minimum:


 n−1 1
fX(1) (x) = n e−x/θ × e−x/θ
θ
n −nx/θ
= e .
θ
Hence, the minimum X(1) = min{X1 , . . . , Xn } is distributed as the expo-
nential random variable with parameter θ/n.

11
If we set θb = nX(1) , then the expectation of this estimator is θ and it
gives an unbiased estimator of θ.
What is its variance?

Var nX(1) = n2 VarX(1) = n2 (θ/n)2 = θ2 ,

so it is not a particularly good estimator of θ. Its variance does not decline


as the sample size grows.

Summary: In this section we introduced simple measures that help us


to evaluate how good an estimator is, – its bias, variance and mean squared
error.

1.3 Consistency
Suppose again that we have a sample (X1 , . . . Xn ) from a probability distri-
bution that depends on parameter θ. Note that although we speak about
an estimator θb = θ(Xb 1 , . . . , Xn ), in fact the distribution of the estimator de-
pends on n, so it would be more correctly speak about a sequence of random
variables θbn .
Usually, we expect that when the size of the sample becomes larger, that
is, n grows, the distribution of the estimator θbn become concentrated more
and more around the true value of the parameter θ. This is the minimal
requirement that we can impose on the family of estimators that depend
on sample size. If this requirement is not satisfied as in Example 1.2.7
above, then the estimator is not very useful. Technically this property of an
estimator is called consistency and we are giving its mathematical definition
below.
Before that, let us look at some pictures. Plots show a simulation study.
A sample X1 , X2 , . . . from the distribution N (θ, 1/4) was generated with
θ = 10 and we computed θbk = (X1 + . . . Xk )/k. Figure 1.1 shows a path of
θbk . It suggests that if we get more and more data, θbk converges to the true
value of θ. In fact, this is a consequence of the strong law of large numbers,
which says that this behavior is observed with probability 1.

12
Figure 1.1 Figure 1.2

What about several different samples? Figure 1.2 shows the situation
when the sample X1 , X2 , . . . was generated 10 times and 10 paths of θbk
were plotted. This picture suggest that when the sample size grows the
distribution of θbk around the true value of the parameter θ. Mathematically
this is a consequence of the weak law of large numbers.
In order to define the consistency, recall what it means for a sequence of
random variables to converge to another random variable.

Definition 1.3.1 (Convergence in probability). A sequence of random vari-


ables, X1 , X2 , . . . , Xn , . . . , is convergent in probability to a random vari-
able X if, for any ϵ > 0, as n → ∞

P(|Xn − X| < ϵ) → 1,

that is,
lim P(|Xn − X| < ϵ) = 1
n→∞
P
This is denoted either as Xn −→ X or as plimn→∞ Xn = X.

Note that an ≡ P(|Xn − X| < ϵ) is simply a number (it is not random).


Hence, {a1 , a2 , . . . , an , . . . } form a sequence of numbers, and their limit is
defined in the usual “calculus” sense.

13
Definition 1.3.2 (Consistency). An estimator θbn is a consistent estimator
of θ, if θbn converges in probability to θ

θbn −→ θ.
P

By writing out the definition of the convergence in probability in detail,


we see that this definition can be also written as saying that an estimator
θbn is a consistent estimator of θ, if for any ε > 0, as n → ∞,

P(|θbn − θ| < ε) → 1.

The consistency of the estimator means that as the sample size goes to
infinity, we are become more and more sure that the distance between θbn
and θ is smaller than any positive ε !
Consistency describes a property of the estimator in the n → ∞ limit.
Unlike unbiasedness, it is NOT meant to describe the property of the esti-
mator for a fixed n.
An unbiased estimator can be inconsistent as we can see in Example
1.2.7, and a biased estimator can be consistent (as Y(1) in Example 1.2.6)!
Consistency is more important than unbiasedness because it ensures that if
collect enough data we will eventually learn the true value of the parameter.
So, how can we tell if the estimator is consistent? One way is to see how
MSE changes with n.

Theorem 1.3.3. If M SE(θbn ) → 0 as n → ∞, then the estimator θbn is


consistent.

Proof. By using Theorem 1.2.3, we note that M SE(θbn ) → 0 if and only


if bias(θbn ) → 0 and Var(θbn ) → 0. Fix an ε > 0 and choose n0 so that
|bias(θbn )| < ε/2 for all n > n0 . Then, by definition of bias, for all n > n0 ,
|Eθbn − θ| < ε/2. Since

|θbn − θ| = |θbn − Eθbn + Eθbn − θ| ≤ |θbn − Eθbn | + |Eθbn − θ|


< |θbn − Eθbn | + ε/2,

therefore the event |θbn − θ| > ε can occur only if |θbn − Eθbn | > ε/2 occured.
Hence, for n > n0 , P(|θbn − θ| > ε) ≤ P(|θbn − Eθbn | > ε/2).

14
Now apply the Chebyshev inequality,

Var(θbn )
P(|θbn − Eθbn | > ε/2) ≤
(ε/2)2

By our assumption, the right-hand side can be made arbitrarily small for all
sufficiently large n because Var(θbn ) → 0. We showed that P(|θbn −θ| > ε) → 0
for any ε > 0.

• If Bias(θbn ) → 0 as n → ∞, then the estimator is called asymptotically


unbiased.

• Another way to formulate the theorem is to say that any estimator


which is asymptotically unbiased and has its variance converging to 0
as n → ∞ is a consistent estimator.

Example 1.3.4 (Sample mean is a consistent estimator of the population


mean). Let Y1 , Y2 , . . . be a sample from a population with mean µ and vari-
ance σ 2 .
Pn
• Sample mean Y n = n1 i=1 Yi . Its expectation is µ and its variance is
Var(Y n ) = σ 2 /n → 0.

• So Y n is an unbiased and consistent estimator of µ.

Example 1.3.5 (Biased and consistent estimator of the mean). For parameter
θ = µ, consider a modified sample mean θbn = n−1
n
Y

b = Eθbn − θ = n µ − µ = 1 µ → 0 as n → ∞; (θbn is a
• Bias(θ) n−1 n−1
biased estimator of µ ̸= 0 for every n. It is, however, asymptotically
unbiased)

b =
• Var(θ) n2
σ 2 /n → 0.
(n−1)2

• Conclusion: θbn is a biased but consistent estimator.

Example 1.3.6. • Yi ∼ U nif [θ, θ + 1]

• θb1 = Y − 1/2; θb2 = Y(1) − 1/(n + 1)

15
• We have shown in Examples 1.2.5 and 1.2.6 that these estimators are
both unbiased.

• In addition we showed in these examples that Var(θb1 ) = 1/(12n) and


Var(θb2 ) = n/[(n + 1)2 (n + 2)]. Since both variances go to zero as n
grows, both estimators are consistent.
Our main tool in establishing consistency of estimators was Theorem .
However, it is sometimes cumbersome to calculate MSE of an estimator.
There are some other tools to establish consistency of an estimator. We will
talk about them later.
Unbiasedness and consistency

• Unbiasedness: concerns expectation; for fixed n

• Consistency:

– only care about n → ∞;


– concerns bias and variance (and whether they vanish for large n);
– however, does not necessarily imply unbiasedness for finite n.

1. Can biased estimator be consistent? Yes!

2. Can unbiased estimator be inconsistent? Yes!

1.4 Some common unbiased estimators


1.4.1 An estimator for the population mean µ
Let Y1 , Y2 , . . . , Yn denote a random sample of n independent identically dis-
tributed observations from a population with mean µ (that is, in our sta-
tistical model one of the parameters is EYi = µ) and variance σ 2 (another
parameter is Var(Yi ) = σ 2 ). Then the most natural estimator for µ is the
sample mean:
1X
n
b=Y =
µ Yi .
n
i=1

16
• Expectation and bias of the estimator:
P
E(Y ) = n1 ni=1 EYi = EYi = µ; so bias(Y ) = 0.
σ2
• Variance: Var(Y ) = n1 Var(Yi ) = n ;

σ2
• M SE(Y ) = bias2 + Var = n

These observations shows that the sample mean is an unbiased and consis-
tent estimator for the population mean µ.

Definition 1.4.1. The


q standard error of an estimator is the square root of
b b
its variance SE(θ) = Var(θ)

Another notation for the variance and the standard error of an estimator
θb is σ 2b and σθb, respectively.
θ
So the variance of the sample mean Y is σY2 = σ 2 /n and the standard
error is σY = √σn .

1.4.2 An estimator for the population proportion p


Let Y1 , Y2 , . . . , Yn denote a random sample of size n from a population with
a Bernoulli distribution

P (Yi = 1) = p, P (Yi = 0) = 1 − p.

This is a special case of the situation in the previous section and we can
use the same estimator, the sample mean. In this case, the sample mean
has a special name, the sample proportion. It is an unbiased and consistent
estimator for the parameter p (i.e., for the population proportion).

1X
n
pb = Yi ,
n
i=1

that is, it is the proportion of Yi = 1’s in the sample.


The estimator is unbiased since q E(bp) = EY1 = p. Its variance and
pq pq
n , respectively, with q = 1 − p.
2
standard error are σpb = n and σpb =

17
1.4.3 An estimator for the difference in population means
µ1 − µ2
Suppose we have two samples:
(1) (1) (1)
• {Y1 , Y2 , . . . , Yn1 } of size n1 from Population 1: with mean µ1 and
variance σ12 ;
(2) (2) (2)
• {Y1 , Y2 , . . . , Yn2 } of size n2 from Population 2: with mean µ2 and
variance σ22 ;

An unbiased estimator for the difference in population means θ = µ1 − µ2


(it is the parameter of interest) is the

1 X 1 X
n1 n2
(1) (2)
difference in sample means: θ̂ = Y 1 − Y 2 = Yi − Yi .
n1 n2
i=1 i=1

(1) (2)
• Expectation: E(θ̂) = EY 1 − EY 2 = EY1 − EY1 = µ1 − µ2 ;
σ12 σ22
• Variance: σθ̂2 = V ar(θ̂) = n1 + n2 ;

– Proof: Since the two samples are independent, Var(Y 1 − Y 2 ) =


Var(Y 1 ) + Var(Y 2 ).
q 2
σ σ2
• Standard error: σθ̂ = n11 + n22 .

It follows that this estimator is unbiased (its expectation equals the esti-
mated parameter) and consistent (because its MSE converges to zero as n1
and n2 jointly grow).

1.4.4 An estimator for the difference in population propor-


tions p1 − p2
Suppose we have two samples:
(1) (1) (1) (1)
• {Y1 , Y2 , . . . , Yn1 } of size n1 from Population 1: P (Yi = 1) = p1
(1)
and P (Yi = 0) = 1 − p1 ;
(2) (2) (2) (2)
• {Y1 , Y2 , . . . , Yn1 } of size n2 from Population 2: P (Yi = 1) = p2
(2)
and P (Yi = 0) = 1 − p2 ;

18
An unbiased point estimator for the difference in population means θ = p1 − p2
(the parameter of interest) is the

1 X 1 X
n1 n2
(1) (2)
difference in sample proportions: θ̂ = p̂1 − p̂2 = Yi − Yi
n1 n2
i=1 i=1

• Expectation: E(θ̂) = E(p̂1 − p̂2 ) = p1 − p2 = θ;

• Variance: σθ̂2 = V ar(θ̂) = pn1 q11 + pn2 q22 ;


q
• Standard error: σθ̂ = pn1 q11 + pn2 q22 .

where q1 = 1 − p1 and q2 = 1 − p2 .
Obviously, this estimator is also unbiased and consistent.

1.4.5 An estimator for the variance


Let Y1 , Y2 , . . . , Yn denote a sample of size n from a population with mean µ
and variance σ 2 , then an unbiased estimator for σ 2 is the sample variance:
Pn
2 (Yi − Y )2
S := i=1
n−1
Note that we divide by n − 1, not by n, as it would seem most natural. If
we divided by n the estimator would not be unbiased.
The square root of S 2 is called the sample standard deviation and de-
noted, as could be expected, S.

Theorem 1.4.2. S 2 is an unbiased estimator for σ 2 .

Proof.
X
n hX
n X
n  i hX
n i
2 2
E (Yi − Y ) = E
2
Yi2 −2 Yi Y + nY = E Yi2 − nY
i=1 i=1 i=1 i=1
Xn
2
= EYi2 − nE Y
i=1

For the first term, we have:


X
n X
n
EYi2 = [µ2 + σ 2 ] = nµ2 + nσ 2
i=1 i=1

19
For the second term:
2
nE Y = n[VarY + (EY )2 ] = n(σ 2 /n + µ2 )
= σ 2 + nµ2

Therefore,

X
n
E (Yi − Y )2 = (n − 1)σ 2
i=1

Hence
Pn
E i=1 (Yi− Y )2
ES = 2
= σ2
n−1

and therefore S 2 is unbiased for σ 2 .

The variance of this estimator is more complicated to compute and we


will not do it here. However, it turns out that it goes to zero as n → ∞. In
particular this estimator is consistent.
Note that although S 2 is a unbiased estimator for σ 2 , the estimator S,
that is, the square root of S 2 , is NOT an unbiased estimator for σ. However,
it turns out that it is still a consistent estimator for σ.
The identity used in the first line of the proof,

1X 1X 2 2
n n
(Yi − Y )2 = Yi − Y ,
n n
i=1 i=1

is an “empirical” analogue of the identity Var(Y ) = E(Y 2 ) − (EY )2 .

1.5 The existence of unbiased estimators


We have seen above that several natural parameters have unbiased estima-
tors. So one question is whether it is always possible to find an unbiased
estimator for a parameter of interest. Surprisingly, the answer is ”no”. Here
we present an example. It is a little bit artificial but it shows that sometimes
it is not simply difficult to find an unbiased estimator.

20
In this example, each observation is taken from the Bernoulli distribution
with parameter p. That is, Xi = 1 with probability p and Xi = 0 with
probability q = 1 − p. Of course, we have seen that there is an unbiased
estimator for p, namely pb = X. The twist of this example is that we try to
estimate θ = − ln p ∈ Θ = (0, ∞). Suppose, by seeking contradiction, that
θb = θ(X
b 1 , . . . Xn ) is an unbiased estimator of θ and therefore, Eθb = θ =
− ln p. We will write the expectation by using the basic definition:

X
1 X
1
b 1 , . . . , Xn ) =
Eθ(X ... b 1 , . . . , xn )P(X1 = x1 , . . . , Xn = xn ).
θ(x
x1 =0 xn =0

For Bernoulli r.v., we can write P(Xi = xi ) = pxi (1 − p)1−xi , where xi can
take only two values, 0 and 1. So, by independence of random variables
X1 , . . . , Xn , we have:
Pn Pn
P(X1 = x1 , . . . , Xn = xn ) = p i=1 xi
(1 − p)n− i=1 xi
.

So if θb is unbiased, then we have:

X
1 X
1 Pn Pn
− ln p = ... b 1 , . . . , xn )p
θ(x i=1 xi
(1 − p)n− i=1 xi
,
x1 =0 xn =0

and this should be true for every p ∈ (0, 1) because the estimator is assumed
to be unbiased for every − ln p ∈ (0, ∞). However this means that the
logarithmic function of p equals to a polynomial in p. This is impossible.
For example, the limit of the left-hand side for p → 0 is ∞ while the limit
the right hand side is finite.
We got a contradiction, so that means that there is no unbiased estimator
of θ = − ln p.

1.6 The error of estimation and the 2-standard-


error bound
Definition 1.6.1. The error of estimation ε is the distance between an
estimator and its target parameter. That is, ε = |θb − θ|.

21
The error of estimation is a random quantity that change from sample
to sample. We often interested to have a good bound on this quantity that
holds with large probability.
Recall that the standard error of the estimator σθb is another name for
q
b
the standard deviation of the estimator θ̂. That is σ b = Var(θ).
θ
By the Chebyshev inequality:
1
P{|θb − θ| > kσθb} ≤
k2
• For b = 2 · σθb, the RHS of the Chebyshev inequality is = 25%. [This
is a bound, the true probability that |ε| ≥ b is smaller, often as small
as 5%]

• 2σθb is called the 2-standard-error bound on the error of the estimator.


The meaning is that with large probability the error of estimation is
smaller than 2σθb. (This probability is bounded from above by 75% by
Chebyshev inequality and often significantly smaller).

The Central Limit Theorem for sums of independent random variables


says that a sum of large number of these variables has a distribution, which
is closed to the normal distribution.

Since the estimator for the mean, Y is such a sum (only divided by n),
it becomes approximately normal when n is large, so if sample size is large,
then the estimation error, |Y − µ|, is less than 2-standard-error, 2σY , with
probability 95% (instead of 75%).
This observation holds also for the other standard estimators that we
considered in the previous section.
Example 1.6.2 (Titanic survivors). In a random sample of 136 Titanic first
class passengers that survived the Titanic ship accident, 91 were women. In
a random sample of 119 third class survivors, 72 were women. Assume that
these are small samples from two large populations of “survivors”: first-class
survivors and third-class survivors.
What is an unbiased estimate for the difference in proportions of females
in these populations? What is the two-standard error bound?

22
Solution. pb1 = 66.9%; pb3 = 60.5%; pb1 − pb3 = 6.4%
Two standard error bound is:
s
pb1 (1 − pb1 ) pb3 (1 − pb3 )
2 + = 12.1%
n1 n3

Does the data suggest that the total first class and third class survivor
populations had approximately the same proportions of females?
Does the data suggest that women from the first and third classes had
approximately the same chances to survive?
Solution: Not really. These data say that proportions of women among
the first and the third class survivors are approximately the same, meaning
that the difference between proportions is within the two-standard error bias.
However, this does not really say anything about the chances of survival.
Example 1.6.3 (Titanic survivors II). In a random sample of 95 female pas-
sengers in the first class, 91 survived the Titanic ship accident. In a random
sample of 145 women in the third class, 72 survived.
What is an unbiased estimate for the difference in proportions of sur-
vivors in the populations of the first and the third class female passengers?
What is the two-standard error bound?
Solution. pb1 = 95.8%; pb3 = 49.7%; pb1 − pb3 = 46.1%
Two standard error bound is:
s
pb1 (1 − pb1 ) pb3 (1 − pb3 )
2 + = 9.3%
n1 n3

Does the data suggest that female passengers from the third class had
lower chances to survive than female passengers from the first class?
Solution: Yes, the difference is far larger than the two standard error
bound suggesting, that it is highly unlikely that this happened by chance.
Some additional interesting info about this example: The chances of a
man in the first class to survive: 36.9%. The chances of a man in the third
class: 13.5%.
Example 1.6.4 (Elementary school IQ). The article “A Longitudinal Study
of the Development of Elementary School Children’s Private Speech” by

23
Bivens and Berk (Merrill-Palmer Q., 1990: 443–463) reported on a study of
children talking to themselves (private speech).
[The study was motivated by theories in psychology that claim that pri-
vate speech plays an important role in a child mental development, so one
can investigate how private speech is related to IQ (that is, performance on
math or verbal tasks) or to changes in IQ. The study found some support-
ing evidence that task-related private speech is positively related to future
changes in IQ. Here we are only interested in IQ data.]
The study included 33 students whose first-grade IQ scores are given
here:
082 096 099 102 103 103 106 107 108 108 108 108 109 110 110 111 113
113 113 113 115 115 118 118 119 121 122 122 127 132 136 140 146
(a) Suppose we want an estimate of the average value of IQ for the
first graders served by this school. What is an unbiased estimate for this
parameter?
[Hint: Sum is 3753.]
Solution. µb = X = 3753/33 = 113.7273
(b) Calculate and interpret a point estimate of the population standard
deviation σ. [Hint: Sum of squared observations is 432, 015]
Solution.
1 
S2 = 432, 015 − 33 × (113.7273)2 = 162.3856
32

b = S = 162.3856 = 12.7431
σ

While S 2 is an unbiased estimator of σ 2 , the estimator S for σ is biased.


(c) What is the two-standard-error bound on the error of estimation?
Solution.

The two-standard-error bound is 2σµb = 2σX = 2σ/ n. We use the
estimate of σ to calculate the bound.

√ √
2S/ n = 2 × 12.7431/ 33 = 4.4366

Since the estimate of µ is 113.7273 and the two-error bound for the error

24
of estimation is 4.4366, the data suggest that this is an above average class,
because the nationwide IQ average is around 100.
(d) Calculate a point estimate of the proportion of all such students
whose IQ exceeds 100. [Hint: Think of an observation as a “success” if it
exceeds 100.]
Solution.
The number of students with IQ above 100 is 30. So the point estimate
is pb = 30/33 = 90.91%.
Example 1.6.5 (Elementary school IQ II). The data set mentioned in the
previous example also includes these third grade verbal IQ observations for
males:
117 103 121 112 120 132 113 117 132 149 125 131 136 107 108 113 136
114
(18 observations) and females:
114 102 113 131 124 117 120 90 114 109 102 114 127 127 103
(15 observations)
Let the male values be denoted X1 , . . . , Xm and the female values Y1 , . . . , Yn .
(a) Calculate the point estimate for the difference between male and
female verbal IQ.
Solution.

2186 1707
X −Y = − = 121.4444 − 113.8 = 7.6444
18 15
(b) What is the standard error of the estimator?
Solution. First we calculate the sample variances Sx2 and Sy2 for these
two samples.

1 
Sx2 = 268, 046 − 18 × 121.44442 = 151.0964
17
1 
Sy2 = 196, 039 − 15 × 113.82 = 127.3143
14
Then, we calculate the estimate of the standard error:
s r
Sx2 Sy2 151.0964 127.3143
bθb =
σ + == + = 4.1088
m n 18 15

25
So we see that the estimate of the difference 7.6444. However, the two-
standard-error bound is 8.2176 and the data does not give an evidence that
the difference is positive.

26
Chapter 2

Interval estimators

2.1 Confidence intervals and pivotal quantities


A point estimator is a function of data sample that gives a single number
that is our “best guess” for the parameter, for examples: Y for µ and p̂ for
p. Often we want to know how far is the estimator from the true value of the
parameter. While the standard error of the estimator gives some idea about
its quality, in this section, we will talk about a related and more precise
concept.
Definition 2.1.1. A confidence interval with confidence level (1 − α) is
a random interval [θbL , θbU ] such that

P(θ ∈ [θbL , θbU ]) = 1 − α.

(Confidence level is sometimes called confidence coefficient.)


The interval is random because both ends of the interval are functions
of the data sample, and so they are random. The usually aim to make 1 − α
large like 95% since this is the probability that the confidence interval covers
the true value of the parameter, so α should be small, like 5%.
This is the definition of the two-sided confidence interval. The definition
of one-sided intervals are similar. For example, for lower one-sided interval
we require that

P(θ ∈ [θbL , ∞]) = 1 − α.

27
100
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●

80
● ●
● ● ● ●
● ●
● ● ●
● ●
● ●
● ●
●●

60
● ● ●
● ●

●● ●
index

● ●
● ●
● ●


● ●

40

● ●
● ●
● ●
● ●
● ●●
● ●
● ●
●● ●
● ●
20

●●
● ●
● ●
● ●

● ●●
● ●
● ●
● ●
● ● ●
0

−4 −2 0 2
random interval estimator

Figure 2.1: Interval estimator for 100 different samples. The confidence interval
(θb − 1.96σθb, θb + 1.96σθb), – centered at the point estimators θb shown by red circles,
– in 95% cases covers the true parameter (shown by the purple line).

Here are examples of confidence intervals: for µ, we could take (Y −


2S, Y + 2S). Alternatively, we could take (0.8Y , 1.2Y ). However, we don’t
know what is the confidence levels in these examples.
So, here is a question: for a given (1 − α), how can we find the desired
confidence interval (θ̂L , θ̂U )? If there are several ways to do it, we would
prefer to find an interval, for which the length of the interval is the smallest.
The difficulty is that we do not know true value of the parameter, so
for example we cannot use the Chebyshev inequality to build the interval
estimate: The interval (θ − 2σθb, θ − 2σθb) is not an interval estimator because
we do not know neither θ, no σθb.
We can use θb instead of θ in this interval and estimate σθb, but why is
this OK? This is often fine, as we will see later, but only if the sample size
is large.
Here we discuss the pivotal quantity method which in some cases allows
us to find confidence intervals exactly.

The pivotal quantity or pivot is a quantity which is a function of both the

28
sample data and the parameter θ, but whose distribution does not depend
on the parameter θ !
Let X denote the data sample. (It is a vector of observations.) Find a
function T (X, θ) (the pivot) so that its distribution does not depend on θ
and so it is known.
Use the distribution of T to find a pair of L and U such that

Pr(L ≤ T ≤ U ) = 1 − α

Manipulate the inequalities L ≤ T (X, θ) ≤ U so that they become

L∗ (X) ≤ θ ≤ U ∗ (X),

so that θ is in the middle!!!


Example 2.1.2 (A sample from a normal distribution). Let X1 , . . . , Xn be
a sample from a normal distribution N (µ, σ 2 ), where the parameter σ is
known, and we want to find a confidence interval for µ.
P
Note that X = n1 i=1 nXi is a normally distributed random variable
with mean µ and variance σ 2 /n.
Then the quantity

X −µ
T = √
σ/ n

has the normal distribution with mean 0 and variance 1. In particular,


its distribution does not depend on µ. (Here T depends not only on the
data and parameter µ but also on σ but we assume that σ is known.) By
convention, in this important example, the quantity T is denoted Z.
The next task it to look for L and U so that Pr(L ≤ Z ≤ U ) = 1 − α.
There are 3 standard ways to do it. One of them to choose L and U so
that P(Z < L) = α/2 and P(Z > U ) = α/2. By symmetry of the normal
distribution L = −U and by convention this U is denoted zα/2 .
Then we have
X −µ
−zα/2 ≤ Z = √ ≤ zα/2 .
σ/ n

29
This inequality can be converted to the desired confidence interval for
parameter µ:
σ σ
X − zα/2 √ ≤ µ ≤ X + zα/2 √
n n
Alternatively, we can look for a “one-sided interval”. So, take L = −∞
and look for U such that P{Z > U } = α. By definition this U is denoted zα
and it can be found from a table or by using software.
Then, the inequality is

X −µ
Z= √ ≤ zα
σ/ n
and it can be transformed to the lower confidence bound on the parameter
µ
σ
X − zα √ ≤ µ.
n
Note the difference from the previous inequality. Here we use the factor zα
before the standard error √σn while in the previous inequality we used zα/2 .
Similarly, by using U = ∞ and looking for L such that P{Z < L} = α,
we can derive the upper confidence bound on µ:
σ
µ < X + zα √ .
n

Example 2.1.3 (Confidence interval for σ 2 ). Suppose that X1 , . . . , Xn ∼


N (µ, σ 2 ) but now we know µ, and we are interested in deriving a confidence
interval for σ 2 .
P 2
The quantity T = ni=1 (Xi −µ)/σ is known to be distributed accord-
ing to the χ2 distribution with n degrees of freedom, χ2 (n). Therefore, it is
a valid pivotal quantity and we can used it to derive a confidence interval
for σ 2 .
Again, it can be done in three possible ways. One of them is to find the
quantities L and U such that P(T < L) = α/2 and P(T > U ) = α/2. In this
case, the distribution is not symmetric and we need to find 2 really different
quantities. The quantity U is χ2α/2 (n) and the quantity L is χ21−α/2 (n). They
can be found from a table or by using software.

30
Then, we get

1 X
n
χ21−α/2 (n) ≤ 2 (Xi − µ)2 ≤ χ2α/2 (n),
σ
i=1

and, by putting σ 2 in the middle,

1 X
n
1 X
n
(Xi − µ) ≤ σ ≤
2 2
(Xi − µ)2
χ2α/2 (n) i=1
χ21−α/2 (n) i=1

Note that for large n, χ2α/2 (n) is relatively close to n.


The upper and lower confidence bounds can be found similarly. For
example, (1 − α) upper confidence bound for σ 2 is

1 X
n
σ2 ≤ (Xi − µ)2
χ21−α (n) i=1

Note that in this inequality we use α instead of α/2.


Example 2.1.4. Suppose X1 , . . . , Xn is a sample from the exponential distri-
bution with mean θ. Find a confidence interval for θ.
Recall that the density of the exponential distribution with mean θ is
(1/θ)ex/θ . Since we are looking for a sample statistic with the distribution
that does not depend on θ, it is natural to remove the mean from the dis-
tribution of Xi by dividing these random variables by θ. So let Yi = Xi /θ.
Then it is easy to see (and it is easy to check by using one of the methods
for calculating the density of Yi ) that Yi has the exponential distribution
with mean 1. Hence it is not depends on the parameter θ.
Since we also want to incorporate the information from all of the ob-
P
servations, we are going to use the pivotal quantity T = ni=1 (Xi /θ). By
the properties of the exponential distribution T is distributed as a Gamma
(n) (n)
random variable with parameters n and 1. If z1−α/2 and zα/2 are criti-
(n)
cal values for this distribution, that is, if P(T > z1−α/2 ) = 1 − α/2 and
(n) (n) (n)
P(T > zα/2 ) = α/2, then P(z1−α/2 < T ≤ zα/2 ) = 1 − α and we get a
confidence interval for θ:
Pn Pn
i=1 Xi i=1 Xi
(n)
≤ θ ≤ (n) .
zα/2 z1−α/2

31
This interval has confidence level 1 − α. It can be shown that the size of the
interval decreases when n grows.

Here is a bit less standard example, which shows that the pivot method
sometimes gives poor confidence intervals.
Example 2.1.5. Suppose a sample X1 , …, Xn of random variables distributed
according to the exponential distribution with mean θ.
Suppose we want to build a confidence interval for θ with α = 10%.
The quantity T = nX(1) /θ is pivotal. Indeed, let us write Y to denote
X(1) . This is a minimum of n i.i.d exponential random variables and it is
easy to check that the density of Y is
n −ny/θ
fY (y) = e .
θ
That is, Y is exponential with the mean θ/n. Similar to the previous exam-
ple, it is easy to calculate the density of T = nY /θ = Y /(θ/n) and check
that it is exponential with mean 1. 1
Now we look for L and U , so that
Z U
0.90 = Pr(L ≤ T ≤ U ) = e−t dt = e−L − e−T .
L

There are infinitely many combinations of L and U which satisfy this. One
possibility is to let

Pr(T > U ) = e−T = 0.05 and Pr(T < L) = 1 − e−L = 0.05

Solutions are L = 0.051 and U = 2.996 Now we have


X(1)
0.051 ≤ T = n ≤ 2.996
θ
1
We are making transformation T = nY /θ that has the inverse transformation Y =
(θ/n)T . Let us use notation y(t) for the function y = (θ/n)t. By using the density
transformation method, we calculate the density of T as follows :
dy(t)
fT (t) dt = fY (y(t)) dy(t) = fY (y(t)) dt
dt
n −n(θ/n)t/θ θ
= e × dt = e−t dt
θ n
So, T has the exponential density with parameter 1.

32
We manipulate these two inequalities to put θ in the middle:
nX(1) nX(1)
≤θ≤
2.996 0.051
Remark: The resulting confidence interval is not very good. Indeed, the
1
length of the interval is nX(1) ( 0.051 − 0.2996
1
) and nX(1) is always an expo-
nential random variable with parameter θ, so we cannot expect that the
length of this confidence interval goes to 0 as n grows.

2.2 Asymptotic confidence intervals


Finding an exact pivotal quantity for a parameter which results in a short
confidence interval is difficult! The good news is that for large sample we
can easily find an approximate pivotal quantity, meaning that a func-
tion of data and parameter has a distribution that converges to a fixed
probability distribution as the size of the sample, n, increases. Most often,
this distribution is the standard normal distribution.
By using this approximate pivotal quantity we can obtain the asymptotic
confidence intervals.

Definition 2.2.1. An asymptotic confidence interval with confidence


coefficient (1 − α) is a random interval [θbL , θbU ] such that
(n) (n)

lim P(θ ∈ [θbL , θbU ]) = 1 − α,


(n) (n)
n→∞

where n is the size of the data sample.

Often, the Central Limit Theorem (CLT) ensures that when the sample
size n is large enough an appropriate estimator is approximately normal
random variable.
2
• for θ = µ, the estimator θ̂ = Y is approximately ∼ N (µ, σn );

• for θ = p, the estimator θ̂ = p̂ is approximately ∼ N (p, p(1−p)


n );

• for θ = µ1 − µ2 , the estimator θ̂ = Y 1 − Y 2 is approximately ∼


σ2 σ2
N (µ1 − µ2 , n11 + n22 );

33
• for θ = p1 − p2 , the estimator θ̂ = p̂1 − p̂2 is approximately ∼ N (p1 −
p2 , p1 (1−p
n1
1)
+ p2 (1−p
n2
2)
);
In general, many estimators are asymptotically normal. We will discuss
asymptotic normality a bit later. Intuitively, this means that the distribu-
tion of an estimator θ̂ is close to the normal distribution, N (θ, σθ̂2 ), where
σθ̂2 = V ar(θ̂).
Then we can write an approximate pivotal quantity:
θb − θ
Z= ,
bθb
σ
Here σ bθb is a consistent estimator of σθ̂ . By a theorem which is called the
Slutsky theorem, the distribution of Z is close to the standard normal dis-
tribution.
Then we can proceed as usual and develop a confidence interval from
the pivotal quantity. Since Z is only an approximately pivotal quantity, the
resulting confidence interval will be only an asymptotic confidence interval,
that is, the probability that the interval covers θ equals 1 − α only if n is
large.
Example 2.2.2 (Two-sided asymptotic confidence interval for a parameter θ).
By using our results for the normally distributer variables X1 , . . . , Xn , we
write the (appoximate) two-sided interval for θ, based on the point estimator
θb which is assumed to be approximately distributed as N (θ, σθ̂2 ).
The two-sided confidence interval for θ with confidence coefficient 1 − α
is h i
θ̂ − zα/2 σθb, θ̂ + zα/2 σθb
Example 2.2.3 (Upper and lower confidence bounds). The one-sided large
sample confidence intervals are as follows:
• The upper bound confidence interval with confidence coefficient 1 − α
is
(−∞, θ̂ + zα σθb ]

• The lower bound confidence interval with confidence coefficient 1 − α


is
[θ̂ − zα σθb , +∞)

34
Example 2.2.4. The shopping times of n = 64 randomly selected customers
at a local supermarket were recorded. The average and variance of the 64
shopping times were 33 minutes and 256 minutes, respectively. Estimate µ,
the true average shopping time per customer, with a confidence coefficient
of 1 − α = .90.
p
Solution. We have µ b = 33, and σ bθb = 256/64 = 2. We can get
zα/2 = z0.05 = qnorm(0.05, lower.tail = F ) = 1.644854, so the CI is

(33 − 1.645 × 2, 33 + 1.645 × 2) = (29.71, 36.29)

We got z0.05 by using R function ”qnorm”. We could also get it using an


appropriate table.
Example 2.2.5. Two brands of refrigerators, denoted A and B, are each guar-
anteed for 1 year. In a random sample of 50 refrigerators of brand A, 12
were observed to fail before the guarantee period ended. An independent
random sample of 60 brand B refrigerators also revealed 12 failures during
the guarantee period. Estimate the true difference (p1 −p2 ) between propor-
tions of failures during the guarantee period, with the confidence coefficient
approximately .98.
Solution.

• nA = 50 and YA = 12, hence pb1 =12/50 = 0.24

• nB = 60 and YB = 12, hence pb2 =12/60 = 0.2

• Use θb = pb1 − pb2 = 0.04 as the (point) estimator.

• θb is approximately normal and we have

θb ± zα/2 σθb

as the 100(1 − α)% confidence interval.

• Note that
p
σθb =Varθb
p
= p1 (1 − p1 )/nA + p2 (1 − p2 )/nB ,

where p1 and p2 are unknown but can be approximated by pb1 and pb2 .

35
So,
p
bθb =
σ 0.24(1 − 0.24)/50 + 0.2(1 − 0.2)/60 = 0.0795.

We also have zα/2 = z0.01 = qnorm(0.01, lower.tail = F )2.326348, and


therefore the confidence interval is

(0.04 − 2.326348 × 0.0795, 0.04 + 2.326348 × 0.0795) = (−0.1449, 0.2249)

We used here the R function “qnorm” to find the value of z0.01 . The lower.tail
option is set to F (FALSE) because we want to find such z0.01 such that the
upper tail of the standard normal distribution is 0.01, that is, we want
P(Z > z0.01 ) = 0.01. Alternatively, we could get the value of z0.01 from a
table.
For the exam, you are supposed to know how to calculate this confidence
interval. However, note that these calculations are already implemented in
R, although R uses the language of hypothesis testing here, which we will
learn later.
In particular, for this example, the confidence interval can be calculated
as follows.

prop . t e s t ( c ( 1 2 , 1 2 ) , c ( 5 0 , 6 0 ) , c o n f . l e v e l = 0 . 9 8 , c o r r e c t = F)

The first argument is the vector of the number of successes (or failures in
our example), and the second argument is the vector of the sample sizes.

2−sample t e s t f o r e q u a l i t y o f p r o p o r t i o n s w i t h o u t
continuity correction

data : c ( 1 2 , 1 2 ) out o f c ( 5 0 , 6 0 )
X−s q u a r e d = 0 . 2 5 5 8 1 , d f = 1 , p−v a l u e = 0 . 6 1 3
a l t e r n a t i v e h y p o t h e s i s : two . s i d e d
98 p e r c e n t c o n f i d e n c e i n t e r v a l :
−0.1448629 0 . 2 2 4 8 6 2 9
sample e s t i m a t e s :
prop 1 prop 2
0.24 0.20

36
Example 2.2.6. A study was done on 41 first-year medical students to see
if their anxiety levels changed during the first semester. One measure used
was the level of serum cortisol, which is associated with stress. For each
of the 41 students the level was compared during finals at the end of the
semester against the level in the first week of classes. The average difference
was 2.08 with a standard deviation of 7.88. Find a 95% lower confidence
bound for the population mean difference µ. Does the bound suggest that
the mean population stress change is necessarily positive?
Example 2.2.7. A random sample of 539 households from a mid-western city
was selected, and it was determined that 133 of these households owned
at least one firearm (“The Social Determinants of Gun Ownership: Self-
Protection in an Urban Environment,” Criminology, 1997: 629–640). Using
a 95% confidence level, calculate a lower confidence bound for the proportion
of all households in this city that own at least one firearm.

2.3 How to determine the sample size


The sample size dilemma

• We need to collect samples to make inference about the population


parameter. Question: how large n should be? 30? 40? 50?

• On the one hand, the more data you have, the more accurate is your
estimator θb for θ

• On the other hand, collecting samples is NOT free: it costs money,


time, personnel..

Conclusion: We want minimal sample size which allows us to achieve


given precision with a given level of confidence.
The formulas for confidence intervals provide an easy way to find the
required size of the sample.

Precision + Confidence level → Required minimal sample size.

37
Rather than give a bunch of formulas we illustrate the method in exam-
ples.
Example 2.3.1. The reaction of an individual to a stimulus in a psycho-
logical experiment may take one of two forms, A or B. If an experimenter
wishes to estimate the probability p that a person will react in manner A,
how many people must be included in the experiment? Assume that the
experimenter will be satisfied if the error of estimation is less than .04 with
probability equal to .90. Assume also that he expects p to lie somewhere in
the neighborhood of .6.

• This is an estimating p in a binomial distribution problem. n is to be


found.

• Although n is yet to be found, let’s assume that it is large enough, in


which case θb = pb is approximately normal, and then Z ≡ √ pb−p ≈
p(1−p)/n
√ pb−p
is approximately standard normal and hence, P(−z0.05 ≤
p̃(1−p̃)/n
√ pb−p ≤ z0.05 ) = 0.9. This is to say that with probability 0.9,
p̃(1−p̃)/n
p
|b
p − p| < z0.05 p(1 − p)/n

• If we want this error to be smaller than 0.04, only need to have


p
z0.05 p̃(1 − p̃)/n ≤ 0.04. Solve n and we have n ≥ 406.

• The prior information p̃ = 0.6 is used to approximate the standard


error

Example 2.3.2. Telephone pollsters often interview between 1000 and 1500
individuals regarding their opinions on various issues. A survey question
asks if a person believes that the performance of their athletics teams has
a positive impact on the perceived prestige of the institutions. The goal
of the survey is to see if there is a difference between the opinions of men
and women on this issue. Suppose that you design the survey and wish to
estimate the difference in a pair of proportions, correct to within .02, with
probability .9. How many interviewees should be included in each sample?

38
q
b σb =
1. What is the standard error of θ? p1 (1−p1 )
+ p2 (1−p2 )
θ n1 n1

2. θb = pb1 − pb2 is approximately normal with mean θ = p1 − p2 and


variance σ 2b;
θ

3. Hence, with probability 0.9, |θb − θ| < z0.05 σθb

4. Take n1 = n2 = n. For p1 and p2 , since we have no prior information,


we just replace them by 0.5, as the most conservative guess.

5. If we want this error to be smaller than 0.02, only need to solve


r
1/4 1/4
z0.05 + < 0.02
n n
6.
1  z0.05 2 1  1.645 2
n≥ = = 3382.5
2 0.02 2 0.02
So we should take n = 3383.
Example 2.3.3. A state wildlife service wants to estimate the mean number
of days that each licensed hunter actually hunts during a given season, with
a bound on the error of estimation equal to 2 hunting days. If data collected
in earlier surveys have shown σ to be approximately equal to 10, how many
hunters must be included in the survey?

• The client is not sophisticated and does not formulate explicitly what
is the level of confidence required. In this case, it is typical to set
the confidence level at 95% and use the 2-standard-error bound. If we
want the error of estimation to be less than 2, then
σ 10
2 > 2σθb = 2 √ = 2 √ ⇒ n > 100.
n n

2.4 Small-sample confidence intervals


2.4.1 Small sample CIs for for µ and µ1 − µ2
Suppose, the parameter of interest is the population mean µ and we have a
sample X1 , . . . , Xn . When the sample size is large, the Central Limit The-
2
orem ensures that X is approximately normal with distribution N (µ, σn ).

39
In addition, the parameter σ 2 can be consistently estimated by the sam-
ple variance. Thus, the quantity

X −µ
Z= √
S/ n

is approximately pivotal with the asymptotic distribution N (0, 1). Based on


Z, we can find

• 2-sided confidence interval for µ: [X − zα/2 √Sn , X + zα/2 √Sn ]

• 1-sided lower confidence interval for µ: [X − zα √Sn , ∞]

• 1-sided upper confidence interval for µ: [−∞, X + zα √Sn ];

Now suppose that the sample size n is small, say less than 30.
Then the quantity Z may have a distribution which is very different from
the standard normal distribution. Using the normal distribution in the
case of small samples leads to erroneous intervals!
What can we do? In general, there is no universal answer. The answer
depends on the distribution of the data points Xi .
If the data happen to be normally distributed and σ 2 is known, then

X −µ
Z= √ (2.1)
σ/ n

is pivotal and normal. However, in most cases this fact cannot be used to
construct a confidence interval for µ since σ 2 is not known.
We can try to use the sample variance S 2 instead of σ 2 , however then

X −µ
T = √ (2.2)
S/ n

is pivotal but its distribution is not normal. The reason is that S 2 is a ran-
dom quantity and dividing the normal random variable X − µ by a random
quantity instead of a deterministic coefficient breaks the normality of the
quotient.
What is the distribution of T ? Let us recall first some facts from the
probability theory.

40
Definition 2.4.1. If X1 , . . . , Xn are i.i.d and distribute as standard normal
r.v.s (∼ N (0, 1)), then the random variable Y ≡ X12 + · · · + Xn2 has the χ2
distribution with n degrees of freedom.
Definition 2.4.2. Suppose random variables Z ∼ N (0, 1) and Y ∼ χ2 (n)
are independent. Then the random variable
Z
T ≡q (2.3)
Y
n

is distributed according to the t-distribution with n degrees of freedom,


denoted as
T ∼ t(n)
.
This distribution is also often called Student’s t-distribution. It was
discovered by William Gosset who worked for Guinness brewery and wrote
his papers under the pen name Student.
For large n, – say for n > 30, the t(n)- distribution is very close to the
standard normal distribution N (0, 1). And in general, the t distribution has
many properties that are similar to the normal distribution. For example,
its PDF is symmetric with respect to 0. However, t distribution has heavier
tails: When n is small, a r.v. with the t(n)-distribution takes large values
with much larger probability than a standard normal random variable.
Now, note that T in the formula (2.2) has quite similar expression as
in (2.3). However, is it true that X and S 2 (2.2) are independent? And
is S 2 has χ2 -distribution with n degrees of freedom. The answer is almost
positive by the following remarkable theorem which we will state without
proof.
Theorem 2.4.3 (Joint distribution of sample mean and sample variance).
Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Then the sample variance S 2 is independent of
the sample mean X and
(n − 1)S 2 X  Xi − X 2
n
=
σ2 σ
i=1

has the χ2 distribution with n − 1 degrees of freedom.

41
Note that the random variable has χ2 distribution with one less degree
of freedom than if we would add together n independent standard normal
random variables. The reason for this reduction is that the random variables
Xi − Y are not independent. Intuitively they can be expressed in terms of
n − 1 independent normal random variables, hence the reduction in the
degree of freedom. The most surprising in this theorem is the independence
of X and S 2 .
By using this theorem, we can show that

X −µ
T = √ ∼ t(n − 1),
S/ n
that is T has a t-distribution with degrees freedom df = n − 1.
Based on T ,

• 2-sided confidence interval for µ: [X − tα/2 √Sn , X + tα/2 √Sn ]

• 1-sided lower confidence interval for µ: [X − tα √Sn , ∞]

• 1-sided upper confidence interval for µ: [−∞, X + tα √Sn ];

tα can be found using statistical software or the t-table (Table 5, look for
subscript α with df = n − 1).
Note that the statistic in both large sample and small sample cases is
the same:

X − µY
T ≡ √
S/ n
The only difference is that in the case of a large sample, it is distributed
as normal random variable, and in the case of a small sample it is distributed
as a t-random variable.
Remark 1: the small sample confidence intervals based on t distribu-
tion are longer than asymptotic confidence intervals based on the standard
normal distribution.
Remark 2: the small sample confidence intervals based on t distribution
are valid only if the data are normally distributed.

42
40
index

20

10 12 14
n=10. Confidence intervals using z (blue) and t (red) values

Figure 2.2: Comparison of z and t confidence intervals for n = 10.

Figure 2.2 compares the z and t confidence intervals in a sample with n


observations.
• Red: using the tα/2 value – the correct one

• Blue: using the zα/2 value – the incorrect one

– Fail to take the uncertainty of S 2 into consideration


– Fail to deliver the promised coverage probability
– Shorter than the red one (for small α, zα/2 < tα/2 )

Example 2.4.4. The carapace lengths of ten lobsters examined in a study of


the infestation of the Thenus orientalis lobster by two types of barnacles,
Octolasmis tridens and O. lowei, are given in the following table. Find a
95% confidence interval for the mean carapace length (in millimeters, mm)
of T.orientalis lobsters caught in the seas in the vicinity of Singapore.

This is a small sample estimation problem for µ. Below are calculations


done in R.

43
> x = c (78 ,66 ,65 ,63 ,60 ,60 ,58 ,56 ,52 ,50)
> x
[ 1 ] 78 66 65 63 60 60 58 56 52 50
> mean( x ) # sample mean
[ 1 ] 60.8
> sum( ( x − 6 0 . 8 ) ^ 2 ) /(10 −1) # sample v a r i a n c e
[ 1 ] 63.51111
> sqrt (sum( ( x − 6 0 . 8 ) ^ 2 ) /(10 −1)) # sample s t a n d a r d d e v i a t i o n
[ 1 ] 7.969386
> qt ( 0 . 9 7 5 , 9 ) # 0 . 0 2 5 p e r c e n t a g e p o i n t
# f o r t d i s t r i b u t i o n w i t h d f =9
[ 1 ] 2.262157
> qt ( 0 . 9 7 5 , 9 ) ∗sqrt (sum( ( x − 6 0 . 8 ) ^ 2 ) /(10 −1))/sqrt ( 1 0 )
[ 1 ] 5.700955

Answer: 60.8 ± 5.700955. Note that we have divided by 9 when we esti-

mated S and then again by 10 when we estimated σX . Do not forget the
second division.
Alternatively, one can use the “sd” function to calculate the sample
standard deviation:

>mean( x)−qt ( 0 . 9 7 5 , 9 ) ∗sd ( x ) /sqrt ( 1 0 )


>mean( x)+qt ( 0 . 9 7 5 , 9 ) ∗sd ( x ) /sqrt ( 1 0 )

produce the required interval.


Finally, one can also get the confidence interval in a simpler ways by
using function “t-test”:

t . test (x)

produces the following output:

One Sample t−t e s t

data : x
t = 2 4 . 1 2 6 , df = 9 , p−v a l u e = 1 . 7 2 7 e −09
a l t e r n a t i v e h y p o t h e s i s : t r u e mean i s not equal t o 0

44
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
55.09904 66.50096
sample e s t i m a t e s :
mean o f x
60.8

Exercise 2.4.5. The reaction time (RT) to a stimulus is the interval of time
commencing with stimulus presentation and ending with the first discernible
movement of a certain type. The article “Relationship of Reaction Time and
Movement Time in a Gross Motor Skill” (Percept. Motor Skills, 1973: 453–
454) reports that the sample average RT for 16 experienced swimmers to a
pistol start was .214 s and the sample standard deviation was .036 s.
Making any necessary assumptions, derive a 90% CI for true average RT
for all experienced swimmers.
Two sample t-test
We have samples X1 , . . . , Xn1 and Y1 , . . . Yn2 , with observations dis-
tributed according to N (µ1 , σ1 ) and, N (µ1 , σ1 ) respectively. We assume
that n1 and n2 are small and want to find C.I. for µ1 − µ2 .
Here we consider only the most simple case when it is assumed that
σ1 = σ2 = σ. Then we can define the pooled-sample estimator for the
common variance σ 2 ,
P n1 Pn2
i=1 (X − X) + i=1 (Yi − Y )
2 2
Sp ≡
2
n1 + n2 − 2
(n1 − 1)S12 + (n2 − 1)S22
=
n1 + n2 − 2
In this case,
Y 1 − Y 2 − (µ1 − µ2 ) Y 1 − Y 2 − (µ1 − µ2 )
T = = q ∼ t(n1 + n2 − 2),
bY 1 −Y 2
σ S 1
+ 1
p n1 n2

that is, T has the t-distribution with df = n1 + n2 − 2;


So in this simple case, we have the following confidence interval for µ1 −µ2

 r r 
(n1 +n2 −2) 1 1 (n1 +n2 −2) 1 1
X −Y − tα/2 Sp + , X − Y + tα/2 Sp + ,
n1 n2 n1 n2

45
Similarly, the lower bound confidence interval for µ1 − µ2 is
 r 
1 1
X − Y − tα S p + ,∞
n 1 n2
and the upper bound confidence interval for µ1 − µ2 is
 r
1 1
− ∞, X − Y + tα Sp +
n1 n2
In more general case, the formulas are more complicated, and one has
rely on software.
Example 2.4.6. To reach maximum efficiency in performing an assembly op-
eration in a manufacturing plant, new employees require approximately a
1-month training period. A new method of training was suggested, and a
test was conducted to compare the new method with the standard proce-
dure. Two groups of nine new employees each were trained for a period of
3 weeks, one group using the new method and the other following the stan-
dard training procedure. The length of time (in minutes) required for each
employee to assemble the device was recorded at the end of the 3-week pe-
riod. The resulting measurements are as shown in Table 8.3 (see the book).
Estimate the true mean difference (µ1 −µ2 ) with confidence coefficient .95.
Assume that the assembly times are approximately normally distributed,
that the variances of the assembly times are approximately equal for the
two methods, and that the samples are independent.

2.4.2 Small sample CIs for population variance σ 2


Population variance σ 2 quantifies the amount of variability in the popula-
tion. We have already shown that if we observe the data sample (X1 , . . . , Xn ),
then σ 2 can be estimated by an unbiased point estimator

1 X
n
2
S = (Xi − X)2 .
n−1
i=1

How do we get a confidence interval for σ 2 ?


If the sample is large then S 2 is approximately normally distributed. It is
not difficult to derive a formula for Var(S 2 ) and develop an estimator for this

46
variance. Then the standard method for large sample confidence intervals
works. In practice, however, we are usually interested in confidence intervals
for σ 2 when we have a small data sample.
So assume that the data sample is small. In this case we must restrict
ourself to the situation when the data is normally distributed.
Assume that all sample data points X1 , X2 , . . . , Xn ∼ N (µ, σ 2 ).
The pivotal quantity (see Theorem 2.4.3) is
Pn
(Xi − X)2 (n − 1)S 2
T = i=1 2 = ∼ χ2(n−1)
σ σ2
We need to find L and U so that
(n − 1)S 2
P(L ≤ ≤ U) = 1 − α
σ2
A usual choice is L = χ21−(α/2) and U = χ2α/2 , both corresponding to (n − 1)
d.f.
Hence,
 
(n − 1)S 2
P χ1−(α/2) (n − 1) ≤
2
≤ χα/2 (n − 1) = 1 − α
2
σ2
!
(n − 1)S 2 (n − 1)S 2
⇔P ≤σ ≤ 2
2
=1−α
χ2α/2 (n − 1) χ1−α/2 (n − 1)

• Suppose we want a one-sided bound for σ 2 , say lower bound. We will


want to make use of the pivotal quantity:
(n − 1)S 2
∼ χ2(n−1)
σ2

• Note that if b = χ2α for (n − 1) d.f., then


 (n − 1)S 2 
P ≤ b =1−α
σ2
• Then we have, with 100(1 − α)% probability,
(n − 1)S 2
≤ σ2.
χ2α,n−1
(n−1)S 2
Hence χ2α,n−1
is a (1 − α) confidence lower bound for σ 2 .

47
2
Similarly, χ(n−1)S
2 is a (1 − α) confidence upper bound for σ 2 .
1−α;n−1
What if we want to build a confidence interval for the standard deviation

σ = σ 2 , instead of σ 2 ? This is simple:

• Since we know that


!
(n − 1)S 2 (n − 1)S 2
P ≤ σ 2
≤ =1−α
χ2α/2 χ21−α/2

• This is equivalent to
s s !
(n − 1)S 2 (n − 1)S 2
P ≤σ≤ =1−α
χ2α/2 χ21−α/2

• Therefore we can easily obtain a C.I. for σ.

Example 2.4.7. Suppose that you wished to describe the variability of the
carapace lengths of this population of lobsters. Find a 90% confidence in-
terval for the population variance σ 2 .

> x = c(78,66,65,63,60,60,58,56,52,50)
> x
[1] 78 66 65 63 60 60 58 56 52 50
> sum(( x-mean(x) )^2) # the numerator of sample variance and the CI LB and UB
[1] 571.6
> # and then we calculate the denominators of the CI LB and UB
> qchisq(0.05,9)
[1] 3.325113
> qchisq(0.95,9)
[1] 16.91898

2
The answer is (571.6/16.91898, 571.6/3.325113). Note that χ0.95,9 = 3.325113 =
2
qchisq(0.05,9) and χ0.05,9 = 16.91898 = qchisq(0.95,9).
Both can also be obtained from Table 6.

48
Example 2.4.8. An optical firm purchases glass to be ground into lenses. As
it is important that the various pieces of glass have nearly the same index
of refraction, the firm is interested in controlling the variability. A simple
random sample of size n = 20 measurements yields S 2 = (1.2)10−4 . From
previous experience, it is known that the normal distribution is a reasonable
model for the population of these measurements. Find a 95% CI for σ.

49
Chapter 3

Advanced properties of point


estimators

3.1 More about consistency of estimators


We have seen in Chapter 1, that it is often difficult or even impossible to find
an unbiased estimator of a parameter θ. What about consistent estimators?
Can we find consistent estimators? We will see later that the answer is
positive and there are some useful methods to find a consistent estimator.
What we are going to do in this section is to study how one can prove that
an estimator is consistent.
Recall that the consistency of an estimator θb means that for every θ ∈ Θ
the estimator converges in probability to the true value of the parameter.
We have shown that θb is consistent if its Mean Square Error converges to
zero for every θ ∈ Θ as n → ∞. In fact, due to the MSE decomposition
theorem, it is enough to show that the bias and the variance of the estimator
converge to zero.
In practice, however, it is often difficult to calculate the variance of the
estimator. So, it is a good news that in some cases consistency can be
established without actually calculating the variance.
Since consistency is all about convergence in probability, here are some
properties of this mode of convergence of random variables.

50
Theorem 3.1.1. Suppose that θbn → θ and θbn′ → θ′ , then:
p p

1. θbn + θbn′ → θ + θ′ ;
p

2. θbn × θbn′ → θ × θ′ ;
p

3. θbn /θbn′ → θ/θ′ provided that θ′ ̸= 0;


p

4. For any continuous function g(u), g(θbn ) → g(θ);


p

5. For any continuous function g(u, v), g(θbn , θbn′ ) → g(θ, θ′ );


p

6. For a sequence of numbers {an , n = 1, . . . }, an → a (in the calcu-


P
lus sense) implies that an −→ a (an ’s are viewed as special random
variables).

This result is called the continuous mapping theorem for the convergence
in probability.
We omit the proof.
Example 3.1.2 (S 2 is a consistent estimator of σ 2 .). By definition
! !
1 Xn
2 n 1 X
n
Sn2 = Yi2 − nY = Yi2 − Y 2 (3.1)
n−1 n−1 n
i=1 i=1

We want to show that


p
Sn2 → σ 2

• By the law of large numbers, we have:

1X 2 p
n
Yi → E(Yi2 ) = σ 2 + µ2 , (3.2)
n
i=1

• Also, by LLN,
p
Y → E(Y1 ) = µ (3.3)

(Note that Y is actually a sequence of random variables that depends


on the sample size n. This dependence is suppressed in the notation
but we should remember about it to understand what it means that
P
Y −→ E(Yi ).)

51
• By an application of the continuous mapping theorem with g(u) = u2 ,
we have (3):
p
(Y )2 → µ2

• Consider g(u, v) = u − v and apply the continuous mapping theorem


again,
1X 2
n
p
Yi − (Y n )2 → σ 2 + µ2 − µ2 = σ 2
n
i=1

Recall that !
1X 2
n
n
Sn2 = Yi − Y n 2 ,
n−1 n
i=1

and we have just shown that

1X 2
n
p
Yi − (Y n )2 → σ 2
n
i=1

It remains to notice that


n n p
→ 1 implies → 1,
n−1 n−1
and another application of the continuous mapping theorem (with function

1 Pn 2
g(u, v) = uv, and variables u = n/(n − 1) and v = n i=1 Yi − Y n shows
2

that !
1X 2
n
n 2 p
2
Sn = Yi − Y n → 1 × σ 2 = σ 2 .
n−1 n
i=1

Example 3.1.3 ( Another estimator of σ 2 ). Another estimator of σ can be


defined as
!
1 X
n
2
S02 ≡ Yi2 − nY n
n
i=1

The denominator of S 2 is (n−1) while that of his brother S02 is n. Estimator


S 2 is an unbiased estimator of σ 2 ) and estimator S02 is biased. Is S02 a
consistent estimator of σ 2 ? Yes! (By the continuous mapping theorem.)

52

Example 3.1.4. Is S := S 2 a continuous estimator of σ?
p
Yes! By the continuous mapping theorem, if plim Sn2 = σ 2 , then plim Sn2 =

σ 2 = σ.
Is consistency really important?
Yes. If an estimator is not consistent, then it will not produce the correct
estimation even if we are given the luxury of getting unlimited amount of
data for free. It’s a shame if one cannot get the correct answer in this
situation. An inconsistent estimator is a waste of time.
Does consistency guarantee good performance?
Not necessarily. We still live in a finite sample world. Something that
is ultimately good for very large sample, may not be good enough for a
realistic sample size.

3.2 Asymptotic normality


Definition
q 3.2.1. An estimator θbn is called asymptotically normal if (θbn −
θ)/ Var(θbn ) converges in distribution to the standard normal distribution.

Typically, Var(θbn ) ∼ σ 2 /n, and the constant σ 2 is called the asymptotic


variance of the estimator. Intuitively, as n grows, the error of the estimator
becomes becomes more and more like a normal random variable with vari-
ance σ 2 /n. In particular we can use the techniques that we learned in the
previous chapter in order to build the asymptotic confidence intervals.
In order to prove the asymptotic normality, we usually use the CLT
(Central Limit Theorem).
Example 3.2.2. Let X1 , . . . , Xn be a sample from a distribution with mean
E(Xi ) = µ and variance Var(Xi ) = σ 2 . Then X is asymptotically normal
with asymptotic variance σ 2 .
This is exactly the statement of the central limit theorem, namely: If
X1 , . . . , Xn are i.i.d. with mean µ and finite variance σ 2 , then

X −µ
√ → Z,
σ/ n

53
in distribution, where Z is the standard normal r.v.
Many other estimators are also asymptotically normal, but it might be
not so easy to find their asymptotic variance.
The following example is meant to illustrate that sometimes there are
estimators that have smaller asymptotic variance than sample mean.
Example 3.2.3. Let X1 , . . . , Xn be a sample from the Laplace distribution
shifted by θ, that is from the distribution with density
1
pθ (x) = e−|x−θ| .
2
By symmetry, it is clear that E(Xi ) = θ, so we can estimate by using either
the sample mean or the sample median. Let us compare the asymptotic
variance of the estimators X and θbmed .
It turns out that it is possible to prove that θbmed is asymptotically normal
estimator of θ with asymptotic variance 1/(4p(0)2 ) = 1/(4 × (1/2)2 ) = 1.
On the other hand, for X, the asymptotic variance equals to the variance
of the Poisson distribution, which can be computed as 2. 1 It follows that in
this example the sample median has smaller asymptotic variance than the
sample mean.
Exercise 3.2.4. Show that if X1 , . . . , Xn is a sample from the standard normal
distribution, then the sample mean has smaller asymptotic variance than the
sample median.
These examples show that the answer to the question of which estimator
is better often depends on the distribution from which we draw the sample.
Now let us finish this section by proving the result that we used in the
section about asymptotic confidence intervals.
Example 3.2.5. Let X1 , . . . , Xn be a sample from a distribution with mean
E(Xi ) = µ and variance Var(Xi ) = σ 2 . Then, the statistics
X −µ
T = √
S/ n
1
∫ ∞ ∫ ∞
1
2
σX = x2 e−|x| dx = x2 e−x dx = Γ(3) = 2! = 2.
2 −∞ 0

54
converges to the standard normal distribution.
This is a direct consequence of the following result which we give without
proof.

Theorem 3.2.6 (Slutsky’s theorem). If Xn converges in distribution to a


variable X, and Yn converges in probability to a constant c, then

• Xn + Yn → X + c in distribution;

• Xn Yn → cX in distribution;

• Xn /Yn → X/c in distribution, provided that c ̸= 0.

3.3 Risk functions and comparison of point esti-


mators
In Section 1.2, we defined the Mean Squared Error of a point estimator:

M SEθb(θ) := E(θb − θ)2 .

We wrote it here as a function of θ to emphasize that the MSE depends on


the true value of the parameter θ.
This is a particular case of the risk function of an estimator. More
generally, the risk function is
 
Rθb(θ) := E u(θb − θ) ,

where u(x) is some non-negative function, which might depend on a particu-


lar application. The function u(x) is called the loss function. So, intuitively
the risk function is the expected loss from a mistake made while predict-
ing the parameter θ. In the case of the MSE, the loss function is simply a
quadratic function: u(x) = x2 .
In the following we will use the MSE, but everything also holds for other
risk functions.

55
The important thing is that MSE and,
more generally, any risk function depends
on θ, which we do not know. Ideally, an
estimator θb1 is better than another estima-
tor θb2 if its MSE is smaller for every θ ∈ Θ.
However, it might happen that MSE of θb1
is smaller than MSE of θb2 for one value of
parameter θ and larger for another. See the
Figure 3.1: MSE (risk func- picture.
tions) for two different estima- In general there are two approaches how
b b
tors, θ1 and θ2 to deal with this situation. In the first ap-
proach we simply compute the average MSE
of the estimators over the set of possible parameters and compare these av-
erages. This is called the Bayesian approach since it is popular in the branch
of statistical theory called the Bayesian statistics.
In the other approach one finds the values of θ which give the largest
MSE for every of the estimators. The better estimator will have the smaller
of these MSE. This is called the minimax approach.
Exercise 3.3.1. According to the minimax criterion, which of the estima-
tors is better for the situation pictured in Figure 3.1? Which one is better
according to the Bayesian criterion?
Example 3.3.2. Let X1 , . . . , Xn be sampled from the exponential distribution
with mean θ. Consider estimators of θ that have the form θb = µn (X1 + . . . +
Xn ). Calculate the MSE of these estimators. Which of them best according
to the Bayesian criterion? according to the minimax approach?
Let us calculate the MSE. In the calculation, we will use the fact that if

56
Xi ∼ Exp(θ) then E(Xi ) = θ, Var(Xi ) = θ2 and therefore E(Xi2 ) = 2θ2 .
h i2
M SE(θ) := E µn (X1 + . . . + Xn ) − θ
h X
n X X
n i
=E µ2n ( Xi2 + Xi Xj ) − 2µn θ Xi + θ 2
i=1 i̸=j i=1

= µ2n (2nθ2 + n(n − 1)θ2 ) − 2µn (nθ2 ) + θ2 )


 
= θ2 n(n + 1)µ2n − 2nµn + 1 .

What is important here is that the MSE


isθ2 multiplied by a constant that depends
on µn . So, it does not matter if we use
the Bayesian or the minimax criteria. Ac-
cording to both, the best estimator is the
estimator that minimizes the constant

n(n + 1)µ2n − 2nµn + 1.

It is easy to check that the minimum is


reached for
2n 1 Figure 3.2: MSE (risk func-
µ∗n = = .
2n(n + 1) n+1 tions) for two different estima-
tors, θb1 and θb2
So the best estimator is

1 X
n
n
θb Xi = X.
n+1 n+1
i=1

Since we know that X is an unbiased estimator of the mean, we found that


the best estimator of θ is biased!

3.4 Relative efficiency


In this section, we suppose that the estimators that we evaluate are unbiased.
In this case, the MSE of an estimator equals its variance. Suppose we
compare two unbiased estimators. Which of these estimators is better?

57
Usually, an estimator with smaller variance in preferable. One quantitative
measure that statisticians use to compare the unbiased estimators is their
relative efficiency.
Definition 3.4.1. Given two unbiased estimators θb1 and θb0 of the same
parameter θ, the relative efficiency of θb1 relative to θb0 is defined to be the
ratio of their variances
V ar(θb0 )
ef f (θb1 , θb0 ) = .
V ar(θb1 )
We can think about θb0 as a reference estimator. The estimator θb1 with
relative efficiency which is greater than 1 is better than θb0 since its variance
is smaller and so θb1 is a more accurate estimator of θ than θb0 ! Note that we
assumed from outset that both estimators are unbiased.

• Simple example: Y1 , . . . , Y9 ∼ N (µY , 1). Want to estimate µY .


P
1. θb1 = 19 9i=1 Yi ,
2. θb2 = 1 (Y1 + Y2 ).
2

• Both θb1 and θb2 are unbiased estimator for θ!

– Eθb1 = θ and Eθb2 = θ ;

• Var(θb1 ) = 19 Var(Y1 ) = 19 ;

• Var(θb2 ) = 12 Var(Y1 ) = 12 ;

• Ref f (θb1 , θb2 ) = 9


2 > 1;

• θb1 is better than θb2 ; why?

• the estimator θb2 does not use all information available, and so it is less
efficient.

Example 3.4.2. Let Y1 , Y2 , . . . , Yn denote a random sample from the uniform


distribution on the interval [0, θ]. Two unbiased estimators for θ are
n+1
θb1 = 2Y and θb2 = Y(n) ,
n
where Y(n) = max{Y1 , Y2 , . . . , Yn }. Find the efficiency of θb1 relative to θb2 .

58
Both are unbiased and therefore we only need to compute their variances.
Var(θb1 ) = Var(2Y ) = 4Var(Y )/n = θ2 /(3n). (This is because Y = θX,
where X is a random variable that has uniform distribution on [0, 1] and
the variance of X is 1/12.
In order to compute Var(θb2 ) = Var( n+1n max{Y1 , Y2 , . . . , Yn }), we note
that Yi = θXi , where Xi is uniformly distributed on [0, 1]. Then,
n+1 n+1
Var( max{Y1 , Y2 , . . . , Yn }) = Var(θ max{X1 , X2 , . . . , Xn })
n n
 n + 1 2
= θ2 Var(max{X1 , X2 , . . . , Xn }).
n
So we only need to compute Var(max{X1 , X2 , . . . , Xn }). To do this we
need to find the density of X(n) = max{X1 , X2 , . . . , Xn }.
Recall that P(X(n) ≤ x) = P(X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x), so we
have for the cdf, FX(n) (x) = (FX1 (x))n , and for the density, fX(n) (x) =
nfX1 (x)(FX1 (x))n−1 . In our particular case, fX(n) (x) = nxn−1 .
Then we calculate:
Z 1
n
EX(n) = n xxn−1 dx = ,
0 n+1
Z 1
n
EX(n)
2
=n x2 xn−1 dx = ,
0 n+2
n n2 n(n + 1)2 − n2 (n + 2)
Var(X(n) ) = EX(n)2
− (EX(n) )2 = − =
n + 2 (n + 1)2 (n + 2)(n + 1)2
n
= ,
(n + 2)(n + 1)2

It follows that
 n + 1 2 n 1
Var(θb2 ) = θ2 2
= θ2 ,
n (n + 2)(n + 1) n(n + 2)

and the relative efficiency

Var(θb1 ) θ2 θ2 n+2
ef f (θb2 , θb1 ) = = / = .
Var(θb2 ) 3n n(n + 2) 3

Hence, the second estimator is much more efficient than the first one.

59
3.5 Sufficient statistics
There is a huge multitude of functions of the data that we can consider in
the search for a good estimator. So it is worthwhile to check if we can reduce
the data to one or just a few of summary statistics. This is the main idea
behind the concept of sufficient statistics.

Definition 3.5.1. Let X1 , X2 , . . . , Xn denote a random sample from a dis-


tribution with unknown parameter θ. A statistic T = T (X1 , X2 , . . . , Xn ) is
said to be sufficient for θ if the conditional distribution of (X1 , X2 , . . . , Xn ),
given T = t, does not depend on θ. That is,

Pθ (X1 , . . . , Xn |T (X1 , . . . , Xn ) = t)

depends only on t but does not depend on θ.

Intuitively, if we know that T = t, then revealing the complete infor-


mation about X1 , X2 , . . . , Xn does not give us any additional information
about θ.
Example 3.5.2. Let X1 , . . . , Xn be an i.i.d. sample from the Bernoulli dis-
tribution with parameter θ = p. That is, Xi takes values 1 and 0 with prob-
abilities p and 1 − p respectively. Consider a statistic T = X1 + . . . + Xn .
Then

p(x1 , . . . , xn , T = t)
p(x1 , . . . , xn |T = t) =
p(T = t)
P P
p xi (1 − p)n− i xi
i X
= n P P δ( xi − t)
i xi (1 − p)n− i xi
t p i
1 X
= 
n δ( xi − t),
t i
P P
where δ( i xi − t) equals 1 if i xi = t and 0, otherwise. We can see that
this conditional probability does not depend on p.
In principle, a sufficient statistic can be a vector, that is, it can consists
of several functions. For example, if you take a vector of order statistics,

60
T = (X(1) , X(2) , . . . X(n) ), then it is a sufficient statistic. However, we don’t
gain very much by considering such statistics since they do not reduce the
data.
If we take a function of a vector of sufficient statistics and reduce the di-
mension, then it can potentially break the sufficiency, however in some cases
the resulting function is still sufficient. For example any invertible function
of a sufficient statistic is sufficient, – it does not loose any information.
A sufficient statistic is minimal if it can be written as a function of any
other of sufficient statistics. (A minimal sufficient statistic exists under mild
conditions on the distribution of the data but there are some counterexam-
ples.)
Why do we care about sufficient statistics?
In some cases, we can find a good estimator of a parameter θ by a 2-step
procedure:

1. Find a sufficient statistic T (X1 , X2 , . . . Xn ) for parameter θ. Infor-


mally, it contains all information about the parameter which is avail-
able in the data.

2. Find an unbiased estimator of θ, which is a function of T . We can


hope that all relevant information in data was used and the estimator
cannot be further improved.

How we can find a good sufficient statistics? We should try to factorize


the likelihood function.
Recall that the likelihood function is the same thing as the joint distribu-
tion density or joint distribution pmf of data X1 , . . . , Xn . In this course we
consider only the situation when X1 , . . . , Xn are independent and identically
distributed, so

1. If X1 , . . . , Xn are discrete random variables, and p(x) := P(X = x|θ)


is the probability mass function of each of them, then

Y
n
L(θ | x1 , . . . , xn ) = pX1 ,...,Xn (x1 , . . . , xn | θ) = p(xi |θ).
i=1

61
2. If X1 , . . . , Xn are continuous random variables and f (x|θ) is the den-
sity of each of them, then
Y
n
L(θ | x1 , . . . , xn ) = fX1 ,...,Xn (x1 , . . . , xn | θ) = f (xi |θ).
i=1

To simplify the notation, we sometimes write L(θ) instead of L(θ |


x1 , . . . , xn ).

Theorem 3.5.3 (Fisher’s Factorization Criterion). A statistic T is a suffi-


cient statistic for parameter θ if and only if L(θ) can be factorized into two
nonnegative functions:

L(θ|x1 , . . . xn ) = g(θ, t(x1 , . . . , xn )) × h(x1 , . . . , xn )

Here g(θ, t) is a function only of t (the observed value of T ) and θ and


the function h(x1 , . . . , xn ) does not depend on θ at all.
Let us indicate how one can prove one direction of this theorem. Suppose
that the factorization holds. Then, one can write:
p(x1 , . . . , xn , T = t)
p(x1 , . . . , xn |T = t) =
p(T = t)

g(θ, t)h(x1 , . . . , xn )δ t(x1 , . . . , xn ) − t
= P

x:t(⃗x)=t g(θ, t)h(x1 , . . . , xn )

h(x1 , . . . , xn )δ t(x1 , . . . , xn ) − t
= P ,

x:t(⃗x)=t h(x1 , . . . , xn )

and the result does not depend on the parameter θ.

Example 3.5.4. • Y1 , . . . , Yn ∼ B(p). Find a sufficient statistic for p.

• Likelihood:
Y
n
L(p) = {pyi (1 − p)(1−yi ) }
i=1
Pn Pn
= p i=1 yi (1 − p)(n− i=1 yi )
 Pni=1 yi
p
= (1 − p)n × 1
1−p

62
P
• We can define a statistic T = ni=1 Yi . Then we will have
n ot
p
– gp (t) = 1−p (1 − p)n , and h(y1 , . . . , yn ) = 1
– The first term only depends on p and T (or t)
– The second term does not depend on p

• Yi ∼iid P oisson(λ), i.e. ∼ p(y) = e−λ λy!


y
Example 3.5.5.

• Likelihood (use independence):

Y
n
λ yi
L(λ) = e−λ
yi !
i=1

• Pn
λ i=1 yi Pn 1
−nλ Q
L(λ) = e n = e−nλ λ i=1 yi
× Qn
i=1 (yi !) i=1 (yi !)

Example 3.5.6. Let X1 , X2 , . . . , Xn be a random sample in which Xi is ex-


ponentially distributed (remember life of smartphones?) with parameter θ.
Find a sufficient statistic for θ.

Example 3.5.7. • Yi ∼iid N (µ, σ 2 ), i.e.

1 (y − µ)2
∼ f (y) = √ exp{− }
2πσ 2 2σ 2

• Likelihood:
n 
Y 
1 (yi − µ)2
L(·) = √ exp{− }
2πσ 2 2σ 2
i=1
 Pn 2
2 −n i=1 (yi − µ)
= (2πσ ) exp −
2
2σ 2
P P P
• Note that ni=1 (yi − µ)2 = ni=1 (yi − y + y − µ)2 = ni=1 (yi − y)2 +
n(y − µ)2 = (n − 1)s2 + n(y − µ)2

63
• Thus we have
   
2 −n (n − 1)s2 n(y − µ)2
L(·) = (2πσ ) 2 · exp − · exp −
2σ 2 2σ 2

   
2 −n (n − 1)s2 n(y − µ)2
L(·) = (2πσ ) 2 · exp − · exp −
2σ 2 2σ 2
• The argument in L(·) is not specified because there are two situations.

1. µ is unknown and σ 2 is known ⇒ L(µ)


   
n(y − µ)2 2 −n (n − 1)s2
L(·) = exp − · (2πσ ) 2 · exp −
2σ 2 2σ 2

Y is a sufficient statistic for µ


2. Both µ and σ 2 are unknown ⇒ L(µ, σ 2 )
   
2 −n (n − 1)s2 n(y − µ)2
L(·) = (2πσ ) 2 · exp − · exp − ×1
2σ 2 2σ 2

The pair (Y , S 2 ) is a sufficient statistic for µ and σ 2

Quiz 3.5.8. Every function of a sufficient statistics is a sufficient statistic.

A. True

B. False

Quiz 3.5.9. Every strictly decreasing function of a sufficient statistics is a


sufficient statistic.

A. True

B. False

Even a minimal sufficient statistic is not unique!

Example 3.5.10 (Uniform distribution). Let X1 , X2 , . . . , Xn be uniformly


distributed in (0, θ). What is a sufficient statistic for θ?

64
• Density of one random variable:

1/θ, 0 < xi < θ,
fXi (xi ) =
0, otherwise
1
= 10<xi <θ ,
θ
where 10<xi <θ is the indicator function of the event {0 < xi < θ}.
• Likelihood:
n 
Y 
1
L(θ) = 10≤xi ≤θ
θ
i=1
Qn
• Note that i=1 1θ≥xi = 1θ≥x(n) . Not very obvious. Think!
• Thus the likelihood L(θ) = 1θ≥x(n) θ1n = 1θ≥x(n) θ1n × 1

• Hence T = X(n) is a sufficient statistic for θ.

• Any 1-to-1 function of a sufficient statistic is also a sufficient


statistic for the same parameter. ⇒ the unbiased estimator θb =
n+1
n X(n) is also a sufficient statistic for θ.

• Note that previously we found that this estimator is much more ef-
ficient than another unbiased estimator 2X. This is a reflection of a
general fact which is called the Rao-Blackwell theorem.

Exercise 3.5.11. • f (y) = exp[−(y − θ)] for y > θ

• Sufficient statistic?

3.6 Rao-Blackwell Theorem and Minimum-Variance


Unbiased Estimator
This section relates sufficiency with efficiency and unbiasedness. It will tell
us why a sufficient statistic is very useful in statistical inference.

65
• We have learned two good qualities of an estimator θb for a parameter
θ:

– Unbiasenss: E θb = θ;
b is small;
– Low variance: V ar(θ)

• Relative efficiency of two estimators θb1 and θb2

V ar(θb2 )
Reff(θb1 , θb2 ) = ;
V ar(θb1 )

whichever has the smaller variance is more efficient (better)!

Definition 3.6.1. An unbiased estimator θb is called MVUE (Minimal Vari-


ance Unbiased Estimator) if for every other estimator θ̂′ and every value of
b ≤ Var(θ̂′ ).
the parameter θ ∈ Θ, Var(θ)

(Sometimes it is called UMVUE, where the first U stands for “uniform”


to emphasize that the minimal variance property should hold for every θ ∈
Θ.)
Since the unbiased estimators does not always exist, MVUE are even
more rare. However, if an MVUE exists, how can we find it?
The main idea of the following theorem is that we can always improve any
unbiased estimator by conditioning it on a sufficient statistic. In particular,
if an MVUE exists, it must be a function of a sufficient statistic.

Theorem 3.6.2 (Rao-Blackwell Theorem). Let θb be an unbiased estimator


b < ∞. If T is a sufficient statistic for θ, define
for θ such that Var(θ)
θb∗ = E(θ|T
b ). Then, for all θ, Eθb∗ � = θ and Var(θb∗ )� ≤ Var(θ).
b

• Given an unbiased estimator θb and a sufficient statistic T , we can find


a modified estimator θb∗ , which is improved in the sense that

– θb∗ is still unbiased;


– θb∗ has a smaller (or at least no larger) variance than θ;
b

66
• Remarks:

– θb∗ is a function of T
– θb∗ is random
– If θb is already a function of T , then E(θb |T ) = θ,
b i.e., taking the
conditional expectation does not change anything, in particular
it does not improve the efficiency.

Proof. Because T is sufficient for θ, the conditional distribution of any statis-


b given T , does not depend on θ. Thus, θb∗ = E(θ|T
tic (including θ), b ) is not a
function of θ and is therefore a statistic. The fact that θb∗ is almost obvious
from the law of repeated expectation.
h i
Eθb∗ = E E(θ|T
b ) = Eθb = θ.

For the variance we use another theorem about conditional expectations:


h i h i
b = Var E(θ|T
Var(θ) b ) + E Var(θ|T
b )
h i
= Var(θb∗ ) + E Var(θ|T
b ) .

Since the second term is non-negative, we find that Var(θb∗ ) < Var(θ).
b

This theorem implies that if an unbiased estimator θb is NOT a function


of a sufficient statistics T the we can find another unbiased estimator θb∗ =
b ) which is a function of T at least as good as θ.
E(θ|T b However, it does
NOT imply that this estimator is an MVUE. Perhaps we can find another
sufficient statistic T2 and improve θb∗ by taking a conditional expectation
with respect to T2 .
A natural conjecture is that if T is a minimal sufficient statistic and
some function θb = θ(T
b ) is an unbiased estimator of θ, then θb is an MVUE.
This is again not quite correct. One has to impose a stronger requirement
that T is a complete sufficient statistic. In this case, it is also guaranteed to
be a minimal sufficient statistic and a function θb = θ(Tb ) which is unbiased
for θ is indeed MVUE.

67
This raises questions about what a complete sufficient statistic is and
how one can check that a sufficient statistic is complete. We will not be
concerned with these questions in this course and simply promise that in all
our examples the sufficient statistics obtained by factorization theorem will
be complete and sufficient.
Routine to find the MVUE

1. Factorize the likelihood function and find a (minimal) sufficient statis-


tic T ;

2. find a function of T which is unbiased for the parameter of interest θ;

Example 3.6.3 (Exponential). Suppose X1 , X2 , . . . , Xn all from the expo-


nential distribution with the parameter θ. What is an MVUE for θ ?
We showed that T = (X1 + . . . + Xn ) is a sufficient statistic. In fact it
is a minimal and complete statistic, and since X = T /n is unbiased for θ
hence it is an MVUE.
Example 3.6.4 (Bernoulli). Let Xi ’s are iid Bernoulli with parameter p.
What is an MVUE for p?
Same argument works as in the previous example. We already proved
P P
that T = ni=1 Xi is a sufficient statistic for p. Hence pb = X = n1 ni=1 Xi
is MVUE (for p).
Example 3.6.5 (Normal). X1 , . . . , Xn are i.i.d. from normal distribution
N (µ, σ 2 ). What is the MVUE for µ and σ 2 .
We have shown that X and S 2 are joint sufficient statistics for µ and
σ 2 . In addition, we know that X and S 2 are unbiased estimators of µ and
σ. Therefore, X and S 2 as the MVUEs for µ and σ 2 , respectively.
Example 3.6.6. X1 , . . . , Xn are i.i.d. from a distribution f (x|θ) = (2y/θ)e−y
2 /θ

for y > 0. MVUE for θ?


[Quizzes]

68
Chapter 4

Methods of estimation

Suppose that we are looking for a good estimator θ(X b 1 , . . . , Xn ) for a pa-
rameter θ. The method in the previous section asks us to find a minimal
complete sufficient statistic T = T (X1 , . . . , Xn ) by using a factorization cri-
terion and then find a function of T which would be an unbiased estimator
of θ. This will lead us to a MVUE. Unfortunately, even if T is known, it is
b ) which would be unbiased for θ. In fact, in some
often difficult to find θ(T
cases no unbiased estimator for θ exists.
For this reason we are looking for other methods to construct an estima-
tor, which would be easy to construct and which would have a small MSE
in a large sample.
We will consider two such methods, Method of Moments Estimation
(MME) and Maximum Likelihood Estimation (MLE).

4.1 Method of Moments Estimation


We consider our usual situation when data X1 , . . . , Xn is an i.i.d sample from
a distribution Fθ (x) which belongs to a family of distributions parameterized
by θ ∈ Θ. In general, parameter θ can be a vector θ = (θ1 , . . . , θs ) that
consists of several components. We want to estimate θ using the data sample.
Recall that the k-th population moment is simply the theoretical expec-

69
tation of an observation Xi , that is,
R
 xk f (x|θ) dx if Xi are continuous r.v.,
µk (θ) := E(Xi )k = P
 k if Xi are discrete r.v..
x x p(x|θ)

Note that the population moments are all functions of the parameter θ
(and do not depend on the sample data).
In contrast, the sample moments are functions of the data sample Xi .
(They are random quantities and their distribution depends on θ.) We
denote the k-th sample moment mk .

1X k
n
mk = mk (X1 , . . . , Xn ) := Xi .
n
i=1

For example, the first sample moment equals to the sample mean m1 = X,
the second sample moment can be expressed in terms of the sample variance
and the sample mean: m2 = (n − 1)S 2 + n(X)2 .
It should be emphasized that the sample moments are all functions of
the data, i.e., they can all be calculated using the data. And they are all
random.
The main idea behind Method of Moments Estimator is that by the Law
of Large Numbers the sample moments converge to population moments:

1X k P
n
mk = Xi −→ EXik = µk (θ),
n
i=1

in probability as n → ∞.
So for large n we have,

mk (X1 , . . . , Xn ) = µk (θ) + εk ,

where εk is very small with probability close to 1. This means that empirical
moments are consistent estimators of the population moments. In addition,
we know the form of the functions µk (θ), although we do not know the value
of θ. Hence we can invert the system of these functions and get an estimator
of θ from the estimator µbk = mk (X1 , . . . , Xn ) of µk (θ).

70
So, the idea is to ignore εk and solve the system of equations

b
mk (X1 , . . . , Xn ) = µk (θ), , k = 1, 2, . . . , s

b (It might be just one equation if the parameter is one-dimensional


for θ.
vector, that is, a real number.)
The solution is given by the inverse function θb = µk (mk ). If the in-
(−1)

(−1)
verse function µk is a continuous function then we can use the continuous
mapping theorem and show that θb → θ in probability as n → ∞, in other
words, that θb is a consistent estimator of θ.
How many moments we need for estimation? Typically, if we need to
estimate a vector that consists of s parameters θ1 , . . . θs , we use the first s
moments. However it might happen that one of the first theoretical moments
µk (θ) actually does not depend on the parameter of interest. For example, if
for every θ the distribution Fθ (x) has the density function symmetric relative
to the origin x = 0, then the first population moment (population mean or
expectation) is zero for every θ, and therefore this first moment is not going
to help us in the estimation of parameters.

Practical steps:

1. Calculate the first K population moments µk (θ) ⃗ as functions of the


vector of unknown parameters θ⃗ = (θ1 , . . . , θs ). (Typically, if there
are s unknown parameters, then one needs to calculate the first s
moments: K = s.)

2. Write the first s sample moments mk , k = 1, . . . , K, as functions of


data vector Xi .

b = mk (X)
3. Match the moments: Solve the system of K equations µk (θ) ⃗
b
for θ.

4. The solution is a (vector) function θb of the sample moments mk and


hence of the data Xi , because the sample moments are functions of
the data. Since we believe that the equations are approximately true,
we also believe that the solution θb is close to the true value of the

71
parameter θ. These solutions give us the desired Method of Moments
Estimator (θbM M E ).

MME and Consistency Under some mild regularity conditions on the


distribution of data, Method of Moment estimators are consistent. This is
roughly because

• Under the conditions, we have


P
mk (X1 , . . . , Xn ) −→ µk (θ1 , . . . , θs ), for k = 1, . . . , s,

when n → ∞.

• The solution to the sistem of equations “mk = µk (θb1 , . . . , θbs ), k =


1, . . . , s” is a continuous map

θbk = θbk (m1 , . . . , ms ),

where k = 1, . . . , s. It is the inverse of the moment transformation µ,


that sends parameters θ1 , θ, θs to moments µ1 , . . . , mus , so we could
write (in vector form) θ(m b 1 , . . . ms ) = µ(−1) (m1 , . . . , ms ).

• By the continuous mapping theorem, we may conclude (again for con-


ciseness using vector notation for the parameter θ and the estimator
b
θ).
θb = θ(m
b 1 , . . . , ms ) −→ µ−1 (µ1 (θ), . . . µs (θ)) = θ.
P

Example 4.1.1 (Normal). The data X1 , . . . , Xn are distributed according the


normal distribution N (µ, σ 2 )

• There are 2 parameters to estimate. The vector parameter θ has two


components, θ = (µ, σ 2 )

• The first population moment is µ1 (θ) := E(Xi ) = µ. (There is a clash


of notation here: µ1 (θ) is the first population moment and µ also
denotes the first component of the parameter vector θ.)

• The second population moment is µ2 (θ) := EXi2 = Var(Xi )+(EXi )2 =


σ 2 + µ2 .

72
1 Pn
• The first sample moment is m1 := n i=1 Xi = X.
1 Pn 2
• The second sample moment is m2 := n i=1 Xi .

By matching the population and sample moments we obtain equations:

b = m1 = X
µ
1X 2
n
b2 + µ
σ b2 = m2 = Xi
n
i=1

After solving these equations we get:

b = X;
µ
X n
c2 = 1
σ Xi2 − (X)2
n
i=1

These are the MME estimators of the parameters. Note that the MME esti-
mator for the variance is different from the standard estimator, the sample
variance, which is

1 X 2 
n
2
S = Xi − n(X)2
n−1
i=1

It is easy to see that they are related by the formula:

c2 M M E = n − 1 S 2 .
σ
n
In particular, the MME estimator is biased.
Example 4.1.2 (Poisson with unusual parameter). The data sample X1 , . . . Xn
is distributed according to the Poisson distribution with parameter λ. Find
the estimator of the parameter θ = 1/λ.

• Only 1 parameter to be estimated.

• The first population moment is µ(θ) = EXi = λ = 1/θ

• The first sample moment is m1 = X

73
b
• Match the two quantities above and solve for θ:

1/θb = X

• Hence,
θbM M E = 1/X

If you attentively examine the example above then you will see that we
could estimate the parameter λ (the population mean) by the sample mean
X. This would give an MME estimator of λ. Then, since parameters θ and
λ are in one-to-one correspondence to each other, therefore solving MME
equations for θb can be done by solving MME equations for λ b and then using
the one-to-one relation between the parameters. This would give us the
MME estimator θb = 1/λ b = 1/X.
This is a manifestation of the general principle valid for MME estimators.
Plug-in or Invariance property of MME

• If the parameter of interest ψ is a function of another parameter θ


whose MME is relative easy to find, i.e., if

ψ = h(θ)

and θbM M E is easy to obtain, then

ψbM M E = h(θbM M E ),

i.e. we can apply the function h to θbM M E to obtain ψbM M E

• This is often easier then the “standard” procedure to find MME of ψ,


where you need to redo the whole process.

Example 4.1.3. Let X1 , . . . , Xn be i.i.d observation from the uniform distri-


bution on the interval [0, θ]. What is the Method of Moments estimator of
θ?

The first population moment is µ1 = θ/2. The first sample moment is


X. Therefore, the MM estimator is θb = 2X. It is an unbiased estimator
since Eθb = 2EX = 2EXi = θ.

74
Note, however, that X is not a sufficient statistic for θ. Indeed, the
minimal sufficient statistic in this example is X(n) = max{X1 , . . . , Xn }.
So, this is an example of an MME estimator which is not a function of
a sufficient statistic. So it has no chance to be an MVUE. Even though it
is unbiased, its variance could be reduced by conditioning on the sufficient
statistic.

Reflection

• The four point estimators back in Chapter 8 (for the mean, proportion,
differences in means and proportions) were all MMEs.

• In practice, Method of Moments is a very intuitive way to find an


estimator. It requires only the ability to calculate the moments in
terms of the parameters and invert this relation.

• Sometimes, MME gives biased estimators but at least it is consistent


(under very mild conditions).

• One of its deficiencies that it is not always a function of a sufficient


statistic.

Here is a couple of additional examples.


Exercise 4.1.4. Xi ’s are Gamma(α, β) distribution.
Exercise 4.1.5.  
2
Yi ∼ f (y) = (θ − y)10≤y≤θ
θ2
• Use MME to find an estimator for θ

• Is this estimator is a function of a sufficient statistic?

4.2 Maximum Likelihood Estimation (MLE)


The most popular method to find an estimator is the method of maximum
likelihood, MLE.

75
Definition 4.2.1. The maximum likelihood estimator θb is the value of the
parameter θ, at which the likelihood function L(θ|x1 , . . . xn ) takes its maxi-
mum value.

Informally, if we know that the probability to observe data sample (x1 , . . . , xn )


is 90% if the value of the parameter were θ1 versus 10% if the value of the
parameter were θ2 , then we might prefer θ1 as an estimator of true unknown
θ.
The steps in finding MLE are easy: write down the maximum likelihood
function and find its maximum with respect to the parameter θ. It turns
out that in many cases, it is technically easier to look for maximum of log-
likelihood function ℓ(θ|⃗x) := log(L(θ|⃗x)) instead L(θ). (It is maximized at
the same value of θ.)
The main technical question in ML estimation is how to find the global
maximum.
The maximization can be done by a computer algorithm or analyti-
cally. If the maximization is done by computer, there are many specialized
algorithms suitable for statistical applications. One of them is EM algo-
rithm. If the maximization is done analytically, we aim to solve equations
d ⃗
dθ ℓ(θ|(y)) = 0. (They are called the first order conditions “FOC” for the
extremal points.) Note, however, that the solution of these equations are all
local extrema: local maxima, local minima and saddle points, so one should
be careful to choose the global maximum among all possible solutions. In
addition, the maximum can occur on the boundary of the set of all possible
values of θ. In this case, the FOC will not give you information about the
global maximum.
Consistency: The important fact about MLE is that this estimator
is consistent under the mild distributional conditions. It shares this good
property with the Method of Moments estimator.
Let us consider some examples.
Example 4.2.2 (Exponential). Let X1 , . . . , Xn be i. i. d. observations dis-
tributed according to the exponential distribution with parameter θ. What
is the ML estimator for θ?

76
The density of an individual observation is
1
f (xi ) = e−xi /θ .
θ
Since the observations are independent, the likelihood is just the product of
the density functions:

1 Y −xi /θ
n
1 −(Pni=1 xi )/θ
L(θ|x1 , . . . , xn ) = e = e
θn θn
i=1

Hence the log-likelihood is

1X
n
ℓ(θ|x1 , . . . , xn ) = log L(θ|x1 , . . . , xn ) = −n log θ − xi
θ
i=1

The first-order condition equation is


d
ℓ(θ|x1 , . . . , xn ) = 0,

1 X
n
1
−n + 2 xi = 0,
θ θ
i=1
X
n
xi = nθ,
i=1

and the ML estimator is θb = X. Note that it coincides with the MM


estimator.
Example 4.2.3. The data Y1 , . . . , Yn are from the Bernoulli distribution with
parameter p. What is the MLE for p?

• The likelihood function is


Y
n
L(p|⃗y ) = pyi (1 − p)1−yi = pt (1 − p)n−t ,
i=1
Pn
where t = i=1 yi .

• The log-likelihood function is

ℓ(p|⃗y ) := log(L(p|⃗y )) = t log(p) + (n − t) log(1 − p)

77

d t n−t
ℓ(p|⃗y ) = −
dp p 1−p

• Set
t n−t
ℓ′ (p) = − =0
p 1−p
We obtain
t n−t t
= ⇒ t − tp = np − tp ⇒ p =
p 1−p n

• Hence pbM LE = T
n

• Recall that we proved that this is an unbiased estimator, and since it


is a function of a minimal sufficient statistic, it is the MVUE. It is also
consistent since its variance goes to 0.

• Same as the MME estimator.

Exercise 4.2.4. A sample X1 , . . . , Xn is taken from the binomial distribution


with parameters (k, θ). Assume k is known. What is the MLE for θ?
Hint. The likelihood function is
n  
Y k
L(θ, x1 , . . . xn ) = θxi (1 − θ)k−xi .
xi
i=1

Example 4.2.5 (Normal). Let X1 , . . . , Xn be i.i.d observations from a normal


distribution, N (µ, σ 2 ). What are the ML estimators of µ and σ 2 ?

• Likelihood function:
n 
Y 
1 (xi − µ)2
L(µ, σ 2 ) = √ exp{− }
2πσ 2 2σ 2
i=1
 Pn 
2 −n (xi − µ)2
= (2πσ ) 2 exp − i=1
2σ 2

78
• Log-likelihood:
Pn
n (xi − µ)2
ℓ(µ, σ ) = log(L(µ, σ )) = − log(2πσ ) − i=1 2
2 2 2
2 2σ P
n
n n (xi − µ)2
= − log(2π) − log(σ 2 ) − i=1 2
2 2 2σ

• There are two unknown variables in this function. We want to calculate


the partial derivative separately and set each to be zero, and then solve
the unknowns.
P Pn
∂{ℓ(µ, σ 2 )} 2 ni=1 (xi − µ) yi − nµ x−µ
= 2
= i=1 2 = 2
∂µ 2σ σ σ /n

Pn
∂{ℓ(µ, σ 2 )} n 1 i=1 (xi − µ)2 1
=− +
∂σ 2 2 σ2 2 σ4
• Set both to zero.
x−µ
=0⇒µ=x
σ 2 /n
Pn
1X 1X
n n
n 1 i=1 (xi− µ)2 1
− + = 0 ⇒ σ2 = (xi −µ)2 = (xi −x)2
2 σ2 2 σ 4 n n
i=1 i=1
Pn
• Hence, µ bM LE = X, σ bM2 1
LE = n i=1 (Xi −X) = S0 . These are exactly
2 2

the sample mean, and the alternative version of sample variance (but
not S 2 , the original sample variance). We know that X is unbiased
and both X and S02 are consistent. However, S02 is a biased estimator
of σ 2 .
In fact, as in the previous example, the ML estimator coincide with
the MM estimator.

So far, all ML estimators coincided with MM estimators. Is there any


advantage in ML estimation? Here is an example, where we get an ML
estimator which is different from the MM estimator and which is better
than MM estimator. It also illustrates that it is sometimes not enough to
solve the first order conditions.

79
Example 4.2.6 (Uniform on (0, θ)). Let X1 , . . . , Xn be i.i.d observations from
the uniform distribution on the interval [0, θ]. What is the ML estimator for
θ.

• The density of Xi is f (xi ) = 1/θ if xi ∈ [0, θ], and 0 otherwise.

• The likelihood function is the product of the densities, so


1
L(θ|x1 , . . . , xn ) = ,
θn
if min x1 , . . . xn ≥ 0, and max x1 , . . . , xn ≤ θ and 0 otherwise.

• We can also write this

L(θ) = θ−n 1{x(n) ≤θ} 1{x(1) ≥0}

• Or equivalently L(θ) = θ−n for θ ≥ x(n) , and 0, otherwise. ← better


here for MLE.

• The log-likelihood is ℓ(θ) = −n log(θ) for θ ≥ x(n) , and −∞, otherwise.

• ℓ′ (θ) = − nθ for θ ≥ x(n)

• But − nθ = 0 does not have a solution! What can be wrong?

• The maximum of the function ℓ(θ) is reached at the boundary point!

Indeed ℓ′ (θ) = − nθ < 0, hence ℓ(θ) is a decreasing function. The smallest


possible value for θ in the domain θ ≥ x(n) is its left boundary point x(n) ,
which is the maximum point for the likelihood.
So, θbM L = X(n) . Note that it is a function of a minimal sufficient statistic
X(n) . Is it biased? How does its variance compare with the MM estimator?
Let us repeat some information from Example 3.4.2. First, EX(n) =
n
n+1 θ, so So the ML estimator is biased with bias = θ/(n + 1). However its
variance Var(Xn ) = (n+2)(n+1)
n
2 θ ∼ θ /n is much smaller than the variance
2 2 2

of the unbiased MM estimator. (Var(2X) = 4Var(Xi )/n = θ2 /(3n).) In


particular, this estimator has smaller MSE for large n.
Here we see a clear difference between ML and MM estimators. The MM
estimator 2X is unbiased but it is not a function of a sufficient statistic. The

80
ML estimator is biased but it is a function of a minimal sufficient statistic.
In particular, the bias of the ML estimator can be corrected and this will
lead to an MVUE estimator.
Note, by the way, that if the domain of the density function depends on
the parameter θ, it is often a warning sign that the boundary point of θ may
play a major role in finding the MLE.
MLE is always a function of a sufficient statistic!

Theorem 4.2.7. Suppose T (X1 , . . . , Xn ) is a sufficient statistic for θ. Then


θbM LE can be written as a function of T , θbM LE (X1 , . . . , Xn ) = θ(T
b ).

b ) is a function of a complete minimal sufficient statistic


In particular, θ(T
as long as such a beast exists. This is rather appealing because this means
that if are able to find a function of θbM LE which is unbiased, then this
function is an MVUE.

Proof. If T is a sufficient statistic, then by the factorization criterion,

L(θ) = g(t, θ)h(x1 , . . . , xn ) and so ℓ(θ) = log(g(t, θ)) + log(h(y1 , . . . , yn ))

• Since log(h(x1 , . . . , xn )) has nothing to do with θ, as far as θ is con-


cerned, log(h(x1 , . . . , xn )) is a constant; hence the maximizer of ℓ(θ)
over θ is the same as the maximizer of log(g(t, θ)).

• The maximizer of log(g(t, θ)), over all possible θ, has to depend only
on t.

Thus the ML estimator is a function of T .

Plug-in property for MLE


Example 4.2.8 (Poisson with usual and unusual parameter). Suppose that
X1 , . . . , Xn are i.i.d. from the Poisson distribution with parameter λ. What
is the ML estimator for λ?
The Poisson distribution is discrete so we work with probability mass
functions (“pmf”s). The pmf of one observation Xi is
λ xi
pXi (xi |λ) := P[Xi = xi ] = e−λ ,
xi !

81
where xi can take values 0, 1, 2, . . .. Since the observations are independent,
the likelihood function is simply the product of pmfs of individual observa-
tions.
Pn Y
n
xi −nλ
L(λ|x1 , . . . , xn ) = λ i=1 e / (xi !).
i=1

Hence, the log-likelihood is


X
n  Y
n
ℓ(λ) = xi log λ − nλ − log( (xi !)).
i=1 i=1

So we can write the first order condition as


X
n 1
ℓ′ (λ) = xi −n=0
λ
i=1

bM L = X. It coincides with the


The solution gives us the ML estimator λ
MM estimator.
Now suppose we use a different parameter in the model θ = 1/λ and
want to find an ML estimator for θ. This simply means that now we write
the distribution function for the observations in terms of θ not λ:
(1/θ)xi
pXi (xi |θ) := e−1/θ ,
xi !
So the likelihood function will be
Pn Y
n
xi −n(1/θ)
L(θ|x1 , . . . , xn ) = (1/θ) i=1 e / (xi !).
i=1

b then
Now it is rather obvious that if L(λ|x1 , . . . , xn ) is maximized at λ = λ,
L(θ|x1 , . . . , xn ) is maximized at the point that corresponds to this point,
namely at θ = 1/λ. b Therefore,

θbM L = 1/λ
bM L = 1/X.

The principle that the relation between parameters are transferred to their
estimates is called the invariance, or plug-in, principle.

82
Theorem 4.2.9. Suppose that X1 , . . . , Xn are observations from the dis-
tribution that depends on parameter θ. If θbM L = θbM L (X1 , . . . , Xn ) is the
maximum likelihood estimator for θ and g(·) is a one-to-one function, then
g(θbM L ) is the maximum likelihood for parameter ψ := g(θ), i.e.,

ψbM L = g(θbM L )

Proof. We have the identity

L(θ|y) = L(g −1 (ψ)|y),

and the expression on the right is the likelihood for ψ. If the MLE of ψ were
ψ ∗ ̸= g(θbM LE ), then it would follow that

L(g −1 (ψ ∗ )) > L(g −1 (g(θbM LE ))) = L(θbM LE )

But this would contradict the fact that θbM LE maximizes L(θ).

• The invariance property actually holds for any function ψ of a pa-


rameter θ (and not only for the one-to-one functions), once we define
appropriately, what do we mean by the maximum likelihood estimator
of ψ(θ) in this case.

• A discussion and a proof can be seen in Casella and Berger (2002)


[Math 502]

Example 4.2.10. For example, suppose you want to estimate the probability
that a random variable X > 2, you know that it is a Poisson r.v. and have
a data sample Xi , i = 1, . . . n.
Note that

Pr{X > 2} = 1 − Pr{X = 0} − Pr{X = 1} − Pr{X = 2}


λ λ2
= 1 − e−λ − e−λ − e−λ
1! 2!
The invariance principle tells us that the ML estimator for Pr{X > 2} is
simply

b b b
λ b2

1 − e − λ − e −λ − e −λ ,
1! 2!

83
b=λ
where λ bM L is the maximum likelihood estimator for λ.
bM L = X, therefore the ML estimator for Pr{X > 2}
Since we know that λ
is

 X (X)2 
1 − e−X 1 + + .
1! 2!
Comparison of MM and MLE

1. MM estimator is usually easier to calculate.

2. In many cases MM estimator coincides with MLE.

3. MME might be not as efficient as MLE.

4. While MME is not as efficient as MLE, sometimes MM can be applied


in situations when MLE is not available. (This happens when the
likelihood function is not available but one can make some assumptions
about the moments of random variables associated with the model.) In
particular, a generalized method of moments (GMM) was developed
in 1980s by an econometrist Lars Peter Hansen who received Nobel
Prize in Economics in 2013 in part for this work.

Additional examples:
Example 4.2.11. Let Y1 , . . . , Yn be taken from distribution with the following
density:
1
fY (y) = ry r−1 e−y /θ 1y>0 , where θ > 0 and r is known.
r

θ
Find a sufficient statistic for θ. Find the MLE of θ. Is it MVUE?
Qn  1 r−1 −yi /θ r 1 n Qn r−1 e− θ1
Pn
yir
• L(θ) = i=1 θ ryi e = θ n r ( i=1 yi )
i=1

Pn r
• Clearly the sufficient statistic is i=1 Yi
Pn
1 − θ1 yir
• L(θ) = C · θn e
i=1 where C has nothing to do with θ.
Pn
• ℓ(θ) = log(C) − n log(θ) − 1
θ
r
i=1 yi

84
Pn
• ℓ′ (θ) = − nθ + 1
θ2
r
i=1 yi .
Note that log(C) disappears.
Pn 1 Pn
• Set ℓ′ (θ) = 0 ⇒ nθ = θ12 r ∗
i=1 yi ⇒ θ = n
r
i=1 yi
P
• So, n1 ni=1 Yir is the MLE for θ.

• MVUE??? We know that the estimator is a sufficient statistic. Need


to check unbiasedness.
P P
• E( n1 ni=1 Yir ) = n1 ni=1 E(Yir )

• Note that
Z ∞
1 r−1 −yr /θ r
E(Yir ) = ry e · y dy
θ
Z0 ∞
e−y
r /θ
= · y r d(y r /θ)
0
(Let u = y r /θ)
Z ∞
=θ e−u · udu
0

• One can either calculate the integral explicitly or note that e−u is
the density of the exponential distribution with parameter 1 and the
R∞
integral 0 e−u · u du calculate its expectation, which, as we already
know, equals 1. Hence, we have

E(Yir ) = θ
Pn Pn
• Therefore, E( n1 r
i=1 Yi ) = 1
n i=1 E(Yi )
= n1 nθ = θ r

P
It follows that the maximum likelihood estimator n1 ni=1 E(Yir ) is the MVUE
for θ.
Another example:
Example 4.2.12. Consider the situation when we have two samples. One of
them is X1 , X2 , . . . , Xm from normal distribution N (µ1 , σ 2 ). The other is
Y1 , Y2 , . . . , Yn from normal distribution N (µ2 , σ 2 ). Here we assumed that
the variance in both distributions is the same. What is the ML estimators
for the parameters µ1 , µ2 and σ 2 ?

85
• Given the observations x1 , . . . , xm , y1 , . . . , yn , the likelihood for µ1 , µ2 , σ 2
is the product of all the densities (including the X’s and the Y ’s)
m 
Y 
1 (xi − µ1 )2
L(µ1 , µ2 , σ 2 ) = √ exp{− } ×
2πσ 2 2σ 2
i=1
Yn  
1 (yi − µ2 )2
√ exp{− }
2πσ 2 2σ 2
i=1
 Pm 
2 −m (xi − µ1 )2
= (2πσ ) 2 exp − i=1
×
2σ 2
 Pn 
2 −n (yi − µ2 )2
(2πσ ) 2 exp − i=1
2σ 2

• So the log-likelihood is
Pn
m m (xi − µ1 )2
ℓ(µ1 , µ2 , σ ) = − log(2π) − log(σ ) − i=1 2
2 2
2 2 Pn 2σ
n n (yi − µ2 )2
− log(2π) − log(σ 2 ) − i=1 2
2 2 2σ

Let us use the notation θ = σ 2 . The partial derivatives of the log-


likelihood function with respect to parameters are:
x − µ1
ℓ µ1 =
θ/n
y − µ2
ℓ µ2 =
θ/n
 Pm   Pn 
i=1 (xi − µ1 ) 1 i=1 (yi − µ2 ) 1
2 2
m1 n1
ℓθ = − + + − +
2 θ 2 θ2 2θ 2 θ2
• Set all three to zero and get the solution.

b1 = X, µ
µ b2 = Y ,
Pm Pn
i=1 (xi − X) + i=1 (yi − Y )
2 2
c 2
σ =
m+n

• Does b(σ 2 )M L look a bit familiar?

86
Exercise 4.2.13. Y1 , Y2 , . . . , Yn is a sample of observations from N (5, θ) where
the variance θ is unknown and is the parameter of interest:

1 (y − 5)2
f (y) = √ exp[− ].
2πθ 2θ

(a). Find the sufficient statistic for θ.

(b). Find the Method of Moment Estimator (MME) θbM M for θ.

(c). Find the Maximum Likelihood Estimator (MLE) θbM L for θ

(d). Show directly (without using the general theorem that MM estimators
are consistent) that θbM M is consistent for θ.

(e). Show directly that θbM L is consistent for θ.

(f). Prove that θbM L is the minimal variance unbiased estimator (MVUE)
for θ

Exercise 4.2.14. Let Y1 , Y2 , Y3 be three i.i.d. observations from the distribu-


tion with density:
ye−y/θ
fY (y) = 1y>0
θ2
In a data sample these random variables are observed to be 120, 130 and
128, respectively.

• Find the ML extimator of θ

• Is the ML estimator unbiased in this model? Explain.

• What is the ML estimator for the variance of Y1 ?

Exercise 4.2.15. Let Y1 , . . . , Yn be from the distribution with density

f (y) = (θ + 1)y θ , 0 < y < 1,

where θ > −1. Find the MLE.

87
4.3 Cramer-Rao Lower Bound and large sample
properties of MLE
In this section we learn about the Cramer-Rao lower bound on the variance
of any unbiased estimator of θ. It is not possible do get smaller variance
even if you use the MVUE. We also learn that in the limit, for n → ∞, the
maximum likelihood estimator achieves this bound. In this sense, the ML
estimator is an asymptotically Minimal Variance Unbiased Estimator.
The idea behind the Cramer-Rao bound is that if the likelihood function
is flat and does not depend on the parameter θ then it will be difficult
to estimate the parameter from the data. The measure of the likelihood
function flatness that the Cramer-Rao bound uses is the Fisher information.
Essentially, it is the average squared sensitivity of the log-likelihood function
to the parameter.
For the formal definition, let us define the score function s(x, θ) of a
random variable X as log f (x, θ) if X is continuous with probability density
f (x, θ) and as log p(x, θ) if it is discrete with probablity mass function p(x, θ).
We are talking here about a single variable X, and a log-likelihood func-
P
tion is the sum of the values of score function at xi : ln L(θ) = ni=1 s(xi , θ).

Definition 4.3.1. The Fisher information of a random variable X is defined


as
h d 2 i
IX (θ) = E s(X, θ)

Informally, the larger the Fisher information is, the more sensitive the
log-likelihood function is with respect to parameter θ.
Example 4.3.2. Let us calculate the Fisher information for exponential ran-
dom variable with mean θ. The density is f (x, θ) = 1θ e−x/θ , so the score
is

s(x, θ) = − log θ − x/θ.

88
By definition,
h d 2 i
I(θ) = E (− log θ − X/θ)

h 2 i
= E − 1/θ + X/θ2
h i
= E 1/θ2 − 2X/θ3 + X 2 /θ4

Recollect that the exponential distribution with parameter θ has mean θ and
variance θ2 . Hence EX = θ and EX 2 = θ2 + θ2 = 2θ2 . So after substitution
we get

I(θ) = 1/θ2 − 2θ/θ3 + 2θ2 /θ4 = 1/θ2 .

In many cases, it is easier to calculate the Fisher information by using a


different formula. Namely, under some regularity conditions, one has:
h d2 i
IX (θ) = −E s(X, θ) . (4.1)
dθ2
For example, the following result holds.

Lemma 4.3.3. Suppose that X is a continuous random variable and the


range of X (the set where the density is positive) does not depend on θ.
Suppose also that the density is continuously differentiable in θ on the range.
Then the equality (4.1) holds.

This lemma can be proved by rather challenging integral manipulations.


Identity 4.1 often gives a shorter path to calculate Fisher’s information.
In the previous example, this identity gives
h d2 i
I(θ) = −E 2
(− log θ − X/θ)
h dθ i
= −E 1/θ2 − 2X/θ3 ) = 1/θ2 .

We formulate the following result as a theorem although we do not spec-


ify the regularity conditions precisely. Check for them in graduate level
textbooks.

89
Theorem 4.3.4 (Cramer-Rao bound). Let X1 , . . . , Xn be a sample of in-
dependent identically distributed observations from the distribution that de-
pends on parameter θ. Under certain regularity conditions on the distribu-
b
tion, for every unbiased estimator θ,

b ≥ 1
Var(θ)
nIX (θ)

For example, the regularity conditions are satisfied if X1 , . . . , Xn are i.i.d.


continuous random variables with density f (x, θ) provided that the support
of the distribution (that is, the set of x where f (x, θ) is positive) does not
depend on θ and that the density is continuously differentiable in θ.

Note 1: If we are able to find an unbiased estimator such that its variance
equals the Cramer-Rao bound (and the regularity conditions hold), then this
estimator is MVUE (minimal variance unbiased estimator).

Note 2: Sometimes the Cramer - Rao bound is not sharp. That is,
sometimes the variance of the MVUE will be larger than the bound given
by the Cramer - Rao inequality.

Note 3: The regularity conditions are often violated if the parameter


involves the domain of the density. (Like estimation of the θ for the random
variable uniformly distributed on [0, θ].) In this case, the Cramer-Rao bound
is invalid: there can be an estimator with smaller variance than the bound
predicts.
Example 4.3.5 (Exponential). Suppose that X1 , . . . , Xn are i.i.d. observa-
tions from the exponential distribution with parameter θ. We have calcu-
lated that the Fisher Information of this distribution is IX (θ) = 1/θ2 . Hence,
the Cramer-Rao inequality says that every unbiased estimator θb has vari-
ance ≥ θ2 /n. On the other hand the estimator θb = X is unbiased and has
variance Var(Xi )/n = θ2 /n. This fact means that this unbiased estimator
attains the Cramer-Rao bound and so it is MVUE.
Example 4.3.6 (Uniform on (0, θ)). Now suppose that X1 , . . . , Xn are i.i.d.
observations from the uniform distribution on the interval [0, θ]. First, let

90
us calculate the Fisher Information for this distribution. The density is 1/θ
and the score function is − log(θ), so by definition:
h d 2 i h1i 1
IX (θ) = E s(X, θ) =E 2 = 2
dθ θ θ
(If we use identity (4.1), we get:
h d2 i h d2 i h1i 1
IX (θ) = −E 2
s(X, θ) = −E 2
s(X, θ) = E 2
= 2,
dθ dθ θ θ
which is the same. Note, however, that the conditions of Lemma 4.3.3 that
justified (4.1) are not satisfied in this example, so we are lucky that (4.1)
gives the correct result.) So the Cramer-Rao inequality predicts that every
unbiased estimator should have variance ≥ θ2 /n. We have calculated earlier
the expectation and variance of the ML estimator in this example which is
X(n) . This estimator is biased but we can correct its bias and consider the
unbiased estimator θb = n+1n X(n) . Its variance is

b = θ2 (n + 1)2 n 1
Var(θ) 2 2
= θ2
n (n + 2)(n + 1) n(n + 2)
So this estimator clearly violates the Cramer-Rao bound. The reason is that
the conditions of the Theorem 4.3.4 are not satisfied.
We will explain some ideas behind the proof of the Cramer-Rao bound
below. Now let us turn to the asymptotic optimality of MLE. The main
result here is that under some regularity conditions,

  1
nVar θbM L →
IX (θ)
as n → ∞.
The point is that the MLE in the limit attains the Cramer-Rao bound.
In this sense it is an asymptotically MVUE, or in other terminology it is
asymptotically efficient.
Ideas of the proof of the Cramer-Rao bound
We are going to prove the bound for the case when X is continuous and
its range does not depend on the parameter and when n = 1. The proof in
the general case is difficult and you can find it in graduate level textbooks.

91
Lemma 4.3.7. Assume the range of X does not depend on θ and the density
is positive and continuously differentiable in θ. Then,
hd i
E s(X, θ) = 0.

Proof. Note that by chain rule:
d
d f (x, θ)
s(X, θ) = dθ .
dθ f (x, θ)

hd i Z b d Z b
d
E s(X, θ) = s(x, θ)f (x, θ)dx = f (x, θ)dx
dθ a dθ a dθ
Z b
d d
= f (x, θ)dx = 1 = 0.
dθ a dθ


Corollary of Lemma 4.3.7: I(θ) = Var d
dθ s(X, θ) .

b
Proof of the Cramer-Rao bound for n = 1. Let θ(X) be an unbiased esti-
mator of θ based on just one datapoint X. Let us write s′ (θ) instead of
d
dθ s(X, θ). By using Lemma 4.3.7 and the Cauchy - Schwarz inequality for
covariance:
q
 
′ b ′ b b
|E s (θ)θ | = |Cov s (θ), θ | ≤ Var(s′ (θ))Var(θ).

Or,

|E s b|
′ (θ)θ
b ≥
Var(θ)
I(θ)

All this would hold if θb was biased. The next step is crucial.
Z Z
 d b
d b
E s (θ)θb =
′ b
f (x, θ)θ(x)dx = b
f (x, θ)θ(x)dx
a dθ dθ a
d b d
= Eθ = θ = 1.
dθ dθ

92
Example 4.3.8. • Yi ∼ Bernoulli(p) with PMF pyi (1 − p)1−yi
Pn
• We know that p̂ = i=1 Yi /n is the MLE for p

• Try to derive the Cramer-Rao Lower Bound for p̂


1 1
h i= h i
− ∂ log∂θf2(Y |θ)
2
− ∂ log[p ∂p(1−p)
2 Y 1−Y ]
nE nE 2

1 1
= h i= h i
∂ 2 [Y log(p)+(1−Y ) log(1−p)]
nE − ∂p2
nE Y
p2
+ 1−Y
(1−p)2
1 1 p(1 − p)
= h i= h i=
n p
+ 1−p
n 1
+ 1 n
p2 (1−p)2 p 1−p

p(1−p)
• We already know that Var(p̂) = n . So indeed, p̂ is the MVUE.

93
Chapter 5

Hypothesis testing

5.1 Basic definitions


• Chapter 8: make statistical inference about the population (parame-
ter) by

– point estimators (with unbiasedness & low variance), and


– confidence intervals.

• Chapter 9: different mathematical properties of point estimators, and


how to find good estimators.

• Chapter 10:

– Hypothesis tests: answer scientific questions using statistics


– Different philosophy and goal.
– Some connection with confidence interval.

Examples of “tests” in real life

• What people may want to know:

1. Does smoking cause lung cancer?


2. Is global warming real?

94
3. Are men more likely to run a stop sign than women?
4. Does chemotherapy really cure cancer?
5. Is a new medicine effective in increasing longevity?

• Beyond just scientific interest.

– Business decisions, military actions, political strategies.

The basic philosophy of statistical testing is “Proof by Contradiction”.

1. Identify the question of interest:


Does smoking lead to lung cancer?

2. Try to prove causality by assuming the opposite theory (”does not


lead”), which is called null hypothesis, and showing that it leads to a
contradiction.

3. Namely, if the data looks very improbable under the null hypothesis,
then you can conclude that the data contradicts the null hypothesis,
so it should be rejected and your theory should be accepted instead.

4. However, if the data does not looks very improbable under null hypoth-
esis, then you cannot reject it and so you don’t have enough evidence
in support of your, alternative, point of view.

Terminology

• Hypothesis

– A statement about a population, usually of the form that a pa-


rameter takes a particular numerical value (e.g. θ = 2) or falls in
a certain range of values (e.g. θ > 2).

• Null Hypothesis H0

– The statement of no effect.


– This is the statement that we will assume as true when we will
try to show that it leads to improbable conclusions.

95
– It is usually denoted by H0 and it is usually very specific. For
example it can state: “the treatment has no effect”.

• Alternative Hypothesis Ha

– The statement of some effect.


– The statement that we actually want to confirm by showing that
H0 should be rejected.
– Usually it is denoted by Ha ; it can be specific like: the percentage
of recoveries after a medicine was used increased by 10%. This
is a point hypothesis. Or it can be less specific: the percent of
recoveries after the treatment increased by at least 10%. This is
called the composite alternative hypothesis.

• The hypotheses must be stated before collecting, viewing or analyzing


the data.

Besides the null and alternative hypothesis, the statistical test is defined
by a test statistic and a rejection region.

• Recall: a statistic is a function of random observations Yi (the data)


A statistic cannot be defined in terms of the unknown θ.

• A test statistic (TS) is a statistic, that is, a function of the data. Its
intention is different from an estimator which is also a function a data.
The test statistic should help us to answer the question “how close is
the data sample to what we would expect if the null hypothesis H0 were
true?”

• If the null hypothesis is expressed in terms of a parameter of the model,


for example if it has the form like θ = θ0 , where θ0 is a specific value,
then often you can use a test statistic in the form

θb − θ0
TS = ,
σθb

where θb is an estimator of θ.

96
• In this case, if |T S| is large (so T S is far from 0) then it might indicate
that the data is not compatible with the null hypothesis.

Definition 5.1.1. The reject region RR is a set of possible values of a test


statistic T S(X1 , . . . , Xn ) so that if value of T S for the observed data is in
the rejection rejection RR, then we reject the null hypothesis H0 .

• Example: if we reject H0 when T S > 1, then RR = {x : x > 1}, that


is RR = (1, ∞).

• Often, it is more convenient to write the RR as some inequality, such


as RR : T S < a. This is equivalent to RR = {x : x < a} = (−∞, a).

• If the TS is not in the RR, then we fail to reject the null hypothesis
H0 .

• Other commonly used terms: do not reject, do not have enough evi-
dence to reject, etc.

• Next: How to find the reject region?

Example 5.1.2. Let X1 , . . . , Xn be distributed with Bernoulli distribution


with parameter p.

• Null hypothesis: H0 : p = 0.5

• Alternative hypothesis Ha : p > 0.5.

• Test the hypothesis H0 against alternative Ha .

• We can define our test statistic to be


p̂ − 0.5
TS = ,
SE(p̂)
p
where p̂ is the sample proportion and SE(p̂) = 0.5(1 − 0.5)/n

• A suitable reject region seems to be T S > t, since large T S indicates


that p estimated from data is much larger then 0.5.

• But which t should we choose?

97
Does not reject H0 Reject H0 in favor of Ha
H0 is true Correct decision Type I Error
Ha is true Type II Error Correct decision

Types of errors that a test can make

(a) Type I Error or False Positive or False Discovery

• occurs if we reject H0 and accept Ha when H0 is in fact true.


• Probability of making a Type I error, also called (significance)
LEVEL of the test:

Level of the test, α = P (Type I Error|t) = P (Reject H0 |H0 , t).

(b) Type II Error or False Negagive

• Occurs if we fail to reject H0 when H0 is false. We failed to make


a discovery
• Probability of making a Type II error. :

β = P (Type II Error|t) = P (Does not reject H0 |θ ∈ Ha , t).

• The quantity 1 − β is also called the POWER of the test

In the working example:

• As t > 0 increases, harder to reject H0 . Then α ↓, β ↑

• As t > 0 decreases, easier to reject H0 . Then α ↑, β ↓

• α and β are always inversely related;

– It is impossible to minimize both at the same time.

• In scientific practice and in drug development, researchers typically


consider a Type I error (“False Discovery”) more serious error than a
Type II error (“Failure to make a discovery”). In medical application,
such as testing for a decease, however, it is often more important to
minimize Type II error. Here we proceed with the assumption that
Type I error is more important.

98
– In this case, a value for α is chosen before initiating a hypothesis
test.
– Common values for α are 0.01, 0.05 and 0.10;
– Choose the rejection region so that

P(Type I Error) = P(t ∈ RR|H0 ) = α

• If α = 0.05, this choice of the rejection region means that in 5% of


the data samples from a population where H0 is actually true, the test
will reject the H0 .

Type I against Type II errors

• What is the consequence of a Type I error?

– Conclude that a drug is effective when in fact that it is not.


– Conclude that a foreign policy is working when in fact that it is
not.
– Ultimately: huge amount of money spent for nothing

• What is the consequence of a Type II error?

– Conclude that a drug is ineffective when in fact it is a good drug.


– Conclude that a potentially working foreign policy is not useful.
– Ultimately: Lost opportunity

Eventually, the choice of balance between type I and type II error depend
on a cost-benefit analysis, which is outside of the area of statistics.
Summary Design of the test:

• Set up H0

• Set up Ha

• Define a reasonable statistic T S.

• Figure out the distribution of T S under H0

99
• Choose a small significance level α (like 5%) and find a reject region
(RR) so that P(make a type I error) = P(T S ∈ RR|H0 is true) = α.

Application of the test: If the observed value of the T S is in the RR,


then reject H0 in favor of Ha ; otherwise, decide that the test fails to reject
H0 , and conclude that there is no sufficient evidence at the significance level
α that Ha is true.

5.2 Calculating the Level and Power of a Test


5.2.1 Basic examples
Given a test statistic and a reject region (RR) of a test, how do we find the
probabilities of errors α and β?

• Type I Error:

– Occurs if we reject H0 when H0 is true.


– Probability of making a Type I error:

α = P (Type I Error) = P (rejecting H0 | H0 is true)


= P (T S ∈ RR | H0 is true).

• Type II Error:

– Occurs if we fail to reject H0 when H0 is false.


– Probability of making a Type II error:

β = P (Type II Error) = P (fail to reject H0 | θ ∈ Ha )


= P (T S ̸∈ RR | θ ∈ Ha ).

Example 5.2.1. An experimenter has prepared a drug dosage level that she
claims will induce sleep for 80% of people suffering from insomnia. After
examining the dosage, we feel that her claims regarding the effectiveness of
the dosage are inflated. In an attempt to disprove her claim, we administer
her prescribed dosage to 20 insomniacs and we observe Y , the number
for whom the drug dose induces sleep. We wish to test the hypothesis

100
H0 : p = .8 versus the alternative, Ha : p < .8. Assume that the rejection
region {y ≤ 12} is used.

(a) What is the probability of type I error (level of the test) α?

(b) What is the probability of type II error β if Ha : p = 0.6?

(c) What is the probability of type II error β if Ha : p = 0.4?

(d) If we want the size of the test α ≈ 0.01 how should we choose the
threshold r in the rejection region RR = {y ≤ r}?

(a) In this example Y is our test statistic (the complete data consists of
observations for each insomniac). This statistic is distributed according
to the binomial distribution with parameters n = 20 and p. If we assume
H0 then p = 0.8. Then we need to calculate

α = P(Y ∈ RR|H0 : p = 0.8) = P(Y ≤ 12|H0 : p = 0.8).

We can do it using tables or issuing R command pbinom(12, 20, 0.8).


which gives us the result α = 0.03214266. This is the significance level
(or size) of this test.

(b) Under the alternative hypothesis Ha : p = 0.6, we have

β = P(Y ∈
/ RR|H0 : p = 0.8) = P(Y > 12|Ha : p = 0.6)
= 1 − P(Y ≤ 12|Ha : p = 0.6).

We can calculate it as β = 1 − pbinom(12, 20, 0.6) = 0.4158929. This is


a rather large probability of error. We can also say that the power of
the test 1 − β ≈ 58% is small under this alternative hypothesis.

(c) If Ha : p = 0.4, then β = 1 − pbinom(12, 20, 0.4) = 0.02102893, and the


power is 1 − β ≈ 98%. Here both the probabilities of type I and type II
errors are small because in this case it is easy to detect from the data
whether a null hypothesis or an alternative is true. The probability that
we encounter the data which would be likely under both alternatives is
small.

101
(d) Now if we want to make α = 0.01, then we need to find r such that
P(Y ≤ r|H0 : p = 0.8) = 0.01. Unfortunately, there is no r such
that this equality is satisfied exactly. However, we can solve it approx-
imately by using the R command qbinom. If we issue thhe command
qbinom(0.01, size = 20, p = 0.8), it gives us r = 12, since by definition
it produces the smallest r such that P(Y ≤ r) is ≥ 0.01. However, we
know that P(Y ≤ 12) = pbinom(12, 20, 0.8) = 0.03214266 and this is not
satisfactory if we want to ensure that the probability of type I error is
smaller than 0.01. For this reason we should choose r = 11. In fact, for
this choice we have α = P(Y ≤ 11) = pbinom(11, 20, 0.8) = 0.009981786
which is very close to 0.01.

In the previous example we were given the test statistic and the rejection
region. How can we choose them in a typical exam? Here is one method
which is useful if we are interested in testing a statement about a parameter
θ, and are given a desired probability of type I error (i.e., the significance
level of the test α).

• The null hypothesis H0 : θ = θ0

• The alternative hypothesis could be one of the following

– Ha : θ > θ0 (one-sided test)


– Ha : θ < θ0 (one-sided test)
– Ha : θ ̸= θ0 (two-sided test)

b
• Using the sample data to find an estimator of θ, denote it by θ;

• Define the test statistic


θb − θ0
TS = .
σθb

• For the reject region (RR):

– {T S > t} (one-sided test)


– {T S < t} (one-sided test)

102
– {|T S| > t} (two-sided test)

• Cutoff t is chosen such that

P (T S is in RR | H0 is true) = P (T S is in RR |θ = θ0 ) = α

Example 5.2.2. A machine in a factory must be repaired if it produces more


than 10% defectives among the large lot of items that it produces in a
day. A random sample of 100 items from the day’s production contains 15
defectives, and the supervisor says that the machine must be repaired. Does
the sample evidence support his decision? Use a test with level .01.

• The null hypothesis H0 : p = p0 (where p0 = 0.1 here.)

• The alternative hypothesis Ha : p > p0 ;

• An estimator of p is pb the sample proportion;

• Define the test statistic


pb − p0 pb − p0
TS = =p .
σpb p0 (1 − p0 )/n
Note that we use p0 = 0.1 to calculate σpb, not the estimator of pb = 0.15.
This is because we are aiming to use fully the assumptions of the null
hypothesis to calculate the test statistic.

• The reject region (RR): {T S > t}; with the cutoff t chosen such that
P (T S > t p = p0 ) = α.

Under the assumption H0 is true, T S is approximately N (0, 1), because


the sample size is large (n = 100). So the equation becomes a statement
about a normal random variable.

P (T S > t p = p0 ) = α ⇒ P (Z > t) = α,

where Z is a normal random variable. We know that it holds when t = zα .


So the reject region RR becomes

RR : {T S > zα };

103
The problem asks for level α = 0.01, so z0.01 = 2.33, and the reject region is

RR : {T S > 2.33};

By using the data provided in this problem, we calculate the observed value
of the test statistic T S as
pb − p0 0.15 − 0.10 5
ts = =p = = 1.667.
σpb 0.1(1 − 0.1)/100 3

Since ts is NOT in the reject region, we fail to reject H0 at the level α = 0.01.
and come to the conclusion that there is NO sufficient evidence to support
the statement that the machine must be repaired, at the significance level
α = 0.01.
Note that if we used a different α, say α = 0.05, then z0.05 = 1.645, and
the reject region would be

RR : {T S > 1.645};

Then, ts would be in the reject region, and we would reject H0 at the level
α = 0.05. The conclusion would become: at the significance level α = 0.05,
there is sufficient evidence to support the statement that the machine must
be repaired.
So, the decision of the hypothesis test depends on the value of α – the
level of tolerance for the type I error. The report about the decision should
always specify the value of α.
Not that the decision of a hypothesis test has a random nature! It
depends on the realized data. In particular, if H0 is true, we incorrectly
reject it with probability α.
Now let us look at the calculation of the probability of type II error in
the situation when we have a large sample and an estimator of the parame-
ter which is distributed approximately normally. Let us consider the same
example as before.
Example 5.2.3. Let X1 , . . . , Xn be distributed according to the Bernoulli
distribution with parameter p.

(a) Null hypothesis: H0 : p = 0.1

104
(b) Alternative hypothesis Ha : p > 0.1.

(c) Test the hypothesis H0 against alternative Ha at level α.

(d) What can be said about the probability of type II error and the power
of this test?

As in the previous example, we define the test statistic as


p̂ − 0.1
TS = ,
SE(p̂)
p
where p̂ is the sample proportion and SE(p̂) = 0.1(1 − 0.1)/n The rejec-
tion region is T S > zα , where α is the pre-specified level (=size) of the test,
which is the probability of type I error.
Now, what about the probability of type II error for this test? In order to
calculate this probability, we need to have a specific alternative hypothesis
Ha about the parameter p. Suppose, for example that Ha : p = 0.15. This
is a natural choice since we observed 15 defective machine out of 100. If we
assume that the alternative hypothesis is true, then we know that for large
n, the quantity
pb − 0.15
Z=p
0.15(1 − 0.15)/n
is distributed as a standard normal random variable. Therefore,
p 
pb = 0.15 + 0.15(1 − 0.15)/n Z,

and we re-write the test statistic as


p
pb − 0.1 0.15 − 0.1 + Z 0.15(1 − 0.15)/n
TS = p = p
0.1(1 − 0.1)/n 0.1(1 − 0.1)/n
So the probability of type II error for the test with level α is
 0.05 + Z p0.15(1 − 0.15)/n 
β = P(T S ≤ zα |Ha ) = P p ≤ zα
0.1(1 − 0.1)/n
 p p 
= P Z 0.15(1 − 0.15)/n ≤ −0.05 + zα 0.1(1 − 0.1)/n
 p
−0.05 + zα 0.1(1 − 0.1)/n 
=P Z≤ p
0.15(1 − 0.15)/n

105
and this quantity is easy to evaluate by using software or by referring to
tables. If we use α = 0.01 and n = 100, then zα = 2.33 we calculate that
p
−0.05 + zα 0.1(1 − 0.1)/n
p = 0.5573
0.15(1 − 0.15)/n

and

P(Z ≤ 0.5573) = 0.71.

So the probability that we make an error of the type II is rather large, 71%
Note that the power of the test is by definition 1 − β, so we have a
method to calculate both the probability of type II error and the power of
the test. In this example the power of the test is 1 − 0.71 = 29%.
It is important to note that β (and the power) depends on the value of
the parameter under the alternative hypothesis. For example, if we changed
our alternative hypothesis to Ha : p = 0.2, then the probability of type II
error would be equal to
 p
−0.1 + zα 0.1(1 − 0.1)/n 
β=P Z≤ p = P(Z ≤ −0.7525) = 22.6%
0.2(1 − 0.2)/n

This is certainly much better outcome. Under this alternative hypothesis


the test has much more statistical power. The power equals 1 − 0.226 =
77.4%.
Example 5.2.4 (Covid-19 in New York and California). The total number
of cases of Covid-19 in New York State is ≈ 203, 123 with number of total
deaths 10,834 as of April 14. The corresponding numbers for California are
25, 536 and 782. Test the hypothesis that the mortality rate in New York is
higher than the mortality rate in California.

• The null hypothesis H0 : θ = θ0 (where θ = p1 − p2 and θ0 = 0.)

• The alternative hypothesis Ha : θ > θ0 ;

• An estimator of θ = p1 − p2 is θb = pb1 − pb2 , the difference in sample


proportion;

106
• Find the test statistic
θb − θ0 pb1 − pb2 − 0
TS = =q ≈?.
σθb p1 (1−p1 )
+ p2 (1−p2 )
n1 n2

• Form the reject region (RR): T S > zα .

An additional difficulty here is that the null hypothesis does not specify
the exact values of p1 and p2 . It only says that p1 = p2 . For this reason, we
need to estimate p1 and p2 . We use the “pooled sample proportion”,
suggested by the fact that H0 claims that p1 = p2 .
Y1 + Y2
p̃ :=
n1 + n2
This is the best guess about p1 and p2 we can obtain when p1 = p2 (that is,
under the assumption that H0 is true.)
Using the data provided in this problem, we can calculate:
10834
pb1 = = 0.05333,
203123
782
pb2 = = 0.03062,
25536
pb1 − pb2 = 0.02271

10834 + 782
p̃ = = 0.05080,
203123 + 25536
r 1 1
σpb1 −bp2 = p̃(1 − p̃) + = 0.001457,
n1 n2
and the observed value of the test statistic T S is
θb − θ0 0.05080 − 0
ts = = = 15.5789
σθb 0.001457

So, in this case it is obvious that the null hypothesis can be rejected at
α = 0.01. The data give strong support to the hypothesis that the mortality
rate in New York is higher than that in California.
The previous two examples were about testing hypotheses about popu-
lation proportions.

107
Now let us look at the hypotheses about population means. We still
maintain the assumption that the sample size is large and therefore we can
rely on the normality of the parameter estimator distribution.
Example 5.2.5. A random sample of 37 second graders who participated in
sports had manual dexterity scores with mean 32.19 and standard deviation
4.34. An independent sample of 37 second graders who did not participate in
sports had manual dexterity scores with mean 31.68 and standard deviation
4.56.

a. Test to see whether sufficient evidence exists to indicate that second


graders who participate in sports have a higher mean dexterity score.
Use α = .05.

b. For the rejection region used in part (a), calculate β when µ1 −µ2 = 3.

The null hypothesis is H0 : µ1 = µ2 and the alternative is Ha : µ1 > µ2 .


A suitable estimator for θ = µ1 − µ2 is θb = X − Y , the difference of the
sample means. Its variance is
q
σθb = σ12 /n1 + σ22 /n2 .

Since we do not know the exact values of σ12 and σ22 , we will use estimates
for these variances. Then, the test statistic is

θb − θ0 X −Y −0
TS = =p 2
bθb
σ s1 /n1 + s22 /n2
32.19 − 31.68
=p = 0.4928
4.342 /37 + 4.562 /37

Since the samples are relatively large (n1 = n2 > 30), the test statistic
is distributed as a standard normal random variable. Since α = 0.05 and
T S ≤ z0.05 = 1.645, the test statistic is not in the rejection region and we
are not able to reject the null hypothesis. The data does not give enough
evidence to indicate that second graders who participate in sports have a
higher mean dexterity score.
Now let us consider the second question. What is β if µ1 − µ2 = 3?

108
In this case, we know that

X −Y −3
Z=p 2
s1 /n1 + s22 /n2
is approximately standard normal random variable. What we need to cal-
culate is the probability that we do not reject the null hypothesis, that is
P(T S ≤ zα ). So, we need to express TS in terms of Z:

X −Y X −Y −3+3 3
TS = p 2 = p = Z + p
s1 /n1 + s22 /n2 s21 /n1 + s22 /n2 s21 /n1 + s22 /n2
Then, the desired probability is
h 3 i
β = P(T S ≤ zα ) = P Z + p 2 ≤ z α
s1 /n1 + s22 /n2
h 3 i
= P Z ≤ zα − p 2
s1 /n1 + s22 /n2

After plugging in numbers, we get


h 3 i
β = P Z ≤ 1.645 − p
(4.342 + 4.562 )/37
h
= P Z ≤ −1.2538] = pnorm(−1.2538) = 0.105...

So, β = 10.5% and the power of this test is 1 − β = 89.5%.


More generally, if we use the test statistic

θb − θ0
TS = ,
bθb
σ

where the sample size is large, the estimator θb has an approximately normal
distribution, and the standard deviation of the estimator θb is estimated from
the data (not calculated on the basis of the hypothesis), then we can write
simple formulas for β.
If the alternative hypothesis is θa > θ0 , and we use the rejection region
T S > zα , then
h θa − θ0 i
β = P Z ≤ zα −
bθb
σ

109
If the alternative hypothesis is θa < θ0 and the rejection region is T S < −zα ,
then
h θa − θ0 i
β = P Z ≥ −zα − ,
bθb
σ

which can also be written as


h θ0 − θa i
β = P Z ≤ zα −
bθb
σ

by the symmetry of the distribution of the normal random variable.


Finally, if the alternative hypothesis is θa ̸= θ0 and we decided to use
the symmetric rejection region |T S| ≥ zα/2 , then
h θa − θ0 i h θa − θ0 i
β = P Z ≤ zα/2 − − P Z ≤ −zα/2 − ,
bθb
σ bθb
σ

What happens if θa = θ0 , of for example, if θa = θ0 + ε, where ε is


very small? This is the situation when the alternative hypothesis is barely
distinguishable from the null hypothesis. It is easy to see that in this case
β = 1 − α, and the power = α. This is the worst case scenario for the test
and we conclude that the power of a test cannot drop down below its size.
Now what happens if |θa − θ0 | becomes larger. In the case of one-sided
hypotheses, it is easy to see from formulas that β declines. In fact, it is
possible to check that this also holds for the two-sided hypothesis. The more
the alternative differs from the null hypothesis, the less is the probability of
type II error β (for a fixed level α).

5.2.2 Additional examples


Example 5.2.6 (Shear strength of soils). Shear strength of soils is a quantity im-
portant in civil engineering. Shear strength measurements derived from unconfined
compression tests for two types of soils gave the results shown in the following table
(measurements in tons per square foot). Do the soils appear to differ with respect
to average shear strength, at the 1% significance level?

110
The null hypothesis H0 : θ = θ0 (where θ = µ1 − µ2 and θ0 = 0.) This is simply
a different formulation of the hypothesis H0 : µ1 = µ2 .) The alternative hypothesis
Ha : θ ̸= θ0 .
An estimator of θ = µ1 − µ2 is θb = Y 1 − Y 2 , the difference in sample mean. So
we take the test statistic

θb − θ0 θb − θ0 θb − θ0
TS = =q 2 ≈q 2 .
σθb σ1 2
σ2 s1 s22
n1 + n2 n1 + n2

The rejection region (RR): {|T S| > t} and choose the cutoff t so that

P (|T S| > t θ = θ0 ) = α

When H0 is true, the T S is approximately N (0, 1), and we can take t = zα/2 .

RR : {|T S| > zα/2 };

For α = 0.01, z0.01/2 = 2.575, which gives RR : {|T S| > 2.575}.


By using the data,

θb − θ0 1.65 − 1.43 − 0
ts = = q = 3.65.
σθb 0.262
+ 0.22
2
30 35

Since ts in the reject region, our decision: reject H0 at the level α = 0.01. We
conclude that at the significance level α = 0.01, there is sufficient evidence that the
soils appear to differ with respect to average shear strength.
Example 5.2.7. A political researcher believes that the fraction p1 of Republicans
strongly in favor of the death penalty is greater than the fraction p2 of Democrats
strongly in favor of the death penalty. He acquired independent random samples of
200 Republicans and 200 Democrats and found 46 Republicans and 34 Democrats
strongly favoring the death penalty. Does this evidence provide statistical support
for the researcher’s belief? Use α = .05.

111
Using the data provided in this problem, we can calculate:
Y1 + Y2 46 + 34
p̃ = = = 0.2,
n1 + n2 400
s r
p̃(1 − p̃) p̃(1 − p̃) 2 × 0.2 × 0.8
σpb1 −bp2 = + = = 0.04,
n1 n2 200

and the observed value of the test statistic T S is

θb − θ0 (46/200 − 34/200) − 0
ts = = = 1.5
σθb 0.04

What should be the conclusion?

5.3 Determining the sample size


Now, consider Example 5.2.5 again. Suppose that we are not satisfied that
the probability of type II error β is around 10%. What should the sample
size that would give β = 5%?
Recall that the formula for β is
h 3 i
β = P Z ≤ zα − p 2 .
s1 /n1 + s22 /n2

We can use the symmetry of the normal distribution and write it as


h 3 i
β = P Z > −zα + p 2 ,
s1 /n1 + s22 /n2

which can be re-written as


3
zβ = −zα + p 2
s1 /n1 + s22 /n2

If we assume that n1 = n2 , then this is the same as



3 n
zβ + zα = p ,
s21 + s22

and the formula for the appropriate sample size is


h z + z i2
β α
n = (s21 + s22 ) ,
3

112
Note that 3 here represents θa − θ0 , so the general formula is
h z + z i2
β α
n = (s21 + s22 ) ,
θa − θ0
This formula works for both right-tailed and left-tailed one-sided tests. How-
ever, there is no simple formula for two-sided hypotheses.
By plugging in the numbers from the example, we get
h 1.645 + 1.645 i2
n = (4.342 + 4.562 ) ≈ 47.66
3
Since we cannot use fraction as a sample size, we conclude that sample size
n = 48 would be sufficient.

5.4 Relation with confidence intervals


Suppose we are still working with large size samples and we know that the
estimator θb for a parameter θ has a normal distribution. In fact let us assume
that we calculated the confidence interval, with the confidence level α. For
concreteness, let us focus on one-sided lower bound confidence interval

CI = (θb − zα σ
bθb, ∞).

Does it help us with hypothesis testing? Well, the confidence interval says
that the true value of the parameter is likely to be larger than θb − zα σ
bθb. So
if we test the null hypothesis H0 : θ = θ0 , and it happens that θ0 is outside
of the confidence interval, that is, if θ0 < θb − zα σ
bθb, then we should reject
the null hypothesis.
The only provision here is that the CI should be in agreement with the
alternative hypothesis, that is, the alternative hypothesis should be Ha :
θ > θ0 .
If our alternative hypothesis is that θ < θ0 , the it is more appropriate to
consider the upper bound confidence interval:

CI = (−∞, θb + zα σ
bθb).

113
This confidence interval tells us that the true value of the parameter is likely
to be large than θb + zα σ bθb, so if θ0 is greater then this quantity we should
reject the null hypothesis.
Similarly, if we use a two-sided alternative hypothesis Ha : θ ̸= θ0 , then
it is appropriate to use the two-sided confidence interval

CI = (θb − zα/2 σ
bθb, θb + zα/2 σ
bθb),

and reject the null hypothesis H0 : θ = θ0 if θ0 is outside of the confidence


interval.
On the formal level, consider, say, the first case, when the alternative is
Ha : θ > θ 0 .
Then the rejection region is

θb − θ0
TS = > zα .
bθb
σ

But this condition can be re-written as

θ0 < θb − zα σ
bθb,

and this is exactly the condition that θ0 is outside of the low-bound con-
fidence interval, as we claimed above. The other two cases can be done
similarly.

5.5 p-values
The p-value of a test is useful if one wants to report how strongly the evidence
in the data speaks against the null hypothesis.
Recall that we saw several times the situation when for α = 0.01 we
could not reject the null hypothesis, the evidence was not strong enough,
but for α = 0.05, we could reject the null. (This is because when α = 0.05
we could allow to make type I error more frequently.)
For any data sample if we consider very large α then the test statistic
is likely to land in the rejection region, which is very wide in this case and
we are likely to reject the test. However, as we gradually decrease α, we

114
become more conservative, the rejection region shrinks, and at some point
we switch from rejecting H0 for this data sample to saying that there is not
enough evidence in the data to support the rejection. This point is called
the p-value of the test.
Note especially that unlike the level and the power of the test, the p-
value depends both on the test (that is, on the way to calculate the test
statistic and the rejection region) and on the data. If the data sample looks
more unlikely for the null hypothesis than another sample, that is, if it has a
larger test statistic, then the switch from rejection to non-rejection happens
later, for smaller α, and p - value for such data sample is smaller!

Definition 5.5.1. The p-value is the smallest significance level α at which


the observed data indicates that H0 should be rejected.

While definition above is easy to use, it is a bit difficult to grasp or to


explain to a client who does not know what is the significance level of a test.
In this case, the following equivalent definition might be useful.

Definition 5.5.2. The p-value is the probability, – calculated assuming that


the null hypothesis is true, – of obtaining a value of the test statistic, which
is at least as contradictory to H0 as the value calculated from the available
sample.

It is very easy to calculate the p-value. We just set the threshold in the
rejection region equal to the observed value of the test statistic and calculate
the probability of this rejection region under the null hypothesis.
Say, let the test have the rejection region RR : {T S > t} and let ts be
the observed value of the test statistic. Then the p-value is Pr{T S > ts|H0 }.
In practice, for large sample tests it often boils down to calculating the
cumulative function of the standard normal distribution at the test statistic
value ts.
Benefits of the p-value:

• It is a universal measure of the strength of the evidence.

• It describes how extreme the data would be if the H0 were true.

115
• It answers the question: “Assuming that the null is true, what is the
chance of observing a sample like this, or even worse?”

Example 5.5.3. Urban storm water can be contaminated by many sources,


including discarded batteries. When ruptured, these batteries release met-
als of environmental significance. The paper “Urban Battery Litter” (J.
Environ. Engr., 2009: 46–57) presented summary data for characteristics of
a variety of batteries found in urban areas around Cleveland. A sample of
51 Panasonic AAA batteries gave a sample mean zinc mass of 2.06 g. and
a sample standard deviation of .141 g. Does this data provide compelling
evidence for concluding that the population mean zinc mass exceeds 2.0 g.?
With m denoting the true average zinc mass for such batteries, the rel-
evant hypotheses are H0 : m = 2.0 versus Ha : m > 2.0. The sample size is
large enough so that a z-test can be used without making any specific as-
sumption about the shape of the population distribution. The test statistic
value is
x − 2.0 2.06 − 2.0
z= √ = √ = 3.04
s/ n .141/ 51

So, we calculate the p-value:

p − value = P(Z > 3.04) = 1 − pnorm(3.04) = 0.118%

This means that the null hypothesis would be rejected by tests with α = 5%,
α = 1%, and even with α = 0.2%, although it could not be rejected at the
level of α = 0.1%. We would conclude that the sample appears to highly
contradictory to the null hypothesis, and so there is a compelling evidence
that the the population mean zinc mass exceeds 2.0 g.
p-values for large sample tests (aka z-tests)

1. The parameter θ is one of the following: µ, p, µ1 − µ2 and p1 − p2 ;

2. The sample size n is large enough.

3. The null hypothesis is H0 : θ = θ0

116
4. The test statistic is

θb − θ0
TS = ∼ N (0, 1) under H0
σθb

and the observed test statistic using the given data is ts;

If the alternative hypothesis is

• Ha : θ > θ0 then p-value = P (T S > ts|θ = θ0 ) = 1 − Φ(ts)

• Ha : θ < θ0 then p-value = P (T S < ts|θ = θ0 ) = Φ(ts)

• Ha : θ ̸= θ0 then p-value = P (|T S| > |ts||θ = θ0 )


= P (T S > |ts||θ = θ0 ) + P (T S < −|ts||θ = θ0 ) = 2(1 − Φ(|ts|))

117
5.6 Small-sample hypothesis tests for population
means
If the sample size is small (n < 30) then we cannot hope that the Central
Limit Theorem will ensure that the test statistic

θb − θ0
TS =
bθb
σ

has the standard normal distribution. In this case the only way out is
to make sure that the data is at least approximately normal, perhaps by
applying an appropriate transformation to the data.
From now on, in this section we will assume that the data is normal. Even
in this case, the distribution of the test statistic differs significantly from the
normal distribution. This means that when we calculate the probabilities of
type I and II errors, or when we calculate the p-values, we cannot calculate
probabilities like

P(T S > x)

as if the TS were a standard normal random variable. This would result in


wrong probabilities.
Luckily, the distribution for this test statistics is still known and can be
calculated by a computer algorithm. It can also can be found in tables.

θb − θ0
TS = ∼ t-distribution if H0 is true (θ = θ0 ),
σθb
The degrees of freedom for t-distributions depends on whether θ = µ or
µ1 − µ2 .
If we are interested in testing of H0 : µ = µ0 , then we use the test
statistic
Y − µ0
TS = √ ∼ tn−1 if H0 is true (µ = µ0 )
S/ n
If we have two samples, X1 , . . . , Xn1 and Y1 , . . . , Yn2 ,with population
means µ1 and µ2 , respectively, then we are often interested in testing H0 :
µ 1 − µ2 = θ0 .

118
Figure 5.1: Degrees of freedom for the test statistic when the variances are not
the same

Here, two different situations are possible. A bit simple situation is when
we can assume that the variances in two samples are the same σ12 = σ22 = σ 2 .
(We could check this assumption by an appropriate test!) Then we can use
the test statistic
X − Y − θ0
TS = q ∼ tn1 +n2 −2 if H0 is true (θ = θ0 );
Sp n11 + n12

where we use the pooled-sample standard deviation as an estimator for σ 2 .


s
q (n1 − 1)S12 + (n2 − 1)S22
Sp = Sp2 = = ...
n1 + n2 − 2

In this case the degrees of freedom of the t-distribution are df = n1 + n2 − 2.


A more difficult situation arises when we cannot assume that the vari-
ances σ12 and σ22 are equal. Then we have to use this test statistic:

X − Y − θ0
TS = q 2 ,
S1 S22
n1 + n2

where S12 and S22 are sample variances in the two samples. It turns out that
the distribution of this TS is approximately a t-distribution, but the formula
for the degrees of freedom is quite complicated. See Figure 5.1.

119
Note, however, that some researchers suggested that this procedure should
be used if there are doubts about whether the variances are same.
After the distribution of the test-statistic is determined the rest is simple,
one would only need to replace zα with the tα that is calculated for the t-
distribution with correct number of degrees of freedom.

• Ha : θ > θ0 ⇔ RR : {T S > tα }

• Ha : θ < θ0 ⇔ RR : {T S < −tα }

• Ha : θ ̸= θ0 ⇔ RR : {|T S| > tα/2 }

The quantities tα can be found from the tables or by using the R com-
mand qt. In particular tα for ν degrees of freedom can be calculated as
qt(1 − α, ν).
The calculation of the probability of type II error β and the power 1−β is
in fact very similar to the calculations in the case of the normal distribution.
Again, one only needs to use the t-distribution with the correct number of
degrees of freedom instead of the standard normal distribution.
This is also true for p-values.

• Ha : θ > θ0 ⇔ p-value = P (T S > ts|θ = θ0 )

• Ha : θ < θ0 ⇔ p-value = P (T S < ts|θ = θ0 )

• Ha : θ ̸= θ0 ⇔ p-value = P (|T S| > |ts||θ = θ0 ) = 2P (T S > |ts|)

where the test statistic has the t-distribution with an appropriate number
of degrees of freedom. The tables only give a range for p-value. For precise
probability, one must use R command pt(a,df) whose output is P(T < a)
where T ∼ t(df ).
Example 5.6.1. An Article in American Demographics investigated con-
sumer habits at the mall. We tend to spend the most money when shop-
ping on weekends, particularly on Sundays between 4:00 and 6:00PM, while
Wednesday-morning shoppers spend the least.
Independent random samples of weekend and weekday shoppers were
selected and the amount spent per trip to the mall was recorded as shown
in the following table:

120
• Is there sufficient evidence to claim that there is a difference in the
average amount spent per trip on weekends and weekdays? Use α =
0.05.

• What is the attained significance level (p-value)?

• What if n1 = 10 and n2 = 10?

Let us do this example for n1 = n2 = 20 and assume that σ12 = σ22 .


Our null hypothesis is that µ1 = µ2 , so we want to calculate

X −Y
TS = q
Sp n11 + 1
n2

where
s r
q (n1 − 1)s21 + (n2 − 1)s22 19 × 222 + 19 × 202
Sp = Sp2 = =
n1 + n2 − 2 20 + 20 − 2
= 21.0238

So,
78 − 67
TS = q = 1.654556
1 1
21.0238 20 + 20

For the normal distribution z0.05 = 1.645 so the test would reject the null
hypothesis if the sample were large.
If we want to test H0 at the level α = 0.05 and use the t-distribution we
want tα for ν = 20 + 20 − 2 = 38.
We calculate it as qt(.95, 38) = 1.685954, so we conclude that the evi-
dence is not sufficient to reject H0 at the level α = 0.05.
The p-value can be calculated as 1 − pt(1.654556, 38) = 5.31%.

121
5.7 Hypothesis testing for population variances
Occasionally, we are interested in testing variances. The most frequent ex-
ample is when we test equality of variance in two samples, in order to see
if the corresponding populations are really different in a certain aspect.
Sometimes we might me interested to see that the variance does not exceed
a certain threshold. This problem arises in quality control.
Let us consider first testing the hypothesis H 0 : σ 2 = σ02 If we sample is
large, then we can use
S 2 − σ02
TS = ,
σS 2
with a suitable estimator for σS 2 .
This approach does not generalizes easily to small samples since it is
difficult to calculate the exact distribution of the ration. So we look at an
alternative method, which works both for large and small samples.
So let us assume that X1 , . . . , Xn are from a Normal distribution N (µ, σ 2 )
with unknown mean µ and unknown variance σ 2 .
We use the result that we know from the section about variance estima-
tion.
(n − 1)S 2
TS = ∼ χ2 (n − 1) when H0 is true
σ02
Fore the case when the alternative hypothesis is Ha : σ 2 > σ02 , the
rejection region is similar to the RR in z- and t-tests:

RR : T S > χ2α (n − 1),

where χ2α (n − 1) solves the equation P(T S > x) = α if we know that TS is


distributed as a χ2 random variable with n − 1 degrees of freedom.
The R command for calculating this quantity is qchisq(1 − α, n − 1).

122
For the alternative hypothesis Ha : σ 2 < σ02 , there is some difference from
the case of z- or t-tests because the χ2 distribution is not symmetric relative
to zero. Instead of using −χ2α (n − 1) as a threshold, we use χ21−α (n − 1). So
the rejection region in this case is

RR : T S < χ21−α (n − 1),

Finally, if the alternative hypothesis is Ha : σ 2 ̸= σ02 , then it is conventional


to use the following rejection region:

RR : T Sχ21−α/2 (n − 1) or T S > χ2α/2 (n − 1)

Correspondingly, the p-values for these alternative hypotheses are as


follows. If Ha : σ 2 > σ02 , then

p-value = P (T S > ts) = 1 − pchisq(ts, n − 1).

If Ha : σ 2 <σ02 , then

p-value = P (T S < ts) = pchisq(ts, n − 1)

If Ha : σ 2 ̸= σ02 , then we need to think a bit harder. When we decrease α


then at some point either χ2α/2 or χ21−α/2 hits the value of the tests statistic

123
ts that was realized in the sample. At this moment the test stops rejecting
the null hypothesis. So if χ2α/2 hits ts first, then we conclude that this is the
critical α∗ which is equal the p-value. Hence in this case the p-value equals

α∗ = 2P(T S > χ2α∗ /2 ) = 2P(T S > ts)

Note that at that moment ts > 1 − α∗ /2 so P(T S < ts) > 1 − α∗ /2 and
2P (T S < ts) > 2 − α∗ > 1.
If χ21−α/2 hits ts first, then by a similar argument we find that p-value
equals

α∗ = 2P(T S < χ21−α/2 ) = 2P(T S < ts)

and P (T S > ts) > 1.


So we have two candidates for the p-value: 2P (T S < ts) and 2P (T S >
ts), and we know that if one of them is indeed the p-value (and so less than 1
as a probability), then the other is greater than 1. So we can simply choose
the minimal of these two numbers. In summary,

p-value = 2 × min[P (T S > ts), P (T S < ts)]

In case you pick up the wrong one between P (T S > ts) or P (T S < ts),
your answer will exceed 1, which is an immediate warning sign because
probability cannot be greater than 1.
Example 5.7.1. An experimenter was convinced that the variability in his
measuring equipment results in a standard deviation of 2. Sixteen measure-
ments yielded s2 = 6.1. Do the data disagree with his claim? Determine
the p-value for the test. What would you conclude if you chose α = 0.05?

• H0 : σ 2 = 4 and Ha : σ 2 ̸= 4
(n−1)S 2
• Test statistic: 4

(16−1)6.1
• Observed value for test statistic is ts = 4 = 22.875

• R gives pchisq(22.875, 15) = 0.9132 and 1 − pchisq(22.875, 15) =


0.0868.

124
• Hence p-value = 2 × 0.0868 = 17.36% > 5%.

We conclude that the data do not give enough evidence to disagree with his
claim.

Now consider the test for equality of variances in two population. In some
situation, a researcher is interested to know whether the data variation in
two samples indicated the different variances in corresponding populations.
For example:

• comparing precision of two measuring instruments;

• the variation in quality of a product at two locations or at two different


time periods;

• variation is scores for two test procedures.

• variation in outcomes for two medical procedures.

• variation in market returns for two time periods or in two different


countries.

Suppose that we have two samples with n1 and n2 observations, re-


spectively, from the normal distributions with variances σ12 and σ22 respec-
tively. We want to test the hypothesis H0 : σ12 = σ22 against the alternative
Ha : σ12 > σ22 .
If S12 and S22 are sample variances, then define the test statistic

S12
TS = .
S22

125
Under the null hypothesis this ratio is distributed as so-called F -distribution
with n1 − 1 and n2 − 1 degrees of freedom. (F is for Fisher, who designed
the test.)
So we can use the Rejection Region
RR = {T S > Fα }

If the hypothesis H0 : σ12 = σ22 but we test against the alternative Ha :


σ12 <σ22 . (instead of Ha : σ12 > σ22 ) then we can simply use
S22
TS = .
S12
instead of
S12
TS = .
S22
The new tests statistics is distributed as F -random variable with degrees of
freedom n2 − 1 and n1 − 1.
What if we want to test H0 against the alternative hypothesis Ha :
σ12 ̸=σ22 ?
It turns out that in this case we can use the test statistics
S2
T S = 12 ,
S2
which is distributed as the F -random variable with n1 −1 and n2 −1 degrees
of freedom and use the rejection region
n 1 o
−1 −1
RR = > Fnn12−1;α/2 or T S > Fnn21−1;α/2
TS

126
Notice also the degrees of freedom in the numerator and denominator of the
thresholds!

It is worthwhile to repeat: the test is very sensitive to the assumption


that the data are normally distributed.
Example 5.7.2. A study was conducted by the Florida Game and Fish Com-
mission to assess the amounts of chemical residues found in the brain tissue
of brown pelicans. In a test for DDT, random samples of n1 = 10 juveniles
and n2 = 13 nestlings produced the results shown in the accompanying table
(measurements in parts per million, ppm).

Are you willing to assume that the underlying population variances are
equal? Test σ12 = σ22 against σ12 > σ22 at α = 0.01. What is the p-value?
The test statistic is
0.0172
TS = = 8.027778.
0.0062
p-value = 1 − pf(8.027778, 9, 12) = 0.07%.
So we would reject the null at 1% level.

5.8 Neyman - Pearson Lemma and Uniformly Most


Powerful Tests
So far we talked about specific tests and have not be concerned with evalu-
ating and comparing tests.
Recall that for the estimation problem we had the concept of the Minimal
Variance Unbiased Estimator. If we can find such an estimator, then we
could agree, that this is the best estimator among available.

127
For a test the natural measure of its goodness is its power. If we can
have two tests with the same α, we will prefer the test that has the large
power.
Note however that the power of the test is a function of the parameter
under the alternative hypothesis. So if one test has larger power than an-
other under one value of the parameter θa , it can actually has smaller power
under another value of θa .
Recall that the power of the test = 1 − β = Pr( reject H0 | if Ha is true).
It can be calculated only if Ha is a simple hypothesis.

Definition 5.8.1. A hypothesis is said to be simple if this hypothesis


uniquely specifies the distribution of the population from which the sam-
ple is taken. Any hypothesis which is not simple is called a composite
hypothesis.

For example, the hypothesis θ = 2 is simple, the hypothesis θ > 2 is


composite.
Power is a curve, it depends on the value of the parameter in the alter-
native hypothesis. Can we build a test with the “best” power curve?

First, let us not to be too ambitious and try to find the test with the
maximum power when the significance level is α and when the alternative
hypothesis is simple, Ha : θ = θa .

Theorem 5.8.2 (The Neyman-Pearson Lemma). For testing between H0 :


θ = θ0 vs Ha : θ = θa , the test with the reject region
n L(θ |X) o
0
RR = <t
L(θa |X)
L(θ0 )
where t is chosen so that P( L(θ a)
< t|θ = θ0 ) = α, is the most powerful
α-level test for H0 versus Ha .

128
In other words, if the alternative hypothesis is simple: θ = θα , the best
test statistic is
L(θ0 |X)
TS =
L(θa |X)

This T S measures how likely is the data under the null hypothesis compared
with its likelihood under the alternative hypothesis. You reject H0 if the
ratio is too small, with the threshold chose so that the level of this test is α.
The theorem says that this test has the largest power to reject H0 (among
the tests with the same α) provided the alternative is fixed at θa .
In order to construct the Neyman-Pearson test, we need to know the
distribution of the test statistics, which is not always easy. Here is an
L(θ0 )
example where the distribution of L(θ a)
is not too hard to find.
Example 5.8.3. Suppose we have just one observation in the sample, Y ∼
f (y|θ) = θy θ−1 1{0<y<1} . Find the most powerful test for H0 : θ = 2 against
Ha : θ = 1 at significance level α = 0.05.
The likelihood here is simply L(θ|y) = θy θ−1 . (Only one observation -
so no products.)
The ratio is
L(θ0 |y) θ0
= y θ0 −θa
L(θa |y) θa

If we want the size of the test be α, we should have


h L(θ |Y ) i hθ i
0 0 θ0 −θa
Pr < t H0 = Pr Y < t H0 =
L(θa |Y ) θa
h  θ 1/(θ0 −θa ) i
a
= Pr Y < t H0 = α.
θ0

Under the null hypothesis, Y has density θ0 y θ0 −1 , so we can calculate


the cumulative distribution function as FY (y) = y θ0 , and
h L(θ |Y ) i h  θ 1/(θ0 −θa ) i
0 a
Pr < t H0 = Pr Y < t H0
L(θa |Y ) θ0
 θ θ0 /(θ0 −θa )
a
= t = α.
θ0

129
Hence, the threshold in the test should be set to
θ0 (θ0 −θa )/θ0
t= α .
θa
θ0 θ0 −θa
In our example, the test statistic is θa Y = 2Y , and

2
t = 0.05(2−1)/2 = 2 × 0.2236.
1
Equivalently, the most powerful test with α = 0.05 in this case has RR =
{Y < 0.2236}
In N-P lemma, the test is guaranteed to be most powerful level-α test
against a specific alternative hypothesis. What if we try a different alterna-
tive hypothesis?

Consider the previous example: We found that the most powerful test
has the rejection region:
nθ θ0 o n o
0 θ0 −θa
Y < α(θ0 −θa )/θ0 = Y θ0 −θa < α(θ0 −θa )/θ0
θa θa
If θa < θ0 , then we can take the power 1/(θ0 − θa ) on both sides and get
that the rejection region is
n o
Y < α1/θ0 .

It is the same for all θa < θ0 . But if θa > θ0 then we would get a completely
different test:
n o
Y > (1 − α)1/θ0

Say if in the previous example we choose Ha : θ = 4 then we would get


test RR = {Y > . . .}.
When a test (that is a test statistic TS and a rejection region RR)
maximizes the power for every value of θ ∈ Ωa , it is said to be a uniformly
most powerful (or UMP) test for H0 : θ = θ0 versus composite hypothesis
Ha : θ ∈ Ωa .
For example, in our previous example, the Neyman-Pearson test is the
UMP for Ha : θ > θ0 . It is also UMP for Ha : θ < θ0 . (Note that in this case

130
it is actually a different test. We call it the Neyman-Pearson test because
it was obtained by applying the Neyman - Pearson lemma to a specific
θa < θ. It is just happened in this example that all these test coincide
for all θa < θ.) However, in case of the two-sided hypothesis, Ha : θa ̸= θ0 ,
Neyman-Pearson is not helpful because it gives two different tests depending
on whether θa < θ0 or θa > θ0 . In fact, in this case there is no uniformly
most powerful test.
In many cases, even for one sided hypothesis, UMP tests do not exists.
However, they are especially rare if the alternative is two sided Ha : θ ̸= θ0 ,
or if we test a vector parameter and the alternative hypothesis is not simple.
(That is, the alternative hypothesis is not just a specific value θ⃗a , for which
the Neyman-Pearson lemma would give us a most powerful test.)
Example 5.8.4 (Neyman-Pearson lemma applied to normal data). Y1 , . . . , Yn ∼
N (µ, σ). Consider H0 : µ = µ0 versus Ha : µ = µ1 . We assume that σ 2 is
KNOWN (fixed). Otherwise the N-P lemma is not applicable: the hypoth-
esis is not simple. What is the most powerful test with level α?
The likelihood is
( )
1 X
n
2 −n/2
L(µ) = (2πσ ) exp − 2 (yi − µ)2

i=1

The ratio of the likelihoods defined in the Neyman Pearson lemma is:
 P
L(µ0 ) (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (yi − µ0 )2
=  P
L(µ1 ) (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (yi − µ1 )2
 i
1 hX X
n n
= exp − 2 (yi − µ0 )2 − (yi − µ1 )2

i=1 i=1
 h X n i
1
= exp − 2 2yi (µ1 − µ0 ) + µ0 − µ1
2 2

i=1

The Neyman Pearson Lemma says that the most powerful test is the one

131
with some appropriate threshold t which rejects H0 when
( )
1 hX i
n
exp − 2 2yi (µ1 − µ0 ) + µ0 − µ1
2 2
<t

i=1
1 hX i
n
⇐⇒ − 2 2yi (µ1 − µ0 ) + µ20 − µ21 < t′

i=1
X
n
⇐⇒ 2yi (µ1 − µ0 ) + µ20 − µ21 > t′′
i=1

⇐⇒ 2ny(µ1 − µ0 ) > t′′′

That is, the test tells us to reject H0 when

y > t′′′ /(2n(µ1 − µ0 )), if µ1 − µ0 > 0, or,


y < t′′′ /(2n(µ1 − µ0 )), if µ1 − µ0 < 0.

y > A, if µ1 − µ0 > 0, or,


y < B, if µ1 − µ0 < 0,

where the thresholds A and B are chosen so that the level of the test is α.
Indeed, such a test would be what our intuition would have driven us.
The NP lemma provides theoretical justification for this test. Is this
test uniformly most powerful among all the tests against the composite
alternative hypothesis Ha : µ > µ0 ? the composite Ha : µ < µ0 ? Can we
construct uniformly most powerful tests for Ha : µ ̸= µ0 ?

5.9 Likelihood ratio test


Theoretical Question #2 How do we design a test?
Suppose we have a model with many parameters θ⃗ = (θ1 , . . . , θk ), and
want to test that one or more of the parameters is 0. Or may be we want
to test some relationship between parameters, such as θ1 = θ2 , or more
generally that c1 θ1 + c2 θ2 + . . . + ck θk = 0.
How do we test such a hypothesis?

132
One approach for the design of a statistical test is to find an estimator
for a parameter that encapsulate our hypothesis, and then use our knowl-
edge about the distribution of this estimator. For example, if we test the
hypothesis θ1 = θ2 , then we can find the ML estimator for the difference
θ1 − θ2 and use the fact that in large samples this estimator is normal and
that it is possible to calculate its variance. This is a very useful approach.
Its deficiency is that we need to obtain the estimator and its variance before
we are able to construct the test. In addition, the variance often depends
on the true value of parameter θ so if our null hypothesis is not simple but
has some nuisance parameters, then we are in trouble.
The second approach is based on the Neyman-Pearson lemma and uses
⃗ ⃗
the ratio L(⃗θ0 |Y⃗ ) as the test statistic. This will give the most powerful test.
L(θa |Y )
The deficiencies is that we need to find the distribution of this test statistic.
Additionally, this approach is quite restrictive. It requires both the null and
alternative hypotheses to be simple and not composite.
In this section we consider the third alternative, which is often very
convenient since it works for composite hypotheses, and in case of large
samples requires essentially no calculation except the maximization of some
likelihood functions.
Let Ω0 be the set of parameters that satisfy our null hypothesis. For
example, it can be that Ω0 = the set of all parameters θ⃗ such that θ1 = θ2 ,
and all other parameters can be arbitrary. Then, let Ωa be the set of all
possible alternative values for parameters. For example, our alternative can
be that θ1 > θ2 and all other θi are arbitrary. The alternative hypothesis is
typically a composite hypothesis in practical applications.
Define the total feasible parameter set Ω = Ω0 ∪Ωa . Define the likelihood
ratio statistic by

maxθ∈Ω ⃗⃗
⃗ 0 L(θ|Y )
λ=
max⃗ L(θ|⃗Y⃗)
θ∈Ω

⃗Y
where L(θ| ⃗ ) is the likelihood of the vector parameter θ⃗ given that the
observed data is Y⃗ = (Y1 , . . . , Yn ).
Use the rejection region RR = {λ < k}, where the threshold k is deter-

133
mined by the requirement that the level of the test is α.
This appears to be not an especially useful since we need to do two con-
strained maximizations and we do not know the distribution of λ, so we
cannot calculate the threshold k. In fact, it appears that λ is a quite com-
plicated function of the data: it is the ratio of two constrained maximums
of the likelihoods, which are itself complicated functions!
The power of this method is that we can do the maximization numerically
and this is a relatively easy given enough computing power. The most
important fact, however, is that k can be calculated efficiently when the
data sample is large.
Conceptually, the likelihood ratio test makes a lot of sense.

• If H0 is true, then with high probability, the constrained maximum


likelihood (with maximum over Ω0 ) would be close to unconstrained
maximum likelihood (maximum over Ω), the denominator would give
the same result as the numerator, and λ would be around 1.

• If H0 is false while Ha is true, then the unconstrained maximum likeli-


hood would be much larger than the constrained maximum likelihood,
the denominator likely to be much greater than the numerator, which
leads to a small value of λ

• Therefore, we may use λ as a test statistic and reject H0 if λ < k.

But what threshold k to choose if we want level α test?


The practical applications of the likelihood ratio test are based on the
following amazing theorem.

Theorem 5.9.1 (Wilks’ Theorem). Let X1 , X2 , . . . , Xn are i.i.d random


observations and λ(X1 , . . . Xn ) is the likelihood ratio for the hypothesis Ω0
against Ωa . Let r0 denote the number of free parameters that are specified by
H0 : θ⃗ ∈ Ω0 and let r denote the number of free parameters specified by the
statement θ⃗ ∈ Ω. Then, for large n, for all θ0 ∈ Ω0 , the statistic −2 log λ
has approximately a χ2 distribution with r − r0 degree of freedoms.

The number of free parameters is the total number of parameters minus


the number of equality constraints, so the difference r − r0 is simply the

134
excess in the number of equality constraints that define Ω0 over the number
of equality constraints that define Ω. (Note that the inequality constraints
do not count - they are not important for large sample analysis.)
Therefore, the rejection region for the likelihood ratio test in large sam-
ples has a very simple form:

RR : {−2 log λ > χ2α (r − r0 )},

The proof of Wilks’ Theorem is not easy and will not be given here.
Example 5.9.2. Suppose that an engineer wishes to compare the number
of complaints per week filed by union stewards for two different shifts at a
manufacturing plant. One hundred independent observations on the number
of complaints gave means x = 20 for shift 1 and y = 22 for shift 2. Assume
that the number of complaints per week on the i-th shift has a Poisson
distribution with mean λi , for i = 1, 2. Use the likelihood ratio method to
test H0 : λ1 = λ2 against Ha : λ1 ̸= λ2 with α ≈ 0.01.
By taking the product the individual density functions we find the like-
lihood function:
1 −nλ1 Pn x Pn y
L(λ1 , λ2 ) = e λ1 i=1 i × e−nλ2 λ2 i=1 i ,
C
where C = x1 ! . . . xn !y1 ! . . . yn ! and n = 100.
Here we will be able to do maximizations analytically, although it could
also be done numerically.
Log likelihood function:
X
n  X
n 
ℓ(λ1 , λ2 ) = − log C + xi log λ1 − nλ1 + yi log λ2 − nλ2 .
i=1 i=1

If it is assumed that λ1 = λ2 = λ, then the maximization of the log-


likelihood function leads (after a calculation) to the constrained MLE esti-
mator

bM L = 1 (x + y) = 21.
λ
2

135
If we do not assume that λ1 = λ2 , then the unconstrained maximum
likelihood estimator of the vector (λ1 , λ2 ) is (after a calculation)
bM L = x = 20 and λ
λ bM L = y = 22.
1 2

Then, log likelihood ratio is the difference of the log-likelihoods evaluated


at the constrained and unconstrained MLE estimators.

 
bM L , λ
−2 log likelihood ratio = −2 ℓ(λ bM L ) − ℓ(λ
bM L , λ
bM L )
1 2

Calculation gives:
 
bM L , λ
−2 log likelihood ratio = −2 ℓ(λ bM L ) − ℓ(λ
bM L , λ
bM L )
1 2
  
= −2 − log k + nx + ny log λ bM L − 2nλ
bM L

b M L + nλ
+ log k − nx log λ bM L − ny log λ
b M L + nλ
bM L
1 1 2 2

Some terms cancel out and we find


h
−2 log likelihood ratio = −2 (100 × 20 + 100 × 22) log 21
i
− 100 × 20 log 20 − 100 × 22 log 22

= 9.5274

By Wilks theorem we should use the rejection region RR = {−2 log λ >
χ2α=0.01,df =1= 6.635}. Hence we reject H0 : λ1 = λ0 at significance level
α = 0.01.
In fact, in this example, we can also use the first method based on the
estimator of the parameter λ1 − λ2 . Indeed, since parameter λ is the mean
of the Poisson distribution, the problem can be thought as a problem about
the equality of means in two samples. The difficulty is that the sample
standard deviations are not given. However, we know the distribution of
data observations (Poisson).

Note that x is an estimator of λ1 which is approximately normal with


distribution N (λ1 , λ1 /n). Similarly y is approximately normal independent
random variable with distribution N (λ2 , λ2 /n).

136
Hence, under the null hypothesis, we have that the test statistic
y−x
TS = ∼ N (0, 1).
σy−x

Under the null hypothesis, a good estimator of σy−x is


s
b λ
λ b
+
n n

b = 1 (x + y) = 21. So, we calculate:


where λ 2

22 − 20 2 × 10
TS = q = √ = 3.086.
2 × 100
21 42

This is greater than z0.01 = 2.33, so H0 : λ1 = λ2 should be rejected at level


α = 0.01.

5.9.1 An Additional Example


This section gives an example, in which the likelihood ratio test is designed explicitly
without using the approximation provided by Wilks’ theorem.
Example 5.9.3. A service station has six gas pumps. When no vehicles are at
the station, let pi denote the probability that the next vehicle will select pump
i (where i = 1, 2, . . . , 6). We have a sample of size n in which xi vehicles have
chosen pump i. We wish to test H0 : p1 = . . . = p6 = 61 versus the alternative
Ha : p1 = p3 = p5 ; p2 = p4 = p6 = θ ̸= 16 .
What is the likelihood function of the parameters p1 , p2 , . . . , p6 given the sample
data if no restriction are imposed on the parameters?
This is a multinomial distribution so
 Y6
n
p|⃗x) =
L(⃗ px i
x1 , . . . , x6 i=1 i

What are the likelihood functions under the hypothesis H0 and Ha , respec-
tively?
Under H0 , the likelihood is
  
n 1 n
.
x1 , . . . , x 6 6

137
Under Ha , we calculate that p1 = p2 = p3 = 1/3 − θ, and the likelihood
  x1 +x3 +x5
n 1
−θ θx2 +x4 +x6 .
x1 , . . . , x 6 3
Suppose that X = X2 + X4 + X6 is the number of customers in the sample that
select an even numbered pump. What is the maximum likelihood estimator of the
parameter θ under the alternative hypothesis Ha ?
Maximization of likelihood under Ha is equivalent to maximization of log-
likelihood, that is, of
1 
ℓ(θ) = c + (n − X) log − θ + X log θ,
3
Then,
1 1
ℓ′ (θ) = −(n − X) 1 + X = 0,
3 − θ θ
−θn + Xθ + X/3 − Xθ = 0,
X
θbM LE = .
3n
Express the likelihood ratio statistic λ in terms of X.
Under Ha , the likelihood
  n−X
n 1
−θ θX .
x1 , . . . , x 6 3
Substituting the MLE estimate of θ in the definition of λ, we get:
(1/6)n
λ=  n−X  X
1
3 − X
3n
X
3n

The rejection region for likelihood ratio test is {λ ≤ t}, where t is a threshold.
This is the same as {− log λ ≥ t′ }, and we can re-write this region in our case as
n − X  X 
(n − X) log + X log ≥ t′′ ,
3n 3n
or
   
(n − X) log n − X + X log X ≥ t′′′ ,

138
The second derivative of the function on the left is
1 1
+ > 0,
n−X X
which means that this function is convex and so:

1. there can be only two solutions to the equality

2. By symmetry, if one of these solutions is c, then the other is n − c.

3. The inequality is satisfied only if X ≥ c or if X ≤ n − c.

Let n = 10 and c = 9. Determine significance level α of the test and its power
when θ = p2 = p4 = p6 = 1/10.
Under the Ha , the probability that one of the pumps 2, 4, or 6 is visited equals
3θ. Hence X (the number of visits of these pumps) is distributed as binomial with
parameter 3θ = 0.3. For H0 it is the binomial with probability 0.5 Hence,

α = Pr(X ≥ 9|n = 10, p = 0.5) + Pr(X ≤ 1|n = 10, p = 0.5)


= 1 − Pr(X ≤ 8|n = 10, p = 0.5) + Pr(X ≤ 1|n = 10, p = 0.5)
= 1 − 0.989 + 0.011 = 0.022

and

power = 1 − β = Pr(X ≥ 9|n = 10, p = 0.3) + Pr(X ≤ 1|n = 10, p = 0.3)


= 1 − Pr(X ≤ 8|n = 10, p = 0.3) + Pr(X ≤ 1|n = 10, p = 0.3)
= 1 − 1.000 + 0.149 = 0.149

139
5.10 Quizzes

Quiz 5.10.1. A Type I error is when:

A. We reject the null hypothesis when it is actually true


B. We obtain the wrong test statistic
C. We fail to reject the null hypothesis when it’s actually
false
D. We reject the alternate hypothesis when it’s actually
true

Quiz 5.10.2. A level of significance (or size of the test) of


5% means:
A. There’s a 5% chance there is an error in test decision.
B. There’s a 5% chance we’ll be wrong if we fail to reject
the null hypothesis
C. There’s a 5% chance we’ll be wrong if we reject the null
hypothesis.
D. The alternative hypothesis is not significant.

140
Quiz 5.10.3. We are interested in this problem: “Is the
proportion of babies born male different from 50%?” In
the sample of 200 births, we found that 96 babies born
were male. We tested the claim using a test with the level
of significance 1% and found that the conclusion is “Fail to
reject H0 .” What could we use as interpretation?

A. The proportion of babies born male is not 0.50.


B. There is not enough evidence to say that the proportion
of babies born male is different from 0.50.
C. There is not enough evidence to say that the proportion
of babies born male is 0.50.

Quiz 5.10.4. Suppose, we are interested in testing H0 : µ =


µ0 against Ha : µ > µ0 . We will reject H0 at level α = 0.05
if µ0 is

A. larger than 95% upper confidence bound for µ.


B. larger than 95% lower confidence bound for µ.
C. smaller than 95% upper confidence bound for µ.
D. smaller than 95% lower confidence bound for µ.

141
Quiz 5.10.5. An educator is interested in determining the
number of hours of TV watched by 4-year-old children. She
wants to show that the average number of hours watched
per day is more than 4 hours. To test her claim she took a
random sample of 100 youngsters. Which of the following
values for the sample mean would have the largest p-value
associated with it.

A. 2
B. 3.9
C. 4
D. 5

Quiz 5.10.6. Suppose that I have collected a random sample


to test H0 : µ = µ0 v.s. Ha : µ > µ0 and I end up rejecting
H0 at level α = 0.05 based on my sample. If I decided to
change α from 0.05 to 0.01, then based on the same sample
that I have in my hand, I would
A. definitely fail to reject the H0 at level α = 0.01;
B. definitely reject the H0 at level α = 0.01;
C. either reject the H0 at level α = 0.01 or fail to reject
the H0 at level α = 0.01 depending on the sample;
D. have to toss a fair coin to decide what to do.

142
Quiz 5.10.7. Suppose that I am interested in testing H0 : µ =
µ0 against Ha : µ ̸= µ0 . I calculate the type II error proba-
bility β using the alternative value of parameter µa . Then,
β will be smaller if I
A. Decrease the type I error probability α;
B. Decrease the sample size n;
C. Decrease the distance between µa and µ0 ;
D. None of the above is correct.

143
Chapter 6

Linear statistical models and


the method of least squares

6.1 Linear regression model


In the previous chapter, we have seen an example of two sample problems, in
which we observed two sample and tried to understand whether one sample
is different from another.
One case of this problem is when we compare the control sample with
the treatment sample and try to understand whether treatment have a sta-
tistically significant effect.
You can imagine the situation when we have more than two samples
corresponding to several treatments (may be different doses of drugs or may
be different drugs) and we try to understand what are the effects of these
treatments. The models with greater than two samples can be studied by
statistical method ANOVA.
In this chapter we are introducing another model in which one variable
X affect another one Y and the relation is assumed to be linear up to a
random error.
In other words, we have n observations of variables X and Y : (x1 , y1 ),
(x2 , y2 ), …(xn , yn ) and we assume that they satisfy the following model:

144
yi = β 0 + β 1 x i + ε i (6.1)
Here β0 and β1 are unknown parameters that we want to estimate. The
quantities xi , i = 1, . . . , n are known parameters, which are called explana-
tory variables, or independent variables. The variables εi are error terms.
They are responsible for randomness in the model. They are always as-
sumed to have zero mean: E(εi ) = 0. They are also often but not always
assumed to have unknown variance σ2 that does not depend on i: E(εi ) = σ2 .
Even more restrictively, they are often assumed to be normally distributed:
εi ∼ N (0, σ2 ).
The values yi are random since they are functions of εi . (We could write
them Yi following our usual convention about random variables.) They are
usually called the response variable or dependent variable. So, y1 , …, yn are
n independent observations of the response variable Y .
If we want to relate this model to the models in previous chapters, assume
that the error terms are normally distributed and note that then we can
think about y1 , . . . , yn as an a sample from the distribution N (β0 +β1 xi , σ 2 ).
The observations in this sample are independent but they are not identically
distributed! Indeed the mean of the i observation changes with i: E(yi ) =
β0 + β1 xi .
The model (6.1) is called the simple regression model. It is often written
in a short form that omits the subscript i:

Y = β0 + β1 x + ε.

A general linear regression model includes more than one explanatory


variable:
(1) (k)
yi = β 0 + β 1 x i + . . . + β k x i + ε i (6.2)
,
or in short notation:

Y = β0 + β1 x(1) + . . . + βk x(k) + ε

This model is very flexible and can be used to model non-linear depen-
dencies as well. For example, if we believe that the response Y depends on

145
Figure 6.1: A scatter plot for the weight (in pounds) and the miles per gallon
(MPG) of a car in a random sample of 25 vehicles.

explanatory variable X as a polynomial of degree three, we can add new


explanatory variables that corresponds to squares and cubes of X:
(1)
xi = xi ,
(2)
xi = (xi )2 ,
(3)
xi = (xi )3 .

Then, we only need to estimate the regression model:

Y = β0 + β1 x(1) + β2 x(2) + β3 x(3) + ε,

in order to get the desired non-linear relationship between Y and X.


In some cases we can also consider a transformation of random variable
Y so as to get a more suitable distribution for random error terms ε.
Examples
The linear regression is a working horse of statistics so there are enormous
number of examples. For instance,

• Energy efficiency of a vehicle and its weight.

• Wage of an individual and its education, age, experience.

Note that one of the difficult problems in statistic is to distinguish which


variables are explanatory variables and which are response variables.

146
Goals
As usual, are goals are to estimate parameters β0 , …, βk and test some
hypothesis about their values.
There is another goal, which we have not seen before. We might be
interested in predicting response Y for some other values of x. In addition
we might be interested in having some kind of a confidence interval for our
prediction.

6.2 Simple linear regression


6.2.1 Least squares estimator
Here we look at the simple linear regression y = β0 + β1 x + ε, although the
methods are also applicable to the general linear regression.
In order to estimate parameters β0 and β1 we could use the maximum
likelihood method by writing the likelihood of the random quantities yi and
maximizing it with respect to β0 and β1 . It turns out that for normally
distributed εi ∼ N (0, σ 2 ) this method gives the same estimates as a simpler
method described below.
This simpler method aims to minimize the deviation of the fitted values
ybi = βb0 + βb1 xi from the observed values of yi , by a choice of the estimates βb0
and βb1 . Specifically, the method of least squares aims to minimize the Sum
of Squared Errors (SSE):
X
n X
n
SSE := (yi − ybi ) =
2
(yi − βb0 − βb1 xi )2 → min by a choice of βb0 , βb1
i=1 i=1
(6.3)
As usual, this minimization can be done by using the First Order Conditions.

Definition 6.2.1. The values of βb0 and βb1 which solve the problem (6.3)
are called the (ordinary) Least Squares estimators of the simple regression
model (6.1).

(The estimators are called ordinary LS estimators, because sometimes


in the definition of SSE the terms have different weights. In this case the
solution is called the weighted least squares estimators.)

147
It is a bit simpler to do it for a modified model, in which the explanatory
variables are centered by subtracting their mean:

yi = α0 + β1 (xi − x) + εi

Clearly, this model equivalent to the original simple regression with β0 =


α0 − β1 x. It is also clear that the least squares estimators in these regression
problems are related by the similar equation: βb0 = α b0 − βb1 x.

Theorem 6.2.2. The least squares estimators are given by the following
formulas:

b0 = y,
α
Sxy
βb1 = ,
Sxx
where
X
n
Sxy = (xi − x)(yi − y)
i=1
Xn
Sxx = (xi − x)2 .
i=1

This implies that for our original problem, we have also the following
least squares estimator for the parameter β0 :

b0 − βb1 x = y − βb1 x
βb0 = α

Proof.
nP o
∂SSE ∂ n
i=1 [yi α0 + βb1 (xi − x))]2
− (b
=
b0
∂α b0
∂α
X
n
=2 α0 + βb1 (xi − x))] · (−1)
[yi − (b
i=1

• Set this to be 0. We have


X
n X
n X
n
0= α0 + βb1 (xi − x))] =
[yi − (b α0 + βb1
yi − nb (xi − x)
i=1 i=1 i=1

148
P
• Since ni=1 (xi − x) = 0 (we have centered them, remember?), we have
P
0 = ni=1 yi − nbα0 ⇒

b0 = y
α

nP o
∂SSE ∂ n
i=1 [yi α0 + βb1 (xi − x))]2
− (b
=
∂ βb1 ∂ βb1
X
n
=2 α0 + βb1 (xi − x))] · (−(xi − x))
[yi − (b
i=1

b0 = y which we just obtained, and set the whole thing


• Now plug in α
to be 0.
X
n
0=2 [(yi − y) − βb1 (xi − x)] · (−(xi − x))
i=1
Xn h i
= −2 (yi − y)(xi − x) − βb1 (xi − x)2
i=1
( )
Xn X
n
= −2 (yi − y)(xi − x) − βb1 (xi − x) 2

i=1 i=1

• Thus
Pn
(y − y)(xi − x)
βb1 = Pn i
i=1
i=1 (xi − x)
2

6.2.2 Properties of LS estimator


We aim to calculate the expectation and the variance of LS estimators βb0
and βb1 . This information is important for calculation of the bias of the
estimators and for construction of confidence interval.
We start with βb1 , which is typically more useful in practice since β1
measures the effect of X on Y .

149
Theorem 6.2.3. Assume that the error terms in the simple linear regression
model yi = β0 + β1 xi + εi have the properties Eεi = 0 and Var(εi ) = σ 2 .
Then,

1. Eβb1 = β1 .

2. Var(βb1 ) = σ 2 /Sxx .

If, in addition, εi are normal, then βb1 is also normal.

Before proving this theorem, let us derive some consequences. First, we


Pn
see that βb1 is an unbiased estimator of β1 . Second, if Sxx = i=1 (xi −
x) → ∞ as n → ∞, then βb1 is a consistent estimator of β1 . The condition
2

Sxx → ∞ as n → ∞ means that as the sample grows we continue to observe


sufficient variation in explanatory variables xi .

Proof. It is convenient to write the model in the form: yi = α0 + β1 (x −


x i ) + εi .
Recall that xi are not random. Also note that
X X
Sxy = (xi − x)(yi − y) = (xi − x)yi .

Then, we have
Sxy 1 X
Eβb1 = E = (xi − x)Eyi
Sxx Sxx
1 X 1 X
= (xi − x)(α0 + β1 (xi − x)) = β1 (xi − x)(xi − x)
Sxx Sxx
= β1 ,

which proves the first statement. Similarly, we calculate variance. It is


useful to note that Var(yi ) = Var(εi ) = σ 2 .
S  1 X
Var(βb1 ) = Var
xy
= 2 (xi − x)2 Var(yi )
Sxx Sxx
1 σ2
= 2 Sxx σ 2 = .
Sxx Sxx

Finally, if εi are normal, therefore yi are also normal. Note that βb1 is a
weighted sum of yi and the coefficients in this sum are non-random. We

150
know that this implies that the sum itself is normal. This shows that βb1 is
normal if εi are normal.

For other estimators we have similar results.

Theorem 6.2.4. Assume that the error terms in the simple linear regression
model yi = α0 +β1 (xi −x)+ εi have the properties Eεi = 0 and Var(εi ) = σ 2 .
Then,

1. Eb
α0 = α0 .

2. Var(b
α0 ) = σ 2 /n.

α0 , βb1 ) = 0.
3. Cov(b

b0 is also normal.
If, in addition, εi are normal, then α

b0 is an unbiased and consistent estimator of α0 .


Therefore, α

Proof. For the expectation, we have:


1X 1X
Eb
α0 = Ey = Eyi = (α0 + β1 (xi − x)) = α0 .
n n
For the variance,
1 X 1 σ2
Var(b
α0 ) = Var(y) = Var(y i ) = nσ 2
= .
n2 n2 n
Finally, for covariance,
1 X 1 X 
Cov(α0 , β1 ) = Cov yi , (x − xi )yi
n Sxx
1 X 1 X
= (x − xi )Var(yi ) = σ2 (x − xi ) = 0.
nSxx nSxx
b0 is the average
Finally, if εi are normal, then yi are also normal, and since α
b0 is also normal.
of yi , α

Since βb0 = α
b0 − βb1 x, we have a similar result for β0 .

Theorem 6.2.5. Assume that the error terms in the simple linear regression
model yi = β0 + β1 xi + εi have the properties Eεi = 0 and Var(εi ) = σ 2 .
Then,

151
1. Eβb0 = β0 .

2.
1 (x)2 
Var(βb0 ) = σ 2 + .
n Sxx

3.
 x 
Cov(βb0 , βb1 ) = σ 2 .
Sxx

If, in addition, εi are normal, then βb0 is also normal.

This result can be obtained from the formula βb0 = αb0 − βb1 x and previous
theorems through an easy calculation and is left as an exercise.
We are almost ready to write down the confidence intervals for the esti-
mates βb1 , α
b0 and βb0 . However, the variances that we have just calculated
include σ2 , which is not known and should be estimated. It turns out that
our previous definition of S 2 as an estimator of σ2 is unapproriate because
yi are no longer identically distributed. A suitable estimator is as follows:

1 X X
n n
1
b :=
σ2
SSE ≡ (yi − ybi )2 ≡ (yi − βb0 − βb1 xi )2 .
n−2 n−2
i=1 i=1

(This estimator is sometimes still denoted S 2 .)

Theorem 6.2.6. σ b2 := n−2


1
SSE is an unbiased estimator of σ 2 . If the
error terms are normal, then σ b2 is independent from βb1 , αb0 and βb0 and
(n − 2)b
σ 2 /σ 2 has the χ2 distribution with n − 2 degrees of freedom.

The fact that we have n − 2 in the denominator instead of n − 1 as in


the previous definition of S 2 can be ”explained” as that we now estimated
two parameters instead of one and so lost two degrees of freedom.
I omit the proof of this theorem.

152
6.2.3 Confidence intervals and hypothesis tests for coeffi-
cients
Once we know the variances of the parameters, it is easy to construct the
confidence intervals. The procedure is essentially the same as what we did
when the estimated the mean of a sample.
For example, a large sample two-sided confidence interval for the param-
eter β1 can be written as follows:

 b
σ b 
σ
βb1 − zα/2 √ , βb1 + zα/2 √ ,
Sxx Sxx
where α is the confidence level.
If the sample is small but we assume that the error terms are normal,
we can use our previous theorems to come to conclusion that

βb1 − β1

b/ Sxx
σ

is a pivotal quantity that has t distribution with n − 2 degrees of freedom.


In this case an appropriate confidence interval is
 b b 
b (n−2) σ b (n−2) σ
β1 − tα/2 √ , β1 + tα/2 √ .
Sxx Sxx
(0)
Similarly, if the null hypothesis is β1 = β1 we can form the test statistic as

βb1 − β1
(0)
T = √
b/ Sxx
σ

and use this test statistic to test the null hypothesis against various alter-
native. If the sample is large (n > 30) then T is distributed as a standard
normal random variable. If the sample is small, then we rely on the assump-
tion that εi have normal distribution and then T has the t distribution with
df = n − 2.
Similar procedures can be easily established for other parameters, that is
for α0 or β0 . We only need to use the appropriate variance of the estimator
instead of the σb2 /Sxx .

153
6.2.4 Statistical inference for the regression mean
In applications we sometimes want to make some inferences about linear
combinations of parameters. In this section we study a particular example
of this problem. Suppose we want to build the confidence interval for the
regression mean of Y , when x is equal to a specific value x∗ :

E(Y |x∗ ) = β0 + β1 x∗ .

The natural estimator for this quantity is the predicted value:

yb∗ = βb0 + βb1 x∗ .

This estimator is unbiased, because βb0 and βb1 are unbiased estimators of β0
and β1 . In order to build the confidence interval, we also need to calculate
its variance. It is more convenient to use the other form of the regression
for this task:

b0 + βb1 (x∗ − x).


yb∗ = α

Then,

y ∗ ) = Var(b
Var(b α0 ) + (x∗ − x)2 Var(βb1 ) + 2(x∗ − x)Cov(b
α0 , βb1 )
 1 (x∗ − x)2 
= σ2 + ,
n Sxx
where we used our previous results about variances and covariance of esti-
mators α0 and β1 .
By using this information we can build the confidence intervals for y ∗ .
For example, if the sample size is large then the two-sided confidence interval
with significance level α is
s
1 (x∗ − x)2
yb∗ ± zα/2 σ
b + ,
n Sxx
p
where σ b = SSE/(n − 2) is the estimate for σ = Var(εi ).
If the sample is small and the errors εi are normal, then we can use
the t distribution with n − 2 degrees of freedom and the confidence interval

154
becomes:
s
1 (x∗ − x)2
yb∗ ± tα/2 σ
(n−2)
b + .
n Sxx

And for testing hypothesis H0 : y ∗ = y0 we use the statistic:

y ∗ − y0
T = q ∗
.
b n1 + (x S−x)
2
σ xx

6.2.5 Prediction interval


When predicting Y we are often interested not in variation of our predictions
yb around the true regression mean but rather in variations of the actual
quantities y around the true regression mean. The random quantity y has
larger variation than yb because in addition to uncertainty due to the error
in parameter estimation it also includes the variation due to error terms εi .
We define the prediction interval with confidence level 1−α as a (random)
interval (L, U ) such that

P(L ≤ yi ≤ U ) = 1 − α.

Here L and U are some statistics, so they must be computable from data.
In order to construct the prediction interval we use the pivotal quantity
technique and consider

y ∗ − yb∗
T = ,
SE(y ∗ − yb∗ )

where SE stands for “standard error”. Here y ∗ is a new observation which


we try to predict and yb∗ is the prediction.
Note that that

y ∗ − yb∗ = β0 + β1 x∗ + ε∗ − (βb0 + βb1 x∗ )


= (β0 − βb0 ) + (β1 − βb1 )x∗ + ε∗

Since βb0 and βb1 are unbiased estimators of β0 and β1 , we see that this
quantity has expectation 0.

155
Moreover, if εi are normal then we see that y ∗ − yb∗ is also normal.
What is the standard error of y ∗ − yb∗ ? Note that we have

Var(y ∗ − yb∗ ) = Var(β0 + β1 x∗ + ε∗ − yb∗ ) = Var(ε∗ ) + Var(b


y ∗ ),

because the “new” error term ε∗ is un-correllated with prediction yb∗ . Indeed,
the coefficients βb0 and βb1 were estimated using the old error terms εi and
x∗ is not random.
We calculated the variance of yb∗ in the previous section, and so we have
1 (x∗ − x)2   1 (x∗ − x)2 
Var(y ∗ − yb∗ ) = σ 2 + σ 2 + = σ2 1 + +
n Sxx n Sxx
It follows that
y ∗ − yb∗
Z= q ∗
σ 1 + n1 + (x S−x)
2

xx

has the standard normal distribution.


p
b = SSE/(n − 2) instead
It can be shown that if we use the estimator σ
of unknown σ, then the quantity

y ∗ − yb∗
T = q ∗
b 1 + n1 + (x S−x)
2
σ xx

has the t distribution with n − 2 degrees of freedom.


So it follows that the prediction interval for y ∗ can be written as
s s
1 (x ∗ − x)2 1 (x∗ − x)2
yb∗ ± tα/2 σ = βb0 + βb1 x∗ ± tα/2 σ
(n−2) (n−2)
b 1+ + b 1+ +
n Sxx n Sxx

The interpretation is that with probability 1 − α the deviation of our


prediction yb∗ from the actual realization of y ∗ will be smaller than
s
(n−2) 1 (x∗ − x)2
tα/2 σ b 1+ + .
n Sxx

156
6.2.6 Correlation and R-squared
Sometimes, xi can be interpreted as observed values of some random quan-
tity X. That is, we have n observations (xi , yi ) sampled from the joint
distribution of the random quantities X and Y . In this case, the coefficient
β1 in the regression yi = β0 + β1 xi + εi can be interpreted as a measure of
dependence between Y and X.
On the other hand, we know that another measure of dependence be-
tween Y and X is the correlation coefficient:
Cov(X, Y )
ρ= p ,
Var(X)Var(Y )

and we can estimate it as


Sxy
R= p
Sxx Syy

Since βb1 = Sxy /Sxx , we see that we have the following relation between the
estimates of correlation coefficient ρ and linear regression parameter β:
r
Syy
R = β1 .
Sxx
So there is a clear relationship between these two measures of association.
The statistic r2 (called R-squared) has another useful interpretation,
which will be later generalized for multiple linear regression model. Namely,
it measures goodness of fit in the simple linear regression.
Indeed, it is possible to derive the following useful formula:
X X
SSE := (yi − ybi )2 = (yi − y − βb1 (xi − x))2
i i
X X X
= (yi − y) − 2βb1
2
(yi − y)(xi − x) + βb12 (xi − x)2
i i i
2
Sxy
= Syy − βb12 Sxx = Syy −
Sxx
P
Now, Syy = i (yi − y)2 can be thought as the variation in the response
variable if no explanatory variable is used, and SSE is the variation in the

157
response after the explanatory variable is brought in. So the difference is
the reduction in the variation due to the explanatory variable X.
In particular,
2
Sxy
R2 =
Sxx Syy

is this reduction measured in percentage terms.


To summarize, R2 is the proportion of response variable variation that
is explained by the explanatory variable.

6.3 Multiple linear regression


6.3.1 Estimation
A more general version of linear regression reads:

y = β0 + β1 x(1) + . . . + βp x(p) + ε,

where we have p explanatory variables. In fact this stands for n separate


equations, one for each observation:
(1) (p)
yi = β 0 + β 1 x i + . . . + β p x i + εi ,

where i = 1, . . . , n.
In full glory, we have a big system of equations:

y1 = β0 + β1 x11 + β2 x12 + · · · + βp x1p + ε1


y2 = β0 + β1 x21 + β2 x22 + · · · + βp x2p + ε2
..
.
yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + εi
..
.
yn = β0 + β1 xn1 + β2 xn2 + · · · + βp xnp + εn

In order to write down this system shorter, we use the matrix notation.

158
We have (p + 1) coefficients (1 for the intercept term and p for the
independent variables), and n observations. Each observation is represented
by yi and xi := [1, xi1 , xi2 , . . . , xip ]. (We distinguish between column and
row vectors, and the vector xi here is row vector.)
Now stack all observations together to form a vector of responses and a
matrix of explanatory variables.

     
y1 1 x11 x12 . . . x1p x1
     
 y2   1 x21 x22 . . . x2p   x2 
y=
 .. ,
 X=
 .. .. .. .. .. =
  .. 

 .   . . . . .   . 
yn 1 xn1 xn2 . . . xnp xn
| {z }
p+1 columns

Also define a (column) vectors of coefficients and error terms.

β = [β0 , β1 , . . . , βp ]T , ε = [ε1 , ε2 , . . . , εn ]T .

(Here superscript T is the notation for the transposition operation. It indi-


cates that the vector, which we wrote as a row vector is in fact the column
vector. For example,
 
β0
 
 β1 
[β0 , β1 , . . . , βp ]T =  
 ..  .
.
βp

The transposition can also be defined for matrices and has a useful property:
(AB)T = B T AT .
Then we can rewrite model as

y = Xβ + ε

The sum of squared errors can also be written very simply in the matrix
notation:

159
X
n
b =
SSE(β) {yi − (βb0 + βb1 xi1 + βb2 xi2 + · · · + βbp xip )}2
i=1
X
n
= b 2
{yi − xi β}
i=1
b T [y − Xβ]
=[y − Xβ] b

To minimize SSE(β), b we need to write the first order conditions, which


can also be written in matrix form. Namely, for each j = 1, . . . , p we have:

b
∂SSE(β) X n
= −2 xij {yi − (βb0 + βb1 xi1 + βb2 xi2 + · · · + βbp xip )} = 0
∂βj
i=1

If we stack these p + 1 equations together, we obtain the matrix form of


these system of equations:

b
∂SSE(β) b = −2XT y + 2XT Xβ
= −2XT [y − Xβ] b=0
b
∂β
Or, re-arranging the terms and simplifying:

b = XT y.
XT Xβ

This system of (p + 1) equations in p + 1 unknowns βbi is called the normal


equations. In matrix form, its solution can be written as

b = (XT X)−1 XT y.
β LS

6.3.2 Properties of least squares estimators


We are interested to learn about the expectations and variances of the es-
timated parameters βi , i = 1, . . . , p. It is convenient to express the re-
sults using the language of random vectors in matrices. In particular, if
ξ = [ξ1 , . . . , ξp ] is a vector of random quantities, then its expectation is the
vector of expectations: Eξ = [Eξ1 , . . . , Eξp ].

160
We can also define the (variance-)covariance matrix of vector ξ ∈ Rp
with p components as the p × p matrix whose ijth element is the covariance
between the ith and jth elements of ξ, i.e.
 
Cov(ξ1 , ξ1 ) Cov(ξ1 , ξ2 ) . . . Cov(ξ1 , ξp )
 
 Cov(ξ2 , ξ1 ) Cov(ξ2 , ξ2 ) . . . Cov(ξ2 , ξp ) 
V(ξ) = 
 .. .. .. .. 

 . . . . 
Cov(ξp , ξ1 ) Cov(ξp , ξ2 ) . . . Cov(ξp , ξp )

(For example, Cov(ε) = σ 2 In where In is the n × n identity matrix, which


is a diagonal matrix with 1’s on the diagonal.)
Note that the expectation can also me defined for matrices, as the matrix
of expectations of entries. In this case, if ξi are all have zero mean, then
V(ξ) = E(ξξ T ). More generally,

V(ξ) = E((ξ − Eξ)(ξ − Eξ)T ). (6.4)

We also have the following simple rules. If If A is an m × p (non-


random) matrix, then we can calculate new vector Aξ. For this vector, the
expectation and variance can be calculated as follows:

E(Aξ) = A(Eξ). (6.5)


(This is a consequence of linearity of the expectation.) And from formula
(6.4) it is easy to get the equality:

V(Aξ) = AV(ξ)AT . (6.6)

Now let us apply this properties to the least squares estimator β. b


b is a p-vector [βb0 , βb1 , . . . , βbp ] and that
Recall that the β T

b = (X T X)−1 X T y.
β

Theorem 6.3.1. The least squares estimator of β is unbiased:


b = β.

Its variance matrix is the following (p + 1) × (p + 1) matrix:


b = σ(X T X)−1 .

161
Proof. We know that Ey = Xβ and Vy = Vε = σ 2 In . Then, we can apply
rules (6.5) and (6.6) and get:

b = (X T X)−1 X T Ey = (X T X)−1 X T Xβ = β,

and

b = (X T X)−1 X T Vy[(X T X)−1 X T ]T



= (X T X)−1 X T σ 2 In X(X T X)−1
= σ 2 (X T X)−1 ,

after cancellations.

b
In addition, if εi are normal, εi ∼ N (0, σ), then it can be shown that β
2
is the multivariate normal with mean β and variance σ (X X) . T −1

Now it is clear how to build confidence intervals and test the hypothesis
for the parameters βi . We simply notice that

Var(βi ) = σ 2 cii ,

where cii is the i-th element on the main diagonal of the matrix (X T X)−1 :
h i
cii = (X T X)−1 . (6.7)
ii

So if σ 2 is known, then the confidence interval for βi is



βbi ± zα/2 σ cii .

In practice, σ 2 is not known and have to be estimated from data. We can


do it using SSE, which is defined similarly to the case of the simple linear
regression:
X
SSE = (yi − ybi )2 ,
i

where ybi are fitted values for the response variable:

ybi = βb0 + βb1 xi1 + . . . + βbp xip


b = X(X T X)−1 X T y = Xβ + X(X T X)−1 X T ε.
= Xβ

162
b2 := SSE/(n − p − 1) is an unbiased estimator of σ 2 .
Theorem 6.3.2. σ

Proof. (This proof is optional.)


Recall that for any vector a = [a1 , . . . , an ], we have the notation ∥a∥2 :=
P2 2 T
i=1 ai . We can also write this sum as a a.
Since y = Xβ + ε, we can do the following calculation for SSE:

X
n
SSE = (yi − ybi )2 = ∥Xβ + ε − (Xβ + X(X T X)−1 X T ε)∥2
i=1

= ∥(In − X(X T X)−1 X T )ε∥2


 T
= (In − X(X T X)−1 X T )ε (I − X(X T X)−1 X T )ε

= εT (In − X(X T X)−1 X T )2 ε,

A calculation gives (In − X(X T X)−1 X T )2 = In − X(X T X)−1 X T .


Now, to move further, we need a fact from linear algebra. Recall that
the trace of a matrix A is defined as sum of its diagonal entries: tr(A) =
P
i aii . An important property of trace is that tr(AB) = tr(BA). (It is
easy to prove this directly from the definition of trace.) In our case εT (In −
X(X T X)−1 X T )2 ε is a scalar (a 1 × 1 matrix) so it is equal to its trace.
Hence, we can use this property of trace and write:
h i
εT (In − X(X T X)−1 X T )ε = tr (In − X(X T X)−1 X T )εεt .

Next, we use the fact that E(εεt ) = V(ε) = σ 2 In , and the property that
taking expectations and taking the trace can be performed in any order and
get:
h i
E(SSE) = tr (In − X(X T X)−1 X T )E(εεt )
h i  
= σ 2 tr (In − X(X T X)−1 X T )In = σ 2 (tr(In ) − tr X(X T X)−1 X T )
 
= σ 2 (n − tr X(X T X)−1 X T )
 
Now, in order to calculate tr X(X T X)−1 X T , we use the property of the
trace once more and write:
   
tr X(X T X)−1 X T = tr (X T X)−1 X T X = tr(Ip+1 ) = p + 1.

163
So,

E(SSE) = σ 2 (n − p − 1)

and SSE/(n − p − 1) is an unbiased estimator of σ 2 .

Moreover, if εi are independent normal random variables, εi ∼ N (0, σ 2 ,


then it turns out that SSE/σ 2 has the ξ 2 distribution with n − p − 1 degrees
of freedom, and that this random variable is independent from β. b It follows
that
βbi − βi

b cii
σ

has the t distribution with n − p − 1 degrees of freedom. (Here cii is as


defined in (6.7).) Therefore, in this case the confidence interval is
(n−p−1) √
βbi ± tα/2 b cii .
σ

6.3.3 Confidence interval for linear functions of parameters


Sometimes, we want to test a hypothesis about a linear function of param-
eters. For example, if we have a regression model

y = β0 + β1 x(1) + β2 x(2) + β3 x(3) + ε,

then we might want to test the hypothesis that β1 = β2 . This is equivalent


to the hypothesis that β1 − β2 = 1. We can approach it by asking what
is the confidence interval for β1 − β2 . This confidence interval should be
centered on the estimate βb1 − βb2 and the question is what is the standard
error of this estimate.
More generaly, if we have p parameters we might be interested in finding
the confidence interval for the linear combination a0 β0 + a1 β1 + . . . + ap βp ,
which can be written in matrix notation as at β where a is a column vector
[a0 , . . . , ap ]t . (In our previous example a = [0, 1, −1, 0]t .) The confidence
interval should be centered on at β b and the main question is about the
standard error of this estimator. It turns out that it is easy to calculate by
using matrices.

164
Indeed, by using one of the theorems from probability theory course we
have:
X
p X
Var( ai βbi = ai aj Cov(βbi , βbj ).
i=0 i,j

In the matrix notation, this can be written as a very concise formula:

b = at V(β)a
Var(at β) b = σ 2 at (X t X)−1 a,

where we used the formula for the variance-covariance matrix of the estima-
tor β.
It follows that the confidence interval for at β can be written as
p
at βb ± zα/2 σ at (X t X)−1 a.

Since σ is unknown, we substitute it with its estimator. In the case of normal


errors, it gives the following confidence interval:
p
at βb ± tα/2
(n−p−1)
b
σ at (X t X)−1 a.

This calculations are useful, for example if we want to calculate the


confidence interval for the regression mean yb∗ = βb0 + βb1 x∗ + . . . + βbp x∗ ,
(1) (p)

(1) (p)
for a new observation (x∗ , . . . , x∗ ).
In this case we use the formulas we derived above by setting the vector
(1) (p)
a = [1, x∗ , . . . x∗ ]t .

6.3.4 Prediction
Suppose we obtained a new observations with predictors (i.e., explanatory
(1) (2) (p)
variables x∗ , x∗ , . . . , x∗ and we want to predict the response variable y∗ .
The natural predictor is

yb∗ = βb0 + βb1 x∗ + . . . βbp x∗ = xt∗ β,


b
(1) (p)

(1) (p)
where x∗ is the column vector [1, x∗ , . . . , x∗ ]t .

165
This expected value of this predictor equals the regression mean Ey∗ ,

Eb b = xt β.
y∗ = xt∗ Eβ ∗

Let us define the prediction error as the difference between the prediction
and the actual realization of the response variable,

e∗ = y∗ − yb∗ ,

Then the expected value of the error is zero and it is easy to compute
its variance by using results from the previous section:

b
Var(e∗ ) = Var(y∗ − yb∗ ) = Var(xt∗ β + ε∗ − xt∗ β)
b
= Var(ε∗ ) + Var(xt∗ β)
= σ 2 + σ 2 xt∗ (X t X)−1 x∗ .

This allows us to write the prediction interval:


p
b ± t(n−p−1) σ
xt∗ β b 1 + xt∗ (X t X)−1 x∗
α/2

6.4 Goodness of fit and a test for a reduced model


Recall that we defined R2 statistic earlier as
SST − SSE
R2 = ,
SST
where
X X
SST = (yi − y)2 , and SSE = (yi − ybi )2 .
i i

Essentially, in this calculation we compare a given regression model with a


model in which only a constant allowed. The prediction of this reference
model is always y.
More generally, we can compare two models, reduced (often also called
restricted) and complete (often called full or unrestricted).

y = β0 + β1 x(1) + . . . + βp x(p) + ε,
y = β0 + β1 x(1) + . . . + βp x(p) + βp+1 x(p+1) + . . . + x(p+q) + ε.

166
In the complete model, we have q additional predictors.
Then it is reasonable to define a statistic that compares these two models.
We could define it as
SSER − SSEC
,
SSER
where SSER and SSEC are the sums of the squared errors computed, re-
spectively, for the reduced and complete models. This would be in complete
analogy to the definition of R2 above. However, traditionally another form
of the statistic is preferred, namely:

(SSER − SSEC )/q


F = ,
SSEC /(n − p − q − 1)

for the reason that it is useful for testing. Indeed, under the null hypotheses,
that the reduced model is correct this statistic is distributed according to
the Fisher distribution with q and n − p − q − 1 degrees of freedom.
In particular the null hypothesis can be rejected at α significance level
if F > Fα . Intuitively, the reduction in the size of errors, as measured by
SSER − SSEC is too large to be explained by pure chance.
Note, however, that the statement that F follows the Fisher distribution
assumes that the errors are normal and it is quite sensitive to this assump-
tion. (In other words, if the errors are not normal it can happen that the
probability of type I error is different from α, in particular it can happen
that we reject the null hypothesis too often.)

167
Chapter 7

Categorical data

7.1 Experiment
Here we discuss the experiment in which there are a finite number of out-
comes. So we have n observation and each observation Yi , i = 1, . . . , n, can
belong to one of k possible categories.
You should recognize here the multinomial experiment with n trials and
k possible outcomes. We assume here that there are no predictors xi . In this
case, the result of this experiment can be conveniently summarized by a list
of counts. So nj is the number of yi that happened to be in j-th category.
Clearly we have n1 + . . . + nk = n.
The model that we use assumes that all yi are independent, so the counts
nj , j = 1, . . . , k follow the multinomial distribution:
 
n
P(X1 = n1 , . . . , Xk = nk ) = pn1 pn2 . . . pnk k ,
n1 , n 2 , . . . , n k 1 2

where p1 , . . . pk are unknown parameters, and we used notation Xj to denote


the random number of items in category j.
We are interested in statistical inferences about these parameters. For
example, we might be interested if they are equal to some specific values.
The exact inference is somewhat complicated here and we are going to do
the asymptotic inference which gives a good approximation if n is sufficiently

168
large. This can be done by using the Pearson statistic and χ2 goodness of
fit test.

7.2 Pearson’s χ2 test


Our goal is to test that the probabilities of the categories are p1 , . . . , pk . The
key is the following statistic:

X
k
(Xj − EXj )2 X (nj − npj )2
k
T = =
EXj npj
j=1 j=1

Pearson proved that if the null hypothesis is correct (that is, if the data
indeed came from the multinomial distribution with the specified probabili-
ties), and if n is large, then this statistic is approximately distributed as χ2
random variable with k − 1 degrees of freedom.
Usually the one-sided test is used so that the null hypothesis is rejected
if T > χ2α,k−1 . Alternatively, one can calculate the p-value as

p − value ≈ P(ξk−1
2
≥T

and compare it with α.


A usual rule of thumb is that the approximation is good if npj > 1 for
each n and that at least for 80% of j this product is greater than 5.
A computationally easier formula can be obtained by expanding the
squares.

X
k
n2j
T = −n
npj
j=1

Example 7.2.1 (From Ross). If a famous person is in poor health and dying,
then perhaps anticipating his birthday would “cheer him up and therefore
improve his health and possibly decrease the chance that he will die shortly
before his birthday.” The data might therefore reveal that a famous person
is less likely to die in the months before his or her birthday and more likely
to die in the months afterward.

169
To test this, a sample of 1,251 (deceased) Americans was randomly cho-
sen from Who Was Who in America, and their birth and death days were
noted. (The data are taken from D. Phillips, “Death Day and Birthday: An
Unexpected Connection,” in Statistics: A Guide to the Unknown, Holden-
Day, 1972.)
We choose the following categories: outcome 1 = “6, 5, 4 months before
death” outcome 2 = “3, 2, 1’ months before” outcome 3 = “0, 1, 2 months
after death”, outcome 4 = “3, 4, 5 months after death”.
Outcome Number of times occurring
1 277
2 283
3 358
4 333
The null hypothesis is that all these outcomes have equal probabilities
pi = 1/4 for all i = 1, . . . , 4. So we calculate the Pearson statistic:

2772 + 2832 + 3582 + 3332


T = − 1251 = 14.775
1251/4

The corresponding p-value is

p − value ≈ P(ξ32 ≥ 14.775) = 0.002,

so it appears that the null hypothesis can be rejected.

7.3 Goodness of fit tests when parameters are un-


specified
The Pearson test can be applied to test whether the data came from a spec-
ified family of distribution. For example we might be interested if the data
came from a Poisson, geometric, or some other more complicated distribu-
tion.
In this situation, the true values of pi depend on other parameters, which
must be estimated from data. Somewhat surprisingly, the Pearson test can
still be applied if the parameters are estimated using the maximum likelihood

170
method. Namely, if pi depend on m parameters and if they were estimated
by maximum likelihood, then the statistic

X
k
(nj − nbpj ) 2
T =
nb
pj
j=1

is distributed according to χ2 distribution with k − 1 − m degrees of freedom


Example 7.3.1. Suppose we have data on the number of accidents on an
industrial plant over 30-week period:
 
8 0 0 1 3 4 0 2 12 5
 
1 8 0 2 0 1 9 3 4 5
3 3 4 7 4 0 1 2 1 2

We want to test the hypothesis that number of accidents follows the Poisson
distribution.
We can divide the data in 5 categories: the outcome of the number of
accidents in a given week is in category 1 if there are 0 accidents, in category
2 if there is 1 accident, in category 3 if there are 2 or 3 accidents, in category
4 if there are 4 or 5 accidents, and in category 5 if there are more than 5
accidents. Then, if the parameter λ of the Poisson distribution were known,
we could calculate the probability of each category:

p1 = P(Y = 0) = e−λ ,
p2 = P(Y = 1) = λe−λ ,
λ2 e−λ λ3 e−λ
p3 = P(Y = 2) + P(Y = 3) = + ,
2 6
p4 = . . . ,
λ2 e−λ λ3 e−λ λ4 e−λ λ5 e−λ
p5 = P(Y > 5) = 1 − e−λ − λe−λ − − − − .
2 6 24 120
However, since it is unknown, it must be estimated from the data. The
maximum likelihood estimator here is the same as the method of moments
estimator:

b = Y = 95 = 3.16667.
λ
30

171
Then a computation gives

pb1 = .04214,
pb2 = .13346,
pb3 = .43434,
pb4 = .28841,
pb5 = .10164

The category counts are n1 = 6, n2 = 5, n3 = 8, n4 = 6, n5 = 5 and we can


calculate the statistic value as

X
k
(nj − 30bpj ) 2
T = = 21.99156
30b
pj
j=1

The number of degrees of freedom is k − m − 1 = 5 − 1 − 1 = 3. Then the


p-value is

p − value ≈ P(ξ32 ≥ 21.99) = 0.000064,

and the hypotheses that the distribution is Poisson can be rejected.

7.4 Independence test for contingency tables


In this section we will talk about another application of the Pearson test.
We suppose now that every observation in the data has 2 characteristics,
X and Y . Both are categorical, so X can take one of r values, which we
code as i = 1, . . . , r and Y can take one of s values, j = 1, . . . , s.
We want to check if Y depends on X. More specifically, we want to test
they hypothesis that random variables Y and X are independent.
We denote the joint pmf of X and Y as pij :

pij := P[X = i, Y = j],

172
Then the marginal pmfs of X and Y are
X
s
pi := P[X = i] = pij ,
j=1
Xr
qj := P[Y = j] = pij ,
i=1

and the null hypothesis states that

pij = pi qj ,

for all i and j.


Now assume that the data also presented in the form of counts. Namely,
let nij be the number of observations that have X-characterisitic equal to i
and Y -characteristic equal to j. (Usually these numbers are organized as a
table which is called the contingency table.)
We can also define ni as the number of observations with X-characteristic
equal to i and mj as the number of observations with Y -characteristic equal
to j. Clearly,
X
s
ni := nij ,
j=1
Xr
mj := pij .
i=1

If we knew pi and qj , then we could write the Pearson statistic as


X (nij − npi qj )2
T = .
npi qj
i,j

However, since we don’t know these parameters we need to estimate them


from data. Natural estimates are
ni mj
pbi = and qbj = ,
n n
and then we can write the Pearson statistic as
X (nij − nb pi qbj )2 X n2ij
T = = − n.
pi qbj
nb ni nj /n
i,j i,j

173
If n is large, then the distribution of T approximately equals the χ2 dis-
tribution. As usual the number of degrees of freedom equal the number
of categories minus 1, rs − 1, reduced by the number of estimations per-
formed. A calculation shows that the number of calculations needed is
(r − 1) + (s − 1) = r + s − 2 and so the number of degrees of freedom is

df = rs − 1 − (r + s − 2) = (r − 1)(s − 1).

So, for a given significance level α, the null hypothesis should be rejected if

T > χ2α,(r−1)(s−1) .

The p - value can be calculated as probability

p − value = P(χ > T ),

where χ is a random variable that have the χ2 distribution with (r −1)(s−1)


degrees of freedom. In R, this can be computed as

p − value = pchisq(T, df = (r − 1)(s − 1), lower.tail = F ).

Example 7.4.1. A sample of 300 people was randomly chosen, and the sam-
pled individuals were classified as to their gender and political affiliation,
Democrat, Republican, or Independent. The following table displays the
resulting data.
Democrat Republican Independent Total
Women 68 56 32 156
Men 52 72 20 144
Total 120 128 52 300
The null hypothesis is that a randomly chosen individual’s gender and
political affiliation are independent.

174
We calculate:
n 1 m1 156 × 120
p1 qb1 =
nb = = 62.40
n 300
n1 m2 156 × 128
p1 qb2 =
nb = = 66.56
n 300
n1 m3 156 × 52
p1 qb3 =
nb = = 27.04
n 300
n 2 m1 144 × 120
p2 qb1 =
nb = = 57.60
n 300
n2 m2 144 × 128
p2 qb2 =
nb = = 61.44
n 300
n2 m3 144 × 52
p2 qb3 =
nb = = 24.96
n 300
Therefore, the test statistic is

682 562 322 522 722 202


TS = + + + + + − 300 = 6.432857
62.40 66.56 27.04 57.60 61.44 24.96
The number of degrees of freedom is (3 − 1)(2 − 1) = 2, so we look at the
χ2 distribution with 2 degrees of freedom. At α = 0.05 we have

χ20.05,2 = 5.991,

so we can reject the null hypothesis that gender and political affiliation are
independent. We can calculate p - value as

pchisq(6.433, df = 2, lower.tail = F ) ≈ 0.04,

so at α = 0.01 we cannot reject this hypothesis.


In R if we have two variables X and Y as factors in a database “data”,
then one can build the contingency table by using the function “table()”:

tb ← table(data$X, data$Y ),

and the Pearson table of independence can be done using function “chisq.test()”:

chisq.test(tb)

175
Chapter 8

Bayesian Inference

8.1 Estimation
The Bayesian inference is a collection of statistical methods based on a
different statistical philosophy. The statistical model is still consist of ob-

servations X1 , . . . , Xn which are random with the distribution f (X|θ) that
depend on a vector of parameters θ. However, while the classical statistic
treats the parameters as fixed and unknown quantities, the Bayesian statis-
tic models researchers’ beliefs about the parameters by using probability
theory. This adds a second layer of randomness: now the parameters θ of
the data-generating distribution f (X|θ)⃗ have their own probability distri-
butions which model our beliefs about them.
In fact, the parameters are treated as random variables that have two
probability distributions: before and after the data is observed. Their distri-
bution before the data is observed is described by a prior distribution with
density (or mass) function p(θ). The posterior distribution is the distribu-
tion of parameters after the data is observed. It captures our beliefs after
they were modified by the observed data. The density (or mass) function
of the posterior distribution is the conditional density p(θ|x1 , . . . , xn ). It
can be calculated from the prior distribution and the data by using Bayes’
formula:

176
f (x1 , . . . , xn |θ)p(θ)
p(θ|x1 , . . . , xn ) = R .
f (x1 , . . . , xn |φ)p(φ) dφ
The integral in the denominator is a normalizing constant – it does not
depend on θ. Often, it is not written explicitly, and the formula is written
as

p(θ|x1 , . . . , xn ) ∝ f (x1 , . . . , xn |θ)p(θ).

The symbol ∝ means “proportional to”.


Example 8.1.1. A biased coin is tossed n times. Let Xi be 1 or 0 as the i-th
toss is or is not a head. The probability of a head is θ. Suppose we have no
idea how biased the coin is, and we place a uniform prior distribution on θ,
to give a so-called “non-informative prior” of

p(θ) = 1, 0 ≤ θ ≤ 1.

Let t be the number of heads. Then, the posterior distribution of θ is

p(θ|x1 , . . . , xn ) ∝ θt (1 − θ)n−t

By inspection we realize that if the appropriate constant on the right-hand


side, then we have the density of the Beta distribution with parameters
(t + 1, n − t + 1). This is the posterior distribution of θ given x.
After a bit of reflection, we realize that if we start with the prior dis-
tribution which is the beta distribution with parameters α1 , α2 , that is, if
p(θ) ∝ θα1 −1 (1 − θ)α2 −1 , the the posterior distribution is

p(θ|x1 , . . . , xn ) ∝ θt+α1 −1 (1 − θ)n−t+α2 −1 ,

that is, it is still the beta distribution but with updated parameters t+α1 and
n−t+α2 . Note that here α1 and α2 are parameters for the prior distribution
of the parameter θ! Sometimes they are called hyper-parameters.
Definition 8.1.2. Let f (x|θ) be the distribution of data point x given the
parameter θ. Let p(θ) be a prior distribution for the parameter. If the
posterior distribution p(θ|x1 , . . . , xn ) has the same functional form as the
prior but with altered parameter values, then the prior p(θ) is said to be
conjugate to the distribution f (x|θ).

177
The conjugate priors are very convenient in modeling and are often used
in practice. Here is another example.
Exercise 8.1.3. Suppose X1 , . . . , Xn are normally distributed Xi ∼ N (µ, σ)
with unknown parameter µ and known σ = 1. Let the prior distribution for
µ be N (0, τ −2 ) for known τ −2 . Then the posterior distribution for µ is
 Pn x 1 
i=1 i
N , .
n + τ2 n + τ2
In many cases, in practical applications we need a point estimator and
not a posterior distribution of the parameter. Bayesian statistic addresses
this concern with the concept of the loss function. A loss function L(θ, a)
is a measure of the loss incurred by estimating the value of the parameter
to be a when its true value is θ. The estimator θb is chosen to minimize the
b where the expectation is taken over θ with respect
expected loss EL(θ, θ),
to the posterior distribution p(θ|⃗x),
 
θb = arg min E L(θ, a)
a

Theorem 8.1.4. (a) Suppose that the loss function is quadratic in error:
L(θ, a) = (θ − a)2 . Then the expected loss is minimized by taking θb to be the
mean of the posterior distribution:
Z
b
θ = θp(θ|x1 , . . . , xn ) dθ.

(b) Suppose the loss function is the absolute value of the error: L(θ, a) =
|θ − a|. Then the expected loss is minimized by taking θb to be the median of
the posterior distribution.

Example 8.1.5 (Coin Tosses). Consider the setting of Example 8.1.1. The
posterior distribution is Beta distribution with parameters t+1 and n−t+1.
So, by properties of Beta distribution, the posterior mean estimator is
t+1
θb = ,
n+2
and the posterior median estimator needs to be calculated numerically. Note
that both are different from the standard estimator x = t/n.

178
Note that both posterior mean and posterior median estimators depends
on our choice of the prior distribution.
The Bayesian analogue of confidence intervals is credible sets. A set A
is a credible set with probability 1 − α if the probability that the parameter
belongs to the set is α when the probability is calculated with respect to the
posterior distribution. That is,
Z
Pθ ∈ A = p(θ|x1 , . . . , xn ) dθ = 1 − α.
A

Calculation of the credible sets is straightforward when we know the poste-


rior distribution. It is important to understand that confidence sets depend
on the choice of prior distribution.

8.2 Hypothesis testing


Bayesian inference is also different in its approach to hypothesis testing
from the approach of the classical statistics. In fact, the hypothesis testing
plays less significant role in Bayesian inference simply because the idea that
a continuous parameter can be for sure equal to a specific value is at odds
with main idea of Bayesian inference to model beliefs about parameters with
probabilities. Still it is possible to evaluate two competing hypothesis using
Bayesian methods.
In the classical case, the setup consists of a pair of hypotheses: the null
hypothesis H0 and the alternative hypotheses Ha . The rejection region is
selected as a set of data samples for which we reject of H0 in favor of Ha .
This set is selected in such a way that the probabilities of making the wrong
decisions, - the probabilities of type I and type II errors, α and β, – are
small.
The deficiency of this method is that it works really well only if both the
null and alternative are simple hypothesis, so that we can calculate α and β
unambiguously. In this case, the Neyman-Pearson lemma provides us with
the most powerful test, that is the test, that has the smallest β for a given
α. This test is based on the ratio of likelihood functions, that is on the ratio

179
of data densities for the parameters θ0 and θa :

L(θ0 |x1 , . . . , xn ) f (x1 , . . . , xn |θ0 )


≡ .
L(θa |x1 , . . . , xn ) f (x1 , . . . , xn |θa )

Sometimes, it is possible to develop the best test (Uniformly Most Pow-


erful Test) even when the alternative hypothesis is composite. However, in
many cases, for example, for the two-sided alternative hypothesis there are
no UMP tests. In addition, in order to search for UMP tests, we have to
assume that the null hypothesis is simple. This is somewhat unsatisfactory
since in many practical situations it is difficult to justify a specific value for
the null hypothesis.
In practice, statisticians are satisfied with reasonable, although not UMP,
tests, which would allow us to do testing even if the null hypothesis is not
simple. One of this tests, the likelihood ratio test is based on the test statis-
tic:
maxθ∈Ω0 L(θ|x1 , . . . , xn ) maxθ∈Ω0 f (x1 , . . . , xn |θ)
λ(x1 , . . . , xn ) = ≡
maxθ∈Ω0 ∪Ωa L(θ|x1 , . . . , xn ) maxθ∈Ω0 ∪Ωa f (x1 , . . . , xn |θ)

In other words, we choose the value θ0 in the null hypothesis set Ω0 , which
gives the largest probability density of the data, and compare this density
with the maximum of the data probability density when the parameter is
allowed to vary over both the null (Ω0 ) and alternative (Ωa ) hypothesis sets.
We reject the null if the ratio of these two probabilities is smaller than a
threshold.
While this procedure is very reasonable, it does not have a clear proba-
bilistic justification.
In contrast, in the Bayesian inference, the null hypothesis is typically not
a single value but a big set of parameters H0 : θ ∈ Ω0 and the alternative
is the complement of this set, Ha : θ ∈ Ωa = Ωc0 . For example, we can have
the null hypothesis H0 : θ ≤ θ0 and the alternative Ha : θ > θ0 .
The null hypothesis is rejected by the Bayesian test, if the ratio of pos-

180
terior probabilities of hypotheses:
R
P[θ ∈ Ω0 ] p(θ|x1 , . . . , xn ) dθ
λB (x1 , . . . , xn ) = = R Ω0
P[θ ∈ Ωa ] p(θ|x1 , . . . , xn ) dθ
RΩa
f (x1 , . . . , xn |θ)p(θ) dθ
= R Ω0
Ωa f (x1 , . . . , xn |θ)p(θ) dθ

is smaller than a certain threshold t, which measures the degree of our


conservatism. For example, if the threshold is set to 1/3, then we reject
the null hypothesis H0 only if the posterior probability of H0 is three times
smaller than the posterior probability of Ha .
This resembles the likelihood ratio test, except instead of maximizing
the data density over the set of parameters Ω0 and Ωa , we take the average
of the data densities by using the prior probability distribution p(θ).
It is in principle possible to define α and β of the Bayesian test as the
average probabilities of making type I and type II errors, where the average
is calculated with respect to the prior distribution. However, the definition
is more complicated. In addition, the Bayesian analysis is most useful when
the amount of the data is not overwhelmingly large compared to our prior
beliefs. In this situation, there is no analogue of Wilkes’ theorem for the
likelihood ratio, and so it is significantly more difficult to develop a test
with a given α. For this reason α and β are very rarely used in Bayesian
inference.
Here is an example, how a Bayesian test applies in practice.
Example 8.2.1. Let X1 , . . . , Xn be a sample form an exponentially dis-
tributed population with density f (x|θ) = θe−θx . (Note that this is a slightly
different parameterization of the exponential distribution. The mean of the
distribution is µ = 1/θ. Suppose the prior distribution is Gamma distri-
bution with parameters α and β. Test the hypothesis H0 : θ ≤ θ0 versus
Ha : θ > θ 0 .
The density of the data sample is
Pn
f (x1 , . . . , xn |θ) = θn e−θ i=1 xi
.

181
The prior is

p(θ) ∝ θα−1 e−θ/β .

The posterior distribution is


Pn
p(θ|x1 , . . . , xn ) ∝ θn+α−1 e−θ( i=1 xi +1/β)

So the posterior distribution is the Gamma distribution with parameters

α′ = n + α,
1 β
β ′ = Pn = Pn
i=1 xi + 1/β β i=1 xi + 1

In particular we showed that the Gamma distribution is the conjugate prior


for the exponential distribution. We reject the null hypothesis only if
t
P[θ ≤ θ0 ] < .
1+t
In R, this can be solved by checking if
t
pgamma(θ0 , α′ , 1/β ′ ) < .
1+t
P
For example, let n = 10, xi = 1.26, α = 3 and β = 5. (These are perhaps
are obtained by reviewing prior studies about θ.) We want to test the null
hypothesis that H0 : µ > .12 against Ha : µ ≤ .12 using t = 1 (This not a
very conservative test. We reject null hypothesis if its probability is smaller
than the probability of the alternative.) In terms of the parameter θ, the
hypotheses are

H0 : θ < 1/(.12) = 8.333 against H0 :≥ 8.333.

We calculate

α′ = n + α = 10 + 3 = 13
X
n
1/β ′ = xi + 1/β = 1.26 + 1/5 = 1.46,
i=1

182
then

pgamma(θ0 , α′ , 1/β ′ ) = pgamma(8.333, 13, 1.46) = 0.4430332

Since this is smaller than 1/(1 + t) = 1/2, we can reject the null hypothesis.
One observation about the Bayesian hypothesis testing is that the re-
sults of the tests depend on the choice of the prior distribution and this
choice should be careful and well-justified. The second observation is that
in practice it is sometimes difficult to calculate the probabilities under the
posterior distribution. This calculation may involve difficult integrations.
In this respect, the classical approach is often computationally easier.

183

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy