Math 2901 Booklet 15
Math 2901 Booklet 15
Math 2901 Booklet 15
MATH2901
HIGHER THEORY OF
STATISTICS
Contents
Probability 5
1 Random Variables 3
2 Common Distributions 47
3 Bivariate Distributions 71
4 Transformations 95
Statistics is all about making decisions based on data in the presence of uncer-
tainty. In order to deal with uncertainty we need to develop a language for dis-
cussing it – probability theory. We will also derive in this chapter some funda-
mental results from probability theory that will be useful to us later on.
References: Kroese and Chan (2013) chapter 1, Hogg et al (2005) sections 1.1-1.4,
Rice (2007) chapter 1.
Definition
An experiment is any process leading to recorded observations.
1. Tossing a die.
4. Selecting a random sample of fifty people and observing the number of left-
handers.
5
6 PROBABILITY – REVISION
Definitions
An outcome is a possible result of an experiment.
The set Ω of all possible outcomes is the sample space of an experiment.
Ω is discrete if it contains a countable (finite or countably infinite) number
of outcomes.
Ω = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), . . . , (6, 6)}.
Ω = {0, 1, · · · } = Z+ .
Notice that for modelling purposes it is often easier to take the sample space larger
than necessary. For example the actual lifetime of a machine would certainly not
span the entire positive real axis.
Definitions
An event is a set of outcomes (a subset of Ω). An event occurs if the
result of the experiment is one of the outcomes in that event.
A = [0, 1000) .
3. The event that out of fifty selected people, five are left-handed,
A = {5} .
ADVANCED MATERIAL ONLY FOR INTERESTED STUDENTS 7
Definition
Events are mutually exclusive (disjoint) if they have no outcomes in
common; that is, if they cannot both occur.
Example
Experiment - toss a coin 3 times, record results. Let H denote ’head’, T denote
’tail’.
Ω = {HHH, HHT, HTH, THH, TTH, THT, HTT, TTT}
Ω is discrete.
Let’s say HTT is the result of the experiment. Which of A, B and C have occurred?
..., then A and C have occurred, but not B.
Having 1 defined an event and the set of all possible events Ω, the second ingredient
in the model is to specify the collection F, say, of all events “of interest”. That is,
1
This section is not compulsory/assessable and should be read only by interested students to
deepen their understanding.
8 PROBABILITY – REVISION
the collection of all events to which we wish to assign a “probability” (which we will
define next). It is most tempting to take F equal to the collection of all subsets of
Ω (the power set). When, Ω is countable this okay, and this is the reason why you
will never encounter F in an elementary exposition of probability theory. However,
when Ω is uncountable, the power set of Ω is in general so large that one cannot
assign a proper “probability” to all subsets! The same difficulty arises when we wish
to assign a natural “length” to all subsets of R.
Thus, for an uncountable Ω we have to settle for a smaller collection F of events.
This collection should have nice properties. For example,
4. The set Ω itself should be an event, namely the “certain” event. Similarly ∅
should be an event, namely the “impossible” event.
Thinking this over, the minimal assumption that we impose on F is that it should
be an object called a σ-algebra:
1. Ω ∈ F ,
2. If A ∈ F then also Ac ∈ F ,
S
3. If A1 , A2 , . . . ∈ F , then n An ∈ F .
T
Exercise 1 Prove that if F is a σ-algebra then, with A1 , A2 , . . . ∈ F also n An ∈
F.
Events A1 , A2 , . . . are called exhaustive if their union is the whole sample space Ω.
A sequence A1 , A2 , . . . of disjoint and exhaustive events is called a partition of Ω.
Exercise 2 Let A, B, C be a partition of Ω. Describe the smallest σ-algebra con-
taining the sets A, B and C.
Exercise 3 Let Ω be a sample space with n elements. If F is the collection of all
subsets of Ω, how many sets does F contain?
ADVANCED MATERIAL ONLY FOR INTERESTED STUDENTS 9
Borel σ-algebra
(−∞, x1 ] × · · · × (−∞, xn ],
Instead of working with the real line, it will be convenient to work with the extended
real line R̄ := [−∞, ∞]. The natural extension of B is the σ-algebra B̄ which is
generated by the intervals [−∞, x]. It is called the Borel σ-algebra on R̄.
Similarly, the Borel σ-algebra B̄ n on R̄n (the meaning should be obvious) is defined
as the σ-algebra that is generated by the rectangles of the form
[−∞, x1 ] × · · · × [−∞, xn ] .
10 PROBABILITY – REVISION
Before we proceed you are encouraged to recall the definition of union, intersection,
and complement notation given in Figure 1 below and explained in detail in Section
1.3 of the Kroese Chan, 2013 recommended textbook.
Figure 1: Venn diagrams illustrating set operations. Each square represents the
sample space Ω
Given a sample space (Ω, F), a probability function P can be defined in the following
way. To every event A ∈ F we assign a number P(A), the probability that A
occurs. The function P must satisfy the axioms
(ii) P(Ω) = 1
Notation
Note that we will use the following interchangeable notation for the com-
plement of event A
Ac ≡ A
It follows that
2. P(∅) = 0,
P(A) = P({s2 }∪{s5 }∪{s9 }∪{s11 }∪{s14 ) = P({s2 })+P({s5 })+P({s9 })+P({s11 })+P({s14 }).
Thus, in the previous coin toss example, if we assume that the 8 possible outcomes
are equally likely,
A1
B4 B 3 B 2 B1
Counting Rules
Counting Rule 1
If there are k experiments with ni possible outcomes in the ith (i =
1, 2, . . . , k), then the total number of possible outcomes for the k experi-
ments is n1 n2 . . . nk = Πki=1 ni .
Example
Toss a 6-sided die 3 times.
What are k and ni in this case?
What is the number of outcomes in the sample space?
Counting Rule 2
The number of possible permutations of r objects selected from n distinct
n!
objects is n Pr = (n−r)! , where n! = n(n − 1)(n − 2) . . . 3 · 2 · 1 for integers
n ≥ 1 and 0! ≡ 1.
(n − r)(n − r − 1) . . . 3 · 2 · 1
= n(n − 1)(n − 2) . . . (n − r + 1)
(n − r)(n − r − 1) . . . 3 · 2 · 1
n(n − 1)(n − 2) . . . 3 · 2 · 1 n!
= =
(n − r)(n − r − 1) . . . 3 · 2 · 1 (n − r)!
14 PROBABILITY – REVISION
Example
A particular committee has four members. One member must chair the committee,
and a different committee member must take minutes from meetings.
How many different ways are there of choosing a Chair and a Minute-taker for this
committee?
First label the four people as a, b, c, d.
ab ac ad bc bd cd n = 4, r = 2
4 4!
ba ca da cb db dc P2 = 2!
= 12.
r!
Note: the number of possible permutations of r objects (n = r) is r Pr = 0!
= r!
Counting Rule 3
The number of ways of choosing r objects from n distinct objects is
!
n! n
≡ (n choose r), 0 ≤ r ≤ n.
r!(n − r)! r
Example
From a committee of four people, two committee members will need to present the
committee’s recommendations to the board of directors.
How many ways are there of choosing two committee members to report to the
board of directors?
CONDITIONAL PROBABILITY 15
Choose 2 letters from a, b, c, d. We ignore order, so that ab and ba, etc. each count
as one selection. The possibilities are: ab ac ad bc bd cd
!
4 4!
n = 4, r = 2, = = 6.
2 2!2!
Conditional Probability
Definition
The conditional probability that an event A occurs, given that an event
B has occurred is
P(A ∩ B)
P(A|B) = if P(B) ̸= 0.
P(B)
A B
Given that B has occurred, the total probability for possible results of the experi-
ment equals P(B), so that the probability that A occurs equals the total probability
for outcomes in A (only those in A ∩ B) divided by the total probability, P(B).
Independent Events
Events A and B are independent if P(A ∩ B) = P(A)P(B). For any two events A
and B, P(A ∩ B) = P(A|B)P(B), so A and B are independent
⇐⇒ P(A|B) = P(A)
⇐⇒ P(B|A) = P(B) (lemma 1)
Example
Toss two fair dice. There are 36 outcomes in the sample space Ω, each with proba-
1
bility 36 . Let:
Intuition: A and B are independent, A and C are not independent. (B and C are
not independent since
We can write Ω as
Ω = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), . . . , (2, 6), . . . , (6, 6)}
or display it as follows;
2nd die
1 2 3 4 5 6
1 · · · · · ⊙
2 · · · · ⊙ ·
1st die 3 · · · ⊙ · · A ×
4 × × ⊗ × × × B ⊙
5 · ⊙ · · · · C
6 ⊙ · · · · ·
INDEPENDENT EVENTS 17
6 1 6 1 1
P(A) = = , P(B) = = , P(A ∩ B) =
36 6 36 6 36
and 1
36
= P(A ∩ B) = P(A)P(B). Thus A and B are independent.
5 1
P(C) = , P(A ∩ C) =
, so
36 36
1 1 5
= P(A ∩ C) ̸= P(A)P(C) = × .
36 6 36
Thus A and C are not independent.
P(A ∩ C) 1/36 1
Also, P(A|C) = = = ̸= P(A), again confirming A and C are not
P(C) 5/36 5
independent.
For a countable sequence of events {Ai }, the events are pairwise independent if
P(Ai ∩ Aj ) = P(Ai )P(Aj ) for all i ̸= j
and the events are (mutually) independent if for any collection
A i1 , A i2 , . . . , A in ,
P(Ai1 ∩ · · · ∩ Ain ) = P(Ai1 ) . . . P(Ain ).
Clearly, independence =⇒ pairwise independence, but not vice versa, as in the
following example.
Example
A coin is tossed twice. Let A be the event ‘head on the first toss’, B the event ‘head
on the second toss’ and C the event ‘exactly one head turned up’.
Example
A ball is drawn at random from an urn containing 4 balls numbered 1,2,3,4. Let
For events A1 , A2 , A3
P(A1 ∩ A2 ∩ A3 ) = P(A3 ∩ A2 ∩ A1 )
= P(A3 |A2 ∩ A1 )P(A2 ∩ A1 )
= P(A3 |A1 ∩ A2 )P(A2 |A1 )P(A1 ).
Example
To gain entry to a selective high school students must pass 3 tests. 20% fail the first
test and are excluded. Of the 80% who pass the first, 30% fail the second and are
excluded. Of those who pass the second, 60% pass the third.
SOME PROBABILITY LAWS 19
Question 1: What proportion of students pass the first two tests? Use the multi-
plicative law to answer this question.
Question 2: What proportion of students gain entry to the selective high school?
Let A1 be the event ‘pass first test’. Similarly A2 , A3 . Then
Question 3: What proportion pass the first two tests, but fail the third?
A B
20 PROBABILITY – REVISION
Note that A and Ā ∩ B are mutually exclusive, and that A ∩ B and Ā ∩ B are
mutually exclusive. So from the axioms
and
P(B) = P(A ∩ B) + P(Ā ∩ B).
The first of these equations gives P(Ā ∩ B) = P(A ∪ B) − P(A). Substitution of this
expression into the second equation leads to the required result.
This is one of the original axioms:
Example
3 letters are placed at random into 3 addressed envelopes.
What is the probability that none is in the correct envelope?
Let A, B, C be the events that envelopes 1,2 and 3 contain the correct letters.
Then P(A) = P(B) = P(C) = 13 and P(A ∩ B) = P(A ∩ C) = P(B ∩ C) = 1
6
since,
for example, P(A ∩ B) = P(B|A)P(A) = 12 × 31 .
Also, P(A ∩ B ∩ C) = 61 since all 3 envelopes must contain the correct letters if any
2 envelopes contain the correct letters; that is,
A B
Figure 4: A ∩ B = A ∩ C = B ∩ C = A ∩ B ∩ C
X
k
P(B) = P(B|Ai )P(Ai ).
i=1
Proof:
[
k
B= (B ∩ Ai ) (disjoint union since the Ai ’s are disjoint)
i=1
22 PROBABILITY – REVISION
Example
Urn I contains 3 red and 4 white balls. Urn II contains 2 red balls and 4 white.
A ball is drawn from Urn I and placed unseen into Urn II. A ball is now drawn at
random from Urn II.
What is the probability that this second ball is red?
Let A1 be the event ‘1st ball drawn red’
A2 be the event ‘1st ball drawn white’
and let B be the event ‘2nd ball drawn red’.
A1 and A2 are mutually exclusive (they cannot both occur) and exhaustive (one of
them must occur) and so
Bayes’ Formula
P(A ∩ B) P(B|A)P(A)
P(A|B) = = .
P(B) P(B)
BAYES’ FORMULA 23
Example
Recall the previous example.
Given that the second ball drawn is red, what is the probability that the first ball
was white?
P(B|A2 )P(A2 ) 2
×4 8
P(A2 |B) = = 7 17 7 = .
P(B) 49
17
Example
A diagnostic test for a certain disease is claimed to be 90% accurate because, if
a person has the disease, the test will show a positive result with probability 0.9
while if a person does not have the disease the test will show a negative result with
probability 0.9. Only 1% of the population has the disease.
If a person is chosen at random from the population and tests positive for the disease,
what is the probability that the person does in fact have the disease?
Let A be the event ‘person has disease’ and B be the event ‘person tests positive’.
P(A ∩ B) P(B|A)P(A)
P(A|B) = =
P(B) P(B|A)P(A) + P(B|A)P(A)
since A and A form a partition (they are mutually exclusive and exhaustive).
test positive
test negative
P (disease)= 0.01 = 10
1000
P(diseased|test positive) = 9
9+99
= 9
108
= 1
12
1
Part One —
Probability and Distribution Theory
2 PROBABILITY – REVISION
Chapter 1
Random Variables
Definition
3
4 CHAPTER 1. RANDOM VARIABLES
The definition is depicted pictorially below, where Ω = {(1, 1), . . . , (6, 6)} is the
sample space generated by two dice thrown consecutively and X({i, j}) = i + j is
the random variable indicating the sum of the faced of the two dice.
Sometimes we may write I{A} or I{A} , depending on which happens to be the most
economical, but unambiguous notation.
Example
Toss a fair coin 3 times.
1 3 3 1
P(X = 0) = , P(X = 1) = , P(X = 2) = , P(X = 3) = .
8 8 8 8
CUMULATIVE DISTRIBUTION FUNCTION 5
where A(x) = {ω ∈ Ω : X(ω) 6 x}. Note that we can write this abbreviated as
follows
FX (x) = P(X ≤ x) .
Note that we may write F (x) if there is no ambiguity about whose cdf we refer to.
The following figure shows the cdf for the previous example:
F (x)
1 b
7/8 b
1/2 b
1/8 b
0
x
0 1 2 3
4. F is right-continuous:
lim F (x + 1/n) = F (x)
n↑∞
and
P(X = x) = F (x) − lim F (x − 1/n)
n↑∞
Proofs:
6 CHAPTER 1. RANDOM VARIABLES
∩∞
n=1 An = ∅
continuity
lim F (−n) = lim P(X 6 −n) = lim P(An ) = P(∩∞
n=1 An ) = P(∅) = 0 .
n↑∞ n↑∞ n↑∞
Now for the other case set An = {x < X 6 x + 1/n} and note that
Definition
The probability function of the discrete random variable X is the function
fX given by
fX (x) = P(X = x).
Results
The probability function of a discrete random variable X has the following
properties:
Example
Below is the probability function for the number of heads in three coin tosses:
f (x)
3/8 b b
1/8 b b
x
0 1 2 3
When X is discrete, FX (x) is the sum of the probabilities for possible values of X
less than or equal to x.
8 CHAPTER 1. RANDOM VARIABLES
Example
A coin, with p=probability of a head on a single toss, is tossed until a head turns
up for the first time. Let X denote the number of tosses required.
Find the probability function and the cumulative distribution function of X.
If the tosses are independent, then
Definition
The density function of a continuous random variable is a real-valued
function fX on R with the property
Z
fX (x) dx = P(X ∈ A)
A
Results
The density function of a continuous random variable X has the following
properties:
The next two results show how FX may be found from fX and vice versa.
Result
The cumulative distribution function (cdf) FX of a continuous random vari-
able can be found from the density function fX via
Z x
FX (x) = fX (t) dt.
−∞
Result
The density function fX of a continuous random variable can be be found
from the cumulative distribution function (cdf) FX via
Result
For any continuous random variable X and pair of numbers a ≤ b
Z b
P(a ≤ X ≤ b) = fX (x)dx = area under fX between a and b.
a
These results demonstrate the importance of fX (x) and FX (x): if you know or can
derive either fX (x) or FX (x), then you can derive any probability you want to about
X, and hence any property of X that is of interest.
Some particularly important properties of X are described later in this chapter.
Example
The lifetime (in thousands of hours) X of a light bulb has density fX (x) = e−x , x > 0.
Find the probability that a bulb lasts between 2 thousand and 3 thousand hours.
The probability that a bulb lasts between 2 thousand hours and 3 thousand hours
is
10 CHAPTER 1. RANDOM VARIABLES
Z 3
e−x dx = [−e−x ]32 = e−2 − e−3 = (e − 1)/e3 ≃ 0.0855.
2
and
It only ‘makes sense’ to talk about the probability of X lying in some subset of R.
A consequence of (1.1) is that, with continuous random variables, we don’t have
to worry about distinguishing between < and ≤ signs. The probabilities are not
affected. For example,
x0.5 is the median of FX (or f ). x0.25 , x0.75 are the lower and upper quar-
tiles of FX (or fX ).
Example
Let X be a random variable with cumulative distribution function FX (x) = 1 −
e−x , x > 0.
Find the median and quartiles of X.
EXPECTATION AND MOMENTS 11
xp = − ln(1 − p)
so
x0.5 = − ln(0.5) = ln 2
.
The lower quartile is − ln(0.75) = ln 4/3 and the upper quartile is − ln(0.25) = 2 ln 2.
f (x) = e−x
area =1/4
area = 1/4
x
4
ln 3
0 ln 2 2 ln 2
Expectation
that is, the sum of the possible values of X weighted by their probabilities.
Definition
The expected value or mean of a discrete random variable X is
def
X X
EX = E[X] = x × P(X = x) = xfX (x),
all x all x
Definition
The expected value or mean of a continuous random variable X is
Z ∞
E[X] = xfX (x) dx,
−∞
In both cases, E(X) has the interpretation of being the long run average of X – in
the long run, as you observe an increasing number of values of X, the average of
these values approaches E(X).
In both cases, E(X) has the physical interpretation of the centre of gravity of the
function fX . So if a piece of thick wood or stone was carved in the shape of fX ; it
would balance on a fulcrum placed at E(X) (see accompanying figure).
E(X)
EXPECTATION AND MOMENTS 13
Example
Let X be the number of females in a committee with three members. Assume that
there is a 50:50 chance of each committee member being female, and that committee
members are chosen independently of each other.
Find E[X].
x 0 1 2 3
fX (x) = P(X = x) 1/8 3/8 3/8 1/8
1 3 3 1 3
E[X] = 0 · +1· +2· +3· =
8 8 8 8 2
The interpretation of 32 is not that you expect X to be 23 (it can only be 0,1,2 or 3)
but that if you repeated the experiment, say, 100 times, then the average of the 100
numbers observed should be about 32 (= 150 100
).
That is, we expect to observe about 150 females in total in 100 committees. We
don’t expect to see exactly 1.5 females on each committee!
Example
Suppose X is a standard uniform random number generator (such as can be found
on most hand-held calculators).
X has density fX (x) = 1, 0 < x < 1.
Find E[X].
Z ∞ Z 1
1
E[X] = xfX (x)dx = x · 1dx =
−∞ 0 2
f (x)
1 b
0 x
0 1
E(X)
14 CHAPTER 1. RANDOM VARIABLES
Note that in both of the above examples, fX is symmetric and it is symmetric about
E[X]. This is not always the case, as in the following examples:
Example*
X has probability function
Find E(X).
X
∞
E(X) = x(1 − p)x−1 · p
x=1
X∞
d d X
∞
= −p (1 − p)x = −p (1 − p)x
dp dp x=1
x=1
d 1 1
= −p −1 = .
dp p p
Example
X has density
fX (x) = e−x , x > 0
Find E(X).
Z ∞
E(X) = xe−x dx
0
Z ∞
= −[xe−x ]∞
0 + e−x dx = 1
0
EXPECTATION AND MOMENTS 15
Special case: if X is degenerate, that is, X = c with probability 1 for some con-
X
stant c, then X is in fact just a constant and E[X] = xP (X = x) = c · 1 = c.
all x
Thus the expected value of a constant is the constant; that is, E[c] = c.
• We count the number of consecutive times (X) a pair of coins land “odds” (one
tail and one head) in a game of two-up. The variable of interest is whether or
not the casino wins all bets (which happens for X > 4).
Transformations are also of interest when studying the properties of a random vari-
able. For example, in order to understand X, it is often useful to look at the rth
moment of X about some constant a, defined as E[(X −a)r ]. This is another exam-
ple of a transformation of X, for which we wish to find E[g(X)], for some function
g(x).
The following is an important result regarding the expectation of a transformation
of a random variable:
Result
The expected value of a function g(X) of a random variable X is
X
g(x)fX (x) X discrete
all x
Eg(X) = Z ∞
g(x)fX (x)dx X continuous
−∞
proof: We prove this only in the discrete case. Set Y = g(X) and fX (x) = P(X =
x), then
def
X
Eg(X) = EY = y P(Y = y)
y
X X
= y P(g(X) = y) = y P(X ∈ {x : g(x) = y})
y y
X X X X
= y P(X = x) = yP(X = x)
y x:g(x)=y y x:g(x)=y
X X X
= g(x)P(X = x) = g(x)P(X = x)
y x:g(x)=y x
16 CHAPTER 1. RANDOM VARIABLES
where the last line follows from the fact that if x takes on all values in its domain,
then y = g(x) takes on all values in its range and vice-versa. 2
Example
Let I denote the electric current, through a particular circuit, and I has density
function
2 1≤x≤3
1
fI (x) =
0 elsewhere
P = 3I 2
Results
If a is a constant,
X
E[X + a] = (x + a)P(X = x)
all x
= E[X] + a
X
and E[aX] = axP(X = x) = aE(X).
all x
Definition
If we let µ = E(X), then the variance of X denoted by Var(X) is defined
as
Var(X) = E[(X − µ)2 ]
(which is the second moment of X about µ).
Definition
The standard deviation of a random variable X is the square-root of its
variance:
p
standard deviation of X = Var(X).
Standard deviations are more readily interpreted, because they are measured in the
same units as the original variable X. So standard deviations are more commonly
used as measures of spread in applied statistics, and in reporting the results of
quantitative research.
Variances are a bit easier to work with theoretically, and so are more commonly
used in mathematical statistics (and in this course).
Result
Var(X) = E(X 2 ) − µ2 .
18 CHAPTER 1. RANDOM VARIABLES
Proof:
Example
Assume the lifetime of a lightbulb (in thousands of hours) has density function
fX (x) = e−x .
Calculate Var(X)
We will use Var(X) = E(X 2 ) − [E(X)]2 .
Recall from page 14 that E(X) = 1.
Z ∞
E(X ) =
2
x2 e−x dx
0
Z
∞ ∞
= − x2 e−x 0 − −2xe−x dx
0
= 0 + 2E(X) = 2
Example
Consider two random variables A and B.
x 1 2 3 4 5
A:
fA (x) 0.15 0.25 0.2 0.25 0.15
x 1 2 3 4 5
B:
fB (x) 0.1 0.1 0.6 0.1 0.1
Which of A and B is more variable?
MOMENT GENERATING FUNCTIONS 19
E[A] = E[B] = 3
E[A2 ] = 10.7, E[B 2 ] = 10
Var(A) = 1.7, Var(B) = 1
Results
Var(X + a) = Var(X)
Var(aX) = a2 Var(X).
Definition
The moment generating function (mgf) of a random variable X is
mX (u) = E(euX ).
The name “moment generating function” comes from the following result concerning
the rth moment of X, that is E(X r ):
Result
(r)
In general, E(X r ) = mX (0) for r = 0, 1, 2, . . .
Derivation: Without going into details, the condition that the moment generat-
ing function is finite on some interval around zero guarantees the existence of all
moements and that the derivative and the expectation can be interchanged and
(r)
mX (u) = ∂u(r) E(euX )
= E(∂u(r) euX )
= E(X r euX )
(r)
given that the r-th derivative mX (u) is smooth enough around zero, the r-th mo-
ment can be computed by
Example
X has mgf (
(1 − u)−1 , u < 1
mX (u) =
∞ , u ≥ 1.
Find an expression for the rth moment of X.
(Var(X) = 2 − 1 = 1)
(r)
E(X r ) = mX (0)
= 1 · 2 · 3 . . . r(1 − u)−r−1 |u=0 = r!
Example*
X has probability function
λx
P(X = x) = e−λ , x = 0, 1, 2, . . . ; λ > 0.
x!
Find the mgf of X. Hence find E(X) and Var(X). X has mgf
X
∞
e−λ λx X∞
(λeu )x
−λ
mX (u) = E(euX
) = e ux
· =e
x=0
x! x=0
x!
−λ u u −1)
= e · eλe = eλ(e .
E(X) = m′X (0) = eλ(e
u −1)
· λeu |u=0 = λ
(2)
E(X 2 ) = mX (0)
λ(eu −1) λ(eu −1)
= λ e ·e +e
u
· λe · e |u=0
u u
= λ(1 + λ)
∴ Var(X) = E(X 2 ) − {E(X)}2 = λ.
MOMENT GENERATING FUNCTIONS 21
Example*
X has probability function
!
n
fX (x) = P(X = x) = px (1 − p)n−x , x = 0, 1, . . . , n; 0 < p < 1.
x
The Binomial Theorem states that
n
X
n n
(a + b) = ax bn−x
x=0
x
Xn
n n x n−x
(a + b) = a b .
x=0
x
mX (u) = EeuX
Xn
ux n
= e px (1 − p)n−x
x=0
x
= (1 − p + peu )n .
E(X) = m′X (0)
= mX (1 − p + peu )n−1 · peu |u=0 = np
(2)
E(X 2 ) = mX (0)
= np{(1 − p + peu )n−1 u + (n − 1)(1 − p + peu )n−2 · peu · eu }|u=0
= np{1 + (n − 1)p}.
∴ Var(X) = np(1 − p).
Example
X has density
fX (x) = e−x , x > 0
22 CHAPTER 1. RANDOM VARIABLES
Z ∞
mX (u) = E(e uX
)= eux · e−x dx
Z ∞
0
(
(1 − u)−1 , u < 1
= e(u−1)x dx =
0 +∞ , u ≥ 1.
The following results on uniqueness and convergence for moment generating func-
tions will be particularly important later on.
Result
Let X and Y be two random variables all of whose moments exist. If
mX (u) = mY (u)
for all u in a neighbourhood of 0 (i.e. for all |u| < ε for some ε > 0) then
Result
Let {Xn : n = 1, 2, . . .} be a sequence of random variables, each with
moment generating function mXn (u). Furthermore, suppose that
The proofs rely on the theory of Laplace transforms but are not given in this refer-
ence. Instead, the reader is referred to
Widder, D.V. (1946). The Laplace Transform. Princeton, New Jersey: Princeton
University Press.
Result
Consider a random variable U with density function fU (x).
A location family of densities based on the random variable U is the family
of densities fX (x) where X = U + c for all possible c. fX (x) is given by:
fX (x) = fU (x − c)
Example
Consider the following families of density functions:
1 |(x−µ)/β|
2. fX (x) = 2β
e , −∞ < x < ∞
For each of the above families of distributions, explain whether it is a location family,
a scale family, a location and scale family, or neither.
Example
Consider the random variable U which has density function
1
fU (x) = √ e−x /2 ,
2
∞<x<∞
2π
Bounding Probabilities
· · · + (−1)
n+1
P(A1 ∩ A2 ∩ · · · ∩ An )
I{B} = 1 − I{B}
= 1 − I{∪ni=1 Ai }
= 1 − I{∩ni=1 Ai } by De Morgan law
Yn
=1− I{Ai } since I{A ∩ B} = I{A}I{B}
i=1
Y
n
=1− (1 − I{Ai })
i=1
X
P(∪ni=1 Ai ) 6 P(Ai )
i
X X X
6 P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak )
i i<j i<j<k
..
.
X X
P(∪ni=1 Ai ) > P(Ai ) − P(Ai ∩ Aj )
i i<j
X X X X
> P(Ai ) − P(Ai ∩ Aj ) + P(Ai , Aj , Ak ) − P(Ai , Aj , Ak , Am )
i i<j i<j<k i<j<k<m
..
.
Chebychev’s Inequality
1
∴ P(|X − µ| > kσ) ≤ .
k2
Example
The number of items a factory produces in 1 day has mean 500 and variance 100.
What is a lower bound for the probability that between 400 and 600 items will be
produced tomorrow?
Let X denote the number of items produced tomorrow.
1
+P(X > 600) ≤
102
1
or P(400 ≤ X ≤ 600) ≥ 1 − = 0.99.
100
Jensen’s Inequality
Another important inequality is called Jensen’s inequality and it states the following.
For example, suppose that h(x) = x2 , then Jensen’s inequality implies that
EX 2 > (EX)2
proof of Jensen’s inequality: Since h is convex we have that for some constants
a and b:
h(x) > ax + b
with h(µ) = aµ + b for x = µ (line may be tangent to h at this point). Now put
µ = EX, x = X and eliminate b = h(µ) − aµ to get
Hence,
Eh(X) > h(µ) + aE[X − µ] = h(µ)
Given a sample of data, {x1 , x2 , . . . , xn }, how would you summarise it, graphically
or numerically? Below we will quickly review some key tools.
28 CHAPTER 1. RANDOM VARIABLES
You will not be expected to construct the following numerical and graphical sum-
maries by hand, but you should understand how they are produced, know how to
produce them using the statistics package R, and know how to interpret such sum-
maries.
The most important property to think about when constructing descriptive statistics
is whether each variable is categorical or quantitative.
If the sample {x1 , x2 , . . . , xn } comes from a quantitative variable, then the xi are
real numbers, xi ∈ ℜ. If it comes from a categorical variable, then each xi comes
from a finite set of categories or “levels”, xi ∈ {C1 , C2 , . . . , CK }.
Example
Consider the following questions:
1. Will more people vote the Liberal party ahead of Labour at the next election?
2. Does whether or not pregnant guinea pigs are given a nicotine treatment affect
the number of errors made in a maze by their offspring?
What are the variables of interest in these questions? Are each of these variables
categorical or quantitative?
CATEGORICAL DATA 29
We will work through each of the methods mentioned in the above table.
Example
Summary of descriptive methods
Useful descriptive methods for when we wish to summarise one variable, or the association
between two variables, depend on whether these variables are categorical or quantitative.
Does the research question involve:
One variable Two variables
Both Both
Data type: Categorical Quantitative One of each
( categorical quantative
Table of Mean/sd
Numerics: Two-way Mean/sd per Correlation
frequencies Median/quantiles
table group
Scatterplot
Dotplot Boxplots
Graphs: Bar chart Boxplot Clustered Scatterplot
Histograms
Histogram bar chart
etc.
Consider again the the research questions of the previous example.
What method(s) would you use to construct a graph to answer each research ques-
tion?
Categorical data
The main tool for summarising categorical data is a table of frequencies (or percent-
ages).
30 CHAPTER 1. RANDOM VARIABLES
Example
We can summarise the NSW election poll as follows:
Example
Consider the question of whether there is an association between gender and whether
or not a passenger on the Titanic survived.
We can summarise the results from passenger records as follows:
Outcome
Survived Died
Gender Male 142 709
Female 308 154
which suggests that a much higher proportion of females survived: their survival
rate was 67% vs 17%!
In the Titanic example, an alternative summary was the percentage survival for each
gender. Whenever one of the variables of interest has only two possible outcomes a
list (or table) of percentages is a useful alternative way to summarise the data.
If you are interested in an association between more than two categorical variables
you can extend the above ideas, e.g. construct a three-way table...
ture:
Bar Chart Clustered bar chart
600
200
Died
Survived
50 100 150
Frequency
Frequency
200 400
0
0
Liberal Labour male female
Pie charts are often used to graph categorical variables, however these are not gener-
ally recommended. It has been shown that readers of pie charts find it more difficult
to understand the information that is contained in them, e.g. comparing the relative
size of frequencies across categories. (For details, see the Wikipedia entry on pie
charts and references therein http://en.wikipedia.org/wiki/Pie chart)
Quantitative data
• Shape – other information about a variable apart from location and spread.
Skewness is an important example.
The most commonly used numerical summaries of a quantitative variable are the
sample mean, variance and standard deviation:
32 CHAPTER 1. RANDOM VARIABLES
1 X
n
2
s = (xi − x̄)2
n − 1 i=1
The variance is a useful quantity for theoretical purposes, as we will see in the coming
chapters. The standard deviation however is of more practical interest because it is
on the same scale as the original variable and hence is more readily interpreted.
The sample mean and variance are very widely used and we will derive a range of
useful results about these estimators in this course.
Let’s say we order the n values in the dataset and write them in increasing order
as {x(1) , x(2) , . . . , x(n) }. For example, x(3) is the third smallest observation in the
dataset.
The sample median is
x( n+1 ) if n is odd
x̃0.5 = 2
1 x n + x n+2
2 (2) ( 2 ) if n is even
Example
The following (ordered) dataset is the number of mistakes made when ten subjects
QUANTITATIVE DATA 33
2 4 5 7 8 10 14 17 27 35
Find the 5th and 15th sample percentiles of the data. Hence find the 10th percentile.
There are ten observations in the dataset, so the 5th sample percentile is
x̃(1−0.5)/10 = x̃0.05 = 2
Similarly, the 15th sample percentile is 4. The 10th sample percentile is the average
of these two. So x̃0.1 can be estimated as
1 1
x̃0.1 = x(1) + x(2) = (2 + 4) = 3
2 2
Apart from x̃0.5 , the two important quantiles are the first and third quartiles,
x̃0.25 and x̃0.75 respectively. These terms are used to define the interquartile range
There are many ways to summarise a variable, and a key thing to consider when
choosing a graphical method is the sample size (n). Some common plots:
Dotchart (small n) Boxplot (moderate n) Histogram (large n)
●
120
●
●
Frequency
●
80
●
40
●
●
0
0 10 20 30 40 50 20 30 40 50 60 0 20 40 60
A dotchart is a plot of each variable (x-axis) against its observation number, with
data labels (if available). This is useful for small samples (e.g. n < 20).
A boxplot concisely describes location, spread and shape via the median, quartiles
and extremes:
34 CHAPTER 1. RANDOM VARIABLES
• The line in the middle of the box is the median, the measure of centre.
• The box is bounded by the upper and lower quartiles, so box width is a measure
of spread (the interquartile range, IQR).
• The whiskers extend until the most extreme value within one and a half in-
terquartile ranges (1.5IQR) of the nearest quartile.
• Any value farther than 1.5IQR from its nearest quartile is classified as an
extreme value (or “outlier”), and labelled as a dot or open circle.
Boxplots are most useful for moderate-sized samples (e.g. 10 < n < 50).
A histogram is a plot of the frequencies or relative frequencies of values within
different intervals or bins that cover the range of all observed values in the sample.
Note that this involves breaking the data up into smaller subsamples, and as such
it will only find meaningful structure if the sample is large enough (e.g. n > 30) for
the subsamples to contain non-trivial counts.
An issue in histogram construction is choice of number of bins. A useful rough rule
is to use
√
number of bins = n
1X
n
fbh (x) = wh (x − xi )
n i=1
for some choice of weighting function wh (x) which includes a “bandwidth parameter”
h.
Usually, w(x) is chosen to be the normal density (defined in Chapter 4) with mean 0
and standard deviation h. A lot of research has studied the issue of how to choose a
bandwidth h, and most statistics packages are now able to automatically choose an
estimate of h that usually performs well. The larger h is, the larger the bandwidth
that is used i.e. the larger the range of observed values xi that influence estimation
of fbh (x) at any given point x.
QUANTITATIVE DATA 35
Shape of a distribution
Something we can see from a graph that is hard to see from numerical summaries
is the shape of a distribution. Shape properties, broadly, are characteristics of the
distribution apart from location and spread.
An example of an important shape property is skew – if the data tend to be asym-
metric about its centre, it is skewed. We say data are “left-skewed” if the left tail is
longer than the right, conversely, data are right-skewed if the right-tail is longer.
36 CHAPTER 1. RANDOM VARIABLES
There are some numerical measures of shape, e.g. the coefficient of skewness κ1 :
1 Xn
b1 =
κ (xi − x̄)3
(n − 1)s3 i=1
but they are rarely used – perhaps because of extreme sensitivity to outliers, and
perhaps because shape properties can be easily visualised as above.
Another important thing to look for in graphs is outliers – unusual observations
that might carry large weight in analysis. Such values need to be investigated – are
they errors, are they “special cases” that offer interesting insights, how dependent
are results on these outliers.
Consider a pair of samples from two quantitative variables, {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}.
We would like to understand how the x and y variables are related.
An effective graphical display of the relationship between two quantitative variables
is a scatterplot – a plot of the yi against the xi .
Example
How did brain mass change as a function of body size in dinosaurs?
SUMMARISING ASSOCIATIONS BETWEEN VARIABLES 37
where x̄ and sx are the sample mean and standard deviation of x, similarly for y.
Results
1. |r| ≤ 1
Examples:
Example
Recall the guinea pig experiment – we want to explore whether there is an association
between a nicotine treatment (categorical) and number of errors made by offspring
(quantitative).
To summarise number of errors, we might typically use mean/sd and a boxplot. To
instead look at the association between number of errors and nicotine treatment,
we calculate mean/sd of number of errors for each of the two levels of treatment
(nicotine and no nicotine), and construct a boxplot for each level of treatment:
SUMMARISING ASSOCIATIONS BETWEEN VARIABLES 39
80
60
Number of errors
x̄ s
40
Sample A 23.4 12.3
Sample B 44.3 21.5
20
0
A B
Note that in the above example the boxplots are presented on a common axis –
sometimes this is referred to as comparative boxplots or “side-by-side boxplots”.
An advantage of boxplots over histograms is that they can be quite narrow and
hence readily compared across many samples by stacking them side-by-side. Some
interesting extensions are reviewed in the article “40 years of boxplots” by Hadley
Wickham and Lisa Stryjewski at Rice University.
R is a free statistics package that can be used for data analysis, which will be useful
to us in this course. You can download R from cran.r-project.org.
To open R, learn how to upload data, and for more detailed instructions on some
of the below, please refer to the notes “A Brief Introduction to R” available on My
eLearning.
To enter the example dataset we considered in the previous section:
> x=c(2, 4, 5, 7, 8, 10, 14, 17, 27, 35)
To calculate the sample mean, variance, standard deviation and median (respec-
tively:)
> mean(x)
> var(x)
> sd(x)
> median(x)
40 CHAPTER 1. RANDOM VARIABLES
It 1 can be quite annoying to to always deal with two separate definitions for discrete
and continuous random variables. This is especially the case if, as is often the case,
a random variable can be a mix of the two. For this reason, in this section we
look to define expectation in a unique way that does not discriminate between the
discrete and continuous nature of the variable, and yet gives the same results as our
previous definition. The result of this exercise is better understanding and notation
when writing probability arguments.
Before we proceed we recall the definition of the Riemman integral:
1
This section is again non-assessable material only for the interested students.
EXTENDED DEFINITION OF EXPECTATION 41
Riemman integral
The Riemman integral of the function g over the closed interval [a, b] is defined as
the limit of the Riemman sums:
Z b X
n−1
g(x)dx = lim g(x̃k )(xk+1 − xk ),
a δn ↓0
k=0
Lebesgue-Stieltjes integral
def
X
n−1
= lim g(x̃k )f (x̃k )(xk+1 − xk )
δn ↓0
k=0
X
n−1
= lim g(x̃k )[F (xk+1 ) − F (xk )]
δn ↓0
k=0
where in the last line we have used the Mean Value Theorem, which states that for
a given continuous F (x), there exists a x̃k ∈ (xk , xk+1 ) such that:
F (xk+1 ) − F (xk )
= F ′ (x̃k ) = f (x̃k ) .
xk+1 − xk
Recall that
= g(x)F {dx}
ZX
= g(x)dF (x)
X
You may now ask what the point of the new notation is. The answer is that the
definition
def
Xn−1
E[g(X)] = lim g(x̃k )[F (xk+1 ) − F (xk )]
δn ↓0
k=0
works in both the continuous and discrete (or mixed!) cases. As a result, there is
no need to treat the discrete or continuous cases differently by giving two separate
formulas for calculating probabilities and expectations. We saw above that the
definition works in the continuous case. We now show that it also works in the
discrete case.
Suppose now X is a discrete random variable, that is, X is countable and P(X =
x) > 0 for all x ∈ X . Then, letting n ↑ ∞, we can choose a subsequence
of x̃0 < x̃1 < x̃2 < · · · to equal all the possible values in the countable set X . By
properties of the cdf F in the discrete case2 , we then have
P(X = x̃ ), k ∈ {j , j , . . .}
k 0 1
F (xk+1 ) − F (xk ) = P(xk < X 6 xk+1 ) =
0, otherwise
2
The cdf is a right-continuous step/staircase function with countable number of jumps.
UNIFORM VERSUS POINTWISE CONVERGENCE 43
Thus,
X
n−1
E[g(X)] = lim g(x̃k )[F (xk+1 ) − F (xk )]
δn ↓0
k=0
X
∞
= g(x̃jk )P(X = x̃jk )
k=0
X
= g(x)P(X = x)
x∈X
Some students inquired about when we can interchange the order of differentiation
and summation/integration as in the following example for computing the mean.
If X has pdf
then
X
∞
E(X) = x(1 − p)x−1 p
x=1
X∞
d d X
∞
= −p (1 − p) = −p
x
(1 − p)x
dp dp
x=1
x=1
d 1 1
= −p −1 =
dp p p
To answer this question we now give the definition of uniform convergence and then
point to its consequences. Recall the following definition
Example
Consider the continuous functions fn (x) = xn on X = [0, 1]. We have fn (x) → 0
for 0 6 x < 1 and fn (x) → 1 for x = 1. Thus, the limit depends on the value of
x within X and we do not have uniform convergence on X . This can also be seen from
supx∈X |fn (x)−0| = 1 ̸< ε.
Theorem 1.2 (Weierstrass M-test) Suppose that we can find an upper bound to
fn such that
|fn (x)| 6 Mn , x∈X
P∞
and n=1 Mn < ∞ (that is, converges). Then,
def
X
m
Sm (x) = fn (x)
n=1
Example
P
The series ∞ n=1 (1−p) converges uniformly on p ∈ (0, 1), because we can find a 0 <
x
P
ϱ < p such that (1 − p)x 6 (1 − ϱ)x for p ∈ (0, 1) and ∞x=1 (1 − ϱ) = (1 − ϱ)/ϱ < ∞.
x
Thus, if fn (x) are continuous functions, then so is the limiting function f (x).
Theorem 1.4 (Exchanging a limit with integration) Suppose that fn (x) con-
verges uniformly to f (x) on x ∈ X . Then,
Z Z
lim fn (x)dx = lim fn (x) dx
n↑∞ X X n↑∞
| {z }
f (x)
def
X
m
Sm (x) = fn (x)
n=1
R
converges uniformly to S∞ (x) on x ∈ X and X
fn (x)dx < ∞ for all n. Then,
Z X
∞ ∞ Z
X
fn (x)dx = fn (x)dx .
X n=1 n=1 X
This result justifies the exchange of expectation with Taylor expansion of the mo-
ment generating function of a bounded random variable X, in which we write
mX (u) = E[euX ]
uX (uX)2
= E 1+ + + ...
1! 2!
u u2 u3
= 1 + E[X] · + E[X 2 ] · + E[X 3 ] · + ...
1! 2! 3!
• d
S (x)
dx m
converges uniformly on x ∈ X ;
• Sm (x) converges.
46 CHAPTER 1. RANDOM VARIABLES
Then
d d
lim Sm (x) = lim Sm (x)
m↑∞ dx dx m↑∞
which is the same as
X∞
d d X
∞
fn (x) = fn (x) .
n=1
dx dx n=1
Example
P Pm
If we let Sm (p) = − m k=1 (1 − p) , then dp Sm (p) =
k d
k=1 k(1 − p)
k−1
. To see that
d
S (p) converges uniformly, we recall the following result:
dp m
P
For a power series power series k=0 ak xk , the radius of convergence R is defined
to be R := sup{x, (ak xk ) < ∞} and for any 0 < r < R, the power series converges
uniformly on every closed interval [−r, r].
To proceed, we note that for every p ∈ (0, 1] the function (k +1)(1−p)k−1 is bounded
in k since
k
lim ke(k−1) ln(1−p) = lim
k→∞ k→∞ e(k−1) ln(1−p)
(1 − p)k−1
(L′ Hopital′ s rule) = lim =0
k→∞ − ln(1 − p)
Therefore the radius convergence R = 1 and for every p ∈ [0, 1), the sequence of
d
partial sum dp Sm (p) converges uniformly. Hence, we are justified in differentiating
term by term:
d X X
∞ ∞
d
− (1 − p) =
k
(1 − p)k .
dp k=1 k=1
dp
Chapter 2
Common Distributions
Bernoulli Distribution
Example
The tossing of a coin is a Bernoulli trial. We may define:
‘success’ = heads
‘failure’ = tails
or vice versa.
Some more examples of Bernoulli trials:
47
48 CHAPTER 2. COMMON DISTRIBUTIONS
• Dead or alive?
• Sick?
• Sold?
• Faulty?
There are only two possible responses to any of the questions in the above exam-
ple. Hence the above variables can all be modelled using the Bernoulli distribution,
defined below:
Definition
For a Bernoulli trial define the random variable
(
1 if the trial results in success
X=
0 otherwise
Result
If X is a Bernoulli random variable defined according to a Bernoulli trial
with success probability 0 < p < 1 then the probability function of X is
(
p if x = 1
fX (x) =
1 − p if x = 0
Example
1
Consider coin-tossing as in the previous example. If the coin is fair p = 2
and
x 1−x
1 1 1
fX (x) = 1− = , x = 0, 1.
2 2 2
This is consistent with the notion that heads and tails are equally likely for a fair
coin.
Definition
A constant like p above in a probability function or density is called a
parameter.
BINOMIAL DISTRIBUTION 49
Binomial Distribution
The Binomial distribution arises when several Bernoulli trials are repeated in
succession.
Definition
Consider a sequence of n independent Bernoulli trials, each with success
probability p. If
X = total number of successes
then X is a Binomial random variable with parameters n and p. A common
shorthand is:
X ∼ Bin(n, p).
The previous definition uses the symbol “∼”. In Statistics this symbol has a special
meaning:
Notation
The symbol “∼” is commonly used in statistics for the phrase
Y ∼ Bin(29, 0.72)
is usually read:
Y has a Binomial distribution with parameters n = 29 and p = 0.72.
Whenever summing the number of times we observe a particular binary outcome,
across n independent trials, we have a binomial distribution.
Example
Write down a distribution that could be used to model X when X is:
1. The number of patients who survive a new type of surgery, out of 12 patients
who each have 95% chance of surviving.
2. The number of patients who visit a doctor who are in fact sick, out of the 36
who visit the doctor. Assume that in general, 70% of people who visit their
doctor are actually sick.
50 CHAPTER 2. COMMON DISTRIBUTIONS
3. The number of plants that are flowering, out of 52 randomly selected plants
in the wild, when the proportion of plants in the wild flowering at this time is
p.
4. The number of plasma screen televisions that a store sells in a month, out of
the store’s total stock of 18, each with probability 0.6 of being sold.
5. The number of plasma screen televisions that are returned to the store due to
faults, out of the 9 that are sold, when the return rate for this type of TV is
8%.
Note that in all of the above, we require the assumption of independence of responses
across the n units in order to use the binomial distribution. This assumption is
guaranteed to be satisfied if we randomly select units from some larger population
(as was done in the above plant example).
Result
If X ∼ Bin(n, p) then its probability function is given by
n x
fX (x) = p (1 − p)n−x , x = 0, . . . , n.
x
This result follows from the fact that there are nx ways by which X can take the
value x, and each of these ways has probability px (1 − p)n−x of occurring.
Results
If X ∼ Bin (n, p) then
1. E(X) = np,
Example
Adam pushed 10 pieces of toast off a table. Seven of these landed butter side down.
• What distribution could be used to model the number of slices of toast that
land butter side down? Assume that there is a 50:50 chance of each slice
landing butter side down.
GEOMETRIC DISTRIBUTION 51
• What is the expected number of pieces of toast that land butter side down,
and the standard deviation?
• What is the probability that exactly 7 slices land butter side down?
X ∼ Bin(1, p).
Geometric Distribution
The Geometric distribution arises when a Bernoulli trial is repeated until the
first ‘success’. In this case
Results
If X has a Geometric distribution with parameter p then
1. E(X) = p1 ,
1−p
2. Var(X) = p2
.
Hypergeometric Distribution
Hypergeometric random variables arise when counting the number of binary re-
sponses, when objects are sampled independently from finite populations, and the
total number of “successes” in the population is known.
Suppose that a box contains N balls, m are red and N − m are black. n balls are
drawn at random. Let
Note that this can be thought of as a finite population version of the binomial
distribution. Instead of assuming some constant probability p of “success” in the
population, we say that there are N units in the population of which m are successes.
Result
If X has a Hypergeometric distribution with parameters N , m and n
then its probability function is given by
m N −m
fX (x; N, m, n) = x
N
n−x
0 ≤ x ≤ min(m, n).
n
Example
Lotto A machine contains 45 balls, and you select 6. Seven winning numbers are
then drawn (6 main, one supplementary), and you win a major prize ($10,000+) if
you pick six of the winning numbers.
What’s the chance that you win a major prize from playing one game?
Let X be the number of winning numbers. X is hypergeometric with N = 45,
m = 6, n = 7.
Example
1. The number of patients a town doctor sees who are in fact sick, when 800
people want to see a doctor, 500 of these are actually sick, and when the
doctor only has time to see 32 of the 800 people (who are selected effectively
at random from those that want to see the doctor).
2. The number of plants that are flowering, out of 52 randomly selected plants
in the wild, from a population of 800 plants of which 650 are flowering.
Answer:
X ∼ Hyp(52, 650, 800)
3. The number of faulty plasma screen televisions that are returned to a store,
when 5 of the store’s 18 TV’s are faulty, and 6 of the 18 TV’s were sold.Answer:
X ∼ Hyp(6, 5, 18)
Results
If X has a Hypergeometric distribution with parameters N , m and n then
1. E(X) = n · m
N
,
N −n
2. Var(X) = n · m
N
1− m
N N −1
,
Poisson Distribution
Definition
The random variable X has a Poisson distribution with parameter λ > 0
if its probability function is
e−λ λx
fX (x; λ) = P(X = x) = , x = 0, 1, 2, . . . .
x!
A common abbreviation is
X ∼ Poisson(λ).
The Poisson distribution often arises when the variable of interest is a count. For
example, the number of traffic accidents in a city on any given day could be well-
described by a Poisson random variable.
The Poisson is a standard distribution for the occurrence of rare events. Such
events are often described by a Poisson process. A Poisson process is a model for
the occurrence of point events in a continuum, usually a time-continuum. The
occurrence or not of points in disjoint intervals is independent, with a uniform
probability rate over time. If the probability rate is λ, then the number of points
occurring in a time interval of length t is a random variable with a Poisson(λt)
distribution.
Results
If X ∼ Poisson(λ) then
1. E(X) = λ,
2. Var(X) = λ,
u −1)
3. mX (u) = eλ(e .
Example
Suggest a distribution that could be useful for studying X in each of the following
cases:
3. The number of ATM customers overnight when a bank is closed (when the
average number is 5.6).
Example
If, on average, 5 servers go offline during the day, what is the chance that no more
than 1 will go offline? (assume independence of servers going offline)
Exponential Distribution
Result
If X has an Exponential distribution with parameter β then
E(X) = β , Var(X) = β 2 .
Example
If, on average, 5 servers go offline during the day, what is the chance that no servers
1
will go offline in the next hour? (Hint: Note that an hour is 24 of a day.)
56 CHAPTER 2. COMMON DISTRIBUTIONS
In words, if the waiting time until the next event is exponential, then the waiting
time until the next event is independent of the time you’ve already been waiting.
Note that the exponential distribution is a special case of the Gamma distribution
described in a later section.
Uniform Distribution
Definition
A continuous random variable X that can take values in the interval (a, b)
with equal likelihood is said to have a uniform distribution on (a, b). A
common shorthand is:
X ∼ Uniform(a, b).
Definition
If X ∼ Uniform(a, b) then the density function of X is
1
fX (x; a, b) = , a < x < b; a < b.
b−a
Note that fX (x; a, b) is simply a constant function over the interval (a, b), and zero
otherwise.
The following figure shows four different uniform density functions.
SPECIAL FUNCTIONS ARISING IN STATISTICS 57
Uniform(0,1) Uniform(−2,3)
2.0
2.0
1.5
1.5
fX(x)
fX(x)
1.0
1.0
0.5
0.5
0.0
0.0
−3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4
x x
Uniform(3,3.5) Uniform(0.6,2.8)
2.0
2.0
1.5
1.5
fX(x)
fX(x)
1.0
1.0
0.5
0.5
0.0
0.0
−3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4
x x
Results
If X ∼ Uniform(a, b) then
1. E(X) = (a + b)/2,
Note that there is also a discrete version of the uniform distribution, useful for
modelling the outcome of an event that has k equally likely outcomes (such as the
roll of a die). This has different formulae for its expectation and variance than the
continuous case does, which can be derived from first principles.
There are three more special distributions that we will consider in this chapter –
the normal, gamma and beta distributions. But before discussing these distribu-
tions, we will need to define some special functions that are closely related to these
distributions.
58 CHAPTER 2. COMMON DISTRIBUTIONS
The Gamma function is essentially an extension of the factorial function (e.g. 4!=24)
to general real numbers.
Definition
The Gamma function at x ∈ R is given by
Z ∞
Γ(x) = tx−1 e−t dt.
0
Results
Some basic results for the Gamma function are
1. Γ(x) = (x − 1)Γ(x − 1)
2. Γ(n) = (n − 1)! n = 1, 2, 3, . . .
√
3. Γ( 12 ) = π
The following result follows from (2.) of the above results. It is a useful result in a
variety of statistical applications.
Result
If m is a non-negative integer then
Z ∞
xm e−x dx = m!
0
Example
fX (x) = 1
120
x5 e−x x>0
What is E(X)?
Find the answer using the gamma function.
Z ∞ Z ∞
E(X) = xfX (x) dx = 1
120
x6 e−x dx = 6!
120
= 6.
−∞ 0
SPECIAL FUNCTIONS ARISING IN STATISTICS 59
Definition
The Beta function at x, y ∈ R is given by
Z 1
B(x, y) = tx−1 (1 − t)y−1 dt.
0
Result
For all x, y ∈ R,
Γ(x)Γ(y)
B(x, y) = .
Γ(x + y)
Example
fX (x) = 168x2 (1 − x)5 dx, 0 < x < 1.
What is E(X 2 )? Use the beta function to find the answer.
Z 1 Z 1
E(X ) =
2 2
x fX (x) dx = 168x4 (1 − x)5 dx = 168 × B(5, 6)
0 0
Γ(5)Γ(6) 4!5! 168 2
= 168 × = 168 × = = .
Γ(11) 10! 1260 15
The digamma and trigamma functions will be required later in the notes, but we
will give their definition here since they are closely related to the Gamma function:
Definition
For all x ∈ R,
d
digamma(x) = ln{Γ(x)},
dx
d2
trigamma(x) = ln{Γ(x)}.
dx2
Unlike some other mathematical functions (such as sin, tan−1 , ln10 , . . .), the digamma
and trigamma functions are not available on ordinary hand-held calculators. How-
ever, they are just another set of special functions and are available in mathematics
60 CHAPTER 2. COMMON DISTRIBUTIONS
and statistics computer software such as Maple, Matlab and R. The following figure
(constructed using R software) displays them graphically.
digamma function
digamma(x)
0.5
−0.5
−1.5 1 2 3 4
x
trigamma function
trigamma(x)
4
3
2
1
x
1 2 3 4
The final special function we will consider is routinely denoted by the ‘capital phi’
symbol Φ. It is defined as follows:
Definition
For all x ∈ R, Z x
1
e−t
2 /2
Φ(x) = √ dt.
2π −∞
Note that Φ(x) cannot be simplified any further than in the above expression since
e−t /2 does not have a closed-form ‘anti-derivative’.
2
This function gives the cumulative distribution function of the standard normal
distribution, considered in the following section.
NORMAL DISTRIBUTION 61
Result
The Φ function has the following properties:
1. lim Φ(x) = 0,
x→−∞
2. lim Φ(x) = 1,
x→∞
3. Φ(0) = 12 ,
Φ function
Φ(x)
1.0
0.8
0.6
0.4
0.2
−3 −2 −1 1 2 3
x
It follows from the previous result that the inverse of Φ, Φ−1 (x), is well-defined for
all 0 < x < 1. Examples are
−1 1
Φ = 0 and Φ−1 (0.975) = 1.95996 . . . ≃ 1.96.
2
We will see later (in chapters 8-10) that the Φ−1 function plays a particularly im-
portant role in statistical inference.
Normal Distribution
Definition
The random variable X is said to have a normal distribution with pa-
rameters µ and σ 2 (where −∞ < µ < ∞ and σ 2 > 0) if X has density
function
1 (x−µ)2
fX (x; µ, σ) = √ e− 2σ2 , −∞ < x < ∞.
σ 2π
A common shorthand is
X ∼ N (µ, σ 2 ).
0.8
0.6
0.6
fX(x)
fX(x)
0.4
0.4
0.2
0.2
0.0
0.0
−10 −5 0 5 10 −10 −5 0 5 10
x x
N(4,0.25) N(4,9)
0.8
0.8
0.6
0.6
fX(x)
fX(x)
0.4
0.4
0.2
0.2
0.0
0.0
−10 −5 0 5 10 −10 −5 0 5 10
x x
Results
If X ∼ N (µ, σ 2 ) then
1. E(X) = µ,
2. Var(X) = σ 2 ,
1 2 u2
3. mX (u) = eµu+ 2 σ .
1
fZ (x) = √ e− 2 x .
1 2
2π
The standard normal density function does not have a closed form anti-derivative
and cannot be solved in the usual way. Note, however:
Result
If Z ∼ N (0, 1) then
This result means that probabilities concerning Z ∼ N (0, 1) can be computed when-
ever the function Φ is available. Tables for Φ are available in the back of the Course
Pack (and on UNSW Blackboard). This can be used, for example, to show that:
For finding a probability such as P(Z > 0.81), we need to work with the complement
P(Z ≤ 0.81):
Result
If X ∼ N (µ, σ 2 ) then
X −µ
Z= ∼ N (0, 1).
σ
Example
GAMMA DISTRIBUTION 65
Example
The distribution of young men’s heights is approximately normally distributed with
mean 174 cm and standard deviation 6.4 cm.
1. What percentage of these men are taller than six foot (182.9 cm)?
Gamma Distribution
Definition
A random variable X is said to have a Gamma distribution with param-
eters α and β (where α, β > 0) if X has density function:
e−x/β xα−1
fX (x; α, β) = , x > 0.
Γ(α)β α
X ∼ Gamma(α, β).
Gamma density functions are skewed curves on the positive half-line. The following
figure shows four different Gamma density functions.
66 CHAPTER 2. COMMON DISTRIBUTIONS
Gamma(1.5,0.2) Gamma(4,3)
1.0
1.0
0.8
0.8
0.6
0.6
fX(x)
fX(x)
0.4
0.4
0.2
0.2
0.0
0.0
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6
x x
Gamma(13,5) Gamma(13,10)
1.0
1.0
0.8
0.8
0.6
0.6
fX(x)
fX(x)
0.4
0.4
0.2
0.2
0.0
0.0
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6
x x
Results
If X ∼ Gamma(α, β) then
1. E(X) = αβ,
2. Var(X) = αβ 2 ,
α
1
3. mX (u) = 1−βu , u < 1/β.
Result
X has an Exponential distribution if and only if
X ∼ Gamma(1, β)
Beta Distribution
For completeness, we conclude with the Beta distribution. This generalises the
Uniform(0,1) distribution, which can be thought of as a beta distribution with a =
b = 1.
QUANTILE-QUANTILE PLOTS OF DATA 67
Definition
A random variable X is said to have a Beta distribution with parameters
α, β > 0 if its density function is
xα−1 (1 − x)β−1
fX (x; α, β) = , 0 < x < 1.
B(α, β)
Results
If X has a Beta distribution with parameters α and β then
1. E(X) = α
α+β
,
αβ
2. Var(X) = (α+β+1)(α+β)2
,
Consider the situation in which we have a sample of size n from some unknown
random variable {x1 , x2 , . . . , xn } and we want to check if these data appear to come
from a random variable with cumulative distribution function FX (x). This can be
achieved using a quantile-quantile plot (sometimes called a Q-Q plot) as defined
below.
Quantile-quantile plots
To check how well the sample {x1 , x2 , . . . , xn } approximates the distribu-
tion with cdf FX (x), plot the n sample quantiles against the corresponding
quantiles of FX (x). That is, plot the points
k − 0.5
F −1 (p), x(k) where p = , for all k ∈ {1, 2, . . . n}
n
If the data come from the distribution FX (x), then the points will show no
systematic departure from the one-to-one line.
According to the above definition, we need to know the exact cdf FX (x) to construct
a quantile-quantile plot. However, for a location-scale family of distributions, a
family that does not change its essential shape as its parameters change, we can
construct the quantile-quantile plot using an arbitrary choice of parameters. In this
case, we only need to check for systematic departures from a straight line rather than
from the one-to-one line when assessing goodness-of-fit. This is the most common
application of quantile-quantile plots. It allows us to see how well data approximates
a whole family of location-scale distributions, without requiring any knowledge of
what the values of the parameters are.
Among the special distributions introduced in this chapter are several examples of
location or scale families.
68 CHAPTER 2. COMMON DISTRIBUTIONS
Example
Recall the example dataset from Chapter 1:
2 4 5 7 8 10 14 17 27 35
Use a quantile-quantile plot to assess how well these data approximate a normal
distribution.
There are ten values in this dataset, so the values of p we will use to plot the data
are k−0.5
10
for all k ∈ {1, 2, . . . , 10}, that is, for all
p ∈ { 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95 }
{ −1.64, −1.04, −0.67, −0.39, −0.13, 0.13, 0.39, 0.67, 1.04, 1.64 }
and so we plot these values against our ordered example dataset. This results in the
following plot:
This plot does not follow a straight line – it has a systematic concave-up curve, so
the data are clearly not normally distributed. In fact, because the curve is concave-
up, the data are right-skewed (since the larger values in the dataset are much larger
than expected for a normal distribution).
SPECIAL DISTRIBUTIONS ON R 69
Special distributions on R
R has some useful functions for working with special distributions, as below.
Consider the normally distributed random variable with mean 2 and standard devi-
ation 3, X ∼ N (2, 9).
To take a random sample of size 20 from X and store it in x:
> x = rnorm(20,2,3)
To find the density function of X at the value x = 4, i.e. to find fX (4):
> dnorm(4,2,3)
To find Fx (4) = P(X 6 4):
> pnorm(4,2,3)
To find the 95th percentile of X:
> qnorm(0.95,2,3)
We can perform similar calculations for all of the special distributions considered in
this chapter, using the following functions:
Binomial rbinom, dbinom, pbinom, qbinom
Geometric rgeom, dgeom, pgeom, qgeom
Hypergeometric rhyper, dhyper, phyper, qhyper
Poisson rpois, dpois, ppois, qpois
Exponential rexp, dexp, pexp, qexp
Uniform runif, dunif, punif, qunif
Normal rnorm, dnorm, pnorm, qnorm
Gamma rgamma, dgamma, pgamma, qgamma
Beta rbeta, dbeta, pbeta, qbeta
Bivariate Distributions
Observations are often taken in pairs, leading to bivariate observations (X, Y ), i.e.
observations of two variables measured on the same subjects. For example, (height,
weight) can be measured on people, as can (age, blood pressure), (gender, promo-
tion) for employees, (sales, price) for supermarket products...
Often we are interested in exploring the nature of the relationship between two
variables that have been measured on the same set of subjects. In this chapter we
develop a notation for the study of relationship between two variables, and explore
some key concepts.
For further reading, consider Hogg et al (2005) Chapter 2 or Rice (2007) Chapter 3
and Chapter 3 of Kroese and Chan (2014).
Definition
If X and Y are discrete random variables then the joint probability func-
tion of X and Y is
71
72 CHAPTER 3. BIVARIATE DISTRIBUTIONS
Example
Suppose that X and Y have joint probability function as tabulated below.
fX,Y (x, y)
y
−1 0 1
0 1/8 1/4 1/8
x 1 1/8 1/16 1/16
2 1/16 1/16 1/8
Find P(X = 0 and Y = 0). Show that P(X = 0 and Y = 0) ̸= P(X = 0) · P(Y = 0).
Example
Let X = number of successes in the first of two Bernoulli trials each with success
probability p and let Y = total number of successes in the two trials. Then, for
example,
fX,Y (x, y)
y
0 1 2
0 (1 − p)2 p(1 − p) 0
x
1 0 p(1 − p) p2
Definition
The joint density function of continuous random variables X and Y a
bivariate function fX,Y with the property
Z Z
fX,Y (x, y) dx dy = P((X, Y ) ∈ A)
A
For two continuous random variables, X and Y , probabilities have the following
geometrical interpretation: fX,Y is a surface over the plane R2 and probabilities
over subsets A ⊆ R2 correspond to the volume under fX,Y over A.
For example, if
12 2
(x + xy) for x, y ∈ (0, 1)
f (x, y) =
7
then the joint density function looks like this:
f(x,y) 2
chch
1
0
1
1
0.8
0.5 0.6
0.4
0.2
X 0 0
Y
74 CHAPTER 3. BIVARIATE DISTRIBUTIONS
Example
(X, Y ) have joint density function
12 2
f (x, y) = (x + xy) for x, y ∈ (0, 1)
7
(X < 21 , Y < 23 )
x
0 1
Example*
1/2 y=x
1/3 1
If we use horizontal strips as shown in the next figure then the integral needs to be
broken up into two terms; corresponding to the triangle and the rectangle compo-
nents of the trapezium.
1/2 y=x
1/3 1
If we use vertical strips as shown in the following figure then the integral can be
done in one piece as follows:
Z 1/3 Z 1/2
11
P(X < 1/3, Y < 1/2) = 2(x + y) dy dx =
0 x 108
1/2 y=x
1/3 1
Many of the definitions and results for random variables, considered in chapter 1,
generalise directly to the bivariate case. We consider some of these in this section.
Essentially, all that changes from chapter 1 to here is that instead of doing a single
summation or integral, we now do a double summation or double integral, because
JOINT PROBABILITY FUNCTION AND DENSITY FUNCTION 77
Result
If X and Y are discrete random variables then
XX
fX,Y (x, y) = 1.
all x all y
Next, we will consider a definition of the joint cumulative distribution function (cdf).
Definition
The joint cdf of X and Y is
Example
consider again
12 2
fX,Y (x, y) = (x + xy), 0 < x < 1, 0 < y < 1.
7
fX,Y > 0
x
0 1
Result
If g is any function of X and Y ,
XX
g(x, y) P(X = x, Y = y) (discrete)
all x all y
E{g(X, Y )} =
Z ∞Z ∞
g(x, y) fX,Y (x, y) dx dy (continuous)
−∞ −∞
Note that this formula has the same form as that of E{g(X)} from chapter 1.
1
MARGINAL PROBABILITY/DENSITY FUNCTIONS 79
Example
fX,Y (x, y)
y
0 1 2
x 0 0.1 0.2 0.2
1 0.2 0.2 a
where a is a constant.
3. Find E(XY ).
Result
If X and Y are discrete, then fX (x) and fY (y) can be calculated from
fX,Y (x, y) as follows: X
fX (x) = fX,Y (x, y)
all y
X
fY (y) = fX,Y (x, y)
all x
Example
y
0 1 2 fX (x)
x 0 0.1 0.2 0.2
1 0.2 0.2 0.1
fY (y)
+P(X = 0, Y = 2) = 0.5.
X
Thus fX (0) = P(X = 0) = P(X = 0, Y = y).
all y
And in fact for any value x,
X
fX (x) = P(X = x) = P(X = x, Y = y)
all y
X
= fX,Y (x, y).
all y
Result
If X and Y are continuous, then fX (x) and fY (y) can be calculated from
fX,Y (x, y) as follows:
Z ∞
fX (x) = fX,Y (x, y) dy
−∞
Z ∞
fY (y) = fX,Y (x, y) dx
−∞
Example
12 2
fX,Y (x, y) = (x + xy), 0 < x < 1, 0 < y < 1.
7
Find fX (x) and fY (y).
Z
12 1
12 2 x
fX (x) = x2 + xy dy =
x + , 0 < x < 1.
7 0 7 2
Z
12 1 2 12 1 y
fY (y) = x + xy dx = + , 0 < y < 1.
7 0 7 3 2
Example
Definition
If X and Y are discrete, the conditional probability function of X given
Y = y is
Similarly,
fX,Y (x, y)
fY |X (y|x) = P(Y = y|X = x) = .
fX (x)
82 CHAPTER 3. BIVARIATE DISTRIBUTIONS
Definition
If X and Y are continuous, the conditional density function of X given
Y = y is
fX,Y (x, y)
fX|Y (x|Y = y) =
fY (y)
Similarly,
fX,Y (x, y)
fY |X (y|X = x) = .
fX (x)
Example
y
0 1 2 fX (x)
0 0.1 0.2 0.2 0.5
x
1 0.2 0.2 0.1 0.5
fY (y) 0.3 0.4 0.3 1
0.1 1
P(X = 0|Y = 0) = =
0.3 3
0.2 2
P(X = 1|Y = 0) = =
0.3 3
x 0 1
P(X = x|Y = 0) 1
3
2
3
Also,
y 0 1 2
P(Y = y|X = 0) 1
5
2
5
2
5
CONDITIONAL EXPECTED VALUE AND VARIANCE 83
Example
12 2
fX,Y (x, y) = x + xy , 0 < x < 1, 0 < y < 1
7
Find fX|Y (x|y) and fY |X (y|x).
x2 + xy
fX|Y (x|y) = 1 y , 0 < x < 1
3
+ 2
x2 + xy
fY |X (y|x) = , 0 < y < 1.
x2 + x2
Result
If X and Y are continuous then
Z b
P(a ≤ Y ≤ b|X = x) = fY |X (y|x) dy.
a
Result
If X and Y are discrete then
X
P(Y ∈ A|X = x) = fY |X (y|X = x).
y∈A
Similarly,
X
y P(Y = y|X = x) if Y is discrete
all y
E(Y |X = x) =
Z ∞
y fY |X (y|x) dy if Y is continuous
−∞
Note that this can be thought of as an application of the definition of E(X) from
chapter 1.
Example
Recall the example on page 82, in which
y
0 1 2 fX (x)
0 0.1 0.2 0.2 0.5
x
1 0.2 0.2 0.1 0.5
fY (y) 0.3 0.4 0.3 1
Find E(X|Y = 2) and E(Y |X = 0).
Recall the example from previous section for which it was established that:
x 0 1
P(X = x|Y = 0) 1
3
2
3
Then
1 2 2
E(X|Y = 0) = 0 +1 =
3 3 3
Example
Recall the example on page 83, in which
12 2
fX,Y (x, y) = x + xy , 0 < x < 1, 0 < y < 1
7
Find E(X|Y ).
Recall the example from previous section for which it was established that:
x2 + xy
fX|Y (x|y) = , 0<x<1
1
3
+ y2
CONDITIONAL EXPECTED VALUE AND VARIANCE 85
where X
x2 P (X = x|Y = y)
all x
E(X 2 |Y = y) = Z
∞
x2 fX|Y (x|y)dx.
−∞
Example
Find Var(X|Y = 2) for the discrete data example on page 82.
Recall the example from previous section for which it was established that:
x 0 1
P(X = x|Y = 0) 1
3
2
3
Example
Find Var(X|Y ) for the continuous data example on page 83.
Recall the example from previous section for which it was established that:
x+y
fX|Y (x|y) = , 0<x<1
y + 12
86 CHAPTER 3. BIVARIATE DISTRIBUTIONS
Z 1
(x + y) 4y + 3
Then E(X |Y = y) =
2
x2 1 dx = ,
0 y+2 12y + 6
6y 2 + 6y + 1
Var(X|Y = y) = .
18(2y + 1)2
Definition
Random variables X and Y are independent if and only if for all x, y
Result
Random variables X and Y are independent if and only if for all x, y
fY |X (y|x) = fY (y)
or
fX|Y (x|y) = fX (x).
This result allows an interpretation that conforms with the ‘every day’ meaning of
the word independent. If X and Y are independent, then the probability structure
of Y is unaffected by the ‘knowledge’ that X takes on some value x (and vice versa).
Result
If X and Y are independent,
Example
y
−1 0 1 fX (x)
0 0.01 0.02 0.07 0.1
x 1 0.04 0.13 0.33 0.5
2 0.05 0.05 0.3 0.4
fY (y) 0.1 0.2 0.7 1
INDEPENDENT RANDOM VARIABLES 87
that is, every entry in the body of the table equals the product of the corresponding
row and column totals.
X and Y are not independent if fX,Y (x, y) ̸= fX (x)fY (y) for at least one pair of
values x, y.
Thus X and Y are not independent in this case since, for example
Example
X and Y have joint probability function fX,Y (x, y) = P(X = x, Y = y)
X
fX (x) = P(X = x) = fX,Y (x, y)
all y
X
∞
= p2 (1 − p)x+y = p(1 − p)x , x = 0, 1, . . .
y=0
X
∞
Similarly, fY (y) = p2 (1 − p)x+y
x=0
= p(1 − p)y , y = 0, 1, . . .
Example
X and Y have joint density
Z 1
fY (y) = 6xy 2 dx = 3y 2 , 0 < y < 1.
0
Example
X and Y have joint density
Are X and Y independent? Note that X cannot be larger than Y , so the value X
takes depends on Y (and vice versa).
Z 1
10x
fX (x) = 10xy 2 dy = (1 − x3 ), 0 < x < 1
x 3
Z y
fY (y) = 10xy 2 dx = 5y 4 , 0 < y < 1.
0
Result
If X and Y are independent,
Covariance
Definition
The covariance of X and Y is
Cov(X, Y ) measures not only how X and Y vary about their means, but also how
they vary together linearly. Cov(X, Y ) > 0 if X and Y are positively associated,
i.e. if X is likely to be large when Y is large and X is likely to be small when Y is
small. If X and Y are negatively associated, Cov(X, Y ) < 0.
Results
1. Cov(X, X) = Var(X)
2. Cov(X, Y ) = E(XY ) − µX µY .
Result
If X and Y are independent then Cov(X, Y ) = 0.
Results
Hence
Correlation
Definition
The Correlation between X and Y is
Cov(X, Y )
Corr(X, Y ) = p .
Var(X) · Var(Y )
Definition
If Corr(X, Y ) = 0, then X and Y are said to be uncorrelated.
Independent random variables are uncorrelated, but uncorrelated variables are not
necessarily independent; for example, if X has a distribution which is symmetric
about zero and Y = X 2 ,
Results
1. |Corr(X, Y )| ≤ 1
X Y
0 ≤ Var +
σX σY
2
where σX = Var(X) and σY2 = Var(Y )
X Y
= 2 + 2 Cov ,
σX σY
= 2(1 + ϱ)
∴ ϱ ≥ −1.
Also, 0 ≤ Var X
σX
−
= 2(1 − ϱ), so ϱ ≤ 1.
Y
σY
2. If ϱ = −1 , Var σX + σY = 2(1 + ϱ) = 0.
X Y
This means that σX + σY is a constant, i.e. P σXX +
X Y Y
σY
= c = 1 for some constant
c.
THE BIVARIATE NORMAL DISTRIBUTION 91
But
X Y −XσY
+ = c ⇐⇒ Y = + c σY
σX σY σX
−σY
so P(Y = a + bX) = 1 for some constants a = cσY and b = σX
< 0.
3. Similarly, for ϱ = 1, P(Y = a + bX) = 1 for some constant a and b = σY
σX
> 0.
The most commonly used special type of bivariate distribution is the bivariate nor-
mal.
X and Y have the bivariate normal distribution if
fX,Y (x, y) =
( " 2 2 #)
1 1 x − µX x − µX y − µY y − µY
p exp − − 2ϱ +
2πσX σY 1 − ϱ2 2(1 − ϱ2 ) σX σX σY σY
Result
1. X ∼ N (µX , σX
2
)
2. Y ∼ N (µY , σY2 ).
Result
ϱ = Corr(X, Y )
The bivariate normal density is a bivariate function fX,Y (x, y) with elliptical con-
tours. The following figure provides contour plots of the bivariate normal density
for
µX = 3, µY = 7, σX = 2, σY = 5
20
15
15
10
10
y
y
5
5
0
0
−5
−5
−5 0 5 10 15 20 −5 0 5 10 15 20
x x
20
15
15
10
10
y
y
5
5
0
0
−5
−5
−5 0 5 10 15 20 −5 0 5 10 15 20
x x
• X and Y uncorrelated.
Result
If X and Y are uncorrelated jointly normal variables, then X and Y are
independent.
Note that is a special exception to the rule given on page 90 in the case of normal
random variables. You will prove this result in a Chapter 3 exercise.
All of the definitions and results in this chapter extend to the case of more than two
random variables. For the general case of n random variables we now give some of
the most fundamental of these.
EXTENSION TO N RANDOM VARIABLES 93
Definition
The joint probability function of X1 , . . . , Xn is
Definition
The joint cdf in both the discrete and continuous cases is
Definition
The joint density of X1 , . . . , Xn is
∂n
fX1 ,...,Xn (x1 , . . . , xn ) = FX ,...,Xn (x1 , . . . , xn ).
∂x1 . . . ∂xn 1
Definition
X1 , . . . , Xn are independent if
or
FX1 ,...,Xn (x1 , . . . , xn ) = FX1 (x1 ) . . . FXn (xn ).
94 CHAPTER 3. BIVARIATE DISTRIBUTIONS
Chapter 4
Transformations
Definition
If X is a random variable, Y = h(X) for some function h is a transforma-
tion of X.
Often we wish to transform a variable, for one of a few different reasons. Often
we wish to transform a random variable or a set of random variables in order to
calculate some summary statistic of interest to us. An example we will consider in
later chapters is the average of a sample (the “sample mean”): we observe n random
P
variables X1 , X2 , . . . , Xn and we are interested in their average X̄ = n1 ni=1 Xi . This
can be considered as a transformation of the X1 , X2 , . . . , Xn . Another reason we
might want to transform a random variable is because the variable we measure is
not the one of primary interest. An example of this introduced in a previous chapter
is the measurement of tree circumference X, when what we are primary interested
X 2 2
in is basal area, π 2π = X4π . Yet another example, often encountered in applied
statistics, is to transform a random variable in order to improve its properties (e.g.
to reduce skewness).
In this chapter, we will learn key methods for deriving the probability or density
function of a transformation of a random variable.
For further reading, consider Hogg et al (2005) sections 1.6.1, 1.7.1 and 2.2, or Rice
(2007) sections 2.3 and 3.6.
First, we will start with the discrete case.
Result
For discrete X,
X
fY (y) = P(Y = y) = P{h(X) = y} = P(X = x).
x:h(x)=y
95
96 CHAPTER 4. TRANSFORMATIONS
Example
x −1 0 1 2
fX (x) 1/8 1/4 1/2 1/8
y 0 1 4
fY (y) 1/4 5/8 1/8
Now we will consider the continuous case. The density function of a transformed
continuous variable is simple to determine when the transformation is monotonic.
Definition
Let h be a real-valued function defined over the set A where A is a subset of
R. Then h is a monotonic transformation if h is either strictly increasing
or strictly decreasing over A.
Example
For the following functions, classify them as monotonic or non-monotonic, for the
domain x ∈ R. If the function is non-monotonic, specify a domain over which this
function will be monotonic.
1. h1 (x) = x2
3. h3 (x) = sin(x)
Consider the function h1 (x) = x2 . Then h1 is not monotonic over the whole real
line R. However h1 is monotonic over the positive half-line {x : x > 0}. It is also
monotonic over the negative half-line {x : x < 0}
The function h2 (x) = 7(x − 4)3 is monotonic over R.
97
Result
For continuous X, if h is monotonic over the set {x : fX (x) > 0} then
dx
fY (y) = fX (x)
dy
dx
= fX h−1 (y)
dy
Example
fX (x) = 3x2 , 0 < x < 1.
Find fY (y) where Y = 2X − 1.
y+1 dx 1
x= 2
, dy = 2
and
1 3
= (y + 1)2 , −1 < y < 1.
2 8
Example
Let X ∼ exponential(β), i.e. fX (x) = β1 e−x/β , x > 0; β > 0.
98 CHAPTER 4. TRANSFORMATIONS
Example
q
fX (x) = π2 e−x /2 , x > 0.
2
X2
Let Y = 2
. Find the density function of Y .
√
= (2y)− 2 and
1
Then x = 2y, dx
dy q −2 1
e−y
e |(2y)− 2 | =
2 −y 1
fY (y) = fX (x) dx = √y , y > 0.
dy π π
R∞ √
Note: Γ 12 = 0 e−y y − 2 dy = π.
1
Linear Transformations
h(x) = ax + b for a ̸= 0.
Result
For a continuous random variable X, if Y = aX +b is a linear transformation
of X with a ̸= 0, then
1 y−b
fY (y) = fX
|a| a
for all y such that fX y−b
a
> 0.
This result is the formula for densities of location and scale families, discussed in
chapter 1. It follows directly from the result in the previous section for general
monotonic transformations. The implication is that linear transformations only
change the location and scale of a density function. They do not change its shape.
The following figure illustrates this for a bimodal density function fX , and different
choices of the linear transformation parameters.
fX(x) fY(x)
0.5
0.5
0.3
0.3
0.1
0.1
x x
2 4 6 8 10 2 4 6 8 10
fY(x) fY(x)
0.5
0.5
0.3
0.3
0.1
0.1
x x
2 4 6 8 10 2 4 6 8 10
dy
Proof: Let Y = FX (X). Then y = FX (x) and dx
= fX (x).
dx 1
fY (y) = fX (x) = fX (x) · = 1, 0 < y < 1.
dy fX (x)
100 CHAPTER 4. TRANSFORMATIONS
Example
The Probability Integral Transformation allows for easy simulation of random vari-
ables from any distribution for which the inverse cdf FX−1 is easily computed.
A computer can be used to generate U where U ∼ Uniform (0, 1). If we require an
observation X where X has cdf FX , then
A notable exception is when the cdf cannot be written in closed form, as is the case
for the normal distibution, for example.
Example
We require an observation from the distribution with cdf
Explain how we can generate a random observation from this distribution. Let
y = FX (x). Then x = FX−1 (y) and y = 1 − e−x =⇒ x = − ln(1 − y), so
FX−1 (y) = − ln(1−y). Thus, if X = FX−1 (U ) = − ln(1−U ) where U ∼ Uniform (0, 1),
then X has cdf FX (x) = 1 − e−x , x > 0.
Bivariate Transformations
If X and Y have joint density fX,Y (x, y) and U is a function of X and Y , we can
find the density of U by calculating FU (u) = P(U ≤ u) and differentiating.
Example
fX,Y (x, y) = 1, 0 < x < 1, 0 < y < 1.
Let U = X + Y . Find fU (u).
FU (u) = P(X + Y ≤ u)
Z u Z u−y
1dxdy ,0 < u < 1
0 0
=
Z Z
1 1
1− 1dxdy , 1 < u < 2
u−1 u−y
(
u2
2
,0 < u < 1
= u2
2u − 2
− 1 , 1 < u < 2.
Thus
(
u ,0 < u < 1
fU (u) =
2 − u , 1 < u < 2.
Result
If U and V are functions of continuous random variables X and Y , then
where
∂x ∂x
∂u ∂v
J=
∂y ∂y
∂u ∂v
The Jacobian has an interpretation similar to, although more detailed than, that of
the factor dx
dy
in the univariate density transformation formula. The transforma-
tions U, V transform a small rectangle, with area δxδy in the x, y plane, into a small
parallelogram with area δxδy/J in the u, v plane.
To find fU (u) by bivariate transformation:
Example*
X, Y independent Uniform (0,1) variables.
∂y ∂y 1 −1
= 0, = 1 and J = = 1.
∂u ∂v 0 1
BIVARIATE TRANSFORMATIONS 103
Now 0 < x < 1 ⇔ 0 < u − v < 1 and 0 < y < 1 ⇔ 0 < v < 1
v=u
v
v =u−1
fU,V > 0
u
0 1 2
Z u
1 · dv = u
,0 < u < 1
0
fU (u) =
Z
1
1 · dv = 2 − u , 1 < u < 2.
u−1
Example
U −V U +V
X= , Y = .
2 1 2
u−v u+v
Now x = 2
, y= 2
give
∂x 1 ∂x 1 ∂y 1 ∂y 1
= , =− , = , = .
∂u 2 ∂v 2 ∂u 2 ∂v 2
1/2 −1/2 1
∴J = = and
1/2 1/2 2
104 CHAPTER 4. TRANSFORMATIONS
u−v u+v
0 < x < y ⇐⇒ 0 < < ⇐⇒ 0 < v < u,
2 2
u+v
y < 1 ⇐⇒ < 1 ⇐⇒ u + v < 2.
2
v
2 b
v=u
fU,V > 0
u+v =2
b
u
0 1 2
3(u + v) 1 3
∴ fU,V (u, v) = | | = (u + v), 0 < v < u, u + v < 2.
2 2 4
Z u
3 9u2
For 0 < u < 1, fU (u) = (u + v)dv = .
0 4 8
Z 2−u
3 3 3u2
For 1 < u < 2, fU (u) = (u + v)dv = − .
0 4 2 8
Z 2−v
3 3
As an aside, note that fV (v) = (u + v)du = (1 − v 2 ), 0 < v < 1.
v 4 2
Example
The lifetimes X and Y of two brands of components of a system are independent
with 1
X = V , Y = U V and if x = v, y = uv,
BIVARIATE TRANSFORMATIONS 105
∂x ∂x ∂y ∂y
=0, =1, =v , = u, so
∂u ∂v ∂u ∂v
0 1
J= = −v
v u
Now x > 0 ⇐⇒ v > 0 and y > 0 ⇐⇒ uv > 0 =⇒ u > 0 since v > 0.
Example
Suppose X denotes the total time from arrival to exit from a service queue and Y
denotes the time spent in the queue before being served. Suppose also that we want
the density of U = X − Y , the amount of time spent being served when
Now U = X − Y . Let V = Y .
Find the density function of U , using a bivariate transformation.
Then
X = U + V , Y = V and if x = u + v, y = v,
∂x ∂x ∂y ∂y
then =1, =0, =1, = 1.
∂u ∂v ∂u ∂v
1 1
∴J = = 1 and
0 1
0 < y < x < ∞ ⇐⇒ 0 < v < u + v < ∞
=⇒ v > 0 , u > 0 and u + v < ∞ =⇒ u < ∞, v < ∞.
fU,V (u, v) = e−(u+v) , u > 0, v > 0
Thus the time spent being served, U , and the time spent in the queue before service,
V , are independent random variables. Also,
Multivariate Transformations
Something similar holds for higher dimensions. We give the formula without proof.
Suppose we are given the n-dimensional integral
Z Z
··· f (x)dx
R
which maps the region R into S. Then, if we change the variable from x to z = g(x)
we obtain
Change of variable formula
Z Z Z Z
··· f (x)dx = ··· f (g−1 (z)) | det(Jg−1 )(z)| dz,
R S
where | det(Jg−1 )(z)| stands for the absolute value of the determinant of the Jacobian
matrix of the transformation g−1 evaluated at z.
The Jacobian matrix of a transformation g is defined as the matrix of partial deriva-
tives ∂g1
∂x1
(x) · · · ∂g1
∂xn
(x)
.. .. ..
Jg (x) = . . .
∂gn
∂x1
(x) ··· ∂gn
∂xn
(x)
MULTIVARIATE TRANSFORMATIONS 107
Note that | det(Jg−1 )(z)| plays the same role as (g −1 (z))′ in the one dimensional case
and satisfies
1
| det(Jg−1 )(z)| =
| det(Jg )(x)|
1
just like g′ (x) 1
= g′ (g−1 (z))
= (g −1 (z))′ in the one-dimensional change of variable case.
As a result of this
Example
Suppose X ∈ R has pdf
fX (x)
and we transform the variable from X to Z using
Z = g(X)
fZ (z)
Z = g(X) ,
Proof: We can write for any sufficiently nice region A ⊆ S that maps to B = g−1 (A)
under the transformation g:
Z
fZ (z)dz = P(Z ∈ A) by definition of pdf of Z
A
= P(g(X) ∈ A)
= P(X ∈ g−1 (A))
Z
= fX (x)dx by definition of pdf of X
g−1 (A)
Z
= fX (g−1 (z))| det(Jg−1 (z))| dz by change of variable formula
A
This suggests that the integrands are equivalent in some sense, because they give us
the same integral for any region A. Without getting into technicalities, we can see
why it is plausible that
Example
Suppose again that X ∈ R has pdf
fX (x)
and
Jg−1 (z) = A−1
so that
Result
1
fZ (z) = fX (A−1 z)| det(A−1 )| = fX (A−1 z)
| det(A)|
You may use this transformation formula in your assignment without giving a proof
there. Note that we already know the special case of this formula when A = a is a
scalar and Z = aX:
1 z
fZ (z) = fX (a−1 z)| det(a−1 )| = fX
|a| a
Result
Suppose that X and Y are independent random variables taking only non-
negative integer values, and let Z = X + Y . Then
X
z
fZ (z) = fX (z − y)fY (y), z = 0, 1, . . .
y=0
Proof:
fZ (z) = P(X + Y = z)
= P(X = z, Y = 0) + P(X = z − 1, Y = 1) + . . . + P(X = 0, Y = z)
= P(X = z)P(Y = 0) + P(X = z − 1)P(Y = 1) + . . . + P(X = 0)P(Y = z)
= fX (z)fY (0) + fX (z − 1)fY (1) + . . . + fX (0)fY (z)
Xz
= fX (z − y)fY (y)
y=0
Example
Example*
X ∼ Poisson (λ1 ), Y ∼ Poisson (λ2 ).
e−λ1 λk1
P(X = k) = , k = 0, 1, 2, . . .
k!
110 CHAPTER 4. TRANSFORMATIONS
e−λ2 λk2
P(Y = k) = , k = 0, 1, 2, . . .
k!
Find the probability function of Z = X + Y .
X z
e−λ1 λk1 e−λ2 λz−k
fX+Y (z) = · 2
k=0
k! (z − k)!
!
e−(λ1 +λ2 ) X z
z
= λk1 λz−k
2
z! k=0
k
Result
Suppose X and Y are independent continuous variables with X ∼ fX (x)
and Y ∼ fY (y). Then Z = X + Y has density
Z
fZ (z) = fX (z − y)fY (y)dy.
all possible y
Proof:
= FX (z − y)fY (y) dy
all possible y
To complete the proof we differentiate wrt z in order to obtain the density function
fZ (z): Z
d
fZ (z) = FZ (z) = fX (z − y)fY (y) dy
dz all possible y
SUMS OF INDEPENDENT RANDOM VARIABLES 111
Example
X and Y are independent variables and fX (x) = e−x , x > 0, fY (y) = e−y , y > 0.
Find the density function of Z = X + Y .
Z ∞
Z = X + Y ∼ fZ (z) = fX (z − y)fY (y)dy.
−∞
If z > 0
0 for y < 0 and y > z
∴ fX (z − y)fY (y) =
e−(z−y) · e−y = e−z for 0 < y < z.
Z z
∴ fZ (y) = e−z dy = ze−z , z > 0.
0
Note that the answer is the density function of a Gamma(2,1) random variable.
Thus the sum of two independent exponential(1) random variables is a Gamma(2,1)
variable.
112 CHAPTER 4. TRANSFORMATIONS
Example*
Y1 and Y2 are independent variables and Y1 ∼ N (0, 1), Y2 ∼ N (0, 1).
1
fX (x) = √ e−x /2 ,
2
−∞ < x < ∞
2π
where
1 1
fY1 (z−y) = √ e−(z−y) /2 , −∞ < z−y < ∞ and fY2 (y) = √ e−y /2 ,
2 2
−∞ < y < ∞
2π 2π
1
fY1 (z − y) = √ e−(z−y) /2 ,
2
−∞ < y < ∞
2π
Z ∞
1 1
√ e−(z−y) /2 · √ e−y /2 dy
2 2
∴ fZ (z) =
−∞ 2π 2π
Z ∞
1
e−z
2 /2+zy−y 2
= dy
2π −∞
−z 2 /2 Z ∞ 2
e 2
− y 2 −zy+ z4 + z4
= e dy
2π −∞
−z 2 /4 Z ∞
e 2
e−(y− 2 ) dy.
z
=
2π −∞
e−z /4 √
2
1
π = √ e−z /4 , −∞ < z < ∞.
2
∴ fZ (z) =
2π 2 π
Thus Z ∼ N (0, 2). More generally, we can show that the sum of any two normal
random variables is also normal.
SUMS OF INDEPENDENT RANDOM VARIABLES 113
Example*
X ∼ Gamma (α1 , 1), Y ∼ independently of Gamma (α2 , 1).
e−x xα1 −1
fX (x) = , x > 0; α1 ≥ 1
Γ(α1 )
e−y y α2 −1
fY (y) = , y > 0; α2 ≥ 1
Γ(α2 )
First we will find the limits on this integral, that is, the range of values of y for
which fX (z − y) and fY (y) are both non-zero. Both X and Y are only defined over
positive values, so the range of values of y of interest to us satisfies both z − y > 0
and y > 0, i.e. 0 < y < z.
e−(z−y) (z − y)α1 −1
fX (z − y) = for z − y > 0 or y < z
Γ(α1 )
For z > 0
fX (z − y) > 0 fX (z − y) > 0 fX (z − y) = 0
b b y
0 z
fY (y) = 0 fY (y) > 0 fY (y) > 0
xα1 −1 (1 − x)α2 −1
Note: the Beta (α1 , α2 ) density is .
B(α1 , α2 )
Result
Suppose that X and Y independent random variables with moment gener-
ating functions mX and mY . Then
Example
1 2 u2
Recall from Chapter 2 that if X ∼ N (µ, σ 2 ) then mX (u) = eµu+ 2 σ .
Use this result to find the distribution of Z = Y1 + Y2 where Y1 ∼ N (0, 1) indepen-
dently of Y2 ∼ N (0, 1).
More generally,
Result
P
If X1 , X2 , . . . , Xn are independent random variables, then ni=1 Xi has mo-
ment generating function
Y
n
Pn
m i=1 Xi (u) = mXi (u).
i=1
Proof:
Pn
mPni=1 Xi (u) = E(eu i=1 Xi
)
!
Y
n
= E euXi
i=1
Y
n Y
n
= E(e uXi
)= mXi (u).
i=1 i=1
This offers us a useful approach for deriving the distribution of the sum of inde-
pendent random variables, using the 1-1 correspondence between distributions and
moment generating functions. For this approach to work however we need to be
able to recognise the distribution of the sum from its moment generating function.
116 CHAPTER 4. TRANSFORMATIONS
Example
X1 , X2 , . . . , Xn independent Bernoulli (p) random variables.
P
Use moment generating functions to show that ni=1 Xi ∼ Binomial(n, p).
Each Xi has probability function
X
1
mX (u) = E(e uX
) = euX P(X = x)
k=0
= e u·0
P(X = 0) + eu·1 P(X = 1)
= 1 − p + peu
Pn
Therefore, the moment generating function of the sum i=1 Xi is
Y
n
mPni=1 Xi (u) = mXi (u) = (1 − p + peu )n
i=1
which is the mgf of a Binomial(n, p) random variable. Thus we can conclude that
Pn
i=1 Xi ∼ Binomial(n, p).
Example
X1 , X2 , . . . , Xn independent random variables with Xi ∼Poisson (λi ).
Pn
Find the mgf of Xi and hence deduce the distribution of i=1 Xi .
SUMS OF INDEPENDENT RANDOM VARIABLES 117
X
∞
e−λ λx
mX (u) = E(e uX
) = eux ·
x=0
x!
X
∞
(λeu )x
= e−λ
x=0
x!
−λ λeu u −1)
= e ·e = eλ(e .
X
n
and Xi has mgf
i=1
Pn Y
n Pn
λi )(eu −1)
m Pn
i=1 Xi
(u) = E(e u i=1 Xi )= E(euXi ) = e( i=1
i=1
This has the form of the Poisson mgf as derived above, but now with parameter
Pn
i=1 λ. !
Xn Xn
∴ Xi ∼ Poisson λi .
i=1 i=1
It can be shown (again using moment generating functions) that sums of independent
normal random variables are also normal; with the means and variances added.
Result
If X ∼ N (µX , σX
2
) and Y ∼ N (µY , σY2 ) are independent then
X + Y ∼ N (µX + µY , σX
2
+ σY2 ).
Result
If for 1 ≤ i ≤ n, Xi ∼ N (µi , σi2 ) are independent then for any set of
constants a1 , . . . , an ,
!
X n X
n X
n
ai X i ∼ N ai µi , a2i σi2 .
i=1 i=1 i=1
Example
Suppose that
Using known results for sums of normal random variables, find the distribution of
U.
Linear combinations of normal random variables are normal. So U is normal with
mean
and
In summary,
U ∼ N(−15, 425).
Now we’ve met some common methods for modelling data, it’s time to think about
how to collect data in such a way that we can use these models. Whether or not
it is appropriate to do statistical modelling, using tools from the previous chapters,
depends critically on how the data were collected. For example, we have been
treating data as random variables. For this to be a reasonable thing to do, we need
to sample randomly in order to introduce randomness.
In this section we will work through some key ideas concerning the collection of data.
We will also meet the important idea that an appropriate study design enables the
examination of the properties of the statistics we calculate from samples, using
random variables.
SURVEY DESIGN 119
Survey design
When collecting data in a survey it is critical to try to collect data that is repre-
sentative and random.
Representativeness
When we collect a sample from a population, typically we would like to use this
sample to make inferences about some property of the population at large. However,
this is only reasonable if the sample is representative of the population. If
this is not achieved then inferences about the population can be completely wrong.
We can formally define representativeness as below.
Definition
Consider a sample X1 , . . . , Xn from a random variable X which has proba-
bility or density function fX (x).
The sample is said to be representative if:
Example
“The poll that changed polling” http://historymatters.gmu.edu/d/5168
The Literary Digest correctly predicted the outcomes of each of the 1916-1932 US
presidential elections by conducting polls. These polls were a lucrative venture for
the magazine – readers liked them; they got a lot of news coverage; and they were
linked to subscription renewals. The 1936 postal card poll claimed to have asked
one fourth of the nation’s voters which candidate they intended to vote for. Based
on more than 2,000,000 returned post cards, it issued its prediction: Republican
presidential candidate Alfred Landon would win in a landslide. But this “largest
poll in history” couldn’t have been more wrong – the Democrat Roosevelt won by
the election by the largest margin in history! (Roosevelt got more than 60% of the
vote, but was predicted to get only 43%.) The Literary Digest lost a lot of credibility
from this result and was soon discontinued.
120 CHAPTER 4. TRANSFORMATIONS
The result was correctly predicted by a new pollster, George Gallup, based on
just 50,000 voters selected in a representative fashion and interviewed face-to-face.
Gallup not only predicted the election result, but before the Literary Digest poll
was released, he correctly predicted that it would get it wrong! This election made
“Gallup polls” famous, and formed a template for polling methods ever since.
What went wrong in the Literary Digest poll?
While some have claimed that more Republicans were Literary Digest subscribers,
an American Statistician article claimed that voluntary response was the main issue.
How do you ensure a sample is representative? One way to ensure this is to take a
simple random sample from the population of interest, as below.
Random samples
Definition
A random sample of size n is a set of random variables
X1 , . . . , X n
We often say that the Xi are iid (independently and identically distributed).
Example
Sampling with replacement
Consider sampling a variable X in a population of 10 subjects, which take the
following (sorted) values:
2 4 5 7 8 10 14 17 27 35
We sample three subjects randomly (with equal sampling probability for each sub-
ject), with replacement. Let these values be X1 , X2 and X3 .
Show that X1 , X2 and X3 are iid.
SURVEY DESIGN 121
• Obtain a list of all subjects in the population, and assign each subject a number
from 1 to N
(The sample function generates N random numbers, assigns one to each subject,
then includes in the sample the n subjects with smallest n random numbers.)
Strictly speaking, a simple random sample does not consist of iid random variables
– they are identically distributed, but they are dependent, since knowledge that
Xi = xi makes it less likely that Xj = xi because the ith subject can only be included
in the sample once. However this dependence is very weak when the population size
N is large compared to the sample size n (e.g. if N > 100n) and so in most instances
it can be ignored. See MATH3831 for “finite sample” survey methods when this
dependence cannot be ignored.
It is important in surveys, wherever possible, to ensure sampling is random. This is
important for a few reasons:
• It ensures the n values in the sample are iid, which is an important assumption
of most methods of statistical inference (as in the coming chapters).
• Random sampling removes selection bias – the choice of who is included in the
study is taken away from the experimenter, hence it is not possible for them
to (intentionally or otherwise) manipulate results through choice of subjects.
• Randomly sampling from the population of interest guarantees that the sample
is representative of the population.
Unfortunately, it is typically very hard to obtain a simple random sample from the
population of interest, so the best we can hope for is a “good approximation”.
122 CHAPTER 4. TRANSFORMATIONS
Example
Consider polling NSW voters to predict the result of a state election. You do not
have access to the list of all registered voters (for privacy reasons).
How would you sample NSW voters?
It is difficult to think of a method that ensures a representative sample!
Definition
Let X1 , . . . , Xn be a random sample. A statistic is any real-valued function
of the random sample.
Pn
• X̄ = 1
n i=1 Xi , the sample mean
Pn
• S2 = 1
n−1 i=1 (Xi − X̄)2 , the sample variance
A key advantage of random sampling is that the fact that the sample is random
implies that any statistic calculated from the sample is random. Hence we
can treat statistics as random variables and study their properties. Further, the iid
property makes it a lot easier to derive some important properties of statistics, as
in the important examples below.
σ2
E(X̄) = µ and Var(X̄) =
n
SURVEY DESIGN 123
p(1 − p)
E(b
p) = p and Var(b
p) =
n
Note that while the variance results given above require the variables to be iid
(hence a random sample), the expectation results only requires the observations in
the sample to be identically distributed.
Because of the property that E(X̄) = µ, we say that sample means of random
samples are unbiased (we will come back to this later on). Similarly, pb is unbiased.
There are many methods of survey sampling beyond taking a simple random sample.
Key considerations when choosing a sampling scheme are efficiency and effort –
ideally we would like an estimate that gives us a good estimate relative to the effort
invested in sampling.
Definition
Consider two unbiased alternative statistics, denoted as g(X1 , . . . , Xn )
and h(Y1 , . . . , Ym ). We say that g(X1 , . . . , Xn ) is more efficient than
h(Y1 , . . . , Ym ) if:
Note that the above notation implies that not only can the statistics that we are
using differ (g(·) vs h(·)), but the observations used to calculate the statistics can also
differ (X1 , . . . , Xn vs Y1 , . . . , Ym ). This reflects that there are two ways to achieve
efficiency – use a different statistic (discussed in later chapters) or by sampling
differently. The most obvious way that sampling differently can increase efficiency
is by increasing the sample size, but even for a fixed sample size (n = m) efficiency
varies with sampling method. Below are three common methods of sampling – for
more, and a deeper study of their properties, see MATH3831.
varies considerably with age, so a good survey design would involve sampling
separately within age strata (if possible).
Cluster sampling This is useful when subjects in the population arise in clusters,
and it takes less effort to sample within clusters than across clusters. Effort-
per-subject can be reduced by sampling clusters and then measuring all (or
sampling many) subjects within a cluster. e.g. Face-to-face interviews with
100 NSW household owners – it is easier logistically to sample ten postcodes,
then sample ten houses in each postcode, than to travel to a random sample
100 households spread across (potentially) 100 NSW postcodes!
Example
Consider estimating the average heart rate of students, µ. Males and females are
known to have different heart rates, µM and µF , but the same variance σ 2 .
Consider estimating the mean µ using a stratified random sample, as follows:
• Since males and females occur with (approximately) equal frequency in the
student population, we can estimate the overall mean heart rate as
1
X̄s = X̄M + X̄F
2
1. Find Var(X̄s ).
2. Show that the marginal variance of heart rate across the student population
(ignoring gender) is Var(X) = σ 2 + (µM − µF )2 /4.
3. Hence show that stratified random sampling is more efficient than using a
simple random sample of size 2n, if µM ̸= µF .
But if X is the heart rate of any arbitrary selected student (so that there is equal
chance of being male or female), then we are given that E[X | M] = µM and E[X | F] =
µF so that
µM + µF
E[X] = E[X | M] P(M) + E[X | F] P(F) = =µ
2
and similarly for the variance
Var(X) = E(X − µ)2 = E[(X − µ)2 | M] P(M) + E[(X − µ)2 | F] P(F)
= 1/2E(XM − µ)2 + 1/2E(XF − µ)2
= 1/2E(XM − µM + µM − µ)2 + 1/2E(XM − µF + µF − µ)2
= σ 2 + 1/2E(µM − µ)2 + 1/2E(µF − µ)2
= σ 2 + (µM − µF )2 /4
1
P2n
Hence, if X̄ = 2n i=1 Xi (that is, sample mean based on 2n observations)
σ 2 (µM − µF )2 σ2
Var X̄ = + > = Var(X̄s )
2n 8n 2n
Design of experiments
Often in science we would like to demonstrate causation, e.g. does smoking causes
learning difficulties? Does praying for a patient cause a better recovery following
heart surgery?
While surveying a population often provides valuable information, it is very diffi-
cult to demonstrate causation based on just observing an association between two
variables. The reason for this is that lurking variables can induce an association
between X and Y when there is actually no causal relationship, or when the causal
relationship has a completely different nature to what we observe.
Example
Student survey results demonstrate that students who walk to UNSW take a lot less
time than students who use public transport!
Does this mean that walking to UNSW is faster than using public transport?
i.e. Should we all walk to UNSW to save time?
Definitions
An observational study (or survey) is a study in which we observe vari-
ables (X, Y ) on subjects without manipulating them in any way.
An experiment is a study in which subjects are manipulating in some way
(changing X) and we observe their response (Y ).
126 CHAPTER 4. TRANSFORMATIONS
Example
For each of the following experiments, what is X (the treatment variable) and Y
(the response variable)?
1. The great prayer experiment (popularised in Richard Dawkins’ book “The
God Delusion”)
Does praying for a patient influence their recovery from illness? A clinical trial was
conducted to answer this question (Benson et al. 2006, published in the American
Heart Journal ), in which 1201 patients due to undergo coronary bypass surgery were
randomly assigned to one of two groups – a group to receive daily prayers for 14
days following surgery, and a group who received no prayers. The study was double-
blind, meaning that neither the patient nor anyone treating them knew whether or
not they were being prayed for. The outcome variable of interest was whether or
not each patient had any complications during the first 30 days following surgery.
2. A guinea pig experiment
Does smoking while pregnant affect the cognitive development of the foetus?
Johns et al (1993) conducted a study to look at this question using guinea pigs as a
model. They injected nicotine tartate in a saline solution into ten pregnant guinea
pigs, injected saline solution with no nicotine into ten other pregnant guinea pigs,
and compared the cognitive development of offspring by getting them to complete
a simple maze where they look for food. “Cognitive development” was measured as
the number of errors made by the offspring when looking for food in a maze.
Note that both the above experiments (indeed any good experiment) is designed so
that the only thing allowed to vary across groups is the treatment variable of interest
(X) – so if a significant effect is detected in Y , the only plausible explanation would
be that it was caused by X.
duced changes in the value of X). These groups needs to be carefully designed
so that the only thing that differs across groups is the treatment variable
X. Double-blinding is often used for this reason (e.g. the prayer experiment),
as is a “placebo” or “sham treatment” (e.g. saline-only injections in the guinea
pig experiment).
Randomise the allocation of subjects to treatment groups. This ensures that any
differences across groups, apart from those caused by treatment, are governed
by chance (which we can then model!).
Repeat the application of the treatment to the different subjects in each treat-
ment group. It is important that application of treatment is replicated (rather
than applied once, “in bulk”) in order that we can make inferences about the
effect of the treatment in general.
The above points may seem relatively obvious, but they can be difficult to implement
correctly – errors in design can be hard to spot.
Example
What error has been made in each of the following studies?
2. Greg was studying how mites affect the growth of cotton plants. He applied
a mite treatment to eight (randomly chosen) plants by crumpling up a mite-
infested leaf and leaving it at the base of each plant. He applied a no-mite
treatment by not putting any leaves at the base of each of eight “control”
plants.
(Surprisingly, plants in the mite treatment had faster growth!?)
3. The success of a course on road rules was assessed by using the RTA’s practice
driving test:
http://www.rta.nsw.gov.au/licensing/tests/driverknowledgetest/demonstrationdriverknowledgetest
Participants were asked to complete the test before the course, then again af-
terwards, and results were compared.
128 CHAPTER 4. TRANSFORMATIONS
For more details on the above and other common types of experiment, see MATH3851.
There is an analogy between randomised comparative experiments and simple ran-
dom samples (which treat “all subjects as equal”), and between stratified random
sampling and randomised blocks experiments (which break subjects into blocks/strata
which are expected to differ in response). The terminology used is different but the
concept is similar!
DESIGN OF EXPERIMENTS 129
Example
Does regularly taking vitamins guard against illness? Consider two experiments on
a set of 2n subjects:
B. All subjects are given a set of tablets (vitamins or placbeo) and asked to take
them daily for three months. They are then given a different set of tablets
(placebo or vitamin, whichever they didn’t have last time) and are asked to
take these for three months. Number of illnesses are recorded and compared
over the two periods.
Let the number of illnesses in the two treatment groups be Yv and Yp . We are
interested in the mean difference in number of illnesses between takers of vitamin
tablets and takers of a placebo, estimated using the sample mean difference Ȳv − Ȳp .
Assume Var(Yv ) = Var(Yp ) = σ 2 .
2σ 2
Var(Ȳv − Ȳp ) = Var(Ȳv ) + Var(Ȳp ) =
n
130 CHAPTER 4. TRANSFORMATIONS
Var(Yv − Yp ) = Cov(Yv − Yp , Yv − Yp )
= Var(Yv ) + Var(Yp ) − 2Cov(Yv , Yp )
q
= 2σ − 2Corr(Yv , Yp ) Var(Yv )Var(Yp )
2
= 2σ 2 − 2Corr(Yv , Yp )σ 2
= 2σ 2 (1 − Corr(Yv , Yp ))
Therefore,
2σ 2
Var(Ȳv − Ȳp ) = (1 − Corr(Yv , Yp ))
n
4. Compared to study A, here we get a reduction in the variance if there is positive
correlation (1 − Corr(Yv , Yp )) < 1 across the study periods. Note that the variance
is increase if there is a negative correlation (1 − Corr(Yv , Yp )) > 1 across the study
periods.
Chapter 5
The previous chapter dealt with the problem of finding the density function (or
probability function) of some transformation to a function of one or two random
variables. In practice we are usually interested in some function of many variables
– not just one or two. However, the calculations often become mathematically
intractable in this situation. An example is the problem of finding the exact density
function of the sum of 100 independent U(0, 1) random variables.
Because of the difficulties in obtaining exact results, a large portion of mathematical
statistics is concerned with approximations to density functions. These are based
on convergence results for sequences of random variables.
In this chapter, we will focus on some key convergence results that are useful in
statistics. Some of these (such as the law of large numbers and the central limit
theorem) relate to sums or averages of random variables. These results are particu-
larly useful, because sums and averages are typically used as summary statistics in
quantitative research.
Suggested further reading for this chapter is Chapter 3.7 in Kroese & Chan (2014),
Hogg et al (2005) Sections 4.2-4.4 or Rice (2007) Chapter 5.
Modes of Convergence
131
132 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
Sure convergence
This is the familiar type of convergence as that of a real function fn (x) → f (x) for
some point x. Here we only replace x with ω.
Sure convergence
We say that X1 , X2 , . . . converges surely to X if
Example
If Xn (ω) = X(ω) + n1 , then Xn converges to X surely.
Sure convergence is no different from the convergence in ordinary first year Calculus.
We now introduce a special type of convergence that actually uses the concept of
probability in its definition.
a.s.
Xn −→ X
if and only if for every ε > 0
lim P sup |Xk − X| > ε = 0 .
n→∞ k>n
Hence, the last limit can be used as an alternative definition of almost sure
convergence.
A(ε) = ∩∞
n=1 An (ε) = {ω : |X∞ − X| > ε}
3.
{ω : lim Xn ̸= X} = ∪∞
m=1 A(1/m)
n↑∞
4.
An (ε) = {ω : sup |Xn − X| > ε},
k>n
where sup is the smallest upper bound and is same as max when the maximum
of the set exits 1 . Note that supk>n Yk is a decreasing sequence as seen from
Figure 5.1.
1
For example, sup of [0, 1] is 1 and is the same as the maximum of the set. The maximum does
not exist for the set (0, 1), because 1 does not below to the set, but the sup is still 1, because this
is the smallest number which bounds the set (0, 1) from above.
134 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
then
lim P sup |Xk − X| > ε = lim P (An (ε))
n→∞ k>n n→∞
= P lim An (ε)
n→∞
= P(A(ε))
6 P(A(1/m), m > ε−1 )
6 P(∪∞
m=1 A(1/m)) =P lim Xn ̸= X = 0
n→∞
then
P lim Xn ̸= X = P(∪∞
m=1 A(1/m))
n→∞
X
∞ X
∞
6 P(A(1/m)) = 0=0
m=1 m=1
where in the last equation we used P(A(1/m)) = P(A(εm )) = 0 for all εm > 0.
2
MODES OF CONVERGENCE 135
Convergence in Probability
Definition
The sequence of random variables X1 , X2 , . . . converges in probability
to a random variable X if, for all ε > 0,
Example
X1 , X2 , . . . are independent Uniform (0, θ) variables. Let Yn = max(X1 , . . . , Xn ) for
n = 1, 2, . . . . Then it can be shown that
( n
y
θ
, 0<y<θ
FYn (y) =
1, y≥θ
and
ny n−1
fYn (y) = , 0 < y < θ.
θn
P
Show that Yn −
→ θ.
For 0 < ε < θ,
P
∴ Yn −
→θ
a.s.
In the last example it is actually easy to show that Yn −→ θ as well. Consider the
fact that for any n
θ > Yn+1 > Yn ,
136 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
Hence, the event supk>n |Yk − θ| > ε implies the event |Yn − θ| > ε and
Example
For n = 1, 2, . . . , Yn ∼ N (µ, σn2 ) and suppose lim σn = 0.
n→∞
P
Show that Yn −
→ µ.
For any ε > 0,
and
a.s. P
Xn −→ 0 ⇐⇒ sup |Xk | −→ 0
k>n
MODES OF CONVERGENCE 137
The essential difference between almost sure convergence and convergence in proba-
bility is that almost sure convergence is a property of the entire sequence X1 , X2 , . . .,
because the distribution of supk>n |Xk − X| depends on the joint pdf
fXn (xn )
a.s.
The difference can be visualized as follows. On Figure 5.2 we have depicted Xn −→
0. Here as n gets larger and larger the probability of the random noodle/snake/path
straying far away from the strip (the band −ε < Xn < ε) vanishes as n ↑ ∞.
In contrast, on Figure 5.3 we have depicted many different realizations of the random
P
path/snake X1 , X2 , . . .. Here Xn −→ 0 means that the proportion of noodles leaving
the strip goes to zero as n ↑ ∞. This does not prevent a particular noodle from
straying far away from the bands. We only want the proportion of these rogue
noodles to get smaller and smaller with increasing n. There is no attempt to control
how far a particular noodle strays from the strip.
The next example is only for the interested students and will not be assessable. It
is an example of a sequence of random variable which converges in probability, but
not almost surely.
Example
138 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
a.s.
Thus, it is not true that Xn −→ 0, that is,
Convergence in Distribution
This differs subtly (but importantly) from the idea of random variables Xi con-
verging to the random variable X. Convergence in probability is concerned with
whether the actual values of the random variables (the xi ) converge. Convergence
in distribution, in contrast, is concerned with whether the distributions (the FXi (x))
converge.
Convergence in distribution allows us to make approximate probability statements
about Xn , for large n, if we can derive the limiting distribution FX (x).
Example
Suppose that P(Xn = x) = 1/n for x = 1, . . . , n. Set Yn = Xn /n.
140 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
d
Show that Yn −→ Y ∼ U(0, 1)
Here we have for 0 6 y 6 1
Example
Let X1 , X2 , . . . be independent Bernoulli random variables with success probability
1/2, representing the outcomes of the fair coin tosses. Define new random variables
Y1 , Y2 , . . . as
Xn k
1
Yn = Xk , n = 1, 2, . . . .
k=1
2
d
Show that Yn −→ Y ∼ U (0, 1)
We have
Y
n Y
n
−uYn −uXk /2k −n
(1 + e−u/2 )
k
Ee = E[e ]=2
k=1 k=1
Y
n
(1 − e−u/2 ) (1 + e−u/2 ) = 1 − e−u , collapsing product!
n k
k=1
1 − e−u
Ee−uYn = 2−n
1 − e−u/2n
1/2n
= (1 − e−u ) .
1 − e−u/2n
MODES OF CONVERGENCE 141
Convergence in Lp -norm
Convergence in Lp -mean
The sequence of numerical random variables X1 , X2 , . . . is said to converge
Lp
in Lp -norm to a numerical random variable X, denoted Xn −→ X, if
Example
Suppose X1 , X2 , . . . are independent, each with mean µ and variance 0 < σ 2 < ∞.
L2
Show that X̄n −→ µ
We have
σ2
E[(X̄ − µ)2 ] = Var(X̄) = →0.
n
The next example shows that we can have a sequence converging in mean, but not
almost surely. Thus, the two types of convergences are quite distinct.
Example
Recall the example in which we have the iid sequence X1 , X2 , . . . with
1
P(Xn = 1) = = 1 − P(Xn = 0)
n
a.s. L1
We showed that Xn −→ 0 is not true. However, we do have Xn −→ 0:
1 1
E|Xn − 0| = 1 × + 0 × (1 − 1/n) = → 0
n n
142 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
Complete convergence
For completeness (no pun intended) we will mention one last type of convergence.
Complete Convergence
A sequence of random variables X1 , X2 , . . . is said to converge completely
to X, denoted
cpl.
Xn −→ X,
if
X
∞
P(|Xn − X| > ε) < ∞ for all ε > 0 .
n=1
Example
Suppose P(Xn = n5 ) = 1/n2 and P(Xn = 0) = 1 − 1/n2 . Then, we have
X X
P(|Xn − 0| > ε) = P(Xn = n5 )
n n
X∞
1 π2
= = <∞
n=1
n2 6
cpl.
Hence, by definition, Xn −→ 0.
P
Note that if n P(|Xn − X| > ε) < ∞, then by first year Calculus we know that
P(|Xn − X| > ε) → 0. Thus, complete convergence implies convergence in proba-
bility.
It is not obvious, but it also implies almost sure convergence (which implies conver-
gence in probability).
cpl. a.s.
Xn −→ X =⇒ Xn −→ X
X
∞ X
n−1
6 P(|Xk − X| > ε) − P(|Xk − X| > ε)
|
k=1
{z } k=1
cpl.
=c<∞ from Xn −→X
X
n−1
6c− P(|Xk − X| > ε) → c − c = 0, n↑∞
k=1
a.s.
Hence, by definition Xn −→ X.
Convergence Relationships
The different types of convergence form a big family with close ties to each other.
The most general relationships among the various modes of convergence for numer-
ical random variables are shown on the following hierarchical diagram.
cpl. a.s.
Xn −→ X ⇒ Xn −→ X
⇓
P d
Xn −→ X ⇒ Xn −→ X
⇑
Lp p>q Lq
Xn −→ X ⇒ Xn −→ X
We are now going to examine this diagram in some detail. We have already seen
a.s. P
that Xn −→ X implies Xn −→ X.
First note that convergence in distribution does not imply convergence in probability
(unless additional assumptions are imposed).
Example
Suppose Xn = 1 − X, where X ∼ U(0, 1). Then,
so that FXn (x) is the cdf of the uniform distribution for all n. Trivially, we have
d
Xn −→ X ∼ U(0, 1)
144 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
However,
P
Hence, Xn −→
̸ X.
P d
Xn −→ X =⇒ Xn −→ X
Before we proceed with a proof, we will take note of the following curious conjunc-
tion fallacy made famous in psychology.
Linda is 31 years old, single, outspoken, and very bright. She majored
in philosophy.
As a student, she was deeply concerned with issues of discrimination and
social justice, and also participated in anti-nuclear demonstrations.
Which is more probable?
1. Linda is a bank teller.
2. Linda is a bank teller and is active in the feminist movement.
Most people will choose 2, but they would be wrong. The probability of two events
occurring together (in ”conjunction”) is always less than or equal to the probability
of either one occurring alone. Formally,
P(A ∩ B) 6 P(A)
Now, in the arguments above we can switch the roles of Xn and X (there is a
symmetry) to deduce the analogous result:
lim FX (x ± ε) = FX (x)
ε↓0
Hence, by taking the limit on both sides as ε ↓ 0 we deduce by the squeeze principle
that
lim FXn (x) = FX (x)
n↑∞
at points x where FX (x) is continuous. The last agrees with the definition of con-
vergence in probability. 2
First note that convergence in probability does not imply convergence in L1 mean
(unless additional assumptions are imposed).
Example
Suppose P(Xn = n5 ) = 1/n2 and P(Xn = 0) = 1 − 1/n2 .
P
Show that Xn −→ 0.
Does Xn converge to X in L1 mean?
solution: We have
P
Hence, by definition Xn −→ 0. However,
1
E|Xn − 0| = n5 × + 0 × P(Xn = 0)
n2
= n3 →
̸ 0
L1 P
Xn −→ X =⇒ Xn −→ X
proof: We have
E|Xn − X|
P(|Xn − X| > ε) 6 Chebyshev’s inequality
ε
6 constant × E|Xn − X| → 0
Example
Suppose X1 , X2 , . . . are independent, each with mean µ and variance 0 < σ 2 < ∞.
Then, we know
L2
X̄n −→ µ ,
which implies
L2 L1 P
X̄n −→ µ =⇒ X̄n −→ µ =⇒ X̄n −→ µ
Lq
We next give an example showing that in general if p > q > 1, then Xn −→ X does
Lp
not imply Xn −→ X.
CONVERSE RESULTS ON MODES OF CONVERGENCE 147
Example
Assume p > q > 1. Let P(Xn = n) = 1
n(p+q)/2
= 1 − P(Xn = 0), then
and we will get convergence to 0 for p > q, but at the same time
In general, the these two types of convergences are quite distinct as the next example
shows.
Example
Consider again the example with P(Xn = n5 ) = 1/n2 and P(Xn = 0) = 1 − 1/n2 .
L1
We showed that Xn −→ 0 is false, because Xn = n5 can take arbitrarily large values
as n ↑ ∞ and this forces the expectation to blow up!
a.s.
Show that Xn −→ 0.
This is easy, since
X X 1 π2
P(|Xn − 0| > ε) = 2
= <∞
n n
n 6
cpl. a.s.
implies that Xn −→ 0, which we know in turn implies Xn −→ 0.
In this section we explore conditions under which we can reverse the direction of the
⇒ in the diagram in the previous section.
148 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
The proof is quite simple and all student should be able to follow it.
proof: We are given that
1 x > c
lim FXn (x) = lim P(Xn 6 x) =
n↑∞ n↑∞ 0 x < c
We now try to bound and squash to zero the criterion for convergence in probability:
Convergence in Probability
d
proof: First note that since Xn −→ X, we also have that X is bounded in the sense
E|Xn − X|p = E[|Xn − X|p I{|Xn −X|>ε} ] + E[|Xn − X|p I{|Xn −X|<ε} ]
6 E[|Xn − X|p I{|Xn −X|>ε} ] + E[εp I{|Xn −X|<ε} ]
6 E[(|Xn | + |X|)p I{|Xn −X|>ε} ] + εp P(|Xn − X| < ε)
6 (2c)p E[I{|Xn −X|>ε} ] + εp P(|Xn − X| < ε)
6 (2c)p P(|Xn − X| > ε) + εp → 0 + εp , n↑∞
Since this is true for any ε > 0, no matter how small, we conclude that E|Xn −X|p →
0 as ε ↓ 0 and n ↑ ∞.
CONVERSE RESULTS ON MODES OF CONVERGENCE 149
P a.s.
Finally, of interest is when we can go from Xn −→ X to X −→ X. In general,
P
this is a difficult problem, but we can easily show that if Xn −→ X, then there
is a subsequence of X1 , X2 , X3 , . . ., call it Xk1 , Xk2 , Xk3 , . . . which converges almost
surely to X.
Convergence of Subsequence
P
Suppose that Xn −→ X, that is, for every ε > 0 and δ > 0, we can find a
large enough n such that
Then,
a.s.
Xkn −→ X, n ↑ ∞,
where k1 < k2 < k3 < · · · is selected such that
1
P(|Xkj − X| > ε) <
j2
proof: Since
X X 1 π2
P(|Xkj − X| > ε) < = < ∞,
j j
j2 6
cpl. a.s.
then Xkn −→ X, which in turn implies Xkn −→ X.
In the case of independent random variables almost sure convergence implies com-
plete convergence.
a.s.
instead of the expected 0, contradicting Xn −→ X. We have
Hence, it is not true that P(supk>n |Xn − X| > ε) ↓ 0 contradicting the assumption
a.s.
Xn −→ X. The only other logical possibility is that
X
P(|Xn − X| > ε) < ∞,
n
cpl.
that is, Xn −→ X. 2
In summary, what we have established is the following diagram with reverse arrow
directions.
Example
There are n independent and identically distributed trials, each with probability p
of success. Consider the “sample proportion” pbn = Xn , where X is the number of
successes in the n trials.
P
Use the Weak Law of Large Numbers to show that pbn −→ p.
The following result, known as Slutsky’s Theorem, is useful for establishing con-
vergence in distribution results.
Slutsky’s Theorem
Let X1 , X2 , . . . be a sequence of random variables that converges in distri-
bution to X, that is,
d
Xn −→ X.
Let Y1 , Y2 , . . . be another sequence of random variables that converges in
probability to a constant c, that is,
P
Yn −→ c.
Then,
d
1. Xn + Yn −→ X + c,
d
2. Xn Yn −→ cX
The proof is omitted in these notes, but may be found in advanced texts such as:
Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics, New
York: John Wiley & Sons.
Example
d
Suppose that X1 , X2 , . . . converges in distribution to X ∼ N(0, 1), i.e. Xn −→
N(0, 1), and suppose that nYn ∼ Bin(n, 21 ).
152 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
d d
Xn + Yn −→ N(1/2, 1) and Xn Yn −→ N(0, 1/4).
The Weak Law corresponds to convergence in probability, while the Strong Law
corresponds to almost sure convergence. Since almost sure convergence implies
convergence in probability, it is in this sense that we talk about a Strong and Weak
Law. The difference between the Strong and Weak Law is the same as that between
a.s. convergence and convergence in probability.
proof: We give only a proof for the case Xk > 0 for all k. If from the sequence
for j = 1, 2, 3, . . ., then
Var(X̄j 2 ) σ2
P(|X̄j 2 − µ| > ε) 6 = <∞
ε2 j 2 ε2
Hence,
X
∞
π 2 Var(X)
P(|X̄j 2 − µ| > ε) = <∞
j=1
6 ε2
k2 (k + 1)2
X̄ k 2 6 X̄ n 6 X̄ (k+1)2 .
(k + 1)2 k2
Both X̄k2 and X̄(k+1)2 converge almost surely to µ as n (and hence k) go to infinity.
a.s.
Therefore, we conclude that X̄n −→ µ.
k2
Rearranging the last gives X̄ 2
(k+1)2 k
6 X̄n . Similarly,
k2 6n n6(k+1)2
k X̄n 6 nX̄n = X1 + · · · + Xn
2
6 X1 + · · · + X(k+1)2 = (k + 1)2 X̄(k+1)2
2
Hence, X̄n 6 X̄(k+1)2 (k+1)
k2
.
X̄n − µ d
√ −→ Z
σ/ n
X̄n − µ d
√ −→ N(0, 1).
σ/ n
Note that
σ2
E(X̄n ) = µ, and Var(X̄n ) =
.
n
So the Central Limit Theorem states that the limiting distribution of any stan-
dardised average of independent random variables is the standard Normal or N(0, 1)
distribution.
154 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
κ3 is called the skewness of the distribution of X, and κ4 is called the kurtosis of the
distribution of X. The skewness parameter indicates how symmetric the distribution
of X is and the kurtosis indicates how fast the tails of the density decay to zero.
Without loss of generality we can assume µ = 0 and σ = 1. Thus, in our case
κk = E[X k ]. Then, if
mX (t) = EetX
is the MGF of X, it follows from the iid assumption that
n
t
m nX̄n (t) = mX √
√
n
Hence,
def t
ζ(t) = ln m√nX̄n (t) = n ln mX √
n
Now, consider the Taylor expansions
x2 x3
ln(1 + x) = x − + − ··· , |x| < 1
2 3
and
t t t2 t3
mX √ = 1 + E[X] √ + E[X 2 ] + E[X 3 ] 3/2 + · · ·
n n 2!n 3!n
2 3 4
t t t
=1+ + E[X 3 ] 3/2 + E[X 4 ] 2 + · · ·
2n 6n 4!n
t2 t3 t4
=1+ + κ3 3/2 + κ4 2 + · · ·,
|2n 6n {z 4!n }
x
√
where we choose t such that |t/ n| < ε with ε > 0 small enough that
X
∞
εk
|x| 6 |κk | <1
k=2
k!
CENTRAL LIMIT THEOREM 155
It follows that
t2 κ3 t3 t4 κ4 1
ζ(t) = + 1/2 + − + ···
2 6n n 4! 8
or alternatively
t2 κ3 t3 4
m√nX̄n (t) = e 2
+
6n1/2
+ tn ( κ4!4 − 81 )+··· → e t22 , n↑∞
√ d
Hence, nX̄n −→ N(0, 1). Note that we have used the fact that
X∞
κ3 t3 t4 κ4 1 1 1
+ − + · · · 6 const. = const. −1
6n1/2 n 4! 8 k=1
nk/2 1 − n−1/2
2
The Central Limit Theorem stated above provides the limiting distribution of
X̄ − µ
√ .
σ/ n
P
However, sometimes probabilities involving related quantities such as the sum ni=1 Xi
P
are required. Since ni=1 Xi = nX̄ the Central Limit Theorem also applies to the
sum of a sequence of random variables. The following result provides alternative
forms of the Central Limit Theorem.
Result
Suppose X1 , X2 , . . . are independent and identically distributed random
variables with common mean µ = E(Xi ) and common variance σ 2 =
Var(Xi ) < ∞. Then the Central Limit Theorem may also be stated in
the following alternative forms:
√ d
1. n(X̄ − µ) −→ N(0, σ 2 ),
P
Xi −nµ d
2. i √
σ n
−→ N(0, 1),
P
Xi −nµ d
3. i √
n
−→ N(0, σ 2 ).
156 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
In this section we provide some applications of the Central Limit Theorem (CLT).
Being the most important theorem in statistics, its applications are many and varied.
The first application is descriptive. The CLT tells us that the sum of many indepen-
dent small random variables has an approximate normal distribution. Therefore it is
plausible that any real-life random variable, formed by the combined effect of many
small independent random influences, will be approximately normally distributed.
Thus the CLT provides an explanation for the widespread empirical occurrence of
the normal distribution.
The CLT also provides a number of very useful normal approximations to common
distributions.
Example
It is known that Australians have an average weight of about 68kg, and the variance
of this quantity is about 256.
We randomly choose 10 Australians. What is an approximate distribution for the
average weight of these people? What is the chance that their average weight exceeds
80kg?
From the central limit theorem,
X̄ − µ d
√ −→ N(0, 1)
σ/ n
80 − 68
P(X̄ > 80) = P X > √
25.6
≃ P(X > 2.37)
≃ 0.0089
The Central Limit Theorem also allows us to approximate some common distribu-
tions by the normal. An example is the binomial distribution.
X − np d
p −→ N(0, 1).
np(1 − p)
Proof:
Let X1 , . . . , Xn be a set of independent Bernoulli random variables with parameter
p. Then
X
X= Xi
i
where Z ∼ N(0, 1) and µ = E(Xi ) = p and σ 2 = Var(Xi ) = p(1 − p). The required
result follows immediately.
2
The practical ramifications are that probabilities involving binomial random vari-
ables with large n can be approximated by normal probabilities. However a slight
adjustment, known as a continuity correction, is often used to improve the ap-
proximation:
158 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
The continuity correction is based on the fact that a discrete random variable is
being approximated by a continuous random variable.
Example
Adam tosses 25 piece of toast off a roof, and 10 of them land butter side up.
Is this evidence that toast lands butter side down more often than butter side up?
i.e. is P(X 6 10) unusually small?
X ∼ Bin(25, 0.5).
We could answer this question by calculating the exact probability, but this would
be time consuming. Instead, we use the fact that
X/n − 1/2 d
p −→ Z ∼ N(0, 1)
1/(4n)
Compare this with the exact answer, obtained from the binomial distribution:
How large does n need to be for the normal approximation to the binomial distri-
bution to be reasonable?
APPLICATIONS OF THE CENTRAL LIMIT THEOREM 159
Recall that how well the central limit theorem works depends on the skewness of
the distribution of X, and its kurtosis. In the case of the Bernouilli and binomial
distributions, the skewness and kurtosis are a function of p (and skewness is zero for
p = 0.5, with kurtosis also very small in this case). This means that how well the
normal approximation to the binomial works is a function of p, and it is a better
approximation as p approaches 0.5.
A useful “rule of thumb” is that the normal approximation to the binomial will work
well when n is large enough that both np > 5 and n(1 − p) > 5.
This rule of thumb means that we don’t actually need a very large value of n for
this “large sample” approximation to work well – if p = 0.5, we only need n = 10
for the normal approximation to work well. On the other hand, if p = 0.005, we
would need a sample size of n = 1000...
Result
Suppose X ∼ Poisson(λ). Then
X −λ
lim P √ ≤ x = P(Z ≤ x)
λ→∞ λ
where Z ∼ N(0, 1).
This approximation works increasingly well as λ gets large, and it provides a reason-
able approximation to most Poisson probabilities for λ > 5. Note that because X
is discrete, a continuity correction will improve the accuracy of this approximation.
Example
Suppose X ∼ Poi(100). Then
100x
P(X = x) = e−100 , x = 0, 1, 2, . . . .
x!
Use a normal approximation (with continuity correction) to calculate P(80 < X <
120).
X−100
Now 10
is approximately N(0, 1).
80 − 100 X − 100 120 − 100
∴ P(80 ≤ X ≤ 120) = P ≤ ≤
10 10 10
≈ P(−2 ≤ Z ≤ 2) where Z ∼ N(0, 1)
≈ 0.9488...
160 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
There is also a normal approximation to the gamma distribution, which works via
a similar method (but with no need for a continuity correction).
The Central Limit Theorem provides a large sample approximation to the distribu-
tion of X̄n . But what about other functions of a sequence X1 , X2 , . . .? Some special
examples include
p
1. Functions of X̄ such as (X̄n )3 and sin−1 ( X̄n ).
If
1 1
Yn = θ + √ σZn + (terms in or smaller)
n n
d
where Zn −→ N(0, 1) then
1 1
g(Yn ) = g(θ) + √ σg ′ (θ)Zn + (terms in or smaller)
n n
It is often useful in statistics to use this latter notation, where one expands a statistic
into a constant term and terms which vanish at different rates as n increases.
proof:
First recall the mean value Theorem from First Year Calculus, which states that if f
is continuous in the interval [a, b] and differentiable on (a, b), then there is a number
c ∈ (a, b) such that
f (b) − f (a)
f ′ (c) = ,
b−a
see Figure 5.4. On this figure we can clearly see that there exists a tangent line
at c with slope f ′ (c), which is parallel to the line through the points (a, f (a)) and
(b, f (b)) and with slope f (b)−f
b−a
(a)
.
First, note that
√
σ n(Yn − θ)
Yn − θ = √ ×
n σ
√
σ n(Yn − θ) d
= √ × −→ 0 × Z = 0 by Slutky’s theorem
n | σ
{z }
|{z}
d
P
−→0 −→Z∼N(0,1)
d
In other words, Yn − θ −→ 0, but convergence in distribution to a constant implies
convergence in probability. Hence, we have established that
P
Yn −→ θ .
162 CHAPTER 5. CONVERGENCE OF RANDOM VARIABLES
g(Yn ) − g(θ)
g ′ (ϑn ) = (5.1)
Yn − θ
for some ϑn between θ and Yn . Note that the mean value result is valid if ϑn ∈ (θ, Yn )
or ϑn ∈ (Yn , θ) and this is why we simply say that ϑn is between θ and Yn . Any
such random ϑn satisfies
|ϑn − θ| 6 |Yn − θ| .
From the last inequality it follows that
2
Yet another way to write the Delta Method result, which is more informal but useful
for practical purposes, is as follows:
appr. σ2 appr. ′ 2σ
2
If Yn ∼ N θ, then g(Yn ) ∼ N g(θ), [g (θ)]
n n
These two expressions are only valid for finite n, but n must be large enough for
results when n → ∞ to offer a reasonable approximation to the distribution of g(Yn ).
Hence we refer to the above as a large sample approximation to the distribution of
g(Yn ).
Example
Let X1 , X2 , . . . be a sequence of independently and identically distributed random
variables with mean 2 and variance 7.
Obtain a large sample approximation for the distribution of (X̄n )3 .
Answer: The Central Limit Theorem gives
THE DELTA METHOD 163
√ d
n(X̄n − 2) −→ N(0, 7).
Application of the Delta Method with g(x) = x3 leads to g ′ (x) = 3x2 and then
√ d
n{(X̄n )3 − 23 } −→ N(0, 7 × (3 × 22 )2 ).
Simplification gives
√ d
n{(X̄n )3 − 8} −→ N(0, 1008).
It is common to model data such as these as a random sample from the normal
distribution. Validity of the assumption is often questionable, but because of the
central limit theorem, it may be true approximately.
Definition
Let X1 , . . . , Xn be a random sample with common distribution N(µ, σ 2 ) for
some µ and σ 2 . Then X1 , . . . , Xn is a normal random sample.
Result
If X1 , . . . , Xn is a random sample from the N(µ, σ 2 ) distribution then
P
1. iXi ∼ N(nµ, nσ 2 ),
σ2
2. X̄ ∼ N µ, n .
165
166 CHAPTER 6. DISTRIBUTIONS ARISING FROM A NORMAL SAMPLE
Note that the Central Limit Theorem from chapter 6 states that a sum or mean of
a random sample from any distribution is approximately normal. The above result
states that this result is exact if we have a normal random sample.
Definition
If X has density
e−x/2 xν/2−1
fX (x) = , x>0
2ν/2 Γ(ν/2)
then X has the χ2 (chi-squared) distribution with degrees of freedom
ν. A common shorthand is
X ∼ χ2ν .
Result
If X ∼ χ2ν then X ∼ Gamma(ν/2, 2).
This means that results given there can be used to obtain similar results for chi-
squared random variables.
Results
If X ∼ χ2ν then
1. E(X) = ν,
2. Var(X) = 2ν,
1
ν/2
3. mX (u) = 1−2u
, u < 1/2.
As the title of the chapter suggests, the χ2 distribution arises from a normal random
sample. The key relation is this result.
The t Distribution
A common shorthand is
T ∼ tν .
− (ν+1) −ν
t2 t2
= e−t so
2
Proof: fT (t) ∝ 1 +
2
ν
and limν→∞ 1 + ν
t2
lim fT (t) ∝ e− 2 .
ν→∞
tν density
area = α area = α
b b
values are for right-tailed probabilities, although two-tailed probabilities can also
be read off this table.
The following result is frequently used in applied statistics and is the key reason
for the importance of the student’s distribution. We will use this result in later
chapters.
Result
Let X1 , . . . , Xn be a random sample from the N(µ, σ 2 ) distribution. Then
X̄ − µ
√ ∼ tn−1 .
S/ n
2
Suppose X1 , X2 , . . . , XnX are independent N(µX , σX ) and Y1 , Y2 , . . . , YnY are inde-
2
pendent N(µY , σY ) and the samples are independent. When comparing the variances
2
or drawing inferences about σX 2
/σY2 we use SX /SY2 (the ratio of the sample variances)
and this leads us to the F distribution.
Definition
Suppose X ∼ χ2ν1 and Y ∼ χ2ν2 and X and Y are independent. Then
F = X/ν 1
Y /ν2
has the F distribution with degrees of freedom ν1 and ν2 . We
write F ∼ Fν1 ,ν2 .
Result
If F ∼ Fν1 ,ν2 then F has density function
ν21 ν1
− (ν1 +ν2)
ν1 −1 ν1 u 2
ν2
u 2 1+ ν2
fF (u) = ν1 ν2
,u > 0
B ,
2 2
area = α
area = α
b b
Fν1 ,ν2 ,α is the αth quantile of the Fν1 ,ν2 distribution; i.e., if U ∼ Fν1 ,ν2 , P(U ≤
Fν1 ,ν2 ,α ) = α. Fν1 ,ν2 ,1−α is the (1 − α)th quantile of the Fν1 ,ν2 distribution; i.e., if
U ∼ Fν1 ,ν2 , P(U ≤ Fν1 ,ν2 ,1−α ) = 1 − α.
Again, the Fisher distribution arises naturally when we consider statistics of normal
samples. The key is this result.
1 XX n
2
SX = (Xi − X̄)2 = sample variance of the X’s,
nX − 1 i=1
1 X Y n
SY2 = (Yi − Ȳ )2 = sample variance of the Y ’s.
nY − 1 i=1
we have
2
SX /SY2
2
∼ FnX −1,nY −1 .
σX /σY2 1
The F distribution was invented/discovered by Sir Ronald Fisher, who made funda-
mental contributions to both statistics and genetics.
170 CHAPTER 6. DISTRIBUTIONS ARISING FROM A NORMAL SAMPLE
χ2, t, F and R
The statistics package R can be used to calculate cumulative probabilities and quan-
tiles from the distributions considered in this chapter, and more. The method of
use is the same as that described in chapter 2.
To find χ210,0.95 :
> qchisq(0.95,10)
To find P(t15 < 2.602):
> pt(2.602,15)
To find F5,10,0.05 :
> qf(0.05,5,10)
To take a random sample of size 20 from the t1 distribution:
> rt(20,1)
To plot fX (x) where X ∼ χ210 between 0 and 40:
> x=0:40
> f=dchisq(x,20)
> plot(x,f,type="l",ylab="f_X(x)", main="The chi-squared (20) distribution")
Primary Results
X̄ − µ
√ ∼ tn−1
S/ n
and
(n − 1)S 2
∼ χ2n−1 .
σ2
For independent samples
2
X1 , X2 , . . . , XnX random sample from N(µX , σX )
we have
2
SX /SY2
2
∼ FnX −1,nY −1 .
σX /σY2
Secondary Results
Pm
Z1 , . . . , Zm ∼ N(0, 1) and ind. =⇒ i=1 Zi2 ∼ χ2m
An Introduction to Statistical
Inference
So far, we have dealt only with the situation where we know the distribution of the
variable of interest to us, and we know the values of parameters in the distribution.
But in practice, we usually do not know the value of the parameters. We would like
to use sample data to make inferences about the parameters, which is the subject
of the next few chapters.
For further reading, consider Hogg et al (2005) sections 4.1, 5.1 and 5.4 or Rice
(2007) sections 7.1-7.3 (but ignore the finite population correction stuff).
Example
Consider the situation where we would like to know the cadmium concentration
in water from a particular dam. We know from Chapter 5 that if we take several
water samples and measure cadmium concentration, the average of these samples
will be normally distributed and will be centred on µ, the true average cadmium
concentration of water in the dam. But what is µ?
In this chapter we will learn how to make inferences about µ based on a sample.
173
174 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
Statistical Models
Example
Keating, Glaser and Ketchum (Technometrics, 1990) describe data on the life-
times (hours) of 20 pressure vessels constructed of fibre/epoxy composite materials
wrapped around metal liners. The data are:
274 28.5 1.7 20.8 871 363 1311 1661 236 828
458 290 54.9 175 1787 970 0.75 1278 776 126
Given the positive and right-skewed nature of the data a plausible parametric model
for the data is: n o
fX (x; β) = β1 e−x/β , x > 0; β > 0 .
Since this family of density functions is parameterised by the single parameter β > 0,
this is a parametric model.
We could check how well this parametric model fits the data using a quantile-quantile
plot.
{fX (x; θ) : θ ∈ Θ} .
ESTIMATION 175
The set Θ ⊆ R is the set of possible values of θ and is known as the parameter
space. If this model is assumed for a random sample X1 , . . . , Xn then we write
X1 , . . . , Xn ∼ fX (x; θ), θ ∈ Θ.
Note that a model for a random sample induces a probability measure on its mem-
bers. However, these probabilities depend on the value of θ. This is sometimes
indicated using subscripted notation such as: Pθ , Eθ , Varθ to describe probabilities
and expectations according to the model and particular values of θ, although we
will minimise the use of this notation.
Estimation
Example
Let
p = proportion of UNSW students who watched the Cricket World Cup final.
be a parameter of interest.
Suppose that we survey 8 UNSW students, asking them whether they watched the
Cricket World Cup final.
Let X1 , . . . , X8 be such that
(
1, if ith surveyed watched the Cricket World Cup final
Xi =
0, otherwise.
176 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
An appropriate model is
where (
p, x=1
fX (x; p) = px (1 − p)1−x =
1 − p, x = 0.
X1 + X 2 + X3 + X 4 + X5 + X6 + X7 + X8
pb = ,
8
corresponding to the proportion from the survey that watched the Cricket World
Cup final. However, there are many other possible estimators for p; such as
X2 + X4 + X6 + X8
pbalt = ,
4
based on every second person in the survey. But even
satisfies the definition of being an estimator for p! (It’s not a very good estimator
though...)
The previous definition permits any function of the sample to be an estimator for
a parameter θ. However, only certain functions have good properties. So how do
we identify an estimator of θ that has good properties? This is the subject of the
remainder of this section.
Before we start studying properties of estimators, we state a set of facts that are
fundamental to statistical inference:
fθb
that depends on θ.
We use fθb to study the properties of θb as an estimator of θ.
This idea will be used repeatedly throughout the rest of these notes.
ESTIMATION 177
Bias
The first property of an estimator that we will study is bias. This corresponds to
b the “centre of gravity” of f b, and the target parameter
the difference between E(θ), θ
θ:
Definition
Let θb be an estimator of a parameter θ. The bias of θb is given by
b = E(θ)
bias(θ) b − θ.
Bias is a measure of the systematic error of an estimator – how far we expect the
estimator to be from its true value θ, on average.
Often, we want to use an estimator θb which is unbiased, or as close to zero bias as
possible.
In many practical situations, we can identify an estimator of θ that is unbiased. If
we cannot, then we would like an estimator that has as small a bias as possible.
Example
For the Cricket World Cup example
Y
pb =
8
where Y is the number of students who watched the Cricket World Cup final, and
Y ∼ Bin(8, p)
p) = E(b
Now to find bias(b p − p), we will first find E(b
p).
Since E(Y ) = 8p, E(b
p) = E(Y /8) = E(Y )/8 = 8p/8 = p so
178 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
p) = E(b
bias(b p − p) = E(b
p) − E(p) = 0
4
fpbalt (x) = p4x (1 − p)4−4x , x = 0, 41 , 24 . . . , 1.
4x
Standard error
Definition
Let θb be an estimator of a parameter θ. The standard error of θb is the
standard deviation: q
b b
se(θ) = Var(θ),
b we first derive Var(θ),
b θ),
To obtain the estimated standard error se( b the
b and then we replace unknown parameters θ by its estimator
variance of θ,
b
θ.
Like the bias, the standard error of an estimator is ideally as small as possible.
However, unlike the bias the standard error can never be made zero (except in
trivial cases).
Example
Consider, again, the Cricket World Cup example.
Find se(b
p) and se(b
palt ). Comment.
ESTIMATION 179
p(1 − p)
p) = Var(Y /8) = (1/82 )Var(Y ) =
Var(b .
8
Therefore the standard error of pb is
r
p(1 − p)
se(b
p) =
8
r
pb(1 − pb)
b p) =
se(b .
8
Similarly,
r r
p(1 − p) pb(1 − pb)
se(b
palt ) = b palt ) =
and se(b .
4 4
Therefore the standard error of pbalt will always be larger than that of pb by a factor
√
of 2 ≃ 1.4, which suggests that pb is a better estimator of p than pbalt .
The bias and standard error of an estimator are fundamental measures of different
b as an estimator of θ: bias is concerned with the system-
aspects of the quality of θ,
b while the standard error is concerned with its inherent random (or
atic error in θ,
sampling) error. But consider a situation in which we want to choose between two
alternative estimators of θ, where one has smaller bias, and the other has smaller
standard error. How could we choose between these two estimators?
b which combines
One approach is to use a combined measure of the quality of θ,
the bias and standard error in some way. The mean squared error is the most
common way of doing this:
Definition
The mean squared error of θb is given by
b = E{(θb − θ)2 }.
MSE(θ)
180 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
b = bias(θ)
MSE(θ) b 2 + Var(θ),
b
[ θ)
MSE( b = bias(θ)
b 2 + se(
b θ)b 2.
Proof:
b = E{(θb − θ)2 }
MSE(θ)
= E[{θb − E(θ)
b + E(θ)
b − θ}2 ]
= E[{θb − E(θ)}
b 2 ] + E[{E(θ)
b − θ}2 ] + 2E[{θb − E(θ)}{E(
b b − θ}]
θ)
b + E[{bias(θ)}
= Var(θ) b 2 ] + 2{E(θ)
b − θ}E[{θb − E(θ)}]
b
b + bias(θ)
= Var(θ) b 2 + 2{E(θ)
b − θ}{E(θ)
b − E(θ)}
b
b + bias(θ)
= Var(θ) b 2.
Definition
Let θb1 and θb2 be two estimators of a parameter θ. Then θb1 is better than
θb2 (with respect to MSE) at θ0 ∈ Θ if
Example
For the Cricket World Cup example, find MSE(b palt ). Is pb uniformly
p) and MSE(b
better than pbalt ?
From previous results,
p(1 − p) p(1 − p)
p) = 0 2 +
MSE(b =
8 8
while
p(1 − p)
MSE(b
palt ) = .
4
Since
ESTIMATION 181
MSE(b
p) < MSE(b
palt ) for all 0 < p < 1
Consistency
Definition
The estimator θbn is consistent for θ if
θbn −→ θ.
P
Example
For the Cricket World Cup example, suppose that n people are surveyed and
Yn
pbn =
n
where
Application of the Weak Law of Large numbers (to the binary Xi random variables)
leads to
P
pbn −→ p.
then
θbn is consistent for θ.
Example
pn ). Hence show that pbn is consistent
For the Cricket World Cup example, find MSE(b
for p.
p(1 − p)
MSE(b
pn ) =
n
so
1
pn ) = p(1 − p) lim
lim MSE(b =0
n→∞ n→∞ n
Asymptotic normality
θb − θ d
−→ N(0, 1).
b
se(θ)
b = X̄
In particular, we know already from the CLT that a sample mean µ
and a sample proportion pb are asymptotically normal.
Example
In the Cricket World Cup example, let X1 , . . . , Xn be defined by
(
1, if ith surveyed student watched Cricket World Cup final
Xi =
0, otherwise.
Assume the Xi are independent. Write pbn in terms of the Xi and hence use conver-
gence results from Chapter 5 to find the asymptotic distribution of pbn .
Example
Consider the sample mean X̄n of n independent random variables with mean µ and
variance σ 2 .
Show that X̄n is consistent. Is X̄n asymptotically normal?
184 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
Observed values
Throughout this section we have considered the random sample X1 , . . . , Xn and the
b by treating it as a random variable. This allows us to
properties of the estimator θ,
do theoretical calculations concerning, say, the bias and large sample properties of
b
θ.
b known as the
In practice, we only take one sample, and observe a single value of θ,
observed value of θ. b This is also sometimes called the estimate of θ (as opposed
to the estimator, which is the random variable we use to obtain the estimate).
Example
Consider the pressure vessel example and let
274 28.5 1.7 20.8 871 363 1311 1661 236 828
458 290 54.9 175 1787 970 0.75 1278 776 126
Some statistics texts distinguish between random variables and their observed values
via use of lower-case letters. For the previous example this would involve:
Good notation for the observed value of βb is a bit trickier since β is already lower-case
(in Greek). These notes will not worry too much about making such distinctions.
So βb denotes the random variable X̄ before the data are observed. But we will also
say βb = 575.53 for the observed value. The meaning of βb should be clear from the
context.
Example
Eight students are surveyed and two watched the Cricket World Cup final.
Find pb (defined previously) and its standard error, and write you answer in estimate
(standard error) notation.
2/8 = 0.25.
p
0.25(1 − 0.25)/8 = 0.153.
So the estimate (standard error) notation would report the results for p as follows:
0.25 (0.153).
Multiparameter models
So far in this chapter we have only considered models with a single parameter.
However, it is often the case that a model with more than one parameter is required
for the data at hand.
186 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
Example
For the pressure vessel data we have previously considered the single parameter
model
n o
1 −x/β
fX (x; β) = β e , x > 0; β > 0 .
A two-parameter model is
n o
fX (x; α, β) = 1
β α Γ(α)
xα−1 e−x/β , x > 0; α, β > 0 .
X1 , . . . , Xn ∼ N(µ, σ 2 ).
Often inference is concerned with the original parameters: µ and σ for a N(µ, σ 2 )
model; α and β for a Gamma(α, β) model. However, sometimes a transformation of
the parameters is of interest. For example, in the Gamma(α, β) model a parameter
of interest is often
τ = αβ
Confidence Intervals
An estimator θb of a parameter θ leads to a single number for inferring the true value
of θ. For example, in the Cricket World Cup example if we survey 50 people and 16
watched the Cricket World Cup final then the estimator pb has an observed value of
0.32.
However, the number 0.32 alone does not tell us much about the inherent variability
in the underlying estimator. Confidence intervals aim to improve this situation with
a range of plausible values, e.g.
p is likely to be in the range 0.19 to 0.45.
CONFIDENCE INTERVALS 187
Definition
Let X1 , . . . , Xn be a random sample from a model that includes an unknown
parameter θ. Let
the quantity in the middle (θ) is fixed, but the limits (L and U ) are random. This is
the reverse situation from many probability statements that arise in earlier chapters,
such as
P(2 ≤ X ≤ 7)
for a random variable X.
Example
Consider the pressure vessel example
X1 , . . . , X20 ∼ fX (x; β)
where
fX (x; β) = 1
β
e−x/β , x > 0; β > 0.
Then it can be shown (details omitted) that
P 0.52 X̄ ≤ β ≤ 1.67 X̄ = 0.99 for all β > 0.
Therefore
(0.52 X̄, 1.67 X̄)
is a 0.99 or 99% confidence interval for β.
Since the observed value of X̄ is x̄ = 575.53, the observed value of the 99% confidence
interval for β is
(352, 852).
188 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
We now return to the topic of Chapter 6 corresponding to the special case where the
random sample can be reasonably modelled as coming from a normal distribution:
X1 , . . . , Xn ∼ N(µ, σ 2 ).
X̄ − µ (n − 1)S 2
√ ∼ tn−1 and ∼ χ2n−1
S/ n σ2
Result
Let X1 , . . . , Xn be a random sample from the N(µ, σ 2 ) distribution. Then a
100(1 − α)% confidence interval for µ is
S S
X̄ − tn−1,1−α/2 √ , X̄ + tn−1,1−α/2 √ .
n n
The above is a particular powerful result. The above provides us with a method of
making potentially quite specific statements about µ, based on a sample: we can
estimate a range of values which we are arbitrarily sure will contain µ (such intervals
contain µ 100(1 − α)% of the time).
This is a particularly useful result in practice, because many interesting research
questions can be phrased in terms of means (e.g. What is average recovery time for
patients given a new treatment? How much weight do people lose, on average, if
they exercise for 30 minutes every day?)
Recall from Chapter 5 that it was noted that even when Xi are not normal, tn−1
X̄−µ
√ , as long as n is large
is a reasonable approximation for the distribution of S/ n
enough for the Central Limit Theorem to come into play. This means that we can
use the above method to construct a confidence interval for µ even when data are
not normal – this is where the above result becomes really handy in practice!
CONFIDENCE INTERVALS FOR TWO NORMAL RANDOM SAMPLES 189
X1 , . . . , XnX ∼ N(µX , σX
2
)
and
Y1 , . . . , YnY ∼ N(µY , σY2 )
As previously, this assumption is not critical when making inferences about µ, be-
cause some robustness to non-normality is inherited from the Central Limit Theo-
rem.
Result
Let
X1 , . . . , XnX ∼ N(µX , σ 2 )
and
Y1 , . . . , YnY ∼ N(µY , σ 2 )
be two independent normal random samples; each with the same variance
σ 2 . Then a 100(1 − α)% confidence interval for µX − µY is
r
1 1
X̄ − Ȳ ± tnX +nY −2,1−α/2 Sp +
nX nY
where
(nX − 1)SX 2
+ (nY − 1)SY2
Sp2 = ,
nX + nY − 2
known as the pooled sample variance.
The proof is analogous to that for the single sample confidence interval for µ.
Example
Peak expiratory flow is measured for 7 “normal” six-year-old children and 9 six-
year-old children with asthma. The data are as follows:
190 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
We will obtain a 95% confidence interval for µX − µY under the assumption that
the variances for each population are equal. The sample sizes, sample means and
sample variances are
nX = 7, x̄ ≃ 65.43, s2X ≃ 109.62
nY = 9, ȳ ≃ 47.78, s2Y ≃ 47.19
The pooled sample variance is
6 × 109.62 + 8 × 47.19
s2p ≃ ≃ 73.95.
14
The appropriate t-distribution quantile is t14,0.975 = 2.145. The confidence interval
is then r
√ 1 1
(65.43 − 47.78) ± 2.15 73.95 + = (8.4, 27.0).
7 9
In conclusion, we can be 95% confident that the difference in mean peak expiratory
flow between the two groups is between 8.4 and 27.0 units.
If we have a paired normal random sample from (X, Y ), i.e. two normal random
samples that are dependent, we can construct a confidence interval for the mean
CONFIDENCE INTERVALS FOR SAMPLE PROPORTIONS 191
Example
The number of people exceeding the speed limit, per month, was recorded before
and after the installation of speed cameras at four locations (data from the Sydney
Morning Herald, 22nd September 2003).
How large was the reduction in number of people speeding per month per camera,
on average?
pb − p d
p −→ N(0, 1).
p(1 − p)/n
This means that we can use the normal distribution to construct a confidence interval
for p, based on a binomial sample.
Result
Let X ∼ Bin(n, p). Then an approximate 100(1 − α)% confidence interval
for p is
pb − z1−α/2 se(b
b p), pb + z1−α/2 se(b
b p) .
p
b p) = pb(1 − pb)/n.
where pb = Xn and se(b
Note that this is only an “approximate” confidence interval for two reasons: pb is
p
b p) in place of p(1 − p)/n.
only approximately normal, and we are using se(b
192 CHAPTER 7. AN INTRODUCTION TO STATISTICAL INFERENCE
Chapter 8
Often key research questions can be phrased in terms of a statistical model, and in
particular, in terms of a specific parameter in the statistical model. We would like
to be able to make inferences about the parameter of interest, after collecting some
data – tools for parametric inference are desired. Historically, the majority of
statistical work (both theoretical and applied) has been concerned with developing
and applying tools for parametric inference. The most important contributions of
statistics to scientific research have been in this important area.
Chapter 7 introduced the basic statistical inference concepts of estimation and con-
fidence intervals. These ideas were worked through in the special cases of means of
(approximately) normal samples, and proportions for binomial samples. But what
about more general statistical models? How do you come up with a good estimator
in such situations? How can you construct a confidence interval for the parameter
of interest, in such situations? This chapter provides some tools for answering these
important questions.
Throughout this chapter it is useful to keep in mind the distinction between estimates
and estimators. An estimate of a parameter θ is a function θb = θ(x b 1 , . . . , xn ) of
observed values x1 , . . . , xn , whereas the corresponding estimator is the same function
b 1 , . . . , Xn ) of the observable random variables X1 , . . . , Xn . Thus an estimator
θ(X
is a random variable whose properties may be examined and considered before the
observation process occurs, whereas an estimate is an actual number, the realized
value of the estimator, evaluated after the observations are available.
In deriving estimation formulas it is often easier to work with estimates, but in
considering theoretical properties, switching to estimators may be necessary.
For notational convenience, we usually denote the density or probability function
fX simply by f .
193
194CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
For further reading, consider Hogg et al (2005) sections 6.1-6.2 or Rice (2007) sec-
tions 8.4-8.5 or Chapter 6 of Kroese & Chan (2014).
f = f (x; θ1 , . . . , θk )
Example
Consider a random sample with normal model:
X1 , . . . , Xn ∼ N(µ, σ 2 ).
Figure 8.1: Karl Pearson (27 March 1857 – 27 April 1936) is one of the father of
statistics, meteorology, epidemiology, and biometrics. He founded the first depart-
ment of statistics in the world. We owe the correlation coefficient and the method of
moments to Pearson, amongst other things. His books on the philosophy of science
influenced the young A. Einstein.
But
E(X) = µ and E(X 2 ) = Var(X) + {E(X)}2 = σ 2 + µ2
which leads to the system of equations:
µ = x̄
1X 2
σ 2 + µ2 = x
n i i
Assuming that Var(X k ) < ∞, the Weak Law of Large Numbers states that
1X j P
Xi −→ E(X j ), 1≤j≤k
n i
θbj −→ θj ,
P
1 ≤ j ≤ k.
That is, method of moments leads to consistent estimation of the model parameters.
Methods of moments estimation is useful in practice because it is a simple approach
that guarantees us a consistent estimator. However it is not always optimal, in the
sense that it does not always provide us with an estimator that has the smallest
possible standard errors and mean squared error. There is however a method that
is (usually) optimal...
Definition
Let x1 , . . . , xn be observation from the pdf f where
f (x) = f (x; θ)
Note that the form of the likelihood function as a function of the observations
MAXIMUM LIKELIHOOD ESTIMATION 197
x1 , . . . , xn is the same as the joint density function, but the likelihood function is
regarded as a function of θ, for fixed values of {xi }.
Example
Let
(
1, if i-th surveyed student watched the World Cup Cricket final
Xi =
0, otherwise.
Find the likelihood and log-likelihood functions, ie L(p) and ℓ(p), hence find expres-
sions for L(0.3) and ℓ(0.3).
The likelihood function for p is
Y
8
P8 P8
L(p) = {pxi (1 − p)1−xi } = p i=1 xi (1 − p)8− i=1 xi
i=1
Taking p = 0.3,
P8 P8
L(0.3) = (0.3) i=1 xi
(0.7)8− i=1 xi
and
! !
X
8 X
8
ℓ(0.3) = xi ln(0.3) + 8− xi ln(0.7)
i=1 i=1
198CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
Definition
Let x1 , . . . , xn be observations from probability/density function f , where
f (x) = f (x; θ)
Example
In the World Cup Cricket example (n = 8) the likelihood function has been shown
to be
Y
8
P8 P8 P8 P8
L(p) = {pxi (1−p)1−xi } = p i=1 xi (1−p)8− i=1 xi = e( i=1 xi ) ln(p)+(8− i=1 xi ) ln(1−p) .
i=1
Find the maximum likelihood estimator of p by finding the value of p that maximises
L(p).
The first derivative of L(p) with respect to p is then
" 8 ! ! #
d P8 P8 X X
8
L(p) = e( i=1 xi ) ln(p)+(8− i=1 xi ) ln(1−p) xi /p − 8− xi /(1 − p)
dp
P8 P8 ! i=1 i=1
i=1 xi 8 − i=1 xi
= L(p) − .
p 1−p
Further analysis (see next example) shows that this is the unique maximiser of L(p)
over 0 < p < 1 so
P8
xi
pb = i=1
= proportion of people in survey that watched the World Cup Cricket final
8
is the maximum likelihood estimate of p.
OBTAINING MAXIMUM LIKELIHOOD ESTIMATORS 199
Example
Suppose that the observed data are:
x1 = 0, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 0, x7 = 1, x8 = 1.
(p)
0.005
As the previous definition shows, maximum likelihood estimation boils down to the
problem of determining where a function reaches its maximum. The mechanics of
determination of the maximum differ depending on the smoothness of L(θ).
is usually simpler to work with the log-likelihood function ℓ(θ). Maximising ℓ(θ)
rather than L(θ) is justified by:
Result
The point at which L(θ) attains its maximum over θ ∈ Θ is also that where
X
ℓ(θ) = ln{L(θ)} = ln{f (xi ; θ)}
i
Example
Re-visit the World Cup Cricket example.
Find the maximum likelihood estimator of p by finding the value of p that maximises
ℓ(p). Then plot ℓ(p) when the observed data are
x1 = 0, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 0, x7 = 1, x8 = 1
! !
X
8 X
8
ℓ(p) = ln{L(p)} = xi ln(p) + 8− xi ln(1 − p).
i=1 i=1
P8 P P8
xi 8 − 8i=1 xi xi
i=1
− = 0 ⇐⇒ p = i=1 ,
p 1−p 8
which is the same answer obtained previously using L(p), but via simpler calculus!
Is this the unique maximiser of ℓ(p) over p ∈ (0, 1)? The second derivative is
P8 P
d2 − xi 8− 8i=1 xi
dp2
ℓ(p) = i=1
p2
− (1−p)2
which is negative for all 0 < p < 1 and samples xi ∈ {0, 1}, 1 ≤ i ≤ 8. Hence ℓ(p)
d
is concave (downwards) over 0 < p < 1 and the point at which dp ℓ(p) = 0 must be
a maximum.
For the case where the observed data are again
x1 = 0, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 0, x7 = 1, x8 = 1
the following figure shows the maximum likelihood estimation procedure via the log-
likelihood function ℓ(p). As in the case of L(p) (see previous figure), the maximiser
occurs at pb = 6/8 = 0.75.
202CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
−4
(p)
−6
−8
−10
−12
−14
−16
Example
Consider an observed sample x1 , . . . , xn with common density function f , where
Write down the log-likelihood function for θ. Show that any stationary point on this
function maximises ℓ(θ), hence find the maximum likelihood estimator of θ.
Not all likelihood functions are differentiable, or even continuous, over θ ∈ Θ. In such
non-smooth cases calculus methods, alone, cannot be used to locate the maximiser,
and it is usually better to work directly with L(θ) rather than ℓ(θ). The following
notation is useful in non-smooth likelihood situations.
Definition
Let P be a logical condition. Then the indicator function of P, I(P) is
given by (
1 if P is true,
I(P) =
0 if P is false.
OBTAINING MAXIMUM LIKELIHOOD ESTIMATORS 203
Example
Some examples of use of I are
I(Cristina Keneally was the premier of New South Wales on 14th February, 2011) = 1.
I(42 = 16) = 1,
I(eπ = 17) = 0,
I(The Earth is bigger than the Moon) = 1,
I(The Earth is bigger than the Moon & The Moon is made of blue cheese) = 0.
The I notation allows one to write density functions in explicit algebraic terms. For
example, the Gamma(α, β) density function is usually written as
e−x/β xα−1
f (x; α, β) = , x > 0.
Γ(α)β α
However, it can also be written using I as
e−x/β xα−1
f (x; α, β) = I(x > 0).
Γ(α)β α
The following result is useful for deriving maximum likelihood estimators when the
likelihood function is non-smooth:
Result
For any two logical conditions P and Q,
I(P ∩ Q) = I(P)I(Q).
Example
Suppose that the observation x1 , . . . , xn come from pdf f , where
f (x; θ) = 5(x4 /θ5 )I(0 < x < θ) = 5(x4 /θ5 )I(x > 0)I(θ > x).
204CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
Note that
Yn
I(θ > xi ) = I(θ > x1 , θ > x2 , · · · , θ > xn ) = I(θ > max(x1 , . . . , xn )).
i=1
Qn
Also, i=1 I(xi > 0) = 1 with probability 1, since P(Xi > 0) = 1. Hence, the
likelihood function is
!4
Y
n
L(θ) = 5n xi θ−5n I(θ > max(x1 , . . . , xn ))
i=1
θb = max(X1 , . . . , Xn ).
Consistency
Suppose
X1 , . . . , Xn ∼ f (x; θ∗ )
iid
for some unknown θ∗ in the interior of the set Θ, where Θ is the set of allowable
values for θ. Let us have the assumptions
PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS 205
1. The domain of x does not depend on θ∗ , that is, all pdfs f (·; θ) have common
support (are nonzero over the same fixed set, independent of θ∗ ).
2. If θ ̸= ϑ, then
f (x; θ) ̸= f (x; ϑ) .
3. The MLE
θbn = argmax f (X; θ) ,
θ∈Θ
Qn
where f (X; θ) = i=1 f (Xi ; θ), is unique and lies in the interior of Θ.
Consistency
Under the assumptions above, the maximum likelihood estimator θbn of θ∗
is consistent; i.e.
θbn −→ θ∗ .
P
1X
n
ℓn (θ) = ln f (Xi ; θ)
n i=1
∗ ∗
Note that Eθ∗ ln ff(X;θ ∗ ) = 0 if and only if f (X; θ) = f (X; θ ). Hence, if θ ̸= θ , then
(X;θ)
Now note that by definition of MLE and its assumed uniqueness (assumption 3) we
have
{ω : θbn (ω)} = {ω : ℓn (θbn ) > ℓn (θ) for all θ ∈ Θ}
Hence, for any ε > 0
Equivariance
Example
Let X1 , . . . , Xn be random variables each with density function f , where
f (x; θ) = 2θxe−θx ,
2
x > 0.
1 1X 2
τb = = Xi
θb n i
Let
iid
X1 , . . . , Xn ∼ f (x; θ)
Definition
The Fisher score is defined as
Figure 8.2: Ronald Fisher: the inventor of the maximum likelihood method and
“the greatest biologist since Darwin”. Fisher and Pearson were bitter rivals as each
considered their method of estimation and inference superior.
Eθ Sn (θ) = 0.
proof: We have
Z
∂ ∂
0= (1) = f (x; θ)dx
∂θ ∂θ
Z
∂
= f (x; θ)dx
∂θ
Hence,
Finally,
2
It is now possible to show that a maximum likelihood estimate has asymptotic mean
square error and standard error that are functions of the Fisher information.
Result
Let X1 , . . . , Xn be random variables with common density function f de-
pending on a parameter θ, and let θbn be the maximum likelihood estimator
of θ. Then as n → ∞
In (θ) Var(θbn ) −→ 1
P
Example
Recall the previous example of a random sample x1 , . . . , xn from a common density
function f , where
f (x; θ) = 2θxe−θx , x ≥ 0; θ > 0.
2
b
Find the Fisher Information for θ, and hence the approximate se(θ).
From previously, the log-likelihood function, written as a random variable, is
X X
ℓ(θ) = n ln(2) + n ln(θ) + ln(Xi ) − θ Xi2 .
i i
b ≃q1 θb
se(θ) =√
b n
In (θ)
Note that in the above example, ℓ′′ (θ) has no random variables present so the ex-
pected value operation is redundant. This will not be the case in general, as in the
example below.
Example
Consider the daily number of hospital admissions due to asthma attacks. This can
be considered as a count of rare events, which can be modelled using the Poisson
distribution, where
x
−λ λ
f (x; λ) = e , x ∈ {0, 1, 2, ...}
x!
Let X1 , X2 , . . . , Xn be the observed number of hospital admissions due to asthma
attacks on n randomly selected days (which can be considered to be iid).
Find:
1. ℓ(λ).
Asymptotic normality
θb − θ d θb − θ d
q −→ N(0, 1) and −→ N(0, 1)
b b
se(θ)
Var(θ)
where
b = √1
se(θ)
In (θ)
for some ϑn between θ and θbn . By the consistency of MLE we have θbn −→ θ and
P
P
ϑn −→ θ. By the Central Limit Theorem, we have
Hence, since ℓ′n (θbn ) = 0 (due to uniqueness and differentiability, the MLE is the
critical point of the log-likelihood), we have
√ √
n(θbn − θ) n −ℓ′n (θ)
−1/2
= −1/2 ′′
I1 (θ) I1 (θ) ℓn (ϑn )
I1 (θ) ℓ′n (θ)
= × p
−ℓ′′n (ϑn )/n nI1 (θ)
P P P
It seems plausible that −ℓ′′n (ϑn )/n −→ I1 (θ), because ϑn −→ θ and −ℓ′′n (θ)/n −→
I1 (θ) (by the Law of Large Numbers). If we accept this, then the result follows from
Slutsky’s theorem:
√
n(θbn − θ) I1 (θ) ℓ′n (θ) d
= ×p −→ Z ∼ N(0, 1) .
−1/2 −ℓ (ϑ )/n
′′
I1 (θ) | n {zn } | nI 1 (θ)
{z }
P
−→1 d
−→Z
Remark 8.1 (Convergence of −ℓ′′n (ϑn )/n) We can write by the mean value the-
orem
ℓ′′n (ϑn ) = ℓ′′n (θ) + (ϑn − θ)ℓ′′′
n (ξn )
212CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
P
for some ξn −→ θ, which is between θ and ϑn . Assume the existence of a third
derivative of ln f (x; ϑ), such that
∂3
ln f (x; ϑ) 6 g(x), for all ϑ ∈ Θ
∂ 3ϑ
for some dominating function g with finite expectation Eθ [g(X)] = µ < ∞. Hence,
we can write
1X
n
ℓ′′′
n (ξn ) P
6 g(Xi ) −→ µ < ∞
n n i=1
d
and therefore by Slutsky’s theorem and using the fact that Xn −→ constant implies
P
Xn −→ constant, we have
ℓ′′′
n (ξn ) P
(ϑn − θ)ℓ′′′
n (ξn )/n = |ϑn − θ| −→ 0 × µ = 0
n
Hence,
−ℓ′′n (ϑn ) −ℓ′′n (θ) ℓ′′′ (ξn ) P
= − (ϑn − θ) n −→ I1 (θ) .
n n } |
| {z {z n }
P P
−→I1 (θ) −→0
This result is important because it means that maximum likelihood is not only
useful for estimation, but is also a method for making inferences about parameters.
Because we now know how to find the approximate distribution of any maximum
b we can now calculate standard errors and construct confidence
likelihood estimator θ,
intervals for θ using θb for data from any family of distributions of known form.
Example
Recall the previous example:
In (θ) = n/θ2 .
b ≃ θ2
Var(θ)
n
b θ)
se( b = √1 b √n.
= θ/
b
In (θ)
PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS 213
Thus
θb − θ d
√ −→ N(0, 1).
θ/ n
This means that we can approximate the distribution of θb using
θb ∼ N(θ, θ2 /n)
appr.
Example
Recall the n measurements of the daily number of hospital admissions due to asthma
attacks, X1 , X2 , . . . , Xn , which we will model using the Poisson distribution, i.e.
x
λλ
f (x; λ) = e , x ∈ {0, 1, 2, ...}
x!
Here, λ has the interpretation of being the average number of hospital admissions
due to asthma attacks per day.
b the maximum likelihood estimator of λ.
Find the approximate distribution of λ,
We have X X
ℓ(θ) = n ln(θ) + ln(Xi ) − θ Xi2 + const.
i i
From here it follows that
n X 2
ℓ̇(θ) = − Xi
θ i
P
with solution θb = i Xi2 /n. This is maximum, because
ℓ̈(θ) = −n/θ2 < 0
The Delta Method permits extension of the asymptotic normality result to a general
smooth function of θ:
Result
Under appropriate regularity conditions, including the existence of two
b where g is differentiable
derivatives of L(θ), if τ = g(θ) and τb = g(θ),
and g ′ (θ) ̸= 0, then
τb − τ d
p −→ N(0, 1)
Var(bτ)
where
{g ′ (θ)}2
τ) ≃
Var(b
In (θ)
214CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
This result follows directly from the delta method result given in Chapter 5.
Example
Recall the example:
Suppose that the parameter of interest is ω = ln(θ). Use maximum likelihood esti-
mation to find an estimator of ω and its approximate distribution.
From previous working, the maximum likelihood estimator of ω is
!
X
b = ln(n) − ln
ω Xi2 .
i
b is
and the variance of ω
|1/θ|2 1
ω) ≃
Var(b = .
n/θ2 n
Thus
b−ω d
ω
√ −→ N(0, 1).
1/ n
Example
Recall the n measurements of the daily number of hospital admissions due to asthma
attacks, X1 , X2 , . . . , Xn , which we will model using the Poisson distribution, i.e.
x
λλ
f (x; λ) = e , x ∈ {0, 1, 2, ...}
x!
The parameter β = 1/λ has the interpretation of being the average time (in days)
between hospital admissions due to asthma attacks.
Find the approximate distribution of the maximum likelihood estimator of β.
PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS 215
Asymptotic optimality
In the case of smooth likelihood functions where asymptotic normality results can
be derived it is possible to argue that, asymptotically, the maximum likelihood
estimator is optimal or best.
Eθ g(X1 , . . . , Xn ) = µn (θ)
Then,
(µ′n (θ))2
Varθ (θ̃n ) >
nI1 (θ)
and note that the log-likelihood can be written in terms of the joint pdf:
Xn
∂ ∂
ℓ′n (θ) = ln f (Xi ; θ) = ln f (X; θ)
i=1
∂θ ∂θ
We have
2
∂
> Eθ (θ̃n − µn ) ln f (X; θ) using EX 2 EY 2 > (EXY )2
∂θ
Z !2
∂
f (x; θ)
> f (x; θ)(θ̃n − µn ) ∂θ dx
f (x; θ)
Z 2
∂f
> (x; θ)(θ̃n − µn )dx
∂θ
Z Z 2
∂f ∂f
> (x; θ) θ̃n dx − µn (x; θ)dx
∂θ ∂θ
d Z 2
d
f (x; θ)dx = (µ′n (θ))
2
via assumption 5 > Eθ [θ̃n ] −µn
dθ | {z } dθ
µn | {z }
d
dθ
(1)
216CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
(µ′n (θ))2
Varθ (θ̃n ) >
In (θ)
2
As a corollary we have the following.
Example
Recall the example:
It is known that
r
1 π 1 π
E(X) = and Var(X) = 1−
2 θ θ 4
3. Hence use the delta method to find the approximate distribution of θ̃.
b Which is the
4. Compare the asymptotic properties of the estimators θ̃ and θ.
better estimator?
LIKELIHOOD-BASED CONFIDENCE INTERVALS 217
1π
solution: Rearranging (EX)2 = 4θ
we obtain the method of moments estimator:
π
θ̃ = = g(X̄),
4(X̄)2
asymptotic distribution using the delta method. From the Central Limit Theorem
√ p d 1 − π/4
n(X̄ − π/(4θ)) −→ N 0,
θ
Hence,
√ p 42 θ3
d
n(g(X̄) − g( π/(4θ))) −→ N 0, Var(X) ×
| {z π }
θ2 ( 16
π
−4)
16
− 4 = 1.09295817... > 1
π
we conclude that the MLE estimator beats the method of moments estimator, be-
cause the method of moments estimator has asymptotic variance roughly 9% higher
than the variance of the MLE.
b =
is an approximate 1 − α confidence interval for θ for large n, where se(θ)
p
1/ In (θ).
218CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
Example
Recall the example of a sample x1 , . . . , xn from the density function f , where
z0.975 = 1.96.
2. Use the following data to find an approximate 95% confidence interval for θ,
assuming that it is a random sample from a r.v with the density function above.
0.366 0.568 0.300 0.115 0.204 0.128 0.277 0.391 0.328 0.451
0.412 0.190 0.207 0.147 0.116 0.326 0.256 0.524 0.217 0.485
0.265 0.375 0.267 0.360 0.250 0.258 0.583 0.413 0.481 0.468
0.406 0.336 0.305 0.321 0.268 0.361 0.632 0.283 0.258 0.466
0.276 0.232 0.133 0.316 0.468 0.496 0.573 0.523 0.256 0.491
0.127 0.054 0.440 0.228 0.249 0.754 0.430 0.111 0.459 0.233
0.257 0.640 0.147 0.273 0.112 0.389 0.126 0.356 0.273 0.296
0.433 0.253 0.234 0.514 0.177 0.221 0.534 0.509 0.510 0.269
0.262 0.625 0.183 0.541 0.705 0.078 0.847 0.149 0.031 0.453
0.299 0.226 0.069 0.211 0.195 0.381 0.317 0.467 0.289 0.593
P
For these data (x1 , . . . , x100 ) we have 100 2
i=1 xi = 14.018. So using the formula derived
previously, an approximate 95% confidence interval for θ is
√ √ !
100 100 100 100
− 1.96 , + 1.96 = (5.17, 9.09).
14.018 14.018 14.018 14.018
MULTI-PARAMETER MAXIMUM LIKELIHOOD INFERENCE 219
Result
Under the same conditions as the previous result, with τ = g(θ) and τb =
b
g(θ),
lim P τb − z1−α/2 se(b τ ) < τ < τb + z1−α/2 se(b
τ) = 1 − α
n→∞
p
τ ) = |g ′ (θ)|/ In (θ). Therefore,
where se(b
τb − z1−α/2 se(b
τ ) , τb + z1−α/2 se(b
τ)
X1 , . . . , Xn ∼ N(µ, σ 2 )
and
X1 , . . . , Xn ∼ Gamma(α, β)
the maximum likelihood principle still applies. Instead of maximising over a single
variable, the maximisation is performed simultaneously over several variables.
Example
Consider the model
Then,
P
∂
∂µ
ℓ(µ, σ) = 1
σ2 i (xi − µ) = n
σ2
(x̄ − µ) = 0
if and only if
µ = x̄
and regardless of the value of σ. Also
P
∂
∂σ
ℓ(µ, σ) = σ −3 i (xi − µ)2 − nσ −1 = 0
220CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
if and only if s
1X
σ= (xi − µ)2
n i
The unique stationary point of ℓ(µ, σ) is then
s !
1X
(b b) = x̄,
µ, σ (xi − x̄)2 .
n i
Analysis of the second order partial derivatives can be used to show that this is the
global maximiser of ℓ(µ, σ) over µ ∈ R and σ > 0. Hence, the maximum likelihood
estimators of µ and σ are
s
1X
µb = X̄ and σ b= (Xi − X̄)2 .
n i
where
∂2
Hij = ℓ(θ).
∂θi ∂θj
Example
Consider the model
∂2 −n
∂µ2
ℓ(µ, σ) =
σ2
∂2 −2n
∂µ∂σ
ℓ(µ, σ) = σ3
− µ)
(x̄
X
−4
and ∂2
∂σ 2 ℓ(µ, σ) = −3σ (xi − µ)2 + nσ −2 .
i
Noting that E(Xi ) = µ and E(Xi − µ)2 = σ 2 for each i we then get
h 2 i −n
E ∂µ∂
2 ℓ(µ, σ) =
σ2,
h 2 i
E ∂µ∂σ
∂
ℓ(µ, σ) = 0
h 2 i
and E ∂σ2 ℓ(µ, σ) = −3σ −4 nσ 2 + nσ −2 = −2nσ −2 .
∂
Given the Fisher Information matrix, we can find the asymptotic distribution of any
b But first we need
function of a vector of maximum likelihood estimators τb = g(θ).
to define the gradient vector, as below.
Definition
Let θ = (θ1 , . . . , θk ) be the vector of parameters in a multi-parameter model
and g(θ) = g(θ1 , . . . , θk ) be a real-valued function. The gradient vector
of g is given by
∂g(θ)
∂θ1
..
∇g(θ) =
.
∂g(θ)
∂θk
Example
Find ∇g(µ, σ) where g(µ, σ) = σµ .
" #
1
∇g(µ, σ) = σ
− σµ2
222CHAPTER 8. METHODS FOR PARAMETER ESTIMATION AND INFERENCE
Result
Let τ = g(θ) be a real-valued function of θ = (θ1 , . . . , θk ), with maximum
likelihood estimate θb and τb = g(θ).
b Under appropriate regularity conditions,
including the existence of all second order partial derivatives of L(θ) and
first order partial derivatives of g, as n → ∞
τb − τ d
−→ N(0, 1)
se(b
τ)
where
p
se(b
τ) = ∇g(θ)⊤ In (θ)−1 ∇g(θ).
This is a multi-parameter extension of the delta method result given on page 213.
Example
Let
X1 , . . . , Xn ∼ N(µ, σ 2 )
Derive a convergence result that gives us the approximate distribution of τb =
g(b b) = µ
µ, σ b/b
σ.
Given the invariance property of maximum likelihood estimators, the maximum
likelihood estimator of τ is
b
µ X̄
τb = =q P .
b
σ 1
(X − X̄)2
n i i
We can use the asymptotic normality result previously stated to obtain an approx-
imate confidence interval for
τ = g(θ) = g(θ1 , . . . , θk ).
Result
Under appropriate regularity conditions, including the existence of all sec-
ond order partial derivatives of L(θ), if τ = g(θ) where each component of
g is differentiable then:
lim P τb − z1−α/2 se(b
τ ) < τ < τb + z1−α/2 se(b
τ) = 1 − α
n→∞
where
p
se(b
τ) = ∇g(θ)⊤ In (θ)−1 ∇g(θ).
Therefore,
τb − z1−α/2 se(b
τn ) , τb + z1−α/2 se(b
τ)
is an approximate 1 − α confidence interval for τ for large n.
Example
Let
X1 , . . . , Xn ∼ N(µ, σ 2 )
Derive a formula for an approximate 99% confidence interval for τ = g(µ, σ) = µ/σ.
As shown previously, the maximum likelihood estimator for τ is
X̄
τb = q P
1
n i (Xi − X̄)2
Hypothesis Testing
Hypothesis testing is the most commonly used formal statistical vehicle for mak-
ing decisions given some data.
For further reading, consider Hogg et al (2005) sections 5.5, 5.6 and 6.3 (likelihood
ratio and Wald-type tests only), or Rice (2007) sections 9.1-9.4 and 11.2.1. You may
also read through Section 5.3 in Kroese & Chan (2014).
There are many examples of situations in which we are interested in testing a specific
hypothesis using sample data. In fact, most science is done using this approach!
(And indeed other forms of quantitative research.)
Below are some examples that we will consider in this chapter.
Example
Recall that the Mythbusters were testing whether or not toast lands butter side
down more often than butter side up.
In 24 trials, they found that 14 slices of bread landed butter side down.
Is this evidence that toast lands butter-side down more often than butter side up?
Example
Before the installation of new machinery at a chemical plant, the daily yield of
fertilizer produced at the plant had a mean µ = 880 tonnes. Some new machinery
was installed, and we would like to know if the new machinery is more efficent (i.e.
if µ > 880).
225
226 CHAPTER 9. HYPOTHESIS TESTING
During the first n = 50 days of operation of the new machinery, the yield of fertilizer
was recorded and the sample mean obtained was x̄ = 888 with a standard deviation
s = 21.
Is there evidence that the new machinery is more efficient?
Example
(Ecology 2005, 86:1057-1060)
Do ravens intentionally fly towards gunshot sounds (to scavenge on the carcass they
expect to find)? Crow White addressed this question by going to 12 locations, firing
a gun, then counting raven numbers 10 minutes later. He repeated the process at
12 different locations where he didn’t fire a gun. Results:
no gunshot 0 0 2 3 5 0 0 0 1 0 1 0
gunshot 2 1 4 1 0 5 0 1 0 3 5 2
A key first step in hypothesis testing is identifying the hypothesis that we would
like to test (the null hypothesis), and the alternative hypothesis of interest to us,
defined as below.
STATING THE HYPOTHESES 227
Definitions
The null hypothesis, labelled H0 , is a claim that a parameter of interest
to us (θ) takes a particular value (θ0 ). Hence H0 has the form θ = θ0 for
some pre-specified value θ0 .
The alternative hypothesis, labelled H1 , is a more general hypothesis
about the parameter of interest to us, which we will accept to be true if
the evidence against the null hypothesis is strong enough. The form of H1
tends to be one of the following:
H1 : θ ̸= θ0
H1 : θ > θ0
H1 : θ < θ0
In later statistics courses you will meet hypothesis tests concerning hypotheses that
have a more general form than the above. H0 does not actually need to have the
form H0 : θ = θ0 but instead can just specify some restriction on the range of
possible values for θ, i.e. H0 : θ ∈ ω0 for some set ω0 ∈ Ω where Ω is the parameter
space. It is typical for the alternative hypothesis to be more general than the null,
e.g. H1 : θ ∈ ω0 .
Example
Consider the Mythbusters example on page 225.
State the null and alternative hypotheses.
Here we have a sample from a binomial distribution, where the binomial parameter
p is the probability that a slice of toast will land butter side down. We are interested
in whether there is evidence that the parameter p is larger than 0.5.
Example
Consider the machinery example on page 226.
State the null and alternative hypotheses.
228 CHAPTER 9. HYPOTHESIS TESTING
In all of the examples of the previous section, there is one hypothesis of primary
interest, which involves specifying a specific value for a parameter of interest to
us.
Can you find the null and alternative hypotheses for all the above situations?
1. State the null hypothesis (H0 ) and the alternative hypothesis (H1 ). By con-
vention, the null hypothesis is the more specific of the two hypotheses.
i) Find a test statistic that measures how “far” our data are from what is
expected under the null hypothesis. (You must know the approximate
distribution of this test statistic assuming H0 to be true. This is called
the null distribution.)
ii) Calculate a P -value, a probability that measures how much evidence
there is against the null hypothesis, for the data we observed. A P -value
is defined as the probability of observing a test statistic value as or more
unusual than the one we observed, if the null hypothesis were true.
Example
The Mythbusters example can be used to illustrate the above steps of a hypothesis
test.
1. Let
p = P(Toast lands butter side down)
Then what we want to do is choose between the following two hypotheses:
1 1
H0 : p = 2
versus H1 : p > 2
INTERPRETING P -VALUES 229
2. We want to answer the question “How much evidence (if any) does our sample
(14 of 24 land butter side down) give us against the claim that p = 0.5?”
(a) To answer this question, we will consider pb, the sample proportion, and
in particular we will look at the test statistic
pb − p d
Z=p −→ N(0, 1)
p(1 − p)/n
Under the null hypothesis,
pb − 0.5 d
Z=p −→ N(0, 1)
0.5(1 − 0.5)/n
This statistic measures how far our sample data are with H0 . The further
pb is from 0.5, the further Z is from 0.
(b) To find out if pb = 14
is unusually large, if p = 0.5, we can calculate
24
!
14 14
− 0.5
P pb > ≃ P Z > p 24 ≃ P(Z > 0.82) ≃ 0.2071
24 0.5(1 − 0.5)/n
Interpreting P -values
The most common way that a hypothesis test is conducted and reported is via the
calculation of a P -value.
Example
When reading scientific research, you will often see comments such as the following:
• “Paired t-tests indicated that there was no change in attitude (P > 0.05)”
• “The Yahoo! search activity associated with specific cancers correlated with
their estimated incidence (Spearman rank correlation = 0.50, P = 0.015)”
230 CHAPTER 9. HYPOTHESIS TESTING
• “There was no significant difference (p > 0.05) between the first and second
responses”
So what is a P -value? And how do you interpret a P -value once you’ve calculated
it?
Definition
The P -value of an observed test statistic is
The following are some rough guidelines on interpreting P -values. Note though that
the interpretation of P -values should depend on the context to some extent, so the
below should be considered as a guide only and not as strict rules.
Example
Consider again the Mythbusters example.
We calculated that the P -value was about 0.22. (This P -value measured how often
you would get further from 0.5 than pb = 14
24
.)
Draw a conclusion from this hypothesis test.
This P -value is large – it is in the “little or no evidence against H0 ” range. Hence
we conclude that there is little or no evidence against the claim that toast lands
butter side down just as often as it lands butter side up.
Example
Before the installation of new machinery, the daily yield of fertilizer produced by a
232 CHAPTER 9. HYPOTHESIS TESTING
chemical plant had a mean µ = 880 tonnes. Some new machinery was installed, and
we would like to know if the new machinery is more efficent (i.e. if µ > 880).
During the first n = 50 days of operation of the new machinery, the yield of fertilizer
was recorded. The sample mean was x̄ = 888 with a standard deviation s = 21.
Is there evidence that the new machinery is more efficient? Use a hypothesis test to
answer this question, assuming that yield is approximately normal.
Using the hypothesis testing steps given previously:
2. We estimate µ using the sample mean X̄. Now we want to know if our X̄ is
so far from µ = 880 that it provides evidence against H0 .
X̄ − µ
T = √ ∼ tn−1
S/ n
X̄ − 880
T = √ ∼ tn−1
S/ n
ii) We now would like to calculate a P -value that measures how unusual it
is to get a mean as large as x̄ = 888, if the true mean were µ = 880:
x̄ − 880 888 − 880
P T > √ =P T > √ = P(T > 2.69)
s/ n 21/ 50
Alternatively, you could say that “the mean is significantly different from 880”.
ONE-SIDED AND TWO-SIDED TESTS 233
H0 : θ = θ0 versus H1 : θ < θ0
or
H0 : θ = θ0 versus H1 : θ > θ0 .
H0 : θ = θ0 versus H1 : θ ̸= θ0 .
Example
A popular brand of yoghurt claims to contain 120 calories per serving. A consumer
watchdog group randomly sampled 14 servings of the yoghurt and obtained the
following numbers of calories per serving:
where
µ = mean calorie value of a serving of the yoghurt.
234 CHAPTER 9. HYPOTHESIS TESTING
so there is very strong evidence against H0 . The consumer watchdog group should
conclude that the manufacturer’s calorie claim is not correct.
Rejection regions
Definition
The rejection region is the set of values of the test statistic for which H0
is rejected in favour of H1 .
The term “rejection region” comes from the fact that often people speak of “rejecting
H0 ” (if our data provide evidence against H0 ) or “retaining H0 ” (if our data do not
provide evidence against H0 ). If out test statistic is in the rejection region we reject
H0 , if the test statistic is not in the rejection region we retain H0 .
To determine a rejection region, we first choose a size or significance level for the
test, which can be defined as the P -value at which we would start rejecting H0 .
It should be set to a small number (typically 0.05), and by convention is usually
denoted by α.
Once we have determined the desired size of the test, we can then derive the rejection
region, as in the below examples.
REJECTION REGIONS 235
Example
Recall the fertilizer example – we wanted to test
Example
Recall the yoghurt example – we wanted to test
We now have two different approaches. One approach involves computing the P -
value based on the observed test statistic, and deciding whether or not it is small
enough to reject H0 . The other approach is based on setting the significance level α
of the test to some number (like 0.05), then working out the range of values of our
test statistic for which H0 should be rejected at this significance level.
We will mainly use the P -value approach because it is more informative, and because
it is more commonly used in practice. The significance level approach is however
useful in determining important properties of tests, such as their Type I and Type
II error.
Definition
Type I error corresponds to rejection of the null hypothesis when it is
really true.
reject H0 accept H0
Example
For the “chemical plant” example of page 226, a Type I error would correspond to
concluding that the machinery has a positive effect on yield, when in actual fact it
doesn’t. Type II error corresponds to concluding that the machinery does not have
a positive effect on yield, when in actual fact it does.
if H0 were true
Type I error
critical value
if H0 were true
Type II error
critical value
Clearly we would like to avoid both Type I and Type II errors, however there is
always a chance that either one will occur so the idea is to reduce these chances as
much as possible. Unfortunately, making the probability of one type of error small
has the effect of making the probability of the other large.
In most practical situations, we can readily control Type I error, but Type II error
is more difficult to get a handle on.
Definition
The size or significance level of a test is the probability of committing a
Type I error. It is usually denoted by α. Therefore,
α = size
= significance level
= P(committing Type I error)
= P(reject H0 when H0 is true)
238 CHAPTER 9. HYPOTHESIS TESTING
Type II error is usually quantified through a concept called power, which will be
discussed in detail in a later section.
A popular choice of α is α = 0.05. This corresponds to the following situation:
“We have set up our test in such a way that if we do reject H0 then there is only a
5% chance that we will wrongfully do so.”
There is nothing special about α = 0.05. Sometimes it might be better to have
α = 0.1 or α = 0.01, depending on the application.
Example
If we want to test the following hypotheses about a new vaccine:
H0 : vaccine is perfectly safe.
versus
H1 : vaccine has harmful side effects
then it is important to minimise Type II error – we want to detect any harmful side
effects that are present. To assist in minimising Type II error, we might be prepared
to accept a reasonably high Type I error (such as 0.1 or maybe even 0.25).
Example
Suppose we are testing a potential water source for toxic levels of heavy metals. If
the hypotheses are
H0 : the water has toxic levels of heavy metals
versus
H1 : the water is OK to drink
then it is important to minimise type I error – we won’t want to make the mistake
of saying that toxic water is OK to drink! So we would want to choose a low Type
I error, maybe 0.001 say.
POWER OF A STATISTICAL TEST 239
Therefore, a test with high power is able to detect deviation from the null hypothesis
with high probability. The formal definition of the power of a test about µ is given
by:
Definition
Consider a test about a mean µ. Then
Therefore the power of a test is really a function, or curve, that depends on the true
value of µ. It is easy to check that
power(value of µ under H0 ) = P(Type I error) = α
But for all other values of µ we want power(µ) to be as close to 1 as possible.
The following graph shows the ‘ideal’ power curve as well as a typical actual power
curve.
power(µ) power(µ)
1 ● 1
● ●
880 µ 880 µ
240 CHAPTER 9. HYPOTHESIS TESTING
Also, note that power and Type II error are inversely related:
Result
Example
Consider again the chemical plant example, where we are testing
That is, if mean production really was 882 then our test has only a 16% chance of
detecting this change and rejecting H0 .
The power seems low in the above scenario – this is because relative to sample
variation (σ = 21), an increase in yield of 2 is not particularly large. If, on the other
hand, yield had increased by 6 tonnes to 886, the power could be shown to be 0.64
in this case. And if yield had increased by 10 tonnes to 890, power would be 95%.
We can do such calculations for general µ > 880 and arrive at a function of power
values:
0.05 ●
880 885 890 µ
It is interesting to graphically compare the power curves from the X̄ test and the
barrel test for the “chemical plant” example:
242 CHAPTER 9. HYPOTHESIS TESTING
(a)
power(µ)
X−based test
1
barrel test
0.05 ●
880 885 890 µ
Here we see that the X̄ test has higher power than the Barrel test for all values of
µ, so is clearly superior.
Further notes on power are:
then, for a good test, power is larger for higher values of µ and smaller σ.
• Power may be used to compare two or more tests for a given significance level.
For example, if we had two competing tests with significance level α = 0.05
then we would want to use the test that has the higher power. The previous
figure shows that the X̄ test is clearly superior to the Barrel test.
Hypothesis testing is a tool that can be applied in a range of settings. If the null
hypothesis is that some parameter equals a specific value (e.g. p = 0.5, µ = 880,
θ = 1, . . .) then all we need to construct a hypothesis test is an estimator of the
SOME MORE TESTS 243
parameter of interest and an approximate distribution for the estimator, under the
null hypothesis.
We have focussed in previous sections on one-sample tests of the mean, e.g. the
chemical plant and hairdresser examples. In this section we consider some other
commonly encountered hypothesis testing situations.
Consider the situation where we have two independent normal random samples:
H0 : µX = µY versus H1 : µX ̸= µY
X̄ − Ȳ
q ∼ tnX +nY −2 under H0
Sp n1X + 1
nY
This statistic assumes that the two samples have equal variance – as in the two-
sample confidence interval method of page 189.
Example
Do MATH2801 and MATH2901 students spend the same amount at the hairdressers,
on average?
A class survey obtained the following results:
Wald Tests
So far we have only considered the special cases of normal or binomial data. How
about the general situation:
When the sample size is large the Wald test procedure often provides a satisfactory
solution:
The Wald Test
Consider the hypotheses
H0 : θ = θ0 versus H1 : θ ̸= θ0
θb − θ d
−→ N(0, 1).
b
se(θ)
θb − θ0
∼ N(0, 1)
approx.
W =
b
b θ)
se(
Usually the estimator θb in the Wald test is the maximum likelihood estimator since,
for smooth likelihood situations, this estimator satisfies the asymptotic normality
requirement, and a formula for the (approximate) standard error is readily available.
Example
Consider the sample of size n = 100
0.366 0.568 0.300 0.115 0.204 0.128 0.277 0.391 0.328 0.451
0.412 0.190 0.207 0.147 0.116 0.326 0.256 0.524 0.217 0.485
0.265 0.375 0.267 0.360 0.250 0.258 0.583 0.413 0.481 0.468
0.406 0.336 0.305 0.321 0.268 0.361 0.632 0.283 0.258 0.466
0.276 0.232 0.133 0.316 0.468 0.496 0.573 0.523 0.256 0.491
0.127 0.054 0.440 0.228 0.249 0.754 0.430 0.111 0.459 0.233
0.257 0.640 0.147 0.273 0.112 0.389 0.126 0.356 0.273 0.296
0.433 0.253 0.234 0.514 0.177 0.221 0.534 0.509 0.510 0.269
0.262 0.625 0.183 0.541 0.705 0.078 0.847 0.149 0.031 0.453
0.299 0.226 0.069 0.211 0.195 0.381 0.317 0.467 0.289 0.593
H0 : θ = 6 versus H1 : θ ̸= 6
b =√ θb
b θ)
se( .
100
It may also be shown that θb is asymptotically normal so the Wald test applies. The
Wald test statistic is
2 − 6
100
θb − 6 θb − 6 P100
i=1 Xi
W = = √ = .
b θ)
se( b b 100
θ/
100
P100 2 /10
i=1 Xi
P100
Since i=1 x2i = 14.081 the observed value of W is
100
14.081
−6
w= 100 = 1.5514.
14.081
/10
246 CHAPTER 9. HYPOTHESIS TESTING
Example
(Ecology 2005, 86:1057-1060)
Do ravens intentionally fly towards gunshot sounds (to scavenge on the carcass they
expect to find)? Crow White addressed this question by going to 12 locations, firing
a gun, then counting raven numbers 10 minutes later. He repeated the process at
12 different locations where he didn’t fire a gun. Results:
no gunshot 0 0 2 3 5 0 0 0 1 0 1 0
gunshot 2 1 4 1 0 5 0 1 0 3 5 2
Is there evidence that ravens fly towards the location of gunshots? Answer this
question using an appropriate Wald test.
Here we might proceed by assuming that Bi and Ai is the number of ravens at the
i-th location (i = 1, . . . , 12 = n) before and after the firing of the gunshots. Note
that here A1 , . . . , An and B1 , . . . , Bn are pairwise dependent, because the Ai depends
on the Bi . We only have independence across the i-th, that is, the pairs
(A1 , B1 ), . . . , (An , Bn )
are independent with common mean (µa , µb ). Here, due to the dependence, we
cannot use a two sample student test. A crude alternative model is to consider the
difference
Xi = Ai − Bi
Then, X1 , . . . , Xn are independent and we could test
H0 : µa = µb H1 : µa > µb
1
2 × P(T > √ ) ≈ 2 × 0.1153.. ≈ 0.23
2.73/ 12
The Wald test described in the previous section is a general testing procedure for
the situation where an asymptotically normal estimator is available. An even more
general procedure, with good power properties, is the likelihood ratio test.
H0 : θ = θ0 versus H1 : θ ̸= θ0
Λ(θbn ) −→ χ21 .
d
We will frequently drop the n’s from the notation for convenience. Let λ be
the observed value of Λ. Then the approximate P -value is given by
√
P -value ≃ Pθ0 (Λ > λ) = P(Q > λ) = 2Φ(− λ)
where Q ∼ χ21 .
proof: Assume the H0 hypothesis is true and that the data X1 , . . . , Xn are iid
random variables from the ‘true’ f (x; θ0 ). Then, recall that under some regularity
conditions
p
nI1 (θ0 )(θ0 − θbn ) −→ Z ∼ N(0, 1)
d
By Taylor’s expansion around θbn and the Mean Value Theorem we have
1
ℓn (θ0 ) = ℓn (θbn ) + ℓ′n (θbn )(θ0 − θbn ) + (θ0 − θbn )2 ℓ′′n (ϑn )
2
248 CHAPTER 9. HYPOTHESIS TESTING
for some ϑn between θbn and θ0 . Since ℓ′n (θbn ) = 0, we obtain after rearrangement
where we used the result from Remark 8.1 and Slutsky’s theorem. 2
Remark 9.1 (Relation between Wald and Likelihood Ratio Test Statistics)
A Wald statistic uses the horizontal axis, for θ, to construct a test statistic – we
take θb and compare it to θ0 , to see if it θb is significantly far from θ0 .
In constrast, a likelihood ratio statistic uses the vertical axis, for ℓ(θ), to construct
a test statistic – we take the maximised log-likelihood, ℓ(θ), b and compare it to the
log-likelihood under the null hypothesis, ℓ(θ0 ), to see if ℓ(θ0 ) is significantly far from
the maximum.
Note that if W = (θbn − θ0 )/se(θbn ) is the Wald statistic, then
−ℓ′′n (ϑn ) 2 b
Λ = 2(ℓn (θbn ) − ℓn (θ0 )) =
P
se (θn ) W 2 −→ W 2
n
Thus, the Wald and likelihood ratio tests are asymptotically equivalent when the null
hypothesis is true – and in large samples, they typically return similar test statistics
hence similar conclusions. However, when the null hypothesis is not true, these tests
can have quite different properties, especially in small samples.
Example
Consider, one last time, the sample of size n = 100
0.366 0.568 0.300 0.115 0.204 0.128 0.277 0.391 0.328 0.451
0.412 0.190 0.207 0.147 0.116 0.326 0.256 0.524 0.217 0.485
0.265 0.375 0.267 0.360 0.250 0.258 0.583 0.413 0.481 0.468
0.406 0.336 0.305 0.321 0.268 0.361 0.632 0.283 0.258 0.466
0.276 0.232 0.133 0.316 0.468 0.496 0.573 0.523 0.256 0.491
0.127 0.054 0.440 0.228 0.249 0.754 0.430 0.111 0.459 0.233
0.257 0.640 0.147 0.273 0.112 0.389 0.126 0.356 0.273 0.296
0.433 0.253 0.234 0.514 0.177 0.221 0.534 0.509 0.510 0.269
0.262 0.625 0.183 0.541 0.705 0.078 0.847 0.149 0.031 0.453
0.299 0.226 0.069 0.211 0.195 0.381 0.317 0.467 0.289 0.593
LIKELIHOOD RATIO TESTS 249
Also,
X
100 X
100
ℓ(6) = 100 ln(2) + 100 ln(6) + ln(Xi ) − 6 Xi2
i=1 i=1
and so the likelihood ratio statistic is
" ! #
X
100 X
100
Λ = 2 100{ln(100/6) − 1} − 100 ln Xi2 +6 Xi2 .
i=1 i=1
P100 2
Since i=1 xi = 14.081 the observed value of Λ is
λ = 2 [100{ln(100/6) − 1} − 100 ln(14.081) + 6 × 14.081] = 2.689.
Then
p-value = Pθ=6 (Λ > λ)
= P(Q > 2.689), Q ∼ χ21
= P(Z 2 > 2.689), Z ∼ N(0, 1)
√
= 2P(Z ≤ − 2.689)
= 2Φ(−1.64) = 0.10
250 CHAPTER 9. HYPOTHESIS TESTING
which is close to the p-value of 0.12 obtained via the Wald test in the previous sec-
tion. The conclusion remains that there is little or no evidence against H0 and that
H0 should be retained.
The likelihood ratio test procedure can be extended to hypothesis tests involving
several parameters simultaneously. Such situations arise, for example, in the impor-
tant branch of statistics known as regression.
Consider a model with parameter vector θ and corresponding parameter space Θ.
A general class of hypotheses is:
H0 : θ ∈ Θ0 versus H1 : θ ∈
/ Θ0
We will now show how the multivariate version motivates the one- and two-sample
student hypothesis tests.
H0 : µ = µ0 versus H1 : µ ̸= µ0
where
b = X̄
µ
1X
b2 =
σ (Xi − X̄)2
n i
b0 = µ0
µ
1X
b02 =
σ (Xi − µ0 )2
n i
H0 : µX = µY = µ versus H1 : µX ̸= µY
where µ, µX , µY , σ are all unknown. We now derive the two sample test statistic
X̄ − Ȳ
T = q ∼ tm+n−2 ,
Sp n1 + 1
m
where Pn P
− X̄)2 + m
i=1 (Xi i=1 (Yi − Ȳ )
2
Sp2=
m+n−2
using the likelihood ratio test. The likelihood function of the joint data is
Pn 2 Pm 2
1 − i=1 (Xi −µX ) + i=1 (Yi −µY )
L(µX , µY , σ ) =
2
m+n e 2σ 2
(2πσ 2 ) 2
Λ(θ) = 2(ℓ(b bY , σ
µX , µ b2 ) − ℓ(b b, σ
µ, µ e2 ))
where
bX = X̄
µ
bY = Ȳ
µ
Pn Pm
(X i − X̄) 2
+ i=1 (Yi − Ȳ )
2
b = i=1
σ 2
m+n
is the MLE of θ without any restrictions and
nX̄ + mȲ n n
b=
µ = X̄ + Ȳ
Pnn + m m +Pn m + n
i=1 (Xi − µ
b)2 + m i=1 (Yi − µ
b)2
e =
σ 2
m+n
is the MLE of θ with the restriction µX = µY = µ. We now simplify the likelihood
ratio to obtain:
Λ(θ) = 2(ℓ(b bY , σ
µX , µ b2 ) − ℓ(b b, σ
µ, µ e2 ))
Pn P
(Xi − X̄)2 + m i=1 (Yi − Ȳ )
2
= −(m + n) ln Pn i=1 P
i=1 (Xi − µb)2 + mi=1 (Yi − µb)2
LIKELIHOOD RATIO TESTS 253
and similarly
X
m X
m
mn2
(Yi − µ
b) =
2
(Yi − Ȳ )2 + (Ȳ − X̄)2
i=1 i=1
(m + n)2
Therefore,
!
(m + n − 2)Sp2
Λ = −(m + n) ln
(m + n − 2)Sp2 + m+n mn
(Ȳ − X̄)2
Ȳ )2
m + n − 2 + S(2X̄−
1 1
p( n + m )
= (m + n) ln
m+n−2
T2
= (m + n) ln 1 +
m+n−2
Again, Λ is a monotonic function of T 2 and therefore instead of dealing with the
distribution of Λ when computing p-values, we could equivalently work with the
two-sample student statistic T .
Figure 9.1: John Maynard Keynes: the most famous and influential economist of
the 20-th century and author of A Treatise on Probability (1921).
That’s it – good luck in your exams and remember that all models in Statistics are
wrong, but that some are useful or as Keynes put it “It is better to be roughly right
than precisely wrong”. You will consider statistical modeling and probability theory
in much more detail in your third year courses.
254 CHAPTER 9. HYPOTHESIS TESTING
A Brief Introduction to R
This set of exercises introduces you to the statistics package R. R is a free and widely
used statistics program that you can download from http://cran.r-project.org/.
R is the statistics package that was used to construct most of the graphs in your
lecture notes. Some of your assignment questions will require computer-aided data
analysis, for which you are encouraged to use R.
By the end of this set of exercises you will know how to:
• Load R in Windows.
• Save R graphs as jpg files, and save your work before exiting R.
Starting R
Log onto a maths computer under the Windows environment. You can also use R
under Linux, by typing R at the command line, and you are welcome to explore this
option yourself. These notes however are written for the Windows environment, and
the methods are different for Linux (in particular, no drag-down menus are available,
and the saving of graphs is a little trickier).
Go to Start...Programs...R...R2.4.1.
Notice that R opens in the window called “R Console”. R is a command-line program,
which means that you need to type commands into the R Console line-by-line after
the red prompt “>” at the bottom of the window.
If at any time the prompt disappears or changes into a “+”, you’ve probably made
a mistake. In this situation, if you press the escape key Esc you will be back in
business.
255
256 A BRIEF INTRODUCTION TO R
If your dataset is small, you can enter it in manually. For example, to enter the
following dataset and to store it as the object dat:
When entering data into a computer, people typically use a spreadsheet such as
Excel, or some other program. We can take data from these other programs and
import it into R for analysis. We will learn how to do this for a dataset that has
been stored in tab-delimited format.
Download the file 1041_07.txt from My eLearning and open this file. This file
contains MATH1041 student responses to a survey that was administered at the
start of session 2, 2007. (If you are curious about what questions were asked in the
survey, you can find them in the file 1041 07Evaluation.pdf.) The 1041_07.txt
dataset is stored in a tab-delimited file, which means that the Tab key has been
used to distinguish between columns of the dataset. There are many columns in
the dataset, and each has a label in the first row identifying which variable it is
(e.g. travel.time). Notice that the column labels have no spaces in them – this is
important because R will have problems reading the dataset if there are spaces.
To import this dataset into R:
• First save 1041_07.txt in your personal drive. This drive is labelled using
your student number.
• Now change R’s working directory to the directory on your personal drive
that contains 1041_07.txt, by going to “File...Change dir...Browse...” and
navigating to the directory on your personal drive that contains 1041_07.txt.
257
The reason for doing this is to tell R where it will be able to find the file
1041_07.txt. Whenever you ask R to load a file, you need to also tell R which
directory to find the file in, and this is a simple way of doing that.
• To load the file 1041_07.txt and to store it under the name survey, type:
> survey=read.table("1041 07.txt",header=T)
(read.table is the R command that reads data in from a table that is stored
in a text file, and header=T tells R that the first row of the file is a “header
row” which contains the names for each column.)
To check that the data are in R, ask R to show you the first 10 rows of the dataset,
which you can do using the command survey[1:10,]
Numerical summaries on R
• To calculate the mean of dat, type the command mean(dat) at the prompt,
then press return.
• To calculate the 25th percentile (otherwise known as the first quartile), use
the command quantile(dat,0.25)
All of these commands will also work for quantitative variables in the survey data,
which you have stored in R as survey, although you should specify which variable
from this dataset you want to use when typing a command. For example, to access
the travel-time variable, you will need to refer to survey$travel.time in your
command, not just travel.time. So to calculate a five-number summary of travel
time, use the command summary(survey$travel.time).
258 A BRIEF INTRODUCTION TO R
Removing NA’s
In some cases, a MATH1041 student attempted the survey but did not complete all
questions. This means that there are some places in the dataset with no response,
which have been labelled as NA. You might have noticed that R tells you how many
NA values there are for a given variable, when you use the summary command. The
travel.time variable, for example, contains 12 NA values.
A problem with having NA’s in a quantitative variable is that when you try to
calculate some function of this variable, such as the mean, it will return NA. For ex-
ample, see what you get when you type mean(survey$travel.time). To overcome
this problem, there is an option in most functions called na.rm (which stands for
“remove NA’s”). To calculate the average travel time, ignoring the NA values, use
the command mean(survey$travel.time, na.rm=T).
Graphical summaries on R
To construct a bar chart of the gender variable from the survey dataset, use the
command plot(survey$gender)
To construct comparative boxplots for the hair cut costs of different genders, type
any of the following:
• plot(survey$hair.cost~survey$gender)
• boxplot(survey$hair.cost~survey$gender)
259
• plot(hair.cost~gender, data=survey)
Note that most functions have an optional data= argument so you don’t have to use
the “survey$” form every time you want to use a variable from within the dataset
survey.
The “~” stands for “against”, that is, plot(hair.cost~gender, data=survey)
means “plot hair.cost against gender, for the dataset survey.”
To save a graph as a jpeg image file, click on it (to make the graph window active),
then go to File...Save As... jpeg... 75% quality... and type a suitable filename.
Notice that it is possible to save your graphs in other formats too, such as PDF.
To save all objects currently in your workspace (if you have completed all the above
steps, your workspace currently consists of dat and survey) go to File...Save
Workspace... and suggest a filename. If you ever want to continue analyses from
the point you are currently at, you can simply find the workspace file on My Com-
puter and double-click on it (provided that it has been saved with the .RData file
extension).
260 A BRIEF INTRODUCTION TO R
Getting help on R
There are a few different ways to find help on R, if you want to explore it further...
• Click on the R Console (so that it is the active window) and you will find a
drag-down Help menu. You can browse through html help or an introductory
manual available at Manuals (In PDF)... An Introduction to R.
• If you know what function you want to find out more about, use the help
command. e.g. to find out more about the hist command, type help(hist)
at the command line.
• Buy a book that introduces you to R, or download a free book from the Internet.
There are a lot of free books around, see suggestions at http://cran.r-project.org/
under “Documentation – Contributed”.
Other stuff
There’s so much you can do on R. One particular example is doing arithmetic cal-
culations – you can use R like a calculator.
> 3*log(2.4)-3
> choose(5,2) * 0.4
94 3
* (1-0.4)
94 2
You can also store the results of previous commands, which is important for doing
more complex calculations or for retrieving results later on.
> mn.trav = mean(survey$travel.time, na.rm=T)
> sd.trav = sd(survey$travel.time, na.rm=T)
> mn.trav + 3*sd.trav
> x = 0:5
> p = choose(5,x) * 0.4
94 x
* (1-0.4)
94 (
261
5-x)
> plot(x,p)
Exiting R
Make the R Console the active window (by clicking on it) then go to File... Exit.
262 A BRIEF INTRODUCTION TO R
Statistical Tables
263
UNIVERSITY OF NEW SOUTH WALES • SCHOOL OF MATHEMATICS AND
STATISTICS
Statistical Tables
Key: Table entry for p and C is the critical value t∗ with probability p lying to its
right
and probability C lying between −t∗ and t∗ .
−2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
−2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
−2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
−2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
−2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
−2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
−2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
−2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
−2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
−2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
−1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
−1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
−1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
−1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
−1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
−1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
−1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
−1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
−1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
−1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
−0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
−0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
−0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
−0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
−0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
−0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
−0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
−0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
−0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
−0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
χ2 distribution critical values
Key: Table entry for p is the critical value with probability p lying to its right.
2 .05 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.50
.025 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.39 39.40 39.41 39.43 39.45 39.46 39.46 39.47 39.48 39.49 39.50
.01 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 99.45 99.46 99.47 99.47 99.48 99.49 99.50
.005 199 199 199 199 199 199 199 199 199 199 199 199 199 200 200 200 200 200 200
3 .05 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
.025 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47 14.42 14.34 14.25 14.17 14.12 14.08 14.04 13.99 13.95 13.90
.01 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.87 26.69 26.60 26.50 26.41 26.32 26.22 26.13
.005 55.55 49.80 47.47 46.19 45.39 44.84 44.43 44.13 43.88 43.69 43.39 43.08 42.78 42.62 42.47 42.31 42.15 41.99 41.83
4 .05 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
.025 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84 8.75 8.66 8.56 8.51 8.46 8.41 8.36 8.31 8.26
.01 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.20 14.02 13.93 13.84 13.75 13.65 13.56 13.46
.005 31.33 26.28 24.26 23.15 22.46 21.97 21.62 21.35 21.14 20.97 20.70 20.44 20.17 20.03 19.89 19.75 19.61 19.47 19.32
5 .05 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36
.025 10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62 6.52 6.43 6.33 6.28 6.23 6.18 6.12 6.07 6.02
.01 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11 9.02
.005 22.78 18.31 16.53 15.56 14.94 14.51 14.20 13.96 13.77 13.62 13.38 13.15 12.90 12.78 12.66 12.53 12.40 12.27 12.14
6 .05 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
.025 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 5.46 5.37 5.27 5.17 5.12 5.07 5.01 4.96 4.90 4.85
.01 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97 6.88
.005 18.63 14.54 12.92 12.03 11.46 11.07 10.79 10.57 10.39 10.25 10.03 9.81 9.59 9.47 9.36 9.24 9.12 9.00 8.88
7 .05 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
.025 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76 4.67 4.57 4.47 4.41 4.36 4.31 4.25 4.20 4.14
.01 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65
.005 16.24 12.40 10.88 10.05 9.52 9.16 8.89 8.68 8.51 8.38 8.18 7.97 7.75 7.64 7.53 7.42 7.31 7.19 7.08
8 .05 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
.025 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30 4.20 4.10 4.00 3.95 3.89 3.84 3.78 3.73 3.67
.01 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95 4.86
.005 14.69 11.04 9.60 8.81 8.30 7.95 7.69 7.50 7.34 7.21 7.01 6.81 6.61 6.50 6.40 6.29 6.18 6.06 5.95
9 .05 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
.025 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96 3.87 3.77 3.67 3.61 3.56 3.51 3.45 3.39 3.33
.01 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31
.005 13.61 10.11 8.72 7.96 7.47 7.13 6.88 6.69 6.54 6.42 6.23 6.03 5.83 5.73 5.62 5.52 5.41 5.30 5.19
10 .05 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
.025 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72 3.62 3.52 3.42 3.37 3.31 3.26 3.20 3.14 3.08
.01 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00 3.91
.005 12.83 9.43 8.08 7.34 6.87 6.54 6.30 6.12 5.97 5.85 5.66 5.47 5.27 5.17 5.07 4.97 4.86 4.75 4.64
12 .05 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
.025 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37 3.28 3.18 3.07 3.02 2.96 2.91 2.85 2.79 2.72
.01 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36
.005 11.75 8.51 7.23 6.52 6.07 5.76 5.52 5.35 5.20 5.09 4.91 4.72 4.53 4.43 4.33 4.23 4.12 4.01 3.90
15 .05 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
.025 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06 2.96 2.86 2.76 2.70 2.64 2.59 2.52 2.46 2.40
.01 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96 2.87
.005 10.80 7.70 6.48 5.80 5.37 5.07 4.85 4.67 4.54 4.42 4.25 4.07 3.88 3.79 3.69 3.58 3.48 3.37 3.26
20 .05 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
.025 5.87 4.46 3.86 3.51 3.29 3.13 3.01 2.91 2.84 2.77 2.68 2.57 2.46 2.41 2.35 2.29 2.22 2.16 2.09
.01 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42
.005 9.94 6.99 5.82 5.17 4.76 4.47 4.26 4.09 3.96 3.85 3.68 3.50 3.32 3.22 3.12 3.02 2.92 2.81 2.69
24 .05 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
.025 5.72 4.32 3.72 3.38 3.15 2.99 2.87 2.78 2.70 2.64 2.54 2.44 2.33 2.27 2.21 2.15 2.08 2.01 1.94
.01 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31 2.21
.005 9.55 6.66 5.52 4.89 4.49 4.20 3.99 3.83 3.69 3.59 3.42 3.25 3.06 2.97 2.87 2.77 2.66 2.55 2.43
30 .05 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
.025 5.57 4.18 3.59 3.25 3.03 2.87 2.75 2.65 2.57 2.51 2.41 2.31 2.20 2.14 2.07 2.01 1.94 1.87 1.79
.01 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11 2.01
.005 9.18 6.35 5.24 4.62 4.23 3.95 3.74 3.58 3.45 3.34 3.18 3.01 2.82 2.73 2.63 2.52 2.42 2.30 2.18
40 .05 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
.025 5.42 4.05 3.46 3.13 2.90 2.74 2.62 2.53 2.45 2.39 2.29 2.18 2.07 2.01 1.94 1.88 1.80 1.72 1.64
.01 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 1.92 1.80
.005 8.83 6.07 4.98 4.37 3.99 3.71 3.51 3.35 3.22 3.12 2.95 2.78 2.60 2.50 2.40 2.30 2.18 2.06 1.93
60 .05 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
.025 5.29 3.93 3.34 3.01 2.79 2.63 2.51 2.41 2.33 2.27 2.17 2.06 1.94 1.88 1.82 1.74 1.67 1.58 1.48
.01 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.60
.005 8.49 5.79 4.73 4.14 3.76 3.49 3.29 3.13 3.01 2.90 2.74 2.57 2.39 2.29 2.19 2.08 1.96 1.83 1.69
120 .05 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
.025 5.15 3.80 3.23 2.89 2.67 2.52 2.39 2.30 2.22 2.16 2.05 1.94 1.82 1.76 1.69 1.61 1.53 1.43 1.31
.01 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38
.005 8.18 5.54 4.50 3.92 3.55 3.28 3.09 2.93 2.81 2.71 2.54 2.37 2.19 2.09 1.98 1.87 1.75 1.61 1.43
∞ .05 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00
.025 5.02 3.69 3.12 2.79 2.57 2.41 2.29 2.19 2.11 2.05 1.94 1.83 1.71 1.64 1.57 1.48 1.39 1.27 1.00
.01 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32 1.00
.005 7.88 5.30 4.28 3.72 3.35 3.09 2.90 2.74 2.62 2.52 2.36 2.19 2.00 1.90 1.79 1.67 1.53 1.36 1.00