PROBABILITY

Chapter 2: Foundations of Statistical
Inference
2.1 Introduction
Statistical inference is the process of deducing properties of an underly-
ing distribution by analysis of data. The word inference means ‘conclusions’
or ‘decisions’. Statistical inference is about drawing conclusions and making
decisions based on observed data.
Data, or observations, typically arise from some underlying process. It is the

underlying process we are interested in, not the observations themselves. Some-
times we call the underlying process the population or mechanism of interest.
The data are only a sample from this population or mechanism. We cannot
possibly observe every outcome of the process, so we have to make do with the
sample that we have observed.
The data give us imperfect insight into the population of interest. The role
of statistical inference is to use this imperfect data to draw conclusions about
the population of interest, while simultaneously giving an honest reflection of the
uncertainty in our conclusions.
Example 1: Tossing a coin.

• Population: all possible tosses of this coin.
• Sample: a small number of observed tosses, e.g. 10 observed tosses.
• What do we want to make inference about? We might be interested
in the probability of getting a Head on each toss. In particular, we might be
interested in whether the coin is fair (P(Head) = 0.5) or has been fiddled.
Example 2: Political polling: how many people will vote for the NZ Labour Party?
• Population: all eligible voters in New Zealand.
• Sample: a random sample of voters, e.g. 1000.
• What do we want to make inference about? We want to know the
support for Labour among all voters, but this is too expensive to carry
out except on election-night itself. Instead we aim to deduce the support
for Labour by asking a smaller number of voters, while simultaneously
reporting upon our uncertainty (margin of error).
34
In the next two chapters we meet several important concepts in statistical

inference. We will illustrate them with discrete random variables, then
introduce continuous random variables in Chapter 4 and show how the
same ideas still apply.
1. Hypothesis testing:
• I toss a coin ten times and get nine heads. How unlikely is that? Can we
continue to believe that the coin is fair when it produces nine heads out
of ten tosses?
2. Likelihood and estimation:

• Suppose we know that our random variable is (say) Binomial(10, p), for
some p, but we don’t know the value of p. We will see how to estimate
the value of p using maximum likelihood estimation.
3. Expectation and variance of a random variable:

• The expectation of a random variable is the value it takes on average.
• The variance of a random variable measures how much the random vari-
able varies about its average.
These are used to report how accurate and reliable our estimation procedure
is. Does it give the right answer on average? How much does it vary about
its average?
4. Modelling:
• We have a situation in real life that we know is random. But what does
the randomness look like? Is it highly variable, or little variability? Does
it sometimes give results much higher than average, but never give results
much lower (long-tailed distribution)? We will see how different probability
distributions are suitable for different circumstances. Choosing a probabil-
ity distribution to fit a situation is called modelling.
35
2.2 Hypothesis testing
You have probably come across the idea of hypothesis tests, p-values, and sig-
nificance in other courses. Common hypothesis tests include t-tests and chi-
squared tests. However, hypothesis tests can be conducted in much simpler
circumstances than these. The concept of the hypothesis test is at its easiest to
understand with the Binomial distribution in the following example. All other
hypothesis tests throughout statistics are based on the same idea.
Example: Weird Coin? H
I toss a coin 10 times and get 9 heads. How weird is that?

H
What is ‘weird’ ?
• Getting 9 heads out of 10 tosses: we’ll call this weird.

• Getting 10 heads out of 10 tosses: even more weird!
• Getting 8 heads out of 10 tosses: less weird.
• Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses:
just as weird as 9 heads if the coin is fair.
• Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than
9 heads if the coin is fair.
Set of weird outcomes
If our coin is fair, the outcomes that are as weird or weirder than 9 heads
are:
9 heads, 10 heads, 1 head, 0 heads.
So how weird is 9 heads or worse, if the coin is fair?
Define X =#heads out of 10 tosses.
Distribution of X, if the coin is fair: X ∼ Binomial(n = 10, p = 0.5).

36
Probability of observing something at least as weird as 9 heads,
if the coin is fair:
We can add the probabilities of all the outcomes that are at least as weird
as 9 heads out of 10 tosses, assuming that the coin is fair.
P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X ∼ Binomial(10, 0.5).
Probabilities for Binomial(n = 10, p = 0.5)

0.25
0.15
P(X=x)
0.0 0.05
0 1 2 3 4 5 6 7 8 9 10
x
For X ∼ Binomial(10, 0.5), we have:
P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) =

10 10
(0.5)9 (0.5)1 + (0.5)10 (0.5)0 +
9 10

10 10
(0.5)1 (0.5)9 + (0.5)0 (0.5)10
1 0
= 0.00977 + 0.00098 + 0.00977 + 0.00098
= 0.021.
Is this weird?
Yes, it is quite weird. If we had a fair coin and tossed it 10 times, we would only
expect to see something as extreme as 9 heads on about 2.1% of occasions.
37
Is the coin fair?
Obviously, we can’t say. It might be: after all, on 2.1% of occasions that you
toss a fair coin 10 times, you do get something as weird as 9 heads or more.
However, 2.1% is a small probability, so it is still very unusual for a fair coin to
produce something as weird as what we’ve seen. If the coin really was fair, it
would be very unusual to get 9 heads or more.
We can deduce that, EITHER we have observed a very unusual event with a fair
coin, OR the coin is not fair.
In fact, this gives us some evidence that the coin is not fair.
The value 2.1% measures the strength of our evidence. The smaller this proba-
bility, the more evidence we have.
Formal hypothesis test

We now formalize the procedure above. Think of the steps:
• We have a question that we want to answer: Is the coin fair?
• There are two alternatives:

1. The coin is fair.
2. The coin is not fair.
• Our observed information is X, the number of heads out of 10 tosses. We

write down the distribution of X if the coin is fair:
X ∼ Binomial(10, 0.5).
• We calculate the probability of observing something AT LEAST AS

EXTREME as our observation, X = 9, if the coin is fair: prob=0.021.
• The probability is small (2.1%). We conclude that this is unlikely with a

fair coin, so we have observed some evidence that the coin is NOT fair.
38
Null hypothesis and alternative hypothesis
We express the steps above as two competing hypotheses.
Null hypothesis: the first alternative, that the coin IS fair.

We expect to believe the null hypothesis unless we see convincing evidence that
it is wrong.
Alternative hypothesis: the second alternative, that the coin is NOT fair.
In hypothesis testing, we often use this same formulation.
• The null hypothesis is specific.

It specifies an exact distribution for our observation: X ∼ Binomial(10, 0.5).
• The alternative hypothesis is general.
It simply states that the null hypothesis is wrong. It does not say what
the right answer is.
We use H0 and H1 to denote the null and alternative hypotheses respectively.
The null hypothesis is H0 : the coin is fair.

The alternative hypothesis is H1 : the coin is NOT fair.
To set up the test, we write:
Number of heads, X ∼ Binomial(10, p),
and
H0 : p = 0.5
H1 : p 6= 0.5.
Think of ‘null hypothesis’ as meaning the ‘default’: the hypothesis we will

accept unless we have a good reason not to.
39
p-values
In the hypothesis-testing framework above, we always measure evidence AGAINST

the null hypothesis.
That is, we believe that our coin is fair unless we see convincing evidence
otherwise.
We measure the strength of evidence against H0 using the p-value.
In the example above, the p-value was p = 0.021.
A p-value of 0.021 represents quite strong evidence against the null hypothesis.
It states that, if the null hypothesis is TRUE, we would only have a 2.1% chance
of observing something as extreme as 9 heads or tails.
Some people might even see this as strong enough evidence to decide that the
null hypothesis is not true, but this is generally an over-simplistic interpretation.
In general, the p-value is the probability of observing something AT LEAST AS

EXTREME AS OUR OBSERVATION, if H0 is TRUE.
This means that SMALL p-values represent STRONG evidence against H0 .
Small p-values mean Strong evidence.

Large p-values mean Little evidence.
Note: Be careful not to confuse the term p-value, which is 0.021 in our exam-
ple, with the Binomial probability p. Our hypothesis test is designed to test
whether the Binomial probability is p = 0.5. To test this, we calculate the
p-value of 0.021 as a measure of the strength of evidence against the hypoth-
esis that p = 0.5.
40
Interpreting the hypothesis test
There are different schools of thought about how a p-value should be interpreted.
• Most people agree that the p-value is a useful measure of the strength of
evidence against the null hypothesis. The smaller the p-value, the
stronger the evidence against H0 .
• Some people go further and use an accept/reject framework. Under

this framework, the null hypothesis H0 should be rejected if the p-value is
less than 0.05 (say), and accepted if the p-value is greater than 0.05.
• In this course we use the strength of evidence interpretation. The

p-value measures how far out our observation lies in the tails of the dis-
tribution specified by H0 . We do not talk about accepting or rejecting
H0 . This decision should usually be taken in the context of other scientific
information.
However, as a rule of thumb, we consider that p-values of 0.05 and less
start to suggest that the null hypothesis is doubtful.
Statistical significance
You have probably encountered the idea of statistical significance in other
courses.
Statistical significance refers to the p-value.
The result of a hypothesis test is significant at the 5% level if the p-value

is less than 0.05.
This means that the chance of seeing what we did see (9 heads), or more, is less
than 5% if the null hypothesis is true.
Saying the test is significant is a quick way of saying that there is evidence
against the null hypothesis, usually at the 5% level.
41
In the coin example, we can say that our test of H0 : p = 0.5 against H1 : p 6= 0.5
is significant at the 5% level, because the p-value is 0.021 which is < 0.05.
This means:
• we have some evidence that p 6= 0.5.
It does not mean:

• the difference between p and 0.5 is large, or
• the difference between p and 0.5 is important in practical terms.
Statistically significant means that we have evidence, in OUR sample, that

p is different from 0.5. It says NOTHING about the SIZE,
or the IMPORTANCE, of the difference.
“Substantial evidence of a difference”, not “Evidence of a substantial difference.”
Beware!
The p-value gives the probability of seeing something as weird as what we did
see, if H0 is true.
This means that 5% of the time, we will get a p-value < 0.05 WHEN H0 IS
TRUE!!
Similarly, about once in every thousand tests, we will get a p-value < 0.001,
when H0 is true!
A small p-value does NOT mean that H0 is definitely wrong.
One-sided and two-sided tests

The test above is a two-sided test. This means that we considered it just as
weird to get 9 tails as 9 heads.
If we had a good reason, before tossing the coin, to believe that the binomial
probability could only be = 0.5 or > 0.5, i.e. that it would be impossible
to have p < 0.5, then we could conduct a one-sided test: H0 : p = 0.5 versus
H1 : p > 0.5.
This would have the effect of halving the resultant p-value.

42
2.3 Example: Presidents and deep-sea divers
Men in the class: would you like to have daughters? Then become a deep-sea
diver, a fighter pilot, or a heavy smoker.
Would you prefer sons? Easy!

Just become a US president.
Numbers suggest that men in different

professions tend to have more sons than
daughters, or the reverse. Presidents have
sons, fighter pilots have daughters. But is it real, or just chance? We can use
hypothesis tests to decide.
The facts
• The 44 US presidents from George Washington to Barack Obama have had

a total of 153 children, comprising 88 sons and only 65 daughters: a sex
ratio of 1.4 sons for every daughter.
• Two studies of deep-sea divers revealed that the men had a total of 190
children, comprising 65 sons and 125 daughters: a sex ratio of 1.9 daughters
for every son.
Could this happen by chance?
Is it possible that the men in each group really had a 50-50 chance of producing
sons and daughters?
This is the same as the question in Section 2.2.
For the presidents: If I tossed a coin 153 times and got only 65 heads, could
I continue to believe that the coin was fair?
For the divers: If I tossed a coin 190 times and got only 65 heads, could I
continue to believe that the coin was fair?
43
Hypothesis test for the presidents
We set up the competing hypotheses as follows.
Let X be the number of daughters out of 153 presidential children.

Then X ∼ Binomial(153, p), where p is the probability that each child is a daugh-
ter.
Null hypothesis: H0 : p = 0.5.
Alternative hypothesis: H1 : p 6= 0.5.
p-value: We need the probability of getting a result AT LEAST

AS EXTREME as X = 65 daughters, if H0 is true
and p really is 0.5.
Which results are at least as extreme as X = 65?
X = 0, 1, 2, . . . , 65, for even fewer daughters.

X = (153 − 65), . . . , 153, for too many daughters, because we would be just as
surprised if we saw ≤ 65 sons, i.e. ≥ (153 − 65) = 88 daughters.
Probabilities for X ∼ Binomial(n = 153, p = 0.5)

0.06
0.04
0.02
0.00
0 20 40 60 80 100 120 140

44
Calculating the p-value
The p-value for the president problem is given by
P(X ≤ 65) + P(X ≥ 88) where X ∼ Binomial(153, 0.5).
In principle, we could calculate this as
P(X = 0) + P(X = 1) + . . . + P(X = 65) + P(X = 88) + . . . + P(X = 153)

153 0 153 153
= (0.5) (0.5) + (0.5)1 (0.5)152 + . . .
0 1
This would take a lot of calculator time! Instead, we use a computer with a
package such as R.
R command for the p-value
The R command for calculating the lower-tail p-value for the

Binomial(n = 153, p = 0.5) distribution is
pbinom(65, 153, 0.5).
Typing this in R gives:

0.06
0.04
> pbinom(65, 153, 0.5)

0.02
[1] 0.03748079
0.00
0 20 40 60 80 100 120 140
This gives us the lower-tail p-value only:

P(X ≤ 65) = 0.0375.
To get the overall p-value:
Multiply the lower-tail p-value by 2:

2 × 0.0375 = 0.0750.
In R:
> 2 * pbinom(65, 153, 0.5)

[1] 0.07496158
45
This works because the upper-tail p-value, by definition, is always going to be
the same as the lower-tail p-value. The upper tail gives us the probability of
finding something equally surprising at the opposite end of the distribution.
Note: The R command pbinom is equivalent to the cumulative distribution function

for the Binomial distribution:
pbinom(65, 153, 0.5) = P(X ≤ 65) where X ∼ Binomial(153, 0.5)
= FX (65) for X ∼ Binomial(153, 0.5).
The overall p-value in this example is 2 × FX (65).
Note: In the R command pbinom(65, 153, 0.5), the order that you enter the
numbers 65, 153, and 0.5 is important. If you enter them in a different order, you
will get an error. An alternative is to use the longhand command pbinom(q=65,
size=153, prob=0.5), in which case you can enter the terms in any order.
Summary: are presidents more likely to have sons?
Back to our hypothesis test. Recall that X was the number of daughters out of
153 presidential children, and X ∼ Binomial(153, p), where p is the probability
that each child is a daughter.
p-value: 2 × FX (65) = 0.075.
What does this mean?
The p-value of 0.075 means that, if the presidents really were as likely to have
daughters as sons, there would only be 7.5% chance of observing something as
unusual as only 65 daughters out of the total 153 children.
This is slightly unusual, but not very unusual.

46
We conclude that there is no real evidence that presidents are more likely to have
sons than daughters. The observations are compatible with the possibility that
there is no difference.
Does this mean presidents are equally likely to have sons and daughters? No:
the observations are also compatible with the possibility that there is a difference.
We just don’t have enough evidence either way.
Hypothesis test for the deep-sea divers
For the deep-sea divers, there were 190 children: 65 sons, and 125 daughters.
Let X be the number of sons out of 190 diver children.
Then X ∼ Binomial(190, p), where p is the probability that each child is a son.
Note: We could just as easily formulate our hypotheses in terms of daughters

instead of sons. Because pbinom is defined as a lower-tail probability, however,
it is usually easiest to formulate them in terms of the low result (sons).
p-value: Probability of getting a result AT LEAST

AS EXTREME as X = 65 sons, if H0 is true
and p really is 0.5.
Results at least as extreme as X = 65 are:
X = 0, 1, 2, . . . , 65, for even fewer sons.

X = (190−65), . . . , 190, for the equally surprising result in the opposite direction
(too many sons).
47
0.05
0.04
0.03
0.02
0.01
0.00
0 20 40 60 80 100 120 140 160 180
p-value = 2×pbinom(65, 190, 0.5).
> 2*pbinom(65, 190, 0.5)

[1] 1.603136e-05
This is 0.000016, or a little more than one chance in 100 thousand.
We conclude that it is extremely unlikely that this observation could have oc-
curred by chance, if the deep-sea divers had equal probabilities of having sons
and daughters.
We have very strong evidence that deep-sea divers are more likely to have daugh-
ters than sons. The data are not really compatible with H0 .
48
What next?
p-values are often badly used in science and business. They are regularly treated
as the end point of an analysis, after which no more work is needed. Many
scientific journals insist that scientists quote a p-value with every set of results,
and often only p-values less than 0.05 are regarded as ‘interesting’. The outcome
is that some scientists do every analysis they can think of until they finally come
up with a p-value of 0.05 or less.
A good statistician will recommend a different attitude. It is very rare in science

for numbers and statistics to tell us the full story.
Results like the p-value should be regarded as an investigative starting point,

rather than the final conclusion. Why is the p-value small? What possible
mechanism could there be for producing this result?
If you were a medical statistician and you gave me a p-value, I

would ask you for a mechanism.
Don’t accept that Drug A is better than Drug B only because the p-value says
so: find a biochemist who can explain what Drug A does that Drug B doesn’t.
Don’t accept that sun exposure is a cause of skin cancer on the basis of a p-value
alone: find a mechanism by which skin is damaged by the sun.
Why might divers have daughters and presidents have sons?

Deep-sea divers are thought to have more daughters than sons because the
underwater work at high atmospheric pressure lowers the level of the hormone
testosterone in the men’s blood, which is thought to make them more likely to
conceive daughters. For the presidents, your guess is as good as mine . . .
2.4 Example: Birthdays and sports professionals
Have you ever wondered what makes a professional

sports player? Talent? Dedication? Good coaching?
Or is it just that they happen to have the right

birthday. . . ?
The following text is taken from Malcolm Gladwell’s book Outliers. It describes
the play-by-play for the first goal scored in the 2007 finals of the Canadian ice
hockey junior league for star players aged 17 to 19. The two teams are the Tigers
and Giants. There’s one slight difference . . . instead of the players’ names, we’re
given their birthdays.
March 11 starts around one side of the Tigers’ net, leaving the puck
for his teammate January 4, who passes it to January 22, who flips it
back to March 12, who shoots point-blank at the Tigers’ goalie, April
27. April 27 blocks the shot, but it’s rebounded by Giants’ March 6.
He shoots! Tigers defensemen February 9 and February 14 dive to
block the puck while January 10 looks on helplessly. March 6 scores!
Notice anything funny?
Here are some figures. There were 25 players in the Tigers squad, born between
1986 and 1990. Out of these 25 players, 14 of them were born in January,
February, or March. Is it believable that this should happen by chance, or do
we have evidence that there is a birthday-effect in becoming a star ice hockey
player?
Hypothesis test
Let X be the number of the 25 players who are born from January to March.
We need to set up hypotheses of the following form:
Null hypothesis: H0 : there is no birthday effect.
Alternative hypothesis: H1 : there is a birthday effect.

50
What is the distribution of X under H0 and under H1 ?
Under H0 , there is no birthday effect. So the probability that each player has a
birthday in Jan to March is about 1/4.
(3 months out of a possible 12 months).
Thus the distribution of X under H0 is X ∼ Binomial(25, 1/4).
Under H1 , there is a birthday effect, so p 6= 1/4.
Our formulation for the hypothesis test is therefore as follows.
Number of Jan to March players, X ∼ Binomial(25, p).

Our observation:
The observed proportion of players born from Jan to March is 14/25 = 0.56.
This is more than the 0.25 predicted by H0 .
Is it sufficiently greater than 0.25 to provide evidence against H0 ?
Just using our intuition, we can make a guess, but we might be wrong. The
answer also depends on the sample size (25 in this case). We need the p-value
to measure the evidence properly.
p-value: Probability of getting a result AT LEAST

AS EXTREME as X = 14 Jan to March players,
if H0 is true and p really is 0.25.
Results at least as extreme as X = 14 are:
Upper tail: X = 14, 15, . . . , 25, for even more Jan to March players.
Lower tail: an equal probability in the opposite direction, for too few Jan to
March players.
51
Note: We do not need to calculate the values corresponding to our lower-tail p-
value. It is more complicated in this example than in Section 2.3, because we
do not have Binomial probability p = 0.5. In fact, the lower tail probability lies
somewhere between 0 and 1 player, but it cannot be specified exactly.
We get round this problem for calculating the p-value by just multiplying the
upper-tail p-value by 2.

0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
We need twice the UPPER-tail p-value:
p-value = 2 × (1−pbinom(13, 25, 0.25)).

(Recall P(X ≥ 14) = 1 − P(X ≤ 13).)
2*(1-pbinom(13, 25, 0.25))
[1] 0.001831663
This p-value is very small.
It means that if there really was no birthday effect, we would expect to see results
as unusual as 14 out of 25 Jan to March players less than 2 in 1000 times.
We conclude that we have strong evidence that there is a birthday effect in this
ice hockey team. Something beyond ordinary chance seems to be going on. The
data are barely compatible with H0 .
52
Why should there be a birthday effect?
These data are just one example of a much wider - and astonishingly strong -
phenomenon. Professional sports players not just in ice hockey, but in soccer,
baseball, and other sports have strong birthday clustering. Why?
It’s because these sports select talented players for age-class star teams at young
ages, about 10 years old. In ice hockey, the cut-off date for age-class teams is
January 1st. A 10-year-old born in December is competing against players who
are nearly a year older, born in January, February, and March. The age differ-
ence makes a big difference in terms of size, speed, and physical coordination.
Most of the ‘talented’ players at this age are simply older and bigger. But there
then follow years in which they get the best coaching and the most practice.
By the time they reach 17, these players really are the best.
2.5 Likelihood and estimation
So far, the hypothesis tests have only told us whether the Binomial probability p
might be, or probably isn’t, equal to the value specified in the null hypothesis.
They have told us nothing about the size, or potential importance, of the de-
parture from H0 .
For example, for the deep-sea divers, we found that it would be very unlikely to
observe as many as 125 daughters out of 190 children if the chance of having a
daughter really was p = 0.5.
But what does this say about the actual value of p?
Remember the p-value for the test was 0.000016. Do you think that:
1. p could be as big as 0.8?
No idea! The p-value does not tell us.
2. p could be as close to 0.5 as, say, 0.51?
The test doesn’t even tell us this much!
If there was a huge sample size (number of children), we COULD get a
p-value as small as 0.000016 even if the true probability was 0.51.
Common sense, however, gives us a hint. Because there were almost twice as
many daughters as sons, my guess is that the probability of a having a daughter
is something close to p = 2/3. We need some way of formalizing this.
53
Estimation
The process of using observations to suggest a value for a parameter is called

estimation.
The value suggested is called the estimate of the parameter.
In the case of the deep-sea divers, we wish to estimate the probability p that
the child of a diver is a daughter. The common-sense estimate to use is
number of daughters 125

p= = = 0.658.
total number of children 190
However, there are many situations where our common sense fails us. For
example, what would we do if we had a regression-model situation (see Section
3.8) and wished to specify an alternative form for p, such as
p = α + β × (diver age).
How would we estimate the unknown intercept α and slope β, given known
information on diver age and number of daughters and sons?
We need a general framework for estimation that can be applied to any situa-
tion. The most useful and general method of obtaining parameter estimates is
the method of maximum likelihood estimation.
Likelihood
Likelihood is one of the most important concepts in statistics.

Return to the deep-sea diver example.
X is the number of daughters out of 190 children.
We know that X ∼ Binomial(190, p),
and we wish to estimate the value of p.
The available data is the observed value of X: X = 125.

54
Suppose for a moment that p = 0.5. What is the probability of observing
X = 125?
When X ∼ Binomial(190, 0.5),

190
P(X = 125) = (0.5)125 (1 − 0.5)190−125
125
= 3.97 × 10−6 .
Not very likely!!
What about p = 0.6? What would be the probability of observing X = 125 if

p = 0.6?

190
P(X = 125) = (0.6)125 (1 − 0.6)190−125
125
= 0.016.
This still looks quite unlikely, but it is almost 4000 times more likely than getting
X = 125 when p = 0.5.
So far, we have discovered that it would be thousands of times more likely to

observe X = 125 if p = 0.6 than it would be if p = 0.5.
This suggests that p = 0.6 is a better estimate than p = 0.5.
You can probably see where this is heading. If p = 0.6 is a better estimate than
p = 0.5, what if we move p even closer to our common-sense estimate of 0.658?

190
P(X = 125) = (0.658)125 (1 − 0.658)190−125
125
= 0.061.
This is even more likely than for p = 0.6. So p = 0.658 is the best estimate yet.
55
Can we do any better? What happens if we increase p a little more, say to
p = 0.7?

190
P(X = 125) = (0.7)125 (1 − 0.7)190−125
125
= 0.028.
This has decreased from the result for p = 0.658, so our observation of 125 is
LESS likely under p = 0.7 than under p = 0.658.
Overall, we can plot a graph showing how likely our observation of X = 125
is under each different value of p.
0.06
0.05
P(X=125) when X ~ Bin(190, p)
0.04
0.03
0.02
0.01
0.00
0.50 0.55 0.60 0.65 0.70 0.75 0.80

p
The graph reaches a clear maximum. This is a value of p at which the observation
X = 125 is MORE LIKELY than at any other value of p.
This maximum likelihood value of p is our maximum likelihood estimate.
We can see that the maximum occurs somewhere close to our common-sense
estimate of p = 0.658.
56
The likelihood function
Look at the graph we plotted overleaf:
Horizontal axis: The unknown parameter, p.
Vertical axis: The probability of our observation, X = 125, under this value
of p.
This function is called the likelihood function.
It is a function of the unknown parameter p.
For our fixed observation X = 125, the likelihood function shows how LIKELY
the observation 125 is for every different value of p.
The likelihood function is:
L(p) = P(X = 125) when X ∼ Binomial(190, p),

190 125
= p (1 − p)190−125
125

190 125
= p (1 − p)65 for 0 < p < 1 .
125
This function of p is the curve shown on the graph on page 55.
In general, if our observation were X = x rather than X = 125, the likelihood

function is a function of p giving P(X = x) when X ∼ Binomial(190, p).
We write:
L(p ; x) = P(X = x) when X ∼ Binomial(190, p),

190 x
= p (1 − p)190−x .
x
57
Difference between the likelihood function and the probability function
The likelihood function is a probability of x, but it is a FUNCTION of p.
The likelihood gives the probability of a FIXED observation x, for every possible
value of the parameter p.
Compare this with the probability function, which is the probability of every
different value of x, for a FIXED value of p.
0.06
0.05
0.05
P(X=x) when X ~ Bin(190, p=0.6)

P(X=125) when X ~ Bin(190, p)
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0.00
0.00
0.50 0.55 0.60 0.65 0.70 0.75 0.80 90 110 130 150
p x
Likelihood function, L(p ; x). Probability function, fX (x).

Function of p for fixed x. Function of x for fixed p.
Gives P(X = x) as p changes. Gives P(X = x) as x changes.
(x = 125 here, but could be anything.) (p = 0.6 here, but could be anything.)
Maximizing the likelihood
We have decided that a sensible parameter estimate for p is the maximum

likelihood estimate: the value of p at which the observation X = 125 is more
likely than at any other value of p.
We can find the maximum likelihood estimate using calculus.
The likelihood function is

190 125
L(p ; 125) = p (1 − p)65 .
125
58
We wish to find the value of p that maximizes this expression.
To find the maximizing value of p, differentiate the likelihood with respect to p:
n o
dL 190 124 65 125 64
= × 125 × p × (1 − p) + p × 65 × (1 − p) × (−1)
dp 125
(Product Rule)
n o
190
= × p124 × (1 − p)64 125(1 − p) − 65p
125
n o
190 124 64
= p (1 − p) 125 − 190p .
125
The maximizing value of p occurs when

dL
= 0.
dp
This gives:
n o
dL 190 124 64
= p (1 − p) 125 − 190p = 0
dp 125
n o
⇒ 125 − 190p = 0
125
⇒ p = = 0.658 .
190
59
For the diver example, the maximum likelihood estimate of 125/190 is the same
as the common-sense estimate (page 53):
number of daughters 125

p= = .
total number of children 190
This gives us confidence that the method of maximum likelihood is sensible.
The ‘hat’ notation for an estimate
It is conventional to write the estimated value of a parameter with a ‘hat’, like

this: pb.
For example,
125
pb = .
190
The correct notation for the maximization is:

dL 125
=0 ⇒ pb = .
dp p=bp 190
Summary of the maximum likelihood procedure
1. Write down the distribution of X in terms of the unknown parameter:

X ∼ Binomial(190, p).
2. Write down the observed value of X:

Observed data: X = 125.
3. Write down the likelihood function for this observed value:
L(p ; 125) = P(X = 125) when X ∼ Binomial(190, p)

190 125
= p (1 − p)65 for 0 < p < 1.
125
60
4. Differentiate the likelihood with respect to the parameter, and set to 0 for
the maximum:
n o
dL 190 124
= p (1 − p)64
125 − 190p = 0, when p = pb.
dp 125
This is the Likelihood Equation.
5. Solve for pb: From the graph, we can see that p = 0 and p = 1 are not maxima.
125
∴ pb = .
190
This is the maximum likelihood estimate (MLE) of p.
Verifying the maximum
Strictly speaking, when we find the maximum likelihood estimate using

dL
= 0,
dp p=bp
we should verify that the result is a maximum (rather than a minimum) by

showing that
d2 L
< 0.
dp2 p=bp
In Stats 210, we will be relaxed about this. You will usually be told to assume
that the MLE occurs in the interior of the parameter range. Where possible, it
is always best to plot the likelihood function, as on page 55.
This confirms that the maximum likelihood estimate exists and is unique.
In particular, care must be taken when the parameter has a restricted range like
0 < p < 1 (see later).
61
Estimators
For the example above, we had observation X = 125, and the maximum likeli-
hood estimate of p was
125
pb = .
190
It is clear that we could follow through the same working with any value of X,
which we can write as X = x, and we would obtain
x
pb = .
190
Exercise: Check this by maximizing the likelihood using x instead of 125.
This means that even before we have made our observation of X, we can provide
a RULE for calculating the maximum likelihood estimate once X is observed:
Rule: Let
X ∼ Binomial(190, p).
Whatever value of X we observe, the maximum likelihood estimate of p will be
X
pb = .
190
Note that this expression is now a random variable: it depends on the random
value of X .
A random variable specifying how an estimate is calculated from an observation
is called an estimator.
In the example above, the maximum likelihood estimaTOR of p is

X
pb = .
190
The maximum likelihood estimaTE of p, once we have observed that X = x, is
x
pb = .
190
62
General maximum likelihood estimator for Binomial(n, p)
Take any situation in which our observation X has the distribution
X ∼ Binomial(n, p),
where n is KNOWN and p is to be estimated.

We make a single observation X = x.
Follow the steps on page 59 to find the maximum likelihood estimator for p.
1. Write down the distribution of X in terms of the unknown parameter:
X ∼ Binomial(n, p).
(n is known.)
2. Write down the observed value of X:
Observed data: X = x.
3. Write down the likelihood function for this observed value:
L(p ; x) = P(X = x) when X ∼ Binomial(n, p)

n x
= p (1 − p)n−x for 0 < p < 1.
x
4. Differentiate the likelihood with respect to the parameter, and set to 0 for
the maximum:
n o
dL n x−1
= p (1 − p)n−x−1
x − np = 0, when p = pb.
dp x
(Exercise)
5. Solve for pb:
x
pb = .
n
This is the maximum likelihood estimate of p.

63
The maximum likelihood estimator of p is
X
pb = .
n
(Just replace the x in the MLE with an X , to convert from the estimate to the
estimator.)
By deriving the general maximum likelihood estimator for any problem of
this sort, we can plug in values of n and x to get an instant MLE for any
Binomial(n, p) problem in which n is known.
Example: Recall the president problem in Section 2.3. Out of 153 children, 65
were daughters. Let p be the probability that a presidential child is a daughter.
What is the maximum likelihood estimate of p?
Solution: Plug in the numbers n = 153, x = 65:
the maximum likelihood estimate is

x 65
pb = = = 0.425 .
n 153
Note: We showed in Section 2.3 that p was not significantly different from 0.5 in
this example.
However, the MLE of p is definitely different from 0.5.
This comes back to the meaning of significantly different in the statistical sense.
Saying that p is not significantly different from 0.5 just means that we can’t
DISTINGUISH any difference between p and 0.5 from routine sampling
variability.
We expect that p probably IS different from 0.5, just by a little. The maximum
likelihood estimate gives us the ‘best’ estimate of p.
Note: We have only considered the class of problems for which X ∼ Binomial(n, p)
and n is KNOWN. If n is not known, we have a harder problem: we have two
parameters, and one of them (n) should only take discrete values 1, 2, 3, . . ..
We will not consider problems of this type in Stats 210.
64
2.6 Random numbers and histograms
We often wish to generate random numbers from a given distribution. Statis-

tical packages like R have custom-made commands for doing this.
To generate (say) 100 random numbers from the Binomial(n = 190, p = 0.6)
distribution in R, we use:
rbinom(100, 190, 0.6)
or in long-hand,
rbinom(n=100, size=190, prob=0.6)
Caution: the R inputs n and size are the opposite to what you might expect:
n gives the required sample size, and size gives the Binomial parameter n!
Histograms
The usual graph used to visualise a set of random numbers is the histogram.
The height of each bar of the histogram shows how many of the random numbers
fall into the interval represented by the bar.
For example, if each histogram bar covers an interval of length 5, and if 24 of

the random numbers fall between 105 and 110, then the height of the histogram
bar for the interval (105, 110) would be 24.
Here are histograms from applying the command rbinom(100, 190, 0.6)
three different times.
10
7
8
6
8
5
6
frequency of x
frequency of x
frequency of x
6
4
4
3
4
2
2
1
0
80 90 100 120 140 80 90 100 120 140 80 90 100 120 140
Each graph shows 100 random numbers from the Binomial(n = 190, p = 0.6)
distribution.
65
Note: The histograms above have been specially adjusted so that each histogram
bar covers an interval of just one integer. For example, the height of the bar
plotted at x = 109 shows how many of the 100 random numbers are equal to
109.
Histogram of rbinom(100, 190, 0.6)
30
Usually, histogram bars would cover a larger
25
interval, and the histogram would be smoother.
20
For example, on the right is a histogram using
Frequency
15
the default settings in R, obtained from the
10
command hist(rbinom(100, 190, 0.6)).
5
Each histogram bar covers an interval of
0
5 integers. 100 110 120
rbinom(100, 190, 0.6)

130
In all the histograms above, the sum of the heights of all the bars is 100, because
there are 100 observations.
Histograms as the sample size increases
Histograms are useful because they show the approximate shape of the
underlying probability function.
They are also useful for exploring the effect of increasing sample size.
All the histograms below have bars covering an interval of 1 integer.

They show how the histogram becomes smoother and less erratic as sample size
increases.
Eventually, with a large enough sample size, the histogram starts to look identical
to the probability function.
Note: difference between a histogram and the probability function
The histogram plots OBSERVED FREQUENCIES of a set of random numbers.
The probability function plots EXACT PROBABILITIES for the distribution.
The histogram should have the same shape as the probability function, especially
as the sample size gets large.
66
Sample size 1000: rbinom(1000, 190, 0.6)
60
60
60
50
50
50
40
40
frequency of x
frequency of x
frequency of x
40
30
30
30
20
20
20
10
10
10
0
0
80 90 100 120 140 80 90 100 120 140 80 90 100 120 140
Sample size 10,000: rbinom(10000, 190, 0.6)
600
600
600
500
500
500
400
400
400
frequency of x
frequency of x
frequency of x
300
300
300
200
200
200
100
100
100
0
80 90 100 120 140 80 90 100 120 140 0 80 90 100 120 140
Sample size 100,000: rbinom(100000, 190, 0.6)

6000
6000
6000
5000
5000
5000
4000
4000
4000
frequency of x
frequency of x
frequency of x
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
80 90 100 120 140 80 90 100 120 140 80 90 100 120 140
Probability function for Binomial(190, 0.6):

0.06
0.05
The probability function is

P(X=x) when X~Bin(190, 0.6)
0.04
fixed and exact.

0.03
The histograms become stable in shape

0.02
and approach the shape of the probability

0.01
function as sample size gets large.

0.00
80 100 120 140
x
2.7 Expectation
Given a random variable X that measures something, we often want to know

what is the average value of X ?
For example, here are 30 random observations taken from the distribution
X ∼ Binomial(n = 190, p = 0.6):
R command: rbinom(30, 190, 0.6)
0.05
P(X=x) when X ~ Bin(190, p=0.6)
0.04
116 116 117 122 111 112 114 120 112 102
0.03
125 116 97 105 108 117 118 111 116 121
0.02
107 113 120 114 114 124 116 118 119 120
0.01
0.00
90 110 130 150
x
The average, or mean, of the first ten values is:
116 + 116 + . . . + 112 + 102

= 114.2.
10
The mean of the first twenty values is:
116 + 116 + . . . + 116 + 121

= 113.8.
20
The mean of the first thirty values is:
116 + 116 + . . . + 119 + 120

= 114.7.
30
The answers all seem to be close to 114. What would happen if we took the
average of hundreds of values?
100 values from Binomial(190, 0.6):

R command: mean(rbinom(100, 190, 0.6))
Result: 114.86
Note: You will get a different result every time you run this command.
68
1000 values from Binomial(190, 0.6):
Result: 114.02
1 million values from Binomial(190, 0.6):

Result: 114.0001
The average seems to be converging to the value 114.
The larger the sample size, the closer the average seems to get to 114.
If we kept going for larger and larger sample sizes, we would keep getting answers
closer and closer to 114. This is because 114 is the DISTRIBUTION MEAN:
the mean value that we would get if we were able to draw an infinite sample from
the Binomial(190, 0.6) distribution.
This distribution mean is called the expectation, or expected value, of the Bino-
mial(190, 0.6) distribution.
It is a FIXED property of the Binomial(190, 0.6) distribution. This means it is
a fixed constant: there is nothing random about it.
Definition: The expected value, also called the expectation or mean, of a

discrete random variable X, can be written as either E(X), or E(X), or µX ,
and is given by
X X
µX = E(X) = xfX (x) = xP(X = x) .
x x
The expected value is a measure of the centre, or average, of the set of values that
X can take, weighted according to the probability of each value.
If we took a very large sample of random numbers from the distribution of X ,

their average would be approximately equal to µX .
69
Example: Let X ∼ Binomial(n = 190, p = 0.6). What is E(X)?
X
E(X) = xP(X = x)
x
190
X
190
= x (0.6)x (0.4)190−x .
x=0
x
Although it is not obvious, the answer to this sum is n × p = 190 × 0.6 = 114.
We will see why in Section 2.10.
Explanation of the formula for expectation
We will move away from the Binomial distribution for a moment, and use a
simpler example.

1 with probability 0.9,
Let the random variable X be defined as X =
−1 with probability 0.1.
X takes only the values 1 and −1. What is the ‘average’ value of X?
Using 1+(−1)
2 = 0 would not be useful, because it ignores the fact that usually
X = 1, and only occasionally is X = −1.
Instead, think of observing X many times, say 100 times.
Roughly 90 of these 100 times will have X = 1.

Roughly 10 of these 100 times will have X = −1
The average of the 100 values will be roughly

90 × 1 + 10 × (−1)
,
100
= 0.9 × 1 + 0.1 × (−1)
( = 0.8. )
We could repeat this for any sample size.

70
As the sample gets large, the average of the sample will get ever closer to
0.9 × 1 + 0.1 × (−1).
This is why the distribution mean is given by

E(X) = P(X = 1) × 1 + P(X = −1) × (−1),
or in general, X
E(X) = P(X = x) × x.
x
E(X) is a fixed constant giving the

average value we would get from a large sample of X .
Linear property of expectation
Expectation is a linear operator:
Theorem 2.7: Let a and b be constants. Then
E(aX + b) = aE(X) + b.
Proof:
Immediate from the definition of expectation.

X
E(aX + b) = (ax + b)fX (x)
x
X X
= a xfX (x) + b fX (x)
x x
= a E(X) + b × 1.
71
Example: finding expectation from the probability function
Example 1: Let X ∼ Binomial(3, 0.2). Write down the probability function of X

and find E(X).
We have:
3
P(X = x) = (0.2)x (0.8)3−x for x = 0, 1, 2, 3.
x
x 0 1 2 3
fX (x) = P(X = x) 0.512 0.384 0.096 0.008
Then
3
X
E(X) = xfX (x) = 0 × 0.512 + 1 × 0.384 + 2 × 0.096 + 3 × 0.008
x=0
= 0.6.
Note: We have: E(X) = 0.6 = 3 × 0.2 for X ∼ Binomial(3, 0.2).

We will prove in Section 2.10 that whenever X ∼ Binomial(n, p), then
E(X) = np.
Example 2: Let Y be Bernoulli(p) (Section 1.2). That is,

1 with probability p,
Y =
0 with probability 1 − p.
Find E(Y ).
y 0 1
P(Y = y) 1 − p p
E(Y ) = 0 × (1 − p) + 1 × p = p.
72
Expectation of a sum of random variables: E(X + Y )
For ANY random variables X1 , X2 , . . . , Xn ,
E (X1 + X2 + . . . + Xn ) = E(X1 ) + E(X2 ) + . . . + E(Xn ).
In particular, E(X + Y ) = E(X) + E(Y ) for ANY X and Y .
This result holds for any random variables X1 , . . . , Xn . It does NOT require
X1 , . . . , Xn to be independent.
We can summarize this important result by saying:
The expectation of a sum

is the sum of the expectations – ALWAYS.
The proof requires multivariate methods, to be studied in later courses.
Note: We can combine the result above with the linear property of expectation.
For any constants a1 , . . . , an , we have:
E (a1 X1 + a2 X2 + . . . + an Xn ) = a1 E(X1 ) + a2 E(X2 ) + . . . + an E(Xn ).
Expectation of a product of random variables: E(XY )
There are two cases when finding the expectation of a product:

1. General case:
For general X and Y , E(XY ) is NOT equal to E(X)E(Y ).
We have to find E(XY ) either using their joint probability function (see
later), or using their covariance (see later).
2. Special case: when X and Y are INDEPENDENT:
When X and Y are INDEPENDENT, E(XY ) = E(X)E(Y ).

2.8 Variable transformations
We often wish to transform random variables through a function. For example,

given the random variable X, possible transformations of X include:
2
√
X , X, 4X 3 , ...
We often summarize all possible variable transformations by referring to

Y = g(X) for some function g .
For discrete random variables, it is very easy to find the probability function for
Y = g(X), given that the probability function for X is known. Simply change
all the values and keep the probabilities the same.
Example 1: Let X ∼ Binomial(3, 0.2), and let Y = X 2 . Find the probability

function of Y .
x 0 1 2 3
The probability function for X is:
P(X = x) 0.512 0.384 0.096 0.008
Thus the probability function for Y = X 2 is:
y 02 12 22 32
P(Y = y) 0.512 0.384 0.096 0.008
This is because Y takes the value 02 whenever X takes the value 0, and so on.
Thus the probability that Y = 02 is the same as the probability that X = 0.
Overall, we would write the probability function of Y = X 2 as:

y 0 1 4 9
P(Y = y) 0.512 0.384 0.096 0.008
To transform a discrete random variable, transform the values

and leave the probabilities alone.
Example 2: Mr Chance hires out giant helium balloons for
advertising. His balloons come in three sizes: heights 2m, 3m,
and 4m. 50% of Mr Chance’s customers choose to hire the
cheapest 2m balloon, while 30% hire the 3m balloon and
20% hire the 4m balloon.
The amount of helium gas in cubic metres required to fill the balloons is h3 /2,
where h is the height of the balloon. Find the probability function of Y , the
amount of helium gas required for a randomly chosen customer.
Let X be the height of balloon ordered by a random customer. The probability

function of X is:
height, x (m) 2 3 4
P(X = x) 0.5 0.3 0.2
Let Y be the amount of gas required: Y = X 3 /2.

The probability function of Y is:
gas, y (m 3 ) 4 13.5 32
P(Y = y) 0.5 0.3 0.2
Transform the values, and leave the probabilities alone.
Expected value of a transformed random variable
We can find the expectation of a transformed random variable just like any other
random variable. For example, in Example 1 we had X ∼ Binomial(3, 0.2), and
Y = X 2.
x 0 1 2 3
The probability function for X is:
P(X = x) 0.512 0.384 0.096 0.008
y 0 1 4 9
and for Y = X 2 :
P(Y = y) 0.512 0.384 0.096 0.008
75
2
Thus the expectation of Y = X is:
E(Y ) = E(X 2 ) = 0 × 0.512 + 1 × 0.384 + 4 × 0.096 + 9 × 0.008

= 0.84.
Note: E(X 2 ) is NOT the same as {E(X)}2 . Check that {E(X)}2 = 0.36.
To make the calculation quicker, we could cut out the middle step of writing
down the probability function of Y . Because we transform the values and keep
the probabilities the same, we have:
E(X 2 ) = 02 × 0.512 + 12 × 0.384 + 22 × 0.096 + 32 × 0.008.
If we write g(X) = X 2 , this becomes:
E{g(X)} = E(X 2 ) = g(0) × 0.512 + g(1) × 0.384 + g(2) × 0.096 + g(3) × 0.008.
Clearly the same arguments can be extended to any function g(X) and any
discrete random variable X:
X
E{g(X)} = g(x)P(X = x).
x
Transform the values, and leave the probabilities alone.
Definition: For any function g and discrete random variable X, the expected value
of g(X) is given by
X X
E{g(X)} = g(x)P(X = x) = g(x)fX (x).
x x
76
Example: Recall Mr Chance and his balloon-hire business from page 74. Let X be
the height of balloon selected by a randomly chosen customer. The probability
function of X is:
height, x (m) 2 3 4
P(X = x) 0.5 0.3 0.2
(a) What is the average amount of gas required per customer?
Gas required was X 3 /2 from page 74.

Average gas per customer is E(X 3 /2).
3 X x3
X
E = × P(X = x)
2 x
2
23 33 43
= × 0.5 + × 0.3 + × 0.2
2 2 2
= 12.45 m3 gas.
(b) Mr Chance charges $400×h to hire a balloon of height h. What is his

expected earning per customer?
Expected earning is E(400X).

E(400X) = 400 × E(X) (expectation is linear)
= 400 × (2 × 0.5 + 3 × 0.3 + 4 × 0.2)
= 400 × 2.7
= $1080 per customer.
(c) How much does Mr Chance expect to earn in total from his next 5 customers?
Let Z1 , . . . , Z5 be the earnings from the next 5 customers. Each Zi has E(Zi ) =
1080 by part (b). The total expected earning is
E(Z1 + Z2 + . . . + Z5 ) = E(Z1 ) + E(Z2 ) + . . . + E(Z5 )

= 5 × 1080
= $5400.
77
Getting the expectation. . .
ng !
r o
W

3 with probability 3/4,
Suppose X=
8 with probability 1/4.
Then 3/4 of the time, X takes value 3, and 1/4 of the
time, X takes value 8.
3
So E(X) = 4×3 + 14 × 8.
add up the values

times how often they occur
√
What about E( X)?
√
√ 3 with probability 3/4,
X= √
78
add up the values

times how often they occur
√ 3 √ 1 √
E( X) = 4 × 3 + 4 × 8.
Common mistakes
√ √ q
i) E( X) = EX = 34 × 3 + 41 × 8
ong!
Wr
√ q q
3 1
ii) E( X) = 4 × 3 + 4×8
ong!
Wr
√ q q
iii) E( X) = 3 1
4 × 3 + 4×8
q √ q √
3 1
= 4× 3 + 4× 8
on g!
Wr
2.9 Variance
Example: Mrs Tractor runs the Rational Bank of Remuera. Every day she hopes
to fill her cash machine with enough cash to see the well-heeled citizens of Re-
muera through the day. She knows that the expected amount of money with-
drawn each day is $50,000. How much money should she load in the machine?
$50,000?
No: $50,000 is the average, near the centre
of the distribution. About half the time,
the money required will be GREATER
than the average.
How much money should Mrs Tractor put in the
machine if she wants to be 99% certain that there
will be enough for the day’s transactions?
Answer: it depends how much the amount withdrawn varies above and below
its mean.
For questions like this, we need the study of variance.
Variance is the average squared distance of a random variable from its own mean.
Definition: The variance of a random variable X is written as either Var(X) or σX

2
,
and is given by

2
σX = Var(X) = E (X − µX )2 = E (X − EX)2 .
Similarly, the variance of a function of X is
2
Var(g(X)) = E g(X) − E(g(X)) .
Note: The variance is the square of the standard deviation of X, so

p q
sd(X) = Var(X) = σX 2 =σ .
X
80
Variance as the average squared distance from the mean
The variance is a measure of how spread out are the values that X can take.
It is the average squared distance between a value of X and the central (mean)
value, µX .
Possible values of X
x1 x2 x3 x4 x5 x6
x2 − µ X x4 − µ X
µX
(central value)
Var(X) = E [(X − µX )2 ]
|{z} | {z }
(2) (1)
(1) Take distance from observed values of X to the central point, µX . Square it
to balance positive and negative distances.
(2) Then take the average over all values X can take: ie. if we observed X many
times, find what would be the average squared distance between X and µX .
2
Note: The mean, µX , and the variance, σX , of X are just numbers: there is nothing
random or variable about them.

3 with probability 3/4,
Example: Let X =
Then 3 1
E(X) = µX = 3 × + 8 × = 4.25
4 4
3 1
Var(X) = σX
2
= × (3 − 4.25)2 + × (8 − 4.25)2
4 4
= 4.6875.
When we observe X, we get either 3 or 8: this is random.

But µX is fixed at 4.25, and σX
2
is fixed at 4.6875, regardless of the outcome of
X.
81
For a discrete random variable,
X X
Var(X) = E (X − µX ) =
2 2
(x − µX ) fX (x) = (x − µX )2 P(X = x).
x x
This uses the definition of the expected value of a function of X:
Var(X) = E(g(X)) where g(X) = (X − µX )2 .
Theorem 2.9A: (important)
Var(X) = E(X 2 ) − (EX)2 = E(X 2 ) − µ2X

Proof: Var(X) = E (X − µX )2 by definition
X 2 −2 |{z}
= E[|{z} X µX + µ2X ]
|{z} |{z}
r.v. r.v. constant constant
= E(X 2 ) − 2µX E(X) + µ2X by Thm 2.7
= E(X 2 ) − 2µ2X + µ2X
= E(X 2 ) − µ2X .
P P 2
Note: E(X 2 ) = x x 2
f X (x) = x x P(X = x) . This is not the same as (EX) :
2

3 with probability 0.75,
e.g. X=
8 with probability 0.25.
Then µX = EX = 4.25, so µ2X = (EX)2 = (4.25)2 = 18.0625.

But
2 2 3 2 1
E(X ) = 3 × + 8 × = 22.75.
4 4
Thus E(X 2 ) 6= (EX)2 in general.

82
Theorem 2.9B: If a and b are constants and g(x) is a function, then
i) Var(aX + b) = a2 Var(X).
ii) Var(a g(X) + b) = a2 Var{g(X)}.
Proof:
(part (i))
h i
Var(aX + b) = E {(aX + b) − E(aX + b)} 2
h i
= E {aX + b − aE(X) − b}2
by Thm 2.7
h i
2
= E {aX − aE(X)}
h i
2 2
= E a {X − E(X)}
h i
= a E {X − E(X)} by Thm 2.7
2 2
= a2 Var(X) .
Part (ii) follows similarly.
Note: These are very different from the corresponding expressions for expectations
(Theorem 2.7). Variances are more difficult to manipulate than expectations.
Example: finding expectation and variance from the probability function
Recall Mr Chance’s balloons from page 74. The random

variable Y is the amount of gas required by a randomly
chosen customer. The probability function of Y is:
gas, y (m 3 ) 4 13.5 32
P(Y = y) 0.5 0.3 0.2
Find Var(Y ).
83
We know that E(Y ) = µY = 12.45 from page 76.
First method: use Var(Y ) = E[(Y − µY )2 ]:
Var(Y ) = (4 − 12.45)2 × 0.5 + (13.5 − 12.45)2 × 0.3 + (32 − 12.45)2 × 0.2

= 112.47.
Second method: use E(Y 2 ) − µ2Y : (usually easier)
E(Y 2 ) = 42 × 0.5 + 13.52 × 0.3 + 322 × 0.2

= 267.475.
So Var(Y ) = 267.475 − (12.45)2 = 112.47 as before.
Variance of a sum of random variables: Var(X + Y )
There are two cases when finding the variance of a sum:

1. General case:
For general X and Y ,

Var(X + Y ) is NOT equal to Var(X) + Var(Y ).
We have to find Var(X + Y ) using their covariance (see later courses).
2. Special case: when X and Y are INDEPENDENT:
When X and Y are INDEPENDENT,

Var(X + Y ) = Var(X) + Var(Y ).
84
2.10 Mean and Variance of the Binomial(n, p) distribution
Let X ∼ Binomial(n, p). We have mentioned several times that E(X) = np.
We now prove this and the additional result for Var(X).
If X ∼ Binomial(n, p), then:
E(X) = µX = np
Var(X) = σX
2
= np(1 − p).
We often write q = 1 − p, so Var(X) = npq .
Easy proof: X as a sum of Bernoulli random variables
If X ∼ Binomial(n, p), then X is the number of successes out of n independent

trials, each with P(success) = p.
This means that we can write:
X = Y1 + Y2 + . . . + Yn ,
where each
1 with probability p,
Yi =
0 with probability 1 − p.
That is, Yi counts as a 1 if trial i is a success, and as a 0 if trial i is a failure.
Overall, Y1 + . . . + Yn is the total number of successes out of n independent trials,

which is the same as X .
Note: Each Yi is a Bernoulli(p) random variable (Section 1.2).
Now if X = Y1 + Y2 + . . . + Yn , and Y1 , . . . , Yn are independent, then:
E(X) = E(Y1 ) + E(Y2 ) + . . . + E(Yn ) (does NOT require independence),
Var(X) = Var(Y1 ) + Var(Y2 ) + . . . + Var(Yn ) (DOES require independence).

85
y 0 1
The probability function of each Yi is:
P(Yi = y) 1 − p p
Thus,
E(Yi ) = 0 × (1 − p) + 1 × p = p.
Also,
E(Yi2 ) = 02 × (1 − p) + 12 × p = p.
So
Var(Yi ) = E(Yi2 ) − (EYi )2
= p − p2
= p(1 − p).
Therefore:
E(X) = E(Y1 ) + E(Y2 ) + . . . + E(Yn )
= p + p + ... + p
= n × p.
And:
Var(X) = Var(Y1 ) + Var(Y2 ) + . . . + Var(Yn )
= n × p(1 − p).
Thus we have proved that E(X) = np and Var(X) = np(1 − p).

86
Hard proof: for mathematicians (non-examinable)
We show below how the Binomial mean and variance formulae can be derived
directly from the probability function.
X n Xn Xn
n x n−x n!
E(X) = xfX (x) = x p (1 − p) = x px (1 − p)n−x
x=0 x=0
x x=0
(n − x)!x!
x 1
But = and also the first term xfX (x) is 0 when x = 0.
x! (x − 1)!
So, continuing,
n
X n!
E(X) = px (1 − p)n−x
x=1
(n − x)!(x − 1)!
Next: make n’s into (n − 1)’s, x’s into (x − 1)’s, wherever possible:
e.g.
n − x = (n − 1) − (x − 1), px = p · px−1
n! = n(n − 1)! etc.
This gives,
n
X n(n − 1)!
E(X) = p · p(x−1) (1 − p)(n−1)−(x−1)
x=1
[(n − 1) − (x − 1)]!(x − 1)!
X n
n − 1 x−1
= np p (1 − p)(n−1)−(x−1)
|{z} x−1
what we want |x=1 {z }
need to show this sum = 1
Finally we let y = x − 1 and let m = n − 1.

When x = 1, y = 0; and when x = n, y = n − 1 = m.
Xm
So m y
E(X) = np p (1 − p)m−y
y=0
y
= np(p + (1 − p))m (Binomial Theorem)
E(X) = np, as required.

87
For Var(X), use the same ideas again.
For E(X), we used x!x = (x−1)!

1
; so instead of finding E(X 2 ), it will be easier
to find E[X(X − 1)] = E(X 2 ) − E(X) because then we will be able to cancel
x(x−1) 1
x! = (x−2)! .
Here goes:
n
X
n x
E[X(X − 1)] = x(x − 1) p (1 − p)n−x
x=0
x
n
X x(x − 1)n(n − 1)(n − 2)!
= p2 p(x−2) (1 − p)(n−2)−(x−2)
x=0
[(n − 2) − (x − 2)]!(x − 2)!x(x − 1)
First two terms (x = 0 and x = 1) are 0 due to the x(x − 1) in the numerator.
Thus
Xn
n − 2
E[X(X − 1)] = p2 n(n − 1) px−2 (1 − p)(n−2)−(x−2)
x=2
x−2
X m
2 m y m−y m = n − 2,
= n(n − 1)p p (1 − p) if
y y = x − 2.
y=0
| {z }
sum=1 by Binomial Theorem
So E[X(X − 1)] = n(n − 1)p2 .
Thus Var(X) = E(X 2 ) − (E(X))2
= E(X 2 ) − E(X) + E(X) − (E(X))2
= E[X(X − 1)] + E(X) − (E(X))2
= n(n − 1)p2 + np − n2 p2
= np(1 − p).
Note the steps: take out x(x−1) and replace n by (n−2), x by (x−2) wherever
possible.
88
2.11 Mean and Variance of Estimators
Perhaps the most important application of mean and variance is in the context
of estimators:
• An estimator is a random variable.
• It has a mean and a variance.
• The mean tells us how accurate the estimator is: in particular, does it get
the right answer on average?
• The variance tells us how reliable the estimator is. If the variance is high,
it has high spread and we can get estimates a long way from the true answer.
Because we don’t know what the right answer is, we don’t know whether our
particular estimate is a good one (close to the true answer) or a bad one (a long
way from the true answer). So an estimator with high variance is unreliable:
sometimes it gives bad answers, sometimes it gives good answers; and we don’t
know which we’ve got.
An unreliable estimator is like a friend who often tells lies. Once you find them
out, you can’t believe ANYTHING they say!
Good estimators and bad estimators
Good estimator Unreliable estimator Biased estimator

^)
^)
^)
Estimator PDF, f(p
Estimator PDF, f(p
Estimator PDF, f(p
True p ^
p True p ^
p True p ^
p
Example: MLE of the Binomial p parameter
In Section 2.5 we derived the MLE for the Binomial parameter p.
Reminder: X ∼ Binomial(n, p), where n is KNOWN and p is to be estimated.
Make a single observation X = x.

X
The maximum likelihood estimator of p is pb = .
n
89
Why do we convert the estimate x/n to the estimator X/n?
The estimator has a mean and a variance, and this means we can study its
properties. Is it accurate? Is it reliable?
Estimator Mean, E(p)

b

X 1
E(b
p) = E = E(X)
n n
1
= × np because E(X) = np when X ∼ Binomial(n, p)
n
∴ E(b
p) = p.
So this estimator gets the right answer on average — it is unbiased.
Definition: If pb is an estimator of the parameter p, then pb is unbiased if E(b

p) = p.
That is, an unbiased estimator gets the right answer on average.
p) 6= p, then pb is said to be a biased estimator.

If E(b
If an estimator has a large bias, we probably don’t want to use it. However,
even if the estimator is unbiased, we still need to look at its variance to decide
how reliable it is.
b
Estimator Variance, Var(p)
We have:

X
Var(b
p) = Var
n
1
= Var(X)
n2
1
= × np(1 − p) because Var(X) = np(1 − p) for X ∼Bin(n, p)
n2
p(1 − p)
∴ Var(b
p) = . (?)
n
To decide how reliable our estimator pb is, we would like to calculate the value
of Var(b
p). But Var(bp) = p(1 − p)/n, and we do not know the true value of p, so
we cannot calculate the exact Var(b
p).
90
Instead, we have to ESTIMATE Var(b
p) by replacing the unknown p in equation
(?) by pb.
d (b d (b pb(1 − pb)
We call our estimated variance Var p): Var p) = .
n
q
The standard error of pb is defined as: se(b
p) = d (b
Var p).
The margin of error associated with pb is defined as:
r
pb(1 − pb)
Margin of error = 1.96 × se(b
p) = 1.96 × .
n
p) gives an approximate 95% confidence interval

The expression pb ± 1.96 × se(b
for p under the Normal approximation.
This is because the Central Limit Theorem guarantees that pb will be approxi-
mately Normally distributed when n is large. We will study the Central Limit
Theorem and this result in Chapter 5, Section 5.3.
Example: For the deep-sea diver example in Section 2.3, we had X ∼ Binomial(190, p)
with observation X = 125 daughters out of 190 children.
So
r
X 125 0.658 × (1 − 0.658)
pb = = = 0.658, ⇒ se(b
p) = = 0.034.
n 190 190
For our final answer, we should therefore quote:
pb = 0.658 ± 1.96 × 0.034 = 0.658 ± 0.067 or pb = 0.658 (0.591, 0.725).
Our estimate is fairly precise, although not extremely precise. We are pretty
sure that the daughter probability is somewhere between 0.59 and 0.73.
Why do we use the MLE instead of some other estimator?

The MLE is a sensible estimator to use, but we could think of other sensible
estimators too. The reason why the MLE is so highly preferred is because it
has excellent general properties. Under mild conditions, and with a large
enough sample size, any MLE will be (i) unbiased, (ii) Normally distributed,
and (iii) have the minimal possible variance of all estimators. Wow!
91
Comments about p, pb, E(b
p), Var(b d p)
p), and Var(b
It might seem difficult at first to get to grips with what these quantities are
and what they represent. Here are some comments and notes.
• p is a parameter: an unknown but fixed number that we wish to estimate.
• pb is a random variable: for example, pb = Xn . It is a particular type of
random variable that generates estimates of p, so it is called an estimator.
• E(bp) is a number that tells us whether or not our estimator is unbiased.
We are mostly interested in E(b p) in an abstract sense: for example, if
E(bp) = p, no matter what p is, then our estimator is unbiased and we are
happy. If E(b p) 6= p, we want to know how badly wrong it is and whether
we should devise a correction factor.

For example, if we discovered that E(b p) = n+1n
p, then we could create a
n+1

different estimator qb such that qb = n pb. Then we would have,

E(bq ) = n+1
n E(b
p ) = n+1
n × n
n+1 p = p.
So our new estimator qb is unbiased for p, but on the downside it also has
n+1 2
higher variance than pb, because Var(b q) = n Var(b
p). So we might or
might not prefer to use qb instead of pb. As the sample size n grows large,
we might prefer to accept a tiny bias with the lower variance and use pb.
• Var(bp) is a number that tells us about the reliability of our estimator.
Unlike E(b p) which we care about more as an abstract property, we would
like to know the actual numeric value of Var(b p) so we can calculate confi-
dence intervals. Confidence intervals quantify our estimator reliability and
should be included with our final report.
Unfortunately, we find that Var(b
p) depends upon the unknown value p: for
p(1−p)
example, Var(bp) = n , so we can’t calculate it because we don’t know
d p) described next.
what p is. This is why we use Var(b
d p) is our best attempt at getting a value for Var(b
• Var(b p). We just take the
p) and substitute pb for the unknown p everywhere. This
expression for Var(b
d p) is an estimator for Var(b
means that Var(b p).
For example, if Var(b d p) = pb(1−bp) .
p) = p(1−p) , then Var(b
n n
d p) is a function of the random variable pb, Var(b
Because Var(b d p) is also a
random variable. Typically, we use it only for calculating its numerical
value and transforming this into a standard error and a confidence interval
as described on the previous page.

PROBABILITY

Uploaded by

Copyright:

Available Formats

PROBABILITY

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PROBABILITY

Uploaded by

Copyright:

Available Formats

Chapter 2: Foundations of Statistical

Data, or observations, typically arise from some underlying process. It is the

Example 1: Tossing a coin.

In the next two chapters we meet several important concepts in statistical

2. Likelihood and estimation:

3. Expectation and variance of a random variable:

Example: Weird Coin? H

I toss a coin 10 times and get 9 heads. How weird is that?

• Getting 9 heads out of 10 tosses: we’ll call this weird.

Set of weird outcomes

9 heads, 10 heads, 1 head, 0 heads.

So how weird is 9 heads or worse, if the coin is fair?

Define X =#heads out of 10 tosses.

Distribution of X, if the coin is fair: X ∼ Binomial(n = 10, p = 0.5).

P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X ∼ Binomial(10, 0.5).

Probabilities for Binomial(n = 10, p = 0.5)

For X ∼ Binomial(10, 0.5), we have:

P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) =

Formal hypothesis test

• There are two alternatives:

• Our observed information is X, the number of heads out of 10 tosses. We

• We calculate the probability of observing something AT LEAST AS

• The probability is small (2.1%). We conclude that this is unlikely with a

We express the steps above as two competing hypotheses.

Null hypothesis: the first alternative, that the coin IS fair.

In hypothesis testing, we often use this same formulation.

• The null hypothesis is specific.

We use H0 and H1 to denote the null and alternative hypotheses respectively.

The null hypothesis is H0 : the coin is fair.

To set up the test, we write:

Number of heads, X ∼ Binomial(10, p),

Think of ‘null hypothesis’ as meaning the ‘default’: the hypothesis we will

In the hypothesis-testing framework above, we always measure evidence AGAINST

We measure the strength of evidence against H0 using the p-value.

In the example above, the p-value was p = 0.021.

In general, the p-value is the probability of observing something AT LEAST AS

This means that SMALL p-values represent STRONG evidence against H0 .

Small p-values mean Strong evidence.

• Some people go further and use an accept/reject framework. Under

• In this course we use the strength of evidence interpretation. The

Statistical significance refers to the p-value.

The result of a hypothesis test is significant at the 5% level if the p-value

It does not mean:

Statistically significant means that we have evidence, in OUR sample, that

“Substantial evidence of a difference”, not “Evidence of a substantial difference.”

A small p-value does NOT mean that H0 is definitely wrong.

One-sided and two-sided tests

This would have the effect of halving the resultant p-value.

Would you prefer sons? Easy!

Numbers suggest that men in different

• The 44 US presidents from George Washington to Barack Obama have had

Could this happen by chance?

We set up the competing hypotheses as follows.

Let X be the number of daughters out of 153 presidential children.

Null hypothesis: H0 : p = 0.5.

Alternative hypothesis: H1 : p 6= 0.5.

p-value: We need the probability of getting a result AT LEAST

Which results are at least as extreme as X = 65?

X = 0, 1, 2, . . . , 65, for even fewer daughters.

Probabilities for X ∼ Binomial(n = 153, p = 0.5)