PROBABILITY
PROBABILITY
PROBABILITY
Inference
2.1 Introduction
Statistical inference is the process of deducing properties of an underly-
ing distribution by analysis of data. The word inference means ‘conclusions’
or ‘decisions’. Statistical inference is about drawing conclusions and making
decisions based on observed data.
The data are only a sample from this population or mechanism. We cannot
possibly observe every outcome of the process, so we have to make do with the
sample that we have observed.
The data give us imperfect insight into the population of interest. The role
of statistical inference is to use this imperfect data to draw conclusions about
the population of interest, while simultaneously giving an honest reflection of the
uncertainty in our conclusions.
Example 2: Political polling: how many people will vote for the NZ Labour Party?
• Population: all eligible voters in New Zealand.
• Sample: a random sample of voters, e.g. 1000.
• What do we want to make inference about? We want to know the
support for Labour among all voters, but this is too expensive to carry
out except on election-night itself. Instead we aim to deduce the support
for Labour by asking a smaller number of voters, while simultaneously
reporting upon our uncertainty (margin of error).
34
1. Hypothesis testing:
• I toss a coin ten times and get nine heads. How unlikely is that? Can we
continue to believe that the coin is fair when it produces nine heads out
of ten tosses?
4. Modelling:
• We have a situation in real life that we know is random. But what does
the randomness look like? Is it highly variable, or little variability? Does
it sometimes give results much higher than average, but never give results
much lower (long-tailed distribution)? We will see how different probability
distributions are suitable for different circumstances. Choosing a probabil-
ity distribution to fit a situation is called modelling.
35
2.2 Hypothesis testing
You have probably come across the idea of hypothesis tests, p-values, and sig-
nificance in other courses. Common hypothesis tests include t-tests and chi-
squared tests. However, hypothesis tests can be conducted in much simpler
circumstances than these. The concept of the hypothesis test is at its easiest to
understand with the Binomial distribution in the following example. All other
hypothesis tests throughout statistics are based on the same idea.
What is ‘weird’ ?
If our coin is fair, the outcomes that are as weird or weirder than 9 heads
are:
We can add the probabilities of all the outcomes that are at least as weird
as 9 heads out of 10 tosses, assuming that the coin is fair.
0 1 2 3 4 5 6 7 8 9 10
x
= 0.021.
Is this weird?
Yes, it is quite weird. If we had a fair coin and tossed it 10 times, we would only
expect to see something as extreme as 9 heads on about 2.1% of occasions.
37
Is the coin fair?
Obviously, we can’t say. It might be: after all, on 2.1% of occasions that you
toss a fair coin 10 times, you do get something as weird as 9 heads or more.
However, 2.1% is a small probability, so it is still very unusual for a fair coin to
produce something as weird as what we’ve seen. If the coin really was fair, it
would be very unusual to get 9 heads or more.
We can deduce that, EITHER we have observed a very unusual event with a fair
coin, OR the coin is not fair.
In fact, this gives us some evidence that the coin is not fair.
The value 2.1% measures the strength of our evidence. The smaller this proba-
bility, the more evidence we have.
Alternative hypothesis: the second alternative, that the coin is NOT fair.
and
H0 : p = 0.5
H1 : p 6= 0.5.
A p-value of 0.021 represents quite strong evidence against the null hypothesis.
It states that, if the null hypothesis is TRUE, we would only have a 2.1% chance
of observing something as extreme as 9 heads or tails.
Some people might even see this as strong enough evidence to decide that the
null hypothesis is not true, but this is generally an over-simplistic interpretation.
Note: Be careful not to confuse the term p-value, which is 0.021 in our exam-
ple, with the Binomial probability p. Our hypothesis test is designed to test
whether the Binomial probability is p = 0.5. To test this, we calculate the
p-value of 0.021 as a measure of the strength of evidence against the hypoth-
esis that p = 0.5.
40
Interpreting the hypothesis test
There are different schools of thought about how a p-value should be interpreted.
• Most people agree that the p-value is a useful measure of the strength of
evidence against the null hypothesis. The smaller the p-value, the
stronger the evidence against H0 .
Statistical significance
You have probably encountered the idea of statistical significance in other
courses.
This means that the chance of seeing what we did see (9 heads), or more, is less
than 5% if the null hypothesis is true.
Saying the test is significant is a quick way of saying that there is evidence
against the null hypothesis, usually at the 5% level.
41
In the coin example, we can say that our test of H0 : p = 0.5 against H1 : p 6= 0.5
is significant at the 5% level, because the p-value is 0.021 which is < 0.05.
This means:
• we have some evidence that p 6= 0.5.
Beware!
The p-value gives the probability of seeing something as weird as what we did
see, if H0 is true.
This means that 5% of the time, we will get a p-value < 0.05 WHEN H0 IS
TRUE!!
Similarly, about once in every thousand tests, we will get a p-value < 0.001,
when H0 is true!
Men in the class: would you like to have daughters? Then become a deep-sea
diver, a fighter pilot, or a heavy smoker.
The facts
Is it possible that the men in each group really had a 50-50 chance of producing
sons and daughters?
This is the same as the question in Section 2.2.
For the presidents: If I tossed a coin 153 times and got only 65 heads, could
I continue to believe that the coin was fair?
For the divers: If I tossed a coin 190 times and got only 65 heads, could I
continue to believe that the coin was fair?
43
Hypothesis test for the presidents
This would take a lot of calculator time! Instead, we use a computer with a
package such as R.
[1] 0.03748079
0.00
In R:
Note: In the R command pbinom(65, 153, 0.5), the order that you enter the
numbers 65, 153, and 0.5 is important. If you enter them in a different order, you
will get an error. An alternative is to use the longhand command pbinom(q=65,
size=153, prob=0.5), in which case you can enter the terms in any order.
Back to our hypothesis test. Recall that X was the number of daughters out of
153 presidential children, and X ∼ Binomial(153, p), where p is the probability
that each child is a daughter.
The p-value of 0.075 means that, if the presidents really were as likely to have
daughters as sons, there would only be 7.5% chance of observing something as
unusual as only 65 daughters out of the total 153 children.
For the deep-sea divers, there were 190 children: 65 sons, and 125 daughters.
Then X ∼ Binomial(190, p), where p is the probability that each child is a son.
We conclude that it is extremely unlikely that this observation could have oc-
curred by chance, if the deep-sea divers had equal probabilities of having sons
and daughters.
We have very strong evidence that deep-sea divers are more likely to have daugh-
ters than sons. The data are not really compatible with H0 .
48
What next?
p-values are often badly used in science and business. They are regularly treated
as the end point of an analysis, after which no more work is needed. Many
scientific journals insist that scientists quote a p-value with every set of results,
and often only p-values less than 0.05 are regarded as ‘interesting’. The outcome
is that some scientists do every analysis they can think of until they finally come
up with a p-value of 0.05 or less.
Don’t accept that Drug A is better than Drug B only because the p-value says
so: find a biochemist who can explain what Drug A does that Drug B doesn’t.
Don’t accept that sun exposure is a cause of skin cancer on the basis of a p-value
alone: find a mechanism by which skin is damaged by the sun.
The following text is taken from Malcolm Gladwell’s book Outliers. It describes
the play-by-play for the first goal scored in the 2007 finals of the Canadian ice
hockey junior league for star players aged 17 to 19. The two teams are the Tigers
and Giants. There’s one slight difference . . . instead of the players’ names, we’re
given their birthdays.
March 11 starts around one side of the Tigers’ net, leaving the puck
for his teammate January 4, who passes it to January 22, who flips it
back to March 12, who shoots point-blank at the Tigers’ goalie, April
27. April 27 blocks the shot, but it’s rebounded by Giants’ March 6.
He shoots! Tigers defensemen February 9 and February 14 dive to
block the puck while January 10 looks on helplessly. March 6 scores!
Here are some figures. There were 25 players in the Tigers squad, born between
1986 and 1990. Out of these 25 players, 14 of them were born in January,
February, or March. Is it believable that this should happen by chance, or do
we have evidence that there is a birthday-effect in becoming a star ice hockey
player?
Hypothesis test
Let X be the number of the 25 players who are born from January to March.
We need to set up hypotheses of the following form:
Under H0 , there is no birthday effect. So the probability that each player has a
birthday in Jan to March is about 1/4.
(3 months out of a possible 12 months).
Thus the distribution of X under H0 is X ∼ Binomial(25, 1/4).
Our observation:
The observed proportion of players born from Jan to March is 14/25 = 0.56.
Just using our intuition, we can make a guess, but we might be wrong. The
answer also depends on the sample size (25 in this case). We need the p-value
to measure the evidence properly.
Upper tail: X = 14, 15, . . . , 25, for even more Jan to March players.
Lower tail: an equal probability in the opposite direction, for too few Jan to
March players.
51
Note: We do not need to calculate the values corresponding to our lower-tail p-
value. It is more complicated in this example than in Section 2.3, because we
do not have Binomial probability p = 0.5. In fact, the lower tail probability lies
somewhere between 0 and 1 player, but it cannot be specified exactly.
We get round this problem for calculating the p-value by just multiplying the
upper-tail p-value by 2.
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
It means that if there really was no birthday effect, we would expect to see results
as unusual as 14 out of 25 Jan to March players less than 2 in 1000 times.
We conclude that we have strong evidence that there is a birthday effect in this
ice hockey team. Something beyond ordinary chance seems to be going on. The
data are barely compatible with H0 .
52
Why should there be a birthday effect?
These data are just one example of a much wider - and astonishingly strong -
phenomenon. Professional sports players not just in ice hockey, but in soccer,
baseball, and other sports have strong birthday clustering. Why?
It’s because these sports select talented players for age-class star teams at young
ages, about 10 years old. In ice hockey, the cut-off date for age-class teams is
January 1st. A 10-year-old born in December is competing against players who
are nearly a year older, born in January, February, and March. The age differ-
ence makes a big difference in terms of size, speed, and physical coordination.
Most of the ‘talented’ players at this age are simply older and bigger. But there
then follow years in which they get the best coaching and the most practice.
By the time they reach 17, these players really are the best.
So far, the hypothesis tests have only told us whether the Binomial probability p
might be, or probably isn’t, equal to the value specified in the null hypothesis.
They have told us nothing about the size, or potential importance, of the de-
parture from H0 .
For example, for the deep-sea divers, we found that it would be very unlikely to
observe as many as 125 daughters out of 190 children if the chance of having a
daughter really was p = 0.5.
Remember the p-value for the test was 0.000016. Do you think that:
1. p could be as big as 0.8?
No idea! The p-value does not tell us.
2. p could be as close to 0.5 as, say, 0.51?
The test doesn’t even tell us this much!
If there was a huge sample size (number of children), we COULD get a
p-value as small as 0.000016 even if the true probability was 0.51.
Common sense, however, gives us a hint. Because there were almost twice as
many daughters as sons, my guess is that the probability of a having a daughter
is something close to p = 2/3. We need some way of formalizing this.
53
Estimation
In the case of the deep-sea divers, we wish to estimate the probability p that
the child of a diver is a daughter. The common-sense estimate to use is
However, there are many situations where our common sense fails us. For
example, what would we do if we had a regression-model situation (see Section
3.8) and wished to specify an alternative form for p, such as
p = α + β × (diver age).
How would we estimate the unknown intercept α and slope β, given known
information on diver age and number of daughters and sons?
We need a general framework for estimation that can be applied to any situa-
tion. The most useful and general method of obtaining parameter estimates is
the method of maximum likelihood estimation.
Likelihood
= 3.97 × 10−6 .
This still looks quite unlikely, but it is almost 4000 times more likely than getting
X = 125 when p = 0.5.
You can probably see where this is heading. If p = 0.6 is a better estimate than
p = 0.5, what if we move p even closer to our common-sense estimate of 0.658?
This is even more likely than for p = 0.6. So p = 0.658 is the best estimate yet.
55
Can we do any better? What happens if we increase p a little more, say to
p = 0.7?
This has decreased from the result for p = 0.658, so our observation of 125 is
LESS likely under p = 0.7 than under p = 0.658.
Overall, we can plot a graph showing how likely our observation of X = 125
is under each different value of p.
0.06
0.05
P(X=125) when X ~ Bin(190, p)
0.04
0.03
0.02
0.01
0.00
The graph reaches a clear maximum. This is a value of p at which the observation
X = 125 is MORE LIKELY than at any other value of p.
We can see that the maximum occurs somewhere close to our common-sense
estimate of p = 0.658.
56
The likelihood function
Vertical axis: The probability of our observation, X = 125, under this value
of p.
This function is called the likelihood function.
For our fixed observation X = 125, the likelihood function shows how LIKELY
the observation 125 is for every different value of p.
We write:
The likelihood gives the probability of a FIXED observation x, for every possible
value of the parameter p.
Compare this with the probability function, which is the probability of every
different value of x, for a FIXED value of p.
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0.00
0.00
0.50 0.55 0.60 0.65 0.70 0.75 0.80 90 110 130 150
p x
n o
dL 190 124 65 125 64
= × 125 × p × (1 − p) + p × 65 × (1 − p) × (−1)
dp 125
(Product Rule)
n o
190
= × p124 × (1 − p)64 125(1 − p) − 65p
125
n o
190 124 64
= p (1 − p) 125 − 190p .
125
This gives:
n o
dL 190 124 64
= p (1 − p) 125 − 190p = 0
dp 125
n o
⇒ 125 − 190p = 0
125
⇒ p = = 0.658 .
190
59
For the diver example, the maximum likelihood estimate of 125/190 is the same
as the common-sense estimate (page 53):
For example,
125
pb = .
190
dL 125
=0 ⇒ pb = .
dp p=bp 190
5. Solve for pb: From the graph, we can see that p = 0 and p = 1 are not maxima.
125
∴ pb = .
190
In Stats 210, we will be relaxed about this. You will usually be told to assume
that the MLE occurs in the interior of the parameter range. Where possible, it
is always best to plot the likelihood function, as on page 55.
This confirms that the maximum likelihood estimate exists and is unique.
In particular, care must be taken when the parameter has a restricted range like
0 < p < 1 (see later).
61
Estimators
For the example above, we had observation X = 125, and the maximum likeli-
hood estimate of p was
125
pb = .
190
It is clear that we could follow through the same working with any value of X,
which we can write as X = x, and we would obtain
x
pb = .
190
This means that even before we have made our observation of X, we can provide
a RULE for calculating the maximum likelihood estimate once X is observed:
Rule: Let
X ∼ Binomial(190, p).
Whatever value of X we observe, the maximum likelihood estimate of p will be
X
pb = .
190
Note that this expression is now a random variable: it depends on the random
value of X .
A random variable specifying how an estimate is calculated from an observation
is called an estimator.
X ∼ Binomial(n, p),
Follow the steps on page 59 to find the maximum likelihood estimator for p.
X ∼ Binomial(n, p).
(n is known.)
2. Write down the observed value of X:
Observed data: X = x.
3. Write down the likelihood function for this observed value:
4. Differentiate the likelihood with respect to the parameter, and set to 0 for
the maximum:
n o
dL n x−1
= p (1 − p)n−x−1
x − np = 0, when p = pb.
dp x
(Exercise)
5. Solve for pb:
x
pb = .
n
Example: Recall the president problem in Section 2.3. Out of 153 children, 65
were daughters. Let p be the probability that a presidential child is a daughter.
What is the maximum likelihood estimate of p?
Note: We showed in Section 2.3 that p was not significantly different from 0.5 in
this example.
However, the MLE of p is definitely different from 0.5.
This comes back to the meaning of significantly different in the statistical sense.
Saying that p is not significantly different from 0.5 just means that we can’t
DISTINGUISH any difference between p and 0.5 from routine sampling
variability.
We expect that p probably IS different from 0.5, just by a little. The maximum
likelihood estimate gives us the ‘best’ estimate of p.
Note: We have only considered the class of problems for which X ∼ Binomial(n, p)
and n is KNOWN. If n is not known, we have a harder problem: we have two
parameters, and one of them (n) should only take discrete values 1, 2, 3, . . ..
We will not consider problems of this type in Stats 210.
64
2.6 Random numbers and histograms
To generate (say) 100 random numbers from the Binomial(n = 190, p = 0.6)
distribution in R, we use:
or in long-hand,
Caution: the R inputs n and size are the opposite to what you might expect:
n gives the required sample size, and size gives the Binomial parameter n!
Histograms
The usual graph used to visualise a set of random numbers is the histogram.
The height of each bar of the histogram shows how many of the random numbers
fall into the interval represented by the bar.
Here are histograms from applying the command rbinom(100, 190, 0.6)
three different times.
10
7
8
6
8
5
6
frequency of x
frequency of x
frequency of x
6
4
4
3
4
2
2
1
0
Each graph shows 100 random numbers from the Binomial(n = 190, p = 0.6)
distribution.
65
Note: The histograms above have been specially adjusted so that each histogram
bar covers an interval of just one integer. For example, the height of the bar
plotted at x = 109 shows how many of the 100 random numbers are equal to
109.
Histogram of rbinom(100, 190, 0.6)
30
Usually, histogram bars would cover a larger
25
interval, and the histogram would be smoother.
20
For example, on the right is a histogram using
Frequency
15
the default settings in R, obtained from the
10
command hist(rbinom(100, 190, 0.6)).
5
Each histogram bar covers an interval of
0
5 integers. 100 110 120
In all the histograms above, the sum of the heights of all the bars is 100, because
there are 100 observations.
Histograms are useful because they show the approximate shape of the
underlying probability function.
They are also useful for exploring the effect of increasing sample size.
Eventually, with a large enough sample size, the histogram starts to look identical
to the probability function.
The histogram should have the same shape as the probability function, especially
as the sample size gets large.
66
Sample size 1000: rbinom(1000, 190, 0.6)
60
60
60
50
50
50
40
40
frequency of x
frequency of x
frequency of x
40
30
30
30
20
20
20
10
10
10
0
0
80 90 100 120 140 80 90 100 120 140 80 90 100 120 140
600
600
600
500
500
500
400
400
400
frequency of x
frequency of x
frequency of x
300
300
300
200
200
200
100
100
100
0
6000
5000
5000
5000
4000
4000
4000
frequency of x
frequency of x
frequency of x
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
0.04
x
2.7 Expectation
0.05
P(X=x) when X ~ Bin(190, p=0.6)
0.04
116 116 117 122 111 112 114 120 112 102
0.03
125 116 97 105 108 117 118 111 116 121
0.02
107 113 120 114 114 124 116 118 119 120
0.01
0.00
90 110 130 150
x
The average, or mean, of the first ten values is:
The answers all seem to be close to 114. What would happen if we took the
average of hundreds of values?
Note: You will get a different result every time you run this command.
68
1000 values from Binomial(190, 0.6):
R command: mean(rbinom(1000, 190, 0.6))
Result: 114.02
The larger the sample size, the closer the average seems to get to 114.
If we kept going for larger and larger sample sizes, we would keep getting answers
closer and closer to 114. This is because 114 is the DISTRIBUTION MEAN:
the mean value that we would get if we were able to draw an infinite sample from
the Binomial(190, 0.6) distribution.
This distribution mean is called the expectation, or expected value, of the Bino-
mial(190, 0.6) distribution.
It is a FIXED property of the Binomial(190, 0.6) distribution. This means it is
a fixed constant: there is nothing random about it.
X X
µX = E(X) = xfX (x) = xP(X = x) .
x x
The expected value is a measure of the centre, or average, of the set of values that
X can take, weighted according to the probability of each value.
X
E(X) = xP(X = x)
x
190
X
190
= x (0.6)x (0.4)190−x .
x=0
x
Although it is not obvious, the answer to this sum is n × p = 190 × 0.6 = 114.
We will see why in Section 2.10.
We will move away from the Binomial distribution for a moment, and use a
simpler example.
1 with probability 0.9,
Let the random variable X be defined as X =
−1 with probability 0.1.
X takes only the values 1 and −1. What is the ‘average’ value of X?
Using 1+(−1)
2 = 0 would not be useful, because it ignores the fact that usually
X = 1, and only occasionally is X = −1.
or in general, X
E(X) = P(X = x) × x.
x
E(aX + b) = aE(X) + b.
Proof:
We have:
3
P(X = x) = (0.2)x (0.8)3−x for x = 0, 1, 2, 3.
x
x 0 1 2 3
fX (x) = P(X = x) 0.512 0.384 0.096 0.008
Then
3
X
E(X) = xfX (x) = 0 × 0.512 + 1 × 0.384 + 2 × 0.096 + 3 × 0.008
x=0
= 0.6.
1 with probability p,
Y =
0 with probability 1 − p.
Find E(Y ).
y 0 1
P(Y = y) 1 − p p
E(Y ) = 0 × (1 − p) + 1 × p = p.
72
Expectation of a sum of random variables: E(X + Y )
This result holds for any random variables X1 , . . . , Xn . It does NOT require
X1 , . . . , Xn to be independent.
Note: We can combine the result above with the linear property of expectation.
For any constants a1 , . . . , an , we have:
E (a1 X1 + a2 X2 + . . . + an Xn ) = a1 E(X1 ) + a2 E(X2 ) + . . . + an E(Xn ).
We have to find E(XY ) either using their joint probability function (see
later), or using their covariance (see later).
2. Special case: when X and Y are INDEPENDENT:
For discrete random variables, it is very easy to find the probability function for
Y = g(X), given that the probability function for X is known. Simply change
all the values and keep the probabilities the same.
x 0 1 2 3
The probability function for X is:
P(X = x) 0.512 0.384 0.096 0.008
y 02 12 22 32
P(Y = y) 0.512 0.384 0.096 0.008
This is because Y takes the value 02 whenever X takes the value 0, and so on.
We can find the expectation of a transformed random variable just like any other
random variable. For example, in Example 1 we had X ∼ Binomial(3, 0.2), and
Y = X 2.
x 0 1 2 3
The probability function for X is:
P(X = x) 0.512 0.384 0.096 0.008
y 0 1 4 9
and for Y = X 2 :
P(Y = y) 0.512 0.384 0.096 0.008
75
2
Thus the expectation of Y = X is:
Note: E(X 2 ) is NOT the same as {E(X)}2 . Check that {E(X)}2 = 0.36.
To make the calculation quicker, we could cut out the middle step of writing
down the probability function of Y . Because we transform the values and keep
the probabilities the same, we have:
E{g(X)} = E(X 2 ) = g(0) × 0.512 + g(1) × 0.384 + g(2) × 0.096 + g(3) × 0.008.
Clearly the same arguments can be extended to any function g(X) and any
discrete random variable X:
X
E{g(X)} = g(x)P(X = x).
x
Definition: For any function g and discrete random variable X, the expected value
of g(X) is given by
X X
E{g(X)} = g(x)P(X = x) = g(x)fX (x).
x x
76
Example: Recall Mr Chance and his balloon-hire business from page 74. Let X be
the height of balloon selected by a randomly chosen customer. The probability
function of X is:
height, x (m) 2 3 4
P(X = x) 0.5 0.3 0.2
23 33 43
= × 0.5 + × 0.3 + × 0.2
2 2 2
= 12.45 m3 gas.
(c) How much does Mr Chance expect to earn in total from his next 5 customers?
Let Z1 , . . . , Z5 be the earnings from the next 5 customers. Each Zi has E(Zi ) =
1080 by part (b). The total expected earning is
ng !
r o
W
3 with probability 3/4,
Suppose X=
8 with probability 1/4.
Then 3/4 of the time, X takes value 3, and 1/4 of the
time, X takes value 8.
3
So E(X) = 4×3 + 14 × 8.
√ 3 √ 1 √
E( X) = 4 × 3 + 4 × 8.
Common mistakes
√ √ q
i) E( X) = EX = 34 × 3 + 41 × 8
ong!
Wr
√ q q
3 1
ii) E( X) = 4 × 3 + 4×8
ong!
Wr
√ q q
iii) E( X) = 3 1
4 × 3 + 4×8
q √ q √
3 1
= 4× 3 + 4× 8
on g!
Wr
2.9 Variance
Example: Mrs Tractor runs the Rational Bank of Remuera. Every day she hopes
to fill her cash machine with enough cash to see the well-heeled citizens of Re-
muera through the day. She knows that the expected amount of money with-
drawn each day is $50,000. How much money should she load in the machine?
$50,000?
No: $50,000 is the average, near the centre
of the distribution. About half the time,
the money required will be GREATER
than the average.
How much money should Mrs Tractor put in the
machine if she wants to be 99% certain that there
will be enough for the day’s transactions?
Answer: it depends how much the amount withdrawn varies above and below
its mean.
For questions like this, we need the study of variance.
Variance is the average squared distance of a random variable from its own mean.
2
Var(g(X)) = E g(X) − E(g(X)) .
The variance is a measure of how spread out are the values that X can take.
It is the average squared distance between a value of X and the central (mean)
value, µX .
Possible values of X
x1 x2 x3 x4 x5 x6
x2 − µ X x4 − µ X
µX
(central value)
Var(X) = E [(X − µX )2 ]
|{z} | {z }
(2) (1)
(1) Take distance from observed values of X to the central point, µX . Square it
to balance positive and negative distances.
(2) Then take the average over all values X can take: ie. if we observed X many
times, find what would be the average squared distance between X and µX .
2
Note: The mean, µX , and the variance, σX , of X are just numbers: there is nothing
random or variable about them.
3 with probability 3/4,
Example: Let X =
8 with probability 1/4.
Then 3 1
E(X) = µX = 3 × + 8 × = 4.25
4 4
3 1
Var(X) = σX
2
= × (3 − 4.25)2 + × (8 − 4.25)2
4 4
= 4.6875.
X X
Var(X) = E (X − µX ) =
2 2
(x − µX ) fX (x) = (x − µX )2 P(X = x).
x x
Proof: Var(X) = E (X − µX )2 by definition
X 2 −2 |{z}
= E[|{z} X µX + µ2X ]
|{z} |{z}
r.v. r.v. constant constant
= E(X 2 ) − µ2X .
P P 2
Note: E(X 2 ) = x x 2
f X (x) = x x P(X = x) . This is not the same as (EX) :
2
3 with probability 0.75,
e.g. X=
8 with probability 0.25.
i) Var(aX + b) = a2 Var(X).
Proof:
(part (i))
h i
Var(aX + b) = E {(aX + b) − E(aX + b)} 2
h i
= E {aX + b − aE(X) − b}2
by Thm 2.7
h i
2
= E {aX − aE(X)}
h i
2 2
= E a {X − E(X)}
h i
= a E {X − E(X)} by Thm 2.7
2 2
= a2 Var(X) .
Note: These are very different from the corresponding expressions for expectations
(Theorem 2.7). Variances are more difficult to manipulate than expectations.
Let X ∼ Binomial(n, p). We have mentioned several times that E(X) = np.
We now prove this and the additional result for Var(X).
E(X) = µX = np
Var(X) = σX
2
= np(1 − p).
X = Y1 + Y2 + . . . + Yn ,
where each
1 with probability p,
Yi =
0 with probability 1 − p.
Thus,
E(Yi ) = 0 × (1 − p) + 1 × p = p.
Also,
E(Yi2 ) = 02 × (1 − p) + 12 × p = p.
So
Var(Yi ) = E(Yi2 ) − (EYi )2
= p − p2
= p(1 − p).
Therefore:
= p + p + ... + p
= n × p.
And:
= n × p(1 − p).
We show below how the Binomial mean and variance formulae can be derived
directly from the probability function.
X n Xn Xn
n x n−x n!
E(X) = xfX (x) = x p (1 − p) = x px (1 − p)n−x
x=0 x=0
x x=0
(n − x)!x!
x 1
But = and also the first term xfX (x) is 0 when x = 0.
x! (x − 1)!
So, continuing,
n
X n!
E(X) = px (1 − p)n−x
x=1
(n − x)!(x − 1)!
Next: make n’s into (n − 1)’s, x’s into (x − 1)’s, wherever possible:
e.g.
n − x = (n − 1) − (x − 1), px = p · px−1
n! = n(n − 1)! etc.
This gives,
n
X n(n − 1)!
E(X) = p · p(x−1) (1 − p)(n−1)−(x−1)
x=1
[(n − 1) − (x − 1)]!(x − 1)!
X n
n − 1 x−1
= np p (1 − p)(n−1)−(x−1)
|{z} x−1
what we want |x=1 {z }
need to show this sum = 1
Xm
So m y
E(X) = np p (1 − p)m−y
y=0
y
= np(p + (1 − p))m (Binomial Theorem)
Here goes:
n
X
n x
E[X(X − 1)] = x(x − 1) p (1 − p)n−x
x=0
x
n
X x(x − 1)n(n − 1)(n − 2)!
= p2 p(x−2) (1 − p)(n−2)−(x−2)
x=0
[(n − 2) − (x − 2)]!(x − 2)!x(x − 1)
First two terms (x = 0 and x = 1) are 0 due to the x(x − 1) in the numerator.
Thus
Xn
n − 2
E[X(X − 1)] = p2 n(n − 1) px−2 (1 − p)(n−2)−(x−2)
x=2
x−2
X m
2 m y m−y m = n − 2,
= n(n − 1)p p (1 − p) if
y y = x − 2.
y=0
| {z }
sum=1 by Binomial Theorem
= n(n − 1)p2 + np − n2 p2
= np(1 − p).
Note the steps: take out x(x−1) and replace n by (n−2), x by (x−2) wherever
possible.
88
2.11 Mean and Variance of Estimators
Perhaps the most important application of mean and variance is in the context
of estimators:
• An estimator is a random variable.
• It has a mean and a variance.
• The mean tells us how accurate the estimator is: in particular, does it get
the right answer on average?
• The variance tells us how reliable the estimator is. If the variance is high,
it has high spread and we can get estimates a long way from the true answer.
Because we don’t know what the right answer is, we don’t know whether our
particular estimate is a good one (close to the true answer) or a bad one (a long
way from the true answer). So an estimator with high variance is unreliable:
sometimes it gives bad answers, sometimes it gives good answers; and we don’t
know which we’ve got.
An unreliable estimator is like a friend who often tells lies. Once you find them
out, you can’t believe ANYTHING they say!
^)
^)
Estimator PDF, f(p
True p ^
p True p ^
p True p ^
p
If an estimator has a large bias, we probably don’t want to use it. However,
even if the estimator is unbiased, we still need to look at its variance to decide
how reliable it is.
b
Estimator Variance, Var(p)
We have:
X
Var(b
p) = Var
n
1
= Var(X)
n2
1
= × np(1 − p) because Var(X) = np(1 − p) for X ∼Bin(n, p)
n2
p(1 − p)
∴ Var(b
p) = . (?)
n
To decide how reliable our estimator pb is, we would like to calculate the value
of Var(b
p). But Var(bp) = p(1 − p)/n, and we do not know the true value of p, so
we cannot calculate the exact Var(b
p).
90
Instead, we have to ESTIMATE Var(b
p) by replacing the unknown p in equation
(?) by pb.
d (b d (b pb(1 − pb)
We call our estimated variance Var p): Var p) = .
n
q
The standard error of pb is defined as: se(b
p) = d (b
Var p).
r
pb(1 − pb)
Margin of error = 1.96 × se(b
p) = 1.96 × .
n
Example: For the deep-sea diver example in Section 2.3, we had X ∼ Binomial(190, p)
with observation X = 125 daughters out of 190 children.
So
r
X 125 0.658 × (1 − 0.658)
pb = = = 0.658, ⇒ se(b
p) = = 0.034.
n 190 190
For our final answer, we should therefore quote:
pb = 0.658 ± 1.96 × 0.034 = 0.658 ± 0.067 or pb = 0.658 (0.591, 0.725).
Our estimate is fairly precise, although not extremely precise. We are pretty
sure that the daughter probability is somewhere between 0.59 and 0.73.
So our new estimator qb is unbiased for p, but on the downside it also has
n+1 2
higher variance than pb, because Var(b q) = n Var(b
p). So we might or
might not prefer to use qb instead of pb. As the sample size n grows large,
we might prefer to accept a tiny bias with the lower variance and use pb.
• Var(bp) is a number that tells us about the reliability of our estimator.
Unlike E(b p) which we care about more as an abstract property, we would
like to know the actual numeric value of Var(b p) so we can calculate confi-
dence intervals. Confidence intervals quantify our estimator reliability and
should be included with our final report.
Unfortunately, we find that Var(b
p) depends upon the unknown value p: for
p(1−p)
example, Var(bp) = n , so we can’t calculate it because we don’t know
d p) described next.
what p is. This is why we use Var(b
d p) is our best attempt at getting a value for Var(b
• Var(b p). We just take the
p) and substitute pb for the unknown p everywhere. This
expression for Var(b
d p) is an estimator for Var(b
means that Var(b p).
For example, if Var(b d p) = pb(1−bp) .
p) = p(1−p) , then Var(b
n n
d p) is a function of the random variable pb, Var(b
Because Var(b d p) is also a
random variable. Typically, we use it only for calculating its numerical
value and transforming this into a standard error and a confidence interval
as described on the previous page.