Chapter 10, Probability and Stats

10
Probability and Statistics
10.1 Introduction
This chapter describes probability theory and statistical inference

with an emphasis on reliability methods in engineering applica-
tions. Subsection 10.2 reviews basic probability and introduces
notation for concepts with which you are already familiar. The con-
tent is necessarily brief due to time constraints and this revision of
probability models considers only finite sample spaces. Sections
10.3 - 10.4 contain the examinable material for MATH1002.
One of the most important aspects of probability theory is in
the description of random variables. For a long time most scientific
advancements were achieved by assuming a deterministic (non-
random) world. In this case, repeated experiments ought to give
rise to the same results. However, it was noticed that they did not.
The same experiment, repeated under identical conditions, could
give rise to different results. This variability of observed results
is described well by probability theory and random variables. We
can think of random variables as describing the results of an ex-
periment that is affected by some chance mechanism. Section 10.3
discusses probability models using the concept of random variables.
Subsection 10.3.1 examines two discrete random variables useful
for reliability engineering – the Bernoulli and binomial random vari-
ables. Subsection 10.3.2 discusses two continuous random variables
suitable for modelling reliability – the exponential and normal ran-
dom variables.
Section 10.4 introduces statistical inference. Probability theory is
useful at approximating many real world observations, if we know
the true probability model. Often we do not. If someone asks you to
bet on the toss of a coin, you may assume the chance is 50%, but
you do not know it. Perhaps the person who offers the bet is un-
savory and uses a biased coin designed to weigh the odds against
you. Statistical inference is concerned with the use of observed
sample results (observed values of random variables) in making
statements about the unknown parameters of a probability model.
In effect there are a collection of candidate probability models
(P(heads) = 0.2, P(heads) = 0.5, P(heads) = 0.8 etc.), and statis-
tics is the process by which observed data can help identify the
138 a. bassom, e. cripps, a. devillers, l. jennings, a. niemeyer, t. stemler
most plausible of these probability models. Section 10.4.1 describes

one method of estimating probability models – obtaining point and
interval estimates of the model parameters. The task of the statis-
tician is often to assess whether or not a hypothesised value of the
model parameter is likely (e.g. P(heads) = 0.5? before making your
bet). Hypothesis testing is the process of deciding whether the data
provides evidence against statements and is discussed in Section
10.4.2.
10.2 A recap on probability models
A formal definition of a probability model begins with an experi-

ment.
Definition 10.1. An experiment is any operation whose outcome is not
known with certainty.
The outcome of the experiment is not known with certainty but
the possible outcomes are. This is the sample space of the experi-
ment.
Definition 10.2. The sample space S of an experiment is the set of all
possible outcomes from the experiment.
For example, if the experiment is coin tossing then S = { head, tails}.
If the experiment is die rolling then S = {1, 2, 3, 4, 5, 6}. The expe-
riment may be the weather tomorrow, in which case we must be
more careful how to construct the sample space. One possibility
may be S = {rain, snow, clear }. The sample space can be any set of
objects (e.g. numbers, figures, letters, points) and should include
those things which we are interested in. For example, in the case of
the weather tomorrow then rain, snow and clear are in S but tomor-
row’s opinion poll of the government is not.
A probability model also requires a collection of events that may
or may not occur:
Definition 10.3. An event is a subset of the sample space.
Definition 10.4. An event occurs if any one of its elements is the
outcome of the experiment.
Consider the experiment of die rolling. Denote A to be the
event that an even number is on the upward facing side. Then
A = {2, 4, 6} is an event of S. Now, set B = {4, 5, 6} (the event the
die shows 4 or more). Using these two events we will introduce
some notation you will already have seen.
Definition 10.5. The union of A and B (written A ∪ B) is the set which
consists of the all the outcomes that belong to A or B or to both.
In the die rolling experiment above A ∪ B = {2, 4, 5, 6} and A ∪ B
occurs if the rolled die shows any one of these four outcomes.
Definition 10.6. The intersection of A and B (written A ∩ B) is the set
which consists of all the outcomes that belong to both A and B.
mathematical methods 2 139
In the die rolling experiment A ∩ B = {4, 6} and A ∩ B occurs if

the rolled die shows a 4 or a 6.
Definition 10.7. The complement of A (written Ā) is the set which
consists of the all the outcomes not belonging to A.
In the die rolling experiment Ā = {1, 3, 5} and Ā occurs if the
rolled die shows any one of these three outcomes.
Note also that S itself is an event as is the empty set ∅ = {}.
With possible outcomes described we assign probabilities to each
event. Formally, a probability function, P, is a real-valued set function
defined on a set of subsets of the sample space S. Given S, and an
event A in S, we define P( A) to be the probability that A occurs. A
probability function satisfies the following three axioms:
Axiom 1: P(S) = 1.
This says it is certain something must happen. If we are toss-
ing a coin the probability that the event A = {heads, tails}
must occur.
Axiom 2: 0 ≤ P( A) ≤ 1, for each subset A of S.

This axiom says all probabilities are measured on a scale of 0
to 1, where 0 means impossible and 1 means certain.
Axiom 3: If A1 , A2 , . . . , An is a collection of disjoint sets then
P( A1 ∪ A2 ∪ . . . ∪ An ) = P( A1 ) + P( A2 ) + . . . + P( An ). (10.1)
This axiom is more subtle than the first two, and is known
as the additivity property of probability. It says we can cal-
culate probabilities of complicated events by adding up the
probabilities of smaller events provided the smaller events
are disjoint and together make up the entire complicated
event. When we say disjoint we mean the events do not in-
tersect. This axiom can be extended to a countable sequence
of disjoint events A1 , A2 , . . . and is needed in general but not
for MATH1002 level.
Axioms 1-3 imply other basic properties or theorems that are

true for any probability function. We list four now.
Theorem 10.1. P(∅) = 0.
Proof. S ∪ ∅ = S and thus P(S ∪ ∅) = P(S) = 1. But S ∩ ∅ = ∅ so
that P(S ∪ ∅) = P(S) + P(∅) = 1 + P(∅) by Axioms 3 and 1. Thus
1 + P(∅) = 1 and P(∅) = 0.
Theorem 10.2. P( Ā) = 1 − P( A) where Ā is the complement of A with

respect to S.
Proof. A ∪ Ā = S so P( A ∪ Ā) = P(S) = 1 by Axiom 1. But
A ∩ Ā = ∅ and therefore P( A ∪ Ā) = P( A) + P( Ā) by Axiom 3.
Thus P( A) + P( Ā) = 1 and the result follows immediately.
Theorem 10.3. P( Ā ∩ B) = P( B) − P( A ∩ B).
Proof. B = ( Ā ∩ B) ∪ ( A ∩ B) so P( B) = P(( Ā ∩ B) ∪ ( A ∩ B)). Since

( Ā ∩ B) ∩ ( A ∩ B) = ∅, P( B) = P( Ā ∩ B) + P( A ∩ B) and the result
follows immediately.
Theorem 10.4. P( A ∪ B) = P( A) + P( B) − P( A ∩ B).
Proof. A ∪ B = A ∪ ( Ā ∩ B) so P( A ∪ B) = P( A ∪ ( Ā ∩ B)). Since

A ∩ ( Ā ∩ B) = ∅ then
P( A ∪ ( Ā ∩ B)) = P( A) + P( Ā ∩ B)
= P( A) + P( B) − P( A ∩ B)
by Theorem 10.3. Thus
P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ).
There are many applications where we know that an event B

has occurred and we are asked to calculate the probability that an-
other event A also occurs. This requires careful consideration of the
sample space. Implicit in our probability model above, the prob-
abilities are conditioned on S. To ask the probability of drawing an
ace of clubs from a pack of cards is meaningless unless we specify
a suitable sample space. Perhaps we define S to represent a canasta
deck of cards (actually 2 standard 52 card packs plus 4 jokers). We
are then asking what is the probability of drawing an ace of clubs,
given the deck of cards is a canasta deck. In this case, for well shuffled
canasta deck, the probability will equal 2/108 = 0.0185. But what
if we are told that the pack was the standard pack of 52 cards. The
probability now would be 1/52 = 0.0192. This is intuitively sensible
and obvious but the concept is important for conditional probability.
Definition 10.8. The conditional probability of A occurring, given

that B has occurred is P( A| B) = P( A ∩ B)/P( B) if P( B) > 0. If
P( B) = 0 we define P( A| B) = 0.
Thinking of probability as an area on a Venn diagram, Definition

10.8 says that the conditional probability of A given B is the propor-
tion of B inhabited by A ∩ B. In other words, S has been reduced to
the outcomes contained in B. If P( A| B) = P( A) then knowledge of
the occurrence of B has provided no additional information to our
knowledge about the uncertainty associated with the occurrence of
A. This motivates the definition of independence in probability.
Definition 10.9. If A and B are any two events in S we say A is inde-

pendent of B if P( A| B) = P( A).
This is equivalent to saying the proportion of A that inhabits S

is equal to the proportion of A ∩ B that inhabits B. If A and B are
independent then P( A ∩ B) = P( A) P( B).
Definition 10.8 leads directly to Bayes’ Theorem.
Theorem 10.5 (Bayes’ Theorem). Suppose we are given k events

A1 , A2 , . . . , Ak such that
1. A1 ∪ A2 ∪ . . . ∪ Ak = S
2. Ai ∩ A j = ∅ for all i 6= j
(that is the events form a partition of S) then for any event B
P( B| A j ) P( A j )
P( A j | B) = , j = 1, 2, . . . , k.
∑ik=1 P ( B | Ai ) P ( Ai )
To clarify imagine we are dealing with events A, Ā and B. From

Definition 10.8
P( A ∩ B)
P( B| A) = , and P( A ∩ B) = P( A| B) P( B)
P( B)
Note, that A and Ā form a partition of S and their union equals

S. This also means B ∩ A and B ∩ Ā are disjoint so we can write
P( B) = P( B ∩ A) + P( B ∩ Ā) = P( B| A) P( A) + P( B| Ā) P( Ā)
and in this case Bayes’ Theorem says
P( B| A) P( A)
P( A| B) = .
P( B| A) P( A) + P( B| Ā) P( Ā)
We end this revision with a reminder on the different types of sam-

ple spaces.
Definition 10.10. A discrete sample space is one which has a finite or

countable 1 number of elements. 1
A countable set is a set which has
the same size as the natural numbers,
The discussion up to now has only dealt with discrete sample more precisely, which can be put in
bijection with the natural numbers.
spaces and described experiments in terms of single element events
(outcomes) to which we can assign a probability. There are many
experiments however that identify a sample space consisting of
anywhere along a continuous line. For example, the experiment
may be the life time of a machine part and S has outcomes all of the
points on the non-negative real line.
Definition 10.11. A continuous sample space is one which has as

outcomes all of the points in some interval on the real line.
Although subsets of the continuous interval are events, the out-

comes would be the sets of single points in the interval. There are
an infinite number of points in the interval and the probability of
the occurrence of any one point must be zero. This is more easily
examined with the concept of random variables which we turn to
now, beginning MATH1002 proper.
10.3 Random variables
The content of the previous section was a reminder of what you

already know about probability models (for finite sample spaces). It
required defining a probability function (a real valued set function
that satisfies the axioms of probability) on S. It turns out there can
be simpler ways of working with probability assignments. Once we
have a probability model as described in Section 10.2 we may define
a random variable for that model. Intuitively, a random variable
assigns a numerical value to each possible outcome in the sample
space. Formally,
Definition 10.12. (Random Variable)

A random variable X is a real-valued function X : S → R with domain
being the sample space S.
Since X (s) is a real valued function the range of X (s) is always

a set of real numbers. Put discrete example here then apply to
expected value and variance, or even a couple examples Then the
probability of X (s) = x corresponds to the probability of outcome
s. In the remainder of this course we will use upper case letters (for
example X, Y, Z) to denote particular random variables and lower
case letters (for example x, y, z) to denote values in the range of the
random variable. This will become important later. Note that often
the outcome of an experiment is a number. In such situations the
introduction of random variables amounts to little more than a new
and convenient notation.
Obviously similar to sample spaces random variables can be
either discrete or continuous
Definition 10.13. A random variable X is discrete if its range forms
a discrete (finite or countable) set of real numbers. A random variable X
is continuous if its range forms a continuous (uncountable) set of real
numbers and the probability of X equalling any single value is 0.
We now discuss these two types of random variables in turn2 . 2
Actually there is a third type - a
mixture of discrete and continuous but
we will ignore these in this course
10.3.1 Discrete random variables
If X is a discrete random variable we can use the probability de-
fined on the subsets of S to define the probability mass function
(p.m.f).
Definition 10.14. (Probability Mass Function)

The probability mass function for X is a function (denoted by p X ( x ) or
P( X = x )) of a real variable x and is defined to be
p X ( x ) = P( X = x ) = P({s ∈ S | X (s) = x }) for all real x.

If we have an experiment and sample space S defined for that

experiment and have defined the discrete random variable X on the
elements of S then we can find p X ( x ) as follows. Define the event
A( x ) = {s : X (s) = x } ⊆ S.
Then p X ( x ) = P( A( x )) for all real x. Note that A( x ) may equal

∅ for many x and therefore p X ( x ) is 0 for any such x. From the
axioms of probability models p X ( x ) must satisfy
p X ( x ) ≥ 0 and ∑ p X ( x ) = 1.
all x
From now on we shall write P( X = x ) for P( X (s) = x ) and

suppress the functional dependence of X on the elements S unless
necessary. We also define the cumulative distribution function (c.d.f)
as
Definition 10.15. (Cumulative Distribution Function)

The cumulative distribution function for a random variable X is a
function FX : R → R of a real variable t defined by
FX (t) = P( X ≤ t) = ∑ p X ( t ).
x ≤t
In other words, FX (t) gives the total accumulation of probability

for X equalling any number less than or equal to t and therefore it
can be proved to have the following properties:
Theorem 10.6. Let FX be the cumulative distribution function of a

random variable, X. Then
1. 0 ≤ FX (t) ≤ 1
2. FX (t1 ) ≤ FX (t2 ) whenever t1 ≤ t2 (i.e. FX is monotone increasing)
3. limt→∞ FX (t) = 1
4. limt→−∞ FX (t) = 0
5. FX (t) is right-continuous, that is lim∆t→0,∆t>0 FX (t + ∆t) = FX (t).
Either p X or FX can be used to evaluate probability statements

about X. Often we are interested not only in probabilities that X
lies in certain intervals but also a typical value of X – the expected
value of X. The concept of expectation is fundamental to all of
probability theory. Intuitively, the expected value of a random
variable is the average value that the random variable assumes.
More formally,
Definition 10.16. (Expected value)

For a discrete random variable X with probability mass function p X ( x ),
the expected value (written E[ X ]) is defined by
E[ X ] = ∑ xp X ( x )
range of X
Note the above definition can be generalised to include the aver-

age value of a function of a random variable E[ H ( X )] :
E[ H ( X )] = ∑ H (x) pX (x)
range of X
The expected value of X is often called the mean and is denoted

by µ X . So the mean of a random variable X is the average of the
values in the range of X, where the average is weighted by p.m.f.
The operation of taking the expected value has many convenient
properties. Three of these are
Theorem 10.7. If X and Y are discrete random variables and c is a real

number,
1. E[c] = c,
2. E[cX ] = cE[ X ],
3. E[ X + Y ] = E[ X ] + E[Y ].
Now that we understand expected value we can use it to define

other quantities that provide further information about a random
variable. Given a random variable X, µ X tells us the average value
of X but does not tell us how far X tends to be from µ X . For that
we define the variance of a discrete random variable.
Definition 10.17. (Variance, standard deviation)

The variance of a discrete random variable X, denoted by Var ( X ) or σX2 is
defined by
Var ( X ) = E[( X − µ X )2 ] = ∑ ( x − µ X )2 p X ( x )
range of X
p
and the standard deviation is defined by σX = Var ( X ).
Intuitively, σX2 and σX are measures of how spread out the distri-
bution of X is or how much it varies. As a measure of variability
the variance is not so intuitive because it is measured in different
units than the random variable, but it is convenient mathemati-
cally. Therefore, the standard deviation σX is also defined and is the
square root of the variance, providing a measure of variability in
the same units as the random variable. The variance of a random
variable plays a very important role in probability theory and sta-

tistical inference so we pause to present some properties (without
proof) of variances:
Definition 10.18. Two random variables X and Y are independent if

P( X = x and Y = y) = P( X = x ) P(Y = y) for all x, y.
Theorem 10.8. Let X be any random variable with µ X = E[ X ] and

variance σX2 = Var ( X ). Let Y be any random variable independent of X
and with µY = E[Y ] and variance σY2 = Var (Y ). Then the following hold
true
1. Var ( X ) ≥ 0.
2. If a and b are real numbers then Var ( aX + b) = a2 Var ( X ).
3. Var ( X ) = E( X 2 ) − µ2X = E[ X 2 ] − E[ X ]2 .
4. Var ( X + Y ) = Var ( X ) + Var (Y ).
We are now equipped with enough information about random

variables to examine some important discrete random variables,
and their applicability to reliability engineering.
Bernoulli random variables

Many problems in reliability engineering deal with repeated trials.
For example, we may want to know the probability 9 out of 10
machines will be operating tomorrow or the probability that at least
1 nuclear power plant out of 15 experiences a serious violation. To
address these problems we first define a Bernoulli trial:
Definition 10.19. A Bernoulli trial is an experiment which has two

possible outcomes, generally called success and failure.
So S = {success, f ailure} and we define a random variable Y by

Y ( f ailure) = 0 and Y (success) = 1. Let p denote the probability of
success and let q denote the probability of failure. Then q = (1 − p).
The random variable Y is said to have the Bernoulli distribution
(written Y ∼ Bern( p)). In other words it has the p.m.f
pY ( y ) = p y q 1 − y y = 0, 1.
The expected value of Y is
µY = 1 × p + 0 × ( 1 − p ) = p
and the variance is calculated by
σY2 = E[Y 2 ] − E[Y ]2 = p(1 − p)
because E[Y 2 ] = 12 × p + 02 × (1 − p) and E[Y ]2 = p2 .
Binomial random variables

So a Bernoulli random variable can address questions such as

“What is the probability that one nuclear power plant fails?” Now,
consider a random variable that describes the outcomes of repeated
Bernoulli trials where we are interested in the probability of getting
x successes in n independent, identical Bernoulli trials. So setting
X = ∑in=1 Yi where Yi the represents the ith Bernoulli trial yields the
binomial random variable.
Definition 10.20. Let X be the total number of successes in n repeated,

independent Bernoulli trials with success probability p. X is called the
binomial random variable with parameters n and p, and written X ∼
Binom(n, p).
If X ∼ Binom(n, p) and q = 1 − p it can be proved that

(
(nx) p x qn− x x = 0, 1, 2, . . . , n.
pX (x) =
0 otherwise.
To understand the binomial p.m.f, if the sample space of each Yi ∼

Bern( p) is Si = {success, f ailure} then the sample space for X ∼
Binom(n, p) is the Cartesian product S = S1 × S2 × . . . × Sn . Since
the trials are independent we assign probabilities to the outcomes
in S by multiplying the values of the probabilities for the individual
Bernoulli trials. Now, each outcome in S must contain exactly x
successes, for some 0 ≤ x ≤ n, so the probability assigned to each
outcome is p x qn− x . So if we count out the number of elements in S
having exactly x success, the product of this number with p x qn− x
will yield p X ( x ) and it is known that the number of elements in S
having exactly x successes is (nx).
It is clear that p X ( x ) > 0 for x = (0, 1, 2, . . . , n) and
n n
n
∑ p X ( x ) = ∑ x p x qn− x = ( p + q )n = ( p + 1 − p )n = 1
x =0 x =0
and therefore p X ( x ) is a p.m.f. Figure 10.1 contains plots of the

binomial p.m.fs and c.d.fs.
From Theorem 10.7, Theorem 10.8 and the fact that the Bernoulli
trials are independent it can be shown that if X ∼ Binom(n, p) then
µ X = np and the variance is σX2 = npq by
" #
n n n
µ X = E[ X ] = E ∑ Yi = ∑ E[Yi ] = ∑ p = np.
i =1 i =1 i =1
" #
n n n
σX2 = Var ∑ Yi = ∑ Var[Yi ] = ∑ pq = npq.
i =1 i =1 i =1
Consider the following three examples.
Example 10.1. There are 10 machines in a factory. The machines operate

overnight while there are no workers in the factory. History suggests the
probability that each machine still works the next day is p = 0.95 inde-
pendent of the other machines’ operating status. If less than 9 machines
are working the next morning the factory can’t meet customer demand
Figure 10.1: Binomial p.m.fs (left

panel) and c.d.fs (right panel) for
n = 20 and p = 0.5 (black) and p = 0.2
0.25
(clear).
1.0
● ● ● ● ● ● ●● ●● ● ● ● ● ●
●
●
●
●
0.20
0.8
●
●
0.15
0.6
●
p
y
0.10
0.4
● ●
●
0.05
0.2
●
●
0.00
●
0.0
●
● ● ● ● ●
0 5 10 15 20 0 5 10 15 20
x x
and incur severe financial losses. How many machines do we expect to be

working the next day? What is the probability that at least 9 out of 10 ma-
chines will work? Provide a measure of variability for how many machines
are still operating the next day.
Define X to be the number of machines working the next day

out of a total of 10 machines. Then X ∼ Binom(10, 0.95). We would
therefore expect E[ X ] = 10 × 0.95 = 9.5 machines to be operating
the next day. The probability that at least 9 machines out of 10 are
working the next day is

10 9 1 10
P( X ≥ 9) = p X (9) + p X (10) = 0.95 0.05 + 0.9510 0.050
9 10
= 0.3151 + 0.5987 = 0.9138.
A measure of the variability associated with working machines is

√
σX = 10 × 0.95 × 0.05 = 0.6892
Example 10.2. In the country Bigtrouble, the probability that a nu-

clear power plant experiences a serious violation every 10 years is 0.06.
Bigtrouble has 15 nuclear power plants that operate independently of each
other. How many nuclear power plants do you expect to experience a seri-
ous violation over the next decade? Even one serious violation may result
in untold catastrophe. What is the probability of experiencing more than
one serious violation?
Define X to be the number of nuclear power plants that experi-

ence a serious violation in a decade in Bigtrouble. X ∼ Binom(15, 0.06)
and we would expect E[ X ] = 15 × 0.06 = 0.9 power plants to ex-
perience a serious violation. The probability of at least one power
plant suffering a serious violation is

15
P ( X ≥ 1) = 1 − p X (0) = 1 − 0.060 0.9415 = 0.6047.
0
Residents should be nervous.
Example 10.3. As a final example consider the reliability of systems built

from several manufactured components of similar structure. System 1
has three components linked in such a way that the system is operating
if at least two of the three components are operating. In other words,
System 1 has one redundant component. Suppose there is another system,
System 2, that has two components of similar type to System 1. System
2 is operating only if both components are operating – System 2 has no
redundant components. What is the effect of redundancy on the reliability
of the system?
Suppose all the components in System 1 and System 2 operate

independently of one another and have the same probability of op-
erating, p. Denote X to be the number of operating components in
System 1, so X ∼ Bin(3, p). Then the probability System 1 operates
is
P( X ≥ 2) = p X (2) + p X (3) = p3 + 3p2 (1 − p)
Now consider System 2. Denote W to be the number operating
components in System 2 then W ∼ Bin(2, p) and the probability the
system is working is pW (2) = p2 . If we compare System 1 (which
has a redundant component) with System 2 we can assess the value
of redundancy in the reliability of System 1. This can be based on
the ratio of the above two probabilities , i.e.
( p X (2) + p X (3))/pW (2) = p2 ( p + 3q)/p2 = 3 − 2p.
Observe that the ratio is close to 3 for small p and close to 1 for
large p illustrating the general principal of reliability: that redun-
dancy improves system reliability when components are ‘unreli-
able’ but there is little advantage in having redundancy when the
components are highly reliable.
10.3.2 Continuous random variables

In the previous section we considered discrete random variables
for which P( X = x ) > 0. However, for some random variables
we also have P( X = x ) = 0 for all x ∈ R. Indeed, this is the
definition of a continuous random variable as stated in Definition
10.13. The immediate consequence is that we must focus on events
like X ∈ [ a, b] where [ a, b] is an interval of length b − a > 0. Note
first, if X is a continuous random variable then
P ( a ≤ X ≤ b ) = P ( a < X ≤ b ) = P ( a ≤ X < b ) = P ( a < X < b ).
Instead of considering the probability mass function p X ( x ) as for

discrete random variables we define the probability density function
(p.d.f) as
Definition 10.21. (Probability Density Function)

A probability density function is a nonnegative function f such that its
integral from −∞ to ∞ is equal to 1. That is,
Z ∞
f ( x ) ≥ 0, for all x, and f ( x )dx = 1.
−∞
If X is a continuous random variable such that P( a ≤ X ≤ b) =

Rb
a f ( x ) dx, for the p.d.f. f , then we say X has probability density function
f and we denote f by f X .
In particular, if b = a + δ with δ a small positive number, and if f

is continuous at a, then we see that
Z a+δ
P( a ≤ X ≤ a + δ) = f X ( x )dx ≈ δ f ( a).
a
Definition 10.22. (Cumulative Distribution Function)

The cumulative distribution function for a continuous random variable
X with p.d.f. f X is the function FX : R → [0, 1] of a real variable t defined
by
Z t
FX (t) = P( X ≤ t) = f X ( x )dx.
−∞
And note that the properties of c.d.fs for continuous random

variables are the same as those listed for discrete random variables
in Theorem 10.6. Also recall the definitions for the mean and vari-
ance of discrete random variables. We get similar definitions for
continuous random variables.
Definition 10.23. (Expected value, Variance)

If a continuous random variable X has p.d.f f X , then its expected value
(written E[ X ]) is defined by
Z ∞
E[ X ] = µ X = x f X ( x )dx,
−∞
its variance is defined by

Z ∞
σX2 = Var ( X ) = E[( X − µ X )2 ] = ( x − µ X )2 f X ( x )dx,
−∞
p
and its standard deviation is defined by σX = Var ( X ).
Definition 10.24. (Independent Variables)

Let X and Y be two random variables. We say X and Y are independent

if
P( X ≤ x, Y ≤ y) = P( X ≤ x ) P(Y ≤ y) ∀ x, y.
Note that Theorem 10.7 and Theorem 10.8 hold also for continu-
ous random variables.
Exponential random variables

An important continuous random variable in reliability is one
with the exponential distribution. Let T be a random variable that
measures the length of time until a certain event occurs. Since time
is measured continuously T is a continuous random variable with
its range the positive numbers. The “rate” of occurrence of the
event is governed by a parameter λ > 0. Specifically,
Definition 10.25. A random variable with the exponential distribution
with rate parameter λ > 0 (written T ∼ Exp(λ)) has a probability
density function (
λe− xλ x > 0
f T (x) = (10.2)
0 x≤0
Clearly, f T ( x ) ≥ 0 for all x and
Z ∞ Z ∞
f T ( x )dx = λe− xλ dx = [−e− xλ ]0∞ = 1
−∞ 0
so f T ( x ) is a p.d.f. The c.d.f is given by

Z t Z t
FT (t) = f T ( x )dx = λe− xλ dx = [e− xλ ]0t = 1 − e−tλ for t > 0,
−∞ 0
and Ft (t) = 0 for t ≤ 0.

Inspection of Figure 10.2 shows that as λ gets larger the proba-
bility of lasting to a time t before the occurrence of the event tends
toward zero. Another way to see this is through the expected value
of T. Integration by parts yields the mean of an exponential random
variable as
1 −λt ∞
Z ∞ Z ∞
− xλ 1
µT = E[ T ] = x f T ( x )dx = xλe dx = − e = .
−∞ 0 λ 0 λ
So as λ increases our expected time to the event shrinks.
We now compute Var ( T ) = E[ T 2 ] − E[ T ]2 . Using integration by
parts yields
1 −λt ∞
Z ∞ Z ∞
2 2 2 − xλ 2 2
E[ T ] = x f T ( x )dx = x λe dx = − e = 2
−∞ 0 λ λ 0 λ
so
2 1 1
Var ( T ) = E[ T 2 ] − E[ T ]2 = 2
− 2 = 2
λ λ λ
and
1
q
σT = Var ( T ) = .
λ
To summarise:
Figure 10.2: Exponential p.d.fs (left

panel) and c.d.fs (right panel) for
λ = 0.5 (red), λ = 1 (blue) and λ = 1.5
(black)
1.0
1.5
0.8
0.6
1.0
F
f
0.4
0.5
0.2
0.0
0.0
0 1 2 3 4 5 0 1 2 3 4 5
t t
(
λe− xλ x>0
f T (x) =
0 x≤0
(
1 − e−tλ t>0
FT (t) =
0 t≤0
1 1
E[ X ] = and Var ( X ) = 2 .
λ λ
Example 10.4. An exponential distribution can be used to model life

lengths. For example, say past experience tells us the life time in hours of
a certain type of lightbulb produced by a manufacturer follows an Exp(λ)
distribution, where λ = .001. How many hours do we expect a light
bulb to last? What is the standard deviation of the life time of a lightbulb?
What is the probability that a light bulb lasts more than 900 hours.
Let T denote the life time in hours of a light bulb. We know

T ∼ Exp(λ) and λ = 0.001. Therefore, the expected lifetime is
1/λ = 1/0.001 = 1000 hours. Also, the standard deviation is
1/λ = 1000 hours.
Finally, because T ∼ Exp(0.001)
P( T ≥ t) = 1 − FT (t) = e−tλ .
Therefore the probability a randomly selected light bulb lasts

longer than 900 hours is
P( T > 900) = 1 − FT (900) = e−900×0.001 = e−0.9 = 0.4067

Normal random variables

The normal p.d.f is the most commonly used of all p.d.fs. This
is not only because it addresses a lot of practical problems but also
because of the Central Limit Theorem and its approximation to a
large number of other probability models. We will return later to
this in Section 10.4.
Definition 10.26. (Normal distribution)

A random variable X is normally distributed with parameters µ and σ2 if
and only if its probability density function is
1 2 /2σ2
f X (x) = √ e−( x−µ)
σ2 2π
for all real x. We write X ∼ N (µ, σ2 ). It can be computed that E[ X ] = µ

Var ( X ) = σ2 .
Inspection of the p.d.f and Figure 10.3 shows a normal random

variable has a p.d.f that is bell shaped centred at µ and the spread
of the distribution is governed by σ.
Figure 10.3: Normal p.d.fs (left panel)

and c.d.fs (right panel) for µ = 0 and
σ = 0.5, (black) σ = 1 (blue) and σ = 2
(red).
0.8
1.0
0.8
0.6
0.6
0.4
f
0.4
0.2
0.2
0.0
0.0
-4 -2 0 2 4 -4 -2 0 2 4
x x
The normal c.d.f

Z t
1 2 /2σ2
FX (t) = √ e−( x−µ) dx (10.3)
−∞ σ2 2π
has no closed form (approximations of examples are shown in

Figure 10.3). However, most text books and computer packages
provide approximations to (10.3) when µ = 0 and σ = 1. This is
because of the following.
Theorem 10.9. Let X ∼ N (µ, σ2 ) then the standardised random variable

Z is
X−µ
Z= ∼ N (0, 1).
σ
Corollary 10.1. If X ∼ N (µ, σ2 ) then
P( a ≤ X ≤ b) = P ( Z ≤ (b − µ)/σ ) − P ( Z ≤ ( a − µ)/σ )
b−µ a−µ

= FZ − FZ .
σ σ
Theorem 10.9 and Corollary 10.1 say that we can write probabil-
ity statements about any normally distributed random variable in
terms of the standard normal c.d.f. Section 10.4.3 at the end of this
chapter contains approximations of the P( Z < z) = FZ (z) for z > 0
for a standard normal random variable. Some equalities will be
useful to remember: P( Z > z) = 1 − P( Z < z) and if −z < 0,
P ( Z < − z ) = P ( Z > z ) = 1 − P ( Z < z ).
In several problems in statistical inference the probability to a
standard normal is given, e.g. P( Z ≤ z) = 0.95, and it is asked what
is the corresponding value of z, e.g. z ≈ 1.64 in this case.
Definition 10.27. For 0 ≤ α ≤ 1, let zα be the real number, called

(1 − α)-quantile, such that
α = P ( Z > z α ),
where Z has the standard normal distribution.
Since you are already familiar with manipulating standard nor-

mal tables the following two examples are left for your own revi-
sion.
Example 10.5. X ∼ N (µ, σ2 ). What is the probability that (i) X ∈

(µ − σ, µ + σ), (ii) X ≤ µ + 2σ, (iii) X ≥ µ + 3σ
Example 10.6. X ∼ N (0, 1). Calculate the (1 − p)-quantiles for p = 0.9,

p = 0.95, p = 0.975 and p = 0.9995.
We finish with an example where normal random variables are

used in reliability engineering.
Example 10.7. The length of screws produced by a machine are not

all the same but rather follow a normal distribution with mean 75 mm
and standard deviation 0.1 mm. If a screw is too long it is automatically
rejected. If 1% of screws are rejected, what is the length of the smallest
screw to be rejected?
Let X denote the length of a screw produced by the machine.

X ∼ N (µ, σ2 ) where µ = 75 and σ2 = 0.12 . The task is to find
a−µ
P( X > a) = 0.01. We know P( X > a) = P( Z > σ ). We also know
from Section 10.4.3 that P( Z > 2.33) ≈ 0.01 so
a − 75
= 2.33
σ
and a = 75.233. Thus if 1% of screws get rejected then the smallest
to be rejected would be approximately 75.233 mm.
10.4 Statistical Inference
In Section 10.2 and Section 10.3 we discussed probability theory.

The various calculations associated with the application of prob-
ability theory presented above rely on our knowledge of the true
model parameters. For instance, in Example 10.2 we are informed
the probability of failure for a nuclear power plant is 0.06 and in
Example 10.4 we know the rate parameter for the life time distri-
bution of light bulbs is 0.001. Knowledge of the parameter values
allows us to make probabilistic statements about the uncertainty of
the random variable.
We are now going to discuss statistical inference in which we are
faced with another type of uncertainty – the uncertainty about the
true parameter value in a probability model. In statistics we assume
a certain type of probability model generates observed data and we
use a collection of observations to estimate which particular model
is most plausible, to estimate intervals for plausible values of the
parameters and to test whether certain claims about parameters are
likely. The first two concepts will be examined in Section 10.4.1 and
testing is examined in Section 10.4.2.
10.4.1 Estimation
A very basic concept to statistical inference is that of a random sam-
ple. By sample we mean only a fraction or portion of the whole –
crash testing every car to inspect the effectiveness of airbags is not
a good idea. By random we mean essentially that the portion taken
is determined non-systematically. Heuristically, it is expected that a
haphazardly selected random sample will be representative of the
whole population because there is no selection bias. For the pur-
poses of MATH1002 (and often in practice), we will assume every
observation in a random sample is generated by the same probabil-
ity distribution and each observation is made independently of the
others. This motivates the following definition
Definition 10.28. A collection of random variables X1 , X2 , . . . , Xn is

independent and identically distributed (or i.i.d.) if the collection
is independent and if each of the n variables has the same distribution.
The i.i.d. collection X1 , X2 , . . . , Xn is called a random sample from the
common distribution.
Also,
Definition 10.29. Any function of the elements of a random sample,

which does not depend on unknown model parameters, is called a statistic.
Therefore, we can think of the numbers in our random sample

as being observed values of i.i.d. random variables. Denote X1 to
represent the first measurement we are to make, X2 to represent the
second measurement we are to make etc up to Xn to represent the
nth measurement we are to make. Thus, if X1 , X2 , . . . , Xn constitute
a random sample, the quantities
n
X1 + X2 , ∑ Xi , X n − X1
i =1
are statistics but

Xn − X1 + 2µ X
X1 + µ X , X2 /σX ,
σX
are not.
An estimator is a statistic that is used to provide an estimate to
a parameter in a probability model. We are now going to make a
very important distinction between an estimate and an estimator.
An estimator is a rule or method of obtaining our best guess of the
model parameters and is based on a random sample. An estimate
is the result of applying the estimator to a particular observed
random sample. For instance, you already know that for a random
sample X1 , X2 , . . . , Xn from normal random variables, one estimator
of µ is
X̄n = ( X1 + X2 + . . . + Xn )/n.
So X̄n is our estimator (rule for estimating) for µ from any normal
random sample of size n. Once we observe the random sample we
know that X1 = x1 , X2 = x2 , . . . , Xn = xn and then
x̄n = ( x1 + x2 + . . . + xn )/n
is our observed estimate for that particular data set. Of course, if

we take another random sample, X̄n will realise another value, x̄n∗
say, different to x̄n . The estimator X̄n is itself a random variable.
It turns out that X̄n is a very important estimator (for reasons we
shall examine shortly) so to develop the ideas of estimation and
hypothesis testing in this section and the next we shall focus only
on X̄n .
Example 10.8. The manager of an automobile factory wants to know on

average how much labour time is required to produce an order of automi-
bile mufflers using a heavy stamping machine. The manager decides to
collect a random sample of size n = 20 orders to estimate the true mean
time in hours taken to produce the mufflers. The data are
2.58 2.58 1.75 0.53 3.29 2.04 3.46 2.92 3.10 2.41
3.89 1.99 0.74 1.59 0.35 0.03 0.52 1.42 0.04 4.02
Assume that each Xi ∼ N (µ, σ2 ) i.i.d for i = 1, 2, . . . , n represents

hours worked for each order of muffler. Then x1 = 2.58, x2 =
2.58, . . . , x20 = 4.02 is the observed random sample and we estimate

the true mean µ as x̄n = 1.9625.
Example 10.8 provides a point estimate of the true mean, but if
we collected another set of muffler orders the new estimate would
be different. Obviously, we frequently do not collect multiple ran-
dom samples. With this in mind it is a good idea to provide an
interval of plausible values for the true mean, based on the one
sample observed. In statistics these are called confidence intervals.
We now discuss the rationale behind confidence intervals.
Since X̄n is a random variable we should ask
1. What is the expected value of X̄n ?
2. What is the variance or standard deviation of X̄n ?
3. What is the probability density of X̄n ?
If the answer to these questions is satisfactory then we can accept

X̄n is a satisfactory estimator and we can use the results to attach a
confidence interval to the true parameter value. We address these
questions first in the case of a normal random sample.
Suppose Xi ∼ N (µ, σ2 ) i.i.d for i = 1, 2, . . . , n and our estimator
for µ is X̄n . Using the properties of expectations in Theorem 10.7
" #
n
1
E[ X̄n ] = E ∑ Xi
n i =1
1
= ( E[ X1 ] + E[ X2 ] + . . . + E[ Xn ])
n
= µ.
So the expected value of X̄n tells us if we took random samples re-

peatedly, then the average of the observed estimates would be cen-
tred around µ, the true value of the parameter we are estimating.
This is a nice result. Using the properties of variances in Theorem
10.8 and the fact that each Xi is independent we have that
" #
n
1
Var [ X̄n ] = Var ∑ Xi
n2 i =1
1
= (Var [ X1 ] + Var [ X2 ] + . . . + Var [ Xn ])
n2
= σ2 /n.
So the variance of X̄n tells us that as our information in the sam-

ple increases (that is as the number of observations increase) the
variance of the estimator X̄n shrinks. This property is also nice, par-
2.0
ticularly since we know the variance will shrink around the true
mean because E[ X̄n ] = µ. Finally in the case of a normal random
sample we also have
1.5
Theorem 10.10. Suppose X1 , X2 , . . . , Xn are i.i.d normal random vari-

ables, each with parameters µ and σ2 . Then Y = ∑in=1 Xi is a normal
f(x)
1.0
random variable with parameters nµ and nσ2 .

0.5
0.0
−3 −2 −1 0 1 2 3
In other words Theorem 10.10 says for a normal random sample

X̄n ∼ N (µ, σ2 /n). The figure on the right shows the effect on the
p.d.f of X̄n as n increases.
The above outlines some nice properties of the estimator of µ,
X̄n , if the random sample is generated from a normal distribution.
What if we wanted to construct an estimator for the mean from
a non-normal random sample? Now the Central Limit Theorem
comes to the rescue. Omitting the proof, the Central Limit Theorem
(CLT) states
Theorem 10.11. (Central Limit Theorem) X1 , X2 , X3 . . . is a se-

quence of i.i.d random variables, each with finite mean µ X and finite
variance σX2 . Define the sequence of random variables Z1 , Z2 , . . . by
X̄n − µ X
Zn = σX , n = 1, 2, 3, ...
√
n
where
n
1
X̄n =
n ∑ Xi .
i =1
Then for all real t

Z t
1 1 2
lim FZn (t) = √ e− 2 z dz.
n→∞ −∞ 2π
Loosely, the CLT states that for any random sample as n gets
larger the distribution of the sample average X̄n , properly nor-
malised, approaches a standard normal distribution. Even more
˙ N (µ X , σX2 /n) where ∼
loosely, the CLT states, for large n, X̄n ∼ ˙ de-
notes approximately distributed. The CLT is one of the most power-
ful results presented in probability theory. Obviously, knowing the
distribution of the sample average is important because it provides
a measure of how precise we believe our estimate of the model pa-
rameter to be. However, more powerfully, the CLT means that for
any random variable we can always perform statistical inference on
the mean or expected value. Figure 10.4 shows a approximate p.d.f
of an exponential random variable and the standardised estimator
X̄n based on random samples of increasing size from this random
variable. The figure shows that although the original density of
each Xi is exponential, the p.d.f of the standardised estimator ap-
proaches the standard normal distribution as n increases.
By providing the p.d.f of the estimator X̄n Theorem 10.10 and
Theorem 10.11 (in the case of large n) can be used to provide ranges
of plausible values of the mean parameter – a confidence interval.
Definition 10.30. Suppose we have a random sample whose probability
model depends on a parameter θ. The two statistics, L1 and L2 , form a
100 × (1 − α)% confidence interval if
P ( L1 < θ < L2 ) = 1 − α
Figure 10.4: The upper left hand panel

contains the approximate p.d.f of an
exponential random variable. The
0.5
approximate p.d.f of Zn when n = 2 is
shown in the upper right panel, when
n = 20 in the bottom left panel and
1.5
0.4
when n = 2000 in the bottom right
panel. The thick black line shows the
p.d.f of the standard normal random
0.3
1.0
variable. As n increases on the p.d.f
fZn
fX
of Zn converge to a standard normal
0.2
p.d.f.
0.5
0.1
0.0
0.0
0 1 2 3 4 5 6 −2 0 2 4 6 8 10 12
x zn
0.4
0.4
0.3
0.3
0.2
0.2
fZn
fZn
0.1
0.1
0.0
0.0
−2 0 2 4 6 −4 −2 0 2 4
zn zn
no matter what the unknown value of θ.
Therefore, ( L1 , L2 ) form a random interval which covers the true

parameter θ with probability 1 − α. Consider a 95% confidence
interval for µ X . In this case 1 − α = 0.95 so α = 0.05. From standard
normal tables we know P(−1.96 < Z < 1.96) ≈ 0.95 (i.e. zα/2 =
1.96) and we can write
√ √
P( X̄n − 1.96σX / n ≤ µ X ≤ X̄n + 1.96σX / n) ≈ 0.95.
For an observed random sample, we substitute the value x̄n for X̄n
to calculate the confidence interval. It is important to note that for
this observed data set, we cannot say the probability that µ X lies in
this interval is 0.95. All we can say is that after repeated samples we
expect 95% of the confidence intervals constructed to contain the
true value of µ X . We will now do an example using the binomial
random variable.
Example 10.9. A supplier of ammunition claims that of a certain type of

bullet they produce, less than 5% misfire. You decide to test this claim and
observe 1000 bullets fired. Of these 70 misfired. Provide a point estimate

and an interval estimate for the true probability of a given bullet misfiring
and assess whether you believe the suppliers claim.
Assuming each bullet is fired independently, we can treat each

observation as coming from a Bernoulli trial, with success to
represent a misfiring and failure to represent not misfire. Then
Xi ∼ Bern( p) for i = 1, 2, . . . , 1000 are random variables represent-
ing whether individual bullets misfire. Write
(
1 the bullet misfires
Xi =
0 the bullet does not misfire
We can estimate p by X̄n which in this case would provide the

point estimate x̄n = 70/1000 = 0.07. We can construct a 95%
confidence interval as
√ √
( x̄n − 1.96σX / n, x̄n + 1.96σX / n).
Naturally, we do not know σX but we do know for a random vari-

p
able from a Bernoulli trial that σX = p(1 − p) so we shall use our
p
estimate of p to calculate σX = 0.07 ∗ (1 − 0.07) = 0.2551. So our
95% confidence interval is
(0.07 − 1.96 × 0.0081, 0.07 + 1.96 × 0.0081) = (0.0541, 0.0859).
Notice the confidence interval does not extend below 0.05 and we
should probably doubt the producer’s claim that no more than 5%
of their bullets misfire.
10.4.2 Hypothesis testing

We commence this section with a motivating example.
Example 10.10. Suppose that a processor of dairy products is packaging

cheese into blocks of 1kg. Variability exists of course in this process and
historical information suggests that the standard deviation of the weights
is σ = 10g. The aim is to assess whether the process is continuing to
operate satisfactorily. We define an unsatisfactory process as a shift in the
mean, with no effect on the standard deviation.
The statistical procedure to assess whether the process is oper-

ating satisfactorily would be to take a random sample of blocks of
cheese. Suppose we get the weights x1 , x2 , . . . , xn from the random
variables X1 , X2 , . . . , Xn that describe this process. The question
then is: “Is the observed sample average sufficiently close to the
desired mean of 1000g that we can regard the operation as satisfac-
tory, or is there some doubt?”
Consider x̄n = (∑in=1 xi ) /n as a realised value of X̄n and we
know by the CLT that X̄n ∼ ˙ N (µ X , σX2 /n). When the process is oper-
ating to satisfaction X̄n ∼ ˙ N (1000, 102 /n). Suppose now that n = 25
and x̄n = 1006. Is the process operating to satisfaction? How do we
assess this? We do so by assessing how unlikely or surprising is the

observed sample if the true mean were really 1000g. The distribu-
tion of X̄n implies regions of low probability occur in the tails of the
p.d.f of X̄n and we begin by examining how far out in the tail of the
p.d.f lies our observation x̄n . Figure 10.5 shows the p.d.f of X̄n if the
process is operating satisfactorily and shows the observed average,
x̄n = 1006, lies far out in the upper tail. This suggests the observed
value is very unlikely if the process were operating satisfactorily.
Figure 10.5: The p.d.f of X̄n if the

process is working satisfactorily. The
observed estimate of x̄n = 1006 is
0.20
indicated with a black dot and lies far

out in the tail of the p.d.f.
0.15
0.10
fXn
0.05
0.00
990 995 1000 1005 1010
xn
As a first indication that the process is not operating satisfacto-

rily the above is fine but to make a hard decision we must define
more precisely what is “unlikely” or “surprising”. We measure this
by the probability of obtaining a value more extreme than what we
observed. That is

X̄n − 1000
> 1006 −

1000
P( X̄n is more extreme than 1006|µ X = 1000) = P √ √ |µ X = 1000
10/ 25 10/ 25
= P(| Z | > 3)
≈ 0.0026 (from z tables).
Note that the left-hand side probability is read as the “probabil-

ity of obtaining a value of X̄n more extreme than 1006 (in either di-
rection) given that µ X = 1000.” The smallness of this value reflects
the extremeness of the observed data. On the basis of this probability
we could conclude either
1. the process operation is satisfactory and a rather surprising event

has occurred; or
2. the process operation is not satisfactory because if µ X 6= 1000

then the probability could be larger and the event more likely to
occur.
If we ignore the possibility that a rather unlikely event has oc-

curred, i.e. getting an observed value x̄n at least as extreme as 1006,
then we would conclude that the process is unsatisfactory. Now
consider two other possible scenarios to assist us in understanding

the above conclusion.
• For x̄n = 1012
P( X̄n more extreme than 1012|µ X = 1000) = P(| Z | > 6)
which is much smaller than P(| Z | > 4) so we would (perhaps

more confidently) conclude the process is unsatisfactory.
• For x̄n = 1002
P( X̄n more extreme than 1002|µ X = 1000) = P(| Z | > 1) = 0.3174
Since the latter probabilty is not very small there is no cause to

doubt the process operation is satisfactory. That is, if µ X = 1000
we are not surprised by the data we have observed. The shaded
area in Figure 10.6 represents the probability that we observe a
value of X̄n that more extreme than 1002 in either direction if the
process is working satisfactorily.
Figure 10.6: The p.d.f of X̄n if the

process is working satisfactorily and
x̄n = 1002. The observed estimate is
0.20
indicated with a black dot and does

not lie far out in the tail of the p.d.f.
The probability of being more extreme
0.15
than 1002 in either direction is the

shaded area.
0.10
fXn
0.05
0.00
990 995 1000 1005 1010
xn
The discussion to this point in the present section has been our
motivation for testing hypotheses; now we formalise some ideas.
By a statistical hypothesis we mean a statement about the value of a
specified parameter in a probability model. For example, based on
the probability model described by
X̄n ∼ N (µ X , σX2 /n)
the hypothesis we considered earlier was H0 : µ = 1000. This

is called the null hypothesis. It represents a state of affairs that we
will believe in unless the sample data provides strong evidence
to the contrary. Often the null hypothesis represents the negation
of a claim for which we are seeking evidence from data. In such a
situation the claim itself is the alternative hypothesis, H1 .
The previously considered alternative hypothesis was H1 : µ 6=
1000. This is a two-sided hypothesis since no information was
given, apart from that in the sample data, regarding the direction of
the departure from the null hypothesis. In different circumstances
other possible alternative hypotheses are the one-side hypotheses
H1 : µ > 1000 or H1 : µ < 1000. Assessment of the null hypothesis
is then made using the observed value of a suitable test statistic con-
structed from the random sample. For our purposes this is just a
standardised version of an estimator of the relevant parameter – in
√
our case Z = n( X̄n − µ X )/σX . Based on the observed test statistic,
z, we determine the P-value, this being the probability of obtain-
ing a value of the test statistic at least as extreme as that observed. In
determining the P-value we use one or both tails of the distribu-
tion, depending on whether the alternative hypothesis is one-sided
(P( Z > z) or P( Z < z)) or two-sided (P(| Z | > z)), respectively.
In summary, hypothesis testing requires
1. Identifying an appropriate probabilty model – e.g. normal,

exponential, Bernoulli, binomial.
2. Formulating an appropriate null hypothesis H0 and alterna-

tive hypothesis H1 taking into careful consideration whether
H1 should be one-sided (which way?) or two-sided.
3. Constructing the distribution of a suitable test statistic under

H0 .
4. Calculating the P-value based on the data.
5. Stating your conclusion. Reject H0 if the P-value is very small

as this suggests the data are highly unlikely if H0 is true. Do
not reject H0 if the P-value is not small because the observed
data are not surprising if H0 were true.
6. You should always interpret your conclusion in the relevant

context.
Often in a testing situation there is a specified value α satisfying

0 ≤ α ≤ 1 called the level of significance. Typically, α is small, con-
vential values being α = 0.05 or α = 0.01. Its role is as a yardstick
against which to judge the size of the P-value:
1. a P-value of less than or equal to α we interpret to mean that the

data cast doubt on H0 or there is sufficient evidence to suggest
H1 might hold;
2. a P-value of greater than or α we interpret to mean that the data

is consistent with H0 or provide not enough evidence to support
the claim expressed by H1 .
If the P-value is less than or equal to α the data is said to be statisti-

cally significant at the level α. Say α = 0.05 then in the cheese example
above if x̄n = 1006 the P-value is 0.001 and we would reject the null
hypothesis but if x̄n = 1002 then the P-value is 0.3174 we would

not. We will now examine two examples following the steps above
for hypothesis testing.
Example 10.11. The production manager of bullets claims less than 5%

misfire. You have strong doubts about this claim. You decide to collect
1000 bullets and test how many misfire. You observe 70 out of 1000 bul-
lets misfired. Test whether the manager is correct at the α = 0.05% level of
significance. Note, this example uses the same data and model as Example
10.9 for confidence intervals but we now carry out a hypothesis test.
1. Identify an appropriate probabilty model

We can model these data as X ∼ Binom(n, p) being the sum of
Bernoulli trials for each bullet. A sensible estimator of p is then
just X/n.
2. Formulate an appropriate null hypothesis H0 and alternative hypothesis

H1 .
We can test the claim by
H0 : p = 0.05
H1 : p > 0.05
The test is one sided because we would primarily be interested

in whether too many bullets misfired. We are not concerned if
the probability of a misfire is less than 0.05, only if it were more
unreliable than claimed and the probability is greater than 0.05.
3. Constructing the distribution of a suitable test statistic under H0 .

The relevant test statistic is then
X
−p
qn .
p (1− p )
n
Assuming H0 is true p = 0.05 and we have

X
− 0.05
60
qn ∼ N (0, 1)
0.05(1−0.05)
1000
50
because the sample size is large enough for the CLT to apply.
40
4. Calculate the P-value based on the data

The observed value of the estimator is x/n = 70/1000 = 0.07 and
30
n
fX
the observed test statistic is

20
0.07 − 0.05
p = 2.9019.
0.05(1 − 0.05)/1000
10
Hence from tables at the end of this chapter the P-value is

●
0
P( X/n > 0.07| p = 0.05) = P( Z > 2.4788) = 1 − P( Z <

0.02 0.03 0.04 0.05 0.06 0.07 0.08
2.4788) ≈ 0.0019, where Z ∼ N (0, 1). See the figure on the right
x/n
for a graphical representation of this result.
The p.d.f of X/n if the misfire rate of
5. State your conclusion the bullets were 0.05. The observed
The P − value ≈ 0.0066 is less than α = 0.05 and we reject H0 estimate x/n = 0.07 is indicated with
a black dot and lies far out in the
in favour of H1 . tail of the p.d.f. The probability of
lying further out in the upper tail, the
P-value, is the shaded area.
6. Interpret your conclusion in the relevant context.

We reject the manager’s claim that less than 5% of the bullets
misfire.
Example 10.12. Aircrew escape systems are powered by a solid propel-

lant. The burning rate of this propellant is an important product char-
acteristic. Specifications require that the mean burning rate must be 50
centimeters per second. We know that the standard deviation of burning
rate is 4 centimeters per second. The experiementer decides to specify a
level of significance of α = 0.05 and selects a random sample of n = 25
and obtains an observed average burning rate of 51.3 centimeters per sec-
ond. Test whether the burning rate meets specifications versus alternative
that the burning rate does not meet specifications.
1. Identify an appropriate probabilty model

Assume the burning rate of each propellant constitutes a normal
random sample, Xi for i = 1, 2, . . . , n with unknown mean µ X
and known standard deviation σX = 4. A sensible estimator of
µ X is then X̄n .
2. Formulate an appropriate null hypothesis H0 and alternative hypothesis

H1 .
We can test the claim by
H0 : µ X = 50
H1 : µ X 6= 50
The test is two sided because we do not have any extra informa-
tion to specify which way we should test.
3. Constructing the distribution of a suitable test statistic under H0 .

The relevant test statistic is then
X̄n − µ X
q
2
.
σX
n
Assuming H0 is true µ X = 50, we are told σX = 4 and we have
X̄n − 50
q ∼ N (0, 1).
0.5
16
25
0.4
4. Calculate the P-value based on the data

The observed estimate is x̄n = 51.3 and the observed test statistic
0.3
is
51.3 − 50
fXn
= 1.625.
4/5
0.2
Hence from tables at the end of this chapter the P-value is

0.1
P( X̄n > 51.3|µ X = 50) + P( X̄n < 48.7|µ X = 50)
= P(| Z | > 1.625) ≈ 0.1042,

0.0
47 48 49 50 51 52 53
where Z ∼ N (0, 1). See the figure on the right for an illustration xn
of this P-value.
The p.d.f of X̄n if the burning of the
propellant is burning according to
specification. The observed estimate is
indicated with a black dot and does
not lie far out in the tail of the p.d.f.
The probability of lying further out in
the tail in either direction is the shaded
area.
5. State your conclusion

This P − value ≈ 0.1042 is greater than α = 0.05 and we do not
reject H0 in favour of H1
6. Interpret your conclusion in the relevant context.
We do not reject the claim that less burning propellant oper-
ates to specification.
Final remarks on hypothesis testing
1. To reject H0 does not mean that the null hypothesis is false. Only
that the data shows sufficient evidence to cast doubt on H0 . To
not reject H0 does not mean it is true, only that the data shows
insufficient evidence against H0 .
2. As a consequence in any testing situation there are two possible

types of error
• Type I error which occurs when H0 is rejected even though it is

true or
• Type II error which occurs when H0 is not rejected even though
it is false.
3. It is easy to see that
P(Type I error) = P(Reject H0 | H0 is true) = α
However to give a specific value for the probability of a Type II

error requires consideration of a specified alternative value fo the
parameter of interest.
4. The P-value is NOT the probability that H0 is true. It is a proba-

bility calculated assuming the null hypothesis is true. Regard it as
a way of summarizing the extent of agreeement between the data
and the model when H0 is true.
10.4.3 Statistical Tables
Cumulative Standard Normal Probabilities (for

z > 0). The table below gives FZ (z) where Z ∼ N (0, 1)
(see shaded area in the Figure). z
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998

Chapter 10, Probability and Stats

Uploaded by

Copyright:

Available Formats

Chapter 10, Probability and Stats

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 10, Probability and Stats

Uploaded by

Copyright:

Available Formats

10

Probability and Statistics

This chapter describes probability theory and statistical inference

most plausible of these probability models. Section 10.4.1 describes

10.2 A recap on probability models

A formal definition of a probability model begins with an experi-

In the die rolling experiment A ∩ B = {4, 6} and A ∩ B occurs if

Axiom 2: 0 ≤ P( A) ≤ 1, for each subset A of S.

Axiom 3: If A1 , A2 , . . . , An is a collection of disjoint sets then

Axioms 1-3 imply other basic properties or theorems that are

Theorem 10.2. P( Ā) = 1 − P( A) where Ā is the complement of A with

Theorem 10.3. P( Ā ∩ B) = P( B) − P( A ∩ B).

Proof. B = ( Ā ∩ B) ∪ ( A ∩ B) so P( B) = P(( Ā ∩ B) ∪ ( A ∩ B)). Since

Theorem 10.4. P( A ∪ B) = P( A) + P( B) − P( A ∩ B).

Proof. A ∪ B = A ∪ ( Ā ∩ B) so P( A ∪ B) = P( A ∪ ( Ā ∩ B)). Since

by Theorem 10.3. Thus

There are many applications where we know that an event B

Definition 10.8. The conditional probability of A occurring, given

Thinking of probability as an area on a Venn diagram, Definition

Definition 10.9. If A and B are any two events in S we say A is inde-

This is equivalent to saying the proportion of A that inhabits S

Theorem 10.5 (Bayes’ Theorem). Suppose we are given k events

(that is the events form a partition of S) then for any event B

To clarify imagine we are dealing with events A, Ā and B. From

Note, that A and Ā form a partition of S and their union equals

P( B) = P( B ∩ A) + P( B ∩ Ā) = P( B| A) P( A) + P( B| Ā) P( Ā)

and in this case Bayes’ Theorem says

We end this revision with a reminder on the different types of sam-

Definition 10.10. A discrete sample space is one which has a finite or

Definition 10.11. A continuous sample space is one which has as

Although subsets of the continuous interval are events, the out-

10.3 Random variables

The content of the previous section was a reminder of what you

Definition 10.12. (Random Variable)

Since X (s) is a real valued function the range of X (s) is always

Definition 10.14. (Probability Mass Function)

p X ( x ) = P( X = x ) = P({s ∈ S | X (s) = x }) for all real x.

If we have an experiment and sample space S defined for that

Then p X ( x ) = P( A( x )) for all real x. Note that A( x ) may equal

From now on we shall write P( X = x ) for P( X (s) = x ) and

Definition 10.15. (Cumulative Distribution Function)

In other words, FX (t) gives the total accumulation of probability

Theorem 10.6. Let FX be the cumulative distribution function of a

2. FX (t1 ) ≤ FX (t2 ) whenever t1 ≤ t2 (i.e. FX is monotone increasing)

5. FX (t) is right-continuous, that is lim∆t→0,∆t>0 FX (t + ∆t) = FX (t).

Either p X or FX can be used to evaluate probability statements

Definition 10.16. (Expected value)

Note the above definition can be generalised to include the aver-

The expected value of X is often called the mean and is denoted

Theorem 10.7. If X and Y are discrete random variables and c is a real

Now that we understand expected value we can use it to define

Definition 10.17. (Variance, standard deviation)

variable plays a very important role in probability theory and sta-

Definition 10.18. Two random variables X and Y are independent if

Theorem 10.8. Let X be any random variable with µ X = E[ X ] and

2. If a and b are real numbers then Var ( aX + b) = a2 Var ( X ).

4. Var ( X + Y ) = Var ( X ) + Var (Y ).

We are now equipped with enough information about random

Bernoulli random variables