NUS ST2334 Lecture Notes
NUS ST2334 Lecture Notes
NUS ST2334 Lecture Notes
1 Introduction to Probability 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Probability Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Theorem of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Introduction to Statistics 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Examples of Computing the MLE . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Constructing Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
i
4 Miscellaneous Results 43
4.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Derivatives and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Integration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
ii
1
Course Information
Lecture times: This course is held from January 2015-April 2015. Lecture times are at 0800-
1000 Monday and Wednesday at LT25. The notes are available on my website: http://www.stat.
nus.edu.sg/~staja/ (also IVLE). You should register as usual for tutorials and for manual regis-
tration please email me with your name and matriculation number and I will attempt to put you
into a convenient tutorial group.
Office Hour: My office hour is at 1500 on Wednesday in 04-02, Department of Statistics and Applied
Probability. I am not available at other times, unless there is a time-table clash. If you would like
to see me or have a question, please email me or your tutors. I will try to respond as quickly as I can.
Assessed Coursework: During the course there will be one assignment which makes up 40% of
the final grade. The dates of when this assessment will be handed out will be provided at least two
weeks beforehand and you are given two weeks to complete the assessment. The assessment is to
be handed to me in person or, at the statistics office, S16, level 7. Due to the number of students on
this course I do not accept assessments via e-mail (unless there are extreme circumstances, which
would need to be verified by the department before-hand) or by student submission on IVLE (again
there are too many students for this). There is NO mid-term examination.
Exam:. There is a 2 hour final exam of 4 questions. No calculators are allowed and the examination
is closed-book (with the exception of the below information). You will be given a formula sheet to
assist you in the examination (Tables 4.1 and 4.2) and you may bring a single side of hand written
A4 to the examination with any information you deem useful for the examination; you may write on
both sides of the A4.
Problem Sheets: There are also 10 non-assessed problem sheets available on my website (also
IVLE). Typed solutions will be available on my website.
Course Details: These notes are not sufficient to replace lectures. In particular, many examples
and clarifications are given during the class. This course investigates the following concepts: Basic
concepts of probability, conditional probability, independence, random variables, joint and marginal
distributions, mean and variance, some common probability distributions, sampling distributions,
estimation and hypothesis testing based on a normal population. There are three chapters of course
content, the first covering the foundations of probability. We then move on to random variables
and their distributions which features the main content of the course and is a necessary prerequisite
for further work in statistical modeling and randomized algorithms. The third chapter is a basic
introduction to statistics and gives some basic notions which would be used in statistical modeling.
The final chapter of the notes provides some mathematical background for the course, which you
are strongly advised to read in the first week. You are expected to know everything that is in this
chapter. In particular the notion of double summation and integration is used routinely in this course
and you should spend some time recalling these ideas. Some additional help on integration will be
given later on in the course.
References: The recommended reference for this course is [1] (Chapters 1-5). For a non-technical
introduction the book [2] provides an entertaining and intuitive look at probability.
Chapter 1
Introduction to Probability
1.1 Introduction
This Chapter provides a basic introduction to the notions that underlie the basics of probability. The
level of notes is below that required for complete mathematical rigor, but does provide a technical
development of these ideas. Essentially, the basic ideas of probability theory start with a probability
triple of an event-space (e.g. in flipping a coin a head or tail), sets of events (e.g.tail or tail and
head) and a way to compare the likelihood of events by a probability distribution (the chance of
obtaining a tail). Moving on from these notions, we consider the probability of events, given other
events are known to have occurred (conditional probability) (the probability a coin lands tails, given
we know it has two heads). Some events have probabilities which decouple in a special way and
are called independent (for example, flipping a fair coin twice, the outcome of the first flip, may not
influence that of the second).
The structure of this Chapter is as follows: In Section 1.2, we discuss the idea of a probability
triple; in Section 1.3 conditional probability is introduced and in Section 1.4 we discuss independence.
Example 1.2.2. Consider rolling a six sided fair die, once. There are 6 possible outcomes, and
thus: = {1, 2, . . . , 6}. We may be interested (for example, for betting purposes) in the following
events:
1. we roll a 1
2
3
2. {2,4,6}
3. {2}
4. {1,3,5}
As a result, we think of events as subsets of the sample space . Such events can be constructed
by unions, intersections and complements of other events (we will formally explain why, below). For
example, letting A = {2, 4, 6} in the case (2) above, we immediately yield that the event (4) is Ac .
Similarly letting A be the event in (2) and B be the event of rolling a number less than 4 we obtain
that event (3) is A B. For those of you that have forgotten basic set notations and terminologies,
see Table1 1.1. In particular, we will think of as the certain event (we will roll a 1, 2, . . . , 6) and
its complement c = as the impossible event (we must roll something).
1.2.2 Fields
Now we have a notion of an event, in particular, that they are subsets of of . A particular question
of interest is then: are all subsets of events? The answer actually turns out to be no, but the
technical reasons for this lie far out of the scope of this course. We will content ourselves to use a
particular collection of sets F (a set of sets) of which contains all the events that we can make
probability statements about. This collection of sets is a field.
Definition 1.2.3. A collection of sets F of is called a -field if it satisfies the following conditions:
1. F
S
2. If A1 , . . . , F then i=1 Ai F
3. If A F then Ac F
TIt can be shown that fields are closed under countable intersections (i.e. if A1 , . . . , F then
i=1 Ai F ). Whilst it may seem quite abstract, it will turn out that, for this course, all the sets
we consider will lie in the field F .
Example 1.2.4. Consider flipping a single coin, letting H denote a head and T denoting a tail then:
= {H, T } and F = {, , {H}, {T }}.
Thus, to summarize, so far (, F ) is sample space and a field. The former is the collection
of all possible outcomes and the latter is collection of sets on (the events) that follow Definition
1.2.3. Our next objective is define a way of assigning a likelihood to each event.
1 For those of you that have forgotten set theory, Section 4.1 has a refresher
4
1.2.3 Probability
We now introduce a way to assign a likelihood to an event, via probability. One possible interpretation
is the following: suppose my experiment is repeated many times, then the probability of any event
is the limit of the ratio of times the event occurs over the number of events. We note that this is
not the only interpretation of probability, but we do not diverge into a discussion of the philosophy
of probability. Formally, we introduce the notion of a probability measure:
Definition 1.2.6. A probability measure P on (, F ) is a function P : F [0, 1] which satisfies:
1. P() = 1 and P() = 0
2. For A1 , A2 , . . . disjoint (i 6= j, Ai Aj = ) members of F then
[ X
P Ai = P(Ai ).
i=1 i=1
The triple (, F , P), comprising a set , a field F of subsets of and a probability measure P,
is called a probability space (or probability triple).
The idea is to associate the probability space with an experiment:
Example 1.2.7. Consider flipping a coin as in Example 1.2.4, with = {H, T } and F = {, , {H}, {T }}.
Then we can set:
P({H}) = p P({T }) = 1 p p [0, 1].
If p = 1/2 we say that coin is fair.
Example 1.2.8. Consider rolling a 6-sided die, then = {1, . . . , 6} and F = {0, 1} (the power
set of the set of all subsets of ). Define a probability measure on (, F ) as:
X
P(A) = pi A F
iA
P6
where i {1, . . . , 6}, 0 pi 1 and i=1 pi = 1. If pi = 1/6, i {1, . . . , 6} then:
Card(A)
P(A) =
6
where Card(A) is the cardinality of A (the number of elements in the set A).
We now consider a sequence of results on probability spaces.
Lemma 1.2.9. We have the following properties on probability space (, F , P):
1. For any A F , P(Ac ) = 1 P(A).
2. For any A, B F , if A B then, P(B) = P(A) + P(B \ A) P(A).
3. For any A, B F , P(A B) = P(A) + P(B) P(A B).
4. (inclusion-exclusion formula) For any A1 , . . . , An F then:
n
[ X X X n
\
P Ai = P(Ai ) P(Ai Aj ) + P(Ai Aj Ak ) + (1)n+1 P Ai .
i=1 i i<j i<j<k i=1
P Pn Pj1
where, for example, i<j = j=1 i=1 etc.
5
Now (A B) B so by (2)
and thus
P(A B) = P(A) + P(B) P(A B).
For (4), this can be proved by induction and is an exercise.
To conclude the Section, we introduce two terms:
An event A F is null if P(A) = 0.
An event A F occurs almost surely if P(A) = 1.
Null events are not impossible. We will see this when considering random variables which take values
in the real line.
P(A B)
P(A|B) := .
P(B)
Example 1.3.2. A family has two children of different ages. What is the probability that both
children are boys, given that at least one is a boy? The older and younger child may be either boys
or girl so the sample space is:
= {GG, BB, GB, BG}
and we assume that all outcomes are equally likely P(GG) = P(BB) = P(GB) = P(BG) = 1/4 (the
uniform distribution). We are interested in:
One can also ask the question: what is the probability that in a family of two children, where the
younger of two is a boy, that both are boys? We want:
P(A|B)P(B)
P(B|A) = .
P(A)
Lemma 1.3.5. Consider probability space (, F , P) and let for any fixed n 2, B1 , . . . , Bn F be
a partition of , with P(Bi ) > 0, i {1, . . . , n}. Then, for any A F :
n
X
P(A) = P(A|Bi )P(Bi ).
i=1
Proof. We give the proof for n = 2, the other cases being very similar. We have A = (AB1 )(AB2 )
(recall B1 B2 = , and B2 = B1c ), so
Example 1.3.6. Consider two factories which manufacture a product. If the product comes from
factory I, it is defective with probability 1/5 and if it is from factory II, it is defective with probability
1/20. It is twice as likely that a product comes from factory I. What is the probability that a given
product operates properly (i.e. is not defective)? Let A denote the event of operates properly and B
denote the event the product is made in factory I. Then:
P(Ac |B)P(B)
P(B|Ac ) =
P(Ac )
(1/5)(2/3) 8
= = .
9/60 9
7
1.4 Independence
The idea of independence is roughly that the probability of an event A is unaffected by the occur-
rance of a (non-null) event B; that is, P(A) = P(A|B), when P(B) > 0. More formally:
Definition 1.4.1. Consider probability space (, F , P) and let A, B F . A and B are indepen-
dent if
P(A B) = P(A)P(B).
More generally, a family of F sets A1 , . . . , An ( > n 2) are independent if
n
\ Yn
P Ai = P(Ai ).
i=1 i=1
This does not necessarily mean that A1 , . . . , An are independent events. Another important concept
is conditional independence; given an event C F , with P(C) > 0 , A, B F are conditionally
independent events if
P(A B|C) = P(A|C)P(B|C).
This can be extended to families of sets Ai given C.
Example 1.4.2. Consider a pack of playing cards, and one chooses a card completely at random
(i.e. no card counting etc). One can assume that the probability of choosing a suit (e.g. spade) is
independent of its rank. So, for example:
13 4 1
P(spade king) = P(spade)P(king) = = .
52 52 52
Chapter 2
2.1 Introduction
Throughout the Chapter we assume that there is a probability space (, F , P), but its presence is
minimized from the discussion. This is the main Chapter of the course and we focus upon random
variables and their distribution (Section 2.2). In particular, a random variable is (essentially) just a
map from to some subset of the real line. Once we are there we introduce notions of probability
through distribution functions. Sometimes the random variables take values on a countable space
(possibly countably infinite) and we call such random variables discrete (Section 2.3). The proba-
bilities of such random variables are linked to probability mass functions and we use this concept
to revisit independence and conditional probability. We also consider expectation (the theoretical
average) and conditional expectation for discrete random variables. These ideas are revisited when
the random variables are continuous (Section 2.4). The Chapter is concluded when we discuss the
convergence of random variables (Section 2.5) and, in particular, the famous central limit theorem.
8
9
A distribution function has the following properties, which we state without proof. See [1] for
the proof. The lemma characterizes a distribution function: a function F is a distribution function
for some random variable if and only if it satisfies the three properties in the following lemma.
Lemma 2.2.4. A distribution function F has the following properties:
X(H) = 1 X(T ) = 0.
We remark that X
F (x) = f (x) f (x) = F (x) lim F (y)
yx
i:xi x
which provides and association between the distribution function and the PMF. We have the following
result, whose proof follows easily from the above definitions and results.
Lemma 2.3.4. A PMF satisfies:
1. the set of x such that f (x) 6= 0 is countable,
P
2. xX f (x) = 1.
Example 2.3.5. A coin is flipped n times and the probability one obtains a head is p (0, 1);
= {H, T }n . Let X denote the number of heads, which takes values in the set X = {0, 1, . . . , n} and
is thus a discrete random variable. Consider x X, exactly nx points in give us x heads and each
point occurs with probability px (1 p)nx ; hence
n x
f (x) = p (1 p)nx x X.
x
The random variable X is said to have a Binomial distribution and this is denoted X B(n, p).
Note that
n n!
=
x (n x)!x!
with x! = x (x 1) 1 and 0! = 1.
Example 2.3.6. Let > 0 be given. A random variable X that takes values on X = {0, 1, 2, . . . , }
is said have a Poisson distribution with parameter (denoted X P()) if its PMF is:
x e
f (x) = x X.
x!
2.3.2 Independence
We now extend the notion of independence of events to the domain of random variables. Recall that
events A and B are independent if the joint probability is equal to the product (A does not affect
B).
Definition 2.3.7. Discrete random variables X and Y are independent if the events {X = x} and
{Y = y} are independent for each (x, y) X Y.
To understand this idea, let X = {x1 , x2 , . . . }, Y = {y1 , y2 , . . . }, then X and Y are independent
if and only if the events Ai = {X = xi }, Bj = {Y = yj } are independent for every possible pair i, j.
Example 2.3.8. Consider flipping a coin once, which lands tail with probability p (0, 1). Let X
be the number of heads seen and Y the number of tails. Then:
P(X = Y = 1) = 0
and
P(X = 1)P(Y = 1) = (1 p)p 6= 0
so X and Y cannot be independent.
A useful result (which we do not prove) that is worth noting is the following.
Theorem 2.3.9. If X and Y are independent random variables and g : X R, h : Y R, then
the random variables g(X) and h(Y ) are also independent.
In full generality1 , consider a sequence of discrete random variables X1 , X2 , . . . , Xn , Xi Xi ;
they are said to be independent if the events {X1 = x1 }, . . . , {Xn = xn } are independent for every
possible (x1 , . . . , xn ) X1 Xn . That is:
n
Y
P(X1 = x1 , . . . , Xn = xn ) = P(Xi = xi ) (x1 , . . . , xn ) X1 Xn .
i=1
1 this point will not make too much sense, until we see Section 2.3.4
11
2.3.3 Expectation
Throughout your statistical training, you may have encountered the notion of an average or mean
value of data. In this Section we consider the idea of the average value of a random variable, which
is called the expected value.
Definition 2.3.10. The expected value of a discrete random variable X on X, with PMF f , denoted
E[X] is defined as X
E[X] := xf (x)
xX
where we have used the Taylor series expansion for the exponential function on the third line.
We now look at the idea of an expectation of a function of a random variable:
Lemma 2.3.12. Given a discrete random variable X on X, with PMF f and g : X R:
X
E[g(X)] = g(x)f (x)
xX
where we have assumed that it is legitimate to swap differentiation and summation (it turns out that
this is true here).
An important concept is the moment generating function:.
Definition 2.3.14. For a discrete random variable X the moment generating function (MGF)
is X
M (t) = E[eXt ] = ext f (x) t T
xX
xt
P
where T is the set of t for which X e f (x) < .
12
Then
M 0 (0) = .
An important special case of functions of random variables are:
Definition 2.3.17. Given a discrete random variable X on X, with PMF f and k Z+ , the k th
moment of X is
E[X k ]
and the k th central moment of X is
E[(X E[X])k ].
Of particular importance, is the idea of the variance:
(x E[X])2 f (x) 0 x X
it follows that
Variances cannot be negative2 .
Example 2.3.18. Returning to Examples 3.2.1 and 2.3.13, when X P(), we have shown that
E[X 2 ] = 2 + E[X] = .
Exercise 2.3.19. Compute the mean and variance for a Binomial distribution B(n, p) random vari-
able. [Hint: consider the differentiation, w.r.t. x, of the equality
n
X n k
x = (1 + x)n .
k
k=0
].
We now state a Theorem, whose proof we do not give. In general, it is simple to establish, once
the concept of a joint distribution has been introduced; we have not done this, so we simply state
the result.
Theorem 2.3.20. The expectation operator has the following properties:
1. if X 0, E[X] 0
2. if a, b R then E[aX + bY ] = aE[X] + bE[Y ]
3. if X = c R always, then E[X] = c.
An important result that we will use later on and whose proof is omitted is as follows.
Lemma 2.3.21. If X and Y are independent then E[XY ] = E[X]E[Y ].
A notion of dependence (linear dependence) is correlation. This will be discussed in details, later
on.
Definition 2.3.22. X and Y are uncorrelated if E[XY ] = E[X]E[Y ].
It is important to remark that random variables that are independent are uncorrelated. However,
uncorrelated variables are not necessarily independent; we will explore this idea later on.
We end the section with some useful properties of the variance operator.
Theorem 2.3.23. For random variables X and Y
1. For a R, Var[aX] = a2 Var[X],
2. For X, Y uncorrelated Var[X + Y ] = Var[X] + Var[Y ].
Proof. For 1. we have:
Var[aX] = E[(aX)2 ) E[aX]2
= a2 E[X 2 ] a2 E[X]2 = a2 Var[X]
where we have used Theorem 2.3.20 2. in the second line.
Now for 2. we have:
Var[(X + Y )] = E[(X + Y E[X + Y ])2 )
= E[X 2 + Y 2 + 2XY + E[X + Y ]2 2(X + Y )E[X + Y ]]
= E[X 2 + Y 2 + 2XY + E[X]2 + E[Y ]2 + 2E[XY ]] 2E[X + Y ]2
= E[X 2 ] + E[Y 2 ] + 4E[X]E[Y ] + E[X]2 + E[Y ]2 2(E[X]2 + E[Y ]2 + 2E[X]E[Y ])
= E[X 2 ] + E[Y 2 ] E[X]2 E[Y ]2 = Var[X] + Var[Y ]
where we have repeatedly used Theorem 2.3.20 2..
2.3.4 Dependence
As we saw in the previous Section, there as a need to define distributions on more than one random
variable. We will do that in this Section. We start with the following definition (of which one can
easily generalize to a collection of n 1 discrete random variables):
Definition 2.3.24. The joint distribution function F : R2 [0, 1] of X, Y where X and Y are
discrete random variables, is given by
F (x, y) = P(X x Y y).
2
Their joint mass function f : R [0, 1] is given by
f (x, y) = P(X = x Y = y).
14
Table 2.1: The Joint Mass Function in Example 2.3.26. The row totals are the marginal mass
function on X (and sum to 1) and respectively the column totals are the marginal mass function on
Y.
In general, it may be that the random variables are defined on a space Z which may not be
decomposable into a cartesian product X Y. In such scenarios weP writePthe joint support as simply
Z and omit X and Y, in this scenario, we will use the notation x or y to denote sums over the
supports induced by Z. This concept will be clarified during a reading of the subsequent text.
Of particular importance, are the marginal PMFs of X and Y :
X X
f (x) = f (x, y) f (y) = f (x, y).
y x
Example 2.3.25. Consider Theorem 2.3.20 2. and for simplicity suppose Z = X Y. Now, we have
XX
E[aX + bY ] = (ax + by)f (x, y)
xX yY
X X X X
= ax f (x, y) + by f (x, y)
xX yY yY xX
X X
= a xf (x) + b yf (y)
xX yY
= aE[X] + bE[Y ].
Example 2.3.26. Suppose X = {1, 2, 3} and Y = {1, 0, 2} then an example of a joint PMF can be
found in Table 2.3.26. From the table, we have:
XX 29
E[XY ] = xyf (x, y) =
18
xX yY
(just sum the 9 values in the table, multiplying each time by the product of the associated x and y.
Similarly
6 5 7 37
E[X] = +2 +3 =
18 18 18 18
and
3 7 8 13
E[Y ] = 1 + 0 + 2 = .
18 18 18 18
We now formalize independence in a result, which provides us a more direct way to ascertain if
two random variables X and Y are independent.
Lemma 2.3.27. The discrete random variables X and Y are independent if an only if
More generally, X and Y are independent if and only if f (x, y) factorizes into the product g(x)h(y)
with g a function only of x and h a function only of y.
Example 2.3.28. Consider the joint PMF:
x e y e
f (x, y) = X = Y = {0, 1, . . . , }.
x! y!
15
Now clearly
x e y e
f (x) = xX f (y) = y Y.
x! y!
It is also clear that via Lemma 2.3.27 the random variables X and Y are independent and identically
distributed (and Poisson distributed).
As in the case of a single variable, we are interested in the expectation of a functional of two
random variables (strictly, in the proof of Theorem 2.3.23 we have already used the following result):
P
Lemma 2.3.29. E[g(X, Y )] = (x,y)XY g(x, y)f (x, y).
Recall that in the previous Section, we mentioned a notion of dependence called correlation. In
order to formally define this concept, we introduce first the covariance and then correlation.
Definition 2.3.30. The covariance between X and Y is
and
Cov[X, Y ] = 29/18 (37/18)(13/18) = 41/324.
Thus
(X, Y ) = 41/ 107413.
Some useful facts about correlations that we do not prove:
When we consider continuous random variables an example of uncorrelated random variables that
are not independent, will be given.
Example 2.3.34. Suppose that Y P() and X|Y = y B(y, p). Find the conditional probability
mass function of Y |X = x. Note that the random variables lie in the space Z = {(x, y) : y
{0, 1, . . . }, x {0, 1, . . . , y}}. We have
f (x, y) f (x|y)f (y)
f (y|x) = = (x, y) Z
f (x) f (x)
which is a version of Bayes Theorem for discrete random variables. Note that for (x, y) Z all the
PMFs above are positive. Now for x {0, 1, . . . }
X y x y e
f (x) = p (1 p)yx
y=x
x y!
e X ((1 p))yx
= (p)x
x! y=x (y x)!
(p)x p
= e .
x!
Thus, we have for y {x, x + 1, . . . }:
y
y
px (1 p)yx e
x y!
f (y|x) = (p)x p
x! e
given that the conditional PMF is well-defined. We generally write E[Y |X] or E[Y |x].
An important result associated to conditional expectations is as follows:
Theorem 2.3.36. The conditional expectation satisfies:
Example 2.3.37. Let us return to Example 2.3.34. Find E[X|Y ], E[Y |X] and E[X]. From Exercise
2.3.19, you should have derived that if Z B(n, p) then E[Z] = np. Then
E[X|Y = y] = yp.
Thus
E[X] = E[E[X|Y = y]] = pE[Y ].
As Y P()
E[X] = p.
From Example 2.3.34,
Thus
X ((1 p))yx e(1p)
E[Y |X = x] = y .
y=x
(y x)!
We end the Section with a result which can be very useful in practice.
Theorem 2.3.38. We have, for any g : R R
Proof. We have
X
E[Y g(X)] = yg(x)f (x, y)
(x,y)Z
X
= yg(x)f (y|x)f (x)
(x,y)Z
X X
= g(x)[ yf (y|x)]f (x)
x y
= E[E[Y |X]g(X)].
where f : R [0, ), the R.H.S. is the usual Riemann integral and we will assume that the R.H.S. is
differentiable.
Definition 2.4.1. The function f is called the probability density function (PDF) of the con-
tinuous random variable X.
We note that, under our assumptions, f (x) = F 0 (x).
Example 2.4.2. One of the most commonly used PDFs is the Gaussian or normal distribution:
1 1
f (x) = exp{ 2 (x )2 } xX=R
2 2
where R, 2 > 0. We use the notation X N (, 2 ).
One of the key points associated to PDFs is as follows. The numerical value f (x) does not
represent the probability that X takes the value x. The technical explanation goes far beyond the
mathematical level of this course, but perhaps an intuitive reason is simply that there are uncountably
infinite points in X (so assigning a probability to each point is seemingly impossible). In general, one
assigns probability to sets of non-zero width. For example let A = [a, b], < a < b < , then
one might expect: Z
P(X A) = f (x)dx.
A
Indeed this holds true, but we are deliberately vague about this. We give the following result, which
is not proved and should be taken as true.
Lemma 2.4.3. If X has a PDF f , then
R
1. X f (x)dx = 1.
2. P(X = x) = 0 for each x X.
Rb
3. P(X [a, b]) = a f (x)dx, a < b .
Example 2.4.4. Returning to Example 2.4.2, we have
Z
1 1
P(X X) = exp{ 2 (x )2 }dx
2 2
Z
1 2
= eu /2 du = 1.
2
Here, we have used the substitution u = (x )/ to go to the second line and the fact that
R u2 /2
e du = 2.
2.4.2 Independence
To define an idea of independence for continuous random variables, we cannot use the one for discrete
random variables (recall Definition 2.3.7); the sets {X = x} and {Y = y} have zero probability and
are hence trivially independent. Thus we use a new definition for independence:
Definition 2.4.5. The random variables X and Y are independent if
{X x} {Y y}
2.4.3 Expectation
As for discrete random variables one can consider the idea of the average of a random variable. This
is simply brought about by replacing summations with integrations:
Definition 2.4.6. The expectation of a continuous random variable X with PDF f is given by
Z
E[X] = xf (x)dx
X
Here, we have used the substitution u = (x )/ to go to the second line and the fact that
Z
1 2
eu /2 du = 1
2
to go to the third.
Example 2.4.9. Consider the gamma density:
1 x
f (x) = x e x X = [0, ) , > 0
()
R
where () = 0 t1 et dt. We use the notation X G(, ) and note that if X G(1, ) then
X E(). Now
Z
E[X] = x ex dx
() 0
Z
1
= u eu du
() +1 0
1
= ( + 1) = .
()
Where we have used the substitution u = x to go the second line and ( + 1) = () on the final
line (from herein you may use that identity without proof ).
We now state a useful technical result
Theorem 2.4.10. If X and g(x) (g : R R) are continuous random variables
Z
E[g(X)] = g(x)f (x)dx.
X
20
We remark that all the extensions to expectations discussed in Section 2.3.3 can be extended to
the continuous case. In particular Definitions 2.3.17 and 2.3.22 can be imported to the continuous
case. In addition the results: Theorems 2.3.20 and 2.3.23 and Lemma 2.3.21 all can be extended. It
is assumed that this is the case from herein.
Example 2.4.11. Let us return to Example 2.4.7
Z
E[X 2 ] = x2 ex dx
0
Z
2
= [x2 ex ]
0 + xex dx
0
2
= .
2
Thus, for an exponential random variable:
1
Var[X] = .
2
Example 2.4.12. Let us return to Example 2.4.2. Then
Z
1 1
E[X 2 ] = x2 exp{ 2 (x )2 }dx
2 2
Z
1 2
= (u + )2 eu /2 du
2
Z Z
1 2 1 2
= 2 u2 eu /2 du + 2 + 2 u eu /2 du
2 2
Z
2 2 2
= {[ueu /2 ] + eu /2 du} + 2
2
2
= 2 + 2 .
2
R 2 R 2
Here we have used u 12 eu /2 du = 0 and eu /2 du = 2. Thus, for a normal random
variable:
Var[X] = 2 .
Example 2.4.13. Let us return to Example 2.4.9 Now
Z
E[X 2 ] = x+1 ex dx
() 0
Z
1
= u+1 eu du
() +2 0
1 ( + 1)
= 2
( + 2) =
() 2
where we have used ( + 2) = ( + 1)(). Thus, for a gamma random variable:
Var[X] = .
2
To end this section we consider an important concept; the MGF for continuous random variables.
Definition 2.4.14. For a continuous random variable X the moment generating function
(MGF) is Z
M (t) = E[eXt ] = ext f (x)dx t T
X
xt
R
where T is the set of t for which X
e f (x)dx < .
Exercise 2.4.15. Show that
As for discrete random variables, the moment generating function is a simple way to obtain
moments, if it is simple to differentiate M (t). Note also, that it can be proven that the MGF
uniquely characterizes a distribution.
Example 2.4.16. Suppose X E() then, supposing > t
Z
M (t) = ex(t) dx
0
h i
= ex(t)
( t) 0
= .
( t)
Clearly
M 0 (t) =
( t)2
and thus E[X] = 1/.
Example 2.4.17. Suppose X N (, 2 ) then
Z
1 1
M (t) = exp{ 2 (x )2 + xt}dx
2 2
Z
1 1
= exp{ 2 [(x ( + t 2 ))2 ( + t 2 )2 + 2 ]}dx
2 2
Z
1 2 t2 1
= exp{t + } exp{ 2 (x ( + t 2 ))2 }dx
2 2 2
2 t2
= exp{t + }
2
where we have used a change-of-variables u = (x ( + t 2 ))/ to deal with the integral.
2.4.4 Dependence
Just as for discrete random variables, one can consider the idea of joint distributions for continuous
random variables.
Definition 2.4.18. The joint distribution function of X and Y is the function F : R2 [0, 1]
given by
F (x, y) = P(X x, Y y).
Again, as for discrete random variables, one requires a PDF:
Definition 2.4.19. The random variables are jointly continuous with joint PDF f : R2 [0, ) if
Z y Z x
F (x, y) = f (u, v)dudv
2
f (x, y) = F (x, y).
xy
note that the R.H.S. is a double integral, but we will only write one integral sign in such contexts.
As for discrete random variables, we can introduce the idea of marginal PDFs. Here, we take a
little longer to consider these ideas:
22
As Z x Z Z y Z
F (x) = f (u, y)dydu F (y) = f (x, u)dxdu
We remark that expectation is much the same for joint distributions (as for Section 2.4.3) of
continuous random variables:
Z Z Z
E[g(X, Y )] = g(x, y)f (x, y)dxdy = g(x, y)f (x, y)dxdy.
Z
We now turn to independence; we state the following result with no proof. If you are unconvinced,
simply take the below as a definition.
Theorem 2.4.21. The random variables X and Y are independent if and only if
or equivalently
f (x, y) = f (x)f (y).
Example 2.4.22. Let Z = (R+ )2 = X Y and
where we have used integration tables to obtain the integral3 . Thus, returning to (2.4.1):
1 1
P(X 2 + Y 2 1) = + log(1 + 2) = log(1 + 2).
2 2
Given the previous discussion, the notion of a conditional expectation for continuous random
variables: Z
E[g(Y )|X = x] = g(y)f (y|x)dy.
As for discrete-valued random variables, we have the following result, which is again not proved.
Theorem 2.4.30. Consider jointly continuous random variables X and Y with g, h : R R with
h(X), g(Y ) continuous. Then:
Z Z
E[h(X)g(Y )] = E[E[g(Y )|X]h(X)] = g(y)f (y|x)dy h(x)f (x)dx
M (t) = E[eY t ]
= E[E[eY t |X]]
2 2
= E[eXt+(1 t )/2
]
(12 t2 )/2
= e E[eXt ]
(12 t2 )/2 2 2
= e et+(2 t )/2
Here we have used Example 2.4.17 for the MGF of a normal distribution. Thus we can conclude that
Y N (, 12 + 22 ).
Now
P( y X y) = P(X y) P(X y)
= ( y) ( y)
= ( y) [1 ( y)]
= 2( y) 1.
Then
d 1 1
f (y) = F (y) = 0 ( y) = ey/2 yY
dy y 2y
that is Y G(1/2, 1/2). This is also called the chi-squared distribution on one degree of freedom.
We remark that Y and X are clearly not independent (if X changes so does Y and in a deterministic
manner). Now
E[XY ] = E[X 3 ] = 0.
In addition E[X] = 0 and E[Y ] = 1. So X and Y are uncorrelated, but they are not independent.
We now move onto a more general change of variable formula. Suppose X1 and X2 have a joint
density on Z and we set
(Y1 , Y2 ) = T (X1 , X2 )
where T : Z T where T is differentiable and invertible and T R2 . What is the joint density
function of (Y1 , Y2 ) on T? As T is invertible, we set
Then we define
T11 T21
T11 T21 T21 T11
y1 y1
J(y1 , y2 ) := T11 T21
=
y1 y2 y1 y2
y2 y2
to be the Jacobian of transformation. Then we have the following result, which simply follows from
the change of variable rule for integration:
Theorem 2.4.34. If (X1 , X2 ) have joint density f (x, y) on Z, then for (Y1 , Y2 ) = T (X1 , X2 ), with
T as described above, the joint density of (Y1 , Y2 ), denoted g is:
Let
(Y1 , Y2 ) = (X1 + X2 , X1 /X2 ).
To find the joint density of (Y1 , Y2 ) we first note that T = Z and that
Y Y Y1
1 2
(X1 , X2 ) = , .
1 + Y2 1 + Y2
Then the Jacobian is
y2 1
1+y2 1+y2
y1
J(y1 , y2 ) := = .
y1 y1
(1+y2 )2 (1+y 2)
2 (1 + y2 )2
Thus:
y1
g(y1 , y2 ) = 2 ey1 (y1 , y2 ) T.
(1 + y2 )2
One can check that indeed Y1 and Y2 are independent and, that the marginal densities are:
1
g(y1 ) = 2 y1 ey1 y1 R+ g(y2 ) = y 2 R+ .
(1 + y2 )2
28
Example 2.4.36. Suppose we have (X1 , X2 ) as in the case of Example 2.4.35, except we have the
mapping:
(Y1 , Y2 ) = (X1 , X1 + X2 ).
Now, clearly Y2 Y1 , so this transformation induces the support for (Y1 , Y2 ) as:
T = {(y1 , y2 ) : 0 y1 y2 , y2 R+ }.
Then
(X1 , X2 ) = (Y1 , Y2 Y1 )
and clearly J(y1 , y2 ) = 1; thus
Example 2.4.37. Let X1 and X2 be independent N (0, 1) random variables (Z = R2 ) and let
p
(Y1 , Y2 ) = (X1 , X1 + 1 2 X2 ) [1, 1]
Then T = Z and Y2 Y1
(X1 , X2 ) = Y1 , p .
1 2
Then the Jacobian is
1
12 = p 1
J(y1 , y2 ) := 1 .
0 1 2
12
Thus
1 1
g(y1 , y2 ) = p exp{ (y12 + (y2 y1 )2 /(1 2 )} (y1 , y2 ) T
2 1 2 2
1 1
= exp{ (y 2 + y22 2y1 y2 )} (y1 , y2 ) T.
2(1 2 ) 1
p
2 1 2
We write X Xn2 ; you will notice that also X G(n/2, 1/2), so that E[X] = n. Note that from
Table 4.2, we have that
1 1
M (t) = n/2
> t.
(1 2t) 2
We have the following result:
Proposition 2.4.38. Let X1 , . . . , Xn be independent and identically distributed (i.i.d.) standard
i.i.d.
normal random variables (i.e. Xi N (0, 1)). Let
Z = X12 + + Xn2 .
Then Z Xn2 .
29
Then T Tn .
Proof. We have
1 1 2 1 1
f (x, y) = e 2 x n/2 y n/21 e 2 y Z = R R+ .
2 2 (n/2)
We will use Theorem 2.4.34, with the transformation defined by
X
T = q
Y
n
S = Y
and then marginalize out S. The Jacobian of transformation is:
ps r
0 s
J(t, s) := t n = n.
2 sn
1
Then
1 s
f (t, s) = sn/21/2 exp{ [t2 /n + 1]} (t, s) R R+ .
2n2n/2 (n/2) 2
Then we have for t R
Z
1 s
f (t) = sn/21/2 exp{ [t2 /n + 1]}ds
2n2n/2 (n/2) 0 2
Z
1 1
=
t2 n+1
un/21/2 eu du
2n2n/2 (n/2) 0 ( 12 + 2n ) 2
1 1
= ([n + 1]/2)
t2 n+1
2n2n/2 (n/2) ( 12 + 2n ) 2
([n + 1]/2) t2 (n+1)/2
= 1+
n(n/2) n
30
and we conclude.
The last distribution that we consider is the standard F distribution on d1 , d2 > 0 degrees of
freedom: d +d
1 d d1 /2
1 d1 /21
d 1 1 2 2
f (x) = x 1 + x x X = R+
B( d21 , d22 ) d2 d2
where
d1 d2 ( d21 )( d22 )
B( , )= .
2 2 ( d1 +d
2 )
2
X/d1
F = .
Y /d2
X/d1
T =
Y /d2
S = Y
Then
1 d1 s d1 st d1 /21 d2d1 st d2 /21 s/2
f (t, s) = e 2 s e T = (R+ )2 .
2d1 /2+d2 /2 (d1 /2)(d2 /2) d2 d2
1 d d1 /2
1
g= .
2d1 /2+d2 /2 (d1 /2)(d2 /2) d2
Then, for t R+
Z
d1 /21
d1 +d2
1 1 d1 t
f (t) = gt s exp{s( [1 +
2 ])}ds
2 d2
Z0
1 d1 t (d1 +d2 )/2
= gtd1 /21 u(d1 +d2 )/21 eu ( [1 + ]) du
0 2 d2
1 d1 t (d1 +d2 )/2
= gtd1 /21 ( [1 + ]) ((d1 + d2 )/2)
2 d2
d +d
d1 +d2
d 1 1 2 2
= g((d1 + d2 )/2)2 2 td1 /21 1 + t
d2
d +d
1 d /2
d1 1 d1 /21
d1 1 2 2
= t 1+ t
B( d21 , d22 ) d2 d2
and we conclude.
31
with X1 , . . . , Xn independent and identically distributed with zero mean and unit variance. This
particular result is very useful in hypothesis testing, which we shall see later on. Throughout, we
will focus on continuous random variables, but one can extend this notion.
Now if X E(1): Z x
F (x) = eu du = 1 ex .
0
d
So Xn X, X E(1).
Example 2.5.3. Consider a sequence of random variables Xn X = [0, 1], n 1, with
sin(2nx)
Fn (x) = 1 .
2n
Then for any fixed x [0, 1]
lim Fn (x) = x.
n
lim Fn (x) = c
n
P
then Xn c.
Example 2.5.6. Consider a sequence of random variables Xn X = R+ , n 1, with
x n
Fn (x) = .
1+x
Then for any fixed x X
lim Fn (x) = 0.
n
P
Thus Xn 0.
We finish the Section with a rather important result, which we again do not prove. It is called
the weak law of large numbers:
Theorem 2.5.7. Let X1 , X2 , . . . be a sequence of independent and identically distributed random
variables with E[|X1 |] < . Then
n
1X P
Xi E[X1 ].
n i=1
Note that this result extends to functions; i.e. if g : R R with E[|g(X1 )|] < then
n
1X P
g(Xi ) E[g(X1 )].
n i=1
where we will suppose that g(x) 6= 0. Suppose we cannot calculate I. Consider any non-zero pdf f (x)
on R. Then Z
g(x)
I= f (x)dx
R (x)
f
and suppose that Z
g(x)
f (x)dx < .
R f (x)
Then by the weak law of large numbers, if X1 , . . . , Xn are i.i.d. with pdf f then
n
1 X g(Xi ) P
I.
n i=1 f (Xi )
This provides a justification for a numerical method (called the Monte Carlo method) to approximate
integrals.
33
where we have used the i.i.d. property and the MGF of a Gamma random variable (see Table 4.2).
Thus Zn G(a, b) for any n 1. However, from the CLT one can reasonably approximate Zn , when
n is large by a normal random variable with mean
a a
n =
bn b
and variance
a a
n = 2.
nb2 b
Introduction to Statistics
3.1 Introduction
In this final Chapter, we give a very brief introduction to statistical ideas. Here the notion is that
one has observed data from some sampling distribution the population distribution and one wishes
to infer what the properties of the population are on the basis of observed samples. In particular
we will be interested in estimating the parameters of sampling distributions such as the maximum
likelihood method (Section 3.2) as well as testing hypotheses about parameters (3.3). We end the
Chapter with an introduction to Bayesian statistics (Section 3.4) which is another alternative way
to estimate parameters, which is more complex, but much richer than the MLE method.
We call f (x1 , . . . , xn ) the likelihood of the data. As maximizing a function is equivalent to maxi-
mizing a monotonic increasing transformation of the function, we often work with the log-likelihood :
Xn
l (x1 , . . . , xn ) = log f (x1 , . . . , xn ) = log f (xi ) .
i=1
If is some continuous space (as it generally is for our examples) and = (1 , . . . , d ), then we can
compute the gradient vector:
l (x , . . . , x ) l (x1 , . . . , xn )
1 n
l (x1 , . . . , xn ) = ,...,
1 d
and we would like to solve, for (below 0 is the ddimensional vector of zeros)
l (x1 , . . . , xn ) = 0. (3.2.1)
34
35
The solution of this equation (assuming it exists) is a maximum if the hessian matrix is negative
definite: 2 l (x1 ,...,xn ) 2 l (x1 ,...,xn ) 2
12 1 2 l
(x1 ,...,xn )
1 d
H() := .. .. .. ..
.
. . . .
2 2 2
l (x1 ,...,xn ) l (x1 ,...,xn ) l (x1 ,...,xn )
d 1 d 2 2 d
If the d numbers 1 , . . . , d which solve|Id H()| = 0, with Id the d d identity matrix, are all
negative, then is a local maximum of l (x1 , . . . , xn ). If d = 1 then this just boils down to checking
whether the second derivative of the log-likelihood is negative at the solution of (3.2.1).
Thus in summary, the approach we employ is as follows:
1. Compute the likelihood f (x1 , . . . , xn ).
2. Compute the log-likelihood l (x1 , . . . , xn ) and its gradient vector l (x1 , . . . , xn ).
3. Solve l (x1 , . . . , xn ) = 0, with respect to , call this solution en (we are assuming there
is only one en ).
In general, point 3. may not be possible analytically (so for example, one can use Newtons method).
However, you will not be asked to solve l (x1 , . . . , xn ) = 0, unless there is an analytic solution.
Thirdly
n
1X
n + xi = 0
i=1
so,
n
en = 1
X
xi .
n i=1
Fourthly,
n
d2 l (x1 , . . . , xn ) 1 X
2
= 2 xi < 0
d i=1
for any > 0 (assuming there is at least one i such that xi > 0). Thus assuming there is at least
one i such that xi > 0
n
1X
n = xi .
n i=1
36
will converge in probability (see Theorem 2.5.7) to E[X1 ] = if our assumptions hold true. That is,
we recover the true parameter value; such a property is called consistency - we do not address this
issue further.
Example 3.2.3. Let X1 , . . . , Xn be i.i.d. E() random variables. Let us compute the MLE of = ,
given observations x1 , . . . , xn . First, we have
n
Y
f (x1 , . . . , xn ) = exi
i=1
n
X
= n exp{ xi }.
i=1
Thirdly
n
n X
xi = 0
i=1
so
n
en = ( 1
X
xi )1 .
n i=1
Fourthly,
d2 l (x1 , . . . , xn ) n
= 2 < 0.
d2
Thus
n
1 X 1
n = ( xi ) .
n i=1
Example 3.2.4. Let X1 , . . . , Xn be i.i.d. N (, 2 ) random variables. Let us compute the MLE of
(, 2 ) = , given observations x1 , . . . , xn . First, we have
n
Y 1 1
f (x1 , . . . , xn ) = exp{ (xi )2 }
i=1 2 2 2 2
n n
1 1 X
= exp{ (xi )2 }.
2 2 2 2 i=1
simultaneously for and 2 . Since (3.2.2) can be solved independently of (3.2.3), we have:
n
1X
en = xi .
n i=1
Thus substituting into (3.2.3), we have
n
1 X n
en )2 =
(xi
2( 2 )2 i=1 2 2
that is:
n
1X
en2 =
en )2 .
(xi
n i=1
Fourthly the hessian matrix is:
" Pn #
n (12 )2 i=1 (xi )
H() = Pn2 Pn .
(12 )2 i=1 (xi ) n 1
2( 2 )2 ( 2 )3 i=1 (xi )
2
Note that when en = (e en2 ) the off-diagonal elements are exactly 0, so when solving |I2 H(en )| =
n ,
0, we simply need to show that the diagonal elements are negative. Clearly for the first diagonal
element
n
2 <0
en
en )2 > 0 for at least one i (we do not allow the case
if (xi en2 , in that we assume this doesnt
occur). In addition
n
n 1 X n n n
2 2
2 3
en )2 =
(xi 2 2
2 2 = < 0.
2(e
n ) (e
n ) i=1 2(e
n ) (e
n ) n2 )2
2(e
Thus
n
1 X n n
1X 1 X 2
n = xi , xi xj .
n i=1
n i=1 n j=1
Then
1. Xn N (0, 2 /n).
2. (n 1)s2n / 2 Xn1
2
.
3. Xn and (n 1)s2n / 2 are independent random variables.
Note that as one might imagine Proposition 2.4.38 is used to prove Theorem 3.3.1. Note that one
can show that the Tn1 distribution has a symmetric pdf, around 0.
Now consider testing H0 : = 0 against the two-sided alternative H1 : 6= 0 (that is, we are
testing whether the population mean, the true mean of the data, is a particular value). Now we
know for any R that T () Tn1 , thus if the null hypothesis is true T (0 ) should be a random
variable that is consistent with a Tn1 random variable. To construct the rejection region of the
test, we must choose a confidence level, typically 95%. Then we want T (0 ) to lie in 95% of the
probability. However, this still does not tell us what the rejection region is; this is informed by the
alternative hypothesis, that H1 : 6= 0 ; which indicates that a value which is inconsistent with the
Tn1 distribution lies in each tail of the distribution. Thus the procedure is:
1. Compute T (0 ).
2. Decide upon your confidence level (1 )%. This defines the rejection region.
3. Compute the tvalues (t, t). These are the numbers in R such that the probability (under a
Tn1 random variable) of exceeding t is /2 and the probability of being less than t is /2.
4. If t < T (0 ) < t then we do not reject the null hypothesis, otherwise we reject the null
hypothesis.
A number of remarks are in order. First, a test can only disprove a null hypothesis. The fact
that we do not reject the null on the basis of the sample evidence does not mean that the hypothesis
is true.
Second, it is useful to distinguish between two types of errors that can be made:
(n 1)s2X,n /X
2 2
Xn1
39
and independently:
(m 1)s2Y,m /Y2 Xm1
2
.
2
Now suppose that we want to test H0 : X = Y2 against H1 : X2
6= Y2 . Now if H0 is true, by
Theorem 3.3.1 and Proposition 2.4.40.
2
F (X ) := (n 1)s2X,n /X
2
/(n 1) (m 1)s2Y,m /X
2
/(m 1) = s2X,n /s2Y,m Fn1,m1
3. Compute the F values (f , f ). These are the numbers in R+ such that the probability (under
a Fn1,m1 random variable) of exceeding f is /2 and the probability of being less than f
is /2.
2
4. If f < F (X ) < f then we do not reject the null hypothesis, otherwise we reject the null
hypothesis.
Remark 3.3.2. All of our results concern Normal samples. However one can use the CLT (Theorem
2.5.9) to extend these tests to non-normal data. We do not follow this idea in this course.
The main key behind Bayesian statistics is the choice of a prior probability distribution for the
parameter . That is, Bayesian statisticians specify a probability distribution on the parameter
before the data are observed. This probability distribution is supposed to reflect the information
one might have before seeing the observations. To make this idea concrete, consider the following
example:
i.i.d.
Example 3.4.1. Suppose that we will observe data Xi | E(). Then one has to construct a
probability distribution on R+ . A possible candidate is G(a, b). If one has some prior beliefs
about the mean and variance of (say E[] = 10, Var[] = 10) then one determine what a and b
are.
40
For a Bayesian statistician, the posterior is the final answer, in that all statistical inference should
be associated to the posterior. For example, if one is interested in estimating then one can use the
posterior mean: Z
E[|x1 , . . . , xn ] = (|x1 , . . . , xn )d.
In addition, consider R (or indeed any univariate component of ), then we can compute a
confidence interval, which is called an credible interval in Bayesian statistics. The highest 95%-
posterior-credible (HPC) interval is the shortest region [, ], such that
Z
(|x1 , . . . , xn )d = 0.95.
3.4.3 Examples
Despite the fact that Bayesian inference is very challenging, there are still many examples where one
can do analytical calculations.
Example 3.4.2. Let us consider Example 3.4.1. Here we have that
Z n
X ba a1 b
f (x1 , . . . , xn ) = n exp{ xi } e d
0 i=1
(a)
Z n
ba X
= n+a1 exp{[ xi + b]}d
(a) 0 i=1
ba 1 n+a Z
= un+a1 eu du
(a) [ ni=1 xi + b]
P
0
ba 1 n+a
= Pn (n + a).
(a) [ i=1 xi + b]
So, as:
n
ba n+a1 X
f (x1 , . . . , xn |)() = exp{[ xi + b]}
(a) i=1
we have: Pn
n+a1 exp{[ i=1 xi + b]}
(|x1 , . . . , xn ) = n+a
P 1 (n + a)
[ n xi +b]
i=1
41
i.e.
n
X
|x1 , . . . , xn G(n + a, b + xi ).
i=1
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reflecting the data. So, for example:
n+a
E[|x1 , . . . , xn ] = Pn .
b + i=1 xi
In comparison to the MLE in Example 3.2.3, we see that the posterior mean and MLE correspond
as a, b 0.
i.i.d.
Example 3.4.3. Let Xi | P(), i {1, . . . , n}. Suppose the prior on is G(a, b). Then
Z P
n 1 ba a1 b
f (x1 , . . . , xn ) = i=1 xi exp{n} Qn e d
0 i=1 xi ! (a)
Z P
ba n
= Qn i=1 xi +a1 exp{[n + b]}d
(a) i=1 xi ! 0
ba 1 Pni=1 xi +a Z Pn
= Qn u i=1 xi +a1 eu du
(a) i=1 xi ! [n + b] 0
1 Pni=1 xi +a X n
ba
= Qn ( xi + a).
(a) i=1 xi ! [n + b] i=1
So as:
ba Pn
xi +a1
f (x1 , . . . , xn |)() = Qn i=1 exp{[n + b]}
(a) i=1 xi !
we have: Pn
xi +a1
i=1 exp{[n + b]}
(|x1 , . . . , xn ) = Pni=1 xi +a P
1 n
[n+b] ( i=1 xi + a)
i.e.
n
X
|x1 , . . . , xn G xi + a, n + b .
i=1
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reflecting the data. So, for example:
Pn
xi + a
E[|x1 , . . . , xn ] = i=1 .
n+b
In comparison to the MLE in Example 3.2.1, we see that the posterior mean and MLE correspond
as a, b 0.
i.i.d.
Example 3.4.4. Let Xi | N (, 1), i {1, . . . , n}. Suppose the prior on is N (, ). Then
Z n
1 n 1X 1 1
f (x1 , . . . , xn ) = exp{ (xi )2 } exp{ ( )2 }d
2 2 i=1
2 2
1 n 1 Z n
1X 1
= exp{ (xi )2 ( )2 }d.
2 2 2 i=1 2
To compute the integral, let us first manipulate the exponent inside the integral:
n n n
1X 1 1 1 X 2 X 2
(xi )2 ( )2 = [2 (n + ) 2( + xi ) + + xi ]
2 i=1 2 2 i=1 i=1
2 Pn
Let c(, , x1 , . . . , xn ) = 21 [ + i=1 x2i ], then
n n
1X 1 1 1 X
(xi )2 ( )2 = [2 (n + ) 2( + xi )] + c(, , x1 , . . . , xn )
2 i=1 2 2 i=1
Pn Pn
1 1 h ( + i=1 xi ) 2 ( + i=1 xi ) 2 i
= (n + ) ) ( + c(, , x1 , . . . , xn ).
2 n + 1 n + 1
42
2
+ n
P
( i=1 xi )
Let c0 (, , x1 , . . . , xn ) = c(, , x1 , . . . , xn ) + 21 (n + 1 )( 1
n+
, then we have
Pn
( + i=1 xi ) 2
1 n 1 Z
1 1
f (x1 , . . . , xn ) = exp{c0 (, , x1 , . . . , xn )} exp{ (n + ) ) }d
2 2 2 n + 1
1 n 1
0 2
= exp{c (, , x1 , . . . , xn )} .
2 2 (n + 1 )1/2
So as:
Pn
1 n 1
0 1 1 ( + i=1 xi ) 2
f (x1 , . . . , xn |)() = exp{c (, , x1 , . . . , xn )} exp{ (n+ ) ) }
2 2 2 n + 1
we have
+ n
P
( i=1 xi ) 2
exp{ 12 (n + 1 ) 1
n+
) }
(|x1 , . . . , xn ) =
2
1 1/2
(n+ )
i.e.
( + Pn x ) 1 1
i=1 i
|x1 , . . . , xn N , (n + )
n + 1
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reflecting the data. So, for example:
Pn
( + i=1 xi )
E[|x1 , . . . , xn ] = .
n + 1
In comparison to the MLE in Example 3.2.4, we see that the posterior mean and MLE correspond
as , for any fixed R.
We note that in all of our examples the posterior is in the same family as the prior. This is not
by chance; for every member of the exponential family, with parameter it is often possible to find
a prior which obeys this former property. Such priors are called conjugate priors.
Chapter 4
Miscellaneous Results
In the following Chapter we quote some results of use in the course. We do not cover this material
in lectures and is there for your convenience and revision. The following Sections give general facts
which should be taken as true. You can easily find the theory behind each of these results in any
undergraduate text in maths or statistics.
AB =BA A B = B A ASSOCIATIVITY
A (B C) = (A B) (A C) A (B C) = (A B) (A C) DISTRIBUTIVITY.
Note that the second rule holds when mixing intersection and union, e.g.:
A (B C) = (A B) (A C).
(A B)c = Ac B c (A B)c = Ac B c .
43
44
Also recall the product rule; for two functions f (x) and g(x), we have
d dg(x) df (x)
(f (x)g(x)) = f (x) + g(x) .
dx dx dx
In addition the chain rule; for two functions f (x) and g(x), let h(x) = f (g(x)), then
dh(x) df (g(x)) dg(x)
= .
dx dg(x) dx
For example, suppose f (x) = log(x) and g(x) = 2 + sin(x) and set h(x) = log(2 + sin(x)), then
df (g(x)) 1 dg(x)
= = cos(x)
dg(x) g(x) dx
where df /dg is the derivative of log(g). Hence
dh(x) 1 cos(x)
= cos(x) = .
dx g(x) 2 + sin(x)
Note that by log(x) I mean the inverse function of the exponential function, or natural logarithm:
log(ex ) = x.
Recall that log(ab) = log(a) + log(b) and log(ab ) = b log(a). Note that sin2 (x) + cos2 (x) = 1.
4.3 Summation
Pk2
Recall that x=k1 f (x), for some integer k1 k2 , with f : Z R, means
f (k1 ) + f (k1 + 1) + f (k1 + 2) + + f (k2 1) + f (k2 ).
For example k1 = 1, k2 = 3, f (x) = x, then
3
X
x = 1 + 2 + 3 = 6.
x=1
Some standard summations that you may have seen in the past are listed below:
1 X
= zk |z| < 1 GEOMETRIC
1z
k=0
Xzk
ez = z R EXPONENTIAL
k!
k=0
n
X n k
(1 + z)n = z n Z+ BINOMIAL
k
k=0
n
X n+k k
(1 + z)n = z n Z+ , |z| < 1 NEGATIVE BINOMIAL
k
k=0
X zk
log(1 z) = |z| < 1 LOGARITHMIC
k
k=1
X zk
log(1 + z) = (1)k+1 |z| < 1 LOGARITHMIC
k
k=1
45
A standard trick that is used in this course is the reversal of summation and differentiation. For
example, consider f : Z R, where R, then it is valid to assume in this course that:
k2 k2
dX X df (x, )
f (x, ) = .
d d
x=k1 x=k1
In words the derivative of the sum is equal to the sum of the derivatives. This trick is very useful for
computing expectations w.r.t. e.g. geometric distributions. Note that this also applies to higher-order
derivatives:
k2 k2
d2 X X d2 f (x, )
2
f (x, ) = .
d d2
x=k1 x=k1
Pk4 Pk2 2
A double summation y=k3 x=k1 f (x, y), with f : Z R, k1 k2 , with
f : Z R, k3 k4 , with f : Z R means:
X k2
k4 X k4
X
f (x, y) = f (k1 , y) + + f (k2 , y)
y=k3 x=k1 y=k3
= f (k1 , k3 ) + f (k1 , k3 + 1) + + f (k1 , k4 ) + + f (k2 , k3 ) + f (k2 , k3 + 1) + + f (k2 , k4 ).
In words: keep y constant and sum over x, then for each x value sum over y. Note again, that it
will be mathematically valid to switch the order of summation in this course, i.e.:
X k2
k4 X k2 X
X k4
f (x, y) = f (x, y).
y=k3 x=k1 x=k1 y=k3
You should be careful when switching the order of summation, when the first limit depends upon y,
e.g.
k4 X
X y
f (x, y).
y=k3 x=k1
A table of integrals, essentially follows by reversing the derivatives that you saw previously (we
omit the constant of integration which is not needed at this level):
xp+1
Z
xp dx = p 6= 1
p+1
Z
a bx
aebx dx = e
b
Z
log(x)dx = x log(x) x
Z
1
ax dx = ax a > 0
log(a)
Z
1
sin(ax)dx = cos(ax)
a
Z
1
cos(ax)dx = sin(ax)
a
Z
tan(ax)dx = log | sec(ax)|.
where g is a positive function. Suppose that there is a PDF f (x) such that
f (x) = cg(x)
where you know c. Then Z Z
1 1
g(x)dx = f (x)dx = .
c c
Recall that
Z
2
I= eu /2
du = 2.
This can be established by the fact that
Z Z
2
/2v 2 /2
I2 = eu dudv
4.7 Distributions
A Table of discrete distributions can be found in Table 4.1 and continuous Distributions in Table
4.2. Recall that Z
(a) = ta1 et dt a > 0.
0
Note that one can show (a + 1) = a(a) and for a Z+ , (a) = (a 1)!. In addition that for
a, b > 0
(a)(b)
B(a, b) = .
(a + b)
One can show also that Z 1
B(a, b) = ua1 (1 u)b1 du.
0
Support X Par. PMF CDF E[X] Var[X] MGF
B(1, p) {0, 1} p (0, 1) p p(1 p) (1 p) + pet
n x nx
px(1 p)1x
B(n, p) {0, 1, . . . , n} p (0, 1), n Z+ x p (1 p) np np(1 p) ((1 p) + pet )n
x e
P() {0, 1, 2, . . . } R+ x! exp{(et 1)}
pet
Ge(p) {1, 2, . . . } p (0, 1) (1 p)x1 p 1 qx 1/p (1 p)/p2
t
1et (1p) n
x1 xn n pe
N e(n, p) {n, n + 1, . . . } p (0, 1), n Z+ n1 (1 p) p n/p n(1 p)/p2 1et (1p)
Table 4.2: Table of Continuous Distributions. Note that B(a, b)1 = (a + b)/[(a)(b)].
48
Bibliography
[1] Grimmett, G, & Stirzaker, D. (2001). Probability and Random Processes. Third Edition.
Oxford: OUP.
49