NUS ST2334 Lecture Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

ST2334 Probability and Statistics

A/P Ajay Jasra


Office 06-18, S16
National University of Singapore
E-Mail:staja@nus.edu.sg

Department of Statistics and Applied Probability


National University of Singapore
6 Science Drive 2, Singapore, 117546
Contents

1 Introduction to Probability 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Probability Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Theorem of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Random Variables and their Distributions 8


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Random Variables and Distribution Functions . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.5 Conditional Distributions and Expectations . . . . . . . . . . . . . . . . . . . 15
2.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.4 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.5 Conditional Distributions and Expectations . . . . . . . . . . . . . . . . . . . 24
2.4.6 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Introduction to Statistics 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Examples of Computing the MLE . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Constructing Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

i
4 Miscellaneous Results 43
4.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Derivatives and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Integration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

ii
1

Course Information
Lecture times: This course is held from January 2015-April 2015. Lecture times are at 0800-
1000 Monday and Wednesday at LT25. The notes are available on my website: http://www.stat.
nus.edu.sg/~staja/ (also IVLE). You should register as usual for tutorials and for manual regis-
tration please email me with your name and matriculation number and I will attempt to put you
into a convenient tutorial group.

Office Hour: My office hour is at 1500 on Wednesday in 04-02, Department of Statistics and Applied
Probability. I am not available at other times, unless there is a time-table clash. If you would like
to see me or have a question, please email me or your tutors. I will try to respond as quickly as I can.

Assessed Coursework: During the course there will be one assignment which makes up 40% of
the final grade. The dates of when this assessment will be handed out will be provided at least two
weeks beforehand and you are given two weeks to complete the assessment. The assessment is to
be handed to me in person or, at the statistics office, S16, level 7. Due to the number of students on
this course I do not accept assessments via e-mail (unless there are extreme circumstances, which
would need to be verified by the department before-hand) or by student submission on IVLE (again
there are too many students for this). There is NO mid-term examination.

Exam:. There is a 2 hour final exam of 4 questions. No calculators are allowed and the examination
is closed-book (with the exception of the below information). You will be given a formula sheet to
assist you in the examination (Tables 4.1 and 4.2) and you may bring a single side of hand written
A4 to the examination with any information you deem useful for the examination; you may write on
both sides of the A4.

Problem Sheets: There are also 10 non-assessed problem sheets available on my website (also
IVLE). Typed solutions will be available on my website.

Course Details: These notes are not sufficient to replace lectures. In particular, many examples
and clarifications are given during the class. This course investigates the following concepts: Basic
concepts of probability, conditional probability, independence, random variables, joint and marginal
distributions, mean and variance, some common probability distributions, sampling distributions,
estimation and hypothesis testing based on a normal population. There are three chapters of course
content, the first covering the foundations of probability. We then move on to random variables
and their distributions which features the main content of the course and is a necessary prerequisite
for further work in statistical modeling and randomized algorithms. The third chapter is a basic
introduction to statistics and gives some basic notions which would be used in statistical modeling.
The final chapter of the notes provides some mathematical background for the course, which you
are strongly advised to read in the first week. You are expected to know everything that is in this
chapter. In particular the notion of double summation and integration is used routinely in this course
and you should spend some time recalling these ideas. Some additional help on integration will be
given later on in the course.

References: The recommended reference for this course is [1] (Chapters 1-5). For a non-technical
introduction the book [2] provides an entertaining and intuitive look at probability.
Chapter 1

Introduction to Probability

1.1 Introduction
This Chapter provides a basic introduction to the notions that underlie the basics of probability. The
level of notes is below that required for complete mathematical rigor, but does provide a technical
development of these ideas. Essentially, the basic ideas of probability theory start with a probability
triple of an event-space (e.g. in flipping a coin a head or tail), sets of events (e.g.tail or tail and
head) and a way to compare the likelihood of events by a probability distribution (the chance of
obtaining a tail). Moving on from these notions, we consider the probability of events, given other
events are known to have occurred (conditional probability) (the probability a coin lands tails, given
we know it has two heads). Some events have probabilities which decouple in a special way and
are called independent (for example, flipping a fair coin twice, the outcome of the first flip, may not
influence that of the second).
The structure of this Chapter is as follows: In Section 1.2, we discuss the idea of a probability
triple; in Section 1.3 conditional probability is introduced and in Section 1.4 we discuss independence.

1.2 Probability Triples


As mentioned in the introduction, probability begins with a triple:

A sample space (possible outcomes)


A collection of sets on outcomes (field)
A way to compare the likelihood of events (probability measure)

We will slowly introduce and analyze these concepts.

1.2.1 Sample Space


The basic notion we begin with is that of an experiment, with a collection of possible outcomes to
which we cannot (usually) determine exactly what will happen. For example, if we flip a coin, or if
we watch a football game and so on; in general we do not know for certain what will happen. The
idea of a sample space is as follows.
Definition 1.2.1. The set of all possible outcomes of an experiment is called the sample space
and is denoted by .

Example 1.2.2. Consider rolling a six sided fair die, once. There are 6 possible outcomes, and
thus: = {1, 2, . . . , 6}. We may be interested (for example, for betting purposes) in the following
events:
1. we roll a 1

2. we roll an even number


3. we roll an even number, which is less than 3

2
3

Notation Set Terminology Probability Terminology


Collection of objects Sample Space
Member of Elementary event
A Subset of Event that an outcome in A occurs
Ac Complement of A Event that no outcome in A occurs
AB Intersection Both A and B
AB Union A or B
A\B Difference A but not B
AB Symmetric difference A or B but not both
AB Inclusion If A then B
Empty set Impossible event

Table 1.1: Terminology of Set and Probability Theory

4. we roll an odd number


One thing that we immediately realize is that each of the events in the above example are subsets
of , that is each event (1)-(4) correspond to:
1. {1}

2. {2,4,6}
3. {2}
4. {1,3,5}

As a result, we think of events as subsets of the sample space . Such events can be constructed
by unions, intersections and complements of other events (we will formally explain why, below). For
example, letting A = {2, 4, 6} in the case (2) above, we immediately yield that the event (4) is Ac .
Similarly letting A be the event in (2) and B be the event of rolling a number less than 4 we obtain
that event (3) is A B. For those of you that have forgotten basic set notations and terminologies,
see Table1 1.1. In particular, we will think of as the certain event (we will roll a 1, 2, . . . , 6) and
its complement c = as the impossible event (we must roll something).

1.2.2 Fields
Now we have a notion of an event, in particular, that they are subsets of of . A particular question
of interest is then: are all subsets of events? The answer actually turns out to be no, but the
technical reasons for this lie far out of the scope of this course. We will content ourselves to use a
particular collection of sets F (a set of sets) of which contains all the events that we can make
probability statements about. This collection of sets is a field.
Definition 1.2.3. A collection of sets F of is called a -field if it satisfies the following conditions:

1. F
S
2. If A1 , . . . , F then i=1 Ai F
3. If A F then Ac F

TIt can be shown that fields are closed under countable intersections (i.e. if A1 , . . . , F then
i=1 Ai F ). Whilst it may seem quite abstract, it will turn out that, for this course, all the sets
we consider will lie in the field F .
Example 1.2.4. Consider flipping a single coin, letting H denote a head and T denoting a tail then:
= {H, T } and F = {, , {H}, {T }}.

Thus, to summarize, so far (, F ) is sample space and a field. The former is the collection
of all possible outcomes and the latter is collection of sets on (the events) that follow Definition
1.2.3. Our next objective is define a way of assigning a likelihood to each event.
1 For those of you that have forgotten set theory, Section 4.1 has a refresher
4

Exercise 1.2.5. You are given De-Morgans Laws:


 [ c \  \ c [
Ai = Aci Ai = Aci .
i i i

Show that if A, B F , A B and A \ B are in F (just use the rules of -fields).

1.2.3 Probability
We now introduce a way to assign a likelihood to an event, via probability. One possible interpretation
is the following: suppose my experiment is repeated many times, then the probability of any event
is the limit of the ratio of times the event occurs over the number of events. We note that this is
not the only interpretation of probability, but we do not diverge into a discussion of the philosophy
of probability. Formally, we introduce the notion of a probability measure:
Definition 1.2.6. A probability measure P on (, F ) is a function P : F [0, 1] which satisfies:
1. P() = 1 and P() = 0
2. For A1 , A2 , . . . disjoint (i 6= j, Ai Aj = ) members of F then

[  X
P Ai = P(Ai ).
i=1 i=1

The triple (, F , P), comprising a set , a field F of subsets of and a probability measure P,
is called a probability space (or probability triple).
The idea is to associate the probability space with an experiment:
Example 1.2.7. Consider flipping a coin as in Example 1.2.4, with = {H, T } and F = {, , {H}, {T }}.
Then we can set:
P({H}) = p P({T }) = 1 p p [0, 1].
If p = 1/2 we say that coin is fair.
Example 1.2.8. Consider rolling a 6-sided die, then = {1, . . . , 6} and F = {0, 1} (the power
set of the set of all subsets of ). Define a probability measure on (, F ) as:
X
P(A) = pi A F
iA

P6
where i {1, . . . , 6}, 0 pi 1 and i=1 pi = 1. If pi = 1/6, i {1, . . . , 6} then:

Card(A)
P(A) =
6
where Card(A) is the cardinality of A (the number of elements in the set A).
We now consider a sequence of results on probability spaces.
Lemma 1.2.9. We have the following properties on probability space (, F , P):
1. For any A F , P(Ac ) = 1 P(A).
2. For any A, B F , if A B then, P(B) = P(A) + P(B \ A) P(A).
3. For any A, B F , P(A B) = P(A) + P(B) P(A B).
4. (inclusion-exclusion formula) For any A1 , . . . , An F then:
n
[  X X X n
\ 
P Ai = P(Ai ) P(Ai Aj ) + P(Ai Aj Ak ) + (1)n+1 P Ai .
i=1 i i<j i<j<k i=1

P Pn Pj1
where, for example, i<j = j=1 i=1 etc.
5

Proof. For (1), A Ac = , so 1 = P() = P(A Ac ) = P(A) + P(Ac ) (A Ac = ). Thus, one


concludes by rearranging.
For (2), as B = A (B \ A) and A (B \ A) = , (recall Exercise 1.2.5) P(B) = P(A) + P(B \ A).
The inequality is clear from the definition of a probability.
For (3), A B = A (B \ A), which is a disjoint union, and B \ A = (B \ (A B)) hence

P(A B) = P(A) + P(B \ A) = P(A) + P(B \ (A B)).

Now (A B) B so by (2)

P(B \ (A B)) = P(B) P(A B)

and thus
P(A B) = P(A) + P(B) P(A B).
For (4), this can be proved by induction and is an exercise.
To conclude the Section, we introduce two terms:
An event A F is null if P(A) = 0.
An event A F occurs almost surely if P(A) = 1.
Null events are not impossible. We will see this when considering random variables which take values
in the real line.

1.3 Conditional Probability


We now move onto the notion of conditional probability. It takes the concept of a probability of an
event, given another event is known to have occurred. For example, what is the probability that it
rains today, given that it rained yesterday? We formalize the notion of conditional probability:
Definition 1.3.1. Consider probability space (, F , P) and let A, B F with P(B) > 0. Then the
conditional probability that A occurs given B occurs is defined to be:

P(A B)
P(A|B) := .
P(B)

Example 1.3.2. A family has two children of different ages. What is the probability that both
children are boys, given that at least one is a boy? The older and younger child may be either boys
or girl so the sample space is:
= {GG, BB, GB, BG}
and we assume that all outcomes are equally likely P(GG) = P(BB) = P(GB) = P(BG) = 1/4 (the
uniform distribution). We are interested in:

P(BB (GB BB BG))


P(BB|GB BB BG) =
P(GB BB BG)
P(BB)
=
P(GB BB BG)
1/4 1
= = .
3/4 3

One can also ask the question: what is the probability that in a family of two children, where the
younger of two is a boy, that both are boys? We want:

P(BB (BG BB))


P(BB|BG BB) =
P(BG BB)
P(BB)
=
P(BG BB)
1/4 1
= = .
1/2 2
6

1.3.1 Bayes Theorem


An important result in conditional probability is Bayes theorem.
Theorem 1.3.3. Consider probability space (, F , P) and let A, B F with P(A), P(B) > 0. Then
we have:
P(A|B)P(B)
P(B|A) = .
P(A)
Proof. By definition
P(A B)
P(B|A) = .
P(A)
As P(B) > 0, P(A B) = P(A|B)P(B) and hence:

P(A|B)P(B)
P(B|A) = .
P(A)

1.3.2 Theorem of Total Probability


We will use Bayes Theorem after introducing the following result (the Theorem of Total Probability.
We begin with a preliminary definition:
Definition 1.3.4. A family of sets B1 , . . . , Bn is called a partition of if:
n
[
i 6= j Bi Bj = and Bi = .
i=1

Lemma 1.3.5. Consider probability space (, F , P) and let for any fixed n 2, B1 , . . . , Bn F be
a partition of , with P(Bi ) > 0, i {1, . . . , n}. Then, for any A F :
n
X
P(A) = P(A|Bi )P(Bi ).
i=1

Proof. We give the proof for n = 2, the other cases being very similar. We have A = (AB1 )(AB2 )
(recall B1 B2 = , and B2 = B1c ), so

P(A) = P(A B1 ) + P(A B2 )


= P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ).

Example 1.3.6. Consider two factories which manufacture a product. If the product comes from
factory I, it is defective with probability 1/5 and if it is from factory II, it is defective with probability
1/20. It is twice as likely that a product comes from factory I. What is the probability that a given
product operates properly (i.e. is not defective)? Let A denote the event of operates properly and B
denote the event the product is made in factory I. Then:

P(A) = P(A|B)P(B) + P(A|B c )P(B c )


4 2 19 1 51
= + = .
5 3 20 3 60
If a product is defective, what is the probability that it came from factory I? This is P(B|Ac ); by
Bayes Theorem

P(Ac |B)P(B)
P(B|Ac ) =
P(Ac )
(1/5)(2/3) 8
= = .
9/60 9
7

1.4 Independence
The idea of independence is roughly that the probability of an event A is unaffected by the occur-
rance of a (non-null) event B; that is, P(A) = P(A|B), when P(B) > 0. More formally:
Definition 1.4.1. Consider probability space (, F , P) and let A, B F . A and B are indepen-
dent if
P(A B) = P(A)P(B).
More generally, a family of F sets A1 , . . . , An ( > n 2) are independent if
n
\  Yn
P Ai = P(Ai ).
i=1 i=1

Two important concepts include pairwise independence:

P(Ai Aj ) = P(Ai )P(Aj ) i 6= j.

This does not necessarily mean that A1 , . . . , An are independent events. Another important concept
is conditional independence; given an event C F , with P(C) > 0 , A, B F are conditionally
independent events if
P(A B|C) = P(A|C)P(B|C).
This can be extended to families of sets Ai given C.
Example 1.4.2. Consider a pack of playing cards, and one chooses a card completely at random
(i.e. no card counting etc). One can assume that the probability of choosing a suit (e.g. spade) is
independent of its rank. So, for example:
13 4 1
P(spade king) = P(spade)P(king) = = .
52 52 52
Chapter 2

Random Variables and their


Distributions

2.1 Introduction
Throughout the Chapter we assume that there is a probability space (, F , P), but its presence is
minimized from the discussion. This is the main Chapter of the course and we focus upon random
variables and their distribution (Section 2.2). In particular, a random variable is (essentially) just a
map from to some subset of the real line. Once we are there we introduce notions of probability
through distribution functions. Sometimes the random variables take values on a countable space
(possibly countably infinite) and we call such random variables discrete (Section 2.3). The proba-
bilities of such random variables are linked to probability mass functions and we use this concept
to revisit independence and conditional probability. We also consider expectation (the theoretical
average) and conditional expectation for discrete random variables. These ideas are revisited when
the random variables are continuous (Section 2.4). The Chapter is concluded when we discuss the
convergence of random variables (Section 2.5) and, in particular, the famous central limit theorem.

2.2 Random Variables and Distribution Functions


2.2.1 Random Variables
In general, we are often concerned by an experiment itself, but by an associated consequence of the
outcome. For example, a gambler is often interested in his or hers profit or loss, rather that in the
result that occurs. To deal with this issue, we introduce the notion of a random variable:
Definition 2.2.1. A random variable is a function X : R such that for each x R,
{ : X() x} F . Such a function is said to be F measurable.
For the purpose of this course, the technical constraint of F measurability can be ignored; you
can content yourself to think of X() as a mapping from the sample space to the real line and
continue onwards. In general we omit the argument and just write X, with possible realized values
(numbers) written in lower-case x. The distinction between a random variable X and its realized
value is very important and you should maintain this convention. To move from the random variable,
back to our probability measure P, we write events { : X() x} as {X x}, and hence we
write P({X x}) as P(X x) to mean P({ : X() x}) (recall { : X() x} F
and so this makes sense). This leads us to the notion of a distribution function:

2.2.2 Distribution Functions


Definition 2.2.2. The distribution function of a random variable X is the function F : R [0, 1]
given by F (x) = P(X x), x R.
Example 2.2.3. Consider flipping a fair coin twice; then = {HH, T T, HT, T H}, with F =
{0, 1} and P(HH) = P(T T ) = P(HT ) = P(T H) = 1/4. Define a random variable X as the number
of heads; so
X(HH) = 2 X(HT ) = X(T H) = 1 X(T T ) = 0.

8
9

The associated distribution function is:




0 if x<0
1
4 if 0x<1
F (x) = 3

4 if 1x<2
1 if x2

A distribution function has the following properties, which we state without proof. See [1] for
the proof. The lemma characterizes a distribution function: a function F is a distribution function
for some random variable if and only if it satisfies the three properties in the following lemma.
Lemma 2.2.4. A distribution function F has the following properties:

1. limx F (x) = 0, limx F (x) = 1,


2. If x < y then F (x) F (y),
3. F is right continuous; lim0 F (x + ) = F (x) for each x R.
Example 2.2.5. Consider flipping a coin as in Example 1.2.4, which lands heads with probability
p (0, 1). Let X : R be given by

X(H) = 1 X(T ) = 0.

The associated distribution function is



0 if x < 0
F (x) = 1p if 0 x < 1
1 if x 1

X is said to have a Bernoulli distribution.


We end the Section with another lemma, whose proof can be found in [1].
Lemma 2.2.6. A distribution function F of random variable X, has the following properties:
1. P(X > x) = 1 F (x) for any fixed x R,

2. P(x < X y) = F (y) F (x) for any fixed x, y R with x < y,


3. P(X = x) = F (x) limyx F (y) for any fixed x R.

2.3 Discrete Random Variables


2.3.1 Probability Mass Functions
We now move onto an important class of random variable called discrete random variables.
Definition 2.3.1. A random variable is said to be discrete if it takes values in some countable subset
X = {x1 , x2 , . . . } of R.
A discrete random variable takes values only at countable values and hence its distribution is a
jump function, in that it shifts between different values when they change. An important concept is
the probability mass function (PMF):
Definition 2.3.2. The probability mass function of a discrete random variable X, is the function
f : X [0, 1] defined by f (x) = P(X = x).
Remark 2.3.3. We generally use sans-serif notation X, Z to denote the supports of a
random variable. That is the range of values for which its PMF (or PDF - as will
be defined much later on) is (potentially) non-zero. Note however, the PMF may be
defined on say Z or R but is taken as (potentially) non-zero only at those points in X -
it is always zero outside X. We call this the support of a random variable.
10

We remark that X
F (x) = f (x) f (x) = F (x) lim F (y)
yx
i:xi x

which provides and association between the distribution function and the PMF. We have the following
result, whose proof follows easily from the above definitions and results.
Lemma 2.3.4. A PMF satisfies:
1. the set of x such that f (x) 6= 0 is countable,
P
2. xX f (x) = 1.

Example 2.3.5. A coin is flipped n times and the probability one obtains a head is p (0, 1);
= {H, T }n . Let X denote the number of heads, which takes values in the set X = {0, 1, . . . , n} and
is thus a discrete random variable. Consider x X, exactly nx points in give us x heads and each
point occurs with probability px (1 p)nx ; hence
 
n x
f (x) = p (1 p)nx x X.
x
The random variable X is said to have a Binomial distribution and this is denoted X B(n, p).
Note that  
n n!
=
x (n x)!x!
with x! = x (x 1) 1 and 0! = 1.
Example 2.3.6. Let > 0 be given. A random variable X that takes values on X = {0, 1, 2, . . . , }
is said have a Poisson distribution with parameter (denoted X P()) if its PMF is:

x e
f (x) = x X.
x!

2.3.2 Independence
We now extend the notion of independence of events to the domain of random variables. Recall that
events A and B are independent if the joint probability is equal to the product (A does not affect
B).
Definition 2.3.7. Discrete random variables X and Y are independent if the events {X = x} and
{Y = y} are independent for each (x, y) X Y.
To understand this idea, let X = {x1 , x2 , . . . }, Y = {y1 , y2 , . . . }, then X and Y are independent
if and only if the events Ai = {X = xi }, Bj = {Y = yj } are independent for every possible pair i, j.
Example 2.3.8. Consider flipping a coin once, which lands tail with probability p (0, 1). Let X
be the number of heads seen and Y the number of tails. Then:

P(X = Y = 1) = 0

and
P(X = 1)P(Y = 1) = (1 p)p 6= 0
so X and Y cannot be independent.
A useful result (which we do not prove) that is worth noting is the following.
Theorem 2.3.9. If X and Y are independent random variables and g : X R, h : Y R, then
the random variables g(X) and h(Y ) are also independent.
In full generality1 , consider a sequence of discrete random variables X1 , X2 , . . . , Xn , Xi Xi ;
they are said to be independent if the events {X1 = x1 }, . . . , {Xn = xn } are independent for every
possible (x1 , . . . , xn ) X1 Xn . That is:
n
Y
P(X1 = x1 , . . . , Xn = xn ) = P(Xi = xi ) (x1 , . . . , xn ) X1 Xn .
i=1
1 this point will not make too much sense, until we see Section 2.3.4
11

2.3.3 Expectation
Throughout your statistical training, you may have encountered the notion of an average or mean
value of data. In this Section we consider the idea of the average value of a random variable, which
is called the expected value.
Definition 2.3.10. The expected value of a discrete random variable X on X, with PMF f , denoted
E[X] is defined as X
E[X] := xf (x)
xX

whenever the sum on the R.H.S. is absolutely convergent.


Example 2.3.11. Recall the Poisson random variable from Example 2.3.6, X P(). The expected
value is:

X x e
E[X] = x
x=0
x!

X x1
= e
x=1
(x 1)!
= e e
=

where we have used the Taylor series expansion for the exponential function on the third line.
We now look at the idea of an expectation of a function of a random variable:
Lemma 2.3.12. Given a discrete random variable X on X, with PMF f and g : X R:
X
E[g(X)] = g(x)f (x)
xX

whenever the sum on the R.H.S. is absolutely convergent.

Example 2.3.13. Returning to Example 3.2.1, we have



2
X x e
E[X ] = x2
x=0
x!

X (x1)
= e x
x=1
(x 1)!

X d x 1
= e ( )
x=1
d (x 1)!

d h  X x1 1 i
= e ( )
d x=1
(x 1)!
d
= e (e )
d
= 2 +

where we have assumed that it is legitimate to swap differentiation and summation (it turns out that
this is true here).
An important concept is the moment generating function:.
Definition 2.3.14. For a discrete random variable X the moment generating function (MGF)
is X
M (t) = E[eXt ] = ext f (x) t T
xX
xt
P
where T is the set of t for which X e f (x) < .
12

Exercise 2.3.15. Show that

E[X] = M 0 (0) E[X 2 ] = M (2) (0)

when the right-hand derivatives exist.


The moment generating function is thus a simple way to obtain moments, if it is simple to differ-
entiate M (t). Note also, that it can be proven that the MGF uniquely characterizes a distribution.
Example 2.3.16. Let X P() then:

X x
E[eXt ] = ext e
x=0
x!

X (et )x
= e
x=0
x!
= exp{(et 1)}.

Then
M 0 (0) = .
An important special case of functions of random variables are:
Definition 2.3.17. Given a discrete random variable X on X, with PMF f and k Z+ , the k th
moment of X is
E[X k ]
and the k th central moment of X is
E[(X E[X])k ].
Of particular importance, is the idea of the variance:

Var[X] := E[(X E[X])2 ].

Now, we have the following calculations:


X
E[(X E[X])2 ] = (x E[X])2 f (x)
xX
X
= (x2 2xE[X] + E[X]2 )f (x)
xX
X X X
= x2 f (x) 2E[X] xf (x) + E[X]2 f (x)
xX xX xX

= E[X 2 ] 2E[X]2 + E[X]2 = E[X 2 ] E[X] . 2

Hence we have shown that


Var[X] = E[X 2 ] E[X]2 . (2.3.1)
P
Note a very important point: for any absolutelyPconvergent sum an , if an 0 for each n, then
an 0. Now as Var[X] := E[(X E[X])2 ] = xX (x E[X])2 f (x) and clearly
P

(x E[X])2 f (x) 0 x X

it follows that
Variances cannot be negative2 .
Example 2.3.18. Returning to Examples 3.2.1 and 2.3.13, when X P(), we have shown that

E[X 2 ] = 2 + E[X] = .

Hence on using (2.3.1):


Var[X] = .
That is, for a Poisson random variable, its mean and variance are equal.
2 This is very important and you will not be told again
13

Exercise 2.3.19. Compute the mean and variance for a Binomial distribution B(n, p) random vari-
able. [Hint: consider the differentiation, w.r.t. x, of the equality
n  
X n k
x = (1 + x)n .
k
k=0

].
We now state a Theorem, whose proof we do not give. In general, it is simple to establish, once
the concept of a joint distribution has been introduced; we have not done this, so we simply state
the result.
Theorem 2.3.20. The expectation operator has the following properties:
1. if X 0, E[X] 0
2. if a, b R then E[aX + bY ] = aE[X] + bE[Y ]
3. if X = c R always, then E[X] = c.
An important result that we will use later on and whose proof is omitted is as follows.
Lemma 2.3.21. If X and Y are independent then E[XY ] = E[X]E[Y ].
A notion of dependence (linear dependence) is correlation. This will be discussed in details, later
on.
Definition 2.3.22. X and Y are uncorrelated if E[XY ] = E[X]E[Y ].
It is important to remark that random variables that are independent are uncorrelated. However,
uncorrelated variables are not necessarily independent; we will explore this idea later on.
We end the section with some useful properties of the variance operator.
Theorem 2.3.23. For random variables X and Y
1. For a R, Var[aX] = a2 Var[X],
2. For X, Y uncorrelated Var[X + Y ] = Var[X] + Var[Y ].
Proof. For 1. we have:
Var[aX] = E[(aX)2 ) E[aX]2
= a2 E[X 2 ] a2 E[X]2 = a2 Var[X]
where we have used Theorem 2.3.20 2. in the second line.
Now for 2. we have:
Var[(X + Y )] = E[(X + Y E[X + Y ])2 )
= E[X 2 + Y 2 + 2XY + E[X + Y ]2 2(X + Y )E[X + Y ]]
= E[X 2 + Y 2 + 2XY + E[X]2 + E[Y ]2 + 2E[XY ]] 2E[X + Y ]2
= E[X 2 ] + E[Y 2 ] + 4E[X]E[Y ] + E[X]2 + E[Y ]2 2(E[X]2 + E[Y ]2 + 2E[X]E[Y ])
= E[X 2 ] + E[Y 2 ] E[X]2 E[Y ]2 = Var[X] + Var[Y ]
where we have repeatedly used Theorem 2.3.20 2..

2.3.4 Dependence
As we saw in the previous Section, there as a need to define distributions on more than one random
variable. We will do that in this Section. We start with the following definition (of which one can
easily generalize to a collection of n 1 discrete random variables):
Definition 2.3.24. The joint distribution function F : R2 [0, 1] of X, Y where X and Y are
discrete random variables, is given by
F (x, y) = P(X x Y y).
2
Their joint mass function f : R [0, 1] is given by
f (x, y) = P(X = x Y = y).
14

y = 1 y=0 y=2 f (x)


1 3 2 6
x=1 18 18 18 18
2 3 5
x=2 18 0 18 18
4 3 7
x=3 0 18 18 18
3 7 8
f (y) 18 18 18

Table 2.1: The Joint Mass Function in Example 2.3.26. The row totals are the marginal mass
function on X (and sum to 1) and respectively the column totals are the marginal mass function on
Y.

In general, it may be that the random variables are defined on a space Z which may not be
decomposable into a cartesian product X Y. In such scenarios weP writePthe joint support as simply
Z and omit X and Y, in this scenario, we will use the notation x or y to denote sums over the
supports induced by Z. This concept will be clarified during a reading of the subsequent text.
Of particular importance, are the marginal PMFs of X and Y :
X X
f (x) = f (x, y) f (y) = f (x, y).
y x

Note that, clearly X X


f (x) = 1 = f (y).
x y

Example 2.3.25. Consider Theorem 2.3.20 2. and for simplicity suppose Z = X Y. Now, we have
XX
E[aX + bY ] = (ax + by)f (x, y)
xX yY
X X X X
= ax f (x, y) + by f (x, y)
xX yY yY xX
X X
= a xf (x) + b yf (y)
xX yY
= aE[X] + bE[Y ].

Example 2.3.26. Suppose X = {1, 2, 3} and Y = {1, 0, 2} then an example of a joint PMF can be
found in Table 2.3.26. From the table, we have:
XX 29
E[XY ] = xyf (x, y) =
18
xX yY

(just sum the 9 values in the table, multiplying each time by the product of the associated x and y.
Similarly
6 5 7 37
E[X] = +2 +3 =
18 18 18 18
and
3 7 8 13
E[Y ] = 1 + 0 + 2 = .
18 18 18 18

We now formalize independence in a result, which provides us a more direct way to ascertain if
two random variables X and Y are independent.
Lemma 2.3.27. The discrete random variables X and Y are independent if an only if

f (x, y) = f (x)f (y) x, y R.

More generally, X and Y are independent if and only if f (x, y) factorizes into the product g(x)h(y)
with g a function only of x and h a function only of y.
Example 2.3.28. Consider the joint PMF:

x e y e
f (x, y) = X = Y = {0, 1, . . . , }.
x! y!
15

Now clearly
x e y e
f (x) = xX f (y) = y Y.
x! y!
It is also clear that via Lemma 2.3.27 the random variables X and Y are independent and identically
distributed (and Poisson distributed).
As in the case of a single variable, we are interested in the expectation of a functional of two
random variables (strictly, in the proof of Theorem 2.3.23 we have already used the following result):
P
Lemma 2.3.29. E[g(X, Y )] = (x,y)XY g(x, y)f (x, y).

Recall that in the previous Section, we mentioned a notion of dependence called correlation. In
order to formally define this concept, we introduce first the covariance and then correlation.
Definition 2.3.30. The covariance between X and Y is

Cov[X, Y ] := E[(X E[X])(Y E[Y ])].

The correlation between X and Y is


Cov[X, Y ]
(X, Y ) = p .
Var[X]Var[Y ]

Exercise 2.3.31. Show that

E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ].

Thus, independent random variables have zero correlation.


Example 2.3.32. Returning to Example 2.3.26, we have:

Var[X] = 233/324 Var[Y ] = 461/324

and
Cov[X, Y ] = 29/18 (37/18)(13/18) = 41/324.
Thus
(X, Y ) = 41/ 107413.
Some useful facts about correlations that we do not prove:

|(X, Y )| 1. Random variables with correlation 1 or -1 are said to be perfectly positively or


negatively correlated.
The correlation coefficient is 1 iff Y increases linearly with X and -1 iff Y decreases linearly as
X increases.

When we consider continuous random variables an example of uncorrelated random variables that
are not independent, will be given.

2.3.5 Conditional Distributions and Expectations


In Section 1.3 we discussed the idea of conditional probabilities associated to events. This idea can
then be extended to discrete-valued random variables:
Definition 2.3.33. The conditional distribution function of Y given X, written FY |x (|x), is defined
by
Fy|x (y|x) = P(Y y|X = x)
for any x with P(X = x) > 0. The conditional PMF of Y given X = x is defined by

f (y|x) = P(Y = y|X = x)

when x is such that P(X = x) > 0.


16

We remark that, in particular, the conditional mass function of Y given X = x is:


f (x, y) f (x, y)
f (y|x) = =P .
f (x) y f (x, y)

Example 2.3.34. Suppose that Y P() and X|Y = y B(y, p). Find the conditional probability
mass function of Y |X = x. Note that the random variables lie in the space Z = {(x, y) : y
{0, 1, . . . }, x {0, 1, . . . , y}}. We have
f (x, y) f (x|y)f (y)
f (y|x) = = (x, y) Z
f (x) f (x)
which is a version of Bayes Theorem for discrete random variables. Note that for (x, y) Z all the
PMFs above are positive. Now for x {0, 1, . . . }
 
X y x y e
f (x) = p (1 p)yx
y=x
x y!

e X ((1 p))yx
= (p)x
x! y=x (y x)!
(p)x p
= e .
x!
Thus, we have for y {x, x + 1, . . . }:
y
y
px (1 p)yx e

x y!
f (y|x) = (p)x p
x! e

which after some algebra, becomes:

((1 p))yx e(1p)


f (y|x) = y {x, x + 1, . . . }.
(y x)!
Given the idea of a conditional distribution, we move onto the idea of a conditional expectation:
Definition 2.3.35. The conditional expectation of a random variable Y , given X = x is
X
E[Y |X = x] = yf (y|x)
y

given that the conditional PMF is well-defined. We generally write E[Y |X] or E[Y |x].
An important result associated to conditional expectations is as follows:
Theorem 2.3.36. The conditional expectation satisfies:

E[E[Y |X]] = E[Y ]

assuming the expectations all exist.


Proof. We have
X
E[Y ] = yf (y)
y
X
= yf (x, y)
(x,y)Z
X
= yf (y|x)f (x)
(x,y)Z
XX
= [ yf (y|x)]f (x)
x y
= E[E[Y |X]].
17

Example 2.3.37. Let us return to Example 2.3.34. Find E[X|Y ], E[Y |X] and E[X]. From Exercise
2.3.19, you should have derived that if Z B(n, p) then E[Z] = np. Then

E[X|Y = y] = yp.

Thus
E[X] = E[E[X|Y = y]] = pE[Y ].
As Y P()
E[X] = p.
From Example 2.3.34,

((1 p))yx e(1p)


f (y|x) = y {x, x + 1, . . . }.
(y x)!

Thus

X ((1 p))yx e(1p)
E[Y |X = x] = y .
y=x
(y x)!

Setting u = y x in the summation, one yields



X ((1 p))u e(1p)
E[Y |X = x] = (u + x)
u=0
u!

X ((1 p))u e(1p)
= u +x
u=0
u!
= (1 p) + x.

Here, we have used the fact that U P((1 p)).

We end the Section with a result which can be very useful in practice.
Theorem 2.3.38. We have, for any g : R R

E[E[Y |X]g(X)] = E[Y g(X)]

assuming the expectations all exist.

Proof. We have
X
E[Y g(X)] = yg(x)f (x, y)
(x,y)Z
X
= yg(x)f (y|x)f (x)
(x,y)Z
X X
= g(x)[ yf (y|x)]f (x)
x y
= E[E[Y |X]g(X)].

2.4 Continuous Random Variables


In the previous Section, we considered random variables which produce real numbers within a count-
ably finite or infinite set. However, there are of course a wide variety of applications where one sees
numerical outcomes of an experiment that can lie potentially anywhere on the real-line. Thus we
extend each of concepts that we saw for discrete random variables to continuous ones. In a rather
informal (and incorrect) manner, one can simply think of the idea of replacing summation with
integration; of course things will be more challenging than this, but one should keep this idea in the
back of your mind.
18

2.4.1 Probability Density Functions


A random variable is said to be continuous if its distribution function F (x) = P(X x), can be
written as Z x
F (x) = f (u)du

where f : R [0, ), the R.H.S. is the usual Riemann integral and we will assume that the R.H.S. is
differentiable.
Definition 2.4.1. The function f is called the probability density function (PDF) of the con-
tinuous random variable X.
We note that, under our assumptions, f (x) = F 0 (x).
Example 2.4.2. One of the most commonly used PDFs is the Gaussian or normal distribution:
1 1
f (x) = exp{ 2 (x )2 } xX=R
2 2
where R, 2 > 0. We use the notation X N (, 2 ).
One of the key points associated to PDFs is as follows. The numerical value f (x) does not
represent the probability that X takes the value x. The technical explanation goes far beyond the
mathematical level of this course, but perhaps an intuitive reason is simply that there are uncountably
infinite points in X (so assigning a probability to each point is seemingly impossible). In general, one
assigns probability to sets of non-zero width. For example let A = [a, b], < a < b < , then
one might expect: Z
P(X A) = f (x)dx.
A
Indeed this holds true, but we are deliberately vague about this. We give the following result, which
is not proved and should be taken as true.
Lemma 2.4.3. If X has a PDF f , then
R
1. X f (x)dx = 1.
2. P(X = x) = 0 for each x X.
Rb
3. P(X [a, b]) = a f (x)dx, a < b .
Example 2.4.4. Returning to Example 2.4.2, we have
Z
1 1
P(X X) = exp{ 2 (x )2 }dx
2 2
Z
1 2
= eu /2 du = 1.
2
Here, we have used the substitution u = (x )/ to go to the second line and the fact that
R u2 /2

e du = 2.

2.4.2 Independence
To define an idea of independence for continuous random variables, we cannot use the one for discrete
random variables (recall Definition 2.3.7); the sets {X = x} and {Y = y} have zero probability and
are hence trivially independent. Thus we use a new definition for independence:
Definition 2.4.5. The random variables X and Y are independent if

{X x} {Y y}

are independent events for each x, y R.


As for PMFs one can consider independence through PDFs; however, as we are yet to define the
notion of joint PDFs, we will leave this idea for now. We note that one can show that for any real
valued functions h and g (at least within some technical conditions which are assumed in this course)
that if X and Y are independent, so are the random variables h(X) and g(Y ). We do not prove why.
19

2.4.3 Expectation
As for discrete random variables one can consider the idea of the average of a random variable. This
is simply brought about by replacing summations with integrations:
Definition 2.4.6. The expectation of a continuous random variable X with PDF f is given by
Z
E[X] = xf (x)dx
X

whenever the integral exists.


Example 2.4.7. Consider the exponential density:

f (x) = ex x X = [0, ) > 0.

We use the notation X E(). Then


Z Z
1 1
E[X] = xex dx = [xex ]
0 + ex dx = [ ex ]
0 =
0 0
where we have used integration by parts.
Example 2.4.8. Let us return to Example 2.4.2. Then
Z
1 1
E[X] = x exp{ 2 (x )2 }dx
2 2
Z
1 2
= (u + ) eu /2 du
2
Z
1 u2 /2
= u e du +
2
2
= [eu /2 ] +
2
= .

Here, we have used the substitution u = (x )/ to go to the second line and the fact that
Z
1 2
eu /2 du = 1
2
to go to the third.
Example 2.4.9. Consider the gamma density:
1 x
f (x) = x e x X = [0, ) , > 0
()
R
where () = 0 t1 et dt. We use the notation X G(, ) and note that if X G(1, ) then
X E(). Now
Z

E[X] = x ex dx
() 0
Z
1
= u eu du
() +1 0
1
= ( + 1) = .
()
Where we have used the substitution u = x to go the second line and ( + 1) = () on the final
line (from herein you may use that identity without proof ).
We now state a useful technical result
Theorem 2.4.10. If X and g(x) (g : R R) are continuous random variables
Z
E[g(X)] = g(x)f (x)dx.
X
20

We remark that all the extensions to expectations discussed in Section 2.3.3 can be extended to
the continuous case. In particular Definitions 2.3.17 and 2.3.22 can be imported to the continuous
case. In addition the results: Theorems 2.3.20 and 2.3.23 and Lemma 2.3.21 all can be extended. It
is assumed that this is the case from herein.
Example 2.4.11. Let us return to Example 2.4.7
Z
E[X 2 ] = x2 ex dx
0
Z
2
= [x2 ex ]
0 + xex dx
0
2
= .
2
Thus, for an exponential random variable:
1
Var[X] = .
2
Example 2.4.12. Let us return to Example 2.4.2. Then
Z
1 1
E[X 2 ] = x2 exp{ 2 (x )2 }dx
2 2
Z
1 2
= (u + )2 eu /2 du
2
Z Z
1 2 1 2
= 2 u2 eu /2 du + 2 + 2 u eu /2 du
2 2
Z
2 2 2
= {[ueu /2 ] + eu /2 du} + 2
2
2
= 2 + 2 .
2
R 2 R 2
Here we have used u 12 eu /2 du = 0 and eu /2 du = 2. Thus, for a normal random
variable:
Var[X] = 2 .
Example 2.4.13. Let us return to Example 2.4.9 Now
Z

E[X 2 ] = x+1 ex dx
() 0
Z
1
= u+1 eu du
() +2 0
1 ( + 1)
= 2
( + 2) =
() 2
where we have used ( + 2) = ( + 1)(). Thus, for a gamma random variable:

Var[X] = .
2
To end this section we consider an important concept; the MGF for continuous random variables.
Definition 2.4.14. For a continuous random variable X the moment generating function
(MGF) is Z
M (t) = E[eXt ] = ext f (x)dx t T
X
xt
R
where T is the set of t for which X
e f (x)dx < .
Exercise 2.4.15. Show that

E[X] = M 0 (0) E[X 2 ] = M (2) (0)

when the right-hand derivatives exist.


21

As for discrete random variables, the moment generating function is a simple way to obtain
moments, if it is simple to differentiate M (t). Note also, that it can be proven that the MGF
uniquely characterizes a distribution.
Example 2.4.16. Suppose X E() then, supposing > t
Z
M (t) = ex(t) dx
0
h i
= ex(t)
( t) 0

= .
( t)

Clearly

M 0 (t) =
( t)2
and thus E[X] = 1/.
Example 2.4.17. Suppose X N (, 2 ) then
Z
1 1
M (t) = exp{ 2 (x )2 + xt}dx
2 2
Z
1 1
= exp{ 2 [(x ( + t 2 ))2 ( + t 2 )2 + 2 ]}dx
2 2
Z
1 2 t2 1
= exp{t + } exp{ 2 (x ( + t 2 ))2 }dx
2 2 2
2 t2
= exp{t + }
2
where we have used a change-of-variables u = (x ( + t 2 ))/ to deal with the integral.

2.4.4 Dependence
Just as for discrete random variables, one can consider the idea of joint distributions for continuous
random variables.
Definition 2.4.18. The joint distribution function of X and Y is the function F : R2 [0, 1]
given by
F (x, y) = P(X x, Y y).
Again, as for discrete random variables, one requires a PDF:
Definition 2.4.19. The random variables are jointly continuous with joint PDF f : R2 [0, ) if
Z y Z x
F (x, y) = f (u, v)dudv

for each (x, y) R2 .


In this course, we will assume (generally) that

2
f (x, y) = F (x, y).
xy

Note that, P(X = x, Y = y) = 0 and for sufficiently well defined sets A Z R2


Z
P((X, Y ) A) = f (x, y)dxdy
A

note that the R.H.S. is a double integral, but we will only write one integral sign in such contexts.
As for discrete random variables, we can introduce the idea of marginal PDFs. Here, we take a
little longer to consider these ideas:
22

Definition 2.4.20. The marginal distribution functions of X and Y are

F (x) = lim F (x, y) F (y) = lim F (x, y).


y x

As Z x Z Z y Z
F (x) = f (u, y)dydu F (y) = f (x, u)dxdu

the marginal density functions of X and Y


Z Z
f (x) = f (x, y)dy f (y) = f (x, y)dx.

We remark that expectation is much the same for joint distributions (as for Section 2.4.3) of
continuous random variables:
Z Z Z
E[g(X, Y )] = g(x, y)f (x, y)dxdy = g(x, y)f (x, y)dxdy.
Z

So as before if g(x, y) = ah(x) + bg(y)

E[g(X, Y )] = aE[h(X)] + bE[g(Y )].

We now turn to independence; we state the following result with no proof. If you are unconvinced,
simply take the below as a definition.

Theorem 2.4.21. The random variables X and Y are independent if and only if

F (x, y) = F (x)F (y)

or equivalently
f (x, y) = f (x)f (y).
Example 2.4.22. Let Z = (R+ )2 = X Y and

f (x, y) = 2 e(x+y) (x, y) Z, > 0.

Then one has


f (x) = ex x X f (y) = ey y Y
In addition, what is the probability that X > Y ?
Z Z x
P(X > Y ) = f (x, y)dydx
Z0 Z0 x
= ey dyex dx
0 0
Z h ix
= ey ex dx
0
Z0
= (1 ex )ex dx
0
h 1 i
= ex + e2x
2 0
1 1
= 1 = .
2 2
Again, one has the idea of covariance and correlation for continuous valued random variables:

Cov[X, Y ] = E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ]


Z Z Z
= xyf (x, y)dxdy xf (x)dx yf (y)dy
Z
Cov[X, Y ]
(X, Y ) = p
Var[X]Var[Y ]
23

Example 2.4.23. Let Z = R2 and define


1 1
f (x, y) = exp{ (x2 + y 2 2xy)} (x, y) Z [1, 1].
2(1 2 )
p
2 1 2
Let us check if this is a PDF; clearly it is non-negative on Z. So
Z Z Z
1 1
f (x, y)dxdy = p exp{ 2)
(x2 + y 2 2xy)}dxdy
Z 2 1
2 2(1
Z Z
1 1 2 2 2 1
= exp{ 2
(y y )} exp{ (x y)2 }dxdy
2(1 2 )
p
2
2 1 2(1 )
Z
1 1
= exp{ (y 2 y 2 2 )}dy
2 2(1 2 )
Z
1 1
= exp{ (1 2 )y 2 }dy = 1.
2 2(1 2 )
Here we have completed the square in the integral in x and used the change of variable u = (x
y)/(1 2 )1/2 . Thus f (x, y) defines a joint PDF. Moreover, it is called the standard bi-variate
normal distribution. Now let us consider the marginal PDFs:
Z
1 1
f (x) = exp{ (x2 + y 2 2xy)}dy x R
2 )
p
2 1
2 2(1
Z
1 1 2 2 2 1
= exp{ 2
(x x )} exp{ (y x)2 }dy
2(1 2 )
p
2 1 2 2(1 )
1 2
= ex /2 .
2
That is X N (0, 1). By Similar arguments, Y N (0, 1). Now, let us consider:
Z Z
1 1
E[XY ] = xy p exp{ 2)
(x2 + y 2 2xy)}dxdy
2 1 2 2(1
Z Z
1 1 2 2 2 1
= y exp{ 2
(y y )} x exp{ (x y)2 }dxdy
2(1 2 )
p
2 1 2 2(1 )
Z Z
1 1 2
= p y exp{ 2
(y 2
y 2 2
)} [u(1 2 )1/2 + y](1 2 )1/2 eu /2 dudy
2 1 2 2(1 )
Z
1 1
= p y exp{ 2
(y 2 y 2 2 )}[(1 2 )1/2 {(1 2 )1/2 2E[X] + y 2}]dy
2 1 2 2(1 )
Z
1 1
= p y exp{ 2
(y 2 y 2 2 )}(1 2 )1/2 y 2dy
2 1 2 2(1 )
Z
2
= y 2 ey /2 dy = .
2
Here we have used the results in Examples 2.4.8 and 2.4.12 that the first and second (raw) moment
of N (, 2 ) are and 2 + 2 . Therefore
Cov[X, Y ] =

(X, Y ) = = .
1
Note also that if = 0 then:
1 1
f (x, y) = p exp{ (x2 + y 2 )} (x, y) Z.
2 1 2 2
so as
1 2 1 2
f (x) = ex /2 x R f (y) = ey /2 y R
2 2
f (x, y) = f (x)f (y).
That is standard bi-variate normal random variables are independent if and only they are uncorre-
lated. Note that this does not always occur for other random variables.
24

We conclude this section with a mention of joint moment generating functions:


Definition 2.4.24. For a continuous random variables X and Y the joint moment generating
function is Z
Xt1 +Y t2
M (t1 , t2 ) = E[e ]= ext1 +yt2 f (x, y)dxdy (t1 , t2 ) T
Z
where T is the set of t1 , t2 for which Z ext1 +yt2 f (x, y)dxdy < .
R

Exercise 2.4.25. Show that


2


E[X] = M (t1 , t2 ) , E[Y ] = M (t1 , t2 ) , E[XY ] = M (t1 , t2 )
t1 t1 =t2 =0 t2 t1 =t2 =0 t1 t2 t1 =t2 =0

when the derivatives exist.

2.4.5 Conditional Distributions and Expectations


The idea of conditional distributions and expectations can be further extended to continuous random
variables. However, we recall that the event {X = x} has zero probability, so how can we condition
on it? It turns out that we cannot have a rigorous answer to this question (i.e. at the level of
mathematics in this course), so some faith must be extended towards the following definitions.
Definition 2.4.26. The conditional distributions function of Y given X = x is the function:
Z y
f (x, v)
F (y|x) = dv
f (x)

for any x such that f (x) > 0.


Definition 2.4.27. The conditional density function of F (y|x), written f (y|x) is given by:
f (x, y)
f (y|x) =
f (x)
for any x such that f (x) > 0.
We remark that clearly
f (x, y)
f (y|x) = R .

f (x, y)dy
Example 2.4.28. Suppose that Y |X = x N (x, 1) and X N (0, 1), Z = R2 . Find the PDFs:
f (x, y) and f (x|y). We have
f (x, y) = f (y|x)f (x)
so
1 1
f (x, y) = exp{ ((y x)2 + x2 )} (x, y) Z.
2 2
Now
f (x, y) f (y|x)f (x)
f (x|y) = =
f (y) f (y)
which is Bayes theorem for continuous random-variables. Now for y R
Z
1 1
f (y) = exp{ ((y x)2 + x2 )}dx
2 2
Z
1 2 1
= exp{y /2} exp{ (2x2 2xy)}dx
2 2
Z
1
= exp{y 2 /2} exp{[(x y/2)2 y 2 /4]}dx
2
Z
1 2 1
= 2
exp{y /4} eu /2 du
2 2

1 2
= exp{y 2 /4}
2 2
1
= exp{y 2 /4}.
2 2
25

So Y N (0, 2). Then:


1
2 exp{ 12 ((y x)2 + x2 )}
f (x|y) = 1 x R.

2 2
exp{y 2 /4}

After some calculations (exercise) one obtains:


r
2 y
f (x|y) = exp{(x )2 } xR
2 2
that is X|Y = y N (y/2, 1/2).
Example 2.4.29. Let X and Y have joint PDF:
1
f (x, y) = (x, y) Z = {(x, y) : 0 y x 1}.
x
Then for x [0, 1] Z x
1
f (x) = dy = 1.
0 x
In addition, (x, y) Z:
1/x 1
=
f (y|x) =
1 x
that is if x [0, 1], then conditional upon x Y U[0,x] (the uniform distribution on [0, x]). Now,
suppose we are interested in P(X 2 + Y 2 1|X = x). Let x 0, and define
A(x) := {y R : 0 y x, x2 + y 2 1}
that is, A(x) = [0, x (1 x2 )1/2 ], where a b = min{a, b}. Thus (x [0, 1])
Z
P(X 2 + Y 2 1|X = x) = f (y|x)dy
A(x)
Z x(1x2 )1/2
1
= dy
0 x
x (1 x2 )1/2
= .
x
To compute P(X 2 + Y 2 1), we note, by setting A = {(x, y) : (x, y) Z, x2 + y 2 1}, we are
interested in P(A), which is:
Z
P(X 2 + Y 2 1) = f (x, y)dydx
A
Z 1 Z
= f (y|x)dyf (x)dx
0 A(x)
1
x (1 x2 )1/2
Z
= dx.
0 x
2 2 1/2
Now x (1 x ) if
x 1/ 2
hence

1/ 2 1
(1 x2 )1/2
Z Z
2 2
P(X + Y 1) = dx + dx. (2.4.1)
0 1/ 2 x

Now we have, setting x = cos()


Z 1 /4
(1 x2 )1/2
Z
dx = sin() tan()d
1/ 2 x 0
h i/4
= log(|sec() + tan()|) sin()
0
1
= log(1 + 2)
2
26

where we have used integration tables to obtain the integral3 . Thus, returning to (2.4.1):
1 1
P(X 2 + Y 2 1) = + log(1 + 2) = log(1 + 2).
2 2
Given the previous discussion, the notion of a conditional expectation for continuous random
variables: Z
E[g(Y )|X = x] = g(y)f (y|x)dy.

As for discrete-valued random variables, we have the following result, which is again not proved.
Theorem 2.4.30. Consider jointly continuous random variables X and Y with g, h : R R with
h(X), g(Y ) continuous. Then:
Z Z 
E[h(X)g(Y )] = E[E[g(Y )|X]h(X)] = g(y)f (y|x)dy h(x)f (x)dx

whenever the expectations exist.


Example 2.4.31. Let us return to Example 2.4.28. Then

E[XY ] = E[Y E[X|Y ]]


= E[Y (Y /2)]
1
= [2] = 1.
2
Thus as Var[X] = 1 and Var[Y ] = 2
1
(X, Y ) = .
2
Example 2.4.32. Suppose Y |X = x N (x, 12 ) and X N (, 22 ). To find the marginal distribu-
tion of Y , one can use MGFs and conditional expectations. We have:

M (t) = E[eY t ]
= E[E[eY t |X]]
2 2
= E[eXt+(1 t )/2
]
(12 t2 )/2
= e E[eXt ]
(12 t2 )/2 2 2
= e et+(2 t )/2

= exp{t + (12 + 22 )t2 /2}.

Here we have used Example 2.4.17 for the MGF of a normal distribution. Thus we can conclude that

Y N (, 12 + 22 ).

2.4.6 Functions of Random Variables


Consider a random variable X and g : R R a continuous and invertible function (continuity can
be weakened, but let us use it for now). Suppose Y = g(X); what is the distribution of Y ? We have:
Z
1
F (y) = P(Y y) = P(g(X) y) = P(X g ((, y)) = f (x)dx.
g 1 (,y]

Note that for A R, then g 1 A = {x R : g(x) A}.


Example 2.4.33. Let X N (0, 1) and set
Z x
1 2
(x) = eu /2 du
2
to denote the standard normal CDF. Suppose Y = X 2 , then for y 0 (note Y = R+ )

P(Y y) = P(X 2 y) = P( y X y).
3 you will not be expected to perform such an integration in the examination
27

Now

P( y X y) = P(X y) P(X y)

= ( y) ( y)

= ( y) [1 ( y)]

= 2( y) 1.

Then
d 1 1
f (y) = F (y) = 0 ( y) = ey/2 yY
dy y 2y
that is Y G(1/2, 1/2). This is also called the chi-squared distribution on one degree of freedom.
We remark that Y and X are clearly not independent (if X changes so does Y and in a deterministic
manner). Now
E[XY ] = E[X 3 ] = 0.
In addition E[X] = 0 and E[Y ] = 1. So X and Y are uncorrelated, but they are not independent.

We now move onto a more general change of variable formula. Suppose X1 and X2 have a joint
density on Z and we set
(Y1 , Y2 ) = T (X1 , X2 )
where T : Z T where T is differentiable and invertible and T R2 . What is the joint density
function of (Y1 , Y2 ) on T? As T is invertible, we set

X1 = T11 (Y1 , Y2 ) X2 = T21 (Y1 , Y2 )

Then we define
T11 T21
T11 T21 T21 T11

y1 y1
J(y1 , y2 ) := T11 T21
=
y1 y2 y1 y2
y2 y2
to be the Jacobian of transformation. Then we have the following result, which simply follows from
the change of variable rule for integration:
Theorem 2.4.34. If (X1 , X2 ) have joint density f (x, y) on Z, then for (Y1 , Y2 ) = T (X1 , X2 ), with
T as described above, the joint density of (Y1 , Y2 ), denoted g is:

g(y1 , y2 ) = f (T11 (y1 , y2 ), T21 (y1 , y2 ))|J(y1 , y2 )| (y1 , y2 ) T.

Example 2.4.35. Suppose Z = (R+ )2 and

f (x1 , x2 ) = 2 e(x1 +x2 ) (x1 , x2 ) Z.

Let
(Y1 , Y2 ) = (X1 + X2 , X1 /X2 ).
To find the joint density of (Y1 , Y2 ) we first note that T = Z and that
 Y Y Y1 
1 2
(X1 , X2 ) = , .
1 + Y2 1 + Y2
Then the Jacobian is

y2 1

1+y2 1+y2
y1
J(y1 , y2 ) := = .

y1 y1
(1+y2 )2 (1+y 2)
2 (1 + y2 )2

Thus:
y1
g(y1 , y2 ) = 2 ey1 (y1 , y2 ) T.
(1 + y2 )2
One can check that indeed Y1 and Y2 are independent and, that the marginal densities are:
1
g(y1 ) = 2 y1 ey1 y1 R+ g(y2 ) = y 2 R+ .
(1 + y2 )2
28

Example 2.4.36. Suppose we have (X1 , X2 ) as in the case of Example 2.4.35, except we have the
mapping:
(Y1 , Y2 ) = (X1 , X1 + X2 ).
Now, clearly Y2 Y1 , so this transformation induces the support for (Y1 , Y2 ) as:

T = {(y1 , y2 ) : 0 y1 y2 , y2 R+ }.

Then
(X1 , X2 ) = (Y1 , Y2 Y1 )
and clearly J(y1 , y2 ) = 1; thus

g(y1 , y2 ) = 2 ey2 (y1 , y2 ) T.

Example 2.4.37. Let X1 and X2 be independent N (0, 1) random variables (Z = R2 ) and let
p
(Y1 , Y2 ) = (X1 , X1 + 1 2 X2 ) [1, 1]

Then T = Z and  Y2 Y1 
(X1 , X2 ) = Y1 , p .
1 2
Then the Jacobian is
1
12 = p 1

J(y1 , y2 ) := 1 .
0 1 2

12
Thus
1 1
g(y1 , y2 ) = p exp{ (y12 + (y2 y1 )2 /(1 2 )} (y1 , y2 ) T
2 1 2 2
1 1
= exp{ (y 2 + y22 2y1 y2 )} (y1 , y2 ) T.
2(1 2 ) 1
p
2 1 2

On inspection of Example 2.4.23, (Y1 , Y2 ) follow a standard bivariate normal distribution.

Distributions associated to the Normal Distribution


In the following Subsection we investigate some distributions that are related to the normal distri-
bution. They appear frequently in hypothesis testing, which is something that we will investigate
in the following Chapter. We note that, in a similar way to the way in which joint distributions are
defined (Definition 2.4.19), one can easily extend to joint distributions of n 2 variables (so that
there is a joint distribution F (x1 , . . . , xn ) and density f (x1 , . . . , xn )).
We begin with the idea of the chi-square distribution on n > 0 degrees of freedom:
1
f (x) = xn/21 ex/2 x X = R+ .
(2)n/2 (n/2)

We write X Xn2 ; you will notice that also X G(n/2, 1/2), so that E[X] = n. Note that from
Table 4.2, we have that
1 1
M (t) = n/2
> t.
(1 2t) 2
We have the following result:
Proposition 2.4.38. Let X1 , . . . , Xn be independent and identically distributed (i.i.d.) standard
i.i.d.
normal random variables (i.e. Xi N (0, 1)). Let

Z = X12 + + Xn2 .

Then Z Xn2 .
29

Proof. We will use moment generating functions. We have


M (t) = E[eZt ]
Pn 2
= E[et i=1 Xi ]
n
Y 2
= E[etXi ]
i=1
2
= E[eX1 t ]n .
Here we have used the independent property on the third line and the fact that the random variables
are identically distributed on the last line. Now
Z
X12 t 1 1 1
E[e ] = exp{ (1 2t)x21 }dx1 >t
2 2 2
Z
1 1
= exp{ u2 }du
2(1 2t) 1/2
2
1
= .
(1 2t)1/2
Hence
1 1
M (t) = >t
(1 2t)n/2 2
and we conclude the result.
The standard tdistribution on ndegrees of freedom is (n > 0):
([n + 1]/2)  x2 (n+1)/2
p(x) = 1+ x X = R.
n(n/2) n
We write X Tn . Then we have the following important result.
Proposition 2.4.39. Let X N (0, 1) and independently Y Xn2 , n > 0. Let
X
T =q .
Y
n

Then T Tn .
Proof. We have
1 1 2 1 1
f (x, y) = e 2 x n/2 y n/21 e 2 y Z = R R+ .
2 2 (n/2)
We will use Theorem 2.4.34, with the transformation defined by
X
T = q
Y
n
S = Y
and then marginalize out S. The Jacobian of transformation is:
ps r
0 s
J(t, s) := t n = n.
2 sn
1
Then
1 s
f (t, s) = sn/21/2 exp{ [t2 /n + 1]} (t, s) R R+ .
2n2n/2 (n/2) 2
Then we have for t R
Z
1 s
f (t) = sn/21/2 exp{ [t2 /n + 1]}ds
2n2n/2 (n/2) 0 2
Z
1 1
=
t2 n+1
un/21/2 eu du
2n2n/2 (n/2) 0 ( 12 + 2n ) 2

1 1
= ([n + 1]/2)
t2 n+1
2n2n/2 (n/2) ( 12 + 2n ) 2
([n + 1]/2)  t2 (n+1)/2
= 1+
n(n/2) n
30

and we conclude.

The last distribution that we consider is the standard F distribution on d1 , d2 > 0 degrees of
freedom: d +d
1  d d1 /2
1 d1 /21
 d 1  1 2 2
f (x) = x 1 + x x X = R+
B( d21 , d22 ) d2 d2
where
d1 d2 ( d21 )( d22 )
B( , )= .
2 2 ( d1 +d
2 )
2

We write X F(d1 ,d2 ) . We have the following result.

Proposition 2.4.40. Let X Xd21 and independently Y Xd22 , d1 , d2 > 0. Let

X/d1
F = .
Y /d2

Then F F(d1 ,d2 ) .


Proof. We have
1 1 1 1
f (x, y) = xd1 /21 e 2 x d /2 y d2 /21 e 2 y Z = R + R+ .
2d1 /2 (d1 /2) 2 2 (d2 /2)

We will use Theorem 2.4.34, with the transformation defined by

X/d1
T =
Y /d2
S = Y

and then marginalize out S. The Jacobian of transformation is:



d1 s 0 d s
d2 1
J(t, s) := d1 t = .

d2 1 d2

Then
1 d1 s  d1 st d1 /21 d2d1 st d2 /21 s/2
f (t, s) = e 2 s e T = (R+ )2 .
2d1 /2+d2 /2 (d1 /2)(d2 /2) d2 d2

To shorten the subsequent notations, let

1  d d1 /2
1
g= .
2d1 /2+d2 /2 (d1 /2)(d2 /2) d2

Then, for t R+
Z
d1 /21
d1 +d2
1 1 d1 t
f (t) = gt s exp{s( [1 +
2 ])}ds
2 d2
Z0
1 d1 t (d1 +d2 )/2
= gtd1 /21 u(d1 +d2 )/21 eu ( [1 + ]) du
0 2 d2
1 d1 t (d1 +d2 )/2
= gtd1 /21 ( [1 + ]) ((d1 + d2 )/2)
2 d2
d +d
d1 +d2
 d 1  1 2 2
= g((d1 + d2 )/2)2 2 td1 /21 1 + t
d2
d +d
1  d /2
d1 1 d1 /21
  d1  1 2 2
= t 1+ t
B( d21 , d22 ) d2 d2

and we conclude.
31

2.5 Convergence of Random Variables


In the following Section we provide a brief introduction to convergence of random variable and in
particular two modes of convergence:
Convergence in Distribution
Convergence in Probability
These properties are rather important in probability and mathematical statistics and provide a way
to justify many statistical and numerical procedures. The properties are rather loosely associated
to the notion of a sequence of random variables X1 , X2 , . . . , Xn (there will be infinitely many of
them typically) and we will be concerned with the idea of what happens to the distribution or some
functional as n . The second part of this Section will focus on perhaps the most important result
in probability: the central limit theorem. This provides a characterization of a random variable:
n
1 X
Sn = Xi
n i=1

with X1 , . . . , Xn independent and identically distributed with zero mean and unit variance. This
particular result is very useful in hypothesis testing, which we shall see later on. Throughout, we
will focus on continuous random variables, but one can extend this notion.

2.5.1 Convergence in Distribution


Consider a sequence of random variables X1 , X2 , . . . , with associated distribution functions F1 , F2 , . . . ,
we then have the following definition:
Definition 2.5.1. We say that the sequence of distribution functions F1 , F2 , . . . converges to a
distribution function F , if limn Fn (x) = F (x) at each point where F is continuous. If X has
d
distribution function F , we say that Xn converges in distribution to X and write Xn X.

Example 2.5.2. Consider a sequence of random variables Xn Xn = [0, n], n 1, with


 x n
Fn (x) = 1 1 x Xn .
n
Then for any fixed x R+
lim Fn (x) = 1 ex .
n

Now if X E(1): Z x
F (x) = eu du = 1 ex .
0
d
So Xn X, X E(1).
Example 2.5.3. Consider a sequence of random variables Xn X = [0, 1], n 1, with

sin(2nx)
Fn (x) = 1 .
2n
Then for any fixed x [0, 1]
lim Fn (x) = x.
n

Now if X U[0,1] , x [0, 1]:


Z x
F (x) = du = x.
0
d
So Xn X, X U[0,1] .
32

2.5.2 Convergence in Probability


We now consider an alternative notion of convergence for a sequence of random variables X1 , X2 , . . .
P
Definition 2.5.4. We say that Xn converges in probability to a constant c R (written Xn c) if
for every  > 0
lim P(|Xn c| > ) = 0.
n

We note that convergence in probability can be extended to convergence to a random variable


X, but we do not do this, as it is beyond the level of this course. We have the following useful result,
which we do not prove.
Theorem 2.5.5. Consider a sequence of random variables X1 , X2 , . . . , with associated distribution
functions F1 , F2 , . . . . If for every x Xn

lim Fn (x) = c
n

P
then Xn c.
Example 2.5.6. Consider a sequence of random variables Xn X = R+ , n 1, with
 x n
Fn (x) = .
1+x
Then for any fixed x X
lim Fn (x) = 0.
n
P
Thus Xn 0.

We finish the Section with a rather important result, which we again do not prove. It is called
the weak law of large numbers:
Theorem 2.5.7. Let X1 , X2 , . . . be a sequence of independent and identically distributed random
variables with E[|X1 |] < . Then
n
1X P
Xi E[X1 ].
n i=1

Note that this result extends to functions; i.e. if g : R R with E[|g(X1 )|] < then
n
1X P
g(Xi ) E[g(X1 )].
n i=1

Example 2.5.8. Consider an integral


Z
I= g(x)dx
R

where we will suppose that g(x) 6= 0. Suppose we cannot calculate I. Consider any non-zero pdf f (x)
on R. Then Z
g(x)
I= f (x)dx
R (x)
f
and suppose that Z
g(x)

f (x)dx < .
R f (x)

Then by the weak law of large numbers, if X1 , . . . , Xn are i.i.d. with pdf f then
n
1 X g(Xi ) P
I.
n i=1 f (Xi )

This provides a justification for a numerical method (called the Monte Carlo method) to approximate
integrals.
33

2.5.3 Central Limit Theorem


We close the Section and Chapter with the central limit theorem (CLT) which we state without
proof.
Theorem 2.5.9. Let X1 , X2 , . . . be a sequence of independent and identically distributed random
variables with E[|X1 |] < and 0 < Var[X1 ] < . Then
n

 
1 X (Xi E[X1 ]) d
n p Z
n i=1 Var[X1 ]

where Z N (0, 1).


Pn
The CLT asserts that the distribution of the summation i=1 Xi suitably normalized and cen-
tered, will converge to that of a normal distribution. An often used idea, via the CLT, is normal
approximation. That is, for n large
n

 
1 X (Xi E[X1 ])
n p N (0, 1)
n i=1 Var[X1 ]

where means approximately distributed as, so we have4 ,
n

X
Xi N (nE[X1 ], nVar[X1 ]).
i=1

Pn 2.5.10. Let X1 , . . . , Xn be i.i.d. G(a/n, b) random variables. Then the distribution of


Example
Zn = i=1 Xi is for b > t:
n
Y  b a/n n
M (t) = E[eZn t ] = E[eXi t ] = E[ex1 ]n =
i=1
bt

where we have used the i.i.d. property and the MGF of a Gamma random variable (see Table 4.2).
Thus Zn G(a, b) for any n 1. However, from the CLT one can reasonably approximate Zn , when
n is large by a normal random variable with mean
a a
n =
bn b
and variance
a a
n = 2.
nb2 b

4 If Z N (0, 1) you can verify that if X = Z + then X N (, 2 )


Chapter 3

Introduction to Statistics

3.1 Introduction
In this final Chapter, we give a very brief introduction to statistical ideas. Here the notion is that
one has observed data from some sampling distribution the population distribution and one wishes
to infer what the properties of the population are on the basis of observed samples. In particular
we will be interested in estimating the parameters of sampling distributions such as the maximum
likelihood method (Section 3.2) as well as testing hypotheses about parameters (3.3). We end the
Chapter with an introduction to Bayesian statistics (Section 3.4) which is another alternative way
to estimate parameters, which is more complex, but much richer than the MLE method.

3.2 Maximum Likelihood Estimation


3.2.1 Introduction
So far in this course, we have proceeded to discuss pmfs and pdfs such as P() or N (, 2 ), but we
have not discussed what the parameters or might be. We discuss in this Section, a particularly
important way to estimate parameters called maximum likelihood estimation (MLE).
The basic idea is this. Suppose one observes data x1 , x2 , . . . , xn and we hypothesize that they
follow some joint distribution F (x1 , . . . , xn ), ( is the parameter space e.g. = R). Then the
idea of maximum likelihood estimation is to find the parameter which maximizes the joint pmf/pdf
of the data, that is:
n = argmax f (x1 , . . . , xn ).

3.2.2 The Method


i.i.d.
Throughout this section, we assume that X1 , . . . , Xn are mutually independent. So, we have Xi
F , and the joint pmf/pdf is:
n
Y
f (x1 , . . . , xn ) = f (x1 ) f (x2 ) f (xn ) = f (xi ).
i=1

We call f (x1 , . . . , xn ) the likelihood of the data. As maximizing a function is equivalent to maxi-
mizing a monotonic increasing transformation of the function, we often work with the log-likelihood :
  Xn  
l (x1 , . . . , xn ) = log f (x1 , . . . , xn ) = log f (xi ) .
i=1

If is some continuous space (as it generally is for our examples) and = (1 , . . . , d ), then we can
compute the gradient vector:
 l (x , . . . , x ) l (x1 , . . . , xn ) 
1 n
l (x1 , . . . , xn ) = ,...,
1 d
and we would like to solve, for (below 0 is the ddimensional vector of zeros)
l (x1 , . . . , xn ) = 0. (3.2.1)

34
35

The solution of this equation (assuming it exists) is a maximum if the hessian matrix is negative
definite: 2 l (x1 ,...,xn ) 2 l (x1 ,...,xn ) 2

12 1 2 l
(x1 ,...,xn )
1 d

H() := .. .. .. ..
.
. . . .
2 2 2
l (x1 ,...,xn ) l (x1 ,...,xn ) l (x1 ,...,xn )
d 1 d 2 2 d

If the d numbers 1 , . . . , d which solve|Id H()| = 0, with Id the d d identity matrix, are all
negative, then is a local maximum of l (x1 , . . . , xn ). If d = 1 then this just boils down to checking
whether the second derivative of the log-likelihood is negative at the solution of (3.2.1).
Thus in summary, the approach we employ is as follows:
1. Compute the likelihood f (x1 , . . . , xn ).
2. Compute the log-likelihood l (x1 , . . . , xn ) and its gradient vector l (x1 , . . . , xn ).

3. Solve l (x1 , . . . , xn ) = 0, with respect to , call this solution en (we are assuming there
is only one en ).

4. If H(en ) is negative definite, then n = en .

In general, point 3. may not be possible analytically (so for example, one can use Newtons method).
However, you will not be asked to solve l (x1 , . . . , xn ) = 0, unless there is an analytic solution.

3.2.3 Examples of Computing the MLE


Example 3.2.1. Let X1 , . . . , Xn be i.i.d. P() random variables. Let us compute the MLE of = ,
given observations x1 , . . . , xn . First, we have
n
Y xi
f (x1 , . . . , xn ) = e
i=1
xi !
Pn 1
= en i=1 xi
Qn .
i=1 xi !

Second the log-likelihood is:


n
X  n
Y
l (x1 , . . . , xn ) = log(f (x1 , . . . , xn )) = n + xi log() log( xi !).
i=1 i=1

The gradient vector is a derivative:


n
dl (x1 , . . . , xn ) 1X 
= n + xi .
d i=1

Thirdly
n
1X 
n + xi = 0
i=1
so,
n
en = 1
X
xi .
n i=1
Fourthly,
n
d2 l (x1 , . . . , xn ) 1 X 
2
= 2 xi < 0
d i=1

for any > 0 (assuming there is at least one i such that xi > 0). Thus assuming there is at least
one i such that xi > 0
n
1X
n = xi .
n i=1
36

Remark 3.2.2. As a theoretical justification of n as in Example 3.2.1, we note that


n
1X
xi
n i=1

will converge in probability (see Theorem 2.5.7) to E[X1 ] = if our assumptions hold true. That is,
we recover the true parameter value; such a property is called consistency - we do not address this
issue further.
Example 3.2.3. Let X1 , . . . , Xn be i.i.d. E() random variables. Let us compute the MLE of = ,
given observations x1 , . . . , xn . First, we have
n
Y
f (x1 , . . . , xn ) = exi
i=1
n
X
= n exp{ xi }.
i=1

Second the log-likelihood is:


n
X
l (x1 , . . . , xn ) = log(f (x1 , . . . , xn )) = n log() xi .
i=1

The gradient vector is a derivative:


n
dl (x1 , . . . , xn ) n X
= xi .
d i=1

Thirdly
n
n X
xi = 0
i=1
so
n
en = ( 1
X
xi )1 .
n i=1

Fourthly,
d2 l (x1 , . . . , xn ) n
= 2 < 0.
d2
Thus
n
1 X 1
n = ( xi ) .
n i=1

Example 3.2.4. Let X1 , . . . , Xn be i.i.d. N (, 2 ) random variables. Let us compute the MLE of
(, 2 ) = , given observations x1 , . . . , xn . First, we have
n
Y 1 1
f (x1 , . . . , xn ) = exp{ (xi )2 }
i=1 2 2 2 2
n n
 1 1 X
= exp{ (xi )2 }.
2 2 2 2 i=1

Second the log-likelihood is:


n
n 1 X
l (x1 , . . . , xn ) = log(f (x1 , . . . , xn )) = log(2 2 ) 2 (xi )2 .
2 2 i=1

The gradient vector is


 l (x , . . . , x ) l (x , . . . , x )   1 X n n
1 n 1 n n 1 X 2

, = (x i ), + (x i ) .
2 2 i=1 2 2 2( 2 )2 i=1
37

Thirdly, we must solve the equations


n
1 X
(xi ) = 0 (3.2.2)
2 i=1
n
n 1 X
+ (xi )2 = 0 (3.2.3)
2 2 2( 2 )2 i=1

simultaneously for and 2 . Since (3.2.2) can be solved independently of (3.2.3), we have:
n
1X

en = xi .
n i=1
Thus substituting into (3.2.3), we have
n
1 X n
en )2 =
(xi
2( 2 )2 i=1 2 2
that is:
n
1X
en2 =
en )2 .
(xi
n i=1
Fourthly the hessian matrix is:
" Pn #
n (12 )2 i=1 (xi )
H() = Pn2 Pn .
(12 )2 i=1 (xi ) n 1
2( 2 )2 ( 2 )3 i=1 (xi )
2

Note that when en = (e en2 ) the off-diagonal elements are exactly 0, so when solving |I2 H(en )| =
n ,
0, we simply need to show that the diagonal elements are negative. Clearly for the first diagonal
element
n
2 <0

en
en )2 > 0 for at least one i (we do not allow the case
if (xi en2 , in that we assume this doesnt
occur). In addition
n
n 1 X n n n
2 2
2 3
en )2 =
(xi 2 2
2 2 = < 0.
2(e
n ) (e
n ) i=1 2(e
n ) (e
n ) n2 )2
2(e
Thus
n
1 X n n
1X 1 X 2 
n = xi , xi xj .
n i=1
n i=1 n j=1

3.3 Hypothesis Testing


3.3.1 Introduction
The theory of hypothesis testing is concerned with the problem of determining whether or not a
statistical hypothesis, that is, a statement about the probability distribution of the data, is consis-
tent with the available sample evidence. The particular hypothesis to be tested is called the null
hypothesis and is denoted by H0 . The ultimate goal is to accept or reject H0 .
In addition to the null hypothesis H0 , one may also be interested in a particular set of deviations
from H0 , called the alternative hypothesis and denoted by H1 . Usually, the null and the alternative
hypotheses are not on an equal footing: H0 is clearly specified and of intrinsic interest, whereas H1
serves only to indicate what types of departure from H0 are of interest.
A statistical test of a null hypothesis H0 is typically based on three elements:
1. a statistic T , called a test statistic
2. a partition of the possible values of T into two distinct regions: the set K of values of T that
are regarded as inconsistent with H0 , called the critical or rejection region of the test, and its
complement K c , called the non-rejection region
3. a decision rule that rejects the null hypothesis H0 as inconsistent with the data if the observed
value of T falls in the rejection region K, and does not reject H0 if the observed value of T
belongs instead to K c .
38

3.3.2 Constructing Test Statistics


In order to construct test-statistics, we start with an important result, which we do not prove. Note
that all of our results are associated to normally distributed data.
Theorem 3.3.1. Let X1 , . . . , Xn be i.i.d. N (, 2 ) random variables. Define
n
1X
Xn := Xi
n i=1
n
1 X
s2n := (Xi Xn )2
n 1 i=1

Then

1. Xn N (0, 2 /n).
2. (n 1)s2n / 2 Xn1
2
.
3. Xn and (n 1)s2n / 2 are independent random variables.

We note then, that by Proposition 2.4.39, it follows that



(Xn )/(/ n) (Xn )
T () := p = p Tn1 .
2
sn / 2 s2n /n

Note that as one might imagine Proposition 2.4.38 is used to prove Theorem 3.3.1. Note that one
can show that the Tn1 distribution has a symmetric pdf, around 0.
Now consider testing H0 : = 0 against the two-sided alternative H1 : 6= 0 (that is, we are
testing whether the population mean, the true mean of the data, is a particular value). Now we
know for any R that T () Tn1 , thus if the null hypothesis is true T (0 ) should be a random
variable that is consistent with a Tn1 random variable. To construct the rejection region of the
test, we must choose a confidence level, typically 95%. Then we want T (0 ) to lie in 95% of the
probability. However, this still does not tell us what the rejection region is; this is informed by the
alternative hypothesis, that H1 : 6= 0 ; which indicates that a value which is inconsistent with the
Tn1 distribution lies in each tail of the distribution. Thus the procedure is:
1. Compute T (0 ).

2. Decide upon your confidence level (1 )%. This defines the rejection region.
3. Compute the tvalues (t, t). These are the numbers in R such that the probability (under a
Tn1 random variable) of exceeding t is /2 and the probability of being less than t is /2.
4. If t < T (0 ) < t then we do not reject the null hypothesis, otherwise we reject the null
hypothesis.
A number of remarks are in order. First, a test can only disprove a null hypothesis. The fact
that we do not reject the null on the basis of the sample evidence does not mean that the hypothesis
is true.
Second, it is useful to distinguish between two types of errors that can be made:

Type I error: Reject H0 when it is true.


Type II error: Do not reject H0 when it is false.
In statistics, Type I error is usually regarded as the most serious. The analogy is with the judiciary
system, where condemning an innocent is typically considered a much more serious problem than
letting a guilty person free.
A second test statistic we consider is for testing whether two samples have the same variance. Let
i.i.d. 2
X1 , . . . , Xn and Y1 , . . . , Ym be independent sequences of random variables where Xi N (X , X )
i.i.d.
and Yi N (Y , Y2 ). Now we know from Theorem 3.3.1 that

(n 1)s2X,n /X
2 2
Xn1
39

and independently:
(m 1)s2Y,m /Y2 Xm1
2
.
2
Now suppose that we want to test H0 : X = Y2 against H1 : X2
6= Y2 . Now if H0 is true, by
Theorem 3.3.1 and Proposition 2.4.40.

2
F (X ) := (n 1)s2X,n /X
2
/(n 1) (m 1)s2Y,m /X
2
/(m 1) = s2X,n /s2Y,m Fn1,m1

Thus, we perform the following procedure:


2
1. Compute F (X ).
2. Decide upon your confidence level (1 )%. This defines the rejection region.

3. Compute the F values (f , f ). These are the numbers in R+ such that the probability (under
a Fn1,m1 random variable) of exceeding f is /2 and the probability of being less than f
is /2.
2
4. If f < F (X ) < f then we do not reject the null hypothesis, otherwise we reject the null
hypothesis.
Remark 3.3.2. All of our results concern Normal samples. However one can use the CLT (Theorem
2.5.9) to extend these tests to non-normal data. We do not follow this idea in this course.

3.4 Bayesian Inference


3.4.1 Introduction
So far we have considered point estimation methods for the unknown parameter. In the following
Section, we consider the idea of Bayesian inference for unknown parameters. Recall, from Examples
2.3.34 and 2.4.28, that there is a Bayes theorem for discrete and continuous random variables. Indeed,
there is version of Bayes theorem that can mix continuous and discrete random variables. Let X
and Y be two jointly defined random variables such that X or Y are either continuous or discrete.
Then Bayes theorem is:
f (y|x)f (x)
f (x|y) =
f (y)
as long as the associated pmfs/pdfs are well defined. Throughout the section, we assume that the
pmfs/pdfs are always positive on their supports.

3.4.2 Bayesian Estimation


Throughout, we will assume that we will have a random sample X1 , X2 , . . . , Xn which are condi-
tionally independent, given a parameter . As we will see, in Bayesian statistics, the parameter
i.i.d.
is a random variable. So, we assume Xi | F (|) where F (|) is assumed to be a distribution
function for any . Thus we have that the joint pmf/pdf of X1 , . . . , Xn | is:
n
Y
f (x1 , . . . , xn |) = f (xi |).
i=1

The main key behind Bayesian statistics is the choice of a prior probability distribution for the
parameter . That is, Bayesian statisticians specify a probability distribution on the parameter
before the data are observed. This probability distribution is supposed to reflect the information
one might have before seeing the observations. To make this idea concrete, consider the following
example:
i.i.d.
Example 3.4.1. Suppose that we will observe data Xi | E(). Then one has to construct a
probability distribution on R+ . A possible candidate is G(a, b). If one has some prior beliefs
about the mean and variance of (say E[] = 10, Var[] = 10) then one determine what a and b
are.
40

Throughout, we will write the prior pmf/pdf as ().


Now the way in which Bayesian inference works is to update the prior beliefs on via the posterior
pmf/pdf. That is, in the light of the data the distributional properties of the prior are updated.
This is achieved by Bayes theorem; the posterior pmf/pdf is:
f (x1 , . . . , xn |)()
(|x1 , . . . , xn ) =
f (x1 , . . . , xn )
where Z
f (x1 , . . . , xn ) = f (x1 , . . . , xn |)()d

if is continuous and, if is discrete:
X
f (x1 , . . . , xn ) = f (x1 , . . . , xn |)().

For a Bayesian statistician, the posterior is the final answer, in that all statistical inference should
be associated to the posterior. For example, if one is interested in estimating then one can use the
posterior mean: Z
E[|x1 , . . . , xn ] = (|x1 , . . . , xn )d.

In addition, consider R (or indeed any univariate component of ), then we can compute a
confidence interval, which is called an credible interval in Bayesian statistics. The highest 95%-
posterior-credible (HPC) interval is the shortest region [, ], such that
Z
(|x1 , . . . , xn )d = 0.95.

By shortest, we mean the [, ], such that | | is smallest.


It should be clear by now, that the posterior distribution is much richer than the MLE, in the
sense that one now has a whole distribution which reflects the parameter, instead of a point estimate.
What also might be apparent now, is the fact that the posterior is perhaps difficult to calculate. For
example the posterior mean will require you to compute two integrations over (in general) and this
may not be easy to calculate in practice. In addition, it could be difficult, analytically to calculate
a HPC. As a result, most Bayesian inference is done via numerical methods, based on Monte Carlo
(see Example 2.5.8); we do not mention this further.

3.4.3 Examples
Despite the fact that Bayesian inference is very challenging, there are still many examples where one
can do analytical calculations.
Example 3.4.2. Let us consider Example 3.4.1. Here we have that
Z n
X ba a1 b
f (x1 , . . . , xn ) = n exp{ xi } e d
0 i=1
(a)
Z n
ba X
= n+a1 exp{[ xi + b]}d
(a) 0 i=1
ba  1 n+a Z
= un+a1 eu du
(a) [ ni=1 xi + b]
P
0
ba  1 n+a
= Pn (n + a).
(a) [ i=1 xi + b]
So, as:
n
ba n+a1 X
f (x1 , . . . , xn |)() = exp{[ xi + b]}
(a) i=1
we have: Pn
n+a1 exp{[ i=1 xi + b]}
(|x1 , . . . , xn ) =  n+a
P 1 (n + a)
[ n xi +b]
i=1
41

i.e.
n
X
|x1 , . . . , xn G(n + a, b + xi ).
i=1
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reflecting the data. So, for example:
n+a
E[|x1 , . . . , xn ] = Pn .
b + i=1 xi
In comparison to the MLE in Example 3.2.3, we see that the posterior mean and MLE correspond
as a, b 0.
i.i.d.
Example 3.4.3. Let Xi | P(), i {1, . . . , n}. Suppose the prior on is G(a, b). Then
Z P
n 1 ba a1 b
f (x1 , . . . , xn ) = i=1 xi exp{n} Qn e d
0 i=1 xi ! (a)
Z P
ba n
= Qn i=1 xi +a1 exp{[n + b]}d
(a) i=1 xi ! 0
ba  1 Pni=1 xi +a Z Pn
= Qn u i=1 xi +a1 eu du
(a) i=1 xi ! [n + b] 0
 1 Pni=1 xi +a X n
ba
= Qn ( xi + a).
(a) i=1 xi ! [n + b] i=1

So as:
ba Pn
xi +a1
f (x1 , . . . , xn |)() = Qn i=1 exp{[n + b]}
(a) i=1 xi !
we have: Pn
xi +a1
i=1 exp{[n + b]}
(|x1 , . . . , xn ) =  Pni=1 xi +a P
1 n
[n+b] ( i=1 xi + a)
i.e.
n
X 
|x1 , . . . , xn G xi + a, n + b .
i=1
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reflecting the data. So, for example:
Pn
xi + a
E[|x1 , . . . , xn ] = i=1 .
n+b
In comparison to the MLE in Example 3.2.1, we see that the posterior mean and MLE correspond
as a, b 0.
i.i.d.
Example 3.4.4. Let Xi | N (, 1), i {1, . . . , n}. Suppose the prior on is N (, ). Then
Z  n
1 n 1X 1 1
f (x1 , . . . , xn ) = exp{ (xi )2 } exp{ ( )2 }d
2 2 i=1
2 2
 1 n 1 Z n
1X 1
= exp{ (xi )2 ( )2 }d.
2 2 2 i=1 2

To compute the integral, let us first manipulate the exponent inside the integral:
n n n
1X 1 1 1 X 2 X 2
(xi )2 ( )2 = [2 (n + ) 2( + xi ) + + xi ]
2 i=1 2 2 i=1 i=1
2 Pn
Let c(, , x1 , . . . , xn ) = 21 [ + i=1 x2i ], then
n n
1X 1 1 1 X
(xi )2 ( )2 = [2 (n + ) 2( + xi )] + c(, , x1 , . . . , xn )
2 i=1 2 2 i=1
Pn Pn
1 1 h ( + i=1 xi ) 2 ( + i=1 xi ) 2 i
= (n + ) ) ( + c(, , x1 , . . . , xn ).
2 n + 1 n + 1
42

2
+ n
P
( i=1 xi )
Let c0 (, , x1 , . . . , xn ) = c(, , x1 , . . . , xn ) + 21 (n + 1 )( 1
n+
, then we have
Pn
( + i=1 xi ) 2
 1 n 1 Z
1 1 
f (x1 , . . . , xn ) = exp{c0 (, , x1 , . . . , xn )} exp{ (n + ) ) }d
2 2 2 n + 1
 1 n 1
0 2
= exp{c (, , x1 , . . . , xn )} .
2 2 (n + 1 )1/2

So as:
Pn
 1 n 1
0 1 1  ( + i=1 xi ) 2
f (x1 , . . . , xn |)() = exp{c (, , x1 , . . . , xn )} exp{ (n+ ) ) }
2 2 2 n + 1

we have
+ n
 P
( i=1 xi ) 2
exp{ 12 (n + 1 ) 1
n+
) }
(|x1 , . . . , xn ) =
2
1 1/2
(n+ )

i.e.
 ( + Pn x ) 1 1 
i=1 i
|x1 , . . . , xn N , (n + )
n + 1
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reflecting the data. So, for example:
Pn
( + i=1 xi )
E[|x1 , . . . , xn ] = .
n + 1

In comparison to the MLE in Example 3.2.4, we see that the posterior mean and MLE correspond
as , for any fixed R.

We note that in all of our examples the posterior is in the same family as the prior. This is not
by chance; for every member of the exponential family, with parameter it is often possible to find
a prior which obeys this former property. Such priors are called conjugate priors.
Chapter 4

Miscellaneous Results

In the following Chapter we quote some results of use in the course. We do not cover this material
in lectures and is there for your convenience and revision. The following Sections give general facts
which should be taken as true. You can easily find the theory behind each of these results in any
undergraduate text in maths or statistics.

4.1 Set Theory


Recall that A B means all the elements in A or all the elements of B. E.g. A = {1, 2}, B = {3, 4}
then AB = {1, 2, 3, 4}. In addition that AB means all the elements in A and B. E.g. A = {1, 2, 3},
B = {3, 4} then A B = {3}. Finally recall Ac is the complement of the set A, that is everything
in that is not in A, e.g. = {1, 2, 3} and A = {1, 2}, then Ac = {3}.
Let A, B, C be sets in some state-space . Then the following holds true:

AB =BA A B = B A ASSOCIATIVITY
A (B C) = (A B) (A C) A (B C) = (A B) (A C) DISTRIBUTIVITY.

Note that the second rule holds when mixing intersection and union, e.g.:

A (B C) = (A B) (A C).

We also have De-Morgans laws:

(A B)c = Ac B c (A B)c = Ac B c .

A neat trick for probability:


A = (A B) (A B c ).

4.2 Derivatives and Functions


Here I list some standard mathematical derivatives. Below a, b are real-valued finite constants, unless
otherwise stated.
d p
(x ) = pxp1 p 6= 0
dx
d
(aebx ) = abebx
dx
d 1
(log(x)) = x>0
dx x
d x
(a ) = log(a)ax a>0
dx
d
(sin(ax)) = a cos(ax)
dx
d
(cos(ax)) = a sin(ax)
dx
d
(tan(ax)) = a sec2 (ax)
dx

43
44

Also recall the product rule; for two functions f (x) and g(x), we have
d dg(x) df (x)
(f (x)g(x)) = f (x) + g(x) .
dx dx dx
In addition the chain rule; for two functions f (x) and g(x), let h(x) = f (g(x)), then
dh(x) df (g(x)) dg(x)
= .
dx dg(x) dx
For example, suppose f (x) = log(x) and g(x) = 2 + sin(x) and set h(x) = log(2 + sin(x)), then
df (g(x)) 1 dg(x)
= = cos(x)
dg(x) g(x) dx
where df /dg is the derivative of log(g). Hence
dh(x) 1 cos(x)
= cos(x) = .
dx g(x) 2 + sin(x)
Note that by log(x) I mean the inverse function of the exponential function, or natural logarithm:
log(ex ) = x.
Recall that log(ab) = log(a) + log(b) and log(ab ) = b log(a). Note that sin2 (x) + cos2 (x) = 1.

4.3 Summation
Pk2
Recall that x=k1 f (x), for some integer k1 k2 , with f : Z R, means
f (k1 ) + f (k1 + 1) + f (k1 + 2) + + f (k2 1) + f (k2 ).
For example k1 = 1, k2 = 3, f (x) = x, then
3
X
x = 1 + 2 + 3 = 6.
x=1

Recall that for any finite c R:


k2
X k2
X
cf (x) = c f (x).
x=k1 x=k1

Also for two real-valued functions f (x) and g(x):


k2 
X  k2
X k2
X
f (x) + g(x) = f (x) + g(x).
x=k1 x=k1 x=k1

Some standard summations that you may have seen in the past are listed below:


1 X
= zk |z| < 1 GEOMETRIC
1z
k=0

Xzk
ez = z R EXPONENTIAL
k!
k=0
n  
X n k
(1 + z)n = z n Z+ BINOMIAL
k
k=0
n  
X n+k k
(1 + z)n = z n Z+ , |z| < 1 NEGATIVE BINOMIAL
k
k=0

X zk
log(1 z) = |z| < 1 LOGARITHMIC
k
k=1

X zk
log(1 + z) = (1)k+1 |z| < 1 LOGARITHMIC
k
k=1
45

Note that, suppose |z| < 1, then we have


n
X 1 z nj+1
zk = zj ,
1z
k=j

A standard trick that is used in this course is the reversal of summation and differentiation. For
example, consider f : Z R, where R, then it is valid to assume in this course that:
k2 k2
dX  X df (x, )
f (x, ) = .
d d
x=k1 x=k1

In words the derivative of the sum is equal to the sum of the derivatives. This trick is very useful for
computing expectations w.r.t. e.g. geometric distributions. Note that this also applies to higher-order
derivatives:
k2 k2
d2  X  X d2 f (x, )
2
f (x, ) = .
d d2
x=k1 x=k1
Pk4 Pk2 2
A double summation y=k3 x=k1 f (x, y), with f : Z R, k1 k2 , with
f : Z R, k3 k4 , with f : Z R means:

X k2
k4 X k4 
X 
f (x, y) = f (k1 , y) + + f (k2 , y)
y=k3 x=k1 y=k3
= f (k1 , k3 ) + f (k1 , k3 + 1) + + f (k1 , k4 ) + + f (k2 , k3 ) + f (k2 , k3 + 1) + + f (k2 , k4 ).

In words: keep y constant and sum over x, then for each x value sum over y. Note again, that it
will be mathematically valid to switch the order of summation in this course, i.e.:

X k2
k4 X k2 X
X k4
f (x, y) = f (x, y).
y=k3 x=k1 x=k1 y=k3

You should be careful when switching the order of summation, when the first limit depends upon y,
e.g.
k4 X
X y
f (x, y).
y=k3 x=k1

4.4 Exponential Function


For x R+ :
x n
lim 1+ = ex
n n
 x n
lim 1 = ex .
n n

4.5 Taylor Series


Under assumptions that are in effect in this course, for infinitely differentiable f : R R:

X (x c)k
f (x) = f (k) (c) c R.
k!
k=0

4.6 Integration Methods


In lectures, I will give a review of integration by parts and integration by substitution. The notes
may be found online. We have the following standard properties of integrals, for c real and finite
and f (x) integrable Z Z
cf (x)dx = c f (x)dx.
46

For two integrable functions f (x) and g(x) we have


Z   Z Z
f (x) + g(x) dx = f (x)dx + g(x)dx.

A table of integrals, essentially follows by reversing the derivatives that you saw previously (we
omit the constant of integration which is not needed at this level):
xp+1
Z
xp dx = p 6= 1
p+1
Z
a bx
aebx dx = e
b
Z
log(x)dx = x log(x) x
Z
1
ax dx = ax a > 0
log(a)
Z
1
sin(ax)dx = cos(ax)
a
Z
1
cos(ax)dx = sin(ax)
a
Z
tan(ax)dx = log | sec(ax)|.

Some ideas for integration:


Z
d
[f (g(x))] = g 0 (x)f 0 (g(x)) g 0 (x)f 0 (g(x))dx = f (g(x)) + C.
dx
Consider the following integration trick. Suppose you want to compute
Z
g(x)dx

where g is a positive function. Suppose that there is a PDF f (x) such that
f (x) = cg(x)
where you know c. Then Z Z
1 1
g(x)dx = f (x)dx = .
c c
Recall that
Z
2
I= eu /2
du = 2.

This can be established by the fact that
Z Z
2
/2v 2 /2
I2 = eu dudv

and then changing to polar co-ordinates.

4.7 Distributions
A Table of discrete distributions can be found in Table 4.1 and continuous Distributions in Table
4.2. Recall that Z
(a) = ta1 et dt a > 0.
0
Note that one can show (a + 1) = a(a) and for a Z+ , (a) = (a 1)!. In addition that for
a, b > 0
(a)(b)
B(a, b) = .
(a + b)
One can show also that Z 1
B(a, b) = ua1 (1 u)b1 du.
0
Support X Par. PMF CDF E[X] Var[X] MGF
B(1, p) {0, 1} p (0, 1) p p(1 p) (1 p) + pet
n x nx
px(1 p)1x
B(n, p) {0, 1, . . . , n} p (0, 1), n Z+ x p (1 p) np np(1 p) ((1 p) + pet )n
x e
P() {0, 1, 2, . . . } R+ x! exp{(et 1)}
pet
Ge(p) {1, 2, . . . } p (0, 1) (1 p)x1 p 1 qx 1/p (1 p)/p2
t
1et (1p) n
x1 xn n pe

N e(n, p) {n, n + 1, . . . } p (0, 1), n Z+ n1 (1 p) p n/p n(1 p)/p2 1et (1p)

Table 4.1: Table of Discrete Distributions.Note that q = 1 p.


47
Support X Par. PDF CDF E[X] Var[X] MGF
1 xa ebt eat
U[a,b] [a, b] < a < b < ba ba (a + b)/2 (b a)2 /12 t(ba)
x
E() R+ R+ e 1 ex 1/ 1/2 t
a
b a1 bx b a
G(a, b) R+ a, b R+ (a) x e a/b a/b2 ( bt )
1 1
+ 2b (xa)2 at+bt2 /2
N (a, b) R (a, b) R R 2b
e a b e
1 a1 b1 ab
Be(a, b) [0, 1] (a, b) (R+ )2 B(a, b) x (1 x) a/(a + b) (a+b)2 (a+b+1)

Table 4.2: Table of Continuous Distributions. Note that B(a, b)1 = (a + b)/[(a)(b)].
48
Bibliography

[1] Grimmett, G, & Stirzaker, D. (2001). Probability and Random Processes. Third Edition.
Oxford: OUP.

[2] Tijms, H. (2012). Understanding Probability. Third Edition. Cambridge: CUP.

49

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy