Revision Notes - ST2131: Ma Hongqiang April 18, 2017

Revision notes - ST2131
Ma Hongqiang
April 18, 2017
Contents
1 Combinatorial Analysis 2
2 Axioms of Probability 3
3 Conditional Probability and Independence 5
4 Random Variables 7
5 Continuous Random Variable 12
6 Jointly Distributed Random Variables 18
7 Properties of Expectation 24
8 Limit Theorems 29
9 Problems 30
1
1 Combinatorial Analysis
Theorem 1.1 (Generalised Basic Principle of Counting).
Suppose that r experiments are to be preformed. If
• experiment 1 can result in n1 possible outcomes;
• experiment 2 can result in n2 possible outcomes;
• ···
• experiment r can result in nr possible outcomes;
then together there are n1 n2 · · · nr possible outcomes of the r experiments.
1.1 Permutations
Theorem 1.2 (Permutation of distinct objects).
Suppose there are n distinct objects, then the total number of permutations is
n!
.
Theorem 1.3 (General principle of permutation).
For n objects of which n1 are alike, n2 are alike, . . ., nr are alike, there are
n!
n1 !n2 ! · · · nr !
different permutations of the n objects.
1.2 Combinations
Theorem 1.4 (General principle of combination).
If there are n distinct objects, of which we choose a group of r items, number of combinations
equals
n n!
=
r r!(n − r)!
1.2.1 Useful Combinatorial Identities

1. For 1 ≤ r ≤ n, nr = n−1 n−1

r−1
+ r
2. (Binomial Theorem) Let n be a non-negative integer, then

n
n
X n k n−k
(x + y) = x y
k=0
k
Pn n

3. k=0 k= 2n
Pn k n

4. k=0 (−1) k = 0
2
1.3 Multinomial Coefficients
n

If n1 + n2 + · · · + nr = n, we define n1 ,n2 ,··· ,nr
by

n n!
=
n1 , n2 , · · · , nr n1 !n2 ! · · · nr !
Thus n1 ,n2n,··· ,nr represents the number of possible divisions of n distinct objects into r

distinct groups of respective sizes n1 , n2 , · · · , nr .

Theorem 1.5 (Multinomial Theorem).

X n
(x1 + x2 + · · · + xr ) =n
xn1 1 xn2 2 · · · xnr r
n1 , n2 , · · · , nr
(n1 ,...,nr ):n1 +···+nr =n
1.4 Number of Integer Solutions of Equations

n−1

Theorem 1.6. There are r−1
distinct positive integer-valued vectors (x1 , x2 , . . . , xr ) that
satisfies the equation
x1 + x2 + · · · + xr = n
where xi > 0 for i = 1, . . . , r
n+r−1

Theorem 1.7. There are r−1
distinct non-negative integer-valued vectors (x1 , x2 , . . . , xr )
that satisfies the equation
x1 + x2 + · · · + xr = n
where xi > 0 for i = 1, . . . , r
2 Axioms of Probability
2.1 Sample Space and Events
The basic objects of probability is an experiment: an activity or procedure that produces
distinct, well-defined possibilities called outcomes.
The sample space is the set of all possible outcomes of an experiment, usually denoted by
S.
Any subset E of the sample space is an event.
2.2 Axions of probability

Probability, denoted by P , is a function on the collection of events satisfying
(i) For any event A,
0 ≤ P (A) ≤ 1
(ii) Let S be the sample space, then

P (S) = 1
3
(iii) For any sequence of mutually exclusive events A1 , A2 , . . .(that is Ai Aj = ∅ when i 6= j)
∞
X
P (∪∞
i=1 Ai ) = P (Ai )
i=1
2.3 Properties of Probability

Theorem 2.1. P (∅) = 0.
Theorem 2.2. For any finite sequence of mutually exclusive event A1 , A2 , . . . , An ,
n
X
n
P (∪i=1 Ai ) = P (Ai )
i=1
Theorem 2.3. Let A be an event, then

P (Ac ) = 1 − P (A)
Theorem 2.4. If A ⊆ B, then
P (A) + P (BAc ) = P (B)
Theorem 2.5 (Inclusion-exclusion Principle). Let A1 , A2 , . . . , An be any events, then
n
X X
P (A1 ∪ A2 ∪ · · · ∪ An ) = P (Ai ) − P (Ai1 Ai2 ) + · · ·
i=1 1≤i1 ≤i2 ≤n
X
+ (−1)r+1 P (Ai1 · · · Air )
1≤i1 ≤···≤ir ≤n
+ ··· + (−1)n+1 P (A1 · · · An )
2.4 Probability as a Continuous Set Function

Definition 2.1. A sequence of events {En }, n ≥ 1 is said to be an increasing sequence if
E1 ⊂ E2 ⊂ · · · ⊂ En ⊂ En+1 ⊂ · · ·
whereas it is said to be a decreasing sequence if
E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ En+1 ⊃ · · ·
Definition 2.2. If {En }, n ≥ 1 is an increasing sequence of events,
lim En as
define new event, denoted by n→∞
lim En = ∪∞
i=1 Ei
n→∞
Similarly, if {En }, n ≥ 1 is an decreasing sequence of events, define new event, denoted by

lim
n→∞ En as
lim En = ∩∞i=1 Ei
n→∞
Theorem 2.6. If {En }, n ≥ 1 is either an increasing or a decreasing sequence of events,

then
lim P (En ) = P lim En
n→∞ n→∞
4
3 Conditional Probability and Independence
3.1 conditional Probabilities
Definition 3.1. Let A and B be two events. Suppose that P (A) > 0, the conditional
probability of B given A is defined as
P (AB)
P (A)
and is denoted by P (B|A).
Suppose P (A) > 0, then P (AB) = P (A)P (B|A).
Theorem 3.1 (General Multiplication Rule).
Let A1 , A2 , . . . , An be n events, then
P (A1 A2 · · · An ) = P (A1 )P (A2 |A1 )P (A3 |A1 A2 ) · · · P (An |A1 A2 · · · An−1 )
3.2 Bayes’ Formulas

Let A and B be any two events, then
P (B) = P (B|A)P (A) + P (B|Ac )P (Ac )
Definition 3.2. We say that A1 , A2 , . . . , An partitions the sample space S if:

• They are mutually exclusive, meaning Ai ∩ Aj = ∅ ∀i 6= j.
• THey are collectively exclusive, meaning ∪ni=1 Ai = S

Theorem 3.2 (Bayes’ First Formula).
Suppose the events A1 , A2 , . . . , An partitions the sample space. Assume further that P (Ai ) >
0 for 1 ≤ i ≤ n. Let B be any event, then
P (B) = P (B|A1 )P (A1 ) + · · · + P (B|An )P (An )
Theorem 3.3 (Bayes’ Second Formula).

Suppose the events A1 , A2 , . . . , An partitions the sample space. Assume further that P (Ai ) >
0 for 1 ≤ i ≤ n. Let B be any event, then for any 1 ≤ i ≤ n,
P (B|Ai )P (Ai )
P (Ai |B) =
P (B|A1 )P (A1 ) + · · · + P (B|An )P (An )
3.3 Independent Events

Definition 3.3. Two events A and B are said to be independent if
P (AB) = P (A)P (B)
They are said to be dependent otherwise.
5
Theorem 3.4. If A and B are independent, then so are
1. A and B c
2. Ac and B
3. Ac and B c
Definition 3.4. Events A1 , A2 , . . . , An are said to be independent, if for eveery subcollec-

tion of events Ai1 , . . . , Air , we have
P (Ai1 · · · Air ) = P (Ai1 )P (Air )
3.4 P (·|A) is a Probability

Theorem 3.5. Let A be an event with P (A) > 0. Then the following three conditions hold.
1. For any event B, we have

0 ≤ P (B|A) ≤ 1
2.
P (S|A) = 1
3. Let B1 .B2 , . . . be a sequence of mutually exclusive events, then

∞
X
P (∪∞
k=1 Bk |A) = P (Bk |A)
k=1
6
4 Random Variables
4.1 Random Variables
Definition 4.1. A random variable X, is a mapping from the sample space to real num-
bers.
4.2 Discrete Random Variables

Definition 4.2. A random variable is said to be discrete if the range of X is either finite
or countably infinite.
Definition 4.3. Suppose that random variable X is discrete, taking values x1 , x2 . . ., then
the probability mass function of X, denoted by pX , is defined as
(
P (X = x) if x = x1 , x2 , . . .
pX (x) =
0 otherwise
Properties of the probability mass function include
1. pX (xi ) ≥ 0 for i = 1, 2, . . .;
2. pX (x) = 0 for other values of x;
P∞
3. Since X must take on one of the values of xi , i=1 pX (xi ) = 1.
Definition 4.4. The cumulative distribution function of X, is defined as
FX : R → R
where
FX (x) = P (X ≤ x) ∀x ∈ R
Remark: SUppose that X is discrete and takes values x1 , x2 , . . . where x1 < x2 < x3 < · · · .
Note then that F is a step function, that is, F is constant in the interval [xi−1 , xi ) (F takes
value p(x1 ) + · · · + p(xi−1 )), and then take a jump of size = p(xi ).
4.3 Expected Value

Definition 4.5. If X is a discrete random variable having a probability mass function pX ,
the expectation or expected value of X, denoted by E(X) or µX is defined by
X
E(X) = xpX (x)
x
Definition 4.6 (Bernoulli Random Variable).

Suppose X takes only two values 0 and 1 with
P (X = 0) = 1 − p and P (X = 1) = p
We call this random variable a Bernoulli random variable of parameter p. And we denote
it by X ∼ Be(p).
7
4.4 Expectation of a Function of a Random Variable
Theorem 4.1. If X is a discrete random variable that takes values xi , i ≥ 1, with respective
probabilities pX (xi ), then for any real value function g
X
E[g(x)] = g(xi )pX (xi ) or equivalently
i
X
= g(x)pX (x)
x
Theorem 4.2. Let a and b be constants, then

E[aX + b] = aE(X) + b
Theorem 4.3 (Tail Sum Formula for Expectation).
For nonnegative integer-valued random variable X,
∞
X ∞
X
E(X) = P (X ≥ k) = P (X > k)
k=1 k=0
4.5 Variance and Standard Deviation

Definition 4.7. If X is a random variable with mean µ, then the variance of X, denoted
by Var(X), is defined by
Var(X) = E(X − µ)2
An alternative formula for variance is:
Var(X) = E(X 2 ) − [E(X)]2
Remark:
1. Var(X) ≥ 0
2. Var(X) = 0 if and only if X is a degenerate random variable
3. E(X 2 ) ≥ [E(X)]2 ≥ 0
4. Var(aX + b) = a2 Var(X)
4.6 Discrete Random Variable arising from Repeated Trials

1. Bernoulli random variable, denoted by Be(p).
We only perform the Bernoulli Trial once and define
(
1 if it is a success
X=
0 if it is a failure
Here,
P (X = 1) = p, P (X = 0) = 1 − p
and
E(X) = p, Var(X) = p(1 − p)
8
2. Binomial random variable,denoted by Bin(n, p)
We perform the experiment (under identical conditions and independently) n times
and define
X = number of successes in n Bernoulli(p) trials
Therefore, X takes values 0, 1, . . . , n. In fact, for 0 ≤ k ≤ n,

n k n−k
P (X = k) = p q
k
Here,
E(X) = np, Var(X) = np(1 − p)
Also, a useful fact is
P (X = k + 1) p n−k
=
P (X = k) 1−pk+1
3. Geometric random variable, denoted by Geom(p)

Define the random variable
X = number of Bernoulli(p) trials required to obtain the first success
Here X takes values 1, 2, 3, . . . and so on. In fact, for k ≥ 1,
P (X = k) = pq k−1
And,
1 1−p
E(X) = Var(X) =
p p2
4. Negative Binomial random variable, denoted by NBr, p
Define the random variable
X = number of Bernoulli(p) trials required to obtain r success.
Here, X takes values r, r + 1, . . . and so on. In fact, for k ≥ r,

k − 1 r k−r
P (X = k) = pq
r−1
And
r r(1 − p)
E(X) = Var(X) =
p p2
9
4.7 Poisson Random Variable
A random variable X is said to have a Poisson distribution with parameter λ if X takes
values 0, 1, 2 . . . with probabilities given as:
e−λ λk
P (X = k) =
k!
And
E(X) = λ Var(X) = λ
Poisson distribution of parameter λ := np can be used as an approximation for a binomial
distribution with parameter (n, p), when n is large and p is small such that np is moderate.
(Poisson Paradigm) Consider n events, with pi equal to the probability that event i
occurs, i = 1, . . . , n. If all the pi are small and trials are either independent or at most
weakly dependent, then P the number of these events that occur approximately has a Poisson
distribution with mean ni=1 pi := λ. Another use is Poisson process.
4.8 Hypergeometric Random Variable

Suppose that we have a set of N balls, of which m are red and N − m is blue. We choose n of
these balls, without replacement, and define X to be the number of red balls in our sample.
Then
m N −m

x n−x
P (X = x) = N

n
for x = 0, 1, . . . , N . A random variable whose probability mass function is given as the above
equation for some values of n, N, m is said to be a hypergeometric random variable. and
is denoted by H(n, N, m). Here,

nm nm (n − 1)(m − 1) nm
E(X) = , Var(X) = +1−
N N (N − 1) N
4.9 Expected Value of Sums of Random Variables

Theorem 4.4. X
E[X] = X(s)p(s)
s∈S
Theorem 4.5. For random variables X1 , X2 , . . . , Xn ,

" n # n
X X
E Xi = E[Xi ]
i=1 i=1
4.10 Distribution Functions and Probability Mass Function

4.10.1 Properties of distribution function
1. FX is a nondecreasing function.
10
2. lim FX (b) = 1
b→∞
3. lim FX (b) = 0
b→−∞
4. FX is right continuous. That is for any b ∈ R
lim FX (x) = FX (b)

x→b+
4.10.2 Useful Calculations

1. Calculating probabilities from density function
(a) P (a < X ≤ b) = FX (b) − FX (a)

lim F (b − 1 )
(b) P (X < b) = n→∞ n
(c) P (X = a) = FX (a) − FX (a− ) where FX (a− ) = x→a
lim FX (x)
−
2. Calculating probabilities from probability mass function

X
P (A) = pX (x)
x∈A
3. Calculate probability mass function from density function
pX (x) = FX (x) − FX (x− )
4. Calculate density function from probability mass function

X
FX (x) = pX (y)
y≤x
11
5 Continuous Random Variable
5.1 Introduction
Definition 5.1. We say that X is a continuous random variable if there exists a non-
negative function fX , defined for all real x ∈ R, having the property that, for any set B of
real numbers, Z
P (X ∈ B) = fX (x)dx
B
The function fX is called the probability density function of the random variable X.
For instance, letting B = [a, b], we have
Z b
P (a ≤ X ≤ b) = fX (x)dx
a
Definition 5.2. We defined the distribution function of X by

Z x
FX (x) = P (X ≤ x) = fX (x)dx
−∞
and using the Fundamental Theorem of Calculus,
FX0 (x) = fX (x)
Theorem 5.1 (Properties of Distribution Function). 1. P (X = x) = 0∀x ∈ R.
2. FX is continuous.
3. For any a, b ∈ R, where a < b,
P (a ≤ X ≤ b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a < X < b)
5.2 Expectation and Variance of Continuous Random Variables

Definition 5.3. Let X be a continuous random variable with probability density function
fX , then Z ∞
E(X) = xfX (x)dx
−∞
Z ∞
Var(X) = (x2 − E(X)2 )fX (x)dx
−∞
Theorem 5.2. If X is a continuous random variable with probability density function fX ,

then for any real value function g
1. Z ∞
E[g(x)] = g(x)fX (x)dx
−∞
12
2. Same linearity Property:
E(aX + b) = aE(X) + b
3. Same alternative formula for variance
Var(X) = E(X 2 ) − [E(X)]2
Theorem 5.3 (Tail sum formula).

Suppose X is a nonnegative continuous random variable, then
Z ∞ Z ∞
E(X) = P (X > x)dx = P (X ≥ x)dx
0 0
Theorem 5.4. We have Var(aX + b) = a2 Var(X).
5.3 Uniform distribution

A random variable X is said to be uniformly distributed over the interval (a, b) if its
probability density function is given by
(
1, 0 < x < 1
fX (x) =
0, otherwise
We denote this by X ∼ U (0, 1).

Finding FX : 
Z x 0, if x < 0

FX (x) = fX (y)dy = x, if 0 ≤ x < 1
−∞ 
1, if 1 ≤ x

In general, for a < b, a random variable X is uniformly distributed over the interval (a, b) if
its probability density function is given by
(
1
, a<x<b
fX (x) = b−a
0, otherwise
We denote this by X ∼ U (a, b). In a similar way,


Z x 0,
 if x < a
FX (x) = fX (y)dy = x−ab−a
, if a ≤ x < b
−∞ 
1, if b ≤ x

It was shown that

a+b (b − a)2
E(X) = Var(X) =
2 12
13
5.4 Normal Distribution
A random variable is said to be normally distributed with parameters µ and σ 2 if its
1 1 2
fX (x) = √ e− 2σ2 (x−µ)
2πσ
We denote this by X ∼ N (µ, σ 2 ). The density function is bell-shaped, always positive,

symmetric at µ and attains its maximum at x = µ.
A normal random variable is called a standard normal random variable when µ = 0 and
σ = 1 and is denoted by Z ∼ N (0, 1). Its probability density function is denoted by φ and
its distribution function by Φ.
1 1 2
φ(x) = √ e− 2 x
2π Z
x
1 1 2
Φ(x) = √ e− 2 y dy
2π −∞
An observation: Let Y ∼ N (µ, σ 2 ) and Z ∼ N (0, 1), then

a−µ b−µ b−µ a−µ
P (a < Y ≤ b) = P <Z< =Φ − P hi
σ σ σ σ
Theorem 5.5 (Properties of Standard Normal). 1. P (Z ≥ 0) = P (Z ≤ 0) = 0.5.
2. −Z ∼ N (0, 1)
3. P (Z ≤ x) = 1 − P (Z > x)
4. P (Z ≤ −x) = P (Z ≥ x)
Y −µ
5. If Y ∼ N (µ, σ 2 ), then X = σ
∼ N (0, 1)
6. If X ∼ N (0, 1), then Y = aX + b ∼ N (b, a2 )
Important facts:
1. If Y ∼ N (µ, σ 2 ), then E(Y ) = µ and Var(Y ) = σ 2 .
2. If Z ∼ N (0, 1), then E(Z) = 0 and Var(Z) = 1.
Definition 5.4. The qth quantile of a random variable X is defined as a number zq so that
P (X ≤ zq ) = q.
14
5.5 Exponential Distribution
A random variable X is said to be exponentially distributed with parameter λ > 0 if its
(
λe−λx if x ≥ 0
fX (x) =
0 if x < 0
The distribution function of X is given by

(
0 x≤0
FX (x) =
1 − e−λx x>0
Exponential distribution has memoryless property.
P (X > s + t | X > s) = P (X > t)
Mean and variance of X ∼ Exp(λ):

1 1
E(X) = Var(X) =
λ λ2
5.6 Gamma Distribution

A random variable X is said to have a gamma distribution with parameters (α, λ), denoted
by X ∼ Γ(α, λ), if its probability density function is given by
( α
λ
e−λx xα−1 , x ≥ 0
fX (x) = Γ(α)
0 x<0
where λ > 0, α > 0, and Γ(α), called the gamma function, is defined by
Z ∞
Γ(α) = e−y y α−1 dy
0
Remark:
1. Γ(1) = 1.
2. Γ(α) = (α − 1)Γ(α − 1)
3. For integral values of α = n, Γ(n) = (n − 1)!.
4. Γ(1, λ) = Exp(λ).
√
5. Γ( 21 ) = π
15
5.7 Beta Distribution
A random variable X is said to have a beta distribution with parameter (a, b), denoted
by X ∼ Beta(a, b), if its density is given by
(
1
xa−1 (1 − x)b−1 0 < x < 1
f (x) = B(a,b)
0 otherwise
where Z 1
B(a, b) = xa−1 (1 − x)b−1 dx
0
is known as the beta function.
It can be shown that
Γ(a)Γ(b)
B(a, b) =
Γ(a + b)
If X ∼ β(a, b), then
a ab
E[X] = and Var(X) =
a+b (a + b)2 (a + b + 1)
5.8 Cauchy Distribution

A random variable X is said to have a Cauchy distribution with parameter θ, −∞ < θ <
∞, denoted by X ∼ Cauchy(θ), if its density is given by
1 1
f (x) = , −∞ < x < ∞
π 1 + (x − θ)2
5.9 Approximation of Binomial Random Variables

Theorem 5.6 (De Moivre-Laplace Limit Theorem).
Suppose that X ∼ Bin(n, p). Then for any a < b

X − np
P a≤ √ ≤ b → Φ(b) − Φ(a)
npq
as n → ∞, where q = 1 − p.
That is,
Bin(n, p) ≈ N (np, npq)
Equivalently,
X − np
√ ≈Z
npq
where Z ∼ N (0, 1). Remark: The normal approximation will be generally quite good for
values of n satisfying np(1 − p) ≥ 10.
16
Approximation is further improved if we incorporate continuity correction.
If X ∼ Bin(n, p), then

1 1
P (X = k) = P k − ≤ X ≤ k +
2 2

1
P (X ≥ k) = P X ≥ k −
2

1
P (X ≤ k) = P X ≤ k +
2
5.10 Distribution of a Function of a Random Variable

Theorem 5.7. Let X be a continuous random variable having probability density function
fX . Suppose that g(x) is a strictly monotonic(increasing or decreasing), differentiable (and
thus continuous) function of x. Then the random variable Y defined by Y = g(x) has a
probability density function given by
(
d −1
fX (g −1 (y))| dy g (y)|, if y = g(x) for some x
fY (y) =
0, if y 6= g(x) for all x
17
6 Jointly Distributed Random Variables
6.1 Joint Distribution Functions
Definition 6.1. For any two random variables X and Y defined on the same sample space,
we defined the joint distribution function of X and Y by
FX,Y (x, y) = P (X ≤ x, Y ≤ y) forx, y ∈ R
The distribution function of X can be obtained from the joint density function of X and Y
in the following way:
lim FX,Y (x, y)
FX (x) = y→∞
We call FX the marginal distribution function of X.
Similarly,
lim FX,Y (x, y)
FY (y) = x→∞
and FY is called the marginal distribution function of Y .
Theorem 6.1 (Some Useful Calculations).

Let a, b, a1 ≤ a2 , b1 ≤ b2 be real numbers, then
P (X > a, Y > b) = 1 − FX (a) − FY (b) + FX,Y (a, b)
P (a1 < X ≤ a2 , b1 < Y ≤ b2 ) = FX,Y (a2 , b2 ) − FX,Y (a1 , b2 ) + FX,Y (a1 , b1 ) − FX,Y (a2 , b1 )
6.1.1 Jointly Discrete Random Variables

In the case when both X and Y are discrete random variables, we define the joint proba-
bility mass function of X and Y as:
pX,Y (x, y) = P (X = x, Y = y)
We can recover the probability mass function of X and Y in the following manner:
X
pX (x) = P (X = x) = pX,Y (x, y)
y∈R
X
pY (y) = P (Y = y) = pX,Y (x, y)
x∈R
We call pX the marginal probability mass function of X and pY the marginal prob-
ability mass function of Y .
Theorem 6.2 (Some useful formulas).
1. X X
P (a1 < X ≤ a2 , b1 < Y ≤ b2 ) = pX,Y (x, y)
a1 <X≤a2 b1 <Y ≤b2
18
2. XX
FX,Y (a, b) = P (X ≤ a, Y ≤ b) = pX,Y (x, y)
X≤a Y ≤b
3. XX
P (X > a, Y > b) = pX,Y (x, y)
X>a Y >b
6.1.2 Jointly Continuous Random Variables

We say that X and Y are jointly continuous random variables if there exists a function
(which is denoted by fX,Y , called the jointly probability density function of X and Y )
if for every set C ⊂ R2 , we have
ZZ
P ((X, Y ) ∈ C) = fX,Y (x, y)dxdy
(x,y)∈C
Theorem 6.3 (Some useful formulas). 1. Let A, B ⊂ R2 , take C = A × B above

Z Z
P (X ∈ A, Y ∈ B) = fX,Y (x, y)dydx
A B
2. In particular, let a1 , a2 , b1 , b2 ∈ R where a1 < a2 and b1 < b2 , we have

Z a2 Z b2
P (a1 < X ≤ a2 , b1 < Y ≤ b2 ) = fX,Y (x, y)dydx
a1 b1
3. Let a, b ∈ R, we have
Z a Z b
FX,Y (a, b) = P (X ≤ a, Y ≤ b) = fX,Y (x, y)dydx
−∞ −∞
As a result of this,
∂2
fX,Y (x, y) = FX,Y (x, y)
∂x∂y
Definition 6.2. The marginal probability density function of X is given by
Z ∞
fX (x) = fX,Y (x, y)dy
−∞
Similarly, the marginal probability density function of Y is given by

Z ∞
fY (y) = fX,Y (x, y)dx
−∞
19
6.2 Independent Random Variables
Two random variables X and Y are said to be independent if
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B) for any A, B ⊂ R
Theorem 6.4 (For jointly discrete random variables).

The following three statements are equivalent:
1. Random variables X and Y are indepedent.
2. For all x, y ∈ R, we have

fX,Y (x, y) = fX (x)fY (y)
3. For all x, y ∈ R, we have

FX,Y (x, y) = FX (x)FY (y)
Theorem 6.5. Random variables X and Y are independent if and only if there exist func-
tions g, h : R → R such that for all x, y ∈ R, we have
fX,Y (x, y) = g(x)h(y)
6.3 Sums of Independent Random Variables

Under the assumption of independence of X and Y , we have
fX,Y (x, y) = fX (x)fY (y)
Then it follows that Z ∞

FX,Y (x) = FX (x − t)fY (t)dt
−∞
And Z ∞
fX+Y (x) = fX (x − t)fY (t)dt
−∞
Theorem 6.6 (Sum of 2 Independent Gamma Random Variables).

Assume that X ∼ Γ(α, λ) and Y ∼ Γ(β, λ), and X and Y are mutually independent. Then,
X + Y ∼ Γ(α + β, λ)
Theorem 6.7 (Sum of Independent Exponential Random Variables).

Let X1 , X2 , . . . , Xn be n independent exponential random variables each having parameter
λ. Equivalently, Xi ∼ Exp(λ) = Γ(1, λ). Then, X1 + X2 + · · · + Xn ∼ Γ(n, λ).
Theorem 6.8 (Sum of Indepedent Normal Pn Random Variables).

If Xi ∼ N (µi , σi ), ∀i = 1, 2, . . . , n, then i=1 Xi ∼ N ( ni=1 µi , ni=1 σi2 ).
2
P P
20
6.4 X and Y are discrete and independent
Theorem 6.9 (Sum of 2 Independent Poisson Random Variables).
If X ∼ Poisson(λ) and Y ∼ Poisson(µ) are two independent random variables, X + Y ∼
Poisson(λ + µ).
Theorem 6.10 (Sum of 2 Indepedent Binomial Random Variables).

If X ∼ Bin(n, p) and Y ∼ Bin(m, p) are two independent random variables, X + Y ∼
Bin(n + m, p).
Theorem 6.11 (Sum of 2 Independent Geometric Random Variables).

If X ∼ Geom(p) and Y ∼ Geom(p) are two independent random variables, X+Y ∼ NB(2, p).
6.5 Conditional distribution: Discrete Case

The conditional probability mass function of X given that Y = y is defined by
pX|Y (x | y) : = P (X = x | Y = y)
pX,Y (x, y)
=
pY (y)
for all values of y such that PY (y) > 0.

Similarly, the conditional distribution function of X given that Y = y is defined by
P (X ≤ x, Y = y)
FX|Y (x | y) : =
P (Y = y)
X
= pX|Y (x | y)
a≤x
Theorem 6.12. If X is independent of Y , then the conditional probability mass function

of X given Y = y is the same as the marginal probability mass function of X for every y
such that pY (y) > 0, i.e. pX|Y (x | y) = pX (x).
6.6 Conditional distributions: Continuous Case

Suppose X and Y are jointly continuous random variables. Define the conditional prob-
ability density function of X given that Y = y as
fX,Y (x, y)
fX|Y (x | y) :=
fY (y)
for all y such that fY (y) > 0. We define conditional probabilities of event associated with
one random variable when we are given the value of a second random variable. That is, for
A ⊂ R and y such that fY (y) > 0,
Z
P (X ∈ A | Y = y) fX|Y (x | y)dx
A
21
In particular, the conditional distribution function of X given that Y = y is defined by
Z x
FX|Y (x, y) = P (X ≤ x | Y = y) = fX|Y (t | y)dt
−∞
Theorem 6.13. If X is independent of Y , then the conditional probability density function

of X given Y = y is the same as the marginal probability density function of X for every y
such that fY (y) > 0, i.e.,
fX|Y (x | y) = fX (x)
6.7 Joint Probability Distribution Function of Functions of Ran-

dom Variables
Let X and Y be jointly distributed random variables with joint probability density function
fX,Y .
Suppose that
U = g(X, Y ) and V = h(X, Y )
for some functions g and h.
The jointly probability density function of U and V is given by
fU,V (u, v) = fX,Y (x, y)|J(x, y)|−1
where x = a(u, v) and y = b(u, v).

Here, g and h have continuous partial derivatives and

∂g ∂g
∂x ∂y
J(x, y) = ∂h
∂x ∂h

∂y
6.8 Jointly Distributed Random Variables: n ≥ 3

Assume X, Y, Z are jointly continuous random variables, with
FX,Y,Z (x, y, z) := P (X ≤ x, Y ≤ y, Z ≤ z)
The marginal distribution functions are given as

lim FX,Y,Z (x, y, z)
FX,Y (x, y) = z→∞
FX,Z (x, z) = y→∞
FY,Z (y, z) = x→∞
FX (x) = y,z→∞
FY (y) = x,z→∞
FZ (z) = x,y→∞
22
6.8.1 Joint probability density function of X, Y and Z:fX,Y,Z (x, y, z)
For any D ⊂ R3 , we have
ZZ Z
P ((X, Y, Z) ∈ D) = fX,Y,Z (x, y, z)dxdydz
(x,y,z)∈D
6.8.2 Marginal probability density function of X, Y and Z

Z ∞Z ∞
fX (x) = fX,Y,Z (x, y, z)dydz
Z−∞
∞ Z−∞
∞
fY (y) = fX,Y,Z (x, y, z)dxdz
−∞ −∞
Z ∞Z ∞
fZ (z) = fX,Y,Z (x, y, z)dxdy
−∞ −∞
Z ∞
fX,Y (x, y) = fX,Y,Z (x, y, z)dz
−∞
Z ∞
fY,Z (y, z) = fX,Y,Z (x, y, z)dx
−∞
Z ∞
fX,X (x, z) = fX,Y,Z (x, y, z)dy
−∞
6.8.3 Independent random variables

Theorem 6.14. For jointly continuous random variables, the following three statements are
equivalent:
1. Random variables X, Y and Z are independent
2. For all x, y, z ∈ R, we have
fX,Y,Z (x, y, z) = fX (x)fY (y)fZ (z)
3. For all x, y, z ∈ R, we have
FX,Y,Z (x, y, z) = FX (x)FY (y)FZ (z)
23
7 Properties of Expectation
Theorem 7.1. If a ≤ X ≤ b, then a ≤ E(X) ≤ b.
7.1 Expectation of Sums of Random Variables

Theorem 7.2.
1. If X and Y are jointly discrete with joint probability mass function pX,Y , then
XX
E[g(X, Y )] = g(x, y)pX,Y (x, y)
y x
2. If X and Y are joint continuous with joint probability density function fX,Y , then
Z ∞Z ∞
E[g(X, Y )] = g(x, y)fX,Y (x, y)dxdy
−∞ −∞
Some important consequences of theorem above are:
1. If g(x, y) ≥ 0 whenever pX,Y (x, y) > 0, then E[g(X, Y )] ≥ 0.
2. E[g(X, Y ) + h(X, Y )] = E[g(X, Y )] + E[h(X, Y )]
3. E[g(X) + h(Y )] = E[g(X)] + E[h(Y )].
4. Monotone Property
If jointly distributed random variables X and Y satisfy X ≤ Y , then
E(X) ≤ E(Y )
Theorem 7.3 (Boole’s Inequality).

n
X
P (∪nk=1 Ak ) ≤ P (Ak )
k=1
7.2 Covariance, Variance of Sums, Correlations

Definition 7.1. The covariance of jointly distributed random variables X and Y , denoted
by cov(X, Y ), is defined by
cov(X, Y ) = E(X − µX )(Y − µY )
where µX , µY denote the means of X and Y respectively.

If cov(X, Y ) 6= 0, we say that X and Y are correlated.
24
Theorem 7.4 (Alternative formulae for covariance).
cov(X, Y ) = E(XY ) − E(X)E(Y )

= E[X(Y − µY )]
= E[Y (X − µX )]
Theorem 7.5. If X and Y are independent, then for any functions g, h : R → mathbbR,
we have
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
Theorem 7.6. If X and Y are independent, then cov(X, Y ) = 0.
Theorem 7.7 (Some properties of covariance).
1. Var(X) = cov(X, X)
2. cov(X, Y ) = cov(Y, X)
P P P
n Pm
3. cov i=1 a i X i , j=1 b j Y j = ni=1 m
j=1 ai bj cov(Xi , Yj )
Theorem 7.8.
n
! n
X X X
Var Xk = Var(Xk ) + 2 cov(Xi , Xj )
k=1 k=1 1≤i<j≤n
If X1 , . . . , Xn are independent random variables, then

n
! n
X X
Var Xk = Var(Xk )
k=1 k=1
In other words, under independence, variance of sum = sum of variances

Definition 7.2 (Correlation Coefiicient).
The correlation coefficient of random variables X and Y , denoted by ρ(X, Y ), is defined by
cov(X, Y )
ρ(X, Y ) = p
Var(X)Var(Y )
Theorem 7.9.
−1 ≤ ρ(X, Y ) ≤ 1
1. The correlation coefiicient is a measure of the degree of linearity between X and Y .
If ρ(X, Y ) = 0, then X and Y are said to be uncorrelated.
ρY
2. ρ(X, Y ) = 1 if and only if Y = aX + b where a = ρX
> 0.
3. ρ(X, Y ) = −1 if and only if Y = aX + b where a = − ρρXY < 0.
4. ρ(X, Y ) is dimensionless.
5. If X and Y are independent, then ρ(X, Y ) = 0.
25
7.3 Conditional expectation
Definition 7.3.
1. If X and Y are jointly distributed discrete random variables, then

X
E[X | Y = y] = xpX|Y (x | y) if pY (y) > 0
x
2. If X and Y are jointly distributed continuous random variables, then

Z ∞
E[X | Y = y] = xfX|Y (x | y)dx if fY (y) > 0
−∞
Theorem 7.10 (Some important formulas).

(P
g(x)pX|Y (x | y) for discrete case
E[g(X) | Y = y] = R ∞x
−∞
g(x)fX|Y (x | y)dx for continuous case
and hence " #
n
X n
X
E Xk | Y = y = E[Xk | Y = y]
k=1 k=1
7.3.1 Computing expectation by conditioning

Theorem 7.11.
(P
E(X | Y = y)P (Y = y) if Y is discrete
E[X] = E[E[X | Y ]] = R ∞y
−∞
E(X | Y = y)fY (y)dy if Y is continuous
7.3.2 Computing probabilities by conditioning

Theorem 7.12.
Let X = IA where A is an event. Then we have
E[IA ] = P (A) E[IA | Y = y] = P (A | Y = y)
and we have
= E(IA ) = E[E(IA | Y )]
(P
E(IA | Y = y)P (Y = y) if Y is discrete
= R ∞y
P (A) −∞
E(IA | Y = y)fY (y)dy if Y is continuous
(P
P (A | Y = y)P (Y = y) if Y is discrete
= R ∞y
−∞
P (A | Y = y)fY (y)dy if Y is continuous
26
7.4 Conditional Variance
Definition 7.4. The conditional variance of X given that Y = y is defined as
Var(X | Y ) = E[(X − E[X | Y ])2 | Y ]
Theorem 7.13.
Var(X) = E[Var(X | Y )] + Var(E[X | Y ])
7.5 Moment Generating Functions

Definition 7.5. The moment generating function of random variable X, denoted by
MX , is defined as
MX (t) = E[etX ]
(P
tx
x e pX (x), if X is discrete with probability mass function pX
= R ∞ tx
−∞
e fX (x)dx if X is continuous with probability density function fX
Theorem 7.14 (Properties of Moment Generating Function).
(n)
1. MX (0) = E[X n ].
2. Multiplicative Property: If X and Y are independent, then
MX+Y (t) = MX (t)MY (t)
3. Uniqueness Property: Let X and Y be random variables with their moment generating
functions MX and MY respectively. Suppose that there exists an h > 0 such that
MX (t) = MY (t), ∀t ∈ (−h, h)
then X and Y have the same distribution.

Theorem 7.15 (Typical Moment Generating Functions).
1. When X ∼ Be(p), M (t) = 1 − p + pet .

2. When X ∼ Bin(n, p), M (t) = (1 − p + pet )n .
pet
3. When X ∼ Geom(p), M (t) = 1−(1−p)et
.
t
4. When X ∼ Poisson(λ), M (t) = eλe −1 .
eβt −eαt
5. When X ∼ U (α, β), M (t) = (β−α)t
.
λ
6. When X ∼ Exp(λ), M (t) = λ−t
for t < λ.
1 2 t2
7. When X ∼ N (µ, σ 2 ), M (t) = eµt+ 2 σ .
27
7.6 Joint Moment Generating Functions
Definition 7.6. For any n random variables X1 , . . . , Xn , the joint moment generating func-
tion, M (t1 , . . . , tn ), is defined for all real values t1 , . . . , tn by
M (t1 , . . . , tn ) = E[et1 X1 +···+tn Xn ]
The individual moment generating functions can be obtained from M (t1 , . . . , tn ) by letting
all but one of the tj be 0. That is,
MXi (t) = E[etXi ] = M (0, . . . , 0, t, 0, . . . , 0)
where the t is in the ith place.

It can be proved that M (t1 , . . . , tn uniquely determines the joint distribution of X1 , . . . , Xn .
n random variable X1 , . . . , Xn are independent if and only if
M (t1 , . . . , tn ) = MX1 (t1 ) · · · MXn (tn )
28
8 Limit Theorems
8.1 Chebyshev’s Inequality and the Weak Law of Large Numbers
Theorem 8.1 (Markov’s Inequality).
Let X be a nonnegative random variable. For a > 0, we have
E(X)
P (X ≥ a) ≤
a
Theorem 8.2 (Chebyshev’s Inequality).
Let X be a random variable with finite mean µ and variance σ 2 , then for a > 0, we have
σ2
P (|X − µ| ≥ a) ≤
a2
Theorem 8.3 (Consequences of Chebyshev’s Inequality).
If Var(X) = 0, then the random variable X is a constant. Or in other words,
P (X = E(X)) = 1
Theorem 8.4 (The Weak Law of Large Numbers).
Let X1 , X2 , . . . be a sequence of independent and identically distributed random variables,
with common mean µ. Then, for any > 0,

X1 + · · · + Xn
P − µ ≥ → 0 as n → ∞
n
8.2 Central Limit Theorem

Theorem 8.5 (Central Limit Theorem).
each having mean µ and variance σ 2 . Then the distribution of
X1 + · · · + Xn − nµ
√
σ n
tends to the standard normal distribution as n → ∞. That is,
Z x
X 1 + · · · + X n − nµ 1 t2
lim
n→∞ P
√ ≤x = e− 2 dt
σ n 2π −∞
8.3 The Strong Law of Large Numbers

Theorem 8.6 (The Strong Law of Large Numbers).
each having a finite mean µ = E(Xi ). Then with probability 1,
X1 + · · · + Xn
→ µ as n → ∞
n
In other words,
lim
X 1 + · · · + Xn
P n→∞ =µ =1
n
29
9 Problems
1 (AY1314Sem1) Let X1 and X2 have a bivariate normal distribution with parameters
µ1 = µ2 = 0, σ1 = σ2 = 1 and ρ = 12 . Find the probability that all of the roots of the
following equation are real:
X1 x2 + 2X2 x + X1 = 0
2 (AY1617Sem1) Let (X1 , X2 ) have a bivariate normal distribution with means 0, vari-
ances µ21 and µ22 , respectively, and with the correlation coefficient −1 < ρ < 1.
(a) Determine the distribution of aX1 + bX2 , where a and b are two real numbers
such that a2 + b2 > 0.
(b) Find a constants b such that X1 + bX2 is independent of X1 . Justify your answer.
(c) Find the probability that the following equation has real roots:
X1 x2 − 2X1 x − bX − 2 = 0
where b is the constant found in part (ii).
30

Revision Notes - ST2131: Ma Hongqiang April 18, 2017

Uploaded by

Copyright:

Available Formats

Revision Notes - ST2131: Ma Hongqiang April 18, 2017

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Revision Notes - ST2131: Ma Hongqiang April 18, 2017

Uploaded by

Copyright:

Available Formats

Revision notes - ST2131

3 Conditional Probability and Independence 5

5 Continuous Random Variable 12

6 Jointly Distributed Random Variables 18

1.2.1 Useful Combinatorial Identities

2. (Binomial Theorem) Let n be a non-negative integer, then

distinct groups of respective sizes n1 , n2 , · · · , nr .

1.4 Number of Integer Solutions of Equations

2.2 Axions of probability

(ii) Let S be the sample space, then

2.3 Properties of Probability

Theorem 2.3. Let A be an event, then

+ ··· + (−1)n+1 P (A1 · · · An )

2.4 Probability as a Continuous Set Function

Similarly, if {En }, n ≥ 1 is an decreasing sequence of events, define new event, denoted by

Theorem 2.6. If {En }, n ≥ 1 is either an increasing or a decreasing sequence of events,

P (A1 A2 · · · An ) = P (A1 )P (A2 |A1 )P (A3 |A1 A2 ) · · · P (An |A1 A2 · · · An−1 )

3.2 Bayes’ Formulas

P (B) = P (B|A)P (A) + P (B|Ac )P (Ac )

Definition 3.2. We say that A1 , A2 , . . . , An partitions the sample space S if:

• THey are collectively exclusive, meaning ∪ni=1 Ai = S

P (B) = P (B|A1 )P (A1 ) + · · · + P (B|An )P (An )

Theorem 3.3 (Bayes’ Second Formula).

3.3 Independent Events

P (AB) = P (A)P (B)

They are said to be dependent otherwise.

Definition 3.4. Events A1 , A2 , . . . , An are said to be independent, if for eveery subcollec-

P (Ai1 · · · Air ) = P (Ai1 )P (Air )

3.4 P (·|A) is a Probability

1. For any event B, we have

3. Let B1 .B2 , . . . be a sequence of mutually exclusive events, then

4.2 Discrete Random Variables

4.3 Expected Value

Definition 4.6 (Bernoulli Random Variable).

Theorem 4.2. Let a and b be constants, then

4.5 Variance and Standard Deviation

4.6 Discrete Random Variable arising from Repeated Trials

3. Geometric random variable, denoted by Geom(p)

X = number of Bernoulli(p) trials required to obtain the first success

Here X takes values 1, 2, 3, . . . and so on. In fact, for k ≥ 1,

X = number of Bernoulli(p) trials required to obtain r success.

Here, X takes values r, r + 1, . . . and so on. In fact, for k ≥ r,

4.8 Hypergeometric Random Variable

4.9 Expected Value of Sums of Random Variables

Theorem 4.5. For random variables X1 , X2 , . . . , Xn ,

4.10 Distribution Functions and Probability Mass Function

4. FX is right continuous. That is for any b ∈ R

lim FX (x) = FX (b)

4.10.2 Useful Calculations

(a) P (a < X ≤ b) = FX (b) − FX (a)

2. Calculating probabilities from probability mass function

3. Calculate probability mass function from density function

pX (x) = FX (x) − FX (x− )

4. Calculate density function from probability mass function

Definition 5.2. We defined the distribution function of X by

and using the Fundamental Theorem of Calculus,

FX0 (x) = fX (x)

Theorem 5.1 (Properties of Distribution Function). 1. P (X = x) = 0∀x ∈ R.

3. For any a, b ∈ R, where a < b,