Undergraduate Probability: Richard F. Bass

Undergraduate Probability
Richard F. Bass
ii

c Copyright 2013 Richard F. Bass
Contents
1 Combinatorics 1
2 The probability set-up 5
3 Independence 9
4 Conditional probability 13
5 Random variables 17
6 Some discrete distributions 23
7 Continuous distributions 29
8 Normal distribution 33
9 Normal approximation 37
10 Some continuous distributions 39
11 Multivariate distributions 43
12 Expectations 51
iii
iv CONTENTS
13 Moment generating functions 55
14 Limit laws 59
Chapter 1
Combinatorics
The first basic principle is to multiply. Suppose we have 4 shirts of 4 different

colors and 3 pants of different colors. How many possibilities are there?
For each shirt there are 3 possibilities, so altogether there are 4 × 3 = 12
possibilities.
Example. How many license plates of 3 letters followed by 3 numbers are

possible? Answer. (26)3 (10)3 , because there are 26 possibilities for the first
place, 26 for the second, 26 for the third, 10 for the fourth, 10 for the fifth,
and 10 for the sixth. We multiply.
How many ways can one arrange a, b, c? One can have

abc, acb, bac, bca, cab, cba.
There are 3 possibilities for the first position. Once we have chosen the first
position, there are 2 possibilities for the second position, and once we have
chosen the first two possibilities, there is only 1 choice left for the third. So
there are 3 × 2 × 1 = 3! arrangements. In general, if there are n letters, there
are n! possibilities.
Example. What is the number of possible batting orders with 9 players?

Answer. 9!
Example. How many ways can one arrange 4 math books, 3 chemistry books,
2 physics books, and 1 biology book on a bookshelf so that all the math books
1
2 CHAPTER 1. COMBINATORICS
are together, all the chemistry books are together, and all the physics books
are together. Answer. 4!(4!3!2!1!). We can arrange the math books in 4!
ways, the chemistry books in 3! ways, the physics books in 2! ways, and the
biology book in 1! = 1 way. But we also have to decide which set of books go
on the left, which next, and so on. That is the same as the number of ways
of arranging the letters M, C, P, B, and there are 4! ways of doing that.
How many ways can one arrange the letters a, a, b, c? Let us label them
A, a, b, c. There are 4!, or 24, ways to arrange these letters. But we have
repeats: we could have Aa or aA. So we have a repeat for each possibility,
and so the answer should be 4!/2! = 12. If there were 3 a’s, 4 b’s, and 2 c’s,
we would have
9!
.
3!4!2!
What we just did was called the number of permutations. Now let us look
at what are known as combinations. How many ways can we choose 3 letters
out of 5? If the letters are a, b, c, d, e and order matters, then there would
be 5 for the first position, 4 for the second, and 3 for the third, for a total
of 5 × 4 × 3. But suppose the letters selected were a, b, c. If order doesn’t
matter, we will have the letters a, b, c 6 times, because there are 3! ways of
arranging 3 letters. The same is true for any choice of three letters. So we
should have 5 × 4 × 3/3!. We can rewrite this as
5·4·3 5!
=
3! 3!2!

5
This is often written , read “5 choose 3.” Sometimes this is written C5,3
3
or 5 C3 . More generally,
n n!
= .
k k!(n − k)!
Example.How many ways can one choose a committee of 3 out of 10 people?

10
Answer. .
3
Example. Suppose there are 8 men and 8 women. How many ways can we
choose a committee that has 2 men and 2 women? Answer. We can choose
3

8 8
2 men in ways and 2 women in ways. The number of committees
2 2
8 8
is then the product: .
2 2
Suppose one has 9 people and one wants todivide
them into one committee
9
of 3, one of 4, and a last of 2. There are ways of choosing the first
3
6
committee. Once that is done, there are 6 people left and there are
4
ways of choosing the second committee. Once that is done, the remainder
must go in the third committee. So the answer is
9! 6! 9!
= .
3!6! 4!2! 3!4!2!
In general, to divide n objects into one group of n1 , one group of n2 , . . ., and
a kth group of nk , where n = n1 + · · · + nk , the answer is
n!
.
n1 !n2 ! · · · nk !
These are known as multinomial coefficients.
Another example: suppose we have 4 Americans and 6 Canadians. (a)
How many ways can we arrange them in a line? (b) How many ways if all
the Americans have to stand together? (c) How many ways if not all the
Americans are together? (d) Suppose you want to choose a committee of 3,
which will be all Americans or all Canadians. How many ways can this be
done? (e) How many ways for a committee of 3 that is not all Americans
or all Canadians? Answer. (a) This is just 10! (b) Consider the Americans
as a group and each Canadian as a group; this gives 7 groups, which can
be arranged in 7! ways. Once we have these seven groups arranged, we can
arrange the Americans within their group in 4! ways, so we get 4!7! (c) This
is the answer to (a) minus theanswer
to (b): 10! − 4!7! (d) We can choose a
4
committee of 3 Americans in ways and a committee of 3 Canadians in
3
6 4 6
ways, so the answer is + . (e) We can choose a committee of
3 3 3
10 10 4 6
3 out of 10 in ways, so the answer is − − .
3 3 3 3
4 CHAPTER 1. COMBINATORICS
Finally, we consider three interrelated examples. First, suppose one has 8

o’s and 2 |’s. How many ways can one arrange these symbols in order? There
are 10 spots,
and we want to select 8 of them in which we place the o’s. So
10
we have .
8
Next, suppose one has 8 indistinguishable balls. How many ways can one
put them in 3 boxes? Let us make sequences of o’s and |’s; any such sequence
that has | at each side, 2 other |’s, and 8 o’s represents a way of arranging
balls into boxes. For example, if one has
| o o | o o o | o o o |,
this would represent 2 balls in the first box, 3 in the second, and 3 in the
third. Altogether there are 8 + 4 symbols, the first is a | as is the last. so
there are 10 symbols that can be either | or o. Also, 8 of them must be o.
How many ways out of 10 spaces can one pick 8 of them into which to put a
10
o? We just did that: the answer is .
8
Now, to finish, suppose we have $8,000 to invest in 3 mutual funds. Each
mutual fund required you to make investments in increments of $1,000. How
many ways can we do this? This is the sameas putting 8 indistinguishable
10
balls in 3 boxes, and we know the answer is .
8
Chapter 2
The probability set-up
We will have a sample space, denoted S (sometimes Ω) that consists of all

possible outcomes. For example, if we roll two dice, the sample space would
be all possible pairs made up of the numbers one through six. An event is a
subset of S.
Another example is to toss a coin 2 times, and let
S = {HH, HT, T H, T T };
or to let S be the possible orders in which 5 horses finish in a horse race; or

S the possible prices of some stock at closing time today; or S = [0, ∞); the
age at which someone dies; or S the points in a circle, the possible places a
dart can hit.
We use the following usual notation: A ∪ B is the union of A and B and
denotes the points of S that are in A or B or both. A ∩ B is the intersection
of A and B and is the set of points that are in both A and B. ∅ denotes the
empty set. A ⊂ B means that A is contained in B and Ac is the complement
of A, that is, the points in S that are not in A. We extend the definition to
have ∪ni=1 Ai is the union of A1 , · · · , An , and similarly ∩ni=1 Ai .
An exercise is to show that
(∪ni=1 Ai )c = ∩ni=1 Aci and (∩ni=1 Ai )c = ∪ni=1 Aci .
These are called DeMorgan’s laws.
5
6 CHAPTER 2. THE PROBABILITY SET-UP
There are no restrictions on S. The collection of events, F, must be a

σ-field, which means that it satisfies the following:
(i) ∅, S is in F;
(ii) if A is in F, then Ac is in F;
(iii) if A1 , A2 , . . . are in F, then ∪∞ ∞
i=1 Ai and ∩i=1 Ai are in F.
Typically we will take F to be all subsets of S, and so (i)-(iii) are au-

tomatically satisfied. The only times we won’t have F be all subsets is for
technical reasons or when we talk about conditional expectation.
So now we have a space S, a σ-field F, and we need to talk about what a
probability is. There are three axioms: (1) 0 ≤ P(E) ≤ 1 for all events E.
(2) P(S) = 1. P∞
(3) If the Ei are pairwise disjoint, P(∪∞
i=1 Ei ) = i=1 P(Ei ).
Pairwise disjoint means that Ei ∩ Ej = ∅ unless i = j.

Note that probabilities are probabilities of subsets of S, not of points of
S. However it is common to write P(x) for P({x}).
Intuitively, the probability of E should be the number of times E occurs
in n times, taking a limit as n tends to infinity. This is hard to use. It is
better to start with these axioms, and then to prove that the probability of
E is the limit as we hoped.
There are some easy consequences of the axioms.
Proposition 2.1 (1) P(∅) = 0.

(2) If A1 , . . . , An are pairwise disjoint, P(∪ni=1 Ai ) = ni=1 P(Ai ).
P
(3) P(E c ) = 1 − P(E).
(4) If E ⊂ F , then P(E) ≤ P(F ).
(5) P(E ∪ F ) = P(E) + P(F ) − P(E ∩ F ).
∞
P∞let Ai = ∅ for
Proof. For (1), P∞each i. These are clearly disjoint, so P(∅) =
P(∪i=1 Ai ) = i=1 P(Ai ) = i=1 P(∅). If P(∅) were positive, then the last
term would be infinity, contradicting the fact that probabilities are between
0 and 1. So the probability must be zero.
7
n+2 = · · · = ∅.P
The second follows if we let An+1 = AP We still have pairwise
∞
disjointness and ∪i=1 Ai = ∪i=1 Ai , and i=1 P(Ai ) = ni=1 P(Ai ), using (1).
∞ n
To prove (3), use S = E ∪ E c . By (2), P(S) = P(E) + P(E c ). By axiom

(2), P(S) = 1, so (1) follows.
To prove (4), write F = E ∪(F ∩E c ), so P(F ) = P(E)+P(F ∩E c ) ≥ P(E)
by (2) and axiom (1).
Similarly, to prove (5), we have P(E ∪ F ) = P(E) + P(E c ∩ F ) and P(F ) =
P(E ∩ F ) + P(E c ∩ F ). Solving the second equation for P(E c ∩ F ) and
substituting in the first gives the desired result.
It is very common for a probability space to consist of finitely many points,

all with equally likely probabilities. For example, in tossing a fair coin, we
have S = {H, T }, with P(H) = P(T ) = 21 . Similarly, in rolling a fair die, the
probability space consists of {1, 2, 3, 4, 5, 6}, each point having probability 61 .
Example. What is the probability that if we roll 2 dice, the sum is 7? Answer.
There are 36 possibilities, of which 6 have a sum of 7: (1, 6), (2, 5), (3, 4),
6
(4, 3), (5, 2), (6, 1). Since they are all equally likely, the probability is 36 = 61 .
Example. What is the probability that in a poker hand (5 cards out of 52)
we get exactly 4 of a kind? Answer. The probability of 4 aces and 1 king is
.
4 4 52
. The probability of 4 jacks and one 3 is the same. There
4 1 5
are 13 ways to pick the rank that we have 4 of and then 12 ways to pick the
rank we have one of, so the answer is

4 4
4 1
13 · 12 .
52
5
Example. What is the probability that in a poker hand we get exactly 3

of a kind (and the other two cards are of different ranks)? Answer. The
8 CHAPTER 2. THE PROBABILITY SET-UP
probability of 3 aces, 1 king and 1 queen is

.
4 4 4 52
.
3 1 1 5

12
We have 13 choices for the rank we have 3 of and choices for the other
2
two ranks, so the answer is

4 4 4

12 3 1 1
13 .
2 52
5
Example. In a class of 30 people, what is the probability everyone has a

different birthday? (We assume each day is equally likely.)
Answer. Let the first person have a birthday on some day. The probability
364
that the second person has a different birthday will be 365 . The probability
that the third person has a different birthday from the first two people is 363
365
.
So the answer is
364 363 336
1− ··· .
365 365 365
Chapter 3
Independence
We say E and F are independent if
P(E ∩ F ) = P(E)P(F ).
Example. Suppose you flip two coins. The outcome of heads on the second is
independent of the outcome of tails on the first. To be more precise, if A is
tails for the first coin and B is heads for the second, and we assume we have
fair coins (although this is not necessary), we have P(A ∩ B) = 14 = 12 · 21 =
P(A)P(B).
Example. Suppose you draw a card from an ordinary deck. Let E be you
1 1
drew an ace, F be that you drew a spade. Here 52 = P(E ∩ F ) = 13 · 14 =
P(E) ∩ P(F ).
Proposition 3.1 If E and F are independent, then E and F c are indepen-

dent.
Proof.
P(E ∩ F c ) = P(E) − P(E ∩ F ) = P(E) − P(E)P(F )

= P(E)[1 − P(F )] = P(E)P(F c ).
9
10 CHAPTER 3. INDEPENDENCE
We say E, F , and G are independent if E and F are independent, E and G are

independent, F and G are independent, and P(E ∩ F ∩ G) = P(E)P(F )P(G).
Example. Suppose you roll two dice, E is that the sum is 7, F that the first
is a 4, and G that the second is a 3. E and F are independent, as are E and
G and F and G, but E, F and G are not.
Example. What is the probability that exactly 3 threes will show if you roll
10 dice? Answer. The probability that the 1st, 2nd, and 4th dice will show
3 7
a three and the other 7 will not is 16 56 . Independence is used here: the
probability is 61 16 56 16 56 · · · 65 . The probability that the 4th, 5th, and 6th dice
will show a three and the other 7 will not is the same thing. So to answer
3 7
our original question, we take 16 56 and multiply it by the number ofways
10
of choosing 3 dice out of 10 to be the ones showing threes. There are
3
ways of doing that.
This is a particular example of what are known as Bernoulli trials or the
binomial distribution.
Suppose you have n independent trials, where the probability of a success
is p. The the probability there
arek successes is the number of ways of putting
n
k objects in n slots (which is ) times the probability that there will be
k
k
successes
and n − k failures in exactly a given order. So the probability is
n k
p (1 − p)n−k .
k
A problem that comes up in actuarial science frequently is gambler’s ruin.
Example. Suppose you toss a fair coin repeatedly and independently. If it

comes up heads, you win a dollar, and if it comes up tails, you lose a dollar.
Suppose you start with $50. What’s the probability you will get to $200
before you go broke?
Answer. It is easier to solve a slightly harder problem. Let y(x) be the

probability you get to 200 before 0 if you start with x dollars. Clearly y(0) = 0
and y(200) = 1. If you start with x dollars, then with probability 21 you get
a heads and will then have x + 1 dollars. With probability 12 you get a tails
11
and will then have x − 1 dollars. So we have
y(x) = 12 y(x + 1) + 21 y(x − 1).
Multiplying by 2, and subtracting y(x) + y(x − 1) from each side, we have
y(x + 1) − y(x) = y(x) − y(x − 1).
This says succeeding slopes of the graph of y(x) are constant (remember that
x must be an integer). In other words, y(x) must be a line. Since y(0) = 0
and y(200) = 1, we have y(x) = x/200, and therefore y(50) = 1/4.
Example. Suppose we are in the same situation, but you are allowed to go
arbitrarily far in debt. Let y(x) be the probability you ever get to $200.
What is a formula for y(x)?
Answer. Just as above, we have the equation y(x) = 21 y(x + 1) + 21 y(x − 1).
This implies y(x) is linear, and as above y(200) = 1. Now the slope of y
cannot be negative, or else we would have y > 1 for some x and that is not
possible. Neither can the slope be positive, or else we would have y < 0,
and again this is not possible, because probabilities must be between 0 and
1. Therefore the slope must be 0, or y(x) is constant, or y(x) = 1 for all x.
In other words, one is certain to get to $200 eventually (provided, of course,
that one is allowed to go into debt). There is nothing special about the figure
300. Another way of seeing this is to compute as above the probability of
getting to 200 before −M and then letting M → ∞.
12 CHAPTER 3. INDEPENDENCE
Chapter 4
Conditional probability
Suppose there are 200 men, of which 100 are smokers, and 100 women, of
which 20 are smokers. What is the probability that a person chosen at
random will be a smoker? The answer is 120/300. Now, let us ask, what is
the probability that a person chosen at random is a smoker given that the
person is a women? One would expect the answer to be 20/100 and it is.
What we have computed is
number of women smokers number of women smokers/300
= ,
number of women number of women/300
which is the same as the probability that a person chosen at random is a
woman and a smoker divided by the probability that a person chosen at
random is a woman.
With this in mind, we make the following definition. If P(F ) > 0, we
define
P(E ∩ F )
P(E | F ) = .
P(F )
P(E | F ) is read “the probability of E given F .”
Note P(E ∩ F ) = P(E | F )P(F ).
Suppose you roll two dice. What is the probability the sum is 8? There are
five ways this can happen (2, 6), (3, 5), (4, 4), (5, 3), (6, 2), so the probability
is 5/36. Let us call this event A. What is the probability that the sum is
8 given that the first die shows a 3? Let B be the event that the first die
13
14 CHAPTER 4. CONDITIONAL PROBABILITY
shows a 3. Then P(A ∩ B) is the probability that the first die shows a 3 and
the sum is 8, or 1/36. P(B) = 1/6, so P(A | B) = 1/36
1/6
= 1/6.
Example. Suppose a box has 3 red marbles and 2 black ones. We select 2
marbles. What is the probability that second marble is red given that the first
one is red? Answer. Let A be the event the second marble is red, and B the
event that the first one is red. P(B) = 3/5, while P(A ∩ B) is the probability
both are red, or is the probability
that wechose 2 red out of 3 and 0 black
3 2 . 5
out of 2. The P(A ∩ B) = . Then P(A | B) = 3/10
3/5
= 1/2.
2 0 2
Example. A family has 2 children. Given that one of the children is a boy,
what is the probability that the other child is also a boy? Answer. Let B be
the event that one child is a boy, and A the event that both children are boys.
The possibilities are bb, bg, gb, gg, each with probability 1/4. P(A ∩ B) =
1/4
P(bb) = 1/4 and P(B) = P(bb, bg, gb) = 3/4. So the answer is 3/4 = 1/3.
Example. Suppose the test for HIV is 99% accurate in both directions and
0.3% of the population is HIV positive. If someone tests positive, what is
the probability they actually are HIV positive?
Let D mean HIV positive, and T mean tests positive.
P(D ∩ T ) (.99)(.003)
P(D | T ) = = ≈ 26%.
P(T ) (.99)(.003) + (.01)(.997)
A short reason why this surprising result holds is that the error in the test is
much greater than the percentage of people with HIV. A little longer answer
is to suppose that we have 1000 people. On average, 3 of them will be HIV
positive and 10 will test positive. So the chances that someone has HIV given
that the person tests positive is approximately 3/10. The reason that it is
not exactly .3 is that there is some chance someone who is positive will test
negative.
Suppose you know P(E | F ) and you want P(F | E).
Example. Suppose 36% of families own a dog, 30% of families own a cat,
and 22% of the families that have a dog also have a cat. A family is chosen
at random and found to have a cat. What is the probability they also own
15
a dog? Answer. Let D be the families that own a dog, and C the families
that own a cat. We are given P(D) = .36, P(C) = .30, P(C | D) = .22 We
want to know P(D | C). We know P(D | C) = P(D ∩ C)/P(C). To find
the numerator, we use P(D ∩ C) = P(C | D)P(D) = (.22)(.36) = .0792. So
P(D | C) = .0792/.3 = .264 = 26.4%.
Example. Suppose 30% of the women in a class received an A on the test

and 25% of the men received an A. The class is 60% women. Given that a
person chosen at random received an A, what is the probability this person
is a women? Answer. Let A be the event of receiving an A, W be the
event of being a woman, and M the event of being a man. We are given
P(A | W ) = .30, P(A | M ) = .25, P(W ) = .60 and we want P(W | A). From
the definition
P(W ∩ A)
P(W | A) = .
P(A)
As in the previous example,
P(W ∩ A) = P(A | W )P(W ) = (.30)(.60) = .18.
To find P(A), we write
P(A) = P(W ∩ A) + P(M ∩ A).
Since the class is 40% men,
P(M ∩ A) = P(A | M )P(M ) = (.25)(.40) = .10.
So
P(A) = P(W ∩ A) + P(M ∩ A) = .18 + .10 = .28.
Finally,
P(W ∩ A) .18
P(W | A) = = .
P(A) .28
To get a general formula, we can write
P(E ∩ F ) P(E | F )P(F )
P(F | E) = =
P(E) P(E ∩ F ) + P(E ∩ F c )
P(E | F )P(F )
= .
P(E | F )P(F ) + P(E | F c )P(F c )
16 CHAPTER 4. CONDITIONAL PROBABILITY
This formula is known as Bayes’ rule.

Here is another example related to conditional probability, although this
is not an example of Bayes’ rule. This is known as the Monty Hall problem
after the host of the TV show of the 60’s called Let’s Make a Deal.
There are three doors, behind one a nice car, behind each of the other
two a goat eating a bale of straw. You choose a door. Then Monty Hall
opens one of the other doors, which shows a bale of straw. He gives you the
opportunity of switching to the remaining door. Should you do it?
Answer. Let’s suppose you choose door 1, since the same analysis applies
whichever door you chose. Strategy one is to stick with door 1. With prob-
ability 1/3 you chose the car. Monty Hall shows you one of the other doors,
but that doesn’t change your probability of winning.
Strategy 2 is to change. Let’s say the car is behind door 1, which happens
with probability 1/3. Monty Hall shows you one of the other doors, say
door 2. There will be a goat, so you switch to door 3, and lose. The same
argument applies if he shows you door 3. Suppose the car is behind door 2.
He will show you door 3, since he doesn’t want to give away the car. You
switch to door 2 and win. This happens with probability 1/3. The same
argument applies if the car is behind door 3. So you win with probability
2/3 and lose with probability 1/3. Thus strategy 2 is much superior.
Chapter 5
Random variables
A random variable is a real-valued function on S. Random variables are

usually denoted by X, Y, Z, . . .
Example. If one rolls a die, let X denote the outcome (i.e., either 1,2,3,4,5,6).
Example. If one rolls a die, let Y be 1 if an odd number is showing and 0 if

an even number is showing.
Example. If one tosses 10 coins, let X be the number of heads showing.
Example. In n trials, let X be the number of successes.

A discrete random variable is one that can only take countably many val-
ues. For a discrete random variable, we define the probability mass function
or the density by p(x) = P(X = x). Here P(X = x) is an abbreviation
P P({ω ∈ S : X(ω) = x}). This type of abbreviation is standard. Note
for
i p(xi ) = 1 since X must equal something.
Let X be the number showing if we roll a die. The expected number to

show up on a roll of a die should be 1·P(X = 1)+2·P(X = 2)+· · ·+6·P(X =
6) = 3.5. More generally, we define
X
EX = xp(x)
{x:p(x)>0}
to be the expected value or expectation or mean of X.
17
18 CHAPTER 5. RANDOM VARIABLES
Example. If we toss a coin and X is 1 if we have heads and 0 if we have tails,

what is the expectation of X? Answer.

1
2, x = 1

pX (x) = 12 , x = 0

0, all other values of x.

Hence E X = (1)( 12 ) + (0)( 21 ) = 12 .
Example. Suppose X = 0 with probability 12 , 1 with probability 41 , 2 with

probability 18 , and more generally n with probability 1/2n . This is an example
where X can take infinitely many values (although still countably many
values). What is the expectation of X? Answer. Here pX (n) = 1/2n if n is a
nonnegative integer and 0 otherwise. So
E X = (0) 12 + (1) 41 + (2) 81 + (3) 16

1
+ ··· .
This turns out to sum to 1. To see this, recall the formula for a geometric
series:
1
1 + x + x2 + x 3 + · · · = .
1−x
If we differentiate this, we get
1
1 + 2x + 3x2 + · · · = .
(1 − x)2
We have
E X = 1( 41 ) + 2( 18 ) + 3( 16
1
+ ···
h i
= 14 1 + 2( 12 ) + 3( 41 ) + · · ·
1 1
= = 1.
4
(1 − 12 )2
Example. Suppose we roll a fair die. If 1 or 2 is showing, let X = 3; if a 3

or 4 is showing, let X = 4, and if a 5 or 6 is showing, let X = 10. What is
E X? Answer. We have P(X = 3) = P(X = 4) = P(X = 10) = 13 , so
19
X
EX = xP(X = x) = (3)( 13 ) + (4)( 31 ) + (10)( 13 ) = 17
3
.
Let’s give a proof of the linearity of expectation in the case when X and
Y both take only finitely many values.
Let Z = X + Y , let a1 , . . . , an be the values taken by X, b1 , . . . , bm be the
values taken by Y , and c1 , . . . , c` the values taken by Z. Since there are only
finitely many values, we can interchange the order of summations freely.
We write
`
X ` X
X n
EZ = ck P(Z = ck ) = ck P(Z = ck , X = ai )
k=1 k=1 i=1
XX
= ck P(X = ai , Y = ck − ai )
k i
m
XXX
= ck P(X = ai , Y = ck − ai , Y = bj )
k i j=1
XXX
= ck P(X = ai , Y = ck − ai , Y = bj ).
i j k
Now P(X = ai , Y = ck − ai , Y = bj ) will be 0, unless ck − ai = bj . For

each pair (i, j), this will be non-zero for only one value k, since the ck are all
different. Therefore for each i and j
X
ck P(X = ai , Y = ck − ai , Y = bj )
k
X
= (ai + bj )P(X = ai , Y = ck − ai , Y = bj )
k
= (ai + bj )P(X = ai , Y = bj ).
Substituting,
XX
EZ = (ai + bj )P(X = ai , Y = bj )
i j
X X X X
= ai P(X = ai , Y = bj ) + bj P(X = ai , Y = bj )
i j j i
X X
= ai P(X = ai ) + bj P(Y = bj )
i j
= E X + E Y.
It turns out there is a formula for the expectation of random variables like
X and eX . To see how this works, let us first look at an example.
2
Suppose we roll a die and let X be the value that is showing. We want
the expectation E X 2 . Let Y = X 2 , so that P(Y = 1) = 16 , P(Y = 4) = 16 ,
etc. and
E X 2 = E Y = (1) 16 + (4) 61 + · · · + (36) 61 .
We can also write this as
E X 2 = (12 ) 16 + (22 ) 61 + · · · + (62 ) 16 ,
which suggests that a formula for E X 2 is x x2 P(X = x). This turns out
P
to be correct.
The only possibility where things could go wrong is if more than one value
of X leads to the same value of X 2 . For example, suppose P(X = −2) =
1
8
, P(X = −1) = 41 , P(X = 1) = 38 , P(X = 2) = 14 . Then if Y = X 2 ,
P(Y = 1) = 58 and P(Y = 4) = 38 . Then
E X 2 = (1) 58 + (4) 83 = (−1)2 14 + (1)2 83 + (−2)2 18 + (2)2 14 .
So even in this case E X 2 = x x2 P(X = x).

P
P
Theorem 5.1 E g(X) = g(x)p(x).
Proof. Let Y = g(X). Then

X X X
EY = yP(Y = y) = y P(X = x)
y y {x:g(x)=y}
X
= g(x)P(X = x).
x
Example. E X 2 = x2 p(x).
P
E X n is called the nth moment of X. If M = E X, then
Var (X) = E (X − M )2
21
is called the variance of X. The square root of Var (X) is the standard
deviation of X.
The variance measures how much spread there is about the expected value.
Example. We toss a fair coin and let X = 1 if we get heads, X = −1 if

we get tails. Then E X = 0, so X − E X = X, and then Var X = E X 2 =
(1)2 21 + (−1)2 12 = 1.
Example. We roll a die and let X be the value that shows. We have previously
calculated E X = 27 . So X − E X equals
5 3 1 1 3 5
− ,− ,− , , , ,
2 2 2 2 2 2
each with probability 16 . So
Var X = (− 52 )2 16 + (− 23 )2 16 + (− 21 )2 16 + ( 21 )2 16 + ( 32 )2 16 + ( 25 )2 16 = 35
12
.
Note that the expectation of a constant is just the constant. An alternate

expression for the variance is
Var X = E X 2 − 2E (XM ) + E (M 2 )
= E X 2 − 2M 2 + M 2
= E X 2 − (E X)2 .
Chapter 6
Some discrete distributions
Bernoulli. A r.v. X such that P(X = 1) = p and P(X = 0) = 1 − p is said

to be a Bernoulli r.v. with parameter p. Note E X = p and E X 2 = p, so
Var X = p − p2 = p(1 − p).
Binomial. A r.v.
X has a binomial distribution with parameters n and p
n k
if P(X = k) = p (1 − p)n−k . The number of successes in n trials is a
k
binomial. After some cumbersome calculations one can derive E X = np.
An easier way is to realize that if X is binomial, then X = Y1 + · · · + Yn ,
where the Yi are independent Bernoulli’s, so E X = E Y1 + · · · + E Yn = np.
We haven’t defined what it means for r.v.’s to be independent, but here we
mean that the events (Yk = 1) are independent. The cumbersome way is as
follows.
n n
X n k n−k
X n k
EX = k p (1 − p) = k p (1 − p)n−k
k k
k=0 k=1
n
X n!
= k pk (1 − p)n−k
k=1
k!(n − k)!
n
X (n − 1)!
= np pk−1 (1 − p)(n−1)−(k−1)
k=1
(k − 1)!((n − 1) − (k − 1))!
n−1
X (n − 1)!
= np pk (1 − p)(n−1)−k
k=0
k!((n − 1) − k)!
23
24 CHAPTER 6. SOME DISCRETE DISTRIBUTIONS
n−1
X n−1
= np pk (1 − p)(n−1)−k = np.
k
k=0
To get the variance of X, we have

n
X X
E X2 = E Yk2 + E Yi Yj .
k=1 i6=j
Now
E Yi Yj = 1 · P(Yi Yj = 1) + 0 · P(Yi Yj = 0)
= P(Yi = 1, Yj = 1) = P(Yi = 1)P(Yj = 1) = p2
using independence. The square of Y1 + · · · + Yn yields n2 terms, of which n

are of the form Yk2 . So we have n2 − n terms of the form Yi Yj with i 6= j.
Hence
Var X = E X 2 − (E X)2 = np + (n2 − n)p2 − (np)2 = np(1 − p).
Later we will see that the variance of the sum of independent r.v.’s is
the sum of the variances, so we could quickly get Var X = np(1 − p). Al-
ternatively, one can compute E (X 2 ) − E X = E (X(X − 1)) using binomial
coefficients and derive the variance of X from that.
Poisson. X is Poisson with parameter λ if
λi
P(X = i) = e−λ .
i!
P∞
Note i=0 λi /i! = eλ , so the probabilities add up to one.
To compute expectations,
∞ i ∞
X
−λ λ −λ
X λi−1
EX = ie =e λ = λ.
i=0
i! i=1
(i − 1)!
25
Similarly one can show that
∞
X λi
E (X 2 ) − E X = E X(X − 1) = i(i − 1)e−λ
i=0
i!
∞
2 −λ
X λi−2
=λ e
i=2
(i − 2)!
= λ2 ,
so E X 2 = E (X 2 − X) + |EX = λ2 + λ, and hence Var X = λ.
Example. Suppose on average there are 5 homicides per month in a given

city. What is the probability there will be at most 1 in a certain month?
Answer. If X is the number of homicides, we are given that E X = 5. Since

the expectation for a Poisson is λ, then λ = 5. Therefore P(X = 0) + P(X =
1) = e−5 + 5e−5 .
Example. Suppose on average there is one large earthquake per year in Cal-
ifornia. What’s the probability that next year there will be exactly 2 large
earthquakes?
Answer. λ = E X = 1, so P(X = 2) = e−1 ( 21 ).
We have the following proposition.
Proposition 6.1 If Xn is binomial with parameters n and pn and npn → λ,

then P(Xn = i) → P(Y = i), where Y is Poisson with parameter λ.
The above proposition shows that the Poisson distribution models bino-
mials when the probability of a success is small. The number of misprints on
a page, the number of automobile accidents, the number of people entering
a store, etc. can all be modeled by Poissons.
Proof. For simplicity, let us suppose λ = npn . In the general case we use
λn = npn . We write
n!
P(Xn = i) = pi (1 − pn )n−i
i!(n − i)! n
n(n − 1) · · · (n − i + 1) λ i λ n−i
= 1−
i! n n
n(n − 1) · · · (n − i + 1) λi (1 − λ/n)n
= .
ni i! (1 − λ/n)i
The first factor tends to 1 as n → ∞. (1 − λ/n)i → 1 as n → ∞ and
(1 − λ/n)n → e−λ as n → ∞.
Uniform. Let P(X = k) = n1 for k = 1, 2, . . . , n. This is the distribution of

the number showing on a die (with n = 6), for example.
Geometric. Here P(X = i) = (1 − p)i−1 p for i = 1, 2, . . .. In Bernoulii trials,

if we let X be the first time we have a success, then X will be geometric.
For example, if we toss a coin over and over and X is the first time we get
a heads, then X will have a geometric distribution. To see this, to have the
first success occur on the k th trial, we have to have k − 1 failures in the first
P−∞1 trials
k and then a success. The probability ofP that is (1 − p)k−1 p. Since
n 2
n=0 nr = 1/(1 − r) (differentiate the formula rn = 1/(1 − r)), we see
that E X = 1/p. Similarly we have Var X = (1 − p)/p2 .
Negative binomial. Let r and p be parameters and set

n−1 r
P(X = n) = p (1 − p)n−r , n = r, r + 1, . . . .
r−1
A negative binomial represents the number of trials until r successes. To get
the above formula, to have the rth success in the nth trial, we must exactly
have r − 1 successes in the first n − 1 trials and then a success in the nth trial.
Hypergeometric. Set

m N −m
i n−i
P(X = i) = .
N
n
27
This comes up in sampling without replacement: if there are N balls, of

which m are one color and the other N − m are another, and we choose n
balls at random without replacement, then X represents the probability of
having i balls of the first color.
Chapter 7
Continuous distributions
A r.v. X is said to have a continuous distribution if there exists a nonnegative

function f such that
Z b
P(a ≤ X ≤ b) = f (x)dx
a
for every a and b. (More precisely, such an X is said to have an ab-

solutely
R ∞continuous distribution.) f is called the density function for X.
Note −∞ f (x)dx = P(−∞ < X < ∞) = 1. In particular, P(X = a) =
Ra
a
f (x)dx = 0 for every a.
R∞
Example. Suppose we are given f (x) = c/x3 for x ≥ 1. Since −∞ f (x)dx = 1
and Z ∞ Z ∞
1 c
c f (x)dx = c 3
dx = ,
−∞ 1 x 2
we have c = 2.
Ry
Define F (y) = P(−∞ < X ≤ y) = −∞ f (x)dx. F is called the distri-
bution function of X. We can define F for any random variable, not just
continuous ones, by setting F (y) = P(X ≤ y). In the case of discrete ran-
dom variables, this is not particularly useful, although it does serve to unify
discrete and continuous random variables. In the continuous case, the fun-
damental theorem of calculus tells us, provided f satisfies some conditions,
that
f (y) = F 0 (y).
29
30 CHAPTER 7. CONTINUOUS DISTRIBUTIONS
By analogy with the discrete case, we define the expectation by

Z ∞
EX = xf (x)dx.
−∞
In the example above,

Z ∞ Z ∞
2
EX = x 3 dx = 2 x−2 dx = 2.
1 x 1
We give another definition of the expectation in the continuous case. First

suppose X is nonnegative. Define Xn (ω) to be k/2n if k/2n ≤ X(ω) < (k +
1)/2n . We are approximating X from below by the largest multiple of 2−n .
Each Xn is discrete and the Xn increase to X. We define E X = limn→∞ E Xn .
Let us argue that this agrees with the first definition in this case. We have
X k
E Xn = n
P(Xn = k/2n )
n
2
k/2
X k
= n
P(k/2n ≤ X < (k + 1)/2n )
2
k/2n
X k Z (k+1)/2n
= f (x)dx
2n k/2n
X Z (k+1)/2n k
= f (x)dx.
k/2n 2n
If x ∈ [k/2n , (k + 1)/2n ), then x differs from k/2n by at most 1/2n . So the

last integral differs from
X Z (k+1)/2n
xf (x)dx
k/2n
by at most (1/2n )P(k/2n ≤ X < (k + 1)/2n ) ≤ 1/2n , which goes to 0 as

P
n → ∞. On the other hand,
X Z (k+1)/2n Z M
xf (x)dx = xf (x)dx,
k/2n 0
which is how we defined the expectation of X.

31
We will not prove the following, but it is an interesting exercise: if Xm

is any sequence of discrete random variables that increase up to X, then
limm→∞ E Xm will have the same value E X.
To show linearity, if X and Y are bounded positive random variables,
then take Xm discrete increasing up to X and Ym discrete increasing up to
Y . Then Xm + Ym is discrete and increases up to X + Y , so we have
E (X + Y ) = lim E (Xm + Ym )
m→∞
= lim E Xm + lim E Ym = E X + E Y.
m→∞ m→∞
If X is not necessarily positive, we have a similar definition; we will not

do the details. This second definition of expectation is mostly useful for
theoretical purposes and much less so for calculations.
Similarly to the discrete case, we have

R
Proposition 7.1 E g(X) = g(x)f (x)dx.
As in the discrete case,

Var X = E [X − E X]2 .
As an example of these calculations, let us look at the uniform distribution.

We say that a random variable X has a uniform distribution on [a, b] if
1
fX (x) = b−a if a ≤ x ≤ b and 0 otherwise.
To calculate the expectation of X,
Z ∞ Z b
1
EX = xfX (x)dx = x dx
−∞ a b−a
Z b
1
= x dx
b−a a
1 b 2 a2 a + b
= − = .
b−a 2 2 2
This is what one would expect. To calculate the variance, we first calculate
Z ∞ Z b
2 2 1 a2 + ab + b2
EX = x fX (x)dx = x2 dx = .
−∞ a b−a 3
32 CHAPTER 7. CONTINUOUS DISTRIBUTIONS
We then do some algebra to obtain
(b − a)2
Var X = E X 2 − (E X)2 = .
12
Chapter 8
Normal distribution
A r.v. is a standard normal (written N (0, 1)) if it has density

1 2
√ e−x /2 .
2π
A synonym for normal
R ∞ is−xGaussian. The first thing to do is show that this is
2 /2
a density. Let I = 0 e dx. Then
Z ∞Z ∞
2 2
I2 = e−x /2 e−y /2 dx dy.
0 0
Changing to polar coordinates,

Z π/2 Z ∞
2 /2
2
I = re−r dr = π/2.
0 0
p R∞ 2 √
So I = π/2, hence −∞ e−x /2 dx = 2π as it should.
Note Z
2 /2
xe−x dx = 0
by symmetry, so E Z = 0. For the variance of Z, we use integration by parts:

Z Z
1 2 −x2 /2 1 2
2
EZ = √ xe dx = √ x · xe−x /2 dx.
2π 2π
The integral is equal to
−x2 /2
i∞ Z
2 /2 √
−xe + e−x dx = 2π.
−∞
33
34 CHAPTER 8. NORMAL DISTRIBUTION
Therefore Var Z = E Z 2 = 1.
We say X is a N (µ, σ 2 ) if X = σZ + µ, where Z is a N (0, 1). We see that
FX (x) = P(X ≤ x) = P(µ + σZ ≤ x)

= P(Z ≤ (x − µ)/σ) = FZ ((x − µ)/σ)
if σ > 0. (A similar calculation holds if σ < 0.) Then by the chain rule X
has density
1
fX (x) = FX0 (x) = FZ0 ((x − µ)/σ) = fZ ((x − µ)/σ).
σ
This is equal to
1 2 2
√ e−(x−µ) /2σ .
2πσ
E X = µ + E Z and Var X = σ 2 Var Z, so
E X = µ, Var X = σ 2 .
If X is N (µ, σ 2 ) and Y = aX+b, then Y = a(µ+σZ)+b = (aµ+b)+(aσ)Z,

or Y is N (aµ + b, a2 σ 2 ). In particular, if X is N (µ, σ 2 ) and Z = (X − µ)/σ,
then Z is N (0, 1).
The distribution function of a standard N (0, 1) is often denoted Φ(x), so
that Z x
1 2
Φ(x) = √ e−y /2 dy.
2π −∞
Tables of Φ(x) are often given only for x > 0. One can use the symmetry of
the density function to see that
Φ(−x) = 1 − Φ(x);
this follows from

Z −x
1 2
Φ(−x) = P(Z ≤ −x) = √ e−y /2 dy
−∞ 2π
Z ∞
1 2
= √ e−y /2 dy = P(Z ≥ x)
x 2π
= 1 − P(Z < x) = 1 − Φ(x).
35
Example. Find P(1 ≤ X ≤ 4) if X is N (2, 25).
Answer. Write X = 2 + 5Z. So
P(1 ≤ X ≤ 4) = P(1 ≤ 2 + 5Z ≤ 4)
= P(−1 ≤ 5Z ≤ 2) = P(−0.2 ≤ Z ≤ .4)
= P(Z ≤ .4) − P(Z ≤ −0.2)
= Φ(0.4) − Φ(−0.2) = .6554 − [1 − Φ(0.2)]
= .6554 − [1 − .5793].
Example. Find c such that P(|Z| ≥ c) = .05. Answer. By symmetry we want

c such that P(Z ≥ c) = .025 or Φ(c) = P(Z ≤ c) = .975. From the table
we see c = 1.96 ≈ 2. This is the origin of the idea that the 95% significance
level is ±2 standard deviations from the mean.
Proposition 8.1 We have the following bound. For x > 0

1 1 −x2 /2
P(Z ≥ x) = 1 − Φ(x) ≤ √ e .
2π x
Proof. If y ≥ x, then y/x ≥ 1, and then

Z ∞
1 2
P(Z ≥ x) = √ e−y /2 dy
2π Zx
∞
1 y −y2 /2
≤√ e dy
2π x x
1 1 −x2 /2
=√ e .
2π x
This is a good estimate when x ≥ 3.5.

In particular, for x large,
2 /2
P(Z ≥ x) = 1 − Φ(x) ≤ e−x .
36 CHAPTER 8. NORMAL DISTRIBUTION
Chapter 9
Normal approximation to the

binomial
A special case of the central limit theorem is
Theorem 9.1 If Sn is a binomial with parameters n and p, then

Sn − np
P a≤ p ≤ b → P(a ≤ Z ≤ b),
np(1 − p)
as n → ∞, where Z is a N (0, 1).
This approximation is good if np(1 − p) ≥ 10 and gets better the larger

this quantity gets. Note np is the same as E Sn and √ np(1 − p) is the same as
Var Sn . So the ratio is also equal to (Sn − E Sn )/ Var Sn , and this ratio has
mean 0 and variance 1, the same as a standard N (0, 1).
Note that here p stays fixed as n → ∞, unlike the case of the Poisson
approximation.
Example. Suppose a fair coin is tossed 100 times. What is the probability
there will be more than 60 heads?
p
Answer. np = 50 and np(1 − p) = 5. We have
P(Sn ≥ 60) = P((Sn − 50)/5 ≥ 2) ≈ P(Z ≥ 2) ≈ .0228.
37
38 CHAPTER 9. NORMAL APPROXIMATION
Example. Suppose a die is rolled 180 times. What is the probability a 3 will
be showing more than 50 times?
p
Answer. Here p = 16 , so np = 30 and np(1 − p) = 5. Then P(Sn > 50) ≈
2
P(Z > 4), which is less than e−4 /2 .
Example. Suppose a drug is supposed to be 75% effective. It is tested on 100

people. What is the probability more than 70 people will be helped?
Answer. Here Sn is the number of successes, n = 100, and p = .75. We have

p
P(Sn ≥ 70) = P((Sn − 75)/ 300/16 ≥ −1.154)
≈ P(Z ≥ −1.154) ≈ .87.
(The last figure came from a table.)
When b−a is small, there is a correction that makes things more accurate,
namely replace a by a − 21 and b by b + 21 . This correction never hurts and is
sometime necessary. For example, in tossing a coin 100 times, there ispositive
probability that there are exactly 50 heads, while without the correction, the
answer given by the normal approximation would be 0.
Example. We toss a coin 100 times. What is the probability of getting 49,
50, or 51 heads?
Answer. We write P(49 ≤ Sn ≤ 51) = P(48.5 ≤ Sn ≤ 51.5) and then continue

as above.
Chapter 10
Some continuous distributions
We look at some other continuous random variables besides normals.
Uniform. Here f (x) = 1/(b − a) if a ≤ x ≤ b and 0 otherwise. To compute

1
Rb
expectations, E X = b−a a
x dx = (a + b)/2.
Exponential. An exponential with parameter λ has density f (x) = λe−λx if

x ≥ 0 and 0 otherwise. We have
Z ∞
P(X > a) = λe−λx dx = e−λa
a
and we readily compute E X = 1/λ, Var X = 1/λ2 . Examples where an

exponential r.v. is a good model is the length of a telephone call, the length
of time before someone arrives at a bank, the length of time before a light
bulb burns out.
Exponentials are memory-less. This means that P(X > s + t | X > t) =
P(X > s), or given that the light bulb has burned 5 hours, the probability it
will burn 2 more hours is the same as the probability a new light bulb will
burn 2 hours. To prove this,
P(X > s + t)
P(X > s + t | X > t) =
P(X > t)
−λ(s+t)
e
= −λt
= e−λs
e
39
40 CHAPTER 10. SOME CONTINUOUS DISTRIBUTIONS
= P(X > s).
Gamma. A gamma distribution with parameters λ and t has density
λe−λx (λx)t−1
f (x) =
Γ(t)
R∞
if x ≥ 0 and 0 otherwise. Here Γ(t) = 0 e−y y t−1 dt is the Gamma function,
which interpolates the factorial function.
An exponential is the time for something to occur. A gamma is the time
for t events to occur. A gamma with parameters 12 and n2 is known as a χ2n , a
chi-squared r.v. with n degrees of freedom. Gammas and chi-squared’s come
up frequently in statistics. Another distribution that arises in statistics is
the beta:
1
f (x) = xa−1 (1 − x)b−1 , 0 < x < 1,
B(a, b)
R1
where B(a, b) = 0 xa−1 (1 − x)b−1 .
Cauchy. Here
1 1
f (x) = .
π 1 + (x − θ)2
What is interesting about the Cauchy is that it does not have finite mean,
that is, E |X| = ∞.
Often it is important to be able to compute the density of Y = g(X). Let

us give a couple of examples. If X is uniform on (0, 1] and Y = − log X,
then Y > 0. If x > 0,
FY (x) = P(Y ≤ x) = P(− log X ≤ x)

= P(log X ≥ −x) = P(X ≥ e−x ) = 1 − P(X ≤ e−x )
= 1 − FX (e−x ).
Taking the derivative,

d
fY (x) = FY (x) = −fX (e−x )(−e−x ),
dx
41
using the chain rule. Since fX = 1, this gives fY (x) = e−x , or Y is exponential
with parameter 1.
For another example, suppose X is N (0, 1) and Y = X 2 . Then
√ √
FY (x) = P(Y ≤ x) = P(X 2 ≤ x) = P(− x ≤ X ≤ x)
√ √ √ √
= P(X ≤ x) − P(X ≤ − x) = FX ( x) − FX (− x).
Taking the derivative and using the chain rule,

d √ 1 √ 1
fY (x) = FY (x) = fX ( x) √ − fX (− x) − √ .
dx 2 x 2 x
2
Remembering that fX (t) = √12π e−t /2 and doing some algebra, we end up
with
1
fY (x) = √ x−1/2 e−x/2 ,
2π
1
which is a Gamma with parameters 2
and 12 . (This is also a χ2 with one
degree of freedom.)
One more example. Suppose X is uniform on (−π/2, π/2) and Y = tan X.

Then
FY (x) = P(X ≤ tan−1 x) = FX (tan−1 x),
and taking the derivative yields
1 1 1
fY (x) = fX (tan−1 x) 2
= ,
1+x π 1 + x2
which is a Cauchy distribution.
42 CHAPTER 10. SOME CONTINUOUS DISTRIBUTIONS
Chapter 11
Multivariate distributions
We want to discuss collections of random variables (X1 , X2 , . . . , Xn ), which

are known as random vectors. In the discrete case, we can define the density
p(x, y) = P(X = x, Y = y). Remember that here the comma means “and.””
In the continuous case a density is a function such that
Z bZ d
P(a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y)dy dx.
a c
Example. If fX,Y (x, y) = ce−x e−2y for 0 < x < ∞ and x < y < ∞, what is c?
Answer. We use the fact that a density must integrate to 1. So

Z ∞Z ∞
ce−x e−2y dy dx = 1.
0 x
Recalling multivariable calculus, this is

Z ∞
c
ce−x 21 e−2x dx = ,
0 6
so c = 6.
The multivariate distribution function of (X, Y ) is defined by FX,Y (x, y) =

P(X ≤ x, Y ≤ y). In the continuous case, this is
Z x Z y
fX,Y (x, y)dy dx,
−∞ −∞
43
44 CHAPTER 11. MULTIVARIATE DISTRIBUTIONS
and so we have
∂ 2F
f (x, y) = (x, y).
∂x∂y
The extension to n random variables is exactly similar.
We have
Z bZ d
P(a ≤ X ≤ b, c ≤ Y ≤ d) = fX,Y (x, y)dy dx,
a c
or Z Z
P((X, Y ) ∈ D) = fX,Y dy dx
D
when D is the set {(x, y) : a ≤ x ≤ b, c ≤ y ≤ d}. One can show this holds
when D is any set. For example,
Z Z
P(X < Y ) = fX,Y (x, y)dy dx.
{x<y}
If one has the joint density of X and Y , one can recover the densities of
X and of Y :
Z ∞ Z ∞
fX (x) = fX,Y (x, y)dy, fY (y) = fX,Y (x, y)dx.
−∞ −∞
If we have a binomial with parameters n and p, this can be thought of as

the number of successes in n trials, and
n!
P(X = k) = pk (1 − p)n−k .
k!(n − k)!
If we let k1 = k, k2 = n − k, p1 = p, and p2 = 1 − p, this can be rewritten as
n! k1 k2
p p ,
k1 !k2 ! 1 2
as long as n = k1 + k2 . Thus this is the probability of k1 successes and k2
failures, where the probabilities of success and failure are p1 and p2 , resp.
A multivariate random vector is (X1 , . . . , Xr ) with
n!
P(X1 = n1 , . . . , Xr = nr ) = pn1 · · · pnr r ,
n1 ! · · · nr ! 1
45
where n1 +· · ·+nr = n and p1 +· · · pr = 1. Thus this generalizes the binomial

to more than 2 categories.
In the discrete case we say X and Y are independent if P(X = x, Y =
y) = P(X = x)P(Y = y) for all x and y. In the continuous case, X and Y
are independent if
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)
for all pairs of subsets A, B of the reals. The left hand side is an abbreviation
for
P({ω : X(ω) is in A and Y (ω) is in B})
and similarly for the right hand side.
In the discrete case, if we have independence,
pX,Y (x, y) = P(X = x, Y = y) = P(X = x)P(Y = y)

= pX (x)pY (y).
In other words, the joint density pX,Y factors. In the continuous case,
Z bZ d
fX,Y (x, y)dy dx = P(a ≤ X ≤ b, c ≤ Y ≤ d)
a c
= P(a ≤ X ≤ b)P(c ≤ Y ≤ d)
Z b Z d
= fX (x)dx fY (y)dy
a c
Z bZ d
= fX (x)fY (y) dy dx..
a c
One can conclude from this by taking partial derivatives that
fX,Y (x, y) = fX (x)fY (y),
or again the joint density factors. Going the other way, one can also see that
if the joint density factors, then one has independence.
Example. Suppose one has a floor made out of wood planks and one drops a
needle onto it. What is the probability the needle crosses one of the cracks?
Suppose the needle is of length L and the wood planks are D across.
Answer. Let X be the distance from the midpoint of the needle to the nearest
crack and let Θ be the angle the needle makes with the vertical. Then X and
Θ will be independent. X is uniform on [0, D/2] and Θ is uniform on [0, π/2].
A little geometry shows that the needle will cross a crack if L/2 > X/ cos Θ.
4
We have fX,Θ = πD and so we have to integrate this constant over the set
where X < L cos Θ/2 and 0 ≤ Θ ≤ π/2 and 0 ≤ X ≤ D/2. The integral is
Z π/2 Z L cos θ/2
4 2L
dx dθ = .
0 0 πD πD
If X and Y are independent, then

Z Z
P(X + Y ≤ a) = fX,Y (x, y)dx dy
{x+y≤a}
Z Z
= fX (x)fY (y)dx dy
{x+y≤a}
Z ∞ Z a−y
= fX (x)fY (y)dx dy
−∞ −∞
Z
= FX (a − y)fY (y)dy.
Differentiating with respect to a, we have

Z
fX+Y (a) = fX (a − y)fY (y)dy.
There are a number of cases where this is interesting.

(1) If X is a gamma with parameters s and λ and Y is a gamma with
parameters t and λ, then straightforward integration shows that X + Y is a
gamma with parameters s + t and λ. In particular, the sum of n independent
exponentials with parameter λ is a gamma with parameters n and λ.
√ √
(2) If Z is a N (0, 1), then FZ 2 (y) = P(Z 2 ≤ y) = P(− y ≤ Z ≤ y) =
√ √
FZ ( y) − FZ (− y). Differentiating shows that fZ 2 (y) = ce−y/2 (y/2)(1/2)−1 ,
1
or Z 2 is a gamma with parameters
Pn 2
and 21 . So using (1) above, if Zi are
2
independent N (0, 1)’s, then i=1 Zi is a gamma with parameters n/2 and
1
2
, i.e., a χ2n .
47
σi2 ) and the Xi P

(3) If Xi is a N (µi , P are independent, then some lengthy
n P 2
calculations show that i=1 Xi is a N ( µi , σi ).
(4) The analogue for discrete random variables is easier. If X and Y takes
only nonnegative integer values, we have
r
X
P(X + Y = r) = P(X = k, Y = r − k)
k=0
Xr
= P(X = k)P(Y = r − k).
k=0
In the case where X is a Poisson with parameter λ and Y is a Poisson

with parameter µ, we see that X + Y is a Poisson with parameter λ + µ. To
check this, use the above formula to get
r
X
P(X + Y = r) = P(X = k)P(Y = r − k)
k=0
r
X λk −µ µr−k
= e−λ
e
k=0
k! (r − k)!
r
−(λ+µ) 1 r
X
=e λk µr−k
r! k
k=0
(λ + µ)r
= e−(λ+µ)
r!
using the binomial theorem.
Note that it is not always the case that the sum of two independent random
variables will be a random variable of the same type.
If X and Y are independent normals, then −Y is also a normal (with

E (−Y ) = −E Y and Var (−Y ) = (−1)2 Var Y = Var Y ), and so X − Y is also
normal.
To define a conditional density in the discrete case, we write
pX|Y =y (x | y) = P(X = x | Y = y).

This is equal to
P(X = x, Y = y) p(x, y)
= .
P(Y = y) pY (y)
Analogously, we define in the continuous case
f (x, y)
fX|Y =y (x | y) = .
fY (y)
Just as in the one-dimensional case, there is a change of variables formula.

Let us recall how the formula goes in one dimension. If X has a density fX
and y = g(X), then
FY (y) = P(Y ≤ y) = P(g(X) ≤ y)

= P(X ≤ g −1 (y)) = FX (g −1 (y)).
Taking the derivative, using the chain rule, and recalling that the derivative
of g −1 (y) is 1/g 0 (y), we have
1
fY (y) = fX (g −1 (y)) .
g(y)
The higher dimensional case is very analogous. Suppose Y1 = g1 (X1 , X2 )

and Y2 = g2 (X1 , X2 ). Let h1 and h2 be such that X1 = h1 (Y1 , Y2 ) and
X2 = h2 (Y1 , Y2 ). (This plays the role of g −1 .) Let J be the Jacobian of
∂g1 ∂g2 ∂g1 ∂g2
the mapping (x1 , x2 ) → (g1 (x1 , x2 ), g2 (x1 , x2 ), so that J = ∂x 1 ∂x2
− ∂x 2 ∂x1
.
(This is the analogue of g 0 (y).) Using the change of variables theorem from
multivariable calculus, we have
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 )|J|−1 .
Example. Suppose X1 is N (0, 1), X2 is N (0, 4), and X1 and X2 are inde-
pendent. Let Y1 = 2X1 + X2 , Y2 = X1 − 3X2 . Then y1 = g1 (x1 , x2 ) =
2x1 + x2 , y2 = g2 (x1 , x2 ) = x1 − x3 , so

2 1
J= = −7.
1 −3
49
(In general, J might depend on x, and hence on y.) Some algebra leads to
x1 = 37 y1 + 71 y2 , x2 = 17 y1 − 72 y2 . Since X1 and X2 are independent,
1 2 1 2
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) = √ e−x1 /2 √ e−x2 /8 .
2π 8π
Therefore
1 3 1 2 1 1 2 2 1
fY1 ,Y2 (y1 , y2 ) = √ e−( 7 y1 + 7 y2 ) /2 √ e−( 7 y1 − 7 y2 ) /8 .
2π 8π 7
Chapter 12
Expectations
As in the one variable case, we have

XX
E g(X, Y ) = g(x, y)p(x, y)
in the discrete case and

Z Z
E g(X, Y ) = g(x, y)f (x, y)dx dy
in the continuous case.

If we set g(x, y) = x + y, then
Z Z
E (X + Y ) = (x + y)f (x, y)dx dy
Z Z Z Z
= xf (x, y)dx dy + yf (x, y)dx dy.
If we now set g(x, y) = x, we see the first integral on the right is E X, and
similarly the second is E Y . Therefore
E (X + Y ) = E X + E Y.
Proposition 12.1 If X and Y are independent, then
E [h(X)k(Y )] = E h(X)E k(Y ).
In particular, E (XY ) = (E X)(E Y ).
51
52 CHAPTER 12. EXPECTATIONS
Proof. By the above with g(x, y) = h(x)k(y),

Z Z
E [h(X)k(Y )] = h(x)k(y)f (x, y)dx dy
Z Z
= h(x)k(y)fX (x)fY (y)dx dy
Z Z
= h(x)fX (x) k(y)fY (y)dy dx
Z
= h(x)fX (x)(E k(Y ))dx
= E h(X)E k(Y ).
The covariance of two random variables X and Y is defined by
Cov (X, Y ) = E [(X − E X)(Y − E Y )].
As with the variance, Cov (X, Y ) = E (XY )−(E X)(E Y ). It follows that if X
and Y are independent, then E (XY ) = (E X)(E Y ), and then Cov (X, Y ) =
0.
Note
Var (X + Y )
= E [((X + Y ) − E (X + Y ))2 ]
= E [((X − E X) + (Y − E Y ))2 ]
= E [(X − E X)2 + 2(X − E X)(Y − E Y ) + (Y − E Y )2 ]
= Var X + 2Cov (X, Y ) + Var Y.
We have the following corollary.
Var (X + Y ) = Var X + Var Y.
Proof. We have
Var (X + Y ) = Var X + Var Y + 2Cov (X, Y ) = Var X + Var Y.

53
Since a binomial is the sum

Pn of n independent Bernoulli’s, its variance is
np(1 − p). If we write X = i=1 Xi /n and the Xi are independent and have
the same distribution (X is called the sample mean), then E X = E X1 and
Var X = Var X1 /n.
We define the conditional expectation of X given Y by

Z
E [X | Y = y] = xfX|Y =y (x)dx.
54 CHAPTER 12. EXPECTATIONS
Chapter 13
Moment generating functions
We define the moment generating function mX by
mX (t) = E etX ,
etx p(x) and in

P
provided this is finite.
R In the discrete case this is equal to
tx
the continuous case e f (x)dx.
Let us compute the moment generating function for some of the distribu-
tions we have been working with.
1. Bernoulli: pet + (1 − p).
2. Binomial: using independence,
P Y Y
E et Xi = E etXi = E etXi = (pet + (1 − p))n ,
where the Xi are independent Bernoulli’s.

3. Poisson:
X etk e−λ λk X (λet )k t t
−λ
Ee tX
= =e = e−λ eλe = eλ(e −1) .
k! k!
4. Exponential:
Z ∞
λ
Ee tX
= etx λe−λx dx =
0 λ−t
if t < λ and ∞ if t ≥ λ.
55
56 CHAPTER 13. MOMENT GENERATING FUNCTIONS
5. N (0, 1):
Z Z
1 tx −x2 /2 t2 /2 1 2 /2 2 /2
√ e e dx = e √ e−(x−t) dx = et .
2π 2π
6. N (µ, σ 2 ): Write X = µ + σZ. Then

2 /2 2 σ 2 /2
E etX = E etµ etσZ = etµ e(tσ) = etµ+t .
mX+Y (t) = mX (t)mY (t).
Proof. By independence and Proposition 12.1,
mX+Y (t) = E etX etY = E etX E etY = mX (t)mY (t).
Proposition 13.2 If mX (t) = mY (t) < ∞ for all t in an interval, then X

and Y have the same distribution.
We will not prove this, but

R this is essentially the uniqueness of the Laplace
tX tx
Rtransform.
∞ tx
Note E e = e fX (x)dx. If fX (x) = 0 for x < 0, this is
0
e f X (x)dx = LfX (−t), where LfX is the Laplace transform of fX .
We can use this to verify some of the properties of sums we proved be-
fore. For example, if X is a N (a, b2 ) and Y is a N (c, d2 ) and X and Y are
independent, then
2 t2 /2 2 t2 /2 2 +d2 )t2 /2
mX+Y (t) = eat+b ect+d = e(a+c)t+(b .
Proposition 13.2 then implies that X + Y is a N (a + c, b2 + d2 ).

Similarly, if X and Y are independent Poisson random variables with
parameters a and b, resp., then
t t t
mX+Y (t) = mX (t)mY (t) = ea(e −1) eb(e −1) = e(a+b)(e −1) ,
57
which is the moment generating function of a Poisson with parameter a + b.
One problem with the moment generating function is that it might be

infinite. One way to get around this, at the cost of considerable
√ work, is
itX
to use the characteristic function ϕX (t) = E e , where i = −1. This is
always finite, and is the analogue of the Fourier transform.
The joint moment generating function of X and Y is
mX,Y (s, t) = E esX+tY .
If X and Y are independent, then
mX,Y (s, t) = mX (s)mY (t)
by Proposition 13.2. We will not prove this, but the converse is also true: if
mX,Y (s, t) = mX (s)mY (t) for all s and t, then X and Y are independent.
58 CHAPTER 13. MOMENT GENERATING FUNCTIONS
Chapter 14
Limit laws
Suppose Xi are independent and have the same distribution. In the case
of continuous or discrete random variables, this means they all have the
same density. We say the Xi are i.i.d.,
Pn which stands for “independent and
identically distributed.” Let Sn = i=1 Xi . Sn is called the partial sum
process.
Theorem 14.1 Suppose E |Xi | < ∞ and let µ = E Xi . Then
Sn
→ µ.
n
This is known as the strong law of large numbers (SLLN). The convergence
here means that Sn (ω)/n → µ for every ω ∈ S, where S is the probability
space, except possibly for a set of ω of probability 0.
The proof of Theorem 14.1 is quite hard, and we prove a weaker version,
the weak law of large numbers (WLLN). The WLLN states that for every
a > 0,
S
n
P − E X1 > a → 0

n
as n → ∞. It is not even that easy to give an example of random variables
that satisfy the WLLN but not the SLLN.
Before proving the WLLN, we need an inequality called Chebyshev’s in-
equality.
59
60 CHAPTER 14. LIMIT LAWS
Proposition 14.2 If Y ≥ 0, then for any A,

EY
P(Y > A) ≤ .
A
Proof. We do the case for continuous densities, the case for discrete densities
being similar. We have
Z ∞ Z ∞
y
P(Y > A) = fY (y) dy ≤ fY (y) dy
A A A
1 ∞
Z
1
≤ yfY (y) dy = E Y.
A −∞ A
We now prove the WLLN.
Theorem 14.3 Suppose the Xi are i.i.d. and E |X1 | and Var X1 are finite.
Then for every a > 0,
S
n
P − E X1 > a → 0

n
as n → ∞.
Proof. Recall E Sn = nE X1 and by the independence, Var Sn = nVar X1 ,

so Var (Sn /n) = Var X1 /n. We have
S S S
n n n
P − E X1 > a = P − E >a

n n n
S S 2
n n 2
=P −E >a
n n
E | Snn − E ( Snn )|2
≤
a2
Sn
Var ( n )
=
a2
Var X1
n
= → 0.
a2
61
The inequality step follows from Proposition 14.2 with A = a2 and Y =

| Snn − E ( Snn )|2 .
We now turn to the central limit theorem (CLT).
Theorem 14.4 Suppose the Xi are i.i.d. Suppose E Xi2 < ∞. Let µ = E Xi
and σ 2 = Var Xi . Then
Sn − nµ
P a≤ √ ≤ b → P(a ≤ Z ≤ b)
σ n
for every a and b, where Z is a N (0, 1).
√
The ratio on the left is (Sn − E Sn )/ Var Sn . We do not claim that this
ratio converges for any ω (in fact, it doesn’t), but that the probabilities
converge.
Example. If the Xi are i.i.d. Bernoulli random variables, so that Sn is a

binomial, this is just the normal approximation to the binomial.
Example. Suppose we roll a die 3600 times. Let Xi be the number showing
on the ith roll. We know Sn /n will be close to 3.5. What’s the probability it
differs from 3.5 by more than 0.05?
Answer. We want S
n
P − 3.5 > .05 .

n
We rewrite this as
S − nE X 180
n 1
P(|Sn − nE X1 | > (.05)(3600)) = P √ √ > q
n Var X1 (60) 35
12
r ≈ P(|Z| > 1.756) ≈ .08.
Example. Suppose the lifetime of a human has expectation 72 and variance

36. What is the probability that the average of the lifetimes of 100 people
exceeds 73?
62 CHAPTER 14. LIMIT LAWS
Answer. We want
S
n
P > 73 = P(Sn > 7300)
n
S − nE X 7300 − (100)(72)
n 1
=P √ √ > √ √
n Var X1 100 36
≈ P(Z > 1.667) ≈ .047.
The idea behind proving the central limit theorem is the following. It
turns out that if mYn (t) → mZ (t) for every t, then P(a ≤ Yn ≤ b) → P(a
√≤
Z ≤ b). (We won’t prove this.) We are going to let Yn = (Sn − nµ)/σ n.
Let Wi = (Xi − µ)/σ. Then E Wi = 0, Var Wi = Varσ2Xi = 1, the Wi are
independent, and Pn
Sn − nµ Wi
√ = i=1 √ .
σ n n
So there is no loss of generality in assuming that µ = 0 and σ = 1. Then
√ √
mYn (t) = E etYn = E e(t/ n)(Sn ) = mSn (t/ n).
Since the Xi are i.i.d., all the Xi have the same moment generating function.
Since Sn = X1 + · + Xn , then
mSn (t) = mX1 (t) · · · mXn (t) = [mX1 (t)]n .
If we expand etX1 as a power series,
t2 t3
mX1 (t) = E etX1 = 1 + tE X1 + E (X1 )2 + E (X1 )3 + · · · .
2! 3!
We put the above together and obtain
√
mYn (t) = mSn (t/ n)
√
= [mX1 (t/ n)]n
√
h (t/ n)2 in
= 1+t·0+ + Rn
2!
h t2
= 1+ + Rn ]n ,
2n
2 /2
where |Rn |/n → 0 as n → ∞. This converges to et = mZ (t) as n → ∞.

Undergraduate Probability: Richard F. Bass

Uploaded by

Copyright:

Available Formats

Undergraduate Probability: Richard F. Bass

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Undergraduate Probability: Richard F. Bass

Uploaded by

Copyright:

Available Formats

Undergraduate Probability

2 The probability set-up 5

6 Some discrete distributions 23

10 Some continuous distributions 39

13 Moment generating functions 55

The first basic principle is to multiply. Suppose we have 4 shirts of 4 different

Example. How many license plates of 3 letters followed by 3 numbers are

How many ways can one arrange a, b, c? One can have

Example. What is the number of possible batting orders with 9 players?

Example.How many ways can one choose a committee of 3 out of 10 people?

Finally, we consider three interrelated examples. First, suppose one has 8

The probability set-up

We will have a sample space, denoted S (sometimes Ω) that consists of all

or to let S be the possible orders in which 5 horses finish in a horse race; or

(∪ni=1 Ai )c = ∩ni=1 Aci and (∩ni=1 Ai )c = ∪ni=1 Aci .

These are called DeMorgan’s laws.

There are no restrictions on S. The collection of events, F, must be a

Typically we will take F to be all subsets of S, and so (i)-(iii) are au-

Pairwise disjoint means that Ei ∩ Ej = ∅ unless i = j.

There are some easy consequences of the axioms.

Proposition 2.1 (1) P(∅) = 0.

To prove (3), use S = E ∪ E c . By (2), P(S) = P(E) + P(E c ). By axiom

It is very common for a probability space to consist of finitely many points,

Example. What is the probability that in a poker hand we get exactly 3

probability of 3 aces, 1 king and 1 queen is

Example. In a class of 30 people, what is the probability everyone has a

We say E and F are independent if

Proposition 3.1 If E and F are independent, then E and F c are indepen-

P(E ∩ F c ) = P(E) − P(E ∩ F ) = P(E) − P(E)P(F )

We say E, F , and G are independent if E and F are independent, E and G are

A problem that comes up in actuarial science frequently is gambler’s ruin.

Example. Suppose you toss a fair coin repeatedly and independently. If it

Answer. It is easier to solve a slightly harder problem. Let y(x) be the

and will then have x − 1 dollars. So we have

y(x) = 12 y(x + 1) + 21 y(x − 1).

Multiplying by 2, and subtracting y(x) + y(x − 1) from each side, we have

y(x + 1) − y(x) = y(x) − y(x − 1).

Suppose you know P(E | F ) and you want P(F | E).

Example. Suppose 30% of the women in a class received an A on the test

P(W ∩ A) = P(A | W )P(W ) = (.30)(.60) = .18.

To find P(A), we write

P(A) = P(W ∩ A) + P(M ∩ A).

Since the class is 40% men,

P(M ∩ A) = P(A | M )P(M ) = (.25)(.40) = .10.

This formula is known as Bayes’ rule.

A random variable is a real-valued function on S. Random variables are

Example. If one rolls a die, let Y be 1 if an odd number is showing and 0 if

Example. If one tosses 10 coins, let X be the number of heads showing.

Example. In n trials, let X be the number of successes.

Let X be the number showing if we roll a die. The expected number to

to be the expected value or expectation or mean of X.

Example. If we toss a coin and X is 1 if we have heads and 0 if we have tails,

Hence E X = (1)( 12 ) + (0)( 21 ) = 12 .

Example. Suppose X = 0 with probability 12 , 1 with probability 41 , 2 with

E X = (0) 12 + (1) 41 + (2) 81 + (3) 16

Example. Suppose we roll a fair die. If 1 or 2 is showing, let X = 3; if a 3

Now P(X = ai , Y = ck − ai , Y = bj ) will be 0, unless ck − ai = bj . For

E X 2 = (12 ) 16 + (22 ) 61 + · · · + (62 ) 16 ,

E X 2 = (1) 58 + (4) 83 = (−1)2 14 + (1)2 83 + (−2)2 18 + (2)2 14 .

Example.How many ways can one choose a committee of 3 out of 10 people?