Undergraduate Probability: Richard F. Bass
Undergraduate Probability: Richard F. Bass
Undergraduate Probability: Richard F. Bass
Richard F. Bass
ii
c Copyright 2013 Richard F. Bass
Contents
1 Combinatorics 1
3 Independence 9
4 Conditional probability 13
5 Random variables 17
7 Continuous distributions 29
8 Normal distribution 33
9 Normal approximation 37
11 Multivariate distributions 43
12 Expectations 51
iii
iv CONTENTS
14 Limit laws 59
Chapter 1
Combinatorics
Example. How many ways can one arrange 4 math books, 3 chemistry books,
2 physics books, and 1 biology book on a bookshelf so that all the math books
1
2 CHAPTER 1. COMBINATORICS
are together, all the chemistry books are together, and all the physics books
are together. Answer. 4!(4!3!2!1!). We can arrange the math books in 4!
ways, the chemistry books in 3! ways, the physics books in 2! ways, and the
biology book in 1! = 1 way. But we also have to decide which set of books go
on the left, which next, and so on. That is the same as the number of ways
of arranging the letters M, C, P, B, and there are 4! ways of doing that.
How many ways can one arrange the letters a, a, b, c? Let us label them
A, a, b, c. There are 4!, or 24, ways to arrange these letters. But we have
repeats: we could have Aa or aA. So we have a repeat for each possibility,
and so the answer should be 4!/2! = 12. If there were 3 a’s, 4 b’s, and 2 c’s,
we would have
9!
.
3!4!2!
What we just did was called the number of permutations. Now let us look
at what are known as combinations. How many ways can we choose 3 letters
out of 5? If the letters are a, b, c, d, e and order matters, then there would
be 5 for the first position, 4 for the second, and 3 for the third, for a total
of 5 × 4 × 3. But suppose the letters selected were a, b, c. If order doesn’t
matter, we will have the letters a, b, c 6 times, because there are 3! ways of
arranging 3 letters. The same is true for any choice of three letters. So we
should have 5 × 4 × 3/3!. We can rewrite this as
5·4·3 5!
=
3! 3!2!
5
This is often written , read “5 choose 3.” Sometimes this is written C5,3
3
or 5 C3 . More generally,
n n!
= .
k k!(n − k)!
Example. Suppose there are 8 men and 8 women. How many ways can we
choose a committee that has 2 men and 2 women? Answer. We can choose
3
8 8
2 men in ways and 2 women in ways. The number of committees
2 2
8 8
is then the product: .
2 2
Suppose one has 9 people and one wants todivide
them into one committee
9
of 3, one of 4, and a last of 2. There are ways of choosing the first
3
6
committee. Once that is done, there are 6 people left and there are
4
ways of choosing the second committee. Once that is done, the remainder
must go in the third committee. So the answer is
9! 6! 9!
= .
3!6! 4!2! 3!4!2!
In general, to divide n objects into one group of n1 , one group of n2 , . . ., and
a kth group of nk , where n = n1 + · · · + nk , the answer is
n!
.
n1 !n2 ! · · · nk !
These are known as multinomial coefficients.
Another example: suppose we have 4 Americans and 6 Canadians. (a)
How many ways can we arrange them in a line? (b) How many ways if all
the Americans have to stand together? (c) How many ways if not all the
Americans are together? (d) Suppose you want to choose a committee of 3,
which will be all Americans or all Canadians. How many ways can this be
done? (e) How many ways for a committee of 3 that is not all Americans
or all Canadians? Answer. (a) This is just 10! (b) Consider the Americans
as a group and each Canadian as a group; this gives 7 groups, which can
be arranged in 7! ways. Once we have these seven groups arranged, we can
arrange the Americans within their group in 4! ways, so we get 4!7! (c) This
is the answer to (a) minus theanswer
to (b): 10! − 4!7! (d) We can choose a
4
committee of 3 Americans in ways and a committee of 3 Canadians in
3
6 4 6
ways, so the answer is + . (e) We can choose a committee of
3 3 3
10 10 4 6
3 out of 10 in ways, so the answer is − − .
3 3 3 3
4 CHAPTER 1. COMBINATORICS
| o o | o o o | o o o |,
this would represent 2 balls in the first box, 3 in the second, and 3 in the
third. Altogether there are 8 + 4 symbols, the first is a | as is the last. so
there are 10 symbols that can be either | or o. Also, 8 of them must be o.
How many ways out of 10 spaces can one pick 8 of them into which to put a
10
o? We just did that: the answer is .
8
Now, to finish, suppose we have $8,000 to invest in 3 mutual funds. Each
mutual fund required you to make investments in increments of $1,000. How
many ways can we do this? This is the sameas putting 8 indistinguishable
10
balls in 3 boxes, and we know the answer is .
8
Chapter 2
S = {HH, HT, T H, T T };
5
6 CHAPTER 2. THE PROBABILITY SET-UP
(i) ∅, S is in F;
(ii) if A is in F, then Ac is in F;
(iii) if A1 , A2 , . . . are in F, then ∪∞ ∞
i=1 Ai and ∩i=1 Ai are in F.
∞
P∞let Ai = ∅ for
Proof. For (1), P∞each i. These are clearly disjoint, so P(∅) =
P(∪i=1 Ai ) = i=1 P(Ai ) = i=1 P(∅). If P(∅) were positive, then the last
term would be infinity, contradicting the fact that probabilities are between
0 and 1. So the probability must be zero.
7
n+2 = · · · = ∅.P
The second follows if we let An+1 = AP We still have pairwise
∞
disjointness and ∪i=1 Ai = ∪i=1 Ai , and i=1 P(Ai ) = ni=1 P(Ai ), using (1).
∞ n
Example. What is the probability that if we roll 2 dice, the sum is 7? Answer.
There are 36 possibilities, of which 6 have a sum of 7: (1, 6), (2, 5), (3, 4),
6
(4, 3), (5, 2), (6, 1). Since they are all equally likely, the probability is 36 = 61 .
Example. What is the probability that in a poker hand (5 cards out of 52)
we get exactly 4 of a kind? Answer. The probability of 4 aces and 1 king is
.
4 4 52
. The probability of 4 jacks and one 3 is the same. There
4 1 5
are 13 ways to pick the rank that we have 4 of and then 12 ways to pick the
rank we have one of, so the answer is
4 4
4 1
13 · 12 .
52
5
Answer. Let the first person have a birthday on some day. The probability
364
that the second person has a different birthday will be 365 . The probability
that the third person has a different birthday from the first two people is 363
365
.
So the answer is
364 363 336
1− ··· .
365 365 365
Chapter 3
Independence
P(E ∩ F ) = P(E)P(F ).
Example. Suppose you flip two coins. The outcome of heads on the second is
independent of the outcome of tails on the first. To be more precise, if A is
tails for the first coin and B is heads for the second, and we assume we have
fair coins (although this is not necessary), we have P(A ∩ B) = 14 = 12 · 21 =
P(A)P(B).
Example. Suppose you draw a card from an ordinary deck. Let E be you
1 1
drew an ace, F be that you drew a spade. Here 52 = P(E ∩ F ) = 13 · 14 =
P(E) ∩ P(F ).
Proof.
9
10 CHAPTER 3. INDEPENDENCE
Example. Suppose you roll two dice, E is that the sum is 7, F that the first
is a 4, and G that the second is a 3. E and F are independent, as are E and
G and F and G, but E, F and G are not.
Example. What is the probability that exactly 3 threes will show if you roll
10 dice? Answer. The probability that the 1st, 2nd, and 4th dice will show
3 7
a three and the other 7 will not is 16 56 . Independence is used here: the
probability is 61 16 56 16 56 · · · 65 . The probability that the 4th, 5th, and 6th dice
will show a three and the other 7 will not is the same thing. So to answer
3 7
our original question, we take 16 56 and multiply it by the number ofways
10
of choosing 3 dice out of 10 to be the ones showing threes. There are
3
ways of doing that.
This is a particular example of what are known as Bernoulli trials or the
binomial distribution.
Suppose you have n independent trials, where the probability of a success
is p. The the probability there
arek successes is the number of ways of putting
n
k objects in n slots (which is ) times the probability that there will be
k
k
successes
and n − k failures in exactly a given order. So the probability is
n k
p (1 − p)n−k .
k
This says succeeding slopes of the graph of y(x) are constant (remember that
x must be an integer). In other words, y(x) must be a line. Since y(0) = 0
and y(200) = 1, we have y(x) = x/200, and therefore y(50) = 1/4.
Example. Suppose we are in the same situation, but you are allowed to go
arbitrarily far in debt. Let y(x) be the probability you ever get to $200.
What is a formula for y(x)?
Answer. Just as above, we have the equation y(x) = 21 y(x + 1) + 21 y(x − 1).
This implies y(x) is linear, and as above y(200) = 1. Now the slope of y
cannot be negative, or else we would have y > 1 for some x and that is not
possible. Neither can the slope be positive, or else we would have y < 0,
and again this is not possible, because probabilities must be between 0 and
1. Therefore the slope must be 0, or y(x) is constant, or y(x) = 1 for all x.
In other words, one is certain to get to $200 eventually (provided, of course,
that one is allowed to go into debt). There is nothing special about the figure
300. Another way of seeing this is to compute as above the probability of
getting to 200 before −M and then letting M → ∞.
12 CHAPTER 3. INDEPENDENCE
Chapter 4
Conditional probability
Suppose there are 200 men, of which 100 are smokers, and 100 women, of
which 20 are smokers. What is the probability that a person chosen at
random will be a smoker? The answer is 120/300. Now, let us ask, what is
the probability that a person chosen at random is a smoker given that the
person is a women? One would expect the answer to be 20/100 and it is.
What we have computed is
number of women smokers number of women smokers/300
= ,
number of women number of women/300
which is the same as the probability that a person chosen at random is a
woman and a smoker divided by the probability that a person chosen at
random is a woman.
With this in mind, we make the following definition. If P(F ) > 0, we
define
P(E ∩ F )
P(E | F ) = .
P(F )
P(E | F ) is read “the probability of E given F .”
Note P(E ∩ F ) = P(E | F )P(F ).
Suppose you roll two dice. What is the probability the sum is 8? There are
five ways this can happen (2, 6), (3, 5), (4, 4), (5, 3), (6, 2), so the probability
is 5/36. Let us call this event A. What is the probability that the sum is
8 given that the first die shows a 3? Let B be the event that the first die
13
14 CHAPTER 4. CONDITIONAL PROBABILITY
shows a 3. Then P(A ∩ B) is the probability that the first die shows a 3 and
the sum is 8, or 1/36. P(B) = 1/6, so P(A | B) = 1/36
1/6
= 1/6.
Example. Suppose a box has 3 red marbles and 2 black ones. We select 2
marbles. What is the probability that second marble is red given that the first
one is red? Answer. Let A be the event the second marble is red, and B the
event that the first one is red. P(B) = 3/5, while P(A ∩ B) is the probability
both are red, or is the probability
that wechose 2 red out of 3 and 0 black
3 2 . 5
out of 2. The P(A ∩ B) = . Then P(A | B) = 3/10
3/5
= 1/2.
2 0 2
Example. A family has 2 children. Given that one of the children is a boy,
what is the probability that the other child is also a boy? Answer. Let B be
the event that one child is a boy, and A the event that both children are boys.
The possibilities are bb, bg, gb, gg, each with probability 1/4. P(A ∩ B) =
1/4
P(bb) = 1/4 and P(B) = P(bb, bg, gb) = 3/4. So the answer is 3/4 = 1/3.
Example. Suppose the test for HIV is 99% accurate in both directions and
0.3% of the population is HIV positive. If someone tests positive, what is
the probability they actually are HIV positive?
Let D mean HIV positive, and T mean tests positive.
P(D ∩ T ) (.99)(.003)
P(D | T ) = = ≈ 26%.
P(T ) (.99)(.003) + (.01)(.997)
A short reason why this surprising result holds is that the error in the test is
much greater than the percentage of people with HIV. A little longer answer
is to suppose that we have 1000 people. On average, 3 of them will be HIV
positive and 10 will test positive. So the chances that someone has HIV given
that the person tests positive is approximately 3/10. The reason that it is
not exactly .3 is that there is some chance someone who is positive will test
negative.
Example. Suppose 36% of families own a dog, 30% of families own a cat,
and 22% of the families that have a dog also have a cat. A family is chosen
at random and found to have a cat. What is the probability they also own
15
a dog? Answer. Let D be the families that own a dog, and C the families
that own a cat. We are given P(D) = .36, P(C) = .30, P(C | D) = .22 We
want to know P(D | C). We know P(D | C) = P(D ∩ C)/P(C). To find
the numerator, we use P(D ∩ C) = P(C | D)P(D) = (.22)(.36) = .0792. So
P(D | C) = .0792/.3 = .264 = 26.4%.
So
P(A) = P(W ∩ A) + P(M ∩ A) = .18 + .10 = .28.
Finally,
P(W ∩ A) .18
P(W | A) = = .
P(A) .28
To get a general formula, we can write
P(E ∩ F ) P(E | F )P(F )
P(F | E) = =
P(E) P(E ∩ F ) + P(E ∩ F c )
P(E | F )P(F )
= .
P(E | F )P(F ) + P(E | F c )P(F c )
16 CHAPTER 4. CONDITIONAL PROBABILITY
Random variables
Example. If one rolls a die, let X denote the outcome (i.e., either 1,2,3,4,5,6).
17
18 CHAPTER 5. RANDOM VARIABLES
This turns out to sum to 1. To see this, recall the formula for a geometric
series:
1
1 + x + x2 + x 3 + · · · = .
1−x
If we differentiate this, we get
1
1 + 2x + 3x2 + · · · = .
(1 − x)2
We have
E X = 1( 41 ) + 2( 18 ) + 3( 16
1
+ ···
h i
= 14 1 + 2( 12 ) + 3( 41 ) + · · ·
1 1
= = 1.
4
(1 − 12 )2
Let’s give a proof of the linearity of expectation in the case when X and
Y both take only finitely many values.
Let Z = X + Y , let a1 , . . . , an be the values taken by X, b1 , . . . , bm be the
values taken by Y , and c1 , . . . , c` the values taken by Z. Since there are only
finitely many values, we can interchange the order of summations freely.
We write
`
X ` X
X n
EZ = ck P(Z = ck ) = ck P(Z = ck , X = ai )
k=1 k=1 i=1
XX
= ck P(X = ai , Y = ck − ai )
k i
m
XXX
= ck P(X = ai , Y = ck − ai , Y = bj )
k i j=1
XXX
= ck P(X = ai , Y = ck − ai , Y = bj ).
i j k
= E X + E Y.
20 CHAPTER 5. RANDOM VARIABLES
It turns out there is a formula for the expectation of random variables like
X and eX . To see how this works, let us first look at an example.
2
Suppose we roll a die and let X be the value that is showing. We want
the expectation E X 2 . Let Y = X 2 , so that P(Y = 1) = 16 , P(Y = 4) = 16 ,
etc. and
E X 2 = E Y = (1) 16 + (4) 61 + · · · + (36) 61 .
We can also write this as
which suggests that a formula for E X 2 is x x2 P(X = x). This turns out
P
to be correct.
The only possibility where things could go wrong is if more than one value
of X leads to the same value of X 2 . For example, suppose P(X = −2) =
1
8
, P(X = −1) = 41 , P(X = 1) = 38 , P(X = 2) = 14 . Then if Y = X 2 ,
P(Y = 1) = 58 and P(Y = 4) = 38 . Then
P
Theorem 5.1 E g(X) = g(x)p(x).
Example. E X 2 = x2 p(x).
P
Var (X) = E (X − M )2
21
is called the variance of X. The square root of Var (X) is the standard
deviation of X.
The variance measures how much spread there is about the expected value.
Example. We roll a die and let X be the value that shows. We have previously
calculated E X = 27 . So X − E X equals
5 3 1 1 3 5
− ,− ,− , , , ,
2 2 2 2 2 2
each with probability 16 . So
Var X = (− 52 )2 16 + (− 23 )2 16 + (− 21 )2 16 + ( 21 )2 16 + ( 32 )2 16 + ( 25 )2 16 = 35
12
.
Var X = E X 2 − 2E (XM ) + E (M 2 )
= E X 2 − 2M 2 + M 2
= E X 2 − (E X)2 .
22 CHAPTER 5. RANDOM VARIABLES
Chapter 6
Binomial. A r.v.
X has a binomial distribution with parameters n and p
n k
if P(X = k) = p (1 − p)n−k . The number of successes in n trials is a
k
binomial. After some cumbersome calculations one can derive E X = np.
An easier way is to realize that if X is binomial, then X = Y1 + · · · + Yn ,
where the Yi are independent Bernoulli’s, so E X = E Y1 + · · · + E Yn = np.
We haven’t defined what it means for r.v.’s to be independent, but here we
mean that the events (Yk = 1) are independent. The cumbersome way is as
follows.
n n
X n k n−k
X n k
EX = k p (1 − p) = k p (1 − p)n−k
k k
k=0 k=1
n
X n!
= k pk (1 − p)n−k
k=1
k!(n − k)!
n
X (n − 1)!
= np pk−1 (1 − p)(n−1)−(k−1)
k=1
(k − 1)!((n − 1) − (k − 1))!
n−1
X (n − 1)!
= np pk (1 − p)(n−1)−k
k=0
k!((n − 1) − k)!
23
24 CHAPTER 6. SOME DISCRETE DISTRIBUTIONS
n−1
X n−1
= np pk (1 − p)(n−1)−k = np.
k
k=0
Now
E Yi Yj = 1 · P(Yi Yj = 1) + 0 · P(Yi Yj = 0)
= P(Yi = 1, Yj = 1) = P(Yi = 1)P(Yj = 1) = p2
Later we will see that the variance of the sum of independent r.v.’s is
the sum of the variances, so we could quickly get Var X = np(1 − p). Al-
ternatively, one can compute E (X 2 ) − E X = E (X(X − 1)) using binomial
coefficients and derive the variance of X from that.
λi
P(X = i) = e−λ .
i!
P∞
Note i=0 λi /i! = eλ , so the probabilities add up to one.
To compute expectations,
∞ i ∞
X
−λ λ −λ
X λi−1
EX = ie =e λ = λ.
i=0
i! i=1
(i − 1)!
25
∞
X λi
E (X 2 ) − E X = E X(X − 1) = i(i − 1)e−λ
i=0
i!
∞
2 −λ
X λi−2
=λ e
i=2
(i − 2)!
= λ2 ,
Example. Suppose on average there is one large earthquake per year in Cal-
ifornia. What’s the probability that next year there will be exactly 2 large
earthquakes?
The above proposition shows that the Poisson distribution models bino-
mials when the probability of a success is small. The number of misprints on
a page, the number of automobile accidents, the number of people entering
a store, etc. can all be modeled by Poissons.
Proof. For simplicity, let us suppose λ = npn . In the general case we use
26 CHAPTER 6. SOME DISCRETE DISTRIBUTIONS
λn = npn . We write
n!
P(Xn = i) = pi (1 − pn )n−i
i!(n − i)! n
n(n − 1) · · · (n − i + 1) λ i λ n−i
= 1−
i! n n
n(n − 1) · · · (n − i + 1) λi (1 − λ/n)n
= .
ni i! (1 − λ/n)i
The first factor tends to 1 as n → ∞. (1 − λ/n)i → 1 as n → ∞ and
(1 − λ/n)n → e−λ as n → ∞.
Hypergeometric. Set
m N −m
i n−i
P(X = i) = .
N
n
27
Continuous distributions
29
30 CHAPTER 7. CONTINUOUS DISTRIBUTIONS
(b − a)2
Var X = E X 2 − (E X)2 = .
12
Chapter 8
Normal distribution
33
34 CHAPTER 8. NORMAL DISTRIBUTION
Therefore Var Z = E Z 2 = 1.
We say X is a N (µ, σ 2 ) if X = σZ + µ, where Z is a N (0, 1). We see that
if σ > 0. (A similar calculation holds if σ < 0.) Then by the chain rule X
has density
1
fX (x) = FX0 (x) = FZ0 ((x − µ)/σ) = fZ ((x − µ)/σ).
σ
This is equal to
1 2 2
√ e−(x−µ) /2σ .
2πσ
E X = µ + E Z and Var X = σ 2 Var Z, so
E X = µ, Var X = σ 2 .
Φ(−x) = 1 − Φ(x);
P(1 ≤ X ≤ 4) = P(1 ≤ 2 + 5Z ≤ 4)
= P(−1 ≤ 5Z ≤ 2) = P(−0.2 ≤ Z ≤ .4)
= P(Z ≤ .4) − P(Z ≤ −0.2)
= Φ(0.4) − Φ(−0.2) = .6554 − [1 − Φ(0.2)]
= .6554 − [1 − .5793].
Example. Suppose a fair coin is tossed 100 times. What is the probability
there will be more than 60 heads?
p
Answer. np = 50 and np(1 − p) = 5. We have
37
38 CHAPTER 9. NORMAL APPROXIMATION
Example. Suppose a die is rolled 180 times. What is the probability a 3 will
be showing more than 50 times?
p
Answer. Here p = 16 , so np = 30 and np(1 − p) = 5. Then P(Sn > 50) ≈
2
P(Z > 4), which is less than e−4 /2 .
When b−a is small, there is a correction that makes things more accurate,
namely replace a by a − 21 and b by b + 21 . This correction never hurts and is
sometime necessary. For example, in tossing a coin 100 times, there ispositive
probability that there are exactly 50 heads, while without the correction, the
answer given by the normal approximation would be 0.
Example. We toss a coin 100 times. What is the probability of getting 49,
50, or 51 heads?
39
40 CHAPTER 10. SOME CONTINUOUS DISTRIBUTIONS
λe−λx (λx)t−1
f (x) =
Γ(t)
R∞
if x ≥ 0 and 0 otherwise. Here Γ(t) = 0 e−y y t−1 dt is the Gamma function,
which interpolates the factorial function.
An exponential is the time for something to occur. A gamma is the time
for t events to occur. A gamma with parameters 12 and n2 is known as a χ2n , a
chi-squared r.v. with n degrees of freedom. Gammas and chi-squared’s come
up frequently in statistics. Another distribution that arises in statistics is
the beta:
1
f (x) = xa−1 (1 − x)b−1 , 0 < x < 1,
B(a, b)
R1
where B(a, b) = 0 xa−1 (1 − x)b−1 .
Cauchy. Here
1 1
f (x) = .
π 1 + (x − θ)2
What is interesting about the Cauchy is that it does not have finite mean,
that is, E |X| = ∞.
using the chain rule. Since fX = 1, this gives fY (x) = e−x , or Y is exponential
with parameter 1.
For another example, suppose X is N (0, 1) and Y = X 2 . Then
√ √
FY (x) = P(Y ≤ x) = P(X 2 ≤ x) = P(− x ≤ X ≤ x)
√ √ √ √
= P(X ≤ x) − P(X ≤ − x) = FX ( x) − FX (− x).
Multivariate distributions
Example. If fX,Y (x, y) = ce−x e−2y for 0 < x < ∞ and x < y < ∞, what is c?
43
44 CHAPTER 11. MULTIVARIATE DISTRIBUTIONS
and so we have
∂ 2F
f (x, y) = (x, y).
∂x∂y
The extension to n random variables is exactly similar.
We have
Z bZ d
P(a ≤ X ≤ b, c ≤ Y ≤ d) = fX,Y (x, y)dy dx,
a c
or Z Z
P((X, Y ) ∈ D) = fX,Y dy dx
D
when D is the set {(x, y) : a ≤ x ≤ b, c ≤ y ≤ d}. One can show this holds
when D is any set. For example,
Z Z
P(X < Y ) = fX,Y (x, y)dy dx.
{x<y}
If one has the joint density of X and Y , one can recover the densities of
X and of Y :
Z ∞ Z ∞
fX (x) = fX,Y (x, y)dy, fY (y) = fX,Y (x, y)dx.
−∞ −∞
for all pairs of subsets A, B of the reals. The left hand side is an abbreviation
for
P({ω : X(ω) is in A and Y (ω) is in B})
and similarly for the right hand side.
In the discrete case, if we have independence,
In other words, the joint density pX,Y factors. In the continuous case,
Z bZ d
fX,Y (x, y)dy dx = P(a ≤ X ≤ b, c ≤ Y ≤ d)
a c
= P(a ≤ X ≤ b)P(c ≤ Y ≤ d)
Z b Z d
= fX (x)dx fY (y)dy
a c
Z bZ d
= fX (x)fY (y) dy dx..
a c
or again the joint density factors. Going the other way, one can also see that
if the joint density factors, then one has independence.
Example. Suppose one has a floor made out of wood planks and one drops a
needle onto it. What is the probability the needle crosses one of the cracks?
Suppose the needle is of length L and the wood planks are D across.
46 CHAPTER 11. MULTIVARIATE DISTRIBUTIONS
Answer. Let X be the distance from the midpoint of the needle to the nearest
crack and let Θ be the angle the needle makes with the vertical. Then X and
Θ will be independent. X is uniform on [0, D/2] and Θ is uniform on [0, π/2].
A little geometry shows that the needle will cross a crack if L/2 > X/ cos Θ.
4
We have fX,Θ = πD and so we have to integrate this constant over the set
where X < L cos Θ/2 and 0 ≤ Θ ≤ π/2 and 0 ≤ X ≤ D/2. The integral is
Z π/2 Z L cos θ/2
4 2L
dx dθ = .
0 0 πD πD
Note that it is not always the case that the sum of two independent random
variables will be a random variable of the same type.
This is equal to
P(X = x, Y = y) p(x, y)
= .
P(Y = y) pY (y)
Analogously, we define in the continuous case
f (x, y)
fX|Y =y (x | y) = .
fY (y)
Taking the derivative, using the chain rule, and recalling that the derivative
of g −1 (y) is 1/g 0 (y), we have
1
fY (y) = fX (g −1 (y)) .
g(y)
Example. Suppose X1 is N (0, 1), X2 is N (0, 4), and X1 and X2 are inde-
pendent. Let Y1 = 2X1 + X2 , Y2 = X1 − 3X2 . Then y1 = g1 (x1 , x2 ) =
2x1 + x2 , y2 = g2 (x1 , x2 ) = x1 − x3 , so
2 1
J= = −7.
1 −3
49
(In general, J might depend on x, and hence on y.) Some algebra leads to
x1 = 37 y1 + 71 y2 , x2 = 17 y1 − 72 y2 . Since X1 and X2 are independent,
1 2 1 2
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) = √ e−x1 /2 √ e−x2 /8 .
2π 8π
Therefore
1 3 1 2 1 1 2 2 1
fY1 ,Y2 (y1 , y2 ) = √ e−( 7 y1 + 7 y2 ) /2 √ e−( 7 y1 − 7 y2 ) /8 .
2π 8π 7
50 CHAPTER 11. MULTIVARIATE DISTRIBUTIONS
Chapter 12
Expectations
If we now set g(x, y) = x, we see the first integral on the right is E X, and
similarly the second is E Y . Therefore
E (X + Y ) = E X + E Y.
51
52 CHAPTER 12. EXPECTATIONS
= E h(X)E k(Y ).
As with the variance, Cov (X, Y ) = E (XY )−(E X)(E Y ). It follows that if X
and Y are independent, then E (XY ) = (E X)(E Y ), and then Cov (X, Y ) =
0.
Note
Var (X + Y )
= E [((X + Y ) − E (X + Y ))2 ]
= E [((X − E X) + (Y − E Y ))2 ]
= E [(X − E X)2 + 2(X − E X)(Y − E Y ) + (Y − E Y )2 ]
= Var X + 2Cov (X, Y ) + Var Y.
Proof. We have
mX (t) = E etX ,
4. Exponential:
Z ∞
λ
Ee tX
= etx λe−λx dx =
0 λ−t
if t < λ and ∞ if t ≥ λ.
55
56 CHAPTER 13. MOMENT GENERATING FUNCTIONS
5. N (0, 1):
Z Z
1 tx −x2 /2 t2 /2 1 2 /2 2 /2
√ e e dx = e √ e−(x−t) dx = et .
2π 2π
by Proposition 13.2. We will not prove this, but the converse is also true: if
mX,Y (s, t) = mX (s)mY (t) for all s and t, then X and Y are independent.
58 CHAPTER 13. MOMENT GENERATING FUNCTIONS
Chapter 14
Limit laws
Suppose Xi are independent and have the same distribution. In the case
of continuous or discrete random variables, this means they all have the
same density. We say the Xi are i.i.d.,
Pn which stands for “independent and
identically distributed.” Let Sn = i=1 Xi . Sn is called the partial sum
process.
Sn
→ µ.
n
This is known as the strong law of large numbers (SLLN). The convergence
here means that Sn (ω)/n → µ for every ω ∈ S, where S is the probability
space, except possibly for a set of ω of probability 0.
The proof of Theorem 14.1 is quite hard, and we prove a weaker version,
the weak law of large numbers (WLLN). The WLLN states that for every
a > 0,
S
n
P − E X1 > a → 0
n
as n → ∞. It is not even that easy to give an example of random variables
that satisfy the WLLN but not the SLLN.
Before proving the WLLN, we need an inequality called Chebyshev’s in-
equality.
59
60 CHAPTER 14. LIMIT LAWS
Proof. We do the case for continuous densities, the case for discrete densities
being similar. We have
Z ∞ Z ∞
y
P(Y > A) = fY (y) dy ≤ fY (y) dy
A A A
1 ∞
Z
1
≤ yfY (y) dy = E Y.
A −∞ A
Theorem 14.3 Suppose the Xi are i.i.d. and E |X1 | and Var X1 are finite.
Then for every a > 0,
S
n
P − E X1 > a → 0
n
as n → ∞.
Theorem 14.4 Suppose the Xi are i.i.d. Suppose E Xi2 < ∞. Let µ = E Xi
and σ 2 = Var Xi . Then
Sn − nµ
P a≤ √ ≤ b → P(a ≤ Z ≤ b)
σ n
√
The ratio on the left is (Sn − E Sn )/ Var Sn . We do not claim that this
ratio converges for any ω (in fact, it doesn’t), but that the probabilities
converge.
Example. Suppose we roll a die 3600 times. Let Xi be the number showing
on the ith roll. We know Sn /n will be close to 3.5. What’s the probability it
differs from 3.5 by more than 0.05?
Answer. We want S
n
P − 3.5 > .05 .
n
We rewrite this as
S − nE X 180
n 1
P(|Sn − nE X1 | > (.05)(3600)) = P √ √ > q
n Var X1 (60) 35
12
Answer. We want
S
n
P > 73 = P(Sn > 7300)
n
S − nE X 7300 − (100)(72)
n 1
=P √ √ > √ √
n Var X1 100 36
≈ P(Z > 1.667) ≈ .047.
The idea behind proving the central limit theorem is the following. It
turns out that if mYn (t) → mZ (t) for every t, then P(a ≤ Yn ≤ b) → P(a
ò
Z ≤ b). (We won’t prove this.) We are going to let Yn = (Sn − nµ)/σ n.
Let Wi = (Xi − µ)/σ. Then E Wi = 0, Var Wi = Varσ2Xi = 1, the Wi are
independent, and Pn
Sn − nµ Wi
√ = i=1 √ .
σ n n
So there is no loss of generality in assuming that µ = 0 and σ = 1. Then
√ √
mYn (t) = E etYn = E e(t/ n)(Sn ) = mSn (t/ n).
Since the Xi are i.i.d., all the Xi have the same moment generating function.
Since Sn = X1 + · + Xn , then
t2 t3
mX1 (t) = E etX1 = 1 + tE X1 + E (X1 )2 + E (X1 )3 + · · · .
2! 3!
We put the above together and obtain
√
mYn (t) = mSn (t/ n)
√
= [mX1 (t/ n)]n
√
h (t/ n)2 in
= 1+t·0+ + Rn
2!
h t2
= 1+ + Rn ]n ,
2n
2 /2
where |Rn |/n → 0 as n → ∞. This converges to et = mZ (t) as n → ∞.