St213 1993 Lecture Notes
St213 1993 Lecture Notes
St213 1993 Lecture Notes
Wilfrid S. Kendall
version 1.0
28 April 1999
1. Introduction
The main purpose of the course ST213 Mathematics of Random Events (which we
will abbreviate to MoRE) is to work over again the basics of the mathematics of
uncertainty. You have already covered this in a rough-and-ready fashion in:
(a) ST111 Probability;
and possibly in
(b) ST114 Games and Decisions.
In this course we will cover these matters with more care. It is important to
do this because a proper appreciation of the fundamentals of the mathematics of
random events
(a) gives an essential basis for getting a good grip on the basic ideas of
statistics;
(b) will be of increasing importance in the future as it forms the basis of
the hugely important field of mathematical finance.
It is appropriate at this level that we cover the material emphasizing concepts
rather than proofs: by-and-large we will concentrate on what the results say and
so will on some occasions explain them rather than prove them. The third-year
courses MA305 Measure Theory, and ST318 Probability Theory go into the matter
of proofs. For further discussion of how Warwick probability courses fit together,
see our road-map to probability at Warwick at
www.warwick.ac.uk/statsdept/teaching/probmap.html
1.1 Books
[1] D. Williams (1991) Probability with Martingales CUP.
after the test, and an examples class will be held a week after the test. The tests
will be marked, and the assessed component will be based on the best 3 out of 4
of your answers.
This method helps you learn during the lecture course so should:
improve your exam marks;
increase your enjoyment of the course;
cost less time than end-of-term assessment.
Further copies of exercise sheets (after they have been handed out in lectures!)
can be obtained at the homepage for the ST213 course:
www.warwick.ac.uk/statsdept/teaching/ST213.html
These notes will also be made available at the above URL, chapter by chapter
as they are covered in lectures. Notice that they do not cover all the material
of the lectures: their purpose is to provide a basic skeleton of summary material
to supplement the notes you make during lectures. For example no proofs are
included. In particular you will not find it possible to cover the course by ignoring
lectures and depending on these notes alone!
Further related material (eg: related courses, some pretty pictures of random
processes, ...) can be obtained by following links from: W.S. Kendalls homepage:
www.warwick.ac.uk/statsdept/Staff/WSK/
Finally, the Library Student Reserve Collection (SRC) will in the summer
term hold copies of previous examination papers, and we will run two revision
classes for this course at that time.
P[a U b]
ba
and yet
=
=
{x}
x[a,b]
P[U = x]
0?
x[a,b]
f (x) dx
0
where
f (x)
1
0
It is often the case that one feels justified in assuming the coins individually are
equally likely to come up heads or tails. Using the fact P [ A = T ] = 1 P [ A = H ],
etc, we find
1
P [ A comes up heads ] =
2
1
P [ B comes up heads ] =
2
To find probabilities such as P [ HH ] = P [ A = H, B = H ] we need to say
something about the relationship between the two coin-tosses. It is often the case
that one feels justified in assuming the coin-tosses are independent, so
P [ A = H, B = H ]
P[A = H ] P[B = H ] .
However this assumption may be unwise when the person tossing the coin is not
experienced! We may decide that some variant of the following is a better model:
the event determining [B = H] is C if [A = H], D if [A = T ], where
P[C = H ]
P[D = H ]
3
4
1
4
{ = {H, T }, {H}, {T }, } ;
(iii) now consider the following class of subsets of the unit interval [0, 1]:
A = { finite unions of subintervals }; This is an algebra. For example, if
A = (a0 , a1 ) (a1 , a2 ) ... (a2n , a2n+1 )
is a non-overlapping union of intervals (and we can always re-arrange
matters so that any union of intervals to be non-overlapping!) then
Ac
(iv)
(v)
(vi)
(vii)
(viii)
(ix)
In realistic examples algebras are rather large : not surprising, since they
correspond to the collection of all true-or-false statements you can make about a
certain experiment! (If your experiments results can be summarised as n different
yes/no answers such as, result is hot/cold, result is coloured black/white,
etc then the relevant algebra is composed of 2n different subsets!) Therefore it is
of interest that the typical element of an algebra can be written down in a rather
special form:
Theorem 2.3 (Representation of typical element of algebra): If C is a
collection of subsets of then the event A belongs to the algebra A(C) generated
by C if and only if
Mi
N \
[
A =
Ci,j
i=1 j=1
c
where for each i, j either Ci,j or its complement Ci,j
belongs to C. Moreover we
may write A in this form with the sets
Di
Mi
\
Ci,j
j=1
being disjoint. *
We are now in a position to produce our first stab at a set of axioms for
probability. Given a sample space and an algebra A of subsets, probability P [ ]
assigns a number between 0 and 1 to each event in the algebra A, obeying the
rules given below. There is a close analogy to the notion of length of subsets of
[0, 1] (and also to notions of area, volume, ...): the table below makes this clear:
Probability
P[] = 0
Length () = 0
P[] = 1
if A B =
There are some consequences of these axioms which are not completely trivial.
For example, the law of negation
c
P[A ]
1 P[A] ;
and the generalized law of addition holding when A B is not necessarily empty
P[A B]
XX
i6=j
P [ Ai Aj ]
+ ...
+ (1)n P [ A1 A2 ... An ] .
C.
Tn
(b) P [ first n tosses all give heads ] = P [ i=1 Ai ]
(c) P [ the first toss which gives a head is even-numbered ]
There is a difference! The first two can be dealt with within the algebra. The
third cannot: suppose Cn is the event the first toss in numbers 1, ..., n which
gives a head is even-numbered or else all n of these tosses give tails, then Cn lies
in A(C), and converges down to the event C the first toss which gives a head is
even-numbered, but C is not in A(C).
We now find a number of problems raise their heads.
Problems with everywhere being impossible: Suppose we
are running an experiment with an outcome uniformly distributed
over [0, 1]. Then we have a problem as mentioned in the second of our
motivating examples: under reasonable conditions we are working
with the algebra of finite unions of sub-intervals of [0, 1], and the
probability measure which gives P [ [a, b] ] = b a, but this means
P [ {a} ] = 0. Now we need to be careful, since if we rashly allow
ourselves to work with uncountable unions we get
{x} =
x[0,1]
0 = 0.
x[0,1]
n
X
1/2m+1
(1/2) (1 1/2n+1 )
m=1
1/2 6= 1 .
2.5 -algebras
The first task is to establish a wide range of sensible limit sets. Boldly, we look
at sets which can be obtained by any imaginable combination of countable set
operations: the collection of all such sets is a -algebra.**
Definition 2.4 (-algebra): A -algebra of subsets of is an algebra which is
also closed under countable unions.
In fact -algebras are even larger than ordinary algebras; it is difficult to
describe a typical member of a -algebra, and it pays to talk about -algebras
generated by specified collections of sets.
Definition 2.5 (-algebra generated by a collection): For any collection of
subsets C of , we define (C) to be the intersection of all -algebras of subsets of
which contain C:
\
(C) =
{S : S is a -algebra and C S} .
Theorem 2.6 (Monotone limits): Note that (C) defined above is indeed a
-algebra. Furthermore, it is the smallest -algebra containing C which is closed
under monotone limits.
(
(A
))
i
i=1
P
(A
)
even
when
the
union
is
not
disjoint,
etc.
i
i=1
CA is a kind of continuity condition. A similar continuity condition is that of
monotone limits.
Definition 2.9 (Monotone limits): A set-function : A [0, 1] is said to obey
the monotone limits property (ML) if it satisfies:
Ai Ai whenever the Ai increase upwards to a limit set A which
lies in A.
(ML) is simpler to check than (CA) but is equivalent for finitely-additive
measures.
Theorem 2.10 (Equivalence for countable additivity):
(F A) + (M L)
(CA)
Lemma 2.11 (Another equivalence): Suppose P is a finitely additive probability measure on (, F), where F is an algebra of sets. Then P is countably
additive if and only if
lim P [ An ] = 1
n
to get at least an upper bound for what it would be sensible to call the measure
of A.
Of course we must give equal priority to considering what is the measure of
the complement Ac . Suppose for definiteness that A is contained in a simple set
10
Q of finite measure (a convenient interval for length, a square for area, a cube for
volume, ...) so that Ac = Q \ A. Then consideration of (Ac ) leads us directly to
consideration of inner-measure for A:
(A)
(Q) (Ac ) .
Clearly (A) (A): moreover we can only expect a truly sensible definition
of measure on the set
F =
A : (A) = (A) .
The fundamental theorem of measure theory states that this works out all
right!
X
[
=
Ai
Ai
i=1
i=1
S
make sense (whenever the disjoint union i=1 Ai actually belongs to the algebra),
then it can be extended uniquely to the (typically much larger) -algebra generated
by the original algebra, so as again to be a (-additive) measure.
There is an important special part of this theorem which is worth stating
separately.
Definition 2.13 (-system): A -system of subsets of is a collection of subsets
including itself and closed under finite intersections.
Theorem 2.14 (Uniqueness for probability measures): Two finite measures
which agree on a -system also agree on the generated -algebra ().
11
the -algebra B = (A) (the Borel -algebra restricted to [0, 1]). We call this
Lebesgue measure.
There is a significant connection between infinite sequences of coin tosses and
numbers in [0, 1]. Briefly, we can expand a number x [0, 1] in binary (as opposed
to decimal!): we write x as
.1 2 3 ...
where i equals 1 or 0 according as 2i x is greater than 1 or not. The coin-tossing
-algebra can be viewed as generated by the sequence
{1 , 2 , 3 , ...}
with 0 standing for tails, 1 for heads. In effect we get a map from coin-tossing
space 2N to number space [0, 1] with the slight cautionary note that this map
very occasionally maps two sequences onto one number (think of .0111111... and
.100000...). In particular
[1 = a1 , 2 = a2 , ..., d = ad ]
[x, x + 2d )
12
and since there are only countably many rational q, and P [ Aq ] doesnt depend on
q, we determine
X
X
P [ [0, 1] ] =
P [ Aq ] =
P[A] .
q rational
q rational
But this cannot make sense if P [ [0, 1] ] = 1! We are forced to conclude that
A cannot be Lebesgue measurable.
This example has a lot to do with the Banach-Tarski paradox described in
one of our motivating examples above.
i=1 j=i
13
Bj .
Bi holds eventually ([Bi ev.]) if for all large enough i the statement
Bi is true: in set-theoretic terms
\
[
[Bi ev.] =
Bj .
i=1 j=i
Notice these two concepts ev. and i.o. make sense even if the infinite sequence
is just a sequence, with no notion of events occurring consecutively in time!
Notice (you should check this yourself!)
[Bi i.o.]
[Bic ev.]c .
\
Y
Ai
=
P
P [ Ai ]
i=1
i=1
Q
We have to be careful what we mean by
Qnthe infinite product i=1 P [ Ai ]: we
mean of course the limiting value limn i=1 P [ Ai ].
We can now prove a remarkable pair of facts about P [ Ai i.o. ] (and hence its
twin P [ Ai ev. ]!). It turns out it is often easy to tell whether these events have
probability 0 or 1.
P [ Ai i.o. ]
0;
1.
Note the two parts of the above result are not quite symmetrical: the second
part also requires independence. It is a good exercise to work out a counterexample
to part (ii) if independence fails.
14
Theorem 3.19 (Law of large numbers for events): Suppose that we have a
sequence of independent events Ai each with the same probability p. Let Sn count
the number of events A1 , ...,, An which occur. Then
Sn
p ev.
P
n
15
P[X B ]
whenever B B.
4. Integration
One of the main things to do with functions is to integrate them (find the area
under the curve). One of the main things to do with random variables is to take
their expectations (find their average values). It turns out that these are really the
same idea! We start with integration.
h d
h(x)( dx)
n
X
ci (Ai )
i=1
where
h(x)
17
as above.
R
Note that one really should prove that the definition of h d does not depend
on exactly how one represents h as the sum of indicator functions.
Integration for such functions has a number of basic properties which one uses
all the time, almost unconsciously, when trying to find integrals.
Theorem 4.34 (Properties of integration for simple functions):
R
R
(1) if (f 6= g) =R 0 then f d = R g d;
R
(2) Linearity: (af + bg) d = a R f d + bR g d;
(3) Monotonicity: f g means f d g d;
(4) min{f, g} and max{f, g} are simple.
Simple functions are rather boring. For more general functions we use limiting
arguments. We have to be a little careful here, since some functions will have
integrals built up from + where they are integrated over one part of the region,
and over another part. Think for example of
Z
1
dx
x
1
dx +
x
1
dx
x
equals ?
f d
sup
Z
f d
18
g d
h d .
R
One really needs to prove that the integral f d does not depend on the
choice f = g h. In fact if there is any choice which works then the easy choice
g
h
=
=
max{f, 0}
max{f, 0}
will work.
One can show that the integral on integrable functions agrees with its definition on simple functions and is linear. What starts to make the theory very easy
is that the integral thus defined behaves very well when studying limits.
Theorem 4.37 (Monotone convergence theorem (MON)): If fn f (all
being non-negative measurable functions) then
Z
Z
fn d
f d .
Corollary 4.38 (Integrability and simple functions): if f is non-negative
and measurable then for any sequence of non-negative simple functions fn such
that fn f we have
Z
Z
fn d
f d .
19
g(x)PX ( dx) .
4.4 Examples
You need to work through examples such as the following to get a good idea of
how the above really works out in practice. See the material covered in lectures
for more on this.
R1
Evaluate 0 xLeb( dx) = x.
P
Consider
f dP = R i=1 f (i)pi .
y x
Evaluate 0 e Leb( dx).
Rn
Evaluate 0 f (x)Leb( dx) where
f (x)
Evaluate
if 0 x < 1,
if 1 x < 2,
...
if n 1 x < n.
5. Convergence
Approximation is a fundamental key to making mathematics work in practice.
Instead of being stuck, unable to do a hard problem, we find an easier problem
which has almost the same answer, and do that instead! The notion of convergence
(see first-year analysis) is the formal structure giving us the tools to do this. For
random variables there are a number of different notions of convergence, depending
on whether we need to approximate a whole sequence of actual random values, or
just a particular random value, or even just probabilities.
Y in prob ,
20
0.
Y a.s. ,
if we have
P [ Xn Y ]
0.
i=1
P [ Xn a ] = 1 .
i=1
P [ Xn a ] .
As one might expect, the notion of almost sure convergence implies that of
convergence in probability.
21
Theorem 5.44 (Almost sure convergence implies convergence in probability): Xn X a.s. implies Xn X in prob.
ALmost sure convergence allows for various theorems telling us when it is
OK to exchange integrals and limits. Generally this doesnt work: consider the
example
Z
Z
Z
1=
exp(t) dt 6
lim exp(t) dt = 0 dt = 0 .
0
However we have already seen one case where it does work: when the limit in
monotonic. In fact we only need this to hold almost everywhere (i.e. when the
convergence is almost sure).
i=1
22
Theorem 5.49 (Weak law of large numbers): if a sequence of random variables Xi is independent, and if the random variables all have the same finite mean
and variance E [ Xi ] = and Var(Xi ) = 2 < , then
Sn /n in prob.
where Sn = (X1 + ... + Xn )/n is the partial sum of the sequence.
As you will see, the proof is really rather easy when we use Chebyshevs
inequality above. Indeed it is also quite easy to generalize to the case when the
random variables are correlated, as long as the covariances are small ...
However the corresponding result for almost sure convergence, rather than
convergence, is rather harder to prove.
Theorem 5.50 (Strong law of large numbers): if a sequence of random variables Xi is independent and identically distributed, and if E [ Xi ] = then
Sn /n a.s.
where Sn = (X1 + ... + Xn )/n is the partial sum of the sequence.
23
Corollary 5.53 (Dominated convergence theorem (DOM)): If the functions fn : R R are bounded above in absolute value by g a.e. (so |fn | < g a.e.)
and g is integrable and also fn f then
Z
Z
lim fn d =
f d .
This is a very powerful result ...
5.5 Examples
If the Xn form a bounded sequence random variable and they converge almost surely to X then
E [ Xn ] E [ X ] .
Suppose that U is a random variable uniformly distributed over [0, 1]
and
n
2X
1
Xn =
k2n I[k2n U <(k+1)2n ] .
k=0
Then E [ log(1 Xn ) ] 1.
Suppose that the Xn are independent and X1 = 1 while for n 2
P [ Xn = n + 1 ]
P [ Xn = 1/(n + 1) ]
P [ Xn = 1 ]
1 2/n3
1/n3
Qn
and Zn = i=1 Xi . Then the Zn form an almost surely convergent
sequence with limit Z , and
E [ Zn ]
E [ Z ] .
6. Product measures
6.1 Product measure spaces
The idea here is, given two measure spaces (, F, ) and (0 , F 0 , ), we build a
meaasure space 0 by using rectangle sets AB with measures (A)(B).
As you might guess from the product form (A) (B), in the context of
probability this is related to independence.
24
Lemma 6.55 (Representation of A(R)): every member of A(R) can be expressed as a finite disjoint union of rectangle sets.
It is now possible to apply the Extension Theorem (we need to check additivity this is non-trivial but works) to define the product measure
on the whole -algebra (R).
P [ (X, Y ) A ] ,
25