Notes On Set Theory and Probability Theory: Michelle Alexopoulos
Notes On Set Theory and Probability Theory: Michelle Alexopoulos
Notes On Set Theory and Probability Theory: Michelle Alexopoulos
Michelle Alexopoulos
August 2003
0.1. Set Theory and Probability Theory
Before we talk about probability, it is useful to review some basic definitions and
subset of B, i.e., A B.
Definition 3. Two sets, A and B, are equal, denoted A=B, if and only if all
element in A belongs to the set B and every element in B belongs to set A, i.e.,
A B and A B.
equal A.
Definition 5. The empty set, or null set, is a set which contains no elements,
Ac = { : S and
/ A}
2
Definition 7. The union of sets A and B, denoted A B, is the set containing
A B = { : A or B}.
A B = { : A and B}.
Definition 9. Two sets, A and B, are called disjoint of mutually exclusive if they
Definition 10. The set of all possible outcomes of a random experiment is called
3
Theorem 15. A (B C) = (A B) (A C) (First distributive law)
Basic Probability theory is defined using a triple (S, F, P ) where S is the sample
space, F is the collection of events, and P is a function that maps F into the
interval [0,1]. P is the probability measure and intuitively F is the set of all
events that can be verified to have occurred or not occurred I will discuss these
objects in more detail below. However, for the most part, I will follow the notation
4
In elementary probability theory, we usually associate these elements with
S = {H, T },
where H stands for head and T for tail. For an experiment of three tosses of
S = { : = (a1 , ..., an ), ai = H or T }.
An alternate example is one where an individual rolls a die once. The sample
S = {1, 2, 3, 4, 5, 6}
5
and if we rolled the die n times, the sample space of this experiment would be:
In modern probability theory, the sample space can be fairly general and ab-
stract. For example, it can be the collection of all real numbers, R, or the collection
crete, then all subsets correspond to events, but if S is continuous, only measurable
To each event A in the class of events, we associate a real number, P(A), i.e.,
P(A) is the probability of the event A, if the following axioms are satisfied:
Axiom 3: For any number of mutually exclusive events A1 , A2 , ... , then P (A1
6
Theorem 22. If A1 A2 , then P (A1 ) P (A2 ) and P (A2 A1 ) = P (A2 )P (A1 )
Theorem 27. If A and B are any two events, then P (A B) = P (A) + P (B)
P (A B).
Conditional Probability: Let A and B be two events such that P (A) > 0.
Let P (B|A) denote the probability of B given that A has occurred. Since A has
already occurred, it becomes the new sample space. From this, we are led to the
definition of P (B|A):
P (A B)
P (B|A) = or P (A B) = P (A)P (B|A)
P (A)
7
Theorem 29. Bayes Rule: Suppose that A1 , A2 , ..., An are mutually exclusive
P (Ak )P (A|Ak )
P (AK |A) = Pn
i=1 P (Ai )P (A|Ai )
Definition 30. Two events A and B are said to be independent if and only if
P (A B) = P (A)P (B)
n!
n Pr =
(n r)!
n n!
C = =
n r r!(n r)!
r
n Pr is used when order matters. For example if we want to find out how many
dierent permutations consisting of three letters each can be formed from the 4
8
letters A,B, C, and D, the answer is given by
4!
4 P3 = = 4 3 2 1 = 24
(4 3)!
In this case order matters, i.e., ABC is a dierent permutation then ACB, or
If we only want to know how many ways three letters can be chosen from the
4!
4 C3 = =4
3!(4 3)!
In this case order does not matter so we can only have 4 possibilities. i.e., (1)
A,B, and C, (2) A,C, and D, (3) A,B, and D, and (4) B,C, and D.
Other Commonly used definitions If you are reading more advanced books
on probability theory, you will often find terms like sigma-field and sigma-algebra
and find reference to measurable functions and measure spaces. I will now briefly
turn to these
9
{A1 , A2 , ..., AN } of disjoint subsets of S whose union is S.
A0 = {S},
may have. Suppose that the numbers in S represent all possible returns in the
stock market in a day. Then, after the return is realized, an agent with information
partition A0 has eectively no information about the return, while an agent with
information A1 can tell whether the return is positive, zero, or negative, and an
agent with information A2 knows exactly what the return is. So, these three
ities to each of the event in the partition, P1 , ..., PN . Based on these probabilities,
the agent should also be able to decide the probabilities of events such as A1 A2 ,
or Ac1 .
10
This motivates the following definition of measurable sets, which can thought of
Above, I mentioned that F includes all outcomes on the sample space that
can be verified to have occurred or not occurred. Basically this means that if
set A is an event, then its complement, A0 (i.e., not A) must also be an event.
if both A and B happened and (b) if either A or B (or both) occurred. Thus,
that is closed under complementation, intersection and union. For our purposes,
will refer toe F as a algebra ( f ield). This next definition states these ideas
more formally.
11
Note that if Ai F for i = 1, 2, then, Aci F, which imply that Ac1 Ac2 F.
(Ac1 Ac2 )c { S :
/ Ac1 , and
/ Ac2 }
= { S : A1 and A2 }
A1 A2 .
So, A1 A2 F.
Definition 34. A pair (S, F) is called a measurable space, and any subset in F
Examples:
(i) F ={, S}
Definition 35. (C): smallest -field that contains the collection of subsets, C.
Examples:
12
(iii) B, the -field generated by all the open intervals in R. We call all the
(ii) v() = 0.
X
v (
i=1 Ai ) = v(Ai ).
i=1
Examples:
S.
v((a, b)) = b a.
13
Proposition 37. For a measure space (S, F, v), we have
X
v(
i=1 Ai ) v(Ai )
i=1
v( lim An ) = v(
i=1 Ai ) = lim v(An )
n n
(or
v( lim An ) = v(
i=1 Ai ) = lim v(An )
n n
if v(A1 ) < ).
because v(C) 0.
14
a sequence of disjoint sets such that
i=1 Ai = i=1 Ci and, by (i), v(Ci ) v(Ai ).
Thus, we have
X
X
v(
i=1 Ai ) = v(
i=1 Ci ) = v(Ci ) v(Ai ).
i=1 i=1
that
n=1 An = n=1 Dn . By the definition of measure, we have
X
v(
n=1 An ) = v(n=1 Dn ) = v(Dn )
n=1
X
n
= lim v(Di )
n
"i=1n #
X
= lim (v(Ai ) v(Ai1 ))
n
i=1
= lim v(An ).
n
have
v(
n=1 Bn ) = lim v(Bn ) = v(A1 ) lim v(An ).
n n
15
However,
n=1 An = A1 (n=1 Bn ), so,
v(
n=1 An ) = v(A1 ) v(n=1 Bn ) = lim v(An ).
n
Q.E.D.
(i) If f and g are measurable, then so are f g and af + bg, where a and b are
Proposition 39. Let f and g be measurable functions on a measure space (S, F, v).
R R R
(i) (af + bg)dv = a f dv + b gdv.
16
R R
(ii) If f = g a.e., then, f dv = gdv.
R R
(iii) If f g a.e., then, f dv gdv.
R
(iv) If f 0 a.e. and f dv = 0, then, f = 0 a.e.
R
(v) If f 0 a.e. and f dv = 1, then, the set function
Z
P (B) = f dv
B
R
(vi) If fn f a.e., |fn | g, and gdv < , then,
Z Z
lim fn dv = f dv.
n
R
(vii) If |f (, )/| g() a.e., and gdv < , then,
Z Z
d f (, )
f (, )dv = dv.
d
17
0.3. Random Variables and Probability Distribution
variable X() is a single valued real function that assigns a real number to each
sample point of S. Often we use a single letter X for this function in place of
X().
Probability Distribution:
Definition 41. A listing of the values x taken by a random variable X and their
Properties of FX (x) :
1. 0 FX (x) 1
3. limx FX (x) = FX () = 1
18
4. limx FX (x) = FX () = 0
continuous)
random variable only if its range contains a finite or countably infinite number of
points. Alternatively, if FX (x) changes values only in jumps (at most a countable
number of them) and is constant between jumps, then X is called a discrete random
variable.
occur at the points x1 , x2 , ... where the sequence may be either finite or countably
19
Properties of pX (x) :
P
3. k pX (xk ) = 1
by:
X
FX (x) = P (X x) = pX (xk )
xk x
random variable only if its range contains an interval (either finite or infinite)
dFX (x)/dx which exists everywhere except at possibly a finite number of points
For the case of a continuous random variable, the probability associated with
any particular point is zero, (i.e., P(X=x)=0). However, we can assign a positive
20
dFX (x)
Definition 46. Let f(x) = dx
. The function f(x) is called the probability
Properties of f(x) :
1. f (x) 0
R
2.
f (x) = 1
Z x
FX (x) = P (X x) = f (t)dt
21
0.4. Expectation of a Random Variable
P
x xf (x) if x is discrete
E[x] =
R
xf (x)dx if x is continuous
Proposition 48. Let g(x) be a function of x. The function that gives the ex-
P
x g(x)f (x) if X is discrete
E[g(X)] = .
R
g(x)f (x)dx if X is continuous
able is:
P
x (x )2 f (x) if x is discrete
2
V ar[x] = E[(x ) ] =
R
(x )2 f (x)dx if x is continuous
where = E(x).
22
The variance is usually denoted by 2 .
V ar(x) = 2 = E(x2 ) 2
2. E(aX)=aE(X)
3. E(X+Y)=E(X)+E(Y)
5. Var(aX)=c2 Var(X)
23
The Normal distribution: In econometrics you will often use the Normal
distribution. The general form of a normal distribution with mean and variance
2 is
( " #)
2
1 1 (x )
f (x|, 2 ) = exp
2 2 2 2
N[, 2 ] which reads x is normally distributed with mean and standard deviation
If a = and b = 1
then letting z = a + bx we find that z N[0, 1]. N[0, 1]
is called the standard normal distribution and has the density function
2
1 z
(z) = exp
2 2
The notation (z) is often used to denote the standard normal distribution,
24
3. If z1 , z2 , ..., zn are independent random variables and zi N[0, 1] for all i,
then
X
n
zi2 2 [n]
i=1
You will also find that the t-distribution will converge in the limit to the
normal distribution.
Above we discussed the case where we have one random variable (i.e., the univari-
ate case). However, in many instances we will want to consider the case where
we have multiple random variables. The good news is that the concepts de-
scribed above can be fairly easily extended to the case of n random variables (the
multivariate case)
Definition 50. Given an experiment, the n-tuple of random variables (X1 , X2 , ..., Xn )
25
Let X denote a random vector in Rn . The vector X = (X1 , X2 , ..., Xn ) takes
The marginal joint cdfs are gotten from this one by setting the appropriate
For the discrete n-variate random variable, the joint pmf is defined by:
Pn P
2. i=1 xi pX1 X2 ...Xn (x1 , x2 , ..., xn ) = 1
26
3. The marginal pmf of one random variable (or set of random variables) is
found by summing pX1 X2 ...Xn (x1 , x2 , ..., xn ) over the ranges of the other vari-
ables xi s. , e.g.,
X X X
pX1 X2 ...Xnk (x1 , x2 , ..., xnk ) = ... pX1 X2 ...Xn (x1 , x2 , ..., xn )
xnk+1 xnk+2 xn
4. Conditional pmfs are then defined in a straight forward manor. For example:
Z x1 Z xn
FX (x1 , ..., xn ) = ... f (z1 , ..., zn )dz1 ...dzn
for some function f , then, we can generally find the joint pdf for a continuous
27
If we know the joint distribution function of a random vector X, then we also know
the joint distribution of any subvector of X. For example, the joint distribution
Z x1 Z xk Z Z
FX(k) (x1 , ..., xk ) = ... ... f (z1 , ..., zn )dzk+1 ...dzn dz1 ...dzk .
R R
2.
...
fX1 X2 ...Xn (x1 , x2 , ..., xn )dx1 ...dxn = 1
3. The marginal pdf of one random variable (or set of random variables) is
found by integrating fX1 X2 ...Xn (x1 , x2 , ..., xn ) over the ranges of the other
variables xi s. , e.g.,
Z Z Z
fX1 X2 ...Xnk (x1 , x2 , ..., xnk ) = ... fX1 X2 ...Xn (x1 , x2 , ..., xn )dxnk+1 dxxk+2 ...dxn
28
Proposition 51. If X and Y s joint distribution function has a p.d.f. f (x, y),
then, the conditional distribution function of X given Y has a p.d.f. f (x|y) given
by the following:
f (x, y)
f (x|y) =
fY (y)
R
where fY (y) f (x, y)dx is the marginal distribution function of the random
variable Y .
0.5.1. Expectations:
R R
... g(x1 , ..., xn )f (x1 , ..., xn )dx1 ...dxn if the variables are continuous
E[g(X)] =
P P
... g(x1 , ..., xn )f (x1 , ..., xn ) if the variables are discrete
Let the EX (g(X)), and EX,Y (g(X)) denote the expectation of the function G(X)
with respect to the marginal and the joint distributions respectively. It is easy to
show that
29
P P hP i P
since EX,Y (g(X)) = i,j g(xi )f (xi , yi ) = i g(xi ) j f (xi , yi ) = i g(xi )f (xi ) =
EX (g(X)).
P
i yi f (yi |x) in the discrete case
E(Y |X = x) =
R
yf (y|x) in the continuous case
P
i g(x, yi )f (yi |X = x) in the discrete case
E(g(X, Y )|X = x) =
R
g(x, y)f (y|X = x) in the continuous case
Theorem 56. Law of iterated expectations. E[y]=Ex [E[y|x]] where the notation
30
Properties of Conditional Expectation: The conditional expectation can be
defined given any field A such that A F or any random variable Y . The
(iii) If a and b are real numbers, then, E[aX +bY |F1 ] = aE[X|F1 ]+bE[Y |F1 ]
a.s.
(vii) If E[|g(X, Y )|] < , then E[g(X, Y )|Y = y] = E[g(X, y)|Y = y] a.s.
31
The expectation of a random vector is defined as a vector which consists of
c0 E[X].
32
V ar(c0 X) E[(c0 X E[c0 X])(c0 X E[c0 X])0 ]
c0 V ar(X)c.
So,
Q.E.D.
able with pdf fX (x). If the transformation y=g(x) is one-to-one and has the in-
Let Z=g(X,Y) and W=h(X,Y) where X and Y are random variables and
33
fX,Y (x,y,) is the joint pdf of X and Y. If the transformation z=g(x,y) and w=h(w,y)
is one to one and has the inverse transformation x=q(z,w) and y=r(z,w) then the
fZ,W (z, w) = fX,Y (x, y) |J(x, y)|1 where x = q(z, w) and y = r(z, w) and
g g z z
x y x y
J(x, y) = =
h h w w
x y x y
define
q q x x
z w z w
J(z, w) = =
r r y y
z w z w
thenJ(z, w) = |J(x, y)|1 and fZ,W (z, w) = fX,Y [q(z, w), r(z, w)] J(z, w) .
x = (x1 , ...xn ), with the mean vector and the covariance matrix . The general
0 1 (x)
f (x) (2)n/2 ||1/2 e(1/2)(x)
34
Properties of the multivariate normal Let x1 be any subset of the variables
including a single variable, and let x2 be the remaining variables. Partition and
likewise so
1 11 12
=
and =
2 21 22
distributions are
x1 N(1 , 11 ) and
x2 N(2 , 22 )
1.2 = 1 + 12 1
22 (x2 2 )
11.2 = 11 12 1
22 21
35
0.6. Markov Process
Stochastic Process:
-field Ft for all t = 0, 1, ... and that the sequence of the -fields, {Ft , t = 0, 1, ...}
Note that for any sequence of random variables, we can always set Ft =
with respect to Ft . Most of the time this is the natural filtration that we work
with for stochastic processes. However, there maybe other filtrations such that
Sample Path:
For any given S, we call the sequence of real numbers {Zt (), t = 0, 1, ...}
Markov Process:
36
1 n t, we have
process if
This type of process has a memoryless property since the future state of the
process depends only on the present state and not on the past history.
Acknowledgements: The notes for this section have been based on materials
provided by Prof. Xiaodong Zhu, Prof. Angelo Melino, Prof. Bruce Hansen, and
taken from Probability, random variables and random processes by Hsu. Please
37