CalTech Lecture On Probability
CalTech Lecture On Probability
CalTech Lecture On Probability
Unless otherwise noted, references to Theorems, page numbers, etc. from CasellaBerger, chap 1. Statistics: draw conclusions about a population of objects by conducting an experiment.
: sample space. Set of outcomes of an experiment. Example: tossing a coin twice. = {HH, HT, T T, T H }
An event is a subset of . Examples: (i) at least one head is {HH, HT, T H }; (ii) no more than one head is {HT, T H, T T }. &etc. In order to have a probabilistic model, we need to enumerate the set of events that we can distinguish upon running an experiment. A set of events B is a -algebra (or -eld) of , which is a collection of subsets of with the following properties: (1) B (2) If an event A B , then Ac B (closed under complementation) (3) If A1 , A2 , . . . B , then i=1 Ai B (closed under countable union). A countable sequence can be indexed using the natural integers. Additional properties: (4) (1)+(2) B (5) (3)+De-Morgans Laws1 i=1 Ai B (closed under coutable intersection)
For a given sample space , there are many dierent -algebras: {, }: trivial -algebra The powerset P (), which contains all the subsets of For 2-coin toss example, the smallest -algebra containing (HH ), (HT ), (T H ), (T T ) is... We call this the -algebra generated by the fundamental events (HH ), (HT ), (T H ), (T T ).
1
(A B )c = Ac B c
In practice, rather than specifying a particular -algebra from scratch, there is usually a class of events of interest, C , which we want to be included in the -algebra. Hence, we wish to complete C by adding events to it so that we get a -algebra: Dene: Let C be a collection of subsets of . The minimal -eld generated by C , denoted (C ), satises: (i) C (C ); (ii) if B is any other -eld containing C , then (C ) B .
Given a measurable space (, B ), a probability function P with domain B is a function which satises: 1. P (A) 0, for all A B . 2. P () = 1 3. Countable additivity: If A1 , A2 , B are pairwise disjoint (i.e., Ai Aj = , for all i = j ), then P ( i=1 Ai ) = i=1 P (Ai ). Dene: Support of P is the set {A B : P (A) > 0}. Example: Return to 2-coin toss. Assuming that the coin is fair (50/50 chance of getting heads/tails), then the probability function for the -algebra consisting of all subsets of is Event A P (A) 1 HH 4 1 HT 4 1 TH 4 1 TT 4 0 1 (using pt. (3) of Defn above) (HH, HT, TH) 3 4 1 (HH,HT) 2 . . . . . .
The triple (, B , P ) is a probability space. Extended example: ([0, 1], B([0, 1]), ) 1. The sample space is the real interval [0, 1]
2. The event space is the Borel -algebra on [0,1], denoted B([0, 1]). This is the minimal -algebra generated by the elementary events {[0, b), 0 b 1}. This 1 2 collection contains things like [ 2 ,[ 1 , 2 ]. , 3 ], [0, 1 ] (2 , 1], 1 2 3 2 2 3 To see this, note that closed intervals can be generated as countable intersections of open intervals (and vice versa):
n n n
lim (a 1/n, b + 1/n) = n=1 (a 1/n, b + 1/n) = [a, b] lim [a + 1/n, b 1/n] = n=1 [a + 1/n, b 1/n] = (a, b)
(Limit has unambiguous meaning because the set sequences are monotonic.) Thus, B([0, 1]) can equivalently be characterized as the minimal -eld generated by: (i) the open intervals (a, b) on [0, 1]; (ii) the closed intervals [a, b]; (iii) the closed half-lines [0, a], and so on. Moreover: it is also the minimal -eld containing all the open sets in [0, 1]: B([0, 1]) = (open sets on [0, 1]). This last characterization of the Borel eld, as the minimal -eld containing the open subsets, can be generalized to any metric space (ie. so that openness is dened). This includes R, Rk , even functional spaces (eg. L2 [a, b], the space of square-integrable functions on [a, b]). 3. (), for all A B , is Lebesgue measure, dened as the sum of the lengths of the 1 1 1 2 , 3 ]) = 1 , ([0, 2 ] (2 , 1]) = 5 , ([ 2 ]) = 0. intervals contained in A. Eg.: ([ 2 6 3 6
More examples: Consider the measurable space ([0, 1], B ). Are the following probability measures? for some [0, 1], A B , P (A) = (A) if (A) 0 otherwise 3
for some [0, 1], A B , P (A) = P (A) = P (A) = 1, for all A B . Can you gure out an appropriate -algebra for which these functions are probability measures? For third example: take -algebra as {, [0, 1]}. Additional properties of probability measures (CB Thms 1.2.8-11). For prob. fxn P and A, B B : P () = 0; P (A) 1; P (Ac ) = 1 P (A). P (B Ac ) = P (B ) P (A B ) P (A B ) = P (A) + P (B ) P (A B ); Subadditivity (Booles inequality): for events Ai , i 1,
( i=1 Ai )
i=1
P (Ai ).
which is called the Bonferroni bound on the joint event A B . (Note: when P (A) and P (B ) are small, then bound is < 0, which is trivially correct. Also, bound is always 1.) With three events, the above properties imply:
3 3
P (3 i=1 Ai ) =
i=1
P (Ai )
i<j
P (Ai Aj ) + P (A1 A2 A3 )
P (n i=1 Ai ) =
i=1
P (Ai )
1i<j n n+1
P (Ai Aj ) +
1i<j<kn
P (Ai Aj Ak )+
+ (1)
P (A1 A2 An ).
This equality, the inclusion-exclusion formula, can be used to derive a wide variety of bounds (depending on what is known and unknown).
Updating information: conditional probabilities Consider a given probability space (, B , P ). Denition 1.3.2: if A, B B , and P (B ) > 0, then the conditional probability of A given B , denoted P (A|B ) is P (A|B ) = P (A B ) . P (B )
If you interpret P (A) as the prob. that the outcome of the experiment is in A, then P (A|B ) is the prob. that the outcome is in A, given that you know it is in B . If A and B are disjoint, then P (A|B ) = 0/P (B ) = 0. If A B , then P (A|B ) = P (A)/P (B ) < 1. Here B is necessary for A. If B A, then P (A|B ) = P (B )/P (B ) = 1. Here B implies A As CB point out, when you condition on the event B , then B becomes the sample space of a new probability space, for which the P (|B ) is the appropriate probability measure: 5
B B {A B, A B} P () P (|B )
From manipulating the conditional probability formula, you can get that
P (A B ) = P (A|B ) P (B ) = P (B |A) P (A) P (B |A) P (A) P (A|B ) = . P (B ) For a partition of disjoint events A1 , A2 , . . . of : P (B ) = i=1 P (B |Ai )P (Ai ). Hence: P (Ai |B ) = which is Bayes Rule. P (B |Ai ) P (Ai ) i=1 P (B |Ai )P (Ai )
i=1
P (B Ai ) =
Example: Lets Make a Deal. There are three doors (numbered 1,2,3). Behind one of them, a prize has been randomly placed. You (the contestant) bet that prize is behind door 1 Monty Hall opens door 2, and reveals that there is no prize behind door 2. He asks you: do you want to switch your bet to door 3? Informally: MH has revealed that prize is not behind 2. There are two cases: either (a) it is behind 1, or (b) behind 3. In which case is MHs opening door 2 more probable? In case (a), MH could have opened either 2 or 3; in case (b), MH is forced to open 2 (since he cannot open door 1, because you chose that door). MHs opening of door 2 is more probable under case (b), so you should switch. (This is actually a maximum likelihood argument.) More formally, dene two random variables D (for door behind which the prize is) and M (denoted the door which Monty opens). Consider a comparison of the conditional 6
probabilities P (D = 1|M = 2) vs. P (D = 3|M = 2). Note that these two sum to 1, so you will switch D = 3 if P (D = 3|M = 2) > 0.5. D 1 1 1 2 2 2 3 3 3 M 1 2 3 1 2 3 1 2 3 P rob 0 1 1 2 = 3 1 1 2= 3 0 0 1 1= 3 0 1 1= 3 0
1 6 1 6
1 3 1 3
(Note that Monty will never open door 1, because you bet on door 1.)
1 Before Monty opens door 2, you believe that the P r(D = 3) = 3 . After Monty opens door 2, you can update to
Two events A, B B are statistically independent i P (A B ) = P (A) P (B ). (Two disjoint events are not independent.) Independence implies that P (A|B ) = P (A), P (B |A) = P (B ) :
knowing that outcome is in B does not change your perception of the outcomes being in A. Example: in 2-coin toss: the events rst toss is heads (HH,HT) and second toss is heads (HH,TH) are independent. (Note that independence of two events does not mean that the two events have zero intersection in the sample space.) Some trivial cases for independence of two events A1 and A2 : (i) P (A1 ) P (A2 ) = 1; P (A1 ) = 0. 7
When there are more than two events (i.e., A1 , . . . , An ), we use concept of mutual independence: A1 , . . . , An are mutually independent i
k
P (k j =1 Aij )
n i=1
=
j =1
P (Aij ).
P (Ai ), and also stronger n = 2n n 1 than P (Ai Aj ) = P (Ai )P (Aj ), i = j . Indeed, it involves n k=2 k equations (which are the number of subcollections of Ai1 , . . . , Aik ).
Random variables A random variable is a function from the sample space to the real numbers Examples: 2-coin toss. Let ti = 1. One RV is x t1 + t2 . x HH 2 HT 3 TH 3 TT 4 Note that RV need not be one-to-one mapping from to R. 2. Another RV is x equal to the number of heads P () x 1 HH 2 4 1 HT 1 4 1 TH 1 4 1 TT 0 4 implying 0 w/prob 1 4 1 w/prob 1 x= 2 2 w/prob 1 . 4 This example illustrates how we use (, B , P ), the original probability space, to dene (induce) a probability space for a random variable: here ({0, 1, 2} , all subsets of {0, 1, 2}, Px ). 8 Px
1 4 1 4 1 4 1 4
This is the simplest example of a discrete random variable: one with a countable range.
For continuous random variables x : X R, we dene the probability space: Sample space is real line R Event space is B (R), the Borel -algebra on the real line, which is generated by the half-lines {(, a], a (, )}. Probability measure Px dened so that, for A B (R), Px (A) = P ( : x( ) A) P (x1 (A)). Implicit assumption: for all A B (R), x1 (A) B (). Otherwise, P (x1 (A)) may not be well-dened, since the domain of the P () function is B (). This is the requirement that the random variable x() is Borel-measurable. Example: consider X ( ) = | |, with from the probability space ([1, 1], B [1, 1], /2). Then the probability space for X () is (i) sample space [0, 1]; (ii) event space B [0, 1], and (iii) probability measure Px such that Px (A) = P (x( ) A) = P ( : A, A) = (A).
1 2 1 2 For example, Px ([ 1 , 2 ]) = ([ 3 , 3 ])/2 + ([ 2 , 1 ])/2 = ([ 3 , 3 ]). 3 3 3 3
For a random variable X on (R, B (R), Px ), we dene its cumulative distribution function (CDF) FX (x) Px (X x), for all x. (note that all the sets X x are in B (R)). For a discrete random variable: step function which is continuous from the right (graph)
Thm 1.5.3: F(x) is a CDF i 1. limx F (x) = 1 and limx F (x) = 0 2. F (x) is nondecreasing 3. F (x) is right-continuous: for every x0 , limxx0 F (x) = F (x0 ) Any random variable X is tight: For every > 0 there exists a constant M < such that P (|X | > M ) < . Does not have a probability mass at .
Denition 1.5.8: the random variables X and Y are identically distributed if for every set A B (R), PX (X A) = PY (Y A). Note that X , Y being identically distributed does not mean than X = Y ! (Example: 2-coin toss, with X being number of heads and Y being number of tails) But Thm 1.5.10: X and Y are identically distributed FX (z ) = FY (z ) for every z.
Denition 1.6.1: Probability mass function (pmf) for a discrete random variable X is fX (x) PX (X = x). 10
Recover from CDF as the distance (on the y -axis) between the steps. Denition 1.6.3: Probability density function (pdf) for a continuous random variable X is fX (x) which satises
x
FX (x) =
fX (t)dt.
(3)
Thm 1.6.5 A function fx (x) is a pmf or pdf i fX (x) 0 for all x For discrete RV:
x
fX (x)dx = 1.
By Eq. (3), and the fundamental theorem of calculus, if fX () is continuous, then fX () = FX () (i.e., FX is the anti-derivative of fX ). [can skip]Generally, one looks for CDFs F and (nonegative and integrable) functions b f satisfying the relationships F (b) F (a) = a f (x)dx and F (x) = f (x). By the Radon-Nikodym Theorem, a (necessary and sucient) condition for these relationships (and hence for the existence of a density function fx ()) is that the probability measure PX of the real-valued random variable X be absolutely continuous with respect to Lebesgue measure. For random variables on the real line, this implies the usual continuity of the CDF F (x); what it allows for are non-continuous, but nevertheless integrable f . Absolutely continuous w.r.t. Lebesgue measure means that all sets in the support of X (which is a part of the real line) which have zero Lebesgue measure must also have zero probability under PX ; i.e., for all A R such that (A) = 0 PX (A) = 0. Since only singletons (and countable sets of singletons) have zero Lebesgue measure, this condition essentially rules out random variables which have a point mass at some points. Intuitively, this implies jumps in the CDF F Example: ([0, 1], B[0, 1], ) and random variable X ( ) = if 1 4 otherwise.
1 2 1 2
Conditional CDF/PDF Random variable X (R, B (R), PX ) What is P rob(X x|X A) (conditional CDF)? Go back to basics: P rob(X x|X A) =
P rob({X x}A) P rob(A)
This expression can be dierentiated to obtain the conditional PDF. Example: X U [0, 1], with conditioning event X z . Conditional CDF: P rob(X x|X z ) = 0 if x z (x z )/(1 z ) if x > z
Hence, the conditional pdf is 1/(1 z ), for x > z . Example: (truncated wages) Wage oers X U [0, 10], and X > 5.25 (only observe wages when they lie above minimum wage) Conditional CDF: FX (x|X 5.25) = P rob({X x} {X 5.25}) P rob(X 5.25) 1 (x 5.25) for x [5.25, 10] = 10 1 4.75 10
12
Lesbesgue integration
Consider the measure space (R, B , P ), where is any measure (not necessarily Lesbesgue measure). Then we dene the Lesbesgue-Stieltjes integral: EP f = f dP sup{Ei }
i
(infEi f ( )) P (Ei )
where the sup is taken over all nite partitions {E1 , E2 , . . .} of R. Assign value of + when this sup does not exist.
1.1
dP dP.
dP + f dP gdP .
kf dP = k f dP +
(f + g )dP =
f dP
gdP .
1.2
When does the Lesbesgue integral exist (that is, is nite). Consider a non-negative bounded function f ( ), and a sequence of partitions E 1 E 2 E 3 , etc. For a partition E i , consider the simple function dened as Ek E i : f i ( ) = inf Ek f ( ). This function is constant on each element of the partition E i , and equal to the inmum of the function f ( ) in each element. Because the range of each function f i ( ) is nite, the Lesbesgue integral just a nite sum and exists. f i dP is
Furthermore, we see that f i ( ) f ( ), for P -almost all . The question is: does f dP , which intuitively is the limit of f i dP , also exist?
13
Monotone convergence theorem: If {fn } is a non-decreasing sequence of measurable non-negative functions, with fn ( ) f ( ), then lim fn dP = f dP.
(As stated, dont require boundedness of fn , f : so both LHS and RHS can be .) For general functions f which may take both positive and negative values, we break it up into the positive f + = max {f, 0} and negative f = (f + f ) parts. Both f + and f are non-negative functions. We dene f dP = f + dP f dP. |f |dP <
We say that f is integrable if both of the integrals on the RHS are nite, ie. . Additional results (if monotone convergence does not hold): Fatous lemma: for sequence of non-negative functions fn : liminfn fn dP (liminfn fn )dP.
(On the RHS, the liminf is taken pointwise in .) This is for sequences of functions which need not converge. Dominated convergence theorem: If fn ( ) f ( ) for P -almost all , and there exists a function g ( ) such that |fn ( )| g ( ) for P -almost all and for all n, and g is integrable ( gdP < ), then fn dP f dP . That is, Efn Ef . Bounded convergence theorem: If fn ( ) f ( ) for P -almost all , and there exists constant B < such that |fn ( )| B for P -almost all and for all n, then fn dP f dP .
14