Chapter 2. Discrete Probability 2.2: Conditional Probability

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Chapter 2.

Discrete Probability
2.2: Conditional Probability
Slides (Google Drive) Alex Tsun Video (YouTube)

2.2.1 Conditional Probability

Sometimes we would like to incorporate new information into our probability. For example, you may be
feeling symptoms of some disease, and so you take a test to see whether you have it or not. Let D be the
event you have a disease, and T be the event you test positive (T C is the event you test negative). It could
be that P (D) = 0.01 (1% chance of having the disease without knowing anything). But how can we update
this probability given that we tested positive (or negative)? This will be written as P (D | T ) or P D | T C
respectively. You would think P (D | T ) > P (D) since you’re more likely to have the disease once you test
positive, and P D | T C < P (D) since you’re less likely to have the disease once you test negative. These
are called conditional probabilities - they are the probability of an event, given that you know some other
event occurred. Is there a formula for updating P (D) given new information? Yes!
Let’s go back to the example of students in CSE312 liking donuts and ice cream. Recall we defined event
A as liking ice cream and event B as liking donuts. Then, remember we had 36 students that only like ice
cream (A ∩ B C ), 7 students that like donuts and ice cream (A ∩ B), and 13 students that only like donuts
(B ∩ AC ). Let’s also say that we have 14 students that don’t like either (AC ∩ B C ). That leaves us with the
following picture, which makes up the whole sample space:

Now, what if we asked the question, what’s the probability that someone likes ice cream, given that we
know they like donuts? We can approach this with the knowledge that 20 of the students like donuts (13
who don’t like ice cream and 7 who do). What this question is getting at, is: given the knowledge that
someone likes donuts, what is the chance that they also like ice cream? Well, 7 of the 20 who like donuts
7
like ice cream, so we are left with the probability 20 . We write this as P (A | B) (read the “probability of A
given B”) and in this case we have the following:

1
2 Probability & Statistics with Applications to Computing 2.2

7
P (A | B) =
20
|A ∩ B|
= [|B| = 20 people like donuts, |A ∩ B| = 7 people like both]
|B|
|A ∩ B|/|Ω|
= [divide top and bottom by |Ω|, which is equivalent]
|B|/|Ω|
P (A ∩ B)
= [if we have equally likely outcomes]
P (B)

This intuition (which worked only in the special case equally likely outcomes), leads us to the definition of
conditional probability:

Definition 2.2.1: Conditional Probability

The conditional probability of event A given that event B happened is:

P (A ∩ B)
P (A | B) =
P (B)

An equivalent and useful formula we can derive (by multiplying both sides by the denominator, P (B),
and switching the sides of the equation is:

P (A ∩ B) = P (A | B) P (B)

Let’s consider an important question: does P (A | B) = P (B | A)? No!


This is a common misconception we can show with some examples. In the above example with ice cream,
7 7
we showed already P (A | B) = 20 , but P (B | A) = 36 , and these are not equal.
Consider another example where W is the event that you are wet and S is the event you are swimming.
Then, the probability you are wet given you are swimming, P (W | S) = 1, as if you are swimming you are
certainly wet. But, the probability you are swimming given you are wet, P (S | W ) 6= 1, because there are
numerous other reasons you could be wet that don’t involve swimming (being in the rain, showering, etc.).
2.2 Probability & Statistics with Applications to Computing 3

2.2.2 Bayes Theorem

This brings us to Bayes Theorem:

Theorem 2.2.1: Bayes Theorem

Let A, B be events with nonzero probability. Then,

P (B | A) P (A)
P (A | B) =
P (B)

Note that in the above P (A) is called the prior, which is our belief without knowing anything about
event B. P (A | B) is called the posterior, our belief after learning that event B occurred.

This theorem is important because it allows to “reverse the conditioning”! Notice that both P (A | B)
and P (B | A) appear in this equation on opposite sides. So if we know P (A) and P (B) and can more
easily calculate one of P (A | B) or P (B | A), we can use Bayes Theorem to derive the other.

Proof of Bayes Theorem. Recall the (alternate) definition of conditional probability from above:

P (A ∩ B) = P (A | B) P (B) (2.2.1)

Swapping the roles of A and B we can also get that:

P (B ∩ A) = P (B | A) P (A) (2.2.2)

But, because A ∩ B = B ∩ A (since these are the outcomes in both events A and B, and the order of
intersection does not matter), P (A ∩ B) = P (B ∩ A), so (2.2.1) and (2.2.2) are equal and we have (by
setting the right-hand sides equal):

P (A | B) P (B) = P (B | A) P (A)

We can divide both sides by P (B) and get Bayes Theorem:


P (B | A) P (A)
P (A | B) =
P (B)

Wow, I wish I was alive back then and had this important (and easy to prove) theorem named after me!

Example(s)

We’ll investigate two slightly different questions whose answers don’t seem that they should be differ-
ent, but are. Suppose a family has two children (whom at birth, were each equally likely to be male
or female). Let’s say a telemarketer calls home and one of the two children picks up.

1. If the child who responded was male, and says “Let me get my older sibling”, what is the
probability that both children are male?
2. If the child who responded was male, and says “Let me get my other sibling”, what is the
probability that both children are male?
4 Probability & Statistics with Applications to Computing 2.2

Solution There are four equally likely outcomes, MM, MF, FM, and FF (where M represents male and F
represents female). Let A be the event both children are male.

1. In this part, we’re given that the younger sibling is male. So we can rule out 2 of the 4 outcomes
above and we’re left with MF and MM. Out of these two, in one of these cases we get MM, and so our
desired probability is 1/2.
More formally, let this event be B, which happens with probability 2/4 (2 out of 4 equally likely
P (A ∩ B) 1/4 1
outcomes). Then, P (A|B) = = = , since P (A ∩ B) is the probability both children
P (B) 2/4 2
are male, which happens in 1 out of 4 equally likely scenarios. This is because the older sibling’s sex
is independent of the younger sibling’s, so knowing the younger sibling is male doesn’t change the
probability of the older sibling being male (which is what we computed just now).

2. In this part, we’re given that at least one sibling is male. That is, out of the 4 outcomes, we can only
rule out the FF option. Out of the remaining options MM, MF, and FM, only one has both siblings
being male. Hence, the probability desired is 1/3. You can do a similar more formal argument like we
did above!

See how a slight wording change changed the answer?


We’ll see a disease testing example later, which requires the next section first. If you test positive for a
disease, how concerned should you be? The result may surprise you!

2.2.3 Law of Total Probability

Let’s say you sign up for a chemistry class, but are assigned to one of three teachers randomly. Furthermore,
you know the probabilities you fail the class if you were to have each teacher (from historical results, or
word-of-mouth from classmates who have taken the class). Can we combine this information to compute the
overall probability that you fail chemistry (before you know which teacher you get)? Yes - using the law of
total probability below! We first need to define what a partition is.

Definition 2.2.2: Partitions


Non-empty events E1 , . . . , En partition the sample space Ω if they are:
Sn
• (Exhaustive) E1 ∪ E2 ∪ · · · ∪ En = i=1 Ei = Ω; that is, they cover the entire sample space.
• (Pairwise Mutually Exclusive) For all i 6= j, Ei ∩ Ej = ∅; that is, none of them overlap.
Note that for any event E, E and E C always form a partition of Ω.

Example(s)

Two example partitions can be seen in the image below:


2.2 Probability & Statistics with Applications to Computing 5

You can see that partition is a very appropriate word here! In the first image, the four events
E1 , . . . , E4 don’t overlap and cover the sample space. In the second image, the two events E, E C do
the same thing! This is useful when you know exactly one of a few things will happen. For example,
for the chemistry example, there might be only three teachers, and you will be assigned to exactly
one of them: at most one because you can’t have two teachers (mutually exclusive), and at least one
because there aren’t other teachers possible (exhaustive).

Now, suppose we have some event F which intersects with various events that form a partition of Ω. This
is illustrated by the picture below:

Notice that F is composed of its intersection with each of E1 , E2 , and E3 , and so we can split F up into
smaller pieces. This means that we can write the following (green chunk F ∩ E1 , plus pink chunk F ∩ E2
plus yellow chunk F ∩ E3 ):

P (F ) = P (F ∩ E1 ) + P (F ∩ E2 ) + P (F ∩ E3 )

Note that F and E4 do not intersect, so F ∩ E4 = ∅. For completion, we can include E4 in the above
equation, because P (F ∩ E4 ) = 0. So, in all we have:

P (F ) = P (F ∩ E1 ) + P (F ∩ E2 ) + P (F ∩ E3 ) + P (F ∩ E4 )

This leads us to the law of total probability.


6 Probability & Statistics with Applications to Computing 2.2

Theorem 2.2.2: Law of Total Probability (LTP)

If events E1 , . . . , En partition Ω, then for any event F

n
X
P (F ) = P (F ∩ E1 ) + · · · + P (F ∩ En ) = P (F ∩ En )
i=1

Using the definition of conditional probability, P (F ∩ Ei ) = P (F | Ei ) P (Ei ), we can replace each of


the terms above and get the (typically) more useful formula:

n
X
P (F ) = P (F | E1 ) P (E1 ) + · · · + P (F | En ) P (En ) = P (F | Ei ) P (Ei )
i=1

That is, to compute the probability of an event F overall; suppose we have n disjoint cases E1 , . . . , En
for which we can (easily) compute the probability of F in each of these cases (P (F |Ei )). Then, take
the weighted average of these probabilities, using the probabilities P (Ei ) as weights (the probability
of being in each case).

Example(s)

Let’s consider an example in which we are trying to determine the probability that we fail chemistry.
Let’s call the event F failing, and consider the three events E1 for getting the Mean Teacher, E2 for
getting the Nice Teacher, and E3 for getting the Hard Teacher which partition the sample space. The
following table gives the relevant probabilities:

Mean Teacher E1 Nice Teacher E2 Hard Teacher E3


Probability of Teaching You P (Ei ) 6/8 1/8 1/8
Probability of Failing You P (F | Ei ) 1 0 1/2

Solve for the probability of failing.

Solution Before doing anything, how are you liking your chances? There is a high probability (6/8) of getting
the Mean Teacher, and she will certainly fail you. Therefore, you should be pretty sad.

Now let’s do the computation. Notice that the first row sums to 1, as it must, since events E1 , E2 , E3
partition the sample space (you have exactly one of the three teachers). Using the Law of Total Probability
(LTP), we have the following:

3
X
P (F ) = P (F | Ei ) P (Ei ) = P (F | E1 ) P (E1 ) + P (F | E2 ) P (E2 ) + P (F | E3 ) P (E3 )
i=1
6 1 1 1 13
=1· +0· + · =
8 8 2 8 16

Notice to get the probability of failing, what we did was: consider the probability of failing in each of the
3 cases, and take a weighted average of using the probability of each case. This is exactly what the law of
total probability lets us do!
2.2 Probability & Statistics with Applications to Computing 7

You might consider using the LTP when you know the probability of your desired event in

Example(s)

Misfortune struck us and we ended up failing chemistry class. What is the probability that we had
the Hard Teacher given that we failed?

Solution First, this probability should be low intuitively because if you failed, it was probably due to the
Hard Teacher (because you are more likely to get them, AND because they have a high fail rate of 100%).

Start by writing out in a formula what you want to compute; in our case, it is P (E3 | F ) (getting the hard
teacher given that we failed). We know P (F | E3 ) and we want to solve for P (E3 | F ). This is a hint to use
Bayes Theorem since we can reverse the conditioning! Using that with the numbers from the table and the
previous question:

P (F | E3 ) P (E3 )
P (E3 | F ) = [bayes theorem]
P (F )
1 1
2 · 8
= 13
16
1
=
13

2.2.4 Bayes Theorem with the Law of Total Probability

Oftentimes, the denominator in Bayes Theorem is hard, so we must compute it using the LTP. Here, we just
combine two powerful formulae: Bayes Theorem and the Law of Total Probability:

Theorem 2.2.3: Bayes Theorem with the Law of Total Probability

Let events E1 , . . . , En partition the sample space Ω, and let F be another event. Then:

P (F | E1 ) P (E1 )
P (E1 | F ) = [by bayes theorem]
P (F )
P (F | E1 ) P (E1 )
= Pn [by the law of total probability]
i=1 P (F | Ei ) P (Ei )

In particular, in the case of a simple partition of Ω into E and E C , if E is an event with nonzero
probability, then:

P (F | E) P (E)
P (E | F ) = [by bayes theorem]
P (F )
P (F | E) P (E)
= [by the law of total probability]
P (F | E) P (E) + P (F | E C ) P (E C )
8 Probability & Statistics with Applications to Computing 2.2

2.2.5 Exercises
1. Suppose the llama flu disease has become increasingly common, and now 0.1% of the population has
it (1 in 1000 people). Suppose there is a test for it which is 98% accurate (e.g., 2% of the time it will
give the wrong answer). Given that you tested positive, what is the probability you have the disease?
Before any computation, think about what you think the answer might be.

Solution: Let L be the event you have the llama flu, and T be the event you test positive (T C is
the event you test negative). You are asked for P (L | T ). We do know P (T | L) = 0.98 because if you
have the llama flu, the probably you test positive is 98%. This gives us the hint to use Bayes Theorem!
We get that
P (T | L) P (L)
P (L | T ) =
P (T )
We are given P (T | L) = 0.98 and P (L) = 0.001, but how can we get P (T ), the probability of testing
positive? Well that depends on whether you have the disease or not. When you have two or more
cases (L and LC ), that’s a hint to use the LTP! So we can write

P (T ) = P (T | L) P (L) + P T | LC P LC
 

Again, interpret this as a weighted


 average of the probability of testing positive whether you had llama
C
flu P (T
 | L) or not P T |
 L , weighting by the probability
 you are in each of these cases P (L) and
P LC . We know  P L C
= 0.999 since these P LC
= 1 − P (L) (axiom of probability). But what
about P T | LC ? This is the probability of testing positive given that you don’t have llama flu, which
is 0.02 or 2% (due to the 98% accuracy). Putting this all together, we get:

P (T | L) P (L)
P (L | T ) = [bayes theorem]
P (T )
P (T | L) P (L)
= [LTP]
P (T | L) P (L) + P (T | LC ) P (LC )
0.98 · 0.001
=
0.98 · 0.001 + 0.02 · 0.999
≈ 0.046756

Not even a 5% chance we have the disease, what a relief! But wait, how can that be? The test
is so accurate, and it said you were positive? This is because the prior probability of having the
disease P (L) was so low at 0.1% (actually this is pretty high for a disease rate). If you think about
it, the posterior probability we computed P (L | T ) is 47× larger than the prior probability P (L)
(P (L | T ) /P (L) ≈ 0.047/0.001 = 47), so the test did make it a lot more likely we had the disease after
all!

2. Suppose we have four fair die: one with three sides, one with four sides, one with five sides, and one
with six sides (The numbering of an n-sided die is 1, 2, ..., n). We pick one of the four die, each with
equal probability, and roll the same die three times. We get all 4’s. What is the probability we chose
the 5-sided die to begin with?

Solution: Let Di be the event we rolled the i-sided die, for i = 3, 4, 5, 6. Notice that these
2.2 Probability & Statistics with Applications to Computing 9

D3 , D4 , D5 , D6 partition the sample space.

P (444|D5 )P (D5 )
P (D5 |444) = [by bayes theorem]
P (444)
P (444|D5 )P (D5 )
= [by ltp]
P (444|D3 )P (D3 ) + P (444|D4 )P (D4 ) + P (444|D5 )P (D5 ) + P (444|D6 )P (D6 )
1 1
53 · 4
= 0 1 1 1 1 1 1 1
33 · 4 + 43 · 4 + 53 · 4 + 63 · 4
1/125
=
1/64 + 1/125 + 1/216
1728
= ≈ 0.2831
6103
Note that we compute P (444|Di ) by noting there’s only one outcome where we get (4, 4, 4) out of the
i3 equally likely outcomes. This is true except when i = 3, where it’s not possible to roll all 4’s.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy