Chapter 2. Discrete Probability 2.2: Conditional Probability
Chapter 2. Discrete Probability 2.2: Conditional Probability
Chapter 2. Discrete Probability 2.2: Conditional Probability
Discrete Probability
2.2: Conditional Probability
Slides (Google Drive) Alex Tsun Video (YouTube)
Sometimes we would like to incorporate new information into our probability. For example, you may be
feeling symptoms of some disease, and so you take a test to see whether you have it or not. Let D be the
event you have a disease, and T be the event you test positive (T C is the event you test negative). It could
be that P (D) = 0.01 (1% chance of having the disease without knowing anything). But how can we update
this probability given that we tested positive (or negative)? This will be written as P (D | T ) or P D | T C
respectively. You would think P (D | T ) > P (D) since you’re more likely to have the disease once you test
positive, and P D | T C < P (D) since you’re less likely to have the disease once you test negative. These
are called conditional probabilities - they are the probability of an event, given that you know some other
event occurred. Is there a formula for updating P (D) given new information? Yes!
Let’s go back to the example of students in CSE312 liking donuts and ice cream. Recall we defined event
A as liking ice cream and event B as liking donuts. Then, remember we had 36 students that only like ice
cream (A ∩ B C ), 7 students that like donuts and ice cream (A ∩ B), and 13 students that only like donuts
(B ∩ AC ). Let’s also say that we have 14 students that don’t like either (AC ∩ B C ). That leaves us with the
following picture, which makes up the whole sample space:
Now, what if we asked the question, what’s the probability that someone likes ice cream, given that we
know they like donuts? We can approach this with the knowledge that 20 of the students like donuts (13
who don’t like ice cream and 7 who do). What this question is getting at, is: given the knowledge that
someone likes donuts, what is the chance that they also like ice cream? Well, 7 of the 20 who like donuts
7
like ice cream, so we are left with the probability 20 . We write this as P (A | B) (read the “probability of A
given B”) and in this case we have the following:
1
2 Probability & Statistics with Applications to Computing 2.2
7
P (A | B) =
20
|A ∩ B|
= [|B| = 20 people like donuts, |A ∩ B| = 7 people like both]
|B|
|A ∩ B|/|Ω|
= [divide top and bottom by |Ω|, which is equivalent]
|B|/|Ω|
P (A ∩ B)
= [if we have equally likely outcomes]
P (B)
This intuition (which worked only in the special case equally likely outcomes), leads us to the definition of
conditional probability:
P (A ∩ B)
P (A | B) =
P (B)
An equivalent and useful formula we can derive (by multiplying both sides by the denominator, P (B),
and switching the sides of the equation is:
P (A ∩ B) = P (A | B) P (B)
P (B | A) P (A)
P (A | B) =
P (B)
Note that in the above P (A) is called the prior, which is our belief without knowing anything about
event B. P (A | B) is called the posterior, our belief after learning that event B occurred.
This theorem is important because it allows to “reverse the conditioning”! Notice that both P (A | B)
and P (B | A) appear in this equation on opposite sides. So if we know P (A) and P (B) and can more
easily calculate one of P (A | B) or P (B | A), we can use Bayes Theorem to derive the other.
Proof of Bayes Theorem. Recall the (alternate) definition of conditional probability from above:
P (A ∩ B) = P (A | B) P (B) (2.2.1)
P (B ∩ A) = P (B | A) P (A) (2.2.2)
But, because A ∩ B = B ∩ A (since these are the outcomes in both events A and B, and the order of
intersection does not matter), P (A ∩ B) = P (B ∩ A), so (2.2.1) and (2.2.2) are equal and we have (by
setting the right-hand sides equal):
P (A | B) P (B) = P (B | A) P (A)
Wow, I wish I was alive back then and had this important (and easy to prove) theorem named after me!
Example(s)
We’ll investigate two slightly different questions whose answers don’t seem that they should be differ-
ent, but are. Suppose a family has two children (whom at birth, were each equally likely to be male
or female). Let’s say a telemarketer calls home and one of the two children picks up.
1. If the child who responded was male, and says “Let me get my older sibling”, what is the
probability that both children are male?
2. If the child who responded was male, and says “Let me get my other sibling”, what is the
probability that both children are male?
4 Probability & Statistics with Applications to Computing 2.2
Solution There are four equally likely outcomes, MM, MF, FM, and FF (where M represents male and F
represents female). Let A be the event both children are male.
1. In this part, we’re given that the younger sibling is male. So we can rule out 2 of the 4 outcomes
above and we’re left with MF and MM. Out of these two, in one of these cases we get MM, and so our
desired probability is 1/2.
More formally, let this event be B, which happens with probability 2/4 (2 out of 4 equally likely
P (A ∩ B) 1/4 1
outcomes). Then, P (A|B) = = = , since P (A ∩ B) is the probability both children
P (B) 2/4 2
are male, which happens in 1 out of 4 equally likely scenarios. This is because the older sibling’s sex
is independent of the younger sibling’s, so knowing the younger sibling is male doesn’t change the
probability of the older sibling being male (which is what we computed just now).
2. In this part, we’re given that at least one sibling is male. That is, out of the 4 outcomes, we can only
rule out the FF option. Out of the remaining options MM, MF, and FM, only one has both siblings
being male. Hence, the probability desired is 1/3. You can do a similar more formal argument like we
did above!
Let’s say you sign up for a chemistry class, but are assigned to one of three teachers randomly. Furthermore,
you know the probabilities you fail the class if you were to have each teacher (from historical results, or
word-of-mouth from classmates who have taken the class). Can we combine this information to compute the
overall probability that you fail chemistry (before you know which teacher you get)? Yes - using the law of
total probability below! We first need to define what a partition is.
Example(s)
You can see that partition is a very appropriate word here! In the first image, the four events
E1 , . . . , E4 don’t overlap and cover the sample space. In the second image, the two events E, E C do
the same thing! This is useful when you know exactly one of a few things will happen. For example,
for the chemistry example, there might be only three teachers, and you will be assigned to exactly
one of them: at most one because you can’t have two teachers (mutually exclusive), and at least one
because there aren’t other teachers possible (exhaustive).
Now, suppose we have some event F which intersects with various events that form a partition of Ω. This
is illustrated by the picture below:
Notice that F is composed of its intersection with each of E1 , E2 , and E3 , and so we can split F up into
smaller pieces. This means that we can write the following (green chunk F ∩ E1 , plus pink chunk F ∩ E2
plus yellow chunk F ∩ E3 ):
P (F ) = P (F ∩ E1 ) + P (F ∩ E2 ) + P (F ∩ E3 )
Note that F and E4 do not intersect, so F ∩ E4 = ∅. For completion, we can include E4 in the above
equation, because P (F ∩ E4 ) = 0. So, in all we have:
P (F ) = P (F ∩ E1 ) + P (F ∩ E2 ) + P (F ∩ E3 ) + P (F ∩ E4 )
n
X
P (F ) = P (F ∩ E1 ) + · · · + P (F ∩ En ) = P (F ∩ En )
i=1
n
X
P (F ) = P (F | E1 ) P (E1 ) + · · · + P (F | En ) P (En ) = P (F | Ei ) P (Ei )
i=1
That is, to compute the probability of an event F overall; suppose we have n disjoint cases E1 , . . . , En
for which we can (easily) compute the probability of F in each of these cases (P (F |Ei )). Then, take
the weighted average of these probabilities, using the probabilities P (Ei ) as weights (the probability
of being in each case).
Example(s)
Let’s consider an example in which we are trying to determine the probability that we fail chemistry.
Let’s call the event F failing, and consider the three events E1 for getting the Mean Teacher, E2 for
getting the Nice Teacher, and E3 for getting the Hard Teacher which partition the sample space. The
following table gives the relevant probabilities:
Solution Before doing anything, how are you liking your chances? There is a high probability (6/8) of getting
the Mean Teacher, and she will certainly fail you. Therefore, you should be pretty sad.
Now let’s do the computation. Notice that the first row sums to 1, as it must, since events E1 , E2 , E3
partition the sample space (you have exactly one of the three teachers). Using the Law of Total Probability
(LTP), we have the following:
3
X
P (F ) = P (F | Ei ) P (Ei ) = P (F | E1 ) P (E1 ) + P (F | E2 ) P (E2 ) + P (F | E3 ) P (E3 )
i=1
6 1 1 1 13
=1· +0· + · =
8 8 2 8 16
Notice to get the probability of failing, what we did was: consider the probability of failing in each of the
3 cases, and take a weighted average of using the probability of each case. This is exactly what the law of
total probability lets us do!
2.2 Probability & Statistics with Applications to Computing 7
You might consider using the LTP when you know the probability of your desired event in
Example(s)
Misfortune struck us and we ended up failing chemistry class. What is the probability that we had
the Hard Teacher given that we failed?
Solution First, this probability should be low intuitively because if you failed, it was probably due to the
Hard Teacher (because you are more likely to get them, AND because they have a high fail rate of 100%).
Start by writing out in a formula what you want to compute; in our case, it is P (E3 | F ) (getting the hard
teacher given that we failed). We know P (F | E3 ) and we want to solve for P (E3 | F ). This is a hint to use
Bayes Theorem since we can reverse the conditioning! Using that with the numbers from the table and the
previous question:
P (F | E3 ) P (E3 )
P (E3 | F ) = [bayes theorem]
P (F )
1 1
2 · 8
= 13
16
1
=
13
Oftentimes, the denominator in Bayes Theorem is hard, so we must compute it using the LTP. Here, we just
combine two powerful formulae: Bayes Theorem and the Law of Total Probability:
Let events E1 , . . . , En partition the sample space Ω, and let F be another event. Then:
P (F | E1 ) P (E1 )
P (E1 | F ) = [by bayes theorem]
P (F )
P (F | E1 ) P (E1 )
= Pn [by the law of total probability]
i=1 P (F | Ei ) P (Ei )
In particular, in the case of a simple partition of Ω into E and E C , if E is an event with nonzero
probability, then:
P (F | E) P (E)
P (E | F ) = [by bayes theorem]
P (F )
P (F | E) P (E)
= [by the law of total probability]
P (F | E) P (E) + P (F | E C ) P (E C )
8 Probability & Statistics with Applications to Computing 2.2
2.2.5 Exercises
1. Suppose the llama flu disease has become increasingly common, and now 0.1% of the population has
it (1 in 1000 people). Suppose there is a test for it which is 98% accurate (e.g., 2% of the time it will
give the wrong answer). Given that you tested positive, what is the probability you have the disease?
Before any computation, think about what you think the answer might be.
Solution: Let L be the event you have the llama flu, and T be the event you test positive (T C is
the event you test negative). You are asked for P (L | T ). We do know P (T | L) = 0.98 because if you
have the llama flu, the probably you test positive is 98%. This gives us the hint to use Bayes Theorem!
We get that
P (T | L) P (L)
P (L | T ) =
P (T )
We are given P (T | L) = 0.98 and P (L) = 0.001, but how can we get P (T ), the probability of testing
positive? Well that depends on whether you have the disease or not. When you have two or more
cases (L and LC ), that’s a hint to use the LTP! So we can write
P (T ) = P (T | L) P (L) + P T | LC P LC
P (T | L) P (L)
P (L | T ) = [bayes theorem]
P (T )
P (T | L) P (L)
= [LTP]
P (T | L) P (L) + P (T | LC ) P (LC )
0.98 · 0.001
=
0.98 · 0.001 + 0.02 · 0.999
≈ 0.046756
Not even a 5% chance we have the disease, what a relief! But wait, how can that be? The test
is so accurate, and it said you were positive? This is because the prior probability of having the
disease P (L) was so low at 0.1% (actually this is pretty high for a disease rate). If you think about
it, the posterior probability we computed P (L | T ) is 47× larger than the prior probability P (L)
(P (L | T ) /P (L) ≈ 0.047/0.001 = 47), so the test did make it a lot more likely we had the disease after
all!
2. Suppose we have four fair die: one with three sides, one with four sides, one with five sides, and one
with six sides (The numbering of an n-sided die is 1, 2, ..., n). We pick one of the four die, each with
equal probability, and roll the same die three times. We get all 4’s. What is the probability we chose
the 5-sided die to begin with?
Solution: Let Di be the event we rolled the i-sided die, for i = 3, 4, 5, 6. Notice that these
2.2 Probability & Statistics with Applications to Computing 9
P (444|D5 )P (D5 )
P (D5 |444) = [by bayes theorem]
P (444)
P (444|D5 )P (D5 )
= [by ltp]
P (444|D3 )P (D3 ) + P (444|D4 )P (D4 ) + P (444|D5 )P (D5 ) + P (444|D6 )P (D6 )
1 1
53 · 4
= 0 1 1 1 1 1 1 1
33 · 4 + 43 · 4 + 53 · 4 + 63 · 4
1/125
=
1/64 + 1/125 + 1/216
1728
= ≈ 0.2831
6103
Note that we compute P (444|Di ) by noting there’s only one outcome where we get (4, 4, 4) out of the
i3 equally likely outcomes. This is true except when i = 3, where it’s not possible to roll all 4’s.