03 Probability
03 Probability
03 Probability
In 2015 there were about 4 million babies born in the US, and 48.8% of the newborns
were girls.
We write: P(newborn is a girl)= 48.8%.
The probability of an event is defined as the proportion of times this event occurs in
many repetitions.
This is the standard definition of probility. It requires that it is possible to repeat this
chance experiment many times.
While interned during the Second World War, John Kerrich tossed a coin 10,000 times
and observed 5,067 tosses resulting in heads.
1/2
What is probability?
The long-run interpretation of probability can make it difficult to interpret it for single
event:
‘What I was wrong about this year’ by David Leonhardt (12/24/2017)
Sometimes people use a different interpretation:
‘The probability that my best friend calls today is 30%’.
Such a ‘subjective probability’ is not based on experiments, and different people may
assign different subjective probabilities to the same event.
2/2
Four basic rules
Probabilities are always between 0 and 1. Computing probabilities comes down to
applying a few basic rules repeatedly:
We saw that P(newborn is a girl)= 48.8%.
Therefore P(newborn is a boy)= 51.2%.
More formally, write A for an event, such as A=’newborn is a girl’.
Complement rule: P(A does not occur) = 1−P(A)
1/3
Four basic rules
The next two rules make it possible to express probabilities of multiple events as those
of the individual events.
A and B are mutually exclusive if they cannot occur at the same time.
As an example, we roll a die twice.
A= on first roll, B=
on first roll, C= on second roll
A and B are mutually exclusive, but A and C are not.
Addition rule: If A and B are mutually exclusive, then
P(A or B) = P(A) + P(B)
Two events are independent if knowing that one occurs does not change the probability
that the other occurs.
B and C are independent, but A and B are not.
Multiplication rule: If A and B are independent, then
P(A and B) = P(A) P(B)
2/3
Four basic rules
Roll a die three times. What is P(at least one
) ?
We could write ‘at least one
’ as follows:
= 1−P(no
in first roll)× P(no
in second roll)×P(no
in third roll)
5 5 5
=1− 6 × 6 × 6
= 41.1%
3/3
Conditional probability
Spam e-mail has a higher chance to contain the word ‘money’ than ham e-mail:
P(‘money’ in e-mail | spam)= 8%, P(‘money’ in e-mail | ham)= 1%.
The conditional probability of B given A is
P(A and B)
P(B|A) =
P(A)
General multiplication rule: P(A and B) = P(A) P(B|A)
In the special case where A and B are independent: P(A and B) = P(A) P(B)
1/4
Conditional probability
Computing probabilities by total enumeration:
P(spam) = 20%. What is the probability that ‘money’ appears in an e-mail?
From data we know P(money | spam)= 8%, P(money | ham)= 1%.
Idea is to artificially introduce the event ‘spam/ham’.
The event ‘money appears in the e-mail’ can be written as:
money appears and e-mail is spam or money appears and e-mail is ham
P(money appears)
=P(money and spam) + P(money and ham)
=P(money | spam) P(spam) + P(money | ham) P(ham)
= 0.08 × 0.2 + 0.01 × 0.8
= 2.4%
2/4
Bayes’ rule
From data we know P(money appears in e-mail | e-mail is spam)= 8%, but what we
need to build a spam filter is P(e-mail is spam | money appears in e-mail).
P(A and B) P(B and A) P(A|B) P(B)
P(B|A)= = =
P(A) P(A) P(A)
P(money | spam) P(spam) 0.08×0.2
P(spam | money)= = = 67%
P(money) 0.024
Bayes’ rule:
P(A|B) P(B)
P(B|A)=
P(A)
P(A|B) P(B)
=
P(A|B)P(B) + P(A|not B) P(not B)
3/4
Bayesian analysis
4/4
Examples and case studies: False positives
1% of the population has a certain disease. If an infected person is tested, then there is
a 95% chance that the test is positive. If the person is not infected, then there is a 2%
chance that the test gives an erroneous positive result (‘false positive’).
Given that a person tests positive, what are the chances that he has the disease?
Know P(D)= 1% , P(+|D)= 95% , P(+|no D)= 2%.
P(+|D) P(D)
Want P(D|+) =
P(+)
P(+|D) P(D)
=
P(+|D) P(D) + P(+|no D) P(no D)
0.95×0.01
= 0.95×0.01+0.02×0.99 = 32.4%
1/3
Case study: Warner’s randomized response model
What percentage of students have cheated during an exam in college?
Problem: Students may be too embarrassed to answer truthfully.
Randomization comes to the rescue:
We do a survey that first instructs students to toss a coin twice. If the student gets
‘tails’ on the first toss, then the student has to answer question 1, otherwise the
student answers question 2.
Q1: Have you ever cheated on an exam in college?
Q2: Did you get ‘tails’ on the second toss?
So the answer will be partly random: We don’t know whether a ‘yes’ answer is due to
the student cheating or to getting tails on the second toss. This should put the student
at ease to answer truthfully.
2/3
Key point: While we don’t know what an individual ‘yes’ means, we can estimate the
proportion of cheaters using all the answers collectively:
P(yes) = P(yes and Q1) + P(yes and Q2)
= P(yes | Q1) P(Q1) + P(yes | Q2) P(Q2)
3/3