12 Uncertainty
12 Uncertainty
12 Uncertainty
Artificial Intelligence
Uncertainty and
Probabilities
AIMA Chapter 12
Problems:
• Partial observability (road state, other drivers' plans, etc.)
• Noisy sensors (traffic reports)
• Uncertainty in action outcomes (flat tire, etc.)
• Complexity of modeling and predicting traffic
A purely logical approach leads to conclusions that are too weak for effective decision
making:
• A25 will get me there on time if there is no accident on the bridge and it doesn't
rain and my tires remain intact, etc., etc.
• AInf guarantees to get there in time, but who lives forever?
Making Decisions Under Uncertainty
• Failure to enumerate
Laziness exceptions,
qualifications, etc.
• R: Is it raining?
• W: What’s the weather?
• Die: What is the outcome of rolling two dice?
• V: What is the speed of my car (in MPH)?
Domain
• R ∈ {True, False}
• W ∈ {Sunny, Cloudy, Rainy, Snow}
• Die ∈ {(1,1), (1,2), … (6,6)}
• V ∈ [0, 200]
Events and Propositions
Probabilistic statements are defined over Events are described using
events, world states or sets of states propositions:
• “It is raining” • R = True
• “The weather is either cloudy or • W = “Cloudy” W =
snowy” “Snowy”
• “The sum of the two dice rolls is 11” • D {(5,6), (6,5)}
• “My car is going between 30 and 50 • 30 S 50
miles per hour”
Notation:
• 𝑃(𝑋 = 𝑥) or 𝑃𝑋 (𝑥) or 𝑃(𝑥) for short, is the probability of
the event that random variable 𝑋 has taken on the value 𝑥.
• For propositions it means the probability of the set of
possible worlds in which the proposition holds.
Kolmogorov’s 3 Axioms of Probability
• Notation:
• 𝑃(𝑥), 𝑃(𝑋 = 𝑥) is the probability that random variable X takes
on value x
• 𝑷(𝑋) is the distribution of probabilities for all possible values
of X. Often we are lazy or forget to make P bold.
Marginal probability distributions
P(Cavity) P(Toothache)
Marginal
𝑃(𝑋 = 𝑥) = 𝑃 (𝑋 = 𝑥 ∧ 𝑌 = 𝑦1 ) ∨ ⋯ ∨ (𝑋 = 𝑥 ∧ 𝑌 = 𝑦𝑛 )
𝑛
P(Cavity) P(Toothache)
Marginal
𝑃(𝐴, 𝐵)
𝑃(𝐴) 𝑃(𝐵)
Conditional probability 𝑃 𝐴 𝐵) =
𝑃(𝐴, 𝐵)
𝑃(𝐵)
P(Cavity, Toothache)
Joint Prob. Distr.
P(Cavity) P(Toothache)
Prob. Distr.
Marginal
Select P(X, Y = y)
Toothache, Cavity = false
Toothache= false 0.8
Sum is 𝑃 𝑌 = 𝑦 = 0.9
Toothache = true 0.1
𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)
Posterior Prob. Prior Prob.
𝑃 𝐵 𝐴 𝑃(𝐴)
• Therefore, 𝑃 𝐴 𝐵) =
𝑃(𝐵)
Rev. Thomas Bayes
• Why is this useful? (1702-1761)
𝑃 𝐵 𝐴 𝑃(𝐴) Posteriori
𝑃 𝐴 𝐵) = Probability
𝑃(𝐵)
𝑃(Predict|Rain)𝑃(Rain)
𝑃(Rain|Predict) =
𝑃(Predict)
𝑃(Predict|Rain)𝑃(Rain)
=
𝑃(Predict|Rain)𝑃(Rain) + 𝑃(Predict|¬Rain)𝑃(¬Rain)
0.9 ∗ 0.014
= = 0.111
0.9 ∗ 0.014 + 0.1 ∗ 0.986
The weather forecast updates
her belief from 0.014 to 0.111
Issue With Applying Bayes’ Theorem
Normalization Trick
• The joint probability table and the tables with conditional probabilities are
typically too large!
For 𝑛𝑛 random variables with a domain size of 𝑑 each, we have a table of size
𝑂(𝑑 ). This is a problem for
• storing the table, and
• estimating the probabilities from data (we need lots of data).
𝑷 𝐶𝑜𝑖𝑛1 , … , 𝐶𝑜𝑖𝑛𝑛
= 𝑷 𝐶𝑜𝑖𝑛1 × ⋯ × 𝑷 𝐶𝑜𝑖𝑛𝑛 = ෑ 𝑷(𝐶𝑜𝑖𝑛𝑖 )
𝑖
• We need for each coin one parameter (chance
of getting H).
• Independence reduces the numbers needed to
specify the joint distribution from 2𝑛 − 1 to 𝑛.
• Note: If we have identical (iid) coins, then we
even only need 2 numbers, probability of H and
number of coins.
Conditional Independence
• If the patient has a cavity, the probability that the probe catches in it does
not depend on whether he/she has a toothache
P(Catch | Toothache, Cavity) = P(Catch | Cavity)
• Therefore, Catch is conditionally independent of Toothache given Cavity
• Likewise, Toothache is conditionally independent of Catch given Cavity
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
Decomposition of the Joint Probability Distribution
Toothache
• Conditional independence Cavity
using the chain rule: Catch
Examples:
x ∈ {zebra, giraffe, hippo}, e = image features
x ∈ {spam, not spam}, e = email message
Notation: We use here 𝑥ො for an estimate and 𝑥 ∗ for the best estimate.
Bayes Decision Rule
• Assumption: The agent has a loss function, which is 0 if the
value of X (x) is guessed correctly, and 1 otherwise.
1 if 𝑥ො ≠ 𝑥, and
𝐿 𝑥, 𝑥ො = ቊ
0 otherwise.
• The value for X that minimizes the expected loss is the one that
has the greatest posterior probability given the evidence.
argmax𝑥 𝑃(𝑋 = 𝑥 | 𝐸 = 𝑒)
∗
𝑃(𝑒|𝑥)𝑃(𝑥) 𝑃 𝑒 is fixed for
𝑥 = argmax𝑥 𝑃 (𝑥|𝑒) = argmax𝑥 a given example.
𝑃(𝑒)
∝ argmax𝑥 𝑃(𝑒|𝑥)𝑃(𝑥)
• If each feature can take on k values, how many entries are in the
joint probability table 𝑃(𝑓1 , … , 𝑓𝑛 , ℎ)?
• Decision: Spam if
𝑃(𝐻 = 𝑠𝑝𝑎𝑚 | 𝑚𝑒𝑠𝑠𝑎𝑔𝑒) > 𝑃(𝐻 = ¬𝑠𝑝𝑎𝑚 | 𝑚𝑒𝑠𝑠𝑎𝑔𝑒)
equivalent to 𝑎𝑟𝑔𝑚𝑎𝑥ℎ 𝑃(ℎ|𝑚𝑒𝑠𝑠𝑎𝑔𝑒)
Parameter Estimation
Count in training data:
# of spam messages + 1 Smoothing for
𝑃(𝐻 = 𝑠𝑝𝑎𝑚) =
total # of messages + # of classes low counts.
# of spam messages that contain the word + 1
𝑃(𝑤𝑖 = 1|𝐻 = 𝑠𝑝𝑎𝑚) =
total # of spam messages + # of classes
spam: 0.33
¬spam: 0.67