12 Uncertainty

CS 5/7320
Artificial Intelligence
Uncertainty and
Probabilities
AIMA Chapter 12
Slides by Michael Hahsler

based on slides by Svetlana Lazepnik
with figures from the AIMA textbook
This work is licensed under a Creative Commons

Attribution-ShareAlike 4.0 International License.
"Dice" by Steve A Johnson
Uncertainty is Bad for Agents Based on Logic
Example: Catching a Flight
Let action At = leave for airport t minutes before flight
Question: Will At get me there on time?
Problems:
• Partial observability (road state, other drivers' plans, etc.)
• Noisy sensors (traffic reports)
• Uncertainty in action outcomes (flat tire, etc.)
• Complexity of modeling and predicting traffic
A purely logical approach leads to conclusions that are too weak for effective decision
making:
• A25 will get me there on time if there is no accident on the bridge and it doesn't
rain and my tires remain intact, etc., etc.
• AInf guarantees to get there in time, but who lives forever?
Making Decisions Under Uncertainty
Probabilities: Suppose the agent believes the following:

𝑃(𝐴25 𝑔𝑒𝑡𝑠 𝑚𝑒 𝑡ℎ𝑒𝑟𝑒 𝑜𝑛 𝑡𝑖𝑚𝑒) = 0.04
Which action should the agent choose?

• Depends on preferences for missing flight vs. time spent waiting
• Utility theory represents preferences for actions using a
utility function 𝑈(𝑎𝑐𝑡𝑖𝑜𝑛)
Decision theory = probability theory + utility theory: The agent should

choose the action that maximizes the expected utility.
𝑎𝑟𝑔𝑚𝑎𝑥𝐴𝑡 [ 𝑃(𝐴𝑡 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑠) 𝑈(𝐴𝑡 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑠) + 𝑃(𝐴𝑡 𝑓𝑎𝑖𝑙𝑠) 𝑈(𝐴𝑡 𝑓𝑎𝑖𝑙𝑠) ]
Sources of Uncertainty
Probabilistic assertions summarize effects of:
Randomness • Intrinsically random behavior
• Lack of explicit theories,

Ignorance relevant facts, observability,
etc.
• Failure to enumerate
Laziness exceptions,
qualifications, etc.
Example: What is the source of uncertainty for a coin toss?

A Quick
Review of
Probability
Theory
What are Probabilities?
Random variables
Events
Joint probabilities
Marginal probabilities
Conditional probabilities
Bayes’ Rule
Conditional independence
What are Probabilities?
Frequentism (Objective; Positivist)
Probabilities are long-run relative frequencies determined by observation.

• For example, if we toss a coin many times, 𝑃(ℎ𝑒𝑎𝑑𝑠) is estimated as the
proportion of the time the coin will come up heads
• But what if we are dealing with events that only happen once? E.g., what
is the probability that a Republican will win the presidency in 2024? How
do we define comparable elections? Reference class problem.
Bayesian Statistics (Subjective)
Probabilities are degrees of belief based on prior knowledge and updated

by evidence.
Provides tools to:
• How do we assign belief values to statements without evidence?
• How do we update our degrees of belief given observations?
Random variables
Random Variable
• We describe the (uncertain) state of the world using random variables.

• Random variables are denoted by capital letters.
• R: Is it raining?
• W: What’s the weather?
• Die: What is the outcome of rolling two dice?
• V: What is the speed of my car (in MPH)?
Domain
• Random variables take on values in a domain D.

• Domain values must be mutually exclusive and exhaustive.
• R ∈ {True, False}
• W ∈ {Sunny, Cloudy, Rainy, Snow}
• Die ∈ {(1,1), (1,2), … (6,6)}
• V ∈ [0, 200]
Events and Propositions
Probabilistic statements are defined over Events are described using
events, world states or sets of states propositions:
• “It is raining” • R = True
• “The weather is either cloudy or • W = “Cloudy”  W =
snowy” “Snowy”
• “The sum of the two dice rolls is 11” • D  {(5,6), (6,5)}
• “My car is going between 30 and 50 • 30  S  50
miles per hour”
Notation:
• 𝑃(𝑋 = 𝑥) or 𝑃𝑋 (𝑥) or 𝑃(𝑥) for short, is the probability of
the event that random variable 𝑋 has taken on the value 𝑥.
• For propositions it means the probability of the set of
possible worlds in which the proposition holds.
Kolmogorov’s 3 Axioms of Probability
Three axioms are sufficient to define probability theory:

1. Probabilities are non-negative real numbers.
2. The probability that at least one atomic event happens is 1.
3. The probability of mutually exclusive events is additive.
This leads to important properties (A and B are sets of events):

• Numeric bound: 0 ≤ 𝑃 𝐴 ≤ 1
• Monotonicity: if 𝐴 ⊆ 𝐵 then 𝑃 𝐴 ≤ 𝑃 𝐵
• Addition law: 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)
• Probability of the empty set: 𝑃 ∅ = 0
• Complement rule: 𝑃 ¬𝐴 = 1 − 𝑃 𝐴
• Continuous variables need in addition the definition of density

functions.
Atomic events
• Atomic event: a complete specification of the state of the

world, or a complete assignment of domain values to all
random variables.
• Atomic events are mutually exclusive and exhaustive.
• E.g., if the world consists of only two Boolean variables

Cavity and Toothache, then there are 4 distinct atomic
events:
Cavity = false Toothache = false
Cavity = false  Toothache = true
Cavity = true  Toothache = false
Cavity = true  Toothache = true
Joint probability distributions
• A joint distribution is an assignment of probabilities to every

possible atomic event
Atomic event P
Cavity = false Toothache = false 0.8
Cavity = false  Toothache = true 0.1
Cavity = true  Toothache = false 0.05
Cavity = true  Toothache = true 0.05
Sum: 1.00
• Notation:
• 𝑃(𝑥), 𝑃(𝑋 = 𝑥) is the probability that random variable X takes
on value x
• 𝑷(𝑋) is the distribution of probabilities for all possible values
of X. Often we are lazy or forget to make P bold.
Marginal probability distributions
• Sometimes we are only interested in one variable.

This is called the marginal distribution 𝑷(𝑌)
P(Cavity, Toothache)
Prob. Distr.
P(Cavity) P(Toothache)
Marginal
Cavity = false ? Toothache = false ?

Cavity = true ? Toothache = true ?
• Suppose we have the joint distribution 𝑷(𝑋, 𝑌) and we

want to find the marginal distribution 𝑷(𝑌)
𝑃(𝑋 = 𝑥) = 𝑃 (𝑋 = 𝑥 ∧ 𝑌 = 𝑦1 ) ∨ ⋯ ∨ (𝑋 = 𝑥 ∧ 𝑌 = 𝑦𝑛 )
𝑛
= 𝑃 (𝑥, 𝑦1 ) ∨ ⋯ ∨ (𝑥, 𝑦𝑛 ) = ෍ 𝑃(𝑥, 𝑦𝑖 )

𝑖=1
• General rule: to find 𝑃(𝑋 = 𝑥), sum the probabilities

of all atomic events where 𝑋 = 𝑥. This is called
“summing out” or marginalization.
• Suppose we have the joint distribution 𝑷(𝑋, 𝑌) and

we want to find the marginal distribution 𝑷(𝑌)
Prob. Distr.
Marginal
Cavity = false 0.8+0.1 = 0.9 Toothache = false 0.8+0.0.5= 0.85

Cavity = true 0.05+0.05=0.1 Toothache = true 0.1+0.05= 0.15
Conditional probability
• Probability of cavity given toothache:

P(Cavity = true | Toothache = true)
𝑃(𝐴, 𝐵)
• For any two events A and B, 𝑃 𝐴 𝐵) =
𝑃(𝐵)
𝑃(𝐴, 𝐵)
𝑃(𝐴) 𝑃(𝐵)
Conditional probability 𝑃 𝐴 𝐵) =
𝑃(𝐴, 𝐵)
𝑃(𝐵)
Joint Prob. Distr.

Prob. Distr.
Marginal
Cavity = false 0.9 Toothache = false 0.85

Cavity = true 0.1 Toothache = true 0.15
• What is P(Cavity = true | Toothache = false)?

0.05 / 0.85 = 0.059
• What is P(Cavity = false | Toothache = true)?
0.1 / 0.15 = 0.667
Conditional distributions 𝑃 𝐴 𝐵) =
𝑃(𝐴, 𝐵)
𝑃(𝐵)
A conditional distribution is a distribution over the values of one

variable given fixed values of other variables
P(Cavity | Toothache = true) P(Cavity | Toothache = false)
Cavity = false 0.667 Cavity = false 0.941
Cavity = true 0.333 Cavity = true 0.059
P(Toothache | Cavity = true) P(Toothache | Cavity = false)

Toothache= false 0.5 Toothache= false 0.889
Toothache = true 0.5 Toothache = true 0.111
Normalization trick
• To get the whole conditional distribution 𝑷(𝑋 | 𝑌 = 𝑦) at once, select all
entries in the joint distribution matching 𝑌 = 𝑦 and renormalize them to
sum to one.
Select P(X, Y = y)
Toothache, Cavity = false
Toothache= false 0.8
Sum is 𝑃 𝑌 = 𝑦 = 0.9
Toothache = true 0.1
Renormalize sum to 1 (= divide by 𝑃(𝑌 = 𝑦))

P(Toothache | Cavity = false) Equivalent to
Toothache= false 0.889 𝑷 𝑋 𝑌 = 𝑦) = 𝛼 𝑷(𝑋, 𝑌 = 𝑦)
Toothache = true 0.111 with 𝛼 = 1/𝑃(𝑌 = 𝑦)
Bayes’ Rule
• The product rule (definition of conditional

distribution) gives us two ways to factor a joint
distribution for events A and B:
𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)
Posterior Prob. Prior Prob.
𝑃 𝐵 𝐴 𝑃(𝐴)
• Therefore, 𝑃 𝐴 𝐵) =
𝑃(𝐵)
Rev. Thomas Bayes
• Why is this useful? (1702-1761)
• Can get diagnostic probability P(Cavity | Toothache) from

causal probability P(Toothache | Cavity)
• We can update our beliefs based on evidence.
• Important tool for probabilistic inference .
Example: Getting Married in the Desert
Marie is getting married tomorrow, at an outdoor ceremony in the desert.

In recent years, it has rained only 5 days each year (5/365 = 0.014).
Unfortunately, the weatherman has predicted rain for tomorrow. When it
actually rains, the weatherman correctly forecasts rain 90% of the time.
When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is
Marie’s belief for the probability that it will rain on her wedding day?
Example: Getting Married in the Desert
New Prior
Evidence Probability
Marie is getting married tomorrow, at an outdoor ceremony in the desert.
In recent years, it has rained only 5 days each year (5/365 = 0.014).
Unfortunately, the weatherman has predicted rain for tomorrow. When it
actually rains, the weatherman correctly forecasts rain 90% of the time.
When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is
Marie’s belief for the probability that it will rain on her wedding day?
𝑃 𝐵 𝐴 𝑃(𝐴) Posteriori
𝑃 𝐴 𝐵) = Probability
𝑃(𝐵)
𝑃(Predict|Rain)𝑃(Rain)
𝑃(Rain|Predict) =
𝑃(Predict)
𝑃(Predict|Rain)𝑃(Rain)
=
𝑃(Predict|Rain)𝑃(Rain) + 𝑃(Predict|¬Rain)𝑃(¬Rain)
0.9 ∗ 0.014
= = 0.111
0.9 ∗ 0.014 + 0.1 ∗ 0.986
The weather forecast updates
her belief from 0.014 to 0.111
Issue With Applying Bayes’ Theorem
Normalization Trick
Estimate Calculate Apply

Joint Conditional Bayes’
Probability Probability Theorem
• The joint probability table and the tables with conditional probabilities are
typically too large!
For 𝑛𝑛 random variables with a domain size of 𝑑 each, we have a table of size
𝑂(𝑑 ). This is a problem for
• storing the table, and
• estimating the probabilities from data (we need lots of data).
• Solution: Decomposition of joint probability distributions using

independence and conditional independence between events. A large table
can be broken into several much smaller tables.
Independence Between Events
• Two events A and B are independent if and only if
𝑃(𝐴, 𝐵) = 𝑃(𝐴) 𝑃(𝐵)
• This is equivalent to 𝑃(𝐴 | 𝐵) = 𝑃(𝐴) and 𝑃(𝐵 | 𝐴) = 𝑃(𝐵)

• Independence is an important simplifying assumption for modeling, e.g.,
Cavity and Weather can be assumed to be independent
Independence P(Cavity, Weather) = P(Cavity)P(Weather)

P(Cavity | Weather) = P(Cavity)
Decomposition of the Joint Probability Distribution
• Independence: The joint probability can be

decomposed into
𝑷 𝐶𝑜𝑖𝑛1 , … , 𝐶𝑜𝑖𝑛𝑛
= 𝑷 𝐶𝑜𝑖𝑛1 × ⋯ × 𝑷 𝐶𝑜𝑖𝑛𝑛 = ෑ 𝑷(𝐶𝑜𝑖𝑛𝑖 )
𝑖
• We need for each coin one parameter (chance
of getting H).
• Independence reduces the numbers needed to
specify the joint distribution from 2𝑛 − 1 to 𝑛.
• Note: If we have identical (iid) coins, then we
even only need 2 numbers, probability of H and
number of coins.
Conditional Independence
• Conditional independence: A and B are conditionally independent

given C (i.e., if we know c) iff
𝑃(𝐴  𝐵 | 𝐶) = 𝑃(𝐴 | 𝐶) 𝑃(𝐵 | 𝐶)

Toothache
Cavity
Catch
• If the patient has a cavity, the probability that the probe catches in it does
not depend on whether he/she has a toothache
P(Catch | Toothache, Cavity) = P(Catch | Cavity)
• Therefore, Catch is conditionally independent of Toothache given Cavity
• Likewise, Toothache is conditionally independent of Catch given Cavity
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
Decomposition of the Joint Probability Distribution
Toothache
• Conditional independence Cavity
using the chain rule: Catch
P(Toothache, Catch, Cavity) =

P(Cavity) P(Catch | Cavity) P(Toothache | Catch, Cavity) =
P(Cavity) P(Catch | Cavity) P(Toothache | Cavity)
• The full joint probability distribution needs 23 − 1 = 7 independent

numbers (-1 because the 23 numbers have to sum up to 1).
• Conditional independence reduces this to 1 + 2 + 2 = 5
• In many practical applications, conditional independence reduces the
space requirements from 𝑂(2𝑛 ) to 𝑂 𝑛 .
Chain rule: Example for 4 variables.

Bayesian Decision Making
Making Decisions Under Uncertainty Based on Evidence
Probabilistic Inference
Suppose the agent has to guess the value of an unobserved
query variable 𝑋 given some observed evidence 𝐸 = 𝑒 and
we assume 𝑋 probabilistically causes 𝐸.
Examples:
x ∈ {zebra, giraffe, hippo}, e = image features
x ∈ {spam, not spam}, e = email message
What is the best guess 𝑥 ∗ ?
Notation: We use here 𝑥ො for an estimate and 𝑥 ∗ for the best estimate.
Bayes Decision Rule
• Assumption: The agent has a loss function, which is 0 if the
value of X (x) is guessed correctly, and 1 otherwise.
1 if 𝑥ො ≠ 𝑥, and
𝐿 𝑥, 𝑥ො = ቊ
0 otherwise.
• The value for X that minimizes the expected loss is the one that
has the greatest posterior probability given the evidence.
argmax𝑥 𝑃(𝑋 = 𝑥 | 𝐸 = 𝑒)
• This is called the MAP (maximum a posteriori) decision.

The MAP decision is optimal!
MAP: Maximum A Posteriori Decision
Use the value 𝑥 that has the highest (maximum)

posterior probability given the evidence 𝑒
Posterior Prob. Prior Prob.
∗
𝑃(𝑒|𝑥)𝑃(𝑥) 𝑃 𝑒 is fixed for
𝑥 = argmax𝑥 𝑃 (𝑥|𝑒) = argmax𝑥 a given example.
𝑃(𝑒)
∝ argmax𝑥 𝑃(𝑒|𝑥)𝑃(𝑥)
For comparison: the maximum 𝑥 ∗ = argmax𝑥 𝑃 (𝑒|𝑥)

likelihood decision ignores 𝑃(𝑥)
likelihood
MAP: Example
Value of 𝑥 that has the highest (maximum) posterior

probability given the evidence 𝑒.
𝑥 ∈ {zebra, dog, cat}, e = stripes
Posterior Prob.
𝑃(stripes|𝑥)𝑃(𝑥)
𝑥∗ = argmax𝑥 𝑃 (𝑥|𝑒) = argmax𝑥
𝑃(𝑠𝑡𝑟𝑖𝑝𝑒𝑠)
∝ argmax𝑥 𝑃(stripes|𝑥)𝑃(𝑥)
likelihood Prior Prob.
The likelihood 𝑃(stripes | zebra) is the highest, but it also depends on the
prior 𝑃 zebra , the chance that we see a zebra. If the likelihood for cats is
smaller, but the prior probability is much higher, cat may have a larger
posterior probability!
Bayes Classifier
• Suppose we have many different types of observations (evidence,

symptoms, features) 𝐹1, … , 𝐹𝑛 that we want to use to decide on an
underlying hypothesis 𝐻.
• MAP decision involves estimating
argmaxh∈𝐻 𝑃(𝑓1 , … , 𝑓𝑛 |ℎ)𝑃(ℎ)
• If each feature can take on k values, how many entries are in the
joint probability table 𝑃(𝑓1 , … , 𝑓𝑛 , ℎ)?
• The table has O(𝑛𝑘 ) entries!

What if we have 1000s of features?
Naïve Bayes model
• Suppose we have many different types of observations (evidence,
symptoms, features) 𝐹1, … , 𝐹𝑛 that we want to use to obtain
evidence about an underlying hypothesis 𝐻
• MAP decision involves estimating
argmaxh∈𝐻 𝑃(𝑓1 , … , 𝑓𝑛 |ℎ)𝑃(ℎ)
• Issue: The likelihood table size grows exponentially with the
number of features 𝑛.
• We can make the simplifying assumption that the different

features are conditionally independent given the hypothesis.
This reduces the joint probability distribution table from O(𝑛𝑘 ) to
size O(𝑘 × 𝑛):
𝑛
argmaxh∈𝐻 𝑃(ℎ) ෑ 𝑃(𝑓𝑖 |ℎ)

𝑖=1
Example: Naïve Bayes Spam Filter
We need the following:

• A hypothesis H: 𝑠𝑝𝑎𝑚 or ¬spam
• Define features of message.
• Estimate parameters to make a MAP decision which minimizes the
classification error (0-1 loss)
Message Features: Bag of Words from NLP
• Extract document features as a binary vector 𝑤1 , … , 𝑤𝑛 .
• Each element represents the event that word 𝑤𝑖 is present (𝑤𝑖 = 1) or
not (𝑤𝑖 = 0) in the message.
• Simplifications:
• The order of the words in the message is ignored.
• How often a word is repeated is ignored.
Naïve Bayes Spam Filter Using Words
• We use the simplifying assumption that each word is conditionally

independent of the others given the message class (spam or not spam):
𝑃(message|h) = 𝑃(𝑤1 , … , 𝑤𝑛 |h) = ෑ 𝑃(𝑤𝑖 |h)

𝑖=1
• Now we can calculate the a posteriori probability after the evidence of

the message as
𝑛
𝑃(h|𝑤1 , … , 𝑤𝑛 ) ∝ 𝑃(h) ෑ 𝑃(𝑤𝑖 |h)

𝑖=1
posterior prior
likelihoods
(presents and absence of words)
Model and Parameters
• Model 𝑛
𝑃 𝐻 = 𝑠𝑝𝑎𝑚 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 ∝ 𝑃 𝐻 = 𝑠𝑝𝑎𝑚 ෑ 𝑃 𝑤𝑖 𝐻 = 𝑠𝑝𝑎𝑚

𝑖=1
𝑛
𝑃 𝐻 = ¬𝑠𝑝𝑎𝑚 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 ∝ 𝑃 𝐻 = ¬𝑠𝑝𝑎𝑚 ෑ 𝑃 𝑤𝑖 𝐻 = ¬𝑠𝑝𝑎𝑚

𝑖=1
• Needed parameters Likelihood of Likelihood of
words in spam words in ¬spam
prior P(w1 = 1 | H=spam) P(w1 = 1 | H=¬spam)
P(H = spam) P(w2 = 1 | H=spam) P(w2 = 1 | H=¬spam)
P(H = ¬spam) … …
P(wn = 1 | H=spam) P(wn = 1 | H=¬spam)
+ likelihood of words not in spam (or ¬spam) can be calculated as

P(wi = 0 | H=spam) = 1- P(wi = 1 | H=spam)
• Decision: Spam if
𝑃(𝐻 = 𝑠𝑝𝑎𝑚 | 𝑚𝑒𝑠𝑠𝑎𝑔𝑒) > 𝑃(𝐻 = ¬𝑠𝑝𝑎𝑚 | 𝑚𝑒𝑠𝑠𝑎𝑔𝑒)
equivalent to 𝑎𝑟𝑔𝑚𝑎𝑥ℎ 𝑃(ℎ|𝑚𝑒𝑠𝑠𝑎𝑔𝑒)
Parameter Estimation
Count in training data:
# of spam messages + 1 Smoothing for
𝑃(𝐻 = 𝑠𝑝𝑎𝑚) =
total # of messages + # of classes low counts.
# of spam messages that contain the word + 1
𝑃(𝑤𝑖 = 1|𝐻 = 𝑠𝑝𝑎𝑚) =
total # of spam messages + # of classes
Prior 𝑃(𝐻) 𝑃(𝑤𝑖 = 1| 𝐻 = 𝑠𝑝𝑎𝑚) 𝑃(𝑤𝑖 = 1 | 𝐻 = ¬𝑠𝑝𝑎𝑚)
spam: 0.33
¬spam: 0.67
+ likelihoods for the 𝑃(𝑤𝑖 = 0|𝐻 = 𝑠𝑝𝑎𝑚) = 1 − 𝑃(𝑤𝑖 = 1|𝐻 = 𝑠𝑝𝑎𝑚)

absence of words: 𝑃(𝑤𝑖 = 0 𝐻 = ¬𝑠𝑝𝑎𝑚) = 1 − 𝑃(𝑤𝑖 = 1|𝐻 = ¬𝑠𝑝𝑎𝑚)
Summary
Decision theory
To make decisions under uncertainty requires:
1. Estimating probabilities of outcomes for different actions.
2. Assign utility to outcomes.
3. Choose the action with the larges expected utility.
Bayes decision rule

Choose the most likely outcome by minimizing expected 0-1 loss.
Required steps:
1. Estimate prior probabilities of outcomes and the likelihood of seeing
evidence given different outcomes.
2. Use the evidence to update the probability of the outcome.
3. Apply the MAP decision rule to determine the most likely outcome.
• A general framework for learning functions and decision rules

from data is the goal of Machine Learning.

12 Uncertainty

Uploaded by

Copyright:

Available Formats

12 Uncertainty

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12 Uncertainty

Uploaded by

Copyright:

Available Formats

CS 5/7320

Slides by Michael Hahsler

This work is licensed under a Creative Commons

Question: Will At get me there on time?

Probabilities: Suppose the agent believes the following:

Which action should the agent choose?

Decision theory = probability theory + utility theory: The agent should

Randomness • Intrinsically random behavior

• Lack of explicit theories,

Example: What is the source of uncertainty for a coin toss?

Frequentism (Objective; Positivist)

Probabilities are long-run relative frequencies determined by observation.

Bayesian Statistics (Subjective)

Probabilities are degrees of belief based on prior knowledge and updated

• We describe the (uncertain) state of the world using random variables.

• Random variables take on values in a domain D.

Three axioms are sufficient to define probability theory:

This leads to important properties (A and B are sets of events):

• Continuous variables need in addition the definition of density

• Atomic event: a complete specification of the state of the

• Atomic events are mutually exclusive and exhaustive.

• E.g., if the world consists of only two Boolean variables

• A joint distribution is an assignment of probabilities to every

• Sometimes we are only interested in one variable.

Cavity = false ? Toothache = false ?

• Suppose we have the joint distribution 𝑷(𝑋, 𝑌) and we

= 𝑃 (𝑥, 𝑦1 ) ∨ ⋯ ∨ (𝑥, 𝑦𝑛 ) = ෍ 𝑃(𝑥, 𝑦𝑖 )

• General rule: to find 𝑃(𝑋 = 𝑥), sum the probabilities

• Suppose we have the joint distribution 𝑷(𝑋, 𝑌) and

Cavity = false 0.8+0.1 = 0.9 Toothache = false 0.8+0.0.5= 0.85

• Probability of cavity given toothache:

Cavity = false Toothache = false 0.8

Cavity = false 0.9 Toothache = false 0.85

• What is P(Cavity = true | Toothache = false)?

A conditional distribution is a distribution over the values of one

P(Toothache | Cavity = true) P(Toothache | Cavity = false)

Renormalize sum to 1 (= divide by 𝑃(𝑌 = 𝑦))

• The product rule (definition of conditional

• Can get diagnostic probability P(Cavity | Toothache) from

Marie is getting married tomorrow, at an outdoor ceremony in the desert.

Estimate Calculate Apply

• Solution: Decomposition of joint probability distributions using

• This is equivalent to 𝑃(𝐴 | 𝐵) = 𝑃(𝐴) and 𝑃(𝐵 | 𝐴) = 𝑃(𝐵)

Independence P(Cavity, Weather) = P(Cavity)P(Weather)

• Independence: The joint probability can be

• Conditional independence: A and B are conditionally independent

𝑃(𝐴  𝐵 | 𝐶) = 𝑃(𝐴 | 𝐶) 𝑃(𝐵 | 𝐶)

P(Toothache, Catch, Cavity) =

• The full joint probability distribution needs 23 − 1 = 7 independent

Chain rule: Example for 4 variables.

What is the best guess 𝑥 ∗ ?

• This is called the MAP (maximum a posteriori) decision.

Use the value 𝑥 that has the highest (maximum)

For comparison: the maximum 𝑥 ∗ = argmax𝑥 𝑃 (𝑒|𝑥)

Value of 𝑥 that has the highest (maximum) posterior

• Suppose we have many different types of observations (evidence,

• MAP decision involves estimating

argmaxh∈𝐻 𝑃(𝑓1 , … , 𝑓𝑛 |ℎ)𝑃(ℎ)

• The table has O(𝑛𝑘 ) entries!

• We can make the simplifying assumption that the different