Section 3:

Reasoning Under Uncertainty

With slides from Dan Klein and Pieter Abbeel

ExpectiMax: What Probabilities to Use?

• In Expectimax search, we have a probabilistic model

of how the opponent (or environment) will behave
in any state
– Model could be a simple uniform distribution (roll a die)
– Model could be sophisticated and require a great deal of
– We have a chance node for any outcome out of our
control: opponent or environment
– Model might say that adversarial actions are more likely!

• For now, assume each chance node magically comes

along with probabilities that specify the distribution
over its outcomes

Having a probabilistic belief

about another agent’s action
does not mean that the
agent is flipping any coins!

Quiz: Informed Probabilities
• Let’s say you know that your opponent is actually running a depth 2 minimax, using the
result 80% of the time, and moving randomly otherwise
• Question: What tree search should you use?

§ Answer: Expectimax!
§ To figure out EACH chance node’s probabilities,
you have to run a simulation of your opponent
0.1 § This kind of thing gets very slow very quickly
0.9 § Even worse if you have to simulate your
opponent simulating you…
§ … except for minimax, which has the nice
property that it all collapses into one game tree

Modeling Assumptions

Dealing with Uncertainty

§ The robot can handle uncertainty in an obstacle position by

representing the set of all positions of the obstacle that the robot
think possible at each time (belief state)
§ For example, this set can be a disc whose radius grows linearly with
Set of possible
Set of possible positions at time 2T
Initial set of positions at time T
possible positions

t=0 t=T t = 2T
Dealing with Uncertainty

§ The robot can handle uncertainty in an obstacle position by

representing the set of all positions of the obstacle that the robot
think possible at each time (belief state)
§ For example, this set can be a disc whose radius grows linearly with

The robot must plan to be

outside this disc at time t = T

t=0 t=T t = 2T
Imperfect Observation of the World

Observation of the world can be:

§ Partial, e.g., a vision sensor can’t see through obstacles
(lack of percepts)

R1 R2

The robot may not know whether

there is dust in room R2

Definition: Belief State
§ In the presence of non-deterministic sensory uncertainty, an
agent belief state represents all the states of the world that it
thinks are possible at a given time or at a given stage of

§ In the probabilistic model of uncertainty, a probability is

associated with each state to measure its likelihood to be the
actual state

0.2 0.3 0.4 0.1

What do probabilities mean?

§ Probabilities have a natural frequency interpretation

§ The agent believes that if it was able to return many times to a
situation where it has the same belief state, then the actual
states in this situation would occur at a relative frequency
defined by the probabilistic distribution

0.2 0.3 0.4 0.1

This state would occur

20% of the times

Belief State: Example
§ Consider a world where a dentist agent D meets a new patient P
§ D is interested in only one thing: whether P has a cavity, which D
models using the proposition Cavity
§ Before making any observation, D’s belief state is:

Cavity ¬ Cavity
p 1-p
§ This means that D believes that a fraction p of patients have

Where do probabilities come from?

§ Frequencies observed in the past, e.g., by the agent, its designer,

or others
§ Symmetries, e.g.:
• If I roll a dice, each of the 6 outcomes has probability 1/6
§ Subjectivism, e.g.:
• If I drive on Highway 280 at 120mph, I will get a speeding ticket with
probability 0.6
• Principle of indifference: If there is no knowledge to consider one
possibility more probable than another, give them the same probability

Pacman: Ghost position is uncertain

• A ghost is in the grid

• Sensor readings tell how
close a square is to the
– On the ghost: red
– 1 or 2 away: orange
– 3 or 4 away: yellow
– 5+ away: green

§ Sensors are noisy, but we know P(Color | Distance)

P(red | 3) P(orange | 3) P(yellow | 3) P(green | 3)
0.05 0.15 0.5 0.3

Pacman Uncertainty: 2

• General situation:
– Observed variables (evidence): Agent knows certain
things about the state of the world (e.g., sensor
readings or symptoms)
– Unobserved variables: Agent needs to reason about
other aspects (e.g. where an object is or what
disease is present)
– Model: Agent knows something about how the
known variables relate to the unknown variables

• Probabilistic reasoning gives us a framework for

managing our beliefs and knowledge

Random Variables

• A random variable is some aspect of the world about

which we (may) have uncertainty
– R = Is it raining?
– T = Is it hot or cold?
– D = How long will it take to drive to work?
– L = Where is the ghost?

• We denote random variables with capital letters

• Random variables have domains

– R in {true, false} (often write as {+r, -r})
– T in {hot, cold}
– D in [0, ¥)
– L in possible locations, maybe {(0,0), (0,1), …}

Probability Distributions

• Unobserved random variables have distributions

Shorthand notation:

hot 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0

• A distribution is a TABLE of probabilities of values OK if all domain entries are unique

• A probability (lower case value) is a single


• Must have: and

Joint Distributions

• A joint distribution over a set of random variables:

specifies a real number for each assignment (or outcome):

– Must obey: hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3

• Size of distribution of n variables with domain sizes d?

– O(size) = ?
– For all but the smallest distributions, impractical to write out!

Marginal Distributions

• Marginal distributions are sub-tables which eliminate variables

• Marginalization (summing out): Combine collapsed rows by adding

hot ?
cold ?
hot sun 0.4
hot rain 0.1
cold sun 0.2 W P
cold rain 0.3 sun ?
rain ?

Marginal Distributions

• Marginal distributions are sub-tables which eliminate variables

• Marginalization (summing out): Combine collapsed rows by adding

hot 0.5
cold 0.5
hot sun 0.4
hot rain 0.1
cold sun 0.2 W P
cold rain 0.3 sun 0.6
rain 0.4

Exercise: Marginal Distributions

+x +y 0.2
+x -y 0.3
-x +y 0.4 Y P
-x -y 0.1 +y

2/10/19 Artificial Intelligence, Fall 2018 19

Probabilistic Models

• A probabilistic model is a joint

distribution over a set of random

• Probabilistic models:
– (Random) variables with domains
– Assignments are called outcomes Distribution over T,W
– Joint distributions: say whether T W P
assignments (outcomes) are likely hot sun 0.4
– Normalized: sum to 1.0 hot rain 0.1

– Ideally: only certain variables cold sun 0.2

directly interact cold rain 0.3

Conditional Probabilities
• Relates joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability


P(a) P(b)

hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3

P(hot|sun) = ? P(sun|cold) = ?

Conditional Distributions

• Conditional distributions are probability distributions over

some variables given fixed values of others
Conditional Distributions
Joint Distribution

sun 0.8
hot sun 0.4
rain 0.2
hot rain 0.1
cold sun 0.2
W P cold rain 0.3
sun 0.4
rain 0.6

Normalization Trick

hot sun 0.4
hot rain 0.1
sun 0.4
cold sun 0.2
rain 0.6
cold rain 0.3

Normalization Trick

SELECT the joint NORMALIZE the

probabilities selection
T W P matching the (make it sum to
hot sun 0.4 evidence one) W P
hot rain 0.1 cold sun 0.2 sun 0.4
cold sun 0.2 cold rain 0.3 rain 0.6
cold rain 0.3

Normalization Trick

SELECT the joint NORMALIZE the

probabilities selection
T W P matching the (make it sum to
hot sun 0.4 evidence one) W P
hot rain 0.1 cold sun 0.2 sun 0.4
cold sun 0.2 cold rain 0.3 rain 0.6
cold rain 0.3

• Why does this work? Sum of selection is P(evidence)! (P(T=c), here)

Exercise: Selection & Normalization
• P(X | Y=-y) ?

SELECT the joint NORMALIZE the

probabilities selection
X Y P matching the (make it sum to
+x +y 0.2 evidence one)
+x -y 0.3
-x +y 0.4
-x -y 0.1

2/10/19 Artificial Intelligence, Fall 2018 26

To Normalize

• (Dictionary) To bring or restore to a normal condition

All entries sum to ONE

• Procedure:
– Step 1: Compute Z = sum over all entries
– Step 2: Divide every entry by Z

• Example 1 § Example 2
Normalize W P
sun 0.2 sun 0.4 hot sun 20 Normalizehot sun 0.4

rain 0.3 hot rain 5 hot rain 0.1

Z = 0.5 rain 0.6
cold sun 10 Z = 50 cold sun 0.2
cold rain 15 cold rain 0.3

Probabilistic Inference

• Probabilistic inference: compute a desired probability

from other known probabilities (evidence)

• We generally compute conditional probabilities

– P(on time | no reported accidents) = 0.90
– These represent the agent’s beliefs given the evidence

• Probabilities change with new evidence:

– P(on time | no accidents, 5 a.m.) = 0.95
– P(on time | no accidents, 5 a.m., raining) = 0.80
– Observing new evidence causes beliefs to be updated

Inference 1: The Product Rule

• Example:

wet sun 0.1 wet sun 0.08
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06

2/10/19 Artificial Intelligence, Fall 2018 29

The Product Rule

• Sometimes given conditional distributions but want the joint

The Product Rule

• Example:

wet sun 0.1 wet sun 0.08
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06

The Chain Rule

• More generally, can always write any joint distribution as an incremental product
of conditional distributions

• Why is this true?

– Recursive decomposition using product rule

2/10/19 Artificial Intelligence, Spring 2018 32

Inference 2: Bayes’ Rule

• Two ways to factor a joint distribution over two variables:

That’s my rule!

• Dividing, we get:

• Why is this at all helpful?

– Lets us build one conditional from its reverse
– Often one conditional is tricky but the other one is simple
– Foundation of many AI systems we’ll see later (e.g. ASR, MT, POS,…)

• In the running for most important AI, ML, DM equation!

Inference with Bayes’ Rule
• Example: Diagnostic probability from causal probability:

• Example: P (cause|e↵ect) = P (e↵ect|cause)P (cause)

P (e↵ect)
– M: meningitis, S: stiff neck

P (+m) = 0.0001
P (+s| + m) = 0.8 Example
P (+s| m) = 0.01 givens

P (+s| + m)P (+m) P (+s| + m)P (+m)

P (+m| + s) = =
P (+s) P (+s| + m)P (+m) + P (+s|

Inference with Bayes’ Rule
• Example: Diagnostic probability from causal probability:

• Example: P (cause|e↵ect) = P (e↵ect|cause)P (cause)

P (e↵ect)
– M: meningitis, S: stiff neck

P (+m) = 0.0001
P (+s| + m) = 0.8 Example
P (+s| m) = 0.01 givens

P (+s| + m)P (+m) P (+s| + m)P (+m) 0.8 ⇥ 0.0001

P (+m| + s) = = = =
P (+s) P (+s| + m)P (+m) + P (+s| m)P ( m) 0.8 ⇥ 0.0001 + 0.01 ⇥ 0.9999

– Note: posterior probability of meningitis still very small

– Note: you should still get stiff necks checked out! Why?

Ghostbusters, Revisited
• Let’s say we have two distributions:
– Prior distribution over ghost location: P(G)
• Let’s say this is uniform
– Sensor reading model: P(R | G)
• Given: we know what our sensors do
• R = reading color measured at (1,1)
• E.g. P(R = yellow | G=(1,1)) = 0.1

• We can calculate the posterior distribution P(G|r)

over ghost locations given a reading using Bayes’

2/10/19 Artificial Intelligence, Fall 2018 36

Ghostbusters, Revisited
• Let’s say we have two distributions:
– Prior distribution over ghost location: P(G)
• Let’s say this is uniform
– Sensor reading model: P(R | G)
• Given: we know what our sensors do
• R = reading color measured at (1,1)
• E.g. P(R = yellow | G=(1,1)) = 0.1

• Can calculate posterior distribution P(G|r) over ghost

locations given a sensor reading, with Bayes’ rule:

Hands-on Example: Ghost Localization

• Setup:
– Prior distribution over ghost location: P(G) = uniform (on right)
– R = reading color measured at (1,1) = Yellow
– Sensor reading model: P(R | G)

P(red | 3) P(orange | 3) P(yellow | 3) P(green | 3)

0.05 0.15 0.5 0.3

• What is probability of ghost at (3,3)?

Details on Board

Hands-on Example: Ghost Localization

• Setup:
– Prior distribution over ghost location: P(G) = uniform (on right)
– R = reading color measured at (1,1) = Yellow
– Sensor reading model: P(R | G)

P(red | 3) P(orange | 3) P(yellow | 3) P(green | 3)

0.05 0.15 0.5 0.3

• What is probability of ghost at (3,3)?

– Answer: 0.1

Quiz: Inference with Bayes’ Rule
• Given:
wet sun 0.1
dry sun 0.9
sun 0.8
wet rain 0.7
rain 0.2
dry rain 0.3

• What is P(W | dry) ?


Graphical Model Notation

• Nodes: variables (with domains)

– Can be assigned (observed) or unassigned

• Arcs: interactions
– Indicate direct influence between
– Formally: encode conditional
• For now: imagine that arrows mean
direct causation (not true in general)

2/10/19 Artificial Intelligence, Fall 2018 41

Definition: Independence

• Two variables are independent if:

– This says that their joint distribution factors into a product two
simpler distributions
– Another form:

– We write:

• Independence is a simplifying modeling assumption

– Empirical joint distributions: at best close to independent
– What could we assume for {Weather, Traffic, Cavity,

Example: Independence
• N fair, independent coin flips:

H 0.5 H 0.5 H 0.5

T 0.5 T 0.5 T 0.5

2/10/19 Artificial Intelligence, Fall 2018 43

Conditional Independence
• P(Toothache, Cavity, Catch)
• If I have a cavity, the probability that the probe catches in it
doesn't depend on whether I have a toothache:
– P(+catch | +toothache, +cavity) = P(+catch | +cavity)
• The same independence holds if I don t have a cavity:
– P(+catch | +toothache, -cavity) = P(+catch| -cavity)
• Catch is conditionally independent of Toothache given Cavity:
– P(Catch | Toothache, Cavity) = P(Catch | Cavity)
§ Equivalent statements:
§ P(Toothache | Catch , Cavity) = P(Toothache | Cavity)
§ P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
§ One can be derived from the other easily

Conditional Independence and the
Chain Rule
• Chain rule:

• Trivial decomposition:

• With assumption of conditional independence:

• Bayes nets / graphical models help us express conditional independence assumptions

Graphical Model Semantics

• A set of nodes, one per variable X

• A directed, acyclic graph A1 An

• A conditional distribution for each node

– A collection of distributions over X, one for
each combination of parents values X

– CPT: conditional probability table

– Description of a noisy causal process

A Bayes net = Topology (graph) + Local Conditional Probabilities

2/10/19 Artificial Intelligence, Fall 2018 46

Conditional Probability Tables
• Each node has a conditional probability table (CPT) that
gives the probability of each of its values given every possible
combination of values for its parents (conditioning case).
– Roots (sources) of the DAG that have no parents are given prior

P(B) P(E)

Burglary Earthquake .002

B E P(A)
T T .95
T F .94
Alarm F T .29
F F .001

A P(J) A P(M)
T .90 T .70
JohnCalls MaryCalls
F .05 F .01

2/10/19 Artificial Intelligence,

CPT Comments
• Probability of Node=false not given, can subtract
from 1: B

• CPT rows do not need to add up to one – they are

NOT NORMALIZED. (convenient for inference)
• Example requires 10 parameters rather than 25–
1=31 for specifying the full joint distribution.
• Number of parameters in the CPT for a node is
exponential in the number of parents (fan-in).
2/10/19 Artificial Intelligence,
Joint Distributions for Bayes Nets
• A Bayesian Network implicitly defines a joint
P( x1 , x2 ,... xn ) = Õ P( xi | Parents ( X i ))
i =1
• Example
P ( J Ù M Ù A Ù ¬B Ù ¬E )
= P( J | A) P( M | A) P( A | ¬B Ù ¬E ) P(¬B) P(¬E )
= 0.9 ´ 0.7 ´ 0.001´ 0.999 ´ 0.998 = 0.00062
• An inefficient approach to inference is:
– 1) Compute the joint distribution using this equation.
– 2) Compute any desired conditional probability using
the joint distribution.
2/10/19 Artificial Intelligence,
Bayes Nets: Big Picture

• Two problems with using full joint distribution tables

as our probabilistic models:
– Unless there are only a few variables, the joint is WAY too
big to represent explicitly
– Hard to learn (estimate) anything empirically about more
than a few variables at a time

• Bayes nets: a technique for describing complex joint

distributions (models) using simple, local distributions
(conditional probabilities)
– More properly called graphical models
– We describe how variables locally interact
– Local interactions chain together to give global, indirect

Probability Review
• Reading Materials: R&N Chapter 13
• Online Resources:
– https://courses.washington.edu/css490/2012.Winter/lec
Tutorials with hands-on examples
– https://www.hackerearth.com/practice/machine-
– https://www.hackerearth.com/practice/machine-
Preview: Homework 3
• http://www.cs.emory.edu/~eugene/cs425/p3/

Probability Review
• Bag with 10 marbles: 3 red, 7 blue

– Reach in, take one, put it back

– Repeat lots of times.
– What fraction red? About .3
– P(red) = .3

Probability Distribution
• The probability for each value of a random variable
if color = (red, blue)
P(color) = (.3, .7)

Basic Properties
• 0 ≤ P(A) ≤ 1
• P(true) = 1
P(red Ú blue Ú green) = 1

• P(false) = 0
P(black) = 0

Basic Properties
Counted twice
• P(A Ú B) = P(A) + P(B) - P(A Ù B)
.3 P(red)
+ .4 P(striped)
- .1 P(red Ù striped)

So subtract
P(red Ú striped) = .6 once

Probability Distributions

• Unobserved random variables have distributions

Shorthand notation:

hot 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0

• A distribution is a TABLE of probabilities of values OK if all domain entries are unique

• A probability (lower case value) is a single


• Must have: and

Joint Distributions

• A joint distribution over a set of random variables:

specifies a real number for each assignment (or outcome):

– Must obey: hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3

• Size of distribution of n variables with domain sizes d?

– O(size) = ?
– For all but the smallest distributions, impractical to write out!

Probabilistic Models

• A probabilistic model is a joint distribution Distribution over T,W

over a set of random variables
• Probabilistic models: hot sun 0.4
– (Random) variables with domains hot rain 0.1
– Assignments are called outcomes
– Joint distributions: say whether cold sun 0.2
assignments (outcomes) are likely cold rain 0.3
– Normalized: sum to 1.0
– Ideally: only certain variables directly
interact Constraint over T,W
• Constraint satisfaction problems:
– Variables with domains hot sun T
– Constraints: state whether assignments are hot rain F
– Ideally: only certain variables directly cold sun F
interact cold rain T

• An event is a set E of outcomes

• From a joint distribution, we can calculate the

probability of any event T W P
– Probability that it’s hot AND sunny hot sun 0.4
P(+hot, + sun) = hot rain 0.1
– Probability that it’s hot? cold sun 0.2
P(+hot) = cold rain 0.3

– Probability that it’s hot OR sunny?

– P(+hot OR +sun)=

• Typically, the events we care about are partial

assignments, like P(T=hot)
• An event is a set E of outcomes

• From a joint distribution, we can calculate the

probability of any event T W P
– Probability that it’s hot AND sunny? hot sun 0.4
hot rain 0.1
– Probability that it’s hot?
cold sun 0.2
– Probability that it’s hot OR sunny? cold rain 0.3

• Typically, the events we care about are partial

assignments, like P(T=hot)

Exercise: Event Probabilities

• P(+x, +y) ?

+x +y 0.2
• P(+x) ?
+x -y 0.3
-x +y 0.4
-x -y 0.1
• P(-y OR +x) ?

Conditional Probabilities
• Relates joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability


P(a) P(b)

hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3

P(hot|sun) = ? P(cold|rain) = ?

Exercise: Conditional Probabilities

• P(+x | +y) ?

+x +y 0.2 • P(-x | +y) ?
+x -y 0.3
-x +y 0.4
-x -y 0.1
• P(-y | +x) ?

Conditional Distributions

• Conditional distributions are probability distributions over

some variables given fixed values of others
Conditional Distributions
Joint Distribution

sun 0.8
hot sun 0.4
rain 0.2
hot rain 0.1
cold sun 0.2
W P cold rain 0.3
sun 0.4
rain 0.6

