13.Uncertainty

Quantifying Uncertainty
Jihoon Yang
Machine Learning Research Laboratory

Department of Computer Science & Engineering
Sogang University
Jihoon Yang (ML Research Lab) Uncertainty 1 / 31

Uncertainty
Uncertainty
Let action At = leave for airport t minutes before flight
Will At get me there on time?
Problems:
1) partial observability (road state, other drivers’ plans, etc.)
2) noisy sensors (traffic radio)
3) uncertainty in action outcomes (flat tire, etc.), etc.
Hence a purely logical approach either
1) risks falsehood: “A25 will get me there on time”
or 2) leads to conclusions that are too weak for decision making
“A25 will get me there on time if there’s no accident on the
bridge and it doesn’t rain and my tires remain intact etc.”
(A1440 might reasonably be said to get me there on time
but I’d have to stay overnight in the airport . . .)
AI Slides (6e)
Jihoon !c Lin (ML
Yang Zuoquan@PKU 1998-2020
Research Lab) Uncertainty 9 644 2 / 31
Making decisions
Making under uncertainty
decisions under uncertai
Suppose I believe the following:
P (A25 gets me there on time| . . .) = 0.04

Which action to choose?
Depends on my preferences for missing flight vs. airport cuisine, etc.
Utility theory is used to represent and infer preferences
Decision theory = utility theory + probability theory

Probability Basics Probability basics
Begin with a set ⌦—the sample space
e.g., 6 possible rolls of a die.
! 2 ⌦ is a sample point/possible world/atomic event
A probability space or probability model is a sample space

with an assignment P (!) for every ! 2 ⌦ s.t.
0  P (!)  1
⌃ ! P (!) = 1
e.g., P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 1/6.
An event A is any subset of ⌦
P (A) = ⌃{!2A}P (!)

E.g., P (die roll < 4) = P (1) + P (2) + P (3) = 1/6 + 1/6 + 1/6 = 1/2

Random Variables
Random variables
A random variable is a function from sample points to some range,
e.g., the reals or Booleans (e.g. Odd(1) = true)
P induces a probability distribution for any r.v. X:
P (X = xi ) = ⌃{!:X(!) = x }P (!)
i
e.g., P (Odd = true) = P (1) + P (3) + P (5) = 1/6 + 1/6 + 1/6 = 1/2

Propositions
Propositions
Think of a proposition as the event (set of sample points)
where the proposition is true
Given Boolean random variables A and B:

event a = set of sample points ! where A(!) = true
event ¬a = set of sample points ! where A(!) = f alse
event a ^ b = points ! where A(!) = true and B(!) = true
Often in AI applications, the sample points are defined

by the values of a set of random variables, i.e., the
sample space is the Cartesian product of the ranges of the variables
With Boolean variables, sample point = propositional logic model

e.g., A = true, B = f alse, or a ^ ¬b.
Proposition = disjunction of atomic events in which it is true
e.g., (a _ b) ⌘ (¬a ^ b) _ (a ^ ¬b) _ (a ^ b)
) P (a _ b) = P (¬a ^ b) + P (a ^ ¬b) + P (a ^ b)
Syntax for Propositions
Syntax for propositions
Propositional or Boolean random variables
e.g., Cavity (do I have a cavity?)
Cavity = true is a proposition, also written cavity
Discrete random variables (finite or infinite)

e.g., W eather is one of hsunny, rain, cloudy, snowi
W eather = rain is a proposition
Values must be exhaustive and mutually exclusive
Continuous random variables (bounded or unbounded)

e.g., T emp = 21.6; also allow, e.g., T emp < 22.0.
Arbitrary Boolean combinations of basic propositions

Axioms of Probability
Axioms of probability
For any propositions A, B
1. 0 ≤ P (A) ≤ 1
2. P (T rue) = 1 and P (F alse) = 0
3. P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
True
A A B B
>
A probability is a measure over a set of events that satisfies three

axioms ⇒ probability theory is analogous to logical theory (axioms)
Prior & Joint ProbabilityPrior probability
Prior or unconditional probabilities of propositions
e.g., P (Cavity = true) = 0.1 and P (W eather = sunny) = 0.72
correspond to belief prior to arrival of any (new) evidence
Probability distribution gives values for all possible assignments:

P(W eather) = h0.72, 0.1, 0.08, 0.1i (normalized, i.e., sums to 1)
Joint probability distribution for a set of r.v.s gives the

probability of every atomic event on those r.v.s (i.e., every sample point)
P(W eather, Cavity) = a 4 ⇥ 2 matrix of values:
W eather = sunny rain cloudy snow

Cavity = true 0.144 0.02 0.016 0.02
Cavity = f alse 0.576 0.08 0.064 0.08
Every question about a domain can be answered by the joint

distribution because every event is a sum of sample points

Inference
Inference using
using Joint the joint distribution
Distribution
Toothache = Toothache =
true false
Cavity = true 0. 4 0.1
Cavity = false 0.1 0.4
P(cavity) = P(cavity, ache) + P(cavity, ¬ache)

Conditional Probability
Conditional probability
• Conditional or posterior probabilities

P(cavity | toothache) = 0.8
probability of cavity given that toothache
(note cavity is shorthand for Cavity = true)
• Notation for conditional distributions:

P(Cavity | Toothache) = 2-element vector of 2-element vectors
P(Cavity | Toothache, Cavity) = 1
• New evidence may be irrelevant (probability of cavity given

toothache is independent of Weather)
P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8

Conditional Probability
Conditional probability
Definition of conditional probability:
P (a ^ b)
P (a|b) = if P (b) 6= 0
P (b)
Product rule gives an alternative formulation:

P (a ^ b) = P (a|b)P (b) = P (b|a)P (a)
A general version holds for whole distributions, e.g.,

P(W eather, Cavity) = P(W eather|Cavity)P(Cavity)
(View as a 4 ⇥ 2 set of equations, not matrix mult.)
Chain rule is derived by successive application of product rule:

P(X1 , . . . , Xn ) = P(X1 , . . . , Xn 1 ) P(Xn |X1 , . . . , Xn 1 )
= P(X1 , . . . , Xn 2 ) P(Xn 1 |X1 , . . . , Xn 2 ) P(Xn |X1 , . . . , Xn 1 )
= . . .n
= ⇧ i = 1 P(Xi |X1 , . . . , Xi 1 )

Inference by Enumeration
Inference
Probabilistic inference is the computation of posterior probabilities for
query propositions given observed evidence
where the full joint distribution can be viewed as the KB
from which answers to all questions may be derived
Start with the joint distribution
toothache toothache
L
catch catch catch catch
L
L
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
L
For any proposition φ, sum the atomic events where it is true

P (φ) = Σω:ω|=φP (ω)
Jihoon
AI Yang
Slides (6e)!c Lin(ML Research1998-2020
Zuoquan@PKU Lab) Uncertainty 9 659 13 / 31
Inference by enumeration
toothache toothache
L
L
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
L

E.g., P (toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

Probabilistic Inference
• One common task is to extract the distribution over a single

variable or some subset of variables, called marginal distribution
P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2
P(¬toothache) = … = 0.8
• This process is called marginalization or summing out: for any sets

of variables Y and Z
P(Y ) P(Y , z ) P(Y | z )P( z )
z z
• A distribution over Y can be obtained by summing out all other

variables from any joint distribution containing Y

toothache toothache
L
L
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
L

E.g., P (cavity ∨ toothache) = 0.108 + 0.012 + 0.072 + 0.008 +
0.016 + 0.064 = 0.28

toothache toothache
L
L
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
L
Can also compute conditional probabilities

P (¬cavity ∧ toothache)
P (¬cavity|toothache) =
P (toothache)
0.016 + 0.064
= = 0.4
0.108 + 0.012 + 0.016 + 0.064

Normalization
Normalization
toothache ¬toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576
• Denominator can be viewed as a normalization constant α

P(Cavity | toothache) = αP(Cavity, toothache)
= α[ P(Cavity, toothache, catch) + P(Cavity, toothache, ¬catch) ]
= α[ <0.108, 0.016> + <0.012, 0.064> ]
= α<0.12, 0.08> = <0.6, 0.4>
• General idea: compute distribution on query variable by fixing

evidence variables and summing over unobserved variables
epartment of Computer Science & Engineering

achine Learning Research Laboratory 24
• Let X be all variables. Typically we want the posterior distribution of
the query variables Y given specific values e for the evidence
variables E
• Let other variables be H = X – Y – E
• Then the required summation of joint entries is done by summing

out the other variables:
P(YProbabilistic
| E e) Inference
P ( Y, H h | E e) P(Y, H h, E e)
h h
• In principle, joint distributions can be used to answer any

probabilistic queries
• Obvious problems:
n
– Worst-case time complexity O ( d ) where d is the largest arity
partment of Computer Science & Engineering n
– Space
chine Learning complexity O(d ) to store the joint distribution
Research Laboratory 25
n
– How to find the numbers for O ( d ) entries??
Independence
Independence
• A and B are independent iff

P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A)P(B)
Toothache
Toothache decomposes into
Catch Cavity
Catch Cavity
Weather
Weather
• P(Toothache, Catch, Cavity, Weather)

= P(Toothache, Catch, Cavity) P(Weather)
n
• 32 entries reduced to 12; for n independent biased coins, O(2 ) O ( n)
• Absolute independence powerful but rare

• How can we manage a large numbers of variables?
partment of Computer Science & Engineering

chine Learning Research Laboratory 28

Conditional Independence
Conditional independence
P(T oothache, Cavity, Catch) has 23 − 1 = 7 independent entries
If I have a cavity, the probability that the probe catches in it doesn’t
depend on whether I have a toothache
(1) P (catch|toothache, cavity) = P (catch|cavity)
The same independence holds if I haven’t got a cavity
(2) P (catch|toothache, ¬cavity) = P (catch|¬cavity)
Catch is conditionally independent of T oothache given Cavity
P(Catch|T oothache, Cavity) = P(Catch|Cavity)
Equivalent statements
P(T oothache|Catch, Cavity) = P(T oothache|Cavity)
P(T oothache, Catch|Cavity) = P(T oothache|Cavity)P(Catch|Cavity)
c Lin Zuoquan@PKU 1998-2020

AI Slides (6e)! 9 666

Conditional independence
• Write out full joint distribution using chain rule:

P(Toothache, Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity)
= P(Toothache | Cavity) P(Catch | Cavity) P(Cavity)
i.e. 2 + 2 + 1 = 5 independent numbers
• Conditional independence
– Often reduces the size of the representation of the joint
distribution from exponential in n to linear in n
– Is one of the most basic and robust form of knowledge about
uncertain environments

• X is conditionally independent of Y given Z if the probability

distribution governing X is independent of the value of Y given the
value of Z:
P(X|Y, Z) = P(X|Z)
that is, if
( xi , y j , z k ) P( X xi | Y yj,Z zk ) P( X xi | Z zk )

Bayes’ Rule Bayes’ Rule
Product rule P (a ^ b) = P (a|b)P (b) = P (b|a)P (a)
P (b|a)P (a)
) Bayes’ rule P (a|b) =
P (b)
or in distribution form
P(X|Y )P(Y )
P(Y |X) = = ↵P(X|Y )P(Y )
P(X)
Useful for assessing diagnostic probability from causal probability:
P (Ef f ect|Cause)P (Cause)

P (Cause|Ef f ect) =
P (Ef f ect)
E.g., let M be meningitis, S be sti↵ neck:
P (s|m)P (m) 0.8 ⇥ 0.0001

P (m|s) = = = 0.0008
P (s) 0.1
Note: posterior probability of meningitis still very small!

Bayes’ Rule and Conditional Independence
Bayes’ rule and conditional independence
P(Cavity | toothache Λ catch)

= α P(toothache Λ catch |Cavity) P(Cavity)
= α P(toothache | Cavity) P(catch | Cavity) P(Cavity)
• This is an example of a naïve Bayes (idiot Bayes) model:

P(Cause, Effect1 ,..., Effect n ) P Cause i P Effecti | Cause
Cavity Cause
Toothache Catch Effect1 . . . Effectn
• Total number of parameter is linear in n
Jihoon
partment Yang (ML
of Computer Research
Science Lab)
& Engineering Uncertainty 25 / 31
Example: Wumpus World
Example: Wumpus World
1,4 2,4 3,4 4,4
1,3 2,3 3,3 4,3
1,2 2,2 3,2 4,2

B
OK
1,1 2,1 3,1 4,1
B
OK OK
Pij = true iff [i, j] contains a pit

Bij = true iff [i, j] is breezy
Include only B1,1, B1,2, B2,1 in the probability model

Specifying the probability model
Specifying the probability model

The full joint distribution is P(P1,1, . . . , P4,4, B1,1, B1,2, B2,1)
Apply product rule: P(B1,1, B1,2, B2,1 | P1,1, . . . , P4,4)P(P1,1, . . . , P4,4)
(Do it this way to get P (Ef f ect|Cause))
First term: 1 if pits are adjacent to breezes, 0 otherwise
Second term: pits are placed randomly, probability 0.2 per square:
4,4
P(P1,1, . . . , P4,4) = Πi,j = 1,1P(Pi,j ) = 0.2n × 0.816−n
for n pits

Observations and query
Observations and query
We know the following facts:
b = ¬b1,1 ∧ b1,2 ∧ b2,1
known = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
Query is P(P1,3|known, b)
Define U nknown = Pij s other than P1,3 and Known
For inference by enumeration, we have
P(P1,3|known, b) = αΣunknownP(P1,3, unknown, known, b)
Grows exponentially with number of squares

Using conditional independence
Basic insight: observations are conditionally independent of other
hidden squares given neighbouring hidden squares
1,4 2,4 3,4 4,4
1,3 2,3 3,3 4,3

OTHER
QUERY
1,2 2,2 3,2 4,2
1,1 2,1 FRINGE

3,1 4,1
KNOWN
Define U nknown = F ringe ∪ Other

P(b|P1,3, Known, U nknown) = P(b|P1,3, Known, F ringe)
Manipulate query into a form where we can use this

!
P(P1,3|known, b) = α P(P1,3, unknown, known, b)
unknown
!
= α P(b|P1,3, known, unknown)P(P1,3, known, unknown)
unknown
! !
= α P(b|known, P1,3, f ringe, other)P(P1,3, known, f ringe, other)
f ringe other
! !
= α P(b|known, P1,3, f ringe)P(P1,3, known, f ringe, other)
f ringe other
! !
= α P(b|known, P1,3, f ringe) P(P1,3, known, f ringe, other)
f ringe other
! !
= α P(b|known, P1,3, f ringe) P(P1,3)P (known)P (f ringe)P (other)
f ringe other
! !
= α P (known)P(P1,3) P(b|known, P1,3, f ringe)P (f ringe) P (other)
f ringe other
!
= α! P(P1,3) P(b|known, P1,3, f ringe)P (f ringe)
f ringe


1,3 1,3 1,3 1,3 1,3
1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2
B B B B B
OK OK OK OK OK
1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1
B B B B B
OK OK OK OK OK OK OK OK OK OK
0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16
P(P1,3|known, b) = α! "0.2(0.04 + 0.16 + 0.16), 0.8(0.04 + 0.16)#

≈ "0.31, 0.69#
P(P2,2|known, b) ≈ "0.86, 0.14#


13.Uncertainty

Uploaded by

Copyright:

Available Formats

13.Uncertainty

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

13.Uncertainty

Uploaded by

Copyright:

Available Formats

Quantifying Uncertainty

Machine Learning Research Laboratory

Jihoon Yang (ML Research Lab) Uncertainty 1 / 31

P (A25 gets me there on time| . . .) = 0.04

Which action to choose?

Depends on my preferences for missing flight vs. airport cuisine, etc.

Utility theory is used to represent and infer preferences

Decision theory = utility theory + probability theory

Jihoon Yang (ML Research Lab) Uncertainty 3 / 31

A probability space or probability model is a sample space

An event A is any subset of ⌦

P (A) = ⌃{!2A}P (!)

Jihoon Yang (ML Research Lab) Uncertainty 4 / 31

P induces a probability distribution for any r.v. X:

Jihoon Yang (ML Research Lab) Uncertainty 5 / 31

Given Boolean random variables A and B:

Often in AI applications, the sample points are defined

With Boolean variables, sample point = propositional logic model

Discrete random variables (finite or infinite)

Continuous random variables (bounded or unbounded)

Arbitrary Boolean combinations of basic propositions

Jihoon Yang (ML Research Lab) Uncertainty 7 / 31

A probability is a measure over a set of events that satisfies three

Probability distribution gives values for all possible assignments:

Joint probability distribution for a set of r.v.s gives the

W eather = sunny rain cloudy snow

Every question about a domain can be answered by the joint

Jihoon Yang (ML Research Lab) Uncertainty 9 / 31

P(cavity) = P(cavity, ache) + P(cavity, ¬ache)

Jihoon Yang (ML Research Lab) Uncertainty 10 / 31

• Conditional or posterior probabilities

• Notation for conditional distributions:

• New evidence may be irrelevant (probability of cavity given

Jihoon Yang (ML Research Lab) Uncertainty 11 / 31

Product rule gives an alternative formulation:

A general version holds for whole distributions, e.g.,

Chain rule is derived by successive application of product rule:

Jihoon Yang (ML Research Lab) Uncertainty 12 / 31

For any proposition φ, sum the atomic events where it is true

For any proposition φ, sum the atomic events where it is true

Jihoon Yang (ML Research Lab) Uncertainty 14 / 31

• One common task is to extract the distribution over a single

• This process is called marginalization or summing out: for any sets

• A distribution over Y can be obtained by summing out all other

Jihoon Yang (ML Research Lab) Uncertainty 15 / 31

For any proposition φ, sum the atomic events where it is true

Jihoon Yang (ML Research Lab) Uncertainty 16 / 31

Can also compute conditional probabilities

Jihoon Yang (ML Research Lab) Uncertainty 17 / 31

• Denominator can be viewed as a normalization constant α

• General idea: compute distribution on query variable by fixing

epartment of Computer Science & Engineering

• Let other variables be H = X – Y – E

• Then the required summation of joint entries is done by summing

• In principle, joint distributions can be used to answer any

• A and B are independent iff

• P(Toothache, Catch, Cavity, Weather)

• Absolute independence powerful but rare

partment of Computer Science & Engineering

Jihoon Yang (ML Research Lab) Uncertainty 20 / 31