Chapter13 Uncertainty

By
Ali Naqvi
 The world is not a well-defined place.
 There is uncertainty in the facts we know:
◦ What’s the temperature? Imprecise measures
◦ Is Trump a good president? Imprecise definitions
◦ Where is the pit? Imprecise knowledge
 There is uncertainty in our inferences
◦ If I have a blistery, itchy rash and was gardening all
weekend I probably have poison ivy
 People make successful decisions all the time anyhow.
 Uncertain data
◦ missing data, unreliable, ambiguous, imprecise representation,
inconsistent, subjective, derived from defaults, noisy…
 Uncertain knowledge
◦ Multiple causes lead to multiple effects
◦ Incomplete knowledge of causality in the domain
◦ Probabilistic/stochastic effects
 Uncertain knowledge representation
◦ restricted model of the real system
◦ limited expressiveness of the representation mechanism
 inference process
◦ Derived result is formally correct, but wrong in the real world
◦ New conclusions are not well-founded (eg, inductive reasoning)
◦ Incomplete, default reasoning methods
Uncertainty techniques used in AI
systems include:
 Probability
 Bayes Theory
 Certainty Factors
 Fuzzy Logic
 Traditional logic is monotonic
◦ The set of legal conclusions grows monotonically with the set
of facts appearing in our initial database
 When humans reason, we use defeasible logic
◦ Almost every conclusion we draw is subject to reversal
◦ If we find contradicting information later, we’ll want to retract
earlier inferences
 Nonmonotonic logic, or defeasible reasoning, allows a
statement to be retracted
 Solution: Truth Maintenance
◦ Keep explicit information about which facts/inferences
support other inferences
◦ If the foundation disappears, so must the conclusion
 Agents must still act even if world is not certain
 If not sure which of two squares have a pit and must
enter one of them to reach the gold, the agent will
take a chance
 If can only act with certainty, most of the time will not
act.
 Example: An agent wants to drive someone to the airport to
catch a flight, and is considering plan A90 that involves leaving
home 90 minutes before the flight departs and driving at a
reasonable speed. Even though the Pullman airport is only 5
miles away, the agent will not be able to reach a definite
conclusion - it will be more like
 “Plan A90 will get us to the airport in time, as long as my car
doesn't break down or run out of gas, and I don't get into an
accident, and there are no accidents on the Moscow-Pullman
highway, and the plane doesn't leave early, and there's no
thunderstorms in the area, …”
 We may use this plan if it improves our situation, given known
information
 Performance measure includes getting to the airport in time,
not waiting at the airport, and/or not getting a speeding ticket.
 Consider the following plans for getting to the airport:
◦ P(A25 gets me there on time | ...) = 0.04
 Which action should I choose?
 Depends on my preferences for missing the flight vs. time
spent waiting, etc.
◦ Utility theory is used to represent and infer Preferences
◦ Decision theory is a combination of probability theory
and utility theory
 Decision theory = utility theory + probability theory
 Pure logic fails for three main reasons:
 Laziness
◦ Too much work to list complete set of antecedents
or consequents needed to ensure an exceptionless
rule, too hard to use the enormous rules that
result
 Theoretical ignorance
◦ Science has no complete theory for the domain
 Practical ignorance
◦ Even if we know all the rules, we may be uncertain
about a particular patient because all the
necessary tests have not or cannot be run
 Probabilities are numeric values between 0 and 1
(inclusive) that represent ideal certainties (not beliefs)
of statements, given assumptions about the
circumstances in which the statements apply.
 These values can be verified by testing, unlike certainty
values. They apply in highly controlled situations.
#instances of the event

Probability(event) = P(event) =
total #instances
 0 <= P(Event) <= 1
 Disjunction, avb
 P(avb) = P(a) + P(b) – P(a^b)
a b
 Negation
 P(~a) = 1 – P(a)
a
 Conditional probability
◦ Once evidence is obtained, the agent can
use conditional probabilities, P(a|b)
◦ P(a|b) = probability of a being true given
that we know b is true
P ( a ^ b)
◦ The equation P(a|b) = P (b)
holds whenever P(b)>0
 Conjunction
◦ Product rule
◦ P(a^b) = P(a)*P(b|a) a b
◦ P(a^b) = P(b)*P(a|b)
 In order words, the only way a and b can both

be true is if a is true and we know b is true given
a is true (thus b is also true)
 If a and b are independent events (the truth
of a has no effect on the truth of b) then:
P(a^b) = P(a) * P(b).
 “Wet” and “Raining” are not independent
events.
 “Wet” and “Joe made a joke” are pretty close
to independent events.
a b a b
 For example, if we roll two dice, each showing one of six possible
numbers, the number of total unique rolls is 6*6 = 36. We
distinguish the dice in some way (a first and second or left and right
die). Here is a listing of the joint possibilities for the dice:
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)
 The number of rolls which add up to 4 is 3 ((1,3), (2,2), (3,1)), so the

probability of rolling a total of 4 is 3/36 = 1/12.
 This does not mean 8.3% true, but 8.3% chance of it being true.
 P(event) is the probability in the absence of any additional
information
 Probability depends on evidence.
 Before looking at dice: P(sum of 4) = 1/12
 After looking at dice: P(sum of 4) = 0 or 1, depending on what
we see
 All probability statements must indicate the evidence with
respect to which the probability is being assessed.
 As new evidence is collected, probability calculations are
updated.
 Before specific evidence is obtained, we refer to the prior or
unconditional probability of the event with respect to the
evidence. After the evidence is obtained, we refer to the
posterior or conditional probability.
 If we know that exactly one of A1, A2, ..., An are true, then
we know
 P(B) = P(B|A1)P(A1) + P(B|A2)P(A2) + ... + P(B|An)P(An) and
 P(B|X) = P(B|A1,X) + ... + P(B|An,X)P(An,X)
 Example
◦ P(Sunday) = P(Monday) =.. = P(Saturday) = 1/7
◦ P(FootballToday) =
P(FootballToday|Sunday)P(Sunday) +
P(FootballToday|Monday)P(Monday) +
.. +
P(FootballToday|Saturday)P(Saturday)
= 0 + 0 + 0 + 0 + 0 + 0 + 1/7*1 = 1/7
 If we want to know the probability of a variable
that can take on multiple values, we may define
a probability distribution, or a set of
probabilities for each possible variable value.
 TemperatureToday =
{Below50, 50s, 60s, 70s, 80s, 90sAndAbove}
 P(TemperatureToday) =
{0.1, 0.1, 0.5, 0.2, 0.05, 0.05}
 Note that the sum of the probabilities for
possible values of any given variable must
always sum to 1.
 Because events are rarely isolated from other events,
we may want to define a joint probability distribution,
or P(X1, X2, .., Xn).
 Each Xi is a vector of probabilities for values of variable
Xi.
 The joint probability distribution is an n-dimensional
array of combinations of probabilities.
Wet ~Wet
Rain 0.6 0.4
~Rain 0.4 0.6
 A lunar lander crashes somewhere in your town (one of the cells
at random in the grid). The crash point is uniformly random (the
probability is uniformly distributed, meaning each location has
an equal probability of being the crash point).
 D is the event that it crashes downtown.
 R is the event that it crashes in the river.
What is P(R)? 18/54
What is P(D)? 12/54
D D D What is P(D^R)? 6/54

R R R R R DR DR DR R
What is P(D|R)? 6/18
R R R R R DR DR DR R
D D D What is P(R|D)? 6/12
What is P(R^D)/P(D)? 6/12

 Bayes' Rule
◦ Given a hypothesis (H) and evidence (E), and given
that P(E) = 0, what is P(H|E)?
 Many times rules and information are uncertain, yet
we still want to say something about the
consequent; namely, the degree to which it can be
believed. Thomas Bayes, suggested an approach.
 Recall the two forms of the product rule:
◦ P(ab) = P(a) * P(b|a) P(b | a ) 
P(a | b) P(b)
◦ P(ab) = P(b) * P(a|b) P(a)
 If we equate the two right-hand sides and divide by

P(a), we get
 Bayes' rule is useful when we have three of the four
parts of the equation.
 In this example, a doctor knows that meningitis
causes a stiff neck in 50% of such cases. The prior
probability of having meningitis is 1/50,000 and the
prior probability of any patient having a stiff neck is
1/20.
 What is the probability that a patient has meningitis
if they have a stiff neck?
 H = "Patient has meningitis“ ( Cause)
 E = "Patient has stiff neck“ (Effect)
P(E|H) * P(H)
P(H|E) = P(H|E) = (0.5*.00002) / .05 = .0002
P(E)
 I have three identical boxes labeled H1, H2, and H3
I place 1 black bead and 3 white beads into H1
I place 2 black beads and 2 white beads into H2
I place 4 black beads and no white beads into H3
 I draw a box at random, and randomly remove a bead
from that box. Given the color of the bead, what can I
deduce as to which box I drew?
 If I replace the bead, then redraw another bead at
random from the same box, how well can I predict its
color before drawing it?
H1 H2 H3
 Observation: I draw a white bead.
 P(H1|W) = P(H1)P(W|H1) / P(W)
= (1/3 * 3/4) / 5/12 = 3/12 * 12/5 = 36/60 = 3/5
 P(H2|W) = P(H2)P(W|H2) / P(W)
= (1/3 * 1/2) / 5/12 = 1/6 * 12/5 = 12/30 = 2/5
 P(H3|W) = P(H3)P(W|H3) / P(W)
= (1/3 * 0) / 5/12 = 0 * 12/5 = 0
 If I replace the bead, then redraw another bead at
random from the same box, how well can I predict its
color before drawing it?
 P(H1)=3/5, P(H2) = 2/5, P(H3) = 0
 P(W) = P(W|H1)P(H1) + P(W|H2)P(H2) +
P(W|H3)P(H3)
= 3/4*3/5 + 1/2*2/5 + 0*0 = 9/20 + 4/20 = 13/20
H1 H2 H3
 We wish to know probability that John has malaria, given that he has
a slightly unusual symptom: a high fever.
 We have 4 kinds of information
a) probability that a person has malaria regardless of symptoms
(0.0001)
b) probability that a person has the symptom of fever given that he
has malaria (0.75)
c) probability that a person has symptom of fever, given that he
does NOT have malaria (0.14)
d) John has high fever P( E | H ) P( H )
P( H | E ) 
 H = John has malaria P( E )
 E = John has a high fever
 Given: P(H) = 0.0001, P(E|H) = 0.75, P(E|~H) = 0.14
 We wish to know probability that John has malaria, given that he has a slightly
unusual symptom: a high fever.
 We have 4 kinds of information
a) probability that a person has malaria regardless of symptoms
b) probability that a person has the symptom of fever given that he has malaria
c) probability that a person has symptom of fever, given that he does NOT have malaria
d) John has high fever
 H = John has malaria P(E|H) * P(H)
P(H|E) =
 E = John has a high fever P(E)
Suppose P(H) = 0.0001, P(E|H) = 0.75, P(E|~H) = 0.14
Then P(E) = 0.75 * 0.0001 + 0.14 * 0.9999 = 0.14006
and P(H|E) = (0.75 * 0.0001) / 0.14006 = 0.0005354
On the other hand, if John did not have a fever, his probability of having malaria would be
P(~E|H) * P(H) (1-0.75)(0.0001)

P(H|~E) = = = 0.000029
P(~E) (1-0.14006)
Which is much smaller.
Prior or unconditional probabilities of propositions
e.g., P (Cavity = true) = 0.1 and P (W eather = sunny) = 0.72
correspond to belief prior to arrival of any (new) evidence
Probability distribution gives values for all possible assignments:
P(Weather) = (0.72, 0.1, 0.08, 0.1) (normalized, i.e., sums to 1)
Joint probability distribution for a set of r.v.s gives the probability of every
atomic event on those r.v.s (i.e., every sample point)
P(Weather, Cavity) = a 4 × 2 matrix of values:
Weather = sunny rain cloudy snow

Cavity = true 0.144 0.02 0.016 0.02
Cavity = false 0.576 0.08 0.064 0.08
Every question a b o u t a d o m a i n can b e answered by t h e joint
d i s t r i b u t i o n b e c a u s e e v e r y e v e n t is a s u m of s a m p l e p o i n t s
Conditional or posterior probabilities
e.g., P (cavity|toothache) = 0.8
i.e., give n t h a t toothache is all I k n o w
N O T “if toothache then 80% chance of cavity”
(Notation for conditional distributions:

P(Cavity|Toothache) = 2- element vector of 2- element vectors)
If we know more, e.g., cavity is also given, then we have

P (cavity|toothache, cavity) = 1
Note: the less specific belief r e m a i n s valid after more evidence arrives,
but is not always us e ful
New evidence may be irrelevant, allowing simplification, e.g.,

P (cavity|toothache, 49ersW in) = P (cavity|toothache) = 0.8
This kind of inference, sanctioned by domain knowledge, is crucial
Definition of conditional probability:
Start with the joint distribution:
toothache ¬toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576
For any proposition φ, sum the atomic events where it is true:

P (φ) = Σω:ω|=φP (ω)
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576

P (toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576

P (cavity∨toothache) = 0.108+0.012+0.072+0.008+0.016+0.064 = 0.28
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576
Can also compute conditional probabilities:

P (¬cavity ∧ toothache)
P (¬cavity| toothache) =
P (toothache)
0.016 + 0.064
= = 0.4
0.108 + 0.012 + 0.016 + 0.064
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576
Denominator can be viewed as a normalization constant α

P(Cavity|toothache) = α P(Cavity, toothache)
= α [P(Cavity, toothache, catch) + P(Cavity, toothache, ¬catch)]
= α [(0.108, 0.016) + (0.012, 0.064)]
= α (0.12, 0.08) = (0.6, 0.4)
General idea: compute distribution on query variable by fixing evidence

variables and summing over hidden variables
A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A)P(B)
Cavity
Cavity decomposes into Toothache Catch
Toothache Catch
Weather
Weather
P(T oothache, Catch, Cavity, W eather)

= P(T oothache, Catch, Cavity)P(W eather)
32 entries reduced to 12; for n independent biased coins, 2n → n
Absolute independence powerful but rare

Dentistry is a large field with hundreds of variables,
none of which are independent. What to do?
Let X be all the variables. Typically, we want
the posterior joint distribution of the query variables Y
given specific values e for the evidence variables E
Let the hidden variables be H = X − Y − E
Then the required summation of joint entries is done by summing out the hidden
variables:
P(Y |E = e) = αP(Y, E = e) = α Σ h P(Y, E = e, H = h)
The terms in the summation are joint entries because Y, E, and H together
exhaust the set of random variables
Obvious problems:
1) Worst-case time complexity O(dn) where d is the largest arity
2) Space complexity O(dn) to store the joint distribution
3) How to find the numbers for O(dn) entries???
P(Toothache, Cavity, Catch) has 23 − 1 = 7 independent entries
If I have a cavity, the probability that the probe catches in it doesn’t depend on
whether I have a toothache:
(1) P (catch|toothache, cavity) = P (catch|cavity)
The same independence holds if I haven’t got a cavity:

(2) P (catch|toothache, ¬cavity) = P (catch|¬cavity)
Catch is conditionally independent of Toothache given Cavity:

P(Catch|Toothache, Cavity) = P(Catch|Cavity)
Equivalent statements:
P(Toothache |Catch, Cavity) = P(Toothache|Cavity)
P(Toothache, Catch| Cavity) = P(Toothache|Cavity)P(Catch|Cavity)
Write out full joint distribution using chain rule:
P(T oothache, Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch|Cavity)P(Cavity)
= P(T oothache|Cavity)P(Catch|Cavity)P(Cavity)
I.e., 2 + 2 + 1 = 5 independent numbers (equations 1 and 2 remove 2)

In most cases, the use of conditional independence reduces the size of the
representation of the joint distribution from exponential in n to linear in n.
C o n d i t i o n a l i n d e p e n d e n c e is o u r m o s t b a s i c a n d r o b u s t
f o r m of k n o w l e d g e a b o u t u n c e r t a i n e n v i r o n m e n t s .
Product rule P (a ∧ b) = P (a|b)P (b) = P (b|a)P (a)
P (b| a)P (a)
⇒ Bayes’ rule P (a|b) =
P (b)
or in distribution form
P (X |Y )P(Y )
P(Y |X) = = αP(X| Y )P(Y )
P (X )
Useful for assessing diagnostic probability from causal probability:
P (Effect| Cause)P (Cause)
P (Cause |Effect) =
P (Effect)
E.g., let M be meningitis, S be stiff neck:
P (s|m)P (m) 0.8 × 0.0001
P (m|s) = = = 0.0008
P (s) 0.1
Note: posterior probability of meningitis still very small!
P(Cavity|toothache ∧catch)
= α P(toothache ∧catch|Cavity)P(Cavity)
= α P(toothache|Cavity)P(catch|Cavity)P(Cavity)
This is an example of a naive Bayes model:

P(Cause, Effect 1 , . . . , Effect n ) = P(Cause)ΠiP(Effecti|Cause)
Cavity Cause
Toothache Catch Effect 1 Effect n
Total number of parameters is l i n e a r in n

The full joint distribution is P(P 1,1, . . . , P4,4, B1,1, B1,2, B2,1)
Apply product rule: P(B1,1, B1,2, B2,1 | P1,1, . . . , P4,4)P(P1,1, . . . , P4,4)
(Do it this way to get P (Effect|Cause).)
First term: 1 if pits are adjacent to breezes, 0 otherwise

Second term: pits are placed randomly, probability 0.2 per square:
P(P 1,1, . . . , P 4,4 ) = Π4,4
i , j = 1,1
P(P i , j ) = 0.2n × 0.816−n
for n pits.
We know the following facts:
b = ¬b1,1 ∧ b1,2 ∧ b2,1
known = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
Query is P(P1,3|known, b)
Define Unknown = P ij s other than P1,3 and Known

For inference by enumeration, we have
P(P1,3|known, b) = αΣunknownP(P1,3, unknown, known, b)
Grows exponentially with number of squares!

Basic insight: observations are conditionally independent of other hidden
squares given neighbouring hidden squares
1,4 2,4 3,4 4,4
1,3 2,3 3,3 4,3

OTHER
QUERY
1,2 2,2 3,2 4,2
1,1 2,1 FRIN3G,1E 4,1

KNOWN
Define Unknown = Fringe ∪Other

P(b|P1,3, Known, Unknown) = P(b|P1,3, Known, Fringe)
Manipulate query into a form where we can use this!
Probability is a rigorous formalism for uncertain knowledge
Joint probability distribution specifies probability of every atomic event
Queries can be answered by summing over atomic events
For nontrivial domains, we must find a way to reduce the joint size
Independence and conditional independence provide the tools

Chapter13 Uncertainty

Uploaded by

Copyright:

Available Formats

Chapter13 Uncertainty

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter13 Uncertainty

Uploaded by

Copyright:

Available Formats

By

#instances of the event

 In order words, the only way a and b can both

 The number of rolls which add up to 4 is 3 ((1,3), (2,2), (3,1)), so the

What is P(D)? 12/54

D D D What is P(D^R)? 6/54

What is P(R^D)/P(D)? 6/12

 If we equate the two right-hand sides and divide by

P(~E|H) * P(H) (1-0.75)(0.0001)

Weather = sunny rain cloudy snow

(Notation for conditional distributions:

If we know more, e.g., cavity is also given, then we have

New evidence may be irrelevant, allowing simplification, e.g.,

For any proposition φ, sum the atomic events where it is true:

For any proposition φ, sum the atomic events where it is true:

For any proposition φ, sum the atomic events where it is true:

Can also compute conditional probabilities:

Denominator can be viewed as a normalization constant α

General idea: compute distribution on query variable by fixing evidence

P(T oothache, Catch, Cavity, W eather)

Absolute independence powerful but rare

The same independence holds if I haven’t got a cavity:

Catch is conditionally independent of Toothache given Cavity:

I.e., 2 + 2 + 1 = 5 independent numbers (equations 1 and 2 remove 2)

This is an example of a naive Bayes model:

Toothache Catch Effect 1 Effect n

Total number of parameters is l i n e a r in n

First term: 1 if pits are adjacent to breezes, 0 otherwise

Define Unknown = P ij s other than P1,3 and Known

Grows exponentially with number of squares!

1,3 2,3 3,3 4,3

1,2 2,2 3,2 4,2

1,1 2,1 FRIN3G,1E 4,1

Define Unknown = Fringe ∪Other

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.