Ai R16 - Unit-6

III Year CSE II Sem Artificial Intelligence Unit VI
UNIT –VI Uncertainty measure & Fuzzy sets and fuzzy logic
Uncertainty measure: probability theory: Introduction, probability theory,

Bayesian belief networks, certainty factor theory, dempster-shafer theory
Fuzzy sets and fuzzy logic: Introduction, fuzzy sets, fuzzy set operations,
types of membership functions, multi valued logic, fuzzy logic, linguistic
variables and hedges, fuzzy propositions, inference rules for fuzzy
propositions, fuzzy systems.
6.1. Uncertainty
● Most intelligent systems have some degree of uncertainty associated

with them.
● Uncertainty may occur in KBS because of the problems with the data.
– Data might be missing or unavailable.
– Data might be present but unreliable or ambiguous due to

measurement errors, multiple conflicting measurements
etc.
– The representation of the data may be imprecise or inconsistent.
– Data may just be expert's best guess.
– Data may be based on defaults and the defaults may have

exceptions.
– Given numerous sources of errors, the most KBS requires the

incorporation of some form of uncertainty management.
– For any form of uncertainty scheme, we must be concerned with

three issues.
– How to represent uncertain data?
– How to combine two or more pieces of uncertain data?
– How to draw inference using uncertain data?
– Probability is the oldest theory with strong mathematical basis.
– Other methods for handling uncertainty are Bayesian belief

network, Certainty factor theory etc.
Prepared by N Md Jubair basha, Associate. Professor, Page 1

CSED,KHIT
www.Jntufastupdates.com
6.2 Probability Theory
● Probability is a way of turning opinion or expectation into numbers.
● It lies between 0 to 1 that reflects the likelihood of an event.
● The chance that a particular event will occur = the number of ways the
event can occur divided by the total number of all possible events.
Example: The probability of throwing two successive heads with a fair coin is
0.25
– Total of four possible outcomes are
: HH, HT, TH & TT
– Since there is only one way of getting HH,
probability = ¼ = 0.25
Event: Every non-empty subset A (of sample space S) is called an event.
– null set  is an impossible event.
– S is a sure event
● P(A) is notation for the probability of an event A.
● P() = 0 and P(S) = 1
● The probabilities of all events S = {A1, A2, …, An} must sum up to

certainty i.e. P(A1) + … + P(An) = 1
● Since the events are the set, it is clear that all set operations can be
performed on the events.
● If A and B are events, then
– A  B ; A B and A' are also events.
– A - B is an event "A but not B
– Events A and B are mutually exclusive, if A  B=
6.2.1. Axioms of Probability
● Let S be a sample space, A and B are events.

– P(A)  0
– P(S) = 1
– P(A’ ) = 1 - P(A)
– P(A  B ) = P(A) + P(B) – P(A  B)
– If events A and B are mutually exclusive, then
P(A  B ) = P(A) + P(B),
● In general, for mutually exclusive events A1,…,An in S
P(A1  A2 …  An ) = P(A1) + P(A2) + …+ P(An)
6.2.2. Joint Probability
● Joint Probability of the occurrence of two independent events is written

as P (A and B) and is defined by
P(A and B) = P(A  B) = P(A) * P(B)
Example: We toss two fair coins separately.
Let P(A) = 0.5 , Probability of getting Head of first coin
P(B) = 0.5, Probability of getting Head of second coin
● Probability (Joint probability) of getting Heads on both the coins is
= P(A and B)
= P(A) * P(B) = 0.5 X 0.5 = 0.25
● The probability of getting Heads on one or on both of the coins i.e. the
union of the probabilities P(A) and P(B) is expressed as
P(A or B) = P(A  B) = P(A) + P(B) - P(A) * P(B)
= 0.5 X 0.5 - 0.25
= 0.75
6.2.3. Conditional Probability
● It relates the probability of one event to the occurrence of another

i.e. probability of the occurrence of an event H given that an event E
is known to have occurred.
● Probability of an event H (Hypothesis), given the occurrence of an event

E (evidence) is denoted by P(H | E) and is defined as follows:
Number of events favorable to H

which are also favorable to E
P(H | E) =
No. of events favorable to E
P(H and E)
P(E)
● What is the probability of a person to be male if person chosen at

random is 80 years old?
● The following probabilities are given
– Any person chosen at random being male is about 0.50
– probability of a given person be 80 years old chosen at random is

equal to 0.005
– probability that a given person chosen at random is both male and

80 years old may be =0.002
● The probability that an 80 years old person chosen at random is male is

calculated as follows:
P(X is male | Age of X is 80)
= [P(X is male and the age of X is 80)] / [P(Age of X is 80)]
= 0.002 / 0.005 = 0.4

Conditional Probability with Multiple Evidences
● If there are n evidences and one hypothesis, then conditional

probability is defined as follows:
P(H and E1 … and En)
P(H | E1 and … and En) =
P(E1 and … and En)
6.2.4. Bayes’ Theorem
 Bayes theorem provides a mathematical model for this type of

reasoning where prior beliefs are combined with evidence to get
estimates of uncertainty.
 This approach relies on the concept that one should incorporate the prior
probability of an event into the interpretation of a situation.
 It relates the conditional probabilities of events.
 It allows us to express the probability P(H | E) in terms of the

probabilities of P(E | H), P(H) and P(E).
P(E|H) * P(H)
P(H|E) =
P(E)
Proof of Bayes’ Theorem
● Bayes’ theorem is derived from conditional probability.
Proof: Using conditional probability
P(H|E) = P(H and E) / P(E)
 P(H|E) * P(E) = P(H and E) (1)
Also P(E|H) = P(E and H) / P(H)
 P(E|H) * P(H) = P(E and H) (2)

From Eqs (1) and (2), we get
P(H|E) * P(E) = P(E|H) * P(H)
Hence, we obtain
P(E|H) * P(H)
P(H|E)
P(E)
=
Extension of Bayes’ Theorem
● Consider one hypothesis H and two evidences E1 and E2.
● The probability of H if both E1 and E2 are true is calculated by using the

following formula:
P(E1| H) * P(E2| H) * P(H)
P(H|E1 and E2) =
P(E1 and E2)
● Consider one hypothesis H and Multiple evidences E1,…., En.
● The probability of H if E1,…, En are true is calculated by using the

following formula:
P(E1| H) * … * P(En | H) * P(H)
P(H|E1 and … and En) =

P(E1 and … and En)
● Find whether Bob has a cold (hypotheses) given that he sneezes

(the evidence) i.e., calculate P(H | E).
● Suppose that we know / given the following.
P(H) = P (Bob has a cold) = 0.2
P(E | H)= P(Bob was observed sneezing
| Bob has a cold) = 0.75

P(E | ~H)= P(Bob was observed sneezing
Now
| Bob does not have a cold) = 0.2
P(H | E) = P(Bob has a cold | Bob was observed sneezing)
= [ P(E | H) * P(H) ] / P(E)
● We can compute P(E) as follows:
P(E) = P( E and H) + P( E and ~H)
= P(E | H) * P(H) + P(E | ~H) * P(~H)
= (0.75)(0.2) + (0.2) (0.8) = 0.31
– Hence P(H | E) = [(0.75 * 0.2)] / 0.31 = 0.48387
– We can conclude that “Bob’s probability of having a cold given that

he sneezes” is about 0.5
● Further it can also determine what is his probability of having a cold if

he was not sneezing?
P(H | ~E) = [P(~E | H) * P(H)] / P(~E)

= [(1 – 0.75) * 0.2] / (1 – 0.31)
= 0.05 / 0.69 = 0.072
– Hence “Bob’s probability of having a cold if he was not sneezing” is

0.072
Advantages and Disadvantages of Bayesian Approach
Advantages:
● They have sound theoretical foundation in probability theory and

thus are currently the most mature of all certainty reasoning
methods.
● Also they have well-defined semantics for decision making.
Disadvantages:
● They require a significant amount of probability data to construct a KB.

– For example, a diagnostic system having 50 detectable conclusions
(R) and 300 relevant and observable characteristics (S) requires a
minimum of 15,050 (R*S + R) probability values assuming that all
of the conclusions are mutually exclusive.
● If conditional probabilities are based on
– statistical data, the sample sizes must be sufficient so that the

probabilities obtained are accurate.
– human experts, then question of values being consistent &

comprehensive arise.
● The reduction of the associations between the hypothesis and evidence

to numbers also eliminates the knowledge embedded within.
– The ability to explain its reasoning and to browse through

the hierarchy of evidences to hypothesis to a user are lost
and
6.2.5. Probabilities in Facts and Rules of Production System
● Some Expert Systems use Bayesian theory to derive further concepts.
– We know that KB = facts + Rules
● We normally assume that the facts are always completely true but facts
might also be probably true.
● Probability can be put as the last argument of a predicate representing

fact.
Example:
● a fact "battery in a randomly picked car is 4% of the time dead" in Prolog

is expressed as
battery_dead (0.04).
– This fact indicates that ‘battery is dead’ is sure with probability

0.04.
6.2.6. Probability in Rules
● If_then rule in rule-based Systems can incorporate probability as follows:
– if X is true then Y can be concluded with probability P

Examples:
● Consider the following probable rules and their corresponding Prolog

representation.
– "if 30% of the time when car does not start, it is true that the
battery is dead "
battery_dead (0.3) :- ignition_not_start(1.0).
Here 30% is rule probability. If right hand side of the rule is certain,
then we can even write above rule as:
battery_dead(0.3) :- ignition_not_start.
– "the battery is dead with same probability that the voltmeter is

outside the normal range"
battery_dead(P) :-voltmeter_measurment_abnormal(P).
6.2.7. Cumulative Probabilities
● Combining probabilities from the facts and successful rules to get a

cumulative probability of the battery being dead is an important issue.
– We should gather all relevant rules and facts about the battery is
dead.
● The probability of a rule to succeed depends on probabilities of sub goals

on the right side of a rule.
– The cumulative probability of conclusion can be calculated by

using and-combination.
● In this case, probabilities of sub goals in the right side of rule are
multiplied, assuming all the events are independent of each other using
the formula
Prob(A and B and C and .....) = Prob(A) * Prob(B) * Prob(C) * ...
● The rules with same conclusion can be uncertain for different reasons.
● If there are more than one rules with the same predicate name having
different probabilities, then in cumulative likelihood of the above
predicate can be computed by using or-combination.
● To get overall probability of predicate, the following formula is used to get
'or' probability if events are mutually independent.
Prob(A or B or C or ...)
= 1 - [(1 - Prob(A)) (1 - Prob(B)) (1 - Prob(C)).....]
Examples
1. "half of the time when a computer does not work, then the battery is dead"
battery_dead(P):-computer_dead(P1), P is P1*0.5.
– Here 0.5 is a rule probability.
2. "95% of the time when a computer has electricalproblem and battery is old,
then the battery is dead"
battery_dead(P) :- electrical_prob(P1),
battery_old(P2), P is P1 * P2 * 0.95.
– Here 0.95 is a rule probability.
● The rule probability can be thought of hidden and is combined along

with associated probabilities in the rule.
6.3. Bayesian Belief Network
● Joint probability distribution of two variables A and B are given in the

following Table
Joint Probabilities A A’
B 0.20 0.12
B’ 0.65 0.03
● Joint probability distribution for n variables require 2n entries with all

possible combinations.
● The time and storage requirements for such computations

become impractical as n grows.
● Inferring with such large numbers of probabilities does not seem to

model human process of reasoning.
● Human tends to single out few propositions which are known to be

causally linked when reasoning with uncertain beliefs.
● This leads to the concept of forming belief network called a Bayesian
belief network.
● It is a probabilistic graphical model that encodes

probabilistic relationships among set of variables with their
probabilistic dependencies.
● This belief network is an efficient structure for storing joint

probability distribution.
6.3.1. Definition of Bayesian Belief Network:

It is a acyclic (with no cycles) directed graph where the nodes of the graph
represent evidence or hypotheses and arc connecting two nodes represents
dependence between them.
● If there is an arc from node X to another node Y (i.e., X Y), then X is

called a parent of Y, and Y is a child of X.
● The set of parent nodes of a node Xi is represented by parent_nodes(Xi).
Joint Probability of n variables
● Joint probability for ‘n’ variables (dependent or independent) is computed

as follows.
● For the sake of simplicity we write P(X1 , … , Xn) instead of P(X1 and
… and Xn).
P(X1 , … ,Xn) = P(Xn | X1 ,…, Xn-1) * P(X1 , … , Xn-1)
Or
P(X1 , … , Xn) = P(Xn | X1 , … , Xn-1) * P(Xn-1 | X1 , … , Xn-2) * …. * P(X2 | X1) * P(X1)
6.3.2. Joint Probability of ‘n’ Variables using B-Network
● In Bayesian Network, the joint probability distribution can be written

as the product of the local distributions of each node and its parents
such as:
n
P(X1, … , Xn) =  P(Xi | parent_nodes(Xi))
i =1
● This expression is reduction of joint probability formula of ‘n’ variables as

some of the terms corresponding to independent variables will not be
required.
● If node Xi has no parents, its probability distribution is said to be
unconditional and it is written as P(Xi) instead of P(Xi |
parent_nodes(Xi)).
● Nodes having parents are called conditional.
● If the value of a node is observed, then the node is said to be an evidence

node.
● Nodes with no children are termed as hypotheses node and nodes with
no parents are called independent nodes.
● The following graph is a Bayesian belief network.
● Here there are four nodes with {A, B} representing evidences and
{C, D} representing hypotheses.
● A and B are unconditional nodes and C and D are conditional

nodes.
A B
C D
Bayesian Belief Network
To describe above Bayesian network, we should specify the following

probabilities.
P(A) = 0.3
P(B) = 0.6
P(C|A) = 0.4
P(C|~A) = 0.2
P(D|A, B) = 0.7
P(D|A, ~B) = 0.4
P(D|~A, B) = 0.2
P(D|~A, ~B) = 0.01
● They can also be expressed as conditional probability tables as follows:

Conditional Probability Tables
P(A) P(B) A P(C) A B P(D)

0.3 0.6 T 0.4 T T 0.7
F 0.2 T F 0.4
F T 0.2
F F 0.01
● Using Bayesian belief network on previous slide, only 8 probability
values in contrast to 16 values are required in general for 4 variables {A,
B, C, D} in joint distribution probability.
● Joint probability using Bayesian Belief Network is computed as follows:
P(A, B, C, D) = P(D|A, B) * P(C|A) * P(B) * P(A)

= 0.7 * 0.4 * 0.6 * 0.3 = 0.0504
Example of Simple B-Network:
● Suppose that there are three events namely earthquake, burglary or

tornado which could cause ringing of alarm in a house.
● This situation can be modeled with Bayesian network as follows.
● All four variables have two possible values T (for true) and F (for false).
– Here the names of the variables have been abbreviated to A

= Alarm, E = Earthquake, and B = Burglary and T = Tornado.
Earthquake Tornado Burglary
Alarm ringing
● Table contains the probability values representing complete Bayesian

belief network. Prior probability of ‘earthquake’ is 0.4 and if it is
earthquake then probability of ‘tornado’ is 0.8. and if not then the
probability of ‘earthquake’ is 0.5.
Conditional Probability Tables

P(E) P(B) E B Tor P(A)
0.4 0.7 T T T 1.0
T T F 0.9
E P(Tor) T F T 0.95
T 0.8 T F F 0.85
F 0.5 F T T 0.89
F T F 0.7
F F T 0.87
F F F 0.3
● The joint probability is computed as follows:
P(E, B, T, A) = P(A| E, B, T) * P(T|E) * P(E) * P(B)
= 1.0 * 0.8 * 0.4 * 0.7 = 0.214
● Using this model one can answer questions using the conditional
probability formula as follows:
– "What is the probability that it is earthquake, given the alarm is

ringing?" P(E|A)
– "What is the probability of burglary, given the alarm is ringing?"

P(B|A)
– "What is the probability of ringing alarm if both earthquake and

burglary happens?" P(A|E, B)
Advantages of Bayesian Belief Network:
● It can easily handle situations where some data entries are missing as
this model encodes dependencies among all variables.
● It is intuitively easier for a human to understand direct dependencies

than complete joint distribution.
● It can be used to learn causal relationships.
● It is an ideal representation for combining prior knowledge (which often

comes in causal form) and data because the model has both causal and
probabilistic semantics.
Disadvantages of Bayesian Belief Network:
● The probabilities are described as a single numeric point value. This can
be a distortion of the precision that is actually available for supporting
evidence.
● There is no way to differentiate between ignorance and uncertainty.

These are distinct two different concepts and be treated as such.
● The quality and extent of the prior beliefs used in Bayesian inference
processing are major shortcomings.
● Reliability of Bayesian network depends on the reliability of prior
knowledge.
Selecting the proper distribution model to describe the data has a notable effect
on the quality of the resulting network. Therefore, selection of the statistical
distribution for modeling the data is very important.
6.4. Certainty Factor Theory
● Certainty factor theory provides another way of measuring uncertainty by

describing a practical way of compromising on pure Bayesian system.
● Certainty factor is based on a number of observations.
● In traditional probability theory, the sum of confidence for a

relationship and against a relationship must add up to 1.
● In practical situation, an expert might have some confidence about some

relationship being true and have no idea about the relationship being
untrue.
● Confidence measures correspond to the informal evaluations that human

experts attach to their conclusions, such as 'it is probably or likely true‘.
● The certainty factor is based on 'confidence for' and 'confidence against'
● The MB[H, E] is a measure of belief in the range [0, 1] in hypothesis H

given the evidence E.
● If evidence supports it fully then MB[H, E] = 1 and it is zero if the

evidence fails to support the hypothesis.
● Similarly, MD[H, E] is a measure of disbelief in the range [0, 1] in

hypothesis H given the evidence E.
● It measures the extent to which the evidence E supports the

negation of the hypothesis H.
● It is to be noted that MD is not compliment of MB.
6.4.1. Measure of belief
● The measure of belief calculates the relative decrement of disbelief in

a given hypothesis H due to some evidence E.
● It may be intuitively defined as follows:

(1- P(H)) – (1 - P(H|E))
MB[H, E]
(1 – P(H))
=
P(H|E) – P(H)
(1 – P(H))
=
In order to avoid getting a negative value of belief, we can modify the above
definition to obtain positive value of measure as follows:
1, if P(H) = 1
Max (P(H|E), P(H)) – P(H)
MB[H, E] = , otherwise
(1 – P(H))
6.4.2. Measure of disbelief
● The measure of disbelief (MD) is similarly defined as the relative

decrement of belief in a given hypothesis H due to some evidence E. It
may be represented as follows:
● It may be intuitively defined as follows:

P(H) – P(H|E)
MD[H, E] =
P(H)
Alternatively,
1,
if P(H) = 0
P(H) – Min{P(H|E), P(H)} P(H)
MD[H, E] = , otherwise
6.4.3. Certainty Factor
● Certainty factor is defined as difference of MB and MD.
● Positive certainty factor indicates evidence for the validity of the

hypothesis, where evidence implies anything that is used to determine
the truth of hypothesis.
– If CF = 1, then the hypothesis is said to be true, while if CF

= –1, the hypothesis is considered to be false.
– Moreover, if CF = 0, then there is no evidence regarding
whether the hypothesis is true or false.
CF[H, E] = MB[H, E] – MD[H, E], where, -1  CF[H, E]  1.
● For computing CF in general, we need to determine the mechanism for

handling the following three cases:
– Certainty factor when there are two evidences supporting

hypothesis H. It is called incrementally acquired evidence.
– Certainty factor for combination of two hypotheses based on the

same evidence.
– Certainty factor for chained rule.
Two Evidences supporting hypothesis
● Case1: Incrementally acquired evidence
● Compute CF(H, E1 and E2).
0, if MD[H, E1 and E2] = 1

MB[H, E1 and E2] =
MB[H, E1] + MB[H, E2] * (1- MB[H, E1]), otherwise
Let us first compute MB(H, E1 and E2) and MD(H, E1 and E2)
● Similarly MD is defined
● Suppose we make an initial observation E1 that confirms our belief in H

with MB[H, E1) = 0.4 and MD(H, E1) = 0. Consider second observation
E2 that also confirms H with MB[H, E2) = 0.3. Then CF(H, E1) = 0.4
MB(H, E1 and E2) = MB(H, E1) + MB(H, E2) * (1 – MB(H, E1))
= 0.4 + 0.3 * (1-MB(H, E1))
= 0.4 +0.18 = 0.58
and
MD(H, E1 and E2) = 0.0
Therefore,
CF(H, E1 and E2) = 0.58
● Here we notice that slight confirmatory evidence can larger certainty

factor.
● For other two cases refer to textbook.
● Case 2: There are two hypotheses H1 and H2 based on the same

evidence E. Find CF for conjunction and disjunction of hypotheses.
● Case 3: In chained rule, the rules are chained together with the
result that the outcome of one rule is input of another rule. For
example, if the outcome of an experiment is treated as an evidence
for some hypothesis i.e., E1  E2  H
6.5. Dempster–Shafer Theory
● It is a mathematical theory of evidence.
● It allows one to combine evidence from different sources and arrive at a

degree of belief.
● Belief function is basically a generalization of the Bayesian theory of

probability.
● Belief functions allow us to base degrees of belief or confidence for one

event on probabilities of related events, whereas Bayesian theory requires
probabilities for each event.
● These degrees of belief may or may not have the mathematical properties
of probabilities.
● The difference between them will depend on how closely the two events
are related.
● It also uses numbers in the range [0, 1] to indicate amount of belief in a

hypothesis for a given piece of evidence.
● Degree of belief in a statement depends upon the number of answers to

the related questions containing the statement and the probability of
each answer.
● In this formalism, a degree of belief (also referred to as a mass) is
represented as a belief function rather than a Bayesian probability
distribution
Example
● Mary and John are friends.
– Suppose Mary tells John that his car is stolen. Then John’s belief
on the truth of this statement will depend on the reliability of
Mary. But it does not mean that the statement is false if Mary is
not reliable.
– Assume that probability of John’s opinion about the reliability of

Mary is given as 0.85. Then the probability of Mary to be unreliable
for John is 0.15.
– So her statement justifies a 0.85 degree of belief that a John’s car

is stolen and John has no reason to believe that his car is not
stolen so it is zero degree of belief that John’s car is not stolen.
– This zero does not mean that John is sure that his car is not stolen
as in the case of probability, 0 would mean that John is sure that
his car is not stolen. The values 0.85 and the 0 together constitute
a belief function.
6.5.1.Dempster Theory Formalism
● Let U be the universal set of all hypotheses, propositions, or statements

under consideration.
● The power set P(U), is the set of all possible subsets of U, including
the empty set represented by .
● The theory of evidence assigns a belief mass to each subset of the

power set.
● A function m: P(U)  [0,1] is called a basic belief assignment

(BBA) function. It satisfies the following axioms:
● m() = 0 ;  m(A) = 1, A  P(U)
● The value of m(A) is called mass assigned to A on the unit interval.

● It makes no additional claims about any subsets of A, each of which has,
by definition, its own mass.
6.5.1. Dempster's Rule of Combination
● The original combination rule, known as Dempster's rule of combination,

is a generalization of Bayes' rule.
● Assume that m1 and m2 are two belief functions used for representing
multiple sources of evidences for two different hypotheses.
● Let A, B  U, such that m1(A) ≠ 0, and m2(B) ≠ 0.
● The Dempster's rule for combining two belief functions to generate an m3

function may be defined as:
m3() = 0
 A  B = C (m1(A) * m2(B))
m3(C) =
1 -  A  B =  (m1(A) * m2(B))
● This belief function gives new value when applied on the set C = A  B.
● The combination of two belief functions is called the joint mass.
– Here m3 can also be written as (m1  m2).
● The expression [  A  B =  (m1(A) * m2(B))] is called normalization

factor.
– It is a measure of the amount of conflict between the two mass

sets.
● The normalization factor has the effect of completely ignoring conflict and
attributing any mass associated with conflict to the null set.
Example : Diagnostic System
● Suppose we have mutually exclusive hypotheses represented by a set U =

{flu, measles, cold, cough}.
● The goal is to assign or attach some measure of belief to the elements of

U based on evidences.
– It is not necessary that particular evidence is supporting some
individual element of U but rather it may support subset of U.
– For example, an evidence of ‘fever’ might support {flu, measles}.
● So a belief function ‘m’ is defined for all subsets of U.
● The degree of belief to a set will keep on changing if we get more

evidences supporting it or not.
● Initially assume that we have no information about how to choose

hypothesis from the given set U.
● So assign m for U as 1.0 i.e., m(U) = 1.0
– This means we are sure that answer is somewhere in the whole set
U.
● Suppose we acquire evidence (say fever) that supports the correct

diagnosis in the set {flu, measles} with its corresponding ‘m’ value as 0.8.
Then we get m({flu, measles}) = 0.8 and m(U) = 0.2
● Let us define two belief functions m1 and m2 based on evidence of fever

and on evidence of headache respectively as follows:
m1({flu, measles}) = 0.8
m1(U) = 0.2
m2({flu, cold}) = 0.6
m2(U) = 0.4
● We can compute their combination m3 using these values.
Combination of m1 and m2 m2({flu, cold}) = 0.6 m2(U) = 0.4
m1({flu, measles}) = 0.8 m3({flu}) = 0.48 m3({flu, measles}) = 0.32

m1(U) = 0.2 m3({flu, cold}) = 0.12 m3(U) = 0.08
● Now previous belief functions are modified to m3 with the following belief
values and are different from earlier beliefs.
m3({flu}) = 0.48
m3({flu, cold}) = 0.12
m3({flu, measles}) = 0.32
m3(U) = 0.08
● Further, if we have another evidence function m4 of sneezing with the

belief values as:
m4({cold, cough}) = 0.7
m4(U) = 0.3
● Then the combination of m3 and m4 gives another belief function as

follows:
Combination of m3 and m4 m4({cold, cough}) = 0.7 m4(U) = 0.3
m3({flu}) = 0.48 m5(  ) = 0.336 m5({flu}) = 0.114

m3({flu, cold)) = 0.12 m5({cold}) = 0.084 m5({flu, cold)) = 0.036
m3({flu, measles})= 0.32 m5( ) = 0.224 m5({flu, measles})= 0.096
m3(U) = 0.08 m5({cold, cough}) = 0.056 m5(U) = 0.024
● If we get empty set () by intersection operation, then we have to

redistribute any belief that is assigned to  sets proportionately across
non empty sets using the value (1 -  A  B =  (m1(A) * m2(B))) in the
denominator of belief values for non empty sets.
● From the table we get multiple belief values for empty set () and its total
belief value is 0.56.
● So according to formula, we have to scale down the remaining values of

non empty sets by dividing by a factor ( 1 - 0.56 =0.44).
m5({flu}) = (0.144/0.44 ) = 0.327
m5({cold}) = (0.084/0.44) = 0.191
m5({flu, cold}) = (0.036/0.44) = 0.082
m5({flu, measles})= (0.096/0.44) = 0.218
m5({cold, cough}) = (0.056/0.44) = 0.127

m5(X) = (0.024/0.44) = 0.055
● While computing new belief we may get same subset generated from
different intersection process. The ‘m’ value for such set is computed by
summing all such values.
6.6. Fuzzy Set:
● Zadeh developed the concept of ‘fuzzy sets’ in mid 60’s to account

for numerous concepts used in human reasoning which are vague
and imprecise e.g. tall, old
● Fuzzy set is very convenient method for representing some form of

uncertainty.
● Later Zadeh developed ‘fuzzy logic’ to account for the imprecision of

natural language quantities e.g. (many) and statements (e.g. not very
likely).
● In Fuzzy logic, a statement can be both true or false and also can be
neither true nor false. Fuzzy logic is non monotonic logic.
● Law of excluded middle does not hold true in fuzzy
logic. A V ~A = True ; A  ~A = False -- do not
hold
● Well known paradoxes can not be solved using classical logic.
● Russell’s paradox
– “All of the men in this town either shaved themselves or were

shaved by the barber. And the barber only shaved the men who did
not shave themselves“
– Answer to question: “ Who shaves the barber ? ” is contradictory
– Assume that he did shave himself. But we see from the story that
he shaved only those men who did not shave themselves.
Therefore, he did not shave himself.
– But we notice that every man either shaved himself or was

shaved by the barber. So he did shave himself. We have a
contradiction.
Example – Paradox
● “ All Cretans are liars”, said the Cretan
● If the Cretan is liar then his claim can not be believed and so is not a
liar.
● If he is not liar then he is telling truth. But because he is Cretan, he

must therefore a liar.
● The main idea behind Fuzzy systems is that truth values (in fuzzy logic)
or membership values are indicated by a value in the range [0,1] with 0
for absolute falsity and 1 for absolute truth.
● Fuzzy sets are often incorrectly assumed to indicate some form of

probability.
● Even though they can take on similar values, it is important to realize

that membership grades are not probabilities.
● Probabilities on a finite universal set must add to 1 while there is no

such requirement for membership grades.
● Fuzzy set theory differs from conventional set theory as it allows each
element of a given set to belong to that set to some degree.
● In contrast to classical set theory each element either fully belongs to the
set or is completely excluded from the set.
● In other words, classical set theory represents a special case of the more
general fuzzy set theory.
● Elements in a fuzzy set X posses membership values between 0 and

1.The degree to which an element belongs to given set is called Grade of
Membership.
Example
● Represent “Helen is old” using probability theory and fuzzy set.

Assume that Helen’s age is 75.
Probability approach:
● We may assign the statement “Helen is old” the truth value of 0.95. The
interpretation is that there is 95% chance of Helen is old
Fuzzy approach:
● The statement could be translated into fuzzy set terminology as follows:

● Helen is a member of the set of old people.
● It could be expressed in symbolic notation of fuzzy set as OLD(Helen) =

0.95 i.e., Helen’s degree of membership within the set of old people =
0.95
Distinction in two views
Important distinction between fuzzy systems and probability.
● Although these two statements seem similar but they actually carry
different meanings.
● First view: There are 5% chances that Helen may not old
● Second view: There is no chance of Helen being young and she is more
or less old.
– Here µOLD is a membership function operation on the fuzzy set of

old people (denoted OLD) which return a value between 0 and 1.
Membership function µOLD for the fuzzy set OLD is represented as
• Membership function for crisp (conventional) set older than 50 years

is represented as:
S-Function
● Function for Analytical approximation of a fuzzy membership function is

called S – function and is defined as:
0, for x  a
2[(x-a) / (c-a)]2
for a  x  b
S(x, a, b, c) = 1- 2[(x-c) / (c-a)]2
for b < x  c
1 for x  c
– where a, b, c are the parameters of the curve and b is mid point.

● S-shaped function can be used to represent the membership of different
persons to fuzzy sets OLD, TALL, RICH, STRONG etc.
● Definition: If X (Universal set) is a collection (set) of objects denoted by

x, then a fuzzy set F in X is a set of ordered pairs
F = { (x, µF(x) ) | x  X}
– where µF(x) is the membership function of x in F which maps x to

the membership space [0,1].
● Grade of membership 1 is assigned to those objects that fully and

completely belong to F and 0 to those who do not belong to F at all.
6.6.1. Various Fuzzy Set Operations
Definition: (Intersection) The membership function µC(x) of the set C = A

 B is defined as µC(x) = min {µA(x), µB(x)}, x  X.
Definition: (Union) The membership function µC(x) of C = A  B is defined
as µC(x) = max{µA(x) , µB(x) }, x  X.
Definition: (Complement) Membership function of the complement

of a fuzzy set A, µA’(x) is defined as µA’(x) = [1 - µA(x) ], x  X.
Example: Let X = { 1,2,3,4,5,6,7}
A = { (3, 0.7), (5, 1), (6, 0.8) } and B = {(3, 0.9), (4, 1), (6, 0.6) }
AB = { (3, 0.7), (6, 0.6) }
A B = { (3, 0.9), (4, 1), (5, 1), (6, 0.8) }
A’ = { (1, 1), (2, 1), (3, 0.3), (4, 1), (6, 0.2), (7, 1) }
Additional operations
1. Equality: A = B, if µA(x) = µB(x), x  X
2. Not equal: A  B, if µA(x)  µB(x) for at least one x  X
3. Containment: A B if and only if µA(x)  µB(x), x  X
4. Proper subset: If A  B and A  B
5. Product: A.B is defined as µA.B(x) = µA(x) . µB(x)
6. Power : AN is defined as: µAN(x) = (µA(x))N
7. Bold union : A  B is defined as:
µA  B(x) = Min [1, µA(X) + µB(x)]
8. Bold intersection: A  B is defined as:
µA  B(x) = Max [0, µA(x) + µB(x) - 1]
6.6.2. Various Types of Membership Functions
● S-shaped function
● Z-shaped function
● Triangular Membership Function
● Trapezoidal Membership Function

● Gaussian Distribution Function
● Pi function
● Vicinity function
S-shaped function
0, for x  a
2[(x-a) / (c-a)]2 , for a  x  b
S(x, a, b, c) = 1- 2[(x-c) / (c-a)] , 2
for b < x  c
1, for x  c
Graphical Representation of S-Shaped Function
a b c
Figure S-shaped Membership Function
Z-Shaped Function
● It represents an asymmetrical polynomial curve open to the left.
● Z-membership function may be defined as follows:
1, for x  a
1- 2[(x-a) / (c-a)]2, for a  x  b
Z(x, a, b, c) = 2[(x-c) / (c-a)]2, for b < x  c
0, for x  c
Graphical Representation
a b c
Figure Z membership function
Triangular membership functions
0, if x < a
(x – a) / (b- a), if a ≤ x ≤ b
F(x, a, b, c) =
(c – x) / (c – b), if b ≤ x ≤ c
0, if c < x
1.2
1
Membership Values
0.8
0.6
0.4
a b c
0.20
0 20 40 60 80 100
Figure Triangular Function

• Trapezoidal membership function
0, if x < a
(x – a) / (b- a), if a ≤ x ≤ b
F(x, a, b, c, d) = 1, if b < x < c
(d – x) / (d – c), if c ≤ x ≤ d
0, if d < x
Gaussian membership function
 ( x, a, ( x b ) 2
b)  e 2 a 2
The graph given in Fig is for parameters a = 0.22, b = 0.78

a b
Figure Gaussian Membership Function
Pi Function
● Pi-shaped curve is a spline-based curve which is named so because of its

shape.
● This membership function is evaluated at four points namely a, b, c, and

d.
● The parameters a and d locate the ‘feet’ of the curve, while b and c locate
its ‘shoulders’. In the graph given in Fig. 10.14, a = 2, b = 4, c = 5, and d
= 9.
a b c d
Figure Pi-shaped Membership Function
Vicinity function
● To represent the statement “x is close to x0, where x0 is any fixed value

of x”, vicinity function using S function as follows can be used:
S(x, a – b, x – b/2, a ), for x  a
V(x, b, a ) =
1 - S(x, a, x + b/2, a + b), for x  a
● This is also called π function. Here ‘b’ is called bandwidth.
● Total width of the function between two zero points is equal to ‘2b’.
a
Figure Vicinity Function
6.6.2. Basic Operations
● For reshaping the memebership functions, following three operations can

be used.
– Dilation (DIL) : It increases the degree of membership of all

members by spreading out the curve. For example, DIL(OLD) =
more or less OLD. Its membership function is defined as: (µA (x))
0.5
– Concentration (CON): It decreases the degree of membership of all

members. For example CONS(OLD) = very OLD. Its membership
function is defined as: (µA (x)) 2
– Normalization (NORM) : It discriminates all membership degree in

the same order unless maximum value of any member is 1. Its
membership function is defined as: µA(x) / max (µA(x)), x  X
– A fuzzy set is called normalized when at least one of its elements
attains the maximum possible membership grade i.e., 1.
6.6.3. Methods for Determining Membership Functions
● Membership functions can be designed by analyzing the problem in

hand.
● There are many possible forms of membership functions.
● Most of the actual fuzzy control operations are drawn from a small set of
different curves.
● The methods for determining membership functions may be broadly

classified into the following categories which are explained briefly as
follows:
– Subjective evaluation and elicitation
– Converted frequencies or probabilities
– Physical measurement
– Learning and adaptation
● Subjective evaluation and elicitation:
– Since fuzzy sets are usually intended to model people's cognitive

states, they can be determined from certain elicitation
procedures.
– Typically, these procedures are provided by the experts in the

problem area.
– For this purpose, a more constrained set of possible curves are

given from which an appropriate one can be chosen.
– Users can be tested using psychological tests for more complex

methods.
● Converted frequencies or probabilities:
– It may be possible to use information taken in the form of

frequency histograms or other probability curves as the basis to
construct a membership function.
– There are a number of possible conversion methods, each with its
own mathematical and methodological strengths and weaknesses.
However, it should always be remembered that membership functions are

NOT (necessarily) equivalent to probabilities.
● Physical measurement:
– Although many applications of fuzzy logic use physical

measurement, almost none of them measure the membership
grade directly.
– Instead, a membership function is obtained from another method,

and it is then used to calculate individual membership grades of
data.
● Learning and adaptation:
– Membership functions of fuzzy sets can be learned and adapted

from a given set to functions to suit the application.

Ai R16 - Unit-6

Uploaded by

Copyright:

Available Formats

Ai R16 - Unit-6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai R16 - Unit-6

Uploaded by

Copyright:

Available Formats

III Year CSE II Sem Artificial Intelligence Unit VI

Uncertainty measure: probability theory: Introduction, probability theory,

● Most intelligent systems have some degree of uncertainty associated

– Data might be missing or unavailable.

– Data might be present but unreliable or ambiguous due to

– The representation of the data may be imprecise or inconsistent.

– Data may just be expert's best guess.

– Data may be based on defaults and the defaults may have

– Given numerous sources of errors, the most KBS requires the

– For any form of uncertainty scheme, we must be concerned with

– How to represent uncertain data?

– How to combine two or more pieces of uncertain data?

– How to draw inference using uncertain data?

– Probability is the oldest theory with strong mathematical basis.

– Other methods for handling uncertainty are Bayesian belief

Prepared by N Md Jubair basha, Associate. Professor, Page 1

● Probability is a way of turning opinion or expectation into numbers.

● It lies between 0 to 1 that reflects the likelihood of an event.

– Total of four possible outcomes are

: HH, HT, TH & TT

– Since there is only one way of getting HH,

Event: Every non-empty subset A (of sample space S) is called an event.

– null set  is an impossible event.

● P(A) is notation for the probability of an event A.

● P() = 0 and P(S) = 1

● The probabilities of all events S = {A1, A2, …, An} must sum up to

● If A and B are events, then

– A  B ; A B and A' are also events.

– A - B is an event "A but not B

– Events A and B are mutually exclusive, if A  B=

6.2.1. Axioms of Probability

● Let S be a sample space, A and B are events.

– P(A  B ) = P(A) + P(B) – P(A  B)

– If events A and B are mutually exclusive, then

P(A  B ) = P(A) + P(B),

● In general, for mutually exclusive events A1,…,An in S

P(A1  A2 …  An ) = P(A1) + P(A2) + …+ P(An)

6.2.2. Joint Probability

● Joint Probability of the occurrence of two independent events is written

P(A and B) = P(A  B) = P(A) * P(B)

Example: We toss two fair coins separately.

Let P(A) = 0.5 , Probability of getting Head of first coin

P(B) = 0.5, Probability of getting Head of second coin

● Probability (Joint probability) of getting Heads on both the coins is

= P(A) * P(B) = 0.5 X 0.5 = 0.25

P(A or B) = P(A  B) = P(A) + P(B) - P(A) * P(B)

= 0.5 X 0.5 - 0.25

● It relates the probability of one event to the occurrence of another

● Probability of an event H (Hypothesis), given the occurrence of an event

Number of events favorable to H

No. of events favorable to E

● What is the probability of a person to be male if person chosen at

● The following probabilities are given

– Any person chosen at random being male is about 0.50

– probability of a given person be 80 years old chosen at random is

– probability that a given person chosen at random is both male and

● The probability that an 80 years old person chosen at random is male is

P(X is male | Age of X is 80)

= [P(X is male and the age of X is 80)] / [P(Age of X is 80)]