(R17A1204) Artificial Intelligence (6) - 119-143

Uncertain knowledge and Learning
Core vs. Probabilistic AI •

 Knowledge Reasoning : work with facts/assertions; develop rules of logical inference
 Planning: work with applicability/effects of actions; develop searches for actions which achieve
goals/avert disasters.
 Expert Systems: develop by hand a set of rules for examining inputs, updating internal states
and generating outputs
 Learning approach: use probabilistic models to tune performance based on many data
examples.
 Probabilistic AI: emphasis on noisy measurements, approximation in hard cases, learning,
algorithmic issues.
o logical assertions ⇒ probability distributions
o logical inference ⇒ conditional probability distributions
o logical operators ⇒ probabilistic generative models
Probabilistic reasoning
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
 Information occurred from unreliable sources.

 Experimental Errors
 Equipment fault
 Temperature variation
 Climate change.
Probabilistic reasoning is a way of knowledge representation where we apply the concept of probability
to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine probability theory with
logic to handle the uncertainty.
We use probability in probabilistic reasoning because it provides a way to handle the uncertainty that is
the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not confirmed, such as
"It will rain today," "behavior of someone for some situations," "A match between two teams or two
players." These are probable sentences for which we can assume that it will happen but not sure about
it, so here we use probabilistic reasoning.
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the numerical
measure of the likelihood that an event will occur. The value of probability always remains between 0
and 1 that represent ideal uncertainties.
0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.

P(A) = 0, indicates total uncertainty in an event A.
P(A) =1, indicates total certainty in an event A.
Artificial Intelligence Page 119

We can find the probability of an uncertain event by using the below formula.
P(¬A) = probability of a not happening event.
P(¬A) + P(A) = 1.
Event: Each possible outcome of a variable is called an event.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real world.
Prior probability: The prior probability of an event is probability computed before observing new
information.
Posterior Probability: The probability that is calculated after all evidence or information has taken into
account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the probability of A
under the conditions of B", it can be written as:
Where P(A⋀B)= Joint probability of a and B
P(B)= Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample space will be
reduced to set B, and now we can only calculate event A when event B is already occurred by dividing
the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes English and
mathematics, and then what is the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics.
Why Reason Probabilistically?
 In many problem domains it isn't possible to create complete, consistent models of the world.
Therefore agents (and people) must act in uncertain worlds (which the real world is).
 Want an agent to make rational decisions even when there is not enough information to prove
that an action will work.
 Some of the reasons for reasoning under uncertainty:
o True uncertainty. E.g., flipping a coin.
o Theoretical ignorance. There is no complete theory which is known about the problem
domain. E.g., medical diagnosis.
o Laziness. The space of relevant factors is very large, and would require too much work to
list the complete set of antecedents and consequents. Furthermore, it would be too hard
to use the enormous rules that resulted.
o Practical ignorance. Uncertain about a particular individual in the domain because all of
the information necessary for that individual has not been collected.

 Probability theory will serve as the formal language for representing and reasoning with
uncertain knowledge.
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determines the
probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two random
events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian inference is an
application of Bayes' theorem, which is fundamental to Bayesian statistics
It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
As from product rule we can write:
1. P(A ⋀ B)= P(A|B) P(B) or
Similarly, the probability of event B with known event A:
1. P(A ⋀ B)= P(B|A) P(A)
Equating right hand side of both the equations, we will get:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most modern
AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of hypothesis
A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the evidence
P(B) is called marginal probability, pure probability of an evidence.

In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be written
as:
Applying Bayes' rule:
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This is very
useful in cases where we have a good probability of these three terms and want to determine the fourth
one.
Suppose we want to perceive the effect of some unknown cause, and want to compute that cause, then
the Bayes' rule becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff
neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:
o The Known probability that a patient has meningitis disease is 1/30,000.

o The Known probability that a patient has a stiff neck is 2%.
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease
with a stiff neck.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
o Directed Acyclic Graph

o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed
link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of
the network graph.
o If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
o Node C is independent of node A.
The Bayesian network has mainly two components:

o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node.
Representing Belief about Propositions
 Rather than reasoning about the truth or falsity of a proposition, reason about the belief that a
proposition or event is true or false
 For each primitive proposition or event, attach a degree of belief to the sentence
 Use probability theory as a formal means of manipulating degrees of belief
 Given a proposition, A, assign a probability, P(A), such that 0 <= P(A) <= 1, where if A is true,
P(A)=1, and if A is false, P(A)=0. Proposition A must be either true or false, but P(A) summarizes
our degree of belief in A being true/false.
o Examples
o P(Weather=Sunny) = 0.7 means that we believe that the weather will be Sunny with 70%
certainty. In this case Weather is a random variable that can take on values in a domain such
as {Sunny, Rainy, Snowy, Cloudy}.
o P(Cavity=True) = 0.05 means that we believe there is a 5% chance that a person has a cavity.
Cavity is a Boolean random variable since it can take on possible values True and False.
o Example: P(A=a ^ B=b) = P(A=a, B=b) = 0.2, where A=My_Mood, a=happy, B=Weather, and
b=rainy, means that there is a 20% chance that when it's raining my mood is happy.
 We will assume that in a given problem domain, the programmer and expert identify all of the
relevant propositional variables that are needed to reason about the domain.
 Each of these will be represented as a random variable, i.e., a variable that can take on values
from a set of mutually exclusive and exhaustive values called the sample space or partition of
the random variable. Usually this will mean a sample space {True, False}.
 For example, the proposition Cavity has possible values True and False indicating whether a
given patient has a cavity or not. A random variable that has True and False as its possible values
is called a Boolean random variable.
More generally, propositions can include the equality predicate with random variables and the
possible values they can have.
For example, we might have a random variable Color with possible values red, green, blue,
and other.
Then P(Color=red) indicates the likelihood that the color of a given object is red.
Similarly, for Boolean random variables we can ask P(A=True), which is abbreviated to P(A), and
P(A=False), which is abbreviated to P(~A).
Axioms of Probability Theory

Probability Theory provides us with the formal mechanisms and rules for manipulating propositions
represented probabilistically. The following are the three axioms of probability theory:
 0 <= P(A=a) <= 1 for all a in sample space of A

 P(True)=1, P(False)=0
 P(A v B) = P(A) + P(B) - P(A ^ B)
From these axioms we can show the following properties also hold:
 P(~A) = 1 - P(A)
 P(A) = P(A ^ B) + P(A ^ ~B)
 Sum{P(A=a)} = 1, where the sum is over all possible values a in the sample space of A
Joint Probability Distribution
Given an application domain in which we have determined a sufficient set of random variables to
encode all of the relevant information about that domain, we can completely specify all of the possible
probabilistic information by constructing the full joint probability distribution,
P(V1=v1, V2=v2, ..., Vn=vn), which assigns probabilities to all possible combinations of values to all
random variables.
For example, consider a domain described by three Boolean random variables, Bird, Flier, and Young.
Then we can enumerate a table showing all possible interpretations and associated probabilities:
Bird Flier Young Probability

T T T 0.0
T T F 0.2
T F T 0.04
T F F 0.01
F T T 0.01
F T F 0.01
F F T 0.23
F F F 0.5
Notice that there are 8 rows in the above table representing the fact that there are 23 ways to assign
values to the three Boolean variables. More generally, with n Boolean variables the table will be of size
2n. And if n variables each had k possible values, then the table would be size kn.
Also notice that the sum of the probabilities in the right column must equal 1 since we know that the set
of all possible values for each variable are known. This means that for n Boolean random variables, the
table has 2n-1 values that must be determined to completely fill in the table.

If all of the probabilities are known for a full joint probability distribution table, then we can
compute any probabilistic statement about the domain. For example, using the table above, we can
compute
 P(Bird=T) = P(B) = 0.0 + 0.2 + 0.04 + 0.01 = 0.25

 P(Bird=T, Flier=F) = P(B, ~F) = P(B, ~F, Y) + F(B, ~F, ~Y) = 0.04 + 0.01 = 0.05
Conditional Probabilities
 Conditional probabilities are key for reasoning because they formalize the process of
accumulating evidence and updating probabilities based on new evidence.
 For example, if we know there is a 4% chance of a person having a cavity, we can represent this
as the prior (aka unconditional) probability P(Cavity)=0.04.
 Say that person now has a symptom of a toothache, we'd like to know what is
the posterior probability of a Cavity given this new evidence. That is, compute P(Cavity |
Toothache).
 If P(A|B) = 1, this is equivalent to the sentence in Propositional Logic B => A. Similarly, if P(A|B)
=0.9, then this is like saying B => A with 90% certainty.
 In other words, we've made implication fuzzy because it's not absolutely certain.
 Given several measurements and other "evidence", E1, ..., Ek, we will formulate queries as P(Q |
E1, E2, ..., Ek) meaning "what is the degree of belief that Q is true given that we know E1, ...,
Ek and nothing else."
Conditional probability is defined as: P(A|B) = P(A ^ B)/P(B) = P(A,B)/P(B)

One way of looking at this definition is as a normalized (using P(B)) joint probability (P(A,B)).
 Example Computing Conditional Probability from the Joint Probability Distribution

Say we want to compute P(~Bird | Flier) and we know the full joint probability distribution
function given above.
 We can do this as follows:
 P(~B|F) = P(~B,F) / P(F)
 = (P(~B,F,Y) + P(~B,F,~Y)) / P(F)
 = (.01 + .01)/P(F)
Next, we could either compute the marginal probability P(F) from the full joint probability
distribution, or, as is more commonly done, we could do it by using a process
called normalization, which first requires computing
P(B|F) = P(B,F) / P(F)

= (P(B,F,Y) + P(B,F,~Y)) / P(F)
= (0.0 + 0.2)/P(F)
Now we also know that P(~B|F) + P(B|F) = 1, so substituting from above and solving for P(F) we
get P(F) = 0.22. Hence, P(~B|F) = 0.02/0.22 = 0.091.

While this is an effective procedure for computing conditional probabilities, it is intractable in
general because it means that we must compute and store the full joint probability distribution
table, which is exponential in size.
 Some important rules related to conditional probability are:

o Rewriting the definition of conditional probability, we get the Product Rule: P(A,B) = P(A|B)P(B)
o Chain Rule: P(A,B,C,D) = P(A|B,C,D)P(B|C,D)P(C|D)P(D), which generalizes the product rule for a
joint probability of an arbitrary number of variables. Note that ordering the variables results in a
different expression, but all have the same resulting value.
o Conditionalized version of the Chain Rule: P(A,B|C) = P(A|B,C)P(B|C)
o Bayes's Rule: P(A|B) = (P(A)P(B|A))/P(B), which can be written as follows to more clearly emphasize
the "updating" aspect of the rule: P(A|B) = P(A) * [P(B|A)/P(B)] Note: The terms P(A) and P(B) are
called the prior (or marginal) probabilities. The term P(A|B) is called the posterior probability
because it is derived from or depends on the value of B.
o Conditionalized version of Bayes's Rule: P(A|B,C) = P(B|A,C)P(A|C)/P(B|C)
o Conditioning (aka Addition) Rule: P(A) = Sum{P(A|B=b)P(B=b)} where the sum is over all possible
values b in the sample space of B.
o P(~B|A) = 1 - P(B|A)
Assuming conditional independence of B and C given A, we can simplify

Bayes's Rule for two pieces of evidence B and C:
 P(A | B,C) = (P(A)P(B,C | A))/P(B,C)

 = (P(A)P(B|A)P(C|A))/(P(B)P(C|B))
 = P(A) * [P(B|A)/P(B)] * [P(C|A)/P(C|B)]
 = (P(A) * P(B|A) * P(C|A))/P(B,C)
Naive Bayes Classifier:
Say we have a random variable, C, which represents the possible ways to classify an input pattern of
features that have been measured.
The domain of C is the set of possible classifications, e.g., it might be the possible diagnoses in a
medical domain.
Say the possible values for C are {a,b,c}, and the features we have measured are
E1=e1, E2=e2, ..., En=en.
Then we can compute
P(C=a | E1=e1, ..., En=en),
P(C=b | E1=e1, ..., En=en) and

P(C=c | E1=e1, ..., En=en) assuming E1, ..., En are conditionally independent given C.
Since for each value of C the denominators are the same above, they can be ignored.
So, for example
P(C=a | E1=e1, ..., En=en) = P(C=a) * P(E1=e1 | C=a) * P(E2=e2 | C=a) * ... * P(En=en | C=a)
Choose the value for C that gives the maximum probability.
Finally, since only relative values are needed and probabilities are often very small, it is common to
compute the sum of logarithms of the probabilities:
log P(C=a | E1=e1, ..., En=en) = log P(C=a) + log P(E1=e1 | C=a) + ... + log P(En=en | C=a).
If B and C are (unconditionally) independent, then P(C|B) = P(C), so
P(A | B,C) = P(A) * [P(B|A)/P(B)] * [P(C|A)/P(C)]
 Example
Consider the medical domain consisting of three Boolean variables: PickledLiver, Jaundice,
Bloodshot, where the first indicates if a given patient has the "disease" PickledLiver, and the
second and third describe symptoms of the patient. We'll assume that Jaundice and Bloodshot
are independent.
The doctor wants to determine the likelihood that the patient has a PickledLiver.
Based on no other information, she knows that the prior probability P(PickledLiver) = 10-17. So,
this represents the doctor's initial belief in this diagnosis. However, after examination, she
determines that the patient has jaundice. She knows that P(Jaundice) = 2-10 and P(Jaundice |
PickledLiver) = 2-3, so she computes the new updated probability in the patient having
PickledLiver as:
P(PickledLiver | Jaundice) = P(P)P(J|P)/P(J)

= (2-17 * 2-3)/2-10
= 2-10
So, based on this new evidence, the doctor increases her belief in this diagnosis from 2-17 to 2-10.
Next, she determines that the patient's eyes are bloodshot, so now we need to add this new
piece of evidence and update the probability of PickledLiver given Jaundice and Bloodshot.
Say, P(Bloodshot) = 2-6 and P(Bloodshot | PickledLiver) = 2-1. Then, she computes the new
conditional probability:

P(PickledLiver | Jaundice, Bloodshot) = (P(P)P(J|P)P(B|P))/(P(J)P(B))
= 2-10 * [2-1 / 2-6]
= 2-5
So, after taking both symptoms into account, the doctor's belief that the patient has a
PickledLiver is 2-5.
Bayesian Networks (aka Belief Networks)
 Bayesian Networks, also known as Bayes Nets, Belief Nets, Causal Nets, and Probability Nets, are
a space-efficient data structure for encoding all of the information in the full joint probability
distribution for the set of random variables defining a domain. That is, from the Bayesian Net
one can compute any value in the full joint probability distribution of the set of random
variables.
 Represents all of the direct causal relationships between variables
 Intuitively, to construct a Bayesian net for a given set of variables, draw arcs from cause
variables to immediate effects.
 Space efficient because it exploits the fact that in many real-world problem domains the
dependencies between variables are generally local, so there are a lot of conditionally
independent variables
 Captures both qualitative and quantitative relationships between variables
 Can be used to reason
o Forward (top-down) from causes to effects -- predictive reasoning (aka causal
reasoning)
o Backward (bottom-up) from effects to causes -- diagnostic reasoning
 Formally, a Bayesian Net is a directed, acyclic graph (DAG), where there is a node for each
random variable, and a directed arc from A to B whenever A is a direct causal influence on B.
Thus the arcs represent direct causal relationships and the nodes represent states of affairs. The
occurrence of A provides support for B, and vice versa. The backward influence is call
"diagnostic" or "evidential" support for A due to the occurrence of B.
 Each node A in a net is conditionally independent of any subset of nodes that are not
descendants of A given the parents of A.
Net Topology Reflects Conditional Independence Assumptions
 Conditional independence defines local net structure. For example, if B and C are conditionally
independent given A, then by definition P(C|A,B) = P(C|A) and, symmetrically, P(B|A,C) = P(B|A).
Intuitively, think of A as the direct cause of both B and C. In a Bayesian Net this will be
represented by the local structure:

For example, in the dentist example in the textbook, having a Cavity causes both a Toothache
and the dental probe to Catch, but these two events are conditionally independent given Cavity.
That is, if we know nothing about whether or not someone has a Cavity, then Toothache and
Catch are dependent. But as soon as we definitely know the person has a cavity or not, then
knowing that the person has a Toothache as well has no effect on whether Catch is true. This
conditional independence relationship will be reflected in the Bayesian Net topology as:
 In general, we will construct the net so that given its parents, a node is conditionally
independent of the rest of the net variables. That is,
P(X1=x1, ..., Xn=xn) = P(xi | Parents(Xi)) * ... * P(xn | Parents(Xn))
Hence, we don't need the full joint probability distribution, only conditionals relative to the
parent variables.
 Example (From (Charniak, 1991))
Consider the problem domain in which when I go home I want to know if someone in my family
is home before I go in. Let's say I know the following information:
(1) Why my wife leaves the house, she often (but not always) turns on the outside light. (She
also sometimes turns the light on when she's expecting a guest.)
(2) When nobody is home, the dog is often left outside.
(3) If the dog has bowel-troubles, it is also often left outside.
(4) If the dog is outside, I will probably hear it barking (though it might not bark, or I might hear
a different dog barking and think it's my dog).

Given this information, define the following five Boolean random variables:
O: Everyone is Out of the house

L: The Light is on
D: The Dog is outside
B: The dog has Bowel troubles
H: I can Hear the dog barking
From this information, the following direct causal influences seem appropriate:
1. H is only directly influenced by D. Hence H is conditionally independent of L, O and B given D.

2. D is only directly influenced by O and B. Hence D is conditionally independent of L given O and B.
3. L is only directly influenced by O. Hence L is conditionally independent of D, H and B given O.
4. O and B are independent.
Based on the above, the following is a Bayesian Net that represents these direct causal relationships
(though it is important to note that these causal connections are not absolute, i.e., they are not
implications):
Next, the following quantitative information is added to the net; this information is usually given
by an expert or determined empirically from training data.
o For each root node (i.e., node without any parents), the prior probability of the random
variable associated with the node is determined and stored there
o For each non-root node, the conditional probabilities of the node's variable given all
possible combinations of its immediate parent nodes are determined. This results in
a conditional probability table (CPT) at each non-root node.
Doing this for the above example, we get the following Bayesian Net:

Notice that in this example, a total of 10 probabilities are computed and stored in the net,
whereas the full joint probability distribution would require a table containing 25 = 32
probabilities. The reduction is due to the conditional independence of many variables.
Two variables that are not directly connected by an arc can still affect each other. For example,
B and H are not (unconditionally) independent, but H does not directly depend on B.
Given a Bayesian Net, we can easily read off the conditional independence relations that are
represented. Specifically, each node, V, is conditionally independent of all nodes that are not
descendants of V, given V's parents. For example, in the above example H is conditionally
independent of B, O, and L given D. So, P(H | B,D,O,L) = P(H | D).
Building a Bayesian Net
Intuitively, "to construct a Bayesian Net for a given set of variables, we draw arcs from cause variables to
immediate effects. In almost all cases, doing so results in a Bayesian network [whose conditional
independence implications are accurate]." (Heckerman, 1996)
More formally, the following algorithm constructs a Bayesian Net:
1. Identify a set of random variables that describe the given problem domain
2. Choose an ordering for them: X1, ..., Xn
3. for i=1 to n do
a. Add a new node for Xi to the net
b. Set Parents(Xi) to be the minimal set of already added nodes such that we have
conditional independence of Xi and all other members of {X1, ..., Xi-1} given Parents(Xi)
c. Add a directed arc from each node in Parents(Xi) to Xi
d. If Xi has at least one parent, then define a conditional probability table at Xi: P(Xi=x |
possible assignments to Parents(Xi)). Otherwise, define a prior probability at Xi: P(Xi)

 There is not, in general, a unique Bayesian Net for a given set of random variables. But all
represent the same information in that from any net constructed every entry in the joint
probability distribution can be computed.
 The "best" net is constructed if in Step 2 the variables are topologically sorted first. That is, each
variable comes before all of its children. So, the first nodes should be the roots, then the nodes
they directly influence, and so on.
 The algorithm will not construct a net that is illegal in the sense of violating the rules of
probability.
Computing Joint Probabilities from a Bayesian Net
To illustrate how a Bayesian Net can be used to compute an arbitrary value in the joint probability
distribution, consider the Bayesian Net shown above for the "home domain."
Goal: Compute P(B,~O,D,~L,H)
P(B,~O,D,~L,H) = P(H,~L,D,~O,B)
= P(H | ~L,D,~O,B) * P(~L,D,~O,B) by Product Rule
= P(H|D) * P(~L,D,~O,B) by Conditional Independence of H and
L,O, and B given D
= P(H|D) P(~L | D,~O,B) P(D,~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D,~O,B) by Conditional Independence of L and D,
and L and B, given O
= P(H|D) P(~L|~O) P(D | ~O,B) P(~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O | B) P(B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O) P(B) by Independence of O and B
= (.3)(1 - .6)(.1)(1 - .6)(.3)
= 0.00144
where all of the numeric values are available directly in the Bayesian Net (since
P(~A|B) = 1 - P(A|B)).
APPROXIMATE INFERENCE IN BAYESIAN NETWORKS

 Direct sampling methods
The simplest kind of random sampling process for Bayesian networks generates events from a
network that has no evidence associated with it. The idea is to sample each variable in turn, in
topological order. The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable’s parents.

 Likelihood weighting
Likelihood weighting avoids the inefficiency of rejection sampling by generating only eventsthat are
consistent with the evidence e. It is a particular instance of the general statisticaltechnique of
importance sampling, tailored for inference in Bayesian networks.
 Inference by Markov chain simulation
Markov chainMonte Carlo (MCMC) MARKOV CHAIN algorithms work quite differently from rejection
sampling and likelihood weighting. Instead of generating each sample from scratch, MCMC
algorithms generate each sample by making a random change to the preceding sample. It is
therefore helpful to think of an MCMC algorithm as being in a particular current state specifying a
value for every variable and generating a next state by making random changes to the current state.
FIRST-ORDER PROBABILITY MODELS

 Possible worlds
For Bayesian networks, the possible worlds are assignmentsof values to variables; for the Boolean
case in particular, the possible worlds areidentical to those of propositional logic. For a first-order
probability model, then, it seemswe need the possible worlds to be those of first-order logic—that
is, a set of objects withrelations among them and an interpretation that maps constant symbols to
objects, predicatesymbols to relations, and function symbols to functions on those objects.
 Relational probability models

Like first-order logic, RPMs have constant, function, and predicate symbols.We can refine the model
by introducing a context-specific independence.

A context-specific independence allows a variable to be independent of some of its parents given certain
values of others.
 Open-universe probability models

 A vision system doesn’t know what exists, if anything, around the next corner, and may not know if
the object it sees now is the same one it saw a few minutes ago.
 A text-understanding system does not know in advance the entities that will be featured in a text,
and must reason about whether phrases such as “Mary,” “Dr. Smith,” “she,”“his cardiologist,” “his
mother,” and so on refer to the same object.
 An intelligence analyst hunting for spies never knows how many spies there really are and can only
guess whether various pseudonyms, phone numbers, and sightings belong to the same individual.
Representing ignorance: Dempster–Shafer theory
The Dempster–Shafer theory DEMPSTER–SHAFER is designed to deal with the distinction between
uncertainty and ignorance. Rather than computing the probability of a proposition, it computes
theprobability that the evidence supports the proposition. This measure of belief is called abelief
function, written Bel(X).
The mathematical formulation of Dempster–Shafer theory is similar tothose of probability theory; the
main difference is that, instead of assigning probabilities to possible worlds, the theory assigns masses
to sets of possible world, that is, to events.
The masses still must add to 1 over all possible events. Bel(A) is defined to be the sum ofmasses for all
events that are subsets of (i.e., that entail) A, including A itself. With thisdefinition, Bel(A) and Bel(¬A)
sum to at most 1, and the gap—the interval between Bel(A)and 1 − Bel(¬A)—is often interpreted as
bounding the probability of A.
As with default reasoning, there is a problem in connecting beliefs to actions. Wheneverthere is a gap in
the beliefs, then a decision problem can be defined such that a Dempster–Shafer system is unable to
make a decision.
Bel(A) should be interpretednot as a degree of belief in A but as the probability assigned to all the
possible worlds (nowinterpreted as logical theories) in which A is provable.
For eg:-
let us consider a room where four person are presented A, B, C, D(lets say) And suddenly lights out and
when the lights come back B has been died due to stabbing in his back with the help of a knife. No one
came into the room and no one has leaved the room and B has not committed suicide. Then we have to
find out who is the murdrer?
 Either {A} or{C} or {D} has killed him.

 Either {A, C} or {C, D} or {A, C} have killed him.

 Or the three of them kill him i.e; {A, C, D}
 None of the kill him {o}(let us say).
These will be the possible evidences by which we can find the murderer by measure of plausiblity.
Using the above example we can say :
Set of possible conclusion (P): ,p1, p2….pn-where P is set of possible conclusion and cannot be
exhaustive means at least one (p)i must be true.(p)i must be mutually exclusive.Power Set will contain
2n elements where n is number of elements in the possible set.
For eg:-
If P = { a, b, c}, then Power set is given as
{o, {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, {a, b, c}}= 23 elements.
Mass function m(K): It is an interpretation of m({K or B}) i.e; it means there is evidence for {K or B} which
cannot be divided among more specific beliefs for K and B.
Belief in K: The belief in element K of Power Set is the sum of masses of element which are subsets of K.
This can be explained through an example
Lets say K = {a, b, c}
Bel(K) = m(a) + m(b) + m(c) + m(a, b) + m(a, c) + m(b, c) + m(a, b, c)
Plaausiblity in K: It is the sum of masses of set that intersects with K.
i.e; Pl(K) = m(a) + m(b) + m(c) + m(a, b) + m(b, c) + m(a, c) + m(a, b, c)
Characteristics of Dempster Shafer Theory:

It will ignorance part such that probability of all events aggregate to 1.Ignorance is reduced in this
theory by adding more and more evidences.Combination rule is used to combine various types of
possiblities.
Advantages:
 Ucertainty interval reduces.
 DST has much lower level of ignorance.
 Diagnose Hierarchies can be represented using this.
 Person dealing with such problems is free to think about evidences.
Disadvantages:
 In this computation effort is high, as we have to deal with 2n of sets.
Learning
An agent is learning if it improves its performance on future tasks after making observations
about the world.
Forms Of Learning
Any component of an agent can be improved by learning from data.It depends upon 4 factors:
 Which component is to be improved

o direct mapping from conditions on the current state to actions
o infer relevant properties of the world
o results of possible actions
o Action-value information
o Goals that describe classes of states
 What prior knowledge the agent already has.

 What representation is used for the data and the component.
o representations: propositional and first-order logical sentences
o Bayesian networks for the inferential components
o factored representation—a vector of attribute values—and outputs that can be either a
continuous numerical value or a discrete value
 What feedback is available to learn from : types of feedback that determine the three main types of
learning
o In unsupervised learning the agent learns patterns in the input even though no explicit
feedback is supplied
o reinforcement learning the agent learns from a series of reinforcements—rewards or
punishments.
o supervised learning the agent observes some example input–output pairs and learns a function
that maps from input to output
o semi-supervised learning we are given a few labeled examples and must make what we can of
a large collection of unlabelled examples
SUPERVISED LEARNING
Given a training set of N example input–output pairs (x1, y1), (x2, y2), . . . (xN, yN) , where each yj was
generated by an unknown function y = f(x), discover a function h that approximates the true function f.
The function h is a hypothesis. To measure the accuracy of a hypothesis we give it a test set of examples
that are distinct from the training set.
Conditional Probability Distribution : the function f is stochastic—it is not strictly a function of x, and
what we have to learn is a , P(Y | x)
Classification :When the output y is one of a finite set of values the learning problem is called
classification
Regression : When y is a number (such as tomorrow’s temperature), the learning problem is called
regression
Hypothesis space, H, can be a set of polynomials. A polynomial is fitting a function of a single variable to
some data points.
Ockham’s razor :how do we choose a function or a polynomial from among multiple consistent
hypotheses? One answer is to prefer the simplest hypothesis consistent with the data. This principle is
called Ockham’s razor
Realizable : a learning problem is realizable if the hypothesis space contains the true function.
Unfortunately, we cannot always tell whether a given learning problem is realizable, because the true
function is not known.
Supervised learning can be done by choosing the hypothesis ” h”that is most probable one for the given
data:

There is a tradeoff between the expressiveness of a hypothesis space and the complexity of finding a
good hypothesis within that space.
LEARNING DECISION TREES

Decision tree induction is one of the simplest and yet most successful forms of machine learning.
The decision tree representation :The aim here is to learn a definition for the goal predicate.
A decision tree represents a function that takes as input a vector of attribute values and returns a
“decision”—a single output value. The input and output values can be discrete or continuous
o A decision tree reaches its decision by performing a sequence of tests.
o Each internal node in the tree corresponds to a test of the value of one of the input attributes, Ai,
o the branches from the node are labeled with the possible values of the attribute, Ai =vik.
o Each leaf node in the tree specifies a value to be returned by the function.
Decision Tree Algorithm:
The DECISION-TREE-LEARNING algorithm adopts a greedy divide-and-conquer strategy. This test divides
the problem up into smaller subproblems that can then be solved recursively.
function DECISION-TREE-LEARNING(examples, attributes, parent examples) returns

a tree
if examples is empty then return PLURALITY-VALUE(parent examples)
else if all examples have the same classification then return the classification
else if attributes is empty then return PLURALITY-VALUE(examples)
else
A←argmaxa ∈ attributes IMPORTANCE(a, examples)
tree←a new decision tree with root test A
for each value vk of A do
exs ←,e : e ∈examples and e.A = vk}
subtree←DECISION-TREE-LEARNING(exs, attributes −A, examples)
add a branch to tree with label (A = vk) and subtree subtree
return tree
Expressiveness of decision trees

A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true if and
only if the input attributes satisfy one of the paths leading to a leaf with value true.
Goal ⇔(Path1 ∨Path2 ∨・・・) , where each Path is a conjunction of attribute-value tests required
to follow that path. A tree consists of just tests on attributes in the interior nodes, values of

attributes on the branches, and output values on the leaf nodes. For a wide variety of problems, the
decision tree format yields a nice, concise result. But some functions cannot be represented
concisely. We can evaluate the accuracy of a learning algorithm with a learning curve.
Choosing attribute tests
The greedy search used in decision tree learning is designed to approximately minimize the depth of the
final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact
classification of the examples. A perfect attribute divides the examples into sets, each of which are all
positive or all negative and thus will be leaves of the tree.
Entropy is a measure of the uncertainty of a random variable; acquisition of information corresponds to

a reduction in entropy.
We can check that the entropy of a fair coin flip is indeed 1 bit:
H(Fair) = −(0.5 log2 0.5 + 0.5 log2 0.5) = 1 .
The information gain from the attribute INFORMATION GAIN test on A is the expected reduction in
entropy:
Pruning
In decision trees, a technique called decision tree pruning combats overfitting. Pruning works by
eliminating nodes that are not clearly relevant.
Issues in decision trees:

 Missing data
 Multivalued attributes
 Continuous and integer-valued input attributes
 Continuous-valued output attributes
LEARNING
A LOGICAL FORMULATION OF LEARNING
Current-best-hypothesis search
The idea behind current-best-hypothesis search is to maintain a single hypothesis, and to adjust it as
new examples arrive in order to maintain consistency.
The extension of the hypothesis must be increased to include new examples. This is called
generalization.
function CURRENT-BEST-LEARNING(examples, h) returns a hypothesis or fail

if examples is empty then
return h
e←FIRST(examples)
if e is consistent with h then
return CURRENT-BEST-LEARNING(REST(examples), h)
else if e is a false positive for h then
for each hin specializations of h consistent with examples seen so far do
h←CURRENT-BEST-LEARNING(REST(examples), h)
if h = fail then return h
else if e is a false negative for h then
for each hin generalizations of h consistent with examples seen so far do
h←CURRENT-BEST-LEARNING(REST(examples), h)
if h = fail then return h
return fail
The extension of the hypothesis must be decreased to exclude the example. This is called specialization.
Least-commitment search
Backtracking arises because the current-best-hypothesis approach has to choose a particular hypothesis
as its best guess even though it does not have enough data yet to be sure of the choice. What we can do
instead is to keep around all and only those hypotheses that are consistent with all the data so far. Each
new example will either have no effect or will get rid of some of the hypotheses.
One important property of this approach is that it is incremental: one never has to go back and
reexamine the old examples.
Boundary Set :
We also have an ordering on the hypothesis space, namely, generalization/specialization. This is a partial
ordering, which means that each boundary will not be a point but rather a set of hypotheses called a
boundary set.
The great thing is that we can represent the entire G-SET version space using just two boundary sets: a
most general boundary (the G-set) and a most S-SET specific boundary (the S-set). Everything in
between is guaranteed to be consistent with the examples.
The members Si and Gi of the S- and G-sets.
For each one, the new example may be a false positive or a false negative.
1. False positive for Si: This means Si is too general, but there are no consistent specializations of Si (by
definition), so we throw it out of the S-set.
2. False negative for Si: This means Si is too specific, so we replace it by all its immediate generalizations,
provided they are more specific than some member of G.
3. False positive for Gi: This means Gi is too general, so we replace it by all its immediate specializations,
provided they are more general than some member of S.
4. False negative for Gi: This means Gi is too specific, but there are no consistent generalizations of Gi
(by definition) so we throw it out of the G-set

EXPLANATION-BASED LEARNING
Explanation-based learning is a method for extracting general rules from individual observations.
Memoization
The technique of memoization has long been used in computer science to speed up programs by saving
the results of computation. The basic idea of memo functions is to accumulate a database of input–
output pairs; when the function is called, it first checks the database to see whether it can avoid solving
the problem from scratch.
Explanation-based learning takes this a good deal further, by creating general rules that cover an entire
class of cases.
Basic EBL process works as follows:
1. Given an example, construct a proof that the goal predicate applies to the example using the available
background knowledge
2. In parallel, construct a generalized proof tree for the variabilized goal using the same inference steps
as in the original proof.
3. Construct a new rule whose left-hand side consists of the leaves of the proof tree and whose right-
hand side is the variabilized goal (after applying the necessary bindings from the generalized proof).
4. Drop any conditions from the left-hand side that are true regardless of the values of the variables in
the goal.
Three factors involved in the analysis of efficiency gains from EBL:

1. Adding large numbers of rules can slow down the reasoning process, because the inference
mechanism must still check those rules even in cases where they do not yield a solution. In other words,
it increases the branching factor in the search space.
2. To compensate for the slowdown in reasoning, the derived rules must offer significant increases in
speed for the cases that they do cover. These increases come about mainly because the derived rules
avoid dead ends that would otherwise be taken, but also because they shorten the proof itself.
3. Derived rules should be as general as possible, so that they apply to the largest possible set of cases.
LEARNING USING RELEVANCE INFORMATION

The learning algorithm is based on a straightforward attempt to find the simplest determination
consistent with the observations.
A determination P ' Q says that if any examples match on P, then they must also match on Q. A
determination is therefore consistent with a set of examples if every pair that matches on the predicates
on the left-hand side also matches on the goal predicate.
An algorithm for finding a minimal consistent determination
function MINIMAL-CONSISTENT-DET(E,A) returns a set of attributes

inputs: E, a set of examples
A, a set of attributes, of size n
for i = 0 to n do

for each subset Ai of A of size i do
if CONSISTENT-DET?(Ai,E) then return Ai
function CONSISTENT-DET?(A,E) returns a truth value

inputs: A, a set of attributes
E, a set of examples
local variables: H, a hash table
for each example e in E do
if some example in H has the same values as e for the attributes A
but a different classification then return false
store the class of e in H, indexed by the values for attributes A of the example e
return true
Given an algorithm for learning determinations, a learning agent has a way to construct a minimal
hypothesis within which to learn the target predicate. For example, we can combine MINIMAL-
CONSISTENT-DET with the DECISION-TREE-LEARNING algorithm.
This yields a relevance-based decision-tree learning algorithm RBDTL that first identifies a minimal set of
relevant attributes and then passes this set to the decision tree algorithm for learning.
INDUCTIVE LOGIC PROGRAMMING

Inductive logic programming (ILP) combines inductive methods with the power of first-order
representations, concentrating in particular on the representation of hypotheses as logic programs.
It has gained popularity for three reasons.
1. ILP offers a rigorous approach to the general knowledge-based inductive learning problem.
2. It offers complete algorithms for inducing general, first-order theories from examples, which can
therefore learn successfully in domains where attribute-based algorithms are hard to apply.
3. Inductive logic programming produces hypotheses that are (relatively) easy for humans to read
The object of an inductive learning program is to come up with a set of sentences for the Hypothesis
such that the entailment constraint is satisfied. Suppose, for the moment, that the agent has no
background knowledge: Background is empty. Then one possible solution we would need to make pairs
of people into objects.
Top-down inductive learning methods
The first approach to ILP works by starting with a very general rule and gradually specializing it so that it
fits the data.
This is essentially what happens in decision-tree learning, where a decision tree is gradually grown until
it is consistent with the observations.
To do ILP we use first-order literals instead of attributes, and the hypothesis is a set of clauses instead
of a decision tree.
Three kinds of literals
1. Literals using predicates
2. Equality and inequality literals
3. Arithmetic comparisons

Inductive learning with inverse deduction
The second major approach to ILP involves inverting the normal deductive proof process.
Inverse resolution is based INVERSE on the observation.
Recall that an ordinary resolution step takes two clauses C1 and C2 and resolves them to produce the
resolvent C.
An inverse resolution step takes a resolvent C and produces two clauses C1 and C2, such that C is the
result of resolving C1 and C2.
Alternatively, it may take a resolvent C and clause C1 and produce a clause C2 such that C is the result of
resolving C1 and C2.
A number of approaches to taming the search implemented in ILP systems
1. Redundant choices can be eliminated
2. The proof strategy can be restricted
3. The representation language can be restricted
4. Inference can be done with model checking rather than theorem proving
5. Inference can be done with ground propositional clauses rather than in first-order
logic.

(R17A1204) Artificial Intelligence (6) - 119-143

Uploaded by

Copyright:

Available Formats

(R17A1204) Artificial Intelligence (6) - 119-143

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(R17A1204) Artificial Intelligence (6) - 119-143

Uploaded by

Copyright:

Available Formats

Uncertain knowledge and Learning

Core vs. Probabilistic AI •

 Information occurred from unreliable sources.

0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.

Artificial Intelligence Page 119

Why Reason Probabilistically?

Artificial Intelligence Page 120

As from product rule we can write:

1. P(A ⋀ B)= P(A|B) P(B) or

Similarly, the probability of event B with known event A:

1. P(A ⋀ B)= P(B|A) P(A)

Equating right hand side of both the equations, we will get:

Artificial Intelligence Page 121

Applying Bayes' rule:

o The Known probability that a patient has meningitis disease is 1/30,000.

Artificial Intelligence Page 122

o Directed Acyclic Graph

The Bayesian network has mainly two components:

Artificial Intelligence Page 123

Representing Belief about Propositions

Axioms of Probability Theory

Artificial Intelligence Page 124

 0 <= P(A=a) <= 1 for all a in sample space of A

Joint Probability Distribution

Bird Flier Young Probability

Artificial Intelligence Page 125

 P(Bird=T) = P(B) = 0.0 + 0.2 + 0.04 + 0.01 = 0.25

Conditional probability is defined as: P(A|B) = P(A ^ B)/P(B) = P(A,B)/P(B)

 Example Computing Conditional Probability from the Joint Probability Distribution

P(B|F) = P(B,F) / P(F)

Artificial Intelligence Page 126

 Some important rules related to conditional probability are:

Assuming conditional independence of B and C given A, we can simplify

 P(A | B,C) = (P(A)P(B,C | A))/P(B,C)

Naive Bayes Classifier:

E1=e1, E2=e2, ..., En=en.

Then we can compute

P(C=a | E1=e1, ..., En=en),

P(C=b | E1=e1, ..., En=en) and

Artificial Intelligence Page 127

So, for example

Choose the value for C that gives the maximum probability.

If B and C are (unconditionally) independent, then P(C|B) = P(C), so

P(A | B,C) = P(A) * [P(B|A)/P(B)] * [P(C|A)/P(C)]

P(PickledLiver | Jaundice) = P(P)P(J|P)/P(J)

Artificial Intelligence Page 128

Bayesian Networks (aka Belief Networks)

Net Topology Reflects Conditional Independence Assumptions

Artificial Intelligence Page 129

P(X1=x1, ..., Xn=xn) = P(xi | Parents(Xi)) * ... * P(xn | Parents(Xn))

 Example (From (Charniak, 1991))

(2) When nobody is home, the dog is often left outside.

(3) If the dog has bowel-troubles, it is also often left outside.

Artificial Intelligence Page 130

O: Everyone is Out of the house

1. H is only directly influenced by D. Hence H is conditionally independent of L, O and B given D.

Artificial Intelligence Page 131

Building a Bayesian Net

More formally, the following algorithm constructs a Bayesian Net: