(R17A1204) Artificial Intelligence (6) - 119-143
(R17A1204) Artificial Intelligence (6) - 119-143
(R17A1204) Artificial Intelligence (6) - 119-143
Probabilistic reasoning
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
Probabilistic reasoning is a way of knowledge representation where we apply the concept of probability
to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine probability theory with
logic to handle the uncertainty.
We use probability in probabilistic reasoning because it provides a way to handle the uncertainty that is
the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not confirmed, such as
"It will rain today," "behavior of someone for some situations," "A match between two teams or two
players." These are probable sentences for which we can assume that it will happen but not sure about
it, so here we use probabilistic reasoning.
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the numerical
measure of the likelihood that an event will occur. The value of probability always remains between 0
and 1 that represent ideal uncertainties.
Let's suppose, we want to calculate the event A when event B has already occurred, "the probability of A
under the conditions of B", it can be written as:
Where P(A⋀B)= Joint probability of a and B
P(B)= Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample space will be
reduced to set B, and now we can only calculate event A when event B is already occurred by dividing
the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes English and
mathematics, and then what is the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics.
In many problem domains it isn't possible to create complete, consistent models of the world.
Therefore agents (and people) must act in uncertain worlds (which the real world is).
Want an agent to make rational decisions even when there is not enough information to prove
that an action will work.
Some of the reasons for reasoning under uncertainty:
o True uncertainty. E.g., flipping a coin.
o Theoretical ignorance. There is no complete theory which is known about the problem
domain. E.g., medical diagnosis.
o Laziness. The space of relevant factors is very large, and would require too much work to
list the complete set of antecedents and consequents. Furthermore, it would be too hard
to use the enormous rules that resulted.
o Practical ignorance. Uncertain about a particular individual in the domain because all of
the information necessary for that individual has not been collected.
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determines the
probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two random
events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian inference is an
application of Bayes' theorem, which is fundamental to Bayesian statistics
It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most modern
AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of hypothesis
A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the evidence
P(B) is called marginal probability, pure probability of an evidence.
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This is very
useful in cases where we have a good probability of these three terms and want to determine the fourth
one.
Suppose we want to perceive the effect of some unknown cause, and want to compute that cause, then
the Bayes' rule becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff
neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed
link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of
the network graph.
o If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
o Node C is independent of node A.
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node.
Rather than reasoning about the truth or falsity of a proposition, reason about the belief that a
proposition or event is true or false
For each primitive proposition or event, attach a degree of belief to the sentence
Use probability theory as a formal means of manipulating degrees of belief
Given a proposition, A, assign a probability, P(A), such that 0 <= P(A) <= 1, where if A is true,
P(A)=1, and if A is false, P(A)=0. Proposition A must be either true or false, but P(A) summarizes
our degree of belief in A being true/false.
o Examples
o P(Weather=Sunny) = 0.7 means that we believe that the weather will be Sunny with 70%
certainty. In this case Weather is a random variable that can take on values in a domain such
as {Sunny, Rainy, Snowy, Cloudy}.
o P(Cavity=True) = 0.05 means that we believe there is a 5% chance that a person has a cavity.
Cavity is a Boolean random variable since it can take on possible values True and False.
o Example: P(A=a ^ B=b) = P(A=a, B=b) = 0.2, where A=My_Mood, a=happy, B=Weather, and
b=rainy, means that there is a 20% chance that when it's raining my mood is happy.
We will assume that in a given problem domain, the programmer and expert identify all of the
relevant propositional variables that are needed to reason about the domain.
Each of these will be represented as a random variable, i.e., a variable that can take on values
from a set of mutually exclusive and exhaustive values called the sample space or partition of
the random variable. Usually this will mean a sample space {True, False}.
For example, the proposition Cavity has possible values True and False indicating whether a
given patient has a cavity or not. A random variable that has True and False as its possible values
is called a Boolean random variable.
More generally, propositions can include the equality predicate with random variables and the
possible values they can have.
For example, we might have a random variable Color with possible values red, green, blue,
and other.
Then P(Color=red) indicates the likelihood that the color of a given object is red.
Similarly, for Boolean random variables we can ask P(A=True), which is abbreviated to P(A), and
P(A=False), which is abbreviated to P(~A).
From these axioms we can show the following properties also hold:
P(~A) = 1 - P(A)
P(A) = P(A ^ B) + P(A ^ ~B)
Sum{P(A=a)} = 1, where the sum is over all possible values a in the sample space of A
Given an application domain in which we have determined a sufficient set of random variables to
encode all of the relevant information about that domain, we can completely specify all of the possible
probabilistic information by constructing the full joint probability distribution,
P(V1=v1, V2=v2, ..., Vn=vn), which assigns probabilities to all possible combinations of values to all
random variables.
For example, consider a domain described by three Boolean random variables, Bird, Flier, and Young.
Then we can enumerate a table showing all possible interpretations and associated probabilities:
Notice that there are 8 rows in the above table representing the fact that there are 23 ways to assign
values to the three Boolean variables. More generally, with n Boolean variables the table will be of size
2n. And if n variables each had k possible values, then the table would be size kn.
Also notice that the sum of the probabilities in the right column must equal 1 since we know that the set
of all possible values for each variable are known. This means that for n Boolean random variables, the
table has 2n-1 values that must be determined to completely fill in the table.
Conditional Probabilities
Conditional probabilities are key for reasoning because they formalize the process of
accumulating evidence and updating probabilities based on new evidence.
For example, if we know there is a 4% chance of a person having a cavity, we can represent this
as the prior (aka unconditional) probability P(Cavity)=0.04.
Say that person now has a symptom of a toothache, we'd like to know what is
the posterior probability of a Cavity given this new evidence. That is, compute P(Cavity |
Toothache).
If P(A|B) = 1, this is equivalent to the sentence in Propositional Logic B => A. Similarly, if P(A|B)
=0.9, then this is like saying B => A with 90% certainty.
In other words, we've made implication fuzzy because it's not absolutely certain.
Given several measurements and other "evidence", E1, ..., Ek, we will formulate queries as P(Q |
E1, E2, ..., Ek) meaning "what is the degree of belief that Q is true given that we know E1, ...,
Ek and nothing else."
Next, we could either compute the marginal probability P(F) from the full joint probability
distribution, or, as is more commonly done, we could do it by using a process
called normalization, which first requires computing
Now we also know that P(~B|F) + P(B|F) = 1, so substituting from above and solving for P(F) we
get P(F) = 0.22. Hence, P(~B|F) = 0.02/0.22 = 0.091.
Say we have a random variable, C, which represents the possible ways to classify an input pattern of
features that have been measured.
The domain of C is the set of possible classifications, e.g., it might be the possible diagnoses in a
medical domain.
Say the possible values for C are {a,b,c}, and the features we have measured are
Since for each value of C the denominators are the same above, they can be ignored.
P(C=a | E1=e1, ..., En=en) = P(C=a) * P(E1=e1 | C=a) * P(E2=e2 | C=a) * ... * P(En=en | C=a)
Finally, since only relative values are needed and probabilities are often very small, it is common to
compute the sum of logarithms of the probabilities:
log P(C=a | E1=e1, ..., En=en) = log P(C=a) + log P(E1=e1 | C=a) + ... + log P(En=en | C=a).
Example
Consider the medical domain consisting of three Boolean variables: PickledLiver, Jaundice,
Bloodshot, where the first indicates if a given patient has the "disease" PickledLiver, and the
second and third describe symptoms of the patient. We'll assume that Jaundice and Bloodshot
are independent.
The doctor wants to determine the likelihood that the patient has a PickledLiver.
Based on no other information, she knows that the prior probability P(PickledLiver) = 10-17. So,
this represents the doctor's initial belief in this diagnosis. However, after examination, she
determines that the patient has jaundice. She knows that P(Jaundice) = 2-10 and P(Jaundice |
PickledLiver) = 2-3, so she computes the new updated probability in the patient having
PickledLiver as:
So, based on this new evidence, the doctor increases her belief in this diagnosis from 2-17 to 2-10.
Next, she determines that the patient's eyes are bloodshot, so now we need to add this new
piece of evidence and update the probability of PickledLiver given Jaundice and Bloodshot.
Say, P(Bloodshot) = 2-6 and P(Bloodshot | PickledLiver) = 2-1. Then, she computes the new
conditional probability:
So, after taking both symptoms into account, the doctor's belief that the patient has a
PickledLiver is 2-5.
Bayesian Networks, also known as Bayes Nets, Belief Nets, Causal Nets, and Probability Nets, are
a space-efficient data structure for encoding all of the information in the full joint probability
distribution for the set of random variables defining a domain. That is, from the Bayesian Net
one can compute any value in the full joint probability distribution of the set of random
variables.
Represents all of the direct causal relationships between variables
Intuitively, to construct a Bayesian net for a given set of variables, draw arcs from cause
variables to immediate effects.
Space efficient because it exploits the fact that in many real-world problem domains the
dependencies between variables are generally local, so there are a lot of conditionally
independent variables
Captures both qualitative and quantitative relationships between variables
Can be used to reason
o Forward (top-down) from causes to effects -- predictive reasoning (aka causal
reasoning)
o Backward (bottom-up) from effects to causes -- diagnostic reasoning
Formally, a Bayesian Net is a directed, acyclic graph (DAG), where there is a node for each
random variable, and a directed arc from A to B whenever A is a direct causal influence on B.
Thus the arcs represent direct causal relationships and the nodes represent states of affairs. The
occurrence of A provides support for B, and vice versa. The backward influence is call
"diagnostic" or "evidential" support for A due to the occurrence of B.
Each node A in a net is conditionally independent of any subset of nodes that are not
descendants of A given the parents of A.
Conditional independence defines local net structure. For example, if B and C are conditionally
independent given A, then by definition P(C|A,B) = P(C|A) and, symmetrically, P(B|A,C) = P(B|A).
Intuitively, think of A as the direct cause of both B and C. In a Bayesian Net this will be
represented by the local structure:
In general, we will construct the net so that given its parents, a node is conditionally
independent of the rest of the net variables. That is,
Hence, we don't need the full joint probability distribution, only conditionals relative to the
parent variables.
Consider the problem domain in which when I go home I want to know if someone in my family
is home before I go in. Let's say I know the following information:
(1) Why my wife leaves the house, she often (but not always) turns on the outside light. (She
also sometimes turns the light on when she's expecting a guest.)
(4) If the dog is outside, I will probably hear it barking (though it might not bark, or I might hear
a different dog barking and think it's my dog).
From this information, the following direct causal influences seem appropriate:
Based on the above, the following is a Bayesian Net that represents these direct causal relationships
(though it is important to note that these causal connections are not absolute, i.e., they are not
implications):
Next, the following quantitative information is added to the net; this information is usually given
by an expert or determined empirically from training data.
o For each root node (i.e., node without any parents), the prior probability of the random
variable associated with the node is determined and stored there
o For each non-root node, the conditional probabilities of the node's variable given all
possible combinations of its immediate parent nodes are determined. This results in
a conditional probability table (CPT) at each non-root node.
Doing this for the above example, we get the following Bayesian Net:
Two variables that are not directly connected by an arc can still affect each other. For example,
B and H are not (unconditionally) independent, but H does not directly depend on B.
Given a Bayesian Net, we can easily read off the conditional independence relations that are
represented. Specifically, each node, V, is conditionally independent of all nodes that are not
descendants of V, given V's parents. For example, in the above example H is conditionally
independent of B, O, and L given D. So, P(H | B,D,O,L) = P(H | D).
Intuitively, "to construct a Bayesian Net for a given set of variables, we draw arcs from cause variables to
immediate effects. In almost all cases, doing so results in a Bayesian network [whose conditional
independence implications are accurate]." (Heckerman, 1996)
1. Identify a set of random variables that describe the given problem domain
2. Choose an ordering for them: X1, ..., Xn
3. for i=1 to n do
a. Add a new node for Xi to the net
b. Set Parents(Xi) to be the minimal set of already added nodes such that we have
conditional independence of Xi and all other members of {X1, ..., Xi-1} given Parents(Xi)
c. Add a directed arc from each node in Parents(Xi) to Xi
d. If Xi has at least one parent, then define a conditional probability table at Xi: P(Xi=x |
possible assignments to Parents(Xi)). Otherwise, define a prior probability at Xi: P(Xi)
To illustrate how a Bayesian Net can be used to compute an arbitrary value in the joint probability
distribution, consider the Bayesian Net shown above for the "home domain."
P(B,~O,D,~L,H) = P(H,~L,D,~O,B)
= P(H | ~L,D,~O,B) * P(~L,D,~O,B) by Product Rule
= P(H|D) * P(~L,D,~O,B) by Conditional Independence of H and
L,O, and B given D
= P(H|D) P(~L | D,~O,B) P(D,~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D,~O,B) by Conditional Independence of L and D,
and L and B, given O
= P(H|D) P(~L|~O) P(D | ~O,B) P(~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O | B) P(B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O) P(B) by Independence of O and B
= (.3)(1 - .6)(.1)(1 - .6)(.3)
= 0.00144
where all of the numeric values are available directly in the Bayesian Net (since
P(~A|B) = 1 - P(A|B)).
The simplest kind of random sampling process for Bayesian networks generates events from a
network that has no evidence associated with it. The idea is to sample each variable in turn, in
topological order. The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable’s parents.
Likelihood weighting avoids the inefficiency of rejection sampling by generating only eventsthat are
consistent with the evidence e. It is a particular instance of the general statisticaltechnique of
importance sampling, tailored for inference in Bayesian networks.
Markov chainMonte Carlo (MCMC) MARKOV CHAIN algorithms work quite differently from rejection
sampling and likelihood weighting. Instead of generating each sample from scratch, MCMC
algorithms generate each sample by making a random change to the preceding sample. It is
therefore helpful to think of an MCMC algorithm as being in a particular current state specifying a
value for every variable and generating a next state by making random changes to the current state.
The Dempster–Shafer theory DEMPSTER–SHAFER is designed to deal with the distinction between
uncertainty and ignorance. Rather than computing the probability of a proposition, it computes
theprobability that the evidence supports the proposition. This measure of belief is called abelief
function, written Bel(X).
The mathematical formulation of Dempster–Shafer theory is similar tothose of probability theory; the
main difference is that, instead of assigning probabilities to possible worlds, the theory assigns masses
to sets of possible world, that is, to events.
The masses still must add to 1 over all possible events. Bel(A) is defined to be the sum ofmasses for all
events that are subsets of (i.e., that entail) A, including A itself. With thisdefinition, Bel(A) and Bel(¬A)
sum to at most 1, and the gap—the interval between Bel(A)and 1 − Bel(¬A)—is often interpreted as
bounding the probability of A.
As with default reasoning, there is a problem in connecting beliefs to actions. Wheneverthere is a gap in
the beliefs, then a decision problem can be defined such that a Dempster–Shafer system is unable to
make a decision.
Bel(A) should be interpretednot as a degree of belief in A but as the probability assigned to all the
possible worlds (nowinterpreted as logical theories) in which A is provable.
For eg:-
let us consider a room where four person are presented A, B, C, D(lets say) And suddenly lights out and
when the lights come back B has been died due to stabbing in his back with the help of a knife. No one
came into the room and no one has leaved the room and B has not committed suicide. Then we have to
find out who is the murdrer?
Mass function m(K): It is an interpretation of m({K or B}) i.e; it means there is evidence for {K or B} which
cannot be divided among more specific beliefs for K and B.
Belief in K: The belief in element K of Power Set is the sum of masses of element which are subsets of K.
This can be explained through an example
Lets say K = {a, b, c}
Bel(K) = m(a) + m(b) + m(c) + m(a, b) + m(a, c) + m(b, c) + m(a, b, c)
Plaausiblity in K: It is the sum of masses of set that intersects with K.
i.e; Pl(K) = m(a) + m(b) + m(c) + m(a, b) + m(b, c) + m(a, c) + m(a, b, c)
Learning
An agent is learning if it improves its performance on future tasks after making observations
about the world.
Forms Of Learning
Any component of an agent can be improved by learning from data.It depends upon 4 factors:
SUPERVISED LEARNING
Given a training set of N example input–output pairs (x1, y1), (x2, y2), . . . (xN, yN) , where each yj was
generated by an unknown function y = f(x), discover a function h that approximates the true function f.
The function h is a hypothesis. To measure the accuracy of a hypothesis we give it a test set of examples
that are distinct from the training set.
Conditional Probability Distribution : the function f is stochastic—it is not strictly a function of x, and
what we have to learn is a , P(Y | x)
Classification :When the output y is one of a finite set of values the learning problem is called
classification
Regression : When y is a number (such as tomorrow’s temperature), the learning problem is called
regression
Hypothesis space, H, can be a set of polynomials. A polynomial is fitting a function of a single variable to
some data points.
Ockham’s razor :how do we choose a function or a polynomial from among multiple consistent
hypotheses? One answer is to prefer the simplest hypothesis consistent with the data. This principle is
called Ockham’s razor
Realizable : a learning problem is realizable if the hypothesis space contains the true function.
Unfortunately, we cannot always tell whether a given learning problem is realizable, because the true
function is not known.
Supervised learning can be done by choosing the hypothesis ” h”that is most probable one for the given
data:
The DECISION-TREE-LEARNING algorithm adopts a greedy divide-and-conquer strategy. This test divides
the problem up into smaller subproblems that can then be solved recursively.
We can check that the entropy of a fair coin flip is indeed 1 bit:
H(Fair) = −(0.5 log2 0.5 + 0.5 log2 0.5) = 1 .
The information gain from the attribute INFORMATION GAIN test on A is the expected reduction in
entropy:
Pruning
In decision trees, a technique called decision tree pruning combats overfitting. Pruning works by
eliminating nodes that are not clearly relevant.
LEARNING
A LOGICAL FORMULATION OF LEARNING
Current-best-hypothesis search
The idea behind current-best-hypothesis search is to maintain a single hypothesis, and to adjust it as
new examples arrive in order to maintain consistency.
The extension of the hypothesis must be increased to include new examples. This is called
generalization.
function CURRENT-BEST-LEARNING(examples, h) returns a hypothesis or fail
Boundary Set :
We also have an ordering on the hypothesis space, namely, generalization/specialization. This is a partial
ordering, which means that each boundary will not be a point but rather a set of hypotheses called a
boundary set.
The great thing is that we can represent the entire G-SET version space using just two boundary sets: a
most general boundary (the G-set) and a most S-SET specific boundary (the S-set). Everything in
between is guaranteed to be consistent with the examples.
The members Si and Gi of the S- and G-sets.
For each one, the new example may be a false positive or a false negative.
1. False positive for Si: This means Si is too general, but there are no consistent specializations of Si (by
definition), so we throw it out of the S-set.
2. False negative for Si: This means Si is too specific, so we replace it by all its immediate generalizations,
provided they are more specific than some member of G.
3. False positive for Gi: This means Gi is too general, so we replace it by all its immediate specializations,
provided they are more general than some member of S.
4. False negative for Gi: This means Gi is too specific, but there are no consistent generalizations of Gi
(by definition) so we throw it out of the G-set
Given an algorithm for learning determinations, a learning agent has a way to construct a minimal
hypothesis within which to learn the target predicate. For example, we can combine MINIMAL-
CONSISTENT-DET with the DECISION-TREE-LEARNING algorithm.
This yields a relevance-based decision-tree learning algorithm RBDTL that first identifies a minimal set of
relevant attributes and then passes this set to the decision tree algorithm for learning.