Unit 5
Unit 5
Unit 5
II-YEAR-AI&DS -2021 R
SUB CODE: AL3391
SUB NAME: Artificial Intelligence
Unit IV:
PROBABILISTIC REASONING
1
PART A
This Learning is rather than being told what to do by teacher, a reinforcement learning agent
must learn from occasional rewards.
Example If taxi driver does not get a tip at the end of journey, it gives him a indication that his
behaviour is undesirable.
An algorithm for supervised learning is given as input the correct value of the unknown function
for particular inputs and it must try to recover the unknown function.
3.Define Classification Learning.
Supervised Learning It involves learning a function From examples of its inputs And outputs
Example: Applying Brake on the wet road, we can even skip on the road is a result.
Unsupervised Learning
It involves learning patterns in the input when no specific output values are supplied.
Example: Day by day agent will learn about “Good traffic days” and “Bad traffic days” without
any advice.
7.Define Bayesian Learning.
It calculates the probability of each hypotheses, given the data and makes predictions on that
basis, (i.e.) predictions are made by using all the hypotheses, weighted by their probabilities rather
than by using just single “best” hypotheses.
8.Define MAP.
Maximum A Posteriori. A very common approximation is to make predictions based on single most probable
hypotheses. This is MAP.
9.Define MDL.
The MDL (Maximum Description Length), is a learning method which attempts to minimize the
size of the hypotheses and data encodings rather than work with probabilities.
2
10.What is Maximum – Likelihood hypotheses?
ML – it is reasonable approach when there is no reason to prefer one hypotheses over another a
prior. 11. What are the methods for maximum likelihood parameter learning?
ii. Write down an expression for the likelihood of the data as a function of the
parameter. ii. Write down the derivative of the log likelihood with respect to each
parameter. iii. Find the parameter values such that the derivatives are zero.
In this model, the “class” variable C is the root and the “attribute” variable XI are the leaves.
This model assumes that the attributes are conditionally independent of each other, given the
class.
12.Define sum of squared errors.
The difference between the actual value yj and the predicated value ( θ1 xj + θ2 ) so E is the sum
of squared errors.
13.Define EM.
Expectation Maximization: the idea of EM is to pretend that we know the parameters of the
model and then to infer the probability that each data point belongs to each component. After
that we refit the components to the data, where each component is fitted to the entire data set
with each point weighted by the probability.
It consists of nodes or units connected by directed links. A link propagates the activation. Each
link has a numeric weight which determines the strength and sign of the connection.
3
17.What are the categories of neural network structures?
i.Acyclic (or) Feed – forward networks ii. Cyclic (or) Recurrent Networks
The agent’s policy is fixed and the task is to learn the utilities of states, this could also involve
learning a model of the environment.
The agent must learn what to do. An agent must experience as much as possible of its environment
in order to learn how to behave in it.
i.Rough translation ii. Restricted source translation iii. Pre edited translation iv. Literacy translation
4
16 MARKS
5.1.Acting under uncertainty
Uncertainty:
Till now, we have learned knowledge representation using first-order logic and propositional
logic with certainty, which means we were sure about the predicates. With this knowledge
representation, we might write A→B, which means if A is true then B is true, but consider a
situation where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we need uncertain
reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
5.2.Bayesian inference
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The
Bayesian inference is an application of Bayes' theorem, which is fundamental to
Bayesian statistics.
Bayes' theorem allows updating the probability prediction of an event by observing new information of the
real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the
probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A
5
with known event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most
modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability
of hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be written as:
Where A1, A2, A3,. , An is a set of mutually exclusive and exhaustive events.
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This
is very useful in cases where we have a good probability of these three terms and want to determine
the fourth one. Suppose we want to perceive the effect of some unknown cause, and want to
compute that cause, then the Bayes' rule becomes:
6
Example-1:
Question: what is the probability that a patient has diseases meningitis with
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) =
1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The probability
that the card is king is 4/52, then calculate posterior probability P(King|Face), which means
the drawn face card is a king card.
Solution:
o It is used to calculate the next step of the robot when the already executed step is given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
8
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
9
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
10
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.
Applications of Naïve Bayes Classifier:
11
5.4.Probabilistic reasoning
Probabilistic reasoning:
We use probability in probabilistic reasoning because it provides a way to handle the uncertainty
that is the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not confirmed,
such as "It will rain today," "behavior of someone for some situations," "A match between two
teams or two players." These are probable sentences for which we can assume that it will happen
but not sure about it, so here we use probabilistic reasoning.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding probabilistic
reasoning, let's understand some common terms:
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the
numerical measure of the likelihood that an event will occur. The value of probability always
remains between 0 and 1 that represent ideal uncertainties.
event A.
12
P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event by using the below formula.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real world.
Prior probability: The prior probability of an event is probability computed before observing new
information.
Posterior Probability: The probability that is calculated after all evidence or information has
taken into account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the
probability of a and B
P(B)= Marginal
probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample space
13
will be reduced to set B, and now we can only calculate event A when event B is already occurred
by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes
English and mathematics, and then what is the percent of students those who like English also
like mathematics?
Solution:
likes English.
Hence, 57% are the students who like English also like Mathematics.
Bayesian belief network is key computer technology for dealing with probabilistic events and
to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their
conditional dependencies using a directed acyclic graph."
14
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two
parts:
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair
of nodes in the graph. These links represent that one node directly influence the
other node, and if there is no directed link that means that nodes are independent with each
other
o In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
15
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So let's
first understand the joint probability distribution:
If we have variables x1, x2, x3, , xn, then the probabilities of a different combination of x1, x2, x3.. xn,
are known as
Joint probability distribution.
P[x1, x2, x3, , xn], it can be written as the following way in terms of the joint probability distribution.
In general for each variable Xi, we can write the equation as:
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two neighbors
David and Sophia, who have taken a responsibility to inform Harry at work when they hear the
alarm. David always calls Harry when he hears the alarm, but sometimes he got confused with the
phone ringing and calls at that time too. On the other hand, Sophia likes to listen to high music, so
sometimes she misses to hear the alarm. Here we would like to compute the probability of Burglary
Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
16
earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly affecting
the probability of alarm's going off, but David and Sophia's calls depend on alarm
probability.
o The network is representing that our assumptions do not directly perceive the burglary and
also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities. Hence, if there
are two parents, then CPT will contain 4 probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:
17
Let's take the observed probability for the Burglary and earthquake
burglary.
The Conditional probability of David that he will call depends on the probability of Alarm.
18
A P(D= True) P(D= False)
True 0.91 0.09
False 0.05 0.95
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of probability
distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using
There are two ways to understand the semantics of the Bayesian network, which is given below:
20
their effect on the alarm. Also given the state of the alarm, whether J calls has no influence on M's calling.
Formally speaking, we believe that the following conditional independence statement holds:
P(M Calls |J Calls, Alarm, Earthquake, Burglary) = P(M calls | Alarm).
• Compactness and node ordering:
Bayesian network are compact and they possess a property of being locally structured (also called as sparse
systems). In a locally structured system, each sub component interacts directly with only a bounded number of
other components, regardless of the total number of components.
⇒ Local structure is usually associated with linear rather than exponential growth in complexity.
⇒ We assume 'n' boolean variables for simplicity, then the amount of information on needed to specify each
conditional probability table will be at most 2k numbers. Where each random variable is influenced by 'k' other
variables.
⇒ The complete network can be specified by n2k numbers.
With some values of 'n' the joint distribution contains 2 n numbers.
⇒ To make this concrete, suppose we have n = 30 nodes, each with five parents (k=5). Then the Bayesian
network requires 960 numbers, but the full joint distribution requires over a billion.
Ordering of nodes in Bayesian network:
1) The correct order in which to add nodes is to add the "root causes" first, then the variables they influence, and
so on, until we reach the "leaves", which have no direct causal influence on the other variables.
2) If wrong order is choosen we get more complicated network.
For example: Consider network shown in following diagram.
The network construction process goes as follows:
i) Adding M calls: No parent.
ii) Adding J calls: If M calls, that probably means the alarm has gone off, which ofcourse would make it more
likely that J calls. Therefore J calls needs M calls as a parent.
iii) Adding alarm: Clearly, if both call, it is more likely that the alarm has gone off than if just one or neither
call, so we need both M calls and J calls as parent.
iv) Adding burglary: If we know the alarm state, then the call from J or M might give us information about
phone ringing or M's music, but not about burglary : P(Burglary Alarm, J calls, M calls) = P(Burglary | Alarm).
Hence we need just alarm as parent.
v) Adding earthquake: If the alarm is on, it is more likely that there has been an 2 earthquake. (The alarm is an
earthquake detector of sorts). But if we know that there has been a burglary, then that explains the alarm, and
the probability of an earthquake would be only slightly above normal. Hence, we need both Alarm and Burglary
as parents. The network is shown below.
21
3. Conditional Independence Relations in Bayesian Networks
One can start from a "topological" semantics that specifies the conditional independence relationships encoded
by the graph structure, and from these we can derive the "numerical" semantics. The topological semantics is
given by either of the following specifications, which are equivalent.
1) A node is conditionally independent of its non-descendants, given its parents. For example, in Fig. 7.3.2 J
calls is independent of Burglary and Earthquake, given the value of Alarm.
2) A node is conditionally independent of all other nodes in the network, given it’s parents, children, and
children’s parents- that is, given its Markov blanket.
For example: Burglary is independent of J calls and M calls given Alarm and Earthquake.
This specifications are illustrated in following figures. From these conditional independence assertions and the
CPTs, the full joint distribution can be reconstructed; thus the "numerical" semantics and the "topological"
semantics are equivalent.
A node X is conditionally independent of its non-descendants (example - - The Z1js) given its parents(the U; is
shown in the gray area).
22
As node X is conditionally on independent of all other nodes in the network given its Markove blanket (the gray
area).
A network with both discrete and continuous variables is called as hybrid Bayesian network.
Consider the simple example in following diagram, in which a customer buys some fruit depending on
its cost, which depends in turn on the size of the harvest and whether the government's subsidy scheme
is operating. The variable Cost is continuous
and has continuous and discrete parents; the variable Buys is discrete and has a continuous parent.
For the cost variable, we need to specify P(Cost Harvest, Subsidy). The discrete parent is handled by explicit
enumeration that is, specifying both P(Cost | Harvest, Subsidy) and P(Cost Harvest, Subsidy). To handle Harvest,
we specify how the distribution over the cost 'c' depends on the continuous value 'h', of Harvest. In other words,
we specify the parameter of the cost distribution as a function of 'h'.
23
5.6.Inferencing in Bayesian Networks
1. Exact Inference in Bayesian Networks
For inferencing in probabilistic system, it is required to calculate posterior probability distribution for a set of
query variables, where some observed events are given. [That is we have some values attached to evidence
variables].
• Notation Revisited:
The notation used in inferencing is same as the one used in probability theory.
X: Query variable.
E: The set of evidence variables E1,....., Em and 'e' is the perticular observed event. Y: The set of non-evidence
variables Y1, Y2,..... Yk [Non-evidence variables are also called as hidden variables].
X: It the complete set of all the types of variables, where X = {X} U E U Y.
• Generally the query requires the posterior probability distribution P(X | e) [assuming that query variable is not
among the evidence variables, if it is, then posterior distribution for X simply gives probability 1 to the observed
value]. [Note that query can contain more than one variable. For study purpose we are assuming single variable].
• Example: In the burglary case, if the observed event is Jcalls = true and Mcalls true.
The query is 'Has burglary occured?'
The probability distribution for this situation would be,
P(Burglary | J calls = true, M calls = true) = < 0.284, 0.716 >
2. Inference by Enumeration
A Bayesian network gives a complete representation of the full joint distribution. These full joint
distributions can be written as product of conditional probabilities from the Bayesian network.
A query can be answered using Bayesian network by computing sums of products of conditional
probabilities from the network.
24
Q(xi) ← ENUMERATE-ALL (VARS[bn] e) return NORMALIZE (Q(x)).
Function ENUMERATE-ALL (vars, e) returns a real number.
if EMPTY? (vars) then return 1.0
Y ← FIRST (vars)
If Y has value y in e.
Then return P(y | parents (Y)) X ENUMERATE-ALL (REST(vars), e)
else return, Σy Ply parents (Y)) X ENUMERATE-ALL (REST(vars), ey)
where eyis e extended with Y = y.
Example:
Consider query,
P(Burglary | J calls = true, M calls =true)
Hidden variables in the queries are → Earthquake and Alarm.
using the query equation.
P(Burglary | j, m) = α P( Burglary, J, M) = α Σe Σa P(Burglary, e, a, j, m)
The semantics of Bayesian networks (equation 7.2.1) then gives us an expression in terms of CPT entries. For
simplicity, we will do this just for Burglary = true.
P(b | j, m) = α Σe ΣaP(b)P(e)P(a | b, e)P(j | a) P(m | a)
• To compute this expression, we have to add four terms, each computed by multiplying five numbers.
• Worst case, where we have to sum out almost all the variables, the complexity of the algorithm for a network
with n boolean variables is O(n2 n).
• An improvement can be obtained from the following simple observations. The P(b) term is a constant and can
be moved outside the summations over a and e, and the P(e) term can be moved outside the summation over a.
Hence, we have
P(b | j, m) = α P(b) Σe P(e) ΣaP(a | b, e)P(j | a) P(m, a)
This expression can be evaluated by looping through the variables in order, multiplying CPT entries as we go.
For each summation, we also need to loop over the variable's possible values. The structure of this computation
is shown in following diagram. Using the numbers from Fig. 7.3.2, we obtain P(b | j, m) = α × 0.00059224. The
corresponding computation for ⌐b yields α × 0.0014919;
Hence,
P(B | j, m)= α < 0.00059224, 0.0014919 >
≈ <0.284, 0.716 >
That is, the chance of burglary, given calls from both neighbours is about 28 %. Note In the Fig. 7.3.7, the
evalution proceeds top to down, multiplying values along each path and summing at the "t" nodes. Observe that
25
there is repetition of paths for j and m.
3) Factors: Each part of the expression is annotated with the name of the associated variable, these parts are
called factors.
Steps in algorithm:
i) The factor for M, P (m | a), does not require summing over M. Probability is stored, given each value of a, in
a two-element factor,
26
Note fM means that M was used to produce f.
ii) Store the factor for J as the two-element vector fj (A).
iii) The factor for A is P(a | B, e) which will be a 2×2×2 matrix fA (A, B, E).
iv) Summing out A from the product of these tree factors. This will give 2×2 matrix whose indices range over
just B and E. We put bar over A in the name of the matrix to indicate that A has been summed out.
FȂJM(B, E) =Σa fA (a, B, E) × fj (a) × fM (a)
= fA (a, B, E) x fJ(a) x fM (a) + fA ( ⌐a, B, E) x fj (¬a)×fm (¬a)
The multiplication process used is called a printwise product.
v) Processing E in the same way (i.e.) sum out E from the product of
fE (E) and fȂJM (B, E):
fȂJM (B) = fE (e) × fȂJM (B, e) + fE (¬e) × fȂJM (B, ¬e).
vi) Compute the answer simply by multiplying the factor for B. (i.e.) (fB|B) = P(B), 701 by the accumulated
matrix (B) :
P(B | j, m) = α fB (B) x f EȂJM (B).
From the above sequence of steps it can be noticed that two computational operations are required.
a) Pointwise product of a pair of factors.
b) Summing out a variable from a product of factors.
a) Pointwise product of a pair of factors: The pointwise product of two factors f1 and f2 yields a new factor f,
those variables are the union of the variables in f1 and f2. Suppose the two factors have variables Y1,..., Yk. Then
we have f(x1,..., Xj, Y1,.... Yk, Z1,.....Zl) = f1 (X1,....Xj, Y1.... Yk) f2(Y1,..... Yk, Z1,....Zl). If all the variables are
binary, then f1 and f2 have 2j+k and 2k+l entries and the pointwise product has 2j+k+1 entries.
For example: Given two factors f1 (A, B) and f2 (B, C) with probability distributions shown below, the pointwise
product f1 × f2 is given as f1 (A, B, C).
b) Summing out a variable from a product of factors: It is a straight forward computation. Any factor that
27
does not depend on the variable to be summed out can be moved outside the summation process.
For example:
Σe fE (e) × fA (A, B, e) × fj (A) × fM (A) = fj (A) × fM (A) × Σ e fE (e) × fA (A, B, e). Now, the pointwise product
inside the summation is computed and the variable is summed out of the resulting matrix.
Fj (A) × fM (A) × Σe fE (e) × fA (A, B, e) = fj (A) × fM (A) × fEA(A, B).
Matrices are not multiplied until we need to sum but a variable from the accumulated product. At that point,
multiply those matrices that include the variable to be summed out.
4. The Complexity Involved in Exact Inferencing
The variable elimination algorithm is more efficient than enumeration algorithm because it avoids
repeated computations as well as drops irrelevant variables.
The variable elimination algorithm constructs the factor, deriving its operation. The space and time
complexity of variable elimination is directly dependant on size of the largest factor constructed during
the operation.
Basically the factor construction is determined by the order of elimination of variables and by the
structure of the network; which affects both space and time complexity.
Clustering algorithm:
1) Clustering algorithm (known as joint tree algorithms) in which inferencing time can be reduced to O(n). In
clustering individual nodes of the network are joint to form cluster nodes to such a way that the resulting network
is a polytree.
2) The variable elimination algorithm is efficient algorithm for answering individual queries. Posterior
probabilities are computed for all the variables in the network. It can be less efficient, in polytree network
because it needs to issue O(n) queries costing O(n) each, for a total of O(n2) time, clustering algorithm, improves
over it.
28
For example: The multiply connected network shown in Fig. 7.3.8 (a) can be converted into a polytree by
combining the Sprinkler and Rain node into a clusternode called Sprinkler + Rain, as shown in Fig. 7.2.8 (b).
The two Boolean nodes are replaced by a meganode that takes on four possible values: TT, TF, FT, FF. The
meganode has only one parent, the Boolean variable. Cloudy, so there are two conditioning cases.
29
5.7.Approximate Inference in Bayesian Network
Because of the fact that exact inference in multiply connected network is intractable problem, we go for
approximate inferencing.
For approximate inferencing randomize sampling algorithm (Monte Carlo Algorithm) is used. Its accuracy
depends on number of samples generated. Monte Carlo Algorithm is widely used in problem areas where it is
difficult to calculate exact quantities.
There are two families of Monte Carlo algorithm:
1) Direct sampling algorithm.
2) Markov chain sampling algorithm.
1. Direct Sampling Algorithm
1) The basic element in any sampling algorithm is the generation of samples from a known probability
distribution.
For example: An unbiased coin can be thought of as a random variable coin with values (heads, tails) and a prior
distribution P(coin) <0.5, 0.5>. Sampling from this distribution is exactly like flipping the coin: with probability
0.5 it will return 9879 heads, and with probability 0.5 it will return tails. Given a source of random numbers in
the range [0, 1], it is a simple matter to sample any distribution on a single variable.
2) The simplest kind of random sampling process for Bayesian networks generates events from a network that
has no evidence associated with it. The idea is to sample each variable in turn, in topological order.
3) The probability distribution from which the value is sampled is conditioned on the values already assigned to
the variable's parents.
4) The sampling algorithm:
Function PRIOR-SAMPLE (bn) returns an event sampled from the prior specified by bn.
inputs: bn, a Bayesian network specifying joint distribution P(X 1,..., Xn).
X ← an event with n elements.
for i = 1 to n do
xi ← a random sample from P(Xi parent (Xi))
return x.
5) Applying operations of algorithm on the network in Fig. 7.3.8 (a) assuming an ordering [Cloudy, Sprinkler,
Rain, WetGrass]:
i) Sample from P(Cloudy) = <0.5, 0.5>; suppose this returns true.
ii) Sample from P(Sprinkler | Cloudy = true) = <0.1, 0.9>; suppose this returns false.
iii) Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns true. iv) Sample from P(WetGrass |
30
Sprinkler = false, Rain = true) <0.9, 0.1>; suppose this returns true.
In this case PRIOR-SAMPLE returns the event [true, false, true, true].
6) PRIOR-SAMPLE generates samples from the prior joint distribution specified by the nework. First, let
Sps (x1,...,xn) be the probability that a specific event is generated by the PRIOR-SAMPLE algorithm. Just looking
at the sampling process, we have,
Sps (x1,...,xn)= Πni=1 P(xi | parents(Xi)).
Because each sampling step depends only on the parent values. This expression is also the probability of the
event according to the Bayesian net's representation of the joint distribution. We have,
Sps (x1,...,xn) = P(x1,...,xn)
7) In any sampling algorithm, the answers are computed by counting the actual samples generated. Suppose
there are N total samples, and let N (x1,...,xn) be the frequency of the specific event x1,...,xn. We expect this
frequency to converge in the limit, to its expected value according to the sampling probability:-
N → lim ∞ Nps (x1, …., xn )/N = Sps (x1, …, xn) = P(x1,….xn) ... (7.3.4)
For example: Consider the event produced earlier : [true, false, true, true]. The b. a sampling probability for this
event is,
Sps (true, false, true, true) = 0.5 × 0.9 × 0.8 × 0.9
= 0.324
Hence, in the limit of large N, we expect 32.4 % of the samples to be of this event.
8) Whenever we use an approximate equality ("≈"), we mean it in exactly this sense that the estimated probability
becomes exact in the large sample limit. Such an estimate is called consistent.
For example: One can produce a consistent estimate of the probability of any partially specified event x 1,...,xn,
where m≤n, as follows: -
P(x1,...,xm) ≈ Nps (x1,...,xm)/N ….(7.3.5)
That is, the probability of the event can be estimated as the fraction of all complete events generated by the
sampling process that match the partially specified event.
For example: If we generate 1000 samples from the sprinkler network, and 511 of them have, Rain = true, then
the estimated probability of rain, written as P(Rain = true), is 0.511.
2. Rejection Sampling in Bayesian Networks
1) Rejection sampling is a method for producing samples from a hard-to-sample distribution given an easy-to-
simple distribution.
2) It can be used to compute conditional probabilities; that is, to determine P(X | e).
3) The Rejection Sampling algorithm is,
function REJECTION-SAMPLING (x, e, bn, N) returns an estimate of P(X | e)
31
inputs: X, the query variable
e, evidence specified as an event.
bn, a Bayesian network.
N, the total number of samples to be generated.
local variables: N, a vector of counts over X, initially zero.
for j = 1 to N do
X ←PRIOR-SAMPLE (bn)
if X is consistent with e is N[x] ← N[x] +1 where x is the value of X in x.
return NORMALIZE (N[x]).
The rejection sampling algorithm for answering queries given evidence in a Bayesian network.
• Working of algorithm:
i) It generates samples from the prior distribution specified by the network.
ii) It rejects all those samples that do not match the evidence.
iii) Finally, the estimate P(X = x | e) is obtained by counting how often X = x occurs in the remaining samples.
4) Let P(X | e) be the estimated distribution that the algorithm returns. From the definition of the algorithm, we
have,
P(X | e) = α Nps (X, e) = Nps (X, e)/Nps (e)
from equation (7.3.5) this becomes,
P(X | e) ≈ P(x, e)/P(e) = P(x, e).
That is, rejection sampling produces a consistent estimate of the rule probability. 5) Applying operations of
algorithm on the network in the Fig. 7.3.8 (a), let us assume that we wish to estimate P (Rain/Sprinkler = true),
using 100 samples. Of the 100 that we generate, suppose 8 have Rain = true and 19 have Rain = false. Hence,
P(Rain | Sprinkler = true) ≈NORMALIZE
(<8.19>) = <0.296, 0.704 >
The true answer is <0.3, 0.7>
6) As more samples are collected, the estimate will converges to the true answer. The standard deviation of the
error in each probability will be proportional to 1/√n, where 'n' is the number of samples used in the estimate.
7) The biggest problem with rejection sampling is that it rejects so many samples ! The fraction of samples
consistent with the evidence 'e' drops exponentially as the number of evidence variables grows, so the procedure
is simply unusable for complex problems.
3. Likelihood Weighing in Bayesian Network
1) Likelihood weighing avoids the inefficiency of rejection sampling by generating only events that are
32
consistent with the evidence 'e'.
2) The Likelihood Weighing algorithm is,
Function LIKELIHOOD-WEIGHING (X, e, bn, N) returns an estimate of P(X | e) inputs: X, the query variable
e, evidence specified as an event
bn, a Bayesian network.
N, the total number of samples to be generated.
Local variables: W, a vector of weighted counts over X, initially zero.
for j = 1 to N do
X, w← WEIGHTED-SAMPLE (bn)
W[X]← W[x] + w where x is the value of X in x.
return NORMALIZE (W[X]).
Function WEIGHTED-SAMPLE (bn, e) returns an event and a weight
X ← an event with n elements; w← 1.
for i = 1 to n do
if X, has a value xi in e
then w ← wX P(Xi = xi parents(Xi))
else xi ← a random sample from P(Xi | parents (Xi))
return X, w.
Note on likelihood weighing algorithm -
i) It fixes the values for the evidence variables E and samples only the remaining variables X and Y. This
guarantees that each event generated is consistent with the evidence.
ii) Not all events are equal, however, before tallying the counts in the distribution for the query variable, each
event is weighted by the likelihood that the event 19 accords to the evidence, as measured by the product of the
conditional of probabilities for each evidence variable, given its parents.
iii) Intuitively, events in which the actual evidence appears unlikely should be given less weight.
3) Applying operations of algorithm on the network Fig. 7.3.8 (a), with the query P(Rain/Sprinkler = true,
WetGrass = true). The process goes as follows:
i) The weight w is set to 1.0
ii) Now, an event is generated.
• Sample from P(Cloudy) = <0.5, 0.5>; suppose this returns true.
• Sprinkler is an evidence variable with value true. Therefore, we set
w ← w P (Sprinkler = true | Cloudy = true) = 0.1.
33
• Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns true.
• WetGrass is an evidence variable with value true. Therefore, we set
W ← w P(Wet Grass = true | Sprinkler = true, Rain = true) = 0.099.
iii) Here WEIGHTED-SAMPLE returns the event [true, true, true, true] with weight 0.099, and this is tallied
under Rain = true.
iv) The weight is low because the event describes a cloudy day, which makes the sprinkler unlikely to be on.
4) Likelihood Weighting working:
i)Examine the sampling distribution Sws for WEIGHTED-SAMPLE.
ii) The evidence variables E are fixed with values 'e'.
iii) Call the other variables Z, that is, Z = {X}U Y.
iv) The algorithm samples each variable in Z, in given its parent values:
Sws (Z, e) =IIli=1 P(Zi | parents (Zi))... (7.3.6)
Notice that Parents (Zi) can include both hidden variables and evidence variables. Unlike the prior distribution
P(Z), the distribution S ws pays some attention to the son evidence the sampled values for each Zi will be
influenced by evidence among Zi's ancestors.
v) On the other hand, Sws pays less attention to the evidence than does the true Som posterior distribution P(Z |
e), because the sampled values for each Zi ignore evidence among Zi's non ancestors.
vi) The likelihood weight w makes up for the difference between the actual and desired sampling distribution.
The weight for a given sample 'x', composed from Z and 'e', is the product of the likelihood for each evidence
variable given its parents (some or all of which may be among the Zis):
w(Z, e) = IImj=l P(ei / parents(Ei) .... (7.3.7)
vii) Multiplying equations (7.3.4) and (7.3.5), we see that the weighted probability of a sample has the
particularly convenient form,
Sws (Z, e) W(Z, e) = Πli=1 P(yi | parents(Yi))
Πmi=1 P(ei | parents(Ei)) = P(y, e)
Because the two products cover all the variables in the network, allowing us to use equation (7.3.1) for the joint
probability.
viii) It can be easy to show that likelihood weighting estimates are consistent. For any particular value x of X,
the estimated posterior probability can be calculated as follows:
P(x | e) =α Σy Nws (x, y, e) w (x, y, e)
from LIKELIHOOD-WEIGHTING.
= α՛ Σy Sws (x, y, e) w (x, y, e)
from large N.
34
= α' Σy Ρ(x, y, e)
= α' P(x, e) =P(x | e)by equation (7.3.8)
Hence, likelihood weighting returns consistent estimates.
5) Performance of algorithm:
i) Likelihood weighting uses all the samples generated therefore it can be much more efficient than rejection
sampling.
ii) It will, suffer a degradation in performance as the number of evidence variables increases.
iii) Because most samples will have very low weights and hence the weighted estimate will be dominated by the
tiny fraction of samples that accord more than an infinitesimal likelihood to the evidence.
iv) The problem is exacerbated if the evidence variables occur late in the variable ordering, because then the
samples will be the simulations that bear little resemblance to the reality suggested by the evidence.
4. Markov Chain Monte Carlo (MCMC) Algorithm for Inference in Bayesian Networks
Working of MCMC algorithm
1) MCMC generates each event by making a random change to the preceding event. It is therefore helpful to
think of the network as being in a particular current state specifying a value for every variable.
2) The next state is generated by randomly sampling a value for one of the non- evidence variables Xi,
conditioned on the current values of the variables in the Markov blanket of X i.
3) MCMC therefore wonders randomly around the state space - the space of possible complete assignments
flipping one variable at a time, but keeping the evidence variables fixed.
4) For example: Consider the query P(rain | Sprinkler = true, WetGrass = true) applied to the network of Fig.
7.3.8 (a). The evidence variables 'Sprinkler' and 'WetGrass' are fixed to their observed values and the hidden
variables 'Cloudy'and 'Rain' are initialized randomly. Let us say to true and false respectively. Thus, the initial
state is [true, true, false, true]. Now the following steps are executed repeatedly:
a) Cloudy is sampled, given the current values of its Markov blanket variables : in this case, we sample from
P(Cloudy | Sprinkler = true, Rain = false), (Shortly we will show how to calculate this distribution). Suppose the
result is y = false. Then the new current state is [false, true, false, true].
b) Rain is sampled, given the current values of its Markov blanket variables : In this case we sample from P(Rain
Cloudy false, Sprinkler = true, WetGrass = true). Suppose this yields Rain = true. The new current state is [false,
true, true, true].
5) Each state visited during this process is a sample that contributes to the estimate, for the query variable Rain.
If the process visits 20 states where Rain is true and 60 states where Rain is false, then the answer to the query
is NORMALIZE (<20, 60>) = <0.25, 0.75>).
6) The complete algorithm is shown as follows:-
35
Function MCMC-ASK (X, e, bn, N) returns an estimate of P(X | e)
local variables: N[X], a vector of counts over X, initially zero.
Z, the nonevidence variables in bn.
X, the current state of the network, initially copied from e.
initialize X with random values for the variables in Z.
for j = 1 to N do
N[x] ←N[x] +1 where x is the value of X in x.
for each Zi in Z do...
sample the value of Zi in X from P(Zi |mb(Zi)) given the value of MB (Zi) in X return NORMALIZE (N[X]).
Why MCMC works
1) It should be noticed that the sampling process settles into a "dynamic equilibrium" in which the long-run
fraction of time spent in each state is exactly proporitional to its posterior probability.
2) This remarkable property follows from the specific transition probability with which the process moves from
one state to another, as defined by the conditional distribution given the Markov blanket of the variable being
sampled.
3) Let q(X →X') be the probability that the process makes a transition from state X to X'. This transition
probability defines what is called a Markov chain on the state space.
4) Now suppose that we run the Markov chain for 't' steps, and let л(X) be the probability that the system is in
state X at time 't'.
- Similarly, let лt+1(X') be the probability of being in state X' at time 't+1.'
- Given лt (X)
- We can calculate лt+1(X') by summing for all states the system could be in at time 't'.
- The probability of being in that state is the times the probability of making the transition to X' →π t+1(X') =
Σx πt (X) q(X→X')
5) We will say that the chain has reached its stationary distribution if л t = πt+1. Call this stationary distribution
л; its defining equation is therefore,
π (X') = ΣX π (X) q (X→X') for all X' …..(7.3.9)
Under certain standard assumptions about the transition probability distribution there is exactly one distribution
л satisfying this equation for any given q.
6) Above equation (7.3.8) can be read as saying that the expected "outflow" from each state (i.e., its current
"population") is equal to the expected "inflow" from all the state.
One way to satisfy this relationship is, the expected flow between any pair of states is the same in both direction.
36
This is the property of detailed balance:
л(X) q (X→X')=π (X') q (X'→X) for all X, X'. …...(7.3.10)
7) It can be shown that detailed balance implies stationarily simply by summing over X in equation (7.3.9). We
have,
Σx π (X) q (X→X') = Σx π (X') q (X' →X) = π (X') Σx q (X' → X) = π (X').
8) To show that the transition probability q (X→X') defined by the sampling step in MCMC-ASK satisfies the
detailed balance equation with a stationary distribution f equal to P(X | e). (The true posterior distribution on the
hidden variables).
This is done in two steps:-
a) First, we will define a Markov chain in which each variable is sampled conditionally on the current values of
all the other variables, and we will show that this satisfies detailed balance.
b) We will simply observe that, for Bayesian networks, doing that is equivalent to sampling conditionally on the
variable's Markov blanket.
• Let Xi be the variable to be sampled, and let be all the hidden variables other than Xi. Their values in the
current state are xi and . If we sample a new xi for Xi conditionally on all the other variables, including the
evidence, we have, q(X→X') = q(xi, ) → (xi, xi) = P(xi | , e).
• This transition probability is called Gibbs sampler and is a particularly convenient form of MCMC. Now we
show that the Gibbs sampler is, in detailed balance with the true posterior.
π (X) q(X→ X') = P(X | e) P(x'I | , e) = P(xi, | e) P(x'i | , e)
• Hence, to flip each variable , the number of multiplications required is equal to the number of Xi 's children.
9) We have discussed here simple variant of MCMC, namely the Gibbs sampler. In its most general form,
MCMC is a powerful method for computing with probability models and many variants have been developed,
including the simulated annealing algorithm presented.
37
5.8.Casual Networks:
38
More precisely, the edge is a directed edge leading from the past event
to the future event.
Some causal networks are independent of the choice of evolution, and these
are called causally invariant.
A causal relationship exists when one variable in a data set has a direct influence on
another variable. Thus, one event triggers the occurrence of another event. A causal
relationship is also referred to as cause and effect.
Often, in the absence of randomised control trials, there is a need for causal inference
purely from observational data. However, in this case the commonly known fact that
- 'correlation does not imply causation' distinguish between events that cause specific
outcomes and those that merely correlate.
One possible explanation for correlation between variables where neither causes the
other is the presence of confounding variables that influence both the target and a
driver of that target. Unobserved confounding variables are severe threats when doing
causal inference on observational data.
A causal generalization, e.g., that smoking causes lung cancer, is not about an
particular smoker but states a special relationship exists between the property of
smoking and the property of getting lung cancer. As a causal statement, this says more
than that there is a correlation between the two properties.
Looking for special circumstances: what was the cause of the fire? Oxygen? or an
arsonist's match?
Causes are sometimes said to be INUS conditions in that they are Insufficient but
Necessary parts of an unnecessary but sufficient set of conditions for the effect.
Striking a match may be said to be a cause of its lighting.
39
Suppose there is some set of conditions that is sufficient for a match's lighting. This
might include the presence of oxygen, the appropriate chemicals in the matchhead and
the striking.
The striking can be said to be a necessary part of this set (though insufficient by itself)
because without the striking among those other conditions the match would not have
lit. But the set itself, though sufficient, is not necessary because other sets of conditions
could have produced the lighting of the match.
But the performance of the model does not itself tell us that X causes Y in the real
world. There are other possible configurations that will produce a correlation between
X and Y. For instance, both X and Y may themselves have a common cause C without
X being otherwise related to Y.
But an intervention to start the school year in mid-winter will not result in leaves
changing color. There's a common cause for the school year and colorful folliage that
produces the relationship: the end of summer.
Structural Causal Models (SCMs)
Structural causal models represent causal dependencies using graphical models that
provide an intuitive visualisation by representing variables as nodes and relationships
between variables as edges in a graph.
Graphical models serve as a language for structuring and visualising knowledge about
the world and can incorporate both data-driven and human inputs.
40
SCMs had a transformative impact on multiple data-intensive disciplines (e.g.
epidemiology, economics, etc.), enabling the codification of the existing knowledge
in diagrammatic and algebraic forms and consequently leveraging data to estimate the
answers to interventional and counterfacutal questions.
Bayesian Networks are one of the most widely used SCMs and are at the core of this library.
41