Forward Backward Chaining
Forward Backward Chaining
Forward Backward Chaining
Forward Chaining is one of the two main methods of reasoning when using an inference engine and
can be described logically as repeated application of modus ponens. Forward chaining is a popular
implementation strategy for expert systems, business and production rule systems.Forward chaining
starts with the available data and uses inference rules to extract more data (from an end user, for
example) until a goal is reached. An inference engine using forward chaining searches the inference
rules until it finds one where the antecedent (If clause) is known to be true. When such a rule is
found, the engine can conclude, or infer, the consequent (Then clause), resulting in the addition of
new information to its data.
Backward Chaining:
Backward chaining (or backward reasoning) is an inference method that can be described (in lay
terms) as working backward from the goal(s). It is used in automated theorem provers, inference
engines, proof assistants and other artificial intelligence applications.
In game theory, its application to (simpler) subgames in order to find a solution to the game is
called backward induction.
Backward chaining is implemented in logic programming by SLD resolution. Both rules are based
on the modus ponnens inference rule. It is one of the two most commonly used methods of
reasoning with inference rules and logical implications – the other is forward chaining. Backward
chaining systems usually employ a depth-first search strategy
Prolog uses a built-in backward chaining.
Bayesian Network
A Bayesian Network is composed of nodes, where the nodes correspond to events that you might or
might not know. They’re typically called random variables, which may be discrete or continuous.
These nodes are connected by arrows, and if there is an arrow from X to Y, X is said to be parent to
Y. Each node Xi has a conditional probability distribution P(Xi|Parents(Xi)). Bayes Networks
define the probability distribution over graphs of random variables.
In the Bayesian Network we have a total of 2 n variables! This is where the Bayesian Network is
key. A Bayesian Network’s advantage is how compact the representation of a probability
distribution is, such as this very large Joint Probability Distribution (JPD), compared to
unstructured representations (like non-graph structures). Just to clarify, JPD is the probability of
every possible event as defined by the combination of the values of all the variables.
Quick intro to probability
P(A) – probability of event A
P(A’) = 1 – P(A) – Complementary probability of P(A)
P(A ∩ B) – Probability of events A and B
P(A ∪ B) – Probability of events A or B
P(A|B) – Probability of event A given event B occurred.
A⟂B – A and B are independent of each other. If A⟂B, then we can write that P(A,B)
=P(A)*P(B), since they are independent.
Perhaps the most important rule in AI is the Bayes Rule, which was invented by Thomas Bayes, a
British mathematician. Bayes Rule is stated as following:
Until now we have a pretty good understanding of calculating the probability B, given that we have
A, but not probability A, given we have B.
In the example
A and B are only dependent on their own variable, so their distribution is P(A) and P(B), since there
are no arrows (connection) coming into them. C, on the other hand, is conditioned on A and B, so
we have P(C|A,B). D and E are conditioned on C, so we have P(D|C), P(E|C).
This gives us the joint probability, represented by a Bayes Network. The joint probability is the
product of various Bayes Network probabilities that are defined over the individual nodes, where
each node’s probability is only conditioned on the incoming arrows.
P(A,B,C,D,E) = P(A)*P(B)*P(C|A,B)*P(D|C)*P(E|C)
A and B have no incoming arrows, so they a have a probability distribution of P(A) and
P(B).
C has two incoming arrows, so it’s probability is conditioned on A and B, giving us
P(C|A,B).
D and E are both conditioned on C, giving us P(D|C) and P(E|C).
So, the definition of this setup for the joint distribution, P(A,B,C,D,E), is based on the factors
above, and gives us one really BIG advantage. We know that the joint distribution over any five
random variables requires 2^5-1=31 probability values, while our Bayes network only requires 10
probability values.
Example I'm at work, neighbor John calls to say my alarm is ringing. Sometimes the alarm is set off
by minor earthquakes. Is there a burglar?
• John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the
alarm.•
Mary likes rather loud music and sometimes misses the alarm.
Network topology reflects "causal" knowledge:– A burglar can set the alarm off– An earthquake can set
the alarm off– The alarm can cause Mary to call– The alarm can cause John to cal
P(Burglary)*P(Earthquake)*P(Alarm|Burglary,
Earthquake)*P(JohnCalls|Alarm)*P(MaryCalls|Alarm)
Here we are looking at the JPD of an alarm going off, when the causes may be burglary and/or
earthquakes, and, the probability that either John or Mary calls to check in.
A concrete example
Consider two friends, Alice and Bob, who live far apart from each other and who talk together daily
over the telephone about what they did that day. Bob is only interested in three activities: walking in
the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively
by the weather on a given day. Alice has no definite information about the weather, but she knows
general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather
must have been like.
Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy"
and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day,
there is a certain chance that Bob will perform one of the following activities, depending on the
weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are the
observations. The entire system is that of a hidden Markov model (HMM).
Alice knows the general weather trends in the area, and what Bob likes to do on average. In other
words, the parameters of the HMM are known. They can be represented as follows in Python:
states = ('Rainy', 'Sunny')
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
In this piece of code, start_probability represents Alice's belief about which state the HMM
is in when Bob first calls her (all she knows is that it tends to be rainy on average). The particular
probability distribution used here is not the equilibrium one, which is (given the transition
probabilities) approximately {'Rainy': 0.57, 'Sunny': 0.43}. The
transition_probability represents the change of the weather in the underlying Markov
chain. In this example, there is only a 30% chance that tomorrow will be sunny if today is rainy.
The emission_probability represents how likely Bob is to perform a certain activity on
each day. If it is rainy, there is a 50% chance that he is cleaning his apartment; if it is sunny, there is
a 60% chance that he is outside for a walk.
Note that, in the above model (and also the one below), the prior distribution of the initial state is
not specified. Typical learning models correspond to assuming a discrete uniform distribution over
possible states (i.e. no particular prior distribution is assumed).