Ai Unit3
Ai Unit3
Ai Unit3
TOPICS
Probabilistic Reasoning,
Probability,
Conditional probability,
Bayes’ Rule,
Bayesian Networks- Representation, Construction, Inference,
Temporal model,
Hidden Markov model.
UNCERTAINTY
Most intelligent systems have some degree of uncertainty associated with them. Uncertainty in
systems arises primarily because of problems in data
• Major cause is missing or unavailable, data is present but unreliable or ambiguous due
to errors in measurement, presence of multiple conflicting measurements and so on.
• It is not always possible to represent data in precise and consistent manner.
• Data is generally based on defaults, which may have exceptions leading errors in
intelligent systems.
Uncertainty may also be caused by the representation of knowledge since it might represent only
the best guesses of the expert based on observation or statistical conclusions, which may not be
appropriate in all situations. Because of these numerous sources of error, it is imperative that
uncertainty management is incorporated in intelligent systems.
1. PROBABILITY THEORY
Probability Theory is designed to estimate the degree of uncertainty. It is the oldest method with
strong mathematical basis.
The term probability is defined as a way of turning an opinion or an expectation into a number lying
between 0 and 1.
It basically reflects the likelihood of an event, or a chance that a particular event will occur.
UNIT - III Page 1
PE 511 IT AI DCET
Assume a set S (known as a Sample space) consisting of independent events representing all possible
outcomes of random experiment.
If S represents a sample space and A and B represents events, then the following axioms hold true
here A` represents complement of set A.
P(A) ≥0
P(S)=1
P(A') = 1-P(A)
P(AUB) = P(A)+P(B), if events A and B are mutually exclusive.
P(AUB) = P(A)+P(B)-P(A∩B) if A and B are not mutually exclusive. This is called
addition rule of probability.
In general, if events A1,A2, ………AN in S are mutually exclusive, then we can write
P(A1∪A2 ∪ ………. ∪AN) = P(A1)+…….+P(AN)
Consider another example where Joint probability distribution of two variables A and B are
given in table.
Joint A A'
Probabilities
B 0.20 0.12
P (A and B) =0.20, P (A and B’) =0.65, P (A' and B) =0.12, P (A' and B’) =0.03.
From this data we can compute P(A) and P(B) as P(A) = P (A and B) + P (A and B')
= 0.20+0.65=0.85
P(B) = P (A and B) + P (A' and B)
= 0.20+0.12=0.32
We can compute the probability of any logical combination of A and B as follows
P (A or B) = P(A)+P(B)-P(A∩B)
=P(A and B)+ P(A and B')+ P(A and B)+ P(A' and B)-P(A and B)
=P(A and B)+ P(A and B')+ P(A' and B)
=0.20+0.65+0.12
=0.97
Alternatively, we can compute P (A or B) as follows
P (A or B) = 1- P((A or B) ' )
=1-P(A ' and B ')
=1-0.03=0.97
3. CONDITIONAL PROBABILITY
When the probability of the statement is based on a piece of evidence, this type of probability is
called Conditional Probability. The concept of Conditional Probability relates the probability of
one event to the occurrence of another. It is defined as the probability of the occurrence of an
event H(hypothesis) provided an event E(evidence) is known to have occurred. It is denoted by
P(H|E)
Number of events favourable to H which are also favourable to E
P(H | E) =
Number of events favourable to E
P(H and E)
P(H | E) =
P(E)
Result 1: If events H and E defined in sample space S of a random experiment are independent
then P(H|E)=P(H)
Proof P(H|E)=P(H and E)/P(E) By definition
= P(H)*P(E)/P(E)
=P(H)
Similarly P(E|H)=P(E)
0.002
=
0.005
= 0.4
It must be noted that if sun is bright today then the probability of sun being bright tomorrow is
high and these two events are not independent. All possible join probabilities of these two
variables can be computed. Let us represent A for Sun is bright today and B as Sun will be bright
tomorrow.
All Four situation can be represented as shown in figure using the path, we can compute join
probability appropriately.
4. BAYES’ THEOREM
Thomas Bayes proposed Bayes’ theorem in the year 1763 for two events using the concept of
conditional probability. This theorem provides a mathematical model for reasoning where prior
beliefs are combined with evidence to get estimates of uncertainty.
It relates the conditional probability and probabilities of events H and E. The basic idea is
to compute P(H|E) that represents the probability assigned to H after taking into account the new
piece of evidence E.
This approach relies on the concept that one should incorporate the prior probability of an
event into the interpretation of a new situation.
Bayes’ theorem related the conditional probabilities of events which allows us to express P(H|E) in
terms of the probabilities P(E|H), P(E), P(H). H denotes the hypothesis, while E represents a new
piece of evidence.
P(E | H) * P(H)
P(H | E) =
P(E)
Similarly
The probability P(H|E) can be expressed in terms of P(E|H) P(E|~H) and P(H) i.e. P(E|H)
P(E|~H) two conditional probabilities which indicate how probable our piece of evidence is
depending on whether our theory is true or not true.
P(H|E) can be expressed as
P(E | H) * P(H)
P(H | E) =
P(E | H) * P(H) + P(E |~ H) * P(~ H)
P(H) is called as the prior probability of H. It is called prior because it does not consider any
information regarding E.P(H|E) is known as the conditional probability of H, given E. It is also called
posterior probability because it is derived from or depends on the specified values of E.P(E|H) is
known as the conditional probability of E given H.P(E) is the prior probability of E and acts as a
normalizing constant.
Bayesian Probability theory was used by the PROSPECTOR expert system for locating
deposits of various minerals from geological evidences at a particular location. Some expert system
use Bayesian theory to derive further concepts.
The Rule type of ‘if-then’ in rule-based system can incorporate probabilities as ‘If X is true then Y can
be concluded with the probability’. In this rule, IF the patient has a cold, THEN the patient will sneeze
(0.75). Probability 0.75 is stated at the end of the rule.
Solution:
Probability of hypothesis Mike has a cold as 0.25 as P(H) =0.25
Probability of Fact that Mike was observed sneezing when he had cold in the past as 0.9
P(Mike was observed sneezing |Mike has a cold)=P(E|H)=0.90
Probability of Fact that Mike was observed sneezing when he did not have cold is 0.20.
P(Mike was observed sneezing |Mike does not have a cold)=P(E|~H)=0.20
We have to find the probability of Mike having cold, when he was observed sneezing. That is
P(Mike has a cold | Mike was observed sneezing)= P(H|E)
Using the formula
P(E | H) * P(H)
P(H | E) =
P(E | H) * P(H) + P(E |~ H) * P(~ H)
The probability of Mike having a cold given that he sneezes is equal to 0.6
Similarly we can compute his probability of having a cold if he was not sneezing
P(H|~E)=[P(~E|H)*P(H)]/P(~E)
= [(1-0.9)*0.25]/[1-0.375]
=0.025/0.625=0.04
Bayes’ Theorem can be extended to include more than two events.
One Hypothesis and Two Evidences
One Hypothesis and Multiple Evidences
Chain Evidences
Multiple Hypothesis and Single Evidences
Multiple Hypothesis and Multiple Evidences
We can derive different forms of the formula using Bayes’ theorem and conditional probability to
compute P(H|E1 and E2).
This expression is reduction of joint probability formula of n variables as some of the terms
corresponding to independent variables will not be required.
If node Xi has no parents, its local probability distribution is said to be unconditional and is written
as P(Xi) instead of P(Xi)|parents_node(Xi).
The nodes having parents are called conditional nodes.
If the value of the node is observed, then the node is said to be evidence node.
Nodes with no children are termed as hypotheses nodes
While nodes with no parents are called independent nodes
A P(C)
T T 0.7
T 0.4
T F 0.4
F 0.2
F T 0.2
P(A) P(B)
F F 0.01
0.3 0.6
Consider an alarm system installed in a house that can be triggered by three events, namely
• Earthquake, burglary and tornado.
P(E) P(B)
T 0.8
F 0.5
B D P(A)
E
T T T 1.0
T T F 0.9
T F T 0.95
T F F 0.85
F T T 0.89
F T F 0.7
F F T 0.87
F F F 0.3
The network structure shows that burglary and earthquakes directly affect the probability of the
alarm’s going off, but whether John and Mary call depends only on the alarm.
The network thus represents our assumptions that they do not perceive burglaries directly, they do
not notice minor earthquakes, and they do not confer before calling.
The conditional distributions in Figure 14.2 are shown as a conditional probability table, or CPT.
Each row in a CPT contains the conditional probability of each node value for a conditioning case.
A conditioning case is just a possible combination of values for the parent nodes—a miniature
possible world.
Each row must sum to 1, because the entries represent an exhaustive set of cases for the variable.
For Boolean variables, once you know that the probability of a true value is p, the probability of
false must be 1 – p.
In general, a table for a Boolean variable with k Boolean parents contains 2k independently
specifiable probabilities. A node with no parents has only one row, representing the prior
probabilities of each possible value of the variable.
Notice that the network does not have nodes corresponding to Mary’s currently listening to loud
music or to the telephone ringing and confusing John. These factors are summarized in the
uncertainty associated with the links from Alarm to JohnCalls and MaryCalls. This shows both
laziness and ignorance in operation.
The next step is to explain how to construct a Bayesian network in such a way that the resulting
joint distribution is a good representation of a given domain. The Equation implies certain
conditional independence relationships that can be used to guide the knowledge engineer in
constructing the topology of the network.
First, we rewrite the entries in the joint distribution in terms of conditional probability, using the
product rule.
This identity is called the chain rule. It holds for any set of random variables. Comparing it with
Equation 1 we see that the specification of the joint distribution is equivalent to the general
assertion that, for every variable Xi in the network,
n
P(Xi | Xi - 1, Xn) =
i =1
P(Xi | p arent_nodes(Xi))
equation 2
Parents(Xi) ⊆ {Xi−1, . . .,X1}. This last condition is satisfied by numbering the nodes in a way that
is consistent with the partial order implicit in the graph structure.
What Equation 2 says is that the Bayesian network is a correct representation of the domain only if
each node is conditionally independent of its other predecessors in the node ordering, given its
parents. We can satisfy this condition with this methodology:
Nodes
First determine the set of variables that are required to model the domain. Now order
them, {X1, . . . ,Xn}. Any order will work, but the resulting network will be more compact if
the variables are ordered such that causes precede effects.
Links
For i = 1 to n do:
• Choose, from X1, . . . ,Xi−1, a minimal set of parents for Xi, such that Equation 2 is
satisfied.
• For each parent insert a link from the parent to Xi.
• CPTs: Write down the conditional probability table, P(Xi|Parents(Xi)). the parents of node
Xi should contain all those nodes in X1, . . . ,Xi−1 that directly influence Xi.
Because each node is connected only to earlier nodes, this construction method guarantees that the
network is acyclic.
Another important property of Bayesian networks is that they contain no redundant
probability values. If there is no redundancy, then there is no chance for inconsistency: it is
Suppose we have completed the network in Figure except for the choice of parents for MaryCalls.
MaryCalls is certainly influenced by whether there is a Burglary or an Earthquake, but not directly
influenced.
Intuitively, our knowledge of the domain tells us that these events influence Mary’s calling
behavior only through their effect on the alarm.
Also, given the state of the alarm, whether John calls has no influence on Mary’s calling.
Formally speaking, we believe that the following conditional independence statement holds:
P(MaryCalls | JohnCalls , Alarm, Earthquake, Burglary) = P(MaryCalls | Alarm) .Thus,
Alarm will be the only parent node for MaryCalls.
MARKOV BLANKET
Another important independence property is implied by the topological semantics.
A node is conditionally independent of all other nodes in the network, given its parents, children,
and children’s parents—that is, given its Markov blanket.
For example, Burglary is independent of JohnCalls and MaryCalls, given Alarm and Earthquake.
6. TEMPORAL MODEL
Time and Uncertainty
We have developed our techniques for probabilistic reasoning in the context of static worlds, in
which each random variable has a single fixed value.
For example, when repairing a car, we assume that whatever is broken remains broken
during the process of diagnosis; our job is to infer the state of the car from observed evidence,
which also remains fixed.
Now consider a slightly different problem: treating a diabetic patient. As in the case of car
repair, we have evidence such as recent insulin doses, food intake, blood sugar measurements, and
other physical signs.
The task is to assess the current state of the patient, including the actual blood sugar level
and insulin level. Given this information, we can make a decision about the patient’s food intake
and insulin dose. Unlike the case of car repair, here the dynamic aspects of the problem are
essential.
The same considerations arise in many other contexts, such as tracking the location of a robot,
tracking the economic activity of a nation, and making sense of a spoken or written sequence of
words. How can dynamic situations like these be modeled?
➢ We view the world as a series of snapshots, or time slices, each of which contains a set of
random variables, some observable and some not.
FOR EXAMPLE:
You are the security guard stationed at a secret underground installation. You want to know
whether it’s raining today, but your only access to the outside world occurs each morning when
you see the director coming in with, or without, an umbrella.
For each day t, the set Et thus contains single evidence variable Umbrellat or Ut for short (whether
the umbrella appears), and the set Xt contains a single state variable Raint or Rt for short (whether
it is raining).
➢ The interval between time slices also depends on the problem. For diabetes monitoring, a suitable
interval might be an hour rather than a day.
➢ We assume the interval between slices is fixed, so we can label times by integers.
➢ Assume that evidence starts arriving at t=1. Hence, our umbrella world is represented by state
variables R0, R1, R2, . . . and evidence variables U1, U2, . . ..
➢ We will use the notation a:b to denote the sequence of integers from a to b (inclusive), and the
notation Xa:b to denote the set of variables from Xa to Xb. For example, U1:3 corresponds to the
variables U1, U2, U3.
➢ An agent maintains a belief state that represents which states of the world are currently possible.
➢ From the belief state and a transition model, the agent can predict how the world might evolve in
the next time step.
➢ From the percepts observed and a sensor model, the agent can update the belief state. A percept is
the input that an intelligent agent is perceiving at any given moment. It is essentially the same
concept as a percept in psychology, except that it is being perceived not by the brain but by the
agent.
➢ The transition and sensor models may be uncertain: the transition model describes the probability
distribution of the variables at time t, given the state of the world at past times,
➢ While the sensor model describes the probability of each percept at time t, given the current state
of the world.
With the set of state and evidence variables for a given problem decided on, the next step is to
specify how the world evolves (the transition model) and how the evidence variables get their
values (the sensor model).
The transition model specifies the probability distribution over the latest state variables, given the
previous values, that is, P(Xt |X0:t−1). The set X0:t−1 is unbounded in size as t increases. We solve
the problem by making a Markov assumption.
MARKOV ASSUMPTION
That the current state depends on only a finite fixed number of previous states.
Processes satisfying this assumption were first studied in depth by the Russian statistician Andrei
Markov (1856–1922) and are called Markov processes or Markov chains.
Hence, in a first-order Markov process, the transition model is the conditional distribution P(Xt
|Xt−1). The transition model for a second-order Markov process is the conditional distribution P(Xt
|Xt−2,Xt−1). Figure shows the Bayesian network structures corresponding to first-order and second-
order Markov processes. there are infinitely many possible values of t. Do we need to specify a
different distribution for each time step?
We avoid this problem by assuming that changes in the world state are caused by a stationary
process—that is, a process of change that is governed by laws that do not themselves change over
time.
In the umbrella world, then, the conditional probability of rain, P(Rt |Rt−1), is the same for all t,
and we only have to specify one conditional probability table.
Now for the sensor model. The evidence variables Et could depend on previous variables as well
as the current state variables. Thus, we make a sensor Markov assumption as follows:
P(Et |X0:t,E0:t−1) = P(Et |Xt)
Thus P(Et |Xt) is our sensor model (sometimes called the observation model).
Shows both the transition model and the sensor model for the umbrella example
The structure in Figure 15.2 is a first-order Markov process—the probability of rain is
assumed to depend only on whether it rained the previous day.
2. Prediction
This is the task of computing the posterior distribution over the future state, given all
evidence to date. That is, we wish to compute P(Xt+k | e1:t) for some k > 0.
In the umbrella example, this might mean computing the probability of rain three
day from now, given all the observations to date. Prediction is useful for evaluating possible
courses of action based on their expected outcomes.
3. Smoothing
This is the task of computing the posterior distribution over a past state, given all
evidence up to the present. That is, we wish to compute P(Xk | e1:t) for some k such that 0 ≤ k < t.
for some function f. This process is called recursive estimation. We can view the calculation as
being composed of two parts: first, the current state distribution is projected forward from t to t+1;
then it is updated using the new evidence et+1. This two-part process emerges quite simply when
the formula is rearranged:
α is a normalizing constant used to make probabilities sum up to 1. The second term, P(Xt+1 | e1:t)
represents a one-step prediction of the next state, and the first term updates this with the new
Within the summation, the first factor comes from the transition model and the second comes from
the current state distribution. Hence, we have the desired recursive formulation. Filtered estimate
P(Xt | e1:t) can be thought as a “message” f1:t that is propagated forward along the sequence,
modified by each transition and updated by each new observation. The process is given by
Where FORWARD implements the update described in Equation (15.5) and the process begins
with f1:0 = P(X0). When all the state variables are discrete, the time for each update is constant
(i.e., independent of t), and the space required is also constant. (The constants depend, of course,
on the size of the state space and the specific type of the temporal model in question). The time
and space requirements for updating must be constant if an agent with limited memory is to keep
track of the current state distribution over an unbounded sequence of observations.
In addition to filtering and prediction, we can use a forward recursion to compute the likelihood of
the evidence sequence, P(e1:t). This is a useful quantity if we want to compare different temporal
models that might have produced the same evidence sequence (e.g., two different models for the
persistence of rain). For this recursion, we use a likelihood message
L1:t (Xt)=P(Xt, e1:t).
It is a simple exercise to show that the message calculation is identical to that for filtering:
L1:t+1 = FORWARD(L1:t, et+1) .
Having computed 1:t, we obtain the actual likelihood by summing out Xt:
L1:t= P( e1:t) =
xt
L1 : t (xt )
6.1.2 SMOOTHING
Smoothing is the process of computing the distribution over past state given evidence up to the
present; that is, P(Xk | e1:t) for 0 ≤ k < t. (See Figure 15.3.). In anticipation of another recursive
message-passing approach, we can split the computation into two parts—the evidence up to k and
the evidence from k +1 to t.
where BACKWARD implements the update described in Equation (15.9). As with the forward
recursion, the time and space needed for each update are constant and thus independent of t. We
can now see that the two terms in Equation (15.8) can both be computed by recursions through
time, one running forward from 1 to k and using the filtering equation (15.5)
Both the forward and backward recursions take a constant amount of time per step; hence, the time
complexity of smoothing with respect to evidence e1:t is O(t). This is the complexity for smoothing
at a particular time step k.
If we want to smooth the whole sequence, one obvious method is simply to run the whole
smoothing process once for each time step to be smoothed. This results in a time complexity of
O(t 2).
A better approach uses a simple application of dynamic programming to reduce the
complexity to O(t). A clue appears in the preceding analysis of the umbrella example, where we
were able to reuse the results of the forward-filtering phase.
The key to the linear-time algorithm is to record the results of forward filtering over the
whole sequence. Then we run the backward recursion from t down to 1, computing the
smoothed estimate at each step k from the computed backward message bk+1:t and the stored
forward message f1:k. The algorithm, aptly called the forward–backward algorithm.
The forward–backward algorithm forms the computational backbone for many applications
that deal with sequences of noisy observations
The first is that its space complexity can be too high when the state space is large and the
sequences are long. It uses O(|f|t) space where |f| is the size of the representation of the forward
message. The space requirement can be reduced to O(|f| log t) with a concomitant increase in the
time complexity by a factor of log t In some cases a constant-space algorithm can be used.
The second drawback of the basic algorithm is that it needs to be modified to work in an online
setting where smoothed estimates must be computed for earlier time slices as new observations are
continuously added to the end of the sequence.
The most common requirement is for fixed-lag smoothing, which requires computing the smoothed
estimate P(Xt−d | e1:t) for fixed d. That is, smoothing is done for the time slice d steps behind the
current time t; as t increases, the smoothing has to keep up.
Obviously, we can run the forward–backward algorithm over the d-step “window” as each new
observation is added, but this seems inefficient.
There is a linear-time algorithm for finding the most likely sequence, but it requires a little more
thought. It relies on the same Markov property that yielded efficient algorithms for filtering and
smoothing. The easiest way to think about the problem is to view each sequence as a path through
a graph whose nodes are the possible states at each time step.
Such a graph is shown for the umbrella world in Figure 15.5(a). Now consider the task of
finding the most likely path through this graph, where the likelihood of any path is the product of
the transition probabilities along the path and the probabilities of the given observations at each
state.
Let’s focus in particular on paths that reach the state Rain 5 =true. Because of the Markov
property, it follows that the most likely path to the state Rain5 =true consists of the most likely
path to some state at time 4 followed by a transition to Rain 5 =true; and the state at time 4 that
will become part of the path to Rain 5 =true is whichever maximizes the likelihood of that path.
In other words, there is a recursive relationship between most likely paths to each state xt+1
and most likely paths to each state xt. We can write this relationship as an equation connecting the
probabilities of the paths:
Thus, the algorithm for computing the most likely sequence is similar to filtering: it runs forward
along the sequence, computing the m message at each time step, using Equation (15.11). The
progress of this computation is shown in Figure 15.5(b).
At the end, it will have the probability for the most likely sequence reaching each of the final
states. One can thus easily select the most likely sequence overall (the states outlined in bold). In
order to identify the actual sequence, as opposed to just computing its probability, the algorithm will
also need to record, for each state, the best state that leads to it; these are indicated by the bold
arrows in Figure 15.5(b).
Besides providing an elegant description of the filtering and smoothing algorithms for HMMs, the
matrix formulation reveals opportunities for improved algorithms.
The first is a simple variation on the forward–backward algorithm that allows smoothing to
be carried out in constant space, independently of the length of the sequence. The idea is that
smoothing for any particular time slice k requires the simultaneous presence of both the forward
and backward messages, f1:k and bk+1:t, according to Equation (15.8).
The forward–backward algorithm achieves this by storing the fs computed on the forward pass so
that they are available during the backward pass.
Another way to achieve this is with a single pass that propagates both f and b in the
same direction. For example, the “forward” message f can be propagated backward if we
manipulate Equation (15.12) to work in the other direction:
The modified smoothing algorithm works by first running the standard forward pass to
compute ft:t (forgetting all the intermediate results) and then running the backward pass for both
b and f together, using them to compute the smoothed estimate at each step. Since only one copy
of each message is needed, the storage requirements are constant (i.e., independent of t, the length
of the sequence).
There are two significant restrictions on this algorithm: it requires that the transition matrix
be invertible and that the sensor model have no zeroes—that is, that every observation be possible
in every state.
A second area in which the matrix formulation reveals an improvement is in online
smoothing with a fixed lag. The fact that smoothing can be done in constant space suggests that
there should exist an efficient recursive algorithm for online smoothing—that is, an algorithm
whose time complexity is independent of the length of the lag. Let us suppose that the lag is d; that
is, we are smoothing at time slice t−d, where the current time is t.
In the localization problem for the vacuum world the robot had a single nondeterministic Move action
and its sensors reported perfectly whether or not obstacles lay immediately to the north, south, east,
and west; the robot’s belief state was the set of possible locations it could be in.
Here we make the problem slightly more realistic by including a simple probability model
for the robot’s motion and by allowing for noise in the sensors. The state variable X t represents the
location of the robot on the discrete grid; the domain of this variable is the set of empty squares {s1, . .
. , sn}.
Let NEIGHBORS(s) be the set of empty squares that are adjacent to s and let N(s) be the
size of that set. Then the transition model for Move action says that the robot is equally likely to end
up at any neighboring square:
The sensor variable Et has 16 possible values, each a four-bit sequence giving the presence or absence
of an obstacle in a particular compass direction. NS, for example, to mean that the north and south
sensors report an obstacle and the east and west do not.
Suppose that each sensor’s error rate is and that errors occur independently for the four sensor
directions.
In that case, the probability of getting all four bits right is (1−ε)4 and the probability of getting them all
wrong is ε4.
Furthermore, if dit is the discrepancy—the number of bits that are different—between the true
values for square i and the actual reading et, then the probability that a robot in square i would
receive a sensor reading et is
Given the matrices T and Ot, the robot can use Equation (15.12) to compute the posterior
distribution over locations—that is, to work out where it is. Figure 15.7 shows the distributions
P(X1 |E1 =NSW) and P(X2 |E1 =NSW,E2 =NS).
In addition to filtering to estimate its current location, the robot can use smoothing to work
out where it was at any given past time—for example, where it began at time 0—and it can use the
Viterbi algorithm to work out the most likely path it has taken to get where it is now.
Figure 15.8 shows the localization error and Viterbi path accuracy for various values of the
per-bit sensor error rate ε . Even when ε is 20%—which means that the overall sensor reading is
wrong 59% of the time—the robot is usually able to work out its location within two squares after
25 observations.
This is because of the algorithm’s ability to integrate evidence over time and to take into
account the probabilistic constraints imposed on the location sequence by the transition model.