Ai 8907
Ai 8907
UNIT IV
ACTING LOGICALLY
PLANNING
The task of coming up with a sequence of actions that will achieve a goal is called
planning. Environments that are fully observable, deterministic, finite, static and discrete are
called classical planning environments.
PLANNING PROBLEM
Solution for the planning problem: An action sequence that, when executed in the initial state,
results in a state that satisfies the goal.
The search for a solution begins with the initial plan, containing a Start action with the
effect At (Spare;Trunk) ^ At (Flat; Axle) and a Finish action with the sole precondition
At (Spare; Axle). Then we generate successors by picking an open precondition to work and
choosing among the possible actions to achieve it.
The sequence of events is as follows:
1. Pick the only open precondition, At (Spare; Axle) of Finish. Choose the only applicable
action, PutOn(Spare; Axle).
2. Pick the At (Spare; Ground) precondition of PutOn(Spare; Axle). Choose the only applicable
action, Remove(Spare;Trunk) to achieve it.
3. Pick the :At (Flat; Axle) precondition of PutOn(Spare; Axle). Just to be contrary,choose the
LeaveOvernight action rather than the Remove(Flat; Axle) action. Notice that LeaveOvernight
also has the effect :At (Spare; Ground), which means it conflicts with the causal link
Remove(Spare;Trunk) At(Spare;Ground)
PutOn(Spare; Axle) :
To resolve the conflict we add an ordering constraint putting LeaveOvernight before
Remove(Spare;Trunk).
4. The only remaining open precondition at this point is the At (Spare;Trunk) precondition of the
action Remove(Spare;Trunk). The only action that can achieve it is the existing Start action, but
the causal link from Start to Remove(Spare;Trunk) is in conflict with the :At (Spare;Trunk)
effect of LeaveOvernight. This time there is no way to resolve the conflict with LeaveOvernight:
we cannot order it before Start (because nothing can come before Start), and we cannot order it
after Remove(Spare;Trunk) (because there is already a constraint ordering it before
Remove(Spare;Trunk)). So we are forced to back up, remove the Remove(Spare;Trunk) action
and the last two causal links.
5. Consider again the :At (Flat; Axle) precondition of PutOn(Spare; Axle). This time, we choose
Remove(Flat; Axle).
6. Once again, pick the At (Spare; Tire) precondition of Remove(Spare;Trunk) and choose Start
to achieve it. This time there are no conflicts.
7. Pick the At (Flat; Axle) precondition of Remove(Flat; Axle), and choose Start to achieve it.
The STRIPS representation talks about what actions do, but, because the representation is
based on situation calculus, it cannot talk about how long an action takes or even about when an
action occurs. Time is of the essence in the general family of appllications called job shop
scheduling. Such tasks require completing a set of jobs, each of whi.ch consists of a sequence of
actions, where each action has a given duration and might require some resources.
The problem is to determine a schedule that minimizes the total time required to complete
all the jobs, while respecting the resource constraints.This is a highly simplified automobile
assembly problem. There are two jobs: assembling cars C1 and C2.
Each job consists of three actions: adding the engine, adding the wheels, and inspecting
the results. The engine must be put in first (because having the front wheels on would inhibit
access to the engine compartment) and of course the inspection must be done last.
A scheduling problem rather than a planning problem,when each action should begin
and end, based on the durations of actions as well as their ordering. The notation Duration(d) in
the effect of an action (where d must be bound to a number) means that the action takes d
minutes to complete.
critical path method (CPM) to determine the possible start and end times of each action.
A path through a partial-order plan is a linearly ordered sequence of actions beginning
with Start and ending with Finish.
The critical path is that path whose total duration is longest; the path is "critical"
because it determines the duration of the entire plan-shortening other paths doesn't shorten the
plan as a whole, but delaying the start of any action on the critical path slows down the whole
plan.
Actions have a window of time [ES,LS]
ES - earliest possible start time, LS - latest possible start time
Given A, B actions and A ≺ B:
ES(Start) = 0
ES(B) = maxA≺BES(A) + Duration(A)
LS(Finish) = ES(Finish)
LS(A) = minB≻ALS(B) − Duration(A)
Scheduling - No resource constraints
Representation Decomposition
General descriptions are stored in plan library.
Each method = Decompos(a,d); a= action and d= PO plan. ƒ
In buildhouse example , Start action supplies all preconditions of actions not supplied by
other actions.
external preconditions ƒ-Finish action has all effects of actions not present in other actions
external effects - Primary effects (used to achieve goal) vs secondary effects
Properties of Decomposition
Should be correct implementation of action a
Correct if plan d is complete and consistent PO plan for the problem of achieving
the effects of a given the preconditions of a.
A decomposition is not necessarily unique.
Performs information hiding
STRIPS action description of higher-level action hides some preconditions and
effects
Ignore all internal effects of decomposition
Does not specify the intervals inside the activity during which preconditions and
effects must hold.
Information hiding is essential to HTN planning.
Planning in nondeterministic domains
Nondeterministic worlds
Bounded nondeterminism: Effects can be enumerated, but agent cannot know in advance
which one will occur
Unbounded nondeteminism: The set of possible effects is unbounded or too large to
enumerate
Planning for bounded nondeterminism
Sensorless planning
Contingent planning
Planning for unbounded nondeterminism
Online replanning
Continuous planning
Sensorless planning
Agent has no sensors to tell which state it is in, therefore each action might lead to one of
several possible outcomes
Must reason about sets of states (belief states), and make sure it arrives in a goal state
regardless of where it comes from and results of actions
Nondeterminism of the environment does not matter – the agent cannot detect the
difference anyway
The required reasoning is often not feasible, and sensorless planning is therefore often not
applicable
Contingent planning
Constructs conditional plans with branches for each (enumerable) possible situation
Decides which action to choose based on special sensing actions that become parts of the
plan
Can also tackle partially observable domains by including reasoning about belief states
(as in sensorless planning)
Planning algorithms have been extended to produce conditional branching plan
Online replanning
Monitors situation as plan unfold, detects when things go wrong
Performs replanning to find new ways to reach goals, if possible by repairing current plan
Agent proceeds from S, and next expects E following original whole-plan
Detects that it’s actually in O
Creates a repair plan that takes it from O to a state P in original plan
New plan to reach G becomes repair + continuation
Conditional Planning
Deal with uncertainty by checking the environment to see what is really
happening.
Used in fully observable and nondeterministic environments.
The outcome of an action is unknown.
Conditional steps will check the state of the environment. How to construct a
conditional plan?
Actions: left, right, suck
Propositions to define states: AtL, AtR, CleanL, CleanR
How to include nondeterminism?
Actions can have more than one effect
E.g. moving left sometimes fails - Action(Left, PRECOND: AtR, EFFECT: AtL)
Becomes : Action(Left, PRECOND: AtR, EFFECT: AtL∨AtR)
Actions can have conditional effects - Action(Left, PRECOND:AtR, - EFFECT:
AtL∨(AtL∧when cleanL: ¬cleanL) - Both disjunctive and conditional
Double murphy: possibility of desposit dirt when moving to other square and possibility
of despositing dirt when action is Suck.
Continuous Planning:
ƒ Agent persists indefinitely in an environment
• Phases of goal formulation, planning and acting ƒ Execution monitoring +
planner as one continuous process ƒ
Example:Blocks world
• Assume a fully observable environment
• Assume partially ordered plan
Block World Example ƒ
Initial state
(a) Action(Move(x,y), PRECOND: Clear(x) ∧ Clear(y) ∧
On(x,z)
(b) EFFECT: On(x,y) ∧ Clear(z) ∧ ¬On(x,z) ∧ ¬Clear(y)
ƒ The agent first needs to formulate a goal: On(C,D) ∧
On(D,B) ƒ Plan is created incrementally, return NoOp
and check
ƒ Assume that percepts don’t change and this plan is constructed ƒ Ordering constraint
between Move(D,B) and Move(C,D).Start is label of current state during planning. Before the
agent can execute the plan, nature intervenes: D is moved onto B
Execute new plan, perform action Move(C,D) .Assume agent is clumsy and drops C on
A. No plan but still an open PRECOND. Determine new plan for open condition. Again
Move(C,D)
1.
2.
3.
4.
5.
6.
Similar to POP, On each iteration find plan-flaw and fix it . Possible flaws: Missing goal,
Open precondition, Causal conflict, Unsupported link, Redundant action, Unexecuted action,
unnecessary historical goal
Poor performance since agents are not indifferent at other agents’ intentions. In general
two types of multiagent environments:
• Cooperative
• Competitive
Multi-planning problem:
Assume double tennis example where agents want to return ball.
Agents(A,B) Init(At(A,[Left,Baseline])∧ At(B,[Right, Net]) ∧ Approaching(Ball,[Right,
Baseline]) ∧ PArtner(A,B) ∧ Partner(B,A)) Goal(Returned(Ball) ∧ At(agent,[x,Net]))
Action(Hit(agent, Ball) PRECOND: Approaching(Ball,[x,y]) ∧ At(agent,[x,y]) ∧ Partner(agent,
partner) ∧ ¬At(partner,[x,y])
EFFECT: Returned(Ball)) Action(Go(agent,[x,y])
PRECOND: At(agent,[a,b])
EFFECT: At(agent,[x,y]) ∧ ¬ At(agent,[a,b]))
Cooperation:
Joint Goals and Plans (II) ƒ A solution is a joint-plan consisting of actions for both agents. ƒ
Example: A: [Go(A,[Right, Baseline]), Hit(A,Ball)] B: [NoOp(B), NoOp(B)] Or A: [Go(A,[Left,
net), NoOp(A)] B: [Go(B,[Right, Baseline]), Hit(B, Ball)]
Coordination is required to reach same joint plan
Multi-Body Planning
Planning problem faced by a single centralized agent that can dictate action to each of
several physical entities.
Assume for simplicity that every action takes one time step and at each point in the joint
plan the actions are performed simultaneously [; ]. Planning can be performed using POP
applied to the set of all possible joint actions. - Size of this set???
Coordination Mechanisms
To ensure agreement on joint plan: use convention.
Convention = a constraint on the selection of joint plans (beyond the constraint that the
joint plan must work if the agents adopt it).
e.g. stick to your court or one player stays at the net. Conventions which are widely
adopted= social laws e.g. language. Can be domain-specific or independent. Could arise through
evolutionary process (flocking behavior).
Separation: Steer away from neighbors when you get too close
o Cohesion Steer toward the average position of neighbors
o Alignment Steer toward average orientation (heading) of neighbors
o Flock exhibits emergent behavior of flying as a pseudo-rigid body
Coordination Mechanisms
In the absence of conventions: Communication e.g. Mine! Or Yours! in tennis
example
The burden of arriving at a succesfull joint plan can be placed on
Agent designer (agents are reactive, no explicit models of other agents)
Agent (agents are deliberative, model of other agents required)
Competitive Environments: Agents can have conflicting utilities e.g. zero-sum games like
chess .The agent must:
• Recognise that there are other agents
• Compute some of the other agents plans
• Compute how the other agents interact with its own plan
• Decide on the best action in view of these interactions.
Model of other agent is required YET, no commitment to joint action plan.
BAYES’S RULES
Flip a coin. What is the chance of it landing heads side up? There are 2 sides, heads and tails.
The heads side is one of them, thus
P(Heads) = Heads side
Heads side + Tails side
Conditional Probability
Events A and B are events that are not mutually exclusive, but occur conditionally on the
occurrence of one another.
p(B)
The number of times that both A and B occur or the probability that both events A and B will
occur is called the joint probability.
The probability that event A will occur if event B occurs is called conditional probability.
P(A|B) = the number of times A and B can occur
the number of times B can occur
or
P(A|B) = P(A and B)
P(B)
or
P(A|B) = P(A B)
P(B)
Lets take an example.
Suppose we have 2 dice and we want to know what the probability of getting an 8 is. Normally
if we roll both at the same time the probability is 5/36.
But what happens if we roll the first die and get a 5, now what is the probability of getting an 8?
There is only one way to get an 8 after a 5 has been rolled. You have to roll a 3.
P(A|B) = P(A B)
P(B)
P(getting an 8 using 2 die | given that we roll a 5 with first dice) = P(rolling 5 and 3)
P(rolling a 5)
P(A|B) = P(A B)
P(B)
P(B|A) = P(B A)
P(A)
Remembering that the intersection is communitve we can form the following equation:
Bayes theorem:
P(h) – initial probability that hypotheses h holds, before we have observed the data.
Often called the prior probability of h. It will reflect any background knowledge that we
have about the correctness of hypothesis h.
If we have no initial knowledge about the hypotheses (h0…..hn), we would divide the
probability equally among the set of available hypotheses (in which h is a member).
P(D) – The prior probability that the data D will be observed (this the probability of D given no
prior knowledge that h will hold). Remember that it is completely independent of h.
P(h|D), the probability of h being true given the presence of D increases with
commonness of h being true independently.
P(h|D), the probability of h being true given the presence of D increases with the
likelihood of data D being associated with hypotheses h. That is the higher our
confidence is in saying that data D is present only when h is true, the more we can say for
our hypothesis depends on D.
When p(D) is high that means our evidence is likely to exist independently of h, so it
weakens the link between h and D.
Bayes Rule in AI
This classifier applies to tasks in which each example is described by a conjunction of attributes
and the target value f(x) can take any value from the set of v.
In this example we want to use Bayes theorem to find out the likelihood of playing tennis for a
given set weather attributes.
f(x) v = (yes, no) i.e. v = (yes we will play tennis, no we will not play tennis)
The attribute values are a0…a3 = (Outlook, Temperature, Humidity, and Wind).
To determine our answer (if we are going to play tennis given a certain set of conditions) we
make an expression that determines the probability based on our training examples from the
table.
Or
Or
P(a|v) = P(Outlook, Temperature, Humidity, Wind | Play tennis, Don’t Play tennis)
In order to get a table with reliable measurements every combination of each attribute a 0…a3 for
each hypotheses v0,1 our table would have be of size 3*3*2*2*2 = 72 and each combination
would have to be observed multiple times to ensure its reliability. Why, because we are
assuming an inter-dependence of the attributes (probably a good assumption). The Naïve Bayes
classifier is based on simplifying this assumption. That is to say, cool temperature is
completely independent of it being sunny and so on.
So :
or
P(outlook = sunny, temperature = cool, humidity = normal, wind = strong | Play tennis)
The probability of observing P(a0…an | vj) is equal the product of probabilities of observing the
individual attributes. Quite an assumption.
Using the table of 14 examples we can calculate our overall probabilities and conditional
probabilities.
First we estimate the probability of playing tennis:
Then we estimate the conditional probabilities of the individual attributes. Remember this is the
step in which we are assuming that the attributes are independent of each other:
Outlook:
P(Outlook = Sunny | Play Tennis = Yes) = 2/9 = .22
P(Outlook = Sunny | Play Tennis = No) = 3/5 = .6
Temperature
P(Temperature = Hot | Play Tennis = Yes) = 2/9 = .22
P(Temperature = Hot | Play Tennis = No) = 2/5 = .40
Wind
P(Wind = Weak | Play Tennis = Yes) = 6/9 = .66
P(Wind = Weak | Play Tennis = No) = 2/5 = .40
What would our Naïve Bayes classifier predict in terms of playing tennis on a day like this?
Which ever equation has the higher probability (greater numerical value)
P(Playtennis = Yes | (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong))
Or
P((sunny…..)|Yes)
And noting that the denominator, P(sunny, cool, high, strong), includes both:
= P(sunny|Yes)*P(cool|Yes)*P(high|Yes)*P(strong|Yes) * P(yes)
P((sunny, cool, high, strong) | Yes) + P((sunny, cool, high, strong) | No)
Remember the quantities in the denominator are expanded using the independent assumption in a
similar way that the first term in the numerator.
= .0051
.0051 + .0207
= .1977
= .0207
.0051 + .0207
= .8023
As we can see, the Bayes Naïve classifier gives a value of just about 20% for playing tennis in
the described conditions, and value of 80% for not playing tennis in these conditions, therefore
the prediction is that no tennis will be played if the day is like these conditions.
function could be represented by a table of entries, one entry for each of the possible
combinations of its parents being true or false. Similar ideas may be applied to undirected, and
possibly cyclic, graphs; such are called Markov networks.
Suppose that there are two events which could cause grass to be wet: either the sprinkler
is on or it's raining. Also, suppose that the rain has a direct effect on the use of the sprinkler
(namely that when it rains, the sprinkler is usually not turned on). Then the situation can be
modeled with a Bayesian network (shown to the right). All three variables have two possible
values, T (for true) and F (for false).
The names of the variables have been abbreviated to G = Grass wet (yes/no), S =
Sprinkler turned on (yes/no), and R = Raining (yes/no).
The model can answer questions like "What is the probability that it is raining, given the
grass is wet?" by using theconditional probability formula and summing over all nuisance
variables:
Using the expansion for the joint probability function and the conditional
probabilities from the conditional probability tables (CPTs) stated in the diagram, one can
evaluate each term in the sums in the numerator and denominator.
If, on the other hand, we wish to answer an interventional question: "What is the
probability that it would rain, given that we wet the grass?" the answer would be
the factor from the pre-intervention distribution. As expected, the probability of rain
most policy evaluation problems. The effect of the action can still be predicted, however,
whenever a criterion called "back-door" is satisfied. It states that, if a set Z of nodes can be
observed that d-separates[3] (or blocks) all back-door paths from X to Y then . A back-door
path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called
"sufficient" or "admissible." For example, the set Z = R is admissible for predicting the effect
of S = T on G, because R d-separate the (only) back-door path S ← R → G. However, if S is not
observed, there is no other set that d-separates this path and the effect of turning the sprinkler on
(S = T) on the grass (G) cannot be predicted from passive observations. We then say
that P(G | do(S = T)) is not "identified." This reflects the fact that, lacking interventional data, we
cannot determine if the observed dependence between S and G is due to a causal connection or is
spurious (apparent dependence arising from a common cause, R).
Bayesian network representation only needs to store at most values. One advantage of
Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct
dependencies and local distributions than complete joint distributions.
Decision Networks: Represents information about the agent’s current state, its possible actions,
the state that will result from the agent’s action, and the utility of that state.
Belief network + decision & utility node.
Nodes – Chance node : represent random variables
Decision node : represent points where the decision-maker has a choice of actions
Utility node : represent the agent’s utility function
Evaluating decision networks 1. set the evidence variables for the current state 2. For
each possible value of the decision node (a) Set the decision node to that value (b) Calculate the
posterior probabilities for the parent nodes of the utility node, using a standard probabilistic
inference algorithm. (c) Calculate the resulting utility for the action 3. Return the action with the
highest utility
The value of information. Not all available information is provided to the agent before it
makes its decision. One of the most important parts of decision making is knowing what
questions to ask.
To conduct expensive and critical tests or not depends on two factors:
– Whether the different possible outcomes would make a significant difference to the
optimal course of action
– The likelihood of the various outcomes „ Information value theory enables an agent to
choose what information to acquire.
If the environment were deterministic, a solution would be easy: the agent will always
reach +1 with moves [U, U, R, R, R]. Because actions are unreliable, a sequence of moves will
not always leads to the desired outcome. Let each action achieve the intended effect with
probability 0.8. But with probability 0.1 the action moves the agent to either of the right angles to
the intended direction. If the agent bumps into a wall, it stays in the same square. Now the
sequence [U, U, R, R, R] leads to the goal state with probability 0.85 = 0.32768. In addition, the
agent has a small chance of reaching the goal by accident going the other way around the
obstacle with a probability 0.14 × 0.8, for a grand total of 0.32776.
A transition model specifies outcome probabilities for each action in each possible state
• Let P(s’ | s, a) denote the probability of reaching state s' if action a is done in
state s
• The transitions are Markovian in the sense that the probability of reaching s’
depends only on s and not the earlier states
• To specify the utility function for the agent
• The decision problem is sequential, so the utility function depends on a
sequence of states
• For now, we will simply stipulate that is each state s, the agent receives a
reward R(s), which may be positive or negative
For our particular example, the reward is -0.04 in all states except in the terminal
states
• The utility of an environment history is just (for now) the sum of rewards
received
• If the agent reaches the state +1, e.g., after ten steps, its total utility will be 0.6
• The small negative reward gives the agent an incentive to reach [4, 3] quickly
• A sequential decision problem for a fully observable environment
• A Markovian transition model and
• Additive rewards is called a Markov decision problem (MDP)
• An MDP is defined by the following four components:
• Initial state s0,
• A set Actions(s) of actions in each state,
• Transition model P(s’ | s, a), and
• Reward function R(s)
• As a solution to an MDP we cannot take a fixed action sequence, because the agent
might end up in a state other than the goal
• A solution must be a policy, which specifies what the agent should do for any state that
the agent might reach
• The action recommended by policy for state s
• If the agent has a complete policy, then no matter what the outcome of any action, the
agent will always know what to do next.Each time a given policy is executed starting
from the initial state, the stochastic nature of the environment will lead to a different
environment history
• The quality of a policy is therefore measured by the expected utility of the possible
environment histories generated by the policy
• An optimal policy yields the highest expected utility
• A policy represents the agent function explicitly and is therefore a description of a
simple reflex agent
Value Iteration
• For calculating an optimal policy we calculate the utility of each state and then use the state
utilities to select an optimal action in each state
• The utility of a state is the expected utility of the state sequence
that might follow it Obviously, the state sequences depend on the policy that is executed
• Let st be the state the agent is in after executing
• Note that st is a random variable
• R(s) is the short-term reward for being in s, whereas U(s) is the long-term total reward from s
onwards
• In our example grid the utilities are higher for states closer to the +1 exit, because fewer steps
are required to reach the exit
• Simultaneously solving the Bellman equations using does not work using the efficient
techniques for systems of linear equations, because max is a nonlinear operation
• In the iterative approach we start with arbitrary initial values for the utilities, calculate the right-
hand side of the equation and plug it into the left-hand side
• If we apply the Bellman update infinitely often, we are guaranteed
to reach an equilibrium, in which case the final utility values must
be solutions to the Bellman equations
• They are also the unique solutions, and the corresponding policy is optimal
Policy Iteration
• Beginning from some initial policy › 0 alternate
• Policy evaluation: given a policy › i , calculate Ui = U› i , the utility of each state if › i
were to be executed
• Policy improvement: Calculate the new MEU policy › i+1, using one-step look-ahead
based on Ui (Equation (*))
• The algorithm terminates when the policy improvement step yields no change in
utilities
• At this point, we know that the utility function Ui is a fixed point of the Bellman update
and a solution to the Bellman equations, so › i must be an optimal policy
• Because there are only finitely many policies for a finite state space, and each iteration
can be shown to yield a better policy, policy iteration must terminate
• Because at the ith iteration the policy › i specifies the action › i (s) in state s, there is no
need to maximize over actions in policy iteration
• We have a simplified version of the Bellman equation: Ui (s) = R(s) + ǁ™s‘ P(s’ | s, › i
(s)) Ui (s')
• For example: Ui (1,1) = –0.04 + 0.8 Ui (1,2) + 0.1 Ui (1,1) + 0.1 Ui (2,1) Ui (1,2) = –
0.04 + 0.8 Ui (1,3) + 0.2 Ui (1,2) etc.
• Now the nonlinear max has been removed, and we have linear equations
• A system of linear equations with n equations with n unknowns can be solved exactly in
time O(n3) by standard linear algebra methods
Instead of using a cubic amount of time to reach the exact solution (for large state
spaces), we can instead perform some number simplified value iteration steps to give a
reasonably good approximation of the utilities Ui+1(s) = R(s) + ǁ™s‘ P(s’ | s, › i (s)) Ui (s')
• This algorithm is called modified policy iteration
• In asynchronous policy iteration we pick any subset of the states on each iteration and
apply either policy improvement or simplified value iteration to that subset
• Given certain conditions on the initial policy and initial utility function, asynchronous
policy iteration is guaranteed to converge to an optimal policy
• We can design, e.g., algorithms that concentrate on updating the values of states that are
likely to be reached by a good policy