Cse 473 MDP Notes
Cse 473 MDP Notes
1 Non-Deterministic Search
Picture a robot runner, coming to the end of his first ever marathon. Though it seems likely he will
complete the race and claim the accompanying everlasting glory, it’s by no means guaranteed. He
may pass out from exhaustion or misstep and slip and fall, tragically breaking both of his legs.
Even more unlikely, a literally earth-shattering earthquake may spontaneously occur, swallowing
up the runner mere inches before he crosses the finish line. Such possibilities add a degree of
uncertainty to the runner’s actions, and it’s this uncertainty that will be the subject of the following
discussion. In the first and second note, we talked about traditional search problems and how to
solve them; then, in the third note, we changed our model to account for adversaries and other
agents in the world that influenced our path to goal states. Now, we’ll change our model again to
account for another influencing factor – the dynamics of world itself. The environment in which
an agent is placed may subject the agent’s actions to being nondeterministic, which means that
there are multiple possible successor states that can result from an action taken in some state. This
is, in fact, the case in many card games such as poker or blackjack, where there exists an inherent
uncertainty from the randomness of card dealing. Such problems where the world poses a degree
of uncertainty are known as nondeterministic search problems, and can be solved with models
known as Markov decision processes, or MDPs.
Figure 1: The Grid World example mentioned in the lecture, where the agent may move to adjacent
squares non-deterministically and receive rewards according to the final consequences.
• A set of actions A. Actions in MDPs are also represented in the same way as in traditional
search problems.
• A start state.
• A reward function R(s, a, s′ ). Typically, MDPs are modeled with small "living" rewards at
each step to reward an agent’s survival, along with large rewards for arriving at a terminal
state. Rewards may be positive or negative depending on whether or not they benefit the
agent in question, and the agent’s objective is naturally to acquire the maximum reward
possible before arriving at some terminal state. Sometimes, the reward function can depend
only on the states, denoted as R(s) or R(s′ ).
There are three possible states, S = {cool, warm, overheated}, and two possible actions A =
{slow, f ast}. Just like in a state-space graph, each of the three states is represented by a node,
with edges representing actions. Overheated is a terminal state, since once a racecar agent arrives
at this state, it can no longer perform any actions for further rewards (it’s a sink state in the MDP
and has no outgoing edges). Notably, for nondeterministic actions, there are multiple edges repre-
senting the same action from the same state with differing successor states. Each edge is annotated
not only with the action it represents, but also a transition probability and corresponding reward.
These are summarized below:
Additionally, knowing that an agent’s goal is to maximize it’s reward across all timesteps, we can
correspondingly express this mathematically as a maximization of the following utility function:
Markov decision processes, like state-space graphs, can be unraveled into search trees. Uncer-
tainty is modeled in these search trees with Q-states, also known as action states, essentially iden-
tical to expectimax chance nodes. This is a fitting choice, as Q-states use probabilities to model
the uncertainty that the environment will land an agent in a given state just as expectimax chance
nodes use probabilities to model the uncertainty that adversarial agents will land our agent in a
given state through the move these agents select. The Qstate represented by having taken action
a from state s is notated as the tuple (s, a).
Observe the unraveled search tree for our racecar, truncated to depth-2:
The green nodes represent Q-states, where an action has been taken from a state but has yet to be
resolved into a successor state. It’s important to understand that agents spend zero timesteps in
Q-states, and that they are simply a construct created for ease of representation and development
of MDP algorithms.
Noting that the above definition of a discounted utility function looks similar to a geometric series
with ratio γ, we can prove that it’s guaranteed to be finite-valued as long as the constraint |γ| < 1
(where |n| denotes the absolute value operator) is met through the following logic:
V ([s0 , s1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...
∞ ∞
X X Rmax
= γ t R(st , at , st+1 ) ≤ γ t Rmax =
1−γ
t=0 t=0
where Rmax is the maximum possible reward attainable at any given timestep in the MDP. Typi-
cally, γ is selected strictly from the range 0 < γ < 1 since values values in the range −1 < γ ≤ 0
are simply not meaningful in most real-world situations–a negative value for γ means the reward
for a state s would flip-flop between positive and negative values at alternating timesteps.
Markovianess
Markov decision processes are "markovian" in the sense that they satisfy the Markov property,
or memoryless property, which states that the future and the past are conditionally independent,
given the present. Intuitively, this means that, if we know the present state, knowing the past
doesn’t give us any more information about the future. To express this mathematically, consider
an agent that has visited states s0 , s1 , ..., st after taking actions a0 , a1 , ..., at−1 in some MDP, and
has just taken action at . The probability that this agent then arrives at state st+1 given their history
of previous states visited and actions taken can be written as follows:
P (St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 )
where each St denotes the random variable representing our agent’s state and At denotes the
random variable representing the action our agent takes at time t. The Markov property states
that the above probability can be simplified as follows:
P (St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 ) = P (St+1 = st+1 |St = st , At = at )
which is "memoryless" in the sense that the probability of arriving in a state s′ at time t+1 depends
only on the state s and action a taken at time t, not on any earlier states or actions. In fact, it is these
memoryless probabilities which are encoded by the transition function: T (s, a, s′ ) = P (s′ |s, a) .
With some investigation, it’s not hard to determine that Policy 2 is optimal. Following the policy
until making action a = Exit yields the following rewards for each start state:
Start State Reward
a 10
b 1
c 0.1
d 0.1
e 1
We’ll now learn how to solve such MDPs (and much more complex ones!) algorithmically
using the Bellman equation for Markov decision processes.
Before we begin interpreting what this means, let’s also define the equation for the optimal value
of a Q-state (more commonly known as an optimal Q-value):
X
Q∗ (s, a) = T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
s′
Note that this second definition allows us to reexpress the Bellman equation as
which is a dramatically simpler quantity. The Bellman equation is an example of a dynamic pro-
gramming equation, an equation that decomposes a problem into smaller subproblems via an in-
herent recursive structure. We can see this inherent recursion in the equation for the Q-value of
a state, in the term [R(s, a, s′ ) + γV ∗ (s′ )]. This term represents the total utility an agent receives
by first taking a from s and arriving at s′ and then acting optimally henceforth. The immediate
reward from the action a taken, R(s, a, s′ ), is added to the optimal discounted sum of rewards
attainable from s′ , V ∗ (s′ ), which is discounted by γ to account for the passage of one timestep in
taking action a. Though in most cases there exists a vast number of possible sequences of states
and actions from s′ to some terminal state, all this detail is abstracted away and encapsulated in a
single recursive value, V ∗ (s′ ).
We can now take another step outwards and consider the full equation for Q-value. Knowing
[R(s, a, s′ ) + γV ∗ (s′ )] represents the utility attained by acting optimally after arriving in state s′
from Q-state (s, a), it becomes evident that the quantity
X
T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
s′
is simply a weighted sum of utilities, with each utility weighted by its probability of occurrence.
This is by definition the expected utility of acting optimally from Q-state (s, a) onwards! This com-
pletes our analysis and gives us enough insight to interpret the full Bellman equation–the optimal
value of a state, V ∗ (s), is simply the maximum expected utility over all possible actions from s. Com-
puting maximum expected utility for a state s is essentially the same as running expectimax–we
first compute the expected utility from each Q-state (s, a) (equivalent to computing the value of
chance nodes), then compute the maximum over these nodes to compute the maximum expected
utility (equivalent to computing the value of a maximizer node).
One final note on the Bellman equation – its usage is as a condition for optimality. In other
words, if we can somehow determine a value V (s) for every state s ∈ S such that the Bellman
equation holds true for each of these states, we can conclude that these values are the optimal
values for their respective states. Indeed, satisfying this condition implies ∀s ∈ S, V (s) = V ∗ (s).
4 Value Iteration
Now that we have a framework to test for optimality of the values of states in a MDP, the nat-
ural follow-up question to ask is how to actually compute these optimal values. To answer this
question, we need time-limited values (the natural result of enforcing finite horizons). The time-
limited value for a state s with a time-limit of k timesteps is denoted Vk (s), and represents the
maximum expected utility attainable from s given that the Markov decision process under con-
sideration terminates in k timesteps. Equivalently, this is what a depth-k expectimax run on the
search tree for a MDP returns.
Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit
to compute time-limited values until convergence (that is, until the V values are the same for each
state as they were in the past iteration: ∀s, Vk+1 (s) = Vk (s)). It operates as follows:
1. ∀s ∈ S, initialize V0 (s) = 0. This should be intuitive, since setting a time limit of 0 timesteps
means no actions can be taken before termination, and so no rewards can be acquired.
At iteration k of value iteration, we use the time-limited values for with limit k for each state
to generate the time-limited values with limit (k + 1). In essence, we use computed solutions
to subproblems (all the Vk (s)) to iteratively build up solutions to larger subproblems (all the
Vk+1 (s)); this is what makes value iteration a dynamic programming algorithm.
The complexity of each iteration is O(|S|2 |A|) and policy may converge long before values do.
Let’s see a few updates of value iteration in practice by revisiting our racecar MDP from earlier,
introducing a discount factor of γ = 1:
Similarly, we can repeat the procedure to compute a second round of updates with our newfound
values for V1 (s) to compute V2 (s).
It’s worthwhile to observe that V ∗ (s) for any terminal state must be 0, since no actions can ever be
taken from any terminal state to reap any rewards.
5 Policy Iteration
Recall that our ultimate goal in solving a MDP is to determine an optimal policy. This can be done
once all optimal values for states are determined using a method called policy extraction. The
intuition behind policy extraction is very simple: if you’re in a state s, you should take the action
a which yields the maximum expected utility. Not surprisingly, a is the action that takes us to the
Q-state with maximum Q-value, allowing for a formal definition of the optimal policy:
X
∀s ∈ S, π ∗ (s) = argmax Q∗ (s, a) = argmax T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
a a
s′
Value iteration can be quite slow. At each iteration, we must update the values of all |S| states
(where |n| refers to the cardinality operator), each of which requires iteration over all |A| actions
as we compute the Q-value for each action. The computation of each of these Q-values, in turn,
requires iteration over each of the |S| states again, leading to a poor runtime of O(|S|2 |A|). Addi-
tionally, when all we want to determine is the optimal policy for the MDP, value iteration tends to
1. Define an initial policy. This can be arbitrary, but policy iteration will converge faster the
closer the initial policy is to the eventual optimal policy.
• Evaluate the current policy with policy evaluation. For a policy π, policy evaluation
means computing V π (s) for all states s, where V π (s) is expected utility of starting in
state s when following π:
X
V π (s) = T (s, π(s), s′ )[R(s, π(s), s′ ) + γV π (s′ )]
s′
Define the policy at iteration i of policy iteration as πi . Since we are fixing a single
action for each state, we no longer need the max operator which effectively leaves us
with a system of |S| equations generated by the above rule. Each V πi (s) can then be
computed by simply solving this system. Alternatively, we can also compute V πi (s) by
using the following update rule until convergence, just like in value iteration:
πi
X
Vk+1 (s) ← T (s, πi (s), s′ )[R(s, πi (s), s′ ) + γVkπi (s′ )]
s′
If πi+1 = πi , the algorithm has converged, and we can conclude that πi+1 = πi = π ∗ .
11