0% found this document useful (0 votes)
24 views11 pages

Cse 473 MDP Notes

Notas de aula sobre cadeias de Markov

Uploaded by

anniecarol.svl72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

Cse 473 MDP Notes

Notas de aula sobre cadeias de Markov

Uploaded by

anniecarol.svl72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UW CSE 473 Notes 4 Notes by: Yizhong Wang, Michael Lee

1 Non-Deterministic Search
Picture a robot runner, coming to the end of his first ever marathon. Though it seems likely he will
complete the race and claim the accompanying everlasting glory, it’s by no means guaranteed. He
may pass out from exhaustion or misstep and slip and fall, tragically breaking both of his legs.
Even more unlikely, a literally earth-shattering earthquake may spontaneously occur, swallowing
up the runner mere inches before he crosses the finish line. Such possibilities add a degree of
uncertainty to the runner’s actions, and it’s this uncertainty that will be the subject of the following
discussion. In the first and second note, we talked about traditional search problems and how to
solve them; then, in the third note, we changed our model to account for adversaries and other
agents in the world that influenced our path to goal states. Now, we’ll change our model again to
account for another influencing factor – the dynamics of world itself. The environment in which
an agent is placed may subject the agent’s actions to being nondeterministic, which means that
there are multiple possible successor states that can result from an action taken in some state. This
is, in fact, the case in many card games such as poker or blackjack, where there exists an inherent
uncertainty from the randomness of card dealing. Such problems where the world poses a degree
of uncertainty are known as nondeterministic search problems, and can be solved with models
known as Markov decision processes, or MDPs.

Figure 1: The Grid World example mentioned in the lecture, where the agent may move to adjacent
squares non-deterministically and receive rewards according to the final consequences.

CSE 473 Autumn’23 1


UW CSE 473 Notes 4

2 Markov Decision Processes


A Markov Decision Process is defined by several properties:
• A set of states S. States in MDPs are represented in the same way as states in traditional
search problems.

• A set of actions A. Actions in MDPs are also represented in the same way as in traditional
search problems.

• A start state.

• Possibly one or more terminal states.

• Possibly a discount factor γ. We’ll cover discount factors shortly.

• A transition function T (s, a, s′ ). Since we have introduced the possibility of nondetermin-


istic actions, we need a way to delineate the likelihood of the possible outcomes after taking
any given action from any given state. The transition function for a MDP does exactly this
- it’s a probability function which represents the probability that an agent taking an action
a ∈ A from a state s ∈ S ends up in a state s′ ∈ S.

• A reward function R(s, a, s′ ). Typically, MDPs are modeled with small "living" rewards at
each step to reward an agent’s survival, along with large rewards for arriving at a terminal
state. Rewards may be positive or negative depending on whether or not they benefit the
agent in question, and the agent’s objective is naturally to acquire the maximum reward
possible before arriving at some terminal state. Sometimes, the reward function can depend
only on the states, denoted as R(s) or R(s′ ).

The Racecar Example


Constructing a MDP for a situation is quite similar to constructing a state-space graph for a search
problem, with a couple additional caveats. Consider the motivating example of a racecar:

CSE 473 Autumn’23 2


UW CSE 473 Notes 4

There are three possible states, S = {cool, warm, overheated}, and two possible actions A =
{slow, f ast}. Just like in a state-space graph, each of the three states is represented by a node,
with edges representing actions. Overheated is a terminal state, since once a racecar agent arrives
at this state, it can no longer perform any actions for further rewards (it’s a sink state in the MDP
and has no outgoing edges). Notably, for nondeterministic actions, there are multiple edges repre-
senting the same action from the same state with differing successor states. Each edge is annotated
not only with the action it represents, but also a transition probability and corresponding reward.
These are summarized below:

• Transition Function: T (s, a, s′ ) • Reward Function: R(s, a, s′ )

– T (cool, slow, cool) = 1 – R(cool, slow, cool) = 1


– T (warm, slow, cool) = 0.5 – R(warm, slow, cool) = 1
– T (warm, slow, warm) = 0.5 – R(warm, slow, warm) = 1
– T (cool, f ast, cool) = 0.5 – R(cool, f ast, cool) = 2
– T (cool, f ast, warm) = 0.5 – R(cool, f ast, warm) = 2
– T (warm, f ast, overheated) = 1 – R(warm, f ast, overheated) = −10
We represent the movement of an agent through different MDP states over time with discrete
timesteps, defining st ∈ S and at ∈ A as the state in which an agent exists and the action which an
agent takes at timestep t respectively. An agent starts in state s0 at timestep 0, and takes an action
at every timestep. The movement of an agent through a MDP can thus be modeled as follows:
a
0 1 a 2 a
3 a
s0 −→ s1 −→ s2 −→ s3 −→ ...

Additionally, knowing that an agent’s goal is to maximize it’s reward across all timesteps, we can
correspondingly express this mathematically as a maximization of the following utility function:

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

Markov decision processes, like state-space graphs, can be unraveled into search trees. Uncer-
tainty is modeled in these search trees with Q-states, also known as action states, essentially iden-
tical to expectimax chance nodes. This is a fitting choice, as Q-states use probabilities to model
the uncertainty that the environment will land an agent in a given state just as expectimax chance
nodes use probabilities to model the uncertainty that adversarial agents will land our agent in a
given state through the move these agents select. The Qstate represented by having taken action
a from state s is notated as the tuple (s, a).
Observe the unraveled search tree for our racecar, truncated to depth-2:

CSE 473 Autumn’23 3


UW CSE 473 Notes 4

The green nodes represent Q-states, where an action has been taken from a state but has yet to be
resolved into a successor state. It’s important to understand that agents spend zero timesteps in
Q-states, and that they are simply a construct created for ease of representation and development
of MDP algorithms.

Finite Horizons and Discounting


There is an inherent problem with our racecar MDP - we haven’t placed any time constraints on
the number of timesteps for which a racecar can take actions and collect rewards. With our current
formulation, it could routinely choose a = slow at every timestep forever, safely and effectively
obtaining infinite reward without any risk of overheating. This is prevented by the introduction
of finite horizons and/or discount factors. An MDP enforcing a finite horizon is simple - it essen-
tially defines a "lifetime" for agents, which gives them some set number of timesteps n to accrue
as much reward as they can before being automatically terminated. We’ll return to this concept
shortly.
Discount factors are slightly more complicated, and are introduced to model an exponential
decay in the value of rewards over time. Concretely, with a discount factor of γ, taking action at
from state st at timestep t and ending up in state st+1 results in a reward of γ t R(st , at , st+1 ) instead
of just R(st , at , st+1 ). Now, instead of maximizing the additive utility

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

we attempt to maximize discounted utility

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...

Noting that the above definition of a discounted utility function looks similar to a geometric series
with ratio γ, we can prove that it’s guaranteed to be finite-valued as long as the constraint |γ| < 1

CSE 473 Autumn’23 4


UW CSE 473 Notes 4

(where |n| denotes the absolute value operator) is met through the following logic:
V ([s0 , s1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...
∞ ∞
X X Rmax
= γ t R(st , at , st+1 ) ≤ γ t Rmax =
1−γ
t=0 t=0

where Rmax is the maximum possible reward attainable at any given timestep in the MDP. Typi-
cally, γ is selected strictly from the range 0 < γ < 1 since values values in the range −1 < γ ≤ 0
are simply not meaningful in most real-world situations–a negative value for γ means the reward
for a state s would flip-flop between positive and negative values at alternating timesteps.

Markovianess
Markov decision processes are "markovian" in the sense that they satisfy the Markov property,
or memoryless property, which states that the future and the past are conditionally independent,
given the present. Intuitively, this means that, if we know the present state, knowing the past
doesn’t give us any more information about the future. To express this mathematically, consider
an agent that has visited states s0 , s1 , ..., st after taking actions a0 , a1 , ..., at−1 in some MDP, and
has just taken action at . The probability that this agent then arrives at state st+1 given their history
of previous states visited and actions taken can be written as follows:
P (St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 )
where each St denotes the random variable representing our agent’s state and At denotes the
random variable representing the action our agent takes at time t. The Markov property states
that the above probability can be simplified as follows:
P (St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 ) = P (St+1 = st+1 |St = st , At = at )
which is "memoryless" in the sense that the probability of arriving in a state s′ at time t+1 depends
only on the state s and action a taken at time t, not on any earlier states or actions. In fact, it is these
memoryless probabilities which are encoded by the transition function: T (s, a, s′ ) = P (s′ |s, a) .

3 Solving Markov Decision Processes


Recall that in deterministic, non-adversarial search, solving a search problem means finding an
optimal plan to arrive at a goal state. Solving a Markov decision process, on the other hand,
means finding an optimal policy π ∗ : S → A, a function mapping each state s ∈ S to an action
a ∈ A. An explicit policy π defines a reflex agent - given a state s, an agent at s implementing π
will select a = π(s) as the appropriate action to make without considering future consequences
of its actions. An optimal policy is one that if followed by the implementing agent, will yield the
maximum expected total reward or utility.
Consider the following MDP with S = {a, b, c, d, e}, A = {East, W est, Exit} (with Exit being
a valid action only in states a and e and yielding rewards of 10 and 1 respectively), a discount
factor γ = 0.1, and deterministic transitions:

CSE 473 Autumn’23 5


UW CSE 473 Notes 4

Two potential policies for this MDP are as follows:

(a) Policy 1 (b) Policy 2

With some investigation, it’s not hard to determine that Policy 2 is optimal. Following the policy
until making action a = Exit yields the following rewards for each start state:
Start State Reward
a 10
b 1
c 0.1
d 0.1
e 1
We’ll now learn how to solve such MDPs (and much more complex ones!) algorithmically
using the Bellman equation for Markov decision processes.

The Bellman Equation


In order to talk about the Bellman equation for MDPs, we must first introduce two new mathe-
matical quantities:
• The optimal value of a state s, V ∗ (s) – the optimal value of s is the expected value of the
utility an optimally-behaving agent that starts in s will receive, over the rest of the agent’s
lifetime. Note that frequently in the literature the same quantity is denoted with V ∗ (s).
• The optimal value of a Q-state (s, a), Q∗ (s, a) - the optimal value of (s, a) is the expected
value of the utility an agent receives after starting in s, taking a, and acting optimally hence-
forth.
Using these two new quantities and the other MDP quantities discussed earlier, the Bellman equa-
tion is defined as follows:
X
V ∗ (s) = max T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
a
s′

CSE 473 Autumn’23 6


UW CSE 473 Notes 4

Before we begin interpreting what this means, let’s also define the equation for the optimal value
of a Q-state (more commonly known as an optimal Q-value):
X
Q∗ (s, a) = T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
s′

Note that this second definition allows us to reexpress the Bellman equation as

V ∗ (s) = max Q∗ (s, a)


a

which is a dramatically simpler quantity. The Bellman equation is an example of a dynamic pro-
gramming equation, an equation that decomposes a problem into smaller subproblems via an in-
herent recursive structure. We can see this inherent recursion in the equation for the Q-value of
a state, in the term [R(s, a, s′ ) + γV ∗ (s′ )]. This term represents the total utility an agent receives
by first taking a from s and arriving at s′ and then acting optimally henceforth. The immediate
reward from the action a taken, R(s, a, s′ ), is added to the optimal discounted sum of rewards
attainable from s′ , V ∗ (s′ ), which is discounted by γ to account for the passage of one timestep in
taking action a. Though in most cases there exists a vast number of possible sequences of states
and actions from s′ to some terminal state, all this detail is abstracted away and encapsulated in a
single recursive value, V ∗ (s′ ).
We can now take another step outwards and consider the full equation for Q-value. Knowing
[R(s, a, s′ ) + γV ∗ (s′ )] represents the utility attained by acting optimally after arriving in state s′
from Q-state (s, a), it becomes evident that the quantity
X
T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
s′

is simply a weighted sum of utilities, with each utility weighted by its probability of occurrence.
This is by definition the expected utility of acting optimally from Q-state (s, a) onwards! This com-
pletes our analysis and gives us enough insight to interpret the full Bellman equation–the optimal
value of a state, V ∗ (s), is simply the maximum expected utility over all possible actions from s. Com-
puting maximum expected utility for a state s is essentially the same as running expectimax–we
first compute the expected utility from each Q-state (s, a) (equivalent to computing the value of
chance nodes), then compute the maximum over these nodes to compute the maximum expected
utility (equivalent to computing the value of a maximizer node).
One final note on the Bellman equation – its usage is as a condition for optimality. In other
words, if we can somehow determine a value V (s) for every state s ∈ S such that the Bellman
equation holds true for each of these states, we can conclude that these values are the optimal
values for their respective states. Indeed, satisfying this condition implies ∀s ∈ S, V (s) = V ∗ (s).

4 Value Iteration
Now that we have a framework to test for optimality of the values of states in a MDP, the nat-
ural follow-up question to ask is how to actually compute these optimal values. To answer this

CSE 473 Autumn’23 7


UW CSE 473 Notes 4

question, we need time-limited values (the natural result of enforcing finite horizons). The time-
limited value for a state s with a time-limit of k timesteps is denoted Vk (s), and represents the
maximum expected utility attainable from s given that the Markov decision process under con-
sideration terminates in k timesteps. Equivalently, this is what a depth-k expectimax run on the
search tree for a MDP returns.
Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit
to compute time-limited values until convergence (that is, until the V values are the same for each
state as they were in the past iteration: ∀s, Vk+1 (s) = Vk (s)). It operates as follows:

1. ∀s ∈ S, initialize V0 (s) = 0. This should be intuitive, since setting a time limit of 0 timesteps
means no actions can be taken before termination, and so no rewards can be acquired.

2. Repeat the following update rule until convergence:


X
∀s ∈ S, Vk+1 (s) ← max T (s, a, s′ )[R(s, a, s′ ) + γVk (s′ )]
a
s′

At iteration k of value iteration, we use the time-limited values for with limit k for each state
to generate the time-limited values with limit (k + 1). In essence, we use computed solutions
to subproblems (all the Vk (s)) to iteratively build up solutions to larger subproblems (all the
Vk+1 (s)); this is what makes value iteration a dynamic programming algorithm.

The complexity of each iteration is O(|S|2 |A|) and policy may converge long before values do.

Let’s see a few updates of value iteration in practice by revisiting our racecar MDP from earlier,
introducing a discount factor of γ = 1:

We begin value iteration by initialization of all V0 (s) = 0:

cool warm overheated


V0 0 0 0

CSE 473 Autumn’23 8


UW CSE 473 Notes 4

In our first round of updates, we can compute ∀s ∈ S, V1 (s) as follows:

V1 (cool) = max{1 · [1 + 1 · 0], 0.5 · [2 + 1 · 0] + 0.5 · [2 + 1 · 0]}


= max{1, 2}
= 2
V1 (warm) = max{0.5 · [1 + 1 · 0] + 0.5 · [1 + 1 · 0], 1 · [−10 + 1 · 0]}
= max{1, −10}
= 1
V1 (overheated) = max{}
= 0

cool warm overheated


V0 0 0 0
V1 2 1 0

Similarly, we can repeat the procedure to compute a second round of updates with our newfound
values for V1 (s) to compute V2 (s).

V2 (cool) = max{1 · [1 + 1 · 2], 0.5 · [2 + 1 · 2] + 0.5 · [2 + 1 · 1]}


= max{3, 3.5}
= 3.5
V2 (warm) = max{0.5 · [1 + 1 · 2] + 0.5 · [1 + 1 · 1], 1 · [−10 + 1 · 0]}
= max{2.5, −10}
= 2.5
V2 (overheated) = max{}
= 0

cool warm overheated


V0 0 0 0
V1 2 1 0
V2 3.5 2.5 0

It’s worthwhile to observe that V ∗ (s) for any terminal state must be 0, since no actions can ever be
taken from any terminal state to reap any rewards.

CSE 473 Autumn’23 9


UW CSE 473 Notes 4

Value Iteration Convergence


Case 1:
if the tree has maximum depth M, then V_M holds the actual untruncated values.
Case 2: if the discount is less than 1:
V _k and V _k + 1 can be both viewed as depth k + 1 expectimax results in nearly identical search
trees. But the V _k tree has zero rewards at the bottom layer. The bottom layer of the V _k + 1 tree
has (at best) all R_max. And since value iteration is discounted by a factor of γ, the two trees
differ at most γ k ∗ Rmax . As k increases, the values converge.

5 Policy Iteration
Recall that our ultimate goal in solving a MDP is to determine an optimal policy. This can be done
once all optimal values for states are determined using a method called policy extraction. The
intuition behind policy extraction is very simple: if you’re in a state s, you should take the action
a which yields the maximum expected utility. Not surprisingly, a is the action that takes us to the
Q-state with maximum Q-value, allowing for a formal definition of the optimal policy:
X
∀s ∈ S, π ∗ (s) = argmax Q∗ (s, a) = argmax T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
a a
s′

Value iteration can be quite slow. At each iteration, we must update the values of all |S| states
(where |n| refers to the cardinality operator), each of which requires iteration over all |A| actions
as we compute the Q-value for each action. The computation of each of these Q-values, in turn,
requires iteration over each of the |S| states again, leading to a poor runtime of O(|S|2 |A|). Addi-
tionally, when all we want to determine is the optimal policy for the MDP, value iteration tends to

CSE 473 Autumn’23 10


do a lot of overcomputation since the policy as computed by policy extraction generally converges
significantly faster than the values themselves. The fix for these flaws is to use policy iteration
as an alternative, an algorithm that maintains the optimality of value iteration while providing
significant performance gains. Policy iteration operates as follows:

1. Define an initial policy. This can be arbitrary, but policy iteration will converge faster the
closer the initial policy is to the eventual optimal policy.

2. Repeat the following until convergence:

• Evaluate the current policy with policy evaluation. For a policy π, policy evaluation
means computing V π (s) for all states s, where V π (s) is expected utility of starting in
state s when following π:
X
V π (s) = T (s, π(s), s′ )[R(s, π(s), s′ ) + γV π (s′ )]
s′

Define the policy at iteration i of policy iteration as πi . Since we are fixing a single
action for each state, we no longer need the max operator which effectively leaves us
with a system of |S| equations generated by the above rule. Each V πi (s) can then be
computed by simply solving this system. Alternatively, we can also compute V πi (s) by
using the following update rule until convergence, just like in value iteration:
πi
X
Vk+1 (s) ← T (s, πi (s), s′ )[R(s, πi (s), s′ ) + γVkπi (s′ )]
s′

However, this second method is typically slower in practice.


• Once we’ve evaluated the current policy, use policy improvement to generate a better
policy. Policy improvement uses policy extraction on the values of states generated by
policy evaluation to generate this new and improved policy:
X
πi+1 (s) = argmax T (s, a, s′ )[R(s, a, s′ ) + γV πi (s′ )]
a
s′

If πi+1 = πi , the algorithm has converged, and we can conclude that πi+1 = πi = π ∗ .

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy