0% found this document useful (0 votes)

24 views11 pages

Cse 473 MDP Notes

Notas de aula sobre cadeias de Markov

Uploaded by

anniecarol.svl72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views11 pages

Cse 473 MDP Notes

Notas de aula sobre cadeias de Markov

Uploaded by

anniecarol.svl72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

UW CSE 473 Notes 4 Notes by: Yizhong Wang, Michael Lee

1 Non-Deterministic Search
Picture a robot runner, coming to the end of his first ever marathon. Though it seems likely he will
complete the race and claim the accompanying everlasting glory, it’s by no means guaranteed. He
may pass out from exhaustion or misstep and slip and fall, tragically breaking both of his legs.
Even more unlikely, a literally earth-shattering earthquake may spontaneously occur, swallowing
up the runner mere inches before he crosses the finish line. Such possibilities add a degree of
uncertainty to the runner’s actions, and it’s this uncertainty that will be the subject of the following
discussion. In the first and second note, we talked about traditional search problems and how to
solve them; then, in the third note, we changed our model to account for adversaries and other
agents in the world that influenced our path to goal states. Now, we’ll change our model again to
account for another influencing factor – the dynamics of world itself. The environment in which
an agent is placed may subject the agent’s actions to being nondeterministic, which means that
there are multiple possible successor states that can result from an action taken in some state. This
is, in fact, the case in many card games such as poker or blackjack, where there exists an inherent
uncertainty from the randomness of card dealing. Such problems where the world poses a degree
of uncertainty are known as nondeterministic search problems, and can be solved with models
known as Markov decision processes, or MDPs.

Figure 1: The Grid World example mentioned in the lecture, where the agent may move to adjacent
squares non-deterministically and receive rewards according to the final consequences.

CSE 473 Autumn’23 1

UW CSE 473 Notes 4

2 Markov Decision Processes

A Markov Decision Process is defined by several properties:
• A set of states S. States in MDPs are represented in the same way as states in traditional
search problems.

• A set of actions A. Actions in MDPs are also represented in the same way as in traditional
search problems.

• A start state.

• Possibly one or more terminal states.

• Possibly a discount factor γ. We’ll cover discount factors shortly.

• A transition function T (s, a, s′ ). Since we have introduced the possibility of nondetermin-

istic actions, we need a way to delineate the likelihood of the possible outcomes after taking
any given action from any given state. The transition function for a MDP does exactly this
- it’s a probability function which represents the probability that an agent taking an action
a ∈ A from a state s ∈ S ends up in a state s′ ∈ S.

• A reward function R(s, a, s′ ). Typically, MDPs are modeled with small "living" rewards at
each step to reward an agent’s survival, along with large rewards for arriving at a terminal
state. Rewards may be positive or negative depending on whether or not they benefit the
agent in question, and the agent’s objective is naturally to acquire the maximum reward
possible before arriving at some terminal state. Sometimes, the reward function can depend
only on the states, denoted as R(s) or R(s′ ).

The Racecar Example

Constructing a MDP for a situation is quite similar to constructing a state-space graph for a search
problem, with a couple additional caveats. Consider the motivating example of a racecar:

CSE 473 Autumn’23 2

UW CSE 473 Notes 4

There are three possible states, S = {cool, warm, overheated}, and two possible actions A =
{slow, f ast}. Just like in a state-space graph, each of the three states is represented by a node,
with edges representing actions. Overheated is a terminal state, since once a racecar agent arrives
at this state, it can no longer perform any actions for further rewards (it’s a sink state in the MDP
and has no outgoing edges). Notably, for nondeterministic actions, there are multiple edges repre-
senting the same action from the same state with differing successor states. Each edge is annotated
not only with the action it represents, but also a transition probability and corresponding reward.
These are summarized below:

• Transition Function: T (s, a, s′ ) • Reward Function: R(s, a, s′ )

– T (cool, slow, cool) = 1 – R(cool, slow, cool) = 1

– T (warm, slow, cool) = 0.5 – R(warm, slow, cool) = 1
– T (warm, slow, warm) = 0.5 – R(warm, slow, warm) = 1
– T (cool, f ast, cool) = 0.5 – R(cool, f ast, cool) = 2
– T (cool, f ast, warm) = 0.5 – R(cool, f ast, warm) = 2
– T (warm, f ast, overheated) = 1 – R(warm, f ast, overheated) = −10
We represent the movement of an agent through different MDP states over time with discrete
timesteps, defining st ∈ S and at ∈ A as the state in which an agent exists and the action which an
agent takes at timestep t respectively. An agent starts in state s0 at timestep 0, and takes an action
at every timestep. The movement of an agent through a MDP can thus be modeled as follows:
a
0 1 a 2 a
3 a
s0 −→ s1 −→ s2 −→ s3 −→ ...

Additionally, knowing that an agent’s goal is to maximize it’s reward across all timesteps, we can
correspondingly express this mathematically as a maximization of the following utility function:

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

Markov decision processes, like state-space graphs, can be unraveled into search trees. Uncer-
tainty is modeled in these search trees with Q-states, also known as action states, essentially iden-
tical to expectimax chance nodes. This is a fitting choice, as Q-states use probabilities to model
the uncertainty that the environment will land an agent in a given state just as expectimax chance
nodes use probabilities to model the uncertainty that adversarial agents will land our agent in a
given state through the move these agents select. The Qstate represented by having taken action
a from state s is notated as the tuple (s, a).
Observe the unraveled search tree for our racecar, truncated to depth-2:

CSE 473 Autumn’23 3

UW CSE 473 Notes 4

The green nodes represent Q-states, where an action has been taken from a state but has yet to be
resolved into a successor state. It’s important to understand that agents spend zero timesteps in
Q-states, and that they are simply a construct created for ease of representation and development
of MDP algorithms.

Finite Horizons and Discounting

There is an inherent problem with our racecar MDP - we haven’t placed any time constraints on
the number of timesteps for which a racecar can take actions and collect rewards. With our current
formulation, it could routinely choose a = slow at every timestep forever, safely and effectively
obtaining infinite reward without any risk of overheating. This is prevented by the introduction
of finite horizons and/or discount factors. An MDP enforcing a finite horizon is simple - it essen-
tially defines a "lifetime" for agents, which gives them some set number of timesteps n to accrue
as much reward as they can before being automatically terminated. We’ll return to this concept
shortly.
Discount factors are slightly more complicated, and are introduced to model an exponential
decay in the value of rewards over time. Concretely, with a discount factor of γ, taking action at
from state st at timestep t and ending up in state st+1 results in a reward of γ t R(st , at , st+1 ) instead
of just R(st , at , st+1 ). Now, instead of maximizing the additive utility

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

we attempt to maximize discounted utility

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...

Noting that the above definition of a discounted utility function looks similar to a geometric series
with ratio γ, we can prove that it’s guaranteed to be finite-valued as long as the constraint |γ| < 1

CSE 473 Autumn’23 4

UW CSE 473 Notes 4

(where |n| denotes the absolute value operator) is met through the following logic:
V ([s0 , s1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...
∞ ∞
X X Rmax
= γ t R(st , at , st+1 ) ≤ γ t Rmax =
1−γ
t=0 t=0

where Rmax is the maximum possible reward attainable at any given timestep in the MDP. Typi-
cally, γ is selected strictly from the range 0 < γ < 1 since values values in the range −1 < γ ≤ 0
are simply not meaningful in most real-world situations–a negative value for γ means the reward
for a state s would flip-flop between positive and negative values at alternating timesteps.

Markovianess
Markov decision processes are "markovian" in the sense that they satisfy the Markov property,
or memoryless property, which states that the future and the past are conditionally independent,
given the present. Intuitively, this means that, if we know the present state, knowing the past
doesn’t give us any more information about the future. To express this mathematically, consider
an agent that has visited states s0 , s1 , ..., st after taking actions a0 , a1 , ..., at−1 in some MDP, and
has just taken action at . The probability that this agent then arrives at state st+1 given their history
of previous states visited and actions taken can be written as follows:
P (St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 )
where each St denotes the random variable representing our agent’s state and At denotes the
random variable representing the action our agent takes at time t. The Markov property states
that the above probability can be simplified as follows:
P (St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 ) = P (St+1 = st+1 |St = st , At = at )
which is "memoryless" in the sense that the probability of arriving in a state s′ at time t+1 depends
only on the state s and action a taken at time t, not on any earlier states or actions. In fact, it is these
memoryless probabilities which are encoded by the transition function: T (s, a, s′ ) = P (s′ |s, a) .

3 Solving Markov Decision Processes

Recall that in deterministic, non-adversarial search, solving a search problem means finding an
optimal plan to arrive at a goal state. Solving a Markov decision process, on the other hand,
means finding an optimal policy π ∗ : S → A, a function mapping each state s ∈ S to an action
a ∈ A. An explicit policy π defines a reflex agent - given a state s, an agent at s implementing π
will select a = π(s) as the appropriate action to make without considering future consequences
of its actions. An optimal policy is one that if followed by the implementing agent, will yield the
maximum expected total reward or utility.
Consider the following MDP with S = {a, b, c, d, e}, A = {East, W est, Exit} (with Exit being
a valid action only in states a and e and yielding rewards of 10 and 1 respectively), a discount
factor γ = 0.1, and deterministic transitions:

CSE 473 Autumn’23 5

UW CSE 473 Notes 4

Two potential policies for this MDP are as follows:

(a) Policy 1 (b) Policy 2

With some investigation, it’s not hard to determine that Policy 2 is optimal. Following the policy
until making action a = Exit yields the following rewards for each start state:
Start State Reward
a 10
b 1
c 0.1
d 0.1
e 1
We’ll now learn how to solve such MDPs (and much more complex ones!) algorithmically
using the Bellman equation for Markov decision processes.

The Bellman Equation

In order to talk about the Bellman equation for MDPs, we must first introduce two new mathe-
matical quantities:
• The optimal value of a state s, V ∗ (s) – the optimal value of s is the expected value of the
utility an optimally-behaving agent that starts in s will receive, over the rest of the agent’s
lifetime. Note that frequently in the literature the same quantity is denoted with V ∗ (s).
• The optimal value of a Q-state (s, a), Q∗ (s, a) - the optimal value of (s, a) is the expected
value of the utility an agent receives after starting in s, taking a, and acting optimally hence-
forth.
Using these two new quantities and the other MDP quantities discussed earlier, the Bellman equa-
tion is defined as follows:
X
V ∗ (s) = max T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
a
s′

CSE 473 Autumn’23 6

UW CSE 473 Notes 4

Before we begin interpreting what this means, let’s also define the equation for the optimal value
of a Q-state (more commonly known as an optimal Q-value):
X
Q∗ (s, a) = T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
s′

Note that this second definition allows us to reexpress the Bellman equation as

V ∗ (s) = max Q∗ (s, a)

which is a dramatically simpler quantity. The Bellman equation is an example of a dynamic pro-
gramming equation, an equation that decomposes a problem into smaller subproblems via an in-
herent recursive structure. We can see this inherent recursion in the equation for the Q-value of
a state, in the term [R(s, a, s′ ) + γV ∗ (s′ )]. This term represents the total utility an agent receives
by first taking a from s and arriving at s′ and then acting optimally henceforth. The immediate
reward from the action a taken, R(s, a, s′ ), is added to the optimal discounted sum of rewards
attainable from s′ , V ∗ (s′ ), which is discounted by γ to account for the passage of one timestep in
taking action a. Though in most cases there exists a vast number of possible sequences of states
and actions from s′ to some terminal state, all this detail is abstracted away and encapsulated in a
single recursive value, V ∗ (s′ ).
We can now take another step outwards and consider the full equation for Q-value. Knowing
[R(s, a, s′ ) + γV ∗ (s′ )] represents the utility attained by acting optimally after arriving in state s′
from Q-state (s, a), it becomes evident that the quantity
X
T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
s′

is simply a weighted sum of utilities, with each utility weighted by its probability of occurrence.
This is by definition the expected utility of acting optimally from Q-state (s, a) onwards! This com-
pletes our analysis and gives us enough insight to interpret the full Bellman equation–the optimal
value of a state, V ∗ (s), is simply the maximum expected utility over all possible actions from s. Com-
puting maximum expected utility for a state s is essentially the same as running expectimax–we
first compute the expected utility from each Q-state (s, a) (equivalent to computing the value of
chance nodes), then compute the maximum over these nodes to compute the maximum expected
utility (equivalent to computing the value of a maximizer node).
One final note on the Bellman equation – its usage is as a condition for optimality. In other
words, if we can somehow determine a value V (s) for every state s ∈ S such that the Bellman
equation holds true for each of these states, we can conclude that these values are the optimal
values for their respective states. Indeed, satisfying this condition implies ∀s ∈ S, V (s) = V ∗ (s).

4 Value Iteration
Now that we have a framework to test for optimality of the values of states in a MDP, the nat-
ural follow-up question to ask is how to actually compute these optimal values. To answer this

CSE 473 Autumn’23 7

UW CSE 473 Notes 4

question, we need time-limited values (the natural result of enforcing finite horizons). The time-
limited value for a state s with a time-limit of k timesteps is denoted Vk (s), and represents the
maximum expected utility attainable from s given that the Markov decision process under con-
sideration terminates in k timesteps. Equivalently, this is what a depth-k expectimax run on the
search tree for a MDP returns.
Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit
to compute time-limited values until convergence (that is, until the V values are the same for each
state as they were in the past iteration: ∀s, Vk+1 (s) = Vk (s)). It operates as follows:

1. ∀s ∈ S, initialize V0 (s) = 0. This should be intuitive, since setting a time limit of 0 timesteps
means no actions can be taken before termination, and so no rewards can be acquired.

2. Repeat the following update rule until convergence:

X
∀s ∈ S, Vk+1 (s) ← max T (s, a, s′ )[R(s, a, s′ ) + γVk (s′ )]
a
s′

At iteration k of value iteration, we use the time-limited values for with limit k for each state
to generate the time-limited values with limit (k + 1). In essence, we use computed solutions
to subproblems (all the Vk (s)) to iteratively build up solutions to larger subproblems (all the
Vk+1 (s)); this is what makes value iteration a dynamic programming algorithm.

The complexity of each iteration is O(|S|2 |A|) and policy may converge long before values do.

Let’s see a few updates of value iteration in practice by revisiting our racecar MDP from earlier,
introducing a discount factor of γ = 1:

We begin value iteration by initialization of all V0 (s) = 0:

cool warm overheated

V0 0 0 0

CSE 473 Autumn’23 8

UW CSE 473 Notes 4

In our first round of updates, we can compute ∀s ∈ S, V1 (s) as follows:

V1 (cool) = max{1 · [1 + 1 · 0], 0.5 · [2 + 1 · 0] + 0.5 · [2 + 1 · 0]}

= max{1, 2}
= 2
V1 (warm) = max{0.5 · [1 + 1 · 0] + 0.5 · [1 + 1 · 0], 1 · [−10 + 1 · 0]}
= max{1, −10}
= 1
V1 (overheated) = max{}
= 0

cool warm overheated

V0 0 0 0
V1 2 1 0

Similarly, we can repeat the procedure to compute a second round of updates with our newfound
values for V1 (s) to compute V2 (s).

V2 (cool) = max{1 · [1 + 1 · 2], 0.5 · [2 + 1 · 2] + 0.5 · [2 + 1 · 1]}

= max{3, 3.5}
= 3.5
V2 (warm) = max{0.5 · [1 + 1 · 2] + 0.5 · [1 + 1 · 1], 1 · [−10 + 1 · 0]}
= max{2.5, −10}
= 2.5
V2 (overheated) = max{}
= 0

cool warm overheated

V0 0 0 0
V1 2 1 0
V2 3.5 2.5 0

It’s worthwhile to observe that V ∗ (s) for any terminal state must be 0, since no actions can ever be
taken from any terminal state to reap any rewards.

CSE 473 Autumn’23 9

UW CSE 473 Notes 4

Value Iteration Convergence

Case 1:
if the tree has maximum depth M, then V_M holds the actual untruncated values.
Case 2: if the discount is less than 1:
V _k and V _k + 1 can be both viewed as depth k + 1 expectimax results in nearly identical search
trees. But the V _k tree has zero rewards at the bottom layer. The bottom layer of the V _k + 1 tree
has (at best) all R_max. And since value iteration is discounted by a factor of γ, the two trees
differ at most γ k ∗ Rmax . As k increases, the values converge.

5 Policy Iteration
Recall that our ultimate goal in solving a MDP is to determine an optimal policy. This can be done
once all optimal values for states are determined using a method called policy extraction. The
intuition behind policy extraction is very simple: if you’re in a state s, you should take the action
a which yields the maximum expected utility. Not surprisingly, a is the action that takes us to the
Q-state with maximum Q-value, allowing for a formal definition of the optimal policy:
X
∀s ∈ S, π ∗ (s) = argmax Q∗ (s, a) = argmax T (s, a, s′ )[R(s, a, s′ ) + γV ∗ (s′ )]
a a
s′

Value iteration can be quite slow. At each iteration, we must update the values of all |S| states
(where |n| refers to the cardinality operator), each of which requires iteration over all |A| actions
as we compute the Q-value for each action. The computation of each of these Q-values, in turn,
requires iteration over each of the |S| states again, leading to a poor runtime of O(|S|2 |A|). Addi-
tionally, when all we want to determine is the optimal policy for the MDP, value iteration tends to

CSE 473 Autumn’23 10

do a lot of overcomputation since the policy as computed by policy extraction generally converges
significantly faster than the values themselves. The fix for these flaws is to use policy iteration
as an alternative, an algorithm that maintains the optimality of value iteration while providing
significant performance gains. Policy iteration operates as follows:

1. Define an initial policy. This can be arbitrary, but policy iteration will converge faster the
closer the initial policy is to the eventual optimal policy.

2. Repeat the following until convergence:

• Evaluate the current policy with policy evaluation. For a policy π, policy evaluation
means computing V π (s) for all states s, where V π (s) is expected utility of starting in
state s when following π:
X
V π (s) = T (s, π(s), s′ )[R(s, π(s), s′ ) + γV π (s′ )]
s′

Define the policy at iteration i of policy iteration as πi . Since we are fixing a single
action for each state, we no longer need the max operator which effectively leaves us
with a system of |S| equations generated by the above rule. Each V πi (s) can then be
computed by simply solving this system. Alternatively, we can also compute V πi (s) by
using the following update rule until convergence, just like in value iteration:
πi
X
Vk+1 (s) ← T (s, πi (s), s′ )[R(s, πi (s), s′ ) + γVkπi (s′ )]
s′

However, this second method is typically slower in practice.

• Once we’ve evaluated the current policy, use policy improvement to generate a better
policy. Policy improvement uses policy extraction on the values of states generated by
policy evaluation to generate this new and improved policy:
X
πi+1 (s) = argmax T (s, a, s′ )[R(s, a, s′ ) + γV πi (s′ )]
a
s′

If πi+1 = πi , the algorithm has converged, and we can conclude that πi+1 = πi = π ∗ .

Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Dbms Assignment 4
No ratings yet
Dbms Assignment 4
5 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
2025_MDPs 1
No ratings yet
2025_MDPs 1
62 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
Module 3-22 March 2025
No ratings yet
Module 3-22 March 2025
34 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
What Are The Odds? Improving The Foundations of Statistical Model Checking
No ratings yet
What Are The Odds? Improving The Foundations of Statistical Model Checking
42 pages
119686
No ratings yet
119686
24 pages
Robust Markov Decision Processes- A Place Where AI and Formal Methods Meet
No ratings yet
Robust Markov Decision Processes- A Place Where AI and Formal Methods Meet
29 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Unit-4 MDP
No ratings yet
Unit-4 MDP
21 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Lec 08
No ratings yet
Lec 08
59 pages
06 MDP
No ratings yet
06 MDP
89 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
MIT16 410F10 Lec22
No ratings yet
MIT16 410F10 Lec22
19 pages
Slides%20
No ratings yet
Slides%20
10 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Thesis Fabrizio Galli
No ratings yet
Thesis Fabrizio Galli
22 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
mdp2 6pp
No ratings yet
mdp2 6pp
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
A Passive Voice Paragraph
73% (11)
A Passive Voice Paragraph
1 page
Markov decision
No ratings yet
Markov decision
4 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Chernoz Hansen 2006 JoE
No ratings yet
Chernoz Hansen 2006 JoE
35 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Markov Decision
No ratings yet
Markov Decision
11 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Markov Decision Processes (MDP) : Sudeshna Sarkar
No ratings yet
Markov Decision Processes (MDP) : Sudeshna Sarkar
14 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
Policies, Search, Utility
No ratings yet
Policies, Search, Utility
13 pages
Unit 04 Finite Markov Decision Processes
No ratings yet
Unit 04 Finite Markov Decision Processes
8 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Assignment 2 - Answer Sheet.V1 (1)
No ratings yet
Assignment 2 - Answer Sheet.V1 (1)
2 pages
19-05-22 Royal Mail
No ratings yet
19-05-22 Royal Mail
9 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
II SEM - AI23231- POAI
No ratings yet
II SEM - AI23231- POAI
65 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
IT SUMMER 2024 - Approved List 2.0
No ratings yet
IT SUMMER 2024 - Approved List 2.0
4 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
3.8 Words To Express Contrast: Word How To Use It Example
No ratings yet
3.8 Words To Express Contrast: Word How To Use It Example
2 pages
Operation Systems, by Gary Nutt: Third Edition
No ratings yet
Operation Systems, by Gary Nutt: Third Edition
14 pages
#WWDC16 Typography and Fonts PDF
100% (1)
#WWDC16 Typography and Fonts PDF
186 pages
Aptis Writing Test 2 - Pretest Part 2 N 3
No ratings yet
Aptis Writing Test 2 - Pretest Part 2 N 3
5 pages
Reflection Paper
No ratings yet
Reflection Paper
7 pages
PEDIA Stickers 1
No ratings yet
PEDIA Stickers 1
10 pages
GV65 Plus Quick Start V1.00
No ratings yet
GV65 Plus Quick Start V1.00
2 pages
PST Syllabus - SEP
No ratings yet
PST Syllabus - SEP
3 pages
Winners of Junior and Senior Group Yats 2021
No ratings yet
Winners of Junior and Senior Group Yats 2021
7 pages
Storageworks Msa1000
No ratings yet
Storageworks Msa1000
52 pages
4654-File Utama Naskah-16324-1-10-20150216
No ratings yet
4654-File Utama Naskah-16324-1-10-20150216
6 pages
Prominent Education Center: English Lecture
No ratings yet
Prominent Education Center: English Lecture
19 pages
EEE105 Chapter10
No ratings yet
EEE105 Chapter10
58 pages
AIML Project Report
No ratings yet
AIML Project Report
19 pages
Nciplot Manual
No ratings yet
Nciplot Manual
16 pages
Asking To Ask: The Strategtc Function of Indirect Requests For Informatton TN Interviews
No ratings yet
Asking To Ask: The Strategtc Function of Indirect Requests For Informatton TN Interviews
19 pages
What Is Literature
100% (1)
What Is Literature
14 pages
Principles For Devising A Reading Comprehension Test: A Library Based Review
No ratings yet
Principles For Devising A Reading Comprehension Test: A Library Based Review
20 pages
BGDES Accomplishment Report IP Month Celebration 2022
100% (3)
BGDES Accomplishment Report IP Month Celebration 2022
5 pages
The Krama Tantricism of Kashmir Vol. I by N. Rastogi Review by Raffaele Torella
No ratings yet
The Krama Tantricism of Kashmir Vol. I by N. Rastogi Review by Raffaele Torella
3 pages
WTT - Culture and Value
No ratings yet
WTT - Culture and Value
108 pages
Important Alphanumeric Symbol Questions For SBI Clerk/ RBI Asst Prelims Exam
No ratings yet
Important Alphanumeric Symbol Questions For SBI Clerk/ RBI Asst Prelims Exam
7 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet
A Note On Sri Dasam Granth Sahib.
No ratings yet
A Note On Sri Dasam Granth Sahib.
36 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cse 473 MDP Notes

Uploaded by

Cse 473 MDP Notes

Uploaded by

UW CSE 473 Notes 4 Notes by: Yizhong Wang, Michael Lee

CSE 473 Autumn’23 1

2 Markov Decision Processes

• Possibly one or more terminal states.

• Possibly a discount factor γ. We’ll cover discount factors shortly.

• A transition function T (s, a, s′ ). Since we have introduced the possibility of nondetermin-

The Racecar Example

CSE 473 Autumn’23 2

• Transition Function: T (s, a, s′ ) • Reward Function: R(s, a, s′ )

– T (cool, slow, cool) = 1 – R(cool, slow, cool) = 1

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

CSE 473 Autumn’23 3

Finite Horizons and Discounting

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

we attempt to maximize discounted utility

V ([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...

CSE 473 Autumn’23 4

3 Solving Markov Decision Processes

CSE 473 Autumn’23 5

Two potential policies for this MDP are as follows:

(a) Policy 1 (b) Policy 2

The Bellman Equation

CSE 473 Autumn’23 6

V ∗ (s) = max Q∗ (s, a)

CSE 473 Autumn’23 7

2. Repeat the following update rule until convergence:

We begin value iteration by initialization of all V0 (s) = 0:

cool warm overheated

CSE 473 Autumn’23 8

In our first round of updates, we can compute ∀s ∈ S, V1 (s) as follows:

V1 (cool) = max{1 · [1 + 1 · 0], 0.5 · [2 + 1 · 0] + 0.5 · [2 + 1 · 0]}

cool warm overheated

V2 (cool) = max{1 · [1 + 1 · 2], 0.5 · [2 + 1 · 2] + 0.5 · [2 + 1 · 1]}

cool warm overheated

CSE 473 Autumn’23 9

Value Iteration Convergence

CSE 473 Autumn’23 10

2. Repeat the following until convergence:

However, this second method is typically slower in practice.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.