2024 MDPs Part 1
2024 MDPs Part 1
Since 2004
Hanoi, 03/2025
Non-deterministic search
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
▪ Each MDP state projects an expectimax-like search tree
s s is a state
(s, a) is a q-
s, a
state
(s,a,s’) called a transition
▪ Why discount?
▪ Reward now is better than later
▪ Also helps our algorithms converge
▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
Recap: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount ) s,a,s’
s’
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
Racing Search Tree
▪ We’re doing way too much
work with expectimax!
How to be optimal:
Step 1: Take correct first action
a
s, a
s,a,s’
s’
Gridworld V* Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Time-Limited Values
▪ Key idea: time-limited values
Noise = 0.2
Discount = 0.9
Living reward = 0
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Computing Time-Limited Values
Value Iteration
Value Iteration
▪ Start with V0(s) = 0: no time steps left means an expected reward sum of zero
▪ Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a
a
s, a
s,a,s’
▪ Value iteration computes them: V(s’)
Value Iteration (again ☺ ) s
a
▪ Init:
s, a
∀𝑠: 𝑉 𝑠 = 0
s,a,s’
▪ Iterate: s’
𝑉 = 𝑉𝑛𝑒𝑤
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
Example: Value Iteration
S: 1
F: .5*2+.5*2=2
Assume no discount!
0 0 0
Example: Value Iteration
S: .5*1+.5*1=1
2 F: -10
Assume no discount!
0 0 0
Example: Value Iteration
2 1 0
Assume no discount!
0 0 0
Example: Value Iteration
S: 1+2=3
F: .5*(2+2)+.5*(2+1)=3.5
2 1 0
Assume no discount!
0 0 0
Example: Value Iteration
3.5 2.5 0
2 1 0
Assume no discount!
0 0 0
Convergence*
▪ How do we know the Vk vectors are going to converge?
(assuming 0 < γ < 1)
▪ Proof Sketch:
▪ For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
▪ The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
▪ That last layer is at best all RMAX
▪ It is at worst RMIN
▪ But everything is discounted by γk that far out
▪ So Vk and Vk+1 are at most γk max|R| different
▪ So as k increases, the values converge
Policy Extraction
Computing Actions from Values
▪ Let’s imagine we have the optimal values V*(s)
▪ This is called policy extraction, since it gets the policy implied by the values
Computing Actions from Q-Values
▪ Let’s imagine we have the optimal
q-values:
▪ Important lesson: actions are easier to select from q-values than values!
Problems with Value Iteration
▪ Value iteration repeats the Bellman updates: s
a
s, a
s,a,s’
▪ Problem 1: It’s slow – O(S2A) per iteration
s’
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0