0% found this document useful (0 votes)
11 views59 pages

2024 MDPs Part 1

The document presents Lecture 8 of an Artificial Intelligence course focusing on Markov Decision Processes (MDPs). It covers the definition of MDPs, the concept of policies, and the process of value iteration to compute optimal policies and utilities. The lecture also discusses challenges in value iteration and the importance of discounting in reward maximization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views59 pages

2024 MDPs Part 1

The document presents Lecture 8 of an Artificial Intelligence course focusing on Markov Decision Processes (MDPs). It covers the definition of MDPs, the concept of policies, and the process of value iteration to compute optimal policies and utilities. The lecture also discusses challenges in value iteration and the importance of discounting in reward maximization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3401E: Artificial Intelligence


Lecture 8: Markov Decision Process (Part 1)
Duc-Trong Le

(Slides based on AI course, University of California, Berkeley)

Hanoi, 03/2025
Non-deterministic search
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement: actions do not always go as


planned
▪ 80% of the time, the action North takes the agent
North
(if there is no wall there)
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have
been taken, the agent stays put

▪ The agent receives rewards each time step


▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)

▪ Goal: maximize sum of rewards


Action in Grid World
Deterministic Non-deterministic
Markov Decision Process (MDP)
▪ An MDP is defined by:
▪ A set of states s  S
▪ A set of actions a  A
▪ A transition model T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’| s, a)
▪ A reward function R(s, a, s’) for each transition
▪ A start state
▪ Possibly a terminal state (or absorbing state)
▪ Utility function which is additive (discounted) rewards

▪ MDPs are fully observable but probabilistic search problems


Policies
▪ A policy  gives an action for each state, : S → A

▪ In deterministic single-agent search problems, we


wanted an optimal plan, or sequence of actions,
from start to a goal

▪ For MDPs, we want an optimal policy *: S → A


▪ An optimal policy maximizes expected utility
▪ An explicit policy defines a reflex agent
Sample Optimal Policies
Example: Racing
Example: Racing
▪ A robot car wants to travel far, quickly
▪ Three states: Cool, Warm, Overheated
▪ Two actions: Slow, Fast
0.5 +1
▪ Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
▪ Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-
s, a
state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)


R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?

▪ More or less? [1, 2, 2] or [2, 3, 4]

▪ Now or later? [0, 0, 1] or [1, 0, 0]


Discounting
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps


Discounting
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once

▪ Why discount?
▪ Reward now is better than later
▪ Also helps our algorithms converge

▪ Example: discount of 0.5


▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies ( depends on time left)

▪ Discounting with γ solves the problem of infinite reward streams!


▪ Geometric series: 1 + γ + γ2 + … = 1/(1 - γ)
▪ Assume rewards bounded by ± Rmax
▪ Then r0 + γr1 + γ2r2 + … is bounded by ± Rmax/(1 - γ)

▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
Recap: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount ) s,a,s’
s’

▪ MDP quantities so far:


▪ Policy = Choice of action for each state
▪ Utility = sum of (discounted) rewards
Solving MDPs
Recall: Racing MDP
▪ A robot car wants to travel far, quickly
▪ Three states: Cool, Warm, Overheated
▪ Two actions: Slow, Fast
0.5 +1
▪ Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
Racing Search Tree
▪ We’re doing way too much
work with expectimax!

▪ Problem: States are repeated


▪ Idea: Only compute needed
quantities once

▪ Problem: Tree goes on forever


▪ Idea: Do a depth-limited
computation, but with increasing
depths until change is small
▪ Note: deep parts of the tree
eventually don’t matter if γ < 1
Optimal Quantities

▪ The value (utility) of a state s:


V*(s) = expected utility starting in s s s is a
and acting optimally state
a
(s, a) is a
▪ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s
transition
and (thereafter) acting optimally s’

▪ The optimal policy:


*(s) = optimal action from state s
The Bellman Equations

How to be optimal:
Step 1: Take correct first action

Step 2: Keep being optimal


Values of States
▪ Recursive definition of value:
s

a
s, a

s,a,s’
s’
Gridworld V* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Time-Limited Values
▪ Key idea: time-limited values

▪ Define Vk(s) to be the optimal value of s if the game ends


in k more time steps
▪ Equivalently, it’s what a depth-k expectimax would give from s

[Demo – time-limited values (L8D4)]


k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Computing Time-Limited Values
Value Iteration
Value Iteration
▪ Start with V0(s) = 0: no time steps left means an expected reward sum of zero
▪ Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

▪ Repeat until convergence, which yields V* s,a,s’


Vk(s’)

▪ Complexity of each iteration: O(S2A)

▪ Theorem: will converge to unique optimal values


▪ Basic idea: approximations get refined towards optimal values
▪ Policy may converge long before values do
Value Iteration
▪ Bellman equations characterize the optimal values: V(s)

a
s, a

s,a,s’
▪ Value iteration computes them: V(s’)
Value Iteration (again ☺ ) s

a
▪ Init:
s, a
∀𝑠: 𝑉 𝑠 = 0
s,a,s’
▪ Iterate: s’

∀𝑠: 𝑉𝑛𝑒𝑤 𝑠 = max ෍ 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ]


𝑎
𝑠′

𝑉 = 𝑉𝑛𝑒𝑤
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
Example: Value Iteration

S: 1
F: .5*2+.5*2=2

Assume no discount!

0 0 0
Example: Value Iteration

S: .5*1+.5*1=1
2 F: -10

Assume no discount!

0 0 0
Example: Value Iteration

2 1 0

Assume no discount!

0 0 0
Example: Value Iteration

S: 1+2=3
F: .5*(2+2)+.5*(2+1)=3.5

2 1 0

Assume no discount!

0 0 0
Example: Value Iteration

3.5 2.5 0

2 1 0

Assume no discount!

0 0 0
Convergence*
▪ How do we know the Vk vectors are going to converge?
(assuming 0 < γ < 1)

▪ Proof Sketch:
▪ For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
▪ The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
▪ That last layer is at best all RMAX
▪ It is at worst RMIN
▪ But everything is discounted by γk that far out
▪ So Vk and Vk+1 are at most γk max|R| different
▪ So as k increases, the values converge
Policy Extraction
Computing Actions from Values
▪ Let’s imagine we have the optimal values V*(s)

▪ How should we act?


▪ It’s not obvious!

▪ We need to do a mini-expectimax (one step)

▪ This is called policy extraction, since it gets the policy implied by the values
Computing Actions from Q-Values
▪ Let’s imagine we have the optimal
q-values:

▪ How should we act?


▪ Completely trivial to decide!

▪ Important lesson: actions are easier to select from q-values than values!
Problems with Value Iteration
▪ Value iteration repeats the Bellman updates: s

a
s, a

s,a,s’
▪ Problem 1: It’s slow – O(S2A) per iteration
s’

▪ Problem 2: The “max” at each state rarely changes

▪ Problem 3: The policy often converges long before the values


k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy