0% found this document useful (0 votes)

11 views59 pages

2024 MDPs Part 1

The document presents Lecture 8 of an Artificial Intelligence course focusing on Markov Decision Processes (MDPs). It covers the definition of MDPs, the concept of policies, and the process of value iteration to compute optimal policies and utilities. The lecture also discusses challenges in value iteration and the importance of discounting in reward maximization.

Uploaded by

cloneacc1234567891011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views59 pages

2024 MDPs Part 1

Uploaded by

cloneacc1234567891011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology

INT3401E: Artificial Intelligence

Lecture 8: Markov Decision Process (Part 1)
Duc-Trong Le

(Slides based on AI course, University of California, Berkeley)

Hanoi, 03/2025
Non-deterministic search
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement: actions do not always go as

planned
▪ 80% of the time, the action North takes the agent
North
(if there is no wall there)
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have
been taken, the agent stays put

▪ The agent receives rewards each time step

▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)

▪ Goal: maximize sum of rewards

Action in Grid World
Deterministic Non-deterministic
Markov Decision Process (MDP)
▪ An MDP is defined by:
▪ A set of states s  S
▪ A set of actions a  A
▪ A transition model T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’| s, a)
▪ A reward function R(s, a, s’) for each transition
▪ A start state
▪ Possibly a terminal state (or absorbing state)
▪ Utility function which is additive (discounted) rewards

▪ MDPs are fully observable but probabilistic search problems

Policies
▪ A policy  gives an action for each state, : S → A

▪ In deterministic single-agent search problems, we

wanted an optimal plan, or sequence of actions,
from start to a goal

▪ For MDPs, we want an optimal policy *: S → A

▪ An optimal policy maximizes expected utility
▪ An explicit policy defines a reflex agent
Sample Optimal Policies
Example: Racing
Example: Racing
▪ A robot car wants to travel far, quickly
▪ Three states: Cool, Warm, Overheated
▪ Two actions: Slow, Fast
0.5 +1
▪ Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
▪ Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-
s, a
state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)

R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?

▪ More or less? [1, 2, 2] or [2, 3, 4]

▪ Now or later? [0, 0, 1] or [1, 0, 0]

Discounting
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Discounting
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once

▪ Why discount?
▪ Reward now is better than later
▪ Also helps our algorithms converge

▪ Example: discount of 0.5

▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies ( depends on time left)

▪ Discounting with γ solves the problem of infinite reward streams!

▪ Geometric series: 1 + γ + γ2 + … = 1/(1 - γ)
▪ Assume rewards bounded by ± Rmax
▪ Then r0 + γr1 + γ2r2 + … is bounded by ± Rmax/(1 - γ)

▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
Recap: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount ) s,a,s’
s’

▪ MDP quantities so far:

▪ Policy = Choice of action for each state
▪ Utility = sum of (discounted) rewards
Solving MDPs
Recall: Racing MDP
▪ A robot car wants to travel far, quickly
▪ Three states: Cool, Warm, Overheated
▪ Two actions: Slow, Fast
0.5 +1
▪ Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
Racing Search Tree
▪ We’re doing way too much
work with expectimax!

▪ Problem: States are repeated

▪ Idea: Only compute needed
quantities once

▪ Problem: Tree goes on forever

▪ Idea: Do a depth-limited
computation, but with increasing
depths until change is small
▪ Note: deep parts of the tree
eventually don’t matter if γ < 1
Optimal Quantities

▪ The value (utility) of a state s:

V*(s) = expected utility starting in s s s is a
and acting optimally state
a
(s, a) is a
▪ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s
transition
and (thereafter) acting optimally s’

▪ The optimal policy:

*(s) = optimal action from state s
The Bellman Equations

How to be optimal:
Step 1: Take correct first action

Step 2: Keep being optimal

Values of States
▪ Recursive definition of value:
s

a
s, a

s,a,s’
s’
Gridworld V* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Time-Limited Values
▪ Key idea: time-limited values

▪ Define Vk(s) to be the optimal value of s if the game ends

in k more time steps
▪ Equivalently, it’s what a depth-k expectimax would give from s

[Demo – time-limited values (L8D4)]

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Computing Time-Limited Values
Value Iteration
Value Iteration
▪ Start with V0(s) = 0: no time steps left means an expected reward sum of zero
▪ Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

▪ Repeat until convergence, which yields V* s,a,s’

Vk(s’)

▪ Complexity of each iteration: O(S2A)

▪ Theorem: will converge to unique optimal values

▪ Basic idea: approximations get refined towards optimal values
▪ Policy may converge long before values do
Value Iteration
▪ Bellman equations characterize the optimal values: V(s)

a
s, a

s,a,s’
▪ Value iteration computes them: V(s’)
Value Iteration (again ☺ ) s

a
▪ Init:
s, a
∀𝑠: 𝑉 𝑠 = 0
s,a,s’
▪ Iterate: s’

∀𝑠: 𝑉𝑛𝑒𝑤 𝑠 = max ෍ 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ]

𝑎
𝑠′

𝑉 = 𝑉𝑛𝑒𝑤
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
Example: Value Iteration

S: 1
F: .5*2+.5*2=2

Assume no discount!

0 0 0
Example: Value Iteration

S: .5*1+.5*1=1
2 F: -10

Assume no discount!

0 0 0
Example: Value Iteration

2 1 0

Assume no discount!

0 0 0
Example: Value Iteration

S: 1+2=3
F: .5*(2+2)+.5*(2+1)=3.5

2 1 0

Assume no discount!

0 0 0
Example: Value Iteration

3.5 2.5 0

2 1 0

Assume no discount!

0 0 0
Convergence*
▪ How do we know the Vk vectors are going to converge?
(assuming 0 < γ < 1)

▪ Proof Sketch:
▪ For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
▪ The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
▪ That last layer is at best all RMAX
▪ It is at worst RMIN
▪ But everything is discounted by γk that far out
▪ So Vk and Vk+1 are at most γk max|R| different
▪ So as k increases, the values converge
Policy Extraction
Computing Actions from Values
▪ Let’s imagine we have the optimal values V*(s)

▪ How should we act?

▪ It’s not obvious!

▪ We need to do a mini-expectimax (one step)

▪ This is called policy extraction, since it gets the policy implied by the values
Computing Actions from Q-Values
▪ Let’s imagine we have the optimal
q-values:

▪ How should we act?

▪ Completely trivial to decide!

▪ Important lesson: actions are easier to select from q-values than values!
Problems with Value Iteration
▪ Value iteration repeats the Bellman updates: s

a
s, a

s,a,s’
▪ Problem 1: It’s slow – O(S2A) per iteration
s’

▪ Problem 2: The “max” at each state rarely changes

▪ Problem 3: The policy often converges long before the values

k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0

08_MDPs.pptx
No ratings yet
08_MDPs.pptx
111 pages
2025_MDPs 1
No ratings yet
2025_MDPs 1
62 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
lec12
No ratings yet
lec12
60 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
2025_MDPs_Part 2 (1)
No ratings yet
2025_MDPs_Part 2 (1)
41 pages
F20-AI-L9
No ratings yet
F20-AI-L9
44 pages
06 MDP
No ratings yet
06 MDP
89 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Lec 08
No ratings yet
Lec 08
59 pages
Lec 09
No ratings yet
Lec 09
51 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
15 MDP
No ratings yet
15 MDP
35 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
M 2
No ratings yet
M 2
12 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
CS229
No ratings yet
CS229
17 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Decision Theory Models For Applications In Artificial Intelligence Concepts And Solutions 1st Edition L Enrique Sucar download
No ratings yet
Decision Theory Models For Applications In Artificial Intelligence Concepts And Solutions 1st Edition L Enrique Sucar download
82 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
6 390 Lecture Notes Fall24 (1)
No ratings yet
6 390 Lecture Notes Fall24 (1)
146 pages
Data8 Fa23 Final Solutions
No ratings yet
Data8 Fa23 Final Solutions
22 pages
Data8 Su24 Final
No ratings yet
Data8 Su24 Final
19 pages
Deep Reinforcement Learning
100% (1)
Deep Reinforcement Learning
410 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Data8 Fa24 Final
No ratings yet
Data8 Fa24 Final
19 pages
Complete Download Markov Decision Processes in Practice 1st Edition Richard J. Boucherie PDF All Chapters
100% (2)
Complete Download Markov Decision Processes in Practice 1st Edition Richard J. Boucherie PDF All Chapters
55 pages
Ebooks File Applied Reinforcement Learning With Python: With OpenAI Gym, Tensorflow, and Keras Beysolow Ii All Chapters
100% (8)
Ebooks File Applied Reinforcement Learning With Python: With OpenAI Gym, Tensorflow, and Keras Beysolow Ii All Chapters
62 pages
Iv Year I-Semester
No ratings yet
Iv Year I-Semester
45 pages
BE-AIDS-R-20-VII-VIII-Sem-Syllabus_compressed
No ratings yet
BE-AIDS-R-20-VII-VIII-Sem-Syllabus_compressed
55 pages
Get The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu free all chapters
100% (1)
Get The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu free all chapters
42 pages
Multi-Agent Connected Autonomous Driving Using Deep Reinforcement Learning
No ratings yet
Multi-Agent Connected Autonomous Driving Using Deep Reinforcement Learning
16 pages
Internship Report
No ratings yet
Internship Report
20 pages
WorldCoder, A Model-Based LLM Agent
No ratings yet
WorldCoder, A Model-Based LLM Agent
42 pages
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
No ratings yet
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
25 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
E0_270_RL
No ratings yet
E0_270_RL
10 pages
04 - Probability in AI
No ratings yet
04 - Probability in AI
169 pages
A Survey On Deep Reinforcement Learning Algorithms For Robotic Manipulation
No ratings yet
A Survey On Deep Reinforcement Learning Algorithms For Robotic Manipulation
35 pages
Sem 620
No ratings yet
Sem 620
21 pages
Fall 2023 Course 6.30 - 6.70 - Electrical Engineering and Computer Science
No ratings yet
Fall 2023 Course 6.30 - 6.70 - Electrical Engineering and Computer Science
13 pages
Update Monte Carlo Tree Search (UMCTS) Algorithm For Heuristic Global Search of Sizing Optimization Problems For Truss Structures
No ratings yet
Update Monte Carlo Tree Search (UMCTS) Algorithm For Heuristic Global Search of Sizing Optimization Problems For Truss Structures
25 pages
Assignment 3 (Emai2)
No ratings yet
Assignment 3 (Emai2)
2 pages
Robot Motion Planning
No ratings yet
Robot Motion Planning
151 pages
DRL - AI309 - A - Assignment - 1 - F24 - GIKI
No ratings yet
DRL - AI309 - A - Assignment - 1 - F24 - GIKI
3 pages
Ensemble
No ratings yet
Ensemble
8 pages
SF2863 Systems Engineering, 7.5 HP - Intro To Markov Decision Processes PDF
No ratings yet
SF2863 Systems Engineering, 7.5 HP - Intro To Markov Decision Processes PDF
39 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Gujarat Technological University: Bachelor of Engineering Syllabus Subject Code: Subject Name
No ratings yet
Gujarat Technological University: Bachelor of Engineering Syllabus Subject Code: Subject Name
3 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Syllabus
No ratings yet
Syllabus
2 pages
FINAL
No ratings yet
FINAL
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2024 MDPs Part 1

Uploaded by

2024 MDPs Part 1

Uploaded by

UET

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

INT3401E: Artificial Intelligence

(Slides based on AI course, University of California, Berkeley)

▪ Noisy movement: actions do not always go as

▪ The agent receives rewards each time step

▪ Goal: maximize sum of rewards

▪ MDPs are fully observable but probabilistic search problems

▪ In deterministic single-agent search problems, we

▪ For MDPs, we want an optimal policy *: S → A

s,a,s’ T(s,a,s’) = P(s’|s,a)

▪ More or less? [1, 2, 2] or [2, 3, 4]

▪ Now or later? [0, 0, 1] or [1, 0, 0]

Worth Now Worth Next Step Worth In Two Steps

▪ Example: discount of 0.5

▪ Discounting with γ solves the problem of infinite reward streams!

▪ MDP quantities so far:

▪ Problem: States are repeated

▪ Problem: Tree goes on forever

▪ The value (utility) of a state s:

▪ The optimal policy:

Step 2: Keep being optimal

▪ Define Vk(s) to be the optimal value of s if the game ends

[Demo – time-limited values (L8D4)]

▪ Repeat until convergence, which yields V* s,a,s’

▪ Complexity of each iteration: O(S2A)

▪ Theorem: will converge to unique optimal values

∀𝑠: 𝑉𝑛𝑒𝑤 𝑠 = max ෍ 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ]

▪ How should we act?

▪ We need to do a mini-expectimax (one step)

▪ How should we act?

▪ Problem 2: The “max” at each state rarely changes

▪ Problem 3: The policy often converges long before the values

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.