CH 9 MDP
CH 9 MDP
Dynamic programming
Introduction to Markov decision processes
Markov decision processes formulation
Discounted markov decision processes
Average cost markov decision processes
Continuous-time Markov decision processes
Xiaolan Xie
Dynamic programming
Basic principe of dynamic programming
Some applications
Xiaolan Xie
Dynamic programming
Basic principe of dynamic programming
Some applications
Xiaolan Xie
Introduction
Dynamic programming (DP) is a general
optimization
technique
based
on
implicit
enumeration of the solution space.
The problems should have a particular sequential
structure, such that the set of unknowns can be
made sequentially.
It is based on the "principle of optimality"
A wide range of problems can be put in seqential form
and solved by dynamic programming
Xiaolan Xie
Introduction
Applications :
Optimal control
Most problems in graph theory
Investment
Deterministic and stochastic inventory control
Project scheduling
Production scheduling
Xiaolan Xie
10
10
10
10
5
8
15
Xiaolan Xie
A
optimal
B
optimal
Corollary :
SP(xo, y) = min {SP(xo, z) + l(z, y) | z : predecessor of y}
Xiaolan Xie
Solving a problem by DP
1. Extension
Extend the problem to a family of problems of the same nature
2. Recursive Formulation (application of the principle of optimality)
Link optimal solutions of these problems by a recursive relation
3. Decomposition into steps or phases
Define the order of the resolution of the problems in such a way
that, when solving a problem P, optimal solutions of all other
problems needed for computation of P are already known.
4. Computation by steps
Xiaolan Xie
Solving a problem by DP
Difficulties in using dynamic programming :
Identification of the family of problems
transformation of the problem into a sequential form.
Xiaolan Xie
Xiaolan Xie
State t
Cost
present decision epoch
action
State t+1
Cost
next decision epoch
Xiaolan Xie
action
State t
Cost
action
State t+1
Cost
Minimize g N xN gt xt , ut
t 0
action
State t
action
State t+1
Cost
gt xt , ut
present decision epoch
Cost
Xiaolan Xie
J n x = Minimize g N xN gt xt , ut xn x
t n
action
State t
Cost
present decision epoch
action
State t+1
Cost
J n x = MIN g n x, un J n+1 f n x, un
un
action
State t
Cost
present decision epoch
action
State t+1
Cost
Xiaolan Xie
Applications
Single machine scheduling (Knapsac)
Inventory control
Traveling salesman problem
Xiaolan Xie
Applications
Single machine scheduling (Knapsac)
Problem :
Consider a set of N production requests, each needing a
production time ti on a bottleneck machine and generating a
profit pi. The capacity of the bottleneck machine is C.
Question: determine the production requests to confirm in
order to maximize the total profit.
Formulation:
max pi Xi
subject to:
ti X i C
Xiaolan Xie
Applications
Inventory control
See exercices
Xiaolan Xie
Applications
Traveling salesman problem
Problem :
Data: a graph with N nodes and a distance matrix
[dij] beteen any two nodes i and j.
Question: determine a circuit of minimum total
distance passing each node once.
Extensions:
C(y, S): shortest path from y to x0 passing once
each node in S.
Application: Machine scheduling with setups.
2007
Xiaolan Xie
Applications
Total tardiness minimization on a single machine
Si starting time of job i
1, if job i precedes job j
X ij
0, otherwise
Ti tardiness
n
min wiTi
i 1
Ti Si pi d i
S j Si pi M X ij 1
Job
Due date di
Processing
time pi
weight wi
1
5
2
6
3
5
3
3
2
1
4
2
Si , Ti 0
X ij 0,1
Xiaolan Xie
perturbation
action
State t
stage cost
present decision epoch
action
State t+1
cost
next decision epoch
Xiaolan Xie
action
perturbation
State t
cost
present decision epoch
action
State t+1
cost
N 1
Minimize E g N xN gt xt , ut , wt
t 0
action
perturbation
State t
cost
present decision epoch
action
State t+1
cost
Xiaolan Xie
Order quantities u1, u2, ..., uN-1 are determined once at time 0
Closed-loop control:
Xiaolan Xie
Xiaolan Xie
2007
Xiaolan Xie
N 1
J x0 E ct xt r xt ut wt
t 0
Optimal control:
minimize J(x0) over all possible polciy
2007
Xiaolan Xie
pij(u, t) = P{xt+1 = j | xt = i, ut = u}
2007
Xiaolan Xie
Xiaolan Xie
N 1
J x0 E g N x N gt xt , t xt , wt
t 0
J * x0 MIN J x0
J i xi MIN E g N xN gt xt , t xt , wt
t i
Xiaolan Xie
( A)
Ewt gt xt , ut , wt J t 1 f t xt , ut , wt , ( B )
ut U t xt
MIN
Stage 0
Cos-to-go
Stage 0
Optimal
order
quantity
Stage 1
Cos-to-go
Stage 1
Optimal
order
quantity
Stage 2
Cos-to-go
Stage 2
Optimal
order
quantity
3.7
2.5
1.3
2.7
1.5
0.3
2.818
1.68
1.1
Xiaolan Xie
Policy:
Issues:
a sequence of
decision rules in
order to mini. the
cost function
Existence of opt.
policy
action
Present
state
costs
Xiaolan Xie
Applications
Inventory management
Bus engine replacement
Highway pavement maintenance
Bed allocation in hospitals
Personal staffing in fire department
Traffic control in communication networks
Xiaolan Xie
Example
T T
bX ,
t 0
(make, p)
(make, p)
0
d
(make, p)
(make, p)
2
d
if X 0
if X 0
3
d
Xiaolan Xie
Example
Zero stock policy
p
-1
-2
-2
d
-1
d
0
d
1
d
Xiaolan Xie
Xiaolan Xie
Decision epochs
Times at which decisions are made.
Xiaolan Xie
An Markov decision process is characterized by {T, S, As, pt(. |s, a), Ct(s, a)}
Xiaolan Xie
a=0
(1, 0, 0)
(0,9, 0,1, 0)
(0,2, 0,7, 0,1)
a=1
(0,9, 0,1, 0)
(0,2, 0,7, 0,1)
Not allowed
a=2
(0,2, 0,7, 0,1)
Not allowed
Not allowed
Xiaolan Xie
Decision Rules
A decision rule prescribes a procedure for action selection in each
state at a specified decision epoch.
Decision Rules
A decision rule can also be either
Xiaolan Xie
Decision Rules
As a result, the decision rules can be:
Xiaolan Xie
Policies
A policy specifies the decision rule to be used at all decision epoch.
A policy is a sequence of decision rules, i.e. = {d1, d2, , dN-1}
Xiaolan Xie
Example
Decision epochs: T = {1, 2, , N}
State : S = {s1, s2}
Actions: As1 = {a11, a12}, As2 = {a21}
Costs: Ct(s1, a11) =5, Ct(s1, a12) =10, Ct(s2, a21) = -1, CN(s1) = rN(s2) 0
Transition probabilities: pt(s1 |s1, a11) = 0.5, pt(s2|s1, a11) = 0.5, pt(s1 |s1,
a12) = 0, pt(s2|s1, a12) = 1, pt(s1 |s2, a21) = 0, pt(s2 |s2, a21) = 1
a11
{5, .5}
a11
{5, .5}
S1
a21
S2
{-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A deterministic Markov policy
Decision epoch 1:
d1(s1) = a11, d1(s2) = a21
Decision epoch 2:
d2(s1) = a12, d2(s2) = a21
a11
{5, .5}
a11
{5, .5}
S1
a21
S2
{-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A randomized Markov policy
Decision epoch 1:
P1, s1(a11) = 0.7, P1, s1(a12) = 0.3
P1, s2(a21) = 1
Decision epoch 2:
P2, s1(a11) = 0.4, P2, s1(a12) = 0.6
P2, s2(a21) = 1
a11
{5, .5}
a11
{5, .5}
S1
a21
S2
{-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A deterministic history-dependent policy
Decision epoch 1:
d1(s1) = a11
d1(s2) = a21
a13
a11
{5, .5}
{0, 1}
Decision epoch 2:
history h
d2(h, s1)
d2(h, s2)
(s1, a11)
a13
a21
(s1, a12)
infeasible
a21
(s1, a13)
a11
infeasible
(s2, a21)
infeasible
a21
a11
{5, .5}
S1
a21
S2
{-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A randomized history-dependent policy
Decision epoch 1:
Decision epoch 2: at s = s1
history h
a13
a11
{5, .5}
{0, 1}
P(a = a13)
(s1, a11)
0.4
0.3
0.3
(s1, a12)
infeasible
infeasible
infeasible
(s1, a13)
0.8
0.1
0.1
(s2, a21)
infeasible
infeasible
infeasible
a11
{5, .5}
S1
a21
S2
at s = s2,
select a21
{-1, 1}
a12
{10, 1}
Xiaolan Xie
Remarks
Each Markov policy leads to a discrete time Markov Chain
and the policy can be evaluated by solving the related
Markov chain.
Xiaolan Xie
Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, , N}
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Criterion:
N 1
t 1
infHR E
Ct X t , at CN X N
X 1 s
Xiaolan Xie
Optimality equations
Theorem : The following value functions
N 1
t n
Vn s MIN
E
HR
Ct X t , at CN X N
X n s
Vt s MIN Ct s, a
aAs
pt j s, a Vt 1
jS
VN s rN s
and the action a that minimizes the above term defines the
optimal policy.
Xiaolan Xie
Optimality equations
The optimality equation can also be expressed as:
Vt s MIN Qt s, a
aAs
Qt s, a Ct s, a pt j s, a Vt 1 j
jS
Xiaolan Xie
Vt s MIN Ct s, a pt j s, a Vt 1
a As
jS
dt s arg min Ct s, a pt j s, a Vt 1
aAs
jS
3. Repeat 2 till t = 1.
Xiaolan Xie
Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, }
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a), do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | for all a As
and all s S (to be relaxed)
Xiaolan Xie
Assumptions
Criterion:
infHR
lim E Ct X t , at X 1 s
N
t 1
where
0 < < 1 is the discounting factor
HR is the set of all possible policies.
Xiaolan Xie
Optimality equations
Theorem: Under assumptions 1-5, the following optimal cost
function V*(s) exists:
lim E Ct X t , at X 1 s
N
t 1
V * s infHR
V * s MIN C s, a
a As
jS
p j s, a V * j
n 1
s MIN
C s, a p j s , a V
a A
s
jS
d s arg min C s, a p j s, a V
a As
jS
n 1
Xiaolan Xie
Xiaolan Xie
n
d n 1 s arg min C s, a p j s, a V j
a As
jS
Xiaolan Xie
V s E rt X t , at X 1 s
t 1
V s C s, d s p j s, d s V j
jS
Xiaolan Xie
Xiaolan Xie
V s MIN C s, a p j s, a V j
a As
jS
V s
sS
subject to
V s r s, a p j s, a V j , s, a
jS
Xiaolan Xie
V * s MIN C s, a
aAs
jS
p j s, a V * j
Theorem 2. Assume that the set of control actions is finite. Then, under
the condition C(s, a) 0 for all states i and control actions a, we have
lim V N s V * s
where VN(s) is the solution of the value iteration algorithm with V0(s) = 0.
Implication of Theorem 2 : The optimal cost can be obtained as the limit
of value iteration and the optimal stationary policy can also be obtained in
the limit.
Xiaolan Xie
Example
Consider a computer system consisting of M different processors.
Using processor i for a job incurs a finite cost C i with C1 < C2 < ... < CM.
When we submit a job to this system, processor i is assigned to our job with
probability pi.
At this point we can (a) decide to go with this processor or (b) choose to hold the
job until a lower-cost processor is assigned.
The system periodically return to our job and assign a processor in the same
way.
Waiting until the next processor assignment incurs a fixed finite cost c.
Question:
How do we decide to go with the processor currently assigned to our job versus
waiting for the next assignment?
Suggestions:
The state definition should include all information useful for decision
The problem belongs to the so-called stochastic shortest path problem.
Xiaolan Xie
Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, }
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a) do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | for all a As
and all s S
Assumption 6: The markov chain correponding to any
stationary deterministic policy contains a single recurrent
class. (Unichain)
Xiaolan Xie
Assumptions
Criterion:
infHR
1
lim E
N
N
Ct X t , at
t 1
X 1 s
where
HR is the set of all possible policies.
Xiaolan Xie
Optimal policy
Under Assumptions 1-6, there exists a optimal stationary
deterministic policy.
Further, there exists a real g and a value function h(s) that
satisfy the following optimality equation:
h s g MIN C s, a
aAs
jS
p j s, a h j
h s lim V s V x0
1
differential
cost
Xiaolan Xie
h s g MIN C s, a
a As
jS
p j s, a h j
p j s, a h j , s, a
jS
h( x0 ) 0
Remarks: Value iteration and policy iteration can also be
extended to the average cost case.
Xiaolan Xie
n 1
s h s g
n 1
MIN r s, a p j s, a h
a As
jS
h n 1 s U n 1 s U n 1 s0
g n U n 1 s0
d s arg min C s, a p j s, a h n 1
a As
jS
Xiaolan Xie
h s lim V s V x0
1
Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = R+
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary cost rates and transition rates;
C(s, a) and (j |s, a) do not vary from decision epoch to
decision epoch
Xiaolan Xie
Assumptions
Criterion:
t 0
infHR E
C X t ,a t e
dt
1 T
infHR lim E
C X t , a t dt
T T
t 0
Xiaolan Xie
Example
g X t e
Minimize
t 0
(make, p)
0
d
if X 0
if X 0
hX ,
dt with g X
bX ,
(make, p)
(make, p)
(make, p)
2
d
3
d
Xiaolan Xie
Uniformization
Any continuous-time Markov chain can be converted to a
discrete-time chain through a process called
uniformization .
Xiaolan Xie
Uniformization
In order to synchronize (uniformize) the transitions at the same
pace, we choose a uniformization rate
MAX{(i)}
Uniformized Markov chain with
transitions occur only at instants generated by a common a
Poisson process of rate (also called standard clock)
state-transition probabilities
pij = ij /
pii = 1 - (i)/
where the self-loop transitions correspond to fictitious
events.
Xiaolan Xie
Uniformization
CTMC
S1
S2
b
Uniformized CTMC
-a
S1
-b
max{(i)}
S2
DTMC by uniformization
a/
1-a/
S1
b/
1-b/
S2
Uniformization
Xiaolan Xie
Uniformization
For Markov decision process, the uniformization rate
shoudl be such that
(s, a) = jS (j|s, a)
for all states s and for all possible control actions a.
The state-transition probabilities of a uniformized Markov
decision process becomes:
p(j|s, a) = (j|s, a)/
p(s|s, a) = 1- jS (j|s, a)/
Xiaolan Xie
Uniformization
(make, p)
(make, p)
(make, p)
(make, p)
(make, p/)
(make, p/)
0
d/
d/
(not make, p/)
(make, p/)
2
d/
(not make, p/)
(make, p/)
3
d/
d/
Uniformization
Under the uniformization,
a sequence of discrete decision epochs T1, T2, is generated
where Tk+1 Tk = EXP().
The discrete-time markov chain describes the state of the system at
these decision epochs.
All criteria can be easily converted.
fixed cost
K(s,a)
(s,a)
j
EXP()
T0
EXP()
T1
EXP()
T2
T3
t 0
C X t ,a t e
Tk 1
k 0
t Tk
C X k , ak
C
X
t
,
a
t
e
dt
k 0 t Tk
Tk 1
dt
t
E C X k , ak E e dt
t Tk
k 0
dt E
Tk 1
1
E C X k , ak
k 0
k C X ,a
k k
k 0
E
C
X
t
,
a
t
dt
E
N
T t 0
C X k , ak
1
E
k 0
N
k 0
C X k , ak
Xiaolan Xie
a discount factor
Optimality equation
C s, a
V s MIN
K s, a
a A
s
p j s, a k s, a, j V j
jS
Xiaolan Xie
Optimality equation :
C s, a
h s g MIN
p j s, a h
a As
jS
where
Example (continue)
Uniformize the Markov decision process with rate = p+d
The optimality equation:
g s
V s MIN
p
d
V s 1
V s 1 : producing
p d
pd
g s
p
d
V s
V s 1 : not producing
pd
pd
Xiaolan Xie
Example (continue)
From the optimality equation:
g s
p
d
V s
V s
V s 1 MIN V s 1 V s , 0
pd
pd
Xiaolan Xie
Example (continue)
Convexity proved by value iteration
V n 1 s
g s
p
d
MIN V n s 1 ,V n s
V n s 1
pd
pd
V 0 s 0
Proof by induction.
V0 is convex.
If Vn is convex with minimum
MIN V n s 1 , V n s is convex
K
Xiaolan Xie