0% found this document useful (0 votes)
68 views

CH 9 MDP

This document outlines a plan to cover topics related to dynamic programming and Markov decision processes, including: - Dynamic programming principles and applications - Markov decision process formulation and types (discounted, average cost) - Continuous-time Markov decision processes It provides examples of dynamic programming applications such as shortest path problems, inventory control, and traveling salesman problem. Stochastic dynamic programming is also introduced, covering models, policies, state transition probabilities, and solving using the principle of optimality and dynamic programming algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

CH 9 MDP

This document outlines a plan to cover topics related to dynamic programming and Markov decision processes, including: - Dynamic programming principles and applications - Markov decision process formulation and types (discounted, average cost) - Continuous-time Markov decision processes It provides examples of dynamic programming applications such as shortest path problems, inventory control, and traveling salesman problem. Stochastic dynamic programming is also introduced, covering models, policies, state transition probabilities, and solving using the principle of optimality and dynamic programming algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 97

Plan

Dynamic programming
Introduction to Markov decision processes
Markov decision processes formulation
Discounted markov decision processes
Average cost markov decision processes
Continuous-time Markov decision processes
Xiaolan Xie

Dynamic programming
Basic principe of dynamic programming

Some applications

Stochastic dynamic programming

Xiaolan Xie

Dynamic programming
Basic principe of dynamic programming

Some applications

Stochastic dynamic programming

Xiaolan Xie

Introduction
Dynamic programming (DP) is a general
optimization
technique
based
on
implicit
enumeration of the solution space.
The problems should have a particular sequential
structure, such that the set of unknowns can be
made sequentially.
It is based on the "principle of optimality"
A wide range of problems can be put in seqential form
and solved by dynamic programming

Xiaolan Xie

Introduction
Applications :
Optimal control
Most problems in graph theory
Investment
Deterministic and stochastic inventory control
Project scheduling
Production scheduling

We limit ourselves to discrete optimization

Xiaolan Xie

Illustration of DP by shortest path problem


Problem : We are planning the construction of a
highway from city A to city K. Different
construction alternatives and their costs are
given in the following graph. The problem
consists in determine the highway with the
minimum total
cost.
14
3
8
A

10

10

10

10

5
8

15
Xiaolan Xie

BELLMAN's principle of optimality


General form:
if C belongs to an optimal path from A to B, then the sub-path A
to C and C to B are also optimal
or
all sub-path of an optimal path is optimal

A
optimal

B
optimal

Corollary :
SP(xo, y) = min {SP(xo, z) + l(z, y) | z : predecessor of y}

Xiaolan Xie

Solving a problem by DP
1. Extension
Extend the problem to a family of problems of the same nature
2. Recursive Formulation (application of the principle of optimality)
Link optimal solutions of these problems by a recursive relation
3. Decomposition into steps or phases
Define the order of the resolution of the problems in such a way
that, when solving a problem P, optimal solutions of all other
problems needed for computation of P are already known.
4. Computation by steps

Xiaolan Xie

Solving a problem by DP
Difficulties in using dynamic programming :
Identification of the family of problems
transformation of the problem into a sequential form.

Xiaolan Xie

Shortest Path in an acyclic graph


Problem setting : find a shortest path from x0 (root of the graph) to a given
node y0
Extension : Find a shortest path from x0 to any node y, denoted SP(x0, y)
Recursive formulation
SP(y) = min { SP(z) + l(z, y) : z predecessorr of y}
Decomposition into steps : At each step k, consider only nodes y with
unknown SP(y) but for which the SP of all precedecssors are known.
Compute SP(y) step by step
Remarks :
It is a backward dynamic programming
It is also possible to solve this problem by forward dynamic programming

Xiaolan Xie

DP from a control point of view


Consider the control of
(i) a discrete-time dynamic system, with
(ii) costs generated over time depending on the states and the
control actions
action

State t
Cost
present decision epoch

action

State t+1
Cost
next decision epoch
Xiaolan Xie

DP from a control point of view


System dynamics :
x t+1 = ft(xt, ut), t = 0, 1, ..., N-1
where
t : temps index
xt : state of the system
ut = control action to decide at t

action
State t
Cost

present decision epoch

action
State t+1
Cost

next decision epoch


Xiaolan Xie

DP from a control point of view


Criterion to optimize
N 1

Minimize g N xN gt xt , ut
t 0

action
State t

action
State t+1

Cost

gt xt , ut
present decision epoch

Cost

next decision epoch

Xiaolan Xie

DP from a control point of view


Value function or cost-to-go function:
N 1

J n x = Minimize g N xN gt xt , ut xn x
t n

action
State t
Cost
present decision epoch

action
State t+1
Cost

next decision epoch


Xiaolan Xie

DP from a control point of view


Optimality equation or Bellman equation

J n x = MIN g n x, un J n+1 f n x, un
un

action
State t
Cost
present decision epoch

action
State t+1
Cost

next decision epoch

Xiaolan Xie

Applications
Single machine scheduling (Knapsac)
Inventory control
Traveling salesman problem

Xiaolan Xie

Applications
Single machine scheduling (Knapsac)
Problem :
Consider a set of N production requests, each needing a
production time ti on a bottleneck machine and generating a
profit pi. The capacity of the bottleneck machine is C.
Question: determine the production requests to confirm in
order to maximize the total profit.
Formulation:
max pi Xi
subject to:
ti X i C
Xiaolan Xie

Applications
Inventory control

See exercices

Xiaolan Xie

Applications
Traveling salesman problem

Problem :
Data: a graph with N nodes and a distance matrix
[dij] beteen any two nodes i and j.
Question: determine a circuit of minimum total
distance passing each node once.
Extensions:
C(y, S): shortest path from y to x0 passing once
each node in S.
Application: Machine scheduling with setups.
2007

Xiaolan Xie

Applications
Total tardiness minimization on a single machine
Si starting time of job i
1, if job i precedes job j
X ij
0, otherwise
Ti tardiness
n

min wiTi
i 1

Ti Si pi d i

S j Si pi M X ij 1

Job
Due date di
Processing
time pi
weight wi

1
5

2
6

3
5

3
3

2
1

4
2

Si , Ti 0

X ij 0,1

where M is a large constant.

Xiaolan Xie

Stochastic dynamic programming


Model
Consider the control of
(i) a discrete-time stochastic dynamic system, with
(ii) costs generated over time
perturbation

perturbation
action

State t
stage cost
present decision epoch

action

State t+1
cost
next decision epoch
Xiaolan Xie

Stochastic dynamic programming


Model
System dynamics :
x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1
where
t : time index
xt : state of the system
ut = decision at time t
wt : random perturbations

action

perturbation

State t
cost
present decision epoch

action

State t+1
cost

next decision epoch


Xiaolan Xie

Stochastic dynamic programming


Model
Criterion

N 1

Minimize E g N xN gt xt , ut , wt
t 0

action

perturbation

State t
cost
present decision epoch

action

State t+1
cost

next decision epoch

Xiaolan Xie

Stochastic dynamic programming


Model
Open-loop control:

Order quantities u1, u2, ..., uN-1 are determined once at time 0

Closed-loop control:

Order quantity ut at each period is determined dynamically with


the knowledge of state xt

Xiaolan Xie

Stochastic dynamic programming


Control policy
The rule for selecting at each period t a control action ut
for each possible state xt.

Examples of inventory control policies:


1. Order a constant quantity ut = E[wt]
2. Order up to policy :
ut = St xt, if xt St
ut = 0, if xt > St
where St is a constant order up to level.
2007

Xiaolan Xie

Stochastic dynamic programming


Control policy
Mathematically, in closed-loop control, we want to
find a sequence of functions t, t = 0, ..., N-1, mapping state
xt into control ut

so as to minimize the total expected cost.


The sequence = { 0, ..., N-1} is called a policy.

2007

Xiaolan Xie

Stochastic dynamic programming


Optimal control
Cost of a given policy = { 0, ..., N-1},

N 1

J x0 E ct xt r xt ut wt
t 0

Optimal control:
minimize J(x0) over all possible polciy

2007

Xiaolan Xie

Stochastic dynamic programming


State transition probabilities
State transition probabilty:

pij(u, t) = P{xt+1 = j | xt = i, ut = u}

depending on the control policy.

2007

Xiaolan Xie

Stochastic dynamic programming


Basic problem
A discrete-time dynamic system :
x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1
Finite state space st St
Finite control space ut Ct
Control policy = {0, ..., N-1} with ut = t(xt)
State-transition probability: pij(u)
stage cost : gt(xt, t(xt), wt)
2007

Xiaolan Xie

Stochastic dynamic programming


Basic problem
Expected cost of a policy

N 1

J x0 E g N x N gt xt , t xt , wt
t 0

Optimal control policy * is the policy with minimal cost:

J * x0 MIN J x0

where is the set of all admissible policies.


J*(x) : optimal cost function or optimal value function.
Xiaolan Xie

Stochastic dynamic programming


Principle of optimality
Let = { 0, ..., N-1} be an optimal policy for the basic
problem for the N time periods.
Then the truncated policy { i, ..., N-1} is optimal for the
following subproblem
minimization of the following total cost (called cost-to-go
function) from time i to time N by starting with state xi at time
i
N 1

J i xi MIN E g N xN gt xt , t xt , wt
t i

Xiaolan Xie

Stochastic dynamic programming


DP algorithm
Theorem: For every initial state x0, the optimal cost J*(x0) of
the basic problem is equal to J0(x0), given by the last step of the
following algorithm, which proceeds backward in time from
period N-1 to period 0
J N xN g N xN ,
J t xt

( A)

Ewt gt xt , ut , wt J t 1 f t xt , ut , wt , ( B )

ut U t xt
MIN

Furthermore, if u*t = *t(xt) minimizes the right side of Eq (B)


for each xt and t, the policy = { 0, ..., N-1} is optimal.
Xiaolan Xie

Stochastic dynamic programming


Example
Consider the inventory control problem with the following:

Excess demand is lost, i.e. xt+1 = max{0, xt + ut wt}

The inventory capacity is 2, i.e. xt + ut

The inventory holding/shortage cost is : (xt + ut wt)2

Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut wt)2.

N = 3 and the terminal cost, gN(XN) = 0

Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.


Xiaolan Xie

Stochastic dynamic programming


DP algorithm
Optimal policy
Stock

Stage 0
Cos-to-go

Stage 0
Optimal
order
quantity

Stage 1
Cos-to-go

Stage 1
Optimal
order
quantity

Stage 2
Cos-to-go

Stage 2
Optimal
order
quantity

3.7

2.5

1.3

2.7

1.5

0.3

2.818

1.68

1.1

Xiaolan Xie

Sequential decision model


Key ingredients:
A set of decision epochs
A set of system states
A set of available actions
A set of state/action
dependent immediate costs
A set of state/action
dependent transition
probabilities

Policy:

Issues:

a sequence of
decision rules in
order to mini. the
cost function

Existence of opt.
policy

action
Present
state
costs

Form of the opt. policy


Computation of opt.
policy
action
Next
state
costs

Xiaolan Xie

Applications
Inventory management
Bus engine replacement
Highway pavement maintenance
Bed allocation in hospitals
Personal staffing in fire department
Traffic control in communication networks

Xiaolan Xie

Example

Consider a with one machine producing one product. The


processing time of a part is exponentially distributed with rate
p. The demand arrive according to a Poisson process of rate d.

state Xt = stock level, Action : at = make or rest


hX ,
1 T
Minimize lim
g
X
t
dt
with
g
X



T T
bX ,
t 0
(make, p)

(make, p)

0
d

(make, p)

(make, p)

2
d

if X 0
if X 0

3
d

Xiaolan Xie

Example
Zero stock policy
p

P(0) = 1-r, P(-n) = rnP(0), r = d/p

average cost =b/(p d)

-1

-2

Hedging point policy with


hedging point 1
p

-2
d

-1
d

0
d

P(1) = 1-r, P(-n) = rn+1P(1)

1
d

average cost =h(1-r) + r.b/(p d)


Better iff h < b/(p-d)

Xiaolan Xie

MDP Model formulation

Xiaolan Xie

Decision epochs
Times at which decisions are made.

The set T of decisions epochs can be either a discrete set or a


continuum.

The set T can be finite (finite horizon problem) or infinite


(infinite horizon).

Xiaolan Xie

State and action sets


At each decision epoch, the system occupies a state.
S : the set of all possible system states.
As : the set of allowable actions in state s.
A = sSAs: the set of all possible actions.
S and As can be:
finite sets
countable infinite sets
compact sets
Xiaolan Xie

Costs and Transition probabilities


As a result of choosing action a As in state s at decision epoch t,
the decision maker receives a cost Ct(s, a) and
the system state at the next decision epoch is determined by the
probability distribution pt(. |s, a).

If the cost depends on the state at next decision epoch, then


Ct(s, a) = jS Ct(s, a, j) pt(j|s, a).
where Ct(s, a, j) is the cost if the next state is j.

An Markov decision process is characterized by {T, S, As, pt(. |s, a), Ct(s, a)}

Xiaolan Xie

Exemple of inventory management


Consider the inventory control problem with the following:

Excess demand is lost, i.e. xt+1 = max{0, xt + ut wt}

The inventory capacity is 2, i.e. xt + ut

The inventory holding/shortage cost is : (xt + ut wt)2

Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut wt)2.

N = 3 and the terminal cost, gN(XN) = 0

Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.


Xiaolan Xie

Exemple of inventory management


Decision Epochs T = {0, 1, 2, , N}
Set of states : S = {0, 1, 2} indicating the initial stock Xt
Action set As : indicating the possible order quantity Ut
A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}
Cost function : Ct(s, a) = E[a + (s + a wt)2]
Transition probability pt(. |s, a). :
p(j |s, a)
s=0
s=1
s=2

a=0
(1, 0, 0)
(0,9, 0,1, 0)
(0,2, 0,7, 0,1)

a=1
(0,9, 0,1, 0)
(0,2, 0,7, 0,1)
Not allowed

a=2
(0,2, 0,7, 0,1)
Not allowed
Not allowed

Xiaolan Xie

Decision Rules
A decision rule prescribes a procedure for action selection in each
state at a specified decision epoch.

A decision rule can be either

Markovian (memoryless) if the selection of action at is based


only on the current state st;
History dependent if the action selection depends on the past
history, i.e. the sequence of state/actions ht = (s1, a1, , st-1, at-1, st)
Xiaolan Xie

Decision Rules
A decision rule can also be either

Deterministic if the decision rule selects one action with certainty


Randomized if the decision rule only specifies a probability
distribution on the set of actions.

Xiaolan Xie

Decision Rules
As a result, the decision rules can be:

HR : history dependent and randomized


HD : history dependent and deterministic
MR : Markovian and randomized
MD : Markovian and deterministic

Xiaolan Xie

Policies
A policy specifies the decision rule to be used at all decision epoch.
A policy is a sequence of decision rules, i.e. = {d1, d2, , dN-1}

A policy is stationary if dt = d for all t.

Stationary deterministic or stationary randomized policies are


important for infinite horizon markov decision processes.

Xiaolan Xie

Example
Decision epochs: T = {1, 2, , N}
State : S = {s1, s2}
Actions: As1 = {a11, a12}, As2 = {a21}
Costs: Ct(s1, a11) =5, Ct(s1, a12) =10, Ct(s2, a21) = -1, CN(s1) = rN(s2) 0
Transition probabilities: pt(s1 |s1, a11) = 0.5, pt(s2|s1, a11) = 0.5, pt(s1 |s1,
a12) = 0, pt(s2|s1, a12) = 1, pt(s1 |s2, a21) = 0, pt(s2 |s2, a21) = 1

a11
{5, .5}

a11
{5, .5}
S1

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A deterministic Markov policy
Decision epoch 1:
d1(s1) = a11, d1(s2) = a21
Decision epoch 2:
d2(s1) = a12, d2(s2) = a21
a11
{5, .5}

a11
{5, .5}
S1

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A randomized Markov policy
Decision epoch 1:
P1, s1(a11) = 0.7, P1, s1(a12) = 0.3
P1, s2(a21) = 1

Decision epoch 2:
P2, s1(a11) = 0.4, P2, s1(a12) = 0.6
P2, s2(a21) = 1

a11
{5, .5}

a11
{5, .5}
S1

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A deterministic history-dependent policy
Decision epoch 1:
d1(s1) = a11
d1(s2) = a21

a13
a11
{5, .5}

{0, 1}

Decision epoch 2:
history h
d2(h, s1)
d2(h, s2)
(s1, a11)

a13

a21

(s1, a12)

infeasible

a21

(s1, a13)

a11

infeasible

(s2, a21)

infeasible

a21

a11
{5, .5}

S1

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A randomized history-dependent policy
Decision epoch 1:

Decision epoch 2: at s = s1
history h

P1, s1(a11) = 0.6


P1, s1(a12) = 0.3
P1, s1(a12) = 0.1
P1, s2(a21) = 1

a13

a11
{5, .5}

{0, 1}

P(a = a11) P(a = a12)

P(a = a13)

(s1, a11)

0.4

0.3

0.3

(s1, a12)

infeasible

infeasible

infeasible

(s1, a13)

0.8

0.1

0.1

(s2, a21)

infeasible

infeasible

infeasible

a11
{5, .5}

S1

a21
S2

at s = s2,
select a21

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Remarks
Each Markov policy leads to a discrete time Markov Chain
and the policy can be evaluated by solving the related
Markov chain.

Xiaolan Xie

Finite Horizon Markov Decision


Processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = {1, 2, , N}
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Criterion:
N 1

t 1

infHR E

Ct X t , at CN X N

X 1 s

where HR is the set of all possible policies.


Xiaolan Xie

Optimality of Markov deterministic


policy
Theorem :
Assume S is finite or countable, and that As is finite for each
s S.
Then there exists a deterministic Markovian policy which is
optimal.

Xiaolan Xie

Optimality equations
Theorem : The following value functions
N 1

t n

Vn s MIN
E
HR

Ct X t , at CN X N

X n s

satisfy the following optimality equation:

Vt s MIN Ct s, a
aAs

pt j s, a Vt 1
jS

VN s rN s

and the action a that minimizes the above term defines the
optimal policy.
Xiaolan Xie

Optimality equations
The optimality equation can also be expressed as:
Vt s MIN Qt s, a
aAs

Qt s, a Ct s, a pt j s, a Vt 1 j
jS

where Q(s,a) is a Q-function used to evaluate the


consequence of an action from a state s.

Xiaolan Xie

Dynamic programming algorithm


Set t = N and
VN s N rN s N for all s N S

Substitute t-1 for t and compute the following for each st S

Vt s MIN Ct s, a pt j s, a Vt 1
a As
jS

dt s arg min Ct s, a pt j s, a Vt 1
aAs
jS

3. Repeat 2 till t = 1.

Xiaolan Xie

Infinite Horizon discounted


Markov decision processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = {1, 2, }
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a), do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | for all a As
and all s S (to be relaxed)

Xiaolan Xie

Assumptions
Criterion:
infHR

lim E Ct X t , at X 1 s
N
t 1

where
0 < < 1 is the discounting factor
HR is the set of all possible policies.

Xiaolan Xie

Optimality equations
Theorem: Under assumptions 1-5, the following optimal cost
function V*(s) exists:

lim E Ct X t , at X 1 s
N
t 1

V * s infHR

and satisfies the following optimality equation:

V * s MIN C s, a
a As
jS

p j s, a V * j

Further, V*(.) is the unique solution of the optimality equation.


Moreover, a statonary policy is optimal iff it gives the
minimum value in the optimality equation.
Xiaolan Xie

Computation of optimal policy


Value Iteration
Value iteration algorithm:
1.Select any bounded value function V0, let n =0
2. For each s S, compute
V

n 1

s MIN
C s, a p j s , a V
a A
s

jS

3.Repeat 2 until convergence.


4. For each s S, compute

d s arg min C s, a p j s, a V
a As
jS

n 1

Xiaolan Xie

Computation of optimal policy


Value Iteration
Theorem: Under assumptions 1-5,
a.Vn converges to V*
b. The stationary policy defined in the value iteration
algorithm converges to an optimal policy.

Xiaolan Xie

Computation of optimal policy


Policy Iteration
Policy iteration algorithm:
1.Select arbitrary stationary policy 0, let n =0
2. (Policy evaluation) Obtain the value function Vn of policy n.
3.(Policy improvement) Choose n+1 = {dn+1, dn+1,} such that

n
d n 1 s arg min C s, a p j s, a V j
a As
jS

4.Repeat 2-3 till n+1 = n.

Xiaolan Xie

Computation of optimal policy


Policy Iteration
Policy evaluation:
For any stationary deterministic policy = {d, d, }, its
value function

V s E rt X t , at X 1 s
t 1

is the unique solution of the following equation:

V s C s, d s p j s, d s V j
jS

Xiaolan Xie

Computation of optimal policy


Policy Iteration
Theorem:
The value functions Vn generated by the policy iteration
algorithm is such that Vn+1 Vn.
Further, if Vn+1 Vn, Vn = V*.

Xiaolan Xie

Computation of optimal policy


Linear programming
Recall the optimality equation

V s MIN C s, a p j s, a V j
a As
jS

The optimal value function can be determine by the


following Linear programme:
Maximize

V s

sS

subject to
V s r s, a p j s, a V j , s, a
jS

Xiaolan Xie

Extensition to Unbounded Costs


Theorem 1. Under the condition C(s, a) 0 (or C(s, a) 0) for all
states i and control actions a, the optimal cost function V*(s) among
all stationary determinitic policies satisfies the optimality equation

V * s MIN C s, a
aAs
jS

p j s, a V * j

Theorem 2. Assume that the set of control actions is finite. Then, under
the condition C(s, a) 0 for all states i and control actions a, we have
lim V N s V * s

where VN(s) is the solution of the value iteration algorithm with V0(s) = 0.
Implication of Theorem 2 : The optimal cost can be obtained as the limit
of value iteration and the optimal stationary policy can also be obtained in
the limit.

Xiaolan Xie

Example
Consider a computer system consisting of M different processors.
Using processor i for a job incurs a finite cost C i with C1 < C2 < ... < CM.
When we submit a job to this system, processor i is assigned to our job with
probability pi.
At this point we can (a) decide to go with this processor or (b) choose to hold the
job until a lower-cost processor is assigned.
The system periodically return to our job and assign a processor in the same
way.
Waiting until the next processor assignment incurs a fixed finite cost c.
Question:
How do we decide to go with the processor currently assigned to our job versus
waiting for the next assignment?
Suggestions:
The state definition should include all information useful for decision
The problem belongs to the so-called stochastic shortest path problem.
Xiaolan Xie

Infinite Horizon average cost


Markov decision processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = {1, 2, }
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a) do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | for all a As
and all s S
Assumption 6: The markov chain correponding to any
stationary deterministic policy contains a single recurrent
class. (Unichain)
Xiaolan Xie

Assumptions
Criterion:
infHR

1
lim E
N
N

Ct X t , at
t 1

X 1 s

where
HR is the set of all possible policies.

Xiaolan Xie

Optimal policy
Under Assumptions 1-6, there exists a optimal stationary
deterministic policy.
Further, there exists a real g and a value function h(s) that
satisfy the following optimality equation:

h s g MIN C s, a
aAs
jS

p j s, a h j

For any two solutions (g, h) and (g, h) of the optimality


equation, (i) g = g is the optimal average cost; (ii) h(s) =
h(s) + k; (iii) the stationary policy determined by the
optimality equation is an optimal policy.
Xiaolan Xie

Relation between discounted and average cost


MDP
It can be shown that (why? online)
g lim 1 V s
1

h s lim V s V x0
1

differential
cost

for any given state x0.

Xiaolan Xie

Computation of the optimal policy by


LP
Recall the optimality equation:

h s g MIN C s, a
a As
jS

p j s, a h j

This leads to the following LP for optimal policy computation


Maximize g
subject to
h s g r s, a

p j s, a h j , s, a
jS

h( x0 ) 0
Remarks: Value iteration and policy iteration can also be
extended to the average cost case.
Xiaolan Xie

Computation of optimal policy


Value Iteration
1.Select any bounded value function h0 with h0(s0) = 0, let n =0
2. For each s S, compute
U

n 1

s h s g
n 1

MIN r s, a p j s, a h
a As
jS

h n 1 s U n 1 s U n 1 s0

g n U n 1 s0

3.Repeat 2 until convergence.


4. For each s S, compute

d s arg min C s, a p j s, a h n 1
a As
jS

Xiaolan Xie

Extensions to unbounded cost


Theorem. Assume that the set of control actions is finite. Suppose
that there exists a finite constant L and some state x0 such that
|V(x) - V(x0)| L
for all states x and for all (0,1). Then, for some sequence {n}
converging to 1, the following limit exist and satisfy the optimality
equation.
g lim 1 V s
1

h s lim V s V x0
1

Easy extension to policy iteration.


Xiaolan Xie

Continuous time Markov decision


processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = R+
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary cost rates and transition rates;
C(s, a) and (j |s, a) do not vary from decision epoch to
decision epoch

Xiaolan Xie

Assumptions
Criterion:

t 0

infHR E

C X t ,a t e

dt

1 T

infHR lim E
C X t , a t dt

T T

t 0

Xiaolan Xie

Example

Consider a system with one machine producing one product. The


processing time of a part is exponentially distributed with rate p. The
demand arrive according to a Poisson process of rate d.

state Xt = stock level, Action : at = make or rest

g X t e

Minimize

t 0

(make, p)

0
d

if X 0
if X 0

hX ,
dt with g X
bX ,

(make, p)

(make, p)

(make, p)

2
d

3
d

Xiaolan Xie

Uniformization
Any continuous-time Markov chain can be converted to a
discrete-time chain through a process called
uniformization .

Each Continuous Time Markov Chain is characterized by


the transition rates ij of all possible transitions.
The sojourn time Ti in each state i is exponentially
distributed with rate (i) = ji ij, i.e. E[Ti] = 1/(i)
Transitions different states are unpaced and
asynchronuous depending on (i).

Xiaolan Xie

Uniformization
In order to synchronize (uniformize) the transitions at the same
pace, we choose a uniformization rate
MAX{(i)}
Uniformized Markov chain with
transitions occur only at instants generated by a common a
Poisson process of rate (also called standard clock)
state-transition probabilities
pij = ij /
pii = 1 - (i)/
where the self-loop transitions correspond to fictitious
events.
Xiaolan Xie

Uniformization
CTMC

S1

Step1: Determine rate of the states


(S1) = a, (S2) = b

S2

b
Uniformized CTMC

-a
S1

-b

max{(i)}

S2

Step 3: Add self-loop transitions to


states of CTMC.

DTMC by uniformization

a/

1-a/

Step 2: Select an uniformization


rate

S1
b/

1-b/
S2

Step 4: Derive the corresponding


uniformized DTMC
Xiaolan Xie

Uniformization

Rates associated to states

Xiaolan Xie

Uniformization
For Markov decision process, the uniformization rate
shoudl be such that
(s, a) = jS (j|s, a)
for all states s and for all possible control actions a.
The state-transition probabilities of a uniformized Markov
decision process becomes:
p(j|s, a) = (j|s, a)/
p(s|s, a) = 1- jS (j|s, a)/

Xiaolan Xie

Uniformization
(make, p)

(make, p)

(make, p)

(make, p)

Uniformized Markov decision process


at rate = p+d
(make, p/)

(make, p/)

(make, p/)

0
d/

d/
(not make, p/)

(not make, p/)

(make, p/)

2
d/
(not make, p/)

(make, p/)

3
d/

d/

(not make, p/)


Xiaolan Xie

Uniformization
Under the uniformization,
a sequence of discrete decision epochs T1, T2, is generated
where Tk+1 Tk = EXP().
The discrete-time markov chain describes the state of the system at
these decision epochs.
All criteria can be easily converted.
fixed cost
K(s,a)

continuous cost C(s,a)


fixed cost
per unit time
k(s,a, j)

(s,a)

j
EXP()

T0

EXP()

T1

EXP()

T2

T3

Poisson process at rate


Xiaolan Xie

Cost function convertion


for uniformized Markov chain
Discounted cost of a stationary policy (only with continuous cost):

t 0

C X t ,a t e

Tk 1

k 0

t Tk

C X k , ak

C
X
t
,
a
t
e
dt

k 0 t Tk
Tk 1

dt

State change & action taken only at Tk


Mutual independence of (Xk, ak) and
(Tk, Tk+1)

t
E C X k , ak E e dt
t Tk

k 0

dt E

Tk 1

1
E C X k , ak


k 0
k C X ,a
k k


k 0

Tk is a Poisson process at rate

Average cost of a stationary policy (only with continuous cost):


1


E
C
X
t
,
a
t
dt
E

N
T t 0

C X k , ak
1
E

k 0
N

k 0

C X k , ak

Xiaolan Xie

Cost function convertion


for uniformized Markov chain
Equivalent discrete time discounted MDP

a discrete-time Markov chain with uniform transition rate

a discount factor

a stage cost given by the sum of


continuous cost C(s, a)/(),
K(s, a) for fixed cost incurred at T0

k(s,a,j)p(j|s,a) for fixed cost incurred at T1

Optimality equation
C s, a

V s MIN
K s, a
a A
s

p j s, a k s, a, j V j
jS

Xiaolan Xie

Cost function convertion


for uniformized Markov chain
Equivalent discrete time average-cost MDP

a discrete-time Markov chain with uniform transition rate

a stage cost given by C(s, a)/ whenever a state s is entered


and an action a is chosen.

Optimality equation :
C s, a
h s g MIN
p j s, a h
a As

jS

where

g = average cost per discretized time period


g = average cost per time unit (can also be obtained directly from
the optimality equation with stage cost C(s, a))
Xiaolan Xie

Example (continue)
Uniformize the Markov decision process with rate = p+d
The optimality equation:
g s
V s MIN

p
d

V s 1
V s 1 : producing
p d
pd

g s

p
d

V s
V s 1 : not producing

pd
pd

Xiaolan Xie

Example (continue)
From the optimality equation:
g s

p
d
V s

V s
V s 1 MIN V s 1 V s , 0

pd
pd

If V(s) is convex, then there exists a K such that :


V(s+1) V(s) > 0 and the decision is not producing, for all s >= K and
V(s+1) V(s) <= 0 and the decision is producing, for all s < K

Xiaolan Xie

Example (continue)
Convexity proved by value iteration
V n 1 s

g s

p
d
MIN V n s 1 ,V n s
V n s 1
pd
pd

V 0 s 0

Proof by induction.
V0 is convex.
If Vn is convex with minimum

MIN V n s 1 , V n s is convex

at s = K, then Vn+1 is convex.


s
K-1

K
Xiaolan Xie

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy