0% found this document useful (0 votes)

68 views

CH 9 MDP

This document outlines a plan to cover topics related to dynamic programming and Markov decision processes, including: - Dynamic programming principles and applications - Markov decision process formulation and types (discounted, average cost) - Continuous-time Markov decision processes It provides examples of dynamic programming applications such as shortest path problems, inventory control, and traveling salesman problem. Stochastic dynamic programming is also introduced, covering models, policies, state transition probabilities, and solving using the principle of optimality and dynamic programming algorithm.

Uploaded by

Muhammad Asad Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

CH 9 MDP

Uploaded by

Muhammad Asad Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 97

Plan

Dynamic programming
Introduction to Markov decision processes
Markov decision processes formulation
Discounted markov decision processes
Average cost markov decision processes
Continuous-time Markov decision processes
Xiaolan Xie

Dynamic programming
Basic principe of dynamic programming

Some applications

Stochastic dynamic programming

Xiaolan Xie

Dynamic programming
Basic principe of dynamic programming

Some applications

Stochastic dynamic programming

Xiaolan Xie

Introduction
Dynamic programming (DP) is a general
optimization
technique
based
on
implicit
enumeration of the solution space.
The problems should have a particular sequential
structure, such that the set of unknowns can be
made sequentially.
It is based on the "principle of optimality"
A wide range of problems can be put in seqential form
and solved by dynamic programming

Xiaolan Xie

Introduction
Applications :
Optimal control
Most problems in graph theory
Investment
Deterministic and stochastic inventory control
Project scheduling
Production scheduling

We limit ourselves to discrete optimization

Xiaolan Xie

Illustration of DP by shortest path problem

Problem : We are planning the construction of a
highway from city A to city K. Different
construction alternatives and their costs are
given in the following graph. The problem
consists in determine the highway with the
minimum total
cost.
14
3
8
A

5
8

15
Xiaolan Xie

BELLMAN's principle of optimality

General form:
if C belongs to an optimal path from A to B, then the sub-path A
to C and C to B are also optimal
or
all sub-path of an optimal path is optimal

A
optimal

B
optimal

Corollary :
SP(xo, y) = min {SP(xo, z) + l(z, y) | z : predecessor of y}

Xiaolan Xie

Solving a problem by DP
1. Extension
Extend the problem to a family of problems of the same nature
2. Recursive Formulation (application of the principle of optimality)
Link optimal solutions of these problems by a recursive relation
3. Decomposition into steps or phases
Define the order of the resolution of the problems in such a way
that, when solving a problem P, optimal solutions of all other
problems needed for computation of P are already known.
4. Computation by steps

Xiaolan Xie

Solving a problem by DP
Difficulties in using dynamic programming :
Identification of the family of problems
transformation of the problem into a sequential form.

Xiaolan Xie

Shortest Path in an acyclic graph

Problem setting : find a shortest path from x0 (root of the graph) to a given
node y0
Extension : Find a shortest path from x0 to any node y, denoted SP(x0, y)
Recursive formulation
SP(y) = min { SP(z) + l(z, y) : z predecessorr of y}
Decomposition into steps : At each step k, consider only nodes y with
unknown SP(y) but for which the SP of all precedecssors are known.
Compute SP(y) step by step
Remarks :
It is a backward dynamic programming
It is also possible to solve this problem by forward dynamic programming

Xiaolan Xie

DP from a control point of view

Consider the control of
(i) a discrete-time dynamic system, with
(ii) costs generated over time depending on the states and the
control actions
action

State t
Cost
present decision epoch

action

State t+1
Cost
next decision epoch
Xiaolan Xie

DP from a control point of view

System dynamics :
x t+1 = ft(xt, ut), t = 0, 1, ..., N-1
where
t : temps index
xt : state of the system
ut = control action to decide at t

action
State t
Cost

present decision epoch

action
State t+1
Cost

next decision epoch

Xiaolan Xie

DP from a control point of view

Criterion to optimize
N 1

Minimize g N xN gt xt , ut
t 0

action
State t

action
State t+1

Cost

gt xt , ut
present decision epoch

Cost

next decision epoch

Xiaolan Xie

DP from a control point of view

Value function or cost-to-go function:
N 1

J n x = Minimize g N xN gt xt , ut xn x
t n

action
State t
Cost
present decision epoch

action
State t+1
Cost

next decision epoch

Xiaolan Xie

DP from a control point of view

Optimality equation or Bellman equation

J n x = MIN g n x, un J n+1 f n x, un
un

action
State t
Cost
present decision epoch

action
State t+1
Cost

next decision epoch

Xiaolan Xie

Applications
Single machine scheduling (Knapsac)
Inventory control
Traveling salesman problem

Xiaolan Xie

Applications
Single machine scheduling (Knapsac)
Problem :
Consider a set of N production requests, each needing a
production time ti on a bottleneck machine and generating a
profit pi. The capacity of the bottleneck machine is C.
Question: determine the production requests to confirm in
order to maximize the total profit.
Formulation:
max pi Xi
subject to:
ti X i C
Xiaolan Xie

Applications
Inventory control

See exercices

Xiaolan Xie

Applications
Traveling salesman problem

Problem :
Data: a graph with N nodes and a distance matrix
[dij] beteen any two nodes i and j.
Question: determine a circuit of minimum total
distance passing each node once.
Extensions:
C(y, S): shortest path from y to x0 passing once
each node in S.
Application: Machine scheduling with setups.
2007

Xiaolan Xie

Applications
Total tardiness minimization on a single machine
Si starting time of job i
1, if job i precedes job j
X ij
0, otherwise
Ti tardiness
n

min wiTi
i 1

Ti Si pi d i

S j Si pi M X ij 1

Job
Due date di
Processing
time pi
weight wi

1
5

2
6

3
5

3
3

2
1

4
2

Si , Ti 0

X ij 0,1

where M is a large constant.

Xiaolan Xie

Stochastic dynamic programming

Model
Consider the control of
(i) a discrete-time stochastic dynamic system, with
(ii) costs generated over time
perturbation

perturbation
action

State t
stage cost
present decision epoch

action

State t+1
cost
next decision epoch
Xiaolan Xie

Stochastic dynamic programming

Model
System dynamics :
x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1
where
t : time index
xt : state of the system
ut = decision at time t
wt : random perturbations

action

perturbation

State t
cost
present decision epoch

action

State t+1
cost

next decision epoch

Xiaolan Xie

Stochastic dynamic programming

Model
Criterion

N 1

Minimize E g N xN gt xt , ut , wt
t 0

action

perturbation

State t
cost
present decision epoch

action

State t+1
cost

next decision epoch

Xiaolan Xie

Stochastic dynamic programming

Model
Open-loop control:

Order quantities u1, u2, ..., uN-1 are determined once at time 0

Closed-loop control:

Order quantity ut at each period is determined dynamically with

the knowledge of state xt

Xiaolan Xie

Stochastic dynamic programming

Control policy
The rule for selecting at each period t a control action ut
for each possible state xt.

Examples of inventory control policies:

1. Order a constant quantity ut = E[wt]
2. Order up to policy :
ut = St xt, if xt St
ut = 0, if xt > St
where St is a constant order up to level.
2007

Xiaolan Xie

Stochastic dynamic programming

Control policy
Mathematically, in closed-loop control, we want to
find a sequence of functions t, t = 0, ..., N-1, mapping state
xt into control ut

so as to minimize the total expected cost.

The sequence = { 0, ..., N-1} is called a policy.

2007

Xiaolan Xie

Stochastic dynamic programming

Optimal control
Cost of a given policy = { 0, ..., N-1},

N 1

J x0 E ct xt r xt ut wt
t 0

Optimal control:
minimize J(x0) over all possible polciy

2007

Xiaolan Xie

Stochastic dynamic programming

State transition probabilities
State transition probabilty:

pij(u, t) = P{xt+1 = j | xt = i, ut = u}

depending on the control policy.

2007

Xiaolan Xie

Stochastic dynamic programming

Basic problem
A discrete-time dynamic system :
x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1
Finite state space st St
Finite control space ut Ct
Control policy = {0, ..., N-1} with ut = t(xt)
State-transition probability: pij(u)
stage cost : gt(xt, t(xt), wt)
2007

Xiaolan Xie

Stochastic dynamic programming

Basic problem
Expected cost of a policy

N 1

J x0 E g N x N gt xt , t xt , wt
t 0

Optimal control policy * is the policy with minimal cost:

J * x0 MIN J x0

where is the set of all admissible policies.

J*(x) : optimal cost function or optimal value function.
Xiaolan Xie

Stochastic dynamic programming

Principle of optimality
Let = { 0, ..., N-1} be an optimal policy for the basic
problem for the N time periods.
Then the truncated policy { i, ..., N-1} is optimal for the
following subproblem
minimization of the following total cost (called cost-to-go
function) from time i to time N by starting with state xi at time
i
N 1

J i xi MIN E g N xN gt xt , t xt , wt
t i

Xiaolan Xie

Stochastic dynamic programming

DP algorithm
Theorem: For every initial state x0, the optimal cost J*(x0) of
the basic problem is equal to J0(x0), given by the last step of the
following algorithm, which proceeds backward in time from
period N-1 to period 0
J N xN g N xN ,
J t xt

( A)

Ewt gt xt , ut , wt J t 1 f t xt , ut , wt , ( B )

ut U t xt
MIN

Furthermore, if ut = t(xt) minimizes the right side of Eq (B)

for each xt and t, the policy = { 0, ..., N-1} is optimal.
Xiaolan Xie

Stochastic dynamic programming

Example
Consider the inventory control problem with the following:

Excess demand is lost, i.e. xt+1 = max{0, xt + ut wt}

The inventory capacity is 2, i.e. xt + ut

The inventory holding/shortage cost is : (xt + ut wt)2

Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut wt)2.

N = 3 and the terminal cost, gN(XN) = 0

Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.

Xiaolan Xie

Stochastic dynamic programming

DP algorithm
Optimal policy
Stock

Stage 0
Cos-to-go

Stage 0
Optimal
order
quantity

Stage 1
Cos-to-go

Stage 1
Optimal
order
quantity

Stage 2
Cos-to-go

Stage 2
Optimal
order
quantity

3.7

2.5

1.3

2.7

1.5

0.3

2.818

1.68

1.1

Xiaolan Xie

Sequential decision model

Key ingredients:
A set of decision epochs
A set of system states
A set of available actions
A set of state/action
dependent immediate costs
A set of state/action
dependent transition
probabilities

Policy:

Issues:

a sequence of
decision rules in
order to mini. the
cost function

Existence of opt.
policy

action
Present
state
costs

Form of the opt. policy

Computation of opt.
policy
action
Next
state
costs

Xiaolan Xie

Applications
Inventory management
Bus engine replacement
Highway pavement maintenance
Bed allocation in hospitals
Personal staffing in fire department
Traffic control in communication networks

Xiaolan Xie

Example

Consider a with one machine producing one product. The

processing time of a part is exponentially distributed with rate
p. The demand arrive according to a Poisson process of rate d.

state Xt = stock level, Action : at = make or rest

hX ,
1 T
Minimize lim
g
X
t
dt
with
g
X

T T
bX ,
t 0
(make, p)

(make, p)

0
d

(make, p)

2
d

if X 0
if X 0

3
d

Xiaolan Xie

Example
Zero stock policy
p

P(0) = 1-r, P(-n) = rnP(0), r = d/p

average cost =b/(p d)

-1

-2

Hedging point policy with

hedging point 1
p

-2
d

-1
d

0
d

P(1) = 1-r, P(-n) = rn+1P(1)

1
d

average cost =h(1-r) + r.b/(p d)

Better iff h < b/(p-d)

Xiaolan Xie

MDP Model formulation

Xiaolan Xie

Decision epochs
Times at which decisions are made.

The set T of decisions epochs can be either a discrete set or a

continuum.

The set T can be finite (finite horizon problem) or infinite

(infinite horizon).

Xiaolan Xie

State and action sets

At each decision epoch, the system occupies a state.
S : the set of all possible system states.
As : the set of allowable actions in state s.
A = sSAs: the set of all possible actions.
S and As can be:
finite sets
countable infinite sets
compact sets
Xiaolan Xie

Costs and Transition probabilities

As a result of choosing action a As in state s at decision epoch t,
the decision maker receives a cost Ct(s, a) and
the system state at the next decision epoch is determined by the
probability distribution pt(. |s, a).

If the cost depends on the state at next decision epoch, then

Ct(s, a) = jS Ct(s, a, j) pt(j|s, a).
where Ct(s, a, j) is the cost if the next state is j.

An Markov decision process is characterized by {T, S, As, pt(. |s, a), Ct(s, a)}

Xiaolan Xie

Exemple of inventory management

Consider the inventory control problem with the following:

Excess demand is lost, i.e. xt+1 = max{0, xt + ut wt}

The inventory capacity is 2, i.e. xt + ut

The inventory holding/shortage cost is : (xt + ut wt)2

Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut wt)2.

N = 3 and the terminal cost, gN(XN) = 0

Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.

Xiaolan Xie

Exemple of inventory management

Decision Epochs T = {0, 1, 2, , N}
Set of states : S = {0, 1, 2} indicating the initial stock Xt
Action set As : indicating the possible order quantity Ut
A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}
Cost function : Ct(s, a) = E[a + (s + a wt)2]
Transition probability pt(. |s, a). :
p(j |s, a)
s=0
s=1
s=2

a=0
(1, 0, 0)
(0,9, 0,1, 0)
(0,2, 0,7, 0,1)

a=1
(0,9, 0,1, 0)
(0,2, 0,7, 0,1)
Not allowed

a=2
(0,2, 0,7, 0,1)
Not allowed
Not allowed

Xiaolan Xie

Decision Rules
A decision rule prescribes a procedure for action selection in each
state at a specified decision epoch.

A decision rule can be either

Markovian (memoryless) if the selection of action at is based

only on the current state st;
History dependent if the action selection depends on the past
history, i.e. the sequence of state/actions ht = (s1, a1, , st-1, at-1, st)
Xiaolan Xie

Decision Rules
A decision rule can also be either

Deterministic if the decision rule selects one action with certainty

Randomized if the decision rule only specifies a probability
distribution on the set of actions.

Xiaolan Xie

Decision Rules
As a result, the decision rules can be:

HR : history dependent and randomized

HD : history dependent and deterministic
MR : Markovian and randomized
MD : Markovian and deterministic

Xiaolan Xie

Policies
A policy specifies the decision rule to be used at all decision epoch.
A policy is a sequence of decision rules, i.e. = {d1, d2, , dN-1}

A policy is stationary if dt = d for all t.

Stationary deterministic or stationary randomized policies are

important for infinite horizon markov decision processes.

Xiaolan Xie

Example
Decision epochs: T = {1, 2, , N}
State : S = {s1, s2}
Actions: As1 = {a11, a12}, As2 = {a21}
Costs: Ct(s1, a11) =5, Ct(s1, a12) =10, Ct(s2, a21) = -1, CN(s1) = rN(s2) 0
Transition probabilities: pt(s1 |s1, a11) = 0.5, pt(s2|s1, a11) = 0.5, pt(s1 |s1,
a12) = 0, pt(s2|s1, a12) = 1, pt(s1 |s2, a21) = 0, pt(s2 |s2, a21) = 1

a11
{5, .5}

a11
{5, .5}
S1

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A deterministic Markov policy
Decision epoch 1:
d1(s1) = a11, d1(s2) = a21
Decision epoch 2:
d2(s1) = a12, d2(s2) = a21
a11
{5, .5}

a11
{5, .5}
S1

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A randomized Markov policy
Decision epoch 1:
P1, s1(a11) = 0.7, P1, s1(a12) = 0.3
P1, s2(a21) = 1

Decision epoch 2:
P2, s1(a11) = 0.4, P2, s1(a12) = 0.6
P2, s2(a21) = 1

a11
{5, .5}

a11
{5, .5}
S1

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A deterministic history-dependent policy
Decision epoch 1:
d1(s1) = a11
d1(s2) = a21

a13
a11
{5, .5}

{0, 1}

Decision epoch 2:
history h
d2(h, s1)
d2(h, s2)
(s1, a11)

a13

a21

(s1, a12)

infeasible

a21

(s1, a13)

a11

infeasible

(s2, a21)

infeasible

a21

a11
{5, .5}

a21
S2

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Example
A randomized history-dependent policy
Decision epoch 1:

Decision epoch 2: at s = s1
history h

P1, s1(a11) = 0.6

P1, s1(a12) = 0.3
P1, s1(a12) = 0.1
P1, s2(a21) = 1

a13

a11
{5, .5}

{0, 1}

P(a = a11) P(a = a12)

P(a = a13)

(s1, a11)

0.4

0.3

(s1, a12)

infeasible

(s1, a13)

0.8

0.1

(s2, a21)

infeasible

a11
{5, .5}

a21
S2

at s = s2,
select a21

{-1, 1}

a12
{10, 1}
Xiaolan Xie

Remarks
Each Markov policy leads to a discrete time Markov Chain
and the policy can be evaluated by solving the related
Markov chain.

Xiaolan Xie

Finite Horizon Markov Decision

Processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = {1, 2, , N}
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Criterion:
N 1

t 1

infHR E

Ct X t , at CN X N

X 1 s

where HR is the set of all possible policies.

Xiaolan Xie

Optimality of Markov deterministic

policy
Theorem :
Assume S is finite or countable, and that As is finite for each
s S.
Then there exists a deterministic Markovian policy which is
optimal.

Xiaolan Xie

Optimality equations
Theorem : The following value functions
N 1

t n

Vn s MIN
E
HR

Ct X t , at CN X N

X n s

satisfy the following optimality equation:

Vt s MIN Ct s, a
aAs

pt j s, a Vt 1
jS

VN s rN s

and the action a that minimizes the above term defines the
optimal policy.
Xiaolan Xie

Optimality equations
The optimality equation can also be expressed as:
Vt s MIN Qt s, a
aAs

Qt s, a Ct s, a pt j s, a Vt 1 j
jS

where Q(s,a) is a Q-function used to evaluate the

consequence of an action from a state s.

Xiaolan Xie

Dynamic programming algorithm

Set t = N and
VN s N rN s N for all s N S

Substitute t-1 for t and compute the following for each st S

Vt s MIN Ct s, a pt j s, a Vt 1
a As
jS

dt s arg min Ct s, a pt j s, a Vt 1
aAs
jS

3. Repeat 2 till t = 1.

Xiaolan Xie

Infinite Horizon discounted

Markov decision processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = {1, 2, }
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a), do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | for all a As
and all s S (to be relaxed)

Xiaolan Xie

Assumptions
Criterion:
infHR

lim E Ct X t , at X 1 s
N
t 1

where
0 < < 1 is the discounting factor
HR is the set of all possible policies.

Xiaolan Xie

Optimality equations
Theorem: Under assumptions 1-5, the following optimal cost
function V*(s) exists:

lim E Ct X t , at X 1 s
N
t 1

V * s infHR

and satisfies the following optimality equation:

V * s MIN C s, a
a As
jS

p j s, a V * j

Further, V*(.) is the unique solution of the optimality equation.

Moreover, a statonary policy is optimal iff it gives the
minimum value in the optimality equation.
Xiaolan Xie

Computation of optimal policy

Value Iteration
Value iteration algorithm:
1.Select any bounded value function V0, let n =0
2. For each s S, compute
V

n 1

s MIN
C s, a p j s , a V
a A
s

3.Repeat 2 until convergence.

4. For each s S, compute

d s arg min C s, a p j s, a V
a As
jS

n 1

Xiaolan Xie

Computation of optimal policy

Value Iteration
Theorem: Under assumptions 1-5,
a.Vn converges to V*
b. The stationary policy defined in the value iteration
algorithm converges to an optimal policy.

Xiaolan Xie

Computation of optimal policy

Policy Iteration
Policy iteration algorithm:
1.Select arbitrary stationary policy 0, let n =0
2. (Policy evaluation) Obtain the value function Vn of policy n.
3.(Policy improvement) Choose n+1 = {dn+1, dn+1,} such that

n
d n 1 s arg min C s, a p j s, a V j
a As
jS

4.Repeat 2-3 till n+1 = n.

Xiaolan Xie

Computation of optimal policy

Policy Iteration
Policy evaluation:
For any stationary deterministic policy = {d, d, }, its
value function

V s E rt X t , at X 1 s
t 1

is the unique solution of the following equation:

V s C s, d s p j s, d s V j
jS

Xiaolan Xie

Computation of optimal policy

Policy Iteration
Theorem:
The value functions Vn generated by the policy iteration
algorithm is such that Vn+1 Vn.
Further, if Vn+1 Vn, Vn = V*.

Xiaolan Xie

Computation of optimal policy

Linear programming
Recall the optimality equation

V s MIN C s, a p j s, a V j
a As
jS

The optimal value function can be determine by the

following Linear programme:
Maximize

V s

subject to
V s r s, a p j s, a V j , s, a
jS

Xiaolan Xie

Extensition to Unbounded Costs

Theorem 1. Under the condition C(s, a) 0 (or C(s, a) 0) for all
states i and control actions a, the optimal cost function V*(s) among
all stationary determinitic policies satisfies the optimality equation

V * s MIN C s, a
aAs
jS

p j s, a V * j

Theorem 2. Assume that the set of control actions is finite. Then, under
the condition C(s, a) 0 for all states i and control actions a, we have
lim V N s V * s

where VN(s) is the solution of the value iteration algorithm with V0(s) = 0.
Implication of Theorem 2 : The optimal cost can be obtained as the limit
of value iteration and the optimal stationary policy can also be obtained in
the limit.

Xiaolan Xie

Example
Consider a computer system consisting of M different processors.
Using processor i for a job incurs a finite cost C i with C1 < C2 < ... < CM.
When we submit a job to this system, processor i is assigned to our job with
probability pi.
At this point we can (a) decide to go with this processor or (b) choose to hold the
job until a lower-cost processor is assigned.
The system periodically return to our job and assign a processor in the same
way.
Waiting until the next processor assignment incurs a fixed finite cost c.
Question:
How do we decide to go with the processor currently assigned to our job versus
waiting for the next assignment?
Suggestions:
The state definition should include all information useful for decision
The problem belongs to the so-called stochastic shortest path problem.
Xiaolan Xie

Infinite Horizon average cost

Markov decision processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = {1, 2, }
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a) do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | for all a As
and all s S
Assumption 6: The markov chain correponding to any
stationary deterministic policy contains a single recurrent
class. (Unichain)
Xiaolan Xie

Assumptions
Criterion:
infHR

1
lim E
N
N

Ct X t , at
t 1

X 1 s

where
HR is the set of all possible policies.

Xiaolan Xie

Optimal policy
Under Assumptions 1-6, there exists a optimal stationary
deterministic policy.
Further, there exists a real g and a value function h(s) that
satisfy the following optimality equation:

h s g MIN C s, a
aAs
jS

p j s, a h j

For any two solutions (g, h) and (g, h) of the optimality

equation, (i) g = g is the optimal average cost; (ii) h(s) =
h(s) + k; (iii) the stationary policy determined by the
optimality equation is an optimal policy.
Xiaolan Xie

Relation between discounted and average cost

MDP
It can be shown that (why? online)
g lim 1 V s
1

h s lim V s V x0
1

differential
cost

for any given state x0.

Xiaolan Xie

Computation of the optimal policy by

LP
Recall the optimality equation:

h s g MIN C s, a
a As
jS

p j s, a h j

This leads to the following LP for optimal policy computation

Maximize g
subject to
h s g r s, a

p j s, a h j , s, a
jS

h( x0 ) 0
Remarks: Value iteration and policy iteration can also be
extended to the average cost case.
Xiaolan Xie

Computation of optimal policy

Value Iteration
1.Select any bounded value function h0 with h0(s0) = 0, let n =0
2. For each s S, compute
U

n 1

s h s g
n 1

MIN r s, a p j s, a h
a As
jS

h n 1 s U n 1 s U n 1 s0

g n U n 1 s0

3.Repeat 2 until convergence.

4. For each s S, compute

d s arg min C s, a p j s, a h n 1
a As
jS

Xiaolan Xie

Extensions to unbounded cost

Theorem. Assume that the set of control actions is finite. Suppose
that there exists a finite constant L and some state x0 such that
|V(x) - V(x0)| L
for all states x and for all (0,1). Then, for some sequence {n}
converging to 1, the following limit exist and satisfy the optimality
equation.
g lim 1 V s
1

h s lim V s V x0
1

Easy extension to policy iteration.

Xiaolan Xie

Continuous time Markov decision

processes

Xiaolan Xie

Assumptions
Assumption 1: The decision epochs T = R+
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary cost rates and transition rates;
C(s, a) and (j |s, a) do not vary from decision epoch to
decision epoch

Xiaolan Xie

Assumptions
Criterion:

t 0

infHR E

C X t ,a t e

1 T

infHR lim E
C X t , a t dt

T T

t 0

Xiaolan Xie

Example

Consider a system with one machine producing one product. The

processing time of a part is exponentially distributed with rate p. The
demand arrive according to a Poisson process of rate d.

state Xt = stock level, Action : at = make or rest

g X t e

Minimize

t 0

(make, p)

0
d

if X 0
if X 0

hX ,
dt with g X
bX ,

(make, p)

2
d

3
d

Xiaolan Xie

Uniformization
Any continuous-time Markov chain can be converted to a
discrete-time chain through a process called
uniformization .

Each Continuous Time Markov Chain is characterized by

the transition rates ij of all possible transitions.
The sojourn time Ti in each state i is exponentially
distributed with rate (i) = ji ij, i.e. E[Ti] = 1/(i)
Transitions different states are unpaced and
asynchronuous depending on (i).

Xiaolan Xie

Uniformization
In order to synchronize (uniformize) the transitions at the same
pace, we choose a uniformization rate
MAX{(i)}
Uniformized Markov chain with
transitions occur only at instants generated by a common a
Poisson process of rate (also called standard clock)
state-transition probabilities
pij = ij /
pii = 1 - (i)/
where the self-loop transitions correspond to fictitious
events.
Xiaolan Xie

Uniformization
CTMC

Step1: Determine rate of the states

(S1) = a, (S2) = b

b
Uniformized CTMC

-a
S1

-b

max{(i)}

Step 3: Add self-loop transitions to

states of CTMC.

DTMC by uniformization

1-a/

Step 2: Select an uniformization

rate

S1
b/

1-b/
S2

Step 4: Derive the corresponding

uniformized DTMC
Xiaolan Xie

Uniformization

Rates associated to states

Xiaolan Xie

Uniformization
For Markov decision process, the uniformization rate
shoudl be such that
(s, a) = jS (j|s, a)
for all states s and for all possible control actions a.
The state-transition probabilities of a uniformized Markov
decision process becomes:
p(j|s, a) = (j|s, a)/
p(s|s, a) = 1- jS (j|s, a)/

Xiaolan Xie

Uniformization
(make, p)

(make, p)

Uniformized Markov decision process

at rate = p+d
(make, p/)

(make, p/)

0
d/

d/
(not make, p/)

(not make, p/)

(make, p/)

2
d/
(not make, p/)

(make, p/)

3
d/

(not make, p/)

Xiaolan Xie

Uniformization
Under the uniformization,
a sequence of discrete decision epochs T1, T2, is generated
where Tk+1 Tk = EXP().
The discrete-time markov chain describes the state of the system at
these decision epochs.
All criteria can be easily converted.
fixed cost
K(s,a)

continuous cost C(s,a)

fixed cost
per unit time
k(s,a, j)

(s,a)

j
EXP()

EXP()

Poisson process at rate

Xiaolan Xie

Cost function convertion

for uniformized Markov chain
Discounted cost of a stationary policy (only with continuous cost):

t 0

C X t ,a t e

Tk 1

k 0

t Tk

C X k , ak

C
X
t
,
a
t
e
dt

k 0 t Tk
Tk 1

State change & action taken only at Tk

Mutual independence of (Xk, ak) and
(Tk, Tk+1)

t
E C X k , ak E e dt
t Tk

k 0

dt E

Tk 1

1
E C X k , ak

k 0
k C X ,a
k k

k 0

Tk is a Poisson process at rate

Average cost of a stationary policy (only with continuous cost):

E
C
X
t
,
a
t
dt
E

N
T t 0

C X k , ak
1
E

k 0
N

k 0

C X k , ak

Xiaolan Xie

Cost function convertion

for uniformized Markov chain
Equivalent discrete time discounted MDP

a discrete-time Markov chain with uniform transition rate

a discount factor

a stage cost given by the sum of

continuous cost C(s, a)/(),
K(s, a) for fixed cost incurred at T0

k(s,a,j)p(j|s,a) for fixed cost incurred at T1

Optimality equation
C s, a

V s MIN
K s, a
a A
s

p j s, a k s, a, j V j
jS

Xiaolan Xie

Cost function convertion

for uniformized Markov chain
Equivalent discrete time average-cost MDP

a discrete-time Markov chain with uniform transition rate

a stage cost given by C(s, a)/ whenever a state s is entered

and an action a is chosen.

Optimality equation :
C s, a
h s g MIN
p j s, a h
a As

where

g = average cost per discretized time period

g = average cost per time unit (can also be obtained directly from
the optimality equation with stage cost C(s, a))
Xiaolan Xie

Example (continue)
Uniformize the Markov decision process with rate = p+d
The optimality equation:
g s
V s MIN

p
d

V s 1
V s 1 : producing
p d
pd

g s

p
d

V s
V s 1 : not producing

pd
pd

Xiaolan Xie

Example (continue)
From the optimality equation:
g s

p
d
V s

V s
V s 1 MIN V s 1 V s , 0

pd
pd

If V(s) is convex, then there exists a K such that :

V(s+1) V(s) > 0 and the decision is not producing, for all s >= K and
V(s+1) V(s) <= 0 and the decision is producing, for all s < K

Xiaolan Xie

Example (continue)
Convexity proved by value iteration
V n 1 s

g s

p
d
MIN V n s 1 ,V n s
V n s 1
pd
pd

V 0 s 0

Proof by induction.
V0 is convex.
If Vn is convex with minimum

MIN V n s 1 , V n s is convex

at s = K, then Vn+1 is convex.

s
K-1

K
Xiaolan Xie

A Mobile Application For Dyslexia, Dysgraphia and Dyscalculia in Sinhala
No ratings yet
A Mobile Application For Dyslexia, Dysgraphia and Dyscalculia in Sinhala
49 pages
Dynamic Programming: Xiaolan Xie
No ratings yet
Dynamic Programming: Xiaolan Xie
97 pages
DP Methods
No ratings yet
DP Methods
61 pages
Scan 09-Sep-2020
No ratings yet
Scan 09-Sep-2020
3 pages
Dynamic Programming
No ratings yet
Dynamic Programming
9 pages
Group 5 Dyn Prog
No ratings yet
Group 5 Dyn Prog
15 pages
MIT6 231F11 Notes Short
No ratings yet
MIT6 231F11 Notes Short
125 pages
MIT6 231F15 Notes PDF
No ratings yet
MIT6 231F15 Notes PDF
303 pages
Dynamic Programing and Optimal Control PDF
No ratings yet
Dynamic Programing and Optimal Control PDF
276 pages
Dynamic Programing and Optimal Control
No ratings yet
Dynamic Programing and Optimal Control
276 pages
Figure by Mit Opencourseware
No ratings yet
Figure by Mit Opencourseware
26 pages
Process Optimisation: Dynamic Programming
No ratings yet
Process Optimisation: Dynamic Programming
35 pages
36fc4cbaabe504446b51adb8a68f5958 MIT6 231F15 Complete Slide
No ratings yet
36fc4cbaabe504446b51adb8a68f5958 MIT6 231F15 Complete Slide
166 pages
Dynamic Programming 7707
No ratings yet
Dynamic Programming 7707
51 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
27 pages
MIT Dynamic Programming Lecture Slides
No ratings yet
MIT Dynamic Programming Lecture Slides
261 pages
DP_Slides
No ratings yet
DP_Slides
263 pages
Dynamic Programming
No ratings yet
Dynamic Programming
10 pages
Dynamic Programming - Part 1
No ratings yet
Dynamic Programming - Part 1
23 pages
Dynamic Programming: of Optimality
No ratings yet
Dynamic Programming: of Optimality
11 pages
Dynamic_Programming_and_Optimal_Control
No ratings yet
Dynamic_Programming_and_Optimal_Control
62 pages
Chapter 3 Dynamic Programming
No ratings yet
Chapter 3 Dynamic Programming
33 pages
The Analysis of Forward and Backward Dynamic Programming For Multistage Graph
No ratings yet
The Analysis of Forward and Backward Dynamic Programming For Multistage Graph
7 pages
04 - OR2 - Dynamic Programming
No ratings yet
04 - OR2 - Dynamic Programming
14 pages
DP - Intro With Links
No ratings yet
DP - Intro With Links
10 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
Neuro-Dynamic Programming An Overview Dimitri P. Bertsekas
No ratings yet
Neuro-Dynamic Programming An Overview Dimitri P. Bertsekas
9 pages
DAA Material
No ratings yet
DAA Material
12 pages
Lesson 8 Complexity Theory and DP
No ratings yet
Lesson 8 Complexity Theory and DP
13 pages
Introduction_To_Dynamic_Programming
No ratings yet
Introduction_To_Dynamic_Programming
15 pages
CH 18
No ratings yet
CH 18
30 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Deterministic Dynamic Programming: To The Next
No ratings yet
Deterministic Dynamic Programming: To The Next
52 pages
Dynamic Programming
100% (1)
Dynamic Programming
52 pages
Algo - Mod9 - Dynamic Programming Method
No ratings yet
Algo - Mod9 - Dynamic Programming Method
51 pages
DP Combined Report
No ratings yet
DP Combined Report
2 pages
Dynamic Programming
No ratings yet
Dynamic Programming
8 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
Namic Programming
No ratings yet
Namic Programming
18 pages
ESI 4313 Operations Research 2: Dynamic Programming
No ratings yet
ESI 4313 Operations Research 2: Dynamic Programming
52 pages
DP_Report
No ratings yet
DP_Report
1 page
Dynamic Programming
No ratings yet
Dynamic Programming
30 pages
Operational Reseach 1
No ratings yet
Operational Reseach 1
9 pages
Lecture 8 Dynamic Programming
No ratings yet
Lecture 8 Dynamic Programming
32 pages
Operation Research 2 Dynamic Programming
No ratings yet
Operation Research 2 Dynamic Programming
34 pages
NDP PDF
No ratings yet
NDP PDF
5 pages
Semicontractive_Lecture1
No ratings yet
Semicontractive_Lecture1
14 pages
Dynamic Programming Online Teaching FOR PRINT
No ratings yet
Dynamic Programming Online Teaching FOR PRINT
44 pages
Dynamic Optimization
No ratings yet
Dynamic Optimization
73 pages
Lecture Note - 7 - CE605A&CHE705B
No ratings yet
Lecture Note - 7 - CE605A&CHE705B
3 pages
Dynamic Programming in Computer Science
No ratings yet
Dynamic Programming in Computer Science
49 pages
ECE 551 Lecture 2
No ratings yet
ECE 551 Lecture 2
11 pages
dpp (1)
No ratings yet
dpp (1)
14 pages
Dynamic Programming....
No ratings yet
Dynamic Programming....
4 pages
Dynamic Programming
No ratings yet
Dynamic Programming
16 pages
Dynammic Programming Shortest Route
No ratings yet
Dynammic Programming Shortest Route
18 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Nonlinear Control Feedback Linearization Sliding Mode Control
From Everand
Nonlinear Control Feedback Linearization Sliding Mode Control
Mourad Boufadene
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Aim:-Background:-: Lab Exercise 1
No ratings yet
Aim:-Background:-: Lab Exercise 1
4 pages
Complete Network Course
100% (1)
Complete Network Course
14 pages
23MEE182 Manufacturing Practice Lab Manual_ Common _24!12!24_removed
No ratings yet
23MEE182 Manufacturing Practice Lab Manual_ Common _24!12!24_removed
20 pages
Digital Microscope Shuttlepix P-400R
No ratings yet
Digital Microscope Shuttlepix P-400R
8 pages
Layout Options in ALV: Criado Por Hareesh Menon, Última Alteração Por Sandra Rossi em Jan 01, 2016
No ratings yet
Layout Options in ALV: Criado Por Hareesh Menon, Última Alteração Por Sandra Rossi em Jan 01, 2016
3 pages
ZXCTN 6100 (V1.00) Product Description
No ratings yet
ZXCTN 6100 (V1.00) Product Description
21 pages
Interbus: IBS CMD G4 - Quickstart
No ratings yet
Interbus: IBS CMD G4 - Quickstart
58 pages
Pickering Labs Catalog
No ratings yet
Pickering Labs Catalog
76 pages
Calculo de Murro
No ratings yet
Calculo de Murro
17 pages
DDB-distribution Database Important.
No ratings yet
DDB-distribution Database Important.
15 pages
s3 Patterns Handbook
100% (1)
s3 Patterns Handbook
119 pages
Ejemplo de Ensayo de Análisis Literario
100% (1)
Ejemplo de Ensayo de Análisis Literario
6 pages
005 Better Than Me
No ratings yet
005 Better Than Me
16 pages
Rule 1 and Rule 3: Synchronized Keyword Is Used For When We Want To Allowed Only One Thread at A Time Then Use
No ratings yet
Rule 1 and Rule 3: Synchronized Keyword Is Used For When We Want To Allowed Only One Thread at A Time Then Use
3 pages
Chart of Accounts in Oracle Fusion Financials
No ratings yet
Chart of Accounts in Oracle Fusion Financials
18 pages
Porting Fatfs File System To KL26 SPI SD Code
No ratings yet
Porting Fatfs File System To KL26 SPI SD Code
13 pages
Security Administration
No ratings yet
Security Administration
24 pages
Kuldeep Kumar Rawani: Medley Medical Solutions PVT LTD
No ratings yet
Kuldeep Kumar Rawani: Medley Medical Solutions PVT LTD
2 pages
NAS User Guide
No ratings yet
NAS User Guide
10 pages
Pragati Resume
No ratings yet
Pragati Resume
3 pages
Randomx Benchmarks For Monero Mining!
No ratings yet
Randomx Benchmarks For Monero Mining!
1 page
MPC2000 - MPC2500 MS - v01
No ratings yet
MPC2000 - MPC2500 MS - v01
1,312 pages
RF Detector Using An Arduino
100% (3)
RF Detector Using An Arduino
8 pages
Product Data Sheet Deltav Opc Ua Servers Clients en 3583458
No ratings yet
Product Data Sheet Deltav Opc Ua Servers Clients en 3583458
7 pages
C++ FQ Prelims
No ratings yet
C++ FQ Prelims
3 pages
sqlpp11 - An SQL Library Worthy of Modern C++ - Roland Bock - CppCon 2014
No ratings yet
sqlpp11 - An SQL Library Worthy of Modern C++ - Roland Bock - CppCon 2014
31 pages
How To - Adopt GREAT Developer Habits 26-09-23
No ratings yet
How To - Adopt GREAT Developer Habits 26-09-23
4 pages
DX Diag
No ratings yet
DX Diag
31 pages
Connect Inclusive Access PPT
No ratings yet
Connect Inclusive Access PPT
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CH 9 MDP

Uploaded by

CH 9 MDP

Uploaded by

Plan

Stochastic dynamic programming

Stochastic dynamic programming

We limit ourselves to discrete optimization

Illustration of DP by shortest path problem

BELLMAN's principle of optimality

Shortest Path in an acyclic graph

DP from a control point of view

DP from a control point of view

present decision epoch

next decision epoch

DP from a control point of view

next decision epoch

DP from a control point of view

next decision epoch

DP from a control point of view

next decision epoch

where M is a large constant.

Stochastic dynamic programming

Stochastic dynamic programming

next decision epoch

Stochastic dynamic programming

next decision epoch

Stochastic dynamic programming

Order quantity ut at each period is determined dynamically with

Stochastic dynamic programming

Examples of inventory control policies:

Stochastic dynamic programming

so as to minimize the total expected cost.

Stochastic dynamic programming

Stochastic dynamic programming

depending on the control policy.

Stochastic dynamic programming

Stochastic dynamic programming

Optimal control policy * is the policy with minimal cost:

where is the set of all admissible policies.

Stochastic dynamic programming

Stochastic dynamic programming

Furthermore, if u*t = *t(xt) minimizes the right side of Eq (B)

Stochastic dynamic programming

Excess demand is lost, i.e. xt+1 = max{0, xt + ut wt}

The inventory capacity is 2, i.e. xt + ut

The inventory holding/shortage cost is : (xt + ut wt)2

Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut wt)2.

N = 3 and the terminal cost, gN(XN) = 0

Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.

Stochastic dynamic programming

Sequential decision model

Form of the opt. policy

Consider a with one machine producing one product. The

state Xt = stock level, Action : at = make or rest

P(0) = 1-r, P(-n) = rnP(0), r = d/p

average cost =b/(p d)

Hedging point policy with

P(1) = 1-r, P(-n) = rn+1P(1)

average cost =h(1-r) + r.b/(p d)

MDP Model formulation

The set T of decisions epochs can be either a discrete set or a

The set T can be finite (finite horizon problem) or infinite

State and action sets

Costs and Transition probabilities

If the cost depends on the state at next decision epoch, then

Exemple of inventory management

Excess demand is lost, i.e. xt+1 = max{0, xt + ut wt}

The inventory capacity is 2, i.e. xt + ut

The inventory holding/shortage cost is : (xt + ut wt)2

Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut wt)2.

N = 3 and the terminal cost, gN(XN) = 0

Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.

Exemple of inventory management

A decision rule can be either

Markovian (memoryless) if the selection of action at is based

Deterministic if the decision rule selects one action with certainty

HR : history dependent and randomized

A policy is stationary if dt = d for all t.

Stationary deterministic or stationary randomized policies are

Furthermore, if ut = t(xt) minimizes the right side of Eq (B)