Dynamic Programing and Optimal Control
Dynamic Programing and Optimal Control
Dynamic Programing and Optimal Control
http://ocw.mit.edu
6.231 Dynamic Programming and Stochastic Control
Fall 2008
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
LECTURE SLIDES ON DYNAMIC PROGRAMMING
BASED ON LECTURES GIVEN AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
CAMBRIDGE, MASS
FALL 2008
DIMITRI P. BERTSEKAS
These lecture slides are based on the book:
Dynamic Programming and Optimal Con
trol: 3rd edition, Vols. 1 and 2, Athena
Scientic, 2007, by Dimitri P. Bertsekas;
see
http://www.athenasc.com/dpbook.html
Last Updated: December 2008
The slides may be freely reproduced and
distributed.
6.231 DYNAMIC PROGRAMMING
LECTURE 1
LECTURE OUTLINE
Problem Formulation
Examples
The Basic Problem
Signicance of Feedback
DP AS AN OPTIMIZATION METHODOLOGY
Generic optimization problem:
min g(u)
uU
where u is the optimization/decision variable, g(u)
is the cost function, and U is the constraint set
Categories of problems:
Discrete (U is nite) or continuous
Linear (g is linear and U is polyhedral) or
nonlinear
Stochastic or deterministic: In stochastic prob
lems the cost involves a stochastic parameter
w, which is averaged, i.e., it has the form
_ _
g(u) = E
w
G(u, w)
where w is a random parameter.
DP can deal with complex stochastic problems
where information about w becomes available in
stages, and the decisions are also made in stages
and make use of this information.
_ _
BASIC STRUCTURE OF STOCHASTIC DP
Discrete-time system
x
k+1
= f
k
(x
k
, u
k
, w
k
), k = 0, 1, . . . , N 1
k: Discrete time
x
k
: State; summarizes past information that
is relevant for future optimization
u
k
: Control; decision to be selected at time
k from a given set
w
k
: Random parameter (also called distur
bance or noise depending on the context)
N: Horizon or number of times control is
applied
Cost function that is additive over time
N1
E g
N
(x
N
) + g
k
(x
k
, u
k
, w
k
)
k=0
_ _
_ _
_ _
INVENTORY CONTROL EXAMPLE
Inventory
Syst em
Stock Ordered at
Period k
Stock at Period k
Stock at Period k + 1
Demand at Period k
x
k
w
k
x
k
+ 1
= x
k
+ u
k
- w
k
u
k
Cost of Pe riod k
c u
k
+ r (x
k
+ u
k
- w
k
)
Discrete-time system
x
k+1
= f
k
(x
k
, u
k
, w
k
) = x
k
+ u
k
w
k
Cost function that is additive over time
N1
E g
N
(x
N
) + g
k
(x
k
, u
k
, w
k
)
k=0
N1
= E cu
k
+ r(x
k
+ u
k
w
k
)
k=0
Optimization over policies: Rules/functions u
k
=
k
(x
k
) that map states to controls
ADDITIONAL ASSUMPTIONS
The set of values that the control u
k
can take
depend at most on x
k
and not on prior x or u
Probability distribution of w
k
does not depend
on past values w
k1
, . . . , w
0
, but may depend on
x
k
and u
k
Otherwise past values of w or x would be
useful for future optimization
Sequence of events envisioned in period k:
x
k
occurs according to
_ _
x
k
= f
k1
x
k1
, u
k1
, w
k1
u
k
is selected with knowledge of x
k
, i.e.,
u
k
U
k
(x
k
)
w
k
is random and generated according to a
distribution
P
w
k
(x
k
, u
k
)
DETERMINISTIC FINITE-STATE PROBLEMS
Scheduling example: Find optimal sequence of
operations A, B, C, D
A must precede B, and C must precede D
Given startup cost S
A
and S
C
, and setup tran
sition cost C
mn
from operation m to operation n
A
S
A
C
S
C
AB
C
AB
AC
C
AC
CDA
C
AD
ABC
CA
C
CD
CD
ACD
ACB
CAB
CAD
C
BC
C
CB
C
CD
C
AB
C
CA
C
DA
C
CD
C
BD
C
DB
C
BD
C
DB
C
AB
Initial
St at e
STOCHASTIC FINITE-STATE PROBLEMS
Example: Find two-game chess match strategy
Timid play draws with prob. p
d
> 0 and loses
with prob. 1 p
d
. Bold play wins with prob. p
w
<
1/2 and loses with prob. 1 p
w
0 - 0
0.5-0.5
0 - 1
p
d
1 - p
d
0 - 0
1 - 0
0 - 1
1 - p
w
p
w
1st Game / Timid Play
1st Game / Bold Play
1 - 0
0.5-0.5
0 - 1
2 - 0
1.5-0.5
1 - 1
0.5-1.5
p
d
p
d
p
d
1 - p
d
1 - p
d
1 - p
d
0 - 2
p
w
1 - p
w
1 - 0
0.5-0.5
0 - 1
2 - 0
1.5-0.5
1 - 1
0.5-1.5
0 - 2
1 - p
w
1 - p
w
p
w
p
w
2nd Game / Timid Play
2nd Game / Bold Play
_ _
BASIC PROBLEM
System x
k+1
= f
k
(x
k
, u
k
, w
k
), k = 0, . . . , N1
Control contraints u
k
U
k
(x
k
)
Probability distribution P
k
( [ x
k
, u
k
) of w
k
Policies =
0
, . . . ,
N1
, where
k
maps
states x
k
into controls u
k
=
k
(x
k
) and is such
that
k
(x
k
) U
k
(x
k
) for all x
k
Expected cost of starting at x
0
is
N1
J
(x
0
) = E g
N
(x
N
) + g
k
(x
k
,
k
(x
k
), w
k
)
k=0
Optimal cost function
J
(x
0
) = min J
(x
0
)
Optimal policy
satises
J
(x
0
) = J
(x
0
)
k
In deterministic problems open loop is as good
as closed loop
Chess match example; value of information
Timid Play
1 - p
d
p
d
Bold Play
0 - 0
1 - 0
0 - 1
1 - p
w
p
w
1.5-0.5
1 - 1
1 - 1
0 - 2
1 - p
w
p
w
Bold Play
VARIANTS OF DP PROBLEMS
Continuous-time problems
Imperfect state information problems
Innite horizon problems
Suboptimal control
LECTURE BREAKDOWN
Finite Horizon Problems (Vol. 1, Ch. 1-6)
Ch. 1: The DP algorithm (2 lectures)
Ch. 2: Deterministic nite-state problems (2
lectures)
Ch. 3: Deterministic continuous-time prob
lems (1 lecture)
Ch. 4: Stochastic DP problems (2 lectures)
Ch. 5: Imperfect state information problems
(2 lectures)
Ch. 6: Suboptimal control (3 lectures)
Innite Horizon Problems - Simple (Vol. 1, Ch.
7, 3 lectures)
Innite Horizon Problems - Advanced (Vol. 2)
Ch. 1: Discounted problems - Computational
methods (2 lectures)
Ch. 2: Stochastic shortest path problems (1
lecture)
Ch. 6: Approximate DP (6 lectures)
A NOTE ON THESE SLIDES
These slides are a teaching aid, not a text
Dont expect a rigorous mathematical develop
ment or precise mathematical statements
Figures are meant to convey and enhance ideas,
not to express them precisely
Omitted proofs and a much fuller discussion
can be found in the text, which these slides follow
6.231 DYNAMIC PROGRAMMING
LECTURE 2
LECTURE OUTLINE
The basic problem
Principle of optimality
DP example: Deterministic problem
DP example: Stochastic problem
The general DP algorithm
State augmentation
_ _
BASIC PROBLEM
System x
k+1
= f
k
(x
k
, u
k
, w
k
), k = 0, . . . , N1
Control constraints u
k
U
k
(x
k
)
Probability distribution P
k
( [ x
k
, u
k
) of w
k
Policies =
0
, . . . ,
N1
, where
k
maps
states x
k
into controls u
k
=
k
(x
k
) and is such
that
k
(x
k
) U
k
(x
k
) for all x
k
Expected cost of starting at x
0
is
N1
J
(x
0
) = E g
N
(x
N
) + g
k
(x
k
,
k
(x
k
), w
k
)
k=0
Optimal cost function
J
(x
0
) = min J
(x
0
)
Optimal policy
is one that satises
J
(x
0
) = J
(x
0
)
_ _
_ _
PRINCIPLE OF OPTIMALITY
Let
=
0
,
1
, . . . ,
be optimal policy
N1
Consider the tail subproblem whereby we are
at x
i
at time i and wish to minimize the cost-to
go from time i to time N
N1
E g
N
(x
N
) + g
k
x
k
,
k
(x
k
), w
k
k=i
and the tail policy
i
,
i
+1
, . . . ,
N1
x
i
Tail Subproblem
0 i N
Principle of optimality: The tail policy is opti
mal for the tail subproblem (optimization of the
future does not depend on what we did in the past)
DP rst solves ALL tail subroblems of nal
stage
At the generic step, it solves ALL tail subprob
lems of a given time length, using the solution of
the tail subproblems of shorter time length
DETERMINISTIC SCHEDULING EXAMPLE
Find optimal sequence of operations A, B, C,
D (A must precede B and C must precede D)
ABC
6
3
A
C
AB
AC
Initial
St at e
2
8
3
3
4
5
ACB
ACD
CAB
CAD
1
9
4
3
5
6
1 0
1
CDA
CA
CD
6
2
3
5
4
3
3
7
2
Start from the last tail subproblem and go back
wards
At each state-time pair, we record the optimal
cost-to-go and the optimal decision
_
_
_
_
STOCHASTIC INVENTORY EXAMPLE
Inventory
Syst em
Stock Ordered at
Period k
Stock at Period k
Stock at Period k + 1
Demand at Period k
x
k
w
k
x
k
+ 1
= x
k
+ u
k
- w
k
u
k
Cost of Pe riod k
c u
k
+ r (x
k
+ u
k
- w
k
)
Tail Subproblems of Length 1:
J
N1
(x
N1
) = min
E
cu
N1
u
N1
0 w
N1
+ r(x
N1
+ u
N1
w
N1
)
Tail Subproblems of Length N k:
J
k
(x
k
) = min
E
cu
k
+ r(x
k
+ u
k
w
k
)
u
k
0
w
k
+ J
k+1
(x
k
+ u
k
w
k
)
J
0
(x
0
) is opt. cost of initial state x
0
_
DP ALGORITHM
Start with
J
N
(x
N
) = g
N
(x
N
),
and go backwards using
J
k
(x
k
) = min
E
g
k
(x
k
, u
k
, w
k
)
u
k
U
k
(x
k
) w
k
_ __
+J
k+1
f
k
(x
k
, u
k
, w
k
) , k = 0, 1, . . . , N 1.
Then J
0
(x
0
), generated at the last step, is equal
to the optimal cost J
(x
0
). Also, the policy
=
0
, . . . ,
N1
where
k
(x
k
) minimizes in the right side above for
each x
k
and k, is optimal
Justication: Proof by induction that J
k
(x
k
) is
equal to J
k
(x
k
), dened as the optimal cost of the
tail subproblem that starts at time k at state x
k
Note:
ALL the tail subproblems are solved (in ad
dition to the original problem)
Intensive computational requirements
_ _
_
_ _
_ _
_
_ _
_ _ __
PROOF OF THE INDUCTION STEP
Let
k
=
k
,
k+1
, . . . ,
N1
denote a tail
policy from time k onward
Assume that J
k+1
(x
k+1
) = J
k
+1
(x
k+1
). Then
J
k
(x
k
) = min
E
g
k
x
k
,
k
(x
k
), w
k
(
k
,
k+1
)
w
k
,...,w
N1
N1
+g
N
(x
N
) + g
i
x
i
,
i
(x
i
), w
i
i=k+1
_
_ _
= min
E
g
k
x
k
,
k
(x
k
), w
k
k
w
k
_ _ ___
N1
+ min
E
g
N
(x
N
) + g
i
x
i
,
i
(x
i
), w
i
k+1
w
k+1
,...,w
N1
i=k+1
_ _ _ _ _ ___
= min
E
g
k
x
k
,
k
(x
k
), w
k
+J
k+1
f
k
x
k
,
k
(x
k
), w
k
k
w
k
_ _ _ _ _ ___
= min
E
g
k
x
k
,
k
(x
k
), w
k
+J
k+1
f
k
x
k
,
k
(x
k
), w
k
k
w
k
= min
E
g
k
(x
k
, u
k
, w
k
) +J
k+1
f
k
(x
k
, u
k
, w
k
)
u
k
U
k
(x
k
)
w
k
= J
k
(x
k
)
LINEAR-QUADRATIC ANALYTICAL EXAMPLE
Initial
Temperature x
0
Temperature
u
0
Oven 1
x
1
Final
Oven 2
Temperature x
2
Temperature
u
1
System
x
k+1
= (1 a)x
k
+ au
k
, k = 0, 1,
where a is given scalar from the interval (0, 1)
Cost
r(x
2
T )
2
+ u
0
2
+ u
1
2
where r is given positive scalar
DP Algorithm:
J
2
(x
2
) = r(x
2
T )
2
_ _
J
1
(x
1
) = min u
1
2
+ r
_
(1 a)x
1
+ au
1
T
_
2
u
1
_ _ _
J
0
(x
0
) = min u
2
0
+ J
1
(1 a)x
0
+ au
0
u
0
_
_ _
_
STATE AUGMENTATION
When assumptions of the basic problem are
violated (e.g., disturbances are correlated, cost is
nonadditive, etc) reformulate/augment the state
Example: Time lags
x
k+1
= f
k
(x
k
, x
k1
, u
k
, w
k
)
Introduce additional state variable y
k
= x
k1
.
New system takes the form
_ _ _ _
x
k+1
=
f
k
(x
k
, y
k
, u
k
, w
k
)
y
k+1
x
k
View x
k
= (x
k
, y
k
) as the new state.
DP algorithm for the reformulated problem:
J
k
(x
k
, x
k1
) = min
E
g
k
(x
k
, u
k
, w
k
)
u
k
U
k
(x
k
) w
k
+ J
k+1
f
k
(x
k
, x
k1
, u
k
, w
k
), x
k
6.231 DYNAMIC PROGRAMMING
LECTURE 3
LECTURE OUTLINE
Deterministic nite-state DP problems
Backward shortest path algorithm
Forward shortest path algorithm
Shortest path examples
Alternative shortest path algorithms
DETERMINISTIC FINITE-STATE PROBLEM
Terminal Arcs
. . .
. . .
. . .
Initial State
s
t
Artificial Terminal
Node
with Cost Equal
to Terminal Cost
St age 0 St age 1 St age 2
. . .
St age N - 1 St age N
States <==> Nodes
Controls <==> Arcs
Control sequences (open-loop) <==> paths
from initial state to terminal states
a
k
: Cost of transition from state i S
k
to state
ij
j S
k+1
at time k (view it as length of the arc)
a
N
: Terminal cost of state i S
N
it
Cost of control sequence <==> Cost of the cor
responding path (view it as length of the path)
_
_
BACKWARD AND FORWARD DP ALGORITHMS
DP algorithm:
J
N
(i) = a
N
, i S
N
,
it
J ) = min a
k
+J
k
, k = 0, . . . , N1
k
(i
ij
k+1
(j) , i S
jS
k+1
The optimal cost is J
0
(s) and is equal to the
length of the shortest path from s to t
Observation: An optimal path s t is also an
optimal path t s in a reverse shortest path
problem where the direction of each arc is reversed
and its length is left unchanged
Forward DP algorithm (= backward DP algo
rithm for the reverse problem):
J
N
(j) = a
sj
0
, j S
1
,
_
Nk
J
k
(j) = min a
ij
+ J
k+1
(i) , j S
Nk+1
iS
Nk
k
(j) as optimal cost-to-arrive to state j
from initial state s
A NOTE ON FORWARD DP ALGORITHMS
There is no forward DP algorithm for stochastic
problems
Mathematically, for stochastic problems, we
cannot restrict ourselves to open-loop sequences,
so the shortest path viewpoint fails
Conceptually, in the presence of uncertainty,
the concept of optimal-cost-to-arrive at a state
x
k
does not make sense. For example, it may be
impossible to guarantee (with prob. 1) that any
given state can be reached
By contrast, even in stochastic problems, the
concept of optimal cost-to-go from any state x
k
makes clear sense
_
GENERIC SHORTEST PATH PROBLEMS
1, 2, . . . , N, t: nodes of a graph (t: the desti
nation)
a
ij
: cost of moving from node i to node j
Find a shortest (minimum cost) path from each
node i to node t
Assumption: All cycles have nonnegative length.
Then an optimal path need not take more than N
moves
We formulate the problem as one where we re
quire exactly N moves but allow degenerate moves
from a node i to itself with cost a
ii
= 0
J
k
(i) = optimal cost of getting from i to t in Nk moves
J
0
(i): Cost of the optimal path from i to t.
DP algorithm:
J
k
(i) = min a
ij
+J
k+1
(j) , k = 0, 1, . . . , N2,
j=1,...,N
with J
N1
(i) = a
it
, i = 1, 2, . . . , N
_
EXAMPLE
State i
Destination
2
7 5
2
5
5
6
1
3
0. 5
3
2
1
4
5
5
4
3
2
1
3 3 3 3
4 4 4 5
4. 5 4. 5 5. 5 7
2 2 2 2
0 1 2 3 4 St age k
(a)
(b)
J
N1
(i) = a
it
, i = 1, 2, . . . , N,
J
k
(i) = min a
ij
+J
k+1
(j) , k = 0, 1, . . . , N2.
j=1,...,N
ESTIMATION / HIDDEN MARKOV MODELS
Markov chain with transition probabilities p
ij
State transitions are hidden from view
For each transition, we get an (independent)
observation
r(z; i, j): Prob. the observation takes value z
when the state transition is from i to j
Trajectory estimation problem: Given the ob
servation sequence Z
N
= z
1
, z
2
, . . . , z
N
, what is
the most likely state transition sequence X
N
=
x
0
, x
1
, . . . , x
N
[one that maximizes p(X
N
[ Z
N
)
over all X
N
= x
0
, x
1
, . . . , x
N
].
s
x
0
x
1
x
2
x
N - 1
x
N
t
. . .
. . .
. . .
VITERBI ALGORITHM
We have
p(X
N
[ Z
N
) =
p(X
N
, Z
N
)
p(Z
N
)
where p(X
N
, Z
N
) and p(Z
N
) are the unconditional
probabilities of occurrence of (X
N
, Z
N
) and Z
N
Maximizing p(X
N
[ Z
N
) is equivalent with max
imizing ln(p(X
N
, Z
N
))
We have
N
p(X
N
, Z
N
) =
x
0
p
x
k1
x
k
r(z
k
; x
k1
, x
k
)
k=1
so the problem is equivalent to
N
_ _
minimize ln(
x
0
) ln p
x
k1
x
k
r(z
k
; x
k1
, x
k
)
k=1
over all possible sequences x
0
, x
1
, . . . , x
N
.
This is a shortest path problem.
GENERAL SHORTEST PATH ALGORITHMS
There are many nonDP shortest path algo
rithms. They can all be used to solve deterministic
nite-state problems
They may be preferable than DP if they avoid
calculating the optimal cost-to-go of EVERY state
This is essential for problems with HUGE state
spaces. Such problems arise for example in com
binatorial optimization
ABC ABD ACB ACD ADB ADC
ABCD
AB
AC AD
ABDC ACBD ACDB ADBC ADCB
Origin Node s A
1
1 1
20
20
20 20
4 4
4 4
15
15 5
5
3 3
5
3 3
15
Artificial Terminal Node t
5 1 15
5 20 4
1 20 3
15 4 3
LABEL CORRECTING METHODS
Given: Origin s, destination t, lengths a
ij
0.
Idea is to progressively discover shorter paths
from the origin s to every other node i
Notation:
d
i
(label of i): Length of the shortest path
found (initially d
s
= 0, d
i
= for i = , s)
UPPER: The label d
t
of the destination
OPEN list: Contains nodes that are cur
rently active in the sense that they are candi
dates for further examination (initially OPEN=s)
Label Correcting Algorithm
Step 1 (Node Removal): Remove a node i from
OPEN and for each child j of i, do step 2
Step 2 (Node Insertion Test): If d
i
+ a
ij
<
mind
j
, UPPER, set d
j
= d
i
+ a
ij
and set i to
be the parent of j. In addition, if j =, t, place j in
OPEN if it is not already in OPEN, while if j = t,
set UPPER to the new value d
i
+ a
it
of d
t
Step 3 (Termination Test): If OPEN is empty,
terminate; else go to step 1
VISUALIZATION/EXPLANATION
Given: Origin s, destination t, lengths a
ij
0
d
i
(label of i): Length of the shortest path found
thus far (initially d
s
= 0, d
i
= for i =, s). The
label d
i
is implicitly associated with an s i path
UPPER: The label d
t
of the destination
OPEN list: Contains active nodes (initially
OPEN=s)
i
j
REMOVE
Is d
i
+ a
ij
< d
j
?
(Is the path s --> i --> j
better than the
current path s --> j ?)
Is d
i
+ a
ij
< UPPER ?
(Does the path s --> i --> j
have a chance to be part
of a shorter s --> t path ?)
YES
YES
INSERT
OP E N
Set d
j
= d
i
+ a
ij
EXAMPLE
ADB ADC
ADBC ADCB
20 20
AB AC AD
Origin Node s A
1 5 15
20 4 20 3 4 3
15 1 5
15 5
ABC ABD ACB ACD
ABCD ABDC ACBD ACDB
1
4 4 3 3
Artificial Terminal Node t
Iter. No. Node Exiting OPEN
0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
OPEN after Iteration UPPER
1
2, 7,10
3, 5, 7, 10
4, 5, 7, 10
5, 7, 10 43
6, 7, 10 43
7, 10 13
8, 10 13
9, 10 13
10 13
Empty 13
Note that some nodes never entered OPEN
6.231 DYNAMIC PROGRAMMING
LECTURE 4
LECTURE OUTLINE
Label correcting methods for shortest paths
Variants of label correcting methods
Branch-and-bound as a shortest path algorithm
LABEL CORRECTING METHODS
Origin s, destination t, lengths a
ij
that are 0
d
i
(label of i): Length of the shortest path
found thus far (initially d
i
= except d
s
= 0).
The label d
i
is implicitly associated with an s i
path
UPPER: Label d
t
of the destination
OPEN list: Contains active nodes (initially
OPEN=s)
i
j
REMOVE
Is d
i
+ a
ij
< d
j
?
(Is the path s --> i --> j
better than the
current path s --> j ?)
Is d
i
+ a
ij
< UPPER ?
(Does the path s --> i --> j
have a chance to be part
of a shorter s --> t path ?)
YES
YES
INSERT
OP E N
Set d
j
= d
i
+ a
ij
VALIDITY OF LABEL CORRECTING METHODS
Proposition: If there exists at least one path
from the origin to the destination, the label cor
recting algorithm terminates with UPPER equal
to the shortest distance from the origin to the des
tination
Proof: (1) Each time a node j enters OPEN, its
label is decreased and becomes equal to the length
of some path from s to j
(2) The number of possible distinct path lengths
is nite, so the number of times a node can enter
OPEN is nite, and the algorithm terminates
(3) Let (s, j
1
, j
2
, . . . , j
k
, t) be a shortest path and
let d be the shortest distance. If UPPER > d
at termination, UPPER will also be larger than
the length of all the paths (s, j
1
, . . . , j
m
), m =
1, . . . , k, throughout the algorithm. Hence, node
j
k
will never enter the OPEN list with d
j
k
equal
to the shortest distance from s to j
k
. Similarly
node j
k1
will never enter the OPEN list with
d
j
k1
equal to the shortest distance from s to j
k1
.
Continue to j
1
to get a contradiction
MAKING THE METHOD EFFICIENT
Reduce the value of UPPER as quickly as pos
sible
Try to discover good s t paths early in
the course of the algorithm
Keep the number of reentries into OPEN low
Try to remove from OPEN nodes with small
label rst.
Heuristic rationale: if d
i
is small, then d
j
when set to d
i
+a
ij
will be accordingly small,
so reentrance of j in the OPEN list is less
likely
Reduce the overhead for selecting the node to
be removed from OPEN
These objectives are often in conict. They give
rise to a large variety of distinct implementations
Good practical strategies try to strike a compro
mise between low overhead and small label node
selection
NODE SELECTION METHODS
Depth-rst search: Remove from the top of
OPEN and insert at the top of OPEN.
Has low memory storage properties (OPEN
is not too long). Reduces UPPER quickly.
Origin Node s
Destination Node t
Best-rst search (Djikstra): Remove from
OPEN a node with minimum value of label.
Interesting property: Each node will be in
serted in OPEN at most once.
Nodes enter OPEN at minimum distance
Many implementations/approximations
ADVANCED INITIALIZATION
Instead of starting from d
i
= for all i ,= s,
start with
d
i
= length of some path from s to i (or d
i
= )
OPEN = i ,=t [ d
i
<
Motivation: Get a small starting value of UP
PER.
No node with shortest distance initial value
of UPPER will enter OPEN
Good practical idea:
Run a heuristic (or use common sense) to
get a good starting path P from s to t
Use as UPPER the length of P , and as d
i
the path distances of all nodes i along P
Very useful also in reoptimization, where we
solve the same problem with slightly dierent data
VARIANTS OF LABEL CORRECTING METHODS
If a lower bound h
j
of the true shortest dis
tance from j to t is known, use the test
d
i
+ a
ij
+ h
j
< UPPER
for entry into OPEN, instead of
d
i
+ a
ij
< UPPER
The label correcting method with lower bounds as
above is often referred to as the A
method.
If an upper bound m
j
of the true shortest
distance from j to t is known, then if d
j
+ m
j
<
UPPER, reduce UPPER to d
j
+ m
j
.
Important use: Branch-and-bound algorithm
for discrete optimization can be viewed as an im
plementation of this last variant.
BRANCH-AND-BOUND METHOD
Problem: Minimize f(x) over a nite set of
feasible solutions X.
Idea of branch-and-bound: Partition the fea
sible set into smaller subsets, and then calculate
certain bounds on the attainable cost within some
of the subsets to eliminate from further consider
ation other subsets.
Bounding Principle
Given two subsets Y
1
X and Y
2
X, suppose
that we have bounds
f min f(x), f
2
min f(x).
1
xY
1
xY
2
Then, if f
2
f , the solutions in Y
1
may be dis
1
regarded since their cost cannot be smaller than
the cost of the best solution in Y
2
.
The B+B algorithm can be viewed as a la
bel correcting algorithm, where lower bounds de
ne the arc costs, and upper bounds are used to
strengthen the test for admission to OPEN.
SHORTEST PATH IMPLEMENTATION
Acyclic graph/partition of X into subsets (typ
ically a tree). The leafs consist of single solutions.
Upper/Lower bounds f and f
Y
for the mini-
Y
mum cost over each subset Y can be calculated.
The lower bound of a leaf x is f(x)
Each arc (Y, Z) has length f f
Z Y
Shortest distance from X to Y = f f
Y X
Distance from origin X to a leaf x is f(x)f
X
Shortest path from X to the set of leafs gives
the optimal cost and optimal solution
UPPER is the smallest f(x) f out of leaf
X
nodes x examined so far
{1,2,3,4,5}
{1,2,}
{4,5}
{1,2,3}
{1} {2}
{3} {4}
{5}
BRANCH-AND-BOUND ALGORITHM
Step 1: Remove a node Y from OPEN. For each
child Y
j
of Y , do the following:
Entry Test: If f
Y j
< UPPER, place Y
j
in
OPEN.
Update UPPER: If f
Y j
< UPPER, set UP
PER = f
Y j
, and if Y
j
consists of a single
solution, mark that as being the best solu
tion found so far
Step 2: (Termination Test) If OPEN: empty,
terminate; the best solution found so far is opti
mal. Else go to Step 1
It is neither practical nor necessary to generate
a priori the acyclic graph (generate it as you go)
Keys to branch-and-bound:
Generate as sharp as possible upper and lower
bounds at each node
Have a good partitioning and node selection
strategy
Method involves a lot of art, may be prohibitively
time-consuming ... but guaranteed to nd an op
timal solution
6.231 DYNAMIC PROGRAMMING
LECTURE 5
LECTURE OUTLINE
Deterministic continuous-time optimal control
Examples
Connection with the calculus of variations
The Hamilton-Jacobi-Bellman equation as a
continuous-time limit of the DP algorithm
The Hamilton-Jacobi-Bellman equation as a
sucient condition
Examples
_ _
_ _
_ _
_ _
_ _
_ _
_ _ _ _
PROBLEM FORMULATION
Continuous-time dynamic system:
x (t) = f x(t), u(t) , 0 t T, x(0) : given,
where
x(t) 1
n
: state vector at time t
u(t) U 1
m
: control vector at time t
U: control constraint set
T: terminal time
Admissible control trajectories u(t) [ t [0, T] :
piecewise continuous functions u(t) [ t [0, T]
with u(t) U for all t [0, T]; uniquely determine
x(t) [ t [0, T]
Problem: Find an admissible control trajectory
u(t) [ t [0, T] and corresponding state trajec
tory x(t) [ t [0, T] , that minimizes the cost
_
T
h x(T) + g x(t), u(t) dt
0
f, h, g are assumed continuously dierentiable
EXAMPLE I
Motion control: A unit mass moves on a line
under the inuence of a force u
_ _
x(t) = x
1
(t), x
2
(t) : position and velocity of
the mass at time t
_ _
Problem: From a given x
1
(0), x
2
(0) , bring the
mass near a given nal position-velocity pair
(x
1
, x
2
) at time T in the sense:
minimize
x
1
(T) x
1
2
+
x
2
(T) x
2
2
subject to the control constraint
[u(t)[ 1, for all t [0, T]
The problem ts the framework with
x
1
(t) = x
2
(t), x
2
(t) = u(t),
h
_
x(T)
_
=
x
1
(T) x
1
2
+
x
2
(T) x
2
2
,
_ _
g x(t), u(t) = 0, for all t [0, T]
_ _
EXAMPLE II
A producer with production rate x(t) at time t
may allocate a portion u(t) of his/her production
rate to reinvestment and 1 u(t) to production of
a storable good. Thus x(t) evolves according to
x (t) = u(t)x(t),
where > 0 is a given constant
The producer wants to maximize the total amount
of product stored
_
T
1 u(t) x(t)dt
0
subject to
0 u(t) 1, for all t [0, T]
The initial production rate x(0) is a given pos
itive number
0
1 + (u(t))
2
dt
_
EXAMPLE III (CALCULUS OF VARIATIONS)
T
Le ngth =
a
x(t)
T t 0
x(t) = u(t)
.
Given
Point
Given
Line
T
0
1 +
u(t)
2
dt
Find a curve from a given point to a given line
that has minimum length
The problem is
_
T
_
minimize 1 +
_
x (t)
_
2
dt
0
subject to x(0) =
Reformulation as an optimal control problem:
_
T
minimize 1 +
_
u(t)
_
2
dt
0
subject to x(t) = u(t), x(0) =
_ _ _
HAMILTON-JACOBI-BELLMAN EQUATION I
We discretize [0, T] at times 0, , 2, . . . , N,
where = T/N, and we let
x
k
= x(k), u
k
= u(k), k = 0, 1, . . . , N
We also discretize the system and cost:
N1
x
k+1
= x
k
+f(x
k
, u
k
), h(x
N
)+ g(x
k
, u
k
)
k=0
We write the DP algorithm for the discretized
problem
J
(N, x) = h(x),
J
(k, x) = min
_
g(x, u) +J
(k, x) +
t
J
(k, x)
uU
+
x
J
(k, x)
f(x, u) +o()
Cancel J
(k, x) = J
(t, x) +
x
J
(t, x)
f(x, u)
uU
with the boundary condition J
(T, x) = h(x)
This is the Hamilton-Jacobi-Bellman (HJB)
equation a partial dierential equation, which is
satised for all time-state pairs (t, x) by the cost-
to-go function J
(t, x) (assuming J
is dieren
tiable and the preceding informal limiting proce
dure is valid)
Hard to tell a priori if J
(t, x) is dierentiable
So we use the HJB Eq. as a verication tool; if
we can solve it for a dierentiable J
f(x, u) ,
uU
V (T, x) = h(x), for all x
Suppose also that
(t) =
t, x
(t) ,
t [0, T ], be the corresponding state and control
trajectories
Then
V (t, x) = J
(t) and x
(T) + g x
(t), u
(t) dt
0
_
_ _
_ _
_ _ _
EXAMPLE OF THE HJB EQUATION
Consider the scalar system x(t) = u(t), with [u(t)[
1 and cost (1/2)
_
x(T )
_
2
. The HJB equation is
0 = min
t
V (t, x)+
x
V (t, x)u , for all t, x,
|u|1
with the terminal condition V (T, x) = (1/2)x
2
Evident candidate for optimality:
(t, x) =
sgn(x). Corresponding cost-to-go
1
_ _ __
2
J
t
J
x
J
Q
T
x(T ) + x(t)
Qx(t) + u(t)
Ru(t) dt
0
The HJB equation is
0 = min x
Qx+u
Ru+
t
V (t, x)+
x
V (t, x)
(Ax+Bu) ,
u
m
with the terminal condition V (T, x) = x
Q
T
x. We
try a solution of the form
V (t, x) = x
(t) = K(t)AA
K(t)+K(t)BR
1
B
K(t)Q
with the terminal condition K(T ) = Q
T
6.231 DYNAMIC PROGRAMMING
LECTURE 6
LECTURE OUTLINE
Examples of stochastic DP problems
Linear-quadratic problems
Inventory control
_
LINEAR-QUADRATIC PROBLEMS
System: x
k+1
= A
k
x
k
+ B
k
u
k
+ w
k
Quadratic cost
_ _
N1
E
x
Q
N
x
N
+ (x
Q
k
x
k
+ u
R
k
u
k
)
w
k
k=0,1,...,N1
k=0
where Q
k
0 and R
k
> 0 (in the positive (semi)denite
sense).
w
k
are independent and zero mean
DP algorithm:
J
N
(x
N
) = x
Q
N
x
N
,
N
_
J
k
(x
k
) = min E x
Q
k
x
k
+ u
R
k
u
k
u
k
+J
k+1
(A
k
x
k
+ B
k
u
k
+ w
k
)
Key facts:
J
k
(x
k
) is quadratic
Optimal policy
0
, . . . ,
N
1
is linear:
(x
k
) = L
k
x
k
Similar treatment of a number of variants
_
_
DERIVATION
By induction verify that
k
(x
k
) = L
k
x
k
, J
k
(x
k
) = x
k
K
k
x
k
+constant,
where L
k
are matrices given by
L
k
= (B
k
K
k+1
B
k
+ R
k
)
1
B
k
K
k+1
A
k
,
and where K
k
are symmetric positive semidenite
matrices given by
K
N
= Q
N
,
K
k
= A
k
K
k+1
K
k+1
B
k
(B
k
K
k+1
B
k
+ R
k
)
1
B
k
K
k+1
A
k
+ Q
k
.
This is called the discrete-time Riccati equation.
Just like DP, it starts at the terminal time N
and proceeds backwards.
Certainty equivalence holds (optimal policy is
the same as when w
k
is replaced by its expected
value Ew
k
= 0).
ASYMPTOTIC BEHAVIOR OF RICCATI EQUATION
Assume time-independent system and cost per
stage, and some technical assumptions: controla
bility of (A, B) and observability of (A, C) where
Q = C
C
The Riccati equation converges lim
k
K
k
=
K, where K is pos. denite, and is the unique
(within the class of pos. semidenite matrices) so
lution of the algebraic Riccati equation
_ _
K = A
K KB(B
KB + R)
1
B
K A + Q
The corresponding steady-state controller
(x) =
Lx, where
L = (B
KB + R)
1
B
KA,
is stable in the sense that the matrix (A + BL) of
the closed-loop system
x
k+1
= (A + BL)x
k
+ w
k
satises lim
k
(A + BL)
k
= 0.
_ _
GRAPHICAL PROOF FOR SCALAR SYSTEMS
A
2
R
B
2
+ Q
P 0
Q
F(P)
4 5
0
P P
k
P
k + 1
P
*
-
R
B
2
Riccati equation (with P
k
= K
Nk
):
B
2
P
2
P
k+1
= A
2
P
k
k
+ Q,
B
2
P
k
+ R
or P
k+1
= F (P
k
), where
A
2
RP
F (P ) = + Q.
B
2
P + R
Note the two steady-state solutions, satisfying
P = F (P ), of which only one is positive.
RANDOM SYSTEM MATRICES
Suppose that A
0
, B
0
, . . . , A
N1
, B
N1
are
not known but rather are independent random
matrices that are also independent of the w
k
DP algorithm is
J
N
(x
N
) = x
N
Q
N
x
N
,
J
k
(x
k
) = min
E
_
x
k
Q
k
x
k
u
k
w
k
,A
k
,B
k
_
+u
k
R
k
u
k
+J
k+1
(A
k
x
k
+B
k
u
k
+w
k
)
Optimal policy
k
(x
k
) = L
k
x
k
, where
L
k
=
_
R
k
+EB
K
k+1
B
k
_
1
EB
K
k+1
A
k
,
k k
and where the matrices K
k
are given by
K
N
= Q
N
,
K
k
= EA
k
K
k+1
A
k
EA
k
K
k+1
B
k
_ _
1
R
k
+ EB
k
K
k+1
B
k
EB
k
K
k+1
A
k
+ Q
k
PROPERTIES
Certainty equivalence may not hold
Riccati equation may not converge to a steady-
state
R
-
P
E{B
2
}
Q
4 5
0
0
F (P )
We have P
k+1
= F
(P
k
), where
F
(P ) =
EA
2
RP
+ Q +
TP
2
,
EB
2
P + R EB
2
P + R
_ _
2
_ _
2
T = EA
2
EB
2
EA EB
INVENTORY CONTROL
x
k
: stock, u
k
: inventory purchased, w
k
: de
mand
x
k+1
= x
k
+ u
k
w
k
, k = 0, 1, . . . , N 1
Minimize
_ _
N1
_ _
E cu
k
+ r(x
k
+ u
k
w
k
)
k=0
where, for some p > 0 and h > 0,
r(x) = p max(0, x) + hmax(0, x)
DP algorithm:
J
N
(x
N
) = 0,
_ _ _
J
k
(x
k
) = min cu
k
+H(x
k
+u
k
)+E J
k+1
(x
k
+u
k
w
k
) ,
u
k
0
where H(x + u) = Er(x + u w).
_
OPTIMAL POLICY
DP algorithm can be written as
J
N
(x
N
) = 0,
J
k
(x
k
) = min G
k
(x
k
+u
k
) cx
k
,
u
k
0
where
_ _
G
k
(y) = cy + H(y) + E J
k+1
(y w) .
If G
k
is convex and lim
|x|
G
k
(x) , we
have
k
(x
k
) =
S
k
x
k
if x
k
< S
k
,
0 if x
k
S
k
,
where S
k
minimizes G
k
(y).
This is shown, assuming that c < p, by showing
that J
k
is convex for all k, and
lim J
k
(x)
|x|
JUSTIFICATION
Graphical inductive proof that J
k
is convex.
y
H(y)
cy + H(y)
S
N - 1
c S
N - 1
- cy
J
N - 1
(x
N - 1
)
S
N - 1
x
N - 1
- cy
6.231 DYNAMIC PROGRAMMING
LECTURE 7
LECTURE OUTLINE
Stopping problems
Scheduling problems
Other applications
PURE STOPPING PROBLEMS
Two possible controls:
Stop (incur a one-time stopping cost, and
move to cost-free and absorbing stop state)
Continue [using x
k+1
= f
k
(x
k
, w
k
) and in
curring the cost-per-stage]
Each policy consists of a partition of the set of
states x
k
into two regions:
Stop region, where we stop
Continue region, where we continue
STOP
REGION
CONTINUE
REGION
Stop State
_
EXAMPLE: ASSET SELLING
A person has an asset, and at k = 0, 1, . . . , N1
receives a random oer w
k
May accept w
k
and invest the money at xed
rate of interest r, or reject w
k
and wait for w
k+1
.
Must accept the last oer w
N1
DP algorithm (x
k
: current oer, T: stop state):
= T,
J
N
(x
N
) =
x
N
if x
N
,
0 if x
N
= T,
_ _ _ _
J
k
(x
k
) =
max (1 + r)
Nk
x
k
, E J
k+1
(w
k
) if x
k
= T,
0 if x
k
= T.
Optimal policy;
accept the oer x
k
if x
k
>
k
,
reject the oer x
k
if x
k
<
k
,
where
_ _
E J
k+1
(w
k
)
k
= .
(1 + r)
Nk
_ _
_ _
_ _
FURTHER ANALYSIS
0 1 2 N - 1 N k
ACCEPT
REJECT
a
1
a
N - 1
a
2
Can show that
k
k+1
for all k
Proof: Let V
k
(x
k
) = J
k
(x
k
)/(1 + r)
Nk
for
x
k
,= T. Then the DP algorithm is V
N
(x
N
) = x
N
and
V
k
(x
k
) = max x
k
, (1 + r)
1
E
V
k+1
(w) .
w
We have
k
=
E
w
V
k+1
(w) /(1 + r), so it is enough
to show that V
k
(x) V
k+1
(x) for all x and k.
Start with V
N1
(x) V
N
(x) and use the mono
tonicity property of DP.
We can also show that
k
a as k .
Suggests that for an innite horizon the optimal
policy is stationary.
_ _
_ __
_ _
_ _ __
GENERAL STOPPING PROBLEMS
At time k, we may stop at cost t(x
k
) or choose
a control u
k
U(x
k
) and continue
J
N
(x
N
) = t(x
N
),
J
k
(x
k
) = min t(x
k
), min E g(x
k
, u
k
, w
k
)
u
k
U(x
k
)
+ J
k+1
f(x
k
, u
k
, w
k
)
Optimal to stop at time k for states x in the
set
T
k
= x
t(x) min E g(x, u, w) + J
k+1
f(x, u, w)
uU(x)
Since J
N1
(x) J
N
(x), we have J
k
(x)
J
k+1
(x) for all k, so
T
0
T
k
T
k+1
T
N1
.
Interesting case is when all the T
k
are equal (to
T
N1
, the set where it is better to stop than to go
one step and stop). Can be shown to be true if
f(x, u, w) T
N1
, for all x T
N1
, u U(x), w.
SCHEDULING PROBLEMS
Set of tasks to perform, the ordering is subject
to optimal choice.
Costs depend on the order
There may be stochastic uncertainty, and prece
dence and resource availability constraints
Some of the hardest combinatorial problems
are of this type (e.g., traveling salesman, vehicle
routing, etc.)
Some special problems admit a simple quasi-
analytical solution method
Optimal policy has an index form, i.e.,
each task has an easily calculable index,
and it is optimal to select the task that has
the maximum value of index (multi-armed
bandit problems - to be discussed later)
Some problems can be solved by an inter
change argument(start with some schedule,
interchange two adjacent tasks, and see what
happens)
_ _
_ _
EXAMPLE: THE QUIZ PROBLEM
Given a list of N questions. If question i is an
swered correctly (given probability p
i
), we receive
reward R
i
; if not the quiz terminates. Choose or
der of questions to maximize expected reward.
Let i and j be the kth and (k + 1)st questions
in an optimally ordered list
L = (i
0
, . . . , i
k1
, i, j, i
k+2
, . . . , i
N1
)
E reward of L = E reward of i
0
, . . . , i
k1
+ p
i
0
p
i
k1
(p
i
R
i
+ p
i
p
j
R
j
)
+ p
i
0
p
i
k1
p
i
p
j
E reward of i
k+2
, . . . , i
N1
Consider the list with i and j interchanged
L
= (i
0
, . . . , i
k1
, j, i, i
k+2
, . . . , i
N1
)
Since L is optimal, Ereward of L Ereward of L
,
so it follows that p
i
R
i
+ p
i
p
j
R
j
p
j
R
j
+ p
j
p
i
R
i
or
p
i
R
i
/(1 p
i
) p
j
R
j
/(1 p
j
).
_
MINIMAX CONTROL
Consider basic problem with the dierence that
the disturbance w
k
instead of being random, it is
just known to belong to a given set W
k
(x
k
, u
k
).
Find policy that minimizes the cost
_
J
(x
0
) = max g
N
(x
N
)
w
k
W
k
(x
k
,
k
(x
k
))
k=0,1,...,N1
N1
_
_ _
+ g
k
x
k
,
k
(x
k
), w
k
k=0
The DP algorithm takes the form
J
N
(x
N
) = g
N
(x
N
),
J
k
(x
k
) = min max g
k
(x
k
, u
k
, w
k
)
u
k
U(x
k
) w
k
W
k
(x
k
,u
k
)
_ _
+J
k+1
f
k
(x
k
, u
k
, w
k
)
(Exercise 1.5 in the text, solution posted on the
www).
_
_
_
UNKNOWN-BUT-BOUNDED CONTROL
For each k, keep the x
k
of the controlled system
_ _
x
k+1
= f
k
x
k
,
k
(x
k
), w
k
inside a given set X
k
, the target set at time k.
This is a minimax control problem, where the
cost at stage k is
g
k
(x
k
) =
0 if x
k
X
k
,
1 if x
k
/ X
k
.
We must reach at time k the set
_ _
X
k
= x
k
[ J
k
(x
k
) = 0
in order to be able to maintain the state within
the subsequent target sets.
Start with X
N
= X
N
, and for k = 0, 1, . . . , N
1,
X
k
= x
k
X
k
[ there exists u
k
U
k
(x
k
) such that
f
k
(x
k
, u
k
, w
k
) X
k+1
, for all w
k
W
k
(x
k
, u
k
)
6.231 DYNAMIC PROGRAMMING
LECTURE 8
LECTURE OUTLINE
Problems with imperfect state info
Reduction to the perfect state info case
Linear quadratic problems
Separation of estimation and control
BASIC PROBLEM WITH IMPERFECT STATE INFO
Same as basic problem of Chapter 1 with one
dierence: the controller, instead of knowing x
k
,
receives at each time k an observation of the form
z
0
= h
0
(x
0
, v
0
), z
k
= h
k
(x
k
, u
k1
, v
k
), k 1
The observation z
k
belongs to some space Z
k
.
The random observation disturbance v
k
is char
acterized by a probability distribution
P
v
k
( | x
k
, . . . , x
0
, u
k1
, . . . , u
0
, w
k1
, . . . , w
0
, v
k1
, . . . , v
0
)
The initial state x
0
is also random and charac
terized by a probability distribution P
x
0
.
The probability distribution P ( [ x
k
, u
k
) of
w
k
w
k
is given, and it may depend explicitly on x
k
and u
k
but not on w
0
, . . . , w
k1
, v
0
, . . . , v
k1
.
The control u
k
is constrained to a given subset
U
k
(this subset does not depend on x
k
, which is
not assumed known).
_ _
_ _
_ _
_ _
INFORMATION VECTOR AND POLICIES
Denote by I
k
the information vector, i.e., the
information available at time k:
I
k
= (z
0
, z
1
, . . . , z
k
, u
0
, u
1
, . . . , u
k1
), k 1,
I
0
= z
0
We consider policies =
0
,
1
, . . . ,
N1
,
where each function
k
maps the information vec
tor I
k
into a control u
k
and
k
(I
k
) U
k
, for all I
k
, k 0
We want to nd a policy that minimizes
N1
J
=
E
g
N
(x
N
) + g
k
x
k
,
k
(I
k
), w
k
x
0
,w
k
,v
k
k=0,...,N1
k=0
subject to the equations
x
k+1
= f
k
x
k
,
k
(I
k
), w
k
, k 0,
z
0
= h
0
(x
0
, v
0
), z
k
= h
k
x
k
,
k1
(I
k1
), v
k
, k 1
_ _ _ _
_ _
_ _
REFORMULATION AS PERFECT INFO PROBLEM
We have
I
k+1
= (I
k
, z
k+1
, u
k
), k = 0, 1, . . . , N2, I
0
= z
0
View this as a dynamic system with state I
k
, con
trol u
k
, and random disturbance z
k+1
We have
P(z
k+1
[ I
k
, u
k
) = P(z
k+1
[ I
k
, u
k
, z
0
, z
1
, . . . , z
k
),
since z
0
, z
1
, . . . , z
k
are part of the information vec
tor I
k
. Thus the probability distribution of z
k+1
depends explicitly only on the state I
k
and control
u
k
and not on the prior disturbances z
k
, . . . , z
0
Write
E g
k
(x
k
, u
k
, w
k
) = E
E
g
k
(x
k
, u
k
, w
k
) | I
k
, u
k
x
k
,w
k
so the cost per stage of the new system is
g
k
(I
k
, u
k
) =
E
g
k
(x
k
, u
k
, w
k
) [ I
k
, u
k
x
k
,w
k
_
_
_
_ _ _
_
_
_ _
DP ALGORITHM
Writing the DP algorithm for the (reformulated)
perfect state info problem and doing the algebra:
_
_
J
k
(I
k
) = min
E
g
k
(x
k
, u
k
, w
k
)
u
k
U
k
x
k
, w
k
, z
k+1
+J
k+1
(I
k
, z
k+1
, u
k
) | I
k
, u
k
for k = 0, 1, . . . , N 2, and for k = N 1,
J
N1
(I
N1
) = min
u
N1
U
N1
E
g
N
f
N1
(x
N1
, u
N1
, w
N1
)
x
N1
, w
N1
+g
N1
(x
N1
, u
N1
, w
N1
) | I
N1
, u
N1
The optimal cost J
is given by
J
=
E
J
0
(z
0
)
z
0
LINEAR-QUADRATIC PROBLEMS
System: x
k+1
= A
k
x
k
+ B
k
u
k
+ w
k
Quadratic cost
_ _
N1
E
x
N
Q
N
x
N
+ (x
k
Q
k
x
k
+ u
k
R
k
u
k
)
w
k
k=0,1,...,N1
k=0
where Q
k
0 and R
k
> 0
Observations
z
k
= C
k
x
k
+ v
k
, k = 0, 1, . . . , N 1
w
0
, . . . , w
N1
, v
0
, . . . , v
N1
indep. zero mean
Key fact to show:
Optimal policy
0
, . . . ,
is of the form:
N1
(I
k
) = L
k
Ex
k
[ I
k
k
L
k
: same as for the perfect state info case
Estimation problem and control problem can
be solved separately
_
_
_
_
_
DP ALGORITHM I
Last stage N 1 (supressing index N 1):
J
N1
(I
N1
) = min E
x
N1
,w
N1
x
N
1
Qx
N1
u
N1
+ u
Ru
N1
+ (Ax
N1
+Bu
N1
+ w
N1
)
N1
Q(Ax
N1
+Bu
N1
+w
N1
) | I
N1
, u
N1
Since Ew
N1
[ I
N1
= Ew
N1
= 0, the
minimization involves
min u
N
1
(B
QB + R)u
N1
u
N1
+ 2E{x
N1
| I
N1
}
QBu
N1
The minimization yields the optimal
:
N1
u
=
(I
N1
) = L
N1
Ex
N1
[ I
N1
N1 N1
where
L
N1
= (B
QB + R)
1
B
QA
DP ALGORITHM II
Substituting in the DP algorithm
_ _
J
N1
(I
N1
) =
E
x
K
N1
x
N1
[ I
N1
N1
x
N1
__ _
+
E
x
N1
Ex
N1
[ I
N1
x
N1
_ _ _
P
N1
x
N1
Ex
N1
[ I
N1
[ I
N1
+
E
w
Q
N
w
N1
,
N1
w
N1
where the matrices K
N1
and P
N1
are given by
P
N1
= A
Q
N
B
N1
(R
N1
+ B
Q
N
B
N1
)
1
N1 N1
B
Q
N
A
N1
,
N1
K
N1
= A
Q
N
A
N1
P
N1
+ Q
N1
N1
Note the structure of J
N1
: in addition to
the quadratic and constant terms, it involves a
quadratic in the estimation error
x
N1
Ex
N1
[ I
N1
__ _
_ _ _
_
DP ALGORITHM III
DP equation for period N 2:
_
J
N2
(I
N2
) = min
E
{x
Qx
N2
N2
u
N2
x
N2
,w
N2
,z
N1
+u
Ru
N2
+J
N1
(I
N1
) | I
N2
, u
N2
}
N2
_ _
= E x
Qx
N2
| I
N2
N2
_
+ min u
Ru
N2
N2
u
N2
_ _
_
+E x
N
K
N1
x
N1
| I
N2
, u
N2
1
+E x
N1
E{x
N1
| I
N1
}
P
N1
x
N1
E{x
N1
| I
N1
} | I
N2
, u
N2
+E
w
N1
{w
N1
Q
N
w
N1
}
Key point: We have excluded the next to last
term from the minimization with respect to u
N2
This term turns out to be independent of u
N2
QUALITY OF ESTIMATION LEMMA
For every k, there is a function M
k
such that
we have
x
k
Ex
k
[ I
k
= M
k
(x
0
, w
0
, . . . , w
k1
, v
0
, . . . , v
k
),
independently of the policy being used
The following simplied version of the lemma
conveys the main idea
Simplied Lemma: Let r, u, z be random vari
ables such that r and u are independent, and let
x = r + u. Then
x Ex [ z, u = r Er [ z
Proof: We have
x Ex [ z, u = r + u Er + u [ z, u
= r + u Er [ z, u u
= r Er [ z, u
= r Er [ z
APPLYING THE QUALITY OF EST. LEMMA
Using the lemma,
x
N1
Ex
N1
[ I
N1
=
N1
,
where
N1
: function of x
0
, w
0
, . . . , w
N2
, v
0
, . . . , v
N1
Since
N1
is independent of u
N2
, the condi
tional expectation of
P
N1
N1
satises
N1
E
P
N1
N1
[ I
N2
, u
N2
N1
= E
P
N1
N1
[ I
N2
N1
and is independent of u
N2
.
So minimization in the DP algorithm yields
u
=
(I
N2
) = L
N2
Ex
N2
[ I
N2
N2 N2
FINAL RESULT
Continuing similarly (using also the quality of
estimation lemma)
(I
k
) = L
k
Ex
k
[ I
k
,
k
where L
k
is the same as for perfect state info:
L
k
= (R
k
+ B
k
K
k+1
B
k
)
1
B
k
K
k+1
A
k
,
with K
k
generated from K
N
= Q
N
, using
K
k
= A
k
K
k+1
A
k
P
k
+ Q
k
,
P
k
= A
k
K
k+1
B
k
(R
k
+ B
k
K
k+1
B
k
)
1
B
k
K
k+1
A
k
x
k + 1
= A
k
x
k
+ B
k
u
k
+ w
k
L
k
u
k
w
k
x
k
z
k
= C
k
x
k
+ v
k
Delay
Estimator
E{x
k
| I
k
}
u
k - 1
z
k
v
k
z
k
u
k
SEPARATION INTERPRETATION
The optimal controller can be decomposed into
(a) An estimator, which uses the data to gener
ate the conditional expectation Ex
k
[ I
k
.
(b) An actuator, which multiplies Ex
k
[ I
k
by
the gain matrix L
k
and applies the control
input u
k
= L
k
Ex
k
[ I
k
.
Generically the estimate x of a random vector x
given some information (random vector) I, which
minimizes the mean squared error
E
x
|x x|
2
[ I = |x|
2
2Ex [ Ix + |x|
2
is Ex [ I (set to zero the derivative with respect
to x of the above quadratic form).
The estimator portion of the optimal controller
is optimal for the problem of estimating the state
x
k
assuming the control is not subject to choice.
The actuator portion is optimal for the control
problem assuming perfect state information.
STEADY STATE/IMPLEMENTATION ASPECTS
As N , the solution of the Riccati equation
converges to a steady state and L
k
L.
If x
0
, w
k
, and v
k
are Gaussian, Ex
k
[ I
k
is
a linear function of I
k
and is generated by a nice
recursive algorithm, the Kalman lter.
The Kalman lter involves also a Riccati equa
tion, so for N , and a stationary system, it
also has a steady-state structure.
Thus, for Gaussian uncertainty, the solution is
nice and possesses a steady state.
For nonGaussian uncertainty, computing Ex
k
[ I
k
k
(P
x
k
|I
k
, u
k
, z
k+1
) [ I
k
, u
k
u
k
x
k
Delay
Estimator
u
k - 1
u
k - 1
v
k
z
k
z
k
w
k
f
k - 1
Actuator
x
k + 1
= f
k
(x
k
,u
k
,w
k
) z
k
= h
k
(x
k
,u
k - 1
,v
k
)
Syst em Measurement
P
x
k
| I
k
m
k
EXAMPLE: A SEARCH PROBLEM
At each period, decide to search or not search
a site that may contain a treasure.
If we search and a treasure is present, we nd
it with prob. and remove it from the site.
Treasures worth: V . Cost of search: C
States: treasure present & treasure not present
Each search can be viewed as an observation of
the state
Denote
p
k
: prob. of treasure present at the start of time k
with p
0
given.
p
k
evolves at time k according to the equation
_
p
k
if not search,
_
p
k+1
=
0 if search and nd treasure,
_ p
k
(1)
if search and no treasure.
p
k
(1)+1p
k
_
_ _
_
SEARCH PROBLEM (CONTINUED)
DP algorithm
J
k
(p
k
) = max 0, C + p
k
V
p
k
(1 )
+ (1 p
k
)J
k+1
,
p
k
(1 ) + 1 p
k
with J
N
(p
N
) = 0.
Can be shown by induction that the functions
J
k
satisfy
C
J
k
(p
k
) = 0, for all p
k
V
Furthermore, it is optimal to search at period
k if and only if
p
k
V C
(expected reward from the next search the cost
of the search)
FINITE-STATE SYSTEMS
Suppose the system is a nite-state Markov
chain, with states 1, . . . , n.
Then the conditional probability distribution
P
x
k
|I
k
is a vector
_ _
P(x
k
= 1 [ I
k
), . . . , P(x
k
= n [ I
k
)
The DP algorithm can be executed over the n-
dimensional simplex (state space is not expanding
with increasing k)
When the control and observation spaces are
also nite sets, it turns out that the cost-to-go
functions J
k
in the DP algorithm are piecewise
linear and concave (Exercise 5.7).
This is conceptually important and also (mod
erately) useful in practice.
INSTRUCTION EXAMPLE
Teaching a student some item. Possible states
are L: Item learned, or L: Item not learned.
Possible decisions: T: Terminate the instruc
tion, or T: Continue the instruction for one period
and then conduct a test that indicates whether the
student has learned the item.
The test has two possible outcomes: R: Student
gives a correct answer, or R: Student gives an
incorrect answer.
Probabilistic structure
L L R
r t
1 1
1 - r 1 - t
L R L
Cost of instruction is I per period
Cost of terminating instruction; 0 if student has
learned the item, and C > 0 if not.
_
_ _
_ _ __
_
INSTRUCTION EXAMPLE II
Let p
k
: prob. student has learned the item given
the test results so far
p
k
= P(x
k
[I
k
) = P(x
k
= L [ z
0
, z
1
, . . . , z
k
).
Using Bayes rule we can obtain
p
k+1
= (p
k
, z
k+1
)
1(1t)(1p
k
)
if z
k+1
= R,
=
1(1t)(1r)(1p
k
)
0 if z
k+1
= R.
DP algorithm:
J
k
(p
k
) = min (1 p
k
)C, I +
E
J
k+1
(p
k
, z
k+1
) .
z
k+1
starting with
J
N1
(p
N1
) = min (1p
N1
)C, I+(1t)(1p
N1
)C .
_
_ _
_ _
INSTRUCTION EXAMPLE III
Write the DP algorithm as
J
k
(p
k
) = min (1 p
k
)C, I + A
k
(p
k
) ,
where
A
k
(p
k
) = P (z
k+1
= R [ I
k
)J
k+1
(p
k
, R)
+ P (z
k+1
= R [ I
k
)J
k+1
(p
k
, R)
Can show by induction that A
k
(p) are piecewise
linear, concave, monotonically decreasing, with
A
k1
(p) A
k
(p) A
k+1
(p), for all p [0, 1].
0
p
C
I
I + A
N - 1
(p)
I + A
N - 2
(p)
I + A
N - 3
(p)
1
a
N - 1
a
N - 3
a
N - 2
I
1 -
C
6.231 DYNAMIC PROGRAMMING
LECTURE 10
LECTURE OUTLINE
Suboptimal control
Certainty equivalent control
Limited lookahead policies
Performance bounds
Problem approximation approach
Heuristic cost-to-go approximation
PRACTICAL DIFFICULTIES OF DP
The curse of modeling
The curse of dimensionality
Exponential growth of the computational and
storage requirements as the number of state
variables and control variables increases
Quick explosion of the number of states in
combinatorial problems
Intractability of imperfect state information
problems
There may be real-time solution constraints
A family of problems may be addressed. The
data of the problem to be solved is given with
little advance notice
The problem data may change as the system
is controlled need for on-line replanning
_ _
_ _
CERTAINTY EQUIVALENT CONTROL (CEC)
Replace the stochastic problem with a deter
ministic problem
At each time k, the uncertain quantities are
xed at some typical values
Implementation for an imperfect info problem.
At each time k:
(1) Compute a state estimate x
k
(I
k
) given the
current information vector I
k
.
(2) Fix the w
i
, i k, at some w
i
(x
i
, u
i
). Solve
the deterministic problem:
N1
minimize g
N
(x
N
)+ g
i
x
i
, u
i
, w
i
(x
i
, u
i
)
i=k
subject to x
k
= x
k
(I
k
) and for i k,
u
i
U
i
, x
i+1
= f
i
x
i
, u
i
, w
i
(x
i
, u
i
) .
(3) Use as control the rst element in the opti
mal control sequence found.
_ _
_ _
_ _
_ _
ALTERNATIVE IMPLEMENTATION
Let
d
0
(x
0
), . . . ,
d
N 1
(x
N 1
) be an optimal
k
(I
k
) =
k
d
x
k
(I
k
)
x
k
Delay
Estimator
u
k - 1
u
k - 1
v
k
z
k
z
k
w
k
Actuator
x
k + 1
= f
k
(x
k
,u
k
,w
k
) z
k
= h
k
(x
k
,u
k - 1
,v
k
)
Syst em Measurement
m
k
d
u
k
=
mk
d
(x
k
)
x
k
(I
k
)
CEC WITH HEURISTICS
Solve the deterministic equivalent problem
using a heuristic/suboptimal policy
Improved version of this idea: At time k min
imize the stage k cost and plus the heuristic cost
of the remaining stages, i.e., apply at time k a
control u
k
that minimizes over u
k
U
k
(x
k
)
_ _ _ _ __
g
k
x
k
, u
k
, w
k
(x
k
, u
k
) +H
k+1
f
k
x
k
, u
k
, w
k
(x
k
, u
k
)
where H
k+1
is the cost-to-go function correspond
ing to the heuristic.
This an example of an important suboptimal
control idea:
Minimize at each stage k the sum of approxima
tions to the current stage cost and the optimal
cost-to-go.
This is a central idea in several other suboptimal
control schemes, such as limited lookahead, and
rollout algorithms.
H
k+1
(x
k+1
) may be computed o-line or on
line.
_ _
PARTIALLY STOCHASTIC CEC
Instead of xing all future disturbances to their
typical values, x only some, and treat the rest as
stochastic.
Important special case: Treat an imperfect state
information problem as one of perfect state infor
mation, using an estimate x
k
(I
k
) of x
k
as if it were
exact.
Multiaccess Communication Example: Con
sider controlling the slotted Aloha system (dis
cussed in Ch. 5) by optimally choosing the prob
ability of transmission of waiting packets. This
is a hard problem of imperfect state info, whose
perfect state info version is easy.
Natural partially stochastic CEC:
1
k
(I
k
) = min 1, ,
x
k
(I
k
)
where x
k
(I
k
) is an estimate of the current packet
backlog based on the entire past channel history
of successes, idles, and collisions (which is I
k
).
_ _ __
LIMITED LOOKAHEAD POLICIES
One-step lookahead (1SL) policy: At each k and
state x
k
, use the control
k
(x
k
) that
min E g
k
(x
k
, u
k
, w
k
)+J
k+1
f
k
(x
k
, u
k
, w
k
) ,
u
k
U
k
(x
k
)
where
J
N
= g
N
.
J
k+1
: approximation to true cost-to-go J
k+1
Two-step lookahead policy: At each k and x
k
,
use the control
k
(x
k
) attaining the minimum above,
where the function J
k+1
is obtained using a 1SL
approximation (solve a 2-step DP problem).
If J
k+1
is readily available and the minimiza
tion above is not too hard, the 1SL policy is im
plementable on-line.
Sometimes one also replaces U
k
(x
k
) above with
a subset of most promising controls U
k
(x
k
).
As the length of lookahead increases, the re
quired computation quickly explodes.
_
_ __
PERFORMANCE BOUNDS FOR 1SL
Let J
k
(x
k
) be the cost-to-go from (x
k
, k) of the
1SL policy, based on functions J
k
.
Assume that for all (x
k
, k), we have
J
k
(x
k
) J
k
(x
k
), (*)
where J
N
= g
N
and for all k,
J
k
(x
k
) = min E g
k
(x
k
, u
k
, w
k
)
u
k
U
k
(x
k
)
+ J
k+1
f
k
(x
k
, u
k
, w
k
) ,
[so J
k
(x
k
) is computed along with
k
(x
k
)]. Then
J
k
(x
k
) J
k
(x
k
), for all (x
k
, k).
Important application: When J
k
is the cost-to
go of some heuristic policy (then the 1SL policy is
called the rollout policy).
The bound can be extended to the case where
there is a
k
in the RHS of (*). Then
J
k
(x
k
) J
k
(x
k
) +
k
+ +
N1
COMPUTATIONAL ASPECTS
Sometimes nonlinear programming can be used
to calculate the 1SL or the multistep version [par
ticularly when U
k
(x
k
) is not a discrete set]. Con
nection with stochastic programming methods.
The choice of the approximating functions J
k
is critical, and is calculated in a variety of ways.
Some approaches:
(a) Problem Approximation: Approximate the
optimal cost-to-go with some cost derived
from a related but simpler problem
(b) Heuristic Cost-to-Go Approximation: Ap
proximate the optimal cost-to-go with a func
tion of a suitable parametric form, whose pa
rameters are tuned by some heuristic or sys
tematic scheme (Neuro-Dynamic Program
ming)
(c) Rollout Approach: Approximate the optimal
cost-to-go with the cost of some suboptimal
policy, which is calculated either analytically
or by simulation
PROBLEM APPROXIMATION
Many (problem-dependent) possibilities
Replace uncertain quantities by nominal val
ues, or simplify the calculation of expected
values by limited simulation
Simplify dicult constraints or dynamics
Example of enforced decomposition : Route m
vehicles that move over a graph. Each node has a
value. The rst vehicle that passes through the
node collects its value. Max the total collected
value, subject to initial and nal time constraints
(plus time windows and other constraints).
Usually the 1-vehicle version of the problem is
much simpler. This motivates an approximation
obtained by solving single vehicle problems.
1SL scheme: At time k and state x
k
(position
of vehicles and collected value nodes), consider
all possible kth moves by the vehicles, and at the
resulting states we approximate the optimal value-
to-go with the value collected by optimizing the
vehicle routes one-at-a-time
HEURISTIC COST-TO-GO APPROXIMATION
Use a cost-to-go approximation from a paramet
ric class J
(x, r) (the
approximation architecture).
Method for tuning the weights (training
the architecture).
Successful application strongly depends on how
these issues are handled, and on insight about the
problem.
Sometimes a simulator is used, particularly
when there is no mathematical model of the sys
tem.
_ _
APPROXIMATION ARCHITECTURES
Divided in linear and nonlinear [i.e., linear or
nonlinear dependence of J
(x, r) on r].
Linear architectures are easier to train, but non
linear ones (e.g., neural networks) are richer.
Architectures based on feature extraction
St at e x
Feature Extraction
Mapping
Cost Approximator w/
Parameter Vector r
Feat ure
Vector y
Cost Approximation
J
(y,r )
Ideally, the features will encode much of the
nonlinearity that is inherent in the cost-to-go ap
proximated, and the approximation may be quite
accurate without a complicated architecture.
Sometimes the state space is partitioned, and
local features are introduced for each subset of
the partition (they are 0 outside the subset).
With a well-chosen feature vector y(x), we can
use a linear architecture
J
(x, r) = J
y(x), r =
r
i
y
i
(x)
i
COMPUTER CHESS
Programs use a feature-based position evaluator
that assigns a score to each move/position
Feat ure
Extraction
Weighting
of Features
Score
Features:
Material balance,
Mobility,
Safety, etc
Position Evaluator
Most often the weighting of features is linear
but multistep lookahead is involved.
Most often the training is done by trial and
error.
Additional features:
Depth rst search
Variable depth search when dynamic posi
tions are involved
Alpha-beta pruning
6.231 DYNAMIC PROGRAMMING
LECTURE 11
LECTURE OUTLINE
Rollout algorithms
Cost improvement property
Discrete deterministic problems
Sequential consistency and greedy algorithms
Sequential improvement
ROLLOUT ALGORITHMS
One-step lookahead policy: At each k and
state x
k
, use the control
k
(x
k
) that
_ _ __
min E g
k
(x
k
, u
k
, w
k
)+J
k+1
f
k
(x
k
, u
k
, w
k
) ,
u
k
U
k
(x
k
)
where
J
N
= g
N
.
J
k+1
: approximation to true cost-to-go J
k+1
Rollout algorithm: When J
k
is the cost-to-go
of some heuristic policy (called the base policy)
Cost improvement property (to be shown): The
rollout algorithm achieves no worse (and usually
much better) cost than the base heuristic starting
from the same state.
Main diculty: Calculating J
k
(x
k
) may be
computationally intensive if the cost-to-go of the
base policy cannot be analytically calculated.
May involve Monte Carlo simulation if the
problem is stochastic.
Things improve in the deterministic case.
EXAMPLE: THE QUIZ PROBLEM
A person is given N questions; answering cor
rectly question i has probability p
i
, reward v
i
.
Quiz terminates at the rst incorrect answer.
Problem: Choose the ordering of questions so
as to maximize the total expected reward.
Assuming no other constraints, it is optimal to
use the index policy: Answer questions in decreas
ing order of p
i
v
i
/(1 p
i
).
With minor changes in the problem, the index
policy need not be optimal. Examples:
A limit (< N) on the maximum number of
questions that can be answered.
Time windows, sequence-dependent rewards,
precedence constraints.
Rollout with the index policy as base policy:
Convenient because at a given state (subset of
questions already answered), the index policy and
its expected reward can be easily calculated.
Very eective for solving the quiz problem and
important generalizations in scheduling (see Bert
sekas and Castanon, J. of Heuristics, Vol. 5, 1999).
COST IMPROVEMENT PROPERTY
Let
J
k
(x
k
): Cost-to-go of the rollout policy
H
k
(x
k
): Cost-to-go of the base policy
We claim that J
k
(x
k
) H
k
(x
k
) for all x
k
, k
Proof by induction: We have J
N
(x
N
) = H
N
(x
N
)
for all x
N
. Assume that
J
k+1
(x
k+1
) H
k+1
(x
k+1
), x
k+1
.
Then, for all x
k
_ _ _ _ _ ___
J
k
(x
k
) = E g
k
x
k
,
k
(x
k
), w
k
+J
k+1
f
k
x
k
,
k
(x
k
), w
k
_ _ _ _ _ ___
E g
k
x
k
,
k
(x
k
), w
k
+H
k+1
f
k
x
k
,
k
(x
k
), w
k
_ _ _ _ _ ___
E g
k
x
k
,
k
(x
k
), w
k
+H
k+1
f
k
x
k
,
k
(x
k
), w
k
= H
k
(x
k
)
Induction hypothesis ==> 1st inequality
Min selection of
k
(x
k
) ==> 2nd inequality
Denition of H
k
,
k
==> last equality
EXAMPLE: THE BREAKTHROUGH PROBLEM
root
Given a binary tree with N stages.
Each arc is either free or is blocked (crossed out
in the gure).
Problem: Find a free path from the root to the
leaves (such as the one shown with thick lines).
Base heuristic (greedy): Follow the right branch
if free; else follow the left branch if free.
For large N and given prob. of free branch:
the rollout algorithm requires O(N) times more
computation, but has O(N) times larger prob. of
nding a free path than the greedy algorithm.
DISCRETE DETERMINISTIC PROBLEMS
Any discrete optimization problem (with nite
number of choices/feasible solutions) can be rep
resented as a sequential decision process by using
a tree.
The leaves of the tree correspond to the feasible
solutions.
The problem can be solved by DP, starting from
the leaves and going back towards the root.
Example: Traveling salesman problem. Find a
minimum cost tour that goes exactly once through
each of N cities.
ABC ABD ACB ACD ADB ADC
ABCD
AB
AC
AD
ABDC ACBD ACDB ADBC ADCB
Origin Node s A
Traveling salesman problem with four cities A, B, C, D
A CLASS OF GENERAL DISCRETE PROBLEMS
Generic problem:
Given a graph with directed arcs
A special node s called the origin
A set of terminal nodes, called destinations,
and a cost g(i) for each destination i.
Find min cost path starting at the origin,
ending at one of the destination nodes.
Base heuristic: For any nondestination node i,
constructs a path (i, i
1
, . . . , i
m
, i) starting at i and
ending at one of the destination nodes i. We call
i the projection of i, and we denote H(i) = g(i).
Rollout algorithm: Start at the origin; choose
the successor node with least cost projection
j
1
p(j
1
)
s
i
1
i
m
j
2
j
3
j
4
p(j
2
)
p(j
3
)
p(j
4
)
i
m-1
Neighbors of i
m
Projections of
Neighbors of i
m
EXAMPLE: ONE-DIMENSIONAL WALK
A person takes either a unit step to the left or
a unit step to the right. Minimize the cost g(i) of
the point i where he will end up after N steps.
(0,0)
_
(N,-N) (N,0)
i
(N,N)
g(i)
_
-N 0 i N - 2 N i
Base heuristic: Always go to the right. Rollout
nds the rightmost local minimum.
Base heuristic: Compare always go to the right
and always go the left. Choose the best of the two.
Rollout nds a global minimum.
SEQUENTIAL CONSISTENCY
The base heuristic is sequentially consistent if
all nodes of its path have the same projection, i.e.,
for every node i, whenever it generates the path
(i, i
1
, . . . , i
m
, i) starting at i, it also generates the
path (i
1
, . . . , i
m
, i) starting at i
1
.
Prime example of a sequentially consistent heuris
tic is a greedy algorithm. It uses an estimate F(i)
of the optimal cost starting from i.
At the typical step, given a path (i, i
1
, . . . , i
m
),
where i
m
is not a destination, the algorithm adds
to the path a node i
m+1
such that
i
m+1
= arg min F(j)
jN(i
m
)
Prop.: If the base heuristic is sequentially con
sistent, the cost of the rollout algorithm is no more
than the cost of the base heuristic. In particular,
if (s, i
1
, . . . , i
m
) is the rollout path, we have
H(s) H(i
1
) H(i
m m1
) H(i
)
where H(i) = cost of the heuristic starting at i.
Proof: Rollout deviates from the greedy path
only when it discovers an improved path.
SEQUENTIAL IMPROVEMENT
We say that the base heuristic is sequentially
improving if for every non-destination node i, we
have
H(i) min H(j)
j is neighbor of i
If the base heuristic is sequentially improving,
the cost of the rollout algorithm is no more than
the cost of the base heuristic, starting from any
node.
Fortied rollout algorithm:
Simple variant of the rollout algorithm, where
we keep the best path found so far through
the application of the base heuristic.
If the rollout path deviates from the best
path found, then follow the best path.
Can be shown to be a rollout algorithm with
sequentially improving base heuristic for a
slightly modied variant of the original prob
lem.
Has the cost improvement property.
6.231 DYNAMIC PROGRAMMING
LECTURE 12
LECTURE OUTLINE
More on rollout algorithms - Stochastic prob
lems
Simulation-based methods for rollout
Approximations of rollout algorithms
Rolling horizon approximations
Discretization of continuous time
Discretization of continuous space
Other suboptimal approaches
_ _ __
OLLOUT ALGORITHMS - STOCHASTIC PROBLEM
Rollout policy: At each k and state x
k
, use
the control
k
(x
k
) that
min Q
k
(x
k
, u
k
),
u
k
U
k
(x
k
)
where
Q
k
(x
k
, u
k
) = E g
k
(x
k
, u
k
, w
k
)+H
k+1
f
k
(x
k
, u
k
, w
k
)
and H
k+1
(x
k+1
) is the cost-to-go of the heuristic.
Q
k
(x
k
, u
k
) is called the Q-factor of (x
k
, u
k
),
and for a stochastic problem, its computation may
involve Monte Carlo simulation.
Potential diculty: To minimize over u
k
the Q-
factor, we must form Q-factor dierences Q
k
(x
k
, u)
Q
k
(x
k
, u). This dierencing often amplies the
simulation error in the calculation of the Q-factors.
Potential remedy: Compare any two controls
u and u by simulating the Q-factor dierences
Q
k
(x
k
, u) Q
k
(x
k
, u) directly. This may eect
variance reduction of the simulation-induced er
ror.
_ _ __
_ _
_ _
_ _
Q-FACTOR APPROXIMATION
Here, instead of simulating the Q-factors, we
approximate the costs-to-go H
k+1
(x
k+1
).
Certainty equivalence approach: Given x
k
, x
future disturbances at typical values w
k+1
, . . . , w
N1
and approximate the Q-factors with
Q
k
(x
k
, u
k
) = E g
k
(x
k
, u
k
, w
k
)+H
k+1
f
k
(x
k
, u
k
, w
k
)
where H
k+1
f
k
(x
k
, u
k
, w
k
) is the cost of the heuris
tic with the disturbances xed at the typical val
ues.
This is an approximation of H
k+1
f
k
(x
k
, u
k
, w
k
)
by using a single sample simulation.
Variant of the certainty equivalence approach:
Approximate H
k+1
f
k
(x
k
, u
k
, w
k
) by simulation
using a small number of representative samples
(scenarios).
Alternative: Calculate (exact or approximate)
values for the cost-to-go of the base policy at a
limited set of state-time pairs, and then approx
imate H
k+1
using an approximation architecture
and a training algorithm or least-squares t.
ROLLING HORIZON APPROACH
This is an l-step lookahead policy where the
cost-to-go approximation is just 0.
Alternatively, the cost-to-go approximation is
the terminal cost function g
N
.
A short rolling horizon saves computation.
Paradox: It is not true that a longer rolling
horizon always improves performance.
Example: At the initial state, there are two
controls available (1 and 2). At every other state,
there is only one control.
Optimal Trajectory
Current
St at e
... ...
... ...
1
2
High Low High
l St ages
Cost Cost Cost
ROLLING HORIZON COMBINED WITH ROLLOUT
We can use a rolling horizon approximation in
calculating the cost-to-go of the base heuristic.
Because the heuristic is suboptimal, the ratio
nale for a long rolling horizon becomes weaker.
Example: N-stage stopping problem where
the stopping cost is 0, the continuation cost is ei
ther or 1, where 0 < << 1, and the rst state
with continuation cost equal to 1 is state m. Then
the optimal policy is to stop at state m, and the
optimal cost is m.
0 1 2 m N
St opped St at e
- e - e
1
... ...
Consider the heuristic that continues at every
state, and the rollout policy that is based on this
heuristic, with a rolling horizon of l m steps.
It will continue up to the rst m l + 1 stages,
thus compiling a cost of (ml +1). The rollout
performance improves as l becomes shorter!
Limited vision may work to our advantage!
DISCRETIZATION
If the state space and/or control space is con
tinuous/innite, it must be replaced by a nite
discretization.
Need for consistency, i.e., as the discretization
becomes ner, the cost-to-go functions of the dis
cretized problem converge to those of the contin
uous problem.
Pitfalls with discretizing continuous time.
The control constraint set changes a lot as we
pass to the discrete-time approximation.
Continuous-Time Shortest Path Pitfall:
x
1
(t) = u
1
(t), x
2
(t) = u
2
(t),
with control constraint u
i
(t) 1, 1 and cost
_
T
_ _
g x(t) dt. Compare with naive discretization
x
1
(t+t) = x
1
(t)+tu
1
(t), x
2
(t+t) = x
2
(t)+tu
2
(t)
with u
i
(t) 1, 1.
Convexication eect of continuous time.
0
SPACE DISCRETIZATION I
Given a discrete-time system with state space
S, consider a nite subset S; for example S could
be a nite grid within a continuous state space S.
Diculty: f(x, u, w) / S for x S.
We dene an approximation to the original
problem, with state space S, as follows:
Express each x S as a convex combination of
states in S, i.e.,
x =
i
(x)x
i
where
i
(x) 0,
i
(x) = 1
x
i
S
i
Dene a reduced dynamic system with state
space S, whereby from each x
i
S we move to
x = f(x
i
, u, w) according to the system equation
of the original problem, and then move to x
j
S
with probabilities
j
(x).
Dene similarly the corresponding cost per stage
of the transitions of the reduced system.
SPACE DISCRETIZATION II
Let J
k
(x
i
) be the optimal cost-to-go of the re
duced problem from each state x
i
S and time
k onward.
Approximate the optimal cost-to-go of any x
S for the original problem by
J
k
(x) =
i
(x)J
k
(x
i
),
x
i
S
and use one-step-lookahead based on J
k
.
The choice of coecients
i
(x) is in principle
arbitrary, but should aim at consistency, i.e., as
the number of states in S increases, J
k
(x) should
converge to the optimal cost-to-go of the original
problem.
Interesting observation: While the original prob
lem may be deterministic, the reduced problem is
always stochastic.
Generalization: The set S may be any nite set
(not a subset of S) as long as the coecients
i
(x)
admit a meaningful interpretation that quanties
the degree of association of x with x
i
.
_
_ __
OTHER SUBOPTIMAL CONTROL APPROACHES
Minimize the DP equation error: Approxi
mate the optimal cost-to-go functions J
k
(x
k
) with
functions J
k
(x
k
, r
k
), where r
k
is a vector of un
known parameters, chosen to minimize some form
of error in the DP equations.
Direct approximation of control policies:
For a subset of states x
i
, i = 1, . . . , m, nd
k
(x
i
) = arg min
E
g(x
i
, u
k
, w
k
)
u
k
U
k
(x
i
)
+ J
k+1
f
k
(x
i
, u
k
, w
k
), r
k+1
.
Then nd
k
(x
k
, s
k
), where s
k
is a vector of pa
rameters obtained by solving the problem
m
min |
k
(x
i
)
k
(x
i
, s)|
2
.
s
i=1
Approximation in policy space: Do not
bother with cost-to-go approximations. Parametrize
the policies as
k
(x
k
, s
k
), and minimize the cost
function of the problem over the parameters s
k
.
6.231 DYNAMIC PROGRAMMING
LECTURE 13
LECTURE OUTLINE
Innite horizon problems
Stochastic shortest path problems
Bellmans equation
Dynamic programming value iteration
Examples
TYPES OF INFINITE HORIZON PROBLEMS
Same as the basic problem, but:
The number of stages is innite.
The system is stationary.
Total cost problems: Minimize
_ _
N1
_ _
J
(x
0
) = lim
E
k
g x
k
,
k
(x
k
), w
k
N
w
k
k=0,1,...
k=0
Stochastic shortest path problems ( = 1,
nite-state system with a termination state)
Discounted problems ( < 1, bounded cost
per stage)
Discounted and undiscounted problems with
unbounded cost per stage
Average cost problems
_ _
N1
1
_ _
lim
E
g x
k
,
k
(x
k
), w
k
N N
w
k
k=0,1,...
k=0
_ _ __
_ _ __
PREVIEW OF INFINITE HORIZON RESULTS
Key issue: The relation between the innite and
nite horizon optimal cost-to-go functions.
Illustration: Let = 1 and J
N
(x) denote the
optimal cost of the N-stage problem, generated
after N DP iterations, starting from J
0
(x) 0
J
k+1
(x) = min
E
g(x, u, w) + J
k
f(x, u, w) , x
uU(x) w
Typical results for total cost problems:
J
(x) = lim J
N
(x), x
N
J
(x) = min
E
g(x, u, w) + J
f(x, u, w) , x
uU(x) w
(Bellmans Equation). If (x) minimizes in Bell-
mans Eq., the policy , , . . . is optimal.
Bellmans Eq. always holds. The other re
sults are true for SSP (and bounded/discounted;
unusual exceptions for other problems).
_ _
STOCHASTIC SHORTEST PATH PROBLEMS
Assume nite-state system: States 1, . . . , n and
special cost-free termination state t
Transition probabilities p
ij
(u)
Control constraints u U(i)
Cost of policy =
0
,
1
, . . . is
N1
_ _
J
(i) = lim
E
g x
k
,
k
(x
k
)
x
0
= i
N
k=0
Optimal policy if J
(i) = J
(i) in place of J
(i).
Assumption (Termination inevitable): There ex
ists integer m such that for every policy and initial
state, there is positive probability that the termi
nation state will be reached after no more that m
stages; for all , we have
= max Px
m
, = i, < 1 = t [ x
0
i=1,...,n
FINITENESS OF POLICY COST-TO-GO FUNCTIONS
Let
= max
.
Note that
depends only on the rst m compo
nents of the policy , so that < 1.
For any and any initial state i
P { = i, } = P { = t | x
m
= t, x
0
= i, } x
2m
= t | x
0
x
2m
P {x
m
= t | x
0
= i, }
2
and similarly
Px
km
,= t [ x
0
=i,
k
, i = 1, . . . , n
So ECost between times km and (k + 1)m1
m
k
max
g(i, u)
i=1,...,n
and
uU(i)
m
J
(i)
m
k
max
g(i, u)
= max
g(i, u)
i=1,...,n 1 i=1,...,n
k=0
uU(i) uU(i)
_ _ __
MAIN RESULT
Given any initial conditions J
0
(1), . . . , J
0
(n),
the sequence J
k
(i) generated by the DP iteration
_ _
n
_
J
k+1
(i) = min
_
g(i, u) + p
ij
(u)J
k
(j) , i
uU(i)
j=1
converges to the optimal cost J
(i) = min
_
g(i, u) + p
ij
(u)J
(j)
_
, i
uU(i)
j=1
A stationary policy is optimal if and only
if for every state i, (i) attains the minimum in
Bellmans equation.
Key proof idea: The tail of the cost series,
E g x
k
,
k
(x
k
)
k=mK
vanishes as K increases to .
_ _ __
OUTLINE OF PROOF THAT J
N
J
Assume for simplicity that J
0
(i) = 0 for all i,
and for any K 1, write the cost of any policy
as
mK1
_ _ __ _ _ __
J
(x
0
) = E g x
k
,
k
(x
k
) + E g x
k
,
k
(x
k
)
k=0 k=mK
mK1
E g x
k
,
k
(x
k
) +
k
mmax |g(i, u)|
i,u
k=0 k=K
Take the minimum of both sides over to obtain
K
J
(x
0
) J
mK
(x
0
) + mmax [g(i, u)[.
1
i,u
Similarly, we have
K
J
mK
(x
0
) mmax [g(i, u)[ J
(x
0
).
1
i,u
It follows that lim
K
J
mK
(x
0
) = J
(x
0
).
It can be seen that J
mK
(x
0
) and J
mK+k
(x
0
)
converge to the same limit for k = 1, . . . , m 1,
so J
N
(x
0
) J
(x
0
)
EXAMPLE I
Minimizing the ETime to Termination: Let
g(i, u) = 1, i = 1, . . . , n, u U(i)
Under our assumptions, the costs J
(i) uniquely
solve Bellmans equation, which has the form
_ _
n
_
J
(i) = min
_
1 + p
ij
(u)J
(j) , i = 1, . . . , n
uU(i)
j=1
In the special case where there is only one con
trol at each state, J
(i) = 1+pJ
(i)+(12p)J
(i1)+pJ
(i2), i 2
_
J
(1), pJ
(1)
w/ J
(0) = 0. Substituting J
(1),
_ _
p (1 2p)J
(1)
J
(1), + .
1 p 1 p
Work from here to nd that when one unit away
from the y it is optimal not to move if and only
if p 1/3.
6.231 DYNAMIC PROGRAMMING
LECTURE 14
LECTURE OUTLINE
Review of stochastic shortest path problems
Computational methods
Value iteration
Policy iteration
Linear programming
Discounted problems as special case of SSP
_ _
STOCHASTIC SHORTEST PATH PROBLEMS
Assume nite-state system: States 1, . . . , n and
special cost-free termination state t
Transition probabilities p
ij
(u)
Control constraints u U(i)
Cost of policy =
0
,
1
, . . . is
N1
_ _
J
(i) = lim
E
g x
k
,
k
(x
k
)
x
0
= i
N
k=0
Optimal policy if J
(i) = J
(i) in place of J
(i).
Assumption (Termination inevitable): There ex
ists integer m such that for every policy and initial
state, there is positive probability that the termi
nation state will be reached after no more that m
stages; for all , we have
= max Px
m
, = i, < 1 = t [ x
0
i=1,...,n
_ _ __
MAIN RESULT
Given any initial conditions J
0
(1), . . . , J
0
(n),
the sequence J
k
(i) generated by value iteration
_ _
n
_
J
k+1
(i) = min
_
g(i, u) + p
ij
(u)J
k
(j) , i
uU(i)
j=1
converges to the optimal cost J
(i) = min
_
g(i, u) + p
ij
(u)J
(j)
_
, i
uU(i)
j=1
A stationary policy is optimal if and only
if for every state i, (i) attains the minimum in
Bellmans equation.
Key proof idea: The tail of the cost series,
E g x
k
,
k
(x
k
)
k=mK
vanishes as K increases to .
_ _ _ _
BELLMANS EQUATION FOR A SINGLE POLICY
Consider a stationary policy
J
(i) = g i, (i) + p
ij
(i) J
(j), i = 1, . . . , n
j=1
Proof: This is just Bellmans equation for a
modied/restricted problem where there is only
one policy, the stationary policy , i.e., the control
constraint set at state i is U
(i) = (i)
The equation provides a way to compute J
(i),
i = 1, . . . , n, but the computation is substantial
for large n [O(n
3
)]
For large n, value iteration may be preferable.
(Typical case of a large linear system of equations,
where an iterative method may be better than a
direct solution method.)
_ _ _ _
POLICY ITERATION
It generates a sequence
1
,
2
, . . . of stationary
policies, starting with any stationary policy
0
.
At the typical iteration, given
k
, we perform
a policy evaluation step, that computes the J
k
(i)
as the solution of the (linear) system of equations
n
J(i) = g i,
k
(i) + p
ij
k
(i) J(j), i = 1, . . . , n,
j=1
in the n unknowns J(1), . . . , J(n). We then per
form a policy improvement step, which computes
a new policy
k+1
as
_ _
n
_
k+1
(i) = arg min
_
g(i, u) + p
ij
(u)J
k
(j) , i
uU(i)
j=1
The algorithm stops when J
k
(i) = J
k+1
(i) for
all i
Note the connection with the rollout algorithm,
which is just a single policy iteration
_ _ _ _
_ _ _ _
JUSTIFICATION OF POLICY ITERATION
We can show thatJ
k+1
(i) J
k
(i) for all i, k
Fix k and consider the sequence generated by
n
J
N+1
(i) = g i,
k+1
(i) + p
ij
k+1
(i) J
N
(j)
j=1
where J
0
(i) = J
k
(i). We have
n
J
0
(i) = g i,
k
(i) + p
ij
k
(i) J
0
(j)
j=1
n
_ _
_ _
g i,
k+1
(i) + p
ij
k+1
(i) J
0
(j) = J
1
(i)
j=1
Using the monotonicity property of DP,
J
0
(i) J
1
(i) J
N
(i) J
N+1
(i) , i
Since J
N
(i) J
k+1
(i) as N , we obtain
J
k
(i) = J
0
(i) J
k+1
(i) for all i. Also if J
k
(i) =
J
k+1
(i) for all i, J
k
solves Bellmans equation
and is therefore equal to J
A policy cannot be repeated, there are nitely
many stationary policies, so the algorithm termi
nates with an optimal policy
_ _
LINEAR PROGRAMMING
We claim that J
is the largest J that satises
the constraint
n
J(i) g(i, u) + p
ij
(u)J(j), (1)
j=1
for all i = 1, . . . , n and u U(i).
Proof: If we use value iteration to generate a se
quence of vectors J
k
= J
k
(1), . . . , J
k
(n) starting
with a J
0
such that
_ _
n
J
0
(i) min
_
g(i, u) + p
ij
(u)J
0
(j)
_
, i
uU(i)
j=1
Then, J
k
(i) J
k+1
(i) for all k and i (mono
tonicity property of DP) and J
k
J
, so that
J
0
(i) J
(1), . . . , J
(i) = min
_
g(i, u) + p
ij
(u)J
(j) , i
uU(i)
j=1
_ _
_ _
_ _
DISCOUNTED PROBLEMS (CONTINUED)
Policy iteration converges nitely to an optimal
policy, and linear programming works.
Example: Asset selling over an innite horizon.
If accepted, the oer x
k
of period k, is invested at
a rate of interest r.
By depreciating the sale amount to period 0
dollars, we view (1 + r)
k
x
k
as the reward for
selling the asset in period k at a price x
k
, where
r > 0 is the rate of interest. So the discount factor
is = 1/(1 + r).
J
is the unique solution of Bellmans equation
E J
(w)
J
(x) = max x, .
1 + r
An optimal policy is to sell if and only if the cur
rent oer x
k
is greater than or equal to , where
E J
(w)
= .
1 + r
6.231 DYNAMIC PROGRAMMING
LECTURE 15
LECTURE OUTLINE
Average cost per stage problems
Connection with stochastic shortest path prob
lems
Bellmans equation
Value iteration
Policy iteration
AVERAGE COST PER STAGE PROBLEM
Stationary system with nite number of states
and controls
Minimize over policies =
0
,
1
, ...
_ _
1
N1
_ _
J
(x
0
) = lim
E
g x
k
,
k
(x
k
), w
k
N N
w
k
k=0,1,...
k=0
Important characteristics (not shared by other
types of innite horizon problems)
For any xed K, the cost incurred up to time
K does not matter (only the state that we
are at time K matters)
If all states communicate the optimal cost
is independent of the initial state [if we can
go from i to j in nite expected time, we
must have J
(i) J
(j)]. So J
(i) for
all i.
Because communication issues are so im
portant, the methodology relies heavily on
Markov chain theory.
CONNECTION WITH SSP
Assumption: State n is such that for some
integer m > 0, and for all initial states and all
policies, n is visited with positive probability at
least once within the rst m stages.
Divide the sequence of generated states into
cycles marked by successive visits to n.
Each of the cycles can be viewed as a state
trajectory of a corresponding stochastic shortest
path problem with n as the termination state.
i j
p
ij
(u)
p
ii
(u)
p
jj
(u) p
ji
(u)
n
p
in
(u) p
jn
(u)
p
n n
(u)
p
nj
(u)
p
ni
(u)
i j
p
ij
(u)
p
ii
(u) p
jj
(u) p
ji
(u)
n
t
Artificial Termination State
Special
State n
p
ni
(u)
p
in
(u)
p
n n
(u)
p
nj
(u)
p
jn
(u)
Let the cost at i of the SSP be g(i, u)
We will show that
Av. Cost Probl. A Min Cost Cycle Probl. SSP Probl.
CONNECTION WITH SSP (CONTINUED)
Consider a minimum cycle cost problem: Find
a stationary policy that minimizes the expected
cost per transition within a cycle
C
nn
()
N
nn
()
,
where for a xed ,
C
nn
() : Ecost from n up to the rst return to n
N
nn
() : Etime from n up to the rst return to n
Intuitively, optimal cycle cost =
, so
C
nn
() N
nn
()
0,
with equality if is optimal.
Thus, the optimal must minimize over the
expression C
nn
() N
nn
()
, which is the ex
pected cost of starting from n in the SSP with
stage costs g(i, u)
(1), . . . , h
(i) = min
_
g(i, u)
+ p
ij
(u)h
(j)
_
, i
uU(i)
j=1
If is an optimal stationary policy for the SSP
problem, we have
h
(n) = C
nn
(
) N
nn
(
)
= 0
Combining these equations, we have
_ _
n
_
+h
(i) = min
_
g(i, u) + p
ij
(u)h
(j) , i
uU(i)
j=1
If
+ h
(0) + ph
(1),
ci + (1 p)h
(i) + ph
(i + 1) ,
and for state n
+ h
(n) = K + (1 p)h
(0) + ph
(1)
Optimal policy: Process i unlled orders if
K+(1p)h
(0)+ph
(1) ci+(1p)h
(i)+ph
(i+1).
Intuitively, h
(i) = k
+ h
(i), i, k.
On the other hand,
J
k
(i) J
k
(i)
max
J
0
(j) h
(j)
, i
j=1,...,n
since J
k
(i) and J
k
k
.
Policy evaluation: Compute
k
and h
k
(i) of
k
,
using the n + 1 equations h
k
(n) = 0 and
n
k
+ h
k
(i) = g i,
k
(i) + p
ij
k
(i) h
k
(j), i
j=1
Policy improvement: Find for all i
_ _
n
_
k+1
(i) = arg min
_
g(i, u) + p
ij
(u)h
k
(j)
uU(i)
j=1
If
k+1
=
k
and h
k+1
(i) = h
k
(i) for all i, stop;
otherwise, repeat with
k+1
replacing
k
.
Result: For each k, we either have
k+1
<
k
or
k+1
=
k
, h
k+1
(i) h
k
(i), i = 1, . . . , n.
The algorithm terminates with an optimal policy.
6.231 DYNAMIC PROGRAMMING
LECTURE 16
LECTURE OUTLINE
Control of continuous-time Markov chains
Semi-Markov problems
Problem formulation Equivalence to discrete-
time problems
Discounted problems
Average cost problems
CONTINUOUS-TIME MARKOV CHAINS
Stationary system with nite number of states
and controls
State transitions occur at discrete times
Control applied at these discrete times and stays
constant between transitions
Time between transitions is random
Cost accumulates in continuous time (may also
be incurred at the time of transition)
Example: Admission control in a system with
restricted capacity (e.g., a communication link)
Customer arrivals: a Poisson process
Customers entering the system, depart after
exponentially distributed time
Upon arrival we must decide whether to ad
mit or to block a customer
There is a cost for blocking a customer
For each customer that is in the system, there
is a customer-dependent reward per unit time
Minimize time-discounted or average cost
PROBLEM FORMULATION
x(t) and u(t): State and control at time t
t
k
: Time of kth transition (t
0
= 0)
x
k
= x(t
k
); x(t) = x
k
for t
k
t < t
k+1
.
u
k
= u(t
k
); u(t) = u
k
for t
k
t < t
k+1
.
No transition probabilities; instead transition
distributions (quantify the uncertainty about both
transition time and next state)
Q
ij
(, u) = P{t
k+1
t
k
, x
k+1
= j | x
k
= i, u
k
= u}
Two important formulas:
(1) Transition probabilities are specied by
p
ij
(u) = P{x
k+1
= j | x
k
= i, u
k
= u} = lim Q
ij
(, u)
(2) The Cumulative Distribution Function (CDF)
of given i, j, u is (assuming p
ij
(u) > 0)
Q
ij
(, u)
Pt
k+1
t
k
[ x
k
= i, x
k+1
= j, u
k
= u =
p
ij
(u)
Thus, Q
ij
(, u) can be viewed as a scaled CDF
EXPONENTIAL TRANSITION DISTRIBUTIONS
Important example of transition distributions:
_ _
Q
ij
(, u) = p
ij
(u) 1 e
i
(u)
,
where p
ij
(u) are transition probabilities, and
i
(u)
is called the transition rate at state i.
Interpretation: If the system is in state i and
control u is applied
the next state will be j with probability p
ij
(u)
the time between the transition to state i
and the transition to the next state j is ex
ponentially distributed with parameter
i
(u)
(independently of j):
Ptransition time interval > [ i, u = e
i
(u)
The exponential distribution is memoryless.
This implies that for a given policy, the system
is a continuous-time Markov chain (the future de
pends on the past through the present).
Without the memoryless property, the Markov
property holds only at the times of transition.
_ _
_ _
COST STRUCTURES
There is cost g(i, u) per unit time, i.e.
g(i, u)dt = the cost incurred in time dt
There may be an extra instantaneous cost
g(i, u) at the time of a transition (lets ignore this
for the moment)
Total discounted cost of =
0
,
1
, . . . start
ing from state i (with discount factor > 0)
N1
_
t
k+1
t
_ _
lim E e g x
k
,
k
(x
k
) dt
x
0
= i
N
k=0
t
k
Average cost per unit time
N1
_
1
t
k+1
_ _
lim E g x
k
,
k
(x
k
) dt
x
0
= i
N E{t
N
}
k=0
t
k
We will see that both problems have equivalent
discrete-time versions.
A NOTE ON NOTATION
The scaled CDF Q
ij
(, u) can be used to model
discrete, continuous, and mixed distributions for
the transition time .
Generally, expected values of functions of can
be written as integrals involving d Q
ij
(, u). For
example, the conditional expected value of given
i, j, and u is written as
_
d Q
ij
(, u)
E [ i, j, u =
0
p
ij
(u)
If Q
ij
(, u) is continuous with respect to , its
derivative
q
ij
(, u) =
dQ
ij
(, u)
d
can be viewed as a scaled density function. Ex
pected values of functions of can then be written
in terms of q
ij
(, u). For example
_
q
ij
(, u)
E [ i, j, u =
0
p
ij
(u)
d
If Q
ij
(, u) is discontinuous and staircase-like,
expected values can be written as summations.
DISCOUNTED PROBLEMS COST CALCULATION
For a policy =
0
,
1
, . . ., write
J
(i) = E{1st transition cost}+E{e
J
1
(j) | i,
0
(i)}
where J
1
(j) is the cost-to-go of the policy
1
=
1
,
2
, . . .
We calculate the two costs in the RHS. The
E1st transition cost, if u is applied at state i, is
_ _
G(i, u) = E
j
E
{1st transition cost | j}
n
_
_
_
_
dQ
ij
(, u)
= p
ij
(u) e
t
g(i, u)dt
j=1
0 0
p
ij
(u)
n
_
1 e
= g(i, u)dQ
ij
(, u)
j=1
0
Thus the E1st transition cost is
n
_
_ _ _ _
1 e
_ _
G i,
0
(i) = g i,
0
(i) dQ
ij
,
0
(i)
j=1
0
_ _
_ _
_ _
_ _
_ _ _ _
COST CALCULATION (CONTINUED)
Also the expected (discounted) cost from the
next state j is
E e
J
1
(j) [ i,
0
(i)
= E
j
Ee
[ i,
0
(i), jJ
1
(j) [ i,
0
(i)
n
_
_
dQ
ij
(, u)
_
= p
ij
(u) e
J
1
(j)
0
p
ij
(u)
j=1
n
= m
ij
(i) J
1
(j)
j=1
where m
ij
(u) is given by
_
_
m
ij
(u) = e
dQ
ij
(, u) < dQ
ij
(, u) = p
ij
(u)
0 0
and can be viewed as the eective discount fac
tor [the analog of p
ij
(u) in the discrete-time
case].
So J
(i) = G i,
0
(i) + m
ij
0
(i) J
1
(j)
j=1
EQUIVALENCE TO AN SSP
Similar to the discrete-time case, introduce a
stochastic shortest path problem with an articial
termination state t
Under control u, from state i the system moves
to state j with probability m
ij
(u) and to the ter
mination state t with probability 1
n
j=1
m
ij
(u)
Bellmans equation: For i = 1, . . . , n,
_ _
n
_
J
(i) = min
_
G(i, u) + m
ij
(u)J
(j)
uU(i)
j=1
Analogs of value iteration, policy iteration, and
linear programming.
If in addition to the cost per unit time g, there
is an extra (instantaneous) one-stage cost g(i, u),
Bellmans equation becomes
_ _
n
J
(i) = min
_
g(i, u) + G(i, u) + m
ij
(u)J
(j)
_
uU(i)
j=1
_ _
MANUFACTURERS EXAMPLE REVISITED
A manufacturer receives orders with interarrival
times uniformly distributed in [0,
max
].
He may process all unlled orders at cost K > 0,
or process none. The cost per unit time of an
unlled order is c. Max number of unlled orders
is n.
The nonzero transition distributions are
Q
i1
(, Fill) = Q
i(i+1)
(, Not Fill) = min 1,
max
The one-stage expected cost G is
G(i, Fill) = 0, G(i, Not Fill) = c i,
where
n
_ _
1 e
max
1 e
= dQ
ij
(, u) = d
0
0
max
j=1
There is an instantaneous cost
g(i, Fill) = K, g(i, Not Fill) = 0
_
MANUFACTURERS EXAMPLE CONTINUED
The eective discount factors m
ij
(u) in Bell-
mans Equation are
m
i1
(Fill) = m
i(i+1)
(Not Fill) = ,
where
_
max
max
= e
dQ
ij
(, u) =
e
d =
1 e
max
max
0 0
Bellmans equation has the form
J
(1), ci+J
(i+1) , i = 1, 2, . . .
As in the discrete-time case, we can conclude
that there exists an optimal threshold i
:
ll the orders <==> their number i exceeds i
_
_ _
AVERAGE COST
Minimize
1
_
t
N
_
lim E g x(t), u(t) dt
N Et
N
0
assuming there is a special state that is recurrent
under all policies
Total expected cost of a transition
G(i, u) = g(i, u)
i
(u),
where
i
(u): Expected transition time.
We now apply the SSP argument used for the
discrete-time case. Divide trajectory into cycles
marked by successive visits to n. The cost at (i, u)
is G(i, u)
i
(u), where
is the optimal ex
pected cost per unit time. Each cycle is viewed as
a state trajectory of a corresponding SSP problem
with the termination state being essentially n.
So Bellmans Eq. for the average cost problem:
_ _
n
_
h
(i) = min
_
G(i, u)
i
(u) + p
ij
(u)h
(j)
uU(i)
j=1
_
_
AVERAGE COST MANUFACTURERS EXAMPLE
The expected transition times are
i
(Fill) =
i
(Not Fill) =
max
2
the expected transition cost is
G(i, Fill) = 0, G(i, Not Fill) =
c i
max
2
and there is also the instantaneous cost
g(i, Fill) = K, g(i, Not Fill) = 0
Bellmans equation:
h
(i) = min K
max
+ h
(1),
2
ci
max
max
+ h
(i + 1)
2 2
Again it can be shown that a threshold policy
is optimal.
6.231 DYNAMIC PROGRAMMING
LECTURE 17
LECTURE OUTLINE
We start a four-lecture sequence on advanced
innite horizon DP
We allow innite state space, so the stochastic
shortest path framework cannot be used any more
Results are rigorous assuming a countable dis
turbance space
This includes deterministic problems with
arbitrary state space, and countable state
Markov chains
Otherwise the mathematics of measure the
ory make analysis dicult, although the
nal results are essentially the same as for
countable disturbance space
The discounted problem is the proper starting
point for this analysis
The central mathematical structure is that the
DP mapping is a contraction mapping (instead of
existence of a termination state)
_ _
_ _
_ _ __
DISCOUNTED PROBLEMS W/ BOUNDED COST
Stationary system with arbitrary state space
x
k+1
= f(x
k
, u
k
, w
k
), k = 0, 1, . . .
Cost of a policy =
0
,
1
, . . .
N1
J
(x
0
) = lim
E
k
g x
k
,
k
(x
k
), w
k
N
w
k
k=0,1,...
k=0
with < 1, and for some M, we have [g(x, u, w)[
M for all (x, u, w)
Shorthand notation for DP mappings (operate
on functions of state to produce other functions)
(TJ)(x) = min
E
g(x, u, w) +J f(x, u, w) , x
uU(x)
w
TJ is the optimal cost function for the one-stage
problem with stage cost g and terminal cost J.
For any stationary policy
_ _ _ _ __
(T
J)(x) =
E
g x, (x), w +J f(x, (x), w) , x
w
SHORTHAND THEORY A SUMMARY
Cost function expressions [with J
0
(x) 0]
J
(x) = lim (T
0
T
1
T
k
J
0
)(x), J
(x) = lim (T
k
J
0
)(x)
k k
Bellmans equation: J
= TJ
, J
= T
J
Optimality condition:
: optimal <==> T
J
= TJ
Value iteration: For any (bounded) J and all
x,
J
(x) = lim (T
k
J)(x)
k
Policy iteration: Given
k
,
Policy evaluation: Find J
k
by solving
J
k
= T
k
J
k
Policy improvement: Find
k+1
such that
T
k+1
J
k
= TJ
k
_ _
_ _
TWO KEY PROPERTIES
Monotonicity property: For any functions J
and J
such that J(x) J
)(x), x,
(T
J)(x) (T
)(x), x.
Additivity property: For any J, any scalar
r, and any
T (J + re) (x) = (TJ)(x) + r, x,
T
(J + re) (x) = (T
J)(x) + r, x,
where e is the unit function [e(x) 1].
_ _
_ _
_ _
_ _
_ _
_ _
_ _
CONVERGENCE OF VALUE ITERATION
If J
0
0,
J
(x) = lim (T
N
J
0
)(x), for all x
N
Proof: For any initial state x
0
, and policy =
0
,
1
, . . .,
J
(x
0
) = E
k
g x
k
,
k
(x
k
), w
k
k=0
N1
= E
k
g x
k
,
k
(x
k
), w
k
k=0
+ E
k
g x
k
,
k
(x
k
), w
k
k=N
The tail portion satises
_ _
N
M
E
k
g x
k
,
k
(x
k
), w
k
,
1
k=N
where M [g(x, u, w)[. Take the min over of
both sides. Q.E.D.
BELLMANS EQUATION
The optimal cost function J
satises Bellmans
Eq., i.e. J
= T(J
).
Proof: For all x and N,
N
M
N
M
J
(x) (T
N
J
0
)(x) J
(x) + ,
1 1
where J
0
(x) 0 and M [g(x, u, w)[. Applying
T to this relation, and using Monotonicity and
Additivity,
N+1
M
(TJ
)(x) (T
N+1
J
0
)(x)
1
N+1
M
(TJ
)(x) +
1
Taking the limit as N and using the fact
lim (T
N+1
J
0
)(x) = J
(x)
N
we obtain J
= TJ
. Q.E.D.
THE CONTRACTION PROPERTY
Contraction property: For any bounded func
tions J and J
, and any ,
max
(TJ)(x) (TJ
)(x) max
J(x) J
(x) ,
x x
max
(T
J)(x)(T
)(x) max
J(x)J
(x) .
x x
Proof: Denote c = max
xS
J(x) J
(x) . Then
J(x) c J
(x) J(x) + c, x
Apply T to both sides, and use the Monotonicity
and Additivity properties:
(TJ)(x) c (TJ
)(x) (TJ)(x) + c, x
Hence
(TJ)(x) (TJ
)(x) c, x.
Q.E.D.
IMPLICATIONS OF CONTRACTION PROPERTY
Bellmans equation J = TJ has a unique solu
tion, namely J
(x), x
k
Proof: Use
max
(T
k
J)(x) J
(x) max
(T
k
J)(x) (T
k
J
)(x)
x x
k
max
J(x) J
(x)
x
Convergence rate: For all k,
max
(T
k
J)(x) J
(x)
k
max
J(x) J
(x)
x x
Also, for each stationary , J
is the unique
solution of J = T
J and
lim (T
k
J)(x) = J
(x), x,
k
for any bounded J.
NEC. AND SUFFICIENT OPT. CONDITION
A stationary policy is optimal if and only if
(x) attains the minimum in Bellmans equation
for each x; i.e.,
TJ
= T
J
.
Proof: If TJ
= T
), we have
J
= T
J
,
so by uniqueness of the xed point of T
, we obtain
J
= J
; i.e., is optimal.
Conversely, if the stationary policy is optimal,
we have J
= J
, so
J
= T
J
.
Combining this with Bellmans equation (J
=
TJ
), we obtain TJ
= T
J
. Q.E.D.
COMPUTATIONAL METHODS
Value iteration and variants
Gauss-Seidel version
Approximate value iteration
Policy iteration and variants
Combination with value iteration
Modied policy iteration
Asynchronous policy iteration
Linear programming
n
maximize J(i)
i=1
n
subject to J(i) g(i, u) + p
ij
(u)J(j), (i, u)
j=1
Approximate linear programming: use in place
of J(i) a low-dim. basis function representation
m
J
(i, r) = r
k
w
k
(i)
k=1
and low-dim. LP (with many constraints)
6.231 DYNAMIC PROGRAMMING
LECTURE 18
LECTURE OUTLINE
One-step lookahead and rollout for discounted
problems
Approximate policy iteration: Innite state
space
Contraction mappings in DP
Discounted problems: Countable state space
with unbounded costs
ONE-STEP LOOKAHEAD POLICIES
At state i use the control (i) that attains the
minimum in
_ _
n
(TJ
)(i) = min
_
g(i, u) +
p
ij
(u)J
(j)
_
,
uU(i)
j=1
where J
is some approximation to J
.
Assume that
TJ
J
+ e,
for some scalar , where e is the unit vector. Then
J
TJ
+
e J
+
e.
1 1
Assume that
J
e J
J
+ e,
for some scalar . Then
2
J
J
+ e.
1
APPLICATION TO ROLLOUT POLICIES
Let
1
, . . . ,
M
be stationary policies, and let
J
(i) = min
_
J
1
(i)
, . . . , J
M
(i)
_
, i.
Then, for all i, and m = 1, . . . , M, we have
_ _
n
(TJ
)(i) = min
_
g(i, u) +
p
ij
(u)J
(j)
_
uU(i)
j=1
_ _
n
_
min
_
g(i, u) + p
ij
(u)J
m
(j)
uU(i)
j=1
J
m
(i)
Taking minimum over m,
(TJ
)(i) J
(i), i.
Using the preceding slide result with = 0,
J
(i) J
(i) = min
_
J
1
(i)
, . . . , J
M
(i)
_
, i,
i.e., the rollout policy improves over each
m
.
APPROXIMATE POLICY ITERATION
Suppose that the policy evaluation is approxi
mate, according to,
max [J
k
(x) J
k (x)[ , k = 0, 1, . . .
x
and policy improvement is approximate, according
to,
max [(T
k+1 J
k
)(x)(TJ
k
)(x)[ , k = 0, 1, . . .
x
where and are some positive scalars.
Error Bound: The sequence
k
generated
by approximate policy iteration satises
_ _
+ 2
limsup max J
k (x) J
(x)
k
xS (1 )
2
Typical practical behavior: The method makes
steady progress up to a point and then the iterates
J
|
k
|J J
|, k = 1, 2, . . . .
Similar result if F is an m-stage contraction
mapping.
This is a special case of a general result for
contraction mappings F : Y Y over normed
vector spaces Y that are complete: every sequence
y
k
that is Cauchy (satises |y
m
y
n
| 0 as
m, n ) converges.
The space B(S) is complete (see the text for a
proof).
A DP-LIKE CONTRACTION MAPPING I
Let S = 1, 2, . . ., and let F : B(S) B(S)
be a linear mapping of the form
(FJ)(i) = b(i) + a(i, j) J(j), i
jS
where b(i) and a(i, j) are some scalars. Then F is
a contraction with modulus if
[a(i, j)[ v(j)
jS
v(i)
, i
Let F : B(S) B(S) be a mapping of the form
(FJ)(i) = min(F
J)(i), i
M
where M is parameter set, and for each M,
F
is a contraction mapping from B(S) to B(S)
with modulus . Then F is a contraction mapping
with modulus .
_ _ _ _
_ _ _ _
A DP-LIKE CONTRACTION MAPPING II
Let S = 1, 2, . . ., let M be a parameter set,
and for each M, let
(F
a(i, j, )
v(j), i
jS
Consider the mapping F
(FJ)(i) = min(F
J)(i), i
M
We have FJ B(S) for all J B(S), provided
b B(S) and V B(S), where
b = b(1), b(2), . . . , V = V (1), V (2), . . . ,
with b(i) = max
M
b(i, ) and V (i) = max
M
V (i, ).
_ _
DISCOUNTED DP - UNBOUNDED COST I
State space S = 1, 2, . . ., transition probabil
ities p
ij
(u), cost g(i, u).
Weighted sup-norm
[J(i)[
|J| = max
iS v
i
on B(S): sequences J(i) such that |J| < .
Assumptions:
_ _
(a) G = G(1), G(2), . . . B(S), where
G(i) = max
g(i, u) , i
uU(i)
_ _
(b) V = V (1), V (2), . . . B(S), where
V (i) = max p
ij
(u) v
j
, i
uU(i)
jS
(c) There exists an integer m 1 and a scalar
(0, 1) such that for every policy ,
P(x
m
= j [ x
0
= i, ) v
j
m
jS
, i
v
i
DISCOUNTED DP - UNBOUNDED COST II
Example: Let v
i
= i for all i = 1, 2, . . .
Assumption (a) is satised if the maximum ex
pected absolute cost per stage at state i grows no
faster than linearly with i.
Assumption (b) states that the maximum ex
pected next state following state i,
max Ej [ i, u,
uU(i)
also grows no faster than linearly with i.
Assumption (c) is satised if
m
P(x
m
= j [ x
0
= i, ) j i, i
jS
It requires that for all , the expected value of the
state obtained m stages after reaching state i is no
more than
m
i.
If there is bounded upward expected change of
the state starting at i, there exists m suciently
large so that Assumption (c) is satised.
_ _
(T
J)(i) = g i, (i) + p
ij
(i) J(j), i,
jS
_ _
_
(TJ)(i) = min
_
g(i, u) + p
ij
(u)J(j) , i
uU(i)
jS
Proposition: Under the earlier assumptions,
T and T
map B(S) into B(S), and are m-stage
contraction mappings with modulus .
The m-stage contraction properties can be used
to essentially replicate the analysis for the case of
bounded cost, and to show the standard results:
The value iteration method J
k+1
= TJ
k
con
verges to the unique solution J
of Bellmans
equation J = TJ.
The unique solution J
of Bellmans equa
tion is the optimal cost function.
A stationary policy is optimal if and only
if T
J
= TJ
.
6.231 DYNAMIC PROGRAMMING
LECTURE 19
LECTURE OUTLINE
Undiscounted problems
Stochastic shortest path problems (SSP)
Proper and improper policies
Analysis and computational methods for SSP
Pathologies of SSP
_ _
_ _
_ _ __
UNDISCOUNTED PROBLEMS
System: x = f(x , u , w )
k+1 k k k
Cost of a policy =
0
,
1
, . . .
N1
J
(x
0
) = lim
E
g x
k
,
k
(x
k
), w
k
N
w
k
k=0,1,...
k=0
Shorthand notation for DP mappings
(TJ)(x) = min
E
g(x, u, w) +J f(x, u, w) , x
uU(x)
w
For any stationary policy
_ _ _ _ __
(T
J)(x) =
E
g x, (x), w +J f(x, (x), w) , x
w
Neither T nor T
are contractions in general,
but their monotonicity is helpful.
SSP problems provide a soft boundary be
tween the easy nite-state discounted problems
and the hard undiscounted problems.
They share features of both.
Some of the nice theory is recovered because
of the termination state.
_ _ _ _
SSP THEORY SUMMARY I
As earlier, we have a cost-free term. state t, a
nite number of states 1, . . . , n, and nite number
of controls, but we will make weaker assumptions.
Mappings T and T
(modied to account for
termination state t):
_ _
n
(TJ)(i) = min g(i, u) + p
ij
(u)J(j) , i = 1, . . . , n,
uU(i)
j=1
n
(T
J)(i) = g i, (i) + p
ij
(i) J(j), i = 1, . . . , n.
j=1
Denition: A stationary policy is called
proper, if under , from every state i, there is
a positive probability path that leads to t.
Important fact: If is proper, T
is contrac
tion with respect to some weighted max norm
1 1
max [(T
J)(i)(T
)(i)[
max [J(i)J
(i)[
i
v
i
i
v
i
T is similarly a contraction if all are proper
(the case discussed in the text, Ch. 7, Vol. I).
SSP THEORY SUMMARY II
The theory can be pushed one step further.
Assume that:
(a) There exists at least one proper policy
(b) For each improper , J
, and T
k
J J
for all J (holds by the
theory of Vol. I, Section 7.2)
A stationary satisfying J T
J for some J
must be proper - true because
k1
m
J T
k
J = P
k
J + P
g
m=0
and some component of the term on the right
blows up if is improper (by our assumptions).
Consequence: T can have at most one xed
point.
Proof: If J and J
are two solutions, select
and
such that J = TJ = T
J and J
= TJ
=
T
. Also
J = T
k
J T
k
J J
= J
Similarly, J
J, so J = J
.
SSP ANALYSIS II
We now show that T has a xed point, and also
that policy iteration converges.
Generate a sequence
k
by policy iteration
starting from a proper policy
0
.
1
is proper and J
0
J
1
since
J
0
= T
0
J
0
TJ
0
= T
1
J
0
T
k
1
J
0
J
1
Thus J
k
is nonincreasing, some policy will
be repeated, with J
= TJ
. So J
is a xed point
of T.
Next show T
k
J J
for all J, i.e., value it
eration converges to the same limit as policy iter
ation. (Sketch: True if J = J
, for any =
0
,
1
, . . .
T
0
T
k1
J
0
T
k
J
0
,
where J
0
0. Take lim sup as k , to obtain
J
J
, so is optimal and J
= J
.
_ _
_ _
SSP ANALYSIS III
If all policies are proper (the assumption of
Section 7.1, Vol. I), T
and T are contractions
with respect to a weighted sup norm.
Proof: Consider a new SSP problem where the
transition probabilities are the same as in the orig
inal, but the transition costs are all equal to 1.
Let J
be the corresponding optimal cost vector.
For all ,
n n
J
(i) = 1+ min p
ij
(u)J
(j) 1+ p
ij
(i) J
(j)
uU(i)
j=1 j=1
For v
i
= J
(i), we have v
i
1, and for all ,
n
p
ij
(i) v
j
v
i
1 v
i
, i = 1, . . . , n,
j=1
where
v
i
1
= max < 1.
i=1,...,n
v
i
This implies contraction of T
and T by the results
of the preceding lecture.
_
_ _
PATHOLOGIES I: DETERM. SHORTEST PATHS
If there is a cycle with cost = 0, Bellmans equa
tion has an innite number of solutions. Example:
0
1
1 2 t
0
We have J
(1) = J
(2) = 1.
Bellmans equation is
J(1) = J(2), J(2) = min J(1), 1].
It has J
as solution.
Set of solutions of Bellmans equation:
J [ J(1) = J(2) 1 .
_
PATHOLOGIES II: DETERM. SHORTEST PATHS
If there is a cycle with cost < 0, Bellmans
equation has no solution [among functions J with
< J(i) < for all i]. Example:
0
1
1 2 t
-1
We have J
(1) = J
(2) = .
Bellmans equation is
J(1) = J(2), J(2) = min 1 + J(1), 1].
There is no solution [among functions J with
< J(i) < for all i].
Bellmans equation has as solution J
(1) =
J
(1) = u + (1 u
2
)J
(1)
from which J
(1) =
u
1
Thus J
(i) J
(i, r) or J
(i) J
(i, r)
Principal example: Subspace approximation
s
J
(i, r) = (i)
r =
k
(i)r
k
k=1
where
1
, . . . ,
s
are basis functions spanning an
s-dimensional subspace of 1
n
Key issue: How to optimize r with low/s-dimensi
onal operations only
Other than manual/trial-and-error approaches
(e.g/, as in computer chess), the only other ap
proaches are simulation-based. They are collec
tively known as neuro-dynamic programming or
reinforcement learning
APPROX. IN VALUE SPACE - APPROACHES
Policy evaluation/Policy improvement
Uses simulation algorithms to approximate
the cost J
of the current policy
Approximation of the optimal cost function J
Q-Learning: Use a simulation algorithm to
approximate the optimal costs J
(i) or the
Q-factors
n
Q
(i, u) = g(i, u) + p
ij
(u)J
(j)
j=1
Bellman error approach: Find r to
_ _
min E
i
_
J
(i, r) (TJ
)(i, r)
_
2
r
where E
i
is taken with respect to some
distribution
Approximate LP (discussed earlier - supple
mented with clever schemes to overcome the
large number of constraints issue)
POLICY EVALUATE/POLICY IMPROVE
An example
System Simulator
Decision Generator
Cost-to-Go Approximator
Supplies Values J(j,r)
Least-Squares
Optimization
~
J(j,r)
~
State i Decision (i)
_
-
The least squares optimization may be re
placed by a dierent algorithm
POLICY EVALUATE/POLICY IMPROVE I
Approximate the cost of the current policy by
using a simulation method.
Direct policy evaluation - Cost samples gen
erated by simulation, and optimization by
least squares
Indirect policy evaluation - solving the pro
jected equation r = T
(r) where is
projection w/ respect to a suitable weighted
Euclidean norm
J
J
T
(r)
Projection
on S
Projection
on S
r = T
(r)
0
0
S: Subspace spanned by basis functions
S: Subspace spanned by basis functions
Direct Mehod: Projection of cost vector J
Indirect method: Solving a projected
form of Bellmans equation
Batch and incremental methods
Regular and optimistic policy iteration
POLICY EVALUATE/POLICY IMPROVE II
Projected equation methods are preferred and
have rich theory
TD(): Stochastic iterative algorithm for solv
ing r = T
(r)
LSPE(): A simulation-based form of projected
value iteration
r
k+1
= T
(r
k
) + simulation noise
Value Iterate Value Iterate
T(r
k
)
=
g + Pr
k
T(r
k
)
=
g + Pr
k
Projection Projection
on S on S
r
k+1
r
k+1
r
k r
k
Simulation error
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions
Projected Value Iteration (PVI)
Least Squares Policy Evaluation (LSPE)
LSTD(): Solves a simulation-based approxi
mation r =
T
(r) =
q(i)J
i
(r),
i=1
_ _
where q(1), . . . , q(n) is some probability dis
tribution over the states.
PROBLEM APPROXIMATION - AGGREGATION
Another major idea in ADP is to approximate
the cost-to-go function of the problem with the
cost-to-go function of a simpler problem. The sim
plication is often ad-hoc/problem dependent.
Aggregation is a (semi-)systematic approach for
problem approximation. Main elements:
Introduce a few aggregate states, viewed
as the states of an aggregate system
Dene transition probabilities and costs of
the aggregate system, by associating multi
ple states of the original system with each
aggregate state
Solve (exactly or approximately) the ag
gregate problem by any kind of value or pol
icy iteration method (including simulation-
based methods, such as Q-learning)
Use the optimal cost of the aggregate prob
lem to approximate the optimal cost of the
original problem
Example (Hard Aggregation): We are given a
partition of the state space into subsets of states,
and each subset is viewed as an aggregate state
(each state belongs to one and only one subset).
AGGREGATION/DISAGGREGATION PROBS
The aggregate system transition probabilities
are dened via two (somewhat arbitrary) choices:
For each original system state i and aggregate
state m, the aggregation probability a
im
This may be roughly interpreted as the de
gree of membership of i in the aggregate
state m.
In the hard aggregation example, a
im
= 1 if
state i belongs to aggregate state/subset m.
For each aggregate state m and original system
state i, the disaggregation probability d
mi
This may be roughly interpreted as the de
gree to which i is representative of m.
In the hard aggregation example (assuming
all states that belong to aggregate state/subset
m are equally representative) d
mi
= 1/[m[
for each state i that belongs to aggregate
state/subset m, where [m[ is the cardinality
(number of states) of m.
AGGREGATION EXAMPLES
Hard aggregation (each original system state
is associated with one aggregate state):
p
ij
(u)
1
1/4 1
1/3
m
i
j
n
Original System
Aggregation
States
Probabilities
Disaggregation
Probabilities
Aggregate States
Soft aggregation (each original system state is
associated with multiple aggregate states):
p
ij
(u)
1/2
1/4
1/3
m
2/3
i
j
1/2
1/3
n
Original System
Aggregation
States
Probabilities
Disaggregation
Probabilities
Aggregate States
Coarse grid (each aggregate state is an original
system state):
p
ij
(u)
1/2
1
1/2
1
m
i
j
1/3
n
2/3
Original System
Aggregation
States
Probabilities
Disaggregation
Probabilities
Aggregate States
AGGREGATE TRANSITION PROBABILITIES
Let the aggregation and disaggregation proba
bilities, a
im
and d
mi
, and the original transition
probabilities p
ij
(u) be given.
The transition probability from aggregate state
m to aggregate state n under u is
q
mn
(u) = d
mi
p
ij
(u)a
jn
i j
and the transition cost is similarly dened.
This corresponds to a probabilistic process that
can be simulated as follows:
From aggregate state m, generate original
state i according to d
mi
.
Generate a transition from i to j according
to p
ij
(u), with cost g(i, u, j).
From original state j, generate aggregate state
n according to a
jn
.
After solving for the optimal costs J
(m) of the
aggregate problem, the costs of the original prob
lem are approximated by
J
(i) = a
im
J
(m)
m
6.231 DYNAMIC PROGRAMMING
LECTURE 21
LECTURE OUTLINE
Discounted problems - Approximate policy eval
uation/policy improvement
Direct approach - Least squares
Batch and incremental gradient methods
Implementation using TD
Optimistic policy iteration
Exploration issues
THEORETICAL BASIS
If policies are approximately evaluated using an
approximation architecture:
max [J
(i, r
k
) J
k (i)[ , k = 0, 1, . . .
i
If policy improvement is also approximate,
max [(T k+1 J
)(i, r
k
)(TJ
)(i, r
k
)[ , k = 0, 1, . . .
i
Error Bound: The sequence
k
generated
by approximate policy iteration satises
_ _
+ 2
limsup max J
k (i) J
(i)
k
i
(1 )
2
Typical practical behavior: The method makes
steady progress up to a point and then the iterates
J
(i) c(i, m)
M
i
m=1
c(i, m) : mth (noisy) sample cost starting from state i
Approximating well each J
(i) is impractical
for a large state space. Instead, a compact rep
resentation J
c(i, m) J
(i, r)
2
r
i=1 m=1
Note that this is much easier when the archi
tecture is linear - but this is not a requirement.
SIMULATION-BASED DIRECT APPROACH
System Simulator
Decision Generator
Cost-to-Go Approximator
Supplies Values J(j,r)
Least-Squares
Optimization
~
J(j,r)
~
State i Decision (i)
_
-
Simulator: Given a state-control pair (i, u), gen
erates the next state j using systems transition
probabilities under policy currently evaluated
Decision generator: Generates the control (i)
of the evaluated policy at the current state i
Cost-to-go approximator: J
(, r)
BATCH GRADIENT METHOD I
Focus on a batch: an N-transition portion
(i
0
, . . . , i
N
) of a simulated trajectory
We view the numbers
N1
tk
g
_
i
t
, (i
t
), i
t+1
t
_
, k = 0, . . . , N 1,
=k
_ _
_ _
_ _
as cost samples, one per initial state i
0
, . . . , i
N1
Least squares problem
_ _
2
1
N1 N1
min
J
(i
k
, r)
tk
g i
t
, (i
t
), i
t+1
r 2
k=0 t=k
Gradient iteration
N1
r := r
J
(i
k
, r)
k=0
N1
J
(i
k
, r)
tk
g i
t
, (i
t
), i
t+1
t=k
BATCH GRADIENT METHOD II
Important tradeo:
In order to reduce simulation error and cost
samples for a representatively large subset of
states, we must use a large N
To keep the work per gradient iteration small,
we must use a small N
To address the issue of size of N, small batches
may be used and changed after one or more iter
ations.
Then the method becomes susceptible to sim
ulation noise - requires a diminishing stepsize for
convergence.
This slows down the convergence (which can
be very slow for a gradient method even without
noise).
Theoretical convergence is guaranteed (with a
diminishing stepsize) under reasonable conditions,
but in practice this is not much of a guarantee.
_
INCREMENTAL GRADIENT METHOD I
Again focus on an N-transition portion (i
0
, . . . , i
N
)
of a simulated trajectory.
The batch gradient method processes the N
transitions all at once, and updates r using the
gradient iteration.
The incremental method updates r a total of N
times, once after each transition.
After each transition (i
k
, i
k+1
) it uses only the
portion of the gradient aected by that transition:
Evaluate the (single-term) gradient J
(i
k
, r)
at the current value of r (call it r
k
).
Sum all the terms that involve the transi
tion (i
k
, i
k+1
), and update r
k
by making a
correction along their sum:
r
k+1
=r
k
J
(i
k
, r
k
)J
(i
k
, r
k
)
_ _ _
k
_ _
kt
J
(i
t
, r
t
) g i
k
, (i
k
), i
k+1
t=0
INCREMENTAL GRADIENT METHOD II
After N transitions, all the component gradient
terms of the batch iteration are accumulated.
BIG dierence:
In the incremental method, r is changed while
processing the batch the (single-term) gra
dient J
(i
t
, r) is evaluated at the most re
cent value of r [after the transition (i
t
, i
t+1
)].
In the batch version these gradients are eval
uated at the value of r prevailing at the be
ginning of the batch.
Because r is updated at intermediate transi
tions within a batch (rather than at the end of
the batch), the location of the end of the batch
becomes less relevant.
Can have very long batches - can have a single
very long simulated trajectory and a single batch.
The incremental version can be implemented
more exibly, converges much faster in practice.
Interesting convergence analysis (beyond our
scope - see Bertsekas and Tsitsiklis, NDP book,
also paper in SIAM J. on Optimization, 2000)
_ _
_ _
(i
k+1
, r)J
(i
k
, r), k N2,
d
N1
= g i
N1
, (i
N1
), i
N
J
(i
N1
, r)
Following the transition (i
k
, i
k+1
), set
k
r
k+1
= r
k
+
k
d
k
kt
J
(i
t
, r
t
)
t=0
This algorithm is known as TD(1). In the im
portant linear case J
(i, r) = (i)
r, it becomes
k
r
k+1
= r
k
+
k
d
k
kt
(i
t
)
t=0
A variant of TD(1) is TD(), [0, 1]. It sets
k
r
k+1
= r
k
+
k
d
k
()
kt
(i
t
)
t=0
OPTIMISTIC POLICY ITERATION
We have assumed so far is that the least squares
optimization must be solved completely for r.
An alternative, known as optimistic policy iter
ation, is to solve this problem approximately and
replace policy with policy after only a few
simulation samples.
Extreme possibility is to replace with at the
end of each state transition: After state transition
(i
k
, i
k+1
), set
k
r
k+1
= r
k
+
k
d
k
()
kt
J
(i
t
, r
t
),
t=0
and simulate next transition (i
k+1
, i
k+2
) using (i
k+1
),
the control of the new policy.
For = 0, we obtain (the popular) optimistic
TD(0), which has the simple form
r
k+1
= r
k
+
k
d
k
J
(i
k
, r
k
)
Optimistic policy iteration can exhibit fascinat
ing and counterintuitive behavior (see the NDP
book by Bertsekas and Tsitsiklis, Section 6.4.2).
THE ISSUE OF EXPLORATION
To evaluate a policy , we need to generate cost
samples using that policy - this biases the simula
tion by underrepresenting states that are unlikely
to occur under .
As a result, the cost-to-go estimates of these
underrepresented states may be highly inaccurate.
This seriously impacts the improved policy .
This is known as inadequate exploration - a par
ticularly acute diculty when the randomness em
bodied in the transition probabilities is relatively
small (e.g., a deterministic system).
One possibility to guarantee adequate explo
ration: Frequently restart the simulation and en
sure that the initial states employed form a rich
and representative subset.
Another possibility: Occasionally generating
transitions that use a randomly selected control
rather than the one dictated by the policy .
Other methods, to be discussed later, use two
Markov chains (one is the chain of the policy and
is used to generate the transition sequence, the
other is used to generate the state sequence).
_ _
APPROXIMATING Q-FACTORS
The approach described so far for policy eval
uation requires calculating expected values for all
controls u U(i) (and knowledge of p
ij
(u)).
Model-free alternative: Approximate Q-factors
n
Q
(i, u, r) p
ij
(u) g(i, u, j) + J
(j)
j=1
and use for policy improvement the minimization
(i) = arg min Q
(i, u, r)
uU(i)
r is an adjustable parameter vector and Q
(i, u, r)
is a parametric architecture, such as
m
Q
(i, u, r) =
r
k
k
(i, u)
k=1
Can use any method for constructing cost ap
proximations, e.g., TD().
Use the Markov chain with states (i, u) - p
ij
((i))
is the transition prob. to (j, (i)), 0 to other (j, u
).
Major concern: Acutely diminished exploration.
6.231 DYNAMIC PROGRAMMING
LECTURE 22
LECTURE OUTLINE
Discounted problems - Approximate policy eval
uation/policy improvement
Indirect approach - The projected equation
Contraction properties - Error bounds
PVI (Projected Value Iteration)
LSPE (Least Squares Policy Evaluation)
Tetris - A case study
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate Cost
(r) = r
where is full rank n s matrix with columns
the basis functions, and ith row denoted (i)
.
Policy improvement
n
_ _
(i) = arg min p
ij
(u) g(i, u, j) + (j)
r
uU(i)
j=1
Indirect methods nd r by solving a projected
equation.
_
WEIGHTED EUCLIDEAN PROJECTIONS
Consider a weighted Euclidean norm
n
|J|
v
=
v
i
_
J(i)
_
2
,
i=1
where v is a vector of positive weights v
1
, . . . , v
n
.
Let denote the projection operation onto
S = r [ r 1
s
n
with respect to this norm, i.e., for any J 1 ,
J = r
J
where
r
J
= arg min |J r|
v
r
s
and r
J
can be written explicitly:
= (
V )
1
V, r
J
= (
V )
1
V J,
where V is the diagonal matrix with v
i
, i = 1, . . . , n,
along the diagonal.
THE PROJECTED BELLMAN EQUATION
For a xed policy to be evaluated, consider
the corresponding mapping T:
n
_ _
(TJ)(i) = p
ij
g(i, j)+J(j) , i = 1, . . . , n,
i=1
or more compactly,
TJ = g + PJ
The solution J
of Bellmans equation J = TJ
is approximated by the solution of
r = T(r)
T(r)
Projection
on S
r = T(r)
0
S: Subspace spanned by basis functions
Indirect method: Solving a projected
form of Bellmans equation
KEY QUESTIONS AND RESULTS
Does the projected equation have a solution?
Under what conditions is the mapping T a
contraction, so T has unique xed point?
Assuming T has unique xed point r
, how
close is r
to J
?
Assumption: P has a single recurrent class
and no transient states, i.e., it has steady-state
probabilities that are positive
1
N
j
= lim P(i
k
= j [ i
0
= i) > 0, j = 1, . . . , n
N N
k=1
Proposition: T is contraction of modulus
with respect to the weighted Euclidean norm
| |
, where = (
1
, . . . ,
n
) is the steady-state
probability vector. The unique xed point r
of
T satises
1
|J
r
|
1
2
|J
J
|
ANALYSIS
Important property of the projection on S
with weighted Euclidean norm
v
. For all J
n
, J S, the Pythagorean Theorem holds:
J J
2
v
= J J
2
v
+J J
2
v
Proof: Geometrically, (J J) and (J J)
are orthogonal in the scaled geometry of the norm
v
, where two vectors x, y
n
are orthogonal
if
n
i=1
v
i
x
i
y
i
= 0. Expand the quadratic in the
RHS below:
J J
2
v
= (J J) + (J J)
2
v
The Pythagorean Theorem implies that the pro-
jection is nonexpansive, i.e.,
J
J
v
J
J
v
, for all J,
J
n
.
To see this, note that
(J J)
2
v
(J J)
2
v
+
(I )(J J)
2
v
= J J
2
v
PROOF OF CONTRACTION PROPERTY
Lemma: We have
|Pz|
|z|
, z 1
n
Proof of lemma: Let p
ij
be the components of
P. For all z 1
n
, we have
_ _
2
n n n n
|Pz|
2
=
i
_
p
ij
z
j
_
i
p
ij
z
j
2
i=1 j=1 i=1 j=1
n n n
=
i
p
ij
z
j
2
=
j
z
j
2
= |z|
2
,
j=1 i=1 j=1
where the inequality follows from the convexity of
the quadratic function, and the next to last equal
ity follows from the dening property
i
n
=1
i
p
ij
=
j
of the steady-state probabilities.
Using the lemma, the nonexpansiveness of ,
and the denition TJ = g + PJ, we have
TJTJ
TJTJ
= P(JJ
)
JJ
n
for all J, J
1 . Hence T is a contraction of
modulus .
PROOF OF ERROR BOUND
Let r
|
|J
J
.
1
2
Proof: We have
_ _
2
|J
r
|
2
= |J
J
|
2
+
_
J
r
_
_ _
2
= |J
J
|
2
+
_
TJ
T(r
)
_
|J
J
|
2
+
2
|J
r
|
2
,
where the rst equality uses the Pythagorean The
orem, the second equality holds because J
is the
xed point of T and r
is the xed point of T,
and the inequality uses the contraction property
of T. From this relation, the result follows.
Note: The factor 1/ 1
2
in the RHS can
be replaced by a factor that is smaller and com
putable. See
H. Yu and D. P. Bertsekas, New Error Bounds
for Approximations from Projected Linear Equa
tions, Report LIDS-P-2797, MIT, July 2008.
PROJECTED VALUE ITERATION (PVI)
Given the projection property of T , we may
consider the PVI method
r
k+1
= T (r
k
)
Value Iterate
T(r
k
)
=
g + Pr
k
Projection
on S
r
k+1
r
k
0
S: Subspace spanned by basis functions
Question: Can we implement PVI using simu
lation, without the need for n-dimensional linear
algebra calculations?
LSPE (Least Squares Policy Evaluation) is a
simulation-based implementation of PVI.
_ _
_ _
_ _
_ _
_ _
_ _
_ _
LSPE - SIMULATION-BASED PVI
PVI, i.e., r
k+1
= T(r
k
) can be written as
r
k+1
= arg min
_
r T(r
k
)
_
2
,
r
s
from which by setting the gradient to 0,
n n n
i
(i)(i)
r
k+1
=
i
(i) p
ij
g(i, j)+(j)
r
k
i=1 i=1 j=1
For LSPE we generate an innite trajectory
(i
0
, i
1
, . . .) and update r
k
after transition (i
k
, i
k+1
)
k k
(i
t
)(i
t
)
r
k+1
= (i
t
) g(i
t
, i
t+1
)+(i
t+1
)
r
k
t=0 t=0
LSPE can equivalently be written as
n n n
i,k
(i)(i)
r
k+1
=
i,k
(i) p
ij,k
i=1 i=1 j=1
g(i, j) + (j)
r
k
where
i,k
, p
ij,k
: empirical frequencies of state i
and transition (i, j), based on (i
0
, . . . , i
k+1
).
LSPE INTERPRETATION
LSPE can be written as PVI with sim. error:
r
k+1
= T (r
k
) + e
k
where e
k
diminishes to 0 as the empirical frequen
cies
i,k
and p
ij,k
approach and p
ij
.
Value Iterate Value Iterate
T(r
k
)
=
g + Pr
k
T(r
k
)
=
g + Pr
k
Projection Projection
on S on S
r
k+1
r
k+1
r
k r
k
Simulation error
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions
Projected Value Iteration (PVI)
Least Squares Policy Evaluation (LSPE)
Convergence proof is simple: Use the law of
large numbers.
Optimistic LSPE: Changes policy prior to con
vergence - behavior can be very complicated.
EXAMPLE: TETRIS I
The state consists of the board position i, and
the shape of the current falling block (astronomi
cally large number of states).
It can be shown that all policies are proper!!
Use a linear approximation architecture with
feature extraction
s
J
(i, r) =
m
(i)r
m
,
m=1
where r = (r
1
, . . . , r
s
) is the parameter vector and
m
(i) is the value of mth feature associated w/ i.
EXAMPLE: TETRIS II
ximate policy iteration was implemented
following features:
Appro
with the
The height of each column of the wall
The dierence of heights of adjacent columns
The maximum height over all wall columns
The number of holes on the wall
The number 1 (provides a constant oset)
Playing data was collected for a xed value
of the parameter vector r (and the corresponding
policy); the policy was approximately evaluated
by choosing r to match the playing data in some
least-squares sense.
LSPE (its SSP version) was used for approxi
mate policy evaluation.
Both regular and optimistic versions were used.
See: Bertsekas and Ioe, Temporal Dierences-
Based Policy Iteration and Applications in Neuro-
Dynamic Programming, LIDS Report, 1996. Also
the NDP book.
6.231 DYNAMIC PROGRAMMING
LECTURE 23
LECTURE OUTLINE
Review of indirect policy evaluation methods
Multistep methods, LSPE()
LSTD()
Q-learning
Q-learning with linear function approximation
Q-learning for optimal stopping problems
REVIEW: PROJECTED BELLMAN EQUATION
For a xed policy to be evaluated, consider
the corresponding mapping T:
n
_ _
(TJ)(i) = p
ij
g(i, j)+J(j) , i = 1, . . . , n,
i=1
or more compactly,
TJ = g + PJ
The solution J
of Bellmans equation J = TJ
is approximated by the solution of
r = T(r)
T(r)
Projection
on S
r = T(r)
0
S: Subspace spanned by basis functions
Indirect method: Solving a projected
form of Bellmans equation
_ _
_ _
_ _
PVI/LSPE
Key Result: T is contraction of modulus
with respect to the weighted Euclidean norm
| |
, where = (
1
, . . . ,
n
) is the steady-state
|
1
2
|J
J
|
Projected Value Iteration (PVI): r
k+1
=
T(r
k
), which can be written as
r
k+1
= arg min r T(r
k
)
2
r
s
or equivalently
_ _
2
n n
r
k+1
= arg min
i
(i)
r p
ij
g(i, j) +(j)
r
k
r
s
i=1 j=1
LSPE (simulation-based approximation):
We generate an innite trajectory (i
0
, i
1
, . . .) and
update r
k
after transition (i
k
, i
k+1
)
k
r
k+1
= arg min
_
(i
t
)
rg(i
t
, i
t+1
)(i
t+1
)
r
k
_
2
r
s
t=0
_ _
_ _
_ _
_ _
JUSTIFICATION OF PVI/LSPE CONNECTION
By writing the necessary optimality conditions
for the least squares minimization, PVI can be
written as
n n n
i
(i)(i)
r
k+1
=
i
(i) p
ij
g(i, j)+(j)
r
k
i=1 i=1 j=1
Similarly, by writing the necessary optimal
ity conditions for the least squares minimization,
LSPE can be written as
k k
(i
t
)(i
t
)
r
k+1
= (i
t
) g(i
t
, i
t+1
)+(i
t+1
)
r
k
t=0 t=0
So LSPE is just PVI with the two expected val
ues approximated by simulation-based averages.
Convergence follows by the law of large num
bers.
The bottleneck in rate of convergence is the
law of large of numbers/simulation error (PVI is
a contraction with modulus , and converges fast
relative to simulation).
LEAST SQUARES TEMP. DIFFERENCES (LSTD)
Taking the limit in PVI, we see that the pro
jected equation, r = T (r
), can be written as
Ar
+ b = 0, where
_ _
n n
A =
i
(i) p
ij
(j) (i)
i=1 j=1
n n
b =
i
(i) p
ij
g(i, j)
i=1 j=1
A, b are expected values that can be approxi
mated by simulation: A
k
A, b
k
b, where
k
1
_ _
A
k
= (i
t
) (i
t+1
) (i
t
)
k + 1
t=0
k
1
b
k
= (i
t
)g(i
t
, i
t+1
)
k + 1
t=0
LSTD method: Approximates r as
r
r
k
= A
k
1
b
k
Conceptually very simple ... but less suitable for
optimistic policy iteration (hard to transfer info
from one policy evaluation to the next).
Can be shown that convergence rate is the same
for LSPE/LSTD (for large k, r
k
r
k
<< r
k
r
).
MULTISTEP METHODS
Introduce a multistep version of Bellmans equa
tion J = T
()
J, where for [0, 1),
T
()
= (1 )
t
T
t+1
t=0
Note that T
t
is a contraction with modulus
t
,
with respect to the weighted Euclidean norm
,
where is the steady-state probability vector of
the Markov chain.
From this it follows that T
()
is a contraction
with modulus
= (1 )
t+1
t
=
(1 )
1
t=0
T
t
and T
()
have the same xed point J
and
1
J
r
_ J
J
1
2
where r
is the xed point of T
()
.
The xed point r
depends on .
Note that
0 as 1, so error bound improves
as 1.
_ _
_
_ _
_
_ _
PVI()
r
k+1
= T
()
(r
k
) = (1 )
t
T
t+1
(r
k
)
t=0
or
_ _
2
r
k+1
= arg min
_
r T
()
(r
k
)
r
s
Using algebra and the relation
t
(T
t+1
J)(i) = E
t+1
J(i
t+1
) +
k
g(i
k
, i
k+1
)
i
0
= i
k=0
we can write PVI() as
n
r
k+1
= arg min
i
(i)
r (i)
r
k
r
s
i=1
_
2
()
t
E d
k
(i
t
, i
t+1
) | i
0
= i
t=0
where
d
k
(i
t
, i
t+1
) = g(i
t
, i
t+1
) +(i
t+1
)
r
k
(i
t
)
r
k
,
are the, so called, temporal dierences (TD) - they
are the errors in satisfying Bellmans equation.
_
LSPE()
Replacing the expected values dening PVI()
by simulation-based estimates we obtain LSPE().
It has the form
k
r
k+1
= arg min (i
t
)
r (i
t
)
r
k
r
s
t=0
_
2
k
()
mt
d
k
(i
m
, i
m+1
)
m=t
where (i
0
, i
1
, . . .) is an innitely long trajectory gen
erated by simulation.
Can be implemented with convenient incremen
tal update formulas (see the text).
Note the -tradeo:
improves.
As 1, the simulation noise in the LSPE()
iteration (2nd summation term) increases, so
longer simulation trajectories are needed for
LSPE() to approximate well PVI().
Q-LEARNING I
has two motivations:
with multiple policies simultaneously
model-free approach [no need to know
Q-learning
Dealing
Using a
p
ij
(u) explicitly, only to simulate them]
The Q-factors are dened by
n
_ _
Q
(i, u) = p
ij
(u) g(i, u, j) +J
(j) , (i, u)
j=1
In view of J
= TJ
, we have J
(i) = min
uU(i)
Q
(i, u)
so the Q factors solve the equation
n
_ _
Q
(i, u) = p
ij
(u) g(i, u, j) + min Q
(j, u
) , (i, u)
u
U(j)
j=1
Q(i, u) can be shown to be the unique solution of
this equation. Reason: This is Bellmans equation
for a system whose states are the original states
1, . . . , n, together with all the pairs (i, u).
Value iteration:
n
_ _
Q(i, u) := p
ij
(u) g(i, u, j) + min Q(j, u
) , (i, u)
u
U(j)
j=1
_ _
_ _
_ _
Q-LEARNING II
Use any probabilistic mechanism to select se
quence of pairs (i
k
, u
k
) [all pairs (i, u) are chosen
innitely often], and for each k, select j
k
accord
ing to p
i
k
j
(u
k
).
At each k, Q-learning algorithm updates Q(i
k
, u
k
)
according to
Q(i
k
, u
k
) := 1
k
(i
k
, u
k
) Q(i
k
, u
k
)
+
k
(i
k
, u
k
) g(i
k
, u
k
, j
k
) + min Q(j
k
, u
)
u
U(j
k
)
Stepsize
k
(i
k
, u
k
) must converge to 0 at proper
rate (e.g., like 1/k).
Important mathematical point: In the Q-factor
version of Bellmans equation the order of expec
tation and minimization is reversed relatively to
the ordinary cost version of Bellmans equation:
n
J
(i) = min p
ij
(u) g(i, u, j) +J
(j)
uU(i)
j=1
Q-learning can be shown to converge to true/exact
Q-factors (a sophisticated proof).
Major drawback: The large number of pairs (i, u)
- no function approximation is used.
_
Q-FACTOR APROXIMATIONS
Introduce basis function approximation for Q-
factors:
Q
(i, u, r) = (i, u)
r
We cannot use LSPE/LSTD because the Q-
factor Bellman equation involves minimization/multiple
controls.
An optimistic version of LSPE(0) is possible:
Generate an innitely long sequence {(i
k
, u
k
) |
k = 0, 1, . . .}.
At iteration k, given r
k
and state/control (i
k
, u
k
):
(1) Simulate next transition (i
k
, i
k+1
) using the
transition probabilities p
i
k
j
(u
k
).
(2) Generate control u
k+1
from the minimization
u
k+1
= arg min Q
(i
k+1
, u, r
k
)
uU(i
k+1
)
(3) Update the parameter vector via
k
r
k+1
= arg min (i
t
, u
t
)
r
r
s
t=0
_
2
g(i
t
, u
t
, i
t+1
) (i
t+1
, u
t+1
)
r
k
Q-LEARNING FOR OPTIMAL STOPPING
Not much is known about convergence of opti
mistic LSPE(0).
Major diculty is that the projected Bellman
equation for Q-factors may not be a contraction,
and may have multiple solutions or no solution.
There is one important case, optimal stop
ping, where this diculty does not occur.
Given a Markov chain with states 1, . . . , n,
and transition probabilities p
ij
. We assume that
the states form a single recurrent class, with steady-
state distribution vector = (
1
, . . . ,
n
).
At the current state i, we have two options:
Stop and incur a cost c(i), or
Continue and incur a cost g(i, j), where j is
the next state.
Q-factor for the continue action:
n
_ _
_ _
Q(i) = p
ij
g(i, j)+min c(j), Q(j) (FQ)(i)
j=1
Major fact: F is a contraction of modulus
with respect to norm | |
.
_ _
LSPE FOR OPTIMAL STOPPING
Introduce Q-factor approximation
Q
(i, r) = (i)
r
PVI for Q-factors:
r
k+1
= F(r
k
)
LSPE
_ _
1
k
r
k+1
= (i
t
)(i
t
)
t=0
k
_ _
(i
t
) g(i
t
, i
t+1
) + min c(i
t+1
), (i
t+1
)
r
k
t=0
Simpler version: Replace the term (i
t+1
)
r
k
by (i
t+1
)
r
t
. The algorithm still converges to
the unique xed point of F (see H. Yu and D.
P. Bertsekas, A Least Squares Q-Learning Algo
rithm for Optimal Stopping Problems).
6.231 DYNAMIC PROGRAMMING
LECTURE 24
LECTURE OUTLINE
More on projected equation methods/policy
evaluation
Stochastic shortest path problems
Average cost problems
Generalization - Two Markov Chain methods
LSTD-like methods - Use to enhance explo
ration
REVIEW: PROJECTED BELLMAN EQUATION
For xed policy to be evaluated, the solution
of Bellmans equation J = TJ is approximated by
the solution of
r = T(r)
whose solution is in turn obtained using a simulation-
based method such as LSPE(), LSTD(), or TD().
T(r)
Projection
on S
r = T(r)
0
S: Subspace spanned by basis functions
Indirect method: Solving a projected
form of Bellmans equation
These ideas apply to other (linear) Bellman
equations, e.g., for SSP and average cost.
Key Issue: Construct framework where T [or
at least T
()
] is a contraction.
STOCHASTIC SHORTEST PATHS
Introduce approximation subspace
S = r [ r 1
s
and for a given proper policy, Bellmans equation
and its projected version
J = TJ = g + PJ, r = T(r)
Also its -version
r = T
()
(r), T
()
= (1 )
t
T
t+1
t=0
Question: What should be the norm of pro
jection?
Speculation based on discounted case: It
should be a weighted Euclidean norm with weight
vector = (
1
, . . . ,
n
), where
i
should be some
type of long-term occupancy probability of state i
(which can be generated by simulation).
But what does long-term occupancy probabil
ity of a state mean in the SSP context?
How do we generate innite length trajectories
given that termination occurs with prob. 1?
_
SIMULATION TRAJECTORIES FOR SSP
We envision simulation of trajectories up to
termination, followed by restart at state i with
some xed probabilities q
0
(i) > 0.
Then the long-term occupancy probability of
a state of i is proportional to
q(i) = q
t
(i), i = 1, . . . , n,
t=0
where
q
t
(i) = P(i
t
= i), i = 1, . . . , n, t = 0, 1, . . .
We use the projection norm
n
_ _
2
|J|
q
= q(i) J(i)
i=1
[Note that 0 < q(i) < , but q is not a prob.
distribution. ]
We can show that T
()
is a contraction with
respect to | |
(see the next slide).
_ _
CONTRACTION PROPERTY FOR SSP
We have q =
t=0
q
t
so
q
P = q
t
P = q
t
= q
q
0
t=0 t=1
or
n
q(i)p
ij
= q(j) q
0
(j), j
i=1
To verify that T is a contraction, we show
that there exists < 1 such that |Pz|
2
q
|z|
2
q
n
for all z 1 .
For all z 1
n
, we have
_ _
2
n n n n
|Pz|
q
2
= q(i)
_
p
ij
z
j
_
q(i) p
ij
z
j
2
i=1 j=1 i=1 j=1
n n n
= z
j
2
q(i)p
ij
= q(j) q
0
(j) z
j
2
j=1 i=1 j=1
= |z|
2
q
|z|
2
q
0
|z|
2
q
where
q
0
(j)
= 1 min
j
q(j)
_
PVI() AND LSPE() FOR SSP
We consider PVI(): r
k+1
= T
()
(r
k
),
which can be written as
n
r
k+1
= arg min q(i) (i)
r (i)
r
k
r
s
i=1
_
2
_ _
t
E d
k
(i
t
, i
t+1
) [ i
0
= i
t=0
where d
k
(i
t
, i
t+1
) are the TDs.
The LSPE() algorithm is a simulation-based
approximation. Let (i
0,l
, i
1,l
, . . . , i
N
l
,l
) be the lth
trajectory (with i
N
l
,l
= 0), and let r
k
be the pa
rameter vector after k trajectories. We set
_
k+1 N
l
1
r
k+1
= arg min (i
t,l
)
r (i
t,l
)
r
k
r
l=1 t=0
N
l
1
_
2
mt
d
k
(i
m,l
, i
m+1,l
)
m=t
where
d
k
(i
m,l
, i
m+1,l
) = g(i
m,l
, i
m+1,l
)+(i
m+1,l
)
r
k
(i
m,l
)
r
k
Can also update r
k
at every transition.
_ _
AVERAGE COST PROBLEMS
Consider a single policy to be evaluated, with
single recurrent class, no transient states, and steady-
state probability vector = (
1
, . . . ,
n
).
The average cost, denoted by , is independent
of the initial state
1
N1
_ _
= lim E g x
k
, x
k+1
x
0
= i , i
N N
k=0
Bellmans equation is J = FJ with
FJ = g e + PJ
where e is the unit vector e = (1, . . . , 1).
The projected equation and its -version are
r = F(r), r = F
()
(r)
A problem here is that F is not a contraction
with respect to any norm (since e = Pe).
However, F
()
turns out to be a contraction
with respect to | |
assuming that e does not be
long to S and > 0 [the case = 0 is exceptional,
but can be handled - see the text].
LSPE() FOR AVERAGE COST
We generate an innitely long trajectory (i
0
, i
1
, . . .).
We estimate the average cost separately: Fol
lowing each transition (i
k
, i
k+1
), we set
1
k
k
= g(i
t
, i
t+1
)
k + 1
t=0
Also following (i
k
, i
k+1
), we update r
k
by
_ _
2
k k
r
k+1
= arg min (i
t
)
r (i
t
)
r
k
mt
d
k
(m)
r
s
t=0 m=t
where d
k
(m) are the TDs
d
k
(m) = g(i
m
, i
m+1
)
m
+(i
m+1
)
r
k
(i
m
)
r
k
Note that the TDs include the estimate
m
.
Since
m
converges to , for large m it can be
viewed as a constant and lumped into the one-
stage cost.
GENERALIZATION/UNIFICATION
Consider approximate solution of x = T (x),
where
n
T (x) = Ax + b, A is n n, b 1
by solving the projected equation y = T (y),
where is projection on a subspace of basis func
tions (with respect to some Euclidean norm).
We will generalize from DP to the case where
A is arbitrary, subject only to
I A : invertible
Benets of generalization:
Unication/higher perspective for TD meth
ods in approximate DP
An extension to a broad new area of applica
tions, where a DP perspective may be help
ful
Challenge: Dealing with less structure
Lack of contraction
Absence of a Markov chain
LSTD-LIKE METHOD
Let be projection with respect to
n
|x|
=
_
i
x
2
i
i=1
n
where 1 is a probability distribution with
positive components.
If r
is the solution of the projected equation,
we have r
= (Ar
+ b) or
_ _
2
n n
r
= arg min
i
_
(i)
r a
ij
(j)
r
b
i
_
r
s
i=1 j=1
where (i)
denotes the ith row of the matrix .
Optimality condition/equivalent form:
_ _
n n n
i
(i)
_
(i)
_
a
ij
(j) r
=
i
(i)b
i
i=1 j=1 i=1
The two expected values are approximated by
simulation.
Row Sampling According to
i
0
i
1
j
0
j
1
i
k
i
k+1
j
k
j
k+1
. . . . . .
Column Sampling
_ _
SIMULATION MECHANISM
According to P
Row sampling: Generate sequence i
0
, i
1
, . . .
according to , i.e., relative frequency of each row
i is
i
Column sampling: Generate (i
0
, j
0
), (i
1
, j
1
), . . .
according to some transition probability matrix P
with
p
ij
> 0 if a
ij
,= 0,
i.e., for each i, the relative frequency of (i, j) is p
ij
Row sampling may be done using a Markov
chain with transition matrix Q (unrelated to P)
Row sampling may also be done without a
Markov chain - just sample rows according to some
known distribution (e.g., a uniform)
| |
Row Sampling According to
(May Use Markov Chain Q)
Column Sampling
According to
Markov Chain
P A
i
0
i
1
j
0
j
1
i
k
i
k+1
j
k
j
k+1
. . .
| |
ROW AND COLUMN SAMPLING
. . .
Row sampling State Sequence Generation in
DP. Aects:
The projection norm.
Whether A is a contraction.
Column sampling Transition Sequence Gen
eration in DP.
Can be totally unrelated to row sampling.
Aects the sampling/simulation error.
Matching P with [A[ is benecial (has an
eect like in importance sampling).
Independent row and column sampling allows
exploration at will! Resolves the exploration prob
lem that is critical in approximate policy iteration.
LSTD-LIKE METHOD
Optimality condition/equivalent form of pro
jected equation
_ _
n n n
i
(i)
_
(i) a
ij
(j)
_
r
=
i
(i)b
i
i=1 j=1 i=1
The two expected values are approximated by
row and column sampling (batch 0 t).
We solve the linear equation
t
_ _
t
(i
k
) (i
k
)
a
i
k
j
k
(j
k
) r
t
= (i
k
)b
i
k
p
i
k
j
k
k=0 k=0
We have r
t
r
, regardless of A being a con
traction (by law of large numbers; see next slide).
An LSPE-like method is also possible, but re
quires that A is a contraction.
Under the assumption
n
[a
ij
[ 1 for all i,
j=1
there are conditions that guarantee contraction of
A; see the paper by Bertsekas and Yu,Projected
Equation Methods for Approximate Solution of
Large Linear Systems, 2008, or the expanded ver
sion of Chapter 6 Vol 2
JUSTIFICATION W/ LAW OF LARGE NUMBERS
We will match terms in the exact optimality
condition and the simulation-based version.
Let
i
t
be the relative frequency of i in row
sampling up to time t.
We have
1
t n
n
(i
k
)(i
k
)
=
i
t
(i)(i)
i
(i)(i)
t + 1
k=0 i=1 i=1
1
t n n
(i
k
)b
i
k
=
i
t
(i)b
i
i
(i)b
i
t + 1
k=0 i=1 i=1
Let
t
be the relative frequency of (i, j) in p
ij
column sampling up to time t.
1
t
a
i
k
j
k
(i
k
)(j
k
)
t + 1 p
i
k
j
k
k=0
n n
=
i
t
p
ij
t
a
ij
(i)(j)
p
ij
i=1 j=1
n n
i
a
ij
(i)(j)
i=1 j=1
6.231 DYNAMIC PROGRAMMING
LECTURE 25
LECTURE OUTLINE
Additional topics in ADP
Nonlinear versions of the projected equation
Extension of Q-learning for optimal stopping
Basis function adaptation
Gradient-based approximation in policy space
_ _
_
NONLINEAR EXTENSIONS OF PROJECTED EQ.
If the mapping T is nonlinear (as for exam
ple in the case of multiple policies) the projected
equation r = T (r) is also nonlinear.
Any solution r
satises
r
arg min
_
r T (r
)
_
_
2
r
s
or equivalently
r
T (r
) = 0
This is a nonlinear equation, which may have one
or many solutions, or no solution at all.
If T is a contraction, then there is a unique
solution that can be obtained (in principle) by the
xed point iteration
r
k+1
= T (r
k
)
We have seen a nonlinear special case of pro
jected value iteration/LSPE where T is a con
traction, namely optimal stopping.
This case can be generalized.
_ _
LSPE FOR OPTIMAL STOPPING EXTENDED
Consider a system of the form
x = T(x) = Af(x) + b,
n n
where f : 1 1 is a mapping with scalar com
ponents of the form f(x) = f
1
(x
1
), . . . , f
n
(x
n
) .
Assume that each f
i
: 1 1 is nonexpansive:
f
i
(x
i
) f
i
( x
i
) [x
i
x
i
[, i, x
i
, x
i
1
This guarantees that T is a contraction with re
spect to any weighted Euclidean norm | |
when
ever A is a contraction with respect to that norm.
Algorithms similar to LSPE [approximating
r
k+1
= T(r
k
)] are then possible.
Special case: In the optimal stopping problem
of Section 6.4, x is the Q-factor corresponding to
the continuation action, (0, 1) is a discount
factor, f
i
(x
i
) = minc
i
, x
i
, and A = P, where
P is the transition matrix for continuing.
If
j
n
=1
p
ij
< 1 for some state i, and 0 P
Q, where Q is an irreducible transition matrix,
then ((1)I +T) is a contraction with respect
to | |
for all (0, 1), even with = 1.
_ _
_ _ _ _ _ _
_ _
BASIS FUNCTION ADAPTATION I
An important issue in ADP is how to select
basis functions.
A possible approach is to introduce basis func
tions that are parametrized by a vector , and
optimize over , i.e., solve the problem
min F J
()
where J
() =
_
J
() T J
()
_
2
Another example is
F J
() =
[J(i) J
()(i)[
2
,
iI
where I is a subset of states, and J(i), i I, are
the costs of the policy at these states calculated
directly by simulation.
_ _
BASIS FUNCTION ADAPTATION II
Some algorithm may be used to minimize F J
()
over .
A challenge here is that the algorithm should
use low-dimensional calculations.
One possibility is to use a form of random search
method; see the paper by Menache, Mannor, and
Shimkin (Annals of Oper. Res., Vol. 134, 2005)
Another possibility is to use a gradient method.
For this it is necessary to estimate the partial
derivatives of J
(g + Ph),
where is the steady-state probability distribu
tion/vector corresponding to P(r), and all the quan
tities above are evaluated at r:
= (r + r) (r),
g = g(r+r)g(r), P = P(r+r)P(r)
_ _
APPROXIMATION IN POLICY SPACE II
Proof of the gradient formula: We have,
by dierentiating Bellmans equation,
(r)e+h(r) = g(r)+P(r)h(r)+P(r)h(r)
By left-multiplying with
(r)e+
h(r) =
g(r)+P (r)h(r) +
P (r)h(r)
Since
P(r), this
equation simplies to
=
(g + Ph)
Since we dont know , we cannot implement a
gradient-like method for minimizing (r). An al
ternative is to use sampled gradients, i.e., gener
ate a simulation trajectory (i
0
, i
1
, . . .), and change
r once in a while, in the direction of a simulation-
based estimate of
(g + Ph).
There is much recent research on this subject,
see e.g., the work of Marbach and Tsitsiklis, and
Konda and Tsitsiklis, and the refs given there.