DP Book
DP Book
DP Book
Preface viii
Common Symbols xi
1 Introduction 1
1.1 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Finite-Horizon Job Search . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Infinite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Stability and Contractions . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.3 Successive Approximation . . . . . . . . . . . . . . . . . . . . . 24
1.2.4 Finite-Dimensional Function Space . . . . . . . . . . . . . . . . 28
1.3 Infinite-Horizon Job Search . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3.1 Values and Policies . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
ii
CONTENTS iii
3 Markov Dynamics 81
3.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1.2 Stationarity and Ergodicity . . . . . . . . . . . . . . . . . . . . 86
3.1.3 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.2 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2.1 Mathematical Expectations . . . . . . . . . . . . . . . . . . . . 92
3.2.2 Geometric Sums . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3 Job Search Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.3.1 Job Search with Markov State . . . . . . . . . . . . . . . . . . . 97
3.3.2 Job Search with Separation . . . . . . . . . . . . . . . . . . . . 101
3.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
CONTENTS iv
I Appendices 340
This book is about dynamic programming and its applications in economics, finance,
and adjacent fields. It brings together recent innovations in the theory of dynamic
programming and provides applications and code that can help readers approach the
research frontier. The book is aimed at graduate students and researchers, although
most chapters are accessible to undergraduate students with solid quantitative back-
grounds.
The book contains classical results on dynamic programming that are found in
texts such as Bellman (1957), Denardo (1981), Puterman (2005), and Stokey and
Lucas (1989), as well as extensions created by researchers and practitioners during
the last few decades as they wrestled with how to formulate and solve dynamic models
that can explain patterns observed in data. These extensions include recursive pref-
erences, robust control, continuous time models, and time varying-discount rates.
Such settings often fail to satisfy contraction-mapping restrictions on which tradi-
tional methods are based. To accommodate these applications, the key theoretical
chapters of this book (Chapters 8–9) adopt and extend the abstract framework of
Bertsekas (2022b). This approach provides great generality while also offering trans-
parent proofs.
Chapters 1–3 provide motivation and background material on solving fixed point
problems and computing lifetime valuations. Chapters 4 and 5 cover optimal stopping
and Markov decision processes, respectively. Chapter 6 extends the Markov decision
framework to settings where discount rates vary over time. Chapter 7 treats recursive
preferences. The main theoretical results on dynamic programming from Chapters 4–
6 are special cases of more general results in Chapters 8–9. A brief discussion of
continuous time models can be found in Chapter 10.
Mathematically inclined readers with some background in dynamic programming
might prefer to start with the general results in Chapters 8–9. Indeed, it is possible
to read the text in reverse, jumping to Chapters 8–9 and then moving back to cover
special cases according to interests. However, our teaching experience tells us that
most students find the general results challenging on first pass, but considerably easier
viii
PREFACE ix
after they have practiced dynamic programming through the earlier chapters. This is
why we have started the presentation with special cases and ended it with general
results.
Instructors wishing to use this book as a text for undergraduate students can start
with Chapter 1, skim through Chapter 2, cover Chapters 3–5 in depth, optionally
include Chapter 6 and skip Chapters 7–10 entirely.
This book focuses on dynamic programs with finite state spaces, leaving more
general settings to Volume II. Restricting attention to finite states involves some costs,
since there are specific settings where continuous state models are simpler (one ex-
ample being Gaussian linear-quadratic models). Moreover, many continuous state
models allow us to unleash calculus, one of humanity’s most useful inventions.
Nevertheless, finite state models are extremely useful. Computational representa-
tions are always implemented using finitely many floating point numbers, and many
workhorse models in economics and finance are already discrete. In addition, focus-
ing on problems with finite state spaces allows us to avoid using function-analytic and
measure-theoretic machinery and imposing associated auxillary conditions required
to ensure measurability and existence of extrema. Without these distractions, the
core theory of dynamic programming is especially simple.
For these reasons, we believe that even for sophisticated readers, a good approach
dynamic programming begins with a thorough analysis of the finite state case. This
is the task that we have tackled in Volume I.
Computer code is a first-class citizen in this book. Code is written in Julia and can
be found at
https://github.com/QuantEcon/book-dp1
We chose Julia because it is open source and because Julia allows us to write computer
code that is as close as possible to the relevant mathematical equations. Julia code in
the text is written to maximize clarity rather than speed.
We have also written matching Python code, which can be found in the same
repository. When combined with appropriate scientific libraries, Python is very prac-
tical and efficient for dynamic programming, but implementations tend to be library
specific and are sometimes not as clean as those in Julia. That is why we chose Julia
for programs embedded in the text.
We have tried to mix rigorous theory with exciting applications. Despite the var-
ious layers of abstractions used to unify the theory, the results are practical, being
motivated by important optimization problems from economics and finance.
PREFACE x
This book is one of several being written in partnership with the QuantEcon or-
ganization, with funding generously provided by Schmidt Futures (see acknowledg-
ments below). There is some overlap with the first book in the series, Sargent and
Stachurski (2023b), particularly on the topic of Markov chains. Although repetition is
sometimes undesirable, we decided that some overlap would be useful, since it saves
readers from having to jump between two documents.
We are greatly indebted to Jim Savage and Schmidt Futures for generous finan-
cial support, as well as to Shu Hu, Smit Lunagariya, Maanasee Sharma and Chien
Yeh for outstanding research assistance. We are grateful to Makoto Nirei for hosting
John Stachurski at the University of Tokyo in June and July 2022, where significant
progress was made.
We also thank Alexis Akira Toda, Quentin Batista, Fernando Cirelli, Chase Cole-
man, Yihong Du, Ippei Fujiwara, Saya Ikegawa, Fazeleh Kazemian, Yuchao Li, Dawie
van Lill, Qingyin Ma, Simon Mishricky, Pietro Monticone, Shinichi Nishiyama, Flint
O’Neil, Zejin Shi, Akshay Shanker, Arnav Sood, Alexis Akira Toda, Natasha Watkins,
Jingni Yang and Ziyue (Humphrey) Yang for many important fixes, comments and
suggestions. Yuchao Li read the entire manuscript, from cover to cover, and his in-
put and deep knowledge of dynamic programming helped us immensely. Jesse Perla
provided insightful comments on our code.
Common Symbols
xi
Common Abbreviations
xii
Chapter 1
Introduction
The state 𝑋𝑡 is a vector listing current values of variables deemed relevant to choos-
ing the current action. The action 𝐴𝑡 is a vector describing choices of a set of decision
variables. If 𝑇 < ∞ then the problem has a finite horizon. Otherwise it is an infinite
horizon problem. Figure 1.1 illustrates the first two rounds of a dynamic program.
As shown in the figure, a rule for updating the state depends on the current state and
action.
Dynamic programming provides a way to maximize expected lifetime reward of
a decision maker who receives a prospective reward sequence ( 𝑅𝑡 )𝑡⩾0 and who con-
fronts a system that maps today’s state and control into next period’s state. A lifetime
reward is an aggregation of the individual period rewards ( 𝑅𝑡 )𝑡⩾0 into a single value.
Í
An example of lifetime reward is an expected discounted sum E 𝑡⩾0 𝛽 𝑡 𝑅𝑡 for some
𝛽 ∈ (0, 1).
1
CHAPTER 1. INTRODUCTION 2
𝑋0 𝐴0 𝑋1 𝐴1 𝑋2
𝑅0 𝑅1
Example 1.0.1. A manager wants to set prices and inventories to maximize a firm’s
expected present value (EPV), which, given interest rate 𝑟 , is defined as
" 2 #
1 1
E 𝜋0 + 𝜋1 + 𝜋2 + · · · . (1.1)
1+𝑟 1+𝑟
Here 𝑋𝑡 will be a vector that quantifies the size of the inventories, prices set by com-
petitors and other factors relevant to profit maximization. The action 𝐴𝑡 sets current
prices and orders of new stock. The current reward 𝑅𝑡 is current profit 𝜋𝑡 , and the
profit stream ( 𝜋𝑡 )𝑡⩾0 is aggregated into a lifetime reward via (1.1).
Dynamic programming has a vast array of applications, from robotics and artifi-
cial intelligence to the sequencing of DNA. Dynamic programming is used every day to
control aircraft, route shipping, test products, recommend information on media plat-
forms and solve research problems. Some companies produce specialized computer
chips that are designed for specific dynamic programs.
Within economics and finance, dynamic programming is applied to topics includ-
ing unemployment, monetary policy, fiscal policy, asset pricing, firm investment,
wealth dynamics, inventory control, commodity pricing, sovereign default, the di-
vision of labor, natural resource extraction, human capital accumulation, retirement
decisions, portfolio choice, and dynamic pricing. We discuss some of these applica-
tions in chapters below.
The core theory of dynamic programming is relatively simple and concise. But
implementation can be computationally demanding. That situation provides one of
the major challenges facing the field of dynamic programming.
CHAPTER 1. INTRODUCTION 3
Example 1.0.2. To illustrate how computationally demanding problems can be, con-
sider again Example 1.0.1. Suppose that, for each book, a book retailer chooses to
hold between 0 and 10 copies. If there are 100 books to choose from, then the num-
ber of possible combinations for her inventories is 11100 , about 20 orders of magnitude
larger than the number of atoms in the known universe. In reality there are probably
many more books to choose from, as well as other factors in the business environment
that affect choices of a retailer.
Imagine someone who begins her working life at time 𝑡 = 1 without employment.
While unemployed, she receives a new job offer paying wage 𝑊𝑡 at each date 𝑡 . She
CHAPTER 1. INTRODUCTION 4
can accept the offer and work permanently at that wage level or reject the offer, receive
unemployment compensation 𝑐, and draw a new offer next period. We assume that
the wage offer sequence is IID and nonnegative, with distribution 𝜑. In particular,
• W ⊂ R+ is a finite set of possible wage outcomes and
• 𝜑 : W → [0, 1] is a probability distribution on W, assigning a probability 𝜑 ( 𝑤)
to each possible wage outcome 𝑤.
The worker is impatient. Impatience is parameterized by a time discount factor 𝛽 ∈
(0, 1), so that the present value of a next-period payoff of 𝑦 dollars is 𝛽 𝑦 . Since 𝛽 < 1,
the worker will be tempted to accept reasonable offers, rather than to wait for better
ones. A key question is how long to wait.
Suppose as a first step that working life is just two periods. To solve our problem
we work backwards, starting at the final date 𝑡 = 2, after 𝑊2 has been observed.1 If
she is already employed, the worker has no decision to make: she continues working
at her current wage. If she is unemployed, then she should take the largest of 𝑐 and
𝑊2 .
Now we step back to 𝑡 = 1. At this time, having received offer 𝑊1 , the unemployed
worker’s options are (a) accept 𝑊1 and receive it in both periods or (b) reject it, receive
unemployment compensation 𝑐, and then, in the second period, choose the maximum
of 𝑊2 and 𝑐.
Let’s assume that the worker seeks to maximize expected present value. The EPV
of option (a) is 𝑊1 + 𝛽𝑊1 , which is also called the stopping value. The EPV of option
(b), also called the continuation value, is ℎ1 ≔ 𝑐 + 𝛽 E max{ 𝑐, 𝑊2 }. More explicitly,
Õ
ℎ1 = 𝑐 + 𝛽 𝑣2 ( 𝑤0) 𝜑 ( 𝑤0) where 𝑣2 ( 𝑤) ≔ max{ 𝑐, 𝑤} . (1.2)
𝑤0 ∈W
The optimal choice at 𝑡 = 1 is now clear: accept the offer if 𝑊1 + 𝛽𝑊1 ⩾ ℎ1 and reject
otherwise. A decision tree is shown in Figure 1.2.
In determining the optimal choice above, we assumed that the worker (a) cares about
expected values and (b) knows how to compute them.
In Chapters 7–8 we discuss how to extend or weaken these assumptions. Some
of these extensions allow decision makers to focus on measurements that differ from
1 Theprocedure of solving the last period first and then working back in time is called backward
induction. Starting with the last period makes sense because there is no future to consider.
CHAPTER 1. INTRODUCTION 5
draw 𝑊1
1 𝑊1 + 𝛽𝑊1 ⩾ 𝑐 + 𝛽 E 𝑣2 (𝑊2 ) ?
no
ye
2 𝑊2 ⩾ 𝑐 ?
accept and receive 𝑊1 + 𝛽𝑊1
s
no
ye
w1 + βw1
P
100 c + β w0 max{c, w0}ϕ(w0)
v1(w1)
50
0
10.0 w1∗ 60.0
from expected values. Other extensions assume that the decision maker does not know
underlying probability distributions. For now we put these issues aside and return to
the set up discussed in the previous section.
A key idea in dynamic programming is to use “value functions” to track maximal life-
time rewards from a given state at a given time. The time 2 value function 𝑣2 defined
in (1.2) returns the maximum value obtained in the final stage for each possible re-
alization of the time 2 wage offer. The time 1 value function 𝑣1 evaluated at 𝑤 ∈ W
is ( )
Õ
𝑣1 ( 𝑤) ≔ max 𝑤 + 𝛽𝑤, 𝑐 + 𝛽 𝑣2 ( 𝑤0) 𝜑 ( 𝑤0) . (1.3)
𝑤0 ∈W
It represents the present value of expected lifetime income after receiving the first
offer 𝑤, conditional on choosing optimally in both periods.
The value function is shown in Figure 1.3. Figure 1.3 also shows the reservation
wage
ℎ1
𝑤1∗ ≔ . (1.4)
1+𝛽
CHAPTER 1. INTRODUCTION 7
and equates the value of stopping to the value of continuing. For an offer 𝑊1 above
𝑤1∗ , the stopping value exceeds the continuation value. For an offer below the reser-
vation wage, the reverse is true. Hence, the optimal choice for the worker at 𝑡 = 1 is
completely described by the reservation wage.
Parameters and functions underlying the figure are shown in Listing 1.
Figure (1.4) is instructive. We can see that higher unemployment compensation
𝑐 shifts up the continuation value ℎ1 and increases the reservation wage. As a re-
sult, the worker will, on average, spend more time unemployed when unemployment
compensation is higher.
Now let’s suppose that the worker works in period 𝑡 = 0 as well as 𝑡 = 1, 2. Figure 1.4
shows the decision tree for the three periods. Notice that the subtree containing nodes
1 and 2 is just the decision tree for the two-period problem in Figure 1.2. We will use
this to find optimal actions.
At 𝑡 = 0, the value of accepting the current offer 𝑊0 is 𝑊0 + 𝛽𝑊0 + 𝛽 2𝑊0 , while
maximal value of rejecting and waiting, is 𝑐 plus, after discounting by 𝛽 , the maxi-
mum value that can be obtained by behaving optimally from 𝑡 = 1. We have already
calculated this value: it is just 𝑣1 (𝑊1 ), as given in (1.3)!
Maximal time zero value 𝑣0 ( 𝑤) is the maximum of the value of these two options,
given 𝑊0 = 𝑤, so we can write
( )
Õ
𝑣0 ( 𝑤) = max 𝑤 + 𝛽 𝑤 + 𝛽 2 𝑤, 𝑐 + 𝛽 𝑣1 ( 𝑤0) 𝜑 ( 𝑤0) . (1.5)
𝑤0 ∈W
By plugging 𝑣1 from (1.3) into this expression, we can determine 𝑣0 , as well as the
optimal action, the one that achieves the largest value in the max term in (1.5).
CHAPTER 1. INTRODUCTION 8
using Distributions
" Computes lifetime value at t=1 given current wage w_1 = w. "
function v_1(w, model)
(; n, w_vals, ϕ, β, c) = model
h_1 = c + β * max.(c, w_vals)'ϕ
return max(w + β * w, h_1)
end
draw 𝑊0
no
ye
1 𝑊1 + 𝛽𝑊1 ⩾ 𝑐 + 𝛽 E 𝑣2 (𝑊2 ) ?
accept, receive 𝑊0 + 𝛽𝑊0 + 𝛽 2𝑊0
no
ye
2 𝑊2 ⩾ 𝑐 ?
accept, receive 𝑊1 + 𝛽𝑊1
s
no
ye
Figure 1.4 illustrates how the backward induction process works. The last period
value function 𝑣2 is trivial to obtain. With 𝑣2 in hand we can compute 𝑣1 . With 𝑣1 in
hand we can compute 𝑣0 . Once all the value functions are available, we can calculate
whether to accept or reject at each point in time.
Notice how we subdivided the three period problem down into a pair of two pe-
riod problems, given by the two equations (1.3) and (1.5). Breaking many-period
problems down into a sequence of two period problems is the essence of dynamic
programming. The recursive relationships between 𝑣0 and 𝑣1 in (1.5), as well as be-
tween 𝑣1 and 𝑣2 in (1.3), are examples of what are called Bellman equations. We
will see many other examples.
EXERCISE 1.1.3. Extend the above arguments to 𝑇 time periods, where 𝑇 can be
any finite number. Using Julia or another programming language, write a function
that takes 𝑇 as an argument and returns ( 𝑤0∗ , . . . , 𝑤𝑇∗ ), the sequence of reservation
wages for each period.
Here and below, for any finite or countable set 𝐹 , the symbol D( 𝐹 ) indicates the
set of distributions on 𝐹 .
As with the finite state case, infinite-horizon dynamic programming involves a
two step procedure that first assigns values to states and then deduces optimal actions
given those values. We begin with an informal discussion and then formalize the main
ideas.
To trade off current and future rewards optimally, we need to compare current
payoffs we get from our two choices with the states that those choices lead to and the
maximum value that can be extracted from those states. But how do we calculate the
maximum value that can be extracted from each state when lifetime is infinite?
Consider first the present expected lifetime value of being employed with wage
𝑤 ∈ W. This case is easy because, under the current assumptions, workers who accept
a job are employed forever. Lifetime payoff is
𝑤
𝑤 + 𝛽𝑤 + 𝛽 2 𝑤 + · · · = . (1.7)
1−𝛽
How about maximum present expected lifetime value attainable when entering the
current period unemployed with wage offer 𝑤 in hand? Denote this (as yet unknown)
value by 𝑣∗ ( 𝑤). We call 𝑣∗ the value function. While 𝑣∗ is not trivial to pin down, the
task is not impossible. Our first step in the right direction is to observe that it satisfies
the Bellman equation
( )
𝑤 Õ
𝑣∗ ( 𝑤) = max , 𝑐+𝛽 𝑣∗ ( 𝑤0) 𝜑 ( 𝑤0) (1.8)
1−𝛽 𝑤0 ∈W
dependent. This is because the worker always looks forward toward an infinite hori-
zon, regardless of the current date.
Equation (1.8) is to be solved for a function 𝑣∗ ∈ RW , the set of all functions from
W to R. Once we have solved for 𝑣∗ (assuming this is possible), optimal choices can
be made by observing current 𝑤 and then choosing the largest of the two alternatives
on the right-hand side of (1.8), just as we did in the finite horizon case. This idea –
that optimal choices can be made by computing the value function and maximizing the
right-hand side of the Bellman equation – is called Bellman’s principle of optimality,
and will be a cornerstone of what follows. Later we prove it in a general setting.
To solve for 𝑣∗ , use fixed point theory, our topic in the next section. Later, in §1.3,
we return to the job search problem and apply fixed point theory to solve for 𝑣∗ .
For the most part, we are interested in vectors whose elements are real numbers (as
distinguished from complex numbers). Before investigating such vectors, let’s provide
some useful language about the real line R. (You might want to review some elemen-
tary concepts from real analysis in Appendix §A, such as suprema, infima, minima,
maxima, and convergence.)
Given 𝑎, 𝑏 ∈ R, let 𝑎 ∨ 𝑏 ≔ max{ 𝑎, 𝑏} and 𝑎 ∧ 𝑏 ≔ min{ 𝑎, 𝑏}. The absolute value
of 𝑎 ∈ R is defined as | 𝑎 | ≔ 𝑎 ∨ (−𝑎).
A real-valued vector 𝑢 = ( 𝑢1 , . . . , 𝑢𝑛 ) is a finite real sequence with 𝑢𝑖 ∈ R as the 𝑖-
th element. The set of all real vectors of length 𝑛 is denoted by R𝑛 . The inner product
Í
of 𝑛-vectors (𝑢1 , . . . , 𝑢𝑛 ) and ( 𝑣1 , . . . , 𝑣𝑛 ) is h𝑢, 𝑣i ≔ 𝑛𝑖=1 𝑢𝑖 𝑣𝑖 .
CHAPTER 1. INTRODUCTION 13
The set C of complex numbers is defined in the appendix to Sargent and Stachurski
(2023b) and many other places; as is the set C𝑛 of all complex-valued 𝑛-vectors. We
assume readers know what complex numbers are and how to compute the modulus
of a complex number.
EXERCISE 1.2.1. Let 𝛼, 𝑠 and 𝑡 be real numbers. Show that 𝛼 ∨ ( 𝑠 + 𝑡 ) ⩽ 𝑠 + 𝛼 ∨ 𝑡
whenever 𝑠 ⩾ 0.
1.2.1.2 Norms
Because they provide more flexibility when checking conditions that underlie various
results, some alternative norms on R𝑛 are important for applications of fixed point
theory.
As a first step, recall that a function k · k : R𝑛 → R is called a norm on R𝑛 if, for
any 𝛼 ∈ R and 𝑢, 𝑣 ∈ R𝑛 ,
(a) k 𝑢 k ⩾ 0 (nonnegativity)
(b) k 𝑢 k = 0 ⇐⇒ 𝑢 = 0 (positive definiteness)
(c) k 𝛼𝑢 k = | 𝛼 |k 𝑢 k and (absolute homogeneity)
(d) k 𝑢 + 𝑣 k ⩽ k 𝑢 k + k 𝑣 k (triangle inequality)
This inequality can be used to prove that the triangle inequality holds for the Euclidean
norm (see, e.g., Kreyszig (1978)).
Example 1.2.1. The ℓ1 norm of a vector 𝑢 = ( 𝑢1 , . . . , 𝑢𝑛 ) ∈ R𝑛 is defined by
Õ
𝑛
k𝑢k1 ≔ | 𝑢𝑖 | . (1.9)
𝑖=1
Í
EXERCISE 1.2.3. Fix 𝑝 ∈ R𝑛 with 𝑝𝑖 > 0 for all 𝑖 ∈ [ 𝑛] and 𝑖 𝑝𝑖 = 1. Show that
Í
k 𝑢 k 1,𝑝 ≔ 𝑛𝑖=1 | 𝑢𝑖 | 𝑝𝑖 is a norm on R𝑛 .
The ℓ1 norm and the Euclidean norm are special cases of the so-called ℓ𝑝 norm,
which is defined for 𝑝 ⩾ 1 by
! 1/ 𝑝
Õ
𝑛
k𝑢k 𝑝 ≔ | 𝑢𝑖 | 𝑝 . (1.10)
𝑖=1
EXERCISE 1.2.4. Prove that the supremum norm (or ℓ∞ norm), defined by k 𝑢 k ∞ ≔
max𝑛𝑖=1 | 𝑢𝑖 |, is also a norm on R𝑛 .
To begin, recall that when 𝑢 and ( 𝑢𝑚 ) ≔ (𝑢𝑚 )𝑚∈N are all elements of R𝑛 , we say
that ( 𝑢𝑚 ) converges to 𝑢 and write 𝑢𝑚 → 𝑢 if
It might seem that this definition is imprecise. Don’t we need to clarify that the
convergence is with respect to a particular norm?
No we don’t. This is because any two norms k · k 𝑎 and k · k 𝑏 on R𝑛 are equivalent
in the sense that there exist finite positive constants 𝑀, 𝑁 such that
EXERCISE 1.2.6. Let us write k · k 𝑎 ∼ k · k 𝑏 if there exist finite 𝑀, 𝑁 such that (1.11)
holds. Prove that ∼ is an equivalence relation (see §A.1) on the set of all norms on
R𝑛 .
The next exercise tells us that pointwise convergence and norm convergence are
the same thing in finite dimensions.
Next we discuss geometric series in matrix space, along with the Neumann series
lemma, one of many useful results in applied and numerical analysis.
Before starting we recall that if 𝐴 = ( 𝑎𝑖 𝑗 ) is an 𝑛 × 𝑛 matrix with 𝑖, 𝑗-th element 𝑎𝑖 𝑗 ,
then the definition of matrix multiplication tells us that for 𝑢 ∈ R𝑛 , the 𝑖-th element
Í Í
of 𝐴𝑢 is 𝑛𝑗=1 𝑎𝑖 𝑗 𝑢 𝑗 , while the 𝑗-th element of 𝑢> 𝐴 is 𝑛𝑖=1 𝑎𝑖 𝑗 𝑢𝑖 . Think of 𝑢 ↦→ 𝐴𝑢 and
𝑢 ↦→ 𝑢> 𝐴 is two different mappings, each of which takes an 𝑛-vector and produces a
new 𝑛-vector.
Remark 1.2.1. In this book, we adopt a convention that a vector in R𝑛 is just an
𝑛-tuple of real values. This coincides with the viewpoint of languages like Julia and
Python: vectors are just “flat” arrays. But when we use vectors in matrix algebra, they
should be understood as column vectors unless we state otherwise.
k 𝐵 k 𝑜 ≔ max k 𝐵𝑢 k . (1.12)
k 𝑢 k=1
Some matrix norms have the submultiplicative property, which means that, for
all 𝐴, 𝐵 ∈ R𝑛×𝑛 , we have k 𝐴𝐵 k ⩽ k 𝐴 kk 𝐵 k.
CHAPTER 1. INTRODUCTION 17
In what follows we often use the operator norm as our choice of matrix norm
(partly because of its attractive submultiplicative property). Hence, by convention,
an expression such as k 𝐴 k refers to the operator norm k 𝐴 k 𝑜 of 𝐴.
Analogous to the vector case, we say that a sequence ( 𝐴𝑘 ) of 𝑛 × 𝑛 matrices con-
verges to an 𝑛 × 𝑛 matrix 𝐴 and write 𝐴𝑘 → 𝐴 if k 𝐴𝑘 − 𝐴 k → 0 as 𝑘 → ∞. Just as
with vectors, this form of norm convergence holds if and only if each element of 𝐴𝑘
converges to the corresponding element of 𝐴. The proof is similar to the solution to
Exercise 1.2.8.
If 𝐴 is an 𝑛 × 𝑛 matrix, then 𝜆 ∈ C is called an eigenvalue of 𝐴 if there exists a
nonzero 𝑒 ∈ C𝑛 such that 𝐴𝑒 = 𝜆𝑒. (Here C is the set of complex numbers and C𝑛 is the
set of complex 𝑛-vectors.) A vector 𝑒 satisfying this equality is called an eigenvector
of 𝐴 and ( 𝜆, 𝑒) is called an eigenpair.
In Julia, we can compute the eigenvalues of a square matrix 𝐴 via eigvals(A).
The code
using LinearAlgebra
A = [0 -1;
1 0]
println(eigvals(A))
produces
2-element Vector{ComplexF64}:
0.0 - 1.0im
0.0 + 1.0im
Here im stands for 𝑖, the imaginary unit, so the eigenvalues of 𝐴 are −𝑖 and 𝑖.
Turning to geometric series, let us begin in one dimension. Consider the one-
dimensional linear equation 𝑢 = 𝑎𝑢 + 𝑏, where 𝑎, 𝑏 are given and 𝑢 is unknown. Its
solution 𝑢∗ satisfies Õ
𝑏
| 𝑎| < 1 =⇒ 𝑢∗ = = 𝑎𝑘 𝑏. (1.14)
1 − 𝑎 𝑘⩾0
This scalar result extends naturally to vectors. To show this we suppose that 𝑢 and
𝑏 are column vectors in R𝑛 , and that 𝐴 is an 𝑛 × 𝑛 matrix. We consider the vector
CHAPTER 1. INTRODUCTION 18
equation 𝑢 = 𝐴𝑢 + 𝑏. For the next result, we recall that the spectral radius of 𝐴 is
defined as
𝜌 ( 𝐴) ≔ max{| 𝜆 | : 𝜆 is an eigenvalue of 𝐴 } (1.15)
Here | 𝜆 | indicates the modulus of complex number 𝜆 .
With 𝐼 as the 𝑛 × 𝑛 identity matrix, we can state the following result.
Theorem 1.2.1 (Neumann Series Lemma). If 𝜌 ( 𝐴) < 1, then 𝐼 − 𝐴 is nonsingular and
Õ
( 𝐼 − 𝐴) −1 = 𝐴𝑘 .
𝑘⩾0
1 using LinearAlgebra
2 ρ(A) = maximum(abs(λ) for λ in eigvals(A)) # Spectral radius
3 A = [0.4 0.1; # Test with arbitrary A
4 0.7 0.2]
5 print(ρ(A))
The rest of this section works through the proof of the Neumann series lemma,
with several parts left as exercises. An informal proof of the lemma runs as follows.
Í
If 𝑆 ≔ 𝑘⩾0 𝐴𝑘 , then
Õ
𝐼 + 𝐴𝑆 = 𝐼 + 𝐴 𝐴𝑘 = 𝐼 + 𝐴 + 𝐴2 + · · · = 𝑆.
𝑘⩾0
Lemma 1.2.2. If 𝐵 is any square matrix and k · k is any matrix norm, then
A proof of Lemma 1.2.2 can be found in Chapter 12 of Bollobás (1999). The second
result is sometimes called Gelfand’s formula.
EXERCISE 1.2.12. Prove: If 𝐴 and 𝐵 are square matrices that commute (i.e., 𝐴𝐵 =
𝐵𝐴), then 𝜌 ( 𝐴𝐵) ⩽ 𝜌 ( 𝐴) 𝜌 ( 𝐵). [Hint: Show ( 𝐴𝐵) 𝑘 = 𝐴𝑘 𝐵𝑘 and use Gelfand’s formula.]
Í
EXERCISE 1.2.13. Prove: 𝜌 ( 𝐴) < 1 implies that the series 𝑘⩾0 𝐴𝑘 converges, in the
Í
sense that every element of the matrix 𝑆 𝐾 ≔ 𝑘𝐾=0 𝐴𝑘 converges as 𝐾 → ∞.
From this last result, one can show that ( 𝐼 − 𝐴) −1 exists by computing it:
Í
EXERCISE 1.2.14. Prove this claim by showing that, when 𝑘⩾0 𝐴𝑘 exists, the in-
Í
verse of 𝐼 − 𝐴 exists and indeed ( 𝐼 − 𝐴) −1 = 𝑘⩾0 𝐴𝑘 .3
Listing 3 helps illustrate the result in Exercise 1.2.14, although we truncate the
Í
infinite sum 𝑘⩾0 𝐴𝑘 at 50.
The output 5.621e-12 is close enough to zero for many practical purposes.
While the Neumann series lemma is a powerful tool for solving linear systems, it
doesn’t help us with nonlinear problems. In this section, we present Banach’s fixed
point theorem, one of a variety of techniques for handling nonlinear systems. (Chap-
ter 2 introduces other methods.)
3 Hint: To prove that 𝐴 is invertible and 𝐵 = 𝐴 −1 , it suffices to show that 𝐴𝐵 = 𝐼 .
CHAPTER 1. INTRODUCTION 20
3 # Primitives
4 A = [0.4 0.1;
5 0.7 0.2]
6
2.00
T
1.75 45o
1.50 uh
1.25
1.00
0.75 um
0.50 u`
0.25
Figure 1.5 shows another example, for a self-map 𝑇 on 𝑈 ≔ [0, 2]. Fixed points
are numbers 𝑢 ∈ [0, 2] where 𝑇 meets the 45 degree line. In this case there are three.
EXERCISE 1.2.15. Let 𝑈 be any set and let 𝑇 be a self-map on 𝑈 . Suppose there
exists an 𝑢¯ ∈ 𝑈 and an 𝑚 ∈ N such that 𝑇 𝑘 𝑢 = 𝑢¯ for all 𝑢 ∈ 𝑈 and 𝑘 ⩾ 𝑚. Prove that,
under this condition, 𝑢¯ is the unique fixed point of 𝑇 in 𝑈 .
for all 𝑢 ∈ 𝑈 and 𝑘 ∈ N. Next, show that 𝑇 is globally stable on 𝑈 whenever 𝜌 ( 𝐴) < 1.
Next we present the Banach fixed point theorem, a workhorse for analyzing nonlinear
operators.
Let 𝑈 be a nonempty subset of R𝑛 and let k · k be a norm on R𝑛 . A self-map 𝑇 on
𝑈 is called a contraction on 𝑈 with respect to k · k if there exists a 𝜆 < 1 such that
EXERCISE 1.2.21. Let 𝑈 and 𝑇 have the properties stated in Theorem 1.2.3. Fix
𝑢0 ∈ 𝑈 and let 𝑢𝑚 ≔ 𝑇 𝑚 𝑢0 . Show that
Õ
𝑘−1
k 𝑢𝑚 − 𝑢 𝑘 k ⩽ 𝜆 𝑖 k 𝑢0 − 𝑢1 k
𝑖=𝑚
EXERCISE 1.2.22. Using the results in Exercise 1.2.21, prove that ( 𝑢𝑚 ) is a Cauchy
sequence in R𝑛 . (A sequence ( 𝑣𝑚 ) ⊂ R𝑛 is called a Cauchy sequence if, for any 𝜀 > 0,
there exists an 𝑁 ∈ N such that 𝑚, 𝑛 ⩾ 𝑁 implies k 𝑣𝑚 − 𝑣𝑛 k < 𝜀.)
1.2.3.1 Iteration
"""
Computes an approximate fixed point of a given operator T
via successive approximation.
"""
function successive_approx(T, # operator (callable)
u_0; # initial condition
tolerance=1e-6, # error tolerance
max_iter=10_000, # max iteration bound
print_step=25) # print at multiples
u = u_0
error = Inf
k = 1
u_new = T(u)
error = maximum(abs.(u_new - u))
if k % print_step == 0
println("Completed iteration $k with error $error.")
end
u = u_new
k += 1
end
return u
end
include("s_approx.jl")
using LinearAlgebra
u0
u0
5
u∗
u0
3 u0
2.0 2.5 3.0
𝑘𝑡+1 = 𝑠 𝑓 ( 𝑘𝑡 ) + (1 − 𝛿) 𝑘𝑡 , 𝑡 = 0, 1, . . . , (1.19)
where 𝑘𝑡 is capital stock per worker, 𝑓 : (0, ∞) → (0, ∞) is a production function, 𝑠 > 0
is a saving rate and 𝛿 ∈ (0, 1) is a rate of depreciation. If we set 𝑔 ( 𝑘) ≔ 𝑠 𝑓 ( 𝑘) + (1 − 𝛿) 𝑘,
then iterating with 𝑔 from a starting point 𝑘0 (i.e., setting 𝑘𝑡+1 = 𝑔 ( 𝑘𝑡 ) for all 𝑡 ⩾
0) generates the sequence in (1.19). We can also understand this process as using
successive approximation to compute the fixed point of 𝑔.
EXERCISE 1.2.25. Let 𝑓 ( 𝑘) = 𝐴𝑘𝛼 with 𝐴 > 0 and 0 < 𝛼 < 1. Show that, while
the Solow-Swan map 𝑔 ( 𝑘) = 𝑠𝐴𝑘𝛼 + (1 − 𝛿) 𝑘 sends 𝑈 ≔ (0, ∞) into itself, 𝑔 is not a
contraction on 𝑈 . [Hint: use the definition of the derivative of 𝑔 as a limit and consider
the derivative 𝑔0 ( 𝑘) for 𝑘 close to zero.]
Although the model specified in Exercise 1.2.25 does not generate a contraction,
it is globally stable. The next exercise asks you to prove this.
EXERCISE 1.2.26. Show that, in the setting of Exercise 1.2.25, the unique fixed
point of 𝑔 in 𝑈 is
1/(1−𝛼)
∗ 𝑠𝐴
𝑘 ≔
𝛿
Prove that, for 𝑘 ∈ 𝑈 ,
3
g(k) = sAk α + (1 − δ)k
45
2 1
k ∗ = (sA/δ) 1−α
kt+1
0
0 k0 3
kt
( 𝑘𝑡 )𝑡⩾0 that is produced by starting from a particular choice of 𝑘0 is traced out in the
figure.
The figure illustrates that 𝑘∗ is the unique fixed point of 𝑔 in 𝑈 and all sequences
converge to it. The second statement can be rephrased as: successive approximation
successfully computes the fixed point of 𝑔 by stepping through the time path of capital.
In §1.1.2 we introduced a Bellman equation for the infinite horizon job search prob-
lem. The unknown object in the Bellman equation is a function 𝑣∗ defined on the set
W of possible wage offers. Below we discuss how to solve for this unknown function.
Since the set of wage offers is finite we can write W as {𝑤1 , . . . , 𝑤𝑛 } for some
𝑛 ∈ N. If we adopt this convention and also write 𝑣∗ ( 𝑤𝑖 ) as 𝑣𝑖∗ , then we can view 𝑣∗
as a vector ( 𝑣1∗ , . . . , 𝑣𝑛∗ ) in R𝑛 . The vector interpretation is useful when coding, since
vectors (numerical arrays) are an efficient data type.
Nevertheless, for mathematical exposition, we usually find it more convenient to
express function-like objects (e.g., value functions) as functions rather than vectors.
Thus, we typically write 𝑣∗ ( 𝑤) instead of 𝑣𝑖∗ .
Remark 1.2.2. There is a deeper reason that we usually work with functions rather
than vectors: when we shift to general state and action spaces in Volume II, objects
CHAPTER 1. INTRODUCTION 29
𝑢
𝑢∨𝑣
𝑢∧𝑣 𝑣
The next section clarifies our notation with respect to functions and vectors.
If X is any set and 𝑢 maps X to R, then we call 𝑢 a real-valued function on X and write
𝑢 : X → R. Throughout, the symbol RX denotes the set of all real-valued functions on
X. This is a special case of the symbol 𝐵 𝐴 that represents the set of all functions from
𝐴 to 𝐵, where 𝐴 and 𝐵 are sets.
If 𝑢, 𝑣 ∈ RX and 𝛼, 𝛽 ∈ R, then the expressions 𝛼𝑢 + 𝛽𝑣 and 𝑢𝑣 also represent
elements of RX , defined at 𝑥 ∈ X by
𝑢 𝑢∨𝑣
𝑣
𝑢∧𝑣
Let X be finite, so that X = { 𝑥1 , . . . , 𝑥𝑛 } for some 𝑛 ∈ N. The set RX is, in essence, the
vector space R𝑛 expressed in different notation. The next lemma clarifies.
RX 3 𝑢 ←→ ( 𝑢 ( 𝑥1 ) , . . . , 𝑢 ( 𝑥𝑛 )) ∈ R𝑛 (1.23)
is a one-to-one correspondence between the function space RX and the vector space R𝑛 .
With these conventions, the Neumann series lemma and Banach’s contraction
mapping theorem extend directly from R𝑛 to RX . For example, if |X| = 𝑛, 𝐶 is closed
in RX and 𝑇 is a contraction on 𝐶 ⊂ RX , in the sense that 𝑇 : 𝐶 → 𝐶 and
1.2.4.3 Distributions
with a corresponding vector in R𝑛 , the set D(X) can also be thought of as a subset of
R𝑛 . This collection of vectors (i.e., the nonnegative vectors that sum to unity) is also
called the unit simplex. Given X0 ⊂ X and 𝜑 ∈ D(X), we say that 𝜑 is supported on
X0 if 𝜑 ( 𝑥 ) > 0 implies 𝑥 ∈ X0 .
Fix ℎ ∈ RX and 𝜑 ∈ D(X). Let 𝑋 be a random variable with distribution 𝜑, so that
P{ 𝑋 = 𝑥 } = 𝜑 ( 𝑥 ) for all 𝑥 ∈ X. The expectation of ℎ ( 𝑋 ) is
Õ
Eℎ ( 𝑋 ) ≔ ℎ ( 𝑥 ) 𝜑 ( 𝑥 ) = hℎ, 𝜑i .
𝑥 ∈X
EXERCISE 1.2.27. Fix ℎ ∈ RX . Show that 𝜑∗ ∈ argmax 𝜑∈D(X) hℎ, 𝜑i if and only if
𝜑∗ is supported on argmax 𝑥 ∈X ℎ ( 𝑥 ).
CHAPTER 1. INTRODUCTION 32
𝑄 𝜏 𝑋 ≔ min{ 𝑥 ∈ X : Φ ( 𝑥 ) ⩾ 𝜏} . (1.24)
EXERCISE 1.2.28. Prove that the quantile function is additive over constants. That
is, for any 𝜏 ∈ [0, 1], random variable 𝑋 on X and 𝛼 ∈ R, we have 𝑄 𝜏 ( 𝑋 + 𝛼) = 𝑄 𝜏 ( 𝑋 ) + 𝛼.
Armed with fixed point methods, we return to the job search problem discussed in
§1.1.2.
In this section we solve for the value function of an infinite horizon job search problem
and associated optimal choices.
Let’s recall the strategy for solving the infinite-horizon job search problem we pro-
posed in §1.1.2. The first step is to compute the optimal value function 𝑣∗ that solves
CHAPTER 1. INTRODUCTION 33
be the infinite-horizon continuation value that equals the maximal lifetime value that
the worker can receive, contingent on deciding to continue being unemployed today.
With ℎ∗ in hand, the optimal decision at any given time, facing current wage draw
𝑤 ∈ W, is as follows:
The method proposed above requires that we solve for 𝑣∗ . To do so, we introduce a
Bellman operator 𝑇 defined at 𝑣 ∈ RW that is constructed to assure that any fixed
point of 𝑇 solves the Bellman equation and vice versa:
( )
𝑤 Õ
0 0
(𝑇 𝑣) ( 𝑤) = max , 𝑐+𝛽 𝑣(𝑤 ) 𝜑 (𝑤 ) ( 𝑤 ∈ W) . (1.27)
1−𝛽 𝑤0 ∈W
|𝛼 ∨ 𝑥 − 𝛼 ∨ 𝑦 | ⩽ | 𝑥 − 𝑦 | ( 𝛼, 𝑥, 𝑦 ∈ R) (1.28)
EXERCISE 1.3.1. Verify that (1.28) always holds. [Exercise 1.2.1 might be helpful.]
Proof of Proposition 1.3.1. Take any 𝑓 , 𝑔 in 𝑉 and fix any 𝑤 ∈ W. Apply the bound in
(1.28) to get
!
Õ Õ
|(𝑇 𝑓 )( 𝑤) − (𝑇 𝑔)( 𝑤)| ⩽ 𝑐 + 𝛽 𝑓 ( 𝑤0) 𝜑 ( 𝑤0) − 𝑐 + 𝛽 𝑔 ( 𝑤0) 𝜑 ( 𝑤0)
𝑤0 𝑤0
Õ
=𝛽 [ 𝑓 ( 𝑤0) − 𝑔 ( 𝑤0)] 𝜑 ( 𝑤0) .
𝑤0
Taking the supremum over all 𝑤 on the left hand side of this expression leads to
k𝑇 𝑓 − 𝑇 𝑔 k ∞ ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ .
A dynamic program seeks optimal policies. We briefly introduce the notion of a policy
and relate it to the job search application.
In general, for a dynamic program, choices by the controller aim to maximize
lifetime rewards and consist of a state-contingent sequence ( 𝐴𝑡 )𝑡⩾0 specifying how
the agent acts at each point in time. Workers do not know what the future will bring,
so it is natural to assume that 𝐴𝑡 can depend on present and past events but not
future ones. Hence 𝐴𝑡 is a function of the current state 𝑋𝑡 and past state-action pairs
( 𝐴𝑡−𝑖 , 𝑋𝑡−𝑖 ) for 𝑖 ⩾ 1. That is,
A key insight of dynamic programming is that some problems can be set up so that
the optimal current action can be expressed as a function of the current state 𝑋𝑡 .
Example 1.3.1. In Example 1.0.1, the retailer chooses stock orders and prices in
each period. Every quantity relevant to this decision belongs in the current state. It
might include not just the level of current inventories and various measures of business
conditions, but also information about rates at which inventories have changed over
each of the past six months.
If the current state 𝑋𝑡 is enough to determine a current optimal action, then policies
are just maps from states to actions. So we can write 𝐴𝑡 = 𝜎 ( 𝑋𝑡 ) for some function
𝜎. A policy function that depends only on the current state is often called a Markov
policy. Since all policies we consider will be Markov policies, we refer to them more
concisely as “policies.”
Remark 1.3.1. In the last paragraph, we dropped the time subscript on 𝜎 with no
loss of generality because we can always include the date 𝑡 in the current state; i.e.,
if 𝑌𝑡 is the state without time, then we can set 𝑋𝑡 = ( 𝑡, 𝑌𝑡 )). Whether this is necessary
depends on the problem at hand. For the job search model with finite horizon, the
date matters because opportunities for future earnings decrease with the passage of
time. For the infinite horizon version of the problem, in which an agent always looks
forward toward an infinite horizon, the only current information that matters to the
agent at time 𝑡 is the wage offer 𝑊𝑡 . As a result, the calendar date 𝑡 does not affect the
agent’s decision at time 𝑡 , so there is no need to include time in the state. (In §8.1.3.5,
we will formalize this argument.)
In the job search model, the state is the current wage offer and possible actions are
to accept or to reject the current offer. With 0 interpreted as reject and 1 understood
as accept, the action space is {0, 1}, so a policy is a map 𝜎 from W to {0, 1}. Let Σ be
the set of all such maps.
A policy is an “instruction manual”: for an agent following 𝜎 ∈ Σ, if current wage
offer is 𝑤, the agent always responds with 𝜎 ( 𝑤) ∈ {0, 1}. The policy dictates whether
the agent accepts or rejects at any given wage.
For each 𝑣 ∈ 𝑉 , a 𝑣-greedy policy is a 𝜎 ∈ Σ satisfying
( )
𝑤 Õ
𝜎 ( 𝑤) = 1 ⩾ 𝑐+𝛽 𝑣 ( 𝑤0) 𝜑 ( 𝑤0) for all 𝑤 ∈ W. (1.29)
1−𝛽 0
𝑤 ∈W
Equation (1.29) says that an agent accepts if 𝑤/(1 − 𝛽 ) exceeds the continuation value
computed using 𝑣 and rejects otherwise. Our discussion of optimal choices in §1.3.1.1
CHAPTER 1. INTRODUCTION 36
𝜎∗ ( 𝑤) = 1 {𝑤 ⩾ 𝑤∗ } where 𝑤∗ ≔ (1 − 𝛽 ) ℎ∗ . (1.30)
The quantity 𝑤∗ in (1.30) is called the reservation wage, and parallels the reservation
wage that we introduced for the finite-horizon problem. Equation (1.30) states that
value maximization requires accepting an offer if and only if it exceeds the reservation
wage. Thus, 𝑤∗ provides a scalar description of an optimal policy.
1.3.2 Computation
1400
1200
1000
lifetime value
iterate 1
800 iterate 2
iterate 3
600
iterate 1000
400
200
10 20 30 40 50 60
wage offer
include("two_period_job_search.jl")
include("s_approx.jl")
" Solve the infinite-horizon IID job search model by VFI. "
function vfi(model=default_model)
(; n, w_vals, ϕ, β, c) = model
v_init = zero(model.w_vals)
v_star = successive_approx(v -> T(v, model), v_init)
σ_star = get_greedy(v_star, model)
return v_star, σ_star
end
1400
1200
1000
800
600
value function
400 continuation value
w/(1 − β)
200
10 20 30 40 50 60
The technique we employed to solve the job search model in §1.3.1 follows a standard
approach to dynamic programming. But for this particular problem, there is an easier
way to compute the optimal policy that sidesteps calculating the value function. This
section explains how.
Recall that the value function satisfies Bellman equation
( )
𝑤 Õ
𝑣∗ ( 𝑤) = max , 𝑐+𝛽 𝑣∗ ( 𝑤0) 𝜑 ( 𝑤0) ( 𝑤 ∈ W) , (1.31)
1−𝛽 𝑤0
and that the continuation value is given by (1.26). We can use ℎ∗ to eliminate 𝑣∗
from (1.31). First we insert ℎ∗ on the right hand side of (1.31) and then we replace
𝑤 with 𝑤0, which gives 𝑣∗ ( 𝑤0) = max {𝑤0/(1 − 𝛽 ) , ℎ∗ }. Then we take mathematical
CHAPTER 1. INTRODUCTION 40
1200
1100
1000 h∗
900
800
700
g
45
600
Figure 1.12 shows the function 𝑔 using the discrete wage offer distribution and
parameters as adopted previously. The unique fixed point is ℎ∗ .
Exercise 1.3.2 implies that we can compute ℎ∗ by choosing arbitrary ℎ ∈ R+ and
iterating with 𝑔. Doing so produces a value of approximately 1086. (The associated
reservation wage is 𝑤∗ = (1 − 𝛽 ) ℎ∗ ≈ 43.4.) Computation of ℎ∗ using this method is
much faster than value function iteration because the fixed point problem is in R+
rather than R+𝑛 .
With ℎ∗ in hand we have solved the dynamic programming problem, since a policy
CHAPTER 1. INTRODUCTION 41
This chapter discusses techniques that underlie the optimization and fixed point meth-
ods used throughout the book. Many of these techniques relate to order. Order-
theoretic concepts will prove valuable not only for fixed point methods but also for
understanding the main concepts in dynamic programming. Chapter 8 will show
core components of dynamic programming can be expressed in terms of simple order-
theoretic constructs.
2.1 Stability
In this section we discuss algorithms for computing fixed points and analyze their
convergence.
First we treat a technique for simplifying analysis of stability and fixed points that
we’ll apply in applications.
To illustrate the idea, suppose that we want to study dynamics induced by a self-
map 𝑇 on 𝑈 ⊂ R𝑛 . We might want to know if a unique fixed point of 𝑇 exists and if
iterates of 𝑇 converge to a fixed point. One approach is to apply fixed point theory to
𝑇.
However sometimes there is an easier approach: transform 𝑇 into a “simpler” map
ˆ
𝑇 and study its the fixed point properties. For this to work, we need to be sure that
42
CHAPTER 2. OPERATORS AND FIXED POINTS 43
2.1.1.1 Conjugacy
𝑇
𝑢 𝑇𝑢
Φ Φ −1
𝑇ˆ
𝑢
ˆ 𝑇ˆ𝑢
ˆ
EXERCISE 2.1.1. Show that if (𝑈, 𝑇 ) and (𝑈, ˆ 𝑇ˆ) are conjugate under Φ, then 𝑢 ∈ 𝑈
is a fixed point of 𝑇 on 𝑈 if and only if Φ𝑢 ∈ 𝑈ˆ is a fixed point of 𝑇ˆ on 𝑈ˆ .
EXERCISE 2.1.2. Extending Exercise 2.1.1, let (𝑈, 𝑇 ) and (𝑈, ˆ 𝑇ˆ) be dynamical sys-
tems and let fix(𝑇 ) and fix(𝑇ˆ) be the set of fixed points of 𝑇 and 𝑇ˆ, respectively. Show
that Φ is a bijection from fix(𝑇 ) to fix(𝑇ˆ).
The next result summarizes the most important consequences of our findings.
CHAPTER 2. OPERATORS AND FIXED POINTS 44
In particular, if 𝑇 has a unique fixed point on 𝑈 if and only if 𝑇ˆ has a unique fixed
point on 𝑈ˆ .
Assume again that 𝑈 and 𝑈ˆ are subsets of R𝑛 . In this setting, we say that dynamical
ˆ 𝑇ˆ) are topologically conjugate under Φ if (𝑈, 𝑇 ) and (𝑈,
systems (𝑈, 𝑇 ) and (𝑈, ˆ 𝑇ˆ) are
conjugate under Φ and, in addition, Φ is a homeomorphism.
EXERCISE 2.1.3. Let 𝑈 ≔ (0, ∞) and 𝑈ˆ ≔ R. Let 𝑇𝑢 = 𝐴𝑢𝛼 , where 𝐴 > 0 and 𝛼 ∈ R,
and let 𝑇ˆ𝑢ˆ = ln 𝐴 + 𝛼𝑢ˆ. Show that 𝑇 and 𝑇ˆ are topologically conjugate under Φ ≔ ln.
EXERCISE 2.1.4. Consider again the setting of Exercise 2.1.1, but now suppose
ˆ 𝑇ˆ) are topologically conjugate under Φ, Fixing 𝑢, 𝑢∗ ∈ 𝑈 , show that
that (𝑈, 𝑇 ) and (𝑈,
lim𝑘→∞ 𝑇 𝑘 𝑢 = 𝑢∗ if and only if lim𝑘→∞ 𝑇ˆ 𝑘 Φ𝑢 = Φ𝑢∗ .
The next exercise asks you to show that topologically conjugacy is an equivalence
relation, as defined in §A.1.
EXERCISE 2.1.5. Let U be the set of all dynamical systems (𝑈, 𝑇 ) with 𝑈 ⊂ R𝑛 .
Show that topologically conjugacy is an equivalence relation on U.
From the preceding exercises we can state the following useful result:
CHAPTER 2. OPERATORS AND FIXED POINTS 45
EXERCISE 2.1.6. Returning to the setting of Exercise 2.1.4, let (𝑈, 𝑇 ) and (𝑈,
ˆ 𝑇ˆ) be
topologically conjugate and let 𝑢∗ be a fixed point of 𝑇 in 𝑈 . Show that 𝑢∗ is locally
stable for (𝑈, 𝑇 ) if and only if Φ𝑢∗ is locally stable for (𝑈,
ˆ 𝑇ˆ).
ˆ𝑔 ( 𝑥 ) ≔ 𝑔 ( 𝑥 ∗ ) + 𝑔0 ( 𝑥 ∗ )( 𝑥 − 𝑥 ∗ ) = 𝑥 ∗ + 𝑔0 ( 𝑥 ∗ )( 𝑥 − 𝑥 ∗ )
𝑇ˆ𝑢 = 𝑢∗ + 𝐽𝑇 ( 𝑢∗ )( 𝑢 − 𝑢∗ ) (𝑢 ∈ 𝑈 ) .
Combining this theorem with the result of Exercise 2.1.6, we see that, under the
conditions of the theorem, 𝑢∗ is globally stable for ( 𝑂, 𝑇 ), and hence locally stable for
(𝑈, 𝑇 ), whenever ( 𝑂, 𝑇ˆ) is globally stable. By the Neumann series lemma, the first-
order approximation will be globally stable whenever 𝐽𝑇 ( 𝑢∗ ) has spectral radius less
than one. Thus, we have
Corollary 2.1.4. Under the conditions of Theorem 2.1.3, the fixed point 𝑢∗ is locally
stable whenever 𝜌 ( 𝐽𝑇 ( 𝑢∗ )) < 1.
In addition,
𝑇 00 𝑣𝑘
𝑇𝑢𝑘 = 𝑢∗ + 𝑇 0𝑢∗ ( 𝑢𝑘 − 𝑢∗ ) + ( 𝑢 𝑘 − 𝑢∗ ) 2 . (2.1)
2
(The restriction that 0 < |𝑇 0𝑢∗ | < 1 in Exercise 2.1.7 is mild. For example, given
convergence of successive approximation to the fixed point, we expect |𝑇 0𝑢∗ | < 1, since
this inequality implies that 𝑢∗ is locally stable.)
While successive approximation always converges when global stability holds, faster
fixed point algorithms can often be obtained by leveraging extra information, such as
gradients. Newton’s method is an important gradient-based technique. (As we discuss
in §5.1.4.2, Newton’s method is a key component of algorithms for solving dynamic
programs.)
While Newton’s method is often used to solve for roots of a given function, here
we use it to find fixed points.
T
T̂
45
0
u0 u∗ u1
Figure 2.1 shows 𝑢0 and 𝑢1 when 𝑛 = 1 and 𝑇𝑢 = 1 + 𝑢/( 𝑢 + 1) and 𝑢0 = 0.5. The value
𝑢1 is the fixed point of the first-order approximation 𝑇ˆ. It is closer to the fixed point
of 𝑇 than 𝑢0 , as desired.
Newton’s (fixed point) method continues in the same way, from 𝑢1 to 𝑢2 and so
on, leading to the sequence of points
We need not write a new solver, since the successive approximation function in List-
ing 4 can be applied to 𝑄 defined in (2.2).
Figure 2.2 shows both the Newton approximation sequence and the successive ap-
proximation sequence applied to computing the fixed point of the Solow–Swan model
from §1.2.3.2. We use two different initial conditions (top and bottom subfigures).
Both sequences converge, but the Newton sequences converge faster.
CHAPTER 2. OPERATORS AND FIXED POINTS 49
successive approximation
newton steps
k∗
2 4 6 8 10 12
successive approximation
newton steps
k∗
2 4 6 8 10 12
A fast rate of convergence for Newton scheme can be confirmed theoretically: un-
der mild conditions, there exists a neighborhood of the fixed point within which the
Newton iterates converge quadratically. See, for example, Theorem 5.4.1 of Atkinson
and Han (2005). Some dynamic programming algorithms take advantage of this fast
rate of convergence (see §5.1.4.3).
1
kt+1
−1
g
−2 Q
45
−3
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
kt
mation. However, 𝑄 is well-behaved near the fixed point (i.e., very flat and hence
strongly contractive) but poorly behaved away from the fixed point. This illustrates
that Newton’s method is fast but generally less robust.
2.1.4.4 Parallelization
trices of possibly high dimension, each step is computationally intensive. At the same
time, since the rate of convergence is faster, we have to take fewer steps. In this sense,
the algorithm is less serial – it involves a smaller number of more expensive steps. Be-
cause it is less serial, Newton’s method offers far more potential for parallelization.
Thus, the speed gain associated with Newton’s method can become very large when
using effective parallelization.
2.2 Order
𝑝 𝑝 (reflexivity)
𝑝 𝑞 and 𝑞 𝑝 implies 𝑝 = 𝑞 and (antisymmetry)
𝑝 𝑞 and 𝑞 𝑟 implies 𝑝 𝑟 (transitivity)
EXERCISE 2.2.1. Let 𝑃 be any set and consider the relation induced by equality, so
that 𝑝 𝑞 if and only if 𝑝 = 𝑞. Show that this relation is a partial order on 𝑃 .
EXERCISE 2.2.2. Let 𝑀 be any set. Show that set inclusion ⊂ induces a partial
order on ℘ ( 𝑀 ), the set of all subsets of 𝑀 .
CHAPTER 2. OPERATORS AND FIXED POINTS 52
𝑣
𝑣2
𝑢
𝑢2
𝑤
𝑤2
𝑢1 𝑣1 𝑤1
Example 2.2.2 (Pointwise partial order). Fix an arbitrary nonempty set X. The point-
wise order ⩽ on the set RX of all functions from X to R is defined as follows:
The preceding pointwise concepts extend immediately to vectors, since vectors are
just real-valued functions under the identification asserted in Lemma 1.2.4 (page 30).
In particular, for vectors 𝑢 = ( 𝑢1 , . . . , 𝑢𝑛 ) and 𝑣 = ( 𝑣1 , . . . , 𝑣𝑛 ) in R𝑛 , we write
EXERCISE 2.2.5. Limits in R preserve weak inequalities. Use this property to prove
that the same is true in R𝑛 . In particular, show that, for vectors 𝑎, 𝑏 ∈ R𝑛 and sequence
(𝑢𝑘 ) in R𝑛 with 𝑎 ⩽ 𝑢𝑘 ⩽ 𝑏 for all 𝑘 ∈ N and 𝑢𝑘 → 𝑢 ∈ R𝑛 , we have 𝑎 ⩽ 𝑢 ⩽ 𝑏.
Example 2.2.3 (Pointwise order over matrices). Analogous to vectors, for 𝑛 × 𝑘 ma-
trices 𝐴 = ( 𝑎𝑖 𝑗 ) and 𝐵 = ( 𝑏𝑖 𝑗 ), we write
CHAPTER 2. OPERATORS AND FIXED POINTS 53
• 𝐴 ⩽ 𝐵 if 𝑎𝑖 𝑗 ⩽ 𝑏𝑖 𝑗 for all 𝑖, 𝑗.
• 𝐴 𝐵 if 𝑎𝑖 𝑗 < 𝑏𝑖 𝑗 for all 𝑖, 𝑗.
EXERCISE 2.2.6. Explain why the pointwise order introduced in Example 2.2.3 is
also a special case of the pointwise order over functions.
Example 2.2.4. The usual order ⩽ on R is a total order, as is the same order on N.
Example 2.2.5. Figure 2.4 shows that the pointwise order ⩽ is not a total order on
R𝑛 . For example, neither 𝑣 ⩽ 𝑤 nor 𝑤 ⩽ 𝑣, since 𝑤1 > 𝑣1 but 𝑤2 < 𝑣2 .
EXERCISE 2.2.9. Is the partial order defined in Exercise 2.2.2 a total order? Either
prove that it is or provide a counterexample.
EXERCISE 2.2.10. Let 𝑃 be any partially ordered set and fix 𝐴 ⊂ 𝑃 . Prove that 𝐴
has at most one greatest element and at most one least element.
EXERCISE 2.2.11. Let 𝑀 be a nonempty set and let ℘ ( 𝑀 ) be the set of all subsets
of 𝑀 , partially ordered by ⊂. Let { 𝐴𝑖 } = { 𝐴𝑖 } 𝑖∈ 𝐼 be a subset of ℘ ( 𝑀 ), where 𝐼 is an
Ð
arbitrary nonempty index set. Show that 𝑆 ≔ 𝑖 𝐴𝑖 is the greatest element of { 𝐴𝑖 } if
and only if 𝑆 ∈ { 𝐴𝑖 }.
EXERCISE 2.2.12. Adopt the setting of Exercise 2.2.11 and suppose that { 𝐴𝑖 } is the
set of bounded subsets of R𝑛 . Prove that { 𝐴𝑖 } has no greatest element.
Concepts of suprema and infima on the real line (Appendix A) extend naturally to
partially ordered sets. Given a partially ordered set ( 𝑃, ) and a nonempty subset 𝐴
of 𝑃 , we call 𝑢 ∈ 𝑃 an upper bound of 𝐴 if 𝑎 𝑢 for all 𝑎 in 𝐴. Letting 𝑈 𝑃 ( 𝐴) be the
set of all upper bounds of 𝐴 in 𝑃 , we call 𝑢¯ ∈ 𝑃 a supremum of 𝐴 if
Thus, 𝑢¯ is the least element (see §2.2.1.2) of the set of upper bounds 𝑈 𝑃 ( 𝐴), whenever
it exists.
EXERCISE 2.2.13. Prove that 𝐴 has at most one supremum in 𝑃 .
Suprema and greatest elements are clearly related. The next exercise clarifies this.
EXERCISE 2.2.14. Prove the following statements in the setting described above:
Ô
(i) If 𝑎¯ = 𝐴 and 𝑎¯ ∈ 𝐴, then 𝑎¯ is a greatest element of 𝐴.
Ô
(ii) If 𝐴 has a greatest element 𝑎¯, then 𝑎¯ = 𝐴.
Remark 2.2.2. In view of Exercise 2.2.14, when 𝐴 has a greatest element, we can
Ô
refer to it by 𝐴. This notation is used frequently throughout the book.
EXERCISE 2.2.16. Let 𝑀 be a nonempty set and let ℘ ( 𝑀 ) be the set of all subsets
Ô
of 𝑀 , partially ordered by ⊂. Let { 𝐴𝑖 } 𝑖∈ 𝐼 be a subset of ℘ ( 𝑀 ). Prove that 𝑖 𝐴𝑖 = ∪𝑖 𝐴𝑖
Ó
and 𝑖 𝐴𝑖 = ∩𝑖 𝐴𝑖 .
EXERCISE 2.2.17. Even when 𝑃 is totally ordered, existence of suprema and infima
for an abstract partially ordered set ( 𝑃, ) can fail. Provide an example of a totally
ordered set 𝑃 and a subset 𝐴 of 𝑃 that has no supremum in 𝑃 .
For us, the pointwise partial order ⩽ introduced in Example 2.2.2 is especially useful.
In this section we review some properties of this order. Throughout, X is an arbitrary
finite set.
CHAPTER 2. OPERATORS AND FIXED POINTS 56
𝑢, 𝑣 ∈ 𝑉 implies 𝑢 ∨ 𝑣 ∈ 𝑉 and 𝑢 ∧ 𝑣 ∈ 𝑉 .
𝑉1 ≔ { 𝑓 ∈ RX : 𝑓 ⩾ 0}, 𝑉2 ≔ { 𝑓 ∈ RX : 𝑓 0} and 𝑉3 ≔ { 𝑓 ∈ RX : | 𝑓 | ⩽ 1}
are all sublattices of RX .
Above we discussed the fact that, for a pair of functions {𝑢, 𝑣}, the supremum in
( RX , ⩽) is the pointwise maximum, while the infimum in ( RX , ⩽) is the pointwise
minimum. The same principle holds for finite collections of functions. Thus, if { 𝑣𝑖 } ≔
{ 𝑣𝑖 } 𝑖∈ 𝐼 is a finite subset of RX , then, for all 𝑥 ∈ X,
! !
Ü Û
𝑣𝑖 ( 𝑥 ) ≔ max 𝑣𝑖 ( 𝑥 ) and 𝑣𝑖 ( 𝑥 ) ≔ min 𝑣𝑖 ( 𝑥 ) .
𝑖∈ 𝐼 𝑖∈ 𝐼
𝑖 𝑖
The next example discusses greatest elements in the setting of pointwise order.
CHAPTER 2. OPERATORS AND FIXED POINTS 57
W
v∗ = σ∈Σ vσ
vσ00
vσ 0
Σ = {σ 0, σ 00}
Example 2.2.7. Let X be nonempty and fix 𝑉 ⊂ RX . Let 𝑉 be partially ordered by the
pointwise order ⩽. Let { 𝑣𝜎 } ≔ { 𝑣𝜎 }𝜎∈Σ be a finite subset of 𝑉 and let 𝑣∗ ≔ ∨𝜎 𝑣𝜎 ∈ RX
be the pointwise maximum. If 𝑣∗ ∈ { 𝑣𝜎 }, then 𝑣∗ is the greatest element of { 𝑣𝜎 }. If
not, then { 𝑣𝜎 } has no greatest element.
Figure 2.5 helps illustrate Example 2.2.7. In this case, 𝑣∗ is not in { 𝑣𝜎 } and { 𝑣𝜎 }
has no greatest element (since neither 𝑣𝜎0 ⩽ 𝑣𝜎00 nor 𝑣𝜎00 ⩽ 𝑣𝜎0 ).
EXERCISE 2.2.21. Prove the two claims at the end of Example 2.2.7.
In this section we note some useful inequalities and identities related to the pointwise
partial order on RX . As before, X is any finite set.
CHAPTER 2. OPERATORS AND FIXED POINTS 58
(i) | 𝑓 + 𝑔 | ⩽ | 𝑓 | + | 𝑔 |.
(ii) ( 𝑓 ∧ 𝑔) + ℎ = ( 𝑓 + ℎ) ∧ ( 𝑔 + ℎ) and ( 𝑓 ∨ 𝑔) + ℎ = ( 𝑓 + ℎ) ∨ ( 𝑔 + ℎ).
(iii) ( 𝑓 ∨ 𝑔) ∧ ℎ = ( 𝑓 ∧ ℎ) ∨ ( 𝑔 ∧ ℎ) and ( 𝑓 ∧ 𝑔) ∨ ℎ = ( 𝑓 ∨ ℎ) ∧ ( 𝑔 ∨ ℎ).
(iv) | 𝑓 ∧ ℎ − 𝑔 ∧ ℎ | ⩽ | 𝑓 − 𝑔 |.
(v) | 𝑓 ∨ ℎ − 𝑔 ∨ ℎ | ⩽ | 𝑓 − 𝑔 |.
( 𝑓 + 𝑔 ) ∧ ℎ ⩽ ( 𝑓 ∧ ℎ) + ( 𝑔 ∧ ℎ) . (2.3)
𝑓 = 𝑓 − 𝑔 + 𝑔 ⩽ | 𝑓 − 𝑔| + 𝑔
The inequality in Lemma 2.2.2 helps with dynamic programming problems that
involve maximization. The next exercise below concerns minimization.
CHAPTER 2. OPERATORS AND FIXED POINTS 59
We end this section with a discussion of upper envelopes. To frame the discussion,
we take {𝑇𝜎 } ≔ {𝑇𝜎 }𝜎∈Σ to be a finite family of self-maps on a sublattice 𝑉 of RX .
Consider some properties of the operator 𝑇 on 𝑉 defined by
Ü
𝑇𝑣 = 𝑇𝜎 𝑣 (𝑣 ∈ 𝑉).
𝜎∈ Σ
Lemma 2.2.3. If, for each 𝜎 ∈ Σ, the operator 𝑇𝜎 is a contraction of modulus 𝜆 𝜎 under
the supremum norm, then 𝑇 is a contraction of modulus max𝜎 𝜆 𝜎 under the same norm.
Proof. Let the stated conditions hold and fix 𝑢, 𝑣 ∈ 𝑉 . Applying Lemma 2.2.2, we get
2.2.3.1 Definition
Given two partially ordered sets ( 𝑃, ) and ( 𝑄, ⊴), a map 𝑇 from 𝑃 to 𝑄 is called
order-preserving if, given 𝑝, 𝑝0 ∈ 𝑃 , we have
𝑝 𝑝0 =⇒ 𝑇 𝑝 ⊴ 𝑇 𝑝0 . (2.6)
𝑝 𝑝0 =⇒ 𝑇 𝑝0 ⊴ 𝑇 𝑝. (2.7)
∫𝑏 ∫𝑏
Since 𝑓 ⩽ 𝑔 implies 𝑎
𝑓 ( 𝑥 ) 𝑑𝑥 ⩽ 𝑎
𝑔 ( 𝑥 ) 𝑑𝑥 , the map 𝐼 is order-preserving on 𝐶 [ 𝑎, 𝑏].
EXERCISE 2.2.29. Prove: If 𝑃 is any partially ordered set and 𝑓 , 𝑔 ∈ 𝑖R𝑃 , then
(i) 𝛼 𝑓 + 𝛽𝑔 ∈ 𝑖R𝑃 whenever 𝛼, 𝛽 ⩾ 0.
(ii) 𝑓 ∨ 𝑔 ∈ 𝑖R𝑃 and 𝑓 ∧ 𝑔 ∈ 𝑖R𝑃 .
The next exercise shows that, in a totally ordered setting, an increasing function
can be represented as the sum of increasing binary functions.
EXERCISE 2.2.32. Let X = { 𝑥1 , . . . , 𝑥𝑛 } where 𝑥 𝑘 𝑥 𝑘+1 for all 𝑘. Show that, for any
Í
𝑢 ∈ 𝑖RX , there exist 𝑠1 , . . . , 𝑠𝑛 in R+ such that 𝑢 ( 𝑥 ) = 𝑛𝑘=1 𝑠𝑘 1{ 𝑥 𝑥 𝑘 } for all 𝑥 ∈ X.
𝑇𝑢 = 𝑇 ( 𝑣 + 𝑢 − 𝑣) ⩽ 𝑇 ( 𝑣 + k 𝑢 − 𝑣 k ∞ ) ⩽ 𝑇 𝑣 + 𝛽 k 𝑢 − 𝑣 k ∞ .
The relation F is also called first order stochastic dominance to differentiate it from
other forms of stochastic order.
CHAPTER 2. OPERATORS AND FIXED POINTS 63
0.25
φ = B(10, 0.5)
ψ = B(18, 0.5)
0.20
0.15
0.10
0.05
0.00
Example 2.2.11. If 𝜑 and 𝜓 are the binomial distributions defined above and X =
{0, . . . , 18}, then 𝜑 F 𝜓 holds. Indeed, if 𝑊1 , . . . , 𝑊18 are IID binary random variables
Í Í18
with P{𝑊𝑖 = 1} = 0.5 for all 𝑖, then 𝑋 ≔ 10 𝑖=1 𝑊𝑖 has distribution 𝜑 and 𝑌 ≔ 𝑖=1 𝑊𝑖
has distribution 𝜓. In addition, 𝑋 ⩽ 𝑌 with probability one (i.e., for any outcome of
the draws 𝑊1 , . . . , 𝑊18 ). It follows that, for any given 𝑢 ∈ 𝑖RX , we have 𝑢 ( 𝑋 ) ⩽ 𝑢 (𝑌 )
with probability one. Hence E𝑢 ( 𝑋 ) ⩽ E𝑢 (𝑌 ) holds, which is the same statement as
(2.9).
Remark 2.2.4. The last paragraph helps explain the pervasiveness of stochastic dom-
inance in economics. It is standard to assume that economic agents have increasing
utility functions and use expected utility to rank lotteries. In such environments, an
upward shift in a lottery, as measured by stochastic dominance, makes all agents bet-
ter off.
CHAPTER 2. OPERATORS AND FIXED POINTS 64
For a given distribution 𝜑, the function 𝐺 𝜑 is sometimes called the counter CDF
(counter cumulative distribution function) of 𝜑.
(i) 𝜑 F 𝜓 =⇒ 𝐺 𝜑 ⩽ 𝐺 𝜓 .
(ii) If X is totally ordered by , then 𝐺 𝜑 ⩽ 𝐺 𝜓 =⇒ 𝜑 F 𝜓.
The proof is given on page 345. Figure 2.7 helps to illustrate. Here X ⊂ R and 𝜑
and 𝜓 are distributions on X. We can see that 𝜑 F 𝜓 because the counter CDFs are
ordered in the sense that 𝐺 𝜑 ⩽ 𝐺 𝜓 pointwise on X.
EXERCISE 2.2.34. Prove the transitivity component of Lemma 2.2.6, i.e., prove
that F is transitive on D(X).
EXERCISE 2.2.35. Fix 𝜏 ∈ (0, 1] and let 𝑄 𝜏 be the quantile function defined on
page 32. Choose 𝜑, 𝜓 ∈ D(X) and let 𝑋, 𝑌 be X-valued random variables with distri-
butions 𝜑 and 𝜓 respectively. Prove that 𝜑 F 𝜓 implies 𝑄 𝜏 ( 𝑋 ) ⩽ 𝑄 𝜏 (𝑌 ).
0.100 ϕ
0.075 ψ
0.050
0.025
0.000
−3 −2 −1 0 1 2 3
1.0
0.8
Gϕ
Gψ
0.6
0.4
0.2
0.0
−3 −2 −1 0 1 2 3
will increase steady state inflation. By providing sufficient conditions for monotone
shifts in fixed points, results in this section can help answer such questions.
Let ( 𝑃, ) be a partially ordered set. Given two self-maps 𝑆 and 𝑇 on a set 𝑃 , we
write 𝑆 𝑇 if 𝑆𝑢 𝑇𝑢 for every 𝑢 ∈ 𝑃 and say that 𝑇 dominates 𝑆 on 𝑃 .
EXERCISE 2.2.36. Let ( 𝑃, ) be a partially ordered set and let 𝑆 and 𝑇 be order-
preserving self-maps such that 𝑆 𝑇 . Show that 𝑆 𝑘 𝑇 𝑘 holds for all 𝑘 ∈ N.
EXERCISE 2.2.37. Let ( 𝑃, ) be a partially ordered set, let S be the set of all self-
maps on 𝑃 and, as above, write 𝑆 𝑇 if 𝑇 dominates 𝑆 on 𝑃 . Show that is a partial
order on S.
One might assume that, in a setting where 𝑇 dominates 𝑆, the fixed points of 𝑇
will be larger. This can hold, as in Figure 2.8, but it can also fail, as in Figure 2.9.
A difference between these two situations is that in Figure 2.8 the map 𝑇 is globally
stable. This leads us to our next result.
CHAPTER 2. OPERATORS AND FIXED POINTS 66
𝑇
𝑢𝑇
𝑢𝑆
𝑇 𝑆
𝑢𝑆
𝑢𝑇
Proposition 2.2.7. Let 𝑆 and 𝑇 be self-maps on 𝑀 ⊂ R𝑛 and let ⩽ be the pointwise order.
If 𝑇 dominates 𝑆 on 𝑀 and, in addition, 𝑇 is order-preserving and globally stable on 𝑀 ,
then its unique fixed point dominates any fixed point of 𝑆.
Proof of Proposition 2.2.7. Assume the conditions of the proposition and let 𝑢𝑇 be the
unique fixed point of 𝑇 . Let 𝑢𝑆 be any fixed point of 𝑆. Since 𝑆 ⩽ 𝑇 , we have 𝑢𝑆 =
𝑆𝑢𝑆 ⩽ 𝑇𝑢𝑆 . Applying 𝑇 to both sides of this inequality and using the order-preserving
property of 𝑇 and transitivity of ⩽ gives 𝑢𝑆 ⩽ 𝑇 2 𝑢𝑆 . Continuing in this fashion yields
𝑢𝑆 ⩽ 𝑇 𝑘 𝑢𝑆 for all 𝑘 ∈ N. Taking the limit in 𝑘 and using the fact that ⩽ is closed under
limits gives 𝑢𝑆 ⩽ 𝑢𝑇 . □
Figure 2.10 helps illustrate the results of Exercise 2.2.38. The top left sub-figure
shows a baseline parameterization, with 𝐴 = 2.0, 𝑠 = 𝛼 = 0.3 and 𝛿 = 0.4. The
other sub-figures show how the steady state changes as parameters deviate from that
baseline.
EXERCISE 2.2.39. In (1.33) on page 40, we defined a map 𝑔 such that the opti-
mal continuation value ℎ∗ is a fixed point. Using this construction, prove that ℎ∗ is
increasing in 𝛽 .
Figure 2.11 gives an illustration of the result in Exercise 2.2.39. Here an increase in
𝛽 leads to a larger continuation value. This seems reasonable, since larger 𝛽 indicates
more concern about outcomes in future periods.
While the preceding examples of parametric monotonicity are all one-dimensional,
we will soon see that Proposition 2.2.7 can also be applied in high-dimensional set-
tings.
CHAPTER 2. OPERATORS AND FIXED POINTS 68
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
default A = 2.5
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
s = .2 δ = .6
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
1200
1100
1000 h∗2
900
800
h∗1
45
700
g1 (β1 = 0.95)
g2 (β2 = 0.96)
600
𝐴𝑒 = 𝜌 ( 𝐴) 𝑒 and 𝜀𝐴 = 𝜌 ( 𝐴) 𝜀. (2.10)
If 𝐴 is irreducible, then the right and left eigenvectors are everywhere positive and unique.
Moreover, if 𝐴 is everywhere positive, then with 𝑒 and 𝜀 normalized so that h𝜀, 𝑒i = 1, we
have
𝜌 ( 𝐴 ) −𝑡 𝐴 𝑡 → 𝑒 𝜀 ( 𝑡 → ∞) . (2.11)
We can use the Perron–Frobenius theorem to provide bounds on the spectral radius
of a nonnegative matrix. Fix 𝑛 × 𝑛 matrix 𝐴 = ( 𝑎𝑖 𝑗 ) and set
Í
• rowsum𝑖 ( 𝐴) ≔ 𝑗 𝑎𝑖 𝑗 = the 𝑖-th row sum of 𝐴 and
Í
• colsum 𝑗 ( 𝐴) ≔ 𝑖 𝑎𝑖 𝑗 = the 𝑗-th column sum of 𝐴.
EXERCISE 2.3.1. Prove Lemma 2.3.2. (Hint: Since 𝑒 and 𝜀 are nonnegative and
nonzero, and since eigenvectors are defined only up to nonzero multiples, you can
assume that both of these vectors sum to 1.)
Let 𝐴 be an 𝑛 × 𝑛 matrix. We know from Gelfand’s formula (page 19) that if k · k is any
matrix norm, then k 𝐴𝑘 k 1/𝑘 → 𝜌 ( 𝐴) as 𝑘 → ∞. While useful, this lemma can be difficult
to apply because it involves matrix norms. Fortunately, when 𝐴 is nonnegative, we
have the following variation, which only involves vector norms.
The expression on the left of (2.12) is sometimes called the local spectral ra-
dius of 𝐴 at ℎ. Lemma 2.3.3 gives one set of conditions under which a local spectral
radius equals the spectral radius. This result will be useful when we examine state-
dependent discounting in Chapter 6.
For a proof of Lemma 2.3.3 see Theorem 9.1 of Krasnosel’skii et al. (1972).
CHAPTER 2. OPERATORS AND FIXED POINTS 71
𝑃⩾0 and 𝑃1 = 1
where 1 is a column vector of ones, so that 𝑃 is nonnegative and has unit row sums.
The Perron–Frobenius theorem will be useful for the following exercise.
The vector 𝜓 in part (iii) of Exercise 2.3.2 is called a stationary distribution for 𝑃 .
Such distributions play an important role in the theory of Markov chains. We discuss
their interpretation and significance in §3.1.2.
EXERCISE 2.3.3. Given Markov matrix 𝑃 and constant 𝜀 > 0, prove the following
result: There exists no ℎ ∈ RX with 𝑃ℎ ⩾ ℎ + 𝜀.
(1 − 𝜆 )(1 − 𝑑 )
𝜆 (1 − 𝑑 )
𝑏
new entrants unemployed employed
𝛼 (1 − 𝑑 )
(1 − 𝛼)(1 − 𝑑 )
We assume that all parameters lie in (0, 1). New workers are initially unemployed.
Transition rates between two pools appear in Figure 2.12. For example, the rate
of flow from employment to unemployment is 𝛼 (1 − 𝑑 ), which equals the fraction of
employed workers who remained in the labor market and separated from their jobs.
Let 𝑒𝑡 and 𝑢𝑡 be the number of employed and unemployed workers at time 𝑡 re-
spectively. The total population (of workers) is 𝑛𝑡 ≔ 𝑒𝑡 + 𝑢𝑡 . In view of the rates stated
above, the number of unemployed workers evolves according to
The three terms on the right correspond to the newly unemployed (due to separation),
the unemployed who failed to find jobs last period, and new entrants into the labor
force. The number of employed workers evolves according to
Evolution of the time series for 𝑢𝑡 , 𝑒𝑡 and 𝑛𝑡 is illustrated in Figure 2.13. We set
parameters to 𝛼 = 0.01, 𝜆 = 0.1, 𝑑 = 0.02, and 𝑏 = 0.025. The initial population of
unemployed and employed workers are 𝑢0 = 0.6 and 𝑒0 = 1.2, respectively. The series
grow over the long run due to net population growth.
Can we say more about the dynamics of this system? For example, what long run
unemployment rate should we expect? Also, do long run outcomes depend heavily
on the initial conditions 𝑢0 and 𝑒0 ? Can we make some general statements that hold
regardless of the initial state?
CHAPTER 2. OPERATORS AND FIXED POINTS 73
0.7 ut
0.6
0 20 40 60 80 100
t
2.0 et
1.5
0 20 40 60 80 100
t
3.0
nt
2.5
2.0
0 20 40 60 80 100
t
To begin to address these questions, we first organize the linear system for ( 𝑒𝑡 )
and (𝑢𝑡 ) by setting
𝑢𝑡 (1 − 𝑑 )(1 − 𝜆 ) + 𝑏 (1 − 𝑑 ) 𝛼 + 𝑏
𝑥𝑡 ≔ and 𝐴 ≔ . (2.13)
𝑒𝑡 (1 − 𝑑 ) 𝜆 (1 − 𝑑 )(1 − 𝛼)
With these definitions, we can write the dynamics as 𝑥𝑡+1 = 𝐴𝑥𝑡 . As a result, 𝑥𝑡 = 𝐴𝑡 𝑥0 ,
where 𝑥0 = (𝑢0 𝑒0 ) > .
The overall growth rate of the total labor force is 𝑔 = 𝑏 − 𝑑 , in the sense that
𝑛𝑡+1 = (1 + 𝑔 ) 𝑛𝑡 for all 𝑡 .
EXERCISE 2.3.4. Confirm this claim by using the equation 𝑥𝑡+1 = 𝐴𝑥𝑡 .
EXERCISE 2.3.5. Prove that 𝜌 ( 𝐴) = 1 + 𝑔. [Hint: Use one of the results in §2.3.1.1.]
x0 = (0.1, 4.0)
employed workforce
x0 = (5.0, 0.1)
x̄
0
0 6
unemployed workforce
and 𝑒¯ ≔ 1 − 𝑢¯.
Remark 2.3.2. A more thorough analysis would require us to think carefully about
how the underlying rates 𝛼, 𝜆 , 𝑏 and 𝑑 are determined. For the hiring rate 𝜆 , we
could use the job search model to fix the rate at which workers are matched to jobs.
In particular, with 𝑤∗ as the reservation wage, we could set
Õ
𝜆 = P{𝑤𝑡 ⩾ 𝑤∗ } = 𝜑 ( 𝑤)
𝑤⩾ 𝑤∗
There are two ways to think about a matrix. In one definition, an 𝑛 × 𝑘 matrix 𝐴 is
an 𝑛 × 𝑘 array of (real) numbers. In the second, 𝐴 is a linear operator from R𝑘 to
R𝑛 that takes a vector 𝑢 ∈ R𝑘 and sends it to 𝐴𝑢 in R𝑛 . Let’s clarify these ideas in a
setting where 𝑛 = 𝑘. While the matrix representation is important, the linear operator
representation is more fundamental and more general.
CHAPTER 2. OPERATORS AND FIXED POINTS 76
(We write 𝐿𝑢 instead of 𝐿 (𝑢), etc.) For example, if 𝐴 is an 𝑛 × 𝑛 matrix, then the
map from 𝑢 to 𝐴𝑢 defines a linear operator, since the rules of matrix algebra yield
𝐴 ( 𝛼𝑢 + 𝛽𝑣) = 𝛼𝐴𝑢 + 𝛽 𝐴𝑣.
We just showed that each matrix can be regarded as a linear operator. In fact the
converse is also true:
A proof of Theorem 2.3.4 can be found in Kreyszig (1978) and many other sources.
Why introduce linear operators if they are essentially the same as matrices? One
reason is that, while a one-to-one correspondence between linear operators and ma-
trices holds in R𝑛 , the concept of linear operators is far more general. Linear operators
can be defined over many different kinds of sets whose elements have vector-like prop-
erties. This is related to the point that we made about function spaces in Remark 1.2.2
on page 28.
Another reason is computational: the matrix representation of a linear operator
can be tedious to construct and difficult to instantiate in memory in large problems.
We illustrate this point in §2.3.3.3 below.
We use the same symbol 𝐿 on both sides of the equals sign because both represent
essentially the same object (in the sense that a matrix 𝐴 can be viewed as a collection
of numbers ( 𝐴𝑖 𝑗 ) or as a linear map 𝑢 ↦→ 𝐴𝑢).
The function 𝐿 on the right-hand side of (2.16) is sometimes called the “kernel”
of the operator 𝐿. However, we will call it a matrix in what follows, since 𝐿 ( 𝑥, 𝑥 0) =
𝐿 ( 𝑥 𝑖 , 𝑥 𝑗 ) is just an 𝑛 × 𝑛 array of real numbers. When more precision is required, we
will call it the matrix representation of 𝐿.
In essence, the operation in (2.16) is just matrix multiplication: ( 𝐿𝑢)( 𝑥 ) is row 𝑥
of the matrix product 𝐿𝑢.
EXERCISE 2.3.8. Confirm that 𝐿 on the left-hand side of (2.16) is in fact a linear
operator (i.e., an element of L ( RX )).
The eigenvalues and eigenvectors of the linear operator 𝐿 are defined as the eigen-
values and eigenvectors of its matrix representation. The spectral radius 𝜌 ( 𝐿) of 𝐿 is
defined analogously.
We used the same symbol for the operator 𝐿 on the left-hand side of (2.16) and
its matrix representation on the right because these two objects are in one-to-one
correspondence. In particular, every 𝐿 ∈ L ( RX ) can be expressed in the form of
(2.16) for a suitable choice of matrix ( 𝐿 ( 𝑥, 𝑥 0)). Readers who are comfortable with
these claim can skip ahead to §2.3.3.3. The next lemma provides more details.
Lemma 2.3.5 needs no formal proof. Theorem 2.3.4 already tells us that (a) and
(b) are in one-to-one correspondence. Also, (b) and (c) are in one-to-one correspon-
dence because each 𝐿 ∈ L ( RX ) can be identified with a linear operator 𝑢 ↦→ 𝐿𝑢 on R𝑛
by pairing 𝑢, 𝐿𝑢 ∈ RX with its vector representation in R𝑛 (see §1.2.4.2). Finally, (d)
and (a) are in one-to-one correspondence under the identification 𝐿 ( 𝑥 𝑖 , 𝑥 𝑗 ) ↔ 𝐿𝑖 𝑗 .
CHAPTER 2. OPERATORS AND FIXED POINTS 78
At the end of §2.3.3.1 we claimed that working with linear operators brings some
computational advantages vis-à-vis working with matrices. This section fills in some
details (Readers who prefer not to think about computational issues at this point can
skip ahead to §2.3.3.4.)
To illustrate the main idea, consider a setting where the state space X takes the
form X = Y × Z with |Y| = 𝑗 and |Z| = 𝑘. A typical element of X is 𝑥 = ( 𝑦, 𝑧 ). As we
shall see, this kind of setting arises naturally in dynamic programming.
Let 𝑄 be a map from Z × Z to R (i.e., a 𝑘 × 𝑘 matrix) and consider the operator
sending 𝑢 ∈ RX to 𝐿𝑢 ∈ RX according to the rule
Õ
( 𝐿𝑢)( 𝑥 ) = ( 𝐿𝑢)( 𝑦, 𝑧 ) = 𝑢 ( 𝑦, 𝑧0) 𝑄 ( 𝑧, 𝑧0) (2.17)
𝑧 0 ∈Z
by employing sparse matrices, but doing so adds boilerplate and can be a source of
inefficiency.
Because of these issues, most modern scientific computing environments support
linear operators directly, as well as actions on linear operators such as inverting linear
maps. These considerations encourage us to take an operator-based approach.
𝑢 ⩾ 0 =⇒ 𝐿𝑢 ⩾ 0. (2.18)
Remark 2.3.3. The Lemma 2.3.6 characterization suggests that we should really call
a linear operator satisfying (2.18) “nonnegative” rather than positive. Nevertheless,
the “positive” terminology is standard (see, e.g., Zaanen (2012)).
In the next exercise, you can think of 𝜑 as a row vector and 𝜑𝑃 as premuliplying
the matrix 𝑃 by this row vector. Chapter 3 uses the map 𝜑 ↦→ 𝜑𝑃 to update marginal
distributions generated by Markov chains.
Markov operators are important for us because they generate Markov dynamics, a
foundation of dynamic programming. Thus, (2.19) is a rule for updating distributions
by one period under the Markov dynamics specified by 𝑃 . We’ll use it often in the next
chapter.
Davey and Priestley (2002) provide a good introduction to partial orders and order-
theoretic concepts. Our favorite books on fixed points and analysis include Ok (2007),
Zhang (2012), Cheney (2013) and Atkinson and Han (2005). Good background ma-
terial on order-theoretic fixed point methods can be found in Guo et al. (2004) and
Zhang (2012).
Chapter 3
Markov Dynamics
3.1 Foundations
81
CHAPTER 3. MARKOV DYNAMICS 82
To simplify terminology, we also call ( 𝑋𝑡 ) 𝑃 -Markov when (3.1) holds. We call either
𝑋0 or its distribution 𝜓0 the initial condition of ( 𝑋𝑡 ), depending on context. 𝑃 is also
called the transition matrix of the Markov chain.
The definition of a Markov chain says two things:
(i) When updating to 𝑋𝑡+1 from 𝑋𝑡 , earlier states are not required.
(ii) 𝑃 encodes all of the information required to perform the update, given the cur-
rent state 𝑋𝑡 .
As an example, consider a firm whose inventory of some product follows 𝑆-𝑠 dynam-
ics, meaning that the firm waits until its inventory falls below some level 𝑠 > 0 and
then immediately replenishes by ordering 𝑆 units. This pattern of decisions can be
rationalized if ordering requires paying a fixed cost. Thus, in §5.2.1, we will show
that 𝑆-𝑠 behavior is optimal in a setting where fixed costs exist and the firm’s aim is
to maximize its present value.
CHAPTER 3. MARKOV DYNAMICS 83
𝑋𝑡 ∈ X =⇒ P{ 𝑋𝑡+1 ∈ X} = 1.
Listing 7 provides code that simulates inventory paths and computes other objects
of interest. Since the state space X = { 𝑥1 , . . . , 𝑥𝑛 } corresponds to {0, . . . , 𝑆 + 𝑠 } and
Julia indexing starts at 1, we set 𝑥 𝑖 = 𝑖 − 1. This convention is used when computing
P[i, j], which corresponds to 𝑃 ( 𝑥 𝑖 , 𝑥 𝑗 ). The code in the listing is used to produce
the simulation of inventories in Figure 3.1.
The function compute_mc returns an instance of a MarkovChain object that can
store both the state X and the transition probabilities. The QuantEcon.jl library
defines this data type and provides functions that simulate a Markov chains, compute
a stationary distribution, and perform related tasks.
Given a finite state space X, 𝑘 ⩾ 0 and 𝑃 ∈ M ( RX ), let 𝑃 𝑘 be the 𝑘-th power of 𝑃 . (If
𝑘 = 0, then 𝑃 𝑘 is the identity matrix.) Since M ( RX ) is closed under multiplication
(Exercise 2.3.2), 𝑃 𝑘 is in M ( RX ) for all 𝑘 ⩾ 0. In this context, 𝑃 𝑘 is sometimes called
the 𝑘-step transition matrix corresponding to 𝑃 . In what follows, 𝑃 𝑘 ( 𝑥, 𝑥 0) denotes
the ( 𝑥, 𝑥 0)-th element of the matrix representation of 𝑃 𝑘 .
CHAPTER 3. MARKOV DYNAMICS 84
120 Xt
100
80
inventory
60
40
20
0
0 25 50 75 100 125 150 175 200
t
𝑃 𝑘 ( 𝑥, 𝑥 0) = P { 𝑋𝑡 + 𝑘 = 𝑥 0 | 𝑋𝑡 = 𝑥 } . (3.2)
Thus, 𝑃 𝑘 provides the 𝑘-step transition probabilities for the 𝑃 -Markov chain ( 𝑋𝑡 ).
EXERCISE 3.1.2. Prove the claim in the last sentence via induction.
P { 𝑋 𝑘 = 𝑥 0 | 𝑋0 = 𝑥 } > 0 .
Thus, irreducibility of 𝑃 means that the 𝑃 -Markov chain eventually visits any state
from any other state with positive probability.
Í
Proof of Lemma 3.1.1. Fix 𝑃 ∈ M ( RX ). 𝑃 is irreducible if and only if 𝑘⩾0 𝑃 𝑘 0.
This is equivalent to the statement that for each ( 𝑥, 𝑥 0) ∈ X × X, there exists a 𝑘 ⩾ 0
such that 𝑃 𝑘 ( 𝑥, 𝑥 0) > 0, which is in turn equivalent to part (ii) of Lemma 3.1.1. □
CHAPTER 3. MARKOV DYNAMICS 86
using QuantEcon
P = [0.1 0.9;
0.0 1.0]
mc = MarkovChain(P)
print(is_irreducible(mc))
EXERCISE 3.1.3. Using Lemma 3.1.1, prove that the stochastic matrix associated
with the 𝑆-𝑠 inventory dynamics in §3.1.1.2 is irreducible.
Several libraries have code for testing irreducibility, including QuantEcon.jl. See
Listing 8 for an example of a call to this functionality. In this case, irreducibility fails
because state 2 is an absorbing state. Once entered, the probability of ever leaving
that state is zero. (A subset Y of X with this property is called an absorbing set.)
For (3.5) and 𝜓𝑡+1 = 𝜓𝑡 𝑃 to hold, each 𝜓𝑡 must be a row vector. In what follows, we
always treat the distributions ( 𝜓𝑡 )𝑡⩾0 of ( 𝑋𝑡 )𝑡⩾0 as row vectors.
𝑑
EXERCISE 3.1.4. Let ( 𝑋𝑡 ) be 𝑃 -Markov on X with 𝑋0 = 𝜓0 . Show that
0.014 ψ∗
frequency
0.012
0.010
0.008
0.006
0.004
0.002
0.000
0 20 40 60 80 100
state
Listing 9 provides a function to update from 𝑋𝑡 to 𝑋𝑡+1 , using the fact that rand()
generates a draw from the uniform distribution on [0, 1).
EXERCISE 3.1.5. Explain why Listing 9 updates the current state according to the
probabilities in 𝑃 .
EXERCISE 3.1.7. Prove this using the Perron–Frobenius theorem. More generally,
show that this global stability result holds for any 𝑃 ∈ M ( RX ) with 𝑃 0.
EXERCISE 3.1.8. Fix 𝛼 = 0.3 and 𝛽 = 0.2. Compute the sequence ( 𝜓𝑃 𝑡 ) for different
choices of 𝜓 and confirm that your results are consistent with the claim that 𝜓𝑃 𝑡 → 𝜓∗
as 𝑡 → ∞ for any 𝜓 ∈ D(X).
3.1.3 Approximation
𝑏 𝜈2
𝜓∗ = 𝑁 ( 𝜇 𝑥 , 𝜎2𝑥 ) with 𝜇𝑥 ≔ and 𝜎2𝑥 ≔ .
1−𝜌 1 − 𝜌2
𝑑 𝑑
EXERCISE 3.1.10. Suppose that 𝑋𝑡 = 𝜓∗ , 𝜀𝑡+1 = 𝑁 (0, 1) and 𝑋𝑡 and 𝜀𝑡+1 are inde-
pendent. Prove that 𝜌𝑋𝑡 + 𝑏 + 𝜈𝜀𝑡+1 has distribution 𝜓∗ . Is this still true if we drop the
independence assumption made above?
Process (3.10) is also ergodic in a similar sense to (3.7) on page 87: on average,
realizations of the process spend most of their time in regions of the state where the
stationary distribution puts high probability mass. (You can check this via simulations
if you wish.) Hence, in the discretization that follows, we shall put the discrete state
space in this area.
EXERCISE 3.1.11. Set 𝑏 = 0 in (3.10) and let 𝐹 be the CDF of 𝑁 (0, 𝜈2 ). Show that
for all 𝛿, 𝑡 ∈ R.
To discretize (3.10) we use Tauchen’s method, starting with the case 𝑏 = 0.1 As
a first step, we choose 𝑛 as the number of states for the discrete approximation and
𝑚 as an integer that sets the width of the state space. Then we create a state space
X ≔ { 𝑥1 , . . . , 𝑥𝑛 } ⊂ R as an equispaced grid that brackets the stationary mean on both
sides by 𝑚 standard deviations:
• set 𝑥1 = −𝑚 𝜎𝑥 ,
• set 𝑥𝑛 = 𝑚 𝜎𝑥 and
• set 𝑥 𝑖+1 = 𝑥 𝑖 + 𝑠 where 𝑠 = ( 𝑥𝑛 − 𝑥1 )/( 𝑛 − 1) and 𝑖 in [ 𝑛 − 1].
The next step is to create an 𝑛 × 𝑛 matrix 𝑃 that approximates the dynamics in (3.10).
For 𝑖, 𝑗 ∈ [ 𝑛],
1 Tauchen’s method (Tauchen (1986)) is simple but sub-optimal in some cases. For a more general
discretization method and a survey of the literature, see Farmer and Toda (2017).
CHAPTER 3. MARKOV DYNAMICS 91
0.175
ψ∗
0.150 approx. ψ ∗
0.125
0.100
0.075
0.050
0.025
0.000
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
The first two are boundary rules and the third applies Exercise 3.1.11.
Í𝑛
EXERCISE 3.1.12. Prove that 𝑗=1 𝑃 ( 𝑥 𝑖 , 𝑥 𝑗 ) = 1 for all 𝑖 ∈ [ 𝑛].
Finally, if 𝑏 ≠ 0, then we shift the state space to center it on the mean 𝜇 𝑥 of the
stationary distribution 𝑁 ( 𝜇 𝑥 , 𝜎2𝑥 ). This is done by replacing 𝑥 𝑖 with 𝑥 𝑖 + 𝜇 𝑥 for each 𝑖.
Julia routines that compute X and 𝑃 can be found in the library QuantEcon.jl.
Figure 3.3 compares the continuous stationary distribution 𝜓∗ and the unique sta-
tionary distribution of the discrete approximation when X and 𝑃 are constructed as
above when 𝜌 = 0.9, 𝑏 = 0.0, 𝜈 = 1.0 and the discretization parameters are 𝑛 = 15 and
𝑚 = 3.
In this section we discuss how to compute conditional expectations for Markov chains.
The theory will be essential for the study of finite MDPs, since, in these models, lifetime
rewards are mathematical expectations of flow reward functions of Markov states.
CHAPTER 3. MARKOV DYNAMICS 92
𝜓0 𝑃 𝑡 𝑃 𝑘 ℎ = 𝜓0 𝑃 𝑡 + 𝑘 ℎ = 𝜓𝑡 + 𝑘 ℎ = Eℎ ( 𝑋𝑡 + 𝑘 ) .
Hence (3.15) holds.
Next we connect Markov chains to order theory via stochastic dominance. These
connections will have many applications below.
Let X be a finite set partially ordered by . A Markov operator 𝑃 ∈ M ( RX ) is
called monotone increasing if
𝑥, 𝑦 ∈ X and 𝑥 𝑦 =⇒ 𝑃 ( 𝑥, ·) F 𝑃 ( 𝑦, ·) .
Thus, 𝑃 is monotone increasing if shifting up the current state shifts up the next period
state, in the sense that its distribution increases in the stochastic dominance ordering
(see §2.2.4) on D(X). Below, we will see that monotonicity of Markov operators is
closely related to monotonicity of value functions in dynamic programming.
Monotonicity of Markov operators is related to positive autocorrelation. To illus-
trate the idea, consider the AR(1) model 𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝜎𝜀𝑡+1 from §3.1.3 and suppose we
apply Tauchen discretization, mapping the parameters 𝜌, 𝜎 and a discretization size 𝑛
into a Markov operator 𝑃 on state space X = { 𝑥1 , . . . , 𝑥𝑛 } ⊂ R, totally ordered by ⩽. If
𝜌 ⩾ 0, so that positive autocorrelation holds, then 𝑃 is monotone increasing.
for some 𝛼, 𝛽 ∈ [0, 1]. Show that 𝑃𝑤 is monotone increasing if and only if 𝛼 + 𝛽 ⩽ 1.
CHAPTER 3. MARKOV DYNAMICS 94
3.2.2.1 Theory
With 𝐼 as the identity matrix, the next result describes 𝑣 as function of 𝛽 , 𝑃 and ℎ.
where the first equality in (3.18) uses linearity of expectations and the second follows
from (3.14) and the assumption that ( 𝑋𝑡 ) is 𝑃 -Markov starting at 𝑥 .2 Applying the
Í
Neumann series lemma (p. 18) to the matrix 𝛽𝑃 , we see that ∞ 𝑡 −1
𝑡 =0 ( 𝛽𝑃 ) = ( 𝐼 − 𝛽𝑃 ) .
The lemma applies because 𝜌 ( 𝛽𝑃 ) = 𝛽𝜌 ( 𝑃 ) = 𝛽 < 1, as follows from Exercise 2.3.2. □
Consider a firm that receives random profit stream ( 𝜋𝑡 )𝑡⩾0 . Supposes that the value of
the firm equals the expected present value of its profit stream. Suppose for now that
the interest rate is constant at 𝑟 > 0. With 𝛽 ≔ 1/(1 + 𝑟 ), total valuation is
Õ
∞
𝑉0 = E 𝛽 𝑡 𝜋𝑡 . (3.19)
𝑡 =0
To compute this value, we need to know how profits evolve. A common strategy is
to set 𝜋𝑡 = 𝜋 ( 𝑋𝑡 ) for some fixed 𝜋 ∈ RX , where ( 𝑋𝑡 )𝑡⩾0 is a state process. For known
dynamics of ( 𝑋𝑡 ) and function 𝜋, the value 𝑉0 in (3.19) can be computed.
Here we assume that ( 𝑋𝑡 ) is 𝑃 -Markov for 𝑃 ∈ M ( RX ) with finite X. Then condi-
tioning on 𝑋0 = 𝑥 , we can write the value as
"∞ #
Õ
∞ Õ
𝑣 ( 𝑥 ) ≔ E𝑥 𝛽 𝑡 𝜋𝑡 ≔ E 𝛽 𝑡 𝜋𝑡 | 𝑋0 = 𝑥 .
𝑡 =0 𝑡 =0
By Lemma 3.2.1, the value 𝑣 ( 𝑥 ) is finite and the function 𝑣 ∈ RX can be obtained by
Õ
∞
𝑣= 𝛽 𝑡 𝑃 𝑡 𝜋 = ( 𝐼 − 𝛽𝑃 ) −1 𝜋.
𝑡 =0
It is plausible that the value of the firm will be higher for a return process in
which higher states generate higher profits and predict higher future states. The next
exercise confirms this.
EXERCISE 3.2.6. Let X be partially ordered and suppose that 𝜋 ∈ 𝑖RX and that 𝑃 is
monotone increasing. (See §3.2.1.3 for terminology and notation.) Prove that, under
these conditions, 𝑣 is increasing on X.
2 To justify the first equality, care must be taken when pushing expectations through infinite sums.
In the present setting, justification can be provided via the dominated convergence theorem (see, e.g.,
Dudley (2002), Theorem 4.3.5). A proof of a more general result can be found in §B.2.
CHAPTER 3. MARKOV DYNAMICS 96
where 𝛽 ∈ (0, 1) is a discount factor and 𝑢 : R+ → R is called the flow utility function.
Dependence of 𝑣 ( 𝑥 ) on 𝑥 comes from the initial condition 𝑋0 = 𝑥 influencing the
Markov state process and, therefore, the consumption path.
Í
Using 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ) and defining 𝑟 ≔ 𝑢 ◦ 𝑐 we can write 𝑣 ( 𝑥 ) = E𝑥 𝑡⩾0 𝛽 𝑡 𝑟 ( 𝑋𝑡 ). By
Lemma 3.2.1, this sum is finite and 𝑣 can be expressed as
𝑣 = ( 𝐼 − 𝛽𝑃 ) −1 𝑟. (3.21)
Figure 3.4 shows an example when 𝑢 has the constant relative risk aversion (CRRA)
specification
𝑐1−𝛾
𝑢 ( 𝑐) = ( 𝑐 ⩾ 0, 𝛾 > 0) , (3.22)
1−𝛾
while 𝑐 ( 𝑥 ) = exp( 𝑥 ), so that consumption takes the form 𝐶𝑡 = exp( 𝑋𝑡 ), and ( 𝑋𝑡 )𝑡⩾0 is
a Tauchen discretization (see §3.1.3) of 𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝜈𝑊𝑡+1 where (𝑊𝑡 )𝑡⩾1 is IID and
standard normal. Parameters are 𝑛 = 25, 𝛽 = 0.98, 𝜌 = 0.96, 𝜈 = 0.05 and 𝛾 = 2. We
set 𝑟 = 𝑢 ◦ 𝑐 and solved for 𝑣 via (3.21).
EXERCISE 3.2.8. The value function in Figure 3.4 appears to be increasing in the
state 𝑥 . Prove this for the CRRA model when 𝜌 ⩾ 0.
CHAPTER 3. MARKOV DYNAMICS 97
v
−95
−100
−105
−110
We adopt the job search setting of §1.3 but assume now that the wage process (𝑊𝑡 ) is
𝑃 -Markov on W ⊂ R+ , where 𝑃 ∈ M ( RW ) and W is finite.
The value function 𝑣∗ for the Markov job search model is now defined as follows:
𝑣∗ ( 𝑤) is the maximum lifetime value that can be obtained when the worker is un-
employed with current wage offer is 𝑤 in hand. Value function 𝑣∗ satisfies Bellman
equation
( )
𝑤 Õ
𝑣∗ ( 𝑤) = max , 𝑐+𝛽 𝑣∗ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) . (3.23)
1−𝛽 0
𝑤 ∈W
Bellman equation (3.23) extends a corresponding Bellman equation for the IID
case (cf. (1.25) on page 33). (A full proof is given in Chapter 4.) The Bellman
operator corresponding to (3.23) is
( )
𝑤 Õ
(𝑇 𝑣)( 𝑤) = max , 𝑐+𝛽 𝑣 ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) .
1−𝛽 𝑤 0
for all 𝑤 ∈ W.
Let 𝑉 ≔ RW + and endow 𝑉 with the pointwise partial order ⩽ and the supremum
norm, so that k 𝑓 − 𝑔 k ∞ = max𝑤∈W | 𝑓 ( 𝑤) − 𝑔 ( 𝑤)|.
EXERCISE 3.3.1. Prove that
(i) 𝑇 is an order-preserving self-map on 𝑉 .
(ii) 𝑇 is a contraction of modulus 𝛽 on 𝑉 .
We recommend that you study the proof of the next lemma, since the same style
of argument occurs often below.
Lemma 3.3.1. 𝑣∗ is increasing on (W, ⩽) whenever 𝑃 is monotone increasing.
Proof. Let 𝑖𝑉 be the increasing functions in 𝑉 and suppose that 𝑃 is monotone in-
creasing. 𝑇 is a self-map on 𝑖𝑉 in this setting, since 𝑣 ∈ 𝑖𝑉 implies ℎ ( 𝑤) ≔ 𝑐 +
Í
𝛽 𝑤0 𝑣 ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) is in 𝑖𝑉 . Hence, for such a 𝑣, both ℎ and the stopping value func-
tion 𝑒 ( 𝑤) ≔ 𝑤/(1 − 𝛽 ) are in 𝑖𝑉 . It follows that 𝑇 𝑣 = ℎ ∨ 𝑒 is in 𝑖𝑉 .
Since 𝑖𝑉 is a closed subset of 𝑉 and 𝑇 is a self-map on 𝑖𝑉 , the fixed point 𝑣∗ is in 𝑖𝑉
(see Exercise 1.2.18 on page 22). □
In view of the contraction property established in Exercise 3.3.1, we can use value
function iteration (i) to compute an approximation 𝑣 to the value function and (ii) to
calculate the 𝑣-greedy policy that approximates the optimal policy. Code for imple-
menting this procedure is in Listing 10. The definition of a 𝑣-greedy policy resembles
that for the IID case (see (1.29) on page 35).
CHAPTER 3. MARKOV DYNAMICS 99
200
h∗ (w)
175 w/(1 − β)
v ∗ (w)
150
125
100
75
50
25
Figure 3.5: Value, stopping, and continuation for Markov job search
The continuation value ℎ∗ from the IID case (see page 33) is now replaced by a con-
tinuation value function
Õ
ℎ∗ ( 𝑤) ≔ 𝑐 + 𝛽 𝑣∗ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) .
𝑤0
The continuation value depends on 𝑤 because the current offer helps predict the offer
next period, which in turn affects the value of continuing. The functions 𝑤 ↦→ 𝑤/(1 −
𝛽 ), ℎ∗ and 𝑣∗ corresponding to the default model in Listing 10 are shown in Figure 3.5.
EXERCISE 3.3.2. Explain why the continuation value function is increasing in Fig-
ure 3.5. If possible, provide a mathematical and economic explanation.
EXERCISE 3.3.3. Using the Bellman equation (3.23), show that ℎ∗ obeys
Õ 0
∗ 𝑤
ℎ ( 𝑤) ≔ 𝑐 + 𝛽 max , ℎ ( 𝑤 ) 𝑃 ( 𝑤, 𝑤0)
∗ 0
( 𝑤 ∈ W) .
𝑤 0 1 − 𝛽
CHAPTER 3. MARKOV DYNAMICS 101
Exercise 3.3.4 suggests an alternative way to solve the job search problem: iterate
with 𝑄 to obtain the continuation value function ℎ∗ and then use the policy
0
𝑤
𝜎 ( 𝑤) = 1
∗ ∗
⩾ ℎ ( 𝑤) ( 𝑤 ∈ W)
1−𝛽
that tells the worker to accept when the current stopping value exceeds the current
continuation value.
We saw that in the IID case a computational strategy based on continuation values
is far more efficient than value function iteration (see §1.3.2.2). Since continuation
values are functions rather than scalars, here the two approaches (iterating with 𝑇 vs
iterating with 𝑄 ) are more similar. In Chapter 4 we discuss alternative computational
strategies in more detail, seeking conditions under which one approach will be more
efficient than the other.
We now modify the job search problem discussed in §3.3.1 by adding separations. In
particular, an existing match between worker and firm terminates with probability 𝛼
every period. (This is an extension because setting 𝛼 = 0 recovers the permanent job
scenario from §3.3.1.)
The worker now views the loss of a job as a capital loss and a spell of unemploy-
ment as an investment. In what follows, the wage process and discount factor are
unchanged from §3.3.1. As before, 𝑉 ≔ RW+ is endowed with the supremum norm.
The value function 𝑣𝑢∗ for an unemployed worker satisfies the recursion
( )
Õ
∗ ∗ ∗ 0 0
𝑣𝑢 ( 𝑤) = max 𝑣𝑒 ( 𝑤) , 𝑐 + 𝛽 𝑣𝑢 ( 𝑤 ) 𝑃 ( 𝑤, 𝑤 ) ( 𝑤 ∈ W) , (3.25)
𝑤0 ∈W
where 𝑣𝑒∗ is the value function for an employed worker, i.e., the lifetime value of a
CHAPTER 3. MARKOV DYNAMICS 102
worker who starts the period employed at wage 𝑤. Value function 𝑣𝑒∗ satisfies
" #
Õ
𝑣𝑒∗ ( 𝑤) = 𝑤 + 𝛽 𝛼 𝑣𝑢∗ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) + (1 − 𝛼) 𝑣𝑒∗ ( 𝑤) ( 𝑤 ∈ W) . (3.26)
𝑤0
This equation states that value accruing to an employed worker is current wage plus
the discounted expected value of being either employed or unemployed next period.
We claim that, when 0 < 𝛼, 𝛽 < 1, the system (3.25)–(3.26) has a unique solution
( 𝑣∗ , 𝑣∗ )
𝑒
∗
𝑢 in 𝑉 × 𝑉 . To show this we first solve (3.26) in terms of 𝑣𝑒 ( 𝑤) to obtain
1
𝑣𝑒∗ ( 𝑤) = 𝑤 + 𝛼𝛽 ( 𝑃𝑣𝑢∗ )( 𝑤) . (3.27)
1 − 𝛽 (1 − 𝛼)
Í
(Recall ( 𝑃ℎ)( 𝑤) ≔ 𝑤0 ℎ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) for ℎ ∈ RW .) Substituting into (3.25) yields
∗ 1 ∗ ∗
𝑣𝑢 ( 𝑤) = max 𝑤 + 𝛼𝛽 ( 𝑃𝑣𝑢 )( 𝑤) , 𝑐 + 𝛽 ( 𝑃𝑣𝑢 )( 𝑤) . (3.28)
1 − 𝛽 (1 − 𝛼)
EXERCISE 3.3.5. Prove that there exists a unique 𝑣𝑢∗ ∈ 𝑉 that solves (3.28). Propose
a convergent method for computing both 𝑣𝑢∗ and 𝑣𝑒∗ . [Hint: See Lemma 2.2.3 on
page 59.]
Figure 3.6 shows the value function 𝑣𝑢∗ for an unemployed worker, which is the
fixed point of (3.28), as well as the stopping and continuation values, which are given
by
1
𝑠∗ ( 𝑤) ≔ 𝑤 + 𝛼𝛽 ( 𝑃𝑣𝑢∗ )( 𝑤) and ℎ∗𝑒 ( 𝑤) ≔ 𝑐 + 𝛽 ( 𝑃𝑣𝑢∗ )( 𝑤)
1 − 𝛽 (1 − 𝛼)
respectively, for each 𝑤 ∈ W. Parameters are as in Listing 11. The value function
𝑣𝑢∗ is the pointwise maximum (i.e., 𝑣𝑢∗ = 𝑠∗ ∨ ℎ∗ ). The worker’s optimal policy while
unemployed is
𝜎∗ ( 𝑤) ≔ 1{ 𝑠∗ ( 𝑤) ⩾ ℎ∗ ( 𝑤)} .
As before, the smallest 𝑤 such that 𝜎∗ ( 𝑤) = 1 is called the reservation wage.
Figure 3.7 shows how the reservation wage changes with 𝛼. To produce this figure
we solved the model for the reservation wage at 10 values of 𝛼 in an evenly spaced
grid ranging 0 to 1. The reservation wage falls with 𝛼, since time spent unemployed
is a capital investment in better wages, and the value of this investment declines as
the separation rate rises.
140
continuation value
130
stopping value
120 vu∗ (w)
110
100
90
80
70
60
reservation wage
1.8
1.6
1.4
1.2
1.0
Many good textbooks on Markov chains exist, including Norris (1998), Häggström
et al. (2002) and Privault (2013). Sargent and Stachurski (2023b) provides a rel-
atively comprehensive treatment from a network perspective that is a natural one
for Markov chains. Other economic applications are discussed in Stokey and Lucas
(1989) and Ljungqvist and Sargent (2018). Meyer (2000) gives a detailed account
of the theory of nonnegative matrices. Another useful reference is Horn and Johnson
(2012).
A systematic study of monotone Markov chains was initiated by Daley (1968).
Monotone Markov methods have many important applications in economics. See,
for example, Hopenhayn and Prescott (1992), Kamihigashi and Stachurski (2014),
Jaśkiewicz and Nowak (2014), Balbus et al. (2014), Foss et al. (2018) and Hu and
Shmaya (2019).
Chapter 4
Optimal Stopping
We begin with a standard theory of optimal stopping and then consider alternative
approaches that feature continuation values and threshold policies. We aim to provide
a rigorous discussion of optimality that refines our less formal analysis of job search
in §1.3 and §3.3.1.
4.1.1 Theory
Our first step is to set out the fundamental theory of discrete time infinite-horizon
optimal stopping problems.
105
CHAPTER 4. OPTIMAL STOPPING 106
Given a 𝑃 -Markov chain ( 𝑋𝑡 )𝑡⩾0 , a decision maker observes the state 𝑋𝑡 in each
period and decides whether to continue or stop. If she chooses to stop, she receives
final reward 𝑒 ( 𝑋𝑡 ) and the process terminates. If she decides to continue, then she
receives 𝑐 ( 𝑋𝑡 ) and the process repeats next period. Lifetime rewards are
Õ
E 𝛽 𝑡 𝑅𝑡 ,
𝑡 ⩾0
where 𝑅𝑡 equals 𝑐 ( 𝑋𝑡 ) while the agent continues, 𝑒 ( 𝑋𝑡 ) when the agent stops, and zero
thereafter.
Example 4.1.1. Consider the infinite-horizon job search problem from Chapter 1,
where the wage offer process (𝑊𝑡 ) is IID with common distribution 𝜑 on finite set
W. This is an optimal stopping problem with state space X = W and 𝑃 ∈ M ( RX )
having all rows equal to 𝜑, so that all draws are IID from 𝜑. The exit reward function
is 𝑒 ( 𝑥 ) = 𝑥 /(1 − 𝛽 ) and the continuation reward function is constant and equal to
unemployment compensation.
Example 4.1.2. Consider an infinite-horizon American call option that provides the
right to buy a given asset at strike price 𝐾 at each future date. The market price
of the asset is 𝑆𝑡 = 𝑠 ( 𝑋𝑡 ), where ( 𝑋𝑡 ) is 𝑃 -Markov on finite set X and 𝑠 ∈ RX . The
interest rate is 𝑟 > 0. Deciding when to exercise is an optimal stopping problem, with
exit corresponding to exercising the option. The discount factor is 1/(1 + 𝑟 ), the exit
reward function is 𝑒 ( 𝑥 ) ≔ 𝑠 ( 𝑥 ) − 𝐾 and the continuation reward is zero.1
the assumption that the current state contains enough information for the agent to
decide whether or not to stop.
Let Σ be the set of functions from X to {0, 1}. Let 𝑣𝜎 ( 𝑥 ) denote the expected lifetime
value of following policy 𝜎 now and in every future period, given optimal stopping
problem S = ( 𝛽, 𝑃, 𝑟, 𝑒) and current state 𝑥 ∈ X. We call 𝑣𝜎 the 𝜎-value function. We
also call 𝑣𝜎 ( 𝑥 ) the lifetime value of policy 𝜎 conditional on initial state 𝑥 . Section
§4.1.1.2, shows that 𝑣𝜎 is well defined and describes how to calculate it. A policy
𝜎∗ ∈ Σ is called optimal for S if
which is also what we expect: the value of continuing is the current reward plus the
discounted expected reward obtained by continuing with policy 𝜎 next period.
We want to solve (4.2) for 𝑣𝜎 . To this end, we define 𝑟𝜎 ∈ RX and 𝐿𝜎 ∈ L ( RX ) via
𝑟𝜎 ( 𝑥 ) ≔ 𝜎 ( 𝑥 ) 𝑒 ( 𝑥 ) + (1 − 𝜎 ( 𝑥 )) 𝑐 ( 𝑥 ) and 𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝛽 (1 − 𝜎 ( 𝑥 )) 𝑃 ( 𝑥, 𝑥 0) .
𝑣𝜎 = ( 𝐼 − 𝐿𝜎 ) −1 𝑟𝜎 . (4.4)
EXERCISE 4.1.1. Confirm that 𝜌 ( 𝐿𝜎 ) < 1 holds for any optimal stopping problem.
By Exercise 4.1.1 and the Neumann series lemma, 𝑣𝜎 is uniquely defined by (4.4).
CHAPTER 4. OPTIMAL STOPPING 108
For the proofs below, it is helpful to view 𝑣𝜎 as the fixed point of an operator. We
associate each 𝜎 ∈ Σ with an policy operator 𝑇𝜎 defined at 𝑣 ∈ RX by
" #
Õ
(𝑇𝜎 𝑣)( 𝑥 ) = 𝜎 ( 𝑥 ) 𝑒 ( 𝑥 ) + (1 − 𝜎 ( 𝑥 )) 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (4.5)
𝑥0
𝑇𝜎 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣.
In the job search problem in §3.3.1, we argued that the value function equals the fixed
point of the Bellman operator. Here we make the same argument more formally in
the more general setting of optimal stopping.
First, given an optimal stopping problem S = ( 𝛽, 𝑃, 𝑟, 𝑒) with 𝜎-value functions
{ 𝑣𝜎 }𝜎∈Σ , we define the value function 𝑣∗ of S via
𝑣∗ ( 𝑥 ) ≔ max 𝑣𝜎 ( 𝑥 ) ( 𝑥 ∈ X) , (4.6)
𝜎∈ Σ
so that 𝑣∗ ( 𝑥 ) is the maximal lifetime value available to an agent facing current state
𝑥 . Following notation in §2.2.2.1, we can also write 𝑣∗ = ∨𝜎 𝑣𝜎 .
Given that solving the maximization in (4.6) is, in general, a difficult problem,
how can we obtain the value function? The following steps can do the job:
CHAPTER 4. OPTIMAL STOPPING 109
(i) formulate a Bellman equation for the value function of the optimal stopping
problem, namely,
( )
Õ
𝑣 ( 𝑥 ) = max 𝑒 ( 𝑥 ) , 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) , (4.7)
𝑥0
(ii) prove that this Bellman equation has a unique solution in RX , and then
(iii) show that this solution equals the value function, as defined in (4.6).
Proof of Proposition 4.1.2. With the result of Exercise 4.1.5 in hand, we need only
show that the unique fixed point 𝑣¯ of 𝑇 in RX is equal to 𝑣∗ = ∨𝜎 𝑣𝜎 . We show 𝑣¯ ⩽ 𝑣∗
and then 𝑣¯ ⩾ 𝑣∗ .
CHAPTER 4. OPTIMAL STOPPING 110
Paralleling the definition provided in the discussion of job search (§1.3), for each
𝑣 ∈ RX , we call 𝜎 ∈ Σ 𝑣-greedy if, for all 𝑥 ∈ X,
( " #)
Õ
𝜎 ( 𝑥 ) ∈ argmax 𝑎𝑒 ( 𝑥 ) + (1 − 𝑎) 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) . (4.9)
𝑎∈{0,1} 𝑥0
A 𝑣-greedy policy uses 𝑣 to assign values to states and then chooses to stop or continue
based on the action that generates a higher payoff.
With this language in place, the next proposition makes precise our informal §1.1.2
argument that optimal choices can be made using the value function.
The theory stated above tells us that successive approximation using the Bellman op-
erator converges to 𝑣∗ and 𝑣∗ -greedy policies are optimal. These facts make value
function iteration (VFI) a natural algorithm for solving optimal stopping problems.
(VFI for optimal stopping problems is corresponds to VFI for job search, as shown on
page 37.) Later, in Theorem 8.1.1, we will show that when the number of iterates is
sufficiently large, VFI produces an optimal policy.
In §3.2.2.2 we discussed firm valuation using expected present value of the cash flow
generated by profits. This is a standard approach. However, it ignores that firms
have the option to cease operations and sell all remaining assets. In this section, we
consider firm valuation in the presence of an exit option.
We saw in §4.1.1.2–§4.1.1.3 that 𝑇𝜎 has a unique fixed point 𝑣𝜎 and that 𝑣𝜎 ( 𝑧) repre-
sents the value of following policy 𝜎 forever, conditional on 𝑍0 = 𝑧.
The Bellman operator for the firm’s problem is the order-preserving self-map 𝑇 on
RZ defined by
( )
Õ
(𝑇 𝑣)( 𝑧) = max 𝑠, 𝜋 ( 𝑧) + 𝛽 𝑣 ( 𝑧0) 𝑄 ( 𝑧, 𝑧0) ( 𝑧 ∈ Z) .
𝑧0
114 h∗
s
112
v∗
110
108
106
104
102
100
Let 𝑣∗ be the value function for this problem. By Proposition 4.1.2, 𝑣∗ is the unique
fixed point of 𝑇 in RZ and the unique solution to the Bellman equation. Moreover, suc-
cessive approximation from any 𝑣 ∈ RZ converges to 𝑣∗ . Finally, by Proposition 4.1.3,
a policy is optimal if and only if it is 𝑣∗ -greedy.
Figure 4.1 plots 𝑣∗ , computed via value function iteration (i.e., successive approx-
imation using 𝑇 , along with the stopping value 𝑠 and the continuation value function
ℎ∗ = 𝜋 + 𝛽𝑄𝑣∗ , under the parameterization given in Listing 12. As implied by the Bell-
man equation, 𝑣∗ is the pointwise maximum of 𝑠 and ℎ∗ . The 𝑣∗ -greedy policy instruct
the firm to exit when the continuation value of the firm falls below the scrap value.
EXERCISE 4.1.6. Replicate Figure 4.1 by using the parameters in Listing 12 and
applying value function iteration. Reviewing the code for job search on page 99 should
be helpful.
115
v∗
no-exit value
110
105
100
95
90
EXERCISE 4.1.7. Prove the following: If 𝑄 0 and 𝑠 > 𝑤 ( 𝑧) for at least one 𝑧 ∈ Z,
then 𝑤 𝑣∗ . Provide some intuition for this result.
EXERCISE 4.1.8. Consider a version of the model of firm value with exit where
productivity is constant but prices are stochastic. In particular, the price process ( 𝑃𝑡 )
for the final good is 𝑄 -Markov. Suppose further that one-period profits for a given
price 𝑝 are max ℓ⩾0 𝜋 ( ℓ, 𝑝), where ℓ is labor input. Suppose that 𝜋 ( ℓ, 𝑝) = 𝑝ℓ1/2 − 𝑤ℓ,
where the wage rate 𝑤 is constant. Formulate the Bellman equation.
4.1.3 Monotonicity
We study monotonicity in values and actions in the general optimal stopping problem
described in §4.1.1, with X as the state space, 𝑒 as the exit reward function and 𝑐 as
the continuation reward function.
(The continuation reward function 𝑐 and the continuation value function ℎ∗ are dis-
tinct objects.)
Let X be partially ordered and let 𝑖RX be the increasing functions in RX .
Lemma 4.1.4. If 𝑒, 𝑐 ∈ 𝑖RX and 𝑃 is monotone increasing, then ℎ∗ and 𝑣∗ are both
increasing.
CHAPTER 4. OPTIMAL STOPPING 115
Proof. Let the stated conditions hold. The Bellman operator can be written pointwise
as 𝑇 𝑣 = 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑣). Since 𝑃 is monotone increasing, 𝑃 is invariant on 𝑖RX . It follows
from this fact and the conditions on 𝑒 and 𝑐 that 𝑇 is invariant on 𝑖RX . Hence, by
Exercise 1.2.18 on page 22, 𝑣∗ is in 𝑖RX . Since ℎ∗ = 𝑐 + 𝛽𝑃𝑣∗ , the same is true for
ℎ∗ . □
Example 4.1.3. Consider the §4.1.2 firm problem with exit with Bellman operator
𝑇 𝑣 = 𝑠 ∨ ( 𝜋 + 𝛽𝑄𝑣). Since 𝑠 is constant, it follows directly that 𝑣∗ and ℎ∗ are both
increasing functions when 𝜋 ∈ 𝑖RZ and 𝑄 is monotone increasing.
The optimal policy in the IID job search problem takes the form 𝜎∗ ( 𝑤) = 1{𝑤 ⩾ 𝑤∗ }
for all 𝑤 ∈ W, where 𝑤∗ ≔ (1 − 𝛽 ) ℎ∗ is the reservation wage and ℎ∗ is the continuation
value (see page 36). This optimal policy is of threshold type: once the wage offer
exceeds the threshold, the decision is to stop.
Since threshold policies are convenient, let us now try to characterize them.
Throughout this section, we take X to be a subset of R. Elements of X are ordered
by ⩽, the usual order on R.
For a binary function on X ⊂ R, the condition that 𝜎∗ is decreasing means that the
decision maker chooses to exit when 𝑥 is sufficiently small.
Example 4.1.4. In the firm problem with exit, as described in §4.1.2, ℎ∗ is increasing
whenever 𝜋 ∈ 𝑖RZ and 𝑄 is monotone increasing. Since the scrap value is constant,
Exercise 4.1.9 applies under these conditions. Hence the optimal policy is decreasing.
This reasoning agrees with Figure 4.1, where exit is optimal when the state is small
and continuing is optimal when 𝑧 is large. This makes sense: since 𝑄 is monotone
increasing, low current values of 𝑧 predict low future values of 𝑧, so profits associated
with continuing can be anticipated to be low.
EXERCISE 4.1.10. Show that the conditions of Exercise 4.1.9 hold when 𝑒 is con-
stant on X, 𝑐 is increasing on X, and 𝑃 is monotone increasing.
Example 4.1.5. In the IID job search problem, 𝑒 ( 𝑤) = 𝑤/(1 − 𝛽 ) is increasing and ℎ∗ is
constant. Hence the result in Exercise 4.1.11 applies. This is why the optimal policy
𝜎∗ ( 𝑤) = 1{𝑤 ⩾ (1 − 𝛽 ) ℎ∗ } is increasing. The agent accepts all sufficiently large wage
offers.
𝑥 < 𝑥 ∗ =⇒ 𝜎∗ ( 𝑥 ) = 0 and 𝑥 ⩾ 𝑥 ∗ =⇒ 𝜎∗ ( 𝑥 ) = 1.
Remark 4.1.1. Conditions in Exercises 4.1.9–4.1.11 are sufficient but not necessary
for monotone policies. Figure 3.5 on 100 provides an example of a setting where the
policy is increasing (the agent accepts for sufficiently large wage offers) even though
both 𝑒 ( 𝑥 ) = 𝑥 /(1 − 𝛽 ) and ℎ∗ are strictly increasing.
Let ℎ∗ be the continuation value function for the optimal stopping problem defined
in (4.10). To compute ℎ∗ directly we begin with the optimal stopping version of the
Bellman equation evaluated at 𝑣∗ and rewrite it as
Taking expectations of both sides of the equation conditional on current state 𝑥 pro-
Í Í
duces 𝑥 0 𝑣∗ ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) = 𝑥 0 max { 𝑒 ( 𝑥 0) , ℎ∗ ( 𝑥 0)} 𝑃 ( 𝑥, 𝑥 0). Multiplying by 𝛽 , adding
𝑐 ( 𝑥 ), and using the definition of ℎ∗ , we get
Õ
ℎ∗ ( 𝑥 ) = 𝑐 ( 𝑥 ) + 𝛽 max { 𝑒 ( 𝑥 0) , ℎ∗ ( 𝑥 0)} 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (4.12)
𝑥0
Proposition 4.1.5 provides the following alternative method to compute the opti-
mal policy that does not involve value function iteration:
Proof of Proposition 4.1.5. Fix 𝑓 , 𝑔 ∈ RX and 𝑥 ∈ X. By the triangle inequality and the
bound | 𝛼 ∨ 𝑥 − 𝛼 ∨ 𝑦 | ⩽ | 𝑥 − 𝑦 | from page 34, we have
Õ
|( 𝐶 𝑓 )( 𝑥 ) − ( 𝐶𝑔 ) ( 𝑥 )| ⩽ 𝛽 |max { 𝑒 ( 𝑥 0) , 𝑓 ( 𝑥 0)} − max { 𝑒 ( 𝑥 0) , 𝑔 ( 𝑥 0)}| 𝑃 ( 𝑥, 𝑥 0)
𝑥 0
Õ
⩽𝛽 | 𝑓 ( 𝑥 0) − 𝑔 ( 𝑥 0)| 𝑃 ( 𝑥, 𝑥 0) .
𝑥0
The beginning of §4.1.4 mentioned that switching from value function iteration to
continuation value iteration can substantially reduce the dimensionality of the prob-
lem in some cases. Here we describe situations where this works.
To begin, let W and Z be two finite sets and suppose that 𝜑 ∈ D(W) and 𝑄 ∈
M ( RZ ). Let (𝑊𝑡 ) be IID with distribution 𝜑 and let ( 𝑍𝑡 ) be an 𝑄 -Markov chain on Z.
If (𝑊𝑡 ) and ( 𝑍𝑡 ) are independent, then ( 𝑋𝑡 ) defined by 𝑋𝑡 = (𝑊𝑡 , 𝑍𝑡 ) is 𝑃 -Markov on X,
where
𝑃 ( 𝑥, 𝑥 0) = 𝑃 (( 𝑤, 𝑧 ) , ( 𝑤0, 𝑧0)) = 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) .
Suppose that the continuation reward depends only on 𝑧 so that we can write the
Bellman operator as
( )
Õ Õ
(𝑇 𝑣)( 𝑤, 𝑧 ) = max 𝑒 ( 𝑤, 𝑧 ) , 𝑐 ( 𝑧) + 𝛽 𝑣 ( 𝑤0, 𝑧0) 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) . (4.14)
𝑤0 ∈W 𝑧 0 ∈Z
Since the right-hand side depends on both 𝑤 and 𝑧, the Bellman operator acts on an
𝑛-dimensional space, where 𝑛 ≔ |X| = |W| × |Z|.
However, if we inspect the right-hand side of (4.14), we see that the continuation
value function depends only on 𝑧. Dependence on 𝑤 vanishes because 𝑤 does not
help predict 𝑤0. Thus, the continuation value function is an object in |Z|-dimensional
space. The continuation value operator
ÕÕ
(𝐶ℎ)( 𝑧) = 𝑐 ( 𝑧) + 𝛽 max { 𝑒 ( 𝑤0, 𝑧0) , ℎ ( 𝑧0)} 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) ( 𝑧 ∈ Z) (4.15)
𝑤0 𝑧0
Example 4.1.6. We can embed the IID the job search problem into this setting by
taking (𝑊𝑡 ) to be the wage offer process and ( 𝑍𝑡 ) to be constant. This is why the IID
case offers a large dimensionality reduction when we switch to continuation values.
Consider the firm valuation problem from §4.1.2 but suppose now that scrap value
fluctuates with prices of underlying assets. For simplicity let’s assume that scrap value
at each time 𝑡 is given by the IID sequence ( 𝑆𝑡 ), where each 𝑆𝑡 has density 𝜑 on R+ .
The corresponding Bellman operator is
( )
Õ∫
(𝑇 𝑣) ( 𝑧, 𝑠) = max 𝑠, 𝜋 ( 𝑧) + 𝛽 𝑣 ( 𝑧0, 𝑠0) 𝜑 ( 𝑠0) d𝑠0 𝑄 ( 𝑧, 𝑧0) .
𝑧0
We can convert this problem to a finite state space optimal stopping problem by
discretizing the density 𝜑 onto a finite grid contained in R+ . However, since continu-
ation values depend only on 𝑧, a better approach is to switch to a continuation value
operator.
EXERCISE 4.1.12. Write down the continuation value operator for this function as
a mapping from RZ to itself.
In this section we discuss some applications of optimal stopping and apply the results
described above.
We discussed American options briefly in Example 4.1.2 on page 106. Here we inves-
tigate this class of derivatives more carefully. We focus on American call options that
2 Actually,in most definitions, 𝑢 is also restricted to be bounded and measurable, in order to ensure
that the integrals are finite. These technicalities can be ignored in the exercise.
CHAPTER 4. OPTIMAL STOPPING 120
provide the right to buy a particular stock or bond at a fixed strike price 𝐾 at any
time before a set expiration date. The market price of the asset at time 𝑡 is denoted
by 𝑆𝑡 .
We discussed a case in which the expiration date is infinity in Example 4.1.2.
However, options without termination dates – also called perpetual options – are rare
in practice. Hence we focus on the finite-horizon case. We are interested in computing
the expected value of holding the option when discounting with a fixed interest rate,
a typical assumption when pricing American options.
Finite horizon American options can be priced by backward induction in an ap-
proach like the one we used for the finite horizon job search problem discussed in
Chapter 1. Alternatively, we can embed finite horizon options into the theory of
infinite-horizon optimal stopping. We use the second approach here, since we have
just presented a theory for infinite-horizon optimal stopping.
To this end, we take 𝑇 ∈ N to be a fixed integer indicating the date of expiration.
The option is purchased at 𝑡 = 0 and can be exercised at any 𝑡 ∈ N with 𝑡 ⩽ 𝑇 . To
include 𝑡 in the current state, we set
The idea is that time is updated via 𝑡0 = 𝑚 ( 𝑡 ), so that time increments at each update
until 𝑡 = 𝑇 + 1. After that we hold 𝑡 constant. Bounding time at 𝑇 + 1 keeps the state
space finite.
We assume that a stock price 𝑆𝑡 evolves according to
IID
𝑆𝑡 = 𝑍𝑡 + 𝑊𝑡 where (𝑊𝑡 )𝑡⩾0 ∼ 𝜑 ∈ D(W) .
Here ( 𝑍𝑡 )𝑡⩾0 is 𝑄 -Markov on finite set Z for some 𝑄 ∈ M ( RZ ) and W is also finite. This
means that the share price has both persistent and transient stochastic components.
If we set parameters so that ( 𝑍𝑡 )𝑡⩾0 resembles a random walk, price changes will be
difficult to predict.
To form a §4.1.1.1 optimal stopping problem, we must specify the state and clarify
the 𝑃 ∈ M ( RX ) that maps to the state process. We set the state space to X ≔ T×W×Z
and
𝑃 (( 𝑡, 𝑤, 𝑧 ) , ( 𝑡0, 𝑤0, 𝑧0)) ≔ 1{𝑡0 = 𝑚 ( 𝑡 )} 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) .
Thus, time updates deterministically via 𝑡0 = 𝑚 ( 𝑡 ) and 𝑧0 and 𝑤0 are drawn indepen-
dently from 𝑄 ( 𝑧, ·) and 𝜑 respectively.
As for a perpetual option, the continuation reward is zero and the discount factor
is 𝛽 ≔ 1/(1 + 𝑟 ), where 𝑟 > 0 is a fixed risk-free rate. The exit reward can be expressed
CHAPTER 4. OPTIMAL STOPPING 121
𝑒 ( 𝑡, 𝑤, 𝑧 ) ≔ 1{𝑡 ⩽ 𝑇 }[ 𝑧 + 𝑤 − 𝐾 ] .
where 𝑡0 = 𝑚 ( 𝑡 ). This value function 𝑣 ( 𝑡, 𝑤, 𝑧 ) neatly captures the value of the option:
It is the maximum of current exercise value and the discounted expected value of
carrying the option over to the next period.
Since the problem described above is an optimal stopping problem in the sense of
§4.1.1.1, all of the optimality results attained for that problem apply. In particular,
iterates of the Bellman operator converge to the value function 𝑣∗ and, moreover, a
policy is optimal if and only if it is 𝑣∗ -greedy.
We can do better than value function iteration. Since (𝑊𝑡 )𝑡⩾0 is IID and appears
only in the exit reward, we can reduce dimensionality by switching to the continuation
value operator, which, in this case, can be expressed as
ÕÕ
(𝐶ℎ) ( 𝑡, 𝑧 ) = 𝛽 max { 𝑒 ( 𝑡0, 𝑤0, 𝑧0) , ℎ ( 𝑡0, 𝑧0)} 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) . (4.16)
𝑧0 𝑤0
As proved in §4.1.4, the unique fixed point of 𝐶 is the continuation value function ℎ∗ ,
and 𝐶 𝑘 ℎ → ℎ∗ as 𝑘 → ∞ for all ℎ ∈ RX . With the fixed point in hand, we can compute
the optimal policy as
𝜎∗ ( 𝑡, 𝑤, 𝑧 ) = 1 {𝑒 ( 𝑡, 𝑤, 𝑧) ⩾ ℎ∗ ( 𝑡, 𝑧)} .
Here 𝜎∗ ( 𝑡, 𝑤, 𝑧 ) = 1 prescribes exercising the option at time 𝑡 .
Figure 4.3 provides a visual representation of optimal actions under the default
parameterization described in Listing 13. Each of the three figures show contour lines
of the net exit reward 𝑓 ( 𝑡, 𝑤, 𝑧 ) ≔ 𝑒 ( 𝑡, 𝑤, 𝑧 ) − ℎ∗ ( 𝑤, 𝑧 ), viewed as a function of ( 𝑤, 𝑧 ),
when 𝑡 is held fixed. The date 𝑡 for each subfigure is shown in the title. The optimal
policy exercises the option when 𝑓 ( 𝑡, 𝑤, 𝑧 ) ⩾ 0.
In each subfigure, the exercise region, which is the set ( 𝑤, 𝑧 ) such that 𝑓 ( 𝑡, 𝑤, 𝑧 ) ⩾
0, correspond to the northeast part of the figure, where 𝑤 and 𝑧 are both large. The
boundary between exercise and continuing is the zero contour line, which is shown
CHAPTER 4. OPTIMAL STOPPING 122
"Creates an instance of the option model with log S_t = Z_t + W_t."
function create_american_option_model(;
n=100, μ=10.0, # Markov state grid size and mean value
ρ=0.98, ν=0.2, # persistence and volatility for Markov state
s=0.3, # volatility parameter for W_t
r=0.01, # interest rate
K=10.0, T=200) # strike price and expiration date
t_vals = collect(1:T+1)
mc = tauchen(n, ρ, ν)
z_vals, Q = mc.state_values .+ μ, mc.p
w_vals, φ, β = [-s, s], [0.5, 0.5], 1 / (1 + r)
e(t, i_w, i_z) = (t ≤ T) * (z_vals[i_z] + w_vals[i_w] - K)
return (; t_vals, z_vals, w_vals, Q, φ, T, β, K, e)
end
in black. Notice that the size of the exercise region expands with 𝑡 . This is because
the value of waiting decreases when the set of possible exercise dates declines.
Figure 4.4 provides some simulations of the stock price process ( 𝑆𝑡 )𝑡⩾0 over the life-
time of the option, again using the default parameterization described in Listing 13.
The blue region in the top part of each subfigure contains values of the stock price
𝑆𝑡 = 𝑍𝑡 + 𝑊𝑡 such that 𝑆𝑡 ⩾ 𝐾 . In this configuration in which the price of the underly-
ing exceeds the strike price, the option is said to be “in the money.” The figure also
shows an optimal exercise date that is the first 𝑡 such that 𝑒 ( 𝑡, 𝑊𝑡 , 𝑍𝑡 ) ⩾ ℎ∗ (𝑊𝑡 , 𝑍𝑡 ) in a
simulation.
Consider a firm that engages in costly research and development (R&D) in order to
develop a new product. The firm decides whether to continue developing the product
before starting to market it or to stop developing and start marketing it. For simplicity,
we assume that the value of bringing the product to market is a one-time lump sum
payment 𝜋𝑡 = 𝜋 ( 𝑋𝑡 ), where ( 𝑋𝑡 ) is a 𝑃 -Markov chain on finite set X with 𝑃 ∈ M ( RX ).
The flow cost of investing in R&D is 𝐶𝑡 per period, where ( 𝐶𝑡 ) is a stochastic process.
Future payoffs are discounted at rate 𝑟 > 0 and we set 𝛽 ≔ 1/(1 + 𝑟 ).
CHAPTER 4. OPTIMAL STOPPING 123
t =1
13 0.5
0.0
12 0
−0.5
11 −1.0
−1.5
10
z
−2.0
9 −2.5
−3.0
8
−3.5
7 −4.0
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
w
t =195
13 0.5
0.0
12 0
−0.5
11
−1.0
10 −1.5
z
−2.0
9
−2.5
8
−3.0
7 −3.5
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
w
t =199
13 0.5
0.0
12
0
−0.5
11
−1.0
10 −1.5
z
−2.0
9
−2.5
8
−3.0
7 −3.5
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
w
13.315
in the money
St
out of the money
6.685
1 200
exercise date
13.315
in the money
St
out of the money
6.685
1 200
exercise date
13.315
in the money
St
out of the money
6.685
1 200
exercise date
As a first take on this problem, suppose that 𝐶𝑡 ≡ 𝑐 ∈ R+ for all 𝑡 and formulate an
optimal stopping problem with exit reward 𝑒 = 𝜋 and constant continuation reward
−𝑐. The Bellman equation is
( )
Õ
𝑣 ( 𝑥 ) = max 𝜋 ( 𝑥 ) , −𝑐 + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (4.17)
𝑥0
EXERCISE 4.2.1. Write down the continuation value operator for this problem.
Prove that the continuation value function ℎ∗ is increasing in 𝑥 whenever 𝜋 ∈ 𝑖RX and
𝑃 is monotone increasing.
EXERCISE 4.2.2. Prove that the optimal policy 𝜎∗ is increasing whenever 𝜋 is in-
creasing and ( 𝑋𝑡 ) is IID (so that all rows of 𝑃 are identical). Provide economic intuition
for this result.
Let’s suppose now that ( 𝐶𝑡 )𝑡⩾0 is IID with common distribution 𝜑 ∈ D(W). The Bell-
man equation is
( )
ÕÕ
𝑣 ( 𝑐, 𝑥 ) = max 𝜋 ( 𝑥 ) , −𝑐 + 𝛽 𝑣 ( 𝑐0, 𝑥 0) 𝜑 ( 𝑐0) 𝑃 ( 𝑥, 𝑥 0) . (4.18)
𝑥0 𝑐0
Since (𝐶𝑡 ) is IID, we would ideally like to integrate it out in the manner of §4.1.4.2,
thereby lowering the dimensionality of the problem. However, note that the continu-
ation value associated with (4.18) is
ÕÕ
ℎ ( 𝑐, 𝑥 ) ≔ −𝑐 + 𝛽 𝑣 ( 𝑐0, 𝑥 0) 𝜑 ( 𝑐0) 𝑃 ( 𝑥, 𝑥 0) ,
𝑥0 𝑐0
Rewrite the Bellman equation using 𝑔 and replacing ( 𝑐, 𝑥 ) with ( 𝑐0, 𝑥 0) to get
From Exercise 4.2.3, we see that (4.20) has a unique solution 𝑔∗ in RX that can be
computed by successive approximation. With 𝑔∗ in hand, we can compute the optimal
policy via
𝜎∗ ( 𝑐, 𝑥 ) = 1 {𝜋 ( 𝑥 ) , −𝑐 + 𝛽𝑔 ∗ ( 𝑥 )} .
Remark 4.2.1. This technique solves for the expected value function defined in (4.19).
In §5.3 we shall discuss this method and its convergence properties in a more general
setting.
(1992) is the classic reference. Perla and Tonetti (2014) construct a growth model
in which firms at the bottom of the productivity distribution imitate more productive
firms. Carvalho and Grassi (2019) analyze business cycles in a setting of firm growth
with exit and a Pareto distribution of firms.
Infinite duration American options are analyzed in Mordecki (2002). Practical
methods for pricing American options are provided by Longstaff and Schwartz (2001),
Rogers (2002), and Kohler et al. (2010).
Replacement problems are an important optimal stopping problem not treated in
this chapter. An important early paper by Rust (1987) uses dynamic programming
to find optimal replacement policies for of engine parts and goes on to fit the model
to data. §5.3.1 discusses structural estimation in the style of Rust Rust (1987) and
others.
Chapter 5
In this chapter we study a class of discrete time, infinite horizon dynamic programs
called Markov decision processes (MDPs). This standard class of problems is broad
enough to encompass many applications, including the optimal stopping problems in
Chapter 4. MDPs can also be combined with reinforcement learning to tackle settings
where important inputs to an MDP are not known.
We study a controller who interacts with a state process ( 𝑋𝑡 )𝑡⩾0 by choosing an action
path ( 𝐴𝑡 )𝑡⩾0 to maximize expected discounted rewards
Õ
E 𝛽 𝑡 𝑟 ( 𝑋𝑡 , 𝐴𝑡 ) , (5.1)
𝑡 ⩾0
taking an initial state 𝑋0 as given. As with all dynamic programs, we insist that the
controller is not clairvoyant: he or she cannot choose actions that depend on future
states.
To formalize the problem, we fix a finite set X, henceforth called the state space,
and a finite set A, henceforth called the action space. In what follows, a correspon-
dence Γ from X to A is a function from X into ℘ (A), the set of all subsets of A. The
128
CHAPTER 5. MARKOV DECISION PROCESSES 129
G ≔ {( 𝑥, 𝑎) ∈ X × A : 𝑎 ∈ Γ ( 𝑥 )},
define the value function 𝑣∗ as maximal lifetime rewards and show that 𝑣∗ is the unique
solution to the Bellman equation in RX .
We can understand the Bellman equation as reducing an infinite-horizon problem
to a two period problem involving the present and the future. Current actions influ-
ence (i) current rewards and (ii) expected discounted value from future states. In
every case we examine, there is a trade-off between maximizing current rewards and
shifting probability mass towards states with high future rewards.
5.1.2 Examples
Here we list examples of MDPs. We will see that some models neatly fit the MDP
structure, while others can be coaxed into the MDP framework by adding states or
applying other tricks.
Rust (1987) ignited the field of dynamic structural estimation by examining an en-
gine replacement problem for a bus workshop. In each period the superintendent
decides whether or not to replace the engine of a given bus. Replacement is costly but
delaying risks unexpected failure. Rust (1987) solved this trade-off using dynamic
programming.
We consider an abstract version of Rust’s problem with binary action 𝐴𝑡 . When
𝐴𝑡 = 1, the state resets to some fixed renewal state 𝑥¯ in a finite set X (e.g., mileage
resets to zero when an engine is replaced). When 𝐴𝑡 = 0, the state updates according
to 𝑄 ∈ M ( RX ) (e.g., mileage increases stochastically when the engine is not replaced).
Given current state 𝑥 and action 𝑎, current reward 𝑟 ( 𝑥, 𝑎) is received. The discount
factor is 𝛽 ∈ (0, 1).
For this problem, the Bellman equation has the form
( )
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 1) + 𝛽𝑣 (¯
𝑥 ) , 𝑟 ( 𝑥, 0) + 𝛽 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) , (5.3)
𝑥0
where the first term is the value from action 1 and the second is the value of action 0.
To set the problem up as an MDP we set A = {0, 1} and Γ ( 𝑥 ) = A for all 𝑥 ∈ X. We
define
Inserting 𝑃 from (5.4) into the right-hand side of the last equation recovers the MDP
Bellman equation (5.2).
IID
The firm faces exogenous demand process ( 𝐷𝑡 )𝑡⩾0 ∼ 𝜑 ∈ D( Z+ ). Inventory ( 𝑋𝑡 )𝑡⩾0
of the product obeys
The term 𝐴𝑡 is units of stock ordered this period, which take one period to arrive. The
definition of 𝑓 imposes the assumption that firms cannot sell more stock than they
have on hand. We assume that the firm can store at most 𝐾 items at one time.
With the price of the firm’s product set to one, current profits are given by
Here 𝑐 is unit product cost and 𝜅 is a fixed cost of ordering inventory. We take the
minimum 𝑋𝑡 ∧ 𝐷𝑡+1 because orders in excess of inventory are assumed to be lost rather
than back-filled.
We can map our inventory problem into an MDP with state space X ≔ {0, . . . , 𝐾 }
and action space A ≔ X. The feasible correspondence Γ is
Γ ( 𝑥 ) ≔ {0, . . . , 𝐾 − 𝑥 }, (5.7)
CHAPTER 5. MARKOV DECISION PROCESSES 132
which represents the set of feasible orders when the current inventory state is 𝑥 . The
reward function is expected current profits, or
Õ
𝑟 ( 𝑥, 𝑎) ≔ ( 𝑥 ∧ 𝑑 ) 𝜑 ( 𝑑 ) − 𝑐𝑎 − 𝜅1{ 𝑎 > 0} . (5.8)
𝑑 ⩾0
The stochastic kernel from the set of feasible state-action pairs G induced by Γ is, in
view of (5.6),
𝑃 ( 𝑥, 𝑎, 𝑥 0) ≔ P{ 𝑓 ( 𝑥, 𝑎, 𝐷) = 𝑥 0 } when 𝐷 ∼ 𝜑. (5.9)
This operator maps RX to itself and is designed so that its set of fixed points in RX
coincide with solutions to (5.11) in RX .
𝑊𝑡+1 = 𝑅 (𝑊𝑡 − 𝐶𝑡 ) ( 𝑡 ⩾ 0)
where 𝐶𝑡 is current consumption and 𝑅 is the gross interest rate. The agent seeks to
maximize Õ
E 𝛽 𝑡 𝑢 (𝐶𝑡 ) given 𝑊0 = 𝑤
𝑡 ⩾0
𝑣 ( 𝑤) = max
0
{𝑢 ( 𝑤 − 𝑤0/𝑅) + 𝛽𝑣 ( 𝑤0)} . (5.13)
0⩽ 𝑤 ⩽ 𝑤
EXERCISE 5.1.4. Frame this model as an MDP with W as the state space.
Let’s focus on the job search problem with Markov state discussed in §3.3.1 (al-
though the arguments for the general optimal stopping problem in §4.1.1.1 are very
similar). As before, W is the set of wage outcomes. Since we need the symbol 𝑃 for
other purposes, we let 𝑄 be the Markov matrix for wages, so that (𝑊𝑡 )𝑡⩾0 is 𝑄 -Markov
on W.
To express the job search problem as an MDP, let X = {0, 1} × W be a state space
whose typical element is ( 𝑒, 𝑤), with 𝑒 representing either unemployment (𝑒 = 0) or
CHAPTER 5. MARKOV DECISION PROCESSES 134
EXERCISE 5.1.5. Express the job search problem as an MDP, with state space X and
action space A as described in the previous paragraph.
5.1.3 Optimality
In this section we return to the general MDP setting of §5.1.1, define optimal policies
and state our main optimality result. As was the case for job search, actions are gov-
erned by policies, which are maps from states to actions (see, in particular, §1.3.1.3,
where policies were introduced).
𝑃𝜎 ( 𝑥, 𝑥 0) ≔ 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) .
Note that 𝑃𝜎 ∈ M ( RX ). Fixing a policy “closes the loop” in the state transition process
and defines a Markov chain for the state.
Under the policy 𝜎, rewards at state 𝑥 are 𝑟 ( 𝑥, 𝜎 ( 𝑥 )). If
𝑟𝜎 ( 𝑥 ) ≔ 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) and E 𝑥 ≔ E [ · | 𝑋0 = 𝑥 ]
then the lifetime value of following 𝜎 starting from state 𝑥 can be written as
Õ
𝑣𝜎 ( 𝑥 ) = E𝑥 𝛽 𝑡 𝑟𝜎 ( 𝑋𝑡 ) where ( 𝑋𝑡 ) is 𝑃𝜎 -Markov with 𝑋0 = 𝑥. (5.17)
𝑡 ⩾0
Analogous to the optimal stopping case, we call 𝑣𝜎 the 𝜎-value function. We also call
𝑣𝜎 ( 𝑥 ) the lifetime value of policy 𝜎 conditional on initial state 𝑥 .
(𝑇𝜎 is analogous to the policy operator defined for the optimal stopping problem in
§4.1.1.3.) In vector notation,
𝑇𝜎 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣. (5.20)
The next exercise shows how 𝑇𝜎 can be put to work.
Computationally, this means that we can pick 𝑣 ∈ RX and iterate with 𝑇𝜎 to obtain
an approximation to 𝑣𝜎 .
EXERCISE 5.1.8. Prove that, when the initial condition for iteration is 𝑣 ≡ 0 ∈ RX ,
Í −1 𝑡 𝑡
the 𝑘-th iterate 𝑇𝜎𝑘 𝑣 is equal to the truncated sum 𝑡𝑘=0 𝛽 𝑃𝜎 𝑟𝜎 .
The next exercise extends Exercise 5.1.8 and aids interpretation of policy opera-
tors. It tells us that (𝑇𝜎𝑘 𝑣)( 𝑥 ) is the payoff from following policy 𝜎 and starting in state
𝑥 when lifetime is truncated to the finite horizon 𝑘 and 𝑣 provides a terminal payoff
in each state.
CHAPTER 5. MARKOV DECISION PROCESSES 136
Given MDP M = ( Γ, 𝛽, 𝑟, 𝑃 ) with 𝜎-value functions { 𝑣𝜎 }𝜎∈Σ , the value function corre-
sponding to M is defined as 𝑣∗ ≔ ∨𝜎∈Σ 𝑣𝜎 , where, as usual, the maximum is pointwise.
More explicitly,
𝑣∗ ( 𝑥 ) = max 𝑣𝜎 ( 𝑥 ) ( 𝑥 ∈ X) . (5.21)
𝜎∈ Σ
This is consistent with our definition of the value function in the optimal stopping case
(see page 108). It is the maximal lifetime value we can extract from each state using
feasible behavior. The maximum in (5.21) exists at each 𝑥 because Σ is finite.
A policy 𝜎 ∈ Σ is called optimal for M if 𝑣𝜎 = 𝑣∗ . In other words, a policy is optimal
if its lifetime value is maximal at each state.
Example 5.1.1. Consider again Figure 2.5 on page 57, supposing that Σ = {𝜎0, 𝜎00 }.
As drawn, there is no optimal policy, since 𝑣∗ differs from both 𝑣𝜎0 and 𝑣𝜎00 . Below, in
Proposition 5.1.1, we will show that such an outcome is not possible for MDPs.
Our optimality results are easier to follow with some additional terminology. To
start, given 𝑣 ∈ RX , we define a policy 𝜎 ∈ Σ to be 𝑣-greedy if
( )
Õ
𝜎 ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) for all 𝑥 ∈ X. (5.22)
𝑎∈ Γ ( 𝑥 ) 𝑥0
In essence, a 𝑣-greedy policy treats 𝑣 as the correct value function and sets all actions
accordingly.
EXERCISE 5.1.10. Fix 𝜎 ∈ Σ and 𝑣 ∈ RX . Prove that the set {𝑇𝜎 𝑣}𝜎∈Σ has a least
and greatest element.
W Tσ000
T = σ∈Σ Tσ
Tσ00
Tσ 0
The last part of Exercise 5.1.11 tells us that 𝑇 is the pointwise maximum of {𝑇𝜎 }𝜎∈Σ ,
which can be expressed as 𝑇 = ∨𝜎 𝑇𝜎 . Figure 5.1 illustrates this relationship in one
dimension.
While Proposition 5.1.1 is a special case of later results (see §8.1.3.3), a direct
proof is not difficult and we provide one below for interested readers.
inequality.
As for the second inequality, fix 𝜎 ∈ Σ and observe that 𝑇𝜎 𝑣 ⩽ 𝑇 𝑣 for all 𝑣 ∈ RX .
Since 𝑇 is order-preserving and globally stable, Proposition 2.2.7 on page 67 implies
that 𝑣𝜎 ⩽ 𝑣¯. Taking the supremum over 𝜎 ∈ Σ yields 𝑣∗ ⩽ 𝑣¯.
Hence 𝑣∗ is a fixed point of 𝑇 in RX . Since 𝑇 is globally stable on RX , the remaining
claims in parts (i)–(ii) follow immediately.
As for part (iii), it follows from Exercise 5.1.11 and part (i) of this theorem that
𝜎 is 𝑣∗ -greedy ⇐⇒ 𝑇𝜎 𝑣∗ = 𝑇 𝑣∗ ⇐⇒ 𝑇𝜎 𝑣∗ = 𝑣∗ .
The right hand side of this expression tells us that 𝑣∗ is a fixed point of 𝑇𝜎 . But the only
fixed point of 𝑇𝜎 is 𝑣𝜎 , so the right hand side is equivalent to the statement 𝑣𝜎 = 𝑣∗ .
Hence, by this chain of logic and the definition of optimality,
45
T
Tσ00
Tσ 0
vσ 0 vσ00 = v ∗ v
(i) 𝑣∗ is the largest of these fixed points, which equals 𝑣𝜎00 , and
(ii) 𝜎00 is the optimal policy, since 𝑣𝜎00 = 𝑣∗ .
In accordance with Proposition 5.1.1, 𝑣∗ is also the fixed point of the Bellman operator.
It is important to understand the significance of (iii) in Proposition 5.1.1. Greedy
policies are relatively easy to compute, in the sense that solving (5.22) at each 𝑥 is
easier than trying to directly solve the problem of maximizing lifetime value, since Σ
is in general far larger than Γ ( 𝑥 ). Part (iii) tells us that solving the overall problem
reduces to computing a 𝑣-greedy policy with the right choice of 𝑣. For optimal stopping
problems, that choice is the value function 𝑣∗ . Intuitively, 𝑣∗ assigns a “correct” value
CHAPTER 5. MARKOV DECISION PROCESSES 140
to each state, in the sense of maximal lifetime value the controller can extract, so
using 𝑣∗ to calculate greedy policies leads to the optimal outcome.
5.1.4 Algorithms
In previous chapters we solved job search and optimal stopping problems using value
function iteration. In this section we present a generalization suitable for arbitrary
MDPs and then discuss two important alternatives.
Value function iteration (VFI) for MDPs is similar to VFI for the job search model
(see page 37): we use successive approximation on 𝑇 to compute an approximation
𝑣𝑘 to the value function 𝑣∗ and then take a 𝑣𝑘 -greedy policy. The general procedure is
given by Algorithm 5.2.
The fact that the sequence ( 𝑣𝑘 ) 𝑘⩾0 produced by VFI converges to 𝑣∗ is immediate
from Proposition 5.1.1 (as the tolerance 𝜏 is taken toward zero). It is also true that
the greedy policy produced in the last step is approximately optimal when 𝜏 is small,
and exactly optimal when 𝑘 is sufficiently large. Proofs are given in Chapter 8, where
we examine VFI in a more general setting.
VFI is robust, easy to understand and easy to implement. These properties explain
its enduring popularity. At the same time, in terms of efficiency, VFI is often dominated
by alternative algorithms, two of which are discussed below.
CHAPTER 5. MARKOV DECISION PROCESSES 141
initial 𝜎 𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎
.. ..
. .
Unlike VFI, Howard policy iteration (HPI) computes optimal policies by iterating
between computing the value of a given policy and computing the greedy policy as-
sociated with that value. The full technique is described in Algorithm 5.3.
A visualization of HPI is given in Figure 5.3, where 𝜎 is the initial choice. Next we
compute the lifetime value 𝑣𝜎 , and then the 𝑣𝜎 -greedy policy 𝜎0, and so on. The com-
putation of lifetime value is called the policy evaluation step, while the computation
of greedy policies is called policy improvement.
HPI has two very attractive features. One is that, in a finite state setting, the
algorithm always converges to an exact optimal policy in a finite number of steps,
regardless of the initial condition. We prove this fact in a more general setting in
CHAPTER 5. MARKOV DECISION PROCESSES 142
T
Tσ 0
45
0
vσ vσ 0 v ∗
Chapter 8. The second is that the rate of convergence is faster than VFI, as will be
shown in §5.1.4.3.
Figure 5.4 gives another illustration, presented in the one-dimensional setting that
we used for Figure 5.2. In this illustration, we imagine that there are many optimal
policies, and hence many functions in {𝑇𝜎 }, so that their upper envelope, which is the
Bellman operator, becomes a smoother curve. The figure shows the update from 𝑣𝜎
to the next lifetime value 𝑣𝜎0 , via the following two steps:
(i) Take 𝜎0 to be 𝑣𝜎 -greedy, which means that 𝑇𝜎0 𝑣𝜎 = 𝑇 𝑣𝜎 (see Exercise 5.1.11).
(ii) Take 𝑣𝜎0 to be the fixed point of 𝑇𝜎0 .
The next step, from 𝑣𝜎0 to 𝑣𝜎00 is analogous.
Comparison of this figure with Figure 2.1 on page 48 suggests that HPI is an im-
plementation of Newton’s method, applied to the Bellman operator. We confirm this
in §5.1.4.3.
In discussing the connection between HPI and Newton iteration, one issue is that
𝑇 is not always differentiable, as seen in Figure 5.2. But 𝑇 is convex, and this lets
CHAPTER 5. MARKOV DECISION PROCESSES 143
𝑇1 𝑇2
𝑣 𝑣
us substitute subgradients for derivatives. Once we make this modification, HPI and
Newton iteration are identical, as we now show.
First, recall that, given a self-map 𝑇 from 𝑆 ⊂ R𝑛 to itself, an 𝑛 × 𝑛 matrix 𝐷 is
called a subgradient of 𝑇 at 𝑣 ∈ 𝑆 if
Figure 5.5 illustrates the definition in one dimension, where 𝐷 is just a scalar de-
termining the slope of a tangent line at 𝑣. In the left subfigure, 𝑇1 is convex and
differentiable at 𝑣, which means that only one subgradient exists (since any other
choice of slope implies that the inequality in (5.26) will fail for some 𝑢). In the right
subfigure, 𝑇2 is convex but nondifferentiable at 𝑣, so multiple subgradients exist.
In the next result, we take ( Γ, 𝛽, 𝑟, 𝑃 ) to be a given MDP and let 𝑇 be the associated
Bellman operator.
𝑇𝑢 = 𝑇 𝑣 + 𝑇𝑢 − 𝑇 𝑣 ⩾ 𝑇 𝑣 + 𝑇𝜎 𝑢 − 𝑇𝜎 𝑣.
Now let’s consider Newton’s method applied to the problem of finding the fixed
point of 𝑇 . Since 𝑇 is nondifferentiable and convex, we replace the Jacobian in New-
ton’s method (see (2.2) on page 48) with the subdifferential. This leads us to iterate
CHAPTER 5. MARKOV DECISION PROCESSES 144
on
𝑣𝑘+1 = 𝑄𝑣𝑘 where 𝑄𝑣 ≔ ( 𝐼 − 𝛽𝑃𝜎 ) −1 (𝑇 𝑣 − 𝛽𝑃𝜎 𝑣) .
In the definition of 𝑄 , the policy 𝜎 is 𝑣-greedy. Using 𝑇 𝑣 = 𝑇𝜎 𝑣, the map 𝑄 reduces
to 𝑄𝑣 ≔ ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 , which is exactly the update step to produce the next 𝜎-value
function in HPI (i.e., the lifetime value of a 𝑣-greedy policy).
The fact that HPI is a version of Newton’s method suggests that its iterates ( 𝑣𝑘 ) 𝑘⩾0
enjoy quadratic convergence. This is indeed the case: under mild conditions one can
show there exists a constant 𝑁 such that, for all 𝑘 ⩾ 0,
(see, e.g., Puterman (2005), Theorem 6.4.8). Hence HPI enjoys both a fast conver-
gence rate and the robustness of global convergence.
However, HPI is not always optimal in terms of efficiency, since the size of the
constant term in (5.27) also matters. This term can be large because, at each step, the
update from 𝑣𝜎 to 𝑣𝜎0 requires computing the exact lifetime value 𝑣𝜎0 of the 𝑣𝜎 -greedy
policy 𝜎0. Computing this fixed point exactly can be computationally expensive in
high dimensions.
One way around this issue is to forgo computing the fixed point 𝑣𝜎0 exactly, re-
placing it with an approximation. The next section takes up this idea.
Optimistic policy iteration (OPI) is an algorithm that borrows from both VFI and HPI.
In essence, the algorithm is the same as HPI except that, instead of computing the full
value 𝑣𝜎 of a given policy, the approximation 𝑇𝜎𝑚 𝑣 from Exercise 5.1.7 is used instead.
Algorithm 5.4 clarifies.
In the algorithm, the policy operator 𝑇𝜎𝑘 is applied 𝑚 times to generate an ap-
proximation of 𝑣𝜎𝑘 . The constant step size 𝑚 can also be replaced with a sequence
( 𝑚𝑘 ) ⊂ N. In either case, for MDPs, convergence to an optimal policy is guaranteed.
We prove this in a more general setting in Chapter 8.
Notice that, as 𝑚 → ∞, the algorithm increasingly approximates Howard policy
iteration, since 𝑇𝜎𝑚𝑘 𝑣𝑘 converges to 𝑣𝜎𝑘 . At the same time, if 𝑚 = 1, the reduces to VFI.
This follows from Exercise 5.1.11, which tells us that, when 𝜎𝑘 is 𝑣𝑘 -greedy, 𝑇𝜎𝑘 𝑣𝑘 =
𝑇 𝑣𝑘 . Hence, with intermediate 𝑚, OPI can be seen as a “convex combination” of HPI
and VFI.
CHAPTER 5. MARKOV DECISION PROCESSES 145
5.2 Applications
This section gives several applications of the MDP model to economic problems. The
applications illustrate the ease with which MDPs can be implemented on a computer
(provided that the state and action spaces are not too large).
In §3.1.1.2 we studied a firm whose inventory behavior was specified to follow S-s
dynamics. In §5.1.2.2 we introduced a model where investment behavior is endoge-
nous, determined by the desire to maximize firm value. In this section, we show that
this endogenous inventory behavior can replicate the S-s dynamics from §3.1.1.2.
We saw in §5.1.2.2 that the optimal inventory model is an MDP, so the Proposi-
tion 5.1.1 optimality and convergence results apply. In particular, the unique fixed
point of the Bellman operator is the value function 𝑣∗ , and a policy 𝜎∗ is optimal if
and only if 𝜎∗ is 𝑣∗ -greedy.
We solve the model numerically using VFI. As in Exercise 5.1.2, we take 𝜑 to be
the geometric distribution on Z+ with parameter 𝑝. We use the default parameter
CHAPTER 5. MARKOV DECISION PROCESSES 146
28 v∗
26
value 24
22
20
0 5 10 15 20 25 30 35 40
25
σ∗
20
optimal choice
15
10
0
0 5 10 15 20 25 30 35 40
inventory
Figure 5.6: The value function and optimal policy for the inventory problem
values shown in Listing 14. The code listing also presents an implementation of the
Bellman operator.
Figure 5.6 exhibits an approximation of the value function 𝑣∗ , computed by iter-
ating with 𝑇 starting at 𝑣 ≡ 1. Figure 5.6 also shows the approximate optimal policy,
obtained as a 𝑣∗ -greedy policy:
( )
Õ
𝜎∗ ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣∗ ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) 𝜑 ( 𝑑 )
𝑎∈ Γ ( 𝑥 ) 𝑑 ⩾0
The plot of the optimal policy shows that there is a threshold region below which the
firm orders large batches and above which the firm orders nothing. This makes sense,
since the firm wishes to economize on the fixed cost of ordering. Figure 5.7 shows a
simulation of inventory dynamics under the optimal policy, starting from 𝑋0 = 0. The
time path closely approximates the S-s dynamics discussed in §3.1.1.2.
EXERCISE 5.2.1. Compute the optimal policy by extending the code given in List-
ing 14. Replicate Figure 5.7, modulo randomness, by sampling from a geometric
distribution and implementing the dynamics in (5.6). At each 𝑋𝑡 , the action 𝐴𝑡 should
be chosen according to the optimal policy 𝜎∗ ( 𝑋𝑡 ).
CHAPTER 5. MARKOV DECISION PROCESSES 147
using Distributions
30
Xt
25
20
inventory
15
10
0
0 50 100 150 200 250 300 350 400
t
As our next example of an MDP, we modify the cake eating problem in §5.1.2.3 to
add labor income. Wealth evolves according to
where (𝑊𝑡 ) takes values in finite set W ⊂ R+ and labor income (𝑌𝑡 ) is a Markov chain
on finite set Y ⊂ R+ with transition matrix 𝑄 .1 𝑅 is a gross rate of interest, so that
investing 𝑑 dollars today returns 𝑅𝑑 next period. Other parts of the problem are un-
changed. The Bellman operator can be written as
( )
𝑤0 Õ
(𝑇 𝑣)( 𝑤, 𝑦 ) = 0max 𝑢 𝑤+ 𝑦− +𝛽 𝑣 ( 𝑤0, 𝑦 0) 𝑄 ( 𝑦, 𝑦 0) . (5.29)
𝑤 ∈ Γ ( 𝑤,𝑦 ) 𝑅 𝑦0
𝑠, which takes values in W and equals 𝑤0. The feasible correspondence is the set of
feasible savings values
Γ ( 𝑤, 𝑦 ) = { 𝑠 ∈ W : 𝑠 ⩽ 𝑅 ( 𝑤 + 𝑦 )} .
5.2.2.2 Implementation
To implement the algorithms discussed in §5.1.4, we use the Bellman operator (5.29),
and the corresponding definition of a 𝑣-greedy policy, which is
( )
𝑤0 Õ
𝜎 ( 𝑤, 𝑦 ) ∈ argmax 𝑢 𝑤 + 𝑦 − +𝛽 𝑣 ( 𝑤0, 𝑦 0) 𝑄 ( 𝑦, 𝑦 0)
𝑤0 ∈ Γ ( 𝑤,𝑦 ) 𝑅 𝑦0
Code for implementing the model and these two operators is given in Listing 15.
Income is constructed as a discretized AR(1) process using the method from §3.1.3.
Exponentiation is applied to the grid so that income takes positive values.
The function get_value in Listing 16 uses the expression 𝑣𝜎 = ( 𝐼 − 𝛽 𝑃𝜎 ) −1 𝑟𝜎 from
(5.18) to obtain the value of a given policy 𝜎. The matrix 𝑃𝜎 and vector 𝑟𝜎 take the
form
5.2.2.3 Timing
Since all results for MDPs apply, we know that the value function 𝑣∗ is the unique fixed
point of the Bellman operator in RX , and that value function iteration, Howard policy
CHAPTER 5. MARKOV DECISION PROCESSES 150
include("finite_opt_saving_0.jl")
include("s_approx.jl")
include("finite_opt_saving_1.jl")
4
value function iteration
time
0
0 100 200 300 400 500 600
m
iteration and optimistic policy iteration all converge. Listing 17 implements these
three algorithms. Since the state and action space are finite, Howard policy iteration
is guaranteed to return an exact optimal policy.
Figure 5.8 shows the number of seconds taken to solve the finite optimal savings
model under the default parameters when executed on a laptop machine with 20 CPUs
running at around 4GHz. The horizontal axis corresponds to the step parameter 𝑚 in
OPI (Algorithm 5.4). The two other algorithms do not depend on 𝑚 and hence their
timings are constant. The figure shows that HPI is an order of magnitude faster than
VFI and that optimistic policy iteration is even faster for moderate values of 𝑚.
One reason VFI is slow is that the discount factor is close to one. This matters be-
cause the convergence rate for VFI is linear with error size decreasing geometrically
in 𝛽 . In contrast, HPI, being an instance of Newton iteration, converges quadrati-
cally (see §2.1.4.2). As a result, HPI tends to dominate VFI when the discount factor
approaches unity.
Run-times are also dependent on implementation, and relative speed varies signif-
icantly with coding style, software and hardware platforms. In our implementation,
the main deficiency is that parallelization is under-utilized. Better exploitation of
parallelization tends to favor HPI, as discussed in §2.1.4.4.
CHAPTER 5. MARKOV DECISION PROCESSES 154
20.0
wt
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
5.2.2.4 Outputs
Figure 5.9 shows a typical time series for the wealth of a single household under the
optimal policy. The series is created by computing an optimal policy 𝜎∗ , generating
(𝑌𝑡 )𝑡𝑚=0−1 as a 𝑄 -Markov chain on Y and then computing (𝑊𝑡 )𝑡𝑚=0 via 𝑊𝑡+1 = 𝜎∗ (𝑊𝑡 , 𝑌𝑡 ) for
𝑡 running from 0 to 𝑚 − 1. Initial wealth 𝑊0 is set to 1.0 and 𝑚 = 2000.
Figure 5.10 shows the result of computing and histogramming a longer time series,
with 𝑚 set to 1,000,000. This histogram approximates the stationary distribution of
wealth for a large population, each updating via 𝜎∗ and each with independently
generated labor income series (𝑌𝑡 )𝑡𝑚=0−1 . (This is due to ergodicity of the wealth-income
process. For a discussion of the connection between stationary distributions and time
series under ergodicity see, for example, Sargent and Stachurski (2023b).)
The shape of the wealth distribution in Figure 5.10 is unrealistic. In almost all
countries, the wealth distribution has a very long right tail. The Gini coefficient of the
distribution in Figure 5.10 is 0.54, which is too low. For example, World Bank data
for 2019 produces a wealth Gini for the US equal to 0.852. For Germany and Japan
the figures are 0.816 and 0.627 respectively.
In §5.3.3 we discuss a variation on the optimal savings model that can produce a
more realistic wealth distribution.
CHAPTER 5. MARKOV DECISION PROCESSES 155
0.3
0.2
0.1
0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
wealth
As our next application, we consider a monopolist facing adjustment costs and stochas-
tically evolving demand. The monopolist balances setting enough capacity to meet
demand against costs of adjusting capacity.
We assume that the monopolist produces a single product and faces an inverse demand
function of the form
𝑃𝑡 = 𝑎0 − 𝑎1𝑌𝑡 + 𝑍𝑡 ,
where 𝑎0 , 𝑎1 are positive parameters, 𝑌𝑡 is output, 𝑃𝑡 is price and the demand shock 𝑍𝑡
follows
IID
𝑍𝑡+1 = 𝜌𝑍𝑡 + 𝜎𝜂𝑡+1 , {𝜂𝑡 } ∼ 𝑁 (0, 1) .
Current profits are
𝜋𝑡 ≔ 𝑃𝑡 𝑌𝑡 − 𝑐𝑌𝑡 − 𝛾 (𝑌𝑡+1 − 𝑌𝑡 ) 2 .
Here 𝛾 (𝑌𝑡+1 − 𝑌𝑡 ) 2 represents costs associated with adjusting production scale, param-
eterized by 𝛾 , and 𝑐 is unit cost of current production. Costs are convex, so rapid
changes to capacity are expensive.
The monopolist chooses (𝑌𝑡 ) to maximize the expected discounted value of its profit
CHAPTER 5. MARKOV DECISION PROCESSES 156
• If 𝛾 is close to zero, then the optimal output path 𝑌𝑡 will track the time path of
𝑌¯𝑡 relatively closely, while
𝑟 (( 𝑦, 𝑧 ) , 𝑞) = ( 𝑎0 − 𝑎1 𝑦 + 𝑧 − 𝑐) 𝑦 − 𝛾 ( 𝑞 − 𝑦 ) 2 .
CHAPTER 5. MARKOV DECISION PROCESSES 157
𝑃 (( 𝑦, 𝑧 ) , 𝑞, ( 𝑦 0, 𝑧0)) = 1{ 𝑦 0 = 𝑞} 𝑄 ( 𝑧, 𝑧0) .
The term 1{ 𝑦 0 = 𝑞} states that next period output 𝑦 0 is equal to our current choice 𝑞
for next period output. With these definitions, the problem defines an MDP and all of
the optimality theory for MDPs applies.
5.2.3.3 Implementation
By combining iteration with the policy operator and computation of greedy policies,
we can implement optimistic policy iteration, compute the optimal policy 𝜎∗ , and
study output choices generated by this policy. We are particularly interested in how
output responds over time to randomly generated demand shocks.
Figure 5.11 shows the result of a simulation designed to shed light on how output
responds to demand. After choosing initial values (𝑌1 , 𝑍1 ) and generating a 𝑄 -Markov
chain ( 𝑍𝑡 )𝑡𝑇=1 , we simulated optimal output via 𝑌𝑡+1 = 𝜎∗ (𝑌𝑡 , 𝑍𝑡 ). The default parameters
are shown in Listing 18. In the figure, the adjustment cost parameter 𝛾 is varied as
shown in the title. In addition to the optimal output path, the path of (𝑌¯𝑡 ) as defined
in (5.32) is also presented.
The figure shows how increasing 𝛾 promotes smoothing, as predicted in our dis-
cussion above. For small 𝛾 , adjustment costs have only minor effects on choices, so
output closely follows (𝑌¯𝑡 ), the optimal path when output responds immediately to
demand shocks. Conversely, larger values of 𝛾 make adjustment expensive, so the
operator responds relatively slowly to changes in demand.
CHAPTER 5. MARKOV DECISION PROCESSES 158
γ =1
8 Yt
Ȳt
6
output
γ =10
8 Yt
Ȳt
6
output
γ =20
8 Yt
Ȳt
6
output
γ =30
8 Yt
Ȳt
6
output
function create_investment_model(;
r=0.04, # Interest rate
a_0=10.0, a_1=1.0, # Demand parameters
γ=25.0, c=1.0, # Adjustment and unit cost
y_min=0.0, y_max=20.0, y_size=100, # Grid for output
ρ=0.9, ν=1.0, # AR(1) parameters
z_size=25) # Grid size for shock
β = 1/(1+r)
y_grid = LinRange(y_min, y_max, y_size)
mc = tauchen(z_size, ρ, ν)
z_grid, Q = mc.state_values, mc.p
return (; β, a_0, a_1, γ, c, y_grid, z_grid, Q)
end
Figure 5.12 compares timings for VFI, HPI and OPI. Parameters are as in Listing 18.
As in Figure 5.8, which gave timings for the optimal savings model, the horizontal axis
shows 𝑚, which is the step parameter in OPI (see Algorithm 5.4). VFI and HPI do not
depend on 𝑚 and hence their timings are constant. The vertical axis is time in seconds.
HPI is faster than VFI, although the difference is not as dramatic as was the case
for optimal savings. One reason is that the discount factor is relatively small for the
optimal investment model (𝑟 = 0.04 and 𝛽 = 1/(1 + 𝑟 ), so 𝛽 ≈ 0.96). Since 𝛽 is
the modulus of contraction for the Bellman operator, this means that VFI converges
relatively quickly. Another observation is that, for many values of 𝑚, OPI dominates
both VFI and HPI in terms of speed, which is consistent with our findings for the
optimal savings model. At 𝑚 = 70, OPI is around 20 times faster than VFI.
3.5
3.0
2.5
1.0
0.5
0.0
0 100 200 300 400 500 600
m
fixed cost induces lumpy adjustment, as shown in Figure 5.13. Show that this model is
an MDP. Write the Bellman equation and the procedure for optimistic policy iteration
in the context of this model. Replicate Figure 5.13, modulo randomness, using the
parameters shown in Listing 19.
function create_hiring_model(;
r=0.04, # Interest rate
κ=1.0, # Adjustment cost
α=0.4, # Production parameter
p=1.0, w=1.0, # Price and wage
l_min=0.0, l_max=30.0, l_size=100, # Grid for labor
ρ=0.9, ν=0.4, b=1.0, # AR(1) parameters
z_size=100) # Grid size for shock
β = 1/(1+r)
l_grid = LinRange(l_min, l_max, l_size)
mc = tauchen(z_size, ρ, ν, b, 6)
z_grid, Q = mc.state_values, mc.p
return (; β, κ, α, p, w, l_grid, z_grid, Q)
end
16 `t
Zt
15
14
13
employment
12
11
10
Efficient solution methods are essential in structural estimation because the un-
derlying dynamic program must be solved repeatedly in order to search the param-
eter space for a good fit to data. Moreover, these dynamic programs are often high-
dimensional, due to shocks to preferences and other random variables that the agents
2 Rational expectations econometrics was a response to that Critique. While early work on ratio-
nal expectations originated from the macroeconomics community (e.g. Hansen and Sargent (1980),
Hansen and Sargent (1990)), many of their examples were actually about industrial organization and
other microeconomic models. This work was part of a broad process that erased many boundaries
between micro and macro theory.
CHAPTER 5. MARKOV DECISION PROCESSES 163
inside the model are assumed to see but that the econometrician does not. When
these shocks are persistent, the dimension of the state grows.3
In order to maintain focus on dynamic programming, we will not describe the de-
tails of the estimation step required for structural estimation (although §5.4 contains
references for those who wish to learn about that). Instead, we focus on the kinds of
dynamic programs treated in structural estimation and techniques for solving them
efficiently.
5.3.1.2 An Illustration
Let us look at an example of a dynamic program with preference shocks used in struc-
tural estimation, which is taken from a study of labor supply by married women
(Keane et al., 2011). The husband of the decision maker, a married woman, is al-
ready working. The couple has young children and the mother is deciding whether
to work. Her utility function is
𝑢 ( 𝑐, 𝑑, 𝜉) = 𝑐 + ( 𝛼𝑛 + 𝜉)(1 − 𝑑 ) ,
𝑐𝑡 = 𝑓𝑡 + 𝑤𝑡 𝑑𝑡 − 𝜋𝑛𝑑𝑡 ,
where 𝑓𝑡 is the father’s income, 𝑤𝑡 is the mother’s wage and 𝜋 is the cost of child care.
Wages depend on human capital ℎ𝑡 , which increases with experience. In particular,
in which she works. See Keane et al. (2011) for further discussion.
CHAPTER 5. MARKOV DECISION PROCESSES 164
𝑟 ( 𝑓 , ℎ, 𝜉, 𝜂, 𝑑 ) ≔ 𝑓 + ( 𝛾ℎ + 𝜂) 𝑑 − 𝜋𝑛𝑑 + ( 𝛼𝑛 + 𝜉)(1 − 𝑑 ) ,
the problem of maximizing expected discounted utility is an MDP with the Bellman
equation
Õ
𝑣 ( 𝑓 , ℎ, 𝜉, 𝜂) = max 𝑟 ( 𝑓 , ℎ, 𝜉, 𝜂, 𝑑 ) + 𝛽 𝑣 ( 𝑓 0, ℎ + 𝑑, 𝜉0, 𝜂0) 𝐹 ( 𝑓 , 𝑓 0) 𝜑 ( 𝜉0, 𝜂0) .
𝑑
𝑓 0 ,𝜉0 ,𝜂0
While we can proceed directly with a technique such as VFI to obtain optimal
choices, we can simplify.
One way is by reducing the number of states. A hint comes from looking at the
expected value function
Õ
𝑔 ( 𝑓 , ℎ, 𝑑 ) ≔ 𝑣 ( 𝑓 0, ℎ + 𝑑, 𝜉0, 𝜂0) 𝐹 ( 𝑓 , 𝑓 0) 𝜑 ( 𝜉0, 𝜂0)
𝑓 0 ,𝜉0 ,𝜂0
This function depends only on three arguments and, moreover, the choice variable 𝑑
is binary. Hence we can break 𝑔 down into two functions 𝑔 ( 𝑓 , ℎ, 0) and 𝑔 ( 𝑓 , ℎ, 1), each
of which depends only on the pair ( 𝑓 , ℎ). These functions are substantially simpler
than 𝑣 when the domain of ( 𝜉, 𝜂) is large. Hence, it is natural to consider whether we
can solve our problem using 𝑔 rather than 𝑣.
Rather than address this question within the context of the preceding model, let’s shift
to a generic version of the dynamic program used in structural estimation and how it
can be solved using expected value methods. Our generic version takes the form
( )
Õ∫
𝑣 ( 𝑦, 𝜀) = max 𝑟 ( 𝑦, 𝜀, 𝑎) + 𝛽 𝑣 ( 𝑦 0, 𝜀0) 𝑃 ( 𝑦, 𝑎, 𝑦 0) 𝜑 ( 𝜀0) d𝜀0 (5.33)
𝑎∈ Γ ( 𝑦 )
𝑦0
There are several potential advantages associated with working with 𝑔 rather than 𝑣.
One is that the set of actions A can be much smaller than the set of states that would be
created by discretization of the preference shock space E (especially if 𝜀𝑡 takes values
in a high-dimensional space). Another is that the integral provides smoothing, so that
𝑔 is typically a smooth function. This can accelerate structural estimation procedures.
To exploit the relative simplicity of the expected value function, we rewrite the Bell-
man equation (5.33) as
𝑣 ( 𝑦, 𝜀) = max {𝑟 ( 𝑦, 𝜀, 𝑎) + 𝛽𝑔 ( 𝑦, 𝑎)} .
𝑎∈ Γ ( 𝑦 )
To solve this functional equation we introduce the expected value Bellman op-
erator 𝑅 defined at 𝑔 ∈ RG by
Õ∫
( 𝑅𝑔 )( 𝑦, 𝑎) = max0 {𝑟 ( 𝑦 0, 𝜀0, 𝑎0) + 𝛽𝑔 ( 𝑦 0, 𝑎0)} 𝜑 ( 𝜀0) d𝜀0 𝑃 ( 𝑦, 𝑎, 𝑦 0) .
0
(5.35)
𝑎 ∈Γ ( 𝑦 )
𝑦0
We postpone proving Proposition 5.3.1 until §5.3.5, where we prove a more gen-
eral result.
Example 5.3.2. In the labor supply problem in §5.3.1.2, the expected value Bellman
operator becomes
Õ
( 𝑅𝑔 )( 𝑓 , ℎ, 𝑑 ) = max
0
{𝑟 ( 𝑓 0, ℎ + 𝑑, 𝜉0, 𝜂0, 𝑑 0) 𝛽𝑔 ( 𝑓 0, ℎ + 𝑑, 𝑑 0)} 𝐹 ( 𝑓 , 𝑓 0) 𝜑 ( 𝜉0, 𝜂0) .
𝑑
𝑓 0 ,𝜉0 ,𝜂0
𝜎∗ ( 𝑓 , ℎ, 𝜉, 𝜂) ∈ argmax {𝑟 ( 𝑓 , ℎ, 𝜉, 𝜂, 𝑑 ) + 𝛽𝑔 ∗ ( 𝑓 , ℎ, 𝑑 )} .
𝑑
The Gumbel distribution has the following useful stability property, a proof of
which can be found in Huijben et al. (2022).
IID
Lemma 5.3.2. If 𝑍1 , . . . , 𝑍 𝑘 ∼ 𝐺 (0) and 𝑐1 , . . . , 𝑐𝑘 are real numbers, then
( " 𝑘 #)
Õ
max ( 𝑍 𝑖 + 𝑐𝑖 ) ∼ 𝐺 −𝛾 + ln exp( 𝑐𝑖 ) .
1⩽ 𝑖⩽ 𝑘
𝑖=1
CHAPTER 5. MARKOV DECISION PROCESSES 167
To exploit Lemma 5.3.2, we continue the discussion in the previous section but
assume now that A = { 𝑎1 , . . . , 𝑎𝑘 }, that Γ ( 𝑦 0) = A for all 𝑦 0 (so that actions are un-
restricted), that 𝜀0 in (5.35) is additive in rewards and indexed by actions, so that
𝑟 ( 𝑦 0, 𝜀0, 𝑎0) = 𝑟 ( 𝑦 0, 𝑎0) + 𝜀0 ( 𝑎0) for all feasible ( 𝑦 0, 𝑎0), and that, conditional on 𝑦 0, the
vector ( 𝜀 ( 𝑎1 ) , . . . , 𝜀 ( 𝑎𝑘 )) consists of 𝑘 independent 𝐺 (0) shocks. Thus, each feasible
choice returns a rewards perturbed by an independent Gumbel shock.
From these assumptions and Lemma 5.3.2, the term inside the integral in (5.35)
satisfies
This operator is convenient because the absence of a max operator permits fast eval-
uation. Notice also that 𝑅 is smooth in 𝑔, which suggests that we can use gradient
information to compute its fixed points.
Notice how the Gumbel max trick that exploits Lemma 5.3.2 depends crucially
on the expected value formulation of the Bellman equation, rather than the standard
formulation (5.33). This is because the expected value formulation puts the max
inside the expectation operator, unlike the standard formulation, where the max is on
the outside.
Variations of the Gumbel max trick have many uses in structural econometrics
(see §5.4).
CHAPTER 5. MARKOV DECISION PROCESSES 168
We modify the §5.2.2 optimal savings problem by replacing a constant gross rate of
interest 𝑅 by an IID sequence ( 𝜂𝑡 )𝑡⩾0 with common distribution 𝜑 on finite set E. So the
consumer faces a fluctuating rate of returns on financial wealth. In each period 𝑡 , the
consumer knows 𝜂𝑡 , the gross rate of interest between 𝑡 and 𝑡 + 1, before deciding how
much to consume and how much to save. Other aspects of the problem are unchanged.
We have two motivations. One is computational, namely, to illustrate how framing
a decision in terms of expected values can reduce dimensionality, analogous to the
results in §5.3.1.4. The other is to generate a more realistic wealth distribution than
that generated by the §5.2.2.4 optimal savings model.
With stochastic returns on wealth, the Bellman equation becomes
( )
𝑤0 Õ
𝑣 ( 𝑤, 𝑦, 𝜂) = max 𝑢 𝑤+ 𝑦− +𝛽 𝑣 ( 𝑤0, 𝑦 0, 𝜂0) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) .
0
𝑤 ⩽ 𝜂 ( 𝑤+ 𝑦 ) 𝜂 𝑦 0 ,𝜂0
Both 𝑤 and 𝑤0 are constrained to a finite set W ⊂ R+ . The expected value function
can be expressed as
Õ
𝑔 ( 𝑦, 𝑤0) ≔ 𝑣 ( 𝑤0, 𝑦 0, 𝜂0) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) . (5.37)
𝑦 0 , 𝜂0
In the remainder of this section, we will say that a savings policy 𝜎 is 𝑔-greedy if
𝑤0 0
𝜎 ( 𝑦, 𝑤, 𝜂) ∈ argmax 𝑢 𝑤 + 𝑦 − + 𝛽𝑔 ( 𝑦, 𝑤 ) .
𝑤0 ⩽ 𝜂 ( 𝑤+ 𝑦 ) 𝜂
Since it is an MDP, we can see immediately that if we replace 𝑣 in (5.37) with the
value function 𝑣∗ , then a 𝑔-greedy policy will be an optimal one.
Using manipulations analogous to those we used in §5.3.1.4, we can rewrite the
Bellman equation in terms of expected value functions via
Õ
0 𝑤00
𝑔 ( 𝑦, 𝑤 ) = max 𝑢 𝑤 + 𝑦 − 0 + 𝛽𝑔 ( 𝑦 , 𝑤 ) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) .
0 0 0 00
00
𝑦 0 , 𝜂0
0 0 0
𝑤 ⩽ 𝜂 (𝑤 + 𝑦 ) 𝜂
From here we could proceed by introducing an expected value Bellman operator anal-
ogous to 𝜂 in (5.35), proving it to be a contraction map and then showing that greedy
policies taken with respect to the fixed point are optimal. All of this can be accom-
plished without too much difficulty – we prove more general results in §5.3.5.
CHAPTER 5. MARKOV DECISION PROCESSES 169
However, we also know that optimistic policy iteration (OPI) is, in general, more
efficient than value function iteration. This motivates us to introduce the modified
𝜎-value operator
Õ 𝜎 ( 𝑤0, 𝑦 0, 𝜂0)
0 0 0
( 𝑅𝜎 𝑔)( 𝑦, 𝑤 ) = 𝑢 𝑤 +𝑦 − 0
+ 𝛽𝑔 ( 𝑦 , 𝜎 ( 𝑤 , 𝑦 , 𝜂 )) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) .
0 0 0 0
𝑦 0 , 𝜂0
𝜂
This is a modification of the regular 𝜎-value operator 𝑇𝜎 that makes it act on expected
value functions.
A suitably modified OPI routine that is adapted from the regular OPI algorithm in
§5.1.4.4 can be found in Algorithm 5.5 on page 177. The routine is convergent. We
discuss this in greater detail in §5.3.5.
Figure 5.14 shows a histogram of a long wealth time series that parallels Fig-
ure 5.10 on page 155. The only significant difference is the switch to stochastic returns
(as described above). Parameters are as in Listing 20. Now the wealth distribution has
a more realistic long right tail (a few observations are in the far right tail, although
they are difficult to see). The Gini coefficient is 0.72, which is closer to typical country
values recorded in World Bank data (but still lower than the US). In essence, this oc-
curs because return shocks have multiplicative rather than additive effects on wealth,
so a sequence of high draws compounds to make wealth grow fast.
EXERCISE 5.3.3. Consider a version of the optimal savings problem from §5.2.2
CHAPTER 5. MARKOV DECISION PROCESSES 170
1.2
1.0
0.8
Gini = 0.72
0.6
0.4
0.2
0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
wealth
where labor income has both persistent and transient components. In particular, as-
sume that 𝑌𝑡 = 𝑍𝑡 + 𝜀𝑡 for all 𝑡 , where ( 𝜀𝑡 )𝑡⩾0 is IID with common distribution 𝜑 on E,
while ( 𝑍𝑡 )𝑡⩾0 is 𝑄 -Markov on Z. Such a specification of labor income can capture how
households should react differently to transient and “permanent” shocks (see §5.4
for more discussion). Following the pattern developed for the savings model with
stochastic returns on wealth, write down both the Bellman equation and the Bellman
equation in terms of expected value functions.
5.3.4 Q-Factors
𝑄 -factors assign values to state-action pairs. They set the stage for 𝑄 -learning, an ap-
plication of reinforcement learning, a recursive algorithm for estimating parameters.
𝑄 -learning uses stochastic approximation techniques to learn 𝑄 -factors. Under special
conditions 𝑄 -learning eventually learns optimal 𝑄 -factors for a finite MDP.
𝑄 -learning is connected to the topic of this chapter because it relies on a Bellman
operator for the 𝑄 -factor. We discuss that Bellman operator, but we don’t discuss
𝑄 -learning here.
To begin, we fix an MDP ( Γ, 𝛽, 𝑟, 𝑃 ) with state space X and action space A. For each
CHAPTER 5. MARKOV DECISION PROCESSES 171
We can convert the Bellman equation into an equation in 𝑄 -factors by observing that,
given such a 𝑞, the Bellman equation can be written as 𝑣 ( 𝑥 ) = max 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎). Taking
the mean and discounting on both sides of this equation gives
Õ Õ
𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) = 𝛽 max0 𝑞 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
0 𝑎 ∈Γ ( 𝑥 )
𝑥0 𝑥0
Enthusiastic readers might like to try to prove Proposition 5.3.4 directly. We defer
the proof until §5.3.5, where a more general result is obtained.
Our study of structural estimation in §5.3.1, optimal savings in §5.3.3 and 𝑄 -factors
in §5.3.4 all involved manipulations of the Bellman and policy operators that pre-
sented alternative perspectives on the respective optimization problems. Rather than
offering additional applications that apply such ideas, we now develop a general theo-
retical framework from which to understand manipulations of the Bellman and policy
CHAPTER 5. MARKOV DECISION PROCESSES 172
operators for general MDPs. The framework clarifies when and how these techniques
can be applied.
Fix an MDP ( Γ, 𝛽, 𝑟, 𝑃 ) with state space X and action space A. As usual, Σ is the set of
feasible policies, G is the set of feasible state, action pairs, 𝑇 is the Bellman operator
and 𝑣∗ denotes the value function. Our first step is to decompose 𝑇 . To do this we
introduce three auxiliary operators:
Í
• 𝐸 : RX → RG defined by ( 𝐸𝑣)( 𝑥, 𝑎) = 𝑥 0 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0),
• 𝐷 : RG → RG defined by ( 𝐷𝑔 )( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽𝑔 ( 𝑥, 𝑎) and
• 𝑀 : RG → RX defined by ( 𝑀𝑞) ( 𝑥 ) = max 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎).
Evidently the action of the Bellman operator 𝑇 on a given 𝑣 ∈ RX is the composition
of these three steps:
(i) take conditional expectations given ( 𝑥, 𝑎) ∈ G (applying 𝐸),
(ii) discount and adding current rewards (applying 𝐷), and
(iii) maximize with respect to current action (applying 𝑀 ).
As a result, we can write 𝑇 = 𝑀 𝐷𝐸 ≔ 𝑀 ◦ 𝐷 ◦ 𝐸 (apply 𝐸 first, 𝐷 second, and 𝑀 third).
This decomposition is visualized in Figure 5.15. The action of 𝑇 is a round trip from
the top node, which is the set of value functions.
If we stare at Figure 5.15, we can imagine two other round trips. One is a round
trip from the set of expected value functions, obtained by the sequence 𝐸𝑀 𝐷. The
other is a round trip from the set of 𝑄 -factors, obtained by the sequence 𝐷𝐸𝑀 . Let’s
name these additional round trips 𝑅 and 𝑆 respectively, so that, collecting all three,
Both 𝑅 and 𝑆 act on functions in RG . The next exercise provides an explicit represen-
tation of these operators.
RX (value functions)
𝑀 𝐸
Let’s connect our “refactored” Bellman operators 𝑅 and 𝑆 to our preceding exam-
ples. Inspection of (5.39) confirms that 𝑆 is exactly the 𝑄 -factor Bellman operator.
In addition, 𝑅 is a general version of the expected value Bellman operator defined in
(5.35).
EXERCISE 5.3.6. Show that, for all 𝑘 ∈ N, the following relationships hold
While the equalities in Exercise 5.3.6 can be proved by induction via the logic
revealed by (5.40), the intuition is straightforward from Figure 5.15. For example,
the relationship 𝑅 𝑘 = 𝐸𝑇 𝑘−1 𝑀 𝐷 states that round-tripping 𝑘 times from the space of
expected values (EV function space) is the same as shifting to value function space by
applying 𝑀 𝐷, round-tripping 𝑘 − 1 times using 𝑇 , and then shifting one more step to
EV function space via 𝐸.
Although the relationships in Exercise 5.3.6 are easy to prove, they are already
useful. For example, suppose that in a computational setting 𝑅 is easier to iterate
than 𝑇 . Then to iterate with 𝑇 𝑘 times, we can instead use 𝑇 𝑘 = 𝑀 𝐷𝑅 𝑘−1 𝐸: We apply 𝐸
once, 𝑅 𝑘 − 1 times, and 𝑀 and 𝐷 once each. If 𝑘 is large, this might be more efficient.
CHAPTER 5. MARKOV DECISION PROCESSES 174
In the next exercise and the next section, we let k · k ≔ k · k ∞ , the supremum norm
on either RX or RG .
Lemma 5.3.5. The operators 𝑅, 𝑆 and 𝑇 are all contraction maps of modulus 𝛽 under
the supremum norm.
Proof. That 𝑇 is a contraction of modulus 𝛽 was proved in Proposition 5.1.1, on page 138.
We can prove this more easily now by applying Exercise 5.3.7, which, for arbitrary
𝑣, 𝑣0 ∈ RX , gives
In the next section we clarify relationships between these operators and prove
Propositions 5.3.1 and 5.3.4.
From Lemma 5.3.5 we see that 𝑅, 𝑆 and 𝑇 all have unique fixed points. We denote
them by 𝑔∗ , 𝑞∗ and 𝑣∗ respectively, so that
𝑅𝑔 ∗ = 𝑔 ∗ , 𝑆𝑞∗ = 𝑞∗ , and 𝑇 𝑣∗ = 𝑣∗ .
We already know that 𝑣∗ is the value function (Proposition 5.1.1). The results be-
low show that the other two fixed points are, like the value function, sufficient to
determine optimality.
Proposition 5.3.6. The fixed points of 𝑅, 𝑆 and 𝑇 are connected by the following rela-
tionships:
CHAPTER 5. MARKOV DECISION PROCESSES 175
(i) 𝑔∗ = 𝐸𝑣∗ ,
(ii) 𝑞∗ = 𝐷𝑔 ∗ , and
(iii) 𝑣∗ = 𝑀𝑞∗ .
Proof. To prove (i), first observe that, in the notation of (5.40), we have 𝐸𝑣∗ = 𝐸𝑇 𝑣∗ =
𝐸𝑀 𝐷𝐸𝑣∗ = 𝑅𝐸𝑣∗ . Hence 𝐸𝑣∗ is a fixed point of 𝑅. But 𝑅 has only one fixed point, which
is 𝑔∗ . Therefore, 𝑔∗ = 𝐸𝑣∗ . The proofs of (ii) and (iii) are analogous. □
Proof. To see that (i) implies (ii), suppose that 𝜎 is 𝑣-greedy when 𝑣 = 𝑣∗ . Then for
arbitrary 𝑥 ∈ X
( )
Õ
𝜎 ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣∗ ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) = argmax {𝑟 ( 𝑥, 𝑎) + 𝛽𝑔 ∗ ( 𝑥, 𝑎)} .
𝑎∈ Γ ( 𝑥 ) 𝑥0 𝑎∈ Γ ( 𝑥 )
Hence 𝜎 is 𝑔-greedy when 𝑔 = 𝑔∗ , and (i) implies (ii). The proofs of the remaining
equivalences (ii) =⇒ (iii) =⇒ (i) are similar. The claim that 𝜎 is optimal if and only
if any one of (i)–(iii) holds now follows from Proposition 5.1.1, which asserts that 𝜎
is optimal if and only if 𝜎 is 𝑣∗ -greedy. □
CHAPTER 5. MARKOV DECISION PROCESSES 176
(i) Fix 𝑔 ∈ RG .
(ii) Iterate with 𝑅 to obtain 𝑔𝑘 ≔ 𝑅 𝑘 𝑔 ≈ 𝑔∗ .
(iii) Compute a 𝑔𝑘 -greedy policy.
In Chapter 5 we found that VFI is often outperformed by HPI or OPI. Our next step
is to apply these methods to modified versions of the Bellman equation, as discussed
in the previous section. This allows us to combine advantages of HPI/OPI with the
potential efficiency gains obtained by refactoring the Bellman equation.
We illustrate these ideas below by producing a version of OPI that can compute
𝑄 -factors and expected value functions. (The same is true for HPI, although we leave
details of that construction to interested readers.)
To begin, we introduce a new operator, denoted 𝑀𝜎 , that, for fixed 𝜎 ∈ Σ and
𝑞∈ RG , produces
( 𝑀𝜎 𝑞)( 𝑥 ) ≔ 𝑞 ( 𝑥, 𝜎 ( 𝑥 )) ( 𝑥 ∈ X) .
This operator is the policy analog of the maximization operator 𝑀 defined by ( 𝑀𝑞)( 𝑥 ) =
max 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎) in §5.3.5.1. Analogous to (5.40), we set
𝑅𝜎 ≔ 𝐸 𝑀𝜎 𝐷, 𝑆 𝜎 ≔ 𝐷 𝐸 𝑀𝜎 , 𝑇𝜎 ≔ 𝑀𝜎 𝐷 𝐸.
You can verify that 𝑇𝜎 is the ordinary 𝜎-policy operator (defined in (5.19)). The op-
erators 𝑅𝜎 and 𝑆𝜎 are the expected value and 𝑄 -factor equivalents.
Let’s now show that OPI can be successfully modified via these alternative oper-
ators. We will focus on the expected value viewpoint (value functions are replaced
by expected value functions), which is often helpful in the applications we wish to
consider.
Our modified OPI routine is given in Algorithm 5.5. It makes the obvious modifi-
cations to regular OPI, switching to working with expected value functions in RG and
from iteration with 𝑇𝜎 to iteration with 𝑅𝜎 .
Algorithm 5.5 is globally convergent in the same sense as regular OPI (Algo-
rithm 5.4 on page 145). In fact, if we pick 𝑣0 ∈ RX and apply regular OPI with
this initial condition, as well as applying Algorithm 5.5 with initial condition 𝑔0 ≔ 𝐸𝑣0 ,
then the sequences ( 𝑣𝑘 ) 𝑘⩾0 and ( 𝑔𝑘 ) 𝑘⩾0 generated by the two algorithms are connected
via 𝑔𝑘 = 𝐸𝑣𝑘 for all 𝑘 ⩾ 0. If greedy policies are unique, then it is also true that the
policy sequences generated by the two algorithms are identical.
Let’s prove these claims, assuming for convenience that greedy policies are unique.
Consider first the claim that 𝑔𝑘 = 𝐸𝑣𝑘 for all 𝑘 ⩾ 0. This is true by assumption when
𝑘 = 0. Suppose, as an induction hypothesis, that 𝑔𝑘 = 𝐸𝑣𝑘 holds at arbitrary 𝑘. Let 𝜎
be 𝑔𝑘 -greedy. Then
( )
Õ
0 0
𝜎 ( 𝑥 ) = argmax {𝑟 ( 𝑥, 𝑎) + 𝛽𝑔𝑘 ( 𝑥, 𝑎)} = argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣𝑘 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑥 ) ,
𝑎∈ Γ ( 𝑥 ) 𝑎∈ Γ ( 𝑥 ) 𝑥0
where the second equality is implied by 𝑔𝑘 = 𝐸𝑣𝑘 . Hence 𝜎 is both 𝑔𝑘 -greedy and 𝑣𝑘 -
greedy and so is the next policy selected by both modified and regular OPI. Moreover,
updating via Algorithm 5.5 and applying (5.41), we have
CHAPTER 5. MARKOV DECISION PROCESSES 178
Detailed treatment of MDPs can be found in books by Bellman (1957), Howard (1960),
Denardo (1981), Puterman (2005), Bertsekas (2012), Hernández-Lerma and Lasserre
(2012a, 2012b), and Kochenderfer et al. (2022). The books by Hernández-Lerma and
Lasserre (2012a, 2012b) provide excellent coverage of theory, while Puterman (2005)
gives a clear and detailed exposition of algorithms and techniques. Further discussion
of the connection between HPI and Newton iteration can be found in Section 6.4 of
Puterman (2005).
HPI is routinely used in artificial intelligence applications, including during the
training of AlphaZero by DeepMind. Further discussion of these variants of HPI and
their connection to Newton iteration can be found in Bertsekas (2021) and Bertsekas
(2022a).
There are several methods for available for accelerating value function iteration,
including asynchronous VFI and Anderson acceleration. Due to space constraints,
we omit discussion of these topics. Interested readers can find a treatment of asyn-
chronous VFI in Bertsekas (2022b). For discussion of Anderson acceleration see, for
example, Walker and Ni (2011) or Geist and Scherrer (2018). First order methods
for accelerating VFI are presented in Goyal and Grand-Clement (2023).
Other methods for computing solutions to MDPs include the linear programming
(LP) approach and the policy gradient technique, both of which solve a problem of
the form Õ
max 𝑤 ( 𝑥 ) 𝑣 ( 𝑥 ) s.t. 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣 (5.42)
𝜎∈ Σ
𝑥
for some chosen weight function 𝑤. The LP approach views (5.42) as a linear program
and applies various algorithms to the primal and dual problems. See, for example,
Puterman (2005) or Ying and Zhu (2020).
CHAPTER 5. MARKOV DECISION PROCESSES 179
The policy gradient method involves approximating 𝜎 and 𝑣 in (5.42) using smooth
functions with finitely many parameters. These parameters are then adjusted via
some version of gradient ascent. A recent trend for high-dimensional MDPs is to ap-
proximate the value and policy functions with neural nets. An early exposition can
be found in Bertsekas and Tsitsiklis (1996). A more recent monograph is Bertsekas
(2021). For research along these lines in the context of economic applications see,
for example, Maliar et al. (2021), Hill et al. (2021), Han et al. (2021), Kahou et al.
(2021), Kase et al. (2022), and Azinovic et al. (2022).
In some versions of these algorithms, as well as in VFI and HPI, the expectations
associated with dynamic programs are computed using Monte Carlo sampling meth-
ods. See, for example, Rust (1997), Powell (2007), and Bertsekas (2021). Sidford
et al. (2023) combine linear programming and sampling approaches.
The optimal savings problem is a workhorse in macroeconomics and has been
treated extensively in the literature. Early references include Brock and Mirman
(1972), Mirman and Zilcha (1975), Schechtman (1976), Deaton and Laroque (1992),
and Carroll (1997). For more recent studies, see, for example, Li and Stachurski
(2014), Açıkgöz (2018), Light (2018), Lehrer and Light (2018), or Ma et al. (2020).
Recent applications involving optimal savings in a representative agent framework
include Bianchi (2011), Paciello and Wiederholt (2014), Rendahl (2016), Heathcote
and Perri (2018), Paroussos et al. (2019), Erosa and González (2019), Herrendorf
et al. (2021), and Michelacci et al. (2022). For more on the long right tail of the
wealth distribution (as discussed in §5.3.3), see, for example, Benhabib et al. (2015),
Krueger et al. (2016), or Stachurski and Toda (2019).
Households solving optimal savings problems are often embedded in heteroge-
neous agent models in order to study income distributions, wealth distributions, busi-
ness cycles and other macroeconomic phenomena. Representative examples include
Aiyagari (1994), Huggett (1993), Krusell and Smith (1998), Miao (2006), Algan
et al. (2014), Toda (2014), Benhabib et al. (2015), Stachurski and Toda (2019),
Toda (2019), Light (2020), Hubmer et al. (2020), or Cao (2020).
Exercise §5.3.3 considered optimal savings and consumption in the presence of
transient and persistent shocks to labor income. For research in this vein, see, for ex-
ample, Quah (1990), Carroll (2009), De Nardi et al. (2010), or Lettau and Ludvigson
(2014). For empirical work on labor income dynamics, see, for example, Newhouse
(2005), Guvenen (2007), Guvenen (2009), or Blundell et al. (2015). For analysis of
optimal savings in a very general setting, see Ma et al. (2020) or Ma and Toda (2021).
The optimal investment problem dates back to Lucas and Prescott (1971). Text-
book treatments can be found in Stokey and Lucas (1989) and Dixit and Pindyck
(2012). Sargent (1980) and Hayashi (1982) used optimal investment problems to
CHAPTER 5. MARKOV DECISION PROCESSES 180
connect optimal capital accumulation with Tobin’s 𝑞 (which is the ratio between a
physical asset’s market value and its replacement value). Other influential papers in
the field include Lee and Shin (2000), Hassett and Hubbard (2002), Bloom et al.
(2007), Bond and Van Reenen (2007), Bloom (2009), and Wang and Wen (2012).
Carruth et al. (2000) contains a survey.
Classic papers about S-s inventory models include Arrow et al. (1951) and Dvoret-
zky et al. (1952). Optimality of S-s policies under certain conditions was first estab-
lished by Scarf (1960). Kelle and Milne (1999) study the impact of S-s inventory
policies on the supply chain, including connection to the “bullwhip” effect. The con-
nection between S-s inventory policies and macroeconomic fluctuations is studied in
Nirei (2006).
The model in Exercise 5.2.3 is loosely adapted from Bagliano and Bertola (2004).
Rust (1994) is a classic and highly readable reference in the area of structural esti-
mation of MDPs. Keane and Wolpin (1997) provides an influential study of the career
choices of young men. Roberts and Tybout (1997) analyze the decision to export in
the presence of sunk costs. Keane et al. (2011) provide an overview of structural esti-
mation applied to labor market problems. Gentry et al. (2018) review analysis of auc-
tions using structural estimation. Legrand (2019) surveys the use of structural models
to study the dynamics of commodity prices. Calsamiglia et al. (2020) use structural
estimation to study school choices. Iskhakov et al. (2020) provide a thoughtful dis-
cussion on the differences between structural estimation and machine learning. Luo
and Sang (2022) propose structural estimation via sieves.
Theoretical analysis of expected value functions in discrete choice models and
other settings can be found in Rust (1994), Norets (2010), Mogensen (2018) and
Kristensen et al. (2021). The expected value Gumbel max trick is due to Rust (1987)
and builds on work by McFadden (1974). The Gumbel max trick is also used in ma-
chine learning methods (see, e.g., Jang et al. (2016)).
In §5.3.4 we mentioned 𝑄 -learning, which was originally proposed by Watkins
(1989). Tsitsiklis (1994) and Melo (2001) studied convergence of 𝑄 -learning. In
related work, Esponda and Pouzo (2021) study Markov decision processes where dy-
namics are unknown, and where agents update their understanding of transition laws
via Bayesian updating.
The theory in §5.3.5 on optimality under modifications of the Bellman equation is
loosely based on Ma and Stachurski (2021). That paper considers arbitrary modifica-
tions in a very general setting.
Chapter 6
Stochastic Discounting
In this chapter we describe how to extend the MDP model to handle time-varying
discount factors, a specification now widely used in macroeconomics and finance.
6.1.1 Valuation
Our first step is to motivate and understand lifetime valuation when discount factors
vary over time.
6.1.1.1 Motivation
In §3.2.2.2 we discussed firm valuation in a setting where the interest rate is constant.
But data show that interest rates are time-varying, even for safe assets like US Treasury
bills. Figure 6.1 shows nominal interest rate on 1 Year Treasury bills since the 1950s,
while Figure 6.2 shows an estimate of the real interest rate for 10 year T-bills since
2012. Both nominal and real interest rates are evidently time varying.
181
CHAPTER 6. STOCHASTIC DISCOUNTING 182
17.5
nominal interest rate
15.0
12.5
10.0
7.5
5.0
2.5
0.0
1960 1970 1980 1990 2000 2010 2020
1.5
real interest rate
1.0
0.5
0.0
0.5
1.0
Example 6.1.1. Consider a firm valuation problem where interest rates ( 𝑟𝑡 )𝑡⩾0 are
stochastic. The time zero expected present value of time 𝑡 profit 𝜋𝑡 is
1
E [ 𝛽1 · · · 𝛽𝑡 · 𝜋𝑡 ] where 𝛽𝑡 ≔ .
1 + 𝑟𝑡
Remark 6.1.1. Time-varying discount factors are found in extensions of the Section
§3.2.2.3 household consumption-saving problem that appear in modern models of
business cycle dynamics, asset prices, and wealth distributions. For just one important
example, see Krusell and Smith (1998). Marimon (1984) included random discount
factors in his thorough analysis of growth and turnpike properties of general equi-
librium models, unfortunately only parts of which were include in Marimon (1989).
Exogenous impatience shocks have been used as demand shocks in some dynamic
models. For more citations see §7.4.
6.1.1.2 Theory
The aim of this section is to understand and evaluate expressions such as (6.1).
Throughout,
Î
The sequence ( 𝛽𝑡 )𝑡⩾0 is called a discount factor process and 𝑡𝑖=0 𝛽𝑖 is the discount
factor for period 𝑡 payoffs evaluated at time zero. We are interested in expected
discounted sums of the form
" 𝑡 #
Õ
∞ Ö
𝑣 ( 𝑥 ) ≔ E𝑥 𝛽 𝑖 ℎ ( 𝑋𝑡 ) ( 𝑥 ∈ X) . (6.3)
𝑡 =0 𝑖=0
CHAPTER 6. STOCHASTIC DISCOUNTING 184
𝐿 ( 𝑥, 𝑥 0) ≔ 𝑏 ( 𝑥, 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (6.4)
Theorem 6.1.1 generalizes Lemma 3.2.1 on page 94. Indeed, if 𝑏 ≡ 𝛽 ∈ (0, 1),
then 𝐿 = 𝛽𝑃 and 𝜌 ( 𝐿) = 𝛽𝜌 ( 𝑃 ) = 𝛽 < 1, so the result in Theorem 6.1.1 reduces to
Lemma 3.2.1.
We establish (6.6) using induction on 𝑡 . It is easy to see that (6.6) holds at 𝑡 = 1. Now
suppose it holds at 𝑡 . We claim it also holds at 𝑡 + 1. To show this we fix ℎ ∈ RX and
Î
set 𝛿𝑡 ≔ 𝑡𝑖=0 𝛽𝑖 . Applying the law of iterated expectations (see §3.2.1.2) yields
EXERCISE 6.1.1. Consider Example 6.1.1 again but now assume that ( 𝑋𝑡 ) is 𝑃 -
Markov, 𝜋𝑡 = 𝜋 ( 𝑋𝑡 ), and 𝑟𝑡 = 𝑟 ( 𝑋𝑡 ) for some 𝑟, 𝜋 ∈ RX .1 The expected present value of
the firm given current state 𝑋0 = 𝑥 is
" 𝑡 #
Õ
∞ Ö
𝑣 ( 𝑥 ) = E𝑥 𝛽𝑖 𝜋𝑡 (6.9)
𝑡 =0 𝑖=0
Suggest a condition under which 𝑣 ( 𝑥 ) is finite and discuss how to compute it.
EXERCISE 6.1.2. Let X be partially ordered and assume 𝜌 ( 𝐿) < 1. Prove that 𝑣
is increasing on X whenever 𝑃 is monotone increasing, 𝜋 is increasing on X, and 𝑟 is
decreasing on X.
In Theorem 6.1.1 the condition 𝜌 ( 𝐿) < 1 drives stability. In this section we develop
necessary and sufficient conditions for 𝜌 ( 𝐿) < 1 to hold.
Ö
𝑡
1/𝑡
𝜌 ( 𝐿) = lim ℓ𝑡 when ℓ𝑡 ≔ max E𝑥 𝛽𝑖 . (6.10)
𝑡 →∞ 𝑥 ∈X
𝑖=0
1 We are assuming that randomness in interest rates is a function of the same Markov state that
influences profits. There is very little loss of generality in making this assumption. In fact, the two
processes can still be statistically independent. For example, if we take 𝑋𝑡 to have the form 𝑋𝑡 = (𝑌𝑡 , 𝑍𝑡 ),
where (𝑌𝑡 ) and ( 𝑍𝑡 ) are independent Markov chains, then we can take 𝛽𝑡 to be a function of 𝑌𝑡 and 𝜋𝑡
to be a function of 𝑍𝑡 . The resulting interest and profit processes are statistically independent.
CHAPTER 6. STOCHASTIC DISCOUNTING 186
Since 1 0, Lemma 2.3.3 yields (6.10). For a proof of the second claim in Lemma 6.1.2,
see Proposition 4.1 of Stachurski and Zhang (2021). □
The expression in (6.10) connects the spectral radius with the long run proper-
ties of the discount factor process. The connection becomes even simpler when 𝑃 is
irreducible, as the next exercise asks you to show.
Exercise 6.1.3 shows that the spectral radius is a long run (geometric) average of
the discount factor process. For the conclusions of Theorem 6.1.1 to hold, we need
this long run average to be less than unity.
Figure 6.3 illustrates the condition 𝜌 ( 𝐿) < 1 when 𝛽𝑡 = 𝑋𝑡 and 𝑃 is a Markov matrix
produced by discretization of the AR1 process
IID
𝑋𝑡+1 = 𝜇 (1 − 𝑎) + 𝑎𝑋𝑡 + 𝑠 (1 − 𝑎2 ) 1/2 𝜀𝑡+1 ( 𝜀𝑡 ) ∼ 𝑁 (0, 1) . (6.12)
The discussion in §3.1.3 tells us that the stationary distribution 𝜓∗ of (6.12) is nor-
mally distributed with mean 𝜇 and standard deviation 𝑠. The parameter 𝑎 controls
autocorrelation. In the figure we set 𝜇 to 0.96, which, since 𝛽𝑡 = 𝑋𝑡 , is the stationary
mean of the discount factor process. The parameters 𝑎 and 𝑠 are varied in the figure,
and the contour plot shows the corresponding value of 𝜌 ( 𝐿). The process (6.12) is
discretized via the Tauchen method with the size of the state space set to 6 (which
avoids negative values for 𝛽 ( 𝑥 )).
The figure shows that 𝜌 ( 𝐿) tends to increase with both the volatility and the auto-
correlation of the state process. This seems natural given the expression on the right
CHAPTER 6. STOCHASTIC DISCOUNTING 187
0.50 2.10
0.45 1.95
0.40 1.80
0.35 1.65
s
0.30 1.50
0.25 1.35
0.20 1.20
0.15 1.05
1
0.10 0.90
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
a
Figure 6.3: 𝜌 ( 𝐿) for different values of ( 𝑎, 𝑠) (discount_spec_rad.jl)
hand side of (6.11), since sequences of large values of 𝛽𝑖 compound in the product
Î𝑡
𝑖=0 𝛽 𝑖 , pushing up the long run average value, and such sequences occur more often
when autocorrelation and volatility are large.
We finish this section with a lemma that simplifies computation of the spectral
radius in settings where the process ( 𝛽𝑡 ) depends only on a subset of the state variables
– a setting that is common in applications. In the statement of the lemma, the state
space X takes the form X = Y × Z. We fix 𝑄 ∈ M ( RZ ) and 𝑅 ∈ M ( RY ). The discount
operator 𝐿 is
Lemma 6.1.3. The operators 𝐿 and 𝐿Z obey 𝜌 ( 𝐿Z ) = 𝜌 ( 𝐿), where the first spectral radius
is taken in L( RX ) and the second is taken in L ( RZ ).
Taking the limit and using Lemma 6.1.2 gives 𝜌 ( 𝐿) = 𝜌 ( 𝐿Z ), where the first spectral
radius is taken in L ( RX ) and the second is taken in L( RZ ). □
• 𝑉 = (0, ∞) X
• 𝐿 be a positive linear operator on RX .
(i) 𝜌 ( 𝐿) < 1.
(ii) The equation 𝑣 = ℎ + 𝐿𝑣 has a unique solution in 𝑉 .
The next example illustrates Theorem 6.1.5 by proving a result similar to Exer-
cise 1.2.17 on page 22.
k𝑇 𝑘 𝑢 − 𝑇 𝑘 𝑣 k = k 𝐴𝑘 𝑢 − 𝐴𝑘 𝑣 k = k 𝐴𝑘 (𝑢 − 𝑣)k ⩽ k 𝐴𝑘 kk 𝑢 − 𝑣 k ,
where the last line is by the submuliplicative property of the operator norm (page 16).
Since 𝜌 ( 𝐴) < 1, we can choose a 𝑘 ∈ N such that k 𝐴𝑘 k < 1 (see Exercise 1.2.11). Hence
𝑇 is eventually contracting and Theorem 6.1.5 yields global stability. The unique fixed
point satisfies 𝑢 = 𝐴𝑢 + 𝑏 and, since 𝜌 ( 𝐴) < 1, we can use the Neumann series lemma
(page 18) to write it as 𝑢 = ( 𝐼 − 𝐴) −1 𝑏.
Example 6.1.2 illustrates the connection between Theorem 6.1.5 and the Neu-
mann series lemma. Theorem 6.1.5 is more general because it can be applied in
nonlinear settings. But the Neumann series lemma remains imporant because, when
applicable, it provides inverse and power series representations of the fixed point.
On one hand, if 𝑇 is a contraction map on 𝑈 ⊂ RX with respect to a given norm
k · k 𝑎 , we cannot necessarily say that 𝑇 is a contraction with respect to some other norm
k · k 𝑏 on RX . On the other hand, if 𝑇 is an eventual contraction on 𝑈 with respect to
some given norm on RX , then 𝑇 is eventually contracting with respect to every norm
on RX . The next exercise asks you to verify this.
The following sufficient condition for eventual contractivity will be helpful when we
study dynamic programs with state-dependent discounting.
|𝑇 𝑣 − 𝑇𝑤 | ⩽ 𝐿 | 𝑣 − 𝑤 | (6.13)
|𝑇 𝑘 𝑣 − 𝑇 𝑘 𝑤 | ⩽ 𝐿 𝑘 | 𝑣 − 𝑤 | .
k𝑇 𝑘 𝑣 − 𝑇 𝑘 𝑤 k ⩽ k 𝐿𝑘 | 𝑣 − 𝑤 |k ⩽ k 𝐿𝑘 kk 𝑣 − 𝑤 k ,
Proof. Fix 𝑣, 𝑤 ∈ 𝑈 and let 𝑇 and 𝐿 be as in the statement of the proposition. By the
assumed properties on 𝑇 , we have
𝑇 𝑣 = 𝑇 ( 𝑣 + 𝑤 − 𝑤) ⩽ 𝑇 ( 𝑤 + | 𝑣 − 𝑤 |) ⩽ 𝑇𝑤 + 𝐿 | 𝑣 − 𝑤 | .
We can now turn to dynamic programs in which the objective is to maximize a lifetime
value in the presence of state-dependent discounting. First we present an extension
of the MDP model from Chapter 5 that admits state-dependent discounting. Then we
provide weak conditions under which optimal policies exist and Bellman’s principle
of optimality holds.
6.2.1.1 Setup
where 𝑥 ∈ X and 𝑣 ∈ RX . Notice that the discount factor depends on all relevant
information: the current action, the current state and the stochastically determined
next period state.
CHAPTER 6. STOCHASTIC DISCOUNTING 192
Let Σ be the set of all feasible policies, defined as for regular MDPs. The policy oper-
ator 𝑇𝜎 corresponding to 𝜎 ∈ Σ is represented by
Õ
(𝑇𝜎 𝑣)( 𝑥 ) = 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝑣 ( 𝑥 0) 𝛽 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) . (6.16)
𝑥0
𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝛽 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) . (6.17)
Notice that we can now write (6.16) as 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣. In line with our discussion of
MDPs in Chapter 5, when 𝑇𝜎 has a unique fixed point we denote it by 𝑣𝜎 and interpret
it as lifetime value.
Lemma 6.2.1. If Assumption 6.2.1 holds, then, for each 𝜎 ∈ Σ, the linear operator 𝐼 − 𝐿𝜎
is invertible and, in RX , the policy operator 𝑇𝜎 has a unique fixed point
𝑣𝜎 = ( 𝐼 − 𝐿𝜎 ) −1 𝑟𝜎 . (6.18)
Proof. Fix 𝜎 ∈ Σ. By the Neumann series lemma, 𝐼 − 𝐿𝜎 is invertible. Any fixed point
of 𝑇𝜎 obeys 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣, which, given invertibility of 𝐼 − 𝐿𝜎 , is equivalent to (6.18). □
EXERCISE 6.2.2. Show that, under Assumption 6.2.1, the operator 𝑇𝜎 is globally
stable on RX .
EXERCISE 6.2.3. Show that Assumption 6.2.1 holds whenever there exists an 𝐿 ∈
L( RX ) such that 𝜌 ( 𝐿) < 1 and
6.2.1.3 Optimality
where 𝑥 ∈ X and 𝑣 ∈ RX .
Given 𝑣 ∈ RX , a policy 𝜎 is called 𝑣-greedy if 𝜎 ( 𝑥 ) is a maximizer of the right-hand
side of (6.21) for all 𝑥 in X. Equivalently, 𝜎 is 𝑣-greedy whenever 𝑇𝜎 𝑣 = 𝑇 𝑣.
When Assumption 6.2.1 holds and, as a result, 𝑇𝜎 has a unique fixed point 𝑣𝜎 for
each 𝜎 ∈ Σ, we let 𝑣∗ denote the value function, which is defined as 𝑣∗ ≔ ∨𝜎∈Σ 𝑣𝜎 . As
for the regular MDP case, a policy 𝜎 is called optimal if 𝑣𝜎 = 𝑣∗ .
We can now state our main optimality result for MDPs with state-dependent dis-
counting.
(i) the value function 𝑣∗ is the unique solution to the Bellman equation in RX ,
(ii) a policy 𝜎 ∈ Σ is optimal if and only if it is 𝑣∗ -greedy, and
CHAPTER 6. STOCHASTIC DISCOUNTING 194
6.2.1.4 Algorithms
Algorithms for solving an MDP with state-dependent discounting include value func-
tion iteration (VFI), Howard policy iteration (HPI), and optimistic policy iteration
(OPI). The algorithms for VFI and OPI are identical to those given for regular MDPs
(see §5.1.4), provided that the correct operators 𝑇 and 𝑇𝜎 are used, and that the def-
inition of a 𝑣-greedy policy is as given in §6.2.1.1. The algorithm for HPI is almost
identical, with the only change being that computation of lifetime values involves 𝐿𝜎 .
Details are given in Algorithm 6.1.
We prove in Chapter 8 that, under the conditions of Assumption 6.2.1, VFI, OPI
and HPI are all convergent, and that HPI converges to an exact optimal policy in a
finite number of steps.
Some applications use an exogenous state component to drive a discount factor pro-
cess. In this section we set up such a model and obtain optimality conditions by
applying Proposition 6.2.2.
The first step is to decompose the state 𝑋𝑡 into a pair (𝑌𝑡 , 𝑍𝑡 ), where (𝑌𝑡 )𝑡⩾0 is
endogenous (i.e., affected by the actions of the controller) and ( 𝑍𝑡 )𝑡⩾0 is purely ex-
ogenous. In particular, the primitives consist of
for all ( 𝑦, 𝑧 ) ∈ X.
This exogenous discount model is a special case of the general MDP with state-
dependent discounting. Indeed, we can write (6.22) as (6.21) by setting 𝑥 ≔ ( 𝑦, 𝑧 )
and defining
𝑃 ( 𝑥, 𝑎, 𝑥 0) ≔ 𝑃 (( 𝑦, 𝑧 ) , 𝑎, ( 𝑦 0, 𝑧0)) ≔ 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝑎, 𝑦 0) .
The following proposition provides a relatively simple sufficient condition for the
core optimality results in the setting of the exogenous discount model.
In §6.2.1.2 we mentioned that requiring sup 𝛽 < 1 is too strict for some applications.
For example, the real interest rate 𝑟𝑡 shown in Figure 6.2 is sometimes negative. Using
long historical records, Farmer et al. (2023) find that the discount rate is negative
around 1/3 of the time. This means that the associated discount factor 𝛽𝑡 = 1/(1 + 𝑟𝑡 )
is sometimes greater than 1 and sup 𝛽 < 1 fails.
CHAPTER 6. STOCHASTIC DISCOUNTING 196
1.03
βt
β=1
1.00
0.97
0 10 20 30 40 50 60 70 80
time
In this section, we modify the inventory management model from §5.2.1 to include
time-varying interest rates.
2 The parameters are 𝜌 = 0.85, 𝜎 = 0.0062, and 𝑏 = 0.99875. In line with Hills et al. (2019), we
discretize the model via mc = tauchen(n, ρ, σ, 1 - ρ, m) with 𝑚 = 4.5 and 𝑛 = 15.
CHAPTER 6. STOCHASTIC DISCOUNTING 197
Recall that, in the model of §5.2.1, the Bellman equation takes the form
( )
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) 𝜑 ( 𝑑 ) (6.24)
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0
If we set
𝑅 ( 𝑦, 𝑎, 𝑦 0) ≔ P{ 𝑓 ( 𝑦, 𝑎, 𝑑 ) = 𝑦0 } when 𝐷 ∼ 𝜑,
then 𝑅 ( 𝑦, 𝑎, 𝑦 0) is the probability of realizing next period inventory level 𝑦 0 when the
current level is 𝑦 and the action is 𝑎. Hence we can rewrite (6.25) as
Õ
𝐵 (( 𝑦, 𝑧 ) , 𝑎, 𝑣) = 𝑟 ( 𝑦, 𝑎) + 𝛽 ( 𝑧 ) 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝑎, 𝑦 0) . (6.26)
𝑦 0 ,𝑧 0
We have now created a version of the MDP with exogenous state-dependent dis-
counting described in §6.2.1.5. Letting 𝐿 ( 𝑧, 𝑧0) ≔ 𝛽 ( 𝑧) 𝑄 ( 𝑧, 𝑧0) and applying Propo-
sition 6.2.3, we see that all of the standard optimality results hold whenever 𝜌 ( 𝐿) < 1.
Figure 6.5 shows how inventory evolves under an optimal program when the pa-
rameters of the problem are as given in Listing 21. (The code preallocates and com-
putes arrays representing 𝑟 , 𝑅 and 𝑄 in (6.26) and includes a test for 𝜌 ( 𝐿) < 1.) We
set 𝛽 ( 𝑧) = 𝑧 and take ( 𝑍𝑡 ) to be a discretization of an AR(1) process. Figure 6.5
was created by simulating ( 𝑍𝑡 ) according to 𝑄 and inventory (𝑌𝑡 ) according to 𝑌𝑡+1 =
(𝑌𝑡 − 𝐷𝑡+1 ) ∨ 0 + 𝐴𝑡 , where 𝐴𝑡 follows the optimal policy.
The outcome is similar to Figure 5.7, in the sense that inventory falls slowly and
then jumps up. As before, fixed costs induce this lumpy behavior. However, a new
phenomenon is now present: inventories trend up when interest rates fall and down
when they rise. (The interest rate 𝑟𝑡 is calculated via 𝛽𝑡 = 1/(1 + 𝑟𝑡 ) at each 𝑡 .) High
CHAPTER 6. STOCHASTIC DISCOUNTING 198
20
inventory
15
10
0
0 50 100 150 200 250 300 350 400
t
0.06
rt
0.05
0.04
0.03
0.02
interest rates foreshadow high interest rates due to positive autocorrelation ( 𝜌 > 0),
which in turn devalue future profits and hence encourage managers to economize on
stock.
Figure 6.6 shows execution time for VFI and OPI at different choices of 𝑚 (see
§6.2.1.3 for the interpretation of 𝑚). As for the optimal savings problem we studied
in Chapter 5, OPI is around 1 order of magnitude faster when 𝑚 is close to 50 (cf.
Figure 5.8 on page 153).
function create_sdd_inventory_model(;
ρ=0.98, ν=0.002, n_z=20, b=0.97, # Z state parameters
K=40, c=0.2, κ=0.8, p=0.6, # firm and demand parameters
d_max=100) # truncation of demand shock
14
12
10
8
optimistic policy iteration
6
We first discuss risk-neutral pricing and show why this assumption is typically implau-
sible. Next, we introduce stochastic discount factors and stationary asset pricing.
Consider the problem of assigning a current price Π𝑡 to an asset that confers on its
owner the right to payoff 𝐺𝑡+1 . The payoff is stochastic and realized next period. One
simple idea is to use risk-neutral pricing, which implies that
Π𝑡 = E𝑡 𝛽 𝐺𝑡+1 (6.27)
for some constant discount factor 𝛽 ∈ (0, 1). If the payoff is in 𝑘 periods, then we
modify the price to E𝑡 𝛽 𝑘 𝐺𝑡+𝑘 . In essence, risk-neutral pricing says that cost equals
expected reward, discounted to present value by compounding a constant rate of
discount. (A rate of discount, say 𝜌, is linked to a discount factor, say 𝛽 , by 𝛽 =
1/(1 + 𝜌) ≈ exp( 𝜌).)
Example 6.3.1. Let 𝑆𝑡 be the price of a stock at each point in time 𝑡 . A European call
option gives its owner the right to purchase the stock at price 𝐾 at time 𝑡 + 𝑘. There
CHAPTER 6. STOCHASTIC DISCOUNTING 201
Π𝑡 = E𝑡 𝛽 𝑘 max{𝑆𝑡+𝑘 − 𝐾, 0} .
Although risk neutrality allows for simple pricing, assuming risk neutrality for all
investors is not plausible.
To give one example, suppose that we take the asset that pays 𝐺𝑡+1 in (6.27) and
replace it with another asset that pays 𝐻𝑡+1 = 𝐺𝑡+1 + 𝜀𝑡+1 , where 𝜀𝑡+1 is independent of
𝐺𝑡+1 , E𝑡 𝜀𝑡+1 = 0 and Var 𝜀𝑡+1 > 0. In effect, we are adding risk to the original payoff
without changing its mean.
Under risk neutrality, the price of this new asset is
This outcome contradicts the idea that investors typically want compensation for bear-
ing risk.
A helpful way to think about the same point is to consider the rate of return 𝑟𝑡+1 ≔
(𝐺𝑡+1 − Π𝑡 )/Π𝑡 on holding an asset with payoff 𝐺𝑡+1 . From (6.27) we have E𝑡 𝛽 (1+𝑟𝑡+1 ) =
1, or
1−𝛽
E𝑡 𝑟𝑡+1 = .
𝛽
Since the right-hand side does not depend on 𝐺𝑡+1 , risk neutrality implies that all
assets have the same expected rate of return. But this contradicts the finding that, on
average, riskier assets tend to have higher rates of return that compensate investors
for bearing risk.
Example 6.3.2. The risk premium on a given asset is defined as the expected rate
of return minus the rate of return on a risk-free asset. If we assume risk neutrality
then, by the preceding discussion, the risk premium is zero for all assets. However,
calculations based on post-war US data show that the average return premium on
equities over safe assets is around 8% per annum (see, e.g., Cochrane (2009)).
CHAPTER 6. STOCHASTIC DISCOUNTING 202
To go beyond risk neutral-pricing, let’s start with a model containing one asset and
one agent. It is straightforward to price the asset and compare it to the risk neutral
case.
A representative agent takes the price Π𝑡 of a risky asset as given and solves
Here
Rearranging gives us
𝑢0 ( 𝐶𝑡+1 )
Π𝑡 = E𝑡 𝛽 0 𝐺𝑡+1 . (6.28)
𝑢 ( 𝐶𝑡 )
Comparing (6.28) with (6.27), we see that the payoff is now multiplied by a positive
random variable rather than a constant. The random variable
𝑢0 ( 𝐶𝑡+1 )
𝑀𝑡+1 ≔ 𝛽 (6.29)
𝑢0 ( 𝐶𝑡 )
is called the stochastic discount factor or pricing kernel. We call this particular
form of the pricing kernel shown in (6.29) Lucas stochastic discount factor (Lucas
SDF) in honor of Lucas (1978a).
Example 6.3.4. If utility has the CRRA form 𝑢 ( 𝑐) = 𝑐1−𝛾 /(1 − 𝛾 ) for some 𝛾 > 0, then
the Lucas SDF takes the form
−𝛾
𝐶𝑡+1
𝑀𝑡+1 = 𝛽 , (6.30)
𝐶𝑡
which we can also write as 𝑀𝑡+1 = 𝛽 exp(−𝛾𝑔𝑡+1 ) when 𝑔𝑡+1 ≔ ln( 𝐶𝑡+1 /𝐶𝑡 ) is the growth
rate of consumption.
In the CRRA case, the Lucas SDF applies heavier discounting to assets that con-
centrate payoffs in states of the world where the agent is already enjoying strong
consumption growth. Conversely, the SDF attaches higher weights to future payoffs
that occur when consumption growth is low because such payoffs hedge against the
risk of drawing low consumption states.
The standard neoclassical theory of asset pricing generalizes the Lucas discounting
specification by assuming only that there exists a positive random variable 𝑀𝑡+1 such
that the price of an asset with payoff 𝐺𝑡+1 is
As above, 𝑀𝑡+1 is called a stochastic discount factor (SDF). Equation 6.31 generalizes
(6.28) by refraining from restricting the SDF (apart from assuming positivity).
Actually, it can be shown that there exists an SDF 𝑀𝑡+1 such that (6.31) is always
valid under relatively weak assumptions. In particular, a single SDF 𝑀𝑡+1 can be used
to price any asset in the market, so if 𝐻𝑡+1 is a another stochastic payoff then the
current price of an asset with this payoff is E𝑡 𝑀𝑡+1 𝐻𝑡+1 .
We do not prove these claims, since our interest is in understanding forward-
looking equations in Markov environments. Some relevant references are listed in
§6.4.
Now we are ready to look at pricing a stationary cash flow over an infinite horizon,
a basic problem in asset pricing. We will apply the Markov structure assumed in
§6.3.1.4. In all that follows, ( 𝑋𝑡 ) is 𝑃 -Markov on X and 𝑀𝑡+1 is defined as in §6.3.1.4.
We seek the time 𝑡 price, denoted by Π𝑡 , for an ex-dividend contract on the divi-
dend stream ( 𝐷𝑡 )𝑡⩾0 . The contract provides the owner with the right to the dividend
stream. The “ex-dividend” component means that, should the dividend stream be
traded at time 𝑡 , the dividend paid at time 𝑡 goes to the seller rather than the buyer.
As a result, purchasing at 𝑡 and selling at 𝑡 + 1 pays Π𝑡+1 + 𝐷𝑡+1 . Hence, applying the
asset pricing rule (6.31), at time 𝑡 price Π𝑡 of the contract must satisfy
or, equivalently,
𝜋 = 𝐴𝜋 + 𝐴𝑑 when 𝐴 ( 𝑥, 𝑥 0) ≔ 𝑚 ( 𝑥, 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) . (6.35)
By the Neumann series lemma, 𝜌 ( 𝐴) < 1 implies (6.35) has unique solution
Õ
∞
∗ −1
𝜋 ≔ ( 𝐼 − 𝐴) 𝐴𝑑 = 𝐴𝑘 𝑑.
𝑘=1
EXERCISE 6.3.1. Show that 𝜌 ( 𝐴) < 1 is both necessary and sufficient for existence
of a unique solution to (6.34) in (0, ∞) X whenever 𝑚, 𝑑 0.
Remark 6.3.1. We can call 𝐴 an Arrow–Debreu discount operator. Its powers apply
discounting: the valuation of any random payoff 𝑔 in 𝑘 periods is 𝐴𝑘 𝑔.
EXERCISE 6.3.4. Derive the price for a cum-dividend contract on the dividend
stream ( 𝐷𝑡 )𝑡⩾0 , with the model otherwise unchanged. Under this contract, should the
right to the dividend stream be traded at time 𝑡 , the dividend paid at time 𝑡 goes to
the buyer rather than the seller.
Asset prices can be expressed as infinite sums under the assumptions stated above.
Let’s show this for cum-dividend contracts (although the case of ex-dividend contracts
is similar). In Exercise 6.3.4 you found that the state-contingent price vector 𝜋 for a
cum-dividend contract on the dividend stream ( 𝐷𝑡 )𝑡⩾0 obeys
𝜋 = 𝑑 + 𝐴𝜋 when 𝐴 ( 𝑥, 𝑥 0) ≔ 𝑚 ( 𝑥, 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (6.37)
where 𝑀𝑡+1 ≔ 𝑚 ( 𝑋𝑡 , 𝑋𝑡+1 ) for 𝑡 ⩾ 0 and 𝑀0 ≔ 1. This expression agrees with our intu-
ition: The price of the contract is the expected present value of the dividend stream,
with the time 𝑡 dividend discounted by the composite factor 𝑀1 · · · 𝑀𝑡 .
CHAPTER 6. STOCHASTIC DISCOUNTING 206
Until now, our discussion of asset pricing has assumed that dividends are stationary.
However, dividends typically grow over time, along with other economic measures
such as GDP. In this section we solve for the price of a dividend stream when dividends
exhibit random growth.
where 𝜅 is a fixed function, ( 𝑋𝑡 ) is the state process and ( 𝜂𝑡 ) is IID. We let 𝜑 be the
density of each 𝜂𝑡 and assume that ( 𝑋𝑡 ) is 𝑃 -Markov on a finite set X. Let’s suppose as
before that the SDF obeys 𝑀𝑡+1 = 𝑚 ( 𝑋𝑡 , 𝑋𝑡+1 ) for some positive function 𝑚.
Since dividends grow over time, so will the price of the asset. As such, we should
no longer seek a fixed function 𝜋 such that Π𝑡 = 𝜋 ( 𝑋𝑡 ) for all 𝑡 , since the resulting
price process ( Π𝑡 ) will fail to grow. Instead, we try to solve for the price-dividend
ratio 𝑉𝑡 ≔ Π𝑡 / 𝐷𝑡 , which we hope will be stationary.
The price-dividend process (𝑉𝑡∗ ) defined by 𝑉𝑡∗ = 𝑣∗ ( 𝑋𝑡 ) solves (6.38). The price
can be recovered via Π𝑡 = 𝑉𝑡∗ 𝐷𝑡 .
𝜅 ( 𝑋𝑡 , 𝜂𝑑,𝑡+1 ) = 𝜇 𝑑 + 𝑋𝑡 + 𝜎𝑑 𝜂𝑑,𝑡+1
where ( 𝜂𝑑,𝑡 )𝑡⩾0 is IID and standard normal. Consumption growth is given by
𝐶𝑡+1
ln = 𝜇 𝑐 + 𝑋𝑡 + 𝜎𝑐 𝜂𝑐,𝑡+1 ,
𝐶𝑡
where ( 𝜂𝑐,𝑡 )𝑡⩾0 is also IID and standard normal. We use the Lucas SDF in (6.30),
implying that −𝛾
𝐶𝑡+1
𝑀𝑡+1 = 𝛽 = 𝛽 exp(−𝛾 ( 𝜇 𝑐 + 𝑋𝑡 + 𝜎𝑐 𝜂𝑐,𝑡+1 ))
𝐶𝑡
Figure 6.7 shows the price-dividend ratio function 𝑣∗ for the specification given
in Listing 22, as well as for an alternative mean dividend growth rate 𝜇 𝑑 . The state
process is a Tauchen discretization of an AR(1) process with positive autocorrelation.
An increase in the state predicts higher dividends, which tends to increase the price.
At the same time, higher 𝑥 also predicts higher consumption growth, which acts neg-
atively on the price. For values of 𝛾 greater than 1, the second effect dominates and
the price-dividend ratio slopes down.
EXERCISE 6.3.8. Complete the code in Listing 22 and replicate Figure 6.7. Add a
test to your code that checks 𝜌 ( 𝐴) < 1 before computing the price-dividend ratio.
CHAPTER 6. STOCHASTIC DISCOUNTING 208
2.00
µd =0.02
1.75
µd =0.08
1.50
1.25
1.00
0.75
0.50
0.25
0.00
In §6.3.1.5 we used the Neumann series lemma to solve for the equilibrium price
vector 𝜋. However, some modifications to the basic model introduce nonlinearities
that render the Neumann series lemma inapplicable. For example, Harrison and Kreps
(1978) analyze a setting with heterogeneous beliefs and incomplete markets, leading
to failure of the standard asset pricing equation. This results in a nonlinear equation
for prices.
We treat the Harrison and Kreps model only briefly. There are two types of agents.
Type 𝑖 believes that the state updates according to stochastic matrix 𝑃 𝑖 for 𝑖 = 1, 2.
Agents are risk-neutral, so 𝑚 ( 𝑥, 𝑦 ) ≡ 𝛽 ∈ (0, 1). Harrison and Kreps (1978) show that,
for their model, the equilibrium condition (6.34) becomes
Õ
𝜋 ( 𝑥 ) = max 𝛽 [ 𝜋 ( 𝑥 0) + 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) (6.42)
𝑖
𝑥0
for 𝑥 ∈ X and 𝑖 ∈ {1, 2}. Setting aside the details that lead to this equation, our
objective is simply to obtain a vector of prices 𝜋 that solves (6.42).
As a first step, we introduce an operator 𝑇 : RX + → R+ that maps 𝜋 to 𝑇𝜋 via
X
Õ
(𝑇𝜋)( 𝑥 ) = max 𝛽 [ 𝜋 ( 𝑥 0) + 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (6.43)
𝑖
𝑥0
Õ Õ
|(𝑇 𝑝) ( 𝑥 ) − (𝑇𝑞)( 𝑥 )| ⩽ 𝛽 max [ 𝑝 ( 𝑥 0) + 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) − [ 𝑞 ( 𝑥 0) − 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) .
𝑖
𝑥0 𝑥0
Since this bound holds for all 𝑥 , we can take the maximum with respect to 𝑥 and
CHAPTER 6. STOCHASTIC DISCOUNTING 210
obtain
k𝑇 𝑝 − 𝑇𝑞 k ∞ ⩽ 𝛽 k 𝑝 − 𝑞 k ∞ .
Thus, on RX
+ , the map 𝑇 is a contraction of modulus 𝛽 with respect to the sup norm.
in this set. Hence, the system (6.42) has a unique solution 𝜋∗ in RX + , representing
equilibrium prices. This fixed point can be computed by successive approximation.
(2005), Amador et al. (2006), Balbus et al. (2018), Fedus et al. (2019), Hens and
Schindler (2020), Jaśkiewicz and Nowak (2021), and Drugeon and Wigniolle (2021).
This chapter focused on time additive models with state-dependent discounting.
More general preference specifications with this feature include Albuquerque et al.
(2016), Schorfheide et al. (2018), Pohl et al. (2018), Gomez-Cram and Yaron (2020),
and de Groot et al. (2022). In Chapter 8 we consider state-dependent discounting in
general settings that accommodate such nonlinearities.
Chapter 7
Nonlinear Valuation
The most natural way to express lifetime value in recursive preference environments
is as a fixed point of a (typically nonlinear) operator. One challenge is that some
recursive preference specifications induce operators that fail to be contractions. For
this reason, we now invest in additional fixed point theory. All of this theory concerns
order-preserving maps, since the operators we consider always inherit monotonicity
from underlying preferences.
212
CHAPTER 7. NONLINEAR VALUATION 213
If you try to draw an increasing function that maps [0, 1] to itself without touching
the 45 degree line you will find it impossible. Below we state a famous fixed point
theorem due to Bronislaw Knaster (1893–1980) and Alfred Tarski (1901–1983) that
generalizes this idea. In the statement, X is a finite set and 𝑉 ≔ [ 𝑣1 , 𝑣2 ], where 𝑣1 , 𝑣2
are functions in RX with 𝑣1 ⩽ 𝑣2 .
Theorem 7.1.1 (Knaster–Tarski). If 𝑇 is an order-preserving self-map on 𝑉 , then the set
of fixed points of 𝑇 is nonempty and contains least and greatest elements 𝑎 ⩽ 𝑏. Moreover,
𝑇 𝑘 𝑣1 ⩽ 𝑎 ⩽ 𝑏 ⩽ 𝑇 𝑘 𝑣2 for all 𝑘 ⩾ 0.
Unlike, say, the fixed point theorem of Banach (§1.2.2.3), Theorem 7.1.1 only
yields existence. Uniqueness does not hold in general, as you can easily confirm by
sketching the one-dimensional case or completing the following exercise.
EXERCISE 7.1.1. Consider the setting of Theorem 7.1.1 and suppose in addition
that 𝑣1 ≠ 𝑣2 . Show that there exists an order-preserving self-map on 𝑉 with a contin-
uum of fixed points.
In this section we study sufficient conditions for global stability that replace contrac-
tivity with shape properties such as concavity and monotonicity. To build intuition, we
start with the one-dimensional case and show how these properties can be combined
to achieve stability. Readers focused on results can safely skip to §7.1.2.2.
In §1.2.3.2 we showed that concavity and monotonicity can yield global stability for
the Solow–Swan model. Here is a more general result.
Proposition 7.1.2. If 𝑔 is an increasing concave self-map on 𝑈 ≔ (0, ∞) and, for all
𝑥 ∈ 𝑈 , there exist 𝑎, 𝑏 ∈ 𝑈 with 𝑎 ⩽ 𝑥 ⩽ 𝑏, 𝑎 < 𝑔 ( 𝑎) and 𝑔 ( 𝑏) ⩽ 𝑏, then 𝑔 is globally stable
on 𝑈 .
g
45
2
x∗
0
0 1 2 3
x
𝑔 ( 𝑥 ) = 𝑔 ( 𝜆𝑎 + (1 − 𝜆 ) 𝑦 ) ⩾ 𝜆𝑔 ( 𝑎) + (1 − 𝜆 ) 𝑔 ( 𝑦 ) > 𝜆𝑎 + (1 − 𝜆 ) 𝑦 = 𝑥 = 𝑔 ( 𝑥 ) .
EXERCISE 7.1.2. Prove that the map 𝑔 and set 𝑈 defined in the discussion of the
Solow-Swan model above Proposition 7.1.2 satisfies the conditions of the proposition.
EXERCISE 7.1.3. Show that the condition 𝑎 < 𝑔 ( 𝑎) in Proposition 7.1.2 cannot be
dropped without weakening the conclusion.
𝑓 0 ( 𝑘) → ∞ as 𝑘 → 0 and 𝑓 0 ( 𝑘) → 0 as 𝑘 → ∞,
EXERCISE 7.1.5. Fajgelbaum et al. (2017) study a law of motion for aggregate
uncertainty given by
−1
1 21
𝑠𝑡+1 = 𝑔 ( 𝑠𝑡 ) where 𝑔 ( 𝑠) ≔ 𝜌
2
+𝑎 + 𝛾.
𝑠 𝜂
Let 𝑎, 𝜂 and 𝛾 be positive constants and assume 0 < 𝜌 < 1. Prove that 𝑔 is globally
stable on 𝑀 ≔ (0, ∞).
and concave if
45 45
𝑣2 𝑣2
𝑇𝑣
𝑇𝑣
𝑣1 𝑣1
𝑣1 𝑣2 𝑣1 𝑣2
(a) (b)
We are now ready to state our next fixed point result, which was first proved
in an infinite-dimensional setting by Du (1990). In the statement, X is a finite set,
𝑉 ≔ [ 𝑣1 , 𝑣2 ] is a nonempty order interval in ( RX , ⩽), and 𝑇 is a self-map on 𝑉 .
Conditions (i) and (ii) are similar – in fact (ii) holds whenever (i) holds, so (ii)
is the weaker (but slightly more complicated) condition. Conditions (iii) and (iv) are
similar in the same sense. Figure 7.2 illustrates the convex and the concave versions
of the result in one dimension. We encourage you to sketch your own variations to
understand the roles that different conditions play.
A full proof of Theorem 7.1.3 can be found in Du (1990) or Theorem 2.1.2 and
Corollary 2.1.1 of Zhang (2012). In our setting, existence follows from the Knaster–
Tarski theorem on page 213. We prove uniqueness on page 347.
CHAPTER 7. NONLINEAR VALUATION 217
Continuing to assume that ℎ 0 and 𝐴 is a positive linear operator, we can use Du’s
theorem to establish the next result (which generalizes Lemma 6.1.4 on page 188).
The key to proving (i) implies (ii) is that 𝐺 is order-preserving and either convex
or concave, depending on the value of 𝜃. The remaining conditions in Du’s theorem
are established over order intervals using 𝜌 ( 𝐴) 1/𝜃 < 1. By applying an approxima-
tion argument, global stability is extended from order intervals to all of 𝑉 . Some of
these details are contained in the following exercises and a full proof can be found in
Stachurski et al. (2022).
Let n o𝜃
𝐹 𝑥 ( 𝑡 ) = ℎ ( 𝑥 ) + 𝑡 1/𝜃 ( 𝑡 > 0) .
EXERCISE 7.1.9. Kleinman et al. (2023) study a dynamic discrete choice model of
migration with savings and capital accumulation. They show that optimal consump-
tion for landlords in their model is 𝑐𝑡 = 𝜎𝑡 𝑅𝑡 𝑘𝑡 , where 𝑘𝑡 is capital, 𝑅𝑡 is the gross rate
of return on capital and 𝜎𝑡 is a state-dependent process obeying
h i𝜓
−1
𝜎𝑡 =1+𝛽 𝜓
E𝑡 𝑅𝑡(+1
𝜓−1)/𝜓 −1/𝜓
𝜎𝑡+1 . (7.3)
Prove that there exists a unique solution to (7.3) of the form 𝜎𝑡 = 𝜎 ( 𝑋𝑡 ) for some
𝜎 ∈ RX with 𝜎 0 if and only if 𝜌 ( 𝐴) 𝜓 < 1.
The time additive model of valuation in §3.2.2.3 can be studied from a purely recursive
point of view. As a starting point, we state that the value 𝑉𝑡 of current and future
consumption is defined at each point in time 𝑡 by the recursion
𝑉𝑡 = 𝑢 ( 𝐶𝑡 ) + 𝛽 E𝑡 𝑉𝑡+1 . (7.5)
CHAPTER 7. NONLINEAR VALUATION 219
The random variables 𝑉𝑡 and 𝑉𝑡+1 are the unknown objects in this expression. The
expectation E𝑡 conditions on 𝑋0 , . . . , 𝑋𝑡 and 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ). The process ( 𝑋𝑡 )𝑡⩾0 is 𝑃 -Markov.
Since consumption is a function of ( 𝑋𝑡 )𝑡⩾0 and knowledge of the current state 𝑋𝑡
is sufficient to forecast future values (by the Markov property), it is natural to guess
that 𝑉𝑡 will depend on the Markov chain only through 𝑋𝑡 . Hence we guess there is a
solution of (7.5) takes the form 𝑉𝑡 = 𝑣 ( 𝑋𝑡 ) for some 𝑣 ∈ RX .
(Here 𝑣 is an ansatz, meaning “educated guess.” First we guess the form of a
solution and then we try to verify that the guess is correct. So long as we carry out
the second step, starting with a guess brings no loss of rigor.)
Under this conjecture, (7.5) can be rewritten as 𝑣 ( 𝑋𝑡 ) = 𝑢 ( 𝑐 ( 𝑋𝑡 )) + 𝛽 E𝑡 𝑣 ( 𝑋𝑡+1 ).
Conditioning on 𝑋𝑡 = 𝑥 and setting 𝑟 ≔ 𝑢 ◦ 𝑐, this becomes
In vector form, we get 𝑣 = 𝑟 + 𝛽𝑃𝑣. From the Neumann series lemma, the solution is
𝑣∗ = ( 𝐼 − 𝛽𝑃 ) −1 𝑟 , which is identical to (3.21) on page 96.
EXERCISE 7.2.1. Verify our guess: Show (𝑉𝑡∗ ) obeys (7.5) when 𝑉𝑡∗ ≔ 𝑣∗ ( 𝑋𝑡 ).
In summary, (7.5) and the sequential representation (3.20) specify the same life-
time value for consumption paths.
While the recursive formulation in (7.5) now seems redundant, since it produces
the same specification that we obtained from the sequential approach, the recursive
set up gives us a formula to build on, and hence a pathway to overcoming limitations
of the time additive approach. Most of the rest of this chapter will be focused on this
agenda.
Pursuing this agenda will produce preferences over consumption paths where the
sequential approach has no natural counterpart. This occurs when current value 𝑉𝑡 is
nonlinear in current rewards and continuation values (unlike the linear specification
(7.5)). Such specifications are called recursive preferences. When dealing with
recursive preference models, the lack of a sequential counterpart means that we are
forced to proceed recursively.
Remark 7.2.1. The term “recursive preferences” is confusing, since, as we have just
agreed, time additive preferences also admit the recursive specification (7.5). Nonethe-
less, when economists say “recursive preferences,” they almost always refer to settings
where lifetime utility can only be expressed recursively. We follow this convention.
CHAPTER 7. NONLINEAR VALUATION 220
In the previous section we discussed how the time additive preference specification
Õ
𝑣 ( 𝑥 ) = E𝑥 𝛽 𝑡 𝑢 ( 𝐶𝑡 ) (7.7)
𝑡 ⩾0
also called the discounted expected utility model, can be framed recursively, and
how this provides a pathway to go beyond the time additive specification. We are
motivated to do so because the time additive specification has been rejected by exper-
imental and observational data in many settings.
In this section we highlight some of the limitations of time additive preferences.
While our discussion is only brief, more background and a list of references can be
found in §7.4.
One issue with (7.7) is the assumption of a constant positive discount rate, which
has been refuted by a long list of empirical studies. This issue was discussed in §6.4.
Another limitation of time additive preferences is that agents are risk-neutral in
future utility (see, e.g., (7.5), where current value depends linearly on future value).
Although risk aversion over consumption can be built in through curvature of 𝑢, this
same curvature also determines the elasticity of intertemporal substitution, meaning
that the two aspects of preferences cannot be separated. We elaborate on this point
in §7.3.1.4.
A third issue with time additivity is that agents with such preferences are indiffer-
ent to any variation in the joint distribution of rewards that leaves marginal distribu-
tions unchanged. To get a sense of what this means, suppose you accept a new job
and will be employed by this firm for the rest of your life. Your daily consumption will
be entirely determined by your daily wage. Your boss offers you two options:
(A) Your boss will flip a coin at the start of your first day on the job. If the coin is
heads, you will receive $10,000 a day for the rest of your life. If the coin is tails,
you will receive $1 per day for the rest of your life.
(B) Your boss will flip a coin at the start of every day. If the coin is heads, you will
receive $10,000. If the coin is tails, you will receive $1.
If you have a strict preference between options A and B, then your choice cannot
be rationalized with time additive preferences.
To see why, let 𝜑 be a probability distribution that represents the lottery described
above, putting mass 0.5 on 10,000 and mass 0.5 on 1. Under option A, consumption
CHAPTER 7. NONLINEAR VALUATION 221
(𝐶𝑡 )𝑡⩾1 is given by 𝐶𝑡 = 𝐶1 for all 𝑡 , where 𝐶1 ∼ 𝜑. Under option B, consumption (𝐶𝑡 )𝑡⩾1
is an IID sequence drawn from 𝜑. Either way, lifetime utility is
Õ Õ 𝛽𝑢
¯
E 𝛽 𝑡 𝑢 ( 𝐶𝑡 ) = 𝛽 𝑡 E𝑢 ( 𝐶𝑡 ) = ,
𝑡 ⩾1 𝑡 ⩾1
1−𝛽
Having motivated recursive preferences, let’s turn to our first example: risk-sensitive
preferences. For the consumption problem described in §7.2.1.1, imposing risk-
sensitive preferences means replacing the recursion 𝑣 = 𝑟 + 𝛽𝑃𝑣 for 𝑣 with
( )
1 Õ
𝑣 ( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (7.8)
𝜃 𝑥0
We understand the functional equation (7.8) as “defining” lifetime utility under risk-
sensitive preferences. A function 𝑣 solving (7.8) gives a lifetime valuation 𝑣 ( 𝑥 ) to each
𝑥 ∈ X, with the interpretation that 𝑣 ( 𝑥 ) is lifetime utility conditional on initial state
𝑥 . This definition of lifetime value is by analogy to the time additive case studied in
§7.2.1.1, where the function 𝑣 solving 𝑣 = 𝑟 + 𝛽𝑃𝑣 measures lifetime utility from each
initial state.
CHAPTER 7. NONLINEAR VALUATION 222
EXERCISE 7.2.2. Prove that, for any random variable 𝜉 any nonzero 𝜃 and any
constant 𝑐, we have E𝜃 [ 𝜉 + 𝑐] = E𝜃 [ 𝜉] + 𝑐.
The key idea behind the entropic risk-adjusted expectation is that decreasing 𝜃
lowers appetite for risk and increasing 𝜃 does the opposite.
Var[ 𝜉]
E𝜃 [ 𝜉 ] = E [ 𝜉] + 𝜃 . (7.9)
2
[Hint: Look up the moment generating function of a normal distribution.]
Expression (7.9) above shows that, for the Gaussian case, E𝜃 [ 𝜉] equals the mean
plus a term that penalizes variance when 𝜃 < 0 and rewards it when 𝜃 > 0.
CHAPTER 7. NONLINEAR VALUATION 223
As a tractable case, let’s suppose that 𝑟 ( 𝑥 ) = 𝑥 and that 𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝜎𝑊𝑡+1 where (𝑊𝑡 )𝑡⩾1
is IID and standard normal. Here | 𝜌 | < 1 and 𝜎 ⩾ 0 controls volatility of the state.
Rather than discretizing the state process, we leave it as continuous and proceed by
hand.
In this setting, the functional equation (7.8) for 𝑣 becomes
𝑣 ( 𝑥 ) = 𝑥 + 𝛽 E𝜃 [ 𝑣 ( 𝜌𝑥 + 𝜎𝑊 )] (7.11)
1 ( 𝑎𝜎) 2
𝛽
𝑎≔ and 𝑏≔𝜃 .
1 − 𝜌𝛽 1−𝛽 2
We can see that, under the stated assumptions, lifetime value 𝑣 is increasing in
the state variable 𝑥 . However, impacts of the parameters generally depend on 𝜃. For
example, if 𝜃 > 0, increasing 𝜎 shifts up lifetime utility. If 𝜃 < 0, then lifetime value de-
creases with 𝜎. This is as we expect: lifetime utility is affected positively or negatively
by volatility, depending on whether or not the agent is risk averse or risk loving.
Figure 7.3 shows the true solution 𝑣 ( 𝑥 ) = 𝑎𝑥 + 𝑏 to the risk-sensitive lifetime util-
ity model, as well as an approximate fixed point from a discrete approximation. The
discrete approximation is computed by applying successive approximation to 𝐾𝜃 after
discretizing the state process via Tauchen’s method. The parameters and discretiza-
tion are shown in Listing 23.
EXERCISE 7.2.6. Dropping the Gaussian assumption, suppose now that consump-
tion is IID with 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ) where ( 𝑋𝑡 )𝑡⩾0 is IID with distribution 𝜑 on finite set X. Now
the operator 𝐾𝜃 becomes
( )
1 Õ
( 𝐾𝜃 𝑣)( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝜑 ( 𝑥 0) ( 𝑥 ∈ X) .
𝜃 𝑥0
CHAPTER 7. NONLINEAR VALUATION 225
30
approximate fixed point
20 v(x) = ax + b
10
−10
−20
−30
−40
−50
−3 −2 −1 0 1 2 3
x
function create_rs_utility_model(;
n=180, # size of state space
β=0.95, # time discount factor
ρ=0.96, # correlation coef in AR(1)
σ=0.1, # volatility
θ=-1.0) # risk aversion
mc = tauchen(n, ρ, σ, 0, 10) # n_std = 10
x_vals, P = mc.state_values, mc.p
r = x_vals # special case u(c(x)) = x
return (; β, θ, ρ, σ, r, x_vals, P)
end
7.2.3.1 Specification
where 𝛾 , 𝛼 are nonzero parameters and 𝛽 ∈ (0, 1). As for risk-sensitive preferences,
lack of time additivity implies that there is no neat sequential representation for life-
time value. As a result, we must work directly with the recursive expression (7.12).
Assume as before that 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ), where 𝑐 ∈ RX
+ and ( 𝑋𝑡 )𝑡 ⩾0 is 𝑃 -Markov on finite
set X. We conjecture a solution of the form 𝑉𝑡 = 𝑣 ( 𝑋𝑡 ) for some 𝑣 ∈ 𝑉 ≔ RX + . Under
this conjecture, the Epstein–Zin Koopmans operator corresponding to (7.12) is
include("s_approx.jl")
using LinearAlgebra, QuantEcon
function create_ez_utility_model(;
n=200, # size of state space
ρ=0.96, # correlation coef in AR(1)
σ=0.1, # volatility
β=0.99, # time discount factor
α=0.75, # EIS parameter
γ=-2.0) # risk aversion parameter
mc = tauchen(n, ρ, σ, 0, 5)
x_vals, P = mc.state_values, mc.p
c = exp.(x_vals)
return (; β, ρ, σ, α, γ, c, x_vals, P)
end
R = (P * (v.^γ)).^(1/γ)
return ((1 - β) * c.^α + β * R.^α).^(1/α)
end
function compute_ez_utility(model)
v_init = ones(length(model.x_vals))
v_star = successive_approx(v -> K(v, model),
v_init,
tolerance=1e-10)
return v_star
end
1.6
v0
1.4 v∗
1.2
1.0
0.8
0.6
0.4
0.2
The operator 𝐾ˆ is simpler to work with than 𝐾 because it unifies 𝛼, 𝛾 into a single
parameter 𝜃 and decomposes the Epstein–Zin update rule into two parts: a linear
map 𝑃 and a separate nonlinear component.
2
45
K̂ = F
0 1 2
This shows that (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topologically conjugate, as claimed. □
Proof of Proposition 7.2.3. Set 𝐴 = 𝛽 𝜃 𝑃 . With this notation we have 𝐾ˆ 𝑣 = [ ℎ+( 𝐴𝑣) 1/𝜃 ] 𝜃 .
In view of in Theorem 7.1.4 on page 217, this operator is globally stable on 𝑉 whenever
𝜌 ( 𝐴) 1/𝜃 < 1. In our case 𝜌 ( 𝐴) = 𝜌 ( 𝛽 𝜃 𝑃 ) = 𝛽 𝜃 , so 𝜌 ( 𝐴) 1/𝜃 = 𝛽 . It follows that 𝐾ˆ
is globally stable on 𝑉 whenever 𝛽 < 1. Since (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topologically
conjugate, the proof of Proposition 7.2.3 is complete. □
While we can consider studying stability of 𝐾ˆ using contraction arguments, this ap-
proach fails under useful parameterizations. To illustrate, suppose that X = { 𝑥1 }.
Then ℎ is a constant, 𝑃 is the identity, 𝑣 is a scalar and 𝐾ˆ 𝑣 = 𝐹 ( 𝑣) with 𝐹 ( 𝑣) =
𝜃
ℎ + 𝛽𝑣1/𝜃 , as shown in Figure 7.5. Here 𝜃 = 5, ℎ = 0.5 and 𝛽 = 0.5. We see
that 𝐾ˆ has infinite slope at zero, so the contraction property fails.2
2 We could try to truncate the interval to a neighborhood of the fixed point and hope that 𝐾ˆ is a
contraction when restricted to this interval. But in higher dimensions we are not sure that a fixed point
exists for a broad range of parameters, which makes this idea hard to implement.
CHAPTER 7. NONLINEAR VALUATION 231
EXERCISE 7.2.9. Prove that, given the parameter values used for Figure 7.5, the
function 𝐹 satisfies 𝐹 0 ( 𝑡 ) → ∞ as 𝑡 ↓ 0.
EXERCISE 7.3.1. In the last example, the certainty equivalent 𝑅 = 𝑃 is linear. Prove
that this is the only linear case. In particular, prove the following: if R(X) is the set
of all certainty equivalent operators on RX , then R(X) ∩ L ( RX ) = M ( RX ).
The next example is nonlinear. It treats the risk-adjusted expectation that appears
in risk-sensitive preferences.
Example 7.3.2. Let 𝑉 = RX and fix nonzero 𝜃 and 𝑃 ∈ M ( RX ). The entropic cer-
tainty equivalent operator is the operator 𝑅𝜃 on 𝑉 defined by
( )
1 Õ
( 𝑅𝜃 𝑣)( 𝑥 ) = ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝜃 𝑥0
Example 7.3.3. As a third example, let 𝑉 be the interior of the positive cone, as in
§7.2.3.2, and fix 𝑃 ∈ M ( RX ). The operator
( ) 1/𝛾
Õ
( 𝑅 𝛾 𝑣)( 𝑥 ) = 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑥 0) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X, 𝛾 ≠ 0) (7.16)
𝑥0
EXERCISE 7.3.4. Let 𝑉 = RX and fix 𝑃 ∈ M ( RX ) and 𝜏 ∈ [0, 1]. Let 𝑅𝜏 be the
quantile certainty equivalent. That is, ( 𝑅𝜏 𝑣) ( 𝑥 ) = 𝑄 𝜏 𝑣 ( 𝑋 ) where 𝑋 ∼ 𝑃 ( 𝑥, ·) and 𝑄 𝜏
is the quantile functional (see page 32). More specifically,
( )
Õ
( 𝑅𝜏 𝑣)( 𝑥 ) = min 𝑦 ∈ R 1{ 𝑣 ( 𝑥 ) ⩽ 𝑦 } 𝑃 ( 𝑥, 𝑥 ) ⩾ 𝜏
0 0
( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝑥0
EXERCISE 7.3.6. Let R(X) be the set of certainty equivalent operators on RX and
prove the following:
7.3.1.2 Properties
Example 7.3.5. Let 𝑉 be the interior of the positive cone of RX and fix 𝑃 ∈ M ( RX ).
In this setting, the Kreps–Porteus certainty equivalent operator 𝑅 𝛾 in (7.16) is subad-
ditive on 𝑉 when 𝛾 ⩾ 1 and superadditive on 𝑉 when 𝛾 ⩽ 1 (and, as usual, 𝛾 ≠ 0). The
subadditive case follows directly from Minkowski’s inequality, while the superadditive
case follows from the mean inequalities in Bullen (2003) (p. 202).
EXERCISE 7.3.7. Prove that the quantile certainty equivalent operator 𝑅𝜏 from
Exercise 7.3.4 is constant-subadditive.
EXERCISE 7.3.8. Show that the entropic certainty equivalent operator 𝑅𝜃 from
Example 7.3.2 is constant-subadditive.
k 𝑅𝑣 − 𝑅𝑤 k ∞ ⩽ k 𝑣 − 𝑤 k ∞ for all 𝑣, 𝑤 ∈ RX .
CHAPTER 7. NONLINEAR VALUATION 234
Later we will combine Lemma 7.3.1 with the fixed point results for convex and
concave operators in §7.1.2.2 to establish existence and uniqueness of lifetime values
for certain kinds of Koopmans operators.
7.3.1.3 Monotonicity
Let X be partially ordered and let 𝑖RX be the set of increasing functions in RX . Let 𝑉 be
such that 𝑖RX ⊂ 𝑉 ⊂ RX and let 𝑅 be a certainty equivalent on 𝑉 . We call 𝑅 monotone
increasing if 𝑅 is invariant on 𝑖RX . This extends the terminology in §3.2.1.3, where
we applied it to Markov operators (cf., Exercise 3.2.4 on page 94).
As shown below, the concept of monotone increasing certainty equivalent opera-
tors is connected to outcomes where lifetime preferences are increasing in the state.
EXERCISE 7.3.12. Show that the entropic certainty equivalent operator in Exam-
ple 7.3.2 is monotone increasing on 𝑉 = RX whenever 𝑃 is monotone increasing, for
all nonzero values of 𝜃.
CHAPTER 7. NONLINEAR VALUATION 235
7.3.1.4 Aggregation
Here CES stands for “constant elasticity of substitution.” An important special case
of both the CES and Uzawa aggregators is the
From these basic types we can also build composite aggregators. For example, we
might consider a CES-Uzawa aggregator of the form 𝐴 ( 𝑥, 𝑦 ) = {𝑟 ( 𝑥 ) 𝛼 + 𝑏 ( 𝑥 ) 𝑦 𝛼 }1/𝛼 with
𝑟, 𝑏 ∈ RX , 𝑏 ⩾ 0 and 𝛼 ≠ 0. As we will see in §7.3.3.3, the CES-Uzawa aggregator
can be used to construct models with both Epstein–Zin utility and state-dependent
discounting (as in, say, Albuquerque et al. (2016) or Schorfheide et al. (2018).)
𝐾 = 𝐴◦𝑅 (7.18)
CHAPTER 7. NONLINEAR VALUATION 236
Example 7.3.7. For risk-sensitive preferences, the Koopmans operator (page 223) can
be expressed as 𝐾𝜃 = 𝐴ADD ◦ 𝑅𝜃 , where 𝑅𝜃 is the entropic certainty equivalent operator.
𝑑 ln( 𝑦 /𝑐) 𝜕𝑈 ( 𝑐, 𝑦 ) 𝜕𝑈 ( 𝑐, 𝑦 )
EIS = where 𝑈𝑐 ≔ and 𝑈 𝑦 ≔ .
𝑑 ln(𝑈𝑐 /𝑈 𝑦 ) 𝜕𝑐 𝜕𝑦
The fact that EIS = 1/(1−𝛼) under the CES aggregator is significant because the EIS
can be measured from data using regression and other techniques. While estimates
vary significantly, the detailed meta-analysis by Havranek et al. (2015) suggests 0.5 as
a plausible average value for international studies, with rich countries tending slightly
higher. Basu and Bundick (2017) use 0.8 when calibrating to US data. Under these
estimates, the relationship EIS = 1/(1 − 𝛼) implies a value for 𝛼 between -1.0 and
-0.25.
CHAPTER 7. NONLINEAR VALUATION 237
Example 7.3.9. In the case of time additive preferences, lifetime value was defined in
(3.21) by 𝑣 = ( 𝐼 − 𝛽𝑃 ) −1 𝑟 . Equivalently, 𝑣 is the fixed point of the operator 𝐾 defined
by 𝐾𝑣 = 𝑟 + 𝛽𝑃𝑣. Since 𝐾 is globally stable, the fixed point is unique. In view of
Lemma 3.2.1 on page 94, it satisfies
Õ
𝑣( 𝑥) = E 𝛽 𝑡 𝑟 ( 𝑋𝑡 ) when ( 𝑋𝑡 ) is 𝑃 -Markov and 𝑋0 = 𝑥.
𝑡 ⩾0
In many applications, our existence and uniqueness proofs for fixed points of 𝐾
will also establish global stability. For Koopmans operators, global stability has the
following interpretation: for 𝑤 ∈ 𝑉 , 𝑚 ∈ N and 𝑥 ∈ X, the value ( 𝐾 𝑚 𝑤)( 𝑥 ) gives
total finite-horizon utility over periods 0, . . . , 𝑚 under the preferences embedded in
𝐾 , with initial state 𝑥 and terminal condition 𝑤. Hence global stability implies that, for
any choice of terminal condition, finite-horizon utility converges to infinite-horizon
utility as the time horizon converges to infinity. The next exercise helps to illustrate
this point.
Exercise 7.3.15 confirms that, at least for the time additive dcase, global stability of
𝐾 is equivalent to the statement that a finite-horizon valuation with arbitrary terminal
condition 𝑤 converges to the infinite-horizon valuation.
Let X = (X, ) be partially ordered, let 𝑖RX be the set of increasing functions in RX ,
and let 𝑉 be such that 𝑖RX ⊂ 𝑉 ⊂ RX . Let 𝐾 be a Koopmans operator on 𝑉 , so that
𝐾𝑣 = 𝐴 ◦ 𝑅 for some aggregator 𝐴 and certainty equivalent operator 𝑅 on 𝑉 . Suppose
that 𝐾 has a unique fixed point 𝑣∗ ∈ 𝑉 . A natural question is: when is 𝑣∗ increasing in
the state?
Proof. It is not difficult to check that, under the stated conditions, 𝐾 is invariant on
𝑖RX . It follows from Exercise 1.2.18 on page 22 that 𝑣∗ is increasing on X. □
The next proposition states conditions for global stability in settings where aggre-
gators have the Blackwell property.
Proposition 7.3.3. If 𝐴 is a Blackwell aggregator and 𝑅 is constant-subadditive, then
the Koopmans operator 𝐾 ≔ 𝐴 ◦ 𝑅 is a contraction on 𝑉 with respect to k · k ∞ .
Proof. Let the primitives be as stated. In view of Lemma 2.2.4 on page 62, and taking
into account the fact that 𝐾 is order-preserving, we need only show that there exists
a 𝛽 ∈ (0, 1) with 𝐾 ( 𝑣 + 𝜆 ) ⩽ 𝐾𝑣 + 𝛽𝜆 for all 𝑣 ∈ 𝑉 and 𝜆 ∈ R+ . To see this, fix 𝑣 ∈ 𝑉 and
𝜆 ∈ R+ . Applying constant-subadditivity of 𝑅 and monotonicity of 𝐴, we have
𝐾 ( 𝑣 + 𝜆 ) = 𝐴 (·, 𝑅 ( 𝑣 + 𝜆 )) ⩽ 𝐴 (·, 𝑅𝑣 + 𝜆 )
Since 𝐴 is a Blackwell aggregator, the last term is bounded by 𝐴 (·, 𝑅𝑣) + 𝛽𝜆 with 𝛽 < 1.
Hence 𝐾 ( 𝑣 + 𝜆 ) ⩽ 𝐾𝑣 + 𝛽𝜆 , and 𝐾 is a contraction of modulus 𝛽 on 𝑉 . □
We can now complete the proof of Proposition 7.2.2, which concerned global stability
of the Koopmans operator generated by risk-sensitive preferences.
Proof of Proposition 7.2.2. Fix 𝜃 ≠ 0 and recall that 𝐾𝜃 in (7.10) can be expressed as
𝐾𝜃 = 𝐴ADD ◦ 𝑅𝜃 when 𝑅𝜃 is the entropic certainty equivalent. Since 𝐴ADD is a Blackwell
aggregator and 𝑅𝜃 is constant-subadditive (Exercise 7.3.8), Proposition 7.3.3 applies.
In particular, 𝐾𝜃 is globally stable on RX . □
for 𝛽 ∈ (0, 1), 𝜏 ∈ [0, 1], 𝑟 ∈ RX and 𝑅𝜏 as described in Exercise 7.3.4. Since 𝑅𝜏 is
constant-subadditive (Exercise 7.3.7) and the additive aggregator is Blackwell, 𝐾𝜏 is
globally stable (Proposition 7.3.3). The operator 𝐾𝜏 represents quantile preferences,
as described in de Castro and Galvao (2019) and other studies (see 7.4). The value 𝜏
parameterizes attitude to risk, a point we return to in §8.2.1.4.
Now consider 𝐾𝑣 = 𝑟 + 𝑏𝑅𝑣 from (7.21) when 𝑅 is not in M ( RX ). Here 𝑏𝑅𝑣 is the
pointwise product, so that ( 𝑏𝑅𝑣)( 𝑥 ) = 𝑏 ( 𝑥 )( 𝑅𝑣)( 𝑥 ) for all 𝑥 .
We cannot use Proposition 7.3.3 to prove stability of 𝐾 unless 𝑏 ( 𝑥 ) < 1 for all 𝑥 ∈ X.
Since this condition is rather strict, we now study weaker conditions that can be valid
even when 𝑏 exceeds 1 in some states. Specifically, we consider
A fixed point of 𝐾 corresponds to lifetime value for an agent with Epstein–Zin prefer-
ences and state-dependent discounting. (Such set ups are used in research on macroe-
conomic dynamics and asset pricing – see §7.4 for more details).
In what follows we take 𝑉 = (0, ∞) X and assume that ℎ, 𝑏 ∈ 𝑉 and 𝑃 is irreducible.
EXERCISE 7.3.22. Prove: (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topologically conjugate under Φ.
Proof of Proposition 7.3.5. In view of Exercise 7.3.22, it suffices to show that 𝐾ˆ is glob-
ally stable on 𝑉 if and only if 𝜌 ( 𝐴) 𝛼/𝛾 < 1. This is implied by Theorem 7.1.4, since 𝐴
is irreducible (see Exercise 7.3.20 on page 241) and 𝜌 ( 𝐴) 1/𝜃 = 𝜌 ( 𝐴) 𝛼/𝛾 . □
dation was supplied by Koopmans (1960). Bastianello and Faro (2022) study the
foundations of discounted expected utility from a purely subjective framework.
Problems with the time additive discounted utility model include non-constant
discounting, as discussed in §6.4, as well as sign effects (gains being discounted more
than losses) and magnitude effects (small outcomes being discounted more than large
ones. See, for example, Thaler (1981) and Benzion et al. (1989). A critical review of
the time additive model and a list of many references can be found in Frederick et al.
(2002).
In the stochastic setting, the time additive framework is a subset of the expected
utility model (Von Neumann and Morgenstern (1944), Friedman (1956), Savage
(1951)). There are many well documented departures from expected utility in ex-
perimental data. See the start of Andreoni and Sprenger (2012) and the article Eric-
son and Laibson (2019) for an introduction to the literature. An interesting historical
discussion of time additive expected utility can be found in Becker et al. (1989).
(It is ironic that those most responsible for popularizing the time additive dis-
counted expected utility (DEU) framework have also been among the most critical.
For example, Samuelson (1939) stated that it is “completely arbitrary” to assume that
the DEU specification holds. He goes on to claim that, in the analysis of savings and
consumption, it is “extremely doubtful whether we can learn much from considering
such an economic man.” In addition, Stokey and Lucas (1989), whose work helped
to standardize DEU as a methodology for quantitative analysis, argued in a separate
study that DEU is attractive only because of its relative simplicity (Lucas and Stokey,
1984).)
Do the departures from time additive expected utility found in experimental data
actually matter for quantitative work? Evidence suggests that the answer is affirma-
tive. In macroeconomics and asset pricing in particular, researchers increasingly use
non-additive preferences in order to bring model outputs closer to the data. For ex-
ample, many quantitative models of asset pricing rely heavily on Epstein–Zin prefer-
ences. Representative examples include Epstein and Zin (1991), Tallarini Jr (2000),
Bansal and Yaron (2004), Hansen et al. (2008), Bansal et al. (2012), Schorfheide
et al. (2018), and de Groot et al. (2022). Alternative numerical solution methods are
discussed in Pohl et al. (2018).
An excellent introduction to recursive preference models can be found in Backus
et al. (2004). Our use of the term “Koopmans operator,” which is not entirely stan-
dard, honors early contributions by Nobel laureate Tjalling Koopmans on recursive
preferences (see Koopmans (1960) and Koopmans et al. (1964)).
Theoretical properties of recursive preference models have been studied in many
papers, including Epstein and Zin (1989), Weil (1990), Boyd (1990), Hansen and
CHAPTER 7. NONLINEAR VALUATION 244
While the MDP model from Chapters 5–6 is elegant and widely used, researchers in
economics, finance, and other fields are working to extend it. Reasons include:
(i) MDP theory cannot be applied to settings where lifetime values are described
by the kinds of nonlinear recursions discussed in Chapter 7.
(ii) Equilibria in some models of production and economic geography can be com-
puted using dynamic programming but not all such programming problems fit
within the MDP framework.
(iii) Dynamic programming problems that include adversarial agents to promote ro-
bust decision rules can fail to be MDPs.
To handle such departures from the MDP assumptions, we now construct a more
general dynamic programming framework, building on an approach to optimization
initially developed by Denardo (1967) and extended by Bertsekas (2022b). Further
references are provided in §8.4.
We start this chapter by building a framework that centers on an abstract repre-
sentation of the Bellman equation (§8.1). We then state optimality results and show
how they can be verified in a range of applications. We defer proofs of core optimal-
ity results to Chapter 9, where we strip dynamic programs down to their essence by
adopting a purely operator-theoretic perspective.
245
CHAPTER 8. RECURSIVE DECISION PROCESSES 246
𝑣 ( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) . (8.1)
𝑎∈ Γ ( 𝑥 )
G ≔ {( 𝑥, 𝑎) ∈ X × A : 𝑎 ∈ Γ ( 𝑥 )}
The monotonicity condition (8.2) is natural: if, relative to 𝑣, rewards are at least
as high for 𝑤 in every future state, then the total rewards one can extract under 𝑤
should be at least as high. The consistency condition in (8.3) ensures that as we
consider values of different policies we remain within the value space 𝑉 .
The MDP framework is a special case of the RDP framework:
Example 8.1.1. Consider MDP M = ( Γ, 𝛽, 𝑟, 𝑃 ) with state space X and action space A
(see, e.g., §5.1.1). We can frame M as an RDP by taking Γ as unchanged, 𝑉 = RX ,
and Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (( 𝑥, 𝑎) ∈ G, 𝑣 ∈ 𝑉 ) . (8.4)
𝑥0
Now ( Γ, 𝑉, 𝐵) forms an RDP. The monotonicity condition (8.2) clearly holds and the
consistency condition (8.3) is trivial, since 𝑉 is all of RX . Inserting (8.4) into the ab-
stract Bellman equation (8.1) recovers the MDP Bellman equation ((5.2) on page 129).
Example 8.1.2. Consider a basic cake eating problem (see §5.1.2.3), where X is a
finite subset of R+ and 𝑥 ∈ X is understood to be the number of remaining slices of
cake today. Let 𝑥 0 be the number of remaining slices next period and 𝑢 ( 𝑥 − 𝑥 0) be the
utility from slices enjoyed today. The utility function 𝑢 maps R+ to R. Let 𝑉 = RX , let
Γ be defined by Γ ( 𝑥 ) = { 𝑥 0 ∈ X : 𝑥 0 ⩽ 𝑥 } and let
𝐵 ( 𝑥, 𝑥 0, 𝑣) = 𝑢 ( 𝑥 − 𝑥 0) + 𝛽𝑣 ( 𝑥 0) .
Then ( Γ, 𝑉, 𝐵) is an RDP with Bellman equation identical to that of the original cake
eating problem in §5.1.2.3. The monotonicity condition (8.2) and the consistency
condition (8.3) are easy to verify.
The last example is a special case of Example 8.1.1, since the cake eating problem
is an MDP (see §5.1.2.3). Nonetheless, Example 8.1.2 is instructive because, for cake
CHAPTER 8. RECURSIVE DECISION PROCESSES 248
eating, the MDP construction is tedious (e.g., we need to define a stochastic kernel 𝑃
even though transitions are deterministic), while the RDP construction is straightfor-
ward.
The next example makes a related point.
Example 8.1.3. In §5.1.2.4 we showed that the job search model is an MDP but the
construction was tedious. But we can also represent job search as an RDP and the em-
bedding is straightforward. To see this, recall that, for an arbitrary optimal stopping
problem with primitives as described in Chapter 4, the Bellman equation is
( )
Õ
𝑣 ( 𝑥 ) = max 𝑒 ( 𝑥 ) , 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (8.5)
𝑥0
for 𝑥 ∈ X and 𝑎 ∈ A ≔ {0, 1}. Then ( Γ, 𝑉, 𝐵) is an RDP (Exercise 8.1.1) and setting
𝑣 ( 𝑥 ) = max 𝑎∈ Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) reproduces the Bellman equation (8.5).
EXERCISE 8.1.1. Verify that conditions (8.2)–(8.3) hold for this RDP.
We assume that ( 𝑍𝑡 ) is 𝑄 -Markov on finite set Z and (𝑌𝑡 ) takes values in finite set Y.
With state space X ≔ Y × Z, action space Y, feasible correspondence 𝑥 ↦→ Γ ( 𝑥 ), value
space 𝑉 = RX and aggregator
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝐵 (( 𝑦, 𝑧 ) , 𝑦 0, 𝑣) = 𝐹 ( 𝑦, 𝑧, 𝑦 0) + 𝛽 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) ,
𝑧0
EXERCISE 8.1.2. Show that this RDP can also be expressed as an MDP.
Examples 8.1.1–8.1.4 treated RDPs that can be embedded into the MDP frame-
work. In the remaining examples, we consider models that cannot be represented as
MDPs.
Example 8.1.6. We can modify the MDP in Example 8.1.1 to use risk-sensitive pref-
erences. We do this by taking Γ, 𝑉 to be the same as the MDP example and setting
( )
1 Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (8.9)
𝜃 𝑥0
Example 8.1.7 (Epstein–Zin Preferences). We can also modify the MDP in Exam-
ple 8.1.1 to use the Epstein–Zin specification (see (7.12) on page 226) by setting
Example 8.1.8. The shortest path problem considers optimal traversal of a directed
graph G = (X, 𝐸), where X is the vertices of the graph and 𝐸 is the edges. A weight
function 𝑐 : 𝐸 → R+ associates cost to each edge ( 𝑥, 𝑥 0) ∈ 𝐸. The aim is to find the min-
imum cost path from 𝑥 to a specified vertex 𝑑 for every 𝑥 ∈ X. Under some conditions,
the problem can be solved by applying a Bellman operator of the form
We aim to discuss optimality of RDPs. To prepare for this topic, we now clarify lifetime
values associated with different policy choices in the RDP setting.
Let R = ( Γ, 𝑉, 𝐵) be an RDP with state and action spaces X and A, and let Σ be the
set of all feasible policies. For each 𝜎 ∈ Σ we introduce the policy operator 𝑇𝜎 as a
self-map on 𝑉 defined by
The RDP policy operator is a direct generalization of the MDP policy operator defined
on page 135, as well as the optimal stopping policy operator defined on page 108.
Example 8.1.9. For the optimal stopping problem discussed in Chapter 4, the function
𝑣𝜎 that records the lifetime value of a policy 𝜎 from any given state is the unique fixed
point of the optimal stopping policy operator 𝑇𝜎 . See §4.1.1.3.
Example 8.1.10. For the MDP model discussed in Chapter 5, lifetime value of policy
𝜎 is given by 𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . As discussed in Exercise 5.1.7, 𝑣𝜎 is the unique fixed
point of the MDP policy operator 𝑇𝜎 defined by 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣.
Example 8.1.11. For the MDP model with state-dependent discounting introduced
in Chapter 6, Exercise 6.2.1 shows that the lifetime value of following policy 𝜎 is the
unique fixed point of the policy operator 𝑇𝜎 defined in (6.16) on page 192.
The previous examples are linear but the same idea extends to nonlinear recursive
preference models as well. To see this, recall the generic Koopmans operator ( 𝐾𝑣)( 𝑥 ) =
𝐴 ( 𝑥, ( 𝑅𝑣) ( 𝑥 )) introduced in §7.3.1. Lifetime value is the unique fixed point of this
operator whenever it exists. In all of the RDP examples we have considered, the policy
operator can be expressed as (𝑇𝜎 𝑣) ( 𝑥 ) = 𝐴𝜎 ( 𝑥, ( 𝑅𝜎 𝑣)( 𝑥 )) for some aggregator 𝐴𝜎 and
certainty equivalent operator 𝑅𝜎 . Hence 𝑇𝜎 is a Koopmans operator and lifetime value
associated with policy 𝜎 is the fixed point of this operator.
Let R = ( Γ, 𝑉, 𝐵) be a given RDP with policy operators {𝑇𝜎 }. Given that our objective
is to maximize lifetime value over the set of policies in Σ, we need to assume at the
very least that lifetime value is well defined at each policy. To this end, we say that R
is well-posed whenever 𝑇𝜎 has a unique fixed point 𝑣𝜎 in 𝑉 for all 𝜎 ∈ Σ.
Example 8.1.12. The optimal stopping RDP we introduced in Example 8.1.3 is well-
posed. Indeed, for each 𝜎 ∈ Σ, the policy operator 𝑇𝜎 has a unique fixed point in RX
by Proposition 4.1.1 on page 108.
Example 8.1.13. The RDP generated by the MDP model in Example 8.1.1 is well-
posed, since, for each 𝜎 ∈ Σ, the operator 𝑇𝜎 = 𝑟𝜎 + 𝛽𝑃𝜎 has the unique fixed point
𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 in RX .
Example 8.1.14. The shortest path problem discussed in Example 8.1.8 is not well-
posed without further assumptions. For example, consider a graph that contains two
CHAPTER 8. RECURSIVE DECISION PROCESSES 252
Let R be an RDP with policy operators {𝑇𝜎 }𝜎∈Σ . In what follows, we call R globally
stable if 𝑇𝜎 is globally stable on 𝑉 for all 𝜎 ∈ Σ.
Example 8.1.15. The optimal stopping RDP we introduced in Example 8.1.3 is glob-
ally stable, since, for each 𝜎 ∈ Σ, the policy operator 𝑇𝜎 is globally stable on RX by
Proposition 4.1.1 on page 108.
Example 8.1.16. The RDP generated by the MDP model in Example 8.1.1 is globally
stable. See Exercise 5.1.7 on page 135.
In §8.1.3 we will see that global stability yields strong optimality properties.
8.1.2.3 Continuity
Continuity is satisfied by all applications considered in this text. For example, for the
RDP generated by an MDP (Example 8.1.1), the deviation | 𝐵 ( 𝑥, 𝑎, 𝑣𝑘 ) − 𝐵 ( 𝑥, 𝑎, 𝑣)| is
dominated by 𝛽 k 𝑣𝑘 − 𝑣 k ∞ for all ( 𝑥, 𝑎) ∈ G. Hence continuity holds.
Below we will see that continuity is useful when considering covergence of certain
algorithms.
CHAPTER 8. RECURSIVE DECISION PROCESSES 253
8.1.3 Optimality
Since Γ ( 𝑥 ) is finite and nonempty at each 𝑥 ∈ X, at least one such policy exists. As
with policy operators, the notion of greedy policies extends existing definitions from
earlier chapters.
EXERCISE 8.1.7. Show that, for each 𝑣 ∈ 𝑉 , the set {𝑇𝜎 𝑣}𝜎∈Σ ⊂ 𝑉 contains a
least and greatest element (see §2.2.1.2 for the definitions). Explain the connection
between the greatest element and 𝑣-greedy policies.
(𝑇 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) ( 𝑥 ∈ X) .
𝑎∈ Γ ( 𝑥 )
Example 8.1.17. For the Epstein–Zin RDP in (8.10), the Bellman operator is given
by
" # 𝛼/𝛾 1/𝛼
Õ
𝛼 0 𝛾 0
(𝑇 𝑣)( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑥 ) ( 𝑥 ∈ X) .
𝑎∈ Γ ( 𝑥 )
𝑥0
EXERCISE 8.1.8. Given RDP R = ( Γ, 𝑉, 𝐵) with policy operators {𝑇𝜎 } and Bellman
operator 𝑇 , show that, for each 𝑣 ∈ 𝑉 ,
Ô Ô
(i) 𝑇 𝑣 = 𝜎 𝑇𝜎 𝑣 ≔ 𝜎∈Σ (𝑇𝜎 𝑣) and
(ii) 𝜎 is 𝑣-greedy if and only if 𝑇 𝑣 = 𝑇𝜎 𝑣.
(iii) 𝑇 is an order-preserving self-map on 𝑉 .
CHAPTER 8. RECURSIVE DECISION PROCESSES 254
EXERCISE 8.1.9. Show that, for a given RDP ( Γ, 𝑉, 𝐵) and fixed 𝑣 ∈ 𝑉 , the Bellman
operator 𝑇 obeys
(𝑇 𝑘 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑇 𝑘−1 𝑣) (8.14)
𝑎∈ Γ ( 𝑥 )
for all 𝑘 ∈ Z+ and all 𝑥 ∈ X. Show, in addition, that for any policy 𝜎 ∈ Σ, the policy
operator 𝑇𝜎 obeys
(𝑇𝜎𝑘 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑇𝜎𝑘−1 𝑣) (8.15)
for all 𝑘 ∈ Z+ and all 𝑥 ∈ X.
8.1.3.2 Algorithms
To solve RDPs for optimal policies, we use two core algorithms: Howard policy itera-
tion (HPI) and optimistic policy iteration (OPI). As in previous chapters, OPI includes
VFI as a special case.
To describe HPI we take R = ( Γ, 𝑉, 𝐵) to be a well-posed RDP with feasible policy set
Σ, policy operators {𝑇𝜎 }, and Bellman operator 𝑇 . In this setting, the HPI algorithm is
essentially identical to the one given for MDPs in §5.1.4.2, except that 𝑣𝜎 is calculated
as the fixed point of 𝑇𝜎 , rather than taking the specific form ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . The details
are in Algorithm 8.1.
Algorithm 8.1 is somewhat ambiguous, since it is not always clear how to imple-
ment the instruction “ 𝑣𝑘 ← the fixed point of 𝑇𝜎𝑘 ”. However, if R is globally stable,
then each 𝑇𝜎𝑘 is globally stable, so an approximation of the fixed point can be calcu-
lated by iterating with 𝑇𝜎𝑘 . This line of thought leads us to consider optimistic policy
iterating (OPI) as a more practical alternative. Algorithm 8.2 states an OPI routine
for solving R that generalizes the MDP OPI routine in §5.1.4.
CHAPTER 8. RECURSIVE DECISION PROCESSES 255
EXERCISE 8.1.10. Prove that, for the sequence ( 𝑣𝑘 ) in the OPI algorithm 8.2, we
have 𝑣𝑘 = 𝑇 𝑘 𝑣0 whenever 𝑚 = 1. (In other words, OPI reduces to VFI when 𝑚 = 1.)
8.1.3.3 Optimality
Let R be a well-posed RDP with policy operators {𝑇𝜎 } and 𝜎-value functions { 𝑣𝜎 }.
Ô
In this context, we set 𝑣∗ ≔ 𝜎 𝑣𝜎 ∈ RX and call 𝑣∗ the value function of R. By
definition, 𝑣∗ satisfies
Both of these definitions generalize the definitions we used for MDPs and optimal
stopping. In particular, optimality of a policy means that it generates maximum pos-
sible lifetime value from every state.
We say that R satisfies Bellman’s principle of optimality if
We can now state our main optimality result for RDPs. In the statement, R is a well-
posed RDP with value function 𝑣∗ .
As OPI includes VFI as a special case (𝑚 = 1), Theorem 8.1.1 also implies conver-
gence of VFI under the stated conditions.
In terms of applications, Theorem 8.1.1 is the most important optimality result in
this book. It provides the core optimality results from dynamic programming and a
broadly convergent algorithm for computing optimal policies.
The proof of Theorem 8.1.1 is deferred to §9.1.
CHAPTER 8. RECURSIVE DECISION PROCESSES 257
Example 8.1.18. The optimality results for optimal stopping problems we presented
in Chapter 4 are a special case of Theorem 8.1.1, since such optimal stopping problems
generate globally stable RDPs (as discussed in Example 8.1.15).
Example 8.1.19. The optimality results for MDPs we presented in Chapter 5 are a
special case of Theorem 8.1.1, since MDPs generate globally stable RDPs (as discussed
in Example 8.1.16).
Up until now we have focused entirely on stationary policies, in the sense that the
same policy is used at every point in time. What if we drop this assumption and
admit the option to change policies? Might this lead to higher lifetime values?
In this section, we show that for globally stable RDPs the answer is negative. This
finding justifies our focus on stationary policies.
To begin, let R = ( Γ, 𝑉, 𝐵) be a globally stable RDP. Recall from Remark 8.1.1 that,
given 𝑣 ∈ 𝑉 , 𝜎 ∈ Σ, 𝑘 ∈ N and 𝑥 ∈ X, the value (𝑇𝜎𝑘 𝑣)( 𝑥 ) gives finite horizon utility over
periods 0, . . . , 𝑘 under policy 𝜎, with initial state 𝑥 and terminal condition 𝑣. Extending
CHAPTER 8. RECURSIVE DECISION PROCESSES 258
this idea, it is natural to understand 𝑇𝜎𝑘 𝑇𝜎𝑘−1 · · · 𝑇𝜎1 𝑣 as providing finite horizon utility
values for the nonstationary policy sequence ( 𝜎𝑘 ) 𝑘∈N ⊂ Σ, given terminal condition
𝑣 ∈ 𝑉 . For the same policy sequence, we define its lifetime value via
𝑣
¯ ≔ lim sup 𝑣𝑘 with 𝑣𝑘 ≔ 𝑇𝜎𝑘 𝑇𝜎𝑘−1 · · · 𝑇𝜎1 𝑣
𝑘→∞
EXERCISE 8.1.11. Show that, under the stated conditions, 𝑣𝑘 ⩽ 𝑇 𝑘 𝑣 for all 𝑘 ∈ N.
as was to be shown.
We show below that boundedness can be used to obtain optimality results for well-
posed RDPs, even without global stability.
Another attractive feature of boundedness is that it permits a reduction of value
space, as illustrated by the next two exercises.
EXERCISE 8.1.13. Adopt the setting of Exercise 8.1.12 and suppose, in addition,
that the RDP is well-posed. Show that 𝑣𝜎 ∈ 𝑉ˆ for all 𝜎 ∈ Σ.
Exercise 8.1.13 implies the reduced RDP ( Γ, 𝑉ˆ, 𝐵) is also well-posed under the
stated conditions, and that it contains all the 𝜎-value functions and the value function
CHAPTER 8. RECURSIVE DECISION PROCESSES 259
from the original RDP ( Γ, 𝑉, 𝐵). Hence any optimality results for ( Γ, 𝑉ˆ, 𝐵) carry over
to ( Γ, 𝑉, 𝐵).
EXERCISE 8.1.14. Show that the RDP generated by an MDP in Example 8.1.1 is
bounded.
EXERCISE 8.1.15. Show that the optimal stopping RDP from Example 8.1.3 is
bounded.
EXERCISE 8.1.17. Consider the shortest path RDP ( Γ, 𝑉, 𝐵) in Example 8.1.8 and
assume in addition that the graph G contains only one cycle, which is a self-loop at 𝑑 ,
that 𝑑 is accessible from every vertex 𝑥 ∈ X, and that 𝑐 ( 𝑑, 𝑑 ) = 0. (These assumptions
imply that every path leads to 𝑑 in finite time and that travelers reaching 𝑑 remain
there forever at zero cost.) Let 𝐶 ( 𝑥 ) be the maximum cost of traveling to 𝑑 from 𝑥 ,
which is finite by the stated assumptions. Show that (8.20) holds when 𝑣1 ≔ 0 and
𝑣2 ≔ 𝐶 .
The next result shows that, when considering optimality, stability can be replaced
by boundedness.
Theorem 8.1.2. If R is well-posed and bounded, then (i)–(iv) of Theorem 8.1.1 hold.
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝜑−1 [ 𝐵
ˆ ( 𝑥, 𝑎, 𝜑 ◦ 𝑣)] for all 𝑣 ∈ 𝑉 and ( 𝑥, 𝑎) ∈ G. (8.21)
CHAPTER 8. RECURSIVE DECISION PROCESSES 260
The benefit of Proposition 8.1.3 is that one of these models might be easier to
analyze than the other. We apply the proposition to the Epstein–Zin specification in
§8.1.4.1 and to a smooth ambiguity model in §8.3.4. The next exercise will be useful
for the proof.
In the next section we will see how these ideas can simplify optimality analysis.
In this section we show how the preceding optimality results and the notion of topolog-
ically conjugacy can be deployed to analyze the Epstein–Zin RDP from Example 8.1.7.
Recall that the aggregator in Example 8.1.7 is
! 𝛼/𝛾 1/𝛼
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑎, 𝑥 0) . (8.22)
𝑥0
To prove Proposition 8.1.4, we set up a simpler and more tractable model. Our
first step is to introduce another RDP by setting
𝛾
1/𝛾
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝐵 𝑥, 𝑎, 𝑣
ˆ (8.23)
! 1/𝜃 𝜃
Õ
ˆ ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽
𝐵 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) , (8.24)
𝑥0
where 𝜃 ≔ 𝛾 /𝛼.
The value of of introducing R̂ comes from the fact that R̂ is easier to work with than
R (just as the modified Epstein–Zin Koopmans operator 𝐾ˆ defined in §7.2.3.3 turned
out to be easier to work with than the original Epstein–Zin Koopmans operator 𝐾
introduced in §7.2.3.2).
EXERCISE 8.1.19. Prove that R and R̂ are topologically conjugate RDPs (see §8.1.4).
Proof. In view of (8.24), each policy operator 𝑇ˆ𝜎 associated with R̂ takes the form
! 1/𝜃 𝜃
Õ
(𝑇ˆ𝜎 𝑣)( 𝑥 ) = 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽 𝑤 ( 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) (8.25)
𝑥0
𝜃
Each such 𝑇ˆ𝜎 is a special case of 𝐾ˆ defined on page 229 by 𝐾ˆ 𝑣 = ℎ + 𝛽 ( 𝑃𝑣) 1/𝜃 (see
(7.15)). We saw in §7.2.3.3 that this operator is globally stable under the stated
assumptions. Hence R̂ is a globally stable RDP. □
Proof of Proposition 8.1.4. Exercise 8.1.19 and Proposition 8.1.3 on page 260 together
imply that R is globally stable if and only if R̂ is globally stable. The claim that R is
globally stable now follows from and Lemma 8.1.5. □
CHAPTER 8. RECURSIVE DECISION PROCESSES 262
In §8.1 we showed that well-posed RDPs have strong optimality properties whenever
they are globally stable or bounded, and that VFI and OPI converge whenever they
are globally stable. But what conditions are sufficient for these properties? We start
with a relatively strict condition based on contractivity and then progress to models
that fail to be contractive.
In this section we study RDPs with strong contraction properties. Many traditional
dynamic programs fit into this framework.
Let R = ( Γ, 𝑉, 𝐵) be an RDP with state space X, action space A, and feasible state-action
pair set G. We call R contracting if there exists a 𝛽 < 1 such that
In line with the terminology for contraction maps, we call 𝛽 the modulus of contrac-
tion for R when (8.26) holds.
Example 8.2.1. The optimal stopping RDP from Example 8.1.3 is contracting with
modulus 𝛽 , since, for 𝐵 in (8.6), an application of the triangle inequality gives
Õ
| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| = (1 − 𝑎) 𝛽 [ 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)] 𝑃 ( 𝑥, 𝑥 0) ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ .
𝑥0
Proposition 8.2.1. If R is contracting with modulus 𝛽 , then 𝑇 and {𝑇𝜎 }𝜎∈Σ are all con-
tractions of modulus 𝛽 on 𝑉 under the norm k · k ∞ .
CHAPTER 8. RECURSIVE DECISION PROCESSES 263
Corollary 8.2.2 tells us that contracting RDPs are globally stable and, as a result, the
sequence of functions in 𝑉 generated by VFI (Algorithm 8.2 with 𝑚 = 1) converges to
𝑣∗ . However this result is asymptotic and conditions on 𝑣0 = 𝑣𝜎 for some 𝜎 ∈ Σ. We
can improve this result in the current setting by leveraging the contraction property:
2𝛽
k 𝑣∗ − 𝑣𝜎 k ∞ ⩽ k 𝑣𝑘 − 𝑣𝑘−1 k ∞ for all 𝑘 ∈ N. (8.28)
1−𝛽
Since the VFI algorithm terminates when k 𝑣𝑘 − 𝑣𝑘−1 k ∞ falls below a given tolerance,
the result in (8.28) directly provides a quantitative bound on the performance of the
policy returned by VFI.
Proof of Proposition 8.2.3. Let ( Γ, 𝑉, 𝐵) and 𝑣 be as stated and let 𝑣∗ be the value func-
tion. Note that
k 𝑣∗ − 𝑣𝜎 k ∞ ⩽ k 𝑣∗ − 𝑣𝑘 k ∞ + k 𝑣𝑘 − 𝑣𝜎 k ∞ . (8.29)
To bound the first term on the right-hand side of (8.29), we use the fact that 𝑣∗ is a
fixed point of 𝑇 , obtaining
k 𝑣∗ − 𝑣𝑘 k ∞ ⩽ k 𝑣∗ − 𝑇 𝑣𝑘 k ∞ + k𝑇 𝑣𝑘 − 𝑣𝑘 k ∞ ⩽ 𝛽 k 𝑣∗ − 𝑣𝑘 k ∞ + 𝛽 k 𝑣𝑘 − 𝑣𝑘−1 k ∞ .
CHAPTER 8. RECURSIVE DECISION PROCESSES 264
Hence
𝛽
k 𝑣∗ − 𝑣𝑘 k ∞ ⩽ k 𝑣𝑘 − 𝑣𝑘−1 k ∞ . (8.30)
1−𝛽
Now consider the second term on the right-hand side of (8.29). Since 𝜎 is 𝑣𝑘 -greedy,
we have 𝑇 𝑣𝑘 = 𝑇𝜎 𝑣𝑘 , and
k 𝑣𝑘 − 𝑣𝜎 k ∞ ⩽ k 𝑣𝑘 − 𝑇 𝑣𝑘 k ∞ + k𝑇 𝑣𝑘 − 𝑣𝜎 k ∞ = k𝑇 𝑣𝑘−1 − 𝑇 𝑣𝑘 k ∞ + k𝑇𝜎 𝑣𝑘 − 𝑇𝜎 𝑣𝜎 k ∞ .
∴ k 𝑣𝑘 − 𝑣𝜎 k ∞ ⩽ 𝛽 k 𝑣𝑘−1 − 𝑣𝑘 k ∞ + 𝛽 k 𝑣𝑘 − 𝑣𝜎 k ∞ .
𝛽
∴ k 𝑣𝑘 − 𝑣𝜎 k ∞ ⩽ k 𝑣𝑘 − 𝑣𝑘−1 k ∞ . (8.31)
1−𝛽
Together, (8.29), (8.30), and (8.31) give us (8.28). □
Next we state a useful condition for contractivity that is related to Blackwell’s suffi-
cient condition discussed in §2.2.3.3. We say that RDP ( Γ, 𝑉, 𝐵) satisfies Blackwell’s
condition if 𝑣 ∈ 𝑉 implies 𝑣 + 𝜆 ≔ 𝑣 + 𝜆 1 is in 𝑉 for every 𝜆 ⩾ 0 and, in addition, there
exists a 𝛽 ∈ [0, 1) such that
EXERCISE 8.2.4. Prove that the RDP for the state-dependent discounting model in
Example 8.1.5 is contracting on 𝑉 = RX whenever there exists a 𝑏 < 1 with 𝛽 ( 𝑥, 𝑎, 𝑥 0) ⩽
𝑏 for all ( 𝑥, 𝑎) ∈ G and 𝑥 0 ∈ X.
EXERCISE 8.2.5. Prove that the discrete optimal savings model from §5.2.2 satisfies
Blackwell’s condition.
Consider the job search problem with correlated wage draws first investigated in
§3.3.1. With finite wage offer set W, wage offer process generated by 𝑃 ∈ M ( RW )
CHAPTER 8. RECURSIVE DECISION PROCESSES 265
reservation wage
4
3
wages
and 𝛽 ∈ (0, 1), we can frame this as an RDP ( Γ, 𝑉, 𝐵) with 𝑉 = RW , Γ ( 𝑤) = {0, 1} for
𝑤 ∈ W and
𝑤
𝐵 ( 𝑤, 𝑎, 𝑣) ≔ 𝑎 + (1 − 𝑎) [ 𝑐 + 𝛽 ( 𝑃𝑣)( 𝑤)] .
1−𝛽
Since the model just described is an optimal stopping problem, Example 8.2.1 tells us
that (𝑉, Γ, 𝐵) is contracting.
Now consider the following modification, where Γ and 𝑉 are as before but 𝐵 is
replaced by
𝑤
𝐵𝜏 ( 𝑤, 𝑎, 𝑣) ≔ 𝑎 + (1 − 𝑎) [ 𝑐 + 𝛽 ( 𝑅𝜏 𝑣)( 𝑤)] ,
1−𝛽
where 𝜏 ∈ [0, 1] and 𝑅𝜏 is the quantile certainty equivalent operator described in
Exercise 7.3.4 (page 232).
Figure 8.1 shows the reservation wage for a range of 𝜏 values, computed using
optimistic policy iteration (and taking the smallest 𝑤 ∈ W such that 𝜎∗ ( 𝑤) = 1). The
stationary distribution of 𝑃 is also shown in the figure, tilted 90 degrees.
The parameters and the code for applying 𝑇𝜎 and evaluating greedy functions is
shown in Listing 26. That listing includes the quantile operator 𝑅𝜏 , which is imple-
mented in Listing 25. (Quantiles of discrete random variables can also be computed
using functionality contained in Distributions.jl.)
CHAPTER 8. RECURSIVE DECISION PROCESSES 266
end
"For each i, compute the τ-th quantile of v(Y) when Y ∼ P(i, ⋅)"
function R(τ, v, P)
return [quantile(τ, v, P[i, :]) for i in eachindex(v)]
end
The main message of Figure 8.1 is that the reservation wage rises in 𝜏. In essence,
higher 𝜏 focuses the attention of the worker on the right tail of the distribution of
continuation values. This encourages the worker to take on more risk, which leads to
a higher reservation wage (i.e., reluctance to accept a given current offer).
In this section we consider a small open economy that borrows in international finan-
cial markets in order to smooth consumption and has the option to default. We show
that the model is a contractive RDP.
Income (𝑌𝑡 )𝑡⩾0 is exogenous and 𝑄 -Markov on finite set Y. A representative house-
hold faces budget constraint
𝐶𝑡 = 𝑌𝑡 + 𝑏𝑡 − 𝑞𝑏𝑡+1 ( 𝑡 ⩾ 0) ,
using QuantEcon
include("quantile_function.jl")
"""
The policy operator
"""
function T_σ(v, σ, model)
(; n, w_vals, P, β, c, τ) = model
h = c .+ β * R(τ, v, P)
e = w_vals ./ (1 - β)
return σ .* e + (1 .- σ) .* h
end
In other words, if 𝑑 = 0, so the country is not in default, the government can choose
any 𝑏𝑎 ∈ B and also any 𝑑 𝑎 ∈ {0, 1} (i.e., default or not default). If 𝑑 = 1, however, the
government has no choices. We represent this situation by 𝑏𝑎 = 0 and 𝑑 𝑎 = 1.
The value aggregator takes the form
To specify it we decompose the problem across cases for 𝑑 and 𝑑 𝑎 . First consider the
case where 𝑑 = 0 (not currently in default) and 𝑑 𝑎 = 0 (the government chooses not
to default). For this case 𝑦 + 𝑏 − 𝑞𝑏𝑎 is current consumption, so we set
Õ
𝐵 (( 𝑦, 𝑏, 0) , ( 𝑏𝑎 , 0) , 𝑣) = 𝑢 ( 𝑦 + 𝑏 − 𝑞𝑏𝑎 ) + 𝛽 𝑣 ( 𝑦 0, 𝑏𝑎 , 0) 𝑄 ( 𝑦, 𝑦 0) (8.32)
𝑦0
Now consider the case where 𝑑 = 0 and 𝑑 𝑎 = 1, so the government chooses to default.
CHAPTER 8. RECURSIVE DECISION PROCESSES 269
𝐵 (( 𝑦, 𝑏, 0) , ( 𝑏𝑎 , 1) , 𝑣) = 𝑢 ( ℎ ( 𝑦 )) + 𝛽
" #
Õ Õ
𝜃 𝑣 ( 𝑦 0, 0, 0) 𝑄 ( 𝑦, 𝑦 0) + (1 − 𝜃) 𝑣 ( 𝑦 0, 0, 1) 𝑄 ( 𝑦, 𝑦 0) . (8.33)
𝑦0 𝑦0
Í
The term 𝑦 0 𝑣 ( 𝑦 0, 0, 0) 𝑄 ( 𝑦, 𝑦 0) is the expected value next period when the country is
readmitted to international financial markets (with 𝑏0 = 0 and 𝑑 0 = 0), while the term
Í 0 0
𝑦 0 𝑣 ( 𝑦 , 0, 1) 𝑄 ( 𝑦, 𝑦 ) is the expected value next period when default continues (with
𝑏0 = 0 and 𝑑 0 = 1).
Since 𝐵 (( 𝑦, 𝑏, 1) , ( 𝑏𝑎 , 0) , 𝑣) is not feasible (a defaulted country cannot itself directly
choose to reenter financial markets), the only other possibility is 𝐵 (( 𝑦, 𝑏, 1) , ( 𝑏𝑎 , 1) , 𝑣),
which is the expected value when the country remains in default. But this is the same
as 𝐵 (( 𝑦, 𝑏, 0) , ( 𝑏𝑎 , 1) , 𝑣) specified above: the value for a country that stays in default is
the same as that for a country that newly enters default.
EXERCISE 8.2.7. By working through cases (8.32)–(8.33) for the value aggregator
𝐵, show that the model described above is a contractive RDP.
Many RDPs are not contracting. There is no single method for handling all types
of non-contractive RDPs, so we introduce alternative techniques over the next few
sections. The first such technique, treated in this section, handles RDPs that contract
“eventually,” even though they may fail to contract in one step. We show that these
eventually contracting RDPs are globally stable, so all of the fundamental optimality
results apply.
One application for these results is the MDP model with state-dependent discount-
ing treated in Chapter 6. This section contains a proof of the main optimality result
in that chapter (Proposition 6.2.2 on page 193).
𝜎 ∈ Σ =⇒ 𝜌 ( 𝐿𝜎 ) < 1 where 𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝐿 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) .
Proof. Let R be as stated and fix 𝜎 ∈ Σ. Let 𝑇𝜎 be the associated policy operator and
let 𝐿𝜎 be the linear operator in (8.34). For fixed 𝑣, 𝑤 ∈ 𝑉 we have
Since 𝐿𝜎 ⩾ 0 and 𝜌 ( 𝐿𝜎 ) < 1, Proposition 6.1.6 on page 190 implies that 𝑇𝜎 is even-
tually contracting on 𝑉 . Since 𝑉 is closed in RX , it follows that 𝑇𝜎 is globally stable
(Theorem 6.1.5, page 189). Hence R is globally stable, as claimed. □
With Proposition 8.2.4 in hand, we can complete the proof of Proposition 6.2.2 on
page 193, which pertained to optimality properties for MDPs with state-dependent
discounting.
CHAPTER 8. RECURSIVE DECISION PROCESSES 271
𝐿 ( 𝑥, 𝑎, 𝑥 0) ≔ 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) and 𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝐿 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0)
Õ
| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| ⩽ [ 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)] 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0)
𝑥 0
Õ
⩽ | 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)| 𝐿 ( 𝑥, 𝑎, 𝑥 0) ,
𝑥0
Theorem 8.1.1 shows that RDPs have excellent optimality properties when all policy
operators are globally stable on value space. So far we have looked at conditions
for stability based on contractions (§8.2.1) and eventual contractions (§8.2.2). But
sometimes both of these approaches fail and we need alternative conditions.
In this section we explore alternative conditions based on Du’s theorem (page 216).
Du’s theorem is well suited to the task of studying stability of policy operators, since
it leverages the fact that all policy operators are order-preserving.
CHAPTER 8. RECURSIVE DECISION PROCESSES 272
𝐵 ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩽ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑤) and, (8.36)
𝐵 ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩾ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑤) and, (8.38)
In both of the definitions above, condition (ii) is rather complex. The next exercise
provides simpler sufficient conditions.
Both convexity and concavity yield stability, as the next proposition shows.
Proof. We begin with the convex case. Fix 𝜎 ∈ Σ. By the monotonicity property of
RDPs, 𝑇𝜎 is an order-preserving self-map on 𝑉 . Since (8.36) holds, 𝑇𝜎 is also a convex
operator on 𝑉 . Moreover, 𝑇𝜎 𝑣1 ⩾ 𝑣1 because 𝑇𝜎 : 𝑉 → 𝑉 and, by (8.40), 𝑇𝜎 𝑣2 ⩽
CHAPTER 8. RECURSIVE DECISION PROCESSES 273
It follows from Proposition 8.2.5 that, for convex and concave RDPs, all of the
optimality and convergence results in Theorem 8.1.1 apply.
(The functions 𝑣1 and 𝑣2 are constant.) We claim that the RDP R̂ ≔ ( Γ, 𝑉ˆ, 𝐵) is both
convex and concave.
EXERCISE 8.2.11. Prove that (8.40) and (8.41) both hold for R̂.
EXERCISE 8.2.12. Complete the proof that R̂ is both concave and convex.
Consider the risk-sensitive preference RDP in Example 8.1.6, with state space X and
action space A. Let 𝑉 = RX . For ( 𝑥, 𝑎) ∈ G and 𝑣 ∈ 𝑉 , we can express the aggregator
as
𝐵 ( 𝑥, 𝑎, 𝑣) ≔ 𝑟 ( 𝑥, 𝑎) + 𝛽 ( 𝑅𝜃𝑎 𝑣)( 𝑥 )
where 𝜃 is a nonzero constant and
( )
1 Õ
( 𝑅𝜃𝑎 𝑣)( 𝑥 ) ≔ ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
𝜃 𝑥0
Notice that, for each fixed 𝑎 ∈ Γ ( 𝑥 ), the operator 𝑅𝜃𝑎 is an entropic certainty equivalent
operator on 𝑉 (see Example 7.3.2 on page 232).
Proof. Fix 𝛽 < 1. We show that ( Γ, 𝑉, 𝐵) obeys Blackwell’s condition (§8.2.1.3). To this
end, fix 𝑣 ∈ 𝑉 , ( 𝑥, 𝑎) ∈ G, and 𝜆 ⩾ 0. Since 𝑅𝜃𝑎 is constant-subadditive (Exercise 7.3.8
on page 233), we have
The next exercise pertains to quantile preferences rather than risk-sensitive pref-
erences, but the result can be obtained via a relatively straightforward modification
of the proof of Proposition 8.3.1.
EXERCISE 8.3.1. Let R ≔ ( Γ, 𝑉, 𝐵) be an RDP with 𝑉 = RX and fix 𝜏 ∈ [0, 1]. Let
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ( 𝑅𝜏𝑎 𝑣)( 𝑥 ) where, for each 𝑎 ∈ Γ ( 𝑥 ), the map 𝑅𝜏𝑎 is given by
( )
Õ
( 𝑅𝜏𝑎 𝑣)( 𝑥 ) = min 𝑦 ∈ R 1{ 𝑣 ( 𝑥 0) ⩽ 𝑦 } 𝑃 ( 𝑥, 𝑎, 𝑥 0) ⩾ 𝜏 ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝑥0
Let’s consider a job search problem where future wage outcomes are evaluated via
risk-sensitive expectations. The associated Bellman operator is
( " #)
𝑤 𝛽 Õ
(𝑇 𝑣)( 𝑤) = max , 𝑐 + ln exp( 𝜃𝑣 ( 𝑤0)) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) .
1−𝛽 𝜃 𝑤0
Here 𝜃 is a nonzero parameter and other details are as in §3.3.1. We can represent
the problem as an RDP with state space W, action space A = {0, 1}, feasible correspon-
dence Γ ( 𝑤) = A, value space 𝑉 ≔ RW , and value aggregator
( " #)
𝑤 𝛽 Õ
𝐵 ( 𝑤, 𝑎, 𝑣) = 𝑎 + (1 − 𝑎) 𝑐 + ln exp( 𝜃𝑣 ( 𝑤0)) 𝑃 ( 𝑤, 𝑤0) .
1−𝛽 𝜃 𝑤0
If 𝜃 < 0, then the agent is risk-averse with respect to the gamble associated with
continuing and waiting for new wage draws. If 𝜃 > 0 then the agent is risk-loving
with respect to such gambles. For 𝜃 ≈ 0, the agent is close to risk-neutral.
Figure 8.2 shows how the continuation value, value function and optimal decision
vary with 𝜃. Apart from 𝜃, parameters are identical to those in Listing 10 on page 99.
Indeed, for 𝜃 close to zero, as in the middle sub-figure of Figure 8.2, we see that the
value function and reservation wage are almost identical to those from the risk-neutral
model in Figure 3.5 on page 100.
As expected, a negative value of 𝜃 tends to reduce the continuation value and hence
the reservation wage, since the agent’s dislike of risk encourages early acceptance of
an offer. For positive values of 𝜃 the reverse is true, as seen in the bottom sub-figure.
EXERCISE 8.3.2. Replicate Figure 8.2. The simplest method is to modify the code
in Listing 10 and use value function iteration.
Some problems in economics, finance and artificial intelligence assume that decisions
emerge from a dynamic two-person zero sum game in which the two agents’ prefer-
ences are perfectly misaligned. This can lead to a dynamic program where the Bellman
equation takes the form
θ =-10
200
h∗ (w)
150 w/(1 − β)
v ∗ (w)
100
50
50
150
100
h∗ (w)
50
w/(1 − β)
v ∗ (w)
where 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) represents lifetime value for the decision maker conditional on her
current action 𝑎 and her adversary’s action 𝑑 . The decision maker chooses action 𝑎 ∈
Γ ( 𝑥 ) with the knowledge that the opponent will then choose 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) to minimize
her lifetime value.
Remark 8.3.1. In some settings we can replace the inf in (8.43) with min. In other
settings this is not so obvious. For this reason we use inf throughout, paired with the
assumption that 𝐵 is bounded below. This means that the infimum is always well-
defined and finite.
8.3.2.1 Optimality
𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤) for all 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) , 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) .
𝑣1 ( 𝑥 ) + 𝜀 ⩽ 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣1 ) for all 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) , 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) .
𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) for all 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) , 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) .
𝐵 ( 𝑥, 𝑎, 𝑑, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩾ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤)
ˆ ( 𝑥, 𝑎, 𝑣) ≔
𝐵 inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) (( 𝑥, 𝑎) ∈ G, 𝑣 ∈ 𝑉 ) ,
𝑑 ∈ 𝐷 ( 𝑥,𝑎)
We consider R = ( Γ, 𝑉, 𝐵ˆ).
CHAPTER 8. RECURSIVE DECISION PROCESSES 278
EXERCISE 8.3.3. Let 𝑓 and 𝑔 map nonempty set 𝐷 into R. Assume that both 𝑓 and
𝑔 are bounded below. Prove that, in this setting,
Proof of Proposition 8.3.2. First we need to check that R is an RDP. In view (a) we
have 𝐵ˆ ( 𝑥, 𝑎, 𝑣) ⩽ 𝐵ˆ ( 𝑥, 𝑎, 𝑤) whenever ( 𝑥, 𝑎) ∈ G and 𝑣, 𝑤 ∈ 𝑉 and 𝑣 ⩽ 𝑤. Also, by (b)
and (c),
𝑣1 ( 𝑥 ) < 𝐵
ˆ ( 𝑥, 𝑎, 𝑣1 ) and 𝐵ˆ ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) for all ( 𝑥, 𝑎) ∈ G. (8.44)
ˆ ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩾ 𝜆 𝐵ˆ ( 𝑥, 𝑎, 𝑣) + (1 − 𝜆 ) 𝐵ˆ ( 𝑥, 𝑎, 𝑤)
𝐵 (8.45)
ˆ ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) =
𝐵 inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝜆𝑣 + (1 − 𝜆 ) 𝑤)
𝑑 ∈ 𝐷 ( 𝑥,𝑎)
⩾ inf [ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤)]
𝑑 ∈ 𝐷 ( 𝑥,𝑎)
⩾𝜆 inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) + (1 − 𝜆 ) inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤) ,
𝑑 ∈ 𝐷 ( 𝑥,𝑎) 𝑑 ∈ 𝐷 ( 𝑥,𝑎)
where the first inequality is by condition (d) above and the second is by Exercise 8.3.3.
This proves (8.45), so R is a concave RDP. □
The setting we consider is a modified MDP where the adversarial agent’s actions
affect the reward function and transition kernel. This leads to a Bellman equation of
the form
( )
Õ
0 0
𝑣 ( 𝑥 ) = max inf 𝑟 ( 𝑥, 𝑎, 𝑑 ) + 𝛽 𝑣 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑑, 𝑥 ) ( 𝑥 ∈ X) (8.46)
𝑎∈ Γ ( 𝑥 ) 𝑑 ∈ 𝐷 ( 𝑥,𝑎)
𝑥0
To construct the value space 𝑉 , we let 𝑟1 = min 𝑟 and 𝑟2 = max 𝑟 , and set
𝑟1 − 𝜀 𝑟2
𝑉 = [ 𝑣1 , 𝑣2 ] where 𝑣1 ≔ and 𝑣2 ≔ . (8.47)
1−𝛽 1−𝛽
EXERCISE 8.3.4. Prove: For 𝑣1 , 𝑣2 in (8.47), conditions (b)–(c) on page 277 hold.
Proof of Lemma 8.3.3. It suffices to show that R obeys (a)–(d) on page 277. Condition
(a) and (d) are elementary in this setting. Conditions (b) and (c) were established in
Exercise 8.3.4. □
Until now we have considered agents facing decision problems where outcomes are
uncertain but probabilities are known. For example, while the job seeker introduced in
Chapter 1 does not know the next period wage offer when choosing her current action,
she does know the distribution of that offer. She uses this distribution to determine
an optimal course of action. Similarly, the controllers in our discussion of optimal
CHAPTER 8. RECURSIVE DECISION PROCESSES 280
stopping and MDPs used their knowledge of the Markov transition law to determine
an optimal policy.
In many cases, the assumption that the decision maker knows all probability distri-
butions that govern outcomes under different actions is debatable. In this section we
study lifetime valuations in settings of Knightian uncertainty (Knight, 1921), which
means that outcome distributions are themselves unknown. Some authors refer to
Knightian uncertainty as ambiguity.
Below we consider some dynamic problems where decision makers face Knightian
uncertainty.
First we study the choices of a decision maker who knows her reward function but
distrusts her specification of the stochastic kernel 𝑃 that describes the evolution of the
state. This distrust is expressed by assuming that she knows that 𝑃 belongs to some
class of stochastic kernels from G × X to X. This can lead to aggregators of the form
( )
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 inf 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (8.48)
𝑃 ∈P( 𝑥,𝑎)
𝑥0
for ( 𝑥, 𝑎) ∈ G. As usual, 𝑟 maps G to R and 𝛽 ∈ (0, 1). The decision maker can
construct a policy that is robust to her distrust of the stochastic kernel by using this
aggregator 𝐵. Such aggregators arise in the field of robust control.
Positing that the decision maker knows a nontrivial set of stochastic kernels is a
way of modeling Knightian uncertainty, as distinguished from risks that are described
by known probability distributions.
Example 8.3.1. Consider the simple job search problem from Chapter 1. Suppose that
the worker believes that the wage offer distribution lies in some subset P of D(W).
She can seek a decision rule that is robust to worst-case beliefs by optimizing with
aggregator Õ
𝑤
𝐵 ( 𝑤, 𝑎, 𝑣) = 𝑎 + (1 − 𝑎) inf 𝑣 ( 𝑥 0) 𝜑 ( 𝑤0) .
1−𝛽 𝜑 ∈P 0
𝑤
Proof. Writing 𝐵 as
( )
Õ
0 0
𝐵 ( 𝑥, 𝑎, 𝑣) = inf 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑥 ) , (8.49)
𝑃 ∈P( 𝑥,𝑎)
𝑥0
we see that R is a special case of the perturbed MDP model in §8.3.2.2. Concavity
now follows from Lemma 8.3.3. □
We conclude from the discussion above that the robust control RDP is globally
stable. Hence all of the fundamental optimality properties hold.
In this set up, P( 𝑥, 𝑎) is often large, weakening the constraint on 𝑃 . At the same
time, we introduce the penalty term 𝑑 ( 𝑃 ( 𝑥, 𝑎, ·) , 𝑃¯ ( 𝑥, 𝑎, ·)), which can be understood
as recording the deviation between a given kernel 𝑃 and some baseline specification
𝑃¯.
One interpretation of this setting is that the decision maker begins with a baseline
specification of dynamics but lacks confidence in its accuracy. In her desire to choose a
robust policy, she imagines herself playing against an adversarial agent. Her adversary
can choose transition kernels that deviate from the baseline, but the presence of the
penalty term means that extreme deviations are curbed.
If we define
ˆ𝑟 ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝑑 ( 𝑃 ( 𝑥, 𝑎, ·) , 𝑃¯ ( 𝑥, 𝑎, ·)) ,
then (8.50) can be expressed as
( )
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = inf ˆ
𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
𝑃
𝑥0
Hence, for this choice of deviation, the robust control aggregator (8.50) reduces to the
CHAPTER 8. RECURSIVE DECISION PROCESSES 283
risk-sensitive aggregator (see Example 8.1.6 on page 249) under the baseline transi-
tion kernel.
Ju and Miao (2012) propose and study a recursive smooth ambiguity model in the
context of asset pricing. A generic discrete formulation of their optimization problem
can be expressed in terms of the aggregator
" # 𝜅/ 𝛾 𝛼/𝜅
1/𝛼
∫ Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) 𝜇 ( 𝑥, d𝜃) , (8.52)
𝑥0
where 𝛼, 𝜅, 𝛾 are nonzero parameters, 𝑃𝜃 is a stochastic kernel from G to X for each
𝜃 in a finite dimensional parameter space Θ, and 𝜇 ( 𝑥, ·) is a probability distribution
over Θ for each 𝑥 ∈ X. The distribution 𝜇 ( 𝑥, ·) represents subjective beliefs over the
transition rule for the state.
The aggregator 𝐵 in (8.52) is defined for 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) and 𝑣 ∈ 𝐼 , where 𝐼 is be
the interior of the positive cone of RX . To ensure finite real values, we assume 𝑟 0.
As with the Epstein–Zin case, 𝛼 parameterizes the elasticity of intertemporal sub-
stitution and 𝛾 governs risk aversion. The parameter 𝜅 captures ambiguity aversion.
If 𝜅 = 𝛾 , the agent is said to be ambiguity neutral.
EXERCISE 8.3.5. Show that the smooth ambiguity aggregator 𝐵 reduces to the
Epstein–Zin aggregator when the agent is ambiguity neutral.
Returning to (8.52), we focus on the case 𝜅 < 𝛾 < 0 < 𝛼 < 1, which includes the
calibration used in Ju and Miao (2012). (Other cases can be handled using similar
methods and details are left to the reader.) After constructing a suitable value space,
we will show that the resulting RDP is globally stable.
As a first step, set 𝑟1 ≔ min 𝑟 , 𝑟2 ≔ max 𝑟 and fix 𝜀 > 0. Consider the constant
functions 1/𝛼 1/𝛼
𝑟1 𝑟2 + 𝜀
𝑣1 ≔ and 𝑣2 ≔ .
1−𝛽 1−𝛽
Here is our main result for this section. It implies that all optimality and conver-
gence results for R are valid (see, in particular, Theorem 8.1.1).
Proposition 8.3.5. Under the stated assumptions, the RDP R is a globally stable.
To prove Proposition 8.3.5, we use a transformation, just as we did with the
Epstein–Zin case in §8.1.4.1. To this end we introduce the composite parameters
𝛾 𝜅
𝜉≔ ∈ (0, 1) and 𝜁≔ < 0.
𝜅 𝛼
Then we define
" # 1/𝜉 𝜁
1/𝜁
∫ Õ
ˆ ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽
𝐵 𝑣 ( 𝑥 0) 𝜉 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) 𝜇 ( 𝑥, d𝜃) (8.54)
𝑥0
and
𝑉ˆ = [ˆ ˆ2 ]
𝑣1 , 𝑣 where 𝑣ˆ1 ≔ 𝑣21/𝜅 and 𝑣ˆ2 ≔ 𝑣11/𝜅 .
Note that 𝑉ˆ is a nonempty order interval of strictly positive real-valued functions, since
0 < 𝑣1 < 𝑣2 and 𝜅 < 0. We set R̂ = ( Γ, 𝑉ˆ, 𝐵ˆ).
ˆ1 < 𝐵ˆ ( 𝑥, 𝑎, 𝑣ˆ1 )
𝑣 and ˆ ( 𝑥, 𝑎, 𝑣ˆ2 ) ⩽ 𝑣ˆ2
𝐵 for all ( 𝑥, 𝑎) ∈ G.
The next exercise shows that R and R̂ are topologically conjugate (see §8.1.4).
where
" # 1/𝜉
Õ 1/𝜁
𝑓 ( 𝜃, 𝑣) ≔ 𝑣 ( 𝑥 0) 𝜉 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) and 𝜓 ( 𝑡 ) ≔ 𝑟 ( 𝑥, 𝑎) + 𝛽𝑡 𝜁
𝑥0
For fixed 𝜃, the function 𝑣 ↦→ 𝑓 ( 𝜃, 𝑣) is concave over all 𝑣 in the interior of the positive
cone of RX by Lemma 7.3.1 on page 234. The real-valued function 𝜓 satisfies 𝜓0 >
0 and 𝜓00 < 0 over 𝑡 ∈ (0, ∞). Since we are composing order-preserving concave
functions, it follows that 𝐵ˆ ( 𝑥, 𝑎, 𝑣ˆ) is concave on 𝑉ˆ. □
Proof of Proposition 8.3.5. To prove that R is globally stable it suffices to prove that
R̂ is globally stable (see Exercise 8.3.9 and Proposition 8.1.3 on page 260). Given
the results of Exercise 8.3.8 and Lemma 8.3.6, the RDP R̂ is concave. But then R̂ is
globally stable, by Proposition 8.2.5. □
8.3.5 Minimization
Until now, all theory and applications have concerned maximization of lifetime values.
Now is a good time to treat minimization. Throughout this section, R is a well-posed
Ó
RDP. The pointwise minimum 𝑣∗ ≔ 𝜎 𝑣𝜎 is called the min-value function generated
by R. We call a policy 𝜎 ∈ Σ min-optimal for R if 𝑣𝜎 = 𝑣∗ . A policy 𝜎 ∈ Σ is called
𝑣-min-greedy for R if
Recall the shortest path problem introduced in Example 8.1.8, where X is the vertices
of a graph, 𝐸 is the edges, 𝑐 : 𝐸 → R+ maps a travel cost to each edge ( 𝑥, 𝑥 0) ∈ 𝐸,
and O( 𝑥 ) is the set of direct successors of 𝑥 . The aim is to minimize total travel cost
to a destination node 𝑑 . We adopt all assumptions from Exercise 8.1.17 and assume
in addition that 𝑐 ( 𝑥, 𝑥 0) = 0 implies 𝑥 = 𝑑 . As in Exercise 8.1.17, we let 𝐶 ( 𝑥 ) be the
maximum cost of traveling to 𝑑 from 𝑥 along any directed path.
We regard the problem as an RDP R = (O, 𝑉, 𝐵) with 𝑉 = [0, 𝐶 ] and
𝐵 ( 𝑥, 𝑥 0, 𝑣) = 𝑐 ( 𝑥, 𝑥 0) + 𝑣 ( 𝑥 0) ( 𝑥 ∈ X) . (8.55)
In the present setting, the function 𝑣 in (8.55) is often called the cost-to-go function,
with 𝑣 ( 𝑥 0) in (8.57) understood as remaining costs after moving to state 𝑥 0.
While the value aggregator 𝐵 in (8.55) is simple, the absence of discounting (which
is standard in the shortest path literature) means that R is not contracting. Fortu-
nately, R turns out to be concave (in the sense of §8.2.3), which allows us to prove
Proposition 8.3.8. Under the stated conditions, the shortest path RDP is globally stable
and the min-value function 𝑣∗ is the unique solution to
𝑣∗ ( 𝑥 ) = min 𝑐 ( 𝑥, 𝑥 0) + 𝑣∗ ( 𝑥 0) ( 𝑥 ∈ X)
𝑥 0 ∈Γ ( 𝑥 )
(In the present context, 𝑣∗ is also known as the minimum cost-to-go function.)
Proof. We first show that R is concave. By the definition of concave RDPs in §8.2.3,
and given that 𝐵 ( 𝑥, 𝑥 0, 𝑣) is affine in 𝑣 (and hence concave), it suffices to prove that
there exists a 𝛿 > 0 such that
When discussing MDPs we used 𝛽 to represent the discount factor. Given 𝛽 , the dis-
count rate or rate of time preference is the value 𝜌 that solves 𝛽 = 1/(1 + 𝜌). The
standard MDP assumption 𝛽 < 1 implies this rate is positive. You will recall from
Chapter 5 that the condition 𝛽 < 1 is central to the general theory of MDPs, since it
yields global stability of the Bellman and policy operators on RX (via the Neumann
series lemma or Banach’s fixed point theorem).
In the previous section, on shortest paths, we studied an RDP with a zero discount
rate. Now we go one step further and consider problems with negative rates of time
preference. Such preference are commonly inferred when people face unpleasant
tasks. Subjects of studies often prefer getting such tasks “over and done with” rather
than postponing them. (Negative discount rates are inferred in other settings as well.
§9.3 provides background and references.)
In this section, we model optimal choice under a negative discount rate. Taking
our cue from the discussion above, we consider a scenario where a task generates
disutility but has to be completed. In particular, we assume that
𝐵 ( 𝑥, 𝑥 0, 𝑣) = 𝑐 ( 𝑥, 𝑥 0) + 𝛽𝑣 ( 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) (8.57)
CHAPTER 8. RECURSIVE DECISION PROCESSES 288
where X is a finite set and 𝛽 > 1 is some positive constant. The function 𝑐 gives the
cost of transitioning from 𝑥 to the new state 𝑥 0
The value aggregator 𝐵 in (8.57) is the same as the shortest path aggregator (8.55),
except for the constant 𝛽 . To keep the discussion simple, we adopt all other assump-
tions from the shortest path discussion in §8.3.5.1.
(𝑇𝜎 𝑣)( 𝑥 ) = 𝑐 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽𝑣 ( 𝜎 ( 𝑥 )) ⩽ 𝑐 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽𝐶 ( 𝜎 ( 𝑥 )) ⩽ 𝐶 ( 𝑥 ) .
The last bound holds because 𝐶 ( 𝑥 ) is, by definition, greater than the cost of traveling
from 𝑥 to 𝜎 ( 𝑥 ) and then following the most expensive path.
Proposition 8.3.9. Under the stated conditions, the negative discount rate RDP is glob-
ally stable, the min-value function 𝑣∗ is the unique solution to
𝑣∗ ( 𝑥 ) = min 𝑐 ( 𝑥, 𝑥 0) + 𝛽𝑣∗ ( 𝑥 0) ( 𝑥 ∈ X)
𝑥 0 ∈Γ ( 𝑥 )
Proof. The proof of Proposition 8.3.9 is essentially identical to the proof of Proposi-
tion 8.3.8. Readers are invited to confirm this. □
The RDP framework adopted in this chapter is inspired by Bertsekas (2022b), who
in turn credits Mitten (1964) as the first research paper to frame Richard Bellman’s
dynamic programming problems in an abstract setting. Denardo (1967) describes key
ideas including what we call contracting RDPs (see §8.2.1). Denardo credits Shapley
(1953) for inspiring his contraction-based arguments.
CHAPTER 8. RECURSIVE DECISION PROCESSES 289
The key optimality results from this chapter (Theoremm 9.2.4 and 8.1.1) are
somewhat new, although closely related results appear in Bertsekas (2022b). See,
in addition, Bloise et al. (2023), which builds on Bertsekas (2022b) and Ren and
Stachurski (2021).
The job search application with quantile preferences in §8.2.1.4 is based on de Cas-
tro et al. (2022). The same reference includes a general theory of dynamic program-
ming when certainty equivalents are computed using quantile operators and aggre-
gation is time additive.
The optimal default application in §8.2.1.5 is loosely based on Arellano (2008).
Influential contributions to this line of work include, Yue (2010), Chatterjee and Eyi-
gungor (2012), Arellano and Ramanarayanan (2012), Cruces and Trebesch (2013),
Ghosh et al. (2013), Gennaioli et al. (2014), and Bocola et al. (2019).
At the start of the chapter we motivated RDPs by mentioning that equilibria in
some models of production and economic geography can be computed using dynamic
programming. Examples include Hsu (2012), Hsu et al. (2014), Antràs and De Gortari
(2020), Kikuchi et al. (2021) and Tyazhelnikov (2022).
Early references for dynamic programming with risk-sensitive preferences include
Jacobson (1973), Whittle (1981), and Hansen and Sargent (1995). Elegant mod-
ern treatments can be found in Asienkiewicz and Jaśkiewicz (2017) and Bäuerle and
Jaśkiewicz (2023), and an extension to general static risk measures is available in
Bäuerle and Glauner (2022). Risk-sensitivity is applied to the study of optimal growth
in Bäuerle and Jaśkiewicz (2018), and to optimal divided payouts in Bäuerle and
Jaśkiewicz (2017). Risk-sensitivity is also used in applications of reinforcement learn-
ing, where the underlying state process is not known. See, for example, Shen et al.
(2014), Majumdar et al. (2017) or Gao et al. (2021).
Dynamic programming problems that acknowledge model uncertainty by includ-
ing adversarial agents to promote robust decision rules can be found in Cagetti et al.
(2002), Hansen and Sargent (2011), and other related papers. Al-Najjar and Shmaya
(2019) study the connection between Epstein–Zin utility and parameter uncertainty.
Ruszczyński (2010) considers risk averse dynamic programming and time consistency.
The smooth ambiguity model in §8.3.4 is loosely adapted from Klibanoff et al.
(2009) and Ju and Miao (2012). For applications of optimization under smooth am-
biguity, see, for example, Guan and Wang (2020) or Yu et al. (2023). Zhao (2020)
studies yield curves in a setting where ambiguity-averse agents face varying amounts
of Knightian uncertainty over the short and long run.
Readers who wish to see some motivation for the discussion of negative discount-
ing in §8.3.5.2 can consult Loewenstein and Sicherman (1991), who found that the
CHAPTER 8. RECURSIVE DECISION PROCESSES 290
majority of workers they surveyed reported a preference for increasing wage profiles
over decreasing ones that yield the same undiscounted sum, even when it was pointed
out that the latter could be used to construct a dominating consumption sequence.
Loewenstein and Prelec (1991) obtained similar results. In summarizing their study,
they argue that, in the context of the choice problems that they examined, “sequences
of outcomes that decline in value are greatly disliked, indicating a negative rate of
time preference” (Loewenstein and Prelec, 1991, p. 351).
In §8.3.5.2 we considered dynamic programs with negative discount rates. A more
general treatment of such problems can be found in Kikuchi et al. (2021), which also
shows how negative discount rate dynamic programs connect to static problems con-
cerning equilibria in production networks and draws connections with Coase’s theory
of the firm.
An algorithm that we neglected to discuss is stochastic gradient descent (or ascent)
in policy space. Typically policies are parameterized via an approximation architec-
ture that consists of basis functions, activation functions, and compositions of them
(e.g., a neural network). In large models, such approximation is used even when the
state and action spaces are finite, simply because the curse of dimensionality makes
exact representations infeasible. For recent discussions of gradient descent in pol-
icy spaces see Nota and Thomas (2019), Mei et al. (2020), and Bhandari and Russo
(2022).
Chapter 9
9.1.1 Preliminaries
Let’s cover some fundamental concepts that we’ll use when considering abstract dy-
namic programs.
The first concept is related to stability of maps over partially ordered spaces. Our aim
is to provide a weak notion of stability that can be applied in any partially ordered set
(without any form of topology).
291
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 292
T
45o
v̄
0 1
V
Let 𝑉 be a partially ordered set and let 𝑇 be a self-map on 𝑉 with exactly one fixed
point 𝑣¯ in 𝑉 . In this setting, we call 𝑇
Figure 9.1 gives an illustration of a map 𝑇 that on 𝑉 = [0, 1] that is order stable:
all points mapped up by 𝑇 lie below its fixed point and all points mapped down by 𝑇
lie above its fixed point. The figure suggests that order stability is related to global
stability, as defined in §1.2.2.2. We affirm this in Lemma 9.1.1, just below.
Proof. Assume the stated conditions. By global stability, 𝑇 has a unique fixed point 𝑣¯
in 𝑉 . If 𝑣 ∈ 𝑉 and 𝑣 ⩽ 𝑇 𝑣, then iterating on this inequality and using the fact that 𝑇
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 293
Given partially ordered set 𝑉 , let 𝑉 𝜕 = (𝑉, 𝜕 ) be the order dual, so that, for 𝑢, 𝑣 ∈ 𝑉 ,
we have 𝑢 𝜕 𝑣 if and only if 𝑣 𝑢. (The notation is slightly confusing but the concept
is simple: 𝑉 𝜕 is just 𝑉 with the order reversed.) The following result will be useful.
In this section we formalize abstract dynamic programs and present fundamental op-
timality results. §9.1.2.1 starts the ball rolling with an informal overview.
9.1.2.1 Prelude
We saw in §8.1 that a globally stable RDP yields a set of feasible policies Σ and, for
each 𝜎 ∈ Σ, a policy operator 𝑇𝜎 defined on the value space 𝑉 ⊂ RX . Notice that the
dynamic program is fully specified by the family of operators {𝑇𝜎 }𝜎∈Σ and the space
𝑉 that they act on. From this set of operators we obtain the set of lifetime values
{ 𝑣𝜎 }𝜎∈Σ , with each 𝑣𝜎 uniquely identified as a fixed point of 𝑇𝜎 . These lifetime values
define the value function 𝑣∗ as the pointwise maximum 𝑣∗ = ∨𝜎 𝑣𝜎 . An optimal policy
is then defined as a 𝜎 ∈ Σ obeying 𝑣𝜎 = 𝑣∗ .
To shed unnecessary structure before the main optimality proofs, a natural idea is
to start directly with an abstract set of “policy operators” {𝑇𝜎 } acting on some set 𝑉 .
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 294
One can then define lifetime values and optimality as in the previous paragraph and
start to investigate conditions on the family of operators {𝑇𝜎 } that lead to optimality.
We use these ideas as our starting point, beginning with an arbitrary family {𝑇𝜎 }
of operators on a partially ordered set.
An abstract dynamic program (ADP) is a pair A = (𝑉, {𝑇𝜎 }𝜎∈Σ ) such that
Elements of the index set Σ are called policies and elements of {𝑇𝜎 } are called policy
operators. Given 𝑣 ∈ 𝑉 , a policy 𝜎 in Σ is called 𝑣-greedy if 𝑇𝜎 𝑣 𝑇𝜏 𝑣 for all 𝜏 ∈ Σ.
Existence of a greatest element in (iii) of the definition above is equivalent to the
statement that each 𝑣 ∈ 𝑉 has at least one 𝑣-greedy policy.
Remark 9.1.1. Existence of a least element in (iii) is needed only because we wish to
consider minimization as well as maximization. For settings where only maximization
is considered, this can be dropped from the list of assumptions. (An analogous state-
ment holds for minimization and greatest elements.) We mention least elements in
Example 9.1.1 below and then disregard them until we treat minimization in §9.2.3.
Remark 9.1.2. In the applications treated in this chapter, will always be the point-
wise partial order. In Volume II other partial orders arise.
Example 9.1.1 (RDPs generate ADPs). Let R = ( Γ, 𝑉, 𝐵) be an RDP with finite state X,
as defined in §8.1.1. For each 𝜎 in the feasible policy set Σ, let 𝑇𝜎 be the corresponding
policy operator, defined at 𝑣 ∈ 𝑉 by (𝑇𝜎 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣). The pair AR ≔ (𝑉, {𝑇𝜎 }) is
an ADP, since 𝑉 is partially ordered by ⩽, 𝑇𝜎 is a self-map on 𝑉 for all 𝜎 ∈ Σ, and, given
𝑣 ∈ 𝑉 , choosing 𝜎
¯ ∈ Σ such that 𝜎¯ ( 𝑥 ) ∈ argmax 𝑎∈Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) for all 𝑥 ∈ X produces a
𝑣-greedy policy and a greatest element for {𝑇𝜎 𝑣} (cf., Exercise 8.1.7 on page 253). A
least element of {𝑇𝜎 𝑣} can be generated by replacing “argmax” with “argmin.”
We have just shown that RDPs are ADPs. But there are also ADPs that do not
fit naturally into the RDP framework. The next two examples illustrate. In these
examples, the Bellman equation does not match the RDP Bellman equation 𝑣 ( 𝑥 ) =
max 𝑎∈Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) due to the inverted order of expectation and maximization.
Example 9.1.3. Recall the 𝑄 -factor MDP Bellman operator, which takes the form
Õ
( 𝑆𝑞)( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 max0 𝑞 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0)
0
(9.1)
𝑎 ∈Γ ( 𝑥 )
𝑥0
with 𝑞 ∈ RG and ( 𝑥, 𝑎) ∈ G (We are repeating (5.39) on page 171.) The 𝑄 -factor
policy operators {𝑆𝜎 } corresponding to (9.1) are given by
Õ
( 𝑆𝜎 𝑞)( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑞 ( 𝑥 0, 𝜎 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (( 𝑥, 𝑎) ∈ G) . (9.2)
𝑥0
Example 9.1.4. In reinforcement learning and related fields the 𝑄 -factor approach
from Example 9.1.3 has been extended to risk-sensitive decision processes (see, e.g.,
Fei et al. (2021)). The corresponding 𝑄 -factor Bellman equation is given by
( )
𝛽 Õ
𝑓 ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + ln exp 𝜃 0max0 𝑓 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (( 𝑥, 𝑎) ∈ G) . (9.3)
𝜃 𝑥0
𝑎 ∈Γ ( 𝑥 )
In Chapter 10 we will see that continuous time dynamic programs can also be
viewed as ADPs.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 296
9.2 Optimality
In this section we study optimality properties of ADPs, aiming for generalizations of
the foundational results of dynamic programming. To achieve this aim we need to
define optimality and provide sufficient conditions.
9.2.1 Max-Optimality
We begin with maximization. Later, in §9.2.3, we will show that results for minimiza-
tion problems are simple corollaries of maximization results.
We call an ADP A ≔ (𝑉, {𝑇𝜎 }) well-posed if every policy operator 𝑇𝜎 has a unique
fixed point in 𝑉 . In view of the preceding discussion on lifetime values, well-posedness
is a minimum requirement for constructing an optimality theory around ADPs.
9.2.1.2 Operators
and call 𝑇 the Bellman operator generated by A. Note that 𝑇 is a well-defined self-
map on 𝑉 by part (iii) of the definition of ADPs (existence of greedy policies). A
function 𝑣 ∈ 𝑉 is said to satisfy the Bellman equation if it is a fixed point of 𝑇 .
The definition of 𝑇 in (9.5) includes all of the Bellman operators we have met
as special cases. For example, consider an RDP R = ( Γ, 𝑉, 𝐵) with Bellman operator
Ô
(𝑇 𝑣)( 𝑥 ) = max 𝑎∈Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣). We can write 𝑇 as 𝜎 𝑇𝜎 𝑣, as shown in Exercise 8.1.8 on
page 253. Thus, the Bellman operator of the RDP agrees with the Bellman operator
𝑇 of the corresponding ADP AR .
Below we consider Howard policy iteration (HPI) as an algorithm for solving for
optimal policies of ADPs. We use precisely the same instruction set as for the RDP
case, as shown in Algorithm 8.1 on page 254. To further clarify the algorithm, we
define a map 𝐻 from 𝑉 to { 𝑣𝜎 } via 𝐻 𝑣 = 𝑣𝜎 where 𝜎 is 𝑣-max-greedy. Iterating with
𝐻 generates the value sequence associated with Howard policy iteration.1 In what
follows, we call 𝐻 the Howard operator generated by the ADP.
9.2.1.3 Properties
Corollary 9.2.2. Let R be an RDP and let AR be the ADP generated by R. If R is globally
stable, then AR is max-stable.
Proof. Let R and AR be as stated and suppose that R is globally stable. In view of
Lemma 9.1.1 on page 292, each policy operator is order stable. Hence AR is order
stable. Since Σ is finite, Proposition 9.2.1 implies that AR is also max-stable. □
EXERCISE 9.2.2. Show that the ADP described in Example 9.1.3 is max-stable.
Order stability is central to the optimality results stated below. While order stabil-
ity is a somewhat nonstandard condition, the next result shows that, at least in simple
settings, order stability is necessary for any discussion of optimality.
(i) A is well-posed.
(ii) A is order stable.
Let A = (𝑉, {𝑇𝜎 }) be a well-posed ADP with 𝜎-value functions { 𝑣𝜎 }𝜎∈Σ . We define
𝑉Σ ≔ { 𝑣𝜎 }𝜎∈ Σ and 𝑉𝑢 ≔ { 𝑣 ∈ 𝑉 : 𝑣 𝑇 𝑣} .
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 299
If 𝑉Σ has a greatest element, then we denote it by 𝑣∗ and call it the value function
generated by A. In this setting, a policy 𝜎 ∈ Σ is called optimal for A if 𝑣𝜎 = 𝑣∗ . We
say that A obeys Bellman’s principle of optimality if
Theorem 9.2.4 informs us that finite well-posed ADPs have first-rate optimality
properties under a relatively mild stability condition. In §9.2.2 we use Theorem 9.2.4
to prove all optimality results for RDPs stated in Chapter 8. The proof of Theo-
rem 9.2.4 is given in §B.4 (see page 347). Note that (iv) follows directly from (i)
and is included only for completeness.
This volume focuses on dynamic programming problems with finite states. Here we
restrict ourselves to one high-level result for general state spaces.
Proposition 9.2.5 tells us that we can drop finiteness of policy set Σ (which is
implied by finite states and actions) whenever the Bellman operator has at least one
fixed point. Various fixed point methods are available for establishing this existence.
We defer further details until Volume II. Proposition 9.2.5 is proved in in §B.4.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 300
This section discusses adding mixed strategies to an RDP. We will need to apply Propo-
sition 9.2.5 to discuss optimality because the set of mixed strategies is not finite.
Let R = ( Γ, 𝑉, 𝐵) be an RDP with finite state space X, finite action space A, policy
set Σ and Bellman operator 𝑇 (see 8.1.3). A mixed strategy for R is a map 𝜑 sending
𝑥 ∈ X into a distribution 𝜑 𝑥 ∈ D(A) supported on Γ ( 𝑥 ). In other words, for each 𝑥 ∈ X,
Õ
𝜑 𝑥 : A → [0, 1] and 𝜑 𝑥 ( 𝑎) = 1.
𝑎∈ Γ ( 𝑥 )
Let Φ be the set of all mixed strategies for R. For each mixed strategy 𝜑 ∈ Φ, we
introduce the policy operator on 𝑉 defined by
Õ
(𝑇ˆ𝜑 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝑎, 𝑣) 𝜑 𝑥 ( 𝑎) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝑎∈A
The right hand side is the expected lifetime value from current state 𝑥 , when the
current action is drawn from 𝜑 𝑥 and future states are evaluated via 𝑣.
It follows from the discussion above that A𝑀 ≔ (𝑉, {𝑇ˆ𝜑 } 𝜑∈Φ ) is an ADP (where “M”
stands for “mixed”), and that the Bellman operator 𝑇ˆ associated with the ADP A𝑀 is
given by
(𝑇ˆ 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) = (𝑇 𝑣)( 𝑥 ) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) . (9.6)
𝑎∈ Γ ( 𝑥 )
Let us assume for simplicity that R is contracting (see §8.2.1), with modulus of
contraction 𝛽 ∈ (0, 1). Assume also that 𝑉 is closed in RX . As a result, the value
function 𝑣∗ for R exists in 𝑉 and is the unique fixed point of 𝑇 in 𝑉 (Corollary 8.2.2).
EXERCISE 9.2.6. Show that, under the assumptions stated above, {𝑇ˆ𝜑 } 𝜑∈Φ and 𝑇ˆ
are all contraction mappings.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 301
By Exercise 9.2.6, the ADP A𝑀 is max-stable (since globally stable operators are
order stable – see Lemma 9.1.1 on page 292 – and the Bellman operator 𝑇ˆ has a fixed
point). Hence, by Proposition 9.2.5, the value function 𝑣ˆ∗ for A𝑀 exists in 𝑉 and is
the unique fixed point of 𝑇ˆ in 𝑉 . But, by (9.6), 𝑇ˆ and 𝑇 agree on 𝑉 . Hence 𝑣ˆ∗ = 𝑣∗ . We
conclude as follows: while the set of mixed strategies is larger than the set of pure
strategies (i.e., deterministic policies), the maximal lifetime value from each state is
the same.
In this section we provide some preliminary results related to OPI convergence, where
OPI obeys the algorithm given on page 255. Throughout, R = ( Γ, 𝑉, 𝐵) is a globally
stable RDP with policy set Σ, policy operators {𝑇𝜎 }, Bellman operator 𝑇 , and value
function 𝑣∗ . As usual, 𝑣𝜎 denotes the unique fixed point of 𝑇𝜎 for all 𝜎 ∈ Σ. In the
results stated below, 𝑚 is a fixed natural number indicating the OPI step size and 𝐻
and 𝑊𝑚 are as defined in §8.1.3.2.
𝑣 ∈ 𝑉𝑢 =⇒ 𝑇 𝑣 ⩽ 𝑊𝑚 𝑣 ⩽ 𝑇 𝑚 𝑣.
Proof. As for the self-map property, pick any 𝑣 ∈ 𝑉𝑢 . Since 𝑇 and 𝑇𝜎 are order-
preserving, 𝑣 ⩽ 𝑇 𝑣 and 𝜎 is 𝑣-greedy, we have
Proof. The bounds in (9.7) hold for 𝑘 = 1 by 𝑣0 ∈ 𝑉𝑢 and Lemma 9.2.7. Since all
operators are order-preserving and invariant on 𝑉𝑢 , the extension to arbitrary 𝑘 follows
from Exercise 2.2.36 on page 65. □
Lemma 9.2.9. Let 𝑣0 be any element of 𝑉Σ and let 𝑣𝑘 = 𝑊𝑚𝑘 𝑣0 for all 𝑘 ∈ N. If 𝑣𝑘 = 𝑣𝑘+1
for some 𝑘 ∈ N, then 𝑣𝑘 = 𝑣∗ and every 𝑣𝑘 -greedy policy is optimal.
Proof. Let the sequence ( 𝑣𝑘 ) be as stated and suppose that 𝑣𝑘 = 𝑣𝑘+1 . Let 𝜎 be 𝑣𝑘 -
greedy. It follows that 𝑇𝜎𝑚 𝑣𝑘 = 𝑣𝑘 and, moreover, 𝑣𝑘 ⩽ 𝑇 𝑣𝑘 = 𝑇𝜎 𝑣𝑘 , where the last
inequality is by 𝑣𝑘 ∈ 𝑉𝑢 . As a result,
𝑣𝑘 ⩽ 𝑇𝜎 𝑣𝑘 ⩽ 𝑇𝜎𝑚 𝑣𝑘 = 𝑣𝑘 .
𝑒 ≔ min0 k 𝑣𝜎 − 𝑣∗ k ∞ > 0.
𝜎∈ Σ
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 303
Choose 𝐾 ∈ N such that k 𝑣𝑘 − 𝑣∗ k ∞ < 𝑒 for all 𝑘 ⩾ 𝐾 . Fix 𝑘 ⩾ 𝐾 and let 𝜎 be 𝑣𝑘 -greedy.
We claim that 𝜎 is optimal. Indeed, since 𝑣𝑘 ⊂ 𝑉𝑢 , we have 𝑣𝑘 ⩽ 𝑇 𝑣𝑘 = 𝑇𝜎 𝑣𝑘 , so, by
upward stability, 𝑣𝑘 ⩽ 𝑣𝜎 . As a result,
| 𝑣∗ − 𝑣𝜎 | = 𝑣∗ − 𝑣𝜎 ⩽ 𝑣∗ − 𝑣𝑘 .
In §8.1.3 we stated two key optimality results for RDPs, the first concerning globally
stable RDPs (Theorem 8.1.1 on page 256) and the second concerning bounded RDPs
(Theorem 8.1.2 on page 259). Let’s now prove them. In what follows, R = ( Γ, 𝑉, 𝐵) is
a well-posed RDP and AR ≔ (𝑉, {𝑇𝜎 }) is the ADP generated by R.
Proof of Theorem 8.1.1. Let R be globally stable. Then AR is finite and max-stable, by
Corollary 9.2.2. Hence the optimality and HPI convergence claims in Theorem 8.1.1
follow from Theorem 9.2.4.
Regarding OPI convergence, let ( 𝑣𝑘 , 𝜎𝑘 ) be as given in (8.18) on page 255. To-
gether, Lemma 9.2.6 and Lemma 9.2.8 imply that 𝑣𝑘 → 𝑣∗ as 𝑘 → ∞. Given such
convergence, Lemma 9.2.10 implies that there exists a 𝐾 ∈ N such that 𝜎𝑘 is optimal
whenever 𝑘 ⩾ 𝐾 . □
9.2.3 Min-Optimality
Until now, our ADP theory has focused on maximization of lifetime values. Now we
turn to minimization. One of our aims is to prove the RDP minimization results in
§8.3.5. We will see that ADP minimization results are easily recovered from ADP
maximization results via order duality.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 304
Let A = (𝑉, {𝑇𝜎 }) be a well-posed ADP and let 𝑉Σ ≔ { 𝑣𝜎 } be the set of 𝜎-value
functions. We call 𝜎 ∈ Σ min-optimal for A if 𝑣𝜎 is a least element of 𝑉Σ . When 𝑉Σ
has a least element we denote it by 𝑣∗ and call it the min-value function generated
by A. A policy 𝜎 is called 𝑣-min-greedy if 𝑇𝜎 𝑣 𝑇𝜏 𝑣 for all 𝜏 ∈ Σ. Existence of a
𝑣-min-greedy policy for each 𝑣 ∈ 𝑉 is guaranteed by the definition of ADPs.
We say that A obeys Bellman’s principle of min-optimality if
If, in addition, Σ is finite, then min-HPI converges to 𝑣∗ in finitely many steps.
To prove Theorem 9.2.11 we use order duality. Below, if A ≔ (𝑉, {𝑇𝜎 }) is an ADP
then its dual is
In this setting, we let 𝑇 𝜕 be the Bellman operator for A𝜕 , ( 𝑣∗ ) 𝜕 be the value function
for A𝜕 , and so on. We note that A is self-dual, in the sense that (A𝜕 ) 𝜕 = A, since the
same is true for 𝑉 .
To make our terminology more symmetric, in the remainder of this section we
refer to maximization-based optimal policies as max-optimal, the Bellman operator
Ô
𝑇 = 𝜎 𝑇𝜎 𝑣 as the Bellman max-operator, and so on.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 305
EXERCISE 9.2.7. Let A be a well-posed ADP with dual A𝜕 . Verify the following.
Proof of Theorem 9.2.11. Let A be min-stable. By Exercise 9.2.7, the dual A𝜕 is max-
stable. Hence all of the conclusions of the max-optimality result in Theorem 9.2.4
apply to A𝜕 . All that remains is to translate these max-optimality results for A𝜕 back
to min-optimality results for A.
Regarding claim (i) of the min-optimality results, max-optimality of A𝜕 implies
that ( 𝑣∗ ) 𝜕 exists in 𝑉 . But then 𝑣∗ exists in 𝑉 , since, by Exercise 9.2.7, 𝑣∗ = ( 𝑣∗ ) 𝜕 .
Regarding (ii), we know that ( 𝑣∗ ) 𝜕 is the unique solution to 𝑇 𝜕 ( 𝑣∗ ) 𝜕 = ( 𝑣∗ ) 𝜕 , so,
applying Exercise 9.2.7 again, we have 𝑇 𝑣∗ = 𝑣∗ .
The remaining steps of the proof are similar and left to the reader. □
As indicated in notes for Chapter 8, our interest in abstract dynamic programming was
inspired by Bertsekas (2022b). This chapter generalizes his framework by switching
to a “completely abstract” setting based on analysis of self-maps on partially ordered
space. The material here is based on Sargent and Stachurski (2023a). Earlier work
on dynamic programming in a setting with no topology can be found in Kamihigashi
(2014).
Chapter 10
Continuous Time
Earlier chapters treated dynamics in discrete time. Now we switch to continuous time.
We restrict ourselves to finite state spaces, where continuous time processes are pure
jump processes. This allows us to provide a rigorous and self-contained treatment,
while laying foundations for a treatment of general state problems.
In this section we introduce continuous time Markov models. In §10.2, we will use
them as components of continuous time Markov decision processes.
10.1.1 Background
306
CHAPTER 10. CONTINUOUS TIME 307
d
𝑢¤ 𝑡 ≔ 𝑢𝑡 = 𝑟𝑢𝑡 for all 𝑡 ⩾ 0 with initial balance 𝑢0 given. (10.2)
d𝑡
We understand (10.2) as a functional equation whose solution is an element 𝑡 ↦→ 𝑢𝑡 of
𝐶1 ( R+ , R), the set of continuously differentiable functions from R+ to R, that satisfies
(10.2). We claim that 𝑢𝑡 ≔ e𝑟𝑡 𝑢0 is the only solution to (10.2) in 𝐶1 ( R+ , R). It is easy
to check that this choice of 𝑢𝑡 obeys (10.2). As for uniqueness, suppose that 𝑡 ↦→ 𝑦𝑡 is
another solution in 𝐶1 ( R+ , R), so that 𝑦¤𝑡 = 𝑟 𝑦𝑡 for all 𝑡 ⩾ 0 and 𝑦0 = 𝑢0 . Then
d
𝑦𝑡 e−𝑟𝑡 = 𝑦¤𝑡 e−𝑟𝑡 − 𝑟 𝑦𝑡 e−𝑟𝑡 = 𝑟 𝑦𝑡 e−𝑟𝑡 − 𝑟 𝑦𝑡 e−𝑟𝑡 = 0,
d𝑡
so 𝑡 ↦→ 𝑦𝑡 e−𝑟𝑡 is constant on R+ , implying existence of a 𝑐 ∈ R such that 𝑦𝑡 = 𝑐 e𝑟𝑡 for
all 𝑡 ⩾ 0. Setting 𝑡 = 0 and using the initial condition gives 𝑐 = 𝑢0 . Hence, at any 𝑡 , we
have 𝑦𝑡 = e𝑟𝑡 𝑢0 = 𝑢𝑡 .
The continuous time system in Example 10.1.1 is closely related to the discrete
time difference equation 𝑢𝑡+1 = e𝑟 𝑢𝑡 . Indeed, if we start at 𝑢0 , then the 𝑡 -th iterate
is e𝑟𝑡 𝑢0 , so solutions agree at integer times. We can think of the continuous time
system as one that interpolates between points in time of a corresponding discrete
time system.
The exponential e𝜆 of 𝜆 = 𝑎 + 𝑖𝑏 ∈ C can also be defined via (10.1). From the
identity e𝑖𝑏 = cos( 𝑏) + 𝑖 sin( 𝑏), we obtain
Continuous time Markov chains have a close relationship with the exponential distri-
bution, a fact that stems from its being the only distribution having the memoryless
property
P{𝑊 > 𝑠 + 𝑡 | 𝑊 > 𝑠} = P{𝑊 > 𝑡 } for all 𝑠, 𝑡 > 0. (10.4)
𝑑
EXERCISE 10.1.1. Verify that (10.4) holds when 𝑊 = Exp( 𝜃).
The memoryless property is special. For example, the probability that an individ-
ual human being lives 70 years from birth is not equal to the probability that he or she
lives another 70 years conditional on having reached age 70. In fact the exponential
distribution is the only memoryless distribution supported on the nonnegative reals:
Lemma 10.1.1. If 𝑊 has counter CDF 𝐺 satisfying 0 < 𝐺 ( 𝑡 ) < 1 for all 𝑡 > 0, then the
following statements are equivalent:
𝑑
(i) 𝑊 = Exp( 𝜃) for some 𝜃 > 0.
(ii) 𝑊 satisfies the memoryless property in (10.4).
Proof. Exercise (10.1.1) treats (i) ⇒ (ii). As for (ii) ⇒ (i), suppose (ii) holds. Then
𝐺 has three properties:
This is sufficient to prove (i) because then 𝜃 ≔ − ln 𝐺 (1) is a positive real number (by
(b)) and, furthermore,
To see that (10.5) holds, fix 𝑚, 𝑛 ∈ N. We can use (c) to obtain both 𝐺 ( 𝑚/𝑛) = 𝐺 (1/𝑛) 𝑚
and 𝐺 (1) = 𝐺 (1/𝑛) 𝑛 . It follows that 𝐺 ( 𝑚/𝑛) 𝑛 = 𝐺 (1/𝑛) 𝑚𝑛 = 𝐺 (1) 𝑚 and, raising to the
power of 1/𝑛, we get (10.5) when 𝑡 = 𝑚/𝑛.
The discussion so far confirms that (10.5) holds when 𝑡 is rational. So now take
any 𝑡 ⩾ 0 and rational sequences ( 𝑎𝑛 ) and ( 𝑏𝑛 ) converging to 𝑡 with 𝑎𝑛 ⩽ 𝑡 ⩽ 𝑏𝑛 for
all 𝑛. By (a) we have 𝐺 ( 𝑏𝑛 ) ⩽ 𝐺 ( 𝑡 ) ⩽ 𝐺 ( 𝑎𝑛 ) for all 𝑛, so 𝐺 (1) 𝑏𝑛 ⩽ 𝐺 ( 𝑡 ) ⩽ 𝐺 (1) 𝑎𝑛 . for all
𝑛 ∈ N. Taking the limit in 𝑛 completes the proof. □
The real exponential formula (10.1) extends to the matrix exponential via
𝐴2 Õ 𝐴𝑘
e𝐴 ≔ 𝐼 + 𝐴 + +··· = , (10.6)
2! 𝑘!
𝑘⩾0
where 𝐴 is any square matrix. As we will see, the matrix exponential plays a key role
in the solution of vector-valued linear differential equations.
EXERCISE 10.1.2. Let 𝐴 be 𝑛 × 𝑛 and let k · k be the operator norm (see page 16).
Í 𝐴𝑘
Show that (10.6) converges, in the sense that k 𝑚 𝑘=0 𝑘! k is bounded in 𝑚.
Lemma 10.1.2 (Properties of the matrix exponential). Let 𝐴 and 𝐵 be square matrices.
d 𝑡𝐴
e = 𝐴e𝑡 𝐴 = e𝑡 𝐴 𝐴. (10.7)
d𝑡
>
(vi) e 𝐴 = (e 𝐴 ) > .
(vii) The fundamental theorem of calculus holds, in the sense that
∫ 𝑡
𝑡𝐴 𝑠𝐴
e −e = e𝜏𝐴 𝐴 d𝜏 for all 𝑠 ⩽ 𝑡. (10.8)
𝑠
CHAPTER 10. CONTINUOUS TIME 310
The proof of part (ii) of Lemma 10.1.2 uses the definition of the exponential and
the binomial formula. See, for example, Hirsh and Smale (1974). Part (iii) follows
directly from part (ii). Part (iv) follows easily from part (i) when 𝐴 is diagonalizable
(and can be proved more generally via the Jordan canonical form).
EXERCISE 10.1.4. Prove (v) of Lemma 10.1.2. A good starting point is to observe
that, for any 𝑡 ∈ R,
d 𝑡𝐴 e𝑡 𝐴+ℎ𝐴 − e𝑡 𝐴 eℎ𝐴 − 𝐼
e = lim = e𝑡 𝐴 lim . (10.9)
d𝑡 ℎ→0 ℎ ℎ→0 ℎ
EXERCISE 10.1.5. Using Lemma 10.1.2, show that, for any 𝑛 × 𝑛 matrix 𝐴, the
matrix e 𝐴 is invertible, with inverse e− 𝐴 .
As for (vii), we are drawing an analogy with the fundamental ∫ 𝑡 theorem of calculus
for scalar-valued functions, which states that 𝑓 ( 𝑡 ) − 𝑓 ( 𝑠) = 𝑠 𝑓 0 ( 𝜏) d𝜏 for all 𝑠 ⩽ 𝑡 ,
where 𝑓 0 is the derivative of 𝑓 .
EXERCISE 10.1.6. Prove part (vii) of Lemma 10.1.2. [Hint: use part (v).]
Recall from §2.1.1 that a discrete dynamical system is a pair (𝑈, 𝑆), where 𝑈 is a set
and 𝑆 is a self-map on 𝑈 . Trajectories are sequences ( 𝑆𝑡 𝑢)𝑡⩾0 = (𝑢, 𝑆𝑢, 𝑆2 𝑢, . . .), where
CHAPTER 10. CONTINUOUS TIME 311
𝑆𝑠 + 𝑡 = 𝑆𝑡 ◦ 𝑆𝑠 for all 𝑡, 𝑡0 ⩾ 0.
(see, e.g., Hirsh and Smale (1974), Section 8.7). Hence ( 𝑆𝑡 )𝑡⩾0 defined by 𝑆𝑡 𝑢 = 𝐹 ( 𝑡, 𝑢)
satisfies the semigroup property and ( R𝑛 , ( 𝑆𝑡 )𝑡⩾0 ) is a continuous time dynamical sys-
tem. The function 𝑓 is called the vector field of ( R𝑛 , ( 𝑆𝑡 )𝑡⩾0 ).
Given our interest in continuous time Markov chains and their connection to linear
systems (see the comments at the start of §10.1.1), we focus primarily on linear dif-
ferential equations. The next result discusses linear IVPs, illustrating the key role of
the matrix exponential. In the statement, 𝐴 is 𝑛 × 𝑛 and both 𝑢¤ 𝑡 and 𝑢𝑡 are column
vectors in R𝑛 .
Proposition 10.1.3. The unique solution of the 𝑛-dimensional IVP
𝑢 𝑡 ≔ e𝑡 𝐴 𝑢 0 ( 𝑡 ⩾ 0) . (10.11)
EXERCISE 10.1.7. Let 𝑃 be 𝑛 × 𝑛 and consider the IVP 𝜑¤ 𝑡 = 𝜑𝑡 𝑃 and 𝜑0 given, where
each 𝜑𝑡 is a row vector in R𝑛 . Prove that this IVP has the unique solution 𝜑𝑡 ≔ 𝜑0 e𝑡𝑃 .
𝑡 ↦→ 𝑢𝑡 , 𝑢𝑡 = e𝑡 𝐴 𝑢0 ( 𝑡 ⩾ 0) (10.12)
−2.0 −0.4 0
© ª
𝐴 = −1.4 −1.0 2.2 ® . (10.13)
« 0.0 −2.0 −0.6¬
t 7→ etA u0
u0
0.010
0.006
0.002
−0.002
0.010
0.010
0.006 0.006
0.002 0.002
−0.002
Exercise 10.1.8 and (10.14) tell us that the long run dynamics of e𝑡 𝐴 are deter-
mined by the scalar flows 𝑡 ↦→ e𝑡𝜆 𝑗 . How does e𝑡𝜆 evolve over time when 𝜆 ∈ C?
To answer this question we write 𝜆 = 𝑎 + 𝑖𝑏 and apply (10.3) to obtain
Let 𝐴 be any square matrix. In the following statement about a spectral bound, k · k
is the operator norm defined in §1.2.1.4.
Lemma 10.1.4. If 𝜏 > 0, then 𝜏𝑠 ( 𝐴) = 𝑠 ( 𝜏𝐴). Moreover,
1
e𝑠 ( 𝐴) = 𝜌 (e 𝐴 ) and 𝑠 ( 𝐴) = lim ln ke𝑡 𝐴 k . (10.17)
𝑡 →∞ 𝑡
(The second equality in (10.17) also holds when the limit is taken over 𝑡 ∈ R+ .
See, for example, Engel and Nagel (2006).)
The next theorem is a key stability result for exponential flows. Among other
things, it extends to arbitrary 𝐴 the finding that 𝑠 ( 𝐴) < 0 is necessary and sufficient
for stability.
Theorem 10.1.5. For any square matrix 𝐴, the following statements are equivalent:
(i) 𝑠 ( 𝐴) < 0.
(ii) ke𝑡 𝐴 k → 0 as 𝑡 → ∞.
(iii) There exist 𝑀, 𝜔 > 0 such that ke𝑡 𝐴 k ⩽ 𝑀 e−𝑡𝜔 for all 𝑡 ⩾ 0.
∫∞
(iv) 0 ke𝑡 𝐴 𝑢0 k 𝑝 d𝑡 < ∞ for all 𝑝 ⩾ 1 and 𝑢0 ∈ R𝑛 .
A full proof of Theorem 10.1.5 in a general setting can be found in §V.II of Engel
and Nagel (2006).
Theorem 10.1.5 tells us that the flow 𝑡 ↦→ e𝑡 𝐴 𝑢0 converges to the origin at an
exponential rate if and only if 𝑠 ( 𝐴) < 0. The equivalence of (i) and (ii) was proved for
the diagonalizable case in §10.1.2.3. It can be viewed as the continuous time analog
of k 𝐵𝑘 k → 0 if and only if 𝜌 ( 𝐵) < 1 (see Exercise 1.2.11 on page 19).
CHAPTER 10. CONTINUOUS TIME 315
EXERCISE 10.1.12. Prove that (i) implies (ii) without assuming that 𝐴 is diagonal-
izable. In addition, prove that (iii) implies (iv).
Advanced treatments of continuous time systems often begin with semigroups. Let’s
briefly describe these and connect them to things we have studied earlier. (If you
prefer to skip this section on first reading, you can move to the next one after noting
that, given an 𝑛 × 𝑛 matrix 𝐴, the family ( 𝑆𝑡 )𝑡⩾0 = (e𝑡 𝐴 )𝑡⩾0 is called an exponential
semigroup and that 𝐴 is called the infinitesimal generator of the semigroup.)
Let X be a finite set and let ( 𝑆𝑡 )𝑡⩾0 be a subset of L( RX ) indexed by 𝑡 ∈ R+ . The
family ( 𝑆𝑡 )𝑡⩾0 is called a strongly continuous semigroup or 𝐶0 -semigroup on RX if
A natural question, then, is, given a semigroup (𝑆𝑡 ) 𝑡⩾0 on L( RX ), does there always
exist a “vector field” type object that “generates” (𝑆𝑡 ) 𝑡⩾0 ? When X is finite, the answer
is affirmative. This object, denoted below by 𝐴, is called the infinitesimal generator
of the semigroup and is defined by
𝑆𝑡 − 𝑆0 𝑆𝑡 − 𝐼
𝐴 = lim = lim (10.19)
𝑡↓0 𝑡 𝑡↓0 𝑡
𝑆𝑡 − 𝑆0 e𝑡 𝐴 − e0 d
lim = lim = e𝑡 𝐴 = 𝐴e0 𝐴 = 𝐴.
𝑡↓0 𝑡 𝑡↓0 𝑡 d𝑡 𝑡 =0
The preceding discussion places our analysis in a wider context. To practice our
new terminology, we can restate (i) ⇐⇒ (ii) from Theorem 10.1.5 by saying that the
exponential semigroup (𝑆𝑡 ) 𝑡⩾0 = (e𝑡 𝐴 ) 𝑡⩾0 converges to zero if and only if the spectral
bound of its infinitesimal generator is negative.
Having studied multivariate linear dynamics, we are now ready to specialize to the
Markov case, where dynamics evolve in distribution space. For the most part we now
switch to operator-theoretic notation, where X is a finite set with 𝑛 elements, and an
𝑛 × 𝑛 matrix is identified with a linear operator on L ( RX ). As emphasized in §2.3.3.1,
this is merely a change in terminology, and all preceding results for matrices extend
directly to linear operators.
CHAPTER 10. CONTINUOUS TIME 317
Let
I ( RX ) = the set of all intensity operators in L( RX ) .
Example 10.1.4. The matrix
−2 1 1
© ª
𝑄 ≔ 0 −1 1 ®
« 2 1 −3¬
is an intensity matrix, since off-diagonal terms are nonnegative and rows sum to zero.
when 𝜓𝑡 and 𝜓¤ 𝑡 are understood to be row vectors. We say that D(X) is invariant for
the IVP (10.21) if the solution ( 𝜓𝑡 )𝑡⩾0 remains in D(X) for all 𝑡 ⩾ 0.
In view of Proposition 10.1.3, we can rephase this by stating that D(X) is invariant
for (10.21) whenever
Our key result for this section shows the central role of intensity matrices:
Proposition 10.1.7. Fix 𝑄 ∈ L ( RX ) and set 𝑃𝑡 ≔ e𝑡𝑄 for each 𝑡 ⩾ 0. The following
statements are equivalent:
1 Other names for intensity matrices include “𝑄 -matrices” (which is fine until you need to use another
(i) 𝑄 ∈ I ( RX ).
(ii) 𝑃𝑡 ∈ M ( RX ) for all 𝑡 ⩾ 0.
(iii) the set of distributions D(X) is invariant for the IVP (10.21).
Proposition 10.1.7 tells us that the set I ( RX ) coincides with the set of continuous
time (and time-homogeneous) Markov models on X. Any specification outside this
class fails to generate flows in distribution space. The proof is completed in several
steps below.
For Exercises 10.1.13–10.1.15, 𝑄 ∈ I ( RX ) and 𝑃𝑡 ≔ e𝑡𝑄 .
For the proof of Proposition 10.1.7, we have now shown that (i) implies (ii). Evi-
dently (ii) implies (iii), because if 𝜓0 ∈ D and 𝜓𝑡 = 𝜓0 𝑃𝑡 where 𝑃𝑡 is stochastic, then
𝜓𝑡 ∈ D(X). Hence it remains only to show that (iii) implies (i).
Returning to Proposition 10.1.7, the last two exercises confirm that (iii) implies
(i). The proof is now complete.
10.1.3.2 Interpretation
The previous section covered the formal relationship between intensity matrices and
Markov operators. Let’s now discuss the connection more informally, in order to build
intuition.
CHAPTER 10. CONTINUOUS TIME 319
To this end, let ( 𝑋𝑡 )𝑡⩾0 be 𝑃ℎ -Markov in discrete time. Here ℎ > 0 is the length of
the time step. We write the corresponding distribution sequence 𝜓𝑡+ℎ = 𝜓𝑡 𝑃ℎ in terms
of change per unit of time, as in
𝜓𝑡 + ℎ − 𝜓𝑡 𝑃ℎ − 𝐼
= 𝜓𝑡 where 𝐼 is the 𝑛 × 𝑛 identity. (10.25)
ℎ ℎ
𝑃ℎ ( 𝑥, 𝑥 0) − 1{ 𝑥 = 𝑥 0 }
𝑄 ( 𝑥, 𝑥 0) ≈ (10.27)
ℎ
EXERCISE 10.1.18. Prove that, when ℎ > 0 and 𝑃ℎ is stochastic, the matrix on the
right-hand side of (10.27) is an intensity matrix.
EXERCISE 10.1.19. To formalize (10.27), use the expression for the matrix expo-
nential in (10.6) to prove that if 𝑃𝑡 = e𝑡𝑄 , then
more explicitly as
Õ
𝑃 𝑠+𝑡 ( 𝑥, 𝑥 0) = 𝑃 𝑠 ( 𝑥, 𝑧 ) 𝑃𝑡 ( 𝑧, 𝑥 0) ( 𝑠, 𝑡 ⩾ 0, 𝑥, 𝑥 0 ∈ X) . (10.29)
𝑧
We can work in the other direction as well: if we can establish that a function 𝑡 ↦→ 𝑃𝑡
from R+ to L ( RX ) satisfies either one of these equations, then ( 𝑃𝑡 )𝑡⩾0 is a Markov
semigroup with infinitesimal generator 𝑄 . The next proposition gives details.
(i) 𝑃¤ 𝑡 = 𝑄𝑃𝑡 or
(ii) 𝑃¤ 𝑡 = 𝑃𝑡 𝑄 ,
Proposition 10.1.8 is a version of our result for linear IVPs in Proposition 10.1.3,
except that the IVP is now defined in operator space, rather than vector space.
We have discussed the one-to-one connection between intensity matrices and Markov
semigroups, and how the dynamics generated by Markov semigroups trace out distri-
bution flows. Let’s now connect these objects to continuous time Markov chains.
CHAPTER 10. CONTINUOUS TIME 321
10.1.4.1 Definition
where F𝑠 ≔ ( 𝑋𝜏 )0⩽ 𝜏⩽ 𝑠 is the history of the process up to time 𝑠. To update from time
𝑠 to time 𝑡 given this history, we simply take the last value 𝑋𝑠 and update using 𝑃𝑡 .
Conditioning on 𝑋𝑠 = 𝑥 , we get
𝑃𝑡 ( 𝑥, 𝑥 0) = P{ 𝑋𝑠 + 𝑡 = 𝑥 0 | 𝑋𝑠 = 𝑥 } ( 𝑠, 𝑡 ⩾ 0, 𝑥, 𝑥 0 ∈ X) .
Mirroring terminology for discrete chains from §3.1.1.1, we will call a continuous
time Markov chain ( 𝑋𝑡 )𝑡⩾0 𝑄 -Markov when (10.30) holds and 𝑄 is the infinitesimal
generator of ( 𝑃𝑡 )𝑡⩾0 .
In what follows, P𝑥 and E𝑥 denote probabilities and expectations conditional on
𝑋0 = 𝑥 . Given ℎ ∈ RX , we have
Õ
E 𝑥 ℎ ( 𝑋𝑡 ) = 𝑃𝑡 ( 𝑥, 𝑥 0) ℎ ( 𝑥 0) =: ( 𝑃𝑡 ℎ)( 𝑥 ) .
𝑥0
𝑋𝑡
𝑊1 𝐽1 𝑊2 𝐽2 𝑊3 𝐽3
time
Figure 10.2: A jump chain sample path
Í
wait times, the sums 𝐽𝑘 = 𝑘𝑖=1 𝑊𝑖 are called the jump times and (𝑌𝑘 ) is called the em-
bedded jump chain. The jumps and the process ( 𝑋𝑡 )𝑡⩾0 are illustrated in Figure 10.2.
𝑃𝑡 ( 𝑥, 𝑥 0) ≔ P 𝑥 { 𝑋𝑡 = 𝑥 0 } = P 𝑥 { 𝑋𝑡 = 𝑥 0 , 𝐽1 > 𝑡 } + P 𝑥 { 𝑋 𝑡 = 𝑥 0 , 𝐽1 ⩽ 𝑡 } . (10.33)
For the second term on the right hand side of (10.33), we obtain
P 𝑥 { 𝑋𝑡 = 𝑥 0 , 𝐽1 ⩽ 𝑡 } = E𝑥 [1{ 𝐽1 ⩽ 𝑡 }P𝑥 { 𝑋𝑡 = 𝑥 0 | 𝑊1 , 𝑌1 }] = E𝑥 1{ 𝐽1 ⩽ 𝑡 } 𝑃𝑡− 𝐽1 (𝑌1 , 𝑥 0) .
Evaluating the expectation and using the independence of 𝐽1 and 𝑌1 , this becomes
∫ ∞ Õ
P 𝑥 { 𝑋 𝑡 = 𝑥 , 𝐽1 ⩽ 𝑡 } =
0
1{𝜏 ⩽ 𝑡 } Π ( 𝑥, 𝑧) 𝑃𝑡−𝜏 ( 𝑧, 𝑥 0) 𝜆 ( 𝑥 ) 𝑒−𝜏𝜆 ( 𝑥 ) 𝑑𝜏
0 𝑧
∫ 𝑡 Õ
= 𝜆 (𝑥) Π ( 𝑥, 𝑧 ) 𝑃𝑡−𝜏 ( 𝑧, 𝑥 0) 𝑒−𝜏𝜆 ( 𝑥 ) 𝑑𝜏.
0 𝑧
Proof. The claim that 𝑃0 = 𝐼 is obvious. For the second claim, one can easily verify
that, when 𝑓 is a differentiable function and 𝛼 > 0, we have
Note also that, with the change of variable 𝑠 = 𝑡 − 𝜏, we can rewrite (10.32) as
∫ 𝑡
0 −𝑡𝜆 ( 𝑥 ) 0 0 𝑠𝜆 ( 𝑥 )
𝑃𝑡 ( 𝑥, 𝑥 ) = 𝑒 𝐼 ( 𝑥, 𝑥 ) + 𝜆 ( 𝑥 ) ( Π𝑃 𝑠 )( 𝑥, 𝑥 ) 𝑒 𝑑𝑠 . (10.36)
0
Proof of Proposition 10.1.9. Proposition 10.1.9 follows directly from Lemma 10.1.10
and Lemma 10.1.11, combined with Proposition 10.1.8. □
Let 𝑋𝑡 be a firm’s inventory at time 𝑡 . When current stock is 𝑥 > 0, customers arrive
at rate 𝜆 ( 𝑥 ), so the wait time for the next customer is an independent draw from the
Exp( 𝜆 ( 𝑥 )) distribution; 𝜆 maps X to (0, ∞).
The 𝑘-th customer demands 𝑈𝑘 units, where each 𝑈𝑘 is an independent draw from
a fixed distribution 𝜑 on N. Purchases are constrained by inventory, however, so
inventory falls by 𝑈𝑘 ∧ 𝑋𝑡 . When inventory hits zero the firm orders 𝑏 units of new
stock. The wait time for new stock is also exponential, being an independent draw
from Exp( 𝜆 (0)).
Let 𝑌 represent the inventory size after the next jump (induced by either a customer
purchase or ordering new stock), given current stock 𝑥 . If 𝑥 > 0, then 𝑌 is a draw from
the distribution of 𝑥 − 𝑈 ∧ 𝑥 where 𝑈 ∼ 𝜑. If 𝑥 = 0, then 𝑌 ≡ 𝑏. Hence 𝑌 is a draw from
Π ( 𝑥, ·), where Π (0, 𝑦 ) = 1{ 𝑦 = 𝑏} and, for 0 < 𝑥 ⩽ 𝑏,
0 if 𝑥 ⩽ 𝑦
Π ( 𝑥, 𝑦 ) = P{ 𝑥 − 𝑈 = 𝑦 } if 0 < 𝑦 < 𝑥 (10.37)
P{𝑈 ⩾ 𝑥 } if 𝑦 = 0
We can simulate the inventory process ( 𝑋𝑡 )𝑡⩾0 via the jump chain algorithm on
page 322. In this case, the wait time sequence (𝑊𝑘 ) is the wait time for customers
CHAPTER 10. CONTINUOUS TIME 325
10 Xt
6
inventory
0 10 20 30 40 50
time
(and for inventory when 𝑋𝑡 = 0) and the jump sequence (𝑌𝑘 ) is the level of inven-
tory immediately after each jump. By Proposition 10.1.9, the inventory process is
𝑄 -Markov with 𝑄 given by 𝑄 ( 𝑥, 𝑥 0) = 𝜆 ( 𝑥 )( Π ( 𝑥, 𝑥 0) − 𝐼 ( 𝑥, 𝑥 0)).
Figure 10.3 shows a simulation when orders are geometric, so that
In the simulation we set 𝛼 = 0.7, 𝑏 = 10 and 𝜆 ( 𝑥 ) ≡ 0.5. The figure plots 𝑋𝑡 for
𝑡 ∈ [0, 50]. Since each wait time 𝑊𝑖 is a draw from Exp(0.5) the mean wait time is
2.0. The function that produces the map 𝑡 ↦→ 𝑋𝑡 is shown in Listing 27.
"""
Generate a path for inventory starting at b, up to time T.
Return the path as a function X(t) constructed from (J_k) and (Y_k).
"""
function sim_path(; T=10, seed=123, λ=0.5, α=0.7, b=10)
J, Y = 0.0, b
J_vals, Y_vals = [J], [Y]
Random.seed!(seed)
φ = Exponential(1/λ) # Wait times are exponential
G = Geometric(α) # Orders are geometric
while true
W = rand(φ)
J += W
push!(J_vals, J)
if Y == 0
Y = b
else
U = rand(G) + 1 # Geometric on 1, 2,...
Y = Y - min(Y, U)
end
push!(Y_vals, Y)
if J > T
break
end
end
function X(t)
k = searchsortedlast(J_vals, t)
return Y_vals[k+1]
end
return X
end
Then we set
𝑄 ( 𝑥, 𝑥 0)
𝜆 ( 𝑥 ) ≔ −𝑄 ( 𝑥, 𝑥 ) and Π ( 𝑥, 𝑥 0) ≔ 𝐼 ( 𝑥, 𝑥 0) + .
𝜆 (𝑥)
We are ready to turn to dynamic programming in continuous time. As for the discrete
time case, continuous time dynamic programs aim to maximize a measure of lifetime
value. In §10.2.1 we study lifetime valuations. In §10.2.2 we learn how to maximize
them.
10.2.1 Valuation
For the discrete time problems with state-dependent discounting that we studied in
Í
Chapter 6, lifetime valuations take the form 𝑣 = 𝑡⩾0 𝐾 𝑡 ℎ for some ℎ ∈ RX and a
positive linear operator 𝐾 on RX . (See Theorem 6.1.1 and (6.18) on page 192.) For
a continuous time version we fix ℎ ∈ RX , take ( 𝐾𝑡 )𝑡⩾0 to be a positive exponential
semigroup in L ( RX ), where positive means 𝐾𝑡 ⩾ 0 for all 𝑡 , and set
∫ ∞
𝑣= 𝐾𝑡 ℎ d𝑡. (10.38)
0
Using the semigroup property and linearity of 𝐾𝑡 , we can write the last term on the
right hand side as
∫ ∞ ∫ ∞ ∫ ∞ ∫ ∞
𝐾𝜏 ℎ d𝜏 = 𝐾𝑡+𝜏 ℎ d𝜏 = 𝐾𝑡 𝐾𝜏 ℎ d𝜏 = 𝐾𝑡 𝐾𝜏 ℎ d𝜏 = 𝐾𝑡 𝑣.
𝑡 0 0 0
CHAPTER 10. CONTINUOUS TIME 329
Combining this result with the expression for 𝑣 in the previous display proves (10.39).
This proves part (i) of the proposition.
Turning to (ii), if we rearrange (10.39) and divide by 𝑡 > 0, we get
∫
𝐾𝑡 − 𝐼 1 𝑡
− 𝑣= 𝐾𝜏 ℎ d𝜏. (10.41)
𝑡 𝑡 0
is a positive 𝐶0 -semigroup.
In the proof of Proposition 10.2.2, we will use the fact that ( 𝑋𝑡 )𝑡⩾0 satisfies the
Markov property. In particular, if 𝐻 is a real-valued function on the path space
𝐶 ( R+ , X), then
E𝑥 𝐻 (( 𝑋𝜏 )𝜏⩾ 𝑠 ) | ( 𝑋𝜏 )𝜏𝑠 =0 = E𝑋𝑠 𝐻 (( 𝑋𝜏 )𝜏⩾0 ) for all 𝑥 ∈ X. (10.43)
EXERCISE 10.2.1. Let ( 𝑋𝑡 )𝑡⩾0 be as stated and, for each 𝑠, 𝑡 ∈ R+ with 𝑠 ⩽ 𝑡, let
𝜂 ( 𝑠, 𝑡 ) be the random variable defined by
∫ 𝑡
𝜂 ( 𝑠, 𝑡 ) = exp − 𝛿 ( 𝑋𝜏 ) d𝜏 .
𝑠
Show that
(i) 𝜂 ( 𝑠, 𝑡 ) > 0 for all 0 ⩽ 𝑠 ⩽ 𝑡 ,
(ii) 𝜂 ( 𝑠, 𝑠) = 1 for all 𝑠 ∈ R+ , and
(iii) 𝜂 (0, 𝑠 + 𝑡 ) = 𝜂 (0, 𝑠) 𝜂 ( 𝑠, 𝑠 + 𝑡 ) for all 𝑠, 𝑡 ∈ R+ .
Using the Markov property (10.43), the inner expectation in the last display can be
expressed as
∫ 𝑠+𝑡
E exp − 𝛿 ( 𝑋𝜏 ) d𝜏 ℎ ( 𝑋𝑠+ 𝑡 ) | ( 𝑋𝜏 )𝜏𝑠 =0
𝑠
∫ 𝑡
= E 𝑋𝑠 exp − 𝛿 ( 𝑋𝜏 ) d𝜏 ℎ ( 𝑋𝑡 ) = ( 𝐾𝑡 ℎ)( 𝑋𝑠 ) ,
0
so
∫ 𝑠
( 𝐾𝑠 + 𝑡 ℎ)( 𝑥 ) = E𝑥 𝜂 (0, 𝑠)( 𝐾𝑡 ℎ)( 𝑋𝑠 ) = E𝑥 exp − 𝛿 ( 𝑋𝜏 ) d𝜏 ( 𝐾𝑡 ℎ)( 𝑋𝑠 ) = ( 𝐾𝑠 𝐾𝑡 ℎ)( 𝑥 ) .
0
To see that 𝐾𝑡 is a positive operator for all 𝑡 , observe that if ℎ ⩾ 0, then the expec-
tation in (10.42) is nonnegative. Hence 𝐾𝑡 ℎ ⩾ 0 whenever ℎ ⩾ 0.
To prove continuity of 𝑡 ↦→ 𝐾𝑡 ℎ, it suffices to show that ( 𝐾𝑡 ℎ)( 𝑥 ) → ℎ ( 𝑥 ) as 𝑡 ↓ 0
(see, e.g., Engel and Nagel (2006), Proposition 1.3). This holds by right-continuity
of 𝑋𝑡 , which gives ℎ ( 𝑋𝑡 ) → 𝑥 as 𝑡 ↓ 0, and hence
∫ 𝑡
lim ( 𝐾𝑡 ℎ)( 𝑥 ) = E𝑥 lim exp − 𝛿 ( 𝑋 𝜏 ) d𝜏 ℎ ( 𝑋 𝑡 ) = ℎ ( 𝑥 ) .
𝑡↓0 𝑡↓0 0
(Readers familiar with measure theory can justify the change of limit and expectation
via the dominated convergence theorem.) □
Many studies of continuous time dynamic programming with discounting use a con-
stant discount rate. In this setting, the lifetime value in (10.38) becomes
∫ ∞
𝑣 ( 𝑥 ) ≔ E𝑥 e−𝑡𝛿 ℎ ( 𝑋𝑡 ) d𝑡 (10.44)
0
for some 𝛿 ∈ R and ℎ ∈ RX . Here ( 𝑋𝑡 )𝑡⩾0 is a continuous time Markov chain on finite
state X generated by Markov semigroup ( 𝑃𝑡 )𝑡⩾0 with intensity operator 𝑄 . The idea is
that ℎ ( 𝑋𝑡 ) is an instantaneous reward at each time 𝑡 , while 𝛿 is a fixed discount rate.
Equation 10.44 is the continuous time version of (3.16) on page 94.
( 𝛿𝐼 − 𝑄 ) −1 ⩾ 0 and 𝑣 = ( 𝛿𝐼 − 𝑄 ) −1 ℎ. (10.45)
Proof. As a first step, we reverse the order of expectation and integration in (10.44)
to get
∫ ∞
𝑣( 𝑥) = ( 𝐾𝑡 ℎ)( 𝑥 ) d𝑡 where ( 𝐾𝑡 ℎ) ( 𝑥 ) ≔ e−𝑡𝛿 E𝑥 ℎ ( 𝑋𝑡 ) = e−𝑡𝛿 ( 𝑃𝑡 ℎ)( 𝑥 ) .
0
CHAPTER 10. CONTINUOUS TIME 332
(This
∫ ∞change of order can be justified by Fubini’s theorem, which can be applied when
E𝑥 0 e−𝑡𝛿 | ℎ ( 𝑋𝑡 )| d𝑡 < ∞. Since X is finite, we∫ have | ℎ | ⩽ 𝑀 < ∞ for some constant
∞
𝑀 , and the double integral is dominated by 𝑀 0 e−𝑡𝛿 d𝑡 = 𝑀 /𝛿.)
Note that 𝐾𝑡 is a special case of (10.42). Hence ( 𝐾𝑡 )𝑡⩾0 is a positive 𝐶0 -semigroup.
Its infinitesimal generator is 𝐴 ≔ 𝑄 − 𝛿𝐼 , since 𝐾𝑡 = e−𝑡𝛿 𝑃𝑡 = e𝑡 ( 𝑄−𝛿𝐼 ) . We claim that
𝑠 ( 𝐴) < 0. To see this, observe that (using (10.17)),
e𝑠 ( 𝑄−𝛿𝐼 ) = 𝜌 (e𝑄−𝛿𝐼 ) = 𝜌 (e𝑄 e−𝛿𝐼 ) = 𝜌 (e𝑄 e−𝛿 𝐼 ) = e−𝛿 𝜌 (e𝑄 ) = e−𝛿 𝜌 ( 𝑃1 ) = e−𝛿 .
𝑣 = − 𝐴−1 ℎ = (− 𝐴) −1 ℎ = ( 𝛿𝐼 − 𝑄 ) −1 ℎ.
In this section we define continuous time Markov decision processes, discuss optimal-
ity theory, and provide algorithms and applications.
10.2.2.1 Definition
Given two finite sets A and X, called the state and action spaces respectively, we define
a continuous time Markov decision process (or continuous time MDP) to be a tuple
C = ( Γ, 𝛿, 𝑟, 𝑄 ) consisting of
G ≔ {( 𝑥, 𝑎) ∈ X × A : 𝑎 ∈ Γ ( 𝑥 )},
and 𝑄 ( 𝑥, 𝑎, 𝑥 0) ⩾ 0 whenever 𝑥 ≠ 𝑥 0.
Informally, at state 𝑥 with action 𝑎 over the short interval from 𝑡 to 𝑡 + ℎ, the
controller receives instantaneous reward 𝑟 ( 𝑥, 𝑎) ℎ and the state transitions to state 𝑥 0
with probability 𝑄 ( 𝑥, 𝑎, 𝑥 0) ℎ + 𝑜 ( ℎ).
Paralleling our discussion of the discrete time case in Chapter 5, the set of feasible
policies is
Σ ≔ {𝜎 ∈ AX : 𝜎 ( 𝑥 ) ∈ Γ ( 𝑥 ) for all 𝑥 ∈ X} . (10.47)
𝑄 𝜎 ( 𝑥, 𝑥 0) ≔ 𝑄 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) .
Letting
𝑃𝑡𝜎 ≔ e𝑡𝑄 𝜎 and 𝑟𝜎 ( 𝑥 ) ≔ 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) ( 𝑥 ∈ X)
the lifetime value of following 𝜎 starting from state 𝑥 is
∫ ∞ ∫ ∞
𝑣𝜎 ( 𝑥 ) ≔ E𝑥 e 𝑟 ( 𝑋𝑡 , 𝜎 ( 𝑋𝑡 )) d𝑡 = E𝑥
−𝛿𝑡
e−𝛿𝑡 𝑟𝜎 ( 𝑋𝑡 ) d𝑡, (10.48)
0 0
where ( 𝑋𝑡 )𝑡⩾0 is 𝑄 𝜎 -Markov with initial condition 𝑥 . We call 𝑣𝜎 the 𝜎-value function.
Since 𝛿 > 0, we can apply Proposition 10.2.3 to obtain
𝑣𝜎 = ( 𝛿𝐼 − 𝑄 𝜎 ) −1 𝑟𝜎 . (10.49)
Like the discrete time case, a 𝑣-greedy policy chooses actions optimally to trade off
high current rewards versus high rate of flow into future states with high values.
Unlike the discrete time case, the discount factor does not appear in (10.50) because
the trade-off is instantaneous.
We introduce a continuous time policy iteration algorithm that parallels discrete time
HPI for Markov decision processes, as described in §5.1.4.2.
The continuous time HPI routine is given in Algorithm 10.2, with the intuition
being similar to that for the discrete time MDP version given on page 141. We provide
convergence results in §10.2.3.
2 𝑘 ← 0
3 𝜀 ← 1
4 while 𝜀 > 0 do
5 𝑣𝑘 ← ( 𝛿𝐼 − 𝑄 𝜎𝑘 ) −1 𝑟𝜎𝑘
6 𝜎𝑘+1 ← a 𝑣𝑘 -greedy policy
7 𝜀 ← 1{𝜎𝑘 ≠ 𝜎𝑘+1 }
8 𝑘← 𝑘+1
9 end
10 return 𝜎𝑘
𝑇𝜎 𝑣 = 𝑟𝜎 + ( 𝑄 𝜎 + (1 − 𝛿) 𝐼 ) 𝑣. (10.51)
As shown in Proposition 10.2.3, each 𝑇𝜎 is order stable on RX , with unique fixed point
𝑣𝜎 . Hence A ≔ ( RX , {𝑇𝜎 }) is an order stable ADP.
EXERCISE 10.2.2. Show that 𝜎 is 𝑣-greedy (i.e., (10.50) holds) if and only if 𝜎 is
𝑣-greedy for A in the sense of §9.1.2.2.
CHAPTER 10. CONTINUOUS TIME 335
10.2.3 Optimality
For a continuous time MDP C = ( Γ, 𝛿, 𝑟, 𝑄 ) with 𝜎-value functions { 𝑣𝜎 },
Ô
• the value function generated by C is 𝑣∗ ≔ 𝜎 𝑣𝜎 , and
• a policy is called optimal for C if 𝑣𝜎 = 𝑣∗ .
A function 𝑣 ∈ RX is said to satisfy a Hamilton–Jacobi–Bellman (HJB) equation
if ( )
Õ
𝛿𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0) for all 𝑥 ∈ X. (10.52)
𝑎∈ Γ ( 𝑥 )
𝑥0
It is clear from (10.50) and Exercise 10.2.2 that, for each 𝑣 ∈ RX , the set of 𝑣-max-
greedy policies is nonempty. Since Σ is finite, it follows from Proposition 9.2.1 that
A is max-stable. Hence, by Theorem 9.2.4, an optimal policy always exists and the
value function 𝑣∗ is the unique fixed point of 𝑇 in RX . The last statement is equivalent
to the assertion that 𝑣∗ is the unique element of RX satisfying
( )
Õ
𝑣∗ ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝑣∗ ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0) + (1 − 𝛿) 𝑣∗ ( 𝑥 ) .
𝑎∈ Γ ( 𝑥 )
𝑥0
CHAPTER 10. CONTINUOUS TIME 336
Rearranging this expression confirms that 𝑣∗ is the unique solution to the HJB equation
in RX .
Applying Theorem 9.2.4 again, a policy is optimal for A if and only if 𝑇𝜎 𝑣∗ = 𝑇 𝑣∗ .
Since the definition of optimality for A coincides with the definition of optimality for
C, we see that C obeys Bellman’s principle of optimality.
The continuous time HPI routine described in Algorithm 10.2 is just ADP max-HPI
(see §9.2.1.4) specialized to the current setting. Hence, applying Theorem 9.2.4 once
more, continuous time HPI converges to an optimal policy in finitely many steps. □
𝜆 ( 𝑥 ) = 𝜆 ( 𝑠, 𝑤) = 1{ 𝑠 = 0}𝜅 + 1{ 𝑠 = 1}𝛼
denote the state-dependent jump rate, which switches between 𝜅 and 𝛼 depending
on employment status.
Let 𝑎 ∈ A ≔ {0, 1} indicate the action, where 0 means reject and 1 means accept.
Let Π ( 𝑥, 𝑎, 𝑥 0) represent the jump probabilities, with
The first two lines consider jump probabilities for the state ( 𝑠, 𝑤) when unemployed
and the action is 𝑎. The second two consider jump probabilities when employed. The
CHAPTER 10. CONTINUOUS TIME 337
reason that the probability assigned to the last line is zero is that a jump from 𝑠 = 1
occurs because the worker is fired, so the value of 𝑠 after the jump is zero.
EXERCISE 10.2.3. Prove that Π is a stochastic kernel, in the sense that Π ⩾ 0 and
Í 0
𝑥 0 Π ( 𝑥, 𝑎, 𝑥 ) = 1 for all possible ( 𝑥, 𝑎) = (( 𝑠, 𝑤) , 𝑎) in X × {0, 1}.
𝑄 𝜎 ( 𝑥, 𝑥 0) ≔ 𝜆 ( 𝑥 )( Π ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) − 𝐼 ( 𝑥, 𝑥 0)) ,
EXERCISE 10.2.4. Provide economic intuition for the monotone relationships be-
tween parameters and the reservation wage discussed in the preceding paragraph.
Applebaum (2019) and Engel and Nagel (2006) provide elegant introductions to
semigroup theory and its applications in studying partial and stochastic differential
equations. The beautiful book by Lasota and Mackey (1994) and covers connections
among semigroups, Markov processes, and stochastic differential equations. Norris
CHAPTER 10. CONTINUOUS TIME 338
1
action (reject/accept)
1.3 1.35
res. wage
res. wage
1.30
1.2
1.25
1.1 1.20
0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4
separation rate offer rate
1.8
1.3 1.6
res. wage
res. wage
1.4
1.2 1.2
1.0
1.1
0.8
0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4
discount rate unempl. compensation
(1998) provides a good introduction to continuous time Markov chains, while Liggett
(2010) is more advanced.
A rigorous treatment of continuous time MDPs can be found in Hernández-Lerma
and Lasserre (2012b), which also handles the case where X is countably infinite. Our
approach is somewhat different, since our main optimality results rest on the ADP
theory in Chapter 9.
In recent years, continuous time dynamic programming has become more common
in macroeconomic analysis. Influential references include Nuño and Moll (2018),
Kaplan et al. (2018), Achdou et al. (2022), and Fernández-Villaverde et al. (2023).
For computational aspects, see Duarte (2018), Ráfales and Vázquez (2021), Rendahl
(2022), and Eslami and Phelan (2023).
Part I
Appendices
340
Appendix A
This section of the appendix contains an extremely brief review of basic facts concern-
ing sets, functions, suprema and infima. We recommend Bartle and Sherbert (2011)
for those who wish to learn more.
341
APPENDIX A. SUPREMA AND INFIMA 342
the image of 𝐶 under 𝑓 . Also, for 𝐷 ⊂ 𝐵, the set 𝑓 −1 ( 𝐷) is all points in 𝐴 that map into
𝐷 under 𝑓 , and is called the preimage of 𝐷 under 𝑓 .
A function 𝑓 : 𝐴 → 𝐵 is called one-to-one if distinct elements of 𝐴 are always
mapped into distinct elements of 𝐵, and onto if every element of 𝐵 is the image under
𝑓 of at least one point in 𝐴. A bijection or one-to-one correspondence from 𝐴 to 𝐵
is a function 𝑓 from 𝐴 to 𝐵 that is both one-to-one and onto.
A set X is called finite if there exists a bijection from X to [ 𝑛] ≔ {1, . . . , 𝑛} for some
𝑛 ∈ N. In this case we can write X = { 𝑥1 , . . . , 𝑥 𝑛 }. The number 𝑛 is called the car-
dinality of X. Note that, according to our definition, every finite set is automatically
nonempty.
If 𝑓 : 𝐴 → 𝐵 and 𝑔 : 𝐵 → 𝐶 , then the composition of 𝑓 and 𝑔 is the function 𝑔 ◦ 𝑓
from 𝐴 to 𝐶 defined at 𝑎 ∈ 𝐴 by ( 𝑔 ◦ 𝑓 )( 𝑎) ≔ 𝑔 ( 𝑓 ( 𝑎)).
EXERCISE A.2.1. Show that, for all of the sets (0, 1), [0, 1) and (0, 1], the number
1 is the supremum of the set.
EXERCISE A.2.2. Fix 𝐴 ⊂ R. Prove that, for 𝑠 ∈ 𝑈 ( 𝐴), we have 𝑠 = sup 𝐴 if and only
if, for all 𝜀 > 0, there exists a point 𝑎 ∈ 𝐴 with 𝑎 > 𝑠 − 𝜀.
APPENDIX A. SUPREMA AND INFIMA 343
Theorem A.2.1 is often taken as axiomatic in formal constructions of the real num-
bers. (Alternatively, one may assume completeness of the reals and then prove The-
orem A.2.1 using this property. See, e.g., Bartle and Sherbert (2011).)
If 𝑖 ∈ R is a lower bound for 𝐴 and also satisfies 𝑖 ⩾ ℓ for every lower bound ℓ of
𝐴, then 𝑖 is called the infimum of 𝐴 and we write 𝑖 = inf 𝐴. At most one such 𝑖 exists,
and every nonempty subset of R bounded from below has an infimum.
A real sequence is a map 𝑥 from N to R, with the value of the function at 𝑘 ∈ N
typically denoted by 𝑥 𝑘 rather than 𝑥 ( 𝑘). A real sequence 𝑥 = ( 𝑥 𝑘 ) 𝑘⩾1 ≔ ( 𝑥 𝑘 ) 𝑘∈N is said
to converge to 𝑥¯ ∈ R if, for each 𝜀 > 0, there exists an 𝑁 ∈ N such that | 𝑥 𝑘 − 𝑥¯| < 𝜀
for all 𝑘 ⩾ 𝑁 . In this case we write lim𝑘 𝑥 𝑘 = 𝑥¯ or 𝑥 𝑘 → 𝑥¯. Bartle and Sherbert (2011)
give an excellent introduction to real sequences and their basic properties.
A real sequence ( 𝑥 𝑘 ) 𝑘⩾1 is called increasing if 𝑥 𝑘 ⩽ 𝑥 𝑘+1 for all 𝑘 and decreasing if
𝑥 𝑘+1 ⩽ 𝑥 𝑘 for all 𝑘. If ( 𝑥 𝑘 ) 𝑘⩾1 is increasing (resp., decreasing) and 𝑥 𝑘 → 𝑥 ∈ R then we
also write 𝑥 𝑘 ↑ 𝑥 (resp., 𝑥 𝑘 ↓ 𝑥 ).
Í𝑛
Let ( 𝑥 𝑘 ) be a real sequence in R and set 𝑠𝑛 ≔ 𝑘=1 𝑥 𝑘 . If the sequence ( 𝑠𝑛 ) converges
to some 𝑠 ∈ R, then we set
Õ
∞ Õ
𝑥𝑘 ≔ 𝑥 𝑘 ≔ 𝑠 = lim 𝑠𝑛 .
𝑛→∞
𝑘=1 𝑘⩾1
Í𝑛 Í∞
We say that the series 𝑘=1 𝑥 𝑘 converges to 𝑘=1 𝑥𝑘 .
A subset 𝐴 of R is called closed if, for any sequence ( 𝑥𝑛 ) contained in 𝐴 and con-
verging to some limit 𝑥 ∈ R, the limit 𝑥 is in 𝐴.
EXERCISE A.3.2. Show that, if 𝐴 is a closed and bounded subset of R, then 𝐴 has
both a maximum and a minimum.
whenever the latter exists. The terms inf 𝑥 ∈ 𝐷 𝑓 ( 𝑥 ) and min𝑥 ∈ 𝐷 𝑓 ( 𝑥 ) are defined analo-
gously. A point 𝑥 ∗ ∈ 𝐷 is called a
Remaining Proofs
Proof of Lemma 2.2.5. Regarding (i), fix 𝜑, 𝜓 ∈ D(X) with 𝜑 F 𝜓. Pick any 𝑦 ∈ X.
By transitivity of partial orders, the function 𝑢 ( 𝑥 ) ≔ 1{ 𝑦 𝑥 } is in 𝑖RX . Hence
Í Í 𝜑
𝑥 𝑢( 𝑥) 𝜑( 𝑥) ⩽ 𝑥 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ). Given the definition of 𝑢, this is equivalent to 𝐺 ( 𝑦 ) ⩽
𝐺 𝜓 ( 𝑦 ). As 𝑦 was chosen arbitrarily, we have 𝐺 𝜑 ⩽ 𝐺 𝜓 pointwise on X.
Regarding (ii), let 𝜑, 𝜓 ∈ D(X) be such that 𝐺 𝜑 ⩽ 𝐺 𝜓 and let X be totally ordered
by . We can write X as { 𝑥1 , . . . , 𝑥𝑛 } with 𝑥 𝑖 𝑥 𝑖+1 for all 𝑖. Pick any 𝑢 ∈ 𝑖RX and let
Í
𝛼𝑖 = 𝑢 ( 𝑥 𝑖 ). By Exercise 2.2.32, we can write 𝑢 as 𝑢 ( 𝑥 ) = 𝑛𝑖=1 𝑠𝑖 1{ 𝑥 𝑥 𝑖 } at each 𝑥 ∈ X,
where 𝑠𝑖 ⩾ 0 for all 𝑖. Hence
Õ ÕÕ
𝑛 Õ
𝑛 Õ Õ
𝑛
𝑢( 𝑥) 𝜑( 𝑥) = 𝑠𝑖 1 { 𝑥 𝑥 𝑖 } 𝜑 ( 𝑥 ) = 𝑠𝑖 1{ 𝑥 𝑥 𝑖 } 𝜑 ( 𝑥 ) = 𝑠𝑖 𝐺 𝜑 ( 𝑥 𝑖 ) .
𝑥 ∈X 𝑥 ∈X 𝑖=1 𝑖=1 𝑥 ∈X 𝑖=1
Í Í𝑛
A similar argument gives 𝑥 ∈X 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ) = 𝑖=1 𝑠𝑖 𝐺 𝜓 ( 𝑥 𝑖 ). Since 𝐺 𝜑 ⩽ 𝐺 𝜓 , we have
Õ Õ
𝑛 Õ
𝑛 Õ
𝜑
𝑢( 𝑥) 𝜑( 𝑥) = 𝑠𝑖 𝐺 ( 𝑥 𝑖 ) ⩽ 𝑠𝑖 𝐺 𝜓 ( 𝑥 𝑖 ) = 𝑢( 𝑥)𝜓( 𝑥) .
𝑥 ∈X 𝑖=1 𝑖=1 𝑥 ∈X
345
APPENDIX B. REMAINING PROOFS 346
Õ
𝑇 Õ
∞ Ö
𝑡
𝐹𝑇 ≔ 𝛿𝑡 ℎ ( 𝑋𝑡 ) and 𝐹≔ 𝛿𝑡 ℎ ( 𝑋 𝑡 ) where 𝛿𝑡 ≔ 𝛽𝑖 .
𝑡 =0 𝑡 =0 𝑖=0
Our first aim is to show that 𝐹 is a well-defined random variable, in the sense that the
sum converges almost surely. Since absolute convergence of real series implies con-
vergence, and since finite expectation implies finiteness almost everywhere, it suffices
to show that
Õ
∞
E𝑥 𝛿𝑡 | ℎ ( 𝑋𝑡 )| < ∞. (B.2)
𝑡 =0
By the monotone convergence theorem (see, e.g., Dudley (2002), Theorem 4.3.2), we
have
Õ
∞ Õ
∞ Õ
∞
E𝑥 𝛿𝑡 | ℎ ( 𝑋𝑡 )| = E𝑥 𝛿𝑡 | ℎ ( 𝑋𝑡 )| = ( 𝐿𝑡 | ℎ |)( 𝑥 ) ,
𝑡 =0 𝑡 =0 𝑡 =0
where the last equality is by (6.6). Since 𝜌 ( 𝐿) < 1, we have shown that (B.2) holds,
which in turn confirms that 𝐹 is well-defined and finite almost surely.
Now observe that, on the probability one set where 𝐹 is finite, we have 𝐹𝑇 → 𝐹 as
𝑇 → ∞. Moreover,
Õ
𝑇 Õ
∞
| 𝐹𝑇 | ⩽ 𝛿𝑡 | ℎ ( 𝑋𝑡 )| ⩽ 𝑌 ≔ 𝛿𝑡 | ℎ ( 𝑋𝑡 )| ,
𝑡 =0 𝑡 =0
Õ
∞ Õ
𝑇 Õ
𝑇 Õ
∞
E𝑥 𝛿𝑡 ℎ ( 𝑋𝑡 ) = lim E𝑥 𝛿𝑡 ℎ ( 𝑋𝑡 ) = lim E 𝑥 𝛿𝑡 ℎ ( 𝑋 𝑡 ) = E 𝑥 𝛿𝑡 ℎ ( 𝑋 𝑡 ) .
𝑇 →∞ 𝑇 →∞
𝑡 =0 𝑡 =0 𝑡 =0 𝑡 =0
APPENDIX B. REMAINING PROOFS 347
𝑎( 𝑥) − 𝜑( 𝑥)
𝜆 = min
𝑥 ∈X 𝑏( 𝑥) − 𝜑( 𝑥)
and let 𝑥¯ be a minimizer. It follows immediately from its definition that 𝜆 obeys
0 ⩽ 𝜆 ⩽ 1 and
𝑎 = 𝑇 𝑎 ⩾ 𝑇 ( 𝜆𝑏 + (1 − 𝜆 ) 𝜑) ⩾ 𝜆𝑏 + (1 − 𝜆 )𝑇 𝜑.
(iv) If A is finite, then 𝑣∗ exists in 𝑉 and 𝐻 𝑣∗ = 𝑣∗ . Moreover, for all 𝑣 ∈ 𝑉 , the HPI
sequence ( 𝑣𝑘 ) defined by 𝑣𝑘 = 𝐻 𝑘 𝑣 converges to 𝑣∗ in finitely many steps.
(v) Fix 𝑣 ∈ 𝑉 and let ( 𝑣𝑘 ) be the HPI sequence defined by 𝑣𝑘 = 𝐻 𝑘 𝑣 for 𝑘 ∈ N. If
𝑣𝑘+1 = 𝑣𝑘 for some 𝑘 ∈ N, then 𝑣𝑘 = 𝑣∗ and every 𝑣𝑘 -greedy policy is optimal.
Proof of Proposition 9.2.1. If A is finite, then, by (iii)–(iv) of Lemma B.4.1, the point
𝑣∗ exists in 𝑉 and is a fixed point of 𝑇 . □
Proof of Theorem 9.2.4. Parts (i)–(iv) of Theorem 9.2.4 follow from Proposition 9.2.5,
which provides optimality results for max-stable ADPs, and Proposition 9.2.1, which
tells us that every finite order stable ADP is max-stable. Regarding the final claim in
Theorem 9.2.4, on convergence of HPI, suppose that A is finite and order stable. If
HPI terminates, then (v) of Lemma B.4.1 implies that it returns an optimal policy.
Part (iv) of the same lemma implies that HPI terminates in finitely many steps. □
Appendix C
Solution to Exercise 1.1.1. Here is one possible answer: On one hand, providing
additional unemployment compensation is costly for taxpayers and tends to increase
the unemployment rate. On the other hand, unemployment compensation encour-
ages the worker to reject low initial offers, leading to a better lifetime wage. This
can enhance worker welfare and expand the tax base. A larger model is needed to
disentangle these effects.
¯ = 𝑇𝑇 𝑚 𝑢¯ = 𝑇 𝑚+1 𝑢¯ = 𝑢¯,
𝑇𝑢
so 𝑢¯ is a fixed point of 𝑇 .
350
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 351
Solution to Exercise 1.2.16. Assume the hypotheses of the exercise and let 𝑢𝑚 ≔
𝑇 𝑢 for all 𝑚 ∈ N. By continuity and 𝑢𝑚 → 𝑢∗ we have 𝑇𝑢𝑚 → 𝑇𝑢∗ . But the sequence
𝑚
(𝑇𝑢𝑚 ) is just ( 𝑢𝑚 ) with the first element omitted, so, given that 𝑢𝑚 → 𝑢∗ , we must have
𝑇𝑢𝑚 → 𝑢∗ . Since limits are unique, it follows that 𝑢∗ = 𝑇𝑢∗ .
Solution to Exercise 1.2.18. Let the stated hypotheses hold and fix 𝑢 ∈ 𝐶 . By
global stability we have 𝑇 𝑘 𝑢 → 𝑢∗ . Since 𝑇 is invariant on 𝐶 we have (𝑇 𝑘 𝑢) 𝑘∈N ⊂ 𝐶 .
Since 𝐶 is closed, this implies that the limit is in 𝐶 . In other words, 𝑢∗ ∈ 𝐶 , as claimed.
k 𝐴𝑥 + 𝑏 − 𝐴𝑦 − 𝑏 k = k 𝐴 ( 𝑥 − 𝑦 )k ⩽ k 𝐴 kk 𝑥 − 𝑦 k .
𝜆𝑚 − 𝜆𝑘
k 𝑢𝑚 − 𝑢 𝑘 k ⩽ k 𝑢0 − 𝑢1 k ( 𝑚, 𝑘 ∈ N with 𝑚 < 𝑘) .
1−𝜆
Hence (𝑢𝑚 ) is Cauchy, as claimed.
𝑔 ( 𝑦) − 𝑔 ( 𝑥)
> | 𝑔0 ( 𝑥 )| − 𝜀 = 𝑔0 ( 𝑥 ) − 𝜀
𝑦−𝑥
| 𝑔 ( 𝑥 ) − 𝑔 ( 𝑦 )| > [ 𝑔0 ( 𝑥 ) − 𝜀]| 𝑥 − 𝑦 |
follows that, for any 𝜆 ∈ [0, 1), we can find a pair 𝑥, 𝑦 such that | 𝑔 ( 𝑥 ) − 𝑔 ( 𝑦 )| > 𝜆 | 𝑥 − 𝑦 |.
Hence 𝑔 is not a contraction map under | · |.
𝛼 ∨ 𝑥 ⩽ |𝑥 − 𝑦| + 𝛼 ∨ 𝑦 ⇐⇒ 𝛼 ∨ 𝑥 − 𝛼 ∨ 𝑦 ⩽ |𝑥 − 𝑦 |.
Solution to Exercise 2.1.1. Let (𝑈, 𝑇 ) and (𝑈,ˆ 𝑇ˆ) be conjugate under Φ, with
𝑇 ◦ Φ = Φ ◦ 𝑇 . The stated equivalence holds because
ˆ
𝑇𝑢 = 𝑢 ⇐⇒ Φ𝑇𝑢 = Φ𝑢 ⇐⇒ 𝑇ˆ Φ𝑢 = Φ𝑢.
Solution to Exercise 2.1.3. To show that 𝑇 = Φ−1 ◦𝑇ˆ ◦ Φ holds, we can equivalently
prove that Φ ◦ 𝑇 = 𝑇ˆ ◦ Φ. For 𝑢 ∈ R, we have Φ𝑇𝑢 = ln 𝐴 + 𝛼 ln 𝑢 and 𝑇ˆ Φ𝑢 = ln 𝐴 + 𝛼 ln 𝑢.
Hence Φ ◦ 𝑇 = 𝑇ˆ ◦ Φ, as was to be shown.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 353
𝑇 𝑘 𝑢 → 𝑢∗ ⇐⇒ Φ𝑇 𝑘 𝑢 → Φ𝑢∗ ⇐⇒ 𝑇ˆ 𝑘 Φ𝑢 → Φ𝑢∗ .
Solution to Exercise 2.1.5. Let U be the set of all dynamical systems (𝑈, 𝑇 ) with
𝑈 ⊂ R𝑛 and write (𝑈, 𝑇 ) ∼ (𝑈, ˆ 𝑇ˆ) if these systems are topologically conjugate. It is
easy to see that ∼ is reflexive and symmetric. Regarding transitivity, suppose that
(𝑈, 𝑇 ) ∼ (𝑈 0, 𝑇 0) and (𝑈 0, 𝑇 0) ∼ (𝑈 00, 𝑇 00). Let 𝐹 be the homeomorphism from 𝑈 to 𝑈 0
and 𝐺 be the homeomorphism from 𝑈 0 to 𝑈 00. Then 𝐻 ≔ 𝐺 ◦ 𝐹 is a homeomorphism
from 𝑈 to 𝑈 00 with inverse ( 𝐹 ◦ 𝐺 ) −1 . Moreover, on 𝑈 , we have
𝑇 = 𝐹 −1 ◦ 𝑇 0 ◦ 𝐹 = 𝐹 −1 ◦ 𝐺 −1 ◦ 𝑇 00 ◦ 𝐺 ◦ 𝐹 = ( 𝐺𝐹 ) −1 ◦ 𝑇 00 ◦ 𝐺 ◦ 𝐹.
Solution to Exercise 2.2.7. Regarding the first claim, fix 𝐵 ∈ M𝑚×𝑘 with 𝑏𝑖 𝑗 ⩾ 0
Í
for all 𝑖, 𝑗. Pick any 𝑖 ∈ [ 𝑚] and 𝑢 ∈ R𝑘 . By the triangle inequality, we have | 𝑗 𝑏𝑖 𝑗 𝑢 𝑗 | ⩽
Í
𝑗 𝑏𝑖 𝑗 | 𝑢 𝑗 |. Stacking these inequalities yields | 𝐵𝑢 | ⩽ 𝐵 | 𝑢 |, as was to be shown.
Regarding the second, let 𝐴 and ( 𝑢𝑘 ) be as stated, with 𝑢𝑘+1 ⩽ 𝐴𝑢𝑘 for all 𝑘. We
aim to prove 𝑢𝑘 ⩽ 𝐴𝑘 𝑢0 for all 𝑘 using induction. In doing so, we observe that 𝑢1 ⩽
𝐴𝑢0 , so the claim is true at 𝑘 = 1. Suppose now that it holds at 𝑘 − 1. Then 𝑢𝑘 ⩽
𝐴𝑢𝑘−1 ⩽ 𝐴𝐴𝑘−1 𝑢0 = 𝐴𝑘 𝑢0 , where the last step used nonnegativity of 𝐴 and the induction
hypothesis. The claim is now proved.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 354
Solution to Exercise 2.2.8. Assume the stated conditions. Let ℎ ≔ 𝑣 − 𝑢 and let
Í
𝑎𝑖 𝑗 be the 𝑖, 𝑗-th element of 𝐴. We have ℎ ⩾ 0 and ℎ 𝑗 > 0 at some 𝑗. Hence 𝑗 𝑎𝑖 𝑗 ℎ 𝑗 > 0.
This says that every row of 𝐴ℎ is strictly positive. In other words 𝐴ℎ = 𝐴 ( 𝑣 − 𝑢) 0.
The claim follows.
Solution to Exercise 2.2.9. Let 𝑀 = {1, 2}, let 𝐴 = {1} and let 𝐵 = {2}. Then
𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴 both fail. Hence ⊂ is not a total order on ℘ ( 𝑀 ).
Solution to Exercise 2.2.16. To see the former, observe that 𝐴 𝑗 ⊂ ∪𝑖 𝐴𝑖 for all
𝑗 ∈ 𝐼 . Hence ∪𝑖 𝐴𝑖 is an upper bound of { 𝐴𝑖 }. Moreover, if 𝐵 ⊂ 𝑀 and 𝐴 𝑗 ⊂ 𝐵 for all
𝑖 ∈ 𝐼 , then ∪𝑖 𝐴𝑖 ⊂ 𝐵. This proves that ∪𝑖 𝐴𝑖 is the supremum. The proof of the infimum
case is similar.
Solution to Exercise 2.2.17. Here is one possible answer. Let 𝑃 = (0, 1), partially
ordered by ⩽. The set 𝐴 = [1/2, 1) is bounded above in R (and hence has a supremum
Ô
in R) but has no supremum in 𝑃 . Indeed, if 𝑠 = 𝐴, the 𝑠 ∈ 𝑃 and 𝑎 ⩽ 𝑠 for all 𝑠 ∈ 𝐴.
It is clear that no such element exists.
𝑎 ∧ 𝑐 = ( 𝑎 − 𝑏 + 𝑏) ∧ 𝑐 ⩽ (| 𝑎 − 𝑏 | + 𝑏) ∧ 𝑐 ⩽ | 𝑎 − 𝑏 | ∧ 𝑐 + 𝑏 ∧ 𝑐.
Solution to Exercise 2.2.24. Since min 𝑓 = − max(− 𝑓 ) and similarly for 𝑔, we can
apply Lemma 2.2.2 to obtain
Solution to Exercise 2.2.26. Let 𝐴 and 𝑃 be as stated. The claim that 𝐴𝑘 is order-
preserving on 𝑃 holds at 𝑘 = 1. Suppose now that it holds at 𝑘 and fix 𝑝, 𝑞 ∈ 𝑃 with
𝑝 𝑞. By the induction hypothesis and the fact that 𝐴 is order-preserving, we have
𝐴𝐴𝑘 𝑝 𝐴𝐴𝑘 𝑞. Hence 𝐴𝑘+1 𝑝 𝐴𝑘+1 𝑞. We conclude that 𝐴𝑘+1 is also order-preserving,
as was to be shown.
Õ
𝑛 Õ
𝑗
𝑠𝑘 1 { 𝑥 𝑗 ⩾ 𝑥 𝑘 } = 𝑠𝑘 = ( 𝛼1 − 𝛼0 ) + ( 𝛼2 − 𝛼1 ) + . . . + ( 𝛼 𝑗 − 𝛼 𝑗−1 ) = 𝛼 𝑗 .
𝑘=1 𝑘=1
Í𝑛
In other words, 𝑘=1 𝑠𝑘 1{ 𝑥 𝑗 ⩾ 𝑥 𝑘 } = 𝑢 ( 𝑥 𝑗 ). This completes the proofs.
Í Í
Hence 𝑥 𝑢( 𝑥) 𝑓 ( 𝑥) ⩽ 𝑥 𝑢 ( 𝑥 ) ℎ ( 𝑥 ). Since 𝑢 was arbitrary in 𝑖RX , we are done.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 357
{ 𝑥 ∈ X : 𝐹 𝜓 ( 𝑥 ) ⩾ 𝜏} ⊂ { 𝑥 ∈ X : 𝐹 𝜑 ( 𝑥 ) ⩾ 𝜏} .
min{ 𝑥 ∈ X : 𝐹 𝜑 ( 𝑥 ) ⩾ 𝜏} ⩽ min{ 𝑥 ∈ X : 𝐹 𝜓 ( 𝑥 ) ⩾ 𝜏} .
That is, 𝑄 𝜏 ( 𝑋 ) ⩽ 𝑄 𝜏 (𝑌 ).
Solution to Exercise 2.3.1. Let 𝐴 be as stated and let 𝑒 be the right eigenvector in
(2.10). Since 𝑒 is nonnegative and nonzero, and since eigenvectors are defined only
Í
up to constant multiples, we can and do assume that 𝑗 𝑒 𝑗 = 1. From 𝐴𝑒 = 𝜌 ( 𝐴) 𝑒 we
Í Í
have 𝑗 𝑎𝑖 𝑗 𝑒 𝑗 = 𝜌 ( 𝐴) 𝑒𝑖 for all 𝑖. Summing with respect to 𝑖 gives 𝑗 colsum 𝑗 ( 𝐴) 𝑒 𝑗 = 𝜌 ( 𝐴).
Since the elements of 𝑒 are nonnegative and sum to one, 𝜌 ( 𝐴) is a weighted average
of the column sums. Hence the second pair of bounds in Lemma 2.3.2 holds. The
remaining proof is similar (use the left eigenvector).
Solution to Exercise 2.3.3. Let 𝑃 and 𝜀 have the stated properties. Fix ℎ ∈ RX .
It suffices to show that for this arbitrary ℎ we can find an 𝑥 ∈ X such that ( 𝑃ℎ)( 𝑥 ) <
ℎ ( 𝑥 ) + 𝜀. This is easy to verify, since, for 𝑥¯ ∈ argmax 𝑥 ∈X ℎ ( 𝑥 ) we have ( 𝑃ℎ)(¯
𝑥) =
Í 0 0
𝑥 0 ℎ ( 𝑥 ) 𝑃 (¯
𝑥 , 𝑥 ) ⩽ ℎ (¯
𝑥 ).
is in D(X) whenever 𝜑 ∈ D(X), since, for any such 𝜑, the vector 𝜑𝑃 is clearly non-
negative and Õ ÕÕ Õ
( 𝜑𝑃 )( 𝑥 0) = 𝑃 ( 𝑥, 𝑥 0) 𝜑 ( 𝑥 ) = 𝜑 ( 𝑥 ) = 1.
𝑥0 𝑥 𝑥0 𝑥
The induction hypothesis allows us to use (3.2) at 𝑘, so the last equation becomes
Õ
P{ 𝑋𝑡+𝑘+1 = 𝑥 0 | 𝑋𝑡 = 𝑥 } = 𝑃 𝑘 ( 𝑥, 𝑧 ) 𝑃 ( 𝑧, 𝑥 0) = 𝑃 𝑘+1 ( 𝑥, 𝑥 0) .
𝑧
Solution to Exercise 3.1.3. Let 𝑥 ∈ X be the current state at time 𝑡 and suppose
first that 𝑠 < 𝑥 . The next period state 𝑋𝑡+1 hits 𝑠 with positive probability, since 𝜑 ( 𝑑 ) > 0
for all 𝑑 ∈ Z+ . The state 𝑋𝑡+2 hits 𝑆 + 𝑠 with positive probability, since 𝜑 (0) > 0. From
𝑆 + 𝑠, the inventory level reaches any point in X = {0, . . . , 𝑆 + 𝑠 } in one step with
positive probability. Hence, from current state 𝑥 , inventory reaches any other state 𝑦
with positive probability in three steps.
The logic for the case 𝑥 ⩽ 𝑠 is similar and left to the reader.
𝜓𝑃 𝑡 → 𝜓1 𝜓∗ = 𝜓∗ as 𝑡 → ∞.
Solution to Exercise 3.2.2. Using Exercise 3.1.11 and the definition of 𝑃 , it can
be shown that
Õ
𝑛
𝐺 ( 𝑥, 𝑥 𝑘 ) ≔ 𝑃 ( 𝑥, 𝑥 𝑗 ) = P{ 𝑥𝑘 − 𝑠/2 < 𝑋𝑡+1 | 𝑋𝑡 = 𝑥 } .
𝑗= 𝑘
Solution to Exercise 3.2.5. Clearly this is true for 𝑡 = 1. Suppose it is also true
for arbitrary 𝑡 . Then, for any ℎ ∈ 𝑖RX , the function 𝑃 𝑡 ℎ is again in 𝑖RX . From this it
follow that 𝑃 𝑡+1 ℎ = 𝑃𝑃 𝑡 ℎ is also in 𝑖RX , since 𝑃 is monotone increasing. This proves
that 𝑃 𝑡+1 is invariant on 𝑖RX , and therefore monotone increasing.
Solution to Exercise 3.2.6. Let 𝜋 and 𝑃 satisfy the stated conditions. By Exer-
cise 3.2.5, 𝑃 𝑡 is monotone increasing for all 𝑡 . By this fact and the assumption 𝜋 ∈ 𝑖RX ,
Í
we see that 𝑃 𝑡 𝜋 ∈ 𝑖RX for all 𝑡 . Hence 𝑣 = 𝑡⩾0 𝛽 𝑡 𝑃 𝑡 𝜋 is also increasing.
Solution to Exercise 3.3.1. We start with part (i). To show that 𝑇 is a self-map
on 𝑉 ≔ RW + , we just need to verify that 𝑣 ∈ 𝑉 implies 𝑇 𝑣 ∈ 𝑉 , which only requires
us to verify that 𝑇 maps nonnegative functions into nonnegative functions. This is
clear from the definition. Regarding the order-preserving property, fix 𝑓 , 𝑔 ∈ 𝑉 with
Í
𝑓 ⩽ 𝑔 . We claim that 𝑇 𝑓 ⩽ 𝑇 𝑔 . Indeed, if 𝑤 ∈ W, then 𝑤0 ∈W 𝑓 ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ⩽
Í 0 0
𝑤0 ∈W 𝑔 ( 𝑤 ) 𝑃 ( 𝑤, 𝑤 ), which in turn implies that (𝑇 𝑓 )( 𝑤) ⩽ (𝑇 𝑔 )( 𝑤). Since 𝑤 was an
arbitrary wage value, we have 𝑇 𝑓 ⩽ 𝑇 𝑔 , so 𝑇 is order-preserving.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 362
Regarding part (ii), let 𝑒 ( 𝑤) ≔ 𝑤/(1 − 𝛽 ) and fix 𝑓 , 𝑔 in 𝑉 . Writing the operators
pointwise and applying the last result in Lemma 2.2.1 (page 58) gives
|𝑇 𝑓 − 𝑇 𝑔 | = | 𝑒 ∨ ( 𝑐 + 𝛽𝑃 𝑓 ) − 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑔 )|
⩽ | 𝛽𝑃 𝑓 − 𝛽𝑃𝑔 |
= 𝛽 | 𝑃 ( 𝑓 − 𝑔)|
⩽ 𝛽𝑃 | 𝑓 − 𝑔 | .
(Here the last inequality uses the result in Exercise 2.2.7 on page 53.) Since 𝑃 ⩾ 0 we
have 𝑃 | 𝑓 − 𝑔 | ⩽ 𝑃 k 𝑓 − 𝑔 k ∞ 1 = k 𝑓 − 𝑔 k ∞ 1, so
|𝑇 𝑓 − 𝑇 𝑔 | ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ 1 .
Solution to Exercise 3.3.2. The code in Listing 10 creates a Markov chain via
Tauchen approximation of an AR(1) process with positive autocorrelation parameter.
By Exercise 3.2.2, 𝑃 is monotone increasing. Hence, by Lemma 3.3.1, the value func-
tion is increasing. Since ℎ∗ = 𝑐 + 𝛽𝑃𝑣∗ , it follows that ℎ∗ is increasing. Regarding
intuition, positive autocorrelation in wages means that high current wages predict
high future wages. It follows that the value of waiting for future wages rises with
current wages.
Since 𝑇 𝑣 = (𝑇1 𝑣) ∨ (𝑇2 𝑣), Lemma 2.2.3 on page 59 tells us that 𝑇 will be a contraction
provided that 𝑇1 and 𝑇2 are both contraction maps. For the case of 𝑇2 , we have
Õ
k𝑇2 𝑓 − 𝑇2 𝑔 k ∞ = max | 𝑐 + 𝛽 ( 𝑃 𝑓 )( 𝑤) − 𝑐 − 𝛽 ( 𝑃𝑔 )( 𝑤)| ⩽ max 𝛽 | 𝑓 ( 𝑤0) − 𝑔 ( 𝑤0)| 𝑃 ( 𝑤, 𝑤0) .
𝑤 𝑤
𝑤0
Õ
|(𝑇𝜎 𝑓 )( 𝑥 ) − (𝑇𝜎 𝑔)( 𝑥 )| = (1 − 𝜎 ( 𝑥 )) 𝛽 ( 𝑔 ( 𝑥 0) − 𝑓 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0)
𝑥0
Õ
⩽𝛽 [ 𝑓 ( 𝑥 0) − 𝑔 ( 𝑥 0)] 𝑃 ( 𝑥, 𝑥 0) .
𝑥0
Í
Applying the triangle inequality and 𝑥 0 𝑃 ( 𝑥, 𝑥 0) = 1, we obtain
Õ
|(𝑇𝜎 𝑓 )( 𝑥 ) − (𝑇𝜎 𝑔)( 𝑥 )| ⩽ 𝛽 | 𝑓 ( 𝑥 0) − 𝑔 ( 𝑥 0)| 𝑃 ( 𝑥, 𝑥 0) ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ .
𝑥0
Taking the supremum over all 𝑥 on the left hand side of this expression leads to
k𝑇𝜎 𝑓 − 𝑇𝜎 𝑔 k ∞ ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ .
𝑇 𝑓 = 𝑒 ∨ ( 𝑐 + 𝛽𝑃 𝑓 ) ⩽ 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑔 ) = 𝑇 𝑔.
Solution to Exercise 4.1.5. This result follows from Lemma 2.2.3 on page 59.
For the sake of the exercise, we also provide a direct proof:
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 364
Take any 𝑓 , 𝑔 in RX . Writing the operators pointwise and applying the last result
in Lemma 2.2.1 (page 58) gives
|𝑇 𝑓 − 𝑇 𝑔 | = | 𝑒 ∨ ( 𝑐 + 𝛽𝑃 𝑓 ) − 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑔 )|
⩽ | 𝛽𝑃 𝑓 − 𝛽𝑃𝑔 |
= 𝛽 | 𝑃 ( 𝑓 − 𝑔)|
⩽ 𝛽𝑃 | 𝑓 − 𝑔 | .
(Here the last inequality uses the result in Exercise 2.2.7 on page 53.) Since 𝑃 ⩾ 0 we
have 𝑃 | 𝑓 − 𝑔 | ⩽ 𝑃 k 𝑓 − 𝑔 k ∞ 1 = k 𝑓 − 𝑔 k ∞ 1, so
|𝑇 𝑓 − 𝑇 𝑔 | ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ 1 .
𝑇 ( 𝑠 ∨ 𝑤) = 𝑠 ∨ ( 𝜋 + 𝛽𝑄 ( 𝑠 ∨ 𝑤)) ⩾ 𝜋 + 𝛽𝑄 ( 𝑠 ∨ 𝑤) 𝜋 + 𝛽𝑄𝑤 = 𝑤
where the strict inequality is by Exercise 2.2.8 on page 53. We conclude that 𝑣∗ ⩾
𝑇 ( 𝑠 ∨ 𝑤) 𝑤, as was to be shown.
Intuitively, the option to exit adds value to firms everywhere in the state space,
since 𝑄 0 implies that the state can shift to a region of the state space where exit
is optimal in a later period.
Solution to Exercise 4.1.8. For the model described, the Bellman equation takes
the form ( )
Õ
𝑣 ( 𝑝) = max 𝑠, max 𝜋 ( ℓ, 𝑝) + 𝛽 𝑣 ( 𝑝0) 𝑄 ( 𝑝, 𝑝0) .
ℓ⩾0
𝑝0
The next period scrap value 𝑆𝑡+1 is integrated out and the remaining function depends
only on 𝑧 ∈ Z.
Since, for each 𝑧0 ∈ Z, the function 𝑠0 ↦→ max{ 𝑠0, ℎ ( 𝑧0)} is increasing, we have
Õ∫ Õ∫
0 0 0 0 0
max{ 𝑠 , ℎ ( 𝑧 )} 𝜑𝑎 ( 𝑠 ) d𝑠 𝑄 ( 𝑧, 𝑧 ) ⩽ max{ 𝑠0, ℎ ( 𝑧0)} 𝜑𝑏 ( 𝑠0) d𝑠0𝑄 ( 𝑧, 𝑧0) .
𝑧0 𝑧0
Solution to Exercise 4.2.1. In view of (4.13), the continuation value operator for
this problem is
Õ
( 𝐶ℎ) ( 𝑥 ) = −𝑐 + 𝛽 max {𝜋 ( 𝑥 0) , ℎ ( 𝑥 0)} 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) .
𝑥0
The monotonicity result for ℎ∗ follows from Lemma 4.1.4 on page 114.
Since constant functions are (weakly) decreasing, Exercise 4.1.11 applies and 𝜎∗ is
increasing. Intuitively, the value of waiting is independent of the current state, while
the value of bringing the product to market is increasing in the current state. Hence,
if the firm brings to the product to market in state 𝑥 , then it should also do so at any
𝑥0 ⩾ 𝑥 .
The middle case is obtained by observing that the next period state hits 𝑥 0 when 𝑥 0 = 𝑎
if and only if 𝐷𝑡+1 ⩾ 𝑥 and then using the expression for the geometric distribution.
Õ
|(𝑇 𝑣)( 𝑥 )| − (𝑇𝑤)( 𝑥 )| ⩽ 𝛽 max [ 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) − 𝑤 ( 𝑓 ( 𝑥, 𝑎, 𝑑 ))] 𝜑 ( 𝑑 )
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0
Õ
⩽ 𝛽 max | 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) − 𝑤 ( 𝑓 ( 𝑥, 𝑎, 𝑑 ))| 𝜑 ( 𝑑 )
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0
Í
Since 𝑑 ⩾0 𝜑 ( 𝑑 ) = 1, it follows that, for arbitrary 𝑥 ∈ X,
Solution to Exercise 5.1.4. We take the action 𝐴𝑡 to be the choice of next period
wealth 𝑊𝑡+1 , so that the action space is also W. The feasible correspondence is
Γ ( 𝑤) = { 𝑎 ∈ W : 𝑎 ⩽ 𝑅𝑤} ( 𝑤 ∈ W) ,
Solution to Exercise 5.1.5. To impose that workers never leave the firm, we
require 𝑎 ⩾ 𝑒. Thus, the feasible correspondence is
Γ ( 𝑥 ) = Γ ( 𝑒, 𝑤) = { 𝑎 ∈ {0, 1} : 𝑎 ⩾ 𝑒} .
and 𝑃 [(1, 𝑤) , 1, ( 𝑒0, 𝑤0)] = 1{ 𝑒0 = 1}1{𝑤0 = 𝑤}. Equation (5.14) says that if 𝑎 = 0 then
𝑒0 = 0 and the next wage is drawn from 𝑄 ( 𝑤, 𝑤0), while if 𝑎 = 1 then 𝑒0 = 1 and the
next wage is 𝑤. You can verify that 𝑃 is a stochastic kernel from G to X.
To double check that these definitions work, we can verify that they lead to the
same Bellman equations that we saw in §3.3.1. Under the definitions of Γ, 𝑟 and 𝑃
just provided, we have 𝑣 (1, 𝑤) = 𝑤 + 𝛽 E 𝑣 (1, 𝑤). This implies that 𝑣 (1, 𝑤) = 𝑤/(1 − 𝛽 ),
which is what we expect for lifetime value of an agent employed with wage 𝑤.
Moreover, the Bellman equation for 𝑣 (0, 𝑤) agrees with the one we obtained for
an unemployed agent on page 97. To see this when 𝑒 = 0, observe that the Bellman
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 368
equation is
Õ
0 0 0 0
𝑣 (0, 𝑤) = max 𝑎𝑤 + (1 − 𝑎) 𝑐 + 𝛽 𝑣 ( 𝑒 , 𝑤 ) 𝑃 [(0, 𝑤) , 𝑎, ( 𝑒 , 𝑤 )]
𝑎∈{0,1}
( 𝑒0 ,𝑤0 )
( " #)
Õ
= max 𝑎𝑤 + (1 − 𝑎) 𝑐 + 𝛽 𝑎𝑣 ( 𝑎, 𝑤) + (1 − 𝑎) 𝑣 ( 𝑎, 𝑤0) 𝑄 ( 𝑤, 𝑤0) ,
𝑎∈{0,1}
𝑤0
where the second equation follows from (5.14). (You can see this by checking the
cases 𝑎 = 0 and 𝑎 = 1.) Rearranging and using 𝑣 (1, 𝑤) = 𝑤/(1 − 𝛽 ) now gives
( )
𝑤 Õ
𝑣 (0, 𝑤) = max , 𝑐+𝛽 𝑣 (0, 𝑤0) 𝑄 ( 𝑤0, 𝑤0) . (5.15)
1−𝛽 𝑤0
This is the Bellman equation for an unemployed agent from the job search problem
we saw previously on page 97.
Õ Õ
|(𝑇𝜎 𝑣)( 𝑥 ) − (𝑇𝜎 𝑤)( 𝑥 )| = 𝛽 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑣 ( 𝑥 0) − 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑤 ( 𝑥 0)
0 𝑥0
Õ𝑥
⩽ 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝛽 | 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)| ⩽ 𝛽 k 𝑣 − 𝑤 k ∞
𝑥0
Taking the supremum over all 𝑥 ∈ X yields the desired result. This contraction prop-
erty combined with Banach’s fixed point theorem implies that 𝑇𝜎 has a unique fixed
point.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 369
Now suppose that 𝑣 is the unique fixed point of 𝑇𝜎 . Then 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣. But then
𝑣 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . Hence 𝑣 = 𝑣𝜎 . This establishes all claims in the lemma.
Solution to Exercise 5.1.11. Fix 𝑣 ∈ RX . Part (i) follows from the fact that Γ ( 𝑥 )
is finite and nonempty at each 𝑥 ∈ X. Hence we can select an element 𝑎∗𝑥 from the
argmax in the definition of a 𝑣-greedy policy at each 𝑥 in X. The resulting policy is
𝑣-greedy. For part (ii) we need to show that 𝜎 ∈ Σ is 𝑣-greedy if and only if
( )
Õ Õ
𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0)
𝑎∈ Γ ( 𝑥 )
𝑥0 𝑥0
Solution to Exercise 5.1.12. This result follows from Lemma 2.2.3 on page 59.
For the sake of the exercise, we also provide a direct proof:
Fix 𝑣, 𝑤 ∈ RX and 𝑥 ∈ X. By Exercise 5.1.11 and the max-inequality lemma
(page 58), we have
Solution to Exercise 5.1.13. Part (iii) of Proposition 5.1.1 implies (iv) because
every 𝑣 ∈ RX has at least one greedy policy (Exercise 5.1.11). In particular, at least
one 𝑣∗ -greedy policy exists.
Both 𝑤 and 𝑤0 are constrained to a finite set W ⊂ R+ . The expected value function
can be expressed as
Õ
𝑔 ( 𝑧, 𝑤0) ≔ 𝑣 ( 𝑤0, 𝑧0, 𝜀0) 𝑄 ( 𝑧, 𝑧0) 𝜑 ( 𝜀0) . (5.38)
𝑧 0 , 𝜀0
In the remainder of this section, we will say that a savings policy 𝜎 is 𝑔-greedy if
𝑤0 0
𝜎 ( 𝑧, 𝑤, 𝜀) ∈ argmax 𝑢 𝑤 + 𝑧 + 𝜀 − + 𝛽𝑔 ( 𝑧, 𝑤 ) .
𝑤0 ⩽ 𝑅 ( 𝑤+ 𝑧 +𝜀) 𝑅
Since it is an MDP, we can see immediately that if we replace 𝑣 in (5.38) with the
value function 𝑣∗ , then a 𝑔-greedy policy will be an optimal one. We can rewrite the
Bellman equation in terms of expected value functions via
Õ
0 0 0 0 𝑤00
𝑔 ( 𝑧, 𝑤 ) = max 𝑢 𝑤 +𝑧 +𝜀 − + 𝛽𝑔 ( 𝑧 , 𝑤 ) 𝑄 ( 𝑧, 𝑧0) 𝜑 ( 𝜀0) .
0 00
00
𝑧 0 , 𝜀0
0 0 0
𝑤 ⩽ 𝑅 ( 𝑤 + 𝑧 +𝜀 ) 𝑅
Solution to Exercise 6.2.2. Fix 𝜎 ∈ Σ and let Assumption 6.2.1 on page 192 hold.
We saw in the proof of Lemma 6.2.1 that 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣 and 𝑣𝜎 = ( 𝐼 − 𝐿𝜎 ) −1 𝑟𝜎 is the
unique fixed point in of this operator RX . Moreover, for fixed 𝑣, 𝑤 ∈ RX , we have
|𝑇𝜎 𝑣 − 𝑇𝜎 𝑤 | = | 𝐿𝜎 𝑣 − 𝐿𝜎 𝑤 | = | 𝐿𝜎 ( 𝑣 − 𝑤)| = 𝐿𝜎 | 𝑣 − 𝑤 | .
Solution to Exercise 6.2.4. Fix 𝜎 ∈ Σ. In the present setting, the discount operator
𝐿𝜎 from (6.17) becomes
𝐿𝜎 ( 𝑥, 𝑥 0) = 𝐿𝜎 (( 𝑦, 𝑧 ) , ( 𝑦 0, 𝑧0)) = 𝛽 ( 𝑧 ) 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝜎 ( 𝑦 ) , 𝑦 0) .
Proceeding as for the ex-dividend contract, the price conditional on current state 𝑥 is
Í
𝜋 ( 𝑥 ) = 𝑑 ( 𝑥 ) + 𝑥 0 𝑚 ( 𝑥, 𝑥 0) 𝜋 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0). In vector form, this is 𝜋 = 𝑑 + 𝐴𝜋. Solving out
for prices gives 𝜋∗ = ( 𝐼 − 𝐴) −1 𝑑 .
Solution to Exercise 7.1.8. Observe that ( 𝐺𝑢)( 𝑥 ) = 𝐹 𝑥 [( 𝐴𝑢)( 𝑥 )]. Since 𝐴 is order-
preserving and 𝐹 is increasing, 𝑢 ⩽ 𝑣 implies 𝐺𝑢 ⩽ 𝐺𝑣. In particular, 𝐺 is order-
preserving. If 𝜃 ∈ (0, 1], then 𝐹 is convex. Hence, fixing 𝑢, 𝑣 ∈ 𝑉 and 𝜆 ∈ [0, 1] (and
dropping 𝑥 from our notation), we have
𝐹 𝐴 ( 𝜆𝑢 + (1 − 𝜆 ) 𝑣) = 𝐹 ( 𝜆 𝐴𝑢 + (1 − 𝜆 ) 𝐴𝑣) ⩽ 𝜆𝐹 𝐴𝑢 + (1 − 𝜆 ) 𝐹 𝐴𝑣.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 373
Hence 𝐺 is convex. The proof that 𝐺 is concave for other values of 𝜃 is similar and
omitted.
−1/𝜓
Solution to Exercise 7.1.9. Setting 𝑣𝑡 = 𝜎𝑡 , we can write (7.3) as
h i 𝜓 1/𝜓
𝑣𝑡 = 1 + 𝛽 𝜓
E𝑡 𝑅𝑡(+1
𝜓−1)/𝜓
𝑣𝑡+1 . (7.4)
" # 𝜓 1/𝜓
Õ
𝑣( 𝑥 ) = 1 + 𝛽𝜓 𝑓 ( 𝑥 0) ( 𝜓−1)/𝜓 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) .
𝑥0
Using the definition of 𝐴 in the exercise, we can write the equation in vector form
as 𝑣 = [1 + ( 𝐴𝑣) 𝜓 ] 1/𝜓 . It follows from Theorem 7.1.4 that a unique strictly positive
solution to this equation exists if and only if 𝜌 ( 𝐴) 𝜓 < 1. This proves the claim in the
exercise.
Solution to Exercise 7.2.7. Let 𝐾 be as stated (see (7.14)) and fix 𝑣 0. Clearly
𝑣 0 and hence 𝑃𝑣𝛾 0 (see Exercise 3.2.5 on page 94). Since ℎ ⩾ 0, it follows
𝛾
easily that 𝐾𝑣 0.
( 𝑅𝜏 ( 𝑣 + 𝜆 ))( 𝑥 ) = 𝑄 𝜏 ( 𝑣 ( 𝑋 ) + 𝜆 ) = 𝑄 𝜏 ( 𝑣 ( 𝑋 )) + 𝜆,
where the second equality is by Exercise 1.2.28 on page 32. Since 𝑥 was arbitrary, we
have 𝑅𝜏 ( 𝑣 + 𝜆 ) = 𝑅𝜏 𝑣 + 𝜆 . Hence 𝑅𝜏 is constant-subadditive, as claimed.
𝑅𝑣 = 𝑅 ( 𝑣 − 𝑤 + 𝑤) ⩽ 𝑅 (k 𝑣 − 𝑤 k ∞ 1 + 𝑤) ⩽ 𝑅𝑤 + k 𝑣 − 𝑤 k ∞ 1.
𝑉0 = 𝑢 ( 𝐶0 ) + 𝛽 E0 𝑉1 = 𝑢 ( 𝐶0 ) + 𝛽 E0 [𝑢 ( 𝐶1 ) + 𝛽 E1 𝑉2 ] = 𝑢 ( 𝐶0 ) + 𝛽 E0 𝑢 ( 𝐶1 ) + 𝛽 2 E0 𝑉2 .
Í
Continuing forward until time 𝑚 yields 𝑉0 = 𝑡𝑚=0−1 𝛽 𝑡 E0 𝑢 ( 𝐶𝑡 ) + 𝛽 𝑚 E0 𝑉𝑚 . Shifting to
functional form and using 𝑟 = 𝑢 ◦ 𝑐, the last expression becomes
Õ
𝑚 −1
𝑣= ( 𝛽𝑃 ) 𝑡 𝑟 + ( 𝛽𝑃 ) 𝑚 𝑤.
𝑡 =0
By Exercise 1.2.17 on page 22, this is just 𝐾 𝑚 𝑤 when 𝐾 is the associated Koopmans
operator 𝐾𝑣 = 𝑟 + 𝛽𝑃𝑣 and, moreover, 𝐾 𝑚 𝑤 → 𝑣∗ ≔ ( 𝐼 − 𝛽𝑃 ) −1 𝑟 .
Solution to Exercise 7.3.17. The additive case is obvious. Regarding the Leontief
case, fix 𝑥 ∈ X, 𝑦 ∈ R and 𝜆 ∈ R+ . We have
Solution to Exercise 7.3.19. We already know from Exercise 7.3.17 that the Leon-
tief aggregator satisfies Blackwell’s condition when 𝛽 ∈ (0, 1). Since 𝑅𝜏 is constant-
subadditive, global stability follows from Proposition 7.3.3.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 376
𝑃 ( 𝑥, 𝑎, 𝑥 0) = 𝑃 (( 𝑦, 𝑧 ) , 𝑎, ( 𝑦 0, 𝑧0)) = 1{ 𝑦 0 = 𝑎} 𝑄 ( 𝑧, 𝑧0) .
Solution to Exercise 8.1.7. Fix 𝑣 ∈ 𝑉 and consider the set {𝑇𝜎 𝑣}𝜎∈Σ ⊂ 𝑉 . We first
show that {𝑇𝜎 𝑣}𝜎∈Σ contains a greatest element. Suppose that 𝜎¯ is 𝑣-greedy. If 𝜎 is
any other policy, then
Solution to Exercise 8.1.8. Regarding part (i), fix 𝑣 ∈ 𝑉 . For any 𝜎 ∈ Σ and 𝑥 ∈ X,
we have
Ô
Since 𝑥 was chosen arbitrarily, we have confirmed that 𝑇 𝑣 = 𝜎 ∈ Σ 𝑇𝜎 𝑣.
Regarding part (ii), 𝜎 is 𝑣-greedy if and only if
Solution to Exercise 8.1.13. Let {𝑇𝜎 } be the policy operators associated with a
bounded and well-posed RDP ( Γ, 𝑉, 𝐵). Let 𝑉ˆ ≔ [ 𝑣1 , 𝑣2 ], where 𝑣1 , 𝑣2 are as in (8.20).
Fix 𝜎 ∈ Σ and let 𝑣𝜎 be the 𝜎-value function of policy operator 𝑇𝜎 . It follows from (8.20)
that 𝑇𝜎 is a self-map on 𝑉ˆ. By the Knaster–Tarski fixed point theorem (page 213), 𝑇𝜎
has at least one fixed point in 𝑉ˆ. By uniqueness, that fixed point is 𝑣𝜎 .
𝐵 ( 𝑥, 𝑎, 𝑣2 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽𝑣2 ⩽ ¯
𝑟 + 𝛽𝑣2 = 𝑣2 .
for all ( 𝑥, 𝑎) ∈ G. This is the upper bound condition in (8.20). The proof of the lower
bound condition is similar.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 378
¯𝑟 + 𝐿𝑣2 = ¯𝑟 + 𝐿𝑣2 − 𝑣2 + 𝑣2 = ¯𝑟 − ( 𝐼 − 𝐿) 𝑣2 + 𝑣2 = ¯𝑟 − ¯𝑟 + 𝑣2 = 𝑣2 .
Since 𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ ¯𝑟 + ( 𝐿𝑣2 )( 𝑥 ), this proves that 𝑣2 satisfies the upper bound condition
in (8.20). The proof of the lower bound condition is similar.
Solution to Exercise 8.1.17. The only nontrivial condition to check is that the
bound 𝐵 ( 𝑥, 𝑥 0, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) holds for all feasible ( 𝑥, 𝑥 0). In particular, we need to show
that
𝑐 ( 𝑥, 𝑥 0) + 𝐶 ( 𝑥 0) ⩽ 𝐶 ( 𝑥 ) whenever 𝑥 0 ∈ O( 𝑥 ) and 𝑥 0 ≠ 𝑥.
This is true by the definition of 𝐶 , since 𝐶 ( 𝑥 ) is the maximum travel cost to 𝑑 and
𝑐 ( 𝑥, 𝑥 0) + 𝐶 ( 𝑥 0) is the cost of traveling to 𝑑 via 𝑥 0 and then taking the most expensive
path.
𝐵 ( 𝑥, 𝑎, 𝑣) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑤 + k 𝑣 − 𝑤 k ∞ ) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑤) + 𝛽 k 𝑣 − 𝑤 k ∞ .
| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ .
Let 𝜎 be a map from X to A such that 𝜎 ( 𝑥 ) is a maximizer of the right hand side of this
expression for all 𝑥 . Clearly 𝜎 ∈ Σ and
Solution to Exercise 8.2.10. We discuss the first case, regarding (8.37). When
(8.40) holds, by finiteness of G, we can take an 𝜀 > 0 such that
𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) − 𝜀 for all ( 𝑥, 𝑎) ∈ G.
We then have
𝜀 ⩽ 𝑣2 ( 𝑥 ) − 𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) − 𝐵 ( 𝑥, 𝑎, 𝑣1 ) ⩽ 𝑣2 ( 𝑥 ) − 𝑣1 ( 𝑥 )
𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) − 𝛿 k 𝑣2 − 𝑣1 k ∞ ⩽ 𝑣2 ( 𝑥 ) − 𝛿 [ 𝑣2 ( 𝑥 ) − 𝑣1 ( 𝑥 )]
Solution to Exercise 8.2.11. We prove (8.41) and leave (8.40) to the reader. For
given ( 𝑥, 𝑎) ∈ G,
Õ 𝑟1 − 𝜀 𝑟1 − 𝛽𝑟1 + 𝛽𝑟1 − 𝛽𝜀
𝐵 ( 𝑥, 𝑎, 𝑣1 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣1 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) ⩾ 𝑟1 + 𝛽 = = 𝑣1 + 𝜀.
𝑥0
1−𝛽 1−𝛽
Solution to Exercise 8.3.1. For each fixed 𝑎 ∈ Γ ( 𝑥 ), the map 𝑅𝜏𝑎 is a version of the
quantile certainty equivalent operator defined in Exercise 7.3.4 on page 232. With
this observation, we can replicate the proof of Proposition 8.3.1, after replacing 𝑅𝜃𝑎
with 𝑅𝜏𝑎 . The latter is also constant-subadditive, by Exercise 7.3.7 on page 233.
𝑟1 − 𝜀 𝑟1 − 𝜀 𝑟1 − 𝛽𝑟1 + 𝛽𝑟1 − 𝛽𝜀
𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣1 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ⩾ 𝑟1 + 𝛽 = = 𝑣1 + 𝜀.
1−𝛽 1−𝛽 1−𝛽
( ) 𝛼/𝛾 1/𝛼
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑎, 𝑥 0) ,
𝑥0
∫
where 𝑃 ( 𝑥, 𝑎, 𝑥 0) ≔ 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) 𝜇 ( 𝑥, d𝜃) is a weighted average over beliefs. This is
identical to the Epstein–Zin aggregator (see Example 8.1.7).
𝑐 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽𝑐 ( 𝜎 ( 𝑥 ) , 𝜎2 ( 𝑥 )) + 𝛽 2 𝑐 ( 𝜎2 ( 𝑥 ) , 𝜎3 ( 𝑥 )) + · · · + 𝛽 𝑛−1 𝑐 ( 𝜎𝑛−1 ( 𝑥 ) , 𝜎𝑛 ( 𝑥 ))
Solution to Exercise 9.1.1. By the Neumann series lemma, 𝑇 has a unique fixed
point in 𝑉 given by 𝑣¯ ≔ ( 𝐼 − 𝐴) −1 𝑟 . 𝑇 is upward stable because, given 𝑣 ∈ RX with
𝑣 ⩽ 𝑇 𝑣, we have 𝑣 ⩽ 𝑟 + 𝐴𝑣, or ( 𝐼 − 𝐴) 𝑣 ⩽ 𝑟 . By the Neumann series lemma, ( 𝐼 − 𝐴) −1
is a positive linear operator (as the sum of nonnegative matrices), so we can multiply
by this inverse to get 𝑣 ⩽ ( 𝐼 − 𝐴) −1 𝑟 = 𝑣¯. This proves upward stability. Reversing the
inequalities shows that downward stability also holds.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 381
Solution to Exercise 9.2.1. The first part of the exercise is immediate from the
definitions. For the second, take 𝑣, 𝑤 ∈ 𝑉 with 𝑣 𝑤. Since 𝑇𝜎 is order-preserving, we
have 𝑇𝜎 𝑣 𝑇𝜎 𝑤 for all 𝜎 ∈ Σ. Hence 𝑇𝜎 𝑣 𝑇𝑤 for all 𝜎 ∈ Σ. Therefore 𝑇 𝑣 𝑇𝑤.
Solution to Exercise 9.2.3. For all 𝑣 ∈ 𝑉Σ , we have 𝑣 = 𝑣𝜎 for some 𝜎, and hence
𝑇 𝑣 ⩾ 𝑇𝜎 𝑣 = 𝑇𝜎 𝑣𝜎 = 𝑣𝜎 = 𝑣.
Solution to Exercise 9.2.4. This follows directly from Exercise 1.2.27 on page 31.
Solution to Exercise 9.2.5. This result follows from Exercise 9.2.4, since, at each
𝑥 , the maximizing distribution 𝜑 𝑥 is supported on argmax 𝑎∈ Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣).
𝑑
Solution to Exercise 10.1.1. if 𝑊 = Exp( 𝜃) and 𝑠, 𝑡 > 0, then
Õ𝑚
𝐴𝑘 Õ𝑚
k 𝐴𝑘 k Õ k 𝐴 k 𝑘
𝑚
⩽ ⩽ ⩽ ek 𝐴k ,
𝑘! 𝑘! 𝑘!
𝑘=0 𝑘=0 𝑘=0
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 382
where the last term uses the ordinary (scalar) exponential function defined in (10.1).
(If you also want to prove that the scalar series in (10.1) converges, you can do so via
the ratio test.)
Í 𝑘
Solution to Exercise 10.1.4. We use the definition e 𝐴 = 𝑘⩾0 𝐴𝑘! for the proof and
fix 𝑡 ∈ R. A common argument for differentiating e𝑡 𝐴 with respect to 𝑡 is to take the
derivative through the infinite sum to get
d 𝑡𝐴 𝐴2 2𝐴
3
e = 𝐴+𝑡 +𝑡 + · · · = 𝐴e𝑡 𝐴 .
d𝑡 1! 2!
But this is not fully rigorous, since we have not justified interchange of limits. A better
answer is to start with (10.9), which gives
d 𝑡𝐴 eℎ𝐴 − 𝐼
e = e𝑡 𝐴 lim .
d𝑡 ℎ→0 ℎ
Taking the limit 𝑡 → ∞ and applying Gelfand’s lemma, this sequence converges to
ln 𝜌 (e 𝐴 ). But ln 𝜌 (e 𝐴 ) = 𝑠 ( 𝐴), by the first equality in (10.17). This proves the second
equality in (10.17).
Solution to Exercise 10.1.12. Let’s start with (i) =⇒ (ii), or 𝑠 ( 𝐴) < 0 implies
𝑡𝐴
ke k → 0 as 𝑡 → ∞.
Here is one proof that works for 𝑡 ∈ N and 𝑡 → 0. Observe that, since (e 𝐴 ) 𝑡 = e𝑡 𝐴 ,
the powers 𝐵𝑡 of 𝐵 ≔ e 𝐴 match the flow 𝑡 ↦→ e𝑡 𝐴 at integer times. We have 𝐵𝑡 → 0 if
and only if 𝜌 ( 𝐵) < 1. But, by Lemma 10.1.4, 𝜌 ( 𝐵) = 𝜌 (e 𝐴 ) = e𝑠 ( 𝐴) . Hence 𝜌 ( 𝐵) < 1
is equivalent to 𝑠 ( 𝐴) < 0. Thus, 𝑠 ( 𝐴) < 0 is the exact condition we need to obtain
𝐵𝑡 = e𝑡 𝐴 → 0.
We can improve on this proof of (i) =⇒ (ii) by allowing 𝑡 ∈ R and 𝑡 → ∞ as
follows. Suppose 𝑠 ( 𝐴) < 0. Fix 𝜀 > 0 such that 𝑠 ( 𝐴) + 𝜀 < 0 and use (10.17) to obtain
a 𝑇 < ∞ such that (1/𝑡 ) ln ke𝑡 𝐴 k ⩽ 𝑠 ( 𝐴) + 𝜀 for all 𝑡 ⩾ 𝑇 . Equivalently, for 𝑡 large, we
have ke𝑡 𝐴 k ⩽ e𝑡 ( 𝑠 ( 𝐴)+𝜀) . The claim follows.
That (iii) implies (iv) is immediate: Just substitute the bound in (iii) into the
integral.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 384
It is clear from this representation that all entries of 𝑃𝑡 = e𝑡𝑄 are nonnegative.
d d 𝑡𝑄
𝛿𝑥 e𝑡𝑄 1 = 𝛿𝑥 e 1 = 𝛿𝑥 e𝑡𝑄 𝑄 1 = 0
d𝑡 d𝑡
Í
for all 𝑡 ⩾ 0. Evaluating at 𝑡 = 0, we get 𝛿𝑥 𝑄 1 = 0. That is, 𝑥 0 𝑄 ( 𝑥, 𝑥 0) = 0.
d 𝑡𝑄
e = 𝑄 e𝑡𝑄 = e𝑡𝑄 𝑄 for all 𝑡 ⩾ 0. (10.23)
d𝑡
Evaluating (10.23) at 𝑡 = 0 and recalling that e0 = 𝐼 gives
1
𝑄 = lim (eℎ𝑄 − 𝐼 ) . (10.24)
ℎ↓0 ℎ
Interpreting 𝛿𝑥 as a row vector and 𝛿𝑥 0 as a column vector, while using the fact that
𝑥 ≠ 𝑥 0 combined with (10.24), we obtain
0 eℎ𝑄 eℎ𝑄
𝑄 ( 𝑥, 𝑥 ) = 𝛿𝑥 𝑄𝛿𝑥 0 = 𝛿𝑥 lim 𝛿𝑥 0 = lim 𝛿𝑥 𝛿𝑥 0 .
ℎ↓0 ℎ ℎ↓0 ℎ
Hence we need only show that the 𝛿𝑥 eℎ𝑄 𝛿𝑥 0 ⩾ 0. By (ii), 𝛿𝑥 eℎ𝑄 is a distribution, so the
inequality holds.
Solution to Exercise 10.1.19. Using the matrix exponential (10.6) and 𝑃𝑡 = e𝑡𝑄
yields
𝑄 2 ( 𝑥, 𝑥 0) 𝑄 3 ( 𝑥, 𝑥 0)
𝑃𝑡 ( 𝑥, 𝑥 0) = 1{ 𝑥 = 𝑥 0 } + 𝑡𝑄 ( 𝑥, 𝑥 0) + 𝑡 2 + 𝑡3 +···
2! 3!
Setting 𝑡 = ℎ and using 𝑜 ( ℎ) to capture terms converging to zero faster than ℎ as ℎ ↓ 0
recovers (10.28).
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 385
Achdou, Y., Han, J., Lasry, J.-M., Lions, P.-L., and Moll, B. (2022). Income and wealth
distribution in macroeconomics: A continuous-time approach. The Review of Eco-
nomic Studies, 89(1):45–86.
Aiyagari, S. R. (1994). Uninsured idiosyncratic risk and aggregate saving. The Quar-
terly Journal of Economics, 109(3):659–684.
Albuquerque, R., Eichenbaum, M., Luo, V. X., and Rebelo, S. (2016). Valuation risk
and asset pricing. The Journal of Finance, 71(6):2861–2904.
Algan, Y., Allais, O., den Haan, W. J., and Rendahl, P. (2014). Solving and simu-
lating models with heterogeneous agents and aggregate uncertainty. Handbook of
Computational Economics, 3.
Amador, M., Werning, I., and Angeletos, G.-M. (2006). Commitment vs. flexibility.
Econometrica, 74(2):365–396.
Andreoni, J. and Sprenger, C. (2012). Risk preferences are not time preferences.
American Economic Review, 102(7):3357–3376.
Antràs, P. and De Gortari, A. (2020). On the geography of global value chains. Econo-
metrica, 88(4):1553–1598.
386
BIBLIOGRAPHY 387
Arrow, K. J., Harris, T., and Marschak, J. (1951). Optimal inventory policy. Econo-
metrica, 19(3):250–272.
Atkinson, K. and Han, W. (2005). Theoretical Numerical Analysis, volume 39. Springer.
Augeraud-Véron, E., Fabbri, G., and Schubert, K. (2019). The value of biodiversity
as an insurance device. American Journal of Agricultural Economics, 101(4):1068–
1081.
Azinovic, M., Gaegauf, L., and Scheidegger, S. (2022). Deep equilibrium nets. Inter-
national Economic Review, 63(4):1471–1525.
Backus, D. K., Routledge, B. R., and Zin, S. E. (2004). Exotic preferences for macroe-
conomists. NBER Macroeconomics Annual, 19:319–390.
Bagliano, F.-C. and Bertola, G. (2004). Models for Dynamic Macroeconomics. Oxford
University Press.
Balbus, Ł., Reffett, K., and Woźny, Ł. (2014). A constructive study of Markov equilibria
in stochastic games with strategic complementarities. Journal of Economic Theory,
150:815–840.
Balbus, Ł., Reffett, K., and Woźny, Ł. (2018). On uniqueness of time-consistent Markov
policies for quasi-hyperbolic consumers under uncertainty. Journal of Economic The-
ory, 176:293–310.
Balbus, Ł., Reffett, K., and Woźny, Ł. (2022). Time-consistent equilibria in dynamic
models with recursive payoffs and behavioral discounting. Journal of Economic The-
ory, page 105493.
Bansal, R., Kiku, D., and Yaron, A. (2012). An empirical evaluation of the long-run
risks model for asset prices. Critical Finance Review, 1(1):183–221.
BIBLIOGRAPHY 388
Bansal, R. and Yaron, A. (2004). Risks for the long run: A potential resolution of asset
pricing puzzles. The Journal of Finance, 59(4):1481–1509.
Barillas, F., Hansen, L. P., and Sargent, T. J. (2009). Doubts or variability? Journal
of Economic Theory, 144(6):2388–2418.
Bäuerle, N. and Glauner, A. (2022). Markov decision processes with recursive risk
measures. European Journal of Operational Research, 296(3):953–966.
Bäuerle, N. and Jaśkiewicz, A. (2017). Optimal dividend payout model with risk
sensitive preferences. Insurance: Mathematics and Economics, 73:82–93.
Bäuerle, N. and Jaśkiewicz, A. (2018). Stochastic optimal growth model with risk
sensitive preferences. Journal of Economic Theory, 173:181–200.
Becker, R. A., Boyd III, J. H., and Sung, B. Y. (1989). Recursive utility and optimal
capital accumulation. I. Existence. Journal of Economic Theory, 47(1):76–100.
Benhabib, J., Bisin, A., and Zhu, S. (2015). The wealth distribution in Bewley
economies with capital income risk. Journal of Economic Theory, 159:489–515.
BIBLIOGRAPHY 389
Benzion, U., Rapoport, A., and Yagil, J. (1989). Discount rates inferred from decisions:
An experimental study. Management Science, 35(3):270–284.
Bertsekas, D. (2022a). Newton’s method for reinforcement learning and model pre-
dictive control. Results in Control and Optimization, 7:100121.
Bhandari, J. and Russo, D. (2022). Global optimality guarantees for policy gradient
methods. arXiv:1906.01786.
Bloise, G., Le Van, C., and Vailakis, Y. (2023). Do not blame Bellman: It is Koopmans’
fault. Econometrica, in press.
Bloom, N., Bond, S., and Van Reenen, J. (2007). Uncertainty and investment dynam-
ics. The Review of Economic Studies, 74(2):391–415.
Blundell, R., Graber, M., and Mogstad, M. (2015). Labor income dynamics and the
insurance from taxes, transfers, and the family. Journal of Public Economics, 127:58–
73.
Bocola, L., Bornstein, G., and Dovis, A. (2019). Quantitative sovereign default models
and the European debt crisis. Journal of International Economics, 118:20–30.
BIBLIOGRAPHY 390
Bommier, A., Kochov, A., and Le Grand, F. (2017). On monotone recursive preferences.
Econometrica, 85(5):1433–1466.
Bommier, A., Kochov, A., and Le Grand, F. (2019). Ambiguity and endogenous dis-
counting. Journal of Mathematical Economics, 83:48–62.
Bommier, A. and Villeneuve, B. (2012). Risk aversion and the value of risk to life.
Journal of Risk and Insurance, 79(1):77–104.
Borovička, J. and Stachurski, J. (2020). Necessary and sufficient conditions for exis-
tence and uniqueness of recursive utilities. The Journal of Finance.
Boyd, J. H. (1990). Recursive utility and the Ramsey problem. Journal of Economic
Theory, 50(2):326–345.
Brémaud, P. (2020). Markov Chains: Gibbs Fields, Monte Carlo Simulation and Queues,
volume 31. Springer Nature.
Bullen, P. S. (2003). Handbook of Means and Their Inequalities, volume 560. Springer
Science & Business Media.
Burdett, K. (1978). A theory of employee job search and quit rates. American Economic
Review, 68(1):212–220.
Cagetti, M., Hansen, L. P., Sargent, T. J., and Williams, N. (2002). Robustness and
pricing with uncertain growth. The Review of Financial Studies, 15(2):363–404.
Calsamiglia, C., Fu, C., and Güell, M. (2020). Structural estimation of a model of
school choices: The Boston mechanism versus its alternatives. Journal of Political
Economy, 128(2):642–680.
Cao, D. (2020). Recursive equilibrium in Krusell and Smith (1998). Journal of Eco-
nomic Theory, 186.
Cao, D. and Werning, I. (2018). Saving and dissaving with hyperbolic discounting.
Econometrica, 86(3):805–857.
Carroll, C. D. (1997). Buffer-stock saving and the life cycle/permanent income hy-
pothesis. Quarterly Journal of Economics, 112(1):1–55.
Carruth, A., Dickerson, A., and Henley, A. (2000). What do we know about investment
under uncertainty? Journal of Economic Surveys, 14(2):119–154.
Carvalho, V. M. and Grassi, B. (2019). Large firm dynamics and the business cycle.
American Economic Review, 109(4):1375–1425.
Cheney, W. (2013). Analysis for Applied Mathematics, volume 208. Springer Science
& Business Media.
Chetty, R. (2008). Moral hazard versus liquidity and optimal unemployment insur-
ance. Journal of Political Economy, 116(2):173–234.
Christiano, L. J., Motto, R., and Rostagno, M. (2014). Risk shocks. American Economic
Review, 104(1):27–65.
Çınlar, E. and Vanderbei, R. J. (2013). Real and Convex Analysis. Springer Science &
Business Media.
Colacito, R., Croce, M., Ho, S., and Howard, P. (2018). BKK the EZ way: International
long-run growth news and capital flows. American Economic Review, 108(11):3416–
3449.
de Castro, L., Galvao, A. F., and Nunes, D. (2022). Dynamic economics with quantile
preferences. SSRN 4108230.
de Groot, O., Richter, A. W., and Throckmorton, N. A. (2022). Valuation risk revalued.
Quantitative Economics, 13(2):723–759.
De Nardi, M., French, E., and Jones, J. B. (2010). Why do the elderly save? The role
of medical expenses. Journal of Political Economy, 118(1):39–75.
DeJarnette, P., Dillenberger, D., Gottlieb, D., and Ortoleva, P. (2020). Time lotteries
and stochastic impatience. Econometrica, 88(2):619–656.
Drugeon, J.-P. and Wigniolle, B. (2021). On Markovian collective choice with hetero-
geneous quasi-hyperbolic discounting. Economic Theory, 72(4):1257–1296.
Du, Y. (1990). Fixed points of increasing operators in ordered Banach spaces and
applications. Applicable Analysis, 38(01-02):1–20.
Dudley, R. M. (2002). Real Analysis and Probability, volume 74. Cambridge University
Press.
Duffie, D. and Garman, M. B. (1986). Intertemporal Arbitrage and the Markov Valua-
tion of Securities. Citeseer.
Dupuis, P. and Ellis, R. S. (2011). A Weak Convergence Approach to the Theory of Large
Deviations. John Wiley & Sons.
Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1952). The inventory problem: I. Case of
known distributions of demand. Econometrica, 20(2):187–222.
Engel, K.-J. and Nagel, R. (2006). A Short Course on Operator Semigroups. Springer
Science & Business Media.
Epstein, L. G. and Zin, S. E. (1989). Risk aversion and the temporal behavior of
consumption and asset returns: A theoretical framework. Econometrica, 57(4):937–
969.
Epstein, L. G. and Zin, S. E. (1991). Substitution, risk aversion, and the temporal be-
havior of consumption and asset returns: An empirical analysis. Journal of Political
Economy, 99(2):263–286.
Erosa, A. and González, B. (2019). Taxation and the life cycle of firms. Journal of
Monetary Economics, 105:114–130.
Fagereng, A., Holm, M. B., Moll, B., and Natvik, G. (2019). Saving behavior across
the wealth distribution: The importance of capital gains. Technical report, National
Bureau of Economic Research.
Farmer, J. D., Geanakoplos, J., Richiardi, M. G., Montero, M., Perelló, J., and Masoliver,
J. (2023). Discounting the distant future: What do historical bond prices imply
about the long term discount rate?
Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., and Larochelle, H. (2019). Hy-
perbolic discounting and learning over multiple horizons. arXiv:1902.06865.
Fei, Y., Yang, Z., Chen, Y., and Wang, Z. (2021). Exponential Bellman equation and
improved regret bounds for risk-sensitive reinforcement learning. Advances in Neu-
ral Information Processing Systems, 34:20436–20446.
Fernández-Villaverde, J., Hurtado, S., and Nuno, G. (2023). Financial frictions and
the wealth distribution. Econometrica, 91(3):869–901.
Föllmer, H. and Knispel, T. (2011). Entropic risk measures: Coherence vs. con-
vexity, model ambiguity and robust large deviations. Stochastics and Dynamics,
11(02n03):333–351.
Foss, S., Shneer, V., Thomas, J. P., and Worrall, T. (2018). Stochastic stability of
monotone economies in regenerative environments. Journal of Economic Theory,
173:334–360.
Frederick, S., Loewenstein, G., and O’donoghue, T. (2002). Time discounting and
time preference: A critical review. Journal of Economic Literature, 40(2):351–401.
Gao, Y., Lui, K. Y. C., and Hernandez-Leal, P. (2021). Robust risk-sensitive reinforce-
ment learning agents for trading markets. arXiv:2107.08083.
Gennaioli, N., Martin, A., and Rossi, S. (2014). Sovereign default, domestic banks,
and financial institutions. The Journal of Finance, 69(2):819–866.
Gentry, M. L., Hubbard, T. P., Nekipelov, D., and Paarsch, H. J. (2018). Structural
econometrics of auctions: A review. Foundations and Trends in Econometrics, 9(2-
4):79–302.
Ghosh, A. R., Kim, J. I., Mendoza, E. G., Ostry, J. D., and Qureshi, M. S. (2013). Fiscal
fatigue, fiscal space and debt sustainability in advanced economies. The Economic
Journal, 123(566):F4–F30.
Gillingham, K., Iskhakov, F., Munk-Nielsen, A., Rust, J., and Schjerning, B. (2022).
Equilibrium Trade in Automobiles. Journal of Political Economy.
Gomez-Cram, R. and Yaron, A. (2020). How Important Are Inflation Expectations for
the Nominal Yield Curve? The Review of Financial Studies, 34(2):985–1045.
Guo, D., Cho, Y. J., and Zhu, J. (2004). Partial Ordering Methods in Nonlinear Problems.
Nova Publishers.
Guvenen, F. (2007). Learning your earning: Are labor income shocks really very
persistent? American Economic Review, 97(3):687–712.
Häggström, O. et al. (2002). Finite Markov Chains and Algorithmic Applications. Cam-
bridge University Press.
Han, J., Yang, Y., et al. (2021). Deepham: A global solution method for heterogeneous
agent models with aggregate shocks. arXiv:2112.14377.
Hansen, L. P., Heaton, J. C., and Li, N. (2008). Consumption strikes back? Measuring
long-run risk. Journal of Political Economy, 116(2):260–302.
Havranek, T., Horvath, R., Irsova, Z., and Rusnak, M. (2015). Cross-country
heterogeneity in intertemporal substitution. Journal of International Economics,
96(1):100–118.
Heathcote, J. and Perri, F. (2018). Wealth and volatility. The Review of Economic
Studies, 85(4):2173–2213.
Hens, T. and Schindler, N. (2020). Value and patience: The value premium in a
dividend-growth model with hyperbolic discounting. Journal of Economic Behavior
& Organization, 172:161–179.
Herrendorf, B., Rogerson, R., and Valentinyi, A. (2021). Structural change in in-
vestment and consumption—a unified analysis. The Review of Economic Studies,
88(3):1311–1346.
Hill, E., Bardoscia, M., and Turrell, A. (2021). Solving heterogeneous general equi-
librium economic models with deep reinforcement learning. arXiv:2103.16977.
Hills, T. S., Nakata, T., and Schmidt, S. (2019). Effective lower bound risk. European
Economic Review, 120:103321.
Hirsh, M. and Smale, S. (1974). Differential Equations, Dynamical Systems and Linear
Algebra. Academic Press.
Hopenhayn, H. A. (1992). Entry, exit, and firm dynamics in long run equilibrium.
Econometrica, 60:1127–1150.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. John Wiley &
Sons.
Hsu, W.-T. (2012). Central place theory and city size distribution. The Economic
Journal, 122(563):903–932.
Hsu, W.-T., Holmes, T. J., and Morgan, F. (2014). Optimal city hierarchy: A dy-
namic programming approach to central place theory. Journal of Economic Theory,
154:245–273.
Hu, T.-W. and Shmaya, E. (2019). Unique monetary equilibrium with inflation in a
stationary Bewley–Aiyagari model. Journal of Economic Theory, 180:368–382.
Hubmer, J., Krusell, P., and Smith, Jr, A. A. (2020). Sources of US wealth inequality:
Past, present, and future. NBER Macroeconomics Annual 2020, volume 35.
Huijben, I. A., Kool, W., Paulus, M. B., and Van Sloun, R. J. (2022). A review of the
gumbel-max trick and its extensions for discrete stochasticity in machine learning.
IEEE Transactions on Pattern Analysis and Machine Intelligence.
Iskhakov, F., Rust, J., and Schjerning, B. (2020). Machine learning and structural
econometrics: Contrasts and synergies. The Econometrics Journal, 23(3):S81–S124.
Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-
softmax. arXiv:1611.01144.
Ju, N. and Miao, J. (2012). Ambiguity, learning, and asset returns. Econometrica,
80(2):559–591.
Kahou, M. E., Fernández-Villaverde, J., Perla, J., and Sood, A. (2021). Exploiting
symmetry in high-dimensional dynamic programming. Technical report, National
Bureau of Economic Research.
Kaplan, G., Moll, B., and Violante, G. L. (2018). Monetary policy according to HANK.
American Economic Review, 108(3):697–743.
Karp, L. (2005). Global warming and hyperbolic discounting. Journal of Public Eco-
nomics, 89(2-3):261–282.
Kase, H., Melosi, L., and Rottner, M. (2022). Estimating nonlinear heterogeneous
agents models with neural networks. Technical report, CEPR Discussion Paper No.
DP17391.
Keane, M. P., Todd, P. E., and Wolpin, K. I. (2011). The structural estimation of be-
havioral models: Discrete choice dynamic programming methods and applications.
In Handbook of Labor Economics, volume 4, pages 331–461. Elsevier.
Keane, M. P. and Wolpin, K. I. (1997). The career decisions of young men. Journal of
Political Economy, 105(3):473–522.
Kelle, P. and Milne, A. (1999). The effect of (s, S) ordering policy on the supply chain.
International Journal of Production Economics, 59(1-3):113–122.
BIBLIOGRAPHY 400
Kikuchi, T., Nishimura, K., Stachurski, J., and Zhang, J. (2021). Coase meets Bell-
man: Dynamic programming for production networks. Journal of Economic Theory,
196:105287.
Kleinman, B., Liu, E., and Redding, S. J. (2023). Dynamic spatial general equilibrium.
Econometrica, 91(2):385–424.
Klibanoff, P., Marinacci, M., and Mukerji, S. (2009). Recursive smooth ambiguity
preferences. Journal of Economic Theory, 144(3):930–976.
Knight, F. H. (1921). Risk, Uncertainty and Profit, volume 31. Houghton Mifflin.
Kochenderfer, M. J., Wheeler, T. A., and Wray, K. H. (2022). Algorithms for Decision
Making. MIT Press.
Kohler, M., Krzyżak, A., and Todorovic, N. (2010). Pricing of high-dimensional Amer-
ican options by neural networks. Mathematical Finance, 20(3):383–410.
Koopmans, T. C., Diamond, P. A., and Williamson, R. E. (1964). Stationary utility and
time perspective. Econometrica, 32(1/2):82–100.
Krasnosel’skii, M. A., Vainikko, G. M., Zabreiko, P. P., Rutitskii, Y. B., and Stetsenko,
V. Y. (1972). Approximate Solution of Operator Equations. Springer Netherlands.
Kristensen, D., Mogensen, P. K., Moon, J. M., and Schjerning, B. (2021). Solving
dynamic discrete choice models using smoothing and sieve methods. Journal of
Econometrics, 223(2):328–360.
Krueger, D., Mitman, K., and Perri, F. (2016). Macroeconomics and household het-
erogeneity. In Handbook of Macroeconomics, volume 2, pages 843–921. Elsevier.
Krusell, P. and Smith, Jr, A. A. (1998). Income and wealth heterogeneity in the
macroeconomy. Journal of Political Economy, 106(5):867–896.
Lasota, A. and Mackey, M. C. (1994). Chaos, Fractals, and Noise: Stochastic Aspects of
Dynamics, volume 97. Springer Science & Business Media, 2 edition.
BIBLIOGRAPHY 401
Lee, J. and Shin, K. (2000). The role of a variable input in the relationship between
investment and uncertainty. American Economic Review, 90(3):667–680.
Li, H. and Stachurski, J. (2014). Solving the income fluctuation problem with un-
bounded rewards. Journal of Economic Dynamics and Control, 45:353–365.
Liao, J. and Berg, A. (2018). Sharpening Jensen’s inequality. The American Statisti-
cian.
Ljungqvist, L. (2002). How do lay-off costs affect employment? The Economic Journal,
112(482):829–853.
Lucas, R. E. and Stokey, N. L. (1984). Optimal growth with many consumers. Journal
of Economic Theory, 32(1):139–171.
Ma, Q., Stachurski, J., and Toda, A. A. (2020). The income fluctuation problem and
the evolution of wealth. Journal of Economic Theory, 187:105003.
Ma, Q. and Toda, A. A. (2021). A theory of the saving rate of the rich. Journal of
Economic Theory, 192:105193.
Majumdar, A., Singh, S., Mandlekar, A., and Pavone, M. (2017). Risk-sensitive inverse
reinforcement learning via coherent risk models. In Robotics: Science and Systems,
volume 16, page 117.
Maliar, L., Maliar, S., and Winant, P. (2021). Deep learning for solving dynamic
economic models. Journal of Monetary Economics, 122:76–101.
Marcet, A., Obiols-Homs, F., and Weil, P. (2007). Incomplete markets, labor supply
and capital accumulation. Journal of Monetary Economics, 54(8):2621–2635.
Marimon, R. (1984). General equilibrium and growth under uncertainty: the turnpike
property. Northwestern University Economics Department Discussion Paper 624.
Marinacci, M., Principi, G., and Stanca, L. (2023). Recursive preferences and ambi-
guity attitudes. arXiv:2304.06830.
McCall, J. J. (1970). Economics of information and job search. The Quarterly Journal
of Economics, 84(1):113–126.
Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. (2020). On the global con-
vergence rates of softmax policy gradient methods. In International Conference on
Machine Learning, pages 6820–6829. PMLR.
Meissner, T. and Pfeiffer, P. (2022). Measuring preferences over the temporal resolu-
tion of consumption uncertainty. Journal of Economic Theory, 200:105379.
Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra, volume 71. Siam.
Michelacci, C., Paciello, L., and Pozzi, A. (2022). The extensive margin of aggregate
consumption demand. The Review of Economic Studies, 89(2):909–947.
Mordecki, E. (2002). Optimal stopping and perpetual options for Lévy processes.
Finance and Stochastics, 6(4):473–493.
Mortensen, D. T. (1986). Job search and labor market analysis. Handbook of Labor
Economics, 2:849–919.
Newhouse, D. (2005). The persistence of income shocks: Evidence from rural Indone-
sia. Review of Development Economics, 9(3):415–433.
Nuño, G. and Moll, B. (2018). Social optima in economies with heterogeneous agents.
Review of Economic Dynamics, 28:150–180.
Ok, E. A. (2007). Real Analysis with Economic Applications, volume 10. Princeton
University Press.
Paroussos, L., Mandel, A., Fragkiadakis, K., Fragkos, P., Hinkel, J., and Vrontisi, Z.
(2019). Climate clubs and the macro-economic benefits of international coopera-
tion on climate policy. Nature Climate Change, 9(7):542–546.
Perla, J. and Tonetti, C. (2014). Equilibrium imitation and growth. Journal of Political
Economy, 122(1):52–76.
Pissarides, C. A. (1979). Job matchings with state employment agencies and random
search. The Economic Journal, 89(356):818–833.
BIBLIOGRAPHY 405
Pohl, W., Schmedders, K., and Wilms, O. (2018). Higher order effects in asset pricing
models with long-run risks. The Journal of Finance, 73(3):1061–1111.
Pohl, W., Schmedders, K., and Wilms, O. (2019). Relative existence for recursive
utility. SSRN 3432469.
Ren, G. and Stachurski, J. (2021). Dynamic programming with value convexity. Au-
tomatica, 130:109641.
Rendahl, P. (2022). Continuous vs. discrete time: Some computational insights. Jour-
nal of Economic Dynamics and Control, 144:104522.
Rogerson, R., Shimer, R., and Wright, R. (2005). Search-theoretic models of the labor
market: A survey. Journal of Economic Literature, 43(4):959–988.
Saijo, H. (2017). The uncertainty multiplier and business cycles. Journal of Economic
Dynamics and Control, 78:1–25.
Samuelson, P. A. (1939). Interactions between the multiplier analysis and the prin-
ciple of acceleration. The Review of Economics and Statistics, 21(2):75–78.
Scarf, H. (1960). The optimality of (S, s) policies in the dynamic inventory problem.
Mathematical Methods in the Social Sciences, pages 196–202.
BIBLIOGRAPHY 407
Schorfheide, F., Song, D., and Yaron, A. (2018). Identifying long-run risks: A Bayesian
mixed-frequency approach. Econometrica, 86(2):617–654.
Shen, Y., Tobia, M. J., Sommer, T., and Obermayer, K. (2014). Risk-sensitive rein-
forcement learning. Neural Computation, 26(7):1298–1328.
Shiryaev, A. N. (2007). Optimal Stopping Rules, volume 8. Springer Science & Business
Media.
Sidford, A., Wang, M., Wu, X., and Ye, Y. (2023). Variance reduced value iteration
and faster algorithms for solving markov decision processes. Naval Research Logistics
(NRL), 70(5):423–442.
Stachurski, J., Wilms, O., and Zhang, J. (2022). Unique solutions to power-
transformed affine systems. arXiv:2212.00275.
Stanca, L. (2023). Recursive preferences, correlation aversion, and the remporal res-
olution of uncertainty. Working papers, University of Torino.
Von Neumann, J. and Morgenstern, O. (1944). Theory of Games and Economic Behav-
ior. Princeton University Press.
Walker, H. F. and Ni, P. (2011). Anderson acceleration for fixed-point iterations. SIAM
Journal on Numerical Analysis, 49(4):1715–1735.
Wang, P. and Wen, Y. (2012). Hayashi meets Kiyotaki and Moore: A theory of capital
adjustment costs. Review of Economic Dynamics, 15(2):207–225.
Yu, L., Lin, L., Guan, G., and Liu, J. (2023). Time-consistent lifetime portfolio selection
under smooth ambiguity. Mathematical Control and Related Fields, 13(3):967–987.
Zhang, Z. (2012). Variational, Topological, and Partial Order Methods with Their Ap-
plications, volume 29. Springer.
Zhao, G. (2020). Ambiguity, nominal bond yields, and real bond yields. American
Economic Review: Insights, 2(2):177–192.
410
Subject Index
411
SUBJECT INDEX 412