DP Book

Download as pdf or txt
Download as pdf or txt
You are on page 1of 428

Dynamic Programming

VOLUME I: FINITE STATES

Thomas J. Sargent and John Stachurski

April 18, 2024


This book is dedicated to the memory of Robert E. Lucas, Jr.
Contents

Preface viii

Common Symbols xi

Common Abbreviations xii

1 Introduction 1
1.1 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Finite-Horizon Job Search . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Infinite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Stability and Contractions . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.3 Successive Approximation . . . . . . . . . . . . . . . . . . . . . 24
1.2.4 Finite-Dimensional Function Space . . . . . . . . . . . . . . . . 28
1.3 Infinite-Horizon Job Search . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3.1 Values and Policies . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

ii
CONTENTS iii

2 Operators and Fixed Points 42


2.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 Conjugate Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.2 Local Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.3 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.4 Gradient-Based Methods . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.1 Partial Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.2 The Case of Pointwise Order . . . . . . . . . . . . . . . . . . . . 55
2.2.3 Order-Preserving Maps . . . . . . . . . . . . . . . . . . . . . . . 59
2.2.4 Stochastic Dominance . . . . . . . . . . . . . . . . . . . . . . . 62
2.2.5 Parametric Monotonicity . . . . . . . . . . . . . . . . . . . . . . 64
2.3 Matrices and Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.3.1 Nonnegative Matrices . . . . . . . . . . . . . . . . . . . . . . . 69
2.3.2 A Lake Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3.3 Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3 Markov Dynamics 81
3.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1.2 Stationarity and Ergodicity . . . . . . . . . . . . . . . . . . . . 86
3.1.3 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.2 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2.1 Mathematical Expectations . . . . . . . . . . . . . . . . . . . . 92
3.2.2 Geometric Sums . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3 Job Search Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.3.1 Job Search with Markov State . . . . . . . . . . . . . . . . . . . 97
3.3.2 Job Search with Separation . . . . . . . . . . . . . . . . . . . . 101
3.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
CONTENTS iv

4 Optimal Stopping 105


4.1 Introduction to Optimal Stopping . . . . . . . . . . . . . . . . . . . . . 105
4.1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.1.2 Firm Valuation with Exit . . . . . . . . . . . . . . . . . . . . . . 111
4.1.3 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.1.4 Continuation Values . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2.1 American Options . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2.2 Research and Development . . . . . . . . . . . . . . . . . . . . 122
4.3 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5 Markov Decision Processes 128


5.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.1 The MDP Model . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.1.3 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.1.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2.1 Optimal Inventories . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2.2 Optimal Savings with Labor Income . . . . . . . . . . . . . . . . 148
5.2.3 Optimal Investment . . . . . . . . . . . . . . . . . . . . . . . . 155
5.3 Modified Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . 160
5.3.1 Structural Estimation . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3.2 The Gumbel Max Trick . . . . . . . . . . . . . . . . . . . . . . . 166
5.3.3 Optimal Savings with Stochastic Returns on Wealth . . . . . . . 168
5.3.4 Q-Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.3.5 Operator Factorizations . . . . . . . . . . . . . . . . . . . . . . 171
5.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
CONTENTS v

6 Stochastic Discounting 181


6.1 Time-Varying Discount Factors . . . . . . . . . . . . . . . . . . . . . . . 181
6.1.1 Valuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.1.2 Testing the Spectral Radius Condition . . . . . . . . . . . . . . 185
6.1.3 Fixed Point Results . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.2 Optimality with State-Dependent Discounting . . . . . . . . . . . . . . 191
6.2.1 MDPs with State-Dependent Discounting . . . . . . . . . . . . . 191
6.2.2 Inventory Management Revisited . . . . . . . . . . . . . . . . . 196
6.3 Asset Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.3.1 Introduction to Asset Pricing . . . . . . . . . . . . . . . . . . . . 200
6.3.2 Nonstationary Dividends . . . . . . . . . . . . . . . . . . . . . . 206
6.3.3 Incomplete Markets . . . . . . . . . . . . . . . . . . . . . . . . 209
6.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

7 Nonlinear Valuation 212


7.1 Beyond Contraction Maps . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.1.1 Knaster–Tarski for Function Space . . . . . . . . . . . . . . . . . 213
7.1.2 Concavity, Convexity and Stability . . . . . . . . . . . . . . . . 213
7.1.3 A Power-Transformed Affine Equation . . . . . . . . . . . . . . 217
7.2 Recursive Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.2.1 Motivation: Optimal Savings . . . . . . . . . . . . . . . . . . . 218
7.2.2 Risk-Sensitive Preferences . . . . . . . . . . . . . . . . . . . . . 221
7.2.3 Epstein–Zin Preferences . . . . . . . . . . . . . . . . . . . . . . 226
7.3 General Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.3.1 Koopmans Operators . . . . . . . . . . . . . . . . . . . . . . . . 231
7.3.2 A Blackwell-Type Condition . . . . . . . . . . . . . . . . . . . . 238
7.3.3 Uzawa Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
CONTENTS vi

8 Recursive Decision Processes 245


8.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.1.1 Defining RDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.1.2 Lifetime Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.1.3 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
8.1.4 Topologically Conjugate RDPs . . . . . . . . . . . . . . . . . . . 259
8.2 Types of RDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.2.1 Contracting RDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.2.2 Eventually Contracting RDPs . . . . . . . . . . . . . . . . . . . 269
8.2.3 Convex and Concave RDPs . . . . . . . . . . . . . . . . . . . . . 271
8.3 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.3.1 Risk-Sensitive RDPs . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.3.2 Adversarial Agents . . . . . . . . . . . . . . . . . . . . . . . . . 275
8.3.3 Ambiguity and Robustness . . . . . . . . . . . . . . . . . . . . . 279
8.3.4 Smooth Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . 283
8.3.5 Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.4 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

9 Abstract Dynamic Programming 291


9.1 Abstract Dynamic Programs . . . . . . . . . . . . . . . . . . . . . . . . 291
9.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.1.2 Abstract Dynamic Programs . . . . . . . . . . . . . . . . . . . . 293
9.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
9.2.1 Max-Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
9.2.2 Optimality Results for RDPs . . . . . . . . . . . . . . . . . . . . 301
9.2.3 Min-Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
9.3 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
CONTENTS vii

10 Continuous Time 306


10.1 Continuous Time Markov Chains . . . . . . . . . . . . . . . . . . . . . 306
10.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
10.1.2 Continuous Time Flows . . . . . . . . . . . . . . . . . . . . . . 310
10.1.3 Markov Semigroups . . . . . . . . . . . . . . . . . . . . . . . . 316
10.1.4 Continuous Time Markov Chains . . . . . . . . . . . . . . . . . 320
10.2 Continuous Time Markov Decision Processes . . . . . . . . . . . . . . . 327
10.2.1 Valuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
10.2.2 Constructing a Decision Process . . . . . . . . . . . . . . . . . . 332
10.2.3 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.2.4 Application: Job Search . . . . . . . . . . . . . . . . . . . . . . 336
10.3 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

I Appendices 340

A Suprema and Infima 341


A.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
A.2 Some Properties of the Real Line . . . . . . . . . . . . . . . . . . . . . 342
A.3 Max and Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

B Remaining Proofs 345


B.1 Chapter 2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
B.2 Chapter 6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
B.3 Chapter 7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
B.4 Chapter 9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

C Solutions to Selected Exercises 350


Preface

This book is about dynamic programming and its applications in economics, finance,
and adjacent fields. It brings together recent innovations in the theory of dynamic
programming and provides applications and code that can help readers approach the
research frontier. The book is aimed at graduate students and researchers, although
most chapters are accessible to undergraduate students with solid quantitative back-
grounds.
The book contains classical results on dynamic programming that are found in
texts such as Bellman (1957), Denardo (1981), Puterman (2005), and Stokey and
Lucas (1989), as well as extensions created by researchers and practitioners during
the last few decades as they wrestled with how to formulate and solve dynamic models
that can explain patterns observed in data. These extensions include recursive pref-
erences, robust control, continuous time models, and time varying-discount rates.
Such settings often fail to satisfy contraction-mapping restrictions on which tradi-
tional methods are based. To accommodate these applications, the key theoretical
chapters of this book (Chapters 8–9) adopt and extend the abstract framework of
Bertsekas (2022b). This approach provides great generality while also offering trans-
parent proofs.
Chapters 1–3 provide motivation and background material on solving fixed point
problems and computing lifetime valuations. Chapters 4 and 5 cover optimal stopping
and Markov decision processes, respectively. Chapter 6 extends the Markov decision
framework to settings where discount rates vary over time. Chapter 7 treats recursive
preferences. The main theoretical results on dynamic programming from Chapters 4–
6 are special cases of more general results in Chapters 8–9. A brief discussion of
continuous time models can be found in Chapter 10.
Mathematically inclined readers with some background in dynamic programming
might prefer to start with the general results in Chapters 8–9. Indeed, it is possible
to read the text in reverse, jumping to Chapters 8–9 and then moving back to cover
special cases according to interests. However, our teaching experience tells us that
most students find the general results challenging on first pass, but considerably easier

viii
PREFACE ix

after they have practiced dynamic programming through the earlier chapters. This is
why we have started the presentation with special cases and ended it with general
results.
Instructors wishing to use this book as a text for undergraduate students can start
with Chapter 1, skim through Chapter 2, cover Chapters 3–5 in depth, optionally
include Chapter 6 and skip Chapters 7–10 entirely.
This book focuses on dynamic programs with finite state spaces, leaving more
general settings to Volume II. Restricting attention to finite states involves some costs,
since there are specific settings where continuous state models are simpler (one ex-
ample being Gaussian linear-quadratic models). Moreover, many continuous state
models allow us to unleash calculus, one of humanity’s most useful inventions.
Nevertheless, finite state models are extremely useful. Computational representa-
tions are always implemented using finitely many floating point numbers, and many
workhorse models in economics and finance are already discrete. In addition, focus-
ing on problems with finite state spaces allows us to avoid using function-analytic and
measure-theoretic machinery and imposing associated auxillary conditions required
to ensure measurability and existence of extrema. Without these distractions, the
core theory of dynamic programming is especially simple.
For these reasons, we believe that even for sophisticated readers, a good approach
dynamic programming begins with a thorough analysis of the finite state case. This
is the task that we have tackled in Volume I.
Computer code is a first-class citizen in this book. Code is written in Julia and can
be found at

https://github.com/QuantEcon/book-dp1

We chose Julia because it is open source and because Julia allows us to write computer
code that is as close as possible to the relevant mathematical equations. Julia code in
the text is written to maximize clarity rather than speed.
We have also written matching Python code, which can be found in the same
repository. When combined with appropriate scientific libraries, Python is very prac-
tical and efficient for dynamic programming, but implementations tend to be library
specific and are sometimes not as clean as those in Julia. That is why we chose Julia
for programs embedded in the text.
We have tried to mix rigorous theory with exciting applications. Despite the var-
ious layers of abstractions used to unify the theory, the results are practical, being
motivated by important optimization problems from economics and finance.
PREFACE x

This book is one of several being written in partnership with the QuantEcon or-
ganization, with funding generously provided by Schmidt Futures (see acknowledg-
ments below). There is some overlap with the first book in the series, Sargent and
Stachurski (2023b), particularly on the topic of Markov chains. Although repetition is
sometimes undesirable, we decided that some overlap would be useful, since it saves
readers from having to jump between two documents.
We are greatly indebted to Jim Savage and Schmidt Futures for generous finan-
cial support, as well as to Shu Hu, Smit Lunagariya, Maanasee Sharma and Chien
Yeh for outstanding research assistance. We are grateful to Makoto Nirei for hosting
John Stachurski at the University of Tokyo in June and July 2022, where significant
progress was made.
We also thank Alexis Akira Toda, Quentin Batista, Fernando Cirelli, Chase Cole-
man, Yihong Du, Ippei Fujiwara, Saya Ikegawa, Fazeleh Kazemian, Yuchao Li, Dawie
van Lill, Qingyin Ma, Simon Mishricky, Pietro Monticone, Shinichi Nishiyama, Flint
O’Neil, Zejin Shi, Akshay Shanker, Arnav Sood, Alexis Akira Toda, Natasha Watkins,
Jingni Yang and Ziyue (Humphrey) Yang for many important fixes, comments and
suggestions. Yuchao Li read the entire manuscript, from cover to cover, and his in-
put and deep knowledge of dynamic programming helped us immensely. Jesse Perla
provided insightful comments on our code.
Common Symbols

1{ 𝑃 } indicator, equal to 1 if statement 𝑃 is true and 0 otherwise


𝛼≔1 𝛼 is defined as equal to 1
𝑓 ≡1 function 𝑓 is everywhere equal to 1
℘ ( 𝐴) the power set of 𝐴; that is, the collection of all subsets of set 𝐴
[ 𝑛] {1, . . . , 𝑛}
N, Z, R and C the natural, integer, real and complex numbers respectively
Z+ , R+ , etc. the nonnegative elements of Z, R, etc.
| 𝑥 | for 𝑥 ∈ R the absolute value of 𝑥

| 𝜆 | for 𝜆 ∈ C the modulus of 𝜆 (i.e., 𝑎2 + 𝑏2 when 𝜆 = 𝑎 + 𝑖𝑏)
| 𝐵 | for set 𝐵 the cardinality of 𝐵
R𝑛 all 𝑛-tuples of real numbers
𝑥 ⩽ 𝑦 ( 𝑥, 𝑦 ∈ R𝑛 ) 𝑥 𝑖 ⩽ 𝑦𝑖 for 𝑖 = 1, . . . 𝑛 (pointwise partial order)
𝑥 𝑦 ( 𝑥, 𝑦 ∈ R𝑛 ) 𝑥 𝑖 < 𝑦𝑖 for 𝑖 = 1, . . . 𝑛
D( 𝐹 ) the set of distributions on 𝐹
RM the set of all functions from set M to R
𝑖 RM the set of increasing functions in RM
L(X) the set of linear operators on RX
M(X) the set of Markov operators in L(X) (see §2.3.3.4)
h𝑎, 𝑏i the inner product of vectors 𝑎 and 𝑏
∨𝛼 ∈ 𝐴 𝑢 𝛼 supremum of {𝑢𝛼 }𝛼∈ 𝐴
∧𝛼 ∈ 𝐴 𝑢 𝛼 infimum of {𝑢𝛼 }𝛼∈ 𝐴
IID independent and identically distributed
𝑑
𝑋 =𝑌 𝑋 and 𝑌 have the same distribution
𝑋∼𝐹 𝑋 has distribution 𝐹
𝐹 F 𝐺 𝐺 first order stochastically dominates 𝐹

xi
Common Abbreviations

SDF Stochastic discount factor (see §6.3.1.2 and §6.3.1.3)


MDP Markov decision process (see §5.1.1)
VFI Value function iteration (see §5.1.4.1)
HPI Howard policy iteration (see §5.1.4.2)
OPI Optimistic policy iteration (see §5.1.4.4)
RDP Recursive decision process (see §8.1.1)
ADP Abstract dynamic program (see §9.1.2)
SDP Stable dynamic program (see §9.2.2.1)

xii
Chapter 1

Introduction

The temporal structure of a typical dynamic program is

1 an initial state 𝑋0 is given


2 𝑡←0
3 while 𝑡 < 𝑇 do
4 the controller of the system observes the current state 𝑋𝑡
5 the controller chooses an action 𝐴𝑡
6 the controller receives a reward 𝑅𝑡 that depends on the current state and
action
7 the state updates to 𝑋𝑡+1
8 𝑡 ←𝑡+1
9 end

The state 𝑋𝑡 is a vector listing current values of variables deemed relevant to choos-
ing the current action. The action 𝐴𝑡 is a vector describing choices of a set of decision
variables. If 𝑇 < ∞ then the problem has a finite horizon. Otherwise it is an infinite
horizon problem. Figure 1.1 illustrates the first two rounds of a dynamic program.
As shown in the figure, a rule for updating the state depends on the current state and
action.
Dynamic programming provides a way to maximize expected lifetime reward of
a decision maker who receives a prospective reward sequence ( 𝑅𝑡 )𝑡⩾0 and who con-
fronts a system that maps today’s state and control into next period’s state. A lifetime
reward is an aggregation of the individual period rewards ( 𝑅𝑡 )𝑡⩾0 into a single value.
Í
An example of lifetime reward is an expected discounted sum E 𝑡⩾0 𝛽 𝑡 𝑅𝑡 for some
𝛽 ∈ (0, 1).

1
CHAPTER 1. INTRODUCTION 2

𝑋0 𝐴0 𝑋1 𝐴1 𝑋2

𝑅0 𝑅1

Figure 1.1: A dynamic program

Example 1.0.1. A manager wants to set prices and inventories to maximize a firm’s
expected present value (EPV), which, given interest rate 𝑟 , is defined as
"  2 #
1 1
E 𝜋0 + 𝜋1 + 𝜋2 + · · · . (1.1)
1+𝑟 1+𝑟

Here 𝑋𝑡 will be a vector that quantifies the size of the inventories, prices set by com-
petitors and other factors relevant to profit maximization. The action 𝐴𝑡 sets current
prices and orders of new stock. The current reward 𝑅𝑡 is current profit 𝜋𝑡 , and the
profit stream ( 𝜋𝑡 )𝑡⩾0 is aggregated into a lifetime reward via (1.1).

Dynamic programming has a vast array of applications, from robotics and artifi-
cial intelligence to the sequencing of DNA. Dynamic programming is used every day to
control aircraft, route shipping, test products, recommend information on media plat-
forms and solve research problems. Some companies produce specialized computer
chips that are designed for specific dynamic programs.
Within economics and finance, dynamic programming is applied to topics includ-
ing unemployment, monetary policy, fiscal policy, asset pricing, firm investment,
wealth dynamics, inventory control, commodity pricing, sovereign default, the di-
vision of labor, natural resource extraction, human capital accumulation, retirement
decisions, portfolio choice, and dynamic pricing. We discuss some of these applica-
tions in chapters below.
The core theory of dynamic programming is relatively simple and concise. But
implementation can be computationally demanding. That situation provides one of
the major challenges facing the field of dynamic programming.
CHAPTER 1. INTRODUCTION 3

Example 1.0.2. To illustrate how computationally demanding problems can be, con-
sider again Example 1.0.1. Suppose that, for each book, a book retailer chooses to
hold between 0 and 10 copies. If there are 100 books to choose from, then the num-
ber of possible combinations for her inventories is 11100 , about 20 orders of magnitude
larger than the number of atoms in the known universe. In reality there are probably
many more books to choose from, as well as other factors in the business environment
that affect choices of a retailer.

In this book we discuss fundamental theory, traditional economic applications and


recent applications with computationally demanding environments. We also cover re-
cent trends towards more sophisticated specifications of lifetime rewards, often called
recursive preferences. Throughout the book, theory and computation are combined,
since, for interesting problems, brute-force computation is futile, while theory alone
provides limited insight. The interplay between interesting applications, fundamen-
tal theory, computational methods and evolving hardware capability makes dynamic
programming exciting.

1.1 Bellman Equations


In this section we introduce the recursive structure of dynamic programming in a
simple setting. After solving a finite-horizon model, we consider an infinite-horizon
version and explain how it produces a system of nonlinear equations. Then we turn
to methods for solving such systems.

1.1.1 Finite-Horizon Job Search


We begin with a celebrated model of job search created by McCall (1970). McCall
analyzed the decision problem of an unemployed worker in terms of current and
prospective wage offers, impatience, and the availability of unemployment compensa-
tion. Here we study a simple version of the model in which essential ideas of dynamic
programming are particularly clear.
Readers who are familiar with Bellman equations can skim this section quickly
and proceed directly to §1.2.

1.1.1.1 A Two-Period Problem

Imagine someone who begins her working life at time 𝑡 = 1 without employment.
While unemployed, she receives a new job offer paying wage 𝑊𝑡 at each date 𝑡 . She
CHAPTER 1. INTRODUCTION 4

can accept the offer and work permanently at that wage level or reject the offer, receive
unemployment compensation 𝑐, and draw a new offer next period. We assume that
the wage offer sequence is IID and nonnegative, with distribution 𝜑. In particular,
• W ⊂ R+ is a finite set of possible wage outcomes and
• 𝜑 : W → [0, 1] is a probability distribution on W, assigning a probability 𝜑 ( 𝑤)
to each possible wage outcome 𝑤.
The worker is impatient. Impatience is parameterized by a time discount factor 𝛽 ∈
(0, 1), so that the present value of a next-period payoff of 𝑦 dollars is 𝛽 𝑦 . Since 𝛽 < 1,
the worker will be tempted to accept reasonable offers, rather than to wait for better
ones. A key question is how long to wait.
Suppose as a first step that working life is just two periods. To solve our problem
we work backwards, starting at the final date 𝑡 = 2, after 𝑊2 has been observed.1 If
she is already employed, the worker has no decision to make: she continues working
at her current wage. If she is unemployed, then she should take the largest of 𝑐 and
𝑊2 .
Now we step back to 𝑡 = 1. At this time, having received offer 𝑊1 , the unemployed
worker’s options are (a) accept 𝑊1 and receive it in both periods or (b) reject it, receive
unemployment compensation 𝑐, and then, in the second period, choose the maximum
of 𝑊2 and 𝑐.
Let’s assume that the worker seeks to maximize expected present value. The EPV
of option (a) is 𝑊1 + 𝛽𝑊1 , which is also called the stopping value. The EPV of option
(b), also called the continuation value, is ℎ1 ≔ 𝑐 + 𝛽 E max{ 𝑐, 𝑊2 }. More explicitly,
Õ
ℎ1 = 𝑐 + 𝛽 𝑣2 ( 𝑤0) 𝜑 ( 𝑤0) where 𝑣2 ( 𝑤) ≔ max{ 𝑐, 𝑤} . (1.2)
𝑤0 ∈W

The optimal choice at 𝑡 = 1 is now clear: accept the offer if 𝑊1 + 𝛽𝑊1 ⩾ ℎ1 and reject
otherwise. A decision tree is shown in Figure 1.2.

1.1.1.2 Comments on Information

In determining the optimal choice above, we assumed that the worker (a) cares about
expected values and (b) knows how to compute them.
In Chapters 7–8 we discuss how to extend or weaken these assumptions. Some
of these extensions allow decision makers to focus on measurements that differ from
1 Theprocedure of solving the last period first and then working back in time is called backward
induction. Starting with the last period makes sense because there is no future to consider.
CHAPTER 1. INTRODUCTION 5

draw 𝑊1

1 𝑊1 + 𝛽𝑊1 ⩾ 𝑐 + 𝛽 E 𝑣2 (𝑊2 ) ?

reject, receive 𝑐, draw 𝑊2


s

no
ye

2 𝑊2 ⩾ 𝑐 ?
accept and receive 𝑊1 + 𝛽𝑊1
s

no
ye

accept and receive 𝑊2 reject and2receive 𝑐

Figure 1.2: Decision tree for a two period problem


CHAPTER 1. INTRODUCTION 6

w1 + βw1
P
100 c + β w0 max{c, w0}ϕ(w0)
v1(w1)

50

0
10.0 w1∗ 60.0

Figure 1.3: The value function 𝑣1 and the reservation wage

from expected values. Other extensions assume that the decision maker does not know
underlying probability distributions. For now we put these issues aside and return to
the set up discussed in the previous section.

1.1.1.3 Value Functions

A key idea in dynamic programming is to use “value functions” to track maximal life-
time rewards from a given state at a given time. The time 2 value function 𝑣2 defined
in (1.2) returns the maximum value obtained in the final stage for each possible re-
alization of the time 2 wage offer. The time 1 value function 𝑣1 evaluated at 𝑤 ∈ W
is ( )
Õ
𝑣1 ( 𝑤) ≔ max 𝑤 + 𝛽𝑤, 𝑐 + 𝛽 𝑣2 ( 𝑤0) 𝜑 ( 𝑤0) . (1.3)
𝑤0 ∈W

It represents the present value of expected lifetime income after receiving the first
offer 𝑤, conditional on choosing optimally in both periods.
The value function is shown in Figure 1.3. Figure 1.3 also shows the reservation
wage
ℎ1
𝑤1∗ ≔ . (1.4)
1+𝛽
CHAPTER 1. INTRODUCTION 7

It is the 𝑤 that solves the indifference condition


Õ
𝑤 + 𝛽𝑤 = 𝑐 + 𝛽 𝑣2 ( 𝑤0) 𝜑 ( 𝑤0) ,
𝑤0 ∈W

and equates the value of stopping to the value of continuing. For an offer 𝑊1 above
𝑤1∗ , the stopping value exceeds the continuation value. For an offer below the reser-
vation wage, the reverse is true. Hence, the optimal choice for the worker at 𝑡 = 1 is
completely described by the reservation wage.
Parameters and functions underlying the figure are shown in Listing 1.
Figure (1.4) is instructive. We can see that higher unemployment compensation
𝑐 shifts up the continuation value ℎ1 and increases the reservation wage. As a re-
sult, the worker will, on average, spend more time unemployed when unemployment
compensation is higher.

EXERCISE 1.1.1. If unemployment compensation increases unemployment dura-


tion, should we conclude that increasing such compensation is detrimental to society?
Provide some thoughts on this question in the context of the McCall model.

1.1.1.4 Three Periods

Now let’s suppose that the worker works in period 𝑡 = 0 as well as 𝑡 = 1, 2. Figure 1.4
shows the decision tree for the three periods. Notice that the subtree containing nodes
1 and 2 is just the decision tree for the two-period problem in Figure 1.2. We will use
this to find optimal actions.
At 𝑡 = 0, the value of accepting the current offer 𝑊0 is 𝑊0 + 𝛽𝑊0 + 𝛽 2𝑊0 , while
maximal value of rejecting and waiting, is 𝑐 plus, after discounting by 𝛽 , the maxi-
mum value that can be obtained by behaving optimally from 𝑡 = 1. We have already
calculated this value: it is just 𝑣1 (𝑊1 ), as given in (1.3)!
Maximal time zero value 𝑣0 ( 𝑤) is the maximum of the value of these two options,
given 𝑊0 = 𝑤, so we can write
( )
Õ
𝑣0 ( 𝑤) = max 𝑤 + 𝛽 𝑤 + 𝛽 2 𝑤, 𝑐 + 𝛽 𝑣1 ( 𝑤0) 𝜑 ( 𝑤0) . (1.5)
𝑤0 ∈W

By plugging 𝑣1 from (1.3) into this expression, we can determine 𝑣0 , as well as the
optimal action, the one that achieves the largest value in the max term in (1.5).
CHAPTER 1. INTRODUCTION 8

using Distributions

"Creates an instance of the job search model, stored as a NamedTuple."


function create_job_search_model(;
n=50, # wage grid size
w_min=10.0, # lowest wage
w_max=60.0, # highest wage
a=200, # wage distribution parameter
b=100, # wage distribution parameter
β=0.96, # discount factor
c=10.0 # unemployment compensation
)
w_vals = collect(LinRange(w_min, w_max, n+1))
ϕ = pdf(BetaBinomial(n, a, b))
return (; n, w_vals, ϕ, β, c)
end

" Computes lifetime value at t=1 given current wage w_1 = w. "
function v_1(w, model)
(; n, w_vals, ϕ, β, c) = model
h_1 = c + β * max.(c, w_vals)'ϕ
return max(w + β * w, h_1)
end

" Computes reservation wage at t=1. "


function res_wage(model)
(; n, w_vals, ϕ, β, c) = model
h_1 = c + β * max.(c, w_vals)'ϕ
return h_1 / (1 + β)
end

Listing 1: Computing 𝑣1 and 𝑤1∗ (two_period_job_search.jl)


CHAPTER 1. INTRODUCTION 9

draw 𝑊0

0 𝑊0 + 𝛽𝑊0 + 𝛽 2𝑊0 ⩾ 𝑐 + 𝛽 E 𝑣1 (𝑊1 ) ?

reject, receive 𝑐, draw 𝑊1


s

no
ye

1 𝑊1 + 𝛽𝑊1 ⩾ 𝑐 + 𝛽 E 𝑣2 (𝑊2 ) ?
accept, receive 𝑊0 + 𝛽𝑊0 + 𝛽 2𝑊0

reject, receive 𝑐, draw 𝑊2


s

no
ye

2 𝑊2 ⩾ 𝑐 ?
accept, receive 𝑊1 + 𝛽𝑊1
s

no
ye

accept, receive 𝑊2 reject, receive 𝑐

Figure 1.4: Decision tree for job seeker


CHAPTER 1. INTRODUCTION 10

Figure 1.4 illustrates how the backward induction process works. The last period
value function 𝑣2 is trivial to obtain. With 𝑣2 in hand we can compute 𝑣1 . With 𝑣1 in
hand we can compute 𝑣0 . Once all the value functions are available, we can calculate
whether to accept or reject at each point in time.

EXERCISE 1.1.2. The optimal action at time 𝑡 = 0 is determined by a time zero


reservation wage 𝑤0∗ . The worker should accept the time zero wage offer if and only
if 𝑊0 exceeds 𝑤0∗ . Calculate 𝑤0∗ for this problem, by analogy with 𝑤1∗ in (1.4).

Notice how we subdivided the three period problem down into a pair of two pe-
riod problems, given by the two equations (1.3) and (1.5). Breaking many-period
problems down into a sequence of two period problems is the essence of dynamic
programming. The recursive relationships between 𝑣0 and 𝑣1 in (1.5), as well as be-
tween 𝑣1 and 𝑣2 in (1.3), are examples of what are called Bellman equations. We
will see many other examples.

EXERCISE 1.1.3. Extend the above arguments to 𝑇 time periods, where 𝑇 can be
any finite number. Using Julia or another programming language, write a function
that takes 𝑇 as an argument and returns ( 𝑤0∗ , . . . , 𝑤𝑇∗ ), the sequence of reservation
wages for each period.

1.1.2 Infinite Horizon


Next we consider an infinite horizon problem that in some ways is more challenging
but in other ways simpler. On one hand, the lack of a terminal period means that
backward induction requires a subtler justification. On the other hand, the infinite
horizon means that the worker always faces an infinite future, so that we only have
to study a single value function and need not keep track of the number of remaining
periods in the problem. This will become clearer as the section unfolds.2
With the above discussion in mind, let us consider a worker who aims to maximize
Õ

E 𝛽 𝑡 𝑅𝑡 , (1.6)
𝑡 =0

where 𝑅𝑡 ∈ { 𝑐, 𝑊𝑡 } is earnings at time 𝑡 . As before, jobs are permanent, so accepting


a job at a given wage means earning that wage in every subsequent period.
2 Incidentally,imposing an infinite horizon is not the same as assuming humans live forever. Rather,
it corresponds to the idea that humans have no specific “termination” date. More generally, we can un-
derstand an infinite horizon as an approximation to a finite horizon in which observations are recorded
at relatively high frequency and no clear termination date exists.
CHAPTER 1. INTRODUCTION 11

Let’s clarify our assumptions:


IID
Assumption 1.1.1. The wage process satisfies (𝑊𝑡 )𝑡⩾0 ∼ 𝜑 where 𝜑 ∈ D(W) and
W ⊂ R+ is finite. The parameters 𝑐 and 𝛽 are positive and 𝛽 < 1.

Here and below, for any finite or countable set 𝐹 , the symbol D( 𝐹 ) indicates the
set of distributions on 𝐹 .
As with the finite state case, infinite-horizon dynamic programming involves a
two step procedure that first assigns values to states and then deduces optimal actions
given those values. We begin with an informal discussion and then formalize the main
ideas.
To trade off current and future rewards optimally, we need to compare current
payoffs we get from our two choices with the states that those choices lead to and the
maximum value that can be extracted from those states. But how do we calculate the
maximum value that can be extracted from each state when lifetime is infinite?
Consider first the present expected lifetime value of being employed with wage
𝑤 ∈ W. This case is easy because, under the current assumptions, workers who accept
a job are employed forever. Lifetime payoff is
𝑤
𝑤 + 𝛽𝑤 + 𝛽 2 𝑤 + · · · = . (1.7)
1−𝛽

How about maximum present expected lifetime value attainable when entering the
current period unemployed with wage offer 𝑤 in hand? Denote this (as yet unknown)
value by 𝑣∗ ( 𝑤). We call 𝑣∗ the value function. While 𝑣∗ is not trivial to pin down, the
task is not impossible. Our first step in the right direction is to observe that it satisfies
the Bellman equation
( )
𝑤 Õ
𝑣∗ ( 𝑤) = max , 𝑐+𝛽 𝑣∗ ( 𝑤0) 𝜑 ( 𝑤0) (1.8)
1−𝛽 𝑤0 ∈W

at every 𝑤 ∈ W. (Here 𝑤0 is the offer next period.)


Our reasoning is as follows: The first term inside the max operation is the stopping
value, or lifetime payoff from accepting current offer 𝑤. The second term inside the
max operation is the continuation value, or current expected value of rejecting and
behaving optimally thereafter. Maximal value is obtained by selecting the largest of
these two alternatives.
Note the similarity between (1.8) and our finite horizon Bellman equations (1.3)
and (1.5). The only real difference is that the value function is no longer time-
CHAPTER 1. INTRODUCTION 12

dependent. This is because the worker always looks forward toward an infinite hori-
zon, regardless of the current date.
Equation (1.8) is to be solved for a function 𝑣∗ ∈ RW , the set of all functions from
W to R. Once we have solved for 𝑣∗ (assuming this is possible), optimal choices can
be made by observing current 𝑤 and then choosing the largest of the two alternatives
on the right-hand side of (1.8), just as we did in the finite horizon case. This idea –
that optimal choices can be made by computing the value function and maximizing the
right-hand side of the Bellman equation – is called Bellman’s principle of optimality,
and will be a cornerstone of what follows. Later we prove it in a general setting.
To solve for 𝑣∗ , use fixed point theory, our topic in the next section. Later, in §1.3,
we return to the job search problem and apply fixed point theory to solve for 𝑣∗ .

1.2 Stability and Contractions


In this section we cover enough fixed point theory to solve an infinite horizon job
search problem. (In Chapter 2 we consider more general results.) Readers who are
familiar with the Neumann series lemma and Banach’s fixed point theorem can skim
this section and proceed to §1.3.

1.2.1 Vector Space

To begin, we recall some fundamental properties of real numbers, finite-dimensional


vector space, basic topology and equivalence of norms.

1.2.1.1 Real and Complex Vectors

For the most part, we are interested in vectors whose elements are real numbers (as
distinguished from complex numbers). Before investigating such vectors, let’s provide
some useful language about the real line R. (You might want to review some elemen-
tary concepts from real analysis in Appendix §A, such as suprema, infima, minima,
maxima, and convergence.)
Given 𝑎, 𝑏 ∈ R, let 𝑎 ∨ 𝑏 ≔ max{ 𝑎, 𝑏} and 𝑎 ∧ 𝑏 ≔ min{ 𝑎, 𝑏}. The absolute value
of 𝑎 ∈ R is defined as | 𝑎 | ≔ 𝑎 ∨ (−𝑎).
A real-valued vector 𝑢 = ( 𝑢1 , . . . , 𝑢𝑛 ) is a finite real sequence with 𝑢𝑖 ∈ R as the 𝑖-
th element. The set of all real vectors of length 𝑛 is denoted by R𝑛 . The inner product
Í
of 𝑛-vectors (𝑢1 , . . . , 𝑢𝑛 ) and ( 𝑣1 , . . . , 𝑣𝑛 ) is h𝑢, 𝑣i ≔ 𝑛𝑖=1 𝑢𝑖 𝑣𝑖 .
CHAPTER 1. INTRODUCTION 13

The set C of complex numbers is defined in the appendix to Sargent and Stachurski
(2023b) and many other places; as is the set C𝑛 of all complex-valued 𝑛-vectors. We
assume readers know what complex numbers are and how to compute the modulus
of a complex number.
EXERCISE 1.2.1. Let 𝛼, 𝑠 and 𝑡 be real numbers. Show that 𝛼 ∨ ( 𝑠 + 𝑡 ) ⩽ 𝑠 + 𝛼 ∨ 𝑡
whenever 𝑠 ⩾ 0.

1.2.1.2 Norms

The Euclidean norm on a real vector space is defined as


p
k 𝑢 k ≔ h𝑢, 𝑢i ( 𝑢 ∈ R𝑛 ) .

Because they provide more flexibility when checking conditions that underlie various
results, some alternative norms on R𝑛 are important for applications of fixed point
theory.
As a first step, recall that a function k · k : R𝑛 → R is called a norm on R𝑛 if, for
any 𝛼 ∈ R and 𝑢, 𝑣 ∈ R𝑛 ,

(a) k 𝑢 k ⩾ 0 (nonnegativity)
(b) k 𝑢 k = 0 ⇐⇒ 𝑢 = 0 (positive definiteness)
(c) k 𝛼𝑢 k = | 𝛼 |k 𝑢 k and (absolute homogeneity)
(d) k 𝑢 + 𝑣 k ⩽ k 𝑢 k + k 𝑣 k (triangle inequality)

The Euclidean norm on R𝑛 satisfies the Cauchy–Schwarz inequality

| h𝑢, 𝑣i | ⩽ k 𝑢 k · k 𝑣 k for all 𝑢, 𝑣 ∈ R𝑛 .

This inequality can be used to prove that the triangle inequality holds for the Euclidean
norm (see, e.g., Kreyszig (1978)).
Example 1.2.1. The ℓ1 norm of a vector 𝑢 = ( 𝑢1 , . . . , 𝑢𝑛 ) ∈ R𝑛 is defined by
Õ
𝑛
k𝑢k1 ≔ | 𝑢𝑖 | . (1.9)
𝑖=1

In machine learning applications, k · k 1 is sometimes called the “Manhattan norm,” and


𝑑1 ( 𝑢, 𝑣) ≔ k 𝑢 − 𝑣 k 1 is called the “Manhattan distance” or “taxicab distance” between
vectors 𝑢 and 𝑣. We will refer to it as the ℓ1 distance or ℓ1 deviation.
CHAPTER 1. INTRODUCTION 14

EXERCISE 1.2.2. Verify that the ℓ1 norm on R𝑛 satisfies (a)–(d) above.

Í
EXERCISE 1.2.3. Fix 𝑝 ∈ R𝑛 with 𝑝𝑖 > 0 for all 𝑖 ∈ [ 𝑛] and 𝑖 𝑝𝑖 = 1. Show that
Í
k 𝑢 k 1,𝑝 ≔ 𝑛𝑖=1 | 𝑢𝑖 | 𝑝𝑖 is a norm on R𝑛 .

The ℓ1 norm and the Euclidean norm are special cases of the so-called ℓ𝑝 norm,
which is defined for 𝑝 ⩾ 1 by
! 1/ 𝑝
Õ
𝑛
k𝑢k 𝑝 ≔ | 𝑢𝑖 | 𝑝 . (1.10)
𝑖=1

It can be shown that 𝑢 ↦→ k 𝑢 k 𝑝 is a norm for all 𝑝 ⩾ 1, as suggested by the name


(see, e.g., Kreyszig (1978)). For this norm, the subadditivity asserted in (d) is called
Minkowski’s inequality.
Since the Euclidean case is obtained by setting 𝑝 = 2, the Euclidean norm is also
called the ℓ2 norm, and we write k · k 2 rather than k · k when extra clarity is required.

EXERCISE 1.2.4. Prove that the supremum norm (or ℓ∞ norm), defined by k 𝑢 k ∞ ≔
max𝑛𝑖=1 | 𝑢𝑖 |, is also a norm on R𝑛 .

(The symbol k 𝑢 k ∞ is used because, for all 𝑢 ∈ R𝑛 , we have k 𝑢 k 𝑝 → k 𝑢 k ∞ as 𝑝 → ∞.)


For the next exercise, we recall that the indicator function of logical statement 𝑃 ,
denoted here by 1{ 𝑃 }, takes value 1 (resp., 0) if 𝑃 is true (resp., false). For example,
if 𝑥, 𝑦 ∈ R, then (
1 if 𝑥 ⩽ 𝑦
1{ 𝑥 ⩽ 𝑦 } =
0 otherwise.
If 𝐴 ⊂ 𝑆, where 𝑆 is any set, then 1 𝐴 ( 𝑥 ) ≔ 1{ 𝑥 ∈ 𝐴 } for all 𝑥 ∈ 𝑆.
Í
EXERCISE 1.2.5. The so-called ℓ0 “norm” k 𝑢 k 0 ≔ 𝑛𝑖=1 1{𝑢𝑖 ≠ 0} used in some data
science applications is not a norm on R𝑛 . Prove this.

1.2.1.3 Equivalence of Vector Norms

An important property of a finite-dimensional normed vector space is that all norms


are “equivalent.” Let’s review this result and discuss why it matters.
CHAPTER 1. INTRODUCTION 15

To begin, recall that when 𝑢 and ( 𝑢𝑚 ) ≔ (𝑢𝑚 )𝑚∈N are all elements of R𝑛 , we say
that ( 𝑢𝑚 ) converges to 𝑢 and write 𝑢𝑚 → 𝑢 if

k 𝑢𝑚 − 𝑢 k → 0 as 𝑚 → ∞ for some norm k · k on R𝑛 .

It might seem that this definition is imprecise. Don’t we need to clarify that the
convergence is with respect to a particular norm?
No we don’t. This is because any two norms k · k 𝑎 and k · k 𝑏 on R𝑛 are equivalent
in the sense that there exist finite positive constants 𝑀, 𝑁 such that

𝑀 k𝑢k 𝑎 ⩽ k𝑢k 𝑏 ⩽ 𝑁 k𝑢k 𝑎 for all 𝑢 ∈ R𝑛 . (1.11)

(See, e.g., Kreyszig (1978).)

EXERCISE 1.2.6. Let us write k · k 𝑎 ∼ k · k 𝑏 if there exist finite 𝑀, 𝑁 such that (1.11)
holds. Prove that ∼ is an equivalence relation (see §A.1) on the set of all norms on
R𝑛 .

EXERCISE 1.2.7. Let k · k 𝑎 and k · k 𝑏 be any two norms on R𝑛 . Given a point 𝑢


in R𝑛 and a sequence ( 𝑢𝑚 ) in R𝑛 , use (1.11) to confirm that k 𝑢𝑚 − 𝑢 k 𝑎 → 0 implies
k 𝑢𝑚 − 𝑢 k 𝑏 → 0 as 𝑚 → ∞.

The next exercise tells us that pointwise convergence and norm convergence are
the same thing in finite dimensions.

EXERCISE 1.2.8. Let k · k be any norm on R𝑛 . Fixing a point 𝑢 in R𝑛 and a sequence


(𝑢𝑚 ) in R𝑛 , let 𝑢𝑖 and 𝑢𝑚 𝑖
be the 𝑖-th component of 𝑢 and 𝑢𝑚 respectively. Show that
𝑖
𝑢𝑚 → 𝑢𝑖 for all 𝑖 ∈ {1, . . . , 𝑛} if and only if k 𝑢𝑚 − 𝑢 k → 0.

Recall that a set 𝐶 ⊂ R𝑛 is called bounded if there exists an 𝑀 ∈ N with k 𝑥 k ⩽ 𝑀


for all 𝑥 ∈ 𝐶 ; and closed in R𝑛 if, for all 𝑢 ∈ R𝑛 and sequences (𝑢𝑚 ) ⊂ 𝐶 such that
𝑢𝑚 → 𝑢 as 𝑚 → ∞, we also have 𝑢 ∈ 𝐶 . A set 𝐺 ⊂ R𝑛 is called open in R𝑛 if 𝐺 𝑐 is
closed in R𝑛 . A set 𝑁 is called a neighborhood of 𝑢 ∈ R𝑛 if there exists an open set
𝐺 ⊂ R𝑛 with 𝑢 ∈ 𝐺 ⊂ 𝑁 . A map 𝑇 from 𝑈 ⊂ R𝑛 to R𝑘 is called continuous at 𝑢 ∈ 𝑈 if
𝑇𝑢𝑚 → 𝑇𝑢 for any ( 𝑢𝑚 ) ⊂ 𝑈 with 𝑢𝑚 → 𝑢; and continuous if 𝑇 is continuous at every
𝑢 ∈ 𝑈 . These notions apply for any norm, since convergence does not depend on a
choice of norm.
CHAPTER 1. INTRODUCTION 16

1.2.1.4 Matrices and Neumann Series

Next we discuss geometric series in matrix space, along with the Neumann series
lemma, one of many useful results in applied and numerical analysis.
Before starting we recall that if 𝐴 = ( 𝑎𝑖 𝑗 ) is an 𝑛 × 𝑛 matrix with 𝑖, 𝑗-th element 𝑎𝑖 𝑗 ,
then the definition of matrix multiplication tells us that for 𝑢 ∈ R𝑛 , the 𝑖-th element
Í Í
of 𝐴𝑢 is 𝑛𝑗=1 𝑎𝑖 𝑗 𝑢 𝑗 , while the 𝑗-th element of 𝑢> 𝐴 is 𝑛𝑖=1 𝑎𝑖 𝑗 𝑢𝑖 . Think of 𝑢 ↦→ 𝐴𝑢 and
𝑢 ↦→ 𝑢> 𝐴 is two different mappings, each of which takes an 𝑛-vector and produces a
new 𝑛-vector.
Remark 1.2.1. In this book, we adopt a convention that a vector in R𝑛 is just an
𝑛-tuple of real values. This coincides with the viewpoint of languages like Julia and
Python: vectors are just “flat” arrays. But when we use vectors in matrix algebra, they
should be understood as column vectors unless we state otherwise.

Just as we considered norms of vectors in §1.2.1.2, we will find it helpful to have


a notion of norms of matrices. A real-valued map defined on R𝑛×𝑛 , the set of real 𝑛 × 𝑛
matrices, is called a matrix norm if it has the following properties: for any 𝛼 ∈ R and
any 𝑛 × 𝑛 matrices 𝐴, 𝐵,
(a) k 𝐴 k ⩾ 0,
(b) k 𝐴 k = 0 ⇐⇒ 𝐴 = 0,
(c) k 𝛼𝐴 k = | 𝛼 |k 𝐴 k,
(d) k 𝐴 + 𝐵 k ⩽ k 𝐴 k + k 𝐵 k, and
These are called nonnegativity, positive definiteness, absolute homogeneity and the
triangle inequality, analogous to the norms on R𝑛 discussed in §1.2.1.2.
An example of a matrix norm is the so-called operator norm

k 𝐵 k 𝑜 ≔ max k 𝐵𝑢 k . (1.12)
k 𝑢 k=1

Here 𝐵 is 𝑛 × 𝑛, 𝑢 is in R𝑛 and the norm on the right-hand side is the Euclidean


norm over the 𝑛-vector 𝐵𝑢. Another example of a matrix norm is the supremum norm
defined as

k 𝐵 k ∞ ≔ max | 𝑏𝑖 𝑗 | , where 𝑏𝑖 𝑗 is the 𝑖, 𝑗-th element of 𝐵. (1.13)


1⩽ 𝑖, 𝑗⩽ 𝑛

Some matrix norms have the submultiplicative property, which means that, for
all 𝐴, 𝐵 ∈ R𝑛×𝑛 , we have k 𝐴𝐵 k ⩽ k 𝐴 kk 𝐵 k.
CHAPTER 1. INTRODUCTION 17

EXERCISE 1.2.9. Show that the operator k · k 𝑜 is submultiplicative on R𝑛×𝑛 . Provide


a counterexample to the claim that k · k ∞ is submultiplicative.

In what follows we often use the operator norm as our choice of matrix norm
(partly because of its attractive submultiplicative property). Hence, by convention,
an expression such as k 𝐴 k refers to the operator norm k 𝐴 k 𝑜 of 𝐴.
Analogous to the vector case, we say that a sequence ( 𝐴𝑘 ) of 𝑛 × 𝑛 matrices con-
verges to an 𝑛 × 𝑛 matrix 𝐴 and write 𝐴𝑘 → 𝐴 if k 𝐴𝑘 − 𝐴 k → 0 as 𝑘 → ∞. Just as
with vectors, this form of norm convergence holds if and only if each element of 𝐴𝑘
converges to the corresponding element of 𝐴. The proof is similar to the solution to
Exercise 1.2.8.
If 𝐴 is an 𝑛 × 𝑛 matrix, then 𝜆 ∈ C is called an eigenvalue of 𝐴 if there exists a
nonzero 𝑒 ∈ C𝑛 such that 𝐴𝑒 = 𝜆𝑒. (Here C is the set of complex numbers and C𝑛 is the
set of complex 𝑛-vectors.) A vector 𝑒 satisfying this equality is called an eigenvector
of 𝐴 and ( 𝜆, 𝑒) is called an eigenpair.
In Julia, we can compute the eigenvalues of a square matrix 𝐴 via eigvals(A).
The code

using LinearAlgebra
A = [0 -1;
1 0]
println(eigvals(A))

produces

2-element Vector{ComplexF64}:
0.0 - 1.0im
0.0 + 1.0im

Here im stands for 𝑖, the imaginary unit, so the eigenvalues of 𝐴 are −𝑖 and 𝑖.
Turning to geometric series, let us begin in one dimension. Consider the one-
dimensional linear equation 𝑢 = 𝑎𝑢 + 𝑏, where 𝑎, 𝑏 are given and 𝑢 is unknown. Its
solution 𝑢∗ satisfies Õ
𝑏
| 𝑎| < 1 =⇒ 𝑢∗ = = 𝑎𝑘 𝑏. (1.14)
1 − 𝑎 𝑘⩾0

This scalar result extends naturally to vectors. To show this we suppose that 𝑢 and
𝑏 are column vectors in R𝑛 , and that 𝐴 is an 𝑛 × 𝑛 matrix. We consider the vector
CHAPTER 1. INTRODUCTION 18

equation 𝑢 = 𝐴𝑢 + 𝑏. For the next result, we recall that the spectral radius of 𝐴 is
defined as
𝜌 ( 𝐴) ≔ max{| 𝜆 | : 𝜆 is an eigenvalue of 𝐴 } (1.15)
Here | 𝜆 | indicates the modulus of complex number 𝜆 .
With 𝐼 as the 𝑛 × 𝑛 identity matrix, we can state the following result.
Theorem 1.2.1 (Neumann Series Lemma). If 𝜌 ( 𝐴) < 1, then 𝐼 − 𝐴 is nonsingular and
Õ
( 𝐼 − 𝐴) −1 = 𝐴𝑘 .
𝑘⩾0

It follows directly that the vector system 𝑢 = 𝐴𝑢 + 𝑏 has a unique solution 𝑢∗ =


Í
( 𝐼 − 𝐴) −1 𝑏 = 𝑘⩾0 𝐴𝑘 𝑏 whenever 𝜌 ( 𝐴) < 1. This is the multivariate extension of (1.14).
The code in Listing 2 shows how to compute the spectral radius of an arbitrary
matrix 𝐴 in Julia. The print statement produces 0.5828, so, for this matrix, 𝜌 ( 𝐴) < 1.

1 using LinearAlgebra
2 ρ(A) = maximum(abs(λ) for λ in eigvals(A)) # Spectral radius
3 A = [0.4 0.1; # Test with arbitrary A
4 0.7 0.2]
5 print(ρ(A))

Listing 2: Computing a spectral radius (compute_spec_rad.jl)

EXERCISE 1.2.10. Prove that 𝜌 ( 𝛼𝐵) = | 𝛼 | 𝜌 ( 𝐵) for all 𝛼 ∈ R.

The rest of this section works through the proof of the Neumann series lemma,
with several parts left as exercises. An informal proof of the lemma runs as follows.
Í
If 𝑆 ≔ 𝑘⩾0 𝐴𝑘 , then
Õ
𝐼 + 𝐴𝑆 = 𝐼 + 𝐴 𝐴𝑘 = 𝐼 + 𝐴 + 𝐴2 + · · · = 𝑆.
𝑘⩾0

Rearranging 𝐼 + 𝐴𝑆 = 𝑆 gives 𝑆 = ( 𝐼 − 𝐴) −1 , which matches the claim in the Neumann


series lemma.
This informal argument lacks rigor. To make it rigorous, we must prove (a) that
Í
the sum 𝑘⩾0 𝐴𝑘 converges and (b) that the matrix 𝐼 − 𝐴 is invertible.
CHAPTER 1. INTRODUCTION 19

Lemma 1.2.2. If 𝐵 is any square matrix and k · k is any matrix norm, then

𝜌 ( 𝐵) 𝑘 ⩽ k 𝐵𝑘 k for all 𝑘 ∈ N and k 𝐵𝑘 k 1/𝑘 → 𝜌 ( 𝐵) as 𝑘 → ∞.

A proof of Lemma 1.2.2 can be found in Chapter 12 of Bollobás (1999). The second
result is sometimes called Gelfand’s formula.

EXERCISE 1.2.11. Using Lemma 1.2.2, show that

(i) k 𝐵𝑘 k → 0 as 𝑘 → ∞ if and only if 𝜌 ( 𝐵) < 1.


(ii) 𝜌 ( 𝐵) > 1 implies k 𝐵𝑘 k → ∞ as 𝑘 → ∞.

EXERCISE 1.2.12. Prove: If 𝐴 and 𝐵 are square matrices that commute (i.e., 𝐴𝐵 =
𝐵𝐴), then 𝜌 ( 𝐴𝐵) ⩽ 𝜌 ( 𝐴) 𝜌 ( 𝐵). [Hint: Show ( 𝐴𝐵) 𝑘 = 𝐴𝑘 𝐵𝑘 and use Gelfand’s formula.]

Í
EXERCISE 1.2.13. Prove: 𝜌 ( 𝐴) < 1 implies that the series 𝑘⩾0 𝐴𝑘 converges, in the
Í
sense that every element of the matrix 𝑆 𝐾 ≔ 𝑘𝐾=0 𝐴𝑘 converges as 𝐾 → ∞.

From this last result, one can show that ( 𝐼 − 𝐴) −1 exists by computing it:
Í
EXERCISE 1.2.14. Prove this claim by showing that, when 𝑘⩾0 𝐴𝑘 exists, the in-
Í
verse of 𝐼 − 𝐴 exists and indeed ( 𝐼 − 𝐴) −1 = 𝑘⩾0 𝐴𝑘 .3

Listing 3 helps illustrate the result in Exercise 1.2.14, although we truncate the
Í
infinite sum 𝑘⩾0 𝐴𝑘 at 50.
The output 5.621e-12 is close enough to zero for many practical purposes.

1.2.2 Nonlinear Systems

While the Neumann series lemma is a powerful tool for solving linear systems, it
doesn’t help us with nonlinear problems. In this section, we present Banach’s fixed
point theorem, one of a variety of techniques for handling nonlinear systems. (Chap-
ter 2 introduces other methods.)
3 Hint: To prove that 𝐴 is invertible and 𝐵 = 𝐴 −1 , it suffices to show that 𝐴𝐵 = 𝐼 .
CHAPTER 1. INTRODUCTION 20

3 # Primitives
4 A = [0.4 0.1;
5 0.7 0.2]
6

7 # Method one: direct inverse


8 B_inverse = inv(I - A)
9

10 # Method two: power series


11 function power_series(A)
12 B_sum = zeros((2, 2))
13 A_power = I
14 for k in 1:50
15 B_sum += A_power
16 A_power = A_power * A
17 end
18 return B_sum
19 end
20

21 # Print maximal error


22 print(maximum(abs.(B_inverse - power_series(A))))

Listing 3: Matrix inversion vs power series (power_series.jl)

1.2.2.1 Fixed Points

A standard approach to solving an equation is to formulate it as a fixed point problem.


This section provides the basic definitions and some simple results from fixed point
theory.
Let 𝑈 be any nonempty set. We call 𝑇 a self-map on 𝑈 if 𝑇 is a function from 𝑈
into itself. For a self-map 𝑇 on 𝑈 , a point 𝑢∗ ∈ 𝑈 is called a fixed point of 𝑇 in 𝑈 if
𝑇𝑢∗ = 𝑢∗ . (In fixed point theory, it is common to write 𝑇𝑢 for the image of 𝑢 under 𝑇 ,
rather than 𝑇 ( 𝑢).)
Example 1.2.2. Let 𝑈 = R𝑛 and let 𝑇 be defined by 𝑇𝑢 = 𝐴𝑢 + 𝑏, where 𝐴 and 𝑏 are as
in §1.2.1.4. Since 𝑢 is a fixed point of 𝑇 if and only if 𝑢 = 𝐴𝑢 + 𝑏, solving the equation
𝑢 = 𝐴𝑢 + 𝑏 is the same as searching for the fixed point of 𝑇 . By the Neumann series
lemma, 𝑇 has unique fixed point 𝑢∗ ≔ ( 𝐼 − 𝐴) −1 𝑏 in 𝑈 whenever 𝜌 ( 𝐴) < 1.
Example 1.2.3. Every 𝑢 in set 𝑈 is fixed under the identity map 𝐼 : 𝑢 ↦→ 𝑢.
Example 1.2.4. If 𝑈 = N and 𝑇𝑢 = 𝑢 + 1, then 𝑇 has no fixed point.
CHAPTER 1. INTRODUCTION 21

2.00
T
1.75 45o

1.50 uh

1.25

1.00

0.75 um

0.50 u`

0.25

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Figure 1.5: Graph and fixed points of 𝑇 : 𝑢 ↦→ 2.125/(1 + 𝑢−4 )

Figure 1.5 shows another example, for a self-map 𝑇 on 𝑈 ≔ [0, 2]. Fixed points
are numbers 𝑢 ∈ [0, 2] where 𝑇 meets the 45 degree line. In this case there are three.

EXERCISE 1.2.15. Let 𝑈 be any set and let 𝑇 be a self-map on 𝑈 . Suppose there
exists an 𝑢¯ ∈ 𝑈 and an 𝑚 ∈ N such that 𝑇 𝑘 𝑢 = 𝑢¯ for all 𝑢 ∈ 𝑈 and 𝑘 ⩾ 𝑚. Prove that,
under this condition, 𝑢¯ is the unique fixed point of 𝑇 in 𝑈 .

EXERCISE 1.2.16. Let 𝑇 be a self-map on 𝑈 ⊂ R𝑑 . Prove the following: If 𝑇 𝑚 𝑢 → 𝑢∗


as 𝑚 → ∞ for some pair 𝑢, 𝑢∗ ∈ 𝑈 and, in addition, 𝑇 is continuous at 𝑢∗ , then 𝑢∗ is a
fixed point of 𝑇 .

When considering fixed points, given a self-map 𝑇 on 𝑈 , we typically seek condi-


tions on 𝑇 and 𝑈 under which the following properties hold:

• 𝑇 has at least one fixed point on 𝑈 (existence)


• 𝑇 has at most one fixed point on 𝑈 (uniqueness)
• 𝑇 has a fixed point on 𝑈 and the fixed point can be computed numerically.
CHAPTER 1. INTRODUCTION 22

1.2.2.2 Global Stability

A self-map 𝑇 on 𝑈 is called globally stable on 𝑈 if 𝑇 has a unique fixed point 𝑢∗ in


𝑈 and 𝑇 𝑘 𝑢 → 𝑢∗ as 𝑘 → ∞ for all 𝑢 ∈ 𝑈 . Here 𝑇 𝑘 indicates 𝑘 compositions of 𝑇 with
itself. Global stability is a desirable property in the setting of dynamic programming.
A number of our results rely on it.

EXERCISE 1.2.17. As in Example 1.2.2, let 𝑈 = R𝑛 and let 𝑇 be defined by 𝑇𝑢 =


𝐴𝑢 + 𝑏. Using induction, prove that

𝑇 𝑘 𝑢 = 𝐴𝑘 𝑢 + 𝐴𝑘−1 𝑏 + 𝐴𝑘−2 𝑏 + · · · + 𝐴𝑏 + 𝑏 (1.16)

for all 𝑢 ∈ 𝑈 and 𝑘 ∈ N. Next, show that 𝑇 is globally stable on 𝑈 whenever 𝜌 ( 𝐴) < 1.

Let 𝑇 be a self-map on 𝑈 ⊂ R𝑛 . We call 𝑇 invariant on 𝐶 ⊂ 𝑈 and call 𝐶 an


invariant set if 𝑇 is also a self-map on 𝐶 ; that is, if 𝑢 ∈ 𝐶 implies 𝑇𝑢 ∈ 𝐶 .

EXERCISE 1.2.18. Let 𝑇 be a globally stable self-map on 𝑈 ⊂ R𝑛 , with fixed point


𝑢∗ . Prove the following: If 𝐶 is closed and 𝑇 is invariant on 𝐶 , then 𝑢∗ ∈ 𝐶 .

1.2.2.3 Banach’s Fixed Point Theorem

Next we present the Banach fixed point theorem, a workhorse for analyzing nonlinear
operators.
Let 𝑈 be a nonempty subset of R𝑛 and let k · k be a norm on R𝑛 . A self-map 𝑇 on
𝑈 is called a contraction on 𝑈 with respect to k · k if there exists a 𝜆 < 1 such that

k𝑇𝑢 − 𝑇 𝑣 k ⩽ 𝜆 k 𝑢 − 𝑣 k for all 𝑢, 𝑣 ∈ 𝑈. (1.17)

The constant 𝜆 is called the modulus of contraction.


EXERCISE 1.2.19. Let 𝑇 be a contraction on 𝑈 with respect to a norm k · k. Show
that, 𝑇 is continuous on 𝑈 and has at most one fixed point in 𝑈 .

EXERCISE 1.2.20. Let 𝑈 = R𝑛 and let 𝑇 𝑥 = 𝐴𝑥 + 𝑏, where 𝐴 is 𝑛 × 𝑛 and 𝑏 is


𝑛 × 1. Prove that 𝑇 is a contraction of modulus k 𝐴 k on 𝑈 (see (1.12) for the definition)
whenever k 𝐴 k < 1.

The following theorem features a contraction.


CHAPTER 1. INTRODUCTION 23

Theorem 1.2.3 (Banach’s contraction mapping theorem). If 𝑈 is closed in R𝑛 and 𝑇


is a contraction of modulus 𝜆 on 𝑈 with respect to some norm k · k on R𝑛 , then 𝑇 has a
unique fixed point 𝑢∗ in 𝑈 and

k 𝑇 𝑘 𝑢 − 𝑢∗ k ⩽ 𝜆 𝑘 k 𝑢 − 𝑢∗ k for all 𝑘 ∈ N and 𝑢 ∈ 𝑈. (1.18)

In particular, 𝑇 is globally stable on 𝑈 .

We prove Theorem 1.2.3 in stages that build on the following exercises.

EXERCISE 1.2.21. Let 𝑈 and 𝑇 have the properties stated in Theorem 1.2.3. Fix
𝑢0 ∈ 𝑈 and let 𝑢𝑚 ≔ 𝑇 𝑚 𝑢0 . Show that

Õ
𝑘−1
k 𝑢𝑚 − 𝑢 𝑘 k ⩽ 𝜆 𝑖 k 𝑢0 − 𝑢1 k
𝑖=𝑚

holds for all 𝑚, 𝑘 ∈ N with 𝑚 < 𝑘.

EXERCISE 1.2.22. Using the results in Exercise 1.2.21, prove that ( 𝑢𝑚 ) is a Cauchy
sequence in R𝑛 . (A sequence ( 𝑣𝑚 ) ⊂ R𝑛 is called a Cauchy sequence if, for any 𝜀 > 0,
there exists an 𝑁 ∈ N such that 𝑚, 𝑛 ⩾ 𝑁 implies k 𝑣𝑚 − 𝑣𝑛 k < 𝜀.)

A fundamental property of R𝑛 is that if ( 𝑣𝑚 ) is a Cauchy sequence in R𝑛 , then there


exists a 𝑣¯ ∈ R𝑛 such that ( 𝑣𝑚 ) converges to 𝑣¯. (This property is called completeness
of the vector space R𝑛 . See, for example, Çınlar and Vanderbei (2013).) Hence it
follows from Exercise 1.2.22 that ( 𝑢𝑚 ) has a limit 𝑢∗ ∈ R𝑛 .

EXERCISE 1.2.23. Prove that 𝑢∗ ∈ 𝑈 .

Proof of Theorem 1.2.3. The preceding exercises established existence of a point 𝑢∗ ∈


𝑈 such that 𝑇 𝑚 𝑢 → 𝑢∗ . The fact that 𝑢∗ is a fixed point of 𝑇 now follows from Ex-
ercise 1.2.16 and Exercise 1.2.19. Uniqueness is implied by Exercise 1.2.19. The
bound (1.18) follows from iteration on the contraction inequality (1.17) while set-
ting 𝑣 = 𝑢∗ . □

EXERCISE 1.2.24. Let 𝑇 be a contraction of modulus 𝛽 on R𝑛 and fixed point 𝑢¯.


Consider the damped or relaxed iteration scheme 𝑢𝑛+1 = (1 − 𝛼) 𝑢𝑛 + 𝛼𝑇𝑢𝑛 . Show that,
for any choice of 𝑢0 , these iterates converge to 𝑢¯ whenever 0 < 𝛼 ⩽ 1.
CHAPTER 1. INTRODUCTION 24

1.2.3 Successive Approximation

Consider a self-map 𝑇 on 𝑈 ⊂ R𝑛 . We seek algorithms that compute fixed points of 𝑇


whenever they exist.

1.2.3.1 Iteration

If 𝑇 is globally stable on 𝑈 , then a natural algorithm for approximating the unique


fixed point 𝑢∗ of 𝑇 in 𝑈 is to pick any 𝑢 ∈ 𝑈 and iterate with 𝑇 for some finite number
of steps:

1 fix 𝑢0 ∈ 𝑈 and 𝜏 > 0


2 𝑘←0
3 𝜀←𝜏+1
4 while 𝜀 > 𝜏 do
5 𝑢𝑘+1 ← 𝑇𝑢𝑘
6 𝜀 ← k 𝑢𝑘+1 − 𝑢𝑘 k
7 𝑘← 𝑘+1
8 end
9 return 𝑢𝑘

By the definition of global stability, ( 𝑢𝑘 ) 𝑘⩾0 converges to 𝑢∗ . The algorithm just


described is called either successive approximation or fixed point iteration. List-
ing 4 provides a function that implements this procedure. Distances between points
are measured with the ℓ∞ norm.
Listing 5 applies successive approximation to the map 𝑇𝑢 = 𝐴𝑢 + 𝑏 using the func-
tion defined in s_approx.jl. Figure 1.6 shows the sequence of iterates generated by
four runs of the successive approximation algorithm, each with a different starting
condition 𝑢0 . The map and parameters are the same as in Listing 5. It is clear from
the figure that a good choice of initial condition (i.e., one that is close to the fixed
point) accelerates convergence.
Of course for 𝑇𝑢 = 𝐴𝑢 + 𝑏 with 𝜌 ( 𝐴) < 1, there is a more direct method to compute
the fixed point: the Neumann series lemma tells us that 𝑢∗ = ( 𝐼 − 𝐴) −1 𝑏 so we can apply
a numerical linear equation solver. However, even for this case, sometimes successive
approximation is used instead. One reason is that ( 𝐼 − 𝐴) −1 can be very large, making
application of a linear solver problematic. Another is that we might be satisfied with
a quick approximation of the fixed point, computed with a few iterations of 𝑇 . Both
of these situations can arise in dynamic programming.
CHAPTER 1. INTRODUCTION 25

"""
Computes an approximate fixed point of a given operator T
via successive approximation.

"""
function successive_approx(T, # operator (callable)
u_0; # initial condition
tolerance=1e-6, # error tolerance
max_iter=10_000, # max iteration bound
print_step=25) # print at multiples
u = u_0
error = Inf
k = 1

while (error > tolerance) & (k <= max_iter)

u_new = T(u)
error = maximum(abs.(u_new - u))

if k % print_step == 0
println("Completed iteration $k with error $error.")
end

u = u_new
k += 1
end

if error <= tolerance


println("Terminated successfully in $k iterations.")
else
println("Warning: hit iteration bound.")
end

return u
end

Listing 4: Successive approximation (s_approx.jl)


CHAPTER 1. INTRODUCTION 26

include("s_approx.jl")
using LinearAlgebra

# Compute the fixed point of Tu = Au + b via linear algebra


A, b = [0.4 0.1; 0.7 0.2], [1.0; 2.0]
u_star = (I - A) \ b # compute (I - A)^{-1} * b

# Compute the fixed point via successive approximation


T(u) = A * u + b
u_0 = [1.0; 1.0]
u_star_approx = successive_approx(T, u_0)

# Test for approximate equality (prints "true")


print(isapprox(u_star, u_star_approx, rtol=1e-5))

Listing 5: Using successive approximations to compute 𝑢∗ (linear_iter.jl)

u0

u0
5
u∗

u0

3 u0
2.0 2.5 3.0

Figure 1.6: Successive approximations from different initial conditions


CHAPTER 1. INTRODUCTION 27

1.2.3.2 A One-Dimensional Example

To illustrate successive approximations in a nonlinear setting, we use the Solow–Swan


growth model, which is a good place to begin presenting a theory of economic growth.
A fixed point for the Solow-Swan model can be computed with pencil and paper. The
model also provides a good laboratory for studying how successive approximations
might converge to a fixed point.
One version of the Solow–Swan growth dynamics is

𝑘𝑡+1 = 𝑠 𝑓 ( 𝑘𝑡 ) + (1 − 𝛿) 𝑘𝑡 , 𝑡 = 0, 1, . . . , (1.19)

where 𝑘𝑡 is capital stock per worker, 𝑓 : (0, ∞) → (0, ∞) is a production function, 𝑠 > 0
is a saving rate and 𝛿 ∈ (0, 1) is a rate of depreciation. If we set 𝑔 ( 𝑘) ≔ 𝑠 𝑓 ( 𝑘) + (1 − 𝛿) 𝑘,
then iterating with 𝑔 from a starting point 𝑘0 (i.e., setting 𝑘𝑡+1 = 𝑔 ( 𝑘𝑡 ) for all 𝑡 ⩾
0) generates the sequence in (1.19). We can also understand this process as using
successive approximation to compute the fixed point of 𝑔.

EXERCISE 1.2.25. Let 𝑓 ( 𝑘) = 𝐴𝑘𝛼 with 𝐴 > 0 and 0 < 𝛼 < 1. Show that, while
the Solow-Swan map 𝑔 ( 𝑘) = 𝑠𝐴𝑘𝛼 + (1 − 𝛿) 𝑘 sends 𝑈 ≔ (0, ∞) into itself, 𝑔 is not a
contraction on 𝑈 . [Hint: use the definition of the derivative of 𝑔 as a limit and consider
the derivative 𝑔0 ( 𝑘) for 𝑘 close to zero.]

Although the model specified in Exercise 1.2.25 does not generate a contraction,
it is globally stable. The next exercise asks you to prove this.

EXERCISE 1.2.26. Show that, in the setting of Exercise 1.2.25, the unique fixed
point of 𝑔 in 𝑈 is
  1/(1−𝛼)
∗ 𝑠𝐴
𝑘 ≔
𝛿
Prove that, for 𝑘 ∈ 𝑈 ,

(i) 𝑘 ⩽ 𝑘∗ implies 𝑘 ⩽ 𝑔 ( 𝑘) ⩽ 𝑘∗ and


(ii) 𝑘∗ ⩽ 𝑘 implies 𝑘∗ ⩽ 𝑔 ( 𝑘) ⩽ 𝑘.

Conclude that 𝑔 is globally stable on 𝑈 . (Why?)

Figure 1.7 illustrates the dynamics in a 45 degree diagram when 𝑓 ( 𝑘) = 𝐴𝑘𝛼 . In


the top subfigure, 𝐴 = 2.0, 𝛼 = 0.3, 𝑠 = 0.3 and 𝛿 = 0.4. The function 𝑔 is plotted
alongside the 45 degree line. When 𝑔 ( 𝑘𝑡 ) lies strictly above the 45 degree line, then
𝑘𝑡+1 = 𝑔 ( 𝑘𝑡 ) > 𝑘𝑡 and so capital per worker rises. If 𝑔 ( 𝑘𝑡 ) < 𝑘𝑡 then it falls. A trajectory
CHAPTER 1. INTRODUCTION 28

3
g(k) = sAk α + (1 − δ)k
45

2 1
k ∗ = (sA/δ) 1−α
kt+1

0
0 k0 3
kt

Figure 1.7: Successive approximation for the Solow–Swan model

( 𝑘𝑡 )𝑡⩾0 that is produced by starting from a particular choice of 𝑘0 is traced out in the
figure.
The figure illustrates that 𝑘∗ is the unique fixed point of 𝑔 in 𝑈 and all sequences
converge to it. The second statement can be rephrased as: successive approximation
successfully computes the fixed point of 𝑔 by stepping through the time path of capital.

1.2.4 Finite-Dimensional Function Space

In §1.1.2 we introduced a Bellman equation for the infinite horizon job search prob-
lem. The unknown object in the Bellman equation is a function 𝑣∗ defined on the set
W of possible wage offers. Below we discuss how to solve for this unknown function.
Since the set of wage offers is finite we can write W as {𝑤1 , . . . , 𝑤𝑛 } for some
𝑛 ∈ N. If we adopt this convention and also write 𝑣∗ ( 𝑤𝑖 ) as 𝑣𝑖∗ , then we can view 𝑣∗
as a vector ( 𝑣1∗ , . . . , 𝑣𝑛∗ ) in R𝑛 . The vector interpretation is useful when coding, since
vectors (numerical arrays) are an efficient data type.
Nevertheless, for mathematical exposition, we usually find it more convenient to
express function-like objects (e.g., value functions) as functions rather than vectors.
Thus, we typically write 𝑣∗ ( 𝑤) instead of 𝑣𝑖∗ .
Remark 1.2.2. There is a deeper reason that we usually work with functions rather
than vectors: when we shift to general state and action spaces in Volume II, objects
CHAPTER 1. INTRODUCTION 29

𝑢
𝑢∨𝑣

𝑢∧𝑣 𝑣

Figure 1.8: Functions 𝑢 ∨ 𝑣 and 𝑢 ∧ 𝑣

such as value functions can no longer be represented by finite-dimensional vectors.


Instead we must use the language of functional analysis. By adopting this language
now, the leap to general spaces will be smoother, since terminology and notation will
mostly be unchanged.

The next section clarifies our notation with respect to functions and vectors.

1.2.4.1 Pointwise Operations on Functions

If X is any set and 𝑢 maps X to R, then we call 𝑢 a real-valued function on X and write
𝑢 : X → R. Throughout, the symbol RX denotes the set of all real-valued functions on
X. This is a special case of the symbol 𝐵 𝐴 that represents the set of all functions from
𝐴 to 𝐵, where 𝐴 and 𝐵 are sets.
If 𝑢, 𝑣 ∈ RX and 𝛼, 𝛽 ∈ R, then the expressions 𝛼𝑢 + 𝛽𝑣 and 𝑢𝑣 also represent
elements of RX , defined at 𝑥 ∈ X by

( 𝛼𝑢 + 𝛽𝑣)( 𝑥 ) = 𝛼𝑢 ( 𝑥 ) + 𝛽𝑣 ( 𝑥 ) and (𝑢𝑣)( 𝑥 ) = 𝑢 ( 𝑥 ) 𝑣 ( 𝑥 ) . (1.20)

Similarly, | 𝑢 |, 𝑢 ∨ 𝑣, and 𝑢 ∧ 𝑣 are real-valued functions on X defined by

|𝑢 |( 𝑥 ) = |𝑢 ( 𝑥 )| , ( 𝑢 ∨ 𝑣)( 𝑥 ) = 𝑢 ( 𝑥 ) ∨ 𝑣 ( 𝑥 ) and (𝑢 ∧ 𝑣)( 𝑥 ) = 𝑢 ( 𝑥 ) ∧ 𝑣 ( 𝑥 ) . (1.21)

Figure 1.8 illustrates functions 𝑢 ∨ 𝑣 and 𝑢 ∧ 𝑣 when X is a subset of R.


CHAPTER 1. INTRODUCTION 30

𝑢 𝑢∨𝑣

𝑣
𝑢∧𝑣

Figure 1.9: The vectors 𝑢 ∨ 𝑣 and 𝑢 ∧ 𝑣 in R2

Similarly, if 𝑢 = ( 𝑢𝑖 ) 𝑖𝑛=1 and 𝑣 = ( 𝑣𝑖 ) 𝑖𝑛=1 are vectors in R𝑛 , then

| 𝑢 | ≔ (| 𝑢𝑖 |) 𝑖𝑛=1 , 𝑢 ∧ 𝑣 ≔ ( 𝑢𝑖 ∧ 𝑣𝑖 ) 𝑖𝑛=1 and 𝑢 ∨ 𝑣 ≔ (𝑢𝑖 ∨ 𝑣𝑖 ) 𝑖𝑛=1 . (1.22)

Figure 1.9 illustrates in R2 .

1.2.4.2 Functions vs Vectors

Let X be finite, so that X = { 𝑥1 , . . . , 𝑥𝑛 } for some 𝑛 ∈ N. The set RX is, in essence, the
vector space R𝑛 expressed in different notation. The next lemma clarifies.

Lemma 1.2.4. If X = { 𝑥1 , . . . , 𝑥𝑛 }, then

RX 3 𝑢 ←→ ( 𝑢 ( 𝑥1 ) , . . . , 𝑢 ( 𝑥𝑛 )) ∈ R𝑛 (1.23)

is a one-to-one correspondence between the function space RX and the vector space R𝑛 .

The claim in Lemma 1.2.4 is obvious: a real-valued function 𝑢 on X is uniquely


identified by the set of values that it takes on X, which is an 𝑛-tuple of real numbers.
Throughout the text, whenever the supporting set X is finite, we freely use the
identification in (1.23). For example, if k · k is any norm on R𝑛 , then k · k extends to
RX via the identification in (1.23). That is, for 𝑢 ∈ RX , the value k 𝑢 k is given by the
norm of the vector ( 𝑢 ( 𝑥1 ) , . . . , 𝑢 ( 𝑥𝑛 )) ∈ R𝑛 .
We say that a subset of RX is closed (resp., open, compact, etc.) if the corre-
sponding subset of R𝑛 is closed (resp., open, compact, etc.)
CHAPTER 1. INTRODUCTION 31

With these conventions, the Neumann series lemma and Banach’s contraction
mapping theorem extend directly from R𝑛 to RX . For example, if |X| = 𝑛, 𝐶 is closed
in RX and 𝑇 is a contraction on 𝐶 ⊂ RX , in the sense that 𝑇 : 𝐶 → 𝐶 and

there exists a 𝜆 ∈ [0, 1) s.t. k𝑇 𝑓 − 𝑇 𝑔 k ⩽ 𝜆 k 𝑓 − 𝑔 k for all 𝑓 , 𝑔 ∈ 𝐶,

then 𝑇 has a unique fixed point 𝑓 ∗ in 𝐶 and

k𝑇 𝑛 𝑓 − 𝑓 ∗ k ⩽ 𝜆 𝑛 k 𝑓 − 𝑓 ∗ k for all 𝑛 ∈ N and 𝑓 ∈ RX .

Incidentally, in the preceding paragraph 𝑇 is a function that sends functions into


functions (e.g., sends 𝑓 into 𝑇 𝑓 ). To help distinguish 𝑇 from the functions that it
acts on, 𝑇 in this setting is often called an operator rather than a function. This is
a convention rather than a formal distinction: from a mathematical perspective, an
operator is just a function.
A foundational class of operators acting on RX is the set of linear operators. There
is a strong sense in which linear operators are just matrices. We investigate these
ideas in §2.3.3. At the same time, when studying dynamic programming we also use
many operators that are not linear. One example is the “Bellman operator,” which we
start to investigate in §1.3.1.2.

1.2.4.3 Distributions

Given a set X with 𝑛 elements, the set of probability distributions on X is written as


Í
D(X) and contains all 𝜑 ∈ RX + with 𝑥 ∈X 𝜑 ( 𝑥 ) = 1. Since we can identify any 𝑓 ∈ R
X

with a corresponding vector in R𝑛 , the set D(X) can also be thought of as a subset of
R𝑛 . This collection of vectors (i.e., the nonnegative vectors that sum to unity) is also
called the unit simplex. Given X0 ⊂ X and 𝜑 ∈ D(X), we say that 𝜑 is supported on
X0 if 𝜑 ( 𝑥 ) > 0 implies 𝑥 ∈ X0 .
Fix ℎ ∈ RX and 𝜑 ∈ D(X). Let 𝑋 be a random variable with distribution 𝜑, so that
P{ 𝑋 = 𝑥 } = 𝜑 ( 𝑥 ) for all 𝑥 ∈ X. The expectation of ℎ ( 𝑋 ) is
Õ
Eℎ ( 𝑋 ) ≔ ℎ ( 𝑥 ) 𝜑 ( 𝑥 ) = hℎ, 𝜑i .
𝑥 ∈X

EXERCISE 1.2.27. Fix ℎ ∈ RX . Show that 𝜑∗ ∈ argmax 𝜑∈D(X) hℎ, 𝜑i if and only if
𝜑∗ is supported on argmax 𝑥 ∈X ℎ ( 𝑥 ).
CHAPTER 1. INTRODUCTION 32

If X ⊂ R, then the cumulative distribution function (CDF) corresponding to 𝜑 is


the map Φ from X to R given by
Õ
Φ ( 𝑥 ) ≔ P{ 𝑋 ⩽ 𝑥 } = 1{ 𝑥 0 ⩽ 𝑥 } 𝜑 ( 𝑥 0) .
𝑥 0 ∈X

If 𝜏 ∈ [0, 1], then the 𝜏-th quantile of 𝑋 is

𝑄 𝜏 𝑋 ≔ min{ 𝑥 ∈ X : Φ ( 𝑥 ) ⩾ 𝜏} . (1.24)

If 𝜏 = 1/2, then 𝑄 𝜏 𝑋 is called the median of 𝑋 .

Example 1.2.5. Suppose X = { 𝑥1 , 𝑥2 , 𝑥3 }. If 𝜑 = (0.5, 0.0, 0.5) and 𝑋 ∼ 𝜑, then


Φ = (0.5, 0.5, 1) and 𝑄 1/2 ( 𝑋 ) = 𝑥1 . The min in (1.24) allows us to select a unique
median (even though 𝑥2 is also a reasonable choice).

Evidently, if the median of 𝑋 is 𝑥 , then the median of 𝑋 + 𝛼 will be 𝑥 + 𝛼. This same


logic carries over to arbitrary quantiles, as the next exercise asks you to show.

EXERCISE 1.2.28. Prove that the quantile function is additive over constants. That
is, for any 𝜏 ∈ [0, 1], random variable 𝑋 on X and 𝛼 ∈ R, we have 𝑄 𝜏 ( 𝑋 + 𝛼) = 𝑄 𝜏 ( 𝑋 ) + 𝛼.

1.3 Infinite-Horizon Job Search

Armed with fixed point methods, we return to the job search problem discussed in
§1.1.2.

1.3.1 Values and Policies

In this section we solve for the value function of an infinite horizon job search problem
and associated optimal choices.

1.3.1.1 Optimal Choices

Let’s recall the strategy for solving the infinite-horizon job search problem we pro-
posed in §1.1.2. The first step is to compute the optimal value function 𝑣∗ that solves
CHAPTER 1. INTRODUCTION 33

the Bellman equation


( )
𝑤 Õ
∗ ∗ 0 0
𝑣 ( 𝑤) = max , 𝑐+𝛽 𝑣 (𝑤 ) 𝜑 (𝑤 ) ( 𝑤 ∈ W) . (1.25)
1−𝛽 𝑤0 ∈W

Suppose for a moment that we can compute 𝑣∗ , and let


Õ
ℎ∗ ≔ 𝑐 + 𝛽 𝑣∗ ( 𝑤0) 𝜑 ( 𝑤0) (1.26)
𝑤0

be the infinite-horizon continuation value that equals the maximal lifetime value that
the worker can receive, contingent on deciding to continue being unemployed today.
With ℎ∗ in hand, the optimal decision at any given time, facing current wage draw
𝑤 ∈ W, is as follows:

(i) If 𝑤/(1 − 𝛽 ) ⩾ ℎ∗ , then accept the job offer.


(ii) If not, then reject and wait for the next offer.
This decision maximizes lifetime value given the current offer.
(We will prove below that this decision process is optimal as claimed. For now,
however, we focus on computing 𝑣∗ and ℎ∗ .)

1.3.1.2 The Bellman Operator

The method proposed above requires that we solve for 𝑣∗ . To do so, we introduce a
Bellman operator 𝑇 defined at 𝑣 ∈ RW that is constructed to assure that any fixed
point of 𝑇 solves the Bellman equation and vice versa:
( )
𝑤 Õ
0 0
(𝑇 𝑣) ( 𝑤) = max , 𝑐+𝛽 𝑣(𝑤 ) 𝜑 (𝑤 ) ( 𝑤 ∈ W) . (1.27)
1−𝛽 𝑤0 ∈W

Let 𝑉 ≔ RW + and let k · k ∞ be the supremum norm on 𝑉 . We measure distance


between two elements 𝑓 , 𝑔 of 𝑉 by k 𝑓 − 𝑔 k = max𝑤∈W | 𝑓 ( 𝑤)− 𝑔 ( 𝑤)|. Under this distance,
we have the following result.
Proposition 1.3.1. 𝑇 is a contraction of modulus 𝛽 on 𝑉 .

A proof of Proposition 1.3.1 is given below. An implication of the proposition is


that 𝑇 𝑘 𝑣 → 𝑣∗ as 𝑘 → ∞ for any 𝑣 ∈ 𝑉 , so we can compute 𝑣∗ to any required degree
of accuracy by successive approximation.
CHAPTER 1. INTRODUCTION 34

Our proof of Proposition 1.3.1 uses the elementary bound

|𝛼 ∨ 𝑥 − 𝛼 ∨ 𝑦 | ⩽ | 𝑥 − 𝑦 | ( 𝛼, 𝑥, 𝑦 ∈ R) (1.28)

EXERCISE 1.3.1. Verify that (1.28) always holds. [Exercise 1.2.1 might be helpful.]

Proof of Proposition 1.3.1. Take any 𝑓 , 𝑔 in 𝑉 and fix any 𝑤 ∈ W. Apply the bound in
(1.28) to get
!
Õ Õ
|(𝑇 𝑓 )( 𝑤) − (𝑇 𝑔)( 𝑤)| ⩽ 𝑐 + 𝛽 𝑓 ( 𝑤0) 𝜑 ( 𝑤0) − 𝑐 + 𝛽 𝑔 ( 𝑤0) 𝜑 ( 𝑤0)
𝑤0 𝑤0
Õ
=𝛽 [ 𝑓 ( 𝑤0) − 𝑔 ( 𝑤0)] 𝜑 ( 𝑤0) .
𝑤0

Apply the triangle inequality to obtain


Õ
|(𝑇 𝑓 ) ( 𝑤) − (𝑇 𝑔)( 𝑤)| ⩽ 𝛽 | 𝑓 ( 𝑤0) − 𝑔 ( 𝑤0)| 𝜑 ( 𝑤0) ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ .
𝑤0

Taking the supremum over all 𝑤 on the left hand side of this expression leads to

k𝑇 𝑓 − 𝑇 𝑔 k ∞ ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ .

Since 𝑓 , 𝑔 were arbitrary elements of 𝑉 , the contraction claim is verified. □

1.3.1.3 Optimal Policies

A dynamic program seeks optimal policies. We briefly introduce the notion of a policy
and relate it to the job search application.
In general, for a dynamic program, choices by the controller aim to maximize
lifetime rewards and consist of a state-contingent sequence ( 𝐴𝑡 )𝑡⩾0 specifying how
the agent acts at each point in time. Workers do not know what the future will bring,
so it is natural to assume that 𝐴𝑡 can depend on present and past events but not
future ones. Hence 𝐴𝑡 is a function of the current state 𝑋𝑡 and past state-action pairs
( 𝐴𝑡−𝑖 , 𝑋𝑡−𝑖 ) for 𝑖 ⩾ 1. That is,

𝐴𝑡 = 𝜎𝑡 ( 𝑋𝑡 , 𝐴𝑡−1 , 𝑋𝑡−1 , 𝐴𝑡−2 , 𝑋𝑡−2 , . . . , 𝐴0 , 𝑋0 )

for some function 𝜎𝑡 ; 𝜎𝑡 is called a time 𝑡 policy function.


CHAPTER 1. INTRODUCTION 35

A key insight of dynamic programming is that some problems can be set up so that
the optimal current action can be expressed as a function of the current state 𝑋𝑡 .

Example 1.3.1. In Example 1.0.1, the retailer chooses stock orders and prices in
each period. Every quantity relevant to this decision belongs in the current state. It
might include not just the level of current inventories and various measures of business
conditions, but also information about rates at which inventories have changed over
each of the past six months.

If the current state 𝑋𝑡 is enough to determine a current optimal action, then policies
are just maps from states to actions. So we can write 𝐴𝑡 = 𝜎 ( 𝑋𝑡 ) for some function
𝜎. A policy function that depends only on the current state is often called a Markov
policy. Since all policies we consider will be Markov policies, we refer to them more
concisely as “policies.”

Remark 1.3.1. In the last paragraph, we dropped the time subscript on 𝜎 with no
loss of generality because we can always include the date 𝑡 in the current state; i.e.,
if 𝑌𝑡 is the state without time, then we can set 𝑋𝑡 = ( 𝑡, 𝑌𝑡 )). Whether this is necessary
depends on the problem at hand. For the job search model with finite horizon, the
date matters because opportunities for future earnings decrease with the passage of
time. For the infinite horizon version of the problem, in which an agent always looks
forward toward an infinite horizon, the only current information that matters to the
agent at time 𝑡 is the wage offer 𝑊𝑡 . As a result, the calendar date 𝑡 does not affect the
agent’s decision at time 𝑡 , so there is no need to include time in the state. (In §8.1.3.5,
we will formalize this argument.)

In the job search model, the state is the current wage offer and possible actions are
to accept or to reject the current offer. With 0 interpreted as reject and 1 understood
as accept, the action space is {0, 1}, so a policy is a map 𝜎 from W to {0, 1}. Let Σ be
the set of all such maps.
A policy is an “instruction manual”: for an agent following 𝜎 ∈ Σ, if current wage
offer is 𝑤, the agent always responds with 𝜎 ( 𝑤) ∈ {0, 1}. The policy dictates whether
the agent accepts or rejects at any given wage.
For each 𝑣 ∈ 𝑉 , a 𝑣-greedy policy is a 𝜎 ∈ Σ satisfying
( )
𝑤 Õ
𝜎 ( 𝑤) = 1 ⩾ 𝑐+𝛽 𝑣 ( 𝑤0) 𝜑 ( 𝑤0) for all 𝑤 ∈ W. (1.29)
1−𝛽 0
𝑤 ∈W

Equation (1.29) says that an agent accepts if 𝑤/(1 − 𝛽 ) exceeds the continuation value
computed using 𝑣 and rejects otherwise. Our discussion of optimal choices in §1.3.1.1
CHAPTER 1. INTRODUCTION 36

can now be summarized as the recommendation

Adopt a 𝑣∗ -greedy policy.

This statement is sometimes called Bellman’s principle of optimality.


Inserting 𝑣∗ into (1.29) and rearranging, we can express a 𝑣∗ -greedy policy via

𝜎∗ ( 𝑤) = 1 {𝑤 ⩾ 𝑤∗ } where 𝑤∗ ≔ (1 − 𝛽 ) ℎ∗ . (1.30)

The quantity 𝑤∗ in (1.30) is called the reservation wage, and parallels the reservation
wage that we introduced for the finite-horizon problem. Equation (1.30) states that
value maximization requires accepting an offer if and only if it exceeds the reservation
wage. Thus, 𝑤∗ provides a scalar description of an optimal policy.

1.3.2 Computation

Let’s turn to computation. In §1.3.2.1, we apply a standard dynamic programming


method, called value function iteration. In §1.3.2.2, we apply a more specialized
method that uses the structure of the job search problem to accelerate computation.

1.3.2.1 Value Function Iteration

Recall that, by Proposition 1.3.1, we can compute an approximate optimal policy by


applying successive approximation via the Bellman operator. In the language of dy-
namic programming, this is called value function iteration. Algorithm 1.1 provides
a full description.
While 𝑇 𝑘 𝑣 rarely attains 𝑣∗ for 𝑘 < ∞, we can obtain a close approximation by moni-
toring distances between successive iterates, waiting until they become small enough.
Later we will study how these distances depend on 𝑘, the number of iterations, as well
as on parameters defining rewards and opportunities.
Listing 6 implements value function iteration for the infinite-horizon job search
model, using the function for successive approximation from Listing 4.
Figure 1.10 shows a sequence of iterates (𝑇 𝑘 𝑣) 𝑘 when 𝑣 ≡ 0 and parameters are as
given in Listing 1 (page 8). Iterates 0, 1 and 2 are shown, in addition to iterate 1000,
which we take as a good approximation to the limiting function. If you experiment
with different initial conditions, you will see that they all converge to the same limit.
CHAPTER 1. INTRODUCTION 37

Algorithm 1.1: Value function iteration for job search


1 input 𝑣0 ∈ 𝑉 , an initial guess of 𝑣

2 input 𝜏, a tolerance level for error


3 𝜀 ← 𝜏+1
4 𝑘 ← 0
5 while 𝜀 > 𝜏 do
6 for 𝑤 ∈ W do
7 𝑣𝑘+1 ( 𝑤) ← (𝑇 𝑣𝑘 )( 𝑤)
8 end
9 𝜀 ← k 𝑣𝑘 − 𝑣𝑘+1 k ∞
10 𝑘← 𝑘+1
11 end
12 Compute a 𝑣𝑘 -greedy policy 𝜎
13 return 𝜎

1400

1200

1000
lifetime value

iterate 1
800 iterate 2
iterate 3
600
iterate 1000
400

200

10 20 30 40 50 60
wage offer

Figure 1.10: A sequence of iterates of the Bellman operator


CHAPTER 1. INTRODUCTION 38

include("two_period_job_search.jl")
include("s_approx.jl")

" The Bellman operator. "


function T(v, model)
(; n, w_vals, ϕ, β, c) = model
return [max(w / (1 - β), c + β * v'ϕ) for w in w_vals]
end

" Get a v-greedy policy. "


function get_greedy(v, model)
(; n, w_vals, ϕ, β, c) = model
σ = w_vals ./ (1 - β) .>= c .+ β * v'ϕ # Boolean policy vector
return σ
end

" Solve the infinite-horizon IID job search model by VFI. "
function vfi(model=default_model)
(; n, w_vals, ϕ, β, c) = model
v_init = zero(model.w_vals)
v_star = successive_approx(v -> T(v, model), v_init)
σ_star = get_greedy(v_star, model)
return v_star, σ_star
end

Listing 6: Value function iteration (iid_job_search.jl)


CHAPTER 1. INTRODUCTION 39

1400

1200

1000

800

600
value function
400 continuation value
w/(1 − β)
200
10 20 30 40 50 60

Figure 1.11: The approximate value function for job search

Figure 1.11 shows an approximation of 𝑣∗ computed using the code in Listing 6,


along with the stopping reward 𝑤/(1 − 𝛽 ) and the corresponding continuation value
(1.26). As anticipated, the value function is the pointwise supremum of the stopping
reward and the continuation value. The worker chooses to accept an offer only when
that offer exceeds some value close to 43.5.

1.3.2.2 Computing the Continuation Value Directly

The technique we employed to solve the job search model in §1.3.1 follows a standard
approach to dynamic programming. But for this particular problem, there is an easier
way to compute the optimal policy that sidesteps calculating the value function. This
section explains how.
Recall that the value function satisfies Bellman equation
( )
𝑤 Õ
𝑣∗ ( 𝑤) = max , 𝑐+𝛽 𝑣∗ ( 𝑤0) 𝜑 ( 𝑤0) ( 𝑤 ∈ W) , (1.31)
1−𝛽 𝑤0

and that the continuation value is given by (1.26). We can use ℎ∗ to eliminate 𝑣∗
from (1.31). First we insert ℎ∗ on the right hand side of (1.31) and then we replace
𝑤 with 𝑤0, which gives 𝑣∗ ( 𝑤0) = max {𝑤0/(1 − 𝛽 ) , ℎ∗ }. Then we take mathematical
CHAPTER 1. INTRODUCTION 40

1200

1100

1000 h∗

900

800

700
g
45
600

600 700 800 900 1000 1100 1200

Figure 1.12: Computing the continuation value as the fixed point of 𝑔

expectations of both sides, multiply by 𝛽 and add 𝑐 to obtain


Õ  0 
∗ 𝑤
ℎ =𝑐+𝛽 max , ℎ 𝜑 ( 𝑤0) .

(1.32)
𝑤0
1 − 𝛽

To obtain the unknown value ℎ∗ , we introduce the mapping 𝑔 : R+ → R+ defined by


Õ  0 
𝑤
𝑔 ( ℎ) = 𝑐 + 𝛽 max , ℎ 𝜑 ( 𝑤0) . (1.33)
𝑤 0 1 − 𝛽

By construction, ℎ∗ solves (1.32) if and only if ℎ∗ is a fixed point of 𝑔.

EXERCISE 1.3.2. Show that 𝑔 is a contraction map on R+ . Conclude that ℎ∗ is the


unique fixed point of 𝑔 in R+ .

Figure 1.12 shows the function 𝑔 using the discrete wage offer distribution and
parameters as adopted previously. The unique fixed point is ℎ∗ .
Exercise 1.3.2 implies that we can compute ℎ∗ by choosing arbitrary ℎ ∈ R+ and
iterating with 𝑔. Doing so produces a value of approximately 1086. (The associated
reservation wage is 𝑤∗ = (1 − 𝛽 ) ℎ∗ ≈ 43.4.) Computation of ℎ∗ using this method is
much faster than value function iteration because the fixed point problem is in R+
rather than R+𝑛 .
With ℎ∗ in hand we have solved the dynamic programming problem, since a policy
CHAPTER 1. INTRODUCTION 41

𝜎∗ is 𝑣∗ -greedy if and only if it satisfies


 
∗ 𝑤
𝜎 ( 𝑤) = 1 ⩾ℎ ∗
( 𝑤 ∈ R+ ) . (1.34)
1−𝛽

EXERCISE 1.3.3. As a computational exercise, compare the value function 𝑣∗ com-


puted via  
𝑤
𝑣∗ ( 𝑤) = max , ℎ∗
1−𝛽
with our previous result, shown in Figure 1.11. You should find them essentially
identical.

1.4 Chapter Notes

Dynamic programming is often attributed to Richard Bellman (1920–1984). Both


the term “dynamic programming” and the technique were popularized by Bellman
(1957). According to his autobiography, Bellman chose the name dynamic program-
ming to avoid giving the impression that he was conducting mathematical research
within RAND Corporation. His ultimate boss, Secretary of Defense Charles Wilson,
apparently disliked such research (Bellman (1984)).
For treatments of dynamic programming from the perspective of economics and
finance, see, for example, Sargent (1987), Stokey and Lucas (1989), Van and Dana
(2003), Bäuerle and Rieder (2011), or Stachurski (2022).
The job search model was introduced by McCall (1970). The McCall model and
its extensions transformed economists’ way of thinking about labor markets (see,
e.g., Lucas (1978b)). Influential extensions to the job search model include Burdett
(1978), Jovanovic (1979), Pissarides (1979), Jovanovic (1984), Mortensen (1986),
Ljungqvist (2002) and Chetty (2008). Rogerson et al. (2005) provides a useful sur-
vey.
For elementary real analysis, the book by Bartle and Sherbert (2011) is excellent.
Ok (2007) is a superb treatment of real analysis and how it is used throughout eco-
nomic theory. Discussions of Banach’s theorem and the Neumann series lemma can
be found in Cheney (2013) and Atkinson and Han (2005). Martins-da Rocha and
Vailakis (2010) provides an extension to Banach’s theorem that requires only local
contractivity.
Chapter 2

Operators and Fixed Points

This chapter discusses techniques that underlie the optimization and fixed point meth-
ods used throughout the book. Many of these techniques relate to order. Order-
theoretic concepts will prove valuable not only for fixed point methods but also for
understanding the main concepts in dynamic programming. Chapter 8 will show
core components of dynamic programming can be expressed in terms of simple order-
theoretic constructs.

2.1 Stability

In this section we discuss algorithms for computing fixed points and analyze their
convergence.

2.1.1 Conjugate Maps

First we treat a technique for simplifying analysis of stability and fixed points that
we’ll apply in applications.
To illustrate the idea, suppose that we want to study dynamics induced by a self-
map 𝑇 on 𝑈 ⊂ R𝑛 . We might want to know if a unique fixed point of 𝑇 exists and if
iterates of 𝑇 converge to a fixed point. One approach is to apply fixed point theory to
𝑇.
However sometimes there is an easier approach: transform 𝑇 into a “simpler” map
ˆ
𝑇 and study its the fixed point properties. For this to work, we need to be sure that

42
CHAPTER 2. OPERATORS AND FIXED POINTS 43

useful properties we discover about 𝑇ˆ will transmit themselves back to properties of


𝑇 , the map that actually interests us.
This section explains a notion of conjugacy that formalizes these ideas. The study
of conjugate relationships originated in the field of dynamical systems theory. Later we
will apply this approach to operators that arise in contexts of dynamic programming
and recursive preferences.

2.1.1.1 Conjugacy

A dynamical system is a pair (𝑈, 𝑇 ), where 𝑈 is any set and 𝑇 is a self-map on 𝑈 .


Two dynamical systems (𝑈, 𝑇 ) and (𝑈, ˆ 𝑇ˆ) are said to be conjugate under Φ if Φ is a
bijection from 𝑈 into 𝑈ˆ such that 𝑇 = Φ−1 ◦ 𝑇ˆ ◦ Φ on 𝑈 .
Conjugacy of (𝑈, 𝑇 ) and (𝑈,
ˆ 𝑇ˆ) under Φ can be understood as follows: shifting a
point 𝑢 ∈ 𝑈 to 𝑇𝑢 via 𝑇 is equivalent to moving 𝑢 into 𝑈ˆ via 𝑢ˆ = Φ𝑢, applying 𝑇ˆ, and
then moving the result back using Φ−1 :

𝑇
𝑢 𝑇𝑢

Φ Φ −1

𝑇ˆ
𝑢
ˆ 𝑇ˆ𝑢
ˆ

Example 2.1.1. Let 𝐴 be 𝑛 × 𝑛 diagonalizable, meaning that there exists a diagonal


matrix 𝐷 and a matrix 𝑃 such that 𝐴 = 𝑃 −1 𝐷𝑃 . We regard 𝐴 as a self-map on R𝑛 , 𝐷 as
a self-map on C𝑛 , and 𝑃 as a map from R𝑛 to C𝑛 . The identity 𝐴 = 𝑃 −1 𝐷𝑃 implies that
the dynamical systems ( 𝐴, R𝑛 ) and ( 𝐷, C𝑛 ) are conjugate.

The next two exercises illustrate benefits of establishing a conjugate relationship


between two dynamical systems.

EXERCISE 2.1.1. Show that if (𝑈, 𝑇 ) and (𝑈, ˆ 𝑇ˆ) are conjugate under Φ, then 𝑢 ∈ 𝑈
is a fixed point of 𝑇 on 𝑈 if and only if Φ𝑢 ∈ 𝑈ˆ is a fixed point of 𝑇ˆ on 𝑈ˆ .

EXERCISE 2.1.2. Extending Exercise 2.1.1, let (𝑈, 𝑇 ) and (𝑈, ˆ 𝑇ˆ) be dynamical sys-
tems and let fix(𝑇 ) and fix(𝑇ˆ) be the set of fixed points of 𝑇 and 𝑇ˆ, respectively. Show
that Φ is a bijection from fix(𝑇 ) to fix(𝑇ˆ).

The next result summarizes the most important consequences of our findings.
CHAPTER 2. OPERATORS AND FIXED POINTS 44

Proposition 2.1.1. If (𝑈, 𝑇 ) and (𝑈,


ˆ 𝑇ˆ) are conjugate dynamical systems, then

(i) 𝑢 is a fixed point of 𝑇 if and only if Φ𝑢 is a fixed point of 𝑇ˆ,


(ii) 𝑢ˆ is a fixed point of 𝑇ˆ if and only if Φ−1 𝑢ˆ is a fixed point of 𝑇 , and
(iii) the set of fixed points of 𝑇 and the set of fixed points of 𝑇ˆ have the same cardinality.

In particular, if 𝑇 has a unique fixed point on 𝑈 if and only if 𝑇ˆ has a unique fixed
point on 𝑈ˆ .

2.1.1.2 Topological Conjugacy

Let 𝑈 and 𝑈ˆ be two subsets of R𝑛 . A function Φ from 𝑈 to 𝑈ˆ is called a homeomor-


phism if it is continuous, bijective, and its inverse Φ−1 is also continuous.

Example 2.1.2. The map Φ𝑢 = ln 𝑢 from (0, ∞) to R is a homeomorphism, with


continuous inverse Φ−1 𝑦 = exp( 𝑦 ).

Example 2.1.3. Let Φ be an 𝑛 × 𝑛 matrix. We can regard Φ as a map sending column


vector 𝑢 into column vector Φ𝑢. This map is a homeomorphism from R𝑛 to itself if
and only if Φ is nonsingular.

Assume again that 𝑈 and 𝑈ˆ are subsets of R𝑛 . In this setting, we say that dynamical
ˆ 𝑇ˆ) are topologically conjugate under Φ if (𝑈, 𝑇 ) and (𝑈,
systems (𝑈, 𝑇 ) and (𝑈, ˆ 𝑇ˆ) are
conjugate under Φ and, in addition, Φ is a homeomorphism.

EXERCISE 2.1.3. Let 𝑈 ≔ (0, ∞) and 𝑈ˆ ≔ R. Let 𝑇𝑢 = 𝐴𝑢𝛼 , where 𝐴 > 0 and 𝛼 ∈ R,
and let 𝑇ˆ𝑢ˆ = ln 𝐴 + 𝛼𝑢ˆ. Show that 𝑇 and 𝑇ˆ are topologically conjugate under Φ ≔ ln.

EXERCISE 2.1.4. Consider again the setting of Exercise 2.1.1, but now suppose
ˆ 𝑇ˆ) are topologically conjugate under Φ, Fixing 𝑢, 𝑢∗ ∈ 𝑈 , show that
that (𝑈, 𝑇 ) and (𝑈,
lim𝑘→∞ 𝑇 𝑘 𝑢 = 𝑢∗ if and only if lim𝑘→∞ 𝑇ˆ 𝑘 Φ𝑢 = Φ𝑢∗ .

The next exercise asks you to show that topologically conjugacy is an equivalence
relation, as defined in §A.1.

EXERCISE 2.1.5. Let U be the set of all dynamical systems (𝑈, 𝑇 ) with 𝑈 ⊂ R𝑛 .
Show that topologically conjugacy is an equivalence relation on U.

From the preceding exercises we can state the following useful result:
CHAPTER 2. OPERATORS AND FIXED POINTS 45

Proposition 2.1.2. If (𝑈, 𝑇 ) and (𝑈,


ˆ 𝑇ˆ) are topologically conjugate, then

(i) 𝑇 is globally stable on 𝑈 if and only if 𝑇ˆ is globally stable on 𝑈ˆ , and


(ii) the unique fixed points 𝑢∗ ∈ 𝑈 and 𝑢ˆ∗ ∈ 𝑈ˆ satisfy 𝑢ˆ∗ = Φ𝑢∗ .

2.1.2 Local Stability

In §1.2.2.2 we investigated global stability. Here we introduce local stability and


provide a sufficient condition for situations in which the map is smooth.
Let 𝑈 be a subset of R𝑛 and let 𝑇 be a self-map on 𝑈 . A fixed point 𝑢∗ of 𝑇 in 𝑈 is
called locally stable for the dynamical system (𝑈, 𝑇 ) if there exists an open set 𝑂 ⊂ 𝑈
such that 𝑢∗ ∈ 𝑂 and 𝑇 𝑘 𝑢 → 𝑢∗ as 𝑘 → ∞ for every 𝑢 ∈ 𝑂. In other words, the domain
of attraction for 𝑢∗ contains an open neighborhood of 𝑢∗ .
Example 2.1.4. Consider the self-map 𝑔 on R defined by 𝑔 ( 𝑥 ) = 𝑥 2 . The fixed point
1 is not stable (for example, 𝑔𝑡 ( 𝑥 ) → ∞ for any 𝑥 > 1). However, 0 is locally stable,
because −1 < 𝑥 < 1 implies that 𝑔𝑡 ( 𝑥 ) → 0 as 𝑡 → ∞.

EXERCISE 2.1.6. Returning to the setting of Exercise 2.1.4, let (𝑈, 𝑇 ) and (𝑈,
ˆ 𝑇ˆ) be
topologically conjugate and let 𝑢∗ be a fixed point of 𝑇 in 𝑈 . Show that 𝑢∗ is locally
stable for (𝑈, 𝑇 ) if and only if Φ𝑢∗ is locally stable for (𝑈,
ˆ 𝑇ˆ).

For an interior fixed point 𝑥 ∗ of a smooth self-map 𝑔 on an interval of R, local


stability holds whenever | 𝑔0 ( 𝑥 ∗ )| < 1. The proof strategy proceeds as follows: When
| 𝑔0 ( 𝑥 ∗ )| < 1, the first-order linear approximation

ˆ𝑔 ( 𝑥 ) ≔ 𝑔 ( 𝑥 ∗ ) + 𝑔0 ( 𝑥 ∗ )( 𝑥 − 𝑥 ∗ ) = 𝑥 ∗ + 𝑔0 ( 𝑥 ∗ )( 𝑥 − 𝑥 ∗ )

is a contraction of modulus | 𝑔0 ( 𝑥 ∗ )| with unique fixed point 𝑥 ∗ . Hence all trajectories


of ˆ𝑔 converge to 𝑥 ∗ . Moreover, since 𝑔 and ˆ𝑔 are similar in a neighborhood of 𝑥 ∗ , the
same is true for trajectories of 𝑔 starting close to 𝑥 ∗ .
The next theorem formalizes this line of argument and extends it to multiple di-
mensions. In stating the theorem, we take 𝑇 to be a self-map on 𝑈 with fixed point 𝑢∗
in 𝑈 and assume that 𝑇 is continuously differentiable on 𝑈 . Recall that the Jacobian
of 𝑇 at 𝑢 ∈ 𝑈 is
𝜕𝑇1 𝜕𝑇1
𝑇 𝑢
© 𝜕𝑢1 (𝑢) · · · 𝜕𝑢𝑛
( 𝑢) ª © 1. ª
𝐽𝑇 ( 𝑢) ≔ ­­ ··· ®
® where 𝑇𝑢 = ­­ .. ®® ,
𝜕𝑇𝑛 𝜕𝑇𝑛
« 𝜕𝑢1 ( 𝑢 ) · ·· 𝜕𝑢𝑛
( 𝑢) ¬ «𝑇𝑛 𝑢¬
CHAPTER 2. OPERATORS AND FIXED POINTS 46

and let 𝑇ˆ be the first-order approximation to 𝑇 at 𝑢∗ :

𝑇ˆ𝑢 = 𝑢∗ + 𝐽𝑇 ( 𝑢∗ )( 𝑢 − 𝑢∗ ) (𝑢 ∈ 𝑈 ) .

Theorem 2.1.3 (Hartman–Grobman). If 𝐽𝑇 ( 𝑢∗ ) is nonsingular and contains no eigen-


values on the unit circle in C, then there exists an open neighborhood 𝑂 of 𝑢∗ such that
(𝑂, 𝑇 ) and ( 𝑂, 𝑇ˆ) are topologically conjugate.

Combining this theorem with the result of Exercise 2.1.6, we see that, under the
conditions of the theorem, 𝑢∗ is globally stable for ( 𝑂, 𝑇 ), and hence locally stable for
(𝑈, 𝑇 ), whenever ( 𝑂, 𝑇ˆ) is globally stable. By the Neumann series lemma, the first-
order approximation will be globally stable whenever 𝐽𝑇 ( 𝑢∗ ) has spectral radius less
than one. Thus, we have

Corollary 2.1.4. Under the conditions of Theorem 2.1.3, the fixed point 𝑢∗ is locally
stable whenever 𝜌 ( 𝐽𝑇 ( 𝑢∗ )) < 1.

2.1.3 Convergence Rates

To discuss relative rates of convergence we fix a norm k · k on R𝑛 and take a sequence


(𝑢𝑘 ) 𝑘⩾0 ⊂ R𝑛 converging to 𝑢∗ ∈ R𝑛 . Set 𝑒𝑘 ≔ k 𝑢𝑘 − 𝑢∗ k for all 𝑘. We say that (𝑢𝑘 )
converges to 𝑢∗ at rate at least 𝑞 if 𝑞 ⩾ 1 and, for some 𝛽 ∈ (0, ∞) and 𝑁 ∈ N, we
have
𝑞
𝑒𝑘+1 ⩽ 𝛽𝑒𝑘 for all 𝑘 ⩾ 𝑁.
We say that convergence occurs at rate 𝑞 if, in addition,
𝑒𝑘+1
lim sup 𝑞 = 𝛽.
𝑘→∞ 𝑒𝑘

In addition,

• If 𝑞 = 2, then we say that convergence is (at least) quadratic.


• If 𝑞 = 1 and 𝛽 < 1, then we say that convergence is (at least) linear.

Example 2.1.5. Let 𝑇 be a contraction of modulus 𝜆 on a closed set 𝑈 ⊂ R𝑛 . If 𝑢∗ is


the unique fixed point of 𝑇 in 𝑈 and 𝑢𝑘 ≔ 𝑇 𝑘 𝑢0 , then ( 𝑢𝑘 ) converges at least linearly to
𝑢∗ , since
𝑒𝑘+1 = k 𝑢𝑘+1 − 𝑢∗ k = k𝑇𝑢𝑘 − 𝑇𝑢∗ k ⩽ 𝜆𝑒𝑘 .
CHAPTER 2. OPERATORS AND FIXED POINTS 47

Orders of convergence are studied in the neighborhood of zero, implying that


higher orders are faster. For example, suppose 𝜀𝑘 ≔ k 𝑢𝑘 − 𝑢∗ k is the size of the er-
ror and that 𝑢𝑘 converges to 𝑢∗ quadratically. If, say, 𝜀𝑘 = 10−5 , then 𝜀𝑘+1 ≈ 𝛽 10−10 .
Provided that 𝛽 is not large, the number of accurate digits roughly doubles at each
step.
Successive approximations typically converge at a linear rate. To see this in one
dimension, try the following exercise.

EXERCISE 2.1.7. Let 𝑇 : 𝑈 → 𝑈 where 𝑇 is twice continuously differentiable and 𝑈


is an open interval in R. Suppose that 𝑇 has a fixed point 𝑢∗ ∈ 𝑈 and that 𝑢𝑘 ≔ 𝑇 𝑘 𝑢0
converges to 𝑢∗ as 𝑘 → ∞. Prove that the rate of convergence is linear whenever
0 < |𝑇 0𝑢∗ | < 1. In completing the proof, you might find it helpful to use the fact that,
by a second order Taylor expansion, there is a 𝑣𝑘 ∈ ( 𝑢𝑘 , 𝑢∗ ) such that

𝑇 00 𝑣𝑘
𝑇𝑢𝑘 = 𝑢∗ + 𝑇 0𝑢∗ ( 𝑢𝑘 − 𝑢∗ ) + ( 𝑢 𝑘 − 𝑢∗ ) 2 . (2.1)
2

(The restriction that 0 < |𝑇 0𝑢∗ | < 1 in Exercise 2.1.7 is mild. For example, given
convergence of successive approximation to the fixed point, we expect |𝑇 0𝑢∗ | < 1, since
this inequality implies that 𝑢∗ is locally stable.)

2.1.4 Gradient-Based Methods

While successive approximation always converges when global stability holds, faster
fixed point algorithms can often be obtained by leveraging extra information, such as
gradients. Newton’s method is an important gradient-based technique. (As we discuss
in §5.1.4.2, Newton’s method is a key component of algorithms for solving dynamic
programs.)
While Newton’s method is often used to solve for roots of a given function, here
we use it to find fixed points.

2.1.4.1 Newton Fixed Point Iteration

Suppose first that 𝑇 is a differentiable self-map on an open set 𝑈 ⊂ R𝑛 and that we


want to find a fixed point of 𝑇 . Our plan is to start with a guess 𝑢0 of the fixed point
and then update it to 𝑢1 . To do this we use the first-order approximation 𝑇ˆ of 𝑇 around
𝑢0 and solve for the fixed point of 𝑇ˆ – which we can do exactly since 𝑇ˆ is linear. We
take this new point as 𝑢1 and then continue.
CHAPTER 2. OPERATORS AND FIXED POINTS 48

T

45

0
u0 u∗ u1

Figure 2.1: First step of Newton’s method applied to 𝑇

If 𝑇 is one-dimensional then 𝑇ˆ𝑢 ≔ 𝑇𝑢0 + 𝑇 0𝑢0 ( 𝑢 − 𝑢0 ). For 𝑛 > 1 we replace 𝑇 0𝑢0


with the Jacobian of 𝑇 at 𝑢0 , which we write as 𝐽𝑇 ( 𝑢0 ). We then solve 𝑇ˆ𝑢1 = 𝑢1 for 𝑢1 ,
which gives

𝑢1 = ( 𝐼 − 𝐽𝑇 ( 𝑢0 )) −1 (𝑇𝑢0 − 𝐽𝑇 ( 𝑢0 ) 𝑢0 ) ( 𝐼 is the 𝑛 × 𝑛 identity).

Figure 2.1 shows 𝑢0 and 𝑢1 when 𝑛 = 1 and 𝑇𝑢 = 1 + 𝑢/( 𝑢 + 1) and 𝑢0 = 0.5. The value
𝑢1 is the fixed point of the first-order approximation 𝑇ˆ. It is closer to the fixed point
of 𝑇 than 𝑢0 , as desired.
Newton’s (fixed point) method continues in the same way, from 𝑢1 to 𝑢2 and so
on, leading to the sequence of points

𝑢𝑘+1 = 𝑄𝑢𝑘 where 𝑄𝑢 ≔ ( 𝐼 − 𝐽𝑇 ( 𝑢)) −1 (𝑇𝑢 − 𝐽𝑇 ( 𝑢) 𝑢) 𝑘 = 0, 1, . . . (2.2)

We need not write a new solver, since the successive approximation function in List-
ing 4 can be applied to 𝑄 defined in (2.2).

2.1.4.2 Rates of Convergence

Figure 2.2 shows both the Newton approximation sequence and the successive ap-
proximation sequence applied to computing the fixed point of the Solow–Swan model
from §1.2.3.2. We use two different initial conditions (top and bottom subfigures).
Both sequences converge, but the Newton sequences converge faster.
CHAPTER 2. OPERATORS AND FIXED POINTS 49

successive approximation
newton steps

k∗

2 4 6 8 10 12

successive approximation
newton steps

k∗

2 4 6 8 10 12

Figure 2.2: Newton’s method applied to the Solow–Swan update rule

A fast rate of convergence for Newton scheme can be confirmed theoretically: un-
der mild conditions, there exists a neighborhood of the fixed point within which the
Newton iterates converge quadratically. See, for example, Theorem 5.4.1 of Atkinson
and Han (2005). Some dynamic programming algorithms take advantage of this fast
rate of convergence (see §5.1.4.3).

2.1.4.3 Speed vs Robustness

Sometimes we can accelerate computations by exploiting a problem’s special structure


(e.g., differentiability, convexity, monotonicity). But we often face a trade-off between
speed and robustness to details of problem specification. More robust methods impose
less structure.
Relative to other algorithms, successive approximation tends to be robust but slow.
We saw one illustration of the relatively slow rate of convergence in Figure 2.2. But
we can also see its relatively strong robustness properties via the same example, by
inspecting Figure 2.3, which compares the update rule of successive approximation
(the function 𝑔) with the update rule for Newton’s method (the function 𝑄 in (2.2)).
Also plotted is the dashed 45 degree line.
The parameterization is the same as for the top subfigure in Figure 1.7. As pre-
viously discussed, the shape of 𝑔 implies global convergence of successive approxi-
CHAPTER 2. OPERATORS AND FIXED POINTS 50

1
kt+1

−1
g
−2 Q
45
−3
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
kt

Figure 2.3: Robustness of successive approximation vs Newton’s method

mation. However, 𝑄 is well-behaved near the fixed point (i.e., very flat and hence
strongly contractive) but poorly behaved away from the fixed point. This illustrates
that Newton’s method is fast but generally less robust.

2.1.4.4 Parallelization

We have discussed rates of convergence for fixed point methods. Mathematicians


and computer scientists also analyze algorithms via worst case complexity, which
measures the number of fundamental operations (e.g., addition and multiplication of
floating point numbers) when an algorithm acts on data that is least favorable for
good performance. These measures are attractive because they are independent of
the software and hardware platforms on which algorithms are implemented.
Software and hardware matter not just for absolute performance of algorithms but
also for relative performance. For example, although a single update step in successive
approximation can often be partially parallelized, the algorithm is inherently serial,
in the sense that the ( 𝑘 + 1)-th iterate cannot be computed until iterate 𝑘 is available.
Moreover, because the rate of convergence is typically slow (i.e., linear), there can be
many small serial steps. This limits parallelization.
Newton’s method is also serial to some degree, since we are just iterating with a
different map (the operator 𝑄 in (2.2)). However, because it involves inverting ma-
CHAPTER 2. OPERATORS AND FIXED POINTS 51

trices of possibly high dimension, each step is computationally intensive. At the same
time, since the rate of convergence is faster, we have to take fewer steps. In this sense,
the algorithm is less serial – it involves a smaller number of more expensive steps. Be-
cause it is less serial, Newton’s method offers far more potential for parallelization.
Thus, the speed gain associated with Newton’s method can become very large when
using effective parallelization.

2.2 Order

This section reviews key concepts from order theory.

2.2.1 Partial Orders

We define partial orders and examine some of their basic properties.

2.2.1.1 Partially Ordered Sets

A partial order on a nonempty set 𝑃 is a relation  on 𝑃 × 𝑃 that, for any 𝑝, 𝑞, 𝑟 in 𝑃 ,


satisfies

𝑝 𝑝 (reflexivity)
𝑝  𝑞 and 𝑞  𝑝 implies 𝑝 = 𝑞 and (antisymmetry)
𝑝  𝑞 and 𝑞  𝑟 implies 𝑝  𝑟 (transitivity)

The pair ( 𝑃, ) is called a partially ordered set. For convenience, we sometimes


write 𝑃 for ( 𝑃, ) and 𝑞  𝑝 for 𝑝  𝑞. The statement 𝑝  𝑞  𝑟 means 𝑝  𝑞 and 𝑞  𝑟 .

Example 2.2.1. The usual order ⩽ on R is a partial order on R. For example, 𝑎 ⩽ 𝑏


and 𝑏 ⩽ 𝑎 implies 𝑎 = 𝑏.

EXERCISE 2.2.1. Let 𝑃 be any set and consider the relation induced by equality, so
that 𝑝  𝑞 if and only if 𝑝 = 𝑞. Show that this relation is a partial order on 𝑃 .

EXERCISE 2.2.2. Let 𝑀 be any set. Show that set inclusion ⊂ induces a partial
order on ℘ ( 𝑀 ), the set of all subsets of 𝑀 .
CHAPTER 2. OPERATORS AND FIXED POINTS 52

𝑣
𝑣2

𝑢
𝑢2

𝑤
𝑤2

𝑢1 𝑣1 𝑤1

Figure 2.4: Pointwise we have 𝑢 ⩽ 𝑣 and 𝑢  𝑣 but not 𝑤 ⩽ 𝑣

Example 2.2.2 (Pointwise partial order). Fix an arbitrary nonempty set X. The point-
wise order ⩽ on the set RX of all functions from X to R is defined as follows:

given 𝑢, 𝑣 in RX , set 𝑢 ⩽ 𝑣 if 𝑢 ( 𝑥 ) ⩽ 𝑣 ( 𝑥 ) for all 𝑥 ∈ X.

EXERCISE 2.2.3. Show that the pointwise order ⩽ is a partial order on RX .

In what follows, for 𝑢, 𝑣 ∈ RX , we write 𝑢  𝑣 if 𝑢 ( 𝑥 ) < 𝑣 ( 𝑥 ) for all 𝑥 ∈ X.

EXERCISE 2.2.4. Show that the relation  is not a partial order on RX .

The preceding pointwise concepts extend immediately to vectors, since vectors are
just real-valued functions under the identification asserted in Lemma 1.2.4 (page 30).
In particular, for vectors 𝑢 = ( 𝑢1 , . . . , 𝑢𝑛 ) and 𝑣 = ( 𝑣1 , . . . , 𝑣𝑛 ) in R𝑛 , we write

• 𝑢 ⩽ 𝑣 if 𝑢𝑖 ⩽ 𝑣𝑖 for all 𝑖 ∈ [ 𝑛] and


• 𝑢  𝑣 if 𝑢𝑖 < 𝑣𝑖 for all 𝑖 ∈ [ 𝑛].

Statements 𝑢 ⩾ 𝑣 and 𝑢  𝑣 are defined analogously. Figure 2.4 illustrates. Naturally,


⩽ is called the pointwise order on R𝑛 .

EXERCISE 2.2.5. Limits in R preserve weak inequalities. Use this property to prove
that the same is true in R𝑛 . In particular, show that, for vectors 𝑎, 𝑏 ∈ R𝑛 and sequence
(𝑢𝑘 ) in R𝑛 with 𝑎 ⩽ 𝑢𝑘 ⩽ 𝑏 for all 𝑘 ∈ N and 𝑢𝑘 → 𝑢 ∈ R𝑛 , we have 𝑎 ⩽ 𝑢 ⩽ 𝑏.

Example 2.2.3 (Pointwise order over matrices). Analogous to vectors, for 𝑛 × 𝑘 ma-
trices 𝐴 = ( 𝑎𝑖 𝑗 ) and 𝐵 = ( 𝑏𝑖 𝑗 ), we write
CHAPTER 2. OPERATORS AND FIXED POINTS 53

• 𝐴 ⩽ 𝐵 if 𝑎𝑖 𝑗 ⩽ 𝑏𝑖 𝑗 for all 𝑖, 𝑗.
• 𝐴  𝐵 if 𝑎𝑖 𝑗 < 𝑏𝑖 𝑗 for all 𝑖, 𝑗.

We call ⩽ the pointwise order over matrices.

EXERCISE 2.2.6. Explain why the pointwise order introduced in Example 2.2.3 is
also a special case of the pointwise order over functions.

EXERCISE 2.2.7. Prove the next two facts:

(i) If 𝐵 is 𝑚 × 𝑘 and 𝐵 ⩾ 0, then | 𝐵𝑢 | ⩽ 𝐵 | 𝑢 | for all 𝑘 × 1 column vectors 𝑢.


(ii) If 𝐴 is 𝑛 × 𝑛 with 𝐴 ⩾ 0 and ( 𝑢𝑘 ) is a sequence in R𝑛 satisfying 𝑢𝑘+1 ⩽ 𝐴𝑢𝑘 for all
𝑘 ⩾ 0, then 𝑢𝑘 ⩽ 𝐴𝑘 𝑢0 .

EXERCISE 2.2.8. Let 𝐴 be 𝑛 × 𝑘 and let 𝑢 and 𝑣 be 𝑘-vectors. Prove that 𝐴  0,


𝑢 ⩽ 𝑣 and 𝑢 ≠ 𝑣 implies 𝐴𝑢  𝐴𝑣.

A partial order  on 𝑃 is called total if, for all 𝑝, 𝑞 ∈ 𝑃 , either 𝑝  𝑞 or 𝑞  𝑝.

Example 2.2.4. The usual order ⩽ on R is a total order, as is the same order on N.

Example 2.2.5. Figure 2.4 shows that the pointwise order ⩽ is not a total order on
R𝑛 . For example, neither 𝑣 ⩽ 𝑤 nor 𝑤 ⩽ 𝑣, since 𝑤1 > 𝑣1 but 𝑤2 < 𝑣2 .

EXERCISE 2.2.9. Is the partial order defined in Exercise 2.2.2 a total order? Either
prove that it is or provide a counterexample.

2.2.1.2 Least and Greatest Elements

Given a partially ordered set ( 𝑃, ) and 𝐴 ⊂ 𝑃 , we say that 𝑔 ∈ 𝑃 is a greatest element


of 𝐴 if 𝑔 ∈ 𝐴 and, in addition, 𝑎 ∈ 𝐴 =⇒ 𝑎  𝑔. We call ℓ ∈ 𝑃 a least element of 𝐴 if
ℓ ∈ 𝐴 and, in addition, 𝑎 ∈ 𝐴 =⇒ ℓ  𝑎.
If 𝐴 is totally ordered, then a greatest element 𝑔 of 𝐴 is also called a maximum of
𝐴, while a least element ℓ of 𝐴 is also called a minimum. See Appendix A for more
about maxima and minima.
CHAPTER 2. OPERATORS AND FIXED POINTS 54

Remark 2.2.1. Elementary optimization problems have real-valued objectives, which


means that we seek maxima and minima. In contrast, the objective in dynamic pro-
gramming is to maximize a lifetime value function (or minimize a lifetime cost func-
tion), a function over a state space. Thus, the objective takes values in a partially
ordered set and we seek greatest (or least) elements.

EXERCISE 2.2.10. Let 𝑃 be any partially ordered set and fix 𝐴 ⊂ 𝑃 . Prove that 𝐴
has at most one greatest element and at most one least element.

EXERCISE 2.2.11. Let 𝑀 be a nonempty set and let ℘ ( 𝑀 ) be the set of all subsets
of 𝑀 , partially ordered by ⊂. Let { 𝐴𝑖 } = { 𝐴𝑖 } 𝑖∈ 𝐼 be a subset of ℘ ( 𝑀 ), where 𝐼 is an
Ð
arbitrary nonempty index set. Show that 𝑆 ≔ 𝑖 𝐴𝑖 is the greatest element of { 𝐴𝑖 } if
and only if 𝑆 ∈ { 𝐴𝑖 }.

EXERCISE 2.2.12. Adopt the setting of Exercise 2.2.11 and suppose that { 𝐴𝑖 } is the
set of bounded subsets of R𝑛 . Prove that { 𝐴𝑖 } has no greatest element.

2.2.1.3 Sup and Inf

Concepts of suprema and infima on the real line (Appendix A) extend naturally to
partially ordered sets. Given a partially ordered set ( 𝑃, ) and a nonempty subset 𝐴
of 𝑃 , we call 𝑢 ∈ 𝑃 an upper bound of 𝐴 if 𝑎  𝑢 for all 𝑎 in 𝐴. Letting 𝑈 𝑃 ( 𝐴) be the
set of all upper bounds of 𝐴 in 𝑃 , we call 𝑢¯ ∈ 𝑃 a supremum of 𝐴 if

¯ ∈ 𝑈 𝑃 ( 𝐴) and 𝑢¯  𝑢 for all 𝑢 ∈ 𝑈 𝑃 ( 𝐴) .


𝑢

Thus, 𝑢¯ is the least element (see §2.2.1.2) of the set of upper bounds 𝑈 𝑃 ( 𝐴), whenever
it exists.
EXERCISE 2.2.13. Prove that 𝐴 has at most one supremum in 𝑃 .

If 𝑃 ⊂ R and  is ⩽, then the notion of supremum on a partially ordered set reduces


to the elementary definition of the supremum for subsets of the real line discussed in
Appendix A.
Letting 𝐴 be a subset of partially ordered space 𝑃 ,
Ô
• the supremum of 𝐴 is typically denoted 𝐴.
CHAPTER 2. OPERATORS AND FIXED POINTS 55
Ô Ô
• If 𝐴 = { 𝑎𝑖 } 𝑖∈ 𝐼 for some index set 𝐼 , we also write 𝐴 as 𝑖 𝑎𝑖 .
Ô
• If 𝐴 = { 𝑎, 𝑏}, then 𝐴 is also written as 𝑎 ∨ 𝑏.

Suprema and greatest elements are clearly related. The next exercise clarifies this.

EXERCISE 2.2.14. Prove the following statements in the setting described above:
Ô
(i) If 𝑎¯ = 𝐴 and 𝑎¯ ∈ 𝐴, then 𝑎¯ is a greatest element of 𝐴.
Ô
(ii) If 𝐴 has a greatest element 𝑎¯, then 𝑎¯ = 𝐴.

Remark 2.2.2. In view of Exercise 2.2.14, when 𝐴 has a greatest element, we can
Ô
refer to it by 𝐴. This notation is used frequently throughout the book.

We call ℓ ∈ 𝑃 a lower bound of 𝐴 if 𝑎  ℓ for all 𝑎 in 𝐴. An element ℓ̄ of 𝑃 is called


a infimum of 𝐴 if ℓ̄ is a lower bound of 𝐴 and ℓ̄  ℓ for every lower bound ℓ of 𝐴. We
Ó
use analogous notation to denote the infimum. For example, if 𝐴 = { 𝑎, 𝑏}, then 𝐴 is
also written as 𝑎 ∧ 𝑏.

EXERCISE 2.2.15. Let ( 𝑃, ) be a partially ordered set and let 𝐴 be a subset of 𝑃 .


Ó
Prove that if ℓ is a least element of 𝐴, then ℓ = 𝐴.

EXERCISE 2.2.16. Let 𝑀 be a nonempty set and let ℘ ( 𝑀 ) be the set of all subsets
Ô
of 𝑀 , partially ordered by ⊂. Let { 𝐴𝑖 } 𝑖∈ 𝐼 be a subset of ℘ ( 𝑀 ). Prove that 𝑖 𝐴𝑖 = ∪𝑖 𝐴𝑖
Ó
and 𝑖 𝐴𝑖 = ∩𝑖 𝐴𝑖 .

EXERCISE 2.2.17. Even when 𝑃 is totally ordered, existence of suprema and infima
for an abstract partially ordered set ( 𝑃, ) can fail. Provide an example of a totally
ordered set 𝑃 and a subset 𝐴 of 𝑃 that has no supremum in 𝑃 .

2.2.2 The Case of Pointwise Order

For us, the pointwise partial order ⩽ introduced in Example 2.2.2 is especially useful.
In this section we review some properties of this order. Throughout, X is an arbitrary
finite set.
CHAPTER 2. OPERATORS AND FIXED POINTS 56

2.2.2.1 Suprema and Infima under a Pointwise Order

Given 𝑢, 𝑣 ∈ RX , the symbol 𝑢 ∧ 𝑣 is possibly ambiguous because we used the sym-


bol both for a pointwise miminum in §1.2.4.1 and an infimum of {𝑢, 𝑣} in §2.2.1.3.
Fortunately, for elements of the partially ordered set ( RX , ⩽), these two definitions
coincide. Indeed, if 𝑓 ( 𝑥 ) ≔ min{𝑢 ( 𝑥 ) , 𝑣 ( 𝑥 )} for all 𝑥 ∈ X, then

(i) 𝑓 is a lower bound for {𝑢, 𝑣} in ( RX , ⩽), and


(ii) 𝑔 ⩽ 𝑢 and 𝑔 ⩽ 𝑣 implies 𝑔 ⩽ 𝑓 .

Hence 𝑓 is the infimum of {𝑢, 𝑣} in ( RX , ⩽).

EXERCISE 2.2.18. Prove that the supremum 𝑢∨ 𝑣 of {𝑢, 𝑣} in ( RX , ⩽) is the pointwise


maximum 𝑓 ( 𝑥 ) ≔ max{𝑢 ( 𝑥 ) , 𝑣 ( 𝑥 )}.

A subset 𝑉 of RX is called a sublattice of RX if

𝑢, 𝑣 ∈ 𝑉 implies 𝑢 ∨ 𝑣 ∈ 𝑉 and 𝑢 ∧ 𝑣 ∈ 𝑉 .

Example 2.2.6. The sets

𝑉1 ≔ { 𝑓 ∈ RX : 𝑓 ⩾ 0}, 𝑉2 ≔ { 𝑓 ∈ RX : 𝑓  0} and 𝑉3 ≔ { 𝑓 ∈ RX : | 𝑓 | ⩽ 1}
are all sublattices of RX .

Above we discussed the fact that, for a pair of functions {𝑢, 𝑣}, the supremum in
( RX , ⩽) is the pointwise maximum, while the infimum in ( RX , ⩽) is the pointwise
minimum. The same principle holds for finite collections of functions. Thus, if { 𝑣𝑖 } ≔
{ 𝑣𝑖 } 𝑖∈ 𝐼 is a finite subset of RX , then, for all 𝑥 ∈ X,
! !
Ü Û
𝑣𝑖 ( 𝑥 ) ≔ max 𝑣𝑖 ( 𝑥 ) and 𝑣𝑖 ( 𝑥 ) ≔ min 𝑣𝑖 ( 𝑥 ) .
𝑖∈ 𝐼 𝑖∈ 𝐼
𝑖 𝑖

EXERCISE 2.2.19. Verify these claims.

EXERCISE 2.2.20. Show that if 𝑉 is a sublattice and { 𝑣𝑖 } is a finite collection of


Ô Ó
functions in 𝑉 , then 𝑖 𝑣𝑖 and 𝑖 𝑣𝑖 are also in 𝑉 .

The next example discusses greatest elements in the setting of pointwise order.
CHAPTER 2. OPERATORS AND FIXED POINTS 57

W
v∗ = σ∈Σ vσ
vσ00

vσ 0

Σ = {σ 0, σ 00}

Figure 2.5: 𝑣∗ is the upper envelope of { 𝑣𝜎 }𝜎∈Σ

Example 2.2.7. Let X be nonempty and fix 𝑉 ⊂ RX . Let 𝑉 be partially ordered by the
pointwise order ⩽. Let { 𝑣𝜎 } ≔ { 𝑣𝜎 }𝜎∈Σ be a finite subset of 𝑉 and let 𝑣∗ ≔ ∨𝜎 𝑣𝜎 ∈ RX
be the pointwise maximum. If 𝑣∗ ∈ { 𝑣𝜎 }, then 𝑣∗ is the greatest element of { 𝑣𝜎 }. If
not, then { 𝑣𝜎 } has no greatest element.

Figure 2.5 helps illustrate Example 2.2.7. In this case, 𝑣∗ is not in { 𝑣𝜎 } and { 𝑣𝜎 }
has no greatest element (since neither 𝑣𝜎0 ⩽ 𝑣𝜎00 nor 𝑣𝜎00 ⩽ 𝑣𝜎0 ).

EXERCISE 2.2.21. Prove the two claims at the end of Example 2.2.7.

Given a partially ordered set ( 𝑃, ) and 𝑎, 𝑏 ∈ 𝑃 , the order interval [ 𝑎, 𝑏] is defined


as all 𝑝 ∈ 𝑃 such that 𝑎  𝑝  𝑏. (If 𝑎  𝑏 fails, the order interval is empty.)

EXERCISE 2.2.22. Let 𝑉 be a sublattice of RX . Show that the intersection of any


two order intervals in 𝑉 is an order interval in 𝑉 .

2.2.2.2 Inequalities and Identities

In this section we note some useful inequalities and identities related to the pointwise
partial order on RX . As before, X is any finite set.
CHAPTER 2. OPERATORS AND FIXED POINTS 58

Lemma 2.2.1. For 𝑓 , 𝑔, ℎ ∈ RX , the following statements are true:

(i) | 𝑓 + 𝑔 | ⩽ | 𝑓 | + | 𝑔 |.
(ii) ( 𝑓 ∧ 𝑔) + ℎ = ( 𝑓 + ℎ) ∧ ( 𝑔 + ℎ) and ( 𝑓 ∨ 𝑔) + ℎ = ( 𝑓 + ℎ) ∨ ( 𝑔 + ℎ).
(iii) ( 𝑓 ∨ 𝑔) ∧ ℎ = ( 𝑓 ∧ ℎ) ∨ ( 𝑔 ∧ ℎ) and ( 𝑓 ∧ 𝑔) ∨ ℎ = ( 𝑓 ∨ ℎ) ∧ ( 𝑔 ∨ ℎ).
(iv) | 𝑓 ∧ ℎ − 𝑔 ∧ ℎ | ⩽ | 𝑓 − 𝑔 |.
(v) | 𝑓 ∨ ℎ − 𝑔 ∨ ℎ | ⩽ | 𝑓 − 𝑔 |.

These results follow immediately from proofs of corresponding claims when 𝑓 , 𝑔, ℎ ∈


R. For example, by the usual triangle inequality for scalars, we have | 𝑓 ( 𝑥 ) + 𝑔 ( 𝑥 )| ⩽
| 𝑓 ( 𝑥 )| + | 𝑔 ( 𝑥 )| for all 𝑥 ∈ X. This is equivalent to the statement | 𝑓 + 𝑔 | ⩽ | 𝑓 | + | 𝑔 | in (i).
Similarly, inequality (v) follows directly from a corresponding scalar inequality that
was already proved in Exercise 1.3.1, page 34.
A complete proof of lemma 2.2.1 can be found with Theorem 30.1 of Aliprantis
and Burkinshaw (1998).
It is also true that, if 𝑓 , 𝑔, ℎ ∈ RX
+ , then

( 𝑓 + 𝑔 ) ∧ ℎ ⩽ ( 𝑓 ∧ ℎ) + ( 𝑔 ∧ ℎ) . (2.3)

EXERCISE 2.2.23. Prove: If 𝑎, 𝑏, 𝑐 ∈ R+ , then | 𝑎 ∧ 𝑐 − 𝑏 ∧ 𝑐 | ⩽ | 𝑎 − 𝑏 | ∧ 𝑐.

We note the following useful inequality.

Lemma 2.2.2. Let 𝐷 be a finite set. If 𝑓 and 𝑔 are elements of R 𝐷 , then

| max 𝑓 ( 𝑧) − max 𝑔 ( 𝑧)| ⩽ max | 𝑓 ( 𝑧) − 𝑔 ( 𝑧)| . (2.4)


𝑧∈ 𝐷 𝑧∈ 𝐷 𝑧∈ 𝐷

Proof. Fixing 𝑓 , 𝑔 ∈ R 𝐷 , we have

𝑓 = 𝑓 − 𝑔 + 𝑔 ⩽ | 𝑓 − 𝑔| + 𝑔

∴ max 𝑓 ⩽ max(| 𝑓 − 𝑔 | + 𝑔) ⩽ max | 𝑓 − 𝑔 | + max 𝑔


∴ max 𝑓 − max 𝑔 ⩽ max | 𝑓 − 𝑔 |
Reversing the roles of 𝑓 and 𝑔 proves the claim. □

The inequality in Lemma 2.2.2 helps with dynamic programming problems that
involve maximization. The next exercise below concerns minimization.
CHAPTER 2. OPERATORS AND FIXED POINTS 59

EXERCISE 2.2.24. Prove that, in the setting of Lemma 2.2.2, we have

| min 𝑓 ( 𝑧) − min 𝑔 ( 𝑧)| ⩽ max | 𝑓 ( 𝑧) − 𝑔 ( 𝑧)| . (2.5)


𝑧∈ 𝐷 𝑧∈ 𝐷 𝑧∈ 𝐷

We end this section with a discussion of upper envelopes. To frame the discussion,
we take {𝑇𝜎 } ≔ {𝑇𝜎 }𝜎∈Σ to be a finite family of self-maps on a sublattice 𝑉 of RX .
Consider some properties of the operator 𝑇 on 𝑉 defined by
Ü
𝑇𝑣 = 𝑇𝜎 𝑣 (𝑣 ∈ 𝑉).
𝜎∈ Σ

It follows from the sublattice property that 𝑇 is a self-map on 𝑉 . In some sources, 𝑇 is


called the upper envelope of the functions {𝑇𝜎 }. The following lemma will be useful
for dynamic programming.

Lemma 2.2.3. If, for each 𝜎 ∈ Σ, the operator 𝑇𝜎 is a contraction of modulus 𝜆 𝜎 under
the supremum norm, then 𝑇 is a contraction of modulus max𝜎 𝜆 𝜎 under the same norm.

Proof. Let the stated conditions hold and fix 𝑢, 𝑣 ∈ 𝑉 . Applying Lemma 2.2.2, we get

k𝑇𝑢 − 𝑇 𝑣 k ∞ = max | max (𝑇𝜎 𝑢)( 𝑥 ) − max (𝑇𝜎 𝑣)( 𝑥 )|


𝑥 𝜎 𝜎
⩽ max max |(𝑇𝜎 𝑢)( 𝑥 ) − (𝑇𝜎 𝑣)( 𝑥 )|
𝑥 𝜎
= max max |(𝑇𝜎 𝑢)( 𝑥 ) − (𝑇𝜎 𝑣)( 𝑥 )| .
𝜎 𝑥

∴ k𝑇𝑢 − 𝑇 𝑣 k ∞ ⩽ max k𝑇𝜎 𝑢 − 𝑇𝜎 𝑣 k ∞ ⩽ max 𝜆 𝜎 k 𝑢 − 𝑣 k ∞ .


𝜎 𝜎

Hence 𝑇 is a contraction of modulus max𝜎 𝜆 𝜎 on 𝑉 , as claimed. □

2.2.3 Order-Preserving Maps

Order-preserving maps appear throughout the theory of dynamic programming. Here


we define them and state a condition for contractivity that requires the order preserv-
ing property.
CHAPTER 2. OPERATORS AND FIXED POINTS 60

2.2.3.1 Definition

Given two partially ordered sets ( 𝑃, ) and ( 𝑄, ⊴), a map 𝑇 from 𝑃 to 𝑄 is called
order-preserving if, given 𝑝, 𝑝0 ∈ 𝑃 , we have

𝑝  𝑝0 =⇒ 𝑇 𝑝 ⊴ 𝑇 𝑝0 . (2.6)

𝑇 is called order-reversing if, instead,

𝑝  𝑝0 =⇒ 𝑇 𝑝0 ⊴ 𝑇 𝑝. (2.7)

Example 2.2.8. Let ⩽ be the pointwise order on R𝑛 . If 𝐴 is 𝑛 × 𝑛 with 𝐴 ⩾ 0, then


𝑇 : R𝑛 → R𝑛 defined by 𝑇𝑢 = 𝐴𝑢 + 𝑏 is order preserving on R𝑛 , since 𝑢 ⩽ 𝑣 implies
𝑣 − 𝑢 ⩾ 0, and hence 𝐴 ( 𝑣 − 𝑢) ⩾ 0. But then 𝐴𝑢 ⩽ 𝐴𝑣 and hence 𝑇𝑢 ⩽ 𝑇 𝑣.

Example 2.2.9. Given 𝑎 ⩽ 𝑏 in R, let 𝐶 [ 𝑎, 𝑏] be all continuous functions from [ 𝑎, 𝑏]


to R and let ⩽ be the pointwise order on 𝐶 [ 𝑎, 𝑏]. Let
∫ 𝑏
𝐼( 𝑓) ≔ 𝑓 ( 𝑥 ) 𝑑𝑥 ( 𝑓 ∈ 𝐶 [ 𝑎, 𝑏]) .
𝑎

∫𝑏 ∫𝑏
Since 𝑓 ⩽ 𝑔 implies 𝑎
𝑓 ( 𝑥 ) 𝑑𝑥 ⩽ 𝑎
𝑔 ( 𝑥 ) 𝑑𝑥 , the map 𝐼 is order-preserving on 𝐶 [ 𝑎, 𝑏].

EXERCISE 2.2.25. Let 𝑃, 𝑄 be partially ordered sets and let 𝐹 be an order-preserving


map from 𝑃 to 𝑄 . Suppose that {𝑢𝑖 } ⊂ 𝑃 has a greatest and a least element. Prove
Ô Ó
that, in this setting, both 𝑖 𝐹𝑢𝑖 and 𝑖 𝐹𝑢𝑖 exist in 𝑄 , and, moreover,
Ü Ü Û Û
𝐹 𝑢𝑖 = 𝐹𝑢𝑖 and 𝐹 𝑢𝑖 = 𝐹𝑢𝑖 .
𝑖 𝑖 𝑖 𝑖

EXERCISE 2.2.26. Let ( 𝑃, ) be a partially ordered set and let 𝐴 be an order-


preserving self-map on 𝑃 . Prove that 𝐴𝑘 is order-preserving on 𝑃 for any 𝑘 ∈ N.

EXERCISE 2.2.27. Let 𝐴 be 𝑛 × 𝑘 with 𝐴 ⩾ 0. Show that the map 𝑢 ↦→ 𝐴𝑢 is


order-preserving on R𝑘 under the pointwise order.

EXERCISE 2.2.28. Let 𝐴 and 𝐵 be 𝑛 × 𝑛 with 0 ⩽ 𝐴 ⩽ 𝐵. Prove that 𝐴𝑘 ⩽ 𝐵𝑘 for all


𝑘 ∈ N and, in addition, that 𝜌 ( 𝐴) ⩽ 𝜌 ( 𝐵).
CHAPTER 2. OPERATORS AND FIXED POINTS 61

2.2.3.2 Increasing and Decreasing Functions

Regarding the definitions in (2.6)–(2.7), when ( 𝑄, ⊴) = ( R, ⩽), it is common to say


“increasing” instead of order-preserving, and “decreasing” instead of order-reversing.
We adopt this terminology. In particular, given partially ordered set ( 𝑃, ), we call
ℎ ∈ R𝑃

• increasing if 𝑝  𝑝0 implies ℎ ( 𝑝) ⩽ ℎ ( 𝑝0) and


• decreasing if 𝑝  𝑝0 implies ℎ ( 𝑝) ⩾ ℎ ( 𝑝0).
We use the symbol 𝑖R𝑃 for the set of increasing functions in R𝑃 .
Example 2.2.10. If 𝑃 = {1, . . . , 𝑛} and  is the usual order ⩽ on R, then 𝑥 ↦→ 2 𝑥 and
𝑥 ↦→ 1{2 ⩽ 𝑥 } are in 𝑖R 𝑃 but 𝑥 ↦→ − 𝑥 and 𝑥 ↦→ 1{ 𝑥 ⩽ 2} are not.
Remark 2.2.3. Instead of adopting the fancy terms “order-preserving” and “order-
reversing”, why not just use “increasing” and “decreasing”? A short answer is that for
a general partial order the concepts of order-preserving and order-reversing can be
very different from usual notions of increasing and decreasing functions.

EXERCISE 2.2.29. Prove: If 𝑃 is any partially ordered set and 𝑓 , 𝑔 ∈ 𝑖R𝑃 , then
(i) 𝛼 𝑓 + 𝛽𝑔 ∈ 𝑖R𝑃 whenever 𝛼, 𝛽 ⩾ 0.
(ii) 𝑓 ∨ 𝑔 ∈ 𝑖R𝑃 and 𝑓 ∧ 𝑔 ∈ 𝑖R𝑃 .

EXERCISE 2.2.30. Given finite 𝑃 , show that 𝑖R𝑃 is closed in R𝑃 .

EXERCISE 2.2.31. Let 𝑋 be a random variable taking values in finite X. Define


ℓ : RX → R by ℓℎ = Eℎ ( 𝑋 ). Show that ℓ is increasing when RX has the pointwise
order.

The next exercise shows that, in a totally ordered setting, an increasing function
can be represented as the sum of increasing binary functions.

EXERCISE 2.2.32. Let X = { 𝑥1 , . . . , 𝑥𝑛 } where 𝑥 𝑘  𝑥 𝑘+1 for all 𝑘. Show that, for any
Í
𝑢 ∈ 𝑖RX , there exist 𝑠1 , . . . , 𝑠𝑛 in R+ such that 𝑢 ( 𝑥 ) = 𝑛𝑘=1 𝑠𝑘 1{ 𝑥  𝑥 𝑘 } for all 𝑥 ∈ X.

As usual, if ℎ : 𝑃 → 𝑄 and 𝑃, 𝑄 ⊂ R, then we will call ℎ


• strictly increasing if 𝑥 < 𝑦 implies ℎ ( 𝑥 ) < ℎ ( 𝑦 ), and
• strictly decreasing if 𝑥 < 𝑦 implies ℎ ( 𝑥 ) > ℎ ( 𝑦 ).
CHAPTER 2. OPERATORS AND FIXED POINTS 62

2.2.3.3 Blackwell’s Condition

Our discussion of Banach’s theorem in §1.2.2.3 showed the usefulness of contractivity.


For an order-preserving operator on a subset of RX , the following condition often
simplifies establishing this property. In the statement of the lemma, 𝑈 is a subset of
RX , partially ordered by ⩽, and X is finite. Also, 𝑈 has the property that 𝑢 ∈ 𝑈 and
𝑐 ∈ R+ implies 𝑢 + 𝑐 ∈ 𝑈 .
Lemma 2.2.4. If 𝑇 is an order preserving self-map on 𝑈 and there exists a constant
𝛽 ∈ (0, 1) such that

𝑇 ( 𝑢 + 𝑐) ⩽ 𝑇𝑢 + 𝛽𝑐 for all 𝑢 ∈ 𝑈 and 𝑐 ∈ R+ , (2.8)

then 𝑇 is a contraction of modulus 𝛽 on 𝑈 with respect to the supremum norm.

Proof. Let 𝑈, 𝑇 have the stated properties and fix 𝑢, 𝑣 ∈ 𝑈 . We have

𝑇𝑢 = 𝑇 ( 𝑣 + 𝑢 − 𝑣) ⩽ 𝑇 ( 𝑣 + k 𝑢 − 𝑣 k ∞ ) ⩽ 𝑇 𝑣 + 𝛽 k 𝑢 − 𝑣 k ∞ .

Rearranging gives 𝑇𝑢 − 𝑇 𝑣 ⩽ 𝛽 k 𝑢 − 𝑣 k ∞ . Reversing roles of 𝑢 and 𝑣 proves the claim. □

2.2.4 Stochastic Dominance


So far we have discussed partial orders over vectors, functions and sets. It is also
useful to have a partial order over distributions that tells us when one distribution is
in some sense “larger” than another. In this section we introduce a partial order over
some distributions commonly used economics and finance.
Let’s start with an example. Recall that a random variable 𝑋 is binomial 𝐵 ( 𝑛, 0.5)
if it counts the number of heads in 𝑛 flips of a fair coin. Figure 2.6 shows two distri-
𝑑 𝑑
butions, 𝜑 = 𝑋 ∼ 𝐵 (10, 0.5) and 𝜓 = 𝑌 ∼ 𝐵 (18, 0.5). Since 𝑌 counts over more flips, we
expect it to take larger values in some sense, and we also expect its distribution 𝜓 to
reflect this. How can we make these thoughts precise?
A standard order over distributions that captures this idea is defined as follows:
Given finite set X partially ordered by  and 𝜑, 𝜓 ∈ D(X), we say that 𝜓 stochastically
dominates 𝜑 and write 𝜑 F 𝜓 if
Õ Õ
𝑢( 𝑥) 𝜑( 𝑥) ⩽ 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ) for every 𝑢 in 𝑖RX (2.9)
𝑥 𝑥

The relation F is also called first order stochastic dominance to differentiate it from
other forms of stochastic order.
CHAPTER 2. OPERATORS AND FIXED POINTS 63

0.25
φ = B(10, 0.5)
ψ = B(18, 0.5)
0.20

0.15

0.10

0.05

0.00

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

Figure 2.6: Two binomial distributions

Example 2.2.11. If 𝜑 and 𝜓 are the binomial distributions defined above and X =
{0, . . . , 18}, then 𝜑 F 𝜓 holds. Indeed, if 𝑊1 , . . . , 𝑊18 are IID binary random variables
Í Í18
with P{𝑊𝑖 = 1} = 0.5 for all 𝑖, then 𝑋 ≔ 10 𝑖=1 𝑊𝑖 has distribution 𝜑 and 𝑌 ≔ 𝑖=1 𝑊𝑖
has distribution 𝜓. In addition, 𝑋 ⩽ 𝑌 with probability one (i.e., for any outcome of
the draws 𝑊1 , . . . , 𝑊18 ). It follows that, for any given 𝑢 ∈ 𝑖RX , we have 𝑢 ( 𝑋 ) ⩽ 𝑢 (𝑌 )
with probability one. Hence E𝑢 ( 𝑋 ) ⩽ E𝑢 (𝑌 ) holds, which is the same statement as
(2.9).

A good way to interpret first order stochastic dominance is to suppose that an


agent has preferences over outcomes in X described by a utility function 𝑢 ∈ RX .
Suppose in addition that the agent prefers more to less, in the sense that 𝑢 ∈ 𝑖RX , and
that the agent ranks lotteries over X according to expected utility, so that the agent
Í
evaluates 𝜑 ∈ D(X) according to 𝑥 𝑢 ( 𝑥 ) 𝜑 ( 𝑥 ). Then the agent (weakly) prefers 𝜓 to
𝜑 whenever 𝜑 F 𝜓.
We can say more. Consider the class A of all agents who (a) have preferences
over outcomes in X, (b) prefer more to less, and (c) rank lotteries over X according to
expected utility. Then 𝜑 F 𝜓 if and only if every agent in A prefers 𝜓 to 𝜑.

Remark 2.2.4. The last paragraph helps explain the pervasiveness of stochastic dom-
inance in economics. It is standard to assume that economic agents have increasing
utility functions and use expected utility to rank lotteries. In such environments, an
upward shift in a lottery, as measured by stochastic dominance, makes all agents bet-
ter off.
CHAPTER 2. OPERATORS AND FIXED POINTS 64

EXERCISE 2.2.33. A simple setting in which we can study stochastic dominance is


where X = {1, 2} and X is partially ordered by ⩽. In this case, 𝜑 F 𝜓 if and only 𝜑
puts more mass on 1 than 𝜓, and, equivalently, less mass on 2. That is,

𝜑 F 𝜓 ⇐⇒ 𝜓 (1) ⩽ 𝜑 (1) ⇐⇒ 𝜑 (2) ⩽ 𝜓 (2) .

Verify the equivalence of these statements.

To state another useful perspective on stochastic dominance, we introduce the


notation Õ
𝐺𝜑 ( 𝑦) ≔ 𝜑(𝑥) ( 𝜑 ∈ D(X) , 𝑦 ∈ X) .
𝑥𝑦

For a given distribution 𝜑, the function 𝐺 𝜑 is sometimes called the counter CDF
(counter cumulative distribution function) of 𝜑.

Lemma 2.2.5. For each 𝜑, 𝜓 ∈ D(X), the following statements hold:

(i) 𝜑 F 𝜓 =⇒ 𝐺 𝜑 ⩽ 𝐺 𝜓 .
(ii) If X is totally ordered by , then 𝐺 𝜑 ⩽ 𝐺 𝜓 =⇒ 𝜑 F 𝜓.

The proof is given on page 345. Figure 2.7 helps to illustrate. Here X ⊂ R and 𝜑
and 𝜓 are distributions on X. We can see that 𝜑 F 𝜓 because the counter CDFs are
ordered in the sense that 𝐺 𝜑 ⩽ 𝐺 𝜓 pointwise on X.

Lemma 2.2.6. Stochastic dominance is a partial order on D(X).

EXERCISE 2.2.34. Prove the transitivity component of Lemma 2.2.6, i.e., prove
that F is transitive on D(X).

EXERCISE 2.2.35. Fix 𝜏 ∈ (0, 1] and let 𝑄 𝜏 be the quantile function defined on
page 32. Choose 𝜑, 𝜓 ∈ D(X) and let 𝑋, 𝑌 be X-valued random variables with distri-
butions 𝜑 and 𝜓 respectively. Prove that 𝜑 F 𝜓 implies 𝑄 𝜏 ( 𝑋 ) ⩽ 𝑄 𝜏 (𝑌 ).

2.2.5 Parametric Monotonicity

We are often interested in whether a change in a parameter shifts an outcome up or


down. For example, a parameter might appear in a central bank decision rule for
pegging an interest rate, and we want to know whether increasing that parameter
CHAPTER 2. OPERATORS AND FIXED POINTS 65

0.100 ϕ
0.075 ψ
0.050

0.025

0.000
−3 −2 −1 0 1 2 3

1.0

0.8


0.6

0.4

0.2

0.0
−3 −2 −1 0 1 2 3

Figure 2.7: Visualization of 𝜑 F 𝜓

will increase steady state inflation. By providing sufficient conditions for monotone
shifts in fixed points, results in this section can help answer such questions.
Let ( 𝑃, ) be a partially ordered set. Given two self-maps 𝑆 and 𝑇 on a set 𝑃 , we
write 𝑆  𝑇 if 𝑆𝑢  𝑇𝑢 for every 𝑢 ∈ 𝑃 and say that 𝑇 dominates 𝑆 on 𝑃 .

Example 2.2.12. Let 𝑃 = ( R+𝑛 , ⩽), let 𝑆𝑢 = 𝐴𝑢 + 𝑏 and 𝑇𝑢 = 𝐵𝑢 + 𝑏, where 𝑏 ∈ 𝑃 and 𝐴


and 𝐵 are 𝑛 × 𝑛 with 0 ⩽ 𝐴 ⩽ 𝐵. For any 𝑢 ∈ 𝑃 , we have 𝐴𝑢 ⩽ 𝐵𝑢. Hence 𝑆𝑢 ⩽ 𝑇𝑢 and
𝑇 dominates 𝑆 on 𝑃 .

EXERCISE 2.2.36. Let ( 𝑃, ) be a partially ordered set and let 𝑆 and 𝑇 be order-
preserving self-maps such that 𝑆  𝑇 . Show that 𝑆 𝑘  𝑇 𝑘 holds for all 𝑘 ∈ N.

EXERCISE 2.2.37. Let ( 𝑃, ) be a partially ordered set, let S be the set of all self-
maps on 𝑃 and, as above, write 𝑆  𝑇 if 𝑇 dominates 𝑆 on 𝑃 . Show that  is a partial
order on S.

One might assume that, in a setting where 𝑇 dominates 𝑆, the fixed points of 𝑇
will be larger. This can hold, as in Figure 2.8, but it can also fail, as in Figure 2.9.
A difference between these two situations is that in Figure 2.8 the map 𝑇 is globally
stable. This leads us to our next result.
CHAPTER 2. OPERATORS AND FIXED POINTS 66

𝑇
𝑢𝑇

𝑢𝑆

Figure 2.8: Ordered fixed points when global stability holds

𝑇 𝑆

𝑢𝑆

𝑢𝑇

Figure 2.9: Reverse-ordered fixed points when global stability fails


CHAPTER 2. OPERATORS AND FIXED POINTS 67

Proposition 2.2.7. Let 𝑆 and 𝑇 be self-maps on 𝑀 ⊂ R𝑛 and let ⩽ be the pointwise order.
If 𝑇 dominates 𝑆 on 𝑀 and, in addition, 𝑇 is order-preserving and globally stable on 𝑀 ,
then its unique fixed point dominates any fixed point of 𝑆.

Proof of Proposition 2.2.7. Assume the conditions of the proposition and let 𝑢𝑇 be the
unique fixed point of 𝑇 . Let 𝑢𝑆 be any fixed point of 𝑆. Since 𝑆 ⩽ 𝑇 , we have 𝑢𝑆 =
𝑆𝑢𝑆 ⩽ 𝑇𝑢𝑆 . Applying 𝑇 to both sides of this inequality and using the order-preserving
property of 𝑇 and transitivity of ⩽ gives 𝑢𝑆 ⩽ 𝑇 2 𝑢𝑆 . Continuing in this fashion yields
𝑢𝑆 ⩽ 𝑇 𝑘 𝑢𝑆 for all 𝑘 ∈ N. Taking the limit in 𝑘 and using the fact that ⩽ is closed under
limits gives 𝑢𝑆 ⩽ 𝑢𝑇 . □

As an application of Proposition 2.2.7, consider again the Solow–Swan growth


model 𝑘𝑡+1 = 𝑔 ( 𝑘𝑡 ) ≔ 𝑠 𝑓 ( 𝑘𝑡 ) + (1 − 𝛿) 𝑘𝑡 . We saw in §1.2.3.2 that if 𝑓 ( 𝑘) = 𝐴𝑘𝛼 where
𝐴 > 0 and 𝛼 ∈ (0, 1), then 𝑔 is globally stable on 𝑀 ≔ (0, ∞). Clearly 𝑘 ↦→ 𝑔 ( 𝑘)
is order-preserving on 𝑀 . If we now increase, say, the savings rate 𝑠, then 𝑔 will be
shifted up everywhere, implying, via Proposition 2.2.7, that the fixed point also rises.
Exercise 2.2.38 asks you to step through the details.
EXERCISE 2.2.38. Let 𝑔 ( 𝑘) = 𝑠𝐴𝑘𝛼 + (1 − 𝛿) 𝑘 where all parameters are strictly
positive, 𝛼 ∈ (0, 1) and 𝛿 ⩽ 1. Let 𝑘∗ ( 𝑠, 𝐴, 𝛼, 𝛿) be the unique fixed point of 𝑔 in 𝑀 .
Without using the expression we derived for 𝑘∗ previously (Exercise 1.2.26), show
that
(i) 𝑘∗ ( 𝑠, 𝐴, 𝛼, 𝛿) is increasing in 𝑠 and 𝐴.
(ii) 𝑘∗ ( 𝑠, 𝐴, 𝛼, 𝛿) is decreasing in 𝛿.

Figure 2.10 helps illustrate the results of Exercise 2.2.38. The top left sub-figure
shows a baseline parameterization, with 𝐴 = 2.0, 𝑠 = 𝛼 = 0.3 and 𝛿 = 0.4. The
other sub-figures show how the steady state changes as parameters deviate from that
baseline.
EXERCISE 2.2.39. In (1.33) on page 40, we defined a map 𝑔 such that the opti-
mal continuation value ℎ∗ is a fixed point. Using this construction, prove that ℎ∗ is
increasing in 𝛽 .

Figure 2.11 gives an illustration of the result in Exercise 2.2.39. Here an increase in
𝛽 leads to a larger continuation value. This seems reasonable, since larger 𝛽 indicates
more concern about outcomes in future periods.
While the preceding examples of parametric monotonicity are all one-dimensional,
we will soon see that Proposition 2.2.7 can also be applied in high-dimensional set-
tings.
CHAPTER 2. OPERATORS AND FIXED POINTS 68

3.0 3.0

2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0

0.5 0.5
default A = 2.5
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

3.0 3.0

2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0

0.5 0.5
s = .2 δ = .6
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 2.10: Parametric monotonicity for the Solow-Swan model

1200

1100

1000 h∗2

900

800
h∗1
45
700
g1 (β1 = 0.95)
g2 (β2 = 0.96)
600

600 700 800 900 1000 1100 1200

Figure 2.11: Parametric monotonicity in 𝛽 for the continuation value


CHAPTER 2. OPERATORS AND FIXED POINTS 69

2.3 Matrices and Operators


Many aspects of dynamic programming are most clearly framed using operator the-
ory. In this section, we discuss linear operators and their connections to matrices.
We emphasize nonnegative matrices and so-called positive linear operators that arise
naturally in dynamic programming.

2.3.1 Nonnegative Matrices

We begin by reviewing basic properties of nonnegative matrices. The Perron–Frobenius


theorem is a key result.

2.3.1.1 Nonnegative Matrices and their Powers

We call a matrix 𝐴 nonnegative and write 𝐴 ⩾ 0 if all elements of 𝐴 are nonnegative.


We call 𝐴 everywhere positive and write 𝐴  0 if all elements of 𝐴 are strictly positive.
Í
A square matrix 𝐴 is called irreducible if 𝐴 ⩾ 0 and ∞ 𝑘
𝑘=1 𝐴  0. An interpretation in
terms of connected networks is given in Chapter 1 of Sargent and Stachurski (2023b).
Let 𝐴 be 𝑛 × 𝑛. It is not always true that the spectral radius 𝜌 ( 𝐴) is an eigenvalue of
𝐴 .1However, when 𝐴 ⩾ 0, the spectral radius is always an eigenvalue. The following
theorem states this result and several extensions.
Theorem 2.3.1 (Perron–Frobenius). If 𝐴 ⩾ 0, then 𝜌 ( 𝐴) is an eigenvalue of 𝐴 with non-
negative, real-valued right and left eigenvectors. In particular, we can find a nonnegative,
nonzero column vector 𝑒 and a nonnegative, nonzero row vector 𝜀 such that

𝐴𝑒 = 𝜌 ( 𝐴) 𝑒 and 𝜀𝐴 = 𝜌 ( 𝐴) 𝜀. (2.10)

If 𝐴 is irreducible, then the right and left eigenvectors are everywhere positive and unique.
Moreover, if 𝐴 is everywhere positive, then with 𝑒 and 𝜀 normalized so that h𝜀, 𝑒i = 1, we
have
𝜌 ( 𝐴 ) −𝑡 𝐴 𝑡 → 𝑒 𝜀 ( 𝑡 → ∞) . (2.11)

The convergence in (2.11) provides a sharp characterization of large powers of 𝐴


that will prove useful in what follows. The assumption that 𝐴 is everywhere positive
can be weakened without affecting this convergence. A complete statement and full
proof of the Perron–Frobenius theorem can be found in Meyer (2000).
1 Forexample, eigenvalues of 𝐴 = diag(−1, 0) are {−1, 0}. Hence 𝜌 ( 𝐴) = | − 1| = 1, which is not an
eigenvalue of 𝐴.
CHAPTER 2. OPERATORS AND FIXED POINTS 70

Remark 2.3.1. Note that, in general, if 𝑣 is an everywhere positive real-valued eigen-


vector for 𝐴, then so is 𝛼𝑣 for all 𝛼 > 0. Hence the uniqueness asserted in the Perron–
Frobenius theorem is up to positive multiples. It tells us that if 𝑒 is the right eigenvec-
tor corresponding to 𝜌 ( 𝐴) and 𝑒ˆ is another positive vector satisfying 𝐴𝑒ˆ = 𝜌 ( 𝐴)ˆ𝑒, then
𝑒ˆ = 𝛼𝑒 for some 𝛼 > 0. A similar statement holds for the left eigenvalue 𝜀.

We can use the Perron–Frobenius theorem to provide bounds on the spectral radius
of a nonnegative matrix. Fix 𝑛 × 𝑛 matrix 𝐴 = ( 𝑎𝑖 𝑗 ) and set
Í
• rowsum𝑖 ( 𝐴) ≔ 𝑗 𝑎𝑖 𝑗 = the 𝑖-th row sum of 𝐴 and
Í
• colsum 𝑗 ( 𝐴) ≔ 𝑖 𝑎𝑖 𝑗 = the 𝑗-th column sum of 𝐴.

Lemma 2.3.2. If 𝐴 ⩾ 0, then

(i) min𝑖 rowsum𝑖 ( 𝐴) ⩽ 𝜌 ( 𝐴) ⩽ max 𝑖 rowsum𝑖 ( 𝐴) and


(ii) min 𝑗 colsum 𝑗 ( 𝐴) ⩽ 𝜌 ( 𝐴) ⩽ max 𝑗 colsum 𝑗 ( 𝐴).

EXERCISE 2.3.1. Prove Lemma 2.3.2. (Hint: Since 𝑒 and 𝜀 are nonnegative and
nonzero, and since eigenvectors are defined only up to nonzero multiples, you can
assume that both of these vectors sum to 1.)

2.3.1.2 A Local Spectral Radius Result

Let 𝐴 be an 𝑛 × 𝑛 matrix. We know from Gelfand’s formula (page 19) that if k · k is any
matrix norm, then k 𝐴𝑘 k 1/𝑘 → 𝜌 ( 𝐴) as 𝑘 → ∞. While useful, this lemma can be difficult
to apply because it involves matrix norms. Fortunately, when 𝐴 is nonnegative, we
have the following variation, which only involves vector norms.

Lemma 2.3.3. Let k · k be any norm on R𝑛 . If 𝐴 is nonnegative and ℎ ∈ R𝑛 obeys ℎ  0,


then
k 𝐴𝑘 ℎ k 1/𝑘 → 𝜌 ( 𝐴) as 𝑘 → ∞. (2.12)

The expression on the left of (2.12) is sometimes called the local spectral ra-
dius of 𝐴 at ℎ. Lemma 2.3.3 gives one set of conditions under which a local spectral
radius equals the spectral radius. This result will be useful when we examine state-
dependent discounting in Chapter 6.
For a proof of Lemma 2.3.3 see Theorem 9.1 of Krasnosel’skii et al. (1972).
CHAPTER 2. OPERATORS AND FIXED POINTS 71

2.3.1.3 Markov Matrices

An 𝑛 × 𝑛 matrix 𝑃 is called a stochastic matrix or Markov matrix if

𝑃⩾0 and 𝑃1 = 1
where 1 is a column vector of ones, so that 𝑃 is nonnegative and has unit row sums.
The Perron–Frobenius theorem will be useful for the following exercise.

EXERCISE 2.3.2. Let 𝑃, 𝑄 be 𝑛 × 𝑛 Markov matrices. Prove the following facts.

(i) 𝑃𝑄 is also a Markov matrix.


(ii) 𝜌 ( 𝑃 ) = 1.
(iii) There exists a row vector 𝜓 ∈ R+𝑛 such that 𝜓1 = 1 and 𝜓𝑃 = 𝜓.
(iv) If 𝑃 is irreducible, then the vector 𝜓 in (iii) is everywhere positive and unique,
in the sense that no other vector 𝜓 ∈ R+𝑛 satisfies 𝜓1 = 1 and 𝜓𝑃 = 𝜓.

The vector 𝜓 in part (iii) of Exercise 2.3.2 is called a stationary distribution for 𝑃 .
Such distributions play an important role in the theory of Markov chains. We discuss
their interpretation and significance in §3.1.2.

EXERCISE 2.3.3. Given Markov matrix 𝑃 and constant 𝜀 > 0, prove the following
result: There exists no ℎ ∈ RX with 𝑃ℎ ⩾ ℎ + 𝜀.

2.3.2 A Lake Model

We illustrate the power of the Perron–Frobenius theorem by showing how it helps us


analyze a model of employment and unemployment flows in a large population.
The model is sometimes called a “lake model” because there are two pools of
workers: those who are currently employed and those who are currently unemployed
but still seeking work. The flows between states are as follows:

• Workers exit the labor market at rate 𝑑 .


• New workers enter the labor market at rate 𝑏.
• Employed workers separate from their jobs and become unemployed at rate 𝛼.
• Unemployed workers find jobs at rate 𝜆 .
CHAPTER 2. OPERATORS AND FIXED POINTS 72

(1 − 𝜆 )(1 − 𝑑 )

𝜆 (1 − 𝑑 )
𝑏
new entrants unemployed employed

𝛼 (1 − 𝑑 )

(1 − 𝛼)(1 − 𝑑 )

Figure 2.12: Lake model transition dynamics

We assume that all parameters lie in (0, 1). New workers are initially unemployed.
Transition rates between two pools appear in Figure 2.12. For example, the rate
of flow from employment to unemployment is 𝛼 (1 − 𝑑 ), which equals the fraction of
employed workers who remained in the labor market and separated from their jobs.
Let 𝑒𝑡 and 𝑢𝑡 be the number of employed and unemployed workers at time 𝑡 re-
spectively. The total population (of workers) is 𝑛𝑡 ≔ 𝑒𝑡 + 𝑢𝑡 . In view of the rates stated
above, the number of unemployed workers evolves according to

𝑢𝑡+1 = (1 − 𝑑 ) 𝛼𝑒𝑡 + (1 − 𝑑 )(1 − 𝜆 ) 𝑢𝑡 + 𝑏𝑛𝑡 .

The three terms on the right correspond to the newly unemployed (due to separation),
the unemployed who failed to find jobs last period, and new entrants into the labor
force. The number of employed workers evolves according to

𝑒𝑡+1 = (1 − 𝑑 )(1 − 𝛼) 𝑒𝑡 + (1 − 𝑑 ) 𝜆𝑢𝑡 .

Evolution of the time series for 𝑢𝑡 , 𝑒𝑡 and 𝑛𝑡 is illustrated in Figure 2.13. We set
parameters to 𝛼 = 0.01, 𝜆 = 0.1, 𝑑 = 0.02, and 𝑏 = 0.025. The initial population of
unemployed and employed workers are 𝑢0 = 0.6 and 𝑒0 = 1.2, respectively. The series
grow over the long run due to net population growth.
Can we say more about the dynamics of this system? For example, what long run
unemployment rate should we expect? Also, do long run outcomes depend heavily
on the initial conditions 𝑢0 and 𝑒0 ? Can we make some general statements that hold
regardless of the initial state?
CHAPTER 2. OPERATORS AND FIXED POINTS 73

0.7 ut

0.6

0 20 40 60 80 100
t

2.0 et

1.5

0 20 40 60 80 100
t
3.0
nt
2.5

2.0

0 20 40 60 80 100
t

Figure 2.13: Time series for 𝑒𝑡 , 𝑢𝑡 and 𝑛𝑡 , (lake_2.jl)

To begin to address these questions, we first organize the linear system for ( 𝑒𝑡 )
and (𝑢𝑡 ) by setting
   
𝑢𝑡 (1 − 𝑑 )(1 − 𝜆 ) + 𝑏 (1 − 𝑑 ) 𝛼 + 𝑏
𝑥𝑡 ≔ and 𝐴 ≔ . (2.13)
𝑒𝑡 (1 − 𝑑 ) 𝜆 (1 − 𝑑 )(1 − 𝛼)

With these definitions, we can write the dynamics as 𝑥𝑡+1 = 𝐴𝑥𝑡 . As a result, 𝑥𝑡 = 𝐴𝑡 𝑥0 ,
where 𝑥0 = (𝑢0 𝑒0 ) > .
The overall growth rate of the total labor force is 𝑔 = 𝑏 − 𝑑 , in the sense that
𝑛𝑡+1 = (1 + 𝑔 ) 𝑛𝑡 for all 𝑡 .

EXERCISE 2.3.4. Confirm this claim by using the equation 𝑥𝑡+1 = 𝐴𝑥𝑡 .

EXERCISE 2.3.5. Prove that 𝜌 ( 𝐴) = 1 + 𝑔. [Hint: Use one of the results in §2.3.1.1.]

EXERCISE 2.3.6. By the Perron-Frobenius theorem, 1+ 𝑔 is an eigenvalue (in fact the


dominant eigenvalue) of 𝐴. Show that 1> ≔ (1 1) is a left eigenvector corresponding
to this eigenvalue.

EXERCISE 2.3.7. Prove that the unique right eigenvector 𝑥¯ satisfying 𝐴 𝑥¯ = 𝜌 ( 𝐴) 𝑥¯


CHAPTER 2. OPERATORS AND FIXED POINTS 74

x0 = (0.1, 4.0)
employed workforce

x0 = (5.0, 0.1)

0
0 6
unemployed workforce

Figure 2.14: Time paths 𝑥𝑡 = 𝐴𝑡 𝑥0 for two choices of 𝑥0 (lake_1.jl)

and 1> 𝑥¯ = 1 is given by


 
𝑢
¯ 1 + 𝑔 − (1 − 𝑑 )(1 − 𝛼)
𝑥¯ ≔ with 𝑢¯ ≔ (2.14)
𝑒¯ 1 + 𝑔 − (1 − 𝑑 )(1 − 𝛼) + (1 − 𝑑 ) 𝜆

and 𝑒¯ ≔ 1 − 𝑢¯.

In the language of Perron–Frobenius theory, the right eigenvector 𝑥¯ is called the


dominant eigenvector, since it corresponds to the dominant (i.e., largest) eigenvalue
𝜌 ( 𝐴). This eigenvector plays an important role in determining long run outcomes. In
the remainder of this section we illustrate this fact.
To begin, recall that 𝛼𝑥¯ is also a right eigenvector corresponding to the eigenvalue
𝜌 ( 𝐴) when 𝛼 > 0. The set 𝐷 ≔ { 𝑥 ∈ R2 : 𝑥 = 𝛼 𝑥¯ for some 𝛼 > 0} is shown as a dashed
black line in Figure 2.14. The figure also shows two time paths, each of the form
( 𝑥𝑡 )𝑡⩾0 = ( 𝐴𝑡 𝑥0 )𝑡⩾0 , generated from two different initial conditions. In both cases, we
see that both paths converge to 𝐷 over time. The figure suggests that paths share
strong similarities in the long run that are determined by the dominant eigenvector
𝑥¯.
To see why this is so, we return (2.11) from to the Perron–Frobenius theorem,
CHAPTER 2. OPERATORS AND FIXED POINTS 75

which tells us that, since 𝐴  0, we have


 
𝑢
¯ 𝑢¯
𝐴 ≈ 𝜌 ( 𝐴) · 𝑥¯1 = (1 + 𝑔 )
𝑡 𝑡 > 𝑡
for large 𝑡.
𝑒¯ 𝑒¯

As a result, for any initial condition 𝑥0 = ( 𝑢0 𝑒0 ) > , we have


    
𝑡 𝑡 𝑢¯ 𝑢¯ 𝑢0 𝑡 𝑢
¯
𝐴 𝑥0 ≈ (1 + 𝑔 ) = (1 + 𝑔) ( 𝑢0 + 𝑒0 ) = 𝑛𝑡 𝑥¯,
𝑒¯ 𝑒¯ 𝑒0 𝑒¯

where 𝑛𝑡 = (1 + 𝑔) 𝑡 𝑛0 and 𝑛0 = 𝑢0 + 𝑒0 . This says that, regardless of the initial condition,


the state 𝑥𝑡 scales along 𝑥¯ at the rate of population growth. This is precisely what we
saw in Figure 2.14.
We can provide additional interpretations to the components 𝑢¯ and 𝑒¯ of 𝑥¯. Since
𝑛𝑡 is the size of the workforce at time 𝑡 , the rate of unemployment is 𝑢𝑡 /𝑛𝑡 . As just
shown, for large 𝑡 this is close to ( 𝑛𝑡 𝑢¯)/𝑛𝑡 = 𝑢¯. Hence 𝑢¯ is the long term rate of
unemployment along the stable growth path. Similarly, the other component 𝑒¯ of the
dominant eigenvector is the long run employment rate.
In summary, the dominant eigenvector provides with both the long-run rate of un-
employment and the stable growth path, to which all trajectories with positive initial
conditions converge over time.

Remark 2.3.2. A more thorough analysis would require us to think carefully about
how the underlying rates 𝛼, 𝜆 , 𝑏 and 𝑑 are determined. For the hiring rate 𝜆 , we
could use the job search model to fix the rate at which workers are matched to jobs.
In particular, with 𝑤∗ as the reservation wage, we could set
Õ
𝜆 = P{𝑤𝑡 ⩾ 𝑤∗ } = 𝜑 ( 𝑤)
𝑤⩾ 𝑤∗

Doing so would allow us to study determinants of 𝜆 that could include unemployment


compensation and workers’ impatience.

2.3.3 Linear Operators

There are two ways to think about a matrix. In one definition, an 𝑛 × 𝑘 matrix 𝐴 is
an 𝑛 × 𝑘 array of (real) numbers. In the second, 𝐴 is a linear operator from R𝑘 to
R𝑛 that takes a vector 𝑢 ∈ R𝑘 and sends it to 𝐴𝑢 in R𝑛 . Let’s clarify these ideas in a
setting where 𝑛 = 𝑘. While the matrix representation is important, the linear operator
representation is more fundamental and more general.
CHAPTER 2. OPERATORS AND FIXED POINTS 76

2.3.3.1 Matrices vs Linear Operators

A linear operator on R𝑛 is a map 𝐿 from R𝑛 to R𝑛 such that

𝐿 ( 𝛼𝑢 + 𝛽𝑣) = 𝛼𝐿𝑢 + 𝛽𝐿𝑣 for all 𝑢, 𝑣 ∈ R𝑛 and 𝛼, 𝛽 ∈ R. (2.15)

(We write 𝐿𝑢 instead of 𝐿 (𝑢), etc.) For example, if 𝐴 is an 𝑛 × 𝑛 matrix, then the
map from 𝑢 to 𝐴𝑢 defines a linear operator, since the rules of matrix algebra yield
𝐴 ( 𝛼𝑢 + 𝛽𝑣) = 𝛼𝐴𝑢 + 𝛽 𝐴𝑣.
We just showed that each matrix can be regarded as a linear operator. In fact the
converse is also true:

Theorem 2.3.4. If 𝐿 is a linear operator on R𝑛 , then there exists an 𝑛 × 𝑛 matrix 𝐴 = ( 𝑎𝑖 𝑗 )


such that 𝐿𝑢 = 𝐴𝑢 for all 𝑢 ∈ R𝑛 .

A proof of Theorem 2.3.4 can be found in Kreyszig (1978) and many other sources.
Why introduce linear operators if they are essentially the same as matrices? One
reason is that, while a one-to-one correspondence between linear operators and ma-
trices holds in R𝑛 , the concept of linear operators is far more general. Linear operators
can be defined over many different kinds of sets whose elements have vector-like prop-
erties. This is related to the point that we made about function spaces in Remark 1.2.2
on page 28.
Another reason is computational: the matrix representation of a linear operator
can be tedious to construct and difficult to instantiate in memory in large problems.
We illustrate this point in §2.3.3.3 below.

2.3.3.2 Linear Operators on Function Space

The definition of linear operators on R𝑛 extends naturally to linear operators on RX


when X = { 𝑥1 , . . . , 𝑥𝑛 }: A linear operator on RX is a map 𝐿 from RX to itself such
that, for all 𝑢, 𝑣 ∈ RX and 𝛼, 𝛽 ∈ R, we have 𝐿 ( 𝛼𝑢 + 𝛽𝑣) = 𝛼𝐿𝑢 + 𝛽𝐿𝑣. In what follows,

L ( RX ) ≔ the set of all linear operators on RX .

Let 𝐿 be a function from X × X to R. This function induces an operator 𝐿 from RX


to itself via Õ
( 𝐿𝑢)( 𝑥 ) = 𝐿 ( 𝑥, 𝑥 0) 𝑢 ( 𝑥 0) ( 𝑥 ∈ X, 𝑢 ∈ RX ) . (2.16)
𝑥 0 ∈X
CHAPTER 2. OPERATORS AND FIXED POINTS 77

We use the same symbol 𝐿 on both sides of the equals sign because both represent
essentially the same object (in the sense that a matrix 𝐴 can be viewed as a collection
of numbers ( 𝐴𝑖 𝑗 ) or as a linear map 𝑢 ↦→ 𝐴𝑢).
The function 𝐿 on the right-hand side of (2.16) is sometimes called the “kernel”
of the operator 𝐿. However, we will call it a matrix in what follows, since 𝐿 ( 𝑥, 𝑥 0) =
𝐿 ( 𝑥 𝑖 , 𝑥 𝑗 ) is just an 𝑛 × 𝑛 array of real numbers. When more precision is required, we
will call it the matrix representation of 𝐿.
In essence, the operation in (2.16) is just matrix multiplication: ( 𝐿𝑢)( 𝑥 ) is row 𝑥
of the matrix product 𝐿𝑢.

EXERCISE 2.3.8. Confirm that 𝐿 on the left-hand side of (2.16) is in fact a linear
operator (i.e., an element of L ( RX )).

The eigenvalues and eigenvectors of the linear operator 𝐿 are defined as the eigen-
values and eigenvectors of its matrix representation. The spectral radius 𝜌 ( 𝐿) of 𝐿 is
defined analogously.
We used the same symbol for the operator 𝐿 on the left-hand side of (2.16) and
its matrix representation on the right because these two objects are in one-to-one
correspondence. In particular, every 𝐿 ∈ L ( RX ) can be expressed in the form of
(2.16) for a suitable choice of matrix ( 𝐿 ( 𝑥, 𝑥 0)). Readers who are comfortable with
these claim can skip ahead to §2.3.3.3. The next lemma provides more details.

Lemma 2.3.5. When X = { 𝑥1 , . . . , 𝑥𝑛 }, the following sets are in one-to-one correspon-


dence:

(a) The set of all 𝑛 × 𝑛 real matrices.


(b) The set of all linear operators on R𝑛 .
(c) The set L ( RX ) of linear operators on RX .
(d) The set of all functions from X × X to R.

Lemma 2.3.5 needs no formal proof. Theorem 2.3.4 already tells us that (a) and
(b) are in one-to-one correspondence. Also, (b) and (c) are in one-to-one correspon-
dence because each 𝐿 ∈ L ( RX ) can be identified with a linear operator 𝑢 ↦→ 𝐿𝑢 on R𝑛
by pairing 𝑢, 𝐿𝑢 ∈ RX with its vector representation in R𝑛 (see §1.2.4.2). Finally, (d)
and (a) are in one-to-one correspondence under the identification 𝐿 ( 𝑥 𝑖 , 𝑥 𝑗 ) ↔ 𝐿𝑖 𝑗 .
CHAPTER 2. OPERATORS AND FIXED POINTS 78

2.3.3.3 Computational Issues

At the end of §2.3.3.1 we claimed that working with linear operators brings some
computational advantages vis-à-vis working with matrices. This section fills in some
details (Readers who prefer not to think about computational issues at this point can
skip ahead to §2.3.3.4.)
To illustrate the main idea, consider a setting where the state space X takes the
form X = Y × Z with |Y| = 𝑗 and |Z| = 𝑘. A typical element of X is 𝑥 = ( 𝑦, 𝑧 ). As we
shall see, this kind of setting arises naturally in dynamic programming.
Let 𝑄 be a map from Z × Z to R (i.e., a 𝑘 × 𝑘 matrix) and consider the operator
sending 𝑢 ∈ RX to 𝐿𝑢 ∈ RX according to the rule
Õ
( 𝐿𝑢)( 𝑥 ) = ( 𝐿𝑢)( 𝑦, 𝑧 ) = 𝑢 ( 𝑦, 𝑧0) 𝑄 ( 𝑧, 𝑧0) (2.17)
𝑧 0 ∈Z

EXERCISE 2.3.9. Prove that 𝐿 ∈ L ( RX ).

Since 𝐿 is a linear operator on RX , Lemma 2.3.5 tells us that 𝐿 can be represented


as an 𝑛 × 𝑛 matrix ( 𝐿 ( 𝑥 𝑖 , 𝑥 𝑗 )) = ( 𝐿𝑖 𝑗 ), where 𝑛 = |X| = 𝑗 × 𝑘. To construct this matrix, we
first need to “flatten” Y×Z into a set X = { 𝑥1 , . . . , 𝑥𝑛 } with a single index. There are two
natural ways to do this. Considering Y × Z as a two-dimensional array with typical
element ( 𝑦𝑖 , 𝑧 𝑗 ), we can (a) stack all 𝑘 columns vertically into one long column, or
(b) concatenate all 𝑗 rows into one long row. The first arrangement is called column-
major ordering and is the default for languages such as Julia and Fortran. The second
is called row-major ordering and is the default for languages such as Python and C.
Either way we obtain a set of elements indexed by 1, . . . , 𝑛.
After adopting one of these conventions, Lemma 2.3.5 assures us we can construct
a uniquely defined 𝑛 × 𝑛 matrix that represents 𝐿. Once we decide how to construct
this matrix, we can instantiate it in computer memory and compute the operation
𝑢 ↦→ 𝐿𝑢 by matrix multiplication.
There are, however, several disadvantages to implementing 𝐿 using this matrix-
based approach. One is that constructing the matrix representation is tedious. An-
other is that confusion can arise when swapping between column- and row-major
orderings in order to shift between languages or to communicate with colleagues. A
third is that differences are introduced between computer code and the natural rep-
resentation (2.17), which can be a source of bugs. A fourth issue is that an 𝑛 × 𝑛
matrix has to be instantiated in memory, even though the linear operation in (2.17)
is only an inner product in R𝑘 . The last issue can be alleviated in most languages
CHAPTER 2. OPERATORS AND FIXED POINTS 79

by employing sparse matrices, but doing so adds boilerplate and can be a source of
inefficiency.
Because of these issues, most modern scientific computing environments support
linear operators directly, as well as actions on linear operators such as inverting linear
maps. These considerations encourage us to take an operator-based approach.

2.3.3.4 Positive Operators and Markov Operators

Having agreed on the benefits of an operator-theoretic exposition, let us now describe


some kinds of linear operators. We continue to assume that X is a finite set with 𝑛
elements.
The set RX+ of all 𝑢 ∈ R with 𝑢 ⩾ 0 is called the positive cone of R . An operator
X X

𝐿 ∈ L ( R ) is called positive if 𝐿 is invariant on the positive cone; that is, if


X

𝑢 ⩾ 0 =⇒ 𝐿𝑢 ⩾ 0. (2.18)

Example 2.3.1. The operator 𝐿 ∈ L ( RX ) defined in (2.17) is positive whenever 𝑄 ⩾ 0.


This is because
Õ
𝑢 ⩾ 0 =⇒ 𝑢 ( 𝑦, 𝑧0) 𝑄 ( 𝑧, 𝑧0) ⩾ 0 for all 𝑥 = ( 𝑦, 𝑧 ) in X.
𝑧0

Lemma 2.3.6. An operator 𝐿 ∈ L ( RX ) is positive if and only if its matrix representation


is a nonnegative matrix.

EXERCISE 2.3.10. Prove Lemma 2.3.6.

Remark 2.3.3. The Lemma 2.3.6 characterization suggests that we should really call
a linear operator satisfying (2.18) “nonnegative” rather than positive. Nevertheless,
the “positive” terminology is standard (see, e.g., Zaanen (2012)).

EXERCISE 2.3.11. Given 𝐿 ∈ L ( RX ), prove the following statement: 𝐿 is positive if


and only if 𝐿 is order-preserving on RX under the pointwise order.

An operator 𝑃 ∈ L ( RX ) is called a Markov operator on RX if 𝑃 is positive and


𝑃 1 = 1. We let

M ( RX ) ≔ the set of all Markov operators on RX


CHAPTER 2. OPERATORS AND FIXED POINTS 80

Viewed as matrices, elements of M ( RX ) are nonnegative matrices whose rows sum


to one. The next exercise asks you to confirm this.

EXERCISE 2.3.12. Fix 𝑃 ∈ L ( RX ) and let 𝑃 ( 𝑥, 𝑥 0) be the matrix representation.


Í
Prove that 𝑃 ∈ M ( RX ) if and only if 𝑃 ( 𝑥, 𝑥 0) ⩾ 0 for all 𝑥, 𝑥 0 ∈ X and 𝑥 0 ∈X 𝑃 ( 𝑥, 𝑥 0) = 1
for all 𝑥 ∈ X.

EXERCISE 2.3.13. Prove: If 𝑃 ∈ M ( RX ) and 𝑣 ∈ RX with 𝑣  0, then 𝑃𝑣  0.

In the next exercise, you can think of 𝜑 as a row vector and 𝜑𝑃 as premuliplying
the matrix 𝑃 by this row vector. Chapter 3 uses the map 𝜑 ↦→ 𝜑𝑃 to update marginal
distributions generated by Markov chains.

EXERCISE 2.3.14. Fix 𝑃 ∈ L ( RX ). Prove that 𝑃 ∈ M ( RX ) if and only if the function


𝜑𝑃 defined by Õ
( 𝜑𝑃 )( 𝑥 0) = 𝑃 ( 𝑥, 𝑥 0) 𝜑 ( 𝑥 ) ( 𝑥 0 ∈ X) (2.19)
𝑥 ∈X

is in D(X) whenever 𝜑 ∈ D(X).

Markov operators are important for us because they generate Markov dynamics, a
foundation of dynamic programming. Thus, (2.19) is a rule for updating distributions
by one period under the Markov dynamics specified by 𝑃 . We’ll use it often in the next
chapter.

2.4 Chapter Notes

Davey and Priestley (2002) provide a good introduction to partial orders and order-
theoretic concepts. Our favorite books on fixed points and analysis include Ok (2007),
Zhang (2012), Cheney (2013) and Atkinson and Han (2005). Good background ma-
terial on order-theoretic fixed point methods can be found in Guo et al. (2004) and
Zhang (2012).
Chapter 3

Markov Dynamics

To prepare to analyze dynamic programs, we now study stochastic processes gener-


ated by Markov chains. These processes are widely used to construct economic and
financial models.
At the end of this chapter we return to the job search problem from Chapter 1 and
allow wage draws to be correlated over time (rather than IID). We use a Markov chain
to generated serially correlated wage draws.
Throughout this chapter, the symbol X represents a finite set.

3.1 Foundations

This section describes elementary properties of Markov models.

3.1.1 Markov Chains

Let’s start with a definition and some simple examples.

3.1.1.1 Defining Markov Chains

Fix X = { 𝑥1 , . . . , 𝑥𝑛 } and 𝑃 ∈ M ( RX ). We interpret 𝑃 ( 𝑥, 𝑥 0) as the probability that


a random process moves from 𝑥 to 𝑥 0 over one unit of time. For this interpretation
Í
to make sense we need 𝑃 ( 𝑥, 𝑥 0) to be nonnegative and 𝑥 0 ∈X 𝑃 ( 𝑥, 𝑥 0) to equal 1 for
every 𝑥 ∈ X, since we want the chain to stay somewhere in the state space after each

81
CHAPTER 3. MARKOV DYNAMICS 82

update. These are exactly the properties guaranteed by the assumption 𝑃 ∈ M ( RX )


(see Exercise 2.3.12).
To formalize ideas, let ( 𝑋𝑡 ) ≔ ( 𝑋𝑡 )𝑡⩾0 be a sequence of random variables taking
values in X and call ( 𝑋𝑡 ) a Markov chain on state space X if there exists a 𝑃 ∈ M ( RX )
such that

P{ 𝑋𝑡+1 = 𝑥 0 | 𝑋0 , 𝑋1 , . . . , 𝑋𝑡 } = 𝑃 ( 𝑋𝑡 , 𝑥 0 ) for all 𝑡 ⩾ 0, 𝑥 0 ∈ X. (3.1)

To simplify terminology, we also call ( 𝑋𝑡 ) 𝑃 -Markov when (3.1) holds. We call either
𝑋0 or its distribution 𝜓0 the initial condition of ( 𝑋𝑡 ), depending on context. 𝑃 is also
called the transition matrix of the Markov chain.
The definition of a Markov chain says two things:

(i) When updating to 𝑋𝑡+1 from 𝑋𝑡 , earlier states are not required.
(ii) 𝑃 encodes all of the information required to perform the update, given the cur-
rent state 𝑋𝑡 .

One way to think about Markov chains is algorithmically: Fix 𝑃 ∈ M ( RX ) and


let 𝜓0 be an element of D(X). Now generate ( 𝑋𝑡 ) via Algorithm 3.1. The resulting
sequence is 𝑃 -Markov with initial condition 𝜓0 .

Algorithm 3.1: Generation of 𝑃 -Markov ( 𝑋𝑡 ) with initial condition 𝜓0


1 𝑡 ← 0
2 𝑋𝑡 ← a draw from 𝜓0
3 while 𝑡 < ∞ do
4 𝑋𝑡+1 ← a draw from the distribution 𝑃 ( 𝑋𝑡 , ·)
5 𝑡 ←𝑡+1
6 end

3.1.1.2 Application: S-s Dynamics

As an example, consider a firm whose inventory of some product follows 𝑆-𝑠 dynam-
ics, meaning that the firm waits until its inventory falls below some level 𝑠 > 0 and
then immediately replenishes by ordering 𝑆 units. This pattern of decisions can be
rationalized if ordering requires paying a fixed cost. Thus, in §5.2.1, we will show
that 𝑆-𝑠 behavior is optimal in a setting where fixed costs exist and the firm’s aim is
to maximize its present value.
CHAPTER 3. MARKOV DYNAMICS 83

To represent 𝑆-𝑠 dynamics, we suppose that a firm’s inventory ( 𝑋𝑡 )𝑡⩾0 of a given


product obeys
𝑋𝑡+1 = max{ 𝑋𝑡 − 𝐷𝑡+1 , 0} + 𝑆1{ 𝑋𝑡 ⩽ 𝑠 },
where
𝑑
• ( 𝐷𝑡 )𝑡⩾1 is an exogenous IID demand process with 𝐷𝑡 = 𝜑 ∈ D( Z+ ) for all 𝑡 and
• 𝑆 is the quantity ordered when 𝑋𝑡 ⩽ 𝑠.
For the distribution 𝜑 of demand we take the geometric distribution, so that 𝜑 ( 𝑑 ) =
P{ 𝐷𝑡 = 𝑑 } = 𝑝 (1 − 𝑝) 𝑑 for 𝑑 ∈ Z+ .
EXERCISE 3.1.1. Confirm the following claim: An appropriate state space for this
model is X ≔ {0, . . . , 𝑆 + 𝑠 }, since

𝑋𝑡 ∈ X =⇒ P{ 𝑋𝑡+1 ∈ X} = 1.

If we define ℎ ( 𝑥, 𝑑 ) ≔ max{ 𝑥 − 𝑑, 0} + 𝑆1{ 𝑥 ⩽ 𝑠 }, so that 𝑋𝑡+1 = ℎ ( 𝑋𝑡 , 𝐷𝑡+1 ) for all 𝑡 ,


then the transition matrix can be expressed as
Õ
𝑃 ( 𝑥, 𝑥 0) = P{ ℎ ( 𝑥, 𝐷𝑡+1 ) = 𝑥 0 } = 1{ℎ ( 𝑥, 𝑑 ) = 𝑥 0 } 𝜑 ( 𝑑 ) (( 𝑥, 𝑥 0) ∈ X × X) .
𝑑 ⩾0

Listing 7 provides code that simulates inventory paths and computes other objects
of interest. Since the state space X = { 𝑥1 , . . . , 𝑥𝑛 } corresponds to {0, . . . , 𝑆 + 𝑠 } and
Julia indexing starts at 1, we set 𝑥 𝑖 = 𝑖 − 1. This convention is used when computing
P[i, j], which corresponds to 𝑃 ( 𝑥 𝑖 , 𝑥 𝑗 ). The code in the listing is used to produce
the simulation of inventories in Figure 3.1.
The function compute_mc returns an instance of a MarkovChain object that can
store both the state X and the transition probabilities. The QuantEcon.jl library
defines this data type and provides functions that simulate a Markov chains, compute
a stationary distribution, and perform related tasks.

3.1.1.3 Higher Order Transition Matrices

Given a finite state space X, 𝑘 ⩾ 0 and 𝑃 ∈ M ( RX ), let 𝑃 𝑘 be the 𝑘-th power of 𝑃 . (If
𝑘 = 0, then 𝑃 𝑘 is the identity matrix.) Since M ( RX ) is closed under multiplication
(Exercise 2.3.2), 𝑃 𝑘 is in M ( RX ) for all 𝑘 ⩾ 0. In this context, 𝑃 𝑘 is sometimes called
the 𝑘-step transition matrix corresponding to 𝑃 . In what follows, 𝑃 𝑘 ( 𝑥, 𝑥 0) denotes
the ( 𝑥, 𝑥 0)-th element of the matrix representation of 𝑃 𝑘 .
CHAPTER 3. MARKOV DYNAMICS 84

using Distributions, QuantEcon, IterTools

function create_inventory_model(; S=100, # Order size


s=10, # Order threshold
p=0.4) # Demand parameter
ϕ = Geometric(p)
h(x, d) = max(x - d, 0) + S * (x <= s)
return (; S, s, ϕ, h)
end

"Simulate the inventory process."


function sim_inventories(model; ts_length=200)
(; S, s, ϕ, h) = model
X = Vector{Int32}(undef, ts_length)
X[1] = S # Initial condition
for t in 1:(ts_length-1)
X[t+1] = h(X[t], rand(ϕ))
end
return X
end

"Compute the transition probabilities and state."


function compute_mc(model; d_max=100)
(; S, s, ϕ, h) = model
n = S + s + 1 # Size of state space
state_vals = collect(0:(S + s))
P = Matrix{Float64}(undef, n, n)
for (i, j) in product(1:n, 1:n)
P[i, j] = sum((h(i-1, d) == j-1) * pdf(ϕ, d) for d in 0:d_max)
end
return MarkovChain(P, state_vals)
end

"Compute the stationary distribution of the model."


function compute_stationary_dist(model)
mc = compute_mc(model)
return mc.state_values, stationary_distributions(mc)[1]
end

Listing 7: An implementation of 𝑆-𝑠 inventory dynamics (inventory_sim.jl)


CHAPTER 3. MARKOV DYNAMICS 85

120 Xt

100

80
inventory

60

40

20

0
0 25 50 75 100 125 150 175 200
t

Figure 3.1: Inventory simulation (inventory_sim.jl)

The 𝑘-step transition matrix has the following interpretation: If ( 𝑋𝑡 ) is 𝑃 -Markov,


then for any 𝑡, 𝑘 ∈ Z+ and 𝑥, 𝑥 0 ∈ X,

𝑃 𝑘 ( 𝑥, 𝑥 0) = P { 𝑋𝑡 + 𝑘 = 𝑥 0 | 𝑋𝑡 = 𝑥 } . (3.2)

Thus, 𝑃 𝑘 provides the 𝑘-step transition probabilities for the 𝑃 -Markov chain ( 𝑋𝑡 ).

EXERCISE 3.1.2. Prove the claim in the last sentence via induction.

We can now give the following useful characterization of irreducibility:


Lemma 3.1.1. Given 𝑃 ∈ M ( RX ), the following statements are equivalent:
(i) 𝑃 is irreducible.
(ii) If ( 𝑋𝑡 ) is 𝑃 -Markov and 𝑥, 𝑥 0 ∈ X, then there exists a 𝑘 ⩾ 0 such that

P { 𝑋 𝑘 = 𝑥 0 | 𝑋0 = 𝑥 } > 0 .

Thus, irreducibility of 𝑃 means that the 𝑃 -Markov chain eventually visits any state
from any other state with positive probability.
Í
Proof of Lemma 3.1.1. Fix 𝑃 ∈ M ( RX ). 𝑃 is irreducible if and only if 𝑘⩾0 𝑃 𝑘  0.
This is equivalent to the statement that for each ( 𝑥, 𝑥 0) ∈ X × X, there exists a 𝑘 ⩾ 0
such that 𝑃 𝑘 ( 𝑥, 𝑥 0) > 0, which is in turn equivalent to part (ii) of Lemma 3.1.1. □
CHAPTER 3. MARKOV DYNAMICS 86

using QuantEcon
P = [0.1 0.9;
0.0 1.0]
mc = MarkovChain(P)
print(is_irreducible(mc))

Listing 8: Testing irreducibility (is_irreducible.jl)

EXERCISE 3.1.3. Using Lemma 3.1.1, prove that the stochastic matrix associated
with the 𝑆-𝑠 inventory dynamics in §3.1.1.2 is irreducible.

Several libraries have code for testing irreducibility, including QuantEcon.jl. See
Listing 8 for an example of a call to this functionality. In this case, irreducibility fails
because state 2 is an absorbing state. Once entered, the probability of ever leaving
that state is zero. (A subset Y of X with this property is called an absorbing set.)

3.1.2 Stationarity and Ergodicity

Next we review aspects of Markov dynamics, including stationarity and ergodicity.


Fix 𝑃 ∈ M ( RX ) and let ( 𝑋𝑡 ) be a 𝑃 -chain. Let 𝜓𝑡 be the distribution of 𝑋𝑡 . Marginal
distributions 𝜓𝑡 evolve according to
Õ
𝜓𝑡+1 ( 𝑥 0) = 𝑃 ( 𝑥, 𝑥 0) 𝜓𝑡 ( 𝑥 ) for all 𝑥 0 ∈ X and 𝑡 ⩾ 0. (3.3)
𝑥
Í
To verify (3.3), rewrite it as P{ 𝑋𝑡+1 = 𝑥 0 } = 𝑥 P{ 𝑋𝑡+1 = 𝑥 0 | 𝑋𝑡 = 𝑥 }P{ 𝑋𝑡 = 𝑥 }, which is
true by the law of total probability. With each 𝜓𝑡 regarded as a row vector, (3.3) can
also be written as
𝜓𝑡+1 = 𝜓𝑡 𝑃. (3.4)
Equation (3.4) tells us that dynamics of marginal distributions for Markov chains are
generated by deterministic linear difference equations in distribution space. This is
remarkable because the dynamics that drive ( 𝑋𝑡 ) are stochastic and can be arbitrarily
nonlinear.
Iterating on (3.4), we get 𝜓𝑡 = 𝜓0 𝑃 𝑡 for all 𝑡 . In summary,
𝑑 𝑑
( 𝑋𝑡 )𝑡⩾0 is 𝑃 -Markov with 𝑋0 = 𝜓0 =⇒ 𝑋𝑡 = 𝜓0 𝑃 𝑡 for all 𝑡 ⩾ 0. (3.5)
CHAPTER 3. MARKOV DYNAMICS 87

For (3.5) and 𝜓𝑡+1 = 𝜓𝑡 𝑃 to hold, each 𝜓𝑡 must be a row vector. In what follows, we
always treat the distributions ( 𝜓𝑡 )𝑡⩾0 of ( 𝑋𝑡 )𝑡⩾0 as row vectors.

𝑑
EXERCISE 3.1.4. Let ( 𝑋𝑡 ) be 𝑃 -Markov on X with 𝑋0 = 𝜓0 . Show that

E ℎ ( 𝑋 𝑡 ) = 𝜓0 𝑃 𝑡 ℎ = 𝜓0 𝑃 𝑡 , ℎ for all 𝑡 ∈ N and ℎ ∈ RX . (3.6)

Consistent with our definition of stationary distributions in §2.3.1.3, a marginal


distribution 𝜓∗ ∈ D(X) is called stationary for 𝑃 if
Õ
𝑃 ( 𝑥, 𝑥 0) 𝜓∗ ( 𝑥 ) = 𝜓∗ ( 𝑥 0) for all 𝑥 ∈ X.
𝑥

In vector form this is 𝜓∗ 𝑃 = 𝜓∗ . By this definition and (3.3), if 𝜓∗ is stationary and 𝑋𝑡


has distribution 𝜓∗ , then so does 𝑋𝑡+𝑘 for all 𝑘 ⩾ 1.
We saw in Exercise 2.3.2 that every irreducible 𝑃 ∈ M ( RX ) has exactly one sta-
tionary distribution in D(X). The following ergodic property holds under the same
assumptions.

Theorem 3.1.2. If 𝑃 is irreducible with stationary distribution 𝜓∗ , then, for any 𝑃 -


Markov chain ( 𝑋𝑡 ) and any 𝑥 ∈ X, we have
( )

𝑘−1
P lim 1 { 𝑋 𝑡 = 𝑥 } = 𝜓∗ ( 𝑥 ) = 1 . (3.7)
𝑘→∞ 𝑘 𝑡 =0

A proof of (3.7) can be found in Brémaud (2020).


Property (3.7) tells us that, with probability one (i.e., for almost every 𝑃 -Markov
chain that we generate), the fraction of time that the chain spends in any given state
is, in the limit, equal to the probability assigned to that state by the stationary distri-
bution. Markov chains with this property are sometimes said to be ergodic.
Since the 𝑆-𝑠 inventory model from §3.1.1.2 is irreducible, the ergodicity result
from Theorem 3.1.2 applies. In particular, the process has only one stationary distri-
bution 𝜓∗ in D(X), where X = {0, . . . , 𝑆 + 𝑠 }, and (3.7) is valid. Figure 3.2 illustrates
this by plotting both the stationary distribution 𝜓∗ (which is computed using the code
Í −1
in Listing 7), and the value 𝑚 ( 𝑦 ) ≔ 1𝑘 𝑡𝑘=0 1{ 𝑋𝑡 = 𝑦 } at each 𝑦 ∈ X for 𝑘 set to
1, 000, 000. As predicted by the theorem, the fraction of time spent by the chain in
each state is close to the probability assigned by 𝜓∗ .
CHAPTER 3. MARKOV DYNAMICS 88

0.014 ψ∗
frequency
0.012

0.010

0.008

0.006

0.004

0.002

0.000
0 20 40 60 80 100
state

Figure 3.2: Ergodicity (inventory_sim.jl)

3.1.2.1 Application: Day Laborer

Suppose that a day laborer is either unemployed ( 𝑋𝑡 = 1) or employed ( 𝑋𝑡 = 2) in


each period. In state 1 he is hired with probability 𝛼 ∈ (0, 1). In state 2 he is fired
with probability 𝛽 ∈ (0, 1). The corresponding state space and transition matrix are
 
1−𝛼 𝛼
X = {1, 2} and 𝑃 = . (3.8)
𝛽 1−𝛽

Listing 9 provides a function to update from 𝑋𝑡 to 𝑋𝑡+1 , using the fact that rand()
generates a draw from the uniform distribution on [0, 1).

EXERCISE 3.1.5. Explain why Listing 9 updates the current state according to the
probabilities in 𝑃 .

EXERCISE 3.1.6. Because 𝑃 is everywhere positive, it must be irreducible, so 𝑃 has


the unique stationary distribution in 𝜓∗ ∈ D(X). Show that 𝜓∗ is given by
1  
𝜓∗ = 𝛽 𝛼 . (3.9)
𝛼+𝛽

It is also true that 𝜓𝑃 𝑡 → 𝜓∗ as 𝑡 → ∞ for any 𝜓 ∈ D(X). Thus, the operator 𝑃


CHAPTER 3. MARKOV DYNAMICS 89

function create_laborer_model(; α=0.3, β=0.2)


return (; α, β)
end

function laborer_update(x, model) # update X from t to t+1


(; α, β) = model
if x == 1
x′ = rand() < α ? 2 : 1
else
x′ = rand() < β ? 1 : 2
end
return x′
end

Listing 9: Updating the state of the day laborer (laborer_sim.jl)

when understood as the mapping 𝜓 ↦→ 𝜓𝑃 , is globally stable on D(X)

EXERCISE 3.1.7. Prove this using the Perron–Frobenius theorem. More generally,
show that this global stability result holds for any 𝑃 ∈ M ( RX ) with 𝑃  0.

EXERCISE 3.1.8. Fix 𝛼 = 0.3 and 𝛽 = 0.2. Compute the sequence ( 𝜓𝑃 𝑡 ) for different
choices of 𝜓 and confirm that your results are consistent with the claim that 𝜓𝑃 𝑡 → 𝜓∗
as 𝑡 → ∞ for any 𝜓 ∈ D(X).

EXERCISE 3.1.9. Since 𝑃 is irreducible, ergodicity property (3.7) holds. Simulate


a long realization Markov of a 𝑃 -Markov chain from an arbitrary initial condition and
confirm that your results are consistent with (3.7).

3.1.3 Approximation

To simplify numerical calculations, we sometimes approximate a continuous state


Markov process with a Markov chain. For example, consider a linear Gaussian AR(1)
model, where ( 𝑋𝑡 )𝑡⩾0 evolves in R according to
IID
𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝑏 + 𝜈𝜀𝑡+1 , | 𝜌 | < 1, ( 𝜀𝑡 ) ∼ 𝑁 (0, 1) . (3.10)
CHAPTER 3. MARKOV DYNAMICS 90

The model (3.10) has a unique stationary distribution 𝜓∗ given by

𝑏 𝜈2
𝜓∗ = 𝑁 ( 𝜇 𝑥 , 𝜎2𝑥 ) with 𝜇𝑥 ≔ and 𝜎2𝑥 ≔ .
1−𝜌 1 − 𝜌2

This means that


𝑑 𝑑
𝑋𝑡 = 𝜓∗ and 𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝑏 + 𝜈𝜀𝑡+1 implies 𝑋𝑡+1 = 𝜓∗ .

𝑑 𝑑
EXERCISE 3.1.10. Suppose that 𝑋𝑡 = 𝜓∗ , 𝜀𝑡+1 = 𝑁 (0, 1) and 𝑋𝑡 and 𝜀𝑡+1 are inde-
pendent. Prove that 𝜌𝑋𝑡 + 𝑏 + 𝜈𝜀𝑡+1 has distribution 𝜓∗ . Is this still true if we drop the
independence assumption made above?

Process (3.10) is also ergodic in a similar sense to (3.7) on page 87: on average,
realizations of the process spend most of their time in regions of the state where the
stationary distribution puts high probability mass. (You can check this via simulations
if you wish.) Hence, in the discretization that follows, we shall put the discrete state
space in this area.

EXERCISE 3.1.11. Set 𝑏 = 0 in (3.10) and let 𝐹 be the CDF of 𝑁 (0, 𝜈2 ). Show that

P{𝑡 − 𝛿 < 𝑋𝑡+1 ⩽ 𝑡 + 𝛿 | 𝑋𝑡 = 𝑥 } = 𝐹 ( 𝑡 − 𝜌𝑥 + 𝛿) − 𝐹 ( 𝑡 − 𝜌𝑥 − 𝛿) (3.11)

for all 𝛿, 𝑡 ∈ R.

To discretize (3.10) we use Tauchen’s method, starting with the case 𝑏 = 0.1 As
a first step, we choose 𝑛 as the number of states for the discrete approximation and
𝑚 as an integer that sets the width of the state space. Then we create a state space
X ≔ { 𝑥1 , . . . , 𝑥𝑛 } ⊂ R as an equispaced grid that brackets the stationary mean on both
sides by 𝑚 standard deviations:

• set 𝑥1 = −𝑚 𝜎𝑥 ,
• set 𝑥𝑛 = 𝑚 𝜎𝑥 and
• set 𝑥 𝑖+1 = 𝑥 𝑖 + 𝑠 where 𝑠 = ( 𝑥𝑛 − 𝑥1 )/( 𝑛 − 1) and 𝑖 in [ 𝑛 − 1].

The next step is to create an 𝑛 × 𝑛 matrix 𝑃 that approximates the dynamics in (3.10).
For 𝑖, 𝑗 ∈ [ 𝑛],
1 Tauchen’s method (Tauchen (1986)) is simple but sub-optimal in some cases. For a more general
discretization method and a survey of the literature, see Farmer and Toda (2017).
CHAPTER 3. MARKOV DYNAMICS 91

0.175
ψ∗
0.150 approx. ψ ∗

0.125

0.100

0.075

0.050

0.025

0.000
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

Figure 3.3: Comparison of 𝜓∗ = 𝑁 ( 𝜇 𝑥 , 𝜎2𝑥 ) and its discrete approximant

(i) if 𝑗 = 1, then set 𝑃 ( 𝑥 𝑖 , 𝑥 𝑗 ) = 𝐹 ( 𝑥1 − 𝜌𝑥 𝑖 + 𝑠/2).


(ii) If 𝑗 = 𝑛, then set 𝑃 ( 𝑥 𝑖 , 𝑥 𝑗 ) = 1 − 𝐹 ( 𝑥𝑛 − 𝜌𝑥 𝑖 − 𝑠/2).
(iii) Otherwise, set 𝑃 ( 𝑥 𝑖 , 𝑥 𝑗 ) = 𝐹 ( 𝑥 𝑗 − 𝜌𝑥 𝑖 + 𝑠/2) − 𝐹 ( 𝑥 𝑗 − 𝜌𝑥 𝑖 − 𝑠/2).

The first two are boundary rules and the third applies Exercise 3.1.11.
Í𝑛
EXERCISE 3.1.12. Prove that 𝑗=1 𝑃 ( 𝑥 𝑖 , 𝑥 𝑗 ) = 1 for all 𝑖 ∈ [ 𝑛].

Finally, if 𝑏 ≠ 0, then we shift the state space to center it on the mean 𝜇 𝑥 of the
stationary distribution 𝑁 ( 𝜇 𝑥 , 𝜎2𝑥 ). This is done by replacing 𝑥 𝑖 with 𝑥 𝑖 + 𝜇 𝑥 for each 𝑖.
Julia routines that compute X and 𝑃 can be found in the library QuantEcon.jl.
Figure 3.3 compares the continuous stationary distribution 𝜓∗ and the unique sta-
tionary distribution of the discrete approximation when X and 𝑃 are constructed as
above when 𝜌 = 0.9, 𝑏 = 0.0, 𝜈 = 1.0 and the discretization parameters are 𝑛 = 15 and
𝑚 = 3.

3.2 Conditional Expectations

In this section we discuss how to compute conditional expectations for Markov chains.
The theory will be essential for the study of finite MDPs, since, in these models, lifetime
rewards are mathematical expectations of flow reward functions of Markov states.
CHAPTER 3. MARKOV DYNAMICS 92

3.2.1 Mathematical Expectations

We begin with mathematical expectations of functions of Markov states.

3.2.1.1 Conditional Expectations

Fix 𝑃 ∈ M ( RX ). For each ℎ ∈ RX , we define


Õ
( 𝑃ℎ)( 𝑥 ) = ℎ ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (3.12)
𝑥 0 ∈X

Noting that 𝑃 ( 𝑥, ·) is the distribution of 𝑋𝑡+1 given 𝑋𝑡 = 𝑥 , we can write

( 𝑃ℎ)( 𝑥 ) = E [ ℎ ( 𝑋𝑡+1 ) | 𝑋𝑡 = 𝑥 ] , (3.13)

where ( 𝑋𝑡 ) is any 𝑃 -Markov chain on X. In terms of matrix algebra, viewing ℎ has an


𝑛 × 1 column vector, the expression ( 𝑃ℎ)( 𝑥 ) is one element of the vector 𝑃ℎ obtained
by premultiplying ℎ by 𝑃 .
The interpretation in (3.13) extends to powers of 𝑃 . In particular, we have
Õ
( 𝑃 𝑘 ℎ)( 𝑥 ) = ℎ ( 𝑥 0) 𝑃 𝑘 ( 𝑥, 𝑥 0) = E [ ℎ ( 𝑋𝑡+𝑘 ) | 𝑋𝑡 = 𝑥 ] . (3.14)
𝑥0

EXERCISE 3.2.1. Show that

(i) Every constant function ℎ ∈ RX is a fixed point of 𝑃 (i.e., 𝑃ℎ = ℎ).


(ii) max 𝑥 | 𝑃ℎ ( 𝑥 )| ⩽ max 𝑥 | ℎ ( 𝑥 )| for all ℎ ∈ RX .

3.2.1.2 The Law of Iterated Expectations

The law of iterated expectations is a workhorse in economics and finance. A com-


mon version of the law states that if 𝑋 and 𝑌 are two random variables, then E [E [𝑌 | 𝑋 ]] =
E [𝑌 ]. Let’s show how this law applies for Markov chains.
𝑑
Let ( 𝑋𝑡 ) be 𝑃 -Markov with 𝑋0 = 𝜓0 . Fix 𝑡, 𝑘 ∈ N. Set E𝑡 ≔ E [· | 𝑋𝑡 ]. We claim that

E [E𝑡 [ ℎ ( 𝑋𝑡+𝑘 )]] = E [ ℎ ( 𝑋𝑡+𝑘 )] for any ℎ ∈ RX . (3.15)


CHAPTER 3. MARKOV DYNAMICS 93

To see this, recall that E [ ℎ ( 𝑋𝑡+𝑘 ) | 𝑋𝑡 = 𝑥 ] = ( 𝑃 𝑘 ℎ)( 𝑥 ). Hence E [ ℎ ( 𝑋𝑡+𝑘 ) | 𝑋𝑡 ] = ( 𝑃 𝑘 ℎ)( 𝑋𝑡 ).


Therefore,
Õ Õ
E [E𝑡 [ ℎ ( 𝑋𝑡+𝑘 )]] = E [( 𝑃 𝑘 ℎ)( 𝑋𝑡 )] = ( 𝑃 𝑘 ℎ)( 𝑥 0) 𝜓𝑡 ( 𝑥 0) = ( 𝑃 𝑘 ℎ)( 𝑥 0)( 𝜓0 𝑃 𝑡 )( 𝑥 0) .
𝑥0 𝑥0

Since 𝜓0 𝑃 𝑡 is a row vector, we can write the last expression as

𝜓0 𝑃 𝑡 𝑃 𝑘 ℎ = 𝜓0 𝑃 𝑡 + 𝑘 ℎ = 𝜓𝑡 + 𝑘 ℎ = Eℎ ( 𝑋𝑡 + 𝑘 ) .
Hence (3.15) holds.

3.2.1.3 Monotone Markov Chains

Next we connect Markov chains to order theory via stochastic dominance. These
connections will have many applications below.
Let X be a finite set partially ordered by . A Markov operator 𝑃 ∈ M ( RX ) is
called monotone increasing if

𝑥, 𝑦 ∈ X and 𝑥  𝑦 =⇒ 𝑃 ( 𝑥, ·) F 𝑃 ( 𝑦, ·) .

Thus, 𝑃 is monotone increasing if shifting up the current state shifts up the next period
state, in the sense that its distribution increases in the stochastic dominance ordering
(see §2.2.4) on D(X). Below, we will see that monotonicity of Markov operators is
closely related to monotonicity of value functions in dynamic programming.
Monotonicity of Markov operators is related to positive autocorrelation. To illus-
trate the idea, consider the AR(1) model 𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝜎𝜀𝑡+1 from §3.1.3 and suppose we
apply Tauchen discretization, mapping the parameters 𝜌, 𝜎 and a discretization size 𝑛
into a Markov operator 𝑃 on state space X = { 𝑥1 , . . . , 𝑥𝑛 } ⊂ R, totally ordered by ⩽. If
𝜌 ⩾ 0, so that positive autocorrelation holds, then 𝑃 is monotone increasing.

EXERCISE 3.2.2. Verify this claim.

EXERCISE 3.2.3. In §3.1.2.1 we discussed a Markov chain


 
1−𝛼 𝛼
X = {1, 2} and 𝑃𝑤 =
𝛽 1−𝛽

for some 𝛼, 𝛽 ∈ [0, 1]. Show that 𝑃𝑤 is monotone increasing if and only if 𝛼 + 𝛽 ⩽ 1.
CHAPTER 3. MARKOV DYNAMICS 94

EXERCISE 3.2.4. Prove that 𝑃 is monotone increasing if and only if 𝑃 is invariant


on 𝑖RX ; i.e., if ℎ ∈ 𝑖RX implies 𝑃ℎ ∈ 𝑖RX .

EXERCISE 3.2.5. Prove: If 𝑃 is monotone increasing then so is 𝑃 𝑡 for all 𝑡 ∈ N.

3.2.2 Geometric Sums

Dynamic programs often form a lifetime value 𝑉0 as a geometric sum of a reward


Í
sequence ( 𝑅𝑡 )𝑡⩾0 with constant discount factor, so that 𝑉0 = E ∞ 𝑡
𝑡 =0 𝛽 𝑅𝑡 for some 𝛽 > 0.
We saw this in (1.1) on page 2, where we aggregated a profit stream ( 𝜋𝑡 )𝑡⩾0 into an
expected present value of the firm, and again in (1.6) on page 10, where a worker
evaluates lifetime earnings. In this section we study expectations of geometric sums.

3.2.2.1 Theory

Consider a conditional mathematical expectation of a discounted sum of future mea-


surements: " ∞ #
Õ
∞ Õ
𝑣 ( 𝑥 ) ≔ E𝑥 𝛽 𝑡 ℎ ( 𝑋𝑡 ) ≔ E 𝛽 𝑡 ℎ ( 𝑋𝑡 ) | 𝑋0 = 𝑥 (3.16)
𝑡 =0 𝑡 =0

for some constant 𝛽 ∈ R+ and ℎ ∈ RX . Here

• ( 𝑋𝑡 ) is 𝑃 -Markov on some finite set X,


• 𝑣 ( 𝑥 ) is a lifetime reward starting from state 𝑥 , and
• E𝑥 indicates that we are conditioning on 𝑋0 = 𝑥 .

With 𝐼 as the identity matrix, the next result describes 𝑣 as function of 𝛽 , 𝑃 and ℎ.

Lemma 3.2.1. If 𝛽 < 1, then 𝐼 − 𝛽𝑃 is invertible and


Õ

𝑣= ( 𝛽𝑃 ) 𝑡 ℎ = ( 𝐼 − 𝛽𝑃 ) −1 ℎ. (3.17)
𝑡 =0

Proof. Under the stated conditions


Õ
∞ Õ
∞ Õ

E𝑥 𝑡
𝛽 ℎ ( 𝑋𝑡 ) = 𝛽 𝑡
E 𝑥 ℎ ( 𝑋𝑡 ) = 𝛽 𝑡 ( 𝑃 𝑡 ℎ)( 𝑥 ) , (3.18)
𝑡 =0 𝑡 =0 𝑡 =0
CHAPTER 3. MARKOV DYNAMICS 95

where the first equality in (3.18) uses linearity of expectations and the second follows
from (3.14) and the assumption that ( 𝑋𝑡 ) is 𝑃 -Markov starting at 𝑥 .2 Applying the
Í
Neumann series lemma (p. 18) to the matrix 𝛽𝑃 , we see that ∞ 𝑡 −1
𝑡 =0 ( 𝛽𝑃 ) = ( 𝐼 − 𝛽𝑃 ) .
The lemma applies because 𝜌 ( 𝛽𝑃 ) = 𝛽𝜌 ( 𝑃 ) = 𝛽 < 1, as follows from Exercise 2.3.2. □

3.2.2.2 Application: Valuation of Firms

Consider a firm that receives random profit stream ( 𝜋𝑡 )𝑡⩾0 . Supposes that the value of
the firm equals the expected present value of its profit stream. Suppose for now that
the interest rate is constant at 𝑟 > 0. With 𝛽 ≔ 1/(1 + 𝑟 ), total valuation is
Õ

𝑉0 = E 𝛽 𝑡 𝜋𝑡 . (3.19)
𝑡 =0

To compute this value, we need to know how profits evolve. A common strategy is
to set 𝜋𝑡 = 𝜋 ( 𝑋𝑡 ) for some fixed 𝜋 ∈ RX , where ( 𝑋𝑡 )𝑡⩾0 is a state process. For known
dynamics of ( 𝑋𝑡 ) and function 𝜋, the value 𝑉0 in (3.19) can be computed.
Here we assume that ( 𝑋𝑡 ) is 𝑃 -Markov for 𝑃 ∈ M ( RX ) with finite X. Then condi-
tioning on 𝑋0 = 𝑥 , we can write the value as
"∞ #
Õ
∞ Õ
𝑣 ( 𝑥 ) ≔ E𝑥 𝛽 𝑡 𝜋𝑡 ≔ E 𝛽 𝑡 𝜋𝑡 | 𝑋0 = 𝑥 .
𝑡 =0 𝑡 =0

By Lemma 3.2.1, the value 𝑣 ( 𝑥 ) is finite and the function 𝑣 ∈ RX can be obtained by
Õ

𝑣= 𝛽 𝑡 𝑃 𝑡 𝜋 = ( 𝐼 − 𝛽𝑃 ) −1 𝜋.
𝑡 =0

It is plausible that the value of the firm will be higher for a return process in
which higher states generate higher profits and predict higher future states. The next
exercise confirms this.

EXERCISE 3.2.6. Let X be partially ordered and suppose that 𝜋 ∈ 𝑖RX and that 𝑃 is
monotone increasing. (See §3.2.1.3 for terminology and notation.) Prove that, under
these conditions, 𝑣 is increasing on X.

2 To justify the first equality, care must be taken when pushing expectations through infinite sums.
In the present setting, justification can be provided via the dominated convergence theorem (see, e.g.,
Dudley (2002), Theorem 4.3.5). A proof of a more general result can be found in §B.2.
CHAPTER 3. MARKOV DYNAMICS 96

3.2.2.3 Application: Valuing Consumption Streams

To model consumption-saving choices we want to evaluate different consumption


paths, where a consumption path is a nonnegative random sequence ( 𝐶𝑡 )𝑡⩾0 . In what
follows we consider consumption paths such that 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ) for all 𝑡 ⩾ 0, where 𝑐 ∈ RX
+
and ( 𝑋𝑡 )𝑡⩾0 is 𝑃 -Markov on finite set X. Thus, consumption streams are time-invariant
functions of a finite state Markov chain.
In a standard “time additive” model of consumer preferences with constant geo-
metric discounting, the time zero value of a consumption stream (𝐶𝑡 )𝑡⩾0 , given current
state 𝑋0 = 𝑥 ∈ X, is
Õ

𝑣 ( 𝑥 ) = E𝑥 𝛽 𝑡 𝑢 ( 𝐶𝑡 ) , (3.20)
𝑡 =0

where 𝛽 ∈ (0, 1) is a discount factor and 𝑢 : R+ → R is called the flow utility function.
Dependence of 𝑣 ( 𝑥 ) on 𝑥 comes from the initial condition 𝑋0 = 𝑥 influencing the
Markov state process and, therefore, the consumption path.
Í
Using 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ) and defining 𝑟 ≔ 𝑢 ◦ 𝑐 we can write 𝑣 ( 𝑥 ) = E𝑥 𝑡⩾0 𝛽 𝑡 𝑟 ( 𝑋𝑡 ). By
Lemma 3.2.1, this sum is finite and 𝑣 can be expressed as

𝑣 = ( 𝐼 − 𝛽𝑃 ) −1 𝑟. (3.21)

Figure 3.4 shows an example when 𝑢 has the constant relative risk aversion (CRRA)
specification
𝑐1−𝛾
𝑢 ( 𝑐) = ( 𝑐 ⩾ 0, 𝛾 > 0) , (3.22)
1−𝛾
while 𝑐 ( 𝑥 ) = exp( 𝑥 ), so that consumption takes the form 𝐶𝑡 = exp( 𝑋𝑡 ), and ( 𝑋𝑡 )𝑡⩾0 is
a Tauchen discretization (see §3.1.3) of 𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝜈𝑊𝑡+1 where (𝑊𝑡 )𝑡⩾1 is IID and
standard normal. Parameters are 𝑛 = 25, 𝛽 = 0.98, 𝜌 = 0.96, 𝜈 = 0.05 and 𝛾 = 2. We
set 𝑟 = 𝑢 ◦ 𝑐 and solved for 𝑣 via (3.21).

EXERCISE 3.2.7. Replicate Figure 3.4.

EXERCISE 3.2.8. The value function in Figure 3.4 appears to be increasing in the
state 𝑥 . Prove this for the CRRA model when 𝜌 ⩾ 0.
CHAPTER 3. MARKOV DYNAMICS 97

v
−95

−100

−105

−110

−0.4 −0.2 0.0 0.2 0.4


x

Figure 3.4: The value of ( 𝐶𝑡 )𝑡⩾0 given 𝑋𝑡 = 𝑥

3.3 Job Search Revisited


In this section we extend the job search problem studied in §1.3 to a setting with
Markov wage offers. We discuss additional structure when the Markov operator for
wage offers is monotone increasing. We will also allow job separations to occur.

3.3.1 Job Search with Markov State

We adopt the job search setting of §1.3 but assume now that the wage process (𝑊𝑡 ) is
𝑃 -Markov on W ⊂ R+ , where 𝑃 ∈ M ( RW ) and W is finite.

3.3.1.1 Value Function Iteration

The value function 𝑣∗ for the Markov job search model is now defined as follows:
𝑣∗ ( 𝑤) is the maximum lifetime value that can be obtained when the worker is un-
employed with current wage offer is 𝑤 in hand. Value function 𝑣∗ satisfies Bellman
equation
( )
𝑤 Õ
𝑣∗ ( 𝑤) = max , 𝑐+𝛽 𝑣∗ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) . (3.23)
1−𝛽 0
𝑤 ∈W

We continue to assume that 𝑐 > 0 and 𝛽 ∈ (0, 1).


CHAPTER 3. MARKOV DYNAMICS 98

Bellman equation (3.23) extends a corresponding Bellman equation for the IID
case (cf. (1.25) on page 33). (A full proof is given in Chapter 4.) The Bellman
operator corresponding to (3.23) is
( )
𝑤 Õ
(𝑇 𝑣)( 𝑤) = max , 𝑐+𝛽 𝑣 ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) .
1−𝛽 𝑤 0

As before, 𝑇 is constructed so that 𝑣∗ is a fixed point (since (3.23) holds). We prove


below that 𝑣∗ is the only fixed point of 𝑇 in RW+ .
Extending the IID definition (cf. (1.29) on page 35), a policy 𝜎 : W → {0, 1} is
called 𝑣-greedy if
( )
𝑤0 Õ
𝜎 ( 𝑤) = 1 ⩾ 𝑐+𝛽 𝑣 ( 𝑤0) 𝑃 ( 𝑤, 𝑤0)
1−𝛽 𝑤0

for all 𝑤 ∈ W.
Let 𝑉 ≔ RW + and endow 𝑉 with the pointwise partial order ⩽ and the supremum
norm, so that k 𝑓 − 𝑔 k ∞ = max𝑤∈W | 𝑓 ( 𝑤) − 𝑔 ( 𝑤)|.
EXERCISE 3.3.1. Prove that
(i) 𝑇 is an order-preserving self-map on 𝑉 .
(ii) 𝑇 is a contraction of modulus 𝛽 on 𝑉 .

We recommend that you study the proof of the next lemma, since the same style
of argument occurs often below.
Lemma 3.3.1. 𝑣∗ is increasing on (W, ⩽) whenever 𝑃 is monotone increasing.

Proof. Let 𝑖𝑉 be the increasing functions in 𝑉 and suppose that 𝑃 is monotone in-
creasing. 𝑇 is a self-map on 𝑖𝑉 in this setting, since 𝑣 ∈ 𝑖𝑉 implies ℎ ( 𝑤) ≔ 𝑐 +
Í
𝛽 𝑤0 𝑣 ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) is in 𝑖𝑉 . Hence, for such a 𝑣, both ℎ and the stopping value func-
tion 𝑒 ( 𝑤) ≔ 𝑤/(1 − 𝛽 ) are in 𝑖𝑉 . It follows that 𝑇 𝑣 = ℎ ∨ 𝑒 is in 𝑖𝑉 .
Since 𝑖𝑉 is a closed subset of 𝑉 and 𝑇 is a self-map on 𝑖𝑉 , the fixed point 𝑣∗ is in 𝑖𝑉
(see Exercise 1.2.18 on page 22). □

In view of the contraction property established in Exercise 3.3.1, we can use value
function iteration (i) to compute an approximation 𝑣 to the value function and (ii) to
calculate the 𝑣-greedy policy that approximates the optimal policy. Code for imple-
menting this procedure is in Listing 10. The definition of a 𝑣-greedy policy resembles
that for the IID case (see (1.29) on page 35).
CHAPTER 3. MARKOV DYNAMICS 99

using QuantEcon, LinearAlgebra


include("s_approx.jl")

"Creates an instance of the job search model with Markov wages."


function create_markov_js_model(;
n=200, # wage grid size
ρ=0.9, # wage persistence
ν=0.2, # wage volatility
β=0.98, # discount factor
c=1.0 # unemployment compensation
)
mc = tauchen(n, ρ, ν)
w_vals, P = exp.(mc.state_values), mc.p
return (; n, w_vals, P, β, c)
end

" The Bellman operator Tv = max{e, c + β P v} with e(w) = w / (1-β)."


function T(v, model)
(; n, w_vals, P, β, c) = model
h = c .+ β * P * v
e = w_vals ./ (1 - β)
return max.(e, h)
end

" Get a v-greedy policy."


function get_greedy(v, model)
(; n, w_vals, P, β, c) = model
σ = w_vals / (1 - β) .>= c .+ β * P * v
return σ
end

"Solve the infinite-horizon Markov job search model by VFI."


function vfi(model)
v_init = zero(model.w_vals)
v_star = successive_approx(v -> T(v, model), v_init)
σ_star = get_greedy(v_star, model)
return v_star, σ_star
end

Listing 10: Job search with Markov state (markov_js.jl)


CHAPTER 3. MARKOV DYNAMICS 100

200
h∗ (w)
175 w/(1 − β)
v ∗ (w)
150

125

100

75

50

25

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


w

Figure 3.5: Value, stopping, and continuation for Markov job search

3.3.1.2 Continuation Values

The continuation value ℎ∗ from the IID case (see page 33) is now replaced by a con-
tinuation value function
Õ
ℎ∗ ( 𝑤) ≔ 𝑐 + 𝛽 𝑣∗ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) .
𝑤0

The continuation value depends on 𝑤 because the current offer helps predict the offer
next period, which in turn affects the value of continuing. The functions 𝑤 ↦→ 𝑤/(1 −
𝛽 ), ℎ∗ and 𝑣∗ corresponding to the default model in Listing 10 are shown in Figure 3.5.

EXERCISE 3.3.2. Explain why the continuation value function is increasing in Fig-
ure 3.5. If possible, provide a mathematical and economic explanation.

EXERCISE 3.3.3. Using the Bellman equation (3.23), show that ℎ∗ obeys
Õ  0 
∗ 𝑤
ℎ ( 𝑤) ≔ 𝑐 + 𝛽 max , ℎ ( 𝑤 ) 𝑃 ( 𝑤, 𝑤0)
∗ 0
( 𝑤 ∈ W) .
𝑤 0 1 − 𝛽
CHAPTER 3. MARKOV DYNAMICS 101

EXERCISE 3.3.4. Let 𝑄 be the operator on 𝑉 defined at ℎ ∈ 𝑉 by


Õ  0 
𝑤
( 𝑄ℎ)( 𝑤) ≔ 𝑐 + 𝛽 max , ℎ ( 𝑤 ) 𝑃 ( 𝑤, 𝑤0)
0
( 𝑤 ∈ W) . (3.24)
𝑤0
1 − 𝛽

Prove that 𝑄 is (a) an order-preserving self-map on 𝑉 and (b) a contraction of modulus


𝛽 on 𝑉 under the supremum norm.

Exercise 3.3.4 suggests an alternative way to solve the job search problem: iterate
with 𝑄 to obtain the continuation value function ℎ∗ and then use the policy
 0 
𝑤
𝜎 ( 𝑤) = 1
∗ ∗
⩾ ℎ ( 𝑤) ( 𝑤 ∈ W)
1−𝛽

that tells the worker to accept when the current stopping value exceeds the current
continuation value.
We saw that in the IID case a computational strategy based on continuation values
is far more efficient than value function iteration (see §1.3.2.2). Since continuation
values are functions rather than scalars, here the two approaches (iterating with 𝑇 vs
iterating with 𝑄 ) are more similar. In Chapter 4 we discuss alternative computational
strategies in more detail, seeking conditions under which one approach will be more
efficient than the other.

3.3.2 Job Search with Separation

We now modify the job search problem discussed in §3.3.1 by adding separations. In
particular, an existing match between worker and firm terminates with probability 𝛼
every period. (This is an extension because setting 𝛼 = 0 recovers the permanent job
scenario from §3.3.1.)
The worker now views the loss of a job as a capital loss and a spell of unemploy-
ment as an investment. In what follows, the wage process and discount factor are
unchanged from §3.3.1. As before, 𝑉 ≔ RW+ is endowed with the supremum norm.

The value function 𝑣𝑢∗ for an unemployed worker satisfies the recursion
( )
Õ
∗ ∗ ∗ 0 0
𝑣𝑢 ( 𝑤) = max 𝑣𝑒 ( 𝑤) , 𝑐 + 𝛽 𝑣𝑢 ( 𝑤 ) 𝑃 ( 𝑤, 𝑤 ) ( 𝑤 ∈ W) , (3.25)
𝑤0 ∈W

where 𝑣𝑒∗ is the value function for an employed worker, i.e., the lifetime value of a
CHAPTER 3. MARKOV DYNAMICS 102

worker who starts the period employed at wage 𝑤. Value function 𝑣𝑒∗ satisfies
" #
Õ
𝑣𝑒∗ ( 𝑤) = 𝑤 + 𝛽 𝛼 𝑣𝑢∗ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) + (1 − 𝛼) 𝑣𝑒∗ ( 𝑤) ( 𝑤 ∈ W) . (3.26)
𝑤0

This equation states that value accruing to an employed worker is current wage plus
the discounted expected value of being either employed or unemployed next period.
We claim that, when 0 < 𝛼, 𝛽 < 1, the system (3.25)–(3.26) has a unique solution
( 𝑣∗ , 𝑣∗ )
𝑒

𝑢 in 𝑉 × 𝑉 . To show this we first solve (3.26) in terms of 𝑣𝑒 ( 𝑤) to obtain

1 
𝑣𝑒∗ ( 𝑤) = 𝑤 + 𝛼𝛽 ( 𝑃𝑣𝑢∗ )( 𝑤) . (3.27)
1 − 𝛽 (1 − 𝛼)
Í
(Recall ( 𝑃ℎ)( 𝑤) ≔ 𝑤0 ℎ ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) for ℎ ∈ RW .) Substituting into (3.25) yields
 
∗ 1 ∗  ∗
𝑣𝑢 ( 𝑤) = max 𝑤 + 𝛼𝛽 ( 𝑃𝑣𝑢 )( 𝑤) , 𝑐 + 𝛽 ( 𝑃𝑣𝑢 )( 𝑤) . (3.28)
1 − 𝛽 (1 − 𝛼)

EXERCISE 3.3.5. Prove that there exists a unique 𝑣𝑢∗ ∈ 𝑉 that solves (3.28). Propose
a convergent method for computing both 𝑣𝑢∗ and 𝑣𝑒∗ . [Hint: See Lemma 2.2.3 on
page 59.]

Figure 3.6 shows the value function 𝑣𝑢∗ for an unemployed worker, which is the
fixed point of (3.28), as well as the stopping and continuation values, which are given
by
1 
𝑠∗ ( 𝑤) ≔ 𝑤 + 𝛼𝛽 ( 𝑃𝑣𝑢∗ )( 𝑤) and ℎ∗𝑒 ( 𝑤) ≔ 𝑐 + 𝛽 ( 𝑃𝑣𝑢∗ )( 𝑤)
1 − 𝛽 (1 − 𝛼)
respectively, for each 𝑤 ∈ W. Parameters are as in Listing 11. The value function
𝑣𝑢∗ is the pointwise maximum (i.e., 𝑣𝑢∗ = 𝑠∗ ∨ ℎ∗ ). The worker’s optimal policy while
unemployed is
𝜎∗ ( 𝑤) ≔ 1{ 𝑠∗ ( 𝑤) ⩾ ℎ∗ ( 𝑤)} .
As before, the smallest 𝑤 such that 𝜎∗ ( 𝑤) = 1 is called the reservation wage.
Figure 3.7 shows how the reservation wage changes with 𝛼. To produce this figure
we solved the model for the reservation wage at 10 values of 𝛼 in an evenly spaced
grid ranging 0 to 1. The reservation wage falls with 𝛼, since time spent unemployed
is a capital investment in better wages, and the value of this investment declines as
the separation rate rises.

EXERCISE 3.3.6. Replicate Figure 3.7.


CHAPTER 3. MARKOV DYNAMICS 103

using QuantEcon, LinearAlgebra

"Creates an instance of the job search model with separation."


function create_js_with_sep_model(;
n=200, # wage grid size
ρ=0.9, ν=0.2, # wage persistence and volatility
β=0.98, α=0.1, # discount factor and separation rate
c=1.0) # unemployment compensation
mc = tauchen(n, ρ, ν)
w_vals, P = exp.(mc.state_values), mc.p
return (; n, w_vals, P, β, c, α)
end

Listing 11: Job search with separation model (markov_js_with_sep.jl)

140
continuation value
130
stopping value
120 vu∗ (w)

110

100

90

80

70

60

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


w

Figure 3.6: Value function with job separation


CHAPTER 3. MARKOV DYNAMICS 104

reservation wage
1.8

1.6

1.4

1.2

1.0

0.0 0.2 0.4 0.6 0.8 1.0


w

Figure 3.7: Reservation wage vs separation rate

3.4 Chapter Notes

Many good textbooks on Markov chains exist, including Norris (1998), Häggström
et al. (2002) and Privault (2013). Sargent and Stachurski (2023b) provides a rel-
atively comprehensive treatment from a network perspective that is a natural one
for Markov chains. Other economic applications are discussed in Stokey and Lucas
(1989) and Ljungqvist and Sargent (2018). Meyer (2000) gives a detailed account
of the theory of nonnegative matrices. Another useful reference is Horn and Johnson
(2012).
A systematic study of monotone Markov chains was initiated by Daley (1968).
Monotone Markov methods have many important applications in economics. See,
for example, Hopenhayn and Prescott (1992), Kamihigashi and Stachurski (2014),
Jaśkiewicz and Nowak (2014), Balbus et al. (2014), Foss et al. (2018) and Hu and
Shmaya (2019).
Chapter 4

Optimal Stopping

We study problems of maximizing lifetime rewards in settings in which decision mak-


ers face risks. The job search model studied in Chapters 1 and 3 is one example.
Others include an entrepreneur who decides whether to exit or enter a market, a
borrower who considers defaulting on a loan, a firm that contemplates introducing a
new technology, or a portfolio manager deciding whether to exercise a real or finan-
cial option.
These can all be formulated as dynamic programming and have common features
that facilitate sharp characterizations of optimality. They are all two-action (or binary
choice) problems that provide good laboratories for studying some special dynamic
programs in which recursive representations are particularly enlightening.

4.1 Introduction to Optimal Stopping

We begin with a standard theory of optimal stopping and then consider alternative
approaches that feature continuation values and threshold policies. We aim to provide
a rigorous discussion of optimality that refines our less formal analysis of job search
in §1.3 and §3.3.1.

4.1.1 Theory

Our first step is to set out the fundamental theory of discrete time infinite-horizon
optimal stopping problems.

105
CHAPTER 4. OPTIMAL STOPPING 106

4.1.1.1 The Stopping Problem

Let X be a finite set. Given X, an optimal stopping problem is a tuple S = ( 𝛽, 𝑃, 𝑟, 𝑒)


that consists of

(i) a discount factor 𝛽 ∈ (0, 1),


(ii) a Markov operator 𝑃 ∈ M ( RX ),
(iii) a continuation reward function 𝑐 ∈ RX , and
(iv) an exit reward function 𝑒 ∈ RX .

Given a 𝑃 -Markov chain ( 𝑋𝑡 )𝑡⩾0 , a decision maker observes the state 𝑋𝑡 in each
period and decides whether to continue or stop. If she chooses to stop, she receives
final reward 𝑒 ( 𝑋𝑡 ) and the process terminates. If she decides to continue, then she
receives 𝑐 ( 𝑋𝑡 ) and the process repeats next period. Lifetime rewards are
Õ
E 𝛽 𝑡 𝑅𝑡 ,
𝑡 ⩾0

where 𝑅𝑡 equals 𝑐 ( 𝑋𝑡 ) while the agent continues, 𝑒 ( 𝑋𝑡 ) when the agent stops, and zero
thereafter.

Example 4.1.1. Consider the infinite-horizon job search problem from Chapter 1,
where the wage offer process (𝑊𝑡 ) is IID with common distribution 𝜑 on finite set
W. This is an optimal stopping problem with state space X = W and 𝑃 ∈ M ( RX )
having all rows equal to 𝜑, so that all draws are IID from 𝜑. The exit reward function
is 𝑒 ( 𝑥 ) = 𝑥 /(1 − 𝛽 ) and the continuation reward function is constant and equal to
unemployment compensation.

Example 4.1.2. Consider an infinite-horizon American call option that provides the
right to buy a given asset at strike price 𝐾 at each future date. The market price
of the asset is 𝑆𝑡 = 𝑠 ( 𝑋𝑡 ), where ( 𝑋𝑡 ) is 𝑃 -Markov on finite set X and 𝑠 ∈ RX . The
interest rate is 𝑟 > 0. Deciding when to exercise is an optimal stopping problem, with
exit corresponding to exercising the option. The discount factor is 1/(1 + 𝑟 ), the exit
reward function is 𝑒 ( 𝑥 ) ≔ 𝑠 ( 𝑥 ) − 𝐾 and the continuation reward is zero.1

Optimal decisions are described by a policy function, which is a map 𝜎 from X


to {0, 1}. After observing state 𝑥 at any given time, the decision maker takes action
𝜎 ( 𝑥 ), where 0 means “continue” and 1 means “stop.” Implicit in this formulation is
1 We are studying American options in discrete time. Options with discrete exercise times are some-
times called Bermudan options. References for the continuous time case are provided in §4.3.
CHAPTER 4. OPTIMAL STOPPING 107

the assumption that the current state contains enough information for the agent to
decide whether or not to stop.
Let Σ be the set of functions from X to {0, 1}. Let 𝑣𝜎 ( 𝑥 ) denote the expected lifetime
value of following policy 𝜎 now and in every future period, given optimal stopping
problem S = ( 𝛽, 𝑃, 𝑟, 𝑒) and current state 𝑥 ∈ X. We call 𝑣𝜎 the 𝜎-value function. We
also call 𝑣𝜎 ( 𝑥 ) the lifetime value of policy 𝜎 conditional on initial state 𝑥 . Section
§4.1.1.2, shows that 𝑣𝜎 is well defined and describes how to calculate it. A policy
𝜎∗ ∈ Σ is called optimal for S if

𝑣𝜎∗ ( 𝑥 ) = max 𝑣𝜎 ( 𝑥 ) for all 𝑥 ∈ X. (4.1)


𝜎∈ Σ

4.1.1.2 Lifetime Values

Fixing 𝜎 ∈ Σ, let us consider how to compute the lifetime value 𝑣𝜎 ( 𝑥 ) of following 𝜎


conditional on 𝑋0 = 𝑥 . Evidently, 𝑣𝜎 satisfies
" #
Õ
0 0
𝑣𝜎 ( 𝑥 ) = 𝜎 ( 𝑥 ) 𝑒 ( 𝑥 ) + (1 − 𝜎 ( 𝑥 )) 𝑐 ( 𝑥 ) + 𝛽 𝑣𝜎 ( 𝑥 ) 𝑃 ( 𝑥, 𝑥 ) for all 𝑥 ∈ X. (4.2)
𝑥 0 ∈X

Indeed, if 𝜎 ( 𝑥 ) = 1, then (4.2) states that 𝑣𝜎 ( 𝑥 ) = 𝑒 ( 𝑥 ), which is what we expect: if


we choose to stop at a given state, then lifetime value from that state equals the exit
reward. If, instead, 𝜎 ( 𝑥 ) = 0, then (4.2) becomes
Õ
𝑣𝜎 ( 𝑥 ) = 𝑐 ( 𝑥 ) + 𝛽 𝑣𝜎 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) , (4.3)
𝑥0

which is also what we expect: the value of continuing is the current reward plus the
discounted expected reward obtained by continuing with policy 𝜎 next period.
We want to solve (4.2) for 𝑣𝜎 . To this end, we define 𝑟𝜎 ∈ RX and 𝐿𝜎 ∈ L ( RX ) via

𝑟𝜎 ( 𝑥 ) ≔ 𝜎 ( 𝑥 ) 𝑒 ( 𝑥 ) + (1 − 𝜎 ( 𝑥 )) 𝑐 ( 𝑥 ) and 𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝛽 (1 − 𝜎 ( 𝑥 )) 𝑃 ( 𝑥, 𝑥 0) .

With this notation, we can write (4.2) pointwise as 𝑣𝜎 = 𝑟𝜎 + 𝐿𝜎 𝑣𝜎 . If 𝜌 ( 𝐿𝜎 ) < 1, then

𝑣𝜎 = ( 𝐼 − 𝐿𝜎 ) −1 𝑟𝜎 . (4.4)

EXERCISE 4.1.1. Confirm that 𝜌 ( 𝐿𝜎 ) < 1 holds for any optimal stopping problem.

By Exercise 4.1.1 and the Neumann series lemma, 𝑣𝜎 is uniquely defined by (4.4).
CHAPTER 4. OPTIMAL STOPPING 108

4.1.1.3 Policy Operators

For the proofs below, it is helpful to view 𝑣𝜎 as the fixed point of an operator. We
associate each 𝜎 ∈ Σ with an policy operator 𝑇𝜎 defined at 𝑣 ∈ RX by
" #
Õ
(𝑇𝜎 𝑣)( 𝑥 ) = 𝜎 ( 𝑥 ) 𝑒 ( 𝑥 ) + (1 − 𝜎 ( 𝑥 )) 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (4.5)
𝑥0

for each 𝑥 ∈ X. With this notation, (4.2) can be written as 𝑣𝜎 = 𝑇𝜎 𝑣𝜎 .


EXERCISE 4.1.2. Prove that, for any 𝜎 ∈ Σ, the operator 𝑇𝜎 is order-preserving with
respect to the pointwise partial order ⩽ on RX .

Using the notation in §4.1.1.2, we can also define 𝑇𝜎 via

𝑇𝜎 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣.

Proposition 4.1.1. For any 𝜎 ∈ Σ, the policy operator 𝑇𝜎 is a contraction of modulus 𝛽


on RX under the supremum norm.

The significance of Proposition 4.1.1 is that by construction 𝑣𝜎 is a fixed point of


𝑇𝜎 . By the contraction property in Proposition 4.1.1, 𝑣𝜎 is the only fixed point of 𝑇𝜎 in
RX and, moreover, iterates of 𝑇𝜎 always converge to 𝑣𝜎 .
EXERCISE 4.1.3. Prove Proposition 4.1.1.

4.1.1.4 The Value Function

In the job search problem in §3.3.1, we argued that the value function equals the fixed
point of the Bellman operator. Here we make the same argument more formally in
the more general setting of optimal stopping.
First, given an optimal stopping problem S = ( 𝛽, 𝑃, 𝑟, 𝑒) with 𝜎-value functions
{ 𝑣𝜎 }𝜎∈Σ , we define the value function 𝑣∗ of S via

𝑣∗ ( 𝑥 ) ≔ max 𝑣𝜎 ( 𝑥 ) ( 𝑥 ∈ X) , (4.6)
𝜎∈ Σ

so that 𝑣∗ ( 𝑥 ) is the maximal lifetime value available to an agent facing current state
𝑥 . Following notation in §2.2.2.1, we can also write 𝑣∗ = ∨𝜎 𝑣𝜎 .
Given that solving the maximization in (4.6) is, in general, a difficult problem,
how can we obtain the value function? The following steps can do the job:
CHAPTER 4. OPTIMAL STOPPING 109

(i) formulate a Bellman equation for the value function of the optimal stopping
problem, namely,
( )
Õ
𝑣 ( 𝑥 ) = max 𝑒 ( 𝑥 ) , 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) , (4.7)
𝑥0

(ii) prove that this Bellman equation has a unique solution in RX , and then
(iii) show that this solution equals the value function, as defined in (4.6).

We shall complete these steps in §4.1.1.5.

4.1.1.5 The Bellman Operator

Define the Bellman operator for the optimal stopping problem S = ( 𝛽, 𝑃, 𝑟, 𝑒) as


( )
Õ
(𝑇 𝑣)( 𝑥 ) = max 𝑒 ( 𝑥 ) , 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (4.8)
𝑥0

where 𝑥 ∈ X and 𝑣 ∈ RX . By construction, any fixed point of 𝑇 solves the Bellman


equation and vice versa. Pointwise, we can express 𝑇 via 𝑇 𝑣 = 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑣).

EXERCISE 4.1.4. Prove that 𝑇 is an order-preserving self-map on RX .

Our main result for this section is:

Proposition 4.1.2. If S is an optimal stopping problem with Bellman operator 𝑇 and


value function 𝑣∗ , then

(i) 𝑇 is a contraction map of modulus 𝛽 on RX under the supremum norm k · k ∞ and


(ii) the unique fixed point of 𝑇 on RX is the value function 𝑣∗ .

EXERCISE 4.1.5. Prove the claim in (i) of Proposition 4.1.2.

Proof of Proposition 4.1.2. With the result of Exercise 4.1.5 in hand, we need only
show that the unique fixed point 𝑣¯ of 𝑇 in RX is equal to 𝑣∗ = ∨𝜎 𝑣𝜎 . We show 𝑣¯ ⩽ 𝑣∗
and then 𝑣¯ ⩾ 𝑣∗ .
CHAPTER 4. OPTIMAL STOPPING 110

For the first inequality, let 𝜎 ∈ Σ be defined by


( )
Õ
𝜎( 𝑥) = 1 𝑒( 𝑥) ⩾ 𝑐 ( 𝑥) + 𝛽 0 0
¯( 𝑥 ) 𝑃 ( 𝑥, 𝑥 )
𝑣 for all 𝑥 ∈ X.
𝑥0

Observe that for this choice of 𝜎 we have, for any 𝑥 ∈ X,


" #
Õ
(𝑇𝜎 𝑣¯)( 𝑥 ) = 𝜎 ( 𝑥 ) 𝑒 ( 𝑥 ) + (1 − 𝜎 ( 𝑥 )) 𝑐 ( 𝑥 ) + 𝛽 ¯( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0)
𝑣
𝑥0
( )
Õ
= max 𝑒 ( 𝑥 ) , 𝑐 ( 𝑥 ) + 𝛽 ¯( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0)
𝑣 = (𝑇 𝑣¯)( 𝑥 ) = 𝑣¯( 𝑥 ) .
𝑥0

In particular, 𝑇𝜎 𝑣¯ = 𝑣¯. But the only fixed point of 𝑇𝜎 in RX is 𝑣𝜎 , so 𝑣¯ = 𝑣𝜎 . But


then 𝑣¯ ⩽ 𝑣∗ , by the definition of 𝑣∗ . This is our first inequality.
Regarding the second, fix 𝜎 ∈ Σ and observe that 𝑇 𝑣 ⩾ 𝑇𝜎 𝑣 for all 𝑣 ∈ RX . Since
𝑇 is order-preserving and globally stable, Proposition 2.2.7 on page 67 implies that
𝑣𝜎 ⩽ 𝑣¯. Taking the maximum over 𝜎 ∈ Σ yields 𝑣∗ ⩽ 𝑣¯. □

4.1.1.6 Optimal Policies

Paralleling the definition provided in the discussion of job search (§1.3), for each
𝑣 ∈ RX , we call 𝜎 ∈ Σ 𝑣-greedy if, for all 𝑥 ∈ X,
( " #)
Õ
𝜎 ( 𝑥 ) ∈ argmax 𝑎𝑒 ( 𝑥 ) + (1 − 𝑎) 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) . (4.9)
𝑎∈{0,1} 𝑥0

A 𝑣-greedy policy uses 𝑣 to assign values to states and then chooses to stop or continue
based on the action that generates a higher payoff.
With this language in place, the next proposition makes precise our informal §1.1.2
argument that optimal choices can be made using the value function.

Proposition 4.1.3. Policy 𝜎 ∈ Σ is optimal if and only if it is 𝑣∗ -greedy.

Proposition 4.1.3 is a version of Bellman’s principle of optimality. We shall prove


this principle in a more general setting in Chapter 5.
CHAPTER 4. OPTIMAL STOPPING 111

4.1.1.7 Value Function Iteration

The theory stated above tells us that successive approximation using the Bellman op-
erator converges to 𝑣∗ and 𝑣∗ -greedy policies are optimal. These facts make value
function iteration (VFI) a natural algorithm for solving optimal stopping problems.
(VFI for optimal stopping problems is corresponds to VFI for job search, as shown on
page 37.) Later, in Theorem 8.1.1, we will show that when the number of iterates is
sufficiently large, VFI produces an optimal policy.

4.1.2 Firm Valuation with Exit

In §3.2.2.2 we discussed firm valuation using expected present value of the cash flow
generated by profits. This is a standard approach. However, it ignores that firms
have the option to cease operations and sell all remaining assets. In this section, we
consider firm valuation in the presence of an exit option.

4.1.2.1 Optional Exit

Consider a firm whose productivity is exogenous and evolves according to a 𝑄 -Markov


chain ( 𝑍𝑡 ) on finite set Z ⊂ R. Profits are given by 𝜋𝑡 = 𝜋 ( 𝑍𝑡 ) for some fixed 𝜋 ∈ RZ . At
the start of each period, the firm decides whether to remain in operation and receive
current profit 𝜋𝑡 , or to exit and receive scrap value 𝑠 > 0 for sale of physical assets.
Discounting is at fixed rate 𝑟 and 𝛽 ≔ 1/(1 + 𝑟 ). We assume that 𝑟 > 0.
Let Σ be all 𝜎 : Z → {0, 1}. For given 𝜎 ∈ Σ and 𝑣 ∈ RZ , the corresponding policy
operator is
" #
Õ
(𝑇𝜎 𝑣)( 𝑧) = 𝜎 ( 𝑧) 𝑠 + (1 − 𝜎 ( 𝑧)) 𝜋 ( 𝑧) + 𝛽 𝑣 ( 𝑧0) 𝑄 ( 𝑧, 𝑧0) ( 𝑧 ∈ Z) .
𝑧0

We saw in §4.1.1.2–§4.1.1.3 that 𝑇𝜎 has a unique fixed point 𝑣𝜎 and that 𝑣𝜎 ( 𝑧) repre-
sents the value of following policy 𝜎 forever, conditional on 𝑍0 = 𝑧.
The Bellman operator for the firm’s problem is the order-preserving self-map 𝑇 on
RZ defined by
( )
Õ
(𝑇 𝑣)( 𝑧) = max 𝑠, 𝜋 ( 𝑧) + 𝛽 𝑣 ( 𝑧0) 𝑄 ( 𝑧, 𝑧0) ( 𝑧 ∈ Z) .
𝑧0

Pointwise, 𝑇 can be written as 𝑇 𝑣 = 𝑠 ∨ ( 𝜋 + 𝛽𝑄𝑣).


CHAPTER 4. OPTIMAL STOPPING 112

114 h∗
s
112
v∗
110

108

106

104

102

100

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00


z

Figure 4.1: Value function for firms with exit option

Let 𝑣∗ be the value function for this problem. By Proposition 4.1.2, 𝑣∗ is the unique
fixed point of 𝑇 in RZ and the unique solution to the Bellman equation. Moreover, suc-
cessive approximation from any 𝑣 ∈ RZ converges to 𝑣∗ . Finally, by Proposition 4.1.3,
a policy is optimal if and only if it is 𝑣∗ -greedy.
Figure 4.1 plots 𝑣∗ , computed via value function iteration (i.e., successive approx-
imation using 𝑇 , along with the stopping value 𝑠 and the continuation value function
ℎ∗ = 𝜋 + 𝛽𝑄𝑣∗ , under the parameterization given in Listing 12. As implied by the Bell-
man equation, 𝑣∗ is the pointwise maximum of 𝑠 and ℎ∗ . The 𝑣∗ -greedy policy instruct
the firm to exit when the continuation value of the firm falls below the scrap value.

EXERCISE 4.1.6. Replicate Figure 4.1 by using the parameters in Listing 12 and
applying value function iteration. Reviewing the code for job search on page 99 should
be helpful.

4.1.2.2 Exit vs No-Exit


Í
If we define 𝑤 by 𝑤 ( 𝑧) = E𝑧 𝑡⩾0 𝛽 𝑡 𝜋𝑡 for all 𝑧 ∈ Z, then 𝑤 ( 𝑧) is the value of the
firm given 𝑍0 = 𝑧 when the firm never exits so that 𝑤 evaluates the firm according
to expected present value of the profit stream. Figure 4.2 shows the no-exit value 𝑤
based on the parameterization in Listing 12.
In Figure 4.2, we see that 𝑤 ⩽ 𝑣∗ on Z. Let’s now prove that this is always true.
CHAPTER 4. OPTIMAL STOPPING 113

"Creates an instance of the firm exit model."


function create_exit_model(;
n=200, # productivity grid size
ρ=0.95, μ=0.1, ν=0.1, # persistence, mean and volatility
β=0.98, s=100.0 # discount factor and scrap value
)
mc = tauchen(n, ρ, ν, μ)
z_vals, Q = mc.state_values, mc.p
return (; n, z_vals, Q, β, s)
end

Listing 12: Firm exit model (firm_exit.jl)

115
v∗
no-exit value
110

105

100

95

90

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00


z

Figure 4.2: Firm value with and without exit


CHAPTER 4. OPTIMAL STOPPING 114

To show 𝑤 ⩽ 𝑣∗ , first observe that 𝑤 = ( 𝐼 − 𝛽𝑄 ) −1 𝜋, by 𝛽 < 1 and Lemma 3.2.1 on


page 94. Rearranging gives 𝑤 = 𝜋 + 𝛽𝑄𝑤.
Now note that under the policy 𝜎 ≡ 0, where the firm chooses never to exit, we
have 𝑇𝜎 𝑣 = 𝜋 + 𝛽𝑄𝑣. Hence the unique fixed point of 𝑇𝜎 is 𝑤. As a result, 𝑤 = 𝑣𝜎 for
𝜎 ≡ 0. But 𝑣∗ ⩾ 𝑣𝜎 for all 𝜎 ∈ Σ. This proves that 𝑤 ⩽ 𝑣∗ .
Choosing never to exit is a feasible policy. Since 𝑣∗ involves maximization of firm
value over the set of all feasible policies, it must be at least as large as the value of
never exiting.

EXERCISE 4.1.7. Prove the following: If 𝑄  0 and 𝑠 > 𝑤 ( 𝑧) for at least one 𝑧 ∈ Z,
then 𝑤  𝑣∗ . Provide some intuition for this result.

EXERCISE 4.1.8. Consider a version of the model of firm value with exit where
productivity is constant but prices are stochastic. In particular, the price process ( 𝑃𝑡 )
for the final good is 𝑄 -Markov. Suppose further that one-period profits for a given
price 𝑝 are max ℓ⩾0 𝜋 ( ℓ, 𝑝), where ℓ is labor input. Suppose that 𝜋 ( ℓ, 𝑝) = 𝑝ℓ1/2 − 𝑤ℓ,
where the wage rate 𝑤 is constant. Formulate the Bellman equation.

4.1.3 Monotonicity

We study monotonicity in values and actions in the general optimal stopping problem
described in §4.1.1, with X as the state space, 𝑒 as the exit reward function and 𝑐 as
the continuation reward function.

4.1.3.1 Monotone Values

Let 𝑣∗ be the value function of an optimal stopping problem defined by X, 𝑃 , 𝛽 , 𝑐 and


𝑒 and define a continuation value function ℎ∗
Õ
ℎ∗ ( 𝑥 ) ≔ 𝑐 ( 𝑥 ) + 𝛽 𝑣∗ ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (4.10)
𝑥0

(The continuation reward function 𝑐 and the continuation value function ℎ∗ are dis-
tinct objects.)
Let X be partially ordered and let 𝑖RX be the increasing functions in RX .
Lemma 4.1.4. If 𝑒, 𝑐 ∈ 𝑖RX and 𝑃 is monotone increasing, then ℎ∗ and 𝑣∗ are both
increasing.
CHAPTER 4. OPTIMAL STOPPING 115

Proof. Let the stated conditions hold. The Bellman operator can be written pointwise
as 𝑇 𝑣 = 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑣). Since 𝑃 is monotone increasing, 𝑃 is invariant on 𝑖RX . It follows
from this fact and the conditions on 𝑒 and 𝑐 that 𝑇 is invariant on 𝑖RX . Hence, by
Exercise 1.2.18 on page 22, 𝑣∗ is in 𝑖RX . Since ℎ∗ = 𝑐 + 𝛽𝑃𝑣∗ , the same is true for
ℎ∗ . □
Example 4.1.3. Consider the §4.1.2 firm problem with exit with Bellman operator
𝑇 𝑣 = 𝑠 ∨ ( 𝜋 + 𝛽𝑄𝑣). Since 𝑠 is constant, it follows directly that 𝑣∗ and ℎ∗ are both
increasing functions when 𝜋 ∈ 𝑖RZ and 𝑄 is monotone increasing.

4.1.3.2 Monotone Actions

The optimal policy in the IID job search problem takes the form 𝜎∗ ( 𝑤) = 1{𝑤 ⩾ 𝑤∗ }
for all 𝑤 ∈ W, where 𝑤∗ ≔ (1 − 𝛽 ) ℎ∗ is the reservation wage and ℎ∗ is the continuation
value (see page 36). This optimal policy is of threshold type: once the wage offer
exceeds the threshold, the decision is to stop.
Since threshold policies are convenient, let us now try to characterize them.
Throughout this section, we take X to be a subset of R. Elements of X are ordered
by ⩽, the usual order on R.

EXERCISE 4.1.9. Prove that the optimal policy 𝜎∗ is decreasing on X whenever 𝑒 is


decreasing on X and ℎ∗ is increasing on X.

For a binary function on X ⊂ R, the condition that 𝜎∗ is decreasing means that the
decision maker chooses to exit when 𝑥 is sufficiently small.
Example 4.1.4. In the firm problem with exit, as described in §4.1.2, ℎ∗ is increasing
whenever 𝜋 ∈ 𝑖RZ and 𝑄 is monotone increasing. Since the scrap value is constant,
Exercise 4.1.9 applies under these conditions. Hence the optimal policy is decreasing.
This reasoning agrees with Figure 4.1, where exit is optimal when the state is small
and continuing is optimal when 𝑧 is large. This makes sense: since 𝑄 is monotone
increasing, low current values of 𝑧 predict low future values of 𝑧, so profits associated
with continuing can be anticipated to be low.

EXERCISE 4.1.10. Show that the conditions of Exercise 4.1.9 hold when 𝑒 is con-
stant on X, 𝑐 is increasing on X, and 𝑃 is monotone increasing.

EXERCISE 4.1.11. Prove that the optimal policy 𝜎∗ is increasing on X whenever 𝑒


is increasing on X and ℎ∗ is decreasing on X.
CHAPTER 4. OPTIMAL STOPPING 116

Example 4.1.5. In the IID job search problem, 𝑒 ( 𝑤) = 𝑤/(1 − 𝛽 ) is increasing and ℎ∗ is
constant. Hence the result in Exercise 4.1.11 applies. This is why the optimal policy
𝜎∗ ( 𝑤) = 1{𝑤 ⩾ (1 − 𝛽 ) ℎ∗ } is increasing. The agent accepts all sufficiently large wage
offers.

In the settings of Exercises 4.1.9–4.1.11, the optimal policy is either increasing or


decreasing. Since X is totally ordered, monotonicity implies that a threshold policy
is optimal. For example, if 𝜎∗ is increasing, then we take 𝑥 ∗ to be the smallest 𝑥 ∈ X
such that 𝜎∗ ( 𝑥 ) = 1. For such an 𝑥 ∗ we have

𝑥 < 𝑥 ∗ =⇒ 𝜎∗ ( 𝑥 ) = 0 and 𝑥 ⩾ 𝑥 ∗ =⇒ 𝜎∗ ( 𝑥 ) = 1.

Remark 4.1.1. Conditions in Exercises 4.1.9–4.1.11 are sufficient but not necessary
for monotone policies. Figure 3.5 on 100 provides an example of a setting where the
policy is increasing (the agent accepts for sufficiently large wage offers) even though
both 𝑒 ( 𝑥 ) = 𝑥 /(1 − 𝛽 ) and ℎ∗ are strictly increasing.

4.1.4 Continuation Values


In §1.3.2.2 we solved the job search problem with IID draws by computing the continu-
ation value ℎ∗ directly and then setting the optimal policy to 𝜎∗ ( 𝑤) = 1 {𝑤/(1 − 𝛽 ) ⩾ ℎ∗ }.
We saw that this approach is more efficient than first computing the value function,
since the continuation value is one-dimensional rather than |W|-dimensional.
In §3.3.1.2, we tried the same approach for the job search problem with Markov
state, where wage draws are correlated. We gathered fewer benefits from using the
continuation value approach in that setting, since the continuation value function has
the same dimensionality as the value function.
These observations motivate us to explore continuation value methods more care-
fully. In this section, we formulate a continuation value approach for the general
optimal stopping problem and verify convergence. We will see that, while all relevant
state components must be included in the value function, purely transitory compo-
nents do not affect continuation values. Hence the continuation value approach is at
least as efficient and sometimes substantially more so.
Another asymmetry between value functions and continuation value functions is
that the latter are typically smoother. For example, in job search problems, the value
function is usually kinked at the reservation wage, while the continuation value func-
tion is smooth. Greater smoothness comes from taking expectations over stochastic
transitions: integration acts as a smoothing operation. Like lower dimensionality,
increased smoothness facilitates analysis and computation.
CHAPTER 4. OPTIMAL STOPPING 117

4.1.4.1 The Continuation Value Operator

Let ℎ∗ be the continuation value function for the optimal stopping problem defined
in (4.10). To compute ℎ∗ directly we begin with the optimal stopping version of the
Bellman equation evaluated at 𝑣∗ and rewrite it as

𝑣∗ ( 𝑥 0) = max { 𝑒 ( 𝑥 0) , ℎ∗ ( 𝑥 0)} ( 𝑥 0 ∈ X) . (4.11)

Taking expectations of both sides of the equation conditional on current state 𝑥 pro-
Í Í
duces 𝑥 0 𝑣∗ ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) = 𝑥 0 max { 𝑒 ( 𝑥 0) , ℎ∗ ( 𝑥 0)} 𝑃 ( 𝑥, 𝑥 0). Multiplying by 𝛽 , adding
𝑐 ( 𝑥 ), and using the definition of ℎ∗ , we get
Õ
ℎ∗ ( 𝑥 ) = 𝑐 ( 𝑥 ) + 𝛽 max { 𝑒 ( 𝑥 0) , ℎ∗ ( 𝑥 0)} 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (4.12)
𝑥0

This expression motivates us to introduce a continuation value operator 𝐶 : RX →


RX via Õ
(𝐶ℎ) ( 𝑥 ) = 𝑐 ( 𝑥 ) + 𝛽 max { 𝑒 ( 𝑥 0) , ℎ ( 𝑥 0)} 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (4.13)
𝑥0

Proposition 4.1.5. The operator 𝐶 is a contraction of modulus 𝛽 on RX with the unique


fixed point ℎ∗ in RX .

Proposition 4.1.5 provides the following alternative method to compute the opti-
mal policy that does not involve value function iteration:

(i) Use successive approximations to ℎ∗ with 𝐶 and


(ii) Calculate 𝜎∗ via 𝜎∗ ( 𝑥 ) = 1{ 𝑒 ( 𝑥 ) ⩾ ℎ∗ ( 𝑥 )} for each 𝑥 ∈ X.

In §4.1.4.2 we discuss settings where this approach is advantageous.

Proof of Proposition 4.1.5. Fix 𝑓 , 𝑔 ∈ RX and 𝑥 ∈ X. By the triangle inequality and the
bound | 𝛼 ∨ 𝑥 − 𝛼 ∨ 𝑦 | ⩽ | 𝑥 − 𝑦 | from page 34, we have
Õ
|( 𝐶 𝑓 )( 𝑥 ) − ( 𝐶𝑔 ) ( 𝑥 )| ⩽ 𝛽 |max { 𝑒 ( 𝑥 0) , 𝑓 ( 𝑥 0)} − max { 𝑒 ( 𝑥 0) , 𝑔 ( 𝑥 0)}| 𝑃 ( 𝑥, 𝑥 0)
𝑥 0
Õ
⩽𝛽 | 𝑓 ( 𝑥 0) − 𝑔 ( 𝑥 0)| 𝑃 ( 𝑥, 𝑥 0) .
𝑥0

The right-hand side is dominated by 𝛽 k 𝑓 − 𝑔 k ∞ . Taking the maximum on the left-hand


side gives
k 𝐶 𝑓 − 𝐶𝑔 k ∞ ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ ,
CHAPTER 4. OPTIMAL STOPPING 118

which confirms that 𝐶 is a contraction of modulus 𝛽 on RX .


From the contraction property, we know that 𝐶 has exactly one fixed point in RX ;
(4.12) implies that ℎ∗ is that fixed point. □

4.1.4.2 Dimensionality Reduction

The beginning of §4.1.4 mentioned that switching from value function iteration to
continuation value iteration can substantially reduce the dimensionality of the prob-
lem in some cases. Here we describe situations where this works.
To begin, let W and Z be two finite sets and suppose that 𝜑 ∈ D(W) and 𝑄 ∈
M ( RZ ). Let (𝑊𝑡 ) be IID with distribution 𝜑 and let ( 𝑍𝑡 ) be an 𝑄 -Markov chain on Z.
If (𝑊𝑡 ) and ( 𝑍𝑡 ) are independent, then ( 𝑋𝑡 ) defined by 𝑋𝑡 = (𝑊𝑡 , 𝑍𝑡 ) is 𝑃 -Markov on X,
where
𝑃 ( 𝑥, 𝑥 0) = 𝑃 (( 𝑤, 𝑧 ) , ( 𝑤0, 𝑧0)) = 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) .
Suppose that the continuation reward depends only on 𝑧 so that we can write the
Bellman operator as
( )
Õ Õ
(𝑇 𝑣)( 𝑤, 𝑧 ) = max 𝑒 ( 𝑤, 𝑧 ) , 𝑐 ( 𝑧) + 𝛽 𝑣 ( 𝑤0, 𝑧0) 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) . (4.14)
𝑤0 ∈W 𝑧 0 ∈Z

Since the right-hand side depends on both 𝑤 and 𝑧, the Bellman operator acts on an
𝑛-dimensional space, where 𝑛 ≔ |X| = |W| × |Z|.
However, if we inspect the right-hand side of (4.14), we see that the continuation
value function depends only on 𝑧. Dependence on 𝑤 vanishes because 𝑤 does not
help predict 𝑤0. Thus, the continuation value function is an object in |Z|-dimensional
space. The continuation value operator
ÕÕ
(𝐶ℎ)( 𝑧) = 𝑐 ( 𝑧) + 𝛽 max { 𝑒 ( 𝑤0, 𝑧0) , ℎ ( 𝑧0)} 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) ( 𝑧 ∈ Z) (4.15)
𝑤0 𝑧0

acts on this lower dimensional-space.

Example 4.1.6. We can embed the IID the job search problem into this setting by
taking (𝑊𝑡 ) to be the wage offer process and ( 𝑍𝑡 ) to be constant. This is why the IID
case offers a large dimensionality reduction when we switch to continuation values.

More examples of dimensionality reduction are illustrated in the applications be-


low.
CHAPTER 4. OPTIMAL STOPPING 119

4.1.4.3 Application to Firm Value

Consider the firm valuation problem from §4.1.2 but suppose now that scrap value
fluctuates with prices of underlying assets. For simplicity let’s assume that scrap value
at each time 𝑡 is given by the IID sequence ( 𝑆𝑡 ), where each 𝑆𝑡 has density 𝜑 on R+ .
The corresponding Bellman operator is
( )
Õ∫
(𝑇 𝑣) ( 𝑧, 𝑠) = max 𝑠, 𝜋 ( 𝑧) + 𝛽 𝑣 ( 𝑧0, 𝑠0) 𝜑 ( 𝑠0) d𝑠0 𝑄 ( 𝑧, 𝑧0) .
𝑧0

We can convert this problem to a finite state space optimal stopping problem by
discretizing the density 𝜑 onto a finite grid contained in R+ . However, since continu-
ation values depend only on 𝑧, a better approach is to switch to a continuation value
operator.

EXERCISE 4.1.12. Write down the continuation value operator for this function as
a mapping from RZ to itself.

EXERCISE 4.1.13. In §2.2.4 we defined stochastic dominance for distributions on


finite sets. For densities 𝜑 and 𝜓 on R+ , the definition
∫ ∫ we say that 𝜓
is similar:
stochastically dominates 𝜑 and write 𝜑 F 𝜓 if 𝑢 ( 𝑥 ) 𝜑 ( 𝑥 ) d𝑥 ⩽ 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ) d𝑥 for
every 𝑢 in 𝑖RX .2 With this definition, show that if 𝜑𝑎 and 𝜑𝑏 are two alternative
densities for scrap value and 𝜑𝑎 F 𝜑𝑏 , then 𝜎∗𝑎 ⩾ 𝜎𝑏∗ pointwise on Z, where 𝜎∗𝑖 is the
optimal policy corresponding to density 𝜑𝑖 for 𝑖 ∈ { 𝑎, 𝑏}. Interpret this result.

4.2 Further Applications

In this section we discuss some applications of optimal stopping and apply the results
described above.

4.2.1 American Options

We discussed American options briefly in Example 4.1.2 on page 106. Here we inves-
tigate this class of derivatives more carefully. We focus on American call options that
2 Actually,in most definitions, 𝑢 is also restricted to be bounded and measurable, in order to ensure
that the integrals are finite. These technicalities can be ignored in the exercise.
CHAPTER 4. OPTIMAL STOPPING 120

provide the right to buy a particular stock or bond at a fixed strike price 𝐾 at any
time before a set expiration date. The market price of the asset at time 𝑡 is denoted
by 𝑆𝑡 .
We discussed a case in which the expiration date is infinity in Example 4.1.2.
However, options without termination dates – also called perpetual options – are rare
in practice. Hence we focus on the finite-horizon case. We are interested in computing
the expected value of holding the option when discounting with a fixed interest rate,
a typical assumption when pricing American options.
Finite horizon American options can be priced by backward induction in an ap-
proach like the one we used for the finite horizon job search problem discussed in
Chapter 1. Alternatively, we can embed finite horizon options into the theory of
infinite-horizon optimal stopping. We use the second approach here, since we have
just presented a theory for infinite-horizon optimal stopping.
To this end, we take 𝑇 ∈ N to be a fixed integer indicating the date of expiration.
The option is purchased at 𝑡 = 0 and can be exercised at any 𝑡 ∈ N with 𝑡 ⩽ 𝑇 . To
include 𝑡 in the current state, we set

T ≔ {1, . . . , 𝑇 + 1} and 𝑚 ( 𝑡 ) ≔ min{𝑡 + 1, 𝑇 + 1} for all 𝑡 ∈ T.

The idea is that time is updated via 𝑡0 = 𝑚 ( 𝑡 ), so that time increments at each update
until 𝑡 = 𝑇 + 1. After that we hold 𝑡 constant. Bounding time at 𝑇 + 1 keeps the state
space finite.
We assume that a stock price 𝑆𝑡 evolves according to
IID
𝑆𝑡 = 𝑍𝑡 + 𝑊𝑡 where (𝑊𝑡 )𝑡⩾0 ∼ 𝜑 ∈ D(W) .

Here ( 𝑍𝑡 )𝑡⩾0 is 𝑄 -Markov on finite set Z for some 𝑄 ∈ M ( RZ ) and W is also finite. This
means that the share price has both persistent and transient stochastic components.
If we set parameters so that ( 𝑍𝑡 )𝑡⩾0 resembles a random walk, price changes will be
difficult to predict.
To form a §4.1.1.1 optimal stopping problem, we must specify the state and clarify
the 𝑃 ∈ M ( RX ) that maps to the state process. We set the state space to X ≔ T×W×Z
and
𝑃 (( 𝑡, 𝑤, 𝑧 ) , ( 𝑡0, 𝑤0, 𝑧0)) ≔ 1{𝑡0 = 𝑚 ( 𝑡 )} 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) .
Thus, time updates deterministically via 𝑡0 = 𝑚 ( 𝑡 ) and 𝑧0 and 𝑤0 are drawn indepen-
dently from 𝑄 ( 𝑧, ·) and 𝜑 respectively.
As for a perpetual option, the continuation reward is zero and the discount factor
is 𝛽 ≔ 1/(1 + 𝑟 ), where 𝑟 > 0 is a fixed risk-free rate. The exit reward can be expressed
CHAPTER 4. OPTIMAL STOPPING 121

as 1{𝑡 ⩽ 𝑇 }( 𝑆𝑡 − 𝐾 ) so that exercising at time 𝑡 earns the owner 𝑆𝑡 − 𝐾 up to expiry and


zero thereafter. In terms of the state ( 𝑡, 𝑧 ), the exit reward is

𝑒 ( 𝑡, 𝑤, 𝑧 ) ≔ 1{𝑡 ⩽ 𝑇 }[ 𝑧 + 𝑤 − 𝐾 ] .

The Bellman equation can be written


( )
ÕÕ
𝑣 ( 𝑡, 𝑤, 𝑧 ) = max 𝑒 ( 𝑡, 𝑤, 𝑧 ) , 𝛽 𝑣 ( 𝑡0, 𝑤0, 𝑧0) 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) ,
𝑤0 𝑧0

where 𝑡0 = 𝑚 ( 𝑡 ). This value function 𝑣 ( 𝑡, 𝑤, 𝑧 ) neatly captures the value of the option:
It is the maximum of current exercise value and the discounted expected value of
carrying the option over to the next period.
Since the problem described above is an optimal stopping problem in the sense of
§4.1.1.1, all of the optimality results attained for that problem apply. In particular,
iterates of the Bellman operator converge to the value function 𝑣∗ and, moreover, a
policy is optimal if and only if it is 𝑣∗ -greedy.
We can do better than value function iteration. Since (𝑊𝑡 )𝑡⩾0 is IID and appears
only in the exit reward, we can reduce dimensionality by switching to the continuation
value operator, which, in this case, can be expressed as
ÕÕ
(𝐶ℎ) ( 𝑡, 𝑧 ) = 𝛽 max { 𝑒 ( 𝑡0, 𝑤0, 𝑧0) , ℎ ( 𝑡0, 𝑧0)} 𝜑 ( 𝑤0) 𝑄 ( 𝑧, 𝑧0) . (4.16)
𝑧0 𝑤0

As proved in §4.1.4, the unique fixed point of 𝐶 is the continuation value function ℎ∗ ,
and 𝐶 𝑘 ℎ → ℎ∗ as 𝑘 → ∞ for all ℎ ∈ RX . With the fixed point in hand, we can compute
the optimal policy as

𝜎∗ ( 𝑡, 𝑤, 𝑧 ) = 1 {𝑒 ( 𝑡, 𝑤, 𝑧) ⩾ ℎ∗ ( 𝑡, 𝑧)} .
Here 𝜎∗ ( 𝑡, 𝑤, 𝑧 ) = 1 prescribes exercising the option at time 𝑡 .
Figure 4.3 provides a visual representation of optimal actions under the default
parameterization described in Listing 13. Each of the three figures show contour lines
of the net exit reward 𝑓 ( 𝑡, 𝑤, 𝑧 ) ≔ 𝑒 ( 𝑡, 𝑤, 𝑧 ) − ℎ∗ ( 𝑤, 𝑧 ), viewed as a function of ( 𝑤, 𝑧 ),
when 𝑡 is held fixed. The date 𝑡 for each subfigure is shown in the title. The optimal
policy exercises the option when 𝑓 ( 𝑡, 𝑤, 𝑧 ) ⩾ 0.
In each subfigure, the exercise region, which is the set ( 𝑤, 𝑧 ) such that 𝑓 ( 𝑡, 𝑤, 𝑧 ) ⩾
0, correspond to the northeast part of the figure, where 𝑤 and 𝑧 are both large. The
boundary between exercise and continuing is the zero contour line, which is shown
CHAPTER 4. OPTIMAL STOPPING 122

using QuantEcon, LinearAlgebra, IterTools

"Creates an instance of the option model with log S_t = Z_t + W_t."
function create_american_option_model(;
n=100, μ=10.0, # Markov state grid size and mean value
ρ=0.98, ν=0.2, # persistence and volatility for Markov state
s=0.3, # volatility parameter for W_t
r=0.01, # interest rate
K=10.0, T=200) # strike price and expiration date
t_vals = collect(1:T+1)
mc = tauchen(n, ρ, ν)
z_vals, Q = mc.state_values .+ μ, mc.p
w_vals, φ, β = [-s, s], [0.5, 0.5], 1 / (1 + r)
e(t, i_w, i_z) = (t ≤ T) * (z_vals[i_z] + w_vals[i_w] - K)
return (; t_vals, z_vals, w_vals, Q, φ, T, β, K, e)
end

Listing 13: Pricing and American option (american_option.jl)

in black. Notice that the size of the exercise region expands with 𝑡 . This is because
the value of waiting decreases when the set of possible exercise dates declines.
Figure 4.4 provides some simulations of the stock price process ( 𝑆𝑡 )𝑡⩾0 over the life-
time of the option, again using the default parameterization described in Listing 13.
The blue region in the top part of each subfigure contains values of the stock price
𝑆𝑡 = 𝑍𝑡 + 𝑊𝑡 such that 𝑆𝑡 ⩾ 𝐾 . In this configuration in which the price of the underly-
ing exceeds the strike price, the option is said to be “in the money.” The figure also
shows an optimal exercise date that is the first 𝑡 such that 𝑒 ( 𝑡, 𝑊𝑡 , 𝑍𝑡 ) ⩾ ℎ∗ (𝑊𝑡 , 𝑍𝑡 ) in a
simulation.

4.2.2 Research and Development

Consider a firm that engages in costly research and development (R&D) in order to
develop a new product. The firm decides whether to continue developing the product
before starting to market it or to stop developing and start marketing it. For simplicity,
we assume that the value of bringing the product to market is a one-time lump sum
payment 𝜋𝑡 = 𝜋 ( 𝑋𝑡 ), where ( 𝑋𝑡 ) is a 𝑃 -Markov chain on finite set X with 𝑃 ∈ M ( RX ).
The flow cost of investing in R&D is 𝐶𝑡 per period, where ( 𝐶𝑡 ) is a stochastic process.
Future payoffs are discounted at rate 𝑟 > 0 and we set 𝛽 ≔ 1/(1 + 𝑟 ).
CHAPTER 4. OPTIMAL STOPPING 123

t =1
13 0.5

0.0
12 0
−0.5

11 −1.0

−1.5
10
z

−2.0

9 −2.5

−3.0
8
−3.5

7 −4.0
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
w
t =195
13 0.5

0.0
12 0
−0.5
11
−1.0

10 −1.5
z

−2.0
9
−2.5
8
−3.0

7 −3.5
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
w
t =199
13 0.5

0.0
12
0

−0.5
11
−1.0

10 −1.5
z

−2.0
9
−2.5
8
−3.0

7 −3.5
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
w

Figure 4.3: Exercise region for the American option


CHAPTER 4. OPTIMAL STOPPING 124

13.315
in the money
St
out of the money

6.685
1 200
exercise date
13.315
in the money

St
out of the money

6.685
1 200
exercise date
13.315
in the money

St
out of the money

6.685
1 200
exercise date

Figure 4.4: Simulations for the American option process


CHAPTER 4. OPTIMAL STOPPING 125

4.2.2.1 Constant R&D Costs

As a first take on this problem, suppose that 𝐶𝑡 ≡ 𝑐 ∈ R+ for all 𝑡 and formulate an
optimal stopping problem with exit reward 𝑒 = 𝜋 and constant continuation reward
−𝑐. The Bellman equation is
( )
Õ
𝑣 ( 𝑥 ) = max 𝜋 ( 𝑥 ) , −𝑐 + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (4.17)
𝑥0

EXERCISE 4.2.1. Write down the continuation value operator for this problem.
Prove that the continuation value function ℎ∗ is increasing in 𝑥 whenever 𝜋 ∈ 𝑖RX and
𝑃 is monotone increasing.

EXERCISE 4.2.2. Prove that the optimal policy 𝜎∗ is increasing whenever 𝜋 is in-
creasing and ( 𝑋𝑡 ) is IID (so that all rows of 𝑃 are identical). Provide economic intuition
for this result.

4.2.2.2 IID R&D Costs

Let’s suppose now that ( 𝐶𝑡 )𝑡⩾0 is IID with common distribution 𝜑 ∈ D(W). The Bell-
man equation is
( )
ÕÕ
𝑣 ( 𝑐, 𝑥 ) = max 𝜋 ( 𝑥 ) , −𝑐 + 𝛽 𝑣 ( 𝑐0, 𝑥 0) 𝜑 ( 𝑐0) 𝑃 ( 𝑥, 𝑥 0) . (4.18)
𝑥0 𝑐0

Since (𝐶𝑡 ) is IID, we would ideally like to integrate it out in the manner of §4.1.4.2,
thereby lowering the dimensionality of the problem. However, note that the continu-
ation value associated with (4.18) is
ÕÕ
ℎ ( 𝑐, 𝑥 ) ≔ −𝑐 + 𝛽 𝑣 ( 𝑐0, 𝑥 0) 𝜑 ( 𝑐0) 𝑃 ( 𝑥, 𝑥 0) ,
𝑥0 𝑐0

which still depends on 𝑐.


Fortunately, there is a way to eliminate 𝑐. Define the expected discounted value
𝑔 ( 𝑥 ) in state 𝑥 ÕÕ
𝑔 (𝑥) ≔ 𝑣 ( 𝑐0, 𝑥 0) 𝜑 ( 𝑐0) 𝑃 ( 𝑥, 𝑥 0) . (4.19)
𝑥0 𝑐0
CHAPTER 4. OPTIMAL STOPPING 126

Rewrite the Bellman equation using 𝑔 and replacing ( 𝑐, 𝑥 ) with ( 𝑐0, 𝑥 0) to get

𝑣 ( 𝑐0, 𝑥 0) = max {𝜋 ( 𝑥 0) , −𝑐0 + 𝛽𝑔 ( 𝑥 0)} .

Averaging over ( 𝑐0, 𝑥 0) and using the definition of 𝑔 again gives


ÕÕ
𝑔 (𝑥) = max {𝜋 ( 𝑥 0) , −𝑐0 + 𝛽𝑔 ( 𝑥 0)} 𝜑 ( 𝑐0) 𝑃 ( 𝑥, 𝑥 0) . (4.20)
𝑥0 𝑐0

This is a functional equation in 𝑔 that depends only on 𝑥 . To solve it, we introduce


the operator 𝑅 defined by
ÕÕ
( 𝑅𝑔 )( 𝑥 ) = max {𝜋 ( 𝑥 0) , −𝑐0 + 𝛽𝑔 ( 𝑥 0)} 𝜑 ( 𝑐0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) .
𝑥0 𝑐0

EXERCISE 4.2.3. Prove that 𝑅 is a contraction of modulus 𝛽 on RX .

From Exercise 4.2.3, we see that (4.20) has a unique solution 𝑔∗ in RX that can be
computed by successive approximation. With 𝑔∗ in hand, we can compute the optimal
policy via
𝜎∗ ( 𝑐, 𝑥 ) = 1 {𝜋 ( 𝑥 ) , −𝑐 + 𝛽𝑔 ∗ ( 𝑥 )} .

Remark 4.2.1. This technique solves for the expected value function defined in (4.19).
In §5.3 we shall discuss this method and its convergence properties in a more general
setting.

4.3 Chapter Notes


Various textbooks treat optimal stopping in depth, although most use continuous time.
Peskir and Shiryaev (2006) and Shiryaev (2007) are good examples.
There are many applications of optimal stopping in economics and finance, with
influential early research papers including McCall (1970), Jovanovic (1982), Hopen-
hayn (1992), and Ericson and Pakes (1995). Arellano (2008) considers borrowing on
international financial markets with the option of sovereign default (see §8.2.1.5).
Riedel (2009) studies optimal stopping under Knightian uncertainty. Fajgelbaum
et al. (2017) include an optimal stopping problem for firms in a model of uncertainty
traps.
The firm problem with optimal exit has been used to analyze firm dynamics and
firm size distributions in equilibrium models with heterogeneous firms. Hopenhayn
CHAPTER 4. OPTIMAL STOPPING 127

(1992) is the classic reference. Perla and Tonetti (2014) construct a growth model
in which firms at the bottom of the productivity distribution imitate more productive
firms. Carvalho and Grassi (2019) analyze business cycles in a setting of firm growth
with exit and a Pareto distribution of firms.
Infinite duration American options are analyzed in Mordecki (2002). Practical
methods for pricing American options are provided by Longstaff and Schwartz (2001),
Rogers (2002), and Kohler et al. (2010).
Replacement problems are an important optimal stopping problem not treated in
this chapter. An important early paper by Rust (1987) uses dynamic programming
to find optimal replacement policies for of engine parts and goes on to fit the model
to data. §5.3.1 discusses structural estimation in the style of Rust Rust (1987) and
others.
Chapter 5

Markov Decision Processes

In this chapter we study a class of discrete time, infinite horizon dynamic programs
called Markov decision processes (MDPs). This standard class of problems is broad
enough to encompass many applications, including the optimal stopping problems in
Chapter 4. MDPs can also be combined with reinforcement learning to tackle settings
where important inputs to an MDP are not known.

5.1 Definition and Properties

In this section we define MDPs and investigate optimality.

5.1.1 The MDP Model

We study a controller who interacts with a state process ( 𝑋𝑡 )𝑡⩾0 by choosing an action
path ( 𝐴𝑡 )𝑡⩾0 to maximize expected discounted rewards
Õ
E 𝛽 𝑡 𝑟 ( 𝑋𝑡 , 𝐴𝑡 ) , (5.1)
𝑡 ⩾0

taking an initial state 𝑋0 as given. As with all dynamic programs, we insist that the
controller is not clairvoyant: he or she cannot choose actions that depend on future
states.
To formalize the problem, we fix a finite set X, henceforth called the state space,
and a finite set A, henceforth called the action space. In what follows, a correspon-
dence Γ from X to A is a function from X into ℘ (A), the set of all subsets of A. The

128
CHAPTER 5. MARKOV DECISION PROCESSES 129

correspondence is called nonempty if Γ ( 𝑥 ) ≠ ∅ for all 𝑥 ∈ X. For example, the map Γ


defined by Γ ( 𝑥 ) = [− 𝑥, 𝑥 ] is a nonempty correspondence from R to R.
Given X and A, we define a Markov decision process (MDP) to be a tuple M =
( Γ, 𝛽, 𝑟, 𝑃 ) consisting of

(i) a nonempty correspondence Γ from X to A, referred to as the feasible corre-


spondence, which in turn defines the feasible state-action pairs

G ≔ {( 𝑥, 𝑎) ∈ X × A : 𝑎 ∈ Γ ( 𝑥 )},

(ii) a constant 𝛽 in (0, 1), referred to as the discount factor,


(iii) a function 𝑟 from G to R, referred to as the reward function, and
(iv) a stochastic kernel 𝑃 from G to X; that is, 𝑃 is a map from G × X to R+ satisfying
Õ
𝑃 ( 𝑥, 𝑎, 𝑥 0) = 1 for all ( 𝑥, 𝑎) in G.
𝑥 0 ∈X

Here Γ ( 𝑥 ) ⊂ A is the set of actions available to the controller in state 𝑥 . Given a


feasible state-action pair ( 𝑥, 𝑎), reward 𝑟 ( 𝑥, 𝑎) is received and the next period state 𝑥 0
is randomly drawn from 𝑃 ( 𝑥, 𝑎, ·), which is an element of D(X). The dynamics and
reward flow are summarized in Algorithm 5.1.

Algorithm 5.1: MDP dynamics: states, actions, and rewards


1 𝑡 ← 0
2 input 𝑋0
3 while 𝑡 < ∞ do
4 observe 𝑋𝑡
5 choose action 𝐴𝑡
6 receive reward 𝑟 ( 𝑋𝑡 , 𝐴𝑡 )
7 draw 𝑋𝑡+1 from 𝑃 ( 𝑋𝑡 , 𝐴𝑡 , ·)
8 𝑡 ←𝑡+1
9 end

The Bellman equation corresponding to M is


( )
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) for all 𝑥 ∈ X. (5.2)
𝑎∈ Γ ( 𝑥 )
𝑥0

This can be understood as an equation in the unknown function 𝑣 ∈ RX . Below we


CHAPTER 5. MARKOV DECISION PROCESSES 130

define the value function 𝑣∗ as maximal lifetime rewards and show that 𝑣∗ is the unique
solution to the Bellman equation in RX .
We can understand the Bellman equation as reducing an infinite-horizon problem
to a two period problem involving the present and the future. Current actions influ-
ence (i) current rewards and (ii) expected discounted value from future states. In
every case we examine, there is a trade-off between maximizing current rewards and
shifting probability mass towards states with high future rewards.

5.1.2 Examples
Here we list examples of MDPs. We will see that some models neatly fit the MDP
structure, while others can be coaxed into the MDP framework by adding states or
applying other tricks.

5.1.2.1 A Renewal Problem

Rust (1987) ignited the field of dynamic structural estimation by examining an en-
gine replacement problem for a bus workshop. In each period the superintendent
decides whether or not to replace the engine of a given bus. Replacement is costly but
delaying risks unexpected failure. Rust (1987) solved this trade-off using dynamic
programming.
We consider an abstract version of Rust’s problem with binary action 𝐴𝑡 . When
𝐴𝑡 = 1, the state resets to some fixed renewal state 𝑥¯ in a finite set X (e.g., mileage
resets to zero when an engine is replaced). When 𝐴𝑡 = 0, the state updates according
to 𝑄 ∈ M ( RX ) (e.g., mileage increases stochastically when the engine is not replaced).
Given current state 𝑥 and action 𝑎, current reward 𝑟 ( 𝑥, 𝑎) is received. The discount
factor is 𝛽 ∈ (0, 1).
For this problem, the Bellman equation has the form
( )
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 1) + 𝛽𝑣 (¯
𝑥 ) , 𝑟 ( 𝑥, 0) + 𝛽 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) , (5.3)
𝑥0

where the first term is the value from action 1 and the second is the value of action 0.
To set the problem up as an MDP we set A = {0, 1} and Γ ( 𝑥 ) = A for all 𝑥 ∈ X. We
define

𝑃 ( 𝑥, 𝑎, 𝑥 0) ≔ 𝑎1{ 𝑥 0 = 𝑥¯} + (1 − 𝑎) 𝑄 ( 𝑥, 𝑥 0) (( 𝑥, 𝑎) ∈ G, 𝑥 0 ∈ X) . (5.4)


CHAPTER 5. MARKOV DECISION PROCESSES 131

EXERCISE 5.1.1. Prove that 𝑃 is a stochastic kernel from G to X.

The primitives ( Γ, 𝛽, 𝑟, 𝑃 ) form an MDP. Moreover, the renewal Bellman equation


(5.3) is a special case of the MDP Bellman equation (5.2). To verify this we rewrite
(5.3) as ( " #)
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑎𝑣 (¯
𝑥 ) + (1 − 𝑎) 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑥 0) ,
𝑎∈{0,1}
𝑥0

Inserting 𝑃 from (5.4) into the right-hand side of the last equation recovers the MDP
Bellman equation (5.2).

5.1.2.2 Optimal Inventory Management

We study a firm where a manager maximizes shareholder value. To simplify the


problem, we ignore exit options (so that firm value is the expected present value of
profits) and assume that the firm only sells one product. Letting 𝜋𝑡 be profits at time
𝑡 and 𝑟 > 0 be the interest rate, the value of the firm is
Õ 1
𝑉0 = E 𝛽 𝑡 𝜋𝑡 where 𝛽≔ . (5.5)
𝑡 ⩾0
1+𝑟

IID
The firm faces exogenous demand process ( 𝐷𝑡 )𝑡⩾0 ∼ 𝜑 ∈ D( Z+ ). Inventory ( 𝑋𝑡 )𝑡⩾0
of the product obeys

𝑋𝑡+1 = 𝑓 ( 𝑋𝑡 , 𝐷𝑡+1 , 𝐴𝑡 ) where 𝑓 ( 𝑥, 𝑎, 𝑑 ) ≔ ( 𝑥 − 𝑑 ) ∨ 0 + 𝑎. (5.6)

The term 𝐴𝑡 is units of stock ordered this period, which take one period to arrive. The
definition of 𝑓 imposes the assumption that firms cannot sell more stock than they
have on hand. We assume that the firm can store at most 𝐾 items at one time.
With the price of the firm’s product set to one, current profits are given by

𝜋𝑡 ≔ 𝑋𝑡 ∧ 𝐷𝑡+1 − 𝑐𝐴𝑡 − 𝜅1{ 𝐴𝑡 > 0} .

Here 𝑐 is unit product cost and 𝜅 is a fixed cost of ordering inventory. We take the
minimum 𝑋𝑡 ∧ 𝐷𝑡+1 because orders in excess of inventory are assumed to be lost rather
than back-filled.
We can map our inventory problem into an MDP with state space X ≔ {0, . . . , 𝐾 }
and action space A ≔ X. The feasible correspondence Γ is

Γ ( 𝑥 ) ≔ {0, . . . , 𝐾 − 𝑥 }, (5.7)
CHAPTER 5. MARKOV DECISION PROCESSES 132

which represents the set of feasible orders when the current inventory state is 𝑥 . The
reward function is expected current profits, or
Õ
𝑟 ( 𝑥, 𝑎) ≔ ( 𝑥 ∧ 𝑑 ) 𝜑 ( 𝑑 ) − 𝑐𝑎 − 𝜅1{ 𝑎 > 0} . (5.8)
𝑑 ⩾0

The stochastic kernel from the set of feasible state-action pairs G induced by Γ is, in
view of (5.6),

𝑃 ( 𝑥, 𝑎, 𝑥 0) ≔ P{ 𝑓 ( 𝑥, 𝑎, 𝐷) = 𝑥 0 } when 𝐷 ∼ 𝜑. (5.9)

EXERCISE 5.1.2. Suppose that 𝜑 is the geometric distribution on Z+ with parameter


𝑝. Write down an expression for the stochastic kernel (5.9) using only 𝑥, 𝑎, 𝑥 0 and the
parameters of the model.

The Bellman equation for this optimal inventory problem is


( )
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) 𝜑 ( 𝑑 ) (5.11)
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0

at each 𝑥 ∈ X, where 𝑟 ( 𝑥, 𝑎) is as given in (5.8) and the aim is to solve for 𝑣. We


introduce the Bellman operator
( )
Õ
(𝑇 𝑣)( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) 𝜑 ( 𝑑 ) . (5.12)
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0

This operator maps RX to itself and is designed so that its set of fixed points in RX
coincide with solutions to (5.11) in RX .

EXERCISE 5.1.3. Prove that 𝑇 is a contraction of modulus 𝛽 on RX when paired


with the supremum norm k 𝑣 k ∞ ≔ max 𝑥 ∈X | 𝑣 ( 𝑥 )|.

5.1.2.3 Example: Cake Eating

Many dynamic programming problems in economics involve a trade-off between cur-


rent and future consumption. The simplest example in this class is the “cake eating”
problem, where initial household wealth is given but no labor income is received.
CHAPTER 5. MARKOV DECISION PROCESSES 133

Wealth evolves according to

𝑊𝑡+1 = 𝑅 (𝑊𝑡 − 𝐶𝑡 ) ( 𝑡 ⩾ 0)

where 𝐶𝑡 is current consumption and 𝑅 is the gross interest rate. The agent seeks to
maximize Õ
E 𝛽 𝑡 𝑢 (𝐶𝑡 ) given 𝑊0 = 𝑤
𝑡 ⩾0

subject to 0 ⩽ 𝐶𝑡 ⩽ 𝑊𝑡 (implying that the agent cannot borrow). Consumption level


𝐶𝑡 generates utility 𝑢 ( 𝐶𝑡 ). Assuming that wealth takes values in a finite set W ⊂ R+ ,
the Bellman equation for this problem can be written as

𝑣 ( 𝑤) = max
0
{𝑢 ( 𝑤 − 𝑤0/𝑅) + 𝛽𝑣 ( 𝑤0)} . (5.13)
0⩽ 𝑤 ⩽ 𝑤

In (5.13) we are using 𝑤0 = 𝑅 ( 𝑤 − 𝑐) to obtain 𝑐 = ( 𝑤 − 𝑤0/𝑅). The household uses


(5.13) to trade-off current utility of consumption against the value of future wealth.

EXERCISE 5.1.4. Frame this model as an MDP with W as the state space.

5.1.2.4 Example: Optimal Stopping

The optimal stopping problem we studied in Chapter 4 can be framed as an MDP. On


one hand, doing so allows us to apply results obtained for MDPs to optimal stopping.
On the other hand, expressing an optimal stopping problem as an MDP requires an
additional state variable, which complicates the exposition. The exercise below helps
to illustrate the key ideas.
Remark 5.1.1. While readers interested in the connection between optimal stopping
and MDPs will benefit from this section, others can freely skip to §5.1.3 without losing
continuity. Later, in Chapter 8, we will show that optimal stopping problems can be
embedded in a very general framework (which includes MDPs) without adding extra
state variables.

Let’s focus on the job search problem with Markov state discussed in §3.3.1 (al-
though the arguments for the general optimal stopping problem in §4.1.1.1 are very
similar). As before, W is the set of wage outcomes. Since we need the symbol 𝑃 for
other purposes, we let 𝑄 be the Markov matrix for wages, so that (𝑊𝑡 )𝑡⩾0 is 𝑄 -Markov
on W.
To express the job search problem as an MDP, let X = {0, 1} × W be a state space
whose typical element is ( 𝑒, 𝑤), with 𝑒 representing either unemployment (𝑒 = 0) or
CHAPTER 5. MARKOV DECISION PROCESSES 134

employment (𝑒 = 1) and 𝑤 being the current wage offer. An action 𝑎 ∈ A ≔ {0, 1}


indicates rejection or acceptance of the current wage offer.

EXERCISE 5.1.5. Express the job search problem as an MDP, with state space X and
action space A as described in the previous paragraph.

5.1.3 Optimality

In this section we return to the general MDP setting of §5.1.1, define optimal policies
and state our main optimality result. As was the case for job search, actions are gov-
erned by policies, which are maps from states to actions (see, in particular, §1.3.1.3,
where policies were introduced).

5.1.3.1 Policies and Lifetime Values

Let M = ( Γ, 𝛽, 𝑟, 𝑃 ) be an MDP. The set of feasible policies corresponding to M is

Σ ≔ {𝜎 ∈ AX : 𝜎 ( 𝑥 ) ∈ Γ ( 𝑥 ) for all 𝑥 ∈ X} . (5.16)

If we select a policy 𝜎 from Σ, it is understood that we respond to state 𝑋𝑡 with ac-


tion 𝐴𝑡 ≔ 𝜎 ( 𝑋𝑡 ) at every date 𝑡 . As a result, the state evolves by drawing 𝑋𝑡+1 from
𝑃 ( 𝑋𝑡 , 𝜎 ( 𝑋𝑡 ) , ·) at each 𝑡 ⩾ 0. In other words, ( 𝑋𝑡 )𝑡⩾0 is 𝑃𝜎 -Markov when

𝑃𝜎 ( 𝑥, 𝑥 0) ≔ 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) .

Note that 𝑃𝜎 ∈ M ( RX ). Fixing a policy “closes the loop” in the state transition process
and defines a Markov chain for the state.
Under the policy 𝜎, rewards at state 𝑥 are 𝑟 ( 𝑥, 𝜎 ( 𝑥 )). If

𝑟𝜎 ( 𝑥 ) ≔ 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) and E 𝑥 ≔ E [ · | 𝑋0 = 𝑥 ]
then the lifetime value of following 𝜎 starting from state 𝑥 can be written as
Õ
𝑣𝜎 ( 𝑥 ) = E𝑥 𝛽 𝑡 𝑟𝜎 ( 𝑋𝑡 ) where ( 𝑋𝑡 ) is 𝑃𝜎 -Markov with 𝑋0 = 𝑥. (5.17)
𝑡 ⩾0

Since 𝛽 < 1, applying Lemma 3.2.1 on page 94 to this expression yields


Õ
𝑣𝜎 = 𝛽 𝑡 𝑃𝜎𝑡 𝑟𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . (5.18)
𝑡 ⩾0
CHAPTER 5. MARKOV DECISION PROCESSES 135

Analogous to the optimal stopping case, we call 𝑣𝜎 the 𝜎-value function. We also call
𝑣𝜎 ( 𝑥 ) the lifetime value of policy 𝜎 conditional on initial state 𝑥 .

EXERCISE 5.1.6. Prove that 𝑣1 ⩽ 𝑣𝜎 ⩽ 𝑣2 when 𝑣2 ≔ k 𝑟 k ∞ /(1 − 𝛽 ) and 𝑣1 ≔ − 𝑣2 .

Another way to compute 𝑣𝜎 is to use the policy operator 𝑇𝜎 corresponding to 𝜎,


which is defined at 𝑣 ∈ RX by
Õ
(𝑇𝜎 𝑣) ( 𝑥 ) = 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) ( 𝑥 ∈ X) . (5.19)
𝑥0

(𝑇𝜎 is analogous to the policy operator defined for the optimal stopping problem in
§4.1.1.3.) In vector notation,
𝑇𝜎 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣. (5.20)
The next exercise shows how 𝑇𝜎 can be put to work.

EXERCISE 5.1.7. Fixing 𝜎 in Σ, prove that


(i) 𝑇𝜎 is an order-preserving self-map on RX ,
(ii) 𝑇𝜎 is a contraction on RX of modulus 𝛽 under the norm k · k ∞ ,
(iii) the 𝜎-value function 𝑣𝜎 is the unique fixed point of 𝑇𝜎 in RX , and
(iv) 𝑇𝜎𝑘 𝑣 → 𝑣𝜎 as 𝑘 → ∞ for all 𝑣 ∈ RX .

Computationally, this means that we can pick 𝑣 ∈ RX and iterate with 𝑇𝜎 to obtain
an approximation to 𝑣𝜎 .

EXERCISE 5.1.8. Prove that, when the initial condition for iteration is 𝑣 ≡ 0 ∈ RX ,
Í −1 𝑡 𝑡
the 𝑘-th iterate 𝑇𝜎𝑘 𝑣 is equal to the truncated sum 𝑡𝑘=0 𝛽 𝑃𝜎 𝑟𝜎 .

Remark 5.1.2. To compute 𝑣𝜎 , should we use the expression ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 in (5.18)


or iterate with 𝑇𝜎 ? For small state spaces, the first option is typically faster. However,
it is easy to write down dynamic programming problems where X is very large (see,
e.g., Example 1.0.2 on page 2). If, say, X has 109 elements, then 𝐼 − 𝛽 𝑃𝜎 is 109 × 109 .
Matrices of this size are difficult to invert – or even store in memory. In such settings,
iterating with 𝑇𝜎 might be preferred.

The next exercise extends Exercise 5.1.8 and aids interpretation of policy opera-
tors. It tells us that (𝑇𝜎𝑘 𝑣)( 𝑥 ) is the payoff from following policy 𝜎 and starting in state
𝑥 when lifetime is truncated to the finite horizon 𝑘 and 𝑣 provides a terminal payoff
in each state.
CHAPTER 5. MARKOV DECISION PROCESSES 136

EXERCISE 5.1.9. Fix 𝜎 ∈ Σ and let ( 𝑋𝑡 ) be 𝑃𝜎 -Markov with initial condition 𝑥 ∈ X.


Prove that, for given 𝑣 ∈ RX , and 𝑘 ∈ N, we have
" 𝑘−1 #
Õ
(𝑇𝜎 𝑣)( 𝑥 ) = E𝑥
𝑘 𝑡 𝑘
𝛽 𝑟 ( 𝑋𝑡 , 𝜎 ( 𝑋𝑡 )) + 𝛽 𝑣 ( 𝑋𝑘 ) .
𝑡 =1

5.1.3.2 Defining Optimality

Given MDP M = ( Γ, 𝛽, 𝑟, 𝑃 ) with 𝜎-value functions { 𝑣𝜎 }𝜎∈Σ , the value function corre-
sponding to M is defined as 𝑣∗ ≔ ∨𝜎∈Σ 𝑣𝜎 , where, as usual, the maximum is pointwise.
More explicitly,
𝑣∗ ( 𝑥 ) = max 𝑣𝜎 ( 𝑥 ) ( 𝑥 ∈ X) . (5.21)
𝜎∈ Σ

This is consistent with our definition of the value function in the optimal stopping case
(see page 108). It is the maximal lifetime value we can extract from each state using
feasible behavior. The maximum in (5.21) exists at each 𝑥 because Σ is finite.
A policy 𝜎 ∈ Σ is called optimal for M if 𝑣𝜎 = 𝑣∗ . In other words, a policy is optimal
if its lifetime value is maximal at each state.

Example 5.1.1. Consider again Figure 2.5 on page 57, supposing that Σ = {𝜎0, 𝜎00 }.
As drawn, there is no optimal policy, since 𝑣∗ differs from both 𝑣𝜎0 and 𝑣𝜎00 . Below, in
Proposition 5.1.1, we will show that such an outcome is not possible for MDPs.

Our optimality results are easier to follow with some additional terminology. To
start, given 𝑣 ∈ RX , we define a policy 𝜎 ∈ Σ to be 𝑣-greedy if
( )
Õ
𝜎 ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) for all 𝑥 ∈ X. (5.22)
𝑎∈ Γ ( 𝑥 ) 𝑥0

In essence, a 𝑣-greedy policy treats 𝑣 as the correct value function and sets all actions
accordingly.

EXERCISE 5.1.10. Fix 𝜎 ∈ Σ and 𝑣 ∈ RX . Prove that the set {𝑇𝜎 𝑣}𝜎∈Σ has a least
and greatest element.

Bellman’s principle of optimality is said to hold for the MDP M if

𝜎 ∈ Σ is optimal for M ⇐⇒ 𝜎 is 𝑣∗ -greedy.


CHAPTER 5. MARKOV DECISION PROCESSES 137

W Tσ000
T = σ∈Σ Tσ

Tσ00

Tσ 0

Figure 5.1: 𝑇 is the pointwise maximum of {𝑇𝜎 }𝜎∈Σ (one-dimensional setting)

The Bellman operator corresponding to M is the self-map 𝑇 on RX defined by


( )
Õ
(𝑇 𝑣)( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) ( 𝑥 ∈ X) . (5.24)
𝑎∈ Γ ( 𝑥 )
𝑥0

Obviously 𝑇 𝑣 = 𝑣 if and only if 𝑣 satisfies the Bellman equation (5.2).

EXERCISE 5.1.11. Given 𝑣 ∈ RX , prove that

(i) at least one 𝑣-greedy policy exists,


(ii) 𝜎 ∈ Σ is 𝑣-greedy if and only if 𝑇𝜎 𝑣 = 𝑇 𝑣, and
(iii) (𝑇 𝑣)( 𝑥 ) = max𝜎∈Σ (𝑇𝜎 𝑣)( 𝑥 ) for all 𝑥 ∈ X.

The last part of Exercise 5.1.11 tells us that 𝑇 is the pointwise maximum of {𝑇𝜎 }𝜎∈Σ ,
which can be expressed as 𝑇 = ∨𝜎 𝑇𝜎 . Figure 5.1 illustrates this relationship in one
dimension.

EXERCISE 5.1.12. Prove: 𝑇 is a contraction of modulus 𝛽 on RX under norm k · k ∞ .


CHAPTER 5. MARKOV DECISION PROCESSES 138

5.1.3.3 Optimality Theory

We can now state our main optimality result for MDPs.


Proposition 5.1.1. If M = ( Γ, 𝛽, 𝑟, 𝑃 ) is an MDP with value function 𝑣∗ and Bellman
operator 𝑇 , then
(i) 𝑣∗ is the unique solution to the Bellman equation in RX ,
(ii) lim𝑘→∞ 𝑇 𝑘 𝑣 = 𝑣∗ for all 𝑣 ∈ RX ,
(iii) Bellman’s principle of optimality holds for M, and
(iv) at least one optimal policy exists.

While Proposition 5.1.1 is a special case of later results (see §8.1.3.3), a direct
proof is not difficult and we provide one below for interested readers.

Proof of Proposition 5.1.1. In Exercise 5.1.12 we showed that 𝑇 is a contraction map-


ping on the closed set RX . Hence 𝑇 is globally stable on RX and therefore has a unique
fixed point 𝑣¯ ∈ RX . Our first claim is that 𝑣¯ = 𝑣∗ . We show 𝑣¯ ⩽ 𝑣∗ and then 𝑣¯ ⩾ 𝑣∗ .
For the first inequality, let 𝜎 ∈ Σ be 𝑣¯-greedy. Recalling Exercise 5.1.11, we have
𝑇𝜎 𝑣
¯ = 𝑇 𝑣¯ = 𝑣¯. Hence 𝑣¯ is also a fixed point of 𝑇𝜎 . But the only fixed point of 𝑇𝜎 in
R is 𝑣𝜎 , so 𝑣¯ = 𝑣𝜎 . But then 𝑣¯ ⩽ 𝑣∗ , since, by definition, 𝑣∗ = ∨𝜎 𝑣𝜎 . This is our first
X

inequality.
As for the second inequality, fix 𝜎 ∈ Σ and observe that 𝑇𝜎 𝑣 ⩽ 𝑇 𝑣 for all 𝑣 ∈ RX .
Since 𝑇 is order-preserving and globally stable, Proposition 2.2.7 on page 67 implies
that 𝑣𝜎 ⩽ 𝑣¯. Taking the supremum over 𝜎 ∈ Σ yields 𝑣∗ ⩽ 𝑣¯.
Hence 𝑣∗ is a fixed point of 𝑇 in RX . Since 𝑇 is globally stable on RX , the remaining
claims in parts (i)–(ii) follow immediately.
As for part (iii), it follows from Exercise 5.1.11 and part (i) of this theorem that

𝜎 is 𝑣∗ -greedy ⇐⇒ 𝑇𝜎 𝑣∗ = 𝑇 𝑣∗ ⇐⇒ 𝑇𝜎 𝑣∗ = 𝑣∗ .

The right hand side of this expression tells us that 𝑣∗ is a fixed point of 𝑇𝜎 . But the only
fixed point of 𝑇𝜎 is 𝑣𝜎 , so the right hand side is equivalent to the statement 𝑣𝜎 = 𝑣∗ .
Hence, by this chain of logic and the definition of optimality,

𝜎 is 𝑣∗ -greedy ⇐⇒ 𝑣∗ = 𝑣𝜎 ⇐⇒ 𝜎 is optimal (5.25)

Hence (iii) holds.


Part (iv) is left as an exercise immediately below. □
CHAPTER 5. MARKOV DECISION PROCESSES 139

45
T
Tσ00

Tσ 0

vσ 0 vσ00 = v ∗ v

Figure 5.2: Illustration of optimality for MDPs

EXERCISE 5.1.13. Prove that, in Proposition 5.1.1, (iii) implies (iv).

Figure 5.2 illustrates Proposition 5.1.1 in an abstract case, where X is a singleton


{ 𝑥 }. We write 𝑣 instead of 𝑣 ( 𝑥 ) for the value of state 𝑥 and place 𝑣 on the horizontal
axis. In the figure, the set of policies is Σ = {𝜎0, 𝜎00 }. For given 𝜎 ∈ Σ, the map 𝑇𝜎 is
an affine function 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣 and the fixed point is 𝑣𝜎 . The Bellman operator
𝑇 is the upper envelope of the functions {𝑇𝜎 }, as shown in (ii) of Exercise 5.1.11. By
definition,

(i) 𝑣∗ is the largest of these fixed points, which equals 𝑣𝜎00 , and
(ii) 𝜎00 is the optimal policy, since 𝑣𝜎00 = 𝑣∗ .

In accordance with Proposition 5.1.1, 𝑣∗ is also the fixed point of the Bellman operator.
It is important to understand the significance of (iii) in Proposition 5.1.1. Greedy
policies are relatively easy to compute, in the sense that solving (5.22) at each 𝑥 is
easier than trying to directly solve the problem of maximizing lifetime value, since Σ
is in general far larger than Γ ( 𝑥 ). Part (iii) tells us that solving the overall problem
reduces to computing a 𝑣-greedy policy with the right choice of 𝑣. For optimal stopping
problems, that choice is the value function 𝑣∗ . Intuitively, 𝑣∗ assigns a “correct” value
CHAPTER 5. MARKOV DECISION PROCESSES 140

to each state, in the sense of maximal lifetime value the controller can extract, so
using 𝑣∗ to calculate greedy policies leads to the optimal outcome.

5.1.4 Algorithms

In previous chapters we solved job search and optimal stopping problems using value
function iteration. In this section we present a generalization suitable for arbitrary
MDPs and then discuss two important alternatives.

5.1.4.1 Value Function Iteration

Value function iteration (VFI) for MDPs is similar to VFI for the job search model
(see page 37): we use successive approximation on 𝑇 to compute an approximation
𝑣𝑘 to the value function 𝑣∗ and then take a 𝑣𝑘 -greedy policy. The general procedure is
given by Algorithm 5.2.

Algorithm 5.2: Value function iteration for MDPs


1 input 𝑣0 ∈ R , an initial guess of 𝑣
X ∗

2 input 𝜏, a tolerance level for error


3 𝜀 ← +∞ and 𝑘 ← 0
4 while 𝜀 > 𝜏 do
5 𝑣𝑘+1 ← 𝑇 𝑣𝑘
6 𝜀 ← k 𝑣𝑘 − 𝑣𝑘+1 k ∞
7 𝑘← 𝑘+1
8 end
9 return a 𝑣𝑘 -greedy policy 𝜎

The fact that the sequence ( 𝑣𝑘 ) 𝑘⩾0 produced by VFI converges to 𝑣∗ is immediate
from Proposition 5.1.1 (as the tolerance 𝜏 is taken toward zero). It is also true that
the greedy policy produced in the last step is approximately optimal when 𝜏 is small,
and exactly optimal when 𝑘 is sufficiently large. Proofs are given in Chapter 8, where
we examine VFI in a more general setting.
VFI is robust, easy to understand and easy to implement. These properties explain
its enduring popularity. At the same time, in terms of efficiency, VFI is often dominated
by alternative algorithms, two of which are discussed below.
CHAPTER 5. MARKOV DECISION PROCESSES 141

initial 𝜎 𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎

𝑣𝜎 -greedy 𝜎0 𝑣𝜎0 = ( 𝐼 − 𝛽𝑃𝜎0 ) −1 𝑟𝜎0

.. ..
. .

𝜎∗ 𝑣𝜎∗ = ( 𝐼 − 𝛽𝑃𝜎∗ ) −1 𝑟𝜎∗

Figure 5.3: Howard policy iteration algorithm (HPI)

5.1.4.2 Howard Policy Iteration

Unlike VFI, Howard policy iteration (HPI) computes optimal policies by iterating
between computing the value of a given policy and computing the greedy policy as-
sociated with that value. The full technique is described in Algorithm 5.3.

Algorithm 5.3: Howard policy iteration for MDPs


1 input 𝜎 ∈ Σ
2 𝑣0 ← 𝑣𝜎 and 𝑘 ← 0
3 repeat
4 𝜎𝑘 ← a 𝑣𝑘 -greedy policy
5 𝑣𝑘+1 ← ( 𝐼 − 𝛽𝑃𝜎𝑘 ) −1 𝑟𝜎𝑘
6 if 𝑣𝑘+1 = 𝑣𝑘 then break
7 𝑘← 𝑘+1
8 return 𝜎𝑘

A visualization of HPI is given in Figure 5.3, where 𝜎 is the initial choice. Next we
compute the lifetime value 𝑣𝜎 , and then the 𝑣𝜎 -greedy policy 𝜎0, and so on. The com-
putation of lifetime value is called the policy evaluation step, while the computation
of greedy policies is called policy improvement.
HPI has two very attractive features. One is that, in a finite state setting, the
algorithm always converges to an exact optimal policy in a finite number of steps,
regardless of the initial condition. We prove this fact in a more general setting in
CHAPTER 5. MARKOV DECISION PROCESSES 142

T
Tσ 0
45

0
vσ vσ 0 v ∗

Figure 5.4: HPI as a version of Newton’s method

Chapter 8. The second is that the rate of convergence is faster than VFI, as will be
shown in §5.1.4.3.
Figure 5.4 gives another illustration, presented in the one-dimensional setting that
we used for Figure 5.2. In this illustration, we imagine that there are many optimal
policies, and hence many functions in {𝑇𝜎 }, so that their upper envelope, which is the
Bellman operator, becomes a smoother curve. The figure shows the update from 𝑣𝜎
to the next lifetime value 𝑣𝜎0 , via the following two steps:
(i) Take 𝜎0 to be 𝑣𝜎 -greedy, which means that 𝑇𝜎0 𝑣𝜎 = 𝑇 𝑣𝜎 (see Exercise 5.1.11).
(ii) Take 𝑣𝜎0 to be the fixed point of 𝑇𝜎0 .
The next step, from 𝑣𝜎0 to 𝑣𝜎00 is analogous.
Comparison of this figure with Figure 2.1 on page 48 suggests that HPI is an im-
plementation of Newton’s method, applied to the Bellman operator. We confirm this
in §5.1.4.3.

5.1.4.3 HPI as Newton Iteration

In discussing the connection between HPI and Newton iteration, one issue is that
𝑇 is not always differentiable, as seen in Figure 5.2. But 𝑇 is convex, and this lets
CHAPTER 5. MARKOV DECISION PROCESSES 143

𝑇1 𝑇2

𝑣 𝑣

Figure 5.5: Subgradients of convex functions

us substitute subgradients for derivatives. Once we make this modification, HPI and
Newton iteration are identical, as we now show.
First, recall that, given a self-map 𝑇 from 𝑆 ⊂ R𝑛 to itself, an 𝑛 × 𝑛 matrix 𝐷 is
called a subgradient of 𝑇 at 𝑣 ∈ 𝑆 if

𝑇𝑢 ⩾ 𝑇 𝑣 + 𝐷 ( 𝑢 − 𝑣) for all 𝑢 ∈ 𝑆. (5.26)

Figure 5.5 illustrates the definition in one dimension, where 𝐷 is just a scalar de-
termining the slope of a tangent line at 𝑣. In the left subfigure, 𝑇1 is convex and
differentiable at 𝑣, which means that only one subgradient exists (since any other
choice of slope implies that the inequality in (5.26) will fail for some 𝑢). In the right
subfigure, 𝑇2 is convex but nondifferentiable at 𝑣, so multiple subgradients exist.
In the next result, we take ( Γ, 𝛽, 𝑟, 𝑃 ) to be a given MDP and let 𝑇 be the associated
Bellman operator.

Lemma 5.1.2. If 𝑣 ∈ RX and 𝜎 ∈ Σ is 𝑣-greedy, then 𝛽𝑃𝜎 is a subgradient of 𝑇 at 𝑣.

Proof. Fix 𝑣 ∈ RX and let 𝜎 be 𝑣-greedy. Using 𝑇 ⩾ 𝑇𝜎 and 𝑇𝜎 𝑣 = 𝑇 𝑣, we have

𝑇𝑢 = 𝑇 𝑣 + 𝑇𝑢 − 𝑇 𝑣 ⩾ 𝑇 𝑣 + 𝑇𝜎 𝑢 − 𝑇𝜎 𝑣.

Applying the definition of 𝑇𝜎 now gives

𝑇𝑢 ⩾ 𝑇 𝑣 + 𝛽𝑃𝜎 𝑢 − 𝛽𝑃𝜎 𝑣 = 𝑇 𝑣 + 𝛽𝑃𝜎 ( 𝑢 − 𝑣) .

Hence 𝛽𝑃𝜎 is a subgradient of 𝑇 at 𝑣, as claimed. □

Now let’s consider Newton’s method applied to the problem of finding the fixed
point of 𝑇 . Since 𝑇 is nondifferentiable and convex, we replace the Jacobian in New-
ton’s method (see (2.2) on page 48) with the subdifferential. This leads us to iterate
CHAPTER 5. MARKOV DECISION PROCESSES 144

on
𝑣𝑘+1 = 𝑄𝑣𝑘 where 𝑄𝑣 ≔ ( 𝐼 − 𝛽𝑃𝜎 ) −1 (𝑇 𝑣 − 𝛽𝑃𝜎 𝑣) .
In the definition of 𝑄 , the policy 𝜎 is 𝑣-greedy. Using 𝑇 𝑣 = 𝑇𝜎 𝑣, the map 𝑄 reduces
to 𝑄𝑣 ≔ ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 , which is exactly the update step to produce the next 𝜎-value
function in HPI (i.e., the lifetime value of a 𝑣-greedy policy).
The fact that HPI is a version of Newton’s method suggests that its iterates ( 𝑣𝑘 ) 𝑘⩾0
enjoy quadratic convergence. This is indeed the case: under mild conditions one can
show there exists a constant 𝑁 such that, for all 𝑘 ⩾ 0,

k 𝑣𝑘+1 − 𝑣𝑘 k ⩽ 𝑁 k 𝑣𝑘 − 𝑣𝑘−1 k 2 (5.27)

(see, e.g., Puterman (2005), Theorem 6.4.8). Hence HPI enjoys both a fast conver-
gence rate and the robustness of global convergence.
However, HPI is not always optimal in terms of efficiency, since the size of the
constant term in (5.27) also matters. This term can be large because, at each step, the
update from 𝑣𝜎 to 𝑣𝜎0 requires computing the exact lifetime value 𝑣𝜎0 of the 𝑣𝜎 -greedy
policy 𝜎0. Computing this fixed point exactly can be computationally expensive in
high dimensions.
One way around this issue is to forgo computing the fixed point 𝑣𝜎0 exactly, re-
placing it with an approximation. The next section takes up this idea.

5.1.4.4 Optimistic Policy Iteration

Optimistic policy iteration (OPI) is an algorithm that borrows from both VFI and HPI.
In essence, the algorithm is the same as HPI except that, instead of computing the full
value 𝑣𝜎 of a given policy, the approximation 𝑇𝜎𝑚 𝑣 from Exercise 5.1.7 is used instead.
Algorithm 5.4 clarifies.
In the algorithm, the policy operator 𝑇𝜎𝑘 is applied 𝑚 times to generate an ap-
proximation of 𝑣𝜎𝑘 . The constant step size 𝑚 can also be replaced with a sequence
( 𝑚𝑘 ) ⊂ N. In either case, for MDPs, convergence to an optimal policy is guaranteed.
We prove this in a more general setting in Chapter 8.
Notice that, as 𝑚 → ∞, the algorithm increasingly approximates Howard policy
iteration, since 𝑇𝜎𝑚𝑘 𝑣𝑘 converges to 𝑣𝜎𝑘 . At the same time, if 𝑚 = 1, the reduces to VFI.
This follows from Exercise 5.1.11, which tells us that, when 𝜎𝑘 is 𝑣𝑘 -greedy, 𝑇𝜎𝑘 𝑣𝑘 =
𝑇 𝑣𝑘 . Hence, with intermediate 𝑚, OPI can be seen as a “convex combination” of HPI
and VFI.
CHAPTER 5. MARKOV DECISION PROCESSES 145

Algorithm 5.4: Optimistic policy iteration for MDPs


1 input 𝑚 ∈ N and tolerance 𝜏 ⩾ 0
2 input 𝜎 ∈ Σ and set 𝑣0 ← 𝑣𝜎
3 𝑘 ← 0
4 repeat
5 𝜎𝑘 ← a 𝑣𝑘 -greedy policy
6 𝑣𝑘+1 ← 𝑇𝜎𝑚𝑘 𝑣𝑘
7 if k 𝑣𝑘+1 − 𝑣𝑘 k ⩽ 𝜏 then break
8 𝑘← 𝑘+1
9 return 𝜎𝑘

In almost all dynamic programming applications, there exist choices of 𝑚 > 1


such that OPI converges faster than VFI. We investigate these ideas in the applications
below. In some cases, there exist values of 𝑚 such that OPI dominates HPI. However,
this depends on the structure of the problem and the software and hardware platforms
being employed – see 2.1.4.4 and the applications below for additional discussion.

5.2 Applications

This section gives several applications of the MDP model to economic problems. The
applications illustrate the ease with which MDPs can be implemented on a computer
(provided that the state and action spaces are not too large).

5.2.1 Optimal Inventories

In §3.1.1.2 we studied a firm whose inventory behavior was specified to follow S-s
dynamics. In §5.1.2.2 we introduced a model where investment behavior is endoge-
nous, determined by the desire to maximize firm value. In this section, we show that
this endogenous inventory behavior can replicate the S-s dynamics from §3.1.1.2.
We saw in §5.1.2.2 that the optimal inventory model is an MDP, so the Proposi-
tion 5.1.1 optimality and convergence results apply. In particular, the unique fixed
point of the Bellman operator is the value function 𝑣∗ , and a policy 𝜎∗ is optimal if
and only if 𝜎∗ is 𝑣∗ -greedy.
We solve the model numerically using VFI. As in Exercise 5.1.2, we take 𝜑 to be
the geometric distribution on Z+ with parameter 𝑝. We use the default parameter
CHAPTER 5. MARKOV DECISION PROCESSES 146

28 v∗
26

value 24

22

20

0 5 10 15 20 25 30 35 40

25
σ∗
20
optimal choice

15

10

0
0 5 10 15 20 25 30 35 40
inventory

Figure 5.6: The value function and optimal policy for the inventory problem

values shown in Listing 14. The code listing also presents an implementation of the
Bellman operator.
Figure 5.6 exhibits an approximation of the value function 𝑣∗ , computed by iter-
ating with 𝑇 starting at 𝑣 ≡ 1. Figure 5.6 also shows the approximate optimal policy,
obtained as a 𝑣∗ -greedy policy:
( )
Õ
𝜎∗ ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣∗ ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) 𝜑 ( 𝑑 )
𝑎∈ Γ ( 𝑥 ) 𝑑 ⩾0

The plot of the optimal policy shows that there is a threshold region below which the
firm orders large batches and above which the firm orders nothing. This makes sense,
since the firm wishes to economize on the fixed cost of ordering. Figure 5.7 shows a
simulation of inventory dynamics under the optimal policy, starting from 𝑋0 = 0. The
time path closely approximates the S-s dynamics discussed in §3.1.1.2.

EXERCISE 5.2.1. Compute the optimal policy by extending the code given in List-
ing 14. Replicate Figure 5.7, modulo randomness, by sampling from a geometric
distribution and implementing the dynamics in (5.6). At each 𝑋𝑡 , the action 𝐴𝑡 should
be chosen according to the optimal policy 𝜎∗ ( 𝑋𝑡 ).
CHAPTER 5. MARKOV DECISION PROCESSES 147

using Distributions

f(x, a, d) = max(x - d, 0) + a # Inventory update

function create_inventory_model(; β=0.98, # discount factor


K=40, # maximum inventory
c=0.2, κ=2, # cost paramters
p=0.6) # demand parameter
ϕ(d) = (1 - p)^d * p # demand distribution
x_vals = collect(0:K) # set of inventory levels
return (; β, K, c, κ, p, ϕ, x_vals)
end

"The function B(x, a, v) = r(x, a) + β Σ_x′ v(x′) P(x, a, x′)."


function B(x, a, v, model; d_max=100)
(; β, K, c, κ, p, ϕ, x_vals) = model
revenue = sum(min(x, d) * ϕ(d) for d in 0:d_max)
current_profit = revenue - c * a - κ * (a > 0)
next_value = sum(v[f(x, a, d) + 1] * ϕ(d) for d in 0:d_max)
return current_profit + β * next_value
end

"The Bellman operator."


function T(v, model)
(; β, K, c, κ, p, ϕ, x_vals) = model
new_v = similar(v)
for (x_idx, x) in enumerate(x_vals)
Γx = 0:(K - x)
new_v[x_idx], _ = findmax(B(x, a, v, model) for a in Γx)
end
return new_v
end

Listing 14: Solving the optimal inventory model (inventory_dp.jl)


CHAPTER 5. MARKOV DECISION PROCESSES 148

30

Xt
25

20
inventory

15

10

0
0 50 100 150 200 250 300 350 400
t

Figure 5.7: Optimal inventory dynamics

5.2.2 Optimal Savings with Labor Income

As our next example of an MDP, we modify the cake eating problem in §5.1.2.3 to
add labor income. Wealth evolves according to

𝑊𝑡+1 = 𝑅 (𝑊𝑡 + 𝑌𝑡 − 𝐶𝑡 ) ( 𝑡 = 0, 1, . . .) , (5.28)

where (𝑊𝑡 ) takes values in finite set W ⊂ R+ and labor income (𝑌𝑡 ) is a Markov chain
on finite set Y ⊂ R+ with transition matrix 𝑄 .1 𝑅 is a gross rate of interest, so that
investing 𝑑 dollars today returns 𝑅𝑑 next period. Other parts of the problem are un-
changed. The Bellman operator can be written as
(   )
𝑤0 Õ
(𝑇 𝑣)( 𝑤, 𝑦 ) = 0max 𝑢 𝑤+ 𝑦− +𝛽 𝑣 ( 𝑤0, 𝑦 0) 𝑄 ( 𝑦, 𝑦 0) . (5.29)
𝑤 ∈ Γ ( 𝑤,𝑦 ) 𝑅 𝑦0

5.2.2.1 MDP Representation

To frame this problem as an MDP, we set the state to 𝑥 ≔ ( 𝑤, 𝑦 ), representing current


wealth and income, taking values in the state space X ≔ W × Y. The action is savings
1 See Marcet et al. (2007) and Zhu (2020) for more extensive analysis of how adding a labor supply

choice can affect outcomes in a consumption-savings model.


CHAPTER 5. MARKOV DECISION PROCESSES 149

𝑠, which takes values in W and equals 𝑤0. The feasible correspondence is the set of
feasible savings values

Γ ( 𝑤, 𝑦 ) = { 𝑠 ∈ W : 𝑠 ⩽ 𝑅 ( 𝑤 + 𝑦 )} .

The current reward is utility of consumption 𝑟 ( 𝑤, 𝑠) = 𝑢 ( 𝑤 + 𝑦 − 𝑠/𝑅). The stochastic


kernel is
𝑃 (( 𝑤, 𝑦 ) , 𝑠, ( 𝑤0, 𝑦 0)) = 1{𝑤0 = 𝑠 } 𝑄 ( 𝑦, 𝑦 0) .
Having framed an MDP, the Proposition 5.1.1 optimality results apply.

5.2.2.2 Implementation

To implement the algorithms discussed in §5.1.4, we use the Bellman operator (5.29),
and the corresponding definition of a 𝑣-greedy policy, which is
(   )
𝑤0 Õ
𝜎 ( 𝑤, 𝑦 ) ∈ argmax 𝑢 𝑤 + 𝑦 − +𝛽 𝑣 ( 𝑤0, 𝑦 0) 𝑄 ( 𝑦, 𝑦 0)
𝑤0 ∈ Γ ( 𝑤,𝑦 ) 𝑅 𝑦0

for all ( 𝑤, 𝑦 ). The policy operator for given 𝜎 ∈ Σ is


  Õ
𝜎 ( 𝑤, 𝑦 )
(𝑇𝜎 𝑣)( 𝑤, 𝑦 ) = 𝑢 𝑤 + 𝑦 − +𝛽 𝑣 ( 𝜎 ( 𝑤, 𝑦 ) , 𝑦 0) 𝑄 ( 𝑦, 𝑦 0) . (5.30)
𝑅 𝑦0

Code for implementing the model and these two operators is given in Listing 15.
Income is constructed as a discretized AR(1) process using the method from §3.1.3.
Exponentiation is applied to the grid so that income takes positive values.
The function get_value in Listing 16 uses the expression 𝑣𝜎 = ( 𝐼 − 𝛽 𝑃𝜎 ) −1 𝑟𝜎 from
(5.18) to obtain the value of a given policy 𝜎. The matrix 𝑃𝜎 and vector 𝑟𝜎 take the
form

𝑃𝜎 (( 𝑤, 𝑦 ) , ( 𝑤0, 𝑦 0)) = 1{𝜎 ( 𝑤, 𝑦 ) = 𝑤0 } 𝑄 ( 𝑦, 𝑦 0)


𝑟𝜎 ( 𝑤, 𝑦 ) = 𝑢 ( 𝑤 + 𝑦 − 𝜎 ( 𝑤, 𝑦 )/ 𝑅) .

5.2.2.3 Timing

Since all results for MDPs apply, we know that the value function 𝑣∗ is the unique fixed
point of the Bellman operator in RX , and that value function iteration, Howard policy
CHAPTER 5. MARKOV DECISION PROCESSES 150

using QuantEcon, LinearAlgebra, IterTools

function create_savings_model(; R=1.01, β=0.98, γ=2.5,


w_min=0.01, w_max=20.0, w_size=200,
ρ=0.9, ν=0.1, y_size=5)
w_grid = LinRange(w_min, w_max, w_size)
mc = tauchen(y_size, ρ, ν)
y_grid, Q = exp.(mc.state_values), mc.p
return (; β, R, γ, w_grid, y_grid, Q)
end

"B(w, y, w′, v) = u(R*w + y - w′) + β Σ_y′ v(w′, y′) Q(y, y′)."


function B(i, j, k, v, model)
(; β, R, γ, w_grid, y_grid, Q) = model
w, y, w′ = w_grid[i], y_grid[j], w_grid[k]
u(c) = c^(1-γ) / (1-γ)
c = w + y - (w′ / R)
@views value = c > 0 ? u(c) + β * dot(v[k, :], Q[j, :]) : -Inf
return value
end

"The Bellman operator."


function T(v, model)
w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
v_new = similar(v)
for (i, j) in product(w_idx, y_idx)
v_new[i, j] = maximum(B(i, j, k, v, model) for k in w_idx)
end
return v_new
end

"The policy operator."


function T_σ(v, σ, model)
w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
v_new = similar(v)
for (i, j) in product(w_idx, y_idx)
v_new[i, j] = B(i, j, σ[i, j], v, model)
end
return v_new
end

Listing 15: Discrete optimal savings model (finite_opt_saving_0.jl)


CHAPTER 5. MARKOV DECISION PROCESSES 151

include("finite_opt_saving_0.jl")

"Compute a v-greedy policy."


function get_greedy(v, model)
w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
σ = Matrix{Int32}(undef, length(w_idx), length(y_idx))
for (i, j) in product(w_idx, y_idx)
_, σ[i, j] = findmax(B(i, j, k, v, model) for k in w_idx)
end
return σ
end

"Get the value v_σ of policy σ."


function get_value(σ, model)
# Unpack and set up
(; β, R, γ, w_grid, y_grid, Q) = model
w_idx, y_idx = (eachindex(g) for g in (w_grid, y_grid))
wn, yn = length(w_idx), length(y_idx)
n = wn * yn
u(c) = c^(1-γ) / (1-γ)
# Build P_σ and r_σ as multi-index arrays
P_σ = zeros(wn, yn, wn, yn)
r_σ = zeros(wn, yn)
for (i, j) in product(w_idx, y_idx)
w, y, w′ = w_grid[i], y_grid[j], w_grid[σ[i, j]]
r_σ[i, j] = u(w + y - w′/R)
for j′ in y_idx
P_σ[i, j, σ[i, j], j′] = Q[j, j′]
end
end
# Reshape for matrix algebra
P_σ = reshape(P_σ, n, n)
r_σ = reshape(r_σ, n)
# Apply matrix operations --- solve for the value of σ
v_σ = (I - β * P_σ) \ r_σ
# Return as multi-index array
return reshape(v_σ, wn, yn)
end

Listing 16: Discrete optimal savings model (finite_opt_saving_1.jl)


CHAPTER 5. MARKOV DECISION PROCESSES 152

include("s_approx.jl")
include("finite_opt_saving_1.jl")

"Value function iteration routine."


function value_iteration(model, tol=1e-5)
vz = zeros(length(model.w_grid), length(model.y_grid))
v_star = successive_approx(v -> T(v, model), vz, tolerance=tol)
return get_greedy(v_star, model)
end

"Howard policy iteration routine."


function policy_iteration(model)
wn, yn = length(model.w_grid), length(model.y_grid)
σ = ones(Int32, wn, yn)
i, error = 0, 1.0
while error > 0
v_σ = get_value(σ, model)
σ_new = get_greedy(v_σ, model)
error = maximum(abs.(σ_new - σ))
σ = σ_new
i = i + 1
println("Concluded loop $i with error $error.")
end
return σ
end

"Optimistic policy iteration routine."


function optimistic_policy_iteration(model; tolerance=1e-5, m=100)
v = zeros(length(model.w_grid), length(model.y_grid))
error = tolerance + 1
while error > tolerance
last_v = v
σ = get_greedy(v, model)
for i in 1:m
v = T_σ(v, σ, model)
end
error = maximum(abs.(v - last_v))
end
return get_greedy(v, model)
end

Listing 17: Discrete optimal savings model (finite_opt_saving_2.jl)


CHAPTER 5. MARKOV DECISION PROCESSES 153

4
value function iteration
time

3 Howard policy iteration


optimistic policy iteration
2

0
0 100 200 300 400 500 600
m

Figure 5.8: Timings for alternative algorithms, savings model

iteration and optimistic policy iteration all converge. Listing 17 implements these
three algorithms. Since the state and action space are finite, Howard policy iteration
is guaranteed to return an exact optimal policy.
Figure 5.8 shows the number of seconds taken to solve the finite optimal savings
model under the default parameters when executed on a laptop machine with 20 CPUs
running at around 4GHz. The horizontal axis corresponds to the step parameter 𝑚 in
OPI (Algorithm 5.4). The two other algorithms do not depend on 𝑚 and hence their
timings are constant. The figure shows that HPI is an order of magnitude faster than
VFI and that optimistic policy iteration is even faster for moderate values of 𝑚.
One reason VFI is slow is that the discount factor is close to one. This matters be-
cause the convergence rate for VFI is linear with error size decreasing geometrically
in 𝛽 . In contrast, HPI, being an instance of Newton iteration, converges quadrati-
cally (see §2.1.4.2). As a result, HPI tends to dominate VFI when the discount factor
approaches unity.
Run-times are also dependent on implementation, and relative speed varies signif-
icantly with coding style, software and hardware platforms. In our implementation,
the main deficiency is that parallelization is under-utilized. Better exploitation of
parallelization tends to favor HPI, as discussed in §2.1.4.4.
CHAPTER 5. MARKOV DECISION PROCESSES 154

20.0
wt
17.5

15.0

12.5

10.0

7.5

5.0

2.5

0.0

0 250 500 750 1000 1250 1500 1750 2000


time

Figure 5.9: Time series for wealth

5.2.2.4 Outputs

Figure 5.9 shows a typical time series for the wealth of a single household under the
optimal policy. The series is created by computing an optimal policy 𝜎∗ , generating
(𝑌𝑡 )𝑡𝑚=0−1 as a 𝑄 -Markov chain on Y and then computing (𝑊𝑡 )𝑡𝑚=0 via 𝑊𝑡+1 = 𝜎∗ (𝑊𝑡 , 𝑌𝑡 ) for
𝑡 running from 0 to 𝑚 − 1. Initial wealth 𝑊0 is set to 1.0 and 𝑚 = 2000.
Figure 5.10 shows the result of computing and histogramming a longer time series,
with 𝑚 set to 1,000,000. This histogram approximates the stationary distribution of
wealth for a large population, each updating via 𝜎∗ and each with independently
generated labor income series (𝑌𝑡 )𝑡𝑚=0−1 . (This is due to ergodicity of the wealth-income
process. For a discussion of the connection between stationary distributions and time
series under ergodicity see, for example, Sargent and Stachurski (2023b).)
The shape of the wealth distribution in Figure 5.10 is unrealistic. In almost all
countries, the wealth distribution has a very long right tail. The Gini coefficient of the
distribution in Figure 5.10 is 0.54, which is too low. For example, World Bank data
for 2019 produces a wealth Gini for the US equal to 0.852. For Germany and Japan
the figures are 0.816 and 0.627 respectively.
In §5.3.3 we discuss a variation on the optimal savings model that can produce a
more realistic wealth distribution.
CHAPTER 5. MARKOV DECISION PROCESSES 155

0.4 Gini = 0.54

0.3

0.2

0.1

0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
wealth

Figure 5.10: Histogram of wealth

5.2.3 Optimal Investment

As our next application, we consider a monopolist facing adjustment costs and stochas-
tically evolving demand. The monopolist balances setting enough capacity to meet
demand against costs of adjusting capacity.

5.2.3.1 Problem Description

We assume that the monopolist produces a single product and faces an inverse demand
function of the form
𝑃𝑡 = 𝑎0 − 𝑎1𝑌𝑡 + 𝑍𝑡 ,
where 𝑎0 , 𝑎1 are positive parameters, 𝑌𝑡 is output, 𝑃𝑡 is price and the demand shock 𝑍𝑡
follows
IID
𝑍𝑡+1 = 𝜌𝑍𝑡 + 𝜎𝜂𝑡+1 , {𝜂𝑡 } ∼ 𝑁 (0, 1) .
Current profits are
𝜋𝑡 ≔ 𝑃𝑡 𝑌𝑡 − 𝑐𝑌𝑡 − 𝛾 (𝑌𝑡+1 − 𝑌𝑡 ) 2 .
Here 𝛾 (𝑌𝑡+1 − 𝑌𝑡 ) 2 represents costs associated with adjusting production scale, param-
eterized by 𝛾 , and 𝑐 is unit cost of current production. Costs are convex, so rapid
changes to capacity are expensive.
The monopolist chooses (𝑌𝑡 ) to maximize the expected discounted value of its profit
CHAPTER 5. MARKOV DECISION PROCESSES 156

flow, which we write as


Õ

E 𝛽 𝑡 𝜋𝑡 . (5.31)
𝑡 =0

Here 𝛽 = 1/(1 + 𝑟 ), where 𝑟 > 0 is a fixed interest rate.


A way to start thinking about the optimal time path of output is to consider what
would happen if 𝛾 = 0. Without adjustment costs there is no intertemporal trade-off,
so the monopolist should choose output to maximize current profit in each period.
The implied level of output at time 𝑡 is
𝑎0 − 𝑐 + 𝑍𝑡
𝑌¯𝑡 ≔ . (5.32)
2𝑎1

EXERCISE 5.2.2. Show that 𝑌¯𝑡 maximizes current profit when 𝛾 = 0.

For 𝛾 > 0, we expect the following behavior.

• If 𝛾 is close to zero, then the optimal output path 𝑌𝑡 will track the time path of
𝑌¯𝑡 relatively closely, while

• if 𝛾 is larger, then 𝑌𝑡 will be significantly smoother than 𝑌¯𝑡 , as the monopolist


seeks to avoid adjustment costs.

5.2.3.2 MDP Representation

We can represent this problem as an MDP. To do so we let Y be a grid contained in R+


that lists possible output values. To conform to the finite state setting, we discretize
the shock process ( 𝑍𝑡 ) using Tauchen’s method, as described in §3.1.3. For conve-
nience we again use ( 𝑍𝑡 ) to represent the discrete process, which is a finite Markov
chain on Z ⊂ R with transition matrix 𝑄 .
The state space for this MDP is X = Y × Z, while the action space is Y. The feasible
correspondence is defined by Γ ( 𝑥 ) = Y, meaning that choice of output is not restricted
by the state. Thus, the feasible policy set Σ is all 𝜎 : Y × Z → Y.
We write ( 𝑦, 𝑧 ) for the current state, 𝑞 for the action (which chooses next period
output) and ( 𝑦 0, 𝑧0) for the next period state. The current reward function is current
profits, which we can write as

𝑟 (( 𝑦, 𝑧 ) , 𝑞) = ( 𝑎0 − 𝑎1 𝑦 + 𝑧 − 𝑐) 𝑦 − 𝛾 ( 𝑞 − 𝑦 ) 2 .
CHAPTER 5. MARKOV DECISION PROCESSES 157

The stochastic kernel is

𝑃 (( 𝑦, 𝑧 ) , 𝑞, ( 𝑦 0, 𝑧0)) = 1{ 𝑦 0 = 𝑞} 𝑄 ( 𝑧, 𝑧0) .

The term 1{ 𝑦 0 = 𝑞} states that next period output 𝑦 0 is equal to our current choice 𝑞
for next period output. With these definitions, the problem defines an MDP and all of
the optimality theory for MDPs applies.

5.2.3.3 Implementation

The Bellman operator can be expressed as


( )
Õ
(𝑇 𝑣)( 𝑦, 𝑧 ) = max 𝑟 ( 𝑦, 𝑧, 𝑦 0) + 𝛽 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) .
0 𝑦 ∈ R 𝑧0

Given 𝜎 ∈ Σ, we can express the policy operator as


Õ
(𝑇𝜎 𝑣)( 𝑦, 𝑧 ) = 𝑟 ( 𝑦, 𝑧, 𝜎 ( 𝑦, 𝑧 )) + 𝛽 𝑣 ( 𝜎 ( 𝑦, 𝑧 ) , 𝑧0) 𝑄 ( 𝑧, 𝑧0) .
𝑧0

A 𝑣-greedy policy is a 𝜎 ∈ Σ that obeys


( )
Õ
𝜎 ( 𝑦, 𝑧 ) ∈ argmax 𝑟 ( 𝑦, 𝑧, 𝑦 0) + 𝛽 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) for all ( 𝑦, 𝑧 ) ∈ X.
𝑦 0 ∈Y 𝑧0

By combining iteration with the policy operator and computation of greedy policies,
we can implement optimistic policy iteration, compute the optimal policy 𝜎∗ , and
study output choices generated by this policy. We are particularly interested in how
output responds over time to randomly generated demand shocks.
Figure 5.11 shows the result of a simulation designed to shed light on how output
responds to demand. After choosing initial values (𝑌1 , 𝑍1 ) and generating a 𝑄 -Markov
chain ( 𝑍𝑡 )𝑡𝑇=1 , we simulated optimal output via 𝑌𝑡+1 = 𝜎∗ (𝑌𝑡 , 𝑍𝑡 ). The default parameters
are shown in Listing 18. In the figure, the adjustment cost parameter 𝛾 is varied as
shown in the title. In addition to the optimal output path, the path of (𝑌¯𝑡 ) as defined
in (5.32) is also presented.
The figure shows how increasing 𝛾 promotes smoothing, as predicted in our dis-
cussion above. For small 𝛾 , adjustment costs have only minor effects on choices, so
output closely follows (𝑌¯𝑡 ), the optimal path when output responds immediately to
demand shocks. Conversely, larger values of 𝛾 make adjustment expensive, so the
operator responds relatively slowly to changes in demand.
CHAPTER 5. MARKOV DECISION PROCESSES 158

γ =1

8 Yt
Ȳt

6
output

0 25 50 75 100 125 150 175 200

γ =10

8 Yt
Ȳt

6
output

0 25 50 75 100 125 150 175 200

γ =20

8 Yt
Ȳt

6
output

0 25 50 75 100 125 150 175 200

γ =30

8 Yt
Ȳt

6
output

0 25 50 75 100 125 150 175 200

Figure 5.11: Simulation of optimal output with different adjustment costs


CHAPTER 5. MARKOV DECISION PROCESSES 159

using QuantEcon, LinearAlgebra, IterTools


include("s_approx.jl")

function create_investment_model(;
r=0.04, # Interest rate
a_0=10.0, a_1=1.0, # Demand parameters
γ=25.0, c=1.0, # Adjustment and unit cost
y_min=0.0, y_max=20.0, y_size=100, # Grid for output
ρ=0.9, ν=1.0, # AR(1) parameters
z_size=25) # Grid size for shock
β = 1/(1+r)
y_grid = LinRange(y_min, y_max, y_size)
mc = tauchen(z_size, ρ, ν)
z_grid, Q = mc.state_values, mc.p
return (; β, a_0, a_1, γ, c, y_grid, z_grid, Q)
end

Listing 18: Optimal investment model (finite_lq.jl)

Figure 5.12 compares timings for VFI, HPI and OPI. Parameters are as in Listing 18.
As in Figure 5.8, which gave timings for the optimal savings model, the horizontal axis
shows 𝑚, which is the step parameter in OPI (see Algorithm 5.4). VFI and HPI do not
depend on 𝑚 and hence their timings are constant. The vertical axis is time in seconds.
HPI is faster than VFI, although the difference is not as dramatic as was the case
for optimal savings. One reason is that the discount factor is relatively small for the
optimal investment model (𝑟 = 0.04 and 𝛽 = 1/(1 + 𝑟 ), so 𝛽 ≈ 0.96). Since 𝛽 is
the modulus of contraction for the Bellman operator, this means that VFI converges
relatively quickly. Another observation is that, for many values of 𝑚, OPI dominates
both VFI and HPI in terms of speed, which is consistent with our findings for the
optimal savings model. At 𝑚 = 70, OPI is around 20 times faster than VFI.

EXERCISE 5.2.3. Consider a firm that maximizes expected discounted value in a


setting where future profits are discounted at rate 𝛽 = 1/(1 + 𝑟 ), the only production
input is labor and hiring involves fixed costs. Let ℓ𝑡 be employment at the firm at time
𝑡 . Current profits are
𝜋𝑡 = 𝑝𝑍𝑡 ℓ𝑡𝛼 − 𝑤ℓ𝑡 − 𝜅1{ ℓ𝑡+1 ≠ ℓ𝑡 },
where 𝑝 is the output price, 𝑤 is the wage rate, 𝛼 is a production parameter, the
productivity shock is 𝑄 -Markov on Z and 𝜅 is a fixed cost of hiring and firing. This
CHAPTER 5. MARKOV DECISION PROCESSES 160

3.5

3.0

2.5

2.0 Howard policy iteration


time

value function iteration


1.5
optimistic policy iteration

1.0

0.5

0.0
0 100 200 300 400 500 600
m

Figure 5.12: Timings for alternative algorithms, investment model

fixed cost induces lumpy adjustment, as shown in Figure 5.13. Show that this model is
an MDP. Write the Bellman equation and the procedure for optimistic policy iteration
in the context of this model. Replicate Figure 5.13, modulo randomness, using the
parameters shown in Listing 19.

5.3 Modified Bellman Equations

Direct application of MDP theory is sometimes suboptimal. For example, we saw in


§1.3.2.2 that solving the job search problem with IID wage draws is best accomplished
by generating a recursion on the continuation value, which reduces dimensionality for
iterative solution methods. Separately, in §4.2.2.2, we saw how a different manipu-
lation of the Bellman equation also increased efficiency.
Now we aim to study such modifications systematically. We begin by providing
other examples of how manipulating a Bellman equation can facilitate computation
and analysis. Then we establish a theoretical foundation for this line of analysis, and
show how similar ideas can also be applied to policy operators and greedy policies.
(We also treat similar topics at a more advanced and abstract level in Volume II.)
CHAPTER 5. MARKOV DECISION PROCESSES 161

using QuantEcon, LinearAlgebra, IterTools

function create_hiring_model(;
r=0.04, # Interest rate
κ=1.0, # Adjustment cost
α=0.4, # Production parameter
p=1.0, w=1.0, # Price and wage
l_min=0.0, l_max=30.0, l_size=100, # Grid for labor
ρ=0.9, ν=0.4, b=1.0, # AR(1) parameters
z_size=100) # Grid size for shock
β = 1/(1+r)
l_grid = LinRange(l_min, l_max, l_size)
mc = tauchen(z_size, ρ, ν, b, 6)
z_grid, Q = mc.state_values, mc.p
return (; β, κ, α, p, w, l_grid, z_grid, Q)
end

Listing 19: Firm hiring model (firm_hiring.jl)

16 `t
Zt
15

14

13
employment

12

11

10

0 50 100 150 200 250


time

Figure 5.13: Optimal shifts in the stock of labor


CHAPTER 5. MARKOV DECISION PROCESSES 162

5.3.1 Structural Estimation

As a first illustration of the ideas in this section, we discuss a connection between


econometric estimation and dynamic programs. Our focus is on some modifications
that econometricians often make to Bellman equations and how they affect computa-
tion and optimality.

5.3.1.1 What is Structural Estimation?

Structural estimation is a branch of quantitative social science in which, in a quest


to understand observed quantities and prices, researchers attribute Markov decision
problems to economic agents. A key step in this approach is to formulate dynamic
programs in terms of functional forms and parameters. The econometric challenge is
to infer parameters that bring the model outputs as close as possible to actual data.
Structural estimation aims to discover objects that are invariant to hypothetical in-
terventions that the analysis wants to investigate. Examples of such invariant objects
are parameters of utility functions, discount factors, and production technologies.
Agents inside model solve their MDPs. A policy intervention that systematically al-
ters the Markov processes that they face will alter agents’ optimal policies, i.e., their
decision rules. Various examples of such interventions involving aspects of fiscal and
monetary policy are described in various chapters of Lucas and Sargent (1981) a com-
pendium of early papers that were written in response to the Lucas (1976) Critique
of then prevailing dynamic econometric models.2
Example 5.3.1. Gillingham et al. (2022) study the used car market in Denmark by
modeling consumers who trade cars in the new and used car markets. By model-
ing consumers’ decision problems, the authors are able to investigate how consumers
would react to a hypothetical modification in automobile taxes. The study finds that
automobile taxes were too high in the sense that the government could have raised
more tax revenue by lowering tax rates.

Efficient solution methods are essential in structural estimation because the un-
derlying dynamic program must be solved repeatedly in order to search the param-
eter space for a good fit to data. Moreover, these dynamic programs are often high-
dimensional, due to shocks to preferences and other random variables that the agents
2 Rational expectations econometrics was a response to that Critique. While early work on ratio-
nal expectations originated from the macroeconomics community (e.g. Hansen and Sargent (1980),
Hansen and Sargent (1990)), many of their examples were actually about industrial organization and
other microeconomic models. This work was part of a broad process that erased many boundaries
between micro and macro theory.
CHAPTER 5. MARKOV DECISION PROCESSES 163

inside the model are assumed to see but that the econometrician does not. When
these shocks are persistent, the dimension of the state grows.3
In order to maintain focus on dynamic programming, we will not describe the de-
tails of the estimation step required for structural estimation (although §5.4 contains
references for those who wish to learn about that). Instead, we focus on the kinds of
dynamic programs treated in structural estimation and techniques for solving them
efficiently.

5.3.1.2 An Illustration

Let us look at an example of a dynamic program with preference shocks used in struc-
tural estimation, which is taken from a study of labor supply by married women
(Keane et al., 2011). The husband of the decision maker, a married woman, is al-
ready working. The couple has young children and the mother is deciding whether
to work. Her utility function is

𝑢 ( 𝑐, 𝑑, 𝜉) = 𝑐 + ( 𝛼𝑛 + 𝜉)(1 − 𝑑 ) ,

where 𝑐 is consumption, 𝛼 is a parameter, 𝑛 is the number of children, 𝜉 is a preference


shock and 𝑑 is the action variable. The action is binary, with 𝑑 = 1 representing the
decision to work in the current period and 𝑑 = 0 representing the decision not to
work.4
The budget constraint for the household is

𝑐𝑡 = 𝑓𝑡 + 𝑤𝑡 𝑑𝑡 − 𝜋𝑛𝑑𝑡 ,

where 𝑓𝑡 is the father’s income, 𝑤𝑡 is the mother’s wage and 𝜋 is the cost of child care.
Wages depend on human capital ℎ𝑡 , which increases with experience. In particular,

𝑤𝑡 = 𝛾ℎ𝑡 + 𝜂𝑡 , with ℎ𝑡 = ℎ𝑡−1 + 𝑑𝑡−1 .

Here 𝜂𝑡 is random and 𝛾 is a parameter. We assume that ( 𝑓𝑡 )𝑡⩾0 is 𝐹 -Markov on some


finite set. In the model, ( 𝜉𝑡 )𝑡⩾0 and ( 𝜂𝑡 )𝑡⩾0 are IID. We denote their joint distribution
by 𝜑.
3 Hansen and Sargent (1980) analyze the implications of such “Shiller errors” for efficient estimation

procedures in a class of linear structural models.


4 Here, the woman is the primary carer of the child; she derives no utility from children in periods

in which she works. See Keane et al. (2011) for further discussion.
CHAPTER 5. MARKOV DECISION PROCESSES 164

With constant discount factor 𝛽 and implied utility

𝑟 ( 𝑓 , ℎ, 𝜉, 𝜂, 𝑑 ) ≔ 𝑓 + ( 𝛾ℎ + 𝜂) 𝑑 − 𝜋𝑛𝑑 + ( 𝛼𝑛 + 𝜉)(1 − 𝑑 ) ,

the problem of maximizing expected discounted utility is an MDP with the Bellman
equation


 Õ 


𝑣 ( 𝑓 , ℎ, 𝜉, 𝜂) = max 𝑟 ( 𝑓 , ℎ, 𝜉, 𝜂, 𝑑 ) + 𝛽 𝑣 ( 𝑓 0, ℎ + 𝑑, 𝜉0, 𝜂0) 𝐹 ( 𝑓 , 𝑓 0) 𝜑 ( 𝜉0, 𝜂0) .
𝑑  
 𝑓 0 ,𝜉0 ,𝜂0 

While we can proceed directly with a technique such as VFI to obtain optimal
choices, we can simplify.
One way is by reducing the number of states. A hint comes from looking at the
expected value function
Õ
𝑔 ( 𝑓 , ℎ, 𝑑 ) ≔ 𝑣 ( 𝑓 0, ℎ + 𝑑, 𝜉0, 𝜂0) 𝐹 ( 𝑓 , 𝑓 0) 𝜑 ( 𝜉0, 𝜂0)
𝑓 0 ,𝜉0 ,𝜂0

This function depends only on three arguments and, moreover, the choice variable 𝑑
is binary. Hence we can break 𝑔 down into two functions 𝑔 ( 𝑓 , ℎ, 0) and 𝑔 ( 𝑓 , ℎ, 1), each
of which depends only on the pair ( 𝑓 , ℎ). These functions are substantially simpler
than 𝑣 when the domain of ( 𝜉, 𝜂) is large. Hence, it is natural to consider whether we
can solve our problem using 𝑔 rather than 𝑣.

5.3.1.3 Expected Value Functions

Rather than address this question within the context of the preceding model, let’s shift
to a generic version of the dynamic program used in structural estimation and how it
can be solved using expected value methods. Our generic version takes the form
( )
Õ∫
𝑣 ( 𝑦, 𝜀) = max 𝑟 ( 𝑦, 𝜀, 𝑎) + 𝛽 𝑣 ( 𝑦 0, 𝜀0) 𝑃 ( 𝑦, 𝑎, 𝑦 0) 𝜑 ( 𝜀0) d𝜀0 (5.33)
𝑎∈ Γ ( 𝑦 )
𝑦0

for all 𝑦 ∈ Y and 𝜀 ∈ E. Here Y is a finite set, often determined by discretization of


a continuous space, while E, the outcome space for 𝜀, is allowed to be continuous.
The state 𝑦 will be called the endogenous state, while 𝜀 is the preference shock. In
practice, 𝜀 will often be a vector of shocks that affect current rewards. The integral
can therefore be multivariate and is over all of E.
CHAPTER 5. MARKOV DECISION PROCESSES 165

The problem represented by (5.33) is a version of a regular MDP, with state 𝑥 =


( 𝑦, 𝜀) taking values in X ≔ Y × E. If we discretize the space E, then all the optimality
theory for MDPs applies. Instead of taking this approach, however, we draw on our
discussion of labor choice in §5.3.1.2. In particular, to enhance efficiency, we will
work with the expected value function
Õ∫
𝑔 ( 𝑦, 𝑎) ≔ 𝑣 ( 𝑦 0, 𝜀0) 𝑃 ( 𝑦, 𝑎, 𝑦 0) 𝜑 ( 𝜀0) d𝜀0 (5.34)
𝑦0

There are several potential advantages associated with working with 𝑔 rather than 𝑣.
One is that the set of actions A can be much smaller than the set of states that would be
created by discretization of the preference shock space E (especially if 𝜀𝑡 takes values
in a high-dimensional space). Another is that the integral provides smoothing, so that
𝑔 is typically a smooth function. This can accelerate structural estimation procedures.

5.3.1.4 Optimality via EV Methods

To exploit the relative simplicity of the expected value function, we rewrite the Bell-
man equation (5.33) as

𝑣 ( 𝑦, 𝜀) = max {𝑟 ( 𝑦, 𝜀, 𝑎) + 𝛽𝑔 ( 𝑦, 𝑎)} .
𝑎∈ Γ ( 𝑦 )

Taking expectations of both sides and using (5.34) again gives


Õ∫
𝑔 ( 𝑦, 𝑎) = max0 {𝑟 ( 𝑦 0, 𝜀0, 𝑎0) + 𝛽𝑔 ( 𝑦 0, 𝑎0)} 𝜑 ( 𝜀0) d𝜀0 𝑃 ( 𝑦, 𝑎, 𝑦 0) .
0 𝑎 ∈Γ ( 𝑦 )
𝑦0

To solve this functional equation we introduce the expected value Bellman op-
erator 𝑅 defined at 𝑔 ∈ RG by
Õ∫
( 𝑅𝑔 )( 𝑦, 𝑎) = max0 {𝑟 ( 𝑦 0, 𝜀0, 𝑎0) + 𝛽𝑔 ( 𝑦 0, 𝑎0)} 𝜑 ( 𝜀0) d𝜀0 𝑃 ( 𝑦, 𝑎, 𝑦 0) .
0
(5.35)
𝑎 ∈Γ ( 𝑦 )
𝑦0

Here G is the set of feasible state-action pairs ( 𝑦, 𝑎).

EXERCISE 5.3.1. Prove that 𝑅 is order-preserving and a contraction of modulus 𝛽


on RG (with respect to the supremum norm).

In what follows, we let 𝑔∗ be the fixed point of 𝑅 in RG . Since 𝑅 is a contraction


map, 𝑔∗ can be computed by successive approximation. The next result shows that
CHAPTER 5. MARKOV DECISION PROCESSES 166

knowing this fixed point is enough to solve the dynamic program.


Proposition 5.3.1. A policy 𝜎 ∈ Σ is optimal if and only if

𝜎 ( 𝑦, 𝜀) ∈ argmax {𝑟 ( 𝑦, 𝜀, 𝑎) + 𝛽𝑔 ∗ ( 𝑦, 𝑎)} for all ( 𝑦, 𝜀) ∈ Y × E.


𝑎∈ Γ ( 𝑦 )

We postpone proving Proposition 5.3.1 until §5.3.5, where we prove a more gen-
eral result.
Example 5.3.2. In the labor supply problem in §5.3.1.2, the expected value Bellman
operator becomes
Õ
( 𝑅𝑔 )( 𝑓 , ℎ, 𝑑 ) = max
0
{𝑟 ( 𝑓 0, ℎ + 𝑑, 𝜉0, 𝜂0, 𝑑 0) 𝛽𝑔 ( 𝑓 0, ℎ + 𝑑, 𝑑 0)} 𝐹 ( 𝑓 , 𝑓 0) 𝜑 ( 𝜉0, 𝜂0) .
𝑑
𝑓 0 ,𝜉0 ,𝜂0

Iterating from an arbitrary guess of 𝑔 converges to the unique fixed point 𝑔∗ of 𝑅. By


Proposition 5.3.1, we can then compute the optimal policy 𝜎∗ at ( 𝑓 , ℎ, 𝜉, 𝜂) by taking

𝜎∗ ( 𝑓 , ℎ, 𝜉, 𝜂) ∈ argmax {𝑟 ( 𝑓 , ℎ, 𝜉, 𝜂, 𝑑 ) + 𝛽𝑔 ∗ ( 𝑓 , ℎ, 𝑑 )} .
𝑑

5.3.2 The Gumbel Max Trick


§5.3.1.3 described how using expected values can reduce dimensionality by smooth-
ing. But there is another feature of an expected value formulation of a Bellman equa-
tion that we can take advantage of when we are prepared to impose extra structure
on preference shocks. This section provides details.
A real-valued random variable 𝑍 is said to have a Gumbel distribution (or a “type
1 generalized extreme value distribution”) with mean 𝜇 ∈ R if its cumulative distri-
bution function takes the form 𝐹 ( 𝑧) = exp(− exp( 𝑧 − 𝜇 )). To denote a random variable
with a Gumbel distribution, we write 𝑍 ∼ 𝐺 ( 𝜇 ). The expectation of 𝑍 is 𝜇 + 𝛾 , where
𝛾 ≈ 0.577 is the Euler–Mascheroni constant.

EXERCISE 5.3.2. Prove: if 𝑍 ∼ 𝐺 ( 𝜇 ) and 𝜆 ∈ R, then 𝑍 + 𝜆 ∼ 𝐺 ( 𝜇 + 𝜆 ).

The Gumbel distribution has the following useful stability property, a proof of
which can be found in Huijben et al. (2022).
IID
Lemma 5.3.2. If 𝑍1 , . . . , 𝑍 𝑘 ∼ 𝐺 (0) and 𝑐1 , . . . , 𝑐𝑘 are real numbers, then
( " 𝑘 #)
Õ
max ( 𝑍 𝑖 + 𝑐𝑖 ) ∼ 𝐺 −𝛾 + ln exp( 𝑐𝑖 ) .
1⩽ 𝑖⩽ 𝑘
𝑖=1
CHAPTER 5. MARKOV DECISION PROCESSES 167

To exploit Lemma 5.3.2, we continue the discussion in the previous section but
assume now that A = { 𝑎1 , . . . , 𝑎𝑘 }, that Γ ( 𝑦 0) = A for all 𝑦 0 (so that actions are un-
restricted), that 𝜀0 in (5.35) is additive in rewards and indexed by actions, so that
𝑟 ( 𝑦 0, 𝜀0, 𝑎0) = 𝑟 ( 𝑦 0, 𝑎0) + 𝜀0 ( 𝑎0) for all feasible ( 𝑦 0, 𝑎0), and that, conditional on 𝑦 0, the
vector ( 𝜀 ( 𝑎1 ) , . . . , 𝜀 ( 𝑎𝑘 )) consists of 𝑘 independent 𝐺 (0) shocks. Thus, each feasible
choice returns a rewards perturbed by an independent Gumbel shock.
From these assumptions and Lemma 5.3.2, the term inside the integral in (5.35)
satisfies

max {𝑟 ( 𝑦 0, 𝜀0, 𝑎0) + 𝛽𝑔 ( 𝑦 0, 𝑎0)} = max {𝑟 ( 𝑦 0, 𝑎0) + 𝜀0 ( 𝑎0) + 𝛽𝑔 ( 𝑦 0, 𝑎0)}


𝑎0 𝑎0
( " #)
Õ
∼ 𝐺 −𝛾 + ln exp ( 𝑟 ( 𝑦 0, 𝑎0) + 𝛽𝑔 ( 𝑦 0, 𝑎0))
𝑎0

Recalling our rule for computing mathematical expectations of Gumbel distributed


random variables, the expected value Bellman operator 𝑅 in (5.35) becomes
" #
Õ Õ
( 𝑅𝑔 ) ( 𝑦, 𝑎) = ln exp ( 𝑟 ( 𝑦 0, 𝑎0) + 𝛽𝑔 ( 𝑦 0, 𝑎0)) 𝑃 ( 𝑦, 𝑎, 𝑦 0) . (5.36)
𝑦0 𝑎0

This operator is convenient because the absence of a max operator permits fast eval-
uation. Notice also that 𝑅 is smooth in 𝑔, which suggests that we can use gradient
information to compute its fixed points.

Proposition 5.3.3. The operator 𝑅 in (5.36) is a contraction of modulus 𝛽 on RG .

Proof. The operator 𝑅 is order-preserving on RG . Straightforward algebra shows that,


for 𝑐 ∈ R+ and 𝑔 ∈ RG , we have 𝑅 ( 𝑔 + 𝑐) = 𝑅𝑔 + 𝛽𝑐. The claim now follows from
Blackwell’s sufficient condition for a contraction (page 62). □

Notice how the Gumbel max trick that exploits Lemma 5.3.2 depends crucially
on the expected value formulation of the Bellman equation, rather than the standard
formulation (5.33). This is because the expected value formulation puts the max
inside the expectation operator, unlike the standard formulation, where the max is on
the outside.
Variations of the Gumbel max trick have many uses in structural econometrics
(see §5.4).
CHAPTER 5. MARKOV DECISION PROCESSES 168

5.3.3 Optimal Savings with Stochastic Returns on Wealth

We modify the §5.2.2 optimal savings problem by replacing a constant gross rate of
interest 𝑅 by an IID sequence ( 𝜂𝑡 )𝑡⩾0 with common distribution 𝜑 on finite set E. So the
consumer faces a fluctuating rate of returns on financial wealth. In each period 𝑡 , the
consumer knows 𝜂𝑡 , the gross rate of interest between 𝑡 and 𝑡 + 1, before deciding how
much to consume and how much to save. Other aspects of the problem are unchanged.
We have two motivations. One is computational, namely, to illustrate how framing
a decision in terms of expected values can reduce dimensionality, analogous to the
results in §5.3.1.4. The other is to generate a more realistic wealth distribution than
that generated by the §5.2.2.4 optimal savings model.
With stochastic returns on wealth, the Bellman equation becomes
(   )
𝑤0 Õ
𝑣 ( 𝑤, 𝑦, 𝜂) = max 𝑢 𝑤+ 𝑦− +𝛽 𝑣 ( 𝑤0, 𝑦 0, 𝜂0) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) .
0
𝑤 ⩽ 𝜂 ( 𝑤+ 𝑦 ) 𝜂 𝑦 0 ,𝜂0

Both 𝑤 and 𝑤0 are constrained to a finite set W ⊂ R+ . The expected value function
can be expressed as
Õ
𝑔 ( 𝑦, 𝑤0) ≔ 𝑣 ( 𝑤0, 𝑦 0, 𝜂0) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) . (5.37)
𝑦 0 , 𝜂0

In the remainder of this section, we will say that a savings policy 𝜎 is 𝑔-greedy if
   
𝑤0 0
𝜎 ( 𝑦, 𝑤, 𝜂) ∈ argmax 𝑢 𝑤 + 𝑦 − + 𝛽𝑔 ( 𝑦, 𝑤 ) .
𝑤0 ⩽ 𝜂 ( 𝑤+ 𝑦 ) 𝜂

Since it is an MDP, we can see immediately that if we replace 𝑣 in (5.37) with the
value function 𝑣∗ , then a 𝑔-greedy policy will be an optimal one.
Using manipulations analogous to those we used in §5.3.1.4, we can rewrite the
Bellman equation in terms of expected value functions via
Õ    
0 𝑤00
𝑔 ( 𝑦, 𝑤 ) = max 𝑢 𝑤 + 𝑦 − 0 + 𝛽𝑔 ( 𝑦 , 𝑤 ) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) .
0 0 0 00
00
𝑦 0 , 𝜂0
0 0 0
𝑤 ⩽ 𝜂 (𝑤 + 𝑦 ) 𝜂

From here we could proceed by introducing an expected value Bellman operator anal-
ogous to 𝜂 in (5.35), proving it to be a contraction map and then showing that greedy
policies taken with respect to the fixed point are optimal. All of this can be accom-
plished without too much difficulty – we prove more general results in §5.3.5.
CHAPTER 5. MARKOV DECISION PROCESSES 169

using QuantEcon, LinearAlgebra, IterTools

function create_savings_model(; β=0.98, γ=2.5,


w_min=0.01, w_max=20.0, w_size=100,
ρ=0.9, ν=0.1, y_size=20,
η_min=0.75, η_max=1.25, η_size=2)
η_grid = LinRange(η_min, η_max, η_size)
ϕ = ones(η_size) * (1 / η_size) # Uniform distributoin
w_grid = LinRange(w_min, w_max, w_size)
mc = tauchen(y_size, ρ, ν)
y_grid, Q = exp.(mc.state_values), mc.p
return (; β, γ, η_grid, ϕ, w_grid, y_grid, Q)
end

Listing 20: Optimal savings parameters (modified_opt_savings.jl)

However, we also know that optimistic policy iteration (OPI) is, in general, more
efficient than value function iteration. This motivates us to introduce the modified
𝜎-value operator
Õ  𝜎 ( 𝑤0, 𝑦 0, 𝜂0)
 
0 0 0
( 𝑅𝜎 𝑔)( 𝑦, 𝑤 ) = 𝑢 𝑤 +𝑦 − 0
+ 𝛽𝑔 ( 𝑦 , 𝜎 ( 𝑤 , 𝑦 , 𝜂 )) 𝑄 ( 𝑦, 𝑦 0) 𝜑 ( 𝜂0) .
0 0 0 0

𝑦 0 , 𝜂0
𝜂

This is a modification of the regular 𝜎-value operator 𝑇𝜎 that makes it act on expected
value functions.
A suitably modified OPI routine that is adapted from the regular OPI algorithm in
§5.1.4.4 can be found in Algorithm 5.5 on page 177. The routine is convergent. We
discuss this in greater detail in §5.3.5.
Figure 5.14 shows a histogram of a long wealth time series that parallels Fig-
ure 5.10 on page 155. The only significant difference is the switch to stochastic returns
(as described above). Parameters are as in Listing 20. Now the wealth distribution has
a more realistic long right tail (a few observations are in the far right tail, although
they are difficult to see). The Gini coefficient is 0.72, which is closer to typical country
values recorded in World Bank data (but still lower than the US). In essence, this oc-
curs because return shocks have multiplicative rather than additive effects on wealth,
so a sequence of high draws compounds to make wealth grow fast.

EXERCISE 5.3.3. Consider a version of the optimal savings problem from §5.2.2
CHAPTER 5. MARKOV DECISION PROCESSES 170

1.2

1.0

0.8
Gini = 0.72
0.6

0.4

0.2

0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
wealth

Figure 5.14: Histogram of wealth (stochastic returns)

where labor income has both persistent and transient components. In particular, as-
sume that 𝑌𝑡 = 𝑍𝑡 + 𝜀𝑡 for all 𝑡 , where ( 𝜀𝑡 )𝑡⩾0 is IID with common distribution 𝜑 on E,
while ( 𝑍𝑡 )𝑡⩾0 is 𝑄 -Markov on Z. Such a specification of labor income can capture how
households should react differently to transient and “permanent” shocks (see §5.4
for more discussion). Following the pattern developed for the savings model with
stochastic returns on wealth, write down both the Bellman equation and the Bellman
equation in terms of expected value functions.

5.3.4 Q-Factors

𝑄 -factors assign values to state-action pairs. They set the stage for 𝑄 -learning, an ap-
plication of reinforcement learning, a recursive algorithm for estimating parameters.
𝑄 -learning uses stochastic approximation techniques to learn 𝑄 -factors. Under special
conditions 𝑄 -learning eventually learns optimal 𝑄 -factors for a finite MDP.
𝑄 -learning is connected to the topic of this chapter because it relies on a Bellman
operator for the 𝑄 -factor. We discuss that Bellman operator, but we don’t discuss
𝑄 -learning here.
To begin, we fix an MDP ( Γ, 𝛽, 𝑟, 𝑃 ) with state space X and action space A. For each
CHAPTER 5. MARKOV DECISION PROCESSES 171

𝑣∈ RX , the 𝑄 -factor corresponding to 𝑣 is the function


Õ
𝑞 ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (( 𝑥, 𝑎) ∈ G) .
𝑥0

We can convert the Bellman equation into an equation in 𝑄 -factors by observing that,
given such a 𝑞, the Bellman equation can be written as 𝑣 ( 𝑥 ) = max 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎). Taking
the mean and discounting on both sides of this equation gives
Õ Õ
𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) = 𝛽 max0 𝑞 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
0 𝑎 ∈Γ ( 𝑥 )
𝑥0 𝑥0

Adding 𝑟 ( 𝑥, 𝑎) and using the definition of 𝑞 again gives


Õ
𝑞 ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 max0 𝑞 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
0 𝑎 ∈Γ ( 𝑥 )
𝑥0

This functional equation motivates us to introduce the 𝑄 -factor Bellman operator


Õ
( 𝑆𝑞)( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 max0 𝑞 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0)
0
(( 𝑥, 𝑎) ∈ G) . (5.39)
𝑎 ∈Γ ( 𝑥 )
𝑥0

EXERCISE 5.3.4. Prove that 𝑆 is order-preserving and a contraction of modulus 𝛽


on RG (with respect to the supremum norm).

Let 𝑞∗ be the unique fixed point of 𝑆 in RG .


Proposition 5.3.4. A policy 𝜎 ∈ Σ is optimal if and only if

𝜎 ( 𝑥 ) ∈ argmax 𝑞∗ ( 𝑥, 𝑎) for all ( 𝑥, 𝑎) ∈ G.


𝑎∈ Γ ( 𝑥 )

Enthusiastic readers might like to try to prove Proposition 5.3.4 directly. We defer
the proof until §5.3.5, where a more general result is obtained.

5.3.5 Operator Factorizations

Our study of structural estimation in §5.3.1, optimal savings in §5.3.3 and 𝑄 -factors
in §5.3.4 all involved manipulations of the Bellman and policy operators that pre-
sented alternative perspectives on the respective optimization problems. Rather than
offering additional applications that apply such ideas, we now develop a general theo-
retical framework from which to understand manipulations of the Bellman and policy
CHAPTER 5. MARKOV DECISION PROCESSES 172

operators for general MDPs. The framework clarifies when and how these techniques
can be applied.

5.3.5.1 Refactoring the Bellman Operator

Fix an MDP ( Γ, 𝛽, 𝑟, 𝑃 ) with state space X and action space A. As usual, Σ is the set of
feasible policies, G is the set of feasible state, action pairs, 𝑇 is the Bellman operator
and 𝑣∗ denotes the value function. Our first step is to decompose 𝑇 . To do this we
introduce three auxiliary operators:
Í
• 𝐸 : RX → RG defined by ( 𝐸𝑣)( 𝑥, 𝑎) = 𝑥 0 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0),
• 𝐷 : RG → RG defined by ( 𝐷𝑔 )( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽𝑔 ( 𝑥, 𝑎) and
• 𝑀 : RG → RX defined by ( 𝑀𝑞) ( 𝑥 ) = max 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎).
Evidently the action of the Bellman operator 𝑇 on a given 𝑣 ∈ RX is the composition
of these three steps:
(i) take conditional expectations given ( 𝑥, 𝑎) ∈ G (applying 𝐸),
(ii) discount and adding current rewards (applying 𝐷), and
(iii) maximize with respect to current action (applying 𝑀 ).
As a result, we can write 𝑇 = 𝑀 𝐷𝐸 ≔ 𝑀 ◦ 𝐷 ◦ 𝐸 (apply 𝐸 first, 𝐷 second, and 𝑀 third).
This decomposition is visualized in Figure 5.15. The action of 𝑇 is a round trip from
the top node, which is the set of value functions.
If we stare at Figure 5.15, we can imagine two other round trips. One is a round
trip from the set of expected value functions, obtained by the sequence 𝐸𝑀 𝐷. The
other is a round trip from the set of 𝑄 -factors, obtained by the sequence 𝐷𝐸𝑀 . Let’s
name these additional round trips 𝑅 and 𝑆 respectively, so that, collecting all three,

𝑅 = 𝐸𝑀 𝐷, 𝑆 = 𝐷𝐸𝑀, 𝑇 = 𝑀 𝐷𝐸. (5.40)

Both 𝑅 and 𝑆 act on functions in RG . The next exercise provides an explicit represen-
tation of these operators.

EXERCISE 5.3.5. Show that for any 𝑔, 𝑞 ∈ RG and ( 𝑥, 𝑎) ∈ G we have


Õ
( 𝑅𝑔 ) ( 𝑥, 𝑎) = max0 {𝑟 ( 𝑥 0, 𝑎0) + 𝛽𝑔 ( 𝑥 0, 𝑎0)} 𝑃 ( 𝑥, 𝑎, 𝑥 0) and
0 𝑎 ∈Γ ( 𝑥 )
𝑥0
Õ
(𝑆𝑞) ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 max 𝑞 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
𝑎0 ∈ Γ ( 𝑥 0 )
𝑥0
CHAPTER 5. MARKOV DECISION PROCESSES 173

RX (value functions)

𝑀 𝐸

RG (𝑄 -factors) RG (EV functions)


𝐷

Figure 5.15: Multiple Bellman operators (EV = expected value)

Let’s connect our “refactored” Bellman operators 𝑅 and 𝑆 to our preceding exam-
ples. Inspection of (5.39) confirms that 𝑆 is exactly the 𝑄 -factor Bellman operator.
In addition, 𝑅 is a general version of the expected value Bellman operator defined in
(5.35).

EXERCISE 5.3.6. Show that, for all 𝑘 ∈ N, the following relationships hold

• 𝑅 𝑘 = 𝐸𝑇 𝑘−1 𝑀 𝐷 = 𝐸𝑀𝑆 𝑘−1 𝐷


• 𝑆 𝑘 = 𝐷𝑅 𝑘−1 𝑀 𝐸 = 𝑀𝑇 𝑘−1 𝐸𝐷
• 𝑇 𝑘 = 𝑀𝑆 𝑘−1 𝐷𝐸 = 𝑀 𝐷𝑅 𝑘−1 𝐸

(Here, for any operator 𝐴, we take 𝐴0 to be the identity map.)

While the equalities in Exercise 5.3.6 can be proved by induction via the logic
revealed by (5.40), the intuition is straightforward from Figure 5.15. For example,
the relationship 𝑅 𝑘 = 𝐸𝑇 𝑘−1 𝑀 𝐷 states that round-tripping 𝑘 times from the space of
expected values (EV function space) is the same as shifting to value function space by
applying 𝑀 𝐷, round-tripping 𝑘 − 1 times using 𝑇 , and then shifting one more step to
EV function space via 𝐸.
Although the relationships in Exercise 5.3.6 are easy to prove, they are already
useful. For example, suppose that in a computational setting 𝑅 is easier to iterate
than 𝑇 . Then to iterate with 𝑇 𝑘 times, we can instead use 𝑇 𝑘 = 𝑀 𝐷𝑅 𝑘−1 𝐸: We apply 𝐸
once, 𝑅 𝑘 − 1 times, and 𝑀 and 𝐷 once each. If 𝑘 is large, this might be more efficient.
CHAPTER 5. MARKOV DECISION PROCESSES 174

In the next exercise and the next section, we let k · k ≔ k · k ∞ , the supremum norm
on either RX or RG .

EXERCISE 5.3.7. Prove the following facts:

(i) k 𝐸𝑣 − 𝐸𝑣0 k ⩽ k 𝑣 − 𝑣0 k for all 𝑣, 𝑣0 ∈ RX ,


(ii) k 𝑀𝑔 − 𝑀𝑔0 k ⩽ k 𝑔 − 𝑔0 k for all 𝑔, 𝑔0 ∈ RG , and
(iii) k 𝐷𝑞 − 𝐷𝑞0 k ⩽ 𝛽 k 𝑞 − 𝑞0 k for all 𝑞, 𝑞0 ∈ RG .

We can say that 𝐸 and 𝑀 are nonexpansive on RX and RG respectively, while 𝐷 is


a contraction on RG .

Lemma 5.3.5. The operators 𝑅, 𝑆 and 𝑇 are all contraction maps of modulus 𝛽 under
the supremum norm.

Proof. That 𝑇 is a contraction of modulus 𝛽 was proved in Proposition 5.1.1, on page 138.
We can prove this more easily now by applying Exercise 5.3.7, which, for arbitrary
𝑣, 𝑣0 ∈ RX , gives

k𝑇 𝑣 − 𝑇 𝑣0 k = k 𝑀 𝐷𝐸𝑣 − 𝑀 𝐷𝐸𝑣0 k ⩽ k 𝑀 𝐷𝑣 − 𝑀 𝐷𝑣0 k ⩽ 𝛽 k 𝑀𝑣 − 𝑀𝑣0 k ⩽ 𝛽 k 𝑣 − 𝑣0 k .

Proofs for 𝑅 = 𝐸𝑀 𝐷 and 𝑆 = 𝐷𝐸𝑀 are similar. □

In the next section we clarify relationships between these operators and prove
Propositions 5.3.1 and 5.3.4.

5.3.5.2 Refactorizations and Optimality

From Lemma 5.3.5 we see that 𝑅, 𝑆 and 𝑇 all have unique fixed points. We denote
them by 𝑔∗ , 𝑞∗ and 𝑣∗ respectively, so that

𝑅𝑔 ∗ = 𝑔 ∗ , 𝑆𝑞∗ = 𝑞∗ , and 𝑇 𝑣∗ = 𝑣∗ .

We already know that 𝑣∗ is the value function (Proposition 5.1.1). The results be-
low show that the other two fixed points are, like the value function, sufficient to
determine optimality.

Proposition 5.3.6. The fixed points of 𝑅, 𝑆 and 𝑇 are connected by the following rela-
tionships:
CHAPTER 5. MARKOV DECISION PROCESSES 175

(i) 𝑔∗ = 𝐸𝑣∗ ,
(ii) 𝑞∗ = 𝐷𝑔 ∗ , and
(iii) 𝑣∗ = 𝑀𝑞∗ .

Proof. To prove (i), first observe that, in the notation of (5.40), we have 𝐸𝑣∗ = 𝐸𝑇 𝑣∗ =
𝐸𝑀 𝐷𝐸𝑣∗ = 𝑅𝐸𝑣∗ . Hence 𝐸𝑣∗ is a fixed point of 𝑅. But 𝑅 has only one fixed point, which
is 𝑔∗ . Therefore, 𝑔∗ = 𝐸𝑣∗ . The proofs of (ii) and (iii) are analogous. □

The results in Proposition 5.3.6 can be written more explicitly as


Í
• 𝑔∗ ( 𝑥, 𝑎) = 𝑥 0 𝑣∗ ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) for all ( 𝑥, 𝑎) ∈ G,
• 𝑞∗ ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽𝑔 ∗ ( 𝑥, 𝑎) for all ( 𝑥, 𝑎) ∈ G, and
• 𝑣∗ ( 𝑥 ) = max 𝑎∈Γ ( 𝑥 ) 𝑞∗ ( 𝑥, 𝑎) for all 𝑥 ∈ X.
In the next result and the discussion that follows, given 𝑔, 𝑞 ∈ RG , we will call
𝜎∈Σ

• 𝑔-greedy if 𝜎 ( 𝑥 ) ∈ argmax 𝑎∈Γ ( 𝑥 ) {𝑟 ( 𝑥, 𝑎) + 𝛽𝑔 ( 𝑥, 𝑎)} for all 𝑥 ∈ X, and


• 𝑞-greedy if 𝜎 ( 𝑥 ) ∈ argmax 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎) for all 𝑥 ∈ X.
These definitions are exact analogs of the 𝑣-greedy concept, applied to expected value
functions and 𝑄 -factors respectively.
Corollary 5.3.7. For 𝜎 ∈ Σ, the following statements are equivalent:
(i) 𝜎 is 𝑣-greedy when 𝑣 = 𝑣∗ .
(ii) 𝜎 is 𝑔-greedy when 𝑔 = 𝑔∗ .
(iii) 𝜎 is 𝑞-greedy when 𝑞 = 𝑞∗ .
In particular, 𝜎 is optimal if and only if any one (and hence all) of (i)–(iii) holds.

Proof. To see that (i) implies (ii), suppose that 𝜎 is 𝑣-greedy when 𝑣 = 𝑣∗ . Then for
arbitrary 𝑥 ∈ X
( )
Õ
𝜎 ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣∗ ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) = argmax {𝑟 ( 𝑥, 𝑎) + 𝛽𝑔 ∗ ( 𝑥, 𝑎)} .
𝑎∈ Γ ( 𝑥 ) 𝑥0 𝑎∈ Γ ( 𝑥 )

Hence 𝜎 is 𝑔-greedy when 𝑔 = 𝑔∗ , and (i) implies (ii). The proofs of the remaining
equivalences (ii) =⇒ (iii) =⇒ (i) are similar. The claim that 𝜎 is optimal if and only
if any one of (i)–(iii) holds now follows from Proposition 5.1.1, which asserts that 𝜎
is optimal if and only if 𝜎 is 𝑣∗ -greedy. □
CHAPTER 5. MARKOV DECISION PROCESSES 176

Notice that Proposition 5.3.4 is a special case of Corollary 5.3.7.


The results in Corollary 5.3.7 can be understood as “refactored” versions of Bell-
man’s principle of optimality. A consequence of these results is that we can solve a
given MDP by modifying VFI to operate either on expected value functions or on 𝑄 -
factors. For example, if we find it more convenient to iterate in expected value space,
then (informally) we can proceed as follows:

(i) Fix 𝑔 ∈ RG .
(ii) Iterate with 𝑅 to obtain 𝑔𝑘 ≔ 𝑅 𝑘 𝑔 ≈ 𝑔∗ .
(iii) Compute a 𝑔𝑘 -greedy policy.

Since 𝑔𝑘 ≈ 𝑔∗ , the resulting policy will be approximately optimal.

5.3.5.3 Refactored OPI

In Chapter 5 we found that VFI is often outperformed by HPI or OPI. Our next step
is to apply these methods to modified versions of the Bellman equation, as discussed
in the previous section. This allows us to combine advantages of HPI/OPI with the
potential efficiency gains obtained by refactoring the Bellman equation.
We illustrate these ideas below by producing a version of OPI that can compute
𝑄 -factors and expected value functions. (The same is true for HPI, although we leave
details of that construction to interested readers.)
To begin, we introduce a new operator, denoted 𝑀𝜎 , that, for fixed 𝜎 ∈ Σ and
𝑞∈ RG , produces
( 𝑀𝜎 𝑞)( 𝑥 ) ≔ 𝑞 ( 𝑥, 𝜎 ( 𝑥 )) ( 𝑥 ∈ X) .
This operator is the policy analog of the maximization operator 𝑀 defined by ( 𝑀𝑞)( 𝑥 ) =
max 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎) in §5.3.5.1. Analogous to (5.40), we set

𝑅𝜎 ≔ 𝐸 𝑀𝜎 𝐷, 𝑆 𝜎 ≔ 𝐷 𝐸 𝑀𝜎 , 𝑇𝜎 ≔ 𝑀𝜎 𝐷 𝐸.

You can verify that 𝑇𝜎 is the ordinary 𝜎-policy operator (defined in (5.19)). The op-
erators 𝑅𝜎 and 𝑆𝜎 are the expected value and 𝑄 -factor equivalents.

EXERCISE 5.3.8. The relationships in Exercise 5.3.6 continue to hold after we


swap 𝑅, 𝑆, 𝑇, 𝑀 with 𝑅𝜎 , 𝑆𝜎 , 𝑇𝜎 , 𝑀𝜎 . Confirm the first of these relationships, showing
in particular that
𝑅𝜎𝑘 = 𝐸 𝑇𝜎𝑘−1 𝑀𝜎 𝐷 for all 𝑘 ∈ N. (5.41)
CHAPTER 5. MARKOV DECISION PROCESSES 177

Let’s now show that OPI can be successfully modified via these alternative oper-
ators. We will focus on the expected value viewpoint (value functions are replaced
by expected value functions), which is often helpful in the applications we wish to
consider.
Our modified OPI routine is given in Algorithm 5.5. It makes the obvious modifi-
cations to regular OPI, switching to working with expected value functions in RG and
from iteration with 𝑇𝜎 to iteration with 𝑅𝜎 .

Algorithm 5.5: Refactored OPI for expected value functions


1 input 𝑔0 ∈ R , an initial guess of 𝑔
G ∗

2 input 𝜏, a tolerance level for error


3 input 𝑚 ∈ N, a step size
4 𝑘 ← 0
5 𝜀 ← 𝜏+1
6 while 𝜀 > 𝜏 do
7 𝜎𝑘 ← a 𝑔𝑘 -greedy policy
8 𝑔𝑘+1 ← 𝑅𝜎𝑚𝑘 𝑔𝑘
9 𝜀 ← k 𝑔𝑘 − 𝑔𝑘+1 k ∞
10 𝑘← 𝑘+1
11 end
12 return 𝜎𝑘

Algorithm 5.5 is globally convergent in the same sense as regular OPI (Algo-
rithm 5.4 on page 145). In fact, if we pick 𝑣0 ∈ RX and apply regular OPI with
this initial condition, as well as applying Algorithm 5.5 with initial condition 𝑔0 ≔ 𝐸𝑣0 ,
then the sequences ( 𝑣𝑘 ) 𝑘⩾0 and ( 𝑔𝑘 ) 𝑘⩾0 generated by the two algorithms are connected
via 𝑔𝑘 = 𝐸𝑣𝑘 for all 𝑘 ⩾ 0. If greedy policies are unique, then it is also true that the
policy sequences generated by the two algorithms are identical.
Let’s prove these claims, assuming for convenience that greedy policies are unique.
Consider first the claim that 𝑔𝑘 = 𝐸𝑣𝑘 for all 𝑘 ⩾ 0. This is true by assumption when
𝑘 = 0. Suppose, as an induction hypothesis, that 𝑔𝑘 = 𝐸𝑣𝑘 holds at arbitrary 𝑘. Let 𝜎
be 𝑔𝑘 -greedy. Then
( )
Õ
0 0
𝜎 ( 𝑥 ) = argmax {𝑟 ( 𝑥, 𝑎) + 𝛽𝑔𝑘 ( 𝑥, 𝑎)} = argmax 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣𝑘 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑥 ) ,
𝑎∈ Γ ( 𝑥 ) 𝑎∈ Γ ( 𝑥 ) 𝑥0

where the second equality is implied by 𝑔𝑘 = 𝐸𝑣𝑘 . Hence 𝜎 is both 𝑔𝑘 -greedy and 𝑣𝑘 -
greedy and so is the next policy selected by both modified and regular OPI. Moreover,
updating via Algorithm 5.5 and applying (5.41), we have
CHAPTER 5. MARKOV DECISION PROCESSES 178

𝑔𝑘+1 = 𝑅𝜎𝑚 𝑔𝑘 = 𝐸𝑇𝜎𝑚−1 𝑀𝜎 𝐷𝑔𝑘 = 𝐸𝑇𝜎𝑚−1 𝑀𝜎 𝐷𝐸𝑣𝑘 = 𝐸𝑇𝜎𝑚 𝑣𝑘 .


Since 𝜎 is 𝑣𝑘 -greedy, 𝑇𝜎𝑚 𝑣𝑘 is the next function selected by regular OPI. Hence 𝑣𝑘+1 =
𝑇𝜎𝑚 𝑣𝑘 . Connecting with the last chain of equalities yields 𝑔𝑘+1 = 𝐸𝑣𝑘+1 . This completes
the proof that 𝑔𝑘 = 𝐸𝑣𝑘 for all 𝑘. Policy functions generated by the algorithms are
identical as well.
The preceding discussion provides a justification for the modified OPI algorithm
we adopted in §5.3.3.

5.4 Chapter Notes

Detailed treatment of MDPs can be found in books by Bellman (1957), Howard (1960),
Denardo (1981), Puterman (2005), Bertsekas (2012), Hernández-Lerma and Lasserre
(2012a, 2012b), and Kochenderfer et al. (2022). The books by Hernández-Lerma and
Lasserre (2012a, 2012b) provide excellent coverage of theory, while Puterman (2005)
gives a clear and detailed exposition of algorithms and techniques. Further discussion
of the connection between HPI and Newton iteration can be found in Section 6.4 of
Puterman (2005).
HPI is routinely used in artificial intelligence applications, including during the
training of AlphaZero by DeepMind. Further discussion of these variants of HPI and
their connection to Newton iteration can be found in Bertsekas (2021) and Bertsekas
(2022a).
There are several methods for available for accelerating value function iteration,
including asynchronous VFI and Anderson acceleration. Due to space constraints,
we omit discussion of these topics. Interested readers can find a treatment of asyn-
chronous VFI in Bertsekas (2022b). For discussion of Anderson acceleration see, for
example, Walker and Ni (2011) or Geist and Scherrer (2018). First order methods
for accelerating VFI are presented in Goyal and Grand-Clement (2023).
Other methods for computing solutions to MDPs include the linear programming
(LP) approach and the policy gradient technique, both of which solve a problem of
the form Õ
max 𝑤 ( 𝑥 ) 𝑣 ( 𝑥 ) s.t. 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣 (5.42)
𝜎∈ Σ
𝑥

for some chosen weight function 𝑤. The LP approach views (5.42) as a linear program
and applies various algorithms to the primal and dual problems. See, for example,
Puterman (2005) or Ying and Zhu (2020).
CHAPTER 5. MARKOV DECISION PROCESSES 179

The policy gradient method involves approximating 𝜎 and 𝑣 in (5.42) using smooth
functions with finitely many parameters. These parameters are then adjusted via
some version of gradient ascent. A recent trend for high-dimensional MDPs is to ap-
proximate the value and policy functions with neural nets. An early exposition can
be found in Bertsekas and Tsitsiklis (1996). A more recent monograph is Bertsekas
(2021). For research along these lines in the context of economic applications see,
for example, Maliar et al. (2021), Hill et al. (2021), Han et al. (2021), Kahou et al.
(2021), Kase et al. (2022), and Azinovic et al. (2022).
In some versions of these algorithms, as well as in VFI and HPI, the expectations
associated with dynamic programs are computed using Monte Carlo sampling meth-
ods. See, for example, Rust (1997), Powell (2007), and Bertsekas (2021). Sidford
et al. (2023) combine linear programming and sampling approaches.
The optimal savings problem is a workhorse in macroeconomics and has been
treated extensively in the literature. Early references include Brock and Mirman
(1972), Mirman and Zilcha (1975), Schechtman (1976), Deaton and Laroque (1992),
and Carroll (1997). For more recent studies, see, for example, Li and Stachurski
(2014), Açıkgöz (2018), Light (2018), Lehrer and Light (2018), or Ma et al. (2020).
Recent applications involving optimal savings in a representative agent framework
include Bianchi (2011), Paciello and Wiederholt (2014), Rendahl (2016), Heathcote
and Perri (2018), Paroussos et al. (2019), Erosa and González (2019), Herrendorf
et al. (2021), and Michelacci et al. (2022). For more on the long right tail of the
wealth distribution (as discussed in §5.3.3), see, for example, Benhabib et al. (2015),
Krueger et al. (2016), or Stachurski and Toda (2019).
Households solving optimal savings problems are often embedded in heteroge-
neous agent models in order to study income distributions, wealth distributions, busi-
ness cycles and other macroeconomic phenomena. Representative examples include
Aiyagari (1994), Huggett (1993), Krusell and Smith (1998), Miao (2006), Algan
et al. (2014), Toda (2014), Benhabib et al. (2015), Stachurski and Toda (2019),
Toda (2019), Light (2020), Hubmer et al. (2020), or Cao (2020).
Exercise §5.3.3 considered optimal savings and consumption in the presence of
transient and persistent shocks to labor income. For research in this vein, see, for ex-
ample, Quah (1990), Carroll (2009), De Nardi et al. (2010), or Lettau and Ludvigson
(2014). For empirical work on labor income dynamics, see, for example, Newhouse
(2005), Guvenen (2007), Guvenen (2009), or Blundell et al. (2015). For analysis of
optimal savings in a very general setting, see Ma et al. (2020) or Ma and Toda (2021).
The optimal investment problem dates back to Lucas and Prescott (1971). Text-
book treatments can be found in Stokey and Lucas (1989) and Dixit and Pindyck
(2012). Sargent (1980) and Hayashi (1982) used optimal investment problems to
CHAPTER 5. MARKOV DECISION PROCESSES 180

connect optimal capital accumulation with Tobin’s 𝑞 (which is the ratio between a
physical asset’s market value and its replacement value). Other influential papers in
the field include Lee and Shin (2000), Hassett and Hubbard (2002), Bloom et al.
(2007), Bond and Van Reenen (2007), Bloom (2009), and Wang and Wen (2012).
Carruth et al. (2000) contains a survey.
Classic papers about S-s inventory models include Arrow et al. (1951) and Dvoret-
zky et al. (1952). Optimality of S-s policies under certain conditions was first estab-
lished by Scarf (1960). Kelle and Milne (1999) study the impact of S-s inventory
policies on the supply chain, including connection to the “bullwhip” effect. The con-
nection between S-s inventory policies and macroeconomic fluctuations is studied in
Nirei (2006).
The model in Exercise 5.2.3 is loosely adapted from Bagliano and Bertola (2004).
Rust (1994) is a classic and highly readable reference in the area of structural esti-
mation of MDPs. Keane and Wolpin (1997) provides an influential study of the career
choices of young men. Roberts and Tybout (1997) analyze the decision to export in
the presence of sunk costs. Keane et al. (2011) provide an overview of structural esti-
mation applied to labor market problems. Gentry et al. (2018) review analysis of auc-
tions using structural estimation. Legrand (2019) surveys the use of structural models
to study the dynamics of commodity prices. Calsamiglia et al. (2020) use structural
estimation to study school choices. Iskhakov et al. (2020) provide a thoughtful dis-
cussion on the differences between structural estimation and machine learning. Luo
and Sang (2022) propose structural estimation via sieves.
Theoretical analysis of expected value functions in discrete choice models and
other settings can be found in Rust (1994), Norets (2010), Mogensen (2018) and
Kristensen et al. (2021). The expected value Gumbel max trick is due to Rust (1987)
and builds on work by McFadden (1974). The Gumbel max trick is also used in ma-
chine learning methods (see, e.g., Jang et al. (2016)).
In §5.3.4 we mentioned 𝑄 -learning, which was originally proposed by Watkins
(1989). Tsitsiklis (1994) and Melo (2001) studied convergence of 𝑄 -learning. In
related work, Esponda and Pouzo (2021) study Markov decision processes where dy-
namics are unknown, and where agents update their understanding of transition laws
via Bayesian updating.
The theory in §5.3.5 on optimality under modifications of the Bellman equation is
loosely based on Ma and Stachurski (2021). That paper considers arbitrary modifica-
tions in a very general setting.
Chapter 6

Stochastic Discounting

In this chapter we describe how to extend the MDP model to handle time-varying
discount factors, a specification now widely used in macroeconomics and finance.

6.1 Time-Varying Discount Factors

We introduce formulas for infinite-horizon lifetime valuations under stochastic dis-


counting and provide necessary and sufficient conditions for existence of finite solu-
tions.

6.1.1 Valuation

Our first step is to motivate and understand lifetime valuation when discount factors
vary over time.

6.1.1.1 Motivation

In §3.2.2.2 we discussed firm valuation in a setting where the interest rate is constant.
But data show that interest rates are time-varying, even for safe assets like US Treasury
bills. Figure 6.1 shows nominal interest rate on 1 Year Treasury bills since the 1950s,
while Figure 6.2 shows an estimate of the real interest rate for 10 year T-bills since
2012. Both nominal and real interest rates are evidently time varying.

181
CHAPTER 6. STOCHASTIC DISCOUNTING 182

17.5
nominal interest rate
15.0

12.5

10.0

7.5

5.0

2.5

0.0
1960 1970 1980 1990 2000 2010 2020

Figure 6.1: Nominal US interest rates (plot_interest_rates_nominal.jl)

1.5
real interest rate

1.0

0.5

0.0

0.5

1.0

2014 2016 2018 2020 2022

Figure 6.2: Real US interest rates (plot_interest_rates_real.jl)


CHAPTER 6. STOCHASTIC DISCOUNTING 183

Example 6.1.1. Consider a firm valuation problem where interest rates ( 𝑟𝑡 )𝑡⩾0 are
stochastic. The time zero expected present value of time 𝑡 profit 𝜋𝑡 is
1
E [ 𝛽1 · · · 𝛽𝑡 · 𝜋𝑡 ] where 𝛽𝑡 ≔ .
1 + 𝑟𝑡

The lifetime value of the firm is then


" 𝑡 #
Õ
∞ Ö
𝑉0 = E 𝛽𝑖 𝜋𝑡 . (6.1)
𝑡 =0 𝑖=0

Remark 6.1.1. Time-varying discount factors are found in extensions of the Section
§3.2.2.3 household consumption-saving problem that appear in modern models of
business cycle dynamics, asset prices, and wealth distributions. For just one important
example, see Krusell and Smith (1998). Marimon (1984) included random discount
factors in his thorough analysis of growth and turnpike properties of general equi-
librium models, unfortunately only parts of which were include in Marimon (1989).
Exogenous impatience shocks have been used as demand shocks in some dynamic
models. For more citations see §7.4.

6.1.1.2 Theory

The aim of this section is to understand and evaluate expressions such as (6.1).
Throughout,

• X is a finite set, 𝑃 ∈ M ( RX ), and ( 𝑋𝑡 )𝑡⩾0 is 𝑃 -Markov.


• ℎ is an element of RX , with ℎ ( 𝑋𝑡 ) typically interpreted as a payoff or reward at
time 𝑡 in state 𝑋𝑡 .
• 𝑏 is a map from X × X to (0, ∞) and

𝛽𝑡 ≔ 𝑏 ( 𝑋𝑡−1 , 𝑋𝑡 ) for 𝑡 ∈ N with 𝛽0 ≔ 1 (6.2)

Î
The sequence ( 𝛽𝑡 )𝑡⩾0 is called a discount factor process and 𝑡𝑖=0 𝛽𝑖 is the discount
factor for period 𝑡 payoffs evaluated at time zero. We are interested in expected
discounted sums of the form
" 𝑡 #
Õ
∞ Ö
𝑣 ( 𝑥 ) ≔ E𝑥 𝛽 𝑖 ℎ ( 𝑋𝑡 ) ( 𝑥 ∈ X) . (6.3)
𝑡 =0 𝑖=0
CHAPTER 6. STOCHASTIC DISCOUNTING 184

Theorem 6.1.1. Let 𝐿 ∈ L ( RX ) be the discount operator defined by

𝐿 ( 𝑥, 𝑥 0) ≔ 𝑏 ( 𝑥, 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (6.4)

for ( 𝑥, 𝑥 0) ∈ X × X. If 𝜌 ( 𝐿) < 1, then 𝑣 in (6.3) is finite for all 𝑥 ∈ X and, moreover,


Õ

𝑣 = ( 𝐼 − 𝐿) −1 ℎ = 𝐿𝑡 ℎ. (6.5)
𝑡 =0

Theorem 6.1.1 generalizes Lemma 3.2.1 on page 94. Indeed, if 𝑏 ≡ 𝛽 ∈ (0, 1),
then 𝐿 = 𝛽𝑃 and 𝜌 ( 𝐿) = 𝛽𝜌 ( 𝑃 ) = 𝛽 < 1, so the result in Theorem 6.1.1 reduces to
Lemma 3.2.1.

Proof of Theorem 6.1.1. To verify Theorem 6.1.1, we first prove that


" 𝑡 #
Ö
E𝑥 𝛽𝑖 ℎ ( 𝑋𝑡 ) = ( 𝐿𝑡 ℎ)( 𝑥 ) for all 𝑡 ∈ N, ℎ ∈ RX and 𝑥 ∈ X. (6.6)
𝑖=0

We establish (6.6) using induction on 𝑡 . It is easy to see that (6.6) holds at 𝑡 = 1. Now
suppose it holds at 𝑡 . We claim it also holds at 𝑡 + 1. To show this we fix ℎ ∈ RX and
Î
set 𝛿𝑡 ≔ 𝑡𝑖=0 𝛽𝑖 . Applying the law of iterated expectations (see §3.2.1.2) yields

E𝑥 𝛿𝑡+1 ℎ ( 𝑋𝑡+1 ) = E𝑥 E𝑡 𝑏 ( 𝑋𝑡 , 𝑋𝑡+1 ) 𝛿𝑡 ℎ ( 𝑋𝑡+1 ) = E𝑥 𝛿𝑡 E𝑡 𝑏 ( 𝑋𝑡 , 𝑋𝑡+1 ) ℎ ( 𝑋𝑡+1 ) .


Since
Õ Õ
E𝑡 𝑏 ( 𝑋𝑡 , 𝑋𝑡+1 ) ℎ ( 𝑋𝑡+1 ) = 𝑏 ( 𝑋𝑡 , 𝑥 0 ) ℎ ( 𝑥 0 ) 𝑃 ( 𝑋𝑡 , 𝑥 0 ) = 𝐿 ( 𝑋𝑡 , 𝑥 0) ℎ ( 𝑥 0) = ( 𝐿ℎ)( 𝑋𝑡 ) ,
𝑥0 𝑥0

we can now write

E𝑥 𝛿𝑡+1 ℎ ( 𝑋𝑡+1 ) = E𝑥 𝛿𝑡 𝑓 ( 𝑋𝑡 ) where 𝑓 ( 𝑥 ) ≔ ( 𝐿ℎ)( 𝑥 ) . (6.7)

Applying the induction hypothesis to (6.7) yields E𝑥 𝛿𝑡+1 ℎ ( 𝑋𝑡+1 ) = ( 𝐿𝑡 𝑓 )( 𝑥 ). But 𝐿𝑡 𝑓 =


𝐿𝑡 𝐿ℎ = 𝐿𝑡+1 ℎ, so E𝑥 𝛿𝑡+1 ℎ ( 𝑋𝑡+1 ) = ( 𝐿𝑡+1 ℎ)( 𝑥 ). This completes the induction step and
hence the proof of (6.6)
Now we can complete the proof of Theorem 6.1.1. To this end, we fix 𝑥 ∈ X and
use (6.6) to obtain
" 𝑡 # " 𝑡 #
Õ
∞ Ö Õ
∞ Ö Õ

𝑣 ( 𝑥 ) = E𝑥 𝛽 𝑖 ℎ ( 𝑋𝑡 ) = E𝑥 𝛽 𝑖 ℎ ( 𝑋𝑡 ) = ( 𝐿𝑡 ℎ)( 𝑥 ) . (6.8)
𝑡 =0 𝑖=0 𝑡 =0 𝑖=0 𝑡 =0
CHAPTER 6. STOCHASTIC DISCOUNTING 185
Í
Pointwise, this is 𝑣 = 𝑡⩾0 𝐿𝑡 ℎ. By the Neumann series lemma and 𝜌 ( 𝐿) < 1, this sum
converges and equals ( 𝐼 − 𝐿) −1 ℎ. □

In (6.8) we passed expectations through an infinite sum. This operation is valid


under the assumption 𝜌 ( 𝐿) < 1. A complete proof can be found in §B.2.

EXERCISE 6.1.1. Consider Example 6.1.1 again but now assume that ( 𝑋𝑡 ) is 𝑃 -
Markov, 𝜋𝑡 = 𝜋 ( 𝑋𝑡 ), and 𝑟𝑡 = 𝑟 ( 𝑋𝑡 ) for some 𝑟, 𝜋 ∈ RX .1 The expected present value of
the firm given current state 𝑋0 = 𝑥 is
" 𝑡 #
Õ
∞ Ö
𝑣 ( 𝑥 ) = E𝑥 𝛽𝑖 𝜋𝑡 (6.9)
𝑡 =0 𝑖=0

Suggest a condition under which 𝑣 ( 𝑥 ) is finite and discuss how to compute it.

EXERCISE 6.1.2. Let X be partially ordered and assume 𝜌 ( 𝐿) < 1. Prove that 𝑣
is increasing on X whenever 𝑃 is monotone increasing, 𝜋 is increasing on X, and 𝑟 is
decreasing on X.

6.1.2 Testing the Spectral Radius Condition

In Theorem 6.1.1 the condition 𝜌 ( 𝐿) < 1 drives stability. In this section we develop
necessary and sufficient conditions for 𝜌 ( 𝐿) < 1 to hold.

6.1.2.1 Spectral Radii via Expectations

First we develop an alternative representation of the spectral radius based on expec-


tations. The result below is proved via a local spectral radius argument (see page 70).
In the statement, 𝛽𝑡 is as defined in (6.2) and 𝐿 is the operator in (6.4).

Lemma 6.1.2. Let ( 𝑋𝑡 ) be 𝑃 -Markov starting at 𝑥 . The spectral radius of 𝐿 obeys

Ö
𝑡
1/𝑡
𝜌 ( 𝐿) = lim ℓ𝑡 when ℓ𝑡 ≔ max E𝑥 𝛽𝑖 . (6.10)
𝑡 →∞ 𝑥 ∈X
𝑖=0

1 We are assuming that randomness in interest rates is a function of the same Markov state that
influences profits. There is very little loss of generality in making this assumption. In fact, the two
processes can still be statistically independent. For example, if we take 𝑋𝑡 to have the form 𝑋𝑡 = (𝑌𝑡 , 𝑍𝑡 ),
where (𝑌𝑡 ) and ( 𝑍𝑡 ) are independent Markov chains, then we can take 𝛽𝑡 to be a function of 𝑌𝑡 and 𝜋𝑡
to be a function of 𝑍𝑡 . The resulting interest and profit processes are statistically independent.
CHAPTER 6. STOCHASTIC DISCOUNTING 186

Moreover, 𝜌 ( 𝐿) < 1 if and only if there exists a 𝑡 ∈ N such that ℓ𝑡 < 1.

Proof. Let 1 be an 𝑛-vector of ones. In view of (6.6), for fixed 𝑡 ∈ N, we have


  1/𝑡
1/𝑡 𝑡
ℓ𝑡 = max ( 𝐿 1)( 𝑥 )
𝑡
= k 𝐿𝑡 1 k 1/
∞ .
𝑥 ∈X

Since 1  0, Lemma 2.3.3 yields (6.10). For a proof of the second claim in Lemma 6.1.2,
see Proposition 4.1 of Stachurski and Zhang (2021). □

The expression in (6.10) connects the spectral radius with the long run proper-
ties of the discount factor process. The connection becomes even simpler when 𝑃 is
irreducible, as the next exercise asks you to show.

EXERCISE 6.1.3. Let 𝑃 be irreducible. Show that, when ( 𝑋𝑡 ) is 𝑃 -Markov with 𝑋0


drawn from the unique stationary distribution 𝜓∗ of 𝑃 , we also have
! 1/𝑡
Ö
𝑡
𝜌 ( 𝐿) = lim E 𝛽𝑖 . (6.11)
𝑡 →∞
𝑖=0
Í
(Hint: Try replacing k · k ∞ in the proof of Lemma 2.3.3 with k ℎ k ∗ ≔ 𝑥 | ℎ ( 𝑥 )| 𝜓∗ ( 𝑥 ).
We showed that k · k ∗ is a norm on RX in Exercise 1.2.3 on page 14.)

Exercise 6.1.3 shows that the spectral radius is a long run (geometric) average of
the discount factor process. For the conclusions of Theorem 6.1.1 to hold, we need
this long run average to be less than unity.
Figure 6.3 illustrates the condition 𝜌 ( 𝐿) < 1 when 𝛽𝑡 = 𝑋𝑡 and 𝑃 is a Markov matrix
produced by discretization of the AR1 process
IID
𝑋𝑡+1 = 𝜇 (1 − 𝑎) + 𝑎𝑋𝑡 + 𝑠 (1 − 𝑎2 ) 1/2 𝜀𝑡+1 ( 𝜀𝑡 ) ∼ 𝑁 (0, 1) . (6.12)

The discussion in §3.1.3 tells us that the stationary distribution 𝜓∗ of (6.12) is nor-
mally distributed with mean 𝜇 and standard deviation 𝑠. The parameter 𝑎 controls
autocorrelation. In the figure we set 𝜇 to 0.96, which, since 𝛽𝑡 = 𝑋𝑡 , is the stationary
mean of the discount factor process. The parameters 𝑎 and 𝑠 are varied in the figure,
and the contour plot shows the corresponding value of 𝜌 ( 𝐿). The process (6.12) is
discretized via the Tauchen method with the size of the state space set to 6 (which
avoids negative values for 𝛽 ( 𝑥 )).
The figure shows that 𝜌 ( 𝐿) tends to increase with both the volatility and the auto-
correlation of the state process. This seems natural given the expression on the right
CHAPTER 6. STOCHASTIC DISCOUNTING 187

0.50 2.10

0.45 1.95

0.40 1.80

0.35 1.65
s

0.30 1.50

0.25 1.35

0.20 1.20

0.15 1.05
1
0.10 0.90
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
a
Figure 6.3: 𝜌 ( 𝐿) for different values of ( 𝑎, 𝑠) (discount_spec_rad.jl)

hand side of (6.11), since sequences of large values of 𝛽𝑖 compound in the product
Î𝑡
𝑖=0 𝛽 𝑖 , pushing up the long run average value, and such sequences occur more often
when autocorrelation and volatility are large.
We finish this section with a lemma that simplifies computation of the spectral
radius in settings where the process ( 𝛽𝑡 ) depends only on a subset of the state variables
– a setting that is common in applications. In the statement of the lemma, the state
space X takes the form X = Y × Z. We fix 𝑄 ∈ M ( RZ ) and 𝑅 ∈ M ( RY ). The discount
operator 𝐿 is

𝐿 ( 𝑥, 𝑥 0) = 𝑏 ( 𝑧, 𝑧0) 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝑦 0) with 𝑏: Z × Z → R+ .


Let ( 𝑍𝑡 ) and (𝑌𝑡 ) be 𝑄 -Markov and 𝑅-Markov respectively, so that, with 𝑃 as the
pointwise product of 𝑄 and 𝑅, the process ( 𝑋𝑡 ) ≔ (( 𝑍𝑡 , 𝑌𝑡 )) is 𝑃 -Markov. We set
𝐿Z ( 𝑧, 𝑧0) ≔ 𝑏 ( 𝑧, 𝑧0) 𝑄 ( 𝑧, 𝑧0).

Lemma 6.1.3. The operators 𝐿 and 𝐿Z obey 𝜌 ( 𝐿Z ) = 𝜌 ( 𝐿), where the first spectral radius
is taken in L( RX ) and the second is taken in L ( RZ ).

Proof. Let 𝛽𝑡 = 𝑏 ( 𝑍𝑡 , 𝑍𝑡+1 ). Since ( 𝛽𝑡 ) depends only on ( 𝑍𝑡 ) and, in addition, (𝑌𝑡 )


Î Î
and ( 𝑍𝑡 ) are independent, for 𝑥 = ( 𝑦, 𝑧 ) ∈ X we have E𝑥 𝑡𝑖=0 𝛽𝑖 = E ( 𝑦,𝑧) 𝑡𝑖=0 𝛽𝑖 =
Î
E𝑧 𝑡𝑖=0 𝛽𝑖 . Hence
! 1/𝑡 ! 1/𝑡
Ö
𝑡 Ö
𝑡
max E𝑥 𝛽𝑖 = max E𝑧 𝛽𝑖 .
𝑥 ∈X 𝑧 ∈Z
𝑖=0 𝑖=0
CHAPTER 6. STOCHASTIC DISCOUNTING 188

Taking the limit and using Lemma 6.1.2 gives 𝜌 ( 𝐿) = 𝜌 ( 𝐿Z ), where the first spectral
radius is taken in L ( RX ) and the second is taken in L( RZ ). □

6.1.2.2 Necessary Conditions

In §6.1.1 we studied settings where lifetime value is a function 𝑣 on a state space X


that satisfies an equation of the form 𝑣 = ℎ + 𝐿𝑣. The unknown is 𝑣 ∈ RX where X is
a finite set, ℎ ∈ RX is given and 𝐿 is a linear operator from RX to itself. We discussed
the fact that 𝜌 ( 𝐿) < 1 is sufficient for 𝑣 = ℎ + 𝐿𝑣 to have a unique solution.
In some settings the condition 𝜌 ( 𝐿) < 1 is also necessary. For example, let

• 𝑉 = (0, ∞) X
• 𝐿 be a positive linear operator on RX .

In this setting we have the following result:

Lemma 6.1.4. If ℎ ∈ 𝑉 , then the next two statements are equivalent.

(i) 𝜌 ( 𝐿) < 1.
(ii) The equation 𝑣 = ℎ + 𝐿𝑣 has a unique solution in 𝑉 .

Proof. Regarding (i) =⇒ (ii), existence of a unique 𝑣 ∈ RX satisfying 𝑣 = ℎ + 𝐿𝑣


Í
follows from the Neumann series lemma. Since 𝑣 = 𝑡⩾0 𝐿𝑡 ℎ ⩾ ℎ  0, we have 𝑣 ∈ 𝑉 .
For (ii) =⇒ (i), let 𝑣 be any solution to 𝑣 = ℎ + 𝐿𝑣 in 𝑉 . By the Perron–Frobenius
theorem (page 69), we can select a left eigenvector 𝑒 such that 𝑒 ⩾ 0 and 𝑒> 𝐿 = 𝜌 ( 𝐿) 𝑒> .
For this 𝑒, we have 𝑒> 𝑣 = 𝑒> 𝐿𝑣 + 𝑒> ℎ = 𝜌 ( 𝐿) 𝑒> 𝑣 + 𝑒> ℎ. Since 𝑒 ⩾ 0, 𝑒 ≠ 0 and 𝑣, ℎ  0, it
must be that 𝑒> ℎ > 0 and 𝑒> 𝑣 > 0. Therefore 𝜌 ( 𝐿) satisfies (1 − 𝜌 ( 𝐿)) 𝛼 = 𝛽 for 𝛼, 𝛽 > 0.
Hence 𝜌 ( 𝐿) < 1. □

In §7.1.3 we will extend lemma 6.1.4 to handle certain nonlinear equations.

6.1.3 Fixed Point Results

State-dependent discounting breaks contractivity properties that we exploited in Chap-


ter 5, when we studied optimality of MDPs (see, e.g., the proof of Proposition 5.1.1).
Here we introduce a generalization of Banach’s fixed point theorem that can deliver
global stability under weaker conditions. For the remainder of this section, X is any
finite set.
CHAPTER 6. STOCHASTIC DISCOUNTING 189

6.1.3.1 Long Run Contractions

Fix 𝑈 ⊂ RX . We call a self-map 𝑇 on 𝑈 eventually contracting if there exists a 𝑘 ∈ N


and a norm k · k on RX such that 𝑇 𝑘 is a contraction on 𝑈 under k · k.

Theorem 6.1.5. Let 𝑈 be a closed subset of RX and let 𝑇 be a self-map on 𝑈 . If 𝑇 is


eventually contracting on 𝑈 , then 𝑇 is globally stable on 𝑈 .

EXERCISE 6.1.4. Prove Theorem 6.1.5. [Hint: Theorem 1.2.3 is self-improving, in


the sense that it implies this seemingly stronger result.]

The next example illustrates Theorem 6.1.5 by proving a result similar to Exer-
cise 1.2.17 on page 22.

Example 6.1.2. If 𝑇𝑢 = 𝐴𝑢 + 𝑏 for some 𝑏 ∈ RX and 𝐴 ∈ L ( RX ) with 𝜌 ( 𝐴) < 1, then,


under the Euclidean norm,

k𝑇 𝑘 𝑢 − 𝑇 𝑘 𝑣 k = k 𝐴𝑘 𝑢 − 𝐴𝑘 𝑣 k = k 𝐴𝑘 (𝑢 − 𝑣)k ⩽ k 𝐴𝑘 kk 𝑢 − 𝑣 k ,

where the last line is by the submuliplicative property of the operator norm (page 16).
Since 𝜌 ( 𝐴) < 1, we can choose a 𝑘 ∈ N such that k 𝐴𝑘 k < 1 (see Exercise 1.2.11). Hence
𝑇 is eventually contracting and Theorem 6.1.5 yields global stability. The unique fixed
point satisfies 𝑢 = 𝐴𝑢 + 𝑏 and, since 𝜌 ( 𝐴) < 1, we can use the Neumann series lemma
(page 18) to write it as 𝑢 = ( 𝐼 − 𝐴) −1 𝑏.

Example 6.1.2 illustrates the connection between Theorem 6.1.5 and the Neu-
mann series lemma. Theorem 6.1.5 is more general because it can be applied in
nonlinear settings. But the Neumann series lemma remains imporant because, when
applicable, it provides inverse and power series representations of the fixed point.
On one hand, if 𝑇 is a contraction map on 𝑈 ⊂ RX with respect to a given norm
k · k 𝑎 , we cannot necessarily say that 𝑇 is a contraction with respect to some other norm
k · k 𝑏 on RX . On the other hand, if 𝑇 is an eventual contraction on 𝑈 with respect to
some given norm on RX , then 𝑇 is eventually contracting with respect to every norm
on RX . The next exercise asks you to verify this.

EXERCISE 6.1.5. Let k · k 𝑎 and k · k 𝑏 be norms on RX and let 𝑇 be a self-map on


𝑈 ⊂ RX such that 𝑇 𝑘 is a contraction on 𝑈 with respect to k · k 𝑎 for some 𝑘 ∈ N. Prove
that there exists an ℓ ∈ N such that 𝑇 ℓ is a contraction on 𝑈 with respect to k · k 𝑏 .
CHAPTER 6. STOCHASTIC DISCOUNTING 190

6.1.3.2 A Spectral Radius Condition

The following sufficient condition for eventual contractivity will be helpful when we
study dynamic programs with state-dependent discounting.

Proposition 6.1.6. Let 𝑇 be a self-map on 𝑈 ⊂ RX . If there exists a positive linear


operator 𝐿 on RX such that 𝜌 ( 𝐿) < 1 and

|𝑇 𝑣 − 𝑇𝑤 | ⩽ 𝐿 | 𝑣 − 𝑤 | (6.13)

for all 𝑣, 𝑤 ∈ 𝑈 , then 𝑇 is an eventual contraction on 𝑈 .

Proof. Fix 𝑣, 𝑤 ∈ 𝑈 . Pick any 𝑘 ∈ N. We have |𝑇 𝑘 𝑣 − 𝑇 𝑘 𝑤 | ⩽ 𝐿 |𝑇 𝑘−1 𝑣 − 𝑇 𝑘−1 𝑤 |, or

𝑒𝑘 ⩽ 𝐿𝑒𝑘−1 where 𝑒 𝑘 ≔ |𝑇 𝑘 𝑣 − 𝑇 𝑘 𝑤 | . (6.14)

Since 𝐿 is positive, 𝐿 is order-preserving on 𝑈 by Exercise 2.3.11. As a result, we can


iterate on (6.14) to obtain 𝑒𝑘 ⩽ 𝐿𝑘 𝑒0 , or

|𝑇 𝑘 𝑣 − 𝑇 𝑘 𝑤 | ⩽ 𝐿 𝑘 | 𝑣 − 𝑤 | .

Let k · k be the Euclidean norm. Since 0 ⩽ 𝑎 ⩽ 𝑏 implies k 𝑎 k ⩽ k 𝑏 k, we get

k𝑇 𝑘 𝑣 − 𝑇 𝑘 𝑤 k ⩽ k 𝐿𝑘 | 𝑣 − 𝑤 |k ⩽ k 𝐿𝑘 kk 𝑣 − 𝑤 k ,

where k 𝐿𝑘 k is the operator norm (see §1.2.1.4). Since 𝜌 ( 𝐿) < 1, we have k 𝐿𝑘 k → 0 as


𝑘 → ∞ . (Exercise 1.2.11 on page 19) Hence 𝑇 is eventually contracting on 𝑈 . □

6.1.3.3 A Generalized Blackwell Condition

In §2.2.3.3 we studied a sufficient condition for order-preserving self maps to be con-


tractions. The next proposition provides an analogous result for eventual contractions.
In the statement of the proposition, 𝑈 is a subset of RX such that 𝑣, 𝑐 ∈ 𝑈 and 𝑐 ⩾ 0
implies 𝑣 + 𝑐 ∈ 𝑈 .

Proposition 6.1.7. Let 𝑇 be an order-preserving self-map on 𝑈 . If there exists a positive


linear operator 𝐿 on RX such that 𝜌 ( 𝐿) < 1 and

𝑇 ( 𝑣 + 𝑐) ⩽ 𝑇 𝑣 + 𝐿𝑐 for all 𝑐, 𝑣 ∈ RX with 𝑐 ⩾ 0,

then 𝑇 is eventually contracting on 𝑈 .


CHAPTER 6. STOCHASTIC DISCOUNTING 191

Proof. Fix 𝑣, 𝑤 ∈ 𝑈 and let 𝑇 and 𝐿 be as in the statement of the proposition. By the
assumed properties on 𝑇 , we have

𝑇 𝑣 = 𝑇 ( 𝑣 + 𝑤 − 𝑤) ⩽ 𝑇 ( 𝑤 + | 𝑣 − 𝑤 |) ⩽ 𝑇𝑤 + 𝐿 | 𝑣 − 𝑤 | .

Rearranging gives 𝑇 𝑣 − 𝑇𝑤 ⩽ 𝐿 | 𝑣 − 𝑤 |. Reversing the roles of 𝑣 and 𝑤 yields |𝑇 𝑣 − 𝑇𝑤 | ⩽


𝐿 | 𝑣 − 𝑤 |. The claim in Proposition 6.1.7 now follows from Proposition 6.1.6. □

6.2 Optimality with State-Dependent Discounting

We can now turn to dynamic programs in which the objective is to maximize a lifetime
value in the presence of state-dependent discounting. First we present an extension
of the MDP model from Chapter 5 that admits state-dependent discounting. Then we
provide weak conditions under which optimal policies exist and Bellman’s principle
of optimality holds.

6.2.1 MDPs with State-Dependent Discounting

We are ready to extend the MDP model to include state-dependent discounting. We


construct a framework and then provide weak conditions for optimality based on spec-
tral radius methods discussed above.

6.2.1.1 Setup

To provide a framework for dynamic programs with state-dependent discounting, we


begin with an MDP ( Γ, 𝛽, 𝑟, 𝑃 ) with state space X, action space A and feasible state-
action pairs G. We then replace the constant discount factor 𝛽 with a function 𝛽 from
G × X to R+ . We call the resulting model an MDP with state-dependent discounting.
The Bellman equation takes the form
( )
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (6.15)
𝑎∈ Γ ( 𝑥 )
𝑥0

where 𝑥 ∈ X and 𝑣 ∈ RX . Notice that the discount factor depends on all relevant
information: the current action, the current state and the stochastically determined
next period state.
CHAPTER 6. STOCHASTIC DISCOUNTING 192

For MDPs with state-dependent discounting, we can obtain standard optimality


results by assuming a that there exists a 𝑏 < 1 such that 𝛽 ( 𝑥, 𝑎, 𝑥 0) ⩽ 𝑏 for all ( 𝑥, 𝑎, 𝑥 0) ∈
G × X. In this setting it is easy to show that lifetime values are finite, and to extend
the optimality results for regular MDPs found in Proposition 5.1.1 on page 138.
Unfortunately, the assumption discussed in the previous paragraph is too strict for
many applications. (We return to this point in §6.2.1.6.) We will state an optimality
result under weaker conditions.

6.2.1.2 Finite Lifetime Values

Let Σ be the set of all feasible policies, defined as for regular MDPs. The policy oper-
ator 𝑇𝜎 corresponding to 𝜎 ∈ Σ is represented by
Õ
(𝑇𝜎 𝑣)( 𝑥 ) = 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝑣 ( 𝑥 0) 𝛽 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) . (6.16)
𝑥0

Following Chapter 5, we set 𝑟𝜎 ( 𝑥 ) ≔ 𝑟 ( 𝑥, 𝜎 ( 𝑥 )). We define 𝐿𝜎 ∈ L ( RX ) via

𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝛽 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) . (6.17)

Notice that we can now write (6.16) as 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣. In line with our discussion of
MDPs in Chapter 5, when 𝑇𝜎 has a unique fixed point we denote it by 𝑣𝜎 and interpret
it as lifetime value.

Assumption 6.2.1 (SD). For all 𝜎 ∈ Σ we have 𝜌 ( 𝐿𝜎 ) < 1.

Lemma 6.2.1. If Assumption 6.2.1 holds, then, for each 𝜎 ∈ Σ, the linear operator 𝐼 − 𝐿𝜎
is invertible and, in RX , the policy operator 𝑇𝜎 has a unique fixed point

𝑣𝜎 = ( 𝐼 − 𝐿𝜎 ) −1 𝑟𝜎 . (6.18)

Proof. Fix 𝜎 ∈ Σ. By the Neumann series lemma, 𝐼 − 𝐿𝜎 is invertible. Any fixed point
of 𝑇𝜎 obeys 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣, which, given invertibility of 𝐼 − 𝐿𝜎 , is equivalent to (6.18). □

As discussed, the value 𝑣𝜎 ( 𝑥 ) has the interpretation of lifetime value of policy 𝜎


conditional on initial state 𝑥 . We can reinforce this interpretation by connecting
Lemma 6.2.1 to Theorem 6.1.1 on page 183. The next exercise asks you to work
through all the steps.
CHAPTER 6. STOCHASTIC DISCOUNTING 193

EXERCISE 6.2.1. Fix 𝜎 ∈ Σ, set 𝛽𝑡 ≔ 𝛽 ( 𝑋𝑡−1 , 𝜎 ( 𝑋𝑡−1 ) , 𝑋𝑡 ) for 𝑡 ⩾ 1 and 𝛽0 ≔ 1.


Let ( 𝑋𝑡 ) be 𝑃𝜎 -Markov with initial condition 𝑥 . (As before, 𝑃𝜎 ( 𝑥, 𝑥 0) ≔ 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0).)
Prove that, under Assumption 6.2.1, the function 𝑣𝜎 obeys
" 𝑡 #
Õ
∞ Ö
𝑣𝜎 ( 𝑥 ) = E𝑥 𝛽𝑖 𝑟𝜎 ( 𝑋𝑡 ) ( 𝑥 ∈ X) . (6.19)
𝑡 =0 𝑖=0

EXERCISE 6.2.2. Show that, under Assumption 6.2.1, the operator 𝑇𝜎 is globally
stable on RX .

EXERCISE 6.2.3. Show that Assumption 6.2.1 holds whenever there exists an 𝐿 ∈
L( RX ) such that 𝜌 ( 𝐿) < 1 and

𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) ⩽ 𝐿 ( 𝑥, 𝑥 0) for all ( 𝑥, 𝑎) ∈ G and 𝑥 0 ∈ X. (6.20)

6.2.1.3 Optimality

The Bellman operator takes the form


( )
Õ
(𝑇 𝑣)( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (6.21)
𝑎∈ Γ ( 𝑥 )
𝑥0

where 𝑥 ∈ X and 𝑣 ∈ RX .
Given 𝑣 ∈ RX , a policy 𝜎 is called 𝑣-greedy if 𝜎 ( 𝑥 ) is a maximizer of the right-hand
side of (6.21) for all 𝑥 in X. Equivalently, 𝜎 is 𝑣-greedy whenever 𝑇𝜎 𝑣 = 𝑇 𝑣.
When Assumption 6.2.1 holds and, as a result, 𝑇𝜎 has a unique fixed point 𝑣𝜎 for
each 𝜎 ∈ Σ, we let 𝑣∗ denote the value function, which is defined as 𝑣∗ ≔ ∨𝜎∈Σ 𝑣𝜎 . As
for the regular MDP case, a policy 𝜎 is called optimal if 𝑣𝜎 = 𝑣∗ .
We can now state our main optimality result for MDPs with state-dependent dis-
counting.

Proposition 6.2.2. If Assumption 6.2.1 holds, then

(i) the value function 𝑣∗ is the unique solution to the Bellman equation in RX ,
(ii) a policy 𝜎 ∈ Σ is optimal if and only if it is 𝑣∗ -greedy, and
CHAPTER 6. STOCHASTIC DISCOUNTING 194

(iii) at least one optimal policy exists.

In §8.2.2 we prove a result that includes Proposition 6.2.2 as a special case.

6.2.1.4 Algorithms

Algorithms for solving an MDP with state-dependent discounting include value func-
tion iteration (VFI), Howard policy iteration (HPI), and optimistic policy iteration
(OPI). The algorithms for VFI and OPI are identical to those given for regular MDPs
(see §5.1.4), provided that the correct operators 𝑇 and 𝑇𝜎 are used, and that the def-
inition of a 𝑣-greedy policy is as given in §6.2.1.1. The algorithm for HPI is almost
identical, with the only change being that computation of lifetime values involves 𝐿𝜎 .
Details are given in Algorithm 6.1.

Algorithm 6.1: HPI for MDPs with state-dependent discounting


1 input 𝜎 ∈ Σ
2 𝑣0 ← 𝑣𝜎 and 𝑘 ← 0
3 repeat
4 𝜎𝑘 ← a 𝑣𝑘 -greedy policy
5 𝑣𝑘+1 ← ( 𝐼 − 𝐿𝜎𝑘 ) −1 𝑟𝜎𝑘
6 if 𝑣𝑘+1 = 𝑣𝑘 then break
7 𝑘← 𝑘+1
8 return 𝜎𝑘

We prove in Chapter 8 that, under the conditions of Assumption 6.2.1, VFI, OPI
and HPI are all convergent, and that HPI converges to an exact optimal policy in a
finite number of steps.

6.2.1.5 Exogenous Discounting

Some applications use an exogenous state component to drive a discount factor pro-
cess. In this section we set up such a model and obtain optimality conditions by
applying Proposition 6.2.2.
The first step is to decompose the state 𝑋𝑡 into a pair (𝑌𝑡 , 𝑍𝑡 ), where (𝑌𝑡 )𝑡⩾0 is
endogenous (i.e., affected by the actions of the controller) and ( 𝑍𝑡 )𝑡⩾0 is purely ex-
ogenous. In particular, the primitives consist of

(i) a nonempty correspondence Γ from Y × Z to A,


CHAPTER 6. STOCHASTIC DISCOUNTING 195

(ii) a function 𝛽 from Z to R+ ,


(iii) a function 𝑟 from G ≔ {( 𝑦, 𝑎) ∈ Y × A : 𝑎 ∈ Γ ( 𝑦 )} to R,
(iv) a stochastic matrix 𝑄 on Z and
(v) a stochastic kernel 𝑅 from G to Y.

The corresponding Bellman equation is


( )
Õ
𝑣 ( 𝑦, 𝑧 ) = max 𝑟 ( 𝑦, 𝑎) + 𝛽 ( 𝑧 ) 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝑎, 𝑦 0) (6.22)
𝑎∈ Γ ( 𝑦,𝑧 )
𝑧0 , 𝑦 0

for all ( 𝑦, 𝑧 ) ∈ X. Given 𝑣 ∈ RX , a policy 𝜎 ∈ Σ is called 𝑣-greedy if


( )
Õ
0 0 0 0
𝜎 ( 𝑦, 𝑧 ) ∈ argmax 𝑟 ( 𝑦, 𝑎) + 𝛽 ( 𝑧 ) 𝑣 ( 𝑦 , 𝑧 ) 𝑄 ( 𝑧, 𝑧 ) 𝑅 ( 𝑦, 𝑎, 𝑦 ) (6.23)
𝑎∈ Γ ( 𝑦,𝑧 ) 𝑧0 , 𝑦 0

for all ( 𝑦, 𝑧 ) ∈ X.
This exogenous discount model is a special case of the general MDP with state-
dependent discounting. Indeed, we can write (6.22) as (6.21) by setting 𝑥 ≔ ( 𝑦, 𝑧 )
and defining

𝑃 ( 𝑥, 𝑎, 𝑥 0) ≔ 𝑃 (( 𝑦, 𝑧 ) , 𝑎, ( 𝑦 0, 𝑧0)) ≔ 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝑎, 𝑦 0) .

The following proposition provides a relatively simple sufficient condition for the
core optimality results in the setting of the exogenous discount model.

Proposition 6.2.3. Let 𝐿 be the operator in L ( RZ ) defined by 𝐿 ( 𝑧, 𝑧0) ≔ 𝛽 ( 𝑧) 𝑄 ( 𝑧, 𝑧0).


If 𝜌 ( 𝐿) < 1, then all of the optimality results in Proposition 6.2.2 hold.

EXERCISE 6.2.4. Prove Proposition 6.2.3. (Hint: Use Lemma 6.1.3.)

6.2.1.6 Comments on the Spectral Radius Condition

In §6.2.1.2 we mentioned that requiring sup 𝛽 < 1 is too strict for some applications.
For example, the real interest rate 𝑟𝑡 shown in Figure 6.2 is sometimes negative. Using
long historical records, Farmer et al. (2023) find that the discount rate is negative
around 1/3 of the time. This means that the associated discount factor 𝛽𝑡 = 1/(1 + 𝑟𝑡 )
is sometimes greater than 1 and sup 𝛽 < 1 fails.
CHAPTER 6. STOCHASTIC DISCOUNTING 196

1.03
βt
β=1

1.00

0.97
0 10 20 30 40 50 60 70 80
time

Figure 6.4: Discount factor process ( 𝛽 )𝑡⩾0 in Hills et al. (2019).

In macroeconomics, empirically motivated time-varying discount factor specifica-


tions lead to models where 𝛽𝑡 > 1 occurs with positive probability. For example, Hills
et al. (2019) study a model that can be embedded in the MDP framework just de-
scribed. Figure 6.4 shows a simulation of one of the discount factor processes used in
their model, prior to discretization. The exogenous state and discount factor process
takes the form 𝛽𝑡 = 𝑏𝑍𝑡 , where ( 𝑍𝑡 ) is an exogenous state obeying 𝑍𝑡+1 = 1− 𝜌+ 𝜌𝑍𝑡 +𝜎𝜀𝑡+1
with ( 𝜀𝑡 ) IID and standard normal. Clearly sup 𝛽 < 1 fails for this model too.
Let’s now consider the weaker condition 𝜌 ( 𝐿) < 1 described in Proposition 6.2.3
and check whether it holds. Following Hills et al. (2019), we discretize the dynamics
of ( 𝑍𝑡 ) via a Tauchen approximation, producing a stochastic matrix 𝑄 on a finite set
Z.2 The set of values for 𝛽𝑡 ranges between 0.95 and 1.04, so that 𝛽𝑡 > 1 remains
possible. Nonetheless, with 𝐿 ( 𝑧, 𝑧0) = 𝛽 ( 𝑧) 𝑄 ( 𝑧, 𝑧0) we obtain 𝜌 ( 𝐿) = 0.9996. Hence
Proposition 6.2.3 applies.

6.2.2 Inventory Management Revisited

In this section, we modify the inventory management model from §5.2.1 to include
time-varying interest rates.
2 The parameters are 𝜌 = 0.85, 𝜎 = 0.0062, and 𝑏 = 0.99875. In line with Hills et al. (2019), we
discretize the model via mc = tauchen(n, ρ, σ, 1 - ρ, m) with 𝑚 = 4.5 and 𝑛 = 15.
CHAPTER 6. STOCHASTIC DISCOUNTING 197

Recall that, in the model of §5.2.1, the Bellman equation takes the form
( )
Õ
𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) 𝜑 ( 𝑑 ) (6.24)
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0

at each 𝑥 ∈ X, where X ≔ {0, . . . , 𝐾 }, 𝑥 is the current inventory level, 𝑎 is the current


inventory order, 𝑟 ( 𝑥, 𝑎) is current profits (defined in (5.8)), 𝑓 ( 𝑥, 𝑎, 𝑑 ) ≔ ( 𝑥 − 𝑑 ) ∨ 0 + 𝑎
and 𝑑 is an IID demand shock with distribution 𝜑. Let’s now add a time-varying
discount rate and investigate its impact on optimal choices.
We add time-varying discounting by replacing the constant 𝛽 in (6.24) with a
stochastic process ( 𝛽𝑡 ) where 𝛽𝑡 = 1/(1 + 𝑟𝑡 ). We suppose that the dynamics can
be expressed as 𝛽𝑡 = 𝛽 ( 𝑍𝑡 ), where the exogenous process ( 𝑍𝑡 )𝑡⩾0 is 𝑄 -Markov on Z.
After relabeling the endogenous state 𝑋𝑡 as 𝑌𝑡 and 𝑥 as 𝑦 , in line with the notation in
§6.2.1.5, the Bellman equation becomes 𝑣 ( 𝑦, 𝑧 ) = max 𝑎∈Γ ( 𝑦,𝑧) 𝐵 (( 𝑦, 𝑧 ) , 𝑎, 𝑣) where
Õ
𝐵 (( 𝑦, 𝑧 ) , 𝑎, 𝑣) = 𝑟 ( 𝑦, 𝑎) + 𝛽 ( 𝑧 ) 𝑣 ( 𝑓 ( 𝑦, 𝑎, 𝑑 ) , 𝑧0) 𝜑 ( 𝑑 ) 𝑄 ( 𝑧, 𝑧0) . (6.25)
𝑑, 𝑧 0

If we set
𝑅 ( 𝑦, 𝑎, 𝑦 0) ≔ P{ 𝑓 ( 𝑦, 𝑎, 𝑑 ) = 𝑦0 } when 𝐷 ∼ 𝜑,
then 𝑅 ( 𝑦, 𝑎, 𝑦 0) is the probability of realizing next period inventory level 𝑦 0 when the
current level is 𝑦 and the action is 𝑎. Hence we can rewrite (6.25) as
Õ
𝐵 (( 𝑦, 𝑧 ) , 𝑎, 𝑣) = 𝑟 ( 𝑦, 𝑎) + 𝛽 ( 𝑧 ) 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝑎, 𝑦 0) . (6.26)
𝑦 0 ,𝑧 0

We have now created a version of the MDP with exogenous state-dependent dis-
counting described in §6.2.1.5. Letting 𝐿 ( 𝑧, 𝑧0) ≔ 𝛽 ( 𝑧) 𝑄 ( 𝑧, 𝑧0) and applying Propo-
sition 6.2.3, we see that all of the standard optimality results hold whenever 𝜌 ( 𝐿) < 1.
Figure 6.5 shows how inventory evolves under an optimal program when the pa-
rameters of the problem are as given in Listing 21. (The code preallocates and com-
putes arrays representing 𝑟 , 𝑅 and 𝑄 in (6.26) and includes a test for 𝜌 ( 𝐿) < 1.) We
set 𝛽 ( 𝑧) = 𝑧 and take ( 𝑍𝑡 ) to be a discretization of an AR(1) process. Figure 6.5
was created by simulating ( 𝑍𝑡 ) according to 𝑄 and inventory (𝑌𝑡 ) according to 𝑌𝑡+1 =
(𝑌𝑡 − 𝐷𝑡+1 ) ∨ 0 + 𝐴𝑡 , where 𝐴𝑡 follows the optimal policy.
The outcome is similar to Figure 5.7, in the sense that inventory falls slowly and
then jumps up. As before, fixed costs induce this lumpy behavior. However, a new
phenomenon is now present: inventories trend up when interest rates fall and down
when they rise. (The interest rate 𝑟𝑡 is calculated via 𝛽𝑡 = 1/(1 + 𝑟𝑡 ) at each 𝑡 .) High
CHAPTER 6. STOCHASTIC DISCOUNTING 198

20

inventory
15

10

0
0 50 100 150 200 250 300 350 400
t

0.06
rt
0.05

0.04

0.03

0.02

0 50 100 150 200 250 300 350 400


t

Figure 6.5: Inventory dynamics with time-varying interest rates

interest rates foreshadow high interest rates due to positive autocorrelation ( 𝜌 > 0),
which in turn devalue future profits and hence encourage managers to economize on
stock.
Figure 6.6 shows execution time for VFI and OPI at different choices of 𝑚 (see
§6.2.1.3 for the interpretation of 𝑚). As for the optimal savings problem we studied
in Chapter 5, OPI is around 1 order of magnitude faster when 𝑚 is close to 50 (cf.
Figure 5.8 on page 153).

6.3 Asset Pricing

This section provides a brief introduction to asset pricing in a Markov environment.


While the topic of asset pricing is fascinating in its own right, our main aim is to
provide additional practice in handling linear valuation problems. (Readers who wish
to push ahead with their study of dynamic programming can safely skip to Chapter 7.)
CHAPTER 6. STOCHASTIC DISCOUNTING 199

using LinearAlgebra, Random, Distributions, QuantEcon

f(y, a, d) = max(y - d, 0) + a # Inventory update

function create_sdd_inventory_model(;
ρ=0.98, ν=0.002, n_z=20, b=0.97, # Z state parameters
K=40, c=0.2, κ=0.8, p=0.6, # firm and demand parameters
d_max=100) # truncation of demand shock

ϕ(d) = (1 - p)^d * p # demand distribution


d_vals = collect(0:d_max)
ϕ_vals = ϕ.(d_vals)
y_vals = collect(0:K) # inventory levels
n_y = length(y_vals)
mc = tauchen(n_z, ρ, ν)
z_vals, Q = mc.state_values .+ b, mc.p
ρL = maximum(abs.(eigvals(z_vals .* Q)))
@assert ρL < 1 "Error: ρ(L) ≥ 1."

R = zeros(n_y, n_y, n_y)


for (i_y, y) in enumerate(y_vals)
for (i_y′, y′) in enumerate(y_vals)
for (i_a, a) in enumerate(0:(K - y))
hits = f.(y, a, d_vals) .== y′
R[i_y, i_a, i_y′] = dot(hits, ϕ_vals)
end
end
end

r = fill(-Inf, n_y, n_y)


for (i_y, y) in enumerate(y_vals)
for (i_a, a) in enumerate(0:(K - y))
cost = c * a + κ * (a > 0)
r[i_y, i_a] = dot(min.(y, d_vals), ϕ_vals) - cost
end
end

return (; K, c, κ, p, r, R, y_vals, z_vals, Q)


end

Listing 21: Investment model with time-varying discounting (inventory_sdd.jl)


CHAPTER 6. STOCHASTIC DISCOUNTING 200

14

12

10

value function iteration


time

8
optimistic policy iteration
6

0 50 100 150 200 250 300 350 400


m

Figure 6.6: OPV vs VFI timings for the inventory problem

6.3.1 Introduction to Asset Pricing

We first discuss risk-neutral pricing and show why this assumption is typically implau-
sible. Next, we introduce stochastic discount factors and stationary asset pricing.

6.3.1.1 Risk-Neutral Pricing?

Consider the problem of assigning a current price Π𝑡 to an asset that confers on its
owner the right to payoff 𝐺𝑡+1 . The payoff is stochastic and realized next period. One
simple idea is to use risk-neutral pricing, which implies that

Π𝑡 = E𝑡 𝛽 𝐺𝑡+1 (6.27)

for some constant discount factor 𝛽 ∈ (0, 1). If the payoff is in 𝑘 periods, then we
modify the price to E𝑡 𝛽 𝑘 𝐺𝑡+𝑘 . In essence, risk-neutral pricing says that cost equals
expected reward, discounted to present value by compounding a constant rate of
discount. (A rate of discount, say 𝜌, is linked to a discount factor, say 𝛽 , by 𝛽 =
1/(1 + 𝜌) ≈ exp( 𝜌).)

Example 6.3.1. Let 𝑆𝑡 be the price of a stock at each point in time 𝑡 . A European call
option gives its owner the right to purchase the stock at price 𝐾 at time 𝑡 + 𝑘. There
CHAPTER 6. STOCHASTIC DISCOUNTING 201

is no obligation to exercise the option, so the payoff at 𝑡 + 𝑘 is max{𝑆𝑡+𝑘 − 𝐾, 0}. Under


risk-neutral pricing, the time 𝑡 price of this option is

Π𝑡 = E𝑡 𝛽 𝑘 max{𝑆𝑡+𝑘 − 𝐾, 0} .

Although risk neutrality allows for simple pricing, assuming risk neutrality for all
investors is not plausible.
To give one example, suppose that we take the asset that pays 𝐺𝑡+1 in (6.27) and
replace it with another asset that pays 𝐻𝑡+1 = 𝐺𝑡+1 + 𝜀𝑡+1 , where 𝜀𝑡+1 is independent of
𝐺𝑡+1 , E𝑡 𝜀𝑡+1 = 0 and Var 𝜀𝑡+1 > 0. In effect, we are adding risk to the original payoff
without changing its mean.
Under risk neutrality, the price of this new asset is

Π𝑡𝐻 = E𝑡 𝛽 [𝐺𝑡+1 + 𝜀𝑡+1 ] = Π𝑡 + 𝛽 E𝑡 𝜀𝑡+1 = Π𝑡 .


Thus, 𝐻𝑡+1 and 𝐺𝑡+1 are priced identically, even though their means are both E𝑡 𝐺𝑡+1
and their variances satisfy

Var 𝐻𝑡+1 = Var 𝐺𝑡+1 + Var 𝜀𝑡+1 > Var 𝐺𝑡+1 .

This outcome contradicts the idea that investors typically want compensation for bear-
ing risk.
A helpful way to think about the same point is to consider the rate of return 𝑟𝑡+1 ≔
(𝐺𝑡+1 − Π𝑡 )/Π𝑡 on holding an asset with payoff 𝐺𝑡+1 . From (6.27) we have E𝑡 𝛽 (1+𝑟𝑡+1 ) =
1, or
1−𝛽
E𝑡 𝑟𝑡+1 = .
𝛽
Since the right-hand side does not depend on 𝐺𝑡+1 , risk neutrality implies that all
assets have the same expected rate of return. But this contradicts the finding that, on
average, riskier assets tend to have higher rates of return that compensate investors
for bearing risk.

Example 6.3.2. The risk premium on a given asset is defined as the expected rate
of return minus the rate of return on a risk-free asset. If we assume risk neutrality
then, by the preceding discussion, the risk premium is zero for all assets. However,
calculations based on post-war US data show that the average return premium on
equities over safe assets is around 8% per annum (see, e.g., Cochrane (2009)).
CHAPTER 6. STOCHASTIC DISCOUNTING 202

6.3.1.2 A Stochastic Discount Factor

To go beyond risk neutral-pricing, let’s start with a model containing one asset and
one agent. It is straightforward to price the asset and compare it to the risk neutral
case.
A representative agent takes the price Π𝑡 of a risky asset as given and solves

max {𝑢 (𝐶𝑡 ) + 𝛽 E𝑡 𝑢 (𝐶𝑡+1 )}


0⩽ 𝛼⩽1
subject to 𝐶𝑡 = 𝐸𝑡 − Π𝑡 𝛼 and 𝐶𝑡+1 = 𝐸𝑡+1 + 𝛼𝐺𝑡+1 .

Here

• 𝑢 is a flow utility function,


• 𝐺𝑡+1 is the payoff of the asset and Π𝑡 is the time-𝑡 price,
• 𝛽 is a constant discount factor measuring impatience of the agent,
• 𝐸𝑡 and 𝐸𝑡+1 are endowments and
• 𝛼 is the share of the asset purchased by the agent.

Rewriting as max𝛼 {𝑢 ( 𝐸𝑡 − Π𝑡 𝛼) + 𝛽 E𝑡 𝑢 ( 𝐸𝑡+1 + 𝛼𝐺𝑡+1 )} and differentiating with respect


to 𝛼 leads to the first order condition

𝑢0 ( 𝐸𝑡 − Π𝑡 𝛼) Π𝑡 = 𝛽 E𝑡 𝑢0 ( 𝐸𝑡+1 + 𝛼𝐺𝑡+1 ) 𝐺𝑡+1 .

Rearranging gives us  
𝑢0 ( 𝐶𝑡+1 )
Π𝑡 = E𝑡 𝛽 0 𝐺𝑡+1 . (6.28)
𝑢 ( 𝐶𝑡 )
Comparing (6.28) with (6.27), we see that the payoff is now multiplied by a positive
random variable rather than a constant. The random variable
𝑢0 ( 𝐶𝑡+1 )
𝑀𝑡+1 ≔ 𝛽 (6.29)
𝑢0 ( 𝐶𝑡 )

is called the stochastic discount factor or pricing kernel. We call this particular
form of the pricing kernel shown in (6.29) Lucas stochastic discount factor (Lucas
SDF) in honor of Lucas (1978a).

Example 6.3.3. If 𝑢 is linear, so that 𝑢 ( 𝑐) = 𝑎𝑐 + 𝑏 for some 𝑎, 𝑏 ∈ R, then 𝑢0 ( 𝑐) = 𝑎 for


all 𝑐, so 𝑀𝑡+1 = 𝛽 . If the utility function has no curvature, then pricing is risk neutral.
CHAPTER 6. STOCHASTIC DISCOUNTING 203

Example 6.3.4. If utility has the CRRA form 𝑢 ( 𝑐) = 𝑐1−𝛾 /(1 − 𝛾 ) for some 𝛾 > 0, then
the Lucas SDF takes the form
  −𝛾
𝐶𝑡+1
𝑀𝑡+1 = 𝛽 , (6.30)
𝐶𝑡

which we can also write as 𝑀𝑡+1 = 𝛽 exp(−𝛾𝑔𝑡+1 ) when 𝑔𝑡+1 ≔ ln( 𝐶𝑡+1 /𝐶𝑡 ) is the growth
rate of consumption.

In the CRRA case, the Lucas SDF applies heavier discounting to assets that con-
centrate payoffs in states of the world where the agent is already enjoying strong
consumption growth. Conversely, the SDF attaches higher weights to future payoffs
that occur when consumption growth is low because such payoffs hedge against the
risk of drawing low consumption states.

6.3.1.3 A General Specification

The standard neoclassical theory of asset pricing generalizes the Lucas discounting
specification by assuming only that there exists a positive random variable 𝑀𝑡+1 such
that the price of an asset with payoff 𝐺𝑡+1 is

Π𝑡 = E𝑡 𝑀𝑡+1 𝐺𝑡+1 ( 𝑡 ⩾ 0) . (6.31)

As above, 𝑀𝑡+1 is called a stochastic discount factor (SDF). Equation 6.31 generalizes
(6.28) by refraining from restricting the SDF (apart from assuming positivity).
Actually, it can be shown that there exists an SDF 𝑀𝑡+1 such that (6.31) is always
valid under relatively weak assumptions. In particular, a single SDF 𝑀𝑡+1 can be used
to price any asset in the market, so if 𝐻𝑡+1 is a another stochastic payoff then the
current price of an asset with this payoff is E𝑡 𝑀𝑡+1 𝐻𝑡+1 .
We do not prove these claims, since our interest is in understanding forward-
looking equations in Markov environments. Some relevant references are listed in
§6.4.

6.3.1.4 Markov Pricing

A common assumption in quantitative applications is that all underlying randomness


is driven by a Markov model. In this spirit, we take ( 𝑋𝑡 ) to be 𝑃 -Markov on finite state
X, where 𝑃 ∈ M ( RX ), and suppose further that the SDF and payoff have the forms

𝑀𝑡+1 = 𝑚 ( 𝑋𝑡 , 𝑋𝑡+1 ) and 𝐺𝑡+1 = 𝑔 ( 𝑋𝑡 , 𝑋𝑡+1 )


CHAPTER 6. STOCHASTIC DISCOUNTING 204

for fixed functions 𝑚, 𝑔 mapping X × X to R+ . Since 𝑚 is arbitrary at this point, we


don’t assume a particular specification for the SDF.
In this setting, conditioning on 𝑋𝑡 = 𝑥 , the standard asset pricing equation Π𝑡 =
E𝑡 𝑀𝑡+1 𝐺𝑡+1 becomes
Õ
𝜋( 𝑥) = 𝑚 ( 𝑥, 𝑥 0) 𝑔 ( 𝑥, 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) , (6.32)
𝑥0

where 𝜋 ( 𝑥 ) is the price of the asset conditional on 𝑋𝑡 = 𝑥 . (That is, Π𝑡 = 𝜋 ( 𝑋𝑡 ).)

6.3.1.5 Pricing a Stationary Dividend Stream

Now we are ready to look at pricing a stationary cash flow over an infinite horizon,
a basic problem in asset pricing. We will apply the Markov structure assumed in
§6.3.1.4. In all that follows, ( 𝑋𝑡 ) is 𝑃 -Markov on X and 𝑀𝑡+1 is defined as in §6.3.1.4.
We seek the time 𝑡 price, denoted by Π𝑡 , for an ex-dividend contract on the divi-
dend stream ( 𝐷𝑡 )𝑡⩾0 . The contract provides the owner with the right to the dividend
stream. The “ex-dividend” component means that, should the dividend stream be
traded at time 𝑡 , the dividend paid at time 𝑡 goes to the seller rather than the buyer.
As a result, purchasing at 𝑡 and selling at 𝑡 + 1 pays Π𝑡+1 + 𝐷𝑡+1 . Hence, applying the
asset pricing rule (6.31), at time 𝑡 price Π𝑡 of the contract must satisfy

Π𝑡 = E𝑡 𝑀𝑡+1 ( Π𝑡+1 + 𝐷𝑡+1 ) . (6.33)

We assume the existence of a 𝑑 ∈ RX + such that 𝐷𝑡 = 𝑑 ( 𝑋𝑡 ) for all 𝑡 . Using (6.32), we


can write this as
Õ
𝜋( 𝑥) = 𝑚 ( 𝑥, 𝑥 0) ( 𝜋 ( 𝑥 0) + 𝑑 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) , (6.34)
𝑥0

or, equivalently,

𝜋 = 𝐴𝜋 + 𝐴𝑑 when 𝐴 ( 𝑥, 𝑥 0) ≔ 𝑚 ( 𝑥, 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) . (6.35)

By the Neumann series lemma, 𝜌 ( 𝐴) < 1 implies (6.35) has unique solution
Õ

∗ −1
𝜋 ≔ ( 𝐼 − 𝐴) 𝐴𝑑 = 𝐴𝑘 𝑑.
𝑘=1

The vector 𝜋∗ is called an equilibrium price function.


CHAPTER 6. STOCHASTIC DISCOUNTING 205

EXERCISE 6.3.1. Show that 𝜌 ( 𝐴) < 1 is both necessary and sufficient for existence
of a unique solution to (6.34) in (0, ∞) X whenever 𝑚, 𝑑  0.

EXERCISE 6.3.2. As discussed in §6.3.1.1, the case 𝑚 ≡ 𝛽 for some 𝛽 ∈ R+ is called


the risk-neutral case. Provide a condition on 𝛽 under which 𝜌 ( 𝐴) < 1.

EXERCISE 6.3.3. Confirm that ( Π𝑡 )𝑡⩾0 generated by Π𝑡 = 𝜋∗ ( 𝑋𝑡 ) solves (6.33).

Remark 6.3.1. We can call 𝐴 an Arrow–Debreu discount operator. Its powers apply
discounting: the valuation of any random payoff 𝑔 in 𝑘 periods is 𝐴𝑘 𝑔.

EXERCISE 6.3.4. Derive the price for a cum-dividend contract on the dividend
stream ( 𝐷𝑡 )𝑡⩾0 , with the model otherwise unchanged. Under this contract, should the
right to the dividend stream be traded at time 𝑡 , the dividend paid at time 𝑡 goes to
the buyer rather than the seller.

6.3.1.6 Forward Sum Representation

Asset prices can be expressed as infinite sums under the assumptions stated above.
Let’s show this for cum-dividend contracts (although the case of ex-dividend contracts
is similar). In Exercise 6.3.4 you found that the state-contingent price vector 𝜋 for a
cum-dividend contract on the dividend stream ( 𝐷𝑡 )𝑡⩾0 obeys

𝜋 = 𝑑 + 𝐴𝜋 when 𝐴 ( 𝑥, 𝑥 0) ≔ 𝑚 ( 𝑥, 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (6.37)

and 𝜌 ( 𝐴) < 1. As before, 𝐷𝑡 = 𝑑 ( 𝑋𝑡 ) and ( 𝑋𝑡 )𝑡⩾0 is 𝑃 -Markov on X. Applying the


uniqueness component of the Neumann series lemma (page 18) and Theorem 6.1.1,
we see that the function 𝜋 also obeys
" 𝑡 #
Õ
∞ Ö
𝜋 ( 𝑥 ) = E𝑥 𝑀𝑖 𝐷𝑡 ( 𝑥 ∈ X) ,
𝑡 =0 𝑖=0

where 𝑀𝑡+1 ≔ 𝑚 ( 𝑋𝑡 , 𝑋𝑡+1 ) for 𝑡 ⩾ 0 and 𝑀0 ≔ 1. This expression agrees with our intu-
ition: The price of the contract is the expected present value of the dividend stream,
with the time 𝑡 dividend discounted by the composite factor 𝑀1 · · · 𝑀𝑡 .
CHAPTER 6. STOCHASTIC DISCOUNTING 206

6.3.2 Nonstationary Dividends

Until now, our discussion of asset pricing has assumed that dividends are stationary.
However, dividends typically grow over time, along with other economic measures
such as GDP. In this section we solve for the price of a dividend stream when dividends
exhibit random growth.

6.3.2.1 Price-Dividend Ratios

A standard model of dividend growth is


𝐷𝑡+1
ln = 𝜅 ( 𝑋𝑡 , 𝜂𝑡+1 ) 𝑡 = 0, 1, . . . ,
𝐷𝑡

where 𝜅 is a fixed function, ( 𝑋𝑡 ) is the state process and ( 𝜂𝑡 ) is IID. We let 𝜑 be the
density of each 𝜂𝑡 and assume that ( 𝑋𝑡 ) is 𝑃 -Markov on a finite set X. Let’s suppose as
before that the SDF obeys 𝑀𝑡+1 = 𝑚 ( 𝑋𝑡 , 𝑋𝑡+1 ) for some positive function 𝑚.
Since dividends grow over time, so will the price of the asset. As such, we should
no longer seek a fixed function 𝜋 such that Π𝑡 = 𝜋 ( 𝑋𝑡 ) for all 𝑡 , since the resulting
price process ( Π𝑡 ) will fail to grow. Instead, we try to solve for the price-dividend
ratio 𝑉𝑡 ≔ Π𝑡 / 𝐷𝑡 , which we hope will be stationary.

EXERCISE 6.3.5. Using Π𝑡 = E𝑡 [ 𝑀𝑡+1 ( 𝐷𝑡+1 + Π𝑡+1 )], show that

𝑉𝑡 = E𝑡 [ 𝑀𝑡+1 exp( 𝜅 ( 𝑋𝑡 , 𝜂𝑡+1 )) (1 + 𝑉𝑡+1 )] . (6.38)

After conditioning on 𝑋𝑡 = 𝑥 , (6.38) leads us to conjecture existence of a function


𝑣 such that
Õ ∫
0
𝑣( 𝑥) = 𝑚 ( 𝑥, 𝑥 ) exp( 𝜅 ( 𝑥, 𝜂)) 𝜑 (d𝜂) [1 + 𝑣 ( 𝑥 0)] 𝑃 ( 𝑥, 𝑥 0) (6.39)
𝑥0

for all 𝑥 ∈ X. We understand (6.39) as an equation to be solved for the unknown


object 𝑣 ∈ RX . If we can find a solution 𝑣∗ to (6.39), then setting 𝑉𝑡 = 𝑣∗ ( 𝑋𝑡 ) yields a
process (𝑉𝑡 ) that obeys (6.38).

EXERCISE 6.3.6. Let



0 0
𝐴 ( 𝑥, 𝑥 ) ≔ 𝑚 ( 𝑥, 𝑥 ) exp( 𝜅 ( 𝑥, 𝜂)) 𝜑 (d𝜂) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) . (6.40)
CHAPTER 6. STOCHASTIC DISCOUNTING 207

Show that (6.38) has a unique solution 𝑣∗ in RX when 𝜌 ( 𝐴) < 1, and


Õ
𝑣∗ = ( 𝐼 − 𝐴) −1 𝐴1 = 𝐴𝑡 1 . (6.41)
𝑡 ⩾1

The price-dividend process (𝑉𝑡∗ ) defined by 𝑉𝑡∗ = 𝑣∗ ( 𝑋𝑡 ) solves (6.38). The price
can be recovered via Π𝑡 = 𝑉𝑡∗ 𝐷𝑡 .

6.3.2.2 Application: Markov Growth with a Lucas SDF

As an example, suppose that dividend growth obeys

𝜅 ( 𝑋𝑡 , 𝜂𝑑,𝑡+1 ) = 𝜇 𝑑 + 𝑋𝑡 + 𝜎𝑑 𝜂𝑑,𝑡+1

where ( 𝜂𝑑,𝑡 )𝑡⩾0 is IID and standard normal. Consumption growth is given by

𝐶𝑡+1
ln = 𝜇 𝑐 + 𝑋𝑡 + 𝜎𝑐 𝜂𝑐,𝑡+1 ,
𝐶𝑡

where ( 𝜂𝑐,𝑡 )𝑡⩾0 is also IID and standard normal. We use the Lucas SDF in (6.30),
implying that   −𝛾
𝐶𝑡+1
𝑀𝑡+1 = 𝛽 = 𝛽 exp(−𝛾 ( 𝜇 𝑐 + 𝑋𝑡 + 𝜎𝑐 𝜂𝑐,𝑡+1 ))
𝐶𝑡

EXERCISE 6.3.7. Using (6.40), show that


!
𝛾 2 𝜎2𝑐 + 𝜎𝑑2
𝐴 ( 𝑥, 𝑥 0) = 𝛽 exp −𝛾𝜇 𝑐 + 𝜇 𝑑 + (1 − 𝛾 ) 𝑥 + 𝑃 ( 𝑥, 𝑥 0) .
2

Figure 6.7 shows the price-dividend ratio function 𝑣∗ for the specification given
in Listing 22, as well as for an alternative mean dividend growth rate 𝜇 𝑑 . The state
process is a Tauchen discretization of an AR(1) process with positive autocorrelation.
An increase in the state predicts higher dividends, which tends to increase the price.
At the same time, higher 𝑥 also predicts higher consumption growth, which acts neg-
atively on the price. For values of 𝛾 greater than 1, the second effect dominates and
the price-dividend ratio slopes down.

EXERCISE 6.3.8. Complete the code in Listing 22 and replicate Figure 6.7. Add a
test to your code that checks 𝜌 ( 𝐴) < 1 before computing the price-dividend ratio.
CHAPTER 6. STOCHASTIC DISCOUNTING 208

using QuantEcon, LinearAlgebra

"Creates an instance of the asset pricing model with Markov state."


function create_asset_pricing_model(;
n=200, # state grid size
ρ=0.9, ν=0.2, # state persistence and volatility
β=0.99, γ=2.5, # discount and preference parameter
μ_c=0.01, σ_c=0.02, # consumption growth mean and volatility
μ_d=0.02, σ_d=0.1) # dividend growth mean and volatility
mc = tauchen(n, ρ, ν)
x_vals, P = exp.(mc.state_values), mc.p
return (; x_vals, P, β, γ, μ_c, σ_c, μ_d, σ_d)
end

Listing 22: Asset pricing model with Lucas SDF (pd_ratio.jl)

2.00
µd =0.02
1.75
µd =0.08
1.50

1.25

1.00

0.75

0.50

0.25

0.00

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


x

Figure 6.7: Price-dividend ratio as a function of the state


CHAPTER 6. STOCHASTIC DISCOUNTING 209

6.3.3 Incomplete Markets

In §6.3.1.5 we used the Neumann series lemma to solve for the equilibrium price
vector 𝜋. However, some modifications to the basic model introduce nonlinearities
that render the Neumann series lemma inapplicable. For example, Harrison and Kreps
(1978) analyze a setting with heterogeneous beliefs and incomplete markets, leading
to failure of the standard asset pricing equation. This results in a nonlinear equation
for prices.
We treat the Harrison and Kreps model only briefly. There are two types of agents.
Type 𝑖 believes that the state updates according to stochastic matrix 𝑃 𝑖 for 𝑖 = 1, 2.
Agents are risk-neutral, so 𝑚 ( 𝑥, 𝑦 ) ≡ 𝛽 ∈ (0, 1). Harrison and Kreps (1978) show that,
for their model, the equilibrium condition (6.34) becomes
Õ
𝜋 ( 𝑥 ) = max 𝛽 [ 𝜋 ( 𝑥 0) + 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) (6.42)
𝑖
𝑥0

for 𝑥 ∈ X and 𝑖 ∈ {1, 2}. Setting aside the details that lead to this equation, our
objective is simply to obtain a vector of prices 𝜋 that solves (6.42).
As a first step, we introduce an operator 𝑇 : RX + → R+ that maps 𝜋 to 𝑇𝜋 via
X

Õ
(𝑇𝜋)( 𝑥 ) = max 𝛽 [ 𝜋 ( 𝑥 0) + 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (6.43)
𝑖
𝑥0

We are assuming 𝑑 ⩾ 0, so 𝑇 is indeed a self-map on RX


+.

By construction, a vector 𝜋 ∈ RX+ is a fixed point of 𝑇 if and only if it is a vector


of prices that solves (6.42). Hence, we have successfully converted our equilibrium
problem into a fixed point problem.
We aim to show that 𝑇 is a contraction. To this end, pick any 𝑝, 𝑞 ∈ RX
+ . Applying
the inequality from Lemma 2.2.2 on page 58, we obtain

Õ Õ
|(𝑇 𝑝) ( 𝑥 ) − (𝑇𝑞)( 𝑥 )| ⩽ 𝛽 max [ 𝑝 ( 𝑥 0) + 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) − [ 𝑞 ( 𝑥 0) − 𝑑 ( 𝑥 0)] 𝑃𝑖 ( 𝑥, 𝑥 0) .
𝑖
𝑥0 𝑥0

Using the triangle inequality and canceling terms leads to


Õ
|(𝑇 𝑝)( 𝑥 ) − (𝑇𝑞)( 𝑥 )| ⩽ 𝛽 max | 𝑝 ( 𝑥 0) − 𝑞 ( 𝑥 0)| 𝑃𝑖 ( 𝑥, 𝑥 0) ⩽ 𝛽 k 𝑝 − 𝑞 k ∞ .
𝑖 ∈{1,2}
𝑥0

Since this bound holds for all 𝑥 , we can take the maximum with respect to 𝑥 and
CHAPTER 6. STOCHASTIC DISCOUNTING 210

obtain
k𝑇 𝑝 − 𝑇𝑞 k ∞ ⩽ 𝛽 k 𝑝 − 𝑞 k ∞ .
Thus, on RX
+ , the map 𝑇 is a contraction of modulus 𝛽 with respect to the sup norm.

Since RX + is a closed subset of R , we conclude that 𝑇 has a unique fixed point


X

in this set. Hence, the system (6.42) has a unique solution 𝜋∗ in RX + , representing
equilibrium prices. This fixed point can be computed by successive approximation.

EXERCISE 6.3.9. Provide an alternative proof of contractivity of 𝑇 on RX


+ using
Blackwell’s condition (§2.2.3.3).

6.4 Chapter Notes


Asset pricing is discussed in many sources, including Hansen and Renault (2010), Ross
(2009), Cochrane (2009), Duffie (2010) and Campbell (2017). Asset pricing is part
of many applications and extensions in macroeconomics, public finance, international
economics, and other fields. Some of these are described in Ljungqvist and Sargent
(2018).
Dynamic programming with state-dependent discounting is becoming more com-
mon in macroeconomics and finance. Representative examples include Krusell and
Smith (1998), Woodford (2011), Christiano et al. (2014), Albuquerque et al. (2016),
Saijo (2017), Basu and Bundick (2017), de Groot et al. (2018), Schorfheide et al.
(2018), Hills et al. (2019), Toda (2019), Fagereng et al. (2019), Hubmer et al. (2020)
and Cao (2020). For more on the theory of state-dependent discounting, see Jasso-
Fuentes et al. (2020), Toda (2021) or Stachurski and Zhang (2021). An analysis of
sovereign default with time-varying interest rates is provided by Bloise and Vailakis
(2022).
Another challenge to the standard model with constant discount rates comes from
empirical and experimental studies that find evidence of “hyperbolic discounting,”
where valuations across time fall rapidly at first and then more slowly. Provocative
reviews of hyperbolic and quasi-hyperbolic discounting can be found in Frederick et al.
(2002) and Rubinstein (2003). Cao and Werning (2018) provide conditions under
which predictions from optimal savings models with quasi-hyperbolic discounting are
robust. Balbus et al. (2018) analyze uniqueness of time-consistent stationary Markov
policies for quasi-hyperbolic households under uncertainty. Balbus et al. (2022) study
equilibria in dynamic models with recursive payoffs and generalized discounting.
Noor and Takeoka (2022) addresses the topic of optimal discounting. Additional ref-
erences include Diamond and Köszegi (2003), Dasgupta and Maskin (2005), Karp
CHAPTER 6. STOCHASTIC DISCOUNTING 211

(2005), Amador et al. (2006), Balbus et al. (2018), Fedus et al. (2019), Hens and
Schindler (2020), Jaśkiewicz and Nowak (2021), and Drugeon and Wigniolle (2021).
This chapter focused on time additive models with state-dependent discounting.
More general preference specifications with this feature include Albuquerque et al.
(2016), Schorfheide et al. (2018), Pohl et al. (2018), Gomez-Cram and Yaron (2020),
and de Groot et al. (2022). In Chapter 8 we consider state-dependent discounting in
general settings that accommodate such nonlinearities.
Chapter 7

Nonlinear Valuation

Dynamic programs are optimization problems where the objective to be maximized


is lifetime value. As such, one key topic is how to combine a sequence of rewards into
a corresponding lifetime value. So far we have considered linear valuation based on
summation over expected discounted rewards, using either constant discount rates
(Chapters 1–5) or state-dependent discounting (Chapter 6). In this chapter we con-
sider extensions, where lifetime value is computed from a recursion over the reward
sequence instead of a discounted sum. This “recursive preference” approach permits
far more general specifications of lifetime value, and is becoming increasingly popular
in economics, finance and computer science (see, e.g., §6.4).
This chapter focuses purely on valuation (i.e., combining reward sequences into
lifetime values), rather than optimization. Later, in Chapter 8, we will show how to
maximize lifetime value in settings where recursive preferences are adopted.
Throughout this chapter, the symbol X always represents a finite set.

7.1 Beyond Contraction Maps

The most natural way to express lifetime value in recursive preference environments
is as a fixed point of a (typically nonlinear) operator. One challenge is that some
recursive preference specifications induce operators that fail to be contractions. For
this reason, we now invest in additional fixed point theory. All of this theory concerns
order-preserving maps, since the operators we consider always inherit monotonicity
from underlying preferences.

212
CHAPTER 7. NONLINEAR VALUATION 213

7.1.1 Knaster–Tarski for Function Space

If you try to draw an increasing function that maps [0, 1] to itself without touching
the 45 degree line you will find it impossible. Below we state a famous fixed point
theorem due to Bronislaw Knaster (1893–1980) and Alfred Tarski (1901–1983) that
generalizes this idea. In the statement, X is a finite set and 𝑉 ≔ [ 𝑣1 , 𝑣2 ], where 𝑣1 , 𝑣2
are functions in RX with 𝑣1 ⩽ 𝑣2 .
Theorem 7.1.1 (Knaster–Tarski). If 𝑇 is an order-preserving self-map on 𝑉 , then the set
of fixed points of 𝑇 is nonempty and contains least and greatest elements 𝑎 ⩽ 𝑏. Moreover,

𝑇 𝑘 𝑣1 ⩽ 𝑎 ⩽ 𝑏 ⩽ 𝑇 𝑘 𝑣2 for all 𝑘 ⩾ 0.

Unlike, say, the fixed point theorem of Banach (§1.2.2.3), Theorem 7.1.1 only
yields existence. Uniqueness does not hold in general, as you can easily confirm by
sketching the one-dimensional case or completing the following exercise.

EXERCISE 7.1.1. Consider the setting of Theorem 7.1.1 and suppose in addition
that 𝑣1 ≠ 𝑣2 . Show that there exists an order-preserving self-map on 𝑉 with a contin-
uum of fixed points.

7.1.2 Concavity, Convexity and Stability

In this section we study sufficient conditions for global stability that replace contrac-
tivity with shape properties such as concavity and monotonicity. To build intuition, we
start with the one-dimensional case and show how these properties can be combined
to achieve stability. Readers focused on results can safely skip to §7.1.2.2.

7.1.2.1 The One-Dimensional Case

In §1.2.3.2 we showed that concavity and monotonicity can yield global stability for
the Solow–Swan model. Here is a more general result.
Proposition 7.1.2. If 𝑔 is an increasing concave self-map on 𝑈 ≔ (0, ∞) and, for all
𝑥 ∈ 𝑈 , there exist 𝑎, 𝑏 ∈ 𝑈 with 𝑎 ⩽ 𝑥 ⩽ 𝑏, 𝑎 < 𝑔 ( 𝑎) and 𝑔 ( 𝑏) ⩽ 𝑏, then 𝑔 is globally stable
on 𝑈 .

Proof. Regarding existence, fix 𝑥 ∈ 𝑈 and suppose first that 𝑥 ⩽ 𝑔 ( 𝑥 ). Since 𝑔 is


increasing, we have 𝑔 ( 𝑥 ) ⩽ 𝑔2 ( 𝑥 ). Continuing in this fashion shows that ( 𝑔 𝑘 ( 𝑥 )) 𝑘⩾0 is
CHAPTER 7. NONLINEAR VALUATION 214

g
45

2
x∗

0
0 1 2 3
x

Figure 7.1: Global stability induced by increasing concave functions

monotone increasing. Moreover, there exists a 𝑏 ∈ 𝑈 such that 𝑥 ⩽ 𝑏 and 𝑔 ( 𝑏) ⩽ 𝑏.


Hence 𝑔 ( 𝑥 ) ⩽ 𝑔 ( 𝑏) ⩽ 𝑏. Iterating yields 𝑔 𝑘 ( 𝑥 ) ⩽ 𝑏 for all 𝑘, so ( 𝑔 𝑘 ( 𝑥 )) 𝑘⩾0 is increasing
and bounded above. Thus, there exists an 𝑥 ∗ ∈ 𝑈 such that 𝑥 𝑘 ≔ 𝑔 𝑘 ( 𝑥 ) converges to 𝑥 ∗
(by Theorem A.2.1 and Exercise A.2.4). Since 𝑔 is concave and hence continuous on
any open set (see, e.g., Barbu and Precupanu (2012)), the result in Exercise 1.2.16
(page 21) implies that 𝑥 ∗ = 𝑔 ( 𝑥 ∗ ).
If, instead, 𝑔 ( 𝑥 ) ⩽ 𝑥 , then a similar argument shows that ( 𝑔 𝑘 ( 𝑥 )) 𝑘⩾0 is decreasing
and bounded. Using analogous reasoning, we obtain a fixed point 𝑥 ∗ in 𝑈 with 𝑔 𝑘 ( 𝑥 ) →
𝑥 ∗.
To show the uniqueness of the fixed point, assume 𝑔 ( 𝑥 ) = 𝑥 and 𝑔 ( 𝑦 ) = 𝑦 for some
𝑥, 𝑦 ∈ 𝑈 . We claim that 𝑥 = 𝑦 . To see this, suppose without loss of generality that
𝑥 ⩽ 𝑦 . By assumption, there exists an 𝑎 ∈ 𝑈 such that 𝑎 ⩽ 𝑥 ⩽ 𝑦 and 𝑔 ( 𝑎) > 𝑎. Because
𝑎 ⩽ 𝑥 ⩽ 𝑦 , we can take 𝜆 ∈ [0, 1] such that 𝑥 = 𝜆𝑎 + (1 − 𝜆 ) 𝑦 . If 𝜆 > 0, then concavity
of 𝑔 and 𝑔 ( 𝑎) > 𝑎 implies the contradiction

𝑔 ( 𝑥 ) = 𝑔 ( 𝜆𝑎 + (1 − 𝜆 ) 𝑦 ) ⩾ 𝜆𝑔 ( 𝑎) + (1 − 𝜆 ) 𝑔 ( 𝑦 ) > 𝜆𝑎 + (1 − 𝜆 ) 𝑦 = 𝑥 = 𝑔 ( 𝑥 ) .

Hence 𝜆 = 0. Since 𝑥 = 𝜆𝑎 + (1 − 𝜆 ) 𝑦 , this yields 𝑥 = 𝑦 . □



Figure 7.1 gives one example, where 𝑔 ( 𝑥 ) = 1 + 𝑥 /2. The conditions of Proposi-
tion 7.1.2 hold because, given any 𝑥 > 0, we can find an 𝑎 in (0, 𝑥 ) that gets mapped
strictly up (i.e., 𝑔 ( 𝑎) is above the 45 line) and a point 𝑏 > 𝑥 that gets mapped down
(i.e., 𝑔 ( 𝑏) is below the 45 degree line).
CHAPTER 7. NONLINEAR VALUATION 215

EXERCISE 7.1.2. Prove that the map 𝑔 and set 𝑈 defined in the discussion of the
Solow-Swan model above Proposition 7.1.2 satisfies the conditions of the proposition.

EXERCISE 7.1.3. Show that the condition 𝑎 < 𝑔 ( 𝑎) in Proposition 7.1.2 cannot be
dropped without weakening the conclusion.

EXERCISE 7.1.4. Dropping the Cobb-Douglas specification on production, suppose


𝑔 ( 𝑘) = 𝑠 𝑓 ( 𝑘) + (1 − 𝛿) 𝑘 where 0 < 𝑠, 𝛿 < 1 and 𝑓 is a strictly positive increasing concave
production function on 𝑈 = (0, ∞) satisfying the Inada conditions

𝑓 0 ( 𝑘) → ∞ as 𝑘 → 0 and 𝑓 0 ( 𝑘) → 0 as 𝑘 → ∞,

Use Proposition 7.1.2 to prove that 𝑔 is globally stable on 𝑈 .

EXERCISE 7.1.5. Fajgelbaum et al. (2017) study a law of motion for aggregate
uncertainty given by
  −1
1 21
𝑠𝑡+1 = 𝑔 ( 𝑠𝑡 ) where 𝑔 ( 𝑠) ≔ 𝜌
2
+𝑎 + 𝛾.
𝑠 𝜂

Let 𝑎, 𝜂 and 𝛾 be positive constants and assume 0 < 𝜌 < 1. Prove that 𝑔 is globally
stable on 𝑀 ≔ (0, ∞).

7.1.2.2 The Multidimensional Case

Proposition 7.1.2 extends to multiple dimensions. In this section we present a multi-


dimensional version that covers both convex and concave functions.
To state our result we extend the definition of convexity and concavity to vector-
valued self-maps. The definitions mirror those for scalar-valued functions: a self-map
𝑇 on a convex subset 𝐷 of RX is called convex if

𝑇 ( 𝜆𝑢 + (1 − 𝜆 ) 𝑣) ⩽ 𝜆𝑇𝑢 + (1 − 𝜆 )𝑇 𝑣 whenever 𝑢, 𝑣 ∈ 𝐷 and 𝜆 ∈ [0, 1];

and concave if

𝜆𝑇𝑢 + (1 − 𝜆 )𝑇 𝑣 ⩽ 𝑇 ( 𝜆𝑢 + (1 − 𝜆 ) 𝑣) whenever 𝑢, 𝑣 ∈ 𝐷 and 𝜆 ∈ [0, 1] .

Here ⩽ is, as usual, the pointwise order.


CHAPTER 7. NONLINEAR VALUATION 216

45 45

𝑣2 𝑣2
𝑇𝑣
𝑇𝑣

𝑣1 𝑣1

𝑣1 𝑣2 𝑣1 𝑣2

(a) (b)

Figure 7.2: Du’s theorem: convex and concave cases

We are now ready to state our next fixed point result, which was first proved
in an infinite-dimensional setting by Du (1990). In the statement, X is a finite set,
𝑉 ≔ [ 𝑣1 , 𝑣2 ] is a nonempty order interval in ( RX , ⩽), and 𝑇 is a self-map on 𝑉 .

Theorem 7.1.3 (Du). If 𝑇 is order-preserving on 𝑉 , then 𝑇 is globally stable on 𝑉 under


any one of (i)–(iv) below.

(i) 𝑇 is concave and 𝑇 𝑣1  𝑣1 , or


(ii) 𝑇 is concave and there exists a 𝛿 > 0 such that 𝑇 𝑣1 ⩾ 𝑣1 + 𝛿 ( 𝑣2 − 𝑣1 ), or
(iii) 𝑇 is convex and 𝑇 𝑣2  𝑣2 , or
(iv) 𝑇 is convex and there exists a 𝛿 > 0 such that 𝑇 𝑣2 ⩽ 𝑣2 − 𝛿 ( 𝑣2 − 𝑣1 ).

Conditions (i) and (ii) are similar – in fact (ii) holds whenever (i) holds, so (ii)
is the weaker (but slightly more complicated) condition. Conditions (iii) and (iv) are
similar in the same sense. Figure 7.2 illustrates the convex and the concave versions
of the result in one dimension. We encourage you to sketch your own variations to
understand the roles that different conditions play.

EXERCISE 7.1.6. Let 𝐹 and 𝐺 be self-maps on convex 𝐷 ⊂ R𝑛 . Show that 𝑇 ≔ 𝐹 ◦ 𝐺


is concave on 𝐷 whenever 𝐹 and 𝐺 are order-preserving and concave on 𝐷.

A full proof of Theorem 7.1.3 can be found in Du (1990) or Theorem 2.1.2 and
Corollary 2.1.1 of Zhang (2012). In our setting, existence follows from the Knaster–
Tarski theorem on page 213. We prove uniqueness on page 347.
CHAPTER 7. NONLINEAR VALUATION 217

7.1.3 A Power-Transformed Affine Equation

Du’s theorem provides conditions under which concave or convex order-preserving


self-maps on order intervals attain global stability. In this section we study maps of
this type that have additional structure. While this additional structure is restrictive, it
allows us to obtain global stability on unbounded subsets rather than order intervals.
To begin, let X be a finite set and consider the equation

𝑣 = [ ℎ + ( 𝐴𝑣) 1/𝜃 ] 𝜃 ( 𝑣 ∈ 𝑉 ), (7.1)

where 𝜃 is a nonzero parameter, 𝐴 ∈ L ( RX ) with 𝐴 ⩾ 0, 𝑉 = (0, ∞) X , and ℎ ∈ 𝑉 . This


system reduces to the affine model studied in Lemma 6.1.4 (page 188) when 𝜃 = 1.
To analyze (7.1), we introduce the self-map

𝐺𝑣 = [ ℎ + ( 𝐴𝑣) 1/𝜃 ] 𝜃 (𝑣 ∈ 𝑉). (7.2)

Continuing to assume that ℎ  0 and 𝐴 is a positive linear operator, we can use Du’s
theorem to establish the next result (which generalizes Lemma 6.1.4 on page 188).

Theorem 7.1.4. If 𝐴 is irreducible, then the following statements are equivalent.

(i) 𝜌 ( 𝐴) 1/𝜃 < 1.


(ii) 𝐺 is globally stable on 𝑉 .

In the case 𝜌 ( 𝐴) 1/𝜃 ⩾ 1, the map 𝐺 has no fixed point in 𝑉 .

The key to proving (i) implies (ii) is that 𝐺 is order-preserving and either convex
or concave, depending on the value of 𝜃. The remaining conditions in Du’s theorem
are established over order intervals using 𝜌 ( 𝐴) 1/𝜃 < 1. By applying an approxima-
tion argument, global stability is extended from order intervals to all of 𝑉 . Some of
these details are contained in the following exercises and a full proof can be found in
Stachurski et al. (2022).
Let n o𝜃
𝐹 𝑥 ( 𝑡 ) = ℎ ( 𝑥 ) + 𝑡 1/𝜃 ( 𝑡 > 0) .

EXERCISE 7.1.7. Prove that, for all 𝑥 ∈ X, the function 𝐹 𝑥 is increasing,

(i) convex whenever 𝜃 ∈ (0, 1], and


(ii) concave otherwise (i.e., for other nonzero 𝜃).
CHAPTER 7. NONLINEAR VALUATION 218

EXERCISE 7.1.8. Using Exercise 7.1.7, prove that 𝐺 is order-preserving on 𝑉 , convex


on 𝑉 whenever 𝜃 ∈ (0, 1], and concave otherwise.

EXERCISE 7.1.9. Kleinman et al. (2023) study a dynamic discrete choice model of
migration with savings and capital accumulation. They show that optimal consump-
tion for landlords in their model is 𝑐𝑡 = 𝜎𝑡 𝑅𝑡 𝑘𝑡 , where 𝑘𝑡 is capital, 𝑅𝑡 is the gross rate
of return on capital and 𝜎𝑡 is a state-dependent process obeying
h i𝜓
−1
𝜎𝑡 =1+𝛽 𝜓
E𝑡 𝑅𝑡(+1
𝜓−1)/𝜓 −1/𝜓
𝜎𝑡+1 . (7.3)

Here 𝛽 is a discount factor and 𝜓 is a utility parameter. Assume 𝑅𝑡 = 𝑓 ( 𝑋𝑡 ) where X is


finite, 𝑓 ∈ RX , and ( 𝑋𝑡 ) is 𝑃 -Markov for some 𝑃 ∈ M ( RX ). Let 𝐴 ∈ L ( RX ) be defined
by Õ
( 𝐴𝑣)( 𝑥 ) = 𝛽 𝑓 ( 𝑥 0) ( 𝜓−1)/𝜓 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) .
𝑥0

Prove that there exists a unique solution to (7.3) of the form 𝜎𝑡 = 𝜎 ( 𝑋𝑡 ) for some
𝜎 ∈ RX with 𝜎  0 if and only if 𝜌 ( 𝐴) 𝜓 < 1.

7.2 Recursive Preferences


In this section we compute lifetime values associated with given reward processes in
settings that involve nonlinear recursions. These nonlinear recursions are called re-
cursive preferences. We will show how some common specifications of recursive prefer-
ences can be translated into lifetime valuations via the fixed point methods introduced
in Chapter 2 and §7.1.

7.2.1 Motivation: Optimal Savings


We motivate recursive preference models by analyzing consumption decisions.

7.2.1.1 A Recursive View of a Standard Model

The time additive model of valuation in §3.2.2.3 can be studied from a purely recursive
point of view. As a starting point, we state that the value 𝑉𝑡 of current and future
consumption is defined at each point in time 𝑡 by the recursion

𝑉𝑡 = 𝑢 ( 𝐶𝑡 ) + 𝛽 E𝑡 𝑉𝑡+1 . (7.5)
CHAPTER 7. NONLINEAR VALUATION 219

The random variables 𝑉𝑡 and 𝑉𝑡+1 are the unknown objects in this expression. The
expectation E𝑡 conditions on 𝑋0 , . . . , 𝑋𝑡 and 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ). The process ( 𝑋𝑡 )𝑡⩾0 is 𝑃 -Markov.
Since consumption is a function of ( 𝑋𝑡 )𝑡⩾0 and knowledge of the current state 𝑋𝑡
is sufficient to forecast future values (by the Markov property), it is natural to guess
that 𝑉𝑡 will depend on the Markov chain only through 𝑋𝑡 . Hence we guess there is a
solution of (7.5) takes the form 𝑉𝑡 = 𝑣 ( 𝑋𝑡 ) for some 𝑣 ∈ RX .
(Here 𝑣 is an ansatz, meaning “educated guess.” First we guess the form of a
solution and then we try to verify that the guess is correct. So long as we carry out
the second step, starting with a guess brings no loss of rigor.)
Under this conjecture, (7.5) can be rewritten as 𝑣 ( 𝑋𝑡 ) = 𝑢 ( 𝑐 ( 𝑋𝑡 )) + 𝛽 E𝑡 𝑣 ( 𝑋𝑡+1 ).
Conditioning on 𝑋𝑡 = 𝑥 and setting 𝑟 ≔ 𝑢 ◦ 𝑐, this becomes

𝑣 ( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝛽 E𝑥 𝑣 ( 𝑋𝑡+1 ) = 𝑟 ( 𝑥 ) + 𝛽 ( 𝑃𝑣)( 𝑥 ) ( 𝑥 ∈ X) . (7.6)

In vector form, we get 𝑣 = 𝑟 + 𝛽𝑃𝑣. From the Neumann series lemma, the solution is
𝑣∗ = ( 𝐼 − 𝛽𝑃 ) −1 𝑟 , which is identical to (3.21) on page 96.

EXERCISE 7.2.1. Verify our guess: Show (𝑉𝑡∗ ) obeys (7.5) when 𝑉𝑡∗ ≔ 𝑣∗ ( 𝑋𝑡 ).

In summary, (7.5) and the sequential representation (3.20) specify the same life-
time value for consumption paths.
While the recursive formulation in (7.5) now seems redundant, since it produces
the same specification that we obtained from the sequential approach, the recursive
set up gives us a formula to build on, and hence a pathway to overcoming limitations
of the time additive approach. Most of the rest of this chapter will be focused on this
agenda.
Pursuing this agenda will produce preferences over consumption paths where the
sequential approach has no natural counterpart. This occurs when current value 𝑉𝑡 is
nonlinear in current rewards and continuation values (unlike the linear specification
(7.5)). Such specifications are called recursive preferences. When dealing with
recursive preference models, the lack of a sequential counterpart means that we are
forced to proceed recursively.

Remark 7.2.1. The term “recursive preferences” is confusing, since, as we have just
agreed, time additive preferences also admit the recursive specification (7.5). Nonethe-
less, when economists say “recursive preferences,” they almost always refer to settings
where lifetime utility can only be expressed recursively. We follow this convention.
CHAPTER 7. NONLINEAR VALUATION 220

7.2.1.2 Limitations of Time Additive Preferences

In the previous section we discussed how the time additive preference specification
Õ
𝑣 ( 𝑥 ) = E𝑥 𝛽 𝑡 𝑢 ( 𝐶𝑡 ) (7.7)
𝑡 ⩾0

also called the discounted expected utility model, can be framed recursively, and
how this provides a pathway to go beyond the time additive specification. We are
motivated to do so because the time additive specification has been rejected by exper-
imental and observational data in many settings.
In this section we highlight some of the limitations of time additive preferences.
While our discussion is only brief, more background and a list of references can be
found in §7.4.
One issue with (7.7) is the assumption of a constant positive discount rate, which
has been refuted by a long list of empirical studies. This issue was discussed in §6.4.
Another limitation of time additive preferences is that agents are risk-neutral in
future utility (see, e.g., (7.5), where current value depends linearly on future value).
Although risk aversion over consumption can be built in through curvature of 𝑢, this
same curvature also determines the elasticity of intertemporal substitution, meaning
that the two aspects of preferences cannot be separated. We elaborate on this point
in §7.3.1.4.
A third issue with time additivity is that agents with such preferences are indiffer-
ent to any variation in the joint distribution of rewards that leaves marginal distribu-
tions unchanged. To get a sense of what this means, suppose you accept a new job
and will be employed by this firm for the rest of your life. Your daily consumption will
be entirely determined by your daily wage. Your boss offers you two options:

(A) Your boss will flip a coin at the start of your first day on the job. If the coin is
heads, you will receive $10,000 a day for the rest of your life. If the coin is tails,
you will receive $1 per day for the rest of your life.
(B) Your boss will flip a coin at the start of every day. If the coin is heads, you will
receive $10,000. If the coin is tails, you will receive $1.

If you have a strict preference between options A and B, then your choice cannot
be rationalized with time additive preferences.
To see why, let 𝜑 be a probability distribution that represents the lottery described
above, putting mass 0.5 on 10,000 and mass 0.5 on 1. Under option A, consumption
CHAPTER 7. NONLINEAR VALUATION 221

(𝐶𝑡 )𝑡⩾1 is given by 𝐶𝑡 = 𝐶1 for all 𝑡 , where 𝐶1 ∼ 𝜑. Under option B, consumption (𝐶𝑡 )𝑡⩾1
is an IID sequence drawn from 𝜑. Either way, lifetime utility is
Õ Õ 𝛽𝑢
¯
E 𝛽 𝑡 𝑢 ( 𝐶𝑡 ) = 𝛽 𝑡 E𝑢 ( 𝐶𝑡 ) = ,
𝑡 ⩾1 𝑡 ⩾1
1−𝛽

where 𝑢¯ ≔ E𝑢 ( 𝐶1 ) = 𝑢 (1)/2 + 𝑢 (10, 000)/2.


The critical part of this argument is the passing of expectations through the sum,
which uses time additivity . The implication is that lifetime utility depends only on the
marginal distribution of each 𝐶𝑡 , rather than on the joint distribution of the stochastic
process ( 𝐶𝑡 )𝑡⩾0 .

7.2.2 Risk-Sensitive Preferences

Having motivated recursive preferences, let’s turn to our first example: risk-sensitive
preferences. For the consumption problem described in §7.2.1.1, imposing risk-
sensitive preferences means replacing the recursion 𝑣 = 𝑟 + 𝛽𝑃𝑣 for 𝑣 with
( )
1 Õ
𝑣 ( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (7.8)
𝜃 𝑥0

As before, 𝑟 ( 𝑥 ) = 𝑢 ( 𝑐 ( 𝑥 )) represents current utility when the current state is 𝑥 . The


parameter 𝜃 is a nonzero constant in R.
In (7.8), the transform 𝑓 ( 𝑣) = exp( 𝜃𝑣) is applied to 𝑣 before expectation is taken.
After the expectation is computed, the transform is undone via 𝑓 −1 ( 𝑣) = (1/𝜃) ln( 𝑣).
We show below that the agent can be either risk-averse or risk-loving with respect to
future outcomes, depending on the value of 𝜃.

7.2.2.1 Lifetime Utility

We understand the functional equation (7.8) as “defining” lifetime utility under risk-
sensitive preferences. A function 𝑣 solving (7.8) gives a lifetime valuation 𝑣 ( 𝑥 ) to each
𝑥 ∈ X, with the interpretation that 𝑣 ( 𝑥 ) is lifetime utility conditional on initial state
𝑥 . This definition of lifetime value is by analogy to the time additive case studied in
§7.2.1.1, where the function 𝑣 solving 𝑣 = 𝑟 + 𝛽𝑃𝑣 measures lifetime utility from each
initial state.
CHAPTER 7. NONLINEAR VALUATION 222

In the previous paragraph we wrote “defining” in scare quotes because we can’t


be sure we have a definition at this point. Just because we write down a recursive ex-
pression for lifetime utility doesn’t mean that corresponding lifetime utility is actually
well defined. (For example, we can happily write down the recursive vector equation
𝑣 = 𝑣 + 1 but no vector 𝑣 solving this equation exists.) One aim of this chapter is to
provide conditions under which recursions like (7.8) have solutions.
Another issue is uniqueness. Suppose that (7.8) has many solutions. In this case
the predictions of the utility model are ambiguous. Our perspective is that the re-
cursive preference specification (7.8) is not correctly formulated unless existence and
uniqueness hold. We return to this point in §7.2.2.3.
One final comment: even if we can find a 𝑣 that solves (7.8), the nonlinearities
introduced by risk sensitivity imply that there will be no neat sequential representation
Í
analogous to 𝑣 ( 𝑥 ) = E𝑥 𝑡 𝛽 𝑡 𝑢 ( 𝐶𝑡 ) from the time additive case. (This connects to
Remark 7.2.1, where we discuss recursive preference terminology.)

7.2.2.2 Risk-Adjusted Expectation

We want to understand the “expectation-like” expression on the right hand side of


Í
(7.8) that replaces the ordinary conditional expectation 𝑥 0 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) from the time
additive case. To this end, we define, for arbitrary random variable 𝜉 and 𝜃 ∈ R,
1
E𝜃 [ 𝜉] = ln {E [exp( 𝜃𝜉)]} .
𝜃

The value E𝜃 [ 𝜉] is called the entropic risk-adjusted expectation of 𝜉 given 𝜃.

EXERCISE 7.2.2. Prove that, for any random variable 𝜉 any nonzero 𝜃 and any
constant 𝑐, we have E𝜃 [ 𝜉 + 𝑐] = E𝜃 [ 𝜉] + 𝑐.

The key idea behind the entropic risk-adjusted expectation is that decreasing 𝜃
lowers appetite for risk and increasing 𝜃 does the opposite.

EXERCISE 7.2.3. Prove that, if 𝜉 is normally distributed, then

Var[ 𝜉]
E𝜃 [ 𝜉 ] = E [ 𝜉] + 𝜃 . (7.9)
2
[Hint: Look up the moment generating function of a normal distribution.]

Expression (7.9) above shows that, for the Gaussian case, E𝜃 [ 𝜉] equals the mean
plus a term that penalizes variance when 𝜃 < 0 and rewards it when 𝜃 > 0.
CHAPTER 7. NONLINEAR VALUATION 223

More generally, we have the following result.


Lemma 7.2.1. For any random variable 𝜉 taking values in X, we have
(i) E𝜃 [ 𝜉] ⩽ E [ 𝜉] for all 𝜃 < 0.
(ii) E𝜃 [ 𝜉] ⩾ E [ 𝜉] for all 𝜃 > 0.
Moreover, both of these inequalities are strict if and only if Var[ 𝜉] > 0.

Proof. Fix 𝜃 ∈ R and let 𝑓 : R → (0, ∞) be defined by 𝑓 ( 𝑥 ) = exp( 𝜃𝑥 ). Note that


𝑓 0 ( 𝑥 ) = 𝜃 exp( 𝜃𝑥 ) and 𝑓 00 ( 𝑥 ) = 𝜃2 exp( 𝜃𝑥 ). Thus 𝑓 is convex and either increasing or
decreasing depending on whether 𝜃 is positive or negative. Then E𝜃 [𝜉] = 𝑓 −1 ( E 𝑓 ( 𝜉)).
By Jensen’s inequality,
E [ 𝑓 ( 𝜉)] ⩾ 𝑓 ( E [𝜉]) .
If 𝜃 > 0, then 𝑓 −1 is increasing, so applying 𝑓 −1 to both sides gives E𝜃 [ 𝜉] ⩾ E [ 𝜉]. If
𝜃 < 0, then 𝑓 −1 is decreasing, so applying 𝑓 −1 to both sides gives E𝜃 [ 𝜉] ⩽ E [ 𝜉]. This
proves the two weak inequalities in Lemma 7.2.1. To obtain strict inequalities we can
apply the same argument using a strict version of Jensen’s inequality (see, e.g., Liao
and Berg (2018)), which is valid when Var[ 𝜉] > 0. □

7.2.2.3 Existence and Uniqueness

Let’s return to investigating lifetime utility under risk-sensitive preferences. To this


end, we introduce the risk-sensitive Koopmans operator 𝐾𝜃 on RX via
( )
1 Õ
( 𝐾𝜃 𝑣)( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (7.10)
𝜃 𝑥0

Evidently, for given nonzero 𝜃, a function 𝑣 ∈ RX solves the risk-sensitive preference


lifetime utility specification (7.8) if and only if 𝑣 is a fixed point of 𝐾𝜃 . This explains
the significance of the following result:
Proposition 7.2.2. If 𝛽 < 1, then 𝐾𝜃 is globally stable on RX .

We postpone a proof of Proposition 7.2.2 because we will prove a more general


result in §7.3.2.2. For now we note the following implications.
(i) For each nonzero 𝜃, lifetime utility is both well-defined and uniquely defined
for risk-sensitive preferences (i.e., (7.8) has a unique solution).
(ii) The unique solution, denoted henceforth by 𝑣∗ , can be computed by successive
approximation using 𝐾𝜃 .
CHAPTER 7. NONLINEAR VALUATION 224

7.2.2.4 The Gaussian Case

As a tractable case, let’s suppose that 𝑟 ( 𝑥 ) = 𝑥 and that 𝑋𝑡+1 = 𝜌𝑋𝑡 + 𝜎𝑊𝑡+1 where (𝑊𝑡 )𝑡⩾1
is IID and standard normal. Here | 𝜌 | < 1 and 𝜎 ⩾ 0 controls volatility of the state.
Rather than discretizing the state process, we leave it as continuous and proceed by
hand.
In this setting, the functional equation (7.8) for 𝑣 becomes

𝑣 ( 𝑥 ) = 𝑥 + 𝛽 E𝜃 [ 𝑣 ( 𝜌𝑥 + 𝜎𝑊 )] (7.11)

for each 𝑥 ∈ X, where 𝑊 is standard normal.


Since 𝜌𝑥 + 𝜎𝑊 is Gaussian, the expression (7.9) for the risk-adjusted expectation
of a normal random variable leads us to conjecture that the solution 𝑣 will be affine,
i.e., 𝑣 ( 𝑥 ) = 𝑎𝑥 + 𝑏 for some 𝑎, 𝑏 ∈ R. This conjecture turns out to be correct:

EXERCISE 7.2.4. Verify that 𝑣 ( 𝑥 ) = 𝑎𝑥 + 𝑏 solves (7.11) when

1 ( 𝑎𝜎) 2
𝛽
𝑎≔ and 𝑏≔𝜃 .
1 − 𝜌𝛽 1−𝛽 2

We can see that, under the stated assumptions, lifetime value 𝑣 is increasing in
the state variable 𝑥 . However, impacts of the parameters generally depend on 𝜃. For
example, if 𝜃 > 0, increasing 𝜎 shifts up lifetime utility. If 𝜃 < 0, then lifetime value de-
creases with 𝜎. This is as we expect: lifetime utility is affected positively or negatively
by volatility, depending on whether or not the agent is risk averse or risk loving.
Figure 7.3 shows the true solution 𝑣 ( 𝑥 ) = 𝑎𝑥 + 𝑏 to the risk-sensitive lifetime util-
ity model, as well as an approximate fixed point from a discrete approximation. The
discrete approximation is computed by applying successive approximation to 𝐾𝜃 after
discretizing the state process via Tauchen’s method. The parameters and discretiza-
tion are shown in Listing 23.

EXERCISE 7.2.5. Replicate Figure 7.3.

EXERCISE 7.2.6. Dropping the Gaussian assumption, suppose now that consump-
tion is IID with 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ) where ( 𝑋𝑡 )𝑡⩾0 is IID with distribution 𝜑 on finite set X. Now
the operator 𝐾𝜃 becomes
( )
1 Õ
( 𝐾𝜃 𝑣)( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝜑 ( 𝑥 0) ( 𝑥 ∈ X) .
𝜃 𝑥0
CHAPTER 7. NONLINEAR VALUATION 225

30
approximate fixed point
20 v(x) = ax + b
10

−10

−20

−30

−40

−50

−3 −2 −1 0 1 2 3
x

Figure 7.3: Approximate and true solutions in the Gaussian case

using LinearAlgebra, QuantEcon

function create_rs_utility_model(;
n=180, # size of state space
β=0.95, # time discount factor
ρ=0.96, # correlation coef in AR(1)
σ=0.1, # volatility
θ=-1.0) # risk aversion
mc = tauchen(n, ρ, σ, 0, 10) # n_std = 10
x_vals, P = mc.state_values, mc.p
r = x_vals # special case u(c(x)) = x
return (; β, θ, ρ, σ, r, x_vals, P)
end

Listing 23: Risk sensitive utility model parameters (rs_utility.jl)


CHAPTER 7. NONLINEAR VALUATION 226

Although iterating on 𝐾𝜃 is convergent, there is a more efficient method that reduces


to solving a one-dimensional equation. Propose such a method and confirm that it is
convergent. [Hint: Consider reviewing §4.2.2.2.]

7.2.3 Epstein–Zin Preferences

One of the most popular specifications of recursive preferences in quantitative research


is Epstein–Zin utility.1 This class of preferences has been used to study asset pricing,
business cycles, monetary policy, fiscal policy, optimal taxation, climate policy, pension
plans, and other topics. In this section we introduce the Epstein–Zin specification and
discuss how to solve it. We will see that the specification, while highly nonlinear, is
nonetheless well behaved.

7.2.3.1 Specification

With Epstein–Zin preferences, the relationship 𝑉𝑡 = 𝑢 ( 𝐶𝑡 ) + 𝛽 E𝑡 𝑉𝑡+1 is replaced by


n o 1/𝛼
𝛾
𝑉𝑡 = (1 − 𝛽 ) 𝐶𝑡𝛼 + 𝛽 [ E𝑡 𝑉𝑡+1 ] 𝛼/𝛾 , (7.12)

where 𝛾 , 𝛼 are nonzero parameters and 𝛽 ∈ (0, 1). As for risk-sensitive preferences,
lack of time additivity implies that there is no neat sequential representation for life-
time value. As a result, we must work directly with the recursive expression (7.12).
Assume as before that 𝐶𝑡 = 𝑐 ( 𝑋𝑡 ), where 𝑐 ∈ RX
+ and ( 𝑋𝑡 )𝑡 ⩾0 is 𝑃 -Markov on finite
set X. We conjecture a solution of the form 𝑉𝑡 = 𝑣 ( 𝑋𝑡 ) for some 𝑣 ∈ 𝑉 ≔ RX + . Under
this conjecture, the Epstein–Zin Koopmans operator corresponding to (7.12) is

 " # 𝛼/𝛾  1/𝛼




 Õ 


( 𝐾𝑣)( 𝑥 ) = (1 − 𝛽 ) 𝑐 ( 𝑥 ) 𝛼 + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑥 0) . (7.13)

 

𝑥0
 
As will be discussed further in §7.3.1.1, the parameter 𝛾 governs risk aversion with
respect to temporal gambles (where outcomes are resolved in the next period), while
𝛽 controls impatience and 𝛼 parametrizes the intertemporal elasticity of substitution.
The fact that all three parameters have distinct effects helps fit data. For example, see
Tallarini Jr (2000) and Barillas et al. (2009).
1 Epstein–Zin preferences were popularized in Epstein and Zin (1989). They are a special case of
preferences defined by Kreps and Porteus (1978). Further discussion can be found in §7.4.
CHAPTER 7. NONLINEAR VALUATION 227

An important question is whether Epstein–Zin preferences are well defined. In


particular, what conditions do we need on primitives such that the Koopmans operator
𝐾 in (7.13) has a unique fixed point?

7.2.3.2 Properties of the Koopmans Operator

To address this question we rewrite (7.13) in vector form as


n o 1/𝛼
𝛾 𝛼/𝛾
𝐾𝑣 = ℎ + 𝛽 [ 𝑃𝑣 ] (7.14)

where ℎ ∈ RX . This is equivalent to (7.13) when ℎ = (1 − 𝛽 ) 𝑐𝛼 . To avoid fractional


powers of negative numbers, we assume throughout that ℎ ⩾ 0.

EXERCISE 7.2.7. Prove that, under this assumption, 𝐾 is a self-map on 𝑉 ≔ (0, ∞) X .

The set 𝑉 is called the interior of the positive cone of RX .


The operator 𝐾 is difficult to work with for two reasons. First, linear and nonlinear
transformations are intertwined. Second, there are several cases for the parameters
that we need to handle in order to understand stability. Nonetheless, by applying a
smooth transformation, we will find it easy to show that the Epstein–Zin Koopmans
operator 𝐾 is globally stable under mild conditions. In particular,

Proposition 7.2.3. If 𝑃 is irreducible and ℎ  0, then 𝐾 is globally stable on 𝑉 .

A proof of Proposition 7.2.3 is provided in §7.2.3.3.


Proposition 7.2.3 implies that Epstein–Zin utility is well-defined under the stated
conditions and, moreover, that the solution can be computed via successive approxi-
mation on 𝐾 . Listing 24 provides code for performing this operation. Figure 7.4 shows
convergence of the sequence of iterates to the fixed point 𝑣∗ , under the parameters in
Listing 24, given an initial condition 𝑣0 . The figure plots every 10th iterate, repeated
100 times.

7.2.3.3 Proof of the Stability Result

We prove Proposition 7.2.3 by

(i) introducing an operator 𝐾ˆ obtained from 𝐾 via a smooth transformation,


(ii) proving that (𝑉ˆ, 𝐾ˆ ) and (𝑉, 𝐾 ) are topologically conjugate, and
CHAPTER 7. NONLINEAR VALUATION 228

include("s_approx.jl")
using LinearAlgebra, QuantEcon

function create_ez_utility_model(;
n=200, # size of state space
ρ=0.96, # correlation coef in AR(1)
σ=0.1, # volatility
β=0.99, # time discount factor
α=0.75, # EIS parameter
γ=-2.0) # risk aversion parameter

mc = tauchen(n, ρ, σ, 0, 5)
x_vals, P = mc.state_values, mc.p
c = exp.(x_vals)

return (; β, ρ, σ, α, γ, c, x_vals, P)
end

function K(v, model)


(; β, ρ, σ, α, γ, c, x_vals, P) = model

R = (P * (v.^γ)).^(1/γ)
return ((1 - β) * c.^α + β * R.^α).^(1/α)
end

function compute_ez_utility(model)
v_init = ones(length(model.x_vals))
v_star = successive_approx(v -> K(v, model),
v_init,
tolerance=1e-10)
return v_star
end

Listing 24: Epstein–Zin utility model and Koopmans operator (ez_utility.jl)


CHAPTER 7. NONLINEAR VALUATION 229

1.6
v0
1.4 v∗

1.2

1.0

0.8

0.6

0.4

0.2

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x

Figure 7.4: Convergence of Koopmans iterates for Epstein–Zin utility

(iii) obtaining conditions under which 𝐾ˆ is globally stable on 𝑉 .

Throughout this section, the assumptions of Proposition 7.2.3 are in force.


To begin we define 𝐾ˆ via
n o𝜃 𝛾
1/𝜃
𝐾 𝑣 = ℎ + 𝛽 ( 𝑃𝑣)
ˆ where 𝜃≔ . (7.15)
𝛼

The operator 𝐾ˆ is simpler to work with than 𝐾 because it unifies 𝛼, 𝛾 into a single
parameter 𝜃 and decomposes the Epstein–Zin update rule into two parts: a linear
map 𝑃 and a separate nonlinear component.

EXERCISE 7.2.8. Prove that

(i) 𝐾ˆ is a self-map on 𝑉 and


(ii) 𝑣 ∈ 𝑉 is a fixed point of 𝐾 if and only if 𝑣𝛾 is a fixed point of 𝐾ˆ .

Lemma 7.2.4. Let Φ be defined by Φ𝑣 = 𝑣𝛾 . The map Φ is a homeomorphism from 𝑉 to


itself and (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topologically conjugate under Φ.

Proof. Evidently Φ is a continuous bijection from 𝑉 to itself, with continuous inverse


CHAPTER 7. NONLINEAR VALUATION 230

2
45
K̂ = F

0 1 2

Figure 7.5: Shape properties of 𝐾ˆ in one dimension

Φ−1 𝑣 = 𝑣1/𝛾 . Hence Φ is a homeomorphism. In addition, for 𝑣 ∈ 𝑉 ,


n o𝜃 n o 𝛾/𝛼
ˆ Φ𝑣 = ℎ + 𝛽 ( 𝑃Φ𝑣) 1/𝜃
𝐾 = ℎ + 𝛽 ( 𝑃𝑣𝛾 ) 𝛼/𝛾 = Φ𝐾𝑣.

This shows that (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topologically conjugate, as claimed. □

Proof of Proposition 7.2.3. Set 𝐴 = 𝛽 𝜃 𝑃 . With this notation we have 𝐾ˆ 𝑣 = [ ℎ+( 𝐴𝑣) 1/𝜃 ] 𝜃 .
In view of in Theorem 7.1.4 on page 217, this operator is globally stable on 𝑉 whenever
𝜌 ( 𝐴) 1/𝜃 < 1. In our case 𝜌 ( 𝐴) = 𝜌 ( 𝛽 𝜃 𝑃 ) = 𝛽 𝜃 , so 𝜌 ( 𝐴) 1/𝜃 = 𝛽 . It follows that 𝐾ˆ
is globally stable on 𝑉 whenever 𝛽 < 1. Since (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topologically
conjugate, the proof of Proposition 7.2.3 is complete. □

7.2.3.4 Why Not Use Contractivity?

While we can consider studying stability of 𝐾ˆ using contraction arguments, this ap-
proach fails under useful parameterizations. To illustrate, suppose that X = { 𝑥1 }.
Then ℎ is a constant, 𝑃 is the identity, 𝑣 is a scalar and 𝐾ˆ 𝑣 = 𝐹 ( 𝑣) with 𝐹 ( 𝑣) =
 𝜃
ℎ + 𝛽𝑣1/𝜃 , as shown in Figure 7.5. Here 𝜃 = 5, ℎ = 0.5 and 𝛽 = 0.5. We see
that 𝐾ˆ has infinite slope at zero, so the contraction property fails.2
2 We could try to truncate the interval to a neighborhood of the fixed point and hope that 𝐾ˆ is a
contraction when restricted to this interval. But in higher dimensions we are not sure that a fixed point
exists for a broad range of parameters, which makes this idea hard to implement.
CHAPTER 7. NONLINEAR VALUATION 231

EXERCISE 7.2.9. Prove that, given the parameter values used for Figure 7.5, the
function 𝐹 satisfies 𝐹 0 ( 𝑡 ) → ∞ as 𝑡 ↓ 0.

7.3 General Representations

We have discussed two well-known examples of recursive preferences. In this section


we build a general representation. While various constructions can be found in the
decision theory literature, many are not well suited to quantitative work. Here we
give a relatively parsimonious operator-theoretic definition.

7.3.1 Koopmans Operators

In §7.2.2.3 and §7.2.3.1 we met risk-sensitive and Epstein–Zin Koopmans operators


respectively. In this section we provide a general definition of a Koopmans operator
that will contain these two examples as special cases.
We begin by outlining structure that can be combined to generate Koopmans oper-
ators in a Markov environment. The two key components are an aggregation function
and a certainty equivalent operator. We then build Koopmans operators from these
primitives and connect them to applications. In every setting we consider, lifetime
value is identified with the unique fixed point of the Koopmans operator (whenever
it exists).

7.3.1.1 Certainty Equivalents

The first primitive we consider is a generalization of conditional expectations: Given


𝑉 ⊂ RX , we define a certainty equivalent operator on 𝑉 to be a self-map 𝑅 on 𝑉 such
that

(i) 𝑅 is order-preserving on 𝑉 and


(ii) all constants are fixed under 𝑅 (i.e., 𝑅 ( 𝜆 1) = 𝜆 1 for all 𝜆 ∈ R with 𝜆 1 ∈ 𝑉 ).

Example 7.3.1. The usual conditional expectations operator is a certainty equivalent


operator. To see this, set 𝑉 = RX and fix 𝑃 ∈ M ( RX ). Since 𝑓 , 𝑔 ∈ RX with 𝑓 ⩽ 𝑔
implies 𝑃 𝑓 ⩽ 𝑃𝑔 and 𝑃 ( 𝜆 1) = 𝜆 𝑃 1 = 𝜆 1, we see that 𝑃 satisfies (i)–(ii) above.
CHAPTER 7. NONLINEAR VALUATION 232

EXERCISE 7.3.1. In the last example, the certainty equivalent 𝑅 = 𝑃 is linear. Prove
that this is the only linear case. In particular, prove the following: if R(X) is the set
of all certainty equivalent operators on RX , then R(X) ∩ L ( RX ) = M ( RX ).

The next example is nonlinear. It treats the risk-adjusted expectation that appears
in risk-sensitive preferences.

Example 7.3.2. Let 𝑉 = RX and fix nonzero 𝜃 and 𝑃 ∈ M ( RX ). The entropic cer-
tainty equivalent operator is the operator 𝑅𝜃 on 𝑉 defined by
( )
1 Õ
( 𝑅𝜃 𝑣)( 𝑥 ) = ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝜃 𝑥0

EXERCISE 7.3.2. Show that 𝑅𝜃 is in fact a certainty equivalent operator.

Example 7.3.3. As a third example, let 𝑉 be the interior of the positive cone, as in
§7.2.3.2, and fix 𝑃 ∈ M ( RX ). The operator
( ) 1/𝛾
Õ
( 𝑅 𝛾 𝑣)( 𝑥 ) = 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑥 0) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X, 𝛾 ≠ 0) (7.16)
𝑥0

is a certainty equivalent operator on 𝑉 . The map 𝑅 𝛾 is sometimes called the Kreps-


Porteus certainty equivalent operator in honor of Kreps and Porteus (1978). We
met 𝑅 𝛾 in §7.2.3, when we discussed Epstein–Zin preferences.

EXERCISE 7.3.3. Confirm that 𝑅 𝛾 is a certainty equivalent operator.

EXERCISE 7.3.4. Let 𝑉 = RX and fix 𝑃 ∈ M ( RX ) and 𝜏 ∈ [0, 1]. Let 𝑅𝜏 be the
quantile certainty equivalent. That is, ( 𝑅𝜏 𝑣) ( 𝑥 ) = 𝑄 𝜏 𝑣 ( 𝑋 ) where 𝑋 ∼ 𝑃 ( 𝑥, ·) and 𝑄 𝜏
is the quantile functional (see page 32). More specifically,
( )
Õ
( 𝑅𝜏 𝑣)( 𝑥 ) = min 𝑦 ∈ R 1{ 𝑣 ( 𝑥 ) ⩽ 𝑦 } 𝑃 ( 𝑥, 𝑥 ) ⩾ 𝜏
0 0
( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝑥0

Confirm that 𝑅𝜏 defines a certainty equivalent operator on 𝑉 .

EXERCISE 7.3.5. Let 𝑅 be a certainty equivalent operator on 𝑉 ⊂ RX


+ , where 𝜆 1 ∈ 𝑉
for all 𝜆 ⩾ 0. Prove that 𝑅0 = 0 and 𝑅𝑣 ⩾ 0 whenever 𝑣 ⩾ 0.
CHAPTER 7. NONLINEAR VALUATION 233

The set of certainty equivalent operators on RX is invariant under convex combi-


nations, as the next exercise asks you to confirm.

EXERCISE 7.3.6. Let R(X) be the set of certainty equivalent operators on RX and
prove the following:

𝑅 𝑎 , 𝑅 𝑏 ∈ R(X) and 0 ⩽ 𝜆 ⩽ 1 =⇒ 𝜆𝑅 𝑎 + (1 − 𝜆 ) 𝑅𝑏 ∈ R(X) .

7.3.1.2 Properties

A certainty equivalent operator 𝑅 on 𝑉 is called

• positive homogeneous on 𝑉 if 𝑅 ( 𝜆𝑣) = 𝜆𝑅𝑣 for all 𝑣 ∈ 𝑉 and 𝜆 ⩾ 0 with 𝜆𝑣 ∈ 𝑉 ,


• superadditive on 𝑉 if 𝑅 ( 𝑣 + 𝑤) ⩾ 𝑅𝑣 + 𝑅𝑤 for all 𝑣, 𝑤 ∈ 𝑉 with 𝑣 + 𝑤 ∈ 𝑉 ,
• subadditive on 𝑉 if 𝑅 ( 𝑣 + 𝑤) ⩽ 𝑅𝑣 + 𝑅𝑤 for all 𝑣, 𝑤 ∈ 𝑉 with 𝑣 + 𝑤 ∈ 𝑉 ,
• constant-subadditive on 𝑉 if 𝑅 ( 𝑣 + 𝜆 1) ⩽ 𝑅𝑣 + 𝜆 1 for all 𝑣 ∈ 𝑉 and 𝜆 ⩾ 0 with
𝑣 + 𝜆1 ∈ 𝑉 .

Example 7.3.4. Given 𝑃 ∈ M ( RX ), the linear certainty equivalent 𝑅 = 𝑃 is positive


homogeneous and both superadditive and subadditive on RX .

Example 7.3.5. Let 𝑉 be the interior of the positive cone of RX and fix 𝑃 ∈ M ( RX ).
In this setting, the Kreps–Porteus certainty equivalent operator 𝑅 𝛾 in (7.16) is subad-
ditive on 𝑉 when 𝛾 ⩾ 1 and superadditive on 𝑉 when 𝛾 ⩽ 1 (and, as usual, 𝛾 ≠ 0). The
subadditive case follows directly from Minkowski’s inequality, while the superadditive
case follows from the mean inequalities in Bullen (2003) (p. 202).

EXERCISE 7.3.7. Prove that the quantile certainty equivalent operator 𝑅𝜏 from
Exercise 7.3.4 is constant-subadditive.

EXERCISE 7.3.8. Show that the entropic certainty equivalent operator 𝑅𝜃 from
Example 7.3.2 is constant-subadditive.

EXERCISE 7.3.9. Prove: If 𝑅 is constant-subadditive on 𝑉 , then 𝑅 is nonexpansive


with respect to the supremum norm. That is,

k 𝑅𝑣 − 𝑅𝑤 k ∞ ⩽ k 𝑣 − 𝑤 k ∞ for all 𝑣, 𝑤 ∈ RX .
CHAPTER 7. NONLINEAR VALUATION 234

In some instances, a certainty equivalent operator is either convex or concave in


the sense of §7.1.2.2.
Example 7.3.6. The entropic certainty equivalent operator 𝑅𝜃 in Example 7.3.2 is
concave on RX whenever 𝜃 < 0. To prove this we use the result in Föllmer and Knispel
(2011) which states that, for 𝜃 < 0, 0 ⩽ 𝛼 ⩽ 1 and finite-valued random variables 𝑍, 𝑍0,
we have
E𝜃 ( 𝛼𝑍 + (1 − 𝛼) 𝑍0) ⩾ 𝛼E𝜃 ( 𝑍 ) + (1 − 𝛼)E𝜃 ( 𝑍0) (7.17)
where E𝜃 is as defined in §7.2.2.2.

EXERCISE 7.3.10. Using (7.17), show that 𝑅𝜃 is concave on 𝑉 = RX when 𝜃 < 0.

EXERCISE 7.3.11. Let 𝑉 be convex and let 𝑅 be a certainty equivalent operator on


𝑉 . Prove the following:

(i) 𝑅 is convex on 𝑉 whenever 𝑅 is subadditive and positive homogeneous on 𝑉 .


(ii) 𝑅 is concave on 𝑉 whenever 𝑅 is superadditive and positive homogeneous on 𝑉 .

Combining Exercise 7.3.11 and Example 7.3.5, we have proved


Lemma 7.3.1. The Kreps–Porteus certainty equivalent operator 𝑅 𝛾 in (7.16) is convex
on 𝑉 when 𝛾 ⩾ 1 and concave on 𝑉 when 𝛾 ⩽ 1.

Later we will combine Lemma 7.3.1 with the fixed point results for convex and
concave operators in §7.1.2.2 to establish existence and uniqueness of lifetime values
for certain kinds of Koopmans operators.

7.3.1.3 Monotonicity

Let X be partially ordered and let 𝑖RX be the set of increasing functions in RX . Let 𝑉 be
such that 𝑖RX ⊂ 𝑉 ⊂ RX and let 𝑅 be a certainty equivalent on 𝑉 . We call 𝑅 monotone
increasing if 𝑅 is invariant on 𝑖RX . This extends the terminology in §3.2.1.3, where
we applied it to Markov operators (cf., Exercise 3.2.4 on page 94).
As shown below, the concept of monotone increasing certainty equivalent opera-
tors is connected to outcomes where lifetime preferences are increasing in the state.

EXERCISE 7.3.12. Show that the entropic certainty equivalent operator in Exam-
ple 7.3.2 is monotone increasing on 𝑉 = RX whenever 𝑃 is monotone increasing, for
all nonzero values of 𝜃.
CHAPTER 7. NONLINEAR VALUATION 235

EXERCISE 7.3.13. Show that the Kreps–Porteus certainty equivalent operator in


Example 7.3.3 is monotone increasing on 𝑉 = (0, ∞) X whenever 𝑃 is monotone in-
creasing, for all nonzero values of 𝛾 .

7.3.1.4 Aggregation

We mentioned above that Koopmans operators are typically constructed by combining


a certainty equivalent operator and an aggregation function. Let’s now discuss the
second of these components.
Given 𝑉 ⊂ RX , an aggregator 𝐴 on 𝑉 is a map 𝐴 from X × R to R such that

(i) 𝑤 ( 𝑥 ) = 𝐴 ( 𝑥, 𝑣 ( 𝑥 )) is in 𝑉 whenever 𝑣 ∈ 𝑉 and


(ii) 𝑦 ↦→ 𝐴 ( 𝑥, 𝑦 ) is increasing for all 𝑥 ∈ X.

Intuitively, an aggregator combines current state and continuation values to measure


lifetime value.
Common types of aggregators include the

• Leontief aggregator 𝐴MIN ( 𝑥, 𝑦 ) = min{𝑟 ( 𝑥 ) , 𝛽 𝑦 } with 𝑟 ∈ RX and 𝛽 ⩾ 0,


• Uzawa aggregator 𝐴UZAWA ( 𝑥, 𝑦 ) = 𝑟 ( 𝑥 ) + 𝑏 ( 𝑥 ) 𝑦 with 𝑟 ∈ RX and 𝑏 ∈ RX
+ , and

• CES aggregator 𝐴CES ( 𝑥, 𝑦 ) = {𝑟 ( 𝑥 ) 𝛼 + 𝛽 𝑦 𝛼 }1/𝛼 with 𝑟 ∈ (0, ∞) X , 𝛽 ⩾ 0 and 𝛼 ≠ 0.

Here CES stands for “constant elasticity of substitution.” An important special case
of both the CES and Uzawa aggregators is the

• additive aggregator 𝐴ADD ( 𝑥, 𝑦 ) = 𝑟 ( 𝑥 ) + 𝛽 𝑦 with 𝑟 ∈ RX and 𝛽 ⩾ 0.

From these basic types we can also build composite aggregators. For example, we
might consider a CES-Uzawa aggregator of the form 𝐴 ( 𝑥, 𝑦 ) = {𝑟 ( 𝑥 ) 𝛼 + 𝑏 ( 𝑥 ) 𝑦 𝛼 }1/𝛼 with
𝑟, 𝑏 ∈ RX , 𝑏 ⩾ 0 and 𝛼 ≠ 0. As we will see in §7.3.3.3, the CES-Uzawa aggregator
can be used to construct models with both Epstein–Zin utility and state-dependent
discounting (as in, say, Albuquerque et al. (2016) or Schorfheide et al. (2018).)

7.3.1.5 Building Koopmans Operators

We are now ready to build Koopmans operators by combining certainty equivalents


and aggregators. Given 𝑉 ⊂ RX , we call a self-map 𝐾 on 𝑉 a Koopmans operator if

𝐾 = 𝐴◦𝑅 (7.18)
CHAPTER 7. NONLINEAR VALUATION 236

for some aggregator 𝐴 and certainty equivalent operator 𝑅 on 𝑉 . The expression in


(7.18) means that ( 𝐾𝑣)( 𝑥 ) = 𝐴 ( 𝑥, ( 𝑅𝑣)( 𝑥 )) at 𝑣 ∈ 𝑉 and 𝑥 ∈ X.
It is generally appropriate to suppose that a uniform increase in continuation val-
ues will increase current value. This property holds for 𝐾 in (7.18). In particular, it
follows from the definitions of 𝐴 and 𝑅 that 𝐾 is an order-preserving self-map on 𝑉 .

Example 7.3.7. For risk-sensitive preferences, the Koopmans operator (page 223) can
be expressed as 𝐾𝜃 = 𝐴ADD ◦ 𝑅𝜃 , where 𝑅𝜃 is the entropic certainty equivalent operator.

Example 7.3.8. The Epstein–Zin Koopmans operator can be expressed as 𝐾 = 𝐴CES ◦


𝑅 𝛾 , where 𝑅 𝛾 is the Kreps–Porteus expectations operator, as defined in (7.16). This is
a version of (7.18) under the CES aggregator.

Remark 7.3.1. We defined time additive preferences somewhat loosely in §3.2.2.3.


Here is a better definition: The Koopmans operator 𝐾 = 𝐴 ◦ 𝑅 is time additive if
𝐴 = 𝐴ADD and 𝑅 is ordinary condition expectations (as in Example 7.3.1).

7.3.1.6 Comments on CES Aggregation

The CES aggregator is so-named because, in a static utility maximization problem


where 𝑐 and 𝑦 are two goods and utility is 𝑈 ( 𝑐, 𝑦 ) = ((1 − 𝛽 ) 𝑐𝛼 + 𝛽 𝑦 𝛼 ) 1/𝛼 , the elastic-
ity of substitution is constant and given by 1/(1 − 𝛼). In the present setting, where
aggregation is across time, 1/(1 − 𝛼) is usually called the elasticity of intertemporal
substitution (EIS). The next exercise explains.

EXERCISE 7.3.14. Consider 𝑈 ( 𝑐, 𝑦 ) = ((1 − 𝛽 ) 𝑐𝛼 + 𝛽 𝑦 𝛼 ) 1/𝛼 as a utility function over


current and future goods 𝑐 and 𝑦 . Then

𝑑 ln( 𝑦 /𝑐) 𝜕𝑈 ( 𝑐, 𝑦 ) 𝜕𝑈 ( 𝑐, 𝑦 )
EIS = where 𝑈𝑐 ≔ and 𝑈 𝑦 ≔ .
𝑑 ln(𝑈𝑐 /𝑈 𝑦 ) 𝜕𝑐 𝜕𝑦

Confirm that EIS = 1/(1 − 𝛼).

The fact that EIS = 1/(1−𝛼) under the CES aggregator is significant because the EIS
can be measured from data using regression and other techniques. While estimates
vary significantly, the detailed meta-analysis by Havranek et al. (2015) suggests 0.5 as
a plausible average value for international studies, with rich countries tending slightly
higher. Basu and Bundick (2017) use 0.8 when calibrating to US data. Under these
estimates, the relationship EIS = 1/(1 − 𝛼) implies a value for 𝛼 between -1.0 and
-0.25.
CHAPTER 7. NONLINEAR VALUATION 237

7.3.1.7 Lifetime Value

In §7.3.1.5 we constructed a generic Koopmans operator using an aggregator and a


certainty equivalent operator. In this section we connect this Koopmans operator to
lifetime values and discuss the significance of global stability.
To begin, fix set X and function class 𝑉 ⊂ RX . Let 𝐾 = 𝐴 ◦ 𝑅 be a Koopmans operator
for some aggregator 𝐴 and certainty equivalent operator 𝑅 on 𝑉 . The lifetime value
generated by 𝐾 is the unique fixed point of 𝐾 in 𝑉 , whenever it exists. Given such a 𝑣,
the value 𝑣 ( 𝑥 ) is interpreted as lifetime value conditional on initial state 𝑥 .

Example 7.3.9. In the case of time additive preferences, lifetime value was defined in
(3.21) by 𝑣 = ( 𝐼 − 𝛽𝑃 ) −1 𝑟 . Equivalently, 𝑣 is the fixed point of the operator 𝐾 defined
by 𝐾𝑣 = 𝑟 + 𝛽𝑃𝑣. Since 𝐾 is globally stable, the fixed point is unique. In view of
Lemma 3.2.1 on page 94, it satisfies
Õ
𝑣( 𝑥) = E 𝛽 𝑡 𝑟 ( 𝑋𝑡 ) when ( 𝑋𝑡 ) is 𝑃 -Markov and 𝑋0 = 𝑥.
𝑡 ⩾0

Example 7.3.10. By Proposition 7.2.2, the risk-sensitive Koopmans operator 𝐾𝜃 =


𝐴ADD ◦ 𝑅𝜃 is globally stable on 𝑉 = RX when 𝛽 ∈ (0, 1). In this setting, the unique fixed
point of 𝐾𝜃 in 𝑉 is interpreted as lifetime value under the risk-sensitive preferences
described in §7.2.2.

In many applications, our existence and uniqueness proofs for fixed points of 𝐾
will also establish global stability. For Koopmans operators, global stability has the
following interpretation: for 𝑤 ∈ 𝑉 , 𝑚 ∈ N and 𝑥 ∈ X, the value ( 𝐾 𝑚 𝑤)( 𝑥 ) gives
total finite-horizon utility over periods 0, . . . , 𝑚 under the preferences embedded in
𝐾 , with initial state 𝑥 and terminal condition 𝑤. Hence global stability implies that, for
any choice of terminal condition, finite-horizon utility converges to infinite-horizon
utility as the time horizon converges to infinity. The next exercise helps to illustrate
this point.

EXERCISE 7.3.15. Consider again the time additive preferences 𝑉𝑡 = 𝑢 ( 𝐶𝑡 ) + 𝛽 E𝑡 𝑉𝑡+1


in Example 7.3.9. Suppose that the time horizon is finite, with some exogenous ter-
minal value 𝑉𝑚 = 𝑤 ( 𝑋𝑚 ) at time 𝑚. Letting 𝑣𝑚 ( 𝑥 ) represent lifetime value up until
time 𝑚, conditional on initial state 𝑥 , show that
Í
(i) 𝑣𝑚 = 𝑡𝑚=0−1 ( 𝛽𝑃 ) 𝑡 𝑟 + ( 𝛽𝑃 ) 𝑚 𝑤,
(ii) 𝑣𝑚 = 𝐾 𝑚 𝑤, where 𝐾 is the associated Koopmans operator 𝐾𝑣 = 𝑟 + 𝛽𝑃𝑣 and,
(iii) 𝐾 𝑚 𝑤 → 𝑣∗ ≔ ( 𝐼 − 𝛽𝑃 ) −1 𝑟 as 𝑚 → ∞.
CHAPTER 7. NONLINEAR VALUATION 238

Exercise 7.3.15 confirms that, at least for the time additive dcase, global stability of
𝐾 is equivalent to the statement that a finite-horizon valuation with arbitrary terminal
condition 𝑤 converges to the infinite-horizon valuation.

7.3.1.8 Monotone Lifetime Values

Let X = (X, ) be partially ordered, let 𝑖RX be the set of increasing functions in RX ,
and let 𝑉 be such that 𝑖RX ⊂ 𝑉 ⊂ RX . Let 𝐾 be a Koopmans operator on 𝑉 , so that
𝐾𝑣 = 𝐴 ◦ 𝑅 for some aggregator 𝐴 and certainty equivalent operator 𝑅 on 𝑉 . Suppose
that 𝐾 has a unique fixed point 𝑣∗ ∈ 𝑉 . A natural question is: when is 𝑣∗ increasing in
the state?

Lemma 7.3.2. If 𝐾 is globally stable, then 𝑣∗ is increasing on X whenever the following


two conditions hold:

(i) 𝐴 ( 𝑥, 𝑣) ⩽ 𝐴 ( 𝑥 0, 𝑣) whenever 𝑣 ∈ 𝑉 and 𝑥  𝑥 0, and


(ii) 𝑅 is monotone increasing on 𝑉 .

Proof. It is not difficult to check that, under the stated conditions, 𝐾 is invariant on
𝑖RX . It follows from Exercise 1.2.18 on page 22 that 𝑣∗ is increasing on X. □

EXERCISE 7.3.16. Consider the Epstein–Zin Koopmans operator 𝐾 = 𝐴CES ◦ 𝑅 𝛾 on


𝑉 , where 𝑉 ≔ (0, ∞) X and the primitives is as in (7.13). Assume the conditions of
Proposition 7.2.3, so that 𝐾 has a unique fixed point 𝑣∗ in 𝑉 . Given 𝑃 ∈ M ( RX ), we
can write 𝑅 𝛾 as 𝑅 𝛾 𝑣 = ( 𝑃𝑣𝛾 ) 1/𝛾 at each 𝑣 ∈ 𝑉 . Prove that 𝑣∗ is increasing in X whenever
𝑃 is monotone increasing and 𝑐 ∈ 𝑖RX .

7.3.2 A Blackwell-Type Condition

Let 𝑅 be a certainty equivalent operator on 𝑉 = RX and let 𝐴 be an aggregator on 𝑉 .


Let 𝐾 be the Koopmans operator on 𝑉 defined by ( 𝐾𝑣)( 𝑥 ) = 𝐴 ( 𝑥, ( 𝑅𝑣)( 𝑥 )). When 𝑅 is
constant-subadditive, we can often establish global stability of 𝐾 on 𝑉 via a contraction
mapping argument. This section gives details.
CHAPTER 7. NONLINEAR VALUATION 239

7.3.2.1 Blackwell Aggregators

We call an aggregator 𝐴 on 𝑉 a Blackwell aggregator if there exists a 𝛽 ∈ (0, 1) such


that
𝐴 ( 𝑥, 𝑦 + 𝜆 ) ⩽ 𝐴 ( 𝑥, 𝑦 ) + 𝛽𝜆 (7.19)
for all 𝑥 ∈ X, 𝑦 ∈ R and 𝜆 ∈ R+ .

EXERCISE 7.3.17. Fix 𝛽 ∈ R+ and 𝑟 ∈ RX . Show that the additive aggregator


𝐴 ( 𝑥, 𝑦 ) = 𝑟 ( 𝑥 ) + 𝛽 𝑦 and the Leontief aggregator 𝐴 ( 𝑥, 𝑦 ) = min{𝑟 ( 𝑥 ) , 𝛽 𝑦 } are Blackwell
aggregators when 𝛽 < 1.

The next proposition states conditions for global stability in settings where aggre-
gators have the Blackwell property.
Proposition 7.3.3. If 𝐴 is a Blackwell aggregator and 𝑅 is constant-subadditive, then
the Koopmans operator 𝐾 ≔ 𝐴 ◦ 𝑅 is a contraction on 𝑉 with respect to k · k ∞ .

Proof. Let the primitives be as stated. In view of Lemma 2.2.4 on page 62, and taking
into account the fact that 𝐾 is order-preserving, we need only show that there exists
a 𝛽 ∈ (0, 1) with 𝐾 ( 𝑣 + 𝜆 ) ⩽ 𝐾𝑣 + 𝛽𝜆 for all 𝑣 ∈ 𝑉 and 𝜆 ∈ R+ . To see this, fix 𝑣 ∈ 𝑉 and
𝜆 ∈ R+ . Applying constant-subadditivity of 𝑅 and monotonicity of 𝐴, we have

𝐾 ( 𝑣 + 𝜆 ) = 𝐴 (·, 𝑅 ( 𝑣 + 𝜆 )) ⩽ 𝐴 (·, 𝑅𝑣 + 𝜆 )

Since 𝐴 is a Blackwell aggregator, the last term is bounded by 𝐴 (·, 𝑅𝑣) + 𝛽𝜆 with 𝛽 < 1.
Hence 𝐾 ( 𝑣 + 𝜆 ) ⩽ 𝐾𝑣 + 𝛽𝜆 , and 𝐾 is a contraction of modulus 𝛽 on 𝑉 . □

The stability of time additive preferences is a special case of Proposition 7.3.3.

7.3.2.2 The Risk-Sensitive Case

We can now complete the proof of Proposition 7.2.2, which concerned global stability
of the Koopmans operator generated by risk-sensitive preferences.

Proof of Proposition 7.2.2. Fix 𝜃 ≠ 0 and recall that 𝐾𝜃 in (7.10) can be expressed as
𝐾𝜃 = 𝐴ADD ◦ 𝑅𝜃 when 𝑅𝜃 is the entropic certainty equivalent. Since 𝐴ADD is a Blackwell
aggregator and 𝑅𝜃 is constant-subadditive (Exercise 7.3.8), Proposition 7.3.3 applies.
In particular, 𝐾𝜃 is globally stable on RX . □

EXERCISE 7.3.18. Let 𝐾 = 𝐴MIN ◦ 𝑅 on 𝑉 = RX . Prove that 𝐾 is globally stable on 𝑉


whenever 𝑅 is constant-subadditive and 𝐴MIN ( 𝑥, 𝑦 ) = min{𝑟 ( 𝑥 ) , 𝛽 𝑦 } with 𝛽 ∈ (0, 1).
CHAPTER 7. NONLINEAR VALUATION 240

7.3.2.3 Quantile Preferences

Consider a setting where 𝑉 = RX and 𝐾𝜏 ≔ 𝐴ADD ◦ 𝑅𝜏 . That is,

( 𝐾𝜏 𝑣)( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝛽 ( 𝑅𝜏 𝑣)( 𝑥 ) ( 𝑥 ∈ X) (7.20)

for 𝛽 ∈ (0, 1), 𝜏 ∈ [0, 1], 𝑟 ∈ RX and 𝑅𝜏 as described in Exercise 7.3.4. Since 𝑅𝜏 is
constant-subadditive (Exercise 7.3.7) and the additive aggregator is Blackwell, 𝐾𝜏 is
globally stable (Proposition 7.3.3). The operator 𝐾𝜏 represents quantile preferences,
as described in de Castro and Galvao (2019) and other studies (see 7.4). The value 𝜏
parameterizes attitude to risk, a point we return to in §8.2.1.4.

EXERCISE 7.3.19. Consider replacing the operator 𝐾𝜏 in (7.20) with 𝐾 = 𝐴MIN ◦ 𝑅𝜏 .


Under the same assumptions as above (apart from the switch to Leontief aggregator),
prove that 𝐾 is globally stable.

7.3.3 Uzawa Aggregation


Let’s consider the Koopmans operator 𝐾 = 𝐴UZAWA ◦ 𝑅, where 𝑉 is some subset of RX
and 𝑅 is a certainty equivalent operator on 𝑉 . In particular,

( 𝐾𝑣)( 𝑥 ) = 𝑟 ( 𝑥 ) + 𝑏 ( 𝑥 ) ( 𝑅𝑣)( 𝑥 ) ( 𝑥 ∈ X, 𝑣 ∈ 𝑉 ) (7.21)

with 𝑟, 𝑏 ∈ RX and 𝑏 ⩾ 0. We are interested in conditions that imply 𝐾 is globally


stable on 𝑉 .

7.3.3.1 The Case of Conditional Expectation

Let 𝑉 = RX and suppose 𝑅 = 𝑃 for some 𝑃 ∈ M ( RX ), so that 𝑅 is ordinary condi-


tional expectations. Then 𝐾 becomes 𝐾𝑣 = 𝑟 + 𝐿𝑣 where 𝐿 ∈ L ( RX ) with 𝐿 ( 𝑥, 𝑥 0) =
𝑏 ( 𝑥 ) 𝑃 ( 𝑥, 𝑥 0). By Exercise 1.2.17 on page 22, 𝐾 is globally stable on 𝑉 whenever
𝜌 ( 𝐿) < 1.
This kind of structure arises when households derive utility from a consumption
path while their discount factor fluctuates according to some state variable (see, e.g.,
Krusell and Smith (1998), Toda (2019), Cao (2020), and Hubmer et al. (2020)). For
a given consumption path ( 𝐶𝑡 ), lifetime values takes the form
!
Õ
∞ Ö 𝑡
𝑣 ( 𝑥 ) = E𝑥 𝛽𝑖 𝑢 ( 𝐶𝑡 ) (7.22)
𝑡 =0 𝑖=1
CHAPTER 7. NONLINEAR VALUATION 241

where 𝑢 is a flow utility function and { 𝛽𝑡 } is a discount factor process. Suppose 𝐶𝑡 =


𝑐 ( 𝑋𝑡 ) and 𝛽𝑡 = 𝑏 ( 𝑋𝑡 ) where 𝑏 ⩾ 0 and ( 𝑋𝑡 ) is 𝑃 -Markov for some 𝑃 ∈ M ( RX ). Set 𝑟 ≔
𝑢 ◦ 𝑐 and 𝐿 ( 𝑥, 𝑥 0) ≔ 𝑏 ( 𝑥 ) 𝑃 ( 𝑥, 𝑥 0). By Theorem 6.1.1 on page 183, the condition 𝜌 ( 𝐿) < 1
implies that 𝑣 in (7.22) is the unique fixed point of 𝐾𝑣 = 𝑟 + 𝐿𝑣 = 𝑟 + 𝑏𝑃𝑣. In other
words, lifetime value under (7.22) is the unique fixed point of the Koopmans operator
when the aggregator is of Uzawa type and the certainty equivalent is conditional
expectation.
How does this relate to optimization? Recall our discussion of state-dependent
MDPs in Chapter 6. There, the policy operator 𝑇𝜎 in (6.16) on page 192 is a special
case of (7.21) when the discount factor depends only on the current state and action.
With some additional requirements, the condition 𝜌 ( 𝐿) < 1 is necessary as well
as sufficient for existence of a unique fixed point for 𝐾𝑣 = 𝑟 + 𝐿𝑣. Indeed, if 𝑏  0
and 𝑃 is irreducible, then 𝐿 is also irreducible and a positive linear operator. Applying
Lemma 6.1.4, we see that 𝑟  0 and 𝜌 ( 𝐿) ⩾ 1 implies 𝐾𝑣 = 𝑟 + 𝐿𝑣 has no fixed point
in 𝑉 ≔ { 𝑣 ∈ RX : 𝑣  0}.

EXERCISE 7.3.20. Confirm that 𝐿 is irreducible when 𝑏  0 and 𝑃 is irreducible.

7.3.3.2 Stability via Concavity

Now consider 𝐾𝑣 = 𝑟 + 𝑏𝑅𝑣 from (7.21) when 𝑅 is not in M ( RX ). Here 𝑏𝑅𝑣 is the
pointwise product, so that ( 𝑏𝑅𝑣)( 𝑥 ) = 𝑏 ( 𝑥 )( 𝑅𝑣)( 𝑥 ) for all 𝑥 .
We cannot use Proposition 7.3.3 to prove stability of 𝐾 unless 𝑏 ( 𝑥 ) < 1 for all 𝑥 ∈ X.
Since this condition is rather strict, we now study weaker conditions that can be valid
even when 𝑏 exceeds 1 in some states. Specifically, we consider

(a) 𝑏𝑅𝑣 ⩽ 𝑐 + 𝐿𝑣 for some 𝑐 ∈ RX and 𝐿 ∈ L ( RX ) with 𝜌 ( 𝐿) < 1.


(b) 𝑟  0 and 𝑅 is concave on RX
+.

Let 𝑉 = [0, 𝑣¯] where 𝑣¯ ≔ ( 𝐼 − 𝐿) −1 ( 𝑟 + 𝑐).

Proposition 7.3.4. If conditions (a)–(b) hold, then 𝐾 is globally stable on 𝑉 .

Proof. Under (a)–(b), 𝐾 is concave on RX


+ , with

0  𝑟 = 𝑟 + 𝑏𝑅0 = 𝐾 0 and ¯ = 𝑟 + 𝑏𝑅 𝑣¯ ⩽ 𝑟 + 𝑐 + 𝐿𝑣¯ = 𝑟 + 𝑐 − ( 𝐼 − 𝐿)¯


𝐾𝑣 𝑣+𝑣
¯ = 𝑣¯.

The claim now follows from Du’s theorem (page 216). □


CHAPTER 7. NONLINEAR VALUATION 242

7.3.3.3 Epstein–Zin Preferences with State-Dependent Discounting

Combining the CES-Uzawa aggregator 𝐴 ( 𝑥, 𝑦 ) = {𝑟 ( 𝑥 ) 𝛼 + 𝑏 ( 𝑥 ) 𝑦 𝛼 }1/𝛼 with the Kreps–


Porteus certainty equivalent operator leads to the Koopmans operator
n o 1/𝛼
𝛾 𝛼/𝛾
𝐾𝑣 = ℎ + 𝑏 [ 𝑃𝑣 ] , with ℎ, 𝑏 ∈ RX+ . (7.23)

A fixed point of 𝐾 corresponds to lifetime value for an agent with Epstein–Zin prefer-
ences and state-dependent discounting. (Such set ups are used in research on macroe-
conomic dynamics and asset pricing – see §7.4 for more details).
In what follows we take 𝑉 = (0, ∞) X and assume that ℎ, 𝑏 ∈ 𝑉 and 𝑃 is irreducible.

EXERCISE 7.3.21. Show that 𝐾 is self-map on 𝑉 .

To discuss stability of 𝐾 we introduce the operator 𝐴 ∈ L ( RX ) defined by


Õ 𝛾
( 𝐴𝑣)( 𝑥 ) ≔ 𝑏 ( 𝑥 ) 𝜃 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) where 𝜃 ≔ .
𝑥0
𝛼

Proposition 7.3.5. 𝐾 is globally stable on 𝑉 if and only if 𝜌 ( 𝐴) 𝛼/𝛾 < 1.

To prove Proposition 7.3.5, we proceed as in §7.2.3.3, constructing a conjugate


operator 𝐾ˆ and proving stability of the latter. For this purpose, we introduce
n o𝜃
ˆ 𝑣 = ℎ + ( 𝐴𝑣) 1/𝜃
𝐾 ( 𝑣 ∈ 𝑉 ), (7.24)

Also, let Φ be defined by Φ𝑣 = 𝑣𝛾 .

EXERCISE 7.3.22. Prove: (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topologically conjugate under Φ.

Proof of Proposition 7.3.5. In view of Exercise 7.3.22, it suffices to show that 𝐾ˆ is glob-
ally stable on 𝑉 if and only if 𝜌 ( 𝐴) 𝛼/𝛾 < 1. This is implied by Theorem 7.1.4, since 𝐴
is irreducible (see Exercise 7.3.20 on page 241) and 𝜌 ( 𝐴) 1/𝜃 = 𝜌 ( 𝐴) 𝛼/𝛾 . □

7.4 Chapter Notes


The time additive preference structure in §7.2.1 was popularized by Samuelson (1939),
who built on earlier work by Fisher (1930) and Ramsey (1928). An axiomatic foun-
CHAPTER 7. NONLINEAR VALUATION 243

dation was supplied by Koopmans (1960). Bastianello and Faro (2022) study the
foundations of discounted expected utility from a purely subjective framework.
Problems with the time additive discounted utility model include non-constant
discounting, as discussed in §6.4, as well as sign effects (gains being discounted more
than losses) and magnitude effects (small outcomes being discounted more than large
ones. See, for example, Thaler (1981) and Benzion et al. (1989). A critical review of
the time additive model and a list of many references can be found in Frederick et al.
(2002).
In the stochastic setting, the time additive framework is a subset of the expected
utility model (Von Neumann and Morgenstern (1944), Friedman (1956), Savage
(1951)). There are many well documented departures from expected utility in ex-
perimental data. See the start of Andreoni and Sprenger (2012) and the article Eric-
son and Laibson (2019) for an introduction to the literature. An interesting historical
discussion of time additive expected utility can be found in Becker et al. (1989).
(It is ironic that those most responsible for popularizing the time additive dis-
counted expected utility (DEU) framework have also been among the most critical.
For example, Samuelson (1939) stated that it is “completely arbitrary” to assume that
the DEU specification holds. He goes on to claim that, in the analysis of savings and
consumption, it is “extremely doubtful whether we can learn much from considering
such an economic man.” In addition, Stokey and Lucas (1989), whose work helped
to standardize DEU as a methodology for quantitative analysis, argued in a separate
study that DEU is attractive only because of its relative simplicity (Lucas and Stokey,
1984).)
Do the departures from time additive expected utility found in experimental data
actually matter for quantitative work? Evidence suggests that the answer is affirma-
tive. In macroeconomics and asset pricing in particular, researchers increasingly use
non-additive preferences in order to bring model outputs closer to the data. For ex-
ample, many quantitative models of asset pricing rely heavily on Epstein–Zin prefer-
ences. Representative examples include Epstein and Zin (1991), Tallarini Jr (2000),
Bansal and Yaron (2004), Hansen et al. (2008), Bansal et al. (2012), Schorfheide
et al. (2018), and de Groot et al. (2022). Alternative numerical solution methods are
discussed in Pohl et al. (2018).
An excellent introduction to recursive preference models can be found in Backus
et al. (2004). Our use of the term “Koopmans operator,” which is not entirely stan-
dard, honors early contributions by Nobel laureate Tjalling Koopmans on recursive
preferences (see Koopmans (1960) and Koopmans et al. (1964)).
Theoretical properties of recursive preference models have been studied in many
papers, including Epstein and Zin (1989), Weil (1990), Boyd (1990), Hansen and
CHAPTER 7. NONLINEAR VALUATION 244

Scheinkman (2009), Marinacci and Montrucchio (2010), Bommier et al. (2017),


Bloise and Vailakis (2018), Marinacci and Montrucchio (2019), Pohl et al. (2019), Bal-
bus (2020), Borovička and Stachurski (2020), DeJarnette et al. (2020), Christensen
(2022), and Becker and Rincon-Zapatero (2023). The paper by Marinacci and Mon-
trucchio (2019) provides a useful alternative approach to existence of unique fixed
points in the setting of order-preserving maps. Experimental results on Epstein–Zin
preferences can be found in Meissner and Pfeiffer (2022).
There is a strong connection between risk-sensitive preferences and the literature
on robust control. See, for example, Cagetti et al. (2002), Hansen and Sargent (2007),
and Barillas et al. (2009). We return to this point in Chapter 8.
The quantile preferences we considered in §7.3.2.3 have been analyzed in static
and dynamic settings by Giovannetti (2013), de Castro and Galvao (2019), de Castro
and Galvao (2022) and de Castro et al. (2022). Recursive components of the analysis
of quantile and Uzawa preference models build on the study of monotone preferences
in Bommier et al. (2017).
Some recursive preference specifications involve ambiguity aversion. An introduc-
tion to this literature and its applications can be found in Klibanoff et al. (2009),
Hayashi and Miao (2011), Hansen and Miao (2018), Bommier et al. (2019) and
Hansen and Sargent (2020). Marinacci et al. (2023) discuss the connection between
recursivity and attitudes to uncertainty. We discuss ambiguity again in Chapter 8.
Recursive preferences are increasingly applied outside the field of asset pricing,
where they first came to prominence. See, for example, Bommier and Villeneuve
(2012), Colacito et al. (2018), Jensen (2019), or Augeraud-Véron et al. (2019).
The coin flip application in §7.2.1.2 is related to correlation aversion, as discussed
in Stanca (2023), and preference for “consumption spreads” as reviewed in Frederick
et al. (2002).
Some applications of Theorem 7.1.3 to network analysis can be found in Sargent
and Stachurski (2023b).
Chapter 8

Recursive Decision Processes

While the MDP model from Chapters 5–6 is elegant and widely used, researchers in
economics, finance, and other fields are working to extend it. Reasons include:
(i) MDP theory cannot be applied to settings where lifetime values are described
by the kinds of nonlinear recursions discussed in Chapter 7.
(ii) Equilibria in some models of production and economic geography can be com-
puted using dynamic programming but not all such programming problems fit
within the MDP framework.
(iii) Dynamic programming problems that include adversarial agents to promote ro-
bust decision rules can fail to be MDPs.
To handle such departures from the MDP assumptions, we now construct a more
general dynamic programming framework, building on an approach to optimization
initially developed by Denardo (1967) and extended by Bertsekas (2022b). Further
references are provided in §8.4.
We start this chapter by building a framework that centers on an abstract repre-
sentation of the Bellman equation (§8.1). We then state optimality results and show
how they can be verified in a range of applications. We defer proofs of core optimal-
ity results to Chapter 9, where we strip dynamic programs down to their essence by
adopting a purely operator-theoretic perspective.

8.1 Definition and Properties


In this section we introduce and analyze optimality conditions for recursive decision
processes that include and extend all dynamic programming frameworks discussed

245
CHAPTER 8. RECURSIVE DECISION PROCESSES 246

so far. Throughout this chapter, X denotes a finite set.

8.1.1 Defining RDPs

Consider a generic Bellman equation of the form

𝑣 ( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) . (8.1)
𝑎∈ Γ ( 𝑥 )

Here 𝑥 is the state, 𝑎 is an action, Γ is a feasible correspondence, and 𝐵 is an “aggre-


gator” function. We understand Γ ( 𝑥 ) as all actions available to the controller in state
𝑥 . The function 𝑣 assigns values to states and is a member of some class 𝑉 ⊂ RX .
This “abstract” Bellman equation generalizes all of the Bellman equations presented
in previous chapters.
Our plan is to analyze the Bellman equation (8.1) and state conditions on 𝐵 and
the other primitives that make strong optimality properties hold. As a first step, we
introduce two finite sets,

• an action space A and


• a state space X.

Given X and A, we define a recursive decision process (RDP) to be a triple R =


( Γ, 𝑉, 𝐵) consisting of

(i) a feasible correspondence Γ that is a nonempty correspondence from X to A,


which in turn defines
• the feasible state-action pairs

G ≔ {( 𝑥, 𝑎) ∈ X × A : 𝑎 ∈ Γ ( 𝑥 )}

• and the set of feasible policies

Σ ≔ {𝜎 ∈ AX : 𝜎 ( 𝑥 ) ∈ Γ ( 𝑥 ) for all 𝑥 ∈ X},

(ii) a subset 𝑉 of RX called the value space, and


(iii) a value aggregator 𝐵 that maps G × 𝑉 to R and satisfies both the monotonicity
condition

𝑣, 𝑤 ∈ 𝑉 and 𝑣 ⩽ 𝑤 =⇒ 𝐵 ( 𝑥, 𝑎, 𝑣) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑤) for all ( 𝑥, 𝑎) ∈ G (8.2)


CHAPTER 8. RECURSIVE DECISION PROCESSES 247

and the consistency condition

𝑤 ∈ 𝑉 whenever 𝑤 ( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) for some 𝜎 ∈ Σ and 𝑣 ∈ 𝑉. (8.3)

Throughout, ⩽ represents the pointwise order on RX .


The definition of the feasible correspondence in (i) is identical to that for the MDP
in Chapter 5. As for (ii), we understand 𝑉 to be a class of functions that assign values
to states. In (iii), the interpretation of the aggregator 𝐵 is:

𝐵 ( 𝑥, 𝑎, 𝑣) = total lifetime rewards, contingent on current action 𝑎, current


state 𝑥 , and using 𝑣 to evaluate future states.

The monotonicity condition (8.2) is natural: if, relative to 𝑣, rewards are at least
as high for 𝑤 in every future state, then the total rewards one can extract under 𝑤
should be at least as high. The consistency condition in (8.3) ensures that as we
consider values of different policies we remain within the value space 𝑉 .
The MDP framework is a special case of the RDP framework:
Example 8.1.1. Consider MDP M = ( Γ, 𝛽, 𝑟, 𝑃 ) with state space X and action space A
(see, e.g., §5.1.1). We can frame M as an RDP by taking Γ as unchanged, 𝑉 = RX ,
and Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (( 𝑥, 𝑎) ∈ G, 𝑣 ∈ 𝑉 ) . (8.4)
𝑥0

Now ( Γ, 𝑉, 𝐵) forms an RDP. The monotonicity condition (8.2) clearly holds and the
consistency condition (8.3) is trivial, since 𝑉 is all of RX . Inserting (8.4) into the ab-
stract Bellman equation (8.1) recovers the MDP Bellman equation ((5.2) on page 129).
Example 8.1.2. Consider a basic cake eating problem (see §5.1.2.3), where X is a
finite subset of R+ and 𝑥 ∈ X is understood to be the number of remaining slices of
cake today. Let 𝑥 0 be the number of remaining slices next period and 𝑢 ( 𝑥 − 𝑥 0) be the
utility from slices enjoyed today. The utility function 𝑢 maps R+ to R. Let 𝑉 = RX , let
Γ be defined by Γ ( 𝑥 ) = { 𝑥 0 ∈ X : 𝑥 0 ⩽ 𝑥 } and let

𝐵 ( 𝑥, 𝑥 0, 𝑣) = 𝑢 ( 𝑥 − 𝑥 0) + 𝛽𝑣 ( 𝑥 0) .

Then ( Γ, 𝑉, 𝐵) is an RDP with Bellman equation identical to that of the original cake
eating problem in §5.1.2.3. The monotonicity condition (8.2) and the consistency
condition (8.3) are easy to verify.

The last example is a special case of Example 8.1.1, since the cake eating problem
is an MDP (see §5.1.2.3). Nonetheless, Example 8.1.2 is instructive because, for cake
CHAPTER 8. RECURSIVE DECISION PROCESSES 248

eating, the MDP construction is tedious (e.g., we need to define a stochastic kernel 𝑃
even though transitions are deterministic), while the RDP construction is straightfor-
ward.
The next example makes a related point.

Example 8.1.3. In §5.1.2.4 we showed that the job search model is an MDP but the
construction was tedious. But we can also represent job search as an RDP and the em-
bedding is straightforward. To see this, recall that, for an arbitrary optimal stopping
problem with primitives as described in Chapter 4, the Bellman equation is
( )
Õ
𝑣 ( 𝑥 ) = max 𝑒 ( 𝑥 ) , 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) . (8.5)
𝑥0

Let 𝑉 = RX and Γ ( 𝑥 ) = {0, 1} for all 𝑥 . Let


" #
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑎𝑒 ( 𝑥 ) + (1 − 𝑎) 𝑐 ( 𝑥 ) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) (8.6)
𝑥0

for 𝑥 ∈ X and 𝑎 ∈ A ≔ {0, 1}. Then ( Γ, 𝑉, 𝐵) is an RDP (Exercise 8.1.1) and setting
𝑣 ( 𝑥 ) = max 𝑎∈ Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) reproduces the Bellman equation (8.5).

EXERCISE 8.1.1. Verify that conditions (8.2)–(8.3) hold for this RDP.

Example 8.1.4. The dynamic programming framework popularized by Stokey and


Lucas (1989) is characterized by two features: First, the state is divided into an ex-
ogenous process ( 𝑍𝑡 ) and an endogenous process (𝑌𝑡 ). In addition, the next period
endogenous state is directly chosen by the current action. The Bellman equation
takes the form
( )
Õ
𝑣 ( 𝑦, 𝑧 ) = max 𝐹 ( 𝑦, 𝑧, 𝑦 0) + 𝛽 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) . (8.7)
0𝑦 ∈ Γ ( 𝑦,𝑧 )
𝑧0

We assume that ( 𝑍𝑡 ) is 𝑄 -Markov on finite set Z and (𝑌𝑡 ) takes values in finite set Y.
With state space X ≔ Y × Z, action space Y, feasible correspondence 𝑥 ↦→ Γ ( 𝑥 ), value
space 𝑉 = RX and aggregator
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝐵 (( 𝑦, 𝑧 ) , 𝑦 0, 𝑣) = 𝐹 ( 𝑦, 𝑧, 𝑦 0) + 𝛽 𝑣 ( 𝑦 0, 𝑧0) 𝑄 ( 𝑧, 𝑧0) ,
𝑧0

we obtain an RDP with Bellman equation identical to (8.7).


CHAPTER 8. RECURSIVE DECISION PROCESSES 249

EXERCISE 8.1.2. Show that this RDP can also be expressed as an MDP.

Examples 8.1.1–8.1.4 treated RDPs that can be embedded into the MDP frame-
work. In the remaining examples, we consider models that cannot be represented as
MDPs.

Example 8.1.5 (State-Dependent Discounting). We can add state-dependent dis-


counting to Example 8.1.1 by changing the aggregator to
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) . (8.8)
𝑥0

Here 𝛽 is a map from G × X to R+ . With Γ and 𝑉 unchanged, ( Γ, 𝑉, 𝐵) is an RDP with


Bellman equation identical to that of the MDP with state-dependent discounting we
analyzed on Chapter 6. In §8.2.2 we will use RDP theory developed in this chapter to
verify the optimality results claimed in Chapter 6.

EXERCISE 8.1.3. Verify that ( Γ, 𝑉, 𝐵) as defined in Example 8.1.5 is an RDP.

Example 8.1.6. We can modify the MDP in Example 8.1.1 to use risk-sensitive pref-
erences. We do this by taking Γ, 𝑉 to be the same as the MDP example and setting
( )
1 Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (8.9)
𝜃 𝑥0

for all ( 𝑥, 𝑎) ∈ G and 𝑣 ∈ 𝑉 . Here G is generated by the feasible correspondence Γ.

EXERCISE 8.1.4. Confirm that the risk-sensitive model ( Γ, 𝑉, 𝐵) in Example 8.1.6 is


an RDP for all nonzero 𝜃.

Example 8.1.7 (Epstein–Zin Preferences). We can also modify the MDP in Exam-
ple 8.1.1 to use the Epstein–Zin specification (see (7.12) on page 226) by setting

 " # 𝛼/𝛾  1/𝛼




 Õ 


𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑎, 𝑥 0) , (8.10)

 

𝑥0
 
where 𝛽 ∈ (0, 1), 𝛾 and 𝛼 are nonzero parameters, and 𝑟  0. Let 𝑉 be all strictly
positive functions in RX . Then ( Γ, 𝑉, 𝐵) is an RDP.
CHAPTER 8. RECURSIVE DECISION PROCESSES 250

EXERCISE 8.1.5. Confirm the last claim in Example 8.1.7.

Example 8.1.8. The shortest path problem considers optimal traversal of a directed
graph G = (X, 𝐸), where X is the vertices of the graph and 𝐸 is the edges. A weight
function 𝑐 : 𝐸 → R+ associates cost to each edge ( 𝑥, 𝑥 0) ∈ 𝐸. The aim is to find the min-
imum cost path from 𝑥 to a specified vertex 𝑑 for every 𝑥 ∈ X. Under some conditions,
the problem can be solved by applying a Bellman operator of the form

(𝑇 𝑣)( 𝑥 ) = 0min { 𝑐 ( 𝑥, 𝑥 0) + 𝑣 ( 𝑥 0)} ( 𝑥 ∈ X) , (8.11)


𝑥 ∈O( 𝑥 )

where O( 𝑥 ) ≔ { 𝑥 0 ∈ X : ( 𝑥, 𝑥 0) ∈ 𝐸 } is the direct successors of 𝑥 and 𝑣 ( 𝑥 0) is the


minimum cost-to-go from state 𝑥 0. The problem is not an MDP because future values
are not discounted. It can be framed as an RDP, however, by setting Γ ( 𝑥 ) = O( 𝑥 ),
𝐵 ( 𝑥, 𝑥 0, 𝑣) = 𝑐 ( 𝑥, 𝑥 0) + 𝑣 ( 𝑥 0) and 𝑉 = RX .

Example 8.1.8 is a minimization problem. We treat minimization explicitly in


§8.3.5, although the shortest path setting can be converted maximization by replacing
𝑐 ( 𝑥, 𝑥 0) with −𝑐 ( 𝑥, 𝑥 0). This produces an application similar to the cake eating problem
in Example 8.1.2 (although discounting is eliminated and network structure shows up
in the constraint).

8.1.2 Lifetime Value

We aim to discuss optimality of RDPs. To prepare for this topic, we now clarify lifetime
values associated with different policy choices in the RDP setting.

8.1.2.1 Policies and Value

Let R = ( Γ, 𝑉, 𝐵) be an RDP with state and action spaces X and A, and let Σ be the
set of all feasible policies. For each 𝜎 ∈ Σ we introduce the policy operator 𝑇𝜎 as a
self-map on 𝑉 defined by

(𝑇𝜎 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) ( 𝑥 ∈ X) . (8.12)

The RDP policy operator is a direct generalization of the MDP policy operator defined
on page 135, as well as the optimal stopping policy operator defined on page 108.

EXERCISE 8.1.6. Show that 𝑇𝜎 is an order-preserving self-map on 𝑉 for all 𝜎 ∈ Σ.


CHAPTER 8. RECURSIVE DECISION PROCESSES 251

Consider a given RDP ( Γ, 𝑉, 𝐵) and fix 𝜎 ∈ Σ. If 𝑇𝜎 has a unique fixed point in 𝑉 , we


denote this fixed point by 𝑣𝜎 and call it the 𝜎-value function. It is natural to interpret
𝑣𝜎 as representing the lifetime value of following policy 𝜎.

Example 8.1.9. For the optimal stopping problem discussed in Chapter 4, the function
𝑣𝜎 that records the lifetime value of a policy 𝜎 from any given state is the unique fixed
point of the optimal stopping policy operator 𝑇𝜎 . See §4.1.1.3.

Example 8.1.10. For the MDP model discussed in Chapter 5, lifetime value of policy
𝜎 is given by 𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . As discussed in Exercise 5.1.7, 𝑣𝜎 is the unique fixed
point of the MDP policy operator 𝑇𝜎 defined by 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣.

Example 8.1.11. For the MDP model with state-dependent discounting introduced
in Chapter 6, Exercise 6.2.1 shows that the lifetime value of following policy 𝜎 is the
unique fixed point of the policy operator 𝑇𝜎 defined in (6.16) on page 192.

The previous examples are linear but the same idea extends to nonlinear recursive
preference models as well. To see this, recall the generic Koopmans operator ( 𝐾𝑣)( 𝑥 ) =
𝐴 ( 𝑥, ( 𝑅𝑣) ( 𝑥 )) introduced in §7.3.1. Lifetime value is the unique fixed point of this
operator whenever it exists. In all of the RDP examples we have considered, the policy
operator can be expressed as (𝑇𝜎 𝑣) ( 𝑥 ) = 𝐴𝜎 ( 𝑥, ( 𝑅𝜎 𝑣)( 𝑥 )) for some aggregator 𝐴𝜎 and
certainty equivalent operator 𝑅𝜎 . Hence 𝑇𝜎 is a Koopmans operator and lifetime value
associated with policy 𝜎 is the fixed point of this operator.

8.1.2.2 Uniqueness and Stability

Let R = ( Γ, 𝑉, 𝐵) be a given RDP with policy operators {𝑇𝜎 }. Given that our objective
is to maximize lifetime value over the set of policies in Σ, we need to assume at the
very least that lifetime value is well defined at each policy. To this end, we say that R
is well-posed whenever 𝑇𝜎 has a unique fixed point 𝑣𝜎 in 𝑉 for all 𝜎 ∈ Σ.

Example 8.1.12. The optimal stopping RDP we introduced in Example 8.1.3 is well-
posed. Indeed, for each 𝜎 ∈ Σ, the policy operator 𝑇𝜎 has a unique fixed point in RX
by Proposition 4.1.1 on page 108.

Example 8.1.13. The RDP generated by the MDP model in Example 8.1.1 is well-
posed, since, for each 𝜎 ∈ Σ, the operator 𝑇𝜎 = 𝑟𝜎 + 𝛽𝑃𝜎 has the unique fixed point
𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 in RX .

Example 8.1.14. The shortest path problem discussed in Example 8.1.8 is not well-
posed without further assumptions. For example, consider a graph that contains two
CHAPTER 8. RECURSIVE DECISION PROCESSES 252

vertices 𝑥 and 𝑦 , with 𝑥 ∈ O( 𝑦 ), 𝑦 ∈ O( 𝑥 ), and 𝑐 ( 𝑥, 𝑦 ) + 𝑐 ( 𝑦, 𝑥 ) > 0. Then, for any


policy 𝜎 that maps 𝑥 to 𝑦 and 𝑦 to 𝑥 , we have

(𝑇𝜎 𝑣)( 𝑥 ) = 𝑐 ( 𝑥, 𝑦 ) + 𝑣 ( 𝑦 ) and (𝑇𝜎 𝑣)( 𝑦 ) = 𝑐 ( 𝑦, 𝑥 ) + 𝑣 ( 𝑥 ) .

Hence, if 𝑣 ∈ RX is a fixed point of 𝑇𝜎 , we obtain 𝑣 ( 𝑥 ) = 𝑐 ( 𝑥, 𝑦 ) + 𝑣 ( 𝑦 ) and 𝑣 ( 𝑦 ) =


𝑐 ( 𝑦, 𝑥 ) + 𝑣 ( 𝑥 ). Substition yields 𝑣 ( 𝑥 ) = 𝑐 ( 𝑥, 𝑦 ) + 𝑐 ( 𝑦, 𝑥 ) + 𝑣 ( 𝑥 ), which is a contradiction.

Let R be an RDP with policy operators {𝑇𝜎 }𝜎∈Σ . In what follows, we call R globally
stable if 𝑇𝜎 is globally stable on 𝑉 for all 𝜎 ∈ Σ.
Example 8.1.15. The optimal stopping RDP we introduced in Example 8.1.3 is glob-
ally stable, since, for each 𝜎 ∈ Σ, the policy operator 𝑇𝜎 is globally stable on RX by
Proposition 4.1.1 on page 108.
Example 8.1.16. The RDP generated by the MDP model in Example 8.1.1 is globally
stable. See Exercise 5.1.7 on page 135.

Obviously every globally stable RDP is well-posed.


Remark 8.1.1. In line with our discussion of stability of Koopmans operators in
§7.3.1.7, for 𝑣 ∈ 𝑉 , 𝜎 ∈ Σ, 𝑚 ∈ N and 𝑥 ∈ X, it is natural to interpret (𝑇𝜎𝑚 𝑣)( 𝑥 ) as
total (finite horizon) utility over periods 0, . . . , 𝑚 under policy 𝜎, with initial state 𝑥
and terminal condition 𝑣 ∈ 𝑉 . Hence global stability implies that, for any choice of
terminal condition, finite horizon valuations always converge to their infinite horizon
counterparts.

In §8.1.3 we will see that global stability yields strong optimality properties.

8.1.2.3 Continuity

Let R = ( Γ, 𝑉, 𝐵) be an RDP. We call R continuous if 𝐵 ( 𝑥, 𝑎, 𝑣) is continuous in 𝑣 for


all ( 𝑥, 𝑎) ∈ G. In other words, R is continuous if, for any 𝑣 ∈ 𝑉 , any ( 𝑥, 𝑎) ∈ G and any
sequence ( 𝑣𝑘 ) 𝑘⩾1 in 𝑉 , we have

lim 𝐵 ( 𝑥, 𝑎, 𝑣𝑘 ) = 𝐵 ( 𝑥, 𝑎, 𝑣) whenever lim 𝑣𝑘 = 𝑣.


𝑘→∞ 𝑘→∞

Continuity is satisfied by all applications considered in this text. For example, for the
RDP generated by an MDP (Example 8.1.1), the deviation | 𝐵 ( 𝑥, 𝑎, 𝑣𝑘 ) − 𝐵 ( 𝑥, 𝑎, 𝑣)| is
dominated by 𝛽 k 𝑣𝑘 − 𝑣 k ∞ for all ( 𝑥, 𝑎) ∈ G. Hence continuity holds.
Below we will see that continuity is useful when considering covergence of certain
algorithms.
CHAPTER 8. RECURSIVE DECISION PROCESSES 253

8.1.3 Optimality

In this section we present optimality theory for RDPs.

8.1.3.1 Greedy Policies

Given an RDP R = ( Γ, 𝑉, 𝐵) and 𝑣 ∈ 𝑉 , a policy 𝜎 ∈ Σ is called 𝑣-greedy if

𝜎 ( 𝑥 ) ∈ argmax 𝐵 ( 𝑥, 𝑎, 𝑣) for all 𝑥 ∈ X. (8.13)


𝑎∈ Γ ( 𝑥 )

Since Γ ( 𝑥 ) is finite and nonempty at each 𝑥 ∈ X, at least one such policy exists. As
with policy operators, the notion of greedy policies extends existing definitions from
earlier chapters.

EXERCISE 8.1.7. Show that, for each 𝑣 ∈ 𝑉 , the set {𝑇𝜎 𝑣}𝜎∈Σ ⊂ 𝑉 contains a
least and greatest element (see §2.2.1.2 for the definitions). Explain the connection
between the greatest element and 𝑣-greedy policies.

Given RDP R = ( Γ, 𝑉, 𝐵), we say that 𝑣 ∈ 𝑉 satisfies the Bellman equation if


𝑣 ( 𝑥 ) = max 𝑎∈ Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) for all 𝑥 ∈ X. The Bellman operator corresponding to R is
the map 𝑇 on 𝑉 defined by

(𝑇 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) ( 𝑥 ∈ X) .
𝑎∈ Γ ( 𝑥 )

Example 8.1.17. For the Epstein–Zin RDP in (8.10), the Bellman operator is given
by
 " # 𝛼/𝛾  1/𝛼


 Õ 


𝛼 0 𝛾 0
(𝑇 𝑣)( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑥 ) ( 𝑥 ∈ X) .
𝑎∈ Γ ( 𝑥 ) 
 

𝑥0
 

EXERCISE 8.1.8. Given RDP R = ( Γ, 𝑉, 𝐵) with policy operators {𝑇𝜎 } and Bellman
operator 𝑇 , show that, for each 𝑣 ∈ 𝑉 ,
Ô Ô
(i) 𝑇 𝑣 = 𝜎 𝑇𝜎 𝑣 ≔ 𝜎∈Σ (𝑇𝜎 𝑣) and
(ii) 𝜎 is 𝑣-greedy if and only if 𝑇 𝑣 = 𝑇𝜎 𝑣.
(iii) 𝑇 is an order-preserving self-map on 𝑉 .
CHAPTER 8. RECURSIVE DECISION PROCESSES 254

EXERCISE 8.1.9. Show that, for a given RDP ( Γ, 𝑉, 𝐵) and fixed 𝑣 ∈ 𝑉 , the Bellman
operator 𝑇 obeys
(𝑇 𝑘 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑇 𝑘−1 𝑣) (8.14)
𝑎∈ Γ ( 𝑥 )

for all 𝑘 ∈ Z+ and all 𝑥 ∈ X. Show, in addition, that for any policy 𝜎 ∈ Σ, the policy
operator 𝑇𝜎 obeys
(𝑇𝜎𝑘 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑇𝜎𝑘−1 𝑣) (8.15)
for all 𝑘 ∈ Z+ and all 𝑥 ∈ X.

8.1.3.2 Algorithms

To solve RDPs for optimal policies, we use two core algorithms: Howard policy itera-
tion (HPI) and optimistic policy iteration (OPI). As in previous chapters, OPI includes
VFI as a special case.
To describe HPI we take R = ( Γ, 𝑉, 𝐵) to be a well-posed RDP with feasible policy set
Σ, policy operators {𝑇𝜎 }, and Bellman operator 𝑇 . In this setting, the HPI algorithm is
essentially identical to the one given for MDPs in §5.1.4.2, except that 𝑣𝜎 is calculated
as the fixed point of 𝑇𝜎 , rather than taking the specific form ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . The details
are in Algorithm 8.1.

Algorithm 8.1: Howard policy iteration for RDPs


1 input 𝜎 ∈ Σ
2 𝑣0 ← 𝑣𝜎 and 𝑘 ← 0
3 repeat
4 𝜎𝑘 ← a 𝑣𝑘 -greedy policy
5 𝑣𝑘+1 ← the fixed point of 𝑇𝜎𝑘
6 if 𝑣𝑘+1 = 𝑣𝑘 then break
7 𝑘← 𝑘+1
8 return 𝜎𝑘

Algorithm 8.1 is somewhat ambiguous, since it is not always clear how to imple-
ment the instruction “ 𝑣𝑘 ← the fixed point of 𝑇𝜎𝑘 ”. However, if R is globally stable,
then each 𝑇𝜎𝑘 is globally stable, so an approximation of the fixed point can be calcu-
lated by iterating with 𝑇𝜎𝑘 . This line of thought leads us to consider optimistic policy
iterating (OPI) as a more practical alternative. Algorithm 8.2 states an OPI routine
for solving R that generalizes the MDP OPI routine in §5.1.4.
CHAPTER 8. RECURSIVE DECISION PROCESSES 255

Algorithm 8.2: Optimistic policy iteration for RDPs


1 input 𝑚 ∈ N and tolerance 𝜏 ⩾ 0
2 input 𝜎 ∈ Σ and set 𝑣0 ← 𝑣𝜎
3 𝑘 ← 0
4 repeat
5 𝜎𝑘 ← a 𝑣𝑘 -greedy policy
6 𝑣𝑘+1 ← 𝑇𝜎𝑚𝑘 𝑣𝑘
7 if k 𝑣𝑘+1 − 𝑣𝑘 k ⩽ 𝜏 then break
8 𝑘← 𝑘+1
9 return 𝜎𝑘

In Algorithm 8.2 we require that 𝑣0 = 𝑣𝜎 for some 𝜎 ∈ Σ. This assumption can


be dropped in some settings. For practical purposes, however, it is almost always
straightforward to initialize OPI with 𝑣0 = 𝑣𝜎 for some simple choice of 𝜎.

EXERCISE 8.1.10. Prove that, for the sequence ( 𝑣𝑘 ) in the OPI algorithm 8.2, we
have 𝑣𝑘 = 𝑇 𝑘 𝑣0 whenever 𝑚 = 1. (In other words, OPI reduces to VFI when 𝑚 = 1.)

When we turn to proofs, it will help to have an operator-theoretic description of


HPI and OPI. To this end, we define two operators. The first is 𝐻 : 𝑉 → { 𝑣𝜎 }, which is
defined via
𝐻𝑣 = 𝑣𝜎 where 𝜎 is 𝑣-greedy. (8.17)
We call 𝐻 the Howard operator generated by R. Iterating with 𝐻 implements HPI.
In particular, if we fix 𝜎 ∈ Σ and set 𝑣𝑘 = 𝐻 𝑘 𝑣𝜎 , then ( 𝑣𝑘 ) 𝑘⩾0 is the sequence of 𝜎-value
functions generated by HPI.1
Next, fixing 𝑚 ∈ N, we define the operator 𝑊𝑚 from 𝑉 to itself via

𝑊𝑚 𝑣 ≔ 𝑇𝜎𝑚 𝑣 where 𝜎 is 𝑣-greedy.

(See footnote 1 on the choice of 𝑣-greedy policies.) The operator 𝑊𝑚 is an approxi-


mation of 𝐻 , since 𝑇𝜎𝑚 𝑣 → 𝑣𝜎 = 𝐻𝑣 as 𝑚 → ∞. Iterating with 𝑊𝑚 generates the value
sequence in OPI. More specifically, we take 𝑣0 ∈ { 𝑣𝜎 } and generate

( 𝑣𝑘 , 𝜎𝑘 ) 𝑘⩾0 where 𝑣𝑘 = 𝑊𝑚𝑘 𝑣0 and 𝜎𝑘 is 𝑣𝑘 -greedy. (8.18)

This produces an infinite sequence of OPI value and policy iterates.


1 For𝐻 to be well-defined, we must always select the same 𝑣-greedy policy when the operator is
applied to 𝑣. To this end, we enumerate the policy set Σ and choose the first 𝑣-greedy policy. This
choice of convention has no effect on convergence results.
CHAPTER 8. RECURSIVE DECISION PROCESSES 256

8.1.3.3 Optimality

Let R be a well-posed RDP with policy operators {𝑇𝜎 } and 𝜎-value functions { 𝑣𝜎 }.
Ô
In this context, we set 𝑣∗ ≔ 𝜎 𝑣𝜎 ∈ RX and call 𝑣∗ the value function of R. By
definition, 𝑣∗ satisfies

𝑣∗ ( 𝑥 ) = max 𝑣𝜎 ( 𝑥 ) for all 𝑥 ∈ X. (8.19)


𝜎∈ Σ

A policy 𝜎 is called optimal for R if 𝑣𝜎 = 𝑣∗ ; that is, if

𝑣𝜎 ( 𝑥 ) ⩾ 𝑣𝜏 ( 𝑥 ) for all 𝜏 ∈ Σ and all 𝑥 ∈ X.

Both of these definitions generalize the definitions we used for MDPs and optimal
stopping. In particular, optimality of a policy means that it generates maximum pos-
sible lifetime value from every state.
We say that R satisfies Bellman’s principle of optimality if

𝜎 ∈ Σ is optimal for R ⇐⇒ 𝜎 is 𝑣∗ -greedy.

We can now state our main optimality result for RDPs. In the statement, R is a well-
posed RDP with value function 𝑣∗ .

Theorem 8.1.1. If R is globally stable, then

(i) 𝑣∗ is the unique solution to the Bellman equation in 𝑉 ,


(ii) R satisfies Bellman’s principle of optimality,
(iii) R has at least one optimal policy,
(iv) HPI returns an optimal policy in finitely many steps, and
(v) the OPI sequence in (8.18) is such that 𝑣𝑘 → 𝑣∗ as 𝑘 → ∞ and, moreover, there
exists a 𝐾 ∈ N such that 𝜎𝑘 is optimal for all 𝑘 ⩾ 𝐾 .

As OPI includes VFI as a special case (𝑚 = 1), Theorem 8.1.1 also implies conver-
gence of VFI under the stated conditions.
In terms of applications, Theorem 8.1.1 is the most important optimality result in
this book. It provides the core optimality results from dynamic programming and a
broadly convergent algorithm for computing optimal policies.
The proof of Theorem 8.1.1 is deferred to §9.1.
CHAPTER 8. RECURSIVE DECISION PROCESSES 257

Example 8.1.18. The optimality results for optimal stopping problems we presented
in Chapter 4 are a special case of Theorem 8.1.1, since such optimal stopping problems
generate globally stable RDPs (as discussed in Example 8.1.15).

Example 8.1.19. The optimality results for MDPs we presented in Chapter 5 are a
special case of Theorem 8.1.1, since MDPs generate globally stable RDPs (as discussed
in Example 8.1.16).

Examples 8.1.18–8.1.16 are relatively elementary. More complex models will be


handled in §8.2.

8.1.3.4 Comments on the Optimality Theorem

Many traditional treatments of dynamic programming build optimality theory around


contractivity (see, e.g., Puterman (2005) or Stokey and Lucas (1989), Section 4.2).
Assumptions are constructed so that the policy operators and Bellman operator are
all contraction mappings.
While such assumptions are sufficient for Theorem 8.1.1 (since contractivity of the
policy operators implies stability), they are not necessary. There are a variety of ways
to prove uniqueness and stability of fixed points, including the monotonicity-based
methods discussed in §7.1.2 and the spectral methods in §6.1.3.2. These alternatives
will prove useful in settings where contractivity fails, as we shall see in §8.2.
Another point worth noting about the conditions in Theorem 8.1.1 is that no as-
sumptions are placed on the Bellman operator. Rather, one only needs to check prop-
erties of the policy operators. This is advantageous because, unlike the Bellman op-
erator, the policy operators do not involve maximization.

8.1.3.5 Nonstationary Policies

Up until now we have focused entirely on stationary policies, in the sense that the
same policy is used at every point in time. What if we drop this assumption and
admit the option to change policies? Might this lead to higher lifetime values?
In this section, we show that for globally stable RDPs the answer is negative. This
finding justifies our focus on stationary policies.
To begin, let R = ( Γ, 𝑉, 𝐵) be a globally stable RDP. Recall from Remark 8.1.1 that,
given 𝑣 ∈ 𝑉 , 𝜎 ∈ Σ, 𝑘 ∈ N and 𝑥 ∈ X, the value (𝑇𝜎𝑘 𝑣)( 𝑥 ) gives finite horizon utility over
periods 0, . . . , 𝑘 under policy 𝜎, with initial state 𝑥 and terminal condition 𝑣. Extending
CHAPTER 8. RECURSIVE DECISION PROCESSES 258

this idea, it is natural to understand 𝑇𝜎𝑘 𝑇𝜎𝑘−1 · · · 𝑇𝜎1 𝑣 as providing finite horizon utility
values for the nonstationary policy sequence ( 𝜎𝑘 ) 𝑘∈N ⊂ Σ, given terminal condition
𝑣 ∈ 𝑉 . For the same policy sequence, we define its lifetime value via

𝑣
¯ ≔ lim sup 𝑣𝑘 with 𝑣𝑘 ≔ 𝑇𝜎𝑘 𝑇𝜎𝑘−1 · · · 𝑇𝜎1 𝑣
𝑘→∞

whenever the limsup is finite and independent of the terminal condition 𝑣.


Suppose that this is the case, and hence 𝑣¯ is well defined. We claim that 𝑣¯ ⩽ 𝑣∗ .

EXERCISE 8.1.11. Show that, under the stated conditions, 𝑣𝑘 ⩽ 𝑇 𝑘 𝑣 for all 𝑘 ∈ N.

Since 𝑣¯ is independent of the terminal condition 𝑣, we can assume without loss of


generality that 𝑣 ∈ 𝑉Σ . By Theorem 8.1.1, we have 𝑇 𝑘 𝑣 → 𝑣∗ as 𝑘 → ∞. Hence, by
Exercise 8.1.11,
¯ = lim sup 𝑣𝑘 ⩽ lim sup 𝑇 𝑘 𝑣 = lim 𝑇 𝑘 𝑣 = 𝑣∗ ,
𝑣
𝑘→∞ 𝑘→∞ 𝑘→∞

as was to be shown.

8.1.3.6 Bounded RDPs

We call an RDP R = ( Γ, 𝑉, 𝐵) bounded if 𝑉 is convex and, moreover, there exist func-


tions 𝑣1 , 𝑣2 ∈ 𝑉 such that 𝑣1 ⩽ 𝑣2 ,

𝑣1 ( 𝑥 ) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑣1 ) and 𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) for all ( 𝑥, 𝑎) ∈ G. (8.20)

We show below that boundedness can be used to obtain optimality results for well-
posed RDPs, even without global stability.
Another attractive feature of boundedness is that it permits a reduction of value
space, as illustrated by the next two exercises.

EXERCISE 8.1.12. Let ( Γ, 𝑉, 𝐵) be bounded and let 𝑣1 , 𝑣2 ∈ 𝑉 be such that (8.20)


holds. Prove that, in this setting, ( Γ, 𝑉ˆ, 𝐵) is also an RDP when 𝑉ˆ ≔ [ 𝑣1 , 𝑣2 ].

EXERCISE 8.1.13. Adopt the setting of Exercise 8.1.12 and suppose, in addition,
that the RDP is well-posed. Show that 𝑣𝜎 ∈ 𝑉ˆ for all 𝜎 ∈ Σ.

Exercise 8.1.13 implies the reduced RDP ( Γ, 𝑉ˆ, 𝐵) is also well-posed under the
stated conditions, and that it contains all the 𝜎-value functions and the value function
CHAPTER 8. RECURSIVE DECISION PROCESSES 259

from the original RDP ( Γ, 𝑉, 𝐵). Hence any optimality results for ( Γ, 𝑉ˆ, 𝐵) carry over
to ( Γ, 𝑉, 𝐵).

EXERCISE 8.1.14. Show that the RDP generated by an MDP in Example 8.1.1 is
bounded.

EXERCISE 8.1.15. Show that the optimal stopping RDP from Example 8.1.3 is
bounded.

EXERCISE 8.1.16. Consider the RDP ( Γ, 𝑉, 𝐵) generated by an MDP with stochastic


discounting in Example 8.1.5. Prove that this RDP is bounded whenever the conditions
of Proposition 6.2.2 on page 193 hold.

EXERCISE 8.1.17. Consider the shortest path RDP ( Γ, 𝑉, 𝐵) in Example 8.1.8 and
assume in addition that the graph G contains only one cycle, which is a self-loop at 𝑑 ,
that 𝑑 is accessible from every vertex 𝑥 ∈ X, and that 𝑐 ( 𝑑, 𝑑 ) = 0. (These assumptions
imply that every path leads to 𝑑 in finite time and that travelers reaching 𝑑 remain
there forever at zero cost.) Let 𝐶 ( 𝑥 ) be the maximum cost of traveling to 𝑑 from 𝑥 ,
which is finite by the stated assumptions. Show that (8.20) holds when 𝑣1 ≔ 0 and
𝑣2 ≔ 𝐶 .

The next result shows that, when considering optimality, stability can be replaced
by boundedness.
Theorem 8.1.2. If R is well-posed and bounded, then (i)–(iv) of Theorem 8.1.1 hold.

8.1.4 Topologically Conjugate RDPs

Sometimes RDP models can be simplified by transformations over value space. In


this section we investigate such transformations. The underlying ideas are related to
topological conjugacy of dynamical systems, which we introduced in §2.1.1.2.
To begin, let R = ( Γ, 𝑉, 𝐵) and R̂ = ( Γ, 𝑉ˆ, 𝐵ˆ) be two RDPs with identical state space
X, action space A and feasible correspondence Γ. We consider settings where

𝑉= MX and 𝑉ˆ = M̂X where M, M̂ ⊂ R,


and, in addition, that there exists a homeomorphism 𝜑 from M onto M̂ such that

𝐵 ( 𝑥, 𝑎, 𝑣) = 𝜑−1 [ 𝐵
ˆ ( 𝑥, 𝑎, 𝜑 ◦ 𝑣)] for all 𝑣 ∈ 𝑉 and ( 𝑥, 𝑎) ∈ G. (8.21)
CHAPTER 8. RECURSIVE DECISION PROCESSES 260

We call R and 𝑅ˆ topologically conjugate under 𝜑 if 𝜑 is a homeomorphism 𝜑 from


M to M̂ and (8.21) holds.
Here is our main result for this section.
Proposition 8.1.3. If R and R̂ are topologically conjugate, then R is globally stable if
and only if R̂ is globally stable.

The benefit of Proposition 8.1.3 is that one of these models might be easier to
analyze than the other. We apply the proposition to the Epstein–Zin specification in
§8.1.4.1 and to a smooth ambiguity model in §8.3.4. The next exercise will be useful
for the proof.

EXERCISE 8.1.18. Prove the following: If 𝜑 is a homeomorphism from M to M̂ and


Φ𝑣 ≔ 𝜑 ◦ 𝑣, then Φ is a homeomorphism from 𝑉 to 𝑉ˆ .

Proof of Proposition 8.1.3. By Exercise 8.1.18, Φ𝑣 ≔ 𝜑 ◦ 𝑣 is a homeomorphism from


𝑉 to 𝑉ˆ . Moreover, for any 𝜎 ∈ Σ, the respective policy operators 𝑇𝜎 and 𝑇ˆ𝜎 are linked
by
(𝑇𝜎 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) = 𝜑−1 [ 𝐵ˆ ( 𝑥, 𝜎 ( 𝑥 ) , 𝜑 ◦ 𝑣)] = 𝜑−1 [(𝑇ˆ𝜎 𝜑 ◦ 𝑣)( 𝑥 )] .
This shows that 𝑇𝜎 = Φ−1 ◦ 𝑇ˆ𝜎 ◦ Φ on 𝑉 . Hence (𝑉, 𝑇𝜎 ) and (𝑉ˆ, 𝑇ˆ𝜎 ) are topologically
conjugate dynamical systems, from which it follows that 𝑇𝜎 is globally stable if and only
if 𝑇ˆ𝜎 is globally stable (see page 45). This completes the proof of Proposition 8.1.3. □

In the next section we will see how these ideas can simplify optimality analysis.

8.1.4.1 Application: Epstein–Zin RDPs

In this section we show how the preceding optimality results and the notion of topolog-
ically conjugacy can be deployed to analyze the Epstein–Zin RDP from Example 8.1.7.
Recall that the aggregator in Example 8.1.7 is

 ! 𝛼/𝛾  1/𝛼


 Õ 


𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑎, 𝑥 0) . (8.22)

 

𝑥0
 

Let 𝑉 = (0, ∞) X . We assume that 𝑟  0 and take a nonempty feasible correspondence


Γ as given. Exercise 8.1.5 on page 250 confirmed that R ≔ ( Γ, 𝑉, 𝐵) is an RDP.
We will call the stochastic kernel 𝑃 irreducible if 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) is irreducible for all
𝜎 ∈ Σ. Below we establish stability of R under irreducibility.
CHAPTER 8. RECURSIVE DECISION PROCESSES 261

Proposition 8.1.4. If 𝑃 is irreducible, then R is globally stable.

To prove Proposition 8.1.4, we set up a simpler and more tractable model. Our
first step is to introduce another RDP by setting
 𝛾
1/𝛾
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝐵 𝑥, 𝑎, 𝑣
ˆ (8.23)

We set R ≔ ( Γ, 𝐵, 𝑉 ) and R̂ ≔ ( Γ, 𝐵ˆ, 𝑉 ). Notice that 𝐵ˆ can also be expressed as

 ! 1/𝜃  𝜃


 Õ 


ˆ ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽
𝐵 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) , (8.24)

 

𝑥0
 
where 𝜃 ≔ 𝛾 /𝛼.
The value of of introducing R̂ comes from the fact that R̂ is easier to work with than
R (just as the modified Epstein–Zin Koopmans operator 𝐾ˆ defined in §7.2.3.3 turned
out to be easier to work with than the original Epstein–Zin Koopmans operator 𝐾
introduced in §7.2.3.2).

EXERCISE 8.1.19. Prove that R and R̂ are topologically conjugate RDPs (see §8.1.4).

Now we investigate the properties of the simpler RDP R̂.

Lemma 8.1.5. If 𝑃 is irreducible, then R̂ is a globally stable RDP.

Proof. In view of (8.24), each policy operator 𝑇ˆ𝜎 associated with R̂ takes the form

 ! 1/𝜃  𝜃


 Õ 


(𝑇ˆ𝜎 𝑣)( 𝑥 ) = 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽 𝑤 ( 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) (8.25)

 

𝑥0
 
 𝜃
Each such 𝑇ˆ𝜎 is a special case of 𝐾ˆ defined on page 229 by 𝐾ˆ 𝑣 = ℎ + 𝛽 ( 𝑃𝑣) 1/𝜃 (see
(7.15)). We saw in §7.2.3.3 that this operator is globally stable under the stated
assumptions. Hence R̂ is a globally stable RDP. □

Now we can complete the

Proof of Proposition 8.1.4. Exercise 8.1.19 and Proposition 8.1.3 on page 260 together
imply that R is globally stable if and only if R̂ is globally stable. The claim that R is
globally stable now follows from and Lemma 8.1.5. □
CHAPTER 8. RECURSIVE DECISION PROCESSES 262

8.2 Types of RDPs

In §8.1 we showed that well-posed RDPs have strong optimality properties whenever
they are globally stable or bounded, and that VFI and OPI converge whenever they
are globally stable. But what conditions are sufficient for these properties? We start
with a relatively strict condition based on contractivity and then progress to models
that fail to be contractive.

8.2.1 Contracting RDPs

In this section we study RDPs with strong contraction properties. Many traditional
dynamic programs fit into this framework.

8.2.1.1 Definition and Examples

Let R = ( Γ, 𝑉, 𝐵) be an RDP with state space X, action space A, and feasible state-action
pair set G. We call R contracting if there exists a 𝛽 < 1 such that

| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ for all ( 𝑥, 𝑎) ∈ G and 𝑣, 𝑤 ∈ 𝑉. (8.26)

In line with the terminology for contraction maps, we call 𝛽 the modulus of contrac-
tion for R when (8.26) holds.

Example 8.2.1. The optimal stopping RDP from Example 8.1.3 is contracting with
modulus 𝛽 , since, for 𝐵 in (8.6), an application of the triangle inequality gives

Õ
| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| = (1 − 𝑎) 𝛽 [ 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)] 𝑃 ( 𝑥, 𝑥 0) ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ .
𝑥0

EXERCISE 8.2.1. Show that any MDP is a contracting RDP.

EXERCISE 8.2.2. Show that every contracting RDP is value-continuous.

Proposition 8.2.1. If R is contracting with modulus 𝛽 , then 𝑇 and {𝑇𝜎 }𝜎∈Σ are all con-
tractions of modulus 𝛽 on 𝑉 under the norm k · k ∞ .
CHAPTER 8. RECURSIVE DECISION PROCESSES 263

Proof. Let R = ( Γ, 𝑉, 𝐵) be contracting with modulus 𝛽 . Fix 𝜎 ∈ Σ and let 𝑣 and 𝑤 be


elements of 𝑉 . By (8.26) we have

|(𝑇𝜎 𝑣) ( 𝑥 ) − (𝑇𝜎 𝑤)( 𝑥 )| = | 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) − 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑤)| ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ (8.27)

for every 𝑥 ∈ X. Maximizing over 𝑥 proves that 𝑇𝜎 is a contraction of modulus 𝛽 with


respect to the supremum norm.
Contractivity of 𝑇 now follows from Lemma 2.2.3 on page 59. □

The following corollary to Proposition 8.2.1 is immediate from Banach’s contrac-


tion mapping theorem.

Corollary 8.2.2. If R is contracting and 𝑉 is closed in RX , then R is globally stable and


hence all of the optimality results in Theorem 8.1.1 apply.

8.2.1.2 Error Bounds

Corollary 8.2.2 tells us that contracting RDPs are globally stable and, as a result, the
sequence of functions in 𝑉 generated by VFI (Algorithm 8.2 with 𝑚 = 1) converges to
𝑣∗ . However this result is asymptotic and conditions on 𝑣0 = 𝑣𝜎 for some 𝜎 ∈ Σ. We
can improve this result in the current setting by leveraging the contraction property:

Proposition 8.2.3. Let ( Γ, 𝑉, 𝐵) be a contracting RDP with modulus of contraction 𝛽 and


Bellman operator 𝑇 . Fix 𝑣 ∈ 𝑉 and let 𝑣𝑘 = 𝑇 𝑘 𝑣. If 𝜎 is 𝑣𝑘 -greedy, then

2𝛽
k 𝑣∗ − 𝑣𝜎 k ∞ ⩽ k 𝑣𝑘 − 𝑣𝑘−1 k ∞ for all 𝑘 ∈ N. (8.28)
1−𝛽

Since the VFI algorithm terminates when k 𝑣𝑘 − 𝑣𝑘−1 k ∞ falls below a given tolerance,
the result in (8.28) directly provides a quantitative bound on the performance of the
policy returned by VFI.

Proof of Proposition 8.2.3. Let ( Γ, 𝑉, 𝐵) and 𝑣 be as stated and let 𝑣∗ be the value func-
tion. Note that
k 𝑣∗ − 𝑣𝜎 k ∞ ⩽ k 𝑣∗ − 𝑣𝑘 k ∞ + k 𝑣𝑘 − 𝑣𝜎 k ∞ . (8.29)
To bound the first term on the right-hand side of (8.29), we use the fact that 𝑣∗ is a
fixed point of 𝑇 , obtaining

k 𝑣∗ − 𝑣𝑘 k ∞ ⩽ k 𝑣∗ − 𝑇 𝑣𝑘 k ∞ + k𝑇 𝑣𝑘 − 𝑣𝑘 k ∞ ⩽ 𝛽 k 𝑣∗ − 𝑣𝑘 k ∞ + 𝛽 k 𝑣𝑘 − 𝑣𝑘−1 k ∞ .
CHAPTER 8. RECURSIVE DECISION PROCESSES 264

Hence
𝛽
k 𝑣∗ − 𝑣𝑘 k ∞ ⩽ k 𝑣𝑘 − 𝑣𝑘−1 k ∞ . (8.30)
1−𝛽
Now consider the second term on the right-hand side of (8.29). Since 𝜎 is 𝑣𝑘 -greedy,
we have 𝑇 𝑣𝑘 = 𝑇𝜎 𝑣𝑘 , and

k 𝑣𝑘 − 𝑣𝜎 k ∞ ⩽ k 𝑣𝑘 − 𝑇 𝑣𝑘 k ∞ + k𝑇 𝑣𝑘 − 𝑣𝜎 k ∞ = k𝑇 𝑣𝑘−1 − 𝑇 𝑣𝑘 k ∞ + k𝑇𝜎 𝑣𝑘 − 𝑇𝜎 𝑣𝜎 k ∞ .

∴ k 𝑣𝑘 − 𝑣𝜎 k ∞ ⩽ 𝛽 k 𝑣𝑘−1 − 𝑣𝑘 k ∞ + 𝛽 k 𝑣𝑘 − 𝑣𝜎 k ∞ .
𝛽
∴ k 𝑣𝑘 − 𝑣𝜎 k ∞ ⩽ k 𝑣𝑘 − 𝑣𝑘−1 k ∞ . (8.31)
1−𝛽
Together, (8.29), (8.30), and (8.31) give us (8.28). □

8.2.1.3 A Blackwell-Type Condition

Next we state a useful condition for contractivity that is related to Blackwell’s suffi-
cient condition discussed in §2.2.3.3. We say that RDP ( Γ, 𝑉, 𝐵) satisfies Blackwell’s
condition if 𝑣 ∈ 𝑉 implies 𝑣 + 𝜆 ≔ 𝑣 + 𝜆 1 is in 𝑉 for every 𝜆 ⩾ 0 and, in addition, there
exists a 𝛽 ∈ [0, 1) such that

𝐵 ( 𝑥, 𝑎, 𝑣 + 𝜆 ) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑣) + 𝛽𝜆 for all ( 𝑥, 𝑎) ∈ G, 𝑣 ∈ 𝑉 and 𝜆 ∈ R+ .

EXERCISE 8.2.3. Prove the following: If R satisfies Blackwell’s condition, then R is


contracting with modulus 𝛽 .

EXERCISE 8.2.4. Prove that the RDP for the state-dependent discounting model in
Example 8.1.5 is contracting on 𝑉 = RX whenever there exists a 𝑏 < 1 with 𝛽 ( 𝑥, 𝑎, 𝑥 0) ⩽
𝑏 for all ( 𝑥, 𝑎) ∈ G and 𝑥 0 ∈ X.

EXERCISE 8.2.5. Prove that the discrete optimal savings model from §5.2.2 satisfies
Blackwell’s condition.

8.2.1.4 Application: Job Search with Quantile Preferences

Consider the job search problem with correlated wage draws first investigated in
§3.3.1. With finite wage offer set W, wage offer process generated by 𝑃 ∈ M ( RW )
CHAPTER 8. RECURSIVE DECISION PROCESSES 265

reservation wage
4

3
wages

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8


quantile

Figure 8.1: The reservation wage as a function of 𝜏

and 𝛽 ∈ (0, 1), we can frame this as an RDP ( Γ, 𝑉, 𝐵) with 𝑉 = RW , Γ ( 𝑤) = {0, 1} for
𝑤 ∈ W and
𝑤
𝐵 ( 𝑤, 𝑎, 𝑣) ≔ 𝑎 + (1 − 𝑎) [ 𝑐 + 𝛽 ( 𝑃𝑣)( 𝑤)] .
1−𝛽
Since the model just described is an optimal stopping problem, Example 8.2.1 tells us
that (𝑉, Γ, 𝐵) is contracting.
Now consider the following modification, where Γ and 𝑉 are as before but 𝐵 is
replaced by
𝑤
𝐵𝜏 ( 𝑤, 𝑎, 𝑣) ≔ 𝑎 + (1 − 𝑎) [ 𝑐 + 𝛽 ( 𝑅𝜏 𝑣)( 𝑤)] ,
1−𝛽
where 𝜏 ∈ [0, 1] and 𝑅𝜏 is the quantile certainty equivalent operator described in
Exercise 7.3.4 (page 232).

EXERCISE 8.2.6. Prove that (𝑉, Γ, 𝐵𝜏 ) is a contracting RDP.

Figure 8.1 shows the reservation wage for a range of 𝜏 values, computed using
optimistic policy iteration (and taking the smallest 𝑤 ∈ W such that 𝜎∗ ( 𝑤) = 1). The
stationary distribution of 𝑃 is also shown in the figure, tilted 90 degrees.
The parameters and the code for applying 𝑇𝜎 and evaluating greedy functions is
shown in Listing 26. That listing includes the quantile operator 𝑅𝜏 , which is imple-
mented in Listing 25. (Quantiles of discrete random variables can also be computed
using functionality contained in Distributions.jl.)
CHAPTER 8. RECURSIVE DECISION PROCESSES 266

"Compute the τ-th quantile of v(X) when X ∼ ϕ and v = sort(v)."


function quantile(τ, v, ϕ)
for (i, v_value) in enumerate(v)
p = sum(ϕ[1:i]) # sum all ϕ[j] s.t. v[j] ≤ v_value
if p ≥ τ # exit and return v_value if prob ≥ τ
return v_value
end
end

end

"For each i, compute the τ-th quantile of v(Y) when Y ∼ P(i, ⋅)"
function R(τ, v, P)
return [quantile(τ, v, P[i, :]) for i in eachindex(v)]
end

Listing 25: Conditional quantile operator (quantile_function.jl)

The main message of Figure 8.1 is that the reservation wage rises in 𝜏. In essence,
higher 𝜏 focuses the attention of the worker on the right tail of the distribution of
continuation values. This encourages the worker to take on more risk, which leads to
a higher reservation wage (i.e., reluctance to accept a given current offer).

8.2.1.5 Application: Optimal Default

In this section we consider a small open economy that borrows in international finan-
cial markets in order to smooth consumption and has the option to default. We show
that the model is a contractive RDP.
Income (𝑌𝑡 )𝑡⩾0 is exogenous and 𝑄 -Markov on finite set Y. A representative house-
hold faces budget constraint

𝐶𝑡 = 𝑌𝑡 + 𝑏𝑡 − 𝑞𝑏𝑡+1 ( 𝑡 ⩾ 0) ,

where 𝐶𝑡 is consumption at time 𝑡 , 𝑞 is the price at time 𝑡 of a risk-free claim on one


unit of time 𝑡 + 1 consumption; 𝑞 is determined outside the model, say international
markets; 𝑏𝑡 measures foreign lending. Purchasing a claim on 𝑏𝑡+1 units of time 𝑡 + 1
consumption costs 𝑞𝑏𝑡+1 . Purchasing bond with negative face value 𝑏𝑡+1 pays 𝑞𝑏𝑡+1 in
current consumption goods and promises to deliver 𝑏𝑡+1 next period.
CHAPTER 8. RECURSIVE DECISION PROCESSES 267

using QuantEcon
include("quantile_function.jl")

"Creates an instance of the job search model."


function create_markov_js_model(;
n=100, # wage grid size
ρ=0.9, # wage persistence
ν=0.2, # wage volatility
β=0.98, # discount factor
c=1.0, # unemployment compensation
τ=0.5 # quantile parameter
)
mc = tauchen(n, ρ, ν)
w_vals, P = exp.(mc.state_values), mc.p
return (; n, w_vals, P, β, c, τ)
end

"""
The policy operator

(T_σ v)(w) = σ(w) (w / (1-β)) + (1 - σ(w))(c + β (R_τ v)(w))

"""
function T_σ(v, σ, model)
(; n, w_vals, P, β, c, τ) = model
h = c .+ β * R(τ, v, P)
e = w_vals ./ (1 - β)
return σ .* e + (1 .- σ) .* h
end

" Get a v-greedy policy."


function get_greedy(v, model)
(; n, w_vals, P, β, c, τ) = model
σ = w_vals / (1 - β) .≥ c .+ β * R(τ, v, P)
return σ
end

Listing 26: Job search with quantile operator (quantile_js.jl)


CHAPTER 8. RECURSIVE DECISION PROCESSES 268

Trading bonds is managed by a benevolent government that wants to maximize


household utility. Households discount future utility at rate 𝛽 ∈ (0, 1) and current
consumption 𝐶𝑡 generates current utility 𝑢 (𝐶𝑡 ). The government faces borrowing con-
straint 𝑏𝑡 ⩾ −𝑚 where 𝑚 ⩾ 0. The government maximizes expected discounted utility
for the households.
The government can default on foreign loans. In this case, output available for
consumption drops from 𝑌𝑡 to ℎ (𝑌𝑡 ), where ℎ is a function satisfying ℎ ( 𝑦 ) < 𝑦 for all 𝑦 .
After a country defaults, it temporarily loses access to the international credit market.
At the end of each period during which the country is in default, it regains access
to international credit markets with probability 𝜃 ∈ (0, 1). With probability 1 − 𝜃 it
remains in financial autarky. When a country regains access to foreign borrowing, its
debt is reset to zero.
We can cast this as an RDP by considering the value of each state and action. We
set the state space X to be the set of all ( 𝑦, 𝑏, 𝑑 ) in Y × B × {0, 1}, where B is a finite
subset of [−𝑚, ∞) indicating possible choices for bond holdings 𝑏𝑡 and 𝑑 is a binary
variable indicating whether the country is in default (𝑑 = 0 means not in default and
𝑑 = 1 means in default).
The value space 𝑉 is all of RX . The action space is ( 𝑏𝑎 , 𝑑 𝑎 ) ∈ B × {0, 1} indicating
choices for bond holdings and default. The feasible correspondence specifies feasible
( 𝑏𝑎 , 𝑑 𝑎 ) at given state ( 𝑦, 𝑏, 𝑑 ) and is given by
(
B × {0, 1} if 𝑑 = 0 and
Γ ( 𝑦, 𝑏, 𝑑 ) =
{0} × {1} if 𝑑 = 1.

In other words, if 𝑑 = 0, so the country is not in default, the government can choose
any 𝑏𝑎 ∈ B and also any 𝑑 𝑎 ∈ {0, 1} (i.e., default or not default). If 𝑑 = 1, however, the
government has no choices. We represent this situation by 𝑏𝑎 = 0 and 𝑑 𝑎 = 1.
The value aggregator takes the form

𝐵 (( 𝑦, 𝑏, 𝑑 ) , ( 𝑏𝑎 , 𝑑 𝑎 ) , 𝑣) = value in state ( 𝑦, 𝑏, 𝑑 ) under action ( 𝑏𝑎 , 𝑑 𝑎 ) .

To specify it we decompose the problem across cases for 𝑑 and 𝑑 𝑎 . First consider the
case where 𝑑 = 0 (not currently in default) and 𝑑 𝑎 = 0 (the government chooses not
to default). For this case 𝑦 + 𝑏 − 𝑞𝑏𝑎 is current consumption, so we set
Õ
𝐵 (( 𝑦, 𝑏, 0) , ( 𝑏𝑎 , 0) , 𝑣) = 𝑢 ( 𝑦 + 𝑏 − 𝑞𝑏𝑎 ) + 𝛽 𝑣 ( 𝑦 0, 𝑏𝑎 , 0) 𝑄 ( 𝑦, 𝑦 0) (8.32)
𝑦0

Now consider the case where 𝑑 = 0 and 𝑑 𝑎 = 1, so the government chooses to default.
CHAPTER 8. RECURSIVE DECISION PROCESSES 269

Then current consumption is ℎ ( 𝑦 ) and we set

𝐵 (( 𝑦, 𝑏, 0) , ( 𝑏𝑎 , 1) , 𝑣) = 𝑢 ( ℎ ( 𝑦 )) + 𝛽
" #
Õ Õ
𝜃 𝑣 ( 𝑦 0, 0, 0) 𝑄 ( 𝑦, 𝑦 0) + (1 − 𝜃) 𝑣 ( 𝑦 0, 0, 1) 𝑄 ( 𝑦, 𝑦 0) . (8.33)
𝑦0 𝑦0

Í
The term 𝑦 0 𝑣 ( 𝑦 0, 0, 0) 𝑄 ( 𝑦, 𝑦 0) is the expected value next period when the country is
readmitted to international financial markets (with 𝑏0 = 0 and 𝑑 0 = 0), while the term
Í 0 0
𝑦 0 𝑣 ( 𝑦 , 0, 1) 𝑄 ( 𝑦, 𝑦 ) is the expected value next period when default continues (with
𝑏0 = 0 and 𝑑 0 = 1).
Since 𝐵 (( 𝑦, 𝑏, 1) , ( 𝑏𝑎 , 0) , 𝑣) is not feasible (a defaulted country cannot itself directly
choose to reenter financial markets), the only other possibility is 𝐵 (( 𝑦, 𝑏, 1) , ( 𝑏𝑎 , 1) , 𝑣),
which is the expected value when the country remains in default. But this is the same
as 𝐵 (( 𝑦, 𝑏, 0) , ( 𝑏𝑎 , 1) , 𝑣) specified above: the value for a country that stays in default is
the same as that for a country that newly enters default.

EXERCISE 8.2.7. By working through cases (8.32)–(8.33) for the value aggregator
𝐵, show that the model described above is a contractive RDP.

8.2.2 Eventually Contracting RDPs

Many RDPs are not contracting. There is no single method for handling all types
of non-contractive RDPs, so we introduce alternative techniques over the next few
sections. The first such technique, treated in this section, handles RDPs that contract
“eventually,” even though they may fail to contract in one step. We show that these
eventually contracting RDPs are globally stable, so all of the fundamental optimality
results apply.
One application for these results is the MDP model with state-dependent discount-
ing treated in Chapter 6. This section contains a proof of the main optimality result
in that chapter (Proposition 6.2.2 on page 193).

8.2.2.1 Definition and Properties

Let R = ( Γ, 𝑉, 𝐵) be an RDP with policy set Σ. We call R eventually contracting if


there is a map 𝐿 from G × X to R+ such that
Õ
| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| ⩽ | 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)| 𝐿 ( 𝑥, 𝑎, 𝑥 0) (8.34)
𝑥0
CHAPTER 8. RECURSIVE DECISION PROCESSES 270

for all ( 𝑥, 𝑎) ∈ G and all 𝑣, 𝑤 ∈ 𝑉 , and moreover,

𝜎 ∈ Σ =⇒ 𝜌 ( 𝐿𝜎 ) < 1 where 𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝐿 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) .

Proposition 8.2.4. Let R = ( Γ, 𝑉, 𝐵) be an RDP. If R is eventually contracting and 𝑉 is


closed in RX , then R is also globally stable and hence all of the optimality and convergence
results in Theorem 8.1.1 apply.

Proof. Let R be as stated and fix 𝜎 ∈ Σ. Let 𝑇𝜎 be the associated policy operator and
let 𝐿𝜎 be the linear operator in (8.34). For fixed 𝑣, 𝑤 ∈ 𝑉 we have

|(𝑇𝜎 𝑣)( 𝑥 ) − (𝑇𝜎 𝑤)( 𝑥 )| = | 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) − 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑤)|


Õ
⩽ | 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)| 𝐿𝜎 ( 𝑥, 𝑥 0) .
𝑥0

Since 𝐿𝜎 ⩾ 0 and 𝜌 ( 𝐿𝜎 ) < 1, Proposition 6.1.6 on page 190 implies that 𝑇𝜎 is even-
tually contracting on 𝑉 . Since 𝑉 is closed in RX , it follows that 𝑇𝜎 is globally stable
(Theorem 6.1.5, page 189). Hence R is globally stable, as claimed. □

EXERCISE 8.2.8. Prove: If R = ( Γ, 𝑉, 𝐵) is eventually contracting, 𝑉 is closed in RX


and 𝑇 is the Bellman operator generated by R, then 𝑇 is globally stable on 𝑉 .

EXERCISE 8.2.9. In §4.1.2 we studied an optimal exit problem for a firm. We


can modify this problem to handle stochastic interest rates by introducing the RDP
R = ( Γ, 𝑉, 𝐵) on state space X with Γ ( 𝑥 ) = {0, 1}, 𝑉 = RX and
" #
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑎𝑠 + (1 − 𝑎) 𝜋 ( 𝑥 ) + 𝛽 ( 𝑥 ) 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑥 0)
𝑥0

for some 𝛽 ∈ RX + . (We suppose that state-dependence of 𝛽 is generated by state-


dependent interest rates.) State the Bellman equation for this problem. Prove that
R is globally stable whenever there exists an 𝐿 ∈ L ( RX ) such that 𝜌 ( 𝐿) < 1 and
𝛽 ( 𝑥 ) 𝑄 ( 𝑥, 𝑥 0) ⩽ 𝐿 ( 𝑥, 𝑥 0) for all 𝑥, 𝑥 0 ∈ X.

8.2.2.2 Optimality for MDPs with State-Dependent Discounting

With Proposition 8.2.4 in hand, we can complete the proof of Proposition 6.2.2 on
page 193, which pertained to optimality properties for MDPs with state-dependent
discounting.
CHAPTER 8. RECURSIVE DECISION PROCESSES 271

Let ( Γ, 𝛽, 𝑟, 𝑃 ) be an MDP with state-dependent discounting, as defined in §6.2.1.1.


The state space is X and the action space is A. The function 𝛽 maps G × X to R+ . Set

𝐿 ( 𝑥, 𝑎, 𝑥 0) ≔ 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) and 𝐿𝜎 ( 𝑥, 𝑥 0) ≔ 𝐿 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0)

for all ( 𝑥, 𝑎, 𝑥 0) ∈ G × X and 𝜎 ∈ Σ.


Assume the conditions of Proposition 6.2.2, so that 𝜌 ( 𝐿𝜎 ) < 1 for all 𝜎 ∈ Σ.
If we set Õ
𝐵 ( 𝑥, 𝑎, 𝑣) ≔ 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (8.35)
𝑥0

and take 𝑉 to be all of RX , then R ≔ ( Γ, 𝑉, 𝐵) forms an RDP, as discussed in Exer-


cise 8.2.4. We claim that R is an eventually contracting RDP.
To see this, fix 𝑣, 𝑤 ∈ 𝑉 and ( 𝑥, 𝑎) ∈ G. Applying the definition (8.35) and the
triangle inequality, we have

Õ
| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| ⩽ [ 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)] 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0)
𝑥 0
Õ
⩽ | 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)| 𝐿 ( 𝑥, 𝑎, 𝑥 0) ,
𝑥0

Under the stated assumptions, for each 𝜎 ∈ Σ, the operator 𝐿𝜎 ( 𝑥, 𝑥 0) = 𝐿 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0)


satisfies 𝜌 ( 𝐿𝜎 ) < 1. Hence R is eventually contracting, as claimed. Since 𝑉 = RX is
closed, Proposition 8.2.4 implies that R is a globally stable RDP. The claims in Propo-
sition 6.2.2 now follow from Theorem 8.1.1.

8.2.3 Convex and Concave RDPs

Theorem 8.1.1 shows that RDPs have excellent optimality properties when all policy
operators are globally stable on value space. So far we have looked at conditions
for stability based on contractions (§8.2.1) and eventual contractions (§8.2.2). But
sometimes both of these approaches fail and we need alternative conditions.
In this section we explore alternative conditions based on Du’s theorem (page 216).
Du’s theorem is well suited to the task of studying stability of policy operators, since
it leverages the fact that all policy operators are order-preserving.
CHAPTER 8. RECURSIVE DECISION PROCESSES 272

8.2.3.1 Definitions and Optimality

Let R = ( Γ, 𝑉, 𝐵) be an RDP with 𝑉 = [ 𝑣1 , 𝑣2 ] for some 𝑣1 ⩽ 𝑣2 in RX . We call R convex


if

(i) for all ( 𝑥, 𝑎) ∈ G, 𝜆 ∈ [0, 1] and 𝑣, 𝑤 in 𝑉 , we have

𝐵 ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩽ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑤) and, (8.36)

(ii) there exists a 𝛿 > 0 such that

𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) − 𝛿 [ 𝑣2 ( 𝑥 ) − 𝑣1 ( 𝑥 )] for all ( 𝑥, 𝑎) ∈ G. (8.37)

Analogous to the convex case, we call R concave if

(i) for all ( 𝑥, 𝑎) ∈ G, 𝜆 ∈ [0, 1] and 𝑣, 𝑤 in 𝑉 , we have

𝐵 ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩾ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑤) and, (8.38)

(ii) there exists a 𝛿 > 0 such that

𝐵 ( 𝑥, 𝑎, 𝑣1 ) ⩾ 𝑣1 ( 𝑥 ) + 𝛿 [ 𝑣2 ( 𝑥 ) − 𝑣1 ( 𝑥 )] for all ( 𝑥, 𝑎) ∈ G. (8.39)

In both of the definitions above, condition (ii) is rather complex. The next exercise
provides simpler sufficient conditions.

EXERCISE 8.2.10. Prove that (8.37) holds whenever

𝐵 ( 𝑥, 𝑎, 𝑣2 ) < 𝑣2 ( 𝑥 ) for all ( 𝑥, 𝑎) ∈ G. (8.40)

Similarly, prove that (8.39) holds whenever

𝐵 ( 𝑥, 𝑎, 𝑣1 ) > 𝑣1 ( 𝑥 ) for all ( 𝑥, 𝑎) ∈ G. (8.41)

Both convexity and concavity yield stability, as the next proposition shows.

Proposition 8.2.5. If R is either convex or concave, then R is globally stable.

Proof. We begin with the convex case. Fix 𝜎 ∈ Σ. By the monotonicity property of
RDPs, 𝑇𝜎 is an order-preserving self-map on 𝑉 . Since (8.36) holds, 𝑇𝜎 is also a convex
operator on 𝑉 . Moreover, 𝑇𝜎 𝑣1 ⩾ 𝑣1 because 𝑇𝜎 : 𝑉 → 𝑉 and, by (8.40), 𝑇𝜎 𝑣2 ⩽
CHAPTER 8. RECURSIVE DECISION PROCESSES 273

𝑣2 − 𝛿 ( 𝑣2 − 𝑣1 ). Hence Du’s theorem on page 216 applies and 𝑇𝜎 is globally stable on


𝑉 . This shows that R is a globally stable RDP.
The proof of the concave case is analogous (using Du’s theorem applied to order-
preserving concave operators). □

It follows from Proposition 8.2.5 that, for convex and concave RDPs, all of the
optimality and convergence results in Theorem 8.1.1 apply.

8.2.3.2 Application to MDPs

Proposition 8.2.5 can be applied to establish optimality properties of regular MDPs.


This exercise is redundant in the sense that optimality properties of regular MDPs have
already been established using other means. At the same time, some of the arguments
developed here will be helpful when we face more sophisticated problems below.
To sketch the argument, let R = ( Γ, 𝐵, 𝑉 ) be an RDP generated by an ordinary
MDP ( Γ, 𝛽, 𝑟, 𝑃 ), as discussed in Example 8.1.1 on page 247. In particular, 𝑉 = RX ,
Í
and 𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑥 0 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0). We set 𝑟1 ≔ min 𝑟 and 𝑟2 ≔ max 𝑟 . Then
we fix 𝜀 > 0 and define 𝑉 via
𝑟1 − 𝜀 𝑟2 + 𝜀
𝑉ˆ ≔ [ 𝑣1 , 𝑣2 ] where 𝑣1 ≔ and 𝑣2 ≔ . (8.42)
1−𝛽 1−𝛽

(The functions 𝑣1 and 𝑣2 are constant.) We claim that the RDP R̂ ≔ ( Γ, 𝑉ˆ, 𝐵) is both
convex and concave.

EXERCISE 8.2.11. Prove that (8.40) and (8.41) both hold for R̂.

EXERCISE 8.2.12. Complete the proof that R̂ is both concave and convex.

8.3 Further Applications


In this section we consider some applications of the optimality results in §8.2.

8.3.1 Risk-Sensitive RDPs


In §7.2.2 we introduced risk-sensitive preferences and discussed a recursive utility
problem. Now we embed risk-sensitive preferences into a dynamic program and apply
the preceding optimality results to compute optimal policies.
CHAPTER 8. RECURSIVE DECISION PROCESSES 274

8.3.1.1 Optimality Results

Consider the risk-sensitive preference RDP in Example 8.1.6, with state space X and
action space A. Let 𝑉 = RX . For ( 𝑥, 𝑎) ∈ G and 𝑣 ∈ 𝑉 , we can express the aggregator
as
𝐵 ( 𝑥, 𝑎, 𝑣) ≔ 𝑟 ( 𝑥, 𝑎) + 𝛽 ( 𝑅𝜃𝑎 𝑣)( 𝑥 )
where 𝜃 is a nonzero constant and
( )
1 Õ
( 𝑅𝜃𝑎 𝑣)( 𝑥 ) ≔ ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
𝜃 𝑥0

Notice that, for each fixed 𝑎 ∈ Γ ( 𝑥 ), the operator 𝑅𝜃𝑎 is an entropic certainty equivalent
operator on 𝑉 (see Example 7.3.2 on page 232).

Proposition 8.3.1. If 𝛽 < 1, then ( Γ, 𝑉, 𝐵) is contracting.

Proof. Fix 𝛽 < 1. We show that ( Γ, 𝑉, 𝐵) obeys Blackwell’s condition (§8.2.1.3). To this
end, fix 𝑣 ∈ 𝑉 , ( 𝑥, 𝑎) ∈ G, and 𝜆 ⩾ 0. Since 𝑅𝜃𝑎 is constant-subadditive (Exercise 7.3.8
on page 233), we have

𝐵 ( 𝑥, 𝑎, 𝑣 + 𝜆 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽 [ 𝑅𝜃𝑎 ( 𝑣 + 𝜆 )] ( 𝑥 ) ⩽ 𝑟 ( 𝑥, 𝑎) + 𝛽 ( 𝑅𝜃𝑎 𝑣)( 𝑥 ) + 𝛽𝜆.

The right-hand side equals 𝐵 ( 𝑥, 𝑎, 𝑣) + 𝛽𝜆 , so Blackwell’s condition holds. The claim


in Proposition 8.3.1 now follows from Exercise 8.2.3. □

The next exercise pertains to quantile preferences rather than risk-sensitive pref-
erences, but the result can be obtained via a relatively straightforward modification
of the proof of Proposition 8.3.1.

EXERCISE 8.3.1. Let R ≔ ( Γ, 𝑉, 𝐵) be an RDP with 𝑉 = RX and fix 𝜏 ∈ [0, 1]. Let
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ( 𝑅𝜏𝑎 𝑣)( 𝑥 ) where, for each 𝑎 ∈ Γ ( 𝑥 ), the map 𝑅𝜏𝑎 is given by
( )
Õ
( 𝑅𝜏𝑎 𝑣)( 𝑥 ) = min 𝑦 ∈ R 1{ 𝑣 ( 𝑥 0) ⩽ 𝑦 } 𝑃 ( 𝑥, 𝑎, 𝑥 0) ⩾ 𝜏 ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝑥0

Prove that R is globally stable whenever 𝛽 < 1.


CHAPTER 8. RECURSIVE DECISION PROCESSES 275

8.3.1.2 Risk-Sensitive Job Search

Let’s consider a job search problem where future wage outcomes are evaluated via
risk-sensitive expectations. The associated Bellman operator is
( " #)
𝑤 𝛽 Õ
(𝑇 𝑣)( 𝑤) = max , 𝑐 + ln exp( 𝜃𝑣 ( 𝑤0)) 𝑃 ( 𝑤, 𝑤0) ( 𝑤 ∈ W) .
1−𝛽 𝜃 𝑤0

Here 𝜃 is a nonzero parameter and other details are as in §3.3.1. We can represent
the problem as an RDP with state space W, action space A = {0, 1}, feasible correspon-
dence Γ ( 𝑤) = A, value space 𝑉 ≔ RW , and value aggregator
( " #)
𝑤 𝛽 Õ
𝐵 ( 𝑤, 𝑎, 𝑣) = 𝑎 + (1 − 𝑎) 𝑐 + ln exp( 𝜃𝑣 ( 𝑤0)) 𝑃 ( 𝑤, 𝑤0) .
1−𝛽 𝜃 𝑤0

If 𝜃 < 0, then the agent is risk-averse with respect to the gamble associated with
continuing and waiting for new wage draws. If 𝜃 > 0 then the agent is risk-loving
with respect to such gambles. For 𝜃 ≈ 0, the agent is close to risk-neutral.
Figure 8.2 shows how the continuation value, value function and optimal decision
vary with 𝜃. Apart from 𝜃, parameters are identical to those in Listing 10 on page 99.
Indeed, for 𝜃 close to zero, as in the middle sub-figure of Figure 8.2, we see that the
value function and reservation wage are almost identical to those from the risk-neutral
model in Figure 3.5 on page 100.
As expected, a negative value of 𝜃 tends to reduce the continuation value and hence
the reservation wage, since the agent’s dislike of risk encourages early acceptance of
an offer. For positive values of 𝜃 the reverse is true, as seen in the bottom sub-figure.

EXERCISE 8.3.2. Replicate Figure 8.2. The simplest method is to modify the code
in Listing 10 and use value function iteration.

8.3.2 Adversarial Agents

Some problems in economics, finance and artificial intelligence assume that decisions
emerge from a dynamic two-person zero sum game in which the two agents’ prefer-
ences are perfectly misaligned. This can lead to a dynamic program where the Bellman
equation takes the form

𝑣 ( 𝑥 ) = max inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) ( 𝑥 ∈ X, 𝑣 ∈ RX ) , (8.43)


𝑎∈ Γ ( 𝑥 ) 𝑑 ∈ 𝐷 ( 𝑥,𝑎)
CHAPTER 8. RECURSIVE DECISION PROCESSES 276

θ =-10
200
h∗ (w)
150 w/(1 − β)
v ∗ (w)
100

50

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


w
θ =0.0001
200
h∗ (w)
150 w/(1 − β)
v ∗ (w)
100

50

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


w
θ =0.1
200

150

100
h∗ (w)
50
w/(1 − β)
v ∗ (w)

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


w

Figure 8.2: Job search with risk-sensitive preferences


CHAPTER 8. RECURSIVE DECISION PROCESSES 277

where 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) represents lifetime value for the decision maker conditional on her
current action 𝑎 and her adversary’s action 𝑑 . The decision maker chooses action 𝑎 ∈
Γ ( 𝑥 ) with the knowledge that the opponent will then choose 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) to minimize
her lifetime value.
Remark 8.3.1. In some settings we can replace the inf in (8.43) with min. In other
settings this is not so obvious. For this reason we use inf throughout, paired with the
assumption that 𝐵 is bounded below. This means that the infimum is always well-
defined and finite.

8.3.2.1 Optimality

To establish optimality properties in the setting of (8.43), we introduce the following


assumptions:
(a) If 𝑣, 𝑤 ∈ RX with 𝑣 ⩽ 𝑤, then

𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤) for all 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) , 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) .

(b) There exists a 𝑣1 ∈ RX and 𝜀 > 0 such that

𝑣1 ( 𝑥 ) + 𝜀 ⩽ 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣1 ) for all 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) , 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) .

(c) There exists a 𝑣2 ∈ RX such that 𝑣1 ⩽ 𝑣2 and

𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) for all 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) , 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) .

(d) If 𝜆 ∈ [0, 1] and 𝑣, 𝑤 ∈ RX , then

𝐵 ( 𝑥, 𝑎, 𝑑, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩾ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤)

for all 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) and 𝑑 ∈ 𝐷 ( 𝑥, 𝑎).


Condition (a) is a natural monotonicity condition: a uniform increase in continu-
ation values increases current value at all states and actions. Conditions (b) and (c)
provide upper and lower bounds. Condition (d) is a concavity condition.
To analyze the decision maker’s problem, we set 𝑉 ≔ [ 𝑣1 , 𝑣2 ] and

ˆ ( 𝑥, 𝑎, 𝑣) ≔
𝐵 inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) (( 𝑥, 𝑎) ∈ G, 𝑣 ∈ 𝑉 ) ,
𝑑 ∈ 𝐷 ( 𝑥,𝑎)

We consider R = ( Γ, 𝑉, 𝐵ˆ).
CHAPTER 8. RECURSIVE DECISION PROCESSES 278

Proposition 8.3.2. If conditions (a)–(d) hold, then R is a concave RDP.

An immediate corollary of Proposition 8.3.2 is that, under the stated conditions,


the decision maker’s problem is a globally stable RDP, and hence the fundamental
optimality properties in Theorem 8.1.1 all hold.
In the proof of Proposition 8.3.2, we use the following exercise.

EXERCISE 8.3.3. Let 𝑓 and 𝑔 map nonempty set 𝐷 into R. Assume that both 𝑓 and
𝑔 are bounded below. Prove that, in this setting,

inf ( 𝑓 ( 𝑑 ) + 𝑔 ( 𝑑 )) ⩾ inf 𝑓 ( 𝑑 ) + inf 𝑔 ( 𝑑 ) .


𝑑∈𝐷 𝑑∈𝐷 𝑑∈𝐷

Proof of Proposition 8.3.2. First we need to check that R is an RDP. In view (a) we
have 𝐵ˆ ( 𝑥, 𝑎, 𝑣) ⩽ 𝐵ˆ ( 𝑥, 𝑎, 𝑤) whenever ( 𝑥, 𝑎) ∈ G and 𝑣, 𝑤 ∈ 𝑉 and 𝑣 ⩽ 𝑤. Also, by (b)
and (c),

𝑣1 ( 𝑥 ) < 𝐵
ˆ ( 𝑥, 𝑎, 𝑣1 ) and 𝐵ˆ ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) for all ( 𝑥, 𝑎) ∈ G. (8.44)

As a result, 𝑣1 ( 𝑥 ) ⩽ 𝐵ˆ ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) ⩽ 𝑣2 ( 𝑥 ) for all 𝑥 ∈ X and 𝑣 ∈ 𝑉 . Together, these facts


imply the monotonicity and consistency conditions required of an RDP.
In view of (8.44) and Exercise 8.2.10 on page 272, to establish that R is concave,
we need only show that, for fixed 𝜆 ∈ [0, 1] and 𝑣, 𝑤 ∈ 𝑉 ,

ˆ ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩾ 𝜆 𝐵ˆ ( 𝑥, 𝑎, 𝑣) + (1 − 𝜆 ) 𝐵ˆ ( 𝑥, 𝑎, 𝑤)
𝐵 (8.45)

for all ( 𝑥, 𝑎) ∈ G. This holds because, given ( 𝑥, 𝑎) ∈ G, 𝜆 ∈ [0, 1] and 𝑣, 𝑤 ∈ 𝑉 ,

ˆ ( 𝑥, 𝑎, 𝜆𝑣 + (1 − 𝜆 ) 𝑤) =
𝐵 inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝜆𝑣 + (1 − 𝜆 ) 𝑤)
𝑑 ∈ 𝐷 ( 𝑥,𝑎)

⩾ inf [ 𝜆 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) + (1 − 𝜆 ) 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤)]
𝑑 ∈ 𝐷 ( 𝑥,𝑎)

⩾𝜆 inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣) + (1 − 𝜆 ) inf 𝐵 ( 𝑥, 𝑎, 𝑑, 𝑤) ,
𝑑 ∈ 𝐷 ( 𝑥,𝑎) 𝑑 ∈ 𝐷 ( 𝑥,𝑎)

where the first inequality is by condition (d) above and the second is by Exercise 8.3.3.
This proves (8.45), so R is a concave RDP. □

8.3.2.2 A Perturbed MDP Problem

In this section we provide a relatively abstract application of Proposition 8.3.2. Later,


in §8.3.3, we will see more concrete applications.
CHAPTER 8. RECURSIVE DECISION PROCESSES 279

The setting we consider is a modified MDP where the adversarial agent’s actions
affect the reward function and transition kernel. This leads to a Bellman equation of
the form
( )
Õ
0 0
𝑣 ( 𝑥 ) = max inf 𝑟 ( 𝑥, 𝑎, 𝑑 ) + 𝛽 𝑣 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑑, 𝑥 ) ( 𝑥 ∈ X) (8.46)
𝑎∈ Γ ( 𝑥 ) 𝑑 ∈ 𝐷 ( 𝑥,𝑎)
𝑥0

The choice perturbation 𝑑 ∈ 𝐷 ( 𝑥, 𝑎) is made by the adversary. The object 𝑃 is a


stochastic kernel, in the sense that 𝑃 ( 𝑥, 𝑎, 𝑑, ·) is a distribution over X for each feasible
( 𝑥, 𝑎, 𝑑 ). We assume that Γ is a nonempty correspondence from X to A and 𝐷 ( 𝑥, 𝑎) is
nonempty for all ( 𝑥, 𝑎) ∈ G. Let
( )
Õ
ˆ ( 𝑥, 𝑎, 𝑣) = inf 0 0
𝐵 𝑟 ( 𝑥, 𝑎, 𝑑 ) + 𝛽 𝑣 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑑, 𝑥 ) (( 𝑥, 𝑎) ∈ G) .
𝑑 ∈ 𝐷 ( 𝑥,𝑎)
𝑥0

To construct the value space 𝑉 , we let 𝑟1 = min 𝑟 and 𝑟2 = max 𝑟 , and set
𝑟1 − 𝜀 𝑟2
𝑉 = [ 𝑣1 , 𝑣2 ] where 𝑣1 ≔ and 𝑣2 ≔ . (8.47)
1−𝛽 1−𝛽

(These constant functions are similar to 𝑣1 , 𝑣2 in (8.42) on page 273.)

EXERCISE 8.3.4. Prove: For 𝑣1 , 𝑣2 in (8.47), conditions (b)–(c) on page 277 hold.

Lemma 8.3.3. The perturbed MDP model R ≔ ( Γ, 𝑉, 𝐵ˆ) is a concave RDP.

An immediate corollary of Lemma 8.3.3 is that R is globally stable (via Proposi-


tion 8.2.5) and all optimality results in Theorem 8.1.1 apply.

Proof of Lemma 8.3.3. It suffices to show that R obeys (a)–(d) on page 277. Condition
(a) and (d) are elementary in this setting. Conditions (b) and (c) were established in
Exercise 8.3.4. □

8.3.3 Ambiguity and Robustness

Until now we have considered agents facing decision problems where outcomes are
uncertain but probabilities are known. For example, while the job seeker introduced in
Chapter 1 does not know the next period wage offer when choosing her current action,
she does know the distribution of that offer. She uses this distribution to determine
an optimal course of action. Similarly, the controllers in our discussion of optimal
CHAPTER 8. RECURSIVE DECISION PROCESSES 280

stopping and MDPs used their knowledge of the Markov transition law to determine
an optimal policy.
In many cases, the assumption that the decision maker knows all probability distri-
butions that govern outcomes under different actions is debatable. In this section we
study lifetime valuations in settings of Knightian uncertainty (Knight, 1921), which
means that outcome distributions are themselves unknown. Some authors refer to
Knightian uncertainty as ambiguity.
Below we consider some dynamic problems where decision makers face Knightian
uncertainty.

8.3.3.1 Robust Control

First we study the choices of a decision maker who knows her reward function but
distrusts her specification of the stochastic kernel 𝑃 that describes the evolution of the
state. This distrust is expressed by assuming that she knows that 𝑃 belongs to some
class of stochastic kernels from G × X to X. This can lead to aggregators of the form
( )
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 inf 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (8.48)
𝑃 ∈P( 𝑥,𝑎)
𝑥0

for ( 𝑥, 𝑎) ∈ G. As usual, 𝑟 maps G to R and 𝛽 ∈ (0, 1). The decision maker can
construct a policy that is robust to her distrust of the stochastic kernel by using this
aggregator 𝐵. Such aggregators arise in the field of robust control.
Positing that the decision maker knows a nontrivial set of stochastic kernels is a
way of modeling Knightian uncertainty, as distinguished from risks that are described
by known probability distributions.

Example 8.3.1. Consider the simple job search problem from Chapter 1. Suppose that
the worker believes that the wage offer distribution lies in some subset P of D(W).
She can seek a decision rule that is robust to worst-case beliefs by optimizing with
aggregator Õ
𝑤
𝐵 ( 𝑤, 𝑎, 𝑣) = 𝑎 + (1 − 𝑎) inf 𝑣 ( 𝑥 0) 𝜑 ( 𝑤0) .
1−𝛽 𝜑 ∈P 0
𝑤

Returning to the robust control model with aggregator 𝐵 in (8.48), we take 𝑉 is


as defined in (8.47) and set R = ( Γ, 𝑉, 𝐵). The set P of stochastic kernels is entirely
arbitrary.

Proposition 8.3.4. R is a concave RDP.


CHAPTER 8. RECURSIVE DECISION PROCESSES 281

Proof. Writing 𝐵 as
( )
Õ
0 0
𝐵 ( 𝑥, 𝑎, 𝑣) = inf 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 ) 𝑃 ( 𝑥, 𝑎, 𝑥 ) , (8.49)
𝑃 ∈P( 𝑥,𝑎)
𝑥0

we see that R is a special case of the perturbed MDP model in §8.3.2.2. Concavity
now follows from Lemma 8.3.3. □

We conclude from the discussion above that the robust control RDP is globally
stable. Hence all of the fundamental optimality properties hold.

8.3.3.2 Robustness and Adversarial Agents

A more general way to implement robustness is via the aggregator


( )
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 inf 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) + 𝑑 ( 𝑃 ( 𝑥, 𝑎, ·) , 𝑃¯ ( 𝑥, 𝑎, ·)) . (8.50)
𝑃 ∈P( 𝑥,𝑎)
𝑥0

In this set up, P( 𝑥, 𝑎) is often large, weakening the constraint on 𝑃 . At the same
time, we introduce the penalty term 𝑑 ( 𝑃 ( 𝑥, 𝑎, ·) , 𝑃¯ ( 𝑥, 𝑎, ·)), which can be understood
as recording the deviation between a given kernel 𝑃 and some baseline specification
𝑃¯.
One interpretation of this setting is that the decision maker begins with a baseline
specification of dynamics but lacks confidence in its accuracy. In her desire to choose a
robust policy, she imagines herself playing against an adversarial agent. Her adversary
can choose transition kernels that deviate from the baseline, but the presence of the
penalty term means that extreme deviations are curbed.
If we define
ˆ𝑟 ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝑑 ( 𝑃 ( 𝑥, 𝑎, ·) , 𝑃¯ ( 𝑥, 𝑎, ·)) ,
then (8.50) can be expressed as
( )
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = inf ˆ
𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) .
𝑃
𝑥0

This is a special case of (8.49), so the same optimality theory applies.


CHAPTER 8. RECURSIVE DECISION PROCESSES 282

8.3.3.3 Connection to Risk-Sensitive Preferences

A useful measure of discrepancy between two probability distributions is the Kullback–


Liebler divergence (KL divergence)
Õ  
𝑞( 𝑥)
𝑑 𝐾𝐿 ( 𝑞 | 𝑝) ≔ 𝑞 ( 𝑥 ) ln for 𝑞, 𝑝 ∈ D(X) .
𝑥
𝑝( 𝑥)

It is assumed here that 𝑞 ≺ac 𝑝, which means that 𝑞 ( 𝑥 ) = 0 whenever 𝑝 ( 𝑥 ) = 0. We note


for future reference that 𝑑 𝐾𝐿 obeys the duality formula for variational inference,
which states that, given ℎ ∈ RX ,
( )
Õ Õ
ln exp( ℎ ( 𝑥 )) 𝑝 ( 𝑥 ) = sup ℎ ( 𝑥 ) 𝑞 ( 𝑥 ) − 𝑑 𝐾𝐿 ( 𝑞 | 𝑝) . (8.51)
𝑥 𝑞≺ac 𝑝 𝑥

(See, e.g., Dupuis and Ellis (2011), Proposition 1.4.2.)


In robust control, KL divergence can be used measure deviation between the base-
line specification and alternative specifications. It turns out that, under this measure
of divergence, there is a tight relationship between robust control and risk-sensitive
preferences.
To illustrate this relationship, we fix 𝜃 < 0 and set 𝑑𝜃 ≔ −(1/𝜃) 𝑑 𝐾𝐿 , so that 𝑑𝜃 is a
simple positive rescaling of the Kullback–Leibler divergence. Using 𝑑𝜃 in (8.50) leads
to
( )
Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 inf 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) + 𝑑𝜃 ( 𝑃 ( 𝑥, 𝑎, ·) | 𝑃¯ ( 𝑥, 𝑎, ·)) .
𝑃 ∈P( 𝑥,𝑎)
𝑥0

The constraint set P( 𝑥, 𝑎) is all 𝑃 ∈ M ( RX ) such that 𝑃 ( 𝑥, 𝑎, ·) ≺ac 𝑃¯ ( 𝑥, 𝑎, ·).


If we multiply both sides of the variational formula (8.51) by (1/𝜃) and set ℎ = 𝜃𝑣
we get ( )
1 Õ Õ 1
ln exp( 𝜃𝑣 ( 𝑥 )) 𝑝 ( 𝑥 ) = inf 𝑣 ( 𝑥 ) 𝑞 ( 𝑥 ) − 𝑑 𝐾𝐿 ( 𝑞 | 𝑝) .
𝜃 𝑥
𝑞≺ac 𝑝
𝑥
𝜃

This allows us to rewrite 𝐵 as


( )
1 Õ
𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ln exp( 𝜃𝑣 ( 𝑥 0)) 𝑃¯ ( 𝑥, 𝑎, 𝑥 0) .
𝜃 𝑥0

Hence, for this choice of deviation, the robust control aggregator (8.50) reduces to the
CHAPTER 8. RECURSIVE DECISION PROCESSES 283

risk-sensitive aggregator (see Example 8.1.6 on page 249) under the baseline transi-
tion kernel.

8.3.4 Smooth Ambiguity

Ju and Miao (2012) propose and study a recursive smooth ambiguity model in the
context of asset pricing. A generic discrete formulation of their optimization problem
can be expressed in terms of the aggregator


  " # 𝜅/ 𝛾  𝛼/𝜅 

1/𝛼


 ∫ Õ

 
 
 

𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) 𝜇 ( 𝑥, d𝜃) , (8.52)

 
 
 

  𝑥0
 
 
where 𝛼, 𝜅, 𝛾 are nonzero parameters, 𝑃𝜃 is a stochastic kernel from G to X for each
𝜃 in a finite dimensional parameter space Θ, and 𝜇 ( 𝑥, ·) is a probability distribution
over Θ for each 𝑥 ∈ X. The distribution 𝜇 ( 𝑥, ·) represents subjective beliefs over the
transition rule for the state.
The aggregator 𝐵 in (8.52) is defined for 𝑥 ∈ X, 𝑎 ∈ Γ ( 𝑥 ) and 𝑣 ∈ 𝐼 , where 𝐼 is be
the interior of the positive cone of RX . To ensure finite real values, we assume 𝑟  0.
As with the Epstein–Zin case, 𝛼 parameterizes the elasticity of intertemporal sub-
stitution and 𝛾 governs risk aversion. The parameter 𝜅 captures ambiguity aversion.
If 𝜅 = 𝛾 , the agent is said to be ambiguity neutral.

EXERCISE 8.3.5. Show that the smooth ambiguity aggregator 𝐵 reduces to the
Epstein–Zin aggregator when the agent is ambiguity neutral.

Returning to (8.52), we focus on the case 𝜅 < 𝛾 < 0 < 𝛼 < 1, which includes the
calibration used in Ju and Miao (2012). (Other cases can be handled using similar
methods and details are left to the reader.) After constructing a suitable value space,
we will show that the resulting RDP is globally stable.
As a first step, set 𝑟1 ≔ min 𝑟 , 𝑟2 ≔ max 𝑟 and fix 𝜀 > 0. Consider the constant
functions   1/𝛼   1/𝛼
𝑟1 𝑟2 + 𝜀
𝑣1 ≔ and 𝑣2 ≔ .
1−𝛽 1−𝛽

EXERCISE 8.3.6. Prove that

𝑣1 ⩽ 𝐵 ( 𝑥, 𝑎, 𝑣1 ) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑣2 ) < 𝑣2 for all ( 𝑥, 𝑎) ∈ G. (8.53)


CHAPTER 8. RECURSIVE DECISION PROCESSES 284

In the remainder of this section on smooth ambiguity we set 𝑉 = [ 𝑣1 , 𝑣2 ].


EXERCISE 8.3.7. Prove that R ≔ ( Γ, 𝑉, 𝐵) is an RDP.

Here is our main result for this section. It implies that all optimality and conver-
gence results for R are valid (see, in particular, Theorem 8.1.1).
Proposition 8.3.5. Under the stated assumptions, the RDP R is a globally stable.
To prove Proposition 8.3.5, we use a transformation, just as we did with the
Epstein–Zin case in §8.1.4.1. To this end we introduce the composite parameters
𝛾 𝜅
𝜉≔ ∈ (0, 1) and 𝜁≔ < 0.
𝜅 𝛼

Then we define

  " # 1/𝜉  𝜁

1/𝜁


 ∫ Õ

 
 
 

ˆ ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽
𝐵 𝑣 ( 𝑥 0) 𝜉 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) 𝜇 ( 𝑥, d𝜃) (8.54)

 
  
 
  𝑥0
 
 
and
𝑉ˆ = [ˆ ˆ2 ]
𝑣1 , 𝑣 where 𝑣ˆ1 ≔ 𝑣21/𝜅 and 𝑣ˆ2 ≔ 𝑣11/𝜅 .
Note that 𝑉ˆ is a nonempty order interval of strictly positive real-valued functions, since
0 < 𝑣1 < 𝑣2 and 𝜅 < 0. We set R̂ = ( Γ, 𝑉ˆ, 𝐵ˆ).

EXERCISE 8.3.8. Prove that R̂ is an RDP satisfying

ˆ1 < 𝐵ˆ ( 𝑥, 𝑎, 𝑣ˆ1 )
𝑣 and ˆ ( 𝑥, 𝑎, 𝑣ˆ2 ) ⩽ 𝑣ˆ2
𝐵 for all ( 𝑥, 𝑎) ∈ G.

The next exercise shows that R and R̂ are topologically conjugate (see §8.1.4).

EXERCISE 8.3.9. Let 𝜑 be defined on (0, ∞) by 𝜑 ( 𝑡 ) = 𝑡 𝜅 . Show that


(i) 𝐵 ( 𝑥, 𝑎, 𝑣) = 𝜑−1 [ 𝐵ˆ ( 𝑥, 𝑎, 𝜑 ◦ 𝑣)] for all 𝑣 ∈ 𝑉 and ( 𝑥, 𝑎) ∈ G, and
(ii) 𝜑 is a homeomorphism from [ 𝑣1 , 𝑣2 ] to [ˆ ˆ2 ] (as subsets of R).
𝑣1 , 𝑣

Lemma 8.3.6. For each ( 𝑥, 𝑎) ∈ G, the function 𝑣ˆ ↦→ 𝐵ˆ ( 𝑥, 𝑎, 𝑣ˆ) is concave on 𝑉ˆ.

Proof. Fix ( 𝑥, 𝑎) ∈ G. We write 𝐵ˆ ( 𝑥, 𝑎, 𝑣ˆ) has


∫ 
𝐵ˆ ( 𝑥, 𝑎, 𝑣ˆ) = 𝜓 𝑓 ( 𝜃, 𝑣) 𝜇 ( 𝑥, d𝜃)
CHAPTER 8. RECURSIVE DECISION PROCESSES 285

where
" # 1/𝜉
Õ  1/𝜁
𝑓 ( 𝜃, 𝑣) ≔ 𝑣 ( 𝑥 0) 𝜉 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) and 𝜓 ( 𝑡 ) ≔ 𝑟 ( 𝑥, 𝑎) + 𝛽𝑡 𝜁
𝑥0

For fixed 𝜃, the function 𝑣 ↦→ 𝑓 ( 𝜃, 𝑣) is concave over all 𝑣 in the interior of the positive
cone of RX by Lemma 7.3.1 on page 234. The real-valued function 𝜓 satisfies 𝜓0 >
0 and 𝜓00 < 0 over 𝑡 ∈ (0, ∞). Since we are composing order-preserving concave
functions, it follows that 𝐵ˆ ( 𝑥, 𝑎, 𝑣ˆ) is concave on 𝑉ˆ. □

Proof of Proposition 8.3.5. To prove that R is globally stable it suffices to prove that
R̂ is globally stable (see Exercise 8.3.9 and Proposition 8.1.3 on page 260). Given
the results of Exercise 8.3.8 and Lemma 8.3.6, the RDP R̂ is concave. But then R̂ is
globally stable, by Proposition 8.2.5. □

8.3.5 Minimization

Until now, all theory and applications have concerned maximization of lifetime values.
Now is a good time to treat minimization. Throughout this section, R is a well-posed
Ó
RDP. The pointwise minimum 𝑣∗ ≔ 𝜎 𝑣𝜎 is called the min-value function generated
by R. We call a policy 𝜎 ∈ Σ min-optimal for R if 𝑣𝜎 = 𝑣∗ . A policy 𝜎 ∈ Σ is called
𝑣-min-greedy for R if

𝜎 ( 𝑥 ) ∈ argmin 𝐵 ( 𝑥, 𝑎, 𝑣) for all 𝑥 ∈ X.


𝑎∈ Γ ( 𝑥 )

We say that R obeys Bellman’s principle of min-optimality if

𝜎 ∈ Σ is min-optimal for R ⇐⇒ 𝜎 is 𝑣∗ -min-greedy.

The Bellman min-operator 𝑇 is defined by

(𝑇 𝑣)( 𝑥 ) = min 𝐵 ( 𝑥, 𝑎, 𝑣) ( 𝑥 ∈ X) .


𝑎∈ Γ ( 𝑥 )

We say that 𝑣 ∈ 𝑉 obeys the min-Bellman equation if 𝑇 𝑣 = 𝑣. The algorithm defined


by replacing “ 𝑣-greedy” with “ 𝑣-min-greedy” in Algorithm 8.1 (HPI) will be called
min-HPI.
We can now state the following result, which is analogous to Theorem 8.1.1. In
the statement, R is a well-posed RDP with min-value function 𝑣∗ .
CHAPTER 8. RECURSIVE DECISION PROCESSES 286

Theorem 8.3.7 (Min-optimality). If R is globally stable, then


(i) 𝑣∗ is the unique solution to the min-Bellman equation in 𝑉 ,
(ii) R satisfies Bellman’s principle of min-optimality,
(iii) R has at least one min-optimal policy, and
(iv) min-HPI returns an exact optimal policy in finitely many steps.

Although we omit the details, a min-OPI convergence result directly analogous


to the OPI convergence result in (v) of Theorem 8.1.1 also holds (after replacing
maximization-based 𝑣-greedy policies with 𝑣-min-greedy policies).
Theorem 8.3.7 is proved in §9.2.3. For now we consider two applications that
involve minimization.

8.3.5.1 Application: Shortest Paths

Recall the shortest path problem introduced in Example 8.1.8, where X is the vertices
of a graph, 𝐸 is the edges, 𝑐 : 𝐸 → R+ maps a travel cost to each edge ( 𝑥, 𝑥 0) ∈ 𝐸,
and O( 𝑥 ) is the set of direct successors of 𝑥 . The aim is to minimize total travel cost
to a destination node 𝑑 . We adopt all assumptions from Exercise 8.1.17 and assume
in addition that 𝑐 ( 𝑥, 𝑥 0) = 0 implies 𝑥 = 𝑑 . As in Exercise 8.1.17, we let 𝐶 ( 𝑥 ) be the
maximum cost of traveling to 𝑑 from 𝑥 along any directed path.
We regard the problem as an RDP R = (O, 𝑉, 𝐵) with 𝑉 = [0, 𝐶 ] and

𝐵 ( 𝑥, 𝑥 0, 𝑣) = 𝑐 ( 𝑥, 𝑥 0) + 𝑣 ( 𝑥 0) ( 𝑥 ∈ X) . (8.55)

In the present setting, the function 𝑣 in (8.55) is often called the cost-to-go function,
with 𝑣 ( 𝑥 0) in (8.57) understood as remaining costs after moving to state 𝑥 0.
While the value aggregator 𝐵 in (8.55) is simple, the absence of discounting (which
is standard in the shortest path literature) means that R is not contracting. Fortu-
nately, R turns out to be concave (in the sense of §8.2.3), which allows us to prove
Proposition 8.3.8. Under the stated conditions, the shortest path RDP is globally stable
and the min-value function 𝑣∗ is the unique solution to

𝑣∗ ( 𝑥 ) = min 𝑐 ( 𝑥, 𝑥 0) + 𝑣∗ ( 𝑥 0) ( 𝑥 ∈ X)
𝑥 0 ∈Γ ( 𝑥 )

in 𝑉 . A policy 𝜎 ∈ Σ is min-optimal if and only if



𝜎 ( 𝑥 ) ∈ argmin 𝑐 ( 𝑥, 𝑥 0) + 𝑣∗ ( 𝑥 0) for all 𝑥 ∈ X.
𝑥 0 ∈Γ ( 𝑥 )
CHAPTER 8. RECURSIVE DECISION PROCESSES 287

(In the present context, 𝑣∗ is also known as the minimum cost-to-go function.)

Proof. We first show that R is concave. By the definition of concave RDPs in §8.2.3,
and given that 𝐵 ( 𝑥, 𝑥 0, 𝑣) is affine in 𝑣 (and hence concave), it suffices to prove that
there exists a 𝛿 > 0 such that

𝑐 ( 𝑥, 𝑥 0) ⩾ 𝛿𝐶 ( 𝑥 ) for all 𝑥 ∈ X and 𝑥 0 ∈ O( 𝑥 ) . (8.56)

(This corresponds to (8.39) on page 272 when 𝑣1 = 0 and 𝑣2 = 𝐶 .)


To this end, we set
𝑐 ( 𝑥, 𝑥 0)
𝛿 = min min .
𝑥 ≠𝑑 𝑥 0 ∈O( 𝑥 ) 𝐶 (𝑥)
By the stated cost assumptions, we have 𝑐 ( 𝑥, 𝑥 0) > 0 when 𝑥 ≠ 𝑑 and 𝑥 0 ∈ O( 𝑥 ), while
𝐶 ( 𝑥 ) > 0 when 𝑥 ≠ 𝑑 . Since X is finite, it follows that 𝛿 is finite and positive. Evidently,
with this definition, the bound (8.56) holds for all 𝑥 ≠ 𝑑 . In addition, (8.56) holds
trivially when 𝑥 = 𝑑 , since 𝐶 ( 𝑑 ) = 0. Hence (8.56) is valid for all 𝑥 ∈ X.
Concavity of R implies global stability by Proposition 8.2.5. The remaining claims
now follow from Theorem 8.3.7. □

8.3.5.2 Application: Negative Discount Rate Optimality

When discussing MDPs we used 𝛽 to represent the discount factor. Given 𝛽 , the dis-
count rate or rate of time preference is the value 𝜌 that solves 𝛽 = 1/(1 + 𝜌). The
standard MDP assumption 𝛽 < 1 implies this rate is positive. You will recall from
Chapter 5 that the condition 𝛽 < 1 is central to the general theory of MDPs, since it
yields global stability of the Bellman and policy operators on RX (via the Neumann
series lemma or Banach’s fixed point theorem).
In the previous section, on shortest paths, we studied an RDP with a zero discount
rate. Now we go one step further and consider problems with negative rates of time
preference. Such preference are commonly inferred when people face unpleasant
tasks. Subjects of studies often prefer getting such tasks “over and done with” rather
than postponing them. (Negative discount rates are inferred in other settings as well.
§9.3 provides background and references.)
In this section, we model optimal choice under a negative discount rate. Taking
our cue from the discussion above, we consider a scenario where a task generates
disutility but has to be completed. In particular, we assume that

𝐵 ( 𝑥, 𝑥 0, 𝑣) = 𝑐 ( 𝑥, 𝑥 0) + 𝛽𝑣 ( 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) (8.57)
CHAPTER 8. RECURSIVE DECISION PROCESSES 288

where X is a finite set and 𝛽 > 1 is some positive constant. The function 𝑐 gives the
cost of transitioning from 𝑥 to the new state 𝑥 0
The value aggregator 𝐵 in (8.57) is the same as the shortest path aggregator (8.55),
except for the constant 𝛽 . To keep the discussion simple, we adopt all other assump-
tions from the shortest path discussion in §8.3.5.1.

EXERCISE 8.3.10. Let 𝐶 ( 𝑥 ) be the maximum cost of traveling from 𝑥 ∈ X to the


destination node 𝑑 under any feasible policy. Prove that 𝐶 ( 𝑥 ) < ∞ for all 𝑥 .

We now have an R = ( Γ, 𝐵, 𝑉 ) with Γ = O, 𝐵 as in (8.57) and 𝑉 = [0, 𝐶 ]. The


policy operators map 𝑉 into itself because, for 𝑣 ∈ 𝑉 , we clearly have 0 ⩽ 𝑇𝜎 𝑣 and, in
addition,

(𝑇𝜎 𝑣)( 𝑥 ) = 𝑐 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽𝑣 ( 𝜎 ( 𝑥 )) ⩽ 𝑐 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽𝐶 ( 𝜎 ( 𝑥 )) ⩽ 𝐶 ( 𝑥 ) .

The last bound holds because 𝐶 ( 𝑥 ) is, by definition, greater than the cost of traveling
from 𝑥 to 𝜎 ( 𝑥 ) and then following the most expensive path.

Proposition 8.3.9. Under the stated conditions, the negative discount rate RDP is glob-
ally stable, the min-value function 𝑣∗ is the unique solution to

𝑣∗ ( 𝑥 ) = min 𝑐 ( 𝑥, 𝑥 0) + 𝛽𝑣∗ ( 𝑥 0) ( 𝑥 ∈ X)
𝑥 0 ∈Γ ( 𝑥 )

in 𝑉 and a policy 𝜎 ∈ Σ is min-optimal if and only if



𝜎 ( 𝑥 ) ∈ argmin 𝑐 ( 𝑥, 𝑥 0) + 𝛽𝑣∗ ( 𝑥 0) for all 𝑥 ∈ X.
𝑥 0 ∈Γ ( 𝑥 )

Proof. The proof of Proposition 8.3.9 is essentially identical to the proof of Proposi-
tion 8.3.8. Readers are invited to confirm this. □

8.4 Chapter Notes

The RDP framework adopted in this chapter is inspired by Bertsekas (2022b), who
in turn credits Mitten (1964) as the first research paper to frame Richard Bellman’s
dynamic programming problems in an abstract setting. Denardo (1967) describes key
ideas including what we call contracting RDPs (see §8.2.1). Denardo credits Shapley
(1953) for inspiring his contraction-based arguments.
CHAPTER 8. RECURSIVE DECISION PROCESSES 289

The key optimality results from this chapter (Theoremm 9.2.4 and 8.1.1) are
somewhat new, although closely related results appear in Bertsekas (2022b). See,
in addition, Bloise et al. (2023), which builds on Bertsekas (2022b) and Ren and
Stachurski (2021).
The job search application with quantile preferences in §8.2.1.4 is based on de Cas-
tro et al. (2022). The same reference includes a general theory of dynamic program-
ming when certainty equivalents are computed using quantile operators and aggre-
gation is time additive.
The optimal default application in §8.2.1.5 is loosely based on Arellano (2008).
Influential contributions to this line of work include, Yue (2010), Chatterjee and Eyi-
gungor (2012), Arellano and Ramanarayanan (2012), Cruces and Trebesch (2013),
Ghosh et al. (2013), Gennaioli et al. (2014), and Bocola et al. (2019).
At the start of the chapter we motivated RDPs by mentioning that equilibria in
some models of production and economic geography can be computed using dynamic
programming. Examples include Hsu (2012), Hsu et al. (2014), Antràs and De Gortari
(2020), Kikuchi et al. (2021) and Tyazhelnikov (2022).
Early references for dynamic programming with risk-sensitive preferences include
Jacobson (1973), Whittle (1981), and Hansen and Sargent (1995). Elegant mod-
ern treatments can be found in Asienkiewicz and Jaśkiewicz (2017) and Bäuerle and
Jaśkiewicz (2023), and an extension to general static risk measures is available in
Bäuerle and Glauner (2022). Risk-sensitivity is applied to the study of optimal growth
in Bäuerle and Jaśkiewicz (2018), and to optimal divided payouts in Bäuerle and
Jaśkiewicz (2017). Risk-sensitivity is also used in applications of reinforcement learn-
ing, where the underlying state process is not known. See, for example, Shen et al.
(2014), Majumdar et al. (2017) or Gao et al. (2021).
Dynamic programming problems that acknowledge model uncertainty by includ-
ing adversarial agents to promote robust decision rules can be found in Cagetti et al.
(2002), Hansen and Sargent (2011), and other related papers. Al-Najjar and Shmaya
(2019) study the connection between Epstein–Zin utility and parameter uncertainty.
Ruszczyński (2010) considers risk averse dynamic programming and time consistency.
The smooth ambiguity model in §8.3.4 is loosely adapted from Klibanoff et al.
(2009) and Ju and Miao (2012). For applications of optimization under smooth am-
biguity, see, for example, Guan and Wang (2020) or Yu et al. (2023). Zhao (2020)
studies yield curves in a setting where ambiguity-averse agents face varying amounts
of Knightian uncertainty over the short and long run.
Readers who wish to see some motivation for the discussion of negative discount-
ing in §8.3.5.2 can consult Loewenstein and Sicherman (1991), who found that the
CHAPTER 8. RECURSIVE DECISION PROCESSES 290

majority of workers they surveyed reported a preference for increasing wage profiles
over decreasing ones that yield the same undiscounted sum, even when it was pointed
out that the latter could be used to construct a dominating consumption sequence.
Loewenstein and Prelec (1991) obtained similar results. In summarizing their study,
they argue that, in the context of the choice problems that they examined, “sequences
of outcomes that decline in value are greatly disliked, indicating a negative rate of
time preference” (Loewenstein and Prelec, 1991, p. 351).
In §8.3.5.2 we considered dynamic programs with negative discount rates. A more
general treatment of such problems can be found in Kikuchi et al. (2021), which also
shows how negative discount rate dynamic programs connect to static problems con-
cerning equilibria in production networks and draws connections with Coase’s theory
of the firm.
An algorithm that we neglected to discuss is stochastic gradient descent (or ascent)
in policy space. Typically policies are parameterized via an approximation architec-
ture that consists of basis functions, activation functions, and compositions of them
(e.g., a neural network). In large models, such approximation is used even when the
state and action spaces are finite, simply because the curse of dimensionality makes
exact representations infeasible. For recent discussions of gradient descent in pol-
icy spaces see Nota and Thomas (2019), Mei et al. (2020), and Bhandari and Russo
(2022).
Chapter 9

Abstract Dynamic Programming

In Chapter 8 we introduced RDPs, stated their optimality properties and investigated


applications that satisfy optimality conditions. But we have yet to prove the core
optimality and convergence results in Theorem 8.1.1.
Rather than proving these result directly, we now present a very abstract version
of a dynamic programming problem that consist of a family of self-maps on a partially
ordered set. Doing so allows us to simplify proofs and extend the reach of dynamic
programming theory. (The value of these extensions will become clearer in Volume II.)

9.1 Abstract Dynamic Programs


First we define abstract dynamic programs and prove optimality results under a set
of high level assumptions. Then we connect these results to our Chapter 8 optimality
claims for RDPs.

9.1.1 Preliminaries
Let’s cover some fundamental concepts that we’ll use when considering abstract dy-
namic programs.

9.1.1.1 Order Stability

The first concept is related to stability of maps over partially ordered spaces. Our aim
is to provide a weak notion of stability that can be applied in any partially ordered set
(without any form of topology).

291
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 292

T
45o

0 1
V

Figure 9.1: Order-stability of a self-map 𝑇 on 𝑉 = ([0, 1] , ⩽)

Let 𝑉 be a partially ordered set and let 𝑇 be a self-map on 𝑉 with exactly one fixed
point 𝑣¯ in 𝑉 . In this setting, we call 𝑇

• upward stable on 𝑉 if 𝑣 ∈ 𝑉 and 𝑣  𝑇 𝑣 implies 𝑣  𝑣¯,


• downward stable on 𝑉 if 𝑣 ∈ 𝑉 and 𝑇 𝑣  𝑣 implies 𝑣¯  𝑣, and
• order stable on 𝑉 if 𝑇 is both upward and downward stable.

Figure 9.1 gives an illustration of a map 𝑇 that on 𝑉 = [0, 1] that is order stable:
all points mapped up by 𝑇 lie below its fixed point and all points mapped down by 𝑇
lie above its fixed point. The figure suggests that order stability is related to global
stability, as defined in §1.2.2.2. We affirm this in Lemma 9.1.1, just below.

EXERCISE 9.1.1. Let X be finite and consider the self-map on 𝑉 ≔ ( RX , ⩽) defined


by 𝑇 𝑣 = 𝑟 + 𝐴𝑣 for some 𝑟 ∈ RX and 𝐴 ∈ L ( RX ) with 0 ⩽ 𝐴 and 𝜌 ( 𝐴) < 1. Prove that 𝑇
is order stable on 𝑉 .

Lemma 9.1.1. Let X be finite, let 𝑉 be a subset of RX , and let 𝑇 be an order-preserving


self-map on 𝑉 . If 𝑇 is globally stable on 𝑉 , then 𝑇 is order stable on 𝑉 .

Proof. Assume the stated conditions. By global stability, 𝑇 has a unique fixed point 𝑣¯
in 𝑉 . If 𝑣 ∈ 𝑉 and 𝑣 ⩽ 𝑇 𝑣, then iterating on this inequality and using the fact that 𝑇
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 293

is order-preserving yields 𝑣 ⩽ 𝑇 𝑘 𝑣 for all 𝑘 ∈ N. Applying global stability and taking


the limit gives 𝑣 ⩽ 𝑣¯. Hence upward stability holds. The proof of downward stability
is similar. □

9.1.1.2 Order Duals

Given partially ordered set 𝑉 , let 𝑉 𝜕 = (𝑉, 𝜕 ) be the order dual, so that, for 𝑢, 𝑣 ∈ 𝑉 ,
we have 𝑢 𝜕 𝑣 if and only if 𝑣  𝑢. (The notation is slightly confusing but the concept
is simple: 𝑉 𝜕 is just 𝑉 with the order reversed.) The following result will be useful.

Lemma 9.1.2. 𝑆 is order stable on 𝑉 if and only if 𝑆 is order stable on 𝑉 𝜕 .

Proof. Let 𝑆 be as stated. By definition, 𝑆 has a unique fixed point 𝑣¯ ∈ 𝑉 . Hence


it remains only to check that 𝑆 is upward and downward stable on 𝑉 𝜕 . Regarding
upward stability, suppose 𝑣 ∈ 𝑉 and 𝑣 𝜕 𝑆𝑣. Then 𝑆𝑣  𝑣 and hence 𝑣¯  𝑣, by
downward stability of 𝑆 on 𝑉 . But then 𝑣 𝜕 𝑣¯, so 𝑆 is upward stable on 𝑉 𝜕 . The proof
of downward stability is similar.
We have shown that 𝑆 is order stable on 𝑉 𝜕 whenever 𝑆 is order stable on 𝑉 . The
reverse implication holds because the order dual of 𝑉 𝜕 is 𝑉 . □

9.1.2 Abstract Dynamic Programs

In this section we formalize abstract dynamic programs and present fundamental op-
timality results. §9.1.2.1 starts the ball rolling with an informal overview.

9.1.2.1 Prelude

We saw in §8.1 that a globally stable RDP yields a set of feasible policies Σ and, for
each 𝜎 ∈ Σ, a policy operator 𝑇𝜎 defined on the value space 𝑉 ⊂ RX . Notice that the
dynamic program is fully specified by the family of operators {𝑇𝜎 }𝜎∈Σ and the space
𝑉 that they act on. From this set of operators we obtain the set of lifetime values
{ 𝑣𝜎 }𝜎∈Σ , with each 𝑣𝜎 uniquely identified as a fixed point of 𝑇𝜎 . These lifetime values
define the value function 𝑣∗ as the pointwise maximum 𝑣∗ = ∨𝜎 𝑣𝜎 . An optimal policy
is then defined as a 𝜎 ∈ Σ obeying 𝑣𝜎 = 𝑣∗ .
To shed unnecessary structure before the main optimality proofs, a natural idea is
to start directly with an abstract set of “policy operators” {𝑇𝜎 } acting on some set 𝑉 .
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 294

One can then define lifetime values and optimality as in the previous paragraph and
start to investigate conditions on the family of operators {𝑇𝜎 } that lead to optimality.
We use these ideas as our starting point, beginning with an arbitrary family {𝑇𝜎 }
of operators on a partially ordered set.

9.1.2.2 Defining ADPs

An abstract dynamic program (ADP) is a pair A = (𝑉, {𝑇𝜎 }𝜎∈Σ ) such that

(i) 𝑉 = (𝑉, ) is a partially ordered set,


(ii) {𝑇𝜎 } ≔ {𝑇𝜎 }𝜎∈Σ is a family of self-maps on 𝑉 , and
(iii) for all 𝑣 ∈ 𝑉 , the set {𝑇𝜎 𝑣}𝜎∈Σ has both a least and greatest element.

Elements of the index set Σ are called policies and elements of {𝑇𝜎 } are called policy
operators. Given 𝑣 ∈ 𝑉 , a policy 𝜎 in Σ is called 𝑣-greedy if 𝑇𝜎 𝑣  𝑇𝜏 𝑣 for all 𝜏 ∈ Σ.
Existence of a greatest element in (iii) of the definition above is equivalent to the
statement that each 𝑣 ∈ 𝑉 has at least one 𝑣-greedy policy.

Remark 9.1.1. Existence of a least element in (iii) is needed only because we wish to
consider minimization as well as maximization. For settings where only maximization
is considered, this can be dropped from the list of assumptions. (An analogous state-
ment holds for minimization and greatest elements.) We mention least elements in
Example 9.1.1 below and then disregard them until we treat minimization in §9.2.3.

Remark 9.1.2. In the applications treated in this chapter,  will always be the point-
wise partial order. In Volume II other partial orders arise.

Example 9.1.1 (RDPs generate ADPs). Let R = ( Γ, 𝑉, 𝐵) be an RDP with finite state X,
as defined in §8.1.1. For each 𝜎 in the feasible policy set Σ, let 𝑇𝜎 be the corresponding
policy operator, defined at 𝑣 ∈ 𝑉 by (𝑇𝜎 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣). The pair AR ≔ (𝑉, {𝑇𝜎 }) is
an ADP, since 𝑉 is partially ordered by ⩽, 𝑇𝜎 is a self-map on 𝑉 for all 𝜎 ∈ Σ, and, given
𝑣 ∈ 𝑉 , choosing 𝜎
¯ ∈ Σ such that 𝜎¯ ( 𝑥 ) ∈ argmax 𝑎∈Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) for all 𝑥 ∈ X produces a
𝑣-greedy policy and a greatest element for {𝑇𝜎 𝑣} (cf., Exercise 8.1.7 on page 253). A
least element of {𝑇𝜎 𝑣} can be generated by replacing “argmax” with “argmin.”

In the setting of Example 9.1.1, we call AR the ADP generated by R.

Example 9.1.2. Let M = ( Γ, 𝛽, 𝑟, 𝑃 ) be an MDP, as defined in §5.1.1, with policy


operators {𝑇𝜎 } defined by 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣 (as in (5.19)). AM ≔ ( RX , {𝑇𝜎 }) is an ADP
(as a special case of Example 9.1.1). We call AM the ADP generated by M.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 295

We have just shown that RDPs are ADPs. But there are also ADPs that do not
fit naturally into the RDP framework. The next two examples illustrate. In these
examples, the Bellman equation does not match the RDP Bellman equation 𝑣 ( 𝑥 ) =
max 𝑎∈Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) due to the inverted order of expectation and maximization.

Example 9.1.3. Recall the 𝑄 -factor MDP Bellman operator, which takes the form
Õ
( 𝑆𝑞)( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 max0 𝑞 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0)
0
(9.1)
𝑎 ∈Γ ( 𝑥 )
𝑥0

with 𝑞 ∈ RG and ( 𝑥, 𝑎) ∈ G (We are repeating (5.39) on page 171.) The 𝑄 -factor
policy operators {𝑆𝜎 } corresponding to (9.1) are given by
Õ
( 𝑆𝜎 𝑞)( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑞 ( 𝑥 0, 𝜎 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (( 𝑥, 𝑎) ∈ G) . (9.2)
𝑥0

Each 𝑆𝜎 is a self-map on RG = ( RG , ⩽). If 𝑞 ∈ RG and 𝜎 ∈ Σ is such that 𝜎 ( 𝑥 ) ∈


argmax 𝑎∈Γ ( 𝑥 ) 𝑞 ( 𝑥, 𝑎) for all 𝑥 ∈ X, then 𝑆𝜎 𝑞 ⩾ 𝑆𝜏 𝑞 on G for all 𝜏 ∈ Σ. Hence 𝜎 is
𝑞-greedy and A ≔ ( RG , {𝑆𝜎 }) is an ADP.

Example 9.1.4. In reinforcement learning and related fields the 𝑄 -factor approach
from Example 9.1.3 has been extended to risk-sensitive decision processes (see, e.g.,
Fei et al. (2021)). The corresponding 𝑄 -factor Bellman equation is given by
(   )
𝛽 Õ
𝑓 ( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + ln exp 𝜃 0max0 𝑓 ( 𝑥 0, 𝑎0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) (( 𝑥, 𝑎) ∈ G) . (9.3)
𝜃 𝑥0
𝑎 ∈Γ ( 𝑥 )

The policy operators over risk-sensitive 𝑄 -factors take the form


" #
𝛽 Õ
0 0 0
( 𝑄 𝜎 𝑓 )( 𝑥, 𝑎) = 𝑟 ( 𝑥, 𝑎) + ln exp [𝜃 𝑓 ( 𝑥 , 𝜎 ( 𝑥 ))] 𝑃 ( 𝑥, 𝑎, 𝑥 ) (9.4)
𝜃 𝑥0

where 𝑓 ∈ RG and 𝜎 ∈ Σ. An argument similar to the one given in Example 9.1.3


confirms that each 𝑓 ∈ RG has an 𝑓 -greedy policy. Hence ( RG , { 𝑄 𝜎 }) is an ADP.

In Chapter 10 we will see that continuous time dynamic programs can also be
viewed as ADPs.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 296

9.2 Optimality
In this section we study optimality properties of ADPs, aiming for generalizations of
the foundational results of dynamic programming. To achieve this aim we need to
define optimality and provide sufficient conditions.

9.2.1 Max-Optimality

We begin with maximization. Later, in §9.2.3, we will show that results for minimiza-
tion problems are simple corollaries of maximization results.

9.2.1.1 Lifetime Values

The objective of dynamic programming is to optimize lifetime value. But what is


lifetime value in this abstract context? Suppose that, for an ADP (𝑉, {𝑇𝜎 }) and fixed
𝜎 ∈ Σ, the policy operator 𝑇𝜎 has a unique fixed point. In this setting, we write 𝑣𝜎 for
the fixed point of 𝑇𝜎 and call it the 𝜎-value function. We interpret it as the lifetime
value of following policy 𝜎. A closely related interpretation was discussed at length
for RDPs in §8.1.2.1 and the situation here is analogous.
Example 9.2.1. Let M be an MDP. If AM is the ADP generated by M, as in Exam-
ple 9.1.2, then the unique fixed point of 𝑇𝜎 is 𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . This accords with our
interpretation of fixed points of 𝑇𝜎 as lifetime values, since ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 is precisely
the lifetime value of 𝜎 under the MDP assumptions (see §5.1.3.1).
Example 9.2.2. Let A = (𝑉, {𝑇𝜎 }) when each 𝑇𝜎 is a Koopmans operator on 𝑉 , as
defined in §7.3.1. A fixed point of a Koopmans operator is interpreted as lifetime
utility under the preferences it represents (see §7.3.1). Thus 𝑣𝜎 , when well-defined,
is the lifetime value associated with policy 𝜎 and the preferences embedded in 𝑇𝜎 .

We call an ADP A ≔ (𝑉, {𝑇𝜎 }) well-posed if every policy operator 𝑇𝜎 has a unique
fixed point in 𝑉 . In view of the preceding discussion on lifetime values, well-posedness
is a minimum requirement for constructing an optimality theory around ADPs.

9.2.1.2 Operators

Let A = (𝑉, {𝑇𝜎 }) be an ADP. We set


Ü
𝑇𝑣≔ 𝑇𝜎 𝑣 (𝑣 ∈ 𝑉) (9.5)
𝜎
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 297

and call 𝑇 the Bellman operator generated by A. Note that 𝑇 is a well-defined self-
map on 𝑉 by part (iii) of the definition of ADPs (existence of greedy policies). A
function 𝑣 ∈ 𝑉 is said to satisfy the Bellman equation if it is a fixed point of 𝑇 .
The definition of 𝑇 in (9.5) includes all of the Bellman operators we have met
as special cases. For example, consider an RDP R = ( Γ, 𝑉, 𝐵) with Bellman operator
Ô
(𝑇 𝑣)( 𝑥 ) = max 𝑎∈Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣). We can write 𝑇 as 𝜎 𝑇𝜎 𝑣, as shown in Exercise 8.1.8 on
page 253. Thus, the Bellman operator of the RDP agrees with the Bellman operator
𝑇 of the corresponding ADP AR .

EXERCISE 9.2.1. Show that

(i) 𝜎 ∈ Σ is 𝑣-greedy if and only if 𝑇𝜎 𝑣 = 𝑇 𝑣, and


(ii) 𝑇 in (9.5) is order-preserving whenever 𝑇𝜎 is order-preserving for all 𝜎 ∈ Σ.

Below we consider Howard policy iteration (HPI) as an algorithm for solving for
optimal policies of ADPs. We use precisely the same instruction set as for the RDP
case, as shown in Algorithm 8.1 on page 254. To further clarify the algorithm, we
define a map 𝐻 from 𝑉 to { 𝑣𝜎 } via 𝐻 𝑣 = 𝑣𝜎 where 𝜎 is 𝑣-max-greedy. Iterating with
𝐻 generates the value sequence associated with Howard policy iteration.1 In what
follows, we call 𝐻 the Howard operator generated by the ADP.

9.2.1.3 Properties

Let A ≔ (𝑉, {𝑇𝜎 }𝜎∈Σ ) be an ADP. We call A

• finite if Σ is a finite set,


• order stable if every policy operator 𝑇𝜎 is order stable on 𝑉 , and
• max-stable if A is order stable and 𝑇 has at least one fixed point in 𝑉 .

Obviously max-stable =⇒ order stable =⇒ well-posed.


Regarding the definition of max-stability, existence of a fixed point of 𝑇 in 𝑉 is a
high level assumption that can be challenging to verify in applications. At the same
time, our main concern in the present volume is the case where A is finite, and, in
this setting, order stability is enough:
1 For 𝐻 to be well-defined, we must always select the same 𝑣-greedy policy when the operator is
applied to 𝑣. We can use the axiom of choice to assign to each 𝑣 a designated 𝑣-greedy policy, although,
in applications, a simple rule usually suffices. For example, if Σ is finite, we can enumerate the policy
set Σ and choose the first 𝑣-greedy policy.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 298

Proposition 9.2.1. If A is order stable and finite, then A is max-stable.

Proposition 9.2.1 is proved in §B.4.

Corollary 9.2.2. Let R be an RDP and let AR be the ADP generated by R. If R is globally
stable, then AR is max-stable.

Proof. Let R and AR be as stated and suppose that R is globally stable. In view of
Lemma 9.1.1 on page 292, each policy operator is order stable. Hence AR is order
stable. Since Σ is finite, Proposition 9.2.1 implies that AR is also max-stable. □

EXERCISE 9.2.2. Show that the ADP described in Example 9.1.3 is max-stable.

Order stability is central to the optimality results stated below. While order stabil-
ity is a somewhat nonstandard condition, the next result shows that, at least in simple
settings, order stability is necessary for any discussion of optimality.

Proposition 9.2.3. Let A = (𝑉, {𝑇𝜎 }) be an ADP generated by an RDP R = ( Γ, 𝑉, 𝐵). If


RX , then the following statements are equivalent:
𝑉 is an order interval in

(i) A is well-posed.
(ii) A is order stable.

Proof. Let A be as stated, with 𝑉 = [ 𝑣1 , 𝑣2 ] for some 𝑣1 , 𝑣2 in RX with 𝑣1 ⩽ 𝑣2 . Obvi-


ously (ii) =⇒ (i). Regarding (i) =⇒ (ii), let A be well-posed and pick any policy
operator 𝑇𝜎 . Since A is well-posed, 𝑇𝜎 has a unique fixed point 𝑣𝜎 in 𝑉 . Suppose 𝑣 ∈ 𝑉
with 𝑇𝜎 𝑣 ⩽ 𝑣. Since, 𝑇𝜎 is order-preserving, 𝑇𝜎 is a self-map on [ 𝑣1 , 𝑣]. By the Knaster–
Tarski theorem (p. 213), 𝑇𝜎 has at least one fixed point in [ 𝑣1 , 𝑣]. By uniqueness, that
fixed point is 𝑣𝜎 . Hence 𝑣𝜎 ⩽ 𝑣 and downward stability holds. Upward stability can
be confirmed via a similar argument. Hence A is order stable. □

9.2.1.4 Max-Optimality Results

Let A = (𝑉, {𝑇𝜎 }) be a well-posed ADP with 𝜎-value functions { 𝑣𝜎 }𝜎∈Σ . We define

𝑉Σ ≔ { 𝑣𝜎 }𝜎∈ Σ and 𝑉𝑢 ≔ { 𝑣 ∈ 𝑉 : 𝑣  𝑇 𝑣} .
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 299

EXERCISE 9.2.3. Prove that 𝑉Σ ⊂ 𝑉𝑢 .

If 𝑉Σ has a greatest element, then we denote it by 𝑣∗ and call it the value function
generated by A. In this setting, a policy 𝜎 ∈ Σ is called optimal for A if 𝑣𝜎 = 𝑣∗ . We
say that A obeys Bellman’s principle of optimality if

𝜎 ∈ Σ is optimal for A ⇐⇒ 𝜎 is 𝑣∗ -greedy.

These definitions are direct generalizations of of the corresponding definitions for


RDPs discussed in Chapter 8.
We can now state our main optimality result for ADPs.

Theorem 9.2.4 (Max-optimality). If A is finite and order stable, then

(i) the set of 𝜎-value functions 𝑉Σ has a greatest element 𝑣∗ ,


(ii) 𝑣∗ is the unique solution to the Bellman equation in 𝑉 ,
(iii) A obeys Bellman’s principle of optimality,
(iv) A has at least one optimal policy, and
(v) HPI returns an exact optimal policy in finitely many steps.

Theorem 9.2.4 informs us that finite well-posed ADPs have first-rate optimality
properties under a relatively mild stability condition. In §9.2.2 we use Theorem 9.2.4
to prove all optimality results for RDPs stated in Chapter 8. The proof of Theo-
rem 9.2.4 is given in §B.4 (see page 347). Note that (iv) follows directly from (i)
and is included only for completeness.

9.2.1.5 General States

This volume focuses on dynamic programming problems with finite states. Here we
restrict ourselves to one high-level result for general state spaces.

Proposition 9.2.5. If A is max-stable, then (i)–(iv) of Theorem 9.2.4 hold.

Proposition 9.2.5 tells us that we can drop finiteness of policy set Σ (which is
implied by finite states and actions) whenever the Bellman operator has at least one
fixed point. Various fixed point methods are available for establishing this existence.
We defer further details until Volume II. Proposition 9.2.5 is proved in in §B.4.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 300

9.2.1.6 Application: Mixed Strategies

This section discusses adding mixed strategies to an RDP. We will need to apply Propo-
sition 9.2.5 to discuss optimality because the set of mixed strategies is not finite.
Let R = ( Γ, 𝑉, 𝐵) be an RDP with finite state space X, finite action space A, policy
set Σ and Bellman operator 𝑇 (see 8.1.3). A mixed strategy for R is a map 𝜑 sending
𝑥 ∈ X into a distribution 𝜑 𝑥 ∈ D(A) supported on Γ ( 𝑥 ). In other words, for each 𝑥 ∈ X,
Õ
𝜑 𝑥 : A → [0, 1] and 𝜑 𝑥 ( 𝑎) = 1.
𝑎∈ Γ ( 𝑥 )

Let Φ be the set of all mixed strategies for R. For each mixed strategy 𝜑 ∈ Φ, we
introduce the policy operator on 𝑉 defined by
Õ
(𝑇ˆ𝜑 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝑎, 𝑣) 𝜑 𝑥 ( 𝑎) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) .
𝑎∈A

The right hand side is the expected lifetime value from current state 𝑥 , when the
current action is drawn from 𝜑 𝑥 and future states are evaluated via 𝑣.

EXERCISE 9.2.4. Fix 𝑣 ∈ 𝑉 . Prove: If 𝜑 ∈ Φ and, for each 𝑥 ∈ X, the distribution


𝜑 𝑥 is supported on argmax 𝑎∈ Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣), then 𝑇ˆ𝜑 𝑣 ⩾ 𝑇ˆ𝜓 𝑣 for all 𝜓 ∈ Φ.

EXERCISE 9.2.5. Show that, given 𝑣 ∈ 𝑉 and 𝑥 ∈ X we have

max (𝑇ˆ𝜑 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) .


𝜑∈ Φ 𝑎∈ Γ ( 𝑥 )

It follows from the discussion above that A𝑀 ≔ (𝑉, {𝑇ˆ𝜑 } 𝜑∈Φ ) is an ADP (where “M”
stands for “mixed”), and that the Bellman operator 𝑇ˆ associated with the ADP A𝑀 is
given by
(𝑇ˆ 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) = (𝑇 𝑣)( 𝑥 ) ( 𝑣 ∈ 𝑉, 𝑥 ∈ X) . (9.6)
𝑎∈ Γ ( 𝑥 )

Let us assume for simplicity that R is contracting (see §8.2.1), with modulus of
contraction 𝛽 ∈ (0, 1). Assume also that 𝑉 is closed in RX . As a result, the value
function 𝑣∗ for R exists in 𝑉 and is the unique fixed point of 𝑇 in 𝑉 (Corollary 8.2.2).

EXERCISE 9.2.6. Show that, under the assumptions stated above, {𝑇ˆ𝜑 } 𝜑∈Φ and 𝑇ˆ
are all contraction mappings.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 301

By Exercise 9.2.6, the ADP A𝑀 is max-stable (since globally stable operators are
order stable – see Lemma 9.1.1 on page 292 – and the Bellman operator 𝑇ˆ has a fixed
point). Hence, by Proposition 9.2.5, the value function 𝑣ˆ∗ for A𝑀 exists in 𝑉 and is
the unique fixed point of 𝑇ˆ in 𝑉 . But, by (9.6), 𝑇ˆ and 𝑇 agree on 𝑉 . Hence 𝑣ˆ∗ = 𝑣∗ . We
conclude as follows: while the set of mixed strategies is larger than the set of pure
strategies (i.e., deterministic policies), the maximal lifetime value from each state is
the same.

9.2.2 Optimality Results for RDPs

In this section we return to the optimality properties of RDPs, as first discussed in


§8.1.3.3. Our aim is to connect the ADP optimality results from §9.2.1.4 to the special
case of RDPs and, through this process, complete the proofs of our key RDP optimality
results from Chapter 8.

9.2.2.1 OPI Convergence

In this section we provide some preliminary results related to OPI convergence, where
OPI obeys the algorithm given on page 255. Throughout, R = ( Γ, 𝑉, 𝐵) is a globally
stable RDP with policy set Σ, policy operators {𝑇𝜎 }, Bellman operator 𝑇 , and value
function 𝑣∗ . As usual, 𝑣𝜎 denotes the unique fixed point of 𝑇𝜎 for all 𝜎 ∈ Σ. In the
results stated below, 𝑚 is a fixed natural number indicating the OPI step size and 𝐻
and 𝑊𝑚 are as defined in §8.1.3.2.

Lemma 9.2.6. If 𝑣 ∈ 𝑉Σ , then 𝑇 𝑘 𝑣 → 𝑣∗ as 𝑘 → ∞.

Proof. Fix 𝑣 ∈ 𝑉Σ . On one hand, 𝑣 ⩽ 𝑣∗ and hence 𝑇 𝑘 𝑣 ⩽ 𝑇 𝑘 𝑣∗ = 𝑣∗ for all 𝑘. On the


other hand, if 𝜎 is any policy, then 𝑇𝜎𝑘 𝑣 ⩽ 𝑇 𝑘 𝑣 for all 𝑘. Hence 𝑇𝜎𝑘 𝑣 ⩽ 𝑇 𝑘 𝑣 ⩽ 𝑣∗ for all 𝑘.
If we now take 𝜎 to be an optimal policy, which exists under the stated assumptions,
we have 𝑇𝜎𝑘 𝑣 → 𝑣𝜎 = 𝑣∗ as 𝑘 → ∞. Hence 𝑇 𝑘 𝑣 → 𝑣∗ , as required. □

Lemma 9.2.7. 𝑊𝑚 is an order-preserving self-map on 𝑉𝑢 . Moreover,

𝑣 ∈ 𝑉𝑢 =⇒ 𝑇 𝑣 ⩽ 𝑊𝑚 𝑣 ⩽ 𝑇 𝑚 𝑣.

Proof. As for the self-map property, pick any 𝑣 ∈ 𝑉𝑢 . Since 𝑇 and 𝑇𝜎 are order-
preserving, 𝑣 ⩽ 𝑇 𝑣 and 𝜎 is 𝑣-greedy, we have

𝑊𝑚 𝑣 = 𝑇𝜎𝑇𝜎𝑚−1 𝑣 ⩽ 𝑇𝑇𝜎𝑚−1 𝑣 ⩽ 𝑇𝑇𝜎𝑚−1𝑇 𝑣 = 𝑇𝑇𝜎𝑚 𝑣 = 𝑇𝑊𝑚 𝑣.


CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 302

Hence 𝑊𝑚 𝑣 ∈ 𝑉𝑢 and 𝑊𝑚 is invariant on 𝑉𝑢 .


As for the order-preserving property, this is immediate from the definition, since
powers of order-preserving self-maps are order-preserving (Exercise 2.2.26).
As for the inequality 𝑇 𝑣 ⩽ 𝑊𝑚 𝑣, fix 𝑣 ∈ 𝑉𝑢 . Since 𝑇𝜎 is order-preserving, 𝑣 ⩽ 𝑇 𝑣
and 𝜎 is 𝑣-greedy, we have

𝑇𝜎𝑚−1 𝑣 ⩽ 𝑇𝜎𝑚−1𝑇 𝑣 = 𝑇𝜎𝑚−1𝑇𝜎 𝑣 = 𝑊𝑚 𝑣.


𝑚− 𝑗
Continuing in the same manner gives 𝑇𝜎 𝑣 ⩽ 𝑊𝑚 𝑣 for 𝑗 < 𝑚 and, in particular,
𝑇𝜎 𝑣 ⩽ 𝑊𝑚 𝑣. Because 𝜎 is 𝑣-greedy, this yields 𝑇 𝑣 ⩽ 𝑊𝑚 𝑣.
Regarding the second inequality, we use the fact that 𝑇𝜎 ⩽ 𝑇 on 𝑉 and 𝑇 and
𝑇𝜎 are both order-preserving to obtain 𝑊𝑚 𝑣 = 𝑇𝜎𝑚 𝑣 ⩽ 𝑇 𝑚 𝑣 (see Exercise 2.2.36 on
page 65). □
Lemma 9.2.8. For each 𝑣0 ∈ 𝑉Σ we have

𝑇 𝑘 𝑣0 ⩽ 𝑊𝑚𝑘 𝑣0 ⩽ 𝑇 𝑘𝑚 𝑣0 for all 𝑘 ∈ N. (9.7)

Proof. The bounds in (9.7) hold for 𝑘 = 1 by 𝑣0 ∈ 𝑉𝑢 and Lemma 9.2.7. Since all
operators are order-preserving and invariant on 𝑉𝑢 , the extension to arbitrary 𝑘 follows
from Exercise 2.2.36 on page 65. □
Lemma 9.2.9. Let 𝑣0 be any element of 𝑉Σ and let 𝑣𝑘 = 𝑊𝑚𝑘 𝑣0 for all 𝑘 ∈ N. If 𝑣𝑘 = 𝑣𝑘+1
for some 𝑘 ∈ N, then 𝑣𝑘 = 𝑣∗ and every 𝑣𝑘 -greedy policy is optimal.

Proof. Let the sequence ( 𝑣𝑘 ) be as stated and suppose that 𝑣𝑘 = 𝑣𝑘+1 . Let 𝜎 be 𝑣𝑘 -
greedy. It follows that 𝑇𝜎𝑚 𝑣𝑘 = 𝑣𝑘 and, moreover, 𝑣𝑘 ⩽ 𝑇 𝑣𝑘 = 𝑇𝜎 𝑣𝑘 , where the last
inequality is by 𝑣𝑘 ∈ 𝑉𝑢 . As a result,

𝑣𝑘 ⩽ 𝑇𝜎 𝑣𝑘 ⩽ 𝑇𝜎𝑚 𝑣𝑘 = 𝑣𝑘 .

In particular, 𝑇 𝑣𝑘 = 𝑇𝜎 𝑣𝑘 = 𝑣𝑘 , which in turn gives 𝑣𝑘 = 𝑣∗ . Bellman’s principle of


optimality now implies that every 𝑣𝑘 -greedy policy is optimal. □
Lemma 9.2.10. If ( 𝑣𝑘 ) ⊂ 𝑉𝑢 and 𝑣𝑘 → 𝑣∗ as 𝑘 → ∞, then there exists a 𝐾 ∈ N such that

𝑘 ⩾ 𝐾 =⇒ every 𝑣𝑘 -greedy policy is optimal.

Proof. Let R be as stated and fix ( 𝑣𝑘 ) ⊂ 𝑉𝑢 with 𝑣𝑘 → 𝑣∗ as 𝑘 → ∞. Let Σ∗ be the set


of optimal policies and let Σ0 ≔ Σ \ Σ∗ . Since Σ0 is finite, we have

𝑒 ≔ min0 k 𝑣𝜎 − 𝑣∗ k ∞ > 0.
𝜎∈ Σ
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 303

Choose 𝐾 ∈ N such that k 𝑣𝑘 − 𝑣∗ k ∞ < 𝑒 for all 𝑘 ⩾ 𝐾 . Fix 𝑘 ⩾ 𝐾 and let 𝜎 be 𝑣𝑘 -greedy.
We claim that 𝜎 is optimal. Indeed, since 𝑣𝑘 ⊂ 𝑉𝑢 , we have 𝑣𝑘 ⩽ 𝑇 𝑣𝑘 = 𝑇𝜎 𝑣𝑘 , so, by
upward stability, 𝑣𝑘 ⩽ 𝑣𝜎 . As a result,

| 𝑣∗ − 𝑣𝜎 | = 𝑣∗ − 𝑣𝜎 ⩽ 𝑣∗ − 𝑣𝑘 .

Hence k 𝑣∗ − 𝑣𝜎 k ∞ ⩽ k 𝑣∗ − 𝑣𝑘 k ∞ < 𝑒. But then 𝜎 ∉ Σ0, so 𝜎 is optimal. □

9.2.2.2 Proofs of RDP Results

In §8.1.3 we stated two key optimality results for RDPs, the first concerning globally
stable RDPs (Theorem 8.1.1 on page 256) and the second concerning bounded RDPs
(Theorem 8.1.2 on page 259). Let’s now prove them. In what follows, R = ( Γ, 𝑉, 𝐵) is
a well-posed RDP and AR ≔ (𝑉, {𝑇𝜎 }) is the ADP generated by R.

Proof of Theorem 8.1.1. Let R be globally stable. Then AR is finite and max-stable, by
Corollary 9.2.2. Hence the optimality and HPI convergence claims in Theorem 8.1.1
follow from Theorem 9.2.4.
Regarding OPI convergence, let ( 𝑣𝑘 , 𝜎𝑘 ) be as given in (8.18) on page 255. To-
gether, Lemma 9.2.6 and Lemma 9.2.8 imply that 𝑣𝑘 → 𝑣∗ as 𝑘 → ∞. Given such
convergence, Lemma 9.2.10 implies that there exists a 𝐾 ∈ N such that 𝜎𝑘 is optimal
whenever 𝑘 ⩾ 𝐾 . □

Proof of Theorem 8.1.2. Let R = ( Γ, 𝑉, 𝐵) be a bounded and well-posed. In view of


Exercises 8.1.12–8.1.13 (see page 258), it suffices to prove the optimality claims in
Theorem 8.1.2 for the reduced RDP R̂ = ( Γ, 𝑉ˆ, 𝐵), where 𝑉ˆ is the order interval in RX
generated by the bounding functions (i.e., 𝑉ˆ = [ 𝑣1 , 𝑣2 ]).
Let A be the ADP generated by R̂ By Proposition 9.2.3, A is order stable. Corol-
lary 9.2.2 now implies that A is max-stable. Hence the claims in Theorem 8.1.2 follow
from Theorem 9.2.4. □

9.2.3 Min-Optimality

Until now, our ADP theory has focused on maximization of lifetime values. Now we
turn to minimization. One of our aims is to prove the RDP minimization results in
§8.3.5. We will see that ADP minimization results are easily recovered from ADP
maximization results via order duality.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 304

Let A = (𝑉, {𝑇𝜎 }) be a well-posed ADP and let 𝑉Σ ≔ { 𝑣𝜎 } be the set of 𝜎-value
functions. We call 𝜎 ∈ Σ min-optimal for A if 𝑣𝜎 is a least element of 𝑉Σ . When 𝑉Σ
has a least element we denote it by 𝑣∗ and call it the min-value function generated
by A. A policy 𝜎 is called 𝑣-min-greedy if 𝑇𝜎 𝑣  𝑇𝜏 𝑣 for all 𝜏 ∈ Σ. Existence of a
𝑣-min-greedy policy for each 𝑣 ∈ 𝑉 is guaranteed by the definition of ADPs.
We say that A obeys Bellman’s principle of min-optimality if

𝜎 ∈ Σ is min-optimal for A ⇐⇒ 𝜎 is 𝑣∗ -min-greedy.

We define the Bellman min-operator corresponding to A as the self-map 𝑇 on 𝑉 de-


Ó
fined by 𝑇 𝑣 = 𝜎 𝑇𝜎 𝑣. This map is well-defined because {𝑇𝜎 𝑣}𝜎∈Σ has a least element
and, moreover, 𝜎 ∈ Σ is 𝑣-min-greedy if and only if 𝑇𝜎 𝑣 = 𝑇 𝑣.
We say that 𝑣 satisfies the Bellman min-equation if 𝑇 𝑣 = 𝑣. We call A min-stable
if A is order stable and 𝑇 has at least one fixed point in 𝑉 . We define 𝐻 from 𝑉 to
{ 𝑣𝜎 } via 𝐻 𝑣 = 𝑣𝜎 where 𝜎 is 𝑣-min-greedy and call 𝐻 the Howard min-operator
generated by A. Iterating with 𝐻 is called min-HPI.
Results analogous to Theorem 9.2.4 hold for the minimization case.

Theorem 9.2.11 (Min-optimality). If A is min-stable, then

(i) the min-value function 𝑣∗ generated by A exists in 𝑉 ,


(ii) 𝑣∗ is the unique solution to the Bellman min-equation in 𝑉 ,
(iii) A obeys Bellman’s principle of min-optimality, and
(iv) A has at least one min-optimal policy.

If, in addition, Σ is finite, then min-HPI converges to 𝑣∗ in finitely many steps.

To prove Theorem 9.2.11 we use order duality. Below, if A ≔ (𝑉, {𝑇𝜎 }) is an ADP
then its dual is

A𝜕 ≔ (𝑉 𝜕 , {𝑇𝜎 }) where 𝑉 𝜕 is the order dual of 𝑉.

In this setting, we let 𝑇 𝜕 be the Bellman operator for A𝜕 , ( 𝑣∗ ) 𝜕 be the value function
for A𝜕 , and so on. We note that A is self-dual, in the sense that (A𝜕 ) 𝜕 = A, since the
same is true for 𝑉 .
To make our terminology more symmetric, in the remainder of this section we
refer to maximization-based optimal policies as max-optimal, the Bellman operator
Ô
𝑇 = 𝜎 𝑇𝜎 𝑣 as the Bellman max-operator, and so on.
CHAPTER 9. ABSTRACT DYNAMIC PROGRAMMING 305

EXERCISE 9.2.7. Let A be a well-posed ADP with dual A𝜕 . Verify the following.

(i) Given 𝑣 ∈ 𝑉 , 𝜎 ∈ Σ is 𝑣-min-greedy for A if and only if 𝜎 is 𝑣-max-greedy for A𝜕 ,


(ii) 𝑇 = 𝑇 𝜕 and 𝑇𝜕 = 𝑇 ,
(iii) 𝐻 = 𝐻 𝜕 and 𝐻𝜕 = 𝐻 ,
(iv) A is order stable if and only if A𝜕 is order stable,
(v) A is min-stable if and only if A𝜕 is max-stable, and, in this case, 𝑣∗ = ( 𝑣∗ ) 𝜕 .
(vi) 𝜎 ∈ Σ is max-optimal for A if and only if 𝜎 is min-optimal for A𝜕 .

Self-duality implies corollaries to Exercise 9.2.7 that we treat as self-evident. For


example, if A is max-stable if and only if A𝜕 is min-stable, which follows from part (v)
and the fact that (A𝜕 ) 𝜕 = A.

Proof of Theorem 9.2.11. Let A be min-stable. By Exercise 9.2.7, the dual A𝜕 is max-
stable. Hence all of the conclusions of the max-optimality result in Theorem 9.2.4
apply to A𝜕 . All that remains is to translate these max-optimality results for A𝜕 back
to min-optimality results for A.
Regarding claim (i) of the min-optimality results, max-optimality of A𝜕 implies
that ( 𝑣∗ ) 𝜕 exists in 𝑉 . But then 𝑣∗ exists in 𝑉 , since, by Exercise 9.2.7, 𝑣∗ = ( 𝑣∗ ) 𝜕 .
Regarding (ii), we know that ( 𝑣∗ ) 𝜕 is the unique solution to 𝑇 𝜕 ( 𝑣∗ ) 𝜕 = ( 𝑣∗ ) 𝜕 , so,
applying Exercise 9.2.7 again, we have 𝑇 𝑣∗ = 𝑣∗ .
The remaining steps of the proof are similar and left to the reader. □

9.3 Chapter Notes

As indicated in notes for Chapter 8, our interest in abstract dynamic programming was
inspired by Bertsekas (2022b). This chapter generalizes his framework by switching
to a “completely abstract” setting based on analysis of self-maps on partially ordered
space. The material here is based on Sargent and Stachurski (2023a). Earlier work
on dynamic programming in a setting with no topology can be found in Kamihigashi
(2014).
Chapter 10

Continuous Time

Earlier chapters treated dynamics in discrete time. Now we switch to continuous time.
We restrict ourselves to finite state spaces, where continuous time processes are pure
jump processes. This allows us to provide a rigorous and self-contained treatment,
while laying foundations for a treatment of general state problems.

10.1 Continuous Time Markov Chains

In this section we introduce continuous time Markov models. In §10.2, we will use
them as components of continuous time Markov decision processes.

10.1.1 Background

In §3.1.1 we learned that if ( 𝑋𝑡 ) = ( 𝑋0 , 𝑋1 , . . .) is 𝑃 -Markov, then the distributions


( 𝜓𝑡 ) of the state process obey 𝜓𝑡+1 = 𝜓𝑡 𝑃 for all 𝑡 . This update rule is a linear differ-
ence equation in distribution space, which in turn suggests that, once we switch to
continuous time, distributions will evolve according to linear differential equations in
distribution space.
This idea turns out to be correct. As such, we begin this chapter with some facts
about linear differential equations.

306
CHAPTER 10. CONTINUOUS TIME 307

10.1.1.1 Scalar Exponentials

Solutions to linear differential equations involve exponential functions. The real-


valued exponential function can be defined by the power series
Õ 𝑥𝑘
e𝑥 :=: exp( 𝑥 ) ≔ ( 𝑥 ∈ R) . (10.1)
𝑘!
𝑘⩾0

Example 10.1.1. If 𝑢𝑡 is the balance of a savings account that pays a continuously


compounded interest rate 𝑟 , then the balance evolves according to

d
𝑢¤ 𝑡 ≔ 𝑢𝑡 = 𝑟𝑢𝑡 for all 𝑡 ⩾ 0 with initial balance 𝑢0 given. (10.2)
d𝑡
We understand (10.2) as a functional equation whose solution is an element 𝑡 ↦→ 𝑢𝑡 of
𝐶1 ( R+ , R), the set of continuously differentiable functions from R+ to R, that satisfies
(10.2). We claim that 𝑢𝑡 ≔ e𝑟𝑡 𝑢0 is the only solution to (10.2) in 𝐶1 ( R+ , R). It is easy
to check that this choice of 𝑢𝑡 obeys (10.2). As for uniqueness, suppose that 𝑡 ↦→ 𝑦𝑡 is
another solution in 𝐶1 ( R+ , R), so that 𝑦¤𝑡 = 𝑟 𝑦𝑡 for all 𝑡 ⩾ 0 and 𝑦0 = 𝑢0 . Then

d 
𝑦𝑡 e−𝑟𝑡 = 𝑦¤𝑡 e−𝑟𝑡 − 𝑟 𝑦𝑡 e−𝑟𝑡 = 𝑟 𝑦𝑡 e−𝑟𝑡 − 𝑟 𝑦𝑡 e−𝑟𝑡 = 0,
d𝑡
so 𝑡 ↦→ 𝑦𝑡 e−𝑟𝑡 is constant on R+ , implying existence of a 𝑐 ∈ R such that 𝑦𝑡 = 𝑐 e𝑟𝑡 for
all 𝑡 ⩾ 0. Setting 𝑡 = 0 and using the initial condition gives 𝑐 = 𝑢0 . Hence, at any 𝑡 , we
have 𝑦𝑡 = e𝑟𝑡 𝑢0 = 𝑢𝑡 .

The continuous time system in Example 10.1.1 is closely related to the discrete
time difference equation 𝑢𝑡+1 = e𝑟 𝑢𝑡 . Indeed, if we start at 𝑢0 , then the 𝑡 -th iterate
is e𝑟𝑡 𝑢0 , so solutions agree at integer times. We can think of the continuous time
system as one that interpolates between points in time of a corresponding discrete
time system.
The exponential e𝜆 of 𝜆 = 𝑎 + 𝑖𝑏 ∈ C can also be defined via (10.1). From the
identity e𝑖𝑏 = cos( 𝑏) + 𝑖 sin( 𝑏), we obtain

e𝜆 = e𝑎+𝑖𝑏 = e𝑎 (cos( 𝑏) + 𝑖 sin( 𝑏)) . (10.3)

This equation will soon prove useful.


CHAPTER 10. CONTINUOUS TIME 308

10.1.1.2 The Exponential Distribution

A random variable 𝑊 is said to be exponentially distributed with rate 𝜃, and we write


𝑑
𝑊 = Exp( 𝜃), when the counter CDF 𝐺 satisfies

𝐺 (𝑡) ≔ P{𝑊 > 𝑡 } = e−𝜃𝑡 ( 𝑡 ⩾ 0) .

Continuous time Markov chains have a close relationship with the exponential distri-
bution, a fact that stems from its being the only distribution having the memoryless
property
P{𝑊 > 𝑠 + 𝑡 | 𝑊 > 𝑠} = P{𝑊 > 𝑡 } for all 𝑠, 𝑡 > 0. (10.4)

𝑑
EXERCISE 10.1.1. Verify that (10.4) holds when 𝑊 = Exp( 𝜃).

The memoryless property is special. For example, the probability that an individ-
ual human being lives 70 years from birth is not equal to the probability that he or she
lives another 70 years conditional on having reached age 70. In fact the exponential
distribution is the only memoryless distribution supported on the nonnegative reals:

Lemma 10.1.1. If 𝑊 has counter CDF 𝐺 satisfying 0 < 𝐺 ( 𝑡 ) < 1 for all 𝑡 > 0, then the
following statements are equivalent:
𝑑
(i) 𝑊 = Exp( 𝜃) for some 𝜃 > 0.
(ii) 𝑊 satisfies the memoryless property in (10.4).

Proof. Exercise (10.1.1) treats (i) ⇒ (ii). As for (ii) ⇒ (i), suppose (ii) holds. Then
𝐺 has three properties:

(a) 𝐺 is decreasing on R+ (as is any counter CDF),


(b) 0 < 𝐺 ( 𝑡 ) < 1 for all 𝑡 > 0, and
(c) 𝐺 ( 𝑠 + 𝑡 ) = 𝐺 ( 𝑠) 𝐺 ( 𝑡 ) for all 𝑠, 𝑡 > 0.

From (a)–(c) we will show that

𝐺 ( 𝑡 ) = 𝐺 (1) 𝑡 for all 𝑡 ⩾ 0. (10.5)

This is sufficient to prove (i) because then 𝜃 ≔ − ln 𝐺 (1) is a positive real number (by
(b)) and, furthermore,

𝐺 ( 𝑡 ) = exp{ln[𝐺 (1) 𝑡 ]} = exp{ln[𝐺 (1)] 𝑡 } = exp(−𝜃𝑡 ) .


CHAPTER 10. CONTINUOUS TIME 309

To see that (10.5) holds, fix 𝑚, 𝑛 ∈ N. We can use (c) to obtain both 𝐺 ( 𝑚/𝑛) = 𝐺 (1/𝑛) 𝑚
and 𝐺 (1) = 𝐺 (1/𝑛) 𝑛 . It follows that 𝐺 ( 𝑚/𝑛) 𝑛 = 𝐺 (1/𝑛) 𝑚𝑛 = 𝐺 (1) 𝑚 and, raising to the
power of 1/𝑛, we get (10.5) when 𝑡 = 𝑚/𝑛.
The discussion so far confirms that (10.5) holds when 𝑡 is rational. So now take
any 𝑡 ⩾ 0 and rational sequences ( 𝑎𝑛 ) and ( 𝑏𝑛 ) converging to 𝑡 with 𝑎𝑛 ⩽ 𝑡 ⩽ 𝑏𝑛 for
all 𝑛. By (a) we have 𝐺 ( 𝑏𝑛 ) ⩽ 𝐺 ( 𝑡 ) ⩽ 𝐺 ( 𝑎𝑛 ) for all 𝑛, so 𝐺 (1) 𝑏𝑛 ⩽ 𝐺 ( 𝑡 ) ⩽ 𝐺 (1) 𝑎𝑛 . for all
𝑛 ∈ N. Taking the limit in 𝑛 completes the proof. □

10.1.1.3 Extension to Matrices

The real exponential formula (10.1) extends to the matrix exponential via

𝐴2 Õ 𝐴𝑘
e𝐴 ≔ 𝐼 + 𝐴 + +··· = , (10.6)
2! 𝑘!
𝑘⩾0

where 𝐴 is any square matrix. As we will see, the matrix exponential plays a key role
in the solution of vector-valued linear differential equations.

EXERCISE 10.1.2. Let 𝐴 be 𝑛 × 𝑛 and let k · k be the operator norm (see page 16).
Í 𝐴𝑘
Show that (10.6) converges, in the sense that k 𝑚 𝑘=0 𝑘! k is bounded in 𝑚.

Lemma 10.1.2 (Properties of the matrix exponential). Let 𝐴 and 𝐵 be square matrices.

(i) If 𝐴 is diagonalizable with 𝐴 = 𝑃𝐷𝑃 −1 , then e 𝐴 = 𝑃 e 𝐷 𝑃 −1 .


(ii) If 𝐴 and 𝐵 commute (i.e. 𝐴𝐵 = 𝐵𝐴), then e 𝐴+ 𝐵 = e 𝐴 e 𝐵 .
(iii) If 𝑚 is any positive integer, then e𝑚𝐴 = (e 𝐴 ) 𝑚 .
(iv) 𝜆 is an eigenvalue of 𝐴 if and only if e𝜆 is an eigenvalue of e 𝐴 .
(v) The function R 3 𝑡 ↦→ e𝑡 𝐴 is differentiable in 𝑡 , with

d 𝑡𝐴
e = 𝐴e𝑡 𝐴 = e𝑡 𝐴 𝐴. (10.7)
d𝑡
>
(vi) e 𝐴 = (e 𝐴 ) > .
(vii) The fundamental theorem of calculus holds, in the sense that
∫ 𝑡
𝑡𝐴 𝑠𝐴
e −e = e𝜏𝐴 𝐴 d𝜏 for all 𝑠 ⩽ 𝑡. (10.8)
𝑠
CHAPTER 10. CONTINUOUS TIME 310

In Lemma 10.1.2 and below, integration or differentiation of a vector- or matrix-


valued function is carried out element by element. For example, to differentiate a
matrix 𝐵 ( 𝑡 ) = ( 𝑏𝑖 𝑗 ( 𝑡 )) that depends on 𝑡 , we form a new matrix by differentiating
∫𝑏
each element 𝑏𝑖 𝑗 ( 𝑡 ) with respect to 𝑡 . The integral 𝑎 𝐵 ( 𝑡 ) d𝑡 is the matrix of integrals
∫𝑏
𝑏 ( 𝑡 ) d𝑡 .
𝑎 𝑖𝑗

EXERCISE 10.1.3. Prove part (i) of Lemma 10.1.2.

The proof of part (ii) of Lemma 10.1.2 uses the definition of the exponential and
the binomial formula. See, for example, Hirsh and Smale (1974). Part (iii) follows
directly from part (ii). Part (iv) follows easily from part (i) when 𝐴 is diagonalizable
(and can be proved more generally via the Jordan canonical form).

EXERCISE 10.1.4. Prove (v) of Lemma 10.1.2. A good starting point is to observe
that, for any 𝑡 ∈ R,

d 𝑡𝐴 e𝑡 𝐴+ℎ𝐴 − e𝑡 𝐴 eℎ𝐴 − 𝐼
e = lim = e𝑡 𝐴 lim . (10.9)
d𝑡 ℎ→0 ℎ ℎ→0 ℎ

EXERCISE 10.1.5. Using Lemma 10.1.2, show that, for any 𝑛 × 𝑛 matrix 𝐴, the
matrix e 𝐴 is invertible, with inverse e− 𝐴 .

As for (vii), we are drawing an analogy with the fundamental ∫ 𝑡 theorem of calculus
for scalar-valued functions, which states that 𝑓 ( 𝑡 ) − 𝑓 ( 𝑠) = 𝑠 𝑓 0 ( 𝜏) d𝜏 for all 𝑠 ⩽ 𝑡 ,
where 𝑓 0 is the derivative of 𝑓 .

EXERCISE 10.1.6. Prove part (vii) of Lemma 10.1.2. [Hint: use part (v).]

10.1.2 Continuous Time Flows


Next we study solutions of multivariate differential equations, with a focus on lin-
ear systems. These results lay foundations for our study of continuous time Markov
dynamics in §10.1.3.

10.1.2.1 Continuous Time Dynamical Systems

Recall from §2.1.1 that a discrete dynamical system is a pair (𝑈, 𝑆), where 𝑈 is a set
and 𝑆 is a self-map on 𝑈 . Trajectories are sequences ( 𝑆𝑡 𝑢)𝑡⩾0 = (𝑢, 𝑆𝑢, 𝑆2 𝑢, . . .), where
CHAPTER 10. CONTINUOUS TIME 311

𝑢 ∈ 𝑈 is the initial condition. These ideas can be extended to continuous time by


considering a pair (𝑈, ( 𝑆𝑡 )𝑡⩾0 ) where 𝑈 is any set and 𝑆𝑡 is a self-map on 𝑈 for each
𝑡 ∈ R+ . The interpretation is that if 𝑢 ∈ 𝑈 is the current state of the system, then 𝑆𝑡 𝑢
will be the state after 𝑡 units of time.
Example 10.1.2. For the savings account in Example 10.1.1 with solution 𝑢𝑡 ≔ e𝑟𝑡 𝑢0 ,
we can take 𝑈 = R and 𝑆𝑡 𝑢 = e𝑟𝑡 𝑢. Then the state 𝑆𝑡 𝑢 at time 𝑡 is the balance at time 𝑡
associated with initial deposit 𝑢.

In general, to understand (𝑈, ( 𝑆𝑡 )𝑡⩾0 ) as a continuous time dynamical system, we


require that (a) 𝑆0 is the identity map, so that the state after zero units of time is just
the initial condition, and (b) if we start at 𝑢, move forward to 𝑢𝑠 ≔ 𝑆𝑠 𝑢, and then move
again to 𝑆𝑡 𝑢𝑠 after another 𝑡 units of time, the outcome should be the same as moving
from 𝑢 to 𝑆𝑠 + 𝑡 𝑢 directly. That is,

𝑆𝑠 + 𝑡 = 𝑆𝑡 ◦ 𝑆𝑠 for all 𝑡, 𝑡0 ⩾ 0.

This is the semigroup property.


One way that continuous time dynamical systems arise is via initial value problems.
An initial value problem (IVP) in R𝑛 consists of a differential equation 𝑢¤ 𝑡 = 𝑓 ( 𝑢𝑡 )
paired with an initial condition 𝑢0 ∈ R𝑛 , where 𝑢𝑡 ∈ R𝑛 and 𝑓 : R𝑛 → R𝑛 . Under
suitable conditions on 𝑓 , the solution 𝑢𝑡 ≔ 𝐹 ( 𝑡, 𝑢0 ) is uniquely defined for all 𝑡 ⩾ 0,
and, moreover,

𝐹 (0, 𝑢0 ) = 𝑢0 and 𝐹 ( 𝑠 + 𝑡, 𝑢0 ) = 𝐹 ( 𝑡, 𝐹 ( 𝑠, 𝑢0 )) for all 𝑠, 𝑡 ⩾ 0

(see, e.g., Hirsh and Smale (1974), Section 8.7). Hence ( 𝑆𝑡 )𝑡⩾0 defined by 𝑆𝑡 𝑢 = 𝐹 ( 𝑡, 𝑢)
satisfies the semigroup property and ( R𝑛 , ( 𝑆𝑡 )𝑡⩾0 ) is a continuous time dynamical sys-
tem. The function 𝑓 is called the vector field of ( R𝑛 , ( 𝑆𝑡 )𝑡⩾0 ).

10.1.2.2 Linear Initial Value Problems

Given our interest in continuous time Markov chains and their connection to linear
systems (see the comments at the start of §10.1.1), we focus primarily on linear dif-
ferential equations. The next result discusses linear IVPs, illustrating the key role of
the matrix exponential. In the statement, 𝐴 is 𝑛 × 𝑛 and both 𝑢¤ 𝑡 and 𝑢𝑡 are column
vectors in R𝑛 .
Proposition 10.1.3. The unique solution of the 𝑛-dimensional IVP

𝑢¤ 𝑡 = 𝐴𝑢𝑡 , 𝑢0 ∈ R𝑛 given (10.10)


CHAPTER 10. CONTINUOUS TIME 312

in the set of continuously differentialbe functions 𝑡 ↦→ 𝑢𝑡 from R+ to R𝑛 is

𝑢 𝑡 ≔ e𝑡 𝐴 𝑢 0 ( 𝑡 ⩾ 0) . (10.11)

(Here 𝑢¤ 𝑡 ≔ d𝑢𝑡 /d𝑡 is defined by differentiating the vector 𝑢𝑡 element-by-element, as


discussed after Lemma 10.1.2.)

Proof of Proposition 10.1.3. That 𝑢𝑡 ≔ e𝑡 𝐴 𝑢0 solves (10.10) is immediate from Exer-


cise 10.1.4. The proof of uniqueness is omitted, although the logic is very similar to
the scalar case, which was discussed in Example 10.1.1. □

EXERCISE 10.1.7. Let 𝑃 be 𝑛 × 𝑛 and consider the IVP 𝜑¤ 𝑡 = 𝜑𝑡 𝑃 and 𝜑0 given, where
each 𝜑𝑡 is a row vector in R𝑛 . Prove that this IVP has the unique solution 𝜑𝑡 ≔ 𝜑0 e𝑡𝑃 .

Proposition 10.1.3 motivates us to study flows of the form

𝑡 ↦→ 𝑢𝑡 , 𝑢𝑡 = e𝑡 𝐴 𝑢0 ( 𝑡 ⩾ 0) (10.12)

where 𝐴 is 𝑛 × 𝑛, 𝑢0 is a vector in R𝑛 indicating the initial condition, and 𝑢𝑡 is the


“state” of the system at time 𝑡 . Figure 10.1 shows an example when

−2.0 −0.4 0
© ª
𝐴 = ­−1.4 −1.0 2.2 ® . (10.13)
« 0.0 −2.0 −0.6¬

10.1.2.3 Stability in the Diagonalizable Case

For an exponential flow such as (10.12), a key question is whether or not 𝑢𝑡 → 0 as


𝑡 → ∞. (This will matter when we try to evaluate lifetime rewards over an infinite
horizon in continuous time.) Rather than analyze these issues at every possible 𝑢0 ,
we directly consider the matrix-valued flow 𝑡 ↦→ e𝑡 𝐴 and study whether e𝑡 𝐴 → 0.
The case where 𝐴 is diagonalizable provides a good starting point. Suppose 𝐴 =
𝑃 −1 𝐷𝑃with 𝐷 = diag 𝑗 ( 𝜆 𝑗 ) containing the eigenvalues of 𝐴. Then, by Lemma 10.1.2,
for any 𝑡 ⩾ 0, we have
−1
e𝑡 𝐴 = e𝑡𝑃 𝐷𝑃 = 𝑃 −1 e𝑡𝐷 𝑃. (10.14)

EXERCISE 10.1.8. Prove that e𝑡𝐷 = diag(e𝑡𝜆 1 , . . . , e𝑡𝜆 𝑛 ).


CHAPTER 10. CONTINUOUS TIME 313

t 7→ etA u0

u0

0.010

0.006

0.002

−0.002

0.010
0.010
0.006 0.006
0.002 0.002
−0.002

Figure 10.1: Exponential flow 𝑡 ↦→ e𝑡 𝐴 𝑢0 starting from 𝑢0 ∈ R3

Exercise 10.1.8 and (10.14) tell us that the long run dynamics of e𝑡 𝐴 are deter-
mined by the scalar flows 𝑡 ↦→ e𝑡𝜆 𝑗 . How does e𝑡𝜆 evolve over time when 𝜆 ∈ C?
To answer this question we write 𝜆 = 𝑎 + 𝑖𝑏 and apply (10.3) to obtain

e𝑡𝜆 = e𝑡𝑎 (cos( 𝑡𝑏) + 𝑖 sin( 𝑡𝑏)) .

This equation tells us that

e𝑡𝜆 → 0 as 𝑡 → ∞ if and only if Re 𝜆 < 0, (10.15)

where Re 𝜆 is the real part of 𝜆 (i.e., if 𝜆 = 𝑎 + 𝑖𝑏, then Re 𝜆 = 𝑎).


From this analysis, we conclude that, when 𝐴 is diagonalizable, we have e𝑡 𝐴 → 0
if and only if Re 𝜆 𝑗 < 0 for all 𝜆 𝑗 ∈ 𝜎 ( 𝐴), where 𝜎 ( 𝐴) denotes the set of all eigenvalues
(the spectrum) of 𝐴. Another way to put this is that e𝑡 𝐴 → 0 if and only if 𝑠 ( 𝐴) < 0,
where
𝑠 ( 𝐴) ≔ max Re 𝜆, (10.16)
𝜆 ∈ 𝜎 ( 𝐴)

is the spectral bound of 𝐴.


As the preceding analysis suggests, the spectral bound plays a key role in the
asymptotics of exponential flows, just as a spectral radius governs asymptotics of tra-
jectories of linear maps (see, e.g., Exercise 1.2.11 on page 19). The next section
CHAPTER 10. CONTINUOUS TIME 314

expands on this analysis, while dropping the assumption that 𝐴 is diagonalizable.

10.1.2.4 The General Case

Let 𝐴 be any square matrix. In the following statement about a spectral bound, k · k
is the operator norm defined in §1.2.1.4.
Lemma 10.1.4. If 𝜏 > 0, then 𝜏𝑠 ( 𝐴) = 𝑠 ( 𝜏𝐴). Moreover,
1
e𝑠 ( 𝐴) = 𝜌 (e 𝐴 ) and 𝑠 ( 𝐴) = lim ln ke𝑡 𝐴 k . (10.17)
𝑡 →∞ 𝑡

EXERCISE 10.1.9. Confirm that 𝜏𝑠 ( 𝐴) = 𝑠 ( 𝜏𝐴) when 𝜏 > 0.

EXERCISE 10.1.10. Prove the first equality in (10.17).

EXERCISE 10.1.11. The second equality in (10.17) is reminiscent of Gelfand’s


lemma (see page 19). Confirm that it holds when the limit is taken over 𝑡 ∈ N.

(The second equality in (10.17) also holds when the limit is taken over 𝑡 ∈ R+ .
See, for example, Engel and Nagel (2006).)
The next theorem is a key stability result for exponential flows. Among other
things, it extends to arbitrary 𝐴 the finding that 𝑠 ( 𝐴) < 0 is necessary and sufficient
for stability.
Theorem 10.1.5. For any square matrix 𝐴, the following statements are equivalent:
(i) 𝑠 ( 𝐴) < 0.
(ii) ke𝑡 𝐴 k → 0 as 𝑡 → ∞.
(iii) There exist 𝑀, 𝜔 > 0 such that ke𝑡 𝐴 k ⩽ 𝑀 e−𝑡𝜔 for all 𝑡 ⩾ 0.
∫∞
(iv) 0 ke𝑡 𝐴 𝑢0 k 𝑝 d𝑡 < ∞ for all 𝑝 ⩾ 1 and 𝑢0 ∈ R𝑛 .

A full proof of Theorem 10.1.5 in a general setting can be found in §V.II of Engel
and Nagel (2006).
Theorem 10.1.5 tells us that the flow 𝑡 ↦→ e𝑡 𝐴 𝑢0 converges to the origin at an
exponential rate if and only if 𝑠 ( 𝐴) < 0. The equivalence of (i) and (ii) was proved for
the diagonalizable case in §10.1.2.3. It can be viewed as the continuous time analog
of k 𝐵𝑘 k → 0 if and only if 𝜌 ( 𝐵) < 1 (see Exercise 1.2.11 on page 19).
CHAPTER 10. CONTINUOUS TIME 315

EXERCISE 10.1.12. Prove that (i) implies (ii) without assuming that 𝐴 is diagonal-
izable. In addition, prove that (iii) implies (iv).

10.1.2.5 Semigroup Terminology

Advanced treatments of continuous time systems often begin with semigroups. Let’s
briefly describe these and connect them to things we have studied earlier. (If you
prefer to skip this section on first reading, you can move to the next one after noting
that, given an 𝑛 × 𝑛 matrix 𝐴, the family ( 𝑆𝑡 )𝑡⩾0 = (e𝑡 𝐴 )𝑡⩾0 is called an exponential
semigroup and that 𝐴 is called the infinitesimal generator of the semigroup.)
Let X be a finite set and let ( 𝑆𝑡 )𝑡⩾0 be a subset of L( RX ) indexed by 𝑡 ∈ R+ . The
family ( 𝑆𝑡 )𝑡⩾0 is called a strongly continuous semigroup or 𝐶0 -semigroup on RX if

(i) 𝑆0 = 𝐼 , where 𝐼 is the identity,


(ii) 𝑆𝑡+ 𝑡0 = 𝑆𝑡 ◦ 𝑆𝑡0 , and
(iii) 𝑡 ↦→ 𝑆𝑡 𝑢 is a continuous map from R+ to RX for every 𝑢 ∈ RX .

In essence, a 𝐶0 -semigroup on RX is a continuous time dynamical system ( RX , ( 𝑆𝑡 ) 𝑡⩾0 )


where each 𝑆𝑡 maps an initial state into a time 𝑡 state.

Example 10.1.3 (Exponential semigroup). Fix 𝐴 in L( RX ) and let ( 𝑆𝑡 ) 𝑡⩾0 be de-


fined by 𝑆𝑡 = e𝑡 𝐴 . Then (𝑆𝑡 ) 𝑡⩾0 is a 𝐶0 -semigroup on RX . To verify this we take
𝑋 = { 𝑥1 , . . . , 𝑥 𝑛 } and view 𝑆𝑡 and 𝐴 as 𝑛 × 𝑛 matrices. The 𝐶0 -semigroup properties
now follow directly from Lemma 10.1.2. For example, 𝑡 ↦→ 𝑆𝑡 𝑢 is continuous because
it is differentiable in 𝑡 , by (v) of Lemma 10.1.2.

The semigroup perspective is important because it extends in a natural way to


settings where X is not finite, in which case we replace the finite-dimensional set RX
with some (typically infinite-dimensional) class of functions G ⊂ RX , and each 𝑆𝑡
becomes a linear operator mapping G into itself. At this level of generality, 𝑆𝑡 𝑢 can
be the solution to a partial differential equation, or a stochastic differential equation
(see, e.g., Engel and Nagel (2006) or Applebaum (2019)). Operator semigroup theory
offers an elegant and powerful framework for handling such systems.
For semigroups in general settings we often have no analytical expressions for
𝑆𝑡 . This situation is like the one we encountered in the continuous time system in
§10.1.2.1, where 𝑢¤ 𝑡 = 𝑓 ( 𝑢𝑡 ) and 𝑓 is potentially nonlinear. When no analytical solution
𝑢𝑡 exists, analyzing the dynamics requires us to try to infer its properties from the
vector field 𝑓 , so that 𝑓 becomes the primary focus of analysis.
CHAPTER 10. CONTINUOUS TIME 316

A natural question, then, is, given a semigroup (𝑆𝑡 ) 𝑡⩾0 on L( RX ), does there always
exist a “vector field” type object that “generates” (𝑆𝑡 ) 𝑡⩾0 ? When X is finite, the answer
is affirmative. This object, denoted below by 𝐴, is called the infinitesimal generator
of the semigroup and is defined by
𝑆𝑡 − 𝑆0 𝑆𝑡 − 𝐼
𝐴 = lim = lim (10.19)
𝑡↓0 𝑡 𝑡↓0 𝑡

At 𝑢 ∈ 𝑈 , the vector 𝐴𝑢 indicates the instantaneous change in the state.


More precisely, when X is finite, we have:
Proposition 10.1.6. If ( 𝑆𝑡 ) 𝑡⩾0 is a 𝐶0 -semigroup on RX , then
(i) there exists an 𝐴 ∈ L ( RX ) such that 𝑆𝑡 = e𝑡 𝐴 for all 𝑡 ⩾ 0, and
(ii) 𝐴 is the infinitesimal generator of ( 𝑆𝑡 ) 𝑡⩾0 .

Semigroups of the form described in Proposition 10.1.6 are called exponential


semigroups (or “uniformly continuous” semigroups).
A full proof of Proposition 10.1.6 can be found in the discussion of Theorem 2.12
of Engel and Nagel (2006). The results are not surprising, since the main claim is
that, in finite dimensions, solutions to linear differential equations have exponential
form. The fact that 𝐴 is the infinitesimal generator of the semigroup ( 𝑆𝑡 ) 𝑡⩾0 = (e𝑡 𝐴 ) 𝑡⩾0
follows from Lemma 10.1.2, which gives

𝑆𝑡 − 𝑆0 e𝑡 𝐴 − e0 d
lim = lim = e𝑡 𝐴 = 𝐴e0 𝐴 = 𝐴.
𝑡↓0 𝑡 𝑡↓0 𝑡 d𝑡 𝑡 =0

The preceding discussion places our analysis in a wider context. To practice our
new terminology, we can restate (i) ⇐⇒ (ii) from Theorem 10.1.5 by saying that the
exponential semigroup (𝑆𝑡 ) 𝑡⩾0 = (e𝑡 𝐴 ) 𝑡⩾0 converges to zero if and only if the spectral
bound of its infinitesimal generator is negative.

10.1.3 Markov Semigroups

Having studied multivariate linear dynamics, we are now ready to specialize to the
Markov case, where dynamics evolve in distribution space. For the most part we now
switch to operator-theoretic notation, where X is a finite set with 𝑛 elements, and an
𝑛 × 𝑛 matrix is identified with a linear operator on L ( RX ). As emphasized in §2.3.3.1,
this is merely a change in terminology, and all preceding results for matrices extend
directly to linear operators.
CHAPTER 10. CONTINUOUS TIME 317

10.1.3.1 Intensity Matrices

If ( 𝑋𝑡 ) 𝑡⩾0 is 𝑃 -Markov on X for some 𝑃 ∈ M ( RX ), then the marginal distributions of


( 𝑋𝑡 )𝑡⩾0 evolve according to the linear difference system 𝜓𝑡+1 = 𝜓𝑡 𝑃 (see §3.1.1). We
now seek a continuous time analog in the form of a linear differential equation.
To this end we call 𝑄 ∈ L ( RX ) an intensity operator or intensity matrix1 when
Õ
𝑄 ( 𝑥, 𝑥 0) ⩾ 0 whenever 𝑥 ≠ 𝑥 0 and 𝑄 ( 𝑥, 𝑥 0) = 0 for all 𝑥 ∈ X. (10.20)
𝑥0

Let
I ( RX ) = the set of all intensity operators in L( RX ) .
Example 10.1.4. The matrix

−2 1 1
© ª
𝑄 ≔ ­ 0 −1 1 ®
« 2 1 −3¬

is an intensity matrix, since off-diagonal terms are nonnegative and rows sum to zero.

Consider the IVP


Õ
𝜓¤ 𝑡 ( 𝑥 0) = 𝑄 ( 𝑥, 𝑥 0) 𝜓𝑡 ( 𝑥 ) ( 𝑡 ⩾ 0, 𝑥 0 ∈ X) ,
𝑥0

which we can also write as

𝜓¤ 𝑡 = 𝜓𝑡 𝑄, 𝜓0 ∈ D(X) given. (10.21)

when 𝜓𝑡 and 𝜓¤ 𝑡 are understood to be row vectors. We say that D(X) is invariant for
the IVP (10.21) if the solution ( 𝜓𝑡 )𝑡⩾0 remains in D(X) for all 𝑡 ⩾ 0.
In view of Proposition 10.1.3, we can rephase this by stating that D(X) is invariant
for (10.21) whenever

𝜓0 ∈ D(X) =⇒ 𝜓0 e𝑡𝑄 ∈ D(X) for all 𝑡 ⩾ 0. (10.22)

Our key result for this section shows the central role of intensity matrices:
Proposition 10.1.7. Fix 𝑄 ∈ L ( RX ) and set 𝑃𝑡 ≔ e𝑡𝑄 for each 𝑡 ⩾ 0. The following
statements are equivalent:
1 Other names for intensity matrices include “𝑄 -matrices” (which is fine until you need to use another

symbol), “Kolmogorov matrices”, and “infinitesimal stochastic matrices.”


CHAPTER 10. CONTINUOUS TIME 318

(i) 𝑄 ∈ I ( RX ).
(ii) 𝑃𝑡 ∈ M ( RX ) for all 𝑡 ⩾ 0.
(iii) the set of distributions D(X) is invariant for the IVP (10.21).

Proposition 10.1.7 tells us that the set I ( RX ) coincides with the set of continuous
time (and time-homogeneous) Markov models on X. Any specification outside this
class fails to generate flows in distribution space. The proof is completed in several
steps below.
For Exercises 10.1.13–10.1.15, 𝑄 ∈ I ( RX ) and 𝑃𝑡 ≔ e𝑡𝑄 .

EXERCISE 10.1.13. Show that 𝑃𝑡 1 = 1 for all 𝑡 ⩾ 0.

EXERCISE 10.1.14. Set 𝜃 ≔ max 𝑥 ∈X | 𝑄 ( 𝑥, 𝑥 )| and 𝐾 ≔ 𝐼 + 𝜃1 𝑄 , where 𝐼 is the 𝑛 × 𝑛


identity. (If 𝜃 = 0, then set 𝐾 ≔ 𝐼 .) Prove that 𝐾 is a stochastic matrix and 𝑄 = 𝜃 ( 𝐾 − 𝐼 ).

EXERCISE 10.1.15. Using the representation for 𝑄 obtained in Exercise 10.1.14


and the definition of the matrix exponential, show that 𝑃𝑡 is nonnegative for all 𝑡 ⩾ 0.

For the proof of Proposition 10.1.7, we have now shown that (i) implies (ii). Evi-
dently (ii) implies (iii), because if 𝜓0 ∈ D and 𝜓𝑡 = 𝜓0 𝑃𝑡 where 𝑃𝑡 is stochastic, then
𝜓𝑡 ∈ D(X). Hence it remains only to show that (iii) implies (i).

EXERCISE 10.1.16. Let 𝑄 be 𝑛 × 𝑛 and assume (iii). Fix 𝑥 ∈ X. By (iii) we have


Í
𝛿𝑥 e𝑡𝑄 1 = 1 for all 𝑡 ⩾ 0, where 1 is a vector of ones. Show that 𝑥 0 𝑄 ( 𝑥, 𝑥 0) = 0 using
this identity.

EXERCISE 10.1.17. Prove that 𝑄 ( 𝑥, 𝑥 0) ⩾ 0 when (iii) holds and 𝑥, 𝑥 0 ∈ X with


𝑥 ≠ 𝑥 0.

Returning to Proposition 10.1.7, the last two exercises confirm that (iii) implies
(i). The proof is now complete.

10.1.3.2 Interpretation

The previous section covered the formal relationship between intensity matrices and
Markov operators. Let’s now discuss the connection more informally, in order to build
intuition.
CHAPTER 10. CONTINUOUS TIME 319

To this end, let ( 𝑋𝑡 )𝑡⩾0 be 𝑃ℎ -Markov in discrete time. Here ℎ > 0 is the length of
the time step. We write the corresponding distribution sequence 𝜓𝑡+ℎ = 𝜓𝑡 𝑃ℎ in terms
of change per unit of time, as in

𝜓𝑡 + ℎ − 𝜓𝑡 𝑃ℎ − 𝐼
= 𝜓𝑡 where 𝐼 is the 𝑛 × 𝑛 identity. (10.25)
ℎ ℎ

Continuous time dynamics are obtained by taking the limit as ℎ ↓ 0. If we define


𝑃ℎ − 𝐼
𝑄 ≔ lim , (10.26)
ℎ↓0 ℎ

and assume that limits exist, then (10.25) becomes (10.21).


What properties does 𝑄 have? Inspecting (10.26) implies

𝑃ℎ ( 𝑥, 𝑥 0) − 1{ 𝑥 = 𝑥 0 }
𝑄 ( 𝑥, 𝑥 0) ≈ (10.27)

when ℎ is small and positive.

EXERCISE 10.1.18. Prove that, when ℎ > 0 and 𝑃ℎ is stochastic, the matrix on the
right-hand side of (10.27) is an intensity matrix.

EXERCISE 10.1.19. To formalize (10.27), use the expression for the matrix expo-
nential in (10.6) to prove that if 𝑃𝑡 = e𝑡𝑄 , then

𝑃ℎ ( 𝑥, 𝑥 0) = ℎ 𝑄 ( 𝑥, 𝑥 0) + 𝑜 ( ℎ) whenever 𝑥 ≠ 𝑥0. (10.28)

Equation (10.28) tells us that 𝑄 ( 𝑥, 𝑥 0) represents the instantaneous rate of flow


out of state 𝑥 and into state 𝑥 0. The on-diagonal value 𝑃ℎ ( 𝑥, 𝑥 ) just balances the off-
diagonal probabilities.

10.1.3.3 Markov Semigroups

Fix 𝑄 ∈ I ( RX ). In the terminology of §10.1.2.5, the family of operators ( 𝑃𝑡 )𝑡⩾0 =


(e𝑡𝑄 )𝑡⩾0 ⊂ M ( RX ) that solves 𝜓¤ 𝑡 = 𝜓𝑡 𝑄 (see (10.22)) is an exponential semigroup.
Since each 𝑃𝑡 is in M ( RX ), it is also called the Markov semigroup generated by 𝑄 .
It satisfies the semigroup property 𝑃𝑠 + 𝑡 = 𝑃 𝑠 𝑃𝑡 for all 𝑠, 𝑡 ⩾ 0, which can be written
CHAPTER 10. CONTINUOUS TIME 320

more explicitly as
Õ
𝑃 𝑠+𝑡 ( 𝑥, 𝑥 0) = 𝑃 𝑠 ( 𝑥, 𝑧 ) 𝑃𝑡 ( 𝑧, 𝑥 0) ( 𝑠, 𝑡 ⩾ 0, 𝑥, 𝑥 0 ∈ X) . (10.29)
𝑧

In the present setting, (10.29) is called the (continuous time) Chapman–Kolmogorov


equation. It states that the probability of moving from 𝑥 to 𝑥 0 over 𝑠 + 𝑡 units of time
equals the probability of moving from 𝑥 to 𝑧 over 𝑠 units of time, and then 𝑧 to 𝑥 0 over
𝑡 units of time, summed over all 𝑧 .
Again following the terminology in §10.1.2.5, the intensity matrix 𝑄 that defines
( 𝑃𝑡 )𝑡⩾0 = (e𝑡𝑄 )𝑡⩾0 is also called the infinitesimal generator of ( 𝑃𝑡 )𝑡⩾0 .
From Lemma 10.1.2, the derivative of e𝑡𝑄 is 𝑄 e𝑡𝑄 = e𝑡𝑄 𝑄 . We can write this as

• 𝑃¤ 𝑡 = 𝑄𝑃𝑡 , which is called the Kolmogorov backward equation, and


• 𝑃¤ 𝑡 = 𝑃𝑡 𝑄 , which is called the Kolmogorov forward equation.

We can work in the other direction as well: if we can establish that a function 𝑡 ↦→ 𝑃𝑡
from R+ to L ( RX ) satisfies either one of these equations, then ( 𝑃𝑡 )𝑡⩾0 is a Markov
semigroup with infinitesimal generator 𝑄 . The next proposition gives details.

Proposition 10.1.8. Let 𝑄 be an intensity matrix. If 𝑡 ↦→ 𝑃𝑡 is a differentiable function


from R+ to L ( RX ) such that 𝑃0 = 𝐼 and either

(i) 𝑃¤ 𝑡 = 𝑄𝑃𝑡 or
(ii) 𝑃¤ 𝑡 = 𝑃𝑡 𝑄 ,

then 𝑃𝑡 = e𝑡𝑄 for all 𝑡 ⩾ 0.

Proposition 10.1.8 is a version of our result for linear IVPs in Proposition 10.1.3,
except that the IVP is now defined in operator space, rather than vector space.

10.1.4 Continuous Time Markov Chains

We have discussed the one-to-one connection between intensity matrices and Markov
semigroups, and how the dynamics generated by Markov semigroups trace out distri-
bution flows. Let’s now connect these objects to continuous time Markov chains.
CHAPTER 10. CONTINUOUS TIME 321

10.1.4.1 Definition

Let 𝐶 ( R+ , X) be the set of right-continuous functions from R+ to X and let ( 𝑃𝑡 )𝑡⩾0 be a


Markov semigroup generated by some 𝑄 ∈ I ( RX ). A continuous time Markov chain
generated by ( 𝑃𝑡 )𝑡⩾0 is a random function ( 𝑋𝑡 )𝑡⩾0 that takes values in 𝐶 ( R+ , X) and
satisfies

P { 𝑋 𝑠 + 𝑡 = 𝑥 0 | F𝑠 } = 𝑃𝑡 ( 𝑋 𝑠 , 𝑥 0 ) for all 𝑠, 𝑡 ⩾ 0 and 𝑥 0 ∈ X, (10.30)

where F𝑠 ≔ ( 𝑋𝜏 )0⩽ 𝜏⩽ 𝑠 is the history of the process up to time 𝑠. To update from time
𝑠 to time 𝑡 given this history, we simply take the last value 𝑋𝑠 and update using 𝑃𝑡 .
Conditioning on 𝑋𝑠 = 𝑥 , we get

𝑃𝑡 ( 𝑥, 𝑥 0) = P{ 𝑋𝑠 + 𝑡 = 𝑥 0 | 𝑋𝑠 = 𝑥 } ( 𝑠, 𝑡 ⩾ 0, 𝑥, 𝑥 0 ∈ X) .

Mirroring terminology for discrete chains from §3.1.1.1, we will call a continuous
time Markov chain ( 𝑋𝑡 )𝑡⩾0 𝑄 -Markov when (10.30) holds and 𝑄 is the infinitesimal
generator of ( 𝑃𝑡 )𝑡⩾0 .
In what follows, P𝑥 and E𝑥 denote probabilities and expectations conditional on
𝑋0 = 𝑥 . Given ℎ ∈ RX , we have
Õ
E 𝑥 ℎ ( 𝑋𝑡 ) = 𝑃𝑡 ( 𝑥, 𝑥 0) ℎ ( 𝑥 0) =: ( 𝑃𝑡 ℎ)( 𝑥 ) .
𝑥0

This expression mirrors the discrete time case discussed in §3.2.1.1.

10.1.4.2 A Jump Chain Construction

In §10.1.4.1 we defined a continuous time Markov chain. In this section we describe


a standard method for constructing one by using three components:
(i) an initial condition 𝜓 ∈ D(X),
(ii) a jump matrix Π ∈ M ( RX ), and
(iii) a rate function 𝜆 mapping X to (0, ∞).
The process ( 𝑋𝑡 ) starts at state 𝑥 , which is drawn from 𝜓, waits there for an expo-
nential time 𝑊 with rate 𝜆 ( 𝑥 ), and then updates to a new state 𝑥 0 drawn from Π ( 𝑥, ·).
We take 𝑥 0 as the new state for the process and repeat.
These ideas are restated in Algorithm 10.1. In the algorithm, (𝑊𝑘 ) and (𝑌𝑘 ) are
drawn independently. The process (𝑊𝑘 ) is called the sequence of holding times or
CHAPTER 10. CONTINUOUS TIME 322

𝑋𝑡

𝑊1 𝐽1 𝑊2 𝐽2 𝑊3 𝐽3
time
Figure 10.2: A jump chain sample path

Í
wait times, the sums 𝐽𝑘 = 𝑘𝑖=1 𝑊𝑖 are called the jump times and (𝑌𝑘 ) is called the em-
bedded jump chain. The jumps and the process ( 𝑋𝑡 )𝑡⩾0 are illustrated in Figure 10.2.

Algorithm 10.1: Jump chain algorithm


1 draw 𝑌0 from 𝜓, set 𝐽0 = 0 and 𝑘 = 1
2 while 𝑡 < ∞ do
3 draw 𝑊𝑘 independently from Exp( 𝜆 (𝑌𝑘−1 ))
4 𝐽𝑘 ← 𝐽𝑘−1 + 𝑊𝑘
5 𝑋𝑡 ← 𝑌𝑘−1 for all 𝑡 in [ 𝐽𝑘−1 , 𝐽𝑘 )
6 draw 𝑌𝑘 from Π (𝑌𝑘−1 , ·)
7 𝑘← 𝑘+1
8 end

Let 𝐼 ∈ L ( RX ) be the identity matrix, so 𝐼 ( 𝑥, 𝑥 0) = 1{ 𝑥 = 𝑥 0 }, and define 𝑄 ∈ L ( RX )


via
𝑄 ( 𝑥, 𝑥 0) = 𝜆 ( 𝑥 )( Π ( 𝑥, 𝑥 0) − 𝐼 ( 𝑥, 𝑥 0)) ( 𝑥, 𝑥 0 ∈ X) (10.31)
It is easy to verify that 𝑄 is an intensity matrix. In fact 𝑄 is the intensity matrix for
the Markov semigroup associated with the process generated by Algorithm 10.1. For
𝑥 ≠ 𝑥 0, it tells us that probability flows from 𝑥 to 𝑥 0 at rate 𝜆 ( 𝑥 ) Π ( 𝑥, 𝑥 0), which is the
rate of leaving 𝑥 times the rate of moving from 𝑥 to 𝑥 0. The next result formalizes
these ideas.
Proposition 10.1.9. The process ( 𝑋𝑡 )𝑡⩾0 generated by Algorithm 10.1 is 𝑄 -Markov.

To prove Proposition 10.1.9 we take ( 𝑋𝑡 )𝑡⩾0 to be as in the statement of the propo-


sition and define ( 𝑃𝑡 )𝑡⩾0 by 𝑃𝑡 ( 𝑥, 𝑥 0) = P𝑥 { 𝑋𝑡 = 𝑥 0 } for all 𝑥, 𝑥 0 ∈ X. The proof uses the
following steps:
CHAPTER 10. CONTINUOUS TIME 323

(i) Obtain an integral equation that ( 𝑃𝑡 )𝑡⩾0 must satisfy.


(ii) Differentiate to obtain the Kolmogorov backward equation 𝑃¤ 𝑡 = 𝑄𝑃𝑡 .
(iii) Solve this differential equation to obtain 𝑃𝑡 = e𝑡𝑄 for all 𝑡 .
Here is the first step. In the statement, Π𝑃𝑡−𝜏 is the matrix product of Π and
𝑃𝑡−𝜏 , while the equation in (10.32) is sometimes called the integrated Kolmogorov
backward equation.
Lemma 10.1.10. For all 𝑡 ⩾ 0 and 𝑥, 𝑥 0 in X, the semigroup ( 𝑃𝑡 )𝑡⩾0 satisfies
∫ 𝑡
0 −𝑡𝜆 ( 𝑥 ) 0
𝑃𝑡 ( 𝑥, 𝑥 ) = 𝑒 𝐼 ( 𝑥, 𝑥 ) + 𝜆 ( 𝑥 ) ( Π𝑃𝑡−𝜏 )( 𝑥, 𝑥 0) 𝑒−𝜏𝜆 ( 𝑥 ) 𝑑𝜏 (10.32)
0

Proof. Fixing 𝑥, 𝑥 0 ∈ X and 𝑡 > 0, we have

𝑃𝑡 ( 𝑥, 𝑥 0) ≔ P 𝑥 { 𝑋𝑡 = 𝑥 0 } = P 𝑥 { 𝑋𝑡 = 𝑥 0 , 𝐽1 > 𝑡 } + P 𝑥 { 𝑋 𝑡 = 𝑥 0 , 𝐽1 ⩽ 𝑡 } . (10.33)

Regarding the first term on the right hand side of (10.33),

P 𝑥 { 𝑋𝑡 = 𝑥 0 , 𝐽1 > 𝑡 } = 𝐼 ( 𝑥, 𝑥 0) 𝑃 { 𝐽1 > 𝑡 } = 𝐼 ( 𝑥, 𝑥 0) 𝑒−𝑡𝜆 ( 𝑥 ) . (10.34)

For the second term on the right hand side of (10.33), we obtain
 
P 𝑥 { 𝑋𝑡 = 𝑥 0 , 𝐽1 ⩽ 𝑡 } = E𝑥 [1{ 𝐽1 ⩽ 𝑡 }P𝑥 { 𝑋𝑡 = 𝑥 0 | 𝑊1 , 𝑌1 }] = E𝑥 1{ 𝐽1 ⩽ 𝑡 } 𝑃𝑡− 𝐽1 (𝑌1 , 𝑥 0) .

Evaluating the expectation and using the independence of 𝐽1 and 𝑌1 , this becomes
∫ ∞ Õ
P 𝑥 { 𝑋 𝑡 = 𝑥 , 𝐽1 ⩽ 𝑡 } =
0
1{𝜏 ⩽ 𝑡 } Π ( 𝑥, 𝑧) 𝑃𝑡−𝜏 ( 𝑧, 𝑥 0) 𝜆 ( 𝑥 ) 𝑒−𝜏𝜆 ( 𝑥 ) 𝑑𝜏
0 𝑧
∫ 𝑡 Õ
= 𝜆 (𝑥) Π ( 𝑥, 𝑧 ) 𝑃𝑡−𝜏 ( 𝑧, 𝑥 0) 𝑒−𝜏𝜆 ( 𝑥 ) 𝑑𝜏.
0 𝑧

Combining this result with (10.33) and (10.34) gives (10.32). □

Differentiating the integrated Kolmogorov backward equation produces the Kol-


mogorov backward equation:
Lemma 10.1.11. If ( 𝑃𝑡 )𝑡⩾0 satisfies (10.32), then 𝑃0 = 𝐼 and 𝑃¤ 𝑡 = 𝑄𝑃𝑡 for all 𝑡 ⩾ 0

Proof. The claim that 𝑃0 = 𝐼 is obvious. For the second claim, one can easily verify
that, when 𝑓 is a differentiable function and 𝛼 > 0, we have

𝑔 ( 𝑡 ) = 𝑒−𝑡𝛼 𝑓 ( 𝑡 ) =⇒ 𝑔0 ( 𝑡 ) = 𝑒−𝑡𝛼 𝑓 0 ( 𝑡 ) − 𝛼𝑔 ( 𝑡 ) (10.35)


CHAPTER 10. CONTINUOUS TIME 324

Note also that, with the change of variable 𝑠 = 𝑡 − 𝜏, we can rewrite (10.32) as
 ∫ 𝑡 
0 −𝑡𝜆 ( 𝑥 ) 0 0 𝑠𝜆 ( 𝑥 )
𝑃𝑡 ( 𝑥, 𝑥 ) = 𝑒 𝐼 ( 𝑥, 𝑥 ) + 𝜆 ( 𝑥 ) ( Π𝑃 𝑠 )( 𝑥, 𝑥 ) 𝑒 𝑑𝑠 . (10.36)
0

Applying (10.35) produces


n o
0 0 −𝑡𝜆 ( 𝑥 ) 0 𝑡𝜆 ( 𝑥 )
𝑃𝑡 ( 𝑥, 𝑥 ) = 𝑒 𝜆 ( 𝑥 )( Π𝑃𝑡 )( 𝑥, 𝑥 ) 𝑒 − 𝜆 ( 𝑥 ) 𝑃𝑡 ( 𝑥, 𝑥 0) .

Rearranging yields 𝑃¤ 𝑡 ( 𝑥, 𝑥 0) = 𝜆 ( 𝑥 ) [( Π − 𝐼 ) 𝑃𝑡 ] ( 𝑥, 𝑥 0), which is identical to 𝑃¤ 𝑡 = 𝑄𝑃𝑡 . □

Proof of Proposition 10.1.9. Proposition 10.1.9 follows directly from Lemma 10.1.10
and Lemma 10.1.11, combined with Proposition 10.1.8. □

10.1.4.3 Application: Inventory Dynamics

Let 𝑋𝑡 be a firm’s inventory at time 𝑡 . When current stock is 𝑥 > 0, customers arrive
at rate 𝜆 ( 𝑥 ), so the wait time for the next customer is an independent draw from the
Exp( 𝜆 ( 𝑥 )) distribution; 𝜆 maps X to (0, ∞).
The 𝑘-th customer demands 𝑈𝑘 units, where each 𝑈𝑘 is an independent draw from
a fixed distribution 𝜑 on N. Purchases are constrained by inventory, however, so
inventory falls by 𝑈𝑘 ∧ 𝑋𝑡 . When inventory hits zero the firm orders 𝑏 units of new
stock. The wait time for new stock is also exponential, being an independent draw
from Exp( 𝜆 (0)).
Let 𝑌 represent the inventory size after the next jump (induced by either a customer
purchase or ordering new stock), given current stock 𝑥 . If 𝑥 > 0, then 𝑌 is a draw from
the distribution of 𝑥 − 𝑈 ∧ 𝑥 where 𝑈 ∼ 𝜑. If 𝑥 = 0, then 𝑌 ≡ 𝑏. Hence 𝑌 is a draw from
Π ( 𝑥, ·), where Π (0, 𝑦 ) = 1{ 𝑦 = 𝑏} and, for 0 < 𝑥 ⩽ 𝑏,


 0 if 𝑥 ⩽ 𝑦



Π ( 𝑥, 𝑦 ) = P{ 𝑥 − 𝑈 = 𝑦 } if 0 < 𝑦 < 𝑥 (10.37)


 P{𝑈 ⩾ 𝑥 } if 𝑦 = 0

EXERCISE 10.1.20. Prove that Π is a stochastic matrix on X ≔ {0, 1, . . . , 𝑏}.

We can simulate the inventory process ( 𝑋𝑡 )𝑡⩾0 via the jump chain algorithm on
page 322. In this case, the wait time sequence (𝑊𝑘 ) is the wait time for customers
CHAPTER 10. CONTINUOUS TIME 325

10 Xt

6
inventory

0 10 20 30 40 50
time

Figure 10.3: Continuous time inventory dynamics

(and for inventory when 𝑋𝑡 = 0) and the jump sequence (𝑌𝑘 ) is the level of inven-
tory immediately after each jump. By Proposition 10.1.9, the inventory process is
𝑄 -Markov with 𝑄 given by 𝑄 ( 𝑥, 𝑥 0) = 𝜆 ( 𝑥 )( Π ( 𝑥, 𝑥 0) − 𝐼 ( 𝑥, 𝑥 0)).
Figure 10.3 shows a simulation when orders are geometric, so that

𝜑 ( 𝑘) = P{𝑈 = 𝑘 } = (1 − 𝛼) 𝑘−1 𝛼 ( 𝑘 ∈ N, 𝛼 ∈ (0, 1)) .

In the simulation we set 𝛼 = 0.7, 𝑏 = 10 and 𝜆 ( 𝑥 ) ≡ 0.5. The figure plots 𝑋𝑡 for
𝑡 ∈ [0, 50]. Since each wait time 𝑊𝑖 is a draw from Exp(0.5) the mean wait time is
2.0. The function that produces the map 𝑡 ↦→ 𝑋𝑡 is shown in Listing 27.

10.1.4.4 From Intensity Matrices to Jump Chains

If 𝑄 ∈ L ( RX ) is a given intensity matrix, how should we produce a continuous time


𝑄 -Markov chain? If we can construct a jump chain that is 𝑄 -Markov, then not only do
we obtain existence of a 𝑄 -Markov chain but we also provide a way to simulate one
(via Algorithm 10.1).
To construct such a jump chain we first fix an intensity matrix 𝑄 ∈ L ( RX ) and, to
simply matters, assume that all rows of 𝑄 are nonzero. This means that the process
has no absorbing states (since nonzero rows is equivalent to 𝑄 ( 𝑥, 𝑥 ) < 0 for all 𝑥 ,
which in turn states that there is a nonzero outflow from each state).
CHAPTER 10. CONTINUOUS TIME 326

using Random, Distributions

"""
Generate a path for inventory starting at b, up to time T.

Return the path as a function X(t) constructed from (J_k) and (Y_k).
"""
function sim_path(; T=10, seed=123, λ=0.5, α=0.7, b=10)

J, Y = 0.0, b
J_vals, Y_vals = [J], [Y]
Random.seed!(seed)
φ = Exponential(1/λ) # Wait times are exponential
G = Geometric(α) # Orders are geometric

while true
W = rand(φ)
J += W
push!(J_vals, J)
if Y == 0
Y = b
else
U = rand(G) + 1 # Geometric on 1, 2,...
Y = Y - min(Y, U)
end
push!(Y_vals, Y)
if J > T
break
end
end

function X(t)
k = searchsortedlast(J_vals, t)
return Y_vals[k+1]
end

return X
end

Listing 27: Continuous time inventory dynamics (inventory_cont_time.jl)


CHAPTER 10. CONTINUOUS TIME 327

Then we set
𝑄 ( 𝑥, 𝑥 0)
𝜆 ( 𝑥 ) ≔ −𝑄 ( 𝑥, 𝑥 ) and Π ( 𝑥, 𝑥 0) ≔ 𝐼 ( 𝑥, 𝑥 0) + .
𝜆 (𝑥)

It is straightforward to confirm that Π ∈ M ( RX ) and that 𝑄 satisfies (10.31). Hence,


by Proposition 10.1.9, the process ( 𝑋𝑡 )𝑡⩾0 generated by Algorithm 10.1 is 𝑄 -Markov.

10.2 Continuous Time Markov Decision Processes

We are ready to turn to dynamic programming in continuous time. As for the discrete
time case, continuous time dynamic programs aim to maximize a measure of lifetime
value. In §10.2.1 we study lifetime valuations. In §10.2.2 we learn how to maximize
them.

10.2.1 Valuation

In this section we consider lifetime valuations associated with continuous reward


flows, starting from a general semigroup perspective and then progressing to spe-
cific cases (such as expected lifetime value under constant discounting). Throughout,
X is a finite set.

10.2.1.1 A Semigroup Perspective

For the discrete time problems with state-dependent discounting that we studied in
Í
Chapter 6, lifetime valuations take the form 𝑣 = 𝑡⩾0 𝐾 𝑡 ℎ for some ℎ ∈ RX and a
positive linear operator 𝐾 on RX . (See Theorem 6.1.1 and (6.18) on page 192.) For
a continuous time version we fix ℎ ∈ RX , take ( 𝐾𝑡 )𝑡⩾0 to be a positive exponential
semigroup in L ( RX ), where positive means 𝐾𝑡 ⩾ 0 for all 𝑡 , and set
∫ ∞
𝑣= 𝐾𝑡 ℎ d𝑡. (10.38)
0

Let 𝐴 ∈ L ( RX ) be the infinitesimal generator of ( 𝐾𝑡 ) 𝑡⩾0 . The next result provides a


condition for finiteness of 𝑣 and several characterizations.

Proposition 10.2.1. If 𝑠 ( 𝐴) < 0, then


CHAPTER 10. CONTINUOUS TIME 328

(i) the integral in (10.38) is finite and


∫ 𝑡
𝑣= 𝐾𝜏 ℎ d𝜏 + 𝐾𝑡 𝑣 for all 𝑡 ⩾ 0, (10.39)
0

(ii) 𝐴 is bijective and 𝑣 = − 𝐴−1 ℎ,


(iii) 𝐴−1 ⩽ 0, and
(iv) the operator 𝑈 : RX → RX defined by
 
𝑈𝑤 = ℎ + ( 𝐼 + 𝐴) 𝑤 𝑤∈ RX (10.40)

is order stable on RX and 𝑣 in (10.38) is the unique fixed point.

A way to understand (10.39) is to view the valuation 𝑣 as a price that reflects


prospective benefits from holding an asset. The asset yields a flow of benefits, where
ℎ ( 𝑥 ) is the instantaneous reward in state 𝑥 . Rewards 𝑡 periods in the future are dis-
counted by the pricing operator 𝐾𝑡 . Thus, ( 𝐾𝑡 ℎ) ( 𝑥 ) is the anticipated payoff 𝑡 periods
ahead, discounted for the wait time and possibly also for risk as in (6.31) on page 203.
The value 𝑣 ( 𝑥 ) is then lifetime value, which equals the current price.
In this asset valuation setting, (10.39) is a natural consistency condition. It says
that the price of purchasing the asset today is equal to the payouts obtained from
holding the asset from now until time 𝑡 and then selling it for current discounted
value 𝐾𝑡 𝑣. (This is the continuous time analog of (6.33) on page 204.)
The discussion above matches the semigroup perspective on asset pricing intro-
duced in Garman (1985) and Duffie and Garman (1986). In addition to shedding
light on (10.39), it also leads to the assertion that 𝑣 = − 𝐴−1 ℎ in (ii), which is obtained
by differentiating (10.39). Details are in the proof below.

Proof of Proposition 10.2.1. From Proposition 10.1.6, we have 𝐾𝑡 = e𝑡 𝐴 for all 𝑡 ⩾ 0.


Since 𝑠 ( 𝐴) < 0, Theorem 10.1.5 implies that the integral in (10.38) is finite. For any
𝑡 ⩾ 0, ∫ ∫ ∫
∞ 𝑡 ∞
𝑣= 𝐾𝜏 ℎ d𝜏 = 𝐾𝜏 ℎ d𝜏 + 𝐾𝜏 ℎ d𝜏.
0 0 𝑡

Using the semigroup property and linearity of 𝐾𝑡 , we can write the last term on the
right hand side as
∫ ∞ ∫ ∞ ∫ ∞ ∫ ∞
𝐾𝜏 ℎ d𝜏 = 𝐾𝑡+𝜏 ℎ d𝜏 = 𝐾𝑡 𝐾𝜏 ℎ d𝜏 = 𝐾𝑡 𝐾𝜏 ℎ d𝜏 = 𝐾𝑡 𝑣.
𝑡 0 0 0
CHAPTER 10. CONTINUOUS TIME 329

Combining this result with the expression for 𝑣 in the previous display proves (10.39).
This proves part (i) of the proposition.
Turning to (ii), if we rearrange (10.39) and divide by 𝑡 > 0, we get

𝐾𝑡 − 𝐼 1 𝑡
− 𝑣= 𝐾𝜏 ℎ d𝜏. (10.41)
𝑡 𝑡 0

By the fundamental theorem of calculus,


∫ ∫ 𝑡
1 𝑡 d
lim 𝐾𝜏 ℎ d𝜏 = 𝐾𝜏 ℎ d𝜏 = 𝐾0 ℎ = 𝐼 ℎ = ℎ.
𝑡 →0 𝑡 0 d𝑡 0 𝑡 =0

As a result, taking 𝑡 → 0 in (10.41) and using the definition of the infinitesimal


generator yields − 𝐴𝑣 = ℎ. Moreover, since 𝑠 ( 𝐴) < 0, all eigenvalues of 𝐴 are nonzero.
Hence 𝐴 has nonzero determinant and is therefore nonsingular (bijective). Combining
these facts yields 𝑣 = − 𝐴−1 ℎ.
Regarding
∫∞ (iii), fix 𝑔 ∈ RX with 𝑔 ⩾ 0. From the preceding results, the function
𝑤 = 0 𝐾𝑡 𝑔 d𝑡 is finite and equals − 𝐴−1 𝑔 . Since 𝐾𝑡 ⩾ 0 for all 𝑡 , we have 𝑤 ⩾ 0. Thus,
− 𝐴−1 𝑔 ⩾ 0 whenever 𝑔 ⩾ 0. Hence − 𝐴−1 ⩾ 0, or 𝐴−1 ⩽ 0.
For (iv) we use the fact that 𝑣 obeys − 𝐴𝑣 = ℎ to obtain 𝑣 = ℎ + ( 𝐼 + 𝐴) 𝑣. Hence
𝑣 is a fixed point of 𝑈 . Conversely, if 𝑤 is a fixed point of 𝑈 , then − 𝐴𝑤 = ℎ. But 𝐴 is
invertible, so then 𝑤 = − 𝐴−1 ℎ = 𝑣. Hence 𝑣 is the only fixed point of 𝑈 in RX .
Order stability of 𝑈 requires upward and downward stability on RX . For upward
stability, suppose that 𝑤 ∈ RX and 𝑈𝑤 ⩾ 𝑤. Then ℎ + 𝐴𝑤 ⩾ 0, or − 𝐴𝑤 ⩽ ℎ. But
− 𝐴−1 ⩾ 0, so 𝑤 ⩽ − 𝐴−1 ℎ = 𝑣 and upward stability holds. The proof of downward
stability is similar. □

10.2.1.2 Valuations as Expectations


∫∞
In applications, the expression 𝑣 = 0 𝐾𝑡 ℎ d𝑡 from (10.38) typically arises as a dis-
counted expectation over a flow of rewards. When analyzing 𝑣 we wish to deploy
Proposition 10.2.1, so we need to check that any expectation we propose results in
( 𝐾𝑡 ) being a semigroup. The next proposition provides one result along these lines.

Proposition 10.2.2. If ( 𝑋𝑡 )𝑡⩾0 is a continuous time Markov chain on X and 𝛿 ∈ RX ,


then the family of operators ( 𝐾𝑡 )𝑡⩾0 ⊂ L ( RX ) defined by
 ∫ 𝑡 
( 𝐾𝑡 ℎ)( 𝑥 ) = E𝑥 exp − 𝛿 ( 𝑋𝜏 ) d𝜏 ℎ ( 𝑋𝑡 ) ( 𝑡 ⩾ 0) (10.42)
0
CHAPTER 10. CONTINUOUS TIME 330

is a positive 𝐶0 -semigroup.

In the proof of Proposition 10.2.2, we will use the fact that ( 𝑋𝑡 )𝑡⩾0 satisfies the
Markov property. In particular, if 𝐻 is a real-valued function on the path space
𝐶 ( R+ , X), then
 
E𝑥 𝐻 (( 𝑋𝜏 )𝜏⩾ 𝑠 ) | ( 𝑋𝜏 )𝜏𝑠 =0 = E𝑋𝑠 𝐻 (( 𝑋𝜏 )𝜏⩾0 ) for all 𝑥 ∈ X. (10.43)

For a proof of (10.43), see, for example, Chapter 2 of Liggett (2010).

EXERCISE 10.2.1. Let ( 𝑋𝑡 )𝑡⩾0 be as stated and, for each 𝑠, 𝑡 ∈ R+ with 𝑠 ⩽ 𝑡, let
𝜂 ( 𝑠, 𝑡 ) be the random variable defined by
 ∫ 𝑡 
𝜂 ( 𝑠, 𝑡 ) = exp − 𝛿 ( 𝑋𝜏 ) d𝜏 .
𝑠

Show that
(i) 𝜂 ( 𝑠, 𝑡 ) > 0 for all 0 ⩽ 𝑠 ⩽ 𝑡 ,
(ii) 𝜂 ( 𝑠, 𝑠) = 1 for all 𝑠 ∈ R+ , and
(iii) 𝜂 (0, 𝑠 + 𝑡 ) = 𝜂 (0, 𝑠) 𝜂 ( 𝑠, 𝑠 + 𝑡 ) for all 𝑠, 𝑡 ∈ R+ .

Proof of Proposition 10.2.2. Fix ℎ ∈ RX . Evidently ( 𝐾0 ℎ)( 𝑥 ) = ℎ ( 𝑥 ), so 𝐾0 = 𝐼 . Re-


garding the semigroup property, we fix 𝑠 ⩽ 𝑡 and use Exercise 10.2.1 and the law of
iterated expectations to obtain
  
( 𝐾𝑠+ 𝑡 ℎ)( 𝑥 ) = E𝑥 𝜂 (0, 𝑠 + 𝑡 ) ℎ ( 𝑋𝑠+𝑡 ) = E𝑥 𝜂 (0, 𝑠) E 𝜂 ( 𝑠, 𝑠 + 𝑡 ) ℎ ( 𝑋𝑠+𝑡 ) | ( 𝑋𝜏 )𝜏𝑠 =0 .

Using the Markov property (10.43), the inner expectation in the last display can be
expressed as
  ∫ 𝑠+𝑡  
E exp − 𝛿 ( 𝑋𝜏 ) d𝜏 ℎ ( 𝑋𝑠+ 𝑡 ) | ( 𝑋𝜏 )𝜏𝑠 =0
𝑠
  ∫ 𝑡  
= E 𝑋𝑠 exp − 𝛿 ( 𝑋𝜏 ) d𝜏 ℎ ( 𝑋𝑡 ) = ( 𝐾𝑡 ℎ)( 𝑋𝑠 ) ,
0

so
 ∫ 𝑠 
( 𝐾𝑠 + 𝑡 ℎ)( 𝑥 ) = E𝑥 𝜂 (0, 𝑠)( 𝐾𝑡 ℎ)( 𝑋𝑠 ) = E𝑥 exp − 𝛿 ( 𝑋𝜏 ) d𝜏 ( 𝐾𝑡 ℎ)( 𝑋𝑠 ) = ( 𝐾𝑠 𝐾𝑡 ℎ)( 𝑥 ) .
0

This argument confirms that 𝐾𝑠 + 𝑡 = 𝐾𝑠 ◦ 𝐾𝑡 .


CHAPTER 10. CONTINUOUS TIME 331

To see that 𝐾𝑡 is a positive operator for all 𝑡 , observe that if ℎ ⩾ 0, then the expec-
tation in (10.42) is nonnegative. Hence 𝐾𝑡 ℎ ⩾ 0 whenever ℎ ⩾ 0.
To prove continuity of 𝑡 ↦→ 𝐾𝑡 ℎ, it suffices to show that ( 𝐾𝑡 ℎ)( 𝑥 ) → ℎ ( 𝑥 ) as 𝑡 ↓ 0
(see, e.g., Engel and Nagel (2006), Proposition 1.3). This holds by right-continuity
of 𝑋𝑡 , which gives ℎ ( 𝑋𝑡 ) → 𝑥 as 𝑡 ↓ 0, and hence
 ∫ 𝑡 
lim ( 𝐾𝑡 ℎ)( 𝑥 ) = E𝑥 lim exp − 𝛿 ( 𝑋 𝜏 ) d𝜏 ℎ ( 𝑋 𝑡 ) = ℎ ( 𝑥 ) .
𝑡↓0 𝑡↓0 0

(Readers familiar with measure theory can justify the change of limit and expectation
via the dominated convergence theorem.) □

10.2.1.3 Constant Discounting

Many studies of continuous time dynamic programming with discounting use a con-
stant discount rate. In this setting, the lifetime value in (10.38) becomes
∫ ∞
𝑣 ( 𝑥 ) ≔ E𝑥 e−𝑡𝛿 ℎ ( 𝑋𝑡 ) d𝑡 (10.44)
0

for some 𝛿 ∈ R and ℎ ∈ RX . Here ( 𝑋𝑡 )𝑡⩾0 is a continuous time Markov chain on finite
state X generated by Markov semigroup ( 𝑃𝑡 )𝑡⩾0 with intensity operator 𝑄 . The idea is
that ℎ ( 𝑋𝑡 ) is an instantaneous reward at each time 𝑡 , while 𝛿 is a fixed discount rate.
Equation 10.44 is the continuous time version of (3.16) on page 94.

Proposition 10.2.3. If 𝛿 > 0, then 𝑣 in (10.44) is finite, 𝛿𝐼 − 𝑄 is bijective,

( 𝛿𝐼 − 𝑄 ) −1 ⩾ 0 and 𝑣 = ( 𝛿𝐼 − 𝑄 ) −1 ℎ. (10.45)

In addition, 𝑣 is the unique fixed point of


 
𝑈𝑤 = ℎ + ( 𝑄 + (1 − 𝛿) 𝐼 ) 𝑤 𝑤∈ R X
(10.46)

and 𝑈 is order stable on RX .

Proof. As a first step, we reverse the order of expectation and integration in (10.44)
to get
∫ ∞
𝑣( 𝑥) = ( 𝐾𝑡 ℎ)( 𝑥 ) d𝑡 where ( 𝐾𝑡 ℎ) ( 𝑥 ) ≔ e−𝑡𝛿 E𝑥 ℎ ( 𝑋𝑡 ) = e−𝑡𝛿 ( 𝑃𝑡 ℎ)( 𝑥 ) .
0
CHAPTER 10. CONTINUOUS TIME 332

(This
∫ ∞change of order can be justified by Fubini’s theorem, which can be applied when
E𝑥 0 e−𝑡𝛿 | ℎ ( 𝑋𝑡 )| d𝑡 < ∞. Since X is finite, we∫ have | ℎ | ⩽ 𝑀 < ∞ for some constant

𝑀 , and the double integral is dominated by 𝑀 0 e−𝑡𝛿 d𝑡 = 𝑀 /𝛿.)
Note that 𝐾𝑡 is a special case of (10.42). Hence ( 𝐾𝑡 )𝑡⩾0 is a positive 𝐶0 -semigroup.
Its infinitesimal generator is 𝐴 ≔ 𝑄 − 𝛿𝐼 , since 𝐾𝑡 = e−𝑡𝛿 𝑃𝑡 = e𝑡 ( 𝑄−𝛿𝐼 ) . We claim that
𝑠 ( 𝐴) < 0. To see this, observe that (using (10.17)),

e𝑠 ( 𝑄−𝛿𝐼 ) = 𝜌 (e𝑄−𝛿𝐼 ) = 𝜌 (e𝑄 e−𝛿𝐼 ) = 𝜌 (e𝑄 e−𝛿 𝐼 ) = e−𝛿 𝜌 (e𝑄 ) = e−𝛿 𝜌 ( 𝑃1 ) = e−𝛿 .

Taking logs gives 𝑠 ( 𝑄 − 𝛿𝐼 ) = −𝛿. Since 𝛿 > 0, we have 𝑠 ( 𝑄 − 𝛿𝐼 ) < 0, as claimed.


We can now apply Proposition 10.2.1 with 𝐴 = 𝑄 − 𝛿𝐼 and 𝐾𝑡 = e𝑡 𝐴 . The proposition
tells us that that 𝐴 is bijective, and

𝑣 = − 𝐴−1 ℎ = (− 𝐴) −1 ℎ = ( 𝛿𝐼 − 𝑄 ) −1 ℎ.

It also tells us that − 𝐴−1 ⩾ 0, so ( 𝛿𝐼 − 𝑄 ) −1 = (− 𝐴) −1 = − 𝐴−1 ⩾ 0. This confirms both


claims in (10.45). Finally, the operator 𝑈 in (10.46) is a special case of 𝑈 in (10.40),
with 𝐴 = 𝑄 − 𝛿𝐼 , so 𝑈 is order stable with unique fixed point 𝑣 (by Proposition 10.2.1).
All of the claims in Proposition 10.2.3 are now verified. □

10.2.2 Constructing a Decision Process

In this section we define continuous time Markov decision processes, discuss optimal-
ity theory, and provide algorithms and applications.

10.2.2.1 Definition

Given two finite sets A and X, called the state and action spaces respectively, we define
a continuous time Markov decision process (or continuous time MDP) to be a tuple
C = ( Γ, 𝛿, 𝑟, 𝑄 ) consisting of

(i) a nonempty correspondence Γ from X to A, referred to as the feasible corre-


spondence, which in turn defines the feasible state-action pairs

G ≔ {( 𝑥, 𝑎) ∈ X × A : 𝑎 ∈ Γ ( 𝑥 )},

(ii) a constant 𝛿 > 0, referred to as the discount rate,


(iii) a function 𝑟 from G to R, referred to as the reward function, and
CHAPTER 10. CONTINUOUS TIME 333

(iv) an intensity kernel 𝑄 from G to X; that is, a map 𝑄 from G × X to R satisfying


Õ
𝑄 ( 𝑥, 𝑎, 𝑥 0) = 0 for all ( 𝑥, 𝑎) in G
𝑥0

and 𝑄 ( 𝑥, 𝑎, 𝑥 0) ⩾ 0 whenever 𝑥 ≠ 𝑥 0.
Informally, at state 𝑥 with action 𝑎 over the short interval from 𝑡 to 𝑡 + ℎ, the
controller receives instantaneous reward 𝑟 ( 𝑥, 𝑎) ℎ and the state transitions to state 𝑥 0
with probability 𝑄 ( 𝑥, 𝑎, 𝑥 0) ℎ + 𝑜 ( ℎ).
Paralleling our discussion of the discrete time case in Chapter 5, the set of feasible
policies is
Σ ≔ {𝜎 ∈ AX : 𝜎 ( 𝑥 ) ∈ Γ ( 𝑥 ) for all 𝑥 ∈ X} . (10.47)

10.2.2.2 Lifetime Values

Choosing policy 𝜎 from Σ means that we respond to state 𝑋𝑡 with action 𝐴𝑡 ≔ 𝜎 ( 𝑋𝑡 ) at


every 𝑡 ⩾ 0. The state then evolves according to the intensity operator

𝑄 𝜎 ( 𝑥, 𝑥 0) ≔ 𝑄 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) .

Letting
𝑃𝑡𝜎 ≔ e𝑡𝑄 𝜎 and 𝑟𝜎 ( 𝑥 ) ≔ 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) ( 𝑥 ∈ X)
the lifetime value of following 𝜎 starting from state 𝑥 is
∫ ∞ ∫ ∞
𝑣𝜎 ( 𝑥 ) ≔ E𝑥 e 𝑟 ( 𝑋𝑡 , 𝜎 ( 𝑋𝑡 )) d𝑡 = E𝑥
−𝛿𝑡
e−𝛿𝑡 𝑟𝜎 ( 𝑋𝑡 ) d𝑡, (10.48)
0 0

where ( 𝑋𝑡 )𝑡⩾0 is 𝑄 𝜎 -Markov with initial condition 𝑥 . We call 𝑣𝜎 the 𝜎-value function.
Since 𝛿 > 0, we can apply Proposition 10.2.3 to obtain

𝑣𝜎 = ( 𝛿𝐼 − 𝑄 𝜎 ) −1 𝑟𝜎 . (10.49)

Representation (10.49) provides a straightforward method for computing 𝑣𝜎 .

10.2.2.3 Greedy Policies

A policy 𝜎 ∈ Σ is called 𝑣-greedy for C if


( )
Õ
𝜎 ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0) for all 𝑥 ∈ X. (10.50)
𝑎∈ Γ ( 𝑥 ) 𝑥0
CHAPTER 10. CONTINUOUS TIME 334

Like the discrete time case, a 𝑣-greedy policy chooses actions optimally to trade off
high current rewards versus high rate of flow into future states with high values.
Unlike the discrete time case, the discount factor does not appear in (10.50) because
the trade-off is instantaneous.

10.2.2.4 Policy Iteration

We introduce a continuous time policy iteration algorithm that parallels discrete time
HPI for Markov decision processes, as described in §5.1.4.2.
The continuous time HPI routine is given in Algorithm 10.2, with the intuition
being similar to that for the discrete time MDP version given on page 141. We provide
convergence results in §10.2.3.

Algorithm 10.2: Continuous time Howard policy iteration


1 input 𝜎0 ∈ Σ , an initial guess of 𝜎

2 𝑘 ← 0
3 𝜀 ← 1
4 while 𝜀 > 0 do
5 𝑣𝑘 ← ( 𝛿𝐼 − 𝑄 𝜎𝑘 ) −1 𝑟𝜎𝑘
6 𝜎𝑘+1 ← a 𝑣𝑘 -greedy policy
7 𝜀 ← 1{𝜎𝑘 ≠ 𝜎𝑘+1 }
8 𝑘← 𝑘+1
9 end
10 return 𝜎𝑘

10.2.2.5 Policy Operators

For each 𝜎 ∈ Σ, let 𝑇𝜎 be the operator defined at 𝑣 ∈ RX by

𝑇𝜎 𝑣 = 𝑟𝜎 + ( 𝑄 𝜎 + (1 − 𝛿) 𝐼 ) 𝑣. (10.51)

As shown in Proposition 10.2.3, each 𝑇𝜎 is order stable on RX , with unique fixed point
𝑣𝜎 . Hence A ≔ ( RX , {𝑇𝜎 }) is an order stable ADP.

EXERCISE 10.2.2. Show that 𝜎 is 𝑣-greedy (i.e., (10.50) holds) if and only if 𝜎 is
𝑣-greedy for A in the sense of §9.1.2.2.
CHAPTER 10. CONTINUOUS TIME 335

10.2.3 Optimality
For a continuous time MDP C = ( Γ, 𝛿, 𝑟, 𝑄 ) with 𝜎-value functions { 𝑣𝜎 },
Ô
• the value function generated by C is 𝑣∗ ≔ 𝜎 𝑣𝜎 , and
• a policy is called optimal for C if 𝑣𝜎 = 𝑣∗ .
A function 𝑣 ∈ RX is said to satisfy a Hamilton–Jacobi–Bellman (HJB) equation
if ( )
Õ
𝛿𝑣 ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0) for all 𝑥 ∈ X. (10.52)
𝑎∈ Γ ( 𝑥 )
𝑥0

We say that C obeys Bellman’s principle of optimality if

𝜎 ∈ Σ is optimal for C ⇐⇒ 𝜎 is 𝑣∗ -greedy.

Here is our main optimality result for continuous time MDPs.


Theorem 10.2.4. For any continuous time MDP C = ( Γ, 𝛿, 𝑟, 𝑄 ),
(i) the value function 𝑣∗ is the unique solution to the HJB equation in RX ,
(ii) C obeys Bellman’s principle of optimality, and
(iii) C has at least one optimal policy.
In addition, continuous time HPI converges to an optimal policy in finitely many steps.

Proof. Let C = ( Γ, 𝛿, 𝑟, 𝑄 ) be a fixed continuous time MDP with lifetime values { 𝑣𝜎 }


and value function 𝑣∗ . Consider the order stable ADP A ≔ ( RX , {𝑇𝜎 }) discussed in
Ô
§10.2.2.5. The ADP Bellman max-operator is 𝑇 ≔ 𝜎 𝑇𝜎 , which can be written more
explicitly as
( )
Õ
(𝑇 𝑣)( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝑣 ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0) + (1 − 𝛿) 𝑣 ( 𝑥 ) . (10.53)
𝑎∈ Γ ( 𝑥 )
𝑥0

It is clear from (10.50) and Exercise 10.2.2 that, for each 𝑣 ∈ RX , the set of 𝑣-max-
greedy policies is nonempty. Since Σ is finite, it follows from Proposition 9.2.1 that
A is max-stable. Hence, by Theorem 9.2.4, an optimal policy always exists and the
value function 𝑣∗ is the unique fixed point of 𝑇 in RX . The last statement is equivalent
to the assertion that 𝑣∗ is the unique element of RX satisfying
( )
Õ
𝑣∗ ( 𝑥 ) = max 𝑟 ( 𝑥, 𝑎) + 𝑣∗ ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0) + (1 − 𝛿) 𝑣∗ ( 𝑥 ) .
𝑎∈ Γ ( 𝑥 )
𝑥0
CHAPTER 10. CONTINUOUS TIME 336

Rearranging this expression confirms that 𝑣∗ is the unique solution to the HJB equation
in RX .
Applying Theorem 9.2.4 again, a policy is optimal for A if and only if 𝑇𝜎 𝑣∗ = 𝑇 𝑣∗ .
Since the definition of optimality for A coincides with the definition of optimality for
C, we see that C obeys Bellman’s principle of optimality.
The continuous time HPI routine described in Algorithm 10.2 is just ADP max-HPI
(see §9.2.1.4) specialized to the current setting. Hence, applying Theorem 9.2.4 once
more, continuous time HPI converges to an optimal policy in finitely many steps. □

10.2.4 Application: Job Search


Here we study a continuous time version of the job search problem with separation
considered in §3.3.2. As before, a worker can be either unemployed (state 0) or
employed (state 1). When the worker is employed, she can be fired at any time.
Firing occurs at rate 𝛼 > 0, meaning that the probability of being fired over the short
interval from 𝑡 to 𝑡 + ℎ is approximately 𝛼ℎ. When unemployed, the worker receives
receives flow unemployment compensation 𝑐 and job offers at rate 𝜅. She can choose
either to accept or to reject an offer; she discounts the future at rate 𝛿 > 0.
We assume that job offers are associated with wage offers that take values in finite
set W. Let 𝑃 ∈ M ( RW ) give probabilities for new wage draws, so that, conditional on
previous draw 𝑤, the next offer is drawn from 𝑃 ( 𝑤, ·).
For the state space we set X = {0, 1} × W, with typical state 𝑥 = ( 𝑠, 𝑤). Here 𝑠 is
binary and indicates current employment status, while 𝑤 is the current wage. Let

𝜆 ( 𝑥 ) = 𝜆 ( 𝑠, 𝑤) = 1{ 𝑠 = 0}𝜅 + 1{ 𝑠 = 1}𝛼

denote the state-dependent jump rate, which switches between 𝜅 and 𝛼 depending
on employment status.
Let 𝑎 ∈ A ≔ {0, 1} indicate the action, where 0 means reject and 1 means accept.
Let Π ( 𝑥, 𝑎, 𝑥 0) represent the jump probabilities, with

Π ((0, 𝑤) , 𝑎, (0, 𝑤0)) = 𝑃 ( 𝑤, 𝑤0) (1 − 𝑎) (unemployed to unemployed)


Π ((0, 𝑤) , 𝑎, (1, 𝑤0)) = 𝑃 ( 𝑤, 𝑤0) 𝑎 (unemployed to employed)
0 0
Π ((1, 𝑤) , 𝑎, (0, 𝑤 )) = 𝑃 ( 𝑤, 𝑤 ) (employed to unemployed)
Π ((1, 𝑤) , 𝑎, (1, 𝑤0)) = 0. (employed to employed)

The first two lines consider jump probabilities for the state ( 𝑠, 𝑤) when unemployed
and the action is 𝑎. The second two consider jump probabilities when employed. The
CHAPTER 10. CONTINUOUS TIME 337

reason that the probability assigned to the last line is zero is that a jump from 𝑠 = 1
occurs because the worker is fired, so the value of 𝑠 after the jump is zero.

EXERCISE 10.2.3. Prove that Π is a stochastic kernel, in the sense that Π ⩾ 0 and
Í 0
𝑥 0 Π ( 𝑥, 𝑎, 𝑥 ) = 1 for all possible ( 𝑥, 𝑎) = (( 𝑠, 𝑤) , 𝑎) in X × {0, 1}.

Motivated by the jump chain construction of intensity matrices in (10.31) on


page 322, we set
𝑄 ( 𝑥, 𝑎, 𝑥 0) = 𝜆 ( 𝑥 ) ( Π ( 𝑥, 𝑎, 𝑥 0) − 𝐼 ( 𝑥, 𝑥 0)) .
It follows that, for any 𝜎 ∈ Σ ≔ {0, 1}X , the operator

𝑄 𝜎 ( 𝑥, 𝑥 0) ≔ 𝜆 ( 𝑥 )( Π ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) − 𝐼 ( 𝑥, 𝑥 0)) ,

is an intensity matrix for the jump chain under policy 𝜎.


If we define
𝑟 ( 𝑥, 𝑎) = 𝑟 (( 𝑠, 𝑤) , 𝑎) = 𝑐1{ 𝑠 = 0} + 𝑤1{ 𝑠 = 1},
then lifetime value is given by (10.48), where ( 𝑋𝑡 )𝑡⩾0 is 𝑄 𝜎 -Markov and 𝑋0 = 𝑥 .
With Γ defined by Γ ( 𝑥 ) = A for all 𝑥 ∈ X, the tuple C = ( Γ, 𝛿, 𝑟, 𝑄 ) is a continuous
time MDP and Theorem 10.2.4 applies. In particular, an optimal policy exists and can
be computed with HPI in a finite number of iterations.
Figure 10.4 shows an optimal policy computed in this way. (Code and parameter
values can be found in cont_time_js.jl.) The policy is of threshold type, with a
reservation wage of around 12. Figure 10.5 shows how this reservation wage changes
with parameters. The reservation wage increases as the separation rate falls, as the
offer rate increases, as the discount rate falls, and as unemployment compensation
increases.

EXERCISE 10.2.4. Provide economic intuition for the monotone relationships be-
tween parameters and the reservation wage discussed in the preceding paragraph.

10.3 Chapter Notes

Applebaum (2019) and Engel and Nagel (2006) provide elegant introductions to
semigroup theory and its applications in studying partial and stochastic differential
equations. The beautiful book by Lasota and Mackey (1994) and covers connections
among semigroups, Markov processes, and stochastic differential equations. Norris
CHAPTER 10. CONTINUOUS TIME 338

1
action (reject/accept)

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


wage offer

Figure 10.4: Continuous time job search policy

1.3 1.35
res. wage

res. wage

1.30
1.2
1.25

1.1 1.20

0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4
separation rate offer rate
1.8

1.3 1.6
res. wage

res. wage

1.4
1.2 1.2

1.0
1.1
0.8
0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4
discount rate unempl. compensation

Figure 10.5: Continuous time job search reservation wage


CHAPTER 10. CONTINUOUS TIME 339

(1998) provides a good introduction to continuous time Markov chains, while Liggett
(2010) is more advanced.
A rigorous treatment of continuous time MDPs can be found in Hernández-Lerma
and Lasserre (2012b), which also handles the case where X is countably infinite. Our
approach is somewhat different, since our main optimality results rest on the ADP
theory in Chapter 9.
In recent years, continuous time dynamic programming has become more common
in macroeconomic analysis. Influential references include Nuño and Moll (2018),
Kaplan et al. (2018), Achdou et al. (2022), and Fernández-Villaverde et al. (2023).
For computational aspects, see Duarte (2018), Ráfales and Vázquez (2021), Rendahl
(2022), and Eslami and Phelan (2023).
Part I

Appendices

340
Appendix A

Suprema and Infima

This section of the appendix contains an extremely brief review of basic facts concern-
ing sets, functions, suprema and infima. We recommend Bartle and Sherbert (2011)
for those who wish to learn more.

A.1 Sets and Functions

A set is a collection of objections viewed as a whole. Examples include the set of


natural numbers N ≔ {1, 2, . . . } and [ 𝑛] ≔ {1, 2, . . . , 𝑛} when 𝑛 ∈ N. The set that
contains no elements is called the empty set and denoted by ∅.
Let 𝐴 and 𝐵 be two sets and let 𝐴 × 𝐵 be their Cartesian product, defined as the
set of all ordered pairs ( 𝑎, 𝑏) such that 𝑎 ∈ 𝐴 and 𝑏 ∈ 𝐵. A binary relation ∼ between
two sets 𝐴 and 𝐵 is a subset of 𝐴 × 𝐵. If ( 𝑎, 𝑏) is in this subset we write 𝑎 ∼ 𝑏. An
equivalence relation on 𝐴 is a binary relation ∼ between 𝐴 and itself that is reflexive,
symmetric and transitive. That is,

(a) 𝑎 ∼ 𝑎 for all 𝑎 ∈ 𝐴,


(c) 𝑎 ∼ 𝑎0 implies 𝑎0 ∼ 𝑎, and
(d) 𝑎 ∼ 𝑎0 and 𝑎0 ∼ 𝑎00 implies 𝑎 ∼ 𝑎00.

A function 𝑓 from set 𝐴 to set 𝐵, written 𝐴 3 𝑥 ↦→ 𝑓 ( 𝑥 ) ∈ 𝐵 or 𝑓 : 𝐴 → 𝐵, is a


rule (in fact, a binary relation) associating to each and every element 𝑎 in 𝐴 one and
only one element 𝑏 ∈ 𝐵. The point 𝑏 is also written as 𝑓 ( 𝑎), and called the image of
𝑎 under 𝑓 . For 𝐶 ⊂ 𝐴, the set 𝑓 ( 𝐶 ) is the set of all images of points in 𝐶 , and is called

341
APPENDIX A. SUPREMA AND INFIMA 342

the image of 𝐶 under 𝑓 . Also, for 𝐷 ⊂ 𝐵, the set 𝑓 −1 ( 𝐷) is all points in 𝐴 that map into
𝐷 under 𝑓 , and is called the preimage of 𝐷 under 𝑓 .
A function 𝑓 : 𝐴 → 𝐵 is called one-to-one if distinct elements of 𝐴 are always
mapped into distinct elements of 𝐵, and onto if every element of 𝐵 is the image under
𝑓 of at least one point in 𝐴. A bijection or one-to-one correspondence from 𝐴 to 𝐵
is a function 𝑓 from 𝐴 to 𝐵 that is both one-to-one and onto.
A set X is called finite if there exists a bijection from X to [ 𝑛] ≔ {1, . . . , 𝑛} for some
𝑛 ∈ N. In this case we can write X = { 𝑥1 , . . . , 𝑥 𝑛 }. The number 𝑛 is called the car-
dinality of X. Note that, according to our definition, every finite set is automatically
nonempty.
If 𝑓 : 𝐴 → 𝐵 and 𝑔 : 𝐵 → 𝐶 , then the composition of 𝑓 and 𝑔 is the function 𝑔 ◦ 𝑓
from 𝐴 to 𝐶 defined at 𝑎 ∈ 𝐴 by ( 𝑔 ◦ 𝑓 )( 𝑎) ≔ 𝑔 ( 𝑓 ( 𝑎)).

A.2 Some Properties of the Real Line


Given a subset 𝐴 of R, we call 𝑢 ∈ R an upper bound of 𝐴 if 𝑎 ⩽ 𝑢 for all 𝑎 in 𝐴. A
lower bound of 𝐴 is any number ℓ such that ℓ ⩽ 𝑎 for all 𝑎 ∈ 𝐴. If 𝐴 has both an upper
and lower bound then 𝐴 is called bounded. Equivalently, 𝐴 is bounded whenever
there exists an 𝑛 ∈ N with 𝐴 ⊂ [−𝑛, 𝑛].
Let 𝑈 ( 𝐴) be the set of all upper bounds of 𝐴. An element 𝑢¯ of R is called a supre-
mum or least upper bound of 𝐴 if
(i) 𝑢¯ ∈ 𝑈 ( 𝐴) and
(ii) 𝑢¯ ⩽ 𝑢 for every 𝑢 ∈ 𝑈 ( 𝐴).
When a supremum of 𝐴 exists in R, we write it as sup 𝐴.
Example A.2.1. For the set 𝐼 ≔ [0, 1] ⊂ R, the number 1 is an upper bound of 𝐼 .
Moreover, if 𝑢 is an upper bound of 𝐼 , then 𝑢 ⩾ 1. Hence 1 is the supremum of 𝐼 .
Example A.2.2. N has no supremum in R, since the set of upper bounds is empty.

EXERCISE A.2.1. Show that, for all of the sets (0, 1), [0, 1) and (0, 1], the number
1 is the supremum of the set.

EXERCISE A.2.2. Fix 𝐴 ⊂ R. Prove that, for 𝑠 ∈ 𝑈 ( 𝐴), we have 𝑠 = sup 𝐴 if and only
if, for all 𝜀 > 0, there exists a point 𝑎 ∈ 𝐴 with 𝑎 > 𝑠 − 𝜀.
APPENDIX A. SUPREMA AND INFIMA 343

EXERCISE A.2.3. Fix 𝐴 ⊂ R. Prove that 𝐴 has at most one supremum.

One of the most important properties of R is stated below.


Theorem A.2.1 (Least upper bound property). Every nonempty subset of R with an
upper bound in R has a supremum in R.

Theorem A.2.1 is often taken as axiomatic in formal constructions of the real num-
bers. (Alternatively, one may assume completeness of the reals and then prove The-
orem A.2.1 using this property. See, e.g., Bartle and Sherbert (2011).)
If 𝑖 ∈ R is a lower bound for 𝐴 and also satisfies 𝑖 ⩾ ℓ for every lower bound ℓ of
𝐴, then 𝑖 is called the infimum of 𝐴 and we write 𝑖 = inf 𝐴. At most one such 𝑖 exists,
and every nonempty subset of R bounded from below has an infimum.
A real sequence is a map 𝑥 from N to R, with the value of the function at 𝑘 ∈ N
typically denoted by 𝑥 𝑘 rather than 𝑥 ( 𝑘). A real sequence 𝑥 = ( 𝑥 𝑘 ) 𝑘⩾1 ≔ ( 𝑥 𝑘 ) 𝑘∈N is said
to converge to 𝑥¯ ∈ R if, for each 𝜀 > 0, there exists an 𝑁 ∈ N such that | 𝑥 𝑘 − 𝑥¯| < 𝜀
for all 𝑘 ⩾ 𝑁 . In this case we write lim𝑘 𝑥 𝑘 = 𝑥¯ or 𝑥 𝑘 → 𝑥¯. Bartle and Sherbert (2011)
give an excellent introduction to real sequences and their basic properties.
A real sequence ( 𝑥 𝑘 ) 𝑘⩾1 is called increasing if 𝑥 𝑘 ⩽ 𝑥 𝑘+1 for all 𝑘 and decreasing if
𝑥 𝑘+1 ⩽ 𝑥 𝑘 for all 𝑘. If ( 𝑥 𝑘 ) 𝑘⩾1 is increasing (resp., decreasing) and 𝑥 𝑘 → 𝑥 ∈ R then we
also write 𝑥 𝑘 ↑ 𝑥 (resp., 𝑥 𝑘 ↓ 𝑥 ).

EXERCISE A.2.4. Let ( 𝑥 𝑘 ) be a bounded monotone increasing sequence in R. Prove


that sup𝑘 𝑥 𝑘 = lim𝑘 𝑥 𝑘 .

Í𝑛
Let ( 𝑥 𝑘 ) be a real sequence in R and set 𝑠𝑛 ≔ 𝑘=1 𝑥 𝑘 . If the sequence ( 𝑠𝑛 ) converges
to some 𝑠 ∈ R, then we set
Õ
∞ Õ
𝑥𝑘 ≔ 𝑥 𝑘 ≔ 𝑠 = lim 𝑠𝑛 .
𝑛→∞
𝑘=1 𝑘⩾1
Í𝑛 Í∞
We say that the series 𝑘=1 𝑥 𝑘 converges to 𝑘=1 𝑥𝑘 .

A.3 Max and Min


A number 𝑚 contained in a subset 𝐴 of R is called the maximum of 𝐴 and we write
𝑚 = max 𝐴 if 𝑎 ⩽ 𝑚 for every 𝑎 ∈ 𝐴. It is called the minimum of 𝐴 and we write
𝑚 = min 𝐴 if 𝑎 ⩾ 𝑚 for every 𝑎 ∈ 𝐴.
APPENDIX A. SUPREMA AND INFIMA 344

EXERCISE A.3.1. Prove: If 𝐴 is a finite subset of R, then sup 𝐴 = max 𝐴.

A subset 𝐴 of R is called closed if, for any sequence ( 𝑥𝑛 ) contained in 𝐴 and con-
verging to some limit 𝑥 ∈ R, the limit 𝑥 is in 𝐴.

EXERCISE A.3.2. Show that, if 𝐴 is a closed and bounded subset of R, then 𝐴 has
both a maximum and a minimum.

EXERCISE A.3.3. Prove the following statements:

(i) If 𝐴 ⊂ 𝐵, then sup 𝐴 ⩽ sup 𝐵.


(ii) If 𝑠 = sup 𝐴 and 𝑠 ∈ 𝐴, then 𝑠 = max 𝐴.
(iii) If 𝑖 = inf 𝐴 and 𝑖 ∈ 𝐴, then 𝑖 = min 𝐴.

Given an arbitrary set 𝐷 and a function 𝑓 : 𝐷 → R, define

sup 𝑓 ( 𝑥 ) ≔ sup{ 𝑓 ( 𝑥 ) : 𝑥 ∈ 𝐷 } and max 𝑓 ( 𝑥 ) ≔ max{ 𝑓 ( 𝑥 ) : 𝑥 ∈ 𝐷 }


𝑥∈𝐷 𝑥∈𝐷

whenever the latter exists. The terms inf 𝑥 ∈ 𝐷 𝑓 ( 𝑥 ) and min𝑥 ∈ 𝐷 𝑓 ( 𝑥 ) are defined analo-
gously. A point 𝑥 ∗ ∈ 𝐷 is called a

• maximizer of 𝑓 on 𝐷 if 𝑥 ∗ ∈ 𝐷 and 𝑓 ( 𝑥 ∗ ) ⩾ 𝑓 ( 𝑥 ) for all 𝑥 ∈ 𝐷, and a


• minimizer of 𝑓 on 𝐷 if 𝑥 ∗ ∈ 𝐷 and 𝑓 ( 𝑥 ∗ ) ⩽ 𝑓 ( 𝑥 ) for all 𝑥 ∈ 𝐷.

Equivalently, 𝑥 ∗ ∈ 𝐷 is a maximizer of 𝑓 on 𝐷 if 𝑓 ( 𝑥 ∗ ) = max 𝑥 ∈ 𝐷 𝑓 ( 𝑥 ), and a minimizer


if 𝑓 ( 𝑥 ∗ ) = min𝑥 ∈ 𝐷 𝑓 ( 𝑥 ). We define

argmax 𝑓 ( 𝑥 ) ≔ { 𝑥 ∗ ∈ 𝑋 : 𝑓 ( 𝑥 ∗ ) ⩾ 𝑓 ( 𝑥 ) for all 𝑥 ∈ 𝐷 } .


𝑥∈𝐷

The set argmin𝑥 ∈ 𝐷 𝑓 ( 𝑥 ) is defined analogously.


Appendix B

Remaining Proofs

B.1 Chapter 2 Results

Proof of Lemma 2.2.5. Regarding (i), fix 𝜑, 𝜓 ∈ D(X) with 𝜑 F 𝜓. Pick any 𝑦 ∈ X.
By transitivity of partial orders, the function 𝑢 ( 𝑥 ) ≔ 1{ 𝑦  𝑥 } is in 𝑖RX . Hence
Í Í 𝜑
𝑥 𝑢( 𝑥) 𝜑( 𝑥) ⩽ 𝑥 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ). Given the definition of 𝑢, this is equivalent to 𝐺 ( 𝑦 ) ⩽
𝐺 𝜓 ( 𝑦 ). As 𝑦 was chosen arbitrarily, we have 𝐺 𝜑 ⩽ 𝐺 𝜓 pointwise on X.
Regarding (ii), let 𝜑, 𝜓 ∈ D(X) be such that 𝐺 𝜑 ⩽ 𝐺 𝜓 and let X be totally ordered
by . We can write X as { 𝑥1 , . . . , 𝑥𝑛 } with 𝑥 𝑖  𝑥 𝑖+1 for all 𝑖. Pick any 𝑢 ∈ 𝑖RX and let
Í
𝛼𝑖 = 𝑢 ( 𝑥 𝑖 ). By Exercise 2.2.32, we can write 𝑢 as 𝑢 ( 𝑥 ) = 𝑛𝑖=1 𝑠𝑖 1{ 𝑥  𝑥 𝑖 } at each 𝑥 ∈ X,
where 𝑠𝑖 ⩾ 0 for all 𝑖. Hence
Õ ÕÕ
𝑛 Õ
𝑛 Õ Õ
𝑛
𝑢( 𝑥) 𝜑( 𝑥) = 𝑠𝑖 1 { 𝑥  𝑥 𝑖 } 𝜑 ( 𝑥 ) = 𝑠𝑖 1{ 𝑥  𝑥 𝑖 } 𝜑 ( 𝑥 ) = 𝑠𝑖 𝐺 𝜑 ( 𝑥 𝑖 ) .
𝑥 ∈X 𝑥 ∈X 𝑖=1 𝑖=1 𝑥 ∈X 𝑖=1

Í Í𝑛
A similar argument gives 𝑥 ∈X 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ) = 𝑖=1 𝑠𝑖 𝐺 𝜓 ( 𝑥 𝑖 ). Since 𝐺 𝜑 ⩽ 𝐺 𝜓 , we have

Õ Õ
𝑛 Õ
𝑛 Õ
𝜑
𝑢( 𝑥) 𝜑( 𝑥) = 𝑠𝑖 𝐺 ( 𝑥 𝑖 ) ⩽ 𝑠𝑖 𝐺 𝜓 ( 𝑥 𝑖 ) = 𝑢( 𝑥)𝜓( 𝑥) .
𝑥 ∈X 𝑖=1 𝑖=1 𝑥 ∈X

We conclude that 𝜑 F 𝜓, as was to be shown. □

345
APPENDIX B. REMAINING PROOFS 346

B.2 Chapter 6 Results

We adopt the setting of §6.1.1.2 and consider the claim


" 𝑡 # " 𝑡 #
Õ
∞ Ö Õ
∞ Ö
E𝑥 𝛽 𝑖 ℎ ( 𝑋𝑡 ) = E𝑥 𝛽 𝑖 ℎ ( 𝑋𝑡 ) (B.1)
𝑡 =0 𝑖=0 𝑡 =0 𝑖=0

when ( 𝑋𝑡 ) is 𝑃 -Markov with initial condition 𝑥 and ℎ ∈ RX . Throughout this discussion


the assumption 𝜌 ( 𝐿) < 1 is in force (see Theorem 6.1.1). Unlike the rest of the book,
we assume some familiarity with measure theory, at the level of, say, Dudley (2002),
Chapters 3 and 4.
To begin the discussion we set

Õ
𝑇 Õ
∞ Ö
𝑡
𝐹𝑇 ≔ 𝛿𝑡 ℎ ( 𝑋𝑡 ) and 𝐹≔ 𝛿𝑡 ℎ ( 𝑋 𝑡 ) where 𝛿𝑡 ≔ 𝛽𝑖 .
𝑡 =0 𝑡 =0 𝑖=0

Our first aim is to show that 𝐹 is a well-defined random variable, in the sense that the
sum converges almost surely. Since absolute convergence of real series implies con-
vergence, and since finite expectation implies finiteness almost everywhere, it suffices
to show that
Õ

E𝑥 𝛿𝑡 | ℎ ( 𝑋𝑡 )| < ∞. (B.2)
𝑡 =0

By the monotone convergence theorem (see, e.g., Dudley (2002), Theorem 4.3.2), we
have
Õ
∞ Õ
∞ Õ

E𝑥 𝛿𝑡 | ℎ ( 𝑋𝑡 )| = E𝑥 𝛿𝑡 | ℎ ( 𝑋𝑡 )| = ( 𝐿𝑡 | ℎ |)( 𝑥 ) ,
𝑡 =0 𝑡 =0 𝑡 =0

where the last equality is by (6.6). Since 𝜌 ( 𝐿) < 1, we have shown that (B.2) holds,
which in turn confirms that 𝐹 is well-defined and finite almost surely.
Now observe that, on the probability one set where 𝐹 is finite, we have 𝐹𝑇 → 𝐹 as
𝑇 → ∞. Moreover,
Õ
𝑇 Õ

| 𝐹𝑇 | ⩽ 𝛿𝑡 | ℎ ( 𝑋𝑡 )| ⩽ 𝑌 ≔ 𝛿𝑡 | ℎ ( 𝑋𝑡 )| ,
𝑡 =0 𝑡 =0

and, as shown above, E𝑥 𝑌 < ∞. By the dominated convergence theorem, we now


have E𝑥 𝐹 = lim𝑇 →∞ E𝑥 𝐹𝑇 , or, equivalently,

Õ
∞ Õ
𝑇 Õ
𝑇 Õ

E𝑥 𝛿𝑡 ℎ ( 𝑋𝑡 ) = lim E𝑥 𝛿𝑡 ℎ ( 𝑋𝑡 ) = lim E 𝑥 𝛿𝑡 ℎ ( 𝑋 𝑡 ) = E 𝑥 𝛿𝑡 ℎ ( 𝑋 𝑡 ) .
𝑇 →∞ 𝑇 →∞
𝑡 =0 𝑡 =0 𝑡 =0 𝑡 =0
APPENDIX B. REMAINING PROOFS 347

Hence (B.1) holds.

B.3 Chapter 7 Results


Proof of uniqueness for Theorem 7.1.3. We focus on the concave case. Let 𝐼 be as in
Theorem 7.1.3 and suppose that 𝑇 is an order-preserving concave self map on 𝐼 with
𝑇 𝜑  𝜑. By Theorem 7.1.1, 𝑇 has least and greatest fixed points in 𝐼 . We denote
them by 𝑎 and 𝑏, respectively. Let

𝑎( 𝑥) − 𝜑( 𝑥)
𝜆 = min
𝑥 ∈X 𝑏( 𝑥) − 𝜑( 𝑥)

and let 𝑥¯ be a minimizer. It follows immediately from its definition that 𝜆 obeys
0 ⩽ 𝜆 ⩽ 1 and

𝑎 ( 𝑥 ) ⩾ 𝜆𝑏 ( 𝑥 ) + (1 − 𝜆 ) 𝜑 ( 𝑥 ) for all 𝑥 ∈ X with equality at 𝑥¯.

As a result, applying the assumed properties of 𝑇 , we have

𝑎 = 𝑇 𝑎 ⩾ 𝑇 ( 𝜆𝑏 + (1 − 𝜆 ) 𝜑) ⩾ 𝜆𝑏 + (1 − 𝜆 )𝑇 𝜑.

Suppose now that 𝜆 < 1. Since 𝑇 𝜑  𝜑, we get 𝑎  𝜆𝑏 + (1 − 𝜆 ) 𝜑 and evaluating this


at 𝑥¯ yields
𝑎 (¯
𝑥 ) > 𝜆𝑏 (¯
𝑥 ) + (1 − 𝜆 ) 𝜑 (¯
𝑥 ) = 𝑎 (¯
𝑥),
which is a contradiction. Hence 𝜆 = 1 and, therefore, 𝑎 ⩾ 𝑏. Since all fixed points 𝑢¯ of
𝑇 in 𝐼 obey 𝑎 ⩽ 𝑢
¯ ⩽ 𝑏, we see that 𝑎 = 𝑏 is the unique fixed point of 𝑇 in 𝐼 . □

B.4 Chapter 9 Results


Let’s now turn to the proof of the core optimality results for ADPs. In what follows,
A = (𝑉, {𝑇𝜎 }) is a well-posed ADP with Bellman operator 𝑇 and 𝜎-value functions
{ 𝑣𝜎 }𝜎∈Σ . We start with
Lemma B.4.1. If A is order stable, then the following statements hold:
(i) 𝑣 ∈ 𝑉𝑢 =⇒ 𝑣  𝐻 𝑣.
(ii) If 𝜎 ∈ Σ and 𝑇 𝑣𝜎 = 𝑣𝜎 , then 𝑣𝜎 = 𝑣∗ .
(iii) If 𝑣 ∈ 𝑉 and 𝐻 𝑣 = 𝑣, then 𝑣 = 𝑣∗ and 𝑇 𝑣∗ = 𝑣∗ .
APPENDIX B. REMAINING PROOFS 348

(iv) If A is finite, then 𝑣∗ exists in 𝑉 and 𝐻 𝑣∗ = 𝑣∗ . Moreover, for all 𝑣 ∈ 𝑉 , the HPI
sequence ( 𝑣𝑘 ) defined by 𝑣𝑘 = 𝐻 𝑘 𝑣 converges to 𝑣∗ in finitely many steps.
(v) Fix 𝑣 ∈ 𝑉 and let ( 𝑣𝑘 ) be the HPI sequence defined by 𝑣𝑘 = 𝐻 𝑘 𝑣 for 𝑘 ∈ N. If
𝑣𝑘+1 = 𝑣𝑘 for some 𝑘 ∈ N, then 𝑣𝑘 = 𝑣∗ and every 𝑣𝑘 -greedy policy is optimal.

Proof. Regarding (i), fix 𝑣 ∈ 𝑉𝑢 and let 𝜏 be 𝑣-greedy, with 𝐻𝑣 = 𝑣𝜏 . Since 𝑣 ∈ 𝑉𝑢 , we


have 𝑣  𝑇 𝑣 = 𝑇𝜏 𝑣. This inequality and upward stability of 𝑇𝜏 yield 𝑣  𝑣𝜏 . But then
𝑣  𝐻𝑣, as claimed.
Regarding (ii), suppose 𝜎 ∈ Σ and 𝑇 𝑣𝜎 = 𝑣𝜎 . Fix 𝜏 ∈ Σ and note that 𝑣𝜎 = 𝑇 𝑣𝜎 
𝑇𝜏 𝑣𝜎 . Downward stability of 𝑇𝜏 implies 𝑣𝜎  𝑣𝜏 . Since 𝜏 ∈ Σ was arbitrary, 𝑣𝜎 = 𝑣∗ .
Regarding (iii), fix 𝑣 ∈ 𝑉 with 𝐻 𝑣 = 𝑣 and let 𝜎 be such that 𝐻 𝑣 = 𝑣𝜎 . Then 𝑣𝜎 = 𝑣,
and, since 𝜎 is 𝑣-greedy, 𝑇𝜎 𝑣 = 𝑇 𝑣. But then 𝑇𝜎 𝑣𝜎 = 𝑇 𝑣𝜎 , and, since 𝑣𝜎 = 𝑇𝜎 𝑣𝜎 , we have
𝑣𝜎 = 𝑇 𝑣𝜎 . Part (ii) now implies 𝑣 = 𝑣𝜎 = 𝑣∗ . This proves the first claim. Regarding the
second, substituting 𝑣𝜎 = 𝑣∗ into 𝑣𝜎 = 𝑇 𝑣𝜎 yields 𝑣∗ = 𝑇 𝑣∗ .
For (iv), it suffices to show that 𝐻 𝑣∗ = 𝑣∗ and there exists a 𝐾 ∈ N such that
𝐻 𝐾 𝑣 = 𝑣∗ . To this end, let 𝑣𝑘 = 𝐻 𝑘 𝑣 and note that 𝑣𝑘 ∈ 𝑉Σ for all 𝑘 ⩾ 1. Part (i) implies
that 𝑣𝑘+1  𝑣𝑘 for all 𝑘 ∈ N. Since the sequence ( 𝑣𝑘 ) is contained in the finite set
𝑉Σ , it must be that 𝑣𝐾 +1 = 𝑣𝐾 for some 𝐾 ∈ N (since otherwise 𝑉Σ contains an infinite
sequence of distinct points). But then 𝐻 𝑣𝐾 = 𝑣𝐾 +1 = 𝑣𝐾 , so 𝑣𝐾 is a fixed point of 𝐻 .
Part (iii) now implies that 𝑣𝐾 = 𝑣∗ .
For (v), let ( 𝑣𝑘 ) be as stated and suppose that 𝑣𝑘+1 = 𝑣𝑘 for some 𝑘 ∈ N. Then
𝑣𝑘 is a fixed point of 𝐻 , so, by (iii) above, we have 𝑣𝑘 = 𝑣∗ . By Bellman’s principle of
optimality, every 𝑣𝑘 -greedy policy is optimal. □

Proof of Proposition 9.2.1. If A is finite, then, by (iii)–(iv) of Lemma B.4.1, the point
𝑣∗ exists in 𝑉 and is a fixed point of 𝑇 . □

We first prove Proposition 9.2.5 and then return to Theorem 9.2.4.

Proof of Proposition 9.2.5. Let A be max-stable. We need to establish the following


claims.

(a) 𝑉Σ has a greatest element 𝑣∗ and


(b) 𝑣∗ is the unique fixed point of 𝑇 in 𝑉 .
(c) a policy is optimal if and only if it is 𝑣∗ -greedy.
(d) at least one optimal policy exists.
APPENDIX B. REMAINING PROOFS 349

For claims (a)–(b), we observe that, by max-stability, 𝑇 has a fixed point 𝑣¯ in 𝑉 . By


existence of greedy policies, we can find a 𝜎 ∈ Σ such that 𝑣¯ = 𝑇 𝑣¯ = 𝑇𝜎 𝑣¯. But 𝑇𝜎
has a unique fixed point in 𝑉 , equal to 𝑣𝜎 , so 𝑣¯ = 𝑣𝜎 . Moreover, if 𝜏 is any policy,
then 𝑇𝜏 𝑣¯  𝑇 𝑣¯ = 𝑣¯ and hence, by downward stability, 𝑣𝜏  𝑣¯. These facts imply that
𝑣∗ ≔ 𝑣¯ is the greatest element of 𝑉Σ and a fixed point of 𝑇 . Since greatest elements
are unique, 𝑣∗ is the only fixed point of 𝑇 in 𝑉 .
Regarding (c), parts (a)–(b) give 𝑣∗ ∈ 𝑉 and 𝑇 𝑣∗ = 𝑣∗ . Now recall that 𝜎 is optimal
if and only if 𝑣𝜎 = 𝑣∗ . Since 𝑣𝜎 is the unique fixed point of 𝑇𝜎 , this is equivalent to
𝑇𝜎 𝑣∗ = 𝑣∗ . Since 𝑇 𝑣∗ = 𝑣∗ , the last statement is equivalent to 𝑇𝜎 𝑣∗ = 𝑇 𝑣∗ , which is, in
turn equivalent to the statement that 𝜎 is 𝑣∗ -greedy.
Part (d) follows directly from (a). □

Proof of Theorem 9.2.4. Parts (i)–(iv) of Theorem 9.2.4 follow from Proposition 9.2.5,
which provides optimality results for max-stable ADPs, and Proposition 9.2.1, which
tells us that every finite order stable ADP is max-stable. Regarding the final claim in
Theorem 9.2.4, on convergence of HPI, suppose that A is finite and order stable. If
HPI terminates, then (v) of Lemma B.4.1 implies that it returns an optimal policy.
Part (iv) of the same lemma implies that HPI terminates in finitely many steps. □
Appendix C

Solutions to Selected Exercises

Solution to Exercise 1.1.1. Here is one possible answer: On one hand, providing
additional unemployment compensation is costly for taxpayers and tends to increase
the unemployment rate. On the other hand, unemployment compensation encour-
ages the worker to reject low initial offers, leading to a better lifetime wage. This
can enhance worker welfare and expand the tax base. A larger model is needed to
disentangle these effects.

Solution to Exercise 1.2.1. Fix 𝛼, 𝑠 and 𝑡 with 𝑠 ⩾ 0. Suppose first that 𝛼 ⩾ 𝑠 + 𝑡 .


Then 𝛼 ∨ ( 𝑠 + 𝑡 ) = 𝛼 ⩽ 𝛼 ∨ 𝑡 ⩽ 𝑠 + 𝛼 ∨ 𝑡 , as claimed. Suppose next that 𝛼 ⩽ 𝑠 + 𝑡 . Then
𝛼 ∨ ( 𝑠 + 𝑡 ) = 𝑠 + 𝑡 ⩽ 𝑠 + 𝛼 ∨ 𝑡 , as required.

Solution to Exercise 1.2.5. For 𝛼 > 0 we always have k 𝛼𝑢 k 0 = k 𝑢 k 0 , which violates


absolute homogeneity.

Solution to Exercise 1.2.15. Let 𝑇 and 𝑈 be as stated in the exercise. Regarding


uniqueness, suppose that 𝑇 has two distinct fixed points 𝑢 and 𝑦 in 𝑈 . Since 𝑇 𝑚 𝑢 = 𝑢¯
and 𝑇 𝑚 𝑦 = 𝑢¯, we have 𝑇 𝑚 𝑢 = 𝑇 𝑚 𝑦 . But 𝑢 and 𝑦 are distinct fixed points, so 𝑢 = 𝑇 𝑚 𝑢
must be distinct from 𝑦 = 𝑇 𝑚 𝑦 . Contradiction.
Regarding the claim that 𝑢¯ is a fixed point, we recall that 𝑇 𝑘 𝑢 = 𝑢¯ for 𝑘 ⩾ 𝑚. Hence
¯ = 𝑢¯ and 𝑇 𝑚+1 𝑢¯ = 𝑢¯. But then
𝑇 𝑚𝑢

¯ = 𝑇𝑇 𝑚 𝑢¯ = 𝑇 𝑚+1 𝑢¯ = 𝑢¯,
𝑇𝑢

so 𝑢¯ is a fixed point of 𝑇 .

350
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 351

Solution to Exercise 1.2.16. Assume the hypotheses of the exercise and let 𝑢𝑚 ≔
𝑇 𝑢 for all 𝑚 ∈ N. By continuity and 𝑢𝑚 → 𝑢∗ we have 𝑇𝑢𝑚 → 𝑇𝑢∗ . But the sequence
𝑚

(𝑇𝑢𝑚 ) is just ( 𝑢𝑚 ) with the first element omitted, so, given that 𝑢𝑚 → 𝑢∗ , we must have
𝑇𝑢𝑚 → 𝑢∗ . Since limits are unique, it follows that 𝑢∗ = 𝑇𝑢∗ .

Solution to Exercise 1.2.18. Let the stated hypotheses hold and fix 𝑢 ∈ 𝐶 . By
global stability we have 𝑇 𝑘 𝑢 → 𝑢∗ . Since 𝑇 is invariant on 𝐶 we have (𝑇 𝑘 𝑢) 𝑘∈N ⊂ 𝐶 .
Since 𝐶 is closed, this implies that the limit is in 𝐶 . In other words, 𝑢∗ ∈ 𝐶 , as claimed.

Solution to Exercise 1.2.20. By the definition of the operator norm we have


k 𝐴𝑢 k ⩽ k 𝐴 kk 𝑢 k for all 𝑢 ∈ R𝑛 . If k 𝐴 k < 1, then 𝑇 is a contraction of modulus k 𝐴 k,
since, for any 𝑥, 𝑦 ∈ 𝑈 ,

k 𝐴𝑥 + 𝑏 − 𝐴𝑦 − 𝑏 k = k 𝐴 ( 𝑥 − 𝑦 )k ⩽ k 𝐴 kk 𝑥 − 𝑦 k .

Solution to Exercise 1.2.22. From the bound in Exercise 1.2.21, we obtain

𝜆𝑚 − 𝜆𝑘
k 𝑢𝑚 − 𝑢 𝑘 k ⩽ k 𝑢0 − 𝑢1 k ( 𝑚, 𝑘 ∈ N with 𝑚 < 𝑘) .
1−𝜆
Hence (𝑢𝑚 ) is Cauchy, as claimed.

Solution to Exercise 1.2.24. Fix 𝛼 ∈ (0, 1] and let 𝐹 be defined by 𝐹𝑢 = (1 − 𝛼) 𝑢 +


𝛼𝑇𝑢. Readers will be able to verify that 𝐹 is also a contraction with identical fixed
point 𝑢¯, and that damped iteration is just iteration with 𝐹 . The claim follows.

Solution to Exercise 1.2.25. By the definition of the derivative, for any 𝑥 ∈ 𝑈 ≔


(0, ∞), we have
𝑔 ( 𝑦) − 𝑔 ( 𝑥)
lim − 𝑔0 ( 𝑥 ) = 0.
𝑦 →𝑥 𝑦−𝑥
Hence, by the reverse triangle inequality, for fixed 𝜀 > 0, we can take a 𝛿 > 0 such that

𝑔 ( 𝑦) − 𝑔 ( 𝑥)
> | 𝑔0 ( 𝑥 )| − 𝜀 = 𝑔0 ( 𝑥 ) − 𝜀
𝑦−𝑥

for all 𝑦 ∈ ( 𝑥 − 𝛿, 𝑥 + 𝛿). Rearranging gives

| 𝑔 ( 𝑥 ) − 𝑔 ( 𝑦 )| > [ 𝑔0 ( 𝑥 ) − 𝜀]| 𝑥 − 𝑦 |

for all 𝑦 ∈ ( 𝑥 − 𝛿, 𝑥 + 𝛿). But 𝑔0 ( 𝑥 ) = 𝑠𝛼𝑥 𝛼−1 + 1 − 𝛿, which converges to +∞ as 𝑥 → 0. It


APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 352

follows that, for any 𝜆 ∈ [0, 1), we can find a pair 𝑥, 𝑦 such that | 𝑔 ( 𝑥 ) − 𝑔 ( 𝑦 )| > 𝜆 | 𝑥 − 𝑦 |.
Hence 𝑔 is not a contraction map under | · |.

Solution to Exercise 1.2.27. Let 𝑚 = max 𝑥 ∈X ℎ ( 𝑥 ) and 𝑀 = argmax 𝑥 ∈X ℎ ( 𝑥 ).


Suppose first that 𝜑∗ is supported on 𝑀 and let 𝜑 be any distribution on X. Then
hℎ, 𝜑∗ i = 𝑚 ⩾ hℎ, 𝜑i. Conversely, if 𝜑∗ is not supported on 𝑀 , then hℎ, 𝜑∗ i < 𝑚. In
other words, 𝜑∗ ∈ argmax 𝜑∈D(X) hℎ, 𝜑i if and only if 𝜑∗ is supported on 𝑀 .

Solution to Exercise 1.2.28. Fix 𝜏 ∈ [0, 1], 𝑋 ∼ 𝜑 ∈ D(X) and 𝛼 ∈ R. Let Φ𝑋 be


the CDF of 𝑋 . Let 𝑌 ≔ 𝑋 + 𝛼, let Y ≔ { 𝑥 + 𝛼 : 𝑥 ∈ X} and let Φ𝑌 the CDF of 𝑌 . Note
that Φ𝑌 ( 𝑦 ) = P{𝑌 ⩽ 𝑦 } = P{ 𝑋 ⩽ 𝑦 − 𝛼} = Φ𝑋 ( 𝑦 − 𝛼) for all 𝑦 ∈ Y.
Let 𝑥 ∗ ≔ 𝑄 𝜏 𝑋 and let 𝑦 ∗ = 𝑄 𝜏 ( 𝑋 + 𝛼) = min{ 𝑦 ∈ Y : Φ𝑌 ( 𝑦 ) ⩾ 𝜏} . We need to show
that 𝑦 ∗ = 𝑥 ∗ + 𝛼. We do this by proving 𝑦 ∗ ⩾ 𝑥 ∗ + 𝛼 and 𝑦 ∗ ⩽ 𝑥 ∗ + 𝛼.
For the first inequality, fix 𝑦 ∈ Y such that Φ𝑌 ( 𝑦 ) ⩾ 𝜏. Let 𝑥 = 𝑦 − 𝛼. We then
have Φ𝑌 ( 𝑥 + 𝛼) ⩾ 𝜏 and hence Φ𝑋 ( 𝑥 ) ⩾ 𝜏. Hence 𝑥 ⩾ 𝑥 ∗ , or 𝑦 ⩾ 𝑥 ∗ + 𝛼. Since this last
inequality holds for any 𝑦 ∈ Y with Φ𝑌 ( 𝑦 ) ⩾ 𝜏, we have 𝑦 ∗ ⩾ 𝑥 ∗ + 𝛼.
For the reverse inequality, fix 𝑥 ∈ X with Φ𝑋 ( 𝑥 ) ⩾ 𝜏 and set 𝑦 = 𝑥 + 𝛼. We have
Φ𝑌 ( 𝑦 ) = Φ 𝑋 ( 𝑦 − 𝛼) = Φ 𝑋 ( 𝑥 ) ⩾ 𝜏, so 𝑦 ⩾ 𝑦 ∗ , or 𝑥 ⩾ 𝑦 ∗ − 𝛼. Since the last inequality
holds for all 𝑥 ∈ X with Φ𝑋 ( 𝑥 ) ⩾ 𝜏, we have 𝑥 ∗ ⩾ 𝑦 ∗ − 𝛼. Rearranging gives 𝑦 ∗ ⩽ 𝑥 ∗ + 𝛼,
as was to be shown.

Solution to Exercise 1.3.1. Fix 𝛼, 𝑥, 𝑦 ∈ R. We have 𝑥 = 𝑥 − 𝑦 + 𝑦 ⩽ | 𝑥 − 𝑦 | + 𝑦 .


Applying the result in Exercise 1.2.1 yields

𝛼 ∨ 𝑥 ⩽ |𝑥 − 𝑦| + 𝛼 ∨ 𝑦 ⇐⇒ 𝛼 ∨ 𝑥 − 𝛼 ∨ 𝑦 ⩽ |𝑥 − 𝑦 |.

Reversing the roles of 𝑥 and 𝑦 completes the proof.

Solution to Exercise 2.1.1. Let (𝑈, 𝑇 ) and (𝑈,ˆ 𝑇ˆ) be conjugate under Φ, with
𝑇 ◦ Φ = Φ ◦ 𝑇 . The stated equivalence holds because
ˆ

𝑇𝑢 = 𝑢 ⇐⇒ Φ𝑇𝑢 = Φ𝑢 ⇐⇒ 𝑇ˆ Φ𝑢 = Φ𝑢.

Solution to Exercise 2.1.3. To show that 𝑇 = Φ−1 ◦𝑇ˆ ◦ Φ holds, we can equivalently
prove that Φ ◦ 𝑇 = 𝑇ˆ ◦ Φ. For 𝑢 ∈ R, we have Φ𝑇𝑢 = ln 𝐴 + 𝛼 ln 𝑢 and 𝑇ˆ Φ𝑢 = ln 𝐴 + 𝛼 ln 𝑢.
Hence Φ ◦ 𝑇 = 𝑇ˆ ◦ Φ, as was to be shown.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 353

Solution to Exercise 2.1.4. From 𝑇ˆ = Φ ◦ 𝑇 ◦ Φ−1 we have 𝑇ˆ2 = Φ ◦ 𝑇 ◦ Φ−1 ◦


Φ ◦ 𝑇 ◦ Φ−1 = Φ ◦ 𝑇 2 ◦ Φ−1 and, continuing in the same way (or using induction),
𝑇ˆ 𝑘 = Φ ◦ 𝑇 𝑘 ◦ Φ−1 for all 𝑘 ∈ N. Equivalently, 𝑇ˆ 𝑘 ◦ Φ = Φ ◦ 𝑇 𝑘 for all 𝑘 ∈ N. Hence, using
continuity of Φ and Φ−1 ,

𝑇 𝑘 𝑢 → 𝑢∗ ⇐⇒ Φ𝑇 𝑘 𝑢 → Φ𝑢∗ ⇐⇒ 𝑇ˆ 𝑘 Φ𝑢 → Φ𝑢∗ .

Solution to Exercise 2.1.5. Let U be the set of all dynamical systems (𝑈, 𝑇 ) with
𝑈 ⊂ R𝑛 and write (𝑈, 𝑇 ) ∼ (𝑈, ˆ 𝑇ˆ) if these systems are topologically conjugate. It is
easy to see that ∼ is reflexive and symmetric. Regarding transitivity, suppose that
(𝑈, 𝑇 ) ∼ (𝑈 0, 𝑇 0) and (𝑈 0, 𝑇 0) ∼ (𝑈 00, 𝑇 00). Let 𝐹 be the homeomorphism from 𝑈 to 𝑈 0
and 𝐺 be the homeomorphism from 𝑈 0 to 𝑈 00. Then 𝐻 ≔ 𝐺 ◦ 𝐹 is a homeomorphism
from 𝑈 to 𝑈 00 with inverse ( 𝐹 ◦ 𝐺 ) −1 . Moreover, on 𝑈 , we have

𝑇 = 𝐹 −1 ◦ 𝑇 0 ◦ 𝐹 = 𝐹 −1 ◦ 𝐺 −1 ◦ 𝑇 00 ◦ 𝐺 ◦ 𝐹 = ( 𝐺𝐹 ) −1 ◦ 𝑇 00 ◦ 𝐺 ◦ 𝐹.

Hence (𝑈, 𝑇 ) ∼ (𝑈 00, 𝑇 00) and ∼ is transitive, as required.

Solution to Exercise 2.1.7. Since 𝑢𝑘+1 = 𝑇𝑢𝑘 , we have


𝑢𝑘+1 − 𝑢∗ 0 ∗ 𝑇 00 𝑣𝑘
= 𝑇 𝑢 + ( 𝑢 𝑘 − 𝑢∗ ) .
𝑢 𝑘 − 𝑢∗ 2

Since 𝑇 is twice continuously differentiable, 𝑇 00 𝑣𝑘 is bounded on bounded sets. As a


result, taking absolute values and using 𝑢𝑘 → 𝑢∗ confirms the linear rate claimed in
the exercise.

Solution to Exercise 2.2.4. It is easy to confirm that  violates reflexivity.

Solution to Exercise 2.2.6. Just set X = [ 𝑛] × [ 𝑘].

Solution to Exercise 2.2.7. Regarding the first claim, fix 𝐵 ∈ M𝑚×𝑘 with 𝑏𝑖 𝑗 ⩾ 0
Í
for all 𝑖, 𝑗. Pick any 𝑖 ∈ [ 𝑚] and 𝑢 ∈ R𝑘 . By the triangle inequality, we have | 𝑗 𝑏𝑖 𝑗 𝑢 𝑗 | ⩽
Í
𝑗 𝑏𝑖 𝑗 | 𝑢 𝑗 |. Stacking these inequalities yields | 𝐵𝑢 | ⩽ 𝐵 | 𝑢 |, as was to be shown.

Regarding the second, let 𝐴 and ( 𝑢𝑘 ) be as stated, with 𝑢𝑘+1 ⩽ 𝐴𝑢𝑘 for all 𝑘. We
aim to prove 𝑢𝑘 ⩽ 𝐴𝑘 𝑢0 for all 𝑘 using induction. In doing so, we observe that 𝑢1 ⩽
𝐴𝑢0 , so the claim is true at 𝑘 = 1. Suppose now that it holds at 𝑘 − 1. Then 𝑢𝑘 ⩽
𝐴𝑢𝑘−1 ⩽ 𝐴𝐴𝑘−1 𝑢0 = 𝐴𝑘 𝑢0 , where the last step used nonnegativity of 𝐴 and the induction
hypothesis. The claim is now proved.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 354

Solution to Exercise 2.2.8. Assume the stated conditions. Let ℎ ≔ 𝑣 − 𝑢 and let
Í
𝑎𝑖 𝑗 be the 𝑖, 𝑗-th element of 𝐴. We have ℎ ⩾ 0 and ℎ 𝑗 > 0 at some 𝑗. Hence 𝑗 𝑎𝑖 𝑗 ℎ 𝑗 > 0.
This says that every row of 𝐴ℎ is strictly positive. In other words 𝐴ℎ = 𝐴 ( 𝑣 − 𝑢)  0.
The claim follows.

Solution to Exercise 2.2.9. Let 𝑀 = {1, 2}, let 𝐴 = {1} and let 𝐵 = {2}. Then
𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴 both fail. Hence ⊂ is not a total order on ℘ ( 𝑀 ).

Solution to Exercise 2.2.10. We prove the claim concerning greatest elements:


Suppose that 𝑔 and 𝑔0 are greatest elements of 𝐴. Then, since both are in 𝐴, we have
𝑔  𝑔0 and 𝑔0  𝑔 . Hence, by antisymmetry, 𝑔 = 𝑔0.

Solution to Exercise 2.2.11. To see this, suppose first that 𝑆 ∈ { 𝐴𝑖 }. Since 𝐴 𝑗 ⊂


∪𝑖 𝐴𝑖 =: 𝑆 for all 𝑗 ∈ 𝐼 , the set 𝑆 is a greatest element of { 𝐴𝑖 }. Conversely, if 𝑆 is not
in { 𝐴𝑖 }, then 𝑆 is not a greatest element (since the definition directly requires that
𝑆 ∈ { 𝐴𝑖 }.

Solution to Exercise 2.2.12. Since the union of all bounded subsets of R𝑛 is R𝑛


(which is not bounded), { 𝐴𝑖 } has no greatest element. Indeed, if 𝐺 is the greatest
element of { 𝐴𝑖 }, then 𝐺 contains every bounded subset of R𝑛 . But then 𝐺 is not
bounded. Contradiction.

Solution to Exercise 2.2.13. Suppose that 𝑠 and 𝑠0 are both suprema of 𝐴 in 𝑃 .


Then both 𝑠 and 𝑠0 are upper bounds, so 𝑠  𝑠0 and 𝑠0  𝑠. Hence 𝑠 = 𝑠0.

Solution to Exercise 2.2.16. To see the former, observe that 𝐴 𝑗 ⊂ ∪𝑖 𝐴𝑖 for all
𝑗 ∈ 𝐼 . Hence ∪𝑖 𝐴𝑖 is an upper bound of { 𝐴𝑖 }. Moreover, if 𝐵 ⊂ 𝑀 and 𝐴 𝑗 ⊂ 𝐵 for all
𝑖 ∈ 𝐼 , then ∪𝑖 𝐴𝑖 ⊂ 𝐵. This proves that ∪𝑖 𝐴𝑖 is the supremum. The proof of the infimum
case is similar.

Solution to Exercise 2.2.17. Here is one possible answer. Let 𝑃 = (0, 1), partially
ordered by ⩽. The set 𝐴 = [1/2, 1) is bounded above in R (and hence has a supremum
Ô
in R) but has no supremum in 𝑃 . Indeed, if 𝑠 = 𝐴, the 𝑠 ∈ 𝑃 and 𝑎 ⩽ 𝑠 for all 𝑠 ∈ 𝐴.
It is clear that no such element exists.

Solution to Exercise 2.2.21. Suppose first that 𝑣∗ ∈ { 𝑣𝜎 }. Since 𝑣𝜎 ⩽ 𝑣∗ for


all 𝜎, the function 𝑣∗ is the greatest element. Regarding the second claim, suppose
(seeking a contradiction), that 𝑣∗ ∉ { 𝑣𝜎 } and 𝑣¯ is a greatest element of { 𝑣𝜎 }. By
definition, 𝑣𝜎 ⩽ 𝑣¯ for all 𝜎, so taking the pointwise maximum gives 𝑣∗ ⩽ 𝑣¯. At the same
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 355

time, since 𝑣¯ is a greatest element, we have 𝑣¯ ∈ { 𝑣𝜎 }, and therefore 𝑣¯ ⩽ max𝜎 𝑣𝜎 = 𝑣∗ .


Putting the two inequalities together gives 𝑣¯ = 𝑣∗ , which in turn implies that 𝑣∗ ∈ { 𝑣𝜎 }.
Contradiction.

Solution to Exercise 2.2.22. Let 𝐼 𝑎 ≔ [ 𝑎1 , 𝑎2 ] and 𝐼𝑏 ≔ [ 𝑏1 , 𝑏2 ] be two order


intervals in 𝑉 . Consider the order interval 𝐼 ≔ [ 𝑎1 ∧ 𝑏1 , 𝑎2 ∨ 𝑏2 ]. If ℎ ∈ 𝐼 , then
ℎ ⩾ 𝑎1 ∧ 𝑏1 , so ℎ ⩾ 𝑎1 and ℎ ⩾ 𝑏1 . A similar argument gives ℎ ⩽ 𝑎2 and ℎ ⩽ 𝑏2 . Hence
ℎ ∈ 𝐼 𝑎 ∩ 𝐼𝑏 . Working in the other direction, it is not difficult to show that ℎ ∈ 𝐼 𝑎 ∩ 𝐼𝑏
implies ℎ ∈ 𝐼 . Hence 𝐼 = 𝐼 𝑎 ∩ 𝐼𝑏 . In particular, 𝐼 𝑎 ∩ 𝐼𝑏 is an order interval in 𝑉 .

Solution to Exercise 2.2.23. Fix 𝑎, 𝑏 ∈ R+ and 𝑐 ∈ R+ . By (2.3), we have

𝑎 ∧ 𝑐 = ( 𝑎 − 𝑏 + 𝑏) ∧ 𝑐 ⩽ (| 𝑎 − 𝑏 | + 𝑏) ∧ 𝑐 ⩽ | 𝑎 − 𝑏 | ∧ 𝑐 + 𝑏 ∧ 𝑐.

Thus, 𝑎 ∧ 𝑐 − 𝑏 ∧ 𝑐 ⩽ | 𝑎 − 𝑏 | ∧ 𝑐. Reversing roles of 𝑎 and 𝑏 gives 𝑏 ∧ 𝑐 − 𝑎 ∧ 𝑐 ⩽ | 𝑎 − 𝑏 | ∧ 𝑐.


This proves the claim in Exercise 2.2.23.

Solution to Exercise 2.2.24. Since min 𝑓 = − max(− 𝑓 ) and similarly for 𝑔, we can
apply Lemma 2.2.2 to obtain

| min 𝑓 − min 𝑔 | = | max(−𝑔) − max(− 𝑓 )| ⩽ max |(−𝑔) − (− 𝑓 )| = max | 𝑓 − 𝑔 | .

Solution to Exercise 2.2.25. We prove the claim regarding greatest elements. To


this end, let 𝑢¯ be the greatest element of {𝑢𝑖 }. Then 𝐹𝑢𝑖  𝐹 𝑢¯ for all 𝑖, so 𝐹 𝑢¯ is the
Ô Ô
greatest element, and hence the supremum, of { 𝐹𝑢𝑖 }. That is, 𝑖 𝐹𝑢𝑖 = 𝐹 𝑢¯ = 𝐹 𝑖 𝑢𝑖 .

Solution to Exercise 2.2.26. Let 𝐴 and 𝑃 be as stated. The claim that 𝐴𝑘 is order-
preserving on 𝑃 holds at 𝑘 = 1. Suppose now that it holds at 𝑘 and fix 𝑝, 𝑞 ∈ 𝑃 with
𝑝  𝑞. By the induction hypothesis and the fact that 𝐴 is order-preserving, we have
𝐴𝐴𝑘 𝑝  𝐴𝐴𝑘 𝑞. Hence 𝐴𝑘+1 𝑝  𝐴𝑘+1 𝑞. We conclude that 𝐴𝑘+1 is also order-preserving,
as was to be shown.

Solution to Exercise 2.2.27. Fix an 𝑛 × 𝑘 matrix 𝐴 with 𝐴 ⩾ 0, along with 𝑢, 𝑣 ∈ R𝑘 .


We need to show that 𝑢 ⩽ 𝑣 implies 𝐴𝑢 ⩽ 𝐴𝑣 for any conformable vectors 𝑢, 𝑣. This
holds because if 𝑢 ⩽ 𝑣 we have 𝑣 − 𝑢 ⩾ 0, so 𝐴 ( 𝑣 − 𝑢) ⩾ 0. But then 𝐴𝑣 − 𝐴𝑢 ⩾ 0, or
𝐴𝑢 ⩽ 𝐴𝑣.

Solution to Exercise 2.2.28. Fix square 𝐴, 𝐵 with 0 ⩽ 𝐴 ⩽ 𝐵. It follows from the


rules of matrix multiplication that, for arbitrary nonnegative square matrices 𝐸, 𝐹, 𝐺
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 356

with 𝐹 ⩽ 𝐺 , we have 𝐸𝐹 ⩽ 𝐸𝐺 and 𝐹𝐸 ⩽ 𝐺𝐸. Hence, if 𝐴𝑘 ⩽ 𝐵𝑘 for some 𝑘 ∈ N, then


𝐴𝑘+1 = 𝐴𝐴𝑘 ⩽ 𝐵𝐴𝑘 ⩽ 𝐵𝐵𝑘 = 𝐵𝑘+1 . Thus, by induction, 𝐴𝑘 ⩽ 𝐵𝑘 for all 𝑘 ∈ N, which
verifies the first claim. Regarding the second, it is clear that for nonnegative matrices
𝐸, 𝐹 with 𝐸 ⩽ 𝐹 we have k 𝐸 k ∞ ⩽ k 𝐹 k ∞ . Hence k 𝐴𝑘 k ∞ ⩽ k 𝐵𝑘 k ∞ for all 𝑘 ∈ N. Raising
both sides to the power 1/𝑘 and applying Gelfand’s lemma verifies 𝜌 ( 𝐴) ⩽ 𝜌 ( 𝐵).

Solution to Exercise 2.2.30. Take ( 𝑓𝑘 ) 𝑘⩾1 in 𝑖R𝑃 and 𝑓 ∈ R𝑃 with 𝑓𝑘 → 𝑓 as


𝑘 → ∞. Since 𝑓𝑘 → 𝑓 we have 𝑓𝑘 ( 𝑧 ) → 𝑓 ( 𝑧 ) for all 𝑧 ∈ 𝑃 . (Norm convergence
implies pointwise convergence.) Fix 𝑥, 𝑦 ∈ 𝑃 with 𝑥  𝑦 . From ( 𝑓𝑘 ) ⊂ 𝑖R𝑃 we have
𝑓𝑘 ( 𝑥 ) ⩽ 𝑓𝑘 ( 𝑦 ) for all 𝑘. Since weak inequalities are preserved under limits, 𝑓 ( 𝑥 ) ⩽ 𝑓 ( 𝑦 ).
Hence 𝑓 ∈ 𝑖R𝑃 .

Solution to Exercise 2.2.32. Set 𝛼𝑘 ≔ 𝑢 ( 𝑥 𝑘 ) for all 𝑘 and 𝑠𝑘 ≔ 𝛼𝑘 − 𝛼𝑘−1 with


𝛼0 ≔ 0. Fix 𝑥 𝑗 ∈ X. Then

Õ
𝑛 Õ
𝑗
𝑠𝑘 1 { 𝑥 𝑗 ⩾ 𝑥 𝑘 } = 𝑠𝑘 = ( 𝛼1 − 𝛼0 ) + ( 𝛼2 − 𝛼1 ) + . . . + ( 𝛼 𝑗 − 𝛼 𝑗−1 ) = 𝛼 𝑗 .
𝑘=1 𝑘=1
Í𝑛
In other words, 𝑘=1 𝑠𝑘 1{ 𝑥 𝑗 ⩾ 𝑥 𝑘 } = 𝑢 ( 𝑥 𝑗 ). This completes the proofs.

Solution to Exercise 2.2.33. Fix 𝜑, 𝜓 ∈ X and suppose that 𝜑 F 𝜓. Let 𝑢 ∈ RX


be defined by 𝑢 (1) = 0 and 𝑢 (2) = 1. Then, by the definition of stochastic dominance,
we have 𝜑 (2) ⩽ 𝜓 (2). Since 𝜑 (1) = 1 − 𝜑 (2) and 𝜓 (1) = 1 − 𝜓 (2), this inequality is
equivalent to 𝜓 (1) ⩽ 𝜑 (1). Finally, suppose that 𝜓 (1) ⩽ 𝜑 (1) and fix 𝑢 ∈ 𝑖RX . Let
ℎ = 𝑢 (2) − 𝑢 (1) ⩾ 0. Then
Õ
𝑢 ( 𝑥 ) 𝜑 ( 𝑥 ) = 𝑢 (1) 𝜑 (1) + ( 𝑢 (1) + ℎ)(1 − 𝜑 (1)) = 𝑢 (1) + ℎ (1 − 𝜑 (1)) .
𝑥
Í
Similarly, 𝑥 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ) = 𝑢 (1) + ℎ (1 − 𝜓 (1)). Since ℎ ⩾ 0 and 𝜓 (1) ⩽ 𝜑 (1), we have
Í Í
𝑥 𝑢( 𝑥) 𝜑( 𝑥) ⩽ 𝑥 𝑢 ( 𝑥 ) 𝜓 ( 𝑥 ). Thus, 𝜑 F 𝜓. This chain of implications proves the
equivalences in the exercise.

Solution to Exercise 2.2.34. Suppose 𝑓 , 𝑔, ℎ ∈ D(X) with 𝑓 F 𝑔 and 𝑔 F ℎ.


Fixing 𝑢 ∈ 𝑖RX , we have
Õ Õ Õ Õ
𝑢( 𝑥) 𝑓 ( 𝑥) ⩽ 𝑢 ( 𝑥 ) 𝑔 ( 𝑥 ) and 𝑢( 𝑥) 𝑔 ( 𝑥) ⩽ 𝑢( 𝑥)ℎ( 𝑥)
𝑥 𝑥 𝑥 𝑥

Í Í
Hence 𝑥 𝑢( 𝑥) 𝑓 ( 𝑥) ⩽ 𝑥 𝑢 ( 𝑥 ) ℎ ( 𝑥 ). Since 𝑢 was arbitrary in 𝑖RX , we are done.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 357

Solution to Exercise 2.2.35. Let 𝐹 𝜑 ≔ 1 − 𝐺 𝜑 be the CDF of 𝜑 and let 𝐹 𝜓 be the


CDF of 𝜓. In view of Lemma 2.2.5, we have 𝐹 𝜓 ⩽ 𝐹 𝜑 . As a consequence,

{ 𝑥 ∈ X : 𝐹 𝜓 ( 𝑥 ) ⩾ 𝜏} ⊂ { 𝑥 ∈ X : 𝐹 𝜑 ( 𝑥 ) ⩾ 𝜏} .

It follows directly that

min{ 𝑥 ∈ X : 𝐹 𝜑 ( 𝑥 ) ⩾ 𝜏} ⩽ min{ 𝑥 ∈ X : 𝐹 𝜓 ( 𝑥 ) ⩾ 𝜏} .

That is, 𝑄 𝜏 ( 𝑋 ) ⩽ 𝑄 𝜏 (𝑌 ).

Solution to Exercise 2.2.36. Let 𝑃, 𝑆, 𝑇 be as described in the exercise. We aim


to show that 𝑆 𝑘  𝑇 𝑘 holds for all 𝑘 ∈ N. Clearly it holds for 𝑘 = 1. If it also holds at
𝑘 − 1, then, for any 𝑢 ∈ 𝑃 , we have 𝑆 𝑘 𝑢 = 𝑆𝑆 𝑘−1 𝑢 ⩽ 𝑆𝑇 𝑘−1 𝑢 ⩽ 𝑇𝑇 𝑘−1 𝑢 = 𝑇 𝑘 𝑢, where we
used the induction hypothesis, the order-preserving property of 𝑆 and the assumption
that 𝑆  𝑇 .

Solution to Exercise 2.2.39. Fix 𝛽1 ⩽ 𝛽2 . Let 𝑔1 and 𝑔2 be the corresponding fixed


point maps, as defined in (1.33). Since 𝛽1 ⩽ 𝛽2 , we have 𝑔1 ( ℎ) ⩽ 𝑔2 ( ℎ) for all ℎ ∈ R+
and, in addition, 𝑔2 is a contraction map (and hence globally stable), Proposition 2.2.7
applies. In particular, the fixed point ℎ∗1 corresponding to 𝛽1 is less than or equal to
ℎ∗2 , the fixed point corresponding to 𝛽2 .

Solution to Exercise 2.3.1. Let 𝐴 be as stated and let 𝑒 be the right eigenvector in
(2.10). Since 𝑒 is nonnegative and nonzero, and since eigenvectors are defined only
Í
up to constant multiples, we can and do assume that 𝑗 𝑒 𝑗 = 1. From 𝐴𝑒 = 𝜌 ( 𝐴) 𝑒 we
Í Í
have 𝑗 𝑎𝑖 𝑗 𝑒 𝑗 = 𝜌 ( 𝐴) 𝑒𝑖 for all 𝑖. Summing with respect to 𝑖 gives 𝑗 colsum 𝑗 ( 𝐴) 𝑒 𝑗 = 𝜌 ( 𝐴).
Since the elements of 𝑒 are nonnegative and sum to one, 𝜌 ( 𝐴) is a weighted average
of the column sums. Hence the second pair of bounds in Lemma 2.3.2 holds. The
remaining proof is similar (use the left eigenvector).

Solution to Exercise 2.3.2. Let 𝑃 and 𝑄 be as stated. Evidently 𝑃𝑄 ⩾ 0. Moreover,


𝑃𝑄 1 = 𝑃 1 = 1, so 𝑃𝑄 is Markov. That 𝜌 ( 𝑃 ) = 1 follows directly from Lemma 2.3.2.
By the Perron–Frobenius theorem, there exists a nonzero, nonnegative row vector 𝜑
satisfying 𝜑𝑃 = 𝜑. Rescaling 𝜑 to 𝜑/( 𝜑1) gives the desired vector 𝜓.
The final positivity and uniqueness claim is also by the Perron–Frobenius theorem,
and its consequences for irreducible matrices. Indeed, if 𝜑 is another nonnegative
vector satisfying 𝜑1 = 1 and 𝜑𝑃 = 𝜑, then, by the Perron–Frobenius theorem, 𝜑 = 𝛼𝜓
for some 𝛼 > 0. But then 𝛼𝜓1 = 1 and 𝜓1 = 1, which gives 𝛼 = 1. Hence 𝜑 = 𝜓.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 358

Solution to Exercise 2.3.3. Let 𝑃 and 𝜀 have the stated properties. Fix ℎ ∈ RX .
It suffices to show that for this arbitrary ℎ we can find an 𝑥 ∈ X such that ( 𝑃ℎ)( 𝑥 ) <
ℎ ( 𝑥 ) + 𝜀. This is easy to verify, since, for 𝑥¯ ∈ argmax 𝑥 ∈X ℎ ( 𝑥 ) we have ( 𝑃ℎ)(¯
𝑥) =
Í 0 0
𝑥 0 ℎ ( 𝑥 ) 𝑃 (¯
𝑥 , 𝑥 ) ⩽ ℎ (¯
𝑥 ).

Solution to Exercise 2.3.4. It is straightforward to confirm that both columns of


𝐴 sum to 1 + 𝑔 . As a result, with 1> as a row vector of ones, we have

𝑛𝑡+1 = 1> 𝑥𝑡+1 = 1> 𝐴𝑥𝑡 = (1 + 𝑔) 1> 𝑥𝑡 = (1 + 𝑔) 𝑛𝑡 ,


as was to be shown.
Í
Solution to Exercise 2.3.10. Fix 𝐿 ∈ L ( RX ) with ( 𝐿𝑢)( 𝑥 ) = 𝑥 0 ∈X 𝐿 ( 𝑥, 𝑥 0) 𝑢 ( 𝑥 0) for
all 𝑥 ∈ X and 𝑢 ∈ RX . Positivity of 𝐿 requires that
Õ
𝑢 ⩾ 0 =⇒ 𝐿 ( 𝑥, 𝑥 0) 𝑢 ( 𝑥 0) ⩾ 0 for all 𝑥 ∈ X.
𝑥 0 ∈X

Clearly, this holds whenever 𝐿 ( 𝑥, 𝑥 0) ⩾ 0 for all 𝑥, 𝑥 0 ∈ X.


Regarding the converse, suppose that 𝐿 is positive. Seeking a contradiction, sup-
pose in addition that we can find a pair ( 𝑥 𝑎 , 𝑥𝑏 ) ∈ X × X such that 𝐿 ( 𝑥 𝑎 , 𝑥𝑏 ) < 0. With
Í
𝑢 ( 𝑥 ) ≔ 1{ 𝑥 = 𝑥𝑏 }, we have ( 𝐿𝑢)( 𝑥 𝑎 ) = 𝑥 0 ∈X 𝐿 ( 𝑥 𝑎 , 𝑥 0) 𝑢 ( 𝑥 0) = 𝐿 ( 𝑥 𝑎 , 𝑥 𝑏 ) < 0. This contra-
dicts positivity of 𝐿.

Solution to Exercise 2.3.11. Suppose first that 𝐿 is positive. Fix 𝑢 ⩽ 𝑣 in RX and


observe that, by positivity, 𝐿 ( 𝑣 − 𝑢) ⩾ 0. But then 𝐿𝑣 − 𝐿𝑢 ⩾ 0 and hence 𝐿𝑢 ⩽ 𝐿𝑣. This
shows that 𝐿 is order-preserving.
Regarding the converse, if 𝐿 is order-preserving, then 𝑢 ⩾ 0 implies 𝐿𝑢 ⩾ 𝐿0. But
for every linear operator we have 𝐿0 = 0, and so 𝐿𝑢 ⩾ 0. Hence 𝐿 is a positive operator.

Solution to Exercise 2.3.12. Fix 𝑃 ∈ L ( RX ) and let 𝑃 ( 𝑥, 𝑥 0) be the matrix repre-


sentation, so that Õ
( 𝑃𝑢)( 𝑥 ) = 𝑃 ( 𝑥, 𝑥 0) 𝑢 ( 𝑥 0) ( 𝑥 ∈ X)
𝑥0

for any 𝑢 ∈ RX . Suppose first that 𝑃 ∈ M ( RX ). The statement that 𝑃 is a positive


linear operator is equivalent to 𝑃 ( 𝑥, 𝑥 0) ⩾ 0 for all 𝑥, 𝑥 0 by Lemma 2.3.6. Moreover,
Í
𝑃 1 = 1 is equivalent to 𝑥 0 ∈X 𝑃 ( 𝑥, 𝑥 0) = 1 for all 𝑥 ∈ X.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 359

Solution to Exercise 2.3.14. In the solution below, we use the characteriza-


tion in Exercise 2.3.12: 𝑃 ∈ M ( RX ) if and only if 𝑃 ( 𝑥, 𝑥 0) ⩾ 0 for all 𝑥, 𝑥 0 ∈ X and
Í 0
𝑥 0 ∈X 𝑃 ( 𝑥, 𝑥 ) = 1 for all 𝑥 ∈ X.

Fix 𝑃 ∈ L ( RX ) and suppose first that 𝑃 ∈ M ( RX ). Then


Õ
( 𝜑𝑃 )( 𝑥 0) = 𝑃 ( 𝑥, 𝑥 0) 𝜑 ( 𝑥 ) ( 𝑥 0 ∈ X) (2.20)
𝑥

is in D(X) whenever 𝜑 ∈ D(X), since, for any such 𝜑, the vector 𝜑𝑃 is clearly non-
negative and Õ ÕÕ Õ
( 𝜑𝑃 )( 𝑥 0) = 𝑃 ( 𝑥, 𝑥 0) 𝜑 ( 𝑥 ) = 𝜑 ( 𝑥 ) = 1.
𝑥0 𝑥 𝑥0 𝑥

Now suppose instead that 𝑃 ∈ L ( RX ) and 𝜑𝑃 ∈ D(X) whenever 𝜑 ∈ D(X). It


follows that 𝑃 ( 𝑥, 𝑥 0) is nonnegative at arbitrary ( 𝑥, 𝑥 0), since ( 𝜑𝑃 )( 𝑥 0) = 𝑃 ( 𝑥, 𝑥 0) when
𝜑 is the distribution that puts all mass on 𝑥 . Moreover, 𝑃 ( 𝑥, ·) must sum to one at
arbitrary 𝑥 because if 𝜑 is the distribution that puts all mass on 𝑥 , then
Õ Õ
1= ( 𝜑𝑃 )( 𝑥 0) = 𝑃 ( 𝑥, 𝑥 0) .
𝑥0 𝑥0

Solution to Exercise 3.1.1. Let 𝑋𝑡 = 𝑥 ∈ X, so that 𝑋𝑡+1 = max{ 𝑥 − 𝐷𝑡+1 , 0} + 𝑆1{ 𝑥 ⩽


𝑠 }. Evidently 𝑋𝑡+1 is integer-valued and nonnegative. If 𝑥 ⩽ 𝑠, then 𝑋𝑡+1 ⩽ max{ 𝑠 −
𝐷𝑡+1 , 0} + 𝑆 ⩽ 𝑠 + 𝑆. Similarly, if 𝑠 < 𝑥 ⩽ 𝑆 + 𝑠, then 𝑋𝑡+1 ⩽ max{ 𝑥 − 𝐷𝑡+1 , 0} ⩽ 𝑆 + 𝑠. The
claim is verified.

Solution to Exercise 3.1.2. Fixing 𝑡 ⩾ 0 and 𝑃 ∈ M ( RX ), this claim can be


verified by induction over 𝑘. The claim is obviously true when 𝑘 = 0, 1. Suppose the
claim is also true at 𝑘 and now consider the case 𝑘 + 1. By the law of total probability,
for given 𝑥, 𝑥 0 ∈ X, we have
Õ
P{ 𝑋𝑡+𝑘+1 = 𝑥 0 | 𝑋𝑡 = 𝑥 } = P{ 𝑋𝑡+𝑘+1 = 𝑥 0 | 𝑋𝑡+𝑘 = 𝑧 }P{ 𝑋𝑡+𝑘 = 𝑧 | 𝑋𝑡 = 𝑥 } .
𝑧

The induction hypothesis allows us to use (3.2) at 𝑘, so the last equation becomes
Õ
P{ 𝑋𝑡+𝑘+1 = 𝑥 0 | 𝑋𝑡 = 𝑥 } = 𝑃 𝑘 ( 𝑥, 𝑧 ) 𝑃 ( 𝑧, 𝑥 0) = 𝑃 𝑘+1 ( 𝑥, 𝑥 0) .
𝑧

The law (3.2) is now verified at 𝑘 + 1, completing our proof by induction.


APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 360

Solution to Exercise 3.1.3. Let 𝑥 ∈ X be the current state at time 𝑡 and suppose
first that 𝑠 < 𝑥 . The next period state 𝑋𝑡+1 hits 𝑠 with positive probability, since 𝜑 ( 𝑑 ) > 0
for all 𝑑 ∈ Z+ . The state 𝑋𝑡+2 hits 𝑆 + 𝑠 with positive probability, since 𝜑 (0) > 0. From
𝑆 + 𝑠, the inventory level reaches any point in X = {0, . . . , 𝑆 + 𝑠 } in one step with
positive probability. Hence, from current state 𝑥 , inventory reaches any other state 𝑦
with positive probability in three steps.
The logic for the case 𝑥 ⩽ 𝑠 is similar and left to the reader.

Solution to Exercise 3.1.4. Fix 𝑡 ∈ N. Under the stated hypotheses, we have


𝑑
𝑋𝑡 = 𝜓0 𝑃 𝑡 (see (3.5)). Hence
Õ Õ
Eℎ ( 𝑋𝑡 ) = ℎ ( 𝑥 0 ) P { 𝑋𝑡 = 𝑥 0 } = ℎ ( 𝑥 0)( 𝜓0 𝑃 𝑡 )( 𝑥 0) = 𝜓0 𝑃 𝑡 , ℎ .
𝑥0 𝑥0

Solution to Exercise 3.1.7. Assume 𝑃 is everywhere positive with unique station-


ary distribution 𝜓∗ . Since 𝜌 ( 𝑃 ) = 1, the last part of the Perron–Frobenius theorem tells
us that 𝑃 𝑡 → 𝑒 𝜀 as 𝑡 → ∞, where 𝑒 and 𝜀 are the dominant right and left eigenvectors,
normalized such that h𝑒, 𝜀i = 1. In this case we know 𝜓∗ is the dominant left eigenvec-
tor and 1 is the dominant right eigenvector. Moreover, 𝜓∗ ∈ D(X) yields h𝜓∗ , 1i = 1.
Hence, for any 𝜓 ∈ D(X), we have

𝜓𝑃 𝑡 → 𝜓1 𝜓∗ = 𝜓∗ as 𝑡 → ∞.

Hence global stability holds, as claimed.

Solution to Exercise 3.1.11. Since we are conditioning on 𝑋𝑡 = 𝑥 , we can replace


𝑋𝑡+1 with 𝜌𝑥 + 𝜈𝜀𝑡+1 . The result then follows from P{𝛼 < 𝜈𝜀𝑡+1 ⩽ 𝛽 } = 𝐹 ( 𝛽 ) − 𝐹 ( 𝛼).

Solution to Exercise 3.2.2. Using Exercise 3.1.11 and the definition of 𝑃 , it can
be shown that
Õ
𝑛
𝐺 ( 𝑥, 𝑥 𝑘 ) ≔ 𝑃 ( 𝑥, 𝑥 𝑗 ) = P{ 𝑥𝑘 − 𝑠/2 < 𝑋𝑡+1 | 𝑋𝑡 = 𝑥 } .
𝑗= 𝑘

Rewriting the probability in terms of 𝜀𝑡+1 , we get

𝐺 ( 𝑥, 𝑥 𝑘 ) = P{𝜀𝑡+1 > ( 𝑥𝑘 − 𝑠/2 − 𝜌𝑥 )/𝜎} .


Since 𝜌 ⩾ 0, we can now see that 𝑥 ⩽ 𝑦 implies 𝐺 ( 𝑥, 𝑥 𝑘 ) ⩽ 𝐺 ( 𝑦, 𝑥 𝑘 ) for all 𝑘, or,
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 361

equivalently, 𝐺 ( 𝑥, ·) ⩽ 𝐺 ( 𝑦, ·) pointwise on X. By Lemma 2.2.5, this is equivalent to


the statement that 𝑃 ( 𝑥, ·) F 𝑃 ( 𝑦, ·), which confirms that 𝑃 is monotone increasing.

Solution to Exercise 3.2.3. This matrix 𝑃𝑤 is monotone increasing if and only


if (1 − 𝛼, 𝛼) F ( 𝛽, 1 − 𝛽 ). From Exercise 2.2.33, we know that this is equivalent to
𝛽 ⩽ 1 − 𝛼, or 𝛽 + 𝛼 ⩽ 1.

Solution to Exercise 3.2.4. Suppose that 𝑃 is monotone increasing and fix ℎ ∈


𝑖RX . We claim that 𝑃ℎ ∈ 𝑖RX . To see this, pick any 𝑥, 𝑦 ∈ X with 𝑥  𝑦 . Since 𝑥  𝑦
Í Í
we have 𝑃 ( 𝑥, ·) F 𝑃 ( 𝑦, ·). Hence 𝑥 0 ℎ ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ⩽ 𝑥 0 ℎ ( 𝑥 0) 𝑃 ( 𝑦, 𝑥 0). This shows that
𝑃ℎ ∈ 𝑖RX .
To see the converse, suppose that 𝑃 is invariant on 𝑖RX . Fix 𝑥, 𝑦 ∈ X with 𝑥  𝑦 .
We claim that 𝑃 ( 𝑥, ·) F 𝑃 ( 𝑦, ·). To see this, fix 𝑢 ∈ 𝑖RX . 𝑃𝑢 ∈ 𝑖RX by invariance, so
Í Í
( 𝑃𝑢) ( 𝑥 ) ⩽ ( 𝑃𝑢)( 𝑦 ) and hence 𝑥 0 𝑢 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ⩽ 𝑥 0 𝑢 ( 𝑥 0) 𝑃 ( 𝑦, 𝑥 0). Since 𝑢 was chosen
arbitrarily from 𝑖RX , we have 𝑃 ( 𝑥, ·) F 𝑃 ( 𝑦, ·). Hence 𝑃 is monotone increasing, as
was to be shown.

Solution to Exercise 3.2.5. Clearly this is true for 𝑡 = 1. Suppose it is also true
for arbitrary 𝑡 . Then, for any ℎ ∈ 𝑖RX , the function 𝑃 𝑡 ℎ is again in 𝑖RX . From this it
follow that 𝑃 𝑡+1 ℎ = 𝑃𝑃 𝑡 ℎ is also in 𝑖RX , since 𝑃 is monotone increasing. This proves
that 𝑃 𝑡+1 is invariant on 𝑖RX , and therefore monotone increasing.

Solution to Exercise 3.2.6. Let 𝜋 and 𝑃 satisfy the stated conditions. By Exer-
cise 3.2.5, 𝑃 𝑡 is monotone increasing for all 𝑡 . By this fact and the assumption 𝜋 ∈ 𝑖RX ,
Í
we see that 𝑃 𝑡 𝜋 ∈ 𝑖RX for all 𝑡 . Hence 𝑣 = 𝑡⩾0 𝛽 𝑡 𝑃 𝑡 𝜋 is also increasing.

Solution to Exercise 3.2.8. Both 𝑢 and exp are increasing on X, so 𝑟 is in 𝑖RX .


Since 𝜌 ⩾ 0, 𝑃 is monotone increasing (see §3.2.1.3). Clearly 𝛽𝑃 shares this property.
It follows that 𝛽𝑃𝑟 ∈ 𝑖RX . Applying 𝛽𝑃 again, we have ( 𝛽𝑃 ) 2 𝑟 ∈ 𝑖RX . Continuing in
Í
this way, we see that ( 𝛽𝑃 ) 𝑘 𝑟 is increasing for all 𝑘. Hence 𝑘⩾0 ( 𝛽𝑃 ) 𝑘 𝑟 is increasing. By
the Neumann series lemma, this sum is equal to 𝑣, so 𝑣 ∈ 𝑖RX .

Solution to Exercise 3.3.1. We start with part (i). To show that 𝑇 is a self-map
on 𝑉 ≔ RW + , we just need to verify that 𝑣 ∈ 𝑉 implies 𝑇 𝑣 ∈ 𝑉 , which only requires
us to verify that 𝑇 maps nonnegative functions into nonnegative functions. This is
clear from the definition. Regarding the order-preserving property, fix 𝑓 , 𝑔 ∈ 𝑉 with
Í
𝑓 ⩽ 𝑔 . We claim that 𝑇 𝑓 ⩽ 𝑇 𝑔 . Indeed, if 𝑤 ∈ W, then 𝑤0 ∈W 𝑓 ( 𝑤0) 𝑃 ( 𝑤, 𝑤0) ⩽
Í 0 0
𝑤0 ∈W 𝑔 ( 𝑤 ) 𝑃 ( 𝑤, 𝑤 ), which in turn implies that (𝑇 𝑓 )( 𝑤) ⩽ (𝑇 𝑔 )( 𝑤). Since 𝑤 was an
arbitrary wage value, we have 𝑇 𝑓 ⩽ 𝑇 𝑔 , so 𝑇 is order-preserving.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 362

Regarding part (ii), let 𝑒 ( 𝑤) ≔ 𝑤/(1 − 𝛽 ) and fix 𝑓 , 𝑔 in 𝑉 . Writing the operators
pointwise and applying the last result in Lemma 2.2.1 (page 58) gives

|𝑇 𝑓 − 𝑇 𝑔 | = | 𝑒 ∨ ( 𝑐 + 𝛽𝑃 𝑓 ) − 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑔 )|
⩽ | 𝛽𝑃 𝑓 − 𝛽𝑃𝑔 |
= 𝛽 | 𝑃 ( 𝑓 − 𝑔)|
⩽ 𝛽𝑃 | 𝑓 − 𝑔 | .

(Here the last inequality uses the result in Exercise 2.2.7 on page 53.) Since 𝑃 ⩾ 0 we
have 𝑃 | 𝑓 − 𝑔 | ⩽ 𝑃 k 𝑓 − 𝑔 k ∞ 1 = k 𝑓 − 𝑔 k ∞ 1, so

|𝑇 𝑓 − 𝑇 𝑔 | ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ 1 .

Taking the maximum on both sides gives k𝑇 𝑓 − 𝑇 𝑔 k ∞ ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ . Since 𝑓 , 𝑔 were


arbitrary elements of 𝑉 , the contraction claim is verified.

Solution to Exercise 3.3.2. The code in Listing 10 creates a Markov chain via
Tauchen approximation of an AR(1) process with positive autocorrelation parameter.
By Exercise 3.2.2, 𝑃 is monotone increasing. Hence, by Lemma 3.3.1, the value func-
tion is increasing. Since ℎ∗ = 𝑐 + 𝛽𝑃𝑣∗ , it follows that ℎ∗ is increasing. Regarding
intuition, positive autocorrelation in wages means that high current wages predict
high future wages. It follows that the value of waiting for future wages rises with
current wages.

Solution to Exercise 3.3.5. Let 𝑇 be the operator on 𝑉 such that (𝑇 𝑣𝑢 )( 𝑤) is


the right-hand side of (3.28). To solve the exercise, it suffices to prove that 𝑇 is a
contraction map on 𝑉 . (Then 𝑣𝑢 can be obtained, in the limit, by applying successive
approximation to 𝑇 and, once the approximate fixed point is computed, 𝑣𝑒 can be
obtained via (3.27).) To show that 𝑇 is a contraction, we let 𝑇1 and 𝑇2 be the operators
on 𝑉 defined by
1
(𝑇1 𝑣)( 𝑤) = ( 𝑤 + 𝛼𝛽 ( 𝑃𝑣)( 𝑤)) and (𝑇2 𝑣)( 𝑤) = 𝑐 + 𝛽 ( 𝑃𝑣)( 𝑤) .
1 − 𝛽 (1 − 𝛼)

Since 𝑇 𝑣 = (𝑇1 𝑣) ∨ (𝑇2 𝑣), Lemma 2.2.3 on page 59 tells us that 𝑇 will be a contraction
provided that 𝑇1 and 𝑇2 are both contraction maps. For the case of 𝑇2 , we have
Õ
k𝑇2 𝑓 − 𝑇2 𝑔 k ∞ = max | 𝑐 + 𝛽 ( 𝑃 𝑓 )( 𝑤) − 𝑐 − 𝛽 ( 𝑃𝑔 )( 𝑤)| ⩽ max 𝛽 | 𝑓 ( 𝑤0) − 𝑔 ( 𝑤0)| 𝑃 ( 𝑤, 𝑤0) .
𝑤 𝑤
𝑤0

The last term is dominated by 𝛽 k 𝑓 − 𝑔 k ∞ , so 𝑇2 is a contraction. The proof for 𝑇1 is


APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 363

similar in spirit and hence left to the reader.

Solution to Exercise 4.1.1. Pointwise on X we have 1 − 𝜎 ⩽ 1, so 𝐿𝜎 ⩽ 𝛽𝑃 . By


Exercise 2.2.28 on page 60, we then have 𝜌 ( 𝐿𝜎 ) ⩽ 𝜌 ( 𝛽𝑃 ) = 𝛽 < 1.

Solution to Exercise 4.1.2. Fix 𝜎 ∈ Σ. If 𝑓 , 𝑔 ∈ RX , 𝑓 ⩽ 𝑔 and 𝑥 ∈ X, then


" #
Õ Õ
0 0 0 0
(𝑇𝜎 𝑔)( 𝑥 ) − (𝑇𝜎 𝑓 )( 𝑥 ) = (1 − 𝜎 ( 𝑥 )) 𝛽 𝑔 ( 𝑥 ) 𝑃 ( 𝑥, 𝑥 ) − 𝛽 𝑓 ( 𝑥 ) 𝑃 ( 𝑥, 𝑥 )
0 𝑥0
Õ𝑥
= (1 − 𝜎 ( 𝑥 )) 𝛽 ( 𝑔 ( 𝑥 0) − 𝑓 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0) .
𝑥0

Since 𝑔 ( 𝑥 0) ⩾ 𝑓 ( 𝑥 0) for all 𝑥 0 we have (𝑇𝜎 𝑔) ( 𝑥 ) ⩾ (𝑇𝜎 𝑓 )( 𝑥 ) for all 𝑥 .

Solution to Exercise 4.1.3. Fix 𝜎 ∈ Σ. Given 𝑓 , 𝑔 ∈ RX and 𝑥 ∈ X, we have

Õ
|(𝑇𝜎 𝑓 )( 𝑥 ) − (𝑇𝜎 𝑔)( 𝑥 )| = (1 − 𝜎 ( 𝑥 )) 𝛽 ( 𝑔 ( 𝑥 0) − 𝑓 ( 𝑥 0)) 𝑃 ( 𝑥, 𝑥 0)
𝑥0
Õ
⩽𝛽 [ 𝑓 ( 𝑥 0) − 𝑔 ( 𝑥 0)] 𝑃 ( 𝑥, 𝑥 0) .
𝑥0
Í
Applying the triangle inequality and 𝑥 0 𝑃 ( 𝑥, 𝑥 0) = 1, we obtain
Õ
|(𝑇𝜎 𝑓 )( 𝑥 ) − (𝑇𝜎 𝑔)( 𝑥 )| ⩽ 𝛽 | 𝑓 ( 𝑥 0) − 𝑔 ( 𝑥 0)| 𝑃 ( 𝑥, 𝑥 0) ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ .
𝑥0

Taking the supremum over all 𝑥 on the left hand side of this expression leads to

k𝑇𝜎 𝑓 − 𝑇𝜎 𝑔 k ∞ ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ .

Since 𝑓 , 𝑔 were arbitrary elements of RX , the contraction claim is proved.

Solution to Exercise 4.1.4. Fix 𝑓 , 𝑔 ∈ RX with 𝑓 ⩽ 𝑔. Since 𝑃 ⩾ 0, we have


𝑃 𝑓 ⩽ 𝑃𝑔 . Hence 𝑐 + 𝛽𝑃 𝑓 ⩽ 𝑐 + 𝛽𝑃𝑔 . As a result,

𝑇 𝑓 = 𝑒 ∨ ( 𝑐 + 𝛽𝑃 𝑓 ) ⩽ 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑔 ) = 𝑇 𝑔.

Solution to Exercise 4.1.5. This result follows from Lemma 2.2.3 on page 59.
For the sake of the exercise, we also provide a direct proof:
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 364

Take any 𝑓 , 𝑔 in RX . Writing the operators pointwise and applying the last result
in Lemma 2.2.1 (page 58) gives

|𝑇 𝑓 − 𝑇 𝑔 | = | 𝑒 ∨ ( 𝑐 + 𝛽𝑃 𝑓 ) − 𝑒 ∨ ( 𝑐 + 𝛽𝑃𝑔 )|
⩽ | 𝛽𝑃 𝑓 − 𝛽𝑃𝑔 |
= 𝛽 | 𝑃 ( 𝑓 − 𝑔)|
⩽ 𝛽𝑃 | 𝑓 − 𝑔 | .

(Here the last inequality uses the result in Exercise 2.2.7 on page 53.) Since 𝑃 ⩾ 0 we
have 𝑃 | 𝑓 − 𝑔 | ⩽ 𝑃 k 𝑓 − 𝑔 k ∞ 1 = k 𝑓 − 𝑔 k ∞ 1, so

|𝑇 𝑓 − 𝑇 𝑔 | ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ 1 .

Taking the maximum on both sides gives k𝑇 𝑓 − 𝑇 𝑔 k ∞ ⩽ 𝛽 k 𝑓 − 𝑔 k ∞ . Since 𝑓 , 𝑔 were


arbitrary elements of RX , the contraction claim is verified.

Solution to Exercise 4.1.7. First observe that, since 𝑣∗ ⩾ 𝑤 and 𝑇 is order-


preserving, we have 𝑣∗ = 𝑇 𝑣∗ ⩾ 𝑇𝑤 = 𝑠 ∨ ( 𝜋 + 𝛽𝑄𝑤) = 𝑠 ∨ 𝑤. From this we get
𝑣∗ ⩾ 𝑠 ∨ 𝑤 and applying 𝑇 to both sides gives 𝑣∗ ⩾ 𝑇 ( 𝑠 ∨ 𝑤).
Next, observe that

𝑇 ( 𝑠 ∨ 𝑤) = 𝑠 ∨ ( 𝜋 + 𝛽𝑄 ( 𝑠 ∨ 𝑤)) ⩾ 𝜋 + 𝛽𝑄 ( 𝑠 ∨ 𝑤)  𝜋 + 𝛽𝑄𝑤 = 𝑤

where the strict inequality is by Exercise 2.2.8 on page 53. We conclude that 𝑣∗ ⩾
𝑇 ( 𝑠 ∨ 𝑤)  𝑤, as was to be shown.
Intuitively, the option to exit adds value to firms everywhere in the state space,
since 𝑄  0 implies that the state can shift to a region of the state space where exit
is optimal in a later period.

Solution to Exercise 4.1.8. For the model described, the Bellman equation takes
the form ( )
Õ
𝑣 ( 𝑝) = max 𝑠, max 𝜋 ( ℓ, 𝑝) + 𝛽 𝑣 ( 𝑝0) 𝑄 ( 𝑝, 𝑝0) .
ℓ⩾0
𝑝0

Straightforward calculus shows that maximized one-period profits are 𝜋 ( 𝑝) = 𝑝2 /(4𝑤).


Hence the final expression is
( )
𝑝2 Õ
𝑣 ( 𝑝) = max 𝑠, +𝛽 𝑣 ( 𝑝0) 𝑄 ( 𝑝, 𝑝0)
4𝑤 𝑝0
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 365

Solution to Exercise 4.1.9. Fix 𝑥, 𝑥 0 ∈ X with 𝑥 ⩽ 𝑥 0. Since 𝜎∗ is binary, to show 𝜎∗


is decreasing it suffices to show that 𝜎∗ ( 𝑥 ) = 0 implies 𝜎∗ ( 𝑥 0) = 0. Hence we suppose
that 𝜎∗ ( 𝑥 ) = 0. This in turn implies that 𝑒 ( 𝑥 ) < ℎ∗ ( 𝑥 ). As 𝑥 ⩽ 𝑥 0, 𝑒 is decreasing and
ℎ∗ is increasing on X, we have 𝑒 ( 𝑥 0) < ℎ∗ ( 𝑥 0). Hence 𝜎∗ ( 𝑥 0) = 0. We conclude that 𝜎∗ is
decreasing on X, as claimed.

Solution to Exercise 4.1.11. The solution to Exercise 4.1.11 is similar to that of


Exercise 4.1.9 and hence omitted.

Solution to Exercise 4.1.12. Either by manipulating the Bellman equation or ap-


pealing to (4.15) on page 118, we see that the continuation value operator is defined
at ℎ ∈ RZ by
Õ∫
( 𝐶ℎ)( 𝑧) = 𝜋 ( 𝑧) + 𝛽 max{ 𝑠0, ℎ ( 𝑧0)} 𝜑 ( 𝑠0) d𝑠0𝑄 ( 𝑧, 𝑧0) ( 𝑧 ∈ Z) .
𝑧0

The next period scrap value 𝑆𝑡+1 is integrated out and the remaining function depends
only on 𝑧 ∈ Z.

Solution to Exercise 4.1.13. Let 𝜑𝑎 and 𝜑𝑏 be as stated. For 𝑖 ∈ { 𝑎, 𝑏} and ℎ ∈ RZ ,


let Õ∫
( 𝐶 𝑖 ℎ)( 𝑧) = 𝜋 ( 𝑧) + 𝛽 max{ 𝑠0, ℎ ( 𝑧0)} 𝜑𝑖 ( 𝑠0) d𝑠0𝑄 ( 𝑧, 𝑧0) .
𝑧0

Since, for each 𝑧0 ∈ Z, the function 𝑠0 ↦→ max{ 𝑠0, ℎ ( 𝑧0)} is increasing, we have
Õ∫ Õ∫
0 0 0 0 0
max{ 𝑠 , ℎ ( 𝑧 )} 𝜑𝑎 ( 𝑠 ) d𝑠 𝑄 ( 𝑧, 𝑧 ) ⩽ max{ 𝑠0, ℎ ( 𝑧0)} 𝜑𝑏 ( 𝑠0) d𝑠0𝑄 ( 𝑧, 𝑧0) .
𝑧0 𝑧0

Hence 𝐶 𝑎 ℎ ⩽ 𝐶𝑏 ℎ for all ℎ ∈ RZ . As 𝐶𝑏 is order-preserving and globally stable, Propo-


sition 2.2.7 on page 67 implies that the fixed point of 𝐶𝑏 dominates the fixed point of
𝐶 𝑎 . That is, ℎ∗𝑎 ⩽ ℎ∗𝑏 . But then, for any 𝑧 ∈ Z, we have ℎ∗𝑎 ( 𝑧 ) ⩽ ℎ∗𝑏 ( 𝑧 ) and hence

𝜎𝑏∗ ( 𝑧 ) = 1{ 𝑠 ⩾ ℎ∗𝑏 ( 𝑧 )} ⩽ 1{ 𝑠 ⩾ ℎ∗𝑎 ( 𝑧)} = 𝜎∗𝑎 ( 𝑧) .


The interpretation of 𝜎𝑏∗ ⩽ 𝜎∗𝑎 is that firm exits at fewer states when the scrap value
distribution is 𝜑∗𝑏 . This makes sense, since the current scrap value offer 𝑠 is already
known, while future offers are more promising under 𝜑∗𝑏 than 𝜑∗𝑎 . Hence continuing
is more attractive.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 366

Solution to Exercise 4.2.1. In view of (4.13), the continuation value operator for
this problem is
Õ
( 𝐶ℎ) ( 𝑥 ) = −𝑐 + 𝛽 max {𝜋 ( 𝑥 0) , ℎ ( 𝑥 0)} 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) .
𝑥0

The monotonicity result for ℎ∗ follows from Lemma 4.1.4 on page 114.

Solution to Exercise 4.2.2. If ( 𝑋𝑡 ) is IID with common distribution 𝜑, then the


continuation value ℎ∗ is constant; in particular, it is the unique solution to
Õ
ℎ = −𝑐 + 𝛽 max {𝜋 ( 𝑥 0) , ℎ ( 𝑥 0)} 𝜑 ( 𝑥 0) .
𝑥0

Since constant functions are (weakly) decreasing, Exercise 4.1.11 applies and 𝜎∗ is
increasing. Intuitively, the value of waiting is independent of the current state, while
the value of bringing the product to market is increasing in the current state. Hence,
if the firm brings to the product to market in state 𝑥 , then it should also do so at any
𝑥0 ⩾ 𝑥 .

Solution to Exercise 5.1.2. The stochastic kernel is



 0 if 𝑥 0 < 𝑎



𝑃 ( 𝑥, 𝑎, 𝑥 0) = (1 − 𝑝) 𝑥 if 𝑥 0 = 𝑎 (5.10)


 (1 − 𝑝) 𝑥 +𝑎−𝑥 0 𝑝 if 𝑥 0 > 𝑎

The middle case is obtained by observing that the next period state hits 𝑥 0 when 𝑥 0 = 𝑎
if and only if 𝐷𝑡+1 ⩾ 𝑥 and then using the expression for the geometric distribution.

Solution to Exercise 5.1.3. 𝑇 is a sup norm contraction mapping on RX because,


in view of the max-inequality lemma (page 58), for any 𝑣, 𝑤 in RX ,

Õ
|(𝑇 𝑣)( 𝑥 )| − (𝑇𝑤)( 𝑥 )| ⩽ 𝛽 max [ 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) − 𝑤 ( 𝑓 ( 𝑥, 𝑎, 𝑑 ))] 𝜑 ( 𝑑 )
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0
Õ
⩽ 𝛽 max | 𝑣 ( 𝑓 ( 𝑥, 𝑎, 𝑑 )) − 𝑤 ( 𝑓 ( 𝑥, 𝑎, 𝑑 ))| 𝜑 ( 𝑑 )
𝑎∈ Γ ( 𝑥 )
𝑑 ⩾0
Í
Since 𝑑 ⩾0 𝜑 ( 𝑑 ) = 1, it follows that, for arbitrary 𝑥 ∈ X,

|(𝑇 𝑣)( 𝑥 ) − (𝑇𝑤)( 𝑥 )| ⩽ 𝛽 k 𝑣 − 𝑤 k ∞


APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 367

Taking the supremum over all 𝑥 ∈ X yields the desired result.

Solution to Exercise 5.1.4. We take the action 𝐴𝑡 to be the choice of next period
wealth 𝑊𝑡+1 , so that the action space is also W. The feasible correspondence is

Γ ( 𝑤) = { 𝑎 ∈ W : 𝑎 ⩽ 𝑅𝑤} ( 𝑤 ∈ W) ,

implying that G = {( 𝑤, 𝑎) ∈ W × W : 𝑎 ⩽ 𝑅𝑤}. The current reward is utility of


consumption, or  𝑎
𝑟 ( 𝑤, 𝑎) = 𝑢 𝑤 − (( 𝑤, 𝑎) ∈ G) .
𝑅
The stochastic kernel is = 1{𝑤0 = 𝑎}. This just states that next period
𝑃 ( 𝑤, 𝑎, 𝑤0)
wealth 𝑤0 is equal to the action 𝑎 with probability one.

Solution to Exercise 5.1.5. To impose that workers never leave the firm, we
require 𝑎 ⩾ 𝑒. Thus, the feasible correspondence is

Γ ( 𝑥 ) = Γ ( 𝑒, 𝑤) = { 𝑎 ∈ {0, 1} : 𝑎 ⩾ 𝑒} .

The set of feasible state-action pairs is G = {(( 𝑒, 𝑤) , 𝑎) ∈ X × A : 𝑎 ⩾ 𝑒}. The reward


function is
𝑟 ( 𝑥, 𝑎) = 𝑟 (( 𝑒, 𝑤) , 𝑎) = 𝑎𝑤 + (1 − 𝑎) 𝑐.
Regarding the stochastic kernel, we need to define state transition probabilities for all
feasible state-action pairs. Letting 𝑃 [( 𝑒, 𝑤) , 𝑎, ( 𝑒0, 𝑤0)] be the probability of transition-
ing to state ( 𝑒0, 𝑤0) given current state ( 𝑒, 𝑤) and current action 𝑎 ⩽ 𝑒, we set

𝑃 [(0, 𝑤) , 𝑎, ( 𝑒0, 𝑤0)] = 1{ 𝑒0 = 𝑎} · [ 𝑎1{𝑤0 = 𝑤} + (1 − 𝑎) 𝑄 ( 𝑤, 𝑤0) ] (5.14)

and 𝑃 [(1, 𝑤) , 1, ( 𝑒0, 𝑤0)] = 1{ 𝑒0 = 1}1{𝑤0 = 𝑤}. Equation (5.14) says that if 𝑎 = 0 then
𝑒0 = 0 and the next wage is drawn from 𝑄 ( 𝑤, 𝑤0), while if 𝑎 = 1 then 𝑒0 = 1 and the
next wage is 𝑤. You can verify that 𝑃 is a stochastic kernel from G to X.
To double check that these definitions work, we can verify that they lead to the
same Bellman equations that we saw in §3.3.1. Under the definitions of Γ, 𝑟 and 𝑃
just provided, we have 𝑣 (1, 𝑤) = 𝑤 + 𝛽 E 𝑣 (1, 𝑤). This implies that 𝑣 (1, 𝑤) = 𝑤/(1 − 𝛽 ),
which is what we expect for lifetime value of an agent employed with wage 𝑤.
Moreover, the Bellman equation for 𝑣 (0, 𝑤) agrees with the one we obtained for
an unemployed agent on page 97. To see this when 𝑒 = 0, observe that the Bellman
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 368

equation is

 Õ 


 0 0 0 0


𝑣 (0, 𝑤) = max 𝑎𝑤 + (1 − 𝑎) 𝑐 + 𝛽 𝑣 ( 𝑒 , 𝑤 ) 𝑃 [(0, 𝑤) , 𝑎, ( 𝑒 , 𝑤 )]
𝑎∈{0,1} 
 

 ( 𝑒0 ,𝑤0 ) 
( " #)
Õ
= max 𝑎𝑤 + (1 − 𝑎) 𝑐 + 𝛽 𝑎𝑣 ( 𝑎, 𝑤) + (1 − 𝑎) 𝑣 ( 𝑎, 𝑤0) 𝑄 ( 𝑤, 𝑤0) ,
𝑎∈{0,1}
𝑤0

where the second equation follows from (5.14). (You can see this by checking the
cases 𝑎 = 0 and 𝑎 = 1.) Rearranging and using 𝑣 (1, 𝑤) = 𝑤/(1 − 𝛽 ) now gives
( )
𝑤 Õ
𝑣 (0, 𝑤) = max , 𝑐+𝛽 𝑣 (0, 𝑤0) 𝑄 ( 𝑤0, 𝑤0) . (5.15)
1−𝛽 𝑤0

This is the Bellman equation for an unemployed agent from the job search problem
we saw previously on page 97.

Solution to Exercise 5.1.6. We need to show that 𝑣𝜎 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 obeys 𝑣1 ⩽


𝑣𝜎 ⩽ 𝑣2 where 𝑣1 , 𝑣2 are as defined in the exercise. Regarding the upper bound, let
¯𝑟 ≔ k 𝑟 k ∞ . We have
Õ ¯𝑟
( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 ⩽ ( 𝐼 − 𝛽𝑃𝜎 ) −1 ¯𝑟 1 = ¯𝑟 ( 𝛽𝑃 ) 𝑡 1 = = 𝑣2 .
𝑡 ⩾0
1−𝛽

A similar argument shows that 𝑣1 ⩽ 𝑣𝜎 .

Solution to Exercise 5.1.7. Fix 𝜎 ∈ Σ. It is obvious that 𝑇𝜎 is a self-map on RX and


𝑇𝜎 is clearly order-preserving, since 𝑣 ⩽ 𝑤 implies 𝑃𝜎 𝑣 ⩽ 𝑃𝜎 𝑤 and hence 𝑇𝜎 𝑣 ⩽ 𝑇𝜎 𝑤.
Also, 𝑇𝜎 is a contraction of modulus 𝛽 on RX under the supremum norm, since, for
any 𝑣, 𝑤 in RX we have

Õ Õ
|(𝑇𝜎 𝑣)( 𝑥 ) − (𝑇𝜎 𝑤)( 𝑥 )| = 𝛽 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑣 ( 𝑥 0) − 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝑤 ( 𝑥 0)
0 𝑥0
Õ𝑥
⩽ 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) 𝛽 | 𝑣 ( 𝑥 0) − 𝑤 ( 𝑥 0)| ⩽ 𝛽 k 𝑣 − 𝑤 k ∞
𝑥0

Taking the supremum over all 𝑥 ∈ X yields the desired result. This contraction prop-
erty combined with Banach’s fixed point theorem implies that 𝑇𝜎 has a unique fixed
point.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 369

Now suppose that 𝑣 is the unique fixed point of 𝑇𝜎 . Then 𝑣 = 𝑟𝜎 + 𝛽𝑃𝜎 𝑣. But then
𝑣 = ( 𝐼 − 𝛽𝑃𝜎 ) −1 𝑟𝜎 . Hence 𝑣 = 𝑣𝜎 . This establishes all claims in the lemma.

Solution to Exercise 5.1.10. Fix 𝑣 ∈ 𝑉 and take 𝜎ˆ to be 𝑣-greedy, so that


( )
Õ
ˆ ( 𝑥 ) ∈ argmax 𝑟 ( 𝑥, 𝑎) + 𝛽
𝜎 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) for all 𝑥 ∈ X (5.23)
𝑎∈ Γ ( 𝑥 ) 𝑥0

If 𝜎 is any other feasible policy, then


Õ Õ
𝑟 ( 𝑥, 𝜎
ˆ ( 𝑥 )) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝜎
ˆ ( 𝑥 ) , 𝑥 0) ⩾ 𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0)
𝑥0 𝑥0

at all 𝑥 . In operator form, this is 𝑇𝜎ˆ 𝑣 ⩾ 𝑇𝜎 𝑣. Since 𝜎 is an arbitrary greedy policy, we


have shown that 𝑇𝜎ˆ 𝑣 is the greatest element of {𝑇𝜎 𝑣}𝜎∈Σ .
A similar argument replacing argmax with argmin in (5.23) shows that a least
element also exists.

Solution to Exercise 5.1.11. Fix 𝑣 ∈ RX . Part (i) follows from the fact that Γ ( 𝑥 )
is finite and nonempty at each 𝑥 ∈ X. Hence we can select an element 𝑎∗𝑥 from the
argmax in the definition of a 𝑣-greedy policy at each 𝑥 in X. The resulting policy is
𝑣-greedy. For part (ii) we need to show that 𝜎 ∈ Σ is 𝑣-greedy if and only if
( )
Õ Õ
𝑟 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) = max 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0)
𝑎∈ Γ ( 𝑥 )
𝑥0 𝑥0

for all 𝑥 ∈ X. But this immediate from the definition.


Regarding part (iii), it follows from the definitions that (𝑇𝜎 𝑣)( 𝑥 ) ⩽ (𝑇 𝑣)( 𝑥 ) for all
𝑥 ∈ X. At the same time, for any 𝑣-greedy 𝜎 ∈ Σ, we have (𝑇𝜎 𝑣)( 𝑥 ) = (𝑇 𝑣)( 𝑥 ) for all 𝑥 .
Hence 𝑇 𝑣 = ∨𝜎 𝑇𝜎 𝑣, as was to be shown.

Solution to Exercise 5.1.12. This result follows from Lemma 2.2.3 on page 59.
For the sake of the exercise, we also provide a direct proof:
Fix 𝑣, 𝑤 ∈ RX and 𝑥 ∈ X. By Exercise 5.1.11 and the max-inequality lemma
(page 58), we have

|(𝑇 𝑣)( 𝑥 ) − (𝑇𝑤)( 𝑥 )| = max (𝑇𝜎 𝑣) ( 𝑥 ) − max (𝑇𝜎 𝑤)( 𝑥 )


𝜎∈ Σ 𝜎∈ Σ
⩽ max |(𝑇𝜎 𝑣)( 𝑥 ) − (𝑇𝜎 𝑤)( 𝑥 )| = k𝑇𝜎 𝑣 − 𝑇𝜎 𝑤 k ∞ .
𝜎∈ Σ
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 370

Applying contractivity of 𝑇𝜎 (Exercise 5.1.7), we get k𝑇 𝑣 − 𝑇𝑤 k ∞ ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ .

Solution to Exercise 5.1.13. Part (iii) of Proposition 5.1.1 implies (iv) because
every 𝑣 ∈ RX has at least one greedy policy (Exercise 5.1.11). In particular, at least
one 𝑣∗ -greedy policy exists.

Solution to Exercise 5.3.3. The Bellman equation becomes


(   )
𝑤0 Õ
𝑣 ( 𝑤, 𝑧, 𝜀) = max 𝑢 𝑤+𝑧+𝜀− +𝛽 𝑣 ( 𝑤0, 𝑧0, 𝜀0) 𝑄 ( 𝑧, 𝑧0) 𝜑 ( 𝜀0) .
0𝑤 ⩽ 𝑅 ( 𝑤+ 𝑧 +𝜀) 𝑅 𝑧 0 ,𝜀0

Both 𝑤 and 𝑤0 are constrained to a finite set W ⊂ R+ . The expected value function
can be expressed as
Õ
𝑔 ( 𝑧, 𝑤0) ≔ 𝑣 ( 𝑤0, 𝑧0, 𝜀0) 𝑄 ( 𝑧, 𝑧0) 𝜑 ( 𝜀0) . (5.38)
𝑧 0 , 𝜀0

In the remainder of this section, we will say that a savings policy 𝜎 is 𝑔-greedy if
   
𝑤0 0
𝜎 ( 𝑧, 𝑤, 𝜀) ∈ argmax 𝑢 𝑤 + 𝑧 + 𝜀 − + 𝛽𝑔 ( 𝑧, 𝑤 ) .
𝑤0 ⩽ 𝑅 ( 𝑤+ 𝑧 +𝜀) 𝑅

Since it is an MDP, we can see immediately that if we replace 𝑣 in (5.38) with the
value function 𝑣∗ , then a 𝑔-greedy policy will be an optimal one. We can rewrite the
Bellman equation in terms of expected value functions via
Õ    
0 0 0 0 𝑤00
𝑔 ( 𝑧, 𝑤 ) = max 𝑢 𝑤 +𝑧 +𝜀 − + 𝛽𝑔 ( 𝑧 , 𝑤 ) 𝑄 ( 𝑧, 𝑧0) 𝜑 ( 𝜀0) .
0 00
00
𝑧 0 , 𝜀0
0 0 0
𝑤 ⩽ 𝑅 ( 𝑤 + 𝑧 +𝜀 ) 𝑅

Solution to Exercise 6.1.1. Set 𝐿 ( 𝑥, 𝑥 0) ≔ 𝛽 ( 𝑥 ) 𝑃 ( 𝑥, 𝑥 0) with 𝛽 ( 𝑥 ) ≔ 1/(1 + 𝑟 ( 𝑥 )).


We claim that (6.9) is finite for all 𝑥 ∈ X and satisfies 𝑣 = ( 𝐼 − 𝐿) −1 𝜋 whenever 𝜌 ( 𝐿) < 1.
To see this, we apply Theorem 6.1.1 with 𝑏 ( 𝑥, 𝑥 0) = 𝛽 ( 𝑥 ) and ℎ = 𝜋.
Incidentally, to understand 𝑣 = 𝜋 + 𝐿𝑣, suppose we buy the firm now, hold it for one
period and then sell it. The expected present value of the payoff is 𝜋 + 𝐿𝑣. If expected
benefit equals cost, then the value of (i.e., cost of buying) the firm now should equal
𝜋 + 𝐿𝑣. That is, 𝑣 = 𝜋 + 𝐿𝑣. We expand on these ideas in §6.3.

Solution to Exercise 6.1.3. Let ( 𝑋𝑡 ) be 𝑃 -Markov with 𝑋0 drawn from 𝜓∗ , let k · k ∗


be the norm defined in the exercise and let 1 be an 𝑛-vector of ones. In view of (6.6),
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 371

for fixed 𝑡 ∈ N, we have


" #
Ö
𝑡 Ö
𝑡
E 𝛽𝑖 = E E 𝛽 𝑖 | 𝑋0 = E ( 𝐿𝑡 1)( 𝑋0 ) = k 𝐿𝑡 1 k ∗ .
𝑖=0 𝑖=0

Since 1  0, the local spectral radius result on page 70 yields (6.11).

Solution to Exercise 6.1.4. Let 𝑈 be closed in RX and let 𝑇 be a self-map on 𝑈


such that 𝑇 𝑘 is a contraction. Let 𝑢∗ be the unique fixed point of 𝑇 𝑘 . Fix 𝜀 > 0. We can
choose 𝑚 such that k𝑇 𝑚𝑘𝑇𝑢∗ − 𝑢∗ k < 𝜀. Then

k𝑇𝑇 𝑚𝑘 𝑢∗ − 𝑢∗ k = k𝑇𝑢∗ − 𝑢∗ k < 𝜀.

Since 𝜀 was arbitrary we have k𝑇𝑢∗ − 𝑢∗ k = 0, implying that 𝑢∗ is a fixed point of 𝑇 .


The proof that 𝑇 𝑚 𝑢 → 𝑢∗ for any 𝑢 is left to the reader.

Solution to Exercise 6.2.2. Fix 𝜎 ∈ Σ and let Assumption 6.2.1 on page 192 hold.
We saw in the proof of Lemma 6.2.1 that 𝑇𝜎 𝑣 = 𝑟𝜎 + 𝐿𝜎 𝑣 and 𝑣𝜎 = ( 𝐼 − 𝐿𝜎 ) −1 𝑟𝜎 is the
unique fixed point in of this operator RX . Moreover, for fixed 𝑣, 𝑤 ∈ RX , we have

|𝑇𝜎 𝑣 − 𝑇𝜎 𝑤 | = | 𝐿𝜎 𝑣 − 𝐿𝜎 𝑤 | = | 𝐿𝜎 ( 𝑣 − 𝑤)| = 𝐿𝜎 | 𝑣 − 𝑤 | .

Hence, by Proposition 6.1.6, 𝑇𝜎 is globally stable on RX .

Solution to Exercise 6.2.3. Fix 𝜎 ∈ Σ. When (6.20) holds, we have 0 ⩽ 𝐿𝜎 ⩽ 𝐿.


Exercise 2.2.28 on page 60 now implies that 𝜌 ( 𝐿𝜎 ) ⩽ 𝜌 ( 𝐿). Hence 𝜌 ( 𝐿𝜎 ) < 1.

Solution to Exercise 6.2.4. Fix 𝜎 ∈ Σ. In the present setting, the discount operator
𝐿𝜎 from (6.17) becomes

𝐿𝜎 ( 𝑥, 𝑥 0) = 𝐿𝜎 (( 𝑦, 𝑧 ) , ( 𝑦 0, 𝑧0)) = 𝛽 ( 𝑧 ) 𝑄 ( 𝑧, 𝑧0) 𝑅 ( 𝑦, 𝜎 ( 𝑦 ) , 𝑦 0) .

In view of Lemma 6.1.3, the spectral radius of 𝐿𝜎 on L( RX ) is equal to the spectral


radius of 𝐿 ( 𝑧, 𝑧0) = 𝛽 ( 𝑧) 𝑄 ( 𝑧, 𝑧0) on L ( RZ ). It follows that 𝜌 ( 𝐿) < 1 in L ( RZ ) im-
plies 𝜌 ( 𝐿𝜎 ) < 1 in L ( RX ), so Assumption 6.2.1 holds. Hence, under this condition,
Proposition 6.2.2 is valid.

Solution to Exercise 6.3.1. Assume 𝑚, 𝑑  0 and write (6.34) as 𝜋 = 𝐴𝜋 + ℎ,


where ℎ ≔ 𝐴𝑑 . A simple argument shows that ℎ  0 (see Exercise 2.3.13 on page 80
for a closely related claim.) The claim in Exercise 6.3.1 now follows directly from
Lemma 6.1.4.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 372

Solution to Exercise 6.3.4. Under a cum-dividend contract, purchasing at 𝑡 and


selling at 𝑡 + 1 pays 𝐷𝑡 + Π𝑡+1 . Hence, applying the fundamental asset pricing equation,
the time 𝑡 price Π𝑡 of the contract must satisfy

Π𝑡 = 𝐷𝑡 + E𝑡 𝑀𝑡+1 Π𝑡+1 . (6.36)

Proceeding as for the ex-dividend contract, the price conditional on current state 𝑥 is
Í
𝜋 ( 𝑥 ) = 𝑑 ( 𝑥 ) + 𝑥 0 𝑚 ( 𝑥, 𝑥 0) 𝜋 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0). In vector form, this is 𝜋 = 𝑑 + 𝐴𝜋. Solving out
for prices gives 𝜋∗ = ( 𝐼 − 𝐴) −1 𝑑 .

Solution to Exercise 6.3.6. We seek a 𝑣 that solves


Õ
𝑣( 𝑥) = [1 + 𝑣 ( 𝑥 0)] 𝐴 ( 𝑥, 𝑥 0) ( 𝑥, 𝑥 0 ∈ X) .
𝑥0

Treating 𝐴 as a matrix and 𝑣 as a column vector, this equation becomes 𝑣 = 𝐴1 + 𝐴𝑣,


where 1 is a column vector of ones. By the Neumann series lemma, 𝜌 ( 𝐴) < 1 implies
that this equation has the unique solution 𝑣∗ = ( 𝐼 − 𝐴) −1 𝐴1. By the same lemma, 𝑣∗
Í Í
has the alternative representation 𝑣∗ = 𝑡⩾0 𝐴𝑡 ( 𝐴1) = 𝑡⩾1 𝐴𝑡 1.

Solution to Exercise 7.1.1. Here is one possible answer: Let 𝑣1 , 𝑣2 be distinct


and let 𝑇 be the identity map on 𝑉 = [ 𝑣1 , 𝑣2 ]. Then 𝑇 is order-preserving and every
point in 𝑉 is fixed under 𝑇 . The set 𝑉 is a continuum because it contains all points
𝑣 = 𝛼𝑣1 + (1 − 𝛼) 𝑣2 with 0 ⩽ 𝛼 ⩽ 1.

Solution to Exercise 7.1.3. If the condition 𝑎 < 𝑔 ( 𝑎) in Proposition 7.1.2 is


dropped then 𝑔 could be the identity map, which has multiple fixed points and is
not globally stable.

Solution to Exercise 7.1.7. It is straightforward to show that 𝐹 𝑥0 > 0 on (0, ∞),


which proves that 𝐹 𝑥 is increasing. Some additional algebra confirms that 𝜃 ∈ (0, 1]
implies 𝐹 𝑥00 > 0, while 𝜃 < 0 and 𝜃 > 1 both imply 𝐹 𝑥00 < 0. Details are left to the reader.

Solution to Exercise 7.1.8. Observe that ( 𝐺𝑢)( 𝑥 ) = 𝐹 𝑥 [( 𝐴𝑢)( 𝑥 )]. Since 𝐴 is order-
preserving and 𝐹 is increasing, 𝑢 ⩽ 𝑣 implies 𝐺𝑢 ⩽ 𝐺𝑣. In particular, 𝐺 is order-
preserving. If 𝜃 ∈ (0, 1], then 𝐹 is convex. Hence, fixing 𝑢, 𝑣 ∈ 𝑉 and 𝜆 ∈ [0, 1] (and
dropping 𝑥 from our notation), we have

𝐹 𝐴 ( 𝜆𝑢 + (1 − 𝜆 ) 𝑣) = 𝐹 ( 𝜆 𝐴𝑢 + (1 − 𝜆 ) 𝐴𝑣) ⩽ 𝜆𝐹 𝐴𝑢 + (1 − 𝜆 ) 𝐹 𝐴𝑣.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 373

Hence 𝐺 is convex. The proof that 𝐺 is concave for other values of 𝜃 is similar and
omitted.

−1/𝜓
Solution to Exercise 7.1.9. Setting 𝑣𝑡 = 𝜎𝑡 , we can write (7.3) as
 h i 𝜓  1/𝜓
𝑣𝑡 = 1 + 𝛽 𝜓
E𝑡 𝑅𝑡(+1
𝜓−1)/𝜓
𝑣𝑡+1 . (7.4)

We conjecture a stationary Markov solution 𝑣𝑡 = 𝑣 ( 𝑋𝑡 ) for some some 𝑣 ∈ RX with


𝑣  0. This 𝑣 must satisfy

 " # 𝜓  1/𝜓


 Õ 


𝑣( 𝑥 ) = 1 + 𝛽𝜓 𝑓 ( 𝑥 0) ( 𝜓−1)/𝜓 𝑣 ( 𝑥 0) 𝑃 ( 𝑥, 𝑥 0) ( 𝑥 ∈ X) .

 

𝑥0
 
Using the definition of 𝐴 in the exercise, we can write the equation in vector form
as 𝑣 = [1 + ( 𝐴𝑣) 𝜓 ] 1/𝜓 . It follows from Theorem 7.1.4 that a unique strictly positive
solution to this equation exists if and only if 𝜌 ( 𝐴) 𝜓 < 1. This proves the claim in the
exercise.

Solution to Exercise 7.2.1. Fixing 𝑡 , rearranging 𝑣∗ = ( 𝐼 − 𝛽𝑃 ) −1 𝑟 to 𝑣∗ = 𝑟 + 𝛽𝑃𝑣∗


and evaluating at 𝑋𝑡 gives
Õ
𝑉𝑡∗ = 𝑣∗ ( 𝑋𝑡 ) = 𝑟 ( 𝑋𝑡 ) + 𝛽 𝑣∗ ( 𝑥 0) 𝑃 ( 𝑋𝑡 , 𝑥 0)
𝑥0
= 𝑢 (𝐶𝑡 ) + 𝛽 E [ 𝑣∗ ( 𝑋𝑡+1 ) | 𝑋𝑡 ] = 𝑢 ( 𝐶𝑡 ) + 𝛽 E𝑡 𝑉𝑡∗+1 .

Hence (𝑉𝑡∗ )𝑡⩾0 obeys (7.5), as claimed.

Solution to Exercise 7.2.7. Let 𝐾 be as stated (see (7.14)) and fix 𝑣  0. Clearly
𝑣  0 and hence 𝑃𝑣𝛾  0 (see Exercise 3.2.5 on page 94). Since ℎ ⩾ 0, it follows
𝛾

easily that 𝐾𝑣  0.

Solution to Exercise 7.2.8. If 𝑣  0, then, since ℎ  0 also holds, we have


n o𝜃
1/𝜃
𝐾 𝑣 = ℎ + 𝛽 ( 𝑃𝑣)
ˆ ⩾ ℎ𝜃  0 .

Hence 𝐾ˆ is a self-map on 𝑉 . In addition, the statement 𝐾𝑣 = 𝑣 is equivalent to 𝑣𝛼 =


ℎ + 𝛽 ( 𝑃𝑣𝛾 ) 𝛼/𝛾 . Using 𝜃 = 𝛾 /𝛼, we can rewrite the last equation as 𝑣𝛾 = [ ℎ + 𝛽 ( 𝑃𝑣𝛾 ) 1/𝜃 ] 𝜃 .
In other words, 𝑣 = 𝐾𝑣 if and only if 𝑣𝛾 = 𝐾ˆ 𝑣𝛾 .
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 374

Solution to Exercise 7.3.1. Let 𝑅 be a linear operator on RX that is also a certainty


equivalent. In particular, 𝑅 is order-preserving and 𝑅1 = 1. Since 𝑅 is order-preserving
and linear, 𝑅 is a positive linear operator (see Exercise 2.3.11). Hence 𝑅 is a Markov
operator.

Solution to Exercise 7.3.4. Since 𝑉 is all of RX , the condition 𝑅𝜏 : 𝑉 → 𝑉 is


trivially satisfied. Regarding monotonicity, fix 𝑣, 𝑤 ∈ 𝑉 with 𝑣 ⩽ 𝑤 and 𝑥 ∈ X. Let
𝑋 be a draw from 𝑃 ( 𝑥, ·). Then ( 𝑅𝜏 𝑣)( 𝑥 ) = 𝑄 𝜏 [ 𝑣 ( 𝑋 )] ⩽ 𝑄 𝜏 [ 𝑤 ( 𝑋 )] = 𝑅𝜏 ( 𝑣, 𝜑), where
the inequality is by Exercise 2.2.35 on page 64. Moreover, given 𝜆 ∈ R and a random
variable 𝑌 with P{𝑌 = 𝜆 } = 1, we clearly have 𝑄 𝜏 (𝑌 ) = 𝜆 . It follows that 𝑅𝜏 𝜆 1 = 𝜆 1.
Hence 𝑅𝜏 is a certainty equivalent operator, as was to be shown.

Solution to Exercise 7.3.7. Fix 𝑣 ∈ 𝑉 = RX , 𝜆 ∈ R+ and 𝑥 ∈ X. If 𝑋 ∼ 𝑃 ( 𝑥, ·), then

( 𝑅𝜏 ( 𝑣 + 𝜆 ))( 𝑥 ) = 𝑄 𝜏 ( 𝑣 ( 𝑋 ) + 𝜆 ) = 𝑄 𝜏 ( 𝑣 ( 𝑋 )) + 𝜆,

where the second equality is by Exercise 1.2.28 on page 32. Since 𝑥 was arbitrary, we
have 𝑅𝜏 ( 𝑣 + 𝜆 ) = 𝑅𝜏 𝑣 + 𝜆 . Hence 𝑅𝜏 is constant-subadditive, as claimed.

Solution to Exercise 7.3.8. Fix 𝑣 ∈ 𝑉 , 𝑃 ∈ M ( RX ) and 𝜆 ∈ R+ . Let 𝑋 be a draw


from 𝑃 ( 𝑥, ·). We have
1
( 𝑅𝜃 ( 𝑣 + 𝜆 ))( 𝑥 ) = ln {E exp[ 𝜃 ( 𝑣 ( 𝑋 ) + 𝜆 )]}
𝜃
1
= ln {E exp[ 𝜃𝑣 ( 𝑋 )] · exp( 𝜃𝜆 )}
𝜃
1
= ln {E exp[ 𝜃𝑣 ( 𝑋 )]} + 𝜆.
𝜃

Hence constant-subadditivity holds.

Solution to Exercise 7.3.9. Let the primitives be as stated. Fix 𝑣, 𝑤 ∈ 𝑉 . By


monotonicity and constant-subadditivity of 𝑅, we have

𝑅𝑣 = 𝑅 ( 𝑣 − 𝑤 + 𝑤) ⩽ 𝑅 (k 𝑣 − 𝑤 k ∞ 1 + 𝑤) ⩽ 𝑅𝑤 + k 𝑣 − 𝑤 k ∞ 1.

Hence ( 𝑅𝑣)( 𝑥 ) − ( 𝑅𝑤)( 𝑥 ) ⩽ k 𝑣 − 𝑤 k ∞ for all 𝑥 ∈ X. Reversing the roles of 𝑣 and 𝑤


proves the claim.

Solution to Exercise 7.3.10. Fix 𝑣, 𝑣0 ∈ 𝑉 , 𝜃 < 0, 𝑥 ∈ X and 𝛼 ∈ [0, 1]. Letting


APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 375

𝑋 ∼ 𝑃 ( 𝑥, ·), 𝑍 = 𝑣 ( 𝑋 ) and 𝑍0 = 𝑣0 ( 𝑋 ), we have

( 𝑅 ( 𝑣 + 𝑣0))( 𝑥 ) = E𝜃 ( 𝛼𝑍 + (1 − 𝛼) 𝑍0) ⩾ 𝛼E𝜃 ( 𝑍 ) + (1 − 𝛼)E𝜃 ( 𝑍0) .

The last expression is just 𝛼 ( 𝑅𝑣)( 𝑥 ) + (1 − 𝛼)( 𝑅𝑣0)( 𝑥 ), so 𝑅 is concave on 𝑉 , as claimed.

Solution to Exercise 7.3.11. Regarding (i), fix 𝜆 ∈ [0, 1] and 𝑣, 𝑤 ∈ 𝑉 . Using


subadditivity and positive homogeneity, we have

𝑅 ( 𝜆𝑣 + (1 − 𝜆 ) 𝑤) ⩽ 𝑅 ( 𝜆𝑣) + 𝑅 ((1 − 𝜆 ) 𝑤) = 𝜆𝑅𝑣 + (1 − 𝜆 ) 𝑅𝑤.

This proves that 𝑅 is convex on 𝑉 . The proof of (ii) is similar.

Solution to Exercise 7.3.14. It is not difficult to show that (𝑈𝑐 /𝑈 𝑦 ) = ((1 −


𝛽 )/ 𝛽 )( 𝑦 /𝑐) 1−𝛼 . Taking logs and rearranging gives ln( 𝑦 /𝑐) = (1/(1 − 𝛼)) ln(𝑈𝑐 /𝑈 𝑦 ) + 𝑘,
where 𝑘 is a constant. Using the definition in the exercise yields EIS = 1/(1 − 𝛼).

Solution to Exercise 7.3.15. Iterating forward from 𝑉0 gives

𝑉0 = 𝑢 ( 𝐶0 ) + 𝛽 E0 𝑉1 = 𝑢 ( 𝐶0 ) + 𝛽 E0 [𝑢 ( 𝐶1 ) + 𝛽 E1 𝑉2 ] = 𝑢 ( 𝐶0 ) + 𝛽 E0 𝑢 ( 𝐶1 ) + 𝛽 2 E0 𝑉2 .
Í
Continuing forward until time 𝑚 yields 𝑉0 = 𝑡𝑚=0−1 𝛽 𝑡 E0 𝑢 ( 𝐶𝑡 ) + 𝛽 𝑚 E0 𝑉𝑚 . Shifting to
functional form and using 𝑟 = 𝑢 ◦ 𝑐, the last expression becomes

Õ
𝑚 −1
𝑣= ( 𝛽𝑃 ) 𝑡 𝑟 + ( 𝛽𝑃 ) 𝑚 𝑤.
𝑡 =0

By Exercise 1.2.17 on page 22, this is just 𝐾 𝑚 𝑤 when 𝐾 is the associated Koopmans
operator 𝐾𝑣 = 𝑟 + 𝛽𝑃𝑣 and, moreover, 𝐾 𝑚 𝑤 → 𝑣∗ ≔ ( 𝐼 − 𝛽𝑃 ) −1 𝑟 .

Solution to Exercise 7.3.17. The additive case is obvious. Regarding the Leontief
case, fix 𝑥 ∈ X, 𝑦 ∈ R and 𝜆 ∈ R+ . We have

min{𝑟 ( 𝑥 ) , 𝛽 ( 𝑦 + 𝜆 )} ⩽ min{𝑟 ( 𝑥 ) + 𝛽𝜆, 𝛽 𝑦 + 𝛽𝜆 } = min{𝑟 ( 𝑥 ) , 𝛽 𝑦 } + 𝛽𝜆.

That is, 𝐴 ( 𝑥, 𝑦 + 𝜆 ) ⩽ 𝐴 ( 𝑥, 𝑦 ) + 𝛽𝜆 . Hence Blackwell’s condition holds.

Solution to Exercise 7.3.19. We already know from Exercise 7.3.17 that the Leon-
tief aggregator satisfies Blackwell’s condition when 𝛽 ∈ (0, 1). Since 𝑅𝜏 is constant-
subadditive, global stability follows from Proposition 7.3.3.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 376

Solution to Exercise 7.3.22. We saw in Lemma 7.2.4 that Φ is a homeomorphism


from 𝑉 to itself. For 𝑣 ∈ 𝑉 , we have
n o𝜃 n o𝜃 n o 𝛾/𝛼
ˆ Φ𝑣 = ℎ + ( 𝑏𝜃 𝑃Φ𝑣) 1/𝜃
𝐾 = ℎ + 𝑏 ( 𝑃𝑣𝛾 ) 1/𝜃 = ℎ + 𝑏 ( 𝑃𝑣𝛾 ) 𝛼/𝛾 = Φ𝐾𝑣.

Thus, 𝐾ˆ Φ = Φ𝐾 on 𝑉 . Rearranging gives 𝐾 = Φ−1 𝐾ˆ Φ, so (𝑉, 𝐾 ) and (𝑉, 𝐾ˆ ) are topolog-


ically conjugate, as claimed.

Solution to Exercise 8.1.2. This problem can be formulated as an MDP by setting


the state to 𝑥 ≔ ( 𝑦, 𝑧 ), taking values in X ≔ Y × Z. The action space is Y. The feasible
correspondence is 𝑥 ↦→ Γ ( 𝑥 ) and current reward is 𝑟 ( 𝑥, 𝑎) = 𝑟 (( 𝑦, 𝑧 ) , 𝑦 0) = 𝐹 ( 𝑦, 𝑧, 𝑦 0).
The stochastic kernel is

𝑃 ( 𝑥, 𝑎, 𝑥 0) = 𝑃 (( 𝑦, 𝑧 ) , 𝑎, ( 𝑦 0, 𝑧0)) = 1{ 𝑦 0 = 𝑎} 𝑄 ( 𝑧, 𝑧0) .

This MDP ( Γ, 𝛽, 𝑟, 𝑃 ) has a Bellman equation identical to (8.7).

Solution to Exercise 8.1.3. We must check that ( Γ, 𝑉, 𝐵) satisfies conditions (8.2)–


(8.3). The monotonicity condition holds because 𝛽 is nonnegative, so 𝑤 ⩽ 𝑣 implies
Õ Õ
𝑤 ( 𝑥 0) 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) ⩽ 𝑣 ( 𝑥 0) 𝛽 ( 𝑥, 𝑎, 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) for all ( 𝑥, 𝑎) ∈ G.
𝑥0 𝑥0

The consistency condition is trivial, since 𝑉 is all of RX .

Solution to Exercise 8.1.6. Fix 𝜎 ∈ Σ. The claim that 𝑇𝜎 is a self-map on 𝑉 follows


immediately from the consistency condition in (8.3). The order-preserving property
follows from the monotonicity condition in (8.2).

Solution to Exercise 8.1.7. Fix 𝑣 ∈ 𝑉 and consider the set {𝑇𝜎 𝑣}𝜎∈Σ ⊂ 𝑉 . We first
show that {𝑇𝜎 𝑣}𝜎∈Σ contains a greatest element. Suppose that 𝜎¯ is 𝑣-greedy. If 𝜎 is
any other policy, then

(𝑇𝜎 𝑣)( 𝑥 ) = 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) ⩽ (𝑇𝜎¯ 𝑣)( 𝑥 ) for all 𝑥 ∈ X.

Hence 𝑇𝜎¯ 𝑣 is a greatest element of {𝑇𝜎 𝑣}𝜎∈Σ .


The proof that {𝑇𝜎 𝑣}𝜎∈Σ contains a least element is analogous, after replacing
argmax with argmin.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 377

Solution to Exercise 8.1.8. Regarding part (i), fix 𝑣 ∈ 𝑉 . For any 𝜎 ∈ Σ and 𝑥 ∈ X,
we have

(𝑇 𝑣)( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑣) = max 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) = max (𝑇𝜎 𝑣)( 𝑥 ) .


𝑎∈ Γ ( 𝑥 ) 𝜎∈ Σ 𝜎∈ Σ

Ô
Since 𝑥 was chosen arbitrarily, we have confirmed that 𝑇 𝑣 = 𝜎 ∈ Σ 𝑇𝜎 𝑣.
Regarding part (ii), 𝜎 is 𝑣-greedy if and only if

𝜎 ( 𝑥 ) ∈ argmax 𝐵 ( 𝑥, 𝑎, 𝑣) for all 𝑥 ∈ X.


𝑎∈ Γ ( 𝑥 )

This is equivalent to 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) = max 𝑎∈Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣) for all 𝑥 ∈ X. Hence 𝜎 is 𝑣-


greedy if and only if 𝑇𝜎 𝑣 = 𝑇 𝑣, as claimed.
Regarding (iii), to see that 𝑇 is a self-map on 𝑉 , fix 𝑣 ∈ 𝑉 and let 𝜎 be 𝑣-greedy.
Then, by (ii), 𝑇 𝑣 = 𝑇𝜎 𝑣 ∈ 𝑉 . Hence 𝑇 is a self-map, as claimed. The fact that 𝑇 is
order-preserving on 𝑉 follows immediately from the monotonicity property of 𝐵 in
(8.2).

Solution to Exercise 8.1.9. Here’s a proof for 𝑇 and fixed 𝑘 ∈ N: At arbitrary


𝑥 ∈ X,
(𝑇 𝑘 𝑣)( 𝑥 ) = (𝑇 (𝑇 𝑘−1 𝑣)) ( 𝑥 ) = max 𝐵 ( 𝑥, 𝑎, 𝑇 𝑘−1 𝑣) (8.16)
𝑎∈ Γ ( 𝑥 )

This confirms the claim in (8.14).

Solution to Exercise 8.1.13. Let {𝑇𝜎 } be the policy operators associated with a
bounded and well-posed RDP ( Γ, 𝑉, 𝐵). Let 𝑉ˆ ≔ [ 𝑣1 , 𝑣2 ], where 𝑣1 , 𝑣2 are as in (8.20).
Fix 𝜎 ∈ Σ and let 𝑣𝜎 be the 𝜎-value function of policy operator 𝑇𝜎 . It follows from (8.20)
that 𝑇𝜎 is a self-map on 𝑉ˆ. By the Knaster–Tarski fixed point theorem (page 213), 𝑇𝜎
has at least one fixed point in 𝑉ˆ. By uniqueness, that fixed point is 𝑣𝜎 .

Solution to Exercise 8.1.14. Let ( Γ, 𝑉, 𝐵) be as stated. Let ¯𝑟 = k 𝑟 k ∞ . We claim


that (8.20) holds when 𝑣2 = ¯𝑟 /(1 − 𝛽 ) and 𝑣1 = −𝑣2 . (The functions 𝑣1 and 𝑣2 are
constant.) To see this, observe that

𝐵 ( 𝑥, 𝑎, 𝑣2 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽𝑣2 ⩽ ¯
𝑟 + 𝛽𝑣2 = 𝑣2 .

for all ( 𝑥, 𝑎) ∈ G. This is the upper bound condition in (8.20). The proof of the lower
bound condition is similar.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 378

Solution to Exercise 8.1.15. Letting ¯𝑟 = k 𝑒 k ∞ + k 𝑐 k ∞ , it can be shown that (8.20)


holds when 𝑣2 = ¯𝑟 /(1 − 𝛽 ) and 𝑣1 = −𝑣2 . The argument is similar to that provided for
the MDP case in the solution to Exercise 8.1.14.

Solution to Exercise 8.1.16. Let ¯𝑟 ≔ k 𝑟 k ∞ 1. We claim that the functions 𝑣2 =


( 𝐼 − 𝐿) −1¯𝑟 and 𝑣1 = −𝑣2 satisfy (8.20). To see this, observe that

¯𝑟 + 𝐿𝑣2 = ¯𝑟 + 𝐿𝑣2 − 𝑣2 + 𝑣2 = ¯𝑟 − ( 𝐼 − 𝐿) 𝑣2 + 𝑣2 = ¯𝑟 − ¯𝑟 + 𝑣2 = 𝑣2 .

Since 𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ ¯𝑟 + ( 𝐿𝑣2 )( 𝑥 ), this proves that 𝑣2 satisfies the upper bound condition
in (8.20). The proof of the lower bound condition is similar.

Solution to Exercise 8.1.17. The only nontrivial condition to check is that the
bound 𝐵 ( 𝑥, 𝑥 0, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) holds for all feasible ( 𝑥, 𝑥 0). In particular, we need to show
that
𝑐 ( 𝑥, 𝑥 0) + 𝐶 ( 𝑥 0) ⩽ 𝐶 ( 𝑥 ) whenever 𝑥 0 ∈ O( 𝑥 ) and 𝑥 0 ≠ 𝑥.
This is true by the definition of 𝐶 , since 𝐶 ( 𝑥 ) is the maximum travel cost to 𝑑 and
𝑐 ( 𝑥, 𝑥 0) + 𝐶 ( 𝑥 0) is the cost of traveling to 𝑑 via 𝑥 0 and then taking the most expensive
path.

Solution to Exercise 8.1.19. The map 𝜑 ( 𝑚) = 𝑚𝛾 is a homeomorphism from 𝑉 to


itself and (8.21) holds under 𝜑. This implies the claim in the exercise.

Solution to Exercise 8.2.3. Let R = ( Γ, 𝑉, 𝐵) satisfy Blackwell’s condition. Fix


𝑣, 𝑤 ∈ 𝑉 and ( 𝑥, 𝑎) ∈ G. Observe that 𝑣 = 𝑤 + 𝑣 − 𝑤 ⩽ 𝑤 + k 𝑣 − 𝑤 k ∞ . By monotonicity
of 𝐵 and Blackwell’s condition, we have

𝐵 ( 𝑥, 𝑎, 𝑣) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑤 + k 𝑣 − 𝑤 k ∞ ) ⩽ 𝐵 ( 𝑥, 𝑎, 𝑤) + 𝛽 k 𝑣 − 𝑤 k ∞ .

As a result, 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤) ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ . Reversing the roles of 𝑣 and 𝑤 yields

| 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| ⩽ 𝛽 k 𝑣 − 𝑤 k ∞ .

Since 𝛽 < 1, the RDP R is contracting.

Solution to Exercise 8.2.8. Let R = ( Γ, 𝑉, 𝐵) be as stated. Fix 𝑣, 𝑤 ∈ 𝑉 . Using the


max-inequality (page 58), we obtain

|(𝑇 𝑣)( 𝑥 ) − (𝑇𝑤)( 𝑥 )| ⩽ max | 𝐵 ( 𝑥, 𝑎, 𝑣) − 𝐵 ( 𝑥, 𝑎, 𝑤)| ( 𝑥 ∈ X) .


𝑎∈ Γ ( 𝑥 )
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 379

Let 𝜎 be a map from X to A such that 𝜎 ( 𝑥 ) is a maximizer of the right hand side of this
expression for all 𝑥 . Clearly 𝜎 ∈ Σ and

|(𝑇 𝑣)( 𝑥 ) − (𝑇𝑤)( 𝑥 )| ⩽ | 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑣) − 𝐵 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑤)| for all 𝑥 ∈ X.

Since R is eventually contracting, there is a positive linear operator 𝐿𝜎 with 𝜌 ( 𝐿𝜎 ) < 1


and |𝑇 𝑣 − 𝑇𝑤 | ⩽ 𝐿𝜎 | 𝑣 − 𝑤 |. Proposition 6.1.6 on page 190 implies that 𝑇 is eventually
contracting on 𝑉 . Since 𝑉 is closed, it follows that 𝑇 is globally stable (Theorem 6.1.5,
page 189).

Solution to Exercise 8.2.10. We discuss the first case, regarding (8.37). When
(8.40) holds, by finiteness of G, we can take an 𝜀 > 0 such that

𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) − 𝜀 for all ( 𝑥, 𝑎) ∈ G.

We then have

𝜀 ⩽ 𝑣2 ( 𝑥 ) − 𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) − 𝐵 ( 𝑥, 𝑎, 𝑣1 ) ⩽ 𝑣2 ( 𝑥 ) − 𝑣1 ( 𝑥 )

for all 𝑥 , so 0 < 𝜀 ⩽ k 𝑣2 − 𝑣1 k ∞ . Set 𝛿 ≔ 𝜀/k 𝑣2 − 𝑣1 k ∞ . From (8.40) we get

𝐵 ( 𝑥, 𝑎, 𝑣2 ) ⩽ 𝑣2 ( 𝑥 ) − 𝛿 k 𝑣2 − 𝑣1 k ∞ ⩽ 𝑣2 ( 𝑥 ) − 𝛿 [ 𝑣2 ( 𝑥 ) − 𝑣1 ( 𝑥 )]

for arbitrary ( 𝑥, 𝑎) ∈ G. Hence (8.37) holds.

Solution to Exercise 8.2.11. We prove (8.41) and leave (8.40) to the reader. For
given ( 𝑥, 𝑎) ∈ G,
Õ 𝑟1 − 𝜀 𝑟1 − 𝛽𝑟1 + 𝛽𝑟1 − 𝛽𝜀
𝐵 ( 𝑥, 𝑎, 𝑣1 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣1 ( 𝑥 0) 𝑃 ( 𝑥, 𝑎, 𝑥 0) ⩾ 𝑟1 + 𝛽 = = 𝑣1 + 𝜀.
𝑥0
1−𝛽 1−𝛽

Hence (8.41) is confirmed.

Solution to Exercise 8.3.1. For each fixed 𝑎 ∈ Γ ( 𝑥 ), the map 𝑅𝜏𝑎 is a version of the
quantile certainty equivalent operator defined in Exercise 7.3.4 on page 232. With
this observation, we can replicate the proof of Proposition 8.3.1, after replacing 𝑅𝜃𝑎
with 𝑅𝜏𝑎 . The latter is also constant-subadditive, by Exercise 7.3.7 on page 233.

Solution to Exercise 8.3.4. Regarding (b), note that 𝑣1 is constant. Hence, at


APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 380

fixed ( 𝑥, 𝑎) ∈ G and 𝑑 ∈ 𝐷 ( 𝑥, 𝑎), we have

𝑟1 − 𝜀 𝑟1 − 𝜀 𝑟1 − 𝛽𝑟1 + 𝛽𝑟1 − 𝛽𝜀
𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣1 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ⩾ 𝑟1 + 𝛽 = = 𝑣1 + 𝜀.
1−𝛽 1−𝛽 1−𝛽

Hence (b) is confirmed. Regarding (c), we have


𝑟2 𝑟2
𝐵 ( 𝑥, 𝑎, 𝑑, 𝑣2 ) = 𝑟 ( 𝑥, 𝑎) + 𝛽 ⩽ 𝑟2 + 𝛽 = 𝑣2 .
1−𝛽 1−𝛽

We have now verified conditions (a)–(d) on page 277.

Solution to Exercise 8.3.5. In this case, where 𝜅 = 𝛾 , 𝐵 reduces to

 ( ) 𝛼/𝛾  1/𝛼


 Õ 


𝐵 ( 𝑥, 𝑎, 𝑣) = 𝑟 ( 𝑥, 𝑎) + 𝛽 𝑣 ( 𝑥 0) 𝛾 𝑃 ( 𝑥, 𝑎, 𝑥 0) ,

 

𝑥0
 

where 𝑃 ( 𝑥, 𝑎, 𝑥 0) ≔ 𝑃𝜃 ( 𝑥, 𝑎, 𝑥 0) 𝜇 ( 𝑥, d𝜃) is a weighted average over beliefs. This is
identical to the Epstein–Zin aggregator (see Example 8.1.7).

Solution to Exercise 8.3.10. A feasible policy 𝜎 is a map from X to itself satis-


fying 𝜎 ( 𝑥 ) ∈ O( 𝑥 ) for all 𝑥 . Recalling that X is finite and setting 𝑛 = |X|, the stated
assumptions imply that 𝜎𝑘 ( 𝑥 ) = 𝑑 for all 𝑘 ⩾ 𝑛 (since all paths lead to 𝑑 in at most
𝑛 steps). Given that 𝑐 ( 𝑑, 𝑑 ) = 0, it follows that the lifetime cost of following 𝜎 from
initial condition 𝑥 is no more than

𝑐 ( 𝑥, 𝜎 ( 𝑥 )) + 𝛽𝑐 ( 𝜎 ( 𝑥 ) , 𝜎2 ( 𝑥 )) + 𝛽 2 𝑐 ( 𝜎2 ( 𝑥 ) , 𝜎3 ( 𝑥 )) + · · · + 𝛽 𝑛−1 𝑐 ( 𝜎𝑛−1 ( 𝑥 ) , 𝜎𝑛 ( 𝑥 ))

With 𝑐> ≔ max 𝑐, we then have


1 − 𝛽𝑛
𝐶 ( 𝑥 ) ⩽ 𝑐> .
1−𝛽

Solution to Exercise 9.1.1. By the Neumann series lemma, 𝑇 has a unique fixed
point in 𝑉 given by 𝑣¯ ≔ ( 𝐼 − 𝐴) −1 𝑟 . 𝑇 is upward stable because, given 𝑣 ∈ RX with
𝑣 ⩽ 𝑇 𝑣, we have 𝑣 ⩽ 𝑟 + 𝐴𝑣, or ( 𝐼 − 𝐴) 𝑣 ⩽ 𝑟 . By the Neumann series lemma, ( 𝐼 − 𝐴) −1
is a positive linear operator (as the sum of nonnegative matrices), so we can multiply
by this inverse to get 𝑣 ⩽ ( 𝐼 − 𝐴) −1 𝑟 = 𝑣¯. This proves upward stability. Reversing the
inequalities shows that downward stability also holds.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 381

Solution to Exercise 9.2.1. The first part of the exercise is immediate from the
definitions. For the second, take 𝑣, 𝑤 ∈ 𝑉 with 𝑣  𝑤. Since 𝑇𝜎 is order-preserving, we
have 𝑇𝜎 𝑣  𝑇𝜎 𝑤 for all 𝜎 ∈ Σ. Hence 𝑇𝜎 𝑣  𝑇𝑤 for all 𝜎 ∈ Σ. Therefore 𝑇 𝑣  𝑇𝑤.

Solution to Exercise 9.2.3. For all 𝑣 ∈ 𝑉Σ , we have 𝑣 = 𝑣𝜎 for some 𝜎, and hence
𝑇 𝑣 ⩾ 𝑇𝜎 𝑣 = 𝑇𝜎 𝑣𝜎 = 𝑣𝜎 = 𝑣.

Solution to Exercise 9.2.4. This follows directly from Exercise 1.2.27 on page 31.

Solution to Exercise 9.2.5. This result follows from Exercise 9.2.4, since, at each
𝑥 , the maximizing distribution 𝜑 𝑥 is supported on argmax 𝑎∈ Γ ( 𝑥 ) 𝐵 ( 𝑥, 𝑎, 𝑣).

Solution to Exercise 9.2.7. Regarding (i), fix 𝑣 ∈ 𝑉 . Policy 𝜎 is 𝑣-min-greedy for


A if and only if 𝑇𝜎 𝑣  𝑇𝜏 𝑣 for all 𝜏 ∈ Σ, which is equivalent to 𝑇𝜎 𝑣  𝜕 𝑇𝜏 𝑣 for all 𝜏 ∈ Σ.
Hence 𝜎 is 𝑣-min-greedy for A if and only if 𝜎 is 𝑣-max-greedy for A𝜕 .
Regarding (ii)–(iii), fix 𝑣 ∈ 𝑉 and let 𝜎 be 𝑣-min-greedy for A (and hence 𝑣-max-
greedy for Â). We then have 𝑇 𝜕 𝑣 = 𝑇𝜎 𝑣 = 𝑇 𝑣. Hence 𝑇 𝜕 = 𝑇 . Similarly, at the same 𝑣
and with the same policy 𝜎, 𝐻 𝜕 𝑣 is equal to 𝑣𝜎 and so is 𝐻 . A similar argument gives
𝑇𝜕 = 𝑇 and 𝐻𝜕 = 𝐻 .
Regarding (iv), Lemma 9.1.2 implies that A is order stable if and only if A𝜕 is order
stable.
Regarding (v), 𝑇 = 𝑇 𝜕 , so 𝑇 has a fixed point in 𝑉 if and only if 𝑇 𝜕 has a fixed point
in 𝑉 . By this fact and (iv), A is min-stable if and only if A𝜕 is max-stable. Moreover,
Ó Ô
in this setting, we have 𝑣∗ = 𝜎 𝑣𝜎 = 𝜕𝜎 𝑣𝜎 = ( 𝑣∗ ) 𝜕 .
Part (vi) follows from similar analysis and details are left to the reader.

𝑑
Solution to Exercise 10.1.1. if 𝑊 = Exp( 𝜃) and 𝑠, 𝑡 > 0, then

P{𝑊 > 𝑠 + 𝑡 and 𝑊 > 𝑠 } P{𝑊 > 𝑠 + 𝑡 } 𝑒−𝜃𝑠−𝜃𝑡


= = = 𝑒−𝜃𝑡 .
P{𝑊 > 𝑠} P{𝑊 > 𝑠} 𝑒−𝜃𝑠

This is equivalent to (10.4).

Solution to Exercise 10.1.2. Using the triangle inequality and submultiplicative


property of the matrix norm, we have

Õ𝑚
𝐴𝑘 Õ𝑚
k 𝐴𝑘 k Õ k 𝐴 k 𝑘
𝑚
⩽ ⩽ ⩽ ek 𝐴k ,
𝑘! 𝑘! 𝑘!
𝑘=0 𝑘=0 𝑘=0
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 382

where the last term uses the ordinary (scalar) exponential function defined in (10.1).
(If you also want to prove that the scalar series in (10.1) converges, you can do so via
the ratio test.)

Solution to Exercise 10.1.3. Given 𝐴 = 𝑃 −1 𝐷𝑃 we have 𝐴𝑘 = 𝑃 −1 𝐷𝑘 𝑃 for all 𝑘, so


Õ 𝐴𝑘 Õ 𝑃 −1 𝐷𝑘 𝑃 Õ 𝐷𝑘
e𝐴 = = = 𝑃 −1 𝑃 = 𝑃 −1 e 𝐷 𝑃.
𝑘! 𝑘! 𝑘!
𝑘⩾0 𝑘⩾0 𝑘⩾0

Í 𝑘
Solution to Exercise 10.1.4. We use the definition e 𝐴 = 𝑘⩾0 𝐴𝑘! for the proof and
fix 𝑡 ∈ R. A common argument for differentiating e𝑡 𝐴 with respect to 𝑡 is to take the
derivative through the infinite sum to get
 
d 𝑡𝐴 𝐴2 2𝐴
3
e = 𝐴+𝑡 +𝑡 + · · · = 𝐴e𝑡 𝐴 .
d𝑡 1! 2!

But this is not fully rigorous, since we have not justified interchange of limits. A better
answer is to start with (10.9), which gives

d 𝑡𝐴 eℎ𝐴 − 𝐼
e = e𝑡 𝐴 lim .
d𝑡 ℎ→0 ℎ

and note that


eℎ𝐴 − 𝐼 1 2 1 2 3
= 𝐴+ ℎ𝐴 + ℎ 𝐴 + · · · ,
ℎ 2! 3!
which converges to 𝐴 as ℎ → 0.

Solution to Exercise 10.1.5. Fix 𝐴 in M𝑛×𝑛 and let 𝐵 = − 𝐴. Evidently 𝐴𝐵 = 𝐵𝐴,


so e 𝐴 e 𝐵 = e 𝐴− 𝐴 = e0 . It is easy to check that e0 = 𝐼 , so e 𝐴 e− 𝐴 = 𝐼 .

Solution to Exercise 10.1.6. Fix 𝑖, 𝑗 with 1 ⩽ 𝑖, 𝑗 ⩽ 𝑛, let 𝑒𝑘 be the 𝑘-th canonical


basis vector and let 𝑓 be the function on R defined by 𝑓 ( 𝑡 ) = 𝑒𝑖 , e𝑡 𝐴 𝑒 𝑗 . Part (v) tells
of Lemma 10.1.2 tells us that 𝑓 0 ( 𝑡 ) = 𝑒𝑖 , e𝑡 𝐴 𝐴𝑒 𝑗 . By the fundamental theorem of
∫𝑡
calculus, we have 𝑓 ( 𝑡 ) − 𝑓 ( 𝑠) = 𝑠 𝑓 0 ( 𝜏) d𝜏, or
∫ 𝑡
𝑡𝐴 𝑠𝐴
𝑒𝑖 , e 𝑒 𝑗 − 𝑒𝑖 , e 𝑒 𝑗 = 𝑒𝑖 , e𝜏𝐴 𝐴𝑒 𝑗 d𝜏.
𝑠

As this is true for any 𝑖,, we have e𝑡 𝐴 − e𝑠𝐴 = e𝜏𝐴 𝐴 d𝜏, which is what we need to prove.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 383

Solution to Exercise 10.1.7. If we take 𝑢𝑡 = 𝜑> ¤ 𝑡 = 𝜑𝑡 𝑃 we get


𝑡 and transpose 𝜑
> >
(10.10) when 𝐴 = 𝑃 . By Proposition 10.1.3, the unique solution is 𝑢𝑡 = e𝑡𝑃 𝑢0 =
(e𝑡𝑃 ) > 𝑢0 . Transposing again gives 𝜑𝑡 = 𝜑0 e𝑡𝑃 , as was to be shown.

Solution to Exercise 10.1.9. For any 𝜏 > 0, we have

𝜏𝑠 ( 𝐴) = 𝜏 max Re 𝜆 = max 𝜏 Re 𝜆 = 𝑠 ( 𝜏𝐴) . (10.18)


𝜆 ∈ 𝜎 ( 𝐴) 𝜏𝜆 ∈𝜎 ( 𝜏𝐴)

Solution to Exercise 10.1.10. With 𝜉 ≔ max 𝜆 ∈𝜎 ( 𝐴) Re 𝜆 , we have

𝜌 (e 𝐴 ) = max | 𝜆 | = max |e𝜆 | = max eRe 𝜆 = e𝜉 .


𝜆 ∈𝜎 (e 𝐴 ) 𝜆 ∈ 𝜎 ( 𝐴) 𝜆 ∈ 𝜎 ( 𝐴)

(The second equality is by Lemma 10.1.2.) Hence 𝜌 (e 𝐴 ) = e𝑠 ( 𝐴) , as was to be shown.

Solution to Exercise 10.1.11. For 𝑡 ∈ N, we have


1    
ln ke𝑡 𝐴 k = ln ke𝑡 𝐴 k 1/𝑡 = ln k(e 𝐴 ) 𝑡 k 1/𝑡 .
𝑡

Taking the limit 𝑡 → ∞ and applying Gelfand’s lemma, this sequence converges to
ln 𝜌 (e 𝐴 ). But ln 𝜌 (e 𝐴 ) = 𝑠 ( 𝐴), by the first equality in (10.17). This proves the second
equality in (10.17).

Solution to Exercise 10.1.12. Let’s start with (i) =⇒ (ii), or 𝑠 ( 𝐴) < 0 implies
𝑡𝐴
ke k → 0 as 𝑡 → ∞.
Here is one proof that works for 𝑡 ∈ N and 𝑡 → 0. Observe that, since (e 𝐴 ) 𝑡 = e𝑡 𝐴 ,
the powers 𝐵𝑡 of 𝐵 ≔ e 𝐴 match the flow 𝑡 ↦→ e𝑡 𝐴 at integer times. We have 𝐵𝑡 → 0 if
and only if 𝜌 ( 𝐵) < 1. But, by Lemma 10.1.4, 𝜌 ( 𝐵) = 𝜌 (e 𝐴 ) = e𝑠 ( 𝐴) . Hence 𝜌 ( 𝐵) < 1
is equivalent to 𝑠 ( 𝐴) < 0. Thus, 𝑠 ( 𝐴) < 0 is the exact condition we need to obtain
𝐵𝑡 = e𝑡 𝐴 → 0.
We can improve on this proof of (i) =⇒ (ii) by allowing 𝑡 ∈ R and 𝑡 → ∞ as
follows. Suppose 𝑠 ( 𝐴) < 0. Fix 𝜀 > 0 such that 𝑠 ( 𝐴) + 𝜀 < 0 and use (10.17) to obtain
a 𝑇 < ∞ such that (1/𝑡 ) ln ke𝑡 𝐴 k ⩽ 𝑠 ( 𝐴) + 𝜀 for all 𝑡 ⩾ 𝑇 . Equivalently, for 𝑡 large, we
have ke𝑡 𝐴 k ⩽ e𝑡 ( 𝑠 ( 𝐴)+𝜀) . The claim follows.
That (iii) implies (iv) is immediate: Just substitute the bound in (iii) into the
integral.
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 384

Solution to Exercise 10.1.15. Recalling that, for matrix exponentials, e 𝐴+ 𝐵 = e 𝐴 e 𝐵


whenever 𝐴𝐵 = 𝐵𝐴, we have
 
𝑡𝑄 𝑡𝜃 ( 𝐾 − 𝐼 ) −𝑡𝜃𝐼 𝑡𝜃𝐾 −𝑡𝜃 ( 𝑡𝜃) 2 2
e =e =e e =e 𝐼 + 𝑡𝜃𝐾 + 𝐾 +··· .
2!

It is clear from this representation that all entries of 𝑃𝑡 = e𝑡𝑄 are nonnegative.

Solution to Exercise 10.1.16. Since 𝑓 ( 𝑡 ) ≔ 𝛿𝑥 e𝑡𝑄 1 = 1 for all 𝑡 ⩾ 0, we have


𝑓 0 (𝑡)
= 0. Recalling that (e𝑡𝑄 ) 0 = e𝑡𝑄 𝑄 , this means that

d d 𝑡𝑄
𝛿𝑥 e𝑡𝑄 1 = 𝛿𝑥 e 1 = 𝛿𝑥 e𝑡𝑄 𝑄 1 = 0
d𝑡 d𝑡
Í
for all 𝑡 ⩾ 0. Evaluating at 𝑡 = 0, we get 𝛿𝑥 𝑄 1 = 0. That is, 𝑥 0 𝑄 ( 𝑥, 𝑥 0) = 0.

Solution to Exercise 10.1.17. By Lemma 10.1.2, we have

d 𝑡𝑄
e = 𝑄 e𝑡𝑄 = e𝑡𝑄 𝑄 for all 𝑡 ⩾ 0. (10.23)
d𝑡
Evaluating (10.23) at 𝑡 = 0 and recalling that e0 = 𝐼 gives
1
𝑄 = lim (eℎ𝑄 − 𝐼 ) . (10.24)
ℎ↓0 ℎ

Interpreting 𝛿𝑥 as a row vector and 𝛿𝑥 0 as a column vector, while using the fact that
𝑥 ≠ 𝑥 0 combined with (10.24), we obtain
 
0 eℎ𝑄 eℎ𝑄
𝑄 ( 𝑥, 𝑥 ) = 𝛿𝑥 𝑄𝛿𝑥 0 = 𝛿𝑥 lim 𝛿𝑥 0 = lim 𝛿𝑥 𝛿𝑥 0 .
ℎ↓0 ℎ ℎ↓0 ℎ

Hence we need only show that the 𝛿𝑥 eℎ𝑄 𝛿𝑥 0 ⩾ 0. By (ii), 𝛿𝑥 eℎ𝑄 is a distribution, so the
inequality holds.

Solution to Exercise 10.1.19. Using the matrix exponential (10.6) and 𝑃𝑡 = e𝑡𝑄
yields
𝑄 2 ( 𝑥, 𝑥 0) 𝑄 3 ( 𝑥, 𝑥 0)
𝑃𝑡 ( 𝑥, 𝑥 0) = 1{ 𝑥 = 𝑥 0 } + 𝑡𝑄 ( 𝑥, 𝑥 0) + 𝑡 2 + 𝑡3 +···
2! 3!
Setting 𝑡 = ℎ and using 𝑜 ( ℎ) to capture terms converging to zero faster than ℎ as ℎ ↓ 0
recovers (10.28).
APPENDIX C. SOLUTIONS TO SELECTED EXERCISES 385

Solution to Exercise 10.2.2. Fix 𝑣 ∈ RX . Policy 𝜎 is 𝑣-max-greedy for A if and


only if 𝑇𝜎 𝑣 ⩾ 𝑇𝜏 𝑣 for all 𝜏 ∈ Σ, which in turn holds if and only if
Õ
𝑟 ( 𝑥, 𝜎 ( 𝑥 ) + 𝑣∗ ( 𝑥 0) 𝑄 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) + (1 − 𝛿) 𝑣∗ ( 𝑥 )
𝑥0
( )
Õ
= max 𝑟 ( 𝑥, 𝑎) + 𝑣∗ ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0) + (1 − 𝛿) 𝑣∗ ( 𝑥 )
𝑎∈ Γ ( 𝑥 )
𝑥0

for all 𝑥 ∈ X. Canceling terms, this reduces to


( )
Õ Õ
𝑟 ( 𝑥, 𝜎 ( 𝑥 ) + 𝑣∗ ( 𝑥 0) 𝑄 ( 𝑥, 𝜎 ( 𝑥 ) , 𝑥 0) = max 𝑟 ( 𝑥, 𝑎) + 𝑣∗ ( 𝑥 0) 𝑄 ( 𝑥, 𝑎, 𝑥 0)
𝑎∈ Γ ( 𝑥 )
𝑥0 𝑥0

for all 𝑥 ∈ X, which is equivalent to the definition of 𝑣∗ -greedy for C in (10.50).


Bibliography

Achdou, Y., Han, J., Lasry, J.-M., Lions, P.-L., and Moll, B. (2022). Income and wealth
distribution in macroeconomics: A continuous-time approach. The Review of Eco-
nomic Studies, 89(1):45–86.

Açıkgöz, Ö. T. (2018). On the existence and uniqueness of stationary equilibrium in


Bewley economies with production. Journal of Economic Theory, 173:18–55.

Aiyagari, S. R. (1994). Uninsured idiosyncratic risk and aggregate saving. The Quar-
terly Journal of Economics, 109(3):659–684.

Al-Najjar, N. I. and Shmaya, E. (2019). Recursive utility and parameter uncertainty.


Journal of Economic Theory, 181:274–288.

Albuquerque, R., Eichenbaum, M., Luo, V. X., and Rebelo, S. (2016). Valuation risk
and asset pricing. The Journal of Finance, 71(6):2861–2904.

Algan, Y., Allais, O., den Haan, W. J., and Rendahl, P. (2014). Solving and simu-
lating models with heterogeneous agents and aggregate uncertainty. Handbook of
Computational Economics, 3.

Aliprantis, C. D. and Burkinshaw, O. (1998). Principles of Real Analysis. Academic


Press, 3 edition.

Amador, M., Werning, I., and Angeletos, G.-M. (2006). Commitment vs. flexibility.
Econometrica, 74(2):365–396.

Andreoni, J. and Sprenger, C. (2012). Risk preferences are not time preferences.
American Economic Review, 102(7):3357–3376.

Antràs, P. and De Gortari, A. (2020). On the geography of global value chains. Econo-
metrica, 88(4):1553–1598.

Applebaum, D. (2019). Semigroups of Linear Operators, volume 93. Cambridge Uni-


versity Press.

386
BIBLIOGRAPHY 387

Arellano, C. (2008). Default risk and income fluctuations in emerging economies.


American Economic Review, 98(3):690–712.

Arellano, C. and Ramanarayanan, A. (2012). Default and the maturity structure in


sovereign bonds. Journal of Political Economy, 120(2):187–232.

Arrow, K. J., Harris, T., and Marschak, J. (1951). Optimal inventory policy. Econo-
metrica, 19(3):250–272.

Asienkiewicz, H. and Jaśkiewicz, A. (2017). A note on a new class of recursive utilities


in Markov decision processes. Applicationes Mathematicae, 44:149–161.

Atkinson, K. and Han, W. (2005). Theoretical Numerical Analysis, volume 39. Springer.

Augeraud-Véron, E., Fabbri, G., and Schubert, K. (2019). The value of biodiversity
as an insurance device. American Journal of Agricultural Economics, 101(4):1068–
1081.

Azinovic, M., Gaegauf, L., and Scheidegger, S. (2022). Deep equilibrium nets. Inter-
national Economic Review, 63(4):1471–1525.

Backus, D. K., Routledge, B. R., and Zin, S. E. (2004). Exotic preferences for macroe-
conomists. NBER Macroeconomics Annual, 19:319–390.

Bagliano, F.-C. and Bertola, G. (2004). Models for Dynamic Macroeconomics. Oxford
University Press.

Balbus, Ł. (2020). On recursive utilities with non-affine aggregator and conditional


certainty equivalent. Economic Theory, 70(2):551–577.

Balbus, Ł., Reffett, K., and Woźny, Ł. (2014). A constructive study of Markov equilibria
in stochastic games with strategic complementarities. Journal of Economic Theory,
150:815–840.

Balbus, Ł., Reffett, K., and Woźny, Ł. (2018). On uniqueness of time-consistent Markov
policies for quasi-hyperbolic consumers under uncertainty. Journal of Economic The-
ory, 176:293–310.

Balbus, Ł., Reffett, K., and Woźny, Ł. (2022). Time-consistent equilibria in dynamic
models with recursive payoffs and behavioral discounting. Journal of Economic The-
ory, page 105493.

Bansal, R., Kiku, D., and Yaron, A. (2012). An empirical evaluation of the long-run
risks model for asset prices. Critical Finance Review, 1(1):183–221.
BIBLIOGRAPHY 388

Bansal, R. and Yaron, A. (2004). Risks for the long run: A potential resolution of asset
pricing puzzles. The Journal of Finance, 59(4):1481–1509.

Barbu, V. and Precupanu, T. (2012). Convexity and Optimization in Banach Spaces.


Springer Science & Business Media.

Barillas, F., Hansen, L. P., and Sargent, T. J. (2009). Doubts or variability? Journal
of Economic Theory, 144(6):2388–2418.

Bartle, R. G. and Sherbert, D. R. (2011). Introduction to Real Analysis. Hoboken, NJ:


Wiley, 4 edition.

Bastianello, L. and Faro, J. H. (2022). Choquet expected discounted utility. Economic


Theory, pages 1–28.

Basu, S. and Bundick, B. (2017). Uncertainty shocks in a model of effective demand.


Econometrica, 85(3):937–958.

Bäuerle, N. and Glauner, A. (2022). Markov decision processes with recursive risk
measures. European Journal of Operational Research, 296(3):953–966.

Bäuerle, N. and Jaśkiewicz, A. (2017). Optimal dividend payout model with risk
sensitive preferences. Insurance: Mathematics and Economics, 73:82–93.

Bäuerle, N. and Jaśkiewicz, A. (2018). Stochastic optimal growth model with risk
sensitive preferences. Journal of Economic Theory, 173:181–200.

Bäuerle, N. and Rieder, U. (2011). Markov Decision Processes with Applications to


Finance. Springer Science & Business Media.

Becker, R. A., Boyd III, J. H., and Sung, B. Y. (1989). Recursive utility and optimal
capital accumulation. I. Existence. Journal of Economic Theory, 47(1):76–100.

Becker, R. A. and Rincon-Zapatero, J. P. (2023). Recursive utility for Thompson ag-


gregators: Least fixed point, uniqueness, and approximation theories. Technical
report, Indiana University.

Bellman, R. (1957). Dynamic Programming. American Association for the Advance-


ment of Science.

Bellman, R. (1984). Eye of the Hurricane. World Scientific.

Benhabib, J., Bisin, A., and Zhu, S. (2015). The wealth distribution in Bewley
economies with capital income risk. Journal of Economic Theory, 159:489–515.
BIBLIOGRAPHY 389

Benzion, U., Rapoport, A., and Yagil, J. (1989). Discount rates inferred from decisions:
An experimental study. Management Science, 35(3):270–284.

Bertsekas, D. (2012). Dynamic Programming and Optimal Control: Volume I, volume 1.


Athena Scientific.

Bertsekas, D. (2021). Rollout, Policy Iteration, and Distributed Reinforcement Learning.


Athena Scientific.

Bertsekas, D. (2022a). Newton’s method for reinforcement learning and model pre-
dictive control. Results in Control and Optimization, 7:100121.

Bertsekas, D. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Sci-


entific.

Bertsekas, D. P. (2022b). Abstract Dynamic Programming. Athena Scientific, 3 edition.

Bhandari, J. and Russo, D. (2022). Global optimality guarantees for policy gradient
methods. arXiv:1906.01786.

Bianchi, J. (2011). Overborrowing and systemic externalities in the business cycle.


American Economic Review, 101(7):3400–3426.

Bloise, G., Le Van, C., and Vailakis, Y. (2023). Do not blame Bellman: It is Koopmans’
fault. Econometrica, in press.

Bloise, G. and Vailakis, Y. (2018). Convex dynamic programming with (bounded)


recursive utility. Journal of Economic Theory, 173:118–141.

Bloise, G. and Vailakis, Y. (2022). On sovereign default with time-varying interest


rates. Review of Economic Dynamics, 44:211–224.

Bloom, N. (2009). The impact of uncertainty shocks. Econometrica, 77(3):623–685.

Bloom, N., Bond, S., and Van Reenen, J. (2007). Uncertainty and investment dynam-
ics. The Review of Economic Studies, 74(2):391–415.

Blundell, R., Graber, M., and Mogstad, M. (2015). Labor income dynamics and the
insurance from taxes, transfers, and the family. Journal of Public Economics, 127:58–
73.

Bocola, L., Bornstein, G., and Dovis, A. (2019). Quantitative sovereign default models
and the European debt crisis. Journal of International Economics, 118:20–30.
BIBLIOGRAPHY 390

Bollobás, B. (1999). Linear Analysis: An Introductory Course. Cambridge University


Press.

Bommier, A., Kochov, A., and Le Grand, F. (2017). On monotone recursive preferences.
Econometrica, 85(5):1433–1466.

Bommier, A., Kochov, A., and Le Grand, F. (2019). Ambiguity and endogenous dis-
counting. Journal of Mathematical Economics, 83:48–62.

Bommier, A. and Villeneuve, B. (2012). Risk aversion and the value of risk to life.
Journal of Risk and Insurance, 79(1):77–104.

Bond, S. and Van Reenen, J. (2007). Microeconometric models of investment and


employment. In Handbook of Econometrics, volume 6, pages 4417–4498. Elsevier.

Borovička, J. and Stachurski, J. (2020). Necessary and sufficient conditions for exis-
tence and uniqueness of recursive utilities. The Journal of Finance.

Boyd, J. H. (1990). Recursive utility and the Ramsey problem. Journal of Economic
Theory, 50(2):326–345.

Brémaud, P. (2020). Markov Chains: Gibbs Fields, Monte Carlo Simulation and Queues,
volume 31. Springer Nature.

Brock, W. A. and Mirman, L. J. (1972). Optimal economic growth and uncertainty:


The discounted case. Journal of Economic Theory, 4(3):479–513.

Bullen, P. S. (2003). Handbook of Means and Their Inequalities, volume 560. Springer
Science & Business Media.

Burdett, K. (1978). A theory of employee job search and quit rates. American Economic
Review, 68(1):212–220.

Bäuerle, N. and Jaśkiewicz, A. (2023). Markov decision processes with risk-sensitive


criteria: An overview.

Cagetti, M., Hansen, L. P., Sargent, T. J., and Williams, N. (2002). Robustness and
pricing with uncertain growth. The Review of Financial Studies, 15(2):363–404.

Calsamiglia, C., Fu, C., and Güell, M. (2020). Structural estimation of a model of
school choices: The Boston mechanism versus its alternatives. Journal of Political
Economy, 128(2):642–680.

Campbell, J. Y. (2017). Financial Decisions and Markets: A Course in Asset Pricing.


Princeton University Press.
BIBLIOGRAPHY 391

Cao, D. (2020). Recursive equilibrium in Krusell and Smith (1998). Journal of Eco-
nomic Theory, 186.

Cao, D. and Werning, I. (2018). Saving and dissaving with hyperbolic discounting.
Econometrica, 86(3):805–857.

Carroll, C. D. (1997). Buffer-stock saving and the life cycle/permanent income hy-
pothesis. Quarterly Journal of Economics, 112(1):1–55.

Carroll, C. D. (2009). Precautionary saving and the marginal propensity to consume


out of permanent income. Journal of Monetary Economics, 56(6):780–790.

Carruth, A., Dickerson, A., and Henley, A. (2000). What do we know about investment
under uncertainty? Journal of Economic Surveys, 14(2):119–154.

Carvalho, V. M. and Grassi, B. (2019). Large firm dynamics and the business cycle.
American Economic Review, 109(4):1375–1425.

Chatterjee, S. and Eyigungor, B. (2012). Maturity, indebtedness, and default risk.


American Economic Review, 102(6):2674–99.

Cheney, W. (2013). Analysis for Applied Mathematics, volume 208. Springer Science
& Business Media.

Chetty, R. (2008). Moral hazard versus liquidity and optimal unemployment insur-
ance. Journal of Political Economy, 116(2):173–234.

Christensen, T. M. (2022). Existence and uniqueness of recursive utilities without


boundedness. Journal of Economic Theory, page 105413.

Christiano, L. J., Motto, R., and Rostagno, M. (2014). Risk shocks. American Economic
Review, 104(1):27–65.

Çınlar, E. and Vanderbei, R. J. (2013). Real and Convex Analysis. Springer Science &
Business Media.

Cochrane, J. H. (2009). Asset Pricing: Revised Edition. Princeton University Press.

Colacito, R., Croce, M., Ho, S., and Howard, P. (2018). BKK the EZ way: International
long-run growth news and capital flows. American Economic Review, 108(11):3416–
3449.

Cruces, J. J. and Trebesch, C. (2013). Sovereign defaults: The price of haircuts.


American Economic Journal: Macroeconomics, 5(3):85–117.
BIBLIOGRAPHY 392

Daley, D. (1968). Stochastically monotone Markov chains. Probability Theory and


Related Fields, 10(4):305–317.

Dasgupta, P. and Maskin, E. (2005). Uncertainty and hyperbolic discounting. Amer-


ican Economic Review, 95(4):1290–1299.

Davey, B. A. and Priestley, H. A. (2002). Introduction to Lattices and Order. Cambridge


University Press.

de Castro, L. and Galvao, A. F. (2019). Dynamic quantile models of rational behavior.


Econometrica, 87(6):1893–1939.

de Castro, L. and Galvao, A. F. (2022). Static and dynamic quantile preferences.


Economic Theory, 73(2-3):747–779.

de Castro, L., Galvao, A. F., and Nunes, D. (2022). Dynamic economics with quantile
preferences. SSRN 4108230.

de Groot, O., Richter, A. W., and Throckmorton, N. A. (2018). Uncertainty shocks in


a model of effective demand: Comment. Econometrica, 86(4):1513–1526.

de Groot, O., Richter, A. W., and Throckmorton, N. A. (2022). Valuation risk revalued.
Quantitative Economics, 13(2):723–759.

De Nardi, M., French, E., and Jones, J. B. (2010). Why do the elderly save? The role
of medical expenses. Journal of Political Economy, 118(1):39–75.

Deaton, A. and Laroque, G. (1992). On the behaviour of commodity prices. Review of


Economic Studies, 59(1):1–23.

DeJarnette, P., Dillenberger, D., Gottlieb, D., and Ortoleva, P. (2020). Time lotteries
and stochastic impatience. Econometrica, 88(2):619–656.

Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic pro-


gramming. Siam Review, 9(2):165–177.

Denardo, E. V. (1981). Dynamic Programming: Models and Applications. Prentice Hall


PTR.

Diamond, P. and Köszegi, B. (2003). Quasi-hyperbolic discounting and retirement.


Journal of Public Economics, 87(9-10):1839–1872.

Dixit, R. K. and Pindyck, R. S. (2012). Investment under Uncertainty. Princeton Uni-


versity Press.
BIBLIOGRAPHY 393

Drugeon, J.-P. and Wigniolle, B. (2021). On Markovian collective choice with hetero-
geneous quasi-hyperbolic discounting. Economic Theory, 72(4):1257–1296.

Du, Y. (1990). Fixed points of increasing operators in ordered Banach spaces and
applications. Applicable Analysis, 38(01-02):1–20.

Duarte, V. (2018). Machine learning for continuous-time economics. SSRN 3012602.

Dudley, R. M. (2002). Real Analysis and Probability, volume 74. Cambridge University
Press.

Duffie, D. (2010). Dynamic Asset Pricing Theory. Princeton University Press.

Duffie, D. and Garman, M. B. (1986). Intertemporal Arbitrage and the Markov Valua-
tion of Securities. Citeseer.

Dupuis, P. and Ellis, R. S. (2011). A Weak Convergence Approach to the Theory of Large
Deviations. John Wiley & Sons.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1952). The inventory problem: I. Case of
known distributions of demand. Econometrica, 20(2):187–222.

Engel, K.-J. and Nagel, R. (2006). A Short Course on Operator Semigroups. Springer
Science & Business Media.

Epstein, L. G. and Zin, S. E. (1989). Risk aversion and the temporal behavior of
consumption and asset returns: A theoretical framework. Econometrica, 57(4):937–
969.

Epstein, L. G. and Zin, S. E. (1991). Substitution, risk aversion, and the temporal be-
havior of consumption and asset returns: An empirical analysis. Journal of Political
Economy, 99(2):263–286.

Ericson, K. M. and Laibson, D. (2019). Intertemporal choice. In Handbook of Behav-


ioral Economics: Applications and Foundations 1, volume 2, pages 1–67. Elsevier.

Ericson, R. and Pakes, A. (1995). Markov-perfect industry dynamics: A framework


for empirical work. The Review of Economic Studies, 62(1):53–82.

Erosa, A. and González, B. (2019). Taxation and the life cycle of firms. Journal of
Monetary Economics, 105:114–130.

Eslami, K. and Phelan, T. (2023). The art of temporal approximation: An investigation


into numerical solutions to discrete and continuous-time problems in economics.
Technical report, FRB of Cleveland Working Paper.
BIBLIOGRAPHY 394

Esponda, I. and Pouzo, D. (2021). Equilibrium in misspecified markov decision pro-


cesses. Theoretical Economics, 16(2):717–757.

Fagereng, A., Holm, M. B., Moll, B., and Natvik, G. (2019). Saving behavior across
the wealth distribution: The importance of capital gains. Technical report, National
Bureau of Economic Research.

Fajgelbaum, P. D., Schaal, E., and Taschereau-Dumouchel, M. (2017). Uncertainty


traps. The Quarterly Journal of Economics, 132(4):1641–1692.

Farmer, J. D., Geanakoplos, J., Richiardi, M. G., Montero, M., Perelló, J., and Masoliver,
J. (2023). Discounting the distant future: What do historical bond prices imply
about the long term discount rate?

Farmer, L. E. and Toda, A. A. (2017). Discretizing nonlinear, non-gaussian Markov


processes with exact conditional moments. Quantitative Economics, 8(2):651–683.

Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., and Larochelle, H. (2019). Hy-
perbolic discounting and learning over multiple horizons. arXiv:1902.06865.

Fei, Y., Yang, Z., Chen, Y., and Wang, Z. (2021). Exponential Bellman equation and
improved regret bounds for risk-sensitive reinforcement learning. Advances in Neu-
ral Information Processing Systems, 34:20436–20446.

Fernández-Villaverde, J., Hurtado, S., and Nuno, G. (2023). Financial frictions and
the wealth distribution. Econometrica, 91(3):869–901.

Fisher, I. (1930). The theory of interest as determined by impatience to spend income


and opportunity to invest it. Bulletin of the American Mathematical Society, 36:783–
784.

Föllmer, H. and Knispel, T. (2011). Entropic risk measures: Coherence vs. con-
vexity, model ambiguity and robust large deviations. Stochastics and Dynamics,
11(02n03):333–351.

Foss, S., Shneer, V., Thomas, J. P., and Worrall, T. (2018). Stochastic stability of
monotone economies in regenerative environments. Journal of Economic Theory,
173:334–360.

Frederick, S., Loewenstein, G., and O’donoghue, T. (2002). Time discounting and
time preference: A critical review. Journal of Economic Literature, 40(2):351–401.

Friedman, M. (1956). Theory of the Consumption Function. Princeton University Press.


BIBLIOGRAPHY 395

Gao, Y., Lui, K. Y. C., and Hernandez-Leal, P. (2021). Robust risk-sensitive reinforce-
ment learning agents for trading markets. arXiv:2107.08083.

Garman, M. B. (1985). Towards a semigroup pricing theory. The Journal of Finance,


40(3):847–861.

Geist, M. and Scherrer, B. (2018). Anderson acceleration for reinforcement learning.


arXiv preprint arXiv:1809.09501.

Gennaioli, N., Martin, A., and Rossi, S. (2014). Sovereign default, domestic banks,
and financial institutions. The Journal of Finance, 69(2):819–866.

Gentry, M. L., Hubbard, T. P., Nekipelov, D., and Paarsch, H. J. (2018). Structural
econometrics of auctions: A review. Foundations and Trends in Econometrics, 9(2-
4):79–302.

Ghosh, A. R., Kim, J. I., Mendoza, E. G., Ostry, J. D., and Qureshi, M. S. (2013). Fiscal
fatigue, fiscal space and debt sustainability in advanced economies. The Economic
Journal, 123(566):F4–F30.

Gillingham, K., Iskhakov, F., Munk-Nielsen, A., Rust, J., and Schjerning, B. (2022).
Equilibrium Trade in Automobiles. Journal of Political Economy.

Giovannetti, B. C. (2013). Asset pricing under quantile utility maximization. Review


of Financial Economics, 22(4):169–179.

Gomez-Cram, R. and Yaron, A. (2020). How Important Are Inflation Expectations for
the Nominal Yield Curve? The Review of Financial Studies, 34(2):985–1045.

Goyal, V. and Grand-Clement, J. (2023). A first-order approach to accelerated value


iteration. Operations Research, 71(2):517–535.

Guan, G. and Wang, X. (2020). Time-consistent reinsurance and investment strate-


gies for an AAI under smooth ambiguity utility. Scandinavian Actuarial Journal,
2020(8):677–699.

Guo, D., Cho, Y. J., and Zhu, J. (2004). Partial Ordering Methods in Nonlinear Problems.
Nova Publishers.

Guvenen, F. (2007). Learning your earning: Are labor income shocks really very
persistent? American Economic Review, 97(3):687–712.

Guvenen, F. (2009). An empirical investigation of labor income processes. Review of


Economic Dynamics, 12(1):58–79.
BIBLIOGRAPHY 396

Häggström, O. et al. (2002). Finite Markov Chains and Algorithmic Applications. Cam-
bridge University Press.

Han, J., Yang, Y., et al. (2021). Deepham: A global solution method for heterogeneous
agent models with aggregate shocks. arXiv:2112.14377.

Hansen, L. P., Heaton, J. C., and Li, N. (2008). Consumption strikes back? Measuring
long-run risk. Journal of Political Economy, 116(2):260–302.

Hansen, L. P. and Miao, J. (2018). Aversion to ambiguity and model misspecification


in dynamic stochastic environments. Proceedings of the National Academy of Sciences,
115(37):9163–9168.

Hansen, L. P. and Renault, E. (2010). Pricing kernels. Encyclopedia of Quantitative


Finance.

Hansen, L. P. and Sargent, T. J. (1980). Formulating and estimating dynamic linear


rational expectations models. Journal of Economic Dynamics and Control, 2:7–46.

Hansen, L. P. and Sargent, T. J. (1990). Rational Expectations Econometrics. CRC


Press.

Hansen, L. P. and Sargent, T. J. (1995). Discounted linear exponential quadratic


Gaussian control. IEEE Transactions on Automatic Control, 40(5):968–971.

Hansen, L. P. and Sargent, T. J. (2007). Recursive robust estimation and control


without commitment. Journal of Economic Theory, 136(1):1–27.

Hansen, L. P. and Sargent, T. J. (2011). Robustness. Princeton University Press.

Hansen, L. P. and Sargent, T. J. (2020). Structured ambiguity and model misspecifi-


cation. Journal of Economic Theory, page 105165.

Hansen, L. P. and Scheinkman, J. A. (2009). Long-term risk: An operator approach.


Econometrica, 77(1):177–234.

Harrison, J. M. and Kreps, D. M. (1978). Speculative investor behavior in a stock


market with heterogeneous expectations. The Quarterly Journal of Economics,
92(2):323–336.

Hassett, K. A. and Hubbard, R. G. (2002). Tax policy and business investment. In


Handbook of Public Economics, volume 3, pages 1293–1343. Elsevier.
BIBLIOGRAPHY 397

Havranek, T., Horvath, R., Irsova, Z., and Rusnak, M. (2015). Cross-country
heterogeneity in intertemporal substitution. Journal of International Economics,
96(1):100–118.

Hayashi, F. (1982). Tobin’s marginal q and average q: A neoclassical interpretation.


Econometrica, 50(1):213–224.

Hayashi, T. and Miao, J. (2011). Intertemporal substitution and recursive smooth


ambiguity preferences. Theoretical Economics, 6(3):423–472.

Heathcote, J. and Perri, F. (2018). Wealth and volatility. The Review of Economic
Studies, 85(4):2173–2213.

Hens, T. and Schindler, N. (2020). Value and patience: The value premium in a
dividend-growth model with hyperbolic discounting. Journal of Economic Behavior
& Organization, 172:161–179.

Hernández-Lerma, O. and Lasserre, J. B. (2012a). Discrete-Time Markov Control Pro-


cesses: Basic Optimality Criteria, volume 30. Springer Science & Business Media.

Hernández-Lerma, O. and Lasserre, J. B. (2012b). Further Topics on Discrete-Time


Markov Control Processes, volume 42. Springer Science & Business Media.

Herrendorf, B., Rogerson, R., and Valentinyi, A. (2021). Structural change in in-
vestment and consumption—a unified analysis. The Review of Economic Studies,
88(3):1311–1346.

Hill, E., Bardoscia, M., and Turrell, A. (2021). Solving heterogeneous general equi-
librium economic models with deep reinforcement learning. arXiv:2103.16977.

Hills, T. S., Nakata, T., and Schmidt, S. (2019). Effective lower bound risk. European
Economic Review, 120:103321.

Hirsh, M. and Smale, S. (1974). Differential Equations, Dynamical Systems and Linear
Algebra. Academic Press.

Hopenhayn, H. A. (1992). Entry, exit, and firm dynamics in long run equilibrium.
Econometrica, 60:1127–1150.

Hopenhayn, H. A. and Prescott, E. C. (1992). Stochastic monotonicity and stationary


distributions for dynamic economies. Econometrica, 60(6):1387–1406.

Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis. Cambridge University Press.


BIBLIOGRAPHY 398

Howard, R. A. (1960). Dynamic Programming and Markov Processes. John Wiley &
Sons.

Hsu, W.-T. (2012). Central place theory and city size distribution. The Economic
Journal, 122(563):903–932.

Hsu, W.-T., Holmes, T. J., and Morgan, F. (2014). Optimal city hierarchy: A dy-
namic programming approach to central place theory. Journal of Economic Theory,
154:245–273.

Hu, T.-W. and Shmaya, E. (2019). Unique monetary equilibrium with inflation in a
stationary Bewley–Aiyagari model. Journal of Economic Theory, 180:368–382.

Hubmer, J., Krusell, P., and Smith, Jr, A. A. (2020). Sources of US wealth inequality:
Past, present, and future. NBER Macroeconomics Annual 2020, volume 35.

Huggett, M. (1993). The risk-free rate in heterogeneous-agent incomplete-insurance


economies. Journal of Economic Dynamics and Control, 17(5-6):953–969.

Huijben, I. A., Kool, W., Paulus, M. B., and Van Sloun, R. J. (2022). A review of the
gumbel-max trick and its extensions for discrete stochasticity in machine learning.
IEEE Transactions on Pattern Analysis and Machine Intelligence.

Iskhakov, F., Rust, J., and Schjerning, B. (2020). Machine learning and structural
econometrics: Contrasts and synergies. The Econometrics Journal, 23(3):S81–S124.

Jacobson, D. H. (1973). Optimal stochastic linear systems with exponential perfor-


mance criteria and their relation to deterministic differential games. IEEE Transac-
tions for Automatic Control, AC-18:1124–131.

Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-
softmax. arXiv:1611.01144.

Jaśkiewicz, A. and Nowak, A. S. (2014). Stationary Markov perfect equilibria in risk


sensitive stochastic overlapping generations models. Journal of Economic Theory,
151:411–447.

Jaśkiewicz, A. and Nowak, A. S. (2021). Markov decision processes with quasi-


hyperbolic discounting. Finance and Stochastics, 25(2):189–229.

Jasso-Fuentes, H., Menaldi, J.-L., and Prieto-Rumeau, T. (2020). Discrete-time control


with non-constant discount factor. Mathematical Methods of Operations Research,
pages 1–23.
BIBLIOGRAPHY 399

Jensen, N. R. (2019). Life insurance decisions under recursive utility. Scandinavian


Actuarial Journal, 2019(3):204–227.

Jovanovic, B. (1979). Firm-specific capital and turnover. Journal of Political Economy,


87(6):1246–1260.

Jovanovic, B. (1982). Selection and the evolution of industry. Econometrica,


50(3):649–670.

Jovanovic, B. (1984). Matching, turnover, and unemployment. Journal of Political


Economy, 92(1):108–122.

Ju, N. and Miao, J. (2012). Ambiguity, learning, and asset returns. Econometrica,
80(2):559–591.

Kahou, M. E., Fernández-Villaverde, J., Perla, J., and Sood, A. (2021). Exploiting
symmetry in high-dimensional dynamic programming. Technical report, National
Bureau of Economic Research.

Kamihigashi, T. (2014). Elementary results on solutions to the Bellman equation of


dynamic programming: Existence, uniqueness, and convergence. Economic Theory,
56:251–273.

Kamihigashi, T. and Stachurski, J. (2014). Stochastic stability in monotone economies.


Theoretical Economics, 9(2):383–407.

Kaplan, G., Moll, B., and Violante, G. L. (2018). Monetary policy according to HANK.
American Economic Review, 108(3):697–743.

Karp, L. (2005). Global warming and hyperbolic discounting. Journal of Public Eco-
nomics, 89(2-3):261–282.

Kase, H., Melosi, L., and Rottner, M. (2022). Estimating nonlinear heterogeneous
agents models with neural networks. Technical report, CEPR Discussion Paper No.
DP17391.

Keane, M. P., Todd, P. E., and Wolpin, K. I. (2011). The structural estimation of be-
havioral models: Discrete choice dynamic programming methods and applications.
In Handbook of Labor Economics, volume 4, pages 331–461. Elsevier.

Keane, M. P. and Wolpin, K. I. (1997). The career decisions of young men. Journal of
Political Economy, 105(3):473–522.

Kelle, P. and Milne, A. (1999). The effect of (s, S) ordering policy on the supply chain.
International Journal of Production Economics, 59(1-3):113–122.
BIBLIOGRAPHY 400

Kikuchi, T., Nishimura, K., Stachurski, J., and Zhang, J. (2021). Coase meets Bell-
man: Dynamic programming for production networks. Journal of Economic Theory,
196:105287.

Kleinman, B., Liu, E., and Redding, S. J. (2023). Dynamic spatial general equilibrium.
Econometrica, 91(2):385–424.

Klibanoff, P., Marinacci, M., and Mukerji, S. (2009). Recursive smooth ambiguity
preferences. Journal of Economic Theory, 144(3):930–976.

Knight, F. H. (1921). Risk, Uncertainty and Profit, volume 31. Houghton Mifflin.

Kochenderfer, M. J., Wheeler, T. A., and Wray, K. H. (2022). Algorithms for Decision
Making. MIT Press.

Kohler, M., Krzyżak, A., and Todorovic, N. (2010). Pricing of high-dimensional Amer-
ican options by neural networks. Mathematical Finance, 20(3):383–410.

Koopmans, T. C. (1960). Stationary ordinal utility and impatience. Econometrica,


28(2):287–309.

Koopmans, T. C., Diamond, P. A., and Williamson, R. E. (1964). Stationary utility and
time perspective. Econometrica, 32(1/2):82–100.

Krasnosel’skii, M. A., Vainikko, G. M., Zabreiko, P. P., Rutitskii, Y. B., and Stetsenko,
V. Y. (1972). Approximate Solution of Operator Equations. Springer Netherlands.

Kreps, D. M. and Porteus, E. L. (1978). Temporal resolution of uncertainty and dy-


namic choice theory. Econometrica, 46(1):185–200.

Kreyszig, E. (1978). Introductory Functional Analysis with Applications, volume 1.


Wiley New York.

Kristensen, D., Mogensen, P. K., Moon, J. M., and Schjerning, B. (2021). Solving
dynamic discrete choice models using smoothing and sieve methods. Journal of
Econometrics, 223(2):328–360.

Krueger, D., Mitman, K., and Perri, F. (2016). Macroeconomics and household het-
erogeneity. In Handbook of Macroeconomics, volume 2, pages 843–921. Elsevier.

Krusell, P. and Smith, Jr, A. A. (1998). Income and wealth heterogeneity in the
macroeconomy. Journal of Political Economy, 106(5):867–896.

Lasota, A. and Mackey, M. C. (1994). Chaos, Fractals, and Noise: Stochastic Aspects of
Dynamics, volume 97. Springer Science & Business Media, 2 edition.
BIBLIOGRAPHY 401

Lee, J. and Shin, K. (2000). The role of a variable input in the relationship between
investment and uncertainty. American Economic Review, 90(3):667–680.

Legrand, N. (2019). The empirical merit of structural explanations of commodity price


volatility: Review and perspectives. Journal of Economic Surveys, 33(2):639–664.

Lehrer, E. and Light, B. (2018). The effect of interest rates on consumption in an


income fluctuation problem. Journal of Economic Dynamics and Control, 94:63–71.

Lettau, M. and Ludvigson, S. C. (2014). Shocks and crashes. NBER Macroeconomics


Annual, 28(1):293–354.

Li, H. and Stachurski, J. (2014). Solving the income fluctuation problem with un-
bounded rewards. Journal of Economic Dynamics and Control, 45:353–365.

Liao, J. and Berg, A. (2018). Sharpening Jensen’s inequality. The American Statisti-
cian.

Liggett, T. M. (2010). Continuous Time Markov Processes: An Introduction, volume


113. American Mathematical Society.

Light, B. (2018). Precautionary saving in a Markovian earnings environment. Review


of Economic Dynamics, 29:138–147.

Light, B. (2020). Uniqueness of equilibrium in a Bewley–Aiyagari model. Economic


Theory, 69(2):435–450.

Ljungqvist, L. (2002). How do lay-off costs affect employment? The Economic Journal,
112(482):829–853.

Ljungqvist, L. and Sargent, T. J. (2018). Recursive Macroeconomic Theory. MIT Press,


4 edition.

Loewenstein, G. and Prelec, D. (1991). Negative time preference. American Economic


Review, 81:347–352.

Loewenstein, G. and Sicherman, N. (1991). Do workers prefer increasing wage pro-


files? Journal of Labor Economics, 9:67–84.

Longstaff, F. A. and Schwartz, E. S. (2001). Valuing American options by simulation:


A simple least-squares approach. The Review of Financial Studies, 14(1):113–147.

Lucas, R. and Prescott, E. (1971). Investment under uncertainty. Econometrica,


39(5):659–81.
BIBLIOGRAPHY 402

Lucas, R. E. (1976). Econometric policy evaluation: A critique. In Carnegie-Rochester


Conference Series on Public Policy, volume 1, pages 19–46. North-Holland.

Lucas, R. E. (1978a). Asset prices in an exchange economy. Econometrica,


46(6):1429–1445.

Lucas, R. E. (1978b). Unemployment policy. American Economic Review, 68(2):353–


357.

Lucas, R. E. and Sargent, T. J. (1981). Rational Expectations and Econometric Practice,


volume 2. University of Minnesota Press.

Lucas, R. E. and Stokey, N. L. (1984). Optimal growth with many consumers. Journal
of Economic Theory, 32(1):139–171.

Luo, Y. and Sang, P. (2022). Penalized sieve estimation of structural models.


arXiv:2204.13488.

Ma, Q. and Stachurski, J. (2021). Dynamic programming deconstructed: Transfor-


mations of the Bellman equation and computational efficiency. Operations Research,
69(5):1591–1607.

Ma, Q., Stachurski, J., and Toda, A. A. (2020). The income fluctuation problem and
the evolution of wealth. Journal of Economic Theory, 187:105003.

Ma, Q. and Toda, A. A. (2021). A theory of the saving rate of the rich. Journal of
Economic Theory, 192:105193.

Majumdar, A., Singh, S., Mandlekar, A., and Pavone, M. (2017). Risk-sensitive inverse
reinforcement learning via coherent risk models. In Robotics: Science and Systems,
volume 16, page 117.

Maliar, L., Maliar, S., and Winant, P. (2021). Deep learning for solving dynamic
economic models. Journal of Monetary Economics, 122:76–101.

Marcet, A., Obiols-Homs, F., and Weil, P. (2007). Incomplete markets, labor supply
and capital accumulation. Journal of Monetary Economics, 54(8):2621–2635.

Marimon, R. (1984). General equilibrium and growth under uncertainty: the turnpike
property. Northwestern University Economics Department Discussion Paper 624.

Marimon, R. (1989). Stochastic turnpike property and stationary equilibrium. Journal


of Economic Theory, 47(2):282–306.
BIBLIOGRAPHY 403

Marinacci, M. and Montrucchio, L. (2010). Unique solutions for stochastic recursive


utilities. Journal of Economic Theory, 145(5):1776–1804.

Marinacci, M. and Montrucchio, L. (2019). Unique Tarski fixed points. Mathematics


of Operations Research, 44(4):1174–1191.

Marinacci, M., Principi, G., and Stanca, L. (2023). Recursive preferences and ambi-
guity attitudes. arXiv:2304.06830.

Martins-da Rocha, V. F. and Vailakis, Y. (2010). Existence and uniqueness of a fixed


point for local contractions. Econometrica, 78(3):1127–1141.

McCall, J. J. (1970). Economics of information and job search. The Quarterly Journal
of Economics, 84(1):113–126.

McFadden, D. (1974). The measurement of urban travel demand. Journal of Public


Economics, 3(4):303–328.

Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. (2020). On the global con-
vergence rates of softmax policy gradient methods. In International Conference on
Machine Learning, pages 6820–6829. PMLR.

Meissner, T. and Pfeiffer, P. (2022). Measuring preferences over the temporal resolu-
tion of consumption uncertainty. Journal of Economic Theory, 200:105379.

Melo, F. S. (2001). Convergence of Q-learning: A simple proof. Technical report,


Institute of Systems and Robotics.

Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra, volume 71. Siam.

Miao, J. (2006). Competitive equilibria of economies with a continuum of consumers


and aggregate shocks. Journal of Economic Theory, 128(1):274–298.

Michelacci, C., Paciello, L., and Pozzi, A. (2022). The extensive margin of aggregate
consumption demand. The Review of Economic Studies, 89(2):909–947.

Mirman, L. J. and Zilcha, I. (1975). On optimal growth under uncertainty. Journal of


Economic Theory, 11(3):329–339.

Mitten, L. (1964). Composition principles for synthesis of optimal multistage pro-


cesses. Operations Research, 12(4):610–619.

Mogensen, P. K. (2018). Solving dynamic discrete choice models: Integrated or ex-


pected value function? arXiv:1801.03978.
BIBLIOGRAPHY 404

Mordecki, E. (2002). Optimal stopping and perpetual options for Lévy processes.
Finance and Stochastics, 6(4):473–493.

Mortensen, D. T. (1986). Job search and labor market analysis. Handbook of Labor
Economics, 2:849–919.

Newhouse, D. (2005). The persistence of income shocks: Evidence from rural Indone-
sia. Review of Development Economics, 9(3):415–433.

Nirei, M. (2006). Threshold behavior and aggregate fluctuation. Journal of Economic


Theory, 127(1):309–322.

Noor, J. and Takeoka, N. (2022). Optimal discounting. Econometrica, 90(2):585–623.

Norets, A. (2010). Continuity and differentiability of expected value functions in


dynamic discrete choice models. Quantitative Economics, 1(2):305–322.

Norris, J. R. (1998). Markov Chains. Cambridge University Press.

Nota, C. and Thomas, P. S. (2019). Is the policy gradient a gradient?


arXiv:1906.07073.

Nuño, G. and Moll, B. (2018). Social optima in economies with heterogeneous agents.
Review of Economic Dynamics, 28:150–180.

Ok, E. A. (2007). Real Analysis with Economic Applications, volume 10. Princeton
University Press.

Paciello, L. and Wiederholt, M. (2014). Exogenous information, endogenous infor-


mation, and optimal monetary policy. Review of Economic Studies, 81(1):356–388.

Paroussos, L., Mandel, A., Fragkiadakis, K., Fragkos, P., Hinkel, J., and Vrontisi, Z.
(2019). Climate clubs and the macro-economic benefits of international coopera-
tion on climate policy. Nature Climate Change, 9(7):542–546.

Perla, J. and Tonetti, C. (2014). Equilibrium imitation and growth. Journal of Political
Economy, 122(1):52–76.

Peskir, G. and Shiryaev, A. (2006). Optimal Stopping and Free-boundary Problems.


Springer Verlag.

Pissarides, C. A. (1979). Job matchings with state employment agencies and random
search. The Economic Journal, 89(356):818–833.
BIBLIOGRAPHY 405

Pohl, W., Schmedders, K., and Wilms, O. (2018). Higher order effects in asset pricing
models with long-run risks. The Journal of Finance, 73(3):1061–1111.

Pohl, W., Schmedders, K., and Wilms, O. (2019). Relative existence for recursive
utility. SSRN 3432469.

Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of di-


mensionality, volume 703. John Wiley & Sons.

Privault, N. (2013). Understanding Markov Chains: Examples and Applications.


Springer-Verlag Singapore.

Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Pro-


gramming. Wiley Interscience.

Quah, D. (1990). Permanent and transitory movements in labor income: An explana-


tion for excess smoothness in consumption. Journal of Political Economy, 98(3):449–
475.

Ráfales, J. and Vázquez, C. (2021). Equilibrium models with heterogeneous agents


under rational expectations and its numerical solution. Communications in Nonlin-
ear Science and Numerical Simulation, 96:105673.

Ramsey, F. P. (1928). A mathematical theory of saving. The Economic Journal, 38:543–


559.

Ren, G. and Stachurski, J. (2021). Dynamic programming with value convexity. Au-
tomatica, 130:109641.

Rendahl, P. (2016). Fiscal policy in an unemployment crisis. The Review of Economic


Studies, 83(3):1189–1224.

Rendahl, P. (2022). Continuous vs. discrete time: Some computational insights. Jour-
nal of Economic Dynamics and Control, 144:104522.

Riedel, F. (2009). Optimal stopping with multiple priors. Econometrica, 77(3):857–


908.

Roberts, M. J. and Tybout, J. R. (1997). The decision to export in Colombia: An


empirical model of entry with sunk costs. American Economic Review, 87(4):545–
564.

Rogers, L. C. (2002). Monte Carlo valuation of American options. Mathematical Fi-


nance, 12(3):271–286.
BIBLIOGRAPHY 406

Rogerson, R., Shimer, R., and Wright, R. (2005). Search-theoretic models of the labor
market: A survey. Journal of Economic Literature, 43(4):959–988.

Ross, S. A. (2009). Neoclassical Finance. Princeton University Press.

Rubinstein, A. (2003). Economics and psychology? The case of hyperbolic discount-


ing. International Economic Review, 44(4):1207–1216.

Rust, J. (1987). Optimal replacement of GMC bus engines: An empirical model of


Harold Zurcher. Econometrica, 55(5):999–1033.

Rust, J. (1994). Structural estimation of Markov decision processes. Handbook of


Econometrics, 4:3081–3143.

Rust, J. (1997). Using randomization to break the curse of dimensionality. Economet-


rica, 65(3):487–516.

Ruszczyński, A. (2010). Risk-averse dynamic programming for Markov decision pro-


cesses. Mathematical Programming, 125(2):235–261.

Saijo, H. (2017). The uncertainty multiplier and business cycles. Journal of Economic
Dynamics and Control, 78:1–25.

Samuelson, P. A. (1939). Interactions between the multiplier analysis and the prin-
ciple of acceleration. The Review of Economics and Statistics, 21(2):75–78.

Sargent, T. J. (1980). “Tobin’s q” and the rate of investment in general equilibrium.


In Carnegie-Rochester Conference Series on Public Policy, volume 12, pages 107–154.
Elsevier.

Sargent, T. J. (1987). Dynamic Macroeconomic Theory. Harvard University Press.

Sargent, T. J. and Stachurski, J. (2023a). Completely abstract dynamic programming.


arXiv:2308.02148.

Sargent, T. J. and Stachurski, J. (2023b). Economic Networks: Theory and Computa-


tion. Cambridge University Press.

Savage, L. J. (1951). The theory of statistical decisions. Journal of the American


Statistical Association, 46(253):55–67.

Scarf, H. (1960). The optimality of (S, s) policies in the dynamic inventory problem.
Mathematical Methods in the Social Sciences, pages 196–202.
BIBLIOGRAPHY 407

Schechtman, J. (1976). An income fluctuation problem. Journal of Economic Theory,


12(2):218–241.

Schorfheide, F., Song, D., and Yaron, A. (2018). Identifying long-run risks: A Bayesian
mixed-frequency approach. Econometrica, 86(2):617–654.

Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sci-


ences, 39(10):1095–1100.

Shen, Y., Tobia, M. J., Sommer, T., and Obermayer, K. (2014). Risk-sensitive rein-
forcement learning. Neural Computation, 26(7):1298–1328.

Shiryaev, A. N. (2007). Optimal Stopping Rules, volume 8. Springer Science & Business
Media.

Sidford, A., Wang, M., Wu, X., and Ye, Y. (2023). Variance reduced value iteration
and faster algorithms for solving markov decision processes. Naval Research Logistics
(NRL), 70(5):423–442.

Stachurski, J. (2022). Economic Dynamics: Theory and Computation. MIT Press, 2


edition.

Stachurski, J. and Toda, A. A. (2019). An impossibility theorem for wealth in


heterogeneous-agent models with limited heterogeneity. Journal of Economic The-
ory, 182:1–24.

Stachurski, J., Wilms, O., and Zhang, J. (2022). Unique solutions to power-
transformed affine systems. arXiv:2212.00275.

Stachurski, J. and Zhang, J. (2021). Dynamic programming with state-dependent


discounting. Journal of Economic Theory, 192:105190.

Stanca, L. (2023). Recursive preferences, correlation aversion, and the remporal res-
olution of uncertainty. Working papers, University of Torino.

Stokey, N. L. and Lucas, R. E. (1989). Recursive Methods in Dynamic Economics. Har-


vard University Press.

Tallarini Jr, T. D. (2000). Risk-sensitive real business cycles. Journal of Monetary


Economics, 45(3):507–532.

Tauchen, G. (1986). Finite state markov-chain approximations to univariate and vec-


tor autoregressions. Economics letters, 20(2):177–181.
BIBLIOGRAPHY 408

Thaler, R. (1981). Some empirical evidence on dynamic inconsistency. Economics


Letters, 8(3):201–207.

Toda, A. A. (2014). Incomplete market dynamics and cross-sectional distributions.


Journal of Economic Theory, 154:310–348.

Toda, A. A. (2019). Wealth distribution with random discount factors. Journal of


Monetary Economics, 104:101–113.

Toda, A. A. (2021). Perov’s contraction principle and dynamic programming with


stochastic discounting. Operations Research Letters, 49(5):815–819.

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Ma-


chine Learning, 16:185–202.

Tyazhelnikov, V. (2022). Production clustering and offshoring. American Economic


Journal: Microeconomics.

Van, C. and Dana, R.-A. (2003). Dynamic Programming in Economics. Springer.

Von Neumann, J. and Morgenstern, O. (1944). Theory of Games and Economic Behav-
ior. Princeton University Press.

Walker, H. F. and Ni, P. (2011). Anderson acceleration for fixed-point iterations. SIAM
Journal on Numerical Analysis, 49(4):1715–1735.

Wang, P. and Wen, Y. (2012). Hayashi meets Kiyotaki and Moore: A theory of capital
adjustment costs. Review of Economic Dynamics, 15(2):207–225.

Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD Thesis, Cambridge


University.

Weil, P. (1990). Nonexpected utility in macroeconomics. The Quarterly Journal of


Economics, 105(1):29–42.

Whittle, P. (1981). Risk-sensitive linear/quadratic/Gaussian control. Advances in


Applied Probability, 13(4):764–777.

Woodford, M. (2011). Simple analytics of the government expenditure multiplier.


American Economic Journal: Macroeconomics, 3(1):1–35.

Ying, L. and Zhu, Y. (2020). A note on optimization formulations of markov decision


processes.
BIBLIOGRAPHY 409

Yu, L., Lin, L., Guan, G., and Liu, J. (2023). Time-consistent lifetime portfolio selection
under smooth ambiguity. Mathematical Control and Related Fields, 13(3):967–987.

Yue, V. Z. (2010). Sovereign default and debt renegotiation. Journal of International


Economics, 80(2):176–187.

Zaanen, A. C. (2012). Introduction to Operator Theory in Riesz Spaces. Springer.

Zhang, Z. (2012). Variational, Topological, and Partial Order Methods with Their Ap-
plications, volume 29. Springer.

Zhao, G. (2020). Ambiguity, nominal bond yields, and real bond yields. American
Economic Review: Insights, 2(2):177–192.

Zhu, S. (2020). Existence of stationary equilibrium in an incomplete-market model


with endogenous labor supply. International Economic Review, 61(3):1115–1138.
Author Index

410
Subject Index

𝐶0 -semigroup, 315 Bellman max-operator, 304


𝑃 -Markov, 82 Bellman min-equation, 304
𝑄 -Markov, 321 Bellman min-operator, 285, 304
𝑄 -factor, 171 Bellman operator, 33, 109, 137, 193,
ℓ1 norm, 13 253, 297
ℓ𝑝 norm, 14 Bellman’s principle of min-optimality,
𝜑 F 𝜓, 62 304
𝜎-value function, 107, 135, 251, 296, Bellman’s principle of optimality, 12,
333 110, 136, 256, 299, 335
𝑣-greedy, 35, 136, 193, 195, 253, 294, Bermudan options, 106
333 Bijection, 342
𝑣-min-greedy, 285, 304 Binary relation, 341
Blackwell aggregator, 239
Absolute value, 12 Blackwell’s conditions, 264
Absorbing set, 86 Bounded RDP, 258
Absorbing state, 86 Bounded set, 15, 342
Abstract dynamic program, 294
Action, 1 Call option, 119, 200
Action space, 128, 246, 332 Cardinality, 342
ADP, 294 Cartesian product, 341
Adversarial agents, 275 Cauchy sequence, 23
Aggregator, 235 Cauchy–Schwarz inequality, 13
Ambiguity, 280 Certainty equivalent, 232
American option, 106, 119 Certainty equivalent, 231, 232
AR(1) model, 89 CES aggregator, 235
Arrow–Debreu discount operator, 205 CES-Uzawa aggregator, 235
Chapman–Kolmogorov equation, 320
Backward induction, 4 Closed set, 15, 30, 344
Bellman equation, 10, 11, 129, 191, Column major, 78
253, 297 Compact set, 30

411
SUBJECT INDEX 412

Completeness, 23 Discounted expected utility, 220


Complex numbers, 13 Distribution, 31
Complex vectors, 13 Dominant eigenvector, 74
Composition, 342 Dominate, 65
Computational complexity, 50 Downward stability, 292
Concave, 215 Dual set (order dual), 304
Concave RDP, 272 Dynamical system, 43
Conjugate dynamics, 43
Constant-subadditive, 233 Eigenvalue, 17
Consumption path, 96 Eigenvector, 17
Continuation reward, 106 Elasticity of intertemporal
Continuation value, 4, 11, 33 substitution, 236
Continuation value function, 100, 114 Embedded jump chain, 322
Continuation value operator, 117 Entropic certainty equivalent, 232
Continuity of RDPs, 252 Entropic risk measure, 222
Continuous function, 15 Epstein–Zin, 226
Continuous time MDP, 332 EPV, 2
Contracting RDP, 262 Equivalence, 15
Contraction, 22 Equivalence relation, 341
Convergence of real sequences, 343 Ergodicity, 87
Convergence of vectors, 15 Euclidean norm, 13
Convergence rates, 46 Eventually contracting operator, 189
Convex, 215 Eventually contracting RDP, 269
Convex RDP, 272 Ex-dividend contract, 204
Correspondence, 128 Exercise region, 121
Cost-to-go function, 286 Exit reward, 106
Counter CDF, 64 Expectation, 31
CRRA utility, 96 Expected present value, 2
Cum-dividend contract, 205 Expected value function, 165
Cumulative distribution, 32 Expected value operator, 165
Exponential distribution, 308
Damped iteration, 23 Exponential function, 307
Decreasing function, 61 Exponential semigroup, 315, 316
Decreasing sequence, 343
Diagonalizable matrix, 43 Feasible state-action pairs, 246
Discount factor, 129 Feasible correspondence, 129, 332
Discount factor process, 183 Feasible correspondence, 246
Discount factors, 181 Feasible policy, 134, 246, 333
Discount operator, 184 Feasible state-action pairs, 129, 332
Discount rate, 287, 332 Finite ADP, 297
SUBJECT INDEX 413

Finite set, 342 Jacobian, 45


Fixed point, 20 Jump matrix, 321
Fixed point iteration, 24 Jump times, 322
Flow utility, 96
Kernel, 77
Game, 275 KL divergence, 282
Gelfand’s formula, 19 Knaster–Tarski theorem, 213
Globally stable, 22 Knightian uncertainty, 280
Globally stable RDP, 252 Kolmogorov equations, 320
Greatest element, 53 Koopmans operator, 223, 226, 235
Greedy policy, 35, 98, 110, 136, 175, Kullback–Liebler divergence, 282
193, 195, 253, 333
Gumbel distribution, 166 Law of iterated expectations, 92
Least element, 53
Hamilton–Jacobi–Bellman, 335 Leontief aggregator, 235
HJB, 335 Lifetime reward, 1, 94
Holding times, 321 Lifetime value, 107, 135, 237, 333
Homeomorphism, 44 Linear aggregator, 235
Howard min-operator, 304 Linear convergence, 46
Howard operator, 255, 297 Linear operator, 76
Howard policy iteration, 141 Local spectral radius, 70
HPI, 141 Locally stable, 45
Hyperbolic discounting, 210 Lower bound, 55, 342
Lucas SDF, 202
Inada conditions, 215
Increasing function, 61 Markov chain, 82, 321
Increasing sequence, 343 Markov decision process, 129, 332
Indicator function, 14 Markov matrix, 71
Infimum, 55, 343 Markov operator, 79
Infinitesimal generator, 315, 316, 320 Markov policy, 35
Initial condition, 82 Markov property, 330
Initial value problem, 311 Markov semigroup, 319
Inner product, 12 Matrix exponential, 309
Intensity kernel, 333 Matrix norm, 16
Intensity matrix, 317 Matrix representation, 77
Interest rates, 181 Max-optimal, 304
Invariant set, 317 Max-stable, 297
Invariant sets, 22 Maximizer, 344
Irreducibility, 69 Maximum, 53, 343
Irreducible, 260 MDP, 129
IVP, 311 Median, 32
SUBJECT INDEX 414

Memoryless property, 308 Order stable ADP, 297


min-Bellman equation, 285 Order theory, 42
Min-HPI, 304 Order-preserving, 60
Min-optimal, 304 Order-reversing, 60
Min-optimal policy, 285
Partial order, 51
Min-stable, 304
Partially ordered set, 51
Min-value function, 285, 304
Pointwise order, 52, 53
Minimizer, 344
Policy, 34
Minimum, 53, 343
Policy function, 106
Minkowski’s inequality, 14
Policy functions, 294
Mixed strategy, 300
Policy operator, 108, 135, 192, 250,
Modulus of contraction, 22, 262
294
Monotone increasing, 93, 234
Positive cone, 79
Monotonicity, 246
Positive homogeneity, 233
Natural numbers, 341 Positive matrix, 69
Negative discounting, 287 Positive operator, 79
Neighborhood, 15 Positive semigroup, 327
Newton’s method, 47, 48 Preimage, 342
Nonempty correspondence, 129 Price-dividend ratio, 206
Nonexpansive operator, 174 Pricing kernel, 202
Nonnegative matrix, 69 Q-Learning, 170
Norm, 13 Quadratic convergence, 46
Quantile, 32
One-to-one, 342
Quantile preferences, 232
One-to-one correspondence, 342
One-to-one function, 342 Rate function, 321
Onto, 342 Rates of convergence, 46
Onto function, 342 RDP, 246
Open set, 15, 30 Real part, 313
Operator, 31 Real sequence, 343
Operator norm, 16 Real vector, 12
Optimal, 335 Real-valued function, 29
Optimal policy, 107, 136, 193, 256, Recursive decision process, 246
299 Recursive preferences, 219
Optimal stopping, 106 Reinforcement Learning, 170
Option, 119 Renewal state, 130
Order dual, 293 Reservation wage, 6, 36, 102
Order interval, 57 Reward function, 1, 129, 332
Order stability, 292 Risk premium, 201
SUBJECT INDEX 415

Risk-neutral pricing, 200 Successive approximation, 24


Risk-sensitive preferences, 221, 282 Superadditivity, 233
Robust control, 280 Support of distribution, 31
Row major, 78 Supremum, 54, 342
Supremum norm, 14
Self-map, 20
Semigroup, 311, 319 Tauchen’s method, 90
Series, 343 Time additive, 96, 236
Set, 341 Time-varying discount factors, 181
Shortest path, 286 Topological conjugacy, 44
Shortest path problem, 250 Topologically conjugate RDPs, 260
Spectral bound, 313 Total order, 53
Spectral radius, 18 Transition matrix, 82, 83
Triangle inequality, 58
Spectrum, 313
Stability of RDPs, 252 Unit simplex, 31
State space, 82, 128, 246, 332 Upper bound, 54, 342
State variable, 1 Upper envelope, 59
State-dependent discounting, 181, 191 Upward stability, 292
Stationary distribution, 71, 87, 90 Uzawa aggregator, 235
Stochastic discount factor, 202, 203
Stochastic dominance, 62, 119 Value aggregator, 246
Stochastic kernel, 129 Value function, 11, 97, 108, 136, 193,
Stochastic matrix, 71 256, 299, 335
Stopping value, 4 Value function iteration, 36, 140
Strictly decreasing, 61 Value space, 246
Strictly increasing, 61 Vector field, 311
Strike price, 120 VFI, 140
Strongly continuous semigroup, 315
Wait times, 322
Subadditivity, 233 Well-posed ADP, 296
Subgradient, 143 Well-posed RDP, 251
Sublattice, 56
Submultiplicative, 16 Zero sum game, 275

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy