Quantecon Python Econometria

Intermediate Quantitative Economics
with Python
Thomas J. Sargent & John Stachurski
Apr 30, 2024

CONTENTS
I Tools and Techniques 5

1 Modeling COVID 19 7
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 The SIR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Ending Lockdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Linear Algebra 17
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Solving Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Further Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 QR Decomposition 41
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Gram-Schmidt process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Some Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Using QR Decomposition to Compute Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 𝑄𝑅 and PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Circulant Matrices 49
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Constructing a Circulant Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Connection to Permutation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Examples with Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Associated Permutation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Singular Value Decomposition (SVD) 65

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 The Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Four Fundamental Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Eckart-Young Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
i
5.6 Full and Reduced SVD’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.7 Polar Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.8 Application: Principal Components Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.9 Relationship of PCA to SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.10 PCA with Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.11 Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 VARs and DMDs 83

6.1 First-Order Vector Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Dynamic Mode Decomposition (DMD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Representation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4 Representation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 Representation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6 Source for Some Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7 Using Newton’s Method to Solve Economic Models 95

7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Fixed Point Computation Using Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Root-Finding in One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4 Multivariate Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
II Elementary Statistics 119

8 Elementary Probability with Matrices 121
8.1 Sketch of Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.2 What Does Probability Mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.3 Representing Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.4 Univariate Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.5 Bivariate Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.6 Marginal Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.7 Conditional Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.8 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.9 Means and Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.10 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.11 Some Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.12 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.13 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.14 A Mixed Discrete-Continuous Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.15 Matrix Representation of Some Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.16 A Continuous Bivariate Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.17 Sum of Two Independently Distributed Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 154
8.18 Transition Probability Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.19 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.20 Copula Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.21 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9 LLN and CLT 163

9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2 Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.3 LLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.4 CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
ii
10 Two Meanings of Probability 181
10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.2 Frequentist Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.3 Bayesian Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
10.4 Role of a Conjugate Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11 Multivariate Hypergeometric Distribution 199

11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.2 The Administrator’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
12 Multivariate Normal Distribution 209

12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
12.2 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
12.3 Bivariate Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
12.4 Trivariate Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.5 One Dimensional Intelligence (IQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
12.6 Information as Surprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
12.7 Cholesky Factor Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.8 Math and Verbal Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.9 Univariate Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.10 Stochastic Difference Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.11 Application to Stock Price Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.12 Filtering Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.13 Classic Factor Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.14 PCA and Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
13 Fault Tree Uncertainties 247

13.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
13.2 Log normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13.3 The Convolution Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
13.4 Approximating Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
13.5 Convolving Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.6 Failure Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
13.7 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
13.8 Failure Rates Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
13.9 Waste Hoist Failure Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
14 Introduction to Artificial Neural Networks 263

14.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
14.2 A Deep (but not Wide) Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
14.3 Calibrating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
14.4 Back Propagation and the Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
14.5 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.6 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
14.7 How Deep? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.8 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
15 Randomized Response Surveys 275

15.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
15.2 Warner’s Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
15.3 Comparing Two Survey Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
15.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16 Expected Utilities of Random Responses 285
iii
16.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
16.2 Privacy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
16.3 Zoo of Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
16.4 Respondent’s Expected Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
16.5 Utilitarian View of Survey Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
16.6 Criticisms of Proposed Privacy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
III Linear Programming 301

17 Optimal Transport 303
17.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
17.2 The Optimal Transport Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
17.3 The Linear Programming Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
17.4 The Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
17.5 The Python Optimal Transport Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
18 Von Neumann Growth Model (and a Generalization) 321

18.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
18.2 Model Ingredients and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
18.3 Dynamic Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
18.4 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
18.5 Interpretation as Two-player Zero-sum Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
IV Introduction to Dynamics 337

19 Finite Markov Chains 339
19.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
19.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
19.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
19.4 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
19.5 Irreducibility and Aperiodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
19.6 Stationary Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
19.7 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
19.8 Computing Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
19.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
20 Inventory Dynamics 363

20.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
20.2 Sample Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
20.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
20.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
21 Linear State Space Models 373

21.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
21.2 The Linear State Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
21.3 Distributions and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
21.4 Stationarity and Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
21.5 Noisy Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
21.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
21.7 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
21.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
iv
22 Samuelson Multiplier-Accelerator 395
22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
22.2 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
22.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
22.4 Stochastic Shocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
22.5 Government Spending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
22.6 Wrapping Everything Into a Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
22.7 Using the LinearStateSpace Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
22.8 Pure Multiplier Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
22.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
23 Kesten Processes and Firm Dynamics 433

23.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
23.2 Kesten Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
23.3 Heavy Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
23.4 Application: Firm Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
23.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
24 Wealth Distribution Dynamics 445

24.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
24.2 Lorenz Curves and the Gini Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
24.3 A Model of Wealth Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
24.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
24.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
24.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
25 A First Look at the Kalman Filter 459

25.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
25.2 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
25.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
25.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
25.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
26 Another Look at the Kalman Filter 479

26.1 A worker’s output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
26.2 A firm’s wage-setting policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
26.3 A state-space representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
26.4 An Innovations Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
26.5 Some Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
26.6 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
V Search 495
27 Job Search I: The McCall Search Model 497
27.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
27.2 The McCall Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
27.3 Computing the Optimal Policy: Take 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
27.4 Computing an Optimal Policy: Take 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
27.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
28 Job Search II: Search and Separation 513

28.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
28.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
28.3 Solving the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
v
28.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
28.5 Impact of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
28.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
29 Job Search III: Fitted Value Function Iteration 525

29.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
29.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
29.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
29.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
30 Job Search IV: Correlated Wage Offers 535

30.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
30.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
30.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
30.4 Unemployment Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
30.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
31 Job Search V: Modeling Career Choice 545

31.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
31.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
31.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
31.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
32 Job Search VI: On-the-Job Search 559

32.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
32.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
32.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
32.4 Solving for Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
32.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
33 Job Search VII: A McCall Worker Q-Learns 571

33.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
33.2 Review of McCall Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
33.3 Implied Quality Function 𝑄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
33.4 From Probabilities to Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
33.5 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
33.6 Employed Worker Can’t Quit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
33.7 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
VI Consumption, Savings and Capital 589

34 Cass-Koopmans Model 591
34.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
34.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
34.3 Planning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
34.4 Shooting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
34.5 Setting Initial Capital to Steady State Capital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
34.6 A Turnpike Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
34.7 A Limiting Infinite Horizon Economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
35 Cass-Koopmans Competitive Equilibrium 609

35.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
35.2 Review of Cass-Koopmans Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
vi
35.3 Competitive Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
35.4 Market Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
35.5 Firm Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
35.6 Household Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
35.7 Computing a Competitive Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
35.8 Yield Curves and Hicks-Arrow Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
36 Cake Eating I: Introduction to Optimal Saving 625

36.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
36.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
36.3 The Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
36.4 The Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
36.5 The Euler Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
36.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
37 Cake Eating II: Numerical Methods 635

37.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
37.2 Reviewing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
37.3 Value Function Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
37.4 Time Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
37.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
38 Optimal Growth I: The Stochastic Optimal Growth Model 651

38.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
38.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
38.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
38.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
39 Optimal Growth II: Accelerating the Code with Numba 667

39.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
39.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
39.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
39.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
40 Optimal Growth III: Time Iteration 679

40.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
40.2 The Euler Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
40.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
40.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
41 Optimal Growth IV: The Endogenous Grid Method 691

41.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
41.2 Key Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
41.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
42 The Income Fluctuation Problem I: Basic Model 699

42.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
42.2 The Optimal Savings Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
42.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
42.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
42.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
43 The Income Fluctuation Problem II: Stochastic Returns on Assets 717

43.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
43.2 The Savings Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
vii
43.3 Solution Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
43.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
43.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
VII Bayes Law 731

44 Non-Conjugate Priors 733
44.1 Unleashing MCMC on a Binomial Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
44.2 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
44.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
44.4 Alternative Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
44.5 Posteriors Via MCMC and VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
44.6 Non-conjugate Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
45 Posterior Distributions for AR(1) Parameters 779

45.1 PyMC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
45.2 Numpyro Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
46 Forecasting an AR(1) Process 789

46.1 A Univariate First-Order Autoregressive Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790
46.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
46.3 Predictive Distributions of Path Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
46.4 A Wecker-Like Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
46.5 Using Simulations to Approximate a Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . 794
46.6 Calculating Sample Path Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
46.7 Original Wecker Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
46.8 Extended Wecker Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
46.9 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
VIII Information 805

47 Job Search VII: Search with Learning 807
47.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807
47.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
47.3 Take 1: Solution by VFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
47.4 Take 2: A More Efficient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
47.5 Another Functional Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
47.6 Solving the RWFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
47.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
47.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
47.9 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
47.10 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
47.11 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
47.12 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
48 Likelihood Ratio Processes 839

48.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
48.2 Likelihood Ratio Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
48.3 Nature Permanently Draws from Density g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841
48.4 Peculiar Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
48.5 Nature Permanently Draws from Density f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
48.6 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
48.7 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
viii
48.8 Sequels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
49 Computing Mean of a Likelihood Ratio Process 855

49.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855
49.2 Mathematical Expectation of Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
49.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858
49.4 Selecting a Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
49.5 Approximating a cumulative likelihood ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860
49.6 Distribution of Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
49.7 More Thoughts about Choice of Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 863
50 A Problem that Stumped Milton Friedman 869

50.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
50.2 Origin of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 870
50.3 A Dynamic Programming Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
50.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876
50.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
50.6 Comparison with Neyman-Pearson Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
50.7 Sequels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886
51 Exchangeability and Bayesian Updating 887

51.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887
51.2 Independently and Identically Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888
51.3 A Setting in Which Past Observations Are Informative . . . . . . . . . . . . . . . . . . . . . . . . . 889
51.4 Relationship Between IID and Exchangeable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 890
51.5 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
51.6 Bayes’ Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
51.7 More Details about Bayesian Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
51.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
51.9 Sequels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
52 Likelihood Ratio Processes and Bayesian Learning 903

52.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
52.2 The Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904
52.3 Likelihood Ratio Process and Bayes’ Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905
52.4 Behavior of posterior probability {𝜋𝑡 } under the subjective probability distribution . . . . . . . . . . . 909
52.5 Initial Prior is Verified by Paths Drawn from Subjective Conditional Densities . . . . . . . . . . . . . . 915
52.6 Drilling Down a Little Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
52.7 Sequels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917
53 Incorrect Models 919

53.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919
53.2 Sampling from Compound Lottery 𝐻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922
53.3 Type 1 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
53.4 What a type 1 Agent Learns when Mixture 𝐻 Generates Data . . . . . . . . . . . . . . . . . . . . . . 925
53.5 Kullback-Leibler Divergence Governs Limit of 𝜋𝑡 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927
53.6 Type 2 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 931
54 Bayesian versus Frequentist Decision Rules 935

54.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935
54.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936
54.3 Frequentist Decision Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939
54.4 Bayesian Decision Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945
54.5 Was the Navy Captain’s Hunch Correct? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952
ix
54.6 More Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954
54.7 Distribution of Bayesian Decision Rule’s Time to Decide . . . . . . . . . . . . . . . . . . . . . . . . 954
54.8 Probability of Making Correct Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958
54.9 Distribution of Likelihood Ratios at Frequentist’s 𝑡 . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
IX LQ Control 963
55 LQ Control: Foundations 965
55.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
55.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966
55.3 Optimality – Finite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968
55.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971
55.5 Extensions and Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976
55.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
55.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
56 Lagrangian for LQ Control 995

56.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995
56.2 Undiscounted LQ DP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
56.3 Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997
56.4 State-Costate Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998
56.5 Reciprocal Pairs Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998
56.6 Schur decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999
56.7 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
56.8 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005
56.9 Discounted Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006
57 Eliminating Cross Products 1009

57.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
57.2 Undiscounted Dynamic Programming Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
57.3 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010
57.4 Duality table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011
58 The Permanent Income Model 1013

58.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
58.2 The Savings Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
58.3 Alternative Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1021
58.4 Two Classic Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024
58.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027
58.6 Appendix: The Euler Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028
59 Permanent Income II: LQ Techniques 1029

59.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029
59.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
59.3 The LQ Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032
59.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033
59.5 Two Example Economies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036
60 Production Smoothing via Inventories 1049

60.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049
60.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
60.3 Inventories Not Useful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056
60.4 Inventories Useful but are Hardwired to be Zero Always . . . . . . . . . . . . . . . . . . . . . . . . . 1056
60.5 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057
x
60.6 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
60.7 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059
60.8 Example 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061
60.9 Example 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1062
60.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065
X Multiple Agent Models 1071

61 A Lake Model of Employment and Unemployment 1073
61.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073
61.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074
61.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076
61.4 Dynamics of an Individual Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
61.5 Endogenous Job Finding Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083
61.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090
62 Rational Expectations Equilibrium 1101

62.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1101
62.2 Rational Expectations Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104
62.3 Computing an Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107
62.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1109
63 Stability in Linear Rational Expectations Models 1115

63.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
63.2 Linear Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
63.3 Illustration: Cagan’s Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118
63.4 Some Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1120
63.5 Alternative Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1122
63.6 Another Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124
63.7 Log money Supply Feeds Back on Log Price Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126
63.8 Big 𝑃 , Little 𝑝 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1130
63.9 Fun with SymPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1132
64 Markov Perfect Equilibrium 1135

64.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135
64.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136
64.3 Linear Markov Perfect Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137
64.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139
64.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144
65 Uncertainty Traps 1153

65.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
65.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154
65.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157
65.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158
65.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159
66 The Aiyagari Model 1167

66.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
66.2 The Economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168
66.3 Firms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169
66.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1170
xi
XI Asset Pricing and Finance 1177
67 Asset Pricing: Finite State Models 1179
67.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1179
67.2 Pricing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1180
67.3 Prices in the Risk-Neutral Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1181
67.4 Risk Aversion and Asset Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185
67.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
68 Competitive Equilibria with Arrow Securities 1199

68.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199
68.2 The setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1200
68.3 Recursive Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1201
68.4 State Variable Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
68.5 Markov Asset Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
68.6 General Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204
68.7 Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208
68.8 Finite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219
69 Heterogeneous Beliefs and Bubbles 1225

69.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225
69.2 Structure of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226
69.3 Solving the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
69.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1233
XII Data and Empirics 1237

70 Pandas for Panel Data 1239
70.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1239
70.2 Slicing and Reshaping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240
70.3 Merging Dataframes and Filling NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
70.4 Grouping and Summarizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1250
70.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256
70.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257
71 Linear Regression in Python 1261

71.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1261
71.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1262
71.3 Extending the Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267
71.4 Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1269
71.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273
71.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273
72 Maximum Likelihood Estimation 1277

72.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277
72.2 Set Up and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1278
72.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1281
72.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283
72.5 MLE with Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285
72.6 Maximum Likelihood Estimation with statsmodels . . . . . . . . . . . . . . . . . . . . . . . . . 1290
72.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294
72.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295
xii
XIII Auctions 1299
73 First-Price and Second-Price Auctions 1301
73.1 First-Price Sealed-Bid Auction (FPSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1301
73.2 Second-Price Sealed-Bid Auction (SPSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302
73.3 Characterization of SPSB Auction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302
73.4 Uniform Distribution of Private Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303
73.5 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303
73.6 First price sealed bid auction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303
73.7 Second Price Sealed Bid Auction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304
73.8 Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304
73.9 Revenue Equivalence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306
73.10 Calculation of Bid Price in FPSB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1308
73.11 𝜒2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1309
73.12 5 Code Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1312
73.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317
74 Multiple Good Allocation Mechanisms 1319

74.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319
74.2 Ascending Bids Auction for Multiple Goods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319
74.3 A Benevolent Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1320
74.4 Equivalence of Allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1320
74.5 Ascending Bid Auction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1320
74.6 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1321
74.7 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1323
74.8 A Python Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1331
74.9 Robustness Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1340
74.10 A Groves-Clarke Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1352
74.11 An Example Solved by Hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1353
74.12 Another Python Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356
XIV Other 1363

75 Troubleshooting 1365
75.1 Fixing Your Local Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1365
75.2 Reporting an Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
76 References 1367
77 Execution Statistics 1369
Bibliography 1373
Index 1381
xiii
xiv
Intermediate Quantitative Economics with Python
This website presents a set of lectures on quantitative economic modeling.

• Tools and Techniques
– Modeling COVID 19
– Linear Algebra
– QR Decomposition
– Circulant Matrices
– Singular Value Decomposition (SVD)
– VARs and DMDs
– Using Newton’s Method to Solve Economic Models
• Elementary Statistics
– Elementary Probability with Matrices
– LLN and CLT
– Two Meanings of Probability
– Multivariate Hypergeometric Distribution
– Multivariate Normal Distribution
– Fault Tree Uncertainties
– Introduction to Artificial Neural Networks
– Randomized Response Surveys
– Expected Utilities of Random Responses
• Linear Programming
– Optimal Transport
– Von Neumann Growth Model (and a Generalization)
• Introduction to Dynamics
– Finite Markov Chains
– Inventory Dynamics
– Linear State Space Models
– Samuelson Multiplier-Accelerator
– Kesten Processes and Firm Dynamics
– Wealth Distribution Dynamics
– A First Look at the Kalman Filter
– Another Look at the Kalman Filter
• Search
– Job Search I: The McCall Search Model
– Job Search II: Search and Separation
– Job Search III: Fitted Value Function Iteration
– Job Search IV: Correlated Wage Offers
CONTENTS 1
– Job Search V: Modeling Career Choice

– Job Search VI: On-the-Job Search
– Job Search VII: A McCall Worker Q-Learns
• Consumption, Savings and Capital
– Cass-Koopmans Model
– Cass-Koopmans Competitive Equilibrium
– Cake Eating I: Introduction to Optimal Saving
– Cake Eating II: Numerical Methods
– Optimal Growth I: The Stochastic Optimal Growth Model
– Optimal Growth II: Accelerating the Code with Numba
– Optimal Growth III: Time Iteration
– Optimal Growth IV: The Endogenous Grid Method
– The Income Fluctuation Problem I: Basic Model
– The Income Fluctuation Problem II: Stochastic Returns on Assets
• Bayes Law
– Non-Conjugate Priors
– Posterior Distributions for AR(1) Parameters
– Forecasting an AR(1) Process
• Information
– Job Search VII: Search with Learning
– Likelihood Ratio Processes
– Computing Mean of a Likelihood Ratio Process
– A Problem that Stumped Milton Friedman
– Exchangeability and Bayesian Updating
– Likelihood Ratio Processes and Bayesian Learning
– Incorrect Models
– Bayesian versus Frequentist Decision Rules
• LQ Control
– LQ Control: Foundations
– Lagrangian for LQ Control
– Eliminating Cross Products
– The Permanent Income Model
– Permanent Income II: LQ Techniques
– Production Smoothing via Inventories
• Multiple Agent Models
– A Lake Model of Employment and Unemployment
2 CONTENTS
– Rational Expectations Equilibrium

– Stability in Linear Rational Expectations Models
– Markov Perfect Equilibrium
– Uncertainty Traps
– The Aiyagari Model
• Asset Pricing and Finance
– Asset Pricing: Finite State Models
– Competitive Equilibria with Arrow Securities
– Heterogeneous Beliefs and Bubbles
• Data and Empirics
– Pandas for Panel Data
– Linear Regression in Python
– Maximum Likelihood Estimation
• Auctions
– First-Price and Second-Price Auctions
– Multiple Good Allocation Mechanisms
• Other
– Troubleshooting
– References
– Execution Statistics
CONTENTS 3
4 CONTENTS
Part I
Tools and Techniques
5
CHAPTER
ONE
MODELING COVID 19
Contents
• Modeling COVID 19
– Overview
– The SIR Model
– Implementation
– Experiments
– Ending Lockdown
1.1 Overview
This is a Python version of the code for analyzing the COVID-19 pandemic provided by Andrew Atkeson.
See, in particular
• NBER Working Paper No. 26867
• COVID-19 Working papers and code
The purpose of his notes is to introduce economists to quantitative modeling of infectious disease dynamics.
Dynamics are modeled using a standard SIR (Susceptible-Infected-Removed) model of disease spread.
The model dynamics are represented by a system of ordinary differential equations.
The main objective is to study the impact of suppression through social distancing on the spread of the infection.
The focus is on US outcomes but the parameters can be adjusted to study other countries.
We will use the following standard imports:
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
from numpy import exp
We will also use SciPy’s numerical routine odeint for solving differential equations.
7
from scipy.integrate import odeint
This routine calls into compiled code from the FORTRAN library odepack.
1.2 The SIR Model
In the version of the SIR model we will analyze there are four states.
All individuals in the population are assumed to be in one of these four states.
The states are: susceptible (S), exposed (E), infected (I) and removed ®.
Comments:
• Those in state R have been infected and either recovered or died.
• Those who have recovered are assumed to have acquired immunity.
• Those in the exposed group are not yet infectious.
1.2.1 Time Path
The flow across states follows the path 𝑆 → 𝐸 → 𝐼 → 𝑅.

All individuals in the population are eventually infected when the transmission rate is positive and 𝑖(0) > 0.
The interest is primarily in
• the number of infections at a given time (which determines whether or not the health care system is overwhelmed)
and
• how long the caseload can be deferred (hopefully until a vaccine arrives)
Using lower case letters for the fraction of the population in each state, the dynamics are
𝑠(𝑡)
̇ = −𝛽(𝑡) 𝑠(𝑡) 𝑖(𝑡)
𝑒(𝑡)
̇ = 𝛽(𝑡) 𝑠(𝑡) 𝑖(𝑡) − 𝜎𝑒(𝑡) (1.1)
̇ = 𝜎𝑒(𝑡) − 𝛾𝑖(𝑡)
𝑖(𝑡)
In these equations,
• 𝛽(𝑡) is called the transmission rate (the rate at which individuals bump into others and expose them to the virus).
• 𝜎 is called the infection rate (the rate at which those who are exposed become infected)
• 𝛾 is called the recovery rate (the rate at which infected people recover or die).
• the dot symbol 𝑦 ̇ represents the time derivative 𝑑𝑦/𝑑𝑡.
We do not need to model the fraction 𝑟 of the population in state 𝑅 separately because the states form a partition.
In particular, the “removed” fraction of the population is 𝑟 = 1 − 𝑠 − 𝑒 − 𝑖.
We will also track 𝑐 = 𝑖 + 𝑟, which is the cumulative caseload (i.e., all those who have or have had the infection).
The system (1.1) can be written in vector form as
𝑥̇ = 𝐹 (𝑥, 𝑡), 𝑥 ∶= (𝑠, 𝑒, 𝑖) (1.2)
for suitable definition of 𝐹 (see the code below).
8 Chapter 1. Modeling COVID 19

1.2.2 Parameters
Both 𝜎 and 𝛾 are thought of as fixed, biologically determined parameters.

As in Atkeson’s note, we set
• 𝜎 = 1/5.2 to reflect an average incubation period of 5.2 days.
• 𝛾 = 1/18 to match an average illness duration of 18 days.
The transmission rate is modeled as
• 𝛽(𝑡) ∶= 𝑅(𝑡)𝛾 where 𝑅(𝑡) is the effective reproduction number at time 𝑡.
(The notation is slightly confusing, since 𝑅(𝑡) is different to 𝑅, the symbol that represents the removed state.)
1.3 Implementation
First we set the population size to match the US.
pop_size = 3.3e8
Next we fix parameters as described above.
γ = 1 / 18
σ = 1 / 5.2
Now we construct a function that represents 𝐹 in (1.2)
def F(x, t, R0=1.6):

"""
Time derivative of the state vector.
* x is the state vector (array_like)

* t is time (scalar)
* R0 is the effective transmission rate, defaulting to a constant
"""
s, e, i = x
# New exposure of susceptibles

β = R0(t) * γ if callable(R0) else R0 * γ
ne = β * s * i
# Time derivatives
ds = - ne
de = ne - σ * e
di = σ * e - γ * i
return ds, de, di
Note that R0 can be either constant or a given function of time.

The initial conditions are set to
1.3. Implementation 9
# initial conditions of s, e, i
i_0 = 1e-7
e_0 = 4 * i_0
s_0 = 1 - i_0 - e_0
In vector form the initial condition is
x_0 = s_0, e_0, i_0
We solve for the time path numerically using odeint, at a sequence of dates t_vec.
def solve_path(R0, t_vec, x_init=x_0):

"""
Solve for i(t) and c(t) via numerical integration,
given the time path for R0.
"""
G = lambda x, t: F(x, t, R0)
s_path, e_path, i_path = odeint(G, x_init, t_vec).transpose()
c_path = 1 - s_path - e_path # cumulative cases

return i_path, c_path
1.4 Experiments
Let’s run some experiments using this code.

The time period we investigate will be 550 days, or around 18 months:
t_length = 550
grid_size = 1000
t_vec = np.linspace(0, t_length, grid_size)
1.4.1 Experiment 1: Constant R0 Case
Let’s start with the case where R0 is constant.

We calculate the time path of infected people under different assumptions for R0:
R0_vals = np.linspace(1.6, 3.0, 6)

labels = [f'$R0 = {r:.2f}$' for r in R0_vals]
i_paths, c_paths = [], []
for r in R0_vals:
i_path, c_path = solve_path(r, t_vec)
i_paths.append(i_path)
c_paths.append(c_path)
Here’s some code to plot the time paths.

def plot_paths(paths, labels, times=t_vec):
fig, ax = plt.subplots()
for path, label in zip(paths, labels):

ax.plot(times, path, label=label)
ax.legend(loc='upper left')
plt.show()
Let’s plot current cases as a fraction of the population.
plot_paths(i_paths, labels)
As expected, lower effective transmission rates defer the peak of infections.

They also lead to a lower peak in current cases.
Here are cumulative cases, as a fraction of population:
plot_paths(c_paths, labels)
1.4. Experiments 11
1.4.2 Experiment 2: Changing Mitigation
Let’s look at a scenario where mitigation (e.g., social distancing) is successively imposed.
Here’s a specification for R0 as a function of time.
def R0_mitigating(t, r0=3, η=1, r_bar=1.6):

R0 = r0 * exp(- η * t) + (1 - exp(- η * t)) * r_bar
return R0
The idea is that R0 starts off at 3 and falls to 1.6.

This is due to progressive adoption of stricter mitigation measures.
The parameter η controls the rate, or the speed at which restrictions are imposed.
We consider several different rates:
η_vals = 1/5, 1/10, 1/20, 1/50, 1/100

labels = [fr'$\eta = {η:.2f}$' for η in η_vals]
This is what the time path of R0 looks like at these alternative rates:
for η, label in zip(η_vals, labels):

ax.plot(t_vec, R0_mitigating(t_vec, η=η), label=label)
ax.legend()
plt.show()

Let’s calculate the time path of infected people:
for η in η_vals:
R0 = lambda t: R0_mitigating(t, η=η)
i_path, c_path = solve_path(R0, t_vec)
These are current cases under the different scenarios:
Here are cumulative cases, as a fraction of population:
plot_paths(c_paths, labels)
1.4. Experiments 13
1.5 Ending Lockdown
The following replicates additional results by Andrew Atkeson on the timing of lifting lockdown.
Consider these two mitigation scenarios:
1. 𝑅𝑡 = 0.5 for 30 days and then 𝑅𝑡 = 2 for the remaining 17 months. This corresponds to lifting lockdown in 30
days.
2. 𝑅𝑡 = 0.5 for 120 days and then 𝑅𝑡 = 2 for the remaining 14 months. This corresponds to lifting lockdown in 4
months.
The parameters considered here start the model with 25,000 active infections and 75,000 agents already exposed to the
virus and thus soon to be contagious.
# initial conditions
i_0 = 25_000 / pop_size
e_0 = 75_000 / pop_size
s_0 = 1 - i_0 - e_0
x_0 = s_0, e_0, i_0
Let’s calculate the paths:
R0_paths = (lambda t: 0.5 if t < 30 else 2,

lambda t: 0.5 if t < 120 else 2)
labels = [f'scenario {i}' for i in (1, 2)]
for R0 in R0_paths:
i_path, c_path = solve_path(R0, t_vec, x_init=x_0)
Here is the number of active infections:

What kind of mortality can we expect under these scenarios?

Suppose that 1% of cases result in death
ν = 0.01
This is the cumulative number of deaths:
paths = [path * ν * pop_size for path in c_paths]

plot_paths(paths, labels)
This is the daily death rate:
1.5. Ending Lockdown 15

paths = [path * ν * γ * pop_size for path in i_paths]

plot_paths(paths, labels)
Pushing the peak of curve further into the future may reduce cumulative deaths if a vaccine is found.

CHAPTER
TWO
LINEAR ALGEBRA
Contents
• Linear Algebra
– Overview
– Vectors
– Matrices
– Solving Systems of Equations
– Eigenvalues and Eigenvectors
– Further Topics
– Exercises
2.1 Overview
Linear algebra is one of the most useful branches of applied mathematics for economists to invest in.
For example, many applied problems in economics and finance require the solution of a linear system of equations, such
as
𝑦1 = 𝑎𝑥1 + 𝑏𝑥2
𝑦2 = 𝑐𝑥1 + 𝑑𝑥2
or, more generally,
𝑦1 = 𝑎11 𝑥1 + 𝑎12 𝑥2 + ⋯ + 𝑎1𝑘 𝑥𝑘

⋮ (2.1)
𝑦𝑛 = 𝑎𝑛1 𝑥1 + 𝑎𝑛2 𝑥2 + ⋯ + 𝑎𝑛𝑘 𝑥𝑘
The objective here is to solve for the “unknowns” 𝑥1 , … , 𝑥𝑘 given 𝑎11 , … , 𝑎𝑛𝑘 and 𝑦1 , … , 𝑦𝑛 .
When considering such problems, it is essential that we first consider at least some of the following questions
• Does a solution actually exist?
• Are there in fact many solutions, and if so how should we interpret them?
• If no solution exists, is there a best “approximate” solution?
• If a solution exists, how should we compute it?
17
These are the kinds of topics addressed by linear algebra.

In this lecture we will cover the basics of linear and matrix algebra, treating both theory and computation.
We admit some overlap with this lecture, where operations on NumPy arrays were first explained.
Note that this lecture is more theoretical than most, and contains background material that will be used in applications as
we go along.
Let’s start with some imports:

import numpy as np
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from scipy.linalg import inv, solve, det, eig
2.2 Vectors
A vector of length 𝑛 is just a sequence (or array, or tuple) of 𝑛 numbers, which we write as 𝑥 = (𝑥1 , … , 𝑥𝑛 ) or 𝑥 =
[𝑥1 , … , 𝑥𝑛 ].
We will write these sequences either horizontally or vertically as we please.
(Later, when we wish to perform certain matrix operations, it will become necessary to distinguish between the two)
The set of all 𝑛-vectors is denoted by ℝ𝑛 .
For example, ℝ2 is the plane, and a vector in ℝ2 is just a point in the plane.
Traditionally, vectors are represented visually as arrows from the origin to the point.
The following figure represents three vectors in this manner
fig, ax = plt.subplots(figsize=(10, 8))

# Set the axes through the origin
for spine in ['left', 'bottom']:
ax.spines[spine].set_position('zero')
for spine in ['right', 'top']:
ax.spines[spine].set_color('none')
ax.set(xlim=(-5, 5), ylim=(-5, 5))

ax.grid()
vecs = ((2, 4), (-3, 3), (-4, -3.5))
for v in vecs:
ax.annotate('', xy=v, xytext=(0, 0),
arrowprops=dict(facecolor='blue',
shrink=0,
alpha=0.7,
width=0.5))
ax.text(1.1 * v[0], 1.1 * v[1], str(v))
plt.show()
18 Chapter 2. Linear Algebra

2.2.1 Vector Operations
The two most common operators for vectors are addition and scalar multiplication, which we now describe.
As a matter of definition, when we add two vectors, we add them element-by-element
𝑥1 𝑦1 𝑥1 + 𝑦1
⎡𝑥 ⎤ ⎡𝑦 ⎤ ⎡𝑥 + 𝑦 ⎤
𝑥 + 𝑦 = ⎢ ⎥ + ⎢ ⎥ ∶= ⎢ 2
2 2 2⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑥 𝑦
⎣ 𝑛⎦ ⎣ 𝑛⎦ 𝑥
⎣ 𝑛 + 𝑦 𝑛⎦
Scalar multiplication is an operation that takes a number 𝛾 and a vector 𝑥 and produces
𝛾𝑥1
⎡ 𝛾𝑥 ⎤
𝛾𝑥 ∶= ⎢ 2 ⎥
⎢ ⋮ ⎥
⎣𝛾𝑥𝑛 ⎦
Scalar multiplication is illustrated in the next figure

(continues on next page)
2.2. Vectors 19
(continued from previous page)

ax.set(xlim=(-5, 5), ylim=(-5, 5))

x = (2, 2)
ax.annotate('', xy=x, xytext=(0, 0),
shrink=0,
alpha=1,
width=0.5))
ax.text(x[0] + 0.4, x[1] - 0.2, '$x$', fontsize='16')
scalars = (-2, 2)
x = np.array(x)
for s in scalars:
v = s * x
arrowprops=dict(facecolor='red',
shrink=0,
alpha=0.5,
width=0.5))
ax.text(v[0] + 0.4, v[1] - 0.2, f'${s} x$', fontsize='16')
plt.show()

In Python, a vector can be represented as a list or tuple, such as x = (2, 4, 6), but is more commonly represented
as a NumPy array.
One advantage of NumPy arrays is that scalar multiplication and addition have very natural syntax
x = np.ones(3) # Vector of three ones

y = np.array((2, 4, 6)) # Converts tuple (2, 4, 6) into array
x + y
array([3., 5., 7.])
4 * x
array([4., 4., 4.])
2.2. Vectors 21
2.2.2 Inner Product and Norm
The inner product of vectors 𝑥, 𝑦 ∈ ℝ𝑛 is defined as

𝑛
𝑥′ 𝑦 ∶= ∑ 𝑥𝑖 𝑦𝑖
𝑖=1
Two vectors are called orthogonal if their inner product is zero.

The norm of a vector 𝑥 represents its “length” (i.e., its distance from the zero vector) and is defined as
1/2
√ 𝑛
‖𝑥‖ ∶= 𝑥′ 𝑥 ∶= (∑ 𝑥2𝑖 )
𝑖=1
The expression ‖𝑥 − 𝑦‖ is thought of as the distance between 𝑥 and 𝑦.

Continuing on from the previous example, the inner product and norm can be computed as follows
np.sum(x * y) # Inner product of x and y
12.0
np.sqrt(np.sum(x**2)) # Norm of x, take one
1.7320508075688772
np.linalg.norm(x) # Norm of x, take two
1.7320508075688772
2.2.3 Span
Given a set of vectors 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } in ℝ𝑛 , it’s natural to think about the new vectors we can create by performing
linear operations.
New vectors created in this manner are called linear combinations of 𝐴.
In particular, 𝑦 ∈ ℝ𝑛 is a linear combination of 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } if
𝑦 = 𝛽1 𝑎1 + ⋯ + 𝛽𝑘 𝑎𝑘 for some scalars 𝛽1 , … , 𝛽𝑘
In this context, the values 𝛽1 , … , 𝛽𝑘 are called the coefficients of the linear combination.
The set of linear combinations of 𝐴 is called the span of 𝐴.
The next figure shows the span of 𝐴 = {𝑎1 , 𝑎2 } in ℝ3 .
The span is a two-dimensional plane passing through these two points and the origin.
ax = plt.figure(figsize=(10, 8)).add_subplot(projection='3d')
x_min, x_max = -5, 5

y_min, y_max = -5, 5

α, β = 0.2, 0.1
ax.set(xlim=(x_min, x_max), ylim=(x_min, x_max), zlim=(x_min, x_max),

xticks=(0,), yticks=(0,), zticks=(0,))
gs = 3
z = np.linspace(x_min, x_max, gs)
x = np.zeros(gs)
y = np.zeros(gs)
ax.plot(x, y, z, 'k-', lw=2, alpha=0.5)
ax.plot(z, x, y, 'k-', lw=2, alpha=0.5)
ax.plot(y, z, x, 'k-', lw=2, alpha=0.5)
# Fixed linear function, to generate a plane

def f(x, y):
return α * x + β * y
# Vector locations, by coordinate

x_coords = np.array((3, 3))
y_coords = np.array((4, -4))
z = f(x_coords, y_coords)
for i in (0, 1):
ax.text(x_coords[i], y_coords[i], z[i], f'$a_{i+1}$', fontsize=14)
# Lines to vectors
for i in (0, 1):
x = (0, x_coords[i])
y = (0, y_coords[i])
z = (0, f(x_coords[i], y_coords[i]))
ax.plot(x, y, z, 'b-', lw=1.5, alpha=0.6)
# Draw the plane

grid_size = 20
xr2 = np.linspace(x_min, x_max, grid_size)
yr2 = np.linspace(y_min, y_max, grid_size)
x2, y2 = np.meshgrid(xr2, yr2)
z2 = f(x2, y2)
ax.plot_surface(x2, y2, z2, rstride=1, cstride=1, cmap=cm.jet,
linewidth=0, antialiased=True, alpha=0.2)
plt.show()
2.2. Vectors 23
Examples
If 𝐴 contains only one vector 𝑎1 ∈ ℝ2 , then its span is just the scalar multiples of 𝑎1 , which is the unique line passing
through both 𝑎1 and the origin.
If 𝐴 = {𝑒1 , 𝑒2 , 𝑒3 } consists of the canonical basis vectors of ℝ3 , that is
1 0 0
𝑒1 ∶= ⎡ ⎤
⎢0⎥ , 𝑒2 ∶= ⎡ ⎤
⎢1⎥ , 𝑒3 ∶= ⎡
⎢0⎥
⎤
⎣0⎦ ⎣0⎦ ⎣1⎦
then the span of 𝐴 is all of ℝ3 , because, for any 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ) ∈ ℝ3 , we can write
𝑥 = 𝑥1 𝑒1 + 𝑥2 𝑒2 + 𝑥3 𝑒3
Now consider 𝐴0 = {𝑒1 , 𝑒2 , 𝑒1 + 𝑒2 }.

If 𝑦 = (𝑦1 , 𝑦2 , 𝑦3 ) is any linear combination of these vectors, then 𝑦3 = 0 (check it).

Hence 𝐴0 fails to span all of ℝ3 .
2.2.4 Linear Independence
As we’ll see, it’s often desirable to find families of vectors with relatively large span, so that many vectors can be described
by linear operators on a few vectors.
The condition we need for a set of vectors to have a large span is what’s called linear independence.
In particular, a collection of vectors 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } in ℝ𝑛 is said to be
• linearly dependent if some strict subset of 𝐴 has the same span as 𝐴.
• linearly independent if it is not linearly dependent.
Put differently, a set of vectors is linearly independent if no vector is redundant to the span and linearly dependent
otherwise.
To illustrate the idea, recall the figure that showed the span of vectors {𝑎1 , 𝑎2 } in ℝ3 as a plane through the origin.
If we take a third vector 𝑎3 and form the set {𝑎1 , 𝑎2 , 𝑎3 }, this set will be
• linearly dependent if 𝑎3 lies in the plane
• linearly independent otherwise
As another illustration of the concept, since ℝ𝑛 can be spanned by 𝑛 vectors (see the discussion of canonical basis vectors
above), any collection of 𝑚 > 𝑛 vectors in ℝ𝑛 must be linearly dependent.
The following statements are equivalent to linear independence of 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } ⊂ ℝ𝑛
1. No vector in 𝐴 can be formed as a linear combination of the other elements.
2. If 𝛽1 𝑎1 + ⋯ 𝛽𝑘 𝑎𝑘 = 0 for scalars 𝛽1 , … , 𝛽𝑘 , then 𝛽1 = ⋯ = 𝛽𝑘 = 0.
(The zero in the first expression is the origin of ℝ𝑛 )
2.2.5 Unique Representations
Another nice thing about sets of linearly independent vectors is that each element in the span has a unique representation
as a linear combination of these vectors.
In other words, if 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } ⊂ ℝ𝑛 is linearly independent and
𝑦 = 𝛽1 𝑎1 + ⋯ 𝛽𝑘 𝑎𝑘
then no other coefficient sequence 𝛾1 , … , 𝛾𝑘 will produce the same vector 𝑦.

Indeed, if we also have 𝑦 = 𝛾1 𝑎1 + ⋯ 𝛾𝑘 𝑎𝑘 , then
(𝛽1 − 𝛾1 )𝑎1 + ⋯ + (𝛽𝑘 − 𝛾𝑘 )𝑎𝑘 = 0
Linear independence now implies 𝛾𝑖 = 𝛽𝑖 for all 𝑖.
2.2. Vectors 25
2.3 Matrices
Matrices are a neat way of organizing data for use in linear operations.
An 𝑛 × 𝑘 matrix is a rectangular array 𝐴 of numbers with 𝑛 rows and 𝑘 columns:
𝑎11 𝑎12 ⋯ 𝑎1𝑘

⎡𝑎 𝑎22 ⋯ 𝑎2𝑘 ⎤
𝐴 = ⎢ 21 ⎥
⎢ ⋮ ⋮ ⋮ ⎥
⎣𝑎𝑛1 𝑎𝑛2 ⋯ 𝑎𝑛𝑘 ⎦
Often, the numbers in the matrix represent coefficients in a system of linear equations, as discussed at the start of this
lecture.
For obvious reasons, the matrix 𝐴 is also called a vector if either 𝑛 = 1 or 𝑘 = 1.
In the former case, 𝐴 is called a row vector, while in the latter it is called a column vector.
If 𝑛 = 𝑘, then 𝐴 is called square.
The matrix formed by replacing 𝑎𝑖𝑗 by 𝑎𝑗𝑖 for every 𝑖 and 𝑗 is called the transpose of 𝐴 and denoted 𝐴′ or 𝐴⊤ .
If 𝐴 = 𝐴′ , then 𝐴 is called symmetric.
For a square matrix 𝐴, the 𝑖 elements of the form 𝑎𝑖𝑖 for 𝑖 = 1, … , 𝑛 are called the principal diagonal.
𝐴 is called diagonal if the only nonzero entries are on the principal diagonal.
If, in addition to being diagonal, each element along the principal diagonal is equal to 1, then 𝐴 is called the identity matrix
and denoted by 𝐼.
2.3.1 Matrix Operations
Just as was the case for vectors, a number of algebraic operations are defined for matrices.
Scalar multiplication and addition are immediate generalizations of the vector case:
𝑎11 ⋯ 𝑎1𝑘 𝛾𝑎11 ⋯ 𝛾𝑎1𝑘

𝛾𝐴 = 𝛾 ⎡
⎢ ⋮ ⋮ ⋮ ⎤ ⎡
⎥ ∶= ⎢ ⋮ ⋮ ⋮ ⎤⎥
⎣𝑎𝑛1 ⋯ 𝑎𝑛𝑘 ⎦ ⎣𝛾𝑎𝑛1 ⋯ 𝛾𝑎𝑛𝑘 ⎦
and
𝑎11 ⋯ 𝑎1𝑘 𝑏11 ⋯ 𝑏1𝑘 𝑎11 + 𝑏11 ⋯ 𝑎1𝑘 + 𝑏1𝑘
𝐴+𝐵 =⎡
⎢ ⋮ ⋮ ⋮ ⎤ ⎡
⎥+⎢ ⋮ ⋮ ⋮ ⎤ ⎡
⎥ ∶= ⎢ ⋮ ⋮ ⋮ ⎤
⎥
⎣𝑎𝑛1 ⋯ 𝑎𝑛𝑘 ⎦ ⎣𝑏𝑛1 ⋯ 𝑏𝑛𝑘 ⎦ ⎣𝑎𝑛1 + 𝑏𝑛1 ⋯ 𝑎𝑛𝑘 + 𝑏𝑛𝑘 ⎦
In the latter case, the matrices must have the same shape in order for the definition to make sense.
We also have a convention for multiplying two matrices.
The rule for matrix multiplication generalizes the idea of inner products discussed above and is designed to make multi-
plication play well with basic linear operations.
If 𝐴 and 𝐵 are two matrices, then their product 𝐴𝐵 is formed by taking as its 𝑖, 𝑗-th element the inner product of the 𝑖-th
row of 𝐴 and the 𝑗-th column of 𝐵.
There are many tutorials to help you visualize this operation, such as this one, or the discussion on the Wikipedia page.
If 𝐴 is 𝑛 × 𝑘 and 𝐵 is 𝑗 × 𝑚, then to multiply 𝐴 and 𝐵 we require 𝑘 = 𝑗, and the resulting matrix 𝐴𝐵 is 𝑛 × 𝑚.
As perhaps the most important special case, consider multiplying 𝑛 × 𝑘 matrix 𝐴 and 𝑘 × 1 column vector 𝑥.

According to the preceding rule, this gives us an 𝑛 × 1 column vector
𝑎11 ⋯ 𝑎1𝑘 𝑥1 𝑎11 𝑥1 + ⋯ + 𝑎1𝑘 𝑥𝑘

𝐴𝑥 = ⎡
⎢ ⋮ ⋮ ⋮ ⎤ ⎡ ⋮ ⎤ ∶= ⎡
⎥⎢ ⎥ ⎢ ⋮ ⎤
⎥ (2.2)
⎣𝑎𝑛1 ⋯ 𝑎𝑛𝑘 ⎦ ⎣𝑥𝑘 ⎦ ⎣𝑎𝑛1 𝑥1 + ⋯ + 𝑎𝑛𝑘 𝑥𝑘 ⎦
Note: 𝐴𝐵 and 𝐵𝐴 are not generally the same thing.
Another important special case is the identity matrix.

You should check that if 𝐴 is 𝑛 × 𝑘 and 𝐼 is the 𝑘 × 𝑘 identity matrix, then 𝐴𝐼 = 𝐴.
If 𝐼 is the 𝑛 × 𝑛 identity matrix, then 𝐼𝐴 = 𝐴.
2.3.2 Matrices in NumPy
NumPy arrays are also used as matrices, and have fast, efficient functions and methods for all the standard matrix oper-
ations1 .
You can create them manually from tuples of tuples (or lists of lists) as follows
A = ((1, 2),
(3, 4))
type(A)
tuple
A = np.array(A)
type(A)
numpy.ndarray
A.shape
(2, 2)
The shape attribute is a tuple giving the number of rows and columns — see here for more discussion.
To get the transpose of A, use A.transpose() or, more simply, A.T.
There are many convenient functions for creating common matrices (matrices of zeros, ones, etc.) — see here.
Since operations are performed elementwise by default, scalar multiplication and addition have very natural syntax
A = np.identity(3)
B = np.ones((3, 3))
2 * A
1 Although there is a specialized matrix data type defined in NumPy, it’s more standard to work with ordinary NumPy arrays. See this discussion.
2.3. Matrices 27
array([[2., 0., 0.],

[0., 2., 0.],
[0., 0., 2.]])
A + B
array([[2., 1., 1.],

[1., 2., 1.],
[1., 1., 2.]])
To multiply matrices we use the @ symbol.

In particular, A @ B is matrix multiplication, whereas A * B is element-by-element multiplication.
See here for more discussion.
2.3.3 Matrices as Maps
Each 𝑛 × 𝑘 matrix 𝐴 can be identified with a function 𝑓(𝑥) = 𝐴𝑥 that maps 𝑥 ∈ ℝ𝑘 into 𝑦 = 𝐴𝑥 ∈ ℝ𝑛 .
These kinds of functions have a special property: they are linear.
A function 𝑓 ∶ ℝ𝑘 → ℝ𝑛 is called linear if, for all 𝑥, 𝑦 ∈ ℝ𝑘 and all scalars 𝛼, 𝛽, we have
𝑓(𝛼𝑥 + 𝛽𝑦) = 𝛼𝑓(𝑥) + 𝛽𝑓(𝑦)
You can check that this holds for the function 𝑓(𝑥) = 𝐴𝑥 + 𝑏 when 𝑏 is the zero vector and fails when 𝑏 is nonzero.
In fact, it’s known that 𝑓 is linear if and only if there exists a matrix 𝐴 such that 𝑓(𝑥) = 𝐴𝑥 for all 𝑥.
2.4 Solving Systems of Equations
Recall again the system of equations (2.1).

If we compare (2.1) and (2.2), we see that (2.1) can now be written more conveniently as
𝑦 = 𝐴𝑥 (2.3)
The problem we face is to determine a vector 𝑥 ∈ ℝ𝑘 that solves (2.3), taking 𝑦 and 𝐴 as given.
This is a special case of a more general problem: Find an 𝑥 such that 𝑦 = 𝑓(𝑥).
Given an arbitrary function 𝑓 and a 𝑦, is there always an 𝑥 such that 𝑦 = 𝑓(𝑥)?
If so, is it always unique?
The answer to both these questions is negative, as the next figure shows
def f(x):
return 0.6 * np.cos(4 * x) + 1.4
xmin, xmax = -1, 1

x = np.linspace(xmin, xmax, 160)


y = f(x)
ya, yb = np.min(y), np.max(y)
fig, axes = plt.subplots(2, 1, figsize=(10, 10))
for ax in axes:
ax.set(ylim=(-0.6, 3.2), xlim=(xmin, xmax),

yticks=(), xticks=())
ax.plot(x, y, 'k-', lw=2, label='$f$')

ax.fill_between(x, ya, yb, facecolor='blue', alpha=0.05)
ax.vlines([0], ya, yb, lw=3, color='blue', label='range of $f$')
ax.text(0.04, -0.3, '$0$', fontsize=16)
ax = axes[0]
ax.legend(loc='upper right', frameon=False)

ybar = 1.5
ax.plot(x, x * 0 + ybar, 'k--', alpha=0.5)
ax.text(0.05, 0.8 * ybar, '$y$', fontsize=16)
for i, z in enumerate((-0.35, 0.35)):
ax.vlines(z, 0, f(z), linestyle='--', alpha=0.5)
ax.text(z, -0.2, f'$x_{i}$', fontsize=16)
ax = axes[1]
ybar = 2.6
ax.plot(x, x * 0 + ybar, 'k--', alpha=0.5)
ax.text(0.04, 0.91 * ybar, '$y$', fontsize=16)
plt.show()
2.4. Solving Systems of Equations 29

In the first plot, there are multiple solutions, as the function is not one-to-one, while in the second there are no solutions,
since 𝑦 lies outside the range of 𝑓.
Can we impose conditions on 𝐴 in (2.3) that rule out these problems?
In this context, the most important thing to recognize about the expression 𝐴𝑥 is that it corresponds to a linear combination
of the columns of 𝐴.
In particular, if 𝑎1 , … , 𝑎𝑘 are the columns of 𝐴, then
𝐴𝑥 = 𝑥1 𝑎1 + ⋯ + 𝑥𝑘 𝑎𝑘
Hence the range of 𝑓(𝑥) = 𝐴𝑥 is exactly the span of the columns of 𝐴.

We want the range to be large so that it contains arbitrary 𝑦.
As you might recall, the condition that we want for the span to be large is linear independence.
A happy fact is that linear independence of the columns of 𝐴 also gives us uniqueness.

Indeed, it follows from our earlier discussion that if {𝑎1 , … , 𝑎𝑘 } are linearly independent and 𝑦 = 𝐴𝑥 = 𝑥1 𝑎1 +⋯+𝑥𝑘 𝑎𝑘 ,
then no 𝑧 ≠ 𝑥 satisfies 𝑦 = 𝐴𝑧.
2.4.1 The Square Matrix Case
Let’s discuss some more details, starting with the case where 𝐴 is 𝑛 × 𝑛.
This is the familiar case where the number of unknowns equals the number of equations.
For arbitrary 𝑦 ∈ ℝ𝑛 , we hope to find a unique 𝑥 ∈ ℝ𝑛 such that 𝑦 = 𝐴𝑥.
In view of the observations immediately above, if the columns of 𝐴 are linearly independent, then their span, and hence
the range of 𝑓(𝑥) = 𝐴𝑥, is all of ℝ𝑛 .
Hence there always exists an 𝑥 such that 𝑦 = 𝐴𝑥.
Moreover, the solution is unique.
In particular, the following are equivalent
1. The columns of 𝐴 are linearly independent.
2. For any 𝑦 ∈ ℝ𝑛 , the equation 𝑦 = 𝐴𝑥 has a unique solution.
The property of having linearly independent columns is sometimes expressed as having full column rank.
Inverse Matrices
Can we give some sort of expression for the solution?

If 𝑦 and 𝐴 are scalar with 𝐴 ≠ 0, then the solution is 𝑥 = 𝐴−1 𝑦.
A similar expression is available in the matrix case.
In particular, if square matrix 𝐴 has full column rank, then it possesses a multiplicative inverse matrix 𝐴−1 , with the
property that 𝐴𝐴−1 = 𝐴−1 𝐴 = 𝐼.
As a consequence, if we pre-multiply both sides of 𝑦 = 𝐴𝑥 by 𝐴−1 , we get 𝑥 = 𝐴−1 𝑦.
This is the solution that we’re looking for.
Determinants
Another quick comment about square matrices is that to every such matrix we assign a unique number called the deter-
minant of the matrix — you can find the expression for it here.
If the determinant of 𝐴 is not zero, then we say that 𝐴 is nonsingular.
Perhaps the most important fact about determinants is that 𝐴 is nonsingular if and only if 𝐴 is of full column rank.
This gives us a useful one-number summary of whether or not a square matrix can be inverted.
2.4. Solving Systems of Equations 31

2.4.2 More Rows than Columns
This is the 𝑛 × 𝑘 case with 𝑛 > 𝑘.

This case is very important in many settings, not least in the setting of linear regression (where 𝑛 is the number of
observations, and 𝑘 is the number of explanatory variables).
Given arbitrary 𝑦 ∈ ℝ𝑛 , we seek an 𝑥 ∈ ℝ𝑘 such that 𝑦 = 𝐴𝑥.
In this setting, the existence of a solution is highly unlikely.
Without much loss of generality, let’s go over the intuition focusing on the case where the columns of 𝐴 are linearly
independent.
It follows that the span of the columns of 𝐴 is a 𝑘-dimensional subspace of ℝ𝑛 .
This span is very “unlikely” to contain arbitrary 𝑦 ∈ ℝ𝑛 .
To see why, recall the figure above, where 𝑘 = 2 and 𝑛 = 3.
Imagine an arbitrarily chosen 𝑦 ∈ ℝ3 , located somewhere in that three-dimensional space.
What’s the likelihood that 𝑦 lies in the span of {𝑎1 , 𝑎2 } (i.e., the two dimensional plane through these points)?
In a sense, it must be very small, since this plane has zero “thickness”.
As a result, in the 𝑛 > 𝑘 case we usually give up on existence.
However, we can still seek the best approximation, for example, an 𝑥 that makes the distance ‖𝑦 − 𝐴𝑥‖ as small as
possible.
To solve this problem, one can use either calculus or the theory of orthogonal projections.
The solution is known to be 𝑥̂ = (𝐴′ 𝐴)−1 𝐴′ 𝑦 — see for example chapter 3 of these notes.
2.4.3 More Columns than Rows
This is the 𝑛 × 𝑘 case with 𝑛 < 𝑘, so there are fewer equations than unknowns.
In this case there are either no solutions or infinitely many — in other words, uniqueness never holds.
For example, consider the case where 𝑘 = 3 and 𝑛 = 2.
Thus, the columns of 𝐴 consists of 3 vectors in ℝ2 .
This set can never be linearly independent, since it is possible to find two vectors that span ℝ2 .
(For example, use the canonical basis vectors)
It follows that one column is a linear combination of the other two.
For example, let’s say that 𝑎1 = 𝛼𝑎2 + 𝛽𝑎3 .
Then if 𝑦 = 𝐴𝑥 = 𝑥1 𝑎1 + 𝑥2 𝑎2 + 𝑥3 𝑎3 , we can also write
𝑦 = 𝑥1 (𝛼𝑎2 + 𝛽𝑎3 ) + 𝑥2 𝑎2 + 𝑥3 𝑎3 = (𝑥1 𝛼 + 𝑥2 )𝑎2 + (𝑥1 𝛽 + 𝑥3 )𝑎3
In other words, uniqueness fails.

2.4.4 Linear Equations with SciPy
Here’s an illustration of how to solve linear equations with SciPy’s linalg submodule.
All of these routines are Python front ends to time-tested and highly optimized FORTRAN code
A = ((1, 2), (3, 4))

A = np.array(A)
y = np.ones((2, 1)) # Column vector
det(A) # Check that A is nonsingular, and hence invertible
-2.0
A_inv = inv(A) # Compute the inverse

A_inv
array([[-2. , 1. ],
[ 1.5, -0.5]])
x = A_inv @ y # Solution
A @ x # Should equal y
array([[1.],
[1.]])
solve(A, y) # Produces the same solution
array([[-1.],
[ 1.]])
Observe how we can solve for 𝑥 = 𝐴−1 𝑦 by either via inv(A) @ y, or using solve(A, y).
The latter method uses a different algorithm (LU decomposition) that is numerically more stable, and hence should almost
always be preferred.
To obtain the least-squares solution 𝑥̂ = (𝐴′ 𝐴)−1 𝐴′ 𝑦, use scipy.linalg.lstsq(A, y).
2.5 Eigenvalues and Eigenvectors
Let 𝐴 be an 𝑛 × 𝑛 square matrix.

If 𝜆 is scalar and 𝑣 is a non-zero vector in ℝ𝑛 such that
𝐴𝑣 = 𝜆𝑣
then we say that 𝜆 is an eigenvalue of 𝐴, and 𝑣 is an eigenvector.

Thus, an eigenvector of 𝐴 is a vector such that when the map 𝑓(𝑥) = 𝐴𝑥 is applied, 𝑣 is merely scaled.
The next figure shows two eigenvectors (blue arrows) and their images under 𝐴 (red arrows).
As expected, the image 𝐴𝑣 of each 𝑣 is just a scaled version of the original
2.5. Eigenvalues and Eigenvectors 33

A = ((1, 2),
(2, 1))
A = np.array(A)
evals, evecs = eig(A)
evecs = evecs[:, 0], evecs[:, 1]

ax.grid(alpha=0.4)
xmin, xmax = -3, 3

ymin, ymax = -3, 3
ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax))
# Plot each eigenvector

for v in evecs:
shrink=0,
alpha=0.6,
width=0.5))
# Plot the image of each eigenvector

for v in evecs:
v = A @ v
arrowprops=dict(facecolor='red',
shrink=0,
alpha=0.6,
width=0.5))
# Plot the lines they run through

x = np.linspace(xmin, xmax, 3)
for v in evecs:
a = v[1] / v[0]
ax.plot(x, a * x, 'b-', lw=0.4)
plt.show()

The eigenvalue equation is equivalent to (𝐴 − 𝜆𝐼)𝑣 = 0, and this has a nonzero solution 𝑣 only when the columns of
𝐴 − 𝜆𝐼 are linearly dependent.
This in turn is equivalent to stating that the determinant is zero.
Hence to find all eigenvalues, we can look for 𝜆 such that the determinant of 𝐴 − 𝜆𝐼 is zero.
This problem can be expressed as one of solving for the roots of a polynomial in 𝜆 of degree 𝑛.
This in turn implies the existence of 𝑛 solutions in the complex plane, although some might be repeated.
Some nice facts about the eigenvalues of a square matrix 𝐴 are as follows
1. The determinant of 𝐴 equals the product of the eigenvalues.
2. The trace of 𝐴 (the sum of the elements on the principal diagonal) equals the sum of the eigenvalues.
3. If 𝐴 is symmetric, then all of its eigenvalues are real.
4. If 𝐴 is invertible and 𝜆1 , … , 𝜆𝑛 are its eigenvalues, then the eigenvalues of 𝐴−1 are 1/𝜆1 , … , 1/𝜆𝑛 .
A corollary of the first statement is that a matrix is invertible if and only if all its eigenvalues are nonzero.
Using SciPy, we can solve for the eigenvalues and eigenvectors of a matrix as follows
A = ((1, 2),
(2, 1))
2.5. Eigenvalues and Eigenvectors 35


A = np.array(A)
evals, evecs = eig(A)
evals
array([ 3.+0.j, -1.+0.j])
evecs
array([[ 0.70710678, -0.70710678],

[ 0.70710678, 0.70710678]])
Note that the columns of evecs are the eigenvectors.

Since any scalar multiple of an eigenvector is an eigenvector with the same eigenvalue (check it), the eig routine normalizes
the length of each eigenvector to one.
2.5.1 Generalized Eigenvalues
It is sometimes useful to consider the generalized eigenvalue problem, which, for given matrices 𝐴 and 𝐵, seeks generalized
eigenvalues 𝜆 and eigenvectors 𝑣 such that
𝐴𝑣 = 𝜆𝐵𝑣
This can be solved in SciPy via scipy.linalg.eig(A, B).

Of course, if 𝐵 is square and invertible, then we can treat the generalized eigenvalue problem as an ordinary eigenvalue
problem 𝐵−1 𝐴𝑣 = 𝜆𝑣, but this is not always the case.
2.6 Further Topics
We round out our discussion by briefly mentioning several other important topics.
2.6.1 Series Expansions

∞
Recall the usual summation formula for a geometric progression, which states that if |𝑎| < 1, then ∑𝑘=0 𝑎𝑘 = (1 − 𝑎)−1 .
A generalization of this idea exists in the matrix setting.
Matrix Norms
Let 𝐴 be a square matrix, and let
‖𝐴‖ ∶= max ‖𝐴𝑥‖

‖𝑥‖=1
The norms on the right-hand side are ordinary vector norms, while the norm on the left-hand side is a matrix norm — in
this case, the so-called spectral norm.

For example, for a square matrix 𝑆, the condition ‖𝑆‖ < 1 means that 𝑆 is contractive, in the sense that it pulls all vectors
towards the origin2 .
Neumann’s Theorem
Let 𝐴 be a square matrix and let 𝐴𝑘 ∶= 𝐴𝐴𝑘−1 with 𝐴1 ∶= 𝐴.

In other words, 𝐴𝑘 is the 𝑘-th power of 𝐴.
Neumann’s theorem states the following: If ‖𝐴𝑘 ‖ < 1 for some 𝑘 ∈ ℕ, then 𝐼 − 𝐴 is invertible, and
∞
(𝐼 − 𝐴)−1 = ∑ 𝐴𝑘 (2.4)
𝑘=0
Spectral Radius
A result known as Gelfand’s formula tells us that, for any square matrix 𝐴,
𝜌(𝐴) = lim ‖𝐴𝑘 ‖1/𝑘

𝑘→∞
Here 𝜌(𝐴) is the spectral radius, defined as max𝑖 |𝜆𝑖 |, where {𝜆𝑖 }𝑖 is the set of eigenvalues of 𝐴.
As a consequence of Gelfand’s formula, if all eigenvalues are strictly less than one in modulus, there exists a 𝑘 with
‖𝐴𝑘 ‖ < 1.
In which case (2.4) is valid.
2.6.2 Positive Definite Matrices
Let 𝐴 be a symmetric 𝑛 × 𝑛 matrix.

We say that 𝐴 is
1. positive definite if 𝑥′ 𝐴𝑥 > 0 for every 𝑥 ∈ ℝ𝑛 {0}
2. positive semi-definite or nonnegative definite if 𝑥′ 𝐴𝑥 ≥ 0 for every 𝑥 ∈ ℝ𝑛
Analogous definitions exist for negative definite and negative semi-definite matrices.
It is notable that if 𝐴 is positive definite, then all of its eigenvalues are strictly positive, and hence 𝐴 is invertible (with
positive definite inverse).
2.6.3 Differentiating Linear and Quadratic Forms
The following formulas are useful in many economic contexts. Let

• 𝑧, 𝑥 and 𝑎 all be 𝑛 × 1 vectors
• 𝐴 be an 𝑛 × 𝑛 matrix
• 𝐵 be an 𝑚 × 𝑛 matrix and 𝑦 be an 𝑚 × 1 vector
Then
𝜕𝑎′ 𝑥
1. 𝜕𝑥 =𝑎
2 Suppose that ‖𝑆‖ < 1. Take any nonzero vector 𝑥, and let 𝑟 ∶= ‖𝑥‖. We have ‖𝑆𝑥‖ = 𝑟‖𝑆(𝑥/𝑟)‖ ≤ 𝑟‖𝑆‖ < 𝑟 = ‖𝑥‖. Hence every point is
pulled towards the origin.
2.6. Further Topics 37

𝜕𝐴𝑥
2. 𝜕𝑥 = 𝐴′
𝜕𝑥′ 𝐴𝑥
3. 𝜕𝑥 = (𝐴 + 𝐴′ )𝑥
𝜕𝑦′ 𝐵𝑧
4. 𝜕𝑦 = 𝐵𝑧
𝜕𝑦′ 𝐵𝑧
5. 𝜕𝐵 = 𝑦𝑧 ′
Exercise 2.7.1 below asks you to apply these formulas.
2.6.4 Further Reading
The documentation of the scipy.linalg submodule can be found here.

Chapters 2 and 3 of the Econometric Theory contains a discussion of linear algebra along the same lines as above, with
solved exercises.
If you don’t mind a slightly abstract approach, a nice intermediate-level text on linear algebra is [Jänich, 1994].
2.7 Exercises
Exercise 2.7.1
Let 𝑥 be a given 𝑛 × 1 vector and consider the problem
𝑣(𝑥) = max {−𝑦′ 𝑃 𝑦 − 𝑢′ 𝑄𝑢}

𝑦,𝑢
subject to the linear constraint
𝑦 = 𝐴𝑥 + 𝐵𝑢
Here
• 𝑃 is an 𝑛 × 𝑛 matrix and 𝑄 is an 𝑚 × 𝑚 matrix
• 𝐴 is an 𝑛 × 𝑛 matrix and 𝐵 is an 𝑛 × 𝑚 matrix
• both 𝑃 and 𝑄 are symmetric and positive semidefinite
(What must the dimensions of 𝑦 and 𝑢 be to make this a well-posed problem?)
One way to solve the problem is to form the Lagrangian
ℒ = −𝑦′ 𝑃 𝑦 − 𝑢′ 𝑄𝑢 + 𝜆′ [𝐴𝑥 + 𝐵𝑢 − 𝑦]
where 𝜆 is an 𝑛 × 1 vector of Lagrange multipliers.

Try applying the formulas given above for differentiating quadratic and linear forms to obtain the first-order conditions
for maximizing ℒ with respect to 𝑦, 𝑢 and minimizing it with respect to 𝜆.
Show that these conditions imply that
1. 𝜆 = −2𝑃 𝑦.
2. The optimizing choice of 𝑢 satisfies 𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥.
3. The function 𝑣 satisfies 𝑣(𝑥) = −𝑥′ 𝑃 ̃ 𝑥 where 𝑃 ̃ = 𝐴′ 𝑃 𝐴 − 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴.

As we will see, in economic contexts Lagrange multipliers often are shadow prices.
Note: If we don’t care about the Lagrange multipliers, we can substitute the constraint into the objective function, and
then just maximize −(𝐴𝑥 + 𝐵𝑢)′ 𝑃 (𝐴𝑥 + 𝐵𝑢) − 𝑢′ 𝑄𝑢 with respect to 𝑢. You can verify that this leads to the same
maximizer.
Solution to Exercise 2.7.1

We have an optimization problem:
𝑣(𝑥) = max{−𝑦′ 𝑃 𝑦 − 𝑢′ 𝑄𝑢}

𝑦,𝑢
s.t.
𝑦 = 𝐴𝑥 + 𝐵𝑢
with primitives
• 𝑃 be a symmetric and positive semidefinite 𝑛 × 𝑛 matrix
• 𝑄 be a symmetric and positive semidefinite 𝑚 × 𝑚 matrix
• 𝐴 an 𝑛 × 𝑛 matrix
• 𝐵 an 𝑛 × 𝑚 matrix
The associated Lagrangian is:
𝐿 = −𝑦′ 𝑃 𝑦 − 𝑢′ 𝑄𝑢 + 𝜆′ [𝐴𝑥 + 𝐵𝑢 − 𝑦]
Step 1.
Differentiating Lagrangian equation w.r.t y and setting its derivative equal to zero yields
𝜕𝐿
= −(𝑃 + 𝑃 ′ )𝑦 − 𝜆 = −2𝑃 𝑦 − 𝜆 = 0 ,
𝜕𝑦
since P is symmetric.
Accordingly, the first-order condition for maximizing L w.r.t. y implies
𝜆 = −2𝑃 𝑦
Step 2.
Differentiating Lagrangian equation w.r.t. u and setting its derivative equal to zero yields
𝜕𝐿
= −(𝑄 + 𝑄′ )𝑢 − 𝐵′ 𝜆 = −2𝑄𝑢 + 𝐵′ 𝜆 = 0
𝜕𝑢
Substituting 𝜆 = −2𝑃 𝑦 gives
𝑄𝑢 + 𝐵′ 𝑃 𝑦 = 0
Substituting the linear constraint 𝑦 = 𝐴𝑥 + 𝐵𝑢 into above equation gives
𝑄𝑢 + 𝐵′ 𝑃 (𝐴𝑥 + 𝐵𝑢) = 0
2.7. Exercises 39
(𝑄 + 𝐵′ 𝑃 𝐵)𝑢 + 𝐵′ 𝑃 𝐴𝑥 = 0
which is the first-order condition for maximizing 𝐿 w.r.t. 𝑢.
Thus, the optimal choice of u must satisfy
𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥 ,
which follows from the definition of the first-order conditions for Lagrangian equation.
Step 3.
Rewriting our problem by substituting the constraint into the objective function, we get
𝑣(𝑥) = max{−(𝐴𝑥 + 𝐵𝑢)′ 𝑃 (𝐴𝑥 + 𝐵𝑢) − 𝑢′ 𝑄𝑢}

𝑢
Since we know the optimal choice of u satisfies 𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥, then
𝑣(𝑥) = −(𝐴𝑥 + 𝐵𝑢)′ 𝑃 (𝐴𝑥 + 𝐵𝑢) − 𝑢′ 𝑄𝑢 𝑤𝑖𝑡ℎ 𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥
To evaluate the function

𝑣(𝑥) = −(𝐴𝑥 + 𝐵𝑢)′ 𝑃 (𝐴𝑥 + 𝐵𝑢) − 𝑢′ 𝑄𝑢
= −(𝑥′ 𝐴′ + 𝑢′ 𝐵′ )𝑃 (𝐴𝑥 + 𝐵𝑢) − 𝑢′ 𝑄𝑢
= −𝑥′ 𝐴′ 𝑃 𝐴𝑥 − 𝑢′ 𝐵′ 𝑃 𝐴𝑥 − 𝑥′ 𝐴′ 𝑃 𝐵𝑢 − 𝑢′ 𝐵′ 𝑃 𝐵𝑢 − 𝑢′ 𝑄𝑢
= −𝑥′ 𝐴′ 𝑃 𝐴𝑥 − 2𝑢′ 𝐵′ 𝑃 𝐴𝑥 − 𝑢′ (𝑄 + 𝐵′ 𝑃 𝐵)𝑢
For simplicity, denote by 𝑆 ∶= (𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴, then 𝑢 = −𝑆𝑥.

Regarding the second term −2𝑢′ 𝐵′ 𝑃 𝐴𝑥,
−2𝑢′ 𝐵′ 𝑃 𝐴𝑥 = −2𝑥′ 𝑆 ′ 𝐵′ 𝑃 𝐴𝑥
= 2𝑥′ 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥
Notice that the term (𝑄 + 𝐵′ 𝑃 𝐵)−1 is symmetric as both P and Q are symmetric.
Regarding the third term −𝑢′ (𝑄 + 𝐵′ 𝑃 𝐵)𝑢,
−𝑢′ (𝑄 + 𝐵′ 𝑃 𝐵)𝑢 = −𝑥′ 𝑆 ′ (𝑄 + 𝐵′ 𝑃 𝐵)𝑆𝑥

= −𝑥′ 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥
Hence, the summation of second and third terms is 𝑥′ 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥.
This implies that
𝑣(𝑥) = −𝑥′ 𝐴′ 𝑃 𝐴𝑥 − 2𝑢′ 𝐵′ 𝑃 𝐴𝑥 − 𝑢′ (𝑄 + 𝐵′ 𝑃 𝐵)𝑢

= −𝑥′ 𝐴′ 𝑃 𝐴𝑥 + 𝑥′ 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥
= −𝑥′ [𝐴′ 𝑃 𝐴 − 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴]𝑥
Therefore, the solution to the optimization problem 𝑣(𝑥) = −𝑥′ 𝑃 ̃ 𝑥 follows the above result by denoting 𝑃 ̃ ∶= 𝐴′ 𝑃 𝐴 −
𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴

CHAPTER
THREE
QR DECOMPOSITION
3.1 Overview
This lecture describes the QR decomposition and how it relates to

• Orthogonal projection and least squares
• A Gram-Schmidt process
• Eigenvalues and eigenvectors
We’ll write some Python code to help consolidate our understandings.
3.2 Matrix Factorization
The QR decomposition (also called the QR factorization) of a matrix is a decomposition of a matrix into the product of
an orthogonal matrix and a triangular matrix.
A QR decomposition of a real matrix 𝐴 takes the form
𝐴 = 𝑄𝑅
where
• 𝑄 is an orthogonal matrix (so that 𝑄𝑇 𝑄 = 𝐼)
• 𝑅 is an upper triangular matrix
We’ll use a Gram-Schmidt process to compute a QR decomposition
Because doing so is so educational, we’ll write our own Python code to do the job
3.3 Gram-Schmidt process
We’ll start with a square matrix 𝐴.

If a square matrix 𝐴 is nonsingular, then a 𝑄𝑅 factorization is unique.
We’ll deal with a rectangular matrix 𝐴 later.
Actually, our algorithm will work with a rectangular 𝐴 that is not square.
41
3.3.1 Gram-Schmidt process for square 𝐴
Here we apply a Gram-Schmidt process to the columns of matrix 𝐴.

In particular, let
𝐴 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑛 ]
Let || · || denote the L2 norm.

The Gram-Schmidt algorithm repeatedly combines the following two steps in a particular order
• normalize a vector to have unit norm
• orthogonalize the next vector
To begin, we set 𝑢1 = 𝑎1 and then normalize:
𝑢1
𝑢1 = 𝑎 1 , 𝑒 1 =
||𝑢1 ||
We orgonalize first to compute 𝑢2 and then normalize to create 𝑒2 :

𝑢2
𝑢2 = 𝑎2 − (𝑎2 · 𝑒1 )𝑒1 , 𝑒2 =
||𝑢2 ||
We invite the reader to verify that 𝑒1 is orthogonal to 𝑒2 by checking that 𝑒1 ⋅ 𝑒2 = 0.

The Gram-Schmidt procedure continues iterating.
Thus, for 𝑘 = 2, … , 𝑛 − 1 we construct
𝑢𝑘+1
𝑢𝑘+1 = 𝑎𝑘+1 − (𝑎𝑘+1 · 𝑒1 )𝑒1 − ⋯ − (𝑎𝑘+1 · 𝑒𝑘 )𝑒𝑘 , 𝑒𝑘+1 =
||𝑢𝑘+1 ||
Here (𝑎𝑗 ⋅ 𝑒𝑖 ) can be interpreted as the linear least squares regression coefficient of 𝑎𝑗 on 𝑒𝑖
• it is the inner product of 𝑎𝑗 and 𝑒𝑖 divided by the inner product of 𝑒𝑖 where 𝑒𝑖 ⋅ 𝑒𝑖 = 1, as normalization has assured
us.
• this regression coefficient has an interpretation as being a covariance divided by a variance
It can be verified that
𝑎1 · 𝑒1 𝑎2 · 𝑒1 ⋯ 𝑎 𝑛 · 𝑒1
⎡ 0 𝑎2 · 𝑒 2 ⋯ 𝑎 𝑛 · 𝑒2 ⎤
𝐴 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑛 ] = [ 𝑒1 𝑒2 ⋯ 𝑒𝑛 ]⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋯ 𝑎𝑛 · 𝑒 𝑛 ⎦
Thus, we have constructed the decomposision
𝐴 = 𝑄𝑅
where
𝑄 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑛 ] = [ 𝑒1 𝑒2 ⋯ 𝑒𝑛 ]
and
𝑎1 · 𝑒1 𝑎2 · 𝑒1 ⋯ 𝑎 𝑛 · 𝑒1
⎡ 0 𝑎2 · 𝑒 2 ⋯ 𝑎 𝑛 · 𝑒2 ⎤
𝑅=⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋯ 𝑎𝑛 · 𝑒𝑛 ⎦
42 Chapter 3. QR Decomposition
3.3.2 𝐴 not square
Now suppose that 𝐴 is an 𝑛 × 𝑚 matrix where 𝑚 > 𝑛.

Then a 𝑄𝑅 decomposition is
𝑎1 · 𝑒1 𝑎2 · 𝑒 1 ⋯ 𝑎 𝑛 · 𝑒1 𝑎𝑛+1 ⋅ 𝑒1 ⋯ 𝑎 𝑚 ⋅ 𝑒1
⎡ 0 𝑎2 · 𝑒 2 ⋯ 𝑎 𝑛 · 𝑒2 𝑎𝑛+1 ⋅ 𝑒2 ⋯ 𝑎 𝑚 ⋅ 𝑒2 ⎤
𝐴 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑚 ] = [ 𝑒1 𝑒2 ⋯ 𝑒𝑛 ]⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋯ 𝑎𝑛 · 𝑒 𝑛 𝑎𝑛+1 ⋅ 𝑒𝑛 ⋯ 𝑎 𝑚 ⋅ 𝑒𝑛 ⎦
which implies that
𝑎1 = (𝑎1 ⋅ 𝑒1 )𝑒1
𝑎2 = (𝑎2 ⋅ 𝑒1 )𝑒1 + (𝑎2 ⋅ 𝑒2 )𝑒2
⋮ ⋮
𝑎𝑛 = (𝑎𝑛 ⋅ 𝑒1 )𝑒1 + (𝑎𝑛 ⋅ 𝑒2 )𝑒2 + ⋯ + (𝑎𝑛 ⋅ 𝑒𝑛 )𝑒𝑛
𝑎𝑛+1 = (𝑎𝑛+1 ⋅ 𝑒1 )𝑒1 + (𝑎𝑛+1 ⋅ 𝑒2 )𝑒2 + ⋯ + (𝑎𝑛+1 ⋅ 𝑒𝑛 )𝑒𝑛
⋮ ⋮
𝑎𝑚 = (𝑎𝑚 ⋅ 𝑒1 )𝑒1 + (𝑎𝑚 ⋅ 𝑒2 )𝑒2 + ⋯ + (𝑎𝑚 ⋅ 𝑒𝑛 )𝑒𝑛
3.4 Some Code
Now let’s write some homemade Python code to implement a QR decomposition by deploying the Gram-Schmidt process
described above.
import numpy as np
from scipy.linalg import qr
def QR_Decomposition(A):
n, m = A.shape # get the shape of A
Q = np.empty((n, n)) # initialize matrix Q

u = np.empty((n, n)) # initialize matrix u
u[:, 0] = A[:, 0]
Q[:, 0] = u[:, 0] / np.linalg.norm(u[:, 0])
for i in range(1, n):
u[:, i] = A[:, i]
for j in range(i):
u[:, i] -= (A[:, i] @ Q[:, j]) * Q[:, j] # get each u vector
Q[:, i] = u[:, i] / np.linalg.norm(u[:, i]) # compute each e vetor
R = np.zeros((n, m))
for i in range(n):
for j in range(i, m):
R[i, j] = A[:, j] @ Q[:, i]
return Q, R
3.4. Some Code 43

The preceding code is fine but can benefit from some further housekeeping.
We want to do this because later in this notebook we want to compare results from using our homemade code above with
the code for a QR that the Python scipy package delivers.
There can be be sign differences between the 𝑄 and 𝑅 matrices produced by different numerical algorithms.
All of these are valid QR decompositions because of how the sign differences cancel out when we compute 𝑄𝑅.
However, to make the results from our homemade function and the QR module in scipy comparable, let’s require that
𝑄 have positive diagonal entries.
We do this by adjusting the signs of the columns in 𝑄 and the rows in 𝑅 appropriately.
To accomplish this we’ll define a pair of functions.
def diag_sign(A):
"Compute the signs of the diagonal of matrix A"
D = np.diag(np.sign(np.diag(A)))
return D
def adjust_sign(Q, R):

"""
Adjust the signs of the columns in Q and rows in R to
impose positive diagonal of Q
"""
D = diag_sign(Q)
Q[:, :] = Q @ D
R[:, :] = D @ R
return Q, R
3.5 Example
Now let’s do an example.
A = np.array([[1.0, 1.0, 0.0], [1.0, 0.0, 1.0], [0.0, 1.0, 1.0]])

# A = np.array([[1.0, 0.5, 0.2], [0.5, 0.5, 1.0], [0.0, 1.0, 1.0]])
# A = np.array([[1.0, 0.5, 0.2], [0.5, 0.5, 1.0]])
array([[1., 1., 0.],

[1., 0., 1.],
[0., 1., 1.]])
Q, R = adjust_sign(*QR_Decomposition(A))
array([[ 0.70710678, -0.40824829, -0.57735027],

[ 0.70710678, 0.40824829, 0.57735027],
[ 0. , -0.81649658, 0.57735027]])
array([[ 1.41421356, 0.70710678, 0.70710678],

[ 0. , -1.22474487, -0.40824829],
[ 0. , 0. , 1.15470054]])
Let’s compare outcomes with what the scipy package produces
Q_scipy, R_scipy = adjust_sign(*qr(A))
print('Our Q: \n', Q)
print('\n')
print('Scipy Q: \n', Q_scipy)
Our Q:
[[ 0.70710678 -0.40824829 -0.57735027]
[ 0.70710678 0.40824829 0.57735027]
[ 0. -0.81649658 0.57735027]]
Scipy Q:
[[ 0.70710678 -0.40824829 -0.57735027]
[ 0.70710678 0.40824829 0.57735027]
[ 0. -0.81649658 0.57735027]]
print('Our R: \n', R)
print('\n')
print('Scipy R: \n', R_scipy)
Our R:
[[ 1.41421356 0.70710678 0.70710678]
[ 0. -1.22474487 -0.40824829]
[ 0. 0. 1.15470054]]
Scipy R:
[[ 1.41421356 0.70710678 0.70710678]
[ 0. -1.22474487 -0.40824829]
[ 0. 0. 1.15470054]]
The above outcomes give us the good news that our homemade function agrees with what scipy produces.
Now let’s do a QR decomposition for a rectangular matrix 𝐴 that is 𝑛 × 𝑚 with 𝑚 > 𝑛.
A = np.array([[1, 3, 4], [2, 0, 9]])
Q, R = adjust_sign(*QR_Decomposition(A))
Q, R
3.5. Example 45
(array([[ 0.4472136 , -0.89442719],

[ 0.89442719, 0.4472136 ]]),
array([[ 2.23606798, 1.34164079, 9.8386991 ],
[ 0. , -2.68328157, 0.4472136 ]]))
Q_scipy, R_scipy = adjust_sign(*qr(A))

Q_scipy, R_scipy
(array([[ 0.4472136 , -0.89442719],

[ 0.89442719, 0.4472136 ]]),
array([[ 2.23606798, 1.34164079, 9.8386991 ],
[ 0. , -2.68328157, 0.4472136 ]]))
3.6 Using QR Decomposition to Compute Eigenvalues
Now for a useful fact about the QR algorithm.

The following iterations on the QR decomposition can be used to compute eigenvalues of a square matrix 𝐴.
Here is the algorithm:
1. Set 𝐴0 = 𝐴 and form 𝐴0 = 𝑄0 𝑅0
2. Form 𝐴1 = 𝑅0 𝑄0 . Note that 𝐴1 is similar to 𝐴0 (easy to verify) and so has the same eigenvalues.
3. Form 𝐴1 = 𝑄1 𝑅1 (i.e., form the 𝑄𝑅 decomposition of 𝐴1 ).
4. Form 𝐴2 = 𝑅1 𝑄1 and then 𝐴2 = 𝑄2 𝑅2 .
5. Iterate to convergence.
6. Compute eigenvalues of 𝐴 and compare them to the diagonal values of the limiting 𝐴𝑛 found from this process.
Remark: this algorithm is close to one of the most efficient ways of computing eigenvalues!
Let’s write some Python code to try out the algorithm
def QR_eigvals(A, tol=1e-12, maxiter=1000):

"Find the eigenvalues of A using QR decomposition."
A_old = np.copy(A)
A_new = np.copy(A)
diff = np.inf
i = 0
while (diff > tol) and (i < maxiter):
A_old[:, :] = A_new
Q, R = QR_Decomposition(A_old)
A_new[:, :] = R @ Q
diff = np.abs(A_new - A_old).max()

i += 1
eigvals = np.diag(A_new)
return eigvals
Now let’s try the code and compare the results with what scipy.linalg.eigvals gives us
Here goes
# experiment this with one random A matrix

A = np.random.random((3, 3))
sorted(QR_eigvals(A))
[-0.5697946336664133, 0.06239382762551169, 2.0469077458946816]
Compare with the scipy package.
sorted(np.linalg.eigvals(A))
[-0.5697946336664135, 0.062393827625510115, 2.046907745894684]
3.7 𝑄𝑅 and PCA
There are interesting connections between the 𝑄𝑅 decomposition and principal components analysis (PCA).
Here are some.
1. Let 𝑋 ′ be a 𝑘 × 𝑛 random matrix where the 𝑗th column is a random draw from 𝒩(𝜇, Σ) where 𝜇 is 𝑘 × 1 vector
of means and Σ is a 𝑘 × 𝑘 covariance matrix. We want 𝑛 >> 𝑘 – this is an “econometrics example”.
2. Form 𝑋 ′ = 𝑄𝑅 where 𝑄 is 𝑘 × 𝑘 and 𝑅 is 𝑘 × 𝑛.
3. Form the eigenvalues of 𝑅𝑅′ , i.e., we’ll compute 𝑅𝑅′ = 𝑃 ̃ Λ𝑃 ̃ ′ .
̂ ′.
4. Form 𝑋 ′ 𝑋 = 𝑄𝑃 ̃ Λ𝑃 ̃ ′ 𝑄′ and compare it with the eigen decomposition 𝑋 ′ 𝑋 = 𝑃 Λ𝑃
5. It will turn out that that Λ = Λ̂ and that 𝑃 = 𝑄𝑃 ̃ .
Let’s verify conjecture 5 with some Python code.
Start by simulating a random (𝑛, 𝑘) matrix 𝑋.
k = 5
n = 1000
# generate some random moments

= np.random.random(size=k)
C = np.random.random((k, k))
Σ = C.T @ C
# X is random matrix where each column follows multivariate normal dist.

X = np.random.multivariate_normal( , Σ, size=n)
X.shape
(1000, 5)
3.7. 𝑄𝑅 and PCA 47

Let’s apply the QR decomposition to 𝑋 ′ .
Q, R = adjust_sign(*QR_Decomposition(X.T))
Check the shapes of 𝑄 and 𝑅.
Q.shape, R.shape
((5, 5), (5, 1000))
Now we can construct 𝑅𝑅′ = 𝑃 ̃ Λ𝑃 ̃ ′ and form an eigen decomposition.
RR = R @ R.T
, P_tilde = np.linalg.eigh(RR)
Λ = np.diag( )
̂ ′.
We can also apply the decomposition to 𝑋 ′ 𝑋 = 𝑃 Λ𝑃
XX = X.T @ X
_hat, P = np.linalg.eigh(XX)
Λ_hat = np.diag( _hat)
Compare the eigenvalues that are on the diagonals of Λ and Λ.̂
, _hat
(array([ 36.45694801, 182.4271492 , 593.23015461, 1315.47957925,

8259.33586321]),
array([ 36.45694801, 182.4271492 , 593.23015461, 1315.47957925,
8259.33586321]))
Let’s compare 𝑃 and 𝑄𝑃 ̃ .

Again we need to be careful about sign differences between the columns of 𝑃 and 𝑄𝑃 ̃ .
QP_tilde = Q @ P_tilde
np.abs(P @ diag_sign(P) - QP_tilde @ diag_sign(QP_tilde)).max()
3.344546861683284e-15
Let’s verify that 𝑋 ′ 𝑋 can be decomposed as 𝑄𝑃 ̃ Λ𝑃 ̃ ′ 𝑄′ .
QPΛPQ = Q @ P_tilde @ Λ @ P_tilde.T @ Q.T
np.abs(QPΛPQ - XX).max()
5.002220859751105e-12
CHAPTER
FOUR
CIRCULANT MATRICES
4.1 Overview
This lecture describes circulant matrices and some of their properties.

Circulant matrices have a special structure that connects them to useful concepts including
• convolution
• Fourier transforms
• permutation matrices
Because of these connections, circulant matrices are widely used in machine learning, for example, in image processing.
We begin by importing some Python packages
import numpy as np
from numba import njit
np.set_printoptions(precision=3, suppress=True)
4.2 Constructing a Circulant Matrix
To construct an 𝑁 × 𝑁 circulant matrix, we need only the first row, say,
[𝑐0 𝑐1 𝑐2 𝑐3 𝑐4 ⋯ 𝑐𝑁−1 ] .
After setting entries in the first row, the remaining rows of a circulant matrix are determined as follows:
𝑐0 𝑐1 𝑐2 𝑐3 𝑐4 ⋯ 𝑐𝑁−1
⎡ 𝑐 𝑐0 𝑐1 𝑐2 𝑐3 ⋯ 𝑐𝑁−2 ⎤
⎢ 𝑁−1 ⎥
⎢ 𝑐𝑁−2 𝑐𝑁−1 𝑐0 𝑐1 𝑐2 ⋯ 𝑐𝑁−3 ⎥
𝐶=⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥ (4.1)
⎢ ⎥
⎢ 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 ⋯ 𝑐2 ⎥
⎢ 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 ⋯ 𝑐1 ⎥
⎣ 𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 ⋯ 𝑐0 ⎦
It is also possible to construct a circulant matrix by creating the transpose of the above matrix, in which case only the first
column needs to be specified.
Let’s write some Python code to generate a circulant matrix.
49
@njit
def construct_cirlulant(row):
N = row.size
C = np.empty((N, N))
for i in range(N):
C[i, i:] = row[:N-i]

C[i, :i] = row[N-i:]
return C
# a simple case when N = 3

construct_cirlulant(np.array([1., 2., 3.]))
array([[1., 2., 3.],

[3., 1., 2.],
[2., 3., 1.]])
4.2.1 Some Properties of Circulant Matrices
Here are some useful properties:

Suppose that 𝐴 and 𝐵 are both circulant matrices. Then it can be verified that
• The transpose of a circulant matrix is a circulant matrix.
• 𝐴 + 𝐵 is a circulant matrix
• 𝐴𝐵 is a circulant matrix
• 𝐴𝐵 = 𝐵𝐴
Now consider a circulant matrix with first row
𝑐 = [𝑐0 𝑐1 ⋯ 𝑐𝑁−1 ]
and consider a vector
𝑎 = [𝑎0 𝑎1 ⋯ 𝑎𝑁−1 ]
The convolution of vectors 𝑐 and 𝑎 is defined as the vector 𝑏 = 𝑐 ∗ 𝑎 with components

𝑛−1
𝑏𝑘 = ∑ 𝑐𝑘−𝑖 𝑎𝑖 (4.2)
𝑖=0
We use ∗ to denote convolution via the calculation described in equation (4.2).

It can be verified that the vector 𝑏 satisfies
𝑏 = 𝐶𝑇 𝑎
where 𝐶 𝑇 is the transpose of the circulant matrix defined in equation (4.1).
50 Chapter 4. Circulant Matrices

4.3 Connection to Permutation Matrix
A good way to construct a circulant matrix is to use a permutation matrix.

Before defining a permutation matrix, we’ll define a permutation.
A permutation of a set of the set of non-negative integers {0, 1, 2, …} is a one-to-one mapping of the set into itself.
A permutation of a set {1, 2, … , 𝑛} rearranges the 𝑛 integers in the set.
A permutation matrix is obtained by permuting the rows of an 𝑛 × 𝑛 identity matrix according to a permutation of the
numbers 1 to 𝑛.
Thus, every row and every column contain precisely a single 1 with 0 everywhere else.
Every permutation corresponds to a unique permutation matrix.
For example, the 𝑁 × 𝑁 matrix
0 1 0 0 ⋯ 0
⎡ 0 0 1 0 ⋯ 0 ⎤
⎢ ⎥
0 0 0 1 ⋯ 0
𝑃 =⎢ ⎥ (4.3)
⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥
⎢ 0 0 0 0 ⋯ 1 ⎥
⎣ 1 0 0 0 ⋯ 0 ⎦
serves as a cyclic shift operator that, when applied to an 𝑁 × 1 vector ℎ, shifts entries in rows 2 through 𝑁 up one row
and shifts the entry in row 1 to row 𝑁 .
Eigenvalues of the cyclic shift permutation matrix 𝑃 defined in equation (4.3) can be computed by constructing
−𝜆 1 0 0 ⋯ 0
⎡ 0 −𝜆 1 0 ⋯ 0 ⎤
⎢ ⎥
0 0 −𝜆 1 ⋯ 0
𝑃 − 𝜆𝐼 = ⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥
⎢ 0 0 0 0 ⋯ 1 ⎥
⎣ 1 0 0 0 ⋯ −𝜆 ⎦
and solving
det(𝑃 − 𝜆𝐼) = (−1)𝑁 𝜆𝑁 − 1 = 0
Eigenvalues 𝜆𝑖 can be complex.

Magnitudes ∣ 𝜆𝑖 ∣ of these eigenvalues 𝜆𝑖 all equal 1.
Thus, singular values of the permutation matrix 𝑃 defined in equation (4.3) all equal 1.
It can be verified that permutation matrices are orthogonal matrices:
𝑃𝑃′ = 𝐼
4.3. Connection to Permutation Matrix 51

4.4 Examples with Python
Let’s write some Python code to illustrate these ideas.
@njit
def construct_P(N):
P = np.zeros((N, N))
for i in range(N-1):
P[i, i+1] = 1
P[-1, 0] = 1
return P
P4 = construct_P(4)
P4
array([[0., 1., 0., 0.],

[0., 0., 1., 0.],
[0., 0., 0., 1.],
[1., 0., 0., 0.]])
# compute the eigenvalues and eigenvectors

, Q = np.linalg.eig(P4)
for i in range(4):
print(f' {i} = { [i]:.1f} \nvec{i} = {Q[i, :]}\n')
0 = -1.0+0.0j
vec0 = [-0.5+0.j 0. +0.5j 0. -0.5j -0.5+0.j ]
1 = 0.0+1.0j
vec1 = [ 0.5+0.j -0.5+0.j -0.5-0.j -0.5+0.j]
2 = 0.0-1.0j
vec2 = [-0.5+0.j 0. -0.5j 0. +0.5j -0.5+0.j ]
3 = 1.0+0.0j
vec3 = [ 0.5+0.j 0.5-0.j 0.5+0.j -0.5+0.j]
In graphs below, we shall portray eigenvalues of a shift permutation matrix in the complex plane.
These eigenvalues are uniformly distributed along the unit circle.
They are the 𝑛 roots of unity, meaning they are the 𝑛 numbers 𝑧 that solve 𝑧 𝑛 = 1, where 𝑧 is a complex number.
In particular, the 𝑛 roots of unity are
2𝜋𝑗𝑘
𝑧 = exp ( ), 𝑘 = 0, … , 𝑁 − 1
𝑁
where 𝑗 denotes the purely imaginary unit number.

fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i, N in enumerate([3, 4, 6, 8]):
row_i = i // 2
col_i = i % 2
P = construct_P(N)
, Q = np.linalg.eig(P)
circ = plt.Circle((0, 0), radius=1, edgecolor='b', facecolor='None')

ax[row_i, col_i].add_patch(circ)
for j in range(N):
ax[row_i, col_i].scatter( [j].real, [j].imag, c='b')
ax[row_i, col_i].set_title(f'N = {N}')

ax[row_i, col_i].set_xlabel('real')
ax[row_i, col_i].set_ylabel('imaginary')
plt.show()
4.4. Examples with Python 53

For a vector of coefficients {𝑐𝑖 }𝑛−1

𝑖=0 , eigenvectors of 𝑃 are also eigenvectors of
𝐶 = 𝑐0 𝐼 + 𝑐1 𝑃 + 𝑐2 𝑃 2 + ⋯ + 𝑐𝑁−1 𝑃 𝑁−1 .
Consider an example in which 𝑁 = 8 and let 𝑤 = 𝑒−2𝜋𝑗/𝑁 .

It can be verified that the matrix 𝐹8 of eigenvectors of 𝑃8 is
1 1 1 ⋯ 1
⎡ 1 𝑤 𝑤2 ⋯ 𝑤7 ⎤
⎢ ⎥
⎢ 1 𝑤2 𝑤4 ⋯ 𝑤14 ⎥
⎢ 1 𝑤3 𝑤6 ⋯ 𝑤21 ⎥
𝐹8 = ⎢ ⎥
1 𝑤4 𝑤8 ⋯ 𝑤28
⎢ ⎥
⎢ 1 𝑤5 𝑤10 ⋯ 𝑤35 ⎥
⎢ 1 𝑤6 𝑤12 ⋯ 𝑤42 ⎥
⎣ 1 𝑤7 𝑤14 ⋯ 𝑤49 ⎦
The matrix 𝐹8 defines a Discete Fourier Transform.

√
To convert it into an orthogonal eigenvector matrix, we can simply normalize it by dividing every entry by 8.
• stare at the first column of 𝐹8 above to convince yourself of this fact
The eigenvalues corresponding to each eigenvector are {𝑤𝑗 }7𝑗=0 in order.
def construct_F(N):
w = np.e ** (-complex(0, 2*np.pi/N))
F = np.ones((N, N), dtype=complex)

for i in range(1, N):
F[i, 1:] = w ** (i * np.arange(1, N))
return F, w
F8, w = construct_F(8)
(0.7071067811865476-0.7071067811865475j)
F8
array([[ 1. +0.j , 1. +0.j , 1. +0.j , 1. +0.j ,

1. +0.j , 1. +0.j , 1. +0.j , 1. +0.j ],
[ 1. +0.j , 0.707-0.707j, 0. -1.j , -0.707-0.707j,
-1. -0.j , -0.707+0.707j, -0. +1.j , 0.707+0.707j],
[ 1. +0.j , 0. -1.j , -1. -0.j , -0. +1.j ,
1. +0.j , 0. -1.j , -1. -0.j , -0. +1.j ],
[ 1. +0.j , -0.707-0.707j, -0. +1.j , 0.707-0.707j,
-1. -0.j , 0.707+0.707j, 0. -1.j , -0.707+0.707j],
[ 1. +0.j , -1. -0.j , 1. +0.j , -1. -0.j ,
1. +0.j , -1. -0.j , 1. +0.j , -1. -0.j ],
[ 1. +0.j , -0.707+0.707j, 0. -1.j , 0.707+0.707j,
-1. -0.j , 0.707-0.707j, -0. +1.j , -0.707-0.707j],
[ 1. +0.j , -0. +1.j , -1. -0.j , 0. -1.j ,
1. +0.j , -0. +1.j , -1. -0.j , 0. -1.j ],
[ 1. +0.j , 0.707+0.707j, -0. +1.j , -0.707+0.707j,
-1. -0.j , -0.707-0.707j, 0. -1.j , 0.707-0.707j]])
# normalize
Q8 = F8 / np.sqrt(8)
# verify the orthogonality (unitarity)

Q8 @ np.conjugate(Q8)
array([[ 1.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, 0.+0.j, 0.+0.j,

0.+0.j],
[-0.-0.j, 1.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, 0.+0.j,
0.+0.j],
[-0.-0.j, -0.-0.j, 1.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, 0.+0.j,
4.4. Examples with Python 55


0.+0.j],
[-0.-0.j, -0.-0.j, -0.-0.j, 1.+0.j, -0.+0.j, -0.+0.j, -0.+0.j,
-0.+0.j],
[-0.-0.j, -0.-0.j, -0.-0.j, -0.-0.j, 1.+0.j, -0.+0.j, -0.+0.j,
-0.+0.j],
[ 0.-0.j, -0.-0.j, -0.-0.j, -0.-0.j, -0.-0.j, 1.+0.j, -0.+0.j,
-0.+0.j],
[ 0.-0.j, 0.-0.j, 0.-0.j, -0.-0.j, -0.-0.j, -0.-0.j, 1.+0.j,
-0.+0.j],
[ 0.-0.j, 0.-0.j, 0.-0.j, -0.-0.j, -0.-0.j, -0.-0.j, -0.-0.j,
1.+0.j]])
Let’s verify that 𝑘th column of 𝑄8 is an eigenvector of 𝑃8 with an eigenvalue 𝑤𝑘 .
P8 = construct_P(8)
diff_arr = np.empty(8, dtype=complex)

for j in range(8):
diff = P8 @ Q8[:, j] - w ** j * Q8[:, j]
diff_arr[j] = diff @ diff.T
diff_arr
array([ 0.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, -0.+0.j, -0.+0.j,

-0.+0.j])
4.5 Associated Permutation Matrix
Next, we execute calculations to verify that the circulant matrix 𝐶 defined in equation (4.1) can be written as
𝐶 = 𝑐0 𝐼 + 𝑐1 𝑃 + ⋯ + 𝑐𝑛−1 𝑃 𝑛−1
and that every eigenvector of 𝑃 is also an eigenvector of 𝐶.

We illustrate this for 𝑁 = 8 case.
c = np.random.random(8)
array([0.421, 0.58 , 0.352, 0.055, 0.428, 0.466, 0.943, 0.027])
C8 = construct_cirlulant(c)
Compute 𝑐0 𝐼 + 𝑐1 𝑃 + ⋯ + 𝑐𝑛−1 𝑃 𝑛−1 .

N = 8
C = np.zeros((N, N))
P = np.eye(N)
for i in range(N):
C += c[i] * P
P = P8 @ P
array([[0.421, 0.58 , 0.352, 0.055, 0.428, 0.466, 0.943, 0.027],

[0.027, 0.421, 0.58 , 0.352, 0.055, 0.428, 0.466, 0.943],
[0.943, 0.027, 0.421, 0.58 , 0.352, 0.055, 0.428, 0.466],
[0.466, 0.943, 0.027, 0.421, 0.58 , 0.352, 0.055, 0.428],
[0.428, 0.466, 0.943, 0.027, 0.421, 0.58 , 0.352, 0.055],
[0.055, 0.428, 0.466, 0.943, 0.027, 0.421, 0.58 , 0.352],
[0.352, 0.055, 0.428, 0.466, 0.943, 0.027, 0.421, 0.58 ],
[0.58 , 0.352, 0.055, 0.428, 0.466, 0.943, 0.027, 0.421]])
C8
array([[0.421, 0.58 , 0.352, 0.055, 0.428, 0.466, 0.943, 0.027],

[0.027, 0.421, 0.58 , 0.352, 0.055, 0.428, 0.466, 0.943],
[0.943, 0.027, 0.421, 0.58 , 0.352, 0.055, 0.428, 0.466],
[0.466, 0.943, 0.027, 0.421, 0.58 , 0.352, 0.055, 0.428],
[0.428, 0.466, 0.943, 0.027, 0.421, 0.58 , 0.352, 0.055],
[0.055, 0.428, 0.466, 0.943, 0.027, 0.421, 0.58 , 0.352],
[0.352, 0.055, 0.428, 0.466, 0.943, 0.027, 0.421, 0.58 ],
[0.58 , 0.352, 0.055, 0.428, 0.466, 0.943, 0.027, 0.421]])
Now let’s compute the difference between two circulant matrices that we have constructed in two different ways.
np.abs(C - C8).max()
0.0
7
The 𝑘th column of 𝑃8 associated with eigenvalue 𝑤𝑘−1 is an eigenvector of 𝐶8 associated with an eigenvalue ∑ℎ=0 𝑐𝑗 𝑤ℎ𝑘 .
_C8 = np.zeros(8, dtype=complex)
for j in range(8):
for k in range(8):
_C8[j] += c[k] * w ** (j * k)
_C8
array([ 3.272+0.j , 0.053+0.49j , -0.446-0.964j, -0.067-0.691j,

1.018+0.j , -0.067+0.691j, -0.446+0.964j, 0.053-0.49j ])
We can verify this by comparing C8 @ Q8[:, j] with _C8[j] * Q8[:, j].
4.5. Associated Permutation Matrix 57

# verify
for j in range(8):
diff = C8 @ Q8[:, j] - _C8[j] * Q8[:, j]
print(diff)
[0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j]

[-0.+0.j 0.+0.j 0.-0.j -0.-0.j -0.-0.j -0.+0.j -0.+0.j -0.+0.j]
[ 0.-0.j 0.-0.j -0.-0.j -0.-0.j -0.+0.j 0.+0.j 0.-0.j -0.-0.j]
[ 0.+0.j -0.-0.j -0.-0.j -0.+0.j 0.-0.j -0.-0.j 0.+0.j 0.-0.j]
[ 0.+0.j -0.-0.j 0.-0.j -0.+0.j 0.-0.j -0.+0.j 0.-0.j -0.+0.j]
[ 0.-0.j 0.+0.j 0.-0.j 0.+0.j -0.-0.j 0.-0.j -0.+0.j -0.-0.j]
[ 0.+0.j 0.-0.j 0.-0.j 0.-0.j 0.+0.j -0.+0.j -0.-0.j 0.-0.j]
[ 0.+0.j -0.+0.j 0.-0.j 0.-0.j 0.-0.j 0.+0.j 0.+0.j -0.+0.j]
4.6 Discrete Fourier Transform
The Discrete Fourier Transform (DFT) allows us to represent a discrete time sequence as a weighted sum of complex
sinusoids.
Consider a sequence of 𝑁 real number {𝑥𝑗 }𝑁−1
𝑗=0 .
The Discrete Fourier Transform maps {𝑥𝑗 }𝑁−1 𝑁−1

𝑗=0 into a sequence of complex numbers {𝑋𝑘 }𝑘=0
where
𝑁−1
𝑘𝑛
𝑋𝑘 = ∑ 𝑥𝑛 𝑒−2𝜋 𝑁 𝑖
𝑛=0
def DFT(x):
"The discrete Fourier transform."
N = len(x)
w = np.e ** (-complex(0, 2*np.pi/N))
X = np.zeros(N, dtype=complex)
for k in range(N):
for n in range(N):
X[k] += x[n] * w ** (k * n)
return X
Consider the following example.
1/2 𝑛 = 0, 1
𝑥𝑛 = {
0 otherwise
x = np.zeros(10)
x[0:2] = 1/2

array([0.5, 0.5, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])
Apply a discrete Fourier transform.
X = DFT(x)
array([ 1. +0.j , 0.905-0.294j, 0.655-0.476j, 0.345-0.476j,

0.095-0.294j, -0. +0.j , 0.095+0.294j, 0.345+0.476j,
0.655+0.476j, 0.905+0.294j])
We can plot magnitudes of a sequence of numbers and the associated discrete Fourier transform.
def plot_magnitude(x=None, X=None):
data = []
names = []
xs = []
if (x is not None):
data.append(x)
names.append('x')
xs.append('n')
if (X is not None):
data.append(X)
names.append('X')
xs.append('j')
num = len(data)
for i in range(num):
n = data[i].size
plt.figure(figsize=(8, 3))
plt.scatter(range(n), np.abs(data[i]))
plt.vlines(range(n), 0, np.abs(data[i]), color='b')
plt.xlabel(xs[i])
plt.ylabel('magnitude')
plt.title(names[i])
plt.show()
plot_magnitude(x=x, X=X)
4.6. Discrete Fourier Transform 59

The inverse Fourier transform transforms a Fourier transform 𝑋 of 𝑥 back to 𝑥.

The inverse Fourier transform is defined as
𝑁−1
1 𝑘𝑛
𝑥𝑛 = ∑ 𝑋 𝑒2𝜋( 𝑁 )𝑖 , 𝑛 = 0, 1, … , 𝑁 − 1
𝑘=0
𝑁 𝑘
def inverse_transform(X):
N = len(X)
w = np.e ** (complex(0, 2*np.pi/N))
x = np.zeros(N, dtype=complex)
for n in range(N):
for k in range(N):
x[n] += X[k] * w ** (k * n) / N
return x

inverse_transform(X)
array([ 0.5+0.j, 0.5-0.j, -0. -0.j, -0. -0.j, -0. -0.j, -0. -0.j,
-0. +0.j, -0. +0.j, -0. +0.j, -0. +0.j])
Another example is
11
𝑥𝑛 = 2 cos (2𝜋 𝑛) , 𝑛 = 0, 1, 2, ⋯ 19
40
1 11
Since 𝑁 = 20, we cannot use an integer multiple of 20 to represent a frequency 40 .
To handle this, we shall end up using all 𝑁 of the availble frequencies in the DFT.
11
Since 40 is in between 10 12
40 and 40 (each of which is an integer multiple of
1
20 ), the complex coefficients in the DFT have
their largest magnitudes at 𝑘 = 5, 6, 15, 16, not just at a single frequency.
N = 20
x = np.empty(N)
for j in range(N):
x[j] = 2 * np.cos(2 * np.pi * 11 * j / 40)
X = DFT(x)

What happens if we change the last example to 𝑥𝑛 = 2 cos (2𝜋 10

40 𝑛)?
10 1
Note that 40 is an integer multiple of 20 .
N = 20
x = np.empty(N)
for j in range(N):
x[j] = 2 * np.cos(2 * np.pi * 10 * j / 40)
X = DFT(x)

If we represent the discrete Fourier transform as a matrix, we discover that it equals the matrix 𝐹𝑁 of eigenvectors of the
permutation matrix 𝑃𝑁 .
We can use the example where 𝑥𝑛 = 2 cos (2𝜋 11
40 𝑛) , 𝑛 = 0, 1, 2, ⋯ 19 to illustrate this.
N = 20
x = np.empty(N)
for j in range(N):
x[j] = 2 * np.cos(2 * np.pi * 11 * j / 40)
array([ 2. , -0.313, -1.902, 0.908, 1.618, -1.414, -1.176, 1.782,

0.618, -1.975, -0. , 1.975, -0.618, -1.782, 1.176, 1.414,
-1.618, -0.908, 1.902, 0.313])
First use the summation formula to transform 𝑥 to 𝑋.
X = DFT(x)
X
array([2. +0.j , 2. +0.558j, 2. +1.218j, 2. +2.174j, 2. +4.087j,

2.+12.785j, 2.-12.466j, 2. -3.751j, 2. -1.801j, 2. -0.778j,
2. -0.j , 2. +0.778j, 2. +1.801j, 2. +3.751j, 2.+12.466j,
2.-12.785j, 2. -4.087j, 2. -2.174j, 2. -1.218j, 2. -0.558j])
Now let’s evaluate the outcome of postmultiplying the eigenvector matrix 𝐹20 by the vector 𝑥, a product that we claim
should equal the Fourier tranform of the sequence {𝑥𝑛 }𝑁−1
𝑛=0 .
F20, _ = construct_F(20)
F20 @ x

array([2. +0.j , 2. +0.558j, 2. +1.218j, 2. +2.174j, 2. +4.087j,

2.+12.785j, 2.-12.466j, 2. -3.751j, 2. -1.801j, 2. -0.778j,
2. -0.j , 2. +0.778j, 2. +1.801j, 2. +3.751j, 2.+12.466j,
2.-12.785j, 2. -4.087j, 2. -2.174j, 2. -1.218j, 2. -0.558j])
−1
Similarly, the inverse DFT can be expressed as a inverse DFT matrix 𝐹20 .
F20_inv = np.linalg.inv(F20)
F20_inv @ X
array([ 2. -0.j, -0.313-0.j, -1.902-0.j, 0.908-0.j, 1.618-0.j,

-1.414+0.j, -1.176+0.j, 1.782+0.j, 0.618-0.j, -1.975-0.j,
-0. +0.j, 1.975-0.j, -0.618-0.j, -1.782+0.j, 1.176+0.j,
1.414-0.j, -1.618-0.j, -0.908+0.j, 1.902+0.j, 0.313-0.j])

CHAPTER
FIVE
SINGULAR VALUE DECOMPOSITION (SVD)
5.1 Overview
The singular value decomposition (SVD) is a work-horse in applications of least squares projection that form founda-
tions for many statistical and machine learning methods.
After defining the SVD, we’ll describe how it connects to
• four fundamental spaces of linear algebra
• under-determined and over-determined least squares regressions
• principal components analysis (PCA)
Like principal components analysis (PCA), DMD can be thought of as a data-reduction procedure that represents salient
patterns by projecting data onto a limited set of factors.
In a sequel to this lecture about Dynamic Mode Decompositions, we’ll describe how SVD’s provide ways rapidly to compute
reduced-order approximations to first-order Vector Autoregressions (VARs).
5.2 The Setting
Let 𝑋 be an 𝑚 × 𝑛 matrix of rank 𝑝.

Necessarily, 𝑝 ≤ min(𝑚, 𝑛).
In much of this lecture, we’ll think of 𝑋 as a matrix of data in which
• each column is an individual – a time period or person, depending on the application
• each row is a random variable describing an attribute of a time period or a person, depending on the application
We’ll be interested in two situations
• A short and fat case in which 𝑚 << 𝑛, so that there are many more columns (individuals) than rows (attributes).
• A tall and skinny case in which 𝑚 >> 𝑛, so that there are many more rows (attributes) than columns (individuals).
We’ll apply a singular value decomposition of 𝑋 in both situations.
In the 𝑚 << 𝑛 case in which there are many more individuals 𝑛 than attributes 𝑚, we can calculate sample moments of
a joint distribution by taking averages across observations of functions of the observations.
In this 𝑚 << 𝑛 case, we’ll look for patterns by using a singular value decomposition to do a principal components
analysis (PCA).
In the 𝑚 >> 𝑛 case in which there are many more attributes 𝑚 than individuals 𝑛 and when we are in a time-series
setting in which 𝑛 equals the number of time periods covered in the data set 𝑋, we’ll proceed in a different way.
65
We’ll again use a singular value decomposition, but now to construct a dynamic mode decomposition (DMD)
5.3 Singular Value Decomposition
A singular value decomposition of an 𝑚 × 𝑛 matrix 𝑋 of rank 𝑝 ≤ min(𝑚, 𝑛) is
𝑋 = 𝑈 Σ𝑉 ⊤ (5.1)
where
𝑈𝑈⊤ = 𝐼 𝑈 ⊤𝑈 = 𝐼
𝑉𝑉⊤ = 𝐼 𝑉 ⊤𝑉 = 𝐼
and
• 𝑈 is an 𝑚 × 𝑚 orthogonal matrix of left singular vectors of 𝑋
• Columns of 𝑈 are eigenvectors of 𝑋𝑋 ⊤
• 𝑉 is an 𝑛 × 𝑛 orthogonal matrix of right singular vectors of 𝑋
• Columns of 𝑉 are eigenvectors of 𝑋 ⊤ 𝑋
• Σ is an 𝑚 × 𝑛 matrix in which the first 𝑝 places on its main diagonal are positive numbers 𝜎1 , 𝜎2 , … , 𝜎𝑝 called
singular values; remaining entries of Σ are all zero
• The 𝑝 singular values are positive square roots of the eigenvalues of the 𝑚 × 𝑚 matrix 𝑋𝑋 ⊤ and also of the 𝑛 × 𝑛
matrix 𝑋 ⊤ 𝑋
• We adopt a convention that when 𝑈 is a complex valued matrix, 𝑈 ⊤ denotes the conjugate-transpose or
⊤
Hermitian-transpose of 𝑈 , meaning that 𝑈𝑖𝑗 is the complex conjugate of 𝑈𝑗𝑖 .
• Similarly, when 𝑉 is a complex valued matrix, 𝑉 ⊤ denotes the conjugate-transpose or Hermitian-transpose of
𝑉
The matrices 𝑈 , Σ, 𝑉 entail linear transformations that reshape in vectors in the following ways:
• multiplying vectors by the unitary matrices 𝑈 and 𝑉 rotates them, but leaves angles between vectors and lengths
of vectors unchanged.
• multiplying vectors by the diagonal matrix Σ leaves angles between vectors unchanged but rescales vectors.
Thus, representation (5.1) asserts that multiplying an 𝑛 × 1 vector 𝑦 by the 𝑚 × 𝑛 matrix 𝑋 amounts to performing the
following three multiplications of 𝑦 sequentially:
• rotating 𝑦 by computing 𝑉 ⊤ 𝑦
• rescaling 𝑉 ⊤ 𝑦 by multiplying it by Σ
• rotating Σ𝑉 ⊤ 𝑦 by multiplying it by 𝑈
This structure of the 𝑚 × 𝑛 matrix 𝑋 opens the door to constructing systems of data encoders and decoders.
Thus,
• 𝑉 ⊤ 𝑦 is an encoder
• Σ is an operator to be applied to the encoded data
• 𝑈 is a decoder to be applied to the output from applying operator Σ to the encoded data
66 Chapter 5. Singular Value Decomposition (SVD)

We’ll apply this circle of ideas later in this lecture when we study Dynamic Mode Decomposition.
Road Ahead
What we have described above is called a full SVD.
In a full SVD, the shapes of 𝑈 , Σ, and 𝑉 are (𝑚, 𝑚), (𝑚, 𝑛), (𝑛, 𝑛), respectively.
Later we’ll also describe an economy or reduced SVD.
Before we study a reduced SVD we’ll say a little more about properties of a full SVD.
5.4 Four Fundamental Subspaces
Let 𝒞 denote a column space, 𝒩 denote a null space, and ℛ denote a row space.
Let’s start by recalling the four fundamental subspaces of an 𝑚 × 𝑛 matrix 𝑋 of rank 𝑝.
• The column space of 𝑋, denoted 𝒞(𝑋), is the span of the columns of 𝑋, i.e., all vectors 𝑦 that can be written as
linear combinations of columns of 𝑋. Its dimension is 𝑝.
• The null space of 𝑋, denoted 𝒩(𝑋) consists of all vectors 𝑦 that satisfy 𝑋𝑦 = 0. Its dimension is 𝑛 − 𝑝.
• The row space of 𝑋, denoted ℛ(𝑋) is the column space of 𝑋 ⊤ . It consists of all vectors 𝑧 that can be written as
linear combinations of rows of 𝑋. Its dimension is 𝑝.
• The left null space of 𝑋, denoted 𝒩(𝑋 ⊤ ), consist of all vectors 𝑧 such that 𝑋 ⊤ 𝑧 = 0. Its dimension is 𝑚 − 𝑝.
For a full SVD of a matrix 𝑋, the matrix 𝑈 of left singular vectors and the matrix 𝑉 of right singular vectors contain
orthogonal bases for all four subspaces.
They form two pairs of orthogonal subspaces that we’ll describe now.
Let 𝑢𝑖 , 𝑖 = 1, … , 𝑚 be the 𝑚 column vectors of 𝑈 and let 𝑣𝑖 , 𝑖 = 1, … , 𝑛 be the 𝑛 column vectors of 𝑉 .
Let’s write the full SVD of X as
Σ𝑝 0 ⊤
𝑋 = [𝑈𝐿 𝑈𝑅 ] [ ] [𝑉𝐿 𝑉𝑅 ] (5.2)
0 0
where Σ𝑝 is a 𝑝 × 𝑝 diagonal matrix with the 𝑝 singular values on the diagonal and
𝑈𝐿 = [𝑢1 ⋯ 𝑢𝑝 ] , 𝑈𝑅 = [𝑢𝑝+1 ⋯ 𝑢𝑚 ]
𝑉𝐿 = [𝑣1 ⋯ 𝑣𝑝 ] , 𝑈𝑅 = [𝑣𝑝+1 ⋯ 𝑢𝑛 ]
Representation (5.2) implies that
Σ𝑝 0
𝑋 [𝑉𝐿 𝑉𝑅 ] = [𝑈𝐿 𝑈𝑅 ] [ ]
0 0
or
𝑋𝑉𝐿 = 𝑈𝐿 Σ𝑝
(5.3)
𝑋𝑉𝑅 = 0
or
𝑋𝑣𝑖 = 𝜎𝑖 𝑢𝑖 , 𝑖 = 1, … , 𝑝
(5.4)
𝑋𝑣𝑖 = 0, 𝑖 = 𝑝 + 1, … , 𝑛
Equations (5.4) tell how the transformation 𝑋 maps a pair of orthonormal vectors 𝑣𝑖 , 𝑣𝑗 for 𝑖 and 𝑗 both less than or equal
to the rank 𝑝 of 𝑋 into a pair of orthonormal vectors 𝑢𝑖 , 𝑢𝑗 .
5.4. Four Fundamental Subspaces 67

Equations (5.3) assert that

𝒞(𝑋) = 𝒞(𝑈𝐿 )
𝒩(𝑋) = 𝒞(𝑉𝑅 )
Taking transposes on both sides of representation (5.2) implies
Σ𝑝 0
𝑋 ⊤ [𝑈𝐿 𝑈𝑅 ] = [𝑉𝐿 𝑉𝑅 ] [ ]
0 0
or
𝑋 ⊤ 𝑈𝐿 = 𝑉𝐿 Σ𝑝
(5.5)
𝑋 ⊤ 𝑈𝑅 = 0
or
𝑋 ⊤ 𝑢 𝑖 = 𝜎 𝑖 𝑣𝑖 , 𝑖 = 1, … , 𝑝
⊤
(5.6)
𝑋 𝑢𝑖 = 0 𝑖 = 𝑝 + 1, … , 𝑚
Notice how equations (5.6) assert that the transformation 𝑋 ⊤ maps a pair of distinct orthonormal vectors 𝑢𝑖 , 𝑢𝑗 for 𝑖 and
𝑗 both less than or equal to the rank 𝑝 of 𝑋 into a pair of distinct orthonormal vectors 𝑣𝑖 , 𝑣𝑗 .
Equations (5.5) assert that
ℛ(𝑋) ≡ 𝒞(𝑋 ⊤ ) = 𝒞(𝑉𝐿 )
𝒩(𝑋 ⊤ ) = 𝒞(𝑈𝑅 )
Thus, taken together, the systems of equations (5.3) and (5.5) describe the four fundamental subspaces of 𝑋 in the
following ways:
𝒞(𝑋) = 𝒞(𝑈𝐿 )
𝒩(𝑋 ⊤ ) = 𝒞(𝑈𝑅 )
ℛ(𝑋) ≡ 𝒞(𝑋 ⊤ ) = 𝒞(𝑉𝐿 ) (5.7)
𝒩(𝑋) = 𝒞(𝑉𝑅 )
Since 𝑈 and 𝑉 are both orthonormal matrices, collection (5.7) asserts that
• 𝑈𝐿 is an orthonormal basis for the column space of 𝑋
• 𝑈𝑅 is an orthonormal basis for the null space of 𝑋 ⊤
• 𝑉𝐿 is an orthonormal basis for the row space of 𝑋
• 𝑉𝑅 is an orthonormal basis for the null space of 𝑋
We have verified the four claims in (5.7) simply by performing the multiplications called for by the right side of (5.2) and
reading them.
The claims in (5.7) and the fact that 𝑈 and 𝑉 are both unitary (i.e, orthonormal) matrices imply that
• the column space of 𝑋 is orthogonal to the null space of 𝑋 ⊤
• the null space of 𝑋 is orthogonal to the row space of 𝑋
Sometimes these properties are described with the following two pairs of orthogonal complement subspaces:
• 𝒞(𝑋) is the orthogonal complement of 𝒩(𝑋 ⊤ )
• ℛ(𝑋) is the orthogonal complement 𝒩(𝑋)
Let’s do an example.

import numpy as np
import numpy.linalg as LA
Having imported these modules, let’s do the example.
np.set_printoptions(precision=2)
# Define the matrix

A = np.array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])
# Compute the SVD of the matrix

U, S, V = np.linalg.svd(A,full_matrices=True)
# Compute the rank of the matrix

rank = np.linalg.matrix_rank(A)
# Print the rank of the matrix

print("Rank of matrix:\n", rank)
print("S: \n", S)
# Compute the four fundamental subspaces

row_space = U[:, :rank]
col_space = V[:, :rank]
null_space = V[:, rank:]
left_null_space = U[:, rank:]
print("U:\n", U)
print("Column space:\n", col_space)
print("Left null space:\n", left_null_space)
print("V.T:\n", V.T)
print("Row space:\n", row_space.T)
print("Right null space:\n", null_space.T)
Rank of matrix:
2
S:
[2.69e+01 1.86e+00 1.20e-15 2.24e-16 5.82e-17]
U:
[[-0.27 -0.73 0.63 -0.06 0.06]
[-0.35 -0.42 -0.69 -0.45 0.12]
[-0.43 -0.11 -0.24 0.85 0.12]
[-0.51 0.19 0.06 -0.1 -0.83]
[-0.59 0.5 0.25 -0.24 0.53]]
Column space:
[[-0.27 -0.35]
[ 0.73 0.42]
[ 0.32 -0.65]
[ 0.54 -0.39]
[-0.06 -0.35]]
Left null space:
5.4. Four Fundamental Subspaces 69


[[ 0.63 -0.06 0.06]
[-0.69 -0.45 0.12]
[-0.24 0.85 0.12]
[ 0.06 -0.1 -0.83]
[ 0.25 -0.24 0.53]]
V.T:
[[-0.27 0.73 0.32 0.54 -0.06]
[-0.35 0.42 -0.65 -0.39 -0.35]
[-0.43 0.11 0.02 -0.29 0.85]
[-0.51 -0.19 0.61 -0.41 -0.4 ]
[-0.59 -0.5 -0.31 0.55 -0.04]]
Row space:
[[-0.27 -0.35 -0.43 -0.51 -0.59]
[-0.73 -0.42 -0.11 0.19 0.5 ]]
Right null space:
[[-0.43 0.11 0.02 -0.29 0.85]
[-0.51 -0.19 0.61 -0.41 -0.4 ]
[-0.59 -0.5 -0.31 0.55 -0.04]]
5.5 Eckart-Young Theorem
Suppose that we want to construct the best rank 𝑟 approximation of an 𝑚 × 𝑛 matrix 𝑋.

By best, we mean a matrix 𝑋𝑟 of rank 𝑟 < 𝑝 that, among all rank 𝑟 matrices, minimizes
||𝑋 − 𝑋𝑟 ||
where || ⋅ || denotes a norm of a matrix 𝑋 and where 𝑋𝑟 belongs to the space of all rank 𝑟 matrices of dimension 𝑚 × 𝑛.
Three popular matrix norms of an 𝑚 × 𝑛 matrix 𝑋 can be expressed in terms of the singular values of 𝑋
||𝑋𝑦||
• the spectral or 𝑙2 norm ||𝑋||2 = max||𝑦||≠0 ||𝑦|| = 𝜎1
• the Frobenius norm ||𝑋||𝐹 = √𝜎12 + ⋯ + 𝜎𝑝2
• the nuclear norm ||𝑋||𝑁 = 𝜎1 + ⋯ + 𝜎𝑝

The Eckart-Young theorem states that for each of these three norms, same rank 𝑟 matrix is best and that it equals
𝑋̂ 𝑟 = 𝜎1 𝑈1 𝑉1⊤ + 𝜎2 𝑈2 𝑉2⊤ + ⋯ + 𝜎𝑟 𝑈𝑟 𝑉𝑟⊤ (5.8)
This is a very powerful theorem that says that we can take our 𝑚 × 𝑛 matrix 𝑋 that in not full rank, and we can best
approximate it by a full rank 𝑝 × 𝑝 matrix through the SVD.
Moreover, if some of these 𝑝 singular values carry more information than others, and if we want to have the most amount
of information with the least amount of data, we can take 𝑟 leading singular values ordered by magnitude.
We’ll say more about this later when we present Principal Component Analysis.
You can read about the Eckart-Young theorem and some of its uses here.
We’ll make use of this theorem when we discuss principal components analysis (PCA) and also dynamic mode decom-
position (DMD).

5.6 Full and Reduced SVD’s
Up to now we have described properties of a full SVD in which shapes of 𝑈 , Σ, and 𝑉 are (𝑚, 𝑚), (𝑚, 𝑛), (𝑛, 𝑛),
respectively.
There is an alternative bookkeeping convention called an economy or reduced SVD in which the shapes of 𝑈 , Σ and 𝑉
are different from what they are in a full SVD.
Thus, note that because we assume that 𝑋 has rank 𝑝, there are only 𝑝 nonzero singular values, where 𝑝 = rank(𝑋) ≤
min (𝑚, 𝑛).
A reduced SVD uses this fact to express 𝑈 , Σ, and 𝑉 as matrices with shapes (𝑚, 𝑝), (𝑝, 𝑝), (𝑛, 𝑝).
You can read about reduced and full SVD here https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html
For a full SVD,
𝑈𝑈⊤ = 𝐼 𝑈 ⊤𝑈 = 𝐼
𝑉𝑉⊤ = 𝐼 𝑉 ⊤𝑉 = 𝐼
But not all these properties hold for a reduced SVD.

Which properties hold depend on whether we are in a tall-skinny case or a short-fat case.
• In a tall-skinny case in which 𝑚 >> 𝑛, for a reduced SVD
𝑈𝑈⊤ ≠ 𝐼 𝑈 ⊤𝑈 = 𝐼
𝑉𝑉⊤ = 𝐼 𝑉 ⊤𝑉 = 𝐼
• In a short-fat case in which 𝑚 << 𝑛, for a reduced SVD
𝑈𝑈⊤ = 𝐼 𝑈 ⊤𝑈 = 𝐼
𝑉𝑉⊤ = 𝐼 𝑉 ⊤𝑉 ≠ 𝐼
When we study Dynamic Mode Decomposition below, we shall want to remember these properties when we use a reduced
SVD to compute some DMD representations.
Let’s do an exercise to compare full and reduced SVD’s.
To review,
• in a full SVD
– 𝑈 is 𝑚 × 𝑚
– Σ is 𝑚 × 𝑛
– 𝑉 is 𝑛 × 𝑛
• in a reduced SVD
– 𝑈 is 𝑚 × 𝑝
– Σ is 𝑝 × 𝑝
– 𝑉 is 𝑛 × 𝑝
First, let’s study a case in which 𝑚 = 5 > 𝑛 = 2.
(This is a small example of the tall-skinny case that will concern us when we study Dynamic Mode Decompositions
below.)
5.6. Full and Reduced SVD’s 71

import numpy as np
X = np.random.rand(5,2)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V =')
U, S, V
U, S, V =
(array([[-0.48, 0.29, -0.29, -0.59, -0.51],

[-0.3 , -0.1 , -0.79, 0.53, 0.05],
[-0.52, -0.76, 0.33, 0.07, -0.21],
[-0.42, 0.57, 0.44, 0.54, -0.15],
[-0.49, 0.09, 0.04, -0.29, 0.82]]),
array([1.93, 0.69]),
array([[-0.52, -0.85],
[-0.85, 0.52]]))
print('Uhat, Shat, Vhat = ')

Uhat, Shat, Vhat
Uhat, Shat, Vhat =
(array([[-0.48, 0.29],
[-0.3 , -0.1 ],
[-0.52, -0.76],
[-0.42, 0.57],
[-0.49, 0.09]]),
array([1.93, 0.69]),
array([[-0.52, -0.85],
[-0.85, 0.52]]))
rr = np.linalg.matrix_rank(X)
print(f'rank of X = {rr}')
rank of X = 2
Properties:
• Where 𝑈 is constructed via a full SVD, 𝑈 ⊤ 𝑈 = 𝐼𝑚×𝑚 and 𝑈 𝑈 ⊤ = 𝐼𝑚×𝑚
• Where 𝑈̂ is constructed via a reduced SVD, although 𝑈̂ ⊤ 𝑈̂ = 𝐼𝑝×𝑝 , it happens that 𝑈̂ 𝑈̂ ⊤ ≠ 𝐼𝑚×𝑚
We illustrate these properties for our example with the following code cells.
UTU = U.T@U
UUT = U@U.T
print('UUT, UTU = ')
UUT, UTU
UUT, UTU =

(array([[ 1.00e+00, -2.73e-16, -4.78e-17, 9.58e-17, -5.26e-17],

[-2.73e-16, 1.00e+00, -3.84e-17, -1.13e-16, -2.19e-16],
[-4.78e-17, -3.84e-17, 1.00e+00, 1.15e-16, -1.66e-16],
[ 9.58e-17, -1.13e-16, 1.15e-16, 1.00e+00, -8.04e-18],
[-5.26e-17, -2.19e-16, -1.66e-16, -8.04e-18, 1.00e+00]]),
array([[ 1.00e+00, -1.96e-16, -1.25e-16, 7.70e-17, 8.84e-17],
[-1.96e-16, 1.00e+00, 1.29e-16, -2.40e-16, -3.68e-17],
[-1.25e-16, 1.29e-16, 1.00e+00, -2.67e-16, -3.88e-17],
[ 7.70e-17, -2.40e-16, -2.67e-16, 1.00e+00, 7.92e-17],
[ 8.84e-17, -3.68e-17, -3.88e-17, 7.92e-17, 1.00e+00]]))
UhatUhatT = Uhat@Uhat.T
UhatTUhat = Uhat.T@Uhat
print('UhatUhatT, UhatTUhat= ')
UhatUhatT, UhatTUhat
UhatUhatT, UhatTUhat=
(array([[ 0.31, 0.11, 0.03, 0.37, 0.26],

[ 0.11, 0.1 , 0.23, 0.07, 0.14],
[ 0.03, 0.23, 0.84, -0.21, 0.18],
[ 0.37, 0.07, -0.21, 0.5 , 0.26],
[ 0.26, 0.14, 0.18, 0.26, 0.25]]),
array([[ 1.00e+00, -1.96e-16],
[-1.96e-16, 1.00e+00]]))
Remarks:
The cells above illustrate the application of the full_matrices=True and full_matrices=False options.
Using full_matrices=False returns a reduced singular value decomposition.
The full and reduced SVD’s both accurately decompose an 𝑚 × 𝑛 matrix 𝑋
When we study Dynamic Mode Decompositions below, it will be important for us to remember the preceding properties
of full and reduced SVD’s in such tall-skinny cases.
Now let’s turn to a short-fat case.
To illustrate this case, we’ll set 𝑚 = 2 < 5 = 𝑛 and compute both full and reduced SVD’s.
import numpy as np
X = np.random.rand(2,5)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V = ')
U, S, V
U, S, V =
(array([[ 0.92, -0.38],

[ 0.38, 0.92]]),
array([1.38, 0.31]),
array([[ 0.45, 0.46, 0.31, 0.55, 0.43],
[-0.34, -0.19, -0.53, 0.16, 0.74],
5.6. Full and Reduced SVD’s 73


[-0.28, -0.52, 0.76, 0.02, 0.27],
[-0.58, 0.12, -0.02, 0.7 , -0.39],
[-0.51, 0.68, 0.22, -0.43, 0.19]]))
print('Uhat, Shat, Vhat = ')

Uhat, Shat, Vhat
Uhat, Shat, Vhat =
(array([[ 0.92, -0.38],

[ 0.38, 0.92]]),
array([1.38, 0.31]),
array([[ 0.45, 0.46, 0.31, 0.55, 0.43],
[-0.34, -0.19, -0.53, 0.16, 0.74]]))
Let’s verify that our reduced SVD accurately represents 𝑋
SShat=np.diag(Shat)
np.allclose(X, Uhat@SShat@Vhat)
True
5.7 Polar Decomposition
A reduced singular value decomposition (SVD) of 𝑋 is related to a polar decomposition of 𝑋
𝑋 = 𝑆𝑄
where
𝑆 = 𝑈 Σ𝑈 ⊤
𝑄 = 𝑈𝑉 ⊤
Here
• 𝑆 is an 𝑚 × 𝑚 symmetric matrix
• 𝑄 is an 𝑚 × 𝑛 orthogonal matrix
and in our reduced SVD
• 𝑈 is an 𝑚 × 𝑝 orthonormal matrix
• Σ is a 𝑝 × 𝑝 diagonal matrix
• 𝑉 is an 𝑛 × 𝑝 orthonormal

5.8 Application: Principal Components Analysis (PCA)
Let’s begin with a case in which 𝑛 >> 𝑚, so that we have many more individuals 𝑛 than attributes 𝑚.
The matrix 𝑋 is short and fat in an 𝑛 >> 𝑚 case as opposed to a tall and skinny case with 𝑚 >> 𝑛 to be discussed
later.
We regard 𝑋 as an 𝑚 × 𝑛 matrix of data:
𝑋 = [𝑋1 ∣ 𝑋2 ∣ ⋯ ∣ 𝑋𝑛 ]
𝑋1𝑗 𝑥1
⎡𝑋 ⎤ ⎡𝑥 ⎤
2𝑗 ⎥ is a vector of observations on variables ⎢ 2 ⎥.
where for 𝑗 = 1, … , 𝑛 the column vector 𝑋𝑗 = ⎢
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣𝑋𝑚𝑗 ⎦ ⎣𝑥𝑚 ⎦
In a time series setting, we would think of columns 𝑗 as indexing different times at which random variables are observed,
while rows index different random variables.
In a cross-section setting, we would think of columns 𝑗 as indexing different individuals for which random variables are
observed, while rows index different attributes.
As we have seen before, the SVD is a way to decompose a matrix into useful components, just like polar decomposition,
eigendecomposition, and many others.
PCA, on the other hand, is a method that builds on the SVD to analyze data. The goal is to apply certain steps, to help
better visualize patterns in data, using statistical tools to capture the most important patterns in data.
Step 1: Standardize the data:
Because our data matrix may hold variables of different units and scales, we first need to standardize the data.
First by computing the average of each row of 𝑋.
1 𝑚
𝑋𝑗̄ = ∑𝑥
𝑚 𝑖=1 𝑖,𝑗
We then create an average matrix out of these means:
1
⎡1⎤
𝑋̄ = ⎢ ⎥ [𝑋1̄ ∣ 𝑋2̄ ∣ ⋯ ∣ 𝑋𝑛̄ ]
⎢…⎥
⎣1⎦
And subtract out of the original matrix to create a mean centered matrix:
𝐵 = 𝑋 − 𝑋̄
Step 2: Compute the covariance matrix:

Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we
want to know how they can explain each other, we compute the covariance matrix of 𝐵.
1 ⊤
𝐶= 𝐵 𝐵
𝑛
Step 3: Decompose the covariance matrix and arrange the singular values:
If the matrix 𝐶 is diagonalizable, we can eigendecompose it, find its eigenvalues and rearrange the eigenvalue and eigen-
vector matrices in a decreasing other.
5.8. Application: Principal Components Analysis (PCA) 75

If 𝐶 is not diagonalizable, we can perform an SVD of 𝐶:
𝐵𝑇 𝐵 = 𝑉 Σ⊤ 𝑈 ⊤ 𝑈 Σ𝑉 ⊤
= 𝑉 Σ⊤ Σ𝑉 ⊤
1
𝐶= 𝑉 Σ⊤ Σ𝑉 ⊤
𝑛
We can then rearrange the columns in the matrices 𝑉 and Σ so that the singular values are in decreasing order.
Step 4: Select singular values, (optional) truncate the rest:
We can now decide how many singular values to pick, based on how much variance you want to retain. (e.g., retaining
95% of the total variance).
We can obtain the percentage by calculating the variance contained in the leading 𝑟 factors divided by the variance in
total:
𝑟
∑𝑖=1 𝜎𝑖2
𝑝
∑𝑖=1 𝜎𝑖2
Step 5: Create the Score Matrix:
𝑇 = 𝐵𝑉
= 𝑈 Σ𝑉 ⊤
= 𝑈Σ
5.9 Relationship of PCA to SVD
To relate an SVD to a PCA of data set 𝑋, first construct the SVD of the data matrix 𝑋:
Let’s assume that sample means of all variables are zero, so we don’t need to standardize our matrix.
𝑋 = 𝑈 Σ𝑉 ⊤ = 𝜎1 𝑈1 𝑉1⊤ + 𝜎2 𝑈2 𝑉2⊤ + ⋯ + 𝜎𝑝 𝑈𝑝 𝑉𝑝⊤ (5.9)
where
𝑈 = [𝑈1 |𝑈2 | … |𝑈𝑚 ]
𝑉1⊤
⎡𝑉 ⊤ ⎤
𝑉⊤ =⎢ 2 ⎥
⎢…⎥
⎣𝑉𝑛⊤ ⎦
In equation (5.9), each of the 𝑚 × 𝑛 matrices 𝑈𝑗 𝑉𝑗⊤ is evidently of rank 1.
Thus, we have
𝑈11 𝑉1⊤ 𝑈12 𝑉2⊤ 𝑈1𝑝 𝑉𝑝⊤

⎛ ⊤⎞
𝑈21 𝑉1 ⎟ ⎛ ⊤⎞
𝑈22 𝑉2 ⎟ ⎛
⎜ 𝑈2𝑝 𝑉𝑝⊤ ⎞
⎟
𝑋 = 𝜎1 ⎜
⎜
⎜ ⋯ ⎟ ⎟ + 𝜎 2
⎜
⎜
⎜ ⋯ ⎟ ⎟ + … + 𝜎 𝑝 ⎜
⎜ ⋯ ⎟ ⎟ (5.10)
⎝𝑈𝑚1 𝑉1⊤ ⎠ ⎝𝑈𝑚2 𝑉2⊤ ⎠ ⊤
⎝𝑈𝑚𝑝 𝑉𝑝 ⎠
Here is how we would interpret the objects in the matrix equation (5.10) in a time series context:
𝑛
• for each 𝑘 = 1, … , 𝑛, the object {𝑉𝑘𝑗 }𝑗=1 is a time series for the 𝑘th principal component

𝑈1𝑘
⎡𝑈 ⎤
• 𝑈𝑗 = ⎢ 2𝑘 ⎥ 𝑘 = 1, … , 𝑚 is a vector of loadings of variables 𝑋𝑖 on the 𝑘th principal component, 𝑖 = 1, … , 𝑚
⎢ … ⎥
⎣𝑈𝑚𝑘 ⎦
• 𝜎𝑘 for each 𝑘 = 1, … , 𝑝 is the strength of 𝑘th principal component, where strength means contribution to the
overall covariance of 𝑋.
5.10 PCA with Eigenvalues and Eigenvectors
We now use an eigen decomposition of a sample covariance matrix to do PCA.

Let 𝑋𝑚×𝑛 be our 𝑚 × 𝑛 data matrix.
Let’s assume that sample means of all variables are zero.
We can assure this by pre-processing the data by subtracting sample means.
Define a sample covariance matrix Ω as
Ω = 𝑋𝑋 ⊤
Then use an eigen decomposition to represent Ω as follows:
Ω = 𝑃 Λ𝑃 ⊤
Here
• 𝑃 is 𝑚 × 𝑚 matrix of eigenvectors of Ω
• Λ is a diagonal matrix of eigenvalues of Ω
We can then represent 𝑋 as
𝑋 = 𝑃𝜖
where
𝜖 = 𝑃 −1 𝑋
and
𝜖𝜖⊤ = Λ.
We can verify that
𝑋𝑋 ⊤ = 𝑃 Λ𝑃 ⊤ . (5.11)
It follows that we can represent the data matrix 𝑋 as
𝜖1
⎡𝜖 ⎤
𝑋 = [𝑋1 |𝑋2 | … |𝑋𝑚 ] = [𝑃1 |𝑃2 | … |𝑃𝑚 ] ⎢ 2 ⎥ = 𝑃1 𝜖1 + 𝑃2 𝜖2 + … + 𝑃𝑚 𝜖𝑚
⎢…⎥
⎣𝜖𝑚 ⎦
To reconcile the preceding representation with the PCA that we had obtained earlier through the SVD, we first note that
𝜖2𝑗 = 𝜆𝑗 ≡ 𝜎𝑗2 .
5.10. PCA with Eigenvalues and Eigenvectors 77

𝜖𝑗
Now define 𝜖𝑗̃ = √𝜆𝑗
, which implies that 𝜖𝑗̃ 𝜖⊤
𝑗̃ = 1.
Therefore
𝑋 = √𝜆1 𝑃1 𝜖1̃ + √𝜆2 𝑃2 𝜖2̃ + … + √𝜆𝑚 𝑃𝑚 𝜖𝑚̃
= 𝜎1 𝑃1 𝜖2̃ + 𝜎2 𝑃2 𝜖2̃ + … + 𝜎𝑚 𝑃𝑚 𝜖𝑚̃ ,
which agrees with

𝑇 𝑇 𝑇
𝑋 = 𝜎1 𝑈1 𝑉1 + 𝜎2 𝑈2 𝑉2 + … + 𝜎𝑟 𝑈𝑟 𝑉𝑟
provided that we set

• 𝑈𝑗 = 𝑃𝑗 (a vector of loadings of variables on principal component 𝑗)
𝑇
• 𝑉𝑘 = 𝜖𝑘̃ (the 𝑘th principal component)
Because there are alternative algorithms for computing 𝑃 and 𝑈 for given a data matrix 𝑋, depending on algorithms used,
we might have sign differences or different orders of eigenvectors.
We can resolve such ambiguities about 𝑈 and 𝑃 by
1. sorting eigenvalues and singular values in descending order
2. imposing positive diagonals on 𝑃 and 𝑈 and adjusting signs in 𝑉 ⊤ accordingly
5.11 Connections
To pull things together, it is useful to assemble and compare some formulas presented above.
First, consider an SVD of an 𝑚 × 𝑛 matrix:
𝑋 = 𝑈 Σ𝑉 ⊤
Compute:
𝑋𝑋 ⊤ = 𝑈 Σ𝑉 ⊤ 𝑉 Σ⊤ 𝑈 ⊤
≡ 𝑈 ΣΣ⊤ 𝑈 ⊤ (5.12)
≡ 𝑈 Λ𝑈 ⊤
Compare representation (5.12) with equation (5.11) above.

Evidently, 𝑈 in the SVD is the matrix 𝑃 of eigenvectors of 𝑋𝑋 ⊤ and ΣΣ⊤ is the matrix Λ of eigenvalues.
Second, let’s compute
𝑋 ⊤ 𝑋 = 𝑉 Σ⊤ 𝑈 ⊤ 𝑈 Σ𝑉 ⊤
= 𝑉 Σ⊤ Σ𝑉 ⊤
Thus, the matrix 𝑉 in the SVD is the matrix of eigenvectors of 𝑋 ⊤ 𝑋

Summarizing and fitting things together, we have the eigen decomposition of the sample covariance matrix
𝑋𝑋 ⊤ = 𝑃 Λ𝑃 ⊤
where 𝑃 is an orthogonal matrix.

Further, from the SVD of 𝑋, we know that
𝑋𝑋 ⊤ = 𝑈 ΣΣ⊤ 𝑈 ⊤

where 𝑈 is an orthogonal matrix.

Thus, 𝑃 = 𝑈 and we have the representation of 𝑋
𝑋 = 𝑃 𝜖 = 𝑈 Σ𝑉 ⊤
It follows that
𝑈 ⊤ 𝑋 = Σ𝑉 ⊤ = 𝜖
Note that the preceding implies that
𝜖𝜖⊤ = Σ𝑉 ⊤ 𝑉 Σ⊤ = ΣΣ⊤ = Λ,
so that everything fits together.

Below we define a class DecomAnalysis that wraps PCA and SVD for a given a data matrix X.
class DecomAnalysis:
"""
A class for conducting PCA and SVD.
X: data matrix
r_component: chosen rank for best approximation
"""
def __init__(self, X, r_component=None):
self.X = X
self.Ω = (X @ X.T)
self.m, self.n = X.shape

self.r = LA.matrix_rank(X)
if r_component:
self.r_component = r_component
else:
self.r_component = self.m
def pca(self):
, P = LA.eigh(self.Ω) # columns of P are eigenvectors
ind = sorted(range( .size), key=lambda x: [x], reverse=True)
# sort by eigenvalues
self. = [ind]
P = P[:, ind]
self.P = P @ diag_sign(P)
self.Λ = np.diag(self. )
self.explained_ratio_pca = np.cumsum(self. ) / self. .sum()
# compute the N by T matrix of principal components

self. = self.P.T @ self.X
P = self.P[:, :self.r_component]
5.11. Connections 79

= self. [:self.r_component, :]
# transform data
self.X_pca = P @
def svd(self):
U, , VT = LA.svd(self.X)
ind = sorted(range( .size), key=lambda x: [x], reverse=True)
# sort by eigenvalues
d = min(self.m, self.n)
self. = [ind]
U = U[:, ind]
D = diag_sign(U)
self.U = U @ D
VT[:d, :] = D @ VT[ind, :]
self.VT = VT
self.Σ = np.zeros((self.m, self.n))

self.Σ[:d, :d] = np.diag(self. )
_sq = self. ** 2
self.explained_ratio_svd = np.cumsum( _sq) / _sq.sum()
# slicing matrices by the number of components to use

U = self.U[:, :self.r_component]
Σ = self.Σ[:self.r_component, :self.r_component]
VT = self.VT[:self.r_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def fit(self, r_component):
# pca
P = self.P[:, :r_component]
= self. [:r_component, :]
# transform data
self.X_pca = P @
# svd
U = self.U[:, :r_component]
Σ = self.Σ[:r_component, :r_component]
VT = self.VT[:r_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def diag_sign(A):
"Compute the signs of the diagonal of matrix A"
D = np.diag(np.sign(np.diag(A)))

return D
We also define a function that prints out information so that we can compare decompositions obtained by different algo-
rithms.
def compare_pca_svd(da):
"""
Compare the outcomes of PCA and SVD.
"""
da.pca()
da.svd()
print('Eigenvalues and Singular values\n')

print(f'λ = {da.λ}\n')
print(f'σ^2 = {da.σ**2}\n')
print('\n')
# loading matrices
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('loadings')
axs[0].plot(da.P.T)
axs[0].set_title('P')
axs[0].set_xlabel('m')
axs[1].plot(da.U.T)
axs[1].set_title('U')
axs[1].set_xlabel('m')
plt.show()
# principal components
plt.suptitle('principal components')
axs[0].plot(da.ε.T)
axs[0].set_title('ε')
axs[0].set_xlabel('n')
axs[1].plot(da.VT[:da.r, :].T * np.sqrt(da.λ))
axs[1].set_title('$V^\top *\sqrt{\lambda}$')
axs[1].set_xlabel('n')
plt.show()
5.12 Exercises
Exercise 5.12.1
In Ordinary Least Squares (OLS), we learn to compute 𝛽 ̂ = (𝑋 ⊤ 𝑋)−1 𝑋 ⊤ 𝑦, but there are cases such as when we have
colinearity or an underdetermined system: short fat matrix.
In these cases, the (𝑋 ⊤ 𝑋) matrix is not not invertible (its determinant is zero) or ill-conditioned (its determinant is very
close to zero).
What we can do instead is to create what is called a pseudoinverse, a full rank approximation of the inverted matrix so
we can compute 𝛽 ̂ with it.
5.12. Exercises 81
Thinking in terms of the Eckart-Young theorem, build the pseudoinverse matrix 𝑋 + and use it to compute 𝛽.̂

We can use SVD to compute the pseudoinverse:
𝑋 = 𝑈 Σ𝑉 ⊤
inverting 𝑋, we have:
𝑋 + = 𝑉 Σ+ 𝑈 ⊤
where:
1
𝜎 0 ⋯ 0 0
⎡ 01 1
⋯ 0 0⎤
⎢ 𝜎2 ⎥
Σ+ = ⎢ ⋮ ⋮ ⋱ ⋮ ⋮⎥
⎢ 1 ⎥
⎢0 0 ⋯ 𝜎𝑝 0⎥
⎣0 0 ⋯ 0 0⎦
and finally:
𝛽 ̂ = 𝑋 + 𝑦 = 𝑉 Σ+ 𝑈 ⊤ 𝑦
For an example PCA applied to analyzing the structure of intelligence tests see this lecture Multivariable Normal Distri-
bution.
Look at parts of that lecture that describe and illustrate the classic factor analysis model.
As mentioned earlier, in a sequel to this lecture about Dynamic Mode Decompositions, we’ll describe how SVD’s provide
ways rapidly to compute reduced-order approximations to first-order Vector Autoregressions (VARs).

CHAPTER
SIX
VARS AND DMDS
This lecture applies computational methods that we learned about in this lecture Singular Value Decomposition to
• first-order vector autoregressions (VARs)
• dynamic mode decompositions (DMDs)
• connections between DMDs and first-order VARs
6.1 First-Order Vector Autoregressions
We want to fit a first-order vector autoregression
𝑋𝑡+1 = 𝐴𝑋𝑡 + 𝐶𝜖𝑡+1 , 𝜖𝑡+1 ⟂ 𝑋𝑡 (6.1)
where 𝜖𝑡+1 is the time 𝑡 + 1 component of a sequence of i.i.d. 𝑚 × 1 random vectors with mean vector zero and identity
covariance matrix and where the 𝑚 × 1 vector 𝑋𝑡 is
⊤
𝑋𝑡 = [𝑋1,𝑡 𝑋2,𝑡 ⋯ 𝑋𝑚,𝑡 ] (6.2)
and where ⋅⊤ again denotes complex transposition and 𝑋𝑖,𝑡 is variable 𝑖 at time 𝑡.
We want to fit equation (6.1).
Our data are organized in an 𝑚 × (𝑛 + 1) matrix 𝑋̃
𝑋̃ = [𝑋1 ∣ 𝑋2 ∣ ⋯ ∣ 𝑋𝑛 ∣ 𝑋𝑛+1 ]
where for 𝑡 = 1, … , 𝑛 + 1, the 𝑚 × 1 vector 𝑋𝑡 is given by (6.2).

Thus, we want to estimate a system (6.1) that consists of 𝑚 least squares regressions of everything on one lagged value
of everything.
The 𝑖’th equation of (6.1) is a regression of 𝑋𝑖,𝑡+1 on the vector 𝑋𝑡 .
We proceed as follows.
From 𝑋,̃ we form two 𝑚 × 𝑛 matrices
𝑋 = [𝑋1 ∣ 𝑋2 ∣ ⋯ ∣ 𝑋𝑛 ]
and
𝑋 ′ = [𝑋2 ∣ 𝑋3 ∣ ⋯ ∣ 𝑋𝑛+1 ]
Here ′ is part of the name of the matrix 𝑋 ′ and does not indicate matrix transposition.
83
We use ⋅⊤ to denote matrix transposition or its extension to complex matrices.

In forming 𝑋 and 𝑋 ′ , we have in each case dropped a column from 𝑋,̃ the last column in the case of 𝑋, and the first
column in the case of 𝑋 ′ .
Evidently, 𝑋 and 𝑋 ′ are both 𝑚 × 𝑛 matrices.
We denote the rank of 𝑋 as 𝑝 ≤ min(𝑚, 𝑛).
Two cases that interest us are
• 𝑛 >> 𝑚, so that we have many more time series observations 𝑛 than variables 𝑚
• 𝑚 >> 𝑛, so that we have many more variables 𝑚 than time series observations 𝑛
At a general level that includes both of these special cases, a common formula describes the least squares estimator 𝐴 ̂ of
𝐴.
But important details differ.
The common formula is
𝐴̂ = 𝑋′𝑋+ (6.3)
where 𝑋 + is the pseudo-inverse of 𝑋.

To read about the Moore-Penrose pseudo-inverse please see Moore-Penrose pseudo-inverse
Applicable formulas for the pseudo-inverse differ for our two cases.
Short-Fat Case:
When 𝑛 >> 𝑚, so that we have many more time series observations 𝑛 than variables 𝑚 and when 𝑋 has linearly
independent rows, 𝑋𝑋 ⊤ has an inverse and the pseudo-inverse 𝑋 + is
𝑋 + = 𝑋 ⊤ (𝑋𝑋 ⊤ )−1
Here 𝑋 + is a right-inverse that verifies 𝑋𝑋 + = 𝐼𝑚×𝑚 .

In this case, our formula (6.3) for the least-squares estimator of the population matrix of regression coefficients 𝐴 becomes
𝐴 ̂ = 𝑋 ′ 𝑋 ⊤ (𝑋𝑋 ⊤ )−1 (6.4)
This formula for least-squares regression coefficients is widely used in econometrics.

It is used to estimate vector autorgressions.
The right side of formula (6.4) is proportional to the empirical cross second moment matrix of 𝑋𝑡+1 and 𝑋𝑡 times the
inverse of the second moment matrix of 𝑋𝑡 .
Tall-Skinny Case:
When 𝑚 >> 𝑛, so that we have many more attributes 𝑚 than time series observations 𝑛 and when 𝑋 has linearly
independent columns, 𝑋 ⊤ 𝑋 has an inverse and the pseudo-inverse 𝑋 + is
𝑋 + = (𝑋 ⊤ 𝑋)−1 𝑋 ⊤
Here 𝑋 + is a left-inverse that verifies 𝑋 + 𝑋 = 𝐼𝑛×𝑛 .

In this case, our formula (6.3) for a least-squares estimator of 𝐴 becomes
𝐴 ̂ = 𝑋 ′ (𝑋 ⊤ 𝑋)−1 𝑋 ⊤ (6.5)
Please compare formulas (6.4) and (6.5) for 𝐴.̂
84 Chapter 6. VARs and DMDs

Here we are especially interested in formula (6.5).

The 𝑖th row of 𝐴 ̂ is an 𝑚 × 1 vector of regression coefficients of 𝑋𝑖,𝑡+1 on 𝑋𝑗,𝑡 , 𝑗 = 1, … , 𝑚.
̂ we find that
If we use formula (6.5) to calculate 𝐴𝑋
̂ = 𝑋′
𝐴𝑋
so that the regression equation fits perfectly.

This is a typical outcome in an underdetermined least-squares model.
To reiterate, in the tall-skinny case (described in Singular Value Decomposition) in which we have a number 𝑛 of obser-
vations that is small relative to the number 𝑚 of attributes that appear in the vector 𝑋𝑡 , we want to fit equation (6.1).
We confront the facts that the least squares estimator is underdetermined and that the regression equation fits perfectly.
To proceed, we’ll want efficiently to calculate the pseudo-inverse 𝑋 + .
The pseudo-inverse 𝑋 + will be a component of our estimator of 𝐴.
As our estimator 𝐴 ̂ of 𝐴 we want to form an 𝑚 × 𝑚 matrix that solves the least-squares best-fit problem
𝐴 ̂ = argmin𝐴̌ ||𝑋 ′ − 𝐴𝑋||

̌
𝐹 (6.6)
where || ⋅ ||𝐹 denotes the Frobenius (or Euclidean) norm of a matrix.

The Frobenius norm is defined as
√
√𝑚 𝑚
||𝐴||𝐹 = √∑ ∑ |𝐴𝑖𝑗 |2
⎷ 𝑖=1 𝑗=1
The minimizer of the right side of equation (6.6) is
𝐴̂ = 𝑋′𝑋+ (6.7)
where the (possibly huge) 𝑛 × 𝑚 matrix 𝑋 + = (𝑋 ⊤ 𝑋)−1 𝑋 ⊤ is again a pseudo-inverse of 𝑋.

For some situations that we are interested in, 𝑋 ⊤ 𝑋 can be close to singular, a situation that makes some numerical
algorithms be inaccurate.
To acknowledge that possibility, we’ll use efficient algorithms to constructing a reduced-rank approximation of 𝐴 ̂ in
formula (6.5).
Such an approximation to our vector autoregression will no longer fit perfectly.
The 𝑖th row of 𝐴 ̂ is an 𝑚 × 1 vector of regression coefficients of 𝑋𝑖,𝑡+1 on 𝑋𝑗,𝑡 , 𝑗 = 1, … , 𝑚.
An efficient way to compute the pseudo-inverse 𝑋 + is to start with a singular value decomposition
𝑋 = 𝑈 Σ𝑉 ⊤ (6.8)
where we remind ourselves that for a reduced SVD, 𝑋 is an 𝑚 × 𝑛 matrix of data, 𝑈 is an 𝑚 × 𝑝 matrix, Σ is a 𝑝 × 𝑝
matrix, and 𝑉 is an 𝑛 × 𝑝 matrix.
We can efficiently construct the pertinent pseudo-inverse 𝑋 + by recognizing the following string of equalities.
𝑋 + = (𝑋 ⊤ 𝑋)−1 𝑋 ⊤
= (𝑉 Σ𝑈 ⊤ 𝑈 Σ𝑉 ⊤ )−1 𝑉 Σ𝑈 ⊤
= (𝑉 ΣΣ𝑉 ⊤ )−1 𝑉 Σ𝑈 ⊤ (6.9)
−1 −1 ⊤ ⊤
= 𝑉 Σ Σ 𝑉 𝑉 Σ𝑈
= 𝑉 Σ−1 𝑈 ⊤
6.1. First-Order Vector Autoregressions 85

(Since we are in the 𝑚 >> 𝑛 case in which 𝑉 ⊤ 𝑉 = 𝐼𝑝×𝑝 in a reduced SVD, we can use the preceding string of equalities
for a reduced SVD as well as for a full SVD.)
Thus, we shall construct a pseudo-inverse 𝑋 + of 𝑋 by using a singular value decomposition of 𝑋 in equation (6.8) to
compute
𝑋 + = 𝑉 Σ−1 𝑈 ⊤ (6.10)
where the matrix Σ−1 is constructed by replacing each non-zero element of Σ with 𝜎𝑗−1 .
We can use formula (6.10) together with formula (6.7) to compute the matrix 𝐴 ̂ of regression coefficients.
Thus, our estimator 𝐴 ̂ = 𝑋 ′ 𝑋 + of the 𝑚 × 𝑚 matrix of coefficients 𝐴 is
𝐴 ̂ = 𝑋 ′ 𝑉 Σ−1 𝑈 ⊤ (6.11)
6.2 Dynamic Mode Decomposition (DMD)
We turn to the 𝑚 >> 𝑛 tall and skinny case associated with Dynamic Mode Decomposition.
Here an 𝑚 × 𝑛 + 1 data matrix 𝑋̃ contains many more attributes (or variables) 𝑚 than time periods 𝑛 + 1.
Dynamic mode decomposition was introduced by [Schmid, 2010],
You can read about Dynamic Mode Decomposition [Kutz et al., 2016] and [Brunton and Kutz, 2019] (section 7.2).
Dynamic Mode Decomposition (DMD) computes a rank 𝑟 < 𝑝 approximation to the least squares regression coefficients
𝐴 ̂ described by formula (6.11).
We’ll build up gradually to a formulation that is useful in applications.
We’ll do this by describing three alternative representations of our first-order linear dynamic system, i.e., our vector
autoregression.
Guide to three representations: In practice, we’ll mainly be interested in Representation 3.
We use the first two representations to present some useful intermediate steps that help us to appreciate what is under the
hood of Representation 3.
In applications, we’ll use only a small subset of DMD modes to approximate dynamics.
We use such a small subset of DMD modes to construct a reduced-rank approximation to 𝐴.
To do that, we’ll want to use the reduced SVD’s affiliated with representation 3, not the full SVD’s affiliated with repre-
sentations 1 and 2.
Guide to impatient reader: In our applications, we’ll be using Representation 3.
You might want to skip the stage-setting representations 1 and 2 on first reading.
6.3 Representation 1
In this representation, we shall use a full SVD of 𝑋.

We use the 𝑚 columns of 𝑈 , and thus the 𝑚 rows of 𝑈 ⊤ , to define a 𝑚 × 1 vector 𝑏̃𝑡 as
𝑏̃𝑡 = 𝑈 ⊤ 𝑋𝑡 . (6.12)

The original data 𝑋𝑡 can be represented as
𝑋𝑡 = 𝑈 𝑏̃𝑡 (6.13)
(Here we use 𝑏 to remind ourselves that we are creating a basis vector.)

Since we are now using a full SVD, 𝑈 𝑈 ⊤ = 𝐼𝑚×𝑚 .
So it follows from equation (6.12) that we can reconstruct 𝑋𝑡 from 𝑏̃𝑡 .
In particular,
• Equation (6.12) serves as an encoder that rotates the 𝑚 × 1 vector 𝑋𝑡 to become an 𝑚 × 1 vector 𝑏̃𝑡
• Equation (6.13) serves as a decoder that reconstructs the 𝑚 × 1 vector 𝑋𝑡 by rotating the 𝑚 × 1 vector 𝑏̃𝑡
Define a transition matrix for an 𝑚 × 1 basis vector 𝑏̃𝑡 by
𝐴 ̃ = 𝑈 ⊤ 𝐴𝑈
̂ (6.14)
We can recover 𝐴 ̂ from
𝐴 ̂ = 𝑈 𝐴𝑈
̃ ⊤
Dynamics of the 𝑚 × 1 basis vector 𝑏̃𝑡 are governed by
𝑏̃𝑡+1 = 𝐴𝑏̃ ̃𝑡
To construct forecasts 𝑋 𝑡 of future values of 𝑋𝑡 conditional on 𝑋1 , we can apply decoders (i.e., rotators) to both sides
of this equation and deduce
𝑋 𝑡+1 = 𝑈 𝐴𝑡̃ 𝑈 ⊤ 𝑋1
where we use 𝑋 𝑡+1 , 𝑡 ≥ 1 to denote a forecast.
This representation is related to one originally proposed by [Schmid, 2010].

It can be regarded as an intermediate step on the way to obtaining a related representation 3 to be presented later
As with Representation 1, we continue to
• use a full SVD and not a reduced SVD
As we observed and illustrated in a lecture about the Singular Value Decomposition
• (a) for a full SVD 𝑈 𝑈 ⊤ = 𝐼𝑚×𝑚 and 𝑈 ⊤ 𝑈 = 𝐼𝑝×𝑝 are both identity matrices
• (b) for a reduced SVD of 𝑋, 𝑈 ⊤ 𝑈 is not an identity matrix.
As we shall see later, a full SVD is too confining for what we ultimately want to do, namely, cope with situations in which
𝑈 ⊤ 𝑈 is not an identity matrix because we use a reduced SVD of 𝑋.
But for now, let’s proceed under the assumption that we are using a full SVD so that requirements (a) and (b) are both
satisfied.
Form an eigendecomposition of the 𝑚 × 𝑚 matrix 𝐴 ̃ = 𝑈 ⊤ 𝐴𝑈
̂ defined in equation (6.14):
𝐴 ̃ = 𝑊 Λ𝑊 −1 (6.15)
6.4. Representation 2 87
where Λ is a diagonal matrix of eigenvalues and 𝑊 is an 𝑚 × 𝑚 matrix whose columns are eigenvectors corresponding
to rows (eigenvalues) in Λ.
When 𝑈 𝑈 ⊤ = 𝐼𝑚×𝑚 , as is true with a full SVD of 𝑋, it follows that
𝐴 ̂ = 𝑈 𝐴𝑈
̃ ⊤ = 𝑈 𝑊 Λ𝑊 −1 𝑈 ⊤ (6.16)
According to equation (6.16), the diagonal matrix Λ contains eigenvalues of 𝐴 ̂ and corresponding eigenvectors of 𝐴 ̂ are
columns of the matrix 𝑈 𝑊 .
It follows that the systematic (i.e., not random) parts of the 𝑋𝑡 dynamics captured by our first-order vector autoregressions
are described by
𝑋𝑡+1 = 𝑈 𝑊 Λ𝑊 −1 𝑈 ⊤ 𝑋𝑡
Multiplying both sides of the above equation by 𝑊 −1 𝑈 ⊤ gives
𝑊 −1 𝑈 ⊤ 𝑋𝑡+1 = Λ𝑊 −1 𝑈 ⊤ 𝑋𝑡
or
𝑏̂𝑡+1 = Λ𝑏̂𝑡
where our encoder is
𝑏̂𝑡 = 𝑊 −1 𝑈 ⊤ 𝑋𝑡
and our decoder is
𝑋𝑡 = 𝑈 𝑊 𝑏̂𝑡
We can use this representation to construct a predictor 𝑋 𝑡+1 of 𝑋𝑡+1 conditional on 𝑋1 via:
𝑋 𝑡+1 = 𝑈 𝑊 Λ𝑡 𝑊 −1 𝑈 ⊤ 𝑋1 (6.17)
In effect, [Schmid, 2010] defined an 𝑚 × 𝑚 matrix Φ𝑠 as
Φ𝑠 = 𝑈 𝑊 (6.18)
and a generalized inverse
Φ+
𝑠 =𝑊
−1 ⊤
𝑈 (6.19)
[Schmid, 2010] then represented equation (6.17) as
𝑋 𝑡+1 = Φ𝑠 Λ𝑡 Φ+
𝑠 𝑋1 (6.20)
Components of the basis vector 𝑏̂𝑡 = 𝑊 −1 𝑈 ⊤ 𝑋𝑡 ≡ Φ+

𝑠 𝑋𝑡 are
DMD projected modes.
To understand why they are called projected modes, notice that
Φ+ ⊤ −1 ⊤
𝑠 = (Φ𝑠 Φ𝑠 ) Φ𝑠
so that the 𝑚 × 𝑝 matrix
𝑏̂ = Φ +
𝑠𝑋
is a matrix of regression coefficients of the 𝑚 × 𝑛 matrix 𝑋 on the 𝑚 × 𝑝 matrix Φ𝑠 .

We’ll say more about this interpretation in a related context when we discuss representation 3, which was suggested by
Tu et al. [Tu et al., 2014].
It is more appropriate to use representation 3 when, as is often the case in practice, we want to use a reduced SVD.

Departing from the procedures used to construct Representations 1 and 2, each of which deployed a full SVD, we now
use a reduced SVD.
Again, we let 𝑝 ≤ min(𝑚, 𝑛) be the rank of 𝑋.
Construct a reduced SVD
𝑋 = 𝑈̃ Σ̃ 𝑉 ̃ ⊤ ,
where now 𝑈̃ is 𝑚 × 𝑝, Σ̃ is 𝑝 × 𝑝, and 𝑉 ̃ ⊤ is 𝑝 × 𝑛.

Our minimum-norm least-squares approximator of 𝐴 now has representation
𝐴 ̂ = 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ ⊤ (6.21)
Computing Dominant Eigenvectors of 𝐴 ̂

We begin by paralleling a step used to construct Representation 1, define a transition matrix for a rotated 𝑝 × 1 state 𝑏̃𝑡 by
𝐴 ̃ = 𝑈 ̃ ⊤ 𝐴𝑈
̂ ̃ (6.22)
Interpretation as projection coefficients

[Brunton and Kutz, 2022] remark that 𝐴 ̃ can be interpreted in terms of a projection of 𝐴 ̂ onto the 𝑝 modes in 𝑈̃ .
To verify this, first note that, because 𝑈̃ ⊤ 𝑈̃ = 𝐼, it follows that
𝐴 ̃ = 𝑈 ̃ ⊤ 𝐴𝑈
̂ ̃ = 𝑈̃ ⊤ 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ ⊤ 𝑈̃ = 𝑈̃ ⊤ 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ ⊤ (6.23)
Next, we’ll just compute the regression coefficients in a projection of 𝐴 ̂ on 𝑈̃ using a standard least-squares formula
(𝑈̃ ⊤ 𝑈̃ )−1 𝑈̃ ⊤ 𝐴 ̂ = (𝑈̃ ⊤ 𝑈̃ )−1 𝑈̃ ⊤ 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ ⊤ = 𝑈̃ ⊤ 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ ⊤ = 𝐴.̃
Thus, we have verified that 𝐴 ̃ is a least-squares projection of 𝐴 ̂ onto 𝑈̃ .

An Inverse Challenge
Because we are using a reduced SVD, 𝑈̃ 𝑈̃ ⊤ ≠ 𝐼.
Consequently,
𝐴 ̂ ≠ 𝑈 ̃ 𝐴𝑈
̃ ̃ ⊤,
so we can’t simply recover 𝐴 ̂ from 𝐴 ̃ and 𝑈̃ .

A Blind Alley
We can start by hoping for the best and proceeding to construct an eigendecomposition of the 𝑝 × 𝑝 matrix 𝐴:̃
𝐴 ̃ = 𝑊̃ Λ𝑊̃ −1 (6.24)
where Λ is a diagonal matrix of 𝑝 eigenvalues and the columns of 𝑊̃ are corresponding eigenvectors.
Mimicking our procedure in Representation 2, we cross our fingers and compute an 𝑚 × 𝑝 matrix
Φ̃ 𝑠 = 𝑈̃ 𝑊̃ (6.25)
that corresponds to (6.18) for a full SVD.
At this point, where 𝐴 ̂ is given by formula (6.21) it is interesting to compute 𝐴Φ

̂ ̃ :
𝑠
̂ ̃ = (𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ ⊤ )(𝑈̃ 𝑊̃ )
𝐴Φ 𝑠
= 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑊̃
≠ (𝑈̃ 𝑊̃ )Λ
= Φ̃ 𝑠 Λ
That 𝐴Φ̂ ̃ ≠ Φ̃ Λ means that, unlike the corresponding situation in Representation 2, columns of Φ̃ = 𝑈̃ 𝑊̃ are not
𝑠 𝑠 𝑠
eigenvectors of 𝐴 ̂ corresponding to eigenvalues on the diagonal of matix Λ.
An Approach That Works
Continuing our quest for eigenvectors of 𝐴 ̂ that we can compute with a reduced SVD, let’s define an 𝑚 × 𝑝 matrix Φ as
̂ ̃ = 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑊̃
Φ ≡ 𝐴Φ (6.26)
𝑠
It turns out that columns of Φ are eigenvectors of 𝐴.̂

This is a consequence of a result established by Tu et al. [Tu et al., 2014] that we now present.
Proposition The 𝑝 columns of Φ are eigenvectors of 𝐴.̂
Proof: From formula (6.26) we have
̂ = (𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ ⊤ )(𝑋 ′ 𝑉 ̃ Σ−1 𝑊̃ )
𝐴Φ
̃ ̃
= 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝐴𝑊
= 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑊̃ Λ
= ΦΛ
so that
̂ = ΦΛ.
𝐴Φ (6.27)
Let 𝜙𝑖 be the 𝑖th column of Φ and 𝜆𝑖 be the corresponding 𝑖 eigenvalue of 𝐴 ̃ from decomposition (6.24).
Equating the 𝑚 × 1 vectors that appear on the two sides of equation (6.27) gives
̂ =𝜆𝜙.
𝐴𝜙 𝑖 𝑖 𝑖
This equation confirms that 𝜙𝑖 is an eigenvector of 𝐴 ̂ that corresponds to eigenvalue 𝜆𝑖 of both 𝐴 ̃ and 𝐴.̂
This concludes the proof.
Also see [Brunton and Kutz, 2022] (p. 238)
6.5.1 Decoder of 𝑏̌ as a linear projection
From eigendecomposition (6.27) we can represent 𝐴 ̂ as
𝐴 ̂ = ΦΛΦ+ . (6.28)
From formula (6.28) we can deduce dynamics of the 𝑝 × 1 vector 𝑏̌𝑡 :
𝑏̌𝑡+1 = Λ𝑏̌𝑡

where
𝑏̌𝑡 = Φ+ 𝑋𝑡 (6.29)
Since the 𝑚 × 𝑝 matrix Φ has 𝑝 linearly independent columns, the generalized inverse of Φ is
Φ+ = (Φ⊤ Φ)−1 Φ⊤
and so
𝑏̌ = (Φ⊤ Φ)−1 Φ⊤ 𝑋 (6.30)
The 𝑝 × 𝑛 matrix 𝑏̌ is recognizable as a matrix of least squares regression coefficients of the 𝑚 × 𝑛 matrix 𝑋 on the 𝑚 × 𝑝
matrix Φ and consequently
𝑋̌ = Φ𝑏̌ (6.31)
is an 𝑚 × 𝑛 matrix of least squares projections of 𝑋 on Φ.

Variance Decomposition of 𝑋
By virtue of the least-squares projection theory discussed in this quantecon lecture https://python-advanced.quantecon.
org/orth_proj.html, we can represent 𝑋 as the sum of the projection 𝑋̌ of 𝑋 on Φ plus a matrix of errors.
To verify this, note that the least squares projection 𝑋̌ is related to 𝑋 by
𝑋 = 𝑋̌ + 𝜖
or
𝑋 = Φ 𝑏̌ + 𝜖 (6.32)
where 𝜖 is an 𝑚 × 𝑛 matrix of least squares errors satisfying the least squares orthogonality conditions 𝜖⊤ Φ = 0 or
(𝑋 − Φ𝑏)̌ ⊤ Φ = 0𝑚×𝑝 (6.33)
̌ ⊤ Φ, which implies formula (6.30).

Rearranging the orthogonality conditions (6.33) gives 𝑋 ⊤ Φ = 𝑏Φ
6.5.2 An Approximation
We now describe a way to approximate the 𝑝 × 1 vector 𝑏̌𝑡 instead of using formula (6.29).
In particular, the following argument adapted from [Brunton and Kutz, 2022] (page 240) provides a computationally
efficient way to approximate 𝑏̌𝑡 .
For convenience, we’ll apply the method at time 𝑡 = 1.
For 𝑡 = 1, from equation (6.32) we have
𝑋̌ 1 = Φ𝑏̌1 (6.34)
where 𝑏̌1 is a 𝑝 × 1 vector.

Recall from representation 1 above that 𝑋1 = 𝑈 𝑏̃1 , where 𝑏̃1 is a time 1 basis vector for representation 1 and 𝑈 is from
the full SVD 𝑋 = 𝑈 Σ𝑉 ⊤ .
It then follows from equation (6.32) that
𝑈 𝑏̃1 = 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑊̃ 𝑏̌1 + 𝜖1
where 𝜖1 is a least-squares error vector from equation (6.32).

It follows that
𝑏̃1 = 𝑈 ⊤ 𝑋 ′ 𝑉 Σ̃ −1 𝑊̃ 𝑏̌1 + 𝑈 ⊤ 𝜖1
Replacing the error term 𝑈 ⊤ 𝜖1 by zero, and replacing 𝑈 from a full SVD of 𝑋 with 𝑈̃ from a reduced SVD, we obtain
an approximation 𝑏̂1 to 𝑏̃1 :
𝑏̂1 = 𝑈̃ ⊤ 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑊̃ 𝑏̌1
Recall that from equation (6.23), 𝐴 ̃ = 𝑈̃ ⊤ 𝑋 ′ 𝑉 ̃ Σ̃ −1 .

It then follows that
̃ ̃ 𝑏̌
𝑏̂1 = 𝐴𝑊 1
and therefore, by the eigendecomposition (6.24) of 𝐴,̃ we have
𝑏̂1 = 𝑊̃ Λ𝑏̌1
Consequently,
𝑏̂1 = (𝑊̃ Λ)−1 𝑏̃1
or
𝑏̂1 = (𝑊̃ Λ)−1 𝑈̃ ⊤ 𝑋1 , (6.35)
which is a computationally efficient approximation to the following instance of equation (6.29) for the initial vector 𝑏̌1 :
𝑏̌1 = Φ+ 𝑋1 (6.36)
(To highlight that (6.35) is an approximation, users of DMD sometimes call components of basis vector 𝑏̌𝑡 = Φ+ 𝑋𝑡 the
exact DMD modes and components of 𝑏̂𝑡 = (𝑊̃ Λ)−1 𝑈̃ ⊤ 𝑋𝑡 the approximate modes.)
Conditional on 𝑋𝑡 , we can compute a decoded 𝑋̌ 𝑡+𝑗 , 𝑗 = 1, 2, … from the exact modes via
𝑋̌ 𝑡+𝑗 = ΦΛ𝑗 Φ+ 𝑋𝑡 (6.37)
or use compute a decoded 𝑋̂ 𝑡+𝑗 from approximate modes via
𝑋̂ 𝑡+𝑗 = ΦΛ𝑗 (𝑊̃ Λ)−1 𝑈̃ ⊤ 𝑋𝑡 . (6.38)
We can then use a decoded 𝑋̌ 𝑡+𝑗 or 𝑋̂ 𝑡+𝑗 to forecast 𝑋𝑡+𝑗 .
6.5.3 Using Fewer Modes
In applications, we’ll actually use only a few modes, often three or less.
Some of the preceding formulas assume that we have retained all 𝑝 modes associated with singular values of 𝑋.
We can adjust our formulas to describe a situation in which we instead retain only the 𝑟 < 𝑝 largest singular values.
In that case, we simply replace Σ̃ with the appropriate 𝑟 × 𝑟 matrix of singular values, 𝑈̃ with the 𝑚 × 𝑟 matrix whose
columns correspond to the 𝑟 largest singular values, and 𝑉 ̃ with the 𝑛 × 𝑟 matrix whose columns correspond to the 𝑟
largest singular values.
Counterparts of all of the salient formulas above then apply.

6.6 Source for Some Python Code
You can find a Python implementation of DMD here:

https://mathlab.sissa.it/pydmd
6.6. Source for Some Python Code 93


CHAPTER
SEVEN
USING NEWTON’S METHOD TO SOLVE ECONOMIC MODELS
Contents
• Using Newton’s Method to Solve Economic Models

– Overview
– Fixed Point Computation Using Newton’s Method
– Root-Finding in One Dimension
– Multivariate Newton’s Method
– Exercises
See also:
GPU: A version of this lecture which makes use of jax to run the code on a GPU is available here
7.1 Overview
Many economic problems involve finding fixed points or zeros (sometimes called “roots”) of functions.
For example, in a simple supply and demand model, an equilibrium price is one that makes excess demand zero.
In other words, an equilibrium is a zero of the excess demand function.
There are various computational techniques for solving for fixed points and zeros.
In this lecture we study an important gradient-based technique called Newton’s method.
Newton’s method does not always work but, in situations where it does, convergence is often fast when compared to other
methods.
The lecture will apply Newton’s method in one-dimensional and multi-dimensional settings to solve fixed-point and zero-
finding problems.
• When finding the fixed point of a function 𝑓, Newton’s method updates an existing guess of the fixed point by
solving for the fixed point of a linear approximation to the function 𝑓.
• When finding the zero of a function 𝑓, Newton’s method updates an existing guess by solving for the zero of a
linear approximation to the function 𝑓.
To build intuition, we first consider an easy, one-dimensional fixed point problem where we know the solution and solve
it using both successive approximation and Newton’s method.
Then we apply Newton’s method to multi-dimensional settings to solve market for equilibria with multiple goods.
95
At the end of the lecture we leverage the power of automatic differentiation in autograd to solve a very high-dimensional
equilibrium problem
!pip install autograd
We use the following imports in this lecture

from collections import namedtuple
from scipy.optimize import root
from autograd import jacobian
# Thinly-wrapped numpy to enable automatic differentiation
import autograd.numpy as np
plt.rcParams["figure.figsize"] = (10, 5.7)
7.2 Fixed Point Computation Using Newton’s Method
In this section we solve the fixed point of the law of motion for capital in the setting of the Solow growth model.
We will inspect the fixed point visually, solve it by successive approximation, and then apply Newton’s method to achieve
faster convergence.
7.2.1 The Solow Model
In the Solow growth model, assuming Cobb-Douglas production technology and zero population growth, the law of motion
for capital is
𝑘𝑡+1 = 𝑔(𝑘𝑡 ) where 𝑔(𝑘) ∶= 𝑠𝐴𝑘𝛼 + (1 − 𝛿)𝑘 (7.1)
Here
• 𝑘𝑡 is capital stock per worker,
• 𝐴, 𝛼 > 0 are production parameters, 𝛼 < 1
• 𝑠 > 0 is a savings rate, and
• 𝛿 ∈ (0, 1) is a rate of depreciation
In this example, we wish to calculate the unique strictly positive fixed point of 𝑔, the law of motion for capital.
In other words, we seek a 𝑘∗ > 0 such that 𝑔(𝑘∗ ) = 𝑘∗ .
• such a 𝑘∗ is called a steady state, since 𝑘𝑡 = 𝑘∗ implies 𝑘𝑡+1 = 𝑘∗ .
Using pencil and paper to solve 𝑔(𝑘) = 𝑘, you will be able to confirm that
1/(1−𝛼)
𝑠𝐴
𝑘∗ = ( )
𝛿
96 Chapter 7. Using Newton’s Method to Solve Economic Models

7.2.2 Implementation
Let’s store our parameters in namedtuple to help us keep our code clean and concise.
SolowParameters = namedtuple("SolowParameters", ('A', 's', 'α', 'δ'))
This function creates a suitable namedtuple with default parameter values.
def create_solow_params(A=2.0, s=0.3, α=0.3, δ=0.4):

"Creates a Solow model parameterization with default values."
return SolowParameters(A=A, s=s, α=α, δ=δ)
The next two functions implement the law of motion (7.2.1) and store the true fixed point 𝑘∗ .
def g(k, params):

A, s, α, δ = params
return A * s * k**α + (1 - δ) * k
def exact_fixed_point(params):
return ((s * A) / δ)**(1/(1 - α))
Here is a function to provide a 45 degree plot of the dynamics.
def plot_45(params, ax, fontsize=14):
k_min, k_max = 0.0, 3.0

k_grid = np.linspace(k_min, k_max, 1200)
# Plot the functions

lb = r"$g(k) = sAk^{\alpha} + (1 - \delta)k$"
ax.plot(k_grid, g(k_grid, params), lw=2, alpha=0.6, label=lb)
ax.plot(k_grid, k_grid, "k--", lw=1, alpha=0.7, label="45")
# Show and annotate the fixed point

kstar = exact_fixed_point(params)
fps = (kstar,)
ax.plot(fps, fps, "go", ms=10, alpha=0.6)
ax.annotate(r"$k^* = (sA / \delta)^{\frac{1}{1-\alpha}}$",
xy=(kstar, kstar),
xycoords="data",
xytext=(20, -20),
textcoords="offset points",
fontsize=fontsize)
ax.legend(loc="upper left", frameon=False, fontsize=fontsize)
ax.set_yticks((0, 1, 2, 3))
ax.set_yticklabels((0.0, 1.0, 2.0, 3.0), fontsize=fontsize)
ax.set_ylim(0, 3)
ax.set_xlabel("$k_t$", fontsize=fontsize)
ax.set_ylabel("$k_{t+1}$", fontsize=fontsize)
Let’s look at the 45 degree diagram for two parameterizations.
7.2. Fixed Point Computation Using Newton’s Method 97

params = create_solow_params()
plot_45(params, ax)
plt.show()
params = create_solow_params(α=0.05, δ=0.5)

plot_45(params, ax)
plt.show()

We see that 𝑘∗ is indeed the unique positive fixed point.
Successive Approximation
First let’s compute the fixed point using successive approximation.

In this case, successive approximation means repeatedly updating capital from some initial state 𝑘0 using the law of
motion.
Here’s a time series from a particular choice of 𝑘0 .
def compute_iterates(k_0, f, params, n=25):

"Compute time series of length n generated by arbitrary function f."
k = k_0
k_iterates = []
for t in range(n):


k_iterates.append(k)
k = f(k, params)
return k_iterates
k_0 = 0.25
k_series = compute_iterates(k_0, g, params)
k_star = exact_fixed_point(params)
ax.plot(k_series, 'o')
ax.plot([k_star] * len(k_series), 'k--')
ax.set_ylim(0, 3)
plt.show()
Let’s see the output for a long time series.
k_series = compute_iterates(k_0, g, params, n=10_000)

k_star_approx = k_series[-1]
k_star_approx
1.7846741842265788
This is close to the true value.
k_star
1.7846741842265788

Newton’s Method
In general, when applying Newton’s fixed point method to some function 𝑔, we start with a guess 𝑥0 of the fixed point
and then update by solving for the fixed point of a tangent line at 𝑥0 .
To begin with, we recall that the first-order approximation of 𝑔 at 𝑥0 (i.e., the first order Taylor approximation of 𝑔 at 𝑥0 )
is the function
𝑔(𝑥)
̂ ≈ 𝑔(𝑥0 ) + 𝑔′ (𝑥0 )(𝑥 − 𝑥0 ) (7.2)
We solve for the fixed point of 𝑔 ̂ by calculating the 𝑥1 that solves
𝑔(𝑥0 ) − 𝑔′ (𝑥0 )𝑥0

𝑥1 =
1 − 𝑔′ (𝑥0 )
Generalising the process above, Newton’s fixed point method iterates on
𝑔(𝑥𝑡 ) − 𝑔′ (𝑥𝑡 )𝑥𝑡
𝑥𝑡+1 = , 𝑥0 given (7.3)
1 − 𝑔′ (𝑥𝑡 )
To implement Newton’s method we observe that the derivative of the law of motion for capital (7.2.1) is
𝑔′ (𝑘) = 𝛼𝑠𝐴𝑘𝛼−1 + (1 − 𝛿) (7.4)
Let’s define this:
def Dg(k, params):

return α * A * s * k**(α-1) + (1 - δ)
Here’s a function 𝑞 representing (7.2.3).
def q(k, params):

return (g(k, params) - Dg(k, params) * k) / (1 - Dg(k, params))
Now let’s plot some trajectories.
def plot_trajectories(params,
k0_a=0.8, # first initial condition
k0_b=3.1, # second initial condition
n=20, # length of time series
fs=14): # fontsize

ax1, ax2 = axes
ks1 = compute_iterates(k0_a, g, params, n)

ax1.plot(ks1, "-o", label="successive approximation")
ks2 = compute_iterates(k0_b, g, params, n)

ax2.plot(ks2, "-o", label="successive approximation")
ks3 = compute_iterates(k0_a, q, params, n)

ax1.plot(ks3, "-o", label="newton steps")
ks4 = compute_iterates(k0_b, q, params, n)

ax2.plot(ks4, "-o", label="newton steps")

for ax in axes:
ax.plot(k_star * np.ones(n), "k--")
ax.legend(fontsize=fs, frameon=False)
ax.set_ylim(0.6, 3.2)
ax.set_yticks((k_star,))
ax.set_yticklabels(("$k^*$",), fontsize=fs)
ax.set_xticks(np.linspace(0, 19, 20))
plt.show()
plot_trajectories(params)
We can see that Newton’s method converges faster than successive approximation.
7.3 Root-Finding in One Dimension
In the previous section we computed fixed points.

In fact Newton’s method is more commonly associated with the problem of finding zeros of functions.
Let’s discuss this “root-finding” problem and then show how it is connected to the problem of finding fixed points.

7.3.1 Newton’s Method for Zeros
Let’s suppose we want to find an 𝑥 such that 𝑓(𝑥) = 0 for some smooth function 𝑓 mapping real numbers to real numbers.
Suppose we have a guess 𝑥0 and we want to update it to a new point 𝑥1 .
As a first step, we take the first-order approximation of 𝑓 around 𝑥0 :
̂ ≈ 𝑓 (𝑥 ) + 𝑓 ′ (𝑥 ) (𝑥 − 𝑥 )
𝑓(𝑥) 0 0 0
Now we solve for the zero of 𝑓.̂

̂ ) = 0 and solve for 𝑥 to get
In particular, we set 𝑓(𝑥 1 1
𝑓(𝑥0 )
𝑥1 = 𝑥 0 − , 𝑥0 given
𝑓 ′ (𝑥0 )
Generalizing the formula above, for one-dimensional zero-finding problems, Newton’s method iterates on
𝑓(𝑥𝑡 )
𝑥𝑡+1 = 𝑥𝑡 − , 𝑥0 given (7.5)
𝑓 ′ (𝑥𝑡 )
The following code implements the iteration (7.3.1)
def newton(f, Df, x_0, tol=1e-7, max_iter=100_000):

x = x_0
# Implement the zero-finding formula

def q(x):
return x - f(x) / Df(x)
error = tol + 1
n = 0
while error > tol:
n += 1
if(n > max_iter):
raise Exception('Max iteration reached without convergence')
y = q(x)
error = np.abs(x - y)
x = y
print(f'iteration {n}, error = {error:.5f}')
return x
Numerous libraries implement Newton’s method in one dimension, including SciPy, so the code is just for illustrative
purposes.
(That said, when we want to apply Newton’s method using techniques such as automatic differentiation or GPU acceler-
ation, it will be helpful to know how to implement Newton’s method ourselves.)
7.3.2 Application to Finding Fixed Points
Now consider again the Solow fixed-point calculation, where we solve for 𝑘 satisfying 𝑔(𝑘) = 𝑘.
We can convert to this to a zero-finding problem by setting 𝑓(𝑥) ∶= 𝑔(𝑥) − 𝑥.
Any zero of 𝑓 is clearly a fixed point of 𝑔.
Let’s apply this idea to the Solow problem
7.3. Root-Finding in One Dimension 103

k_star_approx_newton = newton(f=lambda x: g(x, params) - x,
Df=lambda x: Dg(x, params) - 1,
x_0=0.8)
iteration 1, error = 1.27209

k_star_approx_newton
1.7846741842265788
The result confirms the descent we saw in the graphs above: a very accurate result is reached with only 5 iterations.
7.4 Multivariate Newton’s Method
In this section, we introduce a two-good problem, present a visualization of the problem, and solve for the equilibrium of
the two-good market using both a zero finder in SciPy and Newton’s method.
We then expand the idea to a larger market with 5,000 goods and compare the performance of the two methods again.
We will see a significant performance gain when using Netwon’s method.
7.4.1 A Two Goods Market Equilibrium
Let’s start by computing the market equilibrium of a two-good problem.

We consider a market for two related products, good 0 and good 1, with price vector 𝑝 = (𝑝0 , 𝑝1 )
Supply of good 𝑖 at price 𝑝,
√
𝑞𝑖𝑠 (𝑝) = 𝑏𝑖 𝑝𝑖
Demand of good 𝑖 at price 𝑝 is,
𝑞𝑖𝑑 (𝑝) = exp(−(𝑎𝑖0 𝑝0 + 𝑎𝑖1 𝑝1 )) + 𝑐𝑖
Here 𝑐𝑖 , 𝑏𝑖 and 𝑎𝑖𝑗 are parameters.
For example, the two goods might be computer components that are typically used together, in which case they are
complements. Hence demand depends on the price of both components.
The excess demand function is,
𝑒𝑖 (𝑝) = 𝑞𝑖𝑑 (𝑝) − 𝑞𝑖𝑠 (𝑝), 𝑖 = 0, 1
An equilibrium price vector 𝑝∗ satisfies 𝑒𝑖 (𝑝∗ ) = 0.
We set
𝑎00 𝑎01 𝑏 𝑐
𝐴=( ), 𝑏 = ( 0) and 𝑐 = ( 0)
𝑎10 𝑎11 𝑏1 𝑐1
for this particular question.

A Graphical Exploration
Since our problem is only two-dimensional, we can use graphical analysis to visualize and help understand the problem.
Our first step is to define the excess demand function
𝑒0 (𝑝)
𝑒(𝑝) = ( )
𝑒1 (𝑝)
The function below calculates the excess demand for given parameters
def e(p, A, b, c):

return np.exp(- A @ p) + c - b * np.sqrt(p)
Our default parameter values will be
0.5 0.4 1 1
𝐴=( ), 𝑏=( ) and 𝑐=( )
0.8 0.2 1 1
A = np.array([
[0.5, 0.4],
[0.8, 0.2]
])
b = np.ones(2)
c = np.ones(2)
At a price level of 𝑝 = (1, 0.5), the excess demand is
ex_demand = e((1.0, 0.5), A, b, c)
print(f'The excess demand for good 0 is {ex_demand[0]:.3f} \n'

f'The excess demand for good 1 is {ex_demand[1]:.3f}')
The excess demand for good 0 is 0.497

The excess demand for good 1 is 0.699
Next we plot the two functions 𝑒0 and 𝑒1 on a grid of (𝑝0 , 𝑝1 ) values, using contour surfaces and lines.
We will use the following function to build the contour plots
def plot_excess_demand(ax, good=0, grid_size=100, grid_max=4, surface=True):
# Create a 100x100 grid

p_grid = np.linspace(0, grid_max, grid_size)
z = np.empty((100, 100))
for i, p_1 in enumerate(p_grid):

for j, p_2 in enumerate(p_grid):
z[i, j] = e((p_1, p_2), A, b, c)[good]
if surface:
cs1 = ax.contourf(p_grid, p_grid, z.T, alpha=0.5)
plt.colorbar(cs1, ax=ax, format="%.6f")
ctr1 = ax.contour(p_grid, p_grid, z.T, levels=[0.0])

ax.set_xlabel("$p_0$")
7.4. Multivariate Newton’s Method 105


ax.set_ylabel("$p_1$")
ax.set_title(f'Excess Demand for Good {good}')
plt.clabel(ctr1, inline=1, fontsize=13)
Here’s our plot of 𝑒0 :
plot_excess_demand(ax, good=0)
plt.show()
Here’s our plot of 𝑒1 :
plot_excess_demand(ax, good=1)
plt.show()

We see the black contour line of zero, which tells us when 𝑒𝑖 (𝑝) = 0.
For a price vector 𝑝 such that 𝑒𝑖 (𝑝) = 0 we know that good 𝑖 is in equilibrium (demand equals supply).
If these two contour lines cross at some price vector 𝑝∗ , then 𝑝∗ is an equilibrium price vector.
fig, ax = plt.subplots(figsize=(10, 5.7))

for good in (0, 1):
plot_excess_demand(ax, good=good, surface=False)
plt.show()

It seems there is an equilibrium close to 𝑝 = (1.6, 1.5).
Using a Multidimensional Root Finder
To solve for 𝑝∗ more precisely, we use a zero-finding algorithm from scipy.optimize.

We supply 𝑝 = (1, 1) as our initial guess.
init_p = np.ones(2)
This uses the modified Powell method to find the zero
%%time
solution = root(lambda p: e(p, A, b, c), init_p, method='hybr')
CPU times: user 332 µs, sys: 0 ns, total: 332 µs

Wall time: 286 µs
Here’s the resulting value:
p = solution.x
p
array([1.57080182, 1.46928838])
This looks close to our guess from observing the figure. We can plug it back into 𝑒 to test that 𝑒(𝑝) ≈ 0:
np.max(np.abs(e(p, A, b, c)))

2.0383694732117874e-13
This is indeed a very small error.
Adding Gradient Information
In many cases, for zero-finding algorithms applied to smooth functions, supplying the Jacobian of the function leads to
better convergence properties.
Here we manually calculate the elements of the Jacobian
𝜕𝑒0 𝜕𝑒0
(𝑝) 𝜕𝑝1 (𝑝)
𝐽 (𝑝) = ( 𝜕𝑝
𝜕𝑒1
0
𝜕𝑒1 )
𝜕𝑝0 (𝑝) 𝜕𝑝1 (𝑝)
def jacobian_e(p, A, b, c):

p_0, p_1 = p
a_00, a_01 = A[0, :]
a_10, a_11 = A[1, :]
j_00 = -a_00 * np.exp(-a_00 * p_0) - (b[0]/2) * p_0**(-1/2)
j_01 = -a_01 * np.exp(-a_01 * p_1)
j_10 = -a_10 * np.exp(-a_10 * p_0)
j_11 = -a_11 * np.exp(-a_11 * p_1) - (b[1]/2) * p_1**(-1/2)
J = [[j_00, j_01],
[j_10, j_11]]
return np.array(J)
%%time
solution = root(lambda p: e(p, A, b, c),
init_p,
jac=lambda p: jacobian_e(p, A, b, c),
method='hybr')
CPU times: user 582 µs, sys: 46 µs, total: 628 µs

Wall time: 449 µs
Now the solution is even more accurate (although, in this low-dimensional problem, the difference is quite small):
p = solution.x
1.3322676295501878e-15
Using Newton’s Method
Now let’s use Newton’s method to compute the equilibrium price using the multivariate version of Newton’s method
𝑝𝑛+1 = 𝑝𝑛 − 𝐽𝑒 (𝑝𝑛 )−1 𝑒(𝑝𝑛 ) (7.6)
This is a multivariate version of (7.3.1)

(Here 𝐽𝑒 (𝑝𝑛 ) is the Jacobian of 𝑒 evaluated at 𝑝𝑛 .)

The iteration starts from some initial guess of the price vector 𝑝0 .
Here, instead of coding Jacobian by hand, We use the jacobian() function in the autograd library to auto-
differentiate and calculate the Jacobian.
With only slight modification, we can generalize our previous attempt to multi-dimensional problems
def newton(f, x_0, tol=1e-5, max_iter=10):

x = x_0
q = lambda x: x - np.linalg.solve(jacobian(f)(x), f(x))
error = tol + 1
n = 0
while error > tol:
n+=1
if(n > max_iter):
raise Exception('Max iteration reached without convergence')
y = q(x)
if(any(np.isnan(y))):
raise Exception('Solution not found with NaN generated')
error = np.linalg.norm(x - y)
x = y
print(f'iteration {n}, error = {error:.5f}')
print('\n' + f'Result = {x} \n')
return x
def e(p, A, b, c):

return np.exp(- np.dot(A, p)) + c - b * np.sqrt(p)
We find the algorithm terminates in 4 steps
%%time
p = newton(lambda p: e(p, A, b, c), init_p)

Result = [1.57080182 1.46928838]
CPU times: user 4.86 ms, sys: 394 µs, total: 5.25 ms
Wall time: 3.62 ms
1.4632739464559563e-13
The result is very accurate.

With the larger overhead, the speed is not better than the optimized scipy function.

7.4.2 A High-Dimensional Problem
Our next step is to investigate a large market with 3,000 goods.

A JAX version of this section using GPU accelerated linear algebra and automatic differentiation is available here
The excess demand function is essentially the same, but now the matrix 𝐴 is 3000 × 3000 and the parameter vectors 𝑏
and 𝑐 are 3000 × 1.
dim = 3000
np.random.seed(123)
# Create a random matrix A and normalize the rows to sum to one

A = np.random.rand(dim, dim)
A = np.asarray(A)
s = np.sum(A, axis=0)
A = A / s
# Set up b and c
b = np.ones(dim)
c = np.ones(dim)
Here’s our initial condition
init_p = np.ones(dim)
%%time
p = newton(lambda p: e(p, A, b, c), init_p)
Result = [1.50185286 1.49865815 1.50028285 ... 1.50875149 1.48724784 1.48577532]
CPU times: user 2min 9s, sys: 1.76 s, total: 2min 11s
Wall time: 32.9 s
6.661338147750939e-16
With the same tolerance, we compare the runtime and accuracy of Newton’s method to SciPy’s root function

%%time
solution = root(lambda p: e(p, A, b, c),
init_p,
jac=lambda p: jacobian(e)(p, A, b, c),
method='hybr',
tol=1e-5)
CPU times: user 1min 20s, sys: 618 ms, total: 1min 21s
Wall time: 42.7 s
p = solution.x
8.295585953721485e-07
7.5 Exercises
Exercise 7.5.1
Consider a three-dimensional extension of the Solow fixed point problem with
2 3 3
𝐴=⎛
⎜2 4 2⎞⎟, 𝑠 = 0.2, 𝛼 = 0.5, 𝛿 = 0.8
⎝1 5 1⎠
As before the law of motion is
𝑘𝑡+1 = 𝑔(𝑘𝑡 ) where 𝑔(𝑘) ∶= 𝑠𝐴𝑘𝛼 + (1 − 𝛿)𝑘
However 𝑘𝑡 is now a 3 × 1 vector.

Solve for the fixed point using Newton’s method with the following initial values:
𝑘10 = (1, 1, 1)
𝑘20 = (3, 5, 5)
𝑘30 = (50, 50, 50)
Hint:
• The computation of the fixed point is equivalent to computing 𝑘∗ such that 𝑓(𝑘∗ ) − 𝑘∗ = 0.
• If you are unsure about your solution, you can start with the solved example:
2 0 0
𝐴=⎛
⎜0 2 0⎞⎟
⎝0 0 2⎠
with 𝑠 = 0.3, 𝛼 = 0.3, and 𝛿 = 0.4 and starting value:
𝑘0 = (1, 1, 1)
The result should converge to the analytical solution.


Let’s first define the parameters for this problem
A = np.array([[2.0, 3.0, 3.0],

[2.0, 4.0, 2.0],
[1.0, 5.0, 1.0]])
s = 0.2
α = 0.5
δ = 0.8
initLs = [np.ones(3),
np.array([3.0, 5.0, 5.0]),
np.repeat(50.0, 3)]
Then define the multivariate version of the formula for the (7.2.1)
def multivariate_solow(k, A=A, s=s, α=α, δ=δ):

return (s * np.dot(A, k**α) + (1 - δ) * k)
Let’s run through each starting value and see the output
attempt = 1
for init in initLs:
print(f'Attempt {attempt}: Starting value is {init} \n')
%time k = newton(lambda k: multivariate_solow(k) - k, \
init)
print('-'*64)
attempt += 1
Attempt 1: Starting value is [1. 1. 1.]

Result = [3.84058108 3.87071771 3.41091933]
CPU times: user 26.8 ms, sys: 137 µs, total: 27 ms

Wall time: 6.33 ms
----------------------------------------------------------------

Result = [3.84058108 3.87071771 3.41091933]
7.5. Exercises 113


Wall time: 4.06 ms
----------------------------------------------------------------

Result = [3.84058108 3.87071771 3.41091933]
CPU times: user 25.5 ms, sys: 0 ns, total: 25.5 ms

Wall time: 5.97 ms
----------------------------------------------------------------
We find that the results are invariant to the starting values given the well-defined property of this question.
But the number of iterations it takes to converge is dependent on the starting values.
Let substitute the output back to the formulate to check our last result
multivariate_solow(k) - k
array([ 0.0000000e+00, -4.4408921e-16, 8.8817842e-16])
Note the error is very small.

We can also test our results on the known solution
A = np.array([[2.0, 0.0, 0.0],

[0.0, 2.0, 0.0],
[0.0, 0.0, 2.0]])
s = 0.3
α = 0.3
δ = 0.4
init = np.repeat(1.0, 3)
%time k = newton(lambda k: multivariate_solow(k, A=A, s=s, α=α, δ=δ) - k, \

init)

Result = [1.78467418 1.78467418 1.78467418]
CPU times: user 14 ms, sys: 4.03 ms, total: 18 ms

Wall time: 4.22 ms

The result is very close to the ground truth but still slightly different.
%time k = newton(lambda k: multivariate_solow(k, A=A, s=s, α=α, δ=δ) - k, \

init,\
tol=1e-7)

Result = [1.78467418 1.78467418 1.78467418]

Wall time: 5.23 ms
We can see it steps towards a more accurate solution.
Exercise 7.5.2
In this exercise, let’s try different initial values and check how Newton’s method responds to different starting points.
Let’s define a three-good problem with the following default values:
0.2 0.1 0.7 1 1
𝐴=⎛
⎜0.3 0.2 0.5⎞⎟, 𝑏=⎛
⎜1⎞⎟ and 𝑐=⎛
⎜1⎞
⎟
⎝0.1 0.8 0.1⎠ 1
⎝ ⎠ 1
⎝ ⎠
For this exercise, use the following extreme price vectors as initial values:
𝑝10 = (5, 5, 5)
𝑝20 = (1, 1, 1)
𝑝30 = (4.5, 0.1, 4)
Set the tolerance to 0.0 for more accurate output.

Define parameters and initial values
A = np.array([
[0.2, 0.1, 0.7],
[0.3, 0.2, 0.5],
[0.1, 0.8, 0.1]
])
b = np.array([1.0, 1.0, 1.0])

c = np.array([1.0, 1.0, 1.0])
initLs = [np.repeat(5.0, 3),

np.ones(3),
np.array([4.5, 0.1, 4.0])]
Let’s run through each initial guess and check the output
7.5. Exercises 115

attempt = 1
for init in initLs:
print(f'Attempt {attempt}: Starting value is {init} \n')
%time p = newton(lambda p: e(p, A, b, c), \
init, \
tol=1e-15, \
max_iter=15)
print('-'*64)
attempt += 1
/opt/conda/envs/quantecon/lib/python3.11/site-packages/autograd/tracer.py:48:␣
↪RuntimeWarning: invalid value encountered in sqrt
return f_raw(*args, **kwargs)

/opt/conda/envs/quantecon/lib/python3.11/site-packages/autograd/numpy/numpy_vjps.
↪py:99: RuntimeWarning: invalid value encountered in power
defvjp(anp.sqrt, lambda ans, x : lambda g: g * 0.5 * x**-0.5)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
File <timed exec>:1
Cell In[34], line 12, in newton(f, x_0, tol, max_iter)

10 y = q(x)
11 if(any(np.isnan(y))):
---> 12 raise Exception('Solution not found with NaN generated')
13 error = np.linalg.norm(x - y)
14 x = y
Exception: Solution not found with NaN generated
----------------------------------------------------------------

Result = [1.49744442 1.49744442 1.49744442]

Wall time: 4.85 ms
----------------------------------------------------------------
Attempt 3: Starting value is [4.5 0.1 4. ]



Result = [1.49744442 1.49744442 1.49744442]

Wall time: 6.03 ms
----------------------------------------------------------------
We can find that Newton’s method may fail for some starting values.
Sometimes it may take a few initial guesses to achieve convergence.
Substitute the result back to the formula to check our result
e(p, A, b, c)
array([ 0.00000000e+00, 0.00000000e+00, -2.22044605e-16])
We can see the result is very accurate.
7.5. Exercises 117


Part II
Elementary Statistics
119
CHAPTER
EIGHT
ELEMENTARY PROBABILITY WITH MATRICES
This lecture uses matrix algebra to illustrate some basic ideas about probability theory.
After providing somewhat informal definitions of the underlying objects, we’ll use matrices and vectors to describe
probability distributions.
Among concepts that we’ll be studying include
• a joint probability distribution
• marginal distributions associated with a given joint distribution
• conditional probability distributions
• statistical independence of two random variables
• joint distributions associated with a prescribed set of marginal distributions
– couplings
– copulas
• the probability distribution of a sum of two independent random variables
– convolution of marginal distributions
• parameters that define a probability distribution
• sufficient statistics as data summaries
We’ll use a matrix to represent a bivariate probability distribution and a vector to represent a univariate probability dis-
tribution
In addition to what’s in Anaconda, this lecture will need the following libraries:
!pip install prettytable
As usual, we’ll start with some imports
import numpy as np
import prettytable as pt
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('retina')
121
8.1 Sketch of Basic Concepts
We’ll briefly define what we mean by a probability space, a probability measure, and a random variable.
For most of this lecture, we sweep these objects into the background, but they are there underlying the other objects that
we’ll mainly focus on.
Let Ω be a set of possible underlying outcomes and let 𝜔 ∈ Ω be a particular underlying outcomes.
Let 𝒢 ⊂ Ω be a subset of Ω.
Let ℱ be a collection of such subsets 𝒢 ⊂ Ω.
The pair Ω, ℱ forms our probability space on which we want to put a probability measure.
A probability measure 𝜇 maps a set of possible underlying outcomes 𝒢 ∈ ℱ into a scalar number between 0 and 1
• this is the “probability” that 𝑋 belongs to 𝐴, denoted by Prob{𝑋 ∈ 𝐴}.
A random variable 𝑋(𝜔) is a function of the underlying outcome 𝜔 ∈ Ω.
The random variable 𝑋(𝜔) has a probability distribution that is induced by the underlying probability measure 𝜇 and
the function 𝑋(𝜔):
Prob(𝑋 ∈ 𝐴) = ∫ 𝜇(𝜔)𝑑𝜔 (8.1)

𝒢
where 𝒢 is the subset of Ω for which 𝑋(𝜔) ∈ 𝐴.

We call this the induced probability distribution of random variable 𝑋.
8.2 What Does Probability Mean?
Before diving in, we’ll say a few words about what probability theory means and how it connects to statistics.
We also touch on these topics in the quantecon lectures https://python.quantecon.org/prob_meaning.html and https://
python.quantecon.org/navy_captain.html.
For much of this lecture we’ll be discussing fixed “population” probabilities.
These are purely mathematical objects.
To appreciate how statisticians connect probabilities to data, the key is to understand the following concepts:
• A single draw from a probability distribution
• Repeated independently and identically distributed (i.i.d.) draws of “samples” or “realizations” from the same
probability distribution
• A statistic defined as a function of a sequence of samples
• An empirical distribution or histogram (a binned empirical distribution) that records observed relative fre-
quencies
• The idea that a population probability distribution is what we anticipate relative frequencies will be in a long
sequence of i.i.d. draws. Here the following mathematical machinery makes precise what is meant by anticipated
relative frequencies
– Law of Large Numbers (LLN)
– Central Limit Theorem (CLT)
122 Chapter 8. Elementary Probability with Matrices

Scalar example
Let 𝑋 be a scalar random variable that takes on the 𝐼 possible values 0, 1, 2, … , 𝐼 − 1 with probabilities
Prob(𝑋 = 𝑖) = 𝑓𝑖 ,
where
𝑓𝑖 ⩾ 0, ∑ 𝑓𝑖 = 1.
𝑖
We sometimes write
𝑋 ∼ {𝑓𝑖 }𝐼−1
𝑖=0
as a short-hand way of saying that the random variable 𝑋 is described by the probability distribution {𝑓𝑖 }𝐼−1
𝑖=0 .
Consider drawing a sample 𝑥0 , 𝑥1 , … , 𝑥𝑁−1 of 𝑁 independent and identically distributoed draws of 𝑋.

What do the “identical” and “independent” mean in IID or iid (“identically and independently distributed)?
• “identical” means that each draw is from the same distribution.
• “independent” means that joint distribution equal products of marginal distributions, i.e.,
Prob{𝑥0 = 𝑖0 , 𝑥1 = 𝑖1 , … , 𝑥𝑁−1 = 𝑖𝑁−1 } = Prob{𝑥0 = 𝑖0 } ⋅ ⋯ ⋅ Prob{𝑥𝐼−1 = 𝑖𝐼−1 }
= 𝑓𝑖0 𝑓𝑖1 ⋅ ⋯ ⋅ 𝑓𝑖𝑁−1
We define an e empirical distribution as follows.
For each 𝑖 = 0, … , 𝐼 − 1, let
𝑁𝑖 = number of times 𝑋 = 𝑖,
𝐼−1
𝑁 = ∑ 𝑁𝑖 total number of draws,
𝑖=0
𝑁𝑖
𝑓𝑖̃ = ∼ frequency of draws for which 𝑋 = 𝑖
𝑁
Key ideas that justify connecting probability theory with statistics are laws of large numbers and central limit theorems
LLN:
• A Law of Large Numbers (LLN) states that 𝑓𝑖̃ → 𝑓𝑖 as 𝑁 → ∞
CLT:
• A Central Limit Theorem (CLT) describes a rate at which 𝑓𝑖̃ → 𝑓𝑖
Remarks
• For “frequentist” statisticians, anticipated relative frequency is all that a probability distribution means.
• But for a Bayesian it means something more or different.
8.3 Representing Probability Distributions
A probability distribution Prob(𝑋 ∈ 𝐴) can be described by its cumulative distribution function (CDF)
𝐹𝑋 (𝑥) = Prob{𝑋 ≤ 𝑥}.
8.3. Representing Probability Distributions 123

Sometimes, but not always, a random variable can also be described by density function 𝑓(𝑥) that is related to its CDF
by
Prob{𝑋 ∈ 𝐵} = ∫ 𝑓(𝑡)𝑑𝑡
𝑡∈𝐵
𝑥
𝐹 (𝑥) = ∫ 𝑓(𝑡)𝑑𝑡
−∞
Here 𝐵 is a set of possible 𝑋’s whose probability we want to compute.
When a probability density exists, a probability distribution can be characterized either by its CDF or by its density.
For a discrete-valued random variable
• the number of possible values of 𝑋 is finite or countably infinite
• we replace a density with a probability mass function, a non-negative sequence that sums to one
• we replace integration with summation in the formula like (8.1) that relates a CDF to a probability mass function
In this lecture, we mostly discuss discrete random variables.
Doing this enables us to confine our tool set basically to linear algebra.
Later we’ll briefly discuss how to approximate a continuous random variable with a discrete random variable.
8.4 Univariate Probability Distributions
We’ll devote most of this lecture to discrete-valued random variables, but we’ll say a few things about continuous-valued
random variables.
8.4.1 Discrete random variable
Let 𝑋 be a discrete random variable that takes possible values: 𝑖 = 0, 1, … , 𝐼 − 1 = 𝑋.̄

Here, we choose the maximum index 𝐼 − 1 because of how this aligns nicely with Python’s index convention.
Define 𝑓𝑖 ≡ Prob{𝑋 = 𝑖} and assemble the non-negative vector
𝑓0
⎡ 𝑓 ⎤
𝑓 =⎢ 1 ⎥ (8.2)
⎢ ⋮ ⎥
⎣ 𝑓𝐼−1 ⎦
𝐼−1
for which 𝑓𝑖 ∈ [0, 1] for each 𝑖 and ∑𝑖=0 𝑓𝑖 = 1.
This vector defines a probability mass function.
𝐼−2
The distribution (8.2) has parameters {𝑓𝑖 }𝑖=0,1,⋯,𝐼−2 since 𝑓𝐼−1 = 1 − ∑𝑖=0 𝑓𝑖 .
These parameters pin down the shape of the distribution.
(Sometimes 𝐼 = ∞.)
Such a “non-parametric” distribution has as many “parameters” as there are possible values of the random variable.
We often work with special distributions that are characterized by a small number parameters.

In these special parametric distributions,
𝑓𝑖 = 𝑔(𝑖; 𝜃)
where 𝜃 is a vector of parameters that is of much smaller dimension than 𝐼.

Remarks:
• The concept of parameter is intimately related to the notion of sufficient statistic.
• Sufficient statistics are nonlinear functions of a data set.
• Sufficient statistics are designed to summarize all information about parameters that is contained in a data set.
• They are important tools that AI uses to summarize a big data set
• R. A. Fisher provided a rigorous definition of information – see https://en.wikipedia.org/wiki/Fisher_information
An example of a parametric probability distribution is a geometric distribution.
It is described by
𝑓𝑖 = Prob{𝑋 = 𝑖} = (1 − 𝜆)𝜆𝑖 , 𝜆 ∈ [0, 1], 𝑖 = 0, 1, 2, …

∞
Evidently, ∑𝑖=0 𝑓𝑖 = 1.
Let 𝜃 be a vector of parameters of the distribution described by 𝑓, then
∞
𝑓𝑖 (𝜃) ≥ 0, ∑ 𝑓𝑖 (𝜃) = 1
𝑖=0
8.4.2 Continuous random variable
Let 𝑋 be a continous random variable that takes values 𝑋 ∈ 𝑋̃ ≡ [𝑋𝑈 , 𝑋𝐿 ] whose distributions have parameters 𝜃.
Prob{𝑋 ∈ 𝐴} = ∫ 𝑓(𝑥; 𝜃) 𝑑𝑥; 𝑓(𝑥; 𝜃) ≥ 0

𝑥∈𝐴
where 𝐴 is a subset of 𝑋̃ and
̃ =1
Prob{𝑋 ∈ 𝑋}
8.5 Bivariate Probability Distributions
We’ll now discuss a bivariate joint distribution.

To begin, we restrict ourselves to two discrete random variables.
Let 𝑋, 𝑌 be two discrete random variables that take values:
𝑋 ∈ {0, … , 𝐼 − 1}
𝑌 ∈ {0, … , 𝐽 − 1}
Then their joint distribution is described by a matrix
𝐹𝐼×𝐽 = [𝑓𝑖𝑗 ]𝑖∈{0,…,𝐼−1},𝑗∈{0,…,𝐽−1}
8.5. Bivariate Probability Distributions 125

whose elements are
𝑓𝑖𝑗 = Prob{𝑋 = 𝑖, 𝑌 = 𝑗} ≥ 0
where
∑ ∑ 𝑓𝑖𝑗 = 1
𝑖 𝑗
8.6 Marginal Probability Distributions
The joint distribution induce marginal distributions

𝐽−1
Prob{𝑋 = 𝑖} = ∑ 𝑓𝑖𝑗 = 𝜇𝑖 , 𝑖 = 0, … , 𝐼 − 1
𝑗=0
𝐼−1
Prob{𝑌 = 𝑗} = ∑ 𝑓𝑖𝑗 = 𝜈𝑗 , 𝑗 = 0, … , 𝐽 − 1
𝑖=0
For example, let a joint distribution over (𝑋, 𝑌 ) be
.25 .1
𝐹 =[ ] (8.3)
.15 .5
The implied marginal distributions are:
Prob{𝑋 = 0} = .25 + .1 = .35

Prob{𝑋 = 1} = .15 + .5 = .65
Prob{𝑌 = 0} = .25 + .15 = .4
Prob{𝑌 = 1} = .1 + .5 = .6
Digression: If two random variables 𝑋, 𝑌 are continuous and have joint density 𝑓(𝑥, 𝑦), then marginal distributions can
be computed by
𝑓(𝑥) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦

ℝ
𝑓(𝑦) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑥

ℝ
8.7 Conditional Probability Distributions
Conditional probabilities are defined according to
Prob{𝐴 ∩ 𝐵}
Prob{𝐴 ∣ 𝐵} =
Prob{𝐵}
where 𝐴, 𝐵 are two events.

For a pair of discrete random variables, we have the conditional distribution
𝑓𝑖𝑗 Prob{𝑋 = 𝑖, 𝑌 = 𝑗}
Prob{𝑋 = 𝑖|𝑌 = 𝑗} = =
∑𝑖 𝑓𝑖𝑗 Prob{𝑌 = 𝑗}

where 𝑖 = 0, … , 𝐼 − 1, 𝑗 = 0, … , 𝐽 − 1.
Note that
∑𝑖 𝑓𝑖𝑗
∑ Prob{𝑋𝑖 = 𝑖|𝑌𝑗 = 𝑗} = =1
𝑖
Remark: The mathematics of conditional probability implies Bayes’ Law:
Prob{𝑋 = 𝑖, 𝑌 = 𝑗} Prob{𝑌 = 𝑗|𝑋 = 𝑖}Prob{𝑋 = 𝑖}

Prob{𝑋 = 𝑖|𝑌 = 𝑗} = =
Prob{𝑌 = 𝑗} Prob{𝑌 = 𝑗}
For the joint distribution (8.3)

.1 .1
Prob{𝑋 = 0|𝑌 = 1} = =
.1 + .5 .6
8.8 Statistical Independence
Random variables X and Y are statistically independent if
Prob{𝑋 = 𝑖, 𝑌 = 𝑗} = 𝑓𝑖 𝑔𝑗
where
Prob{𝑋 = 𝑖} = 𝑓𝑖 ≥ 0 ∑ 𝑓𝑖 = 1
Prob{𝑌 = 𝑗} = 𝑔𝑗 ≥ 0 ∑ 𝑔𝑗 = 1
Conditional distributions are

𝑓𝑖 𝑔𝑗 𝑓𝑖 𝑔𝑗
Prob{𝑋 = 𝑖|𝑌 = 𝑗} = = = 𝑓𝑖
∑ 𝑖 𝑓𝑖 𝑔 𝑗 𝑔𝑗
𝑓𝑖 𝑔𝑗 𝑓𝑖 𝑔𝑗
Prob{𝑌 = 𝑗|𝑋 = 𝑖} = = = 𝑔𝑗
∑𝑗 𝑓𝑖 𝑔𝑗 𝑓𝑖
8.9 Means and Variances
The mean and variance of a discrete random variable 𝑋 are
𝜇𝑋 ≡ 𝔼 [𝑋] = ∑ 𝑘Prob{𝑋 = 𝑘}
𝑘
2 2
𝜎𝑋 ≡ 𝔻 [𝑋] = ∑ (𝑘 − 𝔼 [𝑋]) Prob{𝑋 = 𝑘}
𝑘
A continuous random variable having density 𝑓𝑋 (𝑥)) has mean and variance
∞
𝜇𝑋 ≡ 𝔼 [𝑋] = ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥
−∞
∞
2 2 2
𝜎𝑋 ≡ 𝔻 [𝑋] = E [(𝑋 − 𝜇𝑋 ) ] = ∫ (𝑥 − 𝜇𝑋 ) 𝑓𝑋 (𝑥)𝑑𝑥
−∞
8.8. Statistical Independence 127

8.10 Generating Random Numbers
Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability
distribution
1
Prob{𝑋̃ = 𝑖} = , 𝑖 = 0, … , 𝐼 − 1
𝐼
How can we transform 𝑋̃ to get a random variable 𝑋 for which Prob{𝑋 = 𝑖} = 𝑓𝑖 , 𝑖 = 0, … , 𝐼 − 1, where 𝑓𝑖 is an
arbitary discrete probability distribution on 𝑖 = 0, 1, … , 𝐼 − 1?
The key tool is the inverse of a cumulative distribution function (CDF).
Observe that the CDF of a distribution is monotone and non-decreasing, taking values between 0 and 1.
We can draw a sample of a random variable 𝑋 with a known CDF as follows:
• draw a random variable 𝑢 from a uniform distribution on [0, 1]
• pass the sample value of 𝑢 into the “inverse” target CDF for 𝑋
• 𝑋 has the target CDF
Thus, knowing the “inverse” CDF of a distribution is enough to simulate from this distribution.
Note: The “inverse” CDF needs to exist for this method to work.
The inverse CDF is
𝐹 −1 (𝑢) ≡ inf{𝑥 ∈ ℝ ∶ 𝐹 (𝑥) ≥ 𝑢} (0 < 𝑢 < 1)
Here we use infimum because a CDF is a non-decreasing and right-continuous function.

Thus, suppose that
• 𝑈 is a uniform random variable 𝑈 ∈ [0, 1]
• We want to sample a random variable 𝑋 whose CDF is 𝐹 .
It turns out that if we use draw uniform random numbers 𝑈 and then compute 𝑋 from
𝑋 = 𝐹 −1 (𝑈 ),
then 𝑋 is a random variable with CDF 𝐹𝑋 (𝑥) = 𝐹 (𝑥) = Prob{𝑋 ≤ 𝑥}.

We’ll verify this in the special case in which 𝐹 is continuous and bijective so that its inverse function exists and can be
denoted by 𝐹 −1 .
Note that
𝐹𝑋 (𝑥) = Prob {𝑋 ≤ 𝑥}
= Prob {𝐹 −1 (𝑈 ) ≤ 𝑥}
= Prob {𝑈 ≤ 𝐹 (𝑥)}
= 𝐹 (𝑥)
where the last equality occurs because 𝑈 is distributed uniformly on [0, 1] while 𝐹 (𝑥) is a constant given 𝑥 that also lies
on [0, 1].
Let’s use numpy to compute some examples.
Example: A continuous geometric (exponential) distribution

Let 𝑋 follow a geometric distribution, with parameter 𝜆 > 0.

Its density function is
𝑓(𝑥) = 𝜆𝑒−𝜆𝑥
Its CDF is
∞
𝐹 (𝑥) = ∫ 𝜆𝑒−𝜆𝑥 = 1 − 𝑒−𝜆𝑥
0
Let 𝑈 follow a uniform distribution on [0, 1].

𝑋 is a random variable such that 𝑈 = 𝐹 (𝑋).
The distribution 𝑋 can be deduced from
𝑈 = 𝐹 (𝑋) = 1 − 𝑒−𝜆𝑋
⟹ − 𝑈 = 𝑒−𝜆𝑋
⟹ log(1 − 𝑈 ) = −𝜆𝑋
(1 − 𝑈 )
⟹ 𝑋=
−𝜆
𝑙𝑜𝑔(1−𝑈)
Let’s draw 𝑢 from 𝑈 [0, 1] and calculate 𝑥 = −𝜆 .
We’ll check whether 𝑋 seems to follow a continuous geometric (exponential) distribution.
Let’s check with numpy.
n, λ = 1_000_000, 0.3
# draw uniform numbers

u = np.random.rand(n)
# transform
x = -np.log(1-u)/λ
# draw geometric distributions

x_g = np.random.exponential(1 / λ, n)
# plot and compare

plt.hist(x, bins=100, density=True)
plt.show()
8.10. Generating Random Numbers 129

plt.hist(x_g, bins=100, density=True, alpha=0.6)

plt.show()

Geometric distribution
Let 𝑋 distributed geometrically, that is
Prob(𝑋 = 𝑖) = (1 − 𝜆)𝜆𝑖 , 𝜆 ∈ (0, 1), 𝑖 = 0, 1, …

∞ ∞
1−𝜆
∑ Prob(𝑋 = 𝑖) = 1 ⟷ (1 − 𝜆) ∑ 𝜆𝑖 = =1
𝑖=0 𝑖=0
1−𝜆
Its CDF is given by

𝑖
Prob(𝑋 ≤ 𝑖) = (1 − 𝜆) ∑ 𝜆𝑖
𝑗=0
1 − 𝜆𝑖+1
= (1 − 𝜆)[ ]
1−𝜆
= 1 − 𝜆𝑖+1
= 𝐹 (𝑋) = 𝐹𝑖
Again, let 𝑈̃ follow a uniform distribution and we want to find 𝑋 such that 𝐹 (𝑋) = 𝑈̃ .
Let’s deduce the distribution of 𝑋 from
𝑈̃ = 𝐹 (𝑋) = 1 − 𝜆𝑥+1
1 − 𝑈̃ = 𝜆𝑥+1
log(1 − 𝑈̃ ) = (𝑥 + 1) log 𝜆
log(1 − 𝑈̃ )
=𝑥+1
log 𝜆
log(1 − 𝑈̃ )
−1=𝑥
log 𝜆
However, 𝑈̃ = 𝐹 −1 (𝑋) may not be an integer for any 𝑥 ≥ 0.

So let
log(1 − 𝑈̃ )
𝑥=⌈ − 1⌉
log 𝜆
where ⌈.⌉ is the ceiling function.

Thus 𝑥 is the smallest integer such that the discrete geometric CDF is greater than or equal to 𝑈̃ .
We can verify that 𝑥 is indeed geometrically distributed by the following numpy program.
Note: The exponential distribution is the continuous analog of geometric distribution.
n, λ = 1_000_000, 0.8
# draw uniform numbers

u = np.random.rand(n)
# transform
x = np.ceil(np.log(1-u)/np.log(λ) - 1)
8.10. Generating Random Numbers 131


# draw geometric distributions
x_g = np.random.geometric(1-λ, n)
# plot and compare

plt.hist(x, bins=150, density=True)
plt.show()
np.random.geometric(1-λ, n).max()
64
np.log(0.4)/np.log(0.3)
0.7610560044063083
plt.hist(x_g, bins=150, density=True, alpha=0.6)

plt.show()

8.11 Some Discrete Probability Distributions
Let’s write some Python code to compute means and variances of some univariate random variables.
We’ll use our code to
• compute population means and variances from the probability distribution
• generate a sample of 𝑁 independently and identically distributed draws and compute sample means and variances
• compare population and sample means and variances
8.12 Geometric distribution
Prob(𝑋 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝, 𝑘 = 1, 2, …
⟹
1
𝔼(𝑋) =
𝑝
1−𝑝
𝔻(𝑋) =
𝑝2
We draw observations from the distribution and compare the sample mean and variance with the theoretical results.
8.11. Some Discrete Probability Distributions 133

# specify parameters
p, n = 0.3, 1_000_000
# draw observations from the distribution

x = np.random.geometric(p, n)
# compute sample mean and variance

μ_hat = np.mean(x)
σ2_hat = np.var(x)
print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)
# compare with theoretical results

print("\nThe population mean is: ", 1/p)
print("The population variance is: ", (1-p)/(p**2))
The sample mean is: 3.33521

The sample variance is: 7.793688255900004
The population mean is: 3.3333333333333335

The population variance is: 7.777777777777778
8.12.1 Newcomb–Benford distribution
The Newcomb–Benford law fits many data sets, e.g., reports of incomes to tax authorities, in which the leading digit is
more likely to be small than large.
See https://en.wikipedia.org/wiki/Benford’s_law
A Benford probability distribution is
1
Prob{𝑋 = 𝑑} = log10 (𝑑 + 1) − log10 (𝑑) = log10 (1 + )
𝑑
where 𝑑 ∈ {1, 2, ⋯ , 9} can be thought of as a first digit in a sequence of digits.
This is a well defined discrete distribution since we can verify that probabilities are nonnegative and sum to 1.
9
1 1
log10 (1 + ) ≥ 0, ∑ log10 (1 + )=1
𝑑 𝑑=1
𝑑
The mean and variance of a Benford distribution are

9
1
𝔼 [𝑋] = ∑ 𝑑 log10 (1 + ) ≃ 3.4402
𝑑=1
𝑑
9
2 1
𝕍 [𝑋] = ∑ (𝑑 − 𝔼 [𝑋]) log10 (1 + ) ≃ 6.0565
𝑑=1
𝑑
We verify the above and compute the mean and variance using numpy.
Benford_pmf = np.array([np.log10(1+1/d) for d in range(1,10)])

k = np.array(range(1,10))
# mean


mean = np.sum(Benford_pmf * k)
# variance
var = np.sum([(k-mean)**2 * Benford_pmf])
# verify sum to 1
print(np.sum(Benford_pmf))
print(mean)
print(var)
0.9999999999999999
3.440236967123206
6.056512631375667
# plot distribution
plt.plot(range(1,10), Benford_pmf, 'o')
plt.title('Benford\'s distribution')
plt.show()
8.12. Geometric distribution 135

8.12.2 Pascal (negative binomial) distribution
Consider a sequence of independent Bernoulli trials.

Let 𝑝 be the probability of success.
Let 𝑋 be a random variable that represents the number of failures before we get 𝑟 success.
Its distribution is
𝑋 ∼ 𝑁 𝐵(𝑟, 𝑝)
𝑘+𝑟−1 𝑟
Prob(𝑋 = 𝑘; 𝑟, 𝑝) = ( ) 𝑝 (1 − 𝑝)𝑘
𝑟−1
Here, we choose from among 𝑘 + 𝑟 − 1 possible outcomes because the last draw is by definition a success.
We compute the mean and variance to be
𝑘(1 − 𝑝)
𝔼(𝑋) =
𝑝
𝑘(1 − 𝑝)
𝕍(𝑋) =
𝑝2
r, p, n = 10, 0.3, 1_000_000

x = np.random.negative_binomial(r, p, n)

μ_hat = np.mean(x)
σ2_hat = np.var(x)
print("\nThe population mean is: ", r*(1-p)/p)
print("The population variance is: ", r*(1-p)/p**2)


8.13 Continuous Random Variables
8.13.1 Univariate Gaussian distribution
We write
𝑋 ∼ 𝑁 (𝜇, 𝜎2 )
to indicate the probability distribution

1 1 2
𝑓(𝑥|𝑢, 𝜎2 ) = √ 𝑒[− 2𝜎2 (𝑥−𝑢) ]
2𝜋𝜎2
In the below example, we set 𝜇 = 0, 𝜎 = 0.1.

μ, σ = 0, 0.1
# specify number of draws

n = 1_000_000

x = np.random.normal(μ, σ, n)

μ_hat = np.mean(x)
σ_hat = np.std(x)
print("The sample mean is: ", μ_hat)

print("The sample standard deviation is: ", σ_hat)
The sample mean is: -2.6670604920545608e-05

The sample standard deviation is: 0.10000647799603994
# compare
print(μ-μ_hat < 1e-3)
print(σ-σ_hat < 1e-3)
True
True
8.13.2 Uniform Distribution
𝑋 ∼ 𝑈 [𝑎, 𝑏]
1
, 𝑎≤𝑥≤𝑏
𝑓(𝑥) = { 𝑏−𝑎
0, otherwise
The population mean and variance are
𝑎+𝑏
𝔼(𝑋) =
2
(𝑏 − 𝑎)2
𝕍(𝑋) =
12
a, b = 10, 20
# specify number of draws

n = 1_000_000

x = a + (b-a)*np.random.rand(n)

μ_hat = np.mean(x)
σ2_hat = np.var(x)
8.13. Continuous Random Variables 137


print("\nThe population mean is: ", (a+b)/2)
print("The population variance is: ", (b-a)**2/12)


8.14 A Mixed Discrete-Continuous Distribution
We’ll motivate this example with a little story.

Suppose that to apply for a job you take an interview and either pass or fail it.
You have 5% chance to pass an interview and you know your salary will uniformly distributed in the interval 300~400 a
day only if you pass.
We can describe your daily salary as a discrete-continuous variable with the following probabilities:
𝑃 (𝑋 = 0) = 0.95
400
𝑃 (300 ≤ 𝑋 ≤ 400) = ∫ 𝑓(𝑥) 𝑑𝑥 = 0.05
300
𝑓(𝑥) = 0.0005
Let’s start by generating a random sample and computing sample moments.
x = np.random.rand(1_000_000)
# x[x > 0.95] = 100*x[x > 0.95]+300
x[x > 0.95] = 100*np.random.rand(len(x[x > 0.95]))+300
x[x <= 0.95] = 0
μ_hat = np.mean(x)
σ2_hat = np.var(x)

The analytical mean and variance can be computed:

400
𝜇=∫ 𝑥𝑓(𝑥)𝑑𝑥
300
400
= 0.0005 ∫ 𝑥𝑑𝑥
300
400
1
= 0.0005 × 𝑥2 ∣
2 300

400
𝜎2 = 0.95 × (0 − 17.5)2 + ∫ (𝑥 − 17.5)2 𝑓(𝑥)𝑑𝑥
300
400
= 0.95 × 17.52 + 0.0005 ∫ (𝑥 − 17.5)2 𝑑𝑥
300
400
1
2
= 0.95 × 17.5 + 0.0005 × (𝑥 − 17.5)3 ∣
3 300
mean = 0.0005*0.5*(400**2 - 300**2)

var = 0.95*17.5**2+0.0005/3*((400-17.5)**3-(300-17.5)**3)
print("mean: ", mean)
print("variance: ", var)
mean: 17.5
variance: 5860.416666666666
8.15 Matrix Representation of Some Bivariate Distributions
Let’s use matrices to represent a joint distribution, conditional distribution, marginal distribution, and the mean and
variance of a bivariate random variable.
The table below illustrates a probability distribution for a bivariate random variable.
0.3 0.2
𝐹 = [𝑓𝑖𝑗 ] = [ ]
0.1 0.4
Marginal distributions are
Prob(𝑋 = 𝑖) = ∑ 𝑓𝑖𝑗 = 𝑢𝑖
𝑗
Prob(𝑌 = 𝑗) = ∑ 𝑓𝑖𝑗 = 𝑣𝑗
𝑖
Below we draw some samples confirm that the “sampling” distribution agrees well with the “population” distribution.
Sample results:
xs = np.array([0, 1])
ys = np.array([10, 20])
f = np.array([[0.3, 0.2], [0.1, 0.4]])
f_cum = np.cumsum(f)
# draw random numbers

p = np.random.rand(1_000_000)
x = np.vstack([xs[1]*np.ones(p.shape), ys[1]*np.ones(p.shape)])
# map to the bivariate distribution
x[0, p < f_cum[2]] = xs[1]

x[1, p < f_cum[2]] = ys[0]
x[0, p < f_cum[1]] = xs[0]

x[1, p < f_cum[1]] = ys[1]
8.15. Matrix Representation of Some Bivariate Distributions 139


x[0, p < f_cum[0]] = xs[0]
x[1, p < f_cum[0]] = ys[0]
print(x)
[[ 1. 1. 0. ... 1. 0. 0.]
[20. 20. 20. ... 20. 20. 10.]]
Here, we use exactly the inverse CDF technique to generate sample from the joint distribution 𝐹 .
# marginal distribution
xp = np.sum(x[0, :] == xs[0])/1_000_000
yp = np.sum(x[1, :] == ys[0])/1_000_000
# print output
print("marginal distribution for x")
xmtb = pt.PrettyTable()
xmtb.field_names = ['x_value', 'x_prob']
xmtb.add_row([xs[0], xp])
xmtb.add_row([xs[1], 1-xp])
print(xmtb)
print("\nmarginal distribution for y")

ymtb = pt.PrettyTable()
ymtb.field_names = ['y_value', 'y_prob']
ymtb.add_row([ys[0], yp])
ymtb.add_row([ys[1], 1-yp])
print(ymtb)
marginal distribution for x

+---------+---------------------+
| x_value | x_prob |
+---------+---------------------+
| 0 | 0.501237 |
| 1 | 0.49876299999999996 |
+---------+---------------------+
marginal distribution for y

+---------+---------+
| y_value | y_prob |
+---------+---------+
| 10 | 0.40036 |
| 20 | 0.59964 |
+---------+---------+
# conditional distributions
xc1 = x[0, x[1, :] == ys[0]]
xc2 = x[0, x[1, :] == ys[1]]
yc1 = x[1, x[0, :] == xs[0]]
yc2 = x[1, x[0, :] == xs[1]]
xc1p = np.sum(xc1 == xs[0])/len(xc1)

xc2p = np.sum(xc2 == xs[0])/len(xc2)
yc1p = np.sum(yc1 == ys[0])/len(yc1)
yc2p = np.sum(yc2 == ys[0])/len(yc2)

# print output
print("conditional distribution for x")
xctb = pt.PrettyTable()
xctb.field_names = ['y_value', 'prob(x=0)', 'prob(x=1)']
xctb.add_row([ys[0], xc1p, 1-xc1p])
xctb.add_row([ys[1], xc2p, 1-xc2p])
print(xctb)
print("\nconditional distribution for y")

yctb = pt.PrettyTable()
yctb.field_names = ['x_value', 'prob(y=10)', 'prob(y=20)']
yctb.add_row([xs[0], yc1p, 1-yc1p])
yctb.add_row([xs[1], yc2p, 1-yc2p])
print(yctb)
conditional distribution for x

+---------+--------------------+--------------------+
| y_value | prob(x=0) | prob(x=1) |
+---------+--------------------+--------------------+
| 10 | 0.7501148965930663 | 0.2498851034069337 |
| 20 | 0.3350693749583083 | 0.6649306250416918 |
+---------+--------------------+--------------------+
conditional distribution for y

+---------+---------------------+--------------------+
| x_value | prob(y=10) | prob(y=20) |
+---------+---------------------+--------------------+
| 0 | 0.5991497036332114 | 0.4008502963667886 |
| 1 | 0.20058424542317693 | 0.799415754576823 |
+---------+---------------------+--------------------+
Let’s calculate population marginal and conditional probabilities using matrix algebra.
⋮ 𝑦 1 𝑦2 ⋮ 𝑥
⎡ ⋯ ⋮ ⋯ ⋯ ⋮ ⋯ ⎤
⎢ ⎥
⎢ 𝑥1 ⋮ 0.3 0.2 ⋮ 0.5 ⎥
⎢ 𝑥2 ⋮ 0.1 0.4 ⋮ 0.5 ⎥
⎢ ⋯ ⋮ ⋯ ⋯ ⋮ ⋯ ⎥
⎣ 𝑦 ⋮ 0.4 0.6 ⋮ 1 ⎦
⟹
(1) Marginal distribution:
𝑣𝑎𝑟 ⋮ 𝑣𝑎𝑟1 𝑣𝑎𝑟2
⎡ ⋯ ⋮ ⋯ ⋯ ⎤
⎢ ⎥
⎢ 𝑥 ⋮ 0.5 0.5 ⎥
⎢ ⋯ ⋮ ⋯ ⋯ ⎥
⎣ 𝑦 ⋮ 0.4 0.6 ⎦
(2) Conditional distribution:
𝑥 ⋮ 𝑥1 𝑥2
⎡ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎤
⎢ 0.3 0.1 ⎥
⎢ 𝑦 = 𝑦1 ⋮ 0.4 = 0.75 0.4 = 0.25 ⎥
⎢ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎥
0.2 0.4
⎣ 𝑦 = 𝑦2 ⋮ 0.6 ≈ 0.33 0.6 ≈ 0.67 ⎦

𝑦 ⋮ 𝑦1 𝑦2
⎡ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎤
⎢ 0.3 0.2 ⎥
⎢ 𝑥 = 𝑥1 ⋮ 0.5 = 0.6 0.5 = 0.4 ⎥
⎢ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎥
0.1 0.4
⎣ 𝑥 = 𝑥2 ⋮ 0.5 = 0.2 0.5 = 0.8 ⎦
These population objects closely resemble sample counterparts computed above.
Let’s wrap some of the functions we have used in a Python class for a general discrete bivariate joint distribution.
class discrete_bijoint:
def __init__(self, f, xs, ys):

'''initialization
-----------------
parameters:
f: the bivariate joint probability matrix
xs: values of x vector
ys: values of y vector
'''
self.f, self.xs, self.ys = f, xs, ys
def joint_tb(self):
'''print the joint distribution table'''
xs = self.xs
ys = self.ys
f = self.f
jtb = pt.PrettyTable()
jtb.field_names = ['x_value/y_value', *ys, 'marginal sum for x']
for i in range(len(xs)):
jtb.add_row([xs[i], *f[i, :], np.sum(f[i, :])])
jtb.add_row(['marginal_sum for y', *np.sum(f, 0), np.sum(f)])
print("\nThe joint probability distribution for x and y\n", jtb)
self.jtb = jtb
def draw(self, n):

'''draw random numbers
----------------------
parameters:
n: number of random numbers to draw
'''
xs = self.xs
ys = self.ys
f_cum = np.cumsum(self.f)
p = np.random.rand(n)
x = np.empty([2, p.shape[0]])
lf = len(f_cum)
lx = len(xs)-1
ly = len(ys)-1
for i in range(lf):
x[0, p < f_cum[lf-1-i]] = xs[lx]
x[1, p < f_cum[lf-1-i]] = ys[ly]
if ly == 0:
lx -= 1
ly = len(ys)-1
else:
ly -= 1
self.x = x
self.n = n

def marg_dist(self):
'''marginal distribution'''
x = self.x
xs = self.xs
ys = self.ys
n = self.n
xmp = [np.sum(x[0, :] == xs[i])/n for i in range(len(xs))]
ymp = [np.sum(x[1, :] == ys[i])/n for i in range(len(ys))]
# print output
for i in range(max(len(xs), len(ys))):
if i < len(xs):
xmtb.add_row([xs[i], xmp[i]])
if i < len(ys):
ymtb.add_row([ys[i], ymp[i]])
xmtb.add_row(['sum', np.sum(xmp)])
ymtb.add_row(['sum', np.sum(ymp)])
print("\nmarginal distribution for x\n", xmtb)
print("\nmarginal distribution for y\n", ymtb)
self.xmp = xmp
self.ymp = ymp
def cond_dist(self):
'''conditional distribution'''
x = self.x
xs = self.xs
ys = self.ys
n = self.n
xcp = np.empty([len(ys), len(xs)])
ycp = np.empty([len(xs), len(ys)])
for i in range(max(len(ys), len(xs))):
if i < len(ys):
xi = x[0, x[1, :] == ys[i]]
idx = xi.reshape(len(xi), 1) == xs.reshape(1, len(xs))
xcp[i, :] = np.sum(idx, 0)/len(xi)
if i < len(xs):
yi = x[1, x[0, :] == xs[i]]
idy = yi.reshape(len(yi), 1) == ys.reshape(1, len(ys))
ycp[i, :] = np.sum(idy, 0)/len(yi)
# print output
xctb = pt.PrettyTable()
yctb = pt.PrettyTable()
xctb.field_names = ['x_value', *xs, 'sum']
yctb.field_names = ['y_value', *ys, 'sum']
for i in range(max(len(xs), len(ys))):
if i < len(ys):
xctb.add_row([ys[i], *xcp[i], np.sum(xcp[i])])
if i < len(xs):
yctb.add_row([xs[i], *ycp[i], np.sum(ycp[i])])


print("\nconditional distribution for x\n", xctb)
print("\nconditional distribution for y\n", yctb)
self.xcp = xcp
self.xyp = ycp
Let’s apply our code to some examples.

Example 1
# joint
d = discrete_bijoint(f, xs, ys)
d.joint_tb()
The joint probability distribution for x and y

+--------------------+-----+--------------------+--------------------+
| x_value/y_value | 10 | 20 | marginal sum for x |
+--------------------+-----+--------------------+--------------------+
| 0 | 0.3 | 0.2 | 0.5 |
| 1 | 0.1 | 0.4 | 0.5 |
| marginal_sum for y | 0.4 | 0.6000000000000001 | 1.0 |
+--------------------+-----+--------------------+--------------------+
# sample marginal
d.draw(1_000_000)
d.marg_dist()

+---------+----------+
+---------+----------+
| 0 | 0.498825 |
| 1 | 0.501175 |
| sum | 1.0 |
+---------+----------+

+---------+---------+
+---------+---------+
| 10 | 0.39994 |
| 20 | 0.60006 |
| sum | 1.0 |
+---------+---------+
# sample conditional
d.cond_dist()

+---------+---------------------+--------------------+-----+
| x_value | 0 | 1 | sum |
+---------+---------------------+--------------------+-----+


| 10 | 0.7494524178626794 | 0.2505475821373206 | 1.0 |
| 20 | 0.33178182181781823 | 0.6682181781821818 | 1.0 |
+---------+---------------------+--------------------+-----+

+---------+---------------------+---------------------+-----+
| y_value | 10 | 20 | sum |
+---------+---------------------+---------------------+-----+
| 0 | 0.6008840775823184 | 0.39911592241768157 | 1.0 |
| 1 | 0.19993814535840773 | 0.8000618546415923 | 1.0 |
+---------+---------------------+---------------------+-----+
Example 2
xs_new = np.array([10, 20, 30])

ys_new = np.array([1, 2])
f_new = np.array([[0.2, 0.1], [0.1, 0.3], [0.15, 0.15]])
d_new = discrete_bijoint(f_new, xs_new, ys_new)
d_new.joint_tb()
The joint probability distribution for x and y

+--------------------+---------------------+------+---------------------+
| x_value/y_value | 1 | 2 | marginal sum for x |
+--------------------+---------------------+------+---------------------+
| 10 | 0.2 | 0.1 | 0.30000000000000004 |
| 20 | 0.1 | 0.3 | 0.4 |
| 30 | 0.15 | 0.15 | 0.3 |
| marginal_sum for y | 0.45000000000000007 | 0.55 | 1.0 |
+--------------------+---------------------+------+---------------------+
d_new.draw(1_000_000)
d_new.marg_dist()

+---------+----------+
+---------+----------+
| 10 | 0.299336 |
| 20 | 0.400698 |
| 30 | 0.299966 |
| sum | 1.0 |
+---------+----------+

+---------+----------+
+---------+----------+
| 1 | 0.449267 |
| 2 | 0.550733 |
| sum | 1.0 |
+---------+----------+
d_new.cond_dist()


+---------+--------------------+---------------------+--------------------+-----+
| x_value | 10 | 20 | 30 | sum |
+---------+--------------------+---------------------+--------------------+-----+
| 1 | 0.4446108884026648 | 0.22235997747441943 | 0.3330291341229158 | 1.0 |
| 2 | 0.180826280611476 | 0.5461793645922798 | 0.2729943547962443 | 1.0 |
+---------+--------------------+---------------------+--------------------+-----+

+---------+--------------------+--------------------+-----+
| y_value | 1 | 2 | sum |
+---------+--------------------+--------------------+-----+
| 10 | 0.6673069727663896 | 0.3326930272336104 | 1.0 |
| 20 | 0.2493124497751424 | 0.7506875502248577 | 1.0 |
| 30 | 0.4987865291399692 | 0.5012134708600308 | 1.0 |
+---------+--------------------+--------------------+-----+
8.16 A Continuous Bivariate Random Vector
A two-dimensional Gaussian distribution has joint density
1 (𝑥 − 𝜇1 )2 2𝜌(𝑥 − 𝜇1 )(𝑦 − 𝜇2 ) (𝑦 − 𝜇2 )2
𝑓(𝑥, 𝑦) = (2𝜋𝜎1 𝜎2 √1 − 𝜌2 )−1 exp [− ( − + )]
2(1 − 𝜌2 ) 𝜎12 𝜎1 𝜎2 𝜎22
1 1 (𝑥 − 𝜇1 )2 2𝜌(𝑥 − 𝜇1 )(𝑦 − 𝜇2 ) (𝑦 − 𝜇2 )2
exp [− ( 2
− + )]
2𝜋𝜎1 𝜎2 √1 − 𝜌2 2(1 − 𝜌2 ) 𝜎1 𝜎1 𝜎2 𝜎22
We start with a bivariate normal distribution pinned down by
0 5 .2
𝜇=[ ], Σ=[ ]
5 .2 1
# define the joint probability density function

def func(x, y, μ1=0, μ2=5, σ1=np.sqrt(5), σ2=np.sqrt(1), ρ=.2/np.sqrt(5*1)):
A = (2 * np.pi * σ1 * σ2 * np.sqrt(1 - ρ**2))**(-1)
B = -1 / 2 / (1 - ρ**2)
C1 = (x - μ1)**2 / σ1**2
C2 = 2 * ρ * (x - μ1) * (y - μ2) / σ1 / σ2
C3 = (y - μ2)**2 / σ2**2
return A * np.exp(B * (C1 - C2 + C3))
μ1 = 0
μ2 = 5
σ1 = np.sqrt(5)
σ2 = np.sqrt(1)
ρ = .2 / np.sqrt(5 * 1)
x = np.linspace(-10, 10, 1_000)

y = np.linspace(-10, 10, 1_000)
x_mesh, y_mesh = np.meshgrid(x, y, indexing="ij")
Joint Distribution
Let’s plot the population joint density.

# %matplotlib notebook
fig = plt.figure()
ax = plt.axes(projection='3d')
surf = ax.plot_surface(x_mesh, y_mesh, func(x_mesh, y_mesh), cmap='viridis')

plt.show()
# %matplotlib notebook
fig = plt.figure()
ax = plt.axes(projection='3d')
curve = ax.contour(x_mesh, y_mesh, func(x_mesh, y_mesh), zdir='x')

plt.ylabel('y')
ax.set_zlabel('f')
ax.set_xticks([])
plt.show()
8.16. A Continuous Bivariate Random Vector 147

Next we can simulate from a built-in numpy function and calculate a sample marginal distribution from the sample mean
and variance.
μ= np.array([0, 5])
σ= np.array([[5, .2], [.2, 1]])
n = 1_000_000
data = np.random.multivariate_normal(μ, σ, n)
x = data[:, 0]
y = data[:, 1]
Marginal distribution
plt.hist(x, bins=1_000, alpha=0.6)

μx_hat, σx_hat = np.mean(x), np.std(x)
print(μx_hat, σx_hat)
x_sim = np.random.normal(μx_hat, σx_hat, 1_000_000)
plt.hist(x_sim, bins=1_000, alpha=0.4, histtype="step")
plt.show()
-0.0009410653678662386 2.237337853596715

plt.hist(y, bins=1_000, density=True, alpha=0.6)

μy_hat, σy_hat = np.mean(y), np.std(y)
print(μy_hat, σy_hat)
y_sim = np.random.normal(μy_hat, σy_hat, 1_000_000)
plt.hist(y_sim, bins=1_000, density=True, alpha=0.4, histtype="step")
plt.show()
4.999005281178264 1.0003086878642835

Conditional distribution
The population conditional distribution is
𝑦 − 𝜇𝑌 2
[𝑋|𝑌 = 𝑦] ∼ ℕ[𝜇𝑋 + 𝜌𝜎𝑋 , 𝜎𝑋 (1 − 𝜌2 )]
𝜎𝑌
𝑥 − 𝜇𝑋 2
[𝑌 |𝑋 = 𝑥] ∼ ℕ[𝜇𝑌 + 𝜌𝜎𝑌 , 𝜎𝑌 (1 − 𝜌2 )]
𝜎𝑋
Let’s approximate the joint density by discretizing and mapping the approximating joint density into a matrix.
We can compute the discretized marginal density by just using matrix algebra and noting that
𝑓𝑖𝑗
Prob{𝑋 = 𝑖|𝑌 = 𝑗} =
Fix 𝑦 = 0.
# discretized marginal density

x = np.linspace(-10, 10, 1_000_000)
z = func(x, y=0) / np.sum(func(x, y=0))
plt.plot(x, z)
plt.show()

The mean and variance are computed by
𝑓𝑖𝑗
𝔼 [𝑋|𝑌 = 𝑗] = ∑ 𝑖𝑃 𝑟𝑜𝑏{𝑋 = 𝑖|𝑌 = 𝑗} = ∑ 𝑖
𝑖 𝑖
𝑓𝑖𝑗
2
𝔻 [𝑋|𝑌 = 𝑗] = ∑ (𝑖 − 𝜇𝑋|𝑌 =𝑗 )
𝑖
∑ 𝑓
𝑖 𝑖𝑗
Let’s draw from a normal distribution with above mean and variance and check how accurate our approximation is.
# discretized mean
μx = np.dot(x, z)
# discretized standard deviation

σx = np.sqrt(np.dot((x - μx)**2, z))
# sample
zz = np.random.normal(μx, σx, 1_000_000)
plt.hist(zz, bins=300, density=True, alpha=0.3, range=[-10, 10])
plt.show()

Fix 𝑥 = 1.
y = np.linspace(0, 10, 1_000_000)

z = func(x=1, y=y) / np.sum(func(x=1, y=y))
plt.plot(y,z)
plt.show()

# discretized mean and standard deviation

μy = np.dot(y,z)
σy = np.sqrt(np.dot((y - μy)**2, z))
# sample
zz = np.random.normal(μy,σy,1_000_000)
plt.hist(zz, bins=100, density=True, alpha=0.3)
plt.show()

We compare with the analytically computed parameters and note that they are close.
print(μx, σx)
print(μ1 + ρ * σ1 * (0 - μ2) / σ2, np.sqrt(σ1**2 * (1 - ρ**2)))
print(μy, σy)
print(μ2 + ρ * σ2 * (1 - μ1) / σ1, np.sqrt(σ2**2 * (1 - ρ**2)))
-0.9997518414498433 2.22658413316977
-1.0 2.227105745132009
5.039999456960768 0.9959851265795597
5.04 0.9959919678390986
8.17 Sum of Two Independently Distributed Random Variables
Let 𝑋, 𝑌 be two independent discrete random variables that take values in 𝑋,̄ 𝑌 ̄ , respectively.
Define a new random variable 𝑍 = 𝑋 + 𝑌 .
Evidently, 𝑍 takes values from 𝑍 ̄ defined as follows:
𝑋̄ = {0, 1, … , 𝐼 − 1}; 𝑓𝑖 = Prob{𝑋 = 𝑖}

̄
𝑌 = {0, 1, … , 𝐽 − 1}; 𝑔𝑗 = Prob{𝑌 = 𝑗}
𝑍 ̄ = {0, 1, … , 𝐼 + 𝐽 − 2}; ℎ𝑘 = Prob{𝑋 + 𝑌 = 𝑘}

Independence of 𝑋 and 𝑌 implies that
ℎ𝑘 = Prob{𝑋 = 0, 𝑌 = 𝑘} + Prob{𝑋 = 1, 𝑌 = 𝑘 − 1} + … + Prob{𝑋 = 𝑘, 𝑌 = 0}

ℎ𝑘 = 𝑓0 𝑔𝑘 + 𝑓1 𝑔𝑘−1 + … + 𝑓𝑘−1 𝑔1 + 𝑓𝑘 𝑔0 for 𝑘 = 0, 1, … , 𝐼 + 𝐽 − 2
Thus, we have:
𝑘
ℎ𝑘 = ∑ 𝑓𝑖 𝑔𝑘−𝑖 ≡ 𝑓 ∗ 𝑔
𝑖=0
where 𝑓 ∗ 𝑔 denotes the convolution of the 𝑓 and 𝑔 sequences.

Similarly, for two random variables 𝑋, 𝑌 with densities 𝑓𝑋 , 𝑔𝑌 , the density of 𝑍 = 𝑋 + 𝑌 is
∞
𝑓𝑍 (𝑧) = ∫ 𝑓𝑋 (𝑥)𝑓𝑌 (𝑧 − 𝑥)𝑑𝑥 ≡ 𝑓𝑋 ∗ 𝑔𝑌
−∞
where 𝑓𝑋 ∗ 𝑔𝑌 denotes the convolution of the 𝑓𝑋 and 𝑔𝑌 functions.
8.18 Transition Probability Matrix
Consider the following joint probability distribution of two random variables.

Let 𝑋, 𝑌 be discrete random variables with joint distribution
Prob{𝑋 = 𝑖, 𝑌 = 𝑗} = 𝜌𝑖𝑗
where 𝑖 = 0, … , 𝐼 − 1; 𝑗 = 0, … , 𝐽 − 1 and
∑ ∑ 𝜌𝑖𝑗 = 1, 𝜌𝑖𝑗 ⩾ 0.
𝑖 𝑗
An associated conditional distribution is

𝜌𝑖𝑗 Prob{𝑌 = 𝑗, 𝑋 = 𝑖}
Prob{𝑌 = 𝑖|𝑋 = 𝑗} = =
∑𝑖 𝜌𝑖𝑗 Prob{𝑋 = 𝑖}
We can define a transition probability matrix

𝜌𝑖𝑗
𝑝𝑖𝑗 = Prob{𝑌 = 𝑗|𝑋 = 𝑖} =
∑𝑗 𝜌𝑖𝑗
where
𝑝 𝑝12
[ 11 ]
𝑝21 𝑝22
The first row is the probability of 𝑌 = 𝑗, 𝑗 = 0, 1 conditional on 𝑋 = 0.

The second row is the probability of 𝑌 = 𝑗, 𝑗 = 0, 1 conditional on 𝑋 = 1.
Note that
∑𝑗 𝜌𝑖𝑗
• ∑𝑗 𝜌𝑖𝑗 = ∑𝑗 𝜌𝑖𝑗 = 1, so each row of 𝜌 is a probability distribution (not so for each column.
8.18. Transition Probability Matrix 155

8.19 Coupling
Start with a joint distribution
𝑓𝑖𝑗 = Prob{𝑋 = 𝑖, 𝑌 = 𝑗}
𝑖 = 0, ⋯ 𝐼 − 1
𝑗 = 0, ⋯ 𝐽 − 1
stacked to an 𝐼 × 𝐽 matrix
𝑒.𝑔. 𝐼 = 1, 𝐽 = 1
where
𝑓11 𝑓12
[ ]
𝑓21 𝑓22
From the joint distribution, we have shown above that we obtain unique marginal distributions.
Now we’ll try to go in a reverse direction.
We’ll find that from two marginal distributions, can we usually construct more than one joint distribution that verifies
these marginals.
Each of these joint distributions is called a coupling of the two marginal distributions.
Let’s start with marginal distributions
Prob{𝑋 = 𝑖} = ∑ 𝑓𝑖𝑗 = 𝜇𝑖 , 𝑖 = 0, ⋯ , 𝐼 − 1
𝑗
Prob{𝑌 = 𝑗} = ∑ 𝑓𝑖𝑗 = 𝜈𝑗 , 𝑗 = 0, ⋯ , 𝐽 − 1
𝑗
Given two marginal distribution, 𝜇 for 𝑋 and 𝜈 for 𝑌 , a joint distribution 𝑓𝑖𝑗 is said to be a coupling of 𝜇 and 𝜈.
Example:
Consider the following bivariate example.
Prob{𝑋 = 0} =1 − 𝑞 = 𝜇0
Prob{𝑋 = 1} =𝑞 = 𝜇1
Prob{𝑌 = 0} =1 − 𝑟 = 𝜈0
Prob{𝑌 = 1} =𝑟 = 𝜈1
where 0 ≤ 𝑞 < 𝑟 ≤ 1
We construct two couplings.

The first coupling if our two marginal distributions is the joint distribution
(1 − 𝑞)(1 − 𝑟) (1 − 𝑞)𝑟
𝑓𝑖𝑗 = [ ]
𝑞(1 − 𝑟) 𝑞𝑟
To verify that it is a coupling, we check that
(1 − 𝑞)(1 − 𝑟) + (1 − 𝑞)𝑟 + 𝑞(1 − 𝑟) + 𝑞𝑟 = 1

𝜇0 = (1 − 𝑞)(1 − 𝑟) + (1 − 𝑞)𝑟 = 1 − 𝑞
𝜇1 = 𝑞(1 − 𝑟) + 𝑞𝑟 = 𝑞
𝜈0 = (1 − 𝑞)(1 − 𝑟) + (1 − 𝑟)𝑞 = 1 − 𝑟
𝜇1 = 𝑟(1 − 𝑞) + 𝑞𝑟 = 𝑟

A second coupling of our two marginal distributions is the joint distribution

(1 − 𝑟) 𝑟−𝑞
𝑓𝑖𝑗 = [ ]
0 𝑞
The verify that this is a coupling, note that
1−𝑟+𝑟−𝑞+𝑞 =1
𝜇0 = 1 − 𝑞
𝜇1 = 𝑞
𝜈0 = 1 − 𝑟
𝜈1 = 𝑟
Thus, our two proposed joint distributions have the same marginal distributions.
But the joint distributions differ.
Thus, multiple joint distributions [𝑓𝑖𝑗 ] can have the same marginals.
Remark:
• Couplings are important in optimal transport problems and in Markov processes.
8.20 Copula Functions
Suppose that 𝑋1 , 𝑋2 , … , 𝑋𝑛 are 𝑁 random variables and that

• their marginal distributions are 𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 ), and
• their joint distribution is 𝐻(𝑥1 , 𝑥2 , … , 𝑥𝑁 )
Then there exists a copula function 𝐶(⋅) that verifies
𝐻(𝑥1 , 𝑥2 , … , 𝑥𝑁 ) = 𝐶(𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 )).
We can obtain
𝐶(𝑢1 , 𝑢2 , … , 𝑢𝑛 ) = 𝐻[𝐹1−1 (𝑢1 ), 𝐹2−1 (𝑢2 ), … , 𝐹𝑁−1 (𝑢𝑁 )]
In a reverse direction of logic, given univariate marginal distributions 𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 ) and a
copula function 𝐶(⋅), the function 𝐻(𝑥1 , 𝑥2 , … , 𝑥𝑁 ) = 𝐶(𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 )) is a coupling of
𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 ).
Thus, for given marginal distributions, we can use a copula function to determine a joint distribution when the associated
univariate random variables are not independent.
Copula functions are often used to characterize dependence of random variables.
Discrete marginal distribution
As mentioned above, for two given marginal distributions there can be more than one coupling.
For example, consider two random variables 𝑋, 𝑌 with distributions
Prob(𝑋 = 0) = 0.6,
Prob(𝑋 = 1) = 0.4,
Prob(𝑌 = 0) = 0.3,
Prob(𝑌 = 1) = 0.7,
For these two random variables there can be more than one coupling.
Let’s first generate X and Y.
8.20. Copula Functions 157

# define parameters
mu = np.array([0.6, 0.4])
nu = np.array([0.3, 0.7])
# number of draws
draws = 1_000_000
# generate draws from uniform distribution

p = np.random.rand(draws)
# generate draws of X and Y via uniform distribution

x = np.ones(draws)
y = np.ones(draws)
x[p <= mu[0]] = 0
x[p > mu[0]] = 1
y[p <= nu[0]] = 0
y[p > nu[0]] = 1
# calculate parameters from draws

q_hat = sum(x[x == 1])/draws
r_hat = sum(y[y == 1])/draws
# print output
print("distribution for x")
xmtb.add_row([0, 1-q_hat])
xmtb.add_row([1, q_hat])
print(xmtb)
print("distribution for y")

ymtb.add_row([0, 1-r_hat])
ymtb.add_row([1, r_hat])
print(ymtb)
distribution for x
+---------+--------------------+
+---------+--------------------+
| 0 | 0.6006279999999999 |
| 1 | 0.399372 |
+---------+--------------------+
distribution for y
+---------+----------+
+---------+----------+
| 0 | 0.300752 |
| 1 | 0.699248 |
+---------+----------+
Let’s now take our two marginal distributions, one for 𝑋, the other for 𝑌 , and construct two distinct couplings.
For the first joint distribution:
Prob(𝑋 = 𝑖, 𝑌 = 𝑗) = 𝑓𝑖𝑗

where
0.18 0.42
[𝑓𝑖𝑗 ] = [ ]
0.12 0.28
Let’s use Python to construct this joint distribution and then verify that its marginal distributions are what we want.
# define parameters
f1 = np.array([[0.18, 0.42], [0.12, 0.28]])
f1_cum = np.cumsum(f1)
# number of draws
draws1 = 1_000_000

p = np.random.rand(draws1)
# generate draws of first copuling via uniform distribution

c1 = np.vstack([np.ones(draws1), np.ones(draws1)])
# X=0, Y=0
c1[0, p <= f1_cum[0]] = 0
c1[1, p <= f1_cum[0]] = 0
# X=0, Y=1
c1[0, (p > f1_cum[0])*(p <= f1_cum[1])] = 0
c1[1, (p > f1_cum[0])*(p <= f1_cum[1])] = 1
# X=1, Y=0
c1[0, (p > f1_cum[1])*(p <= f1_cum[2])] = 1
c1[1, (p > f1_cum[1])*(p <= f1_cum[2])] = 0
# X=1, Y=1
c1[0, (p > f1_cum[2])*(p <= f1_cum[3])] = 1
c1[1, (p > f1_cum[2])*(p <= f1_cum[3])] = 1

f1_00 = sum((c1[0, :] == 0)*(c1[1, :] == 0))/draws1
f1_01 = sum((c1[0, :] == 0)*(c1[1, :] == 1))/draws1
f1_10 = sum((c1[0, :] == 1)*(c1[1, :] == 0))/draws1
f1_11 = sum((c1[0, :] == 1)*(c1[1, :] == 1))/draws1
# print output of first joint distribution

print("first joint distribution for c1")
c1_mtb = pt.PrettyTable()
c1_mtb.field_names = ['c1_x_value', 'c1_y_value', 'c1_prob']
c1_mtb.add_row([0, 0, f1_00])
c1_mtb.add_row([0, 1, f1_01])
c1_mtb.add_row([1, 0, f1_10])
c1_mtb.add_row([1, 1, f1_11])
print(c1_mtb)
first joint distribution for c1

+------------+------------+----------+
| c1_x_value | c1_y_value | c1_prob |
+------------+------------+----------+
| 0 | 0 | 0.179818 |
| 0 | 1 | 0.420259 |
| 1 | 0 | 0.120202 |
| 1 | 1 | 0.279721 |
+------------+------------+----------+


c1_q_hat = sum(c1[0, :] == 1)/draws1
c1_r_hat = sum(c1[1, :] == 1)/draws1
# print output
c1_x_mtb = pt.PrettyTable()
c1_x_mtb.field_names = ['c1_x_value', 'c1_x_prob']
c1_x_mtb.add_row([0, 1-c1_q_hat])
c1_x_mtb.add_row([1, c1_q_hat])
print(c1_x_mtb)
print("marginal distribution for y")

c1_ymtb = pt.PrettyTable()
c1_ymtb.field_names = ['c1_y_value', 'c1_y_prob']
c1_ymtb.add_row([0, 1-c1_r_hat])
c1_ymtb.add_row([1, c1_r_hat])
print(c1_ymtb)
+------------+-----------+
| c1_x_value | c1_x_prob |
+------------+-----------+
| 0 | 0.600077 |
| 1 | 0.399923 |
+------------+-----------+
+------------+---------------------+
| c1_y_value | c1_y_prob |
+------------+---------------------+
| 0 | 0.30001999999999995 |
| 1 | 0.69998 |
+------------+---------------------+
Now, let’s construct another joint distribution that is also a coupling of 𝑋 and 𝑌
0.3 0.3
[𝑓𝑖𝑗 ] = [ ]
0 0.4
# define parameters
f2 = np.array([[0.3, 0.3], [0, 0.4]])
f2_cum = np.cumsum(f2)
# number of draws
draws2 = 1_000_000

p = np.random.rand(draws2)
# generate draws of first coupling via uniform distribution

c2 = np.vstack([np.ones(draws2), np.ones(draws2)])
# X=0, Y=0
c2[0, p <= f2_cum[0]] = 0
c2[1, p <= f2_cum[0]] = 0


# X=0, Y=1
c2[0, (p > f2_cum[0])*(p <= f2_cum[1])] = 0
c2[1, (p > f2_cum[0])*(p <= f2_cum[1])] = 1
# X=1, Y=0
c2[0, (p > f2_cum[1])*(p <= f2_cum[2])] = 1
c2[1, (p > f2_cum[1])*(p <= f2_cum[2])] = 0
# X=1, Y=1
c2[0, (p > f2_cum[2])*(p <= f2_cum[3])] = 1
c2[1, (p > f2_cum[2])*(p <= f2_cum[3])] = 1

f2_00 = sum((c2[0, :] == 0)*(c2[1, :] == 0))/draws2
f2_01 = sum((c2[0, :] == 0)*(c2[1, :] == 1))/draws2
f2_10 = sum((c2[0, :] == 1)*(c2[1, :] == 0))/draws2
f2_11 = sum((c2[0, :] == 1)*(c2[1, :] == 1))/draws2
# print output of second joint distribution

print("first joint distribution for c2")
c2_mtb = pt.PrettyTable()
c2_mtb.field_names = ['c2_x_value', 'c2_y_value', 'c2_prob']
c2_mtb.add_row([0, 0, f2_00])
c2_mtb.add_row([0, 1, f2_01])
c2_mtb.add_row([1, 0, f2_10])
c2_mtb.add_row([1, 1, f2_11])
print(c2_mtb)
first joint distribution for c2

+------------+------------+----------+
| c2_x_value | c2_y_value | c2_prob |
+------------+------------+----------+
| 0 | 0 | 0.29983 |
| 0 | 1 | 0.300708 |
| 1 | 0 | 0.0 |
| 1 | 1 | 0.399462 |
+------------+------------+----------+

c2_q_hat = sum(c2[0, :] == 1)/draws2
c2_r_hat = sum(c2[1, :] == 1)/draws2
# print output
c2_x_mtb = pt.PrettyTable()
c2_x_mtb.field_names = ['c2_x_value', 'c2_x_prob']
c2_x_mtb.add_row([0, 1-c2_q_hat])
c2_x_mtb.add_row([1, c2_q_hat])
print(c2_x_mtb)
print("marginal distribution for y")

c2_ymtb = pt.PrettyTable()
c2_ymtb.field_names = ['c2_y_value', 'c2_y_prob']
c2_ymtb.add_row([0, 1-c2_r_hat])
c2_ymtb.add_row([1, c2_r_hat])
print(c2_ymtb)

+------------+-----------+
| c2_x_value | c2_x_prob |
+------------+-----------+
| 0 | 0.600538 |
| 1 | 0.399462 |
+------------+-----------+
+------------+---------------------+
| c2_y_value | c2_y_prob |
+------------+---------------------+
| 0 | 0.29983000000000004 |
| 1 | 0.70017 |
+------------+---------------------+
We have verified that both joint distributions, 𝑐1 and 𝑐2 , have identical marginal distributions of 𝑋 and 𝑌 , respectively.
So they are both couplings of 𝑋 and 𝑌 .
8.21 Time Series
Suppose that there are two time periods.

• 𝑡 = 0 “today”
• 𝑡 = 1 “tomorrow”
Let 𝑋(0) be a random variable to be realized at 𝑡 = 0, 𝑋(1) be a random variable to be realized at 𝑡 = 1.
Suppose that
Prob{𝑋(0) = 𝑖, 𝑋(1) = 𝑗} = 𝑓𝑖𝑗 ≥ 0 𝑖 = 0, ⋯ , 𝐼 − 1

∑ ∑ 𝑓𝑖𝑗 = 1
𝑖 𝑗
𝑓𝑖𝑗 is a joint distribution over [𝑋(0), 𝑋(1)].

A conditional distribution is
𝑓𝑖𝑗
Prob{𝑋(1) = 𝑗|𝑋(0) = 𝑖} =
∑𝑗 𝑓𝑖𝑗
Remark:
• This is a key formula for a theory of optimally predicting a time series.

CHAPTER
NINE
LLN AND CLT
Contents
• LLN and CLT

– Overview
– Relationships
– LLN
– CLT
– Exercises
9.1 Overview
This lecture illustrates two of the most important theorems of probability and statistics: The law of large numbers (LLN)
and the central limit theorem (CLT).
These beautiful theorems lie behind many of the most fundamental results in econometrics and quantitative economic
modeling.
The lecture is based around simulations that show the LLN and CLT in action.
We also demonstrate how the LLN and CLT break down when the assumptions they are based on do not hold.
In addition, we examine several useful extensions of the classical theorems, such as
• The delta method, for smooth functions of random variables, and
• the multivariate case.
Some of these extensions are presented as exercises.
We’ll need the following imports:

import random
import numpy as np
from scipy.stats import t, beta, lognorm, expon, gamma, uniform
from scipy.stats import gaussian_kde, poisson, binom, norm, chi2
163

from matplotlib.collections import PolyCollection
from scipy.linalg import inv, sqrtm
9.2 Relationships
The CLT refines the LLN.

The LLN gives conditions under which sample moments converge to population moments as sample size increases.
The CLT provides information about the rate at which sample moments converge to population moments as sample size
increases.
9.3 LLN
We begin with the law of large numbers, which tells us when sample averages will converge to their population means.
9.3.1 The Classical LLN
The classical law of large numbers concerns independent and identically distributed (IID) random variables.
Here is the strongest version of the classical LLN, known as Kolmogorov’s strong law.
Let 𝑋1 , … , 𝑋𝑛 be independent and identically distributed scalar random variables, with common distribution 𝐹 .
When it exists, let 𝜇 denote the common mean of this sample:
𝜇 ∶= 𝔼𝑋 = ∫ 𝑥𝐹 (𝑑𝑥)
In addition, let
1 𝑛
𝑋̄ 𝑛 ∶= ∑ 𝑋𝑖
𝑛 𝑖=1
Kolmogorov’s strong law states that, if 𝔼|𝑋| is finite, then
ℙ {𝑋̄ 𝑛 → 𝜇 as 𝑛 → ∞} = 1 (9.1)
What does this last expression mean?

Let’s think about it from a simulation perspective, imagining for a moment that our computer can generate perfect random
samples (which of course it can’t).
Let’s also imagine that we can generate infinite sequences so that the statement 𝑋̄ 𝑛 → 𝜇 can be evaluated.
In this setting, (9.1) should be interpreted as meaning that the probability of the computer producing a sequence where
𝑋̄ 𝑛 → 𝜇 fails to occur is zero.
164 Chapter 9. LLN and CLT

9.3.2 Proof
The proof of Kolmogorov’s strong law is nontrivial – see, for example, theorem 8.3.5 of [Dudley, 2002].
On the other hand, we can prove a weaker version of the LLN very easily and still get most of the intuition.
The version we prove is as follows: If 𝑋1 , … , 𝑋𝑛 is IID with 𝔼𝑋𝑖2 < ∞, then, for any 𝜖 > 0, we have
ℙ {|𝑋̄ 𝑛 − 𝜇| ≥ 𝜖} → 0 as 𝑛→∞ (9.2)
(This version is weaker because we claim only convergence in probability rather than almost sure convergence, and assume
a finite second moment)
To see that this is so, fix 𝜖 > 0, and let 𝜎2 be the variance of each 𝑋𝑖 .
Recall the Chebyshev inequality, which tells us that
𝔼[(𝑋̄ 𝑛 − 𝜇)2 ]
ℙ {|𝑋̄ 𝑛 − 𝜇| ≥ 𝜖} ≤ (9.3)
𝜖2
Now observe that
2
⎧
{ 1 𝑛 ⎫
}
̄ 2
𝔼[(𝑋𝑛 − 𝜇) ] = 𝔼 ⎨[ ∑(𝑋𝑖 − 𝜇)] ⎬
{ 𝑛 𝑖=1 }
⎩ ⎭
𝑛 𝑛
1
= 2 ∑ ∑ 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇)
𝑛 𝑖=1 𝑗=1
1 𝑛
= ∑ 𝔼(𝑋𝑖 − 𝜇)2
𝑛2 𝑖=1
𝜎2
=
𝑛
Here the crucial step is at the third equality, which follows from independence.
Independence means that if 𝑖 ≠ 𝑗, then the covariance term 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇) drops out.
As a result, 𝑛2 − 𝑛 terms vanish, leading us to a final expression that goes to zero in 𝑛.
Combining our last result with (9.3), we come to the estimate
𝜎2
ℙ {|𝑋̄ 𝑛 − 𝜇| ≥ 𝜖} ≤ (9.4)
𝑛𝜖2
The claim in (9.2) is now clear.
Of course, if the sequence 𝑋1 , … , 𝑋𝑛 is correlated, then the cross-product terms 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇) are not necessarily
zero.
While this doesn’t mean that the same line of argument is impossible, it does mean that if we want a similar result then
the covariances should be “almost zero” for “most” of these terms.
In a long sequence, this would be true if, for example, 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇) approached zero when the difference between
𝑖 and 𝑗 became large.
In other words, the LLN can still work if the sequence 𝑋1 , … , 𝑋𝑛 has a kind of “asymptotic independence”, in the sense
that correlation falls to zero as variables become further apart in the sequence.
This idea is very important in time series analysis, and we’ll come across it again soon enough.
9.3. LLN 165

9.3.3 Illustration
Let’s now illustrate the classical IID law of large numbers using simulation.
In particular, we aim to generate some sequences of IID random variables and plot the evolution of 𝑋̄ 𝑛 as 𝑛 increases.
Below is a figure that does just this (as usual, you can click on it to expand it).
It shows IID observations from three different distributions and plots 𝑋̄ 𝑛 against 𝑛 in each case.
The dots represent the underlying observations 𝑋𝑖 for 𝑖 = 1, … , 100.
In each of the three cases, convergence of 𝑋̄ 𝑛 to 𝜇 occurs as predicted
n = 100
# Arbitrary collection of distributions

distributions = {"student's t with 10 degrees of freedom": t(10),
"β(2, 2)": beta(2, 2),
"lognormal LN(0, 1/2)": lognorm(0.5),
"γ(5, 1/2)": gamma(5, scale=2),
"poisson(4)": poisson(4),
"exponential with λ = 1": expon(1)}
# Create a figure and some axes

num_plots = 3
fig, axes = plt.subplots(num_plots, 1, figsize=(10, 20))
# Set some plotting parameters to improve layout

bbox = (0., 1.02, 1., .102)
legend_args = {'ncol': 2,
'bbox_to_anchor': bbox,
'loc': 3,
'mode': 'expand'}
plt.subplots_adjust(hspace=0.5)
for ax in axes:
# Choose a randomly selected distribution
name = random.choice(list(distributions.keys()))
distribution = distributions.pop(name)
# Generate n draws from the distribution

data = distribution.rvs(n)
# Compute sample mean at each n

sample_mean = np.empty(n)
for i in range(n):
sample_mean[i] = np.mean(data[:i+1])
# Plot
ax.plot(list(range(n)), data, 'o', color='grey', alpha=0.5)
axlabel = '$\\bar{X}_n$ for $X_i \sim$' + name
ax.plot(list(range(n)), sample_mean, 'g-', lw=3, alpha=0.6, label=axlabel)
m = distribution.mean()
ax.plot(list(range(n)), [m] * n, 'k--', lw=1.5, label='$\mu$')
ax.vlines(list(range(n)), m, data, lw=0.2)
ax.legend(**legend_args, fontsize=12)
plt.show()

9.3. LLN 167

The three distributions are chosen at random from a selection stored in the dictionary distributions.
9.4 CLT
Next, we turn to the central limit theorem, which tells us about the distribution of the deviation between sample averages
and population means.
9.4.1 Statement of the Theorem
The central limit theorem is one of the most remarkable results in all of mathematics.
In the classical IID setting, it tells us the following:
If the sequence 𝑋1 , … , 𝑋𝑛 is IID, with common mean 𝜇 and common variance 𝜎2 ∈ (0, ∞), then
√ 𝑑
𝑛(𝑋̄ 𝑛 − 𝜇) → 𝑁 (0, 𝜎2 ) as 𝑛 → ∞ (9.5)
𝑑
Here → 𝑁 (0, 𝜎2 ) indicates convergence in distribution to a centered (i.e, zero mean) normal with standard deviation 𝜎.
9.4.2 Intuition
The striking implication of the CLT is that for any distribution with finite second moment, the simple operation of adding
independent copies always leads to a Gaussian curve.
A relatively simple proof of the central limit theorem can be obtained by working with characteristic functions (see, e.g.,
theorem 9.5.6 of [Dudley, 2002]).
The proof is elegant but almost anticlimactic, and it provides surprisingly little intuition.
In fact, all of the proofs of the CLT that we know are similar in this respect.
Why does adding independent copies produce a bell-shaped distribution?
Part of the answer can be obtained by investigating the addition of independent Bernoulli random variables.
In particular, let 𝑋𝑖 be binary, with ℙ{𝑋𝑖 = 0} = ℙ{𝑋𝑖 = 1} = 0.5, and let 𝑋1 , … , 𝑋𝑛 be independent.
𝑛
Think of 𝑋𝑖 = 1 as a “success”, so that 𝑌𝑛 = ∑𝑖=1 𝑋𝑖 is the number of successes in 𝑛 trials.
The next figure plots the probability mass function of 𝑌𝑛 for 𝑛 = 1, 2, 4, 8

axes = axes.flatten()
ns = [1, 2, 4, 8]
dom = list(range(9))
for ax, n in zip(axes, ns):

b = binom(n, 0.5)
ax.bar(dom, b.pmf(dom), alpha=0.6, align='center')
ax.set(xlim=(-0.5, 8.5), ylim=(0, 0.55),
xticks=list(range(9)), yticks=(0, 0.2, 0.4),
title=f'$n = {n}$')
plt.show()

When 𝑛 = 1, the distribution is flat — one success or no successes have the same probability.
When 𝑛 = 2 we can either have 0, 1 or 2 successes.
Notice the peak in probability mass at the mid-point 𝑘 = 1.
The reason is that there are more ways to get 1 success (“fail then succeed” or “succeed then fail”) than to get zero or two
successes.
Moreover, the two trials are independent, so the outcomes “fail then succeed” and “succeed then fail” are just as likely as
the outcomes “fail then fail” and “succeed then succeed”.
(If there was positive correlation, say, then “succeed then fail” would be less likely than “succeed then succeed”)
Here, already we have the essence of the CLT: addition under independence leads probability mass to pile up in the middle
and thin out at the tails.
For 𝑛 = 4 and 𝑛 = 8 we again get a peak at the “middle” value (halfway between the minimum and the maximum
possible value).
The intuition is the same — there are simply more ways to get these middle outcomes.
If we continue, the bell-shaped curve becomes even more pronounced.
We are witnessing the binomial approximation of the normal distribution.
9.4. CLT 169

9.4.3 Simulation 1
Since the CLT seems almost magical, running simulations that verify its implications is one good way to build intuition.
To this end, we now perform the following simulation
1. Choose an arbitrary distribution 𝐹 for the underlying observations 𝑋𝑖 .
√
2. Generate independent draws of 𝑌𝑛 ∶= 𝑛(𝑋̄ 𝑛 − 𝜇).
3. Use these draws to compute some measure of their distribution — such as a histogram.
4. Compare the latter to 𝑁 (0, 𝜎2 ).
Here’s some code that does exactly this for the exponential distribution 𝐹 (𝑥) = 1 − 𝑒−𝜆𝑥 .
(Please experiment with other choices of 𝐹 , but remember that, to conform with the conditions of the CLT, the distribution
must have a finite second moment.)
# Set parameters
n = 250 # Choice of n
k = 100000 # Number of draws of Y_n
distribution = expon(2) # Exponential distribution, λ = 1/2
μ, s = distribution.mean(), distribution.std()
# Draw underlying RVs. Each row contains a draw of X_1,..,X_n

data = distribution.rvs((k, n))
# Compute mean of each row, producing k draws of \bar X_n
sample_means = data.mean(axis=1)
# Generate observations of Y_n
Y = np.sqrt(n) * (sample_means - μ)
# Plot
xmin, xmax = -3 * s, 3 * s
ax.set_xlim(xmin, xmax)
ax.hist(Y, bins=60, alpha=0.5, density=True)
xgrid = np.linspace(xmin, xmax, 200)
ax.plot(xgrid, norm.pdf(xgrid, scale=s), 'k-', lw=2, label='$N(0, \sigma^2)$')
ax.legend()
plt.show()

Notice the absence of for loops — every operation is vectorized, meaning that the major calculations are all shifted to
highly optimized C code.
The fit to the normal density is already tight and can be further improved by increasing n.
You can also experiment with other specifications of 𝐹 .
9.4.4 Simulation 2
√
Our next simulation is somewhat like the first, except that we aim to track the distribution of 𝑌𝑛 ∶= 𝑛(𝑋̄ 𝑛 − 𝜇) as 𝑛
increases.
In the simulation, we’ll be working with random variables having 𝜇 = 0.
Thus, when 𝑛 = 1, we have 𝑌1 = 𝑋1 , so the first distribution is just the distribution of the underlying random variable.
√
For 𝑛 = 2, the distribution of 𝑌2 is that of (𝑋1 + 𝑋2 )/ 2, and so on.
What we expect is that, regardless of the distribution of the underlying random variable, the distribution of 𝑌𝑛 will smooth
out into a bell-shaped curve.
The next figure shows this process for 𝑋𝑖 ∼ 𝑓, where 𝑓 was specified as the convex combination of three different beta
densities.
(Taking a convex combination is an easy way to produce an irregular shape for 𝑓.)
In the figure, the closest density is that of 𝑌1 , while the furthest is that of 𝑌5
beta_dist = beta(2, 2)
def gen_x_draws(k):
"""
Returns a flat array containing k independent draws from the
distribution of X, the underlying random variable. This distribution
9.4. CLT 171


is itself a convex combination of three beta distributions.
"""
bdraws = beta_dist.rvs((3, k))
# Transform rows, so each represents a different distribution
bdraws[0, :] -= 0.5
bdraws[1, :] += 0.6
bdraws[2, :] -= 1.1
# Set X[i] = bdraws[j, i], where j is a random draw from {0, 1, 2}
js = np.random.randint(0, 2, size=k)
X = bdraws[js, np.arange(k)]
# Rescale, so that the random variable is zero mean
m, sigma = X.mean(), X.std()
return (X - m) / sigma
nmax = 5
reps = 100000
ns = list(range(1, nmax + 1))
# Form a matrix Z such that each column is reps independent draws of X

Z = np.empty((reps, nmax))
for i in range(nmax):
Z[:, i] = gen_x_draws(reps)
# Take cumulative sum across columns
S = Z.cumsum(axis=1)
# Multiply j-th column by sqrt j
Y = (1 / np.sqrt(ns)) * S
# Plot
ax = plt.figure(figsize = (10, 6)).add_subplot(projection='3d')
a, b = -3, 3
gs = 100
xs = np.linspace(a, b, gs)
# Build verts
greys = np.linspace(0.3, 0.7, nmax)
verts = []
for n in ns:
density = gaussian_kde(Y[:, n-1])
ys = density(xs)
verts.append(list(zip(xs, ys)))
poly = PolyCollection(verts, facecolors=[str(g) for g in greys])

poly.set_alpha(0.85)
ax.add_collection3d(poly, zs=ns, zdir='x')
ax.set(xlim3d=(1, nmax), xticks=(ns), ylabel='$Y_n$', zlabel='$p(y_n)$',

xlabel=("n"), yticks=((-3, 0, 3)), ylim3d=(a, b),
zlim3d=(0, 0.4), zticks=((0.2, 0.4)))
ax.invert_xaxis()
# Rotates the plot 30 deg on z axis and 45 deg on x axis
ax.view_init(30, 45)
plt.show()

As expected, the distribution smooths out into a bell curve as 𝑛 increases.

We leave you to investigate its contents if you wish to know more.
If you run the file from the ordinary IPython shell, the figure should pop up in a window that you can rotate with your
mouse, giving different views on the density sequence.
9.4.5 The Multivariate Case
The law of large numbers and central limit theorem work just as nicely in multidimensional settings.
To state the results, let’s recall some elementary facts about random vectors.
A random vector X is just a sequence of 𝑘 random variables (𝑋1 , … , 𝑋𝑘 ).
Each realization of X is an element of ℝ𝑘 .
A collection of random vectors X1 , … , X𝑛 is called independent if, given any 𝑛 vectors x1 , … , x𝑛 in ℝ𝑘 , we have
ℙ{X1 ≤ x1 , … , X𝑛 ≤ x𝑛 } = ℙ{X1 ≤ x1 } × ⋯ × ℙ{X𝑛 ≤ x𝑛 }
(The vector inequality X ≤ x means that 𝑋𝑗 ≤ 𝑥𝑗 for 𝑗 = 1, … , 𝑘)

Let 𝜇𝑗 ∶= 𝔼[𝑋𝑗 ] for all 𝑗 = 1, … , 𝑘.
9.4. CLT 173

The expectation 𝔼[X] of X is defined to be the vector of expectations:
𝔼[𝑋1 ] 𝜇1
⎛ 𝔼[𝑋2 ] ⎞ ⎛ 𝜇2 ⎞
𝔼[X] ∶= ⎜
⎜
⎜
⎟
⎟ ⎜
⎟=⎜
⎜
⎟
⎟
⎟ =∶ 𝜇
⋮ ⋮
⎝ 𝔼[𝑋 ]
𝑘 ⎠ ⎝ 𝜇𝑘 ⎠
The variance-covariance matrix of random vector X is defined as
Var[X] ∶= 𝔼[(X − 𝜇)(X − 𝜇)′ ]
Expanding this out, we get
𝔼[(𝑋1 − 𝜇1 )(𝑋1 − 𝜇1 )] ⋯ 𝔼[(𝑋1 − 𝜇1 )(𝑋𝑘 − 𝜇𝑘 )]

⎛
⎜ 𝔼[(𝑋2 − 𝜇2 )(𝑋1 − 𝜇1 )] ⋯ 𝔼[(𝑋2 − 𝜇2 )(𝑋𝑘 − 𝜇𝑘 )] ⎞
⎟
Var[X] = ⎜
⎜ ⎟
⎟
⋮ ⋮ ⋮
⎝ 𝔼[(𝑋𝑘 − 𝜇𝑘 )(𝑋1 − 𝜇1 )] ⋯ 𝔼[(𝑋𝑘 − 𝜇𝑘 )(𝑋𝑘 − 𝜇𝑘 )] ⎠
The 𝑗, 𝑘-th term is the scalar covariance between 𝑋𝑗 and 𝑋𝑘 .

With this notation, we can proceed to the multivariate LLN and CLT.
Let X1 , … , X𝑛 be a sequence of independent and identically distributed random vectors, each one taking values in ℝ𝑘 .
Let 𝜇 be the vector 𝔼[X𝑖 ], and let Σ be the variance-covariance matrix of X𝑖 .
Interpreting vector addition and scalar multiplication in the usual way (i.e., pointwise), let
1 𝑛
X̄ 𝑛 ∶= ∑ X𝑖
𝑛 𝑖=1
In this setting, the LLN tells us that
ℙ {X̄ 𝑛 → 𝜇 as 𝑛 → ∞} = 1 (9.6)
Here X̄ 𝑛 → 𝜇 means that ‖X̄ 𝑛 − 𝜇‖ → 0, where ‖ ⋅ ‖ is the standard Euclidean norm.

The CLT tells us that, provided Σ is finite,
√ 𝑑
𝑛(X̄ 𝑛 − 𝜇) → 𝑁 (0, Σ) as 𝑛→∞ (9.7)
9.5 Exercises
Exercise 9.5.1
One very useful consequence of the central limit theorem is as follows.
Assume the conditions of the CLT as stated above.
If 𝑔 ∶ ℝ → ℝ is differentiable at 𝜇 and 𝑔′ (𝜇) ≠ 0, then
√ 𝑑
𝑛{𝑔(𝑋̄ 𝑛 ) − 𝑔(𝜇)} → 𝑁 (0, 𝑔′ (𝜇)2 𝜎2 ) as 𝑛 → ∞ (9.8)
This theorem is used frequently in statistics to obtain the asymptotic distribution of estimators — many of which can be
expressed as functions of sample means.
(These kinds of results are often said to use the “delta method”.)

The proof is based on a Taylor expansion of 𝑔 around the point 𝜇.

Taking the result as given, let the distribution 𝐹 of each 𝑋𝑖 be uniform on [0, 𝜋/2] and let 𝑔(𝑥) = sin(𝑥).
√
Derive the asymptotic distribution of 𝑛{𝑔(𝑋̄ 𝑛 ) − 𝑔(𝜇)} and illustrate convergence in the same spirit as the program
discussed above.
What happens when you replace [0, 𝜋/2] with [0, 𝜋]?
What is the source of the problem?

Here is one solution
"""
Illustrates the delta method, a consequence of the central limit theorem.
"""
# Set parameters
n = 250
replications = 100000
distribution = uniform(loc=0, scale=(np.pi / 2))
μ, s = distribution.mean(), distribution.std()
g = np.sin
g_prime = np.cos
# Generate obs of sqrt{n} (g(X_n) - g(μ))

data = distribution.rvs((replications, n))
sample_means = data.mean(axis=1) # Compute mean of each row
error_obs = np.sqrt(n) * (g(sample_means) - g(μ))
# Plot
asymptotic_sd = g_prime(μ) * s
xmin = -3 * g_prime(μ) * s
xmax = -xmin
ax.set_xlim(xmin, xmax)
ax.hist(error_obs, bins=60, alpha=0.5, density=True)
lb = "$N(0, g'(\mu)^2 \sigma^2)$"
ax.plot(xgrid, norm.pdf(xgrid, scale=asymptotic_sd), 'k-', lw=2, label=lb)
ax.legend()
plt.show()
9.5. Exercises 175

What happens when you replace [0, 𝜋/2] with [0, 𝜋]?
In this case, the mean 𝜇 of this distribution is 𝜋/2, and since 𝑔′ = cos, we have 𝑔′ (𝜇) = 0.
Hence the conditions of the delta theorem are not satisfied.
Exercise 9.5.2
Here’s a result that’s often used in developing statistical tests, and is connected to the multivariate central limit theorem.
If you study econometric theory, you will see this result used again and again.
Assume the setting of the multivariate CLT discussed above, so that
1. X1 , … , X𝑛 is a sequence of IID random vectors, each taking values in ℝ𝑘 .
2. 𝜇 ∶= 𝔼[X𝑖 ], and Σ is the variance-covariance matrix of X𝑖 .
3. The convergence
√ 𝑑
𝑛(X̄ 𝑛 − 𝜇) → 𝑁 (0, Σ) (9.9)
is valid.
In a statistical setting, one often wants the right-hand side to be standard normal so that confidence intervals are easily
computed.
This normalization can be achieved on the basis of three observations.
First, if X is a random vector in ℝ𝑘 and A is constant and 𝑘 × 𝑘, then
Var[AX] = A Var[X]A′
𝑑
Second, by the continuous mapping theorem, if Z𝑛 → Z in ℝ𝑘 and A is constant and 𝑘 × 𝑘, then
𝑑
AZ𝑛 → AZ

Third, if S is a 𝑘 × 𝑘 symmetric positive definite matrix, then there exists a symmetric positive definite matrix Q, called
the inverse square root of S, such that
QSQ′ = I
Here I is the 𝑘 × 𝑘 identity matrix.

Putting these things together, your first exercise is to show that if Q is the inverse square root of 2/7, then
√ 𝑑
Z𝑛 ∶= 𝑛Q(X̄ 𝑛 − 𝜇) → Z ∼ 𝑁 (0, I)
Applying the continuous mapping theorem one more time tells us that
𝑑
‖Z𝑛 ‖2 → ‖Z‖2
Given the distribution of Z, we conclude that

𝑑
𝑛‖Q(X̄ 𝑛 − 𝜇)‖2 → 𝜒2 (𝑘) (9.10)
where 𝜒2 (𝑘) is the chi-squared distribution with 𝑘 degrees of freedom.

(Recall that 𝑘 is the dimension of X𝑖 , the underlying random vectors.)
Your second exercise is to illustrate the convergence in (9.10) with a simulation.
In doing so, let
𝑊𝑖
X𝑖 ∶= ( )
𝑈𝑖 + 𝑊𝑖
where
• each 𝑊𝑖 is an IID draw from the uniform distribution on [−1, 1].
• each 𝑈𝑖 is an IID draw from the uniform distribution on [−2, 2].
• 𝑈𝑖 and 𝑊𝑖 are independent of each other.
Hint:
1. scipy.linalg.sqrtm(A) computes the square root of A. You still need to invert it.
2. You should be able to work out Σ from the preceding information.

First we want to verify the claim that
√ 𝑑
𝑛Q(X̄ 𝑛 − 𝜇) → 𝑁 (0, I)
This is straightforward given the facts presented in the exercise.

Let
√
Y𝑛 ∶= 𝑛(X̄ 𝑛 − 𝜇) and Y ∼ 𝑁 (0, Σ)
By the multivariate CLT and the continuous mapping theorem, we have

𝑑
QY𝑛 → QY
9.5. Exercises 177

Since linear combinations of normal random variables are normal, the vector QY is also normal.
Its mean is clearly 0, and its variance-covariance matrix is
Var[QY] = QVar[Y]Q′ = QΣQ′ = I
𝑑
In conclusion, QY𝑛 → QY ∼ 𝑁 (0, I), which is what we aimed to show.
Now we turn to the simulation exercise.
Our solution is as follows
# Set parameters
n = 250
replications = 50000
dw = uniform(loc=-1, scale=2) # Uniform(-1, 1)
du = uniform(loc=-2, scale=4) # Uniform(-2, 2)
sw, su = dw.std(), du.std()
vw, vu = sw**2, su**2
Σ = ((vw, vw), (vw, vw + vu))
Σ = np.array(Σ)
# Compute Σ^{-1/2}
Q = inv(sqrtm(Σ))
# Generate observations of the normalized sample mean

error_obs = np.empty((2, replications))
for i in range(replications):
# Generate one sequence of bivariate shocks
X = np.empty((2, n))
W = dw.rvs(n)
U = du.rvs(n)
# Construct the n observations of the random vector
X[0, :] = W
X[1, :] = W + U
# Construct the i-th observation of Y_n
error_obs[:, i] = np.sqrt(n) * X.mean(axis=1)
# Premultiply by Q and then take the squared norm

temp = Q @ error_obs
chisq_obs = np.sum(temp**2, axis=0)
# Plot
xmax = 8
ax.set_xlim(0, xmax)
xgrid = np.linspace(0, xmax, 200)
lb = "Chi-squared with 2 degrees of freedom"
ax.plot(xgrid, chi2.pdf(xgrid, 2), 'k-', lw=2, label=lb)
ax.legend()
ax.hist(chisq_obs, bins=50, density=True)
plt.show()

9.5. Exercises 179


CHAPTER
TEN
TWO MEANINGS OF PROBABILITY
10.1 Overview
This lecture illustrates two distinct interpretations of a probability distribution

• A frequentist interpretation as relative frequencies anticipated to occur in a large i.i.d. sample
• A Bayesian interpretation as a personal opinion (about a parameter or list of parameters) after seeing a collection
of observations
We recommend watching this video about hypothesis testing within the frequentist approach
https://youtu.be/8JIe_cz6qGA
After you watch that video, please watch the following video on the Bayesian approach to constructing coverage intervals
https://youtu.be/Pahyv9i_X2k
After you are familiar with the material in these videos, this lecture uses the Socratic method to to help consolidate your
understanding of the different questions that are answered by
• a frequentist confidence interval
• a Bayesian coverage interval
We do this by inviting you to write some Python code.
It would be especially useful if you tried doing this after each question that we pose for you, before proceeding to read
the rest of the lecture.
We provide our own answers as the lecture unfolds, but you’ll learn more if you try writing your own code before reading
and running ours.
Code for answering questions:
In addition to what’s in Anaconda, this lecture will deploy the following library:
pip install prettytable
To answer our coding questions, we’ll start with some imports
import numpy as np
import pandas as pd
from scipy.stats import binom
import scipy.stats as st
181
Empowered with these Python tools, we’ll now explore the two meanings described above.
10.2 Frequentist Interpretation
Consider the following classic example.

The random variable 𝑋 takes on possible values 𝑘 = 0, 1, 2, … , 𝑛 with probabilties
𝑛!
Prob(𝑋 = 𝑘|𝜃) = ( ) 𝜃𝑘 (1 − 𝜃)𝑛−𝑘
𝑘!(𝑛 − 𝑘)!
where the fixed parameter 𝜃 ∈ (0, 1).

This is called the binomial distribution.
Here
• 𝜃 is the probability that one toss of a coin will be a head, an outcome that we encode as 𝑌 = 1.
• 1 − 𝜃 is the probability that one toss of the coin will be a tail, an outcome that we denote 𝑌 = 0.
• 𝑋 is the total number of heads that came up after flipping the coin 𝑛 times.
Consider the following experiment:
Take 𝐼 independent sequences of 𝑛 independent flips of the coin
Notice the repeated use of the adjective independent:
• we use it once to describe that we are drawing 𝑛 independent times from a Bernoulli distribution with parameter
𝜃 to arrive at one draw from a Binomial distribution with parameters 𝜃, 𝑛.
• we use it again to describe that we are then drawing 𝐼 sequences of 𝑛 coin draws.
Let 𝑦ℎ𝑖 ∈ {0, 1} be the realized value of 𝑌 on the ℎth flip during the 𝑖th sequence of flips.
𝑛
Let ∑ℎ=1 𝑦ℎ𝑖 denote the total number of times heads come up during the 𝑖th sequence of 𝑛 independent coin flips.
𝑛
Let 𝑓𝑘 record the fraction of samples of length 𝑛 for which ∑ℎ=1 𝑦ℎ𝑖 = 𝑘:
𝑛
number of samples of length n for which ∑ℎ=1 𝑦ℎ𝑖 = 𝑘
𝑓𝑘𝐼 =
𝐼
The probability Prob(𝑋 = 𝑘|𝜃) answers the following question:
• As 𝐼 becomes large, in what fraction of 𝐼 independent draws of 𝑛 coin flips should we anticipate 𝑘 heads to occur?
As usual, a law of large numbers justifies this answer.
Exercise 10.2.1
1. Please write a Python class to compute 𝑓𝑘𝐼
2. Please use your code to compute 𝑓𝑘𝐼 , 𝑘 = 0, … , 𝑛 and compare them to Prob(𝑋 = 𝑘|𝜃) for various values of 𝜃, 𝑛
and 𝐼
3. With the Law of Large numbers in mind, use your code to say something

Here is one solution:
182 Chapter 10. Two Meanings of Probability

class frequentist:
def __init__(self, θ, n, I):
'''
initialization
-----------------
parameters:
θ : probability that one toss of a coin will be a head with Y = 1
n : number of independent flips in each independent sequence of draws
I : number of independent sequence of draws
'''
self.θ, self.n, self.I = θ, n, I
def binomial(self, k):
'''compute the theoretical probability for specific input k'''
θ, n = self.θ, self.n
self.k = k
self.P = binom.pmf(k, n, θ)
def draw(self):
'''draw n independent flips for I independent sequences'''
θ, n, I = self.θ, self.n, self.I

sample = np.random.rand(I, n)
Y = (sample <= θ) * 1
self.Y = Y
def compute_fk(self, kk):
'''compute f_{k}Î for specific input k'''
Y, I = self.Y, self.I
K = np.sum(Y, 1)
f_kI = np.sum(K == kk) / I
self.f_kI = f_kI
self.kk = kk
def compare(self):
'''compute and print the comparison'''
n = self.n
comp = pt.PrettyTable()
comp.field_names = ['k', 'Theoretical', 'Frequentist']
self.draw()
for i in range(n):
self.binomial(i+1)
self.compute_fk(i+1)
comp.add_row([i+1, self.P, self.f_kI])
print(comp)
10.2. Frequentist Interpretation 183

θ, n, k, I = 0.7, 20, 10, 1_000_000
freq = frequentist(θ, n, I)
freq.compare()
+----+------------------------+-------------+
| k | Theoretical | Frequentist |
+----+------------------------+-------------+
| 1 | 1.6271660538000033e-09 | 0.0 |
| 2 | 3.606884752589999e-08 | 0.0 |
| 3 | 5.04963865362601e-07 | 2e-06 |
| 4 | 5.007558331512455e-06 | 3e-06 |
| 5 | 3.7389768875293014e-05 | 4.9e-05 |
| 6 | 0.00021810698510587546 | 0.000211 |
| 7 | 0.001017832597160754 | 0.001035 |
| 8 | 0.003859281930901185 | 0.003907 |
| 9 | 0.012006654896137007 | 0.011892 |
| 10 | 0.030817080900085007 | 0.03103 |
| 11 | 0.06536956554563476 | 0.065302 |
| 12 | 0.11439673970486108 | 0.11459 |
| 13 | 0.1642619852172365 | 0.164278 |
| 14 | 0.19163898275344252 | 0.191064 |
| 15 | 0.17886305056987967 | 0.179323 |
| 16 | 0.1304209743738704 | 0.130184 |
| 17 | 0.07160367220526209 | 0.071683 |
| 18 | 0.027845872524268643 | 0.027709 |
| 19 | 0.006839337111223895 | 0.006971 |
| 20 | 0.0007979226629761189 | 0.000767 |
+----+------------------------+-------------+
From the table above, can you see the law of large numbers at work?
Let’s do some more calculations.

Comparison with different 𝜃
Now we fix
𝑛 = 20, 𝑘 = 10, 𝐼 = 1, 000, 000
We’ll vary 𝜃 from 0.01 to 0.99 and plot outcomes against 𝜃.
θ_low, θ_high, npt = 0.01, 0.99, 50

thetas = np.linspace(θ_low, θ_high, npt)
P = []
f_kI = []
for i in range(npt):
freq = frequentist(thetas[i], n, I)
freq.binomial(k)
freq.draw()
freq.compute_fk(k)
P.append(freq.P)
f_kI.append(freq.f_kI)


ax.grid()
ax.plot(thetas, P, 'k-.', label='Theoretical')
ax.plot(thetas, f_kI, 'r--', label='Fraction')
plt.title(r'Comparison with different $\theta$', fontsize=16)
plt.xlabel(r'$\theta$', fontsize=15)
plt.ylabel('Fraction', fontsize=15)
plt.tick_params(labelsize=13)
plt.legend()
plt.show()
Comparison with different 𝑛

Now we fix 𝜃 = 0.7, 𝑘 = 10, 𝐼 = 1, 000, 000 and vary 𝑛 from 1 to 100.
Then we’ll plot outcomes.
n_low, n_high, nn = 1, 100, 50

ns = np.linspace(n_low, n_high, nn, dtype='int')
P = []
f_kI = []
for i in range(nn):
freq = frequentist(θ, ns[i], I)
freq.binomial(k)
freq.draw()


freq.compute_fk(k)
P.append(freq.P)

ax.grid()
ax.plot(ns, P, 'k-.', label='Theoretical')
ax.plot(ns, f_kI, 'r--', label='Frequentist')
plt.title(r'Comparison with different $n$', fontsize=16)
plt.xlabel(r'$n$', fontsize=15)
plt.legend()
plt.show()
Comparison with different 𝐼

Now we fix 𝜃 = 0.7, 𝑛 = 20, 𝑘 = 10 and vary log(𝐼) from 2 to 7.
I_log_low, I_log_high, nI = 2, 6, 200

log_Is = np.linspace(I_log_low, I_log_high, nI)
Is = np.power(10, log_Is).astype(int)
P = []


f_kI = []
for i in range(nI):
freq = frequentist(θ, n, Is[i])
freq.binomial(k)
freq.draw()
freq.compute_fk(k)
P.append(freq.P)

ax.grid()
ax.plot(Is, P, 'k-.', label='Theoretical')
ax.plot(Is, f_kI, 'r--', label='Fraction')
plt.title(r'Comparison with different $I$', fontsize=16)
plt.xlabel(r'$I$', fontsize=15)
plt.legend()
plt.show()
From the above graphs, we can see that 𝐼, the number of independent sequences, plays an important role.
When 𝐼 becomes larger, the difference between theoretical probability and frequentist estimate becomes smaller.

Also, as long as 𝐼 is large enough, changing 𝜃 or 𝑛 does not substantially change the accuracy of the observed fraction as
an approximation of 𝜃.
The Law of Large Numbers is at work here.
For each draw of an independent sequence, Prob(𝑋𝑖 = 𝑘|𝜃) is the same, so aggregating all draws forms an i.i.d sequence
of a binary random variable 𝜌𝑘,𝑖 , 𝑖 = 1, 2, ...𝐼, with a mean of Prob(𝑋 = 𝑘|𝜃) and a variance of
𝑛 ⋅ Prob(𝑋 = 𝑘|𝜃) ⋅ (1 − Prob(𝑋 = 𝑘|𝜃)).
So, by the LLN, the average of 𝑃𝑘,𝑖 converges to:
𝑛!
𝐸[𝜌𝑘,𝑖 ] = Prob(𝑋 = 𝑘|𝜃) = ( ) 𝜃𝑘 (1 − 𝜃)𝑛−𝑘
𝑘!(𝑛 − 𝑘)!
as 𝐼 goes to infinity.
10.3 Bayesian Interpretation
We again use a binomial distribution.

But now we don’t regard 𝜃 as being a fixed number.
Instead, we think of it as a random variable.
𝜃 is described by a probability distribution.
But now this probability distribution means something different than a relative frequency that we can anticipate to occur
in a large i.i.d. sample.
Instead, the probability distribution of 𝜃 is now a summary of our views about likely values of 𝜃 either
• before we have seen any data at all, or
• before we have seen more data, after we have seen some data
Thus, suppose that, before seeing any data, you have a personal prior probability distribution saying that
𝜃𝛼−1 (1 − 𝜃)𝛽−1
𝑃 (𝜃) =
𝐵(𝛼, 𝛽)
where 𝐵(𝛼, 𝛽) is a beta function , so that 𝑃 (𝜃) is a beta distribution with parameters 𝛼, 𝛽.
Exercise 10.3.1
a) Please write down the likelihood function for a sample of length 𝑛 from a binomial distribution with parameter 𝜃.
b) Please write down the posterior distribution for 𝜃 after observing one flip of the coin.
c) Now pretend that the true value of 𝜃 = .4 and that someone who doesn’t know this has a beta prior distribution with
parameters with 𝛽 = 𝛼 = .5. Please write a Python class to simulate this person’s personal posterior distribution for 𝜃
for a single sequence of 𝑛 draws.
d) Please plot the posterior distribution for 𝜃 as a function of 𝜃 as 𝑛 grows as 1, 2, ….
e) For various 𝑛’s, please describe and compute a Bayesian coverage interval for the interval [.45, .55].
f) Please tell what question a Bayesian coverage interval answers.
g) Please compute the Posterior probabililty that 𝜃 ∈ [.45, .55] for various values of sample size 𝑛.

h) Please use your Python class to study what happens to the posterior distribution as 𝑛 → +∞, again assuming that the
true value of 𝜃 = .4, though it is unknown to the person doing the updating via Bayes’ Law.

a) Please write down the likelihood function and the posterior distribution for 𝜃 after observing one flip of our coin.
Suppose the outcome is Y.
The likelihood function is:
𝐿(𝑌 |𝜃) = Prob(𝑋 = 𝑌 |𝜃) = 𝜃𝑌 (1 − 𝜃)1−𝑌
b) Please write the posterior distribution for 𝜃 after observing one flip of our coin.
The prior distribution is
𝜃𝛼−1 (1 − 𝜃)𝛽−1
Prob(𝜃) =
𝐵(𝛼, 𝛽)
We can derive the posterior distribution for 𝜃 via
Prob(𝑌 |𝜃)Prob(𝜃)
Prob(𝜃|𝑌 ) =
Prob(𝑌 )
Prob(𝑌 |𝜃)Prob(𝜃)
= 1
∫0 Prob(𝑌 |𝜃)Prob(𝜃)𝑑𝜃
𝜃𝛼−1 (1−𝜃)𝛽−1
𝜃𝑌 (1 − 𝜃)1−𝑌 𝐵(𝛼,𝛽)
= 1 𝜃𝛼−1 (1−𝜃)𝛽−1
∫0 𝜃𝑌 (1 − 𝜃)1−𝑌 𝐵(𝛼,𝛽) 𝑑𝜃
𝜃𝑌 +𝛼−1 (1 − 𝜃)1−𝑌 +𝛽−1
= 1
∫0 𝜃𝑌 +𝛼−1 (1 − 𝜃)1−𝑌 +𝛽−1 𝑑𝜃
which means that
Prob(𝜃|𝑌 ) ∼ Beta(𝛼 + 𝑌 , 𝛽 + (1 − 𝑌 ))
Now please pretend that the true value of 𝜃 = .4 and that someone who doesn’t know this has a beta prior with 𝛽 = 𝛼 = .5.
c) Now pretend that the true value of 𝜃 = .4 and that someone who doesn’t know this has a beta prior distribution with
parameters with 𝛽 = 𝛼 = .5. Please write a Python class to simulate this person’s personal posterior distribution for 𝜃
for a single sequence of 𝑛 draws.
class Bayesian:
def __init__(self, θ=0.4, n=1_000_000, α=0.5, β=0.5):

"""
Parameters:
----------
θ : float, ranging from [0,1].
probability that one toss of a coin will be a head with Y = 1
n : int.
number of independent flips in an independent sequence of draws
α&β : int or float.

10.3. Bayesian Interpretation 189


parameters of the prior distribution on θ
"""
self.θ, self.n, self.α, self.β = θ, n, α, β
self.prior = st.beta(α, β)
def draw(self):
"""
simulate a single sequence of draws of length n, given probability θ
"""
array = np.random.rand(self.n)
self.draws = (array < self.θ).astype(int)
def form_single_posterior(self, step_num):

"""
form a posterior distribution after observing the first step_num elements of␣
↪the draws
Parameters
----------
step_num: int.
number of steps observed to form a posterior distribution
Returns
------
the posterior distribution for sake of plotting in the subsequent steps
"""
heads_num = self.draws[:step_num].sum()
tails_num = step_num - heads_num
return st.beta(self.α+heads_num, self.β+tails_num)
def form_posterior_series(self,num_obs_list):
"""
form a series of posterior distributions that form after observing different␣
↪number of draws.
Parameters
----------
num_obs_list: a list of int.
a list of the number of observations used to form a series of␣
↪posterior distributions.
"""
self.posterior_list = []
for num in num_obs_list:
self.posterior_list.append(self.form_single_posterior(num))
d) Please plot the posterior distribution for 𝜃 as a function of 𝜃 as 𝑛 grows from 1, 2, ….
Bay_stat = Bayesian()
Bay_stat.draw()
num_list = [1, 2, 3, 4, 5, 10, 20, 30, 50, 70, 100, 300, 500, 1000, # this line for␣
↪finite n


5000, 10_000, 50_000, 100_000, 200_000, 300_000] # this line for␣
↪ approximately infinite n
Bay_stat.form_posterior_series(num_list)
θ_values = np.linspace(0.01, 1, 100)
ax.plot(θ_values, Bay_stat.prior.pdf(θ_values), label='Prior Distribution', color='k',

↪ linestyle='--')
for ii, num in enumerate(num_list[:14]):

ax.plot(θ_values, Bay_stat.posterior_list[ii].pdf(θ_values), label='Posterior␣
↪with n = %d' % num)
ax.set_title('P.D.F of Posterior Distributions', fontsize=15)

ax.set_xlabel(r"$\theta$", fontsize=15)
ax.legend(fontsize=11)
plt.show()
e) For various 𝑛’s, please describe and compute .05 and .95 quantiles for posterior probabilities.
upper_bound = [ii.ppf(0.05) for ii in Bay_stat.posterior_list[:14]]

lower_bound = [ii.ppf(0.95) for ii in Bay_stat.posterior_list[:14]]
interval_df = pd.DataFrame()


interval_df['upper'] = upper_bound
interval_df['lower'] = lower_bound
interval_df.index = num_list[:14]
interval_df = interval_df.T
interval_df
1 2 3 4 5 10 20 \
upper 0.228520 0.097308 0.062413 0.16528 0.260634 0.347322 0.280091
lower 0.998457 0.902692 0.764466 0.83472 0.872224 0.814884 0.629953
30 50 70 100 300 500 1000

upper 0.293487 0.329116 0.389167 0.418512 0.373839 0.391977 0.393532
lower 0.582293 0.555887 0.583119 0.581488 0.467296 0.464637 0.444813
As 𝑛 increases, we can see that Bayesian coverage intervals narrow and move toward 0.4.
f) Please tell what question a Bayesian coverage interval answers.
The Bayesian coverage interval tells the range of 𝜃 that corresponds to the [𝑝1 , 𝑝2 ] quantiles of the cumulative probability
distribution (CDF) of the posterior distribution.
To construct the coverage interval we first compute a posterior distribution of the unknown parameter 𝜃.
If the CDF is 𝐹 (𝜃), then the Bayesian coverage interval [𝑎, 𝑏] for the interval [𝑝1 , 𝑝2 ] is described by
𝐹 (𝑎) = 𝑝1 , 𝐹 (𝑏) = 𝑝2
g) Please compute the Posterior probabililty that 𝜃 ∈ [.45, .55] for various values of sample size 𝑛.
left_value, right_value = 0.45, 0.55
posterior_prob_list=[ii.cdf(right_value)-ii.cdf(left_value) for ii in Bay_stat.

↪posterior_list]

ax.plot(posterior_prob_list)
ax.set_title('Posterior Probabililty that '+ r"$\theta$" +' Ranges from %.2f to %.2f'
↪%(left_value, right_value),
fontsize=13)
ax.set_xticks(np.arange(0, len(posterior_prob_list), 3))
ax.set_xticklabels(num_list[::3])
ax.set_xlabel('Number of Observations', fontsize=11)
plt.show()

Notice that in the graph above the posterior probabililty that 𝜃 ∈ [.45, .55] typically exhibits a hump shape as 𝑛 increases.
Two opposing forces are at work.
The first force is that the individual adjusts his belief as he observes new outcomes, so his posterior probability distribution
becomes more and more realistic, which explains the rise of the posterior probabililty.
However, [.45, .55] actually excludes the true 𝜃 = .4 that generates the data.
As a result, the posterior probabililty drops as larger and larger samples refine his posterior probability distribution of 𝜃.
The descent seems precipitous only because of the scale of the graph that has the number of observations increasing
disproportionately.
When the number of observations becomes large enough, our Bayesian becomes so confident about 𝜃 that he considers
𝜃 ∈ [.45, .55] very unlikely.
That is why we see a nearly horizontal line when the number of observations exceeds 500.
h) Please use your Python class to study what happens to the posterior distribution as 𝑛 → +∞, again assuming that the
true value of 𝜃 = .4, though it is unknown to the person doing the updating via Bayes’ Law.
Using the Python class we made above, we can see the evolution of posterior distributions as 𝑛 approaches infinity.
for ii, num in enumerate(num_list[14:]):

ii += 14
ax.plot(θ_values, Bay_stat.posterior_list[ii].pdf(θ_values),
label='Posterior with n=%d thousand' % (num/1000))


ax.set_title('P.D.F of Posterior Distributions', fontsize=15)
ax.set_xlabel(r"$\theta$", fontsize=15)
ax.set_xlim(0.3, 0.5)
plt.show()
As 𝑛 increases, we can see that the probability density functions concentrate on 0.4, the true value of 𝜃.
Here the posterior means converges to 0.4 while the posterior standard deviations converges to 0 from above.
To show this, we compute the means and variances statistics of the posterior distributions.
mean_list = [ii.mean() for ii in Bay_stat.posterior_list]

std_list = [ii.std() for ii in Bay_stat.posterior_list]
ax[0].plot(mean_list)
ax[0].set_title('Mean Values of Posterior Distribution', fontsize=13)
ax[0].set_xticks(np.arange(0, len(mean_list), 3))
ax[0].set_xticklabels(num_list[::3])
ax[0].set_xlabel('Number of Observations', fontsize=11)
ax[1].plot(std_list)
ax[1].set_title('Standard Deviations of Posterior Distribution', fontsize=13)
ax[1].set_xticks(np.arange(0, len(std_list), 3))
ax[1].set_xticklabels(num_list[::3])
ax[1].set_xlabel('Number of Observations', fontsize=11)

plt.show()
How shall we interpret the patterns above?

The answer is encoded in the Bayesian updating formulas.
It is natural to extend the one-step Bayesian update to an 𝑛-step Bayesian update.
Prob(𝜃, 𝑘) Prob(𝑘|𝜃) ∗ Prob(𝜃) Prob(𝑘|𝜃) ∗ Prob(𝜃)

Prob(𝜃|𝑘) = = = 1
Prob(𝑘) Prob(𝑘) ∫0 Prob(𝑘|𝜃) ∗ Prob(𝜃)𝑑𝜃
𝜃𝛼−1 (1−𝜃)𝛽−1
(𝑁
𝑘 )(1 − 𝜃)
𝑁−𝑘 𝑘
𝜃 ∗ 𝐵(𝛼,𝛽)
= 1 𝜃𝛼−1 (1−𝜃)𝛽−1
∫0 (𝑁
𝑘 )(1 − 𝜃)
𝑁−𝑘 𝜃𝑘 ∗
𝐵(𝛼,𝛽) 𝑑𝜃
(1 − 𝜃)𝛽+𝑁−𝑘−1 ∗ 𝜃𝛼+𝑘−1
= 1
∫0 (1 − 𝜃)𝛽+𝑁−𝑘−1 ∗ 𝜃𝛼+𝑘−1 𝑑𝜃
= 𝐵𝑒𝑡𝑎(𝛼 + 𝑘, 𝛽 + 𝑁 − 𝑘)
A beta distribution with 𝛼 and 𝛽 has the following mean and variance.
𝛼
The mean is 𝛼+𝛽
𝛼𝛽
The variance is (𝛼+𝛽)2 (𝛼+𝛽+1)
• 𝛼 can be viewed as the number of successes

• 𝛽 can be viewed as the number of failures
The random variables 𝑘 and 𝑁 − 𝑘 are governed by Binomial Distribution with 𝜃 = 0.4.
Call this the true data generating process.
According to the Law of Large Numbers, for a large number of observations, observed frequencies of 𝑘 and 𝑁 − 𝑘
will be described by the true data generating process, i.e., the population probability distribution that we assumed when
generating the observations on the computer. (See Exercise 10.2.1).
Consequently, the mean of the posterior distribution converges to 0.4 and the variance withers to zero.

upper_bound = [ii.ppf(0.95) for ii in Bay_stat.posterior_list]

lower_bound = [ii.ppf(0.05) for ii in Bay_stat.posterior_list]

ax.scatter(np.arange(len(upper_bound)), upper_bound, label='95 th Quantile')
ax.scatter(np.arange(len(lower_bound)), lower_bound, label='05 th Quantile')
ax.set_xticks(np.arange(0, len(upper_bound), 2))

ax.set_xticklabels(num_list[::2])
ax.set_xlabel('Number of Observations', fontsize=12)
ax.set_title('Bayesian Coverage Intervals of Posterior Distributions', fontsize=15)
plt.show()
After observing a large number of outcomes, the posterior distribution collapses around 0.4.
Thus, the Bayesian statististian comes to believe that 𝜃 is near .4.
As shown in the figure above, as the number of observations grows, the Bayesian coverage intervals (BCIs) become
narrower and narrower around 0.4.
However, if you take a closer look, you will find that the centers of the BCIs are not exactly 0.4, due to the persistent
influence of the prior distribution and the randomness of the simulation path.

10.4 Role of a Conjugate Prior
We have made assumptions that link functional forms of our likelihood function and our prior in a way that has eased our
calculations considerably.
In particular, our assumptions that the likelihood function is binomial and that the prior distribution is a beta distribution
have the consequence that the posterior distribution implied by Bayes’ Law is also a beta distribution.
So posterior and prior are both beta distributions, albeit ones with different parameters.
When a likelihood function and prior fit together like hand and glove in this way, we can say that the prior and posterior
are conjugate distributions.
In this situation, we also sometimes say that we have conjugate prior for the likelihood function Prob(𝑋|𝜃).
Typically, the functional form of the likelihood function determines the functional form of a conjugate prior.
A natural question to ask is why should a person’s personal prior about a parameter 𝜃 be restricted to be described by a
conjugate prior?
Why not some other functional form that more sincerely describes the person’s beliefs?
To be argumentative, one could ask, why should the form of the likelihood function have anything to say about my personal
beliefs about 𝜃?
A dignified response to that question is, well, it shouldn’t, but if you want to compute a posterior easily you’ll just be
happier if your prior is conjugate to your likelihood.
Otherwise, your posterior won’t have a convenient analytical form and you’ll be in the situation of wanting to apply the
Markov chain Monte Carlo techniques deployed in this quantecon lecture.
We also apply these powerful methods to approximating Bayesian posteriors for non-conjugate priors in this quantecon
lecture and this quantecon lecture
10.4. Role of a Conjugate Prior 197


CHAPTER
ELEVEN
MULTIVARIATE HYPERGEOMETRIC DISTRIBUTION
Contents
• Multivariate Hypergeometric Distribution

– Overview
– The Administrator’s Problem
– Usage
11.1 Overview
This lecture describes how an administrator deployed a multivariate hypergeometric distribution in order to access
the fairness of a procedure for awarding research grants.
In the lecture we’ll learn about
• properties of the multivariate hypergeometric distribution
• first and second moments of a multivariate hypergeometric distribution
• using a Monte Carlo simulation of a multivariate normal distribution to evaluate the quality of a normal approxi-
mation
• the administrator’s problem and why the multivariate hypergeometric distribution is the right tool
11.2 The Administrator’s Problem
An administrator in charge of allocating research grants is in the following situation.

To help us forget details that are none of our business here and to protect the anonymity of the administrator and the
subjects, we call research proposals balls and continents of residence of authors of a proposal a color.
There are 𝐾𝑖 balls (proposals) of color 𝑖.
There are 𝑐 distinct colors (continents of residence).
Thus, 𝑖 = 1, 2, … , 𝑐
𝑐
So there is a total of 𝑁 = ∑𝑖=1 𝐾𝑖 balls.
All 𝑁 of these balls are placed in an urn.
199
Then 𝑛 balls are drawn randomly.

The selection procedure is supposed to be color blind meaning that ball quality, a random variable that is supposed to
be independent of ball color, governs whether a ball is drawn.
Thus, the selection procedure is supposed randomly to draw 𝑛 balls from the urn.
The 𝑛 balls drawn represent successful proposals and are awarded research funds.
The remaining 𝑁 − 𝑛 balls receive no research funds.
11.2.1 Details of the Awards Procedure Under Study
Let 𝑘𝑖 be the number of balls of color 𝑖 that are drawn.

𝑐
Things have to add up so ∑𝑖=1 𝑘𝑖 = 𝑛.
Under the hypothesis that the selection process judges proposals on their quality and that quality is independent of conti-
nent of the author’s continent of residence, the administrator views the outcome of the selection procedure as a random
vector
𝑘1
⎛
⎜𝑘2 ⎞
⎟.
𝑋=⎜
⎜⋮⎟ ⎟
𝑘
⎝ 𝑐⎠
To evaluate whether the selection procedure is color blind the administrator wants to study whether the particular re-
alization of 𝑋 drawn can plausibly be said to be a random draw from the probability distribution that is implied by the
color blind hypothesis.
The appropriate probability distribution is the one described here.
Let’s now instantiate the administrator’s problem, while continuing to use the colored balls metaphor.
The administrator has an urn with 𝑁 = 238 balls.
157 balls are blue, 11 balls are green, 46 balls are yellow, and 24 balls are black.
So (𝐾1 , 𝐾2 , 𝐾3 , 𝐾4 ) = (157, 11, 46, 24) and 𝑐 = 4.
15 balls are drawn without replacement.
So 𝑛 = 15.
The administrator wants to know the probability distribution of outcomes
𝑘1
⎛
⎜𝑘2 ⎞
⎟.
𝑋=⎜
⎜⋮⎟ ⎟
⎝𝑘4 ⎠
In particular, he wants to know whether a particular outcome - in the form of a 4 × 1 vector of integers recording the
numbers of blue, green, yellow, and black balls, respectively, - contains evidence against the hypothesis that the selection
process is fair, which here means color blind and truly are random draws without replacement from the population of 𝑁
balls.
The right tool for the administrator’s job is the multivariate hypergeometric distribution.
200 Chapter 11. Multivariate Hypergeometric Distribution

11.2.2 Multivariate Hypergeometric Distribution
Let’s start with some imports.

import numpy as np
from scipy.special import comb
from scipy.stats import normaltest
from numba import njit, prange
To recapitulate, we assume there are in total 𝑐 types of objects in an urn.

If there are 𝐾𝑖 type 𝑖 object in the urn and we take 𝑛 draws at random without replacement, then the numbers of type 𝑖
objects in the sample (𝑘1 , 𝑘2 , … , 𝑘𝑐 ) has the multivariate hypergeometric distribution.
𝑐 𝑐
Note again that 𝑁 = ∑𝑖=1 𝐾𝑖 is the total number of objects in the urn and 𝑛 = ∑𝑖=1 𝑘𝑖 .
Notation
We use the following notation for binomial coefficients: (𝑚
𝑞) =
𝑚!
(𝑚−𝑞)! .
The multivariate hypergeometric distribution has the following properties:

Probability mass function:
𝑐
∏𝑖=1 (𝐾
𝑘 )
𝑖
Pr{𝑋𝑖 = 𝑘𝑖 ∀𝑖} = 𝑖
(𝑁
𝑛)
Mean:
𝐾𝑖
E(𝑋𝑖 ) = 𝑛
𝑁
Variances and covariances:
𝑁 − 𝑛 𝐾𝑖 𝐾
Var(𝑋𝑖 ) = 𝑛 (1 − 𝑖 )
𝑁 −1 𝑁 𝑁
𝑁 − 𝑛 𝐾𝑖 𝐾𝑗
Cov(𝑋𝑖 , 𝑋𝑗 ) = −𝑛
𝑁 −1 𝑁 𝑁
To do our work for us, we’ll write an Urn class.
class Urn:
def __init__(self, K_arr):

"""
Initialization given the number of each type i object in the urn.
Parameters
----------
K_arr: ndarray(int)
number of each type i object.
"""
self.K_arr = np.array(K_arr)
self.N = np.sum(K_arr)
self.c = len(K_arr)
11.2. The Administrator’s Problem 201


def pmf(self, k_arr):
"""
Probability mass function.
Parameters
----------
k_arr: ndarray(int)
number of observed successes of each object.
"""
K_arr, N = self.K_arr, self.N
k_arr = np.atleast_2d(k_arr)
n = np.sum(k_arr, 1)
num = np.prod(comb(K_arr, k_arr), 1)

denom = comb(N, n)
pr = num / denom
return pr
def moments(self, n):

"""
Compute the mean and variance-covariance matrix for
multivariate hypergeometric distribution.
Parameters
----------
n: int
number of draws.
"""
K_arr, N, c = self.K_arr, self.N, self.c
# mean
μ = n * K_arr / N
# variance-covariance matrix
Σ = np.full((c, c), n * (N - n) / (N - 1) / N ** 2)
for i in range(c-1):
Σ[i, i] *= K_arr[i] * (N - K_arr[i])
for j in range(i+1, c):
Σ[i, j] *= - K_arr[i] * K_arr[j]
Σ[j, i] = Σ[i, j]
Σ[-1, -1] *= K_arr[-1] * (N - K_arr[-1])
return μ, Σ
def simulate(self, n, size=1, seed=None):

"""
Simulate a sample from multivariate hypergeometric
distribution where at each draw we take n objects
from the urn without replacement.


Parameters
----------
n: int
number of objects for each draw.
size: int(optional)
sample size.
seed: int(optional)
random seed.
"""
K_arr = self.K_arr
gen = np.random.Generator(np.random.PCG64(seed))
sample = gen.multivariate_hypergeometric(K_arr, n, size=size)
return sample
11.3 Usage
11.3.1 First example
Apply this to an example from wiki:

Suppose there are 5 black, 10 white, and 15 red marbles in an urn. If six marbles are chosen without replacement, the
probability that exactly two of each color are chosen is
(52)(10 15
2 )( 2 )
𝑃 (2 black, 2 white, 2 red) = = 0.079575596816976
(30
6)
# construct the urn

K_arr = [5, 10, 15]
urn = Urn(K_arr)
Now use the Urn Class method pmf to compute the probability of the outcome 𝑋 = (2 2 2)
k_arr = [2, 2, 2] # array of number of observed successes

urn.pmf(k_arr)
array([0.0795756])
We can use the code to compute probabilities of a list of possible outcomes by constructing a 2-dimensional array k_arr
and pmf will return an array of probabilities for observing each case.
k_arr = [[2, 2, 2], [1, 3, 2]]

urn.pmf(k_arr)
array([0.0795756, 0.1061008])
Now let’s compute the mean vector and variance-covariance matrix.
11.3. Usage 203

n = 6
μ, Σ = urn.moments(n)
array([1., 2., 3.])
array([[ 0.68965517, -0.27586207, -0.4137931 ],

[-0.27586207, 1.10344828, -0.82758621],
[-0.4137931 , -0.82758621, 1.24137931]])
11.3.2 Back to The Administrator’s Problem
Now let’s turn to the grant administrator’s problem.

Here the array of numbers of 𝑖 objects in the urn is (157, 11, 46, 24).
K_arr = [157, 11, 46, 24]

urn = Urn(K_arr)
Let’s compute the probability of the outcome (10, 1, 4, 0).
k_arr = [10, 1, 4, 0]
urn.pmf(k_arr)
array([0.01547738])
We can compute probabilities of three possible outcomes by constructing a 3-dimensional arrays k_arr and utilizing
the method pmf of the Urn class.
k_arr = [[5, 5, 4 ,1], [10, 1, 2, 2], [13, 0, 2, 0]]

urn.pmf(k_arr)
array([6.21412534e-06, 2.70935969e-02, 1.61839976e-02])
Now let’s compute the mean and variance-covariance matrix of 𝑋 when 𝑛 = 6.
n = 6 # number of draws
μ, Σ = urn.moments(n)
# mean
μ
array([3.95798319, 0.27731092, 1.15966387, 0.60504202])

# variance-covariance matrix
Σ
array([[ 1.31862604, -0.17907267, -0.74884935, -0.39070401],

[-0.17907267, 0.25891399, -0.05246715, -0.02737417],
[-0.74884935, -0.05246715, 0.91579029, -0.11447379],
[-0.39070401, -0.02737417, -0.11447379, 0.53255196]])
We can simulate a large sample and verify that sample means and covariances closely approximate the population means
and covariances.
size = 10_000_000
sample = urn.simulate(n, size=size)
# mean
np.mean(sample, 0)
array([3.9573046, 0.2774102, 1.1597064, 0.6055788])
# variance covariance matrix

np.cov(sample.T)
array([[ 1.31949123, -0.17936828, -0.74889015, -0.39123281],

[-0.17936828, 0.25914361, -0.05241489, -0.02736044],
[-0.74889015, -0.05241489, 0.91570316, -0.11439812],
[-0.39123281, -0.02736044, -0.11439812, 0.53299137]])
Evidently, the sample means and covariances approximate their population counterparts well.
11.3.3 Quality of Normal Approximation
To judge the quality of a multivariate normal approximation to the multivariate hypergeometric distribution, we draw
a large sample from a multivariate normal distribution with the mean vector and covariance matrix for the correspond-
ing multivariate hypergeometric distribution and compare the simulated distribution with the population multivariate
hypergeometric distribution.
sample_normal = np.random.multivariate_normal(μ, Σ, size=size)
def bivariate_normal(x, y, μ, Σ, i, j):
μ_x, μ_y = μ[i], μ[j]

σ_x, σ_y = np.sqrt(Σ[i, i]), np.sqrt(Σ[j, j])
σ_xy = Σ[i, j]
x_μ = x - μ_x
y_μ = y - μ_y
ρ = σ_xy / (σ_x * σ_y)

z = x_μ**2 / σ_x**2 + y_μ**2 / σ_y**2 - 2 * ρ * x_μ * y_μ / (σ_x * σ_y)
denom = 2 * np.pi * σ_x * σ_y * np.sqrt(1 - ρ**2)
11.3. Usage 205

return np.exp(-z / (2 * (1 - ρ**2))) / denom
@njit
def count(vec1, vec2, n):
size = sample.shape[0]
count_mat = np.zeros((n+1, n+1))

for i in prange(size):
count_mat[vec1[i], vec2[i]] += 1
return count_mat
c = urn.c
fig, axs = plt.subplots(c, c, figsize=(14, 14))
# grids for ploting the bivariate Gaussian

x_grid = np.linspace(-2, n+1, 100)
y_grid = np.linspace(-2, n+1, 100)
X, Y = np.meshgrid(x_grid, y_grid)
for i in range(c):
axs[i, i].hist(sample[:, i], bins=np.arange(0, n, 1), alpha=0.5, density=True,␣
↪label='hypergeom')
axs[i, i].hist(sample_normal[:, i], bins=np.arange(0, n, 1), alpha=0.5,␣

↪density=True, label='normal')
axs[i, i].legend()
axs[i, i].set_title('$k_{' +str(i+1) +'}$')
for j in range(c):
if i == j:
continue
# bivariate Gaussian density function

Z = bivariate_normal(X, Y, μ, Σ, i, j)
cs = axs[i, j].contour(X, Y, Z, 4, colors="black", alpha=0.6)
axs[i, j].clabel(cs, inline=1, fontsize=10)
# empirical multivariate hypergeometric distrbution

count_mat = count(sample[:, i], sample[:, j], n)
axs[i, j].pcolor(count_mat.T/size, cmap='Blues')
axs[i, j].set_title('$(k_{' +str(i+1) +'}, k_{' + str(j+1) + '})$')
plt.show()

The diagonal graphs plot the marginal distributions of 𝑘𝑖 for each 𝑖 using histograms.
Note the substantial differences between hypergeometric distribution and the approximating normal distribution.
The off-diagonal graphs plot the empirical joint distribution of 𝑘𝑖 and 𝑘𝑗 for each pair (𝑖, 𝑗).
The darker the blue, the more data points are contained in the corresponding cell. (Note that 𝑘𝑖 is on the x-axis and 𝑘𝑗 is
on the y-axis).
The contour maps plot the bivariate Gaussian density function of (𝑘𝑖 , 𝑘𝑗 ) with the population mean and covariance given
by slices of 𝜇 and Σ that we computed above.
Let’s also test the normality for each 𝑘𝑖 using scipy.stats.normaltest that implements D’Agostino and Pearson’s
test that combines skew and kurtosis to form an omnibus test of normality.
The null hypothesis is that the sample follows normal distribution.
normaltest returns an array of p-values associated with tests for each 𝑘𝑖 sample.
11.3. Usage 207

test_multihyper = normaltest(sample)
test_multihyper.pvalue
array([0., 0., 0., 0.])
As we can see, all the p-values are almost 0 and the null hypothesis is soundly rejected.
By contrast, the sample from normal distribution does not reject the null hypothesis.
test_normal = normaltest(sample_normal)
test_normal.pvalue
array([0.8969004 , 0.27041724, 0.9152563 , 0.71988042])
The lesson to take away from this is that the normal approximation is imperfect.

CHAPTER
TWELVE
MULTIVARIATE NORMAL DISTRIBUTION
Contents
• Multivariate Normal Distribution

– Overview
– The Multivariate Normal Distribution
– Bivariate Example
– Trivariate Example
– One Dimensional Intelligence (IQ)
– Information as Surprise
– Cholesky Factor Magic
– Math and Verbal Intelligence
– Univariate Time Series Analysis
– Stochastic Difference Equation
– Application to Stock Price Model
– Filtering Foundations
– Classic Factor Analysis Model
– PCA and Factor Analysis
12.1 Overview
This lecture describes a workhorse in probability theory, statistics, and economics, namely, the multivariate normal
distribution.
In this lecture, you will learn formulas for
• the joint distribution of a random vector 𝑥 of length 𝑁
• marginal distributions for all subvectors of 𝑥
• conditional distributions for subvectors of 𝑥 conditional on other subvectors of 𝑥
We will use the multivariate normal distribution to formulate some useful models:
209
• a factor analytic model of an intelligence quotient, i.e., IQ

• a factor analytic model of two independent inherent abilities, say, mathematical and verbal.
• a more general factor analytic model
• Principal Components Analysis (PCA) as an approximation to a factor analytic model
• time series generated by linear stochastic difference equations
• optimal linear filtering theory
12.2 The Multivariate Normal Distribution
This lecture defines a Python class MultivariateNormal to be used to generate marginal and conditional distri-
butions associated with a multivariate normal distribution.
For a multivariate normal distribution it is very convenient that
• conditional expectations equal linear least squares projections
• conditional distributions are characterized by multivariate linear regressions
We apply our Python class to some examples.
We use the following imports:

import numpy as np
import statsmodels.api as sm
Assume that an 𝑁 × 1 random vector 𝑧 has a multivariate normal probability density.

This means that the probability density takes the form
−( 𝑁
2 ) − 12 ′
𝑓 (𝑧; 𝜇, Σ) = (2𝜋) det (Σ) exp (−.5 (𝑧 − 𝜇) Σ−1 (𝑧 − 𝜇))
′
where 𝜇 = 𝐸𝑧 is the mean of the random vector 𝑧 and Σ = 𝐸 (𝑧 − 𝜇) (𝑧 − 𝜇) is the covariance matrix of 𝑧.
The covariance matrix Σ is symmetric and positive definite.
@njit
def f(z, μ, Σ):
"""
The density function of multivariate normal distribution.
Parameters
---------------
z: ndarray(float, dim=2)
random vector, N by 1
μ: ndarray(float, dim=1 or 2)
the mean of z, N by 1
Σ: ndarray(float, dim=2)
the covarianece matrix of z, N by 1
"""
z = np.atleast_2d(z)
210 Chapter 12. Multivariate Normal Distribution


μ = np.atleast_2d(μ)
Σ = np.atleast_2d(Σ)
N = z.size
temp1 = np.linalg.det(Σ) ** (-1/2)

temp2 = np.exp(-.5 * (z - μ).T @ np.linalg.inv(Σ) @ (z - μ))
return (2 * np.pi) ** (-N/2) * temp1 * temp2
For some integer 𝑘 ∈ {1, … , 𝑁 − 1}, partition 𝑧 as
𝑧1
𝑧=[ ],
𝑧2
where 𝑧1 is an (𝑁 − 𝑘) × 1 vector and 𝑧2 is a 𝑘 × 1 vector.

Let
𝜇1 Σ11 Σ12
𝜇=[ ], Σ=[ ]
𝜇2 Σ21 Σ22
be corresponding partitions of 𝜇 and Σ.

The marginal distribution of 𝑧1 is
• multivariate normal with mean 𝜇1 and covariance matrix Σ11 .
The marginal distribution of 𝑧2 is
• multivariate normal with mean 𝜇2 and covariance matrix Σ22 .
The distribution of 𝑧1 conditional on 𝑧2 is
• multivariate normal with mean
𝜇1̂ = 𝜇1 + 𝛽 (𝑧2 − 𝜇2 )
and covariance matrix
Σ̂ 11 = Σ11 − Σ12 Σ−1

22 Σ21 = Σ11 − 𝛽Σ22 𝛽
′
where
𝛽 = Σ12 Σ−1
22
is an (𝑁 − 𝑘) × 𝑘 matrix of population regression coefficients of the (𝑁 − 𝑘) × 1 random vector 𝑧1 − 𝜇1 on the 𝑘 × 1

random vector 𝑧2 − 𝜇2 .
The following class constructs a multivariate normal distribution instance with two methods.
• a method partition computes 𝛽, taking 𝑘 as an input
• a method cond_dist computes either the distribution of 𝑧1 conditional on 𝑧2 or the distribution of 𝑧2 conditional
on 𝑧1
class MultivariateNormal:
"""
Class of multivariate normal distribution.
12.2. The Multivariate Normal Distribution 211


Parameters
----------
μ: ndarray(float, dim=1)
the mean of z, N by 1
Σ: ndarray(float, dim=2)
the covarianece matrix of z, N by 1
Arguments
---------
μ, Σ:
see parameters
μs: list(ndarray(float, dim=1))
list of mean vectors μ1 and μ2 in order
Σs: list(list(ndarray(float, dim=2)))
2 dimensional list of covariance matrices
Σ11, Σ12, Σ21, Σ22 in order
βs: list(ndarray(float, dim=1))
list of regression coefficients β1 and β2 in order
"""
def __init__(self, μ, Σ):

"initialization"
self.μ = np.array(μ)
self.Σ = np.atleast_2d(Σ)
def partition(self, k):

"""
Given k, partition the random vector z into a size k vector z1
and a size N-k vector z2. Partition the mean vector μ into
μ1 and μ2, and the covariance matrix Σ into Σ11, Σ12, Σ21, Σ22
correspondingly. Compute the regression coefficients β1 and β2
using the partitioned arrays.
"""
μ = self.μ
Σ = self.Σ
self.μs = [μ[:k], μ[k:]]

self.Σs = [[Σ[:k, :k], Σ[:k, k:]],
[Σ[k:, :k], Σ[k:, k:]]]
self.βs = [self.Σs[0][1] @ np.linalg.inv(self.Σs[1][1]),

self.Σs[1][0] @ np.linalg.inv(self.Σs[0][0])]
def cond_dist(self, ind, z):

"""
Compute the conditional distribution of z1 given z2, or reversely.
Argument ind determines whether we compute the conditional
distribution of z1 (ind=0) or z2 (ind=1).
Returns
---------
μ_hat: ndarray(float, ndim=1)
The conditional mean of z1 or z2.
Σ_hat: ndarray(float, ndim=2)
The conditional covariance matrix of z1 or z2.
"""


β = self.βs[ind]
μs = self.μs
Σs = self.Σs
μ_hat = μs[ind] + β @ (z - μs[1-ind])

Σ_hat = Σs[ind][ind] - β @ Σs[1-ind][1-ind] @ β.T
return μ_hat, Σ_hat
Let’s put this code to work on a suite of examples.

We begin with a simple bivariate example; after that we’ll turn to a trivariate example.
We’ll compute population moments of some conditional distributions using our MultivariateNormal class.
For fun we’ll also compute sample analogs of the associated population regressions by generating simulations and then
computing linear least squares regressions.
We’ll compare those linear least squares regressions for the simulated data to their population counterparts.
12.3 Bivariate Example
We start with a bivariate normal distribution pinned down by

.5 1 .5
𝜇=[ ], Σ=[ ]
1.0 .5 1
μ = np.array([.5, 1.])
Σ = np.array([[1., .5], [.5 ,1.]])
# construction of the multivariate normal instance

multi_normal = MultivariateNormal(μ, Σ)
k = 1 # choose partition
# partition and compute regression coefficients

multi_normal.partition(k)
multi_normal.βs[0],multi_normal.βs[1]
(array([[0.5]]), array([[0.5]]))
Let’s illustrate the fact that you can regress anything on anything else.
We have computed everything we need to compute two regression lines, one of 𝑧2 on 𝑧1 , the other of 𝑧1 on 𝑧2 .
We’ll represent these regressions as
𝑧1 = 𝑎 1 + 𝑏 1 𝑧2 + 𝜖 1
and
𝑧2 = 𝑎 2 + 𝑏 2 𝑧1 + 𝜖 2
where we have the population least squares orthogonality conditions
𝐸𝜖1 𝑧2 = 0
12.3. Bivariate Example 213

and
𝐸𝜖2 𝑧1 = 0
Let’s compute 𝑎1 , 𝑎2 , 𝑏1 , 𝑏2 .
beta = multi_normal.βs
a1 = μ[0] - beta[0]*μ[1]
b1 = beta[0]
a2 = μ[1] - beta[1]*μ[0]
b2 = beta[1]
Let’s print out the intercepts and slopes.

For the regression of 𝑧1 on 𝑧2 we have
print ("a1 = ", a1)

print ("b1 = ", b1)
a1 = [[0.]]
b1 = [[0.5]]
For the regression of 𝑧2 on 𝑧1 we have
print ("a2 = ", a2)

print ("b2 = ", b2)
a2 = [[0.75]]
b2 = [[0.5]]
Now let’s plot the two regression lines and stare at them.
z2 = np.linspace(-4,4,100)
a1 = np.squeeze(a1)
b1 = np.squeeze(b1)
a2 = np.squeeze(a2)
b2 = np.squeeze(b2)
z1 = b1*z2 + a1
z1h = z2/b2 - a2/b2
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(1, 1, 1)
ax.set(xlim=(-4, 4), ylim=(-4, 4))
ax.spines['left'].set_position('center')
ax.spines['bottom'].set_position('zero')
ax.spines['right'].set_color('none')


ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
plt.ylabel('$z_1$', loc = 'top')
plt.xlabel('$z_2$,', loc = 'right')
plt.title('two regressions')
plt.plot(z2,z1, 'r', label = "$z_1$ on $z_2$")
plt.plot(z2,z1h, 'b', label = "$z_2$ on $z_1$")
plt.legend()
plt.show()
The red line is the expectation of 𝑧1 conditional on 𝑧2 .

The intercept and slope of the red line are
12.3. Bivariate Example 215

print("a1 = ", a1)

print("b1 = ", b1)
a1 = 0.0
b1 = 0.5
The blue line is the expectation of 𝑧2 conditional on 𝑧1 .

The intercept and slope of the blue line are
print("-a2/b2 = ", - a2/b2)

print("1/b2 = ", 1/b2)
-a2/b2 = -1.5
1/b2 = 2.0
We can use these regression lines or our code to compute conditional expectations.
Let’s compute the mean and variance of the distribution of 𝑧2 conditional on 𝑧1 = 5.
After that we’ll reverse what are on the left and right sides of the regression.
# compute the cond. dist. of z1

ind = 1
z1 = np.array([5.]) # given z1
μ2_hat, Σ2_hat = multi_normal.cond_dist(ind, z1)

print('μ2_hat, Σ2_hat = ', μ2_hat, Σ2_hat)
μ2_hat, Σ2_hat = [3.25] [[0.75]]
Now let’s compute the mean and variance of the distribution of 𝑧1 conditional on 𝑧2 = 5.
# compute the cond. dist. of z1

ind = 0
z2 = np.array([5.]) # given z2

print('μ1_hat, Σ1_hat = ', μ1_hat, Σ1_hat)
μ1_hat, Σ1_hat = [2.5] [[0.75]]
Let’s compare the preceding population mean and variance with outcomes from drawing a large sample and then regressing
𝑧1 − 𝜇1 on 𝑧2 − 𝜇2 .
We know that
𝐸𝑧1 |𝑧2 = (𝜇1 − 𝛽𝜇2 ) + 𝛽𝑧2
which can be arranged to
𝑧1 − 𝜇1 = 𝛽 (𝑧2 − 𝜇2 ) + 𝜖,
We anticipate that for larger and larger sample sizes, estimated OLS coefficients will converge to 𝛽 and the estimated
variance of 𝜖 will converge to Σ̂ 1 .

n = 1_000_000 # sample size
# simulate multivariate normal random vectors

data = np.random.multivariate_normal(μ, Σ, size=n)
z1_data = data[:, 0]
z2_data = data[:, 1]
# OLS regression
μ1, μ2 = multi_normal.μs
results = sm.OLS(z1_data - μ1, z2_data - μ2).fit()
Let’s compare the preceding population 𝛽 with the OLS sample estimate on 𝑧2 − 𝜇2
multi_normal.βs[0], results.params
(array([[0.5]]), array([0.50068711]))
Let’s compare our population Σ̂ 1 with the degrees-of-freedom adjusted estimate of the variance of 𝜖
Σ1_hat, results.resid @ results.resid.T / (n - 1)
(array([[0.75]]), 0.7504621422788655)
̂ and compare it with 𝜇̂

Lastly, let’s compute the estimate of 𝐸𝑧1 |𝑧2 1
μ1_hat, results.predict(z2 - μ2) + μ1
(array([2.5]), array([2.50274842]))
Thus, in each case, for our very large sample size, the sample analogues closely approximate their population counterparts.
A Law of Large Numbers explains why sample analogues approximate population objects.
12.4 Trivariate Example
Let’s apply our code to a trivariate example.

We’ll specify the mean vector and the covariance matrix as follows.
μ = np.random.random(3)
C = np.random.random((3, 3))
Σ = C @ C.T # positive semi-definite
μ, Σ
(array([0.96647091, 0.52989787, 0.54470206]),

array([[1.05309198, 0.68622856, 0.92507853],
[0.68622856, 0.45333322, 0.63969818],
[0.92507853, 0.63969818, 1.03456211]]))
12.4. Trivariate Example 217

k = 1
multi_normal.partition(k)
2
Let’s compute the distribution of 𝑧1 conditional on 𝑧2 = [ ].
5
ind = 0
z2 = np.array([2., 5.])
n = 1_000_000
data = np.random.multivariate_normal(μ, Σ, size=n)
z1_data = data[:, :k]
z2_data = data[:, k:]
μ1, μ2 = multi_normal.μs
results = sm.OLS(z1_data - μ1, z2_data - μ2).fit()
As above, we compare population and sample regression coefficients, the conditional covariance matrix, and the condi-
tional mean vector in that order.
multi_normal.βs[0], results.params
(array([[ 1.97658029, -0.32799991]]), array([ 1.97658228, -0.32800479]))
Σ1_hat, results.resid @ results.resid.T / (n - 1)
(array([[0.00013182]]), 0.0001318492235146378)
μ1_hat, results.predict(z2 - μ2) + μ1
(array([2.41090846]), array([2.41088967]))
Once again, sample analogues do a good job of approximating their populations counterparts.
12.5 One Dimensional Intelligence (IQ)
Let’s move closer to a real-life example, namely, inferring a one-dimensional measure of intelligence called IQ from a list
of test scores.
The 𝑖th test score 𝑦𝑖 equals the sum of an unknown scalar IQ 𝜃 and a random variable 𝑤𝑖 .
𝑦𝑖 = 𝜃 + 𝜎𝑦 𝑤𝑖 , 𝑖 = 1, … , 𝑛
The distribution of IQ’s for a cross-section of people is a normal random variable described by
𝜃 = 𝜇𝜃 + 𝜎𝜃 𝑤𝑛+1 .
We assume that the noises {𝑤𝑖 }𝑁

𝑖=1 in the test scores are IID and not correlated with IQ.

We also assume that {𝑤𝑖 }𝑛+1

𝑖=1 are i.i.d. standard normal:
𝑤1
⎡ 𝑤 ⎤
2
⎢ ⎥
𝑤=⎢ ⋮ ⎥ ∼ 𝑁 (0, 𝐼𝑛+1 )
⎢ 𝑤𝑛 ⎥
⎣ 𝑤𝑛+1 ⎦
The following system describes the (𝑛 + 1) × 1 random vector 𝑋 that interests us:
𝑦1 𝜇𝜃 𝜎𝑦 0 ⋯ 0 𝜎𝜃 𝑤1
⎡ 𝑦2 ⎤ ⎡ 𝜇𝜃 ⎤ ⎡ 0 𝜎𝑦 ⋯ 0 𝜎𝜃 ⎤⎡ 𝑤 ⎤
2
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
𝑋=⎢ ⋮ ⎥=⎢ ⋮ ⎥+⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥⎢ ⋮ ⎥,
⎢ 𝑦𝑛 ⎥ ⎢ 𝜇𝜃 ⎥ ⎢ 0 0 ⋯ 𝜎𝑦 𝜎𝜃 ⎥ ⎢ 𝑤𝑛 ⎥
⎣ 𝜃 ⎦ ⎣ 𝜇𝜃 ⎦ ⎣ 0 0 ⋯ 0 𝜎𝜃 ⎦ ⎣ 𝑤𝑛+1 ⎦
or equivalently,
𝑋 = 𝜇𝜃 1𝑛+1 + 𝐷𝑤
𝑦
where 𝑋 = [ ], 1𝑛+1 is a vector of 1s of size 𝑛 + 1, and 𝐷 is an 𝑛 + 1 by 𝑛 + 1 matrix.
𝜃
Let’s define a Python function that constructs the mean 𝜇 and covariance matrix Σ of the random vector 𝑋 that we know
is governed by a multivariate normal distribution.
As arguments, the function takes the number of tests 𝑛, the mean 𝜇𝜃 and the standard deviation 𝜎𝜃 of the IQ distribution,
and the standard deviation of the randomness in test scores 𝜎𝑦 .
def construct_moments_IQ(n, μθ, σθ, σy):
μ_IQ = np.full(n+1, μθ)
D_IQ = np.zeros((n+1, n+1))

D_IQ[range(n), range(n)] = σy
D_IQ[:, n] = σθ
Σ_IQ = D_IQ @ D_IQ.T
return μ_IQ, Σ_IQ, D_IQ
Now let’s consider a specific instance of this model.

Assume we have recorded 50 test scores and we know that 𝜇𝜃 = 100, 𝜎𝜃 = 10, and 𝜎𝑦 = 10.
We can compute the mean vector and covariance matrix of 𝑋 easily with our construct_moments_IQ function as
follows.
n = 50
μθ, σθ, σy = 100., 10., 10.
μ_IQ, Σ_IQ, D_IQ = construct_moments_IQ(n, μθ, σθ, σy)

μ_IQ, Σ_IQ, D_IQ
(array([100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
12.5. One Dimensional Intelligence (IQ) 219


100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100.]),
array([[200., 100., 100., ..., 100., 100., 100.],
[100., 200., 100., ..., 100., 100., 100.],
[100., 100., 200., ..., 100., 100., 100.],
...,
[100., 100., 100., ..., 200., 100., 100.],
[100., 100., 100., ..., 100., 200., 100.],
[100., 100., 100., ..., 100., 100., 100.]]),
array([[10., 0., 0., ..., 0., 0., 10.],
[ 0., 10., 0., ..., 0., 0., 10.],
[ 0., 0., 10., ..., 0., 0., 10.],
...,
[ 0., 0., 0., ..., 10., 0., 10.],
[ 0., 0., 0., ..., 0., 10., 10.],
[ 0., 0., 0., ..., 0., 0., 10.]]))
We can now use our MultivariateNormal class to construct an instance, then partition the mean vector and co-
variance matrix as we wish.
We want to regress IQ, the random variable 𝜃 (what we don’t know), on the vector 𝑦 of test scores (what we do know).
We choose k=n so that 𝑧1 = 𝑦 and 𝑧2 = 𝜃.
multi_normal_IQ = MultivariateNormal(μ_IQ, Σ_IQ)
k = n
multi_normal_IQ.partition(k)
Using the generator multivariate_normal, we can make one draw of the random vector from our distribution and
then compute the distribution of 𝜃 conditional on our test scores.
Let’s do that and then print out some pertinent quantities.
x = np.random.multivariate_normal(μ_IQ, Σ_IQ)
y = x[:-1] # test scores
θ = x[-1] # IQ
# the true value

θ
103.64946988092446
The method cond_dist takes test scores 𝑦 as input and returns the conditional normal distribution of the IQ 𝜃.
In the following code, ind sets the variables on the right side of the regression.
Given the way we have defined the vector 𝑋, we want to set ind=1 in order to make 𝜃 the left side variable in the
population regression.
ind = 1
multi_normal_IQ.cond_dist(ind, y)
(array([106.80818783]), array([[1.96078431]]))

The first number is the conditional mean 𝜇𝜃̂ and the second is the conditional variance Σ̂ 𝜃 .
How do additional test scores affect our inferences?
To shed light on this, we compute a sequence of conditional distributions of 𝜃 by varying the number of test scores in the
conditioning set from 1 to 𝑛.
We’ll make a pretty graph showing how our judgment of the person’s IQ change as more test results come in.
# array for containing moments

μθ_hat_arr = np.empty(n)
Σθ_hat_arr = np.empty(n)
# loop over number of test scores

for i in range(1, n+1):
# construction of multivariate normal distribution instance
μ_IQ_i, Σ_IQ_i, D_IQ_i = construct_moments_IQ(i, μθ, σθ, σy)
multi_normal_IQ_i = MultivariateNormal(μ_IQ_i, Σ_IQ_i)
# partition and compute conditional distribution

multi_normal_IQ_i.partition(i)
scores_i = y[:i]
μθ_hat_i, Σθ_hat_i = multi_normal_IQ_i.cond_dist(1, scores_i)
# store the results

μθ_hat_arr[i-1] = μθ_hat_i[0]
Σθ_hat_arr[i-1] = Σθ_hat_i[0, 0]
# transform variance to standard deviation

σθ_hat_arr = np.sqrt(Σθ_hat_arr)
μθ_hat_lower = μθ_hat_arr - 1.96 * σθ_hat_arr

μθ_hat_higher = μθ_hat_arr + 1.96 * σθ_hat_arr
plt.hlines(θ, 1, n+1, ls='--', label='true $θ$')

plt.plot(range(1, n+1), μθ_hat_arr, color='b', label='$\hat{μ}_{θ}$')
plt.plot(range(1, n+1), μθ_hat_lower, color='b', ls='--')
plt.plot(range(1, n+1), μθ_hat_higher, color='b', ls='--')
plt.fill_between(range(1, n+1), μθ_hat_lower, μθ_hat_higher,
color='b', alpha=0.2, label='95%')
plt.xlabel('number of test scores')

plt.ylabel('$\hat{θ}$')
plt.legend()
plt.show()
12.5. One Dimensional Intelligence (IQ) 221

The solid blue line in the plot above shows 𝜇𝜃̂ as a function of the number of test scores that we have recorded and
conditioned on.
The blue area shows the span that comes from adding or subtracting 1.96𝜎̂𝜃 from 𝜇𝜃̂ .
Therefore, 95% of the probability mass of the conditional distribution falls in this range.
The value of the random 𝜃 that we drew is shown by the black dotted line.
As more and more test scores come in, our estimate of the person’s 𝜃 become more and more reliable.
By staring at the changes in the conditional distributions, we see that adding more test scores makes 𝜃 ̂ settle down and
approach 𝜃.
Thus, each 𝑦𝑖 adds information about 𝜃.
1
If we were to drive the number of tests 𝑛 → +∞, the conditional standard deviation 𝜎̂𝜃 would converge to 0 at rate 𝑛.5 .
12.6 Information as Surprise
By using a different representation, let’s look at things from a different perspective.

We can represent the random vector 𝑋 defined above as
𝑋 = 𝜇𝜃 1𝑛+1 + 𝐶𝜖, 𝜖 ∼ 𝑁 (0, 𝐼)
where 𝐶 is a lower triangular Cholesky factor of Σ so that
Σ ≡ 𝐷𝐷′ = 𝐶𝐶 ′
and
𝐸𝜖𝜖′ = 𝐼.
It follows that
𝜖 ∼ 𝑁 (0, 𝐼).
Let 𝐺 = 𝐶 −1

𝐺 is also lower triangular.

We can compute 𝜖 from the formula
𝜖 = 𝐺 (𝑋 − 𝜇𝜃 1𝑛+1 )
This formula confirms that the orthonormal vector 𝜖 contains the same information as the non-orthogonal vector
(𝑋 − 𝜇𝜃 1𝑛+1 ).
We can say that 𝜖 is an orthogonal basis for (𝑋 − 𝜇𝜃 1𝑛+1 ).
Let 𝑐𝑖 be the 𝑖th element in the last row of 𝐶.
Then we can write
𝜃 = 𝜇𝜃 + 𝑐1 𝜖1 + 𝑐2 𝜖2 + ⋯ + 𝑐𝑛 𝜖𝑛 + 𝑐𝑛+1 𝜖𝑛+1 (12.1)
The mutual orthogonality of the 𝜖𝑖 ’s provides us with an informative way to interpret them in light of equation (12.1).
Thus, relative to what is known from tests 𝑖 = 1, … , 𝑛 − 1, 𝑐𝑖 𝜖𝑖 is the amount of new information about 𝜃 brought by
the test number 𝑖.
Here new information means surprise or what could not be predicted from earlier information.
Formula (12.1) also provides us with an enlightening way to express conditional means and conditional variances that we
computed earlier.
In particular,
𝐸 [𝜃 ∣ 𝑦1 , … , 𝑦𝑘 ] = 𝜇𝜃 + 𝑐1 𝜖1 + ⋯ + 𝑐𝑘 𝜖𝑘
and
2 2 2
𝑉 𝑎𝑟 (𝜃 ∣ 𝑦1 , … , 𝑦𝑘 ) = 𝑐𝑘+1 + 𝑐𝑘+2 + ⋯ + 𝑐𝑛+1 .
C = np.linalg.cholesky(Σ_IQ)
G = np.linalg.inv(C)
ε = G @ (x - μθ)
cε = C[n, :] * ε
# compute the sequence of μθ and Σθ conditional on y1, y2, ..., yk

μθ_hat_arr_C = np.array([np.sum(cε[:k+1]) for k in range(n)]) + μθ
Σθ_hat_arr_C = np.array([np.sum(C[n, i+1:n+1] ** 2) for i in range(n)])
To confirm that these formulas give the same answers that we computed earlier, we can compare the means and variances
of 𝜃 conditional on {𝑦𝑖 }𝑘𝑖=1 with what we obtained above using the formulas implemented in the class Multivari-
ateNormal built on our original representation of conditional distributions for multivariate normal distributions.
# conditional mean
np.max(np.abs(μθ_hat_arr - μθ_hat_arr_C)) < 1e-10
True
# conditional variance
np.max(np.abs(Σθ_hat_arr - Σθ_hat_arr_C)) < 1e-10
12.6. Information as Surprise 223

True
12.7 Cholesky Factor Magic
Evidently, the Cholesky factorizations automatically computes the population regression coefficients and associated
statistics that are produced by our MultivariateNormal class.
The Cholesky factorization computes these things recursively.
Indeed, in formula (12.1),
• the random variable 𝑐𝑖 𝜖𝑖 is information about 𝜃 that is not contained by the information in 𝜖1 , 𝜖2 , … , 𝜖𝑖−1
• the coefficient 𝑐𝑖 is the simple population regression coefficient of 𝜃 − 𝜇𝜃 on 𝜖𝑖
12.8 Math and Verbal Intelligence
We can alter the preceding example to be more realistic.

There is ample evidence that IQ is not a scalar.
Some people are good in math skills but poor in language skills.
Other people are good in language skills but poor in math skills.
So now we shall assume that there are two dimensions of IQ, 𝜃 and 𝜂.
These determine average performances in math and language tests, respectively.
We observe math scores {𝑦𝑖 }𝑛𝑖=1 and language scores {𝑦𝑖 }2𝑛
𝑖=𝑛+1 .
When 𝑛 = 2, we assume that outcomes are draws from a multivariate normal distribution with representation
𝑦1 𝜇𝜃 𝜎𝑦 0 0 0 𝜎𝜃 0 𝑤1
⎡ 𝑦2 ⎤ ⎡ 𝜇𝜃 ⎤ ⎡ 0 𝜎𝑦 0 0 𝜎𝜃 0 ⎤⎡ 𝑤2 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
𝑦3 ⎥=⎢ 𝜇𝜂 ⎥+⎢ 0 0 𝜎𝑦 0 0 𝜎𝜂 ⎥⎢ 𝑤3
𝑋=⎢ ⎥
⎢ 𝑦4 ⎥ ⎢ 𝜇𝜂 ⎥ ⎢ 0 0 0 𝜎𝑦 0 𝜎𝜂 ⎥⎢ 𝑤4 ⎥
⎢ 𝜃 ⎥ ⎢ 𝜇𝜃 ⎥ ⎢ 0 0 0 0 𝜎𝜃 0 ⎥⎢ 𝑤5 ⎥
⎣ 𝜂 ⎦ ⎣ 𝜇𝜂 ⎦ ⎣ 0 0 0 0 0 𝜎𝜂 ⎦⎣ 𝑤6 ⎦
𝑤1
⎡𝑤 ⎤
where 𝑤 ⎢ 2 ⎥ is a standard normal random vector.
⎢ ⋮ ⎥
⎣𝑤6 ⎦
We construct a Python function construct_moments_IQ2d to construct the mean vector and covariance matrix of
the joint normal distribution.
def construct_moments_IQ2d(n, μθ, σθ, μη, ση, σy):
μ_IQ2d = np.empty(2*(n+1))
μ_IQ2d[:n] = μθ
μ_IQ2d[2*n] = μθ
μ_IQ2d[n:2*n] = μη
μ_IQ2d[2*n+1] = μη

D_IQ2d = np.zeros((2*(n+1), 2*(n+1)))

D_IQ2d[range(2*n), range(2*n)] = σy
D_IQ2d[:n, 2*n] = σθ
D_IQ2d[2*n, 2*n] = σθ
D_IQ2d[n:2*n, 2*n+1] = ση
D_IQ2d[2*n+1, 2*n+1] = ση
Σ_IQ2d = D_IQ2d @ D_IQ2d.T
return μ_IQ2d, Σ_IQ2d, D_IQ2d
Let’s put the function to work.
n = 2
# mean and variance of θ, η, and y
μθ, σθ, μη, ση, σy = 100., 10., 100., 10, 10
μ_IQ2d, Σ_IQ2d, D_IQ2d = construct_moments_IQ2d(n, μθ, σθ, μη, ση, σy)

μ_IQ2d, Σ_IQ2d, D_IQ2d
(array([100., 100., 100., 100., 100., 100.]),

array([[200., 100., 0., 0., 100., 0.],
[100., 200., 0., 0., 100., 0.],
[ 0., 0., 200., 100., 0., 100.],
[ 0., 0., 100., 200., 0., 100.],
[100., 100., 0., 0., 100., 0.],
[ 0., 0., 100., 100., 0., 100.]]),
array([[10., 0., 0., 0., 10., 0.],
[ 0., 10., 0., 0., 10., 0.],
[ 0., 0., 10., 0., 0., 10.],
[ 0., 0., 0., 10., 0., 10.],
[ 0., 0., 0., 0., 10., 0.],
[ 0., 0., 0., 0., 0., 10.]]))
# take one draw

x = np.random.multivariate_normal(μ_IQ2d, Σ_IQ2d)
y1 = x[:n]
y2 = x[n:2*n]
θ = x[2*n]
η = x[2*n+1]
# the true values

θ, η
(83.26886447129678, 112.92159885842455)
We first compute the joint normal distribution of (𝜃, 𝜂).
multi_normal_IQ2d = MultivariateNormal(μ_IQ2d, Σ_IQ2d)
k = 2*n # the length of data vector

12.8. Math and Verbal Intelligence 225


multi_normal_IQ2d.partition(k)
multi_normal_IQ2d.cond_dist(1, [*y1, *y2])
(array([ 85.61557319, 105.80129067]),

array([[33.33333333, 0. ],
[ 0. , 33.33333333]]))
Now let’s compute distributions of 𝜃 and 𝜇 separately conditional on various subsets of test scores.
It will be fun to compare outcomes with the help of an auxiliary function cond_dist_IQ2d that we now construct.
def cond_dist_IQ2d(μ, Σ, data):
n = len(μ)
multi_normal.partition(n-1)
μ_hat, Σ_hat = multi_normal.cond_dist(1, data)
return μ_hat, Σ_hat
Let’s see how things work for an example.
for indices, IQ, conditions in [([*range(2*n), 2*n], 'θ', 'y1, y2, y3, y4'),
([*range(n), 2*n], 'θ', 'y1, y2'),
([*range(n, 2*n), 2*n], 'θ', 'y3, y4'),
([*range(2*n), 2*n+1], 'η', 'y1, y2, y3, y4'),
([*range(n), 2*n+1], 'η', 'y1, y2'),
([*range(n, 2*n), 2*n+1], 'η', 'y3, y4')]:
μ_hat, Σ_hat = cond_dist_IQ2d(μ_IQ2d[indices], Σ_IQ2d[indices][:, indices],␣

↪ x[indices[:-1]])
print(f'The mean and variance of {IQ} conditional on {conditions: <15} are ' +
f'{μ_hat[0]:1.2f} and {Σ_hat[0, 0]:1.2f} respectively')
The mean and variance of θ conditional on y1, y2, y3, y4 are 85.62 and 33.33␣
↪respectively
The mean and variance of θ conditional on y1, y2 are 85.62 and 33.33␣
↪respectively
The mean and variance of θ conditional on y3, y4 are 100.00 and 100.00␣
↪respectively
The mean and variance of η conditional on y1, y2, y3, y4 are 105.80 and 33.33␣
↪respectively
The mean and variance of η conditional on y1, y2 are 100.00 and 100.00␣
↪respectively
The mean and variance of η conditional on y3, y4 are 105.80 and 33.33␣
↪respectively
Evidently, math tests provide no information about 𝜇 and language tests provide no information about 𝜂.

12.9 Univariate Time Series Analysis
We can use the multivariate normal distribution and a little matrix algebra to present foundations of univariate linear time
series analysis.
Let 𝑥𝑡 , 𝑦𝑡 , 𝑣𝑡 , 𝑤𝑡+1 each be scalars for 𝑡 ≥ 0.
Consider the following model:
𝑥0 ∼ 𝑁 (0, 𝜎02 )
𝑥𝑡+1 = 𝑎𝑥𝑡 + 𝑏𝑤𝑡+1 , 𝑤𝑡+1 ∼ 𝑁 (0, 1) , 𝑡 ≥ 0
𝑦𝑡 = 𝑐𝑥𝑡 + 𝑑𝑣𝑡 , 𝑣𝑡 ∼ 𝑁 (0, 1) , 𝑡 ≥ 0
We can compute the moments of 𝑥𝑡

1. 𝐸𝑥2𝑡+1 = 𝑎2 𝐸𝑥2𝑡 + 𝑏2 , 𝑡 ≥ 0, where 𝐸𝑥20 = 𝜎02
2. 𝐸𝑥𝑡+𝑗 𝑥𝑡 = 𝑎𝑗 𝐸𝑥2𝑡 , ∀𝑡 ∀𝑗
Given some 𝑇 , we can formulate the sequence {𝑥𝑡 }𝑇𝑡=0 as a random vector
𝑥0
⎡ 𝑥 ⎤
𝑋=⎢ 1 ⎥
⎢ ⋮ ⎥
⎣ 𝑥𝑇 ⎦
and the covariance matrix Σ𝑥 can be constructed using the moments we have computed above.
Similarly, we can define
𝑦0 𝑣0
⎡ 𝑦 ⎤ ⎡ 𝑣 ⎤
𝑌 =⎢ 1 ⎥, 𝑣=⎢ 1 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣ 𝑦𝑇 ⎦ ⎣ 𝑣𝑇 ⎦
and therefore
𝑌 = 𝐶𝑋 + 𝐷𝑉
where 𝐶 and 𝐷 are both diagonal matrices with constant 𝑐 and 𝑑 as diagonal respectively.
Consequently, the covariance matrix of 𝑌 is
Σ𝑦 = 𝐸𝑌 𝑌 ′ = 𝐶Σ𝑥 𝐶 ′ + 𝐷𝐷′
By stacking 𝑋 and 𝑌 , we can write
𝑋
𝑍=[ ]
𝑌
and
Σ𝑥 Σ𝑥 𝐶 ′
Σ𝑧 = 𝐸𝑍𝑍 ′ = [ ]
𝐶Σ𝑥 Σ𝑦
Thus, the stacked sequences {𝑥𝑡 }𝑇𝑡=0 and {𝑦𝑡 }𝑇𝑡=0 jointly follow the multivariate normal distribution 𝑁 (0, Σ𝑧 ).
# as an example, consider the case where T = 3

T = 3
12.9. Univariate Time Series Analysis 227

# variance of the initial distribution x_0

σ0 = 1.
# parameters of the equation system

a = .9
b = 1.
c = 1.0
d = .05
# construct the covariance matrix of X

Σx = np.empty((T+1, T+1))
Σx[0, 0] = σ0 ** 2
for i in range(T):
Σx[i, i+1:] = Σx[i, i] * a ** np.arange(1, T+1-i)
Σx[i+1:, i] = Σx[i, i+1:]
Σx[i+1, i+1] = a ** 2 * Σx[i, i] + b ** 2
Σx
array([[1. , 0.9 , 0.81 , 0.729 ],

[0.9 , 1.81 , 1.629 , 1.4661 ],
[0.81 , 1.629 , 2.4661 , 2.21949 ],
[0.729 , 1.4661 , 2.21949 , 2.997541]])
# construct the covariance matrix of Y

C = np.eye(T+1) * c
D = np.eye(T+1) * d
Σy = C @ Σx @ C.T + D @ D.T
# construct the covariance matrix of Z

Σz = np.empty((2*(T+1), 2*(T+1)))
Σz[:T+1, :T+1] = Σx
Σz[:T+1, T+1:] = Σx @ C.T
Σz[T+1:, :T+1] = C @ Σx
Σz[T+1:, T+1:] = Σy
Σz
array([[1. , 0.9 , 0.81 , 0.729 , 1. , 0.9 ,

0.81 , 0.729 ],
[0.9 , 1.81 , 1.629 , 1.4661 , 0.9 , 1.81 ,
1.629 , 1.4661 ],
[0.81 , 1.629 , 2.4661 , 2.21949 , 0.81 , 1.629 ,
2.4661 , 2.21949 ],
[0.729 , 1.4661 , 2.21949 , 2.997541, 0.729 , 1.4661 ,
2.21949 , 2.997541],
[1. , 0.9 , 0.81 , 0.729 , 1.0025 , 0.9 ,
0.81 , 0.729 ],


[0.9 , 1.81 , 1.629 , 1.4661 , 0.9 , 1.8125 ,
1.629 , 1.4661 ],
[0.81 , 1.629 , 2.4661 , 2.21949 , 0.81 , 1.629 ,
2.4686 , 2.21949 ],
[0.729 , 1.4661 , 2.21949 , 2.997541, 0.729 , 1.4661 ,
2.21949 , 3.000041]])
# construct the mean vector of Z

μz = np.zeros(2*(T+1))
The following Python code lets us sample random vectors 𝑋 and 𝑌 .

This is going to be very useful for doing the conditioning to be used in the fun exercises below.
z = np.random.multivariate_normal(μz, Σz)
x = z[:T+1]
y = z[T+1:]
12.9.1 Smoothing Example
This is an instance of a classic smoothing calculation whose purpose is to compute 𝐸𝑋 ∣ 𝑌 .

An interpretation of this example is
• 𝑋 is a random sequence of hidden Markov state variables 𝑥𝑡
• 𝑌 is a sequence of observed signals 𝑦𝑡 bearing information about the hidden state
# construct a MultivariateNormal instance

multi_normal_ex1 = MultivariateNormal(μz, Σz)
x = z[:T+1]
y = z[T+1:]
# partition Z into X and Y

multi_normal_ex1.partition(T+1)
# compute the conditional mean and covariance matrix of X given Y=y
print("X = ", x)
print("Y = ", y)
print(" E [ X | Y] = ", )
multi_normal_ex1.cond_dist(0, y)
X = [0.84498196 0.39657404 1.96415412 1.34909681]

Y = [0.84836004 0.36291572 1.96174386 1.3549349 ]
E [ X | Y] =
(array([0.84536178, 0.36755731, 1.95676737, 1.35594775]),

array([[2.48875094e-03, 5.57449314e-06, 1.24861729e-08, 2.80236945e-11],


[5.57449314e-06, 2.48876343e-03, 5.57452116e-06, 1.25113944e-08],
[1.24861728e-08, 5.57452116e-06, 2.48876346e-03, 5.58575339e-06],
[2.80236945e-11, 1.25113941e-08, 5.58575339e-06, 2.49377812e-03]]))
12.9.2 Filtering Exercise
Compute 𝐸 [𝑥𝑡 ∣ 𝑦𝑡−1 , 𝑦𝑡−2 , … , 𝑦0 ].

To do so, we need to first construct the mean vector and the covariance matrix of the subvector [𝑥𝑡 , 𝑦0 , … , 𝑦𝑡−2 , 𝑦𝑡−1 ].
For example, let’s say that we want the conditional distribution of 𝑥3 .
t = 3
# mean of the subvector

sub_μz = np.zeros(t+1)
# covariance matrix of the subvector

sub_Σz = np.empty((t+1, t+1))
sub_Σz[0, 0] = Σz[t, t] # x_t

sub_Σz[0, 1:] = Σz[t, T+1:T+t+1]
sub_Σz[1:, 0] = Σz[T+1:T+t+1, t]
sub_Σz[1:, 1:] = Σz[T+1:T+t+1, T+1:T+t+1]
sub_Σz
array([[2.997541, 0.729 , 1.4661 , 2.21949 ],

[0.729 , 1.0025 , 0.9 , 0.81 ],
[1.4661 , 0.9 , 1.8125 , 1.629 ],
[2.21949 , 0.81 , 1.629 , 2.4686 ]])
multi_normal_ex2 = MultivariateNormal(sub_μz, sub_Σz)

multi_normal_ex2.partition(1)
sub_y = y[:t]
multi_normal_ex2.cond_dist(0, sub_y)
(array([1.76190901]), array([[1.00201996]]))

12.9.3 Prediction Exercise
Compute 𝐸 [𝑦𝑡 ∣ 𝑦𝑡−𝑗 , … , 𝑦0 ].

As what we did in exercise 2, we will construct the mean vector and covariance matrix of the subvector
[𝑦𝑡 , 𝑦0 , … , 𝑦𝑡−𝑗−1 , 𝑦𝑡−𝑗 ].
For example, we take a case in which 𝑡 = 3 and 𝑗 = 2.
t = 3
j = 2
sub_μz = np.zeros(t-j+2)
sub_Σz = np.empty((t-j+2, t-j+2))
sub_Σz[0, 0] = Σz[T+t+1, T+t+1]

sub_Σz[0, 1:] = Σz[T+t+1, T+1:T+t-j+2]
sub_Σz[1:, 0] = Σz[T+1:T+t-j+2, T+t+1]
sub_Σz[1:, 1:] = Σz[T+1:T+t-j+2, T+1:T+t-j+2]
sub_Σz
array([[3.000041, 0.729 , 1.4661 ],

[0.729 , 1.0025 , 0.9 ],
[1.4661 , 0.9 , 1.8125 ]])
multi_normal_ex3 = MultivariateNormal(sub_μz, sub_Σz)

multi_normal_ex3.partition(1)
sub_y = y[:t-j+1]
multi_normal_ex3.cond_dist(0, sub_y)
(array([0.29476547]), array([[1.81413617]]))
12.9.4 Constructing a Wold Representation
Now we’ll apply Cholesky decomposition to decompose Σ𝑦 = 𝐻𝐻 ′ and form
𝜖 = 𝐻 −1 𝑌 .
Then we can represent 𝑦𝑡 as
𝑦𝑡 = ℎ𝑡,𝑡 𝜖𝑡 + ℎ𝑡,𝑡−1 𝜖𝑡−1 + ⋯ + ℎ𝑡,0 𝜖0 .
H = np.linalg.cholesky(Σy)

array([[1.00124922, 0. , 0. , 0. ],
[0.8988771 , 1.00225743, 0. , 0. ],
[0.80898939, 0.89978675, 1.00225743, 0. ],
[0.72809046, 0.80980808, 0.89978676, 1.00225743]])
ε = np.linalg.inv(H) @ y
array([ 0.84730157, -0.39780625, 1.63054582, -0.40605746])
array([0.84836004, 0.36291572, 1.96174386, 1.3549349 ])
This example is an instance of what is known as a Wold representation in time series analysis.
12.10 Stochastic Difference Equation
Consider the stochastic second-order linear difference equation

𝑦𝑡 = 𝛼0 + 𝛼1 𝑦𝑦−1 + 𝛼2 𝑦𝑡−2 + 𝑢𝑡
where 𝑢𝑡 ∼ 𝑁 (0, 𝜎𝑢2 ) and
𝑦−1
[ ] ∼ 𝑁 (𝜇𝑦̃ , Σ𝑦̃ )
𝑦0
It can be written as a stacked system
1 0 0 0 ⋯ 0 0 0 𝑦1 𝛼0 + 𝛼1 𝑦0 + 𝛼2 𝑦−1 𝑢1
⎡ −𝛼 1 0 0 ⋯ 0 0 0 ⎤⎡ 𝑦2 ⎤ ⎡ 𝛼 0 + 𝛼 2 𝑦0 ⎤ ⎡ 𝑢 ⎤
1
⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ 2 ⎥
⎢ −𝛼2 −𝛼1 1 0 ⋯ 0 0 0 ⎥⎢ 𝑦3 ⎥= ⎢ 𝛼0 ⎥+ ⎢ 𝑢3 ⎥
⎢ 0 −𝛼2 −𝛼1 1 ⋯ 0 0 0 ⎥⎢ 𝑦4 ⎥ ⎢ 𝛼0 ⎥ ⎢ 𝑢4 ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⋯ ⋮ ⋮ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣ 0 0 0 0 ⋯ −𝛼2 −𝛼1 1 ⎦ ⎣
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ 𝑦𝑇 ⎦ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
⎣ 𝛼0 ⎦ ⎣ 𝑢𝑇 ⎦
⏟
≡𝐴 ≡𝑏 ≡𝑢
We can compute 𝑦 by solving the system

𝑦 = 𝐴−1 (𝑏 + 𝑢)
We have
𝜇𝑦 = 𝐴−1 𝜇𝑏
′ ′
Σ𝑦 = 𝐴−1 𝐸 [(𝑏 − 𝜇𝑏 + 𝑢) (𝑏 − 𝜇𝑏 + 𝑢) ] (𝐴−1 )
′
= 𝐴−1 (Σ𝑏 + Σ𝑢 ) (𝐴−1 )
where
𝛼0 + 𝛼1 𝜇𝑦0 + 𝛼2 𝜇𝑦−1
⎡ 𝛼 0 + 𝛼 2 𝜇 𝑦0 ⎤
⎢ ⎥
𝜇𝑏 = ⎢ 𝛼0 ⎥
⎢ ⋮ ⎥
⎣ 𝛼0 ⎦

𝐶Σ𝑦̃ 𝐶 ′ 0𝑁−2×𝑁−2 𝛼2 𝛼1
Σ𝑏 = [ ], 𝐶=[ ]
0𝑁−2×2 0𝑁−2×𝑁−2 0 𝛼2
𝜎𝑢2 0 ⋯ 0
⎡ 0 𝜎𝑢2 ⋯ 0 ⎤
Σ𝑢 = ⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
⎣ 0 0 ⋯ 𝜎𝑢2 ⎦
# set parameters
T = 80
T = 160
# coefficients of the second order difference equation
0 = 10
1 = 1.53
2 = -.9
# variance of u
σu = 1.
σu = 10.
# distribution of y_{-1} and y_{0}

μy_tilde = np.array([1., 0.5])
Σy_tilde = np.array([[2., 1.], [1., 0.5]])
# construct A and A^{\prime}

A = np.zeros((T, T))
for i in range(T):
A[i, i] = 1
if i-1 >= 0:
A[i, i-1] = - 1
if i-2 >= 0:
A[i, i-2] = - 2
A_inv = np.linalg.inv(A)
# compute the mean vectors of b and y

μb = np.full(T, 0)
μb[0] += 1 * μy_tilde[1] + 2 * μy_tilde[0]
μb[1] += 2 * μy_tilde[1]
μy = A_inv @ μb
# compute the covariance matrices of b and y

Σu = np.eye(T) * σu ** 2
Σb = np.zeros((T, T))
C = np.array([[ 2, 1], [0, 2]])

Σb[:2, :2] = C @ Σy_tilde @ C.T
Σy = A_inv @ (Σb + Σu) @ A_inv.T
12.10. Stochastic Difference Equation 233

12.11 Application to Stock Price Model
Let
𝑇 −𝑡
𝑝𝑡 = ∑ 𝛽 𝑗 𝑦𝑡+𝑗
𝑗=0
Form
𝑝1 1 𝛽 𝛽 2 ⋯ 𝛽 𝑇 −1 𝑦1
⎡ 𝑝 ⎤ ⎡ 0 1 𝛽 ⋯ 𝛽 𝑇 −2 ⎤ ⎡ 𝑦2 ⎤
⎢ 2 ⎥ ⎢ 𝑇 −3 ⎥ ⎢ ⎥
⎢ 𝑝3 ⎥ = ⎢ 0 0 1 ⋯ 𝛽 ⎥⎢ 𝑦3 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥⎢ ⋮ ⎥
⎣ 𝑝𝑇 ⎦ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
⏟ ⎣ 0 0 0 ⋯ 1 ⎦⎣ 𝑦𝑇 ⎦
≡𝑝 ≡𝐵
we have
𝜇𝑝 = 𝐵𝜇𝑦
Σ𝑝 = 𝐵Σ𝑦 𝐵′
β = .96
# construct B
B = np.zeros((T, T))
for i in range(T):
B[i, i:] = β ** np.arange(0, T-i)
Denote
𝑦 𝐼
𝑧=[ ]= [ ]𝑦
𝑝 ⏟ 𝐵
≡𝐷
Thus, {𝑦𝑡 }𝑇𝑡=1 and {𝑝𝑡 }𝑇𝑡=1 jointly follow the multivariate normal distribution 𝑁 (𝜇𝑧 , Σ𝑧 ), where
𝜇𝑧 = 𝐷𝜇𝑦
Σ𝑧 = 𝐷Σ𝑦 𝐷′
D = np.vstack([np.eye(T), B])
μz = D @ μy
Σz = D @ Σy @ D.T
We can simulate paths of 𝑦𝑡 and 𝑝𝑡 and compute the conditional mean 𝐸 [𝑝𝑡 ∣ 𝑦𝑡−1 , 𝑦𝑡 ] using the MultivariateNor-
mal class.
y, p = z[:T], z[T:]

cond_Ep = np.empty(T-1)
sub_μ = np.empty(3)
sub_Σ = np.empty((3, 3))
for t in range(2, T+1):
sub_μ[:] = μz[[t-2, t-1, T-1+t]]
sub_Σ[:, :] = Σz[[t-2, t-1, T-1+t], :][:, [t-2, t-1, T-1+t]]
multi_normal = MultivariateNormal(sub_μ, sub_Σ)

multi_normal.partition(2)
cond_Ep[t-2] = multi_normal.cond_dist(1, y[t-2:t])[0][0]
plt.plot(range(1, T), y[1:], label='$y_{t}$')

plt.plot(range(1, T), y[:-1], label='$y_{t-1}$')
plt.plot(range(1, T), p[1:], label='$p_{t}$')
plt.plot(range(1, T), cond_Ep, label='$Ep_{t}|y_{t}, y_{t-1}$')
plt.xlabel('t')
plt.legend(loc=1)
plt.show()
In the above graph, the green line is what the price of the stock would be if people had perfect foresight about the path of
dividends while the green line is the conditional expectation 𝐸𝑝𝑡 |𝑦𝑡 , 𝑦𝑡−1 , which is what the price would be if people did
not have perfect foresight but were optimally predicting future dividends on the basis of the information 𝑦𝑡 , 𝑦𝑡−1 at time
𝑡.
12.11. Application to Stock Price Model 235

12.12 Filtering Foundations
Assume that 𝑥0 is an 𝑛 × 1 random vector and that 𝑦0 is a 𝑝 × 1 random vector determined by the observation equation
𝑦0 = 𝐺𝑥0 + 𝑣0 , 𝑥0 ∼ 𝒩(𝑥0̂ , Σ0 ), 𝑣0 ∼ 𝒩(0, 𝑅)
where 𝑣0 is orthogonal to 𝑥0 , 𝐺 is a 𝑝 × 𝑛 matrix, and 𝑅 is a 𝑝 × 𝑝 positive definite matrix.

We consider the problem of someone who
• observes 𝑦0
• does not observe 𝑥0 ,
𝑥
• knows 𝑥0̂ , Σ0 , 𝐺, 𝑅 and therefore the joint probability distribution of the vector [ 0 ]
𝑦0
• wants to infer 𝑥0 from 𝑦0 in light of what he knows about that joint probability distribution.
Therefore, the person wants to construct the probability distribution of 𝑥0 conditional on the random vector 𝑦0 .
𝑥0
The joint distribution of [ ] is multivariate normal 𝒩(𝜇, Σ) with
𝑦0
𝑥0̂ Σ0 Σ0 𝐺′
𝜇=[ ], Σ=[ ]
𝐺𝑥0̂ 𝐺Σ0 𝐺Σ0 𝐺′ + 𝑅
By applying an appropriate instance of the above formulas for the mean vector 𝜇1̂ and covariance matrix Σ̂ 11 of 𝑧1
conditional on 𝑧2 , we find that the probability distribution of 𝑥0 conditional on 𝑦0 is 𝒩(𝑥0̃ , Σ̃ 0 ) where
𝛽0 = Σ0 𝐺′ (𝐺Σ0 𝐺′ + 𝑅)−1
𝑥0̃ = 𝑥0̂ + 𝛽0 (𝑦0 − 𝐺𝑥0̂ )
Σ̃ 0 = Σ0 − Σ0 𝐺′ (𝐺Σ0 𝐺′ + 𝑅)−1 𝐺Σ0
We can express our finding that the probability distribution of 𝑥0 conditional on 𝑦0 is 𝒩(𝑥0̃ , Σ̃ 0 ) by representing 𝑥0 as
𝑥0 = 𝑥0̃ + 𝜁0 (12.2)
where 𝜁0 is a Gaussian random vector that is orthogonal to 𝑥0̃ and 𝑦0 and that has mean vector 0 and conditional covariance
matrix 𝐸[𝜁0 𝜁0′ |𝑦0 ] = Σ̃ 0 .
12.12.1 Step toward dynamics
Now suppose that we are in a time series setting and that we have the one-step state transition equation
𝑥1 = 𝐴𝑥0 + 𝐶𝑤1 , 𝑤1 ∼ 𝒩(0, 𝐼)
where 𝐴 is an 𝑛 × 𝑛 matrix and 𝐶 is an 𝑛 × 𝑚 matrix.

Using equation (12.2), we can also represent 𝑥1 as
𝑥1 = 𝐴(𝑥0̃ + 𝜁0 ) + 𝐶𝑤1
It follows that
𝐸𝑥1 |𝑦0 = 𝐴𝑥0̃

and that the corresponding conditional covariance matrix 𝐸(𝑥1 − 𝐸𝑥1 |𝑦0 )(𝑥1 − 𝐸𝑥1 |𝑦0 )′ ≡ Σ1 is
Σ1 = 𝐴Σ̃ 0 𝐴′ + 𝐶𝐶 ′
or
Σ1 = 𝐴Σ0 𝐴′ − 𝐴Σ0 𝐺′ (𝐺Σ0 𝐺′ + 𝑅)−1 𝐺Σ0 𝐴′
We can write the mean of 𝑥1 conditional on 𝑦0 as
𝑥1̂ = 𝐴𝑥0̂ + 𝐴Σ0 𝐺′ (𝐺Σ0 𝐺′ + 𝑅)−1 (𝑦0 − 𝐺𝑥0̂ )
or
𝑥1̂ = 𝐴𝑥0̂ + 𝐾0 (𝑦0 − 𝐺𝑥0̂ )
where
𝐾0 = 𝐴Σ0 𝐺′ (𝐺Σ0 𝐺′ + 𝑅)−1
12.12.2 Dynamic version
Suppose now that for 𝑡 ≥ 0, {𝑥𝑡+1 , 𝑦𝑡 }∞

𝑡=0 are governed by the equations
𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐶𝑤𝑡+1

𝑦𝑡 = 𝐺𝑥𝑡 + 𝑣𝑡
where as before 𝑥0 ∼ 𝒩(𝑥0̂ , Σ0 ), 𝑤𝑡+1 is the 𝑡 + 1th component of an i.i.d. stochastic process distributed as 𝑤𝑡+1 ∼
𝒩(0, 𝐼), and 𝑣𝑡 is the 𝑡th component of an i.i.d. process distributed as 𝑣𝑡 ∼ 𝒩(0, 𝑅) and the {𝑤𝑡+1 }∞ ∞
𝑡=0 and {𝑣𝑡 }𝑡=0
processes are orthogonal at all pairs of dates.
The logic and formulas that we applied above imply that the probability distribution of 𝑥𝑡 conditional on 𝑦0 , 𝑦1 , … , 𝑦𝑡−1 =
𝑦𝑡−1 is
𝑥𝑡 |𝑦𝑡−1 ∼ 𝒩(𝐴𝑥𝑡̃ , 𝐴Σ̃ 𝑡 𝐴′ + 𝐶𝐶 ′ )
where {𝑥𝑡̃ , Σ̃ 𝑡 }∞
𝑡=1 can be computed by iterating on the following equations starting from 𝑡 = 1 and initial conditions for
̃
𝑥0̃ , Σ0 computed as we have above:
Σ𝑡 = 𝐴Σ̃ 𝑡−1 𝐴′ + 𝐶𝐶 ′
𝑥𝑡̂ = 𝐴𝑥𝑡−1
̃
𝛽𝑡 = Σ𝑡 𝐺′ (𝐺Σ𝑡 𝐺′ + 𝑅)−1
𝑥𝑡̃ = 𝑥𝑡̂ + 𝛽𝑡 (𝑦𝑡 − 𝐺𝑥𝑡̂ )
Σ̃ 𝑡 = Σ𝑡 − Σ𝑡 𝐺′ (𝐺Σ𝑡 𝐺′ + 𝑅)−1 𝐺Σ𝑡
If we shift the first equation forward one period and then substitute the expression for Σ̃ 𝑡 on the right side of the fifth
equation into it we obtain
Σ𝑡+1 = 𝐶𝐶 ′ + 𝐴Σ𝑡 𝐴′ − 𝐴Σ𝑡 𝐺′ (𝐺Σ𝑡 𝐺′ + 𝑅)−1 𝐺Σ𝑡 𝐴′ .
This is a matrix Riccati difference equation that is closely related to another matrix Riccati difference equation that appears
in a quantecon lecture on the basics of linear quadratic control theory.
12.12. Filtering Foundations 237

That equation has the form
𝑃𝑡−1 = 𝑅 + 𝐴′ 𝑃𝑡 𝐴 − 𝐴′ 𝑃𝑡 𝐵(𝐵′ 𝑃𝑡 𝐵 + 𝑄)−1 𝐵′ 𝑃𝑡 𝐴.
Stare at the two preceding equations for a moment or two, the first being a matrix difference equation for a conditional
covariance matrix, the second being a matrix difference equation in the matrix appearing in a quadratic form for an
intertemporal cost of value function.
Although the two equations are not identical, they display striking family resemblences.
• the first equation tells dynamics that work forward in time
• the second equation tells dynamics that work backward in time
• while many of the terms are similar, one equation seems to apply matrix transformations to some matrices that play
similar roles in the other equation
The family resemblences of these two equations reflects a transcendent duality that prevails between control theory and
filtering theory.
12.12.3 An example
We can use the Python class MultivariateNormal to construct examples.

Here is an example for a single period problem at time 0
G = np.array([[1., 3.]])
R = np.array([[1.]])
x0_hat = np.array([0., 1.])

Σ0 = np.array([[1., .5], [.3, 2.]])
μ = np.hstack([x0_hat, G @ x0_hat])
Σ = np.block([[Σ0, Σ0 @ G.T], [G @ Σ0, G @ Σ0 @ G.T + R]])
# construction of the multivariate normal instance

multi_normal.partition(2)
# the observation of y
y0 = 2.3
# conditional distribution of x0
μ1_hat, Σ11 = multi_normal.cond_dist(0, y0)
μ1_hat, Σ11
(array([-0.078125, 0.803125]),
array([[ 0.72098214, -0.203125 ],
[-0.403125 , 0.228125 ]]))
A = np.array([[0.5, 0.2], [-0.1, 0.3]])

C = np.array([[2.], [1.]])


# conditional distribution of x1
x1_cond = A @ μ1_hat
Σ1_cond = C @ C.T + A @ Σ11 @ A.T
x1_cond, Σ1_cond
(array([0.1215625, 0.24875 ]),

array([[4.12874554, 1.95523214],
[1.92123214, 1.04592857]]))
12.12.4 Code for Iterating
Here is code for solving a dynamic filtering problem by iterating on our equations, followed by an example.
def iterate(x0_hat, Σ0, A, C, G, R, y_seq):
p, n = G.shape
T = len(y_seq)
x_hat_seq = np.empty((T+1, n))
Σ_hat_seq = np.empty((T+1, n, n))
x_hat_seq[0] = x0_hat
Σ_hat_seq[0] = Σ0
for t in range(T):
xt_hat = x_hat_seq[t]
Σt = Σ_hat_seq[t]
μ = np.hstack([xt_hat, G @ xt_hat])
Σ = np.block([[Σt, Σt @ G.T], [G @ Σt, G @ Σt @ G.T + R]])
# filtering
multi_normal.partition(n)
x_tilde, Σ_tilde = multi_normal.cond_dist(0, y_seq[t])
# forecasting
x_hat_seq[t+1] = A @ x_tilde
Σ_hat_seq[t+1] = C @ C.T + A @ Σ_tilde @ A.T
return x_hat_seq, Σ_hat_seq
iterate(x0_hat, Σ0, A, C, G, R, [2.3, 1.2, 3.2])
(array([[0. , 1. ],
[0.1215625 , 0.24875 ],
[0.18680212, 0.06904689],
[0.75576875, 0.05558463]]),
array([[[1. , 0.5 ],
[0.3 , 2. ]],
[[4.12874554, 1.95523214],
[1.92123214, 1.04592857]],
12.12. Filtering Foundations 239

[[4.08198663, 1.99218488],
[1.98640488, 1.00886423]],
[[4.06457628, 2.00041999],
[1.99943739, 1.00275526]]]))
The iterative algorithm just described is a version of the celebrated Kalman filter.
We describe the Kalman filter and some applications of it in A First Look at the Kalman Filter
12.13 Classic Factor Analysis Model
The factor analysis model widely used in psychology and other fields can be represented as
𝑌 = Λ𝑓 + 𝑈
where
1. 𝑌 is 𝑛 × 1 random vector, 𝐸𝑈 𝑈 ′ = 𝐷 is a diagonal matrix,
2. Λ is 𝑛 × 𝑘 coefficient matrix,
3. 𝑓 is 𝑘 × 1 random vector, 𝐸𝑓𝑓 ′ = 𝐼,
4. 𝑈 is 𝑛 × 1 random vector, and 𝑈 ⟂ 𝑓 (i.e., 𝐸𝑈 𝑓 ′ = 0 )
5. It is presumed that 𝑘 is small relative to 𝑛; often 𝑘 is only 1 or 2, as in our IQ examples.
This implies that
Σ𝑦 = 𝐸𝑌 𝑌 ′ = ΛΛ′ + 𝐷
𝐸𝑌 𝑓 ′ = Λ
𝐸𝑓𝑌 ′ = Λ′
Thus, the covariance matrix Σ𝑌 is the sum of a diagonal matrix 𝐷 and a positive semi-definite matrix ΛΛ′ of rank 𝑘.
This means that all covariances among the 𝑛 components of the 𝑌 vector are intermediated by their common dependencies
on the 𝑘 < factors.
Form
𝑓
𝑍=( )
𝑌
the covariance matrix of the expanded random vector 𝑍 can be computed as
𝐼 Λ′
Σ𝑧 = 𝐸𝑍𝑍 ′ = ( )
Λ ΛΛ′ + 𝐷
In the following, we first construct the mean vector and the covariance matrix for the case where 𝑁 = 10 and 𝑘 = 2.
N = 10
k = 2

We set the coefficient matrix Λ and the covariance matrix of 𝑈 to be

1 0
⎛
⎜ ⋮ ⋮ ⎞
⎟ 𝜎𝑢2 0 ⋯ 0
⎜ ⎟ ⎛ ⎞
⎜ 1 0 ⎟ 0 𝜎𝑢2 ⋯ 0
Λ=⎜
⎜ ⎟
⎟ , 𝐷=⎜
⎜
⎜ ⋮
⎟
⎟
⎟
⎜ 0 1 ⎟ ⋮ ⋮ ⋮
⎜
⎜ ⎟
⎟
⋮ ⋮ ⎝ 0 0 ⋯ 𝜎𝑢2 ⎠
⎝ 0 1 ⎠
where the first half of the first column of Λ is filled with 1s and 0s for the rest half, and symmetrically for the second
column.
𝐷 is a diagonal matrix with parameter 𝜎𝑢2 on the diagonal.
Λ = np.zeros((N, k))
Λ[:N//2, 0] = 1
Λ[N//2:, 1] = 1
σu = .5
D = np.eye(N) * σu ** 2
# compute Σy
Σy = Λ @ Λ.T + D
We can now construct the mean vector and the covariance matrix for 𝑍.
μz = np.zeros(k+N)
Σz = np.empty((k+N, k+N))
Σz[:k, :k] = np.eye(k)

Σz[:k, k:] = Λ.T
Σz[k:, :k] = Λ
Σz[k:, k:] = Σy
f = z[:k]
y = z[k:]
multi_normal_factor = MultivariateNormal(μz, Σz)

multi_normal_factor.partition(k)
Let’s compute the conditional distribution of the hidden factor 𝑓 on the observations 𝑌 , namely, 𝑓 ∣ 𝑌 = 𝑦.
multi_normal_factor.cond_dist(0, y)
(array([-0.30191322, 1.22653669]),
array([[0.04761905, 0. ],
[0. , 0.04761905]]))
We can verify that the conditional mean 𝐸 [𝑓 ∣ 𝑌 = 𝑦] = 𝐵𝑌 where 𝐵 = Λ′ Σ−1

𝑦 .
12.13. Classic Factor Analysis Model 241

B = Λ.T @ np.linalg.inv(Σy)
B @ y
array([-0.30191322, 1.22653669])
Similarly, we can compute the conditional distribution 𝑌 ∣ 𝑓.
multi_normal_factor.cond_dist(1, f)
(array([-0.1949429 , -0.1949429 , -0.1949429 , -0.1949429 , -0.1949429 ,

1.36894286, 1.36894286, 1.36894286, 1.36894286, 1.36894286]),
array([[0.25, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.25, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0.25, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.25, 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.25, 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0.25, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.25, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.25, 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.25, 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.25]]))
It can be verified that the mean is Λ𝐼 −1 𝑓 = Λ𝑓.
Λ @ f
array([-0.1949429 , -0.1949429 , -0.1949429 , -0.1949429 , -0.1949429 ,

1.36894286, 1.36894286, 1.36894286, 1.36894286, 1.36894286])
12.14 PCA and Factor Analysis
To learn about Principal Components Analysis (PCA), please see this lecture Singular Value Decompositions.
For fun, let’s apply a PCA decomposition to a covariance matrix Σ𝑦 that in fact is governed by our factor-analytic model.
Technically, this means that the PCA model is misspecified. (Can you explain why?)
Nevertheless, this exercise will let us study how well the first two principal components from a PCA can approximate the
conditional expectations 𝐸𝑓𝑖 |𝑌 for our two factors 𝑓𝑖 , 𝑖 = 1, 2 for the factor analytic model that we have assumed truly
governs the data on 𝑌 we have generated.
So we compute the PCA decomposition
̃ ′
Σ𝑦 = 𝑃 Λ𝑃
where Λ̃ is a diagonal matrix.

We have
𝑌 = 𝑃𝜖

and
𝜖 = 𝑃 ′𝑌
Note that we will arrange the eigenvectors in 𝑃 in the descending order of eigenvalues.
_tilde, P = np.linalg.eigh(Σy)
# arrange the eigenvectors by eigenvalues

ind = sorted(range(N), key=lambda x: _tilde[x], reverse=True)
P = P[:, ind]
_tilde = _tilde[ind]
Λ_tilde = np.diag( _tilde)
print(' _tilde =', _tilde)
_tilde = [5.25 5.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25]
# verify the orthogonality of eigenvectors

np.abs(P @ P.T - np.eye(N)).max()
4.440892098500626e-16
# verify the eigenvalue decomposition is correct

P @ Λ_tilde @ P.T
array([[1.25, 1. , 1. , 1. , 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1.25, 1. , 1. , 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1. , 1.25, 1. , 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1. , 1. , 1.25, 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1. , 1. , 1. , 1.25, 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 1.25, 1. , 1. , 1. , 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1.25, 1. , 1. , 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1. , 1.25, 1. , 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1. , 1. , 1.25, 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1. , 1. , 1. , 1.25]])
ε = P.T @ y
print("ε = ", ε)
ε = [ 2.87975038 -0.70885341 -0.0648366 0.11824707 0.23763429 -0.25914236

-0.06501703 -0.25015218 -0.30772868 -0.37248783]
# print the values of the two factors
print('f = ', f)
f = [-0.1949429 1.36894286]
Below we’ll plot several things
12.14. PCA and Factor Analysis 243

• the 𝑁 values of 𝑦
• the 𝑁 values of the principal components 𝜖
• the value of the first factor 𝑓1 plotted only for the first 𝑁 /2 observations of 𝑦 for which it receives a non-zero
loading in Λ
• the value of the second factor 𝑓2 plotted only for the final 𝑁 /2 observations for which it receives a non-zero loading
in Λ
plt.scatter(range(N), y, label='y')
plt.scatter(range(N), ε, label='$\epsilon$')
plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
plt.legend()
plt.show()
Consequently, the first two 𝜖𝑗 correspond to the largest two eigenvalues.

Let’s look at them, after which we’ll look at 𝐸𝑓|𝑦 = 𝐵𝑦
ε[:2]
array([ 2.87975038, -0.70885341])
# compare with Ef|y

B @ y
array([-0.30191322, 1.22653669])
The fraction of variance in 𝑦𝑡 explained by the first two principal components can be computed as below.
_tilde[:2].sum() / _tilde.sum()

0.84
Compute
𝑌 ̂ = 𝑃 𝑗 𝜖𝑗 + 𝑃 𝑘 𝜖𝑘
where 𝑃𝑗 and 𝑃𝑘 correspond to the largest two eigenvalues.
y_hat = P[:, :2] @ ε[:2]
In this example, it turns out that the projection 𝑌 ̂ of 𝑌 on the first two principal components does a good job of approx-
imating 𝐸𝑓 ∣ 𝑦.
We confirm this in the following plot of 𝑓, 𝐸𝑦 ∣ 𝑓, 𝐸𝑓 ∣ 𝑦, and 𝑦 ̂ on the coordinate axis versus 𝑦 on the ordinate axis.
plt.scatter(range(N), Λ @ f, label='$Ey|f$')
plt.scatter(range(N), y_hat, label='$\hat{y}$')
plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
Efy = B @ y
plt.hlines(Efy[0], 0, N//2-1, ls='--', color='b', label='$Ef_{1}|y$')
plt.hlines(Efy[1], N//2, N-1, ls='-.', color='b', label='$Ef_{2}|y$')
plt.legend()
plt.show()
The covariance matrix of 𝑌 ̂ can be computed by first constructing the covariance matrix of 𝜖 and then use the upper left
block for 𝜖1 and 𝜖2 .
Σεjk = (P.T @ Σy @ P)[:2, :2]
Pjk = P[:, :2]
Σy_hat = Pjk @ Σεjk @ Pjk.T

print('Σy_hat = \n', Σy_hat)
12.14. PCA and Factor Analysis 245

Σy_hat =
[[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 1.05 1.05 1.05 1.05 1.05]
[0. 0. 0. 0. 0. 1.05 1.05 1.05 1.05 1.05]
[0. 0. 0. 0. 0. 1.05 1.05 1.05 1.05 1.05]
[0. 0. 0. 0. 0. 1.05 1.05 1.05 1.05 1.05]
[0. 0. 0. 0. 0. 1.05 1.05 1.05 1.05 1.05]]

CHAPTER
THIRTEEN
FAULT TREE UNCERTAINTIES
13.1 Overview
This lecture puts elementary tools to work to approximate probability distributions of the annual failure rates of a system
consisting of a number of critical parts.
We’ll use log normal distributions to approximate probability distributions of critical component parts.
To approximate the probability distribution of the sum of 𝑛 log normal probability distributions that describes the failure
rate of the entire system, we’ll compute the convolution of those 𝑛 log normal probability distributions.
We’ll use the following concepts and tools:
• log normal distributions
• the convolution theorem that describes the probability distribution of the sum independent random variables
• fault tree analysis for approximating a failure rate of a multi-component system
• a hierarchical probability model for describing uncertain probabilities
• Fourier transforms and inverse Fourier tranforms as efficient ways of computing convolutions of sequences
For more about Fourier transforms see this quantecon lecture Circulant Matrices as well as these lecture Covariance
Stationary Processes and Estimation of Spectra.
El-Shanawany, Ardron, and Walker [El-Shanawany et al., 2018] and Greenfield and Sargent [Greenfield and Sargent,
1993] used some of the methods described here to approximate probabilities of failures of safety systems in nuclear
facilities.
These methods respond to some of the recommendations made by Apostolakis [Apostolakis, 1990] for constructing
procedures for quantifying uncertainty about the reliability of a safety system.
We’ll start by bringing in some Python machinery.
!pip install tabulate
Requirement already satisfied: tabulate in /opt/conda/envs/quantecon/lib/python3.

↪11/site-packages (0.9.0)
WARNING: Running pip as the 'root' user can result in broken permissions and␣
↪conflicting behaviour with the system package manager. It is recommended to use␣
↪a virtual environment instead: https://pip.pypa.io/warnings/venv
247
import numpy as np
from scipy.signal import fftconvolve
from tabulate import tabulate
import time
13.2 Log normal distribution
If a random variable 𝑥 follows a normal distribution with mean 𝜇 and variance 𝜎2 , then the natural logarithm of 𝑥, say
𝑦 = log(𝑥), follows a log normal distribution with parameters 𝜇, 𝜎2 .
Notice that we said parameters and not mean and variance 𝜇, 𝜎2 .
• 𝜇 and 𝜎2 are the mean and variance of 𝑥 = exp(𝑦)
• they are not the mean and variance of 𝑦
1 2 2 2
• instead, the mean of 𝑦 is 𝑒𝜇+ 2 𝜎 and the variance of 𝑦 is (𝑒𝜎 − 1)𝑒2𝜇+𝜎
A log normal random variable 𝑦 is nonnegative.
The density for a log normal random variate 𝑦 is
1 −(log 𝑦 − 𝜇)2
𝑓(𝑦) = √ exp ( )
𝑦𝜎 2𝜋 2𝜎2
for 𝑦 ≥ 0.
Important features of a log normal random variable are
1 2
mean: 𝑒𝜇+ 2 𝜎
2 2
variance: (𝑒𝜎 − 1)𝑒2𝜇+𝜎
median: 𝑒𝜇
2
mode: 𝑒𝜇−𝜎
.95 quantile: 𝑒𝜇+1.645𝜎
.95-.05 quantile ratio: 𝑒1.645𝜎
Recall the following stability property of two independent normally distributed random variables:
If 𝑥1 is normal with mean 𝜇1 and variance 𝜎12 and 𝑥2 is independent of 𝑥1 and normal with mean 𝜇2 and variance 𝜎22 ,
then 𝑥1 + 𝑥2 is normally distributed with mean 𝜇1 + 𝜇2 and variance 𝜎12 + 𝜎22 .
Independent log normal distributions have a different stability property.
The product of independent log normal random variables is also log normal.
In particular, if 𝑦1 is log normal with parameters (𝜇1 , 𝜎12 ) and 𝑦2 is log normal with parameters (𝜇2 , 𝜎22 ), then the product
𝑦1 𝑦2 is log normal with parameters (𝜇1 + 𝜇2 , 𝜎12 + 𝜎22 ).
Note: While the product of two log normal distributions is log normal, the sum of two log normal distributions is not
log normal.
248 Chapter 13. Fault Tree Uncertainties

This observation sets the stage for challenge that confronts us in this lecture, namely, to approximate probability distri-
butions of sums of independent log normal random variables.
To compute the probability distribution of the sum of two log normal distributions, we can use the following convolution
property of a probability distribution that is a sum of independent random variables.
13.3 The Convolution Property
Let 𝑥 be a random variable with probability density 𝑓(𝑥), where 𝑥 ∈ R.

Let 𝑦 be a random variable with probability density 𝑔(𝑦), where 𝑦 ∈ R.
Let 𝑥 and 𝑦 be independent random variables and let 𝑧 = 𝑥 + 𝑦 ∈ R.
Then the probability distribution of 𝑧 is
∞
ℎ(𝑧) = (𝑓 ∗ 𝑔)(𝑧) ≡ ∫ 𝑓(𝑧)𝑔(𝑧 − 𝜏 )𝑑𝜏
−∞
where (𝑓 ∗ 𝑔) denotes the convolution of the two functions 𝑓 and 𝑔.

If the random variables are both nonnegative, then the above formula specializes to
∞
ℎ(𝑧) = (𝑓 ∗ 𝑔)(𝑧) ≡ ∫ 𝑓(𝑧)𝑔(𝑧 − 𝜏 )𝑑𝜏
0
Below, we’ll use a discretized version of the preceding formula.

In particular, we’ll replace both 𝑓 and 𝑔 with discretized counterparts, normalized to sum to 1 so that they are probability
distributions.
• by discretized we mean an equally spaced sampled version
Then we’ll use the following version of the above formula
∞
ℎ𝑛 = (𝑓 ∗ 𝑔)𝑛 = ∑ 𝑓𝑚 𝑔𝑛−𝑚 , 𝑛 ≥ 0
𝑚=0
to compute a discretized version of the probability distribution of the sum of two random variables, one with probability
mass function 𝑓, the other with probability mass function 𝑔.
Before applying the convolution property to sums of log normal distributions, let’s practice on some simple discrete
distributions.
To take one example, let’s consider the following two probability distributions
𝑓𝑗 = Prob(𝑋 = 𝑗), 𝑗 = 0, 1
and
𝑔𝑗 = Prob(𝑌 = 𝑗), 𝑗 = 0, 1, 2, 3
and
ℎ𝑗 = Prob(𝑍 ≡ 𝑋 + 𝑌 = 𝑗), 𝑗 = 0, 1, 2, 3, 4
The convolution property tells us that
ℎ=𝑓 ∗𝑔 =𝑔∗𝑓
Let’s compute an example using the numpy.convolve and scipy.signal.fftconvolve.
13.3. The Convolution Property 249

f = [.75, .25]
g = [0., .6, 0., .4]
h = np.convolve(f,g)
hf = fftconvolve(f,g)
print("f = ", f, ", np.sum(f) = ", np.sum(f))

print("g = ", g, ", np.sum(g) = ", np.sum(g))
print("h = ", h, ", np.sum(h) = ", np.sum(h))
print("hf = ", hf, ",np.sum(hf) = ", np.sum(hf))
f = [0.75, 0.25] , np.sum(f) = 1.0

g = [0.0, 0.6, 0.0, 0.4] , np.sum(g) = 1.0
h = [0. 0.45 0.15 0.3 0.1 ] , np.sum(h) = 1.0
hf = [0. 0.45 0.15 0.3 0.1 ] ,np.sum(hf) = 1.0000000000000002
A little later we’ll explain some advantages that come from using scipy.signal.ftconvolve rather than numpy.
convolve.numpy program convolve.
They provide the same answers but scipy.signal.ftconvolve is much faster.
That’s why we rely on it later in this lecture.
13.4 Approximating Distributions
We’ll construct an example to verify that discretized distributions can do a good job of approximating samples drawn
from underlying continuous distributions.
We’ll start by generating samples of size 25000 of three independent log normal random variates as well as pairwise and
triple-wise sums.
Then we’ll plot histograms and compare them with convolutions of appropriate discretized log normal distributions.
## create sums of two and three log normal random variates ssum2 = s1 + s2 and ssum3␣
↪= s1 + s2 + s3
mu1, sigma1 = 5., 1. # mean and standard deviation

s1 = np.random.lognormal(mu1, sigma1, 25000)


ssum2 = s1 + s2
ssum3 = s1 + s2 + s3
count, bins, ignored = plt.hist(s1, 1000, density=True, align='mid')

count, bins, ignored = plt.hist(ssum2, 1000, density=True, align='mid')
13.4. Approximating Distributions 251

samp_mean2 = np.mean(s2)
pop_mean2 = np.exp(mu2+ (sigma2**2)/2)
pop_mean2, samp_mean2, mu2, sigma2
(244.69193226422038, 245.39218776762786, 5.0, 1.0)
Here are helper functions that create a discretized version of a log normal probability density function.
def p_log_normal(x,μ,σ):
p = 1 / (σ*x*np.sqrt(2*np.pi)) * np.exp(-1/2*((np.log(x) - μ)/σ)**2)
return p
def pdf_seq(μ,σ,I,m):
x = np.arange(1e-7,I,m)
p_array = p_log_normal(x,μ,σ)
p_array_norm = p_array/np.sum(p_array)
return p_array,p_array_norm,x
Now we shall set a grid length 𝐼 and a grid increment size 𝑚 = 1 for our discretizations.
Note: We set 𝐼 equal to a power of two because we want to be free to use a Fast Fourier Transform to compute a
convolution of two sequences (discrete distributions).
We recommend experimenting with different values of the power 𝑝 of 2.

Setting it to 15 rather than 12, for example, improves how well the discretized probability mass function approximates
the original continuous probability density function being studied.
p=15
I = 2**p # Truncation value
m = .1 # increment size
## Cell to check -- note what happens when don't normalize!

## things match up without adjustment. Compare with above
p1,p1_norm,x = pdf_seq(mu1,sigma1,I,m)
## compute number of points to evaluate the probability mass function
NT = x.size
plt.figure(figsize = (8,8))
plt.subplot(2,1,1)
plt.plot(x[:int(NT)],p1[:int(NT)],label = '')
plt.xlim(0,2500)
count, bins, ignored = plt.hist(s1, 1000, density=True, align='mid')
plt.show()
# Compute mean from discretized pdf and compare with the theoretical value
mean= np.sum(np.multiply(x[:NT],p1_norm[:NT]))
meantheory = np.exp(mu1+.5*sigma1**2)
mean, meantheory
(244.69059898302908, 244.69193226422038)
13.4. Approximating Distributions 253

13.5 Convolving Probability Mass Functions
Now let’s use the convolution theorem to compute the probability distribution of a sum of the two log normal random
variables we have parameterized above.
We’ll also compute the probability of a sum of three log normal distributions constructed above.
Before we do these things, we shall explain our choice of Python algorithm to compute a convolution of two sequences.
Because the sequences that we convolve are long, we use the scipy.signal.fftconvolve function rather than
the numpy.convove function.
These two functions give virtually equivalent answers but for long sequences scipy.signal.fftconvolve is much
faster.
The program scipy.signal.fftconvolve uses fast Fourier transforms and their inverses to calculate convolu-
tions.
Let’s define the Fourier transform and the inverse Fourier transform.
The Fourier transform of a sequence {𝑥𝑡 }𝑇𝑡=0
−1
is a sequence of complex numbers {𝑥(𝜔𝑗 )}𝑇𝑗=0
−1
given by
𝑇 −1
𝑥(𝜔𝑗 ) = ∑ 𝑥𝑡 exp(−𝑖𝜔𝑗 𝑡) (13.1)
𝑡=0
2𝜋𝑗
where 𝜔𝑗 = 𝑇 for 𝑗 = 0, 1, … , 𝑇 − 1.
The inverse Fourier transform of the sequence {𝑥(𝜔𝑗 )}𝑇𝑗=0
−1
is
𝑇 −1
𝑥𝑡 = 𝑇 −1 ∑ 𝑥(𝜔𝑗 ) exp(𝑖𝜔𝑗 𝑡) (13.2)
𝑗=0
The sequences {𝑥𝑡 }𝑇𝑡=0

−1
and {𝑥(𝜔𝑗 )}𝑇𝑗=0
−1
contain the same information.
The pair of equations (13.1) and (13.2) tell how to recover one series from its Fourier partner.
The program scipy.signal.fftconvolve deploys the theorem that a convolution of two sequences {𝑓𝑘 }, {𝑔𝑘 }
can be computed in the following way:
• Compute Fourier transforms 𝐹 (𝜔), 𝐺(𝜔) of the {𝑓𝑘 } and {𝑔𝑘 } sequences, respectively
• Form the product 𝐻(𝜔) = 𝐹 (𝜔)𝐺(𝜔)
• The convolution of 𝑓 ∗ 𝑔 is the inverse Fourier transform of 𝐻(𝜔)
The fast Fourier transform and the associated inverse fast Fourier transform execute these calculations very quickly.
This is the algorithm that scipy.signal.fftconvolve uses.
Let’s do a warmup calculation that compares the times taken by numpy.convove and scipy.signal.
fftconvolve.
tic = time.perf_counter()
c1 = np.convolve(p1_norm,p2_norm)
c2 = np.convolve(c1,p3_norm)

toc = time.perf_counter()
tdiff1 = toc - tic
c1f = fftconvolve(p1_norm,p2_norm)
c2f = fftconvolve(c1f,p3_norm)
tdiff2 = toc - tic
print("time with np.convolve = ", tdiff1, "; time with fftconvolve = ", tdiff2)
time with np.convolve = 47.5052065660002 ; time with fftconvolve = 0.

↪16856744300002902
The fast Fourier transform is two orders of magnitude faster than numpy.convolve
Now let’s plot our computed probability mass function approximation for the sum of two log normal random variables
against the histogram of the sample that we formed above.
NT= np.size(x)
plt.subplot(2,1,1)
plt.plot(x[:int(NT)],c1f[:int(NT)]/m,label = '')
plt.xlim(0,5000)

# plt.plot(P2P3[:10000],label = 'FFT method',linestyle = '--')
plt.show()
13.5. Convolving Probability Mass Functions 255

NT= np.size(x)
plt.subplot(2,1,1)
plt.plot(x[:int(NT)],c2f[:int(NT)]/m,label = '')
plt.xlim(0,5000)

# plt.plot(P2P3[:10000],label = 'FFT method',linestyle = '--')
plt.show()
## Let's compute the mean of the discretized pdf

mean= np.sum(np.multiply(x[:NT],c1f[:NT]))
# meantheory = np.exp(mu1+.5*sigma1**2)
mean, 2*meantheory
(489.3810974093853, 489.38386452844077)
## Let's compute the mean of the discretized pdf

mean= np.sum(np.multiply(x[:NT],c2f[:NT]))
# meantheory = np.exp(mu1+.5*sigma1**2)
mean, 3*meantheory
(734.0714863312272, 734.0757967926611)

13.6 Failure Tree Analysis
We shall soon apply the convolution theorem to compute the probability of a top event in a failure tree analysis.
Before applying the convolution theorem, we first describe the model that connects constituent events to the top end whose
failure rate we seek to quantify.
The model is an example of the widely used failure tree analysis described by El-Shanawany, Ardron, and Walker
[El-Shanawany et al., 2018].
To construct the statistical model, we repeatedly use what is called the rare event approximation.
We want to compute the probabilty of an event 𝐴 ∪ 𝐵.
• the union 𝐴 ∪ 𝐵 is the event that 𝐴 OR 𝐵 occurs
A law of probability tells us that 𝐴 OR 𝐵 occurs with probability
𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵)
where the intersection 𝐴 ∩ 𝐵 is the event that 𝐴 AND 𝐵 both occur and the union 𝐴 ∪ 𝐵 is the event that 𝐴 OR 𝐵
occurs.
If 𝐴 and 𝐵 are independent, then
𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐴)𝑃 (𝐵)
If 𝑃 (𝐴) and 𝑃 (𝐵) are both small, then 𝑃 (𝐴)𝑃 (𝐵) is even smaller.
The rare event approximation is
𝑃 (𝐴 ∪ 𝐵) ≈ 𝑃 (𝐴) + 𝑃 (𝐵)
This approximation is widely used in evaluating system failures.
13.7 Application
A system has been designed with the feature a system failure occurs when any of 𝑛 critical components fails.
The failure probability 𝑃 (𝐴𝑖 ) of each event 𝐴𝑖 is small.
We assume that failures of the components are statistically independent random variables.
We repeatedly apply a rare event approximation to obtain the following formula for the problem of a system failure:
𝑃 (𝐹 ) ≈ 𝑃 (𝐴1 ) + 𝑃 (𝐴2 ) + ⋯ + 𝑃 (𝐴𝑛 )
or
𝑛
𝑃 (𝐹 ) ≈ ∑ 𝑃 (𝐴𝑖 ) (13.3)
𝑖=1
Probabilities for each event are recorded as failure rates per year.
13.6. Failure Tree Analysis 257

13.8 Failure Rates Unknown
Now we come to the problem that really interests us, following [El-Shanawany et al., 2018] and Greenfield and Sargent
[Greenfield and Sargent, 1993] in the spirit of Apostolakis [Apostolakis, 1990].
The constituent probabilities or failure rates 𝑃 (𝐴𝑖 ) are not known a priori and have to be estimated.
We address this problem by specifying probabilities of probabilities that capture one notion of not knowing the con-
stituent probabilities that are inputs into a failure tree analysis.
Thus, we assume that a system analyst is uncertain about the failure rates 𝑃 (𝐴𝑖 ), 𝑖 = 1, … , 𝑛 for components of a system.
The analyst copes with this situation by regarding the systems failure probability 𝑃 (𝐹 ) and each of the component prob-
abilities 𝑃 (𝐴𝑖 ) as random variables.
• dispersions of the probability distribution of 𝑃 (𝐴𝑖 ) characterizes the analyst’s uncertainty about the failure prob-
ability 𝑃 (𝐴𝑖 )
• the dispersion of the implied probability distribution of 𝑃 (𝐹 ) characterizes his uncertainty about the probability
of a system’s failure.
This leads to what is sometimes called a hierarchical model in which the analyst has probabilities about the probabilities
𝑃 (𝐴𝑖 ).
The analyst formalizes his uncertainty by assuming that
• the failure probability 𝑃 (𝐴𝑖 ) is itself a log normal random variable with parameters (𝜇𝑖 , 𝜎𝑖 ).
• failure rates 𝑃 (𝐴𝑖 ) and 𝑃 (𝐴𝑗 ) are statistically independent for all pairs with 𝑖 ≠ 𝑗.
The analyst calibrates the parameters (𝜇𝑖 , 𝜎𝑖 ) for the failure events 𝑖 = 1, … , 𝑛 by reading reliability studies in engineering
papers that have studied historical failure rates of components that are as similar as possible to the components being used
in the system under study.
The analyst assumes that such information about the observed dispersion of annual failure rates, or times to failure, can
inform him of what to expect about parts’ performances in his system.
The analyst assumes that the random variables 𝑃 (𝐴𝑖 ) are statistically mutually independent.
The analyst wants to approximate a probability mass function and cumulative distribution function of the systems failure
probability 𝑃 (𝐹 ).
• We say probability mass function because of how we discretize each random variable, as described earlier.
The analyst calculates the probability mass function for the top event 𝐹 , i.e., a system failure, by repeatedly applying
the convolution theorem to compute the probability distribution of a sum of independent log normal random variables, as
described in equation (13.3).
13.9 Waste Hoist Failure Rate
We’ll take close to a real world example by assuming that 𝑛 = 14.

The example estimates the annual failure rate of a critical hoist at a nuclear waste facility.
A regulatory agency wants the sytem to be designed in a way that makes the failure rate of the top event small with high
probability.
This example is Design Option B-2 (Case I) described in Table 10 on page 27 of [Greenfield and Sargent, 1993].
The table describes parameters 𝜇𝑖 , 𝜎𝑖 for fourteen log normal random variables that consist of seven pairs of random
variables that are identically and independently distributed.

• Within a pair, parameters 𝜇𝑖 , 𝜎𝑖 are the same

• As described in table 10 of [Greenfield and Sargent, 1993] p. 27, parameters of log normal distributions for the
seven unique probabilities 𝑃 (𝐴𝑖 ) have been calibrated to be the values in the following Python code:
mu1, sigma1 = 4.28, 1.1947

mu2, sigma2 = 3.39, 1.1947
mu3, sigma3 = 2.795, 1.1947
mu4, sigma4 = 2.717, 1.1947
mu5, sigma5 = 2.717, 1.1947
mu6, sigma6 = 1.444, 1.4632
mu7, sigma7 = -.040, 1.4632
Note: Because the failure rates are all very small, log normal distributions with the above parameter values actually
describe 𝑃 (𝐴𝑖 ) times 10−09 .
So the probabilities that we’ll put on the 𝑥 axis of the probability mass function and associated cumulative distribution
function should be multiplied by 10−09
To extract a table that summarizes computed quantiles, we’ll use a helper function
def find_nearest(array, value):

array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return idx
We compute the required thirteen convolutions in the following code.

(Please feel free to try different values of the power parameter 𝑝 that we use to set the number of points in our grid for
constructing the probability mass functions that discretize the continuous log normal distributions.)
We’ll plot a counterpart to the cumulative distribution function (CDF) in figure 5 on page 29 of [Greenfield and Sargent,
1993] and we’ll also present a counterpart to their Table 11 on page 28.
p=15
I = 2**p # Truncation value
m = .05 # increment size
13.9. Waste Hoist Failure Rate 259


c1 = fftconvolve(p1_norm,p2_norm)
c2 = fftconvolve(c1,p3_norm)
tdiff13 = toc - tic
print("time for 13 convolutions = ", tdiff13)
time for 13 convolutions = 11.15301869599989
d13 = np.cumsum(c13)
Nx=int(1400)
plt.figure()
plt.plot(x[0:int(Nx/m)],d13[0:int(Nx/m)]) # show Yad this -- I multiplied by m --␣
↪step size
plt.hlines(0.5,min(x),Nx,linestyles='dotted',colors = {'black'})
plt.ylim(0,1)
plt.xlim(0,Nx)
plt.xlabel("$x10^{-9}$",loc = "right")
plt.show()
x_1 = x[find_nearest(d13,0.01)]
x_9978 = x[find_nearest(d13,0.9978)]
print(tabulate([
['1%',f"{x_1}"],
['5%',f"{x_5}"],
['10%',f"{x_10}"],
['50%',f"{x_50}"],


['66.5%',f"{x_66}"],
['85%',f"{x_85}"],
['90%',f"{x_90}"],
['95%',f"{x_95}"],
['99%',f"{x_99}"],
['99.78%',f"{x_9978}"]],
headers = ['Percentile', 'x * 1e-9']))
Percentile x * 1e-9
------------ ----------
1% 76.15
5% 106.5
10% 128.2
50% 260.55
66.5% 338.55
85% 509.4
90% 608.8
95% 807.6
99% 1470.2
99.78% 2474.85
The above table agrees closely with column 2 of Table 11 on p. 28 of of [Greenfield and Sargent, 1993].
Discrepancies are probably due to slight differences in the number of digits retained in inputting 𝜇𝑖 , 𝜎𝑖 , 𝑖 = 1, … , 14 and
in the number of points deployed in the discretizations.
13.9. Waste Hoist Failure Rate 261


CHAPTER
FOURTEEN
INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS
!pip install --upgrade jax jaxlib

!conda install -y -c plotly plotly plotly-orca retrying
Note: If you are running this on Google Colab the above cell will present an error. This is because Google Colab doesn’t
use Anaconda to manage the Python packages. However this lecture will still execute as Google Colab has plotly
installed.
14.1 Overview
Substantial parts of machine learning and artificial intelligence are about

• approximating an unknown function with a known function
• estimating the known function from a set of data on the left- and right-hand variables
This lecture describes the structure of a plain vanilla artificial neural network (ANN) of a type that is widely used to
approximate a function 𝑓 that maps 𝑥 in a space 𝑋 into 𝑦 in a space 𝑌 .
To introduce elementary concepts, we study an example in which 𝑥 and 𝑦 are scalars.
We’ll describe the following concepts that are brick and mortar for neural networks:
• a neuron
• an activation function
• a network of neurons
• A neural network as a composition of functions
• back-propagation and its relationship to the chain rule of differential calculus
263
14.2 A Deep (but not Wide) Artificial Neural Network
We describe a “deep” neural network of “width” one.

Deep means that the network composes a large number of functions organized into nodes of a graph.
Width refers to the number of right hand side variables on the right hand side of the function being approximated.
Setting “width” to one means that the network composes just univariate functions.
Let 𝑥 ∈ ℝ be a scalar and 𝑦 ∈ ℝ be another scalar.
We assume that 𝑦 is a nonlinear function of 𝑥:
𝑦 = 𝑓(𝑥)
We want to approximate 𝑓(𝑥) with another function that we define recursively.

For a network of depth 𝑁 ≥ 1, each layer 𝑖 = 1, … 𝑁 consists of
• an input 𝑥𝑖
• an affine function 𝑤𝑖 𝑥𝑖 + 𝑏𝐼, where 𝑤𝑖 is a scalar weight placed on the input 𝑥𝑖 and 𝑏𝑖 is a scalar bias
• an activation function ℎ𝑖 that takes (𝑤𝑖 𝑥𝑖 + 𝑏𝑖 ) as an argument and produces an output 𝑥𝑖+1
An example of an activation function ℎ is the sigmoid function
1
ℎ(𝑧) =
1 + 𝑒−𝑧
Another popular activation function is the rectified linear unit (ReLU) function
ℎ(𝑧) = max(0, 𝑧)
Yet another activation function is the identity function
ℎ(𝑧) = 𝑧
As activation functions below, we’ll use the sigmoid function for layers 1 to 𝑁 − 1 and the identity function for layer 𝑁 .
̂ by proceeding as follows.
To approximate a function 𝑓(𝑥) we construct 𝑓(𝑥)
Let
𝑙𝑖 (𝑥) = 𝑤𝑖 𝑥 + 𝑏𝑖 .
We construct 𝑓 ̂ by iterating on compositions of functions ℎ𝑖 ∘ 𝑙𝑖 :
̂ =ℎ ∘𝑙 ∘ℎ
𝑓(𝑥) ≈ 𝑓(𝑥) 𝑁 𝑁 𝑁−1 ∘ 𝑙1 ∘ ⋯ ∘ ℎ1 ∘ 𝑙1 (𝑥)
If 𝑁 > 1, we call the right side a “deep” neural net.

The larger is the integer 𝑁 , the “deeper” is the neural net.
Evidently, if we know the parameters {𝑤𝑖 , 𝑏𝑖 }𝑁 ̂
𝑖=1 , then we can compute 𝑓(𝑥) for a given 𝑥 = 𝑥̃ by iterating on the
recursion
𝑥𝑖+1 = ℎ𝑖 ∘ 𝑙𝑖 (𝑥𝑖 ), , 𝑖 = 1, … 𝑁 (14.1)
starting from 𝑥1 = 𝑥.̃

The value of 𝑥𝑁+1 that emerges from this iterative scheme equals 𝑓(̂ 𝑥).
̃
264 Chapter 14. Introduction to Artificial Neural Networks

14.3 Calibrating Parameters
We now consider a neural network like the one describe above with width 1, depth 𝑁 , and activation functions ℎ𝑖 for
1 ⩽ 𝑖 ⩽ 𝑁 that map ℝ into itself.
𝑁
Let {(𝑤𝑖 , 𝑏𝑖 )}𝑖=1 denote a sequence of weights and biases.
As mentioned above, for a given input 𝑥1 , our approximating function 𝑓 ̂ evaluated at 𝑥1 equals the “output” 𝑥𝑁+1 from
our network that can be computed by iterating on 𝑥𝑖+1 = ℎ𝑖 (𝑤𝑖 𝑥𝑖 + 𝑏𝑖 ).
For a given prediction 𝑦(𝑥)
̂ and target 𝑦 = 𝑓(𝑥), consider the loss function
1 2
ℒ (𝑦,̂ 𝑦) (𝑥) = (𝑦 ̂ − 𝑦) (𝑥).
2
𝑁
This criterion is a function of the parameters {(𝑤𝑖 , 𝑏𝑖 )}𝑖=1 and the point 𝑥.
We’re interested in solving the following problem:
min ∫ ℒ (𝑥𝑁+1 , 𝑦) (𝑥)𝑑𝜇(𝑥)

𝑁
{(𝑤𝑖 ,𝑏𝑖 )}𝑖=1
̂ to 𝑓(𝑥).
where 𝜇(𝑥) is some measure of points 𝑥 ∈ ℝ over which we want a good approximation 𝑓(𝑥)
Stack weights and biases into a vector of parameters 𝑝:
𝑤1
⎡𝑏 ⎤
⎢ 1⎥
⎢ 𝑤2 ⎥
𝑝 = ⎢ 𝑏2 ⎥
⎢ ⎥
⎢ ⋮ ⎥
⎢𝑤𝑁 ⎥
⎣ 𝑏𝑁 ⎦
Applying a “poor man’s version” of a stochastic gradient descent algorithm for finding a zero of a function leads to the
following update rule for parameters:
𝑑ℒ 𝑑𝑥𝑁+1
𝑝𝑘+1 = 𝑝𝑘 − 𝛼 (14.2)
𝑑𝑥𝑁+1 𝑑𝑝𝑘
𝑑ℒ
where 𝑑𝑥𝑁+1 = − (𝑥𝑁+1 − 𝑦) and 𝛼 > 0 is a step size.
(See this and this to gather insights about how stochastic gradient descent relates to Newton’s method.)
𝑑𝑥𝑁+1
To implement one step of this parameter update rule, we want the vector of derivatives 𝑑𝑝𝑘 .
In the neural network literature, this step is accomplished by what is known as back propagation.
14.4 Back Propagation and the Chain Rule
Thanks to properties of
• the chain and product rules for differentiation from differential calculus, and
• lower triangular matrices
back propagation can actually be accomplished in one step by
14.3. Calibrating Parameters 265

• inverting a lower triangular matrix, and

• matrix multiplication
(This idea is from the last 7 minutes of this great youtube video by MIT’s Alan Edelman)
https://youtu.be/rZS2LGiurKY
Here goes.
Define the derivative of ℎ(𝑧) with respect to 𝑧 evaluated at 𝑧 = 𝑧𝑖 as 𝛿𝑖 :
𝑑
𝛿𝑖 = ℎ(𝑧)|𝑧=𝑧𝑖
𝑑𝑧
or
𝛿𝑖 = ℎ′ (𝑤𝑖 𝑥𝑖 + 𝑏𝑖 ) .
Repeated application of the chain rule and product rule to our recursion (14.1) allows us to obtain:
𝑑𝑥𝑖+1 = 𝛿𝑖 (𝑑𝑤𝑖 𝑥𝑖 + 𝑤𝑖 𝑑𝑥𝑖 + 𝑏𝑖 )
After imposing 𝑑𝑥1 = 0, we get the following system of equations:
𝑑𝑤1
⎛ ⎞ 0 0 0 0
𝑑𝑥2 𝛿1 𝑤1 𝛿1 0 0 0 ⎜ 𝑑𝑏1 ⎟ ⎛ 𝑑𝑥2
⎜ ⎟ 𝑤 0 0 0 ⎞
⎛
⎜ ⋮ ⎞
⎟= ⎛
⎜ 0 0 ⋱ 0 0 ⎞
⎟⎜⎜ ⋮ ⎟
⎟ +⎜
⎜ 2
⎜ 0 ⋱ 0 0 ⎟
⎟
⎟ ⎛
⎜ ⋮ ⎞
⎟
⎜ ⎟
⎝ 0 0 0 𝛿𝑁 𝑤𝑁 𝛿𝑁 ⎠ ⎜ 𝑑𝑤𝑁
⎝ 𝑑𝑥𝑁+1 ⎠ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⎟ ⎝ 𝑑𝑥𝑁+1 ⎠
⎝ 0 0 𝑤𝑁 0 ⎠
𝐷 ⎝ 𝑑𝑏𝑁 ⎠ ⏟⏟⏟⏟⏟⏟⏟⏟⏟
𝐿
or
𝑑𝑥 = 𝐷𝑑𝑝 + 𝐿𝑑𝑥
which implies that
𝑑𝑥 = (𝐼 − 𝐿)−1 𝐷𝑑𝑝
which in turn implies
𝑑𝑥𝑁+1 /𝑑𝑤1
⎛
⎜ 𝑑𝑥𝑁+1 /𝑑𝑏1 ⎞
⎟
⎜
⎜ ⎟
⎟ −1
⎜ ⋮ ⎟ = 𝑒𝑁 (𝐼 − 𝐿) 𝐷.
⎜
⎜ 𝑑𝑥𝑁+1 /𝑑𝑤𝑁 ⎟
⎟
⎝ 𝑑𝑥𝑁+1 /𝑑𝑏𝑁 ⎠
We can then solve the above problem by applying our update for 𝑝 multiple times for a collection of input-output pairs
𝑀
{(𝑥𝑖1 , 𝑦𝑖 )}𝑖=1 that we’ll call our “training set”.
14.5 Training Set
Choosing a training set amounts to a choice of measure 𝜇 in the above formulation of our function approximation problem
as a minimization problem.
In this spirit, we shall use a uniform grid of, say, 50 or 200 points.
There are many possible approaches to the minimization problem posed above:

• batch gradient descent in which you use an average gradient over the training set
• stochastic gradient descent in which you sample points randomly and use individual gradients
• something in-between (so-called “mini-batch gradient descent”)
The update rule (14.2) described above amounts to a stochastic gradient descent algorithm.
from IPython.display import Image

import jax.numpy as jnp
from jax import grad, jit, jacfwd, vmap
from jax import random
import jax
import plotly.graph_objects as go
# A helper function to randomly initialize weights and biases

# for a dense neural network layer
def random_layer_params(m, n, key, scale=1.):
w_key, b_key = random.split(key)
return scale * random.normal(w_key, (n, m)), scale * random.normal(b_key, (n,))
# Initialize all layers for a fully-connected neural network with sizes "sizes"
def init_network_params(sizes, key):
keys = random.split(key, len(sizes))
return [random_layer_params(m, n, k) for m, n, k in zip(sizes[:-1], sizes[1:],␣
↪keys)]
def compute_xδw_seq(params, x):

# Initialize arrays
δ = jnp.zeros(len(params))
xs = jnp.zeros(len(params) + 1)
ws = jnp.zeros(len(params))
bs = jnp.zeros(len(params))
h = jax.nn.sigmoid
xs = xs.at[0].set(x)
for i, (w, b) in enumerate(params[:-1]):
output = w * xs[i] + b
activation = h(output[0, 0])
# Store elements
δ = δ.at[i].set(grad(h)(output[0, 0]))
ws = ws.at[i].set(w[0, 0])
bs = bs.at[i].set(b[0])
xs = xs.at[i+1].set(activation)
final_w, final_b = params[-1]

preds = final_w * xs[-2] + final_b
# Store elements
δ = δ.at[-1].set(1.)
ws = ws.at[-1].set(final_w[0, 0])
bs = bs.at[-1].set(final_b[0])
xs = xs.at[-1].set(preds[0, 0])
return xs, δ, ws, bs

14.5. Training Set 267

def loss(params, x, y):

xs, δ, ws, bs = compute_xδw_seq(params, x)
preds = xs[-1]
return 1 / 2 * (y - preds) ** 2
# Parameters
N = 3 # Number of layers
layer_sizes = [1, ] * (N + 1)
param_scale = 0.1
step_size = 0.01
params = init_network_params(layer_sizes, random.PRNGKey(1))
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not␣
↪installed. Falling back to cpu.
x = 5
y = 3
dxs_ad = jacfwd(lambda params, x: compute_xδw_seq(params, x)[0], argnums=0)(params, x)

dxs_ad_mat = jnp.block([dx.reshape((-1, 1)) for dx_tuple in dxs_ad for dx in dx_tuple␣
↪])[1:]
jnp.block([[δ * xs[:-1]], [δ]])
Array([[8.5726520e-03, 4.0850646e-04, 6.1021698e-01],

[1.7145304e-03, 2.3785222e-01, 1.0000000e+00]], dtype=float32)
L = jnp.diag(δ * ws, k=-1)

L = L[1:, 1:]
D = jax.scipy.linalg.block_diag(*[row.reshape((1, 2)) for row in jnp.block([[δ * xs[:-

↪1]], [δ]]).T])
dxs_la = jax.scipy.linalg.solve_triangular(jnp.eye(N) - L, D, lower=True)
# Check that the `dx` generated by the linear algebra method

# are the same as the ones generated using automatic differentiation
jnp.max(jnp.abs(dxs_ad_mat - dxs_la))
Array(0., dtype=float32)
grad_loss_ad = jnp.block([dx.reshape((-1, 1)) for dx_tuple in grad(loss)(params, x,␣

↪y) for dx in dx_tuple ])

# Check that the gradient of the loss is the same for both approaches
jnp.max(jnp.abs(-(y - xs[-1]) * dxs_la[-1] - grad_loss_ad))
Array(1.4901161e-08, dtype=float32)
@jit
def update_ad(params, x, y):
grads = grad(loss)(params, x, y)
return [(w - step_size * dw, b - step_size * db)
for (w, b), (dw, db) in zip(params, grads)]
@jit
def update_la(params, x, y):
N = len(params)
L = jnp.diag(δ * ws, k=-1)
L = L[1:, 1:]
D = jax.scipy.linalg.block_diag(*[row.reshape((1, 2)) for row in jnp.block([[δ *␣

↪xs[:-1]], [δ]]).T])
dxs_la = jax.scipy.linalg.solve_triangular(jnp.eye(N) - L, D, lower=True)
grads = -(y - xs[-1]) * dxs_la[-1]
return [(w - step_size * dw, b - step_size * db)

for (w, b), (dw, db) in zip(params, grads.reshape((-1, 2)))]
# Check that both updates are the same

update_la(params, x, y)
[(Array([[-1.3489482]], dtype=float32), Array([0.37956238], dtype=float32)),

(Array([[-0.00782906]], dtype=float32), Array([0.44972023], dtype=float32)),
(Array([[0.22937916]], dtype=float32), Array([-0.04793657], dtype=float32))]
update_ad(params, x, y)
[(Array([[-1.3489482]], dtype=float32), Array([0.37956238], dtype=float32)),

(Array([[-0.00782906]], dtype=float32), Array([0.44972023], dtype=float32)),
(Array([[0.22937916]], dtype=float32), Array([-0.04793657], dtype=float32))]
14.5. Training Set 269

14.6 Example 1
Consider the function
𝑓 (𝑥) = −3𝑥 + 2
on [0.5, 3].
We use a uniform grid of 200 points and update the parameters for each point on the grid 300 times.
ℎ𝑖 is the sigmoid activation function for all layers except the final one for which we use the identity function and 𝑁 = 3.
Weights are initialized randomly.
def f(x):
return -3 * x + 2
M = 200
grid = jnp.linspace(0.5, 3, num=M)
f_val = f(grid)
indices = jnp.arange(M)
key = random.PRNGKey(0)
def train(params, grid, f_val, key, num_epochs=300):

for epoch in range(num_epochs):
key, _ = random.split(key)
random_permutation = random.permutation(random.PRNGKey(1), indices)
for x, y in zip(grid[random_permutation], f_val[random_permutation]):
params = update_la(params, x, y)
return params
# Parameters
layer_sizes = [1, ] * (N + 1)
params_ex1 = init_network_params(layer_sizes, key)
%%time
params_ex1 = train(params_ex1, grid, f_val, key, num_epochs=500)
CPU times: user 4.83 s, sys: 1.7 ms, total: 4.83 s

Wall time: 4.79 s
predictions = vmap(compute_xδw_seq, in_axes=(None, 0))(params_ex1, grid)[0][:, -1]
fig = go.Figure()
fig.add_trace(go.Scatter(x=grid, y=f_val, name=r'$-3x+2$'))
fig.add_trace(go.Scatter(x=grid, y=predictions, name='Approximation'))
# Export to PNG file

Image(fig.to_image(format="png"))
# fig.show() will provide interactive plot when running
# notebook locally

14.7 How Deep?
It is fun to think about how deepening the neural net for the above example affects the quality of approximation
• If the network is too deep, you’ll run into the vanishing gradient problem
• Other parameters such as the step size and the number of epochs can be as important or more important than the
number of layers in the situation considered in this lecture.
• Indeed, since 𝑓 is a linear function of 𝑥, a one-layer network with the identity map as an activation would probably
work best.
14.8 Example 2
We use the same setup as for the previous example with
𝑓 (𝑥) = log (𝑥)
def f(x):
return jnp.log(x)
grid = jnp.linspace(0.5, 3, num=M)

f_val = f(grid)
14.7. How Deep? 271

# Parameters
layer_sizes = [1, ] * (N + 1)
params_ex2_1 = init_network_params(layer_sizes, key)
# Parameters
layer_sizes = [1, ] * (N + 1)
# Parameters
layer_sizes = [1, ] * (N + 1)
params_ex2_1 = train(params_ex2_1, grid, f_val, key, num_epochs=300)
predictions_1 = vmap(compute_xδw_seq, in_axes=(None, 0))(params_ex2_1, grid)[0][:, -1]

fig = go.Figure()
fig.add_trace(go.Scatter(x=grid, y=f_val, name=r'$\log{x}$'))
fig.add_trace(go.Scatter(x=grid, y=predictions_1, name='One-layer neural network'))
fig.add_trace(go.Scatter(x=grid, y=predictions_2, name='Two-layer neural network'))
fig.add_trace(go.Scatter(x=grid, y=predictions_3, name='Three-layer neural network'))
# Export to PNG file

Image(fig.to_image(format="png"))
# fig.show() will provide interactive plot when running
# notebook locally

## to check that gpu is activated in environment
from jax.lib import xla_bridge

print(xla_bridge.get_backend().platform)
cpu
Note: Cloud Environment: This lecture site is built in a server environment that doesn’t have access to a gpu If you
run this lecture locally this lets you know where your code is being executed, either via the cpu or the gpu
14.8. Example 2 273


CHAPTER
FIFTEEN
RANDOMIZED RESPONSE SURVEYS
15.1 Overview
Social stigmas can inhibit people from confessing potentially embarrassing activities or opinions.
When people are reluctant to participate a sample survey about personally sensitive issues, they might decline to partici-
pate, and even if they do participate, they might choose to provide incorrect answers to sensitive questions.
These problems induce selection biases that present challenges to interpreting and designing surveys.
To illustrate how social scientists have thought about estimating the prevalence of such embarrassing activities and opin-
ions, this lecture describes a classic approach of S. L. Warner [Warner, 1965].
Warner used elementary probability to construct a way to protect the privacy of individual respondents to surveys while
still estimating the fraction of a collection of individuals who have a socially stigmatized characteristic or who engage in
a socially stigmatized activity.
Warner’s idea was to add noise between the respondent’s answer and the signal about that answer that the survey maker
ultimately receives.
Knowing about the structure of the noise assures the respondent that the survey maker does not observe his answer.
Statistical properties of the noise injection procedure provide the respondent plausible deniability.
Related ideas underlie modern differential privacy systems.
(See https://en.wikipedia.org/wiki/Differential_privacy)
15.2 Warner’s Strategy
As usual, let’s bring in the Python modules we’ll be using.
import numpy as np
import pandas as pd
Suppose that every person in population either belongs to Group A or Group B.

We want to estimate the proportion 𝜋 who belong to Group A while protecting individual respondents’ privacy.
Warner [Warner, 1965] proposed and analyzed the following procedure.
• A random sample of 𝑛 people is drawn with replacement from the population and each person is interviewed.
• Draw 𝑛 random samples from the population with replacement and interview each person.
275
• Prepare a random spinner that with 𝑝 probability points to the Letter A and with (1 − 𝑝) probability points to the
Letter B.
• Each subject spins a random spinner and sees an outcome (A or B) that the interviewer does not observe.
• The subject states whether he belongs to the group to which the spinner points.
• If the spinner points to the group that the spinner belongs, the subject reports “yes”; otherwise he reports “no”.
• The subject answers the question truthfully.
Warner constructed a maximum likelihood estimators of the proportion of the population in set A.
Let
• 𝜋 : True probability of A in the population
• 𝑝 : Probability that the spinner points to A
1, if the 𝑖th subject says yes
• 𝑋𝑖 = {
0, if the 𝑖th subject says no
Index the sample set so that the first 𝑛1 report “yes”, while the second 𝑛 − 𝑛1 report “no”.
The likelihood function of a sample set is
𝑛1 𝑛−𝑛1
𝐿 = [𝜋𝑝 + (1 − 𝜋)(1 − 𝑝)] [(1 − 𝜋)𝑝 + 𝜋(1 − 𝑝)] (15.1)
The log of the likelihood function is:
log(𝐿) = 𝑛1 log [𝜋𝑝 + (1 − 𝜋)(1 − 𝑝)] + (𝑛 − 𝑛1 ) log [(1 − 𝜋)𝑝 + 𝜋(1 − 𝑝)] (15.2)
The first-order necessary condition for maximizing the log likelihood function with respect to 𝜋 is:
(𝑛 − 𝑛1 )(2𝑝 − 1) 𝑛1 (2𝑝 − 1)
=
(1 − 𝜋)𝑝 + 𝜋(1 − 𝑝) 𝜋𝑝 + (1 − 𝜋)(1 − 𝑝)
or
𝑛1
𝜋𝑝 + (1 − 𝜋)(1 − 𝑝) = (15.3)
𝑛
If 𝑝 ≠ 21 , then the maximum likelihood estimator (MLE) of 𝜋 is:
𝑝−1 𝑛1
𝜋̂ = + (15.4)
2𝑝 − 1 (2𝑝 − 1)𝑛
We compute the mean and variance of the MLE estimator 𝜋̂ to be:
1 1 𝑛
𝔼(𝜋)̂ = [𝑝 − 1 + ∑ 𝔼𝑋𝑖 ]
2𝑝 − 1 𝑛 𝑖=1
1 (15.5)
= [𝑝 − 1 + 𝜋𝑝 + (1 − 𝜋)(1 − 𝑝)]
2𝑝 − 1
=𝜋
and
𝑛𝑉 𝑎𝑟(𝑋𝑖 )
𝑉 𝑎𝑟(𝜋)̂ =
(2𝑝 − 1)2 𝑛2
[𝜋𝑝 + (1 − 𝜋)(1 − 𝑝)] [(1 − 𝜋)𝑝 + 𝜋(1 − 𝑝)]
=
(2𝑝 − 1)2 𝑛2
1
+ (2𝑝2 − 2𝑝 + 12 )(−2𝜋2 + 2𝜋 − 21 ) (15.6)
4
=
(2𝑝 − 1)2 𝑛2
1 1 1
= [ − (𝜋 − )2 ]
𝑛 16(𝑝 − 12 )2 2
276 Chapter 15. Randomized Response Surveys

Equation (15.5) indicates that 𝜋̂ is an unbiased estimator of 𝜋 while equation (15.6) tell us the variance of the estimator.
To compute a confidence interval, first rewrite (15.6) as:
1 1
1
− (𝜋 − 12 )2 16(𝑝− 12 )2
− 4
𝑉 𝑎𝑟(𝜋)̂ = 4
+ (15.7)
𝑛 𝑛
This equation indicates that the variance of 𝜋̂ can be represented as a sum of the variance due to sampling plus the variance
due to the random device.
From the expressions above we can find that:
• When 𝑝 is 12 , expression (15.1) degenerates to a constant.
• When 𝑝 is 1 or 0, the randomized estimate degenerates to an estimator without randomized sampling.
We shall only discuss situations in which 𝑝 ∈ ( 12 , 1)
(a situation in which 𝑝 ∈ (0, 21 ) is symmetric).
From expressions (15.5) and (15.7) we can deduce that:
• The MSE of 𝜋̂ decreases as 𝑝 increases.
15.3 Comparing Two Survey Designs
Let’s compare the preceding randomized-response method with a stylized non-randomized response method.
In our non-randomized response method, we suppose that:
• Members of Group A tells the truth with probability 𝑇𝑎 while the members of Group B tells the truth with proba-
bility 𝑇𝑏
• 𝑌𝑖 is 1 or 0 according to whether the sample’s 𝑖th member’s report is in Group A or not.
Then we can estimate 𝜋 as:
𝑛
∑ 𝑌𝑖
𝜋̂ = 𝑖=1 (15.8)
𝑛
We calculate the expectation, bias, and variance of the estimator to be:
𝔼(𝜋)̂ = 𝜋𝑇𝑎 + [(1 − 𝜋)(1 − 𝑇𝑏 )] (15.9)
𝐵𝑖𝑎𝑠(𝜋)̂ = 𝔼(𝜋̂ − 𝜋)
(15.10)
= 𝜋[𝑇𝑎 + 𝑇𝑏 − 2] + [1 − 𝑇𝑏 ]
[𝜋𝑇𝑎 + (1 − 𝜋)(1 − 𝑇𝑏 )] [1 − 𝜋𝑇𝑎 − (1 − 𝜋)(1 − 𝑇𝑏 )]
𝑉 𝑎𝑟(𝜋)̂ = (15.11)
𝑛
It is useful to define a
Mean Square Error Randomized
MSE Ratio =
Mean Square Error Regular
We can compute MSE Ratios for different survey designs associated with different parameter values.
The following Python code computes objects we want to stare at in order to make comparisons under different values of
𝜋𝐴 and 𝑛:
15.3. Comparing Two Survey Designs 277

class Comparison:
def __init__(self, A, n):
self.A = A
self.n = n
TaTb = np.array([[0.95, 1], [0.9, 1], [0.7, 1],
[0.5, 1], [1, 0.95], [1, 0.9],
[1, 0.7], [1, 0.5], [0.95, 0.95],
[0.9, 0.9], [0.7, 0.7], [0.5, 0.5]])
self.p_arr = np.array([0.6, 0.7, 0.8, 0.9])
self.p_map = dict(zip(self.p_arr, [f"MSE Ratio: p = {x}" for x in self.p_
↪arr]))
self.template = pd.DataFrame(columns=self.p_arr)
self.template[['T_a','T_b']] = TaTb
self.template['Bias'] = None
def theoretical(self):
A = self.A
n = self.n
df = self.template.copy()
df['Bias'] = A * (df['T_a'] + df['T_b'] - 2) + (1 - df['T_b'])
for p in self.p_arr:
df[p] = (1 / (16 * (p - 1/2)**2) - (A - 1/2)**2) / n / \
(df['Bias']**2 + ((A * df['T_a'] + (1 - A) * (1 - df['T_b'])) *␣
↪(1 - A * df['T_a'] - (1 - A) * (1 - df['T_b'])) / n))
df[p] = df[p].round(2)
df = df.set_index(["T_a", "T_b", "Bias"]).rename(columns=self.p_map)
return df
def MCsimulation(self, size=1000, seed=123456):

A = self.A
n = self.n
df = self.template.copy()
np.random.seed(seed)
sample = np.random.rand(size, self.n) <= A
random_device = np.random.rand(size, n)
mse_rd = {}
spinner = random_device <= p
rd_answer = sample * spinner + (1 - sample) * (1 - spinner)
n1 = rd_answer.sum(axis=1)
pi_hat = (p - 1) / (2 * p - 1) + n1 / n / (2 * p - 1)
mse_rd[p] = np.sum((pi_hat - A)**2)
for inum, irow in df.iterrows():
truth_a = np.random.rand(size, self.n) <= irow.T_a
truth_b = np.random.rand(size, self.n) <= irow.T_b
trad_answer = sample * truth_a + (1 - sample) * (1 - truth_b)
pi_trad = trad_answer.sum(axis=1) / n
df.loc[inum, 'Bias'] = pi_trad.mean() - A
mse_trad = np.sum((pi_trad - A)**2)
df.loc[inum, p] = (mse_rd[p] / mse_trad).round(2)
df = df.set_index(["T_a", "T_b", "Bias"]).rename(columns=self.p_map)
return df
Let’s put the code to work for parameter values

• 𝜋𝐴 = 0.6

• 𝑛 = 1000
We can generate MSE Ratios theoretically using the above formulas.
We can also perform Monte Carlo simulations of a MSE Ratio.
cp1 = Comparison(0.6, 1000)

df1_theoretical = cp1.theoretical()
df1_theoretical
MSE Ratio: p = 0.6 MSE Ratio: p = 0.7 MSE Ratio: p = 0.8 \

T_a T_b Bias
0.95 1.00 -0.03 5.45 1.36 0.60
0.90 1.00 -0.06 1.62 0.40 0.18
0.70 1.00 -0.18 0.19 0.05 0.02
0.50 1.00 -0.30 0.07 0.02 0.01
1.00 0.95 0.02 9.82 2.44 1.08
0.90 0.04 3.41 0.85 0.37
0.70 0.12 0.43 0.11 0.05
0.50 0.20 0.16 0.04 0.02
0.95 0.95 -0.01 18.25 4.54 2.00
0.90 0.90 -0.02 9.70 2.41 1.06
0.70 0.70 -0.06 1.62 0.40 0.18
0.50 0.50 -0.10 0.61 0.15 0.07
MSE Ratio: p = 0.9

T_a T_b Bias
0.95 1.00 -0.03 0.33
0.90 1.00 -0.06 0.10
0.70 1.00 -0.18 0.01
0.50 1.00 -0.30 0.00
1.00 0.95 0.02 0.60
0.90 0.04 0.21
0.70 0.12 0.03
0.50 0.20 0.01
0.95 0.95 -0.01 1.11
0.90 0.90 -0.02 0.59
0.70 0.70 -0.06 0.10
0.50 0.50 -0.10 0.04
df1_mc = cp1.MCsimulation()
df1_mc

T_a T_b Bias
0.95 1.00 -0.030060 5.76 1.36 0.63
0.90 1.00 -0.060045 1.73 0.41 0.19
0.70 1.00 -0.179530 0.21 0.05 0.02
0.50 1.00 -0.300077 0.07 0.02 0.01
1.00 0.95 0.019770 10.59 2.5 1.15
0.90 0.040050 3.63 0.86 0.39
0.70 0.120052 0.46 0.11 0.05
0.50 0.199746 0.17 0.04 0.02
0.95 0.95 -0.010137 18.65 4.41 2.02
0.90 0.90 -0.020103 10.48 2.48 1.14
0.70 0.70 -0.060488 1.71 0.4 0.19


0.50 0.50 -0.099341 0.66 0.16 0.07
MSE Ratio: p = 0.9

T_a T_b Bias
0.95 1.00 -0.030060 0.35
0.90 1.00 -0.060045 0.1
0.70 1.00 -0.179530 0.01
0.50 1.00 -0.300077 0.0
1.00 0.95 0.019770 0.64
0.90 0.040050 0.22
0.70 0.120052 0.03
0.50 0.199746 0.01
0.95 0.95 -0.010137 1.12
0.90 0.90 -0.020103 0.63
0.70 0.70 -0.060488 0.1
0.50 0.50 -0.099341 0.04
The theoretical calculations do a good job of predicting Monte Carlo results.

We see that in many situations, especially when the bias is not small, the MSE of the randomized-sampling methods is
smaller than that of the non-randomized sampling method.
These differences become larger as 𝑝 increases.
By adjusting parameters 𝜋𝐴 and 𝑛, we can study outcomes in different situations.
For example, for another situation described in Warner [Warner, 1965]:
• 𝜋𝐴 = 0.5
• 𝑛 = 1000
we can use the code

df2_theoretical

T_a T_b Bias
0.95 1.00 -0.025 7.15 1.79 0.79
0.90 1.00 -0.050 2.27 0.57 0.25
0.70 1.00 -0.150 0.27 0.07 0.03
0.50 1.00 -0.250 0.10 0.02 0.01
1.00 0.95 0.025 7.15 1.79 0.79
0.90 0.050 2.27 0.57 0.25
0.70 0.150 0.27 0.07 0.03
0.50 0.250 0.10 0.02 0.01
0.95 0.95 0.000 25.00 6.25 2.78
0.90 0.90 0.000 25.00 6.25 2.78
0.70 0.70 0.000 25.00 6.25 2.78
0.50 0.50 0.000 25.00 6.25 2.78
MSE Ratio: p = 0.9

T_a T_b Bias
0.95 1.00 -0.025 0.45
0.90 1.00 -0.050 0.14


0.70 1.00 -0.150 0.02
0.50 1.00 -0.250 0.01
1.00 0.95 0.025 0.45
0.90 0.050 0.14
0.70 0.150 0.02
0.50 0.250 0.01
0.95 0.95 0.000 1.56
0.90 0.90 0.000 1.56
0.70 0.70 0.000 1.56
0.50 0.50 0.000 1.56
df2_mc

T_a T_b Bias
0.95 1.00 -0.025230 7.0 1.69 0.75
0.90 1.00 -0.050279 2.23 0.54 0.24
0.70 1.00 -0.149866 0.27 0.07 0.03
0.50 1.00 -0.250211 0.1 0.02 0.01
1.00 0.95 0.024410 7.38 1.78 0.79
0.90 0.049839 2.26 0.54 0.24
0.70 0.149769 0.27 0.07 0.03
0.50 0.249851 0.1 0.02 0.01
0.95 0.95 -0.000260 24.29 5.86 2.59
0.90 0.90 -0.000109 25.73 6.2 2.74
0.70 0.70 -0.000439 25.75 6.21 2.74
0.50 0.50 0.000768 24.91 6.01 2.65
MSE Ratio: p = 0.9

T_a T_b Bias
0.95 1.00 -0.025230 0.44
0.90 1.00 -0.050279 0.14
0.70 1.00 -0.149866 0.02
0.50 1.00 -0.250211 0.01
1.00 0.95 0.024410 0.46
0.90 0.049839 0.14
0.70 0.149769 0.02
0.50 0.249851 0.01
0.95 0.95 -0.000260 1.52
0.90 0.90 -0.000109 1.61
0.70 0.70 -0.000439 1.61
0.50 0.50 0.000768 1.56
We can also revisit a calculation in the concluding section of Warner [Warner, 1965] in which
• 𝜋𝐴 = 0.6
• 𝑛 = 2000
We use the code

df3_theoretical


T_a T_b Bias
0.95 1.00 -0.03 3.05 0.76 0.33
0.90 1.00 -0.06 0.84 0.21 0.09
0.70 1.00 -0.18 0.10 0.02 0.01
0.50 1.00 -0.30 0.03 0.01 0.00
1.00 0.95 0.02 6.03 1.50 0.66
0.90 0.04 1.82 0.45 0.20
0.70 0.12 0.22 0.05 0.02
0.50 0.20 0.08 0.02 0.01
0.95 0.95 -0.01 14.12 3.51 1.55
0.90 0.90 -0.02 5.98 1.49 0.66
0.70 0.70 -0.06 0.84 0.21 0.09
0.50 0.50 -0.10 0.31 0.08 0.03
MSE Ratio: p = 0.9

T_a T_b Bias
0.95 1.00 -0.03 0.19
0.90 1.00 -0.06 0.05
0.70 1.00 -0.18 0.01
0.50 1.00 -0.30 0.00
1.00 0.95 0.02 0.37
0.90 0.04 0.11
0.70 0.12 0.01
0.50 0.20 0.00
0.95 0.95 -0.01 0.86
0.90 0.90 -0.02 0.36
0.70 0.70 -0.06 0.05
0.50 0.50 -0.10 0.02
df3_mc

T_a T_b Bias
0.95 1.00 -0.030316 3.27 0.8 0.34
0.90 1.00 -0.060352 0.91 0.22 0.09
0.70 1.00 -0.180087 0.11 0.03 0.01
0.50 1.00 -0.299849 0.04 0.01 0.0
1.00 0.95 0.019734 6.7 1.64 0.69
0.90 0.039766 2.01 0.49 0.21
0.70 0.119789 0.24 0.06 0.02
0.50 0.200138 0.09 0.02 0.01
0.95 0.95 -0.010475 14.78 3.61 1.53
0.90 0.90 -0.020373 6.32 1.54 0.65
0.70 0.70 -0.059945 0.92 0.23 0.1
0.50 0.50 -0.100103 0.34 0.08 0.03
MSE Ratio: p = 0.9

T_a T_b Bias
0.95 1.00 -0.030316 0.19
0.90 1.00 -0.060352 0.05
0.70 1.00 -0.180087 0.01
0.50 1.00 -0.299849 0.0
1.00 0.95 0.019734 0.39


0.90 0.039766 0.12
0.70 0.119789 0.01
0.50 0.200138 0.0
0.95 0.95 -0.010475 0.85
0.90 0.90 -0.020373 0.36
0.70 0.70 -0.059945 0.05
0.50 0.50 -0.100103 0.02
Evidently, as 𝑛 increases, the randomized response method does better performance in more situations.
15.4 Concluding Remarks
This QuantEcon lecture describes some alternative randomized response surveys.

That lecture presents a utilitarian analysis of those alternatives conducted by Lars Ljungqvist [Ljungqvist, 1993].

import numpy as np
15.4. Concluding Remarks 283


CHAPTER
SIXTEEN
EXPECTED UTILITIES OF RANDOM RESPONSES
16.1 Overview
This QuantEcon lecture describes randomized response surveys in the tradition of Warner [Warner, 1965] that are designed
to protect respondents’ privacy.
Lars Ljungqvist [Ljungqvist, 1993] analyzed how a respondent’s decision about whether to answer truthfully depends on
expected utility.
The lecture tells how Ljungqvist used his framework to shed light on alternative randomized response survey techniques
proposed, for example, by [Lanke, 1975], [Lanke, 1976], [Leysieffer and Warner, 1976], [Anderson, 1976], [Fligner et
al., 1977], [Greenberg et al., 1977], [Greenberg et al., 1969].
16.2 Privacy Measures
We consider randomized response models with only two possible answers, “yes” and “no.”
The design determines probabilities
Pr(yes|𝐴) = 1 − Pr(no|𝐴)
′ ′
Pr(yes|𝐴 ) = 1 − Pr(no|𝐴 )
These design probabilities in turn can be used to compute the conditional probability of belonging to the sensitive group
𝐴 for a given response, say 𝑟:
𝜋𝐴 Pr(𝑟|𝐴)
Pr(𝐴|𝑟) = (16.1)
𝜋𝐴 Pr(𝑟|𝐴) + (1 − 𝜋𝐴 )Pr(𝑟|𝐴′ )
16.3 Zoo of Concepts
At this point we describe some concepts proposed by various researchers
285
16.3.1 Leysieffer and Warner(1976)

′
The response 𝑟 is regarded as jeopardizing with respect to 𝐴 or 𝐴 if
Pr(𝐴|𝑟) > 𝜋𝐴
or (16.2)
′
Pr(𝐴 |𝑟) > 1 − 𝜋𝐴
From Bayes’s rule:
Pr(𝐴|𝑟) (1 − 𝜋𝐴 ) Pr(𝑟|𝐴)
× = (16.3)
Pr(𝐴′ |𝑟) 𝜋𝐴 Pr(𝑟|𝐴′ )
′
If this expression is greater (less) than unity, it follows that 𝑟 is jeopardizing with respect to 𝐴(𝐴 ). Then, the natural
measure of jeopardy will be:
Pr(𝑟|𝐴)
𝑔(𝑟|𝐴) =
Pr(𝑟|𝐴′ )
and (16.4)
′
′ Pr(𝑟|𝐴 )
𝑔(𝑟|𝐴 ) =
Pr(𝑟|𝐴)
′
Suppose, without loss of generality, that Pr(yes|𝐴) > Pr(yes|𝐴 ), then a yes (no) answer is jeopardizing with respect
′
𝐴(𝐴 ), that is,
𝑔(yes|𝐴) > 1
and
′
𝑔(no|𝐴 ) > 1
Leysieffer and Warner proved that the variance of the estimate can only be decreased through an increase in one or both
of these two measures of jeopardy.
An efficient randomized response model is, therefore, any model that attains the maximum acceptable levels of jeopardy
that are consistent with cooperation of the respondents.
As a special example, Leysieffer and Warner considered “a problem in which there is no jeopardy in a no answer”; that
′
is, 𝑔(no|𝐴 ) can be of unlimited magnitude.
Evidently, an optimal design must have
Pr(yes|𝐴) = 1
which implies that
Pr(𝐴|no) = 0
16.3.2 Lanke(1976)
Lanke (1975) [Lanke, 1975] argued that “it is membership in Group A that people may want to hide, not membership in
the complementary Group A’.”
For that reason, Lanke (1976) [Lanke, 1976] argued that an appropriate measure of protection is to minimize
max {Pr(𝐴|yes), Pr(𝐴|no)} (16.5)
Holding this measure constant, he explained under what conditions the smallest variance of the estimate was achieved
with the unrelated question model or Warner’s (1965) original model.
286 Chapter 16. Expected Utilities of Random Responses

16.3.3 2.3 Fligner, Policello, and Singh
Fligner, Policello, and Singh reached similar conclusion as Lanke (1976). [Fligner et al., 1977]
They measured “private protection” as
1 − max {Pr(𝐴|yes), Pr(𝐴|no)}
(16.6)
1 − 𝜋𝐴
16.3.4 2.4 Greenberg, Kuebler, Abernathy, and Horvitz (1977)
[Greenberg et al., 1977]

Greenberg, Kuebler, Abernathy, and Horvitz (1977) stressed the importance of examining the risk to respondents who
do not belong to 𝐴 as well as the risk to those who do belong to the sensitive group.
They defined the hazard for an individual in 𝐴 as the probability that he or she is perceived as belonging to 𝐴:
Pr(yes|𝐴) × Pr(𝐴|yes) + Pr(no|𝐴) × Pr(𝐴|no) (16.7)
Similarly, the hazard for an individual who does not belong to 𝐴 would be
′ ′
Pr(yes|𝐴 ) × Pr(𝐴|yes) + Pr(no|𝐴 ) × Pr(𝐴|no) (16.8)
Greenberg et al. (1977) also considered an alternative related measure of hazard that “is likely to be closer to the actual
concern felt by a respondent.”
′
The “limited hazard” for an individual in 𝐴 and 𝐴 is
Pr(yes|𝐴) × Pr(𝐴|yes) (16.9)
and
′
Pr(yes|𝐴 ) × Pr(𝐴|yes) (16.10)
This measure is just the first term in (16.7), i.e., the probability that an individual answers “yes” and is perceived to belong
to 𝐴.
16.4 Respondent’s Expected Utility
16.4.1 Truth Border
Key assumptions that underlie a randomized response technique for estimating the fraction of a population that belongs
to 𝐴 are:
• Assumption 1: Respondents feel discomfort from being thought of as belonging to 𝐴.
• Assumption 2: Respondents prefer to answer questions truthfully than to lie, so long as the cost of doing so is not
too high. The cost is taken to be the discomfort in 1.
Let 𝑟𝑖 denote individual 𝑖’s response to the randomized question.
𝑟𝑖 can only take values “yes” or “no”.
For a given design of a randomized response interview and a given belief about the fraction of the population that belongs
to 𝐴, the respondent’s answer is associated with a conditional probability Pr(𝐴|𝑟𝑖 ) that the individual belongs to 𝐴.
Given 𝑟𝑖 and complete privacy, the individual’s utility is higher if 𝑟𝑖 represents a truthful answer rather than a lie.
In terms of a respondent’s expected utility as a function of Pr(𝐴|𝑟𝑖 ) and 𝑟𝑖
16.4. Respondent’s Expected Utility 287

• The higher is Pr(𝐴|𝑟𝑖 ), the lower isindividual 𝑖’s expected utility.

• expected utility is higher if 𝑟𝑖 represents a truthful answer rather than a lie
Define:
• 𝜙𝑖 ∈ {truth, lie}, a dichotomous variable that indicates whether or not 𝑟𝑖 is a truthful statement.
• 𝑈𝑖 (Pr(𝐴|𝑟𝑖 ), 𝜙𝑖 ), a utility function that is differentiable in its first argument, summarizes individual 𝑖’s expected
utility.
Then there is an 𝑟𝑖 such that
𝜕𝑈𝑖 (Pr(𝐴|𝑟𝑖 ), 𝜙𝑖 )
< 0, for 𝜙𝑖 ∈ {truth, lie} (16.11)
𝜕Pr(𝐴|𝑟𝑖 )
and
𝑈𝑖 (Pr(𝐴|𝑟𝑖 ), truth) > 𝑈𝑖 (Pr(𝐴|𝑟𝑖 ), lie) , for Pr(𝐴|𝑟𝑖 ) ∈ [0, 1] (16.12)
Suppose now that correct answer for individual 𝑖 is “yes”.

Individual 𝑖 would choose to answer truthfully if
𝑈𝑖 (Pr(𝐴|yes), truth) ≥ 𝑈𝑖 (Pr(𝐴|no), lie) (16.13)
If the correct answer is “no”, individual 𝑖 would volunteer the correct answer only if
𝑈𝑖 (Pr(𝐴|no), truth) ≥ 𝑈𝑖 (Pr(𝐴|yes), lie) (16.14)
Assume that
Pr(𝐴|yes) > 𝜋𝐴 > Pr(𝐴|no)
so that a “yes” answer increases the odds that an individual belongs to 𝐴.

Constraint (16.14) holds for sure.
Consequently, constraint (16.13) becomes the single necessary condition for individual 𝑖 always to answer truthfully.
At equality, constraint (10.a) determines conditional probabilities that make the individual indifferent between telling the
truth and lying when the correct answer is “yes”:
𝑈𝑖 (Pr(𝐴|yes), truth) = 𝑈𝑖 (Pr(𝐴|no), lie) (16.15)
Equation (16.15) defines a “truth border”.

Differentiating (16.15) with respect to the conditional probabilities shows that the truth border has a positive slope in the
space of conditional probabilities:
𝜕𝑈𝑖 (Pr(𝐴|yes),truth)
𝜕Pr(𝐴|no) 𝜕Pr(𝐴|yes)
= 𝜕𝑈𝑖 (Pr(𝐴|no),lie)
>0 (16.16)
𝜕Pr(𝐴|yes)
𝜕Pr(𝐴|no)
The source of the positive relationship is:

• The individual is willing to volunteer a truthful “yes” answer so long as the utility from doing so (i.e., the left side
of (16.15)) is at least as high as the utility of lying on the right side of (16.15).
• Suppose now that Pr(𝐴|yes) increases. That reduces the utility of telling the truth. To preserve indifference between
a truthful answer and a lie, Pr(𝐴|no) must increase to reduce the utility of lying.

16.4.2 Drawing a Truth Border
We can deduce two things about the truth border:

• The truth border divides the space of conditional probabilities into two subsets: “truth telling” and “lying”. Thus,
sufficient privacy elicits a truthful answer, whereas insufficient privacy results in a lie. The truth border depends on
a respondent’s utility function.
• Assumptions in (16.11) and (16.11) are sufficient only to guarantee a positive slope of the truth border. The truth
border can have either a concave or a convex shape.
We can draw some truth borders with the following Python code:
x1 = np.arange(0, 1, 0.001)
y1 = x1 - 0.4
x2 = np.arange(0.4**2, 1, 0.001)
y2 = (pow(x2, 0.5) - 0.4)**2
x3 = np.arange(0.4**0.5, 1, 0.001)
y3 = pow(x3**2 - 0.4, 0.5)
plt.plot(x1, y1, 'r-', label='Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-Pr(A|r_i)+f(\
↪phi_i)$')
plt.fill_between(x1, 0, y1, facecolor='red', alpha=0.05)

plt.plot(x2, y2, 'b-', label='Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-Pr(A|r_i)^{2}
↪+f(\phi_i)$')
plt.fill_between(x2, 0, y2, facecolor='blue', alpha=0.05)

plt.plot(x3, y3, 'y-', label='Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-\sqrt{Pr(A|r_
↪i)}+f(\phi_i)$')
plt.fill_between(x3, 0, y3, facecolor='green', alpha=0.05)

plt.plot(x1, x1, ':', linewidth=2)
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.xlabel('Pr(A|yes)')
plt.ylabel('Pr(A|no)')
plt.text(0.42, 0.3, "Truth Telling", fontdict={'size':28, 'style':'italic'})
plt.text(0.8, 0.1, "Lying", fontdict={'size':28, 'style':'italic'})
plt.legend(loc=0, fontsize='large')
plt.title('Figure 1.1')
plt.show()
16.4. Respondent’s Expected Utility 289

Figure 1.1 three types of truth border.

Without loss of generality, we consider the truth border:
𝑈𝑖 (Pr(𝐴|𝑟𝑖 ), 𝜙𝑖 ) = −Pr(𝐴|𝑟𝑖 ) + 𝑓(𝜙𝑖 )
and plot the “truth telling” and “lying area” of individual 𝑖 in Figure 1.2:
x1 = np.arange(0, 1, 0.001)
y1 = x1 - 0.4
z1 = x1
z2 = 0
plt.plot(x1, y1,'r-',label='Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-Pr(A|r_i)+f(\phi_
↪i)$')
plt.plot(x1, x1, ':', linewidth=2)

plt.fill_between(x1, y1, z1, facecolor='blue', alpha=0.05, label='truth telling')
plt.fill_between(x1, z2, y1, facecolor='green', alpha=0.05, label='lying')
plt.xlim([0, 1])
plt.ylim([0, 1])


plt.title('Figure 1.2')
plt.show()
16.5 Utilitarian View of Survey Design
16.5.1 Iso-variance Curves
A statistician’s objective is
• to find a randomized response survey design that minimizes the bias and the variance of the estimator.
Given a design that ensures truthful answers by all respondents, Anderson(1976, Theorem 1) [Anderson, 1976] showed
that the minimum variance estimate in the two-response model has variance
𝜋𝐴 2 (1 − 𝜋𝐴 )2 1 1
𝑉 (Pr(𝐴|yes), Pr(𝐴|no)) = × × (16.17)
𝑛 Pr(𝐴|yes) − 𝜋𝐴 𝜋𝐴 − Pr(𝐴|no)
16.5. Utilitarian View of Survey Design 291

where the random sample with replacement consists of 𝑛 individuals.

We can use Expression (16.17) to draw iso-variance curves.
The following inequalities restrict the shapes of iso-variance curves:
𝑑 Pr(𝐴|no) 𝜋 − Pr(𝐴|no)
∣ = 𝐴 >0 (16.18)
𝑑 Pr(𝐴|yes) constant variance Pr(𝐴|yes) − 𝜋𝐴
𝑑2 Pr(𝐴|no) 2 [𝜋𝐴 − Pr(𝐴|no)]

2
∣ =− 2
<0 (16.19)
𝑑 Pr(𝐴|yes) constant variance [Pr(𝐴|yes) − 𝜋𝐴 ]
From expression (16.17), (16.18) and (16.19) we can see that:
• Variance can be reduced only by increasing the distance of Pr(𝐴|yes) and/or Pr(𝐴|no) from 𝑟𝐴 .
• Iso-variance curves are always upward-sloping and concave.
16.5.2 Drawing Iso-variance Curves
We use Python code to draw iso-variance curves.

The pairs of conditional probabilities can be attained using Warner’s (1965) model.
Note that:
• Any point on the iso-variance curves can be attained with the unrelated question model as long as the statistician
can completely control the model design.
• Warner’s (1965) original randomized response model is less flexible than the unrelated question model.
class Iso_Variance:
def __init__(self, pi, n):
self.pi = pi
self.n = n
def plotting_iso_variance_curve(self):
pi = self.pi
n = self.n
nv = np.array([0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7])

x = np.arange(0, 1, 0.001)
x0 = np.arange(pi, 1, 0.001)
x2 = np.arange(0, pi, 0.001)
y1 = [pi for i in x0]
y0 = 1 / (1 + (x0 * (1 - pi)**2) / ((1 - x0) * pi**2))
plt.plot(x0, y0, 'm-', label='Warner')
plt.plot(x, x, 'c:', linewidth=2)
plt.plot(x0, y1,'c:', linewidth=2)
plt.plot(y2, x2, 'c:', linewidth=2)
for i in range(len(nv)):
y = pi - (pi**2 * (1 - pi)**2) / (n * (nv[i] / n) * (x0 - pi + 1e-8))
plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}')
plt.xlim([0, 1])
plt.ylim([0, 0.5])


plt.text(0.32, 0.28, "High Var", fontdict={'size':15, 'style':'italic'})
plt.text(0.91, 0.01, "Low Var", fontdict={'size':15, 'style':'italic'})
plt.title('Figure 2')
plt.show()
Properties of iso-variance curves are:

• All points on one iso-variance curve share the same variance
• From 𝑉1 to 𝑉9 , the variance of the iso-variance curve increase monotonically, as colors brighten monotonically
Suppose the parameters of the iso-variance model follow those in Ljungqvist [Ljungqvist, 1993], which are:
• 𝜋 = 0.3
• 𝑛 = 100
Then we can plot the iso-variance curve in Figure 2:
var = Iso_Variance(pi=0.3, n=100)

var.plotting_iso_variance_curve()
16.5. Utilitarian View of Survey Design 293

16.5.3 Optimal Survey
A point on an iso-variance curves can be attained with the unrelated question design.
We now focus on finding an “optimal survey design” that
• Minimizes the variance of the estimator subject to privacy restrictions.
To obtain an optimal design, we first superimpose all individuals’ truth borders on the iso-variance mapping.
To construct an optimal design
• The statistician should find the intersection of areas above all truth borders; that is, the set of conditional probabilities
ensuring truthful answers from all respondents.
• The point where this set touches the lowest possible iso-variance curve determines an optimal survey design.
Consquently, a minimum variance unbiased estimator is pinned down by an individual who is the least willing to volunteer
a truthful answer.
Here are some comments about the model design:
• An individual’s decision of whether or not to answer truthfully depends on his or her belief about other respondents’
behavior, because this determines the individual’s calculation of Pr(𝐴|yes) and Pr(𝐴|no).
• An equilibrium of the optimal design model is a Nash equilibrium of a noncooperative game.
• Assumption (16.12) is sufficient to guarantee existence of an optimal model design. By choosing Pr(𝐴|yes) and
Pr(𝐴|no) sufficiently close to each other, all respondents will find it optimal to answer truthfully. The closer are
these probabilities, the higher the variance of the estimator becomes.
• If respondents experience a large enough increase in expected utility from telling the truth, then there is no need to
use a randomized response model. The smallest possible variance of the estimate is then obtained at Pr(𝐴|yes) = 1
and Pr(𝐴|no) = 0 ; that is, when respondents answer truthfully to direct questioning.
• A more general design problem would be to minimize some weighted sum of the estimator’s variance and bias. It
would be optimal to accept some lies from the most “reluctant” respondents.
16.6 Criticisms of Proposed Privacy Measures
We can use a utilitarian approach to analyze some privacy measures.

We’ll enlist Python Code to help us.
16.6.1 Analysis of Method of Lanke’s (1976)
Lanke (1976) recommends a privacy protection criterion that minimizes:
max {Pr(𝐴|yes), Pr(𝐴|no)} (16.20)
Following Lanke’s suggestion, the statistician should find the highest possible Pr(𝐴|yes) consistent with truth telling while
Pr(𝐴|no) is fixed at 0. The variance is then minimized at point 𝑋 in Figure 3.
However, we can see that in Figure 3, point 𝑍 offers a smaller variance that still allows cooperation of the respondents,
and it is achievable following our discussion of the truth border in Part III:

pi = 0.3
n = 100
nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7]
x = np.arange(0, 1, 0.001)
y = x - 0.4
z = x
x0 = np.arange(pi, 1, 0.001)
x2 = np.arange(0, pi, 0.001)
plt.plot(x0, y1, 'c:', linewidth=2)
plt.plot(y2, x2, 'c:', linewidth=2)
plt.plot(x, y, 'r-', label='Truth Border')
plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='truth telling')
plt.fill_between(x, 0, y, facecolor='green', alpha=0.05, label='lying')
plt.scatter(0.498, 0.1, c='b', marker='*', label='Z', s=150)

plt.scatter(0.4, 0, c='y', label='X', s=150)
plt.xlim([0, 1])
plt.ylim([0, 0.5])
plt.text(0.85, 0.35, "Lying",fontdict = {'size':28, 'style':'italic'})
plt.text(0.515, 0.095, "Optimal Design", fontdict={'size':16,'color':'b'})
plt.show()
16.6. Criticisms of Proposed Privacy Measures 295

16.6.2 Method of Leysieffer and Warner (1976)
Leysieffer and Warner (1976) recommend a two-dimensional measure of jeopardy that reduces to a single dimension
when there is no jeopardy in a ‘no’ answer”, which means that
Pr(yes|𝐴) = 1
and
Pr(𝐴|no) = 0
This is not an optimal choice under a utilitarian approach.

16.6.3 Analysis on the Method of Chaudhuri and Mukerjee’s (1988)
[Chadhuri and Mukerjee, 1988]

Chaudhuri and Mukerjee (1988) argued that the individual may find that since “yes” may sometimes relate to the sensitive
group A, a clever respondent may falsely but safely always be inclined to respond “no”. In this situation, the truth border
is such that individuals choose to lie whenever the truthful answer is “yes” and
Pr(𝐴|no) = 0
Here the gain from lying is too high for someone to volunteer a “yes” answer.
This means that
𝑈𝑖 (Pr(𝐴|yes), truth) < 𝑈𝑖 (Pr(𝐴|no), lie)
in any situation always.

As a result, there is no attainable model design.
However, under a utilitarian approach there should exist other survey designs that are consistent with truthful answers.
In particular, respondents will choose to answer truthfully if the relative advantage from lying is eliminated.
We can use Python to show that the optimal model design corresponds to point Q in Figure 4:
def f(x):
if x < 0.16:
return 0
else:
return (pow(x, 0.5) - 0.4)**2
pi = 0.3
n = 100
nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7]
x = np.arange(0, 1, 0.001)
y = [f(i) for i in x]
z = x
x0 = np.arange(pi, 1, 0.001)
x2 = np.arange(0, pi, 0.001)
x3 = np.arange(0.16, 1, 0.001)
y3 = (pow(x3, 0.5) - 0.4)**2
plt.plot(x0, y1,'c:', linewidth=2)
plt.plot(y2, x2,'c:', linewidth=2)
plt.plot(x3, y3,'b-', label='Truth Border')
plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='Truth telling')
plt.fill_between(x3, 0, y3,facecolor='green', alpha=0.05, label='Lying')
plt.scatter(0.61, 0.146, c='r', marker='*', label='Z', s=150)
plt.xlim([0, 1])
plt.ylim([0, 0.5])
16.6. Criticisms of Proposed Privacy Measures 297


plt.text(0.63, 0.141, "Optimal Design", fontdict={'size':16,'color':'r'})
plt.show()
16.6.4 Method of Greenberg et al. (1977)
[Greenberg et al., 1977]

Greenberg et al. (1977) defined the hazard for an individual in 𝐴 as the probability that he or she is perceived as belonging
to 𝐴:
Pr(yes|𝐴) × Pr(𝐴|yes) + Pr(no|𝐴) × Pr(𝐴|no) (16.21)
The hazard for an individual who does not belong to 𝐴 is

′ ′
Pr(yes|𝐴 ) × Pr(𝐴|yes) + Pr(no|𝐴 ) × Pr(𝐴|no) (16.22)

They also considered an alternative related measure of hazard that they said “is likely to be closer to the actual concern
felt by a respondent.”
′
Their “limited hazard” for an individual in 𝐴 and 𝐴 is
Pr(yes|𝐴) × Pr(𝐴|yes) (16.23)
and
′
Pr(yes|𝐴 ) × Pr(𝐴|yes) (16.24)
According to Greenberg et al. (1977), a respondent commits himself or herself to answer truthfully on the basis of a
probability in (16.21) or (16.23) before randomly selecting the question to be answered.
Suppose that the appropriate privacy measure is captured by the notion of “limited hazard” in (16.23) and (16.24).
Consider an unrelated question model where the unrelated question is replaced by the instruction “Say the word ‘no’”,
which implies that
Pr(𝐴|yes) = 1
and it follows that:

′
• Hazard for an individual in 𝐴 is 0.
• Hazard for an individual in 𝐴 can also be made arbitrarily small by choosing a sufficiently small Pr(yes|𝐴).
Even though this hazard can be set arbitrarily close to 0, an individual in 𝐴 will completely reveal his or her identity
whenever truthfully answering the sensitive question.
However, under utilitarian framework, it is obviously contradictory.
If the individuals are willing to volunteer this information, it seems that the randomized response design was not necessary
in the first place.
It ignores the fact that respondents retain the option of lying until they have seen the question to be answered.
The justifications for a randomized response procedure are that

• Respondents are thought to feel discomfort from being perceived as belonging to the sensitive group.
• Respondents prefer to answer questions truthfully than to lie, unless it is too revealing.
If a privacy measure is not completely consistent with the rational behavior of the respondents, all efforts to derive an
optimal model design are futile.
A utilitarian approach provides a systematic way to model respondents’ behavior under the assumption that they maximize
their expected utilities.
In a utilitarian analysis:
• A truth border divides the space of conditional probabilities of being perceived as belonging to the sensitive group,
Pr(𝐴|yes) and Pr(𝐴|no), into the truth-telling region and the lying region.
• The optimal model design is obtained at the point where the truth border touches the lowest possible iso-variance
curve.
A practical implication of the analysis of [Ljungqvist, 1993] is that uncertainty about respondents’ demands for privacy
can be acknowledged by choosing Pr(𝐴|yes) and Pr(𝐴|no) sufficiently close to each other.


Part III
Linear Programming
301
CHAPTER
SEVENTEEN
OPTIMAL TRANSPORT
17.1 Overview
The transportation or optimal transport problem is interesting both because of its many applications and because of
its important role in the history of economic theory.
In this lecture, we describe the problem, tell how linear programming is a key tool for solving it, and then provide some
examples.
We will provide other applications in followup lectures.
The optimal transport problem was studied in early work about linear programming, as summarized for example by
[Dorfman et al., 1958]. A modern reference about applications in economics is [Galichon, 2016].
Below, we show how to solve the optimal transport problem using several implementations of linear programming, in-
cluding, in order,
1. the linprog solver from SciPy,
2. the linprog_simplex solver from QuantEcon and
3. the simplex-based solvers included in the Python Optimal Transport package.
!pip install --upgrade quantecon

!pip install --upgrade POT
import numpy as np
from scipy.optimize import linprog
from quantecon.optimize.linprog_simplex import linprog_simplex
import ot
from scipy.stats import betabinom
import networkx as nx
303
17.2 The Optimal Transport Problem
Suppose that 𝑚 factories produce goods that must be sent to 𝑛 locations.

Let
• 𝑥𝑖𝑗 denote the quantity shipped from factory 𝑖 to location 𝑗
• 𝑐𝑖𝑗 denote the cost of shipping one unit from factory 𝑖 to location 𝑗
• 𝑝𝑖 denote the capacity of factory 𝑖 and 𝑞𝑗 denote the amount required at location 𝑗.
• 𝑖 = 1, 2, … , 𝑚 and 𝑗 = 1, 2, … , 𝑛.
A planner wants to minimize total transportation costs subject to the following constraints:
• The amount shipped from each factory must equal its capacity.
• The amount shipped to each location must equal the quantity required there.
The figure below shows one visualization of this idea, when factories and target locations are distributed in the plane.
The size of the vertices in the figure are proportional to

• capacity, for the factories, and
• demand (amount required) for the target locations.
The arrows show one possible transport plan, which respects the constraints stated above.
The planner’s problem can be expressed as the following constrained minimization problem:
𝑚 𝑛
min ∑ ∑ 𝑐𝑖𝑗 𝑥𝑖𝑗
𝑥𝑖𝑗
𝑖=1 𝑗=1
𝑛
subject to ∑ 𝑥𝑖𝑗 = 𝑝𝑖 , 𝑖 = 1, 2, … , 𝑚
𝑗=1 (17.1)
𝑚
∑ 𝑥𝑖𝑗 = 𝑞𝑗 , 𝑗 = 1, 2, … , 𝑛
𝑖=1
𝑥𝑖𝑗 ≥ 0
This is an optimal transport problem with

• 𝑚𝑛 decision variables, namely, the entries 𝑥𝑖𝑗 and
• 𝑚 + 𝑛 constraints.
304 Chapter 17. Optimal Transport

Summing the 𝑞𝑗 ’s across all 𝑗’s and the 𝑝𝑖 ’s across all 𝑖’s indicates that the total capacity of all the factories equals total
requirements at all locations:
𝑛 𝑛 𝑚 𝑚 𝑛 𝑚
∑ 𝑞𝑗 = ∑ ∑ 𝑥𝑖𝑗 = ∑ ∑ 𝑥𝑖𝑗 = ∑ 𝑝𝑖 (17.2)
𝑗=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑖=1
The presence of the restrictions in (17.2) will be the source of one redundancy in the complete set of restrictions that we
describe below.
More about this later.
17.3 The Linear Programming Approach
In this section we discuss using using standard linear programming solvers to tackle the optimal transport problem.
17.3.1 Vectorizing a Matrix of Decision Variables
A matrix of decision variables 𝑥𝑖𝑗 appears in problem (17.1).

The SciPy function linprog expects to see a vector of decision variables.
This situation impels us to rewrite our problem in terms of a vector of decision variables.
Let
• 𝑋, 𝐶 be 𝑚 × 𝑛 matrices with entries 𝑥𝑖𝑗 , 𝑐𝑖𝑗 ,
• 𝑝 be 𝑚-dimensional vector with entries 𝑝𝑖 ,
• 𝑞 be 𝑛-dimensional vector with entries 𝑞𝑗 .
With 1𝑛 denoting the 𝑛-dimensional column vector (1, 1, … , 1)′ , our problem can now be expressed compactly as:
min tr(𝐶 ′ 𝑋)
𝑋
subject to 𝑋 1𝑛 = 𝑝
𝑋 ′ 1𝑚 = 𝑞
𝑋≥0
We can convert the matrix 𝑋 into a vector by stacking all of its columns into a column vector.
Doing this is called vectorization, an operation that we denote vec(𝑋).
Similarly, we convert the matrix 𝐶 into an 𝑚𝑛-dimensional vector vec(𝐶).
The objective function can be expressed as the inner product between vec(𝐶) and vec(𝑋):
vec(𝐶)′ ⋅ vec(𝑋).
To express the constraints in terms of vec(𝑋), we use a Kronecker product denoted by ⊗ and defined as follows.
Suppose 𝐴 is an 𝑚 × 𝑠 matrix with entries (𝑎𝑖𝑗 ) and that 𝐵 is an 𝑛 × 𝑡 matrix.
The Kronecker product of 𝐴 and 𝐵 is defined, in block matrix form, by
𝑎11 𝐵 𝑎12 𝐵 … 𝑎1𝑠 𝐵

⎛
⎜ 𝑎21 𝐵 𝑎22 𝐵 … 𝑎2𝑠 𝐵 ⎞
⎟
𝐴⊗𝐵 =⎜
⎜ ⎟
⎟.
⋮
⎝𝑎𝑚1 𝐵 𝑎𝑚2 𝐵 … 𝑎𝑚𝑠 𝐵⎠
17.3. The Linear Programming Approach 305

𝐴 ⊗ 𝐵 is an 𝑚𝑛 × 𝑠𝑡 matrix.
It has the property that for any 𝑚 × 𝑛 matrix 𝑋
vec(𝐴′ 𝑋𝐵) = (𝐵′ ⊗ 𝐴′ ) vec(𝑋). (17.3)
We can now express our constraints in terms of vec(𝑋).

Let 𝐴 = I′𝑚 , 𝐵 = 1𝑛 .
By equation (17.3)
𝑋 1𝑛 = vec(𝑋 1𝑛 ) = vec(I𝑚 𝑋 1𝑛 ) = (1′𝑛 ⊗ I𝑚 ) vec(𝑋).
where I𝑚 denotes the 𝑚 × 𝑚 identity matrix.

Constraint 𝑋 1𝑛 = 𝑝 can now be written as:
(1′𝑛 ⊗ I𝑚 ) vec(𝑋) = 𝑝.
Similarly, the constraint 𝑋 ′ 1𝑚 = 𝑞 can be rewriten as:
(I𝑛 ⊗ 1′𝑚 ) vec(𝑋) = 𝑞.
With 𝑧 ∶= vec(𝑋), our problem can now be expressed in terms of an 𝑚𝑛-dimensional vector of decision variables:
min vec(𝐶)′ 𝑧
𝑧
subject to 𝐴𝑧 = 𝑏 (17.4)
𝑧≥0
where
1′𝑛 ⊗ I𝑚 𝑝
𝐴=( ) and 𝑏 = ( )
I𝑛 ⊗ 1′𝑚 𝑞
17.3.2 An Application
We now provide an example that takes the form (17.4) that we’ll solve by deploying the function linprog.
The table below provides numbers for the requirements vector 𝑞, the capacity vector 𝑝, and entries 𝑐𝑖𝑗 of the cost-of-
shipping matrix 𝐶.
The numbers in the above table tell us to set 𝑚 = 3, 𝑛 = 5, and construct the following objects:
25
50 ⎛
⎜115⎞ 10 15 20 20 40
⎜ ⎟ ⎟
𝑝=⎛
⎜100⎞⎟, 𝑞=⎜
⎜ 60 ⎟
⎟ and 𝐶=⎛
⎜20 40 15 30 30⎞⎟.
⎜
⎜ 30 ⎟
⎟
⎝150⎠ ⎝30 35 40 55 25⎠
⎝ 70 ⎠
Let’s write Python code that sets up the problem and solves it.
# Define parameters
m = 3
n = 5
p = np.array([50, 100, 150])



q = np.array([25, 115, 60, 30, 70])
C = np.array([[10, 15, 20, 20, 40],

[20, 40, 15, 30, 30],
[30, 35, 40, 55, 25]])
# Vectorize matrix C
C_vec = C.reshape((m*n, 1), order='F')
# Construct matrix A by Kronecker product

A1 = np.kron(np.ones((1, n)), np.identity(m))
A2 = np.kron(np.identity(n), np.ones((1, m)))
A = np.vstack([A1, A2])
# Construct vector b
b = np.hstack([p, q])
# Solve the primal problem

res = linprog(C_vec, A_eq=A, b_eq=b)
# Print results
print("message:", res.message)
print("nit:", res.nit)
print("fun:", res.fun)
print("z:", res.x)
print("X:", res.x.reshape((m,n), order='F'))
message: Optimization terminated successfully. (HiGHS Status 7: Optimal)

nit: 8
fun: 7225.0
z: [ 0. 10. 15. 50. 0. 65. 0. 60. 0. 0. 30. 0. 0. 0. 70.]
X: [[ 0. 50. 0. 0. 0.]
[10. 0. 60. 30. 0.]
[15. 65. 0. 0. 70.]]
Notice how, in the line C_vec = C.reshape((m*n, 1), order='F'), we are careful to vectorize using the
flag order='F'.
This is consistent with converting 𝐶 into a vector by stacking all of its columns into a column vector.
Here 'F' stands for “Fortran”, and we are using Fortran style column-major order.
(For an alternative approach, using Python’s default row-major ordering, see this lecture by Alfred Galichon.)
Interpreting the warning:
The above warning message from SciPy points out that A is not full rank.
This indicates that the linear program has been set up to include one or more redundant constraints.
Here, the source of the redundancy is the structure of restrictions (17.2).
Let’s explore this further by printing out 𝐴 and staring at it.

array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0.],
[0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
[1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.]])
The singularity of 𝐴 reflects that the first three constraints and the last five constraints both require that “total requirements
equal total capacities” expressed in (17.2).
One equality constraint here is redundant.
Below we drop one of the equality constraints, and use only 7 of them.
After doing this, we attain the same minimized cost.
However, we find a different transportation plan.
Though it is a different plan, it attains the same cost!
linprog(C_vec, A_eq=A[:-1], b_eq=b[:-1])

success: True
status: 0
fun: 7225.0
x: [ 0.000e+00 1.000e+01 ... 0.000e+00 7.000e+01]
nit: 8
lower: residual: [ 0.000e+00 1.000e+01 ... 0.000e+00
7.000e+01]
marginals: [ 0.000e+00 0.000e+00 ... 1.500e+01
0.000e+00]
upper: residual: [ inf inf ... inf
inf]
marginals: [ 0.000e+00 0.000e+00 ... 0.000e+00
0.000e+00]
eqlin: residual: [ 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00]
marginals: [ 5.000e+00 1.500e+01 2.500e+01 5.000e+00
1.000e+01 -0.000e+00 1.500e+01]
ineqlin: residual: []
marginals: []
mip_node_count: 0
mip_dual_bound: 0.0
mip_gap: 0.0
%time linprog(C_vec, A_eq=A[:-1], b_eq=b[:-1])

Wall time: 1.54 ms

success: True


status: 0
fun: 7225.0
x: [ 0.000e+00 1.000e+01 ... 0.000e+00 7.000e+01]
nit: 8
7.000e+01]
marginals: [ 0.000e+00 0.000e+00 ... 1.500e+01
0.000e+00]
inf]
marginals: [ 0.000e+00 0.000e+00 ... 0.000e+00
0.000e+00]
0.000e+00 0.000e+00 0.000e+00]
marginals: [ 5.000e+00 1.500e+01 2.500e+01 5.000e+00
1.000e+01 -0.000e+00 1.500e+01]
marginals: []
mip_node_count: 0
mip_dual_bound: 0.0
mip_gap: 0.0
%time linprog(C_vec, A_eq=A, b_eq=b)

Wall time: 1.58 ms

success: True
status: 0
fun: 7225.0
x: [ 0.000e+00 1.000e+01 ... 0.000e+00 7.000e+01]
nit: 8
7.000e+01]
marginals: [ 0.000e+00 0.000e+00 ... 1.500e+01
0.000e+00]
inf]
marginals: [ 0.000e+00 0.000e+00 ... 0.000e+00
0.000e+00]
0.000e+00 0.000e+00 0.000e+00 0.000e+00]
marginals: [ 1.000e+01 2.000e+01 3.000e+01 -0.000e+00
5.000e+00 -5.000e+00 1.000e+01 -5.000e+00]
marginals: []
mip_node_count: 0
mip_dual_bound: 0.0
mip_gap: 0.0
Evidently, it is slightly quicker to work with the system that removed a redundant constraint.
Let’s drill down and do some more calculations to help us understand whether or not our finding two different optimal
transport plans reflects our having dropped a redundant equality constraint.

Hint
It will turn out that dropping a redundant equality isn’t really what mattered.
To verify our hint, we shall simply use all of the original equality constraints (including a redundant one), but we’ll just
shuffle the order of the constraints.
arr = np.arange(m+n)
sol_found = []
cost = []
# simulate 1000 times

for i in range(1000):
np.random.shuffle(arr)
res_shuffle = linprog(C_vec, A_eq=A[arr], b_eq=b[arr])
# if find a new solution

sol = tuple(res_shuffle.x)
if sol not in sol_found:
sol_found.append(sol)
cost.append(res_shuffle.fun)
for i in range(len(sol_found)):
print(f"transportation plan {i}: ", sol_found[i])
print(f" minimized cost {i}: ", cost[i])
transportation plan 0: (0.0, 10.0, 15.0, 50.0, 0.0, 65.0, 0.0, 60.0, 0.0, 0.0, 30.
↪0, 0.0, 0.0, 0.0, 70.0)
minimized cost 0: 7225.0
Ah hah! As you can see, putting constraints in different orders in this case uncovers two optimal transportation plans that
achieve the same minimized cost.
These are the same two plans computed earlier.
Next, we show that leaving out the first constraint “accidentally” leads to the initial plan that we computed.
linprog(C_vec, A_eq=A[1:], b_eq=b[1:])

success: True
status: 0
fun: 7225.0
x: [ 0.000e+00 1.000e+01 ... 0.000e+00 7.000e+01]
nit: 8
7.000e+01]
marginals: [ 0.000e+00 0.000e+00 ... 1.500e+01
0.000e+00]
inf]


marginals: [ 0.000e+00 0.000e+00 ... 0.000e+00
0.000e+00]
0.000e+00 0.000e+00 0.000e+00]
marginals: [ 1.000e+01 2.000e+01 1.000e+01 1.500e+01
5.000e+00 2.000e+01 5.000e+00]
marginals: []
mip_node_count: 0
mip_dual_bound: 0.0
mip_gap: 0.0
Let’s compare this transport plan with
res.x
array([ 0., 10., 15., 50., 0., 65., 0., 60., 0., 0., 30., 0., 0.,
0., 70.])
Here the matrix 𝑋 contains entries 𝑥𝑖𝑗 that tell amounts shipped from factor 𝑖 = 1, 2, 3 to location 𝑗 = 1, 2, … , 5.
The vector 𝑧 evidently equals vec(𝑋).
The minimized cost from the optimal transport plan is given by the 𝑓𝑢𝑛 variable.
17.3.3 Using a Just-in-Time Compiler
We can also solve optimal transportation problems using a powerful tool from QuantEcon, namely, quantecon.
optimize.linprog_simplex.
While this routine uses the same simplex algorithm as scipy.optimize.linprog, the code is accelerated by using
a just-in-time compiler shipped in the numba library.
As you will see very soon, by using scipy.optimize.linprog the time required to solve an optimal transportation
problem can be reduced significantly.
# construct matrices/vectors for linprog_simplex

c = C.flatten()
# Equality constraints
A_eq = np.zeros((m+n, m*n))
for i in range(m):
for j in range(n):
A_eq[i, i*n+j] = 1
A_eq[m+j, i*n+j] = 1
b_eq = np.hstack([p, q])
Since quantecon.optimize.linprog_simplex does maximization instead of minimization, we need to put a

negative sign before vector c.
res_qe = linprog_simplex(-c, A_eq=A_eq, b_eq=b_eq)
Since the two LP solvers use the same simplex algorithm, we expect to get exactly the same solutions

res_qe.x.reshape((m, n), order='C')
array([[15., 35., 0., 0., 0.],

[10., 0., 60., 30., 0.],
[ 0., 80., 0., 0., 70.]])
res.x.reshape((m, n), order='F')
array([[ 0., 50., 0., 0., 0.],

[10., 0., 60., 30., 0.],
[15., 65., 0., 0., 70.]])
Let’s do a speed comparison between scipy.optimize.linprog and quantecon.optimize.

linprog_simplex.
# scipy.optimize.linprog
%time res = linprog(C_vec, A_eq=A[:-1, :], b_eq=b[:-1])
Wall time: 2.04 ms
# quantecon.optimize.linprog_simplex
%time out = linprog_simplex(-c, A_eq=A_eq, b_eq=b_eq)
CPU times: user 114 µs, sys: 0 ns, total: 114 µs

Wall time: 121 µs
As you can see, the quantecon.optimize.linprog_simplex is much faster.

(Note however, that the SciPy version is probably more stable than the QuantEcon version, having been tested more
extensively over a longer period of time.)
17.4 The Dual Problem
Let 𝑢, 𝑣 denotes vectors of dual decision variables with entries (𝑢𝑖 ), (𝑣𝑗 ).
The dual to minimization problem (17.1) is the maximization problem:
𝑚 𝑛
max ∑ 𝑝𝑖 𝑢𝑖 + ∑ 𝑞𝑗 𝑣𝑗
𝑢𝑖 ,𝑣𝑗 (17.5)
𝑖=1 𝑗=1
subject to 𝑢𝑖 + 𝑣𝑗 ≤ 𝑐𝑖𝑗 , 𝑖 = 1, 2, … , 𝑚; 𝑗 = 1, 2, … , 𝑛
The dual problem is also a linear programming problem.

It has 𝑚 + 𝑛 dual variables and 𝑚𝑛 constraints.
Vectors 𝑢 and 𝑣 of values are attached to the first and the second sets of primal constraits, respectively.
Thus, 𝑢 is attached to the constraints
• (1′𝑛 ⊗ I𝑚 ) vec(𝑋) = 𝑝

and 𝑣 is attached to constraints

• (I𝑛 ⊗ 1′𝑚 ) vec(𝑋) = 𝑞.
Components of the vectors 𝑢 and 𝑣 of per unit values are shadow prices of the quantities appearing on the right sides of
those constraints.
We can write the dual problem as
max 𝑝𝑢 + 𝑞𝑣
𝑢𝑖 ,𝑣𝑗
(17.6)
𝑢
subject to 𝐴′ ( ) = vec(𝐶)
𝑣
For the same numerical example described above, let’s solve the dual problem.
# Solve the dual problem

res_dual = linprog(-b, A_ub=A.T, b_ub=C_vec,
bounds=[(None, None)]*(m+n))
#Print results
print("message:", res_dual.message)
print("nit:", res_dual.nit)
print("fun:", res_dual.fun)
print("u:", res_dual.x[:m])
print("v:", res_dual.x[-n:])

nit: 9
fun: -7225.0
u: [-20. -10. 0.]
v: [30. 35. 25. 40. 25.]
We can also solve the dual problem using quantecon.optimize.linprog_simplex.
res_dual_qe = linprog_simplex(b_eq, A_ub=A_eq.T, b_ub=c)
And the shadow prices computed by the two programs are identical.
res_dual_qe.x
array([ 5., 15., 25., 5., 10., 0., 15., 0.])
res_dual.x
array([-20., -10., 0., 30., 35., 25., 40., 25.])
We can compare computational times from using our two tools.
%time linprog(-b, A_ub=A.T, b_ub=C_vec, bounds=[(None, None)]*(m+n))

Wall time: 2.35 ms
17.4. The Dual Problem 313


success: True
status: 0
fun: -7225.0
x: [-2.000e+01 -1.000e+01 0.000e+00 3.000e+01 3.500e+01
2.500e+01 4.000e+01 2.500e+01]
nit: 9
lower: residual: [ inf inf inf inf
inf inf inf inf]
marginals: [ 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00]
upper: residual: [ inf inf inf inf
inf inf inf inf]
marginals: [ 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00]
eqlin: residual: []
marginals: []
ineqlin: residual: [ 0.000e+00 0.000e+00 ... 1.500e+01
0.000e+00]
marginals: [-0.000e+00 -1.000e+01 ... -0.000e+00
-7.000e+01]
mip_node_count: 0
mip_dual_bound: 0.0
mip_gap: 0.0
%time linprog_simplex(b_eq, A_ub=A_eq.T, b_ub=c)

Wall time: 412 µs
SimplexResult(x=array([ 5., 15., 25., 5., 10., 0., 15., 0.]), lambd=array([ 0.,␣
↪35., 0., 15., 0., 25., 0., 60., 15., 0., 0., 80., 0.,
0., 70.]), fun=7225.0, success=True, status=0, num_iter=24)
quantecon.optimize.linprog_simplex solves the dual problem 10 times faster.

Just for completeness, let’s solve the dual problems with nonsingular 𝐴 matrices that we create by dropping a redundant
equality constraint.
Try first leaving out the first constraint:
linprog(-b[1:], A_ub=A[1:].T, b_ub=C_vec,

bounds=[(None, None)]*(m+n-1))

success: True
status: 0
fun: -7225.0
x: [ 1.000e+01 2.000e+01 1.000e+01 1.500e+01 5.000e+00
2.000e+01 5.000e+00]
nit: 12
inf inf inf]
marginals: [ 0.000e+00 0.000e+00 0.000e+00 0.000e+00


0.000e+00 0.000e+00 0.000e+00]
inf inf inf]
marginals: [ 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00]
eqlin: residual: []
marginals: []
0.000e+00]
marginals: [-1.500e+01 -1.000e+01 ... -0.000e+00
-7.000e+01]
mip_node_count: 0
mip_dual_bound: 0.0
mip_gap: 0.0
Not let’s instead leave out the last constraint:
linprog(-b[:-1], A_ub=A[:-1].T, b_ub=C_vec,

bounds=[(None, None)]*(m+n-1))
message:
Optimization terminated successfully. (HiGHS Status 7: Optimal)
success:
True
status:
0
fun:
-7225.0
x:
[ 5.000e+00 1.500e+01 2.500e+01 5.000e+00 1.000e+01
-0.000e+00 1.500e+01]
nit: 9
inf inf inf]
marginals: [ 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00]
inf inf inf]
marginals: [ 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00]
eqlin: residual: []
marginals: []
0.000e+00]
marginals: [-0.000e+00 -1.000e+01 ... -0.000e+00
-7.000e+01]
mip_node_count: 0
mip_dual_bound: 0.0
mip_gap: 0.0
17.4. The Dual Problem 315

17.4.1 Interpretation of dual problem
By strong duality (please see this lecture Linear Programming), we know that:
𝑚 𝑛 𝑚 𝑛
∑ ∑ 𝑐𝑖𝑗 𝑥𝑖𝑗 = ∑ 𝑝𝑖 𝑢𝑖 + ∑ 𝑞𝑗 𝑣𝑗
𝑖=1 𝑗=1 𝑖=1 𝑗=1
One unit more capacity in factory 𝑖, i.e. 𝑝𝑖 , results in 𝑢𝑖 more transportation costs.
Thus, 𝑢𝑖 describes the cost of shipping one unit from factory 𝑖.
Call this the ship-out cost of one unit shipped from factory 𝑖.
Similarly, 𝑣𝑗 is the cost of shipping one unit to location 𝑗.
Call this the ship-in cost of one unit to location 𝑗.
Strong duality implies that total transprotation costs equals total ship-out costs plus total ship-in costs.
It is reasonable that, for one unit of a product, ship-out cost 𝑢𝑖 plus ship-in cost 𝑣𝑗 should equal transportation cost 𝑐𝑖𝑗 .
This equality is assured by complementary slackness conditions that state that whenever 𝑥𝑖𝑗 > 0, meaning that there
are positive shipments from factory 𝑖 to location 𝑗, it must be true that 𝑢𝑖 + 𝑣𝑗 = 𝑐𝑖𝑗 .
17.5 The Python Optimal Transport Package
There is an excellent Python package for optimal transport that simplifies some of the steps we took above.
In particular, the package takes care of the vectorization steps before passing the data out to a linear programming routine.
(That said, the discussion provided above on vectorization remains important, since we want to understand what happens
under the hood.)
17.5.1 Replicating Previous Results
The following line of code solves the example application discussed above using linear programming.
X = ot.emd(p, q, C)
X
/tmp/ipykernel_9255/1617639716.py:1: UserWarning: Input histogram consists of␣

↪integer. The transport plan will be casted accordingly, possibly resulting in a␣
↪loss of precision. If this behaviour is unwanted, please make sure your input␣
↪histogram consists of floating point elements.
X = ot.emd(p, q, C)
array([[15, 35, 0, 0, 0],

[10, 0, 60, 30, 0],
[ 0, 80, 0, 0, 70]])
Sure enough, we have the same solution and the same cost
total_cost = np.sum(X * C)
total_cost

7225
17.5.2 A Larger Application
Now let’s try using the same package on a slightly larger application.
The application has the same interpretation as above but we will also give each node (i.e., vertex) a location in the plane.
This will allow us to plot the resulting transport plan as edges in a graph.
The following class defines a node by
• its location (𝑥, 𝑦) ∈ ℝ2 ,
• its group (factory or location, denoted by p or q) and
• its mass (e.g., 𝑝𝑖 or 𝑞𝑗 ).
class Node:
def __init__(self, x, y, mass, group, name):
self.x, self.y = x, y
self.mass, self.group = mass, group
self.name = name
Next we write a function that repeatedly calls the class above to build instances.
It allocates to the nodes it creates their location, mass, and group.
Locations are assigned randomly.
def build_nodes_of_one_type(group='p', n=100, seed=123):
nodes = []
for i in range(n):
if group == 'p':
m = 1/n
x = np.random.uniform(-2, 2)
y = np.random.uniform(-2, 2)
else:
m = betabinom.pmf(i, n-1, 2, 2)
x = 0.6 * np.random.uniform(-1.5, 1.5)
y = 0.6 * np.random.uniform(-1.5, 1.5)
name = group + str(i)

nodes.append(Node(x, y, m, group, name))
return nodes
Now we build two lists of nodes, each one containing one type (factories or locations)
n_p = 32
n_q = 32
17.5. The Python Optimal Transport Package 317


p_list = build_nodes_of_one_type(group='p', n=n_p)
q_list = build_nodes_of_one_type(group='q', n=n_q)
p_probs = [p.mass for p in p_list]

q_probs = [q.mass for q in q_list]
For the cost matrix 𝐶, we use the Euclidean distance between each factory and location.
c = np.empty((n_p, n_q))
for i in range(n_p):
for j in range(n_q):
x0, y0 = p_list[i].x, p_list[i].y
x1, y1 = q_list[j].x, q_list[j].y
c[i, j] = np.sqrt((x0-x1)**2 + (y0-y1)**2)
Now we are ready to apply the solver
%time pi = ot.emd(p_probs, q_probs, c)

Wall time: 471 µs
Finally, let’s plot the results using networkx.

In the plot below,
• node size is proportional to probability mass
• an edge (arrow) from 𝑖 to 𝑗 is drawn when a positive transfer is made from 𝑖 to 𝑗 under the optimal transport plan.
g = nx.DiGraph()
g.add_nodes_from([p.name for p in p_list])
g.add_nodes_from([q.name for q in q_list])
for i in range(n_p):
for j in range(n_q):
if pi[i, j] > 0:
g.add_edge(p_list[i].name, q_list[j].name, weight=pi[i, j])
node_pos_dict={}
for p in p_list:
node_pos_dict[p.name] = (p.x, p.y)
for q in q_list:
node_pos_dict[q.name] = (q.x, q.y)
node_color_list = []
node_size_list = []
scale = 8_000
for p in p_list:
node_color_list.append('blue')
node_size_list.append(p.mass * scale)
for q in q_list:
node_color_list.append('red')
node_size_list.append(q.mass * scale)


plt.axis('off')
nx.draw_networkx_nodes(g,
node_pos_dict,
node_color=node_color_list,
node_size=node_size_list,
edgecolors='grey',
linewidths=1,
alpha=0.5,
ax=ax)
nx.draw_networkx_edges(g,
node_pos_dict,
arrows=True,
connectionstyle='arc3,rad=0.1',
alpha=0.6)
plt.show()
17.5. The Python Optimal Transport Package 319


CHAPTER
EIGHTEEN
VON NEUMANN GROWTH MODEL (AND A GENERALIZATION)
Contents
• Von Neumann Growth Model (and a Generalization)

– Notation
– Model Ingredients and Assumptions
– Dynamic Interpretation
– Duality
– Interpretation as Two-player Zero-sum Game
This lecture uses the class Neumann to calculate key objects of a linear growth model of John von Neumann [von
Neumann, 1937] that was generalized by Kemeny, Morgenstern and Thompson [Kemeny et al., 1956].
Objects of interest are the maximal expansion rate (𝛼), the interest factor (𝛽), the optimal intensities (𝑥), and prices (𝑝).
In addition to watching how the towering mind of John von Neumann formulated an equilibrium model of price and
quantity vectors in balanced growth, this lecture shows how fruitfully to employ the following important tools:
• a zero-sum two-player game
• linear programming
• the Perron-Frobenius theorem
We’ll begin with some imports:
import numpy as np
from scipy.optimize import fsolve, linprog
from textwrap import dedent
np.set_printoptions(precision=2)
The code below provides the Neumann class
class Neumann(object):
"""
This class describes the Generalized von Neumann growth model as it was
discussed in Kemeny et al. (1956, ECTA) and Gale (1960, Chapter 9.5):
321

Let:
n ... number of goods
m ... number of activities
A ... input matrix is m-by-n
a_{i,j} - amount of good j consumed by activity i
B ... output matrix is m-by-n
b_{i,j} - amount of good j produced by activity i
x ... intensity vector (m-vector) with non-negative entries

x'B - the vector of goods produced
x'A - the vector of goods consumed
p ... price vector (n-vector) with non-negative entries
Bp - the revenue vector for every activity
Ap - the cost of each activity
Both A and B have non-negative entries. Moreover, we assume that

(1) Assumption I (every good which is consumed is also produced):
for all j, b_{.,j} > 0, i.e. at least one entry is strictly positive
(2) Assumption II (no free lunch):
for all i, a_{i,.} > 0, i.e. at least one entry is strictly positive
Parameters
----------
A : array_like or scalar(float)
Part of the state transition equation. It should be `n x n`
B : array_like or scalar(float)
Part of the state transition equation. It should be `n x k`
"""
def __init__(self, A, B):
self.A, self.B = list(map(self.convert, (A, B)))

self.m, self.n = self.A.shape
# Check if (A, B) satisfy the basic assumptions

assert self.A.shape == self.B.shape, 'The input and output matrices \
must have the same dimensions!'
assert (self.A >= 0).all() and (self.B >= 0).all(), 'The input and \
output matrices must have only non-negative entries!'
# (1) Check whether Assumption I is satisfied:

if (np.sum(B, 0) <= 0).any():
self.AI = False
else:
self.AI = True
# (2) Check whether Assumption II is satisfied:

if (np.sum(A, 1) <= 0).any():
self.AII = False
else:
self.AII = True
def __repr__(self):
return self.__str__()
def __str__(self):
322 Chapter 18. Von Neumann Growth Model (and a Generalization)

me = """
Generalized von Neumann expanding model:
- number of goods : {n}
- number of activities : {m}
Assumptions:
- AI: every column of B has a positive entry : {AI}
- AII: every row of A has a positive entry : {AII}
"""
# Irreducible : {irr}
return dedent(me.format(n=self.n, m=self.m,
AI=self.AI, AII=self.AII))
def convert(self, x):

"""
Convert array_like objects (lists of lists, floats, etc.) into
well-formed 2D NumPy arrays
"""
return np.atleast_2d(np.asarray(x))
def bounds(self):
"""
Calculate the trivial upper and lower bounds for alpha (expansion rate)
and beta (interest factor). See the proof of Theorem 9.8 in Gale (1960)
"""
n, m = self.n, self.m
A, B = self.A, self.B
f = lambda α: ((B - α * A) @ np.ones((n, 1))).max()

g = lambda β: (np.ones((1, m)) @ (B - β * A)).min()
UB = fsolve(f, 1).item() # Upper bound for α, β

LB = fsolve(g, 2).item() # Lower bound for α, β
return LB, UB
def zerosum(self, γ, dual=False):

"""
Given gamma, calculate the value and optimal strategies of a
two-player zero-sum game given by the matrix
M(gamma) = B - gamma * A
Row player maximizing, column player minimizing
Zero-sum game as an LP (primal --> α)
max (0', 1) @ (x', v)

subject to
[-M', ones(n, 1)] @ (x', v)' <= 0
(x', v) @ (ones(m, 1), 0) = 1
323

(x', v) >= (0', -inf)
Zero-sum game as an LP (dual --> beta)
min (0', 1) @ (p', u)

subject to
[M, -ones(m, 1)] @ (p', u)' <= 0
(p', u) @ (ones(n, 1), 0) = 1
(p', u) >= (0', -inf)
Outputs:
--------
value: scalar
value of the zero-sum game
strategy: vector
if dual = False, it is the intensity vector,
if dual = True, it is the price vector
"""
A, B, n, m = self.A, self.B, self.n, self.m

M = B - γ * A
if dual == False:
# Solve the primal LP (for details see the description)
# (1) Define the problem for v as a maximization (linprog minimizes)
c = np.hstack([np.zeros(m), -1])
# (2) Add constraints :

# ... non-negativity constraints
bounds = tuple(m * [(0, None)] + [(None, None)])
# ... inequality constraints
A_iq = np.hstack([-M.T, np.ones((n, 1))])
b_iq = np.zeros((n, 1))
# ... normalization
A_eq = np.hstack([np.ones(m), 0]).reshape(1, m + 1)
b_eq = 1
res = linprog(c, A_ub=A_iq, b_ub=b_iq, A_eq=A_eq, b_eq=b_eq,

bounds=bounds)
else:
# Solve the dual LP (for details see the description)
# (1) Define the problem for v as a maximization (linprog minimizes)
c = np.hstack([np.zeros(n), 1])
# (2) Add constraints :

# ... non-negativity constraints
bounds = tuple(n * [(0, None)] + [(None, None)])
# ... inequality constraints
A_iq = np.hstack([M, -np.ones((m, 1))])
b_iq = np.zeros((m, 1))
# ... normalization
A_eq = np.hstack([np.ones(n), 0]).reshape(1, n + 1)
b_eq = 1


res = linprog(c, A_ub=A_iq, b_ub=b_iq, A_eq=A_eq, b_eq=b_eq,
bounds=bounds)
if res.status != 0:
print(res.message)
# Pull out the required quantities

value = res.x[-1]
strategy = res.x[:-1]
return value, strategy
def expansion(self, tol=1e-8, maxit=1000):

"""
The algorithm used here is described in Hamburger-Thompson-Weil
(1967, ECTA). It is based on a simple bisection argument and utilizes
the idea that for a given γ (= α or β), the matrix "M = B - γ * A"
defines a two-player zero-sum game, where the optimal strategies are
the (normalized) intensity and price vector.
Outputs:
--------
alpha: scalar
optimal expansion rate
"""
LB, UB = self.bounds()
for iter in range(maxit):
γ = (LB + UB) / 2
ZS = self.zerosum(γ=γ)
V = ZS[0] # value of the game with γ
if V >= 0:
LB = γ
else:
UB = γ
if abs(UB - LB) < tol:

γ = (UB + LB) / 2
x = self.zerosum(γ=γ)[1]
p = self.zerosum(γ=γ, dual=True)[1]
break
return γ, x, p
def interest(self, tol=1e-8, maxit=1000):

"""
The algorithm used here is described in Hamburger-Thompson-Weil
(1967, ECTA). It is based on a simple bisection argument and utilizes
the idea that for a given gamma (= alpha or beta),
the matrix "M = B - γ * A" defines a two-player zero-sum game,
where the optimal strategies are the (normalized) intensity and price
vector
325
Outputs:
--------
beta: scalar
optimal interest rate
"""
LB, UB = self.bounds()
for iter in range(maxit):

γ = (LB + UB) / 2
ZS = self.zerosum(γ=γ, dual=True)
V = ZS[0]
if V > 0:
LB = γ
else:
UB = γ
if abs(UB - LB) < tol:

γ = (UB + LB) / 2
p = self.zerosum(γ=γ, dual=True)[1]
x = self.zerosum(γ=γ)[1]
break
return γ, x, p
18.1 Notation
We use the following notation.

0 denotes a vector of zeros.
We call an 𝑛-vector positive and write 𝑥 ≫ 0 if 𝑥𝑖 > 0 for all 𝑖 = 1, 2, … , 𝑛.
We call a vector non-negative and write 𝑥 ≥ 0 if 𝑥𝑖 ≥ 0 for all 𝑖 = 1, 2, … , 𝑛.
We call a vector semi-positive and written 𝑥 > 0 if 𝑥 ≥ 0 and 𝑥 ≠ 0.
For two conformable vectors 𝑥 and 𝑦, 𝑥 ≫ 𝑦, 𝑥 ≥ 𝑦 and 𝑥 > 𝑦 mean 𝑥 − 𝑦 ≫ 0, 𝑥 − 𝑦 ≥ 0, and 𝑥 − 𝑦 > 0, respectively.
We let all vectors in this lecture be column vectors; 𝑥𝑇 denotes the transpose of 𝑥 (i.e., a row vector).
Let 𝜄𝑛 denote a column vector composed of 𝑛 ones, i.e. 𝜄𝑛 = (1, 1, … , 1)𝑇 .
Let 𝑒𝑖 denote a vector (of arbitrary size) containing zeros except for the 𝑖 th position where it is one.
We denote matrices by capital letters. For an arbitrary matrix 𝐴, 𝑎𝑖,𝑗 represents the entry in its 𝑖 th row and 𝑗 th column.
𝑎⋅𝑗 and 𝑎𝑖⋅ denote the 𝑗 th column and 𝑖 th row of 𝐴, respectively.

18.2 Model Ingredients and Assumptions
A pair (𝐴, 𝐵) of 𝑚 × 𝑛 non-negative matrices defines an economy.

• 𝑚 is the number of activities (or sectors)
• 𝑛 is the number of goods (produced and/or consumed).
• 𝐴 is called the input matrix; 𝑎𝑖,𝑗 denotes the amount of good 𝑗 consumed by activity 𝑖
• 𝐵 is called the output matrix; 𝑏𝑖,𝑗 represents the amount of good 𝑗 produced by activity 𝑖
Two key assumptions restrict economy (𝐴, 𝐵):
• Assumption I: (every good that is consumed is also produced)
𝑏.,𝑗 > 0 ∀𝑗 = 1, 2, … , 𝑛
• Assumption II: (no free lunch)
𝑎𝑖,. > 0 ∀𝑖 = 1, 2, … , 𝑚
A semi-positive intensity 𝑚-vector 𝑥 denotes levels at which activities are operated.

Therefore,
• vector 𝑥𝑇 𝐴 gives the total amount of goods used in production
• vector 𝑥𝑇 𝐵 gives total outputs
An economy (𝐴, 𝐵) is said to be productive, if there exists a non-negative intensity vector 𝑥 ≥ 0 such that 𝑥𝑇 𝐵 > 𝑥𝑇 𝐴.
The semi-positive 𝑛-vector 𝑝 contains prices assigned to the 𝑛 goods.
The 𝑝 vector implies cost and revenue vectors
• the vector 𝐴𝑝 tells costs of the vector of activities
• the vector 𝐵𝑝 tells revenues from the vector of activities
Satisfaction or a property of an input-output pair (𝐴, 𝐵) called irreducibility (or indecomposability) determines whether
an economy can be decomposed into multiple “sub-economies”.
Definition: For an economy (𝐴, 𝐵), the set of goods 𝑆 ⊂ {1, 2, … , 𝑛} is called an independent subset if it is possible
to produce every good in 𝑆 without consuming goods from outside 𝑆. Formally, the set 𝑆 is independent if ∃𝑇 ⊂
{1, 2, … , 𝑚} (a subset of activities) such that 𝑎𝑖,𝑗 = 0 ∀𝑖 ∈ 𝑇 and 𝑗 ∈ 𝑆 𝑐 and for all 𝑗 ∈ 𝑆, ∃𝑖 ∈ 𝑇 for which 𝑏𝑖,𝑗 > 0.
The economy is irreducible if there are no proper independent subsets.
We study two examples, both in Chapter 9.6 of Gale [Gale, 1989]
# (1) Irreducible (A, B) example: α_0 = β_0

A1 = np.array([[0, 1, 0, 0],
[1, 0, 0, 1],
[0, 0, 1, 0]])
B1 = np.array([[1, 0, 0, 0],
[0, 0, 2, 0],
[0, 1, 0, 1]])
# (2) Reducible (A, B) example: β_0 < α_0

A2 = np.array([[0, 1, 0, 0, 0, 0],
18.2. Model Ingredients and Assumptions 327


[1, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 1],
[0, 0, 0, 0, 1, 0]])
B2 = np.array([[1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 2, 0],
[0, 0, 0, 1, 0, 1]])
The following code sets up our first Neumann economy or Neumann instance
n1 = Neumann(A1, B1)
n1

- number of goods : 4
- number of activities : 3
Assumptions:
- AI: every column of B has a positive entry : True
- AII: every row of A has a positive entry : True
Here is a second instance of a Neumann economy
n2 = Neumann(A2, B2)
n2

- number of goods : 6
- number of activities : 5
Assumptions:
- AI: every column of B has a positive entry : True
- AII: every row of A has a positive entry : True
18.3 Dynamic Interpretation
Attach a time index 𝑡 to the preceding objects, regard an economy as a dynamic system, and study sequences
{(𝐴𝑡 , 𝐵𝑡 )}𝑡≥0 , {𝑥𝑡 }𝑡≥0 , {𝑝𝑡 }𝑡≥0
An interesting special case holds the technology process constant and investigates the dynamics of quantities and prices
only.
Accordingly, in the rest of this lecture, we assume that (𝐴𝑡 , 𝐵𝑡 ) = (𝐴, 𝐵) for all 𝑡 ≥ 0.
A crucial element of the dynamic interpretation involves the timing of production.
We assume that production (consumption of inputs) takes place in period 𝑡, while the consequent output materializes in
period 𝑡 + 1, i.e., consumption of 𝑥𝑇𝑡 𝐴 in period 𝑡 results in 𝑥𝑇𝑡 𝐵 amounts of output in period 𝑡 + 1.

These timing conventions imply the following feasibility condition:
𝑥𝑇𝑡 𝐵 ≥ 𝑥𝑇𝑡+1 𝐴 ∀𝑡 ≥ 1
which asserts that no more goods can be used today than were produced yesterday.
Accordingly, 𝐴𝑝𝑡 tells the costs of production in period 𝑡 and 𝐵𝑝𝑡 tells revenues in period 𝑡 + 1.
18.3.1 Balanced Growth
We follow John von Neumann in studying “balanced growth”.

Let ./ denote an elementwise division of one vector by another and let 𝛼 > 0 be a scalar.
Then balanced growth is a situation in which
𝑥𝑡+1 ./𝑥𝑡 = 𝛼, ∀𝑡 ≥ 0
With balanced growth, the law of motion of 𝑥 is evidently 𝑥𝑡+1 = 𝛼𝑥𝑡 and so we can rewrite the feasibility constraint as
𝑥𝑇𝑡 𝐵 ≥ 𝛼𝑥𝑇𝑡 𝐴 ∀𝑡
In the same spirit, define 𝛽 ∈ ℝ as the interest factor per unit of time.
We assume that it is always possible to earn a gross return equal to the constant interest factor 𝛽 by investing “outside the
model”.
Under this assumption about outside investment opportunities, a no-arbitrage condition gives rise to the following (no
profit) restriction on the price sequence:
𝛽𝐴𝑝𝑡 ≥ 𝐵𝑝𝑡 ∀𝑡
This says that production cannot yield a return greater than that offered by the outside investment opportunity (here we
compare values in period 𝑡 + 1).
The balanced growth assumption allows us to drop time subscripts and conduct an analysis purely in terms of a time-
invariant growth rate 𝛼 and interest factor 𝛽.
18.4 Duality
Two problems are connected by a remarkable dual relationship between technological and valuation characteristics of the
economy:
Definition: The technological expansion problem (TEP) for the economy (𝐴, 𝐵) is to find a semi-positive 𝑚-vector 𝑥 > 0
and a number 𝛼 ∈ ℝ that satisfy
max 𝛼
𝛼
s.t. 𝑥𝑇 𝐵 ≥ 𝛼𝑥𝑇 𝐴
Theorem 9.3 of David Gale’s book [Gale, 1989] asserts that if Assumptions I and II are both satisfied, then a maximum
value of 𝛼 exists and that it is positive.
The maximal value is called the technological expansion rate and is denoted by 𝛼0 . The associated intensity vector 𝑥0 is
the optimal intensity vector.
18.4. Duality 329

Definition: The economic expansion problem (EEP) for (𝐴, 𝐵) is to find a semi-positive 𝑛-vector 𝑝 > 0 and a number
𝛽 ∈ ℝ that satisfy
min 𝛽
𝛽
s.t. 𝐵𝑝 ≤ 𝛽𝐴𝑝
Assumptions I and II imply existence of a minimum value 𝛽0 > 0 called the economic expansion rate.
The corresponding price vector 𝑝0 is the optimal price vector.
Because the criterion functions in the technological expansion problem and the economical expansion problem are both
linearly homogeneous, the optimality of 𝑥0 and 𝑝0 are defined only up to a positive scale factor.
For convenience (and to emphasize a close connection to zero-sum games), we normalize both vectors 𝑥0 and 𝑝0 to have
unit length.
A standard duality argument (see Lemma 9.4. in (Gale, 1960) [Gale, 1989]) implies that under Assumptions I and II,
𝛽0 ≤ 𝛼 0 .
But to deduce that 𝛽0 ≥ 𝛼0 , Assumptions I and II are not sufficient.
Therefore, von Neumann [von Neumann, 1937] went on to prove the following remarkable “duality” result that connects
TEP and EEP.
Theorem 1 (von Neumann): If the economy (𝐴, 𝐵) satisfies Assumptions I and II, then there exist (𝛾 ∗ , 𝑥0 , 𝑝0 ), where
𝛾 ∗ ∈ [𝛽0 , 𝛼0 ] ⊂ ℝ, 𝑥0 > 0 is an 𝑚-vector, 𝑝0 > 0 is an 𝑛-vector, and the following arbitrage true
𝑥𝑇0 𝐵 ≥ 𝛾 ∗ 𝑥𝑇0 𝐴
𝐵𝑝0 ≤ 𝛾 ∗ 𝐴𝑝0
𝑥𝑇0 (𝐵 − 𝛾 ∗ 𝐴) 𝑝0 = 0
Note: Proof (Sketch): Assumption I and II imply that there exist (𝛼0 , 𝑥0 ) and (𝛽0 , 𝑝0 ) that solve the TEP and EEP,
respectively. If 𝛾 ∗ > 𝛼0 , then by definition of 𝛼0 , there cannot exist a semi-positive 𝑥 that satisfies 𝑥𝑇 𝐵 ≥ 𝛾 ∗ 𝑥𝑇 𝐴.
Similarly, if 𝛾 ∗ < 𝛽0 , there is no semi-positive 𝑝 for which 𝐵𝑝 ≤ 𝛾 ∗ 𝐴𝑝. Let 𝛾 ∗ ∈ [𝛽0 , 𝛼0 ], then 𝑥𝑇0 𝐵 ≥ 𝛼0 𝑥𝑇0 𝐴 ≥
𝛾 ∗ 𝑥𝑇0 𝐴. Moreover, 𝐵𝑝0 ≤ 𝛽0 𝐴𝑝0 ≤ 𝛾 ∗ 𝐴𝑝0 . These two inequalities imply 𝑥0 (𝐵 − 𝛾 ∗ 𝐴) 𝑝0 = 0.
Here the constant 𝛾 ∗ is both an expansion factor and an interest factor (not necessarily optimal).
We have already encountered and discussed the first two inequalities that represent feasibility and no-profit conditions.
Moreover, the equality 𝑥𝑇0 (𝐵 − 𝛾 ∗ 𝐴) 𝑝0 = 0 concisely expresses the requirements that if any good grows at a rate larger
than 𝛾 ∗ (i.e., if it is oversupplied), then its price must be zero; and that if any activity provides negative profit, it must be
unused.
Therefore, the conditions stated in Theorem I ex encode all equilibrium conditions.
So Theorem I essentially states that under Assumptions I and II there always exists an equilibrium (𝛾 ∗ , 𝑥0 , 𝑝0 ) with
balanced growth.
Note that Theorem I is silent about uniqueness of the equilibrium. In fact, it does not rule out (trivial) cases with 𝑥𝑇0 𝐵𝑝0 =
0 so that nothing of value is produced.
To exclude such uninteresting cases, Kemeny, Morgenstern and Thomspson [Kemeny et al., 1956] add an extra require-
ment
𝑥𝑇0 𝐵𝑝0 > 0
and call the associated equilibria economic solutions.

They show that this extra condition does not affect the existence result, while it significantly reduces the number of
(relevant) solutions.

18.5 Interpretation as Two-player Zero-sum Game
To compute the equilibrium (𝛾 ∗ , 𝑥0 , 𝑝0 ), we follow the algorithm proposed by Hamburger, Thompson and Weil (1967),
building on the key insight that an equilibrium (with balanced growth) can be solves a particular two-player zero-sum
game. First, we introduce some notation.
Consider the 𝑚 × 𝑛 matrix 𝐶 as a payoff matrix, with the entries representing payoffs from the minimizing column
player to the maximizing row player and assume that the players can use mixed strategies. Thus,
• the row player chooses the 𝑚-vector 𝑥 > 0 subject to 𝜄𝑇𝑚 𝑥 = 1
• the column player chooses the 𝑛-vector 𝑝 > 0 subject to 𝜄𝑇𝑛 𝑝 = 1.
Definition: The 𝑚 × 𝑛 matrix game 𝐶 has the solution (𝑥∗ , 𝑝∗ , 𝑉 (𝐶)) in mixed strategies if
(𝑥∗ )𝑇 𝐶𝑒𝑗 ≥ 𝑉 (𝐶) ∀𝑗 ∈ {1, … , 𝑛} and (𝑒𝑖 )𝑇 𝐶𝑝∗ ≤ 𝑉 (𝐶) ∀𝑖 ∈ {1, … , 𝑚}
The number 𝑉 (𝐶) is called the value of the game.

From the above definition, it is clear that the value 𝑉 (𝐶) has two alternative interpretations:
• by playing the appropriate mixed stategy, the maximizing player can assure himself at least 𝑉 (𝐶) (no matter what
the column player chooses)
• by playing the appropriate mixed stategy, the minimizing player can make sure that the maximizing player will not
get more than 𝑉 (𝐶) (irrespective of what is the maximizing player’s choice)
A famous theorem of Nash (1951) tells us that there always exists a mixed strategy Nash equilibrium for any finite two-
player zero-sum game.
Moreover, von Neumann’s Minmax Theorem [von Neumann, 1928] implies that
𝑉 (𝐶) = max min 𝑥𝑇 𝐶𝑝 = min max 𝑥𝑇 𝐶𝑝 = (𝑥∗ )𝑇 𝐶𝑝∗

𝑥 𝑝 𝑝 𝑥
18.5.1 Connection with Linear Programming (LP)
Nash equilibria of a finite two-player zero-sum game solve a linear programming problem.
To see this, we introduce the following notation
• For a fixed 𝑥, let 𝑣 be the value of the minimization problem: 𝑣 ≡ min𝑝 𝑥𝑇 𝐶𝑝 = min𝑗 𝑥𝑇 𝐶𝑒𝑗
• For a fixed 𝑝, let 𝑢 be the value of the maximization problem: 𝑢 ≡ max𝑥 𝑥𝑇 𝐶𝑝 = max𝑖 (𝑒𝑖 )𝑇 𝐶𝑝
Then the max-min problem (the game from the maximizing player’s point of view) can be written as the primal LP
𝑉 (𝐶) = max 𝑣
s.t. 𝑣𝜄𝑇𝑛 ≤ 𝑥𝑇 𝐶
𝑥≥0
𝜄𝑇𝑛 𝑥 = 1
while the min-max problem (the game from the minimizing player’s point of view) is the dual LP
𝑉 (𝐶) = min 𝑢
s.t. 𝑢𝜄𝑚 ≥ 𝐶𝑝
𝑝≥0
𝜄𝑇𝑚 𝑝 = 1
18.5. Interpretation as Two-player Zero-sum Game 331

Hamburger, Thompson and Weil [Hamburger et al., 1967] view the input-output pair of the economy as payoff matrices
of two-player zero-sum games.
Using this interpretation, they restate Assumption I and II as follows
𝑉 (−𝐴) < 0 and 𝑉 (𝐵) > 0
Note: Proof (Sketch):

• ⇒ 𝑉 (𝐵) > 0 implies 𝑥𝑇0 𝐵 ≫ 0, where 𝑥0 is a maximizing vector. Since 𝐵 is non-negative, this requires that
each column of 𝐵 has at least one positive entry, which is Assumption I.
• ⇐ From Assumption I and the fact that 𝑝 > 0, it follows that 𝐵𝑝 > 0. This implies that the maximizing player
can always choose 𝑥 so that 𝑥𝑇 𝐵𝑝 > 0 so that it must be the case that 𝑉 (𝐵) > 0.
In order to (re)state Theorem I in terms of a particular two-player zero-sum game, we define a matrix for 𝛾 ∈ ℝ
𝑀 (𝛾) ≡ 𝐵 − 𝛾𝐴
For fixed 𝛾, treating 𝑀 (𝛾) as a matrix game, calculating the solution of the game implies
• If 𝛾 > 𝛼0 , then for all 𝑥 > 0, there ∃𝑗 ∈ {1, … , 𝑛}, s.t. [𝑥𝑇 𝑀 (𝛾)]𝑗 < 0 implying that 𝑉 (𝑀 (𝛾)) < 0.
• If 𝛾 < 𝛽0 , then for all 𝑝 > 0, there ∃𝑖 ∈ {1, … , 𝑚}, s.t. [𝑀 (𝛾)𝑝]𝑖 > 0 implying that 𝑉 (𝑀 (𝛾)) > 0.
• If 𝛾 ∈ {𝛽0 , 𝛼0 }, then (by Theorem I) the optimal intensity and price vectors 𝑥0 and 𝑝0 satisfy
𝑥𝑇0 𝑀 (𝛾) ≥ 0𝑇 and 𝑀 (𝛾)𝑝0 ≤ 0
That is, (𝑥0 , 𝑝0 , 0) is a solution of the game 𝑀 (𝛾) so that 𝑉 (𝑀 (𝛽0 )) = 𝑉 (𝑀 (𝛼0 )) = 0.
• If 𝛽0 < 𝛼0 and 𝛾 ∈ (𝛽0 , 𝛼0 ), then 𝑉 (𝑀 (𝛾)) = 0.
Moreover, if 𝑥′ is optimal for the maximizing player in 𝑀 (𝛾 ′ ) for 𝛾 ′ ∈ (𝛽0 , 𝛼0 ) and 𝑝″ is optimal for the minimizing
player in 𝑀 (𝛾 ″ ) where 𝛾 ″ ∈ (𝛽0 , 𝛾 ′ ), then (𝑥′ , 𝑝″ , 0) is a solution for 𝑀 (𝛾) ∀𝛾 ∈ (𝛾 ″ , 𝛾 ′ ).
Proof (Sketch): If 𝑥′ is optimal for a maximizing player in game 𝑀 (𝛾 ′ ), then (𝑥′ )𝑇 𝑀 (𝛾 ′ ) ≥ 0𝑇 and so for all 𝛾 < 𝛾 ′ .
(𝑥′ )𝑇 𝑀 (𝛾) = (𝑥′ )𝑇 𝑀 (𝛾 ′ ) + (𝑥′ )𝑇 (𝛾 ′ − 𝛾)𝐴 ≥ 0𝑇
hence 𝑉 (𝑀 (𝛾)) ≥ 0. If 𝑝″ is optimal for a minimizing player in game 𝑀 (𝛾 ″ ), then 𝑀 (𝛾)𝑝 ≤ 0 and so for all 𝛾 ″ < 𝛾
𝑀 (𝛾)𝑝″ = 𝑀 (𝛾 ″ ) + (𝛾 ″ − 𝛾)𝐴𝑝″ ≤ 0
hence 𝑉 (𝑀 (𝛾)) ≤ 0.
It is clear from the above argument that 𝛽0 , 𝛼0 are the minimal and maximal 𝛾 for which 𝑉 (𝑀 (𝛾)) = 0.
Furthermore, Hamburger et al. [Hamburger et al., 1967] show that the function 𝛾 ↦ 𝑉 (𝑀 (𝛾)) is continuous and
nonincreasing in 𝛾.
This suggests an algorithm to compute (𝛼0 , 𝑥0 ) and (𝛽0 , 𝑝0 ) for a given input-output pair (𝐴, 𝐵).

18.5.2 Algorithm
Hamburger, Thompson and Weil [Hamburger et al., 1967] propose a simple bisection algorithm to find the minimal and
maximal roots (i.e. 𝛽0 and 𝛼0 ) of the function 𝛾 ↦ 𝑉 (𝑀 (𝛾)).
Step 1
First, notice that we can easily find trivial upper and lower bounds for 𝛼0 and 𝛽0 .
• TEP requires that 𝑥𝑇 (𝐵 − 𝛼𝐴) ≥ 0𝑇 and 𝑥 > 0, so if 𝛼 is so large that max𝑖 {[(𝐵 − 𝛼𝐴)𝜄𝑛 ]𝑖 } < 0, then TEP
ceases to have a solution.
Accordingly, let UB be the 𝛼∗ that solves max𝑖 {[(𝐵 − 𝛼∗ 𝐴)𝜄𝑛 ]𝑖 } = 0.
• Similar to the upper bound, if 𝛽 is so low that min𝑗 {[𝜄𝑇𝑚 (𝐵 − 𝛽𝐴)]𝑗 } > 0, then the EEP has no solution and so
we can define LB as the 𝛽 ∗ that solves min𝑗 {[𝜄𝑇𝑚 (𝐵 − 𝛽 ∗ 𝐴)]𝑗 } = 0.
The bounds method calculates these trivial bounds for us
n1.bounds()
(1.0, 2.0)
Step 2
Compute 𝛼0 and 𝛽0
• Finding 𝛼0
1. Fix 𝛾 = 𝑈𝐵+𝐿𝐵2 and compute the solution of the two-player zero-sum game associated with 𝑀 (𝛾). We can
use either the primal or the dual LP problem.
2. If 𝑉 (𝑀 (𝛾)) ≥ 0, then set 𝐿𝐵 = 𝛾, otherwise let 𝑈 𝐵 = 𝛾.
3. Iterate on 1. and 2. until |𝑈 𝐵 − 𝐿𝐵| < 𝜖.
• Finding 𝛽0
1. Fix 𝛾 = 𝑈𝐵+𝐿𝐵2 and compute the solution of the two-player zero-sum game associated. with 𝑀 (𝛾). We can
use either the primal or the dual LP problem.
2. If 𝑉 (𝑀 (𝛾)) > 0, then set 𝐿𝐵 = 𝛾, otherwise let 𝑈 𝐵 = 𝛾.
3. Iterate on 1. and 2. until |𝑈 𝐵 − 𝐿𝐵| < 𝜖.
• Existence: Since 𝑉 (𝑀 (𝐿𝐵)) > 0 and 𝑉 (𝑀 (𝑈 𝐵)) < 0 and 𝑉 (𝑀 (⋅)) is a continuous, nonincreasing function,
there is at least one 𝛾 ∈ [𝐿𝐵, 𝑈 𝐵], s.t. 𝑉 (𝑀 (𝛾)) = 0.
The zerosum method calculates the value and optimal strategies associated with a given 𝛾.
γ = 2
print(f'Value of the game with γ = {γ}')

print(n1.zerosum(γ=γ)[0])
print('Intensity vector (from the primal)')
print(n1.zerosum(γ=γ)[1])
print('Price vector (from the dual)')
print(n1.zerosum(γ=γ, dual=True)[1])

Value of the game with γ = 2

-0.24
Intensity vector (from the primal)
[0.32 0.28 0.4 ]
Price vector (from the dual)
[0.4 0.32 0.28 0. ]
numb_grid = 100
γ_grid = np.linspace(0.4, 2.1, numb_grid)
value_ex1_grid = np.asarray([n1.zerosum(γ=γ_grid[i])[0]
for i in range(numb_grid)])
value_ex2_grid = np.asarray([n2.zerosum(γ=γ_grid[i])[0]
for i in range(numb_grid)])
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

fig.suptitle(r'The function $V(M(\gamma))$', fontsize=16)
for ax, grid, N, i in zip(axes, (value_ex1_grid, value_ex2_grid),

(n1, n2), (1, 2)):
ax.plot(γ_grid, grid)
ax.set(title=f'Example {i}', xlabel='$\gamma$')
ax.axhline(0, c='k', lw=1)
ax.axvline(N.bounds()[0], c='r', ls='--', label='lower bound')
ax.axvline(N.bounds()[1], c='g', ls='--', label='upper bound')
plt.show()
The expansion method implements the bisection algorithm for 𝛼0 (and uses the primal LP problem for 𝑥0 )
α_0, x, p = n1.expansion()
print(f'α_0 = {α_0}')
print(f'x_0 = {x}')
print(f'The corresponding p from the dual = {p}')
α_0 = 1.2599210478365421
x_0 = [0.33 0.26 0.41]


The corresponding p from the dual = [0.41 0.33 0.26 0. ]
The interest method implements the bisection algorithm for 𝛽0 (and uses the dual LP problem for 𝑝0 )
β_0, x, p = n1.interest()
print(f'β_0 = {β_0}')
print(f'p_0 = {p}')
print(f'The corresponding x from the primal = {x}')
β_0 = 1.2599210478365421
p_0 = [0.41 0.33 0.26 0. ]
The corresponding x from the primal = [0.33 0.26 0.41]
Of course, when 𝛾 ∗ is unique, it is irrelevant which one of the two methods we use – both work.
In particular, as will be shown below, in case of an irreducible (𝐴, 𝐵) (like in Example 1), the maximal and minimal
roots of 𝑉 (𝑀 (𝛾)) necessarily coincide implying a ‘‘full duality’’ result, i.e. 𝛼0 = 𝛽0 = 𝛾 ∗ so that the expansion (and
interest) rate 𝛾 ∗ is unique.
18.5.3 Uniqueness and Irreducibility
As an illustration, compute first the maximal and minimal roots of 𝑉 (𝑀 (⋅)) for our Example 2 that has a reducible
input-output pair (𝐴, 𝐵)
α_0, x, p = n2.expansion()
print(f'α_0 = {α_0}')
print(f'x_0 = {x}')
print(f'The corresponding p from the dual = {p}')
α_0 = 1.259921052493155
x_0 = [5.27e-10 0.00e+00 3.27e-01 2.60e-01 4.13e-01]
The corresponding p from the dual = [0. 0.21 0.33 0.26 0.21 0. ]
β_0, x, p = n2.interest()
print(f'β_0 = {β_0}')
print(f'p_0 = {p}')
print(f'The corresponding x from the primal = {x}')
β_0 = 1.0000000009313226
p_0 = [ 5.00e-01 5.00e-01 -1.55e-09 -1.24e-09 -9.31e-10 0.00e+00]
The corresponding x from the primal = [-0. 0. 0.25 0.25 0.5 ]
As we can see, with a reducible (𝐴, 𝐵), the roots found by the bisection algorithms might differ, so there might be multiple
𝛾 ∗ that make the value of the game with 𝑀 (𝛾 ∗ ) zero. (see the figure above).
Indeed, although the von Neumann theorem assures existence of the equilibrium, Assumptions I and II are not sufficient
for uniqueness. Nonetheless, Kemeny et al. (1967) show that there are at most finitely many economic solutions, meaning
that there are only finitely many 𝛾 ∗ that satisfy 𝑉 (𝑀 (𝛾 ∗ )) = 0 and 𝑥𝑇0 𝐵𝑝0 > 0 and that for each such 𝛾𝑖∗ , there is a
self-contained part of the economy (a sub-economy) that in equilibrium can expand independently with the expansion
coefficient 𝛾𝑖∗ .

The following theorem (see Theorem 9.10. in Gale [Gale, 1989]) asserts that imposing irreducibility is sufficient for
uniqueness of (𝛾 ∗ , 𝑥0 , 𝑝0 ).
Theorem II: Adopt the conditions of Theorem 1. If the economy (𝐴, 𝐵) is irreducible, then 𝛾 ∗ = 𝛼0 = 𝛽0 .
18.5.4 A Special Case
There is a special (𝐴, 𝐵) that allows us to simplify the solution method significantly by invoking the powerful Perron-
Frobenius theorem for non-negative matrices.
Definition: We call an economy simple if it satisfies
• 𝑛=𝑚
• Each activity produces exactly one good
• Each good is produced by one and only one activity.
These assumptions imply that 𝐵 = 𝐼𝑛 , i.e., that 𝐵 can be written as an identity matrix (possibly after reshuffling its rows
and columns).
The simple model has the following special property (Theorem 9.11. in Gale [Gale, 1989]): if 𝑥0 and 𝛼0 > 0 solve the
TEP with (𝐴, 𝐼𝑛 ), then
1
𝑥𝑇0 = 𝛼0 𝑥𝑇0 𝐴 ⇔ 𝑥𝑇0 𝐴 = ( ) 𝑥𝑇0
𝛼0
The latter shows that 1/𝛼0 is a positive eigenvalue of 𝐴 and 𝑥0 is the corresponding non-negative left eigenvector.
The classic result of Perron and Frobenius implies that a non-negative matrix has a non-negative eigenvalue-eigenvector
pair.
Moreover, if 𝐴 is irreducible, then the optimal intensity vector 𝑥0 is positive and unique up to multiplication by a positive
scalar.
Suppose that 𝐴 is reducible with 𝑘 irreducible subsets 𝑆1 , … , 𝑆𝑘 . Let 𝐴𝑖 be the submatrix corresponding to 𝑆𝑖 and let
𝛼𝑖 and 𝛽𝑖 be the associated expansion and interest factors, respectively. Then we have
𝛼0 = max{𝛼𝑖 } and 𝛽0 = min{𝛽𝑖 }

𝑖 𝑖

Part IV
Introduction to Dynamics
337
CHAPTER
NINETEEN
FINITE MARKOV CHAINS
Contents
• Finite Markov Chains

– Overview
– Definitions
– Simulation
– Marginal Distributions
– Irreducibility and Aperiodicity
– Stationary Distributions
– Ergodicity
– Computing Expectations
– Exercises
!pip install quantecon
19.1 Overview
Markov chains are one of the most useful classes of stochastic processes, being
• simple, flexible and supported by many elegant theoretical results
• valuable for building intuition about random dynamic models
• central to quantitative modeling in their own right
You will find them in many of the workhorse models of economics and finance.
In this lecture, we review some of the theory of Markov chains.
We will also introduce some of the high-quality routines for working with Markov chains available in QuantEcon.py.
Prerequisite knowledge is basic probability and linear algebra.
Let’s start with some standard imports:
339

import quantecon as qe
import numpy as np
19.2 Definitions
The following concepts are fundamental.
19.2.1 Stochastic Matrices
A stochastic matrix (or Markov matrix) is an 𝑛 × 𝑛 square matrix 𝑃 such that

1. each element of 𝑃 is nonnegative, and
2. each row of 𝑃 sums to one
Each row of 𝑃 can be regarded as a probability mass function over 𝑛 possible outcomes.
It is too not difficult to check1 that if 𝑃 is a stochastic matrix, then so is the 𝑘-th power 𝑃 𝑘 for all 𝑘 ∈ ℕ.
19.2.2 Markov Chains
There is a close connection between stochastic matrices and Markov chains.

To begin, let 𝑆 be a finite set with 𝑛 elements {𝑥1 , … , 𝑥𝑛 }.
The set 𝑆 is called the state space and 𝑥1 , … , 𝑥𝑛 are the state values.
A Markov chain {𝑋𝑡 } on 𝑆 is a sequence of random variables on 𝑆 that have the Markov property.
This means that, for any date 𝑡 and any state 𝑦 ∈ 𝑆,
ℙ{𝑋𝑡+1 = 𝑦 | 𝑋𝑡 } = ℙ{𝑋𝑡+1 = 𝑦 | 𝑋𝑡 , 𝑋𝑡−1 , …} (19.1)
In other words, knowing the current state is enough to know probabilities for future states.
In particular, the dynamics of a Markov chain are fully determined by the set of values
𝑃 (𝑥, 𝑦) ∶= ℙ{𝑋𝑡+1 = 𝑦 | 𝑋𝑡 = 𝑥} (𝑥, 𝑦 ∈ 𝑆) (19.2)
By construction,
• 𝑃 (𝑥, 𝑦) is the probability of going from 𝑥 to 𝑦 in one unit of time (one step)
• 𝑃 (𝑥, ⋅) is the conditional distribution of 𝑋𝑡+1 given 𝑋𝑡 = 𝑥
We can view 𝑃 as a stochastic matrix where
𝑃𝑖𝑗 = 𝑃 (𝑥𝑖 , 𝑥𝑗 ) 1 ≤ 𝑖, 𝑗 ≤ 𝑛
Going the other way, if we take a stochastic matrix 𝑃 , we can generate a Markov chain {𝑋𝑡 } as follows:
1 Hint: First show that if 𝑃 and 𝑄 are stochastic matrices then so is their product — to check the row sums, try post multiplying by a column vector
of ones. Finally, argue that 𝑃 𝑛 is a stochastic matrix using induction.
340 Chapter 19. Finite Markov Chains

• draw 𝑋0 from a marginal distribution 𝜓

• for each 𝑡 = 0, 1, …, draw 𝑋𝑡+1 from 𝑃 (𝑋𝑡 , ⋅)
By construction, the resulting process satisfies (19.2).
19.2.3 Example 1
Consider a worker who, at any given time 𝑡, is either unemployed (state 0) or employed (state 1).
Suppose that, over a one month period,
1. An unemployed worker finds a job with probability 𝛼 ∈ (0, 1).
2. An employed worker loses her job and becomes unemployed with probability 𝛽 ∈ (0, 1).
In terms of a Markov model, we have
• 𝑆 = {0, 1}
• 𝑃 (0, 1) = 𝛼 and 𝑃 (1, 0) = 𝛽
We can write out the transition probabilities in matrix form as
1−𝛼 𝛼
𝑃 =( ) (19.3)
𝛽 1−𝛽
Once we have the values 𝛼 and 𝛽, we can address a range of questions, such as
• What is the average duration of unemployment?
• Over the long-run, what fraction of time does a worker find herself unemployed?
• Conditional on employment, what is the probability of becoming unemployed at least once over the next 12 months?
We’ll cover such applications below.
19.2.4 Example 2
From US unemployment data, Hamilton [Hamilton, 2005] estimated the stochastic matrix
0.971 0.029 0
𝑃 =⎛
⎜ 0.145 0.778 0.077 ⎞
⎟
⎝ 0 0.508 0.492 ⎠
where
• the frequency is monthly
• the first state represents “normal growth”
• the second state represents “mild recession”
• the third state represents “severe recession”
For example, the matrix tells us that when the state is normal growth, the state will again be normal growth next month
with probability 0.97.
In general, large values on the main diagonal indicate persistence in the process {𝑋𝑡 }.
This Markov process can also be represented as a directed graph, with edges labeled by transition probabilities
Here “ng” is normal growth, “mr” is mild recession, etc.
19.2. Definitions 341

19.3 Simulation
One natural way to answer questions about Markov chains is to simulate them.
(To approximate the probability of event 𝐸, we can simulate many times and count the fraction of times that 𝐸 occurs).
Nice functionality for simulating Markov chains exists in QuantEcon.py.
• Efficient, bundled with lots of other useful routines for handling Markov chains.
However, it’s also a good exercise to roll our own routines — let’s do that first and then come back to the methods in
QuantEcon.py.
In these exercises, we’ll take the state space to be 𝑆 = 0, … , 𝑛 − 1.
19.3.1 Rolling Our Own
To simulate a Markov chain, we need its stochastic matrix 𝑃 and a marginal probability distribution 𝜓 from which to
draw a realization of 𝑋0 .
The Markov chain is then constructed as discussed above. To repeat:
1. At time 𝑡 = 0, draw a realization of 𝑋0 from 𝜓.
2. At each subsequent time 𝑡, draw a realization of the new state 𝑋𝑡+1 from 𝑃 (𝑋𝑡 , ⋅).
To implement this simulation procedure, we need a method for generating draws from a discrete distribution.
For this task, we’ll use random.draw from QuantEcon, which works as follows:
ψ = (0.3, 0.7) # probabilities over {0, 1}

cdf = np.cumsum(ψ) # convert into cummulative distribution
qe.random.draw(cdf, 5) # generate 5 independent draws from ψ
array([1, 0, 1, 1, 1])
We’ll write our code as a function that accepts the following three arguments
• A stochastic matrix P
• An initial state init
• A positive integer sample_size representing the length of the time series the function should return

def mc_sample_path(P, ψ_0=None, sample_size=1_000):
# set up
P = np.asarray(P)
X = np.empty(sample_size, dtype=int)
# Convert each row of P into a cdf

n = len(P)
P_dist = [np.cumsum(P[i, :]) for i in range(n)]
# draw initial state, defaulting to 0

if ψ_0 is not None:
X_0 = qe.random.draw(np.cumsum(ψ_0))
else:
X_0 = 0
# simulate
X[0] = X_0
for t in range(sample_size - 1):
X[t+1] = qe.random.draw(P_dist[X[t]])
return X
Let’s see how it works using the small matrix
P = [[0.4, 0.6],
[0.2, 0.8]]
As we’ll see later, for a long series drawn from P, the fraction of the sample that takes value 0 will be about 0.25.
Moreover, this is true, regardless of the initial distribution from which 𝑋0 is drawn.
The following code illustrates this
X = mc_sample_path(P, ψ_0=[0.1, 0.9], sample_size=100_000)

np.mean(X == 0)
0.25041
You can try changing the initial distribution to confirm that the output is always close to 0.25, at least for the P matrix
above.
19.3.2 Using QuantEcon’s Routines
As discussed above, QuantEcon.py has routines for handling Markov chains, including simulation.
Here’s an illustration using the same P as the preceding example
from quantecon import MarkovChain
mc = qe.MarkovChain(P)
X = mc.simulate(ts_length=1_000_000)
np.mean(X == 0)
19.3. Simulation 343

0.249516
The QuantEcon.py routine is JIT compiled and much faster.
%time mc_sample_path(P, sample_size=1_000_000) # Our homemade code version
CPU times: user 1.49 s, sys: 0 ns, total: 1.49 s

Wall time: 1.49 s
array([0, 1, 1, ..., 1, 1, 1])
%time mc.simulate(ts_length=1_000_000) # qe code version
CPU times: user 19.6 ms, sys: 4.75 ms, total: 24.3 ms
Wall time: 23.8 ms
array([0, 1, 1, ..., 1, 0, 1])
Adding State Values and Initial Conditions
If we wish to, we can provide a specification of state values to MarkovChain.

These state values can be integers, floats, or even strings.
The following code illustrates
mc = qe.MarkovChain(P, state_values=('unemployed', 'employed'))

mc.simulate(ts_length=4, init='employed')
array(['employed', 'employed', 'employed', 'unemployed'], dtype='<U10')
mc.simulate(ts_length=4, init='unemployed')
array(['unemployed', 'employed', 'employed', 'unemployed'], dtype='<U10')
mc.simulate(ts_length=4) # Start at randomly chosen initial state
array(['employed', 'unemployed', 'employed', 'employed'], dtype='<U10')
If we want to see indices rather than state values as outputs as we can use
mc.simulate_indices(ts_length=4)
array([1, 1, 1, 1])

19.4 Marginal Distributions
Suppose that
1. {𝑋𝑡 } is a Markov chain with stochastic matrix 𝑃
2. the marginal distribution of 𝑋𝑡 is known to be 𝜓𝑡
What then is the marginal distribution of 𝑋𝑡+1 , or, more generally, of 𝑋𝑡+𝑚 ?
To answer this, we let 𝜓𝑡 be the marginal distribution of 𝑋𝑡 for 𝑡 = 0, 1, 2, ….
Our first aim is to find 𝜓𝑡+1 given 𝜓𝑡 and 𝑃 .
To begin, pick any 𝑦 ∈ 𝑆.
Using the law of total probability, we can decompose the probability that 𝑋𝑡+1 = 𝑦 as follows:
ℙ{𝑋𝑡+1 = 𝑦} = ∑ ℙ{𝑋𝑡+1 = 𝑦 | 𝑋𝑡 = 𝑥} ⋅ ℙ{𝑋𝑡 = 𝑥}

𝑥∈𝑆
In words, to get the probability of being at 𝑦 tomorrow, we account for all ways this can happen and sum their probabilities.
Rewriting this statement in terms of marginal and conditional probabilities gives
𝜓𝑡+1 (𝑦) = ∑ 𝑃 (𝑥, 𝑦)𝜓𝑡 (𝑥)

𝑥∈𝑆
There are 𝑛 such equations, one for each 𝑦 ∈ 𝑆.

If we think of 𝜓𝑡+1 and 𝜓𝑡 as row vectors, these 𝑛 equations are summarized by the matrix expression
𝜓𝑡+1 = 𝜓𝑡 𝑃 (19.4)
Thus, to move a marginal distribution forward one unit of time, we postmultiply by 𝑃 .

By postmultiplying 𝑚 times, we move a marginal distribution forward 𝑚 steps into the future.
Hence, iterating on (19.4), the expression 𝜓𝑡+𝑚 = 𝜓𝑡 𝑃 𝑚 is also valid — here 𝑃 𝑚 is the 𝑚-th power of 𝑃 .
As a special case, we see that if 𝜓0 is the initial distribution from which 𝑋0 is drawn, then 𝜓0 𝑃 𝑚 is the distribution of
𝑋𝑚 .
This is very important, so let’s repeat it
𝑋0 ∼ 𝜓0 ⟹ 𝑋𝑚 ∼ 𝜓 0 𝑃 𝑚 (19.5)
and, more generally,
𝑋𝑡 ∼ 𝜓𝑡 ⟹ 𝑋𝑡+𝑚 ∼ 𝜓𝑡 𝑃 𝑚 (19.6)
19.4.1 Multiple Step Transition Probabilities
We know that the probability of transitioning from 𝑥 to 𝑦 in one step is 𝑃 (𝑥, 𝑦).
It turns out that the probability of transitioning from 𝑥 to 𝑦 in 𝑚 steps is 𝑃 𝑚 (𝑥, 𝑦), the (𝑥, 𝑦)-th element of the 𝑚-th
power of 𝑃 .
To see why, consider again (19.6), but now with a 𝜓𝑡 that puts all probability on state 𝑥 so that the transition probabilities
are
• 1 in the 𝑥-th position and zero elsewhere
19.4. Marginal Distributions 345

Inserting this into (19.6), we see that, conditional on 𝑋𝑡 = 𝑥, the distribution of 𝑋𝑡+𝑚 is the 𝑥-th row of 𝑃 𝑚 .
In particular
ℙ{𝑋𝑡+𝑚 = 𝑦 | 𝑋𝑡 = 𝑥} = 𝑃 𝑚 (𝑥, 𝑦) = (𝑥, 𝑦)-th element of 𝑃 𝑚
19.4.2 Example: Probability of Recession
Recall the stochastic matrix 𝑃 for recession and growth considered above.
Suppose that the current state is unknown — perhaps statistics are available only at the end of the current month.
We guess that the probability that the economy is in state 𝑥 is 𝜓(𝑥).
The probability of being in recession (either mild or severe) in 6 months time is given by the inner product
0
𝜓𝑃 6 ⋅ ⎛
⎜ 1 ⎞
⎟
1
⎝ ⎠
19.4.3 Example 2: Cross-Sectional Distributions
The marginal distributions we have been studying can be viewed either as probabilities or as cross-sectional frequencies
that a Law of Large Numbers leads us to anticipate for large samples.
To illustrate, recall our model of employment/unemployment dynamics for a given worker discussed above.
Consider a large population of workers, each of whose lifetime experience is described by the specified dynamics, with
each worker’s outcomes being realizations of processes that are statistically independent of all other workers’ processes.
Let 𝜓 be the current cross-sectional distribution over {0, 1}.
The cross-sectional distribution records fractions of workers employed and unemployed at a given moment.
• For example, 𝜓(0) is the unemployment rate.
What will the cross-sectional distribution be in 10 periods hence?
The answer is 𝜓𝑃 10 , where 𝑃 is the stochastic matrix in (19.3).
This is because each worker’s state evolves according to 𝑃 , so 𝜓𝑃 10 is a marginal distibution for a single randomly selected
worker.
But when the sample is large, outcomes and probabilities are roughly equal (by an application of the Law of Large
Numbers).
So for a very large (tending to infinite) population, 𝜓𝑃 10 also represents fractions of workers in each state.
This is exactly the cross-sectional distribution.

19.5 Irreducibility and Aperiodicity
Irreducibility and aperiodicity are central concepts of modern Markov chain theory.
Let’s see what they’re about.
19.5.1 Irreducibility
Let 𝑃 be a fixed stochastic matrix.

Two states 𝑥 and 𝑦 are said to communicate with each other if there exist positive integers 𝑗 and 𝑘 such that
𝑃 𝑗 (𝑥, 𝑦) > 0 and 𝑃 𝑘 (𝑦, 𝑥) > 0
In view of our discussion above, this means precisely that

• state 𝑥 can eventually be reached from state 𝑦, and
• state 𝑦 can eventually be reached from state 𝑥
The stochastic matrix 𝑃 is called irreducible if all states communicate; that is, if 𝑥 and 𝑦 communicate for all (𝑥, 𝑦) in
𝑆 × 𝑆.
For example, consider the following transition probabilities for wealth of a fictitious set of households
We can translate this into a stochastic matrix, putting zeros where there’s no edge between nodes
0.9 0.1 0
𝑃 ∶= ⎛
⎜ 0.4 0.4 0.2 ⎞
⎟
⎝ 0.1 0.1 0.8 ⎠
It’s clear from the graph that this stochastic matrix is irreducible: we can eventually reach any state from any other state.
We can also test this using QuantEcon.py’s MarkovChain class
P = [[0.9, 0.1, 0.0],

[0.4, 0.4, 0.2],
[0.1, 0.1, 0.8]]
mc = qe.MarkovChain(P, ('poor', 'middle', 'rich'))

mc.is_irreducible
True
19.5. Irreducibility and Aperiodicity 347

Here’s a more pessimistic scenario in which poor people remain poor forever
This stochastic matrix is not irreducible, since, for example, rich is not accessible from poor.
Let’s confirm this
P = [[1.0, 0.0, 0.0],

[0.1, 0.8, 0.1],
[0.0, 0.2, 0.8]]
mc = qe.MarkovChain(P, ('poor', 'middle', 'rich'))

mc.is_irreducible
False
We can also determine the “communication classes”
mc.communication_classes
[array(['poor'], dtype='<U6'), array(['middle', 'rich'], dtype='<U6')]
It might be clear to you already that irreducibility is going to be important in terms of long run outcomes.
For example, poverty is a life sentence in the second graph but not the first.
We’ll come back to this a bit later.

19.5.2 Aperiodicity
Loosely speaking, a Markov chain is called periodic if it cycles in a predictable way, and aperiodic otherwise.
Here’s a trivial example with three states
The chain cycles with period 3:
P = [[0, 1, 0],
[0, 0, 1],
[1, 0, 0]]
mc.period
More formally, the period of a state 𝑥 is the largest common divisor of a set of integers
𝐷(𝑥) ∶= {𝑗 ≥ 1 ∶ 𝑃 𝑗 (𝑥, 𝑥) > 0}
In the last example, 𝐷(𝑥) = {3, 6, 9, …} for every state 𝑥, so the period is 3.
A stochastic matrix is called aperiodic if the period of every state is 1, and periodic otherwise.
For example, the stochastic matrix associated with the transition probabilities below is periodic because, for example,
state 𝑎 has period 2
We can confirm that the stochastic matrix is periodic with the following code
P = [[0.0, 1.0, 0.0, 0.0],

[0.5, 0.0, 0.5, 0.0],
[0.0, 0.5, 0.0, 0.5],
[0.0, 0.0, 1.0, 0.0]]
mc.period
19.5. Irreducibility and Aperiodicity 349

mc.is_aperiodic
False
19.6 Stationary Distributions
As seen in (19.4), we can shift a marginal distribution forward one unit of time via postmultiplication by 𝑃 .
Some distributions are invariant under this updating process — for example,
P = np.array([[0.4, 0.6],
[0.2, 0.8]])
ψ = (0.25, 0.75)
ψ @ P
array([0.25, 0.75])
Such distributions are called stationary or invariant.

Formally, a marginal distribution 𝜓∗ on 𝑆 is called stationary for 𝑃 if 𝜓∗ = 𝜓∗ 𝑃 .
(This is the same notion of stationarity that we learned about in the lecture on AR(1) processes applied to a different
setting.)
From this equality, we immediately get 𝜓∗ = 𝜓∗ 𝑃 𝑡 for all 𝑡.
This tells us an important fact: If the distribution of 𝑋0 is a stationary distribution, then 𝑋𝑡 will have this same distribution
for all 𝑡.
Hence stationary distributions have a natural interpretation as stochastic steady states — we’ll discuss this more soon.
Mathematically, a stationary distribution is a fixed point of 𝑃 when 𝑃 is thought of as the map 𝜓 ↦ 𝜓𝑃 from (row)
vectors to (row) vectors.
Theorem. Every stochastic matrix 𝑃 has at least one stationary distribution.
(We are assuming here that the state space 𝑆 is finite; if not more assumptions are required)
For proof of this result, you can apply Brouwer’s fixed point theorem, or see EDTC, theorem 4.3.5.
There can be many stationary distributions corresponding to a given stochastic matrix 𝑃 .
• For example, if 𝑃 is the identity matrix, then all marginal distributions are stationary.
To get uniqueness an invariant distribution, the transition matrix 𝑃 must have the property that no nontrivial subsets of
the state space are infinitely persistent.
A subset of the state space is infinitely persistent if other parts of the state space cannot be accessed from it.
Thus, infinite persistence of a non-trivial subset is the opposite of irreducibility.
This gives some intuition for the following fundamental theorem.
Theorem. If 𝑃 is both aperiodic and irreducible, then
1. 𝑃 has exactly one stationary distribution 𝜓∗ .
2. For any initial marginal distribution 𝜓0 , we have ‖𝜓0 𝑃 𝑡 − 𝜓∗ ‖ → 0 as 𝑡 → ∞.

For a proof, see, for example, theorem 5.2 of [Häggström, 2002].

(Note that part 1 of the theorem only requires irreducibility, whereas part 2 requires both irreducibility and aperiodicity)
A stochastic matrix that satisfies the conditions of the theorem is sometimes called uniformly ergodic.
A sufficient condition for aperiodicity and irreducibility is that every element of 𝑃 is strictly positive.
• Try to convince yourself of this.
19.6.1 Example
Recall our model of the employment/unemployment dynamics of a particular worker discussed above.
Assuming 𝛼 ∈ (0, 1) and 𝛽 ∈ (0, 1), the uniform ergodicity condition is satisfied.
Let 𝜓∗ = (𝑝, 1 − 𝑝) be the stationary distribution, so that 𝑝 corresponds to unemployment (state 0).
Using 𝜓∗ = 𝜓∗ 𝑃 and a bit of algebra yields
𝛽
𝑝=
𝛼+𝛽
This is, in some sense, a steady state probability of unemployment — more about the interpretation of this below.
Not surprisingly it tends to zero as 𝛽 → 0, and to one as 𝛼 → 0.
19.6.2 Calculating Stationary Distributions
As discussed above, a particular Markov matrix 𝑃 can have many stationary distributions.
That is, there can be many row vectors 𝜓 such that 𝜓 = 𝜓𝑃 .
In fact if 𝑃 has two distinct stationary distributions 𝜓1 , 𝜓2 then it has infinitely many, since in this case, as you can verify,
for any 𝜆 ∈ [0, 1]
𝜓3 ∶= 𝜆𝜓1 + (1 − 𝜆)𝜓2
is a stationary distribution for 𝑃 .

If we restrict attention to the case in which only one stationary distribution exists, one way to finding it is to solve the
system
𝜓(𝐼𝑛 − 𝑃 ) = 0 (19.7)
for 𝜓, where 𝐼𝑛 is the 𝑛 × 𝑛 identity.

But the zero vector solves system (19.7), so we must proceed cautiously.
We want to impose the restriction that 𝜓 is a probability distribution.
There are various ways to do this.
One option is to regard solving system (19.7) as an eigenvector problem: a vector 𝜓 such that 𝜓 = 𝜓𝑃 is a left eigenvector
associated with the unit eigenvalue 𝜆 = 1.
A stable and sophisticated algorithm specialized for stochastic matrices is implemented in QuantEcon.py.
This is the one we recommend:
19.6. Stationary Distributions 351

P = [[0.4, 0.6],
[0.2, 0.8]]
mc.stationary_distributions # Show all stationary distributions
array([[0.25, 0.75]])
19.6.3 Convergence to Stationarity
Part 2 of the Markov chain convergence theorem stated above tells us that the marginal distribution of 𝑋𝑡 converges to
the stationary distribution regardless of where we begin.
This adds considerable authority to our interpretation of 𝜓∗ as a stochastic steady state.
The convergence in the theorem is illustrated in the next figure
P = ((0.971, 0.029, 0.000),

(0.145, 0.778, 0.077),
(0.000, 0.508, 0.492))
P = np.array(P)
ψ = (0.0, 0.2, 0.8) # Initial condition
fig = plt.figure(figsize=(8, 6))

ax = fig.add_subplot(111, projection='3d')
ax.set(xlim=(0, 1), ylim=(0, 1), zlim=(0, 1),

xticks=(0.25, 0.5, 0.75),
yticks=(0.25, 0.5, 0.75),
zticks=(0.25, 0.5, 0.75))
x_vals, y_vals, z_vals = [], [], []

for t in range(20):
x_vals.append(ψ[0])
y_vals.append(ψ[1])
z_vals.append(ψ[2])
ψ = ψ @ P
ax.scatter(x_vals, y_vals, z_vals, c='r', s=60)

ax.view_init(30, 210)
ψ_star = mc.stationary_distributions[0]
ax.scatter(ψ_star[0], ψ_star[1], ψ_star[2], c='k', s=60)
plt.show()

Here
• 𝑃 is the stochastic matrix for recession and growth considered above.
• The highest red dot is an arbitrarily chosen initial marginal probability distribution 𝜓, represented as a vector in
ℝ3 .
• The other red dots are the marginal distributions 𝜓𝑃 𝑡 for 𝑡 = 1, 2, ….
• The black dot is 𝜓∗ .
You might like to try experimenting with different initial conditions.
19.7 Ergodicity
Under irreducibility, yet another important result obtains: for all 𝑥 ∈ 𝑆,
1 𝑚
∑ 1{𝑋𝑡 = 𝑥} → 𝜓∗ (𝑥) as 𝑚 → ∞ (19.8)
𝑚 𝑡=1
Here
• 1{𝑋𝑡 = 𝑥} = 1 if 𝑋𝑡 = 𝑥 and zero otherwise
• convergence is with probability one
• the result does not depend on the marginal distribution of 𝑋0
19.7. Ergodicity 353

The result tells us that the fraction of time the chain spends at state 𝑥 converges to 𝜓∗ (𝑥) as time goes to infinity.
This gives us another way to interpret the stationary distribution — provided that the convergence result in (19.8) is valid.
The convergence asserted in (19.8) is a special case of a law of large numbers result for Markov chains — see EDTC,
section 4.3.4 for some additional information.
19.7.1 Example
Recall our cross-sectional interpretation of the employment/unemployment model discussed above.

Assume that 𝛼 ∈ (0, 1) and 𝛽 ∈ (0, 1), so that irreducibility and aperiodicity both hold.
We saw that the stationary distribution is (𝑝, 1 − 𝑝), where
𝛽
𝑝=
𝛼+𝛽
In the cross-sectional interpretation, this is the fraction of people unemployed.
In view of our latest (ergodicity) result, it is also the fraction of time that a single worker can expect to spend unemployed.
Thus, in the long-run, cross-sectional averages for a population and time-series averages for a given person coincide.
This is one aspect of the concept of ergodicity.
19.8 Computing Expectations
We sometimes want to compute mathematical expectations of functions of 𝑋𝑡 of the form
𝔼[ℎ(𝑋𝑡 )] (19.9)
and conditional expectations such as
𝔼[ℎ(𝑋𝑡+𝑘 ) ∣ 𝑋𝑡 = 𝑥] (19.10)
where
• {𝑋𝑡 } is a Markov chain generated by 𝑛 × 𝑛 stochastic matrix 𝑃
• ℎ is a given function, which, in terms of matrix algebra, we’ll think of as the column vector
ℎ(𝑥1 )
ℎ=⎛
⎜ ⋮ ⎞
⎟
⎝ ℎ(𝑥𝑛 ) ⎠
Computing the unconditional expectation (19.9) is easy.
We just sum over the marginal distribution of 𝑋𝑡 to get
𝔼[ℎ(𝑋𝑡 )] = ∑(𝜓𝑃 𝑡 )(𝑥)ℎ(𝑥)

𝑥∈𝑆
Here 𝜓 is the distribution of 𝑋0 .

Since 𝜓 and hence 𝜓𝑃 𝑡 are row vectors, we can also write this as
𝔼[ℎ(𝑋𝑡 )] = 𝜓𝑃 𝑡 ℎ
For the conditional expectation (19.10), we need to sum over the conditional distribution of 𝑋𝑡+𝑘 given 𝑋𝑡 = 𝑥.

We already know that this is 𝑃 𝑘 (𝑥, ⋅), so
𝔼[ℎ(𝑋𝑡+𝑘 ) ∣ 𝑋𝑡 = 𝑥] = (𝑃 𝑘 ℎ)(𝑥) (19.11)
The vector 𝑃 𝑘 ℎ stores the conditional expectation 𝔼[ℎ(𝑋𝑡+𝑘 ) ∣ 𝑋𝑡 = 𝑥] over all 𝑥.
19.8.1 Iterated Expectations
The law of iterated expectations states that
𝔼 [𝔼[ℎ(𝑋𝑡+𝑘 ) ∣ 𝑋𝑡 = 𝑥]] = 𝔼[ℎ(𝑋𝑡+𝑘 )]
where the outer 𝔼 on the left side is an unconditional distribution taken with respect to the marginal distribution 𝜓𝑡 of 𝑋𝑡
(again see equation (19.6)).
To verify the law of iterated expectations, use equation (19.11) to substitute (𝑃 𝑘 ℎ)(𝑥) for 𝐸[ℎ(𝑋𝑡+𝑘 ) ∣ 𝑋𝑡 = 𝑥], write
𝔼 [𝔼[ℎ(𝑋𝑡+𝑘 ) ∣ 𝑋𝑡 = 𝑥]] = 𝜓𝑡 𝑃 𝑘 ℎ,
and note 𝜓𝑡 𝑃 𝑘 ℎ = 𝜓𝑡+𝑘 ℎ = 𝔼[ℎ(𝑋𝑡+𝑘 )].
19.8.2 Expectations of Geometric Sums
Sometimes we want to compute the mathematical expectation of a geometric sum, such as ∑𝑡 𝛽 𝑡 ℎ(𝑋𝑡 ).
In view of the preceding discussion, this is
∞
𝔼[∑ 𝛽 𝑗 ℎ(𝑋𝑡+𝑗 ) ∣ 𝑋𝑡 = 𝑥] = [(𝐼 − 𝛽𝑃 )−1 ℎ](𝑥)
𝑗=0
where
(𝐼 − 𝛽𝑃 )−1 = 𝐼 + 𝛽𝑃 + 𝛽 2 𝑃 2 + ⋯
Premultiplication by (𝐼 − 𝛽𝑃 )−1 amounts to “applying the resolvent operator”.
19.9 Exercises
Exercise 19.9.1
According to the discussion above, if a worker’s employment dynamics obey the stochastic matrix
1−𝛼 𝛼
𝑃 =( )
𝛽 1−𝛽
with 𝛼 ∈ (0, 1) and 𝛽 ∈ (0, 1), then, in the long-run, the fraction of time spent unemployed will be
𝛽
𝑝 ∶=
𝛼+𝛽
In other words, if {𝑋𝑡 } represents the Markov chain for employment, then 𝑋̄ 𝑚 → 𝑝 as 𝑚 → ∞, where
1 𝑚
𝑋̄ 𝑚 ∶= ∑ 1{𝑋𝑡 = 0}
𝑚 𝑡=1
19.9. Exercises 355

This exercise asks you to illustrate convergence by computing 𝑋̄ 𝑚 for large 𝑚 and checking that it is close to 𝑝.
You will see that this statement is true regardless of the choice of initial condition or the values of 𝛼, 𝛽, provided both lie
in (0, 1).

We will address this exercise graphically.
The plots show the time series of 𝑋̄ 𝑚 − 𝑝 for two initial conditions.
As 𝑚 gets large, both series converge to zero.
α = β = 0.1
N = 10000
p = β / (α + β)
P = ((1 - α, α), # Careful: P and p are distinct

( β, 1 - β))
mc = MarkovChain(P)

ax.set_ylim(-0.25, 0.25)
ax.grid()
ax.hlines(0, 0, N, lw=2, alpha=0.6) # Horizonal line at zero
for x0, col in ((0, 'blue'), (1, 'green')):

# Generate time series for worker that starts at x0
X = mc.simulate(N, init=x0)
# Compute fraction of time spent unemployed, for each n
X_bar = (X == 0).cumsum() / (1 + np.arange(N, dtype=float))
# Plot
ax.fill_between(range(N), np.zeros(N), X_bar - p, color=col, alpha=0.1)
ax.plot(X_bar - p, color=col, label=f'$X_0 = \, {x0} $')
# Overlay in black--make lines clearer
ax.plot(X_bar - p, 'k-', alpha=0.6)
ax.legend(loc='upper right')
plt.show()

Exercise 19.9.2
A topic of interest for economics and many other disciplines is ranking.
Let’s now consider one of the most practical and important ranking problems — the rank assigned to web pages by search
engines.
(Although the problem is motivated from outside of economics, there is in fact a deep connection between search ranking
systems and prices in certain competitive equilibria — see [Du et al., 2013].)
To understand the issue, consider the set of results returned by a query to a web search engine.
For the user, it is desirable to
1. receive a large set of accurate matches
2. have the matches returned in order, where the order corresponds to some measure of “importance”
Ranking according to a measure of importance is the problem we now consider.
The methodology developed to solve this problem by Google founders Larry Page and Sergey Brin is known as PageRank.
To illustrate the idea, consider the following diagram
Imagine that this is a miniature version of the WWW, with
• each node representing a web page
• each arrow representing the existence of a link from one page to another
Now let’s think about which pages are likely to be important, in the sense of being valuable to a search engine user.
One possible criterion for the importance of a page is the number of inbound links — an indication of popularity.
19.9. Exercises 357

By this measure, m and j are the most important pages, with 5 inbound links each.
However, what if the pages linking to m, say, are not themselves important?
Thinking this way, it seems appropriate to weight the inbound nodes by relative importance.
The PageRank algorithm does precisely this.
A slightly simplified presentation that captures the basic idea is as follows.
Letting 𝑗 be (the integer index of) a typical page and 𝑟𝑗 be its ranking, we set
𝑟𝑖
𝑟𝑗 = ∑
𝑖∈𝐿𝑗
ℓ𝑖
where
• ℓ𝑖 is the total number of outbound links from 𝑖
• 𝐿𝑗 is the set of all pages 𝑖 such that 𝑖 has a link to 𝑗
This is a measure of the number of inbound links, weighted by their own ranking (and normalized by 1/ℓ𝑖 ).
There is, however, another interpretation, and it brings us back to Markov chains.
Let 𝑃 be the matrix given by 𝑃 (𝑖, 𝑗) = 1{𝑖 → 𝑗}/ℓ𝑖 where 1{𝑖 → 𝑗} = 1 if 𝑖 has a link to 𝑗 and zero otherwise.
The matrix 𝑃 is a stochastic matrix provided that each page has at least one link.
With this definition of 𝑃 we have
𝑟𝑖 𝑟
𝑟𝑗 = ∑ = ∑ 1{𝑖 → 𝑗} 𝑖 = ∑ 𝑃 (𝑖, 𝑗)𝑟𝑖
𝑖∈𝐿𝑗
ℓ𝑖 all 𝑖
ℓ𝑖 all 𝑖
Writing 𝑟 for the row vector of rankings, this becomes 𝑟 = 𝑟𝑃 .

Hence 𝑟 is the stationary distribution of the stochastic matrix 𝑃 .

Let’s think of 𝑃 (𝑖, 𝑗) as the probability of “moving” from page 𝑖 to page 𝑗.

The value 𝑃 (𝑖, 𝑗) has the interpretation
• 𝑃 (𝑖, 𝑗) = 1/𝑘 if 𝑖 has 𝑘 outbound links and 𝑗 is one of them
• 𝑃 (𝑖, 𝑗) = 0 if 𝑖 has no direct link to 𝑗
Thus, motion from page to page is that of a web surfer who moves from one page to another by randomly clicking on one
of the links on that page.
Here “random” means that each link is selected with equal probability.
Since 𝑟 is the stationary distribution of 𝑃 , assuming that the uniform ergodicity condition is valid, we can interpret 𝑟𝑗 as
the fraction of time that a (very persistent) random surfer spends at page 𝑗.
Your exercise is to apply this ranking algorithm to the graph pictured above and return the list of pages ordered by rank.
There is a total of 14 nodes (i.e., web pages), the first named a and the last named n.
A typical line from the file has the form
d -> h;
This should be interpreted as meaning that there exists a link from d to h.

The data for this graph is shown below, and read into a file called web_graph_data.txt when the cell is executed.
%%file web_graph_data.txt
a -> d;
a -> f;
b -> j;
b -> k;
b -> m;
c -> c;
c -> g;
c -> j;
c -> m;
d -> f;
d -> h;
d -> k;
e -> d;
e -> h;
e -> l;
f -> a;
f -> b;
f -> j;
f -> l;
g -> b;
g -> j;
h -> d;
h -> g;
h -> l;
h -> m;
i -> g;
i -> h;
i -> n;
j -> e;
j -> i;
j -> k;
k -> n;
19.9. Exercises 359


l -> m;
m -> g;
n -> c;
n -> j;
n -> m;
Overwriting web_graph_data.txt
To parse this file and extract the relevant information, you can use regular expressions.
The following code snippet provides a hint as to how you can go about this
import re
re.findall('\w', 'x +++ y ****** z') # \w matches alphanumerics
['x', 'y', 'z']
re.findall('\w', 'a ^^ b &&& $$ c')
['a', 'b', 'c']
When you solve for the ranking, you will find that the highest ranked node is in fact g, while the lowest is a.

"""
Return list of pages, ordered by rank
"""
import re
from operator import itemgetter
infile = 'web_graph_data.txt'
alphabet = 'abcdefghijklmnopqrstuvwxyz'
n = 14 # Total number of web pages (nodes)
# Create a matrix Q indicating existence of links

# * Q[i, j] = 1 if there is a link from i to j
# * Q[i, j] = 0 otherwise
Q = np.zeros((n, n), dtype=int)
with open(infile) as f:
edges = f.readlines()
for edge in edges:
from_node, to_node = re.findall('\w', edge)
i, j = alphabet.index(from_node), alphabet.index(to_node)
Q[i, j] = 1
# Create the corresponding Markov matrix P
P = np.empty((n, n))
for i in range(n):


P[i, :] = Q[i, :] / Q[i, :].sum()
mc = MarkovChain(P)
# Compute the stationary distribution r
r = mc.stationary_distributions[0]
ranked_pages = {alphabet[i] : r[i] for i in range(n)}
# Print solution, sorted from highest to lowest rank
print('Rankings\n ***')
for name, rank in sorted(ranked_pages.items(), key=itemgetter(1), reverse=1):
print(f'{name}: {rank:.4}')
Rankings
***
g: 0.1607
j: 0.1594
m: 0.1195
n: 0.1088
k: 0.09106
b: 0.08326
e: 0.05312
i: 0.05312
c: 0.04834
h: 0.0456
l: 0.03202
d: 0.03056
f: 0.01164
a: 0.002911
Exercise 19.9.3
In numerical work, it is sometimes convenient to replace a continuous model with a discrete one.
In particular, Markov chains are routinely generated as discrete approximations to AR(1) processes of the form
𝑦𝑡+1 = 𝜌𝑦𝑡 + 𝑢𝑡+1
Here 𝑢𝑡 is assumed to be IID and 𝑁 (0, 𝜎𝑢2 ).

The variance of the stationary probability distribution of {𝑦𝑡 } is
𝜎𝑢2
𝜎𝑦2 ∶=
1 − 𝜌2
Tauchen’s method [Tauchen, 1986] is the most common method for approximating this continuous state process with a
finite state Markov chain.
A routine for this already exists in QuantEcon.py but let’s write our own version as an exercise.
As a first step, we choose
• 𝑛, the number of states for the discrete approximation
• 𝑚, an integer that parameterizes the width of the state space
Next, we create a state space {𝑥0 , … , 𝑥𝑛−1 } ⊂ ℝ and a stochastic 𝑛 × 𝑛 matrix 𝑃 such that
• 𝑥0 = −𝑚 𝜎𝑦
• 𝑥𝑛−1 = 𝑚 𝜎𝑦
19.9. Exercises 361

• 𝑥𝑖+1 = 𝑥𝑖 + 𝑠 where 𝑠 = (𝑥𝑛−1 − 𝑥0 )/(𝑛 − 1)

Let 𝐹 be the cumulative distribution function of the normal distribution 𝑁 (0, 𝜎𝑢2 ).
The values 𝑃 (𝑥𝑖 , 𝑥𝑗 ) are computed to approximate the AR(1) process — omitting the derivation, the rules are as follows:
1. If 𝑗 = 0, then set
𝑃 (𝑥𝑖 , 𝑥𝑗 ) = 𝑃 (𝑥𝑖 , 𝑥0 ) = 𝐹 (𝑥0 − 𝜌𝑥𝑖 + 𝑠/2)
2. If 𝑗 = 𝑛 − 1, then set
𝑃 (𝑥𝑖 , 𝑥𝑗 ) = 𝑃 (𝑥𝑖 , 𝑥𝑛−1 ) = 1 − 𝐹 (𝑥𝑛−1 − 𝜌𝑥𝑖 − 𝑠/2)
3. Otherwise, set
𝑃 (𝑥𝑖 , 𝑥𝑗 ) = 𝐹 (𝑥𝑗 − 𝜌𝑥𝑖 + 𝑠/2) − 𝐹 (𝑥𝑗 − 𝜌𝑥𝑖 − 𝑠/2)
The exercise is to write a function approx_markov(rho, sigma_u, m=3, n=7) that returns {𝑥0 , … , 𝑥𝑛−1 } ⊂
ℝ and 𝑛 × 𝑛 matrix 𝑃 as described above.
• Even better, write a function that returns an instance of QuantEcon.py’s MarkovChain class.

A solution from the QuantEcon.py library can be found here.

CHAPTER
TWENTY
INVENTORY DYNAMICS
Contents
• Inventory Dynamics
– Overview
– Sample Paths
– Marginal Distributions
– Exercises
20.1 Overview
In this lecture we will study the time path of inventories for firms that follow so-called s-S inventory dynamics.
Such firms
1. wait until inventory falls below some level 𝑠 and then
2. order sufficient quantities to bring their inventory back up to capacity 𝑆.
These kinds of policies are common in practice and also optimal in certain circumstances.
A review of early literature and some macroeconomic implications can be found in [Caplin, 1985].
Here our main aim is to learn more about simulation, time series and Markov dynamics.
While our Markov environment and many of the concepts we consider are related to those found in our lecture on finite
Markov chains, the state space is a continuum in the current application.
Let’s start with some imports

import numpy as np
from numba import njit, float64, prange
from numba.experimental import jitclass
363
20.2 Sample Paths
Consider a firm with inventory 𝑋𝑡 .

The firm waits until 𝑋𝑡 ≤ 𝑠 and then restocks up to 𝑆 units.
It faces stochastic demand {𝐷𝑡 }, which we assume is IID.
With notation 𝑎+ ∶= max{𝑎, 0}, inventory dynamics can be written as
(𝑆 − 𝐷𝑡+1 )+ if 𝑋𝑡 ≤ 𝑠
𝑋𝑡+1 = {
(𝑋𝑡 − 𝐷𝑡+1 )+ if 𝑋𝑡 > 𝑠
In what follows, we will assume that each 𝐷𝑡 is lognormal, so that
𝐷𝑡 = exp(𝜇 + 𝜎𝑍𝑡 )
where 𝜇 and 𝜎 are parameters and {𝑍𝑡 } is IID and standard normal.
Here’s a class that stores parameters and generates time paths for inventory.
firm_data = [
('s', float64), # restock trigger level
('S', float64), # capacity
('mu', float64), # shock location parameter
('sigma', float64) # shock scale parameter
]
@jitclass(firm_data)
class Firm:
def __init__(self, s=10, S=100, mu=1.0, sigma=0.5):
self.s, self.S, self.mu, self.sigma = s, S, mu, sigma
def update(self, x):

"Update the state from t to t+1 given current state x."
Z = np.random.randn()
D = np.exp(self.mu + self.sigma * Z)
if x <= self.s:
return max(self.S - D, 0)
else:
return max(x - D, 0)
def sim_inventory_path(self, x_init, sim_length):
X = np.empty(sim_length)
X[0] = x_init
for t in range(sim_length-1):
X[t+1] = self.update(X[t])
return X
Let’s run a first simulation, of a single path:
364 Chapter 20. Inventory Dynamics

firm = Firm()
s, S = firm.s, firm.S
sim_length = 100
x_init = 50
X = firm.sim_inventory_path(x_init, sim_length)
bbox = (0., 1.02, 1., .102)
legend_args = {'ncol': 3,
'bbox_to_anchor': bbox,
'loc': 3,
'mode': 'expand'}
ax.plot(X, label="inventory")
ax.plot(np.full(sim_length, s), 'k--', label="$s$")
ax.plot(np.full(sim_length, S), 'k-', label="$S$")
ax.set_ylim(0, S+10)
ax.set_xlabel("time")
ax.legend(**legend_args)
plt.show()
Now let’s simulate multiple paths in order to build a more complete picture of the probabilities of different outcomes:
sim_length=200
ax.plot(np.full(sim_length, s), 'k--', label="$s$")

ax.plot(np.full(sim_length, S), 'k-', label="$S$")
ax.set_ylim(0, S+10)
ax.legend(**legend_args)
20.2. Sample Paths 365


X = firm.sim_inventory_path(x_init, sim_length)
ax.plot(X, 'b', alpha=0.2, lw=0.5)
plt.show()
20.3 Marginal Distributions
Now let’s look at the marginal distribution 𝜓𝑇 of 𝑋𝑇 for some fixed 𝑇 .

We will do this by generating many draws of 𝑋𝑇 given initial condition 𝑋0 .
With these draws of 𝑋𝑇 we can build up a picture of its distribution 𝜓𝑇 .
Here’s one visualization, with 𝑇 = 50.
T = 50
M = 200 # Number of draws
ymin, ymax = 0, S + 10
for ax in axes:
ax.grid(alpha=0.4)
ax = axes[0]
ax.set_ylim(ymin, ymax)
ax.set_ylabel('$X_t$', fontsize=16)
ax.vlines((T,), -1.5, 1.5)
ax.set_xticks((T,))


ax.set_xticklabels((r'$T$',))
sample = np.empty(M)
for m in range(M):
X = firm.sim_inventory_path(x_init, 2 * T)
ax.plot(X, 'b-', lw=1, alpha=0.5)
ax.plot((T,), (X[T+1],), 'ko', alpha=0.5)
sample[m] = X[T+1]
axes[1].set_ylim(ymin, ymax)
axes[1].hist(sample,
bins=16,
density=True,
orientation='horizontal',
histtype='bar',
alpha=0.5)
plt.show()
We can build up a clearer picture by drawing more samples
T = 50
M = 50_000
sample = np.empty(M)
for m in range(M):
X = firm.sim_inventory_path(x_init, T+1)
sample[m] = X[T]
ax.hist(sample,
20.3. Marginal Distributions 367


bins=36,
density=True,
histtype='bar',
alpha=0.75)
plt.show()
Note that the distribution is bimodal

• Most firms have restocked twice but a few have restocked only once (see figure with paths above).
• Firms in the second category have lower inventory.
We can also approximate the distribution using a kernel density estimator.
Kernel density estimators can be thought of as smoothed histograms.
They are preferable to histograms when the distribution being estimated is likely to be smooth.
We will use a kernel density estimator from scikit-learn
from sklearn.neighbors import KernelDensity
def plot_kde(sample, ax, label=''):
xmin, xmax = 0.9 * min(sample), 1.1 * max(sample)

kde = KernelDensity(kernel='gaussian').fit(sample[:, None])
log_dens = kde.score_samples(xgrid[:, None])
ax.plot(xgrid, np.exp(log_dens), label=label)
plot_kde(sample, ax)
plt.show()

The allocation of probability mass is similar to what was shown by the histogram just above.
20.4 Exercises
Exercise 20.4.1
This model is asymptotically stationary, with a unique stationary distribution.
(See the discussion of stationarity in our lecture on AR(1) processes for background — the fundamental concepts are the
same.)
In particular, the sequence of marginal distributions {𝜓𝑡 } is converging to a unique limiting distribution that does not
depend on initial conditions.
Although we will not prove this here, we can investigate it using simulation.
Your task is to generate and plot the sequence {𝜓𝑡 } at times 𝑡 = 10, 50, 250, 500, 750 based on the discussion above.
(The kernel density estimator is probably the best way to present each distribution.)
You should see convergence, in the sense that differences between successive distributions are getting smaller.
Try different initial conditions to verify that, in the long run, the distribution is invariant across initial conditions.

Below is one possible solution:
The computations involve a lot of CPU cycles so we have tried to write the code efficiently.
This meant writing a specialized function rather than using the class above.
s, S, mu, sigma = firm.s, firm.S, firm.mu, firm.sigma
@njit(parallel=True)
def shift_firms_forward(current_inventory_levels, num_periods):
20.4. Exercises 369


num_firms = len(current_inventory_levels)
new_inventory_levels = np.empty(num_firms)
for f in prange(num_firms):
x = current_inventory_levels[f]
for t in range(num_periods):
D = np.exp(mu + sigma * Z)
if x <= s:
x = max(S - D, 0)
else:
x = max(x - D, 0)
new_inventory_levels[f] = x
return new_inventory_levels
x_init = 50
num_firms = 50_000
sample_dates = 0, 10, 50, 250, 500, 750
first_diffs = np.diff(sample_dates)
X = np.full(num_firms, x_init)
current_date = 0
for d in first_diffs:
X = shift_firms_forward(X, d)
current_date += d
plot_kde(X, ax, label=f't = {current_date}')
ax.set_xlabel('inventory')
ax.set_ylabel('probability')
ax.legend()
plt.show()

Notice that by 𝑡 = 500 or 𝑡 = 750 the densities are barely changing.

We have reached a reasonable approximation of the stationary density.
You can convince yourself that initial conditions don’t matter by testing a few of them.
For example, try rerunning the code above with all firms starting at 𝑋0 = 20 or 𝑋0 = 80.
Exercise 20.4.2
Using simulation, calculate the probability that firms that start with 𝑋0 = 70 need to order twice or more in the first 50
periods.
You will need a large sample size to get an accurate reading.

Here is one solution.
Again, the computations are relatively intensive so we have written a a specialized function rather than using the class
above.
We will also use parallelization across firms.
def compute_freq(sim_length=50, x_init=70, num_firms=1_000_000):
firm_counter = 0 # Records number of firms that restock 2x or more

for m in prange(num_firms):
x = x_init
restock_counter = 0 # Will record number of restocks for firm m
for t in range(sim_length):
D = np.exp(mu + sigma * Z)
if x <= s:
20.4. Exercises 371


x = max(S - D, 0)
restock_counter += 1
else:
x = max(x - D, 0)
if restock_counter > 1:
firm_counter += 1
return firm_counter / num_firms
Note the time the routine takes to run, as well as the output.
%%time
freq = compute_freq()
print(f"Frequency of at least two stock outs = {freq}")
Frequency of at least two stock outs = 0.447305

Wall time: 918 ms
Try switching the parallel flag to False in the jitted function above.
Depending on your system, the difference can be substantial.
(On our desktop machine, the speed up is by a factor of 5.)

CHAPTER
TWENTYONE
LINEAR STATE SPACE MODELS
Contents
• Linear State Space Models

– Overview
– The Linear State Space Model
– Distributions and Moments
– Stationarity and Ergodicity
– Noisy Observations
– Prediction
– Code
– Exercises
“We may regard the present state of the universe as the effect of its past and the cause of its future” – Marquis
de Laplace
21.1 Overview
This lecture introduces the linear state space dynamic system.

The linear state space system is a generalization of the scalar AR(1) process we studied before.
This model is a workhorse that carries a powerful theory of prediction.
Its many applications include:
• representing dynamics of higher-order linear systems
• predicting the position of a system 𝑗 steps into the future
• predicting a geometric sum of future values of a variable like
– non-financial income
– dividends on a stock
373
– the money supply

– a government deficit or surplus, etc.
• key ingredient of useful models
– Friedman’s permanent income model of consumption smoothing.
– Barro’s model of smoothing total tax collections.
– Rational expectations version of Cagan’s model of hyperinflation.
– Sargent and Wallace’s “unpleasant monetarist arithmetic,” etc.

import numpy as np
from quantecon import LinearStateSpace
from scipy.stats import norm
import random
21.2 The Linear State Space Model
The objects in play are:

• An 𝑛 × 1 vector 𝑥𝑡 denoting the state at time 𝑡 = 0, 1, 2, ….
• An IID sequence of 𝑚 × 1 random vectors 𝑤𝑡 ∼ 𝑁 (0, 𝐼).
• A 𝑘 × 1 vector 𝑦𝑡 of observations at time 𝑡 = 0, 1, 2, ….
• An 𝑛 × 𝑛 matrix 𝐴 called the transition matrix.
• An 𝑛 × 𝑚 matrix 𝐶 called the volatility matrix.
• A 𝑘 × 𝑛 matrix 𝐺 sometimes called the output matrix.
Here is the linear state-space system

𝑦𝑡 = 𝐺𝑥𝑡
𝑥0 ∼ 𝑁 (𝜇0 , Σ0 )
21.2.1 Primitives
The primitives of the model are

1. the matrices 𝐴, 𝐶, 𝐺
2. shock distribution, which we have specialized to 𝑁 (0, 𝐼)
3. the distribution of the initial condition 𝑥0 , which we have set to 𝑁 (𝜇0 , Σ0 )
Given 𝐴, 𝐶, 𝐺 and draws of 𝑥0 and 𝑤1 , 𝑤2 , …, the model (21.1) pins down the values of the sequences {𝑥𝑡 } and {𝑦𝑡 }.
Even without these draws, the primitives 1–3 pin down the probability distributions of {𝑥𝑡 } and {𝑦𝑡 }.
Later we’ll see how to compute these distributions and their moments.
374 Chapter 21. Linear State Space Models

Martingale Difference Shocks
We’ve made the common assumption that the shocks are independent standardized normal vectors.
But some of what we say will be valid under the assumption that {𝑤𝑡+1 } is a martingale difference sequence.
A martingale difference sequence is a sequence that is zero mean when conditioned on past information.
In the present case, since {𝑥𝑡 } is our state sequence, this means that it satisfies
𝔼[𝑤𝑡+1 |𝑥𝑡 , 𝑥𝑡−1 , …] = 0
This is a weaker condition than that {𝑤𝑡 } is IID with 𝑤𝑡+1 ∼ 𝑁 (0, 𝐼).
21.2.2 Examples
By appropriate choice of the primitives, a variety of dynamics can be represented in terms of the linear state space model.
The following examples help to highlight this point.
They also illustrate the wise dictum finding the state is an art.
Second-order Difference Equation
Let {𝑦𝑡 } be a deterministic sequence that satisfies
𝑦𝑡+1 = 𝜙0 + 𝜙1 𝑦𝑡 + 𝜙2 𝑦𝑡−1 s.t. 𝑦0 , 𝑦−1 given (21.1)
To map (21.1) into our state space system (21.1), we set
1 1 0 0 0
𝑥𝑡 = ⎡ 𝑦
⎢ 𝑡 ⎥
⎤ 𝐴=⎡𝜙
⎢ 0 𝜙1 𝜙2 ⎤
⎥ 𝐶=⎡
⎢0⎥
⎤ 𝐺 = [0 1 0]
𝑦
⎣ 𝑡−1 ⎦ ⎣0 1 0⎦ 0
⎣ ⎦
You can confirm that under these definitions, (21.1) and (21.1) agree.
The next figure shows the dynamics of this process when 𝜙0 = 1.1, 𝜙1 = 0.8, 𝜙2 = −0.8, 𝑦0 = 𝑦−1 = 1.
def plot_lss(A,
C,
G,
n=3,
ts_length=50):
ar = LinearStateSpace(A, C, G, mu_0=np.ones(n))
x, y = ar.simulate(ts_length)
y = y.flatten()
ax.plot(y, 'b-', lw=2, alpha=0.7)
ax.grid()
ax.set_xlabel('time', fontsize=12)
ax.set_ylabel('$y_t$', fontsize=12)
plt.show()
21.2. The Linear State Space Model 375

ϕ_0, ϕ_1, ϕ_2 = 1.1, 0.8, -0.8
A = [[1, 0, 0 ],
[ϕ_0, ϕ_1, ϕ_2],
[0, 1, 0 ]]
C = np.zeros((3, 1))
G = [0, 1, 0]
plot_lss(A, C, G)
Later you’ll be asked to recreate this figure.
Univariate Autoregressive Processes
We can use (21.1) to represent the model
𝑦𝑡+1 = 𝜙1 𝑦𝑡 + 𝜙2 𝑦𝑡−1 + 𝜙3 𝑦𝑡−2 + 𝜙4 𝑦𝑡−3 + 𝜎𝑤𝑡+1 (21.2)
where {𝑤𝑡 } is IID and standard normal.

′
To put this in the linear state space format we take 𝑥𝑡 = [𝑦𝑡 𝑦𝑡−1 𝑦𝑡−2 𝑦𝑡−3 ] and
𝜙1 𝜙2 𝜙3 𝜙4 𝜎
⎡1 0 0 0⎤ ⎡0⎤
𝐴=⎢ ⎥ 𝐶=⎢ ⎥ 𝐺 = [1 0 0 0]
⎢0 1 0 0⎥ ⎢0⎥
⎣0 0 1 0⎦ ⎣0⎦
The matrix 𝐴 has the form of the companion matrix to the vector [𝜙1 𝜙2 𝜙3 𝜙4 ].
The next figure shows the dynamics of this process when
𝜙1 = 0.5, 𝜙2 = −0.2, 𝜙3 = 0, 𝜙4 = 0.5, 𝜎 = 0.2, 𝑦0 = 𝑦−1 = 𝑦−2 = 𝑦−3 = 1

ϕ_1, ϕ_2, ϕ_3, ϕ_4 = 0.5, -0.2, 0, 0.5

σ = 0.2
A_1 = [[ϕ_1, ϕ_2, ϕ_3, ϕ_4],

[1, 0, 0, 0 ],
[0, 1, 0, 0 ],
[0, 0, 1, 0 ]]
C_1 = [[σ],
[0],
[0],
[0]]
G_1 = [1, 0, 0, 0]
plot_lss(A_1, C_1, G_1, n=4, ts_length=200)
Vector Autoregressions
Now suppose that

• 𝑦𝑡 is a 𝑘 × 1 vector
• 𝜙𝑗 is a 𝑘 × 𝑘 matrix and
• 𝑤𝑡 is 𝑘 × 1
Then (21.2) is termed a vector autoregression.
To map this into (21.1), we set
𝑦𝑡 𝜙1 𝜙2 𝜙3 𝜙4 𝜎
⎡𝑦 ⎤ ⎡𝐼 0 0 0⎤ ⎡0⎤
𝑥𝑡 = ⎢ 𝑡−1 ⎥ 𝐴=⎢ ⎥ 𝐶=⎢ ⎥ 𝐺 = [𝐼 0 0 0]
⎢𝑦𝑡−2 ⎥ ⎢0 𝐼 0 0⎥ ⎢0⎥
⎣𝑦𝑡−3 ⎦ ⎣0 0 𝐼 0⎦ ⎣0⎦
where 𝐼 is the 𝑘 × 𝑘 identity matrix and 𝜎 is a 𝑘 × 𝑘 matrix.
21.2. The Linear State Space Model 377

Seasonals
We can use (21.1) to represent

1. the deterministic seasonal 𝑦𝑡 = 𝑦𝑡−4
2. the indeterministic seasonal 𝑦𝑡 = 𝜙4 𝑦𝑡−4 + 𝑤𝑡
In fact, both are special cases of (21.2).
With the deterministic seasonal, the transition matrix becomes
0 0 0 1
⎡1 0 0 0⎤
𝐴=⎢ ⎥
⎢0 1 0 0⎥
⎣0 0 1 0⎦
It is easy to check that 𝐴4 = 𝐼, which implies that 𝑥𝑡 is strictly periodic with period 4:1
𝑥𝑡+4 = 𝑥𝑡
Such an 𝑥𝑡 process can be used to model deterministic seasonals in quarterly time series.
The indeterministic seasonal produces recurrent, but aperiodic, seasonal fluctuations.
Time Trends
The model 𝑦𝑡 = 𝑎𝑡 + 𝑏 is known as a linear time trend.

We can represent this model in the linear state space form by taking
1 1 0
𝐴=[ ] 𝐶=[ ] 𝐺 = [𝑎 𝑏] (21.3)
0 1 0
′
and starting at initial condition 𝑥0 = [0 1] .
In fact, it’s possible to use the state-space system to represent polynomial trends of any order.
For instance, we can represent the model 𝑦𝑡 = 𝑎𝑡2 + 𝑏𝑡 + 𝑐 in the linear state space form by taking
1 1 0 0
𝐴=⎡
⎢ 1 1⎥
0 ⎤ 𝐶=⎡
⎢0⎥
⎤ 𝐺 = [2𝑎 𝑎 + 𝑏 𝑐]
⎣0 0 1 ⎦ ⎣0⎦
′
and starting at initial condition 𝑥0 = [0 0 1] .
It follows that
1 𝑡 𝑡(𝑡 − 1)/2
𝐴𝑡 = ⎡
⎢0 1 𝑡 ⎤
⎥
⎣0 0 1 ⎦
Then 𝑥′𝑡 = [𝑡(𝑡 − 1)/2 𝑡 1]. You can now confirm that 𝑦𝑡 = 𝐺𝑥𝑡 has the correct form.
1 The eigenvalues of 𝐴 are (1, −1, 𝑖, −𝑖).

21.2.3 Moving Average Representations
A nonrecursive expression for 𝑥𝑡 as a function of 𝑥0 , 𝑤1 , 𝑤2 , … , 𝑤𝑡 can be found by using (21.1) repeatedly to obtain
𝑥𝑡 = 𝐴𝑥𝑡−1 + 𝐶𝑤𝑡
= 𝐴2 𝑥𝑡−2 + 𝐴𝐶𝑤𝑡−1 + 𝐶𝑤𝑡
⋮
𝑡−1
= ∑ 𝐴𝑗 𝐶𝑤𝑡−𝑗 + 𝐴𝑡 𝑥0
𝑗=0
Representation (21.4) is a moving average representation.

It expresses {𝑥𝑡 } as a linear function of
1. current and past values of the process {𝑤𝑡 } and
2. the initial condition 𝑥0
As an example of a moving average representation, let the model be
1 1 1
𝐴=[ ] 𝐶=[ ]
0 1 0
1 𝑡 ′
You will be able to show that 𝐴𝑡 = [ ] and 𝐴𝑗 𝐶 = [1 0] .
0 1
Substituting into the moving average representation (21.4), we obtain
𝑡−1
𝑥1𝑡 = ∑ 𝑤𝑡−𝑗 + [1 𝑡] 𝑥0
𝑗=0
where 𝑥1𝑡 is the first entry of 𝑥𝑡 .

The first term on the right is a cumulated sum of martingale differences and is therefore a martingale.
The second term is a translated linear function of time.
For this reason, 𝑥1𝑡 is called a martingale with drift.
21.3 Distributions and Moments
21.3.1 Unconditional Moments
Using (21.1), it’s easy to obtain expressions for the (unconditional) means of 𝑥𝑡 and 𝑦𝑡 .
We’ll explain what unconditional and conditional mean soon.
Letting 𝜇𝑡 ∶= 𝔼[𝑥𝑡 ] and using linearity of expectations, we find that
𝜇𝑡+1 = 𝐴𝜇𝑡 with 𝜇0 given (21.4)
Here 𝜇0 is a primitive given in (21.1).

The variance-covariance matrix of 𝑥𝑡 is Σ𝑡 ∶= 𝔼[(𝑥𝑡 − 𝜇𝑡 )(𝑥𝑡 − 𝜇𝑡 )′ ].
Using 𝑥𝑡+1 − 𝜇𝑡+1 = 𝐴(𝑥𝑡 − 𝜇𝑡 ) + 𝐶𝑤𝑡+1 , we can determine this matrix recursively via
Σ𝑡+1 = 𝐴Σ𝑡 𝐴′ + 𝐶𝐶 ′ with Σ0 given (21.5)
21.3. Distributions and Moments 379

As with 𝜇0 , the matrix Σ0 is a primitive given in (21.1).

As a matter of terminology, we will sometimes call
• 𝜇𝑡 the unconditional mean of 𝑥𝑡
• Σ𝑡 the unconditional variance-covariance matrix of 𝑥𝑡
This is to distinguish 𝜇𝑡 and Σ𝑡 from related objects that use conditioning information, to be defined below.
However, you should be aware that these “unconditional” moments do depend on the initial distribution 𝑁 (𝜇0 , Σ0 ).
Moments of the Observables
Using linearity of expectations again we have
𝔼[𝑦𝑡 ] = 𝔼[𝐺𝑥𝑡 ] = 𝐺𝜇𝑡 (21.6)
The variance-covariance matrix of 𝑦𝑡 is easily shown to be
Var[𝑦𝑡 ] = Var[𝐺𝑥𝑡 ] = 𝐺Σ𝑡 𝐺′ (21.7)
21.3.2 Distributions
In general, knowing the mean and variance-covariance matrix of a random vector is not quite as good as knowing the full
distribution.
However, there are some situations where these moments alone tell us all we need to know.
These are situations in which the mean vector and covariance matrix are all of the parameters that pin down the population
distribution.
One such situation is when the vector in question is Gaussian (i.e., normally distributed).
This is the case here, given
1. our Gaussian assumptions on the primitives
2. the fact that normality is preserved under linear operations
In fact, it’s well-known that
𝑢 ∼ 𝑁 (𝑢,̄ 𝑆) and 𝑣 = 𝑎 + 𝐵𝑢 ⟹ 𝑣 ∼ 𝑁 (𝑎 + 𝐵𝑢,̄ 𝐵𝑆𝐵′ ) (21.8)
In particular, given our Gaussian assumptions on the primitives and the linearity of (21.1) we can see immediately that
both 𝑥𝑡 and 𝑦𝑡 are Gaussian for all 𝑡 ≥ 02 .
Since 𝑥𝑡 is Gaussian, to find the distribution, all we need to do is find its mean and variance-covariance matrix.
But in fact we’ve already done this, in (21.4) and (21.5).
Letting 𝜇𝑡 and Σ𝑡 be as defined by these equations, we have
𝑥𝑡 ∼ 𝑁 (𝜇𝑡 , Σ𝑡 ) (21.9)
By similar reasoning combined with (21.6) and (21.7),
𝑦𝑡 ∼ 𝑁 (𝐺𝜇𝑡 , 𝐺Σ𝑡 𝐺′ ) (21.10)

2 The correct way to argue this is by induction. Suppose that 𝑥 is Gaussian. Then (21.1) and (21.8) imply that 𝑥
𝑡 𝑡+1 is Gaussian. Since 𝑥0 is
assumed to be Gaussian, it follows that every 𝑥𝑡 is Gaussian. Evidently, this implies that each 𝑦𝑡 is Gaussian.

21.3.3 Ensemble Interpretations
How should we interpret the distributions defined by (21.9)–(21.10)?

Intuitively, the probabilities in a distribution correspond to relative frequencies in a large population drawn from that
distribution.
Let’s apply this idea to our setting, focusing on the distribution of 𝑦𝑇 for fixed 𝑇 .
We can generate independent draws of 𝑦𝑇 by repeatedly simulating the evolution of the system up to time 𝑇 , using an
independent set of shocks each time.
The next figure shows 20 simulations, producing 20 time series for {𝑦𝑡 }, and hence 20 draws of 𝑦𝑇 .
The system in question is the univariate autoregressive model (21.2).
The values of 𝑦𝑇 are represented by black dots in the left-hand figure
def cross_section_plot(A,
C,
G,
T=20, # Set the time
ymin=-0.8,
ymax=1.25,
sample_size = 20, # 20 observations/simulations
n=4): # The number of dimensions for the initial x0
ar = LinearStateSpace(A, C, G, mu_0=np.ones(n))
for ax in axes:
ax.grid(alpha=0.4)
ax = axes[0]
ax.vlines((T,), -1.5, 1.5)
ax.set_xticks((T,))
ax.set_xticklabels(('$T$',))
sample = []
for i in range(sample_size):
rcolor = random.choice(('c', 'g', 'b', 'k'))
x, y = ar.simulate(ts_length=T+15)
y = y.flatten()
ax.plot(y, color=rcolor, lw=1, alpha=0.5)
ax.plot((T,), (y[T],), 'ko', alpha=0.5)
sample.append(y[T])
y = y.flatten()
axes[1].set_ylim(ymin, ymax)
axes[1].set_ylabel('$y_t$', fontsize=12)
axes[1].set_xlabel('relative frequency', fontsize=12)
axes[1].hist(sample, bins=16, density=True, orientation='horizontal', alpha=0.5)
plt.show()

ϕ_1, ϕ_2, ϕ_3, ϕ_4 = 0.5, -0.2, 0, 0.5

σ = 0.1
A_2 = [[ϕ_1, ϕ_2, ϕ_3, ϕ_4],

[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0]]
C_2 = [[σ], [0], [0], [0]]
G_2 = [1, 0, 0, 0]
cross_section_plot(A_2, C_2, G_2)
In the right-hand figure, these values are converted into a rotated histogram that shows relative frequencies from our
sample of 20 𝑦𝑇 ’s.
Here is another figure, this time with 100 observations
t = 100
cross_section_plot(A_2, C_2, G_2, T=t)
Let’s now try with 500,000 observations, showing only the histogram (without rotation)
T = 100
ymin=-0.8
ymax=1.25


sample_size = 500_000
ar = LinearStateSpace(A_2, C_2, G_2, mu_0=np.ones(4))

x, y = ar.simulate(sample_size)
mu_x, mu_y, Sigma_x, Sigma_y, Sigma_yx = ar.stationary_distributions()
f_y = norm(loc=float(mu_y), scale=float(np.sqrt(Sigma_y)))
y = y.flatten()
ygrid = np.linspace(ymin, ymax, 150)
ax.hist(y, bins=50, density=True, alpha=0.4)

ax.plot(ygrid, f_y.pdf(ygrid), 'k-', lw=2, alpha=0.8, label=r'true density')
ax.set_xlim(ymin, ymax)
ax.set_xlabel('$y_t$', fontsize=12)
ax.set_ylabel('relative frequency', fontsize=12)
plt.show()
/tmp/ipykernel_6445/1034809053.py:10: DeprecationWarning: Conversion of an array␣

↪with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you␣
↪extract a single element from your array before performing this operation.␣
↪(Deprecated NumPy 1.25.)
f_y = norm(loc=float(mu_y), scale=float(np.sqrt(Sigma_y)))
The black line is the population density of 𝑦𝑇 calculated from (21.10).

The histogram and population distribution are close, as expected.
By looking at the figures and experimenting with parameters, you will gain a feel for how the population distribution
depends on the model primitives listed above, as intermediated by the distribution’s parameters.

Ensemble Means
In the preceding figure, we approximated the population distribution of 𝑦𝑇 by

1. generating 𝐼 sample paths (i.e., time series) where 𝐼 is a large number
2. recording each observation 𝑦𝑇𝑖
3. histogramming this sample
Just as the histogram approximates the population distribution, the ensemble or cross-sectional average
1 𝐼 𝑖
𝑦𝑇̄ ∶= ∑𝑦
𝐼 𝑖=1 𝑇
approximates the expectation 𝔼[𝑦𝑇 ] = 𝐺𝜇𝑇 (as implied by the law of large numbers).
Here’s a simulation comparing the ensemble averages and population means at time points 𝑡 = 0, … , 50.
The parameters are the same as for the preceding figures, and the sample size is relatively small (𝐼 = 20).
I = 20
T = 50
ymin = -0.5
ymax = 1.15
ar = LinearStateSpace(A_2, C_2, G_2, mu_0=np.ones(4))
ensemble_mean = np.zeros(T)
for i in range(I):
x, y = ar.simulate(ts_length=T)
y = y.flatten()
ax.plot(y, 'c-', lw=0.8, alpha=0.5)
ensemble_mean = ensemble_mean + y
ensemble_mean = ensemble_mean / I
ax.plot(ensemble_mean, color='b', lw=2, alpha=0.8, label='$\\bar y_t$')
m = ar.moment_sequence()
population_means = []
for t in range(T):
μ_x, μ_y, Σ_x, Σ_y = next(m)
population_means.append(float(μ_y))
ax.plot(population_means, color='g', lw=2, alpha=0.8, label='$G\mu_t$')

ax.legend(ncol=2)
plt.show()

population_means.append(float(μ_y))

The ensemble mean for 𝑥𝑡 is
1 𝐼 𝑖
𝑥𝑇̄ ∶= ∑ 𝑥 → 𝜇𝑇 (𝐼 → ∞)
𝐼 𝑖=1 𝑇
The limit 𝜇𝑇 is a “long-run average”.

(By long-run average we mean the average for an infinite (𝐼 = ∞) number of sample 𝑥𝑇 ’s)
Another application of the law of large numbers assures us that
1 𝐼
∑(𝑥𝑖 − 𝑥𝑇̄ )(𝑥𝑖𝑇 − 𝑥𝑇̄ )′ → Σ𝑇 (𝐼 → ∞)
𝐼 𝑖=1 𝑇
21.3.4 Joint Distributions
In the preceding discussion, we looked at the distributions of 𝑥𝑡 and 𝑦𝑡 in isolation.

This gives us useful information but doesn’t allow us to answer questions like
• what’s the probability that 𝑥𝑡 ≥ 0 for all 𝑡?
• what’s the probability that the process {𝑦𝑡 } exceeds some value 𝑎 before falling below 𝑏?
• etc., etc.
Such questions concern the joint distributions of these sequences.
To compute the joint distribution of 𝑥0 , 𝑥1 , … , 𝑥𝑇 , recall that joint and conditional densities are linked by the rule
𝑝(𝑥, 𝑦) = 𝑝(𝑦 | 𝑥)𝑝(𝑥) (joint = conditional × marginal)
From this rule we get 𝑝(𝑥0 , 𝑥1 ) = 𝑝(𝑥1 | 𝑥0 )𝑝(𝑥0 ).

The Markov property 𝑝(𝑥𝑡 | 𝑥𝑡−1 , … , 𝑥0 ) = 𝑝(𝑥𝑡 | 𝑥𝑡−1 ) and repeated applications of the preceding rule lead us to
𝑇 −1
𝑝(𝑥0 , 𝑥1 , … , 𝑥𝑇 ) = 𝑝(𝑥0 ) ∏ 𝑝(𝑥𝑡+1 | 𝑥𝑡 )
𝑡=0

The marginal 𝑝(𝑥0 ) is just the primitive 𝑁 (𝜇0 , Σ0 ).

In view of (21.1), the conditional densities are
𝑝(𝑥𝑡+1 | 𝑥𝑡 ) = 𝑁 (𝐴𝑥𝑡 , 𝐶𝐶 ′ )
Autocovariance Functions
An important object related to the joint distribution is the autocovariance function
Σ𝑡+𝑗,𝑡 ∶= 𝔼[(𝑥𝑡+𝑗 − 𝜇𝑡+𝑗 )(𝑥𝑡 − 𝜇𝑡 )′ ] (21.11)
Elementary calculations show that
Σ𝑡+𝑗,𝑡 = 𝐴𝑗 Σ𝑡 (21.12)
Notice that Σ𝑡+𝑗,𝑡 in general depends on both 𝑗, the gap between the two dates, and 𝑡, the earlier date.
21.4 Stationarity and Ergodicity
Stationarity and ergodicity are two properties that, when they hold, greatly aid analysis of linear state space models.
Let’s start with the intuition.
21.4.1 Visualizing Stability
Let’s look at some more time series from the same model that we analyzed above.
This picture shows cross-sectional distributions for 𝑦 at times 𝑇 , 𝑇 ′ , 𝑇 ″
def cross_plot(A,
C,
G,
steady_state='False',
T0 = 10,
T1 = 50,
T2 = 75,
T4 = 100):
ar = LinearStateSpace(A, C, G, mu_0=np.ones(4))
if steady_state == 'True':
μ_x, μ_y, Σ_x, Σ_y, Σ_yx = ar.stationary_distributions()
ar_state = LinearStateSpace(A, C, G, mu_0=μ_x, Sigma_0=Σ_x)
ymin, ymax = -0.6, 0.6

ax.grid(alpha=0.4)
ax.set_xlabel('$time$', fontsize=12)
ax.vlines((T0, T1, T2), -1.5, 1.5)



ax.set_xticks((T0, T1, T2))
ax.set_xticklabels(("$T$", "$T'$", "$T''$"), fontsize=12)
for i in range(80):
rcolor = random.choice(('c', 'g', 'b'))
if steady_state == 'True':
x, y = ar_state.simulate(ts_length=T4)
else:
x, y = ar.simulate(ts_length=T4)
y = y.flatten()
ax.plot(y, color=rcolor, lw=0.8, alpha=0.5)
ax.plot((T0, T1, T2), (y[T0], y[T1], y[T2],), 'ko', alpha=0.5)
plt.show()
cross_plot(A_2, C_2, G_2)
Note how the time series “settle down” in the sense that the distributions at 𝑇 ′ and 𝑇 ″ are relatively similar to each other
— but unlike the distribution at 𝑇 .
Apparently, the distributions of 𝑦𝑡 converge to a fixed long-run distribution as 𝑡 → ∞.
When such a distribution exists it is called a stationary distribution.
21.4.2 Stationary Distributions
In our setting, a distribution 𝜓∞ is said to be stationary for 𝑥𝑡 if
𝑥𝑡 ∼ 𝜓 ∞ and 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐶𝑤𝑡+1 ⟹ 𝑥𝑡+1 ∼ 𝜓∞
Since
1. in the present case, all distributions are Gaussian
2. a Gaussian distribution is pinned down by its mean and variance-covariance matrix
21.4. Stationarity and Ergodicity 387

we can restate the definition as follows: 𝜓∞ is stationary for 𝑥𝑡 if
𝜓∞ = 𝑁 (𝜇∞ , Σ∞ )
where 𝜇∞ and Σ∞ are fixed points of (21.4) and (21.5) respectively.
21.4.3 Covariance Stationary Processes
Let’s see what happens to the preceding figure if we start 𝑥0 at the stationary distribution.
cross_plot(A_2, C_2, G_2, steady_state='True')
Now the differences in the observed distributions at 𝑇 , 𝑇 ′ and 𝑇 ″ come entirely from random fluctuations due to the
finite sample size.
By
• our choosing 𝑥0 ∼ 𝑁 (𝜇∞ , Σ∞ )
• the definitions of 𝜇∞ and Σ∞ as fixed points of (21.4) and (21.5) respectively
we’ve ensured that
𝜇𝑡 = 𝜇∞ and Σ𝑡 = Σ∞ for all 𝑡
Moreover, in view of (21.12), the autocovariance function takes the form Σ𝑡+𝑗,𝑡 = 𝐴𝑗 Σ∞ , which depends on 𝑗 but not
on 𝑡.
This motivates the following definition.
A process {𝑥𝑡 } is said to be covariance stationary if
• both 𝜇𝑡 and Σ𝑡 are constant in 𝑡
• Σ𝑡+𝑗,𝑡 depends on the time gap 𝑗 but not on time 𝑡
In our setting, {𝑥𝑡 } will be covariance stationary if 𝜇0 , Σ0 , 𝐴, 𝐶 assume values that imply that none of 𝜇𝑡 , Σ𝑡 , Σ𝑡+𝑗,𝑡
depends on 𝑡.

21.4.4 Conditions for Stationarity
The Globally Stable Case
The difference equation 𝜇𝑡+1 = 𝐴𝜇𝑡 is known to have unique fixed point 𝜇∞ = 0 if all eigenvalues of 𝐴 have moduli
strictly less than unity.
That is, if (np.absolute(np.linalg.eigvals(A)) < 1).all() == True.
The difference equation (21.5) also has a unique fixed point in this case, and, moreover
𝜇𝑡 → 𝜇 ∞ = 0 and Σ𝑡 → Σ∞ as 𝑡→∞
regardless of the initial conditions 𝜇0 and Σ0 .

This is the globally stable case — see these notes for more a theoretical treatment.
However, global stability is more than we need for stationary solutions, and often more than we want.
To illustrate, consider our second order difference equation example.
′
Here the state is 𝑥𝑡 = [1 𝑦𝑡 𝑦𝑡−1 ] .
Because of the constant first component in the state vector, we will never have 𝜇𝑡 → 0.
How can we find stationary solutions that respect a constant state component?
Processes with a Constant State Component
To investigate such a process, suppose that 𝐴 and 𝐶 take the form
𝐴1 𝑎 𝐶1
𝐴=[ ] 𝐶=[ ]
0 1 0
where
• 𝐴1 is an (𝑛 − 1) × (𝑛 − 1) matrix
• 𝑎 is an (𝑛 − 1) × 1 column vector
′
Let 𝑥𝑡 = [𝑥′1𝑡 1] where 𝑥1𝑡 is (𝑛 − 1) × 1.
It follows that
𝑥1,𝑡+1 = 𝐴1 𝑥1𝑡 + 𝑎 + 𝐶1 𝑤𝑡+1
Let 𝜇1𝑡 = 𝔼[𝑥1𝑡 ] and take expectations on both sides of this expression to get
𝜇1,𝑡+1 = 𝐴1 𝜇1,𝑡 + 𝑎 (21.13)
Assume now that the moduli of the eigenvalues of 𝐴1 are all strictly less than one.
Then (21.13) has a unique stationary solution, namely,
𝜇1∞ = (𝐼 − 𝐴1 )−1 𝑎
′
The stationary value of 𝜇𝑡 itself is then 𝜇∞ ∶= [𝜇′1∞ 1] .
The stationary values of Σ𝑡 and Σ𝑡+𝑗,𝑡 satisfy
Σ∞ = 𝐴Σ∞ 𝐴′ + 𝐶𝐶 ′
Σ𝑡+𝑗,𝑡 = 𝐴𝑗 Σ∞
21.4. Stationarity and Ergodicity 389

Notice that here Σ𝑡+𝑗,𝑡 depends on the time gap 𝑗 but not on calendar time 𝑡.
In conclusion, if
• 𝑥0 ∼ 𝑁 (𝜇∞ , Σ∞ ) and
• the moduli of the eigenvalues of 𝐴1 are all strictly less than unity
then the {𝑥𝑡 } process is covariance stationary, with constant state component.
Note: If the eigenvalues of 𝐴1 are less than unity in modulus, then (a) starting from any initial value, the mean and
variance-covariance matrix both converge to their stationary values; and (b) iterations on (21.5) converge to the fixed
point of the discrete Lyapunov equation in the first line of (21.14).
21.4.5 Ergodicity
Let’s suppose that we’re working with a covariance stationary process.

In this case, we know that the ensemble mean will converge to 𝜇∞ as the sample size 𝐼 approaches infinity.
Averages over Time
Ensemble averages across simulations are interesting theoretically, but in real life, we usually observe only a single real-
ization {𝑥𝑡 , 𝑦𝑡 }𝑇𝑡=0 .
So now let’s take a single realization and form the time-series averages
1 𝑇 1 𝑇
𝑥̄ ∶= ∑𝑥 and 𝑦 ̄ ∶= ∑𝑦
𝑇 𝑡=1 𝑡 𝑇 𝑡=1 𝑡
Do these time series averages converge to something interpretable in terms of our basic state-space representation?
The answer depends on something called ergodicity.
Ergodicity is the property that time series and ensemble averages coincide.
More formally, ergodicity implies that time series sample averages converge to their expectation under the stationary
distribution.
In particular,
1 𝑇
• 𝑇 ∑𝑡=1 𝑥𝑡 → 𝜇∞
1 𝑇
• 𝑇 ∑𝑡=1 (𝑥𝑡 − 𝑥𝑇̄ )(𝑥𝑡 − 𝑥𝑇̄ )′ → Σ∞
1 𝑇
• 𝑇 ∑𝑡=1 (𝑥𝑡+𝑗 − 𝑥𝑇̄ )(𝑥𝑡 − 𝑥𝑇̄ )′ → 𝐴𝑗 Σ∞
In our linear Gaussian setting, any covariance stationary process is also ergodic.

21.5 Noisy Observations
In some settings, the observation equation 𝑦𝑡 = 𝐺𝑥𝑡 is modified to include an error term.
Often this error term represents the idea that the true state can only be observed imperfectly.
To include an error term in the observation we introduce
• An IID sequence of ℓ × 1 random vectors 𝑣𝑡 ∼ 𝑁 (0, 𝐼).
• A 𝑘 × ℓ matrix 𝐻.
and extend the linear state-space system to

𝑦𝑡 = 𝐺𝑥𝑡 + 𝐻𝑣𝑡
𝑥0 ∼ 𝑁 (𝜇0 , Σ0 )
The sequence {𝑣𝑡 } is assumed to be independent of {𝑤𝑡 }.

The process {𝑥𝑡 } is not modified by noise in the observation equation and its moments, distributions and stability prop-
erties remain the same.
The unconditional moments of 𝑦𝑡 from (21.6) and (21.7) now become
𝔼[𝑦𝑡 ] = 𝔼[𝐺𝑥𝑡 + 𝐻𝑣𝑡 ] = 𝐺𝜇𝑡 (21.14)
The variance-covariance matrix of 𝑦𝑡 is easily shown to be
Var[𝑦𝑡 ] = Var[𝐺𝑥𝑡 + 𝐻𝑣𝑡 ] = 𝐺Σ𝑡 𝐺′ + 𝐻𝐻 ′ (21.15)
The distribution of 𝑦𝑡 is therefore
𝑦𝑡 ∼ 𝑁 (𝐺𝜇𝑡 , 𝐺Σ𝑡 𝐺′ + 𝐻𝐻 ′ )
21.6 Prediction
The theory of prediction for linear state space systems is elegant and simple.
21.6.1 Forecasting Formulas – Conditional Means
The natural way to predict variables is to use conditional distributions.

For example, the optimal forecast of 𝑥𝑡+1 given information known at time 𝑡 is
𝔼𝑡 [𝑥𝑡+1 ] ∶= 𝔼[𝑥𝑡+1 ∣ 𝑥𝑡 , 𝑥𝑡−1 , … , 𝑥0 ] = 𝐴𝑥𝑡
The right-hand side follows from 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐶𝑤𝑡+1 and the fact that 𝑤𝑡+1 is zero mean and independent of
𝑥𝑡 , 𝑥𝑡−1 , … , 𝑥0 .
That 𝔼𝑡 [𝑥𝑡+1 ] = 𝔼[𝑥𝑡+1 ∣ 𝑥𝑡 ] is an implication of {𝑥𝑡 } having the Markov property.
The one-step-ahead forecast error is
𝑥𝑡+1 − 𝔼𝑡 [𝑥𝑡+1 ] = 𝐶𝑤𝑡+1
21.5. Noisy Observations 391

The covariance matrix of the forecast error is
𝔼[(𝑥𝑡+1 − 𝔼𝑡 [𝑥𝑡+1 ])(𝑥𝑡+1 − 𝔼𝑡 [𝑥𝑡+1 ])′ ] = 𝐶𝐶 ′
More generally, we’d like to compute the 𝑗-step ahead forecasts 𝔼𝑡 [𝑥𝑡+𝑗 ] and 𝔼𝑡 [𝑦𝑡+𝑗 ].
With a bit of algebra, we obtain
𝑥𝑡+𝑗 = 𝐴𝑗 𝑥𝑡 + 𝐴𝑗−1 𝐶𝑤𝑡+1 + 𝐴𝑗−2 𝐶𝑤𝑡+2 + ⋯ + 𝐴0 𝐶𝑤𝑡+𝑗
In view of the IID property, current and past state values provide no information about future values of the shock.
Hence 𝔼𝑡 [𝑤𝑡+𝑘 ] = 𝔼[𝑤𝑡+𝑘 ] = 0.
It now follows from linearity of expectations that the 𝑗-step ahead forecast of 𝑥 is
𝔼𝑡 [𝑥𝑡+𝑗 ] = 𝐴𝑗 𝑥𝑡
The 𝑗-step ahead forecast of 𝑦 is therefore
𝔼𝑡 [𝑦𝑡+𝑗 ] = 𝔼𝑡 [𝐺𝑥𝑡+𝑗 + 𝐻𝑣𝑡+𝑗 ] = 𝐺𝐴𝑗 𝑥𝑡
21.6.2 Covariance of Prediction Errors
It is useful to obtain the covariance matrix of the vector of 𝑗-step-ahead prediction errors
𝑗−1
𝑥𝑡+𝑗 − 𝔼𝑡 [𝑥𝑡+𝑗 ] = ∑ 𝐴𝑠 𝐶𝑤𝑡−𝑠+𝑗 (21.16)
𝑠=0
Evidently,
𝑗−1
′
𝑉𝑗 ∶= 𝔼𝑡 [(𝑥𝑡+𝑗 − 𝔼𝑡 [𝑥𝑡+𝑗 ])(𝑥𝑡+𝑗 − 𝔼𝑡 [𝑥𝑡+𝑗 ])′ ] = ∑ 𝐴𝑘 𝐶𝐶 ′ 𝐴𝑘 (21.17)
𝑘=0
𝑉𝑗 defined in (21.17) can be calculated recursively via 𝑉1 = 𝐶𝐶 ′ and
𝑉𝑗 = 𝐶𝐶 ′ + 𝐴𝑉𝑗−1 𝐴′ , 𝑗≥2 (21.18)
𝑉𝑗 is the conditional covariance matrix of the errors in forecasting 𝑥𝑡+𝑗 , conditioned on time 𝑡 information 𝑥𝑡 .
Under particular conditions, 𝑉𝑗 converges to
𝑉∞ = 𝐶𝐶 ′ + 𝐴𝑉∞ 𝐴′ (21.19)
Equation (21.19) is an example of a discrete Lyapunov equation in the covariance matrix 𝑉∞ .

A sufficient condition for 𝑉𝑗 to converge is that the eigenvalues of 𝐴 be strictly less than one in modulus.
Weaker sufficient conditions for convergence associate eigenvalues equaling or exceeding one in modulus with elements
of 𝐶 that equal 0.
21.7 Code
Our preceding simulations and calculations are based on code in the file lss.py from the QuantEcon.py package.
The code implements a class for handling linear state space models (simulations, calculating moments, etc.).

One Python construct you might not be familiar with is the use of a generator function in the method mo-
ment_sequence().
Go back and read the relevant documentation if you’ve forgotten how generator functions work.
Examples of usage are given in the solutions to the exercises.
21.8 Exercises
Exercise 21.8.1
In several contexts, we want to compute forecasts of geometric sums of future random variables governed by the linear
state-space system (21.1).
We want the following objects
∞
• Forecast of a geometric sum of future 𝑥’s, or 𝔼𝑡 [∑𝑗=0 𝛽 𝑗 𝑥𝑡+𝑗 ].
∞
• Forecast of a geometric sum of future 𝑦’s, or 𝔼𝑡 [∑𝑗=0 𝛽 𝑗 𝑦𝑡+𝑗 ].
These objects are important components of some famous and interesting dynamic models.
For example,
∞
• if {𝑦𝑡 } is a stream of dividends, then 𝔼 [∑𝑗=0 𝛽 𝑗 𝑦𝑡+𝑗 |𝑥𝑡 ] is a model of a stock price
∞
• if {𝑦𝑡 } is the money supply, then 𝔼 [∑𝑗=0 𝛽 𝑗 𝑦𝑡+𝑗 |𝑥𝑡 ] is a model of the price level
Show that:
∞
𝔼𝑡 [∑ 𝛽 𝑗 𝑥𝑡+𝑗 ] = [𝐼 − 𝛽𝐴]−1 𝑥𝑡
𝑗=0
and
∞
𝔼𝑡 [∑ 𝛽 𝑗 𝑦𝑡+𝑗 ] = 𝐺[𝐼 − 𝛽𝐴]−1 𝑥𝑡
𝑗=0
what must the modulus for every eigenvalue of 𝐴 be less than?

Suppose that every eigenvalue of 𝐴 has modulus strictly less than 𝛽1 .
−1
It then follows that 𝐼 + 𝛽𝐴 + 𝛽 2 𝐴2 + ⋯ = [𝐼 − 𝛽𝐴] .
This leads to our formulas:
• Forecast of a geometric sum of future 𝑥’s
∞
𝔼𝑡 [∑ 𝛽 𝑗 𝑥𝑡+𝑗 ] = [𝐼 + 𝛽𝐴 + 𝛽 2 𝐴2 + ⋯ ]𝑥𝑡 = [𝐼 − 𝛽𝐴]−1 𝑥𝑡
𝑗=0
• Forecast of a geometric sum of future 𝑦’s
21.8. Exercises 393

∞
𝔼𝑡 [∑ 𝛽 𝑗 𝑦𝑡+𝑗 ] = 𝐺[𝐼 + 𝛽𝐴 + 𝛽 2 𝐴2 + ⋯ ]𝑥𝑡 = 𝐺[𝐼 − 𝛽𝐴]−1 𝑥𝑡
𝑗=0

CHAPTER
TWENTYTWO
SAMUELSON MULTIPLIER-ACCELERATOR
Contents
• Samuelson Multiplier-Accelerator
– Overview
– Details
– Implementation
– Stochastic Shocks
– Government Spending
– Wrapping Everything Into a Class
– Using the LinearStateSpace Class
– Pure Multiplier Model
– Summary
22.1 Overview
This lecture creates non-stochastic and stochastic versions of Paul Samuelson’s celebrated multiplier accelerator model
[Samuelson, 1939].
In doing so, we extend the example of the Solow model class in our second OOP lecture.
Our objectives are to
• provide a more detailed example of OOP and classes
• review a famous model
• review linear difference equations, both deterministic and stochastic
395

import numpy as np
We’ll also use the following for various tasks described below:
from quantecon import LinearStateSpace

import cmath
import math
import sympy
from sympy import Symbol, init_printing
from cmath import sqrt
22.1.1 Samuelson’s Model
Samuelson used a second-order linear difference equation to represent a model of national output based on three compo-
nents:
• a national output identity asserting that national output or national income is the sum of consumption plus investment
plus government purchases.
• a Keynesian consumption function asserting that consumption at time 𝑡 is equal to a constant times national output
at time 𝑡 − 1.
• an investment accelerator asserting that investment at time 𝑡 equals a constant called the accelerator coefficient times
the difference in output between period 𝑡 − 1 and 𝑡 − 2.
Consumption plus investment plus government purchases constitute aggregate demand, which automatically calls forth an
equal amount of aggregate supply.
(To read about linear difference equations see here or chapter IX of [Sargent, 1987].)
Samuelson used the model to analyze how particular values of the marginal propensity to consume and the accelerator
coefficient might give rise to transient business cycles in national output.
Possible dynamic properties include
• smooth convergence to a constant level of output
• damped business cycles that eventually converge to a constant level of output
• persistent business cycles that neither dampen nor explode
Later we present an extension that adds a random shock to the right side of the national income identity representing
random fluctuations in aggregate demand.
This modification makes national output become governed by a second-order stochastic linear difference equation that,
with appropriate parameter values, gives rise to recurrent irregular business cycles.
(To read about stochastic linear difference equations see chapter XI of [Sargent, 1987].)
396 Chapter 22. Samuelson Multiplier-Accelerator

22.2 Details
Let’s assume that

• {𝐺𝑡 } is a sequence of levels of government expenditures – we’ll start by setting 𝐺𝑡 = 𝐺 for all 𝑡.
• {𝐶𝑡 } is a sequence of levels of aggregate consumption expenditures, a key endogenous variable in the model.
• {𝐼𝑡 } is a sequence of rates of investment, another key endogenous variable.
• {𝑌𝑡 } is a sequence of levels of national income, yet another endogenous variable.
• 𝑎 is the marginal propensity to consume in the Keynesian consumption function 𝐶𝑡 = 𝑎𝑌𝑡−1 + 𝛾.
• 𝑏 is the “accelerator coefficient” in the “investment accelerator” 𝐼𝑡 = 𝑏(𝑌𝑡−1 − 𝑌𝑡−2 ).
• {𝜖𝑡 } is an IID sequence standard normal random variables.
• 𝜎 ≥ 0 is a “volatility” parameter — setting 𝜎 = 0 recovers the non-stochastic case that we’ll start with.
The model combines the consumption function
𝐶𝑡 = 𝑎𝑌𝑡−1 + 𝛾 (22.1)
with the investment accelerator
𝐼𝑡 = 𝑏(𝑌𝑡−1 − 𝑌𝑡−2 ) (22.2)
and the national income identity
𝑌𝑡 = 𝐶𝑡 + 𝐼𝑡 + 𝐺𝑡 (22.3)
• The parameter 𝑎 is peoples’ marginal propensity to consume out of income - equation (22.1) asserts that people
consume a fraction of 𝑎 ∈ (0, 1) of each additional dollar of income.
• The parameter 𝑏 > 0 is the investment accelerator coefficient - equation (22.2) asserts that people invest in physical
capital when income is increasing and disinvest when it is decreasing.
Equations (22.1), (22.2), and (22.3) imply the following second-order linear difference equation for national income:
𝑌𝑡 = (𝑎 + 𝑏)𝑌𝑡−1 − 𝑏𝑌𝑡−2 + (𝛾 + 𝐺𝑡 )
or
𝑌𝑡 = 𝜌1 𝑌𝑡−1 + 𝜌2 𝑌𝑡−2 + (𝛾 + 𝐺𝑡 ) (22.4)
where 𝜌1 = (𝑎 + 𝑏) and 𝜌2 = −𝑏.
To complete the model, we require two initial conditions.
If the model is to generate time series for 𝑡 = 0, … , 𝑇 , we require initial values
̄ ,
𝑌−1 = 𝑌−1 ̄
𝑌−2 = 𝑌−2
̄ , 𝑌−2
We’ll ordinarily set the parameters (𝑎, 𝑏) so that starting from an arbitrary pair of initial conditions (𝑌−1 ̄ ), national
income 𝑌𝑡 converges to a constant value as 𝑡 becomes large.
We are interested in studying
• the transient fluctuations in 𝑌𝑡 as it converges to its steady state level
• the rate at which it converges to a steady state level
The deterministic version of the model described so far — meaning that no random shocks hit aggregate demand — has
only transient fluctuations.
We can convert the model to one that has persistent irregular fluctuations by adding a random shock to aggregate demand.
22.2. Details 397

22.2.1 Stochastic Version of the Model
We create a random or stochastic version of the model by adding a random process of shocks or disturbances {𝜎𝜖𝑡 }
to the right side of equation (22.4), leading to the second-order scalar linear stochastic difference equation:
𝑌𝑡 = 𝐺𝑡 + 𝑎(1 − 𝑏)𝑌𝑡−1 − 𝑎𝑏𝑌𝑡−2 + 𝜎𝜖𝑡 (22.5)
22.2.2 Mathematical Analysis of the Model
To get started, let’s set 𝐺𝑡 ≡ 0, 𝜎 = 0, and 𝛾 = 0.

Then we can write equation (22.5) as
𝑌𝑡 = 𝜌1 𝑌𝑡−1 + 𝜌2 𝑌𝑡−2
or
𝑌𝑡+2 − 𝜌1 𝑌𝑡+1 − 𝜌2 𝑌𝑡 = 0 (22.6)
To discover the properties of the solution of (22.6), it is useful first to form the characteristic polynomial for (22.6):
𝑧 2 − 𝜌1 𝑧 − 𝜌 2 (22.7)
where 𝑧 is possibly a complex number.

We want to find the two zeros (a.k.a. roots) – namely 𝜆1 , 𝜆2 – of the characteristic polynomial.
These are two special values of 𝑧, say 𝑧 = 𝜆1 and 𝑧 = 𝜆2 , such that if we set 𝑧 equal to one of these values in expression
(22.7), the characteristic polynomial (22.7) equals zero:
𝑧 2 − 𝜌1 𝑧 − 𝜌2 = (𝑧 − 𝜆1 )(𝑧 − 𝜆2 ) = 0 (22.8)
Equation (22.8) is said to factor the characteristic polynomial.

When the roots are complex, they will occur as a complex conjugate pair.
When the roots are complex, it is convenient to represent them in the polar form
𝜆1 = 𝑟𝑒𝑖𝜔 , 𝜆2 = 𝑟𝑒−𝑖𝜔
where 𝑟 is the amplitude of the complex number and 𝜔 is its angle or phase.
These can also be represented as
𝜆1 = 𝑟(𝑐𝑜𝑠(𝜔) + 𝑖 sin(𝜔))
𝜆2 = 𝑟(𝑐𝑜𝑠(𝜔) − 𝑖 sin(𝜔))
(To read about the polar form, see here)
Given initial conditions 𝑌−1 , 𝑌−2 , we want to generate a solution of the difference equation (22.6).
It can be represented as
𝑌𝑡 = 𝜆𝑡1 𝑐1 + 𝜆𝑡2 𝑐2
where 𝑐1 and 𝑐2 are constants that depend on the two initial conditions and on 𝜌1 , 𝜌2 .
When the roots are complex, it is useful to pursue the following calculations.

Notice that
𝑌𝑡 = 𝑐1 (𝑟𝑒𝑖𝜔 )𝑡 + 𝑐2 (𝑟𝑒−𝑖𝜔 )𝑡
= 𝑐1 𝑟𝑡 𝑒𝑖𝜔𝑡 + 𝑐2 𝑟𝑡 𝑒−𝑖𝜔𝑡
= 𝑐1 𝑟𝑡 [cos(𝜔𝑡) + 𝑖 sin(𝜔𝑡)] + 𝑐2 𝑟𝑡 [cos(𝜔𝑡) − 𝑖 sin(𝜔𝑡)]
= (𝑐1 + 𝑐2 )𝑟𝑡 cos(𝜔𝑡) + 𝑖(𝑐1 − 𝑐2 )𝑟𝑡 sin(𝜔𝑡)
The only way that 𝑌𝑡 can be a real number for each 𝑡 is if 𝑐1 + 𝑐2 is a real number and 𝑐1 − 𝑐2 is an imaginary number.
This happens only when 𝑐1 and 𝑐2 are complex conjugates, in which case they can be written in the polar forms
𝑐1 = 𝑣𝑒𝑖𝜃 , 𝑐2 = 𝑣𝑒−𝑖𝜃
So we can write
𝑌𝑡 = 𝑣𝑒𝑖𝜃 𝑟𝑡 𝑒𝑖𝜔𝑡 + 𝑣𝑒−𝑖𝜃 𝑟𝑡 𝑒−𝑖𝜔𝑡
= 𝑣𝑟𝑡 [𝑒𝑖(𝜔𝑡+𝜃) + 𝑒−𝑖(𝜔𝑡+𝜃) ]
= 2𝑣𝑟𝑡 cos(𝜔𝑡 + 𝜃)
where 𝑣 and 𝜃 are constants that must be chosen to satisfy initial conditions for 𝑌−1 , 𝑌−2 .
2𝜋
This formula shows that when the roots are complex, 𝑌𝑡 displays oscillations with period 𝑝̌ = 𝜔 and damping factor 𝑟.
We say that 𝑝̌ is the period because in that amount of time the cosine wave cos(𝜔𝑡+𝜃) goes through exactly one complete
cycles.
(Draw a cosine function to convince yourself of this please)
Remark: Following [Samuelson, 1939], we want to choose the parameters 𝑎, 𝑏 of the model so that the absolute values
(of the possibly complex) roots 𝜆1 , 𝜆2 of the characteristic polynomial are both strictly less than one:
|𝜆𝑗 | < 1 for 𝑗 = 1, 2
Remark: When both roots 𝜆1 , 𝜆2 of the characteristic polynomial have absolute values strictly less than one, the absolute
value of the larger one governs the rate of convergence to the steady state of the non stochastic version of the model.
22.2.3 Things This Lecture Does
We write a function to generate simulations of a {𝑌𝑡 } sequence as a function of time.

The function requires that we put in initial conditions for 𝑌−1 , 𝑌−2 .
The function checks that 𝑎, 𝑏 are set so that 𝜆1 , 𝜆2 are less than unity in absolute value (also called “modulus”).
The function also tells us whether the roots are complex, and, if they are complex, returns both their real and complex
parts.
If the roots are both real, the function returns their values.
We use our function written to simulate paths that are stochastic (when 𝜎 > 0).
We have written the function in a way that allows us to input {𝐺𝑡 } paths of a few simple forms, e.g.,
• one time jumps in 𝐺 at some time
• a permanent jump in 𝐺 that occurs at some time
We proceed to use the Samuelson multiplier-accelerator model as a laboratory to make a simple OOP example.
The “state” that determines next period’s 𝑌𝑡+1 is now not just the current value 𝑌𝑡 but also the once lagged value 𝑌𝑡−1 .
This involves a little more bookkeeping than is required in the Solow model class definition.
22.2. Details 399

We use the Samuelson multiplier-accelerator model as a vehicle for teaching how we can gradually add more features to
the class.
We want to have a method in the class that automatically generates a simulation, either non-stochastic (𝜎 = 0) or stochastic
(𝜎 > 0).
We also show how to map the Samuelson model into a simple instance of the LinearStateSpace class described
here.
We can use a LinearStateSpace instance to do various things that we did above with our homemade function and
class.
Among other things, we show by example that the eigenvalues of the matrix 𝐴 that we use to form the instance of the
LinearStateSpace class for the Samuelson model equal the roots of the characteristic polynomial (22.7) for the
Samuelson multiplier accelerator model.
Here is the formula for the matrix 𝐴 in the linear state space system in the case that government expenditures are a
constant 𝐺:
1 0 0
𝐴=⎡
⎢𝛾 + 𝐺 𝜌1 𝜌2 ⎤
⎥
⎣ 0 1 0⎦
22.3 Implementation
We’ll start by drawing an informative graph from page 189 of [Sargent, 1987]
def param_plot():
"""This function creates the graph on page 189 of

Sargent Macroeconomic Theory, second edition, 1987.
"""

ax.set_aspect('equal')
# Set axis
xmin, ymin = -3, -2
xmax, ymax = -xmin, -ymin
plt.axis([xmin, xmax, ymin, ymax])
# Set axis labels

ax.set(xticks=[], yticks=[])
ax.set_xlabel(r'$\rho_2$', fontsize=16)
ax.xaxis.set_label_position('top')
ax.set_ylabel(r'$\rho_1$', rotation=0, fontsize=16)
ax.yaxis.set_label_position('right')
# Draw (t1, t2) points

ρ1 = np.linspace(-2, 2, 100)
ax.plot(ρ1, -abs(ρ1) + 1, c='black')
ax.plot(ρ1, np.full_like(ρ1, -1), c='black')
ax.plot(ρ1, -(ρ1**2 / 4), c='black')
# Turn normal axes off

for spine in ['left', 'bottom', 'top', 'right']:
ax.spines[spine].set_visible(False)

# Add arrows to represent axes

axes_arrows = {'arrowstyle': '<|-|>', 'lw': 1.3}
ax.annotate('', xy=(xmin, 0), xytext=(xmax, 0), arrowprops=axes_arrows)
ax.annotate('', xy=(0, ymin), xytext=(0, ymax), arrowprops=axes_arrows)
# Annotate the plot with equations

plot_arrowsl = {'arrowstyle': '-|>', 'connectionstyle': "arc3, rad=-0.2"}
plot_arrowsr = {'arrowstyle': '-|>', 'connectionstyle': "arc3, rad=0.2"}
ax.annotate(r'$\rho_1 + \rho_2 < 1$', xy=(0.5, 0.3), xytext=(0.8, 0.6),
arrowprops=plot_arrowsr, fontsize='12')
ax.annotate(r'$\rho_1 + \rho_2 = 1$', xy=(0.38, 0.6), xytext=(0.6, 0.8),
arrowprops=plot_arrowsr, fontsize='12')
ax.annotate(r'$\rho_2 < 1 + \rho_1$', xy=(-0.5, 0.3), xytext=(-1.3, 0.6),
arrowprops=plot_arrowsl, fontsize='12')
ax.annotate(r'$\rho_2 = 1 + \rho_1$', xy=(-0.38, 0.6), xytext=(-1, 0.8),
ax.annotate(r'$\rho_2 = -1$', xy=(1.5, -1), xytext=(1.8, -1.3),
ax.annotate(r'${\rho_1}^2 + 4\rho_2 = 0$', xy=(1.15, -0.35),
xytext=(1.5, -0.3), arrowprops=plot_arrowsr, fontsize='12')
ax.annotate(r'${\rho_1}^2 + 4\rho_2 < 0$', xy=(1.4, -0.7),
xytext=(1.8, -0.6), arrowprops=plot_arrowsr, fontsize='12')
# Label categories of solutions

ax.text(1.5, 1, 'Explosive\n growth', ha='center', fontsize=16)
ax.text(-1.5, 1, 'Explosive\n oscillations', ha='center', fontsize=16)
ax.text(0.05, -1.5, 'Explosive oscillations', ha='center', fontsize=16)
ax.text(0.09, -0.5, 'Damped oscillations', ha='center', fontsize=16)
# Add small marker to y-axis

ax.axhline(y=1.005, xmin=0.495, xmax=0.505, c='black')
ax.text(-0.12, -1.12, '-1', fontsize=10)
ax.text(-0.12, 0.98, '1', fontsize=10)
return fig
param_plot()
plt.show()

The graph portrays regions in which the (𝜆1 , 𝜆2 ) root pairs implied by the (𝜌1 = (𝑎 + 𝑏), 𝜌2 = −𝑏) difference equation
parameter pairs in the Samuelson model are such that:
• (𝜆1 , 𝜆2 ) are complex with modulus less than 1 - in this case, the {𝑌𝑡 } sequence displays damped oscillations.
• (𝜆1 , 𝜆2 ) are both real, but one is strictly greater than 1 - this leads to explosive growth.
• (𝜆1 , 𝜆2 ) are both real, but one is strictly less than −1 - this leads to explosive oscillations.
• (𝜆1 , 𝜆2 ) are both real and both are less than 1 in absolute value - in this case, there is smooth convergence to the
steady state without damped cycles.
Later we’ll present the graph with a red mark showing the particular point implied by the setting of (𝑎, 𝑏).
22.3.1 Function to Describe Implications of Characteristic Polynomial
def categorize_solution(ρ1, ρ2):
"""This function takes values of ρ1 and ρ2 and uses them

to classify the type of solution
"""
discriminant = ρ1 ** 2 + 4 * ρ2
if ρ2 > 1 + ρ1 or ρ2 < -1:
print('Explosive oscillations')
elif ρ1 + ρ2 > 1:
print('Explosive growth')
elif discriminant < 0:
print('Roots are complex with modulus less than one; \


therefore damped oscillations')
else:
print('Roots are real and absolute values are less than one; \
therefore get smooth convergence to a steady state')
### Test the categorize_solution function
categorize_solution(1.3, -.4)
Roots are real and absolute values are less than one; therefore get smooth␣
↪convergence to a steady state
22.3.2 Function for Plotting Paths
A useful function for our work below is
def plot_y(function=None):
"""Function plots path of Y_t"""
plt.subplots(figsize=(10, 6))
plt.plot(function)
plt.xlabel('Time $t$')
plt.ylabel('$Y_t$', rotation=0)
plt.grid()
plt.show()
22.3.3 Manual or “by hand” Root Calculations
The following function calculates roots of the characteristic polynomial using high school algebra.
(We’ll calculate the roots in other ways later)
The function also plots a 𝑌𝑡 starting from initial conditions that we set
# This is a 'manual' method
def y_nonstochastic(y_0=100, y_1=80, α=.92, β=.5, γ=10, n=80):
"""Takes values of parameters and computes the roots of characteristic

polynomial. It tells whether they are real or complex and whether they
are less than unity in absolute value.It also computes a simulation of
length n starting from the two given initial conditions for national
income
"""
roots = []
ρ1 = α + β
ρ2 = -β


print(f'ρ_1 is {ρ1}')
print(f'ρ_2 is {ρ2}')
if discriminant == 0:
roots.append(-ρ1 / 2)
print('Single real root: ')
print(''.join(str(roots)))
elif discriminant > 0:
roots.append((-ρ1 + sqrt(discriminant).real) / 2)
roots.append((-ρ1 - sqrt(discriminant).real) / 2)
print('Two real roots: ')
else:
roots.append((-ρ1 + sqrt(discriminant)) / 2)
roots.append((-ρ1 - sqrt(discriminant)) / 2)
print('Two complex roots: ')
if all(abs(root) < 1 for root in roots):

print('Absolute values of roots are less than one')
else:
print('Absolute values of roots are not less than one')
def transition(x, t): return ρ1 * x[t - 1] + ρ2 * x[t - 2] + γ
y_t = [y_0, y_1]
for t in range(2, n):

y_t.append(transition(y_t, t))
return y_t
plot_y(y_nonstochastic())
ρ_1 is 1.42
ρ_2 is -0.5
Two real roots:
[-0.6459687576256715, -0.7740312423743284]
Absolute values of roots are less than one

22.3.4 Reverse-Engineering Parameters to Generate Damped Cycles
The next cell writes code that takes as inputs the modulus 𝑟 and phase 𝜙 of a conjugate pair of complex numbers in polar
form
𝜆1 = 𝑟 exp(𝑖𝜙), 𝜆2 = 𝑟 exp(−𝑖𝜙)
• The code assumes that these two complex numbers are the roots of the characteristic polynomial
• It then reverse-engineers (𝑎, 𝑏) and (𝜌1 , 𝜌2 ), pairs that would generate those roots
### code to reverse-engineer a cycle

### y_t = r^t (c_1 cos(ϕ t) + c2 sin(ϕ t))
###
def f(r, ϕ):

"""
Takes modulus r and angle ϕ of complex number r exp(j ϕ)
and creates ρ1 and ρ2 of characteristic polynomial for which
r exp(j ϕ) and r exp(- j ϕ) are complex roots.
Returns the multiplier coefficient a and the accelerator coefficient b

that verifies those roots.
"""
g1 = cmath.rect(r, ϕ) # Generate two complex roots
g2 = cmath.rect(r, -ϕ)
ρ1 = g1 + g2 # Implied ρ1, ρ2
ρ2 = -g1 * g2
b = -ρ2 # Reverse-engineer a and b that validate these
a = ρ1 - b


return ρ1, ρ2, a, b
## Now let's use the function in an example

## Here are the example parameters
r = .95
period = 10 # Length of cycle in units of time
ϕ = 2 * math.pi/period
## Apply the function
ρ1, ρ2, a, b = f(r, ϕ)
print(f"a, b = {a}, {b}")

print(f"ρ1, ρ2 = {ρ1}, {ρ2}")
a, b = (0.6346322893124001+0j), (0.9024999999999999-0j)
ρ1, ρ2 = (1.5371322893124+0j), (-0.9024999999999999+0j)
## Print the real components of ρ1 and ρ2
ρ1 = ρ1.real
ρ2 = ρ2.real
ρ1, ρ2
(1.5371322893124, -0.9024999999999999)
22.3.5 Root Finding Using Numpy
Here we’ll use numpy to compute the roots of the characteristic polynomial
r1, r2 = np.roots([1, -ρ1, -ρ2])
p1 = cmath.polar(r1)
p2 = cmath.polar(r2)
print(f"r, ϕ = {r}, {ϕ}")

print(f"p1, p2 = {p1}, {p2}")
# print(f"g1, g2 = {g1}, {g2}")

print(f"ρ1, ρ2 = {ρ1}, {ρ2}")
r, ϕ = 0.95, 0.6283185307179586
p1, p2 = (0.95, 0.6283185307179586), (0.95, -0.6283185307179586)
a, b = (0.6346322893124001+0j), (0.9024999999999999-0j)
ρ1, ρ2 = 1.5371322893124, -0.9024999999999999

##=== This method uses numpy to calculate roots ===#
def y_nonstochastic(y_0=100, y_1=80, α=.9, β=.8, γ=10, n=80):
""" Rather than computing the roots of the characteristic

polynomial by hand as we did earlier, this function
enlists numpy to do the work for us
"""
# Useful constants
ρ1 = α + β
ρ2 = -β
categorize_solution(ρ1, ρ2)
# Find roots of polynomial

roots = np.roots([1, -ρ1, -ρ2])
print(f'Roots are {roots}')
# Check if real or complex

if all(isinstance(root, complex) for root in roots):
print('Roots are complex')
else:
print('Roots are real')
# Check if roots are less than one

print('Roots are less than one')
else:
print('Roots are not less than one')
# Define transition equation

def transition(x, t): return ρ1 * x[t - 1] + ρ2 * x[t - 2] + γ
# Set initial conditions

y_t = [y_0, y_1]
# Generate y_t series

return y_t
plot_y(y_nonstochastic())
Roots are complex with modulus less than one; therefore damped oscillations
Roots are [0.85+0.27838822j 0.85-0.27838822j]
Roots are complex
Roots are less than one

22.3.6 Reverse-Engineered Complex Roots: Example
The next cell studies the implications of reverse-engineered complex roots.

We’ll generate an undamped cycle of period 10
r = 1 # Generates undamped, nonexplosive cycles

## Apply the reverse-engineering function f
ρ1, ρ2, a, b = f(r, ϕ)
# Drop the imaginary part so that it is a valid input into y_nonstochastic

a = a.real
b = b.real
ytemp = y_nonstochastic(α=a, β=b, y_0=20, y_1=30)

plot_y(ytemp)
a, b = 0.6180339887498949, 1.0
Roots are [0.80901699+0.58778525j 0.80901699-0.58778525j]
Roots are complex
Roots are not less than one

22.3.7 Digression: Using Sympy to Find Roots
We can also use sympy to compute analytic formulas for the roots
init_printing()
r1 = Symbol("ρ_1")
r2 = Symbol("ρ_2")
z = Symbol("z")
sympy.solve(z**2 - r1*z - r2, z)
𝜌1 √𝜌12 + 4𝜌2 𝜌1 √𝜌12 + 4𝜌2

[ − , + ]
2 2 2 2
a = Symbol("α")
b = Symbol("β")
r1 = a + b
r2 = -b
sympy.solve(z**2 - r1*z - r2, z)
𝛼 𝛽 √𝛼2 + 2𝛼𝛽 + 𝛽 2 − 4𝛽 𝛼 𝛽 √𝛼2 + 2𝛼𝛽 + 𝛽 2 − 4𝛽

[ + − , + + ]
2 2 2 2 2 2

22.4 Stochastic Shocks
Now we’ll construct some code to simulate the stochastic version of the model that emerges when we add a random shock
process to aggregate demand
def y_stochastic(y_0=0, y_1=0, α=0.8, β=0.2, γ=10, n=100, σ=5):
"""This function takes parameters of a stochastic version of

the model and proceeds to analyze the roots of the characteristic
polynomial and also generate a simulation.
"""
# Useful constants
ρ1 = α + β
ρ2 = -β
# Categorize solution

print(roots)

else:

else:
# Generate shocks
ϵ = np.random.normal(0, 1, n)
# Define transition equation

def transition(x, t): return ρ1 * \
x[t - 1] + ρ2 * x[t - 2] + γ + σ * ϵ[t]
# Set initial conditions

y_t = [y_0, y_1]

return y_t
plot_y(y_stochastic())
[0.7236068 0.2763932]


Roots are real
Let’s do a simulation in which there are shocks and the characteristic polynomial has complex roots
r = .97

### Apply the reverse-engineering function f
ρ1, ρ2, a, b = f(r, ϕ)
# Drop the imaginary part so that it is a valid input into y_nonstochastic

a = a.real
b = b.real

plot_y(y_stochastic(y_0=40, y_1 = 42, α=a, β=b, σ=2, n=100))
a, b = 0.6285929690873979, 0.9409000000000001
[0.78474648+0.57015169j 0.78474648-0.57015169j]
Roots are complex
22.4. Stochastic Shocks 411

22.5 Government Spending
This function computes a response to either a permanent or one-off increase in government expenditures
def y_stochastic_g(y_0=20,
y_1=20,
α=0.8,
β=0.2,
γ=10,
n=100,
σ=2,
g=0,
g_t=0,
duration='permanent'):
"""This program computes a response to a permanent increase

in government expenditures that occurs at time 20
"""
# Useful constants
ρ1 = α + β
ρ2 = -β
# Categorize solution

print(roots)


else:

else:
# Generate shocks
def transition(x, t, g):
# Non-stochastic - separated to avoid generating random series

# when not needed
if σ == 0:
return ρ1 * x[t - 1] + ρ2 * x[t - 2] + γ + g
# Stochastic
else:
return ρ1 * x[t - 1] + ρ2 * x[t - 2] + γ + g + σ * ϵ[t]
# Create list and set initial conditions

y_t = [y_0, y_1]

# No government spending
if g == 0:
# Government spending (no shock)

elif g != 0 and duration == None:
# Permanent government spending shock

elif duration == 'permanent':
if t < g_t:
y_t.append(transition(y_t, t, g=0))
else:
y_t.append(transition(y_t, t, g=g))
# One-off government spending shock

elif duration == 'one-off':
if t == g_t:
y_t.append(transition(y_t, t, g=g))
else:
y_t.append(transition(y_t, t, g=0))
return y_t
22.5. Government Spending 413

A permanent government spending shock can be simulated as follows
plot_y(y_stochastic_g(g=10, g_t=20, duration='permanent'))
[0.7236068 0.2763932]
Roots are real
We can also see the response to a one time jump in government expenditures
plot_y(y_stochastic_g(g=500, g_t=50, duration='one-off'))
[0.7236068 0.2763932]
Roots are real

22.6 Wrapping Everything Into a Class
Up to now, we have written functions to do the work.

Now we’ll roll up our sleeves and write a Python class called Samuelson for the Samuelson model
class Samuelson():
"""This class represents the Samuelson model, otherwise known as the

multiple-accelerator model. The model combines the Keynesian multiplier
with the accelerator theory of investment.
The path of output is governed by a linear second-order difference equation
.. math::
Y_t = + \alpha (1 + \beta) Y_{t-1} - \alpha \beta Y_{t-2}
Parameters
----------
y_0 : scalar
Initial condition for Y_0
y_1 : scalar
Initial condition for Y_1
α : scalar
Marginal propensity to consume
β : scalar
Accelerator coefficient
n : int
22.6. Wrapping Everything Into a Class 415


Number of iterations
σ : scalar
Volatility parameter. It must be greater than or equal to 0. Set
equal to 0 for a non-stochastic model.
g : scalar
Government spending shock
g_t : int
Time at which government spending shock occurs. Must be specified
when duration != None.
duration : {None, 'permanent', 'one-off'}
Specifies type of government spending shock. If none, government
spending equal to g for all t.
"""
def __init__(self,
y_0=100,
y_1=50,
α=1.3,
β=0.2,
γ=10,
n=100,
σ=0,
g=0,
g_t=0,
duration=None):
self.y_0, self.y_1, self.α, self.β = y_0, y_1, α, β

self.n, self.g, self.g_t, self.duration = n, g, g_t, duration
self.γ, self.σ = γ, σ
self.ρ1 = α + β
self.ρ2 = -β
self.roots = np.roots([1, -self.ρ1, -self.ρ2])
def root_type(self):
if all(isinstance(root, complex) for root in self.roots):
return 'Complex conjugate'
elif len(self.roots) > 1:
return 'Double real'
else:
return 'Single real'
def root_less_than_one(self):
if all(abs(root) < 1 for root in self.roots):
return True
def solution_type(self):
ρ1, ρ2 = self.ρ1, self.ρ2
if ρ2 >= 1 + ρ1 or ρ2 <= -1:
return 'Explosive oscillations'
elif ρ1 + ρ2 >= 1:
return 'Explosive growth'
elif discriminant < 0:
return 'Damped oscillations'
else:


return 'Steady state'
def _transition(self, x, t, g):
# Non-stochastic - separated to avoid generating random series

# when not needed
if self.σ == 0:
return self.ρ1 * x[t - 1] + self.ρ2 * x[t - 2] + self.γ + g
# Stochastic
else:
ϵ = np.random.normal(0, 1, self.n)
return self.ρ1 * x[t - 1] + self.ρ2 * x[t - 2] + self.γ + g \
+ self.σ * ϵ[t]
def generate_series(self):
# Create list and set initial conditions

y_t = [self.y_0, self.y_1]

for t in range(2, self.n):
# No government spending
if self.g == 0:
y_t.append(self._transition(y_t, t))
# Government spending (no shock)

elif self.g != 0 and self.duration == None:
y_t.append(self._transition(y_t, t))
# Permanent government spending shock

elif self.duration == 'permanent':
if t < self.g_t:
y_t.append(self._transition(y_t, t, g=0))
else:
y_t.append(self._transition(y_t, t, g=self.g))
# One-off government spending shock

elif self.duration == 'one-off':
if t == self.g_t:
y_t.append(self._transition(y_t, t, g=self.g))
else:
y_t.append(self._transition(y_t, t, g=0))
return y_t
def summary(self):
print('Summary\n' + '-' * 50)
print(f'Root type: {self.root_type()}')
print(f'Solution type: {self.solution_type()}')
print(f'Roots: {str(self.roots)}')
if self.root_less_than_one() == True:
print('Absolute value of roots is less than one')
else:
print('Absolute value of roots is not less than one')

if self.σ > 0:
print('Stochastic series with σ = ' + str(self.σ))
else:
print('Non-stochastic series')
if self.g != 0:
print('Government spending equal to ' + str(self.g))
if self.duration != None:
print(self.duration.capitalize() +
' government spending shock at t = ' + str(self.g_t))
def plot(self):
ax.plot(self.generate_series())
ax.set(xlabel='Iteration', xlim=(0, self.n))
ax.set_ylabel('$Y_t$', rotation=0)
ax.grid()
# Add parameter values to plot

paramstr = f'$\\alpha={self.α:.2f}$ \n $\\beta={self.β:.2f}$ \n \
$\\gamma={self.γ:.2f}$ \n $\\sigma={self.σ:.2f}$ \n \
$\\rho_1={self.ρ1:.2f}$ \n $\\rho_2={self.ρ2:.2f}$'
props = dict(fc='white', pad=10, alpha=0.5)
ax.text(0.87, 0.05, paramstr, transform=ax.transAxes,
fontsize=12, bbox=props, va='bottom')
return fig
def param_plot(self):
# Uses the param_plot() function defined earlier (it is then able

# to be used standalone or as part of the model)
fig = param_plot()
ax = fig.gca()
# Add λ values to legend

for i, root in enumerate(self.roots):
if isinstance(root, complex):
# Need to fill operator for positive as string is split apart
operator = ['+', '']
label = rf'$\lambda_{i+1} = {sam.roots[i].real:.2f} {operator[i]}
↪{sam.roots[i].imag:.2f}i$'
else:
label = rf'$\lambda_{i+1} = {sam.roots[i].real:.2f}$'
ax.scatter(0, 0, 0, label=label) # dummy to add to legend
# Add ρ pair to plot

ax.scatter(self.ρ1, self.ρ2, 100, 'red', '+',
label=r'$(\ \rho_1, \ \rho_2 \ )$', zorder=5)
plt.legend(fontsize=12, loc=3)
return fig

22.6.1 Illustration of Samuelson Class
Now we’ll put our Samuelson class to work on an example
sam = Samuelson(α=0.8, β=0.5, σ=2, g=10, g_t=20, duration='permanent')

sam.summary()
Summary
--------------------------------------------------
Root type: Complex conjugate
Solution type: Damped oscillations
Roots: [0.65+0.27838822j 0.65-0.27838822j]
Absolute value of roots is less than one
Stochastic series with σ = 2
Government spending equal to 10
Permanent government spending shock at t = 20
sam.plot()
plt.show()

22.6.2 Using the Graph
We’ll use our graph to show where the roots lie and how their location is consistent with the behavior of the path just
graphed.
The red + sign shows the location of the roots
sam.param_plot()
plt.show()
22.7 Using the LinearStateSpace Class
It turns out that we can use the QuantEcon.py LinearStateSpace class to do much of the work that we have done from
scratch above.
Here is how we map the Samuelson model into an instance of a LinearStateSpace class
"""This script maps the Samuelson model in the the

``LinearStateSpace`` class
"""
α = 0.8
β = 0.9
ρ1 = α + β
ρ2 = -β
γ = 10
σ = 1


g = 10
n = 100
A = [[1, 0, 0],
[γ + g, ρ1, ρ2],
[0, 1, 0]]
G = [[γ + g, ρ1, ρ2], # this is Y_{t+1}

[γ, α, 0], # this is C_{t+1}
[0, β, -β]] # this is I_{t+1}
μ_0 = [1, 100, 50]

C = np.zeros((3,1))
C[1] = σ # stochastic
sam_t = LinearStateSpace(A, C, G, mu_0=μ_0)
x, y = sam_t.simulate(ts_length=n)
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(12, 8))

titles = ['Output ($Y_t$)', 'Consumption ($C_t$)', 'Investment ($I_t$)']
colors = ['darkblue', 'red', 'purple']
for ax, series, title, color in zip(axes, y, titles, colors):
ax.plot(series, color=color)
ax.set(title=title, xlim=(0, n))
ax.grid()
axes[-1].set_xlabel('Iteration')
plt.show()
22.7. Using the LinearStateSpace Class 421

22.7.1 Other Methods in the LinearStateSpace Class
Let’s plot impulse response functions for the instance of the Samuelson model using a method in the LinearStateS-
pace class
imres = sam_t.impulse_response()
imres = np.asarray(imres)
y1 = imres[:, :, 0]
y2 = imres[:, :, 1]
y1.shape
(2, 6, 1)
Now let’s compute the zeros of the characteristic polynomial by simply calculating the eigenvalues of 𝐴
A = np.asarray(A)
w, v = np.linalg.eig(A)
print(w)
[0.85+0.42130749j 0.85-0.42130749j 1. +0.j ]

22.7.2 Inheriting Methods from LinearStateSpace
We could also create a subclass of LinearStateSpace (inheriting all its methods and attributes) to add more functions
to use
class SamuelsonLSS(LinearStateSpace):
"""
This subclass creates a Samuelson multiplier-accelerator model
as a linear state space system.
"""
def __init__(self,
y_0=100,
y_1=50,
α=0.8,
β=0.9,
γ=10,
σ=1,
g=10):
self.α, self.β = α, β
self.y_0, self.y_1, self.g = y_0, y_1, g
self.γ, self.σ = γ, σ
# Define intial conditions

self.μ_0 = [1, y_0, y_1]
self.ρ1 = α + β
self.ρ2 = -β
# Define transition matrix

self.A = [[1, 0, 0],
[γ + g, self.ρ1, self.ρ2],
[0, 1, 0]]
# Define output matrix

self.G = [[γ + g, self.ρ1, self.ρ2], # this is Y_{t+1}
[γ, α, 0], # this is C_{t+1}
[0, β, -β]] # this is I_{t+1}
self.C = np.zeros((3, 1))

self.C[1] = σ # stochastic
# Initialize LSS with parameters from Samuelson model

LinearStateSpace.__init__(self, self.A, self.C, self.G, mu_0=self.μ_0)
def plot_simulation(self, ts_length=100, stationary=True):
# Temporarily store original parameters

temp_mu = self.mu_0
temp_Sigma = self.Sigma_0
# Set distribution parameters equal to their stationary

# values for simulation
if stationary == True:
try:
self.mu_x, self.mu_y, self.Sigma_x, self.Sigma_y, self.Sigma_yx = \


self.stationary_distributions()
self.mu_0 = self.mu_x
self.Sigma_0 = self.Sigma_x
# Exception where no convergence achieved when
#calculating stationary distributions
except ValueError:
print('Stationary distribution does not exist')
x, y = self.simulate(ts_length)

titles = ['Output ($Y_t$)', 'Consumption ($C_t$)', 'Investment ($I_t$)']
for ax, series, title, color in zip(axes, y, titles, colors):
ax.set(title=title, xlim=(0, n))
ax.grid()
# Reset distribution parameters to their initial values

self.mu_0 = temp_mu
self.Sigma_0 = temp_Sigma
return fig
def plot_irf(self, j=5):
x, y = self.impulse_response(j)
# Reshape into 3 x j matrix for plotting purposes

yimf = np.array(y).flatten().reshape(j+1, 3).T

labels = ['$Y_t$', '$C_t$', '$I_t$']
for ax, series, label, color in zip(axes, yimf, labels, colors):
ax.set(xlim=(0, j))
ax.set_ylabel(label, rotation=0, fontsize=14, labelpad=10)
ax.grid()
axes[0].set_title('Impulse Response Functions')

return fig
def multipliers(self, j=5):

x, y = self.impulse_response(j)
return np.sum(np.array(y).flatten().reshape(j+1, 3), axis=0)

22.7.3 Illustrations
Let’s show how we can use the SamuelsonLSS
samlss = SamuelsonLSS()
samlss.plot_simulation(100, stationary=False)
plt.show()
samlss.plot_simulation(100, stationary=True)
plt.show()

samlss.plot_irf(100)
plt.show()

samlss.multipliers()
array([7.414389, 6.835896, 0.578493])
22.8 Pure Multiplier Model
Let’s shut down the accelerator by setting 𝑏 = 0 to get a pure multiplier model
• the absence of cycles gives an idea about why Samuelson included the accelerator
pure_multiplier = SamuelsonLSS(α=0.95, β=0)
pure_multiplier.plot_simulation()
22.8. Pure Multiplier Model 427


pure_multiplier = SamuelsonLSS(α=0.8, β=0)
pure_multiplier.plot_simulation()

pure_multiplier.plot_irf(100)


22.9 Summary
In this lecture, we wrote functions and classes to represent non-stochastic and stochastic versions of the Samuelson (1939)
multiplier-accelerator model, described in [Samuelson, 1939].
We saw that different parameter values led to different output paths, which could either be stationary, explosive, or
oscillating.
We also were able to represent the model using the QuantEcon.py LinearStateSpace class.

CHAPTER
TWENTYTHREE
KESTEN PROCESSES AND FIRM DYNAMICS
Contents
• Kesten Processes and Firm Dynamics

– Overview
– Kesten Processes
– Heavy Tails
– Application: Firm Dynamics
– Exercises

!pip install --upgrade yfinance
23.1 Overview
Previously we learned about linear scalar-valued stochastic processes (AR(1) models).

Now we generalize these linear models slightly by allowing the multiplicative coefficient to be stochastic.
Such processes are known as Kesten processes after German–American mathematician Harry Kesten (1931–2019)
Although simple to write down, Kesten processes are interesting for at least two reasons:
1. A number of significant economic processes are or can be described as Kesten processes.
2. Kesten processes generate interesting dynamics, including, in some cases, heavy-tailed cross-sectional distributions.
We will discuss these issues as we go along.

import numpy as np
The following two lines are only added to avoid a FutureWarning caused by compatibility issues between pandas and
matplotlib.
433
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()
Additional technical background related to this lecture can be found in the monograph of [Buraczewski et al., 2016].
23.2 Kesten Processes
A Kesten process is a stochastic process of the form
𝑋𝑡+1 = 𝑎𝑡+1 𝑋𝑡 + 𝜂𝑡+1 (23.1)
where {𝑎𝑡 }𝑡≥1 and {𝜂𝑡 }𝑡≥1 are IID sequences.

We are interested in the dynamics of {𝑋𝑡 }𝑡≥0 when 𝑋0 is given.
We will focus on the nonnegative scalar case, where 𝑋𝑡 takes values in ℝ+ .
In particular, we will assume that
• the initial condition 𝑋0 is nonnegative,
• {𝑎𝑡 }𝑡≥1 is a nonnegative IID stochastic process and
• {𝜂𝑡 }𝑡≥1 is another nonnegative IID stochastic process, independent of the first.
23.2.1 Example: GARCH Volatility
The GARCH model is common in financial applications, where time series such as asset returns exhibit time varying
volatility.
For example, consider the following plot of daily returns on the Nasdaq Composite Index for the period 1st January 2006
to 1st November 2019.
import yfinance as yf
s = yf.download('ÎXIC', '2006-1-1', '2019-11-1')['Adj Close']
r = s.pct_change()
ax.plot(r, alpha=0.7)
ax.set_ylabel('returns', fontsize=12)
ax.set_xlabel('date', fontsize=12)
plt.show()
[*********************100%%**********************] 1 of 1 completed
434 Chapter 23. Kesten Processes and Firm Dynamics

Notice how the series exhibits bursts of volatility (high variance) and then settles down again.
GARCH models can replicate this feature.
The GARCH(1, 1) volatility process takes the form
2
𝜎𝑡+1 = 𝛼0 + 𝜎𝑡2 (𝛼1 𝜉𝑡+1
2
+ 𝛽) (23.2)
where {𝜉𝑡 } is IID with 𝔼𝜉𝑡2 = 1 and all parameters are positive.
Returns on a given asset are then modeled as
𝑟𝑡 = 𝜎𝑡 𝜁𝑡 (23.3)
where {𝜁𝑡 } is again IID and independent of {𝜉𝑡 }.

The volatility sequence {𝜎𝑡2 }, which drives the dynamics of returns, is a Kesten process.
23.2.2 Example: Wealth Dynamics
Suppose that a given household saves a fixed fraction 𝑠 of its current wealth in every period.
The household earns labor income 𝑦𝑡 at the start of time 𝑡.
Wealth then evolves according to
𝑤𝑡+1 = 𝑅𝑡+1 𝑠𝑤𝑡 + 𝑦𝑡+1 (23.4)
where {𝑅𝑡 } is the gross rate of return on assets.

If {𝑅𝑡 } and {𝑦𝑡 } are both IID, then (23.4) is a Kesten process.
23.2. Kesten Processes 435

23.2.3 Stationarity
In earlier lectures, such as the one on AR(1) processes, we introduced the notion of a stationary distribution.
In the present context, we can define a stationary distribution as follows:
The distribution 𝐹 ∗ on ℝ is called stationary for the Kesten process (23.1) if
𝑋𝑡 ∼ 𝐹 ∗ ⟹ 𝑎𝑡+1 𝑋𝑡 + 𝜂𝑡+1 ∼ 𝐹 ∗ (23.5)
In other words, if the current state 𝑋𝑡 has distribution 𝐹 ∗ , then so does the next period state 𝑋𝑡+1 .
We can write this alternatively as
𝐹 ∗ (𝑦) = ∫ ℙ{𝑎𝑡+1 𝑥 + 𝜂𝑡+1 ≤ 𝑦}𝐹 ∗ (𝑑𝑥) for all 𝑦 ≥ 0. (23.6)
The left hand side is the distribution of the next period state when the current state is drawn from 𝐹 ∗ .
The equality in (23.6) states that this distribution is unchanged.
23.2.4 Cross-Sectional Interpretation
There is an important cross-sectional interpretation of stationary distributions, discussed previously but worth repeating
here.
Suppose, for example, that we are interested in the wealth distribution — that is, the current distribution of wealth across
households in a given country.
Suppose further that
• the wealth of each household evolves independently according to (23.4),
• 𝐹 ∗ is a stationary distribution for this stochastic process and
• there are many households.
Then 𝐹 ∗ is a steady state for the cross-sectional wealth distribution in this country.
In other words, if 𝐹 ∗ is the current wealth distribution then it will remain so in subsequent periods, ceteris paribus.
To see this, suppose that 𝐹 ∗ is the current wealth distribution.
What is the fraction of households with wealth less than 𝑦 next period?
To obtain this, we sum the probability that wealth is less than 𝑦 tomorrow, given that current wealth is 𝑤, weighted by the
fraction of households with wealth 𝑤.
Noting that the fraction of households with wealth in interval 𝑑𝑤 is 𝐹 ∗ (𝑑𝑤), we get
∫ ℙ{𝑅𝑡+1 𝑠𝑤 + 𝑦𝑡+1 ≤ 𝑦}𝐹 ∗ (𝑑𝑤)
By the definition of stationarity and the assumption that 𝐹 ∗ is stationary for the wealth process, this is just 𝐹 ∗ (𝑦).
Hence the fraction of households with wealth in [0, 𝑦] is the same next period as it is this period.
Since 𝑦 was chosen arbitrarily, the distribution is unchanged.

23.2.5 Conditions for Stationarity
The Kesten process 𝑋𝑡+1 = 𝑎𝑡+1 𝑋𝑡 + 𝜂𝑡+1 does not always have a stationary distribution.
For example, if 𝑎𝑡 ≡ 𝜂𝑡 ≡ 1 for all 𝑡, then 𝑋𝑡 = 𝑋0 + 𝑡, which diverges to infinity.
To prevent this kind of divergence, we require that {𝑎𝑡 } is strictly less than 1 most of the time.
In particular, if
𝔼 ln 𝑎𝑡 < 0 and 𝔼𝜂𝑡 < ∞ (23.7)
then a unique stationary distribution exists on ℝ+ .

• See, for example, theorem 2.1.3 of [Buraczewski et al., 2016], which provides slightly weaker conditions.
As one application of this result, we see that the wealth process (23.4) will have a unique stationary distribution whenever
labor income has finite mean and 𝔼 ln 𝑅𝑡 + ln 𝑠 < 0.
23.3 Heavy Tails
Under certain conditions, the stationary distribution of a Kesten process has a Pareto tail.
(See our earlier lecture on heavy-tailed distributions for background.)
This fact is significant for economics because of the prevalence of Pareto-tailed distributions.
23.3.1 The Kesten–Goldie Theorem
To state the conditions under which the stationary distribution of a Kesten process has a Pareto tail, we first recall that a
random variable is called nonarithmetic if its distribution is not concentrated on {… , −2𝑡, −𝑡, 0, 𝑡, 2𝑡, …} for any 𝑡 ≥ 0.
For example, any random variable with a density is nonarithmetic.
The famous Kesten–Goldie Theorem (see, e.g., [Buraczewski et al., 2016], theorem 2.4.4) states that if
1. the stationarity conditions in (23.7) hold,
2. the random variable 𝑎𝑡 is positive with probability one and nonarithmetic,
3. ℙ{𝑎𝑡 𝑥 + 𝜂𝑡 = 𝑥} < 1 for all 𝑥 ∈ ℝ+ and
4. there exists a positive constant 𝛼 such that
𝔼𝑎𝛼
𝑡 = 1, 𝔼𝜂𝑡𝛼 < ∞, and 𝔼[𝑎𝛼+1
𝑡 ]<∞
then the stationary distribution of the Kesten process has a Pareto tail with tail index 𝛼.
More precisely, if 𝐹 ∗ is the unique stationary distribution and 𝑋 ∗ ∼ 𝐹 ∗ , then
lim 𝑥𝛼 ℙ{𝑋 ∗ > 𝑥} = 𝑐

𝑥→∞
for some positive constant 𝑐.
23.3. Heavy Tails 437

23.3.2 Intuition
Later we will illustrate the Kesten–Goldie Theorem using rank-size plots.

Prior to doing so, we can give the following intuition for the conditions.
Two important conditions are that 𝔼 ln 𝑎𝑡 < 0, so the model is stationary, and 𝔼𝑎𝛼
𝑡 = 1 for some 𝛼 > 0.
The first condition implies that the distribution of 𝑎𝑡 has a large amount of probability mass below 1.
The second condition implies that the distribution of 𝑎𝑡 has at least some probability mass at or above 1.
The first condition gives us existence of the stationary condition.
The second condition means that the current state can be expanded by 𝑎𝑡 .
If this occurs for several concurrent periods, the effects compound each other, since 𝑎𝑡 is multiplicative.
This leads to spikes in the time series, which fill out the extreme right hand tail of the distribution.
The spikes in the time series are visible in the following simulation, which generates of 10 paths when 𝑎𝑡 and 𝑏𝑡 are
lognormal.
μ = -0.5
σ = 1.0
def kesten_ts(ts_length=100):
x = np.zeros(ts_length)
for t in range(ts_length-1):
a = np.exp(μ + σ * np.random.randn())
b = np.exp(np.random.randn())
x[t+1] = a * x[t] + b
return x
num_paths = 10
np.random.seed(12)
for i in range(num_paths):
ax.plot(kesten_ts())
ax.set(xlabel='time', ylabel='$X_t$')
plt.show()

23.4 Application: Firm Dynamics
As noted in our lecture on heavy tails, for common measures of firm size such as revenue or employment, the US firm
size distribution exhibits a Pareto tail (see, e.g., [Axtell, 2001], [Gabaix, 2016]).
Let us try to explain this rather striking fact using the Kesten–Goldie Theorem.
23.4.1 Gibrat’s Law
It was postulated many years ago by Robert Gibrat [Gibrat, 1931] that firm size evolves according to a simple rule whereby
size next period is proportional to current size.
This is now known as Gibrat’s law of proportional growth.
We can express this idea by stating that a suitably defined measure 𝑠𝑡 of firm size obeys
𝑠𝑡+1
= 𝑎𝑡+1 (23.8)
𝑠𝑡
for some positive IID sequence {𝑎𝑡 }.

One implication of Gibrat’s law is that the growth rate of individual firms does not depend on their size.
However, over the last few decades, research contradicting Gibrat’s law has accumulated in the literature.
For example, it is commonly found that, on average,
1. small firms grow faster than large firms (see, e.g., [Evans, 1987] and [Hall, 1987]) and
2. the growth rate of small firms is more volatile than that of large firms [Dunne et al., 1989].
On the other hand, Gibrat’s law is generally found to be a reasonable approximation for large firms [Evans, 1987].
We can accommodate these empirical findings by modifying (23.8) to
𝑠𝑡+1 = 𝑎𝑡+1 𝑠𝑡 + 𝑏𝑡+1 (23.9)
where {𝑎𝑡 } and {𝑏𝑡 } are both IID and independent of each other.
23.4. Application: Firm Dynamics 439

In the exercises you are asked to show that (23.9) is more consistent with the empirical findings presented above than
Gibrat’s law in (23.8).
23.4.2 Heavy Tails
So what has this to do with Pareto tails?

The answer is that (23.9) is a Kesten process.
If the conditions of the Kesten–Goldie Theorem are satisfied, then the firm size distribution is predicted to have heavy
tails — which is exactly what we see in the data.
In the exercises below we explore this idea further, generalizing the firm size dynamics and examining the corresponding
rank-size plots.
We also try to illustrate why the Pareto tail finding is significant for quantitative analysis.
23.5 Exercises
Exercise 23.5.1
Simulate and plot 15 years of daily returns (consider each year as having 250 working days) using the GARCH(1, 1)
process in (23.2)–(23.3).
Take 𝜉𝑡 and 𝜁𝑡 to be independent and standard normal.
Set 𝛼0 = 0.00001, 𝛼1 = 0.1, 𝛽 = 0.9 and 𝜎0 = 0.
Compare visually with the Nasdaq Composite Index returns shown above.
While the time path differs, you should see bursts of high volatility.

α_0 = 1e-5
α_1 = 0.1
β = 0.9
years = 15
days = years * 250
def garch_ts(ts_length=days):
σ2 = 0
r = np.zeros(ts_length)
ξ = np.random.randn()
σ2 = α_0 + σ2 * (α_1 * ξ**2 + β)
r[t] = np.sqrt(σ2) * np.random.randn()
return r


np.random.seed(12)
ax.plot(garch_ts(), alpha=0.7)
ax.set(xlabel='time', ylabel='$\\sigma_t^2$')
plt.show()
Exercise 23.5.2
In our discussion of firm dynamics, it was claimed that (23.9) is more consistent with the empirical literature than Gibrat’s
law in (23.8).
(The empirical literature was reviewed immediately above (23.9).)
In what sense is this true (or false)?

The empirical findings are that
1. small firms grow faster than large firms and
2. the growth rate of small firms is more volatile than that of large firms.
Also, Gibrat’s law is generally found to be a reasonable approximation for large firms than for small firms
The claim is that the dynamics in (23.9) are more consistent with points 1-2 than Gibrat’s law.
To see why, we rewrite (23.9) in terms of growth dynamics:
𝑠𝑡+1 𝑏
= 𝑎𝑡+1 + 𝑡+1 (23.10)
𝑠𝑡 𝑠𝑡
Taking 𝑠𝑡 = 𝑠 as given, the mean and variance of firm growth are
𝔼𝑏 𝕍𝑏
𝔼𝑎 + and 𝕍𝑎 +
𝑠 𝑠2
23.5. Exercises 441

Both of these decline with firm size 𝑠, consistent with the data.
Moreover, the law of motion (23.10) clearly approaches Gibrat’s law (23.8) as 𝑠𝑡 gets large.
Exercise 23.5.3
Consider an arbitrary Kesten process as given in (23.1).
Suppose that {𝑎𝑡 } is lognormal with parameters (𝜇, 𝜎).
In other words, each 𝑎𝑡 has the same distribution as exp(𝜇 + 𝜎𝑍) when 𝑍 is standard normal.
Suppose further that 𝔼𝜂𝑡𝑟 < ∞ for every 𝑟 > 0, as would be the case if, say, 𝜂𝑡 is also lognormal.
Show that the conditions of the Kesten–Goldie theorem are satisfied if and only if 𝜇 < 0.
Obtain the value of 𝛼 that makes the Kesten–Goldie conditions hold.

Since 𝑎𝑡 has a density it is nonarithmetic.
Since 𝑎𝑡 has the same density as 𝑎 = exp(𝜇 + 𝜎𝑍) when 𝑍 is standard normal, we have
𝔼 ln 𝑎𝑡 = 𝔼(𝜇 + 𝜎𝑍) = 𝜇,
and since 𝜂𝑡 has finite moments of all orders, the stationarity condition holds if and only if 𝜇 < 0.
Given the properties of the lognormal distribution (which has finite moments of all orders), the only other condition in
doubt is existence of a positive constant 𝛼 such that 𝔼𝑎𝛼
𝑡 = 1.
This is equivalent to the statement
𝛼2 𝜎2
exp (𝛼𝜇 + ) = 1.
2
Solving for 𝛼 gives 𝛼 = −2𝜇/𝜎2 .
Exercise 23.5.4
One unrealistic aspect of the firm dynamics specified in (23.9) is that it ignores entry and exit.
In any given period and in any given market, we observe significant numbers of firms entering and exiting the market.
Empirical discussion of this can be found in a famous paper by Hugo Hopenhayn [Hopenhayn, 1992].
In the same paper, Hopenhayn builds a model of entry and exit that incorporates profit maximization by firms and market
clearing quantities, wages and prices.
In his model, a stationary equilibrium occurs when the number of entrants equals the number of exiting firms.
In this setting, firm dynamics can be expressed as
𝑠𝑡+1 = 𝑒𝑡+1 𝟙{𝑠𝑡 < 𝑠}̄ + (𝑎𝑡+1 𝑠𝑡 + 𝑏𝑡+1 )𝟙{𝑠𝑡 ≥ 𝑠}̄ (23.11)
Here
• the state variable 𝑠𝑡 represents productivity (which is a proxy for output and hence firm size),
• the IID sequence {𝑒𝑡 } is thought of as a productivity draw for a new entrant and

• the variable 𝑠 ̄ is a threshold value that we take as given, although it is determined endogenously in Hopenhayn’s
model.
The idea behind (23.11) is that firms stay in the market as long as their productivity 𝑠𝑡 remains at or above 𝑠.̄
• In this case, their productivity updates according to (23.9).
Firms choose to exit when their productivity 𝑠𝑡 falls below 𝑠.̄
• In this case, they are replaced by a new firm with productivity 𝑒𝑡+1 .
What can we say about dynamics?
Although (23.11) is not a Kesten process, it does update in the same way as a Kesten process when 𝑠𝑡 is large.
So perhaps its stationary distribution still has Pareto tails?
Your task is to investigate this question via simulation and rank-size plots.
The approach will be to
1. generate 𝑀 draws of 𝑠𝑇 when 𝑀 and 𝑇 are large and
2. plot the largest 1,000 of the resulting draws in a rank-size plot.
(The distribution of 𝑠𝑇 will be close to the stationary distribution when 𝑇 is large.)
In the simulation, assume that
• each of 𝑎𝑡 , 𝑏𝑡 and 𝑒𝑡 is lognormal,
• the parameters are
μ_a = -0.5 # location parameter for a

σ_a = 0.1 # scale parameter for a
μ_b = 0.0 # location parameter for b
σ_b = 0.5 # scale parameter for b
μ_e = 0.0 # location parameter for e
σ_e = 0.5 # scale parameter for e
s_bar = 1.0 # threshold
T = 500 # sampling date
M = 1_000_000 # number of firms
s_init = 1.0 # initial condition for each firm

Here’s one solution. First we generate the observations:

from numpy.random import randn
def generate_draws(μ_a=-0.5,
σ_a=0.1,
μ_b=0.0,
σ_b=0.5,
μ_e=0.0,
σ_e=0.5,
s_bar=1.0,
T=500,
23.5. Exercises 443


M=1_000_000,
s_init=1.0):
draws = np.empty(M)
for m in prange(M):
s = s_init
for t in range(T):
if s < s_bar:
new_s = np.exp(μ_e + σ_e * randn())
else:
a = np.exp(μ_a + σ_a * randn())
b = np.exp(μ_b + σ_b * randn())
new_s = a * s + b
s = new_s
draws[m] = s
return draws
data = generate_draws()
Now we produce the rank-size plot:
rank_data, size_data = qe.rank_size(data, c=0.01)

ax.loglog(rank_data, size_data, 'o', markersize=3.0, alpha=0.5)
ax.set_xlabel("log rank")
ax.set_ylabel("log size")
plt.show()
The plot produces a straight line, consistent with a Pareto tail.

CHAPTER
TWENTYFOUR
WEALTH DISTRIBUTION DYNAMICS
Contents
• Wealth Distribution Dynamics

– Overview
– Lorenz Curves and the Gini Coefficient
– A Model of Wealth Dynamics
– Implementation
– Applications
– Exercises
See also:
A version of this lecture using a GPU is available here
24.1 Overview
This notebook gives an introduction to wealth distribution dynamics, with a focus on

• modeling and computing the wealth distribution via simulation,
• measures of inequality such as the Lorenz curve and Gini coefficient, and
• how inequality is affected by the properties of wage income and returns on assets.
One interesting property of the wealth distribution we discuss is Pareto tails.
The wealth distribution in many countries exhibits a Pareto tail
• See this lecture for a definition.
• For a review of the empirical evidence, see, for example, [Benhabib and Bisin, 2018].
This is consistent with high concentration of wealth amongst the richest households.
It also gives us a way to quantify such concentration, in terms of the tail index.
445
One question of interest is whether or not we can replicate Pareto tails from a relatively simple model.
24.1.1 A Note on Assumptions
The evolution of wealth for any given household depends on their savings behavior.
Modeling such behavior will form an important part of this lecture series.
However, in this particular lecture, we will be content with rather ad hoc (but plausible) savings rules.
We do this to more easily explore the implications of different specifications of income dynamics and investment returns.
At the same time, all of the techniques discussed here can be plugged into models that use optimization to obtain savings
rules.
We will use the following imports.

import numpy as np
from numba import njit, float64, prange
24.2 Lorenz Curves and the Gini Coefficient
Before we investigate wealth dynamics, we briefly review some measures of inequality.
24.2.1 Lorenz Curves
One popular graphical measure of inequality is the Lorenz curve.

The package QuantEcon.py, already imported above, contains a function to compute Lorenz curves.
To illustrate, suppose that
n = 10_000 # size of sample

w = np.exp(np.random.randn(n)) # lognormal draws
is data representing the wealth of 10,000 households.

We can compute and plot the Lorenz curve as follows:
f_vals, l_vals = qe.lorenz_curve(w)
ax.plot(f_vals, l_vals, label='Lorenz curve, lognormal sample')
ax.plot(f_vals, f_vals, label='Lorenz curve, equality')
ax.legend()
plt.show()
446 Chapter 24. Wealth Distribution Dynamics

This curve can be understood as follows: if point (𝑥, 𝑦) lies on the curve, it means that, collectively, the bottom (100𝑥)%
of the population holds (100𝑦)% of the wealth.
The “equality” line is the 45 degree line (which might not be exactly 45 degrees in the figure, depending on the aspect
ratio).
A sample that produces this line exhibits perfect equality.
The other line in the figure is the Lorenz curve for the lognormal sample, which deviates significantly from perfect equality.
For example, the bottom 80% of the population holds around 40% of total wealth.
Here is another example, which shows how the Lorenz curve shifts as the underlying distribution changes.
We generate 10,000 observations using the Pareto distribution with a range of parameters, and then compute the Lorenz
curve corresponding to each set of observations.
a_vals = (1, 2, 5) # Pareto tail index

n = 10_000 # size of each sample
for a in a_vals:
u = np.random.uniform(size=n)
y = u**(-1/a) # distributed as Pareto with tail index a
f_vals, l_vals = qe.lorenz_curve(y)
ax.plot(f_vals, l_vals, label=f'$a = {a}$')
ax.plot(f_vals, f_vals, label='equality')
ax.legend()
plt.show()
24.2. Lorenz Curves and the Gini Coefficient 447

You can see that, as the tail parameter of the Pareto distribution increases, inequality decreases.
This is to be expected, because a higher tail index implies less weight in the tail of the Pareto distribution.
24.2.2 The Gini Coefficient
The definition and interpretation of the Gini coefficient can be found on the corresponding Wikipedia page.
A value of 0 indicates perfect equality (corresponding the case where the Lorenz curve matches the 45 degree line) and
a value of 1 indicates complete inequality (all wealth held by the richest household).
The QuantEcon.py library contains a function to calculate the Gini coefficient.
We can test it on the Weibull distribution with parameter 𝑎, where the Gini coefficient is known to be
𝐺 = 1 − 2−1/𝑎
Let’s see if the Gini coefficient computed from a simulated sample matches this at each fixed value of 𝑎.
a_vals = range(1, 20)

ginis = []
ginis_theoretical = []
n = 100
for a in a_vals:
y = np.random.weibull(a, size=n)
ginis.append(qe.gini_coefficient(y))
ginis_theoretical.append(1 - 2**(-1/a))
ax.plot(a_vals, ginis, label='estimated gini coefficient')
ax.plot(a_vals, ginis_theoretical, label='theoretical gini coefficient')
ax.legend()
ax.set_xlabel("Weibull parameter $a$")
ax.set_ylabel("Gini coefficient")
plt.show()

The simulation shows that the fit is good.
24.3 A Model of Wealth Dynamics
Having discussed inequality measures, let us now turn to wealth dynamics.

The model we will study is
𝑤𝑡+1 = (1 + 𝑟𝑡+1 )𝑠(𝑤𝑡 ) + 𝑦𝑡+1 (24.1)
where
• 𝑤𝑡 is wealth at time 𝑡 for a given household,
• 𝑟𝑡 is the rate of return of financial assets,
• 𝑦𝑡 is current non-financial (e.g., labor) income and
• 𝑠(𝑤𝑡 ) is current wealth net of consumption
Letting {𝑧𝑡 } be a correlated state process of the form
𝑧𝑡+1 = 𝑎𝑧𝑡 + 𝑏 + 𝜎𝑧 𝜖𝑡+1
we’ll assume that
𝑅𝑡 ∶= 1 + 𝑟𝑡 = 𝑐𝑟 exp(𝑧𝑡 ) + exp(𝜇𝑟 + 𝜎𝑟 𝜉𝑡 )
and
𝑦𝑡 = 𝑐𝑦 exp(𝑧𝑡 ) + exp(𝜇𝑦 + 𝜎𝑦 𝜁𝑡 )
Here {(𝜖𝑡 , 𝜉𝑡 , 𝜁𝑡 )} is IID and standard normal in ℝ3 .

The value of 𝑐𝑟 should be close to zero, since rates of return on assets do not exhibit large trends.
When we simulate a population of households, we will assume all shocks are idiosyncratic (i.e., specific to individual
households and independent across them).
24.3. A Model of Wealth Dynamics 449

Regarding the savings function 𝑠, our default model will be
𝑠(𝑤) = 𝑠0 𝑤 ⋅ 𝟙{𝑤 ≥ 𝑤}
̂ (24.2)
where 𝑠0 is a positive constant.

Thus, for 𝑤 < 𝑤,̂ the household saves nothing. For 𝑤 ≥ 𝑤,̄ the household saves a fraction 𝑠0 of their wealth.
We are using something akin to a fixed savings rate model, while acknowledging that low wealth households tend to save
very little.
24.4 Implementation
Here’s some type information to help Numba.
wealth_dynamics_data = [
('w_hat', float64), # savings parameter
('s_0', float64), # savings parameter
('c_y', float64), # labor income parameter
('μ_y', float64), # labor income paraemter
('σ_y', float64), # labor income parameter
('c_r', float64), # rate of return parameter
('μ_r', float64), # rate of return parameter
('σ_r', float64), # rate of return parameter
('a', float64), # aggregate shock parameter
('b', float64), # aggregate shock parameter
('σ_z', float64), # aggregate shock parameter
('z_mean', float64), # mean of z process
('z_var', float64), # variance of z process
('y_mean', float64), # mean of y process
('R_mean', float64) # mean of R process
]
Here’s a class that stores instance data and implements methods that update the aggregate state and household wealth.
@jitclass(wealth_dynamics_data)
class WealthDynamics:
def __init__(self,
w_hat=1.0,
s_0=0.75,
c_y=1.0,
μ_y=1.0,
σ_y=0.2,
c_r=0.05,
μ_r=0.1,
σ_r=0.5,
a=0.5,
b=0.0,
σ_z=0.1):
self.w_hat, self.s_0 = w_hat, s_0

self.c_y, self.μ_y, self.σ_y = c_y, μ_y, σ_y
self.c_r, self.μ_r, self.σ_r = c_r, μ_r, σ_r
self.a, self.b, self.σ_z = a, b, σ_z


# Record stationary moments
self.z_mean = b / (1 - a)
self.z_var = σ_z**2 / (1 - a**2)
exp_z_mean = np.exp(self.z_mean + self.z_var / 2)
self.R_mean = c_r * exp_z_mean + np.exp(μ_r + σ_r**2 / 2)
self.y_mean = c_y * exp_z_mean + np.exp(μ_y + σ_y**2 / 2)
# Test a stability condition that ensures wealth does not diverge

# to infinity.
α = self.R_mean * self.s_0
if α >= 1:
raise ValueError("Stability condition failed.")
def parameters(self):
"""
Collect and return parameters.
"""
parameters = (self.w_hat, self.s_0,
self.c_y, self.μ_y, self.σ_y,
self.c_r, self.μ_r, self.σ_r,
self.a, self.b, self.σ_z)
return parameters
def update_states(self, w, z):

"""
Update one period, given current wealth w and persistent
state z.
"""
# Simplify names
params = self.parameters()
w_hat, s_0, c_y, μ_y, σ_y, c_r, μ_r, σ_r, a, b, σ_z = params
zp = a * z + b + σ_z * np.random.randn()
# Update wealth
y = c_y * np.exp(zp) + np.exp(μ_y + σ_y * np.random.randn())
wp = y
if w >= w_hat:
R = c_r * np.exp(zp) + np.exp(μ_r + σ_r * np.random.randn())
wp += R * s_0 * w
return wp, zp
Here’s function to simulate the time series of wealth for in individual households.
@njit
def wealth_time_series(wdy, w_0, n):
"""
Generate a single time series of length n for wealth given
initial value w_0.
The initial persistent state z_0 for each household is drawn from
the stationary distribution of the AR(1) process.
* wdy: an instance of WealthDynamics

* w_0: scalar
* n: int

"""
z = wdy.z_mean + np.sqrt(wdy.z_var) * np.random.randn()
w = np.empty(n)
w[0] = w_0
for t in range(n-1):
w[t+1], z = wdy.update_states(w[t], z)
return w
Now here’s function to simulate a cross section of households forward in time.

Note the use of parallelization to speed up computation.
def update_cross_section(wdy, w_distribution, shift_length=500):
"""
Shifts a cross-section of household forward in time
* wdy: an instance of WealthDynamics

* w_distribution: array_like, represents current cross-section
Takes a current distribution of wealth values as w_distribution

and updates each w_t in w_distribution to w_{t+j}, where
j = shift_length.
Returns the new distribution.
"""
new_distribution = np.empty_like(w_distribution)
# Update each household

for i in prange(len(new_distribution)):
z = wdy.z_mean + np.sqrt(wdy.z_var) * np.random.randn()
w = w_distribution[i]
for t in range(shift_length-1):
w, z = wdy.update_states(w, z)
new_distribution[i] = w
return new_distribution
Parallelization is very effective in the function above because the time path of each household can be calculated indepen-
dently once the path for the aggregate state is known.

24.5 Applications
Let’s try simulating the model at different parameter values and investigate the implications for the wealth distribution.
24.5.1 Time Series
Let’s look at the wealth dynamics of an individual household.
wdy = WealthDynamics()
ts_length = 200
w = wealth_time_series(wdy, wdy.y_mean, ts_length)
ax.plot(w)
plt.show()
Notice the large spikes in wealth over time.

Such spikes are similar to what we observed in time series when we studied Kesten processes.
24.5.2 Inequality Measures
Let’s look at how inequality varies with returns on financial assets.

The next function generates a cross section and then computes the Lorenz curve and Gini coefficient.
def generate_lorenz_and_gini(wdy, num_households=100_000, T=500):

"""
Generate the Lorenz curve data and gini coefficient corresponding to a
WealthDynamics mode by simulating num_households forward to time T.
"""
ψ_0 = np.full(num_households, wdy.y_mean)
z_0 = wdy.z_mean
24.5. Applications 453


ψ_star = update_cross_section(wdy, ψ_0, shift_length=T)
return qe.gini_coefficient(ψ_star), qe.lorenz_curve(ψ_star)
Now we investigate how the Lorenz curves associated with the wealth distribution change as return to savings varies.
The code below plots Lorenz curves for three different values of 𝜇𝑟 .
If you are running this yourself, note that it will take one or two minutes to execute.
This is unavoidable because we are executing a CPU intensive task.
In fact the code, which is JIT compiled and parallelized, runs extremely fast relative to the number of computations.
%%time
μ_r_vals = (0.0, 0.025, 0.05)
gini_vals = []
for μ_r in μ_r_vals:

wdy = WealthDynamics(μ_r=μ_r)
gv, (f_vals, l_vals) = generate_lorenz_and_gini(wdy)
ax.plot(f_vals, l_vals, label=f'$\psi^*$ at $\mu_r = {μ_r:0.2}$')
gini_vals.append(gv)

ax.legend(loc="upper left")
plt.show()
CPU times: user 1min 29s, sys: 96.4 ms, total: 1min 29s
Wall time: 12.3 s
The Lorenz curve shifts downwards as returns on financial income rise, indicating a rise in inequality.
We will look at this again via the Gini coefficient immediately below, but first consider the following image of our system
resources when the code above is executing:

Since the code is both efficiently JIT compiled and fully parallelized, it’s close to impossible to make this sequence of
tasks run faster without changing hardware.
Now let’s check the Gini coefficient.
ax.plot(μ_r_vals, gini_vals, label='gini coefficient')
ax.set_xlabel("$\mu_r$")
ax.legend()
plt.show()
Once again, we see that inequality increases as returns on financial income rise.
Let’s finish this section by investigating what happens when we change the volatility term 𝜎𝑟 in financial returns.
%%time
σ_r_vals = (0.35, 0.45, 0.52)
gini_vals = []
for σ_r in σ_r_vals:

wdy = WealthDynamics(σ_r=σ_r)
gv, (f_vals, l_vals) = generate_lorenz_and_gini(wdy)
ax.plot(f_vals, l_vals, label=f'$\psi^*$ at $\sigma_r = {σ_r:0.2}$')
gini_vals.append(gv)

ax.legend(loc="upper left")
plt.show()
24.5. Applications 455

CPU times: user 1min 28s, sys: 23.3 ms, total: 1min 28s
Wall time: 11.4 s
We see that greater volatility has the effect of increasing inequality in this model.
24.6 Exercises
Exercise 24.6.1
For a wealth or income distribution with Pareto tail, a higher tail index suggests lower inequality.
Indeed, it is possible to prove that the Gini coefficient of the Pareto distribution with tail index 𝑎 is 1/(2𝑎 − 1).
To the extent that you can, confirm this by simulation.
In particular, generate a plot of the Gini coefficient against the tail index using both the theoretical value just given and
the value computed from a sample via qe.gini_coefficient.
For the values of the tail index, use a_vals = np.linspace(1, 10, 25).
Use sample of size 1,000 for each 𝑎 and the sampling method for generating Pareto draws employed in the discussion of
Lorenz curves for the Pareto distribution.
To the extent that you can, interpret the monotone relationship between the Gini index and 𝑎.

Here is one solution, which produces a good match between theory and simulation.
a_vals = np.linspace(1, 10, 25) # Pareto tail index

ginis = np.empty_like(a_vals)
n = 1000 # size of each sample



for i, a in enumerate(a_vals):
y = np.random.uniform(size=n)**(-1/a)
ginis[i] = qe.gini_coefficient(y)
ax.plot(a_vals, ginis, label='sampled')
ax.plot(a_vals, 1/(2*a_vals - 1), label='theoretical')
ax.legend()
plt.show()
In general, for a Pareto distribution, a higher tail index implies less weight in the right hand tail.
This means less extreme values for wealth and hence more equality.
More equality translates to a lower Gini index.
Exercise 24.6.2
The wealth process (24.1) is similar to a Kesten process.
This is because, according to (24.2), savings is constant for all wealth levels above 𝑤.̂
When savings is constant, the wealth process has the same quasi-linear structure as a Kesten process, with multiplicative
and additive shocks.
The Kesten–Goldie theorem tells us that Kesten processes have Pareto tails under a range of parameterizations.
The theorem does not directly apply here, since savings is not always constant and since the multiplicative and additive
terms in (24.1) are not IID.
At the same time, given the similarities, perhaps Pareto tails will arise.
To test this, run a simulation that generates a cross-section of wealth and generate a rank-size plot.
If you like, you can use the function rank_size from the quantecon library (documentation here).
In viewing the plot, remember that Pareto tails generate a straight line. Is this what you see?
For sample size and initial conditions, use
24.6. Exercises 457

num_households = 250_000
T = 500 # shift forward T periods
ψ_0 = np.full(num_households, wdy.y_mean) # initial distribution
z_0 = wdy.z_mean

First let’s generate the distribution:
num_households = 250_000
T = 500 # how far to shift forward in time
wdy = WealthDynamics()
ψ_0 = np.full(num_households, wdy.y_mean)
z_0 = wdy.z_mean
ψ_star = update_cross_section(wdy, ψ_0, shift_length=T)
Now let’s see the rank-size plot:
rank_data, size_data = qe.rank_size(ψ_star, c=0.001)

ax.loglog(rank_data, size_data, 'o', markersize=3.0, alpha=0.5)
ax.set_xlabel("log rank")
ax.set_ylabel("log size")
plt.show()

CHAPTER
TWENTYFIVE
A FIRST LOOK AT THE KALMAN FILTER
Contents
• A First Look at the Kalman Filter

– Overview
– The Basic Idea
– Convergence
– Implementation
– Exercises
25.1 Overview
This lecture provides a simple and intuitive introduction to the Kalman filter, for those who either
• have heard of the Kalman filter but don’t know how it works, or
• know the Kalman filter equations, but don’t know where they come from
For additional (more advanced) reading on the Kalman filter, see
• [Ljungqvist and Sargent, 2018], section 2.7
• [Anderson and Moore, 2005]
The second reference presents a comprehensive treatment of the Kalman filter.
Required knowledge: Familiarity with matrix manipulations, multivariate normal distributions, covariance matrices, etc.

from scipy import linalg
import numpy as np
import matplotlib.cm as cm
459

from quantecon import Kalman, LinearStateSpace
from scipy.integrate import quad
from scipy.linalg import eigvals
25.2 The Basic Idea
The Kalman filter has many applications in economics, but for now let’s pretend that we are rocket scientists.
A missile has been launched from country Y and our mission is to track it.
Let 𝑥 ∈ ℝ2 denote the current location of the missile—a pair indicating latitude-longitude coordinates on a map.
At the present moment in time, the precise location 𝑥 is unknown, but we do have some beliefs about 𝑥.
One way to summarize our knowledge is a point prediction 𝑥̂
• But what if the President wants to know the probability that the missile is currently over the Sea of Japan?
• Then it is better to summarize our initial beliefs with a bivariate probability density 𝑝
– ∫𝐸 𝑝(𝑥)𝑑𝑥 indicates the probability that we attach to the missile being in region 𝐸.
The density 𝑝 is called our prior for the random variable 𝑥.
To keep things tractable in our example, we assume that our prior is Gaussian.
In particular, we take
𝑝 = 𝑁 (𝑥,̂ Σ) (25.1)
where 𝑥̂ is the mean of the distribution and Σ is a 2 × 2 covariance matrix. In our simulations, we will suppose that
0.2 0.4 0.3

𝑥̂ = ( ), Σ=( ) (25.2)
−0.2 0.3 0.45
This density 𝑝(𝑥) is shown below as a contour map, with the center of the red ellipse being equal to 𝑥.̂
# Set up the Gaussian prior density p

Σ = [[0.4, 0.3], [0.3, 0.45]]
Σ = np.matrix(Σ)
x_hat = np.matrix([0.2, -0.2]).T
# Define the matrices G and R from the equation y = G x + N(0, R)
G = [[1, 0], [0, 1]]
G = np.matrix(G)
R = 0.5 * Σ
# The matrices A and Q
A = [[1.2, 0], [0, -0.2]]
A = np.matrix(A)
Q = 0.3 * Σ
# The observed value of y
y = np.matrix([2.3, -1.9]).T
# Set up grid for plotting

x_grid = np.linspace(-1.5, 2.9, 100)
y_grid = np.linspace(-3.1, 1.7, 100)
X, Y = np.meshgrid(x_grid, y_grid)
460 Chapter 25. A First Look at the Kalman Filter

def bivariate_normal(x, y, σ_x=1.0, σ_y=1.0, μ_x=0.0, μ_y=0.0, σ_xy=0.0):

"""
Compute and return the probability density function of bivariate normal
distribution of normal random variables x and y
Parameters
----------
x : array_like(float)
Random variable
y : array_like(float)
Random variable
σ_x : array_like(float)
Standard deviation of random variable x
σ_y : array_like(float)
Standard deviation of random variable y
μ_x : scalar(float)
Mean value of random variable x
μ_y : scalar(float)
Mean value of random variable y
σ_xy : array_like(float)
Covariance of random variables x and y
"""
x_μ = x - μ_x
y_μ = y - μ_y
ρ = σ_xy / (σ_x * σ_y)

z = x_μ**2 / σ_x**2 + y_μ**2 / σ_y**2 - 2 * ρ * x_μ * y_μ / (σ_x * σ_y)
denom = 2 * np.pi * σ_x * σ_y * np.sqrt(1 - ρ**2)
return np.exp(-z / (2 * (1 - ρ**2))) / denom
def gen_gaussian_plot_vals(μ, C):

"Z values for plotting the bivariate Gaussian N(μ, C)"
m_x, m_y = float(μ[0]), float(μ[1])
s_x, s_y = np.sqrt(C[0, 0]), np.sqrt(C[1, 1])
s_xy = C[0, 1]
return bivariate_normal(X, Y, s_x, s_y, m_x, m_y, s_xy)
# Plot the figure

ax.grid()
Z = gen_gaussian_plot_vals(x_hat, Σ)
ax.contourf(X, Y, Z, 6, alpha=0.6, cmap=cm.jet)
cs = ax.contour(X, Y, Z, 6, colors="black")
ax.clabel(cs, inline=1, fontsize=10)
plt.show()
25.2. The Basic Idea 461


25.2.1 The Filtering Step
We are now presented with some good news and some bad news.
The good news is that the missile has been located by our sensors, which report that the current location is 𝑦 = (2.3, −1.9).
The next figure shows the original prior 𝑝(𝑥) and the new reported location 𝑦

ax.grid()
ax.contourf(X, Y, Z, 6, alpha=0.6, cmap=cm.jet)
cs = ax.contour(X, Y, Z, 6, colors="black")
ax.text(float(y[0]), float(y[1]), "$y$", fontsize=20, color="black")

plt.show()


The bad news is that our sensors are imprecise.

In particular, we should interpret the output of our sensor not as 𝑦 = 𝑥, but rather as
𝑦 = 𝐺𝑥 + 𝑣, where 𝑣 ∼ 𝑁 (0, 𝑅) (25.3)
Here 𝐺 and 𝑅 are 2 × 2 matrices with 𝑅 positive definite. Both are assumed known, and the noise term 𝑣 is assumed to
be independent of 𝑥.

How then should we combine our prior 𝑝(𝑥) = 𝑁 (𝑥,̂ Σ) and this new information 𝑦 to improve our understanding of the
location of the missile?
As you may have guessed, the answer is to use Bayes’ theorem, which tells us to update our prior 𝑝(𝑥) to 𝑝(𝑥 | 𝑦) via
𝑝(𝑦 | 𝑥) 𝑝(𝑥)
𝑝(𝑥 | 𝑦) =
𝑝(𝑦)
where 𝑝(𝑦) = ∫ 𝑝(𝑦 | 𝑥) 𝑝(𝑥)𝑑𝑥.
In solving for 𝑝(𝑥 | 𝑦), we observe that
• 𝑝(𝑥) = 𝑁 (𝑥,̂ Σ).
• In view of (25.3), the conditional density 𝑝(𝑦 | 𝑥) is 𝑁 (𝐺𝑥, 𝑅).
• 𝑝(𝑦) does not depend on 𝑥, and enters into the calculations only as a normalizing constant.
Because we are in a linear and Gaussian framework, the updated density can be computed by calculating population linear
regressions.
In particular, the solution is known1 to be
𝑝(𝑥 | 𝑦) = 𝑁 (𝑥𝐹̂ , Σ𝐹 )
where
𝑥𝐹̂ ∶= 𝑥̂ + Σ𝐺′ (𝐺Σ𝐺′ + 𝑅)−1 (𝑦 − 𝐺𝑥)̂ and Σ𝐹 ∶= Σ − Σ𝐺′ (𝐺Σ𝐺′ + 𝑅)−1 𝐺Σ (25.4)
Here Σ𝐺′ (𝐺Σ𝐺′ + 𝑅)−1 is the matrix of population regression coefficients of the hidden object 𝑥 − 𝑥̂ on the surprise
𝑦 − 𝐺𝑥.̂
This new density 𝑝(𝑥 | 𝑦) = 𝑁 (𝑥𝐹̂ , Σ𝐹 ) is shown in the next figure via contour lines and the color map.
The original density is left in as contour lines for comparison

ax.grid()
cs1 = ax.contour(X, Y, Z, 6, colors="black")
ax.clabel(cs1, inline=1, fontsize=10)
M = Σ * G.T * linalg.inv(G * Σ * G.T + R)
x_hat_F = x_hat + M * (y - G * x_hat)
Σ_F = Σ - M * G * Σ
new_Z = gen_gaussian_plot_vals(x_hat_F, Σ_F)
cs2 = ax.contour(X, Y, new_Z, 6, colors="black")
ax.contourf(X, Y, new_Z, 6, alpha=0.6, cmap=cm.jet)
plt.show()


1 See, for example, page 93 of [Bishop, 2006]. To get from his expressions to the ones used above, you will also need to apply the Woodbury matrix
identity.


Our new density twists the prior 𝑝(𝑥) in a direction determined by the new information 𝑦 − 𝐺𝑥.̂
In generating the figure, we set 𝐺 to the identity matrix and 𝑅 = 0.5Σ for Σ defined in (25.2).
25.2.2 The Forecast Step
What have we achieved so far?

We have obtained probabilities for the current location of the state (missile) given prior and current information.
This is called “filtering” rather than forecasting because we are filtering out noise rather than looking into the future.
• 𝑝(𝑥 | 𝑦) = 𝑁 (𝑥𝐹̂ , Σ𝐹 ) is called the filtering distribution
But now let’s suppose that we are given another task: to predict the location of the missile after one unit of time (whatever
that may be) has elapsed.
To do this we need a model of how the state evolves.

Let’s suppose that we have one, and that it’s linear and Gaussian. In particular,
𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝑤𝑡+1 , where 𝑤𝑡 ∼ 𝑁 (0, 𝑄) (25.5)
Our aim is to combine this law of motion and our current distribution 𝑝(𝑥 | 𝑦) = 𝑁 (𝑥𝐹̂ , Σ𝐹 ) to come up with a new
predictive distribution for the location in one unit of time.
In view of (25.5), all we have to do is introduce a random vector 𝑥𝐹 ∼ 𝑁 (𝑥𝐹̂ , Σ𝐹 ) and work out the distribution of
𝐴𝑥𝐹 + 𝑤 where 𝑤 is independent of 𝑥𝐹 and has distribution 𝑁 (0, 𝑄).
Since linear combinations of Gaussians are Gaussian, 𝐴𝑥𝐹 + 𝑤 is Gaussian.
Elementary calculations and the expressions in (25.4) tell us that
𝔼[𝐴𝑥𝐹 + 𝑤] = 𝐴𝔼𝑥𝐹 + 𝔼𝑤 = 𝐴𝑥𝐹̂ = 𝐴𝑥̂ + 𝐴Σ𝐺′ (𝐺Σ𝐺′ + 𝑅)−1 (𝑦 − 𝐺𝑥)̂
and
Var[𝐴𝑥𝐹 + 𝑤] = 𝐴 Var[𝑥𝐹 ]𝐴′ + 𝑄 = 𝐴Σ𝐹 𝐴′ + 𝑄 = 𝐴Σ𝐴′ − 𝐴Σ𝐺′ (𝐺Σ𝐺′ + 𝑅)−1 𝐺Σ𝐴′ + 𝑄
The matrix 𝐴Σ𝐺′ (𝐺Σ𝐺′ + 𝑅)−1 is often written as 𝐾Σ and called the Kalman gain.
• The subscript Σ has been added to remind us that 𝐾Σ depends on Σ, but not 𝑦 or 𝑥.̂
Using this notation, we can summarize our results as follows.
Our updated prediction is the density 𝑁 (𝑥𝑛𝑒𝑤
̂ , Σ𝑛𝑒𝑤 ) where
𝑥𝑛𝑒𝑤
̂ ∶= 𝐴𝑥̂ + 𝐾Σ (𝑦 − 𝐺𝑥)̂
Σ𝑛𝑒𝑤 ∶= 𝐴Σ𝐴′ − 𝐾Σ 𝐺Σ𝐴′ + 𝑄
• The density 𝑝𝑛𝑒𝑤 (𝑥) = 𝑁 (𝑥𝑛𝑒𝑤

̂ , Σ𝑛𝑒𝑤 ) is called the predictive distribution
The predictive distribution is the new density shown in the following figure, where the update has used parameters.
1.2 0.0
𝐴=( ), 𝑄 = 0.3 ∗ Σ
0.0 −0.2

ax.grid()
# Density 1
cs1 = ax.contour(X, Y, Z, 6, colors="black")
# Density 2
M = Σ * G.T * linalg.inv(G * Σ * G.T + R)
x_hat_F = x_hat + M * (y - G * x_hat)
Σ_F = Σ - M * G * Σ
Z_F = gen_gaussian_plot_vals(x_hat_F, Σ_F)
cs2 = ax.contour(X, Y, Z_F, 6, colors="black")
# Density 3
new_x_hat = A * x_hat_F
new_Σ = A * Σ_F * A.T + Q
new_Z = gen_gaussian_plot_vals(new_x_hat, new_Σ)
cs3 = ax.contour(X, Y, new_Z, 6, colors="black")


ax.contourf(X, Y, new_Z, 6, alpha=0.6, cmap=cm.jet)
plt.show()



25.2.3 The Recursive Procedure
Let’s look back at what we’ve done.

We started the current period with a prior 𝑝(𝑥) for the location 𝑥 of the missile.
We then used the current measurement 𝑦 to update to 𝑝(𝑥 | 𝑦).
Finally, we used the law of motion (25.5) for {𝑥𝑡 } to update to 𝑝𝑛𝑒𝑤 (𝑥).
If we now step into the next period, we are ready to go round again, taking 𝑝𝑛𝑒𝑤 (𝑥) as the current prior.
Swapping notation 𝑝𝑡 (𝑥) for 𝑝(𝑥) and 𝑝𝑡+1 (𝑥) for 𝑝𝑛𝑒𝑤 (𝑥), the full recursive procedure is:
1. Start the current period with prior 𝑝𝑡 (𝑥) = 𝑁 (𝑥𝑡̂ , Σ𝑡 ).
2. Observe current measurement 𝑦𝑡 .
3. Compute the filtering distribution 𝑝𝑡 (𝑥 | 𝑦) = 𝑁 (𝑥𝐹 𝐹
𝑡̂ , Σ𝑡 ) from 𝑝𝑡 (𝑥) and 𝑦𝑡 , applying Bayes rule and the condi-
tional distribution (25.3).
4. Compute the predictive distribution 𝑝𝑡+1 (𝑥) = 𝑁 (𝑥𝑡+1
̂ , Σ𝑡+1 ) from the filtering distribution and (25.5).
5. Increment 𝑡 by one and go to step 1.
Repeating (25.6), the dynamics for 𝑥𝑡̂ and Σ𝑡 are as follows
𝑥𝑡+1
̂ = 𝐴𝑥𝑡̂ + 𝐾Σ𝑡 (𝑦𝑡 − 𝐺𝑥𝑡̂ )
Σ𝑡+1 = 𝐴Σ𝑡 𝐴′ − 𝐾Σ𝑡 𝐺Σ𝑡 𝐴′ + 𝑄
These are the standard dynamic equations for the Kalman filter (see, for example, [Ljungqvist and Sargent, 2018], page
58).
25.3 Convergence
The matrix Σ𝑡 is a measure of the uncertainty of our prediction 𝑥𝑡̂ of 𝑥𝑡 .

Apart from special cases, this uncertainty will never be fully resolved, regardless of how much time elapses.
One reason is that our prediction 𝑥𝑡̂ is made based on information available at 𝑡 − 1, not 𝑡.
Even if we know the precise value of 𝑥𝑡−1 (which we don’t), the transition equation (25.5) implies that 𝑥𝑡 = 𝐴𝑥𝑡−1 + 𝑤𝑡 .
Since the shock 𝑤𝑡 is not observable at 𝑡−1, any time 𝑡−1 prediction of 𝑥𝑡 will incur some error (unless 𝑤𝑡 is degenerate).
However, it is certainly possible that Σ𝑡 converges to a constant matrix as 𝑡 → ∞.
To study this topic, let’s expand the second equation in (25.6):
Σ𝑡+1 = 𝐴Σ𝑡 𝐴′ − 𝐴Σ𝑡 𝐺′ (𝐺Σ𝑡 𝐺′ + 𝑅)−1 𝐺Σ𝑡 𝐴′ + 𝑄 (25.6)
This is a nonlinear difference equation in Σ𝑡 .

A fixed point of (25.6) is a constant matrix Σ such that
Σ = 𝐴Σ𝐴′ − 𝐴Σ𝐺′ (𝐺Σ𝐺′ + 𝑅)−1 𝐺Σ𝐴′ + 𝑄 (25.7)
Equation (25.6) is known as a discrete-time Riccati difference equation.

Equation (25.7) is known as a discrete-time algebraic Riccati equation.
Conditions under which a fixed point exists and the sequence {Σ𝑡 } converges to it are discussed in [Anderson et al., 1996]
and [Anderson and Moore, 2005], chapter 4.

A sufficient (but not necessary) condition is that all the eigenvalues 𝜆𝑖 of 𝐴 satisfy |𝜆𝑖 | < 1 (cf. e.g., [Anderson and
Moore, 2005], p. 77).
(This strong condition assures that the unconditional distribution of 𝑥𝑡 converges as 𝑡 → +∞.)
In this case, for any initial choice of Σ0 that is both non-negative and symmetric, the sequence {Σ𝑡 } in (25.6) converges
to a non-negative symmetric matrix Σ that solves (25.7).
25.4 Implementation
The class Kalman from the QuantEcon.py package implements the Kalman filter
• Instance data consists of:
– the moments (𝑥𝑡̂ , Σ𝑡 ) of the current prior.
– An instance of the LinearStateSpace class from QuantEcon.py.
The latter represents a linear state space model of the form

𝑦𝑡 = 𝐺𝑥𝑡 + 𝐻𝑣𝑡
where the shocks 𝑤𝑡 and 𝑣𝑡 are IID standard normals.

To connect this with the notation of this lecture we set
𝑄 ∶= 𝐶𝐶 ′ and 𝑅 ∶= 𝐻𝐻 ′
• The class Kalman from the QuantEcon.py package has a number of methods, some that we will wait to use until
we study more advanced applications in subsequent lectures.
• Methods pertinent for this lecture are:
– prior_to_filtered, which updates (𝑥𝑡̂ , Σ𝑡 ) to (𝑥𝐹 𝐹
𝑡̂ , Σ𝑡 )
– filtered_to_forecast, which updates the filtering distribution to the predictive distribution – which
becomes the new prior (𝑥𝑡+1
̂ , Σ𝑡+1 )
– update, which combines the last two methods
– a stationary_values, which computes the solution to (25.7) and the corresponding (stationary)
Kalman gain
You can view the program on GitHub.
25.5 Exercises
Exercise 25.5.1
Consider the following simple application of the Kalman filter, loosely based on [Ljungqvist and Sargent, 2018], section
2.9.2.
Suppose that
• all variables are scalars
• the hidden state {𝑥𝑡 } is in fact constant, equal to some 𝜃 ∈ ℝ unknown to the modeler

State dynamics are therefore given by (25.5) with 𝐴 = 1, 𝑄 = 0 and 𝑥0 = 𝜃.

The measurement equation is 𝑦𝑡 = 𝜃 + 𝑣𝑡 where 𝑣𝑡 is 𝑁 (0, 1) and IID.
The task of this exercise to simulate the model and, using the code from kalman.py, plot the first five predictive densities
𝑝𝑡 (𝑥) = 𝑁 (𝑥𝑡̂ , Σ𝑡 ).
As shown in [Ljungqvist and Sargent, 2018], sections 2.9.1–2.9.2, these distributions asymptotically put all mass on the
unknown value 𝜃.
In the simulation, take 𝜃 = 10, 𝑥0̂ = 8 and Σ0 = 1.
Your figure should – modulo randomness – look something like this
# Parameters
θ = 10 # Constant value of state x_t
A, C, G, H = 1, 0, 1, 1
ss = LinearStateSpace(A, C, G, H, mu_0=θ)


# Set prior, initialize kalman filter
x_hat_0, Σ_0 = 8, 1
kalman = Kalman(ss, x_hat_0, Σ_0)
# Draw observations of y from state space model

N = 5
x, y = ss.simulate(N)
y = y.flatten()
# Set up plot
fig, ax = plt.subplots(figsize=(10,8))
xgrid = np.linspace(θ - 5, θ + 2, 200)
for i in range(N):
# Record the current predicted mean and variance
m, v = [float(z) for z in (kalman.x_hat, kalman.Sigma)]
# Plot, update filter
ax.plot(xgrid, norm.pdf(xgrid, loc=m, scale=np.sqrt(v)), label=f'$t={i}$')
kalman.update(y[i])
ax.set_title(f'First {N} densities when $\\theta = {θ:.1f}$')

plt.show()

m, v = [float(z) for z in (kalman.x_hat, kalman.Sigma)]
25.5. Exercises 471

Exercise 25.5.2
The preceding figure gives some support to the idea that probability mass converges to 𝜃.
To get a better idea, choose a small 𝜖 > 0 and calculate
𝜃+𝜖
𝑧𝑡 ∶= 1 − ∫ 𝑝𝑡 (𝑥)𝑑𝑥
𝜃−𝜖
for 𝑡 = 0, 1, 2, … , 𝑇 .
Plot 𝑧𝑡 against 𝑇 , setting 𝜖 = 0.1 and 𝑇 = 600.
Your figure should show error erratically declining something like this
ϵ = 0.1
θ = 10 # Constant value of state x_t
A, C, G, H = 1, 0, 1, 1
ss = LinearStateSpace(A, C, G, H, mu_0=θ)

25.5. Exercises 473

x_hat_0, Σ_0 = 8, 1
kalman = Kalman(ss, x_hat_0, Σ_0)
T = 600
z = np.empty(T)
x, y = ss.simulate(T)
y = y.flatten()
for t in range(T):
# Record the current predicted mean and variance and plot their densities
m, v = [float(temp) for temp in (kalman.x_hat, kalman.Sigma)]
f = lambda x: norm.pdf(x, loc=m, scale=np.sqrt(v))

integral, error = quad(f, θ - ϵ, θ + ϵ)
z[t] = 1 - integral
kalman.update(y[t])

ax.set_ylim(0, 1)
ax.set_xlim(0, T)
ax.plot(range(T), z)
ax.fill_between(range(T), np.zeros(T), z, color="blue", alpha=0.2)
plt.show()

m, v = [float(temp) for temp in (kalman.x_hat, kalman.Sigma)]

Exercise 25.5.3
As discussed above, if the shock sequence {𝑤𝑡 } is not degenerate, then it is not in general possible to predict 𝑥𝑡 without
error at time 𝑡 − 1 (and this would be the case even if we could observe 𝑥𝑡−1 ).
Let’s now compare the prediction 𝑥𝑡̂ made by the Kalman filter against a competitor who is allowed to observe 𝑥𝑡−1 .
This competitor will use the conditional expectation 𝔼[𝑥𝑡 | 𝑥𝑡−1 ], which in this case is 𝐴𝑥𝑡−1 .
The conditional expectation is known to be the optimal prediction method in terms of minimizing mean squared error.
(More precisely, the minimizer of 𝔼 ‖𝑥𝑡 − 𝑔(𝑥𝑡−1 )‖2 with respect to 𝑔 is 𝑔∗ (𝑥𝑡−1 ) ∶= 𝔼[𝑥𝑡 | 𝑥𝑡−1 ])
Thus we are comparing the Kalman filter against a competitor who has more information (in the sense of being able to
observe the latent state) and behaves optimally in terms of minimizing squared error.
Our horse race will be assessed in terms of squared error.
In particular, your task is to generate a graph plotting observations of both ‖𝑥𝑡 − 𝐴𝑥𝑡−1 ‖2 and ‖𝑥𝑡 − 𝑥𝑡̂ ‖2 against 𝑡 for
𝑡 = 1, … , 50.
For the parameters, set 𝐺 = 𝐼, 𝑅 = 0.5𝐼 and 𝑄 = 0.3𝐼, where 𝐼 is the 2 × 2 identity.
Set
0.5 0.4
𝐴=( )
0.6 0.3
25.5. Exercises 475

To initialize the prior density, set

0.9 0.3
Σ0 = ( )
0.3 0.9
and 𝑥0̂ = (8, 8).
Finally, set 𝑥0 = (0, 0).
You should end up with a figure similar to the following (modulo randomness)
Observe how, after an initial learning period, the Kalman filter performs quite well, even relative to the competitor who
predicts optimally with knowledge of the latent state.
# Define A, C, G, H
G = np.identity(2)
H = np.sqrt(0.5) * np.identity(2)
A = [[0.5, 0.4],
[0.6, 0.3]]
C = np.sqrt(0.3) * np.identity(2)
# Set up state space mode, initial value x_0 set to zero

ss = LinearStateSpace(A, C, G, H, mu_0 = np.zeros(2))


# Define the prior density
Σ = [[0.9, 0.3],
[0.3, 0.9]]
Σ = np.array(Σ)
x_hat = np.array([8, 8])
# Initialize the Kalman filter

kn = Kalman(ss, x_hat, Σ)
# Print eigenvalues of A
print("Eigenvalues of A:")
print(eigvals(A))
# Print stationary Σ
S, K = kn.stationary_values()
print("Stationary prediction error variance:")
print(S)
# Generate the plot

T = 50
e1 = np.empty(T-1)
e2 = np.empty(T-1)
for t in range(1, T):

kn.update(y[:,t])
e1[t-1] = np.sum((x[:, t] - kn.x_hat.flatten())**2)
e2[t-1] = np.sum((x[:, t] - A @ x[:, t-1])**2)
ax.plot(range(1, T), e1, 'k-', lw=2, alpha=0.6,
label='Kalman filter error')
ax.plot(range(1, T), e2, 'g-', lw=2, alpha=0.6,
label='Conditional expectation error')
ax.legend()
plt.show()
Eigenvalues of A:
[ 0.9+0.j -0.1+0.j]
Stationary prediction error variance:
[[0.40329108 0.1050718 ]
[0.1050718 0.41061709]]
25.5. Exercises 477

Exercise 25.5.4
Try varying the coefficient 0.3 in 𝑄 = 0.3𝐼 up and down.
Observe how the diagonal values in the stationary solution Σ (see (25.7)) increase and decrease in line with this coefficient.
The interpretation is that more randomness in the law of motion for 𝑥𝑡 causes more (permanent) uncertainty in prediction.

CHAPTER
TWENTYSIX
ANOTHER LOOK AT THE KALMAN FILTER
Contents
• Another Look at the Kalman Filter

– A worker’s output
– A firm’s wage-setting policy
– A state-space representation
– An Innovations Representation
– Some Computational Experiments
– Future Extensions
In this quantecon lecture A First Look at the Kalman filter, we used a Kalman filter to estimate locations of a rocket.
In this lecture, we’ll use the Kalman filter to infer a worker’s human capital and the effort that the worker devotes to
accumulating human capital, neither of which the firm observes directly.
The firm learns about those things only by observing a history of the output that the worker generates for the firm, and
from understanding how that output depends on the worker’s human capital and how human capital evolves as a function
of the worker’s effort.
We’ll posit a rule that expresses how the much firm pays the worker each period as a function of the firm’s information
each period.
To conduct simulations, we bring in these imports, as in A First Look at the Kalman filter.

import numpy as np
from quantecon import Kalman, LinearStateSpace
from collections import namedtuple
from scipy.stats import multivariate_normal
import matplotlib as mpl
mpl.rcParams['text.usetex'] = True
mpl.rcParams['text.latex.preamble'] = r'\usepackage{{amsmath}}'
479
26.1 A worker’s output
A representative worker is permanently employed at a firm.

The workers’ output is described by the following dynamic process:
ℎ𝑡+1 = 𝛼ℎ𝑡 + 𝛽𝑢𝑡 + 𝑐𝑤𝑡+1 , 𝑐𝑡+1 ∼ 𝒩(0, 1)

𝑢𝑡+1 = 𝑢𝑡 (26.1)
𝑦𝑡 = 𝑔ℎ𝑡 + 𝑣𝑡 , 𝑣𝑡 ∼ 𝒩(0, 𝑅)
Here
• ℎ𝑡 is the logarithm of human capital at time 𝑡
• 𝑢𝑡 is the logarithm of the worker’s effort at accumulating human capital at 𝑡
• 𝑦𝑡 is the logarithm of the worker’s output at time 𝑡
• ℎ0 ∼ 𝒩(ℎ̂ 0 , 𝜎ℎ,0 )
• 𝑢0 ∼ 𝒩(𝑢̂0 , 𝜎𝑢,0 )
Parameters of the model are 𝛼, 𝛽, 𝑐, 𝑅, 𝑔, ℎ̂ 0 , 𝑢̂0 , 𝜎ℎ , 𝜎𝑢 .

At time 0, a firm has hired the worker.
The worker is permanently attached to the firm and so works for the same firm at all dates 𝑡 = 0, 1, 2, ….
At the beginning of time 0, the firm observes neither the worker’s innate initial human capital ℎ0 nor its hard-wired
permanent effort level 𝑢0 .
The firm believes that 𝑢0 for a particular worker is drawn from a Gaussian probability distribution, and so is described by
𝑢0 ∼ 𝒩(𝑢̂0 , 𝜎𝑢,0 ).
The ℎ𝑡 part of a worker’s “type” moves over time, but the effort component of the worker’s type is 𝑢𝑡 = 𝑢0 .
This means that from the firm’s point of view, the worker’s effort is effectively an unknown fixed “parameter”.
At time 𝑡 ≥ 1, for a particular worker the firm observed 𝑦𝑡−1 = [𝑦𝑡−1 , 𝑦𝑡−2 , … , 𝑦0 ].
The firm does not observe the worker’s “type” (ℎ0 , 𝑢0 ).
But the firm does observe the worker’s output 𝑦𝑡 at time 𝑡 and remembers the worker’s past outputs 𝑦𝑡−1 .
26.2 A firm’s wage-setting policy
Based on information about the worker that the firm has at time 𝑡 ≥ 1, the firm pays the worker log wage
𝑤𝑡 = 𝑔𝐸[ℎ𝑡 |𝑦𝑡−1 ], 𝑡≥1
and at time 0 pays the worker a log wage equal to the unconditional mean of 𝑦0 :
𝑤0 = 𝑔ℎ̂ 0
In using this payment rule, the firm is taking into account that the worker’s log output today is partly due to the random
component 𝑣𝑡 that comes entirely from luck, and that is assumed to be independent of ℎ𝑡 and 𝑢𝑡 .
480 Chapter 26. Another Look at the Kalman Filter

26.3 A state-space representation
Write system (26.1.1) in the state-space form
ℎ𝑡+1 𝛼 𝛽 ℎ𝑡 𝑐
[ ]=[ ] [ ] + [ ] 𝑤𝑡+1
𝑢𝑡+1 0 1 𝑢𝑡 0
ℎ
𝑦𝑡 = [𝑔 0] [ 𝑡 ] + 𝑣𝑡
𝑢𝑡
which is equivalent with

𝑦𝑡 = 𝐺𝑥𝑡 + 𝑣𝑡 (26.2)
𝑥0 ∼ 𝒩(𝑥0̂ , Σ0 )
where
ℎ ℎ̂ 0 𝜎ℎ,0 0
𝑥𝑡 = [ 𝑡 ] , 𝑥0̂ = [ ], Σ0 = [ ]
𝑢𝑡 𝑢̂0 0 𝜎𝑢,0
To compute the firm’s wage setting policy, we first we create a namedtuple to store the parameters of the model
WorkerModel = namedtuple("WorkerModel",
('A', 'C', 'G', 'R', 'xhat_0', 'Σ_0'))
def create_worker(α=.8, β=.2, c=.2,

R=.5, g=1.0, hhat_0=4, uhat_0=4,
σ_h=4, σ_u=4):
A = np.array([[α, β],
[0, 1]])
C = np.array([[c],
[0]])
G = np.array([g, 1])
# Define initial state and covariance matrix

xhat_0 = np.array([[hhat_0],
[uhat_0]])
Σ_0 = np.array([[σ_h, 0],

[0, σ_u]])
return WorkerModel(A=A, C=C, G=G, R=R, xhat_0=xhat_0, Σ_0=Σ_0)
Please note how the WorkerModel namedtuple creates all of the objects required to compute an associated state-space
representation (26.2).
This is handy, because in order to simulate a history {𝑦𝑡 , ℎ𝑡 } for a worker, we’ll want to form state space system for
him/her by using the LinearStateSpace class.
# Define A, C, G, R, xhat_0, Σ_0

worker = create_worker()
A, C, G, R = worker.A, worker.C, worker.G, worker.R
xhat_0, Σ_0 = worker.xhat_0, worker.Σ_0
# Create a LinearStateSpace object

26.3. A state-space representation 481


ss = LinearStateSpace(A, C, G, np.sqrt(R),
mu_0=xhat_0, Sigma_0=np.zeros((2,2)))
T = 100
y = y.flatten()
h_0, u_0 = x[0, 0], x[1, 0]
Next, to compute the firm’s policy for setting the log wage based on the information it has about the worker, we use the
Kalman filter described in this quantecon lecture A First Look at the Kalman filter.
In particular, we want to compute all of the objects in an “innovation representation”.
26.4 An Innovations Representation
We have all the objects in hand required to form an innovations representation for the output process {𝑦𝑡 }𝑇𝑡=0 for a worker.
Let’s code that up now.
𝑥𝑡+1
̂ = 𝐴𝑥𝑡̂ + 𝐾𝑡 𝑎𝑡
𝑦𝑡 = 𝐺𝑥𝑡̂ + 𝑎𝑡
where 𝐾𝑡 is the Kalman gain matrix at time 𝑡.

We accomplish this in the following code that uses the Kalman class.
kalman = Kalman(ss, xhat_0, Σ_0)

Σ_t = np.zeros((*Σ_0.shape, T-1))
y_hat_t = np.zeros(T-1)
x_hat_t = np.zeros((2, T-1))

kalman.update(y[t])
x_hat, Σ = kalman.x_hat, kalman.Sigma
Σ_t[:, :, t-1] = Σ
x_hat_t[:, t-1] = x_hat.reshape(-1)
y_hat_t[t-1] = worker.G @ x_hat
x_hat_t = np.concatenate((x[:, 1][:, np.newaxis],

x_hat_t), axis=1)
Σ_t = np.concatenate((worker.Σ_0[:, :, np.newaxis],
Σ_t), axis=2)
u_hat_t = x_hat_t[1, :]

For a draw of ℎ0 , 𝑢0 , we plot 𝐸𝑦𝑡 = 𝐺𝑥𝑡̂ where 𝑥𝑡̂ = 𝐸[𝑥𝑡 |𝑦𝑡−1 ].

We also plot 𝐸[𝑢0 |𝑦𝑡−1 ], which is the firm inference about a worker’s hard-wired “work ethic” 𝑢0 , conditioned on infor-
mation 𝑦𝑡−1 that it has about him or her coming into period 𝑡.

We can watch as the firm’s inference 𝐸[𝑢0 |𝑦𝑡−1 ] of the worker’s work ethic converges toward the hidden 𝑢0 , which is not
directly observed by the firm.
fig, ax = plt.subplots(1, 2)
ax[0].plot(y_hat_t, label=r'$E[y_t| y^{t-1}]$')

ax[0].set_xlabel('Time')
ax[0].set_ylabel(r'$E[y_t]$')
ax[0].set_title(r'$E[y_t]$ over time')
ax[0].legend()
ax[1].plot(u_hat_t, label=r'$E[u_t|y^{t-1}]$')
ax[1].axhline(y=u_0, color='grey',
linestyle='dashed', label=fr'$u_0={u_0:.2f}$')
ax[1].set_ylabel(r'$E[u_t|y^{t-1}]$')
ax[1].set_title('Inferred work ethic over time')
ax[1].legend()
fig.tight_layout()
plt.show()
26.4. An Innovations Representation 483

26.5 Some Computational Experiments
Let’s look at Σ0 and Σ𝑇 in order to see how much the firm learns about the hidden state during the horizon we have set.
print(Σ_t[:, :, 0])
[[4. 0.]
[0. 4.]]
print(Σ_t[:, :, -1])
[[0.08805027 0.00100377]
[0.00100377 0.00398351]]
Evidently, entries in the conditional covariance matrix become smaller over time.
It is enlightening to portray how conditional covariance matrices Σ𝑡 evolve by plotting confidence ellipsoides around
𝐸[𝑥𝑡 |𝑦𝑡−1 ] at various 𝑡’s.
# Create a grid of points for contour plotting

h_range = np.linspace(x_hat_t[0, :].min()-0.5*Σ_t[0, 0, 1],
x_hat_t[0, :].max()+0.5*Σ_t[0, 0, 1], 100)
u_range = np.linspace(x_hat_t[1, :].min()-0.5*Σ_t[1, 1, 1],
x_hat_t[1, :].max()+0.5*Σ_t[1, 1, 1], 100)
h, u = np.meshgrid(h_range, u_range)
# Create a figure with subplots for each time step

# Iterate through each time step

for i, t in enumerate(np.linspace(0, T-1, 3, dtype=int)):
# Create a multivariate normal distribution with x_hat and Σ at time step t
mu = x_hat_t[:, t]
cov = Σ_t[:, :, t]
mvn = multivariate_normal(mean=mu, cov=cov)
# Evaluate the multivariate normal PDF on the grid

pdf_values = mvn.pdf(np.dstack((h, u)))
# Create a contour plot for the PDF

con = axs[i].contour(h, u, pdf_values, cmap='viridis')
axs[i].clabel(con, inline=1, fontsize=10)
axs[i].set_title(f'Time Step {t+1}')
axs[i].set_xlabel(r'$h_{{{}}}$'.format(str(t+1)))
axs[i].set_ylabel(r'$u_{{{}}}$'.format(str(t+1)))
cov_latex = r'$\Sigma_{{{}}}= \begin{{bmatrix}} {:.2f} & {:.2f} \\ {:.2f} & {:.2f}

↪ \end{{bmatrix}}$'.format(
t+1, cov[0, 0], cov[0, 1], cov[1, 0], cov[1, 1]
)
axs[i].text(0.33, -0.15, cov_latex, transform=axs[i].transAxes)
plt.tight_layout()
plt.show()

Note how the accumulation of evidence 𝑦𝑡 affects the shape of the confidence ellipsoid as sample size 𝑡 grows.
Now let’s use our code to set the hidden state 𝑥0 to a particular vector in order to watch how a firm learns starting from
some 𝑥0 we are interested in.
For example, let’s say ℎ0 = 0 and 𝑢0 = 4.
Here is one way to do this.
# For example, we might want h_0 = 0 and u_0 = 4

mu_0 = np.array([0.0, 4.0])
# Create a LinearStateSpace object with Sigma_0 as a matrix of zeros

ss_example = LinearStateSpace(A, C, G, np.sqrt(R), mu_0=mu_0,
# This line forces exact h_0=0 and u_0=4
Sigma_0=np.zeros((2, 2))
)
T = 100
x, y = ss_example.simulate(T)
y = y.flatten()
# Now h_0=0 and u_0=4

h_0, u_0 = x[0, 0], x[1, 0]
print('h_0 =', h_0)
print('u_0 =', u_0)
h_0 = 0.0
u_0 = 4.0
Another way to accomplish the same goal is to use the following code.
26.5. Some Computational Experiments 485

# If we want to set the initial

# h_0 = hhat_0 = 0 and u_0 = uhhat_0 = 4.0:
worker = create_worker(hhat_0=0.0, uhat_0=4.0)
ss_example = LinearStateSpace(A, C, G, np.sqrt(R),

# This line takes h_0=hhat_0 and u_0=uhhat_0
mu_0=worker.xhat_0,
# This line forces exact h_0=hhat_0 and u_0=uhhat_0
Sigma_0=np.zeros((2, 2))
)
T = 100
x, y = ss_example.simulate(T)
y = y.flatten()
# Now h_0 and u_0 will be exactly hhat_0

h_0, u_0 = x[0, 0], x[1, 0]
print('h_0 =', h_0)
print('u_0 =', u_0)
h_0 = 0.0
u_0 = 4.0
For this worker, let’s generate a plot like the one above.
# First we compute the Kalman filter with initial xhat_0 and Σ_0
kalman = Kalman(ss, xhat_0, Σ_0)
Σ_t = []
y_hat_t = np.zeros(T-1)
u_hat_t = np.zeros(T-1)
# Then we iteratively update the Kalman filter class using

# observation y based on the linear state model above:
kalman.update(y[t])
x_hat, Σ = kalman.x_hat, kalman.Sigma
Σ_t.append(Σ)
u_hat_t[t-1] = x_hat[1]
# Generate plots for y_hat_t and u_hat_t

fig, ax = plt.subplots(1, 2)
ax[0].plot(y_hat_t, label=r'$E[y_t| y^{t-1}]$')

ax[0].set_ylabel(r'$E[y_t]$')
ax[0].set_title(r'$E[y_t]$ over time')
ax[0].legend()
ax[1].plot(u_hat_t, label=r'$E[u_t|y^{t-1}]$')
ax[1].axhline(y=u_0, color='grey',
linestyle='dashed', label=fr'$u_0={u_0:.2f}$')
ax[1].set_ylabel(r'$E[u_t|y^{t-1}]$')
ax[1].set_title('Inferred work ethic over time')


ax[1].legend()
fig.tight_layout()
plt.show()


u_hat_t[t-1] = x_hat[1]
More generally, we can change some or all of the parameters defining a worker in our create_worker namedtuple.
Here is an example.
# We can set these parameters when creating a worker -- just like classes!
hard_working_worker = create_worker(α=.4, β=.8,
hhat_0=7.0, uhat_0=100, σ_h=2.5, σ_u=3.2)
print(hard_working_worker)

WorkerModel(A=array([[0.4, 0.8],
[0. , 1. ]]), C=array([[0.2],
[0. ]]), G=array([1., 1.]), R=0.5, xhat_0=array([[ 7.],
[100.]]), Σ_0=array([[2.5, 0. ],
[0. , 3.2]]))
We can also simulate the system for 𝑇 = 50 periods for different workers.
The difference between the inferred work ethics and true work ethics converges to 0 over time.
This shows that the filter is gradually teaching the worker and firm about the worker’s effort.
num_workers = 3
T = 50
for i in range(num_workers):
worker = create_worker(uhat_0=4+2*i)
simulate_workers(worker, T, ax)
ax.set_ylim(ymin=-2, ymax=2)
plt.show()

y_hat_t[i] = worker.G @ x_hat

u_hat_t[i] = x_hat[1]

# We can also generate plots of u_t:
T = 50
uhat_0s = [2, -2, 1]

αs = [0.2, 0.3, 0.5]
βs = [0.1, 0.9, 0.3]
for i, (uhat_0, α, β) in enumerate(zip(uhat_0s, αs, βs)):

worker = create_worker(uhat_0=uhat_0, α=α, β=β)
simulate_workers(worker, T, ax,
# By setting diff=False, it will give u_t
diff=False, name=r'$u_{{{}, t}}$'.format(i))
ax.axhline(y=u_0, xmin=0, xmax=0, color='grey',



linestyle='dashed', label=r'$u_{i, 0}$')
ax.legend(bbox_to_anchor=(1, 0.5))
plt.show()


# We can also use exact u_0=1 and h_0=2 for all workers


T = 50
# These two lines set u_0=1 and h_0=2 for all workers
mu_0 = np.array([[1],
[2]])
Sigma_0 = np.zeros((2,2))
uhat_0s = [2, -2, 1]

αs = [0.2, 0.3, 0.5]
βs = [0.1, 0.9, 0.3]
for i, (uhat_0, α, β) in enumerate(zip(uhat_0s, αs, βs)):

simulate_workers(worker, T, ax, mu_0=mu_0, Sigma_0=Sigma_0,
diff=False, name=r'$u_{{{}, t}}$'.format(i))
# This controls the boundary of plots

ax.set_ylim(ymin=-3, ymax=3)
plt.show()



# We can generate a plot for only one of the workers:
T = 50
mu_0_1 = np.array([[1],
[100]])
mu_0_2 = np.array([[1],
[30]])
Sigma_0 = np.zeros((2,2))
uhat_0s = 100
αs = 0.5
βs = 0.3



simulate_workers(worker, T, ax, mu_0=mu_0_1, Sigma_0=Sigma_0,
diff=False, name=r'Hard-working worker')
simulate_workers(worker, T, ax, mu_0=mu_0_2, Sigma_0=Sigma_0,
diff=False,
title='A hard-working worker and a less hard-working worker',
name=r'Normal worker')
plt.show()



26.6 Future Extensions
We can do lots of enlightening experiments by creating new types of workers and letting the firm learn about their hidden
(to the firm) states by observing just their output histories.

Part V
Search
495
CHAPTER
TWENTYSEVEN
JOB SEARCH I: THE MCCALL SEARCH MODEL
Contents
• Job Search I: The McCall Search Model

– Overview
– The McCall Model
– Computing the Optimal Policy: Take 1
– Computing an Optimal Policy: Take 2
– Exercises
“Questioning a McCall worker is like having a conversation with an out-of-work friend: ‘Maybe you are
setting your sights too high’, or ‘Why did you quit your old job before you had a new one lined up?’ This is
real social science: an attempt to model, to understand, human behavior by visualizing the situation people
find themselves in, the options they face and the pros and cons as they themselves see them.” – Robert E.
Lucas, Jr.
27.1 Overview
The McCall search model [McCall, 1970] helped transform economists’ way of thinking about labor markets.
To clarify notions such as “involuntary” unemployment, McCall modeled the decision problem of an unemployed worker
in terms of factors including
• current and likely future wages
• impatience
• unemployment compensation
To solve the decision problem McCall used dynamic programming.
Here we set up McCall’s model and use dynamic programming to analyze it.
As we’ll see, McCall’s model is not only interesting in its own right but also an excellent vehicle for learning dynamic
programming.
497

import numpy as np
from numba import jit, float64
from quantecon.distributions import BetaBinomial
27.2 The McCall Model
An unemployed agent receives in each period a job offer at wage 𝑤𝑡 .

In this lecture, we adopt the following simple environment:
• The offer sequence {𝑤𝑡 }𝑡≥0 is IID, with 𝑞(𝑤) being the probability of observing wage 𝑤 in finite set 𝕎.
• The agent observes 𝑤𝑡 at the start of 𝑡.
• The agent knows that {𝑤𝑡 } is IID with common distribution 𝑞 and can use this when computing expectations.
(In later lectures, we will relax these assumptions.)
At time 𝑡, our agent has two choices:
1. Accept the offer and work permanently at constant wage 𝑤𝑡 .
2. Reject the offer, receive unemployment compensation 𝑐, and reconsider next period.
The agent is infinitely lived and aims to maximize the expected discounted sum of earnings
∞
𝔼 ∑ 𝛽 𝑡 𝑦𝑡
𝑡=0
The constant 𝛽 lies in (0, 1) and is called a discount factor.

The smaller is 𝛽, the more the agent discounts future utility relative to current utility.
The variable 𝑦𝑡 is income, equal to
• his/her wage 𝑤𝑡 when employed
• unemployment compensation 𝑐 when unemployed
27.2.1 A Trade-Off
The worker faces a trade-off:

• Waiting too long for a good offer is costly, since the future is discounted.
• Accepting too early is costly, since better offers might arrive in the future.
To decide optimally in the face of this trade-off, we use dynamic programming.
Dynamic programming can be thought of as a two-step procedure that
1. first assigns values to “states” and
2. then deduces optimal actions given those values
We’ll go through these steps in turn.
498 Chapter 27. Job Search I: The McCall Search Model

27.2.2 The Value Function
In order to optimally trade-off current and future rewards, we need to think about two things:
1. the current payoffs we get from different choices
2. the different states that those choices will lead to in next period
To weigh these two aspects of the decision problem, we need to assign values to states.
To this end, let 𝑣∗ (𝑤) be the total lifetime value accruing to an unemployed worker who enters the current period unem-
ployed when the wage is 𝑤 ∈ 𝕎.
In particular, the agent has wage offer 𝑤 in hand.
More precisely, 𝑣∗ (𝑤) denotes the value of the objective function (28.1) when an agent in this situation makes optimal
decisions now and at all future points in time.
Of course 𝑣∗ (𝑤) is not trivial to calculate because we don’t yet know what decisions are optimal and what aren’t!
But think of 𝑣∗ as a function that assigns to each possible wage 𝑠 the maximal lifetime value that can be obtained with
that offer in hand.
A crucial observation is that this function 𝑣∗ must satisfy the recursion
𝑤
𝑣∗ (𝑤) = max { , 𝑐 + 𝛽 ∑ 𝑣∗ (𝑤′ )𝑞(𝑤′ )} (27.1)
1−𝛽 𝑤′ ∈𝕎
for every possible 𝑤 in 𝕎.

This important equation is a version of the Bellman equation, which is ubiquitous in economic dynamics and other fields
involving planning over time.
The intuition behind it is as follows:
• the first term inside the max operation is the lifetime payoff from accepting current offer, since
𝑤
= 𝑤 + 𝛽𝑤 + 𝛽 2 𝑤 + ⋯
1−𝛽
• the second term inside the max operation is the continuation value, which is the lifetime payoff from rejecting the
current offer and then behaving optimally in all subsequent periods
If we optimize and pick the best of these two options, we obtain maximal lifetime value from today, given current offer
𝑤.
But this is precisely 𝑣∗ (𝑤), which is the left-hand side of (27.1).
27.2.3 The Optimal Policy
Suppose for now that we are able to solve (27.1) for the unknown function 𝑣∗ .
Once we have this function in hand we can behave optimally (i.e., make the right choice between accept and reject).
All we have to do is select the maximal choice on the right-hand side of (27.1).
The optimal action is best thought of as a policy, which is, in general, a map from states to actions.
Given any 𝑤, we can read off the corresponding best choice (accept or reject) by picking the max on the right-hand side
of (27.1).
Thus, we have a map from ℝ to {0, 1}, with 1 meaning accept and 0 meaning reject.
27.2. The McCall Model 499

We can write the policy as follows
𝑤
𝜎(𝑤) ∶= 1 { ≥ 𝑐 + 𝛽 ∑ 𝑣∗ (𝑤′ )𝑞(𝑤′ )}
1−𝛽 𝑤′ ∈𝕎
Here 1{𝑃 } = 1 if statement 𝑃 is true and equals 0 otherwise.

We can also write this as
𝜎(𝑤) ∶= 1{𝑤 ≥ 𝑤}
̄
where
𝑤̄ ∶= (1 − 𝛽) {𝑐 + 𝛽 ∑ 𝑣∗ (𝑤′ )𝑞(𝑤′ )} (27.2)

𝑤′
Here 𝑤̄ (called the reservation wage) is a constant depending on 𝛽, 𝑐 and the wage distribution.
The agent should accept if and only if the current wage offer exceeds the reservation wage.
In view of (27.2), we can compute this reservation wage if we can compute the value function.
27.3 Computing the Optimal Policy: Take 1
To put the above ideas into action, we need to compute the value function at each possible state 𝑤 ∈ 𝕎.
To simplify notation, let’s set
𝕎 ∶= {𝑤1 , … , 𝑤𝑛 } and 𝑣∗ (𝑖) ∶= 𝑣∗ (𝑤𝑖 )
The value function is then represented by the vector 𝑣∗ = (𝑣∗ (𝑖))𝑛𝑖=1 .

In view of (27.1), this vector satisfies the nonlinear system of equations
𝑤(𝑖)
𝑣∗ (𝑖) = max { , 𝑐 + 𝛽 ∑ 𝑣∗ (𝑗)𝑞(𝑗)} for 𝑖 = 1, … , 𝑛 (27.3)
1−𝛽 1≤𝑗≤𝑛
27.3.1 The Algorithm
To compute this vector, we use successive approximations:

Step 1: pick an arbitrary initial guess 𝑣 ∈ ℝ𝑛 .
Step 2: compute a new vector 𝑣′ ∈ ℝ𝑛 via
𝑤(𝑖)
𝑣′ (𝑖) = max { , 𝑐 + 𝛽 ∑ 𝑣(𝑗)𝑞(𝑗)} for 𝑖 = 1, … , 𝑛 (27.4)
1−𝛽 1≤𝑗≤𝑛
Step 3: calculate a measure of a discrepancy between 𝑣 and 𝑣′ , such as max𝑖 |𝑣(𝑖) − 𝑣′ (𝑖)|.
Step 4: if the deviation is larger than some fixed tolerance, set 𝑣 = 𝑣′ and go to step 2, else continue.
Step 5: return 𝑣.
For a small tolerance, the returned function 𝑣 is a close approximation to the value function 𝑣∗ .
The theory below elaborates on this point.

27.3.2 Fixed Point Theory
What’s the mathematics behind these ideas?

First, one defines a mapping 𝑇 from ℝ𝑛 to itself via
𝑤(𝑖)
(𝑇 𝑣)(𝑖) = max { , 𝑐 + 𝛽 ∑ 𝑣(𝑗)𝑞(𝑗)} for 𝑖 = 1, … , 𝑛 (27.5)
1−𝛽 1≤𝑗≤𝑛
(A new vector 𝑇 𝑣 is obtained from given vector 𝑣 by evaluating the r.h.s. at each 𝑖.)
The element 𝑣𝑘 in the sequence {𝑣𝑘 } of successive approximations corresponds to 𝑇 𝑘 𝑣.
• This is 𝑇 applied 𝑘 times, starting at the initial guess 𝑣
One can show that the conditions of the Banach fixed point theorem are satisfied by 𝑇 on ℝ𝑛 .
One implication is that 𝑇 has a unique fixed point in ℝ𝑛 .
• That is, a unique vector 𝑣 ̄ such that 𝑇 𝑣 ̄ = 𝑣.̄
Moreover, it’s immediate from the definition of 𝑇 that this fixed point is 𝑣∗ .
A second implication of the Banach contraction mapping theorem is that {𝑇 𝑘 𝑣} converges to the fixed point 𝑣∗ regardless
of 𝑣.
Our default for 𝑞, the distribution of the state process, will be Beta-binomial.
n, a, b = 50, 200, 100 # default parameters

q_default = BetaBinomial(n, a, b).pdf() # default choice of q
Our default set of values for wages will be
w_min, w_max = 10, 60

w_default = np.linspace(w_min, w_max, n+1)
Here’s a plot of the probabilities of different wage outcomes:
ax.plot(w_default, q_default, '-o', label='$q(w(i))$')
ax.set_xlabel('wages')
ax.set_ylabel('probabilities')
plt.show()
27.3. Computing the Optimal Policy: Take 1 501

We are going to use Numba to accelerate our code.

• See, in particular, the discussion of @jitclass in our lecture on Numba.
The following helps Numba by providing some type
mccall_data = [
('c', float64), # unemployment compensation
('β', float64), # discount factor
('w', float64[:]), # array of wage values, w[i] = wage at state i
('q', float64[:]) # array of probabilities
]
Here’s a class that stores the data and computes the values of state-action pairs, i.e. the value in the maximum bracket on
the right hand side of the Bellman equation (27.4), given the current state and an arbitrary feasible action.
Default parameter values are embedded in the class.
@jitclass(mccall_data)
class McCallModel:
def __init__(self, c=25, β=0.99, w=w_default, q=q_default):
self.c, self.β = c, β
self.w, self.q = w_default, q_default
def state_action_values(self, i, v):

"""
The values of state-action pairs.
"""
# Simplify names
c, β, w, q = self.c, self.β, self.w, self.q
# Evaluate value for each state-action pair
# Consider action = accept or reject the current offer
accept = w[i] / (1 - β)
reject = c + β * np.sum(v * q)
return np.array([accept, reject])

Based on these defaults, let’s try plotting the first few approximate value functions in the sequence {𝑇 𝑘 𝑣}.
We will start from guess 𝑣 given by 𝑣(𝑖) = 𝑤(𝑖)/(1 − 𝛽), which is the value of accepting at every given wage.
Here’s a function to implement this:
def plot_value_function_seq(mcm, ax, num_plots=6):

"""
Plot a sequence of value functions.
* mcm is an instance of McCallModel

* ax is an axes object that implements a plot method.
"""
n = len(mcm.w)
v = mcm.w / (1 - mcm.β)
v_next = np.empty_like(v)
for i in range(num_plots):
ax.plot(mcm.w, v, '-', alpha=0.4, label=f"iterate {i}")
# Update guess
for j in range(n):
v_next[j] = np.max(mcm.state_action_values(j, v))
v[:] = v_next # copy contents into v
ax.legend(loc='lower right')
Now let’s create an instance of McCallModel and watch iterations 𝑇 𝑘 𝑣 converge from below:
mcm = McCallModel()
ax.set_xlabel('wage')
ax.set_ylabel('value')
plot_value_function_seq(mcm, ax)
plt.show()
You can see that convergence is occurring: successive iterates are getting closer together.

Here’s a more serious iteration effort to compute the limit, which continues until measured deviation between successive
iterates is below tol.
Once we obtain a good approximation to the limit, we will use it to calculate the reservation wage.
We’ll be using JIT compilation via Numba to turbocharge our loops.
@jit(nopython=True)
def compute_reservation_wage(mcm,
max_iter=500,
tol=1e-6):
# Simplify names
c, β, w, q = mcm.c, mcm.β, mcm.w, mcm.q
# == First compute the value function == #
n = len(w)
v = w / (1 - β) # initial guess
j = 0
error = tol + 1
while j < max_iter and error > tol:
for j in range(n):
v_next[j] = np.max(mcm.state_action_values(j, v))
error = np.max(np.abs(v_next - v))

j += 1
# == Now compute the reservation wage == #
return (1 - β) * (c + β * np.sum(v * q))
The next line computes the reservation wage at default parameters
compute_reservation_wage(mcm)
47.316499710024964
27.3.4 Comparative Statics
Now that we know how to compute the reservation wage, let’s see how it varies with parameters.
In particular, let’s look at what happens when we change 𝛽 and 𝑐.
grid_size = 25
R = np.empty((grid_size, grid_size))
c_vals = np.linspace(10.0, 30.0, grid_size)

β_vals = np.linspace(0.9, 0.99, grid_size)
for i, c in enumerate(c_vals):


for j, β in enumerate(β_vals):
mcm = McCallModel(c=c, β=β)
R[i, j] = compute_reservation_wage(mcm)
cs1 = ax.contourf(c_vals, β_vals, R.T, alpha=0.75)

ctr1 = ax.contour(c_vals, β_vals, R.T)

plt.colorbar(cs1, ax=ax)
ax.set_title("reservation wage")
ax.set_xlabel("$c$", fontsize=16)
ax.set_ylabel("$β$", fontsize=16)
ax.ticklabel_format(useOffset=False)
plt.show()
As expected, the reservation wage increases both with patience and with unemployment compensation.

27.4 Computing an Optimal Policy: Take 2
The approach to dynamic programming just described is standard and broadly applicable.
But for our McCall search model there’s also an easier way that circumvents the need to compute the value function.
Let ℎ denote the continuation value:
ℎ = 𝑐 + 𝛽 ∑ 𝑣∗ (𝑠′ )𝑞(𝑠′ ) (27.6)

𝑠′
The Bellman equation can now be written as
𝑤(𝑠′ )
𝑣∗ (𝑠′ ) = max { , ℎ}
1−𝛽
Substituting this last equation into (27.6) gives
𝑤(𝑠′ )
ℎ = 𝑐 + 𝛽 ∑ max { , ℎ} 𝑞(𝑠′ ) (27.7)
𝑠′ ∈𝕊
1−𝛽
This is a nonlinear equation that we can solve for ℎ.

As before, we will use successive approximations:
Step 1: pick an initial guess ℎ.
Step 2: compute the update ℎ′ via
𝑤(𝑠′ )
ℎ′ = 𝑐 + 𝛽 ∑ max { , ℎ} 𝑞(𝑠′ ) (27.8)
𝑠′ ∈𝕊
1−𝛽
Step 3: calculate the deviation |ℎ − ℎ′ |.

Step 4: if the deviation is larger than some fixed tolerance, set ℎ = ℎ′ and go to step 2, else return ℎ.
One can again use the Banach contraction mapping theorem to show that this process always converges.
The big difference here, however, is that we’re iterating on a scalar ℎ, rather than an 𝑛-vector, 𝑣(𝑖), 𝑖 = 1, … , 𝑛.
Here’s an implementation:
@jit(nopython=True)
def compute_reservation_wage_two(mcm,
max_iter=500,
tol=1e-5):
# Simplify names
c, β, w, q = mcm.c, mcm.β, mcm.w, mcm.q
# == First compute h == #
h = np.sum(w * q) / (1 - β)
i = 0
error = tol + 1
while i < max_iter and error > tol:
s = np.maximum(w / (1 - β), h)
h_next = c + β * np.sum(s * q)


error = np.abs(h_next - h)
i += 1
h = h_next
return (1 - β) * h
You can use this code to solve the exercise below.
27.5 Exercises
Exercise 27.5.1
Compute the average duration of unemployment when 𝛽 = 0.99 and 𝑐 takes the following values
c_vals = np.linspace(10, 40, 25)
That is, start the agent off as unemployed, compute their reservation wage given the parameters, and then simulate to see
how long it takes to accept.
Repeat a large number of times and take the average.
Plot mean unemployment duration as a function of 𝑐 in c_vals.

Here’s one solution
cdf = np.cumsum(q_default)
@jit(nopython=True)
def compute_stopping_time(w_bar, seed=1234):
t = 1
while True:
# Generate a wage draw
w = w_default[qe.random.draw(cdf)]
# Stop when the draw is above the reservation wage
if w >= w_bar:
stopping_time = t
break
else:
t += 1
return stopping_time
@jit(nopython=True)
def compute_mean_stopping_time(w_bar, num_reps=100000):
obs = np.empty(num_reps)
for i in range(num_reps):
obs[i] = compute_stopping_time(w_bar, seed=i)
27.5. Exercises 507


return obs.mean()
c_vals = np.linspace(10, 40, 25)

stop_times = np.empty_like(c_vals)
mcm = McCallModel(c=c)
w_bar = compute_reservation_wage_two(mcm)
stop_times[i] = compute_mean_stopping_time(w_bar)
ax.plot(c_vals, stop_times, label="mean unemployment duration")

ax.set(xlabel="unemployment compensation", ylabel="months")
ax.legend()
plt.show()
Exercise 27.5.2
The purpose of this exercise is to show how to replace the discrete wage offer distribution used above with a continuous
distribution.
This is a significant topic because many convenient distributions are continuous (i.e., have a density).
Fortunately, the theory changes little in our simple model.
Recall that ℎ in (27.6) denotes the value of not accepting a job in this period but then behaving optimally in all subsequent
periods:
To shift to a continuous offer distribution, we can replace (27.6) by
ℎ = 𝑐 + 𝛽 ∫ 𝑣∗ (𝑠′ )𝑞(𝑠′ )𝑑𝑠′ . (27.9)

Equation (27.7) becomes
𝑤(𝑠′ )
ℎ = 𝑐 + 𝛽 ∫ max { , ℎ} 𝑞(𝑠′ )𝑑𝑠′ (27.10)
1−𝛽
The aim is to solve this nonlinear equation by iteration, and from it obtain the reservation wage.
Try to carry this out, setting
• the state sequence {𝑠𝑡 } to be IID and standard normal and
• the wage function to be 𝑤(𝑠) = exp(𝜇 + 𝜎𝑠).
You will need to implement a new version of the McCallModel class that assumes a lognormal wage distribution.
Calculate the integral by Monte Carlo, by averaging over a large number of wage draws.
For default parameters, use c=25, β=0.99, σ=0.5, μ=2.5.
Once your code is working, investigate how the reservation wage changes with 𝑐 and 𝛽.

mccall_data_continuous = [
('σ', float64), # scale parameter in lognormal distribution
('μ', float64), # location parameter in lognormal distribution
('w_draws', float64[:]) # draws of wages for Monte Carlo
]
@jitclass(mccall_data_continuous)
class McCallModelContinuous:
def __init__(self, c=25, β=0.99, σ=0.5, μ=2.5, mc_size=1000):
self.c, self.β, self.σ, self.μ = c, β, σ, μ
# Draw and store shocks

np.random.seed(1234)
s = np.random.randn(mc_size)
self.w_draws = np.exp(μ+ σ * s)
@jit(nopython=True)
def compute_reservation_wage_continuous(mcmc, max_iter=500, tol=1e-5):
c, β, σ, μ, w_draws = mcmc.c, mcmc.β, mcmc.σ, mcmc.μ, mcmc.w_draws
h = np.mean(w_draws) / (1 - β) # initial guess

i = 0
error = tol + 1
integral = np.mean(np.maximum(w_draws / (1 - β), h))

h_next = c + β * integral
27.5. Exercises 509


error = np.abs(h_next - h)
i += 1
h = h_next
return (1 - β) * h
Now we investigate how the reservation wage changes with 𝑐 and 𝛽.

We will do this using a contour plot.
grid_size = 25
R = np.empty((grid_size, grid_size))
c_vals = np.linspace(10.0, 30.0, grid_size)

β_vals = np.linspace(0.9, 0.99, grid_size)
for j, β in enumerate(β_vals):
mcmc = McCallModelContinuous(c=c, β=β)
R[i, j] = compute_reservation_wage_continuous(mcmc)
cs1 = ax.contourf(c_vals, β_vals, R.T, alpha=0.75)

ctr1 = ax.contour(c_vals, β_vals, R.T)

plt.colorbar(cs1, ax=ax)
ax.set_title("reservation wage")
ax.set_xlabel("$c$", fontsize=16)
ax.set_ylabel("$β$", fontsize=16)
ax.ticklabel_format(useOffset=False)
plt.show()

27.5. Exercises 511


CHAPTER
TWENTYEIGHT
JOB SEARCH II: SEARCH AND SEPARATION
Contents
• Job Search II: Search and Separation

– Overview
– The Model
– Solving the Model
– Implementation
– Impact of Parameters
– Exercises
28.1 Overview
Previously we looked at the McCall job search model [McCall, 1970] as a way of understanding unemployment and
worker decisions.
One unrealistic feature of the model is that every job is permanent.
In this lecture, we extend the McCall model by introducing job separation.
Once separation enters the picture, the agent comes to view
• the loss of a job as a capital loss, and
• a spell of unemployment as an investment in searching for an acceptable job
The other minor addition is that a utility function will be included to make worker preferences slightly more sophisticated.
We’ll need the following imports

import numpy as np
from numba import njit, float64
513

28.2 The Model
The model is similar to the baseline McCall job search model.

It concerns the life of an infinitely lived worker and
• the opportunities he or she (let’s say he to save one character) has to work at different wages
• exogenous events that destroy his current job
• his decision making process while unemployed
The worker can be in one of two states: employed or unemployed.
He wants to maximize
∞
𝔼 ∑ 𝛽 𝑡 𝑢(𝑦𝑡 ) (28.1)
𝑡=0
At this stage the only difference from the baseline model is that we’ve added some flexibility to preferences by introducing
a utility function 𝑢.
It satisfies 𝑢′ > 0 and 𝑢″ < 0.
28.2.1 The Wage Process
For now we will drop the separation of state process and wage process that we maintained for the baseline model.
In particular, we simply suppose that wage offers {𝑤𝑡 } are IID with common distribution 𝑞.
The set of possible wage values is denoted by 𝕎.
(Later we will go back to having a separate state process {𝑠𝑡 } driving random outcomes, since this formulation is usually
convenient in more sophisticated models.)
28.2.2 Timing and Decisions
At the start of each period, the agent can be either

• unemployed or
• employed at some existing wage level 𝑤𝑒 .
At the start of a given period, the current wage offer 𝑤𝑡 is observed.
If currently employed, the worker
1. receives utility 𝑢(𝑤𝑒 ) and
2. is fired with some (small) probability 𝛼.
514 Chapter 28. Job Search II: Search and Separation

If currently unemployed, the worker either accepts or rejects the current offer 𝑤𝑡 .
If he accepts, then he begins work immediately at wage 𝑤𝑡 .
If he rejects, then he receives unemployment compensation 𝑐.
The process then repeats.
Note: We do not allow for job search while employed—this topic is taken up in a later lecture.
28.3 Solving the Model
We drop time subscripts in what follows and primes denote next period values.
Let
• 𝑣(𝑤𝑒 ) be total lifetime value accruing to a worker who enters the current period employed with existing wage 𝑤𝑒
• ℎ(𝑤) be total lifetime value accruing to a worker who who enters the current period unemployed and receives wage
offer 𝑤.
Here value means the value of the objective function (28.1) when the worker makes optimal decisions at all future points
in time.
Our first aim is to obtain these functions.
28.3.1 The Bellman Equations
Suppose for now that the worker can calculate the functions 𝑣 and ℎ and use them in his decision making.
Then 𝑣 and ℎ should satisfy
𝑣(𝑤𝑒 ) = 𝑢(𝑤𝑒 ) + 𝛽 [(1 − 𝛼)𝑣(𝑤𝑒 ) + 𝛼 ∑ ℎ(𝑤′ )𝑞(𝑤′ )] (28.2)

𝑤′ ∈𝕎
and
ℎ(𝑤) = max {𝑣(𝑤), 𝑢(𝑐) + 𝛽 ∑ ℎ(𝑤′ )𝑞(𝑤′ )} (28.3)

𝑤′ ∈𝕎
Equation (28.2) expresses the value of being employed at wage 𝑤𝑒 in terms of

• current reward 𝑢(𝑤𝑒 ) plus
• discounted expected reward tomorrow, given the 𝛼 probability of being fired
Equation (28.3) expresses the value of being unemployed with offer 𝑤 in hand as a maximum over the value of two
options: accept or reject the current offer.
Accepting transitions the worker to employment and hence yields reward 𝑣(𝑤).
Rejecting leads to unemployment compensation and unemployment tomorrow.
Equations (28.2) and (28.3) are the Bellman equations for this model.
They provide enough information to solve for both 𝑣 and ℎ.
28.3. Solving the Model 515

28.3.2 A Simplifying Transformation
Rather than jumping straight into solving these equations, let’s see if we can simplify them somewhat.
(This process will be analogous to our second pass at the plain vanilla McCall model, where we simplified the Bellman
equation.)
First, let
𝑑 ∶= ∑ ℎ(𝑤′ )𝑞(𝑤′ ) (28.4)

𝑤′ ∈𝕎
be the expected value of unemployment tomorrow.

We can now write (28.3) as
ℎ(𝑤) = max {𝑣(𝑤), 𝑢(𝑐) + 𝛽𝑑}
or, shifting time forward one period
∑ ℎ(𝑤′ )𝑞(𝑤′ ) = ∑ max {𝑣(𝑤′ ), 𝑢(𝑐) + 𝛽𝑑} 𝑞(𝑤′ )

𝑤′ ∈𝕎 𝑤′ ∈𝕎
Using (28.4) again now gives
𝑑 = ∑ max {𝑣(𝑤′ ), 𝑢(𝑐) + 𝛽𝑑} 𝑞(𝑤′ ) (28.5)

𝑤′ ∈𝕎
Finally, (28.2) can now be rewritten as
𝑣(𝑤) = 𝑢(𝑤) + 𝛽 [(1 − 𝛼)𝑣(𝑤) + 𝛼𝑑] (28.6)
In the last expression, we wrote 𝑤𝑒 as 𝑤 to make the notation simpler.
28.3.3 The Reservation Wage
Suppose we can use (28.5) and (28.6) to solve for 𝑑 and 𝑣.

(We will do this soon.)
We can then determine optimal behavior for the worker.
From (28.3), we see that an unemployed agent accepts current offer 𝑤 if 𝑣(𝑤) ≥ 𝑢(𝑐) + 𝛽𝑑.
This means precisely that the value of accepting is higher than the expected value of rejecting.
It is clear that 𝑣 is (at least weakly) increasing in 𝑤, since the agent is never made worse off by a higher wage offer.
Hence, we can express the optimal choice as accepting wage offer 𝑤 if and only if
𝑤 ≥ 𝑤̄ where 𝑤̄ solves 𝑣(𝑤)̄ = 𝑢(𝑐) + 𝛽𝑑
28.3.4 Solving the Bellman Equations
We’ll use the same iterative approach to solving the Bellman equations that we adopted in the first job search lecture.
Here this amounts to
1. make guesses for 𝑑 and 𝑣
2. plug these guesses into the right-hand sides of (28.5) and (28.6)

3. update the left-hand sides from this rule and then repeat
In other words, we are iterating using the rules
𝑑𝑛+1 = ∑ max {𝑣𝑛 (𝑤′ ), 𝑢(𝑐) + 𝛽𝑑𝑛 } 𝑞(𝑤′ ) (28.7)

𝑤′ ∈𝕎
𝑣𝑛+1 (𝑤) = 𝑢(𝑤) + 𝛽 [(1 − 𝛼)𝑣𝑛 (𝑤) + 𝛼𝑑𝑛 ] (28.8)

starting from some initial conditions 𝑑0 , 𝑣0 .
As before, the system always converges to the true solutions—in this case, the 𝑣 and 𝑑 that solve (28.5) and (28.6).
(A proof can be obtained via the Banach contraction mapping theorem.)
28.4 Implementation
Let’s implement this iterative process.

In the code, you’ll see that we use a class to store the various parameters and other objects associated with a given model.
This helps to tidy up the code and provides an object that’s easy to pass to functions.
The default utility function is a CRRA utility function
@njit
def u(c, σ=2.0):
return (c**(1 - σ) - 1) / (1 - σ)
Also, here’s a default wage distribution, based around the BetaBinomial distribution:
n = 60 # n possible outcomes for w

w_default = np.linspace(10, 20, n) # wages between 10 and 20
a, b = 600, 400 # shape parameters
dist = BetaBinomial(n-1, a, b)
q_default = dist.pdf()
Here’s our jitted class for the McCall model with separation.
mccall_data = [
('α', float64), # job separation rate
('w', float64[:]), # list of wage values
('q', float64[:]) # pmf of random variable w
]
class McCallModel:
"""
Stores the parameters and functions associated with a given model.
"""
def __init__(self, α=0.2, β=0.98, c=6.0, w=w_default, q=q_default):
self.α, self.β, self.c, self.w, self.q = α, β, c, w, q

def update(self, v, d):
α, β, c, w, q = self.α, self.β, self.c, self.w, self.q
v_new = np.empty_like(v)
for i in range(len(w)):
v_new[i] = u(w[i]) + β * ((1 - α) * v[i] + α * d)
d_new = np.sum(np.maximum(v, u(c) + β * d) * q)
return v_new, d_new
Now we iterate until successive realizations are closer together than some small tolerance level.
We then return the current iterate as an approximate solution.
@njit
def solve_model(mcm, tol=1e-5, max_iter=2000):
"""
Iterates to convergence on the Bellman equations

"""
v = np.ones_like(mcm.w) # Initial guess of v

d = 1 # Initial guess of d
i = 0
error = tol + 1
while error > tol and i < max_iter:

v_new, d_new = mcm.update(v, d)
error_1 = np.max(np.abs(v_new - v))
error_2 = np.abs(d_new - d)
error = max(error_1, error_2)
v = v_new
d = d_new
i += 1
return v, d
28.4.1 The Reservation Wage: First Pass
The optimal choice of the agent is summarized by the reservation wage.

As discussed above, the reservation wage is the 𝑤̄ that solves 𝑣(𝑤)̄ = ℎ where ℎ ∶= 𝑢(𝑐) + 𝛽𝑑 is the continuation value.
Let’s compare 𝑣 and ℎ to see what they look like.
We’ll use the default parameterizations found in the code above.
mcm = McCallModel()
v, d = solve_model(mcm)
h = u(mcm.c) + mcm.β * d

ax.plot(mcm.w, v, 'b-', lw=2, alpha=0.7, label='$v$')

ax.plot(mcm.w, [h] * len(mcm.w),
'g-', lw=2, alpha=0.7, label='$h$')
ax.set_xlim(min(mcm.w), max(mcm.w))
ax.legend()
plt.show()
The value 𝑣 is increasing because higher 𝑤 generates a higher wage flow conditional on staying employed.
28.4.2 The Reservation Wage: Computation
Here’s a function compute_reservation_wage that takes an instance of McCallModel and returns the associ-
ated reservation wage.
@njit
def compute_reservation_wage(mcm):
"""
Computes the reservation wage of an instance of the McCall model
by finding the smallest w such that v(w) >= h.
If no such w exists, then w_bar is set to np.inf.

"""
i = np.searchsorted(v, h, side='right')
w_bar = mcm.w[i]
return w_bar
Next we will investigate how the reservation wage varies with parameters.

28.5 Impact of Parameters
In each instance below, we’ll show you a figure and then ask you to reproduce it in the exercises.
28.5.1 The Reservation Wage and Unemployment Compensation
First, let’s look at how 𝑤̄ varies with unemployment compensation.

In the figure below, we use the default parameters in the McCallModel class, apart from c (which takes the values given
on the horizontal axis)
As expected, higher unemployment compensation causes the worker to hold out for higher wages.
In effect, the cost of continuing job search is reduced.
28.5.2 The Reservation Wage and Discounting
Next, let’s investigate how 𝑤̄ varies with the discount factor.

The next figure plots the reservation wage associated with different values of 𝛽
Again, the results are intuitive: More patient workers will hold out for higher wages.

28.5.3 The Reservation Wage and Job Destruction
Finally, let’s look at how 𝑤̄ varies with the job separation rate 𝛼.
Higher 𝛼 translates to a greater chance that a worker will face termination in each period once employed.
Once more, the results are in line with our intuition.
If the separation rate is high, then the benefit of holding out for a higher wage falls.
Hence the reservation wage is lower.
28.6 Exercises
Exercise 28.6.1
Reproduce all the reservation wage figures shown above.
Regarding the values on the horizontal axis, use
grid_size = 25
c_vals = np.linspace(2, 12, grid_size) # unemployment compensation
beta_vals = np.linspace(0.8, 0.99, grid_size) # discount factors
alpha_vals = np.linspace(0.05, 0.5, grid_size) # separation rate

Here’s the first figure.
28.6. Exercises 521

mcm = McCallModel()
w_bar_vals = np.empty_like(c_vals)
mcm.c = c
w_bar = compute_reservation_wage(mcm)
w_bar_vals[i] = w_bar
ax.set(xlabel='unemployment compensation',
ylabel='reservation wage')
ax.plot(c_vals, w_bar_vals, label=r'$\bar w$ as a function of $c$')
ax.legend()
plt.show()

Here’s the second one.
for i, β in enumerate(beta_vals):
mcm.β = β
ax.set(xlabel='discount factor', ylabel='reservation wage')

ax.plot(beta_vals, w_bar_vals, label=r'$\bar w$ as a function of $\beta$')
ax.legend()
plt.show()
Here’s the third.
28.6. Exercises 523

for i, α in enumerate(alpha_vals):
mcm.α = α
ax.set(xlabel='separation rate', ylabel='reservation wage')

ax.plot(alpha_vals, w_bar_vals, label=r'$\bar w$ as a function of $\alpha$')
ax.legend()
plt.show()

CHAPTER
TWENTYNINE
JOB SEARCH III: FITTED VALUE FUNCTION ITERATION
Contents
• Job Search III: Fitted Value Function Iteration

– Overview
– The Algorithm
– Implementation
– Exercises
29.1 Overview
In this lecture we again study the McCall job search model with separation, but now with a continuous wage distribution.
While we already considered continuous wage distributions briefly in the exercises of the first job search lecture, the change
was relatively trivial in that case.
This is because we were able to reduce the problem to solving for a single scalar value (the continuation value).
Here, with separation, the change is less trivial, since a continuous wage distribution leads to an uncountably infinite state
space.
The infinite state space leads to additional challenges, particularly when it comes to applying value function iteration (VFI).
These challenges will lead us to modify VFI by adding an interpolation step.
The combination of VFI and this interpolation step is called fitted value function iteration (fitted VFI).
Fitted VFI is very common in practice, so we will take some time to work through the details.
We will use the following imports:

import numpy as np
525
29.2 The Algorithm
The model is the same as the McCall model with job separation we studied before, except that the wage offer distribution
is continuous.
We are going to start with the two Bellman equations we obtained for the model with job separation after a simplifying
transformation.
Modified to accommodate continuous wage draws, they take the following form:
𝑑 = ∫ max {𝑣(𝑤′ ), 𝑢(𝑐) + 𝛽𝑑} 𝑞(𝑤′ )𝑑𝑤′ (29.1)
and
𝑣(𝑤) = 𝑢(𝑤) + 𝛽 [(1 − 𝛼)𝑣(𝑤) + 𝛼𝑑] (29.2)
The unknowns here are the function 𝑣 and the scalar 𝑑.

The difference between these and the pair of Bellman equations we previously worked on are
1. in (29.1), what used to be a sum over a finite number of wage values is an integral over an infinite set.
2. The function 𝑣 in (29.2) is defined over all 𝑤 ∈ ℝ+ .
The function 𝑞 in (29.1) is the density of the wage offer distribution.
Its support is taken as equal to ℝ+ .
29.2.1 Value Function Iteration
In theory, we should now proceed as follows:

1. Begin with a guess 𝑣, 𝑑 for the solutions to (29.1)–(29.2).
2. Plug 𝑣, 𝑑 into the right hand side of (29.1)–(29.2) and compute the left hand side to obtain updates 𝑣′ , 𝑑′
3. Unless some stopping condition is satisfied, set (𝑣, 𝑑) = (𝑣′ , 𝑑′ ) and go to step 2.
However, there is a problem we must confront before we implement this procedure: The iterates of the value function
can neither be calculated exactly nor stored on a computer.
To see the issue, consider (29.2).
Even if 𝑣 is a known function, the only way to store its update 𝑣′ is to record its value 𝑣′ (𝑤) for every 𝑤 ∈ ℝ+ .
Clearly, this is impossible.
29.2.2 Fitted Value Function Iteration
What we will do instead is use fitted value function iteration.

The procedure is as follows:
Let a current guess 𝑣 be given.
Now we record the value of the function 𝑣′ at only finitely many “grid” points 𝑤1 < 𝑤2 < ⋯ < 𝑤𝐼 and then reconstruct
𝑣′ from this information when required.
More precisely, the algorithm will be
1. Begin with an array v representing the values of an initial guess of the value function on some grid points {𝑤𝑖 }.
526 Chapter 29. Job Search III: Fitted Value Function Iteration
2. Build a function 𝑣 on the state space ℝ+ by interpolation or approximation, based on v and {𝑤𝑖 }.
3. Obtain and record the samples of the updated function 𝑣′ (𝑤𝑖 ) on each grid point 𝑤𝑖 .
4. Unless some stopping condition is satisfied, take this as the new array and go to step 1.
How should we go about step 2?
This is a problem of function approximation, and there are many ways to approach it.
What’s important here is that the function approximation scheme must not only produce a good approximation to each 𝑣,
but also that it combines well with the broader iteration algorithm described above.
One good choice from both respects is continuous piecewise linear interpolation.
This method
1. combines well with value function iteration (see., e.g., [Gordon, 1995] or [Stachurski, 2008]) and
2. preserves useful shape properties such as monotonicity and concavity/convexity.
Linear interpolation will be implemented using numpy.interp.
The next figure illustrates piecewise linear interpolation of an arbitrary function on grid points 0, 0.2, 0.4, 0.6, 0.8, 1.
def f(x):
y1 = 2 * np.cos(6 * x) + np.sin(14 * x)
return y1 + 2.5
c_grid = np.linspace(0, 1, 6)
f_grid = np.linspace(0, 1, 150)
def Af(x):
return np.interp(x, c_grid, f(c_grid))
ax.plot(f_grid, f(f_grid), 'b-', label='true function')

ax.plot(f_grid, Af(f_grid), 'g-', label='linear approximation')
ax.vlines(c_grid, c_grid * 0, f(c_grid), linestyle='dashed', alpha=0.5)
ax.legend(loc="upper center")
ax.set(xlim=(0, 1), ylim=(0, 6))

plt.show()
29.2. The Algorithm 527

29.3 Implementation
The first step is to build a jitted class for the McCall model with separation and a continuous wage offer distribution.
We will take the utility function to be the log function for this application, with 𝑢(𝑐) = ln 𝑐.
We will adopt the lognormal distribution for wages, with 𝑤 = exp(𝜇 + 𝜎𝑧) when 𝑧 is standard normal and 𝜇, 𝜎 are
parameters.
@njit
def lognormal_draws(n=1000, μ=2.5, σ=0.5, seed=1234):
z = np.random.randn(n)
w_draws = np.exp(μ + σ * z)
return w_draws
Here’s our class.
mccall_data_continuous = [
('α', float64), # job separation rate
('w_grid', float64[:]), # grid of points for fitted VFI
('w_draws', float64[:]) # draws of wages for Monte Carlo
]
@jitclass(mccall_data_continuous)

class McCallModelContinuous:
def __init__(self,
c=1,
α=0.1,
β=0.96,
grid_min=1e-10,
grid_max=5,
grid_size=100,
w_draws=lognormal_draws()):
self.c, self.α, self.β = c, α, β
self.w_grid = np.linspace(grid_min, grid_max, grid_size)

self.w_draws = w_draws
def update(self, v, d):
# Simplify names
c, α, β = self.c, self.α, self.β
w = self.w_grid
u = lambda x: np.log(x)
# Interpolate array represented value function

vf = lambda x: np.interp(x, w, v)
# Update d using Monte Carlo to evaluate integral

d_new = np.mean(np.maximum(vf(self.w_draws), u(c) + β * d))
# Update v
v_new = u(w) + β * ((1 - α) * v + α * d)
return v_new, d_new
We then return the current iterate as an approximate solution.
@njit
def solve_model(mcm, tol=1e-5, max_iter=2000):
"""

"""
v = np.ones_like(mcm.w_grid) # Initial guess of v

d = 1 # Initial guess of d
i = 0
error = tol + 1

v_new, d_new = mcm.update(v, d)
error_1 = np.max(np.abs(v_new - v))
error_2 = np.abs(d_new - d)
v = v_new
d = d_new


i += 1
return v, d
Here’s a function compute_reservation_wage that takes an instance of McCallModelContinuous and re-

turns the associated reservation wage.
If 𝑣(𝑤) < ℎ for all 𝑤, then the function returns np.inf
@njit
def compute_reservation_wage(mcm):
"""
by finding the smallest w such that v(w) >= h.
If no such w exists, then w_bar is set to np.inf.

"""
u = lambda x: np.log(x)
w_bar = np.inf
for i, wage in enumerate(mcm.w_grid):
if v[i] > h:
w_bar = wage
break
return w_bar
The exercises ask you to explore the solution and how it changes with parameters.
29.4 Exercises
Exercise 29.4.1
Use the code above to explore what happens to the reservation wage when the wage parameter 𝜇 changes.
Use the default parameters and 𝜇 in mu_vals = np.linspace(0.0, 2.0, 15).
Is the impact on the reservation wage as you expected?

mcm = McCallModelContinuous()
mu_vals = np.linspace(0.0, 2.0, 15)
w_bar_vals = np.empty_like(mu_vals)

for i, m in enumerate(mu_vals):
mcm.w_draws = lognormal_draws(μ=m)
ax.set(xlabel='mean', ylabel='reservation wage')

ax.plot(mu_vals, w_bar_vals, label=r'$\bar w$ as a function of $\mu$')
ax.legend()
plt.show()
Not surprisingly, the agent is more inclined to wait when the distribution of offers shifts to the right.
Exercise 29.4.2
Let us now consider how the agent responds to an increase in volatility.
To try to understand this, compute the reservation wage when the wage offer distribution is uniform on (𝑚 − 𝑠, 𝑚 + 𝑠)
and 𝑠 varies.
The idea here is that we are holding the mean constant and spreading the support.
(This is a form of mean-preserving spread.)
Use s_vals = np.linspace(1.0, 2.0, 15) and m = 2.0.
State how you expect the reservation wage to vary with 𝑠.
29.4. Exercises 531

Now compute it. Is this as you expected?

mcm = McCallModelContinuous()
s_vals = np.linspace(1.0, 2.0, 15)
m = 2.0
w_bar_vals = np.empty_like(s_vals)
for i, s in enumerate(s_vals):
a, b = m - s, m + s
mcm.w_draws = np.random.uniform(low=a, high=b, size=10_000)
ax.set(xlabel='volatility', ylabel='reservation wage')

ax.plot(s_vals, w_bar_vals, label=r'$\bar w$ as a function of wage volatility')
ax.legend()
plt.show()
The reservation wage increases with volatility.

One might think that higher volatility would make the agent more inclined to take a given offer, since doing so represents
certainty and waiting represents risk.

But job search is like holding an option: the worker is only exposed to upside risk (since, in a free market, no one can
force them to take a bad offer).
More volatility means higher upside potential, which encourages the agent to wait.
29.4. Exercises 533

CHAPTER
THIRTY
JOB SEARCH IV: CORRELATED WAGE OFFERS
Contents
• Job Search IV: Correlated Wage Offers

– Overview
– The Model
– Implementation
– Unemployment Duration
– Exercises
30.1 Overview
In this lecture we solve a McCall style job search model with persistent and transitory components to wages.
In other words, we relax the unrealistic assumption that randomness in wages is independent over time.
At the same time, we will go back to assuming that jobs are permanent and no separation occurs.
This is to keep the model relatively simple as we study the impact of correlation.

import numpy as np
from numpy.random import randn
from numba import njit, prange, float64
535
30.2 The Model
Wages at each point in time are given by
𝑤𝑡 = exp(𝑧𝑡 ) + 𝑦𝑡
where
𝑦𝑡 ∼ exp(𝜇 + 𝑠𝜁𝑡 ) and 𝑧𝑡+1 = 𝑑 + 𝜌𝑧𝑡 + 𝜎𝜖𝑡+1
Here {𝜁𝑡 } and {𝜖𝑡 } are both IID and standard normal.
Here {𝑦𝑡 } is a transitory component and {𝑧𝑡 } is persistent.
As before, the worker can either
1. accept an offer and work permanently at that wage, or
2. take unemployment compensation 𝑐 and wait till next period.
The value function satisfies the Bellman equation
𝑢(𝑤)
𝑣∗ (𝑤, 𝑧) = max { , 𝑢(𝑐) + 𝛽 𝔼𝑧 𝑣∗ (𝑤′ , 𝑧 ′ )}
1−𝛽
In this express, 𝑢 is a utility function and 𝔼𝑧 is expectation of next period variables given current 𝑧.
The variable 𝑧 enters as a state in the Bellman equation because its current value helps predict future wages.
30.2.1 A Simplification
There is a way that we can reduce dimensionality in this problem, which greatly accelerates computation.
To start, let 𝑓 ∗ be the continuation value function, defined by
𝑓 ∗ (𝑧) ∶= 𝑢(𝑐) + 𝛽 𝔼𝑧 𝑣∗ (𝑤′ , 𝑧 ′ )
The Bellman equation can now be written
𝑢(𝑤) ∗
𝑣∗ (𝑤, 𝑧) = max { , 𝑓 (𝑧)}
1−𝛽
Combining the last two expressions, we see that the continuation value function satisfies
𝑢(𝑤′ ) ∗ ′
𝑓 ∗ (𝑧) = 𝑢(𝑐) + 𝛽 𝔼𝑧 max { , 𝑓 (𝑧 )}
1−𝛽
We’ll solve this functional equation for 𝑓 ∗ by introducing the operator
𝑢(𝑤′ )
𝑄𝑓(𝑧) = 𝑢(𝑐) + 𝛽 𝔼𝑧 max { , 𝑓(𝑧 ′ )}
1−𝛽
By construction, 𝑓 ∗ is a fixed point of 𝑄, in the sense that 𝑄𝑓 ∗ = 𝑓 ∗ .

Under mild assumptions, it can be shown that 𝑄 is a contraction mapping over a suitable space of continuous functions
on ℝ.
By Banach’s contraction mapping theorem, this means that 𝑓 ∗ is the unique fixed point and we can calculate it by iterating
with 𝑄 from any reasonable initial condition.
536 Chapter 30. Job Search IV: Correlated Wage Offers

Once we have 𝑓 ∗ , we can solve the search problem by stopping when the reward for accepting exceeds the continuation
value, or
𝑢(𝑤)
≥ 𝑓 ∗ (𝑧)
1−𝛽
For utility we take 𝑢(𝑐) = ln(𝑐).
The reservation wage is the wage where equality holds in the last expression.
That is,
𝑤(𝑧)
̄ ∶= exp(𝑓 ∗ (𝑧)(1 − 𝛽)) (30.1)
Our main aim is to solve for the reservation rule and study its properties and implications.
30.3 Implementation
Let 𝑓 be our initial guess of 𝑓 ∗ .

When we iterate, we use the fitted value function iteration algorithm.
In particular, 𝑓 and all subsequent iterates are stored as a vector of values on a grid.
These points are interpolated into a function as required, using piecewise linear interpolation.
The integral in the definition of 𝑄𝑓 is calculated by Monte Carlo.
The following list helps Numba by providing some type information about the data we will work with.
job_search_data = [
('μ', float64), # transient shock log mean
('s', float64), # transient shock log variance
('d', float64), # shift coefficient of persistent state
('ρ', float64), # correlation coefficient of persistent state
('σ', float64), # state volatility
('z_grid', float64[:]), # grid over the state space
('e_draws', float64[:,:]) # Monte Carlo draws for integration
]
Here’s a class that stores the data and the right hand side of the Bellman equation.
Default parameter values are embedded in the class.
@jitclass(job_search_data)
class JobSearch:
def __init__(self,
μ=0.0, # transient shock log mean
s=1.0, # transient shock log variance
d=0.0, # shift coefficient of persistent state
ρ=0.9, # correlation coefficient of persistent state
σ=0.1, # state volatility
β=0.98, # discount factor
c=5, # unemployment compensation
mc_size=1000,


grid_size=100):
self.μ, self.s, self.d, = μ, s, d,

self.ρ, self.σ, self.β, self.c = ρ, σ, β, c
# Set up grid
z_mean = d / (1 - ρ)
z_sd = σ / np.sqrt(1 - ρ**2)
k = 3 # std devs from mean
a, b = z_mean - k * z_sd, z_mean + k * z_sd
self.z_grid = np.linspace(a, b, grid_size)
# Draw and store shocks

self.e_draws = randn(2, mc_size)
def parameters(self):
"""
Return all parameters as a tuple.
"""
return self.μ, self.s, self.d, \
self.ρ, self.σ, self.β, self.c
Next we implement the 𝑄 operator.
def Q(js, f_in, f_out):
"""
Apply the operator Q.
* js is an instance of JobSearch
* f_in and f_out are arrays that represent f and Qf respectively
"""
μ, s, d, ρ, σ, β, c = js.parameters()
M = js.e_draws.shape[1]
for i in prange(len(js.z_grid)):
z = js.z_grid[i]
expectation = 0.0
for m in range(M):
e1, e2 = js.e_draws[:, m]
z_next = d + ρ * z + σ * e1
go_val = np.interp(z_next, js.z_grid, f_in) # f(z')
y_next = np.exp(μ + s * e2) # y' draw
w_next = np.exp(z_next) + y_next # w' draw
stop_val = np.log(w_next) / (1 - β)
expectation += max(stop_val, go_val)
expectation = expectation / M
f_out[i] = np.log(c) + β * expectation
Here’s a function to compute an approximation to the fixed point of 𝑄.
def compute_fixed_point(js,


use_parallel=True,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
f_init = np.full(len(js.z_grid), np.log(js.c))

f_out = np.empty_like(f_init)
# Set up loop
f_in = f_init
i = 0
error = tol + 1

Q(js, f_in, f_out)
error = np.max(np.abs(f_in - f_out))
i += 1
if verbose and i % print_skip == 0:
print(f"Error at iteration {i} is {error}.")
f_in[:] = f_out
if error > tol:

print("Failed to converge!")
elif verbose:
print(f"\nConverged in {i} iterations.")
return f_out
Let’s try generating an instance and solving the model.
js = JobSearch()
qe.tic()
f_star = compute_fixed_point(js, verbose=True)
qe.toc()
Error at iteration 25 is 0.5762477839587632.


Converged in 178 iterations.
TOC: Elapsed: 0:00:6.43
6.438979864120483
Next we will compute and plot the reservation wage function defined in (30.1).
res_wage_function = np.exp(f_star * (1 - js.β))
ax.plot(js.z_grid, res_wage_function, label="reservation wage given $z$")
ax.set(xlabel="$z$", ylabel="wage")
ax.legend()
plt.show()
Notice that the reservation wage is increasing in the current state 𝑧.

This is because a higher state leads the agent to predict higher future wages, increasing the option value of waiting.
Let’s try changing unemployment compensation and look at its impact on the reservation wage:
c_vals = 1, 2, 3
for c in c_vals:


js = JobSearch(c=c)
f_star = compute_fixed_point(js, verbose=False)
res_wage_function = np.exp(f_star * (1 - js.β))
ax.plot(js.z_grid, res_wage_function, label=rf"$\bar w$ at $c = {c}$")
ax.set(xlabel="$z$", ylabel="wage")
ax.legend()
plt.show()
As expected, higher unemployment compensation shifts the reservation wage up at all state values.
30.4 Unemployment Duration
Next we study how mean unemployment duration varies with unemployment compensation.
For simplicity we’ll fix the initial state at 𝑧𝑡 = 0.
def compute_unemployment_duration(js, seed=1234):
f_star = compute_fixed_point(js, verbose=False)

μ, s, d, ρ, σ, β, c = js.parameters()
z_grid = js.z_grid
@njit
30.4. Unemployment Duration 541


def f_star_function(z):
return np.interp(z, z_grid, f_star)
@njit
def draw_tau(t_max=10_000):
z = 0
t = 0
unemployed = True
while unemployed and t < t_max:
# draw current wage
y = np.exp(μ + s * np.random.randn())
w = np.exp(z) + y
res_wage = np.exp(f_star_function(z) * (1 - β))
# if optimal to stop, record t
if w >= res_wage:
unemployed = False
τ = t
# else increment data and state
else:
z = ρ * z + d + σ * np.random.randn()
t += 1
return τ
def compute_expected_tau(num_reps=100_000):
sum_value = 0
for i in prange(num_reps):
sum_value += draw_tau()
return sum_value / num_reps
return compute_expected_tau()
Let’s test this out with some possible values for unemployment compensation.
c_vals = np.linspace(1.0, 10.0, 8)

durations = np.empty_like(c_vals)
js = JobSearch(c=c)
τ = compute_unemployment_duration(js)
durations[i] = τ
Here is a plot of the results.
ax.plot(c_vals, durations)
ax.set_xlabel("unemployment compensation")
ax.set_ylabel("mean unemployment duration")
plt.show()

Not surprisingly, unemployment duration increases when unemployment compensation is higher.

This is because the value of waiting increases with unemployment compensation.
30.5 Exercises
Exercise 30.5.1
Investigate how mean unemployment duration varies with the discount factor 𝛽.
• What is your prior expectation?
• Do your results match up?

beta_vals = np.linspace(0.94, 0.99, 8)

durations = np.empty_like(beta_vals)
for i, β in enumerate(beta_vals):
js = JobSearch(β=β)
τ = compute_unemployment_duration(js)
durations[i] = τ
30.5. Exercises 543

ax.plot(beta_vals, durations)
ax.set_xlabel(r"$\beta$")
ax.set_ylabel("mean unemployment duration")
plt.show()
The figure shows that more patient individuals tend to wait longer before accepting an offer.

CHAPTER
THIRTYONE
JOB SEARCH V: MODELING CAREER CHOICE
Contents
• Job Search V: Modeling Career Choice

– Overview
– Model
– Implementation
– Exercises
31.1 Overview
Next, we study a computational problem concerning career and job choices.

The model is originally due to Derek Neal [Neal, 1999].
This exposition draws on the presentation in [Ljungqvist and Sargent, 2018], section 6.5.
We begin with some imports:

import numpy as np
from scipy.special import binom, beta
from mpl_toolkits.mplot3d.axes3d import Axes3D
545
31.1.1 Model Features
• Career and job within career both chosen to maximize expected discounted wage flow.
• Infinite horizon dynamic programming with two state variables.
31.2 Model
In what follows we distinguish between a career and a job, where

• a career is understood to be a general field encompassing many possible jobs, and
• a job is understood to be a position with a particular firm
For workers, wages can be decomposed into the contribution of job and career
• 𝑤𝑡 = 𝜃𝑡 + 𝜖𝑡 , where
– 𝜃𝑡 is the contribution of career at time 𝑡
– 𝜖𝑡 is the contribution of the job at time 𝑡
At the start of time 𝑡, a worker has the following options
• retain a current (career, job) pair (𝜃𝑡 , 𝜖𝑡 ) — referred to hereafter as “stay put”
• retain a current career 𝜃𝑡 but redraw a job 𝜖𝑡 — referred to hereafter as “new job”
• redraw both a career 𝜃𝑡 and a job 𝜖𝑡 — referred to hereafter as “new life”
Draws of 𝜃 and 𝜖 are independent of each other and past values, with
• 𝜃𝑡 ∼ 𝐹
• 𝜖𝑡 ∼ 𝐺
Notice that the worker does not have the option to retain a job but redraw a career — starting a new career always requires
starting a new job.
A young worker aims to maximize the expected sum of discounted wages
∞
𝔼 ∑ 𝛽 𝑡 𝑤𝑡 (31.1)
𝑡=0
subject to the choice restrictions specified above.

Let 𝑣(𝜃, 𝜖) denote the value function, which is the maximum of (31.1) overall feasible (career, job) policies, given the
initial state (𝜃, 𝜖).
The value function obeys
𝑣(𝜃, 𝜖) = max{𝐼, 𝐼𝐼, 𝐼𝐼𝐼}
where
𝐼 = 𝜃 + 𝜖 + 𝛽𝑣(𝜃, 𝜖)
𝐼𝐼 = 𝜃 + ∫ 𝜖′ 𝐺(𝑑𝜖′ ) + 𝛽 ∫ 𝑣(𝜃, 𝜖′ )𝐺(𝑑𝜖′ )
𝐼𝐼𝐼 = ∫ 𝜃′ 𝐹 (𝑑𝜃′ ) + ∫ 𝜖′ 𝐺(𝑑𝜖′ ) + 𝛽 ∫ ∫ 𝑣(𝜃′ , 𝜖′ )𝐺(𝑑𝜖′ )𝐹 (𝑑𝜃′ )
Evidently 𝐼, 𝐼𝐼 and 𝐼𝐼𝐼 correspond to “stay put”, “new job” and “new life”, respectively.
546 Chapter 31. Job Search V: Modeling Career Choice

31.2.1 Parameterization
As in [Ljungqvist and Sargent, 2018], section 6.5, we will focus on a discrete version of the model, parameterized as
follows:
• both 𝜃 and 𝜖 take values in the set np.linspace(0, B, grid_size) — an even grid of points between 0
and 𝐵 inclusive
• grid_size = 50
• B = 5
• β = 0.95
The distributions 𝐹 and 𝐺 are discrete distributions generating draws from the grid points np.linspace(0, B,
grid_size).
A very useful family of discrete distributions is the Beta-binomial family, with probability mass function
𝑛 𝐵(𝑘 + 𝑎, 𝑛 − 𝑘 + 𝑏)
𝑝(𝑘 | 𝑛, 𝑎, 𝑏) = ( ) , 𝑘 = 0, … , 𝑛
𝑘 𝐵(𝑎, 𝑏)
Interpretation:
• draw 𝑞 from a Beta distribution with shape parameters (𝑎, 𝑏)
• run 𝑛 independent binary trials, each with success probability 𝑞
• 𝑝(𝑘 | 𝑛, 𝑎, 𝑏) is the probability of 𝑘 successes in these 𝑛 trials
Nice properties:
• very flexible class of distributions, including uniform, symmetric unimodal, etc.
• only three parameters
Here’s a figure showing the effect on the pmf of different shape parameters when 𝑛 = 50.
def gen_probs(n, a, b):

probs = np.zeros(n+1)
for k in range(n+1):
probs[k] = binom(n, k) * beta(k + a, n - k + b) / beta(a, b)
return probs
n = 50
a_vals = [0.5, 1, 100]
b_vals = [0.5, 1, 100]
for a, b in zip(a_vals, b_vals):
ab_label = f'$a = {a:.1f}$, $b = {b:.1f}$'
ax.plot(list(range(0, n+1)), gen_probs(n, a, b), '-o', label=ab_label)
ax.legend()
plt.show()
31.2. Model 547

31.3 Implementation
We will first create a class CareerWorkerProblem which will hold the default parameterizations of the model and
an initial guess for the value function.
class CareerWorkerProblem:
def __init__(self,
B=5.0, # Upper bound
β=0.95, # Discount factor
grid_size=50, # Grid size
F_a=1,
F_b=1,
G_a=1,
G_b=1):
self.β, self.grid_size, self.B = β, grid_size, B
self.θ = np.linspace(0, B, grid_size) # Set of θ values

self.ϵ = np.linspace(0, B, grid_size) # Set of ϵ values
self.F_probs = BetaBinomial(grid_size - 1, F_a, F_b).pdf()

self.G_probs = BetaBinomial(grid_size - 1, G_a, G_b).pdf()
self.F_mean = np.sum(self.θ * self.F_probs)
self.G_mean = np.sum(self.ϵ * self.G_probs)
# Store these parameters for str and repr methods

self._F_a, self._F_b = F_a, F_b
self._G_a, self._G_b = G_a, G_b

The following function takes an instance of CareerWorkerProblem and returns the corresponding Bellman operator
𝑇 and the greedy policy function.
In this model, 𝑇 is defined by 𝑇 𝑣(𝜃, 𝜖) = max{𝐼, 𝐼𝐼, 𝐼𝐼𝐼}, where 𝐼, 𝐼𝐼 and 𝐼𝐼𝐼 are as given in (31.2).
def operator_factory(cw, parallel_flag=True):
"""
Returns jitted versions of the Bellman operator and the
greedy policy function
cw is an instance of ``CareerWorkerProblem``
"""
θ, ϵ, β = cw.θ, cw.ϵ, cw.β

F_probs, G_probs = cw.F_probs, cw.G_probs
F_mean, G_mean = cw.F_mean, cw.G_mean
@njit(parallel=parallel_flag)
def T(v):
"The Bellman operator"
for i in prange(len(v)):
for j in prange(len(v)):
v1 = θ[i] + ϵ[j] + β * v[i, j] # Stay put
v2 = θ[i] + G_mean + β * v[i, :] @ G_probs # New job
v3 = G_mean + F_mean + β * F_probs @ v @ G_probs # New life
v_new[i, j] = max(v1, v2, v3)
return v_new
@njit
def get_greedy(v):
"Computes the v-greedy policy"
σ = np.empty(v.shape)
for i in range(len(v)):
for j in range(len(v)):
v1 = θ[i] + ϵ[j] + β * v[i, j]
v2 = θ[i] + G_mean + β * v[i, :] @ G_probs
v3 = G_mean + F_mean + β * F_probs @ v @ G_probs
if v1 > max(v2, v3):
action = 1
elif v2 > max(v1, v3):
action = 2
else:
action = 3
σ[i, j] = action
return σ
return T, get_greedy
Lastly, solve_model will take an instance of CareerWorkerProblem and iterate using the Bellman operator to
find the fixed point of the Bellman equation.

def solve_model(cw,
use_parallel=True,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
T, _ = operator_factory(cw, parallel_flag=use_parallel)
# Set up loop
v = np.full((cw.grid_size, cw.grid_size), 100.) # Initial guess
i = 0
error = tol + 1

v_new = T(v)
error = np.max(np.abs(v - v_new))
i += 1
v = v_new
if error > tol:

elif verbose:
return v_new
Here’s the solution to the model – an approximate value function
cw = CareerWorkerProblem()
T, get_greedy = operator_factory(cw)
v_star = solve_model(cw, verbose=False)
greedy_star = get_greedy(v_star)

tg, eg = np.meshgrid(cw.θ, cw.ϵ)
ax.plot_surface(tg,
eg,
v_star.T,
cmap=cm.jet,
alpha=0.5,
linewidth=0.25)
ax.set(xlabel='θ', ylabel='ϵ', zlim=(150, 200))
ax.view_init(ax.elev, 225)
plt.show()

And here is the optimal policy

lvls = (0.5, 1.5, 2.5, 3.5)
ax.contourf(tg, eg, greedy_star.T, levels=lvls, cmap=cm.winter, alpha=0.5)
ax.contour(tg, eg, greedy_star.T, colors='k', levels=lvls, linewidths=2)
ax.set(xlabel='θ', ylabel='ϵ')
ax.text(1.8, 2.5, 'new life', fontsize=14)
ax.text(4.5, 2.5, 'new job', fontsize=14, rotation='vertical')
ax.text(4.0, 4.5, 'stay put', fontsize=14)
plt.show()

Interpretation:
• If both job and career are poor or mediocre, the worker will experiment with a new job and new career.
• If career is sufficiently good, the worker will hold it and experiment with new jobs until a sufficiently good one is
found.
• If both job and career are good, the worker will stay put.
Notice that the worker will always hold on to a sufficiently good career, but not necessarily hold on to even the best paying
job.
The reason is that high lifetime wages require both variables to be large, and the worker cannot change careers without
changing jobs.
• Sometimes a good job must be sacrificed in order to change to a better career.

31.4 Exercises
Exercise 31.4.1
Using the default parameterization in the class CareerWorkerProblem, generate and plot typical sample paths for 𝜃
and 𝜖 when the worker follows the optimal policy.
In particular, modulo randomness, reproduce the following figure (where the horizontal axis represents time)
Hint: To generate the draws from the distributions 𝐹 and 𝐺, use quantecon.random.draw().

Simulate job/career paths.
In reading the code, recall that optimal_policy[i, j] = policy at (𝜃𝑖 , 𝜖𝑗 ) = either 1, 2 or 3; meaning ‘stay put’,
‘new job’ and ‘new life’.
31.4. Exercises 553

F = np.cumsum(cw.F_probs)
G = np.cumsum(cw.G_probs)
def gen_path(optimal_policy, F, G, t=20):

i = j = 0
θ_index = []
ϵ_index = []
for t in range(t):
if optimal_policy[i, j] == 1: # Stay put
pass
elif greedy_star[i, j] == 2: # New job

j = qe.random.draw(G)
else: # New life

i, j = qe.random.draw(F), qe.random.draw(G)
θ_index.append(i)
ϵ_index.append(j)
return cw.θ[θ_index], cw.ϵ[ϵ_index]

for ax in axes:
θ_path, ϵ_path = gen_path(greedy_star, F, G)
ax.plot(ϵ_path, label='ϵ')
ax.plot(θ_path, label='θ')
ax.set_ylim(0, 6)
plt.legend()
plt.show()

Exercise 31.4.2
Let’s now consider how long it takes for the worker to settle down to a permanent job, given a starting point of (𝜃, 𝜖) =
(0, 0).
In other words, we want to study the distribution of the random variable
𝑇 ∗ ∶= the first point in time from which the worker's job no longer changes
Evidently, the worker’s job becomes permanent if and only if (𝜃𝑡 , 𝜖𝑡 ) enters the “stay put” region of (𝜃, 𝜖) space.
Letting 𝑆 denote this region, 𝑇 ∗ can be expressed as the first passage time to 𝑆 under the optimal policy:
𝑇 ∗ ∶= inf{𝑡 ≥ 0 | (𝜃𝑡 , 𝜖𝑡 ) ∈ 𝑆}
Collect 25,000 draws of this random variable and compute the median (which should be about 7).
Repeat the exercise with 𝛽 = 0.99 and interpret the change.

The median for the original parameterization can be computed as follows
31.4. Exercises 555

cw = CareerWorkerProblem()
F = np.cumsum(cw.F_probs)
G = np.cumsum(cw.G_probs)
@njit
def passage_time(optimal_policy, F, G):
t = 0
i = j = 0
while True:
if optimal_policy[i, j] == 1: # Stay put
return t
elif optimal_policy[i, j] == 2: # New job
j = qe.random.draw(G)
else: # New life
i, j = qe.random.draw(F), qe.random.draw(G)
t += 1
def median_time(optimal_policy, F, G, M=25000):
samples = np.empty(M)
for i in prange(M):
samples[i] = passage_time(optimal_policy, F, G)
return np.median(samples)
median_time(greedy_star, F, G)
7.0
To compute the median with 𝛽 = 0.99 instead of the default value 𝛽 = 0.95, replace cw = CareerWorkerProb-
lem() with cw = CareerWorkerProblem(β=0.99).
The medians are subject to randomness but should be about 7 and 14 respectively.
Not surprisingly, more patient workers will wait longer to settle down to their final job.
Exercise 31.4.3
Set the parameterization to G_a = G_b = 100 and generate a new optimal policy figure – interpret.

cw = CareerWorkerProblem(G_a=100, G_b=100)

lvls = (0.5, 1.5, 2.5, 3.5)


ax.contourf(tg, eg, greedy_star.T, levels=lvls, cmap=cm.winter, alpha=0.5)
ax.contour(tg, eg, greedy_star.T, colors='k', levels=lvls, linewidths=2)
ax.set(xlabel='θ', ylabel='ϵ')
ax.text(1.8, 2.5, 'new life', fontsize=14)
ax.text(4.5, 1.5, 'new job', fontsize=14, rotation='vertical')
ax.text(4.0, 4.5, 'stay put', fontsize=14)
plt.show()
In the new figure, you see that the region for which the worker stays put has grown because the distribution for 𝜖 has
become more concentrated around the mean, making high-paying jobs less realistic.
31.4. Exercises 557


CHAPTER
THIRTYTWO
JOB SEARCH VI: ON-THE-JOB SEARCH
Contents
• Job Search VI: On-the-Job Search

– Overview
– Model
– Implementation
– Solving for Policies
– Exercises
32.1 Overview
In this section, we solve a simple on-the-job search model

• based on [Ljungqvist and Sargent, 2018], exercise 6.18, and [Jovanovic, 1979]

import numpy as np
import scipy.stats as stats
• job-specific human capital accumulation combined with on-the-job search

• infinite-horizon dynamic programming with one state variable and two controls
559
32.2 Model
Let 𝑥𝑡 denote the time-𝑡 job-specific human capital of a worker employed at a given firm and let 𝑤𝑡 denote current wages.
Let 𝑤𝑡 = 𝑥𝑡 (1 − 𝑠𝑡 − 𝜙𝑡 ), where
• 𝜙𝑡 is investment in job-specific human capital for the current role and
• 𝑠𝑡 is search effort, devoted to obtaining new offers from other firms.
For as long as the worker remains in the current job, evolution of {𝑥𝑡 } is given by 𝑥𝑡+1 = 𝑔(𝑥𝑡 , 𝜙𝑡 ).
When search effort at 𝑡 is 𝑠𝑡 , the worker receives a new job offer with probability 𝜋(𝑠𝑡 ) ∈ [0, 1].
The value of the offer, measured in job-specific human capital, is 𝑢𝑡+1 , where {𝑢𝑡 } is IID with common distribution 𝑓.
The worker can reject the current offer and continue with existing job.
Hence 𝑥𝑡+1 = 𝑢𝑡+1 if he/she accepts and 𝑥𝑡+1 = 𝑔(𝑥𝑡 , 𝜙𝑡 ) otherwise.
Let 𝑏𝑡+1 ∈ {0, 1} be a binary random variable, where 𝑏𝑡+1 = 1 indicates that the worker receives an offer at the end of
time 𝑡.
We can write
𝑥𝑡+1 = (1 − 𝑏𝑡+1 )𝑔(𝑥𝑡 , 𝜙𝑡 ) + 𝑏𝑡+1 max{𝑔(𝑥𝑡 , 𝜙𝑡 ), 𝑢𝑡+1 } (32.1)
Agent’s objective: maximize expected discounted sum of wages via controls {𝑠𝑡 } and {𝜙𝑡 }.
Taking the expectation of 𝑣(𝑥𝑡+1 ) and using (32.1), the Bellman equation for this problem can be written as
𝑣(𝑥) = max {𝑥(1 − 𝑠 − 𝜙) + 𝛽(1 − 𝜋(𝑠))𝑣[𝑔(𝑥, 𝜙)] + 𝛽𝜋(𝑠) ∫ 𝑣[𝑔(𝑥, 𝜙) ∨ 𝑢]𝑓(𝑑𝑢)} (32.2)
𝑠+𝜙≤1
Here nonnegativity of 𝑠 and 𝜙 is understood, while 𝑎 ∨ 𝑏 ∶= max{𝑎, 𝑏}.
In the implementation below, we will focus on the parameterization

√
𝑔(𝑥, 𝜙) = 𝐴(𝑥𝜙)𝛼 , 𝜋(𝑠) = 𝑠 and 𝑓 = Beta(2, 2)
with default parameter values

• 𝐴 = 1.4
• 𝛼 = 0.6
• 𝛽 = 0.96
The Beta(2, 2) distribution is supported on (0, 1) - it has a unimodal, symmetric density peaked at 0.5.
32.2.2 Back-of-the-Envelope Calculations
Before we solve the model, let’s make some quick calculations that provide intuition on what the solution should look like.
To begin, observe that the worker has two instruments to build capital and hence wages:
1. invest in capital specific to the current job via 𝜙
2. search for a new job with better job-specific capital match via 𝑠
560 Chapter 32. Job Search VI: On-the-Job Search

Since wages are 𝑥(1 − 𝑠 − 𝜙), marginal cost of investment via either 𝜙 or 𝑠 is identical.
Our risk-neutral worker should focus on whatever instrument has the highest expected return.
The relative expected return will depend on 𝑥.
For example, suppose first that 𝑥 = 0.05
• If 𝑠 = 1 and 𝜙 = 0, then since 𝑔(𝑥, 𝜙) = 0, taking expectations of (32.1) gives expected next period capital equal
to 𝜋(𝑠)𝔼𝑢 = 𝔼𝑢 = 0.5.
• If 𝑠 = 0 and 𝜙 = 1, then next period capital is 𝑔(𝑥, 𝜙) = 𝑔(0.05, 1) ≈ 0.23.
Both rates of return are good, but the return from search is better.
Next, suppose that 𝑥 = 0.4
• If 𝑠 = 1 and 𝜙 = 0, then expected next period capital is again 0.5
• If 𝑠 = 0 and 𝜙 = 1, then 𝑔(𝑥, 𝜙) = 𝑔(0.4, 1) ≈ 0.8
Return from investment via 𝜙 dominates expected return from search.
Combining these observations gives us two informal predictions:
1. At any given state 𝑥, the two controls 𝜙 and 𝑠 will function primarily as substitutes — worker will focus on whichever
instrument has the higher expected return.
2. For sufficiently small 𝑥, search will be preferable to investment in job-specific human capital. For larger 𝑥, the
reverse will be true.
Now let’s turn to implementation, and see if we can match our predictions.
32.3 Implementation
We will set up a class JVWorker that holds the parameters of the model described above
class JVWorker:
r"""
A Jovanovic-type model of employment with on-the-job search.
"""
def __init__(self,
A=1.4,
α=0.6,
π=np.sqrt, # Search effort function
a=2, # Parameter of f
b=2, # Parameter of f
grid_size=50,
mc_size=100,
ɛ=1e-4):
self.A, self.α, self.β, self.π = A, α, β, π

self.mc_size, self.ɛ = mc_size, ɛ
self.g = njit(lambda x, ϕ: A * (x * ϕ)**α) # Transition function

self.f_rvs = np.random.beta(a, b, mc_size)


# Max of grid is the max of a large quantile value for f and the
# fixed point y = g(y, 1)
ɛ = 1e-4
grid_max = max(A**(1 / (1 - α)), stats.beta(a, b).ppf(1 - ɛ))
# Human capital
self.x_grid = np.linspace(ɛ, grid_max, grid_size)
The function operator_factory takes an instance of this class and returns a jitted version of the Bellman operator
T, i.e.
𝑇 𝑣(𝑥) = max 𝑤(𝑠, 𝜙)
𝑠+𝜙≤1
where
𝑤(𝑠, 𝜙) ∶= 𝑥(1 − 𝑠 − 𝜙) + 𝛽(1 − 𝜋(𝑠))𝑣[𝑔(𝑥, 𝜙)] + 𝛽𝜋(𝑠) ∫ 𝑣[𝑔(𝑥, 𝜙) ∨ 𝑢]𝑓(𝑑𝑢) (32.3)
When we represent 𝑣, it will be with a NumPy array v giving values on grid x_grid.
But to evaluate the right-hand side of (32.3), we need a function, so we replace the arrays v and x_grid with a function
v_func that gives linear interpolation of v on x_grid.
Inside the for loop, for each x in the grid over the state space, we set up the function 𝑤(𝑧) = 𝑤(𝑠, 𝜙) defined in (32.3).
The function is maximized over all feasible (𝑠, 𝜙) pairs.
Another function, get_greedy returns the optimal choice of 𝑠 and 𝜙 at each 𝑥, given a value function.
def operator_factory(jv, parallel_flag=True):
"""
Returns a jitted version of the Bellman operator T
jv is an instance of JVWorker
"""
π, β = jv.π, jv.β
x_grid, ɛ, mc_size = jv.x_grid, jv.ɛ, jv.mc_size
f_rvs, g = jv.f_rvs, jv.g
@njit
def state_action_values(z, x, v):
s, ϕ = z
v_func = lambda x: np.interp(x, x_grid, v)
integral = 0
for m in range(mc_size):
u = f_rvs[m]
integral += v_func(max(g(x, ϕ), u))
integral = integral / mc_size
q = π(s) * integral + (1 - π(s)) * v_func(g(x, ϕ))

return x * (1 - ϕ - s) + β * q
def T(v):


"""
The Bellman operator
"""
for i in prange(len(x_grid)):
x = x_grid[i]
# Search on a grid
search_grid = np.linspace(ɛ, 1, 15)
max_val = -1
for s in search_grid:
for ϕ in search_grid:
current_val = state_action_values((s, ϕ), x, v) if s + ϕ <= 1␣
↪ else -1
if current_val > max_val:
max_val = current_val
v_new[i] = max_val
return v_new
@njit
def get_greedy(v):
"""
Computes the v-greedy policy of a given function v
"""
s_policy, ϕ_policy = np.empty_like(v), np.empty_like(v)
for i in range(len(x_grid)):
x = x_grid[i]
# Search on a grid
search_grid = np.linspace(ɛ, 1, 15)
max_val = -1
for s in search_grid:
for ϕ in search_grid:
current_val = state_action_values((s, ϕ), x, v) if s + ϕ <= 1␣
↪else -1
if current_val > max_val:

max_val = current_val
max_s, max_ϕ = s, ϕ
s_policy[i], ϕ_policy[i] = max_s, max_ϕ
return s_policy, ϕ_policy
To solve the model, we will write a function that uses the Bellman operator and iterates to find a fixed point.
def solve_model(jv,
use_parallel=True,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
"""
Solves the model by value function iteration

* jv is an instance of JVWorker
"""
T, _ = operator_factory(jv, parallel_flag=use_parallel)
# Set up loop
v = jv.x_grid * 0.5 # Initial condition
i = 0
error = tol + 1

v_new = T(v)
i += 1
v = v_new
if error > tol:

elif verbose:
return v_new
32.4 Solving for Policies
Let’s generate the optimal policies and see what they look like.
jv = JVWorker()
T, get_greedy = operator_factory(jv)
v_star = solve_model(jv)
s_star, ϕ_star = get_greedy(v_star)

Here are the plots:
plots = [s_star, ϕ_star, v_star]

titles = ["s policy", "ϕ policy", "value function"]
for ax, plot, title in zip(axes, plots, titles):

ax.plot(jv.x_grid, plot)
ax.set(title=title)
ax.grid()
axes[-1].set_xlabel("x")
plt.show()
32.4. Solving for Policies 565

The horizontal axis is the state 𝑥, while the vertical axis gives 𝑠(𝑥) and 𝜙(𝑥).
Overall, the policies match well with our predictions from above
• Worker switches from one investment strategy to the other depending on relative return.
• For low values of 𝑥, the best option is to search for a new job.
• Once 𝑥 is larger, worker does better by investing in human capital specific to the current position.

32.5 Exercises
Exercise 32.5.1
Let’s look at the dynamics for the state process {𝑥𝑡 } associated with these policies.
The dynamics are given by (32.1) when 𝜙𝑡 and 𝑠𝑡 are chosen according to the optimal policies, and ℙ{𝑏𝑡+1 = 1} = 𝜋(𝑠𝑡 ).
Since the dynamics are random, analysis is a bit subtle.
One way to do it is to plot, for each 𝑥 in a relatively fine grid called plot_grid, a large number 𝐾 of realizations of
𝑥𝑡+1 given 𝑥𝑡 = 𝑥.
Plot this with one dot for each realization, in the form of a 45 degree diagram, setting
jv = JVWorker(grid_size=25, mc_size=50)
plot_grid_max, plot_grid_size = 1.2, 100
plot_grid = np.linspace(0, plot_grid_max, plot_grid_size)
ax.set_xlim(0, plot_grid_max)
ax.set_ylim(0, plot_grid_max)
By examining the plot, argue that under the optimal policies, the state 𝑥𝑡 will converge to a constant value 𝑥̄ close to unity.
Argue that at the steady state, 𝑠𝑡 ≈ 0 and 𝜙𝑡 ≈ 0.6.

Here’s code to produce the 45 degree diagram
jv = JVWorker(grid_size=25, mc_size=50)
π, g, f_rvs, x_grid = jv.π, jv.g, jv.f_rvs, jv.x_grid
T, get_greedy = operator_factory(jv)
v_star = solve_model(jv, verbose=False)
s_policy, ϕ_policy = get_greedy(v_star)
# Turn the policy function arrays into actual functions

s = lambda y: np.interp(y, x_grid, s_policy)
ϕ = lambda y: np.interp(y, x_grid, ϕ_policy)
def h(x, b, u):

return (1 - b) * g(x, ϕ(x)) + b * max(g(x, ϕ(x)), u)
plot_grid_max, plot_grid_size = 1.2, 100

plot_grid = np.linspace(0, plot_grid_max, plot_grid_size)
ticks = (0.25, 0.5, 0.75, 1.0)
ax.set(xticks=ticks, yticks=ticks,
xlim=(0, plot_grid_max),
ylim=(0, plot_grid_max),
xlabel='$x_t$', ylabel='$x_{t+1}$')
ax.plot(plot_grid, plot_grid, 'k--', alpha=0.6) # 45 degree line

for x in plot_grid:
for i in range(jv.mc_size):
32.5. Exercises 567


b = 1 if np.random.uniform(0, 1) < π(s(x)) else 0
u = f_rvs[i]
y = h(x, b, u)
ax.plot(x, y, 'go', alpha=0.25)
plt.show()
Looking at the dynamics, we can see that

• If 𝑥𝑡 is below about 0.2 the dynamics are random, but 𝑥𝑡+1 > 𝑥𝑡 is very likely.
• As 𝑥𝑡 increases the dynamics become deterministic, and 𝑥𝑡 converges to a steady state value close to 1.
Referring back to the figure here we see that 𝑥𝑡 ≈ 1 means that 𝑠𝑡 = 𝑠(𝑥𝑡 ) ≈ 0 and 𝜙𝑡 = 𝜙(𝑥𝑡 ) ≈ 0.6.
Exercise 32.5.2

In Exercise 32.5.1, we found that 𝑠𝑡 converges to zero and 𝜙𝑡 converges to about 0.6.
Since these results were calculated at a value of 𝛽 close to one, let’s compare them to the best choice for an infinitely
patient worker.
Intuitively, an infinitely patient worker would like to maximize steady state wages, which are a function of steady state
capital.
You can take it as given—it’s certainly true—that the infinitely patient worker does not search in the long run (i.e., 𝑠𝑡 = 0
for large 𝑡).
Thus, given 𝜙, steady state capital is the positive fixed point 𝑥∗ (𝜙) of the map 𝑥 ↦ 𝑔(𝑥, 𝜙).
Steady state wages can be written as 𝑤∗ (𝜙) = 𝑥∗ (𝜙)(1 − 𝜙).
Graph 𝑤∗ (𝜙) with respect to 𝜙, and examine the best choice of 𝜙.
Can you give a rough interpretation for the value that you see?

The figure can be produced as follows
jv = JVWorker()
def xbar(ϕ):
A, α = jv.A, jv.α
return (A * ϕ**α)**(1 / (1 - α))
ϕ_grid = np.linspace(0, 1, 100)

ax.set(xlabel='$\phi$')
ax.plot(ϕ_grid, [xbar(ϕ) * (1 - ϕ) for ϕ in ϕ_grid], label='$w^*(\phi)$')
ax.legend()
plt.show()
32.5. Exercises 569

Observe that the maximizer is around 0.6.

This is similar to the long-run value for 𝜙 obtained in Exercise 32.5.1.
Hence the behavior of the infinitely patent worker is similar to that of the worker with 𝛽 = 0.96.
This seems reasonable and helps us confirm that our dynamic programming solutions are probably correct.

CHAPTER
THIRTYTHREE
JOB SEARCH VII: A MCCALL WORKER Q-LEARNS
33.1 Overview
This lecture illustrates a powerful machine learning technique called Q-learning.

[Sutton and Barto, 2018] presents Q-learning and a variety of other statistical learning procedures.
The Q-learning algorithm combines ideas from
• dynamic programming
• a recursive version of least squares known as temporal difference learning.
This lecture applies a Q-learning algorithm to the situation faced by a McCall worker.
This lecture also considers the case where a McCall worker is given an option to quit the current job.
Relative to the dynamic programming formulation of the McCall worker model that we studied in quantecon lecture, a
Q-learning algorithm gives the worker less knowledge about
• the random process that generates a sequence of wages
• the reward function that tells consequences of accepting or rejecting a job
The Q-learning algorithm invokes a statistical learning model to learn about these things.
Statistical learning often comes down to some version of least squares, and it will be here too.
Any time we say statistical learning, we have to say what object is being learned.
For Q-learning, the object that is learned is not the value function that is a focus of dynamic programming.
But it is something that is closely affiliated with it.
In the finite-action, finite state context studied in this lecture, the object to be learned statistically is a Q-table, an instance
of a Q-function for finite sets.
Sometimes a Q-function or Q-table is called a quality-function or quality-table.
The rows and columns of a Q-table correspond to possible states that an agent might encounter, and possible actions that
he can take in each state.
An equation that resembles a Bellman equation plays an important role in the algorithm.
It differs from the Bellman equation for the McCall model that we have seen in this quantecon lecture
In this lecture, we’ll learn a little about
• the Q-function or quality function that is affiliated with any Markov decision problem whose optimal value
function satisfies a Bellman equation
• temporal difference learning, a key component of a Q-learning algorithm
571
As usual, let’s import some Python modules.
import numpy as np
from numba import jit, float64, int64

np.random.seed(123)
33.2 Review of McCall Model
We begin by reviewing the McCall model described in this quantecon lecture.

We’ll compute an optimal value function and a policy that attains it.
We’ll eventually compare that optimal policy to what the Q-learning McCall worker learns.
The McCall model is characterized by parameters 𝛽, 𝑐 and a known distribution of wage offers 𝐹 .
A McCall worker wants to maximize an expected discounted sum of lifetime incomes
∞
𝔼 ∑ 𝛽 𝑡 𝑦𝑡
𝑡=0
The worker’s income 𝑦𝑡 equals his wage 𝑤 if he is employed, and unemployment compensation 𝑐 if he is unemployed.
An optimal value 𝑉 (𝑤) for a McCall worker who has just received a wage offer 𝑤 and is deciding whether to accept or
reject it satisfies the Bellman equation
𝑤
𝑉 (𝑤) = max { , 𝑐 + 𝛽 ∫ 𝑉 (𝑤′ ) 𝑑𝐹 (𝑤′ )} (33.1)
accept, reject 1−𝛽
To form a benchmark to compare with results from Q-learning, we first approximate the optimal value function.
With possible states residing in a finite discrete state space indexed by {1, 2, ..., 𝑛}, we make an initial guess for the value
function of 𝑣 ∈ ℝ𝑛 and then iterate on the Bellman equation:
𝑤(𝑖)
𝑣′ (𝑖) = max { , 𝑐 + 𝛽 ∑ 𝑣(𝑗)𝑞(𝑗)} for 𝑖 = 1, … , 𝑛
1−𝛽 1≤𝑗≤𝑛
Let’s use Python code from this quantecon lecture.

We use a Python method called VFI to compute the optimal value function using value function iterations.
We construct an assumed distribution of wages and plot it with the following Python code

q_default = BetaBinomial(n, a, b).pdf() # default choice of q
w_min, w_max = 10, 60

w_default = np.linspace(w_min, w_max, n+1)
572 Chapter 33. Job Search VII: A McCall Worker Q-Learns

# plot distribution of wage offer

ax.plot(w_default, q_default, '-o', label='$q(w(i))$')
plt.show()
Next we’ll compute the worker’s optimal value function by iterating to convergence on the Bellman equation.
Then we’ll plot various iterates on the Bellman operator.
mccall_data = [
('q', float64[:]), # array of probabilities
]
class McCallModel:
def __init__(self, c=25, β=0.99, w=w_default, q=q_default):
self.w, self.q = w, q
def state_action_values(self, i, v):

33.2. Review of McCall Model 573


"""
The values of state-action pairs.
"""
# Simplify names
c, β, w, q = self.c, self.β, self.w, self.q
# Evaluate value for each state-action pair
# Consider action = accept or reject the current offer
accept = w[i] / (1 - β)
reject = c + β * np.sum(v * q)
return np.array([accept, reject])
def VFI(self, eps=1e-5, max_iter=500):

"""
Find the optimal value function.
"""
n = len(self.w)
v = self.w / (1 - self.β)
flag=0
for i in range(max_iter):
for j in range(n):
v_next[j] = np.max(self.state_action_values(j, v))
if np.max(np.abs(v_next - v))<=eps:
flag=1
break
v[:] = v_next
return v, flag
def plot_value_function_seq(mcm, ax, num_plots=8):

"""
Plot a sequence of value functions.

* ax is an axes object that implements a plot method.
"""
n = len(mcm.w)
v = mcm.w / (1 - mcm.β)
for i in range(num_plots):
ax.plot(mcm.w, v, '-', alpha=0.4, label=f"iterate {i}")
# Update guess
for i in range(n):
v_next[i] = np.max(mcm.state_action_values(i, v))
mcm = McCallModel()


valfunc_VFI, flag = mcm.VFI()
ax.set_xlabel('wage')
ax.set_ylabel('value')
plot_value_function_seq(mcm, ax)
plt.show()
Next we’ll print out the limit of the sequence of iterates.

This is the approximation to the McCall worker’s value function that is produced by value function iteration.
We’ll use this value function as a benchmark later after we have done some Q-learning.
print(valfunc_VFI)
[5322.27935875 5322.27935875 5322.27935875 5322.27935875 5322.27935875

5322.27935875 5322.27935875 5322.27935875 5322.27935875 5500.
6000. ]
33.2. Review of McCall Model 575

33.3 Implied Quality Function 𝑄
A quality function 𝑄 map state-action pairs into optimal values.

They are tightly linked to optimal value functions.
But value functions are functions just of states, and not actions.
For each given state, the quality function gives a list of optimal values that can be attained starting from that state, with
each component of the list indicating one of the possible actions that is taken.
For our McCall worker with a finite set of possible wages
• the state space 𝒲 = {𝑤1 , 𝑤2 , ..., 𝑤𝑛 } is indexed by integers 1, 2, ..., 𝑛
• the action space is 𝒜 = {accept, reject}
Let 𝑎 ∈ 𝒜 be one of the two possible actions, i.e., accept or reject.
For our McCall worker, an optimal Q-function 𝑄(𝑤, 𝑎) equals the maximum value of that a previously unemployed
worker who has offer 𝑤 in hand can attain if he takes action 𝑎.
This definition of 𝑄(𝑤, 𝑎) presumes that in subsequent periods the worker takes optimal actions.
An optimal Q-function for our McCall worker satisfies
𝑤
𝑄 (𝑤, accept) =
1−𝛽
(33.2)
𝑤′
𝑄 (𝑤, reject) = 𝑐 + 𝛽 ∫ max { , 𝑄 (𝑤′ , reject)} 𝑑𝐹 (𝑤′ )
accept, reject 1−𝛽
Note that the first equation of system (33.2) presumes that after the agent has accepted an offer, he will not have the
objection to reject that same offer in the future.
These equations are aligned with the Bellman equation for the worker’s optimal value function that we studied in this
quantecon lecture.
Evidently, the optimal value function 𝑉 (𝑤) described in that lecture is related to our Q-function by
𝑉 (𝑤) = max {𝑄(𝑤, accept) , 𝑄 (𝑤, reject)}

accept,reject
If we stare at the second equation of system (33.2), we notice that since the wage process is identically and independently
distributed over time, 𝑄 (𝑤, reject), the right side of the equation is independent of the current state 𝑤.
So we can denote it as a scalar
𝑄𝑟 ∶= 𝑄 (𝑤, reject) ∀ 𝑤 ∈ 𝒲.
This fact provides us with an an alternative, and as it turns out in this case, a faster way to compute an optimal value
function and associated optimal policy for the McCall worker.
Instead of using the value function iterations that we deployed above, we can instead iterate to convergence on a version
of the second equation in system (33.2) that maps an estimate of 𝑄𝑟 into an improved estimate 𝑄′𝑟 :
𝑤′
𝑄′𝑟 = 𝑐 + 𝛽 ∫ max { , 𝑄 } 𝑑𝐹 (𝑤′ )
1−𝛽 𝑟
After a 𝑄𝑟 sequence has converged, we can recover the optimal value function 𝑉 (𝑤) for the McCall worker from
𝑤
𝑉 (𝑤) = max { ,𝑄 }
1−𝛽 𝑟

33.4 From Probabilities to Samples
We noted above that the optimal Q function for our McCall worker satisfies the Bellman equations
𝑤+𝛽 max {𝑄(𝑤, accept), 𝑄(𝑤, reject)} − 𝑄(𝑤, accept) = 0

accept, reject
(33.3)
𝑐+𝛽∫ max {𝑄(𝑤′ , accept), 𝑄 (𝑤′ , reject)} 𝑑𝐹 (𝑤′ ) − 𝑄 (𝑤, reject) = 0
accept, reject
Notice the integral over 𝐹 (𝑤′ ) on the second line.

Erasing the integral sign sets the stage for an illegitmate argument that can get us started thinking about Q-learning.
Thus, construct a difference equation system that keeps the first equation of (33.3) but replaces the second by removing
integration over 𝐹 (𝑤′ ):
𝑤+𝛽 max {𝑄(𝑤, accept), 𝑄(𝑤, reject)} − 𝑄(𝑤, accept) = 0

accept, reject
(33.4)
𝑐+𝛽 max {𝑄(𝑤′ , accept), 𝑄 (𝑤′ , reject)} − 𝑄 (𝑤, reject) ≈ 0
accept, reject
The second equation can’t hold for all 𝑤, 𝑤′ pairs in the appropriate Cartesian product of our state space.
But maybe an appeal to a Law of Large numbers could let us hope that it would hold on average for a long time series
sequence of draws of 𝑤𝑡 , 𝑤𝑡+1 pairs, where we are thinking of 𝑤𝑡 as 𝑤 and 𝑤𝑡+1 as 𝑤′ .
The basic idea of Q-learning is to draw a long sample of wage offers from 𝐹 (we know 𝐹 though we assume that the
worker doesn’t) and iterate on a recursion that maps an estimate 𝑄̂ 𝑡 of a Q-function at date 𝑡 into an improved estimate
𝑄̂ 𝑡+1 at date 𝑡 + 1.
To set up such an algorithm, we first define some errors or “differences”
𝑤+𝛽 max {𝑄̂ 𝑡 (𝑤𝑡 , accept), 𝑄̂ 𝑡 (𝑤𝑡 , reject)} − 𝑄̂ 𝑡 (𝑤𝑡 , accept) = diffaccept,𝑡
accept, reject
(33.5)
𝑐+𝛽 max {𝑄̂ 𝑡 (𝑤𝑡+1 , accept), 𝑄̂ 𝑡 (𝑤𝑡+1 , reject)} − 𝑄̂ 𝑡 (𝑤𝑡 , reject) = diffreject,𝑡
accept, reject
The adaptive learning scheme would then be some version of
𝑄̂ 𝑡+1 = 𝑄̂ 𝑡 + 𝛼 diff𝑡 (33.6)
where 𝛼 ∈ (0, 1) is a small gain parameter that governs the rate of learning and 𝑄̂ 𝑡 and diff𝑡 are 2 × 1 vectors corre-
sponding to objects in equation system (33.5).
This informal argument takes us to the threshold of Q-learning.
33.5 Q-Learning
Let’s first describe a 𝑄-learning algorithm precisely.

Then we’ll implement it.
The algorithm works by using a Monte Carlo method to update estimates of a Q-function.
We begin with an initial guess for a Q-function.
In the example studied in this lecture, we have a finite action space and also a finite state space.
̃ 𝑎).
That means that we can represent a Q-function as a matrix or Q-table, 𝑄(𝑤,
33.4. From Probabilities to Samples 577

Q-learning proceeds by updating the Q-function as the decision maker acquires experience along a path of wage draws
generated by simulation.
During the learning process, our McCall worker takes actions and experiences rewards that are consequences of those
actions.
He learns simultaneously about the environment, in this case the distribution of wages, and the reward function, in this
case the unemployment compensation 𝑐 and the present value of wages.
The updating algorithm is based on a slight modification (to be described soon) of a recursion like
𝑄 ̃𝑜𝑙𝑑 (𝑤, 𝑎) + 𝛼𝑇̃

̃𝑛𝑒𝑤 (𝑤, 𝑎) = 𝑄 𝐷 (𝑤, 𝑎) (33.7)
where
𝑇̃
𝐷 (𝑤, accept) = [𝑤 + 𝛽 max ̃𝑜𝑙𝑑 (𝑤, 𝑎′ )] − 𝑄
𝑄 ̃𝑜𝑙𝑑 (𝑤, accept)
′ 𝑎 ∈𝒜
(33.8)
𝑇̃
𝐷 (𝑤, reject) = [𝑐 + 𝛽 max ̃𝑜𝑙𝑑 (𝑤′ , 𝑎′ )] − 𝑄
𝑄 ̃𝑜𝑙𝑑 (𝑤, reject) , 𝑤′ ∼ 𝐹
′ 𝑎 ∈𝒜
The terms 𝑇̃
𝐷(𝑤, 𝑎) for 𝑎 = {accept,reject} are the temporal difference errors that drive the updates.
This system is thus a version of the adaptive system that we sketched informally in equation (33.6).
An aspect of the algorithm not yet captured by equation system (33.8) is random experimentation that we add by occa-
sionally randomly replacing
̃𝑜𝑙𝑑 (𝑤, 𝑎′ )
argmax𝑎′ ∈𝒜 𝑄
with
̃𝑜𝑙𝑑 (𝑤, 𝑎′ )
argmin𝑎′ ∈𝒜 𝑄
and occasionally replacing
̃𝑜𝑙𝑑 (𝑤′ , 𝑎′ )
argmax𝑎′ ∈𝒜 𝑄
with
̃𝑜𝑙𝑑 (𝑤′ , 𝑎′ )
argmin𝑎′ ∈𝒜 𝑄
We activate such experimentation with probability 𝜖 in step 3 of the following pseudo-code for our McCall worker to do
Q-learning:
1. Set an arbitrary initial Q-table.
2. Draw an initial wage offer 𝑤 from 𝐹 .
3. From the appropriate row in the Q-table, choose an action using the following 𝜖-greedy algorithm:
• with probability 1 − 𝜖, choose the action that maximizes the value, and
• with probability 𝜖, choose the alternative action.
4. Update the state associated with the chosen action and compute 𝑇̃ ̃ according
𝐷 according to (33.8) and update 𝑄
to (33.7).
5. Either draw a new state 𝑤′ if required or else take existing wage if and update the Q-table again according to (33.7).
6. Stop when the old and new Q-tables are close enough, i.e., ‖𝑄̃ 𝑛𝑒𝑤 − 𝑄̃ 𝑜𝑙𝑑 ‖∞ ≤ 𝛿 for given 𝛿 or if the worker keeps
accepting for 𝑇 periods for a prescribed 𝑇 .

7. Return to step 2 with the updated Q-table.

Repeat this procedure for 𝑁 episodes or until the updated Q-table has converged.
We call one pass through steps 2 to 7 an “episode” or “epoch” of temporal difference learning.
In our context, each episode starts with an agent drawing an initial wage offer, i.e., a new state.
The agent then takes actions based on the preset Q-table, receives rewards, and then enters a new state implied by this
period’s actions.
The Q-table is updated via temporal difference learning.
We iterate this until convergence of the Q-table or the maximum length of an episode is reached.
Multiple episodes allow the agent to start afresh and visit states that she was less likely to visit from the terminal state of
a previos episode.
For example, an agent who has accepted a wage offer based on her Q-table will be less likely to draw a new offer from
other parts of the wage distribution.
By using the 𝜖-greedy method and also by increasing the number of episodes, the Q-learning algorithm balances gains
from exploration and from exploitation.
Remark: Notice that 𝑇̃ 𝐷 associated with an optimal Q-table defined in (33.7) automatically above satisfies 𝑇̃
𝐷 = 0 for
all state action pairs. Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether
the algorithm visits all state-action pairs often enough.
We implement this pseudo code in a Python class.
For simplicity and convenience, we let s represent the state index between 0 and 𝑛 = 50 and 𝑤𝑠 = 𝑤[𝑠].
The first column of the Q-table represents the value associated with rejecting the wage and the second represents accepting
the wage.
We use numba compilation to accelerate computations.
params=[
('q', float64[:]), # array of probabilities
('eps', float64), # for epsilon greedy algorithm
('δ', float64), # Q-table threshold
('lr', float64), # the learning rate α
('T', int64), # maximum periods of accepting
('quit_allowed', int64) # whether quit is allowed after accepting the wage␣
↪offer
@jitclass(params)
class Qlearning_McCall:
def __init__(self, c=25, β=0.99, w=w_default, q=q_default, eps=0.1,
δ=1e-5, lr=0.5, T=10000, quit_allowed=0):
self.w, self.q = w, q
self.eps, self.δ, self.lr, self.T = eps, δ, lr, T
self.quit_allowed = quit_allowed
def draw_offer_index(self):
33.5. Q-Learning 579


"""
Draw a state index from the wage distribution.
"""
q = self.q
return np.searchsorted(np.cumsum(q), np.random.random(), side="right")
def temp_diff(self, qtable, state, accept):

"""
Compute the TD associated with state and action.
"""
c, β, w = self.c, self.β, self.w
if accept==0:
state_next = self.draw_offer_index()
TD = c + β*np.max(qtable[state_next, :]) - qtable[state, accept]
else:
state_next = state
if self.quit_allowed == 0:
TD = w[state_next] + β*np.max(qtable[state_next, :]) - qtable[state,␣
↪accept]
else:
TD = w[state_next] + β*qtable[state_next, 1] - qtable[state, accept]
return TD, state_next
def run_one_epoch(self, qtable, max_times=20000):

"""
Run an "epoch".
"""
c, β, w = self.c, self.β, self.w

eps, δ, lr, T = self.eps, self.δ, self.lr, self.T
s0 = self.draw_offer_index()
s = s0
accept_count = 0
for t in range(max_times):
# choose action
accept = np.argmax(qtable[s, :])
if np.random.random()<=eps:
accept = 1 - accept
if accept == 1:
accept_count += 1
else:
accept_count = 0
TD, s_next = self.temp_diff(qtable, s, accept)
# update qtable
qtable_new = qtable.copy()
qtable_new[s, accept] = qtable[s, accept] + lr*TD

if np.max(np.abs(qtable_new-qtable))<=δ:
break
if accept_count == T:
break
s, qtable = s_next, qtable_new
return qtable_new
@jit(nopython=True)
def run_epochs(N, qlmc, qtable):
"""
Run epochs N times with qtable from the last iteration each time.
"""
for n in range(N):
if n%(N/10)==0:
print(f"Progress: EPOCHs = {n}")
new_qtable = qlmc.run_one_epoch(qtable)
qtable = new_qtable
return qtable
def valfunc_from_qtable(qtable):
return np.max(qtable, axis=1)
def compute_error(valfunc, valfunc_VFI):

return np.mean(np.abs(valfunc-valfunc_VFI))
# create an instance of Qlearning_McCall

qlmc = Qlearning_McCall()
# run
qtable0 = np.zeros((len(w_default), 2))
qtable = run_epochs(20000, qlmc, qtable0)
Progress: EPOCHs = 0
print(qtable)

[[5305.16088055 2489.50966804]
[5253.8159475 5226.64004039]
[5364.14097411 5238.91426927]
[5300.18402637 5281.45326205]
[5249.23054515 5258.66740661]
[5293.42512907 5172.94256847]
[5360.27507946 5174.88150579]
[5242.78951826 5182.73171124]
[5326.34204678 5214.53853639]
[5295.90805617 5500.00000052]
[5316.42837414 6000. ]]
# inspect value function

valfunc_qlr = valfunc_from_qtable(qtable)
print(valfunc_qlr)
[5305.16088055 5253.8159475 5364.14097411 5300.18402637 5258.66740661

5293.42512907 5360.27507946 5242.78951826 5326.34204678 5500.00000052
6000. ]
# plot
ax.plot(w_default, valfunc_VFI, '-o', label='VFI')
ax.plot(w_default, valfunc_qlr, '-o', label='QL')
ax.set_ylabel('optimal value')
ax.legend()
plt.show()

Now, let us compute the case with a larger state space: 𝑛 = 30 instead of 𝑛 = 10.

q_new = BetaBinomial(n, a, b).pdf() # default choice of q
w_min, w_max = 10, 60

w_new = np.linspace(w_min, w_max, n+1)
# plot distribution of wage offer

ax.plot(w_new, q_new, '-o', label='$q(w(i))$')
plt.show()
# VFI
mcm = McCallModel(w=w_new, q=q_new)
mcm = McCallModel(w=w_new, q=q_new)

valfunc_VFI
array([4859.77015703, 4859.77015703, 4859.77015703, 4859.77015703,

4859.77015703, 4859.77015703, 4859.77015703, 4859.77015703,
4859.77015703, 4859.77015703, 4859.77015703, 4859.77015703,
4859.77015703, 4859.77015703, 4859.77015703, 4859.77015703,
4859.77015703, 4859.77015703, 4859.77015703, 4859.77015703,


4859.77015703, 4859.77015703, 4859.77015703, 4859.77015703,
5000. , 5166.66666667, 5333.33333333, 5500. ,
5666.66666667, 5833.33333333, 6000. ])
def plot_epochs(epochs_to_plot, quit_allowed=1):

"Plot value function implied by outcomes of an increasing number of epochs."
qlmc_new = Qlearning_McCall(w=w_new, q=q_new, quit_allowed=quit_allowed)
qtable = np.zeros((len(w_new),2))
epochs_to_plot = np.asarray(epochs_to_plot)
# plot
ax.plot(w_new, valfunc_VFI, '-o', label='VFI')
max_epochs = np.max(epochs_to_plot)
# iterate on epoch numbers
for n in range(max_epochs + 1):
if n%(max_epochs/10)==0:
print(f"Progress: EPOCHs = {n}")
if n in epochs_to_plot:
valfunc_qlr = valfunc_from_qtable(qtable)
error = compute_error(valfunc_qlr, valfunc_VFI)
ax.plot(w_new, valfunc_qlr, '-o', label=f'QL:epochs={n}, mean error=

↪{error}')
new_qtable = qlmc_new.run_one_epoch(qtable)
qtable = new_qtable
ax.set_ylabel('optimal value')
plt.show()
plot_epochs(epochs_to_plot=[100, 1000, 10000, 100000, 200000])

The above graphs indicates that

• the Q-learning algorithm has trouble learning the Q-table well for wages that are rarely drawn
• the quality of approximation to the “true” value function computed by value function iteration improves for longer
epochs
33.6 Employed Worker Can’t Quit
The preceding version of temporal difference Q-learning described in equation system (33.8) lets an employed worker
quit, i.e., reject her wage as an incumbent and instead receive unemployment compensation this period and draw a new
offer next period.
This is an option that the McCall worker described in this quantecon lecture would not take.
See [Ljungqvist and Sargent, 2018], chapter 6 on search, for a proof.
But in the context of Q-learning, giving the worker the option to quit and get unemployment compensation while un-
employed turns out to accelerate the learning process by promoting experimentation vis a vis premature exploitation
only.
33.6. Employed Worker Can’t Quit 585

To illustrate this, we’ll amend our formulas for temporal differences to forbid an employed worker from quitting a job she
had accepted earlier.
With this understanding about available choices, we obtain the following temporal difference values:
𝑇̃ ̃𝑜𝑙𝑑 (𝑤, accept)] − 𝑄

𝐷 (𝑤, accept) = [𝑤 + 𝛽 𝑄 ̃𝑜𝑙𝑑 (𝑤, accept)
(33.9)
𝑇̃
𝐷 (𝑤, reject) = [𝑐 + 𝛽 max ̃𝑜𝑙𝑑 (𝑤′ , 𝑎′ )] − 𝑄
𝑄 ̃𝑜𝑙𝑑 (𝑤, reject) , 𝑤′ ∼ 𝐹
′ 𝑎 ∈𝒜
It turns out that formulas (33.9) combined with our Q-learning recursion (33.7) can lead our agent to eventually learn the
optimal value function as well as in the case where an option to redraw can be exercised.
But learning is slower because an agent who ends up accepting a wage offer prematurally loses the option to explore new
states in the same episode and to adjust the value associated with that state.
This can lead to inferior outcomes when the number of epochs/episodes is low.
But if we increase the number of epochs/episodes, we can observe that the error decreases and the outcomes get better.
We illustrate these possibilities with the following code and graph.
plot_epochs(epochs_to_plot=[100, 1000, 10000, 100000, 200000], quit_allowed=0)

33.7 Possible Extensions
To extend the algorthm to handle problems with continuous state spaces, a typical approach is to restrict Q-functions and
policy functions to take particular functional forms.
This is the approach in deep Q-learning where the idea is to use a multilayer neural network as a good function approx-
imator.
We will take up this topic in a subsequent quantecon lecture.
33.7. Possible Extensions 587


Part VI
Consumption, Savings and Capital
589
CHAPTER
THIRTYFOUR
CASS-KOOPMANS MODEL
Contents
• Cass-Koopmans Model
– Overview
– The Model
– Planning Problem
– Shooting Algorithm
– Setting Initial Capital to Steady State Capital
– A Turnpike Property
– A Limiting Infinite Horizon Economy
– Concluding Remarks
34.1 Overview
This lecture and Cass-Koopmans Competitive Equilibrium describe a model that Tjalling Koopmans [Koopmans, 1965]
and David Cass [Cass, 1965] used to analyze optimal growth.
The model can be viewed as an extension of the model of Robert Solow described in an earlier lecture but adapted to
make the saving rate be a choice.
(Solow assumed a constant saving rate determined outside the model.)
We describe two versions of the model, one in this lecture and the other in Cass-Koopmans Competitive Equilibrium.
Together, the two lectures illustrate what is, in fact, a more general connection between a planned economy and a
decentralized economy organized as a competitive equilibrium.
This lecture is devoted to the planned economy version.
In the planned economy, there are
• no prices
• no budget constraints
Instead there is a dictator that tells people
• what to produce
591
• what to invest in physical capital

• who is to consume what and when
The lecture uses important ideas including
• A min-max problem for solving a planning problem.
• A shooting algorithm for solving difference equations subject to initial and terminal conditions.
• A turnpike property that describes optimal paths for long but finite-horizon economies.

import numpy as np
34.2 The Model
Time is discrete and takes values 𝑡 = 0, 1, … , 𝑇 where 𝑇 is finite.

(We’ll eventually study a limiting case in which 𝑇 = +∞)
A single good can either be consumed or invested in physical capital.
The consumption good is not durable and depreciates completely if not consumed immediately.
The capital good is durable but depreciates.
We let 𝐶𝑡 be the total consumption of a nondurable consumption good at time 𝑡.
Let 𝐾𝑡 be the stock of physical capital at time 𝑡.
Let 𝐶 ⃗ = {𝐶0 , … , 𝐶𝑇 } and 𝐾⃗ = {𝐾0 , … , 𝐾𝑇 +1 }.
34.2.1 Digression: Aggregation Theory
We use a concept of a representative consumer to be thought of as follows.

There is a unit mass of identical consumers indexed by 𝜔 ∈ [0, 1].
Consumption of consumer 𝜔 is 𝑐(𝜔).
Aggregate consumption is
1
𝐶 = ∫ 𝑐(𝜔)𝑑𝜔
0
Consider a welfare problem that chooses an allocation {𝑐(𝜔)} across consumers to maximize
1
∫ 𝑢(𝑐(𝜔))𝑑𝜔
0
where 𝑢(⋅) is a concave utility function with 𝑢 > 0, 𝑢″ < 0 and maximization is subject to
′
1
𝐶 = ∫ 𝑐(𝜔)𝑑𝜔. (34.1)
0
592 Chapter 34. Cass-Koopmans Model

1 1
Form a Lagrangian 𝐿 = ∫0 𝑢(𝑐(𝜔))𝑑𝜔 + 𝜆[𝐶 − ∫0 𝑐(𝜔)𝑑𝜔].
Differentiate under the integral signs with respect to each 𝜔 to obtain the first-order necessary conditions
𝑢′ (𝑐(𝜔)) = 𝜆.
These conditions imply that 𝑐(𝜔) equals a constant 𝑐 that is independent of 𝜔.

To find 𝑐, use feasibility constraint (34.1) to conclude that
𝑐(𝜔) = 𝑐 = 𝐶.
This line of argument indicates the special aggregation theory that lies beneath outcomes in which a representative con-
sumer consumes amount 𝐶.
It appears often in aggregate economics.
We shall use this aggregation theory here and also in this lecture Cass-Koopmans Competitive Equilibrium.
An Economy
A representative household is endowed with one unit of labor at each 𝑡 and likes the consumption good at each 𝑡.
The representative household inelastically supplies a single unit of labor 𝑁𝑡 at each 𝑡, so that 𝑁𝑡 = 1 for all 𝑡 ∈
{0, 1, … , 𝑇 }.
The representative household has preferences over consumption bundles ordered by the utility functional:
𝑇
𝐶𝑡1−𝛾
𝑈 (𝐶)⃗ = ∑ 𝛽 𝑡 (34.2)
𝑡=0
1−𝛾
where 𝛽 ∈ (0, 1) is a discount factor and 𝛾 > 0 governs the curvature of the one-period utility function.
Larger 𝛾’s imply more curvature.
Note that
𝐶𝑡1−𝛾
𝑢(𝐶𝑡 ) = (34.3)
1−𝛾
satisfies 𝑢′ > 0, 𝑢″ < 0.

𝑢′ > 0 asserts that the consumer prefers more to less.
𝑢″ < 0 asserts that marginal utility declines with increases in 𝐶𝑡 .
We assume that 𝐾0 > 0 is an exogenous initial capital stock.
There is an economy-wide production function
𝐹 (𝐾𝑡 , 𝑁𝑡 ) = 𝐴𝐾𝑡𝛼 𝑁𝑡1−𝛼 (34.4)
with 0 < 𝛼 < 1, 𝐴 > 0.

A feasible allocation 𝐶,⃗ 𝐾⃗ satisfies
𝐶𝑡 + 𝐾𝑡+1 ≤ 𝐹 (𝐾𝑡 , 𝑁𝑡 ) + (1 − 𝛿)𝐾𝑡 for all 𝑡 ∈ {0, 1, … , 𝑇 } (34.5)
where 𝛿 ∈ (0, 1) is a depreciation rate of capital.
34.2. The Model 593

34.3 Planning Problem
A planner chooses an allocation {𝐶,⃗ 𝐾}

⃗ to maximize (34.2) subject to (34.5).
Let 𝜇⃗ = {𝜇0 , … , 𝜇𝑇 } be a sequence of nonnegative Lagrange multipliers.

To find an optimal allocation, form a Lagrangian
𝑇
ℒ(𝐶,⃗ 𝐾,⃗ 𝜇)⃗ = ∑ 𝛽 𝑡 {𝑢(𝐶𝑡 ) + 𝜇𝑡 (𝐹 (𝐾𝑡 , 1) + (1 − 𝛿)𝐾𝑡 − 𝐶𝑡 − 𝐾𝑡+1 )} (34.6)
𝑡=0
and pose the following min-max problem:

min max ℒ(𝐶,⃗ 𝐾,⃗ 𝜇)⃗ (34.7)
𝜇⃗ 𝐶,⃗ 𝐾⃗
• Extremization means maximization with respect to 𝐶,⃗ 𝐾⃗ and minimization with respect to 𝜇.⃗
• Our problem satisfies conditions that assure that second-order conditions are satisfied at an allocation that satisfies
the first-order necessary conditions that we are about to compute.
Before computing first-order conditions, we present some handy formulas.
34.3.1 Useful Properties of Linearly Homogeneous Production Function
The following technicalities will help us.

Notice that
𝛼
𝐾𝑡
𝐹 (𝐾𝑡 , 𝑁𝑡 ) = 𝐴𝐾𝑡𝛼 𝑁𝑡1−𝛼 = 𝑁𝑡 𝐴 ( )
𝑁𝑡
Define the output per-capita production function
𝛼
𝐹 (𝐾𝑡 , 𝑁𝑡 ) 𝐾 𝐾
≡ 𝑓 ( 𝑡) = 𝐴( 𝑡)
𝑁𝑡 𝑁𝑡 𝑁𝑡
whose argument is capital per-capita.
It is useful to recall the following calculations for the marginal product of capital
𝜕𝐹 (𝐾𝑡 , 𝑁𝑡 ) 𝜕𝑁𝑡 𝑓 ( 𝐾
𝑁𝑡 )
𝑡
=
𝜕𝐾𝑡 𝜕𝐾𝑡
𝐾 1
= 𝑁𝑡 𝑓 ′ ( 𝑡 ) (Chain rule)
𝑁𝑡 𝑁𝑡 (34.8)
𝐾
= 𝑓 ′ ( 𝑡 )∣
𝑁𝑡 𝑁 =1
𝑡
= 𝑓 ′ (𝐾𝑡 )
and the marginal product of labor
𝜕𝐹 (𝐾𝑡 , 𝑁𝑡 ) 𝜕𝑁𝑡 𝑓 ( 𝐾
𝑁𝑡 )
𝑡
= (Product rule)
𝜕𝑁𝑡 𝜕𝑁𝑡
𝐾 𝐾 −𝐾𝑡
= 𝑓 ( 𝑡 ) +𝑁𝑡 𝑓 ′ ( 𝑡 ) (Chain rule)
𝑁𝑡 𝑁𝑡 𝑁𝑡2
𝐾 𝐾 𝐾
= 𝑓 ( 𝑡 ) − 𝑡 𝑓 ′ ( 𝑡 )∣
𝑁𝑡 𝑁𝑡 𝑁𝑡 𝑁 =1
𝑡
= 𝑓(𝐾𝑡 ) − 𝑓 ′ (𝐾𝑡 )𝐾𝑡

𝐾𝑡
(Here we are using that 𝑁𝑡 = 1 for all 𝑡, so that 𝐾𝑡 = 𝑁𝑡 .)
34.3.2 First-order necessary conditions
We now compute first-order necessary conditions for extremization of Lagrangian (34.6):
𝐶𝑡 ∶ 𝑢′ (𝐶𝑡 ) − 𝜇𝑡 = 0 for all 𝑡 = 0, 1, … , 𝑇 (34.9)
𝐾𝑡 ∶ 𝛽𝜇𝑡 [(1 − 𝛿) + 𝑓 ′ (𝐾𝑡 )] − 𝜇𝑡−1 = 0 for all 𝑡 = 1, 2, … , 𝑇 (34.10)

𝜇𝑡 ∶ 𝐹 (𝐾𝑡 , 1) + (1 − 𝛿)𝐾𝑡 − 𝐶𝑡 − 𝐾𝑡+1 = 0 for all 𝑡 = 0, 1, … , 𝑇 (34.11)
𝐾𝑇 +1 ∶ −𝜇𝑇 ≤ 0, ≤ 0 if 𝐾𝑇 +1 = 0; = 0 if 𝐾𝑇 +1 > 0 (34.12)
In computing (34.10) we recognize that 𝐾𝑡 appears in both the time 𝑡 and time 𝑡 − 1 feasibility constraints (34.5).
Restrictions (34.12) come from differentiating with respect to 𝐾𝑇 +1 and applying the following Karush-Kuhn-Tucker
condition (KKT) (see Karush-Kuhn-Tucker conditions):
𝜇𝑇 𝐾𝑇 +1 = 0 (34.13)
Combining (34.9) and (34.10) gives
𝛽𝑢′ (𝐶𝑡 ) [(1 − 𝛿) + 𝑓 ′ (𝐾𝑡 )] − 𝑢′ (𝐶𝑡−1 ) = 0 for all 𝑡 = 1, 2, … , 𝑇 + 1
which can be rearranged to become
𝛽𝑢′ (𝐶𝑡+1 ) [(1 − 𝛿) + 𝑓 ′ (𝐾𝑡+1 )] = 𝑢′ (𝐶𝑡 ) for all 𝑡 = 0, 1, … , 𝑇 (34.14)
Applying the inverse marginal utility of consumption function on both sides of the above equation gives
−1
′−1 𝛽
𝐶𝑡+1 = 𝑢 (( ′ [𝑓 ′ (𝐾𝑡+1 ) + (1 − 𝛿)]) )
𝑢 (𝐶𝑡 )
which for our utility function (34.3) becomes the consumption Euler equation
1/𝛾
𝐶𝑡+1 = (𝛽𝐶𝑡𝛾 [𝑓 ′ (𝐾𝑡+1 ) + (1 − 𝛿)])
which we can combine with the feasibility constraint (34.5) to get

1/𝛾
𝐶𝑡+1 = 𝐶𝑡 (𝛽[𝑓 ′ (𝐹 (𝐾𝑡 , 1) + (1 − 𝛿)𝐾𝑡 − 𝐶𝑡 ) + (1 − 𝛿)])
𝐾𝑡+1 = 𝐹 (𝐾𝑡 , 1) + (1 − 𝛿)𝐾𝑡 − 𝐶𝑡 .
This is a pair of non-linear first-order difference equations that map 𝐶𝑡 , 𝐾𝑡 into 𝐶𝑡+1 , 𝐾𝑡+1 and that an optimal sequence
𝐶,⃗ 𝐾⃗ must satisfy.
It must also satisfy the initial condition that 𝐾0 is given and 𝐾𝑇 +1 = 0.
Below we define a jitclass that stores parameters and functions that define our economy.
planning_data = [
('γ', float64), # Coefficient of relative risk aversion
('β', float64), # Discount factor
('δ', float64), # Depreciation rate on capital
('α', float64), # Return to capital per capita
('A', float64) # Technology
]
34.3. Planning Problem 595

@jitclass(planning_data)
class PlanningProblem():
def __init__(self, γ=2, β=0.95, δ=0.02, α=0.33, A=1):
self.γ, self.β = γ, β
self.δ, self.α, self.A = δ, α, A
def u(self, c):

'''
Utility function
ASIDE: If you have a utility function that is hard to solve by hand
you can use automatic or symbolic differentiation
See https://github.com/HIPS/autograd
'''
γ = self.γ
return c ** (1 - γ) / (1 - γ) if γ!= 1 else np.log(c)
def u_prime(self, c):

'Derivative of utility'
γ = self.γ
return c ** (-γ)
def u_prime_inv(self, c):

'Inverse of derivative of utility'
γ = self.γ
return c ** (-1 / γ)
def f(self, k):

'Production function'
α, A = self.α, self.A
return A * k ** α
def f_prime(self, k):

'Derivative of production function'
return α * A * k ** (α - 1)
def f_prime_inv(self, k):

'Inverse of derivative of production function'
return (k / (A * α)) ** (1 / (α - 1))
def next_k_c(self, k, c):

''''
Given the current capital Kt and an arbitrary feasible
consumption choice Ct, computes Kt+1 by state transition law
and optimal Ct+1 by Euler equation.
'''
β, δ = self.β, self.δ
u_prime, u_prime_inv = self.u_prime, self.u_prime_inv


f, f_prime = self.f, self.f_prime
k_next = f(k) + (1 - δ) * k - c
c_next = u_prime_inv(u_prime(c) / (β * (f_prime(k_next) + (1 - δ))))
return k_next, c_next
We can construct an economy with the Python code:
pp = PlanningProblem()
34.4 Shooting Algorithm
We use shooting to compute an optimal allocation 𝐶,⃗ 𝐾⃗ and an associated Lagrange multiplier sequence 𝜇.⃗
First-order necessary conditions (34.9), (34.10), and (34.11) for the planning problem form a system of difference equa-
tions with two boundary conditions:
• 𝐾0 is a given initial condition for capital
• 𝐾𝑇 +1 = 0 is a terminal condition for capital that we deduced from the first-order necessary condition for 𝐾𝑇 +1
the KKT condition (34.13)
We have no initial condition for the Lagrange multiplier 𝜇0 .
If we did, our job would be easy:
• Given 𝜇0 and 𝑘0 , we could compute 𝑐0 from equation (34.9) and then 𝑘1 from equation (34.11) and 𝜇1 from
equation (34.10).
• We could continue in this way to compute the remaining elements of 𝐶,⃗ 𝐾,⃗ 𝜇.⃗
However, we woujld not be assured that the Kuhn-Tucker condition (34.13) would be satisfied.
Furthermore, we don’t have an initial condition for 𝜇0 .
So this won’t work.
Indeed, part of our task is to compute the optimal value of 𝜇0 .
To compute 𝜇0 and the other objects we want, a simple modification of the above procedure will work.
It is called the shooting algorithm.
It is an instance of a guess and verify algorithm that consists of the following steps:
• Guess an initial Lagrange multiplier 𝜇0 .
• Apply the simple algorithm described above.
• Compute 𝐾𝑇 +1 and check whether it equals zero.
• If 𝐾𝑇 +1 = 0, we have solved the problem.
• If 𝐾𝑇 +1 > 0, lower 𝜇0 and try again.
• If 𝐾𝑇 +1 < 0, raise 𝜇0 and try again.
The following Python code implements the shooting algorithm for the planning problem.
(Actually, we modified the preceding algorithm slightly by starting with a guess for 𝑐0 instead of 𝜇0 in the following code.)
34.4. Shooting Algorithm 597

@njit
def shooting(pp, c0, k0, T=10):
'''
Given the initial condition of capital k0 and an initial guess
of consumption c0, computes the whole paths of c and k
using the state transition law and Euler equation for T periods.
'''
if c0 > pp.f(k0):
print("initial consumption is not feasible")
return None
# initialize vectors of c and k

c_vec = np.empty(T+1)
k_vec = np.empty(T+2)
c_vec[0] = c0
k_vec[0] = k0
for t in range(T):
k_vec[t+1], c_vec[t+1] = pp.next_k_c(k_vec[t], c_vec[t])
k_vec[T+1] = pp.f(k_vec[T]) + (1 - pp.δ) * k_vec[T] - c_vec[T]
return c_vec, k_vec
We’ll start with an incorrect guess.
paths = shooting(pp, 0.2, 0.3, T=10)
colors = ['blue', 'red']

titles = ['Consumption', 'Capital']
ylabels = ['$c_t$', '$k_t$']
T = paths[0].size - 1
for i in range(2):
axs[i].plot(paths[i], c=colors[i])
axs[i].set(xlabel='t', ylabel=ylabels[i], title=titles[i])
axs[1].scatter(T+1, 0, s=80)
axs[1].axvline(T+1, color='k', ls='--', lw=1)
plt.show()

Evidently, our initial guess for 𝜇0 is too high, so initial consumption too low.
We know this because we miss our 𝐾𝑇 +1 = 0 target on the high side.
Now we automate things with a search-for-a-good 𝜇0 algorithm that stops when we hit the target 𝐾𝑡+1 = 0.
We use a bisection method.
We make an initial guess for 𝐶0 (we can eliminate 𝜇0 because 𝐶0 is an exact function of 𝜇0 ).
We know that the lowest 𝐶0 can ever be is 0 and that the largest it can be is initial output 𝑓(𝐾0 ).
Guess 𝐶0 and shoot forward to 𝑇 + 1.
If 𝐾𝑇 +1 > 0, we take it to be our new lower bound on 𝐶0 .
If 𝐾𝑇 +1 < 0, we take it to be our new upper bound.
Make a new guess for 𝐶0 that is halfway between our new upper and lower bounds.
Shoot forward again, iterating on these steps until we converge.
When 𝐾𝑇 +1 gets close enough to 0 (i.e., within an error tolerance bounds), we stop.
@njit
def bisection(pp, c0, k0, T=10, tol=1e-4, max_iter=500, k_ter=0, verbose=True):
# initial boundaries for guess c0

c0_upper = pp.f(k0)
c0_lower = 0
i = 0
while True:
c_vec, k_vec = shooting(pp, c0, k0, T)
error = k_vec[-1] - k_ter
# check if the terminal condition is satisfied

if np.abs(error) < tol:
if verbose:
print('Converged successfully on iteration ', i+1)
return c_vec, k_vec
i += 1
if i == max_iter:
34.4. Shooting Algorithm 599


if verbose:
print('Convergence failed.')
return c_vec, k_vec
# if iteration continues, updates boundaries and guess of c0

if error > 0:
c0_lower = c0
else:
c0_upper = c0
c0 = (c0_lower + c0_upper) / 2
def plot_paths(pp, c0, k0, T_arr, k_ter=0, k_ss=None, axs=None):
if axs is None:
fix, axs = plt.subplots(1, 3, figsize=(16, 4))
ylabels = ['$c_t$', '$k_t$', '$\mu_t$']
titles = ['Consumption', 'Capital', 'Lagrange Multiplier']
c_paths = []
k_paths = []
for T in T_arr:
c_vec, k_vec = bisection(pp, c0, k0, T, k_ter=k_ter, verbose=False)
c_paths.append(c_vec)
k_paths.append(k_vec)
μ_vec = pp.u_prime(c_vec)
paths = [c_vec, k_vec, μ_vec]
for i in range(3):
axs[i].plot(paths[i])
axs[i].set(xlabel='t', ylabel=ylabels[i], title=titles[i])
# Plot steady state value of capital

if k_ss is not None:
axs[1].axhline(k_ss, c='k', ls='--', lw=1)
axs[1].axvline(T+1, c='k', ls='--', lw=1)

axs[1].scatter(T+1, paths[1][-1], s=80)
return c_paths, k_paths
Now we can solve the model and plot the paths of consumption, capital, and Lagrange multiplier.
plot_paths(pp, 0.3, 0.3, [10]);

34.5 Setting Initial Capital to Steady State Capital
When 𝑇 → +∞, the optimal allocation converges to steady state values of 𝐶𝑡 and 𝐾𝑡 .
It is instructive to set 𝐾0 equal to the lim𝑇 →+∞ 𝐾𝑡 , which we’ll call steady state capital.
In a steady state 𝐾𝑡+1 = 𝐾𝑡 = 𝐾̄ for all very large 𝑡.
Evalauating feasibility constraint (34.5) at 𝐾̄ gives
𝑓(𝐾)̄ − 𝛿 𝐾̄ = 𝐶 ̄ (34.15)
Substituting 𝐾𝑡 = 𝐾̄ and 𝐶𝑡 = 𝐶 ̄ for all 𝑡 into (34.14) gives

𝑢′ (𝐶)̄ ′ ̄
1=𝛽 [𝑓 (𝐾) + (1 − 𝛿)]
𝑢′ (𝐶)̄
1
Defining 𝛽 = 1+𝜌 , and cancelling gives
1 + 𝜌 = 1[𝑓 ′ (𝐾)̄ + (1 − 𝛿)]
Simplifying gives
𝑓 ′ (𝐾)̄ = 𝜌 + 𝛿
and
𝐾̄ = 𝑓 ′−1 (𝜌 + 𝛿)
For production function (34.4), this becomes
𝛼𝐾̄ 𝛼−1 = 𝜌 + 𝛿
As an example, after setting 𝛼 = .33, 𝜌 = 1/𝛽 − 1 = 1/(19/20) − 1 = 20/19 − 19/19 = 1/19, 𝛿 = 1/50, we get
67
33 100
𝐾̄ = ( 1
100
1 ) ≈ 9.57583
50 + 19
Let’s verify this with Python and then use this steady state 𝐾̄ as our initial capital stock 𝐾0 .
ρ = 1 / pp.β - 1
k_ss = pp.f_prime_inv(ρ+pp.δ)
print(f'steady state for capital is: {k_ss}')
34.5. Setting Initial Capital to Steady State Capital 601

steady state for capital is: 9.57583816331462
Now we plot
plot_paths(pp, 0.3, k_ss, [150], k_ss=k_ss);
Evidently, with a large value of 𝑇 , 𝐾𝑡 stays near 𝐾0 until 𝑡 approaches 𝑇 closely.

Let’s see what the planner does when we set 𝐾0 below 𝐾.̄
plot_paths(pp, 0.3, k_ss/3, [150], k_ss=k_ss);
Notice how the planner pushes capital toward the steady state, stays near there for a while, then pushes 𝐾𝑡 toward the
terminal value 𝐾𝑇 +1 = 0 when 𝑡 closely approaches 𝑇 .
The following graphs compare optimal outcomes as we vary 𝑇 .
plot_paths(pp, 0.3, k_ss/3, [150, 75, 50, 25], k_ss=k_ss);

34.6 A Turnpike Property
The following calculation indicates that when 𝑇 is very large, the optimal capital stock stays close to its steady state value
most of the time.
plot_paths(pp, 0.3, k_ss/3, [250, 150, 50, 25], k_ss=k_ss);
In the above graphs, different colors are associated with different horizons 𝑇 .
Notice that as the horizon increases, the planner keeps 𝐾𝑡 closer to the steady state value 𝐾̄ for longer.
This pattern reflects a turnpike property of the steady state.
A rule of thumb for the planner is
• from 𝐾0 , push 𝐾𝑡 toward the steady state and stay close to the steady state until time approaches 𝑇 .
𝑓(𝐾𝑡 )−𝐶𝑡
The planner accomplishes this by adjusting the saving rate 𝑓(𝐾𝑡 ) over time.
Let’s calculate and plot the saving rate.
@njit
def saving_rate(pp, c_path, k_path):
'Given paths of c and k, computes the path of saving rate.'
production = pp.f(k_path[:-1])
return (production - c_path) / production
def plot_saving_rate(pp, c0, k0, T_arr, k_ter=0, k_ss=None, s_ss=None):
c_paths, k_paths = plot_paths(pp, c0, k0, T_arr, k_ter=k_ter, k_ss=k_ss, axs=axs.

↪ flatten())
for i, T in enumerate(T_arr):
s_path = saving_rate(pp, c_paths[i], k_paths[i])
axs[1, 1].plot(s_path)
axs[1, 1].set(xlabel='t', ylabel='$s_t$', title='Saving rate')
if s_ss is not None:

axs[1, 1].hlines(s_ss, 0, np.max(T_arr), linestyle='--')
34.6. A Turnpike Property 603

plot_saving_rate(pp, 0.3, k_ss/3, [250, 150, 75, 50], k_ss=k_ss)
34.7 A Limiting Infinite Horizon Economy
We want to set 𝑇 = +∞.

The appropriate thing to do is to replace terminal condition (34.12) with
lim 𝛽 𝑇 𝑢′ (𝐶𝑇 )𝐾𝑇 +1 = 0,

𝑇 →+∞
a condition that will be satisfied by a path that converges to an optimal steady state.
We can approximate the optimal path by starting from an arbitrary initial 𝐾0 and shooting towards the optimal steady
state 𝐾 at a large but finite 𝑇 + 1.
In the following code, we do this for a large 𝑇 and plot consumption, capital, and the saving rate.
̄ 𝐶̄
𝑓(𝐾)−
We know that in the steady state that the saving rate is constant and that 𝑠 ̄ = 𝑓(𝐾)̄ .
From (34.15) the steady state saving rate equals
𝛿 𝐾̄
𝑠̄ =
𝑓(𝐾)̄

The steady state saving rate 𝑆 ̄ = 𝑠𝑓(

̄ 𝐾)̄ is the amount required to offset capital depreciation each period.
We first study optimal capital paths that start below the steady state.
# steady state of saving rate

s_ss = pp.δ * k_ss / pp.f(k_ss)
plot_saving_rate(pp, 0.3, k_ss/3, [130], k_ter=k_ss, k_ss=k_ss, s_ss=s_ss)
Since 𝐾0 < 𝐾,̄ 𝑓 ′ (𝐾0 ) > 𝜌 + 𝛿.

The planner chooses a positive saving rate that is higher than the steady state saving rate.
Note that 𝑓 ″ (𝐾) < 0, so as 𝐾 rises, 𝑓 ′ (𝐾) declines.
The planner slowly lowers the saving rate until reaching a steady state in which 𝑓 ′ (𝐾) = 𝜌 + 𝛿.
34.7. A Limiting Infinite Horizon Economy 605

34.7.1 Exercise
Exercise 34.7.1
• Plot the optimal consumption, capital, and saving paths when the initial capital level begins at 1.5 times the steady
state level as we shoot towards the steady state at 𝑇 = 130.
• Why does the saving rate respond as it does?
plot_saving_rate(pp, 0.3, k_ss*1.5, [130], k_ter=k_ss, k_ss=k_ss, s_ss=s_ss)

In Cass-Koopmans Competitive Equilibrium, we study a decentralized version of an economy with exactly the same tech-
nology and preference structure as deployed here.
In that lecture, we replace the planner of this lecture with Adam Smith’s invisible hand.
In place of quantity choices made by the planner, there are market prices that are set by a deus ex machina from outside
the model, a so-called invisible hand.
Equilibrium market prices must reconcile distinct decisions that are made independently by a representative household
and a representative firm.
The relationship between a command economy like the one studied in this lecture and a market economy like that studied
in Cass-Koopmans Competitive Equilibrium is a foundational topic in general equilibrium theory and welfare economics.


CHAPTER
THIRTYFIVE
CASS-KOOPMANS COMPETITIVE EQUILIBRIUM
Contents
• Cass-Koopmans Competitive Equilibrium

– Overview
– Review of Cass-Koopmans Model
– Competitive Equilibrium
– Market Structure
– Firm Problem
– Household Problem
– Computing a Competitive Equilibrium
– Yield Curves and Hicks-Arrow Prices
35.1 Overview
This lecture continues our analysis in this lecture Cass-Koopmans Planning Model about the model that Tjalling Koopmans
[Koopmans, 1965] and David Cass [Cass, 1965] used to study optimal capital accumulation.
This lecture illustrates what is, in fact, a more general connection between a planned economy and an economy organized
as a competitive equilibrium or a market economy.
The earlier lecture Cass-Koopmans Planning Model studied a planning problem and used ideas including
• A Lagrangian formulation of the planning problem that leads to a system of difference equations.
• A shooting algorithm for solving difference equations subject to initial and terminal conditions.
• A turnpike property that describes optimal paths for long-but-finite horizon economies.
The present lecture uses additional ideas including
• Hicks-Arrow prices, named after John R. Hicks and Kenneth Arrow.
• A connection between some Lagrange multipliers from the planning problem and the Hicks-Arrow prices.
• A Big 𝐾 , little 𝑘 trick widely used in macroeconomic dynamics.
– We shall encounter this trick in this lecture and also in this lecture.
• A non-stochastic version of a theory of the term structure of interest rates.
609
• An intimate connection between two ways to organize an economy, namely:

– socialism in which a central planner commands the allocation of resources, and
– competitive markets in which competitive equilibrium prices induce individual consumers and producers
to choose a socially optimal allocation as unintended consequences of their selfish decisions

import numpy as np
35.2 Review of Cass-Koopmans Model
The physical setting is identical with that in Cass-Koopmans Planning Model.

Time is discrete and takes values 𝑡 = 0, 1, … , 𝑇 .
Output of a single good can either be consumed or invested in physical capital.
The capital good is durable but partially depreciates each period at a constant rate.
We let 𝐶𝑡 be a nondurable consumption good at time t.
Let 𝐾𝑡 be the stock of physical capital at time t.
Let 𝐶 ⃗ = {𝐶0 , … , 𝐶𝑇 } and 𝐾⃗ = {𝐾0 , … , 𝐾𝑇 +1 }.
A representative household is endowed with one unit of labor at each 𝑡 and likes the consumption good at each 𝑡.
The representative household inelastically supplies a single unit of labor 𝑁𝑡 at each 𝑡, so that 𝑁𝑡 = 1 for all 𝑡 ∈
{0, 1, … , 𝑇 }.
The representative household has preferences over consumption bundles ordered by the utility functional:
𝑇 1−𝛾
𝐶
𝑈 (𝐶)⃗ = ∑ 𝛽 𝑡 𝑡
𝑡=0
1 −𝛾
where 𝛽 ∈ (0, 1) is a discount factor and 𝛾 > 0 governs the curvature of the one-period utility function.
We assume that 𝐾0 > 0.
There is an economy-wide production function
𝐹 (𝐾𝑡 , 𝑁𝑡 ) = 𝐴𝐾𝑡𝛼 𝑁𝑡1−𝛼
with 0 < 𝛼 < 1, 𝐴 > 0.

A feasible allocation 𝐶,⃗ 𝐾⃗ satisfies
𝐶𝑡 + 𝐾𝑡+1 ≤ 𝐹 (𝐾𝑡 , 𝑁𝑡 ) + (1 − 𝛿)𝐾𝑡 for all 𝑡 ∈ {0, 1, … , 𝑇 }
where 𝛿 ∈ (0, 1) is a depreciation rate of capital.
610 Chapter 35. Cass-Koopmans Competitive Equilibrium

35.2.1 Planning Problem
In this lecture Cass-Koopmans Planning Model, we studied a problem in which a planner chooses an allocation {𝐶,⃗ 𝐾}
⃗
to maximize (34.2) subject to (34.5).
The allocation that solves the planning problem reappears in a competitive equilibrium, as we shall see below.
35.3 Competitive Equilibrium
We now study a decentralized version of the economy.

It shares the same technology and preference structure as the planned economy studied in this lecture Cass-Koopmans
Planning Model.
But now there is no planner.
There are (unit masses of) price-taking consumers and firms.
Market prices are set to reconcile distinct decisions that are made separately by a representative consumer and a repre-
sentative firm.
There is a representative consumer who has the same preferences over consumption plans as did a consumer in the planned
economy.
Instead of being told what to consume and save by a planner, a consumer (also known as a household) chooses for itself
subject to a budget constraint.
• At each time 𝑡, the consumer receives wages and rentals of capital from a firm – these comprise its income at time
𝑡.
• The consumer decides how much income to allocate to consumption or to savings.
• The household can save either by acquiring additional physical capital (it trades one for one with time 𝑡 consumption)
or by acquiring claims on consumption at dates other than 𝑡.
• The household owns physical capital and labor and rents them to the firm.
• The household consumes, supplies labor, and invests in physical capital.
• A profit-maximizing representative firm operates the production technology.
• The firm rents labor and capital each period from the representative household and sells its output each period to
the household.
• The representative household and the representative firm are both price takers who believe that prices are not
affected by their choices
Note: Again, we can think of there being unit measures of identical representative consumers and identical representative
firms.
35.3. Competitive Equilibrium 611

35.4 Market Structure
The representative household and the representative firm are both price takers.
The household owns both factors of production, namely, labor and physical capital.
Each period, the firm rents both factors from the household.
There is a single grand competitive market in which a household trades date 0 goods for goods at all other dates 𝑡 =
1, 2, … , 𝑇 .
35.4.1 Prices
There are sequences of prices {𝑤𝑡 , 𝜂𝑡 }𝑇𝑡=0 = {𝑤,⃗ 𝜂}⃗ where

• 𝑤𝑡 is a wage, i.e., a rental rate, for labor at time 𝑡
• 𝜂𝑡 is a rental rate for capital at time 𝑡
In addition there is a vector {𝑞𝑡0 } of intertemporal prices where
• 𝑞𝑡0 is the price at time 0 of one unit of the good at date 𝑡.
We call {𝑞𝑡0 }𝑇𝑡=0 a vector of Hicks-Arrow prices, named after the 1972 economics Nobel prize winners.
Because is a relative price. the unit of account in terms of which the prices 𝑞𝑡0 are stated is; we are free to re-normalize
them by multiplying all of them by a positive scalar, say 𝜆 > 0.
Units of 𝑞𝑡0 could be set so that they are
number of time 0 goods

number of time t goods
In this case, we would be taking the time 0 consumption good to be the numeraire.
35.5 Firm Problem
At time 𝑡 a representative firm hires labor 𝑛̃ 𝑡 and capital 𝑘̃ 𝑡 .

The firm’s profits at time 𝑡 are
𝐹 (𝑘̃ 𝑡 , 𝑛̃ 𝑡 ) − 𝑤𝑡 𝑛̃ 𝑡 − 𝜂𝑡 𝑘̃ 𝑡
where 𝑤𝑡 is a wage rate at 𝑡 and 𝜂𝑡 is the rental rate on capital at 𝑡.

As in the planned economy model
𝐹 (𝑘̃ 𝑡 , 𝑛̃ 𝑡 ) = 𝐴𝑘̃ 𝑡𝛼 𝑛̃ 1−𝛼

𝑡

35.5.1 Zero Profit Conditions
Zero-profits conditions for capital and labor are
𝐹𝑘 (𝑘̃ 𝑡 , 𝑛̃ 𝑡 ) = 𝜂𝑡
and
𝐹𝑛 (𝑘̃ 𝑡 , 𝑛̃ 𝑡 ) = 𝑤𝑡 (35.1)
These conditions emerge from a no-arbitrage requirement.

To describe this no-arbitrage profits reasoning, we begin by applying a theorem of Euler about linearly homogenous
functions.
The theorem applies to the Cobb-Douglas production function because we it displays constant returns to scale:
𝛼𝐹 (𝑘̃ 𝑡 , 𝑛̃ 𝑡 ) = 𝐹 (𝛼𝑘̃ 𝑡 , 𝛼𝑛̃ 𝑡 )

for 𝛼 ∈ (0, 1).
𝜕
Taking partial derivatives 𝜕𝛼 on both sides of the above equation gives
𝜕𝐹 ̃ 𝜕𝐹
𝐹 (𝑘̃ 𝑡 , 𝑛̃ 𝑡 ) = 𝑘𝑡 + 𝑛̃
̃
𝜕 𝑘𝑡 𝜕 𝑛̃ 𝑡 𝑡
Rewrite the firm’s profits as
𝜕𝐹 ̃ 𝜕𝐹
𝑘𝑡 + 𝑛̃ − 𝑤𝑡 𝑛̃ 𝑡 − 𝜂𝑡 𝑘𝑡
̃
𝜕 𝑘𝑡 𝜕 𝑛̃ 𝑡 𝑡
or
𝜕𝐹 𝜕𝐹
( − 𝜂𝑡 ) 𝑘̃ 𝑡 + ( − 𝑤𝑡 ) 𝑛̃ 𝑡
𝜕 𝑘̃ 𝑡 𝜕 𝑛̃ 𝑡
𝜕𝐹 𝜕𝐹
Because 𝐹 is homogeneous of degree 1, it follows that 𝜕 𝑘̃ 𝑡
and 𝜕 𝑛̃ 𝑡 are homogeneous of degree 0 and therefore fixed
with respect to 𝑘̃ 𝑡 and 𝑛̃ 𝑡 .
If 𝜕𝜕𝐹
𝑘̃ 𝑡
> 𝜂𝑡 , then the firm makes positive profits on each additional unit of 𝑘̃ 𝑡 , so it would want to make 𝑘̃ 𝑡 arbitrarily
large.
But setting 𝑘̃ 𝑡 = +∞ is not physically feasible, so equilibrium prices must take values that present the firm with no such
arbitrage opportunity.
𝜕𝐹
A similar argument applies if 𝜕 𝑛̃ 𝑡 > 𝑤𝑡 .
𝜕 𝑘̃ 𝑡
If 𝜕 𝑘̃ 𝑡
< 𝜂𝑡 , the firm would want to set 𝑘̃ 𝑡 to zero, which is not feasible.
It is convenient to define 𝑤⃗ = {𝑤0 , … , 𝑤𝑇 } and 𝜂 ⃗ = {𝜂0 , … , 𝜂𝑇 }.
35.6 Household Problem
A representative household lives at 𝑡 = 0, 1, … , 𝑇 .

At 𝑡, the household rents 1 unit of labor and 𝑘𝑡 units of capital to a firm and receives income
𝑤𝑡 1 + 𝜂𝑡 𝑘𝑡
At 𝑡 the household allocates its income to the following purchases between the following two categories:
35.6. Household Problem 613

• consumption 𝑐𝑡
• net investment 𝑘𝑡+1 − (1 − 𝛿)𝑘𝑡
Here (𝑘𝑡+1 − (1 − 𝛿)𝑘𝑡 ) is the household’s net investment in physical capital and 𝛿 ∈ (0, 1) is again a depreciation rate
of capital.
In period 𝑡, the consumer is free to purchase more goods to be consumed and invested in physical capital than its income
from supplying capital and labor to the firm, provided that in some other periods its income exceeds its purchases.
A consumer’s net excess demand for time 𝑡 consumption goods is the gap
𝑒𝑡 ≡ (𝑐𝑡 + (𝑘𝑡+1 − (1 − 𝛿)𝑘𝑡 )) − (𝑤𝑡 1 + 𝜂𝑡 𝑘𝑡 )
Let 𝑐 ⃗ = {𝑐0 , … , 𝑐𝑇 } and let 𝑘⃗ = {𝑘1 , … , 𝑘𝑇 +1 }.

𝑘0 is given to the household.
The household faces a single budget constraint that requires that the present value of the household’s net excess demands
must be zero:
𝑇
∑ 𝑞𝑡0 𝑒𝑡 ≤ 0
𝑡=0
or
𝑇 𝑇
∑ 𝑞𝑡0 (𝑐𝑡 + (𝑘𝑡+1 − (1 − 𝛿)𝑘𝑡 )) ≤ ∑ 𝑞𝑡0 (𝑤𝑡 1 + 𝜂𝑡 𝑘𝑡 )
𝑡=0 𝑡=0
The household faces price system {𝑞𝑡0 , 𝑤𝑡 , 𝜂𝑡 } as a price-taker and chooses an allocation to solve the constrained opti-
mization problem:
𝑇
max ∑ 𝛽 𝑡 𝑢(𝑐𝑡 )
𝑐,⃗ 𝑘⃗ 𝑡=0
𝑇
subject to ∑ 𝑞𝑡0 (𝑐𝑡 + (𝑘𝑡+1 − (1 − 𝛿)𝑘𝑡 ) − (𝑤𝑡 − 𝜂𝑡 𝑘𝑡 )) ≤ 0
𝑡=0
Components of a price system have the following units:

• 𝑤𝑡 is measured in units of the time 𝑡 good per unit of time 𝑡 labor hired
• 𝜂𝑡 is measured in units of the time 𝑡 good per unit of time 𝑡 capital hired
• 𝑞𝑡0 is measured in units of a numeraire per unit of the time 𝑡 good
35.6.1 Definitions
• A price system is a sequence {𝑞𝑡0 , 𝜂𝑡 , 𝑤𝑡 }𝑡=0

𝑇
= {𝑞,⃗ 𝜂,⃗ 𝑤}.
⃗
• An allocation is a sequence {𝑐𝑡 , 𝑘𝑡+1 , 𝑛𝑡 = 1}𝑇𝑡=0 = {𝑐,⃗ 𝑘,⃗ 𝑛}.
⃗
• A competitive equilibrium is a price system and an allocation with the following properties:
– Given the price system, the allocation solves the household’s problem.
– Given the price system, the allocation solves the firm’s problem.
The vision here is that an equilibrium price system and allocation are determined once and for all.
In effect, we imagine that all trades occur just before time 0.

35.7 Computing a Competitive Equilibrium
We compute a competitive equilibrium by using a guess and verify approach.

• We guess equilibrium price sequences {𝑞,⃗ 𝜂,⃗ 𝑤}.
⃗
• We then verify that at those prices, the household and the firm choose the same allocation.
35.7.1 Guess for Price System
In this lecture Cass-Koopmans Planning Model, we computed an allocation {𝐶,⃗ 𝐾,⃗ 𝑁⃗ } that solves a planning problem.
We use that allocation to construct a guess for the equilibrium price system.
Note: This allocation will constitute the Big 𝐾 to be in the present instance of the Big 𝐾 , little 𝑘 trick that we’ll apply
to a competitive equilibrium in the spirit of this lecture and this lecture.
In particular, we shall use the following procedure:

• obtain first-order conditions for the representative firm and the representative consumer.
• from these equations, obtain a new set of equations by replacing the firm’s choice variables 𝑘,̃ 𝑛̃ and the consumer’s
choice variables with the quantities 𝐶,⃗ 𝐾⃗ that solve the planning problem.
⃗ as functions of 𝐶,⃗ 𝐾.⃗
• solve the resulting equations for {𝑞,⃗ 𝜂,⃗ 𝑤}
• verify that at these prices, 𝑐𝑡 = 𝐶𝑡 , 𝑘𝑡 = 𝑘̃ 𝑡 = 𝐾𝑡 , 𝑛̃ 𝑡 = 1 for 𝑡 = 0, 1, … , 𝑇 .
Thus, we guess that for 𝑡 = 0, … , 𝑇 :
𝑞𝑡0 = 𝛽 𝑡 𝑢′ (𝐶𝑡 ) (35.2)
𝑤𝑡 = 𝑓(𝐾𝑡 ) − 𝐾𝑡 𝑓 ′ (𝐾𝑡 ) (35.3)
𝜂𝑡 = 𝑓 ′ (𝐾𝑡 ) (35.4)
At these prices, let capital chosen by the household be
𝑘𝑡∗ (𝑞,⃗ 𝑤,⃗ 𝜂),⃗ 𝑡≥0 (35.5)
and let the allocation chosen by the firm be
𝑘̃ 𝑡∗ (𝑞,⃗ 𝑤,⃗ 𝜂),

⃗ 𝑡≥0
and so on.
If our guess for the equilibrium price system is correct, then it must occur that
𝑘𝑡∗ = 𝑘̃ 𝑡∗ (35.6)
1 = 𝑛̃ ∗𝑡 (35.7)
𝑐𝑡∗ + 𝑘𝑡+1
∗
− (1 − 𝛿)𝑘𝑡∗ = 𝐹 (𝑘̃ 𝑡∗ , 𝑛̃ ∗𝑡 )
We shall verify that for 𝑡 = 0, … , 𝑇 allocations chosen by the household and the firm both equal the allocation that solves
the planning problem:
𝑘𝑡∗ = 𝑘̃ 𝑡∗ = 𝐾𝑡 , 𝑛̃ 𝑡 = 1, 𝑐𝑡∗ = 𝐶𝑡 (35.8)
35.7. Computing a Competitive Equilibrium 615

35.7.2 Verification Procedure
Our approach is firsts to stare at first-order necessary conditions for optimization problems of the household and the firm.
At the price system we have guessed, we’ll then verify that both sets of first-order conditions are satisfied at the allocation
that solves the planning problem.
35.7.3 Household’s Lagrangian
To solve the household’s problem, we formulate the Lagrangian

𝑇 𝑇
ℒ(𝑐,⃗ 𝑘,⃗ 𝜆) = ∑ 𝛽 𝑡 𝑢(𝑐𝑡 ) + 𝜆 (∑ 𝑞𝑡0 (((1 − 𝛿)𝑘𝑡 − 𝑤𝑡 ) + 𝜂𝑡 𝑘𝑡 − 𝑐𝑡 − 𝑘𝑡+1 ))
𝑡=0 𝑡=0
and attack the min-max problem:
min max ℒ(𝑐,⃗ 𝑘,⃗ 𝜆)

𝜆 𝑐,⃗ 𝑘⃗
First-order conditions are
𝑐𝑡 ∶ 𝛽 𝑡 𝑢′ (𝑐𝑡 ) − 𝜆𝑞𝑡0 = 0 𝑡 = 0, 1, … , 𝑇 (35.9)
𝑘𝑡 ∶ 0
−𝜆𝑞𝑡0 [(1 − 𝛿) + 𝜂𝑡 ] + 𝜆𝑞𝑡−1 =0 𝑡 = 1, 2, … , 𝑇 + 1 (35.10)
𝑇
𝜆∶ (∑ 𝑞𝑡0 (𝑐𝑡 + (𝑘𝑡+1 − (1 − 𝛿)𝑘𝑡 ) − 𝑤𝑡 − 𝜂𝑡 𝑘𝑡 )) ≤ 0 (35.11)
𝑡=0
𝑘𝑇 +1 ∶ −𝜆𝑞0𝑇 +1 ≤ 0, ≤ 0 if 𝑘𝑇 +1 = 0; = 0 if 𝑘𝑇 +1 > 0 (35.12)

Now we plug in our guesses of prices and do some algebra in the hope of recovering all first-order necessary conditions
(34.9)-(34.12) for the planning problem from this lecture Cass-Koopmans Planning Model.
Combining (35.9) and (35.2), we get:
𝑢′ (𝐶𝑡 ) = 𝜇𝑡
which is (34.9).
Combining (35.10), (35.2), and (35.4), we get:
−𝜆𝛽 𝑡 𝜇𝑡 [(1 − 𝛿) + 𝑓 ′ (𝐾𝑡 )] + 𝜆𝛽 𝑡−1 𝜇𝑡−1 = 0 (35.13)
Rewriting (35.13) by dividing by 𝜆 on both sides (which is nonzero since u’>0) we get:
𝛽 𝑡 𝜇𝑡 [(1 − 𝛿 + 𝑓 ′ (𝐾𝑡 )] = 𝛽 𝑡−1 𝜇𝑡−1
or
𝛽𝜇𝑡 [(1 − 𝛿 + 𝑓 ′ (𝐾𝑡 )] = 𝜇𝑡−1
which is (34.10).
Combining (35.11), (35.2), (35.3) and (35.4) after multiplying both sides of (35.11) by 𝜆, we get
𝑇
∑ 𝛽 𝑡 𝜇𝑡 (𝐶𝑡 + (𝐾𝑡+1 − (1 − 𝛿)𝐾𝑡 ) − 𝑓(𝐾𝑡 ) + 𝐾𝑡 𝑓 ′ (𝐾𝑡 ) − 𝑓 ′ (𝐾𝑡 )𝐾𝑡 ) ≤ 0
𝑡=0

which simplifies to
𝑇
∑ 𝛽 𝑡 𝜇𝑡 (𝐶𝑡 + 𝐾𝑡+1 − (1 − 𝛿)𝐾𝑡 − 𝐹 (𝐾𝑡 , 1)) ≤ 0
𝑡=0
𝑡
Since 𝛽 𝜇𝑡 > 0 for 𝑡 = 0, … , 𝑇 , it follows that
𝐶𝑡 + 𝐾𝑡+1 − (1 − 𝛿)𝐾𝑡 − 𝐹 (𝐾𝑡 , 1) = 0 for all 𝑡 in {0, 1, … , 𝑇 }
which is (34.11).
Combining (35.12) and (35.2), we get:
−𝛽 𝑇 +1 𝜇𝑇 +1 ≤ 0
Dividing both sides by 𝛽 𝑇 +1 gives
−𝜇𝑇 +1 ≤ 0
which is (34.12) for the planning problem.

Thus, at our guess of the equilibrium price system, the allocation that solves the planning problem also solves the problem
faced by a representative household living in a competitive equilibrium.
35.7.4 Representative Firm’s Problem
We now turn to the problem faced by a firm in a competitive equilibrium:

If we plug (35.8) into (35.1) for all t, we get
𝜕𝐹 (𝐾𝑡 , 1)
= 𝑓 ′ (𝐾𝑡 ) = 𝜂𝑡
𝜕𝐾𝑡
which is (35.4).
If we now plug (35.8) into (35.1) for all t, we get:
𝜕𝐹 (𝐾̃ 𝑡 , 1)
= 𝑓(𝐾𝑡 ) − 𝑓 ′ (𝐾𝑡 )𝐾𝑡 = 𝑤𝑡
𝜕 𝐿̃ 𝑡
which is exactly (35.5).
Thus, at our guess for the equilibrium price system, the allocation that solves the planning problem also solves the problem
faced by a firm within a competitive equilibrium.
By (35.6) and (35.7) this allocation is identical to the one that solves the consumer’s problem.
Note: Because budget sets are affected only by relative prices, {𝑞𝑡0 } is determined only up to multiplication by a positive
constant.
Normalization: We are free to choose a {𝑞𝑡0 } that makes 𝜆 = 1 so that we are measuring 𝑞𝑡0 in units of the marginal
utility of time 0 goods.
We will plot 𝑞, 𝑤, 𝜂 below to show these equilibrium prices induce the same aggregate movements that we saw earlier in
the planning problem.
To proceed, we bring in Python code that Cass-Koopmans Planning Model used to solve the planning problem
First let’s define a jitclass that stores parameters and functions the characterize an economy.

planning_data = [
('γ', float64), # Coefficient of relative risk aversion
('δ', float64), # Depreciation rate on capital
('α', float64), # Return to capital per capita
('A', float64) # Technology
]
@jitclass(planning_data)
class PlanningProblem():
def __init__(self, γ=2, β=0.95, δ=0.02, α=0.33, A=1):
self.γ, self.β = γ, β
self.δ, self.α, self.A = δ, α, A
def u(self, c):

'''
Utility function
ASIDE: If you have a utility function that is hard to solve by hand
you can use automatic or symbolic differentiation
See https://github.com/HIPS/autograd
'''
γ = self.γ
return c ** (1 - γ) / (1 - γ) if γ!= 1 else np.log(c)

'Derivative of utility'
γ = self.γ
return c ** (-γ)

'Inverse of derivative of utility'
γ = self.γ
return c ** (-1 / γ)
def f(self, k):

'Production function'
return A * k ** α

'Derivative of production function'
return α * A * k ** (α - 1)
def f_prime_inv(self, k):

'Inverse of derivative of production function'
return (k / (A * α)) ** (1 / (α - 1))


def next_k_c(self, k, c):

''''
Given the current capital Kt and an arbitrary feasible
consumption choice Ct, computes Kt+1 by state transition law
and optimal Ct+1 by Euler equation.
'''
β, δ = self.β, self.δ
u_prime, u_prime_inv = self.u_prime, self.u_prime_inv
f, f_prime = self.f, self.f_prime
k_next = f(k) + (1 - δ) * k - c
c_next = u_prime_inv(u_prime(c) / (β * (f_prime(k_next) + (1 - δ))))
return k_next, c_next
@njit
def shooting(pp, c0, k0, T=10):
'''
Given the initial condition of capital k0 and an initial guess
of consumption c0, computes the whole paths of c and k
using the state transition law and Euler equation for T periods.
'''
if c0 > pp.f(k0):
print("initial consumption is not feasible")
return None
# initialize vectors of c and k

c_vec = np.empty(T+1)
k_vec = np.empty(T+2)
c_vec[0] = c0
k_vec[0] = k0
for t in range(T):
k_vec[t+1], c_vec[t+1] = pp.next_k_c(k_vec[t], c_vec[t])
k_vec[T+1] = pp.f(k_vec[T]) + (1 - pp.δ) * k_vec[T] - c_vec[T]
return c_vec, k_vec
@njit
def bisection(pp, c0, k0, T=10, tol=1e-4, max_iter=500, k_ter=0, verbose=True):
# initial boundaries for guess c0

c0_upper = pp.f(k0)
c0_lower = 0
i = 0
while True:
c_vec, k_vec = shooting(pp, c0, k0, T)
error = k_vec[-1] - k_ter
# check if the terminal condition is satisfied



if np.abs(error) < tol:
if verbose:
print('Converged successfully on iteration ', i+1)
return c_vec, k_vec
i += 1
if i == max_iter:
if verbose:
print('Convergence failed.')
return c_vec, k_vec
# if iteration continues, updates boundaries and guess of c0

if error > 0:
c0_lower = c0
else:
c0_upper = c0
c0 = (c0_lower + c0_upper) / 2
pp = PlanningProblem()
# Steady states
ρ = 1 / pp.β - 1
k_ss = pp.f_prime_inv(ρ+pp.δ)
c_ss = pp.f(k_ss) - pp.δ * k_ss
The above code from this lecture Cass-Koopmans Planning Model lets us compute an optimal allocation for the planning
problem.
• from the preceding analysis, we know that it will also be an allocation associated with a competitive equilibium.
Now we’re ready to bring in Python code that we require to compute additional objects that appear in a competitive
equilibrium.
@njit
def q(pp, c_path):
# Here we choose numeraire to be u'(c_0) -- this is q^(t_0)_t
T = len(c_path) - 1
q_path = np.ones(T+1)
q_path[0] = 1
q_path[t] = pp.β ** t * pp.u_prime(c_path[t])
return q_path
@njit
def w(pp, k_path):
w_path = pp.f(k_path) - k_path * pp.f_prime(k_path)
return w_path
@njit
def η(pp, k_path):
η_path = pp.f_prime(k_path)
return η_path
Now we calculate and plot for each 𝑇

T_arr = [250, 150, 75, 50]

titles = ['Arrow-Hicks Prices', 'Labor Rental Rate', 'Capital Rental Rate',
'Consumption', 'Capital', 'Lagrange Multiplier']
ylabels = ['$q_t^0$', '$w_t$', '$\eta_t$', '$c_t$', '$k_t$', '$\mu_t$']
for T in T_arr:
c_path, k_path = bisection(pp, 0.3, k_ss/3, T, verbose=False)
μ_path = pp.u_prime(c_path)
q_path = q(pp, c_path)

w_path = w(pp, k_path)[:-1]
η_path = η(pp, k_path)[:-1]
paths = [q_path, w_path, η_path, c_path, k_path, μ_path]
for i, ax in enumerate(axs.flatten()):
ax.plot(paths[i])
ax.set(title=titles[i], ylabel=ylabels[i], xlabel='t')
if titles[i] == 'Capital':
ax.axhline(k_ss, lw=1, ls='--', c='k')
if titles[i] == 'Consumption':
ax.axhline(c_ss, lw=1, ls='--', c='k')
plt.tight_layout()
plt.show()

Varying Curvature
Now we see how our results change if we keep 𝑇 constant, but allow the curvature parameter, 𝛾 to vary, starting with 𝐾0
below the steady state.
We plot the results for 𝑇 = 150
T = 150
γ_arr = [1.1, 4, 6, 8]
for γ in γ_arr:
pp_γ = PlanningProblem(γ=γ)
c_path, k_path = bisection(pp_γ, 0.3, k_ss/3, T, verbose=False)
μ_path = pp_γ.u_prime(c_path)
q_path = q(pp_γ, c_path)

w_path = w(pp_γ, k_path)[:-1]
η_path = η(pp_γ, k_path)[:-1]
paths = [q_path, w_path, η_path, c_path, k_path, μ_path]
for i, ax in enumerate(axs.flatten()):
ax.plot(paths[i], label=f'$\gamma = {γ}$')
ax.set(title=titles[i], ylabel=ylabels[i], xlabel='t')
if titles[i] == 'Capital':
ax.axhline(k_ss, lw=1, ls='--', c='k')
if titles[i] == 'Consumption':
ax.axhline(c_ss, lw=1, ls='--', c='k')
axs[0, 0].legend()
plt.tight_layout()
plt.show()
Adjusting 𝛾 means adjusting how much individuals prefer to smooth consumption.

Higher 𝛾 means individuals prefer to smooth more resulting in slower convergence to a steady state allocation.
Lower 𝛾 means individuals prefer to smooth less, resulting in faster convergence to a steady state allocation.

35.8 Yield Curves and Hicks-Arrow Prices
We return to Hicks-Arrow prices and calculate how they are related to yields on loans of alternative maturities.
This will let us plot a yield curve that graphs yields on bonds of maturities 𝑗 = 1, 2, … against 𝑗 = 1, 2, ….
We use the following formulas.
A yield to maturity on a loan made at time 𝑡0 that matures at time 𝑡 > 𝑡0
𝑡
log 𝑞𝑡 0
𝑟𝑡0 ,𝑡 = −
𝑡 − 𝑡0
A Hicks-Arrow price system for a base-year 𝑡0 ≤ 𝑡 satisfies

−𝛾
𝑡 𝑢′ (𝑐𝑡 ) 𝑡−𝑡0 𝑐𝑡
𝑞𝑡 0 = 𝛽 𝑡−𝑡0 = 𝛽
𝑢′ (𝑐𝑡0 ) 𝑐𝑡−𝛾
0
We redefine our function for 𝑞 to allow arbitrary base years, and define a new function for 𝑟, then plot both.
We begin by continuing to assume that 𝑡0 = 0 and plot things for different maturities 𝑡 = 𝑇 , with 𝐾0 below the steady
state
@njit
def q_generic(pp, t0, c_path):
# simplify notations
β = pp.β
u_prime = pp.u_prime
T = len(c_path) - 1
q_path = np.zeros(T+1-t0)
q_path[0] = 1
for t in range(t0+1, T+1):
q_path[t-t0] = β ** (t-t0) * u_prime(c_path[t]) / u_prime(c_path[t0])
return q_path
@njit
def r(pp, t0, q_path):
'''Yield to maturity'''
r_path = - np.log(q_path[1:]) / np.arange(1, len(q_path))
return r_path
def plot_yield_curves(pp, t0, c0, k0, T_arr):
for T in T_arr:
c_path, k_path = bisection(pp, c0, k0, T, verbose=False)
q_path = q_generic(pp, t0, c_path)
r_path = r(pp, t0, q_path)
axs[0].plot(range(t0, T+1), q_path)

axs[0].set(xlabel='t', ylabel='$q_t^0$', title='Hicks-Arrow Prices')
axs[1].plot(range(t0+1, T+1), r_path)

axs[1].set(xlabel='t', ylabel='$r_t^0$', title='Yields')
35.8. Yield Curves and Hicks-Arrow Prices 623

T_arr = [150, 75, 50]

plot_yield_curves(pp, 0, 0.3, k_ss/3, T_arr)
Now we plot when 𝑡0 = 20
plot_yield_curves(pp, 20, 0.3, k_ss/3, T_arr)

CHAPTER
THIRTYSIX
CAKE EATING I: INTRODUCTION TO OPTIMAL SAVING
Contents
• Cake Eating I: Introduction to Optimal Saving

– Overview
– The Model
– The Value Function
– The Optimal Policy
– The Euler Equation
– Exercises
36.1 Overview
In this lecture we introduce a simple “cake eating” problem.

The intertemporal problem is: how much to enjoy today and how much to leave for the future?
Although the topic sounds trivial, this kind of trade-off between current and future utility is at the heart of many savings
and consumption problems.
Once we master the ideas in this simple environment, we will apply them to progressively more challenging—and useful—
problems.
The main tool we will use to solve the cake eating problem is dynamic programming.
Readers might find it helpful to review the following lectures before reading this one:
• The shortest paths lecture
• The basic McCall model
• The McCall model with separation
• The McCall model with separation and a continuous wage distribution
In what follows, we require the following imports:

import numpy as np
625
36.2 The Model
We consider an infinite time horizon 𝑡 = 0, 1, 2, 3..

At 𝑡 = 0 the agent is given a complete cake with size 𝑥.̄
Let 𝑥𝑡 denote the size of the cake at the beginning of each period, so that, in particular, 𝑥0 = 𝑥.̄
We choose how much of the cake to eat in any given period 𝑡.
After choosing to consume 𝑐𝑡 of the cake in period 𝑡 there is
𝑥𝑡+1 = 𝑥𝑡 − 𝑐𝑡
left in period 𝑡 + 1.
Consuming quantity 𝑐 of the cake gives current utility 𝑢(𝑐).
We adopt the CRRA utility function
𝑐1−𝛾
𝑢(𝑐) = (𝛾 > 0, 𝛾 ≠ 1) (36.1)
1−𝛾
In Python this is
def u(c, γ):
return c**(1 - γ) / (1 - γ)
Future cake consumption utility is discounted according to 𝛽 ∈ (0, 1).

In particular, consumption of 𝑐 units 𝑡 periods hence has present value 𝛽 𝑡 𝑢(𝑐)
The agent’s problem can be written as
∞
max ∑ 𝛽 𝑡 𝑢(𝑐𝑡 ) (36.2)
{𝑐𝑡 }
𝑡=0
subject to
𝑥𝑡+1 = 𝑥𝑡 − 𝑐𝑡 and 0 ≤ 𝑐𝑡 ≤ 𝑥 𝑡 (36.3)
for all 𝑡.
A consumption path {𝑐𝑡 } satisfying (36.3) where 𝑥0 = 𝑥̄ is called feasible.
In this problem, the following terminology is standard:
• 𝑥𝑡 is called the state variable
• 𝑐𝑡 is called the control variable or the action
• 𝛽 and 𝛾 are parameters
626 Chapter 36. Cake Eating I: Introduction to Optimal Saving

36.2.1 Trade-Off
The key trade-off in the cake-eating problem is this:

• Delaying consumption is costly because of the discount factor.
• But delaying some consumption is also attractive because 𝑢 is concave.
The concavity of 𝑢 implies that the consumer gains value from consumption smoothing, which means spreading consump-
tion out over time.
This is because concavity implies diminishing marginal utility—a progressively smaller gain in utility for each additional
spoonful of cake consumed within one period.
36.2.2 Intuition
The reasoning given above suggests that the discount factor 𝛽 and the curvature parameter 𝛾 will play a key role in
determining the rate of consumption.
Here’s an educated guess as to what impact these parameters will have.
First, higher 𝛽 implies less discounting, and hence the agent is more patient, which should reduce the rate of consumption.
Second, higher 𝛾 implies that marginal utility 𝑢′ (𝑐) = 𝑐−𝛾 falls faster with 𝑐.
This suggests more smoothing, and hence a lower rate of consumption.
In summary, we expect the rate of consumption to be decreasing in both parameters.
Let’s see if this is true.
36.3 The Value Function
The first step of our dynamic programming treatment is to obtain the Bellman equation.
The next step is to use it to calculate the solution.
36.3.1 The Bellman Equation
To this end, we let 𝑣(𝑥) be maximum lifetime utility attainable from the current time when 𝑥 units of cake are left.
That is,
∞
𝑣(𝑥) = max ∑ 𝛽 𝑡 𝑢(𝑐𝑡 ) (36.4)
𝑡=0
where the maximization is over all paths {𝑐𝑡 } that are feasible from 𝑥0 = 𝑥.
At this point, we do not have an expression for 𝑣, but we can still make inferences about it.
For example, as was the case with the McCall model, the value function will satisfy a version of the Bellman equation.
In the present case, this equation states that 𝑣 satisfies
𝑣(𝑥) = max {𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐)} for any given 𝑥 ≥ 0. (36.5)

0≤𝑐≤𝑥
The intuition here is essentially the same it was for the McCall model.
Choosing 𝑐 optimally means trading off current vs future rewards.
36.3. The Value Function 627

Current rewards from choice 𝑐 are just 𝑢(𝑐).

Future rewards given current cake size 𝑥, measured from next period and assuming optimal behavior, are 𝑣(𝑥 − 𝑐).
These are the two terms on the right hand side of (36.5), after suitable discounting.
If 𝑐 is chosen optimally using this trade off strategy, then we obtain maximal lifetime rewards from our current state 𝑥.
Hence, 𝑣(𝑥) equals the right hand side of (36.5), as claimed.
36.3.2 An Analytical Solution
It has been shown that, with 𝑢 as the CRRA utility function in (36.1), the function
−𝛾
𝑣∗ (𝑥𝑡 ) = (1 − 𝛽 1/𝛾 ) 𝑢(𝑥𝑡 ) (36.6)
solves the Bellman equation and hence is equal to the value function.
You are asked to confirm that this is true in the exercises below.
The solution (36.6) depends heavily on the CRRA utility function.
In fact, if we move away from CRRA utility, usually there is no analytical solution at all.
In other words, beyond CRRA utility, we know that the value function still satisfies the Bellman equation, but we do not
have a way of writing it explicitly, as a function of the state variable and the parameters.
We will deal with that situation numerically when the time comes.
Here is a Python representation of the value function:
def v_star(x, β, γ):
return (1 - β**(1 / γ))**(-γ) * u(x, γ)
And here’s a figure showing the function for fixed parameters:
β, γ = 0.95, 1.2
x_grid = np.linspace(0.1, 5, 100)
ax.plot(x_grid, v_star(x_grid, β, γ), label='value function')
ax.set_xlabel('$x$', fontsize=12)
plt.show()

36.4 The Optimal Policy
Now that we have the value function, it is straightforward to calculate the optimal action at each state.
We should choose consumption to maximize the right hand side of the Bellman equation (36.5).
𝑐∗ = arg max{𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐)}

𝑐
We can think of this optimal choice as a function of the state 𝑥, in which case we call it the optimal policy.
We denote the optimal policy by 𝜎∗ , so that
𝜎∗ (𝑥) ∶= arg max{𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐)} for all 𝑥

𝑐
If we plug the analytical expression (36.6) for the value function into the right hand side and compute the optimum, we
find that
𝜎∗ (𝑥) = (1 − 𝛽 1/𝛾 ) 𝑥 (36.7)
Now let’s recall our intuition on the impact of parameters.

We guessed that the consumption rate would be decreasing in both parameters.
This is in fact the case, as can be seen from (36.7).
Here’s some plots that illustrate.
def c_star(x, β, γ):
return (1 - β ** (1/γ)) * x
Continuing with the values for 𝛽 and 𝛾 used above, the plot is
36.4. The Optimal Policy 629

ax.plot(x_grid, c_star(x_grid, β, γ), label='default parameters')
ax.plot(x_grid, c_star(x_grid, β + 0.02, γ), label=r'higher $\beta$')
ax.plot(x_grid, c_star(x_grid, β, γ + 0.2), label=r'higher $\gamma$')
ax.set_ylabel(r'$\sigma(x)$')
ax.set_xlabel('$x$')
ax.legend()
plt.show()
36.5 The Euler Equation
In the discussion above we have provided a complete solution to the cake eating problem in the case of CRRA utility.
There is in fact another way to solve for the optimal policy, based on the so-called Euler equation.
Although we already have a complete solution, now is a good time to study the Euler equation.
This is because, for more difficult problems, this equation provides key insights that are hard to obtain by other methods.
36.5.1 Statement and Implications
The Euler equation for the present problem can be stated as
𝑢′ (𝑐𝑡∗ ) = 𝛽𝑢′ (𝑐𝑡+1

∗
) (36.8)
This is necessary condition for the optimal path.

It says that, along the optimal path, marginal rewards are equalized across time, after appropriate discounting.
This makes sense: optimality is obtained by smoothing consumption up to the point where no marginal gains remain.
We can also state the Euler equation in terms of the policy function.
A feasible consumption policy is a map 𝑥 ↦ 𝜎(𝑥) satisfying 0 ≤ 𝜎(𝑥) ≤ 𝑥.

The last restriction says that we cannot consume more than the remaining quantity of cake.
A feasible consumption policy 𝜎 is said to satisfy the Euler equation if, for all 𝑥 > 0,
𝑢′ (𝜎(𝑥)) = 𝛽𝑢′ (𝜎(𝑥 − 𝜎(𝑥))) (36.9)
Evidently (36.9) is just the policy equivalent of (36.8).

It turns out that a feasible policy is optimal if and only if it satisfies the Euler equation.
In the exercises, you are asked to verify that the optimal policy (36.7) does indeed satisfy this functional equation.
Note: A functional equation is an equation where the unknown object is a function.
For a proof of sufficiency of the Euler equation in a very general setting, see proposition 2.2 of [Ma et al., 2020].
The following arguments focus on necessity, explaining why an optimal path or policy should satisfy the Euler equation.
36.5.2 Derivation I: A Perturbation Approach
Let’s write 𝑐 as a shorthand for consumption path {𝑐𝑡 }∞

𝑡=0 .
The overall cake-eating maximization problem can be written as

∞
max 𝑈 (𝑐) where 𝑈 (𝑐) ∶= ∑ 𝛽 𝑡 𝑢(𝑐𝑡 )
𝑐∈𝐹
𝑡=0
and 𝐹 is the set of feasible consumption paths.

We know that differentiable functions have a zero gradient at a maximizer.
So the optimal path 𝑐∗ ∶= {𝑐𝑡∗ }∞ ′ ∗
𝑡=0 must satisfy 𝑈 (𝑐 ) = 0.
Note: If you want to know exactly how the derivative 𝑈 ′ (𝑐∗ ) is defined, given that the argument 𝑐∗ is a vector of infinite
length, you can start by learning about Gateaux derivatives. However, such knowledge is not assumed in what follows.
In other words, the rate of change in 𝑈 must be zero for any infinitesimally small (and feasible) perturbation away from
the optimal path.
So consider a feasible perturbation that reduces consumption at time 𝑡 to 𝑐𝑡∗ − ℎ and increases it in the next period to
∗
𝑐𝑡+1 + ℎ.
Consumption does not change in any other period.
We call this perturbed path 𝑐ℎ .
By the preceding argument about zero gradients, we have
𝑈 (𝑐ℎ ) − 𝑈 (𝑐∗ )
lim = 𝑈 ′ (𝑐∗ ) = 0
ℎ→0 ℎ
Recalling that consumption only changes at 𝑡 and 𝑡 + 1, this becomes
𝛽 𝑡 𝑢(𝑐𝑡∗ − ℎ) + 𝛽 𝑡+1 𝑢(𝑐𝑡+1
∗
+ ℎ) − 𝛽 𝑡 𝑢(𝑐𝑡∗ ) − 𝛽 𝑡+1 𝑢(𝑐𝑡+1
∗
)
lim =0
ℎ→0 ℎ
After rearranging, the same expression can be written as
∗ ∗
𝑢(𝑐𝑡∗ − ℎ) − 𝑢(𝑐𝑡∗ ) 𝑢(𝑐𝑡+1 + ℎ) − 𝑢(𝑐𝑡+1 )
lim + 𝛽 lim =0
ℎ→0 ℎ ℎ→0 ℎ
36.5. The Euler Equation 631

or, taking the limit,
−𝑢′ (𝑐𝑡∗ ) + 𝛽𝑢′ (𝑐𝑡+1

∗
)=0
This is just the Euler equation.
36.5.3 Derivation II: Using the Bellman Equation
Another way to derive the Euler equation is to use the Bellman equation (36.5).
Taking the derivative on the right hand side of the Bellman equation with respect to 𝑐 and setting it to zero, we get
𝑢′ (𝑐) = 𝛽𝑣′ (𝑥 − 𝑐) (36.10)
To obtain 𝑣′ (𝑥 − 𝑐), we set 𝑔(𝑐, 𝑥) = 𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐), so that, at the optimal choice of consumption,
𝑣(𝑥) = 𝑔(𝑐, 𝑥) (36.11)
Differentiating both sides while acknowledging that the maximizing consumption will depend on 𝑥, we get
𝜕 𝜕𝑐 𝜕
𝑣′ (𝑥) = 𝑔(𝑐, 𝑥) + 𝑔(𝑐, 𝑥)
𝜕𝑐 𝜕𝑥 𝜕𝑥
𝜕
When 𝑔(𝑐, 𝑥) is maximized at 𝑐, we have 𝜕𝑐 𝑔(𝑐, 𝑥) = 0.
Hence the derivative simplifies to
𝜕𝑔(𝑐, 𝑥) 𝜕
𝑣′ (𝑥) = = 𝛽𝑣(𝑥 − 𝑐) = 𝛽𝑣′ (𝑥 − 𝑐) (36.12)
𝜕𝑥 𝜕𝑥
(This argument is an example of the Envelope Theorem.)
But now an application of (36.10) gives
𝑢′ (𝑐) = 𝑣′ (𝑥) (36.13)
Thus, the derivative of the value function is equal to marginal utility.

Combining this fact with (36.12) recovers the Euler equation.
36.6 Exercises
Exercise 36.6.1
How does one obtain the expressions for the value function and optimal policy given in (36.6) and (36.7) respectively?
The first step is to make a guess of the functional form for the consumption policy.
So suppose that we do not know the solutions and start with a guess that the optimal policy is linear.
In other words, we conjecture that there exists a positive 𝜃 such that setting 𝑐𝑡∗ = 𝜃𝑥𝑡 for all 𝑡 produces an optimal path.
Starting from this conjecture, try to obtain the solutions (36.6) and (36.7).
In doing so, you will need to use the definition of the value function and the Bellman equation.


We start with the conjecture 𝑐𝑡∗ = 𝜃𝑥𝑡 , which leads to a path for the state variable (cake size) given by
𝑥𝑡+1 = 𝑥𝑡 (1 − 𝜃)
Then 𝑥𝑡 = 𝑥0 (1 − 𝜃)𝑡 and hence

∞
𝑣(𝑥0 ) = ∑ 𝛽 𝑡 𝑢(𝜃𝑥𝑡 )
𝑡=0
∞
= ∑ 𝛽 𝑡 𝑢(𝜃𝑥0 (1 − 𝜃)𝑡 )
𝑡=0
∞
= ∑ 𝜃1−𝛾 𝛽 𝑡 (1 − 𝜃)𝑡(1−𝛾) 𝑢(𝑥0 )
𝑡=0
𝜃1−𝛾
= 𝑢(𝑥0 )
1 − 𝛽(1 − 𝜃)1−𝛾
From the Bellman equation, then,
𝜃1−𝛾
𝑣(𝑥) = max {𝑢(𝑐) + 𝛽 ⋅ 𝑢(𝑥 − 𝑐)}
0≤𝑐≤𝑥 1 − 𝛽(1 − 𝜃)1−𝛾
𝑐1−𝛾 𝜃1−𝛾 (𝑥 − 𝑐)1−𝛾
= max { +𝛽 1−𝛾
⋅ }
0≤𝑐≤𝑥 1 − 𝛾 1 − 𝛽(1 − 𝜃) 1−𝛾
From the first order condition, we obtain
𝜃1−𝛾
𝑐−𝛾 + 𝛽 ⋅ (𝑥 − 𝑐)−𝛾 (−1) = 0
1 − 𝛽(1 − 𝜃)1−𝛾
or
𝜃1−𝛾
𝑐−𝛾 = 𝛽 ⋅ (𝑥 − 𝑐)−𝛾
1 − 𝛽(1 − 𝜃)1−𝛾
With 𝑐 = 𝜃𝑥 we get
−𝛾 𝜃1−𝛾
(𝜃𝑥) =𝛽 ⋅ (𝑥(1 − 𝜃))−𝛾
1 − 𝛽(1 − 𝜃)1−𝛾
Some rearrangement produces
1
𝜃 = 1 − 𝛽𝛾
This confirms our earlier expression for the optimal policy:
1
𝑐𝑡∗ = (1 − 𝛽 𝛾 ) 𝑥𝑡
Substituting 𝜃 into the value function above gives

1 1−𝛾
(1 − 𝛽 𝛾 )
𝑣∗ (𝑥𝑡 ) = 1−𝛾
𝑢(𝑥𝑡 )
1 − 𝛽 (𝛽 𝛾 )
Rearranging gives
1 −𝛾
𝑣∗ (𝑥𝑡 ) = (1 − 𝛽 𝛾 ) 𝑢(𝑥𝑡 )
Our claims are now verified.
36.6. Exercises 633


CHAPTER
THIRTYSEVEN
CAKE EATING II: NUMERICAL METHODS
Contents
• Cake Eating II: Numerical Methods

– Overview
– Reviewing the Model
– Value Function Iteration
– Time Iteration
– Exercises
37.1 Overview
In this lecture we continue the study of the cake eating problem.

The aim of this lecture is to solve the problem using numerical methods.
At first this might appear unnecessary, since we already obtained the optimal policy analytically.
However, the cake eating problem is too simple to be useful without modifications, and once we start modifying the
problem, numerical methods become essential.
Hence it makes sense to introduce numerical methods now, and test them on this simple problem.
Since we know the analytical solution, this will allow us to assess the accuracy of alternative numerical methods.

import numpy as np
from scipy.optimize import minimize_scalar, bisect
635
37.2 Reviewing the Model
You might like to review the details before we start.

Recall in particular that the Bellman equation is
𝑣(𝑥) = max {𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐)} for all 𝑥 ≥ 0. (37.1)

0≤𝑐≤𝑥
where 𝑢 is the CRRA utility function.

The analytical solutions for the value function and optimal policy were found to be as follows.
return (1 - β ** (1/γ)) * x
return (1 - β**(1 / γ))**(-γ) * (x**(1-γ) / (1-γ))
Our first aim is to obtain these analytical solutions numerically.
37.3 Value Function Iteration
The first approach we will take is value function iteration.

This is a form of successive approximation, and was discussed in our lecture on job search.
The basic idea is:
1. Take an arbitary intial guess of 𝑣.
2. Obtain an update 𝑤 defined by
𝑤(𝑥) = max {𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐)}

0≤𝑐≤𝑥
3. Stop if 𝑤 is approximately equal to 𝑣, otherwise set 𝑣 = 𝑤 and go back to step 2.

Let’s write this a bit more mathematically.
37.3.1 The Bellman Operator
We introduce the Bellman operator 𝑇 that takes a function v as an argument and returns a new function 𝑇 𝑣 defined by
𝑇 𝑣(𝑥) = max {𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐)}

0≤𝑐≤𝑥
From 𝑣 we get 𝑇 𝑣, and applying 𝑇 to this yields 𝑇 2 𝑣 ∶= 𝑇 (𝑇 𝑣) and so on.

This is called iterating with the Bellman operator from initial guess 𝑣.
As we discuss in more detail in later lectures, one can use Banach’s contraction mapping theorem to prove that the sequence
of functions 𝑇 𝑛 𝑣 converges to the solution to the Bellman equation.
636 Chapter 37. Cake Eating II: Numerical Methods

37.3.2 Fitted Value Function Iteration
Both consumption 𝑐 and the state variable 𝑥 are continuous.

This causes complications when it comes to numerical work.
For example, we need to store each function 𝑇 𝑛 𝑣 in order to compute the next iterate 𝑇 𝑛+1 𝑣.
But this means we have to store 𝑇 𝑛 𝑣(𝑥) at infinitely many 𝑥, which is, in general, impossible.
To circumvent this issue we will use fitted value function iteration, as discussed previously in one of the lectures on job
search.
The process looks like this:
1. Begin with an array of values {𝑣0 , … , 𝑣𝐼 } representing the values of some initial function 𝑣 on the grid points
{𝑥0 , … , 𝑥𝐼 }.
2. Build a function 𝑣 ̂ on the state space ℝ+ by linear interpolation, based on these data points.
3. Obtain and record the value 𝑇 𝑣(𝑥
̂ 𝑖 ) on each grid point 𝑥𝑖 by repeatedly solving the maximization problem in the
Bellman equation.
4. Unless some stopping condition is satisfied, set {𝑣0 , … , 𝑣𝐼 } = {𝑇 𝑣(𝑥
̂ 0 ), … , 𝑇 𝑣(𝑥
̂ 𝐼 )} and go to step 2.
In step 2 we’ll use continuous piecewise linear interpolation.
The maximize function below is a small helper function that converts a SciPy minimization routine into a maximization
routine.
def maximize(g, a, b, args):

"""
Maximize the function g over the interval [a, b].
We use the fact that the maximizer of g on any interval is

also the minimizer of -g. The tuple args collects any extra
arguments to g.
Returns the maximal value and the maximizer.

"""
objective = lambda x: -g(x, *args)

result = minimize_scalar(objective, bounds=(a, b), method='bounded')
maximizer, maximum = result.x, -result.fun
return maximizer, maximum
We’ll store the parameters 𝛽 and 𝛾 in a class called CakeEating.

The same class will also provide a method called state_action_value that returns the value of a consumption
choice given a particular state and guess of 𝑣.
class CakeEating:
def __init__(self,
γ=1.5, # degree of relative risk aversion
x_grid_min=1e-3, # exclude zero for numerical stability
37.3. Value Function Iteration 637


x_grid_max=2.5, # size of cake
x_grid_size=120):
self.β, self.γ = β, γ
# Set up grid
self.x_grid = np.linspace(x_grid_min, x_grid_max, x_grid_size)
# Utility function
def u(self, c):
γ = self.γ
if γ == 1:
return np.log(c)
else:
return (c ** (1 - γ)) / (1 - γ)
# first derivative of utility function

return c ** (-self.γ)
def state_action_value(self, c, x, v_array):

"""
Right hand side of the Bellman equation given x and c.
"""
u, β = self.u, self.β
v = lambda x: np.interp(x, self.x_grid, v_array)
return u(c) + β * v(x - c)
We now define the Bellman operation:
def T(v, ce):

"""
The Bellman operator. Updates the guess of the value function.
* ce is an instance of CakeEating
* v is an array representing a guess of the value function
"""
for i, x in enumerate(ce.x_grid):
# Maximize RHS of Bellman equation at state x
v_new[i] = maximize(ce.state_action_value, 1e-10, x, (x, v))[1]
return v_new
After defining the Bellman operator, we are ready to solve the model.
Let’s start by creating a CakeEating instance using the default parameterization.

ce = CakeEating()
Now let’s see the iteration of the value function in action.

We start from guess 𝑣 given by 𝑣(𝑥) = 𝑢(𝑥) for every 𝑥 grid point.
x_grid = ce.x_grid
v = ce.u(x_grid) # Initial guess
n = 12 # Number of iterations
ax.plot(x_grid, v, color=plt.cm.jet(0),
lw=2, alpha=0.6, label='Initial guess')
for i in range(n):
v = T(v, ce) # Apply the Bellman operator
ax.plot(x_grid, v, color=plt.cm.jet(i / n), lw=2, alpha=0.6)
ax.legend()
ax.set_ylabel('value', fontsize=12)
ax.set_xlabel('cake size $x$', fontsize=12)
ax.set_title('Value function iterations')
plt.show()
To do this more systematically, we introduce a wrapper function called compute_value_function that iterates

until some convergence conditions are satisfied.
def compute_value_function(ce,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
# Set up loop
v = np.zeros(len(ce.x_grid)) # Initial guess
i = 0
error = tol + 1

v_new = T(v, ce)

i += 1

v = v_new
if error > tol:

elif verbose:
return v_new
Now let’s call it, noting that it takes a little while to run.
v = compute_value_function(ce)

Now we can plot and see what the converged value function looks like.
ax.plot(x_grid, v, label='Approximate value function')

ax.set_ylabel('$V(x)$', fontsize=12)
ax.set_title('Value function')
ax.legend()
plt.show()
Next let’s compare it to the analytical solution.

v_analytical = v_star(ce.x_grid, ce.β, ce.γ)
ax.plot(x_grid, v_analytical, label='analytical solution')

ax.plot(x_grid, v, label='numerical solution')
ax.set_ylabel('$V(x)$', fontsize=12)
ax.legend()
ax.set_title('Comparison between analytical and numerical value functions')
plt.show()
The quality of approximation is reasonably good for large 𝑥, but less so near the lower boundary.
The reason is that the utility function and hence value function is very steep near the lower boundary, and hence hard to
approximate.

37.3.4 Policy Function
Let’s see how this plays out in terms of computing the optimal policy.
In the first lecture on cake eating, the optimal consumption policy was shown to be
𝜎∗ (𝑥) = (1 − 𝛽 1/𝛾 ) 𝑥
Let’s see if our numerical results lead to something similar.

Our numerical strategy will be to compute
𝜎(𝑥) = arg max {𝑢(𝑐) + 𝛽𝑣(𝑥 − 𝑐)}

0≤𝑐≤𝑥
on a grid of 𝑥 points and then interpolate.

For 𝑣 we will use the approximation of the value function we obtained above.
Here’s the function:
def σ(ce, v):

"""
The optimal policy function. Given the value function,
it finds optimal consumption in each state.
* v is a value function array
"""
c = np.empty_like(v)
for i in range(len(ce.x_grid)):
x = ce.x_grid[i]
# Maximize RHS of Bellman equation at state x
c[i] = maximize(ce.state_action_value, 1e-10, x, (x, v))[0]
return c
Now let’s pass the approximate value function and compute optimal consumption:
c = σ(ce, v)
Let’s plot this next to the true analytical solution
c_analytical = c_star(ce.x_grid, ce.β, ce.γ)
ax.plot(ce.x_grid, c_analytical, label='analytical')

ax.plot(ce.x_grid, c, label='numerical')
ax.set_ylabel(r'$\sigma(x)$')
ax.legend()
plt.show()

The fit is reasonable but not perfect.

We can improve it by increasing the grid size or reducing the error tolerance in the value function iteration routine.
However, both changes will lead to a longer compute time.
Another possibility is to use an alternative algorithm, which offers the possibility of faster compute time and, at the same
time, more accuracy.
We explore this next.
37.4 Time Iteration
Now let’s look at a different strategy to compute the optimal policy.

Recall that the optimal policy satisfies the Euler equation
𝑢′ (𝜎(𝑥)) = 𝛽𝑢′ (𝜎(𝑥 − 𝜎(𝑥))) for all 𝑥 > 0 (37.2)
Computationally, we can start with any initial guess of 𝜎0 and now choose 𝑐 to solve
𝑢′ (𝑐) = 𝛽𝑢′ (𝜎0 (𝑥 − 𝑐))
Choosing 𝑐 to satisfy this equation at all 𝑥 > 0 produces a function of 𝑥.

Call this new function 𝜎1 , treat it as the new guess and repeat.
This is called time iteration.
As with value function iteration, we can view the update step as action of an operator, this time denoted by 𝐾.

• In particular, 𝐾𝜎 is the policy updated from 𝜎 using the procedure just described.
• We will use this terminology in the exercises below.
The main advantage of time iteration relative to value function iteration is that it operates in policy space rather than value
function space.
This is helpful because the policy function has less curvature, and hence is easier to approximate.
In the exercises you are asked to implement time iteration and compare it to value function iteration.
You should find that the method is faster and more accurate.
This is due to
1. the curvature issue mentioned just above and
2. the fact that we are using more information — in this case, the first order conditions.
37.5 Exercises
Exercise 37.5.1
Try the following modification of the problem.
Instead of the cake size changing according to 𝑥𝑡+1 = 𝑥𝑡 − 𝑐𝑡 , let it change according to
𝑥𝑡+1 = (𝑥𝑡 − 𝑐𝑡 )𝛼
where 𝛼 is a parameter satisfying 0 < 𝛼 < 1.

(We will see this kind of update rule when we study optimal growth models.)
Make the required changes to value function iteration code and plot the value and policy functions.
Try to reuse as much code as possible.

We need to create a class to hold our primitives and return the right hand side of the Bellman equation.
We will use inheritance to maximize code reuse.
class OptimalGrowth(CakeEating):
"""
A subclass of CakeEating that adds the parameter α and overrides
the state_action_value method.
"""
def __init__(self,
γ=1.5, # degree of relative risk aversion
α=0.4, # productivity parameter
x_grid_min=1e-3, # exclude zero for numerical stability
x_grid_max=2.5, # size of cake
x_grid_size=120):
self.α = α
37.5. Exercises 645


CakeEating.__init__(self, β, γ, x_grid_min, x_grid_max, x_grid_size)
def state_action_value(self, c, x, v_array):

"""
Right hand side of the Bellman equation given x and c.
"""
u, β, α = self.u, self.β, self.α

v = lambda x: np.interp(x, self.x_grid, v_array)
return u(c) + β * v((x - c)**α)
og = OptimalGrowth()
Here’s the computed value function.
v = compute_value_function(og, verbose=False)
ax.plot(x_grid, v, lw=2, alpha=0.6)

ax.set_ylabel('value', fontsize=12)
ax.set_xlabel('state $x$', fontsize=12)
plt.show()
Here’s the computed policy, combined with the solution we derived above for the standard cake eating case 𝛼 = 1.

c_new = σ(og, v)
ax.plot(ce.x_grid, c_analytical, label=r'$\alpha=1$ solution')

ax.plot(ce.x_grid, c_new, label=fr'$\alpha={og.α}$ solution')
ax.set_ylabel('consumption', fontsize=12)
plt.show()
Consumption is higher when 𝛼 < 1 because, at least for large 𝑥, the return to savings is lower.
Exercise 37.5.2
Implement time iteration, returning to the original case (i.e., dropping the modification in the exercise above).

Here’s one way to implement time iteration.
37.5. Exercises 647

def K(σ_array, ce):

"""
The policy function operator. Given the policy function,
it updates the optimal consumption using Euler equation.
* σ_array is an array of policy function values on the grid

"""
u_prime, β, x_grid = ce.u_prime, ce.β, ce.x_grid

σ_new = np.empty_like(σ_array)
σ = lambda x: np.interp(x, x_grid, σ_array)
def euler_diff(c, x):

return u_prime(c) - β * u_prime(σ(x - c))
for i, x in enumerate(x_grid):
# handle small x separately --- helps numerical stability

if x < 1e-12:
σ_new[i] = 0.0
# handle other x
else:
σ_new[i] = bisect(euler_diff, 1e-10, x - 1e-10, x)
return σ_new
def iterate_euler_equation(ce,
max_iter=500,
tol=1e-5,
verbose=True,
print_skip=25):
x_grid = ce.x_grid
σ = np.copy(x_grid) # initial guess
i = 0
error = tol + 1
σ_new = K(σ, ce)
error = np.max(np.abs(σ_new - σ))

i += 1

σ = σ_new
if error > tol:



elif verbose:
return σ
ce = CakeEating(x_grid_min=0.0)
c_euler = iterate_euler_equation(ce)
Error at iteration 125 is 6.417740905302616e-05.
ax.plot(ce.x_grid, c_analytical, label='analytical solution')

ax.plot(ce.x_grid, c_euler, label='time iteration solution')
ax.set_ylabel('consumption')
plt.show()
37.5. Exercises 649


CHAPTER
THIRTYEIGHT
OPTIMAL GROWTH I: THE STOCHASTIC OPTIMAL GROWTH

MODEL
Contents
• Optimal Growth I: The Stochastic Optimal Growth Model

– Overview
– The Model
– Computation
– Exercises
38.1 Overview
In this lecture, we’re going to study a simple optimal growth model with one agent.
The model is a version of the standard one sector infinite horizon growth model studied in
• [Stokey et al., 1989], chapter 2
• [Ljungqvist and Sargent, 2018], section 3.1
• EDTC, chapter 1
• [Sundaram, 1996], chapter 12
It is an extension of the simple cake eating problem we looked at earlier.
The extension involves
• nonlinear returns to saving, through a production function, and
• stochastic returns, due to shocks to production.
Despite these additions, the model is still relatively simple.
We regard it as a stepping stone to more sophisticated models.
We solve the model using dynamic programming and a range of numerical techniques.
In this first lecture on optimal growth, the solution method will be value function iteration (VFI).
While the code in this first lecture runs slowly, we will use a variety of techniques to drastically improve execution time
over the next few lectures.
651

import numpy as np
from scipy.interpolate import interp1d
from scipy.optimize import minimize_scalar
38.2 The Model
Consider an agent who owns an amount 𝑦𝑡 ∈ ℝ+ ∶= [0, ∞) of a consumption good at time 𝑡.

This output can either be consumed or invested.
When the good is invested, it is transformed one-for-one into capital.
The resulting capital stock, denoted here by 𝑘𝑡+1 , will then be used for production.
Production is stochastic, in that it also depends on a shock 𝜉𝑡+1 realized at the end of the current period.
Next period output is
𝑦𝑡+1 ∶= 𝑓(𝑘𝑡+1 )𝜉𝑡+1
where 𝑓 ∶ ℝ+ → ℝ+ is called the production function.

The resource constraint is
𝑘𝑡+1 + 𝑐𝑡 ≤ 𝑦𝑡 (38.1)
and all variables are required to be nonnegative.
38.2.1 Assumptions and Comments
In what follows,
• The sequence {𝜉𝑡 } is assumed to be IID.
• The common distribution of each 𝜉𝑡 will be denoted by 𝜙.
• The production function 𝑓 is assumed to be increasing and continuous.
• Depreciation of capital is not made explicit but can be incorporated into the production function.
While many other treatments of the stochastic growth model use 𝑘𝑡 as the state variable, we will use 𝑦𝑡 .
This will allow us to treat a stochastic model while maintaining only one state variable.
We consider alternative states and timing specifications in some of our other lectures.
652 Chapter 38. Optimal Growth I: The Stochastic Optimal Growth Model
38.2.2 Optimization
Taking 𝑦0 as given, the agent wishes to maximize

∞
𝔼 [∑ 𝛽 𝑡 𝑢(𝑐𝑡 )] (38.2)
𝑡=0
subject to
𝑦𝑡+1 = 𝑓(𝑦𝑡 − 𝑐𝑡 )𝜉𝑡+1 and 0 ≤ 𝑐𝑡 ≤ 𝑦 𝑡 for all 𝑡 (38.3)
where
• 𝑢 is a bounded, continuous and strictly increasing utility function and
• 𝛽 ∈ (0, 1) is a discount factor.
In (38.3) we are assuming that the resource constraint (38.1) holds with equality — which is reasonable because 𝑢 is
strictly increasing and no output will be wasted at the optimum.
In summary, the agent’s aim is to select a path 𝑐0 , 𝑐1 , 𝑐2 , … for consumption that is
1. nonnegative,
2. feasible in the sense of (38.1),
3. optimal, in the sense that it maximizes (38.2) relative to all other feasible consumption sequences, and
4. adapted, in the sense that the action 𝑐𝑡 depends only on observable outcomes, not on future outcomes such as 𝜉𝑡+1 .
In the present context
• 𝑦𝑡 is called the state variable — it summarizes the “state of the world” at the start of each period.
• 𝑐𝑡 is called the control variable — a value chosen by the agent each period after observing the state.
38.2.3 The Policy Function Approach
One way to think about solving this problem is to look for the best policy function.
A policy function is a map from past and present observables into current action.
We’ll be particularly interested in Markov policies, which are maps from the current state 𝑦𝑡 into a current action 𝑐𝑡 .
For dynamic programming problems such as this one (in fact for any Markov decision process), the optimal policy is
always a Markov policy.
In other words, the current state 𝑦𝑡 provides a sufficient statistic for the history in terms of making an optimal decision
today.
This is quite intuitive, but if you wish you can find proofs in texts such as [Stokey et al., 1989] (section 4.1).
Hereafter we focus on finding the best Markov policy.
In our context, a Markov policy is a function 𝜎 ∶ ℝ+ → ℝ+ , with the understanding that states are mapped to actions via
𝑐𝑡 = 𝜎(𝑦𝑡 ) for all 𝑡
In what follows, we will call 𝜎 a feasible consumption policy if it satisfies
0 ≤ 𝜎(𝑦) ≤ 𝑦 for all 𝑦 ∈ ℝ+ (38.4)
In other words, a feasible consumption policy is a Markov policy that respects the resource constraint.
38.2. The Model 653

The set of all feasible consumption policies will be denoted by Σ.

Each 𝜎 ∈ Σ determines a continuous state Markov process {𝑦𝑡 } for output via
𝑦𝑡+1 = 𝑓(𝑦𝑡 − 𝜎(𝑦𝑡 ))𝜉𝑡+1 , 𝑦0 given (38.5)
This is the time path for output when we choose and stick with the policy 𝜎.
We insert this process into the objective function to get
∞ ∞
𝔼 [ ∑ 𝛽 𝑡 𝑢(𝑐𝑡 ) ] = 𝔼 [ ∑ 𝛽 𝑡 𝑢(𝜎(𝑦𝑡 )) ] (38.6)
𝑡=0 𝑡=0
This is the total expected present value of following policy 𝜎 forever, given initial income 𝑦0 .
The aim is to select a policy that makes this number as large as possible.
The next section covers these ideas more formally.
38.2.4 Optimality
The 𝜎 associated with a given policy 𝜎 is the mapping defined by

∞
𝑣𝜎 (𝑦) = 𝔼 [∑ 𝛽 𝑡 𝑢(𝜎(𝑦𝑡 ))] (38.7)
𝑡=0
when {𝑦𝑡 } is given by (38.5) with 𝑦0 = 𝑦.

In other words, it is the lifetime value of following policy 𝜎 starting at initial condition 𝑦.
The value function is then defined as
𝑣∗ (𝑦) ∶= sup 𝑣𝜎 (𝑦) (38.8)

𝜎∈Σ
The value function gives the maximal value that can be obtained from state 𝑦, after considering all feasible policies.
A policy 𝜎 ∈ Σ is called optimal if it attains the supremum in (38.8) for all 𝑦 ∈ ℝ+ .
38.2.5 The Bellman Equation
With our assumptions on utility and production functions, the value function as defined in (38.8) also satisfies a Bellman
equation.
For this problem, the Bellman equation takes the form
𝑣(𝑦) = max {𝑢(𝑐) + 𝛽 ∫ 𝑣(𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧)} (𝑦 ∈ ℝ+ ) (38.9)

0≤𝑐≤𝑦
This is a functional equation in 𝑣.

The term ∫ 𝑣(𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧) can be understood as the expected next period value when
• 𝑣 is used to measure value
• the state is 𝑦
• consumption is set to 𝑐
As shown in EDTC, theorem 10.1.11 and a range of other texts
The value function 𝑣∗ satisfies the Bellman equation

In other words, (38.9) holds when 𝑣 = 𝑣∗ .
The intuition is that maximal value from a given state can be obtained by optimally trading off
• current reward from a given action, vs
• expected discounted future value of the state resulting from that action
The Bellman equation is important because it gives us more information about the value function.
It also suggests a way of computing the value function, which we discuss below.
38.2.6 Greedy Policies
The primary importance of the value function is that we can use it to compute optimal policies.
The details are as follows.
Given a continuous function 𝑣 on ℝ+ , we say that 𝜎 ∈ Σ is 𝑣-greedy if 𝜎(𝑦) is a solution to
max {𝑢(𝑐) + 𝛽 ∫ 𝑣(𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧)} (38.10)

0≤𝑐≤𝑦
for every 𝑦 ∈ ℝ+ .
In other words, 𝜎 ∈ Σ is 𝑣-greedy if it optimally trades off current and future rewards when 𝑣 is taken to be the value
function.
In our setting, we have the following key result
• A feasible consumption policy is optimal if and only if it is 𝑣∗ -greedy.
The intuition is similar to the intuition for the Bellman equation, which was provided after (38.9).
See, for example, theorem 10.1.11 of EDTC.
Hence, once we have a good approximation to 𝑣∗ , we can compute the (approximately) optimal policy by computing the
corresponding greedy policy.
The advantage is that we are now solving a much lower dimensional optimization problem.
How, then, should we compute the value function?

One way is to use the so-called Bellman operator.
(An operator is a map that sends functions into functions.)
The Bellman operator is denoted by 𝑇 and defined by
𝑇 𝑣(𝑦) ∶= max {𝑢(𝑐) + 𝛽 ∫ 𝑣(𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧)} (𝑦 ∈ ℝ+ ) (38.11)

0≤𝑐≤𝑦
In other words, 𝑇 sends the function 𝑣 into the new function 𝑇 𝑣 defined by (38.11).
By construction, the set of solutions to the Bellman equation (38.9) exactly coincides with the set of fixed points of 𝑇 .
For example, if 𝑇 𝑣 = 𝑣, then, for any 𝑦 ≥ 0,
𝑣(𝑦) = 𝑇 𝑣(𝑦) = max {𝑢(𝑐) + 𝛽 ∫ 𝑣∗ (𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧)}

0≤𝑐≤𝑦
38.2. The Model 655

which says precisely that 𝑣 is a solution to the Bellman equation.

It follows that 𝑣∗ is a fixed point of 𝑇 .
38.2.8 Review of Theoretical Results
One can also show that 𝑇 is a contraction mapping on the set of continuous bounded functions on ℝ+ under the supremum
distance
𝜌(𝑔, ℎ) = sup |𝑔(𝑦) − ℎ(𝑦)|
𝑦≥0
See EDTC, lemma 10.1.18.

Hence, it has exactly one fixed point in this set, which we know is equal to the value function.
It follows that
• The value function 𝑣∗ is bounded and continuous.
• Starting from any bounded and continuous 𝑣, the sequence 𝑣, 𝑇 𝑣, 𝑇 2 𝑣, … generated by iteratively applying 𝑇 con-
verges uniformly to 𝑣∗ .
This iterative method is called value function iteration.
We also know that a feasible policy is optimal if and only if it is 𝑣∗ -greedy.
It’s not too hard to show that a 𝑣∗ -greedy policy exists (see EDTC, theorem 10.1.11 if you get stuck).
Hence, at least one optimal policy exists.
Our problem now is how to compute it.
38.2.9 Unbounded Utility
The results stated above assume that the utility function is bounded.
In practice economists often work with unbounded utility functions — and so will we.
In the unbounded setting, various optimality theories exist.
Unfortunately, they tend to be case-specific, as opposed to valid for a large range of applications.
Nevertheless, their main conclusions are usually in line with those stated for the bounded case just above (as long as we
drop the word “bounded”).
Consult, for example, section 12.2 of EDTC, [Kamihigashi, 2012] or [Martins-da-Rocha and Vailakis, 2010].
38.3 Computation
Let’s now look at computing the value function and the optimal policy.
Our implementation in this lecture will focus on clarity and flexibility.
Both of these things are helpful, but they do cost us some speed — as you will see when you run the code.
Later we will sacrifice some of this clarity and flexibility in order to accelerate our code with just-in-time (JIT) compilation.
The algorithm we will use is fitted value function iteration, which was described in earlier lectures the McCall model and
cake eating.
The algorithm will be
1. Begin with an array of values {𝑣1 , … , 𝑣𝐼 } representing the values of some initial function 𝑣 on the grid points
{𝑦1 , … , 𝑦𝐼 }.
2. Build a function 𝑣 ̂ on the state space ℝ+ by linear interpolation, based on these data points.
3. Obtain and record the value 𝑇 𝑣(𝑦
̂ 𝑖 ) on each grid point 𝑦𝑖 by repeatedly solving (38.11).
4. Unless some stopping condition is satisfied, set {𝑣1 , … , 𝑣𝐼 } = {𝑇 𝑣(𝑦
̂ 1 ), … , 𝑇 𝑣(𝑦
̂ 𝐼 )} and go to step 2.
38.3.1 Scalar Maximization
To maximize the right hand side of the Bellman equation (38.9), we are going to use the minimize_scalar routine
from SciPy.
Since we are maximizing rather than minimizing, we will use the fact that the maximizer of 𝑔 on the interval [𝑎, 𝑏] is the
minimizer of −𝑔 on the same interval.
To this end, and to keep the interface tidy, we will wrap minimize_scalar in an outer function as follows:
def maximize(g, a, b, args):

"""
Maximize the function g over the interval [a, b].
We use the fact that the maximizer of g on any interval is

also the minimizer of -g. The tuple args collects any extra
arguments to g.
Returns the maximal value and the maximizer.

"""
objective = lambda x: -g(x, *args)

result = minimize_scalar(objective, bounds=(a, b), method='bounded')
maximizer, maximum = result.x, -result.fun
return maximizer, maximum
38.3.2 Optimal Growth Model
We will assume for now that 𝜙 is the distribution of 𝜉 ∶= exp(𝜇 + 𝑠𝜁) where
• 𝜁 is standard normal,
• 𝜇 is a shock location parameter and
• 𝑠 is a shock scale parameter.
We will store this and other primitives of the optimal growth model in a class.
The class, defined below, combines both parameters and a method that realizes the right hand side of the Bellman equation
(38.9).
class OptimalGrowthModel:
def __init__(self,
u, # utility function
f, # production function
μ=0, # shock location parameter
38.3. Computation 657


s=0.1, # shock scale parameter
grid_max=4,
grid_size=120,
shock_size=250,
seed=1234):
self.u, self.f, self.β, self.μ, self.s = u, f, β, μ, s
# Set up grid
self.grid = np.linspace(1e-4, grid_max, grid_size)
# Store shocks (with a seed, so results are reproducible)

self.shocks = np.exp(μ + s * np.random.randn(shock_size))
def state_action_value(self, c, y, v_array):

"""
Right hand side of the Bellman equation.
"""
u, f, β, shocks = self.u, self.f, self.β, self.shocks
v = interp1d(self.grid, v_array)
return u(c) + β * np.mean(v(f(y - c) * shocks))
In the second last line we are using linear interpolation.

In the last line, the expectation in (38.11) is computed via Monte Carlo, using the approximation
1 𝑛
∫ 𝑣(𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧) ≈ ∑ 𝑣(𝑓(𝑦 − 𝑐)𝜉𝑖 )
𝑛 𝑖=1
where {𝜉𝑖 }𝑛𝑖=1 are IID draws from 𝜙.

Monte Carlo is not always the most efficient way to compute integrals numerically but it does have some theoretical
advantages in the present setting.
(For example, it preserves the contraction mapping property of the Bellman operator — see, e.g., [Pál and Stachurski,
2013].)
The next function implements the Bellman operator.

(We could have added it as a method to the OptimalGrowthModel class, but we prefer small classes rather than
monolithic ones for this kind of numerical work.)
def T(v, og):

"""
The Bellman operator. Updates the guess of the value function
and also computes a v-greedy policy.
* og is an instance of OptimalGrowthModel
"""
v_greedy = np.empty_like(v)
for i in range(len(grid)):
y = grid[i]
# Maximize RHS of Bellman equation at state y

c_star, v_max = maximize(og.state_action_value, 1e-10, y, (y, v))
v_new[i] = v_max
v_greedy[i] = c_star
return v_greedy, v_new
38.3.4 An Example
Let’s suppose now that
𝑓(𝑘) = 𝑘𝛼 and 𝑢(𝑐) = ln 𝑐
For this particular problem, an exact analytical solution is available (see [Ljungqvist and Sargent, 2018], section 3.1.2),
with
ln(1 − 𝛼𝛽) (𝜇 + 𝛼 ln(𝛼𝛽)) 1 1 1
𝑣∗ (𝑦) = + [ − ]+ ln 𝑦 (38.12)
1−𝛽 1−𝛼 1 − 𝛽 1 − 𝛼𝛽 1 − 𝛼𝛽
and optimal consumption policy
𝜎∗ (𝑦) = (1 − 𝛼𝛽)𝑦
It is valuable to have these closed-form solutions because it lets us check whether our code works for this particular case.
In Python, the functions above can be expressed as:
def v_star(y, α, β, μ):

"""
True value function
"""
c1 = np.log(1 - α * β) / (1 - β)
c2 = (μ + α * np.log(α * β)) / (1 - α)
c3 = 1 / (1 - β)
c4 = 1 / (1 - α * β)
return c1 + c2 * (c3 - c4) + c4 * np.log(y)
def σ_star(y, α, β):

"""
True optimal policy
"""
return (1 - α * β) * y
Next let’s create an instance of the model with the above primitives and assign it to the variable og.

α = 0.4
def fcd(k):
return k**α
og = OptimalGrowthModel(u=np.log, f=fcd)
Now let’s see what happens when we apply our Bellman operator to the exact solution 𝑣∗ in this case.
In theory, since 𝑣∗ is a fixed point, the resulting function should again be 𝑣∗ .
In practice, we expect some small numerical error.
grid = og.grid
v_init = v_star(grid, α, og.β, og.μ) # Start at the solution

v_greedy, v = T(v_init, og) # Apply T once
ax.set_ylim(-35, -24)
ax.plot(grid, v, lw=2, alpha=0.6, label='$Tv^*$')
ax.plot(grid, v_init, lw=2, alpha=0.6, label='$v^*$')
ax.legend()
plt.show()
The two functions are essentially indistinguishable, so we are off to a good start.
Now let’s have a look at iterating with the Bellman operator, starting from an arbitrary initial condition.
The initial condition we’ll start with is, somewhat arbitrarily, 𝑣(𝑦) = 5 ln(𝑦).
v = 5 * np.log(grid) # An initial condition

n = 35
ax.plot(grid, v, color=plt.cm.jet(0),
lw=2, alpha=0.6, label='Initial condition')
for i in range(n):
v_greedy, v = T(v, og) # Apply the Bellman operator
ax.plot(grid, v, color=plt.cm.jet(i / n), lw=2, alpha=0.6)
ax.plot(grid, v_star(grid, α, og.β, og.μ), 'k-', lw=2,

alpha=0.8, label='True value function')
ax.legend()
ax.set(ylim=(-40, 10), xlim=(np.min(grid), np.max(grid)))
plt.show()
The figure shows

1. the first 36 functions generated by the fitted value function iteration algorithm, with hotter colors given to higher
iterates
2. the true value function 𝑣∗ drawn in black
The sequence of iterates converges towards 𝑣∗ .
We are clearly getting closer.
38.3.5 Iterating to Convergence
We can write a function that iterates until the difference is below a particular tolerance level.
def solve_model(og,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
"""
Solve model by iterating with the Bellman operator.
"""

# Set up loop
v = og.u(og.grid) # Initial condition
i = 0
error = tol + 1

v_greedy, v_new = T(v, og)
i += 1
v = v_new
if error > tol:

elif verbose:
Let’s use this function to compute an approximate solution at the defaults.
v_greedy, v_solution = solve_model(og)
Now we check our result by plotting it against the true value:
ax.plot(grid, v_solution, lw=2, alpha=0.6,

label='Approximate value function')
ax.plot(grid, v_star(grid, α, og.β, og.μ), lw=2,

alpha=0.6, label='True value function')
ax.legend()
ax.set_ylim(-35, -24)
plt.show()
The figure shows that we are pretty much on the money.
38.3.6 The Policy Function
The policy v_greedy computed above corresponds to an approximate optimal policy.

The next figure compares it to the exact solution, which, as mentioned above, is 𝜎(𝑦) = (1 − 𝛼𝛽)𝑦
ax.plot(grid, v_greedy, lw=2,

alpha=0.6, label='approximate policy function')
ax.plot(grid, σ_star(grid, α, og.β), '--',

lw=2, alpha=0.6, label='true policy function')
ax.legend()
plt.show()

The figure shows that we’ve done a good job in this instance of approximating the true policy.
38.4 Exercises
Exercise 38.4.1
A common choice for utility function in this kind of work is the CRRA specification
𝑐1−𝛾
𝑢(𝑐) =
1−𝛾
Maintaining the other defaults, including the Cobb-Douglas production function, solve the optimal growth model with
this utility specification.
Setting 𝛾 = 1.5, compute and plot an estimate of the optimal policy.
Time how long this function takes to run, so you can compare it to faster code developed in the next lecture.

Here we set up the model.
γ = 1.5 # Preference parameter
def u_crra(c):
return (c**(1 - γ) - 1) / (1 - γ)
og = OptimalGrowthModel(u=u_crra, f=fcd)
Now let’s run it, with a timer.
%%time

CPU times: user 36.6 s, sys: 40 ms, total: 36.6 s
Wall time: 36.5 s
Let’s plot the policy function just to see what it looks like:
ax.plot(grid, v_greedy, lw=2,

alpha=0.6, label='Approximate optimal policy')
ax.legend()
plt.show()
38.4. Exercises 665

Exercise 38.4.2
Time how long it takes to iterate with the Bellman operator 20 times, starting from initial condition 𝑣(𝑦) = 𝑢(𝑦).
Use the model specification in the previous exercise.
(As before, we will compare this number with that for the faster code developed in the next lecture.)

Let’s set up:
og = OptimalGrowthModel(u=u_crra, f=fcd)
v = og.u(og.grid)
Here’s the timing:
%%time
for i in range(20):
v = v_new
CPU times: user 3.12 s, sys: 0 ns, total: 3.12 s

Wall time: 3.11 s
CHAPTER
THIRTYNINE
OPTIMAL GROWTH II: ACCELERATING THE CODE WITH NUMBA
Contents
• Optimal Growth II: Accelerating the Code with Numba

– Overview
– The Model
– Computation
– Exercises
39.1 Overview
Previously, we studied a stochastic optimal growth model with one representative agent.
We solved the model using dynamic programming.
In writing our code, we focused on clarity and flexibility.
These are important, but there’s often a trade-off between flexibility and speed.
The reason is that, when code is less flexible, we can exploit structure more easily.
(This is true about algorithms and mathematical problems more generally: more specific problems have more structure,
which, with some thought, can be exploited for better results.)
So, in this lecture, we are going to accept less flexibility while gaining speed, using just-in-time (JIT) compilation to
accelerate our code.

import numpy as np
from numba import jit, njit
from quantecon.optimize.scalar_maximization import brent_max
The function brent_max is also designed for embedding in JIT-compiled code.

These are alternatives to similar functions in SciPy (which, unfortunately, are not JIT-aware).
667
39.2 The Model
The model is the same as discussed in our previous lecture on optimal growth.
We will start with log utility:
𝑢(𝑐) = ln(𝑐)
We continue to assume that

• 𝑓(𝑘) = 𝑘𝛼
• 𝜙 is the distribution of 𝜉 ∶= exp(𝜇 + 𝑠𝜁) when 𝜁 is standard normal
We will once again use value function iteration to solve the model.
In particular, the algorithm is unchanged, and the only difference is in the implementation itself.
As before, we will be able to compare with the true solutions

"""
True value function
"""
c1 = np.log(1 - α * β) / (1 - β)
c2 = (μ + α * np.log(α * β)) / (1 - α)
c3 = 1 / (1 - β)
c4 = 1 / (1 - α * β)

"""
True optimal policy
"""
39.3 Computation
We will again store the primitives of the optimal growth model in a class.
But now we are going to use Numba’s @jitclass decorator to target our class for JIT compilation.
Because we are going to use Numba to compile our class, we need to specify the data types.
You will see this as a list called opt_growth_data above our class.
Unlike in the previous lecture, we hardwire the production and utility specifications into the class.
This is where we sacrifice flexibility in order to gain more speed.
from numba import float64

opt_growth_data = [
('α', float64), # Production parameter
('μ', float64), # Shock location parameter
668 Chapter 39. Optimal Growth II: Accelerating the Code with Numba

('s', float64), # Shock scale parameter
('grid', float64[:]), # Grid (array)
('shocks', float64[:]) # Shock draws (array)
]
@jitclass(opt_growth_data)
def __init__(self,
α=0.4,
β=0.96,
μ=0,
s=0.1,
grid_max=4,
grid_size=120,
shock_size=250,
seed=1234):
self.α, self.β, self.μ, self.s = α, β, μ, s
# Set up grid

def f(self, k):

"The production function"
return k**self.α
def u(self, c):

"The utility function"
return np.log(c)

"Derivative of f"
return self.α * (k**(self.α - 1))

"Derivative of u"
return 1/c

"Inverse of u'"
return 1/c
The class includes some methods such as u_prime that we do not need now but will use in later lectures.

We will use JIT compilation to accelerate the Bellman operator.

First, here’s a function that returns the value of a particular consumption choice c, given state y, as per the Bellman
equation (38.9).
@njit
def state_action_value(c, y, v_array, og):
"""
Right hand side of the Bellman equation.
* c is consumption
* y is income
* v_array represents a guess of the value function on the grid
"""
u, f, β, shocks = og.u, og.f, og.β, og.shocks
v = lambda x: np.interp(x, og.grid, v_array)
return u(c) + β * np.mean(v(f(y - c) * shocks))
Now we can implement the Bellman operator, which maximizes the right hand side of the Bellman equation:
@jit(nopython=True)
def T(v, og):
"""
The Bellman operator.
"""
v_greedy = np.empty_like(v)
for i in range(len(og.grid)):
y = og.grid[i]
# Maximize RHS of Bellman equation at state y

result = brent_max(state_action_value, 1e-10, y, args=(y, v, og))
v_greedy[i], v_new[i] = result[0], result[1]
We use the solve_model function to perform iteration until convergence.
def solve_model(og,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
"""

Solve model by iterating with the Bellman operator.
"""
# Set up loop
v = og.u(og.grid) # Initial condition
i = 0
error = tol + 1

i += 1
v = v_new
if error > tol:

elif verbose:
Let’s compute the approximate solution at the default parameters.

First we create an instance:
og = OptimalGrowthModel()
Now we call solve_model, using the %%time magic to check how long it takes.
%%time


CPU times: user 5.81 s, sys: 419 ms, total: 6.23 s
Wall time: 6.23 s
You will notice that this is much faster than our original implementation.
Here is a plot of the resulting policy, compared with the true policy:
ax.plot(og.grid, v_greedy, lw=2,

ax.plot(og.grid, σ_star(og.grid, og.α, og.β), 'k--',

ax.legend()
plt.show()
Again, the fit is excellent — this is as expected since we have not changed the algorithm.
The maximal absolute deviation between the two policies is
np.max(np.abs(v_greedy - σ_star(og.grid, og.α, og.β)))
0.0010480495434626036
39.4 Exercises
Exercise 39.4.1
Time how long it takes to iterate with the Bellman operator 20 times, starting from initial condition 𝑣(𝑦) = 𝑢(𝑦).
Use the default parameterization.

Let’s set up the initial condition.
v = og.u(og.grid)
Here’s the timing:
%%time
for i in range(20):
v = v_new
CPU times: user 199 ms, sys: 0 ns, total: 199 ms

Wall time: 199 ms
Compared with our timing for the non-compiled version of value function iteration, the JIT-compiled code is usually an
order of magnitude faster.
Exercise 39.4.2
Modify the optimal growth model to use the CRRA utility specification.
𝑐1−𝛾
𝑢(𝑐) =
1−𝛾
Set γ = 1.5 as the default value and maintaining other specifications.
(Note that jitclass currently does not support inheritance, so you will have to copy the class and change the relevant
parameters and methods.)
Compute an estimate of the optimal policy, plot it and compare visually with the same plot from the analogous exercise
in the first optimal growth lecture.
Compare execution time as well.

Here’s our CRRA version of OptimalGrowthModel:
39.4. Exercises 673


opt_growth_data = [
('γ', float64), # Preference parameter
]
class OptimalGrowthModel_CRRA:
def __init__(self,
α=0.4,
β=0.96,
μ=0,
s=0.1,
γ=1.5,
grid_max=4,
grid_size=120,
shock_size=250,
seed=1234):
self.α, self.β, self.γ, self.μ, self.s = α, β, γ, μ, s
# Set up grid

def f(self, k):

"The production function."
return k**self.α
def u(self, c):

"The utility function."
return c**(1 - self.γ) / (1 - self.γ)

"Derivative of f."

"Derivative of u."
return c**(-self.γ)
def u_prime_inv(c):
return c**(-1 / self.γ)
Let’s create an instance:
og_crra = OptimalGrowthModel_CRRA()
Now we call solve_model, using the %%time magic to check how long it takes.
%%time
v_greedy, v_solution = solve_model(og_crra)

Wall time: 4.73 s
Here is a plot of the resulting policy:
ax.plot(og.grid, v_greedy, lw=2,

alpha=0.6, label='Approximate value function')
plt.show()
39.4. Exercises 675

This matches the solution that we obtained in our non-jitted code, in the exercises.
Execution time is an order of magnitude faster.
Exercise 39.4.3
In this exercise we return to the original log utility specification.
Once an optimal consumption policy 𝜎 is given, income follows
𝑦𝑡+1 = 𝑓(𝑦𝑡 − 𝜎(𝑦𝑡 ))𝜉𝑡+1
The next figure shows a simulation of 100 elements of this sequence for three different discount factors (and hence three
different policies).
In each sequence, the initial condition is 𝑦0 = 0.1.
The discount factors are discount_factors = (0.8, 0.9, 0.98).
We have also dialed down the shocks a bit with s = 0.05.
Otherwise, the parameters and primitives are the same as the log-linear model discussed earlier in the lecture.
Notice that more patient agents typically have higher wealth.
Replicate the figure modulo randomness.

Here’s one solution:
def simulate_og(σ_func, og, y0=0.1, ts_length=100):

'''
Compute a time series given consumption policy σ.
'''
y = np.empty(ts_length)
ξ = np.random.randn(ts_length-1)
y[0] = y0
y[t+1] = (y[t] - σ_func(y[t]))**og.α * np.exp(og.μ + og.s * ξ[t])
return y
for β in (0.8, 0.9, 0.98):
og = OptimalGrowthModel(β=β, s=0.05)
v_greedy, v_solution = solve_model(og, verbose=False)
# Define an optimal policy function

σ_func = lambda x: np.interp(x, og.grid, v_greedy)
y = simulate_og(σ_func, og)
ax.plot(y, lw=2, alpha=0.6, label=rf'$\beta = {β}$')
plt.show()
39.4. Exercises 677

CHAPTER
FORTY
OPTIMAL GROWTH III: TIME ITERATION
Contents
• Optimal Growth III: Time Iteration

– Overview
– The Euler Equation
– Implementation
– Exercises
40.1 Overview
In this lecture, we’ll continue our earlier study of the stochastic optimal growth model.
In that lecture, we solved the associated dynamic programming problem using value function iteration.
The beauty of this technique is its broad applicability.
With numerical problems, however, we can often attain higher efficiency in specific applications by deriving methods that
are carefully tailored to the application at hand.
The stochastic optimal growth model has plenty of structure to exploit for this purpose, especially when we adopt some
concavity and smoothness assumptions over primitives.
We’ll use this structure to obtain an Euler equation based method.
This will be an extension of the time iteration method considered in our elementary lecture on cake eating.
In a subsequent lecture, we’ll see that time iteration can be further adjusted to obtain even more efficiency.

import numpy as np
from quantecon.optimize import brentq
679
40.2 The Euler Equation
Our first step is to derive the Euler equation, which is a generalization of the Euler equation we obtained in the lecture on
cake eating.
We take the model set out in the stochastic growth model lecture and add the following assumptions:
1. 𝑢 and 𝑓 are continuously differentiable and strictly concave
2. 𝑓(0) = 0
3. lim𝑐→0 𝑢′ (𝑐) = ∞ and lim𝑐→∞ 𝑢′ (𝑐) = 0
4. lim𝑘→0 𝑓 ′ (𝑘) = ∞ and lim𝑘→∞ 𝑓 ′ (𝑘) = 0
The last two conditions are usually called Inada conditions.
Recall the Bellman equation
𝑣∗ (𝑦) = max {𝑢(𝑐) + 𝛽 ∫ 𝑣∗ (𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧)} for all 𝑦 ∈ ℝ+ (40.1)

0≤𝑐≤𝑦
Let the optimal consumption policy be denoted by 𝜎∗ .

We know that 𝜎∗ is a 𝑣∗ -greedy policy so that 𝜎∗ (𝑦) is the maximizer in (40.1).
The conditions above imply that
• 𝜎∗ is the unique optimal policy for the stochastic optimal growth model
• the optimal policy is continuous, strictly increasing and also interior, in the sense that 0 < 𝜎∗ (𝑦) < 𝑦 for all strictly
positive 𝑦, and
• the value function is strictly concave and continuously differentiable, with
(𝑣∗ )′ (𝑦) = 𝑢′ (𝜎∗ (𝑦)) ∶= (𝑢′ ∘ 𝜎∗ )(𝑦) (40.2)
The last result is called the envelope condition due to its relationship with the envelope theorem.
To see why (40.2) holds, write the Bellman equation in the equivalent form
𝑣∗ (𝑦) = max {𝑢(𝑦 − 𝑘) + 𝛽 ∫ 𝑣∗ (𝑓(𝑘)𝑧)𝜙(𝑑𝑧)} ,

0≤𝑘≤𝑦
Differentiating with respect to 𝑦, and then evaluating at the optimum yields (40.2).
(Section 12.1 of EDTC contains full proofs of these results, and closely related discussions can be found in many other
texts.)
Differentiability of the value function and interiority of the optimal policy imply that optimal consumption satisfies the
first order condition associated with (40.1), which is
𝑢′ (𝜎∗ (𝑦)) = 𝛽 ∫(𝑣∗ )′ (𝑓(𝑦 − 𝜎∗ (𝑦))𝑧)𝑓 ′ (𝑦 − 𝜎∗ (𝑦))𝑧𝜙(𝑑𝑧) (40.3)
Combining (40.2) and the first-order condition (40.3) gives the Euler equation
(𝑢′ ∘ 𝜎∗ )(𝑦) = 𝛽 ∫(𝑢′ ∘ 𝜎∗ )(𝑓(𝑦 − 𝜎∗ (𝑦))𝑧)𝑓 ′ (𝑦 − 𝜎∗ (𝑦))𝑧𝜙(𝑑𝑧) (40.4)
We can think of the Euler equation as a functional equation
(𝑢′ ∘ 𝜎)(𝑦) = 𝛽 ∫(𝑢′ ∘ 𝜎)(𝑓(𝑦 − 𝜎(𝑦))𝑧)𝑓 ′ (𝑦 − 𝜎(𝑦))𝑧𝜙(𝑑𝑧) (40.5)
over interior consumption policies 𝜎, one solution of which is the optimal policy 𝜎∗ .
Our aim is to solve the functional equation (40.5) and hence obtain 𝜎∗ .
680 Chapter 40. Optimal Growth III: Time Iteration

40.2.1 The Coleman-Reffett Operator
Recall the Bellman operator
𝑇 𝑣(𝑦) ∶= max {𝑢(𝑐) + 𝛽 ∫ 𝑣(𝑓(𝑦 − 𝑐)𝑧)𝜙(𝑑𝑧)} (40.6)

0≤𝑐≤𝑦
Just as we introduced the Bellman operator to solve the Bellman equation, we will now introduce an operator over policies
to help us solve the Euler equation.
This operator 𝐾 will act on the set of all 𝜎 ∈ Σ that are continuous, strictly increasing and interior.
Henceforth we denote this set of policies by 𝒫
1. The operator 𝐾 takes as its argument a 𝜎 ∈ 𝒫 and
2. returns a new function 𝐾𝜎, where 𝐾𝜎(𝑦) is the 𝑐 ∈ (0, 𝑦) that solves.
𝑢′ (𝑐) = 𝛽 ∫(𝑢′ ∘ 𝜎)(𝑓(𝑦 − 𝑐)𝑧)𝑓 ′ (𝑦 − 𝑐)𝑧𝜙(𝑑𝑧) (40.7)
We call this operator the Coleman-Reffett operator to acknowledge the work of [Coleman, 1990] and [Reffett, 1996].
In essence, 𝐾𝜎 is the consumption policy that the Euler equation tells you to choose today when your future consumption
policy is 𝜎.
The important thing to note about 𝐾 is that, by construction, its fixed points coincide with solutions to the functional
equation (40.5).
In particular, the optimal policy 𝜎∗ is a fixed point.
Indeed, for fixed 𝑦, the value 𝐾𝜎∗ (𝑦) is the 𝑐 that solves
𝑢′ (𝑐) = 𝛽 ∫(𝑢′ ∘ 𝜎∗ )(𝑓(𝑦 − 𝑐)𝑧)𝑓 ′ (𝑦 − 𝑐)𝑧𝜙(𝑑𝑧)
In view of the Euler equation, this is exactly 𝜎∗ (𝑦).
40.2.2 Is the Coleman-Reffett Operator Well Defined?
In particular, is there always a unique 𝑐 ∈ (0, 𝑦) that solves (40.7)?

The answer is yes, under our assumptions.
For any 𝜎 ∈ 𝒫, the right side of (40.7)
• is continuous and strictly increasing in 𝑐 on (0, 𝑦)
• diverges to +∞ as 𝑐 ↑ 𝑦
The left side of (40.7)
• is continuous and strictly decreasing in 𝑐 on (0, 𝑦)
• diverges to +∞ as 𝑐 ↓ 0
Sketching these curves and using the information above will convince you that they cross exactly once as 𝑐 ranges over
(0, 𝑦).
With a bit more analysis, one can show in addition that 𝐾𝜎 ∈ 𝒫 whenever 𝜎 ∈ 𝒫.
40.2. The Euler Equation 681

40.2.3 Comparison with VFI (Theory)
It is possible to prove that there is a tight relationship between iterates of 𝐾 and iterates of the Bellman operator.
Mathematically, the two operators are topologically conjugate.
Loosely speaking, this means that if iterates of one operator converge then so do iterates of the other, and vice versa.
Moreover, there is a sense in which they converge at the same rate, at least in theory.
However, it turns out that the operator 𝐾 is more stable numerically and hence more efficient in the applications we
consider.
Examples are given below.
40.3 Implementation
As in our previous study, we continue to assume that

• 𝑢(𝑐) = ln 𝑐
• 𝑓(𝑘) = 𝑘𝛼
• 𝜙 is the distribution of 𝜉 ∶= exp(𝜇 + 𝑠𝜁) when 𝜁 is standard normal
This will allow us to compare our results to the analytical solutions

"""
True value function
"""
c1 = np.log(1 - α * β) / (1 - β)
c2 = (μ + α * np.log(α * β)) / (1 - α)
c3 = 1 / (1 - β)
c4 = 1 / (1 - α * β)

"""
True optimal policy
"""
As discussed above, our plan is to solve the model using time iteration, which means iterating with the operator 𝐾.
For this we need access to the functions 𝑢′ and 𝑓, 𝑓 ′ .
These are available in a class called OptimalGrowthModel that we constructed in an earlier lecture.

opt_growth_data = [


]
def __init__(self,
α=0.4,
β=0.96,
μ=0,
s=0.1,
grid_max=4,
grid_size=120,
shock_size=250,
seed=1234):
# Set up grid

def f(self, k):

return k**self.α
def u(self, c):

return np.log(c)

"Derivative of f"

"Derivative of u"
return 1/c

"Inverse of u'"
return 1/c
Now we implement a method called euler_diff, which returns
𝑢′ (𝑐) − 𝛽 ∫(𝑢′ ∘ 𝜎)(𝑓(𝑦 − 𝑐)𝑧)𝑓 ′ (𝑦 − 𝑐)𝑧𝜙(𝑑𝑧) (40.8)
@njit
def euler_diff(c, σ, y, og):
"""
Set up a function such that the root with respect to c,


given y and σ, is equal to Kσ(y).
"""
β, shocks, grid = og.β, og.shocks, og.grid

f, f_prime, u_prime = og.f, og.f_prime, og.u_prime
# First turn σ into a function via interpolation

σ_func = lambda x: np.interp(x, grid, σ)
# Now set up the function we need to find the root of.

vals = u_prime(σ_func(f(y - c) * shocks)) * f_prime(y - c) * shocks
return u_prime(c) - β * np.mean(vals)
The function euler_diff evaluates integrals by Monte Carlo and approximates functions using linear interpolation.
We will use a root-finding algorithm to solve (40.8) for 𝑐 given state 𝑦 and 𝜎, the current guess of the policy.
Here’s the operator 𝐾, that implements the root-finding step.
@njit
def K(σ, og):
"""
The Coleman-Reffett operator
Here og is an instance of OptimalGrowthModel.

"""
β = og.β
f, f_prime, u_prime = og.f, og.f_prime, og.u_prime
grid, shocks = og.grid, og.shocks
σ_new = np.empty_like(σ)
for i, y in enumerate(grid):
# Solve for optimal c at y
c_star = brentq(euler_diff, 1e-10, y-1e-10, args=(σ, y, og))[0]
σ_new[i] = c_star
return σ_new
40.3.1 Testing
Let’s generate an instance and plot some iterates of 𝐾, starting from 𝜎(𝑦) = 𝑦.
grid = og.grid
n = 15
σ = grid.copy() # Set initial condition
lb = 'initial condition $\sigma(y) = y$'
ax.plot(grid, σ, color=plt.cm.jet(0), alpha=0.6, label=lb)


for i in range(n):
σ = K(σ, og)
ax.plot(grid, σ, color=plt.cm.jet(i / n), alpha=0.6)
# Update one more time and plot the last iterate in black
σ = K(σ, og)
ax.plot(grid, σ, color='k', alpha=0.8, label='last iterate')
ax.legend()
plt.show()
We see that the iteration process converges quickly to a limit that resembles the solution we obtained in the previous
lecture.
Here is a function called solve_model_time_iter that takes an instance of OptimalGrowthModel and returns
an approximation to the optimal policy, using time iteration.
def solve_model_time_iter(model, # Class with model information

σ, # Initial condition
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
# Set up loop
i = 0
error = tol + 1


σ_new = K(σ, model)
error = np.max(np.abs(σ - σ_new))
i += 1
σ = σ_new
if error > tol:

elif verbose:
return σ_new
Let’s call it:
σ_init = np.copy(og.grid)
σ = solve_model_time_iter(og, σ_init)
ax.plot(og.grid, σ, lw=2,
ax.plot(og.grid, σ_star(og.grid, og.α, og.β), 'k--',

ax.legend()
plt.show()

Again, the fit is excellent.

np.max(np.abs(σ - σ_star(og.grid, og.α, og.β)))
2.532910601971139e-05
How long does it take to converge?
%%timeit -n 3 -r 1
σ = solve_model_time_iter(og, σ_init, verbose=False)
80.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 3 loops each)
Convergence is very fast, even compared to our JIT-compiled value function iteration.
Overall, we find that time iteration provides a very high degree of efficiency and accuracy, at least for this model.

40.4 Exercises
Exercise 40.4.1
Solve the model with CRRA utility
𝑐1−𝛾
𝑢(𝑐) =
1−𝛾
Set γ = 1.5.
Compute and plot the optimal policy.

We use the class OptimalGrowthModel_CRRA from our VFI lecture.

opt_growth_data = [
]
class OptimalGrowthModel_CRRA:
def __init__(self,
α=0.4,
β=0.96,
μ=0,
s=0.1,
γ=1.5,
grid_max=4,
grid_size=120,
shock_size=250,
seed=1234):
self.α, self.β, self.γ, self.μ, self.s = α, β, γ, μ, s
# Set up grid

def f(self, k):



"The production function."
return k**self.α
def u(self, c):

"The utility function."
return c**(1 - self.γ) / (1 - self.γ)

"Derivative of f."

"Derivative of u."
def u_prime_inv(c):
return c**(-1 / self.γ)
Let’s create an instance:
og_crra = OptimalGrowthModel_CRRA()
Now we solve and plot the policy:
%%time
σ = solve_model_time_iter(og_crra, σ_init)
ax.plot(og.grid, σ, lw=2,
ax.legend()
plt.show()
40.4. Exercises 689


Wall time: 2.44 s

CHAPTER
FORTYONE
OPTIMAL GROWTH IV: THE ENDOGENOUS GRID METHOD
Contents
• Optimal Growth IV: The Endogenous Grid Method

– Overview
– Key Idea
– Implementation
41.1 Overview
Previously, we solved the stochastic optimal growth model using

1. value function iteration
2. Euler equation based time iteration
We found time iteration to be significantly more accurate and efficient.
In this lecture, we’ll look at a clever twist on time iteration called the endogenous grid method (EGM).
EGM is a numerical method for implementing policy iteration invented by Chris Carroll.
The original reference is [Carroll, 2006].

import numpy as np
691
41.2 Key Idea
Let’s start by reminding ourselves of the theory and then see how the numerics fit in.
41.2.1 Theory
Take the model set out in the time iteration lecture, following the same terminology and notation.
The Euler equation is
(𝑢′ ∘ 𝜎∗ )(𝑦) = 𝛽 ∫(𝑢′ ∘ 𝜎∗ )(𝑓(𝑦 − 𝜎∗ (𝑦))𝑧)𝑓 ′ (𝑦 − 𝜎∗ (𝑦))𝑧𝜙(𝑑𝑧) (41.1)
As we saw, the Coleman-Reffett operator is a nonlinear operator 𝐾 engineered so that 𝜎∗ is a fixed point of 𝐾.
It takes as its argument a continuous strictly increasing consumption policy 𝜎 ∈ Σ.
It returns a new function 𝐾𝜎, where (𝐾𝜎)(𝑦) is the 𝑐 ∈ (0, ∞) that solves
𝑢′ (𝑐) = 𝛽 ∫(𝑢′ ∘ 𝜎)(𝑓(𝑦 − 𝑐)𝑧)𝑓 ′ (𝑦 − 𝑐)𝑧𝜙(𝑑𝑧) (41.2)
41.2.2 Exogenous Grid
As discussed in the lecture on time iteration, to implement the method on a computer, we need a numerical approximation.
In particular, we represent a policy function by a set of values on a finite grid.
The function itself is reconstructed from this representation when necessary, using interpolation or some other method.
Previously, to obtain a finite representation of an updated consumption policy, we
• fixed a grid of income points {𝑦𝑖 }
• calculated the consumption value 𝑐𝑖 corresponding to each 𝑦𝑖 using (41.2) and a root-finding routine
Each 𝑐𝑖 is then interpreted as the value of the function 𝐾𝜎 at 𝑦𝑖 .
Thus, with the points {𝑦𝑖 , 𝑐𝑖 } in hand, we can reconstruct 𝐾𝜎 via approximation.
Iteration then continues…
41.2.3 Endogenous Grid
The method discussed above requires a root-finding routine to find the 𝑐𝑖 corresponding to a given income value 𝑦𝑖 .
Root-finding is costly because it typically involves a significant number of function evaluations.
As pointed out by Carroll [Carroll, 2006], we can avoid this if 𝑦𝑖 is chosen endogenously.
The only assumption required is that 𝑢′ is invertible on (0, ∞).
Let (𝑢′ )−1 be the inverse function of 𝑢′ .
The idea is this:
• First, we fix an exogenous grid {𝑘𝑖 } for capital (𝑘 = 𝑦 − 𝑐).
• Then we obtain 𝑐𝑖 via
692 Chapter 41. Optimal Growth IV: The Endogenous Grid Method
𝑐𝑖 = (𝑢′ )−1 {𝛽 ∫(𝑢′ ∘ 𝜎)(𝑓(𝑘𝑖 )𝑧) 𝑓 ′ (𝑘𝑖 ) 𝑧 𝜙(𝑑𝑧)} (41.3)
• Finally, for each 𝑐𝑖 we set 𝑦𝑖 = 𝑐𝑖 + 𝑘𝑖 .

It is clear that each (𝑦𝑖 , 𝑐𝑖 ) pair constructed in this manner satisfies (41.2).
With the points {𝑦𝑖 , 𝑐𝑖 } in hand, we can reconstruct 𝐾𝜎 via approximation as before.
The name EGM comes from the fact that the grid {𝑦𝑖 } is determined endogenously.
41.3 Implementation
As before, we will start with a simple setting where

• 𝑢(𝑐) = ln 𝑐,
• production is Cobb-Douglas, and
• the shocks are lognormal.
This will allow us to make comparisons with the analytical solutions

"""
True value function
"""
c1 = np.log(1 - α * β) / (1 - β)
c2 = (μ + α * np.log(α * β)) / (1 - α)
c3 = 1 / (1 - β)
c4 = 1 / (1 - α * β)

"""
True optimal policy
"""
We reuse the OptimalGrowthModel class

opt_growth_data = [
]
def __init__(self,
α=0.4,


β=0.96,
μ=0,
s=0.1,
grid_max=4,
grid_size=120,
shock_size=250,
seed=1234):
# Set up grid

def f(self, k):

return k**self.α
def u(self, c):

return np.log(c)

"Derivative of f"

"Derivative of u"
return 1/c

"Inverse of u'"
return 1/c
41.3.1 The Operator
Here’s an implementation of 𝐾 using EGM as described above.
@njit
def K(σ_array, og):
"""
The Coleman-Reffett operator using EGM
"""
# Simplify names
f, β = og.f, og.β
f_prime, u_prime = og.f_prime, og.u_prime

u_prime_inv = og.u_prime_inv
grid, shocks = og.grid, og.shocks
# Determine endogenous grid

y = grid + σ_array # y_i = k_i + c_i
# Linear interpolation of policy using endogenous grid

σ = lambda x: np.interp(x, y, σ_array)
# Allocate memory for new consumption array

c = np.empty_like(grid)
# Solve for updated consumption value

for i, k in enumerate(grid):
vals = u_prime(σ(f(k) * shocks)) * f_prime(k) * shocks
c[i] = u_prime_inv(β * np.mean(vals))
return c
Note the lack of any root-finding algorithm.
41.3.2 Testing
First we create an instance.
grid = og.grid
Here’s our solver routine:

tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
# Set up loop
i = 0
error = tol + 1

i += 1
σ = σ_new
if error > tol:

elif verbose:

return σ_new
Let’s call it:
σ_init = np.copy(grid)
σ = solve_model_time_iter(og, σ_init)
y = grid + σ # y_i = k_i + c_i
ax.plot(y, σ, lw=2,
ax.plot(y, σ_star(y, og.α, og.β), 'k--',

ax.legend()
plt.show()
np.max(np.abs(σ - σ_star(y, og.α, og.β)))
1.5302749144296968e-05
How long does it take to converge?
%%timeit -n 3 -r 1
σ = solve_model_time_iter(og, σ_init, verbose=False)
11.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 3 loops each)
Relative to time iteration, which as already found to be highly efficient, EGM has managed to shave off still more run
time without compromising accuracy.
This is due to the lack of a numerical root-finding step.
We can now solve the optimal growth model at given parameters extremely fast.

CHAPTER
FORTYTWO
THE INCOME FLUCTUATION PROBLEM I: BASIC MODEL
Contents
• The Income Fluctuation Problem I: Basic Model

– Overview
– The Optimal Savings Problem
– Computation
– Implementation
– Exercises
42.1 Overview
In this lecture, we study an optimal savings problem for an infinitely lived consumer—the “common ancestor” described
in [Ljungqvist and Sargent, 2018], section 1.3.
This is an essential sub-problem for many representative macroeconomic models
• [Aiyagari, 1994]
• [Huggett, 1993]
• etc.
It is related to the decision problem in the stochastic optimal growth model and yet differs in important ways.
For example, the choice problem for the agent includes an additive income term that leads to an occasionally binding
constraint.
Moreover, in this and the following lectures, we will inject more realistic features such as correlated shocks.
To solve the model we will use Euler equation based time iteration, which proved to be fast and accurate in our investi-
gation of the stochastic optimal growth model.
Time iteration is globally convergent under mild assumptions, even when utility is unbounded (both above and below).
699

import numpy as np
from quantecon.optimize import brentq
42.1.1 References
Our presentation is a simplified version of [Ma et al., 2020].

Other references include [Deaton, 1991], [Den Haan, 2010], [Kuhn, 2013], [Rabault, 2002], [Reiter, 2009] and [Schecht-
man and Escudero, 1977].
42.2 The Optimal Savings Problem
Let’s write down the model and then discuss how to solve it.
42.2.1 Set-Up
Consider a household that chooses a state-contingent consumption plan {𝑐𝑡 }𝑡≥0 to maximize
∞
𝔼 ∑ 𝛽 𝑡 𝑢(𝑐𝑡 )
𝑡=0
subject to
𝑎𝑡+1 ≤ 𝑅(𝑎𝑡 − 𝑐𝑡 ) + 𝑌𝑡+1 , 𝑐𝑡 ≥ 0, 𝑎𝑡 ≥ 0 𝑡 = 0, 1, … (42.1)
Here
• 𝛽 ∈ (0, 1) is the discount factor
• 𝑎𝑡 is asset holdings at time 𝑡, with borrowing constraint 𝑎𝑡 ≥ 0
• 𝑐𝑡 is consumption
• 𝑌𝑡 is non-capital income (wages, unemployment compensation, etc.)
• 𝑅 ∶= 1 + 𝑟, where 𝑟 > 0 is the interest rate on savings
The timing here is as follows:
1. At the start of period 𝑡, the household chooses consumption 𝑐𝑡 .
2. Labor is supplied by the household throughout the period and labor income 𝑌𝑡+1 is received at the end of period 𝑡.
3. Financial income 𝑅(𝑎𝑡 − 𝑐𝑡 ) is received at the end of period 𝑡.
4. Time shifts to 𝑡 + 1 and the process repeats.
Non-capital income 𝑌𝑡 is given by 𝑌𝑡 = 𝑦(𝑍𝑡 ), where {𝑍𝑡 } is an exogeneous state process.
As is common in the literature, we take {𝑍𝑡 } to be a finite state Markov chain taking values in Z with Markov matrix 𝑃 .
We further assume that
700 Chapter 42. The Income Fluctuation Problem I: Basic Model

1. 𝛽𝑅 < 1
2. 𝑢 is smooth, strictly increasing and strictly concave with lim𝑐→0 𝑢′ (𝑐) = ∞ and lim𝑐→∞ 𝑢′ (𝑐) = 0
The asset space is ℝ+ and the state is the pair (𝑎, 𝑧) ∈ S ∶= ℝ+ × Z.
A feasible consumption path from (𝑎, 𝑧) ∈ S is a consumption sequence {𝑐𝑡 } such that {𝑐𝑡 } and its induced asset path
{𝑎𝑡 } satisfy
1. (𝑎0 , 𝑧0 ) = (𝑎, 𝑧)
2. the feasibility constraints in (42.1), and
3. measurability, which means that 𝑐𝑡 is a function of random outcomes up to date 𝑡 but not after.
The meaning of the third point is just that consumption at time 𝑡 cannot be a function of outcomes are yet to be observed.
In fact, for this problem, consumption can be chosen optimally by taking it to be contingent only on the current state.
Optimality is defined below.
42.2.2 Value Function and Euler Equation
The value function 𝑉 ∶ S → ℝ is defined by

∞
𝑉 (𝑎, 𝑧) ∶= max 𝔼 {∑ 𝛽 𝑡 𝑢(𝑐𝑡 )} (42.2)
𝑡=0
where the maximization is overall feasible consumption paths from (𝑎, 𝑧).
An optimal consumption path from (𝑎, 𝑧) is a feasible consumption path from (𝑎, 𝑧) that attains the supremum in (42.2).
To pin down such paths we can use a version of the Euler equation, which in the present setting is
𝑢′ (𝑐𝑡 ) ≥ 𝛽𝑅 𝔼𝑡 𝑢′ (𝑐𝑡+1 ) (42.3)
and
𝑐𝑡 < 𝑎𝑡 ⟹ 𝑢′ (𝑐𝑡 ) = 𝛽𝑅 𝔼𝑡 𝑢′ (𝑐𝑡+1 ) (42.4)
When 𝑐𝑡 = 𝑎𝑡 we obviously have 𝑢′ (𝑐𝑡 ) = 𝑢′ (𝑎𝑡 ),

When 𝑐𝑡 hits the upper bound 𝑎𝑡 , the strict inequality 𝑢′ (𝑐𝑡 ) > 𝛽𝑅 𝔼𝑡 𝑢′ (𝑐𝑡+1 ) can occur because 𝑐𝑡 cannot increase
sufficiently to attain equality.
(The lower boundary case 𝑐𝑡 = 0 never arises at the optimum because 𝑢′ (0) = ∞.)
With some thought, one can show that (42.3) and (42.4) are equivalent to
𝑢′ (𝑐𝑡 ) = max {𝛽𝑅 𝔼𝑡 𝑢′ (𝑐𝑡+1 ) , 𝑢′ (𝑎𝑡 )} (42.5)
42.2.3 Optimality Results
As shown in [Ma et al., 2020],

1. For each (𝑎, 𝑧) ∈ S, a unique optimal consumption path from (𝑎, 𝑧) exists
2. This path is the unique feasible path from (𝑎, 𝑧) satisfying the Euler equality (42.5) and the transversality condition
42.2. The Optimal Savings Problem 701

lim 𝛽 𝑡 𝔼 [𝑢′ (𝑐𝑡 )𝑎𝑡+1 ] = 0 (42.6)

𝑡→∞
∗
Moreover, there exists an optimal consumption function 𝜎 ∶ S → ℝ+ such that the path from (𝑎, 𝑧) generated by
(𝑎0 , 𝑧0 ) = (𝑎, 𝑧), 𝑐𝑡 = 𝜎∗ (𝑎𝑡 , 𝑍𝑡 ) and 𝑎𝑡+1 = 𝑅(𝑎𝑡 − 𝑐𝑡 ) + 𝑌𝑡+1
satisfies both (42.5) and (42.6), and hence is the unique optimal path from (𝑎, 𝑧).
Thus, to solve the optimization problem, we need to compute the policy 𝜎∗ .
42.3 Computation
There are two standard ways to solve for 𝜎∗

1. time iteration using the Euler equality and
2. value function iteration.
Our investigation of the cake eating problem and stochastic optimal growth model suggests that time iteration will be
faster and more accurate.
This is the approach that we apply below.
42.3.1 Time Iteration
We can rewrite (42.5) to make it a statement about functions rather than random variables.
In particular, consider the functional equation
(𝑢′ ∘ 𝜎)(𝑎, 𝑧) = max {𝛽𝑅 𝔼𝑧 (𝑢′ ∘ 𝜎)[𝑅(𝑎 − 𝜎(𝑎, 𝑧)) + 𝑌 ̂ , 𝑍]̂ , 𝑢′ (𝑎)} (42.7)
where
• (𝑢′ ∘ 𝜎)(𝑠) ∶= 𝑢′ (𝜎(𝑠)).
• 𝔼𝑧 conditions on current state 𝑧 and 𝑋̂ indicates next period value of random variable 𝑋 and
• 𝜎 is the unknown function.
We need a suitable class of candidate solutions for the optimal consumption policy.
The right way to pick such a class is to consider what properties the solution is likely to have, in order to restrict the search
space and ensure that iteration is well behaved.
To this end, let 𝒞 be the space of continuous functions 𝜎 ∶ S → ℝ such that 𝜎 is increasing in the first argument, 0 <
𝜎(𝑎, 𝑧) ≤ 𝑎 for all (𝑎, 𝑧) ∈ S, and
sup |(𝑢′ ∘ 𝜎)(𝑎, 𝑧) − 𝑢′ (𝑎)| < ∞ (42.8)
(𝑎,𝑧)∈S
This will be our candidate class.

In addition, let 𝐾 ∶ 𝒞 → 𝒞 be defined as follows.
For given 𝜎 ∈ 𝒞, the value 𝐾𝜎(𝑎, 𝑧) is the unique 𝑐 ∈ [0, 𝑎] that solves
𝑢′ (𝑐) = max {𝛽𝑅 𝔼𝑧 (𝑢′ ∘ 𝜎) [𝑅(𝑎 − 𝑐) + 𝑌 ̂ , 𝑍]̂ , 𝑢′ (𝑎)} (42.9)
We refer to 𝐾 as the Coleman–Reffett operator.
The operator 𝐾 is constructed so that fixed points of 𝐾 coincide with solutions to the functional equation (42.7).
It is shown in [Ma et al., 2020] that the unique optimal policy can be computed by picking any 𝜎 ∈ 𝒞 and iterating with
the operator 𝐾 defined in (42.9).

42.3.2 Some Technical Details
The proof of the last statement is somewhat technical but here is a quick summary:
It is shown in [Ma et al., 2020] that 𝐾 is a contraction mapping on 𝒞 under the metric
𝜌(𝑐, 𝑑) ∶= ‖ 𝑢′ ∘ 𝜎1 − 𝑢′ ∘ 𝜎2 ‖ ∶= sup | 𝑢′ (𝜎1 (𝑠)) − 𝑢′ (𝜎2 (𝑠)) | (𝜎1 , 𝜎2 ∈ 𝒞)

𝑠∈𝑆
which evaluates the maximal difference in terms of marginal utility.

(The benefit of this measure of distance is that, while elements of 𝒞 are not generally bounded, 𝜌 is always finite under
our assumptions.)
It is also shown that the metric 𝜌 is complete on 𝒞.
In consequence, 𝐾 has a unique fixed point 𝜎∗ ∈ 𝒞 and 𝐾 𝑛 𝑐 → 𝜎∗ as 𝑛 → ∞ for any 𝜎 ∈ 𝒞.
By the definition of 𝐾, the fixed points of 𝐾 in 𝒞 coincide with the solutions to (42.7) in 𝒞.
As a consequence, the path {𝑐𝑡 } generated from (𝑎0 , 𝑧0 ) ∈ 𝑆 using policy function 𝜎∗ is the unique optimal path from
(𝑎0 , 𝑧0 ) ∈ 𝑆.
42.4 Implementation
We use the CRRA utility specification
𝑐1−𝛾
𝑢(𝑐) =
1−𝛾
The exogeneous state process {𝑍𝑡 } defaults to a two-state Markov chain with state space {0, 1} and transition matrix 𝑃 .
Here we build a class called IFP that stores the model primitives.
ifp_data = [
('R', float64), # Interest rate 1 + r
('P', float64[:, :]), # Markov matrix for binary Z_t
('y', float64[:]), # Income is Y_t = y[Z_t]
('asset_grid', float64[:]) # Grid (array)
]
@jitclass(ifp_data)
class IFP:
def __init__(self,
r=0.01,
β=0.96,
γ=1.5,
P=((0.6, 0.4),
(0.05, 0.95)),
y=(0.0, 2.0),
grid_max=16,
grid_size=50):
self.R = 1 + r


self.P, self.y = np.array(P), np.array(y)
self.asset_grid = np.linspace(0, grid_max, grid_size)
# Recall that we need R β < 1 for convergence.

assert self.R * self.β < 1, "Stability condition violated."

Next we provide a function to compute the difference
𝑢′ (𝑐) − max {𝛽𝑅 𝔼𝑧 (𝑢′ ∘ 𝜎) [𝑅(𝑎 − 𝑐) + 𝑌 ̂ , 𝑍]̂ , 𝑢′ (𝑎)} (42.10)
@njit
def euler_diff(c, a, z, σ_vals, ifp):
"""
The difference between the left- and right-hand side
of the Euler Equation, given current policy σ.
* c is the consumption choice

* (a, z) is the state, with z in {0, 1}
* σ_vals is a policy represented as a matrix.
* ifp is an instance of IFP
"""
# Simplify names
R, P, y, β, γ = ifp.R, ifp.P, ifp.y, ifp.β, ifp.γ
asset_grid, u_prime = ifp.asset_grid, ifp.u_prime
n = len(P)
# Convert policy into a function by linear interpolation

def σ(a, z):
return np.interp(a, asset_grid, σ_vals[:, z])
# Calculate the expectation conditional on current z

expect = 0.0
for z_hat in range(n):
expect += u_prime(σ(R * (a - c) + y[z_hat], z_hat)) * P[z, z_hat]
return u_prime(c) - max(β * R * expect, u_prime(a))
Note that we use linear interpolation along the asset grid to approximate the policy function.
The next step is to obtain the root of the Euler difference.
@njit
def K(σ, ifp):
"""
The operator K.
"""
σ_new = np.empty_like(σ)
for i, a in enumerate(ifp.asset_grid):
for z in (0, 1):
result = brentq(euler_diff, 1e-8, a, args=(a, z, σ, ifp))


σ_new[i, z] = result.root
return σ_new
With the operator 𝐾 in hand, we can choose an initial condition and start to iterate.
The following function iterates to convergence and returns the approximate optimal policy.

tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
# Set up loop
i = 0
error = tol + 1

i += 1
σ = σ_new
if error > tol:

elif verbose:
return σ_new
Let’s carry this out using the default parameters of the IFP class:
ifp = IFP()
# Set up initial consumption policy of consuming all assets at all z

z_size = len(ifp.P)
a_grid = ifp.asset_grid
a_size = len(a_grid)
σ_init = np.repeat(a_grid.reshape(a_size, 1), z_size, axis=1)
σ_star = solve_model_time_iter(ifp, σ_init)

Here’s a plot of the resulting policy for each exogeneous state 𝑧.
for z in range(z_size):


label = rf'$\sigma^*(\cdot, {z})$'
ax.plot(a_grid, σ_star[:, z], label=label)
ax.set(xlabel='assets', ylabel='consumption')
ax.legend()
plt.show()
The following exercises walk you through several applications where policy functions are computed.
42.4.1 A Sanity Check
One way to check our results is to

• set labor income to zero in each state and
• set the gross interest rate 𝑅 to unity.
In this case, our income fluctuation problem is just a cake eating problem.
We know that, in this case, the value function and optimal consumption policy are given by
return (1 - β ** (1/γ)) * x


return (1 - β**(1 / γ))**(-γ) * (x**(1-γ) / (1-γ))
Let’s see if we match up:
ifp_cake_eating = IFP(r=0.0, y=(0.0, 0.0))
σ_star = solve_model_time_iter(ifp_cake_eating, σ_init)
ax.plot(a_grid, σ_star[:, 0], label='numerical')
ax.plot(a_grid, c_star(a_grid, ifp.β, ifp.γ), '--', label='analytical')
ax.set(xlabel='assets', ylabel='consumption')
ax.legend()
plt.show()


Success!
42.5 Exercises
Exercise 42.5.1
Let’s consider how the interest rate affects consumption.
Reproduce the following figure, which shows (approximately) optimal consumption policies for different interest rates
• Other than r, all parameters are at their default values.
• r steps through np.linspace(0, 0.04, 4).
• Consumption is plotted against assets for income shock fixed at the smallest value.
The figure shows that higher interest rates boost savings and hence suppress consumption.

Here’s one solution:
r_vals = np.linspace(0, 0.04, 4)
for r_val in r_vals:


ifp = IFP(r=r_val)
σ_star = solve_model_time_iter(ifp, σ_init, verbose=False)
ax.plot(ifp.asset_grid, σ_star[:, 0], label=f'$r = {r_val:.3f}$')
ax.set(xlabel='asset level', ylabel='consumption (low income)')

ax.legend()
plt.show()
42.5. Exercises 709

Exercise 42.5.2
Now let’s consider the long run asset levels held by households under the default parameters.
The following figure is a 45 degree diagram showing the law of motion for assets when consumption is optimal
ifp = IFP()

a = ifp.asset_grid
R, y = ifp.R, ifp.y
for z, lb in zip((0, 1), ('low income', 'high income')):
ax.plot(a, R * (a - σ_star[:, z]) + y[z] , label=lb)
ax.plot(a, a, 'k--')
ax.set(xlabel='current assets', ylabel='next period assets')
ax.legend()
plt.show()

The unbroken lines show the update function for assets at each 𝑧, which is
𝑎 ↦ 𝑅(𝑎 − 𝜎∗ (𝑎, 𝑧)) + 𝑦(𝑧)
The dashed line is the 45 degree line.

We can see from the figure that the dynamics will be stable — assets do not diverge even in the highest state.
In fact there is a unique stationary distribution of assets that we can calculate by simulation
• Can be proved via theorem 2 of [Hopenhayn and Prescott, 1992].
• It represents the long run dispersion of assets across households when households have idiosyncratic shocks.
Ergodicity is valid here, so stationary probabilities can be calculated by averaging over a single long time series.
Hence to approximate the stationary distribution we can simulate a long time series for assets and histogram it.
Your task is to generate such a histogram.
• Use a single time series {𝑎𝑡 } of length 500,000.
• Given the length of this time series, the initial condition (𝑎0 , 𝑧0 ) will not matter.
• You might find it helpful to use the MarkovChain class from quantecon.

First we write a function to compute a long asset series.
42.5. Exercises 711

def compute_asset_series(ifp, T=500_000, seed=1234):

"""
Simulates a time series of length T for assets, given optimal
savings behavior.
ifp is an instance of IFP

"""
P, y, R = ifp.P, ifp.y, ifp.R # Simplify names
# Solve for the optimal policy

σ = lambda a, z: np.interp(a, ifp.asset_grid, σ_star[:, z])
# Simulate the exogeneous state process

mc = MarkovChain(P)
z_seq = mc.simulate(T, random_state=seed)
# Simulate the asset path

a = np.zeros(T+1)
for t in range(T):
z = z_seq[t]
a[t+1] = R * (a[t] - σ(a[t], z)) + y[z]
return a
Now we call the function, generate the series and then histogram it:
ifp = IFP()
a = compute_asset_series(ifp)
ax.hist(a, bins=20, alpha=0.5, density=True)
ax.set(xlabel='assets')
plt.show()

The shape of the asset distribution is unrealistic.

Here it is left skewed when in reality it has a long right tail.
In a subsequent lecture we will rectify this by adding more realistic features to the model.
Exercise 42.5.3
Following on from exercises 1 and 2, let’s look at how savings and aggregate asset holdings vary with the interest rate
Note: [Ljungqvist and Sargent, 2018] section 18.6 can be consulted for more background on the topic treated in this
exercise.
For a given parameterization of the model, the mean of the stationary distribution of assets can be interpreted as aggregate
capital in an economy with a unit mass of ex-ante identical households facing idiosyncratic shocks.
Your task is to investigate how this measure of aggregate capital varies with the interest rate.
Following tradition, put the price (i.e., interest rate) on the vertical axis.
On the horizontal axis put aggregate capital, computed as the mean of the stationary distribution given the interest rate.

Here’s one solution
42.5. Exercises 713

M = 25
r_vals = np.linspace(0, 0.02, M)
asset_mean = []
for r in r_vals:
print(f'Solving model at r = {r}')
ifp = IFP(r=r)
mean = np.mean(compute_asset_series(ifp, T=250_000))
asset_mean.append(mean)
ax.plot(asset_mean, r_vals)
ax.set(xlabel='capital', ylabel='interest rate')
plt.show()
Solving model at r = 0.0

As expected, aggregate savings increases with the interest rate.
42.5. Exercises 715


CHAPTER
FORTYTHREE
THE INCOME FLUCTUATION PROBLEM II: STOCHASTIC RETURNS

ON ASSETS
Contents
• The Income Fluctuation Problem II: Stochastic Returns on Assets

– Overview
– The Savings Problem
– Solution Algorithm
– Implementation
– Exercises
43.1 Overview
In this lecture, we continue our study of the income fluctuation problem.

While the interest rate was previously taken to be fixed, we now allow returns on assets to be state-dependent.
This matches the fact that most households with a positive level of assets face some capital income risk.
It has been argued that modeling capital income risk is essential for understanding the joint distribution of income and
wealth (see, e.g., [Benhabib et al., 2015] or [Stachurski and Toda, 2019]).
Theoretical properties of the household savings model presented here are analyzed in detail in [Ma et al., 2020].
In terms of computation, we use a combination of time iteration and the endogenous grid method to solve the model
quickly and accurately.
We require the following imports:

import numpy as np
717
43.2 The Savings Problem
In this section we review the household problem and optimality results.
43.2.1 Set Up
A household chooses a consumption-asset path {(𝑐𝑡 , 𝑎𝑡 )} to maximize

∞
𝔼 {∑ 𝛽 𝑡 𝑢(𝑐𝑡 )} (43.1)
𝑡=0
subject to
𝑎𝑡+1 = 𝑅𝑡+1 (𝑎𝑡 − 𝑐𝑡 ) + 𝑌𝑡+1 and 0 ≤ 𝑐𝑡 ≤ 𝑎𝑡 , (43.2)
with initial condition (𝑎0 , 𝑍0 ) = (𝑎, 𝑧) treated as given.

Note that {𝑅𝑡 }𝑡≥1 , the gross rate of return on wealth, is allowed to be stochastic.
The sequence {𝑌𝑡 }𝑡≥1 is non-financial income.
The stochastic components of the problem obey
𝑅𝑡 = 𝑅(𝑍𝑡 , 𝜁𝑡 ) and 𝑌𝑡 = 𝑌 (𝑍𝑡 , 𝜂𝑡 ), (43.3)
where
• the maps 𝑅 and 𝑌 are time-invariant nonnegative functions,
• the innovation processes {𝜁𝑡 } and {𝜂𝑡 } are IID and independent of each other, and
• {𝑍𝑡 }𝑡≥0 is an irreducible time-homogeneous Markov chain on a finite set Z
Let 𝑃 represent the Markov matrix for the chain {𝑍𝑡 }𝑡≥0 .
Our assumptions on preferences are the same as our previous lecture on the income fluctuation problem.
As before, 𝔼𝑧 𝑋̂ means expectation of next period value 𝑋̂ given current value 𝑍 = 𝑧.
43.2.2 Assumptions
We need restrictions to ensure that the objective (43.1) is finite and the solution methods described below converge.
We also need to ensure that the present discounted value of wealth does not grow too quickly.
When {𝑅𝑡 } was constant we required that 𝛽𝑅 < 1.
Now it is stochastic, we require that
1/𝑛
𝑛
𝛽𝐺𝑅 < 1, where 𝐺𝑅 ∶= lim (𝔼 ∏ 𝑅𝑡 ) (43.4)
𝑛→∞
𝑡=1
Notice that, when {𝑅𝑡 } takes some constant value 𝑅, this reduces to the previous restriction 𝛽𝑅 < 1
The value 𝐺𝑅 can be thought of as the long run (geometric) average gross rate of return.
More intuition behind (43.4) is provided in [Ma et al., 2020].
Discussion on how to check it is given below.
718 Chapter 43. The Income Fluctuation Problem II: Stochastic Returns on Assets
Finally, we impose some routine technical restrictions on non-financial income.

𝔼 𝑌𝑡 < ∞ and 𝔼 𝑢′ (𝑌𝑡 ) < ∞
One relatively simple setting where all these restrictions are satisfied is the IID and CRRA environment of [Benhabib et
al., 2015].
43.2.3 Optimality
Let the class of candidate consumption policies 𝒞 be defined as before.

In [Ma et al., 2020] it is shown that, under the stated assumptions,
• any 𝜎 ∈ 𝒞 satisfying the Euler equation is an optimal policy and
• exactly one such policy exists in 𝒞.
In the present setting, the Euler equation takes the form
(𝑢′ ∘ 𝜎)(𝑎, 𝑧) = max {𝛽 𝔼𝑧 𝑅̂ (𝑢′ ∘ 𝜎)[𝑅(𝑎
̂ − 𝜎(𝑎, 𝑧)) + 𝑌 ̂ , 𝑍],
̂ 𝑢′ (𝑎)} (43.5)
(Intuition and derivation are similar to our earlier lecture on the income fluctuation problem.)
We again solve the Euler equation using time iteration, iterating with a Coleman–Reffett operator 𝐾 defined to match the
Euler equation (43.5).
43.3 Solution Algorithm
43.3.1 A Time Iteration Operator
Our definition of the candidate class 𝜎 ∈ 𝒞 of consumption policies is the same as in our earlier lecture on the income
fluctuation problem.
For fixed 𝜎 ∈ 𝒞 and (𝑎, 𝑧) ∈ S, the value 𝐾𝜎(𝑎, 𝑧) of the function 𝐾𝜎 at (𝑎, 𝑧) is defined as the 𝜉 ∈ (0, 𝑎] that solves
𝑢′ (𝜉) = max {𝛽 𝔼𝑧 𝑅̂ (𝑢′ ∘ 𝜎)[𝑅(𝑎
̂ − 𝜉) + 𝑌 ̂ , 𝑍],
̂ 𝑢′ (𝑎)} (43.6)
The idea behind 𝐾 is that, as can be seen from the definitions, 𝜎 ∈ 𝒞 satisfies the Euler equation if and only if 𝐾𝜎(𝑎, 𝑧) =
𝜎(𝑎, 𝑧) for all (𝑎, 𝑧) ∈ S.
This means that fixed points of 𝐾 in 𝒞 and optimal consumption policies exactly coincide (see [Ma et al., 2020] for more
details).
43.3.2 Convergence Properties
As before, we pair 𝒞 with the distance

𝜌(𝑐, 𝑑) ∶= sup |(𝑢′ ∘ 𝑐) (𝑎, 𝑧) − (𝑢′ ∘ 𝑑) (𝑎, 𝑧)| ,
(𝑎,𝑧)∈S
It can be shown that

1. (𝒞, 𝜌) is a complete metric space,
2. there exists an integer 𝑛 such that 𝐾 𝑛 is a contraction mapping on (𝒞, 𝜌), and
3. The unique fixed point of 𝐾 in 𝒞 is the unique optimal policy in 𝒞.
We now have a clear path to successfully approximating the optimal policy: choose some 𝜎 ∈ 𝒞 and then iterate with 𝐾
until convergence (as measured by the distance 𝜌).
43.3. Solution Algorithm 719

43.3.3 Using an Endogenous Grid
In the study of that model we found that it was possible to further accelerate time iteration via the endogenous grid method.
We will use the same method here.
The methodology is the same as it was for the optimal growth model, with the minor exception that we need to remember
that consumption is not always interior.
In particular, optimal consumption can be equal to assets when the level of assets is low.
Finding Optimal Consumption
The endogenous grid method (EGM) calls for us to take a grid of savings values 𝑠𝑖 , where each such 𝑠 is interpreted as
𝑠 = 𝑎 − 𝑐.
For the lowest grid point we take 𝑠0 = 0.
For the corresponding 𝑎0 , 𝑐0 pair we have 𝑎0 = 𝑐0 .
This happens close to the origin, where assets are low and the household consumes all that it can.
Although there are many solutions, the one we take is 𝑎0 = 𝑐0 = 0, which pins down the policy at the origin, aiding
interpolation.
For 𝑠 > 0, we have, by definition, 𝑐 < 𝑎, and hence consumption is interior.
Hence the max component of (43.5) drops out, and we solve for
̂ ′ ∘ 𝜎) [𝑅𝑠
𝑐𝑖 = (𝑢′ )−1 {𝛽 𝔼𝑧 𝑅(𝑢 ̂ 𝑖 + 𝑌 ̂ , 𝑍]}
̂ (43.7)
at each 𝑠𝑖 .
Iterating
Once we have the pairs {𝑠𝑖 , 𝑐𝑖 }, the endogenous asset grid is obtained by 𝑎𝑖 = 𝑐𝑖 + 𝑠𝑖 .
Also, we held 𝑧 ∈ Z in the discussion above so we can pair it with 𝑎𝑖 .
An approximation of the policy (𝑎, 𝑧) ↦ 𝜎(𝑎, 𝑧) can be obtained by interpolating {𝑎𝑖 , 𝑐𝑖 } at each 𝑧.
In what follows, we use linear interpolation.
43.3.4 Testing the Assumptions
Convergence of time iteration is dependent on the condition 𝛽𝐺𝑅 < 1 being satisfied.
One can check this using the fact that 𝐺𝑅 is equal to the spectral radius of the matrix 𝐿 defined by
𝐿(𝑧, 𝑧)̂ ∶= 𝑃 (𝑧, 𝑧)̂ ∫ 𝑅(𝑧,̂ 𝑥)𝜙(𝑥)𝑑𝑥
This identity is proved in [Ma et al., 2020], where 𝜙 is the density of the innovation 𝜁𝑡 to returns on assets.
(Remember that Z is a finite set, so this expression defines a matrix.)
Checking the condition is even easier when {𝑅𝑡 } is IID.
In that case, it is clear from the definition of 𝐺𝑅 that 𝐺𝑅 is just 𝔼𝑅𝑡 .
We test the condition 𝛽𝔼𝑅𝑡 < 1 in the code below.
43.4 Implementation
We will assume that 𝑅𝑡 = exp(𝑎𝑟 𝜁𝑡 + 𝑏𝑟 ) where 𝑎𝑟 , 𝑏𝑟 are constants and {𝜁𝑡 } is IID standard normal.
We allow labor income to be correlated, with
𝑌𝑡 = exp(𝑎𝑦 𝜂𝑡 + 𝑍𝑡 𝑏𝑦 )
where {𝜂𝑡 } is also IID standard normal and {𝑍𝑡 } is a Markov chain taking values in {0, 1}.
ifp_data = [
('γ', float64), # utility parameter
('P', float64[:, :]), # transition probs for z_t
('a_r', float64), # scale parameter for R_t
('b_r', float64), # additive parameter for R_t
('a_y', float64), # scale parameter for Y_t
('b_y', float64), # additive parameter for Y_t
('s_grid', float64[:]), # Grid over savings
('η_draws', float64[:]), # Draws of innovation η for MC
('ζ_draws', float64[:]) # Draws of innovation ζ for MC
]
@jitclass(ifp_data)
class IFP:
"""
A class that stores primitives for the income fluctuation
problem.
"""
def __init__(self,
γ=1.5,
β=0.96,
P=np.array([(0.9, 0.1),
(0.1, 0.9)]),
a_r=0.1,
b_r=0.0,
a_y=0.2,
b_y=0.5,
shock_draw_size=50,
grid_max=10,
grid_size=100,
seed=1234):
np.random.seed(seed) # arbitrary seed
self.P, self.γ, self.β = P, γ, β

self.a_r, self.b_r, self.a_y, self.b_y = a_r, b_r, a_y, b_y
self.η_draws = np.random.randn(shock_draw_size)
self.ζ_draws = np.random.randn(shock_draw_size)
self.s_grid = np.linspace(0, grid_max, grid_size)
# Test stability assuming {R_t} is IID and adopts the lognormal

# specification given below. The test is then β E R_t < 1.
ER = np.exp(b_r + a_r**2 / 2)
assert β * ER < 1, "Stability condition failed."

# Marginal utility
# Inverse of marginal utility

return c**(-1/self.γ)
def R(self, z, ζ):

return np.exp(self.a_r * ζ + self.b_r)
def Y(self, z, η):

return np.exp(self.a_y * η + (z * self.b_y))
Here’s the Coleman-Reffett operator based on EGM:
@njit
def K(a_in, σ_in, ifp):
"""
The Coleman--Reffett operator for the income fluctuation problem,
using the endogenous grid method.

* a_in[i, z] is an asset grid
* σ_in[i, z] is consumption at a_in[i, z]
"""
# Simplify names
u_prime, u_prime_inv = ifp.u_prime, ifp.u_prime_inv
R, Y, P, β = ifp.R, ifp.Y, ifp.P, ifp.β
s_grid, η_draws, ζ_draws = ifp.s_grid, ifp.η_draws, ifp.ζ_draws
n = len(P)
# Create consumption function by linear interpolation

σ = lambda a, z: np.interp(a, a_in[:, z], σ_in[:, z])
# Allocate memory
σ_out = np.empty_like(σ_in)
# Obtain c_i at each s_i, z, store in σ_out[i, z], computing

# the expectation term by Monte Carlo
for i, s in enumerate(s_grid):
for z in range(n):
# Compute expectation
Ez = 0.0
for z_hat in range(n):
for η in ifp.η_draws:
for ζ in ifp.ζ_draws:
R_hat = R(z_hat, ζ)
Y_hat = Y(z_hat, η)
U = u_prime(σ(R_hat * s + Y_hat, z_hat))
Ez += R_hat * U * P[z, z_hat]
Ez = Ez / (len(η_draws) * len(ζ_draws))
σ_out[i, z] = u_prime_inv(β * Ez)

# Calculate endogenous asset grid
a_out = np.empty_like(σ_out)
for z in range(n):
a_out[:, z] = s_grid + σ_out[:, z]
# Fixing a consumption-asset pair at (0, 0) improves interpolation

σ_out[0, :] = 0
a_out[0, :] = 0
return a_out, σ_out
The next function solves for an approximation of the optimal consumption policy via time iteration.

a_vec, # Initial condition for assets
σ_vec, # Initial condition for consumption
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=25):
# Set up loop
i = 0
error = tol + 1

a_new, σ_new = K(a_vec, σ_vec, model)
error = np.max(np.abs(σ_vec - σ_new))
i += 1
a_vec, σ_vec = np.copy(a_new), np.copy(σ_new)
if error > tol:

elif verbose:
return a_new, σ_new
Now we are ready to create an instance at the default parameters.
ifp = IFP()
Next we set up an initial condition, which corresponds to consuming all assets.
# Initial guess of σ = consume all assets

k = len(ifp.s_grid)
n = len(ifp.P)
σ_init = np.empty((k, n))
for z in range(n):
σ_init[:, z] = ifp.s_grid
a_init = np.copy(σ_init)
Let’s generate an approximation solution.

a_star, σ_star = solve_model_time_iter(ifp, a_init, σ_init, print_skip=5)
Here’s a plot of the resulting consumption policy.
for z in range(len(ifp.P)):
ax.plot(a_star[:, z], σ_star[:, z], label=f"consumption when $z={z}$")
plt.legend()
plt.show()
Notice that we consume all assets in the lower range of the asset space.
This is because we anticipate income 𝑌𝑡+1 tomorrow, which makes the need to save less urgent.
Can you explain why consuming all assets ends earlier (for lower values of assets) when 𝑧 = 0?
43.4.1 Law of Motion
Let’s try to get some idea of what will happen to assets over the long run under this consumption policy.
As with our earlier lecture on the income fluctuation problem, we begin by producing a 45 degree diagram showing the
law of motion for assets
# Good and bad state mean labor income

Y_mean = [np.mean(ifp.Y(z, ifp.η_draws)) for z in (0, 1)]
# Mean returns
R_mean = np.mean(ifp.R(z, ifp.ζ_draws))
a = a_star
for z, lb in zip((0, 1), ('bad state', 'good state')):
ax.plot(a[:, z], R_mean * (a[:, z] - σ_star[:, z]) + Y_mean[z] , label=lb)
ax.plot(a[:, 0], a[:, 0], 'k--')

ax.set(xlabel='current assets', ylabel='next period assets')
ax.legend()
plt.show()

The unbroken lines represent, for each 𝑧, an average update function for assets, given by
̄ − 𝜎∗ (𝑎, 𝑧)) + 𝑌 ̄ (𝑧)

𝑎 ↦ 𝑅(𝑎
Here
• 𝑅̄ = 𝔼𝑅𝑡 , which is mean returns and
• 𝑌 ̄ (𝑧) = 𝔼𝑧 𝑌 (𝑧, 𝜂𝑡 ), which is mean labor income in state 𝑧.
The dashed line is the 45 degree line.
We can see from the figure that the dynamics will be stable — assets do not diverge even in the highest state.
43.5 Exercises
Exercise 43.5.1
Let’s repeat our earlier exercise on the long-run cross sectional distribution of assets.
In that exercise, we used a relatively simple income fluctuation model.
In the solution, we found the shape of the asset distribution to be unrealistic.
In particular, we failed to match the long right tail of the wealth distribution.
Your task is to try again, repeating the exercise, but now with our more sophisticated model.
Use the default parameters.

First we write a function to compute a long asset series.
Because we want to JIT-compile the function, we code the solution in a way that breaks some rules on good programming
style.
For example, we will pass in the solutions a_star, σ_star along with ifp, even though it would be more natural
to just pass in ifp and then solve inside the function.
The reason we do this is that solve_model_time_iter is not JIT-compiled.
@njit
def compute_asset_series(ifp, a_star, σ_star, z_seq, T=500_000):
"""
Simulates a time series of length T for assets, given optimal
savings behavior.

* a_star is the endogenous grid solution
* σ_star is optimal consumption on the grid
* z_seq is a time path for {Z_t}
"""
# Create consumption function by linear interpolation

σ = lambda a, z: np.interp(a, a_star[:, z], σ_star[:, z])
# Simulate the asset path

a = np.zeros(T+1)
for t in range(T):
z = z_seq[t]
ζ, η = np.random.randn(), np.random.randn()
R = ifp.R(z, ζ)
Y = ifp.Y(z, η)
a[t+1] = R * (a[t] - σ(a[t], z)) + Y
return a
Now we call the function, generate the series and then histogram it, using the solutions computed above.
T = 1_000_000
mc = MarkovChain(ifp.P)
z_seq = mc.simulate(T, random_state=1234)
a = compute_asset_series(ifp, a_star, σ_star, z_seq, T=T)
ax.hist(a, bins=40, alpha=0.5, density=True)
plt.show()
43.5. Exercises 727

Now we have managed to successfully replicate the long right tail of the wealth distribution.
Here’s another view of this using a horizontal violin plot.
ax.violinplot(a, vert=False, showmedians=True)
plt.show()
43.5. Exercises 729

Part VII
Bayes Law
731
CHAPTER
FORTYFOUR
NON-CONJUGATE PRIORS
This lecture is a sequel to the quantecon lecture.

That lecture offers a Bayesian interpretation of probability in a setting in which the likelihood function and the prior
distribution over parameters just happened to form a conjugate pair in which
• application of Bayes’ Law produces a posterior distribution that has the same functional form as the prior
Having a likelihood and prior that are conjugate can simplify calculation of a posterior, faciltating analytical or nearly
analytical calculations.
But in many situations the likelihood and prior need not form a conjugate pair.
• after all, a person’s prior is his or her own business and would take a form conjugate to a likelihood only by remote
coincidence
In these situations, computing a posterior can become very challenging.
In this lecture, we illustrate how modern Bayesians confront non-conjugate priors by using Monte Carlo techniques that
involve
• first cleverly forming a Markov chain whose invariant distribution is the posterior distribution we want
• simulating the Markov chain until it has converged and then sampling from the invariant distribution to approximate
the posterior
We shall illustrate the approach by deploying two powerful Python modules that implement this approach as well as
another closely related one to be described below.
The two Python modules are
• numpyro
• pymc4
As usual, we begin by importing some Python code.
# install dependencies
!pip install numpyro pyro-ppl torch jax
import numpy as np
import seaborn as sns
from scipy.stats import binom
import scipy.stats as st
import torch
# jax
733

from jax import lax, random
# pyro
import pyro
from pyro import distributions as dist
import pyro.distributions.constraints as constraints
from pyro.infer import MCMC, NUTS, SVI, ELBO, Trace_ELBO
from pyro.optim import Adam
# numpyro
import numpyro
from numpyro import distributions as ndist
import numpyro.distributions.constraints as nconstraints
from numpyro.infer import MCMC as nMCMC
from numpyro.infer import NUTS as nNUTS
from numpyro.infer import SVI as nSVI
from numpyro.infer import ELBO as nELBO
from numpyro.infer import Trace_ELBO as nTrace_ELBO
from numpyro.optim import Adam as nAdam
44.1 Unleashing MCMC on a Binomial Likelihood
This lecture begins with the binomial example in the quantecon lecture.
That lecture computed a posterior
• analytically via choosing the conjugate priors,
This lecture instead computes posteriors
• numerically by sampling from the posterior distribution through MCMC methods, and
• using a variational inference (VI) approximation.
We use both the packages pyro and numpyro with assistance from jax to approximate a posterior distribution
We use several alternative prior distributions
We compare computed posteriors with ones associated with a conjugate prior as described in the quantecon lecture
44.1.1 Analytical Posterior
Assume that the random variable 𝑋 ∼ 𝐵𝑖𝑛𝑜𝑚 (𝑛, 𝜃).

This defines a likelihood function
𝑛!
𝐿 (𝑌 |𝜃) = Prob(𝑋 = 𝑘|𝜃) = ( ) 𝜃𝑘 (1 − 𝜃)𝑛−𝑘
𝑘!(𝑛 − 𝑘)!
where 𝑌 = 𝑘 is an observed data point.
We view 𝜃 as a random variable for which we assign a prior distribution having density 𝑓(𝜃).
We will try alternative priors later, but for now, suppose the prior is distributed as 𝜃 ∼ 𝐵𝑒𝑡𝑎 (𝛼, 𝛽), i.e.,
𝜃𝛼−1 (1 − 𝜃)𝛽−1
𝑓(𝜃) = Prob(𝜃) =
𝐵(𝛼, 𝛽)
734 Chapter 44. Non-Conjugate Priors

We choose this as our prior for now because we know that a conjugate prior for the binomial likelihood function is a beta
distribution.
After observing 𝑘 successes among 𝑁 sample observations, the posterior probability distributionof 𝜃 is
Prob(𝜃, 𝑘) Prob(𝑘|𝜃)Prob(𝜃) Prob(𝑘|𝜃)Prob(𝜃)

Prob(𝜃|𝑘) = = = 1
Prob(𝑘) Prob(𝑘) ∫0 Prob(𝑘|𝜃)Prob(𝜃)𝑑𝜃
𝛼−1
𝑁−𝑘 𝑘 𝜃 (1−𝜃)𝛽−1
(𝑁
𝑘 )(1 − 𝜃) 𝜃 𝐵(𝛼,𝛽)
= 1 𝛼−1 (1−𝜃)𝛽−1
∫0 (𝑁 𝑁−𝑘 𝜃𝑘 𝜃
𝑘 )(1 − 𝜃) 𝐵(𝛼,𝛽) 𝑑𝜃
(1 − 𝜃)𝛽+𝑁−𝑘−1 𝜃𝛼+𝑘−1
= 1
.
∫0 (1 − 𝜃)𝛽+𝑁−𝑘−1 𝜃𝛼+𝑘−1 𝑑𝜃
Thus,
Prob(𝜃|𝑘) ∼ 𝐵𝑒𝑡𝑎(𝛼 + 𝑘, 𝛽 + 𝑁 − 𝑘)
The analytical posterior for a given conjugate beta prior is coded in the following Python code.
def simulate_draw(theta, n):

"""
Draws a Bernoulli sample of size n with probability P(Y=1) = theta
"""
rand_draw = np.random.rand(n)
draw = (rand_draw < theta).astype(int)
return draw
def analytical_beta_posterior(data, alpha0, beta0):

"""
Computes analytically the posterior distribution with beta prior parametrized by␣
↪(alpha, beta)
given # num observations
Parameters
---------
num : int.
the number of observations after which we calculate the posterior
alpha0, beta0 : float.
the parameters for the beta distribution as a prior
Returns
---------
The posterior beta distribution
"""
num = len(data)
up_num = data.sum()
down_num = num - up_num
return st.beta(alpha0 + up_num, beta0 + down_num)
44.1. Unleashing MCMC on a Binomial Likelihood 735

44.1.2 Two Ways to Approximate Posteriors
Suppose that we don’t have a conjugate prior.

Then we can’t compute posteriors analytically.
Instead, we use computational tools to approximate the posterior distribution for a set of alternative prior distributions
using both Pyro and Numpyro packages in Python.
We first use the Markov Chain Monte Carlo (MCMC) algorithm .
We implement the NUTS sampler to sample from the posterior.
In that way we construct a sampling distribution that approximates the posterior.
After doing that we deply another procedure called Variational Inference (VI).
In particular, we implement Stochastic Variational Inference (SVI) machinery in both Pyro and Numpyro.
The MCMC algorithm supposedly generates a more accurate approximation since in principle it directly samples from
the posterior distribution.
But it can be computationally expensive, especially when dimension is large.
A VI approach can be cheaper, but it is likely to produce an inferior approximation to the posterior, for the simple reason
that it requires guessing a parametric guide functional form that we use to approximate a posterior.
This guide function is likely at best to be an imperfect approximation.
By paying the cost of restricting the putative posterior to have a restricted functional form, the problem of approximating
a posteriors is transformed to a well-posed optimization problem that seeks parameters of the putative posterior that
minimize a Kullback-Leibler (KL) divergence between true posterior and the putatitive posterior distribution.
• minimizing the KL divergence is equivalent with maximizing a criterion called the Evidence Lower Bound
(ELBO), as we shall verify soon.
44.2 Prior Distributions
In order to be able to apply MCMC sampling or VI, Pyro and Numpyro require that a prior distribution satisfy special
properties:
• we must be able sample from it;
• we must be able to compute the log pdf pointwise;
• the pdf must be differentiable with respect to the parameters.
We’ll want to define a distribution class.
We will use the following priors:
• a uniform distribution on [𝜃, 𝜃], where 0 ≤ 𝜃 < 𝜃 ≤ 1.
• a truncated log-normal distribution with support on [0, 1] with parameters (𝜇, 𝜎).
– To implement this, let 𝑍 ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙(𝜇, 𝜎) and 𝑍 ̃ be truncated normal with support [log(0), log(1)], then
exp(𝑍) has a log normal distribution with bounded support [0, 1]. This can be easily coded since Numpyro
has a built-in truncated normal distribution, and Torch provides a TransformedDistribution class
that includes an exponential transformation.
– Alternatively, we can use a rejection sampling strategy by assigning the probability rate to 0 outside the bounds
and rescaling accepted samples, i.e., realizations that are within the bounds, by the total probability computed

via CDF of the original distribution. This can be implemented by defining a truncated distribution class with
pyro’s dist.Rejector class.
– We implement both methods in the below section and verify that they produce the same result.
• a shifted von Mises distribution that has support confined to [0, 1] with parameter (𝜇, 𝜅).
– Let 𝑋 ∼ 𝑣𝑜𝑛𝑀 𝑖𝑠𝑒𝑠(0, 𝜅). We know that 𝑋 has bounded support [−𝜋, 𝜋]. We can define a shifted von
Mises random variable 𝑋̃ = 𝑎 + 𝑏𝑋 where 𝑎 = 0.5, 𝑏 = 1/(2𝜋) so that 𝑋̃ is supported on [0, 1].
– This can be implemented using Torch’s TransformedDistribution class with its AffineTrans-
form method.
– If instead, we want the prior to be von-Mises distributed with center 𝜇 = 0.5, we can choose a high concen-
tration level 𝜅 so that most mass is located between 0 and 1. Then we can truncate the distribution using the
above strategy. This can be implemented using pyro’s dist.Rejector class. We choose 𝜅 > 40 in this
case.
• a truncated Laplace distribution.
– We also considered a truncated Laplace distribution because its density comes in a piece-wise non-smooth
form and has a distinctive spiked shape.
– The truncated Laplace can be created using Numpyro’s TruncatedDistribution class.
# used by Numpyro
def TruncatedLogNormal_trans(loc, scale):
"""
Obtains the truncated log normal distribution using numpyro's TruncatedNormal and␣
↪ExpTransform
"""
base_dist = ndist.TruncatedNormal(low=jnp.log(0), high=jnp.log(1), loc=loc,␣
↪scale=scale)
return ndist.TransformedDistribution(
base_dist,ndist.transforms.ExpTransform()
)
def ShiftedVonMises(kappa):
"""
Obtains the shifted von Mises distribution using AffineTransform
"""
base_dist = ndist.VonMises(0, kappa)
return ndist.TransformedDistribution(
base_dist, ndist.transforms.AffineTransform(loc=0.5, scale=1/(2*jnp.pi))
)
def TruncatedLaplace(loc, scale):

"""
Obtains the truncated Laplace distribution on [0,1]
"""
base_dist = ndist.Laplace(loc, scale)
return ndist.TruncatedDistribution(
base_dist, low=0.0, high=1.0
)
# used by Pyro
class TruncatedLogNormal(dist.Rejector):
"""
Define a TruncatedLogNormal distribution through rejection sampling in Pyro
"""
44.2. Prior Distributions 737


def __init__(self, loc, scale_0, upp=1):
self.upp = upp
propose = dist.LogNormal(loc, scale_0)
def log_prob_accept(x):
return (x < upp).type_as(x).log()
log_scale = dist.LogNormal(loc, scale_0).cdf(torch.as_tensor(upp)).log()

super(TruncatedLogNormal, self).__init__(propose, log_prob_accept, log_scale)
@constraints.dependent_property
def support(self):
return constraints.interval(0, self.upp)
class TruncatedvonMises(dist.Rejector):
"""
Define a TruncatedvonMises distribution through rejection sampling in Pyro
"""
def __init__(self, kappa, mu=0.5, low=0.0, upp=1.0):
self.low, self.upp = low, upp
propose = dist.VonMises(mu, kappa)
def log_prob_accept(x):
return ((x > low) & (x < upp)).type_as(x).log()
log_scale = torch.log(
torch.tensor(
st.vonmises(kappa=kappa, loc=mu).cdf(upp)
- st.vonmises(kappa=kappa, loc=mu).cdf(low))
)
super(TruncatedvonMises, self).__init__(propose, log_prob_accept, log_scale)
@constraints.dependent_property
def support(self):
return constraints.interval(self.low, self.upp)
44.2.1 Variational Inference
Instead of directly sampling from the posterior, the variational inference methodw approximates an unknown posterior
distribution with a family of tractable distributions/densities.
It then seeks to minimizes a measure of statistical discrepancy between the approximating and true posteriors.
Thus variational inference (VI) approximates a posterior by solving a minimization problem.
Let the latent parameter/variable that we want to infer be 𝜃.
Let the prior be 𝑝(𝜃) and the likelihood be 𝑝 (𝑌 |𝜃).
We want 𝑝 (𝜃|𝑌 ).
Bayes’ rule implies
𝑝 (𝑌 , 𝜃) 𝑝 (𝑌 |𝜃) 𝑝 (𝜃)
𝑝 (𝜃|𝑌 ) = =
𝑝 (𝑌 ) 𝑝 (𝑌 )

where
𝑝 (𝑌 ) = ∫ 𝑑𝜃𝑝 (𝑌 ∣ 𝜃) 𝑝 (𝑌 ) . (44.1)
The integral on the right side of (44.1) is typically difficult to compute.

Consider a guide distribution 𝑞𝜙 (𝜃) parameterized by 𝜙 that we’ll use to approximate the posterior.
We choose parameters 𝜙 of the guide distribution to minimize a Kullback-Leibler (KL) divergence between the approx-
imate posterior 𝑞𝜙 (𝜃) and the posterior:
𝑝(𝜃 ∣ 𝑌 )
𝐷𝐾𝐿 (𝑞(𝜃; 𝜙) ‖ 𝑝(𝜃 ∣ 𝑌 )) ≡ − ∫ 𝑑𝜃𝑞(𝜃; 𝜙) log
𝑞(𝜃; 𝜙)
Thus, we want a variational distribution 𝑞 that solves
min 𝐷𝐾𝐿 (𝑞(𝜃; 𝜙) ‖ 𝑝(𝜃 ∣ 𝑌 ))

𝜙
Note that
𝑃 (𝜃 ∣ 𝑌 )
𝐷𝐾𝐿 (𝑞(𝜃; 𝜙) ‖ 𝑝(𝜃 ∣ 𝑌 )) = − ∫ 𝑑𝜃𝑞(𝜃; 𝜙) log
𝑞(𝜃; 𝜙)
𝑝(𝜃,𝑌 )
𝑝(𝑌 )
= − ∫ 𝑑𝜃𝑞(𝜃) log
𝑞(𝜃)
𝑝(𝜃, 𝑌 )
= − ∫ 𝑑𝜃𝑞(𝜃) log
𝑝(𝜃)𝑞(𝑌 )
𝑝(𝜃, 𝑌 )
= − ∫ 𝑑𝜃𝑞(𝜃) [log − log 𝑝(𝑌 )]
𝑞(𝜃)
𝑝(𝜃, 𝑌 )
= − ∫ 𝑑𝜃𝑞(𝜃) log + ∫ 𝑑𝜃𝑞(𝜃) log 𝑝(𝑌 )
𝑞(𝜃)
𝑝(𝜃, 𝑌 )
= − ∫ 𝑑𝜃𝑞(𝜃) log + log 𝑝(𝑌 )
𝑞(𝜃)
𝑝(𝜃, 𝑌 )
log 𝑝(𝑌 ) = 𝐷𝐾𝐿 (𝑞(𝜃; 𝜙) ‖ 𝑝(𝜃 ∣ 𝑌 )) + ∫ 𝑑𝜃𝑞𝜙 (𝜃) log
𝑞𝜙 (𝜃)
For observed data 𝑌 , 𝑝(𝜃, 𝑌 ) is a constant, so minimizing KL divergence is equivalent to maximizing
𝑝(𝜃, 𝑌 )
𝐸𝐿𝐵𝑂 ≡ ∫ 𝑑𝜃𝑞𝜙 (𝜃) log = 𝔼𝑞𝜙 (𝜃) [log 𝑝(𝜃, 𝑌 ) − log 𝑞𝜙 (𝜃)] (44.2)
𝑞𝜙 (𝜃)
Formula (44.2) is called the evidence lower bound (ELBO).

A standard optimization routine can used to search for the optimal 𝜙 in our parametrized distribution 𝑞𝜙 (𝜃).
The parameterized distribution 𝑞𝜙 (𝜃) is called the variational distribution.
We can implement Stochastic Variational Inference (SVI) in Pyro and Numpyro using the Adam gradient descent algo-
rithm to approximate posterior.
We use two sets of variational distributions: Beta and TruncatedNormal with support [0, 1]
• Learnable parameters for the Beta distribution are (alpha, beta), both of which are positive.
• Learnable parameters for the Truncated Normal distribution are (loc, scale).
We restrict the truncated Normal paramter ‘loc’ to be in the interval [0, 1].
44.2. Prior Distributions 739

44.3 Implementation
We have constructed a Python class BaysianInference that requires the following arguments to be initialized:
• param: a tuple/scalar of parameters dependent on distribution types
• name_dist: a string that specifies distribution names
The (param, name_dist) pair includes:
• (‘beta’, alpha, beta)
• (‘uniform’, upper_bound, lower_bound)
• (‘lognormal’, loc, scale)
– Note: This is the truncated log normal.
• (‘vonMises’, kappa), where kappa denotes concentration parameter, and center location is set to 0.5.
– Note: When using Pyro, this is the truncated version of the original vonMises distribution;
– Note: When using Numpyro, this is the shifted distribution.
• (‘laplace’, loc, scale)
– Note: This is the truncated Laplace
The class BaysianInference has several key methods :
• sample_prior:
– This can be used to draw a single sample from the given prior distribution.
• show_prior:
– Plots the approximate prior distribution by repeatedly drawing samples and fitting a kernal density curve.
• MCMC_sampling:
– INPUT: (data, num_samples, num_warmup=1000)
– Take a np.array data and generate MCMC sampling of posterior of size num_samples.
• SVI_run:
– INPUT: (data, guide_dist, n_steps=10000)
– guide_dist = ‘normal’ - use a truncated normal distribution as the parametrized guide
– guide_dist = ‘beta’ - use a beta distribution as the parametrized guide
– RETURN: (params, losses) - the learned parameters in a dict and the vector of loss at each step.
class BayesianInference:
def __init__(self, param, name_dist, solver):
"""
Parameters
---------
param : tuple.
a tuple object that contains all relevant parameters for the distribution
dist : str.
name of the distribution - 'beta', 'uniform', 'lognormal', 'vonMises',
↪'tent'
solver : str.
either pyro or numpyro


"""
self.param = param
self.name_dist = name_dist
self.solver = solver
# jax requires explicit PRNG state to be passed

self.rng_key = random.PRNGKey(0)
def sample_prior(self):
"""
Define the prior distribution to sample from in Pyro/Numpyro models.
"""
if self.name_dist=='beta':
# unpack parameters
alpha0, beta0 = self.param
if self.solver=='pyro':
sample = pyro.sample('theta', dist.Beta(alpha0, beta0))
else:
sample = numpyro.sample('theta', ndist.Beta(alpha0, beta0), rng_
↪key=self.rng_key)
elif self.name_dist=='uniform':
# unpack parameters
lb, ub = self.param
sample = pyro.sample('theta', dist.Uniform(lb, ub))
else:
sample = numpyro.sample('theta', ndist.Uniform(lb, ub), rng_key=self.
↪rng_key)
elif self.name_dist=='lognormal':
# unpack parameters
loc, scale = self.param
sample = pyro.sample('theta', TruncatedLogNormal(loc, scale))
else:
sample = numpyro.sample('theta', TruncatedLogNormal_trans(loc, scale),
↪ rng_key=self.rng_key)
elif self.name_dist=='vonMises':
# unpack parameters
kappa = self.param
sample = pyro.sample('theta', TruncatedvonMises(kappa))
else:
sample = numpyro.sample('theta', ShiftedVonMises(kappa), rng_key=self.
↪rng_key)
elif self.name_dist=='laplace':
# unpack parameters
loc, scale = self.param
print("WARNING: Please use Numpyro for truncated Laplace.")
sample = None
else:


sample = numpyro.sample('theta', TruncatedLaplace(loc, scale), rng_
↪key=self.rng_key)
return sample
def show_prior(self, size=1e5, bins=20, disp_plot=1):

"""
Visualizes prior distribution by sampling from prior and plots the␣
↪approximated sampling distribution
"""
self.bins = bins
with pyro.plate('show_prior', size=size):
sample = self.sample_prior()
# to numpy
sample_array = sample.numpy()
elif self.solver=='numpyro':
with numpyro.plate('show_prior', size=size):
sample = self.sample_prior()
# to numpy
sample_array=jnp.asarray(sample)
# plot histogram and kernel density

if disp_plot==1:
sns.displot(sample_array, kde=True, stat='density', bins=bins, height=5,␣
↪aspect=1.5)
plt.xlim(0, 1)
plt.show()
else:
return sample_array
def model(self, data):

"""
Define the probabilistic model by specifying prior, conditional likelihood,␣
↪and data conditioning
"""
if not torch.is_tensor(data):
data = torch.tensor(data)
# set prior
theta = self.sample_prior()
# sample from conditional likelihood

output = pyro.sample('obs', dist.Binomial(len(data), theta), obs=torch.
↪sum(data))
else:
# Note: numpyro.sample() requires obs=np.ndarray
output = numpyro.sample('obs', ndist.Binomial(len(data), theta),␣
↪obs=torch.sum(data).numpy())
return output


def MCMC_sampling(self, data, num_samples, num_warmup=1000):
"""
Computes numerically the posterior distribution with beta prior parametrized␣
↪by (alpha0, beta0)
given data using MCMC

"""
# tensorize
# use pyro
nuts_kernel = NUTS(self.model)
mcmc = MCMC(nuts_kernel, num_samples=num_samples, warmup_steps=num_warmup,
↪ disable_progbar=True)
mcmc.run(data)
# use numpyro
nuts_kernel = nNUTS(self.model)
mcmc = nMCMC(nuts_kernel, num_samples=num_samples, num_warmup=num_warmup,␣
↪progress_bar=False)
mcmc.run(self.rng_key, data=data)
# collect samples
samples = mcmc.get_samples()['theta']
return samples
def beta_guide(self, data):

"""
Defines the candidate parametrized variational distribution that we train to␣
↪approximate posterior with Pyro/Numpyro
Here we use parameterized beta

"""
alpha_q = pyro.param('alpha_q', torch.tensor(0.5),
constraint=constraints.positive)
beta_q = pyro.param('beta_q', torch.tensor(0.5),
constraint=constraints.positive)
pyro.sample('theta', dist.Beta(alpha_q, beta_q))
else:
alpha_q = numpyro.param('alpha_q', 10,
constraint=nconstraints.positive)
beta_q = numpyro.param('beta_q', 10,
numpyro.sample('theta', ndist.Beta(alpha_q, beta_q))
def truncnormal_guide(self, data):

"""
Defines the candidate parametrized variational distribution that we train to␣
↪approximate posterior with Pyro/Numpyro


Here we use truncated normal on [0,1]
"""
loc = numpyro.param('loc', 0.5,
constraint=nconstraints.interval(0.0, 1.0))
scale = numpyro.param('scale', 1,
numpyro.sample('theta', ndist.TruncatedNormal(loc, scale, low=0.0, high=1.0))
def SVI_init(self, guide_dist, lr=0.0005):

"""
Initiate SVI training mode with Adam optimizer
NOTE: truncnormal_guide can only be used with numpyro solver
"""
adam_params = {"lr": lr}
if guide_dist=='beta':
optimizer = Adam(adam_params)
svi = SVI(self.model, self.beta_guide, optimizer, loss=Trace_ELBO())
optimizer = nAdam(step_size=lr)
svi = nSVI(self.model, self.beta_guide, optimizer, loss=nTrace_ELBO())
elif guide_dist=='normal':
# only allow numpyro
print("WARNING: Please use Numpyro with TruncatedNormal guide")
svi = None
optimizer = nAdam(step_size=lr)
svi = nSVI(self.model, self.truncnormal_guide, optimizer, loss=nTrace_
↪ELBO())
else:
print("WARNING: Please input either 'beta' or 'normal'")
svi = None
return svi
def SVI_run(self, data, guide_dist, n_steps=10000):

"""
Runs SVI and returns optimized parameters and losses
Returns
--------
params : the learned parameters for guide
losses : a vector of loss at each step
"""
# tensorize data
if not torch.is_tensor(data):
# initiate SVI
svi = self.SVI_init(guide_dist=guide_dist)

# do gradient steps
# store loss vector
losses = np.zeros(n_steps)
for step in range(n_steps):
losses[step] = svi.step(data)
# pyro only supports beta VI distribution

params = {
'alpha_q': pyro.param('alpha_q').item(),
'beta_q': pyro.param('beta_q').item()
}
result = svi.run(self.rng_key, n_steps, data, progress_bar=False)
params = dict(
(key, np.asarray(value)) for key, value in result.params.items()
)
losses = np.asarray(result.losses)
return params, losses
44.4 Alternative Prior Distributions
Let’s see how well our sampling algorithm does in approximating

• a log normal distribution
• a uniform distribution
To examine our alternative prior distributions, we’ll plot approximate prior distributions below by calling the
show_prior method.
We verify that the rejection sampling strategy under Pyro produces the same log normal distribution as the truncated
normal transformation under Numpyro.
# truncated log normal

exampleLN = BayesianInference(param=(0,2), name_dist='lognormal', solver='numpyro')
exampleLN.show_prior(size=100000,bins=20)
# truncated uniform
exampleUN = BayesianInference(param=(0.1,0.8), name_dist='uniform', solver='numpyro')
exampleUN.show_prior(size=100000,bins=20)
/opt/conda/envs/quantecon/lib/python3.11/site-packages/seaborn/_oldcore.py:1119:␣
↪FutureWarning: use_inf_as_na option is deprecated and will be removed in a␣
↪future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
44.4. Alternative Prior Distributions 745


The above graphs show that sampling seems to work well with both distributions.
Now let’s see how well things work with a couple of von Mises distributions.
# shifted von Mises

exampleVM = BayesianInference(param=10, name_dist='vonMises', solver='numpyro')
exampleVM.show_prior(size=100000,bins=20)
# truncated von Mises

exampleVM_trunc = BayesianInference(param=20, name_dist='vonMises', solver='pyro')
exampleVM_trunc.show_prior(size=100000,bins=20)


These graphs look good too.

Now let’s try with a Laplace distribution.
# truncated Laplace
exampleLP = BayesianInference(param=(0.5,0.05), name_dist='laplace', solver='numpyro')
exampleLP.show_prior(size=100000,bins=40)

Having assured ourselves that our sampler seems to do a good job, let’s put it to work in using MCMC to compute posterior
probabilities.
44.5 Posteriors Via MCMC and VI
We construct a class BayesianInferencePlot to implement MCMC or VI algorithms and plot multiple posteriors
for different updating data sizes and different possible prior.
This class takes as inputs the true data generating parameter ‘theta’, a list of updating data sizes for multiple posterior
plotting, and a defined and parametrized BayesianInference class.
It has two key methods:
• BayesianInferencePlot.MCMC_plot() takes wanted MCMC sample size as input and plot the output
posteriors together with the prior defined in BayesianInference class.
• BayesianInferencePlot.SVI_plot() takes wanted VI distribution class (‘beta’ or ‘normal’) as input and
plot the posteriors together with the prior.
class BayesianInferencePlot:
"""
Easily implement the MCMC and VI inference for a given instance of␣
↪BayesianInference class and
plot the prior together with multiple posteriors
Parameters
----------
theta : float.
the true DGP parameter


N_list : list.
a list of sample size
BayesianInferenceClass : class.
a class initiated using BayesianInference()
"""
def __init__(self, theta, N_list, BayesianInferenceClass, binwidth=0.02):

"""
Enter Parameters for data generation and plotting
"""
self.theta = theta
self.N_list = N_list
self.BayesianInferenceClass = BayesianInferenceClass
# plotting parameters
self.binwidth = binwidth
self.linewidth=0.05
self.colorlist = sns.color_palette(n_colors=len(N_list))
# data generation
N_max = max(N_list)
self.data = simulate_draw(theta, N_max)
def MCMC_plot(self, num_samples, num_warmup=1000):

"""
Parameters as in MCMC_sampling except that data is already defined
"""
# plot prior
prior_sample = self.BayesianInferenceClass.show_prior(disp_plot=0)
sns.histplot(
data=prior_sample, kde=True, stat='density',
binwidth=self.binwidth,
color='#4C4E52',
linewidth=self.linewidth,
alpha=0.1,
ax=ax,
label='Prior Distribution'
)
# plot posteriors
for id, n in enumerate(self.N_list):
samples = self.BayesianInferenceClass.MCMC_sampling(
self.data[:n], num_samples, num_warmup
)
sns.histplot(
samples, kde=True, stat='density',
alpha=0.2,
color=self.colorlist[id-1],
label=f'Posterior with $n={n}$'
)
44.5. Posteriors Via MCMC and VI 751


ax.legend()
ax.set_title('MCMC Sampling density of Posterior Distributions', fontsize=15)
plt.xlim(0, 1)
plt.show()
def SVI_fitting(self, guide_dist, params):

"""
Fit the beta/truncnormal curve using parameters trained by SVI.
I create plot using PDF given by scipy.stats distributions since torch.dist␣
↪do not have embedded PDF methods.
"""
# create x axis
xaxis = np.linspace(0,1,1000)
if guide_dist=='beta':
y = st.beta.pdf(xaxis, a=params['alpha_q'], b=params['beta_q'])
elif guide_dist=='normal':
# rescale upper/lower bound. See Scipy's truncnorm doc

lower, upper = (0, 1)
loc, scale = params['loc'], params['scale']
a, b = (lower - loc) / scale, (upper - loc) / scale
y = st.truncnorm.pdf(xaxis, a=a, b=b, loc=params['loc'], scale=params[

↪'scale'])
return (xaxis, y)
def SVI_plot(self, guide_dist, n_steps=2000):

"""
Parameters as in SVI_run except that data is already defined
"""
# plot prior
prior_sample = self.BayesianInferenceClass.show_prior(disp_plot=0)
sns.histplot(
data=prior_sample, kde=True, stat='density',
color='#4C4E52',
alpha=0.1,
ax=ax,
label='Prior Distribution'
)
# plot posteriors
for id, n in enumerate(self.N_list):
(params, losses) = self.BayesianInferenceClass.SVI_run(self.data[:n],␣
↪guide_dist, n_steps)
x, y = self.SVI_fitting(guide_dist, params)
ax.plot(x, y,
alpha=1,
color=self.colorlist[id-1],
label=f'Posterior with $n={n}$'


)
ax.legend()
ax.set_title(f'SVI density of Posterior Distributions with {guide_dist} guide
↪', fontsize=15)
plt.xlim(0, 1)
plt.show()
Let’s set some parameters that we’ll use in all of the examples below.
To save computer time at first, notice that we’ll set MCMC_num_samples = 2000 and SVI_num_steps = 5000.
(Later, to increase accuracy of approximations, we’ll want to increase these.)
num_list = [5,10,50,100,1000]
MCMC_num_samples = 2000
SVI_num_steps = 5000
# theta is the data generating process

true_theta = 0.8
44.5.1 Beta Prior and Posteriors:
Let’s compare outcomes when we use a Beta prior.

For the same Beta prior, we shall
• compute posteriors analytically
• compute posteriors using MCMC via Pyro and Numpyro.
• compute posteriors using VI via Pyro and Numpyro.
Let’s start with the analytical method that we described in this quantecon lecture https://python.quantecon.org/prob_
meaning.html
# First examine Beta priors

BETA_pyro = BayesianInference(param=(5,5), name_dist='beta', solver='pyro')
BETA_numpyro = BayesianInference(param=(5,5), name_dist='beta', solver='numpyro')
BETA_pyro_plot = BayesianInferencePlot(true_theta, num_list, BETA_pyro)

BETA_numpyro_plot = BayesianInferencePlot(true_theta, num_list, BETA_numpyro)
# plot analytical Beta prior and posteriors

xaxis = np.linspace(0,1,1000)
y_prior = st.beta.pdf(xaxis, 5, 5)

# plot analytical beta prior
ax.plot(xaxis, y_prior, label='Analytical Beta Prior', color='#4C4E52')
data, colorlist, N_list = BETA_pyro_plot.data, BETA_pyro_plot.colorlist, BETA_pyro_

↪plot.N_list
# plot analytical beta posteriors

for id, n in enumerate(N_list):
func = analytical_beta_posterior(data[:n], alpha0=5, beta0=5)


y_posterior = func.pdf(xaxis)
ax.plot(
xaxis, y_posterior, color=colorlist[id-1], label=f'Analytical Beta Posterior␣
↪with $n={n}$')
ax.legend()
ax.set_title('Analytical Beta Prior and Posterior', fontsize=15)
plt.xlim(0, 1)
plt.show()
Now let’s use MCMC while still using a beta prior.

We’ll do this for both MCMC and VI.
BayesianInferencePlot(true_theta, num_list, BETA_pyro).MCMC_plot(num_samples=MCMC_num_

↪samples)
BayesianInferencePlot(true_theta, num_list, BETA_numpyro).SVI_plot(guide_dist='beta',␣

↪n_steps=SVI_num_steps)


Here the MCMC approximation looks good.

But the VI approximation doesn’t look so good.
• even though we use the beta distribution as our guide, the VI approximated posterior distributions do not closely
resemble the posteriors that we had just computed analytically.
(Here, our initial parameter for Beta guide is (0.5, 0.5).)
But if we increase the number of steps from 5000 to 10000 in VI as we now shall do, we’ll get VI-approximated posteriors
will be more accurate, as we shall see next.
(Increasing the step size increases computational time though).
BayesianInferencePlot(true_theta, num_list, BETA_numpyro).SVI_plot(guide_dist='beta',␣

↪n_steps=100000)

44.6 Non-conjugate Prior Distributions
Having assured ourselves that our MCMC and VI methods can work well when we have conjugate prior and so can
also compute analytically, we next proceed to situations in which our prior is not a beta distribution, so we don’t have a
conjugate prior.
So we will have non-conjugate priors and are cast into situations in which we can’t calculate posteriors analytically.
44.6.1 MCMC
First, we implement and display MCMC.

We first initialize the BayesianInference classes and then can directly call BayesianInferencePlot to plot
both MCMC and SVI approximating posteriors.
# Initialize BayesianInference classes

# try uniform
STD_UNIFORM_pyro = BayesianInference(param=(0,1), name_dist='uniform', solver='pyro')
UNIFORM_numpyro = BayesianInference(param=(0.2,0.7), name_dist='uniform', solver=
↪'numpyro')
# try truncated lognormal

LOGNORMAL_numpyro = BayesianInference(param=(0,2), name_dist='lognormal', solver=
↪'numpyro')
LOGNORMAL_pyro = BayesianInference(param=(0,2), name_dist='lognormal', solver='pyro')
# try von Mises

# shifted von Mises
44.6. Non-conjugate Prior Distributions 757


VONMISES_numpyro = BayesianInference(param=10, name_dist='vonMises', solver='numpyro')
# truncated von Mises
VONMISES_pyro = BayesianInference(param=40, name_dist='vonMises', solver='pyro')
# try laplace
LAPLACE_numpyro = BayesianInference(param=(0.5, 0.07), name_dist='laplace', solver=
↪'numpyro')
# Uniform
example_CLASS = STD_UNIFORM_pyro
print(f'=======INFO=======\nParameters: {example_CLASS.param}\nPrior Dist: {example_
↪CLASS.name_dist}\nSolver: {example_CLASS.solver}')
BayesianInferencePlot(true_theta, num_list, example_CLASS).MCMC_plot(num_samples=MCMC_

↪num_samples)
example_CLASS = UNIFORM_numpyro

↪num_samples)
=======INFO=======
Parameters: (0, 1)
Prior Dist: uniform
Solver: pyro

=======INFO=======
Parameters: (0.2, 0.7)
Prior Dist: uniform
Solver: numpyro



In the situation depicted above, we have assumed a 𝑈 𝑛𝑖𝑓𝑜𝑟𝑚(𝜃, 𝜃) prior that puts zero probability outside a bounded
support that excludes the true value.
Consequently, the posterior cannot put positive probability above 𝜃 or below 𝜃.
Note how when the true data-generating 𝜃 is located at 0.8 as it is here, when 𝑛 gets large, the posterior concentrate on
the upper bound of the support of the prior, 0.7 here.
# Log Normal
example_CLASS = LOGNORMAL_numpyro

↪num_samples)
example_CLASS = LOGNORMAL_pyro


↪num_samples)
=======INFO=======
Parameters: (0, 2)
Prior Dist: lognormal
Solver: numpyro

=======INFO=======
Parameters: (0, 2)
Solver: pyro


# Von Mises
example_CLASS = VONMISES_numpyro
print('\nNOTE: Shifted von Mises')

↪num_samples)
example_CLASS = VONMISES_pyro
print('\nNOTE: Truncated von Mises')

↪num_samples)
=======INFO=======
Parameters: 10
Prior Dist: vonMises
Solver: numpyro
NOTE: Shifted von Mises


=======INFO=======
Parameters: 40
Solver: pyro
NOTE: Truncated von Mises

# Laplace
example_CLASS = LAPLACE_numpyro

↪num_samples)
=======INFO=======
Prior Dist: laplace
Solver: numpyro

To get more accuracy we will now increase the number of steps for Variational Inference (VI)

SVI_num_steps = 50000
VI with a Truncated Normal Guide
# Uniform
example_CLASS = BayesianInference(param=(0,1), name_dist='uniform', solver='numpyro')
BayesianInferencePlot(true_theta, num_list, example_CLASS).SVI_plot(guide_dist='normal

↪', n_steps=SVI_num_steps)
=======INFO=======
Parameters: (0, 1)
Prior Dist: uniform
Solver: numpyro
# Log Normal


=======INFO=======
Parameters: (0, 2)
Solver: numpyro
# Von Mises
print('\nNB: Shifted von Mises')

=======INFO=======
Parameters: 10
Solver: numpyro
NB: Shifted von Mises


# Laplace

=======INFO=======
Prior Dist: laplace
Solver: numpyro

Variational Inference with a Beta Guide Distribution
# Uniform
example_CLASS = STD_UNIFORM_pyro
BayesianInferencePlot(true_theta, num_list, example_CLASS).SVI_plot(guide_dist='beta',

↪ n_steps=SVI_num_steps)
=======INFO=======
Parameters: (0, 1)
Prior Dist: uniform
Solver: pyro

# Log Normal

example_CLASS = LOGNORMAL_pyro

=======INFO=======
Parameters: (0, 2)
Solver: numpyro

=======INFO=======
Parameters: (0, 2)
Solver: pyro

# Von Mises
print('\nNB: Shifted von Mises')

example_CLASS = VONMISES_pyro
print('\nNB: Truncated von Mises')

=======INFO=======
Parameters: 10
Solver: numpyro
NB: Shifted von Mises

=======INFO=======
Parameters: 40
Solver: pyro
NB: Truncated von Mises

# Laplace

=======INFO=======
Prior Dist: laplace
Solver: numpyro



CHAPTER
FORTYFIVE
POSTERIOR DISTRIBUTIONS FOR AR(1) PARAMETERS
We’ll begin with some Python imports.
!pip install arviz pymc numpyro jax
import arviz as az
import pymc as pmc
import numpyro
from numpyro import distributions as dist
import numpy as np
import logging
logging.basicConfig()
logger = logging.getLogger('pymc')
logger.setLevel(logging.CRITICAL)
This lecture uses Bayesian methods offered by pymc and numpyro to make statistical inferences about two parameters of
a univariate first-order autoregression.
The model is a good laboratory for illustrating consequences of alternative ways of modeling the distribution of the initial
𝑦0 :
• As a fixed number
• As a random variable drawn from the stationary distribution of the {𝑦𝑡 } stochastic process
The first component of the statistical model is
𝑦𝑡+1 = 𝜌𝑦𝑡 + 𝜎𝑥 𝜖𝑡+1 , 𝑡≥0 (45.1)
where the scalars 𝜌 and 𝜎𝑥 satisfy |𝜌| < 1 and 𝜎𝑥 > 0; {𝜖𝑡+1 } is a sequence of i.i.d. normal random variables with mean
0 and variance 1.
The second component of the statistical model is
𝑦0 ∼ 𝑁 (𝜇0 , 𝜎02 ) (45.2)
Consider a sample {𝑦𝑡 }𝑇𝑡=0 governed by this statistical model.

The model implies that the likelihood function of {𝑦𝑡 }𝑇𝑡=0 can be factored:
𝑓(𝑦𝑇 , 𝑦𝑇 −1 , … , 𝑦0 ) = 𝑓(𝑦𝑇 |𝑦𝑇 −1 )𝑓(𝑦𝑇 −1 |𝑦𝑇 −2 ) ⋯ 𝑓(𝑦1 |𝑦0 )𝑓(𝑦0 )
779
where we use 𝑓 to denote a generic probability density.

The statistical model (45.1)-(45.2) implies
𝑓(𝑦𝑡 |𝑦𝑡−1 ) ∼ 𝒩(𝜌𝑦𝑡−1 , 𝜎𝑥2 )

𝑓(𝑦0 ) ∼ 𝒩(𝜇0 , 𝜎02 )
We want to study how inferences about the unknown parameters (𝜌, 𝜎𝑥 ) depend on what is assumed about the parameters
𝜇0 , 𝜎0 of the distribution of 𝑦0 .
Below, we study two widely used alternative assumptions:
• (𝜇0 , 𝜎0 ) = (𝑦0 , 0) which means that 𝑦0 is drawn from the distribution 𝒩(𝑦0 , 0); in effect, we are conditioning
on an observed initial value.
• 𝜇0 , 𝜎0 are functions of 𝜌, 𝜎𝑥 because 𝑦0 is drawn from the stationary distribution implied by 𝜌, 𝜎𝑥 .
Note: We do not treat a third possible case in which 𝜇0 , 𝜎0 are free parameters to be estimated.
Unknown parameters are 𝜌, 𝜎𝑥 .
We have independent prior probability distributions for 𝜌, 𝜎𝑥 and want to compute a posterior probability distribution
after observing a sample {𝑦𝑡 }𝑇𝑡=0 .
The notebook uses pymc4 and numpyro to compute a posterior distribution of 𝜌, 𝜎𝑥 . We will use NUTS samplers to
generate samples from the posterior in a chain. Both of these libraries support NUTS samplers.
NUTS is a form of Monte Carlo Markov Chain (MCMC) algorithm that bypasses random walk behaviour and allows
for convergence to a target distribution more quickly. This not only has the advantage of speed, but allows for complex
models to be fitted without having to employ specialised knowledge regarding the theory underlying those fitting methods.
Thus, we explore consequences of making these alternative assumptions about the distribution of 𝑦0 :
• A first procedure is to condition on whatever value of 𝑦0 is observed. This amounts to assuming that the probability
distribution of the random variable 𝑦0 is a Dirac delta function that puts probability one on the observed value of
𝑦0 .
• A second procedure assumes that 𝑦0 is drawn from the stationary distribution of a process described by (45.1) so
2
𝜎𝑥
that 𝑦0 ∼ 𝑁 (0, (1−𝜌) 2)
When the initial value 𝑦0 is far out in a tail of the stationary distribution, conditioning on an initial value gives a posterior
that is more accurate in a sense that we’ll explain.
Basically, when 𝑦0 happens to be in a tail of the stationary distribution and we don’t condition on 𝑦0 , the likelihood
function for {𝑦𝑡 }𝑇𝑡=0 adjusts the posterior distribution of the parameter pair 𝜌, 𝜎𝑥 to make the observed value of 𝑦0 more
likely than it really is under the stationary distribution, thereby adversely twisting the posterior in short samples.
An example below shows how not conditioning on 𝑦0 adversely shifts the posterior probability distribution of 𝜌 toward
larger values.
We begin by solving a direct problem that simulates an AR(1) process.
How we select the initial value 𝑦0 matters.
2
𝜎𝑥
• If we think 𝑦0 is drawn from the stationary distribution 𝒩(0, 1−𝜌 2 ), then it is a good idea to use this distribution
as 𝑓(𝑦0 ). Why? Because 𝑦0 contains information about 𝜌, 𝜎𝑥 .

• If we suspect that 𝑦0 is far in the tails of the stationary distribution – so that variation in early observations in the
sample have a significant transient component – it is better to condition on 𝑦0 by setting 𝑓(𝑦0 ) = 1.
To illustrate the issue, we’ll begin by choosing an initial 𝑦0 that is far out in a tail of the stationary distribution.
780 Chapter 45. Posterior Distributions for AR(1) Parameters

def ar1_simulate(rho, sigma, y0, T):
# Allocate space and draw epsilons

y = np.empty(T)
eps = np.random.normal(0.,sigma,T)
# Initial condition and step forward

y[0] = y0
y[t] = rho*y[t-1] + eps[t]
return y
sigma = 1.
rho = 0.5
T = 50
y = ar1_simulate(rho, sigma, 10, T)
plt.plot(y)
plt.tight_layout()
Now we shall use Bayes’ law to construct a posterior distribution, conditioning on the initial value of 𝑦0 .
(Later we’ll assume that 𝑦0 is drawn from the stationary distribution, but not now.)
781
First we’ll use pymc4.
45.1 PyMC Implementation
For a normal distribution in pymc, 𝑣𝑎𝑟 = 1/𝜏 = 𝜎2 .
AR1_model = pmc.Model()
with AR1_model:
# Start with priors

rho = pmc.Uniform('rho', lower=-1., upper=1.) # Assume stable rho
sigma = pmc.HalfNormal('sigma', sigma = np.sqrt(10))
# Expected value of y at the next period (rho * y)

yhat = rho * y[:-1]
# Likelihood of the actual realization

y_like = pmc.Normal('y_obs', mu=yhat, sigma=sigma, observed=y[1:])
pmc.sample by default uses the NUTS samplers to generate samples as shown in the below cell:
with AR1_model:
trace = pmc.sample(50000, tune=10000, return_inferencedata=True)
/opt/conda/envs/quantecon/lib/python3.11/multiprocessing/popen_fork.py:66:␣
↪RuntimeWarning: os.fork() was called. os.fork() is incompatible with␣
↪multithreaded code, and JAX is multithreaded, so this will likely lead to a␣
↪deadlock.
self.pid = os.fork()
Output()
↪deadlock.
with AR1_model:
az.plot_trace(trace, figsize=(17,6))

Evidently, the posteriors aren’t centered on the true values of .5, 1 that we used to generate the data.
This is a symptom of the classic Hurwicz bias for first order autoregressive processes (see Leonid Hurwicz [Hurwicz,
1950].)
The Hurwicz bias is worse the smaller is the sample (see [Orcutt and Winokur, 1969]).
Be that as it may, here is more information about the posterior.
with AR1_model:
summary = az.summary(trace, round_to=4)
summary
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk \

rho 0.5364 0.0707 0.4004 0.6673 0.0002 0.0001 170111.0304
sigma 1.0102 0.1064 0.8231 1.2149 0.0003 0.0002 180608.9018
ess_tail r_hat
rho 121993.3820 1.0000
sigma 141816.1908 1.0001
Now we shall compute a posterior distribution after seeing the same data but instead assuming that 𝑦0 is drawn from the
stationary distribution.
This means that
𝜎𝑥2
𝑦0 ∼ 𝑁 (0, )
1 − 𝜌2
We alter the code as follows:
AR1_model_y0 = pmc.Model()
with AR1_model_y0:
# Start with priors

rho = pmc.Uniform('rho', lower=-1., upper=1.) # Assume stable rho
sigma = pmc.HalfNormal('sigma', sigma=np.sqrt(10))
# Standard deviation of ergodic y

45.1. PyMC Implementation 783


y_sd = sigma / np.sqrt(1 - rho**2)
# yhat
yhat = rho * y[:-1]
y_data = pmc.Normal('y_obs', mu=yhat, sigma=sigma, observed=y[1:])
y0_data = pmc.Normal('y0_obs', mu=0., sigma=y_sd, observed=y[0])
with AR1_model_y0:
trace_y0 = pmc.sample(50000, tune=10000, return_inferencedata=True)
# Grey vertical lines are the cases of divergence
↪deadlock.
Output()
↪deadlock.
with AR1_model_y0:
az.plot_trace(trace_y0, figsize=(17,6))

with AR1_model:
summary_y0 = az.summary(trace_y0, round_to=4)
summary_y0
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk \

rho 0.8759 0.0817 0.7322 0.9994 0.0002 0.0002 99962.0282
sigma 1.4047 0.1473 1.1418 1.6853 0.0004 0.0003 129456.3061
ess_tail r_hat
rho 76692.0659 1.0
sigma 106632.5402 1.0
Please note how the posterior for 𝜌 has shifted to the right relative to when we conditioned on 𝑦0 instead of assuming that
𝑦0 is drawn from the stationary distribution.
Think about why this happens.
Hint: It is connected to how Bayes Law (conditional probability) solves an inverse problem by putting high probability
on parameter values that make observations more likely.
We’ll return to this issue after we use numpyro to compute posteriors under our two alternative assumptions about the
distribution of 𝑦0 .
We’ll now repeat the calculations using numpyro.
45.2 Numpyro Implementation
def plot_posterior(sample):
"""
Plot trace and histogram
"""
# To np array
rhos = sample['rho']
sigmas = sample['sigma']
rhos, sigmas, = np.array(rhos), np.array(sigmas)

# Plot trace
axs[0, 0].plot(rhos) # rho
axs[1, 0].plot(sigmas) # sigma
# Plot posterior
axs[0, 1].hist(rhos, bins=50, density=True, alpha=0.7)
axs[0, 1].set_xlim([0, 1])
axs[1, 1].hist(sigmas, bins=50, density=True, alpha=0.7)
axs[0, 0].set_title("rho")
axs[0, 1].set_title("rho")
axs[1, 0].set_title("sigma")
axs[1, 1].set_title("sigma")
plt.show()
45.2. Numpyro Implementation 785

def AR1_model(data):
# set prior
rho = numpyro.sample('rho', dist.Uniform(low=-1., high=1.))
sigma = numpyro.sample('sigma', dist.HalfNormal(scale=np.sqrt(10)))

yhat = rho * data[:-1]
# Likelihood of the actual realization.

y_data = numpyro.sample('y_obs', dist.Normal(loc=yhat, scale=sigma), obs=data[1:])
# Make jnp array

y = jnp.array(y)
# Set NUTS kernal

NUTS_kernel = numpyro.infer.NUTS(AR1_model)
# Run MCMC
mcmc = numpyro.infer.MCMC(NUTS_kernel, num_samples=50000, num_warmup=10000, progress_
↪bar=False)
mcmc.run(rng_key=random.PRNGKey(1), data=y)
plot_posterior(mcmc.get_samples())
mcmc.print_summary()
mean std median 5.0% 95.0% n_eff r_hat

rho 0.54 0.07 0.54 0.42 0.66 34735.88 1.00
sigma 1.01 0.11 1.00 0.83 1.18 33245.42 1.00
Number of divergences: 0
Next, we again compute the posterior under the assumption that 𝑦0 is drawn from the stationary distribution, so that
𝜎𝑥2
𝑦0 ∼ 𝑁 (0, )
1 − 𝜌2
Here’s the new code to achieve this.

def AR1_model_y0(data):
# Set prior
rho = numpyro.sample('rho', dist.Uniform(low=-1., high=1.))
sigma = numpyro.sample('sigma', dist.HalfNormal(scale=np.sqrt(10)))
# Standard deviation of ergodic y

y_sd = sigma / jnp.sqrt(1 - rho**2)

yhat = rho * data[:-1]

y_data = numpyro.sample('y_obs', dist.Normal(loc=yhat, scale=sigma), obs=data[1:])
y0_data = numpyro.sample('y0_obs', dist.Normal(loc=0., scale=y_sd), obs=data[0])
# Make jnp array

y = jnp.array(y)
# Set NUTS kernal

NUTS_kernel = numpyro.infer.NUTS(AR1_model_y0)
# Run MCMC
mcmc2 = numpyro.infer.MCMC(NUTS_kernel, num_samples=50000, num_warmup=10000, progress_
↪bar=False)
mcmc2.run(rng_key=random.PRNGKey(1), data=y)
plot_posterior(mcmc2.get_samples())
mcmc2.print_summary()
mean std median 5.0% 95.0% n_eff r_hat

rho 0.88 0.08 0.89 0.76 1.00 29063.44 1.00
sigma 1.40 0.15 1.39 1.17 1.64 25674.09 1.00
Number of divergences: 0
Look what happened to the posterior!

It has moved far from the true values of the parameters used to generate the data because of how Bayes’ Law (i.e.,
45.2. Numpyro Implementation 787

conditional probability) is telling numpyro to explain what it interprets as “explosive” observations early in the sample.
Bayes’ Law is able to generate a plausible likelihood for the first observation by driving 𝜌 → 1 and 𝜎 ↑ in order to raise
the variance of the stationary distribution.
Our example illustrates the importance of what you assume about the distribution of initial conditions.

CHAPTER
FORTYSIX
FORECASTING AN AR(1) PROCESS
!pip install arviz pymc
This lecture describes methods for forecasting statistics that are functions of future values of a univariate autogressive
process.
The methods are designed to take into account two possible sources of uncertainty about these statistics:
• random shocks that impinge of the transition law
• uncertainty about the parameter values of the AR(1) process
We consider two sorts of statistics:
• prospective values 𝑦𝑡+𝑗 of a random process {𝑦𝑡 } that is governed by the AR(1) process
• sample path properties that are defined as non-linear functions of future values {𝑦𝑡+𝑗 }𝑗≥1 at time 𝑡
Sample path properties are things like “time to next turning point” or “time to next recession”.
To investigate sample path properties we’ll use a simulation procedure recommended by Wecker [Wecker, 1979].
To acknowledge uncertainty about parameters, we’ll deploy pymc to construct a Bayesian joint posterior distribution for
unknown parameters.
import numpy as np
import arviz as az
import pymc as pmc
sns.set_style('white')
colors = sns.color_palette()
import logging
logging.basicConfig()
logger = logging.getLogger('pymc')
logger.setLevel(logging.CRITICAL)
789
46.1 A Univariate First-Order Autoregressive Process
Consider the univariate AR(1) model:
𝑦𝑡+1 = 𝜌𝑦𝑡 + 𝜎𝜖𝑡+1 , 𝑡≥0 (46.1)
where the scalars 𝜌 and 𝜎 satisfy |𝜌| < 1 and 𝜎 > 0; {𝜖𝑡+1 } is a sequence of i.i.d. normal random variables with mean 0
and variance 1.
The initial condition 𝑦0 is a known number.
Equation (46.1) implies that for 𝑡 ≥ 0, the conditional density of 𝑦𝑡+1 is
𝑓(𝑦𝑡+1 |𝑦𝑡 ; 𝜌, 𝜎) ∼ 𝒩(𝜌𝑦𝑡 , 𝜎2 ) (46.2)
Further, equation (46.1) also implies that for 𝑡 ≥ 0, the conditional density of 𝑦𝑡+𝑗 for 𝑗 ≥ 1 is
1 − 𝜌2𝑗
𝑓(𝑦𝑡+𝑗 |𝑦𝑡 ; 𝜌, 𝜎) ∼ 𝒩 (𝜌𝑗 𝑦𝑡 , 𝜎2 ) (46.3)
1 − 𝜌2
The predictive distribution (46.3) that assumes that the parameters 𝜌, 𝜎 are known, which we express by conditioning on
them.
We also want to compute a predictive distribution that does not condition on 𝜌, 𝜎 but instead takes account of our uncer-
tainty about them.
We form this predictive distribution by integrating (46.3) with respect to a joint posterior distribution 𝜋𝑡 (𝜌, 𝜎|𝑦𝑡 ) that
conditions on an observed history 𝑦𝑡 = {𝑦𝑠 }𝑡𝑠=0 :
𝑓(𝑦𝑡+𝑗 |𝑦𝑡 ) = ∫ 𝑓(𝑦𝑡+𝑗 |𝑦𝑡 ; 𝜌, 𝜎)𝜋𝑡 (𝜌, 𝜎|𝑦𝑡 )𝑑𝜌𝑑𝜎 (46.4)
Predictive distribution (46.3) assumes that parameters (𝜌, 𝜎) are known.

Predictive distribution (46.4) assumes that parameters (𝜌, 𝜎) are uncertain, but have known probability distribution
𝜋𝑡 (𝜌, 𝜎|𝑦𝑡 ).
We also want to compute some predictive distributions of “sample path statistics” that might include, for example
• the time until the next “recession”,
• the minimum value of 𝑌 over the next 8 periods,
• “severe recession”, and
• the time until the next turning point (positive or negative).
To accomplish that for situations in which we are uncertain about parameter values, we shall extend Wecker’s [Wecker,
1979] approach in the following way.
• first simulate an initial path of length 𝑇0 ;
• for a given prior, draw a sample of size 𝑁 from the posterior joint distribution of parameters (𝜌, 𝜎) after observing
the initial path;
• for each draw 𝑛 = 0, 1, ..., 𝑁 , simulate a “future path” of length 𝑇1 with parameters (𝜌𝑛 , 𝜎𝑛 ) and compute our
three “sample path statistics”;
• finally, plot the desired statistics from the 𝑁 samples as an empirical distribution.
790 Chapter 46. Forecasting an AR(1) Process

46.2 Implementation
First, we’ll simulate a sample path from which to launch our forecasts.
In addition to plotting the sample path, under the assumption that the true parameter values are known, we’ll plot .9 and
.95 coverage intervals using conditional distribution (46.3) described above.
We’ll also plot a bunch of samples of sequences of future values and watch where they fall relative to the coverage interval.
def AR1_simulate(rho, sigma, y0, T):
# Allocate space and draw epsilons

y = np.empty(T)
eps = np.random.normal(0, sigma, T)
# Initial condition and step forward

y[0] = y0
y[t] = rho * y[t-1] + eps[t]
return y
def plot_initial_path(initial_path):
"""
Plot the initial path and the preceding predictive densities
"""
# Compute .9 confidence interval]
y0 = initial_path[-1]
center = np.array([rho**j * y0 for j in range(T1)])
vars = np.array([sigma**2 * (1 - rho**(2 * j)) / (1 - rho**2) for j in range(T1)])
y_bounds1_c95, y_bounds2_c95 = center + 1.96 * np.sqrt(vars), center - 1.96 * np.
↪sqrt(vars)

↪sqrt(vars)
# Plot
ax.set_title("Initial Path and Predictive Densities", fontsize=15)
ax.plot(np.arange(-T0 + 1, 1), initial_path)
ax.set_xlim([-T0, T1])
ax.axvline(0, linestyle='--', alpha=.4, color='k', lw=1)
# Simulate future paths

for i in range(10):
y_future = AR1_simulate(rho, sigma, y0, T1)
ax.plot(np.arange(T1), y_future, color='grey', alpha=.5)
# Plot 90% CI
ax.fill_between(np.arange(T1), y_bounds1_c95, y_bounds2_c95, alpha=.3, label='95%␣
↪CI')
ax.fill_between(np.arange(T1), y_bounds1_c90, y_bounds2_c90, alpha=.35, label='90

↪% CI')
ax.plot(np.arange(T1), center, color='red', alpha=.7, label='expected mean')

plt.show()

sigma = 1
rho = 0.9
T0, T1 = 100, 100
y0 = 10
# Simulate
np.random.seed(145)
initial_path = AR1_simulate(rho, sigma, y0, T0)
# Plot
plot_initial_path(initial_path)
As functions of forecast horizon, the coverage intervals have shapes like those described in https://python.quantecon.org/
perm_income_cons.html
46.3 Predictive Distributions of Path Properties
Wecker [Wecker, 1979] proposed using simulation techniques to characterize predictive distribution of some statistics
that are non-linear functions of 𝑦.
He called these functions “path properties” to contrast them with properties of single data points.
He studied two special prospective path properties of a given series {𝑦𝑡 }.
The first was time until the next turning point.
• he defined a “turning point” to be the date of the second of two successive declines in 𝑦.
To examine this statistic, let 𝑍 be an indicator process

1 if 𝑌𝑡 (𝜔) < 𝑌𝑡−1 (𝜔) < 𝑌𝑡−2 (𝜔) ≥ 𝑌𝑡−3 (𝜔)

𝑍𝑡 (𝑌 (𝜔)) ∶= {
0 otherwise
Then the random variable time until the next turning point is defined as the following stopping time with respect to
𝑍:
𝑊𝑡 (𝜔) ∶= inf{𝑘 ≥ 1 ∣ 𝑍𝑡+𝑘 (𝜔) = 1}
Wecker [Wecker, 1979] also studied the minimum value of 𝑌 over the next 8 quarters which can be defined as the
random variable.
𝑀𝑡 (𝜔) ∶= min{𝑌𝑡+1 (𝜔); 𝑌𝑡+2 (𝜔); … ; 𝑌𝑡+8 (𝜔)}
It is interesting to study yet another possible concept of a turning point.

Thus, let
⎧1 if 𝑌𝑡−2 (𝜔) > 𝑌𝑡−1 (𝜔) > 𝑌𝑡 (𝜔) and 𝑌𝑡 (𝜔) < 𝑌𝑡+1 (𝜔) < 𝑌𝑡+2 (𝜔)
{
𝑇𝑡 (𝑌 (𝜔)) ∶= ⎨ −1 if 𝑌𝑡−2 (𝜔) < 𝑌𝑡−1 (𝜔) < 𝑌𝑡 (𝜔) and 𝑌𝑡 (𝜔) > 𝑌𝑡+1 (𝜔) > 𝑌𝑡+2 (𝜔)
{0 otherwise
⎩
Define a positive turning point today or tomorrow statistic as
1 if 𝑇𝑡 (𝜔) = 1 or 𝑇𝑡+1 (𝜔) = 1

𝑃𝑡 (𝜔) ∶= {
0 otherwise
This is designed to express the event

• `àfter one or two decrease(s), 𝑌 will grow for two consecutive quarters’’
Following [Wecker, 1979], we can use simulations to calculate probabilities of 𝑃𝑡 and 𝑁𝑡 for each period 𝑡.
46.4 A Wecker-Like Algorithm
The procedure consists of the following steps:

• index a sample path by 𝜔𝑖
• for a given date 𝑡, simulate 𝐼 sample paths of length 𝑁
𝐼
𝑌 (𝜔𝑖 ) = {𝑌𝑡+1 (𝜔𝑖 ), 𝑌𝑡+2 (𝜔𝑖 ), … , 𝑌𝑡+𝑁 (𝜔𝑖 )}𝑖=1
• for each path 𝜔𝑖 , compute the associated value of 𝑊𝑡 (𝜔𝑖 ), 𝑊𝑡+1 (𝜔𝑖 ), …
• consider the sets {𝑊𝑡 (𝜔𝑖 )}𝑇𝑖=1 , {𝑊𝑡+1 (𝜔𝑖 )}𝑇𝑖=1 , … , {𝑊𝑡+𝑁 (𝜔𝑖 )}𝑇𝑖=1 as samples from the predictive distributions
𝑓(𝑊𝑡+1 ∣ 𝑦𝑡 , … ), 𝑓(𝑊𝑡+2 ∣ 𝑦𝑡 , 𝑦𝑡−1 , … ), …, 𝑓(𝑊𝑡+𝑁 ∣ 𝑦𝑡 , 𝑦𝑡−1 , … ).
46.4. A Wecker-Like Algorithm 793

46.5 Using Simulations to Approximate a Posterior Distribution
The next code cells use pymc to compute the time 𝑡 posterior distribution of 𝜌, 𝜎.
Note that in defining the likelihood function, we choose to condition on the initial value 𝑦0 .
def draw_from_posterior(sample):
"""
Draw a sample of size N from the posterior distribution.
"""
AR1_model = pmc.Model()
with AR1_model:
# Start with priors

rho = pmc.Uniform('rho',lower=-1.,upper=1.) # Assume stable rho
sigma = pmc.HalfNormal('sigma', sigma = np.sqrt(10))

yhat = rho * sample[:-1]

y_like = pmc.Normal('y_obs', mu=yhat, sigma=sigma, observed=sample[1:])
with AR1_model:
trace = pmc.sample(10000, tune=5000)
# check condition
with AR1_model:
az.plot_trace(trace, figsize=(17, 6))
rhos = trace.posterior.rho.values.flatten()
sigmas = trace.posterior.sigma.values.flatten()
post_sample = {
'rho': rhos,
'sigma': sigmas
}
return post_sample
post_samples = draw_from_posterior(initial_path)
Output()

The graphs on the left portray posterior marginal distributions.
46.6 Calculating Sample Path Statistics
Our next step is to prepare Python code to compute our sample path statistics.
# define statistics
def next_recession(omega):
n = omega.shape[0] - 3
z = np.zeros(n, dtype=int)
for i in range(n):
z[i] = int(omega[i] <= omega[i+1] and omega[i+1] > omega[i+2] and omega[i+2] >
↪ omega[i+3])
if np.any(z) == False:
return 500
else:
return np.where(z==1)[0][0] + 1
def minimum_value(omega):
return min(omega[:8])
def severe_recession(omega):
z = np.diff(omega)
n = z.shape[0]
sr = (z < -.02).astype(int)
indices = np.where(sr == 1)[0]
if len(indices) == 0:
return T1
else:
return indices[0] + 1
def next_turning_point(omega):
"""
Suppose that omega is of length 6
46.6. Calculating Sample Path Statistics 795


y_{t-2}, y_{t-1}, y_{t}, y_{t+1}, y_{t+2}, y_{t+3}
that is sufficient for determining the value of P/N

"""
n = np.asarray(omega).shape[0] - 4
T = np.zeros(n, dtype=int)
for i in range(n):
if ((omega[i] > omega[i+1]) and (omega[i+1] > omega[i+2]) and
(omega[i+2] < omega[i+3]) and (omega[i+3] < omega[i+4])):
T[i] = 1
elif ((omega[i] < omega[i+1]) and (omega[i+1] < omega[i+2]) and
(omega[i+2] > omega[i+3]) and (omega[i+3] > omega[i+4])):
T[i] = -1
up_turn = np.where(T == 1)[0][0] + 1 if (1 in T) == True else T1

down_turn = np.where(T == -1)[0][0] + 1 if (-1 in T) == True else T1
return up_turn, down_turn
46.7 Original Wecker Method
Now we apply Wecker’s original method by simulating future paths and compute predictive distributions, conditioning on
the true parameters associated with the data-generating model.
def plot_Wecker(initial_path, N, ax):

"""
Plot the predictive distributions from "pure" Wecker's method.
"""
# Store outcomes
next_reces = np.zeros(N)
severe_rec = np.zeros(N)
min_vals = np.zeros(N)
next_up_turn, next_down_turn = np.zeros(N), np.zeros(N)
# Compute .9 confidence interval]

y0 = initial_path[-1]
center = np.array([rho**j * y0 for j in range(T1)])
vars = np.array([sigma**2 * (1 - rho**(2 * j)) / (1 - rho**2) for j in range(T1)])
↪sqrt(vars)

↪sqrt(vars)
# Plot
ax[0, 0].set_title("Initial path and predictive densities", fontsize=15)
ax[0, 0].plot(np.arange(-T0 + 1, 1), initial_path)
ax[0, 0].set_xlim([-T0, T1])
ax[0, 0].axvline(0, linestyle='--', alpha=.4, color='k', lw=1)
# Plot 90% CI
ax[0, 0].fill_between(np.arange(T1), y_bounds1_c95, y_bounds2_c95, alpha=.3)


ax[0, 0].fill_between(np.arange(T1), y_bounds1_c90, y_bounds2_c90, alpha=.35)
ax[0, 0].plot(np.arange(T1), center, color='red', alpha=.7)

for n in range(N):
sim_path = AR1_simulate(rho, sigma, initial_path[-1], T1)
next_reces[n] = next_recession(np.hstack([initial_path[-3:-1], sim_path]))
severe_rec[n] = severe_recession(sim_path)
min_vals[n] = minimum_value(sim_path)
next_up_turn[n], next_down_turn[n] = next_turning_point(sim_path)
if n%(N/10) == 0:
ax[0, 0].plot(np.arange(T1), sim_path, color='gray', alpha=.3, lw=1)
# Return next_up_turn, next_down_turn

sns.histplot(next_reces, kde=True, stat='density', ax=ax[0, 1], alpha=.8, label=
↪'True parameters')
ax[0, 1].set_title("Predictive distribution of time until the next recession",␣

↪fontsize=13)
sns.histplot(severe_rec, kde=False, stat='density', ax=ax[1, 0], binwidth=0.9,␣

↪alpha=.7, label='True parameters')
ax[1, 0].set_title(r"Predictive distribution of stopping time of growth$<-2\%$",␣
↪fontsize=13)
sns.histplot(min_vals, kde=True, stat='density', ax=ax[1, 1], alpha=.8, label=

ax[1, 1].set_title("Predictive distribution of minimum value in the next 8 periods
↪", fontsize=13)
sns.histplot(next_up_turn, kde=True, stat='density', ax=ax[2, 0], alpha=.8, label=

ax[2, 0].set_title("Predictive distribution of time until the next positive turn",
↪ fontsize=13)
sns.histplot(next_down_turn, kde=True, stat='density', ax=ax[2, 1], alpha=.8,␣

↪label='True parameters')
ax[2, 1].set_title("Predictive distribution of time until the next negative turn",
↪ fontsize=13)
fig, ax = plt.subplots(3, 2, figsize=(15,12))

plot_Wecker(initial_path, 1000, ax)
plt.show()


46.7. Original Wecker Method 797



46.8 Extended Wecker Method
Now we apply we apply our “extended” Wecker method based on predictive densities of 𝑦 defined by (46.4) that acknowl-
edge posterior uncertainty in the parameters 𝜌, 𝜎.
To approximate the intergration on the right side of (46.4), we repeatedly draw parameters from the joint posterior
distribution each time we simulate a sequence of future values from model (46.1).
def plot_extended_Wecker(post_samples, initial_path, N, ax):

"""
Plot the extended Wecker's predictive distribution


"""
# Select a sample
index = np.random.choice(np.arange(len(post_samples['rho'])), N + 1,␣
↪replace=False)
rho_sample = post_samples['rho'][index]
sigma_sample = post_samples['sigma'][index]
# Store outcomes
next_reces = np.zeros(N)
severe_rec = np.zeros(N)
min_vals = np.zeros(N)
next_up_turn, next_down_turn = np.zeros(N), np.zeros(N)
# Plot
ax[0, 0].set_title("Initial path and future paths simulated from posterior draws",
↪ fontsize=15)
ax[0, 0].plot(np.arange(-T0 + 1, 1), initial_path)
ax[0, 0].set_xlim([-T0, T1])
ax[0, 0].axvline(0, linestyle='--', alpha=.4, color='k', lw=1)

for n in range(N):
sim_path = AR1_simulate(rho_sample[n], sigma_sample[n], initial_path[-1], T1)
next_reces[n] = next_recession(np.hstack([initial_path[-3:-1], sim_path]))
severe_rec[n] = severe_recession(sim_path)
min_vals[n] = minimum_value(sim_path)
next_up_turn[n], next_down_turn[n] = next_turning_point(sim_path)
if n % (N / 10) == 0:
ax[0, 0].plot(np.arange(T1), sim_path, color='gray', alpha=.3, lw=1)
# Return next_up_turn, next_down_turn

sns.histplot(next_reces, kde=True, stat='density', ax=ax[0, 1], alpha=.6,␣
↪color=colors[1], label='Sampling from posterior')
ax[0, 1].set_title("Predictive distribution of time until the next recession",␣

↪fontsize=13)
sns.histplot(severe_rec, kde=False, stat='density', ax=ax[1, 0], binwidth=.9,␣

↪alpha=.6, color=colors[1], label='Sampling from posterior')
ax[1, 0].set_title(r"Predictive distribution of stopping time of growth$<-2\%$",␣
↪fontsize=13)
sns.histplot(min_vals, kde=True, stat='density', ax=ax[1, 1], alpha=.6,␣

ax[1, 1].set_title("Predictive distribution of minimum value in the next 8 periods
↪", fontsize=13)
sns.histplot(next_up_turn, kde=True, stat='density', ax=ax[2, 0], alpha=.6,␣

ax[2, 0].set_title("Predictive distribution of time until the next positive turn",
↪ fontsize=13)
sns.histplot(next_down_turn, kde=True, stat='density', ax=ax[2, 1], alpha=.6,␣

ax[2, 1].set_title("Predictive distribution of time until the next negative turn",
↪ fontsize=13)
46.8. Extended Wecker Method 799


plot_extended_Wecker(post_samples, initial_path, 1000, ax)
plt.show()





46.9 Comparison
Finally, we plot both the original Wecker method and the extended method with parameter values drawn from the pos-
terior together to compare the differences that emerge from pretending to know parameter values when they are actually
uncertain.
fig, ax = plt.subplots(3, 2, figsize=(15,12))

plot_Wecker(initial_path, 1000, ax)
ax[0, 0].clear()
plot_extended_Wecker(post_samples, initial_path, 1000, ax)
plt.legend()
plt.show()


46.9. Comparison 801









46.9. Comparison 803


Part VIII
Information
805
CHAPTER
FORTYSEVEN
JOB SEARCH VII: SEARCH WITH LEARNING
Contents
• Job Search VII: Search with Learning

– Overview
– Model
– Take 1: Solution by VFI
– Take 2: A More Efficient Method
– Another Functional Equation
– Solving the RWFE
– Implementation
– Exercises
– Solutions
– Appendix A
– Appendix B
– Examples
In addition to what’s in Anaconda, this lecture deploys the libraries:
!pip install interpolation
47.1 Overview
In this lecture, we consider an extension of the previously studied job search model of McCall [McCall, 1970].
We’ll build on a model of Bayesian learning discussed in this lecture on the topic of exchangeability and its relationship
to the concept of IID (identically and independently distributed) random variables and to Bayesian updating.
In the McCall model, an unemployed worker decides when to accept a permanent job at a specific fixed wage, given
• his or her discount factor
• the level of unemployment compensation
• the distribution from which wage offers are drawn
807
In the version considered below, the wage distribution is unknown and must be learned.
• The following is based on the presentation in [Ljungqvist and Sargent, 2018], section 6.6.

from numba import njit, prange, vectorize
from interpolation import mlinterp
from math import gamma
import numpy as np
import scipy.optimize as op
from scipy.stats import cumfreq, beta
• Infinite horizon dynamic programming with two states and one binary control.
• Bayesian updating to learn the unknown distribution.
47.2 Model
Let’s first review the basic McCall model [McCall, 1970] and then add the variation we want to consider.
47.2.1 The Basic McCall Model
Recall that, in the baseline model, an unemployed worker is presented in each period with a permanent job offer at wage
𝑊𝑡 .
At time 𝑡, our worker either
1. accepts the offer and works permanently at constant wage 𝑊𝑡
2. rejects the offer, receives unemployment compensation 𝑐 and reconsiders next period
The wage sequence 𝑊𝑡 is IID and generated from known density 𝑞.
∞
The worker aims to maximize the expected discounted sum of earnings 𝔼 ∑𝑡=0 𝛽 𝑡 𝑦𝑡 .
Let 𝑣(𝑤) be the optimal value of the problem for a previously unemployed worker who has just received offer 𝑤 and is
yet to decide whether to accept or reject the offer.
The value function 𝑣 satisfies the recursion
𝑤
𝑣(𝑤) = max { , 𝑐 + 𝛽 ∫ 𝑣(𝑤′ )𝑞(𝑤′ )𝑑𝑤′ } (47.1)
1−𝛽
The optimal policy has the form 1{𝑤 ≥ 𝑤},

̄ where 𝑤̄ is a constant called the reservation wage.
808 Chapter 47. Job Search VII: Search with Learning

47.2.2 Offer Distribution Unknown
Now let’s extend the model by considering the variation presented in [Ljungqvist and Sargent, 2018], section 6.6.
The model is as above, apart from the fact that
• the density 𝑞 is unknown
• the worker learns about 𝑞 by starting with a prior and updating based on wage offers that he/she observes
The worker knows there are two possible distributions 𝐹 and 𝐺.
These two distributions have densities 𝑓 and 𝑔, repectively.
Just before time starts, “nature” selects 𝑞 to be either 𝑓 or 𝑔.
This is then the wage distribution from which the entire sequence 𝑊𝑡 will be drawn.
The worker does not know which distribution nature has drawn, but the worker does know the two possible distributions
𝑓 and 𝑔.
The worker puts a (subjective) prior probability 𝜋0 on 𝑓 having been chosen.
The worker’s time 0 subjective distribution for the distribution of 𝑊0 is
𝜋0 𝑓 + (1 − 𝜋0 )𝑔
The worker’s time 𝑡 subjective belief about the the distribution of 𝑊𝑡 is
𝜋𝑡 𝑓 + (1 − 𝜋𝑡 )𝑔,
where 𝜋𝑡 updates via
𝜋𝑡 𝑓(𝑤𝑡+1 )
𝜋𝑡+1 = (47.2)
𝜋𝑡 𝑓(𝑤𝑡+1 ) + (1 − 𝜋𝑡 )𝑔(𝑤𝑡+1 )
This last expression follows from Bayes’ rule, which tells us that
ℙ{𝑊 = 𝑤 | 𝑞 = 𝑓}ℙ{𝑞 = 𝑓}
ℙ{𝑞 = 𝑓 | 𝑊 = 𝑤} = and ℙ{𝑊 = 𝑤} = ∑ ℙ{𝑊 = 𝑤 | 𝑞 = 𝜔}ℙ{𝑞 = 𝜔}
ℙ{𝑊 = 𝑤} 𝜔∈{𝑓,𝑔}
The fact that (47.2) is recursive allows us to progress to a recursive solution method.
Letting
𝜋𝑓(𝑤)
𝑞𝜋 (𝑤) ∶= 𝜋𝑓(𝑤) + (1 − 𝜋)𝑔(𝑤) and 𝜅(𝑤, 𝜋) ∶=
𝜋𝑓(𝑤) + (1 − 𝜋)𝑔(𝑤)
we can express the value function for the unemployed worker recursively as follows
𝑤
𝑣(𝑤, 𝜋) = max { , 𝑐 + 𝛽 ∫ 𝑣(𝑤′ , 𝜋′ ) 𝑞𝜋 (𝑤′ ) 𝑑𝑤′ } where 𝜋′ = 𝜅(𝑤′ , 𝜋) (47.3)
1−𝛽
Notice that the current guess 𝜋 is a state variable, since it affects the worker’s perception of probabilities for future rewards.
Following section 6.6 of [Ljungqvist and Sargent, 2018], our baseline parameterization will be
• 𝑓 is Beta(1, 1)
• 𝑔 is Beta(3, 1.2)
47.2. Model 809

• 𝛽 = 0.95 and 𝑐 = 0.3

The densities 𝑓 and 𝑔 have the following shape
@vectorize
def p(x, a, b):
r = gamma(a + b) / (gamma(a) * gamma(b))
return r * x**(a-1) * (1 - x)**(b-1)
x_grid = np.linspace(0, 1, 100)

f = lambda x: p(x, 1, 1)
g = lambda x: p(x, 3, 1.2)

ax.plot(x_grid, f(x_grid), label='$f$', lw=2)
ax.plot(x_grid, g(x_grid), label='$g$', lw=2)
ax.legend()
plt.show()

47.2.4 Looking Forward
What kind of optimal policy might result from (47.3) and the parameterization specified above?
Intuitively, if we accept at 𝑤𝑎 and 𝑤𝑎 ≤ 𝑤𝑏 , then — all other things being given — we should also accept at 𝑤𝑏 .
This suggests a policy of accepting whenever 𝑤 exceeds some threshold value 𝑤.̄
But 𝑤̄ should depend on 𝜋 — in fact, it should be decreasing in 𝜋 because
• 𝑓 is a less attractive offer distribution than 𝑔
• larger 𝜋 means more weight on 𝑓 and less on 𝑔
Thus, larger 𝜋 depresses the worker’s assessment of her future prospects, so relatively low current offers become more
attractive.
Summary: We conjecture that the optimal policy is of the form 𝟙𝑤 ≥ 𝑤(𝜋)
̄ for some decreasing function 𝑤.̄
47.3 Take 1: Solution by VFI
Let’s set about solving the model and see how our results match with our intuition.
We begin by solving via value function iteration (VFI), which is natural but ultimately turns out to be second best.
The class SearchProblem is used to store parameters and methods needed to compute optimal actions.
class SearchProblem:
"""
A class to store a given parameterization of the "offer distribution
unknown" model.
"""
def __init__(self,
c=0.3, # Unemployment compensation
F_a=1,
F_b=1,
G_a=3,
G_b=1.2,
w_max=1, # Maximum wage possible
w_grid_size=100,
π_grid_size=100,
mc_size=500):
self.β, self.c, self.w_max = β, c, w_max
self.f = njit(lambda x: p(x, F_a, F_b))

self.g = njit(lambda x: p(x, G_a, G_b))
self.π_min, self.π_max = 1e-3, 1-1e-3 # Avoids instability

self.w_grid = np.linspace(0, w_max, w_grid_size)
self.π_grid = np.linspace(self.π_min, self.π_max, π_grid_size)
self.mc_size = mc_size
self.w_f = np.random.beta(F_a, F_b, mc_size)

self.w_g = np.random.beta(G_a, G_b, mc_size)
47.3. Take 1: Solution by VFI 811

The following function takes an instance of this class and returns jitted versions of the Bellman operator T, and a
get_greedy() function to compute the approximate optimal policy from a guess v of the value function
def operator_factory(sp, parallel_flag=True):
f, g = sp.f, sp.g
w_f, w_g = sp.w_f, sp.w_g
β, c = sp.β, sp.c
mc_size = sp.mc_size
w_grid, π_grid = sp.w_grid, sp.π_grid
@njit
def v_func(x, y, v):
return mlinterp((w_grid, π_grid), v, (x, y))
@njit
def κ(w, π):
"""
Updates π using Bayes' rule and the current wage observation w.
"""
pf, pg = π * f(w), (1 - π) * g(w)
π_new = pf / (pf + pg)
return π_new
def T(v):
"""
The Bellman operator.
"""
for i in prange(len(w_grid)):
for j in prange(len(π_grid)):
w = w_grid[i]
π = π_grid[j]
v_1 = w / (1 - β)
integral_f, integral_g = 0, 0
for m in prange(mc_size):
integral_f += v_func(w_f[m], κ(w_f[m], π), v)
integral_g += v_func(w_g[m], κ(w_g[m], π), v)
integral = (π * integral_f + (1 - π) * integral_g) / mc_size
v_2 = c + β * integral
v_new[i, j] = max(v_1, v_2)
return v_new
def get_greedy(v):
""""
Compute optimal actions taking v as the value function.
"""
σ = np.empty_like(v)

for i in prange(len(w_grid)):
for j in prange(len(π_grid)):
w = w_grid[i]
π = π_grid[j]
v_1 = w / (1 - β)
integral_f += v_func(w_f[m], κ(w_f[m], π), v)
integral_g += v_func(w_g[m], κ(w_g[m], π), v)
v_2 = c + β * integral
σ[i, j] = v_1 > v_2 # Evaluates to 1 or 0
return σ
We will omit a detailed discussion of the code because there is a more efficient solution method that we will use later.
To solve the model we will use the following function that iterates using T to find a fixed point
def solve_model(sp,
use_parallel=True,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=5):
"""
Solves for the value function
* sp is an instance of SearchProblem
"""
T, _ = operator_factory(sp, use_parallel)
# Set up loop
i = 0
error = tol + 1
m, n = len(sp.w_grid), len(sp.π_grid)
# Initialize v
v = np.zeros((m, n)) + sp.c / (1 - sp.β)

v_new = T(v)
i += 1
v = v_new

if error > tol:

elif verbose:
return v_new
Let’s look at solutions computed from value function iteration
sp = SearchProblem()
v_star = solve_model(sp)
ax.contourf(sp.π_grid, sp.w_grid, v_star, 12, alpha=0.6, cmap=cm.jet)
cs = ax.contour(sp.π_grid, sp.w_grid, v_star, 12, colors="black")
ax.set(xlabel='$\pi$', ylabel='$w$')
plt.show()

We will also plot the optimal policy
T, get_greedy = operator_factory(sp)
σ_star = get_greedy(v_star)

ax.contourf(sp.π_grid, sp.w_grid, σ_star, 1, alpha=0.6, cmap=cm.jet)
ax.contour(sp.π_grid, sp.w_grid, σ_star, 1, colors="black")
ax.text(0.5, 0.6, 'reject')

ax.text(0.7, 0.9, 'accept')
plt.show()

The results fit well with our intuition from section looking forward.
• The black line in the figure above corresponds to the function 𝑤(𝜋)
̄ introduced there.
• It is decreasing as expected.
47.4 Take 2: A More Efficient Method
Let’s consider another method to solve for the optimal policy.

We will use iteration with an operator that has the same contraction rate as the Bellman operator, but
• one dimensional rather than two dimensional
• no maximization step
As a consequence, the algorithm is orders of magnitude faster than VFI.
This section illustrates the point that when it comes to programming, a bit of mathematical analysis goes a long way.

47.5 Another Functional Equation
To begin, note that when 𝑤 = 𝑤(𝜋),

̄ the worker is indifferent between accepting and rejecting.
Hence the two choices on the right-hand side of (47.3) have equal value:
𝑤(𝜋)
̄
= 𝑐 + 𝛽 ∫ 𝑣(𝑤′ , 𝜋′ ) 𝑞𝜋 (𝑤′ ) 𝑑𝑤′ (47.4)
1−𝛽
Together, (47.3) and (47.4) give
𝑤 𝑤(𝜋)
̄
𝑣(𝑤, 𝜋) = max { , } (47.5)
1−𝛽 1−𝛽
Combining (47.4) and (47.5), we obtain
𝑤(𝜋)
̄ 𝑤′ ̄ ′)
𝑤(𝜋
= 𝑐 + 𝛽 ∫ max { , } 𝑞𝜋 (𝑤′ ) 𝑑𝑤′
1−𝛽 1−𝛽 1−𝛽
Multiplying by 1 − 𝛽, substituting in 𝜋′ = 𝜅(𝑤′ , 𝜋) and using ∘ for composition of functions yields
𝑤(𝜋)
̄ = (1 − 𝛽)𝑐 + 𝛽 ∫ max {𝑤′ , 𝑤̄ ∘ 𝜅(𝑤′ , 𝜋)} 𝑞𝜋 (𝑤′ ) 𝑑𝑤′ (47.6)
Equation (47.6) can be understood as a functional equation, where 𝑤̄ is the unknown function.
• Let’s call it the reservation wage functional equation (RWFE).
• The solution 𝑤̄ to the RWFE is the object that we wish to compute.
47.6 Solving the RWFE
To solve the RWFE, we will first show that its solution is the fixed point of a contraction mapping.
To this end, let
• 𝑏[0, 1] be the bounded real-valued functions on [0, 1]
• ‖𝜔‖ ∶= sup𝑥∈[0,1] |𝜔(𝑥)|
Consider the operator 𝑄 mapping 𝜔 ∈ 𝑏[0, 1] into 𝑄𝜔 ∈ 𝑏[0, 1] via
(𝑄𝜔)(𝜋) = (1 − 𝛽)𝑐 + 𝛽 ∫ max {𝑤′ , 𝜔 ∘ 𝜅(𝑤′ , 𝜋)} 𝑞𝜋 (𝑤′ ) 𝑑𝑤′ (47.7)
Comparing (47.6) and (47.7), we see that the set of fixed points of 𝑄 exactly coincides with the set of solutions to the
RWFE.
• If 𝑄𝑤̄ = 𝑤̄ then 𝑤̄ solves (47.6) and vice versa.
Moreover, for any 𝜔, 𝜔′ ∈ 𝑏[0, 1], basic algebra and the triangle inequality for integrals tells us that
|(𝑄𝜔)(𝜋) − (𝑄𝜔′ )(𝜋)| ≤ 𝛽 ∫ |max {𝑤′ , 𝜔 ∘ 𝜅(𝑤′ , 𝜋)} − max {𝑤′ , 𝜔′ ∘ 𝜅(𝑤′ , 𝜋)}| 𝑞𝜋 (𝑤′ ) 𝑑𝑤′ (47.8)
Working case by case, it is easy to check that for real numbers 𝑎, 𝑏, 𝑐 we always have
| max{𝑎, 𝑏} − max{𝑎, 𝑐}| ≤ |𝑏 − 𝑐| (47.9)
47.5. Another Functional Equation 817

Combining (47.8) and (47.9) yields
|(𝑄𝜔)(𝜋) − (𝑄𝜔′ )(𝜋)| ≤ 𝛽 ∫ |𝜔 ∘ 𝜅(𝑤′ , 𝜋) − 𝜔′ ∘ 𝜅(𝑤′ , 𝜋)| 𝑞𝜋 (𝑤′ ) 𝑑𝑤′ ≤ 𝛽‖𝜔 − 𝜔′ ‖ (47.10)
Taking the supremum over 𝜋 now gives us
‖𝑄𝜔 − 𝑄𝜔′ ‖ ≤ 𝛽‖𝜔 − 𝜔′ ‖ (47.11)
In other words, 𝑄 is a contraction of modulus 𝛽 on the complete metric space (𝑏[0, 1], ‖ ⋅ ‖).
Hence
• A unique solution 𝑤̄ to the RWFE exists in 𝑏[0, 1].
• 𝑄𝑘 𝜔 → 𝑤̄ uniformly as 𝑘 → ∞, for any 𝜔 ∈ 𝑏[0, 1].
47.7 Implementation
The following function takes an instance of SearchProblem and returns the operator Q
def Q_factory(sp, parallel_flag=True):
f, g = sp.f, sp.g
w_f, w_g = sp.w_f, sp.w_g
β, c = sp.β, sp.c
mc_size = sp.mc_size
w_grid, π_grid = sp.w_grid, sp.π_grid
@njit
def ω_func(p, ω):
return np.interp(p, π_grid, ω)
@njit
def κ(w, π):
"""
Updates π using Bayes' rule and the current wage observation w.
"""
pf, pg = π * f(w), (1 - π) * g(w)
π_new = pf / (pf + pg)
return π_new
def Q(ω):
"""
Updates the reservation wage function guess ω via the operator

Q.
"""
ω_new = np.empty_like(ω)
for i in prange(len(π_grid)):
π = π_grid[i]


integral_f += max(w_f[m], ω_func(κ(w_f[m], π), ω))
integral_g += max(w_g[m], ω_func(κ(w_g[m], π), ω))
ω_new[i] = (1 - β) * c + β * integral
return ω_new
return Q
In the next exercise, you are asked to compute an approximation to 𝑤.̄
47.8 Exercises
Exercise 47.8.1
Use the default parameters and Q_factory to compute an optimal policy.
Your result should coincide closely with the figure for the optimal policy shown above.
Try experimenting with different parameters, and confirm that the change in the optimal policy coincides with your
intuition.
47.9 Solutions

This code solves the “Offer Distribution Unknown” model by iterating on a guess of the reservation wage function.
You should find that the run time is shorter than that of the value function approach.
Similar to above, we set up a function to iterate with Q to find the fixed point
def solve_wbar(sp,
use_parallel=True,
tol=1e-4,
max_iter=1000,
verbose=True,
print_skip=5):
Q = Q_factory(sp, use_parallel)
# Set up loop
i = 0
error = tol + 1
m, n = len(sp.w_grid), len(sp.π_grid)
# Initialize w
w = np.ones_like(sp.π_grid)
47.8. Exercises 819


w_new = Q(w)
error = np.max(np.abs(w - w_new))
i += 1
w = w_new
if error > tol:

elif verbose:
return w_new
The solution can be plotted as follows
sp = SearchProblem()
w_bar = solve_wbar(sp)
ax.plot(sp.π_grid, w_bar, color='k')

ax.fill_between(sp.π_grid, 0, w_bar, color='blue', alpha=0.15)
ax.fill_between(sp.π_grid, w_bar, sp.w_max, color='green', alpha=0.15)
ax.text(0.5, 0.6, 'reject')
ax.text(0.7, 0.9, 'accept')
ax.grid()
plt.show()


47.10 Appendix A
The next piece of code generates a fun simulation to see what the effect of a change in the underlying distribution on the
unemployment rate is.
At a point in the simulation, the distribution becomes significantly worse.
It takes a while for agents to learn this, and in the meantime, they are too optimistic and turn down too many jobs.
As a result, the unemployment rate spikes
F_a, F_b, G_a, G_b = 1, 1, 3, 1.2
sp = SearchProblem(F_a=F_a, F_b=F_b, G_a=G_a, G_b=G_b)

f, g = sp.f, sp.g
# Solve for reservation wage

w_bar = solve_wbar(sp, verbose=False)
# Interpolate reservation wage function

π_grid = sp.π_grid
w_func = njit(lambda x: np.interp(x, π_grid, w_bar))
47.10. Appendix A 821

@njit
def update(a, b, e, π):
"Update e and π by drawing wage offer from beta distribution with parameters a␣
↪and b"
if e == False:
w = np.random.beta(a, b) # Draw random wage
if w >= w_func(π):
e = True # Take new job
else:
π = 1 / (1 + ((1 - π) * g(w)) / (π * f(w)))
return e, π
@njit
def simulate_path(F_a=F_a,
F_b=F_b,
G_a=G_a,
G_b=G_b,
N=5000, # Number of agents
T=600, # Simulation length
d=200, # Change date
s=0.025): # Separation rate
"""Simulates path of employment for N number of works over T periods"""
e = np.ones((N, T+1))
π = np.full((N, T+1), 1e-3)
a, b = G_a, G_b # Initial distribution parameters
for t in range(T+1):
if t == d:
a, b = F_a, F_b # Change distribution parameters
# Update each agent

for n in range(N):
if e[n, t] == 1: # If agent is currently employment
p = np.random.uniform(0, 1)
if p <= s: # Randomly separate with probability s
e[n, t] = 0
new_e, new_π = update(a, b, e[n, t], π[n, t])

e[n, t+1] = new_e
π[n, t+1] = new_π
return e[:, 1:]
d = 200 # Change distribution at time d

unemployment_rate = 1 - simulate_path(d=d).mean(axis=0)

ax.plot(unemployment_rate)
ax.axvline(d, color='r', alpha=0.6, label='Change date')


ax.set_xlabel('Time')
ax.set_title('Unemployment rate')
ax.legend()
plt.show()
47.11 Appendix B
In this appendix we provide more details about how Bayes’ Law contributes to the workings of the model.
We present some graphs that bring out additional insights about how learning works.
We build on graphs proposed in this lecture.
In particular, we’ll add actions of our searching worker to a key graph presented in that lecture.
To begin, we first define two functions for computing the empirical distributions of unemployment duration and π at the
time of employment.
@njit
def empirical_dist(F_a, F_b, G_a, G_b, w_bar, π_grid,
N=10000, T=600):
"""
Simulates population for computing empirical cumulative
distribution of unemployment duration and π at time when
the worker accepts the wage offer. For each job searching
problem, we simulate for two cases that either f or g is
the true offer distribution.
47.11. Appendix B 823

Parameters
----------
F_a, F_b, G_a, G_b : parameters of beta distributions F and G.

w_bar : the reservation wage
π_grid : grid points of π, for interpolation
N : number of workers for simulation, optional
T : maximum of time periods for simulation, optional
Returns
-------
accpet_t : 2 by N ndarray. the empirical distribution of
unemployment duration when f or g generates offers.
accept_π : 2 by N ndarray. the empirical distribution of
π at the time of employment when f or g generates offers.
"""
accept_t = np.empty((2, N))

accept_π = np.empty((2, N))
# f or g generates offers
for i, (a, b) in enumerate([(F_a, F_b), (G_a, G_b)]):
# update each agent
for n in range(N):
# initial priori
π = 0.5
for t in range(T+1):
# Draw random wage

w = np.random.beta(a, b)
lw = p(w, F_a, F_b) / p(w, G_a, G_b)
π = π * lw / (π * lw + 1 - π)
# move to next agent if accepts

if w >= np.interp(π, π_grid, w_bar):
break
# record the unemployment duration

# and π at the time of acceptance
accept_t[i, n] = t
accept_π[i, n] = π
return accept_t, accept_π
def cumfreq_x(res):
"""
A helper function for calculating the x grids of
the cumulative frequency histogram.
"""
cumcount = res.cumcount
lowerlimit, binsize = res.lowerlimit, res.binsize


x = lowerlimit + np.linspace(0, binsize*cumcount.size, cumcount.size)
return x
Now we define a wrapper function for analyzing job search models with learning under different parameterizations.
The wrapper takes parameters of beta distributions and unemployment compensation as inputs and then displays various
things we want to know to interpret the solution of our search model.
In addition, it computes empirical cumulative distributions of two key objects.
def job_search_example(F_a=1, F_b=1, G_a=3, G_b=1.2, c=0.3):

"""
Given the parameters that specify F and G distributions,
calculate and display the rejection and acceptance area,
the evolution of belief π, and the probability of accepting
an offer at different π level, and simulate and calculate
the empirical cumulative distribution of the duration of
unemployment and π at the time the worker accepts the offer.
"""
# construct a search problem

sp = SearchProblem(F_a=F_a, F_b=F_b, G_a=G_a, G_b=G_b, c=c)
f, g = sp.f, sp.g
π_grid = sp.π_grid
# Solve for reservation wage

w_bar = solve_wbar(sp, verbose=False)
# l(w) = f(w) / g(w)

l = lambda w: f(w) / g(w)
# objective function for solving l(w) = 1
obj = lambda w: l(w) - 1.
# the mode of beta distribution

# use this to divide w into two intervals for root finding
G_mode = (G_a - 1) / (G_a + G_b - 2)
roots = np.empty(2)
roots[0] = op.root_scalar(obj, bracket=[1e-10, G_mode]).root
roots[1] = op.root_scalar(obj, bracket=[G_mode, 1-1e-10]).root
# part 1: display the details of the model settings and some results
w_grid = np.linspace(1e-12, 1-1e-12, 100)
axs[0, 0].plot(l(w_grid), w_grid, label='$l$', lw=2)

axs[0, 0].vlines(1., 0., 1., linestyle="--")
axs[0, 0].hlines(roots, 0., 2., linestyle="--")
axs[0, 0].set_xlim([0., 2.])
axs[0, 0].legend(loc=4)
axs[0, 0].set(xlabel='$l(w)=f(w)/g(w)$', ylabel='$w$')
axs[0, 1].plot(sp.π_grid, w_bar, color='k')

axs[0, 1].fill_between(sp.π_grid, 0, w_bar, color='blue', alpha=0.15)
axs[0, 1].fill_between(sp.π_grid, w_bar, sp.w_max, color='green', alpha=0.15)
47.11. Appendix B 825


axs[0, 1].text(0.5, 0.6, 'reject')
axs[0, 1].text(0.7, 0.9, 'accept')
W = np.arange(0.01, 0.99, 0.08)

Π = np.arange(0.01, 0.99, 0.08)
ΔW = np.zeros((len(W), len(Π)))
ΔΠ = np.empty((len(W), len(Π)))
for i, w in enumerate(W):
for j, π in enumerate(Π):
lw = l(w)
ΔΠ[i, j] = π * (lw / (π * lw + 1 - π) - 1)
q = axs[0, 1].quiver(Π, W, ΔΠ, ΔW, scale=2, color='r', alpha=0.8)

axs[0, 1].set(xlabel='$\pi$', ylabel='$w$')
axs[0, 1].grid()
axs[1, 0].plot(f(x_grid), x_grid, label='$f$', lw=2)

axs[1, 0].plot(g(x_grid), x_grid, label='$g$', lw=2)
axs[1, 0].vlines(1., 0., 1., linestyle="--")
axs[1, 0].set(xlabel='$f(w), g(w)$', ylabel='$w$')
axs[1, 1].plot(sp.π_grid, 1 - beta.cdf(w_bar, F_a, F_b), label='$f$')

axs[1, 1].plot(sp.π_grid, 1 - beta.cdf(w_bar, G_a, G_b), label='$g$')
axs[1, 1].set_ylim([0., 1.])
axs[1, 1].grid()
axs[1, 1].set(xlabel='$\pi$', ylabel='$\mathbb{P}\{w > \overline{w} (\pi)\}$')
plt.show()
# part 2: simulate empirical cumulative distribution

accept_t, accept_π = empirical_dist(F_a, F_b, G_a, G_b, w_bar, π_grid)
N = accept_t.shape[1]
cfq_t_F = cumfreq(accept_t[0, :], numbins=100)

cfq_π_F = cumfreq(accept_π[0, :], numbins=100)
cfq_t_G = cumfreq(accept_t[1, :], numbins=100)

cfq_π_G = cumfreq(accept_π[1, :], numbins=100)
axs[0].plot(cumfreq_x(cfq_t_F), cfq_t_F.cumcount/N, label="f generates")

axs[0].plot(cumfreq_x(cfq_t_G), cfq_t_G.cumcount/N, label="g generates")
axs[0].grid(linestyle='--')
axs[0].legend(loc=4)
axs[0].title.set_text('CDF of duration of unemployment')
axs[0].set(xlabel='time', ylabel='Prob(time)')
axs[1].plot(cumfreq_x(cfq_π_F), cfq_π_F.cumcount/N, label="f generates")

axs[1].plot(cumfreq_x(cfq_π_G), cfq_π_G.cumcount/N, label="g generates")


axs[1].grid(linestyle='--')
axs[1].legend(loc=4)
axs[1].title.set_text('CDF of π at time worker accepts wage and leaves␣
↪unemployment')
axs[1].set(xlabel='π', ylabel='Prob(π)')
plt.show()
We now provide some examples that provide insights about how the model works.
47.12 Examples
47.12.1 Example 1 (Baseline)
𝐹 ~ Beta(1, 1), 𝐺 ~ Beta(3, 1.2), 𝑐=0.3.

In the graphs below, the red arrows in the upper right figure show how 𝜋𝑡 is updated in response to the new information
𝑤𝑡 .
Recall the following formula from this lecture
𝜋𝑡+1 𝑙 (𝑤𝑡+1 ) >1 if 𝑙 (𝑤𝑡+1 ) > 1

= {
𝜋𝑡 𝜋𝑡 𝑙 (𝑤𝑡+1 ) + (1 − 𝜋𝑡 ) ≤ 1 if 𝑙 (𝑤𝑡+1 ) ≤ 1
The formula implies that the direction of motion of 𝜋𝑡 is determined by the relationship between 𝑙(𝑤𝑡 ) and 1.
The magnitude is small if
• 𝑙(𝑤) is close to 1, which means the new 𝑤 is not very informative for distinguishing two distributions,
• 𝜋𝑡−1 is close to either 0 or 1, which means the priori is strong.
Will an unemployed worker accept an offer earlier or not, when the actual ruling distribution is 𝑔 instead of 𝑓?
Two countervailing effects are at work.
• if 𝑓 generates successive wage offers, then 𝑤 is more likely to be low, but 𝜋 is moving up toward to 1, which lowers
the reservation wage, i.e., the worker becomes less selective the longer he or she remains unemployed.
• if 𝑔 generates wage offers, then 𝑤 is more likely to be high, but 𝜋 is moving downward toward 0, increasing the
reservation wage, i.e., the worker becomes more selective the longer he or she remains unemployed.
Quantitatively, the lower right figure sheds light on which effect dominates in this example.
It shows the probability that a previously unemployed worker accepts an offer at different values of 𝜋 when 𝑓 or 𝑔 generates
wage offers.
That graph shows that for the particular 𝑓 and 𝑔 in this example, the worker is always more likely to accept an offer when
𝑓 generates the data even when 𝜋 is close to zero so that the worker believes the true distribution is 𝑔 and therefore is
relatively more selective.
The empirical cumulative distribution of the duration of unemployment verifies our conjecture.
job_search_example()
47.12. Examples 827


47.12.2 Example 2
𝐹 ~ Beta(1, 1), 𝐺 ~ Beta(1.2, 1.2), 𝑐=0.3.

Now 𝐺 has the same mean as 𝐹 with a smaller variance.
Since the unemployment compensation 𝑐 serves as a lower bound for bad wage offers, 𝐺 is now an “inferior” distribution
to 𝐹 .
Consequently, we observe that the optimal policy 𝑤(𝜋) is increasing in 𝜋.
job_search_example(1, 1, 1.2, 1.2, 0.3)
47.12. Examples 829


47.12.3 Example 3
𝐹 ~ Beta(1, 1), 𝐺 ~ Beta(2, 2), 𝑐=0.3.

If the variance of 𝐺 is smaller, we observe in the result that 𝐺 is even more “inferior” and the slope of 𝑤(𝜋) is larger.
job_search_example(1, 1, 2, 2, 0.3)
47.12. Examples 831


47.12.4 Example 4
𝐹 ~ Beta(1, 1), 𝐺 ~ Beta(3, 1.2), and 𝑐=0.8.

In this example, we keep the parameters of beta distributions to be the same with the baseline case but increase the
unemployment compensation 𝑐.
Comparing outcomes to the baseline case (example 1) in which unemployment compensation if low (𝑐=0.3), now the
worker can afford a longer learning period.
As a result, the worker tends to accept wage offers much later.
Furthermore, at the time of accepting employment, the belief 𝜋 is closer to either 0 or 1.
That means that the worker has a better idea about what the true distribution is when he eventually chooses to accept a
wage offer.
job_search_example(1, 1, 3, 1.2, c=0.8)
47.12. Examples 833


47.12.5 Example 5
𝐹 ~ Beta(1, 1), 𝐺 ~ Beta(3, 1.2), and 𝑐=0.1.

As expected, a smaller 𝑐 makes an unemployed worker accept wage offers earlier after having acquired less information
about the wage distribution.
job_search_example(1, 1, 3, 1.2, c=0.1)
47.12. Examples 835


47.12. Examples 837


CHAPTER
FORTYEIGHT
LIKELIHOOD RATIO PROCESSES
Contents
• Likelihood Ratio Processes

– Overview
– Likelihood Ratio Process
– Nature Permanently Draws from Density g
– Peculiar Property
– Nature Permanently Draws from Density f
– Likelihood Ratio Test
– Kullback–Leibler Divergence
– Sequels
48.1 Overview
This lecture describes likelihood ratio processes and some of their uses.
We’ll use a setting described in this lecture.
Among things that we’ll learn are
• A peculiar property of likelihood ratio processes
• How a likelihood ratio process is a key ingredient in frequentist hypothesis testing
• How a receiver operator characteristic curve summarizes information about a false alarm probability and power
in frequentist hypothesis testing
• How during World War II the United States Navy devised a decision rule that Captain Garret L. Schyler challenged
and asked Milton Friedman to justify to him, a topic to be studied in this lecture
Let’s start by importing some Python tools.

import numpy as np
from numba import vectorize, njit
839

48.2 Likelihood Ratio Process
A nonnegative random variable 𝑊 has one of two probability density functions, either 𝑓 or 𝑔.
Before the beginning of time, nature once and for all decides whether she will draw a sequence of IID draws from either
𝑓 or 𝑔.
We will sometimes let 𝑞 be the density that nature chose once and for all, so that 𝑞 is either 𝑓 or 𝑔, permanently.
Nature knows which density it permanently draws from, but we the observers do not.
We do know both 𝑓 and 𝑔 but we don’t know which density nature chose.
But we want to know.
To do that, we use observations.
We observe a sequence {𝑤𝑡 }𝑇𝑡=1 of 𝑇 IID draws from either 𝑓 or 𝑔.
We want to use these observations to infer whether nature chose 𝑓 or 𝑔.
A likelihood ratio process is a useful tool for this task.
To begin, we define key component of a likelihood ratio process, namely, the time 𝑡 likelihood ratio as the random variable
𝑓 (𝑤𝑡 )
ℓ(𝑤𝑡 ) = , 𝑡 ≥ 1.
𝑔 (𝑤𝑡 )
We assume that 𝑓 and 𝑔 both put positive probabilities on the same intervals of possible realizations of the random variable
𝑊.
𝑓(𝑤𝑡 )
That means that under the 𝑔 density, ℓ(𝑤𝑡 ) = 𝑔(𝑤𝑡 ) is evidently a nonnegative random variable with mean 1.
∞
A likelihood ratio process for sequence {𝑤𝑡 }𝑡=1 is defined as
𝑡
𝐿 (𝑤𝑡 ) = ∏ ℓ(𝑤𝑖 ),
𝑖=1
where 𝑤𝑡 = {𝑤1 , … , 𝑤𝑡 } is a history of observations up to and including time 𝑡.

Sometimes for shorthand we’ll write 𝐿𝑡 = 𝐿(𝑤𝑡 ).
Notice that the likelihood process satisfies the recursion or multiplicative decomposition
𝐿(𝑤𝑡 ) = ℓ(𝑤𝑡 )𝐿(𝑤𝑡−1 ).
The likelihood ratio and its logarithm are key tools for making inferences using a classic frequentist approach due to
Neyman and Pearson [Neyman and Pearson, 1933].
To help us appreciate how things work, the following Python code evaluates 𝑓 and 𝑔 as two different beta distributions,
then computes and simulates an associated likelihood ratio process by generating a sequence 𝑤𝑡 from one of the two
probability distributionss, for example, a sequence of IID draws from 𝑔.
840 Chapter 48. Likelihood Ratio Processes

# Parameters in the two beta distributions.

F_a, F_b = 1, 1
G_a, G_b = 3, 1.2
@vectorize
def p(x, a, b):
return r * x** (a-1) * (1 - x) ** (b-1)
# The two density functions.

f = njit(lambda x: p(x, F_a, F_b))
g = njit(lambda x: p(x, G_a, G_b))
@njit
def simulate(a, b, T=50, N=500):
'''
Generate N sets of T observations of the likelihood ratio,
return as N x T matrix.
'''
l_arr = np.empty((N, T))
for i in range(N):
for j in range(T):
l_arr[i, j] = f(w) / g(w)
return l_arr
48.3 Nature Permanently Draws from Density g
We first simulate the likelihood ratio process when nature permanently draws from 𝑔.
l_arr_g = simulate(G_a, G_b)

l_seq_g = np.cumprod(l_arr_g, axis=1)
N, T = l_arr_g.shape
for i in range(N):
plt.plot(range(T), l_seq_g[i, :], color='b', lw=0.8, alpha=0.5)
plt.ylim([0, 3])
plt.title("$L(w^{t})$ paths");
48.3. Nature Permanently Draws from Density g 841

Evidently, as sample length 𝑇 grows, most probability mass shifts toward zero
To see it this more clearly clearly, we plot over time the fraction of paths 𝐿 (𝑤𝑡 ) that fall in the interval [0, 0.01].
plt.plot(range(T), np.sum(l_seq_g <= 0.01, axis=0) / N)
[<matplotlib.lines.Line2D at 0x7f21294e2810>]
Despite the evident convergence of most probability mass to a very small interval near 0, the unconditional mean of 𝐿 (𝑤𝑡 )
under probability density 𝑔 is identically 1 for all 𝑡.

To verify this assertion, first notice that as mentioned earlier the unconditional mean 𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔] is 1 for all 𝑡:
𝑓 (𝑤𝑡 )
𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔] = ∫ 𝑔 (𝑤𝑡 ) 𝑑𝑤𝑡
𝑔 (𝑤𝑡 )
= ∫ 𝑓 (𝑤𝑡 ) 𝑑𝑤𝑡
= 1,
which immediately implies
𝐸 [𝐿 (𝑤1 ) ∣ 𝑞 = 𝑔] = 𝐸 [ℓ (𝑤1 ) ∣ 𝑞 = 𝑔]
= 1.
Because 𝐿(𝑤𝑡 ) = ℓ(𝑤𝑡 )𝐿(𝑤𝑡−1 ) and {𝑤𝑡 }𝑡𝑡=1 is an IID sequence, we have
𝐸 [𝐿 (𝑤𝑡 ) ∣ 𝑞 = 𝑔] = 𝐸 [𝐿 (𝑤𝑡−1 ) ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔]

= 𝐸 [𝐿 (𝑤𝑡−1 ) 𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔, 𝑤𝑡−1 ] ∣ 𝑞 = 𝑔]
= 𝐸 [𝐿 (𝑤𝑡−1 ) 𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔] ∣ 𝑞 = 𝑔]
= 𝐸 [𝐿 (𝑤𝑡−1 ) ∣ 𝑞 = 𝑔]
for any 𝑡 ≥ 1.
Mathematical induction implies 𝐸 [𝐿 (𝑤𝑡 ) ∣ 𝑞 = 𝑔] = 1 for all 𝑡 ≥ 1.
48.4 Peculiar Property
How can 𝐸 [𝐿 (𝑤𝑡 ) ∣ 𝑞 = 𝑔] = 1 possibly be true when most probability mass of the likelihood ratio process is piling up
near 0 as 𝑡 → +∞?
The answer has to be that as 𝑡 → +∞, the distribution of 𝐿𝑡 becomes more and more fat-tailed: enough mass shifts to
larger and larger values of 𝐿𝑡 to make the mean of 𝐿𝑡 continue to be one despite most of the probability mass piling up
near 0.
To illustrate this peculiar property, we simulate many paths and calculate the unconditional mean of 𝐿 (𝑤𝑡 ) by averaging
across these many paths at each 𝑡.
l_arr_g = simulate(G_a, G_b, N=50000)

It would be useful to use simulations to verify that unconditional means 𝐸 [𝐿 (𝑤𝑡 )] equal unity by averaging across sample
paths.
But it would be too computer-time-consuming for us to that here simply by applying a standard Monte Carlo simulation
approach.
The reason is that the distribution of 𝐿 (𝑤𝑡 ) is extremely skewed for large values of 𝑡.
Because the probability density in the right tail is close to 0, it just takes too much computer time to sample enough points
from the right tail.
We explain the problem in more detail in this lecture.
There we describe a way to an alternative way to compute the mean of a likelihood ratio by computing the mean of a
different random variable by sampling from a different probability distribution.
48.4. Peculiar Property 843

48.5 Nature Permanently Draws from Density f
Now suppose that before time 0 nature permanently decided to draw repeatedly from density 𝑓.
While the mean of the likelihood ratio ℓ (𝑤𝑡 ) under density 𝑔 is 1, its mean under the density 𝑓 exceeds one.
To see this, we compute
𝑓 (𝑤𝑡 )
𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑓] = ∫ 𝑓 (𝑤𝑡 ) 𝑑𝑤𝑡
𝑔 (𝑤𝑡 )
𝑓 (𝑤𝑡 ) 𝑓 (𝑤𝑡 )
=∫ 𝑔 (𝑤𝑡 ) 𝑑𝑤𝑡
𝑔 (𝑤𝑡 ) 𝑔 (𝑤𝑡 )
2
= ∫ ℓ (𝑤𝑡 ) 𝑔 (𝑤𝑡 ) 𝑑𝑤𝑡
2
= 𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔]
2
= 𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔] + 𝑉 𝑎𝑟 (ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔)
2
> 𝐸 [ℓ (𝑤𝑡 ) ∣ 𝑞 = 𝑔] = 1
This in turn implies that the unconditional mean of the likelihood ratio process 𝐿(𝑤𝑡 ) diverges toward +∞.
Simulations below confirm this conclusion.
Please note the scale of the 𝑦 axis.
l_arr_f = simulate(F_a, F_b, N=50000)

l_seq_f = np.cumprod(l_arr_f, axis=1)
N, T = l_arr_f.shape
plt.plot(range(T), np.mean(l_seq_f, axis=0))
[<matplotlib.lines.Line2D at 0x7f2129251e10>]
We also plot the probability that 𝐿 (𝑤𝑡 ) falls into the interval [10000, ∞) as a function of time and watch how fast
probability mass diverges to +∞.

plt.plot(range(T), np.sum(l_seq_f > 10000, axis=0) / N)
[<matplotlib.lines.Line2D at 0x7f2128014750>]
48.6 Likelihood Ratio Test
We now describe how to employ the machinery of Neyman and Pearson [Neyman and Pearson, 1933] to test the hypothesis
that history 𝑤𝑡 is generated by repeated IID draws from density 𝑔.
Denote 𝑞 as the data generating process, so that 𝑞 = 𝑓 or 𝑔.
Upon observing a sample {𝑊𝑖 }𝑡𝑖=1 , we want to decide whether nature is drawing from 𝑔 or from 𝑓 by performing a
(frequentist) hypothesis test.
We specify
• Null hypothesis 𝐻0 : 𝑞 = 𝑓,
• Alternative hypothesis 𝐻1 : 𝑞 = 𝑔.
Neyman and Pearson proved that the best way to test this hypothesis is to use a likelihood ratio test that takes the form:
• reject 𝐻0 if 𝐿(𝑊 𝑡 ) < 𝑐,
• accept 𝐻0 otherwise.
where 𝑐 is a given discrimination threshold, to be chosen in a way we’ll soon describe.
This test is best in the sense that it is a uniformly most powerful test.
To understand what this means, we have to define probabilities of two important events that allow us to characterize a test
associated with a given threshold 𝑐.
The two probabilities are:
• Probability of detection (= power = 1 minus probability of Type II error):
1 − 𝛽 ≡ Pr {𝐿 (𝑤𝑡 ) < 𝑐 ∣ 𝑞 = 𝑔}
48.6. Likelihood Ratio Test 845

• Probability of false alarm (= significance level = probability of Type I error):
𝛼 ≡ Pr {𝐿 (𝑤𝑡 ) < 𝑐 ∣ 𝑞 = 𝑓}
The Neyman-Pearson Lemma states that among all possible tests, a likelihood ratio test maximizes the probability of
detection for a given probability of false alarm.
Another way to say the same thing is that among all possible tests, a likelihood ratio test maximizes power for a given
significance level.
To have made a good inference, we want a small probability of false alarm and a large probability of detection.
With sample size 𝑡 fixed, we can change our two probabilities by adjusting 𝑐.
A troublesome “that’s life” fact is that these two probabilities move in the same direction as we vary the critical value 𝑐.
Without specifying quantitative losses from making Type I and Type II errors, there is little that we can say about how
we should trade off probabilities of the two types of mistakes.
We do know that increasing sample size 𝑡 improves statistical inference.
Below we plot some informative figures that illustrate this.
We also present a classical frequentist method for choosing a sample size 𝑡.
Let’s start with a case in which we fix the threshold 𝑐 at 1.
c = 1
Below we plot empirical distributions of logarithms of the cumulative likelihood ratios simulated above, which are gen-
erated by either 𝑓 or 𝑔.
Taking logarithms has no effect on calculating the probabilities because the log is a monotonic transformation.
As 𝑡 increases, the probabilities of making Type I and Type II errors both decrease, which is good.
This is because most of the probability mass of log(𝐿(𝑤𝑡 )) moves toward −∞ when 𝑔 is the data generating process, ;
while log(𝐿(𝑤𝑡 )) goes to ∞ when data are generated by 𝑓.
That disparate behavior of log(𝐿(𝑤𝑡 )) under 𝑓 and 𝑞 is what makes it possible to distinguish 𝑞 = 𝑓 from 𝑞 = 𝑔.

fig.suptitle('distribution of $log(L(w^t))$ under f or g', fontsize=15)
for i, t in enumerate([1, 7, 14, 21]):

nr = i // 2
nc = i % 2
axs[nr, nc].axvline(np.log(c), color="k", ls="--")
hist_f, x_f = np.histogram(np.log(l_seq_f[:, t]), 200, density=True)

hist_g, x_g = np.histogram(np.log(l_seq_g[:, t]), 200, density=True)
axs[nr, nc].plot(x_f[1:], hist_f, label="dist under f")

axs[nr, nc].plot(x_g[1:], hist_g, label="dist under g")
for i, (x, hist, label) in enumerate(zip([x_f, x_g], [hist_f, hist_g], ["Type I␣

↪ error", "Type II error"])):
ind = x[1:] <= np.log(c) if i == 0 else x[1:] > np.log(c)
axs[nr, nc].fill_between(x[1:][ind], hist[ind], alpha=0.5, label=label)


axs[nr, nc].legend()
axs[nr, nc].set_title(f"t={t}")
plt.show()
The graph below shows more clearly that, when we hold the threshold 𝑐 fixed, the probability of detection monotonically
increases with increases in 𝑡 and that the probability of a false alarm monotonically decreases.
PD = np.empty(T)
PFA = np.empty(T)
for t in range(T):
PD[t] = np.sum(l_seq_g[:, t] < c) / N
PFA[t] = np.sum(l_seq_f[:, t] < c) / N
plt.plot(range(T), PD, label="Probability of detection")

plt.plot(range(T), PFA, label="Probability of false alarm")
plt.xlabel("t")
plt.title("$c=1$")
plt.legend()
plt.show()

For a given sample size 𝑡, the threshold 𝑐 uniquely pins down probabilities of both types of error.
If for a fixed 𝑡 we now free up and move 𝑐, we will sweep out the probability of detection as a function of the probability
of false alarm.
This produces what is called a receiver operating characteristic curve.
Below, we plot receiver operating characteristic curves for different sample sizes 𝑡.
PFA = np.arange(0, 100, 1)
for t in range(1, 15, 4):

percentile = np.percentile(l_seq_f[:, t], PFA)
PD = [np.sum(l_seq_g[:, t] < p) / N for p in percentile]
plt.plot(PFA / 100, PD, label=f"t={t}")
plt.scatter(0, 1, label="perfect detection")

plt.plot([0, 1], [0, 1], color='k', ls='--', label="random detection")
plt.arrow(0.5, 0.5, -0.15, 0.15, head_width=0.03)

plt.text(0.35, 0.7, "better")
plt.xlabel("Probability of false alarm")
plt.ylabel("Probability of detection")
plt.legend()
plt.title("Receiver Operating Characteristic Curve")
plt.show()

Notice that as 𝑡 increases, we are assured a larger probability of detection and a smaller probability of false alarm associated
with a given discrimination threshold 𝑐.
As 𝑡 → +∞, we approach the perfect detection curve that is indicated by a right angle hinging on the blue dot.
For a given sample size 𝑡, the discrimination threshold 𝑐 determines a point on the receiver operating characteristic curve.
It is up to the test designer to trade off probabilities of making the two types of errors.
But we know how to choose the smallest sample size to achieve given targets for the probabilities.
Typically, frequentists aim for a high probability of detection that respects an upper bound on the probability of false
alarm.
Below we show an example in which we fix the probability of false alarm at 0.05.
The required sample size for making a decision is then determined by a target probability of detection, for example, 0.9,
as depicted in the following graph.
PFA = 0.05
PD = np.empty(T)
for t in range(T):
c = np.percentile(l_seq_f[:, t], PFA * 100)

PD[t] = np.sum(l_seq_g[:, t] < c) / N
plt.plot(range(T), PD)
plt.axhline(0.9, color="k", ls="--")
plt.xlabel("t")
plt.title(f"Probability of false alarm={PFA}")
plt.show()

The United States Navy evidently used a procedure like this to select a sample size 𝑡 for doing quality control tests during
World War II.
A Navy Captain who had been ordered to perform tests of this kind had doubts about it that he presented to Milton
Friedman, as we describe in this lecture.
48.7 Kullback–Leibler Divergence
Now let’s consider a case in which neither 𝑔 nor 𝑓 generates the data.
Instead, a third distribution ℎ does.
Let’s watch how how the cumulated likelihood ratios 𝑓/𝑔 behave when ℎ governs the data.
A key tool here is called Kullback–Leibler divergence.
It is also called relative entropy.
It measures how one probability distribution differs from another.
In our application, we want to measure how 𝑓 or 𝑔 diverges from ℎ
The two Kullback–Leibler divergences pertinent for us are 𝐾𝑓 and 𝐾𝑔 defined as
𝑓 (𝑤) 𝑓 (𝑤)
𝐾𝑓 = 𝐸ℎ [log ( ) ]
ℎ (𝑤) ℎ (𝑤)
𝑓 (𝑤) 𝑓 (𝑤)
= ∫ log ( ) ℎ (𝑤) 𝑑𝑤
ℎ (𝑤) ℎ (𝑤)
𝑓 (𝑤)
= ∫ log ( ) 𝑓 (𝑤) 𝑑𝑤
ℎ (𝑤)

𝑔 (𝑤) 𝑔 (𝑤)
𝐾𝑔 = 𝐸ℎ [log ( ) ]
ℎ (𝑤) ℎ (𝑤)
𝑔 (𝑤) 𝑔 (𝑤)
= ∫ log ( ) ℎ (𝑤) 𝑑𝑤
ℎ (𝑤) ℎ (𝑤)
𝑔 (𝑤)
= ∫ log ( ) 𝑔 (𝑤) 𝑑𝑤
ℎ (𝑤)
When 𝐾𝑔 < 𝐾𝑓 , 𝑔 is closer to ℎ than 𝑓 is.
• In that case we’ll find that 𝐿 (𝑤𝑡 ) → 0.
When 𝐾𝑔 > 𝐾𝑓 , 𝑓 is closer to ℎ than 𝑔 is.
• In that case we’ll find that 𝐿 (𝑤𝑡 ) → +∞
We’ll now experiment with an ℎ is also a beta distribution
We’ll start by setting parameters 𝐺𝑎 and 𝐺𝑏 so that ℎ is closer to 𝑔
H_a, H_b = 3.5, 1.8
h = njit(lambda x: p(x, H_a, H_b))
x_range = np.linspace(0, 1, 100)

plt.plot(x_range, f(x_range), label='f')
plt.plot(x_range, g(x_range), label='g')
plt.plot(x_range, h(x_range), label='h')
plt.legend()
plt.show()
Let’s compute the Kullback–Leibler discrepancies by quadrature integration.
def KL_integrand(w, q, h):
m = q(w) / h(w)
return np.log(m) * q(w)
48.7. Kullback–Leibler Divergence 851

def compute_KL(h, f, g):
Kf, _ = quad(KL_integrand, 0, 1, args=(f, h))

Kg, _ = quad(KL_integrand, 0, 1, args=(g, h))
return Kf, Kg
Kf, Kg = compute_KL(h, f, g)
Kf, Kg
(0.7902536603660161, 0.08554075759988769)
We have 𝐾𝑔 < 𝐾𝑓 .
Next, we can verify our conjecture about 𝐿 (𝑤𝑡 ) by simulation.
l_arr_h = simulate(H_a, H_b)

l_seq_h = np.cumprod(l_arr_h, axis=1)
The figure below plots over time the fraction of paths 𝐿 (𝑤𝑡 ) that fall in the interval [0, 0.01].
Notice that it converges to 1 as expected when 𝑔 is closer to ℎ than 𝑓 is.
N, T = l_arr_h.shape
plt.plot(range(T), np.sum(l_seq_h <= 0.01, axis=0) / N)
[<matplotlib.lines.Line2D at 0x7f2129022950>]
We can also try an ℎ that is closer to 𝑓 than is 𝑔 so that now 𝐾𝑔 is larger than 𝐾𝑓 .
H_a, H_b = 1.2, 1.2

h = njit(lambda x: p(x, H_a, H_b))

Kf, Kg = compute_KL(h, f, g)
Kf, Kg
(0.01239249754452668, 0.35377684280997646)
l_arr_h = simulate(H_a, H_b)

l_seq_h = np.cumprod(l_arr_h, axis=1)
Now probability mass of 𝐿 (𝑤𝑡 ) falling above 10000 diverges to +∞.
N, T = l_arr_h.shape
plt.plot(range(T), np.sum(l_seq_h > 10000, axis=0) / N)
[<matplotlib.lines.Line2D at 0x7f2128eb7f50>]
48.8 Sequels
Likelihood processes play an important role in Bayesian learning, as described in this lecture and as applied in this lecture.
Likelihood ratio processes appear again in this lecture, which contains another illustration of the peculiar property of
likelihood ratio processes described above.
48.8. Sequels 853


CHAPTER
FORTYNINE
COMPUTING MEAN OF A LIKELIHOOD RATIO PROCESS
Contents
• Computing Mean of a Likelihood Ratio Process

– Overview
– Mathematical Expectation of Likelihood Ratio
– Importance sampling
– Selecting a Sampling Distribution
– Approximating a cumulative likelihood ratio
– Distribution of Sample Mean
– More Thoughts about Choice of Sampling Distribution
49.1 Overview
In this lecture we described a peculiar property of a likelihood ratio process, namely, that it’s mean equals one for all 𝑡 ≥ 0
despite it’s converging to zero almost surely.
While it is easy to verify that peculiar properly analytically (i.e., in population), it is challenging to use a computer
simulation to verify it via an application of a law of large numbers that entails studying sample averages of repeated
simulations.
To confront this challenge, this lecture puts importance sampling to work to accelerate convergence of sample averages
to population means.
𝑡
We use importance sampling to estimate the mean of a cumulative likelihood ratio 𝐿 (𝜔𝑡 ) = ∏𝑖=1 ℓ (𝜔𝑖 ).
We start by importing some Python packages.
import numpy as np
from numba import njit, vectorize, prange
855
49.2 Mathematical Expectation of Likelihood Ratio
In this lecture, we studied a likelihood ratio ℓ (𝜔𝑡 )
𝑓 (𝜔𝑡 )
ℓ (𝜔𝑡 ) =
𝑔 (𝜔𝑡 )
where 𝑓 and 𝑔 are densities for Beta distributions with parameters 𝐹𝑎 , 𝐹𝑏 , 𝐺𝑎 , 𝐺𝑏 .

Assume that an i.i.d. random variable 𝜔𝑡 ∈ Ω is generated by 𝑔.
The cumulative likelihood ratio 𝐿 (𝜔𝑡 ) is
𝑡
𝐿 (𝜔𝑡 ) = ∏ ℓ (𝜔𝑖 )
𝑖=1
Our goal is to approximate the mathematical expectation 𝐸 [𝐿 (𝜔𝑡 )] well.

In this lecture, we showed that 𝐸 [𝐿 (𝜔𝑡 )] equals 1 for all 𝑡. We want to check out how well this holds if we replace 𝐸 by
with sample averages from simulations.
This turns out to be easier said than done because for Beta distributions assumed above, 𝐿 (𝜔𝑡 ) has a very skewed distri-
bution with a very long tail as 𝑡 → ∞.
This property makes it difficult efficiently and accurately to estimate the mean by standard Monte Carlo simulation meth-
ods.
In this lecture we explore how a standard Monte Carlo method fails and how importance sampling provides a more
computationally efficient way to approximate the mean of the cumulative likelihood ratio.
We first take a look at the density functions f and g .

F_a, F_b = 1, 1
G_a, G_b = 3, 1.2
@vectorize
def p(w, a, b):
return r * w ** (a-1) * (1 - w) ** (b-1)

f = njit(lambda w: p(w, F_a, F_b))
g = njit(lambda w: p(w, G_a, G_b))
w_range = np.linspace(1e-5, 1-1e-5, 1000)
plt.plot(w_range, g(w_range), label='g')

plt.plot(w_range, f(w_range), label='f')
plt.xlabel('$\omega$')
plt.legend()
plt.title('density functions $f$ and $g$')
plt.show()
856 Chapter 49. Computing Mean of a Likelihood Ratio Process

The likelihood ratio is l(w)=f(w)/g(w).
l = njit(lambda w: f(w) / g(w))
plt.plot(w_range, l(w_range))
plt.title('$\ell(\omega)$')
plt.xlabel('$\omega$')
plt.show()
49.2. Mathematical Expectation of Likelihood Ratio 857

The above plots shows that as 𝜔 → 0, 𝑓 (𝜔) is unchanged and 𝑔 (𝜔) → 0, so the likelihood ratio approaches infinity.
𝑡
A Monte Carlo approximation of 𝐸̂ [𝐿 (𝜔𝑡 )] = 𝐸̂ [∏𝑖=1 ℓ (𝜔𝑖 )] would repeatedly draw 𝜔 from 𝑔, calculate the likelihood
𝑓(𝜔)
ratio ℓ(𝜔) = 𝑔(𝜔) for each draw, then average these over all draws.
Because 𝑔(𝜔) → 0 as 𝜔 → 0, such a simulation procedure undersamples a part of the sample space [0, 1] that it is
important to visit often in order to do a good job of approximating the mathematical expectation of the likelihood ratio
ℓ(𝜔).
We illustrate this numerically below.
49.3 Importance sampling
We circumvent the issue by using a change of distribution called importance sampling.

Instead of drawing from 𝑔 to generate data during the simulation, we use an alternative distribution ℎ to generate draws
of 𝜔.
The idea is to design ℎ so that it oversamples the region of Ω where ℓ (𝜔𝑡 ) has large values but low density under 𝑔.
After we construct a sample in this way, we must then weight each realization by the likelihood ratio of 𝑔 and ℎ when we
compute the empirical mean of the likelihood ratio.
By doing this, we properly account for the fact that we are using ℎ and not 𝑔 to simulate data.
To illustrate, suppose were interested in 𝐸 [ℓ (𝜔)].

We could simply compute:
1 𝑁
𝐸̂ 𝑔 [ℓ (𝜔)] = ∑ ℓ(𝑤𝑖𝑔 )
𝑁 𝑖=1
where 𝜔𝑖𝑔 indicates that 𝜔𝑖 is drawn from 𝑔.

But using our insight from importance sampling, we could instead calculate the object:
𝑔(𝑤) 1 𝑁 𝑔(𝑤𝑖ℎ )
𝐸̂ ℎ [ℓ (𝜔) ]= ∑ ℓ(𝑤𝑖ℎ )
ℎ(𝑤) 𝑁 𝑖=1 ℎ(𝑤𝑖ℎ )
where 𝑤𝑖 is now drawn from importance distribution ℎ.

Notice that the above two are exactly the same population objects:
𝑔(𝜔) 𝑔(𝜔)
𝐸 𝑔 [ℓ (𝜔)] = ∫ ℓ(𝜔)𝑔(𝜔)𝑑𝜔 = ∫ ℓ(𝜔) ℎ(𝜔)𝑑𝜔 = 𝐸 ℎ [ℓ (𝜔) ]
Ω Ω ℎ(𝜔) ℎ(𝜔)
49.4 Selecting a Sampling Distribution
Since we must use an ℎ that has larger mass in parts of the distribution to which 𝑔 puts low mass, we use ℎ =
𝐵𝑒𝑡𝑎(0.5, 0.5) as our importance distribution.
The plots compare 𝑔 and ℎ.
g_a, g_b = G_a, G_b

h_a, h_b = 0.5, 0.5
plt.plot(w_range, g(w_range), label=f'g=Beta({g_a}, {g_b})')

plt.plot(w_range, p(w_range, 0.5, 0.5), label=f'h=Beta({h_a}, {h_b})')
plt.title('real data generating process $g$ and importance distribution $h$')
plt.legend()
plt.ylim([0., 3.])
plt.show()
49.4. Selecting a Sampling Distribution 859

49.5 Approximating a cumulative likelihood ratio

𝑇
We now study how to use importance sampling to approximate 𝐸 [𝐿(𝜔𝑡 )] = [∏𝑖=1 ℓ (𝜔𝑖 )].
As above, our plan is to draw sequences 𝜔𝑡 from 𝑞 and then re-weight the likelihood ratio appropriately:
𝑇 𝑇 ℎ
𝑝 (𝜔𝑡 ) 1 𝑁 𝑇
𝑝 (𝜔𝑖,𝑡 )
𝐸̂ 𝑝 [𝐿 (𝜔𝑡 )] = 𝐸̂ 𝑝 [∏ ℓ (𝜔𝑡 )] = 𝐸̂ 𝑞 [∏ ℓ (𝜔𝑡 ) ]= ℎ
∑ (∏ ℓ(𝜔𝑖,𝑡 ) )
𝑡=1 𝑡=1
𝑞 (𝜔𝑡 ) 𝑁 𝑖=1 𝑡=1 ℎ
𝑞 (𝜔𝑖,𝑡 )
ℎ
where the last equality uses 𝜔𝑖,𝑡 drawn from the importance distribution 𝑞.
𝑝(𝜔𝑞𝑖,𝑡 ) 𝑞
Here 𝑞(𝜔𝑞𝑖,𝑡 )
is the weight we assign to each data point 𝜔𝑖,𝑡 .
Below we prepare a Python function for computing the importance sampling estimates given any beta distributions 𝑝, 𝑞.
def estimate(p_a, p_b, q_a, q_b, T=1, N=10000):
μ_L = 0
for i in prange(N):
L = 1
weight = 1
for t in range(T):
w = np.random.beta(q_a, q_b)


l = f(w) / g(w)
L *= l
weight *= p(w, p_a, p_b) / p(w, q_a, q_b)
μ_L += L * weight
μ_L /= N
return μ_L
Consider the case when 𝑇 = 1, which amounts to approximating 𝐸0 [ℓ (𝜔)]

For the standard Monte Carlo estimate, we can set 𝑝 = 𝑔 and 𝑞 = 𝑔.
estimate(g_a, g_b, g_a, g_b, T=1, N=10000)
0.9545643628272709
For our importance sampling estimate, we set 𝑞 = ℎ.
estimate(g_a, g_b, h_a, h_b, T=1, N=10000)
1.0034498466312063
Evidently, even at T=1, our importance sampling estimate is closer to 1 than is the Monte Carlo estimate.
Bigger differences arise when computing expectations over longer sequences, 𝐸0 [𝐿 (𝜔𝑡 )].
Setting 𝑇 = 10, we find that the Monte Carlo method severely underestimates the mean while importance sampling still
produces an estimate close to its theoretical value of unity.
estimate(g_a, g_b, g_a, g_b, T=10, N=10000)
0.4115594531178296
estimate(g_a, g_b, h_a, h_b, T=10, N=10000)
0.9836224253217175
49.6 Distribution of Sample Mean
We next study the bias and efficiency of the Monte Carlo and importance sampling approaches.
The code below produces distributions of estimates using both Monte Carlo and importance sampling methods.
def simulate(p_a, p_b, q_a, q_b, N_simu, T=1):
49.6. Distribution of Sample Mean 861


μ_L_p = np.empty(N_simu)
μ_L_q = np.empty(N_simu)
for i in prange(N_simu):
μ_L_p[i] = estimate(p_a, p_b, p_a, p_b, T=T)
μ_L_q[i] = estimate(p_a, p_b, q_a, q_b, T=T)
return μ_L_p, μ_L_q
Again, we first consider estimating 𝐸 [ℓ (𝜔)] by setting T=1.

We simulate 1000 times for each method.
N_simu = 1000
μ_L_p, μ_L_q = simulate(g_a, g_b, h_a, h_b, N_simu)
# standard Monte Carlo (mean and std)

np.nanmean(μ_L_p), np.nanvar(μ_L_p)
(0.9965017531985612, 0.008583971261348671)
# importance sampling (mean and std)

np.nanmean(μ_L_q), np.nanvar(μ_L_q)
(0.9998655901548411, 2.464706844392727e-05)
Although both methods tend to provide a mean estimate of 𝐸 [ℓ (𝜔)] close to 1, the importance sampling estimates have
smaller variance.
Next, we present distributions of estimates for 𝐸̂ [𝐿 (𝜔𝑡 )], in cases for 𝑇 = 1, 5, 10, 20.
μ_range = np.linspace(0, 2, 100)
for i, t in enumerate([1, 5, 10, 20]):

row = i // 2
col = i % 2
μ_L_p, μ_L_q = simulate(g_a, g_b, h_a, h_b, N_simu, T=t)

μ_hat_p, μ_hat_q = np.nanmean(μ_L_p), np.nanmean(μ_L_q)
σ_hat_p, σ_hat_q = np.nanvar(μ_L_p), np.nanvar(μ_L_q)
axs[row, col].set_xlabel('$μ_L$')
axs[row, col].set_ylabel('frequency')
axs[row, col].set_title(f'$T$={t}')
n_p, bins_p, _ = axs[row, col].hist(μ_L_p, bins=μ_range, color='r', alpha=0.5,␣
↪label='$g$ generating')
n_q, bins_q, _ = axs[row, col].hist(μ_L_q, bins=μ_range, color='b', alpha=0.5,␣

↪label='$h$ generating')
axs[row, col].legend(loc=4)
for n, bins, μ_hat, σ_hat in [[n_p, bins_p, μ_hat_p, σ_hat_p],



[n_q, bins_q, μ_hat_q, σ_hat_q]]:
idx = np.argmax(n)
axs[row, col].text(bins[idx], n[idx], '$\hat{μ}$='+f'{μ_hat:.4g}'+', $\hat{σ}=
↪$'+f'{σ_hat:.4g}')
plt.show()
The simulation exercises above show that the importance sampling estimates are unbiased under all 𝑇 while the standard
Monte Carlo estimates are biased downwards.
Evidently, the bias increases with increases in 𝑇 .
49.7 More Thoughts about Choice of Sampling Distribution
Above, we arbitraily chose ℎ = 𝐵𝑒𝑡𝑎(0.5, 0.5) as the importance distribution.

Is there an optimal importance distribution?
In our particular case, since we know in advance that 𝐸0 [𝐿 (𝜔𝑡 )] = 1.
We can use that knowledge to our advantage.
Thus, suppose that we simply use ℎ = 𝑓.
49.7. More Thoughts about Choice of Sampling Distribution 863

When estimating the mean of the likelihood ratio (T=1), we get:
𝑔(𝜔) 𝑓(𝜔) 𝑔(𝜔) 1 𝑁 𝑔(𝑤𝑖𝑓 )

𝐸̂ 𝑓 [ℓ(𝜔) ] = 𝐸̂ 𝑓 [ ]= ∑ ℓ(𝑤𝑖𝑓 ) =1
𝑓(𝜔) 𝑔(𝜔) 𝑓(𝜔) 𝑁 𝑖=1 𝑓(𝑤𝑖𝑓 )
μ_L_p, μ_L_q = simulate(g_a, g_b, F_a, F_b, N_simu)
# importance sampling (mean and std)

np.nanmean(μ_L_q), np.nanvar(μ_L_q)
(1.0, 0.0)
We could also use other distributions as our importance distribution.

Below we choose just a few and compare their sampling properties.
a_list = [0.5, 1., 2.]

b_list = [0.5, 1.2, 5.]
plt.plot(w_range, g(w_range), label=f'p=Beta({g_a}, {g_b})')

plt.plot(w_range, p(w_range, a_list[0], b_list[0]), label=f'g=Beta({a_list[0]}, {b_
↪list[0]})')

↪list[1]})')

↪list[2]})')
plt.title('real data generating process $g$ and importance distribution $h$')

plt.legend()
plt.ylim([0., 3.])
plt.show()

We consider two additonal distributions.

As a reminder ℎ1 is the original 𝐵𝑒𝑡𝑎(0.5, 0.5) distribution that we used above.
ℎ2 is the 𝐵𝑒𝑡𝑎(1, 1.2) distribution.
Note how ℎ2 has a similar shape to 𝑔 at higher values of distribution but more mass at lower values.
Our hunch is that ℎ2 should be a good importance sampling distribution.
ℎ3 is the 𝐵𝑒𝑡𝑎(2, 5) distribution.
Note how ℎ3 has zero mass at values very close to 0 and at values close to 1.
Our hunch is that ℎ3 will be a poor importance sampling distribution.
We first simulate a plot the distribution of estimates for 𝐸̂ [𝐿 (𝜔𝑡 )] using ℎ2 as the importance sampling distribution.
h_a = a_list[1]
h_b = b_list[1]
fig, axs = plt.subplots(1,2, figsize=(14, 10))
for i, t in enumerate([1, 20]):


axs[i].set_xlabel('$μ_L$')
axs[i].set_ylabel('frequency')
axs[i].set_title(f'$T$={t}')
n_p, bins_p, _ = axs[i].hist(μ_L_p, bins=μ_range, color='r', alpha=0.5, label='$g
↪$ generating')
n_q, bins_q, _ = axs[i].hist(μ_L_q, bins=μ_range, color='b', alpha=0.5, label='$h_

↪2$ generating')
axs[i].legend(loc=4)

idx = np.argmax(n)
axs[i].text(bins[idx], n[idx], '$\hat{μ}$='+f'{μ_hat:.4g}'+', $\hat{σ}=$'+f'
↪{σ_hat:.4g}')
plt.show()
Our simulations suggest that indeed ℎ2 is a quite good importance sampling distribution for our problem.
Even at 𝑇 = 20, the mean is very close to 1 and the variance is small.
h_a = a_list[2]
h_b = b_list[2]
fig, axs = plt.subplots(1,2, figsize=(14, 10))


for i, t in enumerate([1, 20]):

axs[i].set_xlabel('$μ_L$')
axs[i].set_ylabel('frequency')
axs[i].set_title(f'$T$={t}')
n_p, bins_p, _ = axs[i].hist(μ_L_p, bins=μ_range, color='r', alpha=0.5, label='$g
↪$ generating')
n_q, bins_q, _ = axs[i].hist(μ_L_q, bins=μ_range, color='b', alpha=0.5, label='$h_

↪3$ generating')
axs[i].legend(loc=4)

idx = np.argmax(n)
axs[i].text(bins[idx], n[idx], '$\hat{μ}$='+f'{μ_hat:.4g}'+', $\hat{σ}=$'+f'
↪{σ_hat:.4g}')
plt.show()

However, ℎ3 is evidently a poor importance sampling distribution forpir problem, with a mean estimate far away from 1
for 𝑇 = 20.
Notice that evan at 𝑇 = 1, the mean estimate with importance sampling is more biased than just sampling with 𝑔 itself.
Thus, our simulations suggest that we would be better off simply using Monte Carlo approximations under 𝑔 than using
ℎ3 as an importance sampling distribution for our problem.

CHAPTER
FIFTY
A PROBLEM THAT STUMPED MILTON FRIEDMAN
(and that Abraham Wald solved by inventing sequential analysis)
Contents
• A Problem that Stumped Milton Friedman

– Overview
– Origin of the Problem
– A Dynamic Programming Approach
– Implementation
– Analysis
– Comparison with Neyman-Pearson Formulation
– Sequels
50.1 Overview
This lecture describes a statistical decision problem presented to Milton Friedman and W. Allen Wallis during World War
II when they were analysts at the U.S. Government’s Statistical Research Group at Columbia University.
This problem led Abraham Wald [Wald, 1947] to formulate sequential analysis, an approach to statistical decision
problems intimately related to dynamic programming.
In this lecture, we apply dynamic programming algorithms to Friedman and Wallis and Wald’s problem.
Key ideas in play will be:
• Bayes’ Law
• Dynamic programming
• Type I and type II statistical errors
– a type I error occurs when you reject a null hypothesis that is true
– a type II error occures when you accept a null hypothesis that is false
• Abraham Wald’s sequential probability ratio test
• The power of a statistical test
• The critical region of a statistical test
869
• A uniformly most powerful test

We’ll begin with some imports:
import numpy as np
from numba import jit, prange, float64, int64
This lecture uses ideas studied in this lecture, this lecture. and this lecture.
50.2 Origin of the Problem
On pages 137-139 of his 1998 book Two Lucky People with Rose Friedman [Friedman and Friedman, 1998], Milton
Friedman described a problem presented to him and Allen Wallis during World War II, when they worked at the US
Government’s Statistical Research Group at Columbia University.
Note: See pages 25 and 26 of Allen Wallis’s 1980 article [Wallis, 1980] about the Statistical Research Group at Columbia
University during World War II for his account of the episode and for important contributions that Harold Hotelling made
to formulating the problem. Also see chapter 5 of Jennifer Burns book about Milton Friedman [Burns, 2023].
Let’s listen to Milton Friedman tell us what happened

In order to understand the story, it is necessary to have an idea of a simple statistical problem, and of the
standard procedure for dealing with it. The actual problem out of which sequential analysis grew will serve.
The Navy has two alternative designs (say A and B) for a projectile. It wants to determine which is superior.
To do so it undertakes a series of paired firings. On each round, it assigns the value 1 or 0 to A accordingly as
its performance is superior or inferior to that of B and conversely 0 or 1 to B. The Navy asks the statistician
how to conduct the test and how to analyze the results.
The standard statistical answer was to specify a number of firings (say 1,000) and a pair of percentages (e.g.,
53% and 47%) and tell the client that if A receives a 1 in more than 53% of the firings, it can be regarded
as superior; if it receives a 1 in fewer than 47%, B can be regarded as superior; if the percentage is between
47% and 53%, neither can be so regarded.
When Allen Wallis was discussing such a problem with (Navy) Captain Garret L. Schyler, the captain ob-
jected that such a test, to quote from Allen’s account, may prove wasteful. If a wise and seasoned ordnance
officer like Schyler were on the premises, he would see after the first few thousand or even few hundred
[rounds] that the experiment need not be completed either because the new method is obviously inferior or
because it is obviously superior beyond what was hoped for ….
Friedman and Wallis struggled with the problem but, after realizing that they were not able to solve it, described the
problem to Abraham Wald.
That started Wald on the path that led him to Sequential Analysis [Wald, 1947].
We’ll formulate the problem using dynamic programming.
870 Chapter 50. A Problem that Stumped Milton Friedman

50.3 A Dynamic Programming Approach
The following presentation of the problem closely follows Dmitri Berskekas’s treatment in Dynamic Programming and
Stochastic Control [Bertsekas, 1975].
A decision-maker can observe a sequence of draws of a random variable 𝑧.
He (or she) wants to know which of two probability distributions 𝑓0 or 𝑓1 governs 𝑧.
Conditional on knowing that successive observations are drawn from distribution 𝑓0 , the sequence of random variables is
independently and identically distributed (IID).
Conditional on knowing that successive observations are drawn from distribution 𝑓1 , the sequence of random variables is
also independently and identically distributed (IID).
But the observer does not know which of the two distributions generated the sequence.
For reasons explained in Exchangeability and Bayesian Updating, this means that the sequence is not IID.
The observer has something to learn, namely, whether the observations are drawn from 𝑓0 or from 𝑓1 .
The decision maker wants to decide which of the two distributions is generating outcomes.
We adopt a Bayesian formulation.
The decision maker begins with a prior probability
𝜋−1 = ℙ{𝑓 = 𝑓0 ∣ no observations} ∈ (0, 1)
After observing 𝑘+1 observations 𝑧𝑘 , 𝑧𝑘−1 , … , 𝑧0 , he updates his personal probability that the observations are described
by distribution 𝑓0 to
𝜋𝑘 = ℙ{𝑓 = 𝑓0 ∣ 𝑧𝑘 , 𝑧𝑘−1 , … , 𝑧0 }
which is calculated recursively by applying Bayes’ law:
𝜋𝑘 𝑓0 (𝑧𝑘+1 )
𝜋𝑘+1 = , 𝑘 = −1, 0, 1, …
𝜋𝑘 𝑓0 (𝑧𝑘+1 ) + (1 − 𝜋𝑘 )𝑓1 (𝑧𝑘+1 )
After observing 𝑧𝑘 , 𝑧𝑘−1 , … , 𝑧0 , the decision-maker believes that 𝑧𝑘+1 has probability distribution
𝑓𝜋𝑘 (𝑣) = 𝜋𝑘 𝑓0 (𝑣) + (1 − 𝜋𝑘 )𝑓1 (𝑣),
which is a mixture of distributions 𝑓0 and 𝑓1 , with the weight on 𝑓0 being the posterior probability that 𝑓 = 𝑓0 1 .
To illustrate such a distribution, let’s inspect some mixtures of beta distributions.
The density of a beta probability distribution with parameters 𝑎 and 𝑏 is
∞
Γ(𝑎 + 𝑏)𝑧 𝑎−1 (1 − 𝑧)𝑏−1
𝑓(𝑧; 𝑎, 𝑏) = where Γ(𝑡) ∶= ∫ 𝑥𝑡−1 𝑒−𝑥 𝑑𝑥
Γ(𝑎)Γ(𝑏) 0
The next figure shows two beta distributions in the top panel.
The bottom panel presents mixtures of these distributions, with various mixing probabilities 𝜋𝑘
1 The decision maker acts as if he believes that the sequence of random variables [𝑧 , 𝑧 , …] is exchangeable. See Exchangeability and Bayesian
0 1
Updating and [Kreps, 1988] chapter 11, for discussions of exchangeability.
50.3. A Dynamic Programming Approach 871

@jit(nopython=True)
def p(x, a, b):
return r * x**(a-1) * (1 - x)**(b-1)
f0 = lambda x: p(x, 1, 1)
f1 = lambda x: p(x, 9, 9)
grid = np.linspace(0, 1, 50)
fig, axes = plt.subplots(2, figsize=(10, 8))
axes[0].set_title("Original Distributions")
axes[0].plot(grid, f0(grid), lw=2, label="$f_0$")
axes[0].plot(grid, f1(grid), lw=2, label="$f_1$")
axes[1].set_title("Mixtures")
for π in 0.25, 0.5, 0.75:
y = π * f0(grid) + (1 - π) * f1(grid)
axes[1].plot(y, lw=2, label=f"$\pi_k$ = {π}")
for ax in axes:
ax.legend()
ax.set(xlabel="$z$ values", ylabel="probability of $z_k$")
plt.tight_layout()
plt.show()

50.3.1 Losses and Costs
After observing 𝑧𝑘 , 𝑧𝑘−1 , … , 𝑧0 , the decision-maker chooses among three distinct actions:
• He decides that 𝑓 = 𝑓0 and draws no more 𝑧’s
• He decides that 𝑓 = 𝑓1 and draws no more 𝑧’s
• He postpones deciding now and instead chooses to draw a 𝑧𝑘+1
Associated with these three actions, the decision-maker can suffer three kinds of losses:
• A loss 𝐿0 if he decides 𝑓 = 𝑓0 when actually 𝑓 = 𝑓1
• A loss 𝐿1 if he decides 𝑓 = 𝑓1 when actually 𝑓 = 𝑓0
• A cost 𝑐 if he postpones deciding and chooses instead to draw another 𝑧

50.3.2 Digression on Type I and Type II Errors
If we regard 𝑓 = 𝑓0 as a null hypothesis and 𝑓 = 𝑓1 as an alternative hypothesis, then 𝐿1 and 𝐿0 are losses associated
with two types of statistical errors
• a type I error is an incorrect rejection of a true null hypothesis (a “false positive”)
• a type II error is a failure to reject a false null hypothesis (a “false negative”)
So when we treat 𝑓 = 𝑓0 as the null hypothesis
• We can think of 𝐿1 as the loss associated with a type I error.
• We can think of 𝐿0 as the loss associated with a type II error.
50.3.3 Intuition
Before proceeding, let’s try to guess what an optimal decision rule might look like.
Suppose at some given point in time that 𝜋 is close to 1.
Then our prior beliefs and the evidence so far point strongly to 𝑓 = 𝑓0 .
If, on the other hand, 𝜋 is close to 0, then 𝑓 = 𝑓1 is strongly favored.
Finally, if 𝜋 is in the middle of the interval [0, 1], then we are confronted with more uncertainty.
This reasoning suggests a decision rule such as the one shown in the figure
As we’ll see, this is indeed the correct form of the decision rule.
Our problem is to determine threshold values 𝛼, 𝛽 that somehow depend on the parameters described above.
You might like to pause at this point and try to predict the impact of a parameter such as 𝑐 or 𝐿0 on 𝛼 or 𝛽.
50.3.4 A Bellman Equation
Let 𝐽 (𝜋) be the total loss for a decision-maker with current belief 𝜋 who chooses optimally.
With some thought, you will agree that 𝐽 should satisfy the Bellman equation
𝐽 (𝜋) = min {(1 − 𝜋)𝐿0 , 𝜋𝐿1 , 𝑐 + 𝔼[𝐽 (𝜋′ )]} (50.1)
where 𝜋′ is the random variable defined by Bayes’ Law

𝜋𝑓0 (𝑧 ′ )
𝜋′ = 𝜅(𝑧 ′ , 𝜋) =
𝜋𝑓0 (𝑧 ′ ) + (1 − 𝜋)𝑓1 (𝑧 ′ )
when 𝜋 is fixed and 𝑧′ is drawn from the current best guess, which is the distribution 𝑓 defined by
𝑓𝜋 (𝑣) = 𝜋𝑓0 (𝑣) + (1 − 𝜋)𝑓1 (𝑣)
In the Bellman equation, minimization is over three actions:

1. Accept the hypothesis that 𝑓 = 𝑓0

2. Accept the hypothesis that 𝑓 = 𝑓1
3. Postpone deciding and draw again
We can represent the Bellman equation as
𝐽 (𝜋) = min {(1 − 𝜋)𝐿0 , 𝜋𝐿1 , ℎ(𝜋)} (50.2)
where 𝜋 ∈ [0, 1] and

• (1 − 𝜋)𝐿0 is the expected loss associated with accepting 𝑓0 (i.e., the cost of making a type II error).
• 𝜋𝐿1 is the expected loss associated with accepting 𝑓1 (i.e., the cost of making a type I error).
• ℎ(𝜋) ∶= 𝑐 + 𝔼[𝐽 (𝜋′ )]; this is the continuation value; i.e., the expected cost associated with drawing one more 𝑧.
The optimal decision rule is characterized by two numbers 𝛼, 𝛽 ∈ (0, 1) × (0, 1) that satisfy
(1 − 𝜋)𝐿0 < min{𝜋𝐿1 , 𝑐 + 𝔼[𝐽 (𝜋′ )]} if 𝜋 ≥ 𝛼
and
𝜋𝐿1 < min{(1 − 𝜋)𝐿0 , 𝑐 + 𝔼[𝐽 (𝜋′ )]} if 𝜋 ≤ 𝛽
The optimal decision rule is then
accept 𝑓 = 𝑓0 if 𝜋 ≥ 𝛼
accept 𝑓 = 𝑓1 if 𝜋 ≤ 𝛽
draw another 𝑧 if 𝛽 ≤ 𝜋 ≤ 𝛼
Our aim is to compute the cost function 𝐽 , and from it the associated cutoffs 𝛼 and 𝛽.
To make our computations manageable, using (50.2), we can write the continuation cost ℎ(𝜋) as
ℎ(𝜋) = 𝑐 + 𝔼[𝐽 (𝜋′ )]

= 𝑐 + 𝔼𝜋′ min{(1 − 𝜋′ )𝐿0 , 𝜋′ 𝐿1 , ℎ(𝜋′ )}
(50.3)
′ ′ ′ ′ ′
= 𝑐 + ∫ min{(1 − 𝜅(𝑧 , 𝜋))𝐿0 , 𝜅(𝑧 , 𝜋)𝐿1 , ℎ(𝜅(𝑧 , 𝜋))}𝑓𝜋 (𝑧 )𝑑𝑧
The equality
ℎ(𝜋) = 𝑐 + ∫ min{(1 − 𝜅(𝑧 ′ , 𝜋))𝐿0 , 𝜅(𝑧 ′ , 𝜋)𝐿1 , ℎ(𝜅(𝑧 ′ , 𝜋))}𝑓𝜋 (𝑧 ′ )𝑑𝑧 ′ (50.4)
is a functional equation in an unknown function ℎ.

Using the functional equation, (50.4), for the continuation cost, we can back out optimal choices using the right side of
(50.2).
This functional equation can be solved by taking an initial guess and iterating to find a fixed point.
Thus, we iterate with an operator 𝑄, where
𝑄ℎ(𝜋) = 𝑐 + ∫ min{(1 − 𝜅(𝑧 ′ , 𝜋))𝐿0 , 𝜅(𝑧 ′ , 𝜋)𝐿1 , ℎ(𝜅(𝑧 ′ , 𝜋))}𝑓𝜋 (𝑧 ′ )𝑑𝑧 ′

50.4 Implementation
First, we will construct a jitclass to store the parameters of the model
wf_data = [('a0', float64), # Parameters of beta distributions

('b0', float64),
('a1', float64),
('b1', float64),
('c', float64), # Cost of another draw
('π_grid_size', int64),
('L0', float64), # Cost of selecting f0 when f1 is true
('L1', float64), # Cost of selecting f1 when f0 is true
('π_grid', float64[:]),
('mc_size', int64),
('z0', float64[:]),
('z1', float64[:])]
@jitclass(wf_data)
class WaldFriedman:
def __init__(self,
c=1.25,
a0=1,
b0=1,
a1=3,
b1=1.2,
L0=25,
L1=25,
π_grid_size=200,
mc_size=1000):
self.a0, self.b0 = a0, b0

self.a1, self.b1 = a1, b1
self.c, self.π_grid_size = c, π_grid_size
self.L0, self.L1 = L0, L1
self.π_grid = np.linspace(0, 1, π_grid_size)
self.z0 = np.random.beta(a0, b0, mc_size)

def f0(self, x):
return p(x, self.a0, self.b0)
def f1(self, x):
def f0_rvs(self):
return np.random.beta(self.a0, self.b0)
def f1_rvs(self):
return np.random.beta(self.a1, self.b1)
def κ(self, z, π):



"""
Updates π using Bayes' rule and the current observation z
"""
f0, f1 = self.f0, self.f1
π_f0, π_f1 = π * f0(z), (1 - π) * f1(z)

π_new = π_f0 / (π_f0 + π_f1)
return π_new
As in the optimal growth lecture, to approximate a continuous value function

• We iterate at a finite grid of possible values of 𝜋.
• When we evaluate 𝔼[𝐽 (𝜋′ )] between grid points, we use linear interpolation.
We define the operator function Q below.
@jit(nopython=True, parallel=True)
def Q(h, wf):
c, π_grid = wf.c, wf.π_grid

L0, L1 = wf.L0, wf.L1
z0, z1 = wf.z0, wf.z1
mc_size = wf.mc_size
κ = wf.κ
h_new = np.empty_like(π_grid)
h_func = lambda p: np.interp(p, π_grid, h)
π = π_grid[i]
# Find the expected value of J by integrating over z

integral_f0, integral_f1 = 0, 0
π_0 = κ(z0[m], π) # Draw z from f0 and update π
integral_f0 += min((1 - π_0) * L0, π_0 * L1, h_func(π_0))

integral = (π * integral_f0 + (1 - π) * integral_f1) / mc_size
h_new[i] = c + integral
return h_new
To solve the key functional equation, we will iterate using Q to find the fixed point
@jit(nopython=True)
def solve_model(wf, tol=1e-4, max_iter=1000):
"""
Compute the continuation cost function


* wf is an instance of WaldFriedman
"""
# Set up loop
h = np.zeros(len(wf.π_grid))
i = 0
error = tol + 1

h_new = Q(h, wf)
error = np.max(np.abs(h - h_new))
i += 1
h = h_new
if error > tol:

return h_new
50.5 Analysis
Let’s inspect outcomes.

We will be using the default parameterization with distributions like so
wf = WaldFriedman()

ax.plot(wf.f0(wf.π_grid), label="$f_0$")
ax.plot(wf.f1(wf.π_grid), label="$f_1$")
ax.set(ylabel="probability of $z_k$", xlabel="$z_k$", title="Distributions")
ax.legend()
plt.show()

50.5.1 Value Function
To solve the model, we will call our solve_model function
h_star = solve_model(wf) # Solve the model
We will also set up a function to compute the cutoffs 𝛼 and 𝛽 and plot these on our cost function plot
@jit(nopython=True)
def find_cutoff_rule(wf, h):
"""
This function takes a continuation cost function and returns the
corresponding cutoffs of where you transition between continuing and
choosing a specific model
"""
π_grid = wf.π_grid
L0, L1 = wf.L0, wf.L1
# Evaluate cost at all points on grid for choosing a model

payoff_f0 = (1 - π_grid) * L0
payoff_f1 = π_grid * L1
# The cutoff points can be found by differencing these costs with

# The Bellman equation (J is always less than or equal to p_c_i)
β = π_grid[np.searchsorted(
payoff_f1 - np.minimum(h, payoff_f0),
1e-10)
50.5. Analysis 879


- 1]
α = π_grid[np.searchsorted(
np.minimum(h, payoff_f1) - payoff_f0,
1e-10)
- 1]
return (β, α)
β, α = find_cutoff_rule(wf, h_star)
cost_L0 = (1 - wf.π_grid) * wf.L0
cost_L1 = wf.π_grid * wf.L1
ax.plot(wf.π_grid, h_star, label='sample again')

ax.plot(wf.π_grid, cost_L1, label='choose f1')
ax.plot(wf.π_grid,
np.amin(np.column_stack([h_star, cost_L0, cost_L1]),axis=1),
lw=15, alpha=0.1, color='b', label='$J(\pi)$')
ax.annotate(r"$\beta$", xy=(β + 0.01, 0.5), fontsize=14)

ax.annotate(r"$\alpha$", xy=(α + 0.01, 0.5), fontsize=14)
plt.vlines(β, 0, β * wf.L0, linestyle="--")

plt.vlines(α, 0, (1 - α) * wf.L1, linestyle="--")
ax.set(xlim=(0, 1), ylim=(0, 0.5 * max(wf.L0, wf.L1)), ylabel="cost",

xlabel="$\pi$", title="Cost function $J(\pi)$")
plt.legend(borderpad=1.1)
plt.show()

The cost function 𝐽 equals 𝜋𝐿1 for 𝜋 ≤ 𝛽, and (1 − 𝜋)𝐿0 for 𝜋 ≥ 𝛼.

The slopes of the two linear pieces of the cost function 𝐽 (𝜋) are determined by 𝐿1 and −𝐿0 .
The cost function 𝐽 is smooth in the interior region, where the posterior probability assigned to 𝑓0 is in the indecisive
region 𝜋 ∈ (𝛽, 𝛼).
The decision-maker continues to sample until the probability that he attaches to model 𝑓0 falls below 𝛽 or above 𝛼.
50.5.2 Simulations
The next figure shows the outcomes of 500 simulations of the decision process.
On the left is a histogram of stopping times, i.e., the number of draws of 𝑧𝑘 required to make a decision.
The average number of draws is around 6.6.
On the right is the fraction of correct decisions at the stopping time.
In this case, the decision-maker is correct 80% of the time
def simulate(wf, true_dist, h_star, π_0=0.5):
"""
This function takes an initial condition and simulates until it
stops (when a decision is made)
"""
f0, f1 = wf.f0, wf.f1

f0_rvs, f1_rvs = wf.f0_rvs, wf.f1_rvs
κ = wf.κ
50.5. Analysis 881

if true_dist == "f0":
f, f_rvs = wf.f0, wf.f0_rvs
elif true_dist == "f1":
f, f_rvs = wf.f1, wf.f1_rvs
# Find cutoffs
# Initialize a couple of useful variables

decision_made = False
π = π_0
t = 0
while decision_made is False:

# Maybe should specify which distribution is correct one so that
# the draws come from the "right" distribution
z = f_rvs()
t = t + 1
π = κ(z, π)
if π < β:
decision_made = True
decision = 1
elif π > α:
decision_made = True
decision = 0
if true_dist == "f0":
if decision == 0:
correct = True
else:
correct = False
elif true_dist == "f1":

if decision == 1:
correct = True
else:
correct = False
return correct, π, t
def stopping_dist(wf, h_star, ndraws=250, true_dist="f0"):
"""
Simulates repeatedly to get distributions of time needed to make a
decision and how often they are correct
"""
tdist = np.empty(ndraws, int)

cdist = np.empty(ndraws, bool)
for i in range(ndraws):
correct, π, t = simulate(wf, true_dist, h_star)
tdist[i] = t
cdist[i] = correct


return cdist, tdist
def simulation_plot(wf):
h_star = solve_model(wf)
ndraws = 500
cdist, tdist = stopping_dist(wf, h_star, ndraws)
ax[0].hist(tdist, bins=np.max(tdist))
ax[0].set_title(f"Stopping times over {ndraws} replications")
ax[0].set(xlabel="time", ylabel="number of stops")
ax[0].annotate(f"mean = {np.mean(tdist)}", xy=(max(tdist) / 2,
max(np.histogram(tdist, bins=max(tdist))[0]) / 2))
ax[1].hist(cdist.astype(int), bins=2)
ax[1].set_title(f"Correct decisions over {ndraws} replications")
ax[1].annotate(f"% correct = {np.mean(cdist)}",
xy=(0.05, ndraws / 2))
plt.show()
simulation_plot(wf)
50.5.3 Comparative Statics
Now let’s consider the following exercise.

We double the cost of drawing an additional observation.
Before you look, think about what will happen:
• Will the decision-maker be correct more or less often?
• Will he make decisions sooner or later?
wf = WaldFriedman(c=2.5)
simulation_plot(wf)
50.5. Analysis 883

Increased cost per draw has induced the decision-maker to take fewer draws before deciding.
Because he decides with fewer draws, the percentage of time he is correct drops.
This leads to him having a higher expected loss when he puts equal weight on both models.
50.5.4 A Notebook Implementation
To facilitate comparative statics, we provide a Jupyter notebook that generates the same plots, but with sliders.
With these sliders, you can adjust parameters and immediately observe
• effects on the smoothness of the value function in the indecisive middle range as we increase the number of grid
points in the piecewise linear approximation.
• effects of different settings for the cost parameters 𝐿0 , 𝐿1 , 𝑐, the parameters of two beta distributions 𝑓0 and 𝑓1 ,
and the number of points and linear functions 𝑚 to use in the piece-wise continuous approximation to the value
function.
• various simulations from 𝑓0 and associated distributions of waiting times to making a decision.
• associated histograms of correct and incorrect decisions.
50.6 Comparison with Neyman-Pearson Formulation
For several reasons, it is useful to describe the theory underlying the test that Navy Captain G. S. Schuyler had been told
to use and that led him to approach Milton Friedman and Allan Wallis to convey his conjecture that superior practical
procedures existed.
Evidently, the Navy had told Captail Schuyler to use what it knew to be a state-of-the-art Neyman-Pearson test.
We’ll rely on Abraham Wald’s [Wald, 1947] elegant summary of Neyman-Pearson theory.
For our purposes, watch for there features of the setup:
• the assumption of a fixed sample size 𝑛
• the application of laws of large numbers, conditioned on alternative probability models, to interpret the probabilities
𝛼 and 𝛽 defined in the Neyman-Pearson theory
Recall that in the sequential analytic formulation above, that
• The sample size 𝑛 is not fixed but rather an object to be chosen; technically 𝑛 is a random variable.
• The parameters 𝛽 and 𝛼 characterize cut-off rules used to determine 𝑛 as a random variable.

• Laws of large numbers make no appearances in the sequential construction.

In chapter 1 of Sequential Analysis [Wald, 1947] Abraham Wald summarizes the Neyman-Pearson approach to hypoth-
esis testing.
Wald frames the problem as making a decision about a probability distribution that is partially known.
(You have to assume that something is already known in order to state a well-posed problem – usually, something means
a lot)
By limiting what is unknown, Wald uses the following simple structure to illustrate the main ideas:
• A decision-maker wants to decide which of two distributions 𝑓0 , 𝑓1 govern an IID random variable 𝑧.
• The null hypothesis 𝐻0 is the statement that 𝑓0 governs the data.
• The alternative hypothesis 𝐻1 is the statement that 𝑓1 governs the data.
• The problem is to devise and analyze a test of hypothesis 𝐻0 against the alternative hypothesis 𝐻1 on the basis of
a sample of a fixed number 𝑛 independent observations 𝑧1 , 𝑧2 , … , 𝑧𝑛 of the random variable 𝑧.
To quote Abraham Wald,
A test procedure leading to the acceptance or rejection of the [null] hypothesis in question is simply a rule
specifying, for each possible sample of size 𝑛, whether the [null] hypothesis should be accepted or rejected
on the basis of the sample. This may also be expressed as follows: A test procedure is simply a subdivision of
the totality of all possible samples of size 𝑛 into two mutually exclusive parts, say part 1 and part 2, together
with the application of the rule that the [null] hypothesis be accepted if the observed sample is contained in
part 2. Part 1 is also called the critical region. Since part 2 is the totality of all samples of size 𝑛 which are
not included in part 1, part 2 is uniquely determined by part 1. Thus, choosing a test procedure is equivalent
to determining a critical region.
Let’s listen to Wald longer:
As a basis for choosing among critical regions the following considerations have been advanced by Neyman
and Pearson: In accepting or rejecting 𝐻0 we may commit errors of two kinds. We commit an error of the
first kind if we reject 𝐻0 when it is true; we commit an error of the second kind if we accept 𝐻0 when 𝐻1
is true. After a particular critical region 𝑊 has been chosen, the probability of committing an error of the
first kind, as well as the probability of committing an error of the second kind is uniquely determined. The
probability of committing an error of the first kind is equal to the probability, determined by the assumption
that 𝐻0 is true, that the observed sample will be included in the critical region 𝑊 . The probability of
committing an error of the second kind is equal to the probability, determined on the assumption that 𝐻1
is true, that the probability will fall outside the critical region 𝑊 . For any given critical region 𝑊 we shall
denote the probability of an error of the first kind by 𝛼 and the probability of an error of the second kind by
𝛽.
Let’s listen carefully to how Wald applies law of large numbers to interpret 𝛼 and 𝛽:
The probabilities 𝛼 and 𝛽 have the following important practical interpretation: Suppose that we draw a large
number of samples of size 𝑛. Let 𝑀 be the number of such samples drawn. Suppose that for each of these
𝑀 samples we reject 𝐻0 if the sample is included in 𝑊 and accept 𝐻0 if the sample lies outside 𝑊 . In this
way we make 𝑀 statements of rejection or acceptance. Some of these statements will in general be wrong.
If 𝐻0 is true and if 𝑀 is large, the probability is nearly 1 (i.e., it is practically certain) that the proportion
of wrong statements (i.e., the number of wrong statements divided by 𝑀 ) will be approximately 𝛼. If 𝐻1 is
true, the probability is nearly 1 that the proportion of wrong statements will be approximately 𝛽. Thus, we
can say that in the long run [ here Wald applies law of large numbers by driving 𝑀 → ∞ (our comment,
not Wald’s) ] the proportion of wrong statements will be 𝛼 if 𝐻0 is true and 𝛽 if 𝐻1 is true.
The quantity 𝛼 is called the size of the critical region, and the quantity 1 − 𝛽 is called the power of the critical region.
Wald notes that
50.6. Comparison with Neyman-Pearson Formulation 885

one critical region 𝑊 is more desirable than another if it has smaller values of 𝛼 and 𝛽. Although either 𝛼
or 𝛽 can be made arbitrarily small by a proper choice of the critical region 𝑊 , it is possible to make both 𝛼
and 𝛽 arbitrarily small for a fixed value of 𝑛, i.e., a fixed sample size.
Wald summarizes Neyman and Pearson’s setup as follows:
Neyman and Pearson show that a region consisting of all samples (𝑧1 , 𝑧2 , … , 𝑧𝑛 ) which satisfy the inequality
𝑓1 (𝑧1 ) ⋯ 𝑓1 (𝑧𝑛 )
≥𝑘
𝑓0 (𝑧1 ) ⋯ 𝑓0 (𝑧𝑛 )
is a most powerful critical region for testing the hypothesis 𝐻0 against the alternative hypothesis 𝐻1 . The
term 𝑘 on the right side is a constant chosen so that the region will have the required size 𝛼.
Wald goes on to discuss Neyman and Pearson’s concept of uniformly most powerful test.
Here is how Wald introduces the notion of a sequential test
A rule is given for making one of the following three decisions at any stage of the experiment (at the m th
trial for each integral value of m ): (1) to accept the hypothesis H , (2) to reject the hypothesis H , (3) to
continue the experiment by making an additional observation. Thus, such a test procedure is carried out
sequentially. On the basis of the first observation, one of the aforementioned decision is made. If the first or
second decision is made, the process is terminated. If the third decision is made, a second trial is performed.
Again, on the basis of the first two observations, one of the three decision is made. If the third decision
is made, a third trial is performed, and so on. The process is continued until either the first or the second
decisions is made. The number n of observations required by such a test procedure is a random variable,
since the value of n depends on the outcome of the observations.
50.7 Sequels
We’ll dig deeper into some of the ideas used here in the following lectures:
• this lecture discusses the key concept of exchangeability that rationalizes statistical learning
• this lecture describes likelihood ratio processes and their role in frequentist and Bayesian statistical theories
• this lecture discusses the role of likelihood ratio processes in Bayesian learning
• this lecture returns to the subject of this lecture and studies whether the Captain’s hunch that the (frequentist)
decision rule that the Navy had ordered him to use can be expected to be better or worse than the rule sequential
rule that Abraham Wald designed

CHAPTER
FIFTYONE
EXCHANGEABILITY AND BAYESIAN UPDATING
Contents
• Exchangeability and Bayesian Updating

– Overview
– Independently and Identically Distributed
– A Setting in Which Past Observations Are Informative
– Relationship Between IID and Exchangeable
– Exchangeability
– Bayes’ Law
– More Details about Bayesian Updating
– Appendix
– Sequels
51.1 Overview
This lecture studies learning via Bayes’ Law.

We touch foundations of Bayesian statistical inference invented by Bruno DeFinetti [de Finetti, 1937].
The relevance of DeFinetti’s work for economists is presented forcefully in chapter 11 of [Kreps, 1988] by David Kreps.
An example that we study in this lecture is a key component of this lecture that augments the classic job search model of
McCall [McCall, 1970] by presenting an unemployed worker with a statistical inference problem.
Here we create graphs that illustrate the role that a likelihood ratio plays in Bayes’ Law.
We’ll use such graphs to provide insights into mechanics driving outcomes in this lecture about learning in an augmented
McCall job search model.
Among other things, this lecture discusses connections between the statistical concepts of sequences of random variables
that are
• independently and identically distributed
• exchangeable (also known as conditionally independently and identically distributed)
887
Understanding these concepts is essential for appreciating how Bayesian updating works.
You can read about exchangeability here.
Because another term for exchangeable is conditionally independent, we want to convey an answer to the question
conditional on what?
We also tell why an assumption of independence precludes learning while an assumption of conditional independence
makes learning possible.
Below, we’ll often use
• 𝑊 to denote a random variable
• 𝑤 to denote a particular realization of a random variable 𝑊

from numba import njit, vectorize
import scipy.optimize as op
import numpy as np
51.2 Independently and Identically Distributed
We begin by looking at the notion of an independently and identically distributed sequence of random variables.
An independently and identically distributed sequence is often abbreviated as IID.
Two notions are involved
• independence
• identically distributed
A sequence 𝑊0 , 𝑊1 , … is independently distributed if the joint probability density of the sequence is the product of
the densities of the components of the sequence.
The sequence 𝑊0 , 𝑊1 , … is independently and identically distributed (IID) if in addition the marginal density of 𝑊𝑡
is the same for all 𝑡 = 0, 1, ….
For example, let 𝑝(𝑊0 , 𝑊1 , …) be the joint density of the sequence and let 𝑝(𝑊𝑡 ) be the marginal density for a
particular 𝑊𝑡 for all 𝑡 = 0, 1, ….
Then the joint density of the sequence 𝑊0 , 𝑊1 , … is IID if
𝑝(𝑊0 , 𝑊1 , …) = 𝑝(𝑊0 )𝑝(𝑊1 ) ⋯
so that the joint density is the product of a sequence of identical marginal densities.
888 Chapter 51. Exchangeability and Bayesian Updating

51.2.1 IID Means Past Observations Don’t Tell Us Anything About Future Observa-
tions
If a sequence is random variables is IID, past information provides no information about future realizations.
Therefore, there is nothing to learn from the past about the future.
To understand these statements, let the joint distribution of a sequence of random variables {𝑊𝑡 }𝑇𝑡=0 that is not necessarily
IID be
𝑝(𝑊𝑇 , 𝑊𝑇 −1 , … , 𝑊1 , 𝑊0 )
Using the laws of probability, we can always factor such a joint density into a product of conditional densities:
𝑝(𝑊𝑇 , 𝑊𝑇 −1 , … , 𝑊1 , 𝑊0 ) =𝑝(𝑊𝑇 |𝑊𝑇 −1 , … , 𝑊0 )𝑝(𝑊𝑇 −1 |𝑊𝑇 −2 , … , 𝑊0 ) ⋯

⋯ 𝑝(𝑊1 |𝑊0 )𝑝(𝑊0 )
In general,
𝑝(𝑊𝑡 |𝑊𝑡−1 , … , 𝑊0 ) ≠ 𝑝(𝑊𝑡 )
which states that the conditional density on the left side does not equal the marginal density on the right side.
But in the special IID case,
𝑝(𝑊𝑡 |𝑊𝑡−1 , … , 𝑊0 ) = 𝑝(𝑊𝑡 )
and partial history 𝑊𝑡−1 , … , 𝑊0 contains no information about the probability of 𝑊𝑡 .

So in the IID case, there is nothing to learn about the densities of future random variables from past random variables.
But when the sequence is not IID, there is something to learn about the future from observations of past random variables.
We turn next to an instance of the general case in which the sequence is not IID.
Please watch for what can be learned from the past and when.
51.3 A Setting in Which Past Observations Are Informative
Let {𝑊𝑡 }∞
𝑡=0 be a sequence of nonnegative scalar random variables with a joint probability distribution constructed as
follows.
There are two distinct cumulative distribution functions 𝐹 and 𝐺 that have densities 𝑓 and 𝑔, respectively, for a nonnegative
scalar random variable 𝑊 .
Before the start of time, say at time 𝑡 = −1, “nature” once and for all selects either 𝑓 or 𝑔.
Thereafter at each time 𝑡 ≥ 0, nature draws a random variable 𝑊𝑡 from the selected distribution.
So the data are permanently generated as independently and identically distributed (IID) draws from either 𝐹 or 𝐺.
We could say that objectively, meaning after nature has chosen either 𝐹 or 𝐺, the probability that the data are generated
as draws from 𝐹 is either 0 or 1.
We now drop into this setting a partially informed decision maker who knows
• both 𝐹 and 𝐺, but
• not the 𝐹 or 𝐺 that nature drew once-and-for-all at 𝑡 = −1
51.3. A Setting in Which Past Observations Are Informative 889

So our decision maker does not know which of the two distributions nature selected.
The decision maker describes his ignorance with a subjective probability 𝜋̃ and reasons as if nature had selected 𝐹 with
probability 𝜋̃ ∈ (0, 1) and 𝐺 with probability 1 − 𝜋.̃
Thus, we assume that the decision maker
• knows both 𝐹 and 𝐺
• doesn’t know which of these two distributions that nature has drawn
• expresses his ignorance by acting as if or thinking that nature chose distribution 𝐹 with probability 𝜋̃ ∈ (0, 1)
and distribution 𝐺 with probability 1 − 𝜋̃
• at date 𝑡 ≥ 0 knows the partial history 𝑤𝑡 , 𝑤𝑡−1 , … , 𝑤0
To proceed, we want to know the decision maker’s belief about the joint distribution of the partial history.
We’ll discuss that next and in the process describe the concept of exchangeability.
51.4 Relationship Between IID and Exchangeable
Conditional on nature selecting 𝐹 , the joint density of the sequence 𝑊0 , 𝑊1 , … is
𝑓(𝑊0 )𝑓(𝑊1 ) ⋯
Conditional on nature selecting 𝐺, the joint density of the sequence 𝑊0 , 𝑊1 , … is
𝑔(𝑊0 )𝑔(𝑊1 ) ⋯
Thus, conditional on nature having selected 𝐹 , the sequence 𝑊0 , 𝑊1 , … is independently and identically distributed.
Furthermore, conditional on nature having selected 𝐺, the sequence 𝑊0 , 𝑊1 , … is also independently and identically
distributed.
But what about the unconditional distribution of a partial history?
The unconditional distribution of 𝑊0 , 𝑊1 , … is evidently
ℎ(𝑊0 , 𝑊1 , …) ≡ 𝜋[𝑓(𝑊
̃ 0 )𝑓(𝑊1 ) ⋯ ] + (1 − 𝜋)[𝑔(𝑊
̃ 0 )𝑔(𝑊1 ) ⋯ ] (51.1)
Under the unconditional distribution ℎ(𝑊0 , 𝑊1 , …), the sequence 𝑊0 , 𝑊1 , … is not independently and identically dis-
tributed.
To verify this claim, it is sufficient to notice, for example, that
ℎ(𝑊0 , 𝑊1 ) = 𝜋𝑓(𝑊
̃ 0 )𝑓(𝑊1 ) + (1 − 𝜋)𝑔(𝑊
̃ 0 )𝑔(𝑊1 ) ≠ (𝜋𝑓(𝑊
̃ 0 ) + (1 − 𝜋)𝑔(𝑊
̃ 0 ))(𝜋𝑓(𝑊
̃ 1 ) + (1 − 𝜋)𝑔(𝑊
̃ 1 ))
Thus, the conditional distribution

ℎ(𝑊0 , 𝑊1 )
ℎ(𝑊1 |𝑊0 ) ≡ ≠ (𝜋𝑓(𝑊
̃ 1 ) + (1 − 𝜋)𝑔(𝑊
̃ 1 ))
(𝜋𝑓(𝑊
̃ 0 ) + (1 − 𝜋)𝑔(𝑊
̃ 0 ))
This means that random variable 𝑊0 contains information about random variable 𝑊1 .
So there is something to learn from the past about the future.
But what and how?

51.5 Exchangeability
While the sequence 𝑊0 , 𝑊1 , … is not IID, it can be verified that it is exchangeable, which means that the ``re-ordered’’
joint distributions ℎ(𝑊0 , 𝑊1 ) and ℎ(𝑊1 , 𝑊0 ) satisfy
ℎ(𝑊0 , 𝑊1 ) = ℎ(𝑊1 , 𝑊0 )
and so on.
More generally, a sequence of random variables is said to be exchangeable if the joint probability distribution for a
sequence does not change when the positions in the sequence in which finitely many of random variables appear are
altered.
Equation (51.1) represents our instance of an exchangeable joint density over a sequence of random variables as a mixture
of two IID joint densities over a sequence of random variables.
For a Bayesian statistician, the mixing parameter 𝜋̃ ∈ (0, 1) has a special interpretation as a subjective prior probability
that nature selected probability distribution 𝐹 .
DeFinetti [de Finetti, 1937] established a related representation of an exchangeable process created by mixing sequences
of IID Bernoulli random variables with parameter 𝜃 ∈ (0, 1) and mixing probability density 𝜋(𝜃) that a Bayesian statis-
tician would interpret as a prior over the unknown Bernoulli parameter 𝜃.
51.6 Bayes’ Law
We noted above that in our example model there is something to learn about about the future from past data drawn from
our particular instance of a process that is exchangeable but not IID.
But how can we learn?
And about what?
The answer to the about what question is 𝜋.̃
The answer to the how question is to use Bayes’ Law.
Another way to say use Bayes’ Law is to say from a (subjective) joint distribution, compute an appropriate conditional
distribution.
Let’s dive into Bayes’ Law in this context.
Let 𝑞 represent the distribution that nature actually draws 𝑤 from and let
𝜋 = ℙ{𝑞 = 𝑓}
where we regard 𝜋 as a decision maker’s subjective probability (also called a personal probability).
Suppose that at 𝑡 ≥ 0, the decision maker has observed a history 𝑤𝑡 ≡ [𝑤𝑡 , 𝑤𝑡−1 , … , 𝑤0 ].
We let
𝜋𝑡 = ℙ{𝑞 = 𝑓|𝑤𝑡 }
where we adopt the convention
𝜋−1 = 𝜋̃
The distribution of 𝑤𝑡+1 conditional on 𝑤𝑡 is then
𝜋𝑡 𝑓 + (1 − 𝜋𝑡 )𝑔.
51.5. Exchangeability 891

Bayes’ rule for updating 𝜋𝑡+1 is
𝜋𝑡 𝑓(𝑤𝑡+1 )
𝜋𝑡+1 = (51.2)
𝜋𝑡 𝑓(𝑤𝑡+1 ) + (1 − 𝜋𝑡 )𝑔(𝑤𝑡+1 )
Equation (51.2) follows from Bayes’ rule, which tells us that
ℙ{𝑊 = 𝑤 | 𝑞 = 𝑓}ℙ{𝑞 = 𝑓}
ℙ{𝑞 = 𝑓 | 𝑊 = 𝑤} =
ℙ{𝑊 = 𝑤}
where
ℙ{𝑊 = 𝑤} = ∑ ℙ{𝑊 = 𝑤 | 𝑞 = 𝑎}ℙ{𝑞 = 𝑎}

𝑎∈{𝑓,𝑔}
51.7 More Details about Bayesian Updating
Let’s stare at and rearrange Bayes’ Law as represented in equation (51.2) with the aim of understanding how the posterior
probability 𝜋𝑡+1 is influenced by the prior probability 𝜋𝑡 and the likelihood ratio
𝑓(𝑤)
𝑙(𝑤) =
𝑔(𝑤)
It is convenient for us to rewrite the updating rule (51.2) as

𝑓(𝑤 )
𝜋𝑡 𝑓 (𝑤𝑡+1 ) 𝜋𝑡 𝑔(𝑤𝑡+1) 𝜋𝑡 𝑙 (𝑤𝑡+1 )
𝜋𝑡+1 = = 𝑓(𝑤 ) 𝑡+1 =
𝜋𝑡 𝑓 (𝑤𝑡+1 ) + (1 − 𝜋𝑡 ) 𝑔 (𝑤𝑡+1 ) 𝜋𝑡 𝑔(𝑤 ) + (1 − 𝜋𝑡 )
𝑡+1 𝜋 𝑡 𝑙 (𝑤 𝑡+1 ) + (1 − 𝜋𝑡 )
𝑡+1
This implies that
𝜋𝑡+1 𝑙 (𝑤𝑡+1 ) >1 if 𝑙 (𝑤𝑡+1 ) > 1

= { (51.3)
𝜋𝑡 𝜋𝑡 𝑙 (𝑤𝑡+1 ) + (1 − 𝜋𝑡 ) ≤ 1 if 𝑙 (𝑤𝑡+1 ) ≤ 1
Notice how the likelihood ratio and the prior interact to determine whether an observation 𝑤𝑡+1 leads the decision maker
to increase or decrease the subjective probability he/she attaches to distribution 𝐹 .
When the likelihood ratio 𝑙(𝑤𝑡+1 ) exceeds one, the observation 𝑤𝑡+1 nudges the probability 𝜋 put on distribution 𝐹
upward, and when the likelihood ratio 𝑙(𝑤𝑡+1 ) is less that one, the observation 𝑤𝑡+1 nudges 𝜋 downward.
Representation (51.3) is the foundation of some graphs that we’ll use to display the dynamics of {𝜋𝑡 }∞
𝑡=0 that are induced
by Bayes’ Law.
We’ll plot 𝑙 (𝑤) as a way to enlighten us about how learning – i.e., Bayesian updating of the probability 𝜋 that nature has
chosen distribution 𝑓 – works.
To create the Python infrastructure to do our work for us, we construct a wrapper function that displays informative graphs
given parameters of 𝑓 and 𝑔.
@vectorize
def p(x, a, b):
"The general beta distribution function."
return r * x ** (a-1) * (1 - x) ** (b-1)
def learning_example(F_a=1, F_b=1, G_a=3, G_b=1.2):



"""
A wrapper function that displays the updating rule of belief π,
given the parameters which specify F and G distributions.
"""

# l(w) = f(w) / g(w)

# objective function for solving l(w) = 1
obj = lambda w: l(w) - 1
x_grid = np.linspace(0, 1, 100)

π_grid = np.linspace(1e-3, 1-1e-3, 100)
w_max = 1
w_grid = np.linspace(1e-12, w_max-1e-12, 100)
# the mode of beta distribution

# use this to divide w into two intervals for root finding
G_mode = (G_a - 1) / (G_a + G_b - 2)
roots = np.empty(2)
roots[0] = op.root_scalar(obj, bracket=[1e-10, G_mode]).root
roots[1] = op.root_scalar(obj, bracket=[G_mode, 1-1e-10]).root
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
ax1.plot(l(w_grid), w_grid, label='$l$', lw=2)

ax1.vlines(1., 0., 1., linestyle="--")
ax1.hlines(roots, 0., 2., linestyle="--")
ax1.set_xlim([0., 2.])
ax1.legend(loc=4)
ax1.set(xlabel='$l(w)=f(w)/g(w)$', ylabel='$w$')
ax2.plot(f(x_grid), x_grid, label='$f$', lw=2)

ax2.plot(g(x_grid), x_grid, label='$g$', lw=2)
ax2.vlines(1., 0., 1., linestyle="--")
ax2.legend(loc=4)
ax2.set(xlabel='$f(w), g(w)$', ylabel='$w$')
area1 = quad(f, 0, roots[0])[0]

area2 = quad(g, roots[0], roots[1])[0]
area3 = quad(f, roots[1], 1)[0]
ax2.text((f(0) + f(roots[0])) / 4, roots[0] / 2, f"{area1: .3g}")

ax2.fill_between([0, 1], 0, roots[0], color='blue', alpha=0.15)
ax2.text(np.mean(g(roots)) / 2, np.mean(roots), f"{area2: .3g}")
w_roots = np.linspace(roots[0], roots[1], 20)
ax2.fill_betweenx(w_roots, 0, g(w_roots), color='orange', alpha=0.15)
ax2.text((f(roots[1]) + f(1)) / 4, (roots[1] + 1) / 2, f"{area3: .3g}")
ax2.fill_between([0, 1], roots[1], 1, color='blue', alpha=0.15)
W = np.arange(0.01, 0.99, 0.08)

Π = np.arange(0.01, 0.99, 0.08)
51.7. More Details about Bayesian Updating 893

ΔW = np.zeros((len(W), len(Π)))
ΔΠ = np.empty((len(W), len(Π)))
for i, w in enumerate(W):
for j, π in enumerate(Π):
lw = l(w)
ΔΠ[i, j] = π * (lw / (π * lw + 1 - π) - 1)
q = ax3.quiver(Π, W, ΔΠ, ΔW, scale=2, color='r', alpha=0.8)
ax3.fill_between(π_grid, 0, roots[0], color='blue', alpha=0.15)

ax3.fill_between(π_grid, roots[0], roots[1], color='green', alpha=0.15)
ax3.fill_between(π_grid, roots[1], w_max, color='blue', alpha=0.15)
ax3.set(xlabel='$\pi$', ylabel='$w$')
ax3.grid()
plt.show()
Now we’ll create a group of graphs that illustrate dynamics induced by Bayes’ Law.
We’ll begin with Python function default values of various objects, then change them in a subsequent example.
learning_example()
Please look at the three graphs above created for an instance in which 𝑓 is a uniform distribution on [0, 1] (i.e., a Beta
distribution with parameters 𝐹𝑎 = 1, 𝐹𝑏 = 1), while 𝑔 is a Beta distribution with the default parameter values 𝐺𝑎 =
3, 𝐺𝑏 = 1.2.
The graph on the left plots the likelihood ratio 𝑙(𝑤) as the absciassa axis against 𝑤 as the ordinate.
The middle graph plots both 𝑓(𝑤) and 𝑔(𝑤) against 𝑤, with the horizontal dotted lines showing values of 𝑤 at which the
likelihood ratio equals 1.
The graph on the right plots arrows to the right that show when Bayes’ Law makes 𝜋 increase and arrows to the left that
show when Bayes’ Law make 𝜋 decrease.
Lengths of the arrows show magnitudes of the force from Bayes’ Law impelling 𝜋 to change.
These lengths depend on both the prior probability 𝜋 on the abscissa axis and the evidence in the form of the current draw
of 𝑤 on the ordinate axis.
The fractions in the colored areas of the middle graphs are probabilities under 𝐹 and 𝐺, respectively, that realizations of
𝑤 fall into the interval that updates the belief 𝜋 in a correct direction (i.e., toward 0 when 𝐺 is the true distribution, and
toward 1 when 𝐹 is the true distribution).

For example, in the above example, under true distribution 𝐹 , 𝜋 will be updated toward 0 if 𝑤 falls into the interval
[0.524, 0.999], which occurs with probability 1 − .524 = .476 under 𝐹 .
But this would occur with probability 0.816 if 𝐺 were the true distribution.
The fraction 0.816 in the orange region is the integral of 𝑔(𝑤) over this interval.
Next we use our code to create graphs for another instance of our model.
We keep 𝐹 the same as in the preceding instance, namely a uniform distribution, but now assume that 𝐺 is a Beta
distribution with parameters 𝐺𝑎 = 2, 𝐺𝑏 = 1.6.
learning_example(G_a=2, G_b=1.6)
Notice how the likelihood ratio, the middle graph, and the arrows compare with the previous instance of our example.
51.8 Appendix
51.8.1 Sample Paths of 𝜋𝑡
Now we’ll have some fun by plotting multiple realizations of sample paths of 𝜋𝑡 under two possible assumptions about
nature’s choice of distribution, namely
• that nature permanently draws from 𝐹
• that nature permanently draws from 𝐺
Outcomes depend on a peculiar property of likelihood ratio processes discussed in this lecture.
To proceed, we create some Python code.
def function_factory(F_a=1, F_b=1, G_a=3, G_b=1.2):
# define f and g
@njit
def update(a, b, π):
"Update π by drawing from beta distribution with parameters a and b"
# Draw
51.8. Appendix 895


# Update belief
π = 1 / (1 + ((1 - π) * g(w)) / (π * f(w)))
return π
@njit
def simulate_path(a, b, T=50):
"Simulates a path of beliefs π with length T"
π = np.empty(T+1)
# initial condition
π[0] = 0.5

π[t] = update(a, b, π[t-1])
return π
def simulate(a=1, b=1, T=50, N=200, display=True):

"Simulates N paths of beliefs π with length T"
π_paths = np.empty((N, T+1))

if display:
fig = plt.figure()
for i in range(N):
π_paths[i] = simulate_path(a=a, b=b, T=T)
if display:
plt.plot(range(T+1), π_paths[i], color='b', lw=0.8, alpha=0.5)
if display:
plt.show()
return π_paths
return simulate
simulate = function_factory()
We begin by generating 𝑁 simulated {𝜋𝑡 } paths with 𝑇 periods when the sequence is truly IID draws from 𝐹 . We set an
initial prior 𝜋−1 = .5.
T = 50
# when nature selects F

π_paths_F = simulate(a=1, b=1, T=T, N=1000)

In the above example, for most paths 𝜋𝑡 → 1.

So Bayes’ Law evidently eventually discovers the truth for most of our paths.
Next, we generate paths with 𝑇 periods when the sequence is truly IID draws from 𝐺. Again, we set the initial prior
𝜋−1 = .5.
# when nature selects G

π_paths_G = simulate(a=3, b=1.2, T=T, N=1000)
In the above graph we observe that now most paths 𝜋𝑡 → 0.
51.8. Appendix 897

51.8.2 Rates of convergence
We study rates of convergence of 𝜋𝑡 to 1 when nature generates the data as IID draws from 𝐹 and of convergence of 𝜋𝑡
to 0 when nature generates IID draws from 𝐺.
We do this by averaging across simulated paths of {𝜋𝑡 }𝑇𝑡=0 .
𝑁
Using 𝑁 simulated 𝜋𝑡 paths, we compute 1 − ∑𝑖=1 𝜋𝑖,𝑡 at each 𝑡 when the data are generated as draws from 𝐹 and
𝑁
compute ∑𝑖=1 𝜋𝑖,𝑡 when the data are generated as draws from 𝐺.
plt.plot(range(T+1), 1 - np.mean(π_paths_F, 0), label='F generates')

plt.plot(range(T+1), np.mean(π_paths_G, 0), label='G generates')
plt.legend()
plt.title("convergence");
From the above graph, rates of convergence appear not to depend on whether 𝐹 or 𝐺 generates the data.
51.8.3 Graph of Ensemble Dynamics of 𝜋𝑡

𝜋𝑡+1
More insights about the dynamics of {𝜋𝑡 } can be gleaned by computing conditional expectations of 𝜋𝑡 as functions of
𝜋𝑡 via integration with respect to the pertinent probability distribution:
𝜋𝑡+1 𝑙 (𝑤𝑡+1 )
𝐸[ ∣ 𝑞 = 𝑎, 𝜋𝑡 ] = 𝐸 [ ∣ 𝑞 = 𝑎, 𝜋𝑡 ] ,
𝜋𝑡 𝜋𝑡 𝑙 (𝑤𝑡+1 ) + (1 − 𝜋𝑡 )
1
𝑙 (𝑤𝑡+1 )
=∫ 𝑎 (𝑤𝑡+1 ) 𝑑𝑤𝑡+1
0 𝜋𝑡 𝑙 (𝑤𝑡+1 ) + (1 − 𝜋𝑡 )
where 𝑎 = 𝑓, 𝑔.
The following code approximates the integral above:
def expected_ratio(F_a=1, F_b=1, G_a=3, G_b=1.2):
# define f and g



integrand_f = lambda w, π: f(w) * l(w) / (π * l(w) + 1 - π)
integrand_g = lambda w, π: g(w) * l(w) / (π * l(w) + 1 - π)
π_grid = np.linspace(0.02, 0.98, 100)
expected_rario = np.empty(len(π_grid))
for q, inte in zip(["f", "g"], [integrand_f, integrand_g]):
for i, π in enumerate(π_grid):
expected_rario[i]= quad(inte, 0, 1, args=(π,))[0]
plt.plot(π_grid, expected_rario, label=f"{q} generates")
plt.hlines(1, 0, 1, linestyle="--")
plt.xlabel("$π_t$")
plt.ylabel("$E[\pi_{t+1}/\pi_t]$")
plt.legend()
plt.show()
First, consider the case where 𝐹𝑎 = 𝐹𝑏 = 1 and 𝐺𝑎 = 3, 𝐺𝑏 = 1.2.
expected_ratio()
The above graphs shows that when 𝐹 generates the data, 𝜋𝑡 on average always heads north, while when 𝐺 generates the
data, 𝜋𝑡 heads south.
Next, we’ll look at a degenerate case in whcih 𝑓 and 𝑔 are identical beta distributions, and 𝐹𝑎 = 𝐺𝑎 = 3, 𝐹𝑏 = 𝐺𝑏 = 1.2.
In a sense, here there is nothing to learn.
expected_ratio(F_a=3, F_b=1.2)
51.8. Appendix 899

The above graph says that 𝜋𝑡 is inert and remains at its initial value.
Finally, let’s look at a case in which 𝑓 and 𝑔 are neither very different nor identical, in particular one in which 𝐹𝑎 =
2, 𝐹𝑏 = 1 and 𝐺𝑎 = 3, 𝐺𝑏 = 1.2.
expected_ratio(F_a=2, F_b=1, G_a=3, G_b=1.2)

51.9 Sequels
We’ll apply and dig deeper into some of the ideas presented in this lecture:
• this lecture describes likelihood ratio processes and their role in frequentist and Bayesian statistical theories
• this lecture studies whether a World War II US Navy Captain’s hunch that a (frequentist) decision rule that the Navy
had told him to use was inferior to a sequential rule that Abraham Wald had not yet designed.
51.9. Sequels 901


CHAPTER
FIFTYTWO
LIKELIHOOD RATIO PROCESSES AND BAYESIAN LEARNING
52.1 Overview
This lecture describes the role that likelihood ratio processes play in Bayesian learning.
As in this lecture, we’ll use a simple statistical setting from this lecture.
We’ll focus on how a likelihood ratio process and a prior probability determine a posterior probability.
We’ll derive a convenient recursion for today’s posterior as a function of yesterday’s posterior and today’s multiplicative
increment to a likelihood process.
We’ll also present a useful generalization of that formula that represents today’s posterior in terms of an initial prior and
today’s realization of the likelihood ratio process.
We’ll study how, at least in our setting, a Bayesian eventually learns the probability distribution that generates the data, an
outcome that rests on the asymptotic behavior of likelihood ratio processes studied in this lecture.
We’ll also drill down into the psychology of our Bayesian learner and study dynamics under his subjective beliefs.
This lecture provides technical results that underly outcomes to be studied in this lecture and this lecture and this lecture.
We’ll begin by loading some Python modules.

import numpy as np
from numba import vectorize, njit, prange
import pandas as pd

@njit
def set_seed():
set_seed()
903
52.2 The Setting
We begin by reviewing the setting in this lecture, which we adopt here too.
A nonnegative random variable 𝑊 has one of two probability density functions, either 𝑓 or 𝑔.
Before the beginning of time, nature once and for all decides whether she will draw a sequence of IID draws from 𝑓 or
from 𝑔.
We will sometimes let 𝑞 be the density that nature chose once and for all, so that 𝑞 is either 𝑓 or 𝑔, permanently.
Nature knows which density it permanently draws from, but we the observers do not.
We do know both 𝑓 and 𝑔, but we don’t know which density nature chose.
But we want to know.
To do that, we use observations.
We observe a sequence {𝑤𝑡 }𝑇𝑡=1 of 𝑇 IID draws from either 𝑓 or 𝑔.
We want to use these observations to infer whether nature chose 𝑓 or 𝑔.
A likelihood ratio process is a useful tool for this task.
To begin, we define the key component of a likelihood ratio process, namely, the time 𝑡 likelihood ratio as the random
variable
𝑓 (𝑤𝑡 )
ℓ(𝑤𝑡 ) = , 𝑡 ≥ 1.
𝑔 (𝑤𝑡 )
We assume that 𝑓 and 𝑔 both put positive probabilities on the same intervals of possible realizations of the random variable
𝑊.
𝑓(𝑤𝑡 )
That means that under the 𝑔 density, ℓ(𝑤𝑡 ) = 𝑔(𝑤𝑡 ) is evidently a nonnegative random variable with mean 1.
∞
A likelihood ratio process for sequence {𝑤𝑡 }𝑡=1 is defined as
𝑡
𝐿 (𝑤𝑡 ) = ∏ ℓ(𝑤𝑖 ),
𝑖=1
where 𝑤𝑡 = {𝑤1 , … , 𝑤𝑡 } is a history of observations up to and including time 𝑡.

Sometimes for shorthand we’ll write 𝐿𝑡 = 𝐿(𝑤𝑡 ).
Notice that the likelihood process satisfies the recursion or multiplicative decomposition
𝐿(𝑤𝑡 ) = ℓ(𝑤𝑡 )𝐿(𝑤𝑡−1 ).
The likelihood ratio and its logarithm are key tools for making inferences using a classic frequentist approach due to
Neyman and Pearson [Neyman and Pearson, 1933].
We’ll again deploy the following Python code from this lecture that evaluates 𝑓 and 𝑔 as two different beta distributions,
then computes and simulates an associated likelihood ratio process by generating a sequence 𝑤𝑡 from some probability
distribution, for example, a sequence of IID draws from 𝑔.

F_a, F_b = 1, 1
G_a, G_b = 3, 1.2
@vectorize
904 Chapter 52. Likelihood Ratio Processes and Bayesian Learning


def p(x, a, b):
return r * x** (a-1) * (1 - x) ** (b-1)

@njit
'''
'''
for i in range(N):
for j in range(T):
return l_arr
We’ll also use the following Python code to prepare some informative simulations


52.3 Likelihood Ratio Process and Bayes’ Law
Let 𝜋𝑡 be a Bayesian posterior defined as
𝜋𝑡 = Prob(𝑞 = 𝑓|𝑤𝑡 )
The likelihood ratio process is a principal actor in the formula that governs the evolution of the posterior probability 𝜋𝑡 ,
an instance of Bayes’ Law.
Bayes’ law implies that {𝜋𝑡 } obeys the recursion
𝜋𝑡−1 𝑙𝑡 (𝑤𝑡 )
𝜋𝑡 = (52.1)
𝜋𝑡−1 𝑙𝑡 (𝑤𝑡 ) + 1 − 𝜋𝑡−1
with 𝜋0 being a Bayesian prior probability that 𝑞 = 𝑓, i.e., a personal or subjective belief about 𝑞 based on our having
seen no data.
Below we define a Python function that updates belief 𝜋 using likelihood ratio ℓ according to recursion (52.1)
52.3. Likelihood Ratio Process and Bayes’ Law 905

@njit
def update(π, l):
"Update π using likelihood l"
# Update belief
π = π * l / (π * l + 1 - π)
return π
Formula (52.1) can be generalized by iterating on it and thereby deriving an expression for the time 𝑡 posterior 𝜋𝑡+1 as a
function of the time 0 prior 𝜋0 and the likelihood ratio process 𝐿(𝑤𝑡+1 ) at time 𝑡.
To begin, notice that the updating rule
𝜋𝑡 ℓ (𝑤𝑡+1 )
𝜋𝑡+1 =
𝜋𝑡 ℓ (𝑤𝑡+1 ) + (1 − 𝜋𝑡 )
implies
1 𝜋 ℓ (𝑤𝑡+1 ) + (1 − 𝜋𝑡 )
= 𝑡
𝜋𝑡+1 𝜋𝑡 ℓ (𝑤𝑡+1 )
1 1 1
=1− + .
ℓ (𝑤𝑡+1 ) ℓ (𝑤𝑡+1 ) 𝜋𝑡
1 1 1
⇒ −1= ( − 1) .
𝜋𝑡+1 ℓ (𝑤𝑡+1 ) 𝜋𝑡
Therefore
1 1 1 1 1
−1= 𝑡+1
( − 1) = 𝑡+1
( − 1) .
𝜋𝑡+1 ∏𝑖=1 ℓ (𝑤𝑖 ) 𝜋0 𝐿 (𝑤 ) 𝜋0
Since 𝜋0 ∈ (0, 1) and 𝐿 (𝑤𝑡+1 ) > 0, we can verify that 𝜋𝑡+1 ∈ (0, 1).
After rearranging the preceding equation, we can express 𝜋𝑡+1 as a function of 𝐿 (𝑤𝑡+1 ), the likelihood ratio process at
𝑡 + 1, and the initial prior 𝜋0
𝜋0 𝐿 (𝑤𝑡+1 )
𝜋𝑡+1 = . (52.2)
𝜋0 𝐿 (𝑤𝑡+1 ) + 1 − 𝜋0
Formula (52.2) generalizes formula (52.1).
𝑡+1
Formula (52.2) can be regarded as a one step revision of prior probability 𝜋0 after seeing the batch of data {𝑤𝑖 }𝑖=1 .
Formula (52.2) shows the key role that the likelihood ratio process 𝐿 (𝑤𝑡+1 ) plays in determining the posterior probability
𝜋𝑡+1 .
Formula (52.2) is the foundation for the insight that, because of how the likelihood ratio process behaves as 𝑡 → +∞,
the likelihood ratio process dominates the initial prior 𝜋0 in determining the limiting behavior of 𝜋𝑡 .
To illustrate this insight, below we will plot graphs showing one simulated path of the likelihood ratio process 𝐿𝑡 along
with two paths of 𝜋𝑡 that are associated with the same realization of the likelihood ratio process but different initial prior
probabilities 𝜋0 .
First, we tell Python two values of 𝜋0 .
π1, π2 = 0.2, 0.8
Next we generate paths of the likelihood ratio process 𝐿𝑡 and the posterior 𝜋𝑡 for a history of IID draws from density 𝑓.

T = l_arr_f.shape[1]
π_seq_f = np.empty((2, T+1))
π_seq_f[:, 0] = π1, π2
for t in range(T):
for i in range(2):
π_seq_f[i, t+1] = update(π_seq_f[i, t], l_arr_f[0, t])
fig, ax1 = plt.subplots()
for i in range(2):
ax1.plot(range(T+1), π_seq_f[i, :], label=f"$\pi_0$={π_seq_f[i, 0]}")
ax1.set_ylabel("$\pi_t$")
ax1.set_xlabel("t")
ax1.legend()
ax1.set_title("when f governs data")
ax2 = ax1.twinx()
ax2.plot(range(1, T+1), np.log(l_seq_f[0, :]), '--', color='b')
ax2.set_ylabel("$log(L(w^{t}))$")
plt.show()
The dotted line in the graph above records the logarithm of the likelihood ratio process log 𝐿(𝑤𝑡 ).
Please note that there are two different scales on the 𝑦 axis.
Now let’s study what happens when the history consists of IID draws from density 𝑔
T = l_arr_g.shape[1]
π_seq_g = np.empty((2, T+1))
π_seq_g[:, 0] = π1, π2
for t in range(T):
for i in range(2):
π_seq_g[i, t+1] = update(π_seq_g[i, t], l_arr_g[0, t])
52.3. Likelihood Ratio Process and Bayes’ Law 907

for i in range(2):
ax1.plot(range(T+1), π_seq_g[i, :], label=f"$\pi_0$={π_seq_g[i, 0]}")
ax1.set_xlabel("t")
ax1.legend()
ax1.set_title("when g governs data")
ax2 = ax1.twinx()
ax2.plot(range(1, T+1), np.log(l_seq_g[0, :]), '--', color='b')
plt.show()
Below we offer Python code that verifies that nature chose permanently to draw from density 𝑓.
π_seq = np.empty((2, T+1))

π_seq[:, 0] = π1, π2
for i in range(2):
πL = π_seq[i, 0] * l_seq_f[0, :]
π_seq[i, 1:] = πL / (πL + 1 - π_seq[i, 0])
np.abs(π_seq - π_seq_f).max() < 1e-10
True
We thus conclude that the likelihood ratio process is a key ingredient of the formula (52.2) for a Bayesian’s posteior
probabilty that nature has drawn history 𝑤𝑡 as repeated draws from density 𝑔.

52.4 Behavior of posterior probability {𝜋𝑡 } under the subjective prob-

ability distribution
We’ll end this lecture by briefly studying what our Baysian learner expects to learn under the subjective beliefs 𝜋𝑡 cranked
out by Bayes’ law.
This will provide us with some perspective on our application of Bayes’s law as a theory of learning.
As we shall see, at each time 𝑡, the Bayesian learner knows that he will be surprised.
But he expects that new information will not lead him to change his beliefs.
And it won’t on average under his subjective beliefs.
We’ll continue with our setting in which a McCall worker knows that successive draws of his wage are drawn from either
𝐹 or 𝐺, but does not know which of these two distributions nature has drawn once-and-for-all before time 0.
We’ll review and reiterate and rearrange some formulas that we have encountered above and in associated lectures.
The worker’s initial beliefs induce a joint probability distribution over a potentially infinite sequence of draws 𝑤0 , 𝑤1 , ….
Bayes’ law is simply an application of laws of probability to compute the conditional distribution of the 𝑡th draw 𝑤𝑡
conditional on [𝑤0 , … , 𝑤𝑡−1 ].
After our worker puts a subjective probability 𝜋−1 on nature having selected distribution 𝐹 , we have in effect assumes
from the start that the decision maker knows the joint distribution for the process {𝑤𝑡 }𝑡=0 .
We assume that the worker also knows the laws of probability theory.
A respectable view is that Bayes’ law is less a theory of learning than a statement about the consequences of information
inflows for a decision maker who thinks he knows the truth (i.e., a joint probability distribution) from the beginning.
52.4.1 Mechanical details again
At time 0 before drawing a wage offer, the worker attaches probability 𝜋−1 ∈ (0, 1) to the distribution being 𝐹 .
Before drawing a wage at time 0, the worker thus believes that the density of 𝑤0 is
ℎ(𝑤0 ; 𝜋−1 ) = 𝜋−1 𝑓(𝑤0 ) + (1 − 𝜋−1 )𝑔(𝑤0 ).
Let 𝑎 ∈ {𝑓, 𝑔} be an index that indicates whether nature chose permanently to draw from distribution 𝑓 or from distri-
bution 𝑔.
After drawing 𝑤0 , the worker uses Bayes’ law to deduce that the posterior probability 𝜋0 = Prob𝑎 = 𝑓|𝑤0 that the density
is 𝑓(𝑤) is
𝜋−1 𝑓(𝑤0 )
𝜋0 = .
𝜋−1 𝑓(𝑤0 ) + (1 − 𝜋−1 )𝑔(𝑤0 )
More generally, after making the 𝑡th draw and having observed 𝑤𝑡 , 𝑤𝑡−1 , … , 𝑤0 , the worker believes that the probability
that 𝑤𝑡+1 is being drawn from distribution 𝐹 is
𝜋𝑡−1 𝑓(𝑤𝑡 )/𝑔(𝑤𝑡 )

𝜋𝑡 = 𝜋𝑡 (𝑤𝑡 |𝜋𝑡−1 ) ≡ (52.3)
𝜋𝑡−1 𝑓(𝑤𝑡 )/𝑔(𝑤𝑡 ) + (1 − 𝜋𝑡−1 )
or
𝜋𝑡 =
𝜋𝑡−1 𝑙𝑡 (𝑤𝑡 ) + 1 − 𝜋𝑡−1
52.4. Behavior of posterior probability {𝜋𝑡 } under the subjective probability distribution 909
and that the density of 𝑤𝑡+1 conditional on 𝑤𝑡 , 𝑤𝑡−1 , … , 𝑤0 is
ℎ(𝑤𝑡+1 ; 𝜋𝑡 ) = 𝜋𝑡 𝑓(𝑤𝑡+1 ) + (1 − 𝜋𝑡 )𝑔(𝑤𝑡+1 ).
Notice that
𝜋𝑡−1 𝑓(𝑤)
𝐸(𝜋𝑡 |𝜋𝑡−1 ) = ∫[ ][𝜋𝑡−1 𝑓(𝑤) + (1 − 𝜋𝑡−1 )𝑔(𝑤)]𝑑𝑤
𝜋𝑡−1 𝑓(𝑤) + (1 − 𝜋𝑡−1 )𝑔(𝑤)
= 𝜋𝑡−1 ∫ 𝑓(𝑤)𝑑𝑤
= 𝜋𝑡−1 ,
so that the process 𝜋𝑡 is a martingale.

Indeed, it is a bounded martingale because each 𝜋𝑡 , being a probability, is between 0 and 1.
In the first line in the above string of equalities, the term in the first set of brackets is just 𝜋𝑡 as a function of 𝑤𝑡 , while
the term in the second set of brackets is the density of 𝑤𝑡 conditional on 𝑤𝑡−1 , … , 𝑤0 or equivalently conditional on the
sufficient statistic 𝜋𝑡−1 for 𝑤𝑡−1 , … , 𝑤0 .
Notice that here we are computing 𝐸(𝜋𝑡 |𝜋𝑡−1 ) under the subjective density described in the second term in brackets.
Because {𝜋𝑡 } is a bounded martingale sequence, it follows from the martingale convergence theorem that 𝜋𝑡 converges
almost surely to a random variable in [0, 1].
Practically, this means that probability one is attached to sample paths {𝜋𝑡 }∞
𝑡=0 that converge.
According to the theorem, it different sample paths can converge to different limiting values.
Thus, let {𝜋𝑡 (𝜔)}∞
𝑡=0 denote a particular sample path indexed by a particular 𝜔 ∈ Ω.
We can think of nature as drawing an 𝜔 ∈ Ω from a probability distribution ProbΩ and then generating a single realization
(or simulation) {𝜋𝑡 (𝜔)}∞
𝑡=0 of the process.
The limit points of {𝜋𝑡 (𝜔)}∞

𝑡=0 as 𝑡 → +∞ are realizations of a random variable that is swept out as we sample 𝜔 from
Ω and construct repeated draws of {𝜋𝑡 (𝜔)}∞
𝑡=0 .
By staring at law of motion (52.1) or (52.3) , we can figure out some things about the probability distribution of the limit
points
𝜋∞ (𝜔) = lim 𝜋𝑡 (𝜔).

𝑡→+∞
Evidently, since the likelihood ratio ℓ(𝑤𝑡 ) differs from 1 when 𝑓 ≠ 𝑔, as we have assumed, the only possible fixed points
of (52.3) are
𝜋∞ (𝜔) = 1
and
𝜋∞ (𝜔) = 0
Thus, for some realizations, lim→+∞ 𝜋𝑡 (𝜔) = 1 while for other realizations, lim→+∞ 𝜋𝑡 (𝜔) = 0.
Now let’s remember that {𝜋𝑡 }∞
𝑡=0 is a martingale and apply the law of iterated expectations.
The law of iterated expectations implies
𝐸𝑡 𝜋𝑡+𝑗 = 𝜋𝑡
and in particular
𝐸−1 𝜋𝑡+𝑗 = 𝜋−1 .

Applying the above formula to 𝜋∞ , we obtain
𝐸−1 𝜋∞ (𝜔) = 𝜋−1
where the mathematical expectation 𝐸−1 here is taken with respect to the probability measure Prob(Ω).
Since the only two values that 𝜋∞ (𝜔) can take are 1 and 0, we know that for some 𝜆 ∈ [0, 1]
Prob(𝜋∞ (𝜔) = 1) = 𝜆, Prob(𝜋∞ (𝜔) = 0) = 1 − 𝜆
and consequently that
𝐸−1 𝜋∞ (𝜔) = 𝜆 ⋅ 1 + (1 − 𝜆) ⋅ 0 = 𝜆
Combining this equation with equation (20), we deduce that the probability that Prob(Ω) attaches to 𝜋∞ (𝜔) being 1 must
be 𝜋−1 .
Thus, under the worker’s subjective distribution, 𝜋−1 of the sample paths of {𝜋𝑡 } will converge pointwise to 1 and 1−𝜋−1
of the sample paths will converge pointwise to 0.
52.4.2 Some simulations
Let’s watch the martingale convergence theorem at work in some simulations of our learning model under the worker’s
subjective distribution.
𝑇 𝑇
Let us simulate {𝜋𝑡 }𝑡=0 , {𝑤𝑡 }𝑡=0 paths where for each 𝑡 ≥ 0, 𝑤𝑡 is drawn from the subjective distribution
𝜋𝑡−1 𝑓 (𝑤𝑡 ) + (1 − 𝜋𝑡−1 ) 𝑔 (𝑤𝑡 )
We’ll plot a large sample of paths.
@njit
def martingale_simulate(π0, N=5000, T=200):
π_path = np.empty((N,T+1))
w_path = np.empty((N,T))
π_path[:,0] = π0
for n in range(N):
π = π0
for t in range(T):
# draw w
if np.random.rand() <= π:
w = np.random.beta(F_a, F_b)
else:
w = np.random.beta(G_a, G_b)
π = π*f(w)/g(w)/(π*f(w)/g(w) + 1 - π)
π_path[n,t+1] = π
w_path[n,t] = w
return π_path, w_path
def fraction_0_1(π0, N, T, decimals):
π_path, w_path = martingale_simulate(π0, N=N, T=T)

values, counts = np.unique(np.round(π_path[:,-1], decimals=decimals), return_
↪counts=True)

return values, counts
def create_table(π0s, N=10000, T=500, decimals=2):
outcomes = []
for π0 in π0s:
values, counts = fraction_0_1(π0, N=N, T=T, decimals=decimals)
freq = counts/N
outcomes.append(dict(zip(values, freq)))
table = pd.DataFrame(outcomes).sort_index(axis=1).fillna(0)
table.index = π0s
return table
# simulate
T = 200
π0 = .5
π_path, w_path = martingale_simulate(π0=π0, T=T, N=10000)
ax.plot(range(T+1), π_path[i, :])
ax.set_xlabel('$t$')
ax.set_ylabel('$\pi_t$')
plt.show()
The above graph indicates that

• each of paths converges
• some of the paths converge to 1
• some of the paths converge to 0

• none of the paths converge to a limit point not equal to 0 or 1

Convergence actually occurs pretty fast, as the following graph of the cross-ensemble distribution of 𝜋𝑡 for various small
𝑡’s indicates.
for t in [1, 10, T-1]:
ax.hist(π_path[:,t], bins=20, alpha=0.4, label=f'T={t}')
ax.set_ylabel('count')
ax.set_xlabel('$\pi_T$')
plt.show()
Evidently, by 𝑡 = 199, 𝜋𝑡 has converged to either 0 or 1.

The fraction of paths that have converged to 1 is .5
The fractions of paths that have converged to 0 is also .5.
Does the fraction .5 ring a bell?
Yes, it does: it equals the value of 𝜋0 = .5 that we used to generate each sequence in the ensemble.
So let’s change 𝜋0 to .3 and watch what happens to the distribution of the ensemble of 𝜋𝑡 ’s for various 𝑡’s.
# simulate
T = 200
π0 = .3
π_path3, w_path3 = martingale_simulate(π0=π0, T=T, N=10000)
for t in [1, 10, T-1]:
ax.hist(π_path3[:,t], bins=20, alpha=0.4, label=f'T={t}')
ax.set_ylabel('count')
ax.set_xlabel('$\pi_T$')

plt.show()
For the preceding ensemble that assumed 𝜋0 = .5, the following graph shows two paths of 𝑤𝑡 ’s and the 𝜋𝑡 sequences that
gave rise to them.
Notice that one of the paths involves systematically higher 𝑤𝑡 ’s, outcomes that push 𝜋𝑡 upward.
The luck of the draw early in a simulation push the subjective distribution to draw from 𝐹 more frequently along a sample
path, and this pushes 𝜋𝑡 toward 0.
for i, j in enumerate([10, 100]):
ax.plot(range(T+1), π_path[j,:], color=colors[i], label=f'$\pi$_path, {j}-th␣
↪simulation')
ax.plot(range(1,T+1), w_path[j,:], color=colors[i], label=f'$w$_path, {j}-th␣

↪simulation', alpha=0.3)
ax.set_xlabel('$t$')
ax.set_ylabel('$\pi_t$')
ax2 = ax.twinx()
ax2.set_ylabel("$w_t$")
plt.show()

52.5 Initial Prior is Verified by Paths Drawn from Subjective Condi-

tional Densities
Now let’s use our Python code to generate a table that checks out our earlier claims about the probability distribution of
the pointwise limits 𝜋∞ (𝜔).
We’ll use our simulations to generate a histogram of this distribution.
In the following table, the left column in bold face reports an assumed value of 𝜋−1 .
The second column reports the fraction of 𝑁 = 10000 simulations for which 𝜋𝑡 had converged to 0 at the terminal date
𝑇 = 500 for each simulation.
The third column reports the fraction of 𝑁 = 10000 simulations for which 𝜋𝑡 had converged to 1 as the terminal date
𝑇 = 500 for each simulation.
# create table
table = create_table(list(np.linspace(0,1,11)), N=10000, T=500)
table
0.0 1.0
0.0 1.0000 0.0000
0.1 0.8929 0.1071
0.2 0.7994 0.2006
0.3 0.7014 0.2986
0.4 0.5939 0.4061
0.5 0.5038 0.4962
0.6 0.3982 0.6018
0.7 0.3092 0.6908
0.8 0.1963 0.8037
0.9 0.0963 0.9037
1.0 0.0000 1.0000
The fraction of simulations for which 𝜋𝑡 had converged to 1 is indeed always close to 𝜋−1 , as anticipated.
52.5. Initial Prior is Verified by Paths Drawn from Subjective Conditional Densities 915
52.6 Drilling Down a Little Bit
To understand how the local dynamics of 𝜋𝑡 behaves, it is enlightening to consult the variance of 𝜋𝑡 conditional on 𝜋𝑡−1 .
Under the subjective distribution this conditional variance is defined as
𝜋𝑡−1 𝑓(𝑤) 2
𝜎2 (𝜋𝑡 |𝜋𝑡−1 ) = ∫[ − 𝜋𝑡−1 ] [𝜋𝑡−1 𝑓(𝑤) + (1 − 𝜋𝑡−1 )𝑔(𝑤)]𝑑𝑤
𝜋𝑡−1 𝑓(𝑤) + (1 − 𝜋𝑡−1 )𝑔(𝑤)
We can use a Monte Carlo simulation to approximate this conditional variance.

We approximate it for a grid of points 𝜋𝑡−1 ∈ [0, 1].
Then we’ll plot it.
@njit
def compute_cond_var(pi, mc_size=int(1e6)):
# create monte carlo draws
mc_draws = np.zeros(mc_size)
for i in prange(mc_size):
if np.random.rand() <= pi:
mc_draws[i] = np.random.beta(F_a, F_b)
else:
mc_draws[i] = np.random.beta(G_a, G_b)
dev = pi*f(mc_draws)/(pi*f(mc_draws) + (1-pi)*g(mc_draws)) - pi

return np.mean(dev**2)
pi_array = np.linspace(0, 1, 40)

cond_var_array = []
for pi in pi_array:
cond_var_array.append(compute_cond_var(pi))
ax.plot(pi_array, cond_var_array)
ax.set_xlabel('$\pi_{t-1}$')
ax.set_ylabel('$\sigma^{2}(\pi_{t}\\vert \pi_{t-1})$')
plt.show()

The shape of the the conditional variance as a function of 𝜋𝑡−1 is informative about the behavior of sample paths of {𝜋𝑡 }.
Notice how the conditional variance approaches 0 for 𝜋𝑡−1 near either 0 or 1.
The conditional variance is nearly zero only when the agent is almost sure that 𝑤𝑡 is drawn from 𝐹 , or is almost sure it is
drawn from 𝐺.
52.7 Sequels
This lecture has been devoted to building some useful infrastructure that will help us understand inferences that are the
foundations of results described in this lecture and this lecture and this lecture.
52.7. Sequels 917


CHAPTER
FIFTYTHREE
INCORRECT MODELS
!pip install numpyro jax
53.1 Overview
This is a sequel to this quantecon lecture.

We discuss two ways to create compound lottery and their consequences.
A compound lottery can be said to create a mixture distribution.
Our two ways of constructing a compound lottery will differ in their timing.
• in one, mixing between two possible probability distributions will occur once and all at the beginning of time
• in the other, mixing between the same two possible possible probability distributions will occur each period
The statistical setting is close but not identical to the problem studied in that quantecon lecture.
In that lecture, there were two i.i.d. processes that could possibly govern successive draws of a non-negative random
variable 𝑊 .
Nature decided once and for all whether to make a sequence of IID draws from either 𝑓 or from 𝑔.
That lecture studied an agent who knew both 𝑓 and 𝑔 but did not know which distribution nature chose at time −1.
The agent represented that ignorance by assuming that nature had chosen 𝑓 or 𝑔 by flipping an unfair coin that put
probability 𝜋−1 on probability distribution 𝑓.
That assumption allowed the agent to construct a subjective joint probability distribution over the random sequence
{𝑊𝑡 }∞
𝑡=0 .
We studied how the agent would then use the laws of conditional probability and an observed history 𝑤𝑡 = {𝑤𝑠 }𝑡𝑡=0 to
form
𝜋𝑡 = 𝐸[nature chose distribution𝑓|𝑤𝑡 ], 𝑡 = 0, 1, 2, …
However, in the setting of this lecture, that rule imputes to the agent an incorrect model.
The reason is that now the wage sequence is actually described by a different statistical model.
Thus, we change the quantecon lecture specification in the following way.
Now, each period 𝑡 ≥ 0, nature flips a possibly unfair coin that comes up 𝑓 with probability 𝛼 and 𝑔 with probability
1 − 𝛼.
919
Thus, naturally perpetually draws from the mixture distribution with c.d.f.
𝐻(𝑤) = 𝛼𝐹 (𝑤) + (1 − 𝛼)𝐺(𝑤), 𝛼 ∈ (0, 1)
We’ll study two agents who try to learn about the wage process, but who use different statistical models.
Both types of agent know 𝑓 and 𝑔 but neither knows 𝛼.
Our first type of agent erroneously thinks that at time −1 nature once and for all chose 𝑓 or 𝑔 and thereafter permanently
draws from that distribution.
Our second type of agent knows, correctly, that nature mixes 𝑓 and 𝑔 with mixing probability 𝛼 ∈ (0, 1) each period,
though the agent doesn’t know the mixing parameter.
Our first type of agent applies the learning algorithm described in this quantecon lecture.
In the context of the statistical model that prevailed in that lecture, that was a good learning algorithm and it enabled the
Bayesian learner eventually to learn the distribution that nature had drawn at time −1.
This is because the agent’s statistical model was correct in the sense of being aligned with the data generating process.
But in the present context, our type 1 decision maker’s model is incorrect because the model ℎ that actually generates the
data is neither 𝑓 nor 𝑔 and so is beyond the support of the models that the agent thinks are possible.
Nevertheless, we’ll see that our first type of agent muddles through and eventually learns something interesting and useful,
even though it is not true.
Instead, it turn out that our type 1 agent who is armed with a wrong statistical model ends up learning whichever probability
distribution, 𝑓 or 𝑔, is in a special sense closest to the ℎ that actually generates the data.
We’ll tell the sense in which it is closest.
Our second type of agent understands that nature mixes between 𝑓 and 𝑔 each period with a fixed mixing probability 𝛼.
But the agent doesn’t know 𝛼.
The agent sets out to learn 𝛼 using Bayes’ law applied to his model.
His model is correct in the sense that it includes the actual data generating process ℎ as a possible distribution.
In this lecture, we’ll learn about
• how nature can mix between two distributions 𝑓 and 𝑔 to create a new distribution ℎ.
• The Kullback-Leibler statistical divergence https://en.wikipedia.org/wiki/Kullback–Leibler_divergence that gov-
erns statistical learning under an incorrect statistical model
• A useful Python function numpy.searchsorted that, in conjunction with a uniform random number generator,
can be used to sample from an arbitrary distribution
As usual, we’ll start by importing some Python tools.

import numpy as np
from numba import vectorize, njit
import pandas as pd
import scipy.stats as sp

920 Chapter 53. Incorrect Models


import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

@njit
def set_seed():
set_seed()
Let’s use Python to generate two beta distributions

F_a, F_b = 1, 1
G_a, G_b = 3, 1.2
@vectorize
def p(x, a, b):
return r * x** (a-1) * (1 - x) ** (b-1)

@njit
'''
'''
for i in range(N):
for j in range(T):
return l_arr
We’ll also use the following Python code to prepare some informative simulations


53.1. Overview 921

53.2 Sampling from Compound Lottery 𝐻
We implement two methods to draw samples from our mixture model 𝛼𝐹 + (1 − 𝛼)𝐺.
We’ll generate samples using each of them and verify that they match well.
Here is pseudo code for a direct “method 1” for drawing from our compound lottery:
• Step one:
– use the numpy.random.choice function to flip an unfair coin that selects distribution 𝐹 with prob 𝛼 and 𝐺
with prob 1 − 𝛼
• Step two:
– draw from either 𝐹 or 𝐺, as determined by the coin flip.
• Step three:
– put the first two steps in a big loop and do them for each realization of 𝑤
Our second method uses a uniform distribution and the following fact that we also described and used in the quantecon
lecture https://python.quantecon.org/prob_matrix.html:
• If a random variable 𝑋 has c.d.f. 𝐹 (𝑋), then a random variable 𝐹 −1 (𝑈 ) also has c.d.f. 𝐹 (𝑥), where 𝑈 is a
uniform random variable on [0, 1].
In other words, if 𝑋 ∼ 𝐹 (𝑥) we can generate a random sample from 𝐹 by drawing a random sample from a uniform
distribution on [0, 1] and computing 𝐹 −1 (𝑈 ).
We’ll use this fact in conjunction with the numpy.searchsorted command to sample from 𝐻 directly.
See https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html for the searchsorted function.
See the Mr. P Solver video on Monte Carlo simulation to see other applications of this powerful trick.
In the Python code below, we’ll use both of our methods and confirm that each of them does a good job of sampling from
our target mixture distribution.
@njit
def draw_lottery(p, N):
"Draw from the compound lottery directly."
draws = []
for i in range(0, N):
if np.random.rand()<=p:
draws.append(np.random.beta(F_a, F_b))
else:
draws.append(np.random.beta(G_a, G_b))
return np.array(draws)
def draw_lottery_MC(p, N):

"Draw from the compound lottery using the Monte Carlo trick."
xs = np.linspace(1e-8,1-(1e-8),10000)
CDF = p*sp.beta.cdf(xs, F_a, F_b) + (1-p)*sp.beta.cdf(xs, G_a, G_b)
Us = np.random.rand(N)
draws = xs[np.searchsorted(CDF[:-1], Us)]
return draws

# verify
N = 100000
α = 0.0
sample1 = draw_lottery(α, N)
sample2 = draw_lottery_MC(α, N)
# plot draws and density function

plt.hist(sample1, 50, density=True, alpha=0.5, label='direct draws')
plt.hist(sample2, 50, density=True, alpha=0.5, label='MC draws')
xs = np.linspace(0,1,1000)
plt.plot(xs, α*f(xs)+(1-α)*g(xs), color='red', label='density')
plt.legend()
plt.show()
# %%timeit # compare speed

# sample1 = draw_lottery(α, N=int(1e6))
# %%timeit
# sample2 = draw_lottery_MC(α, N=int(1e6))
Note: With numba acceleration the first method is actually only slightly slower than the second when we generated
1,000,000 samples.
53.2. Sampling from Compound Lottery 𝐻 923

53.3 Type 1 Agent
We’ll now study what our type 1 agent learns

Remember that our type 1 agent uses the wrong statistical model, thinking that nature mixed between 𝑓 and 𝑔 once and
for all at time −1.
The type 1 agent thus uses the learning algorithm studied in this quantecon lecture.
We’ll briefly review that learning algorithm now.
Let 𝜋𝑡 be a Bayesian posterior defined as
𝜋𝑡 = Prob(𝑞 = 𝑓|𝑤𝑡 )
The likelihood ratio process plays a principal role in the formula that governs the evolution of the posterior probability
𝜋𝑡 , an instance of Bayes’ Law.
Bayes’ law implies that {𝜋𝑡 } obeys the recursion
𝜋𝑡 = (53.1)
𝜋𝑡−1 𝑙𝑡 (𝑤𝑡 ) + 1 − 𝜋𝑡−1
with 𝜋0 being a Bayesian prior probability that 𝑞 = 𝑓, i.e., a personal or subjective belief about 𝑞 based on our having
seen no data.
Below we define a Python function that updates belief 𝜋 using likelihood ratio ℓ according to recursion (53.1)
@njit
def update(π, l):
"Update π using likelihood l"
# Update belief
π = π * l / (π * l + 1 - π)
return π
Formula (53.1) can be generalized by iterating on it and thereby deriving an expression for the time 𝑡 posterior 𝜋𝑡+1 as a
function of the time 0 prior 𝜋0 and the likelihood ratio process 𝐿(𝑤𝑡+1 ) at time 𝑡.
To begin, notice that the updating rule
𝜋𝑡 ℓ (𝑤𝑡+1 )
𝜋𝑡+1 =
𝜋𝑡 ℓ (𝑤𝑡+1 ) + (1 − 𝜋𝑡 )
implies
1 𝜋𝑡 ℓ (𝑤𝑡+1 ) + (1 − 𝜋𝑡 )
=
𝜋𝑡+1 𝜋𝑡 ℓ (𝑤𝑡+1 )
1 1 1
=1− + .
ℓ (𝑤𝑡+1 ) ℓ (𝑤𝑡+1 ) 𝜋𝑡
1 1 1
⇒ −1= ( − 1) .
𝜋𝑡+1 ℓ (𝑤𝑡+1 ) 𝜋𝑡
Therefore
1 1 1 1 1
−1= 𝑡+1
( − 1) = 𝑡+1
( − 1) .
𝜋𝑡+1 ∏𝑖=1 ℓ (𝑤𝑖 ) 𝜋0 𝐿 (𝑤 ) 𝜋0

Since 𝜋0 ∈ (0, 1) and 𝐿 (𝑤𝑡+1 ) > 0, we can verify that 𝜋𝑡+1 ∈ (0, 1).
After rearranging the preceding equation, we can express 𝜋𝑡+1 as a function of 𝐿 (𝑤𝑡+1 ), the likelihood ratio process at
𝑡 + 1, and the initial prior 𝜋0
𝜋0 𝐿 (𝑤𝑡+1 )
𝜋𝑡+1 = . (53.2)
𝜋0 𝐿 (𝑤𝑡+1 ) + 1 − 𝜋0
Formula (53.2) generalizes formula (53.1).
𝑡+1
Formula (53.2) can be regarded as a one step revision of prior probability 𝜋0 after seeing the batch of data {𝑤𝑖 }𝑖=1 .
53.4 What a type 1 Agent Learns when Mixture 𝐻 Generates Data
We now study what happens when the mixture distribution ℎ; 𝛼 truly generated the data each period.
A submartingale or supermartingale continues to describe 𝜋𝑡
It raises its ugly head and causes 𝜋𝑡 to converge either to 0 or to 1.
This is true even though in truth nature always mixes between 𝑓 and 𝑔.
After verifying that claim about possible limit points of 𝜋𝑡 sequences, we’ll drill down and study what fundamental force
determines the limiting value of 𝜋𝑡 .
Let’s set a value of 𝛼 and then watch how 𝜋𝑡 evolves.
def simulate_mixed(α, T=50, N=500):

"""
return as N x T matrix, when the true density is mixed h;α
"""
w_s = draw_lottery(α, N*T).reshape(N, T)

l_arr = f(w_s) / g(w_s)
return l_arr
def plot_π_seq(α, π1=0.2, π2=0.8, T=200):

"""
Compute and plot π_seq and the log likelihood ratio process
when the mixed distribution governs the data.
"""
l_arr_mixed = simulate_mixed(α, T=T, N=50)

l_seq_mixed = np.cumprod(l_arr_mixed, axis=1)
T = l_arr_mixed.shape[1]
π_seq_mixed = np.empty((2, T+1))
π_seq_mixed[:, 0] = π1, π2
for t in range(T):
for i in range(2):
π_seq_mixed[i, t+1] = update(π_seq_mixed[i, t], l_arr_mixed[0, t])
# plot
for i in range(2):
53.4. What a type 1 Agent Learns when Mixture 𝐻 Generates Data 925

ax1.plot(range(T+1), π_seq_mixed[i, :], label=f"$\pi_0$={π_seq_mixed[i, 0]}")
ax1.plot(np.nan, np.nan, '--', color='b', label='Log likelihood ratio process')

ax1.set_xlabel("t")
ax1.legend()
ax1.set_title("when $\\alpha G + (1-\\alpha)$ F governs data")
ax2 = ax1.twinx()
ax2.plot(range(1, T+1), np.log(l_seq_mixed[0, :]), '--', color='b')
plt.show()
plot_π_seq(α = 0.6)
The above graph shows a sample path of the log likelihood ratio process as the blue dotted line, together with sample
paths of 𝜋𝑡 that start from two distinct initial conditions.
Let’s see what happens when we change 𝛼
plot_π_seq(α = 0.2)

Evidently, 𝛼 is having a big effect on the destination of 𝜋𝑡 as 𝑡 → +∞
53.5 Kullback-Leibler Divergence Governs Limit of 𝜋𝑡
To understand what determines whether the limit point of 𝜋𝑡 is 0 or 1 and how the answer depends on the true value of
the mixing probability 𝛼 ∈ (0, 1) that generates
ℎ(𝑤) ≡ ℎ(𝑤|𝛼) = 𝛼𝑓(𝑤) + (1 − 𝛼)𝑔(𝑤)
we shall compute the following two Kullback-Leibler divergences
𝑔(𝑤)
𝐾𝐿𝑔 (𝛼) = ∫ log ( ) ℎ(𝑤)𝑑𝑤
ℎ(𝑤)
and
𝑓(𝑤)
𝐾𝐿𝑓 (𝛼) = ∫ log ( ) ℎ(𝑤)𝑑𝑤
ℎ(𝑤)
We shall plot both of these functions against 𝛼 as we use 𝛼 to vary ℎ(𝑤) = ℎ(𝑤|𝛼).
The limit of 𝜋𝑡 is determined by
min{𝐾𝐿𝑔 , 𝐾𝐿𝑓 }
𝑓,𝑔
The only possible limits are 0 and 1.

As → +∞, 𝜋𝑡 goes to one if and only if 𝐾𝐿𝑓 < 𝐾𝐿𝑔
@vectorize
def KL_g(α):
"Compute the KL divergence between g and h."
err = 1e-8 # to avoid 0 at end points
ws = np.linspace(err, 1-err, 10000)
gs, fs = g(ws), f(ws)
hs = α*fs + (1-α)*gs
53.5. Kullback-Leibler Divergence Governs Limit of 𝜋𝑡 927


return np.sum(np.log(gs/hs)*hs)/10000
@vectorize
def KL_f(α):
"Compute the KL divergence between f and h."
err = 1e-8 # to avoid 0 at end points
ws = np.linspace(err, 1-err, 10000)
gs, fs = g(ws), f(ws)
hs = α*fs + (1-α)*gs
return np.sum(np.log(fs/hs)*hs)/10000
# compute KL using quad in Scipy

def KL_g_quad(α):
"Compute the KL divergence between g and h using scipy.integrate."
h = lambda x: α*f(x) + (1-α)*g(x)
return quad(lambda x: np.log(g(x)/h(x))*h(x), 0, 1)[0]
def KL_f_quad(α):
"Compute the KL divergence between f and h using scipy.integrate."
h = lambda x: α*f(x) + (1-α)*g(x)
return quad(lambda x: np.log(f(x)/h(x))*h(x), 0, 1)[0]
# vectorize
KL_g_quad_v = np.vectorize(KL_g_quad)
KL_f_quad_v = np.vectorize(KL_f_quad)
# Let us find the limit point

def π_lim(α, T=5000, π_0=0.4):
"Find limit of π sequence."
π_seq = np.zeros(T+1)
π_seq[0] = π_0
l_arr = simulate_mixed(α, T, N=1)[0]
for t in range(T):
π_seq[t+1] = update(π_seq[t], l_arr[t])
return π_seq[-1]
π_lim_v = np.vectorize(π_lim)
Let us first plot the KL divergences 𝐾𝐿𝑔 (𝛼) , 𝐾𝐿𝑓 (𝛼) for each 𝛼.
α_arr = np.linspace(0, 1, 100)

KL_g_arr = KL_g(α_arr)
KL_f_arr = KL_f(α_arr)
fig, ax = plt.subplots(1, figsize=[10, 6])
ax.plot(α_arr, KL_g_arr, label='KL(g, h)')

ax.plot(α_arr, KL_f_arr, label='KL(f, h)')
ax.set_ylabel('K-L divergence')
ax.set_xlabel(r'$\alpha$')
plt.show()

# # using Scipy to compute KL divergence
# α_arr = np.linspace(0, 1, 100)

# KL_g_arr = KL_g_quad_v(α_arr)
# KL_f_arr = KL_f_quad_v(α_arr)
# fig, ax = plt.subplots(1, figsize=[10, 6])
# ax.plot(α_arr, KL_g_arr, label='KL(g, h)')

# ax.plot(α_arr, KL_f_arr, label='KL(f, h)')
# ax.set_ylabel('K-L divergence')
# ax.legend(loc='upper right')
# plt.show()
Let’s compute an 𝛼 for which the KL divergence between ℎ and 𝑔 is the same as that between ℎ and 𝑓.
# where KL_f = KL_g

α_arr[np.argmin(np.abs(KL_g_arr-KL_f_arr))]
0.31313131313131315
We can compute and plot the convergence point 𝜋∞ for each 𝛼 to verify that the convergence is indeed governed by the
KL divergence.
The blue circles show the limiting values of 𝜋𝑡 that simulations discover for different values of 𝛼 recorded on the 𝑥 axis.
Thus, the graph below confirms how a minimum KL divergence governs what our type 1 agent eventually learns.
α_arr_x = α_arr[(α_arr<0.28)|(α_arr>0.38)]
π_lim_arr = π_lim_v(α_arr_x)
53.5. Kullback-Leibler Divergence Governs Limit of 𝜋𝑡 929

# plot
fig, ax = plt.subplots(1, figsize=[10, 6])
ax.plot(α_arr, KL_g_arr, label='KL(g, h)')

ax.plot(α_arr, KL_f_arr, label='KL(f, h)')
ax.set_ylabel('K-L divergence')
ax.set_xlabel(r'$\alpha$')
# plot KL
ax2 = ax.twinx()
# plot limit point
ax2.scatter(α_arr_x, π_lim_arr, facecolors='none', edgecolors='tab:blue', label='$\pi
↪$ lim')
ax2.set_ylabel('π lim')
ax.legend(loc=[0.85, 0.8])
ax2.legend(loc=[0.85, 0.73])
plt.show()
Evidently, our type 1 learner who applies Bayes’ law to his misspecified set of statistical models eventually learns an
approximating model that is as close as possible to the true model, as measured by its Kullback-Leibler divergence.

53.6 Type 2 Agent
We now describe how our type 2 agent formulates his learning problem and what he eventually learns.
Our type 2 agent understands the correct statistical model but acknowledges does not know 𝛼.
We apply Bayes law to deduce an algorithm for learning 𝛼 under the assumption that the agent knows that
ℎ(𝑤) = ℎ(𝑤|𝛼)
but does not know 𝛼.

We’ll assume that the person starts out with a prior probabilty 𝜋0 (𝛼) on 𝛼 ∈ (0, 1) where the prior has one of the forms
that we deployed in this quantecon lecture.
We’ll fire up numpyro and apply it to the present situation.
Bayes’ law now takes the form
ℎ(𝑤𝑡+1 |𝛼)𝜋𝑡 (𝛼)

𝜋𝑡+1 (𝛼) =
∫ ℎ(𝑤𝑡+1 |𝛼)𝜋
̂ 𝑡 (𝛼)𝑑
̂ 𝛼̂
We’ll use numpyro to approximate this equation.

We’ll create graphs of the posterior 𝜋𝑡 (𝛼) as 𝑡 → +∞ corresponding to ones presented in the quantecon lecture https:
//python.quantecon.org/bayes_nonconj.html.
We anticipate that a posterior distribution will collapse around the true 𝛼 as 𝑡 → +∞.
Let us try a uniform prior first.
We use the Mixture class in Numpyro to construct the likelihood function.
α = 0.8
# simulate data with true α

data = draw_lottery(α, 1000)
sizes = [5, 20, 50, 200, 1000, 25000]
def model(w):
α = numpyro.sample('α', dist.Uniform(low=0.0, high=1.0))
y_samp = numpyro.sample('w',
dist.Mixture(dist.Categorical(jnp.array([α, 1-α])), [dist.Beta(F_a, F_b),␣
↪dist.Beta(G_a, G_b)]), obs=w)
def MCMC_run(ws):
"Compute posterior using MCMC with observed ws"
kernal = NUTS(model)
mcmc = MCMC(kernal, num_samples=5000, num_warmup=1000, progress_bar=False)
mcmc.run(rng_key=random.PRNGKey(142857), w=jnp.array(ws))
sample = mcmc.get_samples()
return sample['α']
The following code generates the graph below that displays Bayesian posteriors for 𝛼 at various history lengths.
53.6. Type 2 Agent 931

for i in range(len(sizes)):
sample = MCMC_run(data[:sizes[i]])
sns.histplot(
data=sample, kde=True, stat='density', alpha=0.2, ax=ax,
color=colors[i], binwidth=0.02, linewidth=0.05, label=f't={sizes[i]}'
)
ax.set_title('$\pi_t(\\alpha)$ as $t$ increases')
ax.legend()
ax.set_xlabel('$\\alpha$')
plt.show()

Evidently, the Bayesian posterior narrows in on the true value 𝛼 = .8 of the mixing parameter as the length of a history
of observations grows.
Our type 1 person deploys an incorrect statistical model.

He believes that either 𝑓 or 𝑔 generated the 𝑤 process, but just doesn’t know which one.
That is wrong because nature is actually mixing each period with mixing probability 𝛼.
Our type 1 agent eventually believes that either 𝑓 or 𝑔 generated the 𝑤 sequence, the outcome being determined by the
model, either 𝑓 or 𝑔, whose KL divergence relative to ℎ is smaller.
Our type 2 agent has a different statistical model, one that is correctly specified.
He knows the parametric form of the statistical model but not the mixing parameter 𝛼.
He knows that he does not know it.
But by using Bayes’ law in conjunction with his statistical model and a history of data, he eventually acquires a more and
more accurate inference about 𝛼.
This little laboratory exhibits some important general principles that govern outcomes of Bayesian learning of misspecified
models.
Thus, the following situation prevails quite generally in empirical work.
A scientist approaches the data with a manifold 𝑆 of statistical models 𝑠(𝑋|𝜃) , where 𝑠 is a probability distribution over
a random vector 𝑋, 𝜃 ∈ Θ is a vector of parameters, and Θ indexes the manifold of models.
The scientist with observations that he interprests as realizations 𝑥 of the random vector 𝑋 wants to solve an inverse
problem of somehow inverting 𝑠(𝑥|𝜃) to infer 𝜃 from 𝑥.

But the scientist’s model is misspecified, being only an approximation to an unknown model ℎ that nature uses to generate
𝑋.
If the scientist uses Bayes’ law or a related likelihood-based method to infer 𝜃, it occurs quite generally that for large
sample sizes the inverse problem infers a 𝜃 that minimizes the KL divergence of the scientist’s model 𝑠 relative to nature’s
model ℎ.

CHAPTER
FIFTYFOUR
BAYESIAN VERSUS FREQUENTIST DECISION RULES
Contents
• Bayesian versus Frequentist Decision Rules

– Overview
– Setup
– Frequentist Decision Rule
– Bayesian Decision Rule
– Was the Navy Captain’s Hunch Correct?
– More Details
– Distribution of Bayesian Decision Rule’s Time to Decide
– Probability of Making Correct Decision
– Distribution of Likelihood Ratios at Frequentist’s 𝑡

import numpy as np
from numba import njit, prange, float64, int64
from scipy.optimize import minimize
54.1 Overview
This lecture follows up on ideas presented in the following lectures:

• A Problem that Stumped Milton Friedman
• Exchangeability and Bayesian Updating
• Likelihood Ratio Processes
In A Problem that Stumped Milton Friedman we described a problem that a Navy Captain presented to Milton Friedman
during World War II.
The Navy had instructed the Captain to use a decision rule for quality control that the Captain suspected could be domi-
nated by a better rule.
935
(The Navy had ordered the Captain to use an instance of a frequentist decision rule.)
Milton Friedman recognized the Captain’s conjecture as posing a challenging statistical problem that he and other mem-
bers of the US Government’s Statistical Research Group at Columbia University proceeded to try to solve.
One of the members of the group, the great mathematician Abraham Wald, soon solved the problem.
A good way to formulate the problem is to use some ideas from Bayesian statistics that we describe in this lecture Ex-
changeability and Bayesian Updating and in this lecture Likelihood Ratio Processes, which describes the link between
Bayesian updating and likelihood ratio processes.
The present lecture uses Python to generate simulations that evaluate expected losses under frequentist and Bayesian
decision rules for an instance of the Navy Captain’s decision problem.
The simulations validate the Navy Captain’s hunch that there is a better rule than the one the Navy had ordered him to
use.
54.2 Setup
To formalize the problem of the Navy Captain whose questions posed the problem that Milton Friedman and Allan Wallis
handed over to Abraham Wald, we consider a setting with the following parts.
• Each period a decision maker draws a non-negative random variable 𝑍 from a probability distribution that he does
not completely understand. He knows that two probability distributions are possible, 𝑓0 and 𝑓1 , and that which ever
distribution it is remains fixed over time. The decision maker believes that before the beginning of time, nature
once and for all selected either 𝑓0 or 𝑓1 and that the probability that it selected 𝑓0 is probability 𝜋∗ .
𝑡
• The decision maker observes a sample {𝑧𝑖 }𝑖=0 from the the distribution chosen by nature.
The decision maker wants to decide which distribution actually governs 𝑍 and is worried by two types of errors and the
losses that they impose on him.
• a loss 𝐿̄ 1 from a type I error that occurs when he decides that 𝑓 = 𝑓1 when actually 𝑓 = 𝑓0
• a loss 𝐿̄ 0 from a type II error that occurs when he decides that 𝑓 = 𝑓0 when actually 𝑓 = 𝑓1
The decision maker pays a cost 𝑐 for drawing another 𝑧
We mainly borrow parameters from the quantecon lecture A Problem that Stumped Milton Friedman except that we
increase both 𝐿̄ 0 and 𝐿̄ 1 from 25 to 100 to encourage the frequentist Navy Captain to take more draws before deciding.
We set the cost 𝑐 of taking one more draw at 1.25.
We set the probability distributions 𝑓0 and 𝑓1 to be beta distributions with 𝑎0 = 𝑏0 = 1, 𝑎1 = 3, and 𝑏1 = 1.2,
respectively.
Below is some Python code that sets up these objects.
@njit
def p(x, a, b):
"Beta distribution."
return r * x**(a-1) * (1 - x)**(b-1)
We start with defining a jitclass that stores parameters and functions we need to solve problems for both the Bayesian
and frequentist Navy Captains.
936 Chapter 54. Bayesian versus Frequentist Decision Rules

wf_data = [
('a0', float64), # parameters of beta distribution
('b0', float64),
('a1', float64),
('b1', float64),
('L0', float64), # cost of selecting f0 when f1 is true
('L1', float64), # cost of selecting f1 when f0 is true
('π_grid', float64[:]), # grid of beliefs π
('π_grid_size', int64),
('mc_size', int64), # size of Monto Carlo simulation
('z0', float64[:]), # sequence of random values
('z1', float64[:]) # sequence of random values
]
@jitclass(wf_data)
class WaldFriedman:
def __init__(self,
c=1.25,
a0=1,
b0=1,
a1=3,
b1=1.2,
L0=100,
L1=100,
π_grid_size=200,
mc_size=1000):
self.c, self.π_grid_size = c, π_grid_size

self.a0, self.b0, self.a1, self.b1 = a0, b0, a1, b1
self.L0, self.L1 = L0, L1
self.π_grid = np.linspace(0, 1, π_grid_size)

def f0(self, x):
def f1(self, x):
def κ(self, z, π):

"""
Updates π using Bayes' rule and the current observation z
"""
a0, b0, a1, b1 = self.a0, self.b0, self.a1, self.b1
π_f0, π_f1 = π * p(z, a0, b0), (1 - π) * p(z, a1, b1)

π_new = π_f0 / (π_f0 + π_f1)
return π_new
54.2. Setup 937

wf = WaldFriedman()
grid = np.linspace(0, 1, 50)
plt.figure()
plt.title("Two Distributions")
plt.plot(grid, wf.f0(grid), lw=2, label="$f_0$")
plt.plot(grid, wf.f1(grid), lw=2, label="$f_1$")
plt.legend()
plt.xlabel("$z$ values")
plt.ylabel("density of $z_k$")
plt.tight_layout()
plt.show()
Above, we plot the two possible probability densities 𝑓0 and 𝑓1

54.3 Frequentist Decision Rule
The Navy told the Captain to use a frequentist decision rule.

In particular, it gave him a decision rule that the Navy had designed by using frequentist statistical theory to minimize an
expected loss function.
That decision rule is characterized by a sample size 𝑡 and a cutoff 𝑑 associated with a likelihood ratio.
𝑡 𝑓0 (𝑧𝑖 ) 𝑡
Let 𝐿 (𝑧 𝑡 ) = ∏𝑖=0 𝑓1 (𝑧𝑖 ) be the likelihood ratio associated with observing the sequence {𝑧𝑖 }𝑖=0 .
The decision rule associated with a sample size 𝑡 is:
• decide that 𝑓0 is the distribution if the likelihood ratio is greater than 𝑑
To understand how that rule was engineered, let null and alternative hypotheses be
• null: 𝐻0 : 𝑓 = 𝑓0 ,
• alternative 𝐻1 : 𝑓 = 𝑓1 .
Given sample size 𝑡 and cutoff 𝑑, under the model described above, the mathematical expectation of total loss is
̄ (𝑡, 𝑑) = 𝑐𝑡 + 𝜋∗ 𝑃 𝐹 𝐴 × 𝐿̄ 1 + (1 − 𝜋∗ ) (1 − 𝑃 𝐷) × 𝐿̄ 0
𝑉𝑓𝑟𝑒 (54.1)
where 𝑃 𝐹 𝐴 = Pr {𝐿 (𝑧 𝑡 ) < 𝑑 ∣ 𝑞 = 𝑓0 }
𝑃 𝐷 = Pr {𝐿 (𝑧 𝑡 ) < 𝑑 ∣ 𝑞 = 𝑓1 }
Here
• 𝑃 𝐹 𝐴 denotes the probability of a false alarm, i.e., rejecting 𝐻0 when it is true
• 𝑃 𝐷 denotes the probability of a detection error, i.e., not rejecting 𝐻0 when 𝐻1 is true
For a given sample size 𝑡, the pairs (𝑃 𝐹 𝐴, 𝑃 𝐷) lie on a receiver operating characteristic curve and can be uniquely
pinned down by choosing 𝑑.
To see some receiver operating characteristic curves, please see this lecture Likelihood Ratio Processes.
̄ (𝑡, 𝑑) numerically, we first simulate sequences of 𝑧 when either 𝑓0 or 𝑓1 generates data.
To solve for 𝑉𝑓𝑟𝑒
N = 10000
T = 100
z0_arr = np.random.beta(wf.a0, wf.b0, (N, T))

z1_arr = np.random.beta(wf.a1, wf.b1, (N, T))
plt.hist(z0_arr.flatten(), bins=50, alpha=0.4, label='f0')

plt.hist(z1_arr.flatten(), bins=50, alpha=0.4, label='f1')
plt.legend()
plt.show()
54.3. Frequentist Decision Rule 939

We can compute sequences of likelihood ratios using simulated samples.
l = lambda z: wf.f0(z) / wf.f1(z)
l0_arr = l(z0_arr)
l1_arr = l(z1_arr)
L0_arr = np.cumprod(l0_arr, 1)
L1_arr = np.cumprod(l1_arr, 1)
With an empirical distribution of likelihood ratios in hand, we can draw receiver operating characteristic curves by
enumerating (𝑃 𝐹 𝐴, 𝑃 𝐷) pairs given each sample size 𝑡.
PFA = np.arange(0, 100, 1)
for t in range(1, 15, 4):

percentile = np.percentile(L0_arr[:, t], PFA)
PD = [np.sum(L1_arr[:, t] < p) / N for p in percentile]
plt.plot(PFA / 100, PD, label=f"t={t}")
plt.scatter(0, 1, label="perfect detection")

plt.plot([0, 1], [0, 1], color='k', ls='--', label="random detection")
plt.arrow(0.5, 0.5, -0.15, 0.15, head_width=0.03)

plt.text(0.35, 0.7, "better")
plt.xlabel("Probability of false alarm")
plt.legend()


plt.title("Receiver Operating Characteristic Curve")
plt.show()
Our frequentist minimizes the expected total loss presented in equation (54.1) by choosing (𝑡, 𝑑).
Doing that delivers an expected loss
̄ = min 𝑉𝑓𝑟𝑒
𝑉𝑓𝑟𝑒 ̄ (𝑡, 𝑑) .
𝑡,𝑑
We first consider the case in which 𝜋∗ = Pr {nature selects 𝑓0 } = 0.5.

We can solve the minimization problem in two steps.
̄ (𝑡).
First, we fix 𝑡 and find the optimal cutoff 𝑑 and consequently the minimal 𝑉𝑓𝑟𝑒
Here is Python code that does that and then plots a useful graph.
@njit
def V_fre_d_t(d, t, L0_arr, L1_arr, π_star, wf):
N = L0_arr.shape[0]
PFA = np.sum(L0_arr[:, t-1] < d) / N

PD = np.sum(L1_arr[:, t-1] < d) / N
V = π_star * PFA *wf. L1 + (1 - π_star) * (1 - PD) * wf.L0


return V
def V_fre_t(t, L0_arr, L1_arr, π_star, wf):
res = minimize(V_fre_d_t, 1, args=(t, L0_arr, L1_arr, π_star, wf), method='Nelder-

↪Mead')
V = res.fun
d = res.x
PFA = np.sum(L0_arr[:, t-1] < d) / N

PD = np.sum(L1_arr[:, t-1] < d) / N
return V, PFA, PD
def compute_V_fre(L0_arr, L1_arr, π_star, wf):
T = L0_arr.shape[1]
V_fre_arr = np.empty(T)
PFA_arr = np.empty(T)
PD_arr = np.empty(T)

V, PFA, PD = V_fre_t(t, L0_arr, L1_arr, π_star, wf)
V_fre_arr[t-1] = wf.c * t + V
PFA_arr[t-1] = PFA
PD_arr[t-1] = PD
return V_fre_arr, PFA_arr, PD_arr
π_star = 0.5
V_fre_arr, PFA_arr, PD_arr = compute_V_fre(L0_arr, L1_arr, π_star, wf)
plt.plot(range(T), V_fre_arr, label='$\min_{d} \overline{V}_{fre}(t,d)$')

plt.xlabel('t')
plt.title('$\pi^*=0.5$')
plt.legend()
plt.show()

t_optimal = np.argmin(V_fre_arr) + 1
msg = f"The above graph indicates that minimizing over t tells the frequentist to␣
↪draw {t_optimal} observations and then decide."
print(msg)
The above graph indicates that minimizing over t tells the frequentist to draw 9␣
↪observations and then decide.
Let’s now change the value of 𝜋∗ and watch how the decision rule changes.
n_π = 20
π_star_arr = np.linspace(0.1, 0.9, n_π)
V_fre_bar_arr = np.empty(n_π)
t_optimal_arr = np.empty(n_π)
PFA_optimal_arr = np.empty(n_π)
PD_optimal_arr = np.empty(n_π)
for i, π_star in enumerate(π_star_arr):

t_idx = np.argmin(V_fre_arr)
V_fre_bar_arr[i] = V_fre_arr[t_idx]
t_optimal_arr[i] = t_idx + 1


PFA_optimal_arr[i] = PFA_arr[t_idx]
PD_optimal_arr[i] = PD_arr[t_idx]
plt.plot(π_star_arr, V_fre_bar_arr)
plt.xlabel('$\pi^*$')
plt.title('$\overline{V}_{fre}$')
plt.show()
The following shows how optimal sample size 𝑡 and targeted (𝑃 𝐹 𝐴, 𝑃 𝐷) change as 𝜋∗ varies.
axs[0].plot(π_star_arr, t_optimal_arr)
axs[0].set_xlabel('$\pi^*$')
axs[0].set_title('optimal sample size given $\pi^*$')
axs[1].plot(π_star_arr, PFA_optimal_arr, label='$PFA^*(\pi^*)$')

axs[1].plot(π_star_arr, PD_optimal_arr, label='$PD^*(\pi^*)$')
axs[1].legend()
axs[1].set_title('optimal PFA and PD given $\pi^*$')
plt.show()

54.4 Bayesian Decision Rule
In A Problem that Stumped Milton Friedman, we learned how Abraham Wald confirmed the Navy Captain’s hunch that
there is a better decision rule.
We presented a Bayesian procedure that instructed the Captain to makes decisions by comparing his current Bayesian
posterior probability 𝜋 with two cutoff probabilities called 𝛼 and 𝛽.
To proceed, we borrow some Python code from the quantecon lecture A Problem that Stumped Milton Friedman that
computes 𝛼 and 𝛽.
def Q(h, wf):
c, π_grid = wf.c, wf.π_grid

L0, L1 = wf.L0, wf.L1
z0, z1 = wf.z0, wf.z1
mc_size = wf.mc_size
κ = wf.κ
h_new = np.empty_like(π_grid)
h_func = lambda p: np.interp(p, π_grid, h)
π = π_grid[i]
# Find the expected value of J by integrating over z

integral_f0, integral_f1 = 0, 0

integral = (π * integral_f0 + (1 - π) * integral_f1) / mc_size

54.4. Bayesian Decision Rule 945

h_new[i] = c + integral
return h_new
@njit
def solve_model(wf, tol=1e-4, max_iter=1000):
"""
Compute the continuation value function
* wf is an instance of WaldFriedman
"""
# Set up loop
h = np.zeros(len(wf.π_grid))
i = 0
error = tol + 1

h_new = Q(h, wf)
error = np.max(np.abs(h - h_new))
i += 1
h = h_new
if error > tol:

return h_new
h_star = solve_model(wf)
@njit
def find_cutoff_rule(wf, h):
"""
This function takes a continuation value function and returns the
corresponding cutoffs of where you transition between continuing and
choosing a specific model
"""
L0, L1 = wf.L0, wf.L1
# Evaluate cost at all points on grid for choosing a model

payoff_f0 = (1 - π_grid) * L0
payoff_f1 = π_grid * L1
# The cutoff points can be found by differencing these costs with

# The Bellman equation (J is always less than or equal to p_c_i)
β = π_grid[np.searchsorted(
payoff_f1 - np.minimum(h, payoff_f0),
1e-10)
- 1]
α = π_grid[np.searchsorted(


np.minimum(h, payoff_f1) - payoff_f0,
1e-10)
- 1]
return (β, α)
cost_L0 = (1 - wf.π_grid) * wf.L0
cost_L1 = wf.π_grid * wf.L1
ax.plot(wf.π_grid, h_star, label='continuation value')

ax.plot(wf.π_grid,
np.amin(np.column_stack([h_star, cost_L0, cost_L1]),axis=1),
lw=15, alpha=0.1, color='b', label='minimum cost')
ax.annotate(r"$\beta$", xy=(β + 0.01, 0.5), fontsize=14)

ax.annotate(r"$\alpha$", xy=(α + 0.01, 0.5), fontsize=14)
plt.vlines(β, 0, β * wf.L0, linestyle="--")

plt.vlines(α, 0, (1 - α) * wf.L1, linestyle="--")
ax.set(xlim=(0, 1), ylim=(0, 0.5 * max(wf.L0, wf.L1)), ylabel="cost",

xlabel="$\pi$", title="Value function")
plt.legend(borderpad=1.1)
plt.show()

The above figure portrays the value function plotted against the decision maker’s Bayesian posterior.
It also shows the probabilities 𝛼 and 𝛽.
The Bayesian decision rule is:
• accept 𝐻0 if 𝜋 ≥ 𝛼
• accept 𝐻1 if 𝜋 ≤ 𝛽
• delay deciding and draw another 𝑧 if 𝛽 ≤ 𝜋 ≤ 𝛼
We can calculate two “objective” loss functions under this situation conditioning on knowing for sure that nature has
selected 𝑓0 , in the first case, or 𝑓1 , in the second case.
1. under 𝑓0 ,
⎧0 if 𝛼 ≤ 𝜋,
0
{
0 ′
𝑉 (𝜋) = ⎨𝑐 + 𝐸𝑉 (𝜋 ) if 𝛽 ≤ 𝜋 < 𝛼,
{𝐿̄ if 𝜋 < 𝛽.
⎩ 1
2. under 𝑓1
⎧𝐿̄ 0 if 𝛼 ≤ 𝜋,
1
{
1 ′
𝑉 (𝜋) = ⎨𝑐 + 𝐸𝑉 (𝜋 ) if 𝛽 ≤ 𝜋 < 𝛼,
{0 if 𝜋 < 𝛽.
⎩
𝜋𝑓0 (𝑧′ )
where 𝜋′ = 𝜋𝑓0 (𝑧′ )+(1−𝜋)𝑓1 (𝑧′ ) .
Given a prior probability 𝜋0 , the expected loss for the Bayesian is
̄
𝑉𝐵𝑎𝑦𝑒𝑠 (𝜋0 ) = 𝜋∗ 𝑉 0 (𝜋0 ) + (1 − 𝜋∗ ) 𝑉 1 (𝜋0 ) .
Below we write some Python code that computes 𝑉 0 (𝜋) and 𝑉 1 (𝜋) numerically.
def V_q(wf, flag):
V = np.zeros(wf.π_grid_size)
if flag == 0:
z_arr = wf.z0
V[wf.π_grid < β] = wf.L1
else:
z_arr = wf.z1
V[wf.π_grid >= α] = wf.L0
V_old = np.empty_like(V)
while True:
V_old[:] = V[:]
V[(β <= wf.π_grid) & (wf.π_grid < α)] = 0
for i in prange(len(wf.π_grid)):
π = wf.π_grid[i]
if π >= α or π < β:
continue


for j in prange(len(z_arr)):
π_next = wf.κ(z_arr[j], π)
V[i] += wf.c + np.interp(π_next, wf.π_grid, V_old)
V[i] /= wf.mc_size
if np.abs(V - V_old).max() < 1e-5:

break
return V
V0 = V_q(wf, 0)
V1 = V_q(wf, 1)
plt.plot(wf.π_grid, V0, label='$V^0$')

plt.plot(wf.π_grid, V1, label='$V^1$')
plt.vlines(β, 0, wf.L0, linestyle='--')
plt.text(β+0.01, wf.L0/2, 'β')
plt.vlines(α, 0, wf.L0, linestyle='--')
plt.text(α+0.01, wf.L0/2, 'α')
plt.xlabel('$\pi$')
plt.title('Objective value function $V(\pi)$')
plt.legend()
plt.show()
̄
Given an assumed value for 𝜋∗ = Pr {nature selects 𝑓0 }, we can then compute 𝑉𝐵𝑎𝑦𝑒𝑠 (𝜋0 ).

We can then determine an initial Bayesian prior 𝜋0∗ that minimizes this objective concept of expected loss.
The figure 9 below plots four cases corresponding to 𝜋∗ = 0.25, 0.3, 0.5, 0.7.
We observe that in each case 𝜋0∗ equals 𝜋∗ .
def compute_V_baye_bar(π_star, V0, V1, wf):
V_baye = π_star * V0 + (1 - π_star) * V1

π_idx = np.argmin(V_baye)
π_optimal = wf.π_grid[π_idx]
V_baye_bar = V_baye[π_idx]
return V_baye, π_optimal, V_baye_bar
π_star_arr = [0.25, 0.3, 0.5, 0.7]

row_i = i // 2
col_i = i % 2
V_baye, π_optimal, V_baye_bar = compute_V_baye_bar(π_star, V0, V1, wf)
axs[row_i, col_i].plot(wf.π_grid, V_baye)

axs[row_i, col_i].hlines(V_baye_bar, 0, 1, linestyle='--')
axs[row_i, col_i].vlines(π_optimal, V_baye_bar, V_baye.max(), linestyle='--')
axs[row_i, col_i].text(π_optimal+0.05, (V_baye_bar + V_baye.max()) / 2,
'${\pi_0^*}=$'+f'{π_optimal:0.2f}')
axs[row_i, col_i].set_xlabel('$\pi$')
axs[row_i, col_i].set_ylabel('$\overline{V}_{baye}(\pi)$')
axs[row_i, col_i].set_title('$\pi^*=$' + f'{π_star}')
fig.suptitle('$\overline{V}_{baye}(\pi)=\pi^*V^0(\pi) + (1-\pi^*)V^1(\pi)$',␣
↪fontsize=16)
plt.show()

This pattern of outcomes holds more generally.

Thus, the following Python code generates the associated graph that verifies the equality of 𝜋0∗ to 𝜋∗ holds for all 𝜋∗ .
π_star_arr = np.linspace(0.1, 0.9, n_π)

V_baye_bar_arr = np.empty_like(π_star_arr)
π_optimal_arr = np.empty_like(π_star_arr)
V_baye, π_optimal, V_baye_bar = compute_V_baye_bar(π_star, V0, V1, wf)
V_baye_bar_arr[i] = V_baye_bar
π_optimal_arr[i] = π_optimal
axs[0].plot(π_star_arr, V_baye_bar_arr)
axs[0].set_title('$\overline{V}_{baye}$')
axs[1].plot(π_star_arr, π_optimal_arr, label='optimal prior')

axs[1].plot([π_star_arr.min(), π_star_arr.max()],
[π_star_arr.min(), π_star_arr.max()],
c='k', linestyle='--', label='45 degree line')
axs[1].set_title('optimal prior given $\pi^*$')


axs[1].legend()
plt.show()
54.5 Was the Navy Captain’s Hunch Correct?
We now compare average (i.e., frequentist) losses obtained by the frequentist and Bayesian decision rules.
As a starting point, let’s compare average loss functions when 𝜋∗ = 0.5.
π_star = 0.5
# frequentist
# bayesian
V_baye = π_star * V0 + π_star * V1
V_baye_bar = V_baye.min()
plt.plot(range(T), V_fre_arr, label='$\min_{d} \overline{V}_{fre}(t,d)$')

plt.plot([0, T], [V_baye_bar, V_baye_bar], label='$\overline{V}_{baye}$')
plt.xlabel('t')
plt.title('$\pi^*=0.5$')
plt.legend()
plt.show()

Evidently, there is no sample size 𝑡 at which the frequentist decision rule attains a lower loss function than does the
Bayesian rule.
Furthermore, the following graph indicates that the Bayesian decision rule does better on average for all values of 𝜋∗ .
axs[0].plot(π_star_arr, V_fre_bar_arr, label='$\overline{V}_{fre}$')

axs[0].plot(π_star_arr, V_baye_bar_arr, label='$\overline{V}_{baye}$')
axs[0].legend()
axs[1].plot(π_star_arr, V_fre_bar_arr - V_baye_bar_arr, label='$diff$')

axs[1].legend()
plt.show()
54.5. Was the Navy Captain’s Hunch Correct? 953

̄ − 𝑉𝐵𝑎𝑦𝑒𝑠
The right panel of the above graph plots the difference 𝑉𝑓𝑟𝑒 ̄ .
It is always positive.
54.6 More Details
We can provide more insights by focusing on the case in which 𝜋∗ = 0.5 = 𝜋0 .
π_star = 0.5
Recall that when 𝜋∗ = 0.5, the frequentist decision rule sets a sample size t_optimal ex ante.
For our parameter settings, we can compute its value:
t_optimal
For convenience, let’s define t_idx as the Python array index corresponding to t_optimal sample size.
t_idx = t_optimal - 1
54.7 Distribution of Bayesian Decision Rule’s Time to Decide
By using simulations, we compute the frequency distribution of time to deciding for the Bayesian decision rule and
compare that time to the frequentist rule’s fixed 𝑡.
The following Python code creates a graph that shows the frequency distribution of Bayesian times to decide of Bayesian
decision maker, conditional on distribution 𝑞 = 𝑓0 or 𝑞 = 𝑓1 generating the data.
The blue and red dotted lines show averages for the Bayesian decision rule, while the black dotted line shows the frequentist
optimal sample size 𝑡.
On average the Bayesian rule decides earlier than the frequentist rule when 𝑞 = 𝑓0 and later when 𝑞 = 𝑓1 .

def check_results(L_arr, α, β, flag, π0):
N, T = L_arr.shape
time_arr = np.empty(N)
correctness = np.empty(N)
π_arr = π0 * L_arr / (π0 * L_arr + 1 - π0)
for i in prange(N):
for t in range(T):
if (π_arr[i, t] < β) or (π_arr[i, t] > α):
time_arr[i] = t + 1
correctness[i] = (flag == 0 and π_arr[i, t] > α) or (flag == 1 and π_
↪arr[i, t] < β)
break
return time_arr, correctness
time_arr0, correctness0 = check_results(L0_arr, α, β, 0, π_star)

time_arr1, correctness1 = check_results(L1_arr, α, β, 1, π_star)
# unconditional distribution
time_arr_u = np.concatenate((time_arr0, time_arr1))
correctness_u = np.concatenate((correctness0, correctness1))
n1 = plt.hist(time_arr0, bins=range(1, 30), alpha=0.4, label='f0 generates')[0]

n2 = plt.hist(time_arr1, bins=range(1, 30), alpha=0.4, label='f1 generates')[0]
plt.vlines(t_optimal, 0, max(n1.max(), n2.max()), linestyle='--', label='frequentist')
plt.vlines(np.mean(time_arr0), 0, max(n1.max(), n2.max()),
linestyle='--', color='b', label='E(t) under f0')
plt.vlines(np.mean(time_arr1), 0, max(n1.max(), n2.max()),
linestyle='--', color='r', label='E(t) under f1')
plt.legend();
plt.xlabel('t')
plt.ylabel('n')
plt.title('Conditional frequency distribution of times')
plt.show()
54.7. Distribution of Bayesian Decision Rule’s Time to Decide 955

Later we’ll figure out how these distributions ultimately affect objective expected values under the two decision rules.
To begin, let’s look at simulations of the Bayesian’s beliefs over time.
We can easily compute the updated beliefs at any time 𝑡 using the one-to-one mapping from 𝐿𝑡 to 𝜋𝑡 given 𝜋0 described
in this lecture Likelihood Ratio Processes.
π0_arr = π_star * L0_arr / (π_star * L0_arr + 1 - π_star)

π1_arr = π_star * L1_arr / (π_star * L1_arr + 1 - π_star)
axs[0].plot(np.arange(1, π0_arr.shape[1]+1), np.mean(π0_arr, 0), label='f0 generates')

axs[0].plot(np.arange(1, π1_arr.shape[1]+1), 1 - np.mean(π1_arr, 0), label='f1␣
↪generates')
axs[0].set_xlabel('t')
axs[0].set_ylabel('$E(\pi_t)$ or ($1 - E(\pi_t)$)')
axs[0].set_title('Expectation of beliefs after drawing t observations')
axs[0].legend()
axs[1].plot(np.arange(1, π0_arr.shape[1]+1), np.var(π0_arr, 0), label='f0 generates')

axs[1].plot(np.arange(1, π1_arr.shape[1]+1), np.var(π1_arr, 0), label='f1 generates')
axs[1].set_xlabel('t')
axs[1].set_ylabel('var($\pi_t$)')
axs[1].set_title('Variance of beliefs after drawing t observations')
axs[1].legend()


plt.show()
The above figures compare averages and variances of updated Bayesian posteriors after 𝑡 draws.
The left graph compares 𝐸 (𝜋𝑡 ) under 𝑓0 to 1 − 𝐸 (𝜋𝑡 ) under 𝑓1 : they lie on top of each other.
However, as the right hand size graph shows, there is significant difference in variances when 𝑡 is small: the variance is
lower under 𝑓1 .
The difference in variances is the reason that the Bayesian decision maker waits longer to decide when 𝑓1 generates the
data.
The code below plots outcomes of constructing an unconditional distribution by simply pooling the simulated data across
the two possible distributions 𝑓0 and 𝑓1 .
The pooled distribution describes a sense in which on average the Bayesian decides earlier, an outcome that seems at least
partly to confirm the Navy Captain’s hunch.
n = plt.hist(time_arr_u, bins=range(1, 30), alpha=0.4, label='bayesian')[0]

plt.vlines(np.mean(time_arr_u), 0, n.max(), linestyle='--',
color='b', label='bayesian E(t)')
plt.vlines(t_optimal, 0, n.max(), linestyle='--', label='frequentist')
plt.legend()
plt.xlabel('t')
plt.ylabel('n')
plt.title('Unconditional distribution of times')
plt.show()
54.7. Distribution of Bayesian Decision Rule’s Time to Decide 957

54.8 Probability of Making Correct Decision
Now we use simulations to compute the fraction of samples in which the Bayesian and the frequentist decision rules decide
correctly.
For the frequentist rule, the probability of making the correct decision under 𝑓1 is the optimal probability of detection
given 𝑡 that we defined earlier, and similarly it equals 1 minus the optimal probability of a false alarm under 𝑓0 .
Below we plot these two probabilities for the frequentist rule, along with the conditional probabilities that the Bayesian
rule decides before 𝑡 and that the decision is correct.
# optimal PFA and PD of frequentist with optimal sample size

V, PFA, PD = V_fre_t(t_optimal, L0_arr, L1_arr, π_star, wf)
plt.plot([1, 20], [PD, PD], linestyle='--', label='PD: fre. chooses f1 correctly')

plt.plot([1, 20], [1-PFA, 1-PFA], linestyle='--', label='1-PFA: fre. chooses f0␣
↪correctly')
plt.vlines(t_optimal, 0, 1, linestyle='--', label='frequentist optimal sample size')
N = time_arr0.size
T_arr = np.arange(1, 21)
plt.plot(T_arr, [np.sum(correctness0[time_arr0 <= t] == 1) / N for t in T_arr],
label='q=f0 and baye. choose f0')
plt.plot(T_arr, [np.sum(correctness1[time_arr1 <= t] == 1) / N for t in T_arr],


label='q=f1 and baye. choose f1')
plt.legend(loc=4)
plt.xlabel('t')
plt.ylabel('Probability')
plt.title('Cond. probability of making correct decisions before t')
plt.show()
By averaging using 𝜋∗ , we also plot the unconditional distribution.
plt.plot([1, 20], [(PD + 1 - PFA) / 2, (PD + 1 - PFA) / 2],

linestyle='--', label='fre. makes correct decision')
plt.vlines(t_optimal, 0, 1, linestyle='--', label='frequentist optimal sample size')
N = time_arr_u.size
plt.plot(T_arr, [np.sum(correctness_u[time_arr_u <= t] == 1) / N for t in T_arr],
label="bayesian makes correct decision")
plt.legend()
plt.xlabel('t')
plt.ylabel('Probability')
plt.title('Uncond. probability of making correct decisions before t')
plt.show()
54.8. Probability of Making Correct Decision 959

54.9 Distribution of Likelihood Ratios at Frequentist’s 𝑡
Next we use simulations to construct distributions of likelihood ratios after 𝑡 draws.

To serve as useful reference points, we also show likelihood ratios that correspond to the Bayesian cutoffs 𝛼 and 𝛽.
In order to exhibit the distribution more clearly, we report logarithms of likelihood ratios.
The graphs below reports two distributions, one conditional on 𝑓0 generating the data, the other conditional on 𝑓1 gener-
ating the data.
Lα = (1 - π_star) * α / (π_star - π_star * α)

Lβ = (1 - π_star) * β / (π_star - π_star * β)
L_min = min(L0_arr[:, t_idx].min(), L1_arr[:, t_idx].min())

L_max = max(L0_arr[:, t_idx].max(), L1_arr[:, t_idx].max())
bin_range = np.linspace(np.log(L_min), np.log(L_max), 50)
n0 = plt.hist(np.log(L0_arr[:, t_idx]), bins=bin_range, alpha=0.4, label='f0 generates
↪')[0]
n1 = plt.hist(np.log(L1_arr[:, t_idx]), bins=bin_range, alpha=0.4, label='f1 generates

↪')[0]
plt.vlines(np.log(Lβ), 0, max(n0.max(), n1.max()), linestyle='--', color='r', label=

↪'log($L_β$)')
plt.vlines(np.log(Lα), 0, max(n0.max(), n1.max()), linestyle='--', color='b', label=

↪'log($L_α$)')


plt.legend()
plt.xlabel('log(L)')
plt.ylabel('n')
plt.title('Cond. distribution of log likelihood ratio at frequentist t')
plt.show()
The next graph plots the unconditional distribution of Bayesian times to decide, constructed as earlier by pooling the two
conditional distributions.
plt.hist(np.log(np.concatenate([L0_arr[:, t_idx], L1_arr[:, t_idx]])),

bins=50, alpha=0.4, label='unconditional dist. of log(L)')
plt.vlines(np.log(Lβ), 0, max(n0.max(), n1.max()), linestyle='--', color='r', label=
↪'log($L_β$)')
plt.vlines(np.log(Lα), 0, max(n0.max(), n1.max()), linestyle='--', color='b', label=

↪'log($L_α$)')
plt.legend()
plt.xlabel('log(L)')
plt.ylabel('n')
plt.title('Uncond. distribution of log likelihood ratio at frequentist t')
plt.show()
54.9. Distribution of Likelihood Ratios at Frequentist’s 𝑡 961


Part IX
LQ Control
963
CHAPTER
FIFTYFIVE
LQ CONTROL: FOUNDATIONS
Contents
• LQ Control: Foundations
– Overview
– Introduction
– Optimality – Finite Horizon
– Implementation
– Extensions and Comments
– Further Applications
– Exercises
55.1 Overview
Linear quadratic (LQ) control refers to a class of dynamic optimization problems that have found applications in almost
every scientific field.
This lecture provides an introduction to LQ control and its economic applications.
As we will see, LQ systems have a simple structure that makes them an excellent workhorse for a wide variety of economic
problems.
Moreover, while the linear-quadratic structure is restrictive, it is in fact far more flexible than it may appear initially.
These themes appear repeatedly below.
Mathematically, LQ control problems are closely related to the Kalman filter
• Recursive formulations of linear-quadratic control problems and Kalman filtering problems both involve matrix
Riccati equations.
• Classical formulations of linear control and linear filtering problems make use of similar matrix decompositions
(see for example this lecture and this lecture).
In reading what follows, it will be useful to have some familiarity with
965
• matrix manipulations
• vectors of random variables
• dynamic programming and the Bellman equation (see for example this lecture and this lecture)
For additional reading on LQ control, see, for example,
• [Ljungqvist and Sargent, 2018], chapter 5
• [Hansen and Sargent, 2008], chapter 4
• [Hernandez-Lerma and Lasserre, 1996], section 3.5
In order to focus on computation, we leave longer proofs to these sources (while trying to provide as much intuition as
possible).

import numpy as np
from quantecon import LQ
55.2 Introduction
The “linear” part of LQ is a linear law of motion for the state, while the “quadratic” part refers to preferences.
Let’s begin with the former, move on to the latter, and then put them together into an optimization problem.
55.2.1 The Law of Motion
Let 𝑥𝑡 be a vector describing the state of some economic system.

Suppose that 𝑥𝑡 follows a linear law of motion given by
𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 + 𝐶𝑤𝑡+1 , 𝑡 = 0, 1, 2, … (55.1)
Here
• 𝑢𝑡 is a “control” vector, incorporating choices available to a decision-maker confronting the current state 𝑥𝑡
• {𝑤𝑡 } is an uncorrelated zero mean shock process satisfying 𝔼𝑤𝑡 𝑤𝑡′ = 𝐼, where the right-hand side is the identity
matrix
Regarding the dimensions
• 𝑥𝑡 is 𝑛 × 1, 𝐴 is 𝑛 × 𝑛
• 𝑢𝑡 is 𝑘 × 1, 𝐵 is 𝑛 × 𝑘
• 𝑤𝑡 is 𝑗 × 1, 𝐶 is 𝑛 × 𝑗
966 Chapter 55. LQ Control: Foundations

Example 1
Consider a household budget constraint given by
𝑎𝑡+1 + 𝑐𝑡 = (1 + 𝑟)𝑎𝑡 + 𝑦𝑡
Here 𝑎𝑡 is assets, 𝑟 is a fixed interest rate, 𝑐𝑡 is current consumption, and 𝑦𝑡 is current non-financial income.
If we suppose that {𝑦𝑡 } is serially uncorrelated and 𝑁 (0, 𝜎2 ), then, taking {𝑤𝑡 } to be standard normal, we can write the
system as
𝑎𝑡+1 = (1 + 𝑟)𝑎𝑡 − 𝑐𝑡 + 𝜎𝑤𝑡+1
This is clearly a special case of (55.1), with assets being the state and consumption being the control.
Example 2
One unrealistic feature of the previous model is that non-financial income has a zero mean and is often negative.
This can easily be overcome by adding a sufficiently large mean.
Hence in this example, we take 𝑦𝑡 = 𝜎𝑤𝑡+1 + 𝜇 for some positive real number 𝜇.
Another alteration that’s useful to introduce (we’ll see why soon) is to change the control variable from consumption to
the deviation of consumption from some “ideal” quantity 𝑐.̄
(Most parameterizations will be such that 𝑐 ̄ is large relative to the amount of consumption that is attainable in each period,
and hence the household wants to increase consumption.)
For this reason, we now take our control to be 𝑢𝑡 ∶= 𝑐𝑡 − 𝑐.̄
In terms of these variables, the budget constraint 𝑎𝑡+1 = (1 + 𝑟)𝑎𝑡 − 𝑐𝑡 + 𝑦𝑡 becomes
𝑎𝑡+1 = (1 + 𝑟)𝑎𝑡 − 𝑢𝑡 − 𝑐 ̄ + 𝜎𝑤𝑡+1 + 𝜇 (55.2)
How can we write this new system in the form of equation (55.1)?
If, as in the previous example, we take 𝑎𝑡 as the state, then we run into a problem: the law of motion contains some
constant terms on the right-hand side.
This means that we are dealing with an affine function, not a linear one (recall this discussion).
Fortunately, we can easily circumvent this problem by adding an extra state variable.
In particular, if we write
𝑎𝑡+1 1+𝑟 −𝑐 ̄ + 𝜇 𝑎 −1 𝜎
( )=( )( 𝑡 ) + ( ) 𝑢𝑡 + ( ) 𝑤𝑡+1 (55.3)
1 0 1 1 0 0
then the first row is equivalent to (55.2).

Moreover, the model is now linear and can be written in the form of (55.1) by setting
𝑎𝑡 1+𝑟 −𝑐 ̄ + 𝜇 −1 𝜎
𝑥𝑡 ∶= ( ), 𝐴 ∶= ( ), 𝐵 ∶= ( ), 𝐶 ∶= ( ) (55.4)
1 0 1 0 0
In effect, we’ve bought ourselves linearity by adding another state.
55.2. Introduction 967

55.2.2 Preferences
In the LQ model, the aim is to minimize flow of losses, where time-𝑡 loss is given by the quadratic expression
𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 (55.5)
Here
• 𝑅 is assumed to be 𝑛 × 𝑛, symmetric and nonnegative definite.
• 𝑄 is assumed to be 𝑘 × 𝑘, symmetric and positive definite.
Note: In fact, for many economic problems, the definiteness conditions on 𝑅 and 𝑄 can be relaxed. It is sufficient that
certain submatrices of 𝑅 and 𝑄 be nonnegative definite. See [Hansen and Sargent, 2008] for details.
Example 1
A very simple example that satisfies these assumptions is to take 𝑅 and 𝑄 to be identity matrices so that current loss is
𝑥′𝑡 𝐼𝑥𝑡 + 𝑢′𝑡 𝐼𝑢𝑡 = ‖𝑥𝑡 ‖2 + ‖𝑢𝑡 ‖2
Thus, for both the state and the control, loss is measured as squared distance from the origin.
(In fact, the general case (55.5) can also be understood in this way, but with 𝑅 and 𝑄 identifying other – non-Euclidean
– notions of “distance” from the zero vector.)
Intuitively, we can often think of the state 𝑥𝑡 as representing deviation from a target, such as
• deviation of inflation from some target level
• deviation of a firm’s capital stock from some desired quantity
The aim is to put the state close to the target, while using controls parsimoniously.
Example 2
In the household problem studied above, setting 𝑅 = 0 and 𝑄 = 1 yields preferences
𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 = 𝑢2𝑡 = (𝑐𝑡 − 𝑐)̄ 2
Under this specification, the household’s current loss is the squared deviation of consumption from the ideal level 𝑐.̄
55.3 Optimality – Finite Horizon
Let’s now be precise about the optimization problem we wish to consider, and look at how to solve it.

55.3.1 The Objective
We will begin with the finite horizon case, with terminal time 𝑇 ∈ ℕ.
In this case, the aim is to choose a sequence of controls {𝑢0 , … , 𝑢𝑇 −1 } to minimize the objective
𝑇 −1
𝔼 { ∑ 𝛽 𝑡 (𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 ) + 𝛽 𝑇 𝑥′𝑇 𝑅𝑓 𝑥𝑇 } (55.6)
𝑡=0
subject to the law of motion (55.1) and initial state 𝑥0 .

The new objects introduced here are 𝛽 and the matrix 𝑅𝑓 .
The scalar 𝛽 is the discount factor, while 𝑥′ 𝑅𝑓 𝑥 gives terminal loss associated with state 𝑥.
Comments:
• We assume 𝑅𝑓 to be 𝑛 × 𝑛, symmetric and nonnegative definite.
• We allow 𝛽 = 1, and hence include the undiscounted case.
• 𝑥0 may itself be random, in which case we require it to be independent of the shock sequence 𝑤1 , … , 𝑤𝑇 .
55.3.2 Information
There’s one constraint we’ve neglected to mention so far, which is that the decision-maker who solves this LQ problem
knows only the present and the past, not the future.
To clarify this point, consider the sequence of controls {𝑢0 , … , 𝑢𝑇 −1 }.
When choosing these controls, the decision-maker is permitted to take into account the effects of the shocks {𝑤1 , … , 𝑤𝑇 }
on the system.
However, it is typically assumed — and will be assumed here — that the time-𝑡 control 𝑢𝑡 can be made with knowledge
of past and present shocks only.
The fancy measure-theoretic way of saying this is that 𝑢𝑡 must be measurable with respect to the 𝜎-algebra generated by
𝑥0 , 𝑤 1 , 𝑤 2 , … , 𝑤 𝑡 .
This is in fact equivalent to stating that 𝑢𝑡 can be written in the form 𝑢𝑡 = 𝑔𝑡 (𝑥0 , 𝑤1 , 𝑤2 , … , 𝑤𝑡 ) for some Borel mea-
surable function 𝑔𝑡 .
(Just about every function that’s useful for applications is Borel measurable, so, for the purposes of intuition, you can read
that last phrase as “for some function 𝑔𝑡 ”)
Now note that 𝑥𝑡 will ultimately depend on the realizations of 𝑥0 , 𝑤1 , 𝑤2 , … , 𝑤𝑡 .
In fact, it turns out that 𝑥𝑡 summarizes all the information about these historical shocks that the decision-maker needs to
set controls optimally.
More precisely, it can be shown that any optimal control 𝑢𝑡 can always be written as a function of the current state alone.
Hence in what follows we restrict attention to control policies (i.e., functions) of the form 𝑢𝑡 = 𝑔𝑡 (𝑥𝑡 ).
Actually, the preceding discussion applies to all standard dynamic programming problems.
What’s special about the LQ case is that – as we shall soon see — the optimal 𝑢𝑡 turns out to be a linear function of 𝑥𝑡 .
55.3. Optimality – Finite Horizon 969

55.3.3 Solution
To solve the finite horizon LQ problem we can use a dynamic programming strategy based on backward induction that is
conceptually similar to the approach adopted in this lecture.
For reasons that will soon become clear, we first introduce the notation 𝐽𝑇 (𝑥) = 𝑥′ 𝑅𝑓 𝑥.
Now consider the problem of the decision-maker in the second to last period.
In particular, let the time be 𝑇 − 1, and suppose that the state is 𝑥𝑇 −1 .
The decision-maker must trade-off current and (discounted) final losses, and hence solves
min{𝑥′𝑇 −1 𝑅𝑥𝑇 −1 + 𝑢′ 𝑄𝑢 + 𝛽 𝔼𝐽𝑇 (𝐴𝑥𝑇 −1 + 𝐵𝑢 + 𝐶𝑤𝑇 )}

𝑢
At this stage, it is convenient to define the function
𝐽𝑇 −1 (𝑥) = min{𝑥′ 𝑅𝑥 + 𝑢′ 𝑄𝑢 + 𝛽 𝔼𝐽𝑇 (𝐴𝑥 + 𝐵𝑢 + 𝐶𝑤𝑇 )} (55.7)

𝑢
The function 𝐽𝑇 −1 will be called the 𝑇 −1 value function, and 𝐽𝑇 −1 (𝑥) can be thought of as representing total “loss-to-go”
from state 𝑥 at time 𝑇 − 1 when the decision-maker behaves optimally.
Now let’s step back to 𝑇 − 2.
For a decision-maker at 𝑇 −2, the value 𝐽𝑇 −1 (𝑥) plays a role analogous to that played by the terminal loss 𝐽𝑇 (𝑥) = 𝑥′ 𝑅𝑓 𝑥
for the decision-maker at 𝑇 − 1.
That is, 𝐽𝑇 −1 (𝑥) summarizes the future loss associated with moving to state 𝑥.
The decision-maker chooses her control 𝑢 to trade off current loss against future loss, where
• the next period state is 𝑥𝑇 −1 = 𝐴𝑥𝑇 −2 + 𝐵𝑢 + 𝐶𝑤𝑇 −1 , and hence depends on the choice of current control.
• the “cost” of landing in state 𝑥𝑇 −1 is 𝐽𝑇 −1 (𝑥𝑇 −1 ).
Her problem is therefore
min{𝑥′𝑇 −2 𝑅𝑥𝑇 −2 + 𝑢′ 𝑄𝑢 + 𝛽 𝔼𝐽𝑇 −1 (𝐴𝑥𝑇 −2 + 𝐵𝑢 + 𝐶𝑤𝑇 −1 )}

𝑢
Letting
𝐽𝑇 −2 (𝑥) = min{𝑥′ 𝑅𝑥 + 𝑢′ 𝑄𝑢 + 𝛽 𝔼𝐽𝑇 −1 (𝐴𝑥 + 𝐵𝑢 + 𝐶𝑤𝑇 −1 )}

𝑢
the pattern for backward induction is now clear.

In particular, we define a sequence of value functions {𝐽0 , … , 𝐽𝑇 } via
𝐽𝑡−1 (𝑥) = min{𝑥′ 𝑅𝑥 + 𝑢′ 𝑄𝑢 + 𝛽 𝔼𝐽𝑡 (𝐴𝑥 + 𝐵𝑢 + 𝐶𝑤𝑡 )} and 𝐽𝑇 (𝑥) = 𝑥′ 𝑅𝑓 𝑥

𝑢
The first equality is the Bellman equation from dynamic programming theory specialized to the finite horizon LQ problem.
Now that we have {𝐽0 , … , 𝐽𝑇 }, we can obtain the optimal controls.
As a first step, let’s find out what the value functions look like.
It turns out that every 𝐽𝑡 has the form 𝐽𝑡 (𝑥) = 𝑥′ 𝑃𝑡 𝑥 + 𝑑𝑡 where 𝑃𝑡 is a 𝑛 × 𝑛 matrix and 𝑑𝑡 is a constant.
We can show this by induction, starting from 𝑃𝑇 ∶= 𝑅𝑓 and 𝑑𝑇 = 0.
Using this notation, (55.7) becomes
𝐽𝑇 −1 (𝑥) = min{𝑥′ 𝑅𝑥 + 𝑢′ 𝑄𝑢 + 𝛽 𝔼(𝐴𝑥 + 𝐵𝑢 + 𝐶𝑤𝑇 )′ 𝑃𝑇 (𝐴𝑥 + 𝐵𝑢 + 𝐶𝑤𝑇 )} (55.8)

𝑢
To obtain the minimizer, we can take the derivative of the r.h.s. with respect to 𝑢 and set it equal to zero.

Applying the relevant rules of matrix calculus, this gives
𝑢 = −(𝑄 + 𝛽𝐵′ 𝑃𝑇 𝐵)−1 𝛽𝐵′ 𝑃𝑇 𝐴𝑥 (55.9)
Plugging this back into (55.8) and rearranging yields
𝐽𝑇 −1 (𝑥) = 𝑥′ 𝑃𝑇 −1 𝑥 + 𝑑𝑇 −1
where
𝑃𝑇 −1 = 𝑅 − 𝛽 2 𝐴′ 𝑃𝑇 𝐵(𝑄 + 𝛽𝐵′ 𝑃𝑇 𝐵)−1 𝐵′ 𝑃𝑇 𝐴 + 𝛽𝐴′ 𝑃𝑇 𝐴 (55.10)
and
𝑑𝑇 −1 ∶= 𝛽 trace(𝐶 ′ 𝑃𝑇 𝐶) (55.11)
(The algebra is a good exercise — we’ll leave it up to you.)

If we continue working backwards in this manner, it soon becomes clear that 𝐽𝑡 (𝑥) = 𝑥′ 𝑃𝑡 𝑥 + 𝑑𝑡 as claimed, where
{𝑃𝑡 } and {𝑑𝑡 } satisfy the recursions
𝑃𝑡−1 = 𝑅 − 𝛽 2 𝐴′ 𝑃𝑡 𝐵(𝑄 + 𝛽𝐵′ 𝑃𝑡 𝐵)−1 𝐵′ 𝑃𝑡 𝐴 + 𝛽𝐴′ 𝑃𝑡 𝐴 with 𝑃𝑇 = 𝑅 𝑓 (55.12)
and
𝑑𝑡−1 = 𝛽(𝑑𝑡 + trace(𝐶 ′ 𝑃𝑡 𝐶)) with 𝑑𝑇 = 0 (55.13)
Recalling (55.9), the minimizers from these backward steps are
𝑢𝑡 = −𝐹𝑡 𝑥𝑡 where 𝐹𝑡 ∶= (𝑄 + 𝛽𝐵′ 𝑃𝑡+1 𝐵)−1 𝛽𝐵′ 𝑃𝑡+1 𝐴 (55.14)
These are the linear optimal control policies we discussed above.

In particular, the sequence of controls given by (55.14) and (55.1) solves our finite horizon LQ problem.
Rephrasing this more precisely, the sequence 𝑢0 , … , 𝑢𝑇 −1 given by
𝑢𝑡 = −𝐹𝑡 𝑥𝑡 with 𝑥𝑡+1 = (𝐴 − 𝐵𝐹𝑡 )𝑥𝑡 + 𝐶𝑤𝑡+1 (55.15)
for 𝑡 = 0, … , 𝑇 − 1 attains the minimum of (55.6) subject to our constraints.
55.4 Implementation
We will use code from lqcontrol.py in QuantEcon.py to solve finite and infinite horizon linear quadratic control problems.
In the module, the various updating, simulation and fixed point methods are wrapped in a class called LQ, which includes
• Instance data:
– The required parameters 𝑄, 𝑅, 𝐴, 𝐵 and optional parameters 𝐶, 𝛽, 𝑇 , 𝑅𝑓 , 𝑁 specifying a given LQ model
∗ set 𝑇 and 𝑅𝑓 to None in the infinite horizon case
∗ set C = None (or zero) in the deterministic case
– the value function and policy data
∗ 𝑑𝑡 , 𝑃𝑡 , 𝐹𝑡 in the finite horizon case

∗ 𝑑, 𝑃 , 𝐹 in the infinite horizon case

• Methods:
– update_values — shifts 𝑑𝑡 , 𝑃𝑡 , 𝐹𝑡 to their 𝑡 − 1 values via (55.12), (55.13) and (55.14)
– stationary_values — computes 𝑃 , 𝑑, 𝐹 in the infinite horizon case
– compute_sequence —- simulates the dynamics of 𝑥𝑡 , 𝑢𝑡 , 𝑤𝑡 given 𝑥0 and assuming standard normal
shocks
55.4.1 An Application
Early Keynesian models assumed that households have a constant marginal propensity to consume from current income.
Data contradicted the constancy of the marginal propensity to consume.
In response, Milton Friedman, Franco Modigliani and others built models based on a consumer’s preference for an in-
tertemporally smooth consumption stream.
(See, for example, [Friedman, 1956] or [Modigliani and Brumberg, 1954].)
One property of those models is that households purchase and sell financial assets to make consumption streams smoother
than income streams.
The household savings problem outlined above captures these ideas.
The optimization problem for the household is to choose a consumption sequence in order to minimize
𝑇 −1
𝔼 { ∑ 𝛽 𝑡 (𝑐𝑡 − 𝑐)̄ 2 + 𝛽 𝑇 𝑞𝑎2𝑇 } (55.16)
𝑡=0
subject to the sequence of budget constraints 𝑎𝑡+1 = (1 + 𝑟)𝑎𝑡 − 𝑐𝑡 + 𝑦𝑡 , 𝑡 ≥ 0.

Here 𝑞 is a large positive constant, the role of which is to induce the consumer to target zero debt at the end of her life.
(Without such a constraint, the optimal choice is to choose 𝑐𝑡 = 𝑐 ̄ in each period, letting assets adjust accordingly.)
As before we set 𝑦𝑡 = 𝜎𝑤𝑡+1 + 𝜇 and 𝑢𝑡 ∶= 𝑐𝑡 − 𝑐,̄ after which the constraint can be written as in (55.2).
We saw how this constraint could be manipulated into the LQ formulation 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 + 𝐶𝑤𝑡+1 by setting
𝑥𝑡 = (𝑎𝑡 1)′ and using the definitions in (55.4).
To match with this state and control, the objective function (55.16) can be written in the form of (55.6) by choosing
0 0 𝑞 0
𝑄 ∶= 1, 𝑅 ∶= ( ), and 𝑅𝑓 ∶= ( )
0 0 0 0
Now that the problem is expressed in LQ form, we can proceed to the solution by applying (55.12) and (55.14).
After generating shocks 𝑤1 , … , 𝑤𝑇 , the dynamics for assets and consumption can be simulated via (55.15).
The following figure was computed using 𝑟 = 0.05, 𝛽 = 1/(1 + 𝑟), 𝑐 ̄ = 2, 𝜇 = 1, 𝜎 = 0.25, 𝑇 = 45 and 𝑞 = 106 .
The shocks {𝑤𝑡 } were taken to be IID and standard normal.
# Model parameters
r = 0.05
β = 1/(1 + r)
T = 45
c_bar = 2
σ = 0.25


μ = 1
q = 1e6
# Formulate as an LQ problem
Q = 1
R = np.zeros((2, 2))
Rf = np.zeros((2, 2))
Rf[0, 0] = q
A = [[1 + r, -c_bar + μ],
[0, 1]]
B = [[-1],
[ 0]]
C = [[σ],
[0]]
# Compute solutions and simulate

lq = LQ(Q, R, A, B, C, beta=β, T=T, Rf=Rf)
x0 = (0, 1)
xp, up, wp = lq.compute_sequence(x0)
# Convert back to assets, consumption and income

assets = xp[0, :] # a_t
c = up.flatten() + c_bar # c_t
income = σ * wp[0, 1:] + μ # y_t
# Plot results
n_rows = 2
fig, axes = plt.subplots(n_rows, 1, figsize=(12, 10))
bbox = (0., 1.02, 1., .102)

legend_args = {'bbox_to_anchor': bbox, 'loc': 3, 'mode': 'expand'}
p_args = {'lw': 2, 'alpha': 0.7}
axes[0].plot(list(range(1, T+1)), income, 'g-', label="non-financial income",

**p_args)
axes[0].plot(list(range(T)), c, 'k-', label="consumption", **p_args)
axes[1].plot(list(range(1, T+1)), np.cumsum(income - μ), 'r-',

label="cumulative unanticipated income", **p_args)
axes[1].plot(list(range(T+1)), assets, 'b-', label="assets", **p_args)
axes[1].plot(list(range(T)), np.zeros(T), 'k-')
for ax in axes:
ax.grid()
ax.legend(ncol=2, **legend_args)
plt.show()

The top panel shows the time path of consumption 𝑐𝑡 and income 𝑦𝑡 in the simulation.
As anticipated by the discussion on consumption smoothing, the time path of consumption is much smoother than that
for income.
(But note that consumption becomes more irregular towards the end of life, when the zero final asset requirement impinges
more on consumption choices.)
The second panel in the figure shows that the time path of assets 𝑎𝑡 is closely correlated with cumulative unanticipated
income, where the latter is defined as
𝑡
𝑧𝑡 ∶= ∑ 𝜎𝑤𝑡
𝑗=0
A key message is that unanticipated windfall gains are saved rather than consumed, while unanticipated negative shocks
are met by reducing assets.
(Again, this relationship breaks down towards the end of life due to the zero final asset requirement.)
These results are relatively robust to changes in parameters.
For example, let’s increase 𝛽 from 1/(1 + 𝑟) ≈ 0.952 to 0.96 while keeping other parameters fixed.
This consumer is slightly more patient than the last one, and hence puts relatively more weight on later consumption values.


lq = LQ(Q, R, A, B, C, beta=0.96, T=T, Rf=Rf)
x0 = (0, 1)
# Convert back to assets, consumption and income

assets = xp[0, :] # a_t
c = up.flatten() + c_bar # c_t
income = σ * wp[0, 1:] + μ # y_t
# Plot results
n_rows = 2
bbox = (0., 1.02, 1., .102)

p_args = {'lw': 2, 'alpha': 0.7}
axes[0].plot(list(range(1, T+1)), income, 'g-', label="non-financial income",

**p_args)
axes[0].plot(list(range(T)), c, 'k-', label="consumption", **p_args)
axes[1].plot(list(range(1, T+1)), np.cumsum(income - μ), 'r-',

label="cumulative unanticipated income", **p_args)
axes[1].plot(list(range(T+1)), assets, 'b-', label="assets", **p_args)
axes[1].plot(list(range(T)), np.zeros(T), 'k-')
for ax in axes:
ax.grid()
plt.show()

We now have a slowly rising consumption stream and a hump-shaped build-up of assets in the middle periods to fund
rising consumption.
However, the essential features are the same: consumption is smooth relative to income, and assets are strongly positively
correlated with cumulative unanticipated income.
55.5 Extensions and Comments
Let’s now consider a number of standard extensions to the LQ problem treated above.

55.5.1 Time-Varying Parameters
In some settings, it can be desirable to allow 𝐴, 𝐵, 𝐶, 𝑅 and 𝑄 to depend on 𝑡.

For the sake of simplicity, we’ve chosen not to treat this extension in our implementation given below.
However, the loss of generality is not as large as you might first imagine.
In fact, we can tackle many models with time-varying parameters by suitable choice of state variables.
One illustration is given below.
For further examples and a more systematic treatment, see [Hansen and Sargent, 2013], section 2.4.
55.5.2 Adding a Cross-Product Term
In some LQ problems, preferences include a cross-product term 𝑢′𝑡 𝑁 𝑥𝑡 , so that the objective function becomes
𝑇 −1
𝔼 { ∑ 𝛽 𝑡 (𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 + 2𝑢′𝑡 𝑁 𝑥𝑡 ) + 𝛽 𝑇 𝑥′𝑇 𝑅𝑓 𝑥𝑇 } (55.17)
𝑡=0
Our results extend to this case in a straightforward way.

The sequence {𝑃𝑡 } from (55.12) becomes
𝑃𝑡−1 = 𝑅 − (𝛽𝐵′ 𝑃𝑡 𝐴 + 𝑁 )′ (𝑄 + 𝛽𝐵′ 𝑃𝑡 𝐵)−1 (𝛽𝐵′ 𝑃𝑡 𝐴 + 𝑁 ) + 𝛽𝐴′ 𝑃𝑡 𝐴 with 𝑃𝑇 = 𝑅 𝑓 (55.18)
The policies in (55.14) are modified to
𝑢𝑡 = −𝐹𝑡 𝑥𝑡 where 𝐹𝑡 ∶= (𝑄 + 𝛽𝐵′ 𝑃𝑡+1 𝐵)−1 (𝛽𝐵′ 𝑃𝑡+1 𝐴 + 𝑁 ) (55.19)
The sequence {𝑑𝑡 } is unchanged from (55.13).

We leave interested readers to confirm these results (the calculations are long but not overly difficult).
55.5.3 Infinite Horizon
Finally, we consider the infinite horizon case, with cross-product term, unchanged dynamics and objective function given
by
∞
𝔼 {∑ 𝛽 𝑡 (𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 + 2𝑢′𝑡 𝑁 𝑥𝑡 )} (55.20)
𝑡=0
In the infinite horizon case, optimal policies can depend on time only if time itself is a component of the state vector 𝑥𝑡 .
In other words, there exists a fixed matrix 𝐹 such that 𝑢𝑡 = −𝐹 𝑥𝑡 for all 𝑡.
That decision rules are constant over time is intuitive — after all, the decision-maker faces the same infinite horizon at
every stage, with only the current state changing.
Not surprisingly, 𝑃 and 𝑑 are also constant.
The stationary matrix 𝑃 is the solution to the discrete-time algebraic Riccati equation.
𝑃 = 𝑅 − (𝛽𝐵′ 𝑃 𝐴 + 𝑁 )′ (𝑄 + 𝛽𝐵′ 𝑃 𝐵)−1 (𝛽𝐵′ 𝑃 𝐴 + 𝑁 ) + 𝛽𝐴′ 𝑃 𝐴 (55.21)
55.5. Extensions and Comments 977

Equation (55.21) is also called the LQ Bellman equation, and the map that sends a given 𝑃 into the right-hand side of
(55.21) is called the LQ Bellman operator.
The stationary optimal policy for this model is
𝑢 = −𝐹 𝑥 where 𝐹 = (𝑄 + 𝛽𝐵′ 𝑃 𝐵)−1 (𝛽𝐵′ 𝑃 𝐴 + 𝑁 ) (55.22)
The sequence {𝑑𝑡 } from (55.13) is replaced by the constant value
𝛽
𝑑 ∶= trace(𝐶 ′ 𝑃 𝐶) (55.23)
1−𝛽
The state evolves according to the time-homogeneous process 𝑥𝑡+1 = (𝐴 − 𝐵𝐹 )𝑥𝑡 + 𝐶𝑤𝑡+1 .
An example infinite horizon problem is treated below.
55.5.4 Certainty Equivalence
Linear quadratic control problems of the class discussed above have the property of certainty equivalence.
By this, we mean that the optimal policy 𝐹 is not affected by the parameters in 𝐶, which specify the shock process.
This can be confirmed by inspecting (55.22) or (55.19).
It follows that we can ignore uncertainty when solving for optimal behavior, and plug it back in when examining optimal
state dynamics.
55.6 Further Applications
55.6.1 Application 1: Age-Dependent Income Process
Previously we studied a permanent income model that generated consumption smoothing.

One unrealistic feature of that model is the assumption that the mean of the random income process does not depend on
the consumer’s age.
A more realistic income profile is one that rises in early working life, peaks towards the middle and maybe declines toward
the end of working life and falls more during retirement.
In this section, we will model this rise and fall as a symmetric inverted “U” using a polynomial in age.
As before, the consumer seeks to minimize
𝑇 −1
𝔼 { ∑ 𝛽 𝑡 (𝑐𝑡 − 𝑐)̄ 2 + 𝛽 𝑇 𝑞𝑎2𝑇 } (55.24)
𝑡=0
subject to 𝑎𝑡+1 = (1 + 𝑟)𝑎𝑡 − 𝑐𝑡 + 𝑦𝑡 , 𝑡 ≥ 0.

For income we now take 𝑦𝑡 = 𝑝(𝑡) + 𝜎𝑤𝑡+1 where 𝑝(𝑡) ∶= 𝑚0 + 𝑚1 𝑡 + 𝑚2 𝑡2 .
(In the next section we employ some tricks to implement a more sophisticated model.)
The coefficients 𝑚0 , 𝑚1 , 𝑚2 are chosen such that 𝑝(0) = 0, 𝑝(𝑇 /2) = 𝜇, and 𝑝(𝑇 ) = 0.
You can confirm that the specification 𝑚0 = 0, 𝑚1 = 𝑇 𝜇/(𝑇 /2)2 , 𝑚2 = −𝜇/(𝑇 /2)2 satisfies these constraints.
To put this into an LQ setting, consider the budget constraint, which becomes
𝑎𝑡+1 = (1 + 𝑟)𝑎𝑡 − 𝑢𝑡 − 𝑐 ̄ + 𝑚1 𝑡 + 𝑚2 𝑡2 + 𝜎𝑤𝑡+1 (55.25)

The fact that 𝑎𝑡+1 is a linear function of (𝑎𝑡 , 1, 𝑡, 𝑡2 ) suggests taking these four variables as the state vector 𝑥𝑡 .
Once a good choice of state and control (recall 𝑢𝑡 = 𝑐𝑡 − 𝑐)̄ has been made, the remaining specifications fall into place
relatively easily.
Thus, for the dynamics we set
𝑎𝑡 1+𝑟 − 𝑐 ̄ 𝑚1 𝑚2 −1 𝜎
⎛ 1 ⎞ ⎛ 0 1 0 0 ⎞ ⎛ 0 ⎞ ⎛ 0 ⎞
𝑥𝑡 ∶= ⎜
⎜
⎜ 𝑡 ⎟
⎟,
⎟ 𝐴 ∶= ⎜
⎜
⎜
⎟
⎟
⎟, 𝐵 ∶= ⎜
⎜
⎜ 0 ⎟
⎟,
⎟ 𝐶 ∶= ⎜
⎜ ⎟
⎜ 0 ⎟
⎟ (55.26)
0 1 1 0
⎝ 𝑡2 ⎠ ⎝ 0 1 2 1 ⎠ ⎝ 0 ⎠ ⎝ 0 ⎠
If you expand the expression 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 + 𝐶𝑤𝑡+1 using this specification, you will find that assets follow (55.25)
as desired and that the other state variables also update appropriately.
To implement preference specification (55.24) we take
0 0 0 0 𝑞 0 0 0
⎛
⎜ 0 0 0 0 ⎞
⎟ ⎛
⎜ 0 0 0 0 ⎞
⎟
𝑄 ∶= 1, 𝑅 ∶= ⎜
⎜ 0 0 ⎟
⎟ and 𝑅𝑓 ∶= ⎜
⎜ 0 0 0 0 ⎟
⎟ (55.27)
0 0
⎝ 0 0 0 0 ⎠ ⎝ 0 0 0 0 ⎠
The next figure shows a simulation of consumption and assets computed using the compute_sequence method of
lqcontrol.py with initial assets set to zero.
Once again, smooth consumption is a dominant feature of the sample paths.
The asset path exhibits dynamics consistent with standard life cycle theory.
Exercise 55.7.1 gives the full set of parameters used here and asks you to replicate the figure.
55.6.2 Application 2: A Permanent Income Model with Retirement
In the previous application, we generated income dynamics with an inverted U shape using polynomials and placed them
in an LQ framework.
It is arguably the case that this income process still contains unrealistic features.
A more common earning profile is where
1. income grows over working life, fluctuating around an increasing trend, with growth flattening off in later years
2. retirement follows, with lower but relatively stable (non-financial) income
Letting 𝐾 be the retirement date, we can express these income dynamics by
𝑝(𝑡) + 𝜎𝑤𝑡+1 if 𝑡 ≤ 𝐾
𝑦𝑡 = { (55.28)
𝑠 otherwise
Here
• 𝑝(𝑡) ∶= 𝑚1 𝑡 + 𝑚2 𝑡2 with the coefficients 𝑚1 , 𝑚2 chosen such that 𝑝(𝐾) = 𝜇 and 𝑝(0) = 𝑝(2𝐾) = 0
• 𝑠 is retirement income
We suppose that preferences are unchanged and given by (55.16).
The budget constraint is also unchanged and given by 𝑎𝑡+1 = (1 + 𝑟)𝑎𝑡 − 𝑐𝑡 + 𝑦𝑡 .
Our aim is to solve this problem and simulate paths using the LQ techniques described in this lecture.
In fact, this is a nontrivial problem, as the kink in the dynamics (55.28) at 𝐾 makes it very difficult to express the law of
motion as a fixed-coefficient linear system.
55.6. Further Applications 979


However, we can still use our LQ methods here by suitably linking two-component LQ problems.
These two LQ problems describe the consumer’s behavior during her working life (lq_working) and retirement
(lq_retired).
(This is possible because, in the two separate periods of life, the respective income processes [polynomial trend and
constant] each fit the LQ framework.)
The basic idea is that although the whole problem is not a single time-invariant LQ problem, it is still a dynamic program-
ming problem, and hence we can use appropriate Bellman equations at every stage.
Based on this logic, we can
1. solve lq_retired by the usual backward induction procedure, iterating back to the start of retirement.
2. take the start-of-retirement value function generated by this process, and use it as the terminal condition 𝑅𝑓 to feed
into the lq_working specification.
3. solve lq_working by backward induction from this choice of 𝑅𝑓 , iterating back to the start of working life.
This process gives the entire life-time sequence of value functions and optimal policies.
The next figure shows one simulation based on this procedure.

The full set of parameters used in the simulation is discussed in Exercise 55.7.2, where you are asked to replicate the
figure.
Once again, the dominant feature observable in the simulation is consumption smoothing.
The asset path fits well with standard life cycle theory, with dissaving early in life followed by later saving.
Assets peak at retirement and subsequently decline.
55.6.3 Application 3: Monopoly with Adjustment Costs
Consider a monopolist facing stochastic inverse demand function
𝑝𝑡 = 𝑎0 − 𝑎1 𝑞𝑡 + 𝑑𝑡
Here 𝑞𝑡 is output, and the demand shock 𝑑𝑡 follows
𝑑𝑡+1 = 𝜌𝑑𝑡 + 𝜎𝑤𝑡+1
where {𝑤𝑡 } is IID and standard normal.

The monopolist maximizes the expected discounted sum of present and future profits
∞
𝔼 { ∑ 𝛽 𝑡 𝜋𝑡 } where 𝜋𝑡 ∶= 𝑝𝑡 𝑞𝑡 − 𝑐𝑞𝑡 − 𝛾(𝑞𝑡+1 − 𝑞𝑡 )2 (55.29)
𝑡=0
Here
• 𝛾(𝑞𝑡+1 − 𝑞𝑡 )2 represents adjustment costs
• 𝑐 is average cost of production
This can be formulated as an LQ problem and then solved and simulated, but first let’s study the problem and try to get
some intuition.
One way to start thinking about the problem is to consider what would happen if 𝛾 = 0.
Without adjustment costs there is no intertemporal trade-off, so the monopolist will choose output to maximize current
profit in each period.
It’s not difficult to show that profit-maximizing output is
𝑎0 − 𝑐 + 𝑑 𝑡
𝑞𝑡̄ ∶=
2𝑎1
In light of this discussion, what we might expect for general 𝛾 is that
• if 𝛾 is close to zero, then 𝑞𝑡 will track the time path of 𝑞𝑡̄ relatively closely.
• if 𝛾 is larger, then 𝑞𝑡 will be smoother than 𝑞𝑡̄ , as the monopolist seeks to avoid adjustment costs.
This intuition turns out to be correct.
The following figures show simulations produced by solving the corresponding LQ problem.
The only difference in parameters across the figures is the size of 𝛾
To produce these figures we converted the monopolist problem into an LQ problem.
The key to this conversion is to choose the right state — which can be a bit of an art.
Here we take 𝑥𝑡 = (𝑞𝑡̄ 𝑞𝑡 1)′ , while the control is chosen as 𝑢𝑡 = 𝑞𝑡+1 − 𝑞𝑡 .
We also manipulated the profit function slightly.




In (55.29), current profits are 𝜋𝑡 ∶= 𝑝𝑡 𝑞𝑡 − 𝑐𝑞𝑡 − 𝛾(𝑞𝑡+1 − 𝑞𝑡 )2 .

Let’s now replace 𝜋𝑡 in (55.29) with 𝜋𝑡̂ ∶= 𝜋𝑡 − 𝑎1 𝑞𝑡2̄ .
This makes no difference to the solution, since 𝑎1 𝑞𝑡2̄ does not depend on the controls.
(In fact, we are just adding a constant term to (55.29), and optimizers are not affected by constant terms.)
The reason for making this substitution is that, as you will be able to verify, 𝜋𝑡̂ reduces to the simple quadratic
𝜋𝑡̂ = −𝑎1 (𝑞𝑡 − 𝑞𝑡̄ )2 − 𝛾𝑢2𝑡
After negation to convert to a minimization problem, the objective becomes

∞
min 𝔼 ∑ 𝛽 𝑡 {𝑎1 (𝑞𝑡 − 𝑞𝑡̄ )2 + 𝛾𝑢2𝑡 } (55.30)
𝑡=0
It’s now relatively straightforward to find 𝑅 and 𝑄 such that (55.30) can be written as (55.20).
Furthermore, the matrices 𝐴, 𝐵 and 𝐶 from (55.1) can be found by writing down the dynamics of each element of the
state.
Exercise 55.7.3 asks you to complete this process, and reproduce the preceding figures.
55.7 Exercises
Exercise 55.7.1
Replicate the figure with polynomial income shown above.
The parameters are 𝑟 = 0.05, 𝛽 = 1/(1 + 𝑟), 𝑐 ̄ = 1.5, 𝜇 = 2, 𝜎 = 0.15, 𝑇 = 50 and 𝑞 = 104 .

Here’s one solution.
We use some fancy plot commands to get a certain style — feel free to use simpler ones.
The model is an LQ permanent income / life-cycle model with hump-shaped income
𝑦𝑡 = 𝑚1 𝑡 + 𝑚2 𝑡2 + 𝜎𝑤𝑡+1
where {𝑤𝑡 } is IID 𝑁 (0, 1) and the coefficients 𝑚1 and 𝑚2 are chosen so that 𝑝(𝑡) = 𝑚1 𝑡 + 𝑚2 𝑡2 has an inverted U
shape with
• 𝑝(0) = 0, 𝑝(𝑇 /2) = 𝜇, and
• 𝑝(𝑇 ) = 0
# Model parameters
r = 0.05
β = 1/(1 + r)
T = 50
c_bar = 1.5
σ = 0.15
μ = 2


q = 1e4
m1 = T * (μ/(T/2)**2)
m2 = -(μ/(T/2)**2)
Q = 1
Rf[0, 0] = q
A = [[1 + r, -c_bar, m1, m2],
[0, 1, 0, 0],
[0, 1, 1, 0],
[0, 1, 2, 1]]
B = [[-1],
[ 0],
[ 0],
[ 0]]
C = [[σ],
[0],
[0],
[0]]

lq = LQ(Q, R, A, B, C, beta=β, T=T, Rf=Rf)
x0 = (0, 1, 0, 0)
# Convert results back to assets, consumption and income

ap = xp[0, :] # Assets
c = up.flatten() + c_bar # Consumption
time = np.arange(1, T+1)
income = σ * wp[0, 1:] + m1 * time + m2 * time**2 # Income
# Plot results
n_rows = 2
bbox = (0., 1.02, 1., .102)

p_args = {'lw': 2, 'alpha': 0.7}
axes[0].plot(range(1, T+1), income, 'g-', label="non-financial income",

**p_args)
axes[0].plot(range(T), c, 'k-', label="consumption", **p_args)
axes[1].plot(range(T+1), ap.flatten(), 'b-', label="assets", **p_args)

axes[1].plot(range(T+1), np.zeros(T+1), 'k-')
for ax in axes:
ax.grid()
plt.show()
55.7. Exercises 987

Exercise 55.7.2
Replicate the figure on work and retirement shown above.
The parameters are 𝑟 = 0.05, 𝛽 = 1/(1 + 𝑟), 𝑐 ̄ = 4, 𝜇 = 4, 𝜎 = 0.35, 𝐾 = 40, 𝑇 = 60, 𝑠 = 1 and 𝑞 = 104 .
To understand the overall procedure, carefully read the section containing that figure.
Hint: First, in order to make our approach work, we must ensure that both LQ problems have the same state variables
and control.
As with previous applications, the control can be set to 𝑢𝑡 = 𝑐𝑡 − 𝑐.̄
For lq_working, 𝑥𝑡 , 𝐴, 𝐵, 𝐶 can be chosen as in (55.26).
• Recall that 𝑚1 , 𝑚2 are chosen so that 𝑝(𝐾) = 𝜇 and 𝑝(2𝐾) = 0.
For lq_retired, use the same definition of 𝑥𝑡 and 𝑢𝑡 , but modify 𝐴, 𝐵, 𝐶 to correspond to constant income 𝑦𝑡 = 𝑠.
For lq_retired, set preferences as in (55.27).

For lq_working, preferences are the same, except that 𝑅𝑓 should be replaced by the final value function that emerges
from iterating lq_retired back to the start of retirement.
With some careful footwork, the simulation can be generated by patching together the simulations from these two separate
models.

This is a permanent income / life-cycle model with polynomial growth in income over working life followed by a fixed
retirement income.
The model is solved by combining two LQ programming problems as described in the lecture.
# Model parameters
r = 0.05
β = 1/(1 + r)
T = 60
K = 40
c_bar = 4
σ = 0.35
μ = 4
q = 1e4
s = 1
m1 = 2 * μ/K
m2 = -μ/K**2
# Formulate LQ problem 1 (retirement)

Q = 1
Rf[0, 0] = q
A = [[1 + r, s - c_bar, 0, 0],
[0, 1, 0, 0],
[0, 1, 1, 0],
[0, 1, 2, 1]]
B = [[-1],
[ 0],
[ 0],
[ 0]]
C = [[0],
[0],
[0],
[0]]
# Initialize LQ instance for retired agent

lq_retired = LQ(Q, R, A, B, C, beta=β, T=T-K, Rf=Rf)
# Iterate back to start of retirement, record final value function
for i in range(T-K):
lq_retired.update_values()
Rf2 = lq_retired.P
# Formulate LQ problem 2 (working life)

A = [[1 + r, -c_bar, m1, m2],
[0, 1, 0, 0],
[0, 1, 1, 0],
55.7. Exercises 989


[0, 1, 2, 1]]
B = [[-1],
[ 0],
[ 0],
[ 0]]
C = [[σ],
[0],
[0],
[0]]
# Set up working life LQ instance with terminal Rf from lq_retired

lq_working = LQ(Q, R, A, B, C, beta=β, T=K, Rf=Rf2)
# Simulate working state / control paths

x0 = (0, 1, 0, 0)
xp_w, up_w, wp_w = lq_working.compute_sequence(x0)
# Simulate retirement paths (note the initial condition)
xp_r, up_r, wp_r = lq_retired.compute_sequence(xp_w[:, K])
# Convert results back to assets, consumption and income

xp = np.column_stack((xp_w, xp_r[:, 1:]))
assets = xp[0, :] # Assets
up = np.column_stack((up_w, up_r))
c = up.flatten() + c_bar # Consumption
time = np.arange(1, K+1)

income_w = σ * wp_w[0, 1:K+1] + m1 * time + m2 * time**2 # Income
income_r = np.full(T-K, s)
income = np.concatenate((income_w, income_r))
# Plot results
n_rows = 2
bbox = (0., 1.02, 1., .102)

p_args = {'lw': 2, 'alpha': 0.7}
axes[0].plot(range(1, T+1), income, 'g-', label="non-financial income",

**p_args)
axes[0].plot(range(T), c, 'k-', label="consumption", **p_args)
axes[1].plot(range(T+1), assets, 'b-', label="assets", **p_args)

axes[1].plot(range(T+1), np.zeros(T+1), 'k-')
for ax in axes:
ax.grid()
plt.show()

Exercise 55.7.3
Reproduce the figures from the monopolist application given above.
For parameters, use 𝑎0 = 5, 𝑎1 = 0.5, 𝜎 = 0.15, 𝜌 = 0.9, 𝛽 = 0.95 and 𝑐 = 2, while 𝛾 varies between 1 and 50 (see
figures).

The first task is to find the matrices 𝐴, 𝐵, 𝐶, 𝑄, 𝑅 that define the LQ problem.
Recall that 𝑥𝑡 = (𝑞𝑡̄ 𝑞𝑡 1)′ , while 𝑢𝑡 = 𝑞𝑡+1 − 𝑞𝑡 .
Letting 𝑚0 ∶= (𝑎0 − 𝑐)/2𝑎1 and 𝑚1 ∶= 1/2𝑎1 , we can write 𝑞𝑡̄ = 𝑚0 + 𝑚1 𝑑𝑡 , and then, with some manipulation
𝑞𝑡+1
̄ = 𝑚0 (1 − 𝜌) + 𝜌𝑞𝑡̄ + 𝑚1 𝜎𝑤𝑡+1
By our definition of 𝑢𝑡 , the dynamics of 𝑞𝑡 are 𝑞𝑡+1 = 𝑞𝑡 + 𝑢𝑡 .
Using these facts you should be able to build the correct 𝐴, 𝐵, 𝐶 matrices (and then check them against those found in
the solution code below).
55.7. Exercises 991

Suitable 𝑅, 𝑄 matrices can be found by inspecting the objective function, which we repeat here for convenience:
∞
min 𝔼 {∑ 𝛽 𝑡 𝑎1 (𝑞𝑡 − 𝑞𝑡̄ )2 + 𝛾𝑢2𝑡 }
𝑡=0
Our solution code is
# Model parameters
a0 = 5
a1 = 0.5
σ = 0.15
ρ = 0.9
γ = 1
β = 0.95
c = 2
T = 120
# Useful constants
m0 = (a0-c)/(2 * a1)
m1 = 1/(2 * a1)
# Formulate LQ problem
Q = γ
R = [[ a1, -a1, 0],
[-a1, a1, 0],
[ 0, 0, 0]]
A = [[ρ, 0, m0 * (1 - ρ)],
[0, 1, 0],
[0, 0, 1]]
B = [[0],
[1],
[0]]
C = [[m1 * σ],
[ 0],
[ 0]]
lq = LQ(Q, R, A, B, C=C, beta=β)
# Simulate state / control paths

x0 = (m0, 2, 1)
xp, up, wp = lq.compute_sequence(x0, ts_length=150)
q_bar = xp[0, :]
q = xp[1, :]
# Plot simulation results

# Some fancy plotting stuff -- simplify if you prefer

bbox = (0., 1.01, 1., .101)
p_args = {'lw': 2, 'alpha': 0.6}
time = range(len(q))
ax.set(xlabel='Time', xlim=(0, max(time)))
ax.plot(time, q_bar, 'k-', lw=2, alpha=0.6, label=r'$\bar q_t$')
ax.plot(time, q, 'b-', lw=2, alpha=0.6, label='$q_t$')


s = f'dynamics with $\gamma = {γ}$'
ax.text(max(time) * 0.6, 1 * q_bar.max(), s, fontsize=14)
plt.show()
55.7. Exercises 993


CHAPTER
FIFTYSIX
LAGRANGIAN FOR LQ CONTROL
import numpy as np
from scipy.linalg import schur
56.1 Overview
This is a sequel to this lecture linear quadratic dynamic programming

It can also be regarded as presenting invariant subspace techniques that extend ones that we encountered earlier in this
lecture stability in linear rational expectations models
We present a Lagrangian formulation of an infinite horizon linear quadratic undiscounted dynamic programming problem.
Such a problem is also sometimes called an optimal linear regulator problem.
A Lagrangian formulation
• carries insights about connections between stability and optimality
• is the basis for fast algorithms for solving Riccati equations
• opens the way to constructing solutions of dynamic systems that don’t come directly from an intertemporal opti-
mization problem
A key tool in this lecture is the concept of an 𝑛 × 𝑛 symplectic matrix.
A symplectic matrix has eigenvalues that occur in reciprocal pairs, meaning that if 𝜆𝑖 ∈ (−1, 1) is an eigenvalue, then
so is 𝜆−1
𝑖 .
This reciprocal pairs property of the eigenvalues of a matrix is a tell-tale sign that the matrix describes the joint dynamics
of a system of equations describing the states and costates that constitute first-order necessary conditions for solving an
undiscounted linear-quadratic infinite-horizon optimization problem.
The symplectic matrix that will interest us describes the first-order dynamics of state and co-state vectors of an optimally
controlled system.
In focusing on eigenvalues and eigenvectors of this matrix, we capitalize on an analysis of invariant subspaces.
These invariant subspace formulations of LQ dynamic programming problems provide a bridge between recursive (i.e.,
dynamic programming) formulations and classical formulations of linear control and linear filtering problems that make
use of related matrix decompositions (see for example this lecture and this lecture).
995
While most of this lecture focuses on undiscounted problems, later sections describe handy ways of transforming dis-
counted problems to undiscounted ones.
The techniques in this lecture will prove useful when we study Stackelberg and Ramsey problem in this lecture.
56.2 Undiscounted LQ DP Problem
The problem is to choose a sequence of controls {𝑢𝑡 }∞

𝑡=0 to maximize the criterion
∞
− ∑{𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 }
𝑡=0
subject to 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 , where 𝑥0 is a given initial state vector.

Here 𝑥𝑡 is an (𝑛 × 1) vector of state variables, 𝑢𝑡 is a (𝑘 × 1) vector of controls, 𝑅 is a positive semidefinite symmetric
matrix, 𝑄 is a positive definite symmetric matrix, 𝐴 is an (𝑛 × 𝑛) matrix, and 𝐵 is an (𝑛 × 𝑘) matrix.
The optimal value function turns out to be quadratic, 𝑉 (𝑥) = −𝑥′ 𝑃 𝑥, where 𝑃 is a positive semidefinite symmetric
matrix.
Using the transition law to eliminate next period’s state, the Bellman equation becomes
−𝑥′ 𝑃 𝑥 = max{−𝑥′ 𝑅𝑥 − 𝑢′ 𝑄𝑢 − (𝐴𝑥 + 𝐵𝑢)′ 𝑃 (𝐴𝑥 + 𝐵𝑢)} (56.1)

𝑢
The first-order necessary conditions for the maximum problem on the right side of equation (56.1) are
′
𝜕𝑥′ 𝐴𝑥
Note: We use the following rules for differentiating quadratic and bilinear matrix forms: 𝜕𝑥 = (𝐴 + 𝐴′ )𝑥; 𝜕𝑦𝜕𝑦𝐵𝑧 =
′
𝐵𝑧, 𝜕𝑦𝜕𝑧𝐵𝑧 = 𝐵′ 𝑦.
(𝑄 + 𝐵′ 𝑃 𝐵)𝑢 = −𝐵′ 𝑃 𝐴𝑥,
which implies that an optimal decision rule for 𝑢 is
𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥
or
𝑢 = −𝐹 𝑥,
where
𝐹 = (𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴.
Substituting 𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥 into the right side of equation (56.1) and rearranging gives
𝑃 = 𝑅 + 𝐴′ 𝑃 𝐴 − 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴. (56.2)
Equation (56.2) is called an algebraic matrix Riccati equation.

There are multiple solutions of equation (56.2).
But only one of them is positive definite.
The positive define solution is associated with the maximum of our problem.
996 Chapter 56. Lagrangian for LQ Control

It expresses the matrix 𝑃 as an implicit function of the matrices 𝑅, 𝑄, 𝐴, 𝐵.

Notice that the gradient of the value function is
𝜕𝑉 (𝑥)
= −2𝑃 𝑥 (56.3)
𝜕𝑥
We shall use fact (56.3) later.
56.3 Lagrangian
For the undiscounted optimal linear regulator problem, form the Lagrangian
∞
𝐿 = − ∑{𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 + 2𝜇′𝑡+1 [𝐴𝑥𝑡 + 𝐵𝑢𝑡 − 𝑥𝑡+1 ]} (56.4)
𝑡=0
where 2𝜇𝑡+1 is a vector of Lagrange multipliers on the time 𝑡 transition law 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 .
(We put the 2 in front of 𝜇𝑡+1 to make things match up nicely with equation (56.3).)
First-order conditions for maximization with respect to {𝑢𝑡 , 𝑥𝑡+1 }∞
𝑡=0 are
2𝑄𝑢𝑡 + 2𝐵′ 𝜇𝑡+1 = 0 , 𝑡 ≥ 0

(56.5)
𝜇𝑡 = 𝑅𝑥𝑡 + 𝐴′ 𝜇𝑡+1 , 𝑡 ≥ 1.
Define 𝜇0 to be a vector of shadow prices of 𝑥0 and apply an envelope condition to (56.4) to deduce that
𝜇0 = 𝑅𝑥0 + 𝐴′ 𝜇1 ,
which is a time 𝑡 = 0 counterpart to the second equation of system (56.5).
An important fact is that
𝜇𝑡+1 = 𝑃 𝑥𝑡+1 (56.6)
where 𝑃 is a positive define matrix that solves the algebraic Riccati equation (56.2).
Thus, from equations (56.3) and (56.6), −2𝜇𝑡 is the gradient of the value function with respect to 𝑥𝑡 .
The Lagrange multiplier vector 𝜇𝑡 is often called the costate vector that corresponds to the state vector 𝑥𝑡 .
It is useful to proceed with the following steps:
• solve the first equation of (56.5) for 𝑢𝑡 in terms of 𝜇𝑡+1 .
• substitute the result into the law of motion 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 .
• arrange the resulting equation and the second equation of (56.5) into the form
𝑥𝑡+1 𝑥
𝐿 ( ) = 𝑁 ( 𝑡 ) , 𝑡 ≥ 0, (56.7)
𝜇𝑡+1 𝜇𝑡
where
𝐼 𝐵𝑄−1 𝐵′ 𝐴 0
𝐿= ( ), 𝑁= ( ).
0 𝐴′ −𝑅 𝐼
When 𝐿 is of full rank (i.e., when 𝐴 is of full rank), we can write system (56.7) as
𝑥 𝑥
( 𝑡+1 ) = 𝑀 ( 𝑡 ) (56.8)
𝜇𝑡+1 𝜇𝑡
where
𝐴 + 𝐵𝑄−1 𝐵′ 𝐴′−1 𝑅 −𝐵𝑄−1 𝐵′ 𝐴′−1
𝑀 ≡ 𝐿−1 𝑁 = ( ). (56.9)
−𝐴′−1 𝑅 𝐴′−1
56.3. Lagrangian 997

56.4 State-Costate Dynamics
We seek to solve the difference equation system (56.8) for a sequence {𝑥𝑡 }∞
𝑡=0 that satisfies
• an initial condition for 𝑥0

• a terminal condition lim𝑡→+∞ 𝑥𝑡 = 0
This terminal condition reflects our desire for a stable solution, one that does not diverge as 𝑡 → ∞.
We inherit our wish for stability of the {𝑥𝑡 } sequence from a desire to maximize
∞
− ∑[𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 ],
𝑡=0
which requires that 𝑥′𝑡 𝑅𝑥𝑡 converge to zero as 𝑡 → +∞.
56.5 Reciprocal Pairs Property
To proceed, we study properties of the (2𝑛 × 2𝑛) matrix 𝑀 defined in (56.9).

It helps to introduce a (2𝑛 × 2𝑛) matrix
0 −𝐼𝑛
𝐽 =( ).
𝐼𝑛 0
The rank of 𝐽 is 2𝑛.

Definition: A matrix 𝑀 is called symplectic if
𝑀𝐽𝑀′ = 𝐽. (56.10)
Salient properties of symplectic matrices that are readily verified include:

• If 𝑀 is symplectic, then 𝑀 2 is symplectic
• The determinant of a symplectic, then det(𝑀 ) = 1
It can be verified directly that 𝑀 in equation (56.9) is symplectic.
It follows from equation (56.10) and from the fact 𝐽 −1 = 𝐽 ′ = −𝐽 that for any symplectic matrix 𝑀 ,
𝑀 ′ = 𝐽 −1 𝑀 −1 𝐽 . (56.11)
Equation (56.11) states that 𝑀 ′ is related to the inverse of 𝑀 by a similarity transformation.

For square matrices, recall that
• similar matrices share eigenvalues
• eigenvalues of the inverse of a matrix are inverses of eigenvalues of the matrix
• a matrix and its transpose share eigenvalues
It then follows from equation (56.11) that the eigenvalues of 𝑀 occur in reciprocal pairs: if 𝜆 is an eigenvalue of 𝑀 , so
is 𝜆−1 .
Write equation (56.8) as
𝑦𝑡+1 = 𝑀 𝑦𝑡 (56.12)

𝑥
where 𝑦𝑡 = ( 𝑡 ).
𝜇𝑡
Consider a triangularization of 𝑀
𝑊11 𝑊12
𝑉 −1 𝑀 𝑉 = ( ) (56.13)
0 𝑊22
where
• each block on the right side is (𝑛 × 𝑛)
• 𝑉 is nonsingular
• all eigenvalues of 𝑊22 exceed 1 in modulus
• all eigenvalues of 𝑊11 are less than 1 in modulus
56.6 Schur decomposition
The Schur decomposition and the eigenvalue decomposition are two decompositions of the form (56.13).
𝑦𝑡+1 = 𝑉 𝑊 𝑉 −1 𝑦𝑡 . (56.14)
A solution of equation (56.14) for arbitrary initial condition 𝑦0 is evidently

𝑡
𝑊11 𝑊12,𝑡 −1
𝑦𝑡 = 𝑉 [ 𝑡 ] 𝑉 𝑦0 (56.15)
0 𝑊22
where 𝑊12,𝑡 = 𝑊12 for 𝑡 = 1 and for 𝑡 ≥ 2 obeys the recursion

𝑡−1 𝑡−1
𝑊12,𝑡 = 𝑊11 𝑊12,𝑡−1 + 𝑊12,𝑡−1 𝑊22
and where 𝑊𝑖𝑖𝑡 is 𝑊𝑖𝑖 raised to the 𝑡th power.

∗ 𝑡 ∗
𝑦1𝑡 𝑊11 𝑊12,𝑡 𝑦10
( ∗ ) = [ 𝑡 ] ( ∗ )
𝑦2𝑡 0 𝑊22 𝑦20
where 𝑦𝑡∗ = 𝑉 −1 𝑦𝑡 , and in particular where

∗
𝑦2𝑡 = 𝑉 21 𝑥𝑡 + 𝑉 22 𝜇𝑡 , (56.16)
and where 𝑉 𝑖𝑗 denotes the (𝑖, 𝑗) piece of the partitioned 𝑉 −1 matrix.

Because 𝑊22 is an unstable matrix, 𝑦𝑡∗ will diverge unless 𝑦20
∗
= 0.
Let 𝑉 𝑖𝑗 denote the (𝑖, 𝑗) piece of the partitioned 𝑉 −1 matrix.
∗
To attain stability, we must impose 𝑦20 = 0, which from equation (56.16) implies
𝑉 21 𝑥0 + 𝑉 22 𝜇0 = 0
or
𝜇0 = −(𝑉 22 )−1 𝑉 21 𝑥0 .
56.6. Schur decomposition 999

This equation replicates itself over time in the sense that it implies
𝜇𝑡 = −(𝑉 22 )−1 𝑉 21 𝑥𝑡 .
But notice that because (𝑉 21 𝑉 22 ) is the second row block of the inverse of 𝑉 , it follows that
𝑉11
(𝑉 21 𝑉 22 ) ( )=0
𝑉21
which implies
𝑉 21 𝑉11 + 𝑉 22 𝑉21 = 0.
Therefore,
−(𝑉 22 )−1 𝑉 21 = 𝑉21 𝑉11

−1
.
So we can write
−1
𝜇0 = 𝑉21 𝑉11 𝑥0
and
−1
𝜇𝑡 = 𝑉21 𝑉11 𝑥𝑡 .
However, we know that 𝜇𝑡 = 𝑃 𝑥𝑡 , where 𝑃 occurs in the matrix that solves the Riccati equation.
Thus, the preceding argument establishes that
−1
𝑃 = 𝑉21 𝑉11 . (56.17)
Remarkably, formula (56.17) provides us with a computationally efficient way of computing the positive definite matrix
𝑃 that solves the algebraic Riccati equation (56.2) that emerges from dynamic programming.
This same method can be applied to compute the solution of any system of the form (56.8) if a solution exists, even if
eigenvalues of 𝑀 fail to occur in reciprocal pairs.
The method will typically work so long as the eigenvalues of 𝑀 split half inside and half outside the unit circle.
Systems in which eigenvalues (properly adjusted for discounting) fail to occur in reciprocal pairs arise when the system
being solved is an equilibrium of a model in which there are distortions that prevent there being any optimum problem
that the equilibrium solves. See [Ljungqvist and Sargent, 2018], ch 12.
56.7 Application
Here we demonstrate the computation with an example which is the deterministic version of an example borrowed from
this quantecon lecture.
# Model parameters
r = 0.05
c_bar = 2
μ = 1
Q = np.array([[1]])


A = [[1 + r, -c_bar + μ],
[0, 1]]
B = [[-1],
[0]]
# Construct an LQ instance
lq = LQ(Q, R, A, B)
Given matrices 𝐴, 𝐵, 𝑄, 𝑅, we can then compute 𝐿, 𝑁 , and 𝑀 = 𝐿−1 𝑁 .
def construct_LNM(A, B, Q, R):
n, k = lq.n, lq.k
# construct L and N
L = np.zeros((2*n, 2*n))
L[:n, :n] = np.eye(n)
L[:n, n:] = B @ np.linalg.inv(Q) @ B.T
L[n:, n:] = A.T
N = np.zeros((2*n, 2*n))
N[:n, :n] = A
N[n:, :n] = -R
N[n:, n:] = np.eye(n)
# compute M
M = np.linalg.inv(L) @ N
return L, N, M
L, N, M = construct_LNM(lq.A, lq.B, lq.Q, lq.R)
array([[ 1.05 , -1. , -0.95238095, 0. ],

[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.95238095, 0. ],
[ 0. , 0. , 0.95238095, 1. ]])
Let’s verify that 𝑀 is symplectic.
n = lq.n
J = np.zeros((2*n, 2*n))
J[n:, :n] = np.eye(n)
J[:n, n:] = -np.eye(n)
M @ J @ M.T - J
array([[-1.32169408e-17, 0.00000000e+00, 0.00000000e+00,

0.00000000e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
56.7. Application 1001


0.00000000e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00]])
We can compute the eigenvalues of 𝑀 using np.linalg.eigvals, arranged in ascending order.
eigvals = sorted(np.linalg.eigvals(M))
eigvals
[0.9523809523809523, 1.0, 1.0, 1.05]
When we apply Schur decomposition such that 𝑀 = 𝑉 𝑊 𝑉 −1 , we want

• the upper left block of 𝑊 , 𝑊11 , to have all of its eigenvalues less than 1 in modulus, and
• the lower right block 𝑊22 to have eigenvalues that exceed 1 in modulus.
To get what we want, let’s define a sorting function that tells scipy.schur to sort the corresponding eigenvalues with
modulus smaller than 1 to the upper left.
stable_eigvals = eigvals[:n]
def sort_fun(x):
"Sort the eigenvalues with modules smaller than 1 to the top-left."
if x in stable_eigvals:
stable_eigvals.pop(stable_eigvals.index(x))
return True
else:
return False
W, V, _ = schur(M, sort=sort_fun)
array([[ 1. , -0.02316402, -1.00085948, -0.95000594],

[ 0. , 0.95238095, -0.00237501, -0.95325452],
[ 0. , 0. , 1.05 , 0.02432222],
[ 0. , 0. , 0. , 1. ]])
array([[ 0.99875234, 0.00121459, -0.04992284, 0. ],

[ 0.04993762, -0.02429188, 0.99845688, 0. ],
[ 0. , 0.04992284, 0.00121459, 0.99875234],
[ 0. , -0.99845688, -0.02429188, 0.04993762]])
We can check the modulus of eigenvalues of 𝑊11 and 𝑊22 .

Since they are both triangular matrices, eigenvalues are the diagonal elements.

# W11
np.diag(W[:n, :n])
array([1. , 0.95238095])
# W22
np.diag(W[n:, n:])
array([1.05, 1. ])
The following functions wrap 𝑀 matrix construction, Schur decomposition, and stability-imposing computation of 𝑃 .
def stable_solution(M, verbose=True):

"""
Given a system of linear difference equations
y' = |a b| y
x' = |c d| x
which is potentially unstable, find the solution

by imposing stability.
Parameter
---------
M : np.ndarray(float)
The matrix represents the linear difference equations system.
"""
n = M.shape[0] // 2
stable_eigvals = list(sorted(np.linalg.eigvals(M))[:n])
def sort_fun(x):
"Sort the eigenvalues with modules smaller than 1 to the top-left."
if x in stable_eigvals:
stable_eigvals.pop(stable_eigvals.index(x))
return True
else:
return False
W, V, _ = schur(M, sort=sort_fun)
if verbose:
print('eigenvalues:\n')
print(' W11: {}'.format(np.diag(W[:n, :n])))
print(' W22: {}'.format(np.diag(W[n:, n:])))
# compute V21 V11^{-1}

P = V[n:, :n] @ np.linalg.inv(V[:n, :n])
return W, V, P
def stationary_P(lq, verbose=True):

"""
Computes the matrix :math:`P` that represent the value function


V(x) = x' P x
in the infinite horizon case. Computation is via imposing stability

on the solution path and using Schur decomposition.
Parameters
----------
lq : qe.LQ
QuantEcon class for analyzing linear quadratic optimal control
problems of infinite horizon form.
Returns
-------
P : array_like(float)
P matrix in the value function representation.
"""
Q = lq.Q
R = lq.R
A = lq.A * lq.beta ** (1/2)
B = lq.B * lq.beta ** (1/2)
n, k = lq.n, lq.k
L, N, M = construct_LNM(A, B, Q, R)
W, V, P = stable_solution(M, verbose=verbose)
return P
# compute P
stationary_P(lq)
eigenvalues:
W11: [1. 0.95238095]

W22: [1.05 1. ]
array([[ 0.1025, -2.05 ],

[-2.05 , 41. ]])
Note that the matrix 𝑃 computed in this way is close to what we get from the routine in quantecon that solves an algebraic
Riccati equation by iterating to convergence on a Riccati difference equation.
The small difference comes from computational errors and will decrease as we increase the maximum number of iterations
or decrease the tolerance for convergence.
lq.stationary_values()
(array([[ 0.1025, -2.05 ],

[-2.05 , 41.01 ]]),
array([[-0.09761905, 1.95238095]]),
0)
Using a Schur decomposition is much more efficient.

%%timeit
stationary_P(lq, verbose=False)
139 µs ± 186 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
2.15 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
56.8 Other Applications
The preceding approach to imposing stability on a system of potentially unstable linear difference equations is not limited
to linear quadratic dynamic optimization problems.
For example, the same method is used in our Stability in Linear Rational Expectations Models lecture.
Let’s try to solve the model described in that lecture by applying the stable_solution function defined in this lecture
above.
def construct_H(ρ, λ, δ):

"contruct matrix H given parameters."
H = np.empty((2, 2))
H[0, :] = ρ,δ
H[1, :] = - (1 - λ) / λ, 1 / λ
return H
H = construct_H(ρ=.9, λ=.5, δ=0)
W, V, P = stable_solution(H)
P
eigenvalues:
W11: [0.9]
W22: [2.]
array([[0.90909091]])
56.8. Other Applications 1005

56.9 Discounted Problems
56.9.1 Transforming States and Controls to Eliminate Discounting
A pair of useful transformations allows us to convert a discounted problem into an undiscounted one.
Thus, suppose that we have a discounted problem with objective
∞
− ∑ 𝛽 𝑡 {𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 }
𝑡=0
and that the state transition equation is again 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 .
Define the transformed state and control variables
𝑡
• 𝑥𝑡̂ = 𝛽 2 𝑥𝑡
𝑡
• 𝑢̂𝑡 = 𝛽 2 𝑢𝑡
and the transformed transition equation matrices
• 𝐴̂ = 𝛽 2 𝐴
1
1
• 𝐵̂ = 𝛽 2 𝐵
so that the adjusted state and control variables obey the transition law
𝑥𝑡+1
̂ ̂ ̂ + 𝐵̂ 𝑢̂ .
= 𝐴𝑥 𝑡 𝑡
Then a discounted optimal control problem defined by 𝐴, 𝐵, 𝑅, 𝑄, 𝛽 having optimal policy characterized by 𝑃 , 𝐹 is
associated with an equivalent undiscounted problem defined by 𝐴,̂ 𝐵,̂ 𝑄, 𝑅 having optimal policy characterized by 𝐹 ̂ , 𝑃 ̂
that satisfy the following equations:
𝐹 ̂ = (𝑄 + 𝐵′ 𝑃 ̂ 𝐵)−1 𝐵̂ ′ 𝑃 𝐴 ̂
and
𝑃 ̂ = 𝑅 + 𝐴′̂ 𝑃 𝐴 ̂ − 𝐴′̂ 𝑃 𝐵(𝑄

̂ + 𝐵′ 𝑃 ̂ 𝐵)̂ −1 𝐵̂ ′ 𝑃 𝐴 ̂
It follows immediately from the definitions of 𝐴,̂ 𝐵̂ that 𝐹 ̂ = 𝐹 and 𝑃 ̂ = 𝑃 .

By exploiting these transformations, we can solve a discounted problem by solving an associated undiscounted problem.
In particular, we can first transform a discounted LQ problem to an undiscounted one and then solve that discounted
optimal regulator problem using the Lagrangian and invariant subspace methods described above.
For example, when 𝛽 = 1
1+𝑟 , we can solve for 𝑃 with 𝐴 ̂ = 𝛽 1/2 𝐴 and 𝐵̂ = 𝛽 1/2 𝐵.
These settings are adopted by default in the function stationary_P defined above.
β = 1 / (1 + r)
lq.beta = β
stationary_P(lq)
eigenvalues:
W11: [0.97590007 0.97590007]

W22: [1.02469508 1.02469508]

array([[ 0.0525, -1.05 ],

[-1.05 , 21. ]])
We can verify that the solution agrees with one that comes from applying the routine LQ.stationary_values in
the quantecon package.
(array([[ 0.0525, -1.05 ],

[-1.05 , 21. ]]),
array([[-0.05, 1. ]]),
0.0)
56.9.2 Lagrangian for Discounted Problem
For several purposes, it is useful explicitly briefly to describe a Lagrangian for a discounted problem.
Thus, for the discounted optimal linear regulator problem, form the Lagrangian
∞
𝐿 = − ∑ 𝛽 𝑡 {𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 + 2𝛽𝜇′𝑡+1 [𝐴𝑥𝑡 + 𝐵𝑢𝑡 − 𝑥𝑡+1 ]} (56.18)
𝑡=0
where 2𝜇𝑡+1 is a vector of Lagrange multipliers on the state vector 𝑥𝑡+1 .

First-order conditions for maximization with respect to {𝑢𝑡 , 𝑥𝑡+1 }∞
𝑡=0 are
2𝑄𝑢𝑡 + 2𝛽𝐵′ 𝜇𝑡+1 = 0 , 𝑡 ≥ 0

(56.19)
𝜇𝑡 = 𝑅𝑥𝑡 + 𝛽𝐴′ 𝜇𝑡+1 , 𝑡 ≥ 1.
Define 2𝜇0 to be the vector of shadow prices of 𝑥0 and apply an envelope condition to (56.18) to deduce that
𝜇0 = 𝑅𝑥0 + 𝛽𝐴′ 𝜇1 ,
which is a time 𝑡 = 0 counterpart to the second equation of system (56.19).

Proceeding as we did above with the undiscounted system (56.5), we can rearrange the first-order conditions into the
system
𝐼 𝛽𝐵𝑄−1 𝐵′ 𝑥𝑡+1 𝐴 0 𝑥𝑡
[ ][ ]=[ ][ ] (56.20)
0 𝛽𝐴′ 𝜇𝑡+1 −𝑅 𝐼 𝜇𝑡
which in the special case that 𝛽 = 1 agrees with equation (56.5), as expected.
By staring at system (56.20), we can infer identities that shed light on the structure of optimal linear regulator problems,
some of which will be useful in this lecture when we apply and extend the methods of this lecture to study Stackelberg
and Ramsey problems.
First, note that the first block of equation system (56.20) asserts that when 𝜇𝑡+1 = 𝑃 𝑥𝑡+1 , then
(𝐼 + 𝛽𝑄−1 𝐵′ 𝑃 𝐵𝑃 )𝑥𝑡+1 = 𝐴𝑥𝑡 ,
which can be rearranged to sbe
𝑥𝑡+1 = (𝐼 + 𝛽𝐵𝑄−1 𝐵′ 𝑃 )−1 𝐴𝑥𝑡 .
56.9. Discounted Problems 1007

This expression for the optimal closed loop dynamics of the state must agree with an alternative expression that we had
derived with dynamic programming, namely,
𝑥𝑡+1 = (𝐴 − 𝐵𝐹 )𝑥𝑡 .
But using
𝐹 = 𝛽(𝑄 + 𝛽𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴 (56.21)
it follows that
𝐴 − 𝐵𝐹 = (𝐼 − 𝛽𝐵(𝑄 + 𝛽𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 )𝐴.
Thus, our two expressions for the closed loop dynamics agree if and only if
(𝐼 + 𝛽𝐵𝑄−1 𝐵′ 𝑃 )−1 = (𝐼 − 𝛽𝐵(𝑄 + 𝛽𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 ). (56.22)
Matrix equation (56.22) can be verified by applying a partitioned inverse formula.
Note: Just use the formula (𝑎 − 𝑏𝑑−1 𝑐)−1 = 𝑎−1 + 𝑎−1 𝑏(𝑑 − 𝑐𝑎−1 𝑏)−1 𝑐𝑎−1 for appropriate choices of the matrices
𝑎, 𝑏, 𝑐, 𝑑.
Next, note that for any fixed 𝐹 for which eigenvalues of 𝐴 − 𝐵𝐹 are less than 𝛽1 in modulus, the value function associated
with using this rule forever is −𝑥0 𝑃 ̃ 𝑥0 where 𝑃 ̃ obeys the following matrix equation:
𝑃 ̃ = (𝑅 + 𝐹 ′ 𝑄𝐹 ) + 𝛽(𝐴 − 𝐵𝐹 )′ 𝑃 (𝐴 − 𝐵𝐹 ). (56.23)
Evidently, 𝑃 ̃ = 𝑃 only when 𝐹 obeys formula (56.21).

Next, note that the second equation of system (56.20) implies the “forward looking” equation for the Lagrange multiplier
𝜇𝑡 = 𝑅𝑥𝑡 + 𝛽𝐴′ 𝜇𝑡+1
whose solution is
𝜇𝑡 = 𝑃 𝑥𝑡 ,
where
𝑃 = 𝑅 + 𝛽𝐴′ 𝑃 (𝐴 − 𝐵𝐹 ) (56.24)
where we must require that 𝐹 obeys equation (56.21).

Equations (56.23) and (56.24) provide different perspectives on the optimal value function.

CHAPTER
FIFTYSEVEN
ELIMINATING CROSS PRODUCTS
57.1 Overview
This lecture describes formulas for eliminating

• cross products between states and control in linear-quadratic dynamic programming problems
• covariances between state and measurement noises in Kalman filtering problems
For a linear-quadratic dynamic programming problem, the idea involves these steps
• transform states and controls in a way that leads to an equivalent problem with no cross-products between trans-
formed states and controls
• solve the transformed problem using standard formulas for problems with no cross-products between states and
controls presented in this lecture Linear Control: Foundations
• transform the optimal decision rule for the altered problem into the optimal decision rule for the original problem
with cross-products between states and controls
57.2 Undiscounted Dynamic Programming Problem
Here is a nonstochastic undiscounted LQ dynamic programming with cross products between states and controls in the
objective function.
The problem is defined by the 5-tuple of matrices (𝐴, 𝐵, 𝑅, 𝑄, 𝐻) where 𝑅 and 𝑄 are positive definite symmetric matrices
and 𝐴 ∼ 𝑚 × 𝑚, 𝐵 ∼ 𝑚 × 𝑘, 𝑄 ∼ 𝑘 × 𝑘, 𝑅 ∼ 𝑚 × 𝑚 and 𝐻 ∼ 𝑘 × 𝑚.
The problem is to choose {𝑥𝑡+1 , 𝑢𝑡 }∞
𝑡=0 to maximize
∞
− ∑(𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 + 2𝑢𝑡 𝐻𝑥𝑡 )
𝑡=0
subject to the linear constraints
𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 , 𝑡≥0
where 𝑥0 is a given initial condition.

The solution to this undiscounted infinite-horizon problem is a time-invariant feedback rule
𝑢𝑡 = −𝐹 𝑥𝑡
1009
where
𝐹 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴
and 𝑃 ∼ 𝑚 × 𝑚 is a positive definite solution of the algebraic matrix Riccati equation
𝑃 = 𝑅 + 𝐴′ 𝑃 𝐴 − (𝐴′ 𝑃 𝐵 + 𝐻 ′ )(𝑄 + 𝐵′ 𝑃 𝐵)−1 (𝐵′ 𝑃 𝐴 + 𝐻).
It can be verified that an equivalent problem without cross-products between states and controls is defined by a 4-tuple
of matrices : (𝐴∗ , 𝐵, 𝑅∗ , 𝑄).
That the omitted matrix 𝐻 = 0 indicates that there are no cross products between states and controls in the equivalent
problem.
The matrices (𝐴∗ , 𝐵, 𝑅∗ , 𝑄) defining the equivalent problem and the value function, policy function matrices 𝑃 , 𝐹 ∗ that
solve it are related to the matrices (𝐴, 𝐵, 𝑅, 𝑄, 𝐻) defining the original problem and the value function, policy function
matrices 𝑃 , 𝐹 that solve the original problem by
𝐴∗ = 𝐴 − 𝐵𝑄−1 𝐻,
𝑅∗ = 𝑅 − 𝐻 ′ 𝑄−1 𝐻,
′ ′
𝑃 = 𝑅∗ + 𝐴∗ 𝑃 𝐴 − (𝐴∗ 𝑃 𝐵)(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴∗ ,
𝐹 ∗ = (𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴∗ ,
𝐹 = 𝐹 ∗ + 𝑄−1 𝐻.
57.3 Kalman Filter
The duality that prevails between a linear-quadratic optimal control and a Kalman filtering problem means that there is an
analogous transformation that allows us to transform a Kalman filtering problem with non-zero covariance matrix between
between shocks to states and shocks to measurements to an equivalent Kalman filtering problem with zero covariance
between shocks to states and measurments.
Let’s look at the appropriate transformations.
First, let’s recall the Kalman filter with covariance between noises to states and measurements.
The hidden Markov model is
𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑤𝑡+1 ,

𝑧𝑡+1 = 𝐷𝑥𝑡 + 𝐹 𝑤𝑡+1 ,
where 𝐴 ∼ 𝑚 × 𝑚, 𝐵 ∼ 𝑚 × 𝑝 and 𝐷 ∼ 𝑘 × 𝑚, 𝐹 ∼ 𝑘 × 𝑝, and 𝑤𝑡+1 is the time 𝑡 + 1 component of a sequence of

i.i.d. 𝑝 × 1 normally distibuted random vectors with mean vector zero and covariance matrix equal to a 𝑝 × 𝑝 identity
matrix.
Thus, 𝑥𝑡 is 𝑚 × 1 and 𝑧𝑡 is 𝑘 × 1.
The Kalman filtering formulas are
𝐾(Σ𝑡 ) = (𝐴Σ𝑡 𝐷′ + 𝐵𝐹 ′ )(𝐷Σ𝑡 𝐷′ + 𝐹 𝐹 ′ )−1 ,

Σ𝑡+1 = 𝐴Σ𝑡 𝐴′ + 𝐵𝐵′ − (𝐴Σ𝑡 𝐷′ + 𝐵𝐹 ′ )(𝐷Σ𝑡 𝐷′ + 𝐹 𝐹 ′ )−1 (𝐷Σ𝑡 𝐴′ + 𝐹 𝐵′ ).
Define tranformed matrices
𝐴∗ = 𝐴 − 𝐵𝐹 ′ (𝐹 𝐹 ′ )−1 𝐷,
′
𝐵∗ 𝐵∗ = 𝐵𝐵′ − 𝐵𝐹 ′ (𝐹 𝐹 ′ )−1 𝐹 𝐵′ .
1010 Chapter 57. Eliminating Cross Products

57.3.1 Algorithm
A consequence of formulas {eq}èq:Kalman102} is that we can use the following algorithm to solve Kalman filtering
problems that involve non zero covariances between state and signal noises.
First, compute Σ, 𝐾 ∗ using the ordinary Kalman filtering formula with 𝐵𝐹 ′ = 0, i.e., with zero covariance matrix
between random shocks to states and random shocks to measurements.
That is, compute 𝐾 ∗ and Σ that satisfy
𝐾 ∗ = (𝐴∗ Σ𝐷′ )(𝐷Σ𝐷′ + 𝐹 𝐹 ′ )−1

′ ′ ′
Σ = 𝐴∗ Σ𝐴∗ + 𝐵∗ 𝐵∗ − (𝐴∗ Σ𝐷′ )(𝐷Σ𝐷′ + 𝐹 𝐹 ′ )−1 (𝐷Σ𝐴∗ ).
The Kalman gain for the original problem with non-zero covariance between shocks to states and measurements is then
𝐾 = 𝐾 ∗ + 𝐵𝐹 ′ (𝐹 𝐹 ′ )−1 ,
The state reconstruction covariance matrix Σ for the original problem equals the state reconstrution covariance matrix for
the transformed problem.
57.4 Duality table
Here is a handy table to remember how the Kalman filter and dynamic program are related.
Dynamic Program Kalman Filter

𝐴 𝐴′
𝐵 𝐷′
𝐻 𝐹 𝐵′
𝑄 𝐹𝐹′
𝑅 𝐵𝐵′
𝐹 𝐾′
𝑃 Σ
57.4. Duality table 1011

1012 Chapter 57. Eliminating Cross Products

CHAPTER
FIFTYEIGHT
THE PERMANENT INCOME MODEL
Contents
• The Permanent Income Model

– Overview
– The Savings Problem
– Alternative Representations
– Two Classic Examples
– Further Reading
– Appendix: The Euler Equation
58.1 Overview
This lecture describes a rational expectations version of the famous permanent income model of Milton Friedman [Fried-
man, 1956].
Robert Hall cast Friedman’s model within a linear-quadratic setting [Hall, 1978].
Like Hall, we formulate an infinite-horizon linear-quadratic savings problem.
We use the model as a vehicle for illustrating
• alternative formulations of the state of a dynamic system
• the idea of cointegration
• impulse response functions
• the idea that changes in consumption are useful as predictors of movements in income
Background readings on the linear-quadratic-Gaussian permanent income model are Hall’s [Hall, 1978] and chapter 2 of
[Ljungqvist and Sargent, 2018].

import numpy as np
import random
1013
58.2 The Savings Problem
In this section, we state and solve the savings and consumption problem faced by the consumer.
58.2.1 Preliminaries
We use a class of stochastic processes called martingales.

A discrete-time martingale is a stochastic process (i.e., a sequence of random variables) {𝑋𝑡 } with finite mean at each 𝑡
and satisfying
𝔼𝑡 [𝑋𝑡+1 ] = 𝑋𝑡 , 𝑡 = 0, 1, 2, …
Here 𝔼𝑡 ∶= 𝔼[⋅ | ℱ𝑡 ] is a conditional mathematical expectation conditional on the time 𝑡 information set ℱ𝑡 .
The latter is just a collection of random variables that the modeler declares to be visible at 𝑡.
• When not explicitly defined, it is usually understood that ℱ𝑡 = {𝑋𝑡 , 𝑋𝑡−1 , … , 𝑋0 }.
Martingales have the feature that the history of past outcomes provides no predictive power for changes between current
and future outcomes.
For example, the current wealth of a gambler engaged in a “fair game” has this property.
One common class of martingales is the family of random walks.
A random walk is a stochastic process {𝑋𝑡 } that satisfies
𝑋𝑡+1 = 𝑋𝑡 + 𝑤𝑡+1
for some IID zero mean innovation sequence {𝑤𝑡 }.

Evidently, 𝑋𝑡 can also be expressed as
𝑡
𝑋𝑡 = ∑ 𝑤𝑗 + 𝑋0
𝑗=1
Not every martingale arises as a random walk (see, for example, Wald’s martingale).
58.2.2 The Decision Problem
A consumer has preferences over consumption streams that are ordered by the utility functional
∞
𝔼0 [∑ 𝛽 𝑡 𝑢(𝑐𝑡 )] (58.1)
𝑡=0
where
• 𝔼𝑡 is the mathematical expectation conditioned on the consumer’s time 𝑡 information
• 𝑐𝑡 is time 𝑡 consumption
• 𝑢 is a strictly concave one-period utility function
• 𝛽 ∈ (0, 1) is a discount factor
1014 Chapter 58. The Permanent Income Model

The consumer maximizes (58.1) by choosing a consumption, borrowing plan {𝑐𝑡 , 𝑏𝑡+1 }∞
𝑡=0 subject to the sequence of
budget constraints
1
𝑐𝑡 + 𝑏 𝑡 = 𝑏 + 𝑦𝑡 𝑡≥0 (58.2)
1 + 𝑟 𝑡+1
Here
• 𝑦𝑡 is an exogenous endowment process.
• 𝑟 > 0 is a time-invariant risk-free net interest rate.
• 𝑏𝑡 is one-period risk-free debt maturing at 𝑡.
The consumer also faces initial conditions 𝑏0 and 𝑦0 , which can be fixed or random.
58.2.3 Assumptions
For the remainder of this lecture, we follow Friedman and Hall in assuming that (1 + 𝑟)−1 = 𝛽.
Regarding the endowment process, we assume it has the state-space representation
𝑧𝑡+1 = 𝐴𝑧𝑡 + 𝐶𝑤𝑡+1

(58.3)
𝑦𝑡 = 𝑈 𝑧 𝑡
where
• {𝑤𝑡 } is an IID vector process with 𝔼𝑤𝑡 = 0 and 𝔼𝑤𝑡 𝑤𝑡′ = 𝐼.
• The spectral radius of 𝐴 satisfies 𝜌(𝐴) < √1/𝛽.
• 𝑈 is a selection vector that pins down 𝑦𝑡 as a particular linear combination of components of 𝑧𝑡 .
The restriction on 𝜌(𝐴) prevents income from growing so fast that discounted geometric sums of some quadratic forms
to be described below become infinite.
Regarding preferences, we assume the quadratic utility function
𝑢(𝑐𝑡 ) = −(𝑐𝑡 − 𝛾)2
where 𝛾 is a bliss level of consumption.
Note: Along with this quadratic utility specification, we allow consumption to be negative. However, by choosing
parameters appropriately, we can make the probability that the model generates negative consumption paths over finite
time horizons as low as desired.
Finally, we impose the no Ponzi scheme condition

∞
𝔼0 [∑ 𝛽 𝑡 𝑏𝑡2 ] < ∞ (58.4)
𝑡=0
This condition rules out an always-borrow scheme that would allow the consumer to enjoy bliss consumption forever.
58.2. The Savings Problem 1015

58.2.4 First-Order Conditions
First-order conditions for maximizing (58.1) subject to (58.2) are
𝔼𝑡 [𝑢′ (𝑐𝑡+1 )] = 𝑢′ (𝑐𝑡 ), 𝑡 = 0, 1, … (58.5)
These optimality conditions are also known as Euler equations.

If you’re not sure where they come from, you can find a proof sketch in the appendix.
With our quadratic preference specification, (58.5) has the striking implication that consumption follows a martingale:
𝔼𝑡 [𝑐𝑡+1 ] = 𝑐𝑡 (58.6)
(In fact, quadratic preferences are necessary for this conclusion1 .)

One way to interpret (58.6) is that consumption will change only when “new information” about permanent income is
revealed.
These ideas will be clarified below.
58.2.5 The Optimal Decision Rule
Now let’s deduce the optimal decision rule2 .
Note: One way to solve the consumer’s problem is to apply dynamic programming as in this lecture. We do this later. But
first we use an alternative approach that is revealing and shows the work that dynamic programming does for us behind
the scenes.
In doing so, we need to combine

1. the optimality condition (58.6)
2. the period-by-period budget constraint (58.2), and
3. the boundary condition (58.4)
𝑡
To accomplish this, observe first that (58.4) implies lim𝑡→∞ 𝛽 2 𝑏𝑡+1 = 0.
Using this restriction on the debt path and solving (58.2) forward yields
∞
𝑏𝑡 = ∑ 𝛽 𝑗 (𝑦𝑡+𝑗 − 𝑐𝑡+𝑗 ) (58.7)
𝑗=0
Take conditional expectations on both sides of (58.7) and use the martingale property of consumption and the law of
iterated expectations to deduce
∞
𝑐𝑡
𝑏𝑡 = ∑ 𝛽 𝑗 𝔼𝑡 [𝑦𝑡+𝑗 ] − (58.8)
𝑗=0
1−𝛽
Expressed in terms of 𝑐𝑡 we get

∞ ∞
𝑟
𝑐𝑡 = (1 − 𝛽) [∑ 𝛽 𝑗 𝔼𝑡 [𝑦𝑡+𝑗 ] − 𝑏𝑡 ] = [∑ 𝛽 𝑗 𝔼𝑡 [𝑦𝑡+𝑗 ] − 𝑏𝑡 ] (58.9)
𝑗=0
1 + 𝑟 𝑗=0
1 A linear marginal utility is essential for deriving (58.6) from (58.5). Suppose instead that we had imposed the following more standard assumptions
on the utility function: 𝑢′ (𝑐) > 0, 𝑢″ (𝑐) < 0, 𝑢‴ (𝑐) > 0 and required that 𝑐 ≥ 0. The Euler equation remains (58.5). But the fact that 𝑢‴ < 0
implies via Jensen’s inequality that 𝔼𝑡 [𝑢′ (𝑐𝑡+1 )] > 𝑢′ (𝔼𝑡 [𝑐𝑡+1 ]). This inequality together with (58.5) implies that 𝔼𝑡 [𝑐𝑡+1 ] > 𝑐𝑡 (consumption is
said to be a ‘submartingale’), so that consumption stochastically diverges to +∞. The consumer’s savings also diverge to +∞.
2 An optimal decision rule is a map from the current state into current actions—in this case, consumption.

where the last equality uses (1 + 𝑟)𝛽 = 1.

These last two equations assert that consumption equals economic income
• financial wealth equals −𝑏𝑡
∞
• non-financial wealth equals ∑𝑗=0 𝛽 𝑗 𝔼𝑡 [𝑦𝑡+𝑗 ]
• total wealth equals the sum of financial and non-financial wealth
𝑟
• a marginal propensity to consume out of total wealth equals the interest factor 1+𝑟
• economic income equals

– a constant marginal propensity to consume times the sum of non-financial wealth and financial wealth
– the amount the consumer can consume while leaving its wealth intact
Responding to the State
The state vector confronting the consumer at 𝑡 is [𝑏𝑡 𝑧𝑡 ].

Here
• 𝑧𝑡 is an exogenous component, unaffected by consumer behavior.
• 𝑏𝑡 is an endogenous component (since it depends on the decision rule).
Note that 𝑧𝑡 contains all variables useful for forecasting the consumer’s future endowment.
It is plausible that current decisions 𝑐𝑡 and 𝑏𝑡+1 should be expressible as functions of 𝑧𝑡 and 𝑏𝑡 .
This is indeed the case.
In fact, from this discussion, we see that
∞ ∞
∑ 𝛽 𝑗 𝔼𝑡 [𝑦𝑡+𝑗 ] = 𝔼𝑡 [∑ 𝛽 𝑗 𝑦𝑡+𝑗 ] = 𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡
𝑗=0 𝑗=0
Combining this with (58.9) gives

𝑟
𝑐𝑡 = [𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡 − 𝑏𝑡 ] (58.10)
1+𝑟
Using this equality to eliminate 𝑐𝑡 in the budget constraint (58.2) gives
𝑏𝑡+1 = (1 + 𝑟)(𝑏𝑡 + 𝑐𝑡 − 𝑦𝑡 )
= (1 + 𝑟)𝑏𝑡 + 𝑟[𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡 − 𝑏𝑡 ] − (1 + 𝑟)𝑈 𝑧𝑡
= 𝑏𝑡 + 𝑈 [𝑟(𝐼 − 𝛽𝐴)−1 − (1 + 𝑟)𝐼]𝑧𝑡
= 𝑏𝑡 + 𝑈 (𝐼 − 𝛽𝐴)−1 (𝐴 − 𝐼)𝑧𝑡
To get from the second last to the last expression in this chain of equalities is not trivial.
∞
A key is to use the fact that (1 + 𝑟)𝛽 = 1 and (𝐼 − 𝛽𝐴)−1 = ∑𝑗=0 𝛽 𝑗 𝐴𝑗 .
We’ve now successfully written 𝑐𝑡 and 𝑏𝑡+1 as functions of 𝑏𝑡 and 𝑧𝑡 .

A State-Space Representation
We can summarize our dynamics in the form of a linear state-space system governing consumption, debt and income:
𝑏𝑡+1 = 𝑏𝑡 + 𝑈 [(𝐼 − 𝛽𝐴)−1 (𝐴 − 𝐼)]𝑧𝑡
(58.11)
𝑐𝑡 = (1 − 𝛽)[𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡 − 𝑏𝑡 ]
To write this more succinctly, let
𝑧 𝐴 0 𝐶
𝑥𝑡 = [ 𝑡 ] , 𝐴̃ = [ ], 𝐶̃ = [ ]
𝑏𝑡 𝑈 (𝐼 − 𝛽𝐴)−1 (𝐴 − 𝐼) 1 0
and
𝑈 0 𝑦
𝑈̃ = [ ], 𝑦𝑡̃ = [ 𝑡 ]
(1 − 𝛽)𝑈 (𝐼 − 𝛽𝐴)−1 −(1 − 𝛽) 𝑐𝑡
Then we can express equation (58.11) as
̃ + 𝐶𝑤
𝑥𝑡+1 = 𝐴𝑥 ̃
𝑡 𝑡+1
(58.12)
𝑦𝑡̃ = 𝑈̃ 𝑥𝑡
We can use the following formulas from linear state space models to compute population mean 𝜇𝑡 = 𝔼𝑥𝑡 and covariance
Σ𝑡 ∶= 𝔼[(𝑥𝑡 − 𝜇𝑡 )(𝑥𝑡 − 𝜇𝑡 )′ ]
̃
𝜇𝑡+1 = 𝐴𝜇 with 𝜇0 given (58.13)
𝑡
̃ 𝐴′̃ + 𝐶 𝐶
Σ𝑡+1 = 𝐴Σ ̃ ′̃ with Σ0 given (58.14)
𝑡
We can then compute the mean and covariance of 𝑦𝑡̃ from
𝜇𝑦,𝑡 = 𝑈̃ 𝜇𝑡
(58.15)
Σ𝑦,𝑡 = 𝑈̃ Σ𝑡 𝑈̃ ′
A Simple Example with IID Income
To gain some preliminary intuition on the implications of (58.11), let’s look at a highly stylized example where income is
just IID.
(Later examples will investigate more realistic income streams.)
In particular, let {𝑤𝑡 }∞
𝑡=1 be IID and scalar standard normal, and let
𝑧1 0 0 𝜎
𝑧𝑡 = [ 𝑡 ] , 𝐴=[ ], 𝑈 = [1 𝜇] , 𝐶=[ ]
1 0 1 0
Finally, let 𝑏0 = 𝑧01 = 0.

Under these assumptions, we have 𝑦𝑡 = 𝜇 + 𝜎𝑤𝑡 ∼ 𝑁 (𝜇, 𝜎2 ).
Further, if you work through the state space representation, you will see that
𝑡−1
𝑏𝑡 = −𝜎 ∑ 𝑤𝑗
𝑗=1
𝑡
𝑐𝑡 = 𝜇 + (1 − 𝛽)𝜎 ∑ 𝑤𝑗
𝑗=1

Thus, income is IID and debt and consumption are both Gaussian random walks.
Defining assets as −𝑏𝑡 , we see that assets are just the cumulative sum of unanticipated incomes prior to the present date.
The next figure shows a typical realization with 𝑟 = 0.05, 𝜇 = 1, and 𝜎 = 0.15
r = 0.05
β = 1 / (1 + r)
σ = 0.15
μ = 1
T = 60
@njit
def time_path(T):
w = np.random.randn(T+1) # w_0, w_1, ..., w_T
w[0] = 0
b = np.zeros(T+1)
b[t] = w[1:t].sum()
b = -σ * b
c = μ + (1 - β) * (σ * w - b)
return w, b, c
w, b, c = time_path(T)
ax.plot(μ + σ * w, 'g-', label="Non-financial income")

ax.plot(c, 'k-', label="Consumption")
ax.plot( b, 'b-', label="Debt")
ax.legend(ncol=3, mode='expand', bbox_to_anchor=(0., 1.02, 1., .102))
ax.grid()
plt.show()

Observe that consumption is considerably smoother than income.

The figure below shows the consumption paths of 250 consumers with independent income streams
b_sum = np.zeros(T+1)
w, b, c = time_path(T) # Generate new time path
rcolor = random.choice(('c', 'g', 'b', 'k'))
ax.plot(c, color=rcolor, lw=0.8, alpha=0.7)
ax.grid()
ax.set(xlabel='Time', ylabel='Consumption')
plt.show()

58.3 Alternative Representations
In this section, we shed more light on the evolution of savings, debt and consumption by representing their dynamics in
several different ways.
58.3.1 Hall’s Representation
Hall [Hall, 1978] suggested an insightful way to summarize the implications of LQ permanent income theory.
First, to represent the solution for 𝑏𝑡 , shift (58.9) forward one period and eliminate 𝑏𝑡+1 by using (58.2) to obtain
∞
𝑐𝑡+1 = (1 − 𝛽) ∑ 𝛽 𝑗 𝔼𝑡+1 [𝑦𝑡+𝑗+1 ] − (1 − 𝛽) [𝛽 −1 (𝑐𝑡 + 𝑏𝑡 − 𝑦𝑡 )]
𝑗=0
∞
If we add and subtract 𝛽 −1 (1 − 𝛽) ∑𝑗=0 𝛽 𝑗 𝔼𝑡 𝑦𝑡+𝑗 from the right side of the preceding equation and rearrange, we obtain
∞
𝑐𝑡+1 − 𝑐𝑡 = (1 − 𝛽) ∑ 𝛽 𝑗 {𝔼𝑡+1 [𝑦𝑡+𝑗+1 ] − 𝔼𝑡 [𝑦𝑡+𝑗+1 ]} (58.16)
𝑗=0
The right side is the time 𝑡 + 1 innovation to the expected present value of the endowment process {𝑦𝑡 }.
We can represent the optimal decision rule for (𝑐𝑡 , 𝑏𝑡+1 ) in the form of (58.16) and (58.8), which we repeat:
∞
1
𝑏𝑡 = ∑ 𝛽 𝑗 𝔼𝑡 [𝑦𝑡+𝑗 ] − 𝑐 (58.17)
𝑗=0
1−𝛽 𝑡
Equation (58.17) asserts that the consumer’s debt due at 𝑡 equals the expected present value of its endowment minus the
expected present value of its consumption stream.
58.3. Alternative Representations 1021

A high debt thus indicates a large expected present value of surpluses 𝑦𝑡 − 𝑐𝑡 .

Recalling again our discussion on forecasting geometric sums, we have
∞
𝔼𝑡 ∑ 𝛽 𝑗 𝑦𝑡+𝑗 = 𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡
𝑗=0
∞
𝔼𝑡+1 ∑ 𝛽 𝑗 𝑦𝑡+𝑗+1 = 𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡+1
𝑗=0
∞
𝔼𝑡 ∑ 𝛽 𝑗 𝑦𝑡+𝑗+1 = 𝑈 (𝐼 − 𝛽𝐴)−1 𝐴𝑧𝑡
𝑗=0
Using these formulas together with (58.3) and substituting into (58.16) and (58.17) gives the following representation for
the consumer’s optimum decision rule:
𝑐𝑡+1 = 𝑐𝑡 + (1 − 𝛽)𝑈 (𝐼 − 𝛽𝐴)−1 𝐶𝑤𝑡+1

1
𝑏𝑡 = 𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡 − 𝑐
1−𝛽 𝑡 (58.18)
Representation (58.18) makes clear that

• The state can be taken as (𝑐𝑡 , 𝑧𝑡 ).
– The endogenous part is 𝑐𝑡 and the exogenous part is 𝑧𝑡 .
– Debt 𝑏𝑡 has disappeared as a component of the state because it is encoded in 𝑐𝑡 .
• Consumption is a random walk with innovation (1 − 𝛽)𝑈 (𝐼 − 𝛽𝐴)−1 𝐶𝑤𝑡+1 .
– This is a more explicit representation of the martingale result in (58.6).
58.3.2 Cointegration
Representation (58.18) reveals that the joint process {𝑐𝑡 , 𝑏𝑡 } possesses the property that Engle and Granger [Engle and
Granger, 1987] called cointegration.
Cointegration is a tool that allows us to apply powerful results from the theory of stationary stochastic processes to (certain
transformations of) nonstationary models.
To apply cointegration in the present context, suppose that 𝑧𝑡 is asymptotically stationary3 .
Despite this, both 𝑐𝑡 and 𝑏𝑡 will be non-stationary because they have unit roots (see (58.11) for 𝑏𝑡 ).
Nevertheless, there is a linear combination of 𝑐𝑡 , 𝑏𝑡 that is asymptotically stationary.
In particular, from the second equality in (58.18) we have
(1 − 𝛽)𝑏𝑡 + 𝑐𝑡 = (1 − 𝛽)𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡 (58.19)
Hence the linear combination (1 − 𝛽)𝑏𝑡 + 𝑐𝑡 is asymptotically stationary.

Accordingly, Granger and Engle would call [(1 − 𝛽) 1] a cointegrating vector for the state.
′
When applied to the nonstationary vector process [𝑏𝑡 𝑐𝑡 ] , it yields a process that is asymptotically stationary.
3 This would be the case if, for example, the spectral radius of 𝐴 is strictly less than one.

Equation (58.19) can be rearranged to take the form

∞
(1 − 𝛽)𝑏𝑡 + 𝑐𝑡 = (1 − 𝛽)𝔼𝑡 ∑ 𝛽 𝑗 𝑦𝑡+𝑗 (58.20)
𝑗=0
Equation (58.20) asserts that the cointegrating residual on the left side equals the conditional expectation of the geometric
sum of future incomes on the right4 .
58.3.3 Cross-Sectional Implications
Consider again (58.18), this time in light of our discussion of distribution dynamics in the lecture on linear systems.
The dynamics of 𝑐𝑡 are given by
𝑐𝑡+1 = 𝑐𝑡 + (1 − 𝛽)𝑈 (𝐼 − 𝛽𝐴)−1 𝐶𝑤𝑡+1 (58.21)
or
𝑡
𝑐𝑡 = 𝑐0 + ∑ 𝑤̂ 𝑗 for 𝑤̂ 𝑡+1 ∶= (1 − 𝛽)𝑈 (𝐼 − 𝛽𝐴)−1 𝐶𝑤𝑡+1
𝑗=1
The unit root affecting 𝑐𝑡 causes the time 𝑡 variance of 𝑐𝑡 to grow linearly with 𝑡.
In particular, since {𝑤̂ 𝑡 } is IID, we have
Var[𝑐𝑡 ] = Var[𝑐0 ] + 𝑡 𝜎̂ 2 (58.22)
where
𝜎̂ 2 ∶= (1 − 𝛽)2 𝑈 (𝐼 − 𝛽𝐴)−1 𝐶𝐶 ′ (𝐼 − 𝛽𝐴′ )−1 𝑈 ′
When 𝜎̂ > 0, {𝑐𝑡 } has no asymptotic distribution.

Let’s consider what this means for a cross-section of ex-ante identical consumers born at time 0.
Let the distribution of 𝑐0 represent the cross-section of initial consumption values.
Equation (58.22) tells us that the variance of 𝑐𝑡 increases over time at a rate proportional to 𝑡.
A number of different studies have investigated this prediction and found some support for it (see, e.g., [Deaton and
Paxson, 1994], [Storesletten et al., 2004]).
58.3.4 Impulse Response Functions
Impulse response functions measure responses to various impulses (i.e., temporary shocks).
The impulse response function of {𝑐𝑡 } to the innovation {𝑤𝑡 } is a box.
In particular, the response of 𝑐𝑡+𝑗 to a unit increase in the innovation 𝑤𝑡+1 is (1 − 𝛽)𝑈 (𝐼 − 𝛽𝐴)−1 𝐶 for all 𝑗 ≥ 1.
4 See [John Y. Campbell, 1988], [Lettau and Ludvigson, 2001], [Lettau and Ludvigson, 2004] for interesting applications of related ideas.
58.3. Alternative Representations 1023

58.3.5 Moving Average Representation
It’s useful to express the innovation to the expected present value of the endowment process in terms of a moving average
representation for income 𝑦𝑡 .
The endowment process defined by (58.3) has the moving average representation
𝑦𝑡+1 = 𝑑(𝐿)𝑤𝑡+1 (58.23)
where
∞
• 𝑑(𝐿) = ∑𝑗=0 𝑑𝑗 𝐿𝑗 for some sequence 𝑑𝑗 , where 𝐿 is the lag operator5
• at time 𝑡, the consumer has an information set6 𝑤𝑡 = [𝑤𝑡 , 𝑤𝑡−1 , …]

Notice that
𝑦𝑡+𝑗 − 𝔼𝑡 [𝑦𝑡+𝑗 ] = 𝑑0 𝑤𝑡+𝑗 + 𝑑1 𝑤𝑡+𝑗−1 + ⋯ + 𝑑𝑗−1 𝑤𝑡+1
It follows that
𝔼𝑡+1 [𝑦𝑡+𝑗 ] − 𝔼𝑡 [𝑦𝑡+𝑗 ] = 𝑑𝑗−1 𝑤𝑡+1 (58.24)
Using (58.24) in (58.16) gives
𝑐𝑡+1 − 𝑐𝑡 = (1 − 𝛽)𝑑(𝛽)𝑤𝑡+1 (58.25)
The object 𝑑(𝛽) is the present value of the moving average coefficients in the representation for the endowment process
𝑦𝑡 .
58.4 Two Classic Examples
We illustrate some of the preceding ideas with two examples.

In both examples, the endowment follows the process 𝑦𝑡 = 𝑧1𝑡 + 𝑧2𝑡 where
𝑧 1 0 𝑧1𝑡 𝜎 0 𝑤1𝑡+1
[ 1𝑡+1 ] = [ ][ ] + [ 1 ][ ]
𝑧2𝑡+1 0 0 𝑧2𝑡 0 𝜎2 𝑤2𝑡+1
Here
• 𝑤𝑡+1 is an IID 2 × 1 process distributed as 𝑁 (0, 𝐼).
• 𝑧1𝑡 is a permanent component of 𝑦𝑡 .
• 𝑧2𝑡 is a purely transitory component of 𝑦𝑡 .
58.4.1 Example 1
Assume as before that the consumer observes the state 𝑧𝑡 at time 𝑡.

In view of (58.18) we have
𝑐𝑡+1 − 𝑐𝑡 = 𝜎1 𝑤1𝑡+1 + (1 − 𝛽)𝜎2 𝑤2𝑡+1 (58.26)
Formula (58.26) shows how an increment 𝜎1 𝑤1𝑡+1 to the permanent component of income 𝑧1𝑡+1 leads to
5Representation (58.3) implies that 𝑑(𝐿) = 𝑈(𝐼 − 𝐴𝐿)−1 𝐶.
6A moving average representation for a process 𝑦𝑡 is said to be fundamental if the linear space spanned by 𝑦𝑡 is equal to the linear space spanned
by 𝑤𝑡 . A time-invariant innovations representation, attained via the Kalman filter, is by construction fundamental.

• a permanent one-for-one increase in consumption and

• no increase in savings −𝑏𝑡+1
But the purely transitory component of income 𝜎2 𝑤2𝑡+1 leads to a permanent increment in consumption by a fraction
1 − 𝛽 of transitory income.
The remaining fraction 𝛽 is saved, leading to a permanent increment in −𝑏𝑡+1 .
Application of the formula for debt in (58.11) to this example shows that
𝑏𝑡+1 − 𝑏𝑡 = −𝑧2𝑡 = −𝜎2 𝑤2𝑡 (58.27)
This confirms that none of 𝜎1 𝑤1𝑡 is saved, while all of 𝜎2 𝑤2𝑡 is saved.
The next figure displays impulse-response functions that illustrates these very different reactions to transitory and perma-
nent income shocks.
r = 0.05
β = 1 / (1 + r)
S = 5 # Impulse date
σ1 = σ2 = 0.15
@njit
def time_path(T, permanent=False):
"Time path of consumption and debt given shock sequence"
w1 = np.zeros(T+1)
w2 = np.zeros(T+1)
b = np.zeros(T+1)
c = np.zeros(T+1)
if permanent:
w1[S+1] = 1.0
else:
w2[S+1] = 1.0
b[t+1] = b[t] - σ2 * w2[t]
c[t+1] = c[t] + σ1 * w1[t+1] + (1 - β) * σ2 * w2[t+1]
return b, c

titles = ['permanent', 'transitory']
L = 0.175
for ax, truefalse, title in zip(axes, (True, False), titles):

b, c = time_path(T=20, permanent=truefalse)
ax.set_title(f'Impulse reponse: {title} income shock')
ax.plot(c, 'g-', label="consumption")
ax.plot(b, 'b-', label="debt")
ax.plot((S, S), (-L, L), 'k-', lw=0.5)
ax.grid(alpha=0.5)
ax.set(xlabel=r'Time', ylim=(-L, L))
axes[0].legend(loc='lower right')
plt.tight_layout()
plt.show()
58.4. Two Classic Examples 1025

Notice how the permanent income shock provokes no change in assets −𝑏𝑡+1 and an immediate permanent change in
consumption equal to the permanent increment in non-financial income.
In contrast, notice how most of a transitory income shock is saved and only a small amount is saved.
The box-like impulse responses of consumption to both types of shock reflect the random walk property of the optimal
consumption decision.
58.4.2 Example 2
Assume now that at time 𝑡 the consumer observes 𝑦𝑡 , and its history up to 𝑡, but not 𝑧𝑡 .
Under this assumption, it is appropriate to use an innovation representation to form 𝐴, 𝐶, 𝑈 in (58.18).
The discussion in sections 2.9.1 and 2.11.3 of [Ljungqvist and Sargent, 2018] shows that the pertinent state space repre-
sentation for 𝑦𝑡 is
𝑦 1 −(1 − 𝐾) 𝑦𝑡 1
[ 𝑡+1 ] = [ ] [ ] + [ ] 𝑎𝑡+1
𝑎𝑡+1 0 0 𝑎𝑡 1
𝑦
𝑦𝑡 = [1 0] [ 𝑡 ]
𝑎𝑡
where
• 𝐾 ∶= the stationary Kalman gain

• 𝑎𝑡 ∶= 𝑦𝑡 − 𝐸[𝑦𝑡 | 𝑦𝑡−1 , … , 𝑦0 ]
In the same discussion in [Ljungqvist and Sargent, 2018] it is shown that 𝐾 ∈ [0, 1] and that 𝐾 increases as 𝜎1 /𝜎2 does.
In other words, 𝐾 increases as the ratio of the standard deviation of the permanent shock to that of the transitory shock
increases.
Please see first look at the Kalman filter.
Applying formulas (58.18) implies
𝑐𝑡+1 − 𝑐𝑡 = [1 − 𝛽(1 − 𝐾)]𝑎𝑡+1 (58.28)
where the endowment process can now be represented in terms of the univariate innovation to 𝑦𝑡 as
𝑦𝑡+1 − 𝑦𝑡 = 𝑎𝑡+1 − (1 − 𝐾)𝑎𝑡 (58.29)
Equation (58.29) indicates that the consumer regards

• fraction 𝐾 of an innovation 𝑎𝑡+1 to 𝑦𝑡+1 as permanent
• fraction 1 − 𝐾 as purely transitory
The consumer permanently increases his consumption by the full amount of his estimate of the permanent part of 𝑎𝑡+1 ,
but by only (1 − 𝛽) times his estimate of the purely transitory part of 𝑎𝑡+1 .
Therefore, in total, he permanently increments his consumption by a fraction 𝐾 + (1 − 𝛽)(1 − 𝐾) = 1 − 𝛽(1 − 𝐾) of
𝑎𝑡+1 .
He saves the remaining fraction 𝛽(1 − 𝐾).
According to equation (58.29), the first difference of income is a first-order moving average.
Equation (58.28) asserts that the first difference of consumption is IID.
Application of formula to this example shows that
𝑏𝑡+1 − 𝑏𝑡 = (𝐾 − 1)𝑎𝑡 (58.30)
This indicates how the fraction 𝐾 of the innovation to 𝑦𝑡 that is regarded as permanent influences the fraction of the
innovation that is saved.
58.5 Further Reading
The model described above significantly changed how economists think about consumption.
While Hall’s model does a remarkably good job as a first approximation to consumption data, it’s widely believed that it
doesn’t capture important aspects of some consumption/savings data.
For example, liquidity constraints and precautionary savings appear to be present sometimes.
Further discussion can be found in, e.g., [Hall and Mishkin, 1982], [Parker, 1999], [Deaton, 1991], [Carroll, 2001].
58.5. Further Reading 1027

58.6 Appendix: The Euler Equation
Where does the first-order condition (58.5) come from?

Here we’ll give a proof for the two-period case, which is representative of the general argument.
The finite horizon equivalent of the no-Ponzi condition is that the agent cannot end her life in debt, so 𝑏2 = 0.
From the budget constraint (58.2) we then have
𝑏1
𝑐0 = − 𝑏0 + 𝑦0 and 𝑐1 = 𝑦1 − 𝑏1
1+𝑟
Here 𝑏0 and 𝑦0 are given constants.
Substituting these constraints into our two-period objective 𝑢(𝑐0 ) + 𝛽𝔼0 [𝑢(𝑐1 )] gives
𝑏1
max {𝑢 ( − 𝑏0 + 𝑦0 ) + 𝛽 𝔼0 [𝑢(𝑦1 − 𝑏1 )]}
𝑏1 𝑅
You will be able to verify that the first-order condition is
𝑢′ (𝑐0 ) = 𝛽𝑅 𝔼0 [𝑢′ (𝑐1 )]
Using 𝛽𝑅 = 1 gives (58.5) in the two-period case.

The proof for the general case is similar.

CHAPTER
FIFTYNINE
PERMANENT INCOME II: LQ TECHNIQUES
Contents
• Permanent Income II: LQ Techniques

– Overview
– Setup
– The LQ Approach
– Implementation
– Two Example Economies
59.1 Overview
This lecture continues our analysis of the linear-quadratic (LQ) permanent income model of savings and consumption.
As we saw in our previous lecture on this topic, Robert Hall [Hall, 1978] used the LQ permanent income model to restrict
and interpret intertemporal comovements of nondurable consumption, nonfinancial income, and financial wealth.
For example, we saw how the model asserts that for any covariance stationary process for nonfinancial income
• consumption is a random walk
• financial wealth has a unit root and is cointegrated with consumption
Other applications use the same LQ framework.
For example, a model isomorphic to the LQ permanent income model has been used by Robert Barro [Barro, 1979] to
interpret intertemporal comovements of a government’s tax collections, its expenditures net of debt service, and its public
debt.
This isomorphism means that in analyzing the LQ permanent income model, we are in effect also analyzing the Barro tax
smoothing model.
It is just a matter of appropriately relabeling the variables in Hall’s model.
In this lecture, we’ll
• show how the solution to the LQ permanent income model can be obtained using LQ control methods.
1029
• represent the model as a linear state space system as in this lecture.

• apply QuantEcon’s LinearStateSpace class to characterize statistical features of the consumer’s optimal consumption
and borrowing plans.
We’ll then use these characterizations to construct a simple model of cross-section wealth and consumption dynamics in
the spirit of Truman Bewley [Bewley, 1986].
(Later we’ll study other Bewley models—see this lecture.)
The model will prove useful for illustrating concepts such as
• stationarity
• ergodicity
• ensemble moments and cross-section observations

import numpy as np
import scipy.linalg as la
59.2 Setup
Let’s recall the basic features of the model discussed in the permanent income model.
Consumer preferences are ordered by
∞
𝐸0 ∑ 𝛽 𝑡 𝑢(𝑐𝑡 ) (59.1)
𝑡=0
where 𝑢(𝑐) = −(𝑐 − 𝛾)2 .

The consumer maximizes (59.1) by choosing a consumption, borrowing plan {𝑐𝑡 , 𝑏𝑡+1 }∞
𝑡=0 subject to the sequence of
budget constraints
1
𝑐𝑡 + 𝑏 𝑡 = 𝑏 + 𝑦𝑡 , 𝑡≥0 (59.2)
1 + 𝑟 𝑡+1
and the no-Ponzi condition
∞
𝐸0 ∑ 𝛽 𝑡 𝑏𝑡2 < ∞ (59.3)
𝑡=0
The interpretation of all variables and parameters are the same as in the previous lecture.
We continue to assume that (1 + 𝑟)𝛽 = 1.
The dynamics of {𝑦𝑡 } again follow the linear state space model

(59.4)
The restrictions on the shock process and parameters are the same as in our previous lecture.
1030 Chapter 59. Permanent Income II: LQ Techniques

59.2.1 Digression on a Useful Isomorphism
The LQ permanent income model of consumption is mathematically isomorphic with a version of Barro’s [Barro, 1979]
model of tax smoothing.
In the LQ permanent income model
• the household faces an exogenous process of nonfinancial income
• the household wants to smooth consumption across states and time
In the Barro tax smoothing model
• a government faces an exogenous sequence of government purchases (net of interest payments on its debt)
• a government wants to smooth tax collections across states and time
If we set
• 𝑇𝑡 , total tax collections in Barro’s model to consumption 𝑐𝑡 in the LQ permanent income model.
• 𝐺𝑡 , exogenous government expenditures in Barro’s model to nonfinancial income 𝑦𝑡 in the permanent income
model.
• 𝐵𝑡 , government risk-free one-period assets falling due in Barro’s model to risk-free one-period consumer debt 𝑏𝑡
falling due in the LQ permanent income model.
• 𝑅, the gross rate of return on risk-free one-period government debt in Barro’s model to the gross rate of return
1 + 𝑟 on financial assets in the permanent income model of consumption.
then the two models are mathematically equivalent.
All characterizations of a {𝑐𝑡 , 𝑦𝑡 , 𝑏𝑡 } in the LQ permanent income model automatically apply to a {𝑇𝑡 , 𝐺𝑡 , 𝐵𝑡 } process
in the Barro model of tax smoothing.
See consumption and tax smoothing models for further exploitation of an isomorphism between consumption and tax
smoothing models.
59.2.2 A Specification of the Nonfinancial Income Process
For the purposes of this lecture, let’s assume {𝑦𝑡 } is a second-order univariate autoregressive process:
𝑦𝑡+1 = 𝛼 + 𝜌1 𝑦𝑡 + 𝜌2 𝑦𝑡−1 + 𝜎𝑤𝑡+1
We can map this into the linear state space framework in (59.4), as discussed in our lecture on linear models.
To do so we take
1 1 0 0 0
𝑧𝑡 = ⎡ ⎤
⎢ 𝑦𝑡 ⎥ , 𝐴=⎡
⎢𝛼 𝜌1 𝜌2 ⎤
⎥, 𝐶=⎡ ⎤
⎢𝜎⎥ , and 𝑈 = [0 1 0]
⎣𝑦𝑡−1 ⎦ ⎣0 1 0⎦ ⎣0⎦
59.2. Setup 1031

59.3 The LQ Approach
Previously we solved the permanent income model by solving a system of linear expectational difference equations subject
to two boundary conditions.
Here we solve the same model using LQ methods based on dynamic programming.
After confirming that answers produced by the two methods agree, we apply QuantEcon’s LinearStateSpace class to
illustrate features of the model.
Why solve a model in two distinct ways?
Because by doing so we gather insights about the structure of the model.
Our earlier approach based on solving a system of expectational difference equations brought to the fore the role of the
consumer’s expectations about future nonfinancial income.
On the other hand, formulating the model in terms of an LQ dynamic programming problem reminds us that
• finding the state (of a dynamic programming problem) is an art, and
• iterations on a Bellman equation implicitly jointly solve both a forecasting problem and a control problem
59.3.1 The LQ Problem
Recall from our lecture on LQ theory that the optimal linear regulator problem is to choose a decision rule for 𝑢𝑡 to
minimize
∞
𝔼 ∑ 𝛽 𝑡 {𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 },
𝑡=0
subject to 𝑥0 given and the law of motion
̃ + 𝐵𝑢
𝑥𝑡+1 = 𝐴𝑥 ̃
̃ 𝑡 + 𝐶𝑤 (59.5)
𝑡 𝑡+1 , 𝑡 ≥ 0,
where 𝑤𝑡+1 is IID with mean vector zero and 𝔼𝑤𝑡 𝑤𝑡′ = 𝐼.
The tildes in 𝐴,̃ 𝐵,̃ 𝐶 ̃ are to avoid clashing with notation in (59.4).
The value function for this problem is 𝑣(𝑥) = −𝑥′ 𝑃 𝑥 − 𝑑, where
• 𝑃 is the unique positive semidefinite solution of the corresponding matrix Riccati equation.
̃ ′̃ ).
• The scalar 𝑑 is given by 𝑑 = 𝛽(1 − 𝛽)−1 trace(𝑃 𝐶 𝐶
The optimal policy is 𝑢𝑡 = −𝐹 𝑥𝑡 , where 𝐹 ∶= 𝛽(𝑄 + 𝛽 𝐵̃ ′ 𝑃 𝐵)̃ −1 𝐵̃ ′ 𝑃 𝐴.̃
Under an optimal decision rule 𝐹 , the state vector 𝑥𝑡 evolves according to 𝑥𝑡+1 = (𝐴 ̃ − 𝐵𝐹 ̃
̃ )𝑥𝑡 + 𝐶𝑤 𝑡+1 .
59.3.2 Mapping into the LQ Framework
To map into the LQ framework, we’ll use
1
𝑧𝑡 ⎡ 𝑦 ⎤
𝑥𝑡 ∶= [ ] = ⎢ 𝑡 ⎥
𝑏𝑡 ⎢𝑦𝑡−1 ⎥
⎣ 𝑏𝑡 ⎦
as the state vector and 𝑢𝑡 ∶= 𝑐𝑡 − 𝛾 as the control.

With this notation and 𝑈𝛾 ∶= [𝛾 0 0], we can write the state dynamics as in (59.5) when
𝐴 0 0 𝐶
𝐴 ̃ ∶= [ ] 𝐵̃ ∶= [ ] and 𝐶 ̃ ∶= [ ] 𝑤𝑡+1
(1 + 𝑟)(𝑈𝛾 − 𝑈 ) 1+𝑟 1+𝑟 0
Please confirm for yourself that, with these definitions, the LQ dynamics (59.5) match the dynamics of 𝑧𝑡 and 𝑏𝑡 described
above.
To map utility into the quadratic form 𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 we can set
• 𝑄 ∶= 1 (remember that we are minimizing) and
• 𝑅 ∶= a 4 × 4 matrix of zeros
However, there is one problem remaining.
We have no direct way to capture the non-recursive restriction (59.3) on the debt sequence {𝑏𝑡 } from within the LQ
framework.
To try to enforce it, we’re going to use a trick: put a small penalty on 𝑏𝑡2 in the criterion function.
In the present setting, this means adding a small entry 𝜖 > 0 in the (4, 4) position of 𝑅.
That will induce a (hopefully) small approximation error in the decision rule.
We’ll check whether it really is small numerically soon.
59.4 Implementation
Let’s write some code to solve the model.

One comment before we start is that the bliss level of consumption 𝛾 in the utility function has no effect on the optimal
decision rule.
We saw this in the previous lecture permanent income.
The reason is that it drops out of the Euler equation for consumption.
In what follows we set it equal to unity.
59.4.1 The Exogenous Nonfinancial Income Process
First, we create the objects for the optimal linear regulator
# Set parameters
α, β, ρ1, ρ2, σ = 10.0, 0.95, 0.9, 0.0, 1.0
R = 1 / β
A = np.array([[1., 0., 0.],
[α, ρ1, ρ2],
[0., 1., 0.]])
C = np.array([[0.], [σ], [0.]])
G = np.array([[0., 1., 0.]])
# Form LinearStateSpace system and pull off steady state moments

μ_z0 = np.array([[1.0], [0.0], [0.0]])
Σ_z0 = np.zeros((3, 3))
Lz = qe.LinearStateSpace(A, C, G, mu_0=μ_z0, Sigma_0=Σ_z0)


μ_z, μ_y, Σ_z, Σ_y, Σ_yx = Lz.stationary_distributions()
# Mean vector of state for the savings problem

mxo = np.vstack([μ_z, 0.0])
# Create stationary covariance matrix of x -- start everyone off at b=0

a1 = np.zeros((3, 1))
aa = np.hstack([Σ_z, a1])
bb = np.zeros((1, 4))
sxo = np.vstack([aa, bb])
# These choices will initialize the state vector of an individual at zero

# debt and the ergodic distribution of the endowment process. Use these to
# create the Bewley economy.
mxbewley = mxo
sxbewley = sxo
The next step is to create the matrices for the LQ system
A12 = np.zeros((3,1))
ALQ_l = np.hstack([A, A12])
ALQ_r = np.array([[0, -R, 0, R]])
ALQ = np.vstack([ALQ_l, ALQ_r])
RLQ = np.array([[0., 0., 0., 0.],

[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 1e-9]])
QLQ = np.array([1.0])
BLQ = np.array([0., 0., 0., R]).reshape(4,1)
CLQ = np.array([0., σ, 0., 0.]).reshape(4,1)
β_LQ = β
Let’s print these out and have a look at them
print(f"A = \n {ALQ}")
print(f"B = \n {BLQ}")
print(f"R = \n {RLQ}")
print(f"Q = \n {QLQ}")
A =
[[ 1. 0. 0. 0. ]
[10. 0.9 0. 0. ]
[ 0. 1. 0. 0. ]
[ 0. -1.05263158 0. 1.05263158]]
B =
[[0. ]
[0. ]
[0. ]
[1.05263158]]
R =
[[0.e+00 0.e+00 0.e+00 0.e+00]
[0.e+00 0.e+00 0.e+00 0.e+00]
[0.e+00 0.e+00 0.e+00 0.e+00]


[0.e+00 0.e+00 0.e+00 1.e-09]]
Q =
[1.]
Now create the appropriate instance of an LQ model
lqpi = qe.LQ(QLQ, RLQ, ALQ, BLQ, C=CLQ, beta=β_LQ)
We’ll save the implied optimal policy function soon compare them with what we get by employing an alternative solution
method
P, F, d = lqpi.stationary_values() # Compute value function and decision rule

ABF = ALQ - BLQ @ F # Form closed loop system
59.4.2 Comparison with the Difference Equation Approach
In our first lecture on the infinite horizon permanent income problem we used a different solution method.
The method was based around
• deducing the Euler equations that are the first-order conditions with respect to consumption and savings.
• using the budget constraints and boundary condition to complete a system of expectational linear difference equa-
tions.
• solving those equations to obtain the solution.
Expressed in state space notation, the solution took the form

𝑏𝑡+1 = 𝑏𝑡 + 𝑈 [(𝐼 − 𝛽𝐴)−1 (𝐴 − 𝐼)]𝑧𝑡
𝑐𝑡 = (1 − 𝛽)[𝑈 (𝐼 − 𝛽𝐴)−1 𝑧𝑡 − 𝑏𝑡 ]
Now we’ll apply the formulas in this system
# Use the above formulas to create the optimal policies for b_{t+1} and c_t
b_pol = G @ la.inv(np.eye(3, 3) - β * A) @ (A - np.eye(3, 3))
c_pol = (1 - β) * G @ la.inv(np.eye(3, 3) - β * A)
# Create the A matrix for a LinearStateSpace instance

A_LSS1 = np.vstack([A, b_pol])
A_LSS2 = np.eye(4, 1, -3)
A_LSS = np.hstack([A_LSS1, A_LSS2])
# Create the C matrix for LSS methods

C_LSS = np.vstack([C, np.zeros(1)])
# Create the G matrix for LSS methods

G_LSS1 = np.vstack([G, c_pol])
G_LSS2 = np.vstack([np.zeros(1), -(1 - β)])
G_LSS = np.hstack([G_LSS1, G_LSS2])
# Use the following values to start everyone off at b=0, initial incomes zero


μ_0 = np.array([1., 0., 0., 0.])
Σ_0 = np.zeros((4, 4))
A_LSS calculated as we have here should equal ABF calculated above using the LQ model
ABF - A_LSS
array([[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,

0.00000000e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[-9.51248175e-06, 9.51247915e-08, 0.00000000e+00,
-1.99999923e-08]])
Now compare pertinent elements of c_pol and F
print(c_pol, "\n", -F)
[[65.51724138 0.34482759 0. ]]
[[ 6.55172323e+01 3.44827677e-01 -0.00000000e+00 -5.00000190e-02]]
We have verified that the two methods give the same solution.
Now let’s create instances of the LinearStateSpace class and use it to do some interesting experiments.
To do this, we’ll use the outcomes from our second method.
59.5 Two Example Economies
In the spirit of Bewley models [Bewley, 1986], we’ll generate panels of consumers.
The examples differ only in the initial states with which we endow the consumers.
All other parameter values are kept the same in the two examples
• In the first example, all consumers begin with zero nonfinancial income and zero debt.
– The consumers are thus ex-ante identical.
• In the second example, while all begin with zero debt, we draw their initial income levels from the invariant distri-
bution of financial income.
– Consumers are ex-ante heterogeneous.
In the first example, consumers’ nonfinancial income paths display pronounced transients early in the sample
• these will affect outcomes in striking ways
Those transient effects will not be present in the second example.
We use methods affiliated with the LinearStateSpace class to simulate the model.

59.5.1 First Set of Initial Conditions
We generate 25 paths of the exogenous non-financial income process and the associated optimal consumption and debt
paths.
In the first set of graphs, darker lines depict a particular sample path, while the lighter lines describe 24 other paths.
A second graph plots a collection of simulations against the population distribution that we extract from the Lin-
earStateSpace instance LSS.
Comparing sample paths with population distributions at each date 𝑡 is a useful exercise—see our discussion of the laws
of large numbers
lss = qe.LinearStateSpace(A_LSS, C_LSS, G_LSS, mu_0=μ_0, Sigma_0=Σ_0)
59.5.2 Population and Sample Panels
In the code below, we use the LinearStateSpace class to

• compute and plot population quantiles of the distributions of consumption and debt for a population of consumers.
• simulate a group of 25 consumers and plot sample paths on the same graph as the population distribution.
def income_consumption_debt_series(A, C, G, μ_0, Σ_0, T=150, npaths=25):

"""
This function takes initial conditions (μ_0, Σ_0) and uses the
LinearStateSpace class from QuantEcon to simulate an economy
npaths times for T periods. It then uses that information to
generate some graphs related to the discussion below.
"""
lss = qe.LinearStateSpace(A, C, G, mu_0=μ_0, Sigma_0=Σ_0)
# Simulation/Moment Parameters
moment_generator = lss.moment_sequence()
# Simulate various paths

bsim = np.empty((npaths, T))
csim = np.empty((npaths, T))
ysim = np.empty((npaths, T))
for i in range(npaths):
sims = lss.simulate(T)
bsim[i, :] = sims[0][-1, :]
csim[i, :] = sims[1][1, :]
ysim[i, :] = sims[1][0, :]
# Get the moments

cons_mean = np.empty(T)
cons_var = np.empty(T)
debt_mean = np.empty(T)
debt_var = np.empty(T)
for t in range(T):
μ_x, μ_y, Σ_x, Σ_y = next(moment_generator)
cons_mean[t], cons_var[t] = μ_y[1], Σ_y[1, 1]
debt_mean[t], debt_var[t] = μ_x[3], Σ_x[3, 3]
return bsim, csim, ysim, cons_mean, cons_var, debt_mean, debt_var

59.5. Two Example Economies 1037

def consumption_income_debt_figure(bsim, csim, ysim):
# Get T
T = bsim.shape[1]
# Create the first figure

xvals = np.arange(T)
# Plot consumption and income

ax[0].plot(csim[0, :], label="c", color="b")
ax[0].plot(ysim[0, :], label="y", color="g")
ax[0].plot(csim.T, alpha=.1, color="b")
ax[0].plot(ysim.T, alpha=.1, color="g")
ax[0].legend(loc=4)
ax[0].set(title="Nonfinancial Income, Consumption, and Debt",
xlabel="t", ylabel="y and c")
# Plot debt
ax[1].plot(bsim[0, :], label="b", color="r")
ax[1].plot(bsim.T, alpha=.1, color="r")
ax[1].legend(loc=4)
ax[1].set(xlabel="t", ylabel="debt")
fig.tight_layout()
return fig
def consumption_debt_fanchart(csim, cons_mean, cons_var,

bsim, debt_mean, debt_var):
# Get T
T = bsim.shape[1]
# Create percentiles of cross-section distributions

cmean = np.mean(cons_mean)
c90 = 1.65 * np.sqrt(cons_var)
c95 = 1.96 * np.sqrt(cons_var)
c_perc_95p, c_perc_95m = cons_mean + c95, cons_mean - c95
c_perc_90p, c_perc_90m = cons_mean + c90, cons_mean - c90
# Create percentiles of cross-section distributions

dmean = np.mean(debt_mean)
d90 = 1.65 * np.sqrt(debt_var)
d95 = 1.96 * np.sqrt(debt_var)
d_perc_95p, d_perc_95m = debt_mean + d95, debt_mean - d95
d_perc_90p, d_perc_90m = debt_mean + d90, debt_mean - d90
# Create second figure

xvals = np.arange(T)
# Consumption fan
ax[0].plot(xvals, cons_mean, color="k")
ax[0].plot(csim.T, color="k", alpha=.25)
ax[0].fill_between(xvals, c_perc_95m, c_perc_95p, alpha=.25, color="b")


ax[0].fill_between(xvals, c_perc_90m, c_perc_90p, alpha=.25, color="r")
ax[0].set(title="Consumption/Debt over time",
ylim=(cmean-15, cmean+15), ylabel="consumption")
# Debt fan
ax[1].plot(xvals, debt_mean, color="k")
ax[1].plot(bsim.T, color="k", alpha=.25)
ax[1].fill_between(xvals, d_perc_95m, d_perc_95p, alpha=.25, color="b")
ax[1].fill_between(xvals, d_perc_90m, d_perc_90p, alpha=.25, color="r")
ax[1].set(xlabel="t", ylabel="debt")
fig.tight_layout()
return fig
Now let’s create figures with initial conditions of zero for 𝑦0 and 𝑏0
out = income_consumption_debt_series(A_LSS, C_LSS, G_LSS, μ_0, Σ_0)

bsim0, csim0, ysim0 = out[:3]
cons_mean0, cons_var0, debt_mean0, debt_var0 = out[3:]
consumption_income_debt_figure(bsim0, csim0, ysim0)
plt.show()



consumption_debt_fanchart(csim0, cons_mean0, cons_var0,

bsim0, debt_mean0, debt_var0)
plt.show()

Here is what is going on in the above graphs.

For our simulation, we have set initial conditions 𝑏0 = 𝑦−1 = 𝑦−2 = 0.
Because 𝑦−1 = 𝑦−2 = 0, nonfinancial income 𝑦𝑡 starts far below its stationary mean 𝜇𝑦,∞ and rises early in each
simulation.
Recall from the previous lecture that we can represent the optimal decision rule for consumption in terms of the co-
integrating relationship
∞
(1 − 𝛽)𝑏𝑡 + 𝑐𝑡 = (1 − 𝛽)𝐸𝑡 ∑ 𝛽 𝑗 𝑦𝑡+𝑗 (59.6)
𝑗=0
So at time 0 we have
∞
𝑐0 = (1 − 𝛽)𝐸0 ∑ 𝛽 𝑗 𝑦𝑡
𝑡=0
This tells us that consumption starts at the income that would be paid by an annuity whose value equals the expected
discounted value of nonfinancial income at time 𝑡 = 0.
To support that level of consumption, the consumer borrows a lot early and consequently builds up substantial debt.
In fact, he or she incurs so much debt that eventually, in the stochastic steady state, he consumes less each period than his
nonfinancial income.
He uses the gap between consumption and nonfinancial income mostly to service the interest payments due on his debt.

Thus, when we look at the panel of debt in the accompanying graph, we see that this is a group of ex-ante identical people
each of whom starts with zero debt.
All of them accumulate debt in anticipation of rising nonfinancial income.
They expect their nonfinancial income to rise toward the invariant distribution of income, a consequence of our having
started them at 𝑦−1 = 𝑦−2 = 0.
Cointegration Residual
The following figure plots realizations of the left side of (59.6), which, as discussed in our last lecture, is called the
cointegrating residual.
As mentioned above, the right side can be thought of as an annuity payment on the expected present value of future
∞
income 𝐸𝑡 ∑𝑗=0 𝛽 𝑗 𝑦𝑡+𝑗 .
∞
Early along a realization, 𝑐𝑡 is approximately constant while (1 − 𝛽)𝑏𝑡 and (1 − 𝛽)𝐸𝑡 ∑𝑗=0 𝛽 𝑗 𝑦𝑡+𝑗 both rise markedly
as the household’s present value of income and borrowing rise pretty much together.
This example illustrates the following point: the definition of cointegration implies that the cointegrating residual is
asymptotically covariance stationary, not covariance stationary.
The cointegrating residual for the specification with zero income and zero debt initially has a notable transient component
that dominates its behavior early in the sample.
By altering initial conditions, we shall remove this transient in our second example to be presented below
def cointegration_figure(bsim, csim):

"""
Plots the cointegration
"""
# Create figure
ax.plot((1 - β) * bsim[0, :] + csim[0, :], color="k")
ax.plot((1 - β) * bsim.T + csim.T, color="k", alpha=.1)
ax.set(title="Cointegration of Assets and Consumption", xlabel="t")
return fig
cointegration_figure(bsim0, csim0)
plt.show()

59.5.3 A “Borrowers and Lenders” Closed Economy
When we set 𝑦−1 = 𝑦−2 = 0 and 𝑏0 = 0 in the preceding exercise, we make debt “head north” early in the sample.
Average debt in the cross-section rises and approaches the asymptote.
We can regard these as outcomes of a “small open economy” that borrows from abroad at the fixed gross interest rate
𝑅 = 𝑟 + 1 in anticipation of rising incomes.
So with the economic primitives set as above, the economy converges to a steady state in which there is an excess aggregate
supply of risk-free loans at a gross interest rate of 𝑅.
This excess supply is filled by “foreigner lenders” willing to make those loans.
We can use virtually the same code to rig a “poor man’s Bewley [Bewley, 1986] model” in the following way
• as before, we start everyone at 𝑏0 = 0.
𝑦
• But instead of starting everyone at 𝑦−1 = 𝑦−2 = 0, we draw [ −1 ] from the invariant distribution of the {𝑦𝑡 }
𝑦−2
process.
This rigs a closed economy in which people are borrowing and lending with each other at a gross risk-free interest rate of
𝑅 = 𝛽 −1 .

Across the group of people being analyzed, risk-free loans are in zero excess supply.
We have arranged primitives so that 𝑅 = 𝛽 −1 clears the market for risk-free loans at zero aggregate excess supply.
So the risk-free loans are being made from one person to another within our closed set of agents.
There is no need for foreigners to lend to our group.
Let’s have a look at the corresponding figures
out = income_consumption_debt_series(A_LSS, C_LSS, G_LSS, mxbewley, sxbewley)

bsimb, csimb, ysimb = out[:3]
cons_meanb, cons_varb, debt_meanb, debt_varb = out[3:]
consumption_income_debt_figure(bsimb, csimb, ysimb)
plt.show()



consumption_debt_fanchart(csimb, cons_meanb, cons_varb,

bsimb, debt_meanb, debt_varb)
plt.show()

The graphs confirm the following outcomes:

• As before, the consumption distribution spreads out over time.
𝑦−1
But now there is some initial dispersion because there is ex-ante heterogeneity in the initial draws of [ ].
𝑦−2
• As before, the cross-section distribution of debt spreads out over time.
• Unlike before, the average level of debt stays at zero, confirming that this is a closed borrower-and-lender economy.
• Now the cointegrating residual seems stationary, and not just asymptotically stationary.
Let’s have a look at the cointegration figure
cointegration_figure(bsimb, csimb)
plt.show()



CHAPTER
SIXTY
PRODUCTION SMOOTHING VIA INVENTORIES
Contents
• Production Smoothing via Inventories

– Overview
– Example 1
– Inventories Not Useful
– Inventories Useful but are Hardwired to be Zero Always
– Example 2
– Example 3
– Example 4
– Example 5
– Example 6
– Exercises
In addition to what’s in Anaconda, this lecture employs the following library:
60.1 Overview
This lecture can be viewed as an application of this quantecon lecture about linear quadratic control theory.
It formulates a discounted dynamic program for a firm that chooses a production schedule to balance
• minimizing costs of production across time, against
• keeping costs of holding inventories low
In the tradition of a classic book by Holt, Modigliani, Muth, and Simon [Holt et al., 1960], we simplify the firm’s problem
by formulating it as a linear quadratic discounted dynamic programming problem of the type studied in this quantecon
lecture.
Because its costs of production are increasing and quadratic in production, the firm holds inventories as a buffer stock in
order to smooth production across time, provided that holding inventories is not too costly.
1049
But the firm also wants to make its sales out of existing inventories, a preference that we represent by a cost that is quadratic
in the difference between sales in a period and the firm’s beginning of period inventories.
We compute examples designed to indicate how the firm optimally smooths production while keeping inventories close
to sales.
To introduce components of the model, let
• 𝑆𝑡 be sales at time 𝑡
• 𝑄𝑡 be production at time 𝑡
• 𝐼𝑡 be inventories at the beginning of time 𝑡
• 𝛽 ∈ (0, 1) be a discount factor
• 𝑐(𝑄𝑡 ) = 𝑐1 𝑄𝑡 + 𝑐2 𝑄2𝑡 , be a cost of production function, where 𝑐1 > 0, 𝑐2 > 0, be an inventory cost function
• 𝑑(𝐼𝑡 , 𝑆𝑡 ) = 𝑑1 𝐼𝑡 + 𝑑2 (𝑆𝑡 − 𝐼𝑡 )2 , where 𝑑1 > 0, 𝑑2 > 0, be a cost-of-holding-inventories function, consisting of
two components:
– a cost 𝑑1 𝐼𝑡 of carrying inventories, and
– a cost 𝑑2 (𝑆𝑡 − 𝐼𝑡 )2 of having inventories deviate from sales
• 𝑝𝑡 = 𝑎0 − 𝑎1 𝑆𝑡 + 𝑣𝑡 be an inverse demand function for a firm’s product, where 𝑎0 > 0, 𝑎1 > 0 and 𝑣𝑡 is a demand
shock at time 𝑡
• 𝜋_𝑡 = 𝑝𝑡 𝑆𝑡 − 𝑐(𝑄𝑡 ) − 𝑑(𝐼𝑡 , 𝑆𝑡 ) be the firm’s profits at time 𝑡
∞
• ∑𝑡=0 𝛽 𝑡 𝜋𝑡 be the present value of the firm’s profits at time 0
• 𝐼𝑡+1 = 𝐼𝑡 + 𝑄𝑡 − 𝑆𝑡 be the law of motion of inventories
• 𝑧𝑡+1 = 𝐴22 𝑧𝑡 + 𝐶2 𝜖𝑡+1 be a law of motion for an exogenous state vector 𝑧𝑡 that contains time 𝑡 information useful
for predicting the demand shock 𝑣𝑡
• 𝑣𝑡 = 𝐺𝑧𝑡 link the demand shock to the information set 𝑧𝑡
• the constant 1 be the first component of 𝑧𝑡
To map our problem into a linear-quadratic discounted dynamic programming problem (also known as an optimal linear
regulator), we define the state vector at time 𝑡 as
𝐼
𝑥𝑡 = [ 𝑡 ]
𝑧𝑡
and the control vector as
𝑄𝑡
𝑢𝑡 = [ ]
𝑆𝑡
The law of motion for the state vector 𝑥𝑡 is evidently
𝐼 1 0 𝐼 1 −1 𝑄𝑡 0
[ 𝑡+1 ] = [ ] [ 𝑡] + [ ] [ ] + [ ] 𝜖𝑡+1
𝑧𝑡 0 𝐴22 𝑧𝑡 0 0 𝑆𝑡 𝐶2
or
𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 + 𝐶𝜖𝑡+1
(At this point, we ask that you please forgive us for using 𝑄𝑡 to be the firm’s production at time 𝑡, while below we use 𝑄
as the matrix in the quadratic form 𝑢′𝑡 𝑄𝑢𝑡 that appears in the firm’s one-period profit function)
We can express the firm’s profit as a function of states and controls as
𝜋𝑡 = −(𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 + 2𝑢′𝑡 𝑁 𝑥𝑡 )
1050 Chapter 60. Production Smoothing via Inventories

To form the matrices 𝑅, 𝑄, 𝑁 in an LQ dynamic programming problem, we note that the firm’s profits at time 𝑡 function
can be expressed
𝜋𝑡 =𝑝𝑡 𝑆𝑡 − 𝑐 (𝑄𝑡 ) − 𝑑 (𝐼𝑡 , 𝑆𝑡 )

2
= (𝑎0 − 𝑎1 𝑆𝑡 + 𝑣𝑡 ) 𝑆𝑡 − 𝑐1 𝑄𝑡 − 𝑐2 𝑄2𝑡 − 𝑑1 𝐼𝑡 − 𝑑2 (𝑆𝑡 − 𝐼𝑡 )
=𝑎0 𝑆𝑡 − 𝑎1 𝑆𝑡2 + 𝐺𝑧𝑡 𝑆𝑡 − 𝑐1 𝑄𝑡 − 𝑐2 𝑄2𝑡 − 𝑑1 𝐼𝑡 − 𝑑2 𝑆𝑡2 − 𝑑2 𝐼𝑡2 + 2𝑑2 𝑆𝑡 𝐼𝑡
⎛ 2 + 𝑑 𝑆 2 + 𝑐 𝑄2 − 𝑎 𝑆 − 𝐺𝑧 𝑆 + 𝑐 𝑄 − 2𝑑 𝑆 𝐼 ⎞
=−⎜
⎜𝑑 𝐼𝑡⏟
⏟1⏟ + 𝑑⏟𝐼𝑡2 ⏟⏟
2⏟ + 𝑎⏟
1𝑆
⏟𝑡⏟⏟⏟ 2 𝑡 ⏟⏟⏟⏟2 𝑡 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
0 𝑡 𝑡 𝑡 1 𝑡
⎟
2 𝑡 𝑡⎟
⎝ 𝑥′𝑡 𝑅𝑥𝑡 𝑢′𝑡 𝑄𝑢𝑡 2𝑢′𝑡 𝑁𝑥𝑡 ⎠
⎛ 𝑑1
0 𝑐1
⎜
=−⎜[ 𝐼
𝑑 𝑆 𝐼
𝑧𝑡′ ] [ 𝑑1 2 ′ 2 𝑐 ] [ 𝑡 ] + [ 𝑄𝑡
𝑐
𝑆𝑡 ] [ 2
0 𝑄
] [ 𝑡 ] + 2 [ 𝑄𝑡 𝑆𝑡 ] [ 2 𝑆𝑐 𝐼𝑡
⎜ 𝑎0 𝐺 ][
⎜ 𝑡 𝑆
⏟⏟⏟⏟⏟⏟⏟
2 𝑐 0 𝑧 𝑡 0 𝑎 1 + 𝑑2
⏟⏟⏟⏟⏟⏟⏟ 𝑆𝑡 ⏟⏟−𝑑
⏟⏟ − 2 𝑆𝑐⏟
2 ⏟⏟⏟ −⏟⏟⏟
2
𝑧
⎝ ≡𝑅 ≡𝑄 ≡𝑁
where 𝑆𝑐 = [1, 0].

Remark on notation: The notation for cross product term in the QuantEcon library is 𝑁 .
The firms’ optimum decision rule takes the form
𝑢𝑡 = −𝐹 𝑥𝑡
and the evolution of the state under the optimal decision rule is
𝑥𝑡+1 = (𝐴 − 𝐵𝐹 )𝑥𝑡 + 𝐶𝜖𝑡+1
The firm chooses a decision rule for 𝑢𝑡 that maximizes

∞
𝐸 0 ∑ 𝛽 𝑡 𝜋𝑡
𝑡=0
subject to a given 𝑥0 .
This is a stochastic discounted LQ dynamic program.
Here is code for computing an optimal decision rule and for analyzing its consequences.

import numpy as np
class SmoothingExample:
"""
Class for constructing, solving, and plotting results for
inventories and sales smoothing problem.
"""
def __init__(self,
c1=1, # Cost-of-production
c2=1,
d1=1, # Cost-of-holding inventories
d2=1,
a0=10, # Inverse demand function
60.1. Overview 1051


a1=1,
A22=[[1, 0], # z process
[1, 0.9]],
C2=[[0], [1]],
G=[0, 1]):
self.β = β
self.c1, self.c2 = c1, c2
self.d1, self.d2 = d1, d2
self.a0, self.a1 = a0, a1
self.A22 = np.atleast_2d(A22)
self.C2 = np.atleast_2d(C2)
self.G = np.atleast_2d(G)
# Dimensions
k, j = self.C2.shape # Dimensions for randomness part
n = k + 1 # Number of states
m = 2 # Number of controls
Sc = np.zeros(k)
Sc[0] = 1
# Construct matrices of transition law

A = np.zeros((n, n))
A[0, 0] = 1
A[1:, 1:] = self.A22
B = np.zeros((n, m))
B[0, :] = 1, -1
C = np.zeros((n, j))
C[1:, :] = self.C2
self.A, self.B, self.C = A, B, C
# Construct matrices of one period profit function

R = np.zeros((n, n))
R[0, 0] = d2
R[1:, 0] = d1 / 2 * Sc
R[0, 1:] = d1 / 2 * Sc
Q = np.zeros((m, m))
Q[0, 0] = c2
Q[1, 1] = a1 + d2
N = np.zeros((m, n))
N[1, 0] = - d2
N[0, 1:] = c1 / 2 * Sc
N[1, 1:] = - a0 / 2 * Sc - self.G / 2
self.R, self.Q, self.N = R, Q, N
# Construct LQ instance
self.LQ = qe.LQ(Q, R, A, B, C, N, beta=β)
self.LQ.stationary_values()


def simulate(self, x0, T=100):
c1, c2 = self.c1, self.c2

d1, d2 = self.d1, self.d2
a0, a1 = self.a0, self.a1
G = self.G
x_path, u_path, w_path = self.LQ.compute_sequence(x0, ts_length=T)
I_path = x_path[0, :-1]

z_path = x_path[1:, :-1]
_path = (G @ z_path)[0, :]
Q_path = u_path[0, :]
S_path = u_path[1, :]
revenue = (a0 - a1 * S_path + _path) * S_path

cost_production = c1 * Q_path + c2 * Q_path ** 2
cost_inventories = d1 * I_path + d2 * (S_path - I_path) ** 2
Q_no_inventory = (a0 + _path - c1) / (2 * (a1 + c2))

Q_hardwired = (a0 + _path - c1) / (2 * (a1 + c2 + d2))
ax[0, 0].plot(range(T), I_path, label="inventories")

ax[0, 0].plot(range(T), S_path, label="sales")
ax[0, 0].plot(range(T), Q_path, label="production")
ax[0, 0].legend(loc=1)
ax[0, 0].set_title("inventories, sales, and production")
ax[0, 1].plot(range(T), (Q_path - S_path), color='b')

ax[0, 1].set_ylabel("change in inventories", color='b')
span = max(abs(Q_path - S_path))
ax[0, 1].set_ylim(0-span*1.1, 0+span*1.1)
ax[0, 1].set_title("demand shock and change in inventories")
ax1_ = ax[0, 1].twinx()

ax1_.plot(range(T), _path, color='r')
ax1_.set_ylabel("demand shock", color='r')
span = max(abs( _path))
ax1_.set_ylim(0-span*1.1, 0+span*1.1)
ax1_.plot([0, T], [0, 0], '--', color='k')
ax[1, 0].plot(range(T), revenue, label="revenue")

ax[1, 0].plot(range(T), cost_production, label="cost_production")
ax[1, 0].plot(range(T), cost_inventories, label="cost_inventories")
ax[1, 0].set_title("profits decomposition")
ax[1, 1].plot(range(T), Q_path, label="production")

ax[1, 1].plot(range(T), Q_hardwired, label='production when $I_t$ \
forced to be zero')
ax[1, 1].plot(range(T), Q_no_inventory, label='production when \
inventories not useful')
60.1. Overview 1053


ax[1, 1].set_title('three production concepts')
plt.show()
Notice that the above code sets parameters at the following default values
• discount factor 𝛽 = 0.96,
• inverse demand function: 𝑎0 = 10, 𝑎1 = 1
• cost of production 𝑐1 = 1, 𝑐2 = 1
• costs of holding inventories 𝑑1 = 1, 𝑑2 = 1
In the examples below, we alter some or all of these parameter values.
60.2 Example 1
In this example, the demand shock follows AR(1) process:
𝜈𝑡 = 𝛼 + 𝜌𝜈𝑡−1 + 𝜖𝑡 ,
which implies
1 1 0 1 0
𝑧𝑡+1 = [ ]=[ ][ ]+[ ] 𝜖𝑡+1 .
𝑣𝑡+1 𝛼 𝜌 ⏟ 𝑣𝑡 1
𝑧𝑡
We set 𝛼 = 1 and 𝜌 = 0.9, their default values.

We’ll calculate and display outcomes, then discuss them below the pertinent figures.
ex1 = SmoothingExample()
x0 = [0, 1, 0]
ex1.simulate(x0)

The figures above illustrate various features of an optimal production plan.

Starting from zero inventories, the firm builds up a stock of inventories and uses them to smooth costly production in the
face of demand shocks.
Optimal decisions evidently respond to demand shocks.
Inventories are always less than sales, so some sales come from current production, a consequence of the cost, 𝑑1 𝐼𝑡 of
holding inventories.
The lower right panel shows differences between optimal production and two alternative production concepts that come
from altering the firm’s cost structure – i.e., its technology.
These two concepts correspond to these distinct altered firm problems.
• a setting in which inventories are not needed
• a setting in which they are needed but we arbitrarily prevent the firm from holding inventories by forcing it to set
𝐼𝑡 = 0 always
We use these two alternative production concepts in order to shed light on the baseline model.
60.2. Example 1 1055

60.3 Inventories Not Useful
Let’s turn first to the setting in which inventories aren’t needed.

In this problem, the firm forms an output plan that maximizes the expected value of
∞
∑ 𝛽 𝑡 {𝑝𝑡 𝑄𝑡 − 𝐶(𝑄𝑡 )}
𝑡=0
It turns out that the optimal plan for 𝑄𝑡 for this problem also solves a sequence of static problems max𝑄𝑡 {𝑝𝑡 𝑄𝑡 − 𝑐(𝑄𝑡 )}.
When inventories aren’t required or used, sales always equal production.
This simplifies the problem and the optimal no-inventory production maximizes the expected value of
∞
∑ 𝛽 𝑡 {𝑝𝑡 𝑄𝑡 − 𝐶 (𝑄𝑡 )} .
𝑡=0
The optimum decision rule is

𝑎0 + 𝜈 𝑡 − 𝑐 1
𝑄𝑛𝑖
𝑡 = .
𝑐2 + 𝑎 1
60.4 Inventories Useful but are Hardwired to be Zero Always
Next, we turn to a distinct problem in which inventories are useful – meaning that there are costs of 𝑑2 (𝐼𝑡 −𝑆𝑡 )2 associated
with having sales not equal to inventories – but we arbitrarily impose on the firm the costly restriction that it never hold
inventories.
Here the firm’s maximization problem is
∞
max ∑ 𝛽 𝑡 {𝑝𝑡 𝑆𝑡 − 𝐶 (𝑄𝑡 ) − 𝑑 (𝐼𝑡 , 𝑆𝑡 )}
{𝐼𝑡 ,𝑄𝑡 ,𝑆𝑡 }
𝑡=0
subject to the restrictions that 𝐼𝑡 = 0 for all 𝑡 and that 𝐼𝑡+1 = 𝐼𝑡 + 𝑄𝑡 − 𝑆𝑡 .

The restriction that 𝐼𝑡 = 0 implies that 𝑄𝑡 = 𝑆𝑡 and that the maximization problem reduces to
∞
max ∑ 𝛽 𝑡 {𝑝𝑡 𝑄𝑡 − 𝐶 (𝑄𝑡 ) − 𝑑 (0, 𝑄𝑡 )}
𝑄𝑡
𝑡=0
Here the optimal production plan is

𝑎0 + 𝜈𝑡 − 𝑐1
𝑄ℎ𝑡 = .
𝑐2 + 𝑎 1 + 𝑑 2
We introduce this 𝐼𝑡 is hardwired to zero specification in order to shed light on the role that inventories play by comparing
outcomes with those under our two other versions of the problem.
The bottom right panel displays a production path for the original problem that we are interested in (the blue line) as well
with an optimal production path for the model in which inventories are not useful (the green path) and also for the model
in which, although inventories are useful, they are hardwired to zero and the firm pays cost 𝑑(0, 𝑄𝑡 ) for not setting sales
𝑆𝑡 = 𝑄𝑡 equal to zero (the orange line).
Notice that it is typically optimal for the firm to produce more when inventories aren’t useful. Here there is no requirement
to sell out of inventories and no costs from having sales deviate from inventories.

But “typical” does not mean “always”.

Thus, if we look closely, we notice that for small 𝑡, the green “production when inventories aren’t useful” line in the lower
right panel is below optimal production in the original model.
High optimal production in the original model early on occurs because the firm wants to accumulate inventories quickly
in order to acquire high inventories for use in later periods.
But how the green line compares to the blue line early on depends on the evolution of the demand shock, as we will see
in a deterministically seasonal demand shock example to be analyzed below.
In that example, the original firm optimally accumulates inventories slowly because the next positive demand shock is in
the distant future.
To make the green-blue model production comparison easier to see, let’s confine the graphs to the first 10 periods:
ex1.simulate(x0, T=10)
60.5 Example 2
Next, we shut down randomness in demand and assume that the demand shock 𝜈𝑡 follows a deterministic path:
𝜈𝑡 = 𝛼 + 𝜌𝜈𝑡−1
Again, we’ll compute and display outcomes in some figures
ex2 = SmoothingExample(C2=[[0], [0]])
60.5. Example 2 1057


x0 = [0, 1, 0]
ex2.simulate(x0)
60.6 Example 3
Now we’ll put randomness back into the demand shock process and also assume that there are zero costs of holding
inventories.
In particular, we’ll look at a situation in which 𝑑1 = 0 but 𝑑2 > 0.
Now it becomes optimal to set sales approximately equal to inventories and to use inventories to smooth production quite
well, as the following figures confirm
ex3 = SmoothingExample(d1=0)
x0 = [0, 1, 0]
ex3.simulate(x0)

60.7 Example 4
To bring out some features of the optimal policy that are related to some technical issues in linear control theory, we’ll
now temporarily assume that it is costless to hold inventories.
When we completely shut down the cost of holding inventories by setting 𝑑1 = 0 and 𝑑2 = 0, something absurd happens
(because the Bellman equation is opportunistic and very smart).
(Technically, we have set parameters that end up violating conditions needed to assure stability of the optimally controlled
state.)
The firm finds it optimal to set 𝑄𝑡 ≡ 𝑄∗ = −𝑐 2𝑐2 , an output level that sets the costs of production to zero (when 𝑐1 > 0,
1
as it is with our default settings, then it is optimal to set production negative, whatever that means!).
Recall the law of motion for inventories
𝐼𝑡+1 = 𝐼𝑡 + 𝑄𝑡 − 𝑆𝑡
−𝑐1
So when 𝑑1 = 𝑑2 = 0 so that the firm finds it optimal to set 𝑄𝑡 = 2𝑐2 for all 𝑡, then
−𝑐1
𝐼𝑡+1 − 𝐼𝑡 = − 𝑆𝑡 < 0
2𝑐2
for almost all values of 𝑆𝑡 under our default parameters that keep demand positive almost all of the time.
The dynamic program instructs the firm to set production costs to zero and to run a Ponzi scheme by running inventories
down forever.
(We can interpret this as the firm somehow going short in or borrowing inventories)
The following figures confirm that inventories head south without limit
60.7. Example 4 1059

ex4 = SmoothingExample(d1=0, d2=0)
x0 = [0, 1, 0]
ex4.simulate(x0)
Let’s shorten the time span displayed in order to highlight what is going on.
We’ll set the horizon 𝑇 = 30 with the following code
# shorter period

60.8 Example 5
Now we’ll assume that the demand shock that follows a linear time trend
𝑣𝑡 = 𝑏 + 𝑎𝑡, 𝑎 > 0, 𝑏 > 0
0
To represent this, we set 𝐶2 = [ ] and
0
1 0 1
𝐴22 = [ ] , 𝑥0 = [ ],𝐺 = [ 𝑏 𝑎 ]
1 1 0
# Set parameters
a = 0.5
b = 3.
ex5 = SmoothingExample(A22=[[1, 0], [1, 1]], C2=[[0], [0]], G=[b, a])
x0 = [0, 1, 0] # set the initial inventory as 0

60.8. Example 5 1061

60.9 Example 6
Now we’ll assume a deterministically seasonal demand shock.

To represent this we’ll set
1 0 0 0 0 0 𝑏
⎡0 0 0 0 1⎤ ⎡0⎤ ⎡𝑎⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
𝐴22 = ⎢0 1 0 0 0⎥ , 𝐶2 = ⎢0⎥ , 𝐺′ = ⎢0⎥
⎢0 0 1 0 0⎥ ⎢0⎥ ⎢0⎥
⎣0 0 0 1 0⎦ 0
⎣ ⎦ ⎣0⎦
where 𝑎 > 0, 𝑏 > 0 and
1
⎡0⎤
⎢ ⎥
𝑥0 = ⎢1⎥
⎢0⎥
⎣0⎦
ex6 = SmoothingExample(A22=[[1, 0, 0, 0, 0],

[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]],
C2=[[0], [0], [0], [0], [0]],
G=[b, a, 0, 0, 0])

x00 = [0, 1, 0, 1, 0, 0] # Set the initial inventory as 0

Now we’ll generate some more examples that differ simply from the initial season of the year in which we begin the
demand shock
x01 = [0, 1, 1, 0, 0, 0]
60.9. Example 6 1063

x02 = [0, 1, 0, 0, 1, 0]

x03 = [0, 1, 0, 0, 0, 1]
60.10 Exercises
Please try to analyze some inventory sales smoothing problems using the SmoothingExample class.
Exercise 60.10.1
Assume that the demand shock follows AR(2) process below:
𝜈𝑡 = 𝛼 + 𝜌1 𝜈𝑡−1 + 𝜌2 𝜈𝑡−2 + 𝜖𝑡 .
where 𝛼 = 1, 𝜌1 = 1.2, and 𝜌2 = −0.3. You need to construct 𝐴22, 𝐶, and 𝐺 matrices properly and then to input
them as the keyword arguments of SmoothingExample class. Simulate paths starting from the initial condition
′
𝑥0 = [0, 1, 0, 0] .
After this, try to construct a very similar SmoothingExample with the same demand shock process but exclude the
randomness 𝜖𝑡 . Compute the stationary states 𝑥̄ by simulating for a long period. Then try to add shocks with different
magnitude to 𝜈𝑡̄ and simulate paths. You should see how firms respond differently by staring at the production plans.
60.10. Exercises 1065

# set parameters
α = 1
ρ1 = 1.2
ρ2 = -.3
# construct matrices
A22 =[[1, 0, 0],
[1, ρ1, ρ2],
[0, 1, 0]]
C2 = [[0], [1], [0]]
G = [0, 1, 0]
ex1 = SmoothingExample(A22=A22, C2=C2, G=G)
x0 = [0, 1, 0, 0] # initial condition

ex1.simulate(x0)
# now silence the noise

ex1_no_noise = SmoothingExample(A22=A22, C2=[[0], [0], [0]], G=G)
# initial condition
x0 = [0, 1, 0, 0]
# compute stationary states

x_bar = ex1_no_noise.LQ.compute_sequence(x0, ts_length=250)[0][:, -1]
x_bar

array([ 3.69387755, 1. , 10. , 10. ])
In the following, we add small and large shocks to 𝜈𝑡̄ and compare how firm responds differently in quantity. As the shock
is not very persistent under the parameterization we are using, we focus on a short period response.
T = 40
# small shock
x_bar1 = x_bar.copy()
x_bar1[2] += 2
ex1_no_noise.simulate(x_bar1, T=T)
# large shock
x_bar1 = x_bar.copy()
x_bar1[2] += 10
ex1_no_noise.simulate(x_bar1, T=T)

Exercise 60.10.2
Change parameters of 𝐶(𝑄𝑡 ) and 𝑑(𝐼𝑡 , 𝑆𝑡 ).
1. Make production more costly, by setting 𝑐2 = 5.
2. Increase the cost of having inventories deviate from sales, by setting 𝑑2 = 5.
x0 = [0, 1, 0]
SmoothingExample(c2=5).simulate(x0)

SmoothingExample(d2=5).simulate(x0)


Part X
Multiple Agent Models
1071
CHAPTER
SIXTYONE
A LAKE MODEL OF EMPLOYMENT AND UNEMPLOYMENT
Contents
• A Lake Model of Employment and Unemployment

– Overview
– The Model
– Implementation
– Dynamics of an Individual Worker
– Endogenous Job Finding Rate
– Exercises
61.1 Overview
This lecture describes what has come to be called a lake model.

The lake model is a basic tool for modeling unemployment.
It allows us to analyze
• flows between unemployment and employment.
• how these flows influence steady state employment and unemployment rates.
It is a good model for interpreting monthly labor department reports on gross and net jobs created and jobs destroyed.
The “lakes” in the model are the pools of employed and unemployed.
The “flows” between the lakes are caused by
• firing and hiring
• entry and exit from the labor force
For the first part of this lecture, the parameters governing transitions into and out of unemployment and employment are
exogenous.
Later, we’ll determine some of these transition rates endogenously using the McCall search model.
1073
We’ll also use some nifty concepts like ergodicity, which provides a fundamental link between cross-sectional and long
run time series distributions.
These concepts will help us build an equilibrium model of ex-ante homogeneous workers whose different luck generates
variations in their ex post experiences.

import numpy as np
from scipy.optimize import brentq
from numba import jit
61.1.1 Prerequisites
Before working through what follows, we recommend you read the lecture on finite Markov chains.
You will also need some basic linear algebra and probability.
61.2 The Model
The economy is inhabited by a very large number of ex-ante identical workers.

The workers live forever, spending their lives moving between unemployment and employment.
Their rates of transition between employment and unemployment are governed by the following parameters:
• 𝜆, the job finding rate for currently unemployed workers
• 𝛼, the dismissal rate for currently employed workers
• 𝑏, the entry rate into the labor force
• 𝑑, the exit rate from the labor force
The growth rate of the labor force evidently equals 𝑔 = 𝑏 − 𝑑.
61.2.1 Aggregate Variables
We want to derive the dynamics of the following aggregates

• 𝐸𝑡 , the total number of employed workers at date 𝑡
• 𝑈𝑡 , the total number of unemployed workers at 𝑡
• 𝑁𝑡 , the number of workers in the labor force at 𝑡
We also want to know the values of the following objects
• The employment rate 𝑒𝑡 ∶= 𝐸𝑡 /𝑁𝑡 .
• The unemployment rate 𝑢𝑡 ∶= 𝑈𝑡 /𝑁𝑡 .
(Here and below, capital letters represent aggregates and lowercase letters represent rates)
1074 Chapter 61. A Lake Model of Employment and Unemployment

61.2.2 Laws of Motion for Stock Variables
We begin by constructing laws of motion for the aggregate variables 𝐸𝑡 , 𝑈𝑡 , 𝑁𝑡 .

Of the mass of workers 𝐸𝑡 who are employed at date 𝑡,
• (1 − 𝑑)𝐸𝑡 will remain in the labor force
• of these, (1 − 𝛼)(1 − 𝑑)𝐸𝑡 will remain employed
Of the mass of workers 𝑈𝑡 workers who are currently unemployed,
• (1 − 𝑑)𝑈𝑡 will remain in the labor force
• of these, (1 − 𝑑)𝜆𝑈𝑡 will become employed
Therefore, the number of workers who will be employed at date 𝑡 + 1 will be
𝐸𝑡+1 = (1 − 𝑑)(1 − 𝛼)𝐸𝑡 + (1 − 𝑑)𝜆𝑈𝑡
A similar analysis implies
𝑈𝑡+1 = (1 − 𝑑)𝛼𝐸𝑡 + (1 − 𝑑)(1 − 𝜆)𝑈𝑡 + 𝑏(𝐸𝑡 + 𝑈𝑡 )
The value 𝑏(𝐸𝑡 + 𝑈𝑡 ) is the mass of new workers entering the labor force unemployed.
The total stock of workers 𝑁𝑡 = 𝐸𝑡 + 𝑈𝑡 evolves as
𝑁𝑡+1 = (1 + 𝑏 − 𝑑)𝑁𝑡 = (1 + 𝑔)𝑁𝑡
𝑈𝑡
Letting 𝑋𝑡 ∶= ( ), the law of motion for 𝑋 is
𝐸𝑡
(1 − 𝑑)(1 − 𝜆) + 𝑏 (1 − 𝑑)𝛼 + 𝑏
𝑋𝑡+1 = 𝐴𝑋𝑡 where 𝐴 ∶= ( )
(1 − 𝑑)𝜆 (1 − 𝑑)(1 − 𝛼)
This law tells us how total employment and unemployment evolve over time.
61.2.3 Laws of Motion for Rates
Now let’s derive the law of motion for rates.

To get these we can divide both sides of 𝑋𝑡+1 = 𝐴𝑋𝑡 by 𝑁𝑡+1 to get
𝑈𝑡+1 /𝑁𝑡+1 1 𝑈 /𝑁
( )= 𝐴 ( 𝑡 𝑡)
𝐸𝑡+1 /𝑁𝑡+1 1+𝑔 𝐸𝑡 /𝑁𝑡
Letting
𝑢 𝑈 /𝑁
𝑥𝑡 ∶= ( 𝑡 ) = ( 𝑡 𝑡 )
𝑒𝑡 𝐸𝑡 /𝑁𝑡
we can also write this as
̂ 1
𝑥𝑡+1 = 𝐴𝑥 𝑡 where 𝐴 ̂ ∶= 𝐴
1+𝑔
You can check that 𝑒𝑡 + 𝑢𝑡 = 1 implies that 𝑒𝑡+1 + 𝑢𝑡+1 = 1.
This follows from the fact that the columns of 𝐴 ̂ sum to 1.
61.2. The Model 1075

61.3 Implementation
Let’s code up these equations.

To do this we’re going to use a class that we’ll call LakeModel.
This class will
1. store the primitives 𝛼, 𝜆, 𝑏, 𝑑
2. compute and store the implied objects 𝑔, 𝐴, 𝐴 ̂
3. provide methods to simulate dynamics of the stocks and rates
4. provide a method to compute the steady state vector 𝑥̄ of employment and unemployment rates using a technique
we previously introduced for computing stationary distributions of Markov chains
Please be careful because the implied objects 𝑔, 𝐴, 𝐴 ̂ will not change if you only change the primitives.
For example, if you would like to update a primitive like 𝛼 = 0.03, you need to create an instance and update it by lm
= LakeModel(α=0.03).
In the exercises, we show how to avoid this issue by using getter and setter methods.
class LakeModel:
"""
Solves the lake model and computes dynamics of unemployment stocks and
rates.
Parameters:
------------
λ : scalar
The job finding rate for currently unemployed workers
α : scalar
The dismissal rate for currently employed workers
b : scalar
Entry rate into the labor force
d : scalar
Exit rate from the labor force
"""
def __init__(self, λ=0.283, α=0.013, b=0.0124, d=0.00822):
self.λ, self.α, self.b, self.d = λ, α, b, d
λ, α, b, d = self.λ, self.α, self.b, self.d

self.g = b - d
self.A = np.array([[(1-d) * (1-λ) + b, (1 - d) * α + b],
[ (1-d) * λ, (1 - d) * (1 - α)]])
self.A_hat = self.A / (1 + self.g)
def rate_steady_state(self, tol=1e-6):

"""
Finds the steady state of the system :math:`x_{t+1} = \hat A x_{t}`
Returns
--------
xbar : steady state vector of employment and unemployment rates
"""


x = np.array([self.A_hat[0, 1], self.A_hat[1, 0]])
x /= x.sum()
return x
def simulate_stock_path(self, X0, T):

"""
Simulates the sequence of Employment and Unemployment stocks
Parameters
------------
X0 : array
Contains initial values (E0, U0)
T : int
Number of periods to simulate
Returns
---------
X : iterator
Contains sequence of employment and unemployment stocks
"""
X = np.atleast_1d(X0) # Recast as array just in case

for t in range(T):
yield X
X = self.A @ X
def simulate_rate_path(self, x0, T):

"""
Simulates the sequence of employment and unemployment rates
Parameters
------------
x0 : array
Contains initial values (e0,u0)
T : int
Returns
---------
x : iterator
Contains sequence of employment and unemployment rates
"""
x = np.atleast_1d(x0) # Recast as array just in case
for t in range(T):
yield x
x = self.A_hat @ x
As explained, if we create an instance and update it by lm = LakeModel(α=0.03), derived objects like 𝐴 will also
change.
lm = LakeModel()
lm.α
0.013

lm.A
array([[0.72350626, 0.02529314],
[0.28067374, 0.97888686]])
lm = LakeModel(α = 0.03)
lm.A
array([[0.72350626, 0.0421534 ],
[0.28067374, 0.9620266 ]])
61.3.1 Aggregate Dynamics
Let’s run a simulation under the default parameters (see above) starting from 𝑋0 = (12, 138)
lm = LakeModel()
N_0 = 150 # Population
e_0 = 0.92 # Initial employment rate
u_0 = 1 - e_0 # Initial unemployment rate
T = 50 # Simulation length
U_0 = u_0 * N_0

E_0 = e_0 * N_0

X_0 = (U_0, E_0)
X_path = np.vstack(tuple(lm.simulate_stock_path(X_0, T)))
axes[0].plot(X_path[:, 0], lw=2)

axes[0].set_title('Unemployment')
axes[1].plot(X_path[:, 1], lw=2)

axes[1].set_title('Employment')
axes[2].plot(X_path.sum(1), lw=2)
axes[2].set_title('Labor force')
for ax in axes:
ax.grid()
plt.tight_layout()
plt.show()

The aggregates 𝐸𝑡 and 𝑈𝑡 don’t converge because their sum 𝐸𝑡 + 𝑈𝑡 grows at rate 𝑔.
On the other hand, the vector of employment and unemployment rates 𝑥𝑡 can be in a steady state 𝑥̄ if there exists an 𝑥̄
such that
̂ ̄
• 𝑥 ̄ = 𝐴𝑥
• the components satisfy 𝑒 ̄ + 𝑢̄ = 1
This equation tells us that a steady state level 𝑥̄ is an eigenvector of 𝐴 ̂ associated with a unit eigenvalue.
We also have 𝑥𝑡 → 𝑥̄ as 𝑡 → ∞ provided that the remaining eigenvalue of 𝐴 ̂ has modulus less than 1.
This is the case for our default parameters:
lm = LakeModel()
e, f = np.linalg.eigvals(lm.A_hat)
abs(e), abs(f)
(0.6953067378358462, 1.0)
Let’s look at the convergence of the unemployment and employment rate to steady state levels (dashed red line)
lm = LakeModel()
e_0 = 0.92 # Initial employment rate


u_0 = 1 - e_0 # Initial unemployment rate
xbar = lm.rate_steady_state()

x_0 = (u_0, e_0)
x_path = np.vstack(tuple(lm.simulate_rate_path(x_0, T)))
titles = ['Unemployment rate', 'Employment rate']
for i, title in enumerate(titles):

axes[i].plot(x_path[:, i], lw=2, alpha=0.5)
axes[i].hlines(xbar[i], 0, T, 'r', '--')
axes[i].set_title(title)
axes[i].grid()
plt.tight_layout()
plt.show()

61.4 Dynamics of an Individual Worker
An individual worker’s employment dynamics are governed by a finite state Markov process.
The worker can be in one of two states:
• 𝑠𝑡 = 0 means unemployed
• 𝑠𝑡 = 1 means employed
Let’s start off under the assumption that 𝑏 = 𝑑 = 0.
The associated transition matrix is then
1−𝜆 𝜆
𝑃 =( )
𝛼 1−𝛼
Let 𝜓𝑡 denote the marginal distribution over employment/unemployment states for the worker at time 𝑡.
As usual, we regard it as a row vector.
We know from an earlier discussion that 𝜓𝑡 follows the law of motion
𝜓𝑡+1 = 𝜓𝑡 𝑃
We also know from the lecture on finite Markov chains that if 𝛼 ∈ (0, 1) and 𝜆 ∈ (0, 1), then 𝑃 has a unique stationary
distribution, denoted here by 𝜓∗ .
The unique stationary distribution satisfies
𝛼
𝜓∗ [0] =
𝛼+𝜆
Not surprisingly, probability mass on the unemployment state increases with the dismissal rate and falls with the job
finding rate.
61.4.1 Ergodicity
Let’s look at a typical lifetime of employment-unemployment spells.

We want to compute the average amounts of time an infinitely lived worker would spend employed and unemployed.
Let
1 𝑇
𝑠𝑢,𝑇
̄ ∶= ∑ 𝟙{𝑠𝑡 = 0}
𝑇 𝑡=1
and
1 𝑇
𝑠𝑒,𝑇
̄ ∶= ∑ 𝟙{𝑠𝑡 = 1}
𝑇 𝑡=1
(As usual, 𝟙{𝑄} = 1 if statement 𝑄 is true and 0 otherwise)
These are the fraction of time a worker spends unemployed and employed, respectively, up until period 𝑇 .
If 𝛼 ∈ (0, 1) and 𝜆 ∈ (0, 1), then 𝑃 is ergodic, and hence we have
lim 𝑠𝑢,𝑇
̄ = 𝜓∗ [0] and ̄ = 𝜓∗ [1]
lim 𝑠𝑒,𝑇
𝑇 →∞ 𝑇 →∞
with probability one.

Inspection tells us that 𝑃 is exactly the transpose of 𝐴 ̂ under the assumption 𝑏 = 𝑑 = 0.
Thus, the percentages of time that an infinitely lived worker spends employed and unemployed equal the fractions of
workers employed and unemployed in the steady state distribution.
61.4. Dynamics of an Individual Worker 1081

61.4.2 Convergence Rate
How long does it take for time series sample averages to converge to cross-sectional averages?
We can use QuantEcon.py’s MarkovChain class to investigate this.
Let’s plot the path of the sample averages over 5,000 periods
lm = LakeModel(d=0, b=0)
α, λ = lm.α, lm.λ
P = [[1 - λ, λ],
[ α, 1 - α]]
mc = MarkovChain(P)
xbar = lm.rate_steady_state()

s_path = mc.simulate(T, init=1)
s_bar_e = s_path.cumsum() / range(1, T+1)
s_bar_u = 1 - s_bar_e
to_plot = [s_bar_u, s_bar_e]

titles = ['Percent of time unemployed', 'Percent of time employed']
for i, plot in enumerate(to_plot):

axes[i].plot(plot, lw=2, alpha=0.5)
axes[i].set_title(titles[i])
axes[i].grid()
plt.tight_layout()
plt.show()

The stationary probabilities are given by the dashed red line.

In this case it takes much of the sample for these two objects to converge.
This is largely due to the high persistence in the Markov chain.
61.5 Endogenous Job Finding Rate
We now make the hiring rate endogenous.

The transition rate from unemployment to employment will be determined by the McCall search model [McCall, 1970].
All details relevant to the following discussion can be found in our treatment of that model.
61.5. Endogenous Job Finding Rate 1083

61.5.1 Reservation Wage
The most important thing to remember about the model is that optimal decisions are characterized by a reservation wage
𝑤̄
• If the wage offer 𝑤 in hand is greater than or equal to 𝑤,̄ then the worker accepts.
• Otherwise, the worker rejects.
As we saw in our discussion of the model, the reservation wage depends on the wage offer distribution and the parameters
• 𝛼, the separation rate
• 𝛽, the discount factor
• 𝛾, the offer arrival rate
• 𝑐, unemployment compensation
61.5.2 Linking the McCall Search Model to the Lake Model
Suppose that all workers inside a lake model behave according to the McCall search model.
The exogenous probability of leaving employment remains 𝛼.
But their optimal decision rules determine the probability 𝜆 of leaving unemployment.
This is now
̄ = 𝛾 ∑ 𝑝(𝑤′ )
𝜆 = 𝛾ℙ{𝑤𝑡 ≥ 𝑤} (61.1)
𝑤′ ≥𝑤̄
61.5.3 Fiscal Policy
We can use the McCall search version of the Lake Model to find an optimal level of unemployment insurance.
We assume that the government sets unemployment compensation 𝑐.
The government imposes a lump-sum tax 𝜏 sufficient to finance total unemployment payments.
To attain a balanced budget at a steady state, taxes, the steady state unemployment rate 𝑢, and the unemployment com-
pensation rate must satisfy
𝜏 = 𝑢𝑐
The lump-sum tax applies to everyone, including unemployed workers.

Thus, the post-tax income of an employed worker with wage 𝑤 is 𝑤 − 𝜏 .
The post-tax income of an unemployed worker is 𝑐 − 𝜏 .
For each specification (𝑐, 𝜏 ) of government policy, we can solve for the worker’s optimal reservation wage.
This determines 𝜆 via (61.1) evaluated at post tax wages, which in turn determines a steady state unemployment rate
𝑢(𝑐, 𝜏 ).
For a given level of unemployment benefit 𝑐, we can solve for a tax that balances the budget in the steady state
𝜏 = 𝑢(𝑐, 𝜏 )𝑐
To evaluate alternative government tax-unemployment compensation pairs, we require a welfare criterion.

We use a steady state welfare criterion
𝑊 ∶= 𝑒 𝔼[𝑉 | employed] + 𝑢 𝑈
where the notation 𝑉 and 𝑈 is as defined in the McCall search model lecture.
The wage offer distribution will be a discretized version of the lognormal distribution 𝐿𝑁 (log(20), 1), as shown in the
next figure
We take a period to be a month.

We set 𝑏 and 𝑑 to match monthly birth and death rates, respectively, in the U.S. population
• 𝑏 = 0.0124
• 𝑑 = 0.00822
Following [Davis et al., 2006], we set 𝛼, the hazard rate of leaving employment, to
• 𝛼 = 0.013
61.5.4 Fiscal Policy Code
We will make use of techniques from the McCall model lecture

The first piece of code implements value function iteration
# A default utility function
@jit
def u(c, σ):
if c > 0:
return (c**(1 - σ) - 1) / (1 - σ)
else:


return -10e6
class McCallModel:
"""
Stores the parameters and functions associated with a given model.
"""
def __init__(self,
α=0.2, # Job separation rate
β=0.98, # Discount rate
γ=0.7, # Job offer rate
c=6.0, # Unemployment compensation
σ=2.0, # Utility parameter
w_vec=None, # Possible wage values
p_vec=None): # Probabilities over w_vec
self.α, self.β, self.γ, self.c = α, β, γ, c

self.σ = σ
# Add a default wage vector and probabilities over the vector using
# the beta-binomial distribution
if w_vec is None:
n = 60 # Number of possible outcomes for wage
# Wages between 10 and 20
self.w_vec = np.linspace(10, 20, n)
a, b = 600, 400 # Shape parameters
dist = BetaBinomial(n-1, a, b)
self.p_vec = dist.pdf()
else:
self.w_vec = w_vec
self.p_vec = p_vec
@jit
def _update_bellman(α, β, γ, c, σ, w_vec, p_vec, V, V_new, U):
"""
A jitted function to update the Bellman equations. Note that V_new is
modified in place (i.e, modified by this function). The new value of U
is returned.
"""
for w_idx, w in enumerate(w_vec):
# w_idx indexes the vector of possible wages
V_new[w_idx] = u(w, σ) + β * ((1 - α) * V[w_idx] + α * U)
U_new = u(c, σ) + β * (1 - γ) * U + \
β * γ * np.sum(np.maximum(U, V) * p_vec)
return U_new
def solve_mccall_model(mcm, tol=1e-5, max_iter=2000):

"""
Parameters


----------
mcm : an instance of McCallModel
tol : float
error tolerance
max_iter : int
the maximum number of iterations
"""
V = np.ones(len(mcm.w_vec)) # Initial guess of V

V_new = np.empty_like(V) # To store updates to V
U = 1 # Initial guess of U
i = 0
error = tol + 1

U_new = _update_bellman(mcm.α, mcm.β, mcm.γ,
mcm.c, mcm.σ, mcm.w_vec, mcm.p_vec, V, V_new, U)
error_1 = np.max(np.abs(V_new - V))
error_2 = np.abs(U_new - U)
V[:] = V_new
U = U_new
i += 1
return V, U
The second piece of code is used to complete the reservation wage:
def compute_reservation_wage(mcm, return_values=False):

"""
by finding the smallest w such that V(w) > U.
If V(w) > U for all w, then the reservation wage w_bar is set to
the lowest wage in mcm.w_vec.
If v(w) < U for all w, then w_bar is set to np.inf.
Parameters
----------
mcm : an instance of McCallModel
return_values : bool (optional, default=False)
Return the value functions as well
Returns
-------
w_bar : scalar
The reservation wage
"""
V, U = solve_mccall_model(mcm)
w_idx = np.searchsorted(V - U, 0)
if w_idx == len(V):
w_bar = np.inf


else:
w_bar = mcm.w_vec[w_idx]
if return_values == False:
return w_bar
else:
return w_bar, V, U
Now let’s compute and plot welfare, employment, unemployment, and tax revenue as a function of the unemployment
compensation rate
# Some global variables that will stay constant

α = 0.013
α_q = (1-(1-α)**3) # Quarterly (α is monthly)
b = 0.0124
d = 0.00822
β = 0.98
γ = 1.0
σ = 2.0
# The default wage distribution --- a discretized lognormal

log_wage_mean, wage_grid_size, max_wage = 20, 200, 170
logw_dist = norm(np.log(log_wage_mean), 1)
w_vec = np.linspace(1e-8, max_wage, wage_grid_size + 1)
cdf = logw_dist.cdf(np.log(w_vec))
pdf = cdf[1:] - cdf[:-1]
p_vec = pdf / pdf.sum()
w_vec = (w_vec[1:] + w_vec[:-1]) / 2
def compute_optimal_quantities(c, τ):

"""
Compute the reservation wage, job finding rate and value functions
of the workers given c and τ.
"""
mcm = McCallModel(α=α_q,
β=β,
γ=γ,
c=c-τ, # Post tax compensation
σ=σ,
w_vec=w_vec-τ, # Post tax wages
p_vec=p_vec)
w_bar, V, U = compute_reservation_wage(mcm, return_values=True)

λ = γ * np.sum(p_vec[w_vec - τ > w_bar])
return w_bar, λ, V, U
def compute_steady_state_quantities(c, τ):

"""
Compute the steady state unemployment rate given c and τ using optimal
quantities from the McCall model and computing corresponding steady
state quantities
"""


w_bar, λ, V, U = compute_optimal_quantities(c, τ)
# Compute steady state employment and unemployment rates

lm = LakeModel(α=α_q, λ=λ, b=b, d=d)
x = lm.rate_steady_state()
u, e = x
# Compute steady state welfare

w = np.sum(V * p_vec * (w_vec - τ > w_bar)) / np.sum(p_vec * (w_vec -
τ > w_bar))
welfare = e * w + u * U
return e, u, welfare
def find_balanced_budget_tax(c):
"""
Find the tax level that will induce a balanced budget.
"""
def steady_state_budget(t):
e, u, w = compute_steady_state_quantities(c, t)
return t - u * c
τ = brentq(steady_state_budget, 0.0, 0.9 * c)

return τ
# Levels of unemployment insurance we wish to study

c_vec = np.linspace(5, 140, 60)
tax_vec = []
unempl_vec = []
empl_vec = []
welfare_vec = []
for c in c_vec:
t = find_balanced_budget_tax(c)
e_rate, u_rate, welfare = compute_steady_state_quantities(c, t)
tax_vec.append(t)
unempl_vec.append(u_rate)
empl_vec.append(e_rate)
welfare_vec.append(welfare)
plots = [unempl_vec, empl_vec, tax_vec, welfare_vec]

titles = ['Unemployment', 'Employment', 'Tax', 'Welfare']
for ax, plot, title in zip(axes.flatten(), plots, titles):

ax.plot(c_vec, plot, lw=2, alpha=0.7)
ax.set_title(title)
ax.grid()
plt.tight_layout()
plt.show()

Welfare first increases and then decreases as unemployment benefits rise.

The level that maximizes steady state welfare is approximately 62.
61.6 Exercises
Exercise 61.6.1
In the Lake Model, there is derived data such as 𝐴 which depends on primitives like 𝛼 and 𝜆.
So, when a user alters these primitives, we need the derived data to update automatically.
(For example, if a user changes the value of 𝑏 for a given instance of the class, we would like 𝑔 = 𝑏 − 𝑑 to update
automatically)
In the code above, we took care of this issue by creating new instances every time we wanted to change parameters.
That way the derived data is always matched to current parameter values.
However, we can use descriptors instead, so that derived data is updated whenever parameters are changed.
This is safer and means we don’t need to create a fresh instance for every new parameterization.
(On the other hand, the code becomes denser, which is why we don’t always use the descriptor approach in our lectures.)

In this exercise, your task is to arrange the LakeModel class by using descriptors and decorators such as @property.
(If you need to refresh your understanding of how these work, consult this lecture.)

class LakeModelModified:
"""
Solves the lake model and computes dynamics of unemployment stocks and
rates.
Parameters:
------------
λ : scalar
The job finding rate for currently unemployed workers
α : scalar
The dismissal rate for currently employed workers
b : scalar
Entry rate into the labor force
d : scalar
Exit rate from the labor force
"""
def __init__(self, λ=0.283, α=0.013, b=0.0124, d=0.00822):
self._λ, self._α, self._b, self._d = λ, α, b, d
self.compute_derived_values()
def compute_derived_values(self):
# Unpack names to simplify expression
λ, α, b, d = self._λ, self._α, self._b, self._d
self._g = b - d
self._A = np.array([[(1-d) * (1-λ) + b, (1 - d) * α + b],
[ (1-d) * λ, (1 - d) * (1 - α)]])
self._A_hat = self._A / (1 + self._g)
@property
def g(self):
return self._g
@property
def A(self):
return self._A
@property
def A_hat(self):
return self._A_hat
@property
def λ(self):
return self._λ
@λ.setter
def λ(self, new_value):


self._λ = new_value
@property
def α(self):
return self._α
@α.setter
def α(self, new_value):
self._α = new_value
@property
def b(self):
return self._b
@b.setter
def b(self, new_value):
self._b = new_value
@property
def d(self):
return self._d
@d.setter
def d(self, new_value):
self._d = new_value
def rate_steady_state(self, tol=1e-6):

"""
Finds the steady state of the system :math:`x_{t+1} = \hat A x_{t}`
Returns
--------
xbar : steady state vector of employment and unemployment rates
"""
x = np.array([self.A_hat[0, 1], self.A_hat[1, 0]])
x /= x.sum()
return x
def simulate_stock_path(self, X0, T):

"""
Simulates the sequence of Employment and Unemployment stocks
Parameters
------------
X0 : array
Contains initial values (E0, U0)
T : int
Returns
---------


X : iterator
Contains sequence of employment and unemployment stocks
"""
X = np.atleast_1d(X0) # Recast as array just in case

for t in range(T):
yield X
X = self.A @ X
def simulate_rate_path(self, x0, T):

"""
Simulates the sequence of employment and unemployment rates
Parameters
------------
x0 : array
Contains initial values (e0,u0)
T : int
Returns
---------
x : iterator
Contains sequence of employment and unemployment rates
"""
x = np.atleast_1d(x0) # Recast as array just in case
for t in range(T):
yield x
x = self.A_hat @ x
Exercise 61.6.2
Consider an economy with an initial stock of workers 𝑁0 = 100 at the steady state level of employment in the baseline
parameterization
• 𝛼 = 0.013
• 𝜆 = 0.283
• 𝑏 = 0.0124
• 𝑑 = 0.00822
(The values for 𝛼 and 𝜆 follow [Davis et al., 2006])
Suppose that in response to new legislation the hiring rate reduces to 𝜆 = 0.2.
Plot the transition dynamics of the unemployment and employment stocks for 50 periods.
Plot the transition dynamics for the rates.
How long does the economy take to converge to its new steady state?
What is the new steady state level of employment?
Note: It may be easier to use the class created in exercise 1 to help with changing variables.


We begin by constructing the class containing the default parameters and assigning the steady state values to x0
lm = LakeModelModified()
x0 = lm.rate_steady_state()
print(f"Initial Steady State: {x0}")
Initial Steady State: [0.08266627 0.91733373]
Initialize the simulation values
N0 = 100
T = 50
New legislation changes 𝜆 to 0.2
lm.λ = 0.2
xbar = lm.rate_steady_state() # new steady state

X_path = np.vstack(tuple(lm.simulate_stock_path(x0 * N0, T)))
x_path = np.vstack(tuple(lm.simulate_rate_path(x0, T)))
print(f"New Steady State: {xbar}")
New Steady State: [0.11309295 0.88690705]
Now plot stocks
fig, axes = plt.subplots(3, 1, figsize=[10, 9])
axes[0].plot(X_path[:, 0])
axes[2].plot(X_path.sum(1))
for ax in axes:
ax.grid()
plt.tight_layout()
plt.show()

And how the rates evolve

axes[i].plot(x_path[:, i])
axes[i].grid()
plt.tight_layout()
plt.show()

We see that it takes 20 periods for the economy to converge to its new steady state levels.
Exercise 61.6.3
Consider an economy with an initial stock of workers 𝑁0 = 100 at the steady state level of employment in the baseline
parameterization.
Suppose that for 20 periods the birth rate was temporarily high (𝑏 = 0.025) and then returned to its original level.
Plot the transition dynamics of the unemployment and employment stocks for 50 periods.
Plot the transition dynamics for the rates.
How long does the economy take to return to its original steady state?

This next exercise has the economy experiencing a boom in entrances to the labor market and then later returning to the
original levels.
For 20 periods the economy has a new entry rate into the labor market.
Let’s start off at the baseline parameterization and record the steady state

lm = LakeModelModified()
x0 = lm.rate_steady_state()
Here are the other parameters:
b_hat = 0.025
T_hat = 20
Let’s increase 𝑏 to the new value and simulate for 20 periods
lm.b = b_hat
# Simulate stocks
X_path1 = np.vstack(tuple(lm.simulate_stock_path(x0 * N0, T_hat)))
# Simulate rates
x_path1 = np.vstack(tuple(lm.simulate_rate_path(x0, T_hat)))
Now we reset 𝑏 to the original value and then, using the state after 20 periods for the new initial conditions, we simulate
for the additional 30 periods
lm.b = 0.0124
# Simulate stocks
X_path2 = np.vstack(tuple(lm.simulate_stock_path(X_path1[-1, :2], T-T_hat+1)))
# Simulate rates
x_path2 = np.vstack(tuple(lm.simulate_rate_path(x_path1[-1, :2], T-T_hat+1)))
Finally, we combine these two paths and plot
# note [1:] to avoid doubling period 20

x_path = np.vstack([x_path1, x_path2[1:]])
X_path = np.vstack([X_path1, X_path2[1:]])
axes[2].plot(X_path.sum(1))
for ax in axes:
ax.grid()
plt.tight_layout()
plt.show()

And the rates

axes[i].plot(x_path[:, i])
axes[i].hlines(x0[i], 0, T, 'r', '--')
axes[i].grid()
plt.tight_layout()
plt.show()



CHAPTER
SIXTYTWO
RATIONAL EXPECTATIONS EQUILIBRIUM
Contents
• Rational Expectations Equilibrium

– Overview
– Rational Expectations Equilibrium
– Computing an Equilibrium
– Exercises
“If you’re so smart, why aren’t you rich?”

62.1 Overview
This lecture introduces the concept of a rational expectations equilibrium.

To illustrate it, we describe a linear quadratic version of a model due to Lucas and Prescott [Lucas and Prescott, 1971].
That 1971 paper is one of a small number of research articles that ignited a rational expectations revolution.
We follow Lucas and Prescott by employing a setting that is readily “Bellmanized” (i.e., susceptible to being formulated
as a dynamic programming problems.
Because we use linear quadratic setups for demand and costs, we can deploy the LQ programming techniques described
in this lecture.
We will learn about how a representative agent’s problem differs from a planner’s, and how a planning problem can be
used to compute quantities and prices in a rational expectations equilibrium.
We will also learn about how a rational expectations equilibrium can be characterized as a fixed point of a mapping from
a perceived law of motion to an actual law of motion.
Equality between a perceived and an actual law of motion for endogenous market-wide objects captures in a nutshell what
the rational expectations equilibrium concept is all about.
Finally, we will learn about the important “Big 𝐾, little 𝑘” trick, a modeling device widely used in macroeconomics.
Except that for us
1101
• Instead of “Big 𝐾” it will be “Big 𝑌 ”.

• Instead of “little 𝑘” it will be “little 𝑦”.

import numpy as np
We’ll also use the LQ class from QuantEcon.py.
62.1.1 The Big Y, little y Trick
This widely used method applies in contexts in which a representative firm or agent is a “price taker” operating within
a competitive equilibrium.
The following setting justifies the concept of a representative firm that stands in for a large number of other firms too.
There is a uniform unit measure of identical firms named 𝜔 ∈ Ω = [0, 1].
The output of firm 𝜔 is 𝑦(𝜔).
1
The output of all firms is 𝑌 = ∫0 𝑦(𝜔)𝑑 𝜔.
1
All firms end up choosing to produce the same output, so that at the end of the day 𝑦(𝜔) = 𝑦 and 𝑌 = 𝑦 = ∫0 𝑦(𝜔)𝑑 𝜔.
This setting allows us to speak of a representative firm that chooses to produce 𝑦.
We want to impose that
• The representative firm or individual firm takes aggregate 𝑌 as given when it chooses individual 𝑦(𝜔), but ….
• At the end of the day, 𝑌 = 𝑦(𝜔) = 𝑦, so that the representative firm is indeed representative.
The Big 𝑌 , little 𝑦 trick accomplishes these two goals by
• Taking 𝑌 as beyond control when posing the choice problem of who chooses 𝑦; but ….
• Imposing 𝑌 = 𝑦 after having solved the individual’s optimization problem.
Please watch for how this strategy is applied as the lecture unfolds.
We begin by applying the Big 𝑌 , little 𝑦 trick in a very simple static context.
A Simple Static Example of the Big Y, little y Trick
Consider a static model in which a unit measure of firms produce a homogeneous good that is sold in a competitive market.
Each of these firms ends up producing and selling output 𝑦(𝜔) = 𝑦.
The price 𝑝 of the good lies on an inverse demand curve
𝑝 = 𝑎 0 − 𝑎1 𝑌 (62.1)
where
• 𝑎𝑖 > 0 for 𝑖 = 0, 1
1102 Chapter 62. Rational Expectations Equilibrium

1
• 𝑌 = ∫0 𝑦(𝜔)𝑑𝜔 is the market-wide level of output
For convenience, we’ll often just write 𝑦 instead of 𝑦(𝜔) when we are describing the choice problem of an individual firm
𝜔 ∈ Ω.
Each firm has a total cost function
𝑐(𝑦) = 𝑐1 𝑦 + 0.5𝑐2 𝑦2 , 𝑐𝑖 > 0 for 𝑖 = 1, 2
The profits of a representative firm are 𝑝𝑦 − 𝑐(𝑦).

Using (62.1), we can express the problem of the representative firm as
max[(𝑎0 − 𝑎1 𝑌 )𝑦 − 𝑐1 𝑦 − 0.5𝑐2 𝑦2 ] (62.2)

𝑦
In posing problem (62.2), we want the firm to be a price taker.

We do that by regarding 𝑝 and therefore 𝑌 as exogenous to the firm.
The essence of the Big 𝑌 , little 𝑦 trick is not to set 𝑌 = 𝑛𝑦 before taking the first-order condition with respect to 𝑦 in
problem (62.2).
This assures that the firm is a price taker.
The first-order condition for problem (62.2) is
𝑎0 − 𝑎 1 𝑌 − 𝑐 1 − 𝑐 2 𝑦 = 0 (62.3)
At this point, but not before, we substitute 𝑌 = 𝑦 into (62.3) to obtain the following linear equation
𝑎0 − 𝑐1 − (𝑎1 + 𝑐2 )𝑌 = 0 (62.4)
to be solved for the competitive equilibrium market-wide output 𝑌 .

After solving for 𝑌 , we can compute the competitive equilibrium price 𝑝 from the inverse demand curve (62.1).
62.1.2 Related Planning Problem
Define consumer surplus as the area under the inverse demand curve:
𝑌
𝑎1 2
𝑆𝑐 (𝑌 ) = ∫ (𝑎0 − 𝑎1 𝑠)𝑑𝑠 = 𝑎𝑜 𝑌 − 𝑌 .
0 2
Define the social cost of production as

𝑐2 2
𝑆𝑝 (𝑌 ) = 𝑐1 𝑌 + 𝑌
2
Consider the planning problem
max[𝑆𝑐 (𝑌 ) − 𝑆𝑝 (𝑌 )]
𝑌
The first-order necessary condition for the planning problem is equation (62.4).
Thus, a 𝑌 that solves (62.4) is a competitive equilibrium output as well as an output that solves the planning problem.
This type of outcome provides an intellectual justification for liking a competitive equilibrium.
62.1. Overview 1103

62.1.3 Further Reading
References for this lecture include

• [Lucas and Prescott, 1971]
• [Sargent, 1987], chapter XIV
• [Ljungqvist and Sargent, 2018], chapter 7
62.2 Rational Expectations Equilibrium
Our first illustration of a rational expectations equilibrium involves a market with a unit measure of identical firms, each
of which seeks to maximize the discounted present value of profits in the face of adjustment costs.
The adjustment costs induce the firms to make gradual adjustments, which in turn requires consideration of future prices.
Individual firms understand that, via the inverse demand curve, the price is determined by the amounts supplied by other
firms.
Hence each firm wants to forecast future total industry output.
In our context, a forecast is generated by a belief about the law of motion for the aggregate state.
Rational expectations equilibrium prevails when this belief coincides with the actual law of motion generated by production
choices induced by this belief.
We formulate a rational expectations equilibrium in terms of a fixed point of an operator that maps beliefs into optimal
beliefs.
62.2.1 Competitive Equilibrium with Adjustment Costs
To illustrate, consider a collection of 𝑛 firms producing a homogeneous good that is sold in a competitive market.
Each firm sell output 𝑦𝑡 (𝜔) = 𝑦𝑡 .
The price 𝑝𝑡 of the good lies on the inverse demand curve
𝑝𝑡 = 𝑎0 − 𝑎1 𝑌𝑡 (62.5)
where
• 𝑎𝑖 > 0 for 𝑖 = 0, 1
1
• 𝑌𝑡 = ∫0 𝑦𝑡 (𝜔)𝑑𝜔 = 𝑦𝑡 is the market-wide level of output
The Firm’s Problem
Each firm is a price taker.

While it faces no uncertainty, it does face adjustment costs
In particular, it chooses a production plan to maximize
∞
∑ 𝛽 𝑡 𝑟𝑡 (62.6)
𝑡=0

where
𝛾(𝑦𝑡+1 − 𝑦𝑡 )2
𝑟𝑡 ∶= 𝑝𝑡 𝑦𝑡 − , 𝑦0 given (62.7)
2
Regarding the parameters,
• 𝛽 ∈ (0, 1) is a discount factor
• 𝛾 > 0 measures the cost of adjusting the rate of output
Regarding timing, the firm observes 𝑝𝑡 and 𝑦𝑡 when it chooses 𝑦𝑡+1 at time 𝑡.
To state the firm’s optimization problem completely requires that we specify dynamics for all state variables.
This includes ones that the firm cares about but does not control like 𝑝𝑡 .
We turn to this problem now.
Prices and Aggregate Output
In view of (62.5), the firm’s incentive to forecast the market price translates into an incentive to forecast aggregate output
𝑌𝑡 .
Aggregate output depends on the choices of other firms.
1
The output 𝑦𝑡 (𝜔) of a single firm 𝜔 has a negligible effect on aggregate output ∫0 𝑦𝑡 (𝜔)𝑑𝜔.
That justifies firms in regarding their forecasts of aggregate output as being unaffected by their own output decisions.
Representative Firm’s Beliefs
We suppose the firm believes that market-wide output 𝑌𝑡 follows the law of motion
𝑌𝑡+1 = 𝐻(𝑌𝑡 ) (62.8)
where 𝑌0 is a known initial condition.

The belief function 𝐻 is an equilibrium object, and hence remains to be determined.
Optimal Behavior Given Beliefs
For now, let’s fix a particular belief 𝐻 in (62.8) and investigate the firm’s response to it.
Let 𝑣 be the optimal value function for the firm’s problem given 𝐻.
The value function satisfies the Bellman equation
𝛾(𝑦′ − 𝑦)2
𝑣(𝑦, 𝑌 ) = max {𝑎0 𝑦 − 𝑎1 𝑦𝑌 − + 𝛽𝑣(𝑦′ , 𝐻(𝑌 ))} (62.9)
′𝑦 2
Let’s denote the firm’s optimal policy function by ℎ, so that
𝑦𝑡+1 = ℎ(𝑦𝑡 , 𝑌𝑡 ) (62.10)
where
𝛾(𝑦′ − 𝑦)2
ℎ(𝑦, 𝑌 ) ∶= argmax𝑦′ {𝑎0 𝑦 − 𝑎1 𝑦𝑌 − + 𝛽𝑣(𝑦′ , 𝐻(𝑌 ))} (62.11)
2
Evidently 𝑣 and ℎ both depend on 𝐻.
62.2. Rational Expectations Equilibrium 1105

Characterization with First-Order Necessary Conditions
In what follows it will be helpful to have a second characterization of ℎ, based on first-order conditions.
The first-order necessary condition for choosing 𝑦′ is
−𝛾(𝑦′ − 𝑦) + 𝛽𝑣𝑦 (𝑦′ , 𝐻(𝑌 )) = 0 (62.12)
An important useful envelope result of Benveniste-Scheinkman [Benveniste and Scheinkman, 1979] implies that to dif-
ferentiate 𝑣 with respect to 𝑦 we can naively differentiate the right side of (62.9), giving
𝑣𝑦 (𝑦, 𝑌 ) = 𝑎0 − 𝑎1 𝑌 + 𝛾(𝑦′ − 𝑦)
Substituting this equation into (62.12) gives the Euler equation
−𝛾(𝑦𝑡+1 − 𝑦𝑡 ) + 𝛽[𝑎0 − 𝑎1 𝑌𝑡+1 + 𝛾(𝑦𝑡+2 − 𝑦𝑡+1 )] = 0 (62.13)
The firm optimally sets an output path that satisfies (62.13), taking (62.8) as given, and subject to
• the initial conditions for (𝑦0 , 𝑌0 ).
• the terminal condition lim𝑡→∞ 𝛽 𝑡 𝑦𝑡 𝑣𝑦 (𝑦𝑡 , 𝑌𝑡 ) = 0.
This last condition is called the transversality condition, and acts as a first-order necessary condition “at infinity”.
A representative firm’s decision rule solves the difference equation (62.13) subject to the given initial condition 𝑦0 and the
transversality condition.
Note that solving the Bellman equation (62.9) for 𝑣 and then ℎ in (62.11) yields a decision rule that automatically imposes
both the Euler equation (62.13) and the transversality condition.
The Actual Law of Motion for Output
As we’ve seen, a given belief translates into a particular decision rule ℎ.

Recalling that in equilbrium 𝑌𝑡 = 𝑦𝑡 , the actual law of motion for market-wide output is then
𝑌𝑡+1 = ℎ(𝑌𝑡 , 𝑌𝑡 ) (62.14)
Thus, when firms believe that the law of motion for market-wide output is (62.8), their optimizing behavior makes the
actual law of motion be (62.14).
62.2.2 Definition of Rational Expectations Equilibrium
A rational expectations equilibrium or recursive competitive equilibrium of the model with adjustment costs is a decision
rule ℎ and an aggregate law of motion 𝐻 such that
1. Given belief 𝐻, the map ℎ is the firm’s optimal policy function.
2. The law of motion 𝐻 satisfies 𝐻(𝑌 ) = ℎ(𝑌 , 𝑌 ) for all 𝑌 .
Thus, a rational expectations equilibrium equates the perceived and actual laws of motion (62.8) and (62.14).

Fixed Point Characterization
As we’ve seen, the firm’s optimum problem induces a mapping Φ from a perceived law of motion 𝐻 for market-wide
output to an actual law of motion Φ(𝐻).
The mapping Φ is the composition of two mappings, the first of which maps a perceived law of motion into a decision
rule via (62.9)–(62.11), the second of which maps a decision rule into an actual law via (62.14).
The 𝐻 component of a rational expectations equilibrium is a fixed point of Φ.
62.3 Computing an Equilibrium
Now let’s compute a rational expectations equilibrium.
62.3.1 Failure of Contractivity
Readers accustomed to dynamic programming arguments might try to address this problem by choosing some guess 𝐻0
for the aggregate law of motion and then iterating with Φ.
Unfortunately, the mapping Φ is not a contraction.
Indeed, there is no guarantee that direct iterations on Φ converge1 .
There are examples in which these iterations diverge.
Fortunately, another method works here.
The method exploits a connection between equilibrium and Pareto optimality expressed in the fundamental theorems of
welfare economics (see, e.g, [Mas-Colell et al., 1995]).
Lucas and Prescott [Lucas and Prescott, 1971] used this method to construct a rational expectations equilibrium.
Some details follow.
62.3.2 A Planning Problem Approach
Our plan of attack is to match the Euler equations of the market problem with those for a single-agent choice problem.
As we’ll see, this planning problem can be solved by LQ control (linear regulator).
Optimal quantities from the planning problem are rational expectations equilibrium quantities.
The rational expectations equilibrium price can be obtained as a shadow price in the planning problem.
We first compute a sum of consumer and producer surplus at time 𝑡
𝑌𝑡
𝛾(𝑌𝑡+1 − 𝑌𝑡 )2
𝑠(𝑌𝑡 , 𝑌𝑡+1 ) ∶= ∫ (𝑎0 − 𝑎1 𝑥) 𝑑𝑥 − (62.15)
0 2
The first term is the area under the demand curve, while the second measures the social costs of changing output.
1 A literature that studies whether models populated with agents who learn can converge to rational expectations equilibria features iterations on a
modification of the mapping Φ that can be approximated as 𝛾Φ + (1 − 𝛾)𝐼. Here 𝐼 is the identity operator and 𝛾 ∈ (0, 1) is a relaxation parameter.
See [Marcet and Sargent, 1989] and [Evans and Honkapohja, 2001] for statements and applications of this approach to establish conditions under which
collections of adaptive agents who use least squares learning to converge to a rational expectations equilibrium.
62.3. Computing an Equilibrium 1107

The planning problem is to choose a production plan {𝑌𝑡 } to maximize

∞
∑ 𝛽 𝑡 𝑠(𝑌𝑡 , 𝑌𝑡+1 )
𝑡=0
subject to an initial condition for 𝑌0 .
62.3.3 Solution of Planning Problem
Evaluating the integral in (62.15) yields the quadratic form 𝑎0 𝑌𝑡 − 𝑎1 𝑌𝑡2 /2.
As a result, the Bellman equation for the planning problem is
𝑎1 2 𝛾(𝑌 ′ − 𝑌 )2
𝑉 (𝑌 ) = max {𝑎0 𝑌 − 𝑌 − + 𝛽𝑉 (𝑌 ′ )} (62.16)
𝑌′ 2 2
The associated first-order condition is
−𝛾(𝑌 ′ − 𝑌 ) + 𝛽𝑉 ′ (𝑌 ′ ) = 0 (62.17)
Applying the same Benveniste-Scheinkman formula gives
𝑉 ′ (𝑌 ) = 𝑎0 − 𝑎1 𝑌 + 𝛾(𝑌 ′ − 𝑌 )
Substituting this into equation (62.17) and rearranging leads to the Euler equation
𝛽𝑎0 + 𝛾𝑌𝑡 − [𝛽𝑎1 + 𝛾(1 + 𝛽)]𝑌𝑡+1 + 𝛾𝛽𝑌𝑡+2 = 0 (62.18)
62.3.4 Key Insight
Return to equation (62.13) and set 𝑦𝑡 = 𝑌𝑡 for all 𝑡.

A small amount of algebra will convince you that when 𝑦𝑡 = 𝑌𝑡 , equations (62.18) and (62.13) are identical.
Thus, the Euler equation for the planning problem matches the second-order difference equation that we derived by
1. finding the Euler equation of the representative firm and
2. substituting into it the expression 𝑌𝑡 = 𝑦𝑡 that “makes the representative firm be representative”.
If it is appropriate to apply the same terminal conditions for these two difference equations, which it is, then we have
verified that a solution of the planning problem is also a rational expectations equilibrium quantity sequence.
It follows that for this example we can compute equilibrium quantities by forming the optimal linear regulator problem
corresponding to the Bellman equation (62.16).
The optimal policy function for the planning problem is the aggregate law of motion 𝐻 that the representative firm faces
within a rational expectations equilibrium.
Structure of the Law of Motion
As you are asked to show in the exercises, the fact that the planner’s problem is an LQ control problem implies an optimal
policy — and hence aggregate law of motion — taking the form
𝑌𝑡+1 = 𝜅0 + 𝜅1 𝑌𝑡 (62.19)
for some parameter pair 𝜅0 , 𝜅1 .

Now that we know the aggregate law of motion is linear, we can see from the firm’s Bellman equation (62.9) that the
firm’s problem can also be framed as an LQ problem.
As you’re asked to show in the exercises, the LQ formulation of the firm’s problem implies a law of motion that looks as
follows
𝑦𝑡+1 = ℎ0 + ℎ1 𝑦𝑡 + ℎ2 𝑌𝑡 (62.20)
Hence a rational expectations equilibrium will be defined by the parameters (𝜅0 , 𝜅1 , ℎ0 , ℎ1 , ℎ2 ) in (62.19)–(62.20).
62.4 Exercises
Exercise 62.4.1
Consider the firm problem described above.
Let the firm’s belief function 𝐻 be as given in (62.19).
Formulate the firm’s problem as a discounted optimal linear regulator problem, being careful to describe all of the objects
needed.
Use the class LQ from the QuantEcon.py package to solve the firm’s problem for the following parameter values:
𝑎0 = 100, 𝑎1 = 0.05, 𝛽 = 0.95, 𝛾 = 10, 𝜅0 = 95.5, 𝜅1 = 0.95
Express the solution of the firm’s problem in the form (62.20) and give the values for each ℎ𝑗 .
If there were a unit measure of identical competitive firms all behaving according to (62.20), what would (62.20) imply
for the actual law of motion (62.8) for market supply.

To map a problem into a discounted optimal linear control problem, we need to define
• state vector 𝑥𝑡 and control vector 𝑢𝑡
• matrices 𝐴, 𝐵, 𝑄, 𝑅 that define preferences and the law of motion for the state
For the state and control vectors, we choose
𝑦𝑡
𝑥𝑡 = ⎡ ⎤
⎢𝑌𝑡 ⎥ , 𝑢𝑡 = 𝑦𝑡+1 − 𝑦𝑡
⎣1⎦
For 𝐵, 𝑄, 𝑅 we set
1 0 0 1 0 𝑎1 /2 −𝑎0 /2
𝐴=⎡
⎢0 𝜅1 𝜅0 ⎤
⎥, 𝐵=⎡ ⎤
⎢0⎥ , 𝑅=⎡ 𝑎
⎢ 1 /2 0 0 ⎤ ⎥, 𝑄 = 𝛾/2
⎣0 0 1⎦ 0
⎣ ⎦ −𝑎
⎣ 0 /2 0 0 ⎦
By multiplying out you can confirm that
• 𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 = −𝑟𝑡
• 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡

We’ll use the module lqcontrol.py to solve the firm’s problem at the stated parameter values.
This will return an LQ policy 𝐹 with the interpretation 𝑢𝑡 = −𝐹 𝑥𝑡 , or
𝑦𝑡+1 − 𝑦𝑡 = −𝐹0 𝑦𝑡 − 𝐹1 𝑌𝑡 − 𝐹2
Matching parameters with 𝑦𝑡+1 = ℎ0 + ℎ1 𝑦𝑡 + ℎ2 𝑌𝑡 leads to
ℎ0 = −𝐹2 , ℎ 1 = 1 − 𝐹0 , ℎ2 = −𝐹1
Here’s our solution
# Model parameters
a0 = 100
a1 = 0.05
β = 0.95
γ = 10.0
# Beliefs
κ0 = 95.5
κ1 = 0.95
# Formulate the LQ problem
A = np.array([[1, 0, 0], [0, κ1, κ0], [0, 0, 1]])

B = np.array([1, 0, 0])
B.shape = 3, 1
R = np.array([[0, a1/2, -a0/2], [a1/2, 0, 0], [-a0/2, 0, 0]])
Q = 0.5 * γ
lq = LQ(Q, R, A, B, beta=β)
P, F, d = lq.stationary_values()
F = F.flatten()
out1 = f"F = [{F[0]:.3f}, {F[1]:.3f}, {F[2]:.3f}]"
h0, h1, h2 = -F[2], 1 - F[0], -F[1]
out2 = f"(h0, h1, h2) = ({h0:.3f}, {h1:.3f}, {h2:.3f})"
print(out1)
print(out2)
F = [-0.000, 0.046, -96.949]

(h0, h1, h2) = (96.949, 1.000, -0.046)
The implication is that
𝑦𝑡+1 = 96.949 + 𝑦𝑡 − 0.046 𝑌𝑡
For the case 𝑛 > 1, recall that 𝑌𝑡 = 𝑛𝑦𝑡 , which, combined with the previous equation, yields
𝑌𝑡+1 = 𝑛 (96.949 + 𝑦𝑡 − 0.046 𝑌𝑡 ) = 𝑛96.949 + (1 − 𝑛0.046)𝑌𝑡
Exercise 62.4.2

Consider the following 𝜅0 , 𝜅1 pairs as candidates for the aggregate law of motion component of a rational expectations
equilibrium (see (62.19)).
Extending the program that you wrote for Exercise 62.4.1, determine which if any satisfy the definition of a rational
expectations equilibrium
• (94.0886298678, 0.923409232937)
• (93.2119845412, 0.984323478873)
• (95.0818452486, 0.952459076301)
Describe an iterative algorithm that uses the program that you wrote for Exercise 62.4.1 to compute a rational expectations
equilibrium.
(You are not being asked actually to use the algorithm you are suggesting)

To determine whether a 𝜅0 , 𝜅1 pair forms the aggregate law of motion component of a rational expectations equilibrium,
we can proceed as follows:
• Determine the corresponding firm law of motion 𝑦𝑡+1 = ℎ0 + ℎ1 𝑦𝑡 + ℎ2 𝑌𝑡 .
• Test whether the associated aggregate law :𝑌𝑡+1 = 𝑛ℎ(𝑌𝑡 /𝑛, 𝑌𝑡 ) evaluates to 𝑌𝑡+1 = 𝜅0 + 𝜅1 𝑌𝑡 .
In the second step, we can use 𝑌𝑡 = 𝑛𝑦𝑡 = 𝑦𝑡 , so that 𝑌𝑡+1 = 𝑛ℎ(𝑌𝑡 /𝑛, 𝑌𝑡 ) becomes
𝑌𝑡+1 = ℎ(𝑌𝑡 , 𝑌𝑡 ) = ℎ0 + (ℎ1 + ℎ2 )𝑌𝑡
Hence to test the second step we can test 𝜅0 = ℎ0 and 𝜅1 = ℎ1 + ℎ2 .

The following code implements this test
candidates = ((94.0886298678, 0.923409232937),

(93.2119845412, 0.984323478873),
(95.0818452486, 0.952459076301))
for κ0, κ1 in candidates:
# Form the associated law of motion

A = np.array([[1, 0, 0], [0, κ1, κ0], [0, 0, 1]])
# Solve the LQ problem for the firm

F = F.flatten()
h0, h1, h2 = -F[2], 1 - F[0], -F[1]
# Test the equilibrium condition

if np.allclose((κ0, κ1), (h0, h1 + h2)):
print(f'Equilibrium pair = {κ0}, {κ1}')
print('f(h0, h1, h2) = {h0}, {h1}, {h2}')
break
Equilibrium pair = 95.0818452486, 0.952459076301

f(h0, h1, h2) = {h0}, {h1}, {h2}

The output tells us that the answer is pair (iii), which implies (ℎ0 , ℎ1 , ℎ2 ) = (95.0819, 1.0000, −.0475).
(Notice we use np.allclose to test equality of floating-point numbers, since exact equality is too strict).
Regarding the iterative algorithm, one could loop from a given (𝜅0 , 𝜅1 ) pair to the associated firm law and then to a new
(𝜅0 , 𝜅1 ) pair.
This amounts to implementing the operator Φ described in the lecture.
(There is in general no guarantee that this iterative process will converge to a rational expectations equilibrium)
Exercise 62.4.3
Recall the planner’s problem described above
1. Formulate the planner’s problem as an LQ problem.
2. Solve it using the same parameter values in exercise 1
• 𝑎0 = 100, 𝑎1 = 0.05, 𝛽 = 0.95, 𝛾 = 10
3. Represent the solution in the form 𝑌𝑡+1 = 𝜅0 + 𝜅1 𝑌𝑡 .
4. Compare your answer with the results from exercise 2.

We are asked to write the planner problem as an LQ problem.
For the state and control vectors, we choose
𝑌
𝑥𝑡 = [ 𝑡 ] , 𝑢𝑡 = 𝑌𝑡+1 − 𝑌𝑡
1
For the LQ matrices, we set
1 0 1 𝑎1 /2 −𝑎0 /2
𝐴=[ ], 𝐵 = [ ], 𝑅=[ ], 𝑄 = 𝛾/2
0 1 0 −𝑎0 /2 0
By multiplying out you can confirm that

• 𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 = −𝑠(𝑌𝑡 , 𝑌𝑡+1 )
• 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡
By obtaining the optimal policy and using 𝑢𝑡 = −𝐹 𝑥𝑡 or
𝑌𝑡+1 − 𝑌𝑡 = −𝐹0 𝑌𝑡 − 𝐹1
we can obtain the implied aggregate law of motion via 𝜅0 = −𝐹1 and 𝜅1 = 1 − 𝐹0 .
The Python code to solve this problem is below:
# Formulate the planner's LQ problem
A = np.array([[1, 0], [0, 1]])

B = np.array([[1], [0]])
R = np.array([[a1 / 2, -a0 / 2], [-a0 / 2, 0]])
Q = γ / 2


# Print the results
F = F.flatten()
κ0, κ1 = -F[1], 1 - F[0]
print(κ0, κ1)
95.08187459215002 0.9524590627039248
The output yields the same (𝜅0 , 𝜅1 ) pair obtained as an equilibrium from the previous exercise.
Exercise 62.4.4
∞
A monopolist faces the industry demand curve (62.5) and chooses {𝑌𝑡 } to maximize ∑𝑡=0 𝛽 𝑡 𝑟𝑡 where
𝛾(𝑌𝑡+1 − 𝑌𝑡 )2
𝑟𝑡 = 𝑝𝑡 𝑌𝑡 −
2
Formulate this problem as an LQ problem.
Compute the optimal policy using the same parameters as Exercise 62.4.2.
In particular, solve for the parameters in
𝑌𝑡+1 = 𝑚0 + 𝑚1 𝑌𝑡
Compare your results with Exercise 62.4.2 – comment.

The monopolist’s LQ problem is almost identical to the planner’s problem from the previous exercise, except that
𝑎1 −𝑎0 /2
𝑅=[ ]
−𝑎0 /2 0
The problem can be solved as follows
A = np.array([[1, 0], [0, 1]])

B = np.array([[1], [0]])
R = np.array([[a1, -a0 / 2], [-a0 / 2, 0]])
Q = γ / 2
F = F.flatten()
m0, m1 = -F[1], 1 - F[0]
print(m0, m1)

73.47294403502818 0.9265270559649701
We see that the law of motion for the monopolist is approximately 𝑌𝑡+1 = 73.4729 + 0.9265𝑌𝑡 .
In the rational expectations case, the law of motion was approximately 𝑌𝑡+1 = 95.0818 + 0.9525𝑌𝑡 .
One way to compare these two laws of motion is by their fixed points, which give long-run equilibrium output in each
case.
For laws of the form 𝑌𝑡+1 = 𝑐0 + 𝑐1 𝑌𝑡 , the fixed point is 𝑐0 /(1 − 𝑐1 ).
If you crunch the numbers, you will see that the monopolist adopts a lower long-run quantity than obtained by the com-
petitive market, implying a higher market price.
This is analogous to the elementary static-case results

CHAPTER
SIXTYTHREE
STABILITY IN LINEAR RATIONAL EXPECTATIONS MODELS
Contents
• Stability in Linear Rational Expectations Models

– Overview
– Linear Difference Equations
– Illustration: Cagan’s Model
– Some Python Code
– Alternative Code
– Another Perspective
– Log money Supply Feeds Back on Log Price Level
– Big 𝑃 , Little 𝑝 Interpretation
– Fun with SymPy
In addition to what’s in Anaconda, this lecture deploys the following libraries:

import numpy as np
from sympy import init_printing, symbols, Matrix
init_printing()
1115
63.1 Overview
This lecture studies stability in the context of an elementary rational expectations model.
We study a rational expectations version of Philip Cagan’s model [Cagan, 1956] linking the price level to the money
supply.
Cagan did not use a rational expectations version of his model, but Sargent [Sargent, 1977] did.
We study a rational expectations version of this model because it is intrinsically interesting and because it has a mathe-
matical structure that appears in virtually all linear rational expectations model, namely, that a key endogenous variable
equals a mathematical expectation of a geometric sum of future values of another variable.
The model determines the price level or rate of inflation as a function of the money supply or the rate of change in the
money supply.
In this lecture, we’ll encounter:
• a convenient formula for the expectation of geometric sum of future values of a variable
• a way of solving an expectational difference equation by mapping it into a vector first-order difference equation and
appropriately manipulating an eigen decomposition of the transition matrix in order to impose stability
• a way to use a Big 𝐾, little 𝑘 argument to allow apparent feedback from endogenous to exogenous variables within
a rational expectations equilibrium
• a use of eigenvector decompositions of matrices that allowed Blanchard and Khan (1981) [Blanchard and Kahn,
1980] and Whiteman (1983) [Whiteman, 1983] to solve a class of linear rational expectations models
• how to use SymPy to get analytical formulas for some key objects comprising a rational expectations equilibrium
Matrix decompositions employed here are described in more depth in this lecture Lagrangian formulations.
We formulate a version of Cagan’s model under rational expectations as an expectational difference equation whose
solution is a rational expectations equilibrium.
We’ll start this lecture with a quick review of deterministic (i.e., non-random) first-order and second-order linear difference
equations.
63.2 Linear Difference Equations
We’ll use the backward shift or lag operator 𝐿.

The lag operator 𝐿 maps a sequence {𝑥𝑡 }∞ ∞
𝑡=0 into the sequence {𝑥𝑡−1 }𝑡=0
We’ll deploy 𝐿 by using the equality 𝐿𝑥𝑡 ≡ 𝑥𝑡−1 in algebraic expressions.

Further, the inverse 𝐿−1 of the lag operator is the forward shift operator.
We’ll often use the equality 𝐿−1 𝑥𝑡 ≡ 𝑥𝑡+1 below.
The algebra of lag and forward shift operators can simplify representing and solving linear difference equations.
1116 Chapter 63. Stability in Linear Rational Expectations Models

63.2.1 First Order
We want to solve a linear first-order scalar difference equation.

Let |𝜆| < 1 and let {𝑢𝑡 }∞
𝑡=−∞ be a bounded sequence of scalar real numbers.
Let 𝐿 be the lag operator defined by 𝐿𝑥𝑡 ≡ 𝑥𝑡−1 and let 𝐿−1 be the forward shift operator defined by 𝐿−1 𝑥𝑡 ≡ 𝑥𝑡+1 .
Then
(1 − 𝜆𝐿)𝑦𝑡 = 𝑢𝑡 , ∀𝑡 (63.1)
has solutions
𝑦𝑡 = (1 − 𝜆𝐿)−1 𝑢𝑡 + 𝑘𝜆𝑡 (63.2)
or
∞
𝑦𝑡 = ∑ 𝜆𝑗 𝑢𝑡−𝑗 + 𝑘𝜆𝑡
𝑗=0
for any real number 𝑘.

You can verify this fact by applying (1 − 𝜆𝐿) to both sides of equation (63.2) and noting that (1 − 𝜆𝐿)𝜆𝑡 = 0.
To pin down 𝑘 we need one condition imposed from outside (e.g., an initial or terminal condition) on the path of 𝑦.
Now let |𝜆| > 1.
Rewrite equation (63.1) as
𝑦𝑡−1 = 𝜆−1 𝑦𝑡 − 𝜆−1 𝑢𝑡 , ∀𝑡 (63.3)
or
(1 − 𝜆−1 𝐿−1 )𝑦𝑡 = −𝜆−1 𝑢𝑡+1 . (63.4)
A solution is
1
𝑦𝑡 = −𝜆−1 ( ) 𝑢𝑡+1 + 𝑘𝜆𝑡 (63.5)
1 − 𝜆−1 𝐿−1
for any 𝑘.
To verify that this is a solution, check the consequences of operating on both sides of equation (63.5) by (1 − 𝜆𝐿) and
compare to equation (63.1).
For any bounded {𝑢𝑡 } sequence, solution (63.2) exists for |𝜆| < 1 because the distributed lag in 𝑢 converges.
Solution (63.5) exists when |𝜆| > 1 because the distributed lead in 𝑢 converges.
When |𝜆| > 1, the distributed lag in 𝑢 in (63.2) may diverge, in which case a solution of this form does not exist.
The distributed lead in 𝑢 in (63.5) need not converge when |𝜆| < 1.
63.2.2 Second Order
Now consider the second order difference equation
(1 − 𝜆1 𝐿)(1 − 𝜆2 𝐿)𝑦𝑡+1 = 𝑢𝑡 (63.6)
63.2. Linear Difference Equations 1117

where {𝑢𝑡 } is a bounded sequence, 𝑦0 is an initial condition, |𝜆1 | < 1 and |𝜆2 | > 1.
We seek a bounded sequence {𝑦𝑡 }∞ 𝑡=0 that satisfies (63.6). Using insights from our analysis of the first-order equation,
operate on both sides of (63.6) by the forward inverse of (1 − 𝜆2 𝐿) to rewrite equation (63.6) as
𝜆−1
2
(1 − 𝜆1 𝐿)𝑦𝑡+1 = − 𝑢
−1 𝑡+1
1 − 𝜆−1
2 𝐿
or
∞
−𝑗
𝑦𝑡+1 = 𝜆1 𝑦𝑡 − 𝜆−1
2 ∑ 𝜆2 𝑢𝑡+𝑗+1 . (63.7)
𝑗=0
Thus, we obtained equation (63.7) by solving a stable root (in this case 𝜆1 ) backward, and an unstable root (in this case
𝜆2 ) forward.
Equation (63.7) has a form that we shall encounter often.
• 𝜆1 𝑦𝑡 is called the feedback part
−1
• − 1−𝜆𝜆−1
2
𝑢
𝐿−1 𝑡+1
is called the feedforward part
2
63.3 Illustration: Cagan’s Model
Now let’s use linear difference equations to represent and solve Sargent’s [Sargent, 1977] rational expectations version of
Cagan’s model [Cagan, 1956] that connects the price level to the public’s anticipations of future money supplies.
Cagan did not use a rational expectations version of his model, but Sargent [Sargent, 1977]
Let
• 𝑚𝑑𝑡 be the log of the demand for money
• 𝑚𝑡 be the log of the supply of money
• 𝑝𝑡 be the log of the price level
It follows that 𝑝𝑡+1 − 𝑝𝑡 is the rate of inflation.
The logarithm of the demand for real money balances 𝑚𝑑𝑡 − 𝑝𝑡 is an inverse function of the expected rate of inflation
𝑝𝑡+1 − 𝑝𝑡 for 𝑡 ≥ 0:
𝑚𝑑𝑡 − 𝑝𝑡 = −𝛽(𝑝𝑡+1 − 𝑝𝑡 ), 𝛽>0
Equate the demand for log money 𝑚𝑑𝑡 to the supply of log money 𝑚𝑡 in the above equation and rearrange to deduce that
the logarithm of the price level 𝑝𝑡 is related to the logarithm of the money supply 𝑚𝑡 by
𝑝𝑡 = (1 − 𝜆)𝑚𝑡 + 𝜆𝑝𝑡+1 (63.8)

𝛽
where 𝜆 ≡ 1+𝛽 ∈ (0, 1).
(We note that the characteristic polynomial if 1 − 𝜆−1 𝑧−1 = 0 so that the zero of the characteristic polynomial in this
case is 𝜆 ∈ (0, 1) which here is inside the unit circle.)
Solving the first order difference equation (63.8) forward gives
∞
𝑝𝑡 = (1 − 𝜆) ∑ 𝜆𝑗 𝑚𝑡+𝑗 , (63.9)
𝑗=0

which is the unique stable solution of difference equation (63.8) among a class of more general solutions
∞
𝑝𝑡 = (1 − 𝜆) ∑ 𝜆𝑗 𝑚𝑡+𝑗 + 𝑐𝜆−𝑡 (63.10)
𝑗=0
that is indexed by the real number 𝑐 ∈ R.

Because we want to focus on stable solutions, we set 𝑐 = 0.
Equation (63.10) attributes perfect foresight about the money supply sequence to the holders of real balances.
We begin by assuming that the log of the money supply is exogenous in the sense that it is an autonomous process that
does not feed back on the log of the price level.
In particular, we assume that the log of the money supply is described by the linear state space system
𝑚𝑡 = 𝐺𝑥𝑡
(63.11)
𝑥𝑡+1 = 𝐴𝑥𝑡
where 𝑥𝑡 is an 𝑛 × 1 vector that does not include 𝑝𝑡 or lags of 𝑝𝑡 , 𝐴 is an 𝑛 × 𝑛 matrix with eigenvalues that are less than
𝜆−1 in absolute values, and 𝐺 is a 1 × 𝑛 selector matrix.
Variables appearing in the vector 𝑥𝑡 contain information that might help predict future values of the money supply.
We’ll start with an example in which 𝑥𝑡 includes only 𝑚𝑡 , possibly lagged values of 𝑚, and a constant.
An example of such an {𝑚𝑡 } process that fits info state space system (63.11) is one that satisfies the second order linear
difference equation
𝑚𝑡+1 = 𝛼 + 𝜌1 𝑚𝑡 + 𝜌2 𝑚𝑡−1
where the zeros of the characteristic polynomial (1 − 𝜌1 𝑧 − 𝜌2 𝑧2 ) are strictly greater than 1 in modulus.
(Please see this QuantEcon lecture for more about characteristic polynomials and their role in solving linear difference
equations.)
We seek a stable or non-explosive solution of the difference equation (63.8) that obeys the system comprised of (63.8)-
(63.11).
By stable or non-explosive, we mean that neither 𝑚𝑡 nor 𝑝𝑡 diverges as 𝑡 → +∞.
This requires that we shut down the term 𝑐𝜆−𝑡 in equation (63.10) above by setting 𝑐 = 0
The solution we are after is
𝑝𝑡 = 𝐹 𝑥𝑡 (63.12)
where
𝐹 = (1 − 𝜆)𝐺(𝐼 − 𝜆𝐴)−1 (63.13)
Note: As mentioned above, an explosive solution of difference equation (63.8) can be constructed by adding to the right
hand of (63.12) a sequence 𝑐𝜆−𝑡 where 𝑐 is an arbitrary positive constant.
63.3. Illustration: Cagan’s Model 1119

63.4 Some Python Code
We’ll construct examples that illustrate (63.11).

Our first example takes as the law of motion for the log money supply the second order difference equation
𝑚𝑡+1 = 𝛼 + 𝜌1 𝑚𝑡 + 𝜌2 𝑚𝑡−1 (63.14)
that is parameterized by 𝜌1 , 𝜌2 , 𝛼
To capture this parameterization with system (63.9) we set
1 1 0 0
𝑥𝑡 = ⎡ 𝑚
⎢ 𝑡 ⎥,
⎤ 𝐴=⎡
⎢𝛼 𝜌1 𝜌2 ⎤
⎥, 𝐺 = [0 1 0]
⎣𝑚𝑡−1 ⎦ ⎣0 1 0⎦
Here is Python code
λ = .9
α = 0
ρ1 = .9
ρ2 = .05
A = np.array([[1, 0, 0],
[α, ρ1, ρ2],
[0, 1, 0]])
G = np.array([[0, 1, 0]])
The matrix 𝐴 has one eigenvalue equal to unity.

It is associated with the 𝐴11 component that captures a constant component of the state 𝑥𝑡 .
We can verify that the two eigenvalues of 𝐴 not associated with the constant in the state 𝑥𝑡 are strictly less than unity in
modulus.
eigvals = np.linalg.eigvals(A)
print(eigvals)
[-0.05249378 0.95249378 1. ]
(abs(eigvals) <= 1).all()
True
Now let’s compute 𝐹 in formulas (63.12) and (63.13).
# compute the solution, i.e. forumula (3)

F = (1 - λ) * G @ np.linalg.inv(np.eye(A.shape[0]) - λ * A)
print("F= ",F)
F= [[0. 0.66889632 0.03010033]]
Now let’s simulate paths of 𝑚𝑡 and 𝑝𝑡 starting from an initial value 𝑥0 .

# set the initial state

x0 = np.array([1, 1, 0])
T = 100 # length of simulation
m_seq = np.empty(T+1)
p_seq = np.empty(T+1)
m_seq[0] = G @ x0
p_seq[0] = F @ x0
# simulate for T periods

x_old = x0
for t in range(T):
x = A @ x_old
m_seq[t+1] = G @ x
p_seq[t+1] = F @ x
x_old = x

m_seq[0] = G @ x0
p_seq[0] = F @ x0
m_seq[t+1] = G @ x
p_seq[t+1] = F @ x
plt.figure()
plt.plot(range(T+1), m_seq, label='$m_t$')
plt.plot(range(T+1), p_seq, label='$p_t$')
plt.xlabel('t')
plt.title(f'λ={λ}, α={α}, $ρ_1$={ρ1}, $ρ_2$={ρ2}')
plt.legend()
plt.show()
63.4. Some Python Code 1121

In the above graph, why is the log of the price level always less than the log of the money supply?
Because
• according to equation (63.9), 𝑝𝑡 is a geometric weighted average of current and future values of 𝑚𝑡 , and
• it happens that in this example future 𝑚’s are always less than the current 𝑚
63.5 Alternative Code
We could also have run the simulation using the quantecon LinearStateSpace code.
The following code block performs the calculation with that code.
# construct a LinearStateSpace instance
# stack G and F
G_ext = np.vstack([G, F])
C = np.zeros((A.shape[0], 1))
ss = qe.LinearStateSpace(A, C, G_ext, mu_0=x0)
T = 100
# simulate using LinearStateSpace

x, y = ss.simulate(ts_length=T)
# plot
plt.figure()
plt.plot(range(T), y[0,:], label='$m_t$')
plt.plot(range(T), y[1,:], label='$p_t$')
plt.xlabel('t')
plt.title(f'λ={λ}, α={α}, $ρ_1$={ρ1}, $ρ_2$={ρ2}')


plt.legend()
plt.show()
63.5.1 Special Case
To simplify our presentation in ways that will let focus on an important idea, in the above second-order difference equation
(63.14) that governs 𝑚𝑡 , we now set 𝛼 = 0, 𝜌1 = 𝜌 ∈ (−1, 1), and 𝜌2 = 0 so that the law of motion for 𝑚𝑡 becomes
𝑚𝑡+1 = 𝜌𝑚𝑡 (63.15)
and the state 𝑥𝑡 becomes
𝑥𝑡 = 𝑚 𝑡 .
Consequently, we can set 𝐺 = 1, 𝐴 = 𝜌 making our formula (63.13) for 𝐹 become
𝐹 = (1 − 𝜆)(1 − 𝜆𝜌)−1 .
so that the log the log price level satisfies
𝑝𝑡 = 𝐹 𝑚𝑡 .
Please keep these formulas in mind as we investigate an alternative route to and interpretation of our formula for 𝐹 .
63.5. Alternative Code 1123

63.6 Another Perspective
Above, we imposed stability or non-explosiveness on the solution of the key difference equation (63.8) in Cagan’s model
by solving the unstable root of the characteristic polynomial forward.
To shed light on the mechanics involved in imposing stability on a solution of a potentially unstable system of linear
difference equations and to prepare the way for generalizations of our model in which the money supply is allowed to feed
back on the price level itself, we stack equations (63.8) and (63.15) to form the system
𝑚𝑡+1 𝜌 0 𝑚
[ ]=[ ] [ 𝑡] (63.16)
𝑝𝑡+1 −(1 − 𝜆)/𝜆 𝜆−1 𝑝𝑡
or
𝑦𝑡+1 = 𝐻𝑦𝑡 , 𝑡≥0 (63.17)
where
𝜌 0
𝐻=[ ]. (63.18)
−(1 − 𝜆)/𝜆 𝜆−1
Transition matrix 𝐻 has eigenvalues 𝜌 ∈ (0, 1) and 𝜆−1 > 1.

Because an eigenvalue of 𝐻 exceeds unity, if we iterate on equation (63.17) starting from an arbitrary initial vector
𝑚
𝑦0 = [ 0 ] with 𝑚0 > 0, 𝑝0 > 0, we discover that in general absolute values of both components of 𝑦𝑡 diverge toward
𝑝0
+∞ as 𝑡 → +∞.
To substantiate this claim, we can use the eigenvector matrix decomposition of 𝐻 that is available to us because the
eigenvalues of 𝐻 are distinct
𝐻 = 𝑄Λ𝑄−1 .
Here Λ is a diagonal matrix of eigenvalues of 𝐻 and 𝑄 is a matrix whose columns are eigenvectors associated with the
corresponding eigenvalues.
Note that
𝐻 𝑡 = 𝑄Λ𝑡 𝑄−1
so that
𝑦𝑡 = 𝑄Λ𝑡 𝑄−1 𝑦0
For almost all initial vectors 𝑦0 , the presence of the eigenvalue 𝜆−1 > 1 causes both components of 𝑦𝑡 to diverge in
absolute value to +∞.
To explore this outcome in more detail, we can use the following transformation
𝑦𝑡∗ = 𝑄−1 𝑦𝑡
that allows us to represent the dynamics in a way that isolates the source of the propensity of paths to diverge:
∗
𝑦𝑡+1 = Λ𝑡 𝑦𝑡∗
Staring at this equation indicates that unless

∗
𝑦1,0
𝑦0∗ = [ ] (63.19)
0

the path of 𝑦𝑡∗ and therefore the paths of both components of 𝑦𝑡 = 𝑄𝑦𝑡∗ will diverge in absolute value as 𝑡 → +∞. (We
say that the paths explode)
Equation (63.19) also leads us to conclude that there is a unique setting for the initial vector 𝑦0 for which both components
of 𝑦𝑡 do not diverge.
The required setting of 𝑦0 must evidently have the property that
∗
𝑦1,0
𝑄𝑦0 = 𝑦0∗ = [ ].
0
𝑚0
But note that since 𝑦0 = [ ] and 𝑚0 is given to us an initial condition, 𝑝0 has to do all the adjusting to satisfy this
𝑝0
equation.
Sometimes this situation is described by saying that while 𝑚0 is truly a state variable, 𝑝0 is a jump variable that must
adjust at 𝑡 = 0 in order to satisfy the equation.
Thus, in a nutshell the unique value of the vector 𝑦0 for which the paths of 𝑦𝑡 do not diverge must have second component
𝑝0 that verifies equality (63.19) by setting the second component of 𝑦0∗ equal to zero.
𝑚0
The component 𝑝0 of the initial vector 𝑦0 = [ ] must evidently satisfy
𝑝0
𝑄{2} 𝑦0 = 0
where 𝑄{2} denotes the second row of 𝑄−1 , a restriction that is equivalent to
𝑄21 𝑚0 + 𝑄22 𝑝0 = 0 (63.20)
where 𝑄𝑖𝑗 denotes the (𝑖, 𝑗) component of 𝑄−1 .

Solving this equation for 𝑝0 , we find
𝑝0 = −(𝑄22 )−1 𝑄21 𝑚0 . (63.21)
This is the unique stabilizing value of 𝑝0 expressed as a function of 𝑚0 .
63.6.1 Refining the Formula
We can get an even more convenient formula for 𝑝0 that is cast in terms of components of 𝑄 instead of components of
𝑄−1 .
To get this formula, first note that because (𝑄21 𝑄22 ) is the second row of the inverse of 𝑄 and because 𝑄−1 𝑄 = 𝐼, it
follows that
𝑄11
[𝑄21 𝑄22 ] [ ]=0
𝑄21
which implies that
𝑄21 𝑄11 + 𝑄22 𝑄21 = 0.
Therefore,
−(𝑄22 )−1 𝑄21 = 𝑄21 𝑄−1

11 .
So we can write
𝑝0 = 𝑄21 𝑄−1
11 𝑚0 . (63.22)
63.6. Another Perspective 1125

It can be verified that this formula replicates itself over time in the sense that
𝑝𝑡 = 𝑄21 𝑄−1
11 𝑚𝑡 . (63.23)
To implement formula (63.23), we want to compute 𝑄1 the eigenvector of 𝑄 associated with the stable eigenvalue 𝜌 of
𝑄.
By hand it can be verified that the eigenvector associated with the stable eigenvalue 𝜌 is proportional to
1 − 𝜆𝜌
𝑄1 = [ ].
1−𝜆
Notice that if we set 𝐴 = 𝜌 and 𝐺 = 1 in our earlier formula for 𝑝𝑡 we get
𝑝𝑡 = 𝐺(𝐼 − 𝜆𝐴)−1 𝑚𝑡 = (1 − 𝜆)(1 − 𝜆𝜌)−1 𝑚𝑡 ,
a formula that is equivalent with
𝑝𝑡 = 𝑄21 𝑄−1
11 𝑚𝑡 ,
where
𝑄11
𝑄1 = [ ].
𝑄21
63.6.2 Remarks about Feedback
We have expressed (63.16) in what superficially appears to be a form in which 𝑦𝑡+1 feeds back on 𝑦𝑡 , even though what we
actually want to represent is that the component 𝑝𝑡 feeds forward on 𝑝𝑡+1 , and through it, on future 𝑚𝑡+𝑗 , 𝑗 = 0, 1, 2, ….
A tell-tale sign that we should look beyond its superficial “feedback” form is that 𝜆−1 > 1 so that the matrix 𝐻 in (63.16)
is unstable
• it has one eigenvalue 𝜌 that is less than one in modulus that does not imperil stability, but …
• it has a second eigenvalue 𝜆−1 that exceeds one in modulus and that makes 𝐻 an unstable matrix
We’ll keep these observations in mind as we turn now to a case in which the log money supply actually does feed back on
the log of the price level.
63.7 Log money Supply Feeds Back on Log Price Level
An arrangement of eigenvalues that split around unity, with one being below unity and another being greater than unity,
sometimes prevails when there is feedback from the log price level to the log money supply.
Let the feedback rule be
𝑚𝑡+1 = 𝜌𝑚𝑡 + 𝛿𝑝𝑡 (63.24)
where 𝜌 ∈ (0, 1) and where we shall now allow 𝛿 ≠ 0.

Warning: If things are to fit together as we wish to deliver a stable system for some initial value 𝑝0 that we want to
determine uniquely, 𝛿 cannot be too large.
The forward-looking equation (63.8) continues to describe equality between the demand and supply of money.
𝑚𝑡
We assume that equations (63.8) and (63.24) govern 𝑦𝑡 ≡ [ ] for 𝑡 ≥ 0.
𝑝𝑡

The transition matrix 𝐻 in the law of motion
𝑦𝑡+1 = 𝐻𝑦𝑡
now becomes
𝜌 𝛿
𝐻=[ ].
−(1 − 𝜆)/𝜆 𝜆−1
We take 𝑚0 as a given initial condition and as before seek an initial value 𝑝0 that stabilizes the system in the sense that
𝑦𝑡 converges as 𝑡 → +∞.
Our approach is identical with the one followed above and is based on an eigenvalue decomposition in which, cross our
fingers, one eigenvalue exceeds unity and the other is less than unity in absolute value.
When 𝛿 ≠ 0 as we now assume, the eigenvalues of 𝐻 will no longer be 𝜌 ∈ (0, 1) and 𝜆−1 > 1
We’ll just calculate them and apply the same algorithm that we used above.
That algorithm remains valid so long as the eigenvalues split around unity as before.
Again we assume that 𝑚0 is an initial condition, but that 𝑝0 is not given but to be solved for.
Let’s write and execute some Python code that will let us explore how outcomes depend on 𝛿.
def construct_H(ρ, λ, δ):

"contruct matrix H given parameters."
H = np.empty((2, 2))
H[0, :] = ρ,δ
H[1, :] = - (1 - λ) / λ, 1 / λ
return H
def H_eigvals(ρ=.9, λ=.5, δ=0):

"compute the eigenvalues of matrix H given parameters."
# construct H matrix
H = construct_H(ρ, λ, δ)
# compute eigenvalues
eigvals = np.linalg.eigvals(H)
return eigvals
H_eigvals()
array([2. , 0.9])
Notice that a negative 𝛿 will not imperil the stability of the matrix 𝐻, even if it has a big absolute value.
# small negative δ
H_eigvals(δ=-0.05)
array([0.8562829, 2.0437171])
63.7. Log money Supply Feeds Back on Log Price Level 1127
# large negative δ
H_eigvals(δ=-1.5)
array([0.10742784, 2.79257216])
A sufficiently small positive 𝛿 also causes no problem.
# sufficiently small positive δ

H_eigvals(δ=0.05)
array([0.94750622, 1.95249378])
But a large enough positive 𝛿 makes both eigenvalues of 𝐻 strictly greater than unity in modulus.
For example,
H_eigvals(δ=0.2)
array([1.12984379, 1.77015621])
We want to study systems in which one eigenvalue exceeds unity in modulus while the other is less than unity in modulus,
so we avoid values of 𝛿 that are too.
That is, we want to avoid too much positive feedback from 𝑝𝑡 to 𝑚𝑡+1 .
def magic_p0(m0, ρ=.9, λ=.5, δ=0):

"""
Use the magic formula (8) to compute the level of p0
that makes the system stable.
"""
eigvals, Q = np.linalg.eig(H)
# find the index of the smaller eigenvalue

ind = 0 if eigvals[0] < eigvals[1] else 1
# verify that the eigenvalue is less than unity

if eigvals[ind] > 1:
print("both eigenvalues exceed unity in modulus")
return None
p0 = Q[1, ind] / Q[0, ind] * m0
return p0
Let’s plot how the solution 𝑝0 changes as 𝑚0 changes for different settings of 𝛿.
m_range = np.arange(0.1, 2., 0.1)
for δ in [-0.05, 0, 0.05]:

plt.plot(m_range, [magic_p0(m0, δ=δ) for m0 in m_range], label=f"δ={δ}")


plt.legend()
plt.xlabel("$m_0$")
plt.ylabel("$p_0$")
plt.show()
To look at things from a different angle, we can fix the initial value 𝑚0 and see how 𝑝0 changes as 𝛿 changes.
m0 = 1
δ_range = np.linspace(-0.05, 0.05, 100)

plt.plot(δ_range, [magic_p0(m0, δ=δ) for δ in δ_range])
plt.xlabel('$\delta$')
plt.ylabel('$p_0$')
plt.title(f'$m_0$={m0}')
plt.show()
63.7. Log money Supply Feeds Back on Log Price Level 1129
Notice that when 𝛿 is large enough, both eigenvalues exceed unity in modulus, causing a stabilizing value of 𝑝0 not to
exist.
magic_p0(1, δ=0.2)
both eigenvalues exceed unity in modulus
63.8 Big 𝑃 , Little 𝑝 Interpretation
It is helpful to view our solutions of difference equations having feedback from the price level or inflation to money or the
rate of money creation in terms of the Big 𝐾, little 𝑘 idea discussed in Rational Expectations Models.
This will help us sort out what is taken as given by the decision makers who use the difference equation (63.9) to determine
𝑝𝑡 as a function of their forecasts of future values of 𝑚𝑡 .
Let’s write the stabilizing solution that we have computed using the eigenvector decomposition of 𝐻 as 𝑃𝑡 = 𝐹 ∗ 𝑚𝑡 ,
where
𝐹 ∗ = 𝑄21 𝑄−1
11 .
Then from 𝑃𝑡+1 = 𝐹 ∗ 𝑚𝑡+1 and 𝑚𝑡+1 = 𝜌𝑚𝑡 + 𝛿𝑃𝑡 we can deduce the recursion 𝑃𝑡+1 = 𝐹 ∗ 𝜌𝑚𝑡 + 𝐹 ∗ 𝛿𝑃𝑡 and create
the stacked system
𝑚𝑡+1 𝜌 𝛿 𝑚
[ ]=[ ∗ ] [ 𝑡]
𝑃𝑡+1 𝐹 𝜌 𝐹 ∗ 𝛿 𝑃𝑡
or
𝑥𝑡+1 = 𝐴𝑥𝑡
𝑚𝑡
where 𝑥𝑡 = [ ].
𝑃𝑡

Apply formula (63.13) for 𝐹 to deduce that
𝑚𝑡 𝑚
𝑝𝑡 = 𝐹 [ ]=𝐹[ ∗𝑡 ]
𝑃𝑡 𝐹 𝑚𝑡
which implies that
𝑚𝑡
𝑝𝑡 = [𝐹1 𝐹2 ] [ ] = 𝐹1 𝑚𝑡 + 𝐹2 𝐹 ∗ 𝑚𝑡
𝐹 ∗ 𝑚𝑡
so that we can anticipate that
𝐹 ∗ = 𝐹 1 + 𝐹2 𝐹 ∗
We shall verify this equality in the next block of Python code that implements the following computations.
1. For the system with 𝛿 ≠ 0 so that there is feedback, we compute the stabilizing solution for 𝑝𝑡 in the form
𝑝𝑡 = 𝐹 ∗ 𝑚𝑡 where 𝐹 ∗ = 𝑄21 𝑄−1
11 as above.
𝑚𝑡
2. Recalling the system (63.11), (63.12), and (63.13) above, we define 𝑥𝑡 = [ ] and notice that it is Big 𝑃𝑡 and
𝑃𝑡
𝜌 𝛿
not little 𝑝𝑡 here. Then we form 𝐴 and 𝐺 as 𝐴 = [ ] and 𝐺 = [1 0] and we compute [𝐹1 𝐹2 ] ≡ 𝐹
𝐹 ∗𝜌 𝐹 ∗𝛿
from equation (63.13) above.
3. We compute 𝐹1 + 𝐹2 𝐹 ∗ and compare it with 𝐹 ∗ and check for the anticipated equality.
# set parameters
ρ = .9
λ = .5
δ = .05
# solve for F_star

eigvals, Q = np.linalg.eig(H)
ind = 0 if eigvals[0] < eigvals[1] else 1

F_star = Q[1, ind] / Q[0, ind]
F_star
0.950124378879109
# solve for F_check

A = np.empty((2, 2))
A[0, :] = ρ, δ
A[1, :] = F_star * A[0, :]
G = np.array([1, 0])
F_check= (1 - λ) * G @ np.linalg.inv(np.eye(2) - λ * A)
F_check
array([0.92755597, 0.02375311])
Compare 𝐹 ∗ with 𝐹1 + 𝐹2 𝐹 ∗
63.8. Big 𝑃 , Little 𝑝 Interpretation 1131

F_check[0] + F_check[1] * F_star, F_star
(0.95012437887911, 0.950124378879109)
63.9 Fun with SymPy
This section is a gift for readers who have made it this far.
It puts SymPy to work on our model.
Thus, we use Sympy to compute some key objects comprising the eigenvector decomposition of 𝐻.
We start by generating an 𝐻 with nonzero 𝛿.
λ, δ, ρ = symbols('λ, δ, ρ')
H1 = Matrix([[ρ,δ], [- (1 - λ) / λ, λ ** -1]])
H1
𝜌 𝛿
[ 𝜆−1 1]
𝜆 𝜆
H1.eigenvals()
𝜆𝜌 + 1 √4𝛿𝜆2 − 4𝛿𝜆 + 𝜆2 𝜌2 − 2𝜆𝜌 + 1 𝜆𝜌 + 1 √4𝛿𝜆2 − 4𝛿𝜆 + 𝜆2 𝜌2 − 2𝜆𝜌 + 1

{ − ∶ 1, + ∶ 1}
2𝜆 2𝜆 2𝜆 2𝜆
H1.eigenvects()
𝜆𝜌+1 √4𝛿𝜆2 −4𝛿𝜆+𝜆2 𝜌2 −2𝜆𝜌+1
⎡⎛ 𝜆𝜌 + 1 √4𝛿𝜆2 − 4𝛿𝜆 + 𝜆2 𝜌2 − 2𝜆𝜌 + 1 ⎡⎡ 𝜆( 2𝜆 − 2𝜆 )

1 ⎤⎤⎞⎛ 𝜆𝜌 + 1 √4𝛿𝜆2 − 4𝛿𝜆
⎢⎜
⎜
2𝜆
−
2𝜆
, 1, ⎢⎢
𝜆−1 − 𝜆−1 ⎥⎥
⎟
⎜
⎟,
⎜
2𝜆
+
⎣⎝ ⎣⎣ 1 ⎦⎦⎠ ⎝
Now let’s compute 𝐻 when 𝛿 is zero.
H2 = Matrix([[ρ,0], [- (1 - λ) / λ, λ ** -1]])
H2
𝜌 0
[ 𝜆−1 1]
𝜆 𝜆

H2.eigenvals()
1
{ ∶ 1, 𝜌 ∶ 1}
𝜆
H2.eigenvects()
1 𝜆𝜌−1
0
[( , 1, [[ ]]) , (𝜌, 1, [[ 𝜆−1 ]])]
𝜆 1 1
Below we do induce SymPy to do the following fun things for us analytically:

1. We compute the matrix 𝑄 whose first column is the eigenvector associated with 𝜌. and whose second column is
the eigenvector associated with 𝜆−1 .
2. We use SymPy to compute the inverse 𝑄−1 of 𝑄 (both in symbols).
3. We use SymPy to compute 𝑄21 𝑄−1
11 (in symbols).
4. Where 𝑄𝑖𝑗 denotes the (𝑖, 𝑗) component of 𝑄−1 , we use SymPy to compute −(𝑄22 )−1 𝑄21 (again in symbols)
# construct Q
vec = []
for i, (eigval, _, eigvec) in enumerate(H2.eigenvects()):
vec.append(eigvec[0])
if eigval == ρ:
ind = i
Q = vec[ind].col_insert(1, vec[1-ind])
𝜆𝜌−1
0
[ 𝜆−1 ]
1 1
𝑄−1
Q_inv = Q ** (-1)
Q_inv
𝜆−1
0
[ 𝜆𝜌−1
1−𝜆 ]
𝜆𝜌−1 1
𝑄21 𝑄−1
11
63.9. Fun with SymPy 1133

Q[1, 0] / Q[0, 0]
𝜆−1
𝜆𝜌 − 1
−(𝑄22 )−1 𝑄21
- Q_inv[1, 0] / Q_inv[1, 1]
1−𝜆
−
𝜆𝜌 − 1

CHAPTER
SIXTYFOUR
MARKOV PERFECT EQUILIBRIUM
Contents
• Markov Perfect Equilibrium

– Overview
– Background
– Linear Markov Perfect Equilibria
– Application
– Exercises
64.1 Overview
This lecture describes the concept of Markov perfect equilibrium.

Markov perfect equilibrium is a key notion for analyzing economic problems involving dynamic strategic interaction, and
a cornerstone of applied game theory.
In this lecture, we teach Markov perfect equilibrium by example.
We will focus on settings with
• two players
• quadratic payoff functions
• linear transition rules for the state
Other references include chapter 7 of [Ljungqvist and Sargent, 2018].

import numpy as np
1135
64.2 Background
Markov perfect equilibrium is a refinement of the concept of Nash equilibrium.

It is used to study settings where multiple decision-makers interact non-cooperatively over time, each pursuing its own
objective.
The agents in the model face a common state vector, the time path of which is influenced by – and influences – their
decisions.
In particular, the transition law for the state that confronts each agent is affected by decision rules of other agents.
Individual payoff maximization requires that each agent solve a dynamic programming problem that includes this transition
law.
Markov perfect equilibrium prevails when no agent wishes to revise its policy, taking as given the policies of all other
agents.
Well known examples include
• Choice of price, output, location or capacity for firms in an industry (e.g., [Ericson and Pakes, 1995], [Ryan, 2012],
[Doraszelski and Satterthwaite, 2010]).
• Rate of extraction from a shared natural resource, such as a fishery (e.g., [Levhari and Mirman, 1980], [Van Long,
2011]).
Let’s examine a model of the first type.
64.2.1 Example: A Duopoly Model
Two firms are the only producers of a good, the demand for which is governed by a linear inverse demand function
𝑝 = 𝑎0 − 𝑎1 (𝑞1 + 𝑞2 ) (64.1)
Here 𝑝 = 𝑝𝑡 is the price of the good, 𝑞𝑖 = 𝑞𝑖𝑡 is the output of firm 𝑖 = 1, 2 at time 𝑡 and 𝑎0 > 0, 𝑎1 > 0.
In (64.1) and what follows,
• the time subscript is suppressed when possible to simplify notation
• 𝑥̂ denotes a next period value of variable 𝑥
Each firm recognizes that its output affects total output and therefore the market price.
The one-period payoff function of firm 𝑖 is price times quantity minus adjustment costs:
𝜋𝑖 = 𝑝𝑞𝑖 − 𝛾(𝑞𝑖̂ − 𝑞𝑖 )2 , 𝛾 > 0, (64.2)
Substituting the inverse demand curve (64.1) into (64.2) lets us express the one-period payoff as
𝜋𝑖 (𝑞𝑖 , 𝑞−𝑖 , 𝑞𝑖̂ ) = 𝑎0 𝑞𝑖 − 𝑎1 𝑞𝑖2 − 𝑎1 𝑞𝑖 𝑞−𝑖 − 𝛾(𝑞𝑖̂ − 𝑞𝑖 )2 , (64.3)
where 𝑞−𝑖 denotes the output of the firm other than 𝑖.

∞
The objective of the firm is to maximize ∑𝑡=0 𝛽 𝑡 𝜋𝑖𝑡 .
Firm 𝑖 chooses a decision rule that sets next period quantity 𝑞𝑖̂ as a function 𝑓𝑖 of the current state (𝑞𝑖 , 𝑞−𝑖 ).
An essential aspect of a Markov perfect equilibrium is that each firm takes the decision rule of the other firm as known
and given.
1136 Chapter 64. Markov Perfect Equilibrium

Given 𝑓−𝑖 , the Bellman equation of firm 𝑖 is

𝑣𝑖 (𝑞𝑖 , 𝑞−𝑖 ) = max {𝜋𝑖 (𝑞𝑖 , 𝑞−𝑖 , 𝑞𝑖̂ ) + 𝛽𝑣𝑖 (𝑞𝑖̂ , 𝑓−𝑖 (𝑞−𝑖 , 𝑞𝑖 ))} (64.4)
𝑞𝑖̂
Definition A Markov perfect equilibrium of the duopoly model is a pair of value functions (𝑣1 , 𝑣2 ) and a pair of policy
functions (𝑓1 , 𝑓2 ) such that, for each 𝑖 ∈ {1, 2} and each possible state,
• The value function 𝑣𝑖 satisfies Bellman equation (64.4).
• The maximizer on the right side of (64.4) equals 𝑓𝑖 (𝑞𝑖 , 𝑞−𝑖 ).
The adjective “Markov” denotes that the equilibrium decision rules depend only on the current values of the state variables,
not other parts of their histories.
“Perfect” means complete, in the sense that the equilibrium is constructed by backward induction and hence builds in
optimizing behavior for each firm at all possible future states.
• These include many states that will not be reached when we iterate forward on the pair of equilibrium strategies 𝑓𝑖
starting from a given initial state.
64.2.2 Computation
One strategy for computing a Markov perfect equilibrium is iterating to convergence on pairs of Bellman equations and
decision rules.
In particular, let 𝑣𝑖𝑗 , 𝑓𝑖𝑗 be the value function and policy function for firm 𝑖 at the 𝑗-th iteration.
Imagine constructing the iterates
𝑣𝑖𝑗+1 (𝑞𝑖 , 𝑞−𝑖 ) = max {𝜋𝑖 (𝑞𝑖 , 𝑞−𝑖 , 𝑞𝑖̂ ) + 𝛽𝑣𝑖𝑗 (𝑞𝑖̂ , 𝑓−𝑖 (𝑞−𝑖 , 𝑞𝑖 ))} (64.5)
𝑞𝑖̂
These iterations can be challenging to implement computationally.

However, they simplify for the case in which one-period payoff functions are quadratic and transition laws are linear —
which takes us to our next topic.
64.3 Linear Markov Perfect Equilibria
As we saw in the duopoly example, the study of Markov perfect equilibria in games with two players leads us to an
interrelated pair of Bellman equations.
In linear-quadratic dynamic games, these “stacked Bellman equations” become “stacked Riccati equations” with a tractable
mathematical structure.
We’ll lay out that structure in a general setup and then apply it to some simple problems.
64.3.1 Coupled Linear Regulator Problems
We consider a general linear-quadratic regulator game with two players.

For convenience, we’ll start with a finite horizon formulation, where 𝑡0 is the initial date and 𝑡1 is the common terminal
date.
Player 𝑖 takes {𝑢−𝑖𝑡 } as given and minimizes
𝑡1 −1
∑ 𝛽 𝑡−𝑡0 {𝑥′𝑡 𝑅𝑖 𝑥𝑡 + 𝑢′𝑖𝑡 𝑄𝑖 𝑢𝑖𝑡 + 𝑢′−𝑖𝑡 𝑆𝑖 𝑢−𝑖𝑡 + 2𝑥′𝑡 𝑊𝑖 𝑢𝑖𝑡 + 2𝑢′−𝑖𝑡 𝑀𝑖 𝑢𝑖𝑡 } (64.6)
𝑡=𝑡0
64.3. Linear Markov Perfect Equilibria 1137

while the state evolves according to
𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵1 𝑢1𝑡 + 𝐵2 𝑢2𝑡 (64.7)
Here
• 𝑥𝑡 is an 𝑛 × 1 state vector and 𝑢𝑖𝑡 is a 𝑘𝑖 × 1 vector of controls for player 𝑖
• 𝑅𝑖 is 𝑛 × 𝑛
• 𝑆𝑖 is 𝑘−𝑖 × 𝑘−𝑖
• 𝑄𝑖 is 𝑘𝑖 × 𝑘𝑖
• 𝑊𝑖 is 𝑛 × 𝑘𝑖
• 𝑀𝑖 is 𝑘−𝑖 × 𝑘𝑖
• 𝐴 is 𝑛 × 𝑛
• 𝐵𝑖 is 𝑛 × 𝑘𝑖
64.3.2 Computing Equilibrium
We formulate a linear Markov perfect equilibrium as follows.

Player 𝑖 employs linear decision rules 𝑢𝑖𝑡 = −𝐹𝑖𝑡 𝑥𝑡 , where 𝐹𝑖𝑡 is a 𝑘𝑖 × 𝑛 matrix.
A Markov perfect equilibrium is a pair of sequences {𝐹1𝑡 , 𝐹2𝑡 } over 𝑡 = 𝑡0 , … , 𝑡1 − 1 such that
• {𝐹1𝑡 } solves player 1’s problem, taking {𝐹2𝑡 } as given, and
• {𝐹2𝑡 } solves player 2’s problem, taking {𝐹1𝑡 } as given
If we take 𝑢2𝑡 = −𝐹2𝑡 𝑥𝑡 and substitute it into (64.6) and (64.7), then player 1’s problem becomes minimization of
𝑡1 −1
∑ 𝛽 𝑡−𝑡0 {𝑥′𝑡 Π1𝑡 𝑥𝑡 + 𝑢′1𝑡 𝑄1 𝑢1𝑡 + 2𝑢′1𝑡 Γ1𝑡 𝑥𝑡 } (64.8)
𝑡=𝑡0
subject to
𝑥𝑡+1 = Λ1𝑡 𝑥𝑡 + 𝐵1 𝑢1𝑡 , (64.9)
where
• Λ𝑖𝑡 ∶= 𝐴 − 𝐵−𝑖 𝐹−𝑖𝑡
′
• Π𝑖𝑡 ∶= 𝑅𝑖 + 𝐹−𝑖𝑡 𝑆𝑖 𝐹−𝑖𝑡
• Γ𝑖𝑡 ∶= 𝑊𝑖′ − 𝑀𝑖′ 𝐹−𝑖𝑡
This is an LQ dynamic programming problem that can be solved by working backwards.
Decision rules that solve this problem are
𝐹1𝑡 = (𝑄1 + 𝛽𝐵1′ 𝑃1𝑡+1 𝐵1 )−1 (𝛽𝐵1′ 𝑃1𝑡+1 Λ1𝑡 + Γ1𝑡 ) (64.10)
where 𝑃1𝑡 solves the matrix Riccati difference equation
𝑃1𝑡 = Π1𝑡 − (𝛽𝐵1′ 𝑃1𝑡+1 Λ1𝑡 + Γ1𝑡 )′ (𝑄1 + 𝛽𝐵1′ 𝑃1𝑡+1 𝐵1 )−1 (𝛽𝐵1′ 𝑃1𝑡+1 Λ1𝑡 + Γ1𝑡 ) + 𝛽Λ′1𝑡 𝑃1𝑡+1 Λ1𝑡 (64.11)
Similarly, decision rules that solve player 2’s problem are
𝐹2𝑡 = (𝑄2 + 𝛽𝐵2′ 𝑃2𝑡+1 𝐵2 )−1 (𝛽𝐵2′ 𝑃2𝑡+1 Λ2𝑡 + Γ2𝑡 ) (64.12)

where 𝑃2𝑡 solves
𝑃2𝑡 = Π2𝑡 − (𝛽𝐵2′ 𝑃2𝑡+1 Λ2𝑡 + Γ2𝑡 )′ (𝑄2 + 𝛽𝐵2′ 𝑃2𝑡+1 𝐵2 )−1 (𝛽𝐵2′ 𝑃2𝑡+1 Λ2𝑡 + Γ2𝑡 ) + 𝛽Λ′2𝑡 𝑃2𝑡+1 Λ2𝑡 (64.13)
Here, in all cases 𝑡 = 𝑡0 , … , 𝑡1 − 1 and the terminal conditions are 𝑃𝑖𝑡1 = 0.

The solution procedure is to use equations (64.10), (64.11), (64.12), and (64.13), and “work backwards” from time 𝑡1 −1.
Since we’re working backward, 𝑃1𝑡+1 and 𝑃2𝑡+1 are taken as given at each stage.
Moreover, since
• some terms on the right-hand side of (64.10) contain 𝐹2𝑡
• some terms on the right-hand side of (64.12) contain 𝐹1𝑡
we need to solve these 𝑘1 + 𝑘2 equations simultaneously.
Key Insight
A key insight is that equations (64.10) and (64.12) are linear in 𝐹1𝑡 and 𝐹2𝑡 .
After these equations have been solved, we can take 𝐹𝑖𝑡 and solve for 𝑃𝑖𝑡 in (64.11) and (64.13).
Infinite Horizon
We often want to compute the solutions of such games for infinite horizons, in the hope that the decision rules 𝐹𝑖𝑡 settle
down to be time-invariant as 𝑡1 → +∞.
In practice, we usually fix 𝑡1 and compute the equilibrium of an infinite horizon game by driving 𝑡0 → −∞.
This is the approach we adopt in the next section.
We use the function nnash from QuantEcon.py that computes a Markov perfect equilibrium of the infinite horizon linear-
quadratic dynamic game in the manner described above.
64.4 Application
Let’s use these procedures to treat some applications, starting with the duopoly model.
64.4.1 A Duopoly Model
To map the duopoly model into coupled linear-quadratic dynamic programming problems, define the state and controls
as
1
𝑥𝑡 ∶= ⎡𝑞 ⎤
⎢ 1𝑡 ⎥ and 𝑢𝑖𝑡 ∶= 𝑞𝑖,𝑡+1 − 𝑞𝑖𝑡 , 𝑖 = 1, 2
⎣𝑞2𝑡 ⎦
If we write
𝑥′𝑡 𝑅𝑖 𝑥𝑡 + 𝑢′𝑖𝑡 𝑄𝑖 𝑢𝑖𝑡

where 𝑄1 = 𝑄2 = 𝛾,
0 − 𝑎20 0 0 0 − 𝑎20
𝑅1 ∶= ⎡ −
⎢ 2
𝑎0
𝑎1 𝑎1 ⎤
2 ⎥ and 𝑅2 ∶= ⎡
⎢ 0𝑎 0 𝑎1
2
⎤
⎥
𝑎1 𝑎1
⎣ 0 2 0⎦ ⎣− 20 2 𝑎1 ⎦
then we recover the one-period payoffs in expression (64.3).

The law of motion for the state 𝑥𝑡 is 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵1 𝑢1𝑡 + 𝐵2 𝑢2𝑡 where
1 0 0 0 0
𝐴 ∶= ⎡
⎢0 1 0⎤⎥, 𝐵1 ∶= ⎡ ⎤
⎢1⎥ , 𝐵2 ∶= ⎡ ⎤
⎢0⎥
⎣0 0 1⎦ ⎣0⎦ ⎣1⎦
The optimal decision rule of firm 𝑖 will take the form 𝑢𝑖𝑡 = −𝐹𝑖 𝑥𝑡 , inducing the following closed-loop system for the
evolution of 𝑥 in the Markov perfect equilibrium:
𝑥𝑡+1 = (𝐴 − 𝐵1 𝐹1 − 𝐵1 𝐹2 )𝑥𝑡 (64.14)
64.4.2 Parameters and Solution
Consider the previously presented duopoly model with parameter values of:
• 𝑎0 = 10
• 𝑎1 = 2
• 𝛽 = 0.96
• 𝛾 = 12
From these, we compute the infinite horizon MPE using the preceding code
import numpy as np
# Parameters
a0 = 10.0
a1 = 2.0
β = 0.96
γ = 12.0
# In LQ form
A = np.eye(3)
B1 = np.array([[0.], [1.], [0.]])
B2 = np.array([[0.], [0.], [1.]])
R1 = [[ 0., -a0 / 2, 0.],

[-a0 / 2., a1, a1 / 2.],
[ 0, a1 / 2., 0.]]
R2 = [[ 0., 0., -a0 / 2],

[ 0., 0., a1 / 2.],
[-a0 / 2, a1 / 2., a1]]
Q1 = Q2 = γ
S1 = S2 = W1 = W2 = M1 = M2 = 0.0

# Solve using QE's nnash function

F1, F2, P1, P2 = qe.nnash(A, B1, B2, R1, R2, Q1,
Q2, S1, S2, W1, W2, M1,
M2, beta=β)
# Display policies
print("Computed policies for firm 1 and firm 2:\n")
print(f"F1 = {F1}")
print(f"F2 = {F2}")
print("\n")
Computed policies for firm 1 and firm 2:
F1 = [[-0.66846615 0.29512482 0.07584666]]

F2 = [[-0.66846615 0.07584666 0.29512482]]
Running the code produces the following output.

One way to see that 𝐹𝑖 is indeed optimal for firm 𝑖 taking 𝐹2 as given is to use QuantEcon.py’s LQ class.
In particular, let’s take F2 as computed above, plug it into (64.8) and (64.9) to get firm 1’s problem and solve it using LQ.
We hope that the resulting policy will agree with F1 as computed above
Λ1 = A - B2 @ F2
lq1 = qe.LQ(Q1, R1, Λ1, B1, beta=β)
P1_ih, F1_ih, d = lq1.stationary_values()
F1_ih
array([[-0.66846613, 0.29512482, 0.07584666]])
This is close enough for rock and roll, as they say in the trade.
Indeed, np.allclose agrees with our assessment
np.allclose(F1, F1_ih)
True
64.4.3 Dynamics
Let’s now investigate the dynamics of price and output in this simple duopoly model under the MPE policies.
Given our optimal policies 𝐹 1 and 𝐹 2, the state evolves according to (64.14).
The following program
• imports 𝐹 1 and 𝐹 2 from the previous program along with all parameters.
• computes the evolution of 𝑥𝑡 using (64.14).
• extracts and plots industry output 𝑞𝑡 = 𝑞1𝑡 + 𝑞2𝑡 and price 𝑝𝑡 = 𝑎0 − 𝑎1 𝑞𝑡 .

AF = A - B1 @ F1 - B2 @ F2
n = 20
x = np.empty((3, n))
x[:, 0] = 1, 1, 1
x[:, t+1] = AF @ x[:, t]
q1 = x[1, :]
q2 = x[2, :]
q = q1 + q2 # Total output, MPE
p = a0 - a1 * q # Price, MPE

ax.plot(q, 'b-', lw=2, alpha=0.75, label='total output')
ax.plot(p, 'g-', lw=2, alpha=0.75, label='price')
ax.set_title('Output and prices, duopoly MPE')
ax.legend(frameon=False)
plt.show()
Note that the initial condition has been set to 𝑞10 = 𝑞20 = 1.0.
To gain some perspective we can compare this to what happens in the monopoly case.
The first panel in the next figure compares output of the monopolist and industry output under the MPE, as a function of
time.
The second panel shows analogous curves for price.
Here parameters are the same as above for both the MPE and monopoly solutions.
The monopolist initial condition is 𝑞0 = 2.0 to mimic the industry initial condition 𝑞10 = 𝑞20 = 1.0 in the MPE case.


As expected, output is higher and prices are lower under duopoly than monopoly.
64.5 Exercises
Exercise 64.5.1
Replicate the pair of figures showing the comparison of output and prices for the monopolist and duopoly under MPE.
Parameters are as in duopoly_mpe.py and you can use that code to compute MPE policies under duopoly.
The optimal policy in the monopolist case can be computed using QuantEcon.py’s LQ class.

First, let’s compute the duopoly MPE under the stated parameters
# == Parameters == #
a0 = 10.0
a1 = 2.0
β = 0.96
γ = 12.0
# == In LQ form == #
A = np.eye(3)
B1 = np.array([[0.], [1.], [0.]])
B2 = np.array([[0.], [0.], [1.]])
R1 = [[ 0., -a0/2, 0.],
[-a0 / 2., a1, a1 / 2.],
[ 0, a1 / 2., 0.]]
R2 = [[ 0., 0., -a0 / 2],

[ 0., 0., a1 / 2.],
[-a0 / 2, a1 / 2., a1]]
Q1 = Q2 = γ
S1 = S2 = W1 = W2 = M1 = M2 = 0.0
# == Solve using QE's nnash function == #

F1, F2, P1, P2 = qe.nnash(A, B1, B2, R1, R2, Q1,
Q2, S1, S2, W1, W2, M1,
M2, beta=β)
Now we evaluate the time path of industry output and prices given initial condition 𝑞10 = 𝑞20 = 1.
AF = A - B1 @ F1 - B2 @ F2
n = 20
x[:, 0] = 1, 1, 1
x[:, t+1] = AF @ x[:, t]
q1 = x[1, :]
q2 = x[2, :]
q = q1 + q2 # Total output, MPE
p = a0 - a1 * q # Price, MPE

Next, let’s have a look at the monopoly solution.

For the state and control, we take
𝑥𝑡 = 𝑞𝑡 − 𝑞 ̄ and 𝑢𝑡 = 𝑞𝑡+1 − 𝑞𝑡
To convert to an LQ problem we set
𝑅 = 𝑎1 and 𝑄 = 𝛾
in the payoff function 𝑥′𝑡 𝑅𝑥𝑡 + 𝑢′𝑡 𝑄𝑢𝑡 and
𝐴=𝐵=1
in the law of motion 𝑥𝑡+1 = 𝐴𝑥𝑡 + 𝐵𝑢𝑡 .

We solve for the optimal policy 𝑢𝑡 = −𝐹 𝑥𝑡 and track the resulting dynamics of {𝑞𝑡 }, starting at 𝑞0 = 2.0.
R = a1
Q = γ
A = B = 1
lq_alt = qe.LQ(Q, R, A, B, beta=β)
P, F, d = lq_alt.stationary_values()
q_bar = a0 / (2.0 * a1)
qm = np.empty(n)
qm[0] = 2
x0 = qm[0] - q_bar
x = x0
for i in range(1, n):
x = A * x - B * F * x
qm[i] = float(x) + q_bar
pm = a0 - a1 * qm

qm[i] = float(x) + q_bar
Let’s have a look at the different time paths
ax = axes[0]
ax.plot(qm, 'b-', lw=2, alpha=0.75, label='monopolist output')
ax.plot(q, 'g-', lw=2, alpha=0.75, label='MPE total output')
ax.set(ylabel="output", xlabel="time", ylim=(2, 4))
ax.legend(loc='upper left', frameon=0)
ax = axes[1]
ax.plot(pm, 'b-', lw=2, alpha=0.75, label='monopolist price')
ax.plot(p, 'g-', lw=2, alpha=0.75, label='MPE price')
ax.set(ylabel="price", xlabel="time")
ax.legend(loc='upper right', frameon=0)
plt.show()

Exercise 64.5.2
In this exercise, we consider a slightly more sophisticated duopoly problem.
It takes the form of infinite horizon linear-quadratic game proposed by Judd [Judd, 1990].
Two firms set prices and quantities of two goods interrelated through their demand curves.
Relevant variables are defined as follows:
• 𝐼𝑖𝑡 = inventories of firm 𝑖 at beginning of 𝑡
• 𝑞𝑖𝑡 = production of firm 𝑖 during period 𝑡
• 𝑝𝑖𝑡 = price charged by firm 𝑖 during period 𝑡
• 𝑆𝑖𝑡 = sales made by firm 𝑖 during period 𝑡
• 𝐸𝑖𝑡 = costs of production of firm 𝑖 during period 𝑡

• 𝐶𝑖𝑡 = costs of carrying inventories for firm 𝑖 during 𝑡

The firms’ cost functions are
2
• 𝐶𝑖𝑡 = 𝑐𝑖1 + 𝑐𝑖2 𝐼𝑖𝑡 + 0.5𝑐𝑖3 𝐼𝑖𝑡
2
• 𝐸𝑖𝑡 = 𝑒𝑖1 + 𝑒𝑖2 𝑞𝑖𝑡 + 0.5𝑒𝑖3 𝑞𝑖𝑡 where 𝑒𝑖𝑗 , 𝑐𝑖𝑗 are positive scalars
Inventories obey the laws of motion
𝐼𝑖,𝑡+1 = (1 − 𝛿)𝐼𝑖𝑡 + 𝑞𝑖𝑡 − 𝑆𝑖𝑡
Demand is governed by the linear schedule
𝑆𝑡 = 𝐷𝑝𝑖𝑡 + 𝑏
where
′
• 𝑆𝑡 = [𝑆1𝑡 𝑆2𝑡 ]
• 𝐷 is a 2 × 2 negative definite matrix and
• 𝑏 is a vector of constants
Firm 𝑖 maximizes the undiscounted sum
1 𝑇
lim ∑ (𝑝 𝑆 − 𝐸𝑖𝑡 − 𝐶𝑖𝑡 )
𝑇 →∞ 𝑇 𝑡=0 𝑖𝑡 𝑖𝑡
We can convert this to a linear-quadratic problem by taking
𝐼1𝑡
𝑝𝑖𝑡
𝑢𝑖𝑡 = [ ] and 𝑥𝑡 = ⎡ ⎤
⎢𝐼2𝑡 ⎥
𝑞𝑖𝑡
⎣1⎦
Decision rules for price and quantity take the form 𝑢𝑖𝑡 = −𝐹𝑖 𝑥𝑡 .
The Markov perfect equilibrium of Judd’s model can be computed by filling in the matrices appropriately.
The exercise is to calculate these matrices and compute the following figures.
The first figure shows the dynamics of inventories for each firm when the parameters are
δ = 0.02
D = np.array([[-1, 0.5], [0.5, -1]])
b = np.array([25, 25])
c1 = c2 = np.array([1, -2, 1])
e1 = e2 = np.array([10, 10, 3])
Inventories trend to a common steady state.

If we increase the depreciation rate to 𝛿 = 0.05, then we expect steady state inventories to fall.
This is indeed the case, as the next figure shows
In this exercise, reproduce the figure when 𝛿 = 0.02.

We treat the case 𝛿 = 0.02


δ = 0.02
D = np.array([[-1, 0.5], [0.5, -1]])
b = np.array([25, 25])
c1 = c2 = np.array([1, -2, 1])
e1 = e2 = np.array([10, 10, 3])
δ_1 = 1 - δ
Recalling that the control and state are
𝐼1𝑡
𝑝𝑖𝑡
𝑢𝑖𝑡 = [ ] and 𝑥𝑡 = ⎡ ⎤
⎢𝐼2𝑡 ⎥
𝑞𝑖𝑡
⎣1⎦
we set up the matrices as follows:
# == Create matrices needed to compute the Nash feedback equilibrium == #
A = np.array([[δ_1, 0, -δ_1 * b[0]],

[ 0, δ_1, -δ_1 * b[1]],
[ 0, 0, 1]])
B1 = δ_1 * np.array([[1, -D[0, 0]],

[0, -D[1, 0]],
[0, 0]])
B2 = δ_1 * np.array([[0, -D[0, 1]],
[1, -D[1, 1]],
[0, 0]])
R1 = -np.array([[0.5 * c1[2], 0, 0.5 * c1[1]],

[ 0, 0, 0],
[0.5 * c1[1], 0, c1[0]]])
R2 = -np.array([[0, 0, 0],
[0, 0.5 * c2[2], 0.5 * c2[1]],
[0, 0.5 * c2[1], c2[0]]])
Q1 = np.array([[-0.5 * e1[2], 0], [0, D[0, 0]]])

Q2 = np.array([[-0.5 * e2[2], 0], [0, D[1, 1]]])
S1 = np.zeros((2, 2))
S2 = np.copy(S1)
W1 = np.array([[ 0, 0],
[ 0, 0],
[-0.5 * e1[1], b[0] / 2.]])
W2 = np.array([[ 0, 0],
[ 0, 0],
[-0.5 * e2[1], b[1] / 2.]])
M1 = np.array([[0, 0], [0, D[0, 1] / 2.]])

M2 = np.copy(M1)
We can now compute the equilibrium using qe.nnash
F1, F2, P1, P2 = qe.nnash(A, B1, B2, R1,

R2, Q1, Q2, S1,


S2, W1, W2, M1, M2)
print("\nFirm 1's feedback rule:\n")

print(F1)
print("\nFirm 2's feedback rule:\n")

print(F2)
Firm 1's feedback rule:
[[ 2.43666582e-01 2.72360627e-02 -6.82788293e+00]

[ 3.92370734e-01 1.39696451e-01 -3.77341073e+01]]
Firm 2's feedback rule:
[[ 2.72360627e-02 2.43666582e-01 -6.82788293e+00]

[ 1.39696451e-01 3.92370734e-01 -3.77341073e+01]]
Now let’s look at the dynamics of inventories, and reproduce the graph corresponding to 𝛿 = 0.02
AF = A - B1 @ F1 - B2 @ F2
n = 25
x[:, 0] = 2, 0, 1
x[:, t+1] = AF @ x[:, t]
I1 = x[0, :]
I2 = x[1, :]
ax.plot(I1, 'b-', lw=2, alpha=0.75, label='inventories, firm 1')
ax.plot(I2, 'g-', lw=2, alpha=0.75, label='inventories, firm 2')
ax.set_title(rf'$\delta = {δ}$')
ax.legend()
plt.show()



CHAPTER
SIXTYFIVE
UNCERTAINTY TRAPS
Contents
• Uncertainty Traps
– Overview
– The Model
– Implementation
– Results
– Exercises
65.1 Overview
In this lecture, we study a simplified version of an uncertainty traps model of Fajgelbaum, Schaal and Taschereau-
Dumouchel [Fajgelbaum et al., 2015].
The model features self-reinforcing uncertainty that has big impacts on economic activity.
In the model,
• Fundamentals vary stochastically and are not fully observable.
• At any moment there are both active and inactive entrepreneurs; only active entrepreneurs produce.
• Agents – active and inactive entrepreneurs – have beliefs about the fundamentals expressed as probability distribu-
tions.
• Greater uncertainty means greater dispersions of these distributions.
• Entrepreneurs are risk-averse and hence less inclined to be active when uncertainty is high.
• The output of active entrepreneurs is observable, supplying a noisy signal that helps everyone inside the model infer
fundamentals.
• Entrepreneurs update their beliefs about fundamentals using Bayes’ Law, implemented via Kalman filtering.
Uncertainty traps emerge because:
• High uncertainty discourages entrepreneurs from becoming active.
• A low level of participation – i.e., a smaller number of active entrepreneurs – diminishes the flow of information
about fundamentals.
1153
• Less information translates to higher uncertainty, further discouraging entrepreneurs from choosing to be active,
and so on.
Uncertainty traps stem from a positive externality: high aggregate economic activity levels generates valuable information.

import numpy as np
65.2 The Model
The original model described in [Fajgelbaum et al., 2015] has many interesting moving parts.
Here we examine a simplified version that nonetheless captures many of the key ideas.
65.2.1 Fundamentals
The evolution of the fundamental process {𝜃𝑡 } is given by
𝜃𝑡+1 = 𝜌𝜃𝑡 + 𝜎𝜃 𝑤𝑡+1
where
• 𝜎𝜃 > 0 and 0 < 𝜌 < 1
• {𝑤𝑡 } is IID and standard normal
The random variable 𝜃𝑡 is not observable at any time.
65.2.2 Output
There is a total 𝑀̄ of risk-averse entrepreneurs.

Output of the 𝑚-th entrepreneur, conditional on being active in the market at time 𝑡, is equal to
𝑥𝑚 = 𝜃 + 𝜖𝑚 where 𝜖𝑚 ∼ 𝑁 (0, 𝛾𝑥−1 ) (65.1)
Here the time subscript has been dropped to simplify notation.

The inverse of the shock variance, 𝛾𝑥 , is called the shock’s precision.
The higher is the precision, the more informative 𝑥𝑚 is about the fundamental.
Output shocks are independent across time and firms.
1154 Chapter 65. Uncertainty Traps

65.2.3 Information and Beliefs
All entrepreneurs start with identical beliefs about 𝜃0 .

Signals are publicly observable and hence all agents have identical beliefs always.
Dropping time subscripts, beliefs for current 𝜃 are represented by the normal distribution 𝑁 (𝜇, 𝛾 −1 ).
Here 𝛾 is the precision of beliefs; its inverse is the degree of uncertainty.
These parameters are updated by Kalman filtering.
Let
• 𝕄 ⊂ {1, … , 𝑀̄ } denote the set of currently active firms.
• 𝑀 ∶= |𝕄| denote the number of currently active firms.
1
• 𝑋 be the average output 𝑀 ∑𝑚∈𝕄 𝑥𝑚 of the active firms.
With this notation and primes for next period values, we can write the updating of the mean and precision via
𝛾𝜇 + 𝑀 𝛾𝑥 𝑋
𝜇′ = 𝜌 (65.2)
𝛾 + 𝑀 𝛾𝑥
−1
𝜌2
′
𝛾 =( + 𝜎𝜃2 ) (65.3)
These are standard Kalman filtering results applied to the current setting.
Exercise 1 provides more details on how (65.2) and (65.3) are derived and then asks you to fill in remaining steps.
The next figure plots the law of motion for the precision in (65.3) as a 45 degree diagram, with one curve for each
𝑀 ∈ {0, … , 6}.
The other parameter values are 𝜌 = 0.99, 𝛾𝑥 = 0.5, 𝜎𝜃 = 0.5
Points where the curves hit the 45 degree lines are long-run steady states for precision for different values of 𝑀 .
Thus, if one of these values for 𝑀 remains fixed, a corresponding steady state is the equilibrium level of precision
• high values of 𝑀 correspond to greater information about the fundamental, and hence more precision in steady
state
• low values of 𝑀 correspond to less information and more uncertainty in steady state
In practice, as we’ll see, the number of active firms fluctuates stochastically.
65.2.4 Participation
Omitting time subscripts once more, entrepreneurs enter the market in the current period if
𝔼[𝑢(𝑥𝑚 − 𝐹𝑚 )] > 𝑐 (65.4)
Here
• the mathematical expectation of 𝑥𝑚 is based on (65.1) and beliefs 𝑁 (𝜇, 𝛾 −1 ) for 𝜃
• 𝐹𝑚 is a stochastic but pre-visible fixed cost, independent across time and firms
• 𝑐 is a constant reflecting opportunity costs
65.2. The Model 1155


The statement that 𝐹𝑚 is pre-visible means that it is realized at the start of the period and treated as a constant in (65.4).
The utility function has the constant absolute risk aversion form
1
𝑢(𝑥) = (1 − exp(−𝑎𝑥)) (65.5)
𝑎
where 𝑎 is a positive parameter.
Combining (65.4) and (65.5), entrepreneur 𝑚 participates in the market (or is said to be active) when
1
{1 − 𝔼[exp (−𝑎(𝜃 + 𝜖𝑚 − 𝐹𝑚 ))]} > 𝑐
𝑎
Using standard formulas for expectations of lognormal random variables, this is equivalent to the condition
1 𝑎2 ( 𝛾1 + 1
𝛾𝑥 )
𝜓(𝜇, 𝛾, 𝐹𝑚 ) ∶= (1 − exp (−𝑎𝜇 + 𝑎𝐹𝑚 + )) − 𝑐 > 0 (65.6)
𝑎 2
65.3 Implementation
We want to simulate this economy.

As a first step, let’s put together a class that bundles
• the parameters, the current value of 𝜃 and the current values of the two belief parameters 𝜇 and 𝛾
• methods to update 𝜃, 𝜇 and 𝛾, as well as to determine the number of active firms and their outputs
The updating methods follow the laws of motion for 𝜃, 𝜇 and 𝛾 given above.
The method to evaluate the number of active firms generates 𝐹1 , … , 𝐹𝑀̄ and tests condition (65.6) for each firm.
The init method encodes as default values the parameters we’ll use in the simulations below
class UncertaintyTrapEcon:
def __init__(self,
a=1.5, # Risk aversion
γ_x=0.5, # Production shock precision
ρ=0.99, # Correlation coefficient for θ
σ_θ=0.5, # Standard dev of θ shock
num_firms=100, # Number of firms
σ_F=1.5, # Standard dev of fixed costs
c=-420, # External opportunity cost
μ_init=0, # Initial value for μ
γ_init=4, # Initial value for γ
θ_init=0): # Initial value for θ
# == Record values == #
self.a, self.γ_x, self.ρ, self.σ_θ = a, γ_x, ρ, σ_θ
self.num_firms, self.σ_F, self.c, = num_firms, σ_F, c
self.σ_x = np.sqrt(1/γ_x)
# == Initialize states == #
self.γ, self.μ, self.θ = γ_init, μ_init, θ_init
def ψ(self, F):

temp1 = -self.a * (self.μ - F)


temp2 = self.a**2 * (1/self.γ + 1/self.γ_x) / 2
return (1 / self.a) * (1 - np.exp(temp1 + temp2)) - self.c
def update_beliefs(self, X, M):

"""
Update beliefs (μ, γ) based on aggregates X and M.
"""
# Simplify names
γ_x, ρ, σ_θ = self.γ_x, self.ρ, self.σ_θ
# Update μ
temp1 = ρ * (self.γ * self.μ + M * γ_x * X)
temp2 = self.γ + M * γ_x
self.μ = temp1 / temp2
# Update γ
self.γ = 1 / (ρ**2 / (self.γ + M * γ_x) + σ_θ**2)
def update_θ(self, w):

"""
Update the fundamental state θ given shock w.
"""
self.θ = self.ρ * self.θ + self.σ_θ * w
def gen_aggregates(self):
"""
Generate aggregates based on current beliefs (μ, γ). This
is a simulation step that depends on the draws for F.
"""
F_vals = self.σ_F * np.random.randn(self.num_firms)
M = np.sum(self.ψ(F_vals) > 0) # Counts number of active firms
if M > 0:
x_vals = self.θ + self.σ_x * np.random.randn(M)
X = x_vals.mean()
else:
X = 0
return X, M
In the results below we use this code to simulate time series for the major variables.
65.4 Results
Let’s look first at the dynamics of 𝜇, which the agents use to track 𝜃
We see that 𝜇 tracks 𝜃 well when there are sufficient firms in the market.
However, there are times when 𝜇 tracks 𝜃 poorly due to insufficient information.
These are episodes where the uncertainty traps take hold.
During these episodes
• precision is low and uncertainty is high
• few firms are in the market
To get a clearer idea of the dynamics, let’s look at all the main time series at once, for a given set of shocks
Notice how the traps only take hold after a sequence of bad draws for the fundamental.

Thus, the model gives us a propagation mechanism that maps bad random draws into long downturns in economic activity.
65.5 Exercises
Exercise 65.5.1
Fill in the details behind (65.2) and (65.3) based on the following standard result (see, e.g., p. 24 of [Young and Smith,
2005]).
Fact Let x = (𝑥1 , … , 𝑥𝑀 ) be a vector of IID draws from common distribution 𝑁 (𝜃, 1/𝛾𝑥 ) and let 𝑥̄ be the sample
mean. If 𝛾𝑥 is known and the prior for 𝜃 is 𝑁 (𝜇, 1/𝛾), then the posterior distribution of 𝜃 given x is
𝜋(𝜃 | x) = 𝑁 (𝜇0 , 1/𝛾0 )
where
𝜇𝛾 + 𝑀 𝑥𝛾̄ 𝑥
𝜇0 = and 𝛾0 = 𝛾 + 𝑀 𝛾𝑥

This exercise asked you to validate the laws of motion for 𝛾 and 𝜇 given in the lecture, based on the stated result about
Bayesian updating in a scalar Gaussian setting. The stated result tells us that after observing average output 𝑋 of the 𝑀
firms, our posterior beliefs will be
𝑁 (𝜇0 , 1/𝛾0 )


where
𝜇𝛾 + 𝑀 𝑋𝛾𝑥
𝜇0 = and 𝛾0 = 𝛾 + 𝑀 𝛾𝑥
If we take a random variable 𝜃 with this distribution and then evaluate the distribution of 𝜌𝜃+𝜎𝜃 𝑤 where 𝑤 is independent
and standard normal, we get the expressions for 𝜇′ and 𝛾 ′ given in the lecture.
Exercise 65.5.2
Modulo randomness, replicate the simulation figures shown above.
• Use the parameter values listed as defaults in the init method of the UncertaintyTrapEcon class.

First, let’s replicate the plot that illustrates the law of motion for precision, which is
−1
𝜌2
𝛾𝑡+1 = ( + 𝜎𝜃2 )
𝛾𝑡 + 𝑀 𝛾 𝑥
Here 𝑀 is the number of active firms. The next figure plots 𝛾𝑡+1 against 𝛾𝑡 on a 45 degree diagram for different values
of 𝑀
econ = UncertaintyTrapEcon()
ρ, σ_θ, γ_x = econ.ρ, econ.σ_θ, econ.γ_x # Simplify names
γ = np.linspace(1e-10, 3, 200) # γ grid
ax.plot(γ, γ, 'k-') # 45 degree line
for M in range(7):
γ_next = 1 / (ρ**2 / (γ + M * γ_x) + σ_θ**2)
label_string = f"$M = {M}$"
ax.plot(γ, γ_next, lw=2, label=label_string)
ax.legend(loc='lower right', fontsize=14)
ax.set_xlabel(r'$\gamma$', fontsize=16)
ax.set_ylabel(r"$\gamma'$", fontsize=16)
ax.grid()
plt.show()

The points where the curves hit the 45 degree lines are the long-run steady states corresponding to each 𝑀 , if that value
of 𝑀 was to remain fixed. As the number of firms falls, so does the long-run steady state of precision.
Next let’s generate time series for beliefs and the aggregates – that is, the number of active firms and average output
sim_length=2000
μ_vec = np.empty(sim_length)
θ_vec = np.empty(sim_length)
γ_vec = np.empty(sim_length)
X_vec = np.empty(sim_length)
M_vec = np.empty(sim_length)
μ_vec[0] = econ.μ
γ_vec[0] = econ.γ
θ_vec[0] = 0


w_shocks = np.random.randn(sim_length)
for t in range(sim_length-1):
X, M = econ.gen_aggregates()
X_vec[t] = X
M_vec[t] = M
econ.update_beliefs(X, M)
econ.update_θ(w_shocks[t])
μ_vec[t+1] = econ.μ
γ_vec[t+1] = econ.γ
θ_vec[t+1] = econ.θ
# Record final values of aggregates

X, M = econ.gen_aggregates()
X_vec[-1] = X
M_vec[-1] = M
First, let’s see how well 𝜇 tracks 𝜃 in these simulations

ax.plot(range(sim_length), θ_vec, alpha=0.6, lw=2, label=r"$\theta$")
ax.plot(range(sim_length), μ_vec, alpha=0.6, lw=2, label=r"$\mu$")
ax.grid()
plt.show()

Now let’s plot the whole thing together

# Add some spacing
fig.subplots_adjust(hspace=0.3)
series = (θ_vec, μ_vec, γ_vec, M_vec)

names = r'$\theta$', r'$\mu$', r'$\gamma$', r'$M$'
for ax, vals, name in zip(axes, series, names):

# Determine suitable y limits
s_max, s_min = max(vals), min(vals)
s_range = s_max - s_min
y_max = s_max + s_range * 0.1
y_min = s_min - s_range * 0.1
ax.set_ylim(y_min, y_max)
# Plot series
ax.plot(range(sim_length), vals, alpha=0.6, lw=2)
ax.set_title(f"time series for {name}", fontsize=16)
ax.grid()
plt.show()


If you run the code above you’ll get different plots, of course.
Try experimenting with different parameters to see the effects on the time series.
(It would also be interesting to experiment with non-Gaussian distributions for the shocks, but this is a big exercise since
it takes us outside the world of the standard Kalman filter)

CHAPTER
SIXTYSIX
THE AIYAGARI MODEL
Contents
• The Aiyagari Model

– Overview
– The Economy
– Firms
– Code
66.1 Overview
In this lecture, we describe the structure of a class of models that build on work by Truman Bewley [Bewley, 1977].
We begin by discussing an example of a Bewley model due to Rao Aiyagari [Aiyagari, 1994].
The model features
• Heterogeneous agents
• A single exogenous vehicle for borrowing and lending
• Limits on amounts individual agents may borrow
The Aiyagari model has been used to investigate many topics, including
• precautionary savings and the effect of liquidity constraints [Aiyagari, 1994]
• risk sharing and asset pricing [Heaton and Lucas, 1996]
• the shape of the wealth distribution [Benhabib et al., 2015]
• etc., etc., etc.
1167

import numpy as np
from quantecon.markov import DiscreteDP
from numba import jit
66.1.1 References
The primary reference for this lecture is [Aiyagari, 1994].

A textbook treatment is available in chapter 18 of [Ljungqvist and Sargent, 2018].
A continuous time version of the model by SeHyoun Ahn and Benjamin Moll can be found here.
66.2 The Economy
66.2.1 Households
Infinitely lived households / consumers face idiosyncratic income shocks.

A unit interval of ex-ante identical households face a common borrowing constraint.
The savings problem faced by a typical household is
∞
max 𝔼 ∑ 𝛽 𝑡 𝑢(𝑐𝑡 )
𝑡=0
subject to
𝑎𝑡+1 + 𝑐𝑡 ≤ 𝑤𝑧𝑡 + (1 + 𝑟)𝑎𝑡 𝑐𝑡 ≥ 0, and 𝑎𝑡 ≥ −𝐵
where
• 𝑐𝑡 is current consumption
• 𝑎𝑡 is assets
• 𝑧𝑡 is an exogenous component of labor income capturing stochastic unemployment risk, etc.
• 𝑤 is a wage rate
• 𝑟 is a net interest rate
• 𝐵 is the maximum amount that the agent is allowed to borrow
The exogenous process {𝑧𝑡 } follows a finite state Markov chain with given stochastic matrix 𝑃 .
The wage and interest rate are fixed over time.
In this simple version of the model, households supply labor inelastically because they do not value leisure.
1168 Chapter 66. The Aiyagari Model

66.3 Firms
Firms produce output by hiring capital and labor.

Firms act competitively and face constant returns to scale.
Since returns to scale are constant the number of firms does not matter.
Hence we can consider a single (but nonetheless competitive) representative firm.
The firm’s output is
𝑌𝑡 = 𝐴𝐾𝑡𝛼 𝑁 1−𝛼
where
• 𝐴 and 𝛼 are parameters with 𝐴 > 0 and 𝛼 ∈ (0, 1)
• 𝐾𝑡 is aggregate capital
• 𝑁 is total labor supply (which is constant in this simple version of the model)
The firm’s problem is
𝑚𝑎𝑥𝐾,𝑁 {𝐴𝐾𝑡𝛼 𝑁 1−𝛼 − (𝑟 + 𝛿)𝐾 − 𝑤𝑁 }
The parameter 𝛿 is the depreciation rate.

From the first-order condition with respect to capital, the firm’s inverse demand for capital is
1−𝛼
𝑁
𝑟 = 𝐴𝛼 ( ) −𝛿 (66.1)
𝐾
Using this expression and the firm’s first-order condition for labor, we can pin down the equilibrium wage rate as a function
of 𝑟 as
𝑤(𝑟) = 𝐴(1 − 𝛼)(𝐴𝛼/(𝑟 + 𝛿))𝛼/(1−𝛼) (66.2)
66.3.1 Equilibrium
We construct a stationary rational expectations equilibrium (SREE).

In such an equilibrium
• prices induce behavior that generates aggregate quantities consistent with the prices
• aggregate quantities and prices are constant over time
In more detail, an SREE lists a set of prices, savings and production policies such that
• households want to choose the specified savings policies taking the prices as given
• firms maximize profits taking the same prices as given
• the resulting aggregate quantities are consistent with the prices; in particular, the demand for capital equals the
supply
• aggregate quantities (defined as cross-sectional averages) are constant
In practice, once parameter values are set, we can check for an SREE by the following steps
1. pick a proposed quantity 𝐾 for aggregate capital
66.3. Firms 1169

2. determine corresponding prices, with interest rate 𝑟 determined by (66.1) and a wage rate 𝑤(𝑟) as given in (66.2)
3. determine the common optimal savings policy of the households given these prices
4. compute aggregate capital as the mean of steady state capital given this savings policy
If this final quantity agrees with 𝐾 then we have a SREE.
66.4 Code
Let’s look at how we might compute such an equilibrium in practice.

To solve the household’s dynamic programming problem we’ll use the DiscreteDP class from QuantEcon.py.
Our first task is the least exciting one: write code that maps parameters for a household problem into the R and Q matrices
needed to generate an instance of DiscreteDP.
Below is a piece of boilerplate code that does just this.
In reading the code, the following information will be helpful
• R needs to be a matrix where R[s, a] is the reward at state s under action a.
• Q needs to be a three-dimensional array where Q[s, a, s'] is the probability of transitioning to state s' when
the current state is s and the current action is a.
(A more detailed discussion of DiscreteDP is available in the Discrete State Dynamic Programming lecture in the
Advanced Quantitative Economics with Python lecture series.)
Here we take the state to be 𝑠𝑡 ∶= (𝑎𝑡 , 𝑧𝑡 ), where 𝑎𝑡 is assets and 𝑧𝑡 is the shock.
The action is the choice of next period asset level 𝑎𝑡+1 .
We use Numba to speed up the loops so we can update the matrices efficiently when the parameters change.
The class also includes a default set of parameters that we’ll adopt unless otherwise specified.
class Household:
"""
This class takes the parameters that define a household asset accumulation
problem and computes the corresponding reward and transition matrices R
and Q required to generate an instance of DiscreteDP, and thereby solve
for the optimal policy.
Comments on indexing: We need to enumerate the state space S as a sequence

S = {0, ..., n}. To this end, (a_i, z_i) index pairs are mapped to s_i
indices according to the rule
s_i = a_i * z_size + z_i
To invert this map, use
a_i = s_i // z_size (integer division)

z_i = s_i % z_size
"""
def __init__(self,
r=0.01, # Interest rate


w=1.0, # Wages
a_min=1e-10,
Π=[[0.9, 0.1], [0.1, 0.9]], # Markov chain
z_vals=[0.1, 1.0], # Exogenous states
a_max=18,
a_size=200):
# Store values, set up grids over a and z

self.r, self.w, self.β = r, w, β
self.a_min, self.a_max, self.a_size = a_min, a_max, a_size
self.Π = np.asarray(Π)
self.z_vals = np.asarray(z_vals)
self.z_size = len(z_vals)
self.a_vals = np.linspace(a_min, a_max, a_size)

self.n = a_size * self.z_size
# Build the array Q

self.Q = np.zeros((self.n, a_size, self.n))
self.build_Q()
# Build the array R

self.R = np.empty((self.n, a_size))
self.build_R()
def set_prices(self, r, w):

"""
Use this method to reset prices. Calling the method will trigger a
re-build of R.
"""
self.r, self.w = r, w
self.build_R()
def build_Q(self):
populate_Q(self.Q, self.a_size, self.z_size, self.Π)
def build_R(self):
self.R.fill(-np.inf)
populate_R(self.R,
self.a_size,
self.z_size,
self.a_vals,
self.z_vals,
self.r,
self.w)
# Do the hard work using JIT-ed functions
@jit(nopython=True)
def populate_R(R, a_size, z_size, a_vals, z_vals, r, w):
n = a_size * z_size
for s_i in range(n):
a_i = s_i // z_size
66.4. Code 1171


z_i = s_i % z_size
a = a_vals[a_i]
z = z_vals[z_i]
for new_a_i in range(a_size):
a_new = a_vals[new_a_i]
c = w * z + (1 + r) * a - a_new
if c > 0:
R[s_i, new_a_i] = np.log(c) # Utility
@jit(nopython=True)
def populate_Q(Q, a_size, z_size, Π):
n = a_size * z_size
z_i = s_i % z_size
for a_i in range(a_size):
for next_z_i in range(z_size):
Q[s_i, a_i, a_i*z_size + next_z_i] = Π[z_i, next_z_i]
@jit(nopython=True)
def asset_marginal(s_probs, a_size, z_size):
a_probs = np.zeros(a_size)
for a_i in range(a_size):
for z_i in range(z_size):
a_probs[a_i] += s_probs[a_i*z_size + z_i]
return a_probs
As a first example of what we can do, let’s compute and plot an optimal accumulation policy at fixed prices.
# Example prices
r = 0.03
w = 0.956
# Create an instance of Household

am = Household(a_max=20, r=r, w=w)
# Use the instance to build a discrete dynamic program

am_ddp = DiscreteDP(am.R, am.Q, am.β)
# Solve using policy function iteration

results = am_ddp.solve(method='policy_iteration')
# Simplify names
z_size, a_size = am.z_size, am.a_size
z_vals, a_vals = am.z_vals, am.a_vals
n = a_size * z_size
# Get all optimal actions across the set of a indices with z fixed in each row
a_star = np.empty((z_size, a_size))
a_i = s_i // z_size
z_i = s_i % z_size
a_star[z_i, a_i] = a_vals[results.sigma[s_i]]

ax.plot(a_vals, a_vals, 'k--') # 45 degrees


for i in range(z_size):
lb = f'$z = {z_vals[i]:.2}$'
ax.plot(a_vals, a_star[i, :], lw=2, alpha=0.6, label=lb)
ax.set_xlabel('current assets')
ax.set_ylabel('next period assets')
plt.show()
The plot shows asset accumulation policies at different values of the exogenous state.
Now we want to calculate the equilibrium.
Let’s do this visually as a first pass.
The following code draws aggregate supply and demand curves.
The intersection gives equilibrium interest rates and capital.
66.4. Code 1173

A = 1.0
N = 1.0
α = 0.33
β = 0.96
δ = 0.05
def r_to_w(r):
"""
Equilibrium wages associated with a given interest rate r.
"""
return A * (1 - α) * (A * α / (r + δ))**(α / (1 - α))
def rd(K):
"""
Inverse demand curve for capital. The interest rate associated with a
given demand for capital K.
"""
return A * α * (N / K)**(1 - α) - δ
def prices_to_capital_stock(am, r):

"""
Map prices to the induced level of capital stock.
Parameters:
----------
am : Household
An instance of an aiyagari_household.Household
r : float
The interest rate
"""
w = r_to_w(r)
am.set_prices(r, w)
aiyagari_ddp = DiscreteDP(am.R, am.Q, β)
# Compute the optimal policy
results = aiyagari_ddp.solve(method='policy_iteration')
# Compute the stationary distribution
stationary_probs = results.mc.stationary_distributions[0]
# Extract the marginal distribution for assets
asset_probs = asset_marginal(stationary_probs, am.a_size, am.z_size)
# Return K
return np.sum(asset_probs * am.a_vals)
# Create an instance of Household

am = Household(a_max=20)
# Use the instance to build a discrete dynamic program

am_ddp = DiscreteDP(am.R, am.Q, am.β)
# Create a grid of r values at which to compute demand and supply of capital

num_points = 20
r_vals = np.linspace(0.005, 0.04, num_points)
# Compute supply of capital



k_vals = np.empty(num_points)
for i, r in enumerate(r_vals):
k_vals[i] = prices_to_capital_stock(am, r)
# Plot against demand for capital by firms

ax.plot(k_vals, r_vals, lw=2, alpha=0.6, label='supply of capital')
ax.plot(k_vals, rd(k_vals), lw=2, alpha=0.6, label='demand for capital')
ax.grid()
ax.set_xlabel('capital')
ax.set_ylabel('interest rate')
plt.show()
66.4. Code 1175


Part XI
Asset Pricing and Finance
1177
CHAPTER
SIXTYSEVEN
ASSET PRICING: FINITE STATE MODELS
Contents
• Asset Pricing: Finite State Models

– Overview
– Pricing Models
– Prices in the Risk-Neutral Case
– Risk Aversion and Asset Prices
– Exercises
“A little knowledge of geometric series goes a long way” – Robert E. Lucas, Jr.
“Asset pricing is all about covariances” – Lars Peter Hansen
67.1 Overview
An asset is a claim on one or more future payoffs.

The spot price of an asset depends primarily on
• the anticipated income stream
• attitudes about risk
• rates of time preference
In this lecture, we consider some standard pricing models and dividend stream specifications.
We study how prices and dividend-price ratios respond in these different scenarios.
We also look at creating and pricing derivative assets that repackage income streams.
Key tools for the lecture are
• Markov processses
• formulas for predicting future values of functions of a Markov state
1179
• a formula for predicting the discounted sum of future values of a Markov state

import numpy as np
from numpy.linalg import eigvals, solve
67.2 Pricing Models
Let {𝑑𝑡 }𝑡≥0 be a stream of dividends

• A time-𝑡 cum-dividend asset is a claim to the stream 𝑑𝑡 , 𝑑𝑡+1 , ….
• A time-𝑡 ex-dividend asset is a claim to the stream 𝑑𝑡+1 , 𝑑𝑡+2 , ….
Let’s look at some equations that we expect to hold for prices of assets under ex-dividend contracts (we will consider
cum-dividend pricing in the exercises).
67.2.1 Risk-Neutral Pricing
Our first scenario is risk-neutral pricing.

Let 𝛽 = 1/(1 + 𝜌) be an intertemporal discount factor, where 𝜌 is the rate at which agents discount the future.
The basic risk-neutral asset pricing equation for pricing one unit of an ex-dividend asset is
𝑝𝑡 = 𝛽𝔼𝑡 [𝑑𝑡+1 + 𝑝𝑡+1 ] (67.1)

This is a simple “cost equals expected benefit” relationship.
Here 𝔼𝑡 [𝑦] denotes the best forecast of 𝑦, conditioned on information available at time 𝑡.
More precisely, 𝔼𝑡 [𝑦] is the mathematical expectation of 𝑦 conditional on information available at time 𝑡.
67.2.2 Pricing with Random Discount Factor
What happens if for some reason traders discount payouts differently depending on the state of the world?
Michael Harrison and David Kreps [Harrison and Kreps, 1979] and Lars Peter Hansen and Scott Richard [Hansen and
Richard, 1987] showed that in quite general settings the price of an ex-dividend asset obeys
𝑝𝑡 = 𝔼𝑡 [𝑚𝑡+1 (𝑑𝑡+1 + 𝑝𝑡+1 )] (67.2)
for some stochastic discount factor 𝑚𝑡+1 .
Here the fixed discount factor 𝛽 in (67.1) has been replaced by the random variable 𝑚𝑡+1 .
How anticipated future payoffs are evaluated now depends on statistical properties of 𝑚𝑡+1 .
The stochastic discount factor can be specified to capture the idea that assets that tend to have good payoffs in bad states
of the world are valued more highly than other assets whose payoffs don’t behave that way.
This is because such assets pay well when funds are more urgently wanted.
We give examples of how the stochastic discount factor has been modeled below.
1180 Chapter 67. Asset Pricing: Finite State Models

67.2.3 Asset Pricing and Covariances
Recall that, from the definition of a conditional covariance cov𝑡 (𝑥𝑡+1 , 𝑦𝑡+1 ), we have
𝔼𝑡 (𝑥𝑡+1 𝑦𝑡+1 ) = cov𝑡 (𝑥𝑡+1 , 𝑦𝑡+1 ) + 𝔼𝑡 𝑥𝑡+1 𝔼𝑡 𝑦𝑡+1 (67.3)
If we apply this definition to the asset pricing equation (67.2) we obtain
𝑝𝑡 = 𝔼𝑡 𝑚𝑡+1 𝔼𝑡 (𝑑𝑡+1 + 𝑝𝑡+1 ) + cov𝑡 (𝑚𝑡+1 , 𝑑𝑡+1 + 𝑝𝑡+1 ) (67.4)
It is useful to regard equation (67.4) as a generalization of equation (67.1)

• In equation (67.1), the stochastic discount factor 𝑚𝑡+1 = 𝛽, a constant.
• In equation (67.1), the covariance term cov𝑡 (𝑚𝑡+1 , 𝑑𝑡+1 + 𝑝𝑡+1 ) is zero because 𝑚𝑡+1 = 𝛽.
• In equation (67.1), 𝔼𝑡 𝑚𝑡+1 can be interpreted as the reciprocal of the one-period risk-free gross interest rate.
• When 𝑚𝑡+1 covaries more negatively with the payout 𝑝𝑡+1 + 𝑑𝑡+1 , the price of the asset is lower.
Equation (67.4) asserts that the covariance of the stochastic discount factor with the one period payout 𝑑𝑡+1 + 𝑝𝑡+1 is an
important determinant of the price 𝑝𝑡 .
We give examples of some models of stochastic discount factors that have been proposed later in this lecture and also in
a later lecture.
67.2.4 The Price-Dividend Ratio
Aside from prices, another quantity of interest is the price-dividend ratio 𝑣𝑡 ∶= 𝑝𝑡 /𝑑𝑡 .
Let’s write down an expression that this ratio should satisfy.
We can divide both sides of (67.2) by 𝑑𝑡 to get
𝑑𝑡+1
𝑣𝑡 = 𝔼𝑡 [𝑚𝑡+1 (1 + 𝑣𝑡+1 )] (67.5)
𝑑𝑡
Below we’ll discuss the implication of this equation.
67.3 Prices in the Risk-Neutral Case
What can we say about price dynamics on the basis of the models described above?
The answer to this question depends on
1. the process we specify for dividends
2. the stochastic discount factor and how it correlates with dividends
For now we’ll study the risk-neutral case in which the stochastic discount factor is constant.
We’ll focus on how an asset price depends on a dividend process.
67.3. Prices in the Risk-Neutral Case 1181

67.3.1 Example 1: Constant Dividends
The simplest case is risk-neutral price of a constant, non-random dividend stream 𝑑𝑡 = 𝑑 > 0.
Removing the expectation from (67.1) and iterating forward gives
𝑝𝑡 = 𝛽(𝑑 + 𝑝𝑡+1 )
= 𝛽(𝑑 + 𝛽(𝑑 + 𝑝𝑡+2 ))
⋮
= 𝛽(𝑑 + 𝛽𝑑 + 𝛽 2 𝑑 + ⋯ + 𝛽 𝑘−2 𝑑 + 𝛽 𝑘−1 𝑝𝑡+𝑘 )
If lim𝑘→+∞ 𝛽 𝑘−1 𝑝𝑡+𝑘 = 0, this sequence converges to
𝛽𝑑
𝑝̄ ∶= (67.6)
1−𝛽
This is the equilibrium price in the constant dividend case.
Indeed, simple algebra shows that setting 𝑝𝑡 = 𝑝̄ for all 𝑡 satisfies the difference equation 𝑝𝑡 = 𝛽(𝑑 + 𝑝𝑡+1 ).
67.3.2 Example 2: Dividends with Deterministic Growth Paths
Consider a growing, non-random dividend process 𝑑𝑡+1 = 𝑔𝑑𝑡 where 0 < 𝑔𝛽 < 1.
While prices are not usually constant when dividends grow over time, a price dividend-ratio can be.
If we guess this, substituting 𝑣𝑡 = 𝑣 into (67.5) as well as our other assumptions, we get 𝑣 = 𝛽𝑔(1 + 𝑣).
Since 𝛽𝑔 < 1, we have a unique positive solution:
𝛽𝑔
𝑣=
1 − 𝛽𝑔
The price is then
𝛽𝑔
𝑝𝑡 = 𝑑
1 − 𝛽𝑔 𝑡
If, in this example, we take 𝑔 = 1 + 𝜅 and let 𝜌 ∶= 1/𝛽 − 1, then the price becomes
1+𝜅
𝑝𝑡 = 𝑑
𝜌−𝜅 𝑡
This is called the Gordon formula.
67.3.3 Example 3: Markov Growth, Risk-Neutral Pricing
Next, we consider a dividend process
𝑑𝑡+1 = 𝑔𝑡+1 𝑑𝑡 (67.7)
The stochastic growth factor {𝑔𝑡 } is given by
𝑔𝑡 = 𝑔(𝑋𝑡 ), 𝑡 = 1, 2, …
where

1. {𝑋𝑡 } is a finite Markov chain with state space 𝑆 and transition probabilities
𝑃 (𝑥, 𝑦) ∶= ℙ{𝑋𝑡+1 = 𝑦 | 𝑋𝑡 = 𝑥} (𝑥, 𝑦 ∈ 𝑆)
2. 𝑔 is a given function on 𝑆 taking nonnegative values

You can think of
• 𝑆 as 𝑛 possible “states of the world” and 𝑋𝑡 as the current state.
• 𝑔 as a function that maps a given state 𝑋𝑡 into a growth of dividends factor 𝑔𝑡 = 𝑔(𝑋𝑡 ).
• ln 𝑔𝑡 = ln(𝑑𝑡+1 /𝑑𝑡 ) is the growth rate of dividends.
(For a refresher on notation and theory for finite Markov chains see this lecture)
The next figure shows a simulation, where
• {𝑋𝑡 } evolves as a discretized AR1 process produced using Tauchen’s method.
• 𝑔𝑡 = exp(𝑋𝑡 ), so that ln 𝑔𝑡 = 𝑋𝑡 is the growth rate.
n = 7
mc = qe.tauchen(n, 0.96, 0.25)
sim_length = 80
x_series = mc.simulate(sim_length, init=np.median(mc.state_values))

g_series = np.exp(x_series)
d_series = np.cumprod(g_series) # Assumes d_0 = 1
series = [x_series, g_series, d_series, np.log(d_series)]

labels = ['$X_t$', '$g_t$', '$d_t$', r'$\log \, d_t$']
fig, axes = plt.subplots(2, 2)

for ax, s, label in zip(axes.flatten(), series, labels):
ax.plot(s, 'b-', lw=2, label=label)
ax.legend(loc='upper left', frameon=False)
plt.tight_layout()
plt.show()
67.3. Prices in the Risk-Neutral Case 1183

Pricing Formula
To obtain asset prices in this setting, let’s adapt our analysis from the case of deterministic growth.
In that case, we found that 𝑣 is constant.
This encourages us to guess that, in the current case, 𝑣𝑡 is a fixed function of the state 𝑋𝑡 .
We seek a function 𝑣 such that the price-dividend ratio satisfies 𝑣𝑡 = 𝑣(𝑋𝑡 ).
We can substitute this guess into (67.5) to get
𝑣(𝑋𝑡 ) = 𝛽𝔼𝑡 [𝑔(𝑋𝑡+1 )(1 + 𝑣(𝑋𝑡+1 ))]
If we condition on 𝑋𝑡 = 𝑥, this becomes
𝑣(𝑥) = 𝛽 ∑ 𝑔(𝑦)(1 + 𝑣(𝑦))𝑃 (𝑥, 𝑦)

𝑦∈𝑆
or
𝑣(𝑥) = 𝛽 ∑ 𝐾(𝑥, 𝑦)(1 + 𝑣(𝑦)) where 𝐾(𝑥, 𝑦) ∶= 𝑔(𝑦)𝑃 (𝑥, 𝑦) (67.8)

𝑦∈𝑆
Suppose that there are 𝑛 possible states 𝑥1 , … , 𝑥𝑛 .

We can then think of (67.8) as 𝑛 stacked equations, one for each state, and write it in matrix form as
𝑣 = 𝛽𝐾(𝟙 + 𝑣) (67.9)
Here
• 𝑣 is understood to be the column vector (𝑣(𝑥1 ), … , 𝑣(𝑥𝑛 ))′ .
• 𝐾 is the matrix (𝐾(𝑥𝑖 , 𝑥𝑗 ))1≤𝑖,𝑗≤𝑛 .
• 𝟙 is a column vector of ones.
When does equation (67.9) have a unique solution?
From the Neumann series lemma and Gelfand’s formula, equation (67.9) has a unique solution when 𝛽𝐾 has spectral
radius strictly less than one.
Thus, we require that the eigenvalues of 𝐾 be strictly less than 𝛽 −1 in modulus.
The solution is then
𝑣 = (𝐼 − 𝛽𝐾)−1 𝛽𝐾𝟙 (67.10)
67.3.4 Code
Let’s calculate and plot the price-dividend ratio at some parameters.

As before, we’ll generate {𝑋𝑡 } as a discretized AR1 process and set 𝑔𝑡 = exp(𝑋𝑡 ).
Here’s the code, including a test of the spectral radius condition
n = 25 # Size of state space

β = 0.9
mc = qe.tauchen(n, 0.96, 0.02)


K = mc.P * np.exp(mc.state_values)
warning_message = "Spectral radius condition fails"

assert np.max(np.abs(eigvals(K))) < 1 / β, warning_message
I = np.identity(n)
v = solve(I - β * K, β * K @ np.ones(n))
ax.plot(mc.state_values, v, 'g-o', lw=2, alpha=0.7, label='$v$')
ax.set_ylabel("price-dividend ratio")
ax.set_xlabel("state")
plt.show()
Why does the price-dividend ratio increase with the state?

The reason is that this Markov process is positively correlated, so high current states suggest high future states.
Moreover, dividend growth is increasing in the state.
The anticipation of high future dividend growth leads to a high price-dividend ratio.
67.4 Risk Aversion and Asset Prices
Now let’s turn to the case where agents are risk averse.
We’ll price several distinct assets, including
• An endowment stream
• A consol (a type of bond issued by the UK government in the 19th century)
• Call options on a consol
67.4. Risk Aversion and Asset Prices 1185

67.4.1 Pricing a Lucas Tree
Let’s start with a version of the celebrated asset pricing model of Robert E. Lucas, Jr. [Lucas, 1978].
Lucas considered an abstract pure exchange economy with these features:
• a single non-storable consumption good
• a Markov process that governs the total amount of the consumption good available each period
• a single tree that each period yields fruit that equals the total amount of consumption available to the economy
• a competitive market in shares in the tree that entitles their owners to corresponding shares of the dividend stream,
i.e., the fruit stream, yielded by the tree
• a representative consumer who in a competitive equilibrium
– consumes the economy’s entire endowment each period
– owns 100 percent of the shares in the tree
As in [Lucas, 1978], we suppose that the stochastic discount factor takes the form
𝑢′ (𝑐𝑡+1 )
𝑚𝑡+1 = 𝛽 (67.11)
𝑢′ (𝑐𝑡 )
where 𝑢 is a concave utility function and 𝑐𝑡 is time 𝑡 consumption of a representative consumer.
(A derivation of this expression is given in a later lecture)
Assume the existence of an endowment that follows growth process (67.7).
The asset being priced is a claim on the endowment process, i.e., the Lucas tree described above.
Following [Lucas, 1978], we suppose that in equilibrium the representative consumer’s consumption equals the aggregate
endowment, so that 𝑑𝑡 = 𝑐𝑡 for all 𝑡.
For utility, we’ll assume the constant relative risk aversion (CRRA) specification
𝑐1−𝛾
𝑢(𝑐) = with 𝛾 > 0 (67.12)
1−𝛾
When 𝛾 = 1 we let 𝑢(𝑐) = ln 𝑐.
Inserting the CRRA specification into (67.11) and using 𝑐𝑡 = 𝑑𝑡 gives
−𝛾
𝑐𝑡+1 −𝛾
𝑚𝑡+1 = 𝛽 ( ) = 𝛽𝑔𝑡+1 (67.13)
𝑐𝑡
Substituting this into (67.5) gives the price-dividend ratio formula
𝑣(𝑋𝑡 ) = 𝛽𝔼𝑡 [𝑔(𝑋𝑡+1 )1−𝛾 (1 + 𝑣(𝑋𝑡+1 ))] (67.14)
Conditioning on 𝑋𝑡 = 𝑥, we can write this as
𝑣(𝑥) = 𝛽 ∑ 𝑔(𝑦)1−𝛾 (1 + 𝑣(𝑦))𝑃 (𝑥, 𝑦)

𝑦∈𝑆
If we let
𝐽 (𝑥, 𝑦) ∶= 𝑔(𝑦)1−𝛾 𝑃 (𝑥, 𝑦)
then we can rewrite equation (67.14) in vector form as
𝑣 = 𝛽𝐽 (𝟙 + 𝑣)

Assuming that the spectral radius of 𝐽 is strictly less than 𝛽 −1 , this equation has the unique solution
𝑣 = (𝐼 − 𝛽𝐽 )−1 𝛽𝐽 𝟙 (67.15)
We will define a function tree_price to compute 𝑣 given parameters stored in the class AssetPriceModel
class AssetPriceModel:
"""
A class that stores the primitives of the asset pricing model.
Parameters
----------
β : scalar, float
Discount factor
mc : MarkovChain
Contains the transition matrix and set of state values for the state
process
γ : scalar(float)
Coefficient of risk aversion
g : callable
The function mapping states to growth rates
"""
def __init__(self, β=0.96, mc=None, γ=2.0, g=np.exp):
self.g = g
# A default process for the Markov chain

if mc is None:
self.ρ = 0.9
self.σ = 0.02
self.mc = qe.tauchen(n, self.ρ, self.σ)
else:
self.mc = mc
self.n = self.mc.P.shape[0]
def test_stability(self, Q):

"""
Stability test for a given matrix Q.
"""
sr = np.max(np.abs(eigvals(Q)))
if not sr < 1 / self.β:
msg = f"Spectral radius condition failed with radius = {sr}"
raise ValueError(msg)
def tree_price(ap):
"""
Computes the price-dividend ratio of the Lucas tree.
Parameters
----------
ap: AssetPriceModel
An instance of AssetPriceModel containing primitives
Returns
-------


v : array_like(float)
Lucas tree price-dividend ratio
"""
# Simplify names, set up matrices
β, γ, P, y = ap.β, ap.γ, ap.mc.P, ap.mc.state_values
J = P * ap.g(y)**(1 - γ)
# Make sure that a unique solution exists

ap.test_stability(J)
# Compute v
I = np.identity(ap.n)
Ones = np.ones(ap.n)
v = solve(I - β * J, β * J @ Ones)
return v
Here’s a plot of 𝑣 as a function of the state for several values of 𝛾, with a positively correlated Markov process and
𝑔(𝑥) = exp(𝑥)
γs = [1.2, 1.4, 1.6, 1.8, 2.0]

ap = AssetPriceModel()
states = ap.mc.state_values
for γ in γs:
ap.γ = γ
v = tree_price(ap)
ax.plot(states, v, lw=2, alpha=0.6, label=rf"$\gamma = {γ}$")
ax.set_title('Price-dividend ratio as a function of the state')

ax.set_ylabel("price-dividend ratio")
plt.show()

Notice that 𝑣 is decreasing in each case.

This is because, with a positively correlated state process, higher states indicate higher future consumption growth.
With the stochastic discount factor (67.13), higher growth decreases the discount factor, lowering the weight placed on
future dividends.
Special Cases
In the special case 𝛾 = 1, we have 𝐽 = 𝑃 .

Recalling that 𝑃 𝑖 𝟙 = 𝟙 for all 𝑖 and applying Neumann’s geometric series lemma, we are led to
∞
1
𝑣 = 𝛽(𝐼 − 𝛽𝑃 )−1 𝟙 = 𝛽 ∑ 𝛽 𝑖 𝑃 𝑖 𝟙 = 𝛽 𝟙
𝑖=0
1−𝛽
Thus, with log preferences, the price-dividend ratio for a Lucas tree is constant.
Alternatively, if 𝛾 = 0, then 𝐽 = 𝐾 and we recover the risk-neutral solution (67.10).
This is as expected, since 𝛾 = 0 implies 𝑢(𝑐) = 𝑐 (and hence agents are risk-neutral).
67.4.2 A Risk-Free Consol
Consider the same pure exchange representative agent economy.

A risk-free consol promises to pay a constant amount 𝜁 > 0 each period.
Recycling notation, let 𝑝𝑡 now be the price of an ex-coupon claim to the consol.
An ex-coupon claim to the consol entitles an owner at the end of period 𝑡 to
• 𝜁 in period 𝑡 + 1, plus
• the right to sell the claim for 𝑝𝑡+1 next period
The price satisfies (67.2) with 𝑑𝑡 = 𝜁, or
𝑝𝑡 = 𝔼𝑡 [𝑚𝑡+1 (𝜁 + 𝑝𝑡+1 )]

With the stochastic discount factor (67.13), this becomes

−𝛾
𝑝𝑡 = 𝔼𝑡 [𝛽𝑔𝑡+1 (𝜁 + 𝑝𝑡+1 )] (67.16)
Guessing a solution of the form 𝑝𝑡 = 𝑝(𝑋𝑡 ) and conditioning on 𝑋𝑡 = 𝑥, we get
𝑝(𝑥) = 𝛽 ∑ 𝑔(𝑦)−𝛾 (𝜁 + 𝑝(𝑦))𝑃 (𝑥, 𝑦)

𝑦∈𝑆
Letting 𝑀 (𝑥, 𝑦) = 𝑃 (𝑥, 𝑦)𝑔(𝑦)−𝛾 and rewriting in vector notation yields the solution
𝑝 = (𝐼 − 𝛽𝑀 )−1 𝛽𝑀 𝜁𝟙 (67.17)
The above is implemented in the function consol_price.
def consol_price(ap, ζ):

"""
Computes price of a consol bond with payoff ζ
Parameters
----------
ap: AssetPriceModel
ζ : scalar(float)
Coupon of the console
Returns
-------
p : array_like(float)
Console bond prices
"""
M = P * ap.g(y)**(- γ)

ap.test_stability(M)
# Compute price
I = np.identity(ap.n)
Ones = np.ones(ap.n)
p = solve(I - β * M, β * ζ * M @ Ones)
return p
67.4.3 Pricing an Option to Purchase the Consol
Let’s now price options of various maturities.

We’ll study an option that gives the owner the right to purchase a consol at a price 𝑝𝑆 .

An Infinite Horizon Call Option
We want to price an infinite horizon option to purchase a consol at a price 𝑝𝑆 .

The option entitles the owner at the beginning of a period either
1. to purchase the bond at price 𝑝𝑆 now, or
2. not to exercise the option to purchase the asset now but to retain the right to exercise it later
Thus, the owner either exercises the option now or chooses not to exercise and wait until next period.
This is termed an infinite-horizon call option with strike price 𝑝𝑆 .
The owner of the option is entitled to purchase the consol at price 𝑝𝑆 at the beginning of any period, after the coupon has
been paid to the previous owner of the bond.
The fundamentals of the economy are identical with the one above, including the stochastic discount factor and the process
for consumption.
Let 𝑤(𝑋𝑡 , 𝑝𝑆 ) be the value of the option when the time 𝑡 growth state is known to be 𝑋𝑡 but before the owner has decided
whether to exercise the option at time 𝑡 (i.e., today).
Recalling that 𝑝(𝑋𝑡 ) is the value of the consol when the initial growth state is 𝑋𝑡 , the value of the option satisfies
𝑢′ (𝑐𝑡+1 )
𝑤(𝑋𝑡 , 𝑝𝑆 ) = max {𝛽 𝔼𝑡 𝑤(𝑋𝑡+1 , 𝑝𝑆 ), 𝑝(𝑋𝑡 ) − 𝑝𝑆 }
𝑢′ (𝑐𝑡 )
The first term on the right is the value of waiting, while the second is the value of exercising now.
We can also write this as
𝑤(𝑥, 𝑝𝑆 ) = max {𝛽 ∑ 𝑃 (𝑥, 𝑦)𝑔(𝑦)−𝛾 𝑤(𝑦, 𝑝𝑆 ), 𝑝(𝑥) − 𝑝𝑆 } (67.18)

𝑦∈𝑆
With 𝑀 (𝑥, 𝑦) = 𝑃 (𝑥, 𝑦)𝑔(𝑦)−𝛾 and 𝑤 as the vector of values (𝑤(𝑥𝑖 ), 𝑝𝑆 )𝑛𝑖=1 , we can express (67.18) as the nonlinear
vector equation
𝑤 = max{𝛽𝑀 𝑤, 𝑝 − 𝑝𝑆 𝟙} (67.19)
To solve (67.19), form an operator 𝑇 that maps vector 𝑤 into vector 𝑇 𝑤 via
𝑇 𝑤 = max{𝛽𝑀 𝑤, 𝑝 − 𝑝𝑆 𝟙}
Start at some initial 𝑤 and iterate with 𝑇 to convergence .

We can find the solution with the following function call_option
def call_option(ap, ζ, p_s, ϵ=1e-7):

"""
Computes price of a call option on a consol bond.
Parameters
----------
ap: AssetPriceModel
ζ : scalar(float)
Coupon of the console


p_s : scalar(float)
Strike price
ϵ : scalar(float), optional(default=1e-8)
Tolerance for infinite horizon problem
Returns
-------
w : array_like(float)
Infinite horizon call option prices
"""
M = P * ap.g(y)**(- γ)
# Make sure that a unique consol price exists

# Compute option price

p = consol_price(ap, ζ)
w = np.zeros(ap.n)
error = ϵ + 1
while error > ϵ:
# Maximize across columns
w_new = np.maximum(β * M @ w, p - p_s)
# Find maximal difference of each component and update
error = np.amax(np.abs(w - w_new))
w = w_new
return w
Here’s a plot of 𝑤 compared to the consol price when 𝑃𝑆 = 40
ap = AssetPriceModel(β=0.9)
ζ = 1.0
strike_price = 40
x = ap.mc.state_values
w = call_option(ap, ζ, strike_price)
ax.plot(x, p, 'b-', lw=2, label='consol price')
ax.plot(x, w, 'g-', lw=2, label='value of call option')
plt.show()

In high values of the Markov growth state, the value of the option is close to zero.
This is despite the facts that the Markov chain is irreducible and that low states — where the consol prices are high —
will be visited recurrently.
The reason for low valuations in high Markov growth states is that 𝛽 = 0.9, so future payoffs are discounted substantially.
67.4.4 Risk-Free Rates
Let’s look at risk-free interest rates over different periods.
The One-period Risk-free Interest Rate

−𝛾
As before, the stochastic discount factor is 𝑚𝑡+1 = 𝛽𝑔𝑡+1 .
It follows that the reciprocal 𝑅𝑡−1 of the gross risk-free interest rate 𝑅𝑡 in state 𝑥 is
𝔼𝑡 𝑚𝑡+1 = 𝛽 ∑ 𝑃 (𝑥, 𝑦)𝑔(𝑦)−𝛾

𝑦∈𝑆
We can write this as
𝑚1 = 𝛽𝑀 𝟙
where the 𝑖-th element of 𝑚1 is the reciprocal of the one-period gross risk-free interest rate in state 𝑥𝑖 .
Other Terms
Let 𝑚𝑗 be an 𝑛 × 1 vector whose 𝑖 th component is the reciprocal of the 𝑗 -period gross risk-free interest rate in state 𝑥𝑖 .
Then 𝑚1 = 𝛽𝑀 , and 𝑚𝑗+1 = 𝑀 𝑚𝑗 for 𝑗 ≥ 1.

67.5 Exercises
Exercise 67.5.1
In the lecture, we considered ex-dividend assets.
A cum-dividend asset is a claim to the stream 𝑑𝑡 , 𝑑𝑡+1 , ….
Following (67.1), find the risk-neutral asset pricing equation for one unit of a cum-dividend asset.
With a constant, non-random dividend stream 𝑑𝑡 = 𝑑 > 0, what is the equilibrium price of a cum-dividend asset?
With a growing, non-random dividend process 𝑑𝑡 = 𝑔𝑑𝑡 where 0 < 𝑔𝛽 < 1, what is the equilibrium price of a cum-
dividend asset?

For a cum-dividend asset, the basic risk-neutral asset pricing equation is
𝑝𝑡 = 𝑑𝑡 + 𝛽𝔼𝑡 [𝑝𝑡+1 ]
With constant dividends, the equilibrium price is

1
𝑝𝑡 = 𝑑
1−𝛽 𝑡
With a growing, non-random dividend process, the equilibrium price is
1
𝑝𝑡 = 𝑑
1 − 𝛽𝑔 𝑡
Exercise 67.5.2
Consider the following primitives
n = 5 # Size of State Space

P = np.full((n, n), 0.0125)
P[range(n), range(n)] += 1 - P.sum(1)
# State values of the Markov chain
s = np.array([0.95, 0.975, 1.0, 1.025, 1.05])
γ = 2.0
β = 0.94
Let 𝑔 be defined by 𝑔(𝑥) = 𝑥 (that is, 𝑔 is the identity map).

Compute the price of the Lucas tree.
Do the same for
• the price of the risk-free consol when 𝜁 = 1
• the call option on the consol when 𝜁 = 1 and 𝑝𝑆 = 150.0

First, let’s enter the parameters:

n = 5
P = np.full((n, n), 0.0125)
P[range(n), range(n)] += 1 - P.sum(1)
s = np.array([0.95, 0.975, 1.0, 1.025, 1.05]) # State values
mc = qe.MarkovChain(P, state_values=s)
γ = 2.0
β = 0.94
ζ = 1.0
p_s = 150.0
Next, we’ll create an instance of AssetPriceModel to feed into the functions
apm = AssetPriceModel(β=β, mc=mc, γ=γ, g=lambda x: x)
Now we just need to call the relevant functions on the data:
tree_price(apm)
array([29.47401578, 21.93570661, 17.57142236, 14.72515002, 12.72221763])
consol_price(apm, ζ)
array([753.87100476, 242.55144082, 148.67554548, 109.25108965,

87.56860139])
call_option(apm, ζ, p_s)
array([603.87100476, 176.8393343 , 108.67734499, 80.05179254,

64.30843748])
Let’s show the last two functions as a plot
ax.plot(s, consol_price(apm, ζ), label='consol')
ax.plot(s, call_option(apm, ζ, p_s), label='call option')
ax.legend()
plt.show()

Exercise 67.5.3
Let’s consider finite horizon call options, which are more common than infinite horizon ones.
Finite horizon options obey functional equations closely related to (67.18).
A 𝑘 period option expires after 𝑘 periods.
If we view today as date zero, a 𝑘 period option gives the owner the right to exercise the option to purchase the risk-free
consol at the strike price 𝑝𝑆 at dates 0, 1, … , 𝑘 − 1.
The option expires at time 𝑘.
Thus, for 𝑘 = 1, 2, …, let 𝑤(𝑥, 𝑘) be the value of a 𝑘-period option.
It obeys
𝑤(𝑥, 𝑘) = max {𝛽 ∑ 𝑃 (𝑥, 𝑦)𝑔(𝑦)−𝛾 𝑤(𝑦, 𝑘 − 1), 𝑝(𝑥) − 𝑝𝑆 }

𝑦∈𝑆
where 𝑤(𝑥, 0) = 0 for all 𝑥.

We can express this as a sequence of nonlinear vector equations
𝑤𝑘 = max{𝛽𝑀 𝑤𝑘−1 , 𝑝 − 𝑝𝑆 𝟙} 𝑘 = 1, 2, … with 𝑤0 = 0
Write a function that computes 𝑤𝑘 for any given 𝑘.

Compute the value of the option with k = 5 and k = 25 using parameter values as in Exercise 67.5.1.
Is one higher than the other? Can you give intuition?

Here’s a suitable function:

def finite_horizon_call_option(ap, ζ, p_s, k):

"""
Computes k period option value.
"""
M = P * ap.g(y)**(- γ)

# Compute option price

w = np.zeros(ap.n)
for i in range(k):
# Maximize across columns
w = np.maximum(β * M @ w, p - p_s)
return w
Now let’s compute the option values at k=5 and k=25
for k in [5, 25]:
w = finite_horizon_call_option(apm, ζ, p_s, k)
ax.plot(s, w, label=rf'$k = {k}$')
ax.legend()
plt.show()
Not surprisingly, options with larger 𝑘 are worth more.

This is because an owner has a longer horizon over which the option can be exercised.


CHAPTER
SIXTYEIGHT
COMPETITIVE EQUILIBRIA WITH ARROW SECURITIES
68.1 Introduction
This lecture presents Python code for experimenting with competitive equilibria of an infinite-horizon pure exchange
economy with
• Heterogeneous agents
• Endowments of a single consumption that are person-specific functions of a common Markov state
• Complete markets in one-period Arrow state-contingent securities
• Discounted expected utility preferences of a kind often used in macroeconomics and finance
• Common expected utility preferences across agents
• Common beliefs across agents
• A constant relative risk aversion (CRRA) one-period utility function that implies the existence of a representative
consumer whose consumption process can be plugged into a formula for the pricing kernel for one-step Arrow
securities and thereby determine equilbrium prices before determining an equilibrium distribution of wealth
Diverse endowments across agents provide motivations for individuals to want to reallocate consumption goods across
time and Markov states
We impose restrictions that allow us to Bellmanize competitive equilibrium prices and quantities
We use Bellman equations to describe
• asset prices
• continuation wealth levels for each person
• state-by-state natural debt limits for each person
In the course of presenting the model we shall describe these important ideas
• a resolvent operator widely used in this class of models
• absence of borrowing limits in finite horizon economies
• state-by-state borrowing limits required in infinite horizon economies
• a counterpart of the law of iterated expectations known as a law of iterated values
• a state-variable degeneracy that prevails within a competitive equilibrium and that opens the way to various
appearances of resolvent operators
1199
68.2 The setting
In effect, this lecture implements a Python version of the model presented in section 9.3.3 of Ljungqvist and Sargent
[Ljungqvist and Sargent, 2018].
68.2.1 Preferences and endowments
In each period 𝑡 ≥ 0, a stochastic event 𝑠𝑡 ∈ S is realized.

Let the history of events up until time 𝑡 be denoted 𝑠𝑡 = [𝑠0 , 𝑠1 , … , 𝑠𝑡−1 , 𝑠𝑡 ].
(Sometimes we inadvertently reverse the recording order and denote a history as 𝑠𝑡 = [𝑠𝑡 , 𝑠𝑡−1 , … , 𝑠1 , 𝑠0 ].)
The unconditional probability of observing a particular sequence of events 𝑠𝑡 is given by a probability measure 𝜋𝑡 (𝑠𝑡 ).
For 𝑡 > 𝜏 , we write the probability of observing 𝑠𝑡 conditional on the realization of 𝑠𝜏 as 𝜋𝑡 (𝑠𝑡 |𝑠𝜏 ).
We assume that trading occurs after observing 𝑠0 , which we capture by setting 𝜋0 (𝑠0 ) = 1 for the initially given value of
𝑠0 .
In this lecture we shall follow much macroeconomics and econometrics and assume that 𝜋𝑡 (𝑠𝑡 ) is induced by a Markov
process.
There are 𝐾 consumers named 𝑘 = 1, … , 𝐾.
Consumer 𝑘 owns a stochastic endowment of one good 𝑦𝑡𝑘 (𝑠𝑡 ) that depends on the history 𝑠𝑡 .
The history 𝑠𝑡 is publicly observable.
Consumer 𝑖 purchases a history-dependent consumption plan 𝑐𝑘 = {𝑐𝑡𝑘 (𝑠𝑡 )}∞
𝑡=0
Consumer 𝑖 orders consumption plans by

∞
𝑈𝑘 (𝑐𝑘 ) = ∑ ∑ 𝛽 𝑡 𝑢𝑘 [𝑐𝑡𝑘 (𝑠𝑡 )]𝜋𝑡 (𝑠𝑡 ),
𝑡=0 𝑠𝑡
where 0 < 𝛽 < 1.

∞
The right side is equal to 𝐸0 ∑𝑡=0 𝛽 𝑡 𝑢𝑘 (𝑐𝑡𝑘 ), where 𝐸0 is the mathematical expectation operator, conditioned on 𝑠0 .
Here 𝑢𝑘 (𝑐) is an increasing, twice continuously differentiable, strictly concave function of consumption 𝑐 ≥ 0 of one
good.
The utility function pf person 𝑘 satisfies the Inada condition
lim 𝑢′𝑘 (𝑐) = +∞.

𝑐↓0
This condition implies that each agent chooses strictly positive consumption for every date-history pair (𝑡, 𝑠𝑡 ).
Those interior solutions enable us to confine our analysis to Euler equations that hold with equality and also guarantee
that natural debt limits don’t bind in economies like ours with sequential trading of Arrow securities.
We adopt the assumption, routinely employed in much of macroeconomics, that consumers share probabilities 𝜋𝑡 (𝑠𝑡 ) for
all 𝑡 and 𝑠𝑡 .
A feasible allocation satisfies
∑ 𝑐𝑡𝑘 (𝑠𝑡 ) ≤ ∑ 𝑦𝑡𝑘 (𝑠𝑡 )
𝑖 𝑖
for all 𝑡 and for all 𝑠𝑡 .
1200 Chapter 68. Competitive Equilibria with Arrow Securities

68.3 Recursive Formulation
Following descriptions in section 9.3.3 of Ljungqvist and Sargent [Ljungqvist and Sargent, 2018] chapter 9, we set up a
competitive equilibrium of a pure exchange economy with complete markets in one-period Arrow securities.
When endowments 𝑦𝑘 (𝑠) are all functions of a common Markov state 𝑠, the pricing kernel takes the form 𝑄(𝑠′ |𝑠), where
𝑄(𝑠′ |𝑠) is the price of one unit of consumption in state 𝑠′ at date 𝑡 + 1 when the Markov state at date 𝑡 is 𝑠.
These enable us to provide a recursive formulation of a consumer’s optimization problem.
Consumer 𝑖’s state at time 𝑡 is its financial wealth 𝑎𝑘𝑡 and Markov state 𝑠𝑡 .
Let 𝑣𝑘 (𝑎, 𝑠) be the optimal value of consumer 𝑖’s problem starting from state (𝑎, 𝑠).
• 𝑣𝑘 (𝑎, 𝑠) is the maximum expected discounted utility that consumer 𝑖 with current financial wealth 𝑎 can attain in
Markov state 𝑠.
The optimal value function satisfies the Bellman equation
𝑣𝑘 (𝑎, 𝑠) = max′ {𝑢𝑘 (𝑐) + 𝛽 ∑ 𝑣𝑘 [𝑎(𝑠

̂ ′ ), 𝑠′ ]𝜋(𝑠′ |𝑠)}
𝑐,𝑎(𝑠
̂ )
𝑠′
where maximization is subject to the budget constraint

̂ ′ )𝑄(𝑠′ |𝑠) ≤ 𝑦𝑘 (𝑠) + 𝑎
𝑐 + ∑ 𝑎(𝑠
𝑠′
and also the constraints

𝑐 ≥ 0,
̂ ) ≤ 𝐴𝑘̄ (𝑠′ ),
−𝑎(𝑠 ′
∀𝑠′ ∈ S
with the second constraint evidently being a set of state-by-state debt limits.
Note that the value function and decision rule that solve the Bellman equation implicitly depend on the pricing kernel
𝑄(⋅|⋅) because it appears in the agent’s budget constraint.
Use the first-order conditions for the problem on the right of the Bellman equation and a Benveniste-Scheinkman formula
and rearrange to get
𝛽𝑢′𝑘 (𝑐𝑡+1
𝑘
)𝜋(𝑠𝑡+1 |𝑠𝑡 )
𝑄(𝑠𝑡+1 |𝑠𝑡 ) = ,
𝑢′𝑘 (𝑐𝑡𝑘 )
where it is understood that 𝑐𝑡𝑘 = 𝑐𝑘 (𝑠𝑡 ) and 𝑐𝑡+1
𝑘
= 𝑐𝑘 (𝑠𝑡+1 ).
A recursive competitive equilibrium is an initial distribution of wealth 𝑎0⃗ , a set of borrowing limits {𝐴𝑘̄ (𝑠)}𝐾
𝑘=1 , a
pricing kernel 𝑄(𝑠′ |𝑠), sets of value functions {𝑣𝑘 (𝑎, 𝑠)}𝐾 𝑘 𝑘 𝐾
𝑖=1 , and decision rules {𝑐 (𝑠), 𝑎 (𝑠)}𝑖=1 such that
• The state-by-state borrowing constraints satisfy the recursion

𝐴𝑘̄ (𝑠) = 𝑦𝑘 (𝑠) + ∑ 𝑄(𝑠′ |𝑠)𝐴𝑘̄ (𝑠′ )
𝑠′
• For all 𝑖, given 𝑎𝑘0 , ̄

𝑘
𝐴 (𝑠), and the pricing kernel, the value functions and decision rules solve the consumers’
problems;
• For all realizations of {𝑠𝑡 }∞ 𝑘 𝑘
̂ (𝑠′ )}𝑠′ }𝑖 }𝑡 satisfy ∑𝑖 𝑐𝑡𝑘 =
𝑡=0 , the consumption and asset portfolios {{𝑐𝑡 , {𝑎𝑡+1
𝑘 𝑘 ′ ′
∑𝑖 𝑦 (𝑠𝑡 ) and ∑𝑖 𝑎𝑡+1̂ (𝑠 ) = 0 for all 𝑡 and 𝑠 .
𝐾
• The initial financial wealth vector 𝑎0⃗ satisfies ∑𝑖=1 𝑎𝑘0 = 0.
The third condition asserts that there are zero net aggregate claims in all Markov states.
The fourth condition asserts that the economy is closed and starts from a situation in which there are zero net aggregate
claims.
68.3. Recursive Formulation 1201

68.4 State Variable Degeneracy
Please see Ljungqvist and Sargent [Ljungqvist and Sargent, 2018] for a description of timing protocol for trades consistent
with an Arrow-Debreu vision in which
• at time 0 there are complete markets in a complete menu of history 𝑠𝑡 -contingent claims on consumption at all
dates that all trades occur at time zero
• all trades occur once and for all at time 0
If an allocation and pricing kernel 𝑄 in a recursive competitive equilibrium are to be consistent with the equilibrium
allocation and price system that prevail in a corresponding complete markets economy with such history-contingent com-
modities and all trades occurring at time 0, we must impose that 𝑎𝑘0 = 0 for 𝑘 = 1, … , 𝐾.
That is what assures that at time 0 the present value of each agent’s consumption equals the present value of his endowment
stream, the single budget constraint in arrangement with all trades occurring at time 0.
Starting the system with 𝑎𝑘0 = 0 for all 𝑖 has a striking implication that we can call state variable degeneracy.
Here is what we mean by state variable degeneracy:
Although two state variables 𝑎, 𝑠 appear in the value function 𝑣𝑘 (𝑎, 𝑠), within a recursive competitive equilibrium starting
from 𝑎𝑘0 = 0 ∀𝑖 at initial Markov state 𝑠0 , two outcomes prevail:
• 𝑎𝑘0 = 0 for all 𝑖 whenever the Markov state 𝑠𝑡 returns to 𝑠0 .
• Financial wealth 𝑎 is an exact function of the Markov state 𝑠.
The first finding asserts that each household recurrently visits the zero financial wealth state with which it began life.
The second finding asserts that within a competitive equilibrium the exogenous Markov state is all we require to track an
individual.
Financial wealth turns out to be redundant because it is an exact function of the Markov state for each individual.
This outcome depends critically on there being complete markets in Arrow securities.
For example, it does not prevail in the incomplete markets setting of this lecture The Aiyagari Model
68.5 Markov Asset Prices
Let’s start with a brief summary of formulas for computing asset prices in a Markov setting.
The setup assumes the following infrastructure
• Markov states: 𝑠 ∈ 𝑆 = [𝑠1̄ , … , 𝑠𝑛̄ ] governed by an 𝑛-state Markov chain with transition probability
𝑃𝑖𝑗 = Pr {𝑠𝑡+1 = 𝑠𝑗̄ ∣ 𝑠𝑡 = 𝑠𝑘̄ }
• A collection ℎ = 1, … , 𝐻 of 𝑛 × 1 vectors of 𝐻 assets that pay off 𝑑ℎ (𝑠) in state 𝑠
• An 𝑛 × 𝑛 matrix pricing kernel 𝑄 for one-period Arrow securities, where 𝑄𝑖𝑗 = price at time 𝑡 in state 𝑠𝑡 = 𝑠𝑖̄ of
one unit of consumption when 𝑠𝑡+1 = 𝑠𝑗̄ at time 𝑡 + 1:
𝑄𝑖𝑗 = Price {𝑠𝑡+1 = 𝑠𝑗̄ ∣ 𝑠𝑡 = 𝑠𝑖̄ }
• The price of risk-free one-period bond in state 𝑖 is 𝑅𝑖−1 = ∑𝑗 𝑄𝑖,𝑗
• The gross rate of return on a one-period risk-free bond Markov state 𝑠𝑖̄ is 𝑅𝑖 = (∑𝑗 𝑄𝑖,𝑗 )−1

68.5.1 Exogenous Pricing Kernel
At this point, we’ll take the pricing kernel 𝑄 as exogenous, i.e., determined outside the model
Two examples would be
• 𝑄 = 𝛽𝑃 where 𝛽 ∈ (0, 1)
• 𝑄 = 𝑆𝑃 where 𝑆 is an 𝑛 × 𝑛 matrix of stochastic discount factors
We’ll write down implications of Markov asset pricing in a nutshell for two types of assets
• the price in Markov state 𝑠 at time 𝑡 of a cum dividend stock that entitles the owner at the beginning of time
𝑡 to the time 𝑡 dividend and the option to sell the asset at time 𝑡 + 1. The price evidently satisfies 𝑝ℎ (𝑠𝑖̄ ) =
𝑑ℎ (𝑠𝑖̄ ) + ∑𝑗 𝑄𝑖𝑗 𝑝ℎ (𝑠𝑗̄ ), which implies that the vector 𝑝ℎ satisfies 𝑝ℎ = 𝑑ℎ + 𝑄𝑝ℎ which implies the formula
𝑝ℎ = (𝐼 − 𝑄)−1 𝑑ℎ
• the price in Markov state 𝑠 at time 𝑡 of an ex dividend stock that entitles the owner at the end of time 𝑡 to the time
𝑡 + 1 dividend and the option to sell the stock at time 𝑡 + 1. The price is
𝑝ℎ = (𝐼 − 𝑄)−1 𝑄𝑑ℎ
Below, we describe an equilibrium model with trading of one-period Arrow securities in which the pricing kernel is
endogenous.
In constructing our model, we’ll repeatedly encounter formulas that remind us of our asset pricing formulas.
68.5.2 Multi-Step-Forward Transition Probabilities and Pricing Kernels
The (𝑖, 𝑗) component of the 𝑘-step ahead transition probability 𝑃 ℓ is

ℓ
𝑃 𝑟𝑜𝑏(𝑠𝑡+ℓ = 𝑠𝑗̄ |𝑠𝑡 = 𝑠𝑖̄ ) = 𝑃𝑖,𝑗
The (𝑖, 𝑗) component of the ℓ-step ahead pricing kernel 𝑄ℓ is
𝑄(ℓ) (𝑠𝑡+ℓ = 𝑠𝑗̄ |𝑠𝑡 = 𝑠𝑖̄ ) = 𝑄ℓ𝑖,𝑗
We’ll use these objects to state a useful property in asset pricing theory.
68.5.3 Laws of Iterated Expectations and Iterated Values
A law of iterated values has a mathematical structure that parallels a law of iterated expectations
We can describe its structure readily in the Markov setting of this lecture
Recall the following recursion satisfied by 𝑗 step ahead transition probabilites for our finite state Markov chain:
𝑃𝑗 (𝑠𝑡+𝑗 |𝑠𝑡 ) = ∑ 𝑃𝑗−1 (𝑠𝑡+𝑗 |𝑠𝑡+1 )𝑃 (𝑠𝑡+1 |𝑠𝑡 )

𝑠𝑡+1
We can use this recursion to verify the law of iterated expectations applied to computing the conditional expectation of a
68.5. Markov Asset Prices 1203

random variable 𝑑(𝑠𝑡+𝑗 ) conditioned on 𝑠𝑡 via the following string of equalities
𝐸 [𝐸𝑑(𝑠𝑡+𝑗 )|𝑠𝑡+1 ] |𝑠𝑡 = ∑ ⎡ ⎤

⎢∑ 𝑑(𝑠𝑡+𝑗 )𝑃𝑗−1 (𝑠𝑡+𝑗 |𝑠𝑡+1 )⎥ 𝑃 (𝑠𝑡+1 |𝑠𝑡 )
𝑠𝑡+1 ⎣ 𝑠𝑡+𝑗 ⎦
= ∑ 𝑑(𝑠𝑡+𝑗 ) [∑ 𝑃𝑗−1 (𝑠𝑡+𝑗 |𝑠𝑡+1 )𝑃 (𝑠𝑡+1 |𝑠𝑡 )]

𝑠𝑡+𝑗 𝑠𝑡+1
= ∑ 𝑑(𝑠𝑡+𝑗 )𝑃𝑗 (𝑠𝑡+𝑗 |𝑠𝑡 )

𝑠𝑡+𝑗
= 𝐸𝑑(𝑠𝑡+𝑗 )|𝑠𝑡
The pricing kernel for 𝑗 step ahead Arrow securities satisfies the recursion
𝑄𝑗 (𝑠𝑡+𝑗 |𝑠𝑡 ) = ∑ 𝑄𝑗−1 (𝑠𝑡+𝑗 |𝑠𝑡+1 )𝑄(𝑠𝑡+1 |𝑠𝑡 )

𝑠𝑡+1
The time 𝑡 value in Markov state 𝑠𝑡 of a time 𝑡 + 𝑗 payout 𝑑(𝑠𝑡+𝑗 ) is
𝑉 (𝑑(𝑠𝑡+𝑗 )|𝑠𝑡 ) = ∑ 𝑑(𝑠𝑡+𝑗 )𝑄𝑗 (𝑠𝑡+𝑗 |𝑠𝑡 )

𝑠𝑡+𝑗
The law of iterated values states
𝑉 [𝑉 (𝑑(𝑠𝑡+𝑗 )|𝑠𝑡+1 )] |𝑠𝑡 = 𝑉 (𝑑(𝑠𝑡+𝑗 ))|𝑠𝑡
We verify it by pursuing the following a string of inequalities that are counterparts to those we used to verify the law of
iterated expectations:
𝑉 [𝑉 (𝑑(𝑠𝑡+𝑗 )|𝑠𝑡+1 )] |𝑠𝑡 = ∑ ⎡ ⎤

⎢∑ 𝑑(𝑠𝑡+𝑗 )𝑄𝑗−1 (𝑠𝑡+𝑗 |𝑠𝑡+1 )⎥ 𝑄(𝑠𝑡+1 |𝑠𝑡 )
𝑠𝑡+1 ⎣ 𝑠𝑡+𝑗 ⎦
= ∑ 𝑑(𝑠𝑡+𝑗 ) [∑ 𝑄𝑗−1 (𝑠𝑡+𝑗 |𝑠𝑡+1 )𝑄(𝑠𝑡+1 |𝑠𝑡 )]

𝑠𝑡+𝑗 𝑠𝑡+1
= ∑ 𝑑(𝑠𝑡+𝑗 )𝑄𝑗 (𝑠𝑡+𝑗 |𝑠𝑡 )

𝑠𝑡+𝑗
= 𝐸𝑉 (𝑑(𝑠𝑡+𝑗 ))|𝑠𝑡
68.6 General Equilibrium
Now we are ready to do some fun calculations.

We find it interesting to think in terms of analytical inputs into and outputs from our general equilibrium theorizing.
68.6.1 Inputs
• Markov states: 𝑠 ∈ 𝑆 = [𝑠1̄ , … , 𝑠𝑛̄ ] governed by an 𝑛-state Markov chain with transition probability
𝑃𝑖𝑗 = Pr {𝑠𝑡+1 = 𝑠𝑗̄ ∣ 𝑠𝑡 = 𝑠𝑖̄ }
• A collection of 𝐾 × 1 vectors of individual 𝑘 endowments: 𝑦𝑘 (𝑠) , 𝑘 = 1, … , 𝐾
𝐾
• An 𝑛 × 1 vector of aggregate endowment: 𝑦 (𝑠) ≡ ∑𝑘=1 𝑦𝑘 (𝑠)

• A collection of 𝐾 × 1 vectors of individual 𝑘 consumptions: 𝑐𝑘 (𝑠) , 𝑘 = 1, … , 𝐾

• A collection of restrictions on feasible consumption allocations for 𝑠 ∈ 𝑆:
𝐾
𝑐 (𝑠) = ∑ 𝑐𝑘 (𝑠) ≤ 𝑦 (𝑠)
𝑘=1
∞
• Preferences: a common utility functional across agents 𝐸0 ∑𝑡=0 𝛽 𝑡 𝑢(𝑐𝑡𝑘 ) with CRRA one-period utility function
𝑢 (𝑐) and discount factor 𝛽 ∈ (0, 1)
The one-period utility function is
𝑐1−𝛾
𝑢 (𝑐) =
1−𝛾
so that
𝑢′ (𝑐) = 𝑐−𝛾
68.6.2 Outputs
• An 𝑛 × 𝑛 matrix pricing kernel 𝑄 for one-period Arrow securities, where 𝑄𝑖𝑗 = price at time 𝑡 in state 𝑠𝑡 = 𝑠𝑖̄ of
one unit of consumption when 𝑠𝑡+1 = 𝑠𝑗̄ at time 𝑡 + 1
• pure exchange so that 𝑐 (𝑠) = 𝑦 (𝑠)
𝐾
• a 𝐾 × 1 vector distribution of wealth vector 𝛼, 𝛼𝑘 ≥ 0, ∑𝑘=1 𝛼𝑘 = 1
• A collection of 𝑛 × 1 vectors of individual 𝑘 consumptions: 𝑐𝑘 (𝑠) , 𝑘 = 1, … , 𝐾
68.6.3 𝑄 is the Pricing Kernel
For any agent 𝑘 ∈ [1, … , 𝐾], at the equilibrium allocation, the one-period Arrow securities pricing kernel satisfies
−𝛾
𝑐𝑘 (𝑠𝑗̄ )
𝑄𝑖𝑗 = 𝛽 ( ) 𝑃𝑖𝑗
𝑐𝑘 (𝑠𝑖̄ )
where 𝑄 is an 𝑛 × 𝑛 matrix
This follows from agent 𝑘’s first-order necessary conditions.
But with the CRRA preferences that we have assumed, individual consumptions vary proportionately with aggregate
consumption and therefore with the aggregate endowment.
• This is a consequence of our preference specification implying that Engle curves affine in wealth and therefore
satisfy conditions for Gorman aggregation
Thus,
𝑐𝑘 (𝑠) = 𝛼𝑘 𝑐 (𝑠) = 𝛼𝑘 𝑦 (𝑠)
for an arbitrary distribution of wealth in the form of an 𝐾 × 1 vector 𝛼 that satisfies

𝐾
𝛼𝑘 ∈ (0, 1) , ∑ 𝛼𝑘 = 1
𝑘=1
68.6. General Equilibrium 1205

This means that we can compute the pricing kernel from

𝑦𝑗 −𝛾
𝑄𝑖𝑗 = 𝛽 ( ) 𝑃𝑖𝑗 (68.1)
𝑦𝑖
Note that 𝑄𝑖𝑗 is independent of vector 𝛼.
Key finding: We can compute competitive equilibrium prices prior to computing a distribution of wealth.
68.6.4 Values
Having computed an equilibrium pricing kernel 𝑄, we can compute several values that are required to pose or represent
the solution of an individual household’s optimum problem.
We denote an 𝐾 × 1 vector of state-dependent values of agents’ endowments in Markov state 𝑠 as
𝐴1 (𝑠)
⎡
𝐴 (𝑠) = ⎢ ⋮ ⎤, 𝑠 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ]
⎥
𝐾
⎣ 𝐴 (𝑠) ⎦
and an 𝑛 × 1 vector of continuation endowment values for each individual 𝑘 as
𝐴𝑘 (𝑠1̄ )
𝐴𝑘 = ⎡
⎢ ⋮ ⎤,
⎥ 𝑘 ∈ [1, … , 𝐾]
𝑘
⎣ 𝐴 ( 𝑠 ̄
𝑛 ⎦)
𝐴𝑘 of consumer 𝑘 satisfies
−1
𝐴𝑘 = [𝐼 − 𝑄] [𝑦𝑘 ]
where
𝑦𝑘 (𝑠1̄ ) 𝑦1𝑘
𝑘 ⎡ ⋮ ⎤ ⎡ ⎤
𝑦 =⎢ ⎥≡⎢ ⋮ ⎥
𝑘 𝑘
⎣ 𝑦 (𝑠𝑛̄ ) ⎦ ⎣𝑣𝑛 ⎦
In a competitive equilibrium of an infinite horizon economy with sequential trading of one-period Arrow securities,
𝐴𝑘 (𝑠) serves as a state-by-state vector of debt limits on the quantities of one-period Arrow securities paying off in state
𝑠 at time 𝑡 + 1 that individual 𝑘 can issue at time 𝑡.
These are often called natural debt limits.
Evidently, they equal the maximum amount that it is feasible for individual 𝑘 to repay even if he consumes zero goods
forevermore.
Remark: If we have an Inada condition at zero consumption or just impose that consumption be nonnegative, then in a
finite horizon economy with sequential trading of one-period Arrow securities there is no need to impose natural debt
limits. See the section below on a Finite Horizon Economy.
68.6.5 Continuation Wealth
Continuation wealth plays an important role in Bellmanizing a competitive equilibrium with sequential trading of a com-
plete set of one-period Arrow securities.
We denote an 𝐾 × 1 vector of state-dependent continuation wealths in Markov state 𝑠 as
𝜓1 (𝑠)
⎡
𝜓 (𝑠) = ⎢ ⋮ ⎤, 𝑠 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ]
⎥
𝐾
⎣ 𝜓 (𝑠) ⎦

and an 𝑛 × 1 vector of continuation wealths for each individual 𝑘 as

𝜓𝑘 (𝑠1̄ )
⎡
𝜓 =⎢𝑘
⋮ ⎤, 𝑘 ∈ [1, … , 𝐾]
⎥
𝑘
⎣ 𝜓 (𝑠𝑛̄ ) ⎦
Continuation wealth 𝜓𝑘 of consumer 𝑘 satisfies

−1
𝜓𝑘 = [𝐼 − 𝑄] [𝛼𝑘 𝑦 − 𝑦𝑘 ] (68.2)
where
𝑦𝑘 (𝑠1̄ ) 𝑦 (𝑠1̄ )
𝑘 ⎡
𝑦 =⎢ ⋮ ⎤, 𝑦=⎡ ⋮ ⎤
⎥ ⎢ ⎥
𝑘
⎣ 𝑦 (𝑠𝑛̄ ) ⎦ ⎣ 𝑦 (𝑠𝑛̄ ) ⎦
𝐾
Note that ∑𝑘=1 𝜓𝑘 = 0𝑛×1 .
Remark: At the initial state 𝑠0 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ], the continuation wealth 𝜓𝑘 (𝑠0 ) = 0 for all agents 𝑘 = 1, … , 𝐾. This
indicates that the economy begins with all agents being debt-free and financial-asset-free at time 0, state 𝑠0 .
Remark: Note that all agents’ continuation wealths recurrently return to zero when the Markov state returns to whatever
value 𝑠0 it had at time 0.
68.6.6 Optimal Portfolios
A nifty feature of the model is that an optimal portfolio of a type 𝑘 agent equals the continuation wealth that we just
computed.
Thus, agent 𝑘’s state-by-state purchases of Arrow securities next period depend only on next period’s Markov state and
equal
𝑎𝑘 (𝑠) = 𝜓𝑘 (𝑠), 𝑠 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ] (68.3)
68.6.7 Equilibrium Wealth Distribution 𝛼
With the initial state being a particular state 𝑠0 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ], we must have
𝜓𝑘 (𝑠0 ) = 0, 𝑘 = 1, … , 𝐾
which means the equilibrium distribution of wealth satisfies
𝑉𝑧 𝑦 𝑘
𝛼𝑘 = (68.4)
𝑉𝑧 𝑦
−1
where 𝑉 ≡ [𝐼 − 𝑄] and 𝑧 is the row index corresponding to the initial state 𝑠0 .
𝐾 𝐾
Since ∑𝑘=1 𝑉𝑧 𝑦𝑘 = 𝑉𝑧 𝑦, ∑𝑘=1 𝛼𝑘 = 1.
In summary, here is the logical flow of an algorithm to compute a competitive equilibrium:
• compute 𝑄 from the aggregate allocation and formula (68.1)
• compute the distribution of wealth 𝛼 from the formula (68.4)
• Using 𝛼 assign each consumer 𝑘 the share 𝛼𝑘 of the aggregate endowment at each state
• return to the 𝛼-dependent formula (68.2) and compute continuation wealths
68.6. General Equilibrium 1207

• via formula (68.3) equate agent 𝑘’s portfolio to its continuation wealth state by state
We can also add formulas for optimal value functions in a competitive equilibrium with trades in a complete set of one-
period state-contingent Arrow securities.
Call the optimal value functions 𝐽 𝑘 for consumer 𝑘.
For the infinite horizon economy now under study, the formula is
𝑐1−𝛾
𝐽 𝑘 = (𝐼 − 𝛽𝑃 )−1 𝑢(𝛼𝑘 𝑦), 𝑢(𝑐) =
1−𝛾
where it is understood that 𝑢(𝛼𝑘 𝑦) is a vector.
68.7 Python Code
We are ready to dive into some Python code.

As usual, we start with Python imports.
import numpy as np
np.set_printoptions(suppress=True)
First, we create a Python class to compute the objects that comprise a competitive equilibrium with sequential trading of
one-period Arrow securities.
In addition to handly infinite-horizon economies, the code is set up to handle finite-horizon economies indexed by horizon
𝑇.
We’ll study some finite horizon economies after we look at some infinite-horizon economies.
class RecurCompetitive:
"""
A class that represents a recursive competitive economy
with one-period Arrow securities.
"""
def __init__(self,
s, # state vector
P, # transition matrix
ys, # endowments ys = [y1, y2, .., yI]
γ=0.5, # risk aversion
β=0.98, # discount rate
T=None): # time horizon, none if infinite
# preference parameters
self.γ = γ
self.β = β
# variables dependent on state

self.s = s
self.P = P
self.ys = ys
self.y = np.sum(ys, 1)

# dimensions
self.n, self.K = ys.shape
# compute pricing kernel

self.Q = self.pricing_kernel()
# compute price of risk-free one-period bond

self.PRF = self.price_risk_free_bond()
# compute risk-free rate

self.R = self.risk_free_rate()
# V = [I - Q]^{-1} (infinite case)

if T is None:
self.T = None
self.V = np.empty((1, n, n))
self.V[0] = np.linalg.inv(np.eye(n) - self.Q)
# V = [I + Q + Q^2 + ... + Q^T] (finite case)
else:
self.T = T
self.V = np.empty((T+1, n, n))
self.V[0] = np.eye(n)
Qt = np.eye(n)
Qt = Qt.dot(self.Q)
self.V[t] = self.V[t-1] + Qt
# natural debt limit

self.A = self.V[-1] @ ys
def u(self, c):

"The CRRA utility"
return c ** (1 - self.γ) / (1 - self.γ)

"The first derivative of CRRA utility"
return c ** (-self.γ)
def pricing_kernel(self):
"Compute the pricing kernel matrix Q"
c = self.y
n = self.n
Q = np.empty((n, n))
for i in range(n):
for j in range(n):
ratio = self.u_prime(c[j]) / self.u_prime(c[i])
Q[i, j] = self.β * ratio * P[i, j]
self.Q = Q
68.7. Python Code 1209

return Q
def wealth_distribution(self, s0_idx):

"Solve for wealth distribution α"
# set initial state

self.s0_idx = s0_idx
# simplify notations
n = self.n
Q = self.Q
y, ys = self.y, self.ys
# row of V corresponding to s0
Vs0 = self.V[-1, s0_idx, :]
α = Vs0 @ self.ys / (Vs0 @ self.y)
self.α = α
return α
def continuation_wealths(self):
"Given α, compute the continuation wealths ψ"
diff = np.empty((n, K))

for k in range(K):
diff[:, k] = self.α[k] * self.y - self.ys[:, k]
ψ = self.V @ diff
self.ψ = ψ
return ψ
def price_risk_free_bond(self):
"Give Q, compute price of one-period risk free bond"
PRF = np.sum(self.Q, 0)
self.PRF = PRF
return PRF
def risk_free_rate(self):
"Given Q, compute one-period gross risk-free interest rate R"
R = np.sum(self.Q, 0)
R = np.reciprocal(R)
self.R = R
return R
def value_functionss(self):
"Given α, compute the optimal value functions J in equilibrium"
n, T = self.n, self.T
β = self.β


P = self.P
# compute (I - βP)^(-1) in infinite case

if T is None:
P_seq = np.empty((1, n, n))
P_seq[0] = np.linalg.inv(np.eye(n) - β * P)
# and (I + βP + ... + β^T P^T) in finite case
else:
P_seq = np.empty((T+1, n, n))
P_seq[0] = np.eye(n)
Pt = np.eye(n)
Pt = Pt.dot(P)
P_seq[t] = P_seq[t-1] + Pt * β ** t
# compute the matrix [u(α_1 y), ..., u(α_K, y)]

flow = np.empty((n, K))
for k in range(K):
flow[:, k] = self.u(self.α[k] * self.y)
J = P_seq @ flow
self.J = J
return J
68.7.1 Example 1
Please read the preceding class for default parameter values and the following Python code for the fundamentals of the
economy.
Here goes.
# dimensions
K, n = 2, 2
# states
s = np.array([0, 1])
# transition
P = np.array([[.5, .5], [.5, .5]])
# endowments
ys = np.empty((n, K))
ys[:, 0] = 1 - s # y1
ys[:, 1] = s # y2
ex1 = RecurCompetitive(s, P, ys)
# endowments
ex1.ys

array([[1., 0.],
[0., 1.]])
# pricing kernal
ex1.Q
array([[0.49, 0.49],
[0.49, 0.49]])
# Risk free rate R

ex1.R
array([1.02040816, 1.02040816])
# natural debt limit, A = [A1, A2, ..., AI]

ex1.A
array([[25.5, 24.5],
[24.5, 25.5]])
# when the initial state is state 1

print(f'α = {ex1.wealth_distribution(s0_idx=0)}')
print(f'ψ = \n{ex1.continuation_wealths()}')
print(f'J = \n{ex1.value_functionss()}')
α = [0.51 0.49]
ψ =
[[[ 0. 0.]
[ 1. -1.]]]
J =
[[[71.41428429 70. ]
[71.41428429 70. ]]]

α = [0.49 0.51]
ψ =
[[[-1. 1.]
[ 0. -0.]]]
J =
[[[70. 71.41428429]
[70. 71.41428429]]]

68.7.2 Example 2
# dimensions
K, n = 2, 2
# states
# transition
P = np.array([[.5, .5], [.5, .5]])
# endowments
ys[:, 0] = 1.5 # y1
ys[:, 1] = s # y2
# endowments
print("ys = \n", ex2.ys)
# pricing kernal
print ("Q = \n", ex2.Q)
# Risk free rate R

print("R = ", ex2.R)
ys =
[[1.5 1. ]
[1.5 2. ]]
Q =
[[0.49 0.41412558]
[0.57977582 0.49 ]]
R = [0.93477529 1.10604104]
# pricing kernal
ex2.Q
array([[0.49 , 0.41412558],
[0.57977582, 0.49 ]])
# Risk free rate R

ex2.R
array([0.93477529, 1.10604104])

ex2.A

array([[69.30941886, 66.91255848],
[81.73318641, 79.98879094]])

α = [0.50879763 0.49120237]
ψ =
[[[-0. -0. ]
[ 0.55057195 -0.55057195]]]
J =
[[[122.907875 120.76397493]
[123.32114686 121.17003803]]]

α = [0.50539319 0.49460681]
ψ =
[[[-0.46375886 0.46375886]
[ 0. -0. ]]]
J =
[[[122.49598809 121.18174895]
[122.907875 121.58921679]]]
68.7.3 Example 3
# dimensions
K, n = 2, 2
# states
# transition
λ = 0.9
P = np.array([[1-λ, λ], [0, 1]])
# endowments
ys[:, 0] = [1, 0] # y1
ys[:, 1] = [0, 1] # y2
# endowments


print("ys = ", ex3.ys)
# pricing kernel
print ("Q = ", ex3.Q)
# Risk free rate R

ys = [[1. 0.]
[0. 1.]]
Q = [[0.098 0.882]
[0. 0.98 ]]
R = [10.20408163 0.53705693]
# pricing kernel
ex3.Q
array([[0.098, 0.882],
[0. , 0.98 ]])

ex3.A
array([[ 1.10864745, 48.89135255],

[ 0. , 50. ]])
Note that the natural debt limit for agent 1 in state 2 is 0.

α = [0.02217295 0.97782705]
ψ =
[[[ 0. -0. ]
[ 1.10864745 -1.10864745]]]
J =
[[[14.89058394 98.88513796]
[14.89058394 98.88513796]]]

α = [0. 1.]
ψ =
[[[-1.10864745 1.10864745]
[ 0. 0. ]]]


J =
[[[ 0. 100.]
[ 0. 100.]]]
For the specification of the Markov chain in example 3, let’s take a look at how the equilibrium allocation changes as a
function of transition probability 𝜆.
λ_seq = np.linspace(0, 1, 100)
# prepare containers
αs0_seq = np.empty((len(λ_seq), 2))
αs1_seq = np.empty((len(λ_seq), 2))
for i, λ in enumerate(λ_seq):
P = np.array([[1-λ, λ], [0, 1]])
# initial state s0 = 1
α = ex3.wealth_distribution(s0_idx=0)
αs0_seq[i, :] = α
# initial state s0 = 2
α = ex3.wealth_distribution(s0_idx=1)
αs1_seq[i, :] = α
/tmp/ipykernel_5655/2194301487.py:126: RuntimeWarning: divide by zero encountered␣

↪in reciprocal
R = np.reciprocal(R)
for i, αs_seq in enumerate([αs0_seq, αs1_seq]):

for j in range(2):
axs[i].plot(λ_seq, αs_seq[:, j], label=f'α{j+1}')
axs[i].set_xlabel('λ')
axs[i].set_title(f'initial state s0 = {s[i]}')
axs[i].legend()
plt.show()

68.7.4 Example 4
# dimensions
K, n = 2, 3
# states
s = np.array([1, 2, 3])
# transition
λ = .9
μ = .9
δ = .05
P = np.array([[1-λ, λ, 0], [μ/2, μ, μ/2], [(1-δ)/2, (1-δ)/2, δ]])
# endowments
ys[:, 0] = [.25, .75, .2] # y1
ys[:, 1] = [1.25, .25, .2] # y2
# endowments
print("ys = \n", ex4.ys)
# pricing kernal
print ("Q = \n", ex4.Q)
# Risk free rate R


print("A = \n", ex4.A)
print('')
for i in range(1, 4):



# when the initial state is state i
print(f"when the initial state is state {i}")
print(f'α = {ex4.wealth_distribution(s0_idx=i-1)}')
print(f'J = \n{ex4.value_functionss()}\n')
ys =
[[0.25 1.25]
[0.75 0.25]
[0.2 0.2 ]]
Q =
[[0.098 1.08022498 0. ]
[0.36007499 0.882 0.69728222]
[0.24038317 0.29440805 0.049 ]]
R = [1.43172499 0.44313807 1.33997564]
A =
[[-1.4141307 -0.45854174]
[-1.4122483 -1.54005386]
[-0.58434331 -0.3823659 ]]
when the initial state is state 1

α = [0.75514045 0.24485955]
ψ =
[[[ 0. 0. ]
[-0.81715447 0.81715447]
[-0.14565791 0.14565791]]]
J =
[[[-2.65741909 -1.51322919]
[-5.13103133 -2.92179221]
[-2.65649938 -1.51270548]]]

α = [0.47835493 0.52164507]
ψ =
[[[ 0.5183286 -0.5183286 ]
[ 0. -0. ]
[ 0.12191319 -0.12191319]]]
J =
[[[-2.11505328 -2.20868477]
[-4.08381377 -4.26460049]
[-2.11432128 -2.20792037]]]

α = [0.60446648 0.39553352]
ψ =
[[[ 0.28216299 -0.28216299]
[-0.37231938 0.37231938]
[-0. -0. ]]]
J =
[[[-2.37756442 -1.92325926]
[-4.59067883 -3.71349163]
[-2.37674158 -1.92259365]]]

68.8 Finite Horizon
The Python class RecurCompetitive provided above also can be used to compute competitive equilibrium allocations
and Arrow securities prices for finite horizon economies.
The setting is a finite-horizon version of the one above except that time now runs for 𝑇 + 1 periods 𝑡 ∈ T = {0, 1, … , 𝑇 }.
Consequently, we want 𝑇 + 1 counterparts to objects described above, with one important exception: we won’t need
borrowing limits.
• borrowing limits aren’t required for a finite horizon economy in which a one-period utility function 𝑢(𝑐) satisfies
an Inada condition that sets the marginal utility of consumption at zero consumption to zero.
• Nonnegativity of consumption choices at all 𝑡 ∈ T automatically limits borrowing.
68.8.1 Continuation Wealths
We denote a 𝐾 × 1 vector of state-dependent continuation wealths in Markov state 𝑠 at time 𝑡 as
𝜓1 (𝑠)
⎡
𝜓𝑡 (𝑠) = ⎢ ⋮ ⎤, 𝑠 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ]
⎥
𝐾
⎣ 𝜓 (𝑠) ⎦
and an 𝑛 × 1 vector of continuation wealths for each individual 𝑘 as
𝜓𝑡𝑘 (𝑠1̄ )
𝜓𝑡𝑘 ⎡
=⎢ ⋮ ⎤, 𝑘 ∈ [1, … , 𝐾]
⎥
𝑘
⎣ 𝜓𝑡 (𝑠𝑛̄ ) ⎦
Continuation wealths 𝜓𝑘 of consumer 𝑘 satisfy
𝜓𝑇𝑘 = [𝛼𝑘 𝑦 − 𝑦𝑘 ]
𝜓𝑇𝑘 −1 = [𝐼 + 𝑄] [𝛼𝑘 𝑦 − 𝑦𝑘 ]
(68.5)
⋮ ⋮
𝜓0𝑘 = [𝐼 + 𝑄 + 𝑄2 + ⋯ + 𝑄𝑇 ] [𝛼𝑘 𝑦 − 𝑦𝑘 ]
where
𝑦𝑘 (𝑠1̄ ) 𝑦 (𝑠1̄ )
𝑘 ⎡
𝑦 =⎢ ⋮ ⎤, 𝑦=⎡ ⋮ ⎤
⎥ ⎢ ⎥
𝑘
⎣ 𝑦 (𝑠𝑛̄ ) ⎦ ⎣ 𝑦 (𝑠𝑛̄ ) ⎦
𝐾
Note that ∑𝑘=1 𝜓𝑡𝑘 = 0𝑛×1 for all 𝑡 ∈ T.
Remark: At the initial state 𝑠0 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ], for all agents 𝑘 = 1, … , 𝐾, continuation wealth 𝜓0𝑘 (𝑠0 ) = 0. This
indicates that the economy begins with all agents being debt-free and financial-asset-free at time 0, state 𝑠0 .
Remark: Note that all agents’ continuation wealths return to zero when the Markov state returns to whatever value 𝑠0 it
had at time 0. This will recur if the Markov chain makes the initial state 𝑠0 recurrent.
With the initial state being a particular state 𝑠0 ∈ [𝑠1̄ , … , 𝑠𝑛̄ ], we must have
𝜓0𝑘 (𝑠0 ) = 0, 𝑘 = 1, … , 𝐾
which means the equilibrium distribution of wealth satisfies
𝑉𝑧 𝑦 𝑘
𝛼𝑘 = (68.6)
𝑉𝑧 𝑦
68.8. Finite Horizon 1219

where now in our finite-horizon economy
𝑉 = [𝐼 + 𝑄 + 𝑄2 + ⋯ + 𝑄𝑇 ] (68.7)
and 𝑧 is the row index corresponding to the initial state 𝑠0 .

𝐾 𝐾
Since ∑𝑘=1 𝑉𝑧 𝑦𝑘 = 𝑉𝑧 𝑦, ∑𝑘=1 𝛼𝑘 = 1.
In summary, here is the logical flow of an algorithm to compute a competitive equilibrium with Arrow securities in our
finite-horizon Markov economy:
• compute 𝑄 from the aggregate allocation and formula (68.1)
• compute the distribution of wealth 𝛼 from formulas (68.6) and (68.7)
• using 𝛼, assign each consumer 𝑘 the share 𝛼𝑘 of the aggregate endowment at each state
• return to the 𝛼-dependent formula (68.5) for continuation wealths and compute continuation wealths
• equate agent 𝑘’s portfolio to its continuation wealth state by state
While for the infinite horizon economy, the formula for value functions is
𝑐1−𝛾
𝐽 𝑘 = (𝐼 − 𝛽𝑃 )−1 𝑢(𝛼𝑘 𝑦), 𝑢(𝑐) =
1−𝛾
for the finite horizon economy the formula is
𝐽0𝑘 = (𝐼 + 𝛽𝑃 + ⋯ + 𝛽 𝑇 𝑃 𝑇 )𝑢(𝛼𝑘 𝑦),
where it is understood that 𝑢(𝛼𝑘 𝑦) is a vector.
68.8.2 Finite Horizon Example
Below we revisit the economy defined in example 1, but set the time horizon to be 𝑇 = 10.
# dimensions
K, n = 2, 2
# states
# transition
P = np.array([[.5, .5], [.5, .5]])
# endowments
ys[:, 0] = 1 - s # y1
ys[:, 1] = s # y2
ex1_finite = RecurCompetitive(s, P, ys, T=10)
# (I + Q + Q^2 + ... + Q^T)

ex1_finite.V[-1]
array([[5.48171623, 4.48171623],
[4.48171623, 5.48171623]])

# endowments
ex1_finite.ys
array([[1., 0.],
[0., 1.]])
# pricing kernal
ex1_finite.Q
array([[0.49, 0.49],
[0.49, 0.49]])
# Risk free rate R

ex1_finite.R
array([1.02040816, 1.02040816])
In the finite time horizon case, ψ and J are returned as sequences.

Components are ordered from 𝑡 = 𝑇 to 𝑡 = 0.

print(f'α = {ex1_finite.wealth_distribution(s0_idx=0)}')
print(f'ψ = \n{ex1_finite.continuation_wealths()}\n')
print(f'J = \n{ex1_finite.value_functionss()}')
α = [0.55018351 0.44981649]
ψ =
[[[-0.44981649 0.44981649]
[ 0.55018351 -0.55018351]]
[[-0.40063665 0.40063665]
[ 0.59936335 -0.59936335]]
[[-0.35244041 0.35244041]
[ 0.64755959 -0.64755959]]
[[-0.30520809 0.30520809]
[ 0.69479191 -0.69479191]]
[[-0.25892042 0.25892042]
[ 0.74107958 -0.74107958]]
[[-0.21355851 0.21355851]
[ 0.78644149 -0.78644149]]
[[-0.16910383 0.16910383]
[ 0.83089617 -0.83089617]]
[[-0.12553824 0.12553824]
[ 0.87446176 -0.87446176]]


[[-0.08284397 0.08284397]
[ 0.91715603 -0.91715603]]
[[-0.04100358 0.04100358]
[ 0.95899642 -0.95899642]]
[[-0. 0. ]
[ 1. -1. ]]]
J =
[[[ 1.48348712 1.3413672 ]
[ 1.48348712 1.3413672 ]]
[[ 2.9373045 2.65590706]
[ 2.9373045 2.65590706]]
[[ 4.36204553 3.94415611]
[ 4.36204553 3.94415611]]
[[ 5.75829174 5.20664019]
[ 5.75829174 5.20664019]]
[[ 7.12661302 6.44387459]
[ 7.12661302 6.44387459]]
[[ 8.46756788 7.6563643 ]
[ 8.46756788 7.6563643 ]]
[[ 9.78170364 8.84460421]
[ 9.78170364 8.84460421]]
[[11.06955669 10.00907933]
[11.06955669 10.00907933]]
[[12.33165268 11.15026494]
[12.33165268 11.15026494]]
[[13.56850674 12.26862684]
[13.56850674 12.26862684]]
[[14.78062373 13.3646215 ]
[14.78062373 13.3646215 ]]]

print(f'α = {ex1_finite.wealth_distribution(s0_idx=1)}')
print(f'ψ = \n{ex1_finite.continuation_wealths()}\n')
print(f'J = \n{ex1_finite.value_functionss()}')
α = [0.44981649 0.55018351]
ψ =
[[[-0.55018351 0.55018351]
[ 0.44981649 -0.44981649]]
[[-0.59936335 0.59936335]
[ 0.40063665 -0.40063665]]

[[-0.64755959 0.64755959]
[ 0.35244041 -0.35244041]]
[[-0.69479191 0.69479191]
[ 0.30520809 -0.30520809]]
[[-0.74107958 0.74107958]
[ 0.25892042 -0.25892042]]
[[-0.78644149 0.78644149]
[ 0.21355851 -0.21355851]]
[[-0.83089617 0.83089617]
[ 0.16910383 -0.16910383]]
[[-0.87446176 0.87446176]
[ 0.12553824 -0.12553824]]
[[-0.91715603 0.91715603]
[ 0.08284397 -0.08284397]]
[[-0.95899642 0.95899642]
[ 0.04100358 -0.04100358]]
[[-1. 1. ]
[ 0. -0. ]]]
J =
[[[ 1.3413672 1.48348712]
[ 1.3413672 1.48348712]]
[[ 2.65590706 2.9373045 ]
[ 2.65590706 2.9373045 ]]
[[ 3.94415611 4.36204553]
[ 3.94415611 4.36204553]]
[[ 5.20664019 5.75829174]
[ 5.20664019 5.75829174]]
[[ 6.44387459 7.12661302]
[ 6.44387459 7.12661302]]
[[ 7.6563643 8.46756788]
[ 7.6563643 8.46756788]]
[[ 8.84460421 9.78170364]
[ 8.84460421 9.78170364]]
[[10.00907933 11.06955669]
[10.00907933 11.06955669]]
[[11.15026494 12.33165268]
[11.15026494 12.33165268]]


[[12.26862684 13.56850674]
[12.26862684 13.56850674]]
[[13.3646215 14.78062373]
[13.3646215 14.78062373]]]
We can check the results with finite horizon converges to the ones with infinite horizon as 𝑇 → ∞.
ex1_large = RecurCompetitive(s, P, ys, T=10000)

ex1_large.wealth_distribution(s0_idx=1)
array([0.49, 0.51])
ex1.V, ex1_large.V[-1]
(array([[[25.5, 24.5],
[24.5, 25.5]]]),
array([[25.5, 24.5],
[24.5, 25.5]]))
ex1_large.continuation_wealths()
ex1.ψ, ex1_large.ψ[-1]
(array([[[-1., 1.],
[ 0., -0.]]]),
array([[-1., 1.],
[ 0., -0.]]))
ex1_large.value_functionss()
ex1.J, ex1_large.J[-1]
(array([[[70. , 71.41428429],
[70. , 71.41428429]]]),
array([[70. , 71.41428429],
[70. , 71.41428429]]))

CHAPTER
SIXTYNINE
HETEROGENEOUS BELIEFS AND BUBBLES
Contents
• Heterogeneous Beliefs and Bubbles

– Overview
– Structure of the Model
– Solving the Model
– Exercises
In addition to what’s in Anaconda, this lecture uses following libraries:
69.1 Overview
This lecture describes a version of a model of Harrison and Kreps [Harrison and Kreps, 1978].
The model determines the price of a dividend-yielding asset that is traded by two types of self-interested investors.
The model features
• heterogeneous beliefs
• incomplete markets
• short sales constraints, and possibly …
• (leverage) limits on an investor’s ability to borrow in order to finance purchases of a risky asset
import numpy as np
import scipy.linalg as la
1225
69.1.1 References
Prior to reading the following, you might like to review our lectures on
• Markov chains
• Asset pricing with finite state space
69.1.2 Bubbles
Economists differ in how they define a bubble.

The Harrison-Kreps model illustrates the following notion of a bubble that attracts many economists:
A component of an asset price can be interpreted as a bubble when all investors agree that the current price of
the asset exceeds what they believe the asset’s underlying dividend stream justifies.
69.2 Structure of the Model
The model simplifies things by ignoring alterations in the distribution of wealth among investors who have hard-wired
different beliefs about the fundamentals that determine asset payouts.
There is a fixed number 𝐴 of shares of an asset.
Each share entitles its owner to a stream of dividends {𝑑𝑡 } governed by a Markov chain defined on a state space 𝑆 ∈ {0, 1}.
The dividend obeys
0 if 𝑠𝑡 = 0
𝑑𝑡 = {
1 if 𝑠𝑡 = 1
An owner of a share at the end of time 𝑡 and the beginning of time 𝑡 + 1 is entitled to the dividend paid at time 𝑡 + 1.
Thus, the stock is traded ex dividend.
An owner of a share at the beginning of time 𝑡 + 1 is also entitled to sell the share to another investor during time 𝑡 + 1.
Two types ℎ = 𝑎, 𝑏 of investors differ only in their beliefs about a Markov transition matrix 𝑃 with typical element
𝑃 (𝑖, 𝑗) = ℙ{𝑠𝑡+1 = 𝑗 ∣ 𝑠𝑡 = 𝑖}
Investors of type 𝑎 believe the transition matrix

1 1
𝑃𝑎 = [ 22 2]
1
3 3
Investors of type 𝑏 think the transition matrix is

2 1
𝑃𝑏 = [ 31 3]
3
4 4
Thus, in state 0, a type 𝑎 investor is more optimistic about next period’s dividend than is investor 𝑏.
But in state 1, a type 𝑎 investor is more pessimistic about next period’s dividend than is investor 𝑏.
The stationary (i.e., invariant) distributions of these two matrices can be calculated as follows:
1226 Chapter 69. Heterogeneous Beliefs and Bubbles

qa = np.array([[1/2, 1/2], [2/3, 1/3]])

qb = np.array([[2/3, 1/3], [1/4, 3/4]])
mca = qe.MarkovChain(qa)
mcb = qe.MarkovChain(qb)
mca.stationary_distributions
array([[0.57142857, 0.42857143]])
mcb.stationary_distributions
array([[0.42857143, 0.57142857]])
The stationary distribution of 𝑃𝑎 is approximately 𝜋𝑎 = [.57 .43].

The stationary distribution of 𝑃𝑏 is approximately 𝜋𝑏 = [.43 .57].
Thus, a type 𝑎 investor is more pessimistic on average.
69.2.1 Ownership Rights
An owner of the asset at the end of time 𝑡 is entitled to the dividend at time 𝑡 + 1 and also has the right to sell the asset
at time 𝑡 + 1.
Both types of investors are risk-neutral and both have the same fixed discount factor 𝛽 ∈ (0, 1).
In our numerical example, we’ll set 𝛽 = .75, just as Harrison and Kreps [Harrison and Kreps, 1978] did.
We’ll eventually study the consequences of two alternative assumptions about the number of shares 𝐴 relative to the
resources that our two types of investors can invest in the stock.
1. Both types of investors have enough resources (either wealth or the capacity to borrow) so that they can purchase
the entire available stock of the asset1 .
2. No single type of investor has sufficient resources to purchase the entire stock.
Case 1 is the case studied in Harrison and Kreps.
In case 2, both types of investors always hold at least some of the asset.
69.2.2 Short Sales Prohibited
No short sales are allowed.

This matters because it limits how pessimists can express their opinions.
• They can express themselves by selling their shares.
• They cannot express themsevles more loudly by artificially “manufacturing shares” – that is, they cannot borrow
shares from more optimistic investors and then immediately sell them.
1 By assuming that both types of agents always have “deep enough pockets” to purchase all of the asset, the model takes wealth dynamics off the
table. The Harrison-Kreps model generates high trading volume when the state changes either from 0 to 1 or from 1 to 0.
69.2. Structure of the Model 1227

69.2.3 Optimism and Pessimism
The above specifications of the perceived transition matrices 𝑃𝑎 and 𝑃𝑏 , taken directly from Harrison and Kreps, build in
stochastically alternating temporary optimism and pessimism.
Remember that state 1 is the high dividend state.
• In state 0, a type 𝑎 agent is more optimistic about next period’s dividend than a type 𝑏 agent.
• In state 1, a type 𝑏 agent is more optimistic about next period’s dividend than a type 𝑎 agaub is.
However, the stationary distributions 𝜋𝑎 = [.57 .43] and 𝜋𝑏 = [.43 .57] tell us that a type 𝑏 person is more optimistic
about the dividend process in the long run than is a type 𝑎 person.
69.2.4 Information
Investors know a price function mapping the state 𝑠𝑡 at 𝑡 into the equilibrium price 𝑝(𝑠𝑡 ) that prevails in that state.
This price function is endogenous and to be determined below.
When investors choose whether to purchase or sell the asset at 𝑡, they also know 𝑠𝑡 .
69.3 Solving the Model
Now let’s turn to solving the model.

We’ll determine equilibrium prices under a particular specification of beliefs and constraints on trading selected from one
of the specifications described above.
We shall compare equilibrium price functions under the following alternative assumptions about beliefs:
1. There is only one type of agent, either 𝑎 or 𝑏.
2. There are two types of agents differentiated only by their beliefs. Each type of agent has sufficient resources to
purchase all of the asset (Harrison and Kreps’s setting).
3. There are two types of agents with different beliefs, but because of limited wealth and/or limited leverage, both
types of investors hold the asset each period.
69.3.1 Summary Table
The following table gives a summary of the findings obtained in the remainder of the lecture (in an exercise you will be
asked to recreate the table and also reinterpret parts of it).
The table reports implications of Harrison and Kreps’s specifications of 𝑃𝑎 , 𝑃𝑏 , 𝛽.
𝑠𝑡 0 1
𝑝𝑎 1.33 1.22
𝑝𝑏 1.45 1.91
𝑝𝑜 1.85 2.08
𝑝𝑝 1 1
𝑝𝑎̂ 1.85 1.69
𝑝𝑏̂ 1.69 2.08
Here

• 𝑝𝑎 is the equilibrium price function under homogeneous beliefs 𝑃𝑎

• 𝑝𝑏 is the equilibrium price function under homogeneous beliefs 𝑃𝑏
• 𝑝𝑜 is the equilibrium price function under heterogeneous beliefs with optimistic marginal investors
• 𝑝𝑝 is the equilibrium price function under heterogeneous beliefs with pessimistic marginal investors
• 𝑝𝑎̂ is the amount type 𝑎 investors are willing to pay for the asset
• 𝑝𝑏̂ is the amount type 𝑏 investors are willing to pay for the asset
We’ll explain these values and how they are calculated one row at a time.
The row corresponding to 𝑝𝑜 applies when both types of investor have enough resources to purchase the entire stock of
the asset and strict short sales constraints prevail so that temporarily optimistic investors always price the asset.
The row corresponding to 𝑝𝑝 would apply if neither type of investor has enough resources to purchase the entire stock of
the asset and both types must hold the asset.
The row corresponding to 𝑝𝑝 would also apply if both types have enough resources to buy the entire stock of the asset but
short sales are also possible so that temporarily pessimistic investors price the asset.
69.3.2 Single Belief Prices
We’ll start by pricing the asset under homogeneous beliefs.

(This is the case treated in the lecture on asset pricing with finite Markov states)
Suppose that there is only one type of investor, either of type 𝑎 or 𝑏, and that this investor always “prices the asset”.
𝑝ℎ (0)
Let 𝑝ℎ = [ ] be the equilibrium price vector when all investors are of type ℎ.
𝑝ℎ (1)
The price today equals the expected discounted value of tomorrow’s dividend and tomorrow’s price of the asset:
𝑝ℎ (𝑠) = 𝛽 (𝑃ℎ (𝑠, 0)(0 + 𝑝ℎ (0)) + 𝑃ℎ (𝑠, 1)(1 + 𝑝ℎ (1))) , 𝑠 = 0, 1 (69.1)
These equations imply that the equilibrium price vector is
𝑝ℎ (0) 0
[ ] = 𝛽[𝐼 − 𝛽𝑃ℎ ]−1 𝑃ℎ [ ] (69.2)
𝑝ℎ (1) 1
The first two rows of the table report 𝑝𝑎 (𝑠) and 𝑝𝑏 (𝑠).
Here’s a function that can be used to compute these values
def price_single_beliefs(transition, dividend_payoff, β=.75):

"""
Function to Solve Single Beliefs
"""
# First compute inverse piece
imbq_inv = la.inv(np.eye(transition.shape[0]) - β * transition)
# Next compute prices

prices = β * imbq_inv @ transition @ dividend_payoff
return prices

Single Belief Prices as Benchmarks
These equilibrium prices under homogeneous beliefs are important benchmarks for the subsequent analysis.
• 𝑝ℎ (𝑠) tells what a type ℎ investor thinks is the “fundamental value” of the asset.
• Here “fundamental value” means the expected discounted present value of future dividends.
We will compare these fundamental values of the asset with equilibrium values when traders have different beliefs.
69.3.3 Pricing under Heterogeneous Beliefs
There are several cases to consider.

The first is when both types of agents have sufficient wealth to purchase all of the asset themselves.
In this case, the marginal investor who prices the asset is the more optimistic type so that the equilibrium price 𝑝̄ satisfies
Harrison and Kreps’s key equation:
𝑝(𝑠)
̄ = 𝛽 max {𝑃𝑎 (𝑠, 0)𝑝(0)
̄ + 𝑃𝑎 (𝑠, 1)(1 + 𝑝(1)),
̄ 𝑃𝑏 (𝑠, 0)𝑝(0)
̄ + 𝑃𝑏 (𝑠, 1)(1 + 𝑝(1))}
̄ (69.3)
for 𝑠 = 0, 1.
In the above equation, the 𝑚𝑎𝑥 on the right side is over the two prospective values of next period’s payout from owning
the asset.
The marginal investor who prices the asset in state 𝑠 is of type 𝑎 if
𝑃𝑎 (𝑠, 0)𝑝(0)
̄ + 𝑃𝑎 (𝑠, 1)(1 + 𝑝(1))
̄ > 𝑃𝑏 (𝑠, 0)𝑝(0)
̄ + 𝑃𝑏 (𝑠, 1)(1 + 𝑝(1))
̄
The marginal investor is of type 𝑏 if
𝑃𝑎 (𝑠, 1)𝑝(0)
̄ + 𝑃𝑎 (𝑠, 1)(1 + 𝑝(1))
̄ < 𝑃𝑏 (𝑠, 1)𝑝(0)
̄ + 𝑃𝑏 (𝑠, 1)(1 + 𝑝(1))
̄
Thus the marginal investor is the (temporarily) optimistic type.

Equation (69.3) is a functional equation that, like a Bellman equation, can be solved by
• starting with a guess for the price vector 𝑝̄ and
• iterating to convergence on the operator that maps a guess 𝑝̄𝑗 into an updated guess 𝑝̄𝑗+1 defined by the right side
of (69.3), namely
𝑝̄𝑗+1 (𝑠) = 𝛽 max {𝑃𝑎 (𝑠, 0)𝑝̄𝑗 (0) + 𝑃𝑎 (𝑠, 1)(1 + 𝑝̄𝑗 (1)), 𝑃𝑏 (𝑠, 0)𝑝̄𝑗 (0) + 𝑃𝑏 (𝑠, 1)(1 + 𝑝̄𝑗 (1))} (69.4)
for 𝑠 = 0, 1.
The third row of the table labeled 𝑝𝑜 reports equilibrium prices that solve the functional equation when 𝛽 = .75.
Here the type that is optimistic about 𝑠𝑡+1 prices the asset in state 𝑠𝑡 .
It is instructive to compare these prices with the equilibrium prices for the homogeneous belief economies that solve under
beliefs 𝑃𝑎 and 𝑃𝑏 reported in the rows labeled 𝑝𝑎 and 𝑝𝑏 , respectively.
Equilibrium prices 𝑝𝑜 in the heterogeneous beliefs economy evidently exceed what any prospective investor regards as the
fundamental value of the asset in each possible state.
Nevertheless, the economy recurrently visits a state that makes each investor want to purchase the asset for more than he
believes its future dividends are worth.
An investor is willing to pay more than what he believes is warranted by fundamental value of the prospective dividend
stream because he expects to have the option later to sell the asset to another investor who will value the asset more highly
than he will then.

• Investors of type 𝑎 are willing to pay the following price for the asset
𝑝(0)
̄ if 𝑠𝑡 = 0
𝑝𝑎̂ (𝑠) = {
𝛽(𝑃𝑎 (1, 0)𝑝(0)
̄ + 𝑃𝑎 (1, 1)(1 + 𝑝(1)))
̄ if 𝑠𝑡 = 1
• Investors of type 𝑏 are willing to pay the following price for the asset
𝛽(𝑃𝑏 (0, 0)𝑝(0)
̄ + 𝑃𝑏 (0, 1)(1 + 𝑝(1)))
̄ if 𝑠𝑡 = 0
𝑝𝑏̂ (𝑠) = {
𝑝(1)
̄ if 𝑠𝑡 = 1
Evidently, 𝑝𝑎̂ (1) < 𝑝(1)
̄ and 𝑝𝑏̂ (0) < 𝑝(0).
̄
Investors of type 𝑎 want to sell the asset in state 1 while investors of type 𝑏 want to sell it in state 0.
• The asset changes hands whenever the state changes from 0 to 1 or from 1 to 0.
• The valuations 𝑝𝑎̂ (𝑠) and 𝑝𝑏̂ (𝑠) are displayed in the fourth and fifth rows of the table.
• Even pessimistic investors who don’t buy the asset think that it is worth more than they think future dividends are
worth.
Here’s code to solve for 𝑝,̄ 𝑝𝑎̂ and 𝑝𝑏̂ using the iterative method described above
def price_optimistic_beliefs(transitions, dividend_payoff, β=.75,

max_iter=50000, tol=1e-16):
"""
Function to Solve Optimistic Beliefs
"""
# We will guess an initial price vector of [0, 0]
p_new = np.array([[0], [0]])
p_old = np.array([[10.], [10.]])
# We know this is a contraction mapping, so we can iterate to conv

p_old = p_new
p_new = β * np.max([q @ p_old
+ q @ dividend_payoff for q in transitions],
1)
# If we succeed in converging, break out of for loop

if np.max(np.sqrt((p_new - p_old)**2)) < tol:
break
ptwiddle = β * np.min([q @ p_old

1)
phat_a = np.array([p_new[0], ptwiddle[1]])

phat_b = np.array([ptwiddle[0], p_new[1]])
return p_new, phat_a, phat_b

69.3.4 Insufficient Funds
Outcomes differ when the more optimistic type of investor has insufficient wealth — or insufficient ability to borrow
enough — to hold the entire stock of the asset.
In this case, the asset price must adjust to attract pessimistic investors.
Instead of equation (69.3), the equilibrium price satisfies
𝑝(𝑠)
̌ = 𝛽 min {𝑃𝑎 (𝑠, 1)𝑝(0)
̌ + 𝑃𝑎 (𝑠, 1)(1 + 𝑝(1)),
̌ 𝑃𝑏 (𝑠, 1)𝑝(0)
̌ + 𝑃𝑏 (𝑠, 1)(1 + 𝑝(1))}
̌ (69.5)
and the marginal investor who prices the asset is always the one that values it less highly than does the other type.
Now the marginal investor is always the (temporarily) pessimistic type.
Notice from the sixth row of that the pessimistic price 𝑝𝑜 is lower than the homogeneous belief prices 𝑝𝑎 and 𝑝𝑏 in both
states.
When pessimistic investors price the asset according to (69.5), optimistic investors think that the asset is underpriced.
If they could, optimistic investors would willingly borrow at a one-period risk-free gross interest rate 𝛽 −1 to purchase
more of the asset.
Implicit constraints on leverage prohibit them from doing so.
When optimistic investors price the asset as in equation (69.3), pessimistic investors think that the asset is overpriced and
would like to sell the asset short.
Constraints on short sales prevent that.
Here’s code to solve for 𝑝̌ using iteration
def price_pessimistic_beliefs(transitions, dividend_payoff, β=.75,

max_iter=50000, tol=1e-16):
"""
Function to Solve Pessimistic Beliefs
"""
# We will guess an initial price vector of [0, 0]
p_new = np.array([[0], [0]])
p_old = np.array([[10.], [10.]])
# We know this is a contraction mapping, so we can iterate to conv

p_old = p_new
p_new = β * np.min([q @ p_old
1)
# If we succeed in converging, break out of for loop

if np.max(np.sqrt((p_new - p_old)**2)) < tol:
break
return p_new

69.3.5 Further Interpretation
Jose Scheinkman [Scheinkman, 2014] interprets the Harrison-Kreps model as a model of a bubble — a situation in which
an asset price exceeds what every investor thinks is merited by his or her beliefs about the value of the asset’s underlying
dividend stream.
Scheinkman stresses these features of the Harrison-Kreps model:
• High volume occurs when the Harrison-Kreps pricing formula (69.3) prevails.
• Type 𝑎 investors sell the entire stock of the asset to type 𝑏 investors every time the state switches from 𝑠𝑡 = 0 to
𝑠𝑡 = 1.
• Type 𝑏 investors sell the asset to type 𝑎 investors every time the state switches from 𝑠𝑡 = 1 to 𝑠𝑡 = 0.
Scheinkman takes this as a strength of the model because he observes high volume during famous bubbles.
• If the supply of the asset is increased sufficiently either physically (more “houses” are built) or artificially (ways
are invented to short sell “houses”), bubbles end when the asset supply has grown enough to outstrip optimistic
investors’ resources for purchasing the asset.
• If optimistic investors finance their purchases by borrowing, tightening leverage constraints can extinguish a bubble.
Scheinkman extracts insights about the effects of financial regulations on bubbles.
He emphasizes how limiting short sales and limiting leverage have opposite effects.
69.4 Exercises
Exercise 69.4.1
This exercise invites you to recreate the summary table using the functions we have built above.
𝑠𝑡 0 1
𝑝𝑎 1.33 1.22
𝑝𝑏 1.45 1.91
𝑝𝑜 1.85 2.08
𝑝𝑝 1 1
𝑝𝑎̂ 1.85 1.69
𝑝𝑏̂ 1.69 2.08
You will want first to define the transition matrices and dividend payoff vector.
In addition, below we’ll add an interpretation of the row corresponding to 𝑝𝑜 by inventing two additional types of agents,
one of whom is permanently optimistic, the other who is permanently pessimistic.
We construct subjective transition probability matrices for our permanently optimistic and permanently pessimistic in-
vestors as follows.
The permanently optimistic investors(i.e., the investor with the most optimistic beliefs in each state) believes the transition
matrix
1 1
𝑃𝑜 = [ 12 2]
3
4 4
The permanently pessimistic investor believes the transition matrix

2 1
𝑃𝑝 = [ 32 3]
1
3 3

We’ll use these transition matrices when we present our solution of exercise 1 below.

First, we will obtain equilibrium price vectors with homogeneous beliefs, including when all investors are optimistic or
pessimistic.
qa = np.array([[1/2, 1/2], [2/3, 1/3]]) # Type a transition matrix

qb = np.array([[2/3, 1/3], [1/4, 3/4]]) # Type b transition matrix
# Optimistic investor transition matrix
qopt = np.array([[1/2, 1/2], [1/4, 3/4]])
# Pessimistic investor transition matrix
qpess = np.array([[2/3, 1/3], [2/3, 1/3]])
dividendreturn = np.array([[0], [1]])
transitions = [qa, qb, qopt, qpess]

labels = ['p_a', 'p_b', 'p_optimistic', 'p_pessimistic']
for transition, label in zip(transitions, labels):

print(label)
print("=" * 20)
s0, s1 = np.round(price_single_beliefs(transition, dividendreturn), 2)
print(f"State 0: {s0}")
print("-" * 20)
p_a
====================
State 0: [1.33]
State 1: [1.22]
--------------------
p_b
====================
State 0: [1.45]
State 1: [1.91]
--------------------
p_optimistic
====================
State 0: [1.85]
State 1: [2.08]
--------------------
p_pessimistic
====================
State 0: [1.]
State 1: [1.]
--------------------
We will use the price_optimistic_beliefs function to find the price under heterogeneous beliefs.
opt_beliefs = price_optimistic_beliefs([qa, qb], dividendreturn)

labels = ['p_optimistic', 'p_hat_a', 'p_hat_b']
for p, label in zip(opt_beliefs, labels):

print(label)


print("=" * 20)
s0, s1 = np.round(p, 2)
print("-" * 20)
p_optimistic
====================
State 0: [1.85]
State 1: [2.08]
--------------------
p_hat_a
====================
State 0: [1.85]
State 1: [1.69]
--------------------
p_hat_b
====================
State 0: [1.69]
State 1: [2.08]
--------------------
Notice that the equilibrium price with heterogeneous beliefs is equal to the price under single beliefs with permanently
optimistic investors - this is due to the marginal investor in the heterogeneous beliefs equilibrium always being the type
who is temporarily optimistic.


Part XII
Data and Empirics
1237
CHAPTER
SEVENTY
PANDAS FOR PANEL DATA
Contents
• Pandas for Panel Data

– Overview
– Slicing and Reshaping Data
– Merging Dataframes and Filling NaNs
– Grouping and Summarizing Data
– Final Remarks
– Exercises
70.1 Overview
In an earlier lecture on pandas, we looked at working with simple data sets.

Econometricians often need to work with more complex data sets, such as panels.
Common tasks include
• Importing data, cleaning it and reshaping it across several axes.
• Selecting a time series or cross-section from a panel.
• Grouping and summarizing data.
pandas (derived from ‘panel’ and ‘data’) contains powerful and easy-to-use tools for solving exactly these kinds of
problems.
In what follows, we will use a panel data set of real minimum wages from the OECD to create:
• summary statistics over multiple dimensions of our data
• a time series of the average minimum wage of countries in the dataset
• kernel density estimates of wages by continent
We will begin by reading in our long format panel data from a CSV file and reshaping the resulting DataFrame with
pivot_table to build a MultiIndex.
Additional detail will be added to our DataFrame using pandas’ merge function, and data will be summarized with
the groupby function.
1239
70.2 Slicing and Reshaping Data
We will read in a dataset from the OECD of real minimum wages in 32 countries and assign it to realwage.
The dataset can be accessed with the following link:
url1 = 'https://raw.githubusercontent.com/QuantEcon/lecture-python/master/source/_
↪static/lecture_specific/pandas_panel/realwage.csv'
import pandas as pd
# Display 6 columns for viewing purposes

pd.set_option('display.max_columns', 6)
# Reduce decimal points to 2

pd.options.display.float_format = '{:,.2f}'.format
realwage = pd.read_csv(url1)
Let’s have a look at what we’ve got to work with
realwage.head() # Show first 5 rows
Unnamed: 0 Time Country Series \

0 0 2006-01-01 Ireland In 2015 constant prices at 2015 USD PPPs
Pay period value

0 Annual 17,132.44
1 Annual 18,100.92
2 Annual 17,747.41
3 Annual 18,580.14
4 Annual 18,755.83
The data is currently in long format, which is difficult to analyze when there are several dimensions to the data.
We will use pivot_table to create a wide format panel, with a MultiIndex to handle higher dimensional data.
pivot_table arguments should specify the data (values), the index, and the columns we want in our resulting
dataframe.
By passing a list in columns, we can create a MultiIndex in our column axis
realwage = realwage.pivot_table(values='value',
index='Time',
columns=['Country', 'Series', 'Pay period'])
realwage.head()
Country Australia \
Series In 2015 constant prices at 2015 USD PPPs
Pay period Annual Hourly
Time
1240 Chapter 70. Pandas for Panel Data


2006-01-01 20,410.65 10.33
2007-01-01 21,087.57 10.67
2008-01-01 20,718.24 10.48
2009-01-01 20,984.77 10.62
2010-01-01 20,879.33 10.57
Country ... \
Series In 2015 constant prices at 2015 USD exchange rates ...
Pay period Annual ...
Time ...
2006-01-01 23,826.64 ...
2007-01-01 24,616.84 ...
2008-01-01 24,185.70 ...
2009-01-01 24,496.84 ...
2010-01-01 24,373.76 ...
Country United States \

Pay period Hourly
Time
2006-01-01 6.05
2007-01-01 6.24
2008-01-01 6.78
2009-01-01 7.58
2010-01-01 7.88
Country
Series In 2015 constant prices at 2015 USD exchange rates
Time
2006-01-01 12,594.40 6.05
2007-01-01 12,974.40 6.24
2008-01-01 14,097.56 6.78
2009-01-01 15,756.42 7.58
2010-01-01 16,391.31 7.88
[5 rows x 128 columns]
To more easily filter our time series data, later on, we will convert the index into a DateTimeIndex
realwage.index = pd.to_datetime(realwage.index)
type(realwage.index)
pandas.core.indexes.datetimes.DatetimeIndex
The columns contain multiple levels of indexing, known as a MultiIndex, with levels being ordered hierarchically
(Country > Series > Pay period).
A MultiIndex is the simplest and most flexible way to manage panel data in pandas
type(realwage.columns)
pandas.core.indexes.multi.MultiIndex
70.2. Slicing and Reshaping Data 1241

realwage.columns.names
FrozenList(['Country', 'Series', 'Pay period'])
Like before, we can select the country (the top level of our MultiIndex)
realwage['United States'].head()
Series In 2015 constant prices at 2015 USD PPPs \

Time
2006-01-01 12,594.40 6.05
2007-01-01 12,974.40 6.24
2008-01-01 14,097.56 6.78
2009-01-01 15,756.42 7.58
2010-01-01 16,391.31 7.88

Time
2006-01-01 12,594.40 6.05
2007-01-01 12,974.40 6.24
2008-01-01 14,097.56 6.78
2009-01-01 15,756.42 7.58
2010-01-01 16,391.31 7.88
Stacking and unstacking levels of the MultiIndex will be used throughout this lecture to reshape our dataframe into a
format we need.
.stack() rotates the lowest level of the column MultiIndex to the row index (.unstack() works in the opposite
direction - try it out)
realwage.stack().head()
Country Australia \
Time Pay period
2006-01-01 Annual 20,410.65
Hourly 10.33
2007-01-01 Annual 21,087.57
Hourly 10.67
2008-01-01 Annual 20,718.24
Country \
Time Pay period
2006-01-01 Annual 23,826.64
Hourly 12.06
2007-01-01 Annual 24,616.84
Hourly 12.46
2008-01-01 Annual 24,185.70
Country Belgium ... \

Series In 2015 constant prices at 2015 USD PPPs ...


Time Pay period ...
2006-01-01 Annual 21,042.28 ...
Hourly 10.09 ...
2007-01-01 Annual 21,310.05 ...
Hourly 10.22 ...
2008-01-01 Annual 21,416.96 ...
Country United Kingdom \

Time Pay period
2006-01-01 Annual 20,376.32
Hourly 9.81
2007-01-01 Annual 20,954.13
Hourly 10.07
2008-01-01 Annual 20,902.87
Country United States \

Time Pay period
2006-01-01 Annual 12,594.40
Hourly 6.05
2007-01-01 Annual 12,974.40
Hourly 6.24
2008-01-01 Annual 14,097.56
Country
Time Pay period
2006-01-01 Annual 12,594.40
Hourly 6.05
2007-01-01 Annual 12,974.40
Hourly 6.24
2008-01-01 Annual 14,097.56
We can also pass in an argument to select the level we would like to stack
realwage.stack(level='Country').head()
Series In 2015 constant prices at 2015 USD PPPs \

Time Country
2006-01-01 Australia 20,410.65 10.33
Belgium 21,042.28 10.09
Brazil 3,310.51 1.41
Canada 13,649.69 6.56
Chile 5,201.65 2.22

Time Country
2006-01-01 Australia 23,826.64 12.06
Belgium 20,228.74 9.70
Brazil 2,032.87 0.87
70.2. Slicing and Reshaping Data 1243


Canada 14,335.12 6.89
Chile 3,333.76 1.42
Using a DatetimeIndex makes it easy to select a particular time period.

Selecting one year and stacking the two lower levels of the MultiIndex creates a cross-section of our panel data
realwage.loc['2015'].stack(level=(1, 2)).transpose().head()
Time 2015-01-01 \
Country
Australia 21,715.53 10.99
Belgium 21,588.12 10.35
Brazil 4,628.63 2.00
Canada 16,536.83 7.95
Chile 6,633.56 2.80
Time
Country
Australia 25,349.90 12.83
Belgium 20,753.48 9.95
Brazil 2,842.28 1.21
Canada 17,367.24 8.35
Chile 4,251.49 1.81
For the rest of lecture, we will work with a dataframe of the hourly real minimum wages across countries and time,
measured in 2015 US dollars.
To create our filtered dataframe (realwage_f), we can use the xs method to select values at lower levels in the
multiindex, while keeping the higher levels (countries in this case)
realwage_f = realwage.xs(('Hourly', 'In 2015 constant prices at 2015 USD exchange␣

↪rates'),
level=('Pay period', 'Series'), axis=1)

realwage_f.head()
Country Australia Belgium Brazil ... Turkey United Kingdom \

Time ...
2006-01-01 12.06 9.70 0.87 ... 2.27 9.81
2007-01-01 12.46 9.82 0.92 ... 2.26 10.07
2008-01-01 12.24 9.87 0.96 ... 2.22 10.04
2009-01-01 12.40 10.21 1.03 ... 2.28 10.15
2010-01-01 12.34 10.05 1.08 ... 2.30 9.96
Country United States

Time
2006-01-01 6.05
2007-01-01 6.24
2008-01-01 6.78
2009-01-01 7.58


2010-01-01 7.88
70.3 Merging Dataframes and Filling NaNs
Similar to relational databases like SQL, pandas has built in methods to merge datasets together.
Using country information from WorldData.info, we’ll add the continent of each country to realwage_f with the
merge function.
↪static/lecture_specific/pandas_panel/countries.csv'
worlddata = pd.read_csv(url2, sep=';')

worlddata.head()
Country (en) Country (de) Country (local) ... Deathrate \

0 Afghanistan Afghanistan Afganistan/Afqanestan ... 13.70
1 Egypt Ägypten Misr ... 4.70
2 Åland Islands Ålandinseln Åland ... 0.00
3 Albania Albanien Shqipëria ... 6.70
4 Algeria Algerien Al-Jaza’ir/Algérie ... 4.30
Life expectancy Url

0 51.30 https://www.laenderdaten.info/Asien/Afghanista...
1 72.70 https://www.laenderdaten.info/Afrika/Aegypten/...
2 0.00 https://www.laenderdaten.info/Europa/Aland/ind...
3 78.30 https://www.laenderdaten.info/Europa/Albanien/...
4 76.80 https://www.laenderdaten.info/Afrika/Algerien/...
First, we’ll select just the country and continent variables from worlddata and rename the column to ‘Country’
worlddata = worlddata[['Country (en)', 'Continent']]

worlddata = worlddata.rename(columns={'Country (en)': 'Country'})
worlddata.head()
Country Continent
0 Afghanistan Asia
1 Egypt Africa
2 Åland Islands Europe
3 Albania Europe
4 Algeria Africa
We want to merge our new dataframe, worlddata, with realwage_f.

The pandas merge function allows dataframes to be joined together by rows.
70.3. Merging Dataframes and Filling NaNs 1245

Our dataframes will be merged using country names, requiring us to use the transpose of realwage_f so that rows
correspond to country names in both dataframes
realwage_f.transpose().head()
Time 2006-01-01 2007-01-01 2008-01-01 ... 2014-01-01 2015-01-01 \

Country ...
Australia 12.06 12.46 12.24 ... 12.67 12.83
Belgium 9.70 9.82 9.87 ... 10.01 9.95
Brazil 0.87 0.92 0.96 ... 1.21 1.21
Canada 6.89 6.96 7.24 ... 8.22 8.35
Chile 1.42 1.45 1.44 ... 1.76 1.81
Time 2016-01-01
Country
Australia 12.98
Belgium 9.76
Brazil 1.24
Canada 8.48
Chile 1.91
We can use either left, right, inner, or outer join to merge our datasets:
• left join includes only countries from the left dataset
• right join includes only countries from the right dataset
• outer join includes countries that are in either the left and right datasets
• inner join includes only countries common to both the left and right datasets
By default, merge will use an inner join.
Here we will pass how='left' to keep all countries in realwage_f, but discard countries in worlddata that do
not have a corresponding data entry realwage_f.
This is illustrated by the red shading in the following diagram
We will also need to specify where the country name is located in each dataframe, which will be the key that is used to
merge the dataframes ‘on’.
Our ‘left’ dataframe (realwage_f.transpose()) contains countries in the index, so we set left_index=True.
Our ‘right’ dataframe (worlddata) contains countries in the ‘Country’ column, so we set right_on='Country'
merged = pd.merge(realwage_f.transpose(), worlddata,

how='left', left_index=True, right_on='Country')
merged.head()
2006-01-01 00:00:00 2007-01-01 00:00:00 2008-01-01 00:00:00 ... \

17.00 12.06 12.46 12.24 ...
23.00 9.70 9.82 9.87 ...
32.00 0.87 0.92 0.96 ...
100.00 6.89 6.96 7.24 ...
38.00 1.42 1.45 1.44 ...
2016-01-01 00:00:00 Country Continent



17.00 12.98 Australia Australia
23.00 9.76 Belgium Europe
32.00 1.24 Brazil South America
100.00 8.48 Canada North America
38.00 1.91 Chile South America
Countries that appeared in realwage_f but not in worlddata will have NaN in the Continent column.
To check whether this has occurred, we can use .isnull() on the continent column and filter the merged dataframe
merged[merged['Continent'].isnull()]
2006-01-01 00:00:00 2007-01-01 00:00:00 2008-01-01 00:00:00 ... \

NaN 3.42 3.74 3.87 ...
NaN 0.23 0.45 0.39 ...
NaN 1.50 1.64 1.71 ...

NaN 5.28 Korea NaN
NaN 0.55 Russian Federation NaN
NaN 2.08 Slovak Republic NaN
We have three missing values!

One option to deal with NaN values is to create a dictionary containing these countries and their respective continents.
.map() will match countries in merged['Country'] with their continent from the dictionary.
Notice how countries not in our dictionary are mapped with NaN
missing_continents = {'Korea': 'Asia',

'Russian Federation': 'Europe',
'Slovak Republic': 'Europe'}
merged['Country'].map(missing_continents)
17.00 NaN
23.00 NaN
32.00 NaN
100.00 NaN
38.00 NaN
108.00 NaN
41.00 NaN
225.00 NaN
53.00 NaN
58.00 NaN
45.00 NaN
68.00 NaN
233.00 NaN
86.00 NaN
88.00 NaN
91.00 NaN
NaN Asia
117.00 NaN
122.00 NaN
123.00 NaN
138.00 NaN
153.00 NaN
151.00 NaN
174.00 NaN
175.00 NaN
NaN Europe
NaN Europe
198.00 NaN
200.00 NaN
227.00 NaN
241.00 NaN
240.00 NaN
Name: Country, dtype: object
We don’t want to overwrite the entire series with this mapping.

.fillna() only fills in NaN values in merged['Continent'] with the mapping, while leaving other values in the
column unchanged
merged['Continent'] = merged['Continent'].fillna(merged['Country'].map(missing_
↪continents))
# Check for whether continents were correctly mapped
merged[merged['Country'] == 'Korea']

2006-01-01 00:00:00 2007-01-01 00:00:00 2008-01-01 00:00:00 ... \

NaN 3.42 3.74 3.87 ...

NaN 5.28 Korea Asia
We will also combine the Americas into a single continent - this will make our visualization nicer later on.
To do this, we will use .replace() and loop through a list of the continent values we want to replace
replace = ['Central America', 'North America', 'South America']
for country in replace:

merged['Continent'].replace(to_replace=country,
value='America',
inplace=True)
Now that we have all the data we want in a single DataFrame, we will reshape it back into panel form with a Multi-
Index.
We should also ensure to sort the index using .sort_index() so that we can efficiently filter our dataframe later on.
By default, levels will be sorted top-down
merged = merged.set_index(['Continent', 'Country']).sort_index()

merged.head()
2006-01-01 2007-01-01 2008-01-01 ... 2014-01-01 \

Continent Country ...
America Brazil 0.87 0.92 0.96 ... 1.21
Canada 6.89 6.96 7.24 ... 8.22
Chile 1.42 1.45 1.44 ... 1.76
Colombia 1.01 1.02 1.01 ... 1.13
Costa Rica NaN NaN NaN ... 2.41
2015-01-01 2016-01-01
Continent Country
America Brazil 1.21 1.24
Canada 8.35 8.48
Chile 1.81 1.91
Colombia 1.13 1.12
Costa Rica 2.56 2.63
While merging, we lost our DatetimeIndex, as we merged columns that were not in datetime format
merged.columns
Index([2006-01-01 00:00:00, 2007-01-01 00:00:00, 2008-01-01 00:00:00,

2009-01-01 00:00:00, 2010-01-01 00:00:00, 2011-01-01 00:00:00,
2012-01-01 00:00:00, 2013-01-01 00:00:00, 2014-01-01 00:00:00,
2015-01-01 00:00:00, 2016-01-01 00:00:00],
dtype='object')

Now that we have set the merged columns as the index, we can recreate a DatetimeIndex using .to_datetime()
merged.columns = pd.to_datetime(merged.columns)
merged.columns = merged.columns.rename('Time')
merged.columns
DatetimeIndex(['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01',

'2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01',
'2014-01-01', '2015-01-01', '2016-01-01'],
dtype='datetime64[ns]', name='Time', freq=None)
The DatetimeIndex tends to work more smoothly in the row axis, so we will go ahead and transpose merged
merged = merged.transpose()
merged.head()
Continent America ... Europe

Country Brazil Canada Chile ... Slovenia Spain United Kingdom
Time ...
2006-01-01 0.87 6.89 1.42 ... 3.92 3.99 9.81
2007-01-01 0.92 6.96 1.45 ... 3.88 4.10 10.07
2008-01-01 0.96 7.24 1.44 ... 3.96 4.14 10.04
2009-01-01 1.03 7.67 1.52 ... 4.08 4.32 10.15
2010-01-01 1.08 7.94 1.56 ... 4.81 4.30 9.96
70.4 Grouping and Summarizing Data
Grouping and summarizing data can be particularly useful for understanding large panel datasets.
A simple way to summarize data is to call an aggregation method on the dataframe, such as .mean() or .max().
For example, we can calculate the average real minimum wage for each country over the period 2006 to 2016 (the default
is to aggregate over rows)
merged.mean().head(10)
Continent Country
America Brazil 1.09
Canada 7.82
Chile 1.62
Colombia 1.07
Costa Rica 2.53
Mexico 0.53
United States 7.15
Asia Israel 5.95
Japan 6.18
Korea 4.22
dtype: float64
Using this series, we can plot the average real minimum wage over the past decade for each country in our data set


sns.set_theme()
merged.mean().sort_values(ascending=False).plot(kind='bar',
title="Average real minimum wage 2006␣
↪- 2016")
# Set country labels

country_labels = merged.mean().sort_values(ascending=False).index.get_level_values(
↪'Country').tolist()
plt.xticks(range(0, len(country_labels)), country_labels)

plt.xlabel('Country')
plt.show()
Passing in axis=1 to .mean() will aggregate over columns (giving the average minimum wage for all countries over
time)
70.4. Grouping and Summarizing Data 1251

merged.mean(axis=1).head()
Time
2006-01-01 4.69
2007-01-01 4.84
2008-01-01 4.90
2009-01-01 5.08
2010-01-01 5.11
dtype: float64
We can plot this time series as a line graph
merged.mean(axis=1).plot()
plt.title('Average real minimum wage 2006 - 2016')
plt.ylabel('2015 USD')
plt.xlabel('Year')
plt.show()
We can also specify a level of the MultiIndex (in the column axis) to aggregate over
merged.groupby(level='Continent', axis=1).mean().head()
/tmp/ipykernel_9357/646686994.py:1: FutureWarning: DataFrame.groupby with axis=1␣

↪is deprecated. Do `frame.T.groupby(...)` without axis instead.


merged.groupby(level='Continent', axis=1).mean().head()
Continent America Asia Australia Europe

Time
2006-01-01 2.80 4.29 10.25 4.80
2007-01-01 2.85 4.44 10.73 4.94
2008-01-01 2.99 4.45 10.76 4.99
2009-01-01 3.23 4.53 10.97 5.16
2010-01-01 3.34 4.53 10.95 5.17
We can plot the average minimum wages in each continent as a time series
merged.groupby(level='Continent', axis=1).mean().plot()
plt.title('Average real minimum wage')
plt.xlabel('Year')
plt.show()

We will drop Australia as a continent for plotting purposes

merged = merged.drop('Australia', level='Continent', axis=1)

plt.title('Average real minimum wage')
plt.xlabel('Year')
plt.show()

.describe() is useful for quickly retrieving a number of common summary statistics
merged.stack().describe()
Continent America Asia Europe

count 69.00 44.00 200.00
mean 3.19 4.70 5.15
std 3.02 1.56 3.82
min 0.52 2.22 0.23
25% 1.03 3.37 2.02
50% 1.44 5.48 3.54
75% 6.96 5.95 9.70
max 8.48 6.65 12.39

This is a simplified way to use groupby.

Using groupby generally follows a ‘split-apply-combine’ process:
• split: data is grouped based on one or more keys
• apply: a function is called on each group independently
• combine: the results of the function calls are combined into a new data structure
The groupby method achieves the first step of this process, creating a new DataFrameGroupBy object with data
split into groups.
Let’s split merged by continent again, this time using the groupby function, and name the resulting object grouped
grouped = merged.groupby(level='Continent', axis=1)

grouped

grouped = merged.groupby(level='Continent', axis=1)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f06f4023950>
Calling an aggregation method on the object applies the function to each group, the results of which are combined in a
new data structure.
For example, we can return the number of countries in our dataset for each continent using .size().
In this case, our new data structure is a Series
grouped.size()
Continent
America 7
Asia 4
Europe 19
dtype: int64
Calling .get_group() to return just the countries in a single group, we can create a kernel density estimate of the
distribution of real minimum wages in 2016 for each continent.
grouped.groups.keys() will return the keys from the groupby object
continents = grouped.groups.keys()
for continent in continents:

sns.kdeplot(grouped.get_group(continent).loc['2015'].unstack(), label=continent,␣
↪fill=True)
plt.title('Real minimum wages in 2015')

plt.xlabel('US dollars')
plt.legend()
plt.show()



70.5 Final Remarks
This lecture has provided an introduction to some of pandas’ more advanced features, including multiindices, merging,
grouping and plotting.
Other tools that may be useful in panel data analysis include xarray, a python package that extends pandas to N-dimensional
data structures.

70.6 Exercises
Exercise 70.6.1
In these exercises, you’ll work with a dataset of employment rates in Europe by age and sex from Eurostat.
↪static/lecture_specific/pandas_panel/employ.csv'
Reading in the CSV file returns a panel dataset in long format. Use .pivot_table() to construct a wide format
dataframe with a MultiIndex in the columns.
Start off by exploring the dataframe and the variables available in the MultiIndex levels.
Write a program that quickly returns all values in the MultiIndex.
employ = pd.read_csv(url3)
employ = employ.pivot_table(values='Value',
index=['DATE'],
columns=['UNIT','AGE', 'SEX', 'INDIC_EM', 'GEO'])
employ.index = pd.to_datetime(employ.index) # ensure that dates are datetime format
employ.head()
UNIT Percentage of total population ... \

AGE From 15 to 24 years ...
SEX Females ...
INDIC_EM Active population ...
GEO Austria Belgium Bulgaria ...
DATE ...
2007-01-01 56.00 31.60 26.00 ...
2008-01-01 56.20 30.80 26.10 ...
2009-01-01 56.20 29.90 24.80 ...
2010-01-01 54.00 29.80 26.60 ...
2011-01-01 54.80 29.80 24.80 ...
UNIT Thousand persons \

AGE From 55 to 64 years
SEX Total
INDIC_EM Total employment (resident population concept - LFS)
GEO Switzerland Turkey
DATE
2007-01-01 NaN 1,282.00
2008-01-01 NaN 1,354.00
2009-01-01 NaN 1,449.00
2010-01-01 640.00 1,583.00
2011-01-01 661.00 1,760.00
UNIT
AGE
SEX


INDIC_EM
GEO United Kingdom
DATE
2007-01-01 4,131.00
2008-01-01 4,204.00
2009-01-01 4,193.00
2010-01-01 4,186.00
2011-01-01 4,164.00
This is a large dataset so it is useful to explore the levels and variables available
employ.columns.names
FrozenList(['UNIT', 'AGE', 'SEX', 'INDIC_EM', 'GEO'])
Variables within levels can be quickly retrieved with a loop
for name in employ.columns.names:

print(name, employ.columns.get_level_values(name).unique())
UNIT Index(['Percentage of total population', 'Thousand persons'], dtype='object',␣

↪name='UNIT')
AGE Index(['From 15 to 24 years', 'From 25 to 54 years', 'From 55 to 64 years'],␣

↪dtype='object', name='AGE')
SEX Index(['Females', 'Males', 'Total'], dtype='object', name='SEX')

INDIC_EM Index(['Active population', 'Total employment (resident population␣
↪concept - LFS)'], dtype='object', name='INDIC_EM')
GEO Index(['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic',

'Denmark', 'Estonia', 'Euro area (17 countries)',
'Euro area (18 countries)', 'Euro area (19 countries)',
'European Union (15 countries)', 'European Union (27 countries)',
'European Union (28 countries)', 'Finland',
'Former Yugoslav Republic of Macedonia, the', 'France',
'France (metropolitan)',
'Germany (until 1990 former territory of the FRG)', 'Greece', 'Hungary',
'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg',
'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania',
'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey',
'United Kingdom'],
dtype='object', name='GEO')
Exercise 70.6.2
Filter the above dataframe to only include employment as a percentage of ‘active population’.
Create a grouped boxplot using seaborn of employment rates in 2015 by age group and sex.
Hint: GEO includes both areas and countries.


To easily filter by country, swap GEO to the top level and sort the MultiIndex
employ.columns = employ.columns.swaplevel(0,-1)
employ = employ.sort_index(axis=1)
We need to get rid of a few items in GEO which are not countries.
A fast way to get rid of the EU areas is to use a list comprehension to find the level values in GEO that begin with ‘Euro’
geo_list = employ.columns.get_level_values('GEO').unique().tolist()
countries = [x for x in geo_list if not x.startswith('Euro')]
employ = employ[countries]
employ.columns.get_level_values('GEO').unique()
Index(['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic',

'Denmark', 'Estonia', 'Finland',
'Former Yugoslav Republic of Macedonia, the', 'France',
'France (metropolitan)',
'Germany (until 1990 former territory of the FRG)', 'Greece', 'Hungary',
'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg',
'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania',
'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey',
'United Kingdom'],
dtype='object', name='GEO')
Select only percentage employed in the active population from the dataframe
employ_f = employ.xs(('Percentage of total population', 'Active population'),

level=('UNIT', 'INDIC_EM'),
axis=1)
employ_f.head()
GEO Austria ... United Kingdom \

AGE From 15 to 24 years ... From 55 to 64 years
SEX Females Males Total ... Females Males
DATE ...
2007-01-01 56.00 62.90 59.40 ... 49.90 68.90
2008-01-01 56.20 62.90 59.50 ... 50.20 69.80
2009-01-01 56.20 62.90 59.50 ... 50.60 70.30
2010-01-01 54.00 62.60 58.30 ... 51.10 69.20
2011-01-01 54.80 63.60 59.20 ... 51.30 68.40
GEO
AGE
SEX Total
DATE
2007-01-01 59.30
2008-01-01 59.80
2009-01-01 60.30
2010-01-01 60.00
2011-01-01 59.70

Drop the ‘Total’ value before creating the grouped boxplot
employ_f = employ_f.drop('Total', level='SEX', axis=1)
box = employ_f.loc['2015'].unstack().reset_index()
sns.boxplot(x="AGE", y=0, hue="SEX", data=box, palette=("husl"), showfliers=False)
plt.xlabel('')
plt.xticks(rotation=35)
plt.ylabel('Percentage of population (%)')
plt.title('Employment in Europe (2015)')
plt.legend(bbox_to_anchor=(1,0.5))
plt.show()

CHAPTER
SEVENTYONE
LINEAR REGRESSION IN PYTHON
Contents
• Linear Regression in Python

– Overview
– Simple Linear Regression
– Extending the Linear Regression Model
– Endogeneity
– Summary
– Exercises
!pip install linearmodels
71.1 Overview
Linear regression is a standard tool for analyzing the relationship between two or more variables.
In this lecture, we’ll use the Python package statsmodels to estimate, interpret, and visualize linear regression models.
Along the way, we’ll discuss a variety of topics, including
• simple and multivariate linear regression
• visualization
• endogeneity and omitted variable bias
• two-stage least squares
As an example, we will replicate results from Acemoglu, Johnson and Robinson’s seminal paper [Acemoglu et al., 2001].
• You can download a copy here.
In the paper, the authors emphasize the importance of institutions in economic development.
The main contribution is the use of settler mortality rates as a source of exogenous variation in institutional differences.
Such variation is needed to determine whether it is institutions that give rise to greater economic growth, rather than the
other way around.
1261

import numpy as np
import pandas as pd
from statsmodels.iolib.summary2 import summary_col
from linearmodels.iv import IV2SLS
sns.set_theme()
This lecture assumes you are familiar with basic econometrics.

For an introductory text covering these topics, see, for example, [Wooldridge, 2015].
71.2 Simple Linear Regression
[Acemoglu et al., 2001] wish to determine whether or not differences in institutions can help to explain observed economic
outcomes.
How do we measure institutional differences and economic outcomes?
In this paper,
• economic outcomes are proxied by log GDP per capita in 1995, adjusted for exchange rates.
• institutional differences are proxied by an index of protection against expropriation on average over 1985-95, con-
structed by the Political Risk Services Group.
These variables and other data used in the paper are available for download on Daron Acemoglu’s webpage.
We will use pandas’ .read_stata() function to read in data contained in the .dta files to dataframes
df1 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_
↪static/lecture_specific/ols/maketable1.dta?raw=true')
df1.head()
shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a \

0 AFG 0.000000 1.0 NaN NaN 1.0 2.0 1.0
1 AGO 8.000000 1.0 5.363636 7.770645 3.0 3.0 0.0
2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN
3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0
4 ARM 0.000000 0.0 NaN 7.682482 NaN NaN NaN
cons00a extmort4 logem4 loghjypl baseco

0 1.0 93.699997 4.540098 NaN NaN
1 1.0 280.000000 5.634789 -3.411248 1.0
2 NaN NaN NaN NaN NaN
3 3.0 68.900002 4.232656 -0.872274 1.0
4 NaN NaN NaN NaN NaN
Let’s use a scatterplot to see whether any obvious relationship exists between GDP per capita and the protection against
expropriation index
1262 Chapter 71. Linear Regression in Python

df1.plot(x='avexpr', y='logpgp95', kind='scatter')

plt.show()
The plot shows a fairly strong positive relationship between protection against expropriation and log GDP per capita.
Specifically, if higher protection against expropriation is a measure of institutional quality, then better institutions appear
to be positively correlated with better economic outcomes (higher GDP per capita).
Given the plot, choosing a linear model to describe this relationship seems like a reasonable assumption.
We can write our model as
𝑙𝑜𝑔𝑝𝑔𝑝95𝑖 = 𝛽0 + 𝛽1 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 + 𝑢𝑖
where:
• 𝛽0 is the intercept of the linear trend line on the y-axis
• 𝛽1 is the slope of the linear trend line, representing the marginal effect of protection against risk on log GDP per
capita
• 𝑢𝑖 is a random error term (deviations of observations from the linear trend due to factors not included in the model)
Visually, this linear model involves choosing a straight line that best fits the data, as in the following plot (Figure 2 in
[Acemoglu et al., 2001])
# Dropping NA's is required to use numpy's polyfit

df1_subset = df1.dropna(subset=['logpgp95', 'avexpr'])
# Use only 'base sample' for plotting purposes

df1_subset = df1_subset[df1_subset['baseco'] == 1]
X = df1_subset['avexpr']
y = df1_subset['logpgp95']
labels = df1_subset['shortnam']
# Replace markers with country labels

71.2. Simple Linear Regression 1263


ax.scatter(X, y, marker='')
for i, label in enumerate(labels):

ax.annotate(label, (X.iloc[i], y.iloc[i]))
# Fit a linear trend line

ax.plot(np.unique(X),
np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),
color='black')
ax.set_xlim([3.3,10.5])
ax.set_ylim([4,10.5])
ax.set_xlabel('Average Expropriation Risk 1985-95')
ax.set_ylabel('Log GDP per capita, PPP, 1995')
ax.set_title('Figure 2: OLS relationship between expropriation \
risk and income')
plt.show()
The most common technique to estimate the parameters (𝛽’s) of the linear model is Ordinary Least Squares (OLS).
As the name implies, an OLS model is solved by finding the parameters that minimize the sum of squared residuals, i.e.
𝑁
min ∑ 𝑢̂2𝑖
𝛽̂ 𝑖=1
where 𝑢̂𝑖 is the difference between the observation and the predicted value of the dependent variable.
To estimate the constant term 𝛽0 , we need to add a column of 1’s to our dataset (consider the equation if 𝛽0 was replaced
with 𝛽0 𝑥𝑖 and 𝑥𝑖 = 1)
df1['const'] = 1
Now we can construct our model in statsmodels using the OLS function.
We will use pandas dataframes with statsmodels, however standard arrays can also be used as arguments

reg1 = sm.OLS(endog=df1['logpgp95'], exog=df1[['const', 'avexpr']], \

missing='drop')
type(reg1)
statsmodels.regression.linear_model.OLS
So far we have simply constructed our model.

We need to use .fit() to obtain parameter estimates 𝛽0̂ and 𝛽1̂
results = reg1.fit()
type(results)
statsmodels.regression.linear_model.RegressionResultsWrapper
We now have the fitted regression model stored in results.

To view the OLS regression results, we can call the .summary() method.
Note that an observation was mistakenly dropped from the results in the original paper (see the note located in
maketable2.do from Acemoglu’s webpage), and thus the coefficients differ slightly.
print(results.summary())
OLS Regression Results

==============================================================================
Dep. Variable: logpgp95 R-squared: 0.611
Model: OLS Adj. R-squared: 0.608
Method: Least Squares F-statistic: 171.4
Date: Tue, 30 Apr 2024 Prob (F-statistic): 4.16e-24
Time: 00:37:04 Log-Likelihood: -119.71
No. Observations: 111 AIC: 243.4
Df Residuals: 109 BIC: 248.8
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.6261 0.301 15.391 0.000 4.030 5.222
avexpr 0.5319 0.041 13.093 0.000 0.451 0.612
==============================================================================
Omnibus: 9.251 Durbin-Watson: 1.689
Prob(Omnibus): 0.010 Jarque-Bera (JB): 9.170
Skew: -0.680 Prob(JB): 0.0102
Kurtosis: 3.362 Cond. No. 33.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
↪specified.
From our results, we see that

• The intercept 𝛽0̂ = 4.63.
• The slope 𝛽1̂ = 0.53.
71.2. Simple Linear Regression 1265

• The positive 𝛽1̂ parameter estimate implies that. institutional quality has a positive effect on economic outcomes,
as we saw in the figure.
• The p-value of 0.000 for 𝛽1̂ implies that the effect of institutions on GDP is statistically significant (using p < 0.05
as a rejection rule).
• The R-squared value of 0.611 indicates that around 61% of variation in log GDP per capita is explained by pro-
tection against expropriation.
Using our parameter estimates, we can now write our estimated relationship as
̂
𝑙𝑜𝑔𝑝𝑔𝑝95 𝑖 = 4.63 + 0.53 𝑎𝑣𝑒𝑥𝑝𝑟𝑖
This equation describes the line that best fits our data, as shown in Figure 2.
We can use this equation to predict the level of log GDP per capita for a value of the index of expropriation protection.
For example, for a country with an index value of 7.07 (the average for the dataset), we find that their predicted level of
log GDP per capita in 1995 is 8.38.
mean_expr = np.mean(df1_subset['avexpr'])
mean_expr
6.515625
predicted_logpdp95 = 4.63 + 0.53 * 7.07

predicted_logpdp95
8.3771
An easier (and more accurate) way to obtain this result is to use .predict() and set 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 = 1 and 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 =
𝑚𝑒𝑎𝑛_𝑒𝑥𝑝𝑟
results.predict(exog=[1, mean_expr])
array([8.09156367])
We can obtain an array of predicted 𝑙𝑜𝑔𝑝𝑔𝑝95𝑖 for every value of 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 in our dataset by calling .predict() on
our results.
Plotting the predicted values against 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 shows that the predicted values lie along the linear line that we fitted above.
The observed values of 𝑙𝑜𝑔𝑝𝑔𝑝95𝑖 are also plotted for comparison purposes
# Drop missing observations from whole sample
df1_plot = df1.dropna(subset=['logpgp95', 'avexpr'])
# Plot predicted values
fix, ax = plt.subplots()
ax.scatter(df1_plot['avexpr'], results.predict(), alpha=0.5,
label='predicted')
# Plot observed values


ax.scatter(df1_plot['avexpr'], df1_plot['logpgp95'], alpha=0.5,

label='observed')
ax.legend()
ax.set_title('OLS predicted values')
ax.set_xlabel('avexpr')
ax.set_ylabel('logpgp95')
plt.show()
71.3 Extending the Linear Regression Model
So far we have only accounted for institutions affecting economic performance - almost certainly there are numerous other
factors affecting GDP that are not included in our model.
Leaving out variables that affect 𝑙𝑜𝑔𝑝𝑔𝑝95𝑖 will result in omitted variable bias, yielding biased and inconsistent parameter
estimates.
We can extend our bivariate regression model to a multivariate regression model by adding in other factors that may
affect 𝑙𝑜𝑔𝑝𝑔𝑝95𝑖 .
[Acemoglu et al., 2001] consider other factors such as:
• the effect of climate on economic outcomes; latitude is used to proxy this
• differences that affect both economic performance and institutions, eg. cultural, historical, etc.; controlled for with
the use of continent dummies
Let’s estimate some of the extended models considered in the paper (Table 2) using data from maketable2.dta
71.3. Extending the Linear Regression Model 1267


# Add constant term to dataset
df2['const'] = 1
# Create lists of variables to be used in each regression

X1 = ['const', 'avexpr']
X2 = ['const', 'avexpr', 'lat_abst']
X3 = ['const', 'avexpr', 'lat_abst', 'asia', 'africa', 'other']
# Estimate an OLS regression for each set of variables

reg1 = sm.OLS(df2['logpgp95'], df2[X1], missing='drop').fit()
Now that we have fitted our model, we will use summary_col to display the results in a single table (model numbers
correspond to those in the paper)
info_dict={'R-squared' : lambda x: f"{x.rsquared:.2f}",

'No. observations' : lambda x: f"{int(x.nobs):d}"}
results_table = summary_col(results=[reg1,reg2,reg3],
float_format='%0.2f',
stars = True,
model_names=['Model 1',
'Model 3',
'Model 4'],
info_dict=info_dict,
regressor_order=['const',
'avexpr',
'lat_abst',
'asia',
'africa'])
results_table.add_title('Table 2 - OLS Regressions')
print(results_table)
Table 2 - OLS Regressions

=========================================
Model 1 Model 3 Model 4
-----------------------------------------
const 4.63*** 4.87*** 5.85***
(0.30) (0.33) (0.34)
avexpr 0.53*** 0.46*** 0.39***
(0.04) (0.06) (0.05)
lat_abst 0.87* 0.33
(0.49) (0.45)
asia -0.15
(0.15)
africa -0.92***
(0.17)
other 0.30
(0.37)
R-squared 0.61 0.62 0.72
R-squared Adj. 0.61 0.62 0.70
R-squared 0.61 0.62 0.72


No. observations 111 111 111
=========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
71.4 Endogeneity
As [Acemoglu et al., 2001] discuss, the OLS models likely suffer from endogeneity issues, resulting in biased and incon-
sistent model estimates.
Namely, there is likely a two-way relationship between institutions and economic outcomes:
• richer countries may be able to afford or prefer better institutions
• variables that affect income may also be correlated with institutional differences
• the construction of the index may be biased; analysts may be biased towards seeing countries with higher income
having better institutions
To deal with endogeneity, we can use two-stage least squares (2SLS) regression, which is an extension of OLS regres-
sion.
This method requires replacing the endogenous variable 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 with a variable that is:
1. correlated with 𝑎𝑣𝑒𝑥𝑝𝑟𝑖
2. not correlated with the error term (ie. it should not directly affect the dependent variable, otherwise it would be
correlated with 𝑢𝑖 due to omitted variable bias)
The new set of regressors is called an instrument, which aims to remove endogeneity in our proxy of institutional dif-
ferences.
The main contribution of [Acemoglu et al., 2001] is the use of settler mortality rates to instrument for institutional
differences.
They hypothesize that higher mortality rates of colonizers led to the establishment of institutions that were more extractive
in nature (less protection against expropriation), and these institutions still persist today.
Using a scatterplot (Figure 3 in [Acemoglu et al., 2001]), we can see protection against expropriation is negatively cor-
related with settler mortality rates, coinciding with the authors’ hypothesis and satisfying the first condition of a valid
instrument.
# Dropping NA's is required to use numpy's polyfit

df1_subset2 = df1.dropna(subset=['logem4', 'avexpr'])
X = df1_subset2['logem4']
y = df1_subset2['avexpr']
labels = df1_subset2['shortnam']
# Replace markers with country labels

ax.scatter(X, y, marker='')
for i, label in enumerate(labels):

ax.annotate(label, (X.iloc[i], y.iloc[i]))
# Fit a linear trend line

71.4. Endogeneity 1269


ax.plot(np.unique(X),
np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),
color='black')
ax.set_xlim([1.8,8.4])
ax.set_ylim([3.3,10.4])
ax.set_xlabel('Log of Settler Mortality')
ax.set_ylabel('Average Expropriation Risk 1985-95')
ax.set_title('Figure 3: First-stage relationship between settler mortality \
and expropriation risk')
plt.show()
The second condition may not be satisfied if settler mortality rates in the 17th to 19th centuries have a direct effect on
current GDP (in addition to their indirect effect through institutions).
For example, settler mortality rates may be related to the current disease environment in a country, which could affect
current economic performance.
[Acemoglu et al., 2001] argue this is unlikely because:
• The majority of settler deaths were due to malaria and yellow fever and had a limited effect on local people.
• The disease burden on local people in Africa or India, for example, did not appear to be higher than average,
supported by relatively high population densities in these areas before colonization.
As we appear to have a valid instrument, we can use 2SLS regression to obtain consistent and unbiased parameter esti-
mates.
First stage
The first stage involves regressing the endogenous variable (𝑎𝑣𝑒𝑥𝑝𝑟𝑖 ) on the instrument.
The instrument is the set of all exogenous variables in our model (and not just the variable we have replaced).
Using model 1 as an example, our instrument is simply a constant and settler mortality rates 𝑙𝑜𝑔𝑒𝑚4𝑖 .
Therefore, we will estimate the first-stage regression as
𝑎𝑣𝑒𝑥𝑝𝑟𝑖 = 𝛿0 + 𝛿1 𝑙𝑜𝑔𝑒𝑚4𝑖 + 𝑣𝑖

The data we need to estimate this equation is located in maketable4.dta (only complete data, indicated by baseco
= 1, is used for estimation)
# Import and select the data

df4 = df4[df4['baseco'] == 1]
# Add a constant variable

df4['const'] = 1
# Fit the first stage regression and print summary

results_fs = sm.OLS(df4['avexpr'],
df4[['const', 'logem4']],
missing='drop').fit()
print(results_fs.summary())

==============================================================================
Dep. Variable: avexpr R-squared: 0.270
Df Model: 1
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 9.3414 0.611 15.296 0.000 8.121 10.562
logem4 -0.6068 0.127 -4.790 0.000 -0.860 -0.354
==============================================================================
Skew: 0.045 Prob(JB): 0.918
==============================================================================
Notes:
↪specified.
Second stage
We need to retrieve the predicted values of 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 using .predict().
We then replace the endogenous variable 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 with the predicted values 𝑎𝑣𝑒𝑥𝑝𝑟
̂ 𝑖 in the original linear model.
Our second stage regression is thus
𝑙𝑜𝑔𝑝𝑔𝑝95𝑖 = 𝛽0 + 𝛽1 𝑎𝑣𝑒𝑥𝑝𝑟
̂ 𝑖 + 𝑢𝑖
df4['predicted_avexpr'] = results_fs.predict()
results_ss = sm.OLS(df4['logpgp95'],
71.4. Endogeneity 1271


df4[['const', 'predicted_avexpr']]).fit()
print(results_ss.summary())

==============================================================================
Df Model: 1
====================================================================================
coef std err t P>|t| [0.025 0.
↪975]
-----------------------------------------------------------------------------------
↪-
const 1.9097 0.823 2.320 0.024 0.264 3.

↪555
predicted_avexpr 0.9443 0.126 7.523 0.000 0.693 1.

↪195
==============================================================================
Skew: -0.790 Prob(JB): 0.00407
==============================================================================
Notes:
↪specified.
The second-stage regression results give us an unbiased and consistent estimate of the effect of institutions on economic
outcomes.
The result suggests a stronger positive relationship than what the OLS results indicated.
Note that while our parameter estimates are correct, our standard errors are not and for this reason, computing 2SLS
‘manually’ (in stages with OLS) is not recommended.
We can correctly estimate a 2SLS regression in one step using the linearmodels package, an extension of statsmodels
Note that when using IV2SLS, the exogenous and instrument variables are split up in the function arguments (whereas
before the instrument included exogenous variables)
iv = IV2SLS(dependent=df4['logpgp95'],
exog=df4['const'],
endog=df4['avexpr'],
instruments=df4['logem4']).fit(cov_type='unadjusted')
print(iv.summary)
IV-2SLS Estimation Summary

==============================================================================


Estimator: IV-2SLS Adj. R-squared: 0.1739
No. Observations: 64 F-statistic: 37.568
Date: Tue, Apr 30 2024 P-value (F-stat) 0.0000
Time: 00:37:06 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 1.9097 1.0106 1.8897 0.0588 -0.0710 3.8903
avexpr 0.9443 0.1541 6.1293 0.0000 0.6423 1.2462
==============================================================================
Endogenous: avexpr
Instruments: logem4
Unadjusted Covariance (Homoskedastic)
Debiased: False
Given that we now have consistent and unbiased estimates, we can infer from the model we have estimated that institutional
differences (stemming from institutions set up during colonization) can help to explain differences in income levels across
countries today.
[Acemoglu et al., 2001] use a marginal effect of 0.94 to calculate that the difference in the index between Chile and
Nigeria (ie. institutional quality) implies up to a 7-fold difference in income, emphasizing the significance of institutions
in economic development.
71.5 Summary
We have demonstrated basic OLS and 2SLS regression in statsmodels and linearmodels.
If you are familiar with R, you may want to use the formula interface to statsmodels, or consider using r2py to call
R from within Python.
71.6 Exercises
Exercise 71.6.1
In the lecture, we think the original model suffers from endogeneity bias due to the likely effect income has on institutional
development.
Although endogeneity is often best identified by thinking about the data and model, we can formally test for endogeneity
using the Hausman test.
We want to test for correlation between the endogenous variable, 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 , and the errors, 𝑢𝑖
𝐻0 ∶ 𝐶𝑜𝑣(𝑎𝑣𝑒𝑥𝑝𝑟𝑖 , 𝑢𝑖 ) = 0 (𝑛𝑜 𝑒𝑛𝑑𝑜𝑔𝑒𝑛𝑒𝑖𝑡𝑦)

𝐻1 ∶ 𝐶𝑜𝑣(𝑎𝑣𝑒𝑥𝑝𝑟𝑖 , 𝑢𝑖 ) ≠ 0 (𝑒𝑛𝑑𝑜𝑔𝑒𝑛𝑒𝑖𝑡𝑦)
This test is running in two stages.
71.5. Summary 1273

First, we regress 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 on the instrument, 𝑙𝑜𝑔𝑒𝑚4𝑖
𝑎𝑣𝑒𝑥𝑝𝑟𝑖 = 𝜋0 + 𝜋1 𝑙𝑜𝑔𝑒𝑚4𝑖 + 𝜐𝑖
Second, we retrieve the residuals 𝜐𝑖̂ and include them in the original equation
𝑙𝑜𝑔𝑝𝑔𝑝95𝑖 = 𝛽0 + 𝛽1 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 + 𝛼𝜐𝑖̂ + 𝑢𝑖
If 𝛼 is statistically significant (with a p-value < 0.05), then we reject the null hypothesis and conclude that 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 is
endogenous.
Using the above information, estimate a Hausman test and interpret your results.
# Load in data
# Add a constant term

df4['const'] = 1
# Estimate the first stage regression

reg1 = sm.OLS(endog=df4['avexpr'],
exog=df4[['const', 'logem4']],
# Retrieve the residuals

df4['resid'] = reg1.resid
# Estimate the second stage residuals

reg2 = sm.OLS(endog=df4['logpgp95'],
exog=df4[['const', 'avexpr', 'resid']],
print(reg2.summary())

==============================================================================
Df Model: 2
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.4782 0.547 4.530 0.000 1.386 3.570
avexpr 0.8564 0.082 10.406 0.000 0.692 1.021
resid -0.4951 0.099 -5.017 0.000 -0.692 -0.298
==============================================================================


Skew: -1.054 Prob(JB): 9.19e-06
==============================================================================
Notes:
↪specified.
The output shows that the coefficient on the residuals is statistically significant, indicating 𝑎𝑣𝑒𝑥𝑝𝑟𝑖 is endogenous.
Exercise 71.6.2
The OLS parameter 𝛽 can also be estimated using matrix algebra and numpy (you may need to review the numpy lecture
to complete this exercise).
The linear equation we want to estimate is (written in matrix form)
𝑦 = 𝑋𝛽 + 𝑢
To solve for the unknown parameter 𝛽, we want to minimize the sum of squared residuals
min𝑢̂′ 𝑢̂
𝛽̂
Rearranging the first equation and substituting into the second equation, we can write
min (𝑌 − 𝑋 𝛽)̂ ′ (𝑌 − 𝑋 𝛽)̂

𝛽̂
Solving this optimization problem gives the solution for the 𝛽 ̂ coefficients
𝛽 ̂ = (𝑋 ′ 𝑋)−1 𝑋 ′ 𝑦
Using the above information, compute 𝛽 ̂ from model 1 using numpy - your results should be the same as those in the
statsmodels output from earlier in the lecture.
# Load in data
df1 = df1.dropna(subset=['logpgp95', 'avexpr'])
# Add a constant term

df1['const'] = 1
# Define the X and y variables

y = np.asarray(df1['logpgp95'])
X = np.asarray(df1[['const', 'avexpr']])
# Compute β_hat


β_hat = np.linalg.solve(X.T @ X, X.T @ y)
# Print out the results from the 2 x 1 vector β_hat

print(f'β_0 = {β_hat[0]:.2}')
print(f'β_1 = {β_hat[1]:.2}')
β_0 = 4.6
β_1 = 0.53
It is also possible to use np.linalg.inv(X.T @ X) @ X.T @ y to solve for 𝛽, however .solve() is preferred
as it involves fewer computations.

CHAPTER
SEVENTYTWO
MAXIMUM LIKELIHOOD ESTIMATION
Contents
• Maximum Likelihood Estimation

– Overview
– Set Up and Assumptions
– Conditional Distributions
– Maximum Likelihood Estimation
– MLE with Numerical Methods
– Maximum Likelihood Estimation with statsmodels
– Summary
– Exercises
72.1 Overview
In a previous lecture, we estimated the relationship between dependent and explanatory variables using linear regression.
But what if a linear relationship is not an appropriate assumption for our model?
One widely used alternative is maximum likelihood estimation, which involves specifying a class of distributions, indexed
by unknown parameters, and then using the data to pin down these parameter values.
The benefit relative to linear regression is that it allows more flexibility in the probabilistic relationships between variables.
Here we illustrate maximum likelihood by replicating Daniel Treisman’s (2016) paper, Russia’s Billionaires, which con-
nects the number of billionaires in a country to its economic characteristics.
The paper concludes that Russia has a higher number of billionaires than economic factors such as market size and tax
rate predict.
We’ll require the following imports:

import numpy as np
from numpy import exp
from scipy.special import factorial
1277

import pandas as pd
from statsmodels.api import Poisson
from statsmodels.iolib.summary2 import summary_col
We assume familiarity with basic probability and multivariate calculus.
72.2 Set Up and Assumptions
Let’s consider the steps we need to go through in maximum likelihood estimation and how they pertain to this study.
72.2.1 Flow of Ideas
The first step with maximum likelihood estimation is to choose the probability distribution believed to be generating the
data.
More precisely, we need to make an assumption as to which parametric class of distributions is generating the data.
• e.g., the class of all normal distributions, or the class of all gamma distributions.
Each such class is a family of distributions indexed by a finite number of parameters.
• e.g., the class of normal distributions is a family of distributions indexed by its mean 𝜇 ∈ (−∞, ∞) and standard
deviation 𝜎 ∈ (0, ∞).
We’ll let the data pick out a particular element of the class by pinning down the parameters.
The parameter estimates so produced will be called maximum likelihood estimates.
72.2.2 Counting Billionaires
Treisman [Treisman, 2016] is interested in estimating the number of billionaires in different countries.
The number of billionaires is integer-valued.
Hence we consider distributions that take values only in the nonnegative integers.
(This is one reason least squares regression is not the best tool for the present problem, since the dependent variable in
linear regression is not restricted to integer values)
One integer distribution is the Poisson distribution, the probability mass function (pmf) of which is
𝜇𝑦 −𝜇
𝑓(𝑦) = 𝑒 , 𝑦 = 0, 1, 2, … , ∞
𝑦!
We can plot the Poisson distribution over 𝑦 for different values of 𝜇 as follows
1278 Chapter 72. Maximum Likelihood Estimation

poisson_pmf = lambda y, μ: μ**y / factorial(y) * exp(-μ)

y_values = range(0, 25)
for μ in [1, 5, 10]:

distribution = []
for y_i in y_values:
distribution.append(poisson_pmf(y_i, μ))
ax.plot(y_values,
distribution,
label=f'$\mu$={μ}',
alpha=0.5,
marker='o',
markersize=8)
ax.grid()
ax.set_xlabel('$y$', fontsize=14)
ax.set_ylabel('$f(y \mid \mu)$', fontsize=14)
ax.axis(xmin=0, ymin=0)
plt.show()
Notice that the Poisson distribution begins to resemble a normal distribution as the mean of 𝑦 increases.
Let’s have a look at the distribution of the data we’ll be working with in this lecture.
Treisman’s main source of data is Forbes’ annual rankings of billionaires and their estimated net worth.
The dataset mle/fp.dta can be downloaded from here or its AER page.
72.2. Set Up and Assumptions 1279

pd.options.display.max_columns = 10
# Load in data and view

df = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_
↪static/lecture_specific/mle/fp.dta?raw=true')
df.head()
country ccode year cyear numbil ... topint08 rintr \

0 United States 2.0 1990.0 21990.0 NaN ... 39.799999 4.988405
noyrs roflaw nrrents

0 20.0 1.61 NaN
1 20.0 1.61 NaN
2 20.0 1.61 NaN
3 20.0 1.61 NaN
4 20.0 1.61 NaN
Using a histogram, we can view the distribution of the number of billionaires per country, numbil0, in 2008 (the United
States is dropped for plotting purposes)
numbil0_2008 = df[(df['year'] == 2008) & (

df['country'] != 'United States')].loc[:, 'numbil0']
plt.subplots(figsize=(12, 8))
plt.hist(numbil0_2008, bins=30)
plt.xlim(left=0)
plt.grid()
plt.xlabel('Number of billionaires in 2008')
plt.ylabel('Count')
plt.show()

From the histogram, it appears that the Poisson assumption is not unreasonable (albeit with a very low 𝜇 and some
outliers).
72.3 Conditional Distributions
In Treisman’s paper, the dependent variable — the number of billionaires 𝑦𝑖 in country 𝑖 — is modeled as a function of
GDP per capita, population size, and years membership in GATT and WTO.
Hence, the distribution of 𝑦𝑖 needs to be conditioned on the vector of explanatory variables x𝑖 .
The standard formulation — the so-called poisson regression model — is as follows:
𝑦
𝜇𝑖 𝑖 −𝜇𝑖
𝑓(𝑦𝑖 ∣ x𝑖 ) = 𝑒 ; 𝑦𝑖 = 0, 1, 2, … , ∞. (72.1)
𝑦𝑖 !
where 𝜇𝑖 = exp(x′𝑖 𝛽) = exp(𝛽0 + 𝛽1 𝑥𝑖1 + … + 𝛽𝑘 𝑥𝑖𝑘 )

To illustrate the idea that the distribution of 𝑦𝑖 depends on x𝑖 let’s run a simple simulation.
We use our poisson_pmf function from above and arbitrary values for 𝛽 and x𝑖
y_values = range(0, 20)
# Define a parameter vector with estimates

β = np.array([0.26, 0.18, 0.25, -0.1, -0.22])
# Create some observations X

datasets = [np.array([0, 1, 1, 1, 2]),
72.3. Conditional Distributions 1281


np.array([2, 3, 2, 4, 0]),
np.array([3, 4, 5, 3, 2]),
np.array([6, 5, 4, 4, 7])]
for X in datasets:
μ = exp(X @ β)
distribution = []
for y_i in y_values:
distribution.append(poisson_pmf(y_i, μ))
ax.plot(y_values,
distribution,
label=f'$\mu_i$={μ:.1}',
marker='o',
markersize=8,
alpha=0.5)
ax.grid()
ax.legend()
ax.set_xlabel('$y \mid x_i$')
ax.set_ylabel(r'$f(y \mid x_i; \beta )$')
ax.axis(xmin=0, ymin=0)
plt.show()
We can see that the distribution of 𝑦𝑖 is conditional on x𝑖 (𝜇𝑖 is no longer constant).

72.4 Maximum Likelihood Estimation
In our model for number of billionaires, the conditional distribution contains 4 (𝑘 = 4) parameters that we need to
estimate.
We will label our entire parameter vector as 𝛽 where
𝛽0
⎡𝛽 ⎤
𝛽 = ⎢ 1⎥
⎢𝛽2 ⎥
⎣𝛽3 ⎦
To estimate the model using MLE, we want to maximize the likelihood that our estimate 𝛽̂ is the true parameter 𝛽.
Intuitively, we want to find the 𝛽̂ that best fits our data.
First, we need to construct the likelihood function ℒ(𝛽), which is similar to a joint probability density function.
Assume we have some data 𝑦𝑖 = {𝑦1 , 𝑦2 } and 𝑦𝑖 ∼ 𝑓(𝑦𝑖 ).
If 𝑦1 and 𝑦2 are independent, the joint pmf of these data is 𝑓(𝑦1 , 𝑦2 ) = 𝑓(𝑦1 ) ⋅ 𝑓(𝑦2 ).
If 𝑦𝑖 follows a Poisson distribution with 𝜆 = 7, we can visualize the joint pmf like so
def plot_joint_poisson(μ=7, y_n=20):

yi_values = np.arange(0, y_n, 1)
# Create coordinate points of X and Y

X, Y = np.meshgrid(yi_values, yi_values)
# Multiply distributions together

Z = poisson_pmf(X, μ) * poisson_pmf(Y, μ)

ax.plot_surface(X, Y, Z.T, cmap='terrain', alpha=0.6)
ax.scatter(X, Y, Z.T, color='black', alpha=0.5, linewidths=1)
ax.set(xlabel='$y_1$', ylabel='$y_2$')
ax.set_zlabel('$f(y_1, y_2)$', labelpad=10)
plt.show()
plot_joint_poisson(μ=7, y_n=20)
72.4. Maximum Likelihood Estimation 1283

Similarly, the joint pmf of our data (which is distributed as a conditional Poisson distribution) can be written as
𝑛 𝑦
𝑓(𝑦1 , 𝑦2 , … , 𝑦𝑛 ∣ x1 , x2 , … , x𝑛 ; 𝛽) = ∏ 𝑒
𝑖=1
𝑦𝑖 !
𝑦𝑖 is conditional on both the values of x𝑖 and the parameters 𝛽.

The likelihood function is the same as the joint pmf, but treats the parameter 𝛽 as a random variable and takes the
observations (𝑦𝑖 , x𝑖 ) as given
𝑛 𝑦
ℒ(𝛽 ∣ 𝑦1 , 𝑦2 , … , 𝑦𝑛 ; x1 , x2 , … , x𝑛 ) = ∏ 𝑒
𝑖=1
𝑦𝑖 !
=𝑓(𝑦1 , 𝑦2 , … , 𝑦𝑛 ∣ x1 , x2 , … , x𝑛 ; 𝛽)
Now that we have our likelihood function, we want to find the 𝛽̂ that yields the maximum likelihood value
maxℒ(𝛽)
𝛽

In doing so it is generally easier to maximize the log-likelihood (consider differentiating 𝑓(𝑥) = 𝑥 exp(𝑥) vs. 𝑓(𝑥) =
log(𝑥) + 𝑥).
Given that taking a logarithm is a monotone increasing transformation, a maximizer of the likelihood function will also
be a maximizer of the log-likelihood function.
In our case the log-likelihood is
log ℒ(𝛽) = log (𝑓(𝑦1 ; 𝛽) ⋅ 𝑓(𝑦2 ; 𝛽) ⋅ … ⋅ 𝑓(𝑦𝑛 ; 𝛽))

𝑛
= ∑ log 𝑓(𝑦𝑖 ; 𝛽)
𝑖=1
𝑛 𝑦
= ∑ log ( 𝑒 )
𝑖=1
𝑦𝑖 !
𝑛 𝑛 𝑛
= ∑ 𝑦𝑖 log 𝜇𝑖 − ∑ 𝜇𝑖 − ∑ log 𝑦!
𝑖=1 𝑖=1 𝑖=1
The MLE of the Poisson to the Poisson for 𝛽 ̂ can be obtained by solving
𝑛 𝑛 𝑛
max( ∑ 𝑦𝑖 log 𝜇𝑖 − ∑ 𝜇𝑖 − ∑ log 𝑦!)
𝛽
𝑖=1 𝑖=1 𝑖=1
However, no analytical solution exists to the above problem – to find the MLE we need to use numerical methods.
72.5 MLE with Numerical Methods
Many distributions do not have nice, analytical solutions and therefore require numerical methods to solve for parameter
estimates.
One such numerical method is the Newton-Raphson algorithm.
Our goal is to find the maximum likelihood estimate 𝛽.̂
At 𝛽,̂ the first derivative of the log-likelihood function will be equal to 0.
Let’s illustrate this by supposing
log ℒ(𝛽) = −(𝛽 − 10)2 − 10
β = np.linspace(1, 20)
logL = -(β - 10) ** 2 - 10
dlogL = -2 * β + 20
fig, (ax1, ax2) = plt.subplots(2, sharex=True, figsize=(12, 8))
ax1.plot(β, logL, lw=2)

ax2.plot(β, dlogL, lw=2)
ax1.set_ylabel(r'$log \mathcal{L(\beta)}$',
rotation=0,
labelpad=35,
fontsize=15)
ax2.set_ylabel(r'$\frac{dlog \mathcal{L(\beta)}}{d \beta}$ ',
rotation=0,
72.5. MLE with Numerical Methods 1285


labelpad=35,
fontsize=19)
ax2.set_xlabel(r'$\beta$', fontsize=15)
ax1.grid(), ax2.grid()
plt.axhline(c='black')
plt.show()
𝑑 log ℒ(𝛽)
The plot shows that the maximum likelihood value (the top plot) occurs when 𝑑𝛽 = 0 (the bottom plot).
Therefore, the likelihood is maximized when 𝛽 = 10.
We can also ensure that this value is a maximum (as opposed to a minimum) by checking that the second derivative (slope
of the bottom plot) is negative.
The Newton-Raphson algorithm finds a point where the first derivative is 0.
To use the algorithm, we take an initial guess at the maximum value, 𝛽0 (the OLS parameter estimates might be a
reasonable guess), then
1. Use the updating rule to iterate the algorithm
𝛽 (𝑘+1) = 𝛽 (𝑘) − 𝐻 −1 (𝛽 (𝑘) )𝐺(𝛽 (𝑘) )
where:
𝑑 log ℒ(𝛽 (𝑘) )
𝐺(𝛽 (𝑘) ) =
𝑑𝛽 (𝑘)
𝑑2 log ℒ(𝛽 (𝑘) )
𝐻(𝛽 (𝑘) ) = ′
𝑑𝛽 (𝑘) 𝑑𝛽 (𝑘)
2. Check whether 𝛽 (𝑘+1) − 𝛽 (𝑘) < 𝑡𝑜𝑙

• If true, then stop iterating and set 𝛽̂ = 𝛽 (𝑘+1)

• If false, then update 𝛽 (𝑘+1)
As can be seen from the updating equation, 𝛽 (𝑘+1) = 𝛽 (𝑘) only when 𝐺(𝛽 (𝑘) ) = 0 ie. where the first derivative is equal
to 0.
(In practice, we stop iterating when the difference is below a small tolerance threshold)
Let’s have a go at implementing the Newton-Raphson algorithm.
First, we’ll create a class called PoissonRegression so we can easily recompute the values of the log likelihood,
gradient and Hessian for every iteration
class PoissonRegression:
def __init__(self, y, X, β):

self.X = X
self.n, self.k = X.shape
# Reshape y as a n_by_1 column vector
self.y = y.reshape(self.n,1)
# Reshape β as a k_by_1 column vector
self.β = β.reshape(self.k,1)
def μ(self):
return np.exp(self.X @ self.β)
def logL(self):
y = self.y
μ = self.μ()
return np.sum(y * np.log(μ) - μ - np.log(factorial(y)))
def G(self):
y = self.y
μ = self.μ()
return X.T @ (y - μ)
def H(self):
X = self.X
μ = self.μ()
return -(X.T @ (μ * X))
Our function newton_raphson will take a PoissonRegression object that has an initial guess of the parameter
vector 𝛽 0 .
The algorithm will update the parameter vector according to the updating rule, and recalculate the gradient and Hessian
matrices at the new parameter estimates.
Iteration will end when either:
• The difference between the parameter and the updated parameter is below a tolerance level.
• The maximum number of iterations has been achieved (meaning convergence is not achieved).
So we can get an idea of what’s going on while the algorithm is running, an option display=True is added to print
out values at each iteration.
def newton_raphson(model, tol=1e-3, max_iter=1000, display=True):
i = 0


error = 100 # Initial error value
# Print header of output

if display:
header = f'{"Iteration_k":<13}{"Log-likelihood":<16}{"θ":<60}'
print(header)
print("-" * len(header))
# While loop runs while any value in error is greater

# than the tolerance until max iterations are reached
while np.any(error > tol) and i < max_iter:
H, G = model.H(), model.G()
β_new = model.β - (np.linalg.inv(H) @ G)
error = np.abs(β_new - model.β)
model.β = β_new
# Print iterations
if display:
β_list = [f'{t:.3}' for t in list(model.β.flatten())]
update = f'{i:<13}{model.logL():<16.8}{β_list}'
print(update)
i += 1
print(f'Number of iterations: {i}')

print(f'β_hat = {model.β.flatten()}')
# Return a flat array for β (instead of a k_by_1 column vector)

return model.β.flatten()
Let’s try out our algorithm with a small dataset of 5 observations and 3 variables in X.
X = np.array([[1, 2, 5],
[1, 1, 3],
[1, 4, 2],
[1, 5, 2],
[1, 3, 1]])
y = np.array([1, 0, 1, 1, 0])
# Take a guess at initial βs

init_β = np.array([0.1, 0.1, 0.1])
# Create an object with Poisson model values

poi = PoissonRegression(y, X, β=init_β)
# Use newton_raphson to find the MLE

β_hat = newton_raphson(poi, display=True)
Iteration_k Log-likelihood θ
-----------------------------------------------------------------------------------
↪------
0 -4.3447622 ['-1.49', '0.265', '0.244']

1 -3.5742413 ['-3.38', '0.528', '0.474']
2 -3.3999526 ['-5.06', '0.782', '0.702']


3 -3.3788646 ['-5.92', '0.909', '0.82']
4 -3.3783559 ['-6.07', '0.933', '0.843']
5 -3.3783555 ['-6.08', '0.933', '0.843']
6 -3.3783555 ['-6.08', '0.933', '0.843']
Number of iterations: 7
β_hat = [-6.07848573 0.9334028 0.84329677]
As this was a simple model with few observations, the algorithm achieved convergence in only 7 iterations.
You can see that with each iteration, the log-likelihood value increased.
Remember, our objective was to maximize the log-likelihood function, which the algorithm has worked to achieve.
Also, note that the increase in log ℒ(𝛽 (𝑘) ) becomes smaller with each iteration.
This is because the gradient is approaching 0 as we reach the maximum, and therefore the numerator in our updating
equation is becoming smaller.
The gradient vector should be close to 0 at 𝛽̂
poi.G()
array([[-2.54574140e-13],
[-6.44040377e-13],
[-4.99100761e-13]])
The iterative process can be visualized in the following diagram, where the maximum is found at 𝛽 = 10
logL = lambda x: -(x - 10) ** 2 - 10
def find_tangent(β, a=0.01):

y1 = logL(β)
y2 = logL(β+a)
x = np.array([[β, 1], [β+a, 1]])
m, c = np.linalg.lstsq(x, np.array([y1, y2]), rcond=None)[0]
return m, c
β = np.linspace(2, 18)
ax.plot(β, logL(β), lw=2, c='black')
for β in [7, 8.5, 9.5, 10]:

β_line = np.linspace(β-2, β+2)
m, c = find_tangent(β)
y = m * β_line + c
ax.plot(β_line, y, '-', c='purple', alpha=0.8)
ax.text(β+2.05, y[-1], f'$G({β}) = {abs(m):.0f}$', fontsize=12)
ax.vlines(β, -24, logL(β), linestyles='--', alpha=0.5)
ax.hlines(logL(β), 6, β, linestyles='--', alpha=0.5)
ax.set(ylim=(-24, -4), xlim=(6, 13))

ax.set_xlabel(r'$\beta$', fontsize=15)
ax.set_ylabel(r'$log \mathcal{L(\beta)}$',
rotation=0,
labelpad=25,
fontsize=15)


ax.grid(alpha=0.3)
plt.show()
Note that our implementation of the Newton-Raphson algorithm is rather basic — for more robust implementations see,
for example, scipy.optimize.
72.6 Maximum Likelihood Estimation with statsmodels
Now that we know what’s going on under the hood, we can apply MLE to an interesting application.
We’ll use the Poisson regression model in statsmodels to obtain a richer output with standard errors, test values, and
more.
statsmodels uses the same algorithm as above to find the maximum likelihood estimates.
Before we begin, let’s re-estimate our simple model with statsmodels to confirm we obtain the same coefficients and
log-likelihood value.
X = np.array([[1, 2, 5],
[1, 1, 3],
[1, 4, 2],
[1, 5, 2],
[1, 3, 1]])
y = np.array([1, 0, 1, 1, 0])
stats_poisson = Poisson(y, X).fit()

print(stats_poisson.summary())

Optimization terminated successfully.

Current function value: 0.675671
Iterations 7
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 5
Model: Poisson Df Residuals: 2
Method: MLE Df Model: 2
Date: Tue, 30 Apr 2024 Pseudo R-squ.: 0.2546
converged: True LL-Null: -4.5325
Covariance Type: nonrobust LLR p-value: 0.3153
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -6.0785 5.279 -1.151 0.250 -16.425 4.268
x1 0.9334 0.829 1.126 0.260 -0.691 2.558
x2 0.8433 0.798 1.057 0.291 -0.720 2.407
==============================================================================
Now let’s replicate results from Daniel Treisman’s paper, Russia’s Billionaires, mentioned earlier in the lecture.
Treisman starts by estimating equation (72.1), where:
• 𝑦𝑖 is 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑙𝑙𝑖𝑜𝑛𝑎𝑖𝑟𝑒𝑠𝑖
• 𝑥𝑖1 is log 𝐺𝐷𝑃 𝑝𝑒𝑟 𝑐𝑎𝑝𝑖𝑡𝑎𝑖
• 𝑥𝑖2 is log 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑖
• 𝑥𝑖3 is 𝑦𝑒𝑎𝑟𝑠 𝑖𝑛 𝐺𝐴𝑇 𝑇 𝑖 – years membership in GATT and WTO (to proxy access to international markets)
The paper only considers the year 2008 for estimation.
We will set up our variables for estimation like so (you should have the data assigned to df from earlier in the lecture)
# Keep only year 2008

df = df[df['year'] == 2008]
# Add a constant
df['const'] = 1
# Variable sets
reg1 = ['const', 'lngdppc', 'lnpop', 'gattwto08']
reg2 = ['const', 'lngdppc', 'lnpop',
'gattwto08', 'lnmcap08', 'rintr', 'topint08']
reg3 = ['const', 'lngdppc', 'lnpop', 'gattwto08', 'lnmcap08',
'rintr', 'topint08', 'nrrents', 'roflaw']
Then we can use the Poisson function from statsmodels to fit the model.
We’ll use robust standard errors as in the author’s paper
# Specify model
poisson_reg = sm.Poisson(df[['numbil0']], df[reg1],
missing='drop').fit(cov_type='HC0')
print(poisson_reg.summary())
72.6. Maximum Likelihood Estimation with statsmodels 1291


Iterations 9
Poisson Regression Results
==============================================================================
Dep. Variable: numbil0 No. Observations: 197
Model: Poisson Df Residuals: 193
Covariance Type: HC0 LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -29.0495 2.578 -11.268 0.000 -34.103 -23.997
lngdppc 1.0839 0.138 7.834 0.000 0.813 1.355
lnpop 1.1714 0.097 12.024 0.000 0.980 1.362
gattwto08 0.0060 0.007 0.868 0.386 -0.008 0.019
==============================================================================
Success! The algorithm was able to achieve convergence in 9 iterations.

Our output indicates that GDP per capita, population, and years of membership in the General Agreement on Tariffs and
Trade (GATT) are positively related to the number of billionaires a country has, as expected.
Let’s also estimate the author’s more full-featured models and display them in a single table
regs = [reg1, reg2, reg3]

reg_names = ['Model 1', 'Model 2', 'Model 3']
info_dict = {'Pseudo R-squared': lambda x: f"{x.prsquared:.2f}",
'No. observations': lambda x: f"{int(x.nobs):d}"}
regressor_order = ['const',
'lngdppc',
'lnpop',
'gattwto08',
'lnmcap08',
'rintr',
'topint08',
'nrrents',
'roflaw']
results = []
for reg in regs:

result = sm.Poisson(df[['numbil0']], df[reg],
missing='drop').fit(cov_type='HC0',
maxiter=100, disp=0)
results.append(result)
results_table = summary_col(results=results,
float_format='%0.3f',
stars=True,
model_names=reg_names,
info_dict=info_dict,
regressor_order=regressor_order)
results_table.add_title('Table 1 - Explaining the Number of Billionaires \
in 2008')
print(results_table)

Table 1 - Explaining the Number of Billionaires in 2008

=================================================
Model 1 Model 2 Model 3
-------------------------------------------------
const -29.050*** -19.444*** -20.858***
(2.578) (4.820) (4.255)
lngdppc 1.084*** 0.717*** 0.737***
(0.138) (0.244) (0.233)
lnpop 1.171*** 0.806*** 0.929***
(0.097) (0.213) (0.195)
gattwto08 0.006 0.007 0.004
(0.007) (0.006) (0.006)
lnmcap08 0.399** 0.286*
(0.172) (0.167)
rintr -0.010 -0.009
(0.010) (0.010)
topint08 -0.051*** -0.058***
(0.011) (0.012)
nrrents -0.005
(0.010)
roflaw 0.203
(0.372)
Pseudo R-squared 0.86 0.90 0.90
No. observations 197 131 131
=================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
The output suggests that the frequency of billionaires is positively correlated with GDP per capita, population size, stock
market capitalization, and negatively correlated with top marginal income tax rate.
To analyze our results by country, we can plot the difference between the predicted an actual values, then sort from highest
to lowest and plot the first 15
data = ['const', 'lngdppc', 'lnpop', 'gattwto08', 'lnmcap08', 'rintr',

'topint08', 'nrrents', 'roflaw', 'numbil0', 'country']
results_df = df[data].dropna()
# Use last model (model 3)

results_df['prediction'] = results[-1].predict()
# Calculate difference
results_df['difference'] = results_df['numbil0'] - results_df['prediction']
# Sort in descending order

results_df.sort_values('difference', ascending=False, inplace=True)
# Plot the first 15 data points

results_df[:15].plot('country', 'difference', kind='bar',
figsize=(12,8), legend=False)
plt.ylabel('Number of billionaires above predicted level')
plt.xlabel('Country')
plt.show()
72.6. Maximum Likelihood Estimation with statsmodels 1293

As we can see, Russia has by far the highest number of billionaires in excess of what is predicted by the model (around
50 more than expected).
Treisman uses this empirical result to discuss possible reasons for Russia’s excess of billionaires, including the origination
of wealth in Russia, the political climate, and the history of privatization in the years after the USSR.
72.7 Summary
In this lecture, we used Maximum Likelihood Estimation to estimate the parameters of a Poisson model.
statsmodels contains other built-in likelihood models such as Probit and Logit.
For further flexibility, statsmodels provides a way to specify the distribution manually using the GenericLike-
lihoodModel class - an example notebook can be found here.

72.8 Exercises
Exercise 72.8.1
Suppose we wanted to estimate the probability of an event 𝑦𝑖 occurring, given some observations.
We could use a probit regression model, where the pmf of 𝑦𝑖 is
𝑦
𝑓(𝑦𝑖 ; 𝛽) = 𝜇𝑖 𝑖 (1 − 𝜇𝑖 )1−𝑦𝑖 , 𝑦𝑖 = 0, 1
where 𝜇𝑖 = Φ(x′𝑖 𝛽)
Φ represents the cumulative normal distribution and constrains the predicted 𝑦𝑖 to be between 0 and 1 (as required for a
probability).
𝛽 is a vector of coefficients.
Following the example in the lecture, write a class to represent the Probit model.
To begin, find the log-likelihood function and derive the gradient and Hessian.
The scipy module stats.norm contains the functions needed to compute the cmf and pmf of the normal distribution.

The log-likelihood can be written as
𝑛
log ℒ = ∑ [𝑦𝑖 log Φ(x′𝑖 𝛽) + (1 − 𝑦𝑖 ) log(1 − Φ(x′𝑖 𝛽))]
𝑖=1
Using the fundamental theorem of calculus, the derivative of a cumulative probability distribution is its marginal dis-
tribution
𝜕
Φ(𝑠) = 𝜙(𝑠)
𝜕𝑠
where 𝜙 is the marginal normal distribution.
The gradient vector of the Probit model is
𝑛
𝜕 log ℒ 𝜙(x′𝑖 𝛽) 𝜙(x′𝑖 𝛽)
= ∑ [𝑦𝑖 − (1 − 𝑦 𝑖 ) ]x
𝜕𝛽 𝑖=1
Φ(x′𝑖 𝛽) 1 − Φ(x′𝑖 𝛽) 𝑖
The Hessian of the Probit model is

𝑛
𝜕 2 log ℒ ′ 𝜙(x′𝑖 𝛽) + x′𝑖 𝛽Φ(x′𝑖 𝛽) 𝜙(x′𝑖 𝛽) − x′𝑖 𝛽(1 − Φ(x′𝑖 𝛽))
′ = − ∑ 𝜙(x𝑖 𝛽)[𝑦𝑖 + (1 − 𝑦 𝑖 ) ]x𝑖 x′𝑖
𝜕𝛽𝜕𝛽 𝑖=1
[Φ(x′𝑖 𝛽)]2 [1 − Φ(x′𝑖 𝛽)]2
Using these results, we can write a class for the Probit model as follows
class ProbitRegression:
def __init__(self, y, X, β):

self.X, self.y, self.β = X, y, β
self.n, self.k = X.shape
def μ(self):
return norm.cdf(self.X @ self.β.T)

def ϕ(self):
return norm.pdf(self.X @ self.β.T)
def logL(self):
μ = self.μ()
return np.sum(y * np.log(μ) + (1 - y) * np.log(1 - μ))
def G(self):
μ = self.μ()
ϕ = self.ϕ()
return np.sum((X.T * y * ϕ / μ - X.T * (1 - y) * ϕ / (1 - μ)),
axis=1)
def H(self):
X = self.X
β = self.β
μ = self.μ()
ϕ = self.ϕ()
a = (ϕ + (X @ β.T) * μ) / μ**2
b = (ϕ - (X @ β.T) * (1 - μ)) / (1 - μ)**2
return -(ϕ * (y * a + (1 - y) * b) * X.T) @ X
Exercise 72.8.2
Use the following dataset and initial values of 𝛽 to estimate the MLE with the Newton-Raphson algorithm developed
earlier in the lecture
1 2 4 1
⎡1 1 1⎤ ⎡0⎤ 0.1
⎢ ⎥ ⎢ ⎥
X = ⎢1 4 3⎥ 𝑦 = ⎢1⎥ 𝛽 (0) = ⎡ ⎤
⎢0.1⎥
⎢1 5 6⎥ ⎢1⎥ ⎣0.1⎦
⎣1 3 5⎦ ⎣0⎦
Verify your results with statsmodels - you can import the Probit function with the following import statement
from statsmodels.discrete.discrete_model import Probit
Note that the simple Newton-Raphson algorithm developed in this lecture is very sensitive to initial values, and therefore
you may fail to achieve convergence with different starting values.

X = np.array([[1, 2, 4],
[1, 1, 1],
[1, 4, 3],
[1, 5, 6],
[1, 3, 5]])
y = np.array([1, 0, 1, 1, 0])

# Take a guess at initial βs

β = np.array([0.1, 0.1, 0.1])
# Create instance of Probit regression class

prob = ProbitRegression(y, X, β)
# Run Newton-Raphson algorithm

newton_raphson(prob)
Iteration_k Log-likelihood θ
-----------------------------------------------------------------------------------
↪------
0 -2.3796884 ['-1.34', '0.775', '-0.157']

1 -2.3687526 ['-1.53', '0.775', '-0.0981']
2 -2.3687294 ['-1.55', '0.778', '-0.0971']
3 -2.3687294 ['-1.55', '0.778', '-0.0971']
Number of iterations: 4
β_hat = [-1.54625858 0.77778952 -0.09709757]
array([-1.54625858, 0.77778952, -0.09709757])
# Use statsmodels to verify results
print(Probit(y, X).fit().summary())

Iterations 6
Probit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 5
Model: Probit Df Residuals: 2
Covariance Type: nonrobust LLR p-value: 0.3692
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.5463 1.866 -0.829 0.407 -5.204 2.111
x1 0.7778 0.788 0.986 0.324 -0.768 2.323
x2 -0.0971 0.590 -0.165 0.869 -1.254 1.060
==============================================================================


Part XIII
Auctions
1299
CHAPTER
SEVENTYTHREE
FIRST-PRICE AND SECOND-PRICE AUCTIONS
This lecture is designed to set the stage for a subsequent lecture about Multiple Good Allocation Mechanisms
In that lecture, a planner or auctioneer simultaneously allocates several goods to set of people.
In the present lecture, a single good is allocated to one person within a set of people.
Here we’ll learn about and simulate two classic auctions :
• a First-Price Sealed-Bid Auction (FPSB)
• a Second-Price Sealed-Bid Auction (SPSB) created by William Vickrey [Vickrey, 1961]
We’ll also learn about and apply a
• Revenue Equivalent Theorem
We recommend watching this video about second price auctions by Anders Munk-Nielsen:
https://youtu.be/qwWk_Bqtue8
and
https://youtu.be/eYTGQCGpmXI
Anders Munk-Nielsen put his code on GitHub.

Much of our Python code below is based on his.
73.1 First-Price Sealed-Bid Auction (FPSB)
Protocols:
• A single good is auctioned.
• Prospective buyers simultaneously submit sealed bids.
• Each bidder knows only his/her own bid.
• The good is allocated to the person who submits the highest bid.
• The winning bidder pays price she has bid.
Detailed Setting:
There are 𝑛 > 2 prospective buyers named 𝑖 = 1, 2, … , 𝑛.
Buyer 𝑖 attaches value 𝑣𝑖 to the good being sold.
1301
Buyer 𝑖 wants to maximize the expected value of her surplus defined as 𝑣𝑖 − 𝑝, where 𝑝 is the price that she pays,
conditional on her winning the auction.
Evidently,
• If 𝑖 bids exactly 𝑣𝑖 , she pays what she thinks it is worth and gathers no surplus value.
• Buyer 𝑖 will never want to bid more than 𝑣𝑖 .
• If buyer 𝑖 bids 𝑏 < 𝑣𝑖 and wins the auction, then she gathers surplus value 𝑏 − 𝑣𝑖 > 0.
• If buyer 𝑖 bids 𝑏 < 𝑣𝑖 and someone else bids more than 𝑏, buyer 𝑖 loses the auction and gets no surplus value.
• To proceed, buyer 𝑖 wants to know the probability that she wins the auction as a function of her bid 𝑣𝑖
– this requires that she know a probability distribution of bids 𝑣𝑗 made by prospective buyers 𝑗 ≠ 𝑖
• Given her idea about that probability distribution, buyer 𝑖 wants to set a bid that maximizes the mathematical
expectation of her surplus value.
Bids are sealed, so no bidder knows bids submitted by other prospective buyers.
This means that bidders are in effect participating in a game in which players do not know payoffs of other players.
This is a Bayesian game, a Nash equilibrium of which is called a Bayesian Nash equilibrium.
To complete the specification of the situation, we’ll assume that prospective buyers’ valuations are independently and
identically distributed according to a probability distribution that is known by all bidders.
Bidder optimally chooses to bid less than 𝑣𝑖 .
73.1.1 Characterization of FPSB Auction
A FPSB auction has a unique symmetric Bayesian Nash Equilibrium.

The optimal bid of buyer 𝑖 is
E[𝑦𝑖 |𝑦𝑖 < 𝑣𝑖 ] (73.1)
where 𝑣𝑖 is the valuation of bidder 𝑖 and 𝑦𝑖 is the maximum valuation of all other bidders:
𝑦𝑖 = max 𝑣𝑗 (73.2)
𝑗≠𝑖
A proof for this assertion is available at the Wikepedia page about Vickrey auctions
73.2 Second-Price Sealed-Bid Auction (SPSB)
Protocols: In a second-price sealed-bid (SPSB) auction, the winner pays the second-highest bid.
73.3 Characterization of SPSB Auction
In a SPSB auction bidders optimally choose to bid their values.

Formally, a dominant strategy profile in a SPSB auction with a single, indivisible item has each bidder bidding its value.
A proof is provided at the Wikepedia page about Vickrey auctions
1302 Chapter 73. First-Price and Second-Price Auctions

73.4 Uniform Distribution of Private Values

i.i.d.
We assume valuation 𝑣𝑖 of bidder 𝑖 is distributed 𝑣𝑖 ∼ 𝑈 (0, 1).
Under this assumption, we can analytically compute probability distributions of prices bid in both FPSB and SPSB.
We’ll simulate outcomes and, by using a law of large numbers, verify that the simulated outcomes agree with analytical
ones.
We can use our simulation to illustrate a Revenue Equivalence Theorem that asserts that on average first-price and
second-price sealed bid auctions provide a seller the same revenue.
To read about the revenue equivalence theorem, see this Wikepedia page
73.5 Setup
There are 𝑛 bidders.

Each bidder knows that there are 𝑛 − 1 other bidders.
73.6 First price sealed bid auction
An optimal bid for bidder 𝑖 in a FPSB is described by equations (73.1) and (73.2).
When bids are i.i.d. draws from a uniform distribution, the CDF of 𝑦𝑖 is
̃ (𝑦) = P(𝑦𝑖 ≤ 𝑦) = P(max 𝑣𝑗 ≤ 𝑦)

𝐹𝑛−1
𝑗≠𝑖
= ∏ P(𝑣𝑗 ≤ 𝑦)
𝑗≠𝑖
= 𝑦𝑛−1
̃ (𝑦) = (𝑛 − 1)𝑦𝑛−2 .
and the PDF of 𝑦𝑖 is 𝑓𝑛−1
Then bidder 𝑖’s optimal bid in a FPSB auction is:
𝑣
̃ (𝑦 )𝑑𝑦
∫0 𝑖 𝑦𝑖 𝑓𝑛−1 𝑖 𝑖
E(𝑦𝑖 |𝑦𝑖 < 𝑣𝑖 ) = 𝑣𝑖
̃
∫ 𝑓 (𝑦 )𝑑𝑦
0 𝑛−1 𝑖 𝑖
𝑣𝑖
∫0 (𝑛 − 1)𝑦𝑖𝑛−1 𝑑𝑦𝑖
= 𝑣
∫0 𝑖 (𝑛 − 1)𝑦𝑖𝑛−2 𝑑𝑦𝑖
𝑣𝑖
𝑛−1
= 𝑦𝑖 ∣
𝑛 0
𝑛−1
= 𝑣
𝑛 𝑖
73.4. Uniform Distribution of Private Values 1303

73.7 Second Price Sealed Bid Auction
In a SPSB, it is optimal for bidder 𝑖 to bid 𝑣𝑖 .
73.8 Python Code
import numpy as np
import scipy.stats as stats
import scipy.interpolate as interp
# for plots
plt.rcParams.update({"text.usetex": True, 'font.size': 14})
colors = plt. rcParams['axes.prop_cycle'].by_key()['color']
# ensure the notebook generate the same randomess

We repeat an auction with 5 bidders for 100,000 times.

The valuations of each bidder is distributed 𝑈 (0, 1).
N = 5
R = 100_000
v = np.random.uniform(0,1,(N,R))
# BNE in first-price sealed bid
b_star = lambda vi,N :((N-1)/N) * vi

b = b_star(v,N)
We compute and sort bid price distributions that emerge under both FPSB and SPSB.
idx = np.argsort(v, axis=0) # Biders' values are sorted in ascending order in each␣
↪auction.
# We record the order because we want to apply it to bid price and their id.
v = np.take_along_axis(v, idx, axis=0) # same as np.sort(v, axis=0), except now we␣

↪retain the idx
b = np.take_along_axis(b, idx, axis=0)
ii = np.repeat(np.arange(1,N+1)[:,None], R, axis=1) # the id for the bidders is␣

↪created.
ii = np.take_along_axis(ii, idx, axis=0) # the id is sorted according to bid price␣

↪as well.
winning_player = ii[-1,:] # In FPSB and SPSB, winners are those with highest values.
winner_pays_fpsb = b[-1,:] # highest bid

winner_pays_spsb = v[-2,:] # 2nd-highest valuation
Let’s now plot the winning bids 𝑏(𝑛) (i.e. the payment) against valuations, 𝑣(𝑛) for both FPSB and SPSB.

Note that
• FPSB: There is a unique bid corresponding to each valuation
• SPSB: Because it equals the valuation of a second-highest bidder, what a winner pays varies even holding fixed the
winner’s valuation. So here there is a frequency distribution of payments for each valuation.
# We intend to compute average payments of different groups of bidders
binned = stats.binned_statistic(v[-1,:], v[-2,:], statistic='mean', bins=20)

xx = binned.bin_edges
xx = [(xx[ii]+xx[ii+1])/2 for ii in range(len(xx)-1)]
yy = binned.statistic
ax.plot(xx, yy, label='SPSB average payment')

ax.plot(v[-1,:], b[-1,:], '--', alpha = 0.8, label = 'FPSB analytic')
ax.plot(v[-1,:], v[-2,:], 'o', alpha = 0.05, markersize = 0.1, label = 'SPSB: actual␣
↪bids')
ax.legend(loc='best')
ax.set_xlabel('Valuation, $v_i$')
ax.set_ylabel('Bid, $b_i$')
sns.despine()

73.9 Revenue Equivalence Theorem
We now compare FPSB and a SPSB auctions from the point of view of the revenues that a seller can expect to acquire.
Expected Revenue FPSB:
𝑛−1
The winner with valuation 𝑦 pays 𝑛 ∗ 𝑦, where n is the number of bidders.
Above we computed that the CDF is 𝐹𝑛 (𝑦) = 𝑦𝑛 and the PDF is 𝑓𝑛 = 𝑛𝑦𝑛−1 .
Consequently, expected revenue is
1
𝑛−1 𝑛−1
R=∫ 𝑣 × 𝑛𝑣𝑖𝑛−1 𝑑𝑣𝑖 =
0 𝑛 𝑖 𝑛+1
Expected Revenue SPSB:

The expected revenue equals n × expected payment of a bidder.
Computing this we get
TR = 𝑛Evi [Eyi [𝑦𝑖 |𝑦𝑖 < 𝑣𝑖 ]P(𝑦𝑖 < 𝑣𝑖 ) + 0 × P(𝑦𝑖 > 𝑣𝑖 )]

̃ (𝑣𝑖 )]
= 𝑛Evi [Eyi [𝑦𝑖 |𝑦𝑖 < 𝑣𝑖 ]𝐹𝑛−1
𝑛−1
= 𝑛Evi [ × 𝑣𝑖 × 𝑣𝑖𝑛−1 ]
𝑛
= (𝑛 − 1)Evi [𝑣𝑖𝑛 ]
𝑛−1
=
𝑛+1
Thus, while probability distributions of winning bids typically differ across the two types of auction, we deduce that
expected payments are identical in FPSB and SPSB.
for payment,label in zip([winner_pays_fpsb, winner_pays_spsb], ['FPSB', 'SPSB']):

print('The average payment of %s: %.4f. Std.: %.4f. Median: %.4f'% (label,
↪payment.mean(),payment.std(),np.median(payment)))
ax.hist(payment, density=True, alpha=0.6, label=label, bins=100)
ax.axvline(winner_pays_fpsb.mean(), ls='--', c='g', label='Mean')

ax.axvline(winner_pays_spsb.mean(), ls='--', c='r', label='Mean')
ax.set_xlabel('Bid')
ax.set_ylabel('Density')
sns.despine()
The average payment of FPSB: 0.6665. Std.: 0.1129. Median: 0.6967

The average payment of SPSB: 0.6667. Std.: 0.1782. Median: 0.6862

Summary of FPSB and SPSB results with uniform distribution on [0, 1]
Auction: Sealed-Bid First-Price Second-Price

Winner Agent with highest bid Agent with highest bid
Winner pays Winner’s bid Second-highest bid
Loser pays 0 0
Dominant strategy No dominant strategy Bidding truthfully is dominant strategy
Bayesian Nash equilibrium Bidder 𝑖 bids 𝑛−1
𝑛 𝑣𝑖 Bidder 𝑖 truthfully bids 𝑣𝑖
𝑛−1 𝑛−1
Auctioneer’s revenue 𝑛+1 𝑛+1
Detour: Computing a Bayesian Nash Equibrium for FPSB

The Revenue Equivalence Theorem lets us an optimal bidding strategy for a FPSB auction from outcomes of a SPSB
auction.
Let 𝑏(𝑣𝑖 ) be the optimal bid in a FPSB auction.
The revenue equivalence theorem tells us that a bidder agent with value 𝑣𝑖 on average receives the same payment in the
two types of auction.
Consequently,
𝑏(𝑣𝑖 )P(𝑦𝑖 < 𝑣𝑖 ) + 0 ∗ P(𝑦𝑖 ≥ 𝑣𝑖 ) = E𝑦𝑖 [𝑦𝑖 |𝑦𝑖 < 𝑣𝑖 ]P(𝑦𝑖 < 𝑣𝑖 ) + 0 ∗ P(𝑦𝑖 ≥ 𝑣𝑖 )
It follows that an optimal bidding strategy in a FPSB auction is 𝑏(𝑣𝑖 ) = E𝑦𝑖 [𝑦𝑖 |𝑦𝑖 < 𝑣𝑖 ].
73.9. Revenue Equivalence Theorem 1307

73.10 Calculation of Bid Price in FPSB
In equations (73.1) and (73.1), we displayed formulas for optimal bids in a symmetric Bayesian Nash Equilibrium of a
FPSB auction.
E[𝑦𝑖 |𝑦𝑖 < 𝑣𝑖 ]
where
• 𝑣𝑖 = value of bidder 𝑖
• 𝑦𝑖 =: maximum value of all bidders except 𝑖, i.e., 𝑦𝑖 = max𝑗≠𝑖 𝑣𝑗
Above, we computed an optimal bid price in a FPSB auction analytically for a case in which private values are uniformly
distributed.
For most probability distributions of private values, analytical solutions aren’t easy to compute.
Instead, we can compute bid prices in FPSB auctions numerically as functions of the distribution of private values.
def evaluate_largest(v_hat, array, order=1):

"""
A method to estimate the largest (or certain-order largest) value of the other␣
↪biders,
conditional on player 1 wins the auction.
Parameters:
----------
v_hat : float, the value of player 1. The biggest value in the auction that␣
↪player 1 wins.
array: 2 dimensional array of bidders' values in shape of (N,R),

where N: number of players, R: number of auctions
order: int. The order of largest number among bidders who lose.
e.g. the order for largest number beside winner is 1.
the order for second-largest number beside winner is 2.
"""
N,R = array.shape
array_residual=array[1:,:].copy() # drop the first row because we assume first␣
↪row is the winner's bid
index=(array_residual<v_hat).all(axis=0)
array_conditional=array_residual[:,index].copy()
array_conditional=np.sort(array_conditional, axis=0)
return array_conditional[-order,:].mean()
We can check the accuracy of our evaluate_largest method by comparing it with an analytical solution.
We find that despite small discrepancy, the evaluate_largest method functions well.
Furthermore, if we take a very large number of auctions, say 1 million, the discrepancy disappears.
v_grid = np.linspace(0.3,1,8)
bid_analytical = b_star(v_grid,N)
bid_simulated = [evaluate_largest(ii, v) for ii in v_grid]


ax.plot(v_grid, bid_analytical, '-', color='k', label='Analytical')

ax.plot(v_grid, bid_simulated, '--', color='r', label='Simulated')
ax.set_ylabel('Bid, $b_i$')
ax.set_title('Solution for FPSB')
sns.despine()
73.11 𝜒2 Distribution
Let’s try an example in which the distribution of private values is a 𝜒2 distribution.

We’ll start by taking a look at a 𝜒2 distribution with the help of the following Python code:
v = np.random.chisquare(df=2, size=(N*R,))
plt.hist(v, bins=50, edgecolor='w')

plt.xlabel('Values: $v$')
plt.show()
73.11. 𝜒2 Distribution 1309

Now we’ll get Python to construct a bid price function
v = np.random.chisquare(df=2, size=(N,R))
# we compute the quantile of v as our grid

pct_quantile = np.linspace(0, 100, 101)[1:-1]
v_grid = np.percentile(v.flatten(), q=pct_quantile)
EV=[evaluate_largest(ii, v) for ii in v_grid]

# nan values are returned for some low quantiles due to lack of observations
/tmp/ipykernel_9758/521884726.py:25: RuntimeWarning: Mean of empty slice.

/opt/conda/envs/quantecon/lib/python3.11/site-packages/numpy/core/_methods.py:129:␣
↪RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
# we insert 0 into our grid and bid price function as a complement

EV=np.insert(EV,0,0)
v_grid=np.insert(v_grid,0,0)
b_star_num = interp.interp1d(v_grid, EV, fill_value="extrapolate")
We check our bid price function by computing and visualizing the result.

pct_quantile_fine = np.linspace(0, 100, 1001)[1:-1]

v_grid_fine = np.percentile(v.flatten(), q=pct_quantile_fine)
ax.plot(v_grid, EV, 'or', label='Simulation on Grid')

ax.plot(v_grid_fine, b_star_num(v_grid_fine) , '-', label='Interpolation Solution')
ax.set_ylabel('Optimal Bid in FPSB')
sns.despine()
Now we can use Python to compute the probability distribution of the price paid by the winning bidder
b=b_star_num(v)
idx = np.argsort(v, axis=0)

v = np.take_along_axis(v, idx, axis=0) # same as np.sort(v, axis=0), except now we␣
↪retain the idx
b = np.take_along_axis(b, idx, axis=0)
ii = np.repeat(np.arange(1,N+1)[:,None], R, axis=1)
ii = np.take_along_axis(ii, idx, axis=0)
winning_player = ii[-1,:]
winner_pays_fpsb = b[-1,:] # highest bid

winner_pays_spsb = v[-2,:] # 2nd-highest valuation
73.11. 𝜒2 Distribution 1311

for payment,label in zip([winner_pays_fpsb, winner_pays_spsb], ['FPSB', 'SPSB']):

print('The average payment of %s: %.4f. Std.: %.4f. Median: %.4f'% (label,
↪payment.mean(),payment.std(),np.median(payment)))

sns.despine()

73.12 5 Code Summary
We assemble the functions that we have used into a Python class
class bid_price_solution:
def __init__(self, array):

"""
A class that can plot the value distribution of bidders,
compute the optimal bid price for bidders in FPSB


and plot the distribution of winner's payment in both FPSB and SPSB
Parameters:
----------
array: 2 dimensional array of bidders' values in shape of (N,R),

where N: number of players, R: number of auctions
"""
self.value_mat=array.copy()
return None
def plot_value_distribution(self):
plt.hist(self.value_mat.flatten(), bins=50, edgecolor='w')
plt.xlabel('Values: $v$')
plt.show()
return None
def evaluate_largest(self, v_hat, order=1):

N,R = self.value_mat.shape
array_residual = self.value_mat[1:,:].copy()
# drop the first row because we assume first row is the winner's bid
index=(array_residual<v_hat).all(axis=0)
array_conditional=array_residual[:,index].copy()
array_conditional=np.sort(array_conditional, axis=0)
def compute_optimal_bid_FPSB(self):
# we compute the quantile of v as our grid
pct_quantile = np.linspace(0, 100, 101)[1:-1]
v_grid = np.percentile(self.value_mat.flatten(), q=pct_quantile)
EV=[self.evaluate_largest(ii) for ii in v_grid]

# nan values are returned for some low quantiles due to lack of observations
# we insert 0 into our grid and bid price function as a complement

EV=np.insert(EV,0,0)
v_grid=np.insert(v_grid,0,0)
self.b_star_num = interp.interp1d(v_grid, EV, fill_value="extrapolate")
pct_quantile_fine = np.linspace(0, 100, 1001)[1:-1]

v_grid_fine = np.percentile(self.value_mat.flatten(), q=pct_quantile_fine)
ax.plot(v_grid, EV, 'or', label='Simulation on Grid')

ax.plot(v_grid_fine, self.b_star_num(v_grid_fine) , '-', label='Interpolation␣
↪Solution')
73.12. 5 Code Summary 1313


ax.set_ylabel('Optimal Bid in FPSB')
sns.despine()
return None
def plot_winner_payment_distribution(self):
self.b = self.b_star_num(self.value_mat)
idx = np.argsort(self.value_mat, axis=0)

self.v = np.take_along_axis(self.value_mat, idx, axis=0) # same as np.sort(v,
↪ axis=0), except now we retain the idx
self.b = np.take_along_axis(self.b, idx, axis=0)
self.ii = np.repeat(np.arange(1,N+1)[:,None], R, axis=1)

self.ii = np.take_along_axis(self.ii, idx, axis=0)
winning_player = self.ii[-1,:]
winner_pays_fpsb = self.b[-1,:] # highest bid

winner_pays_spsb = self.v[-2,:] # 2nd-highest valuation
for payment,label in zip([winner_pays_fpsb, winner_pays_spsb], ['FPSB', 'SPSB

↪']):
print('The average payment of %s: %.4f. Std.: %.4f. Median: %.4f'%␣
↪(label,payment.mean(),payment.std(),np.median(payment)))

sns.despine()
return None
v = np.random.chisquare(df=2, size=(N,R))
chi_squ_case = bid_price_solution(v)
chi_squ_case.plot_value_distribution()

chi_squ_case.compute_optimal_bid_FPSB()
/tmp/ipykernel_9758/919518230.py:37: RuntimeWarning: Mean of empty slice.

/opt/conda/envs/quantecon/lib/python3.11/site-packages/numpy/core/_methods.py:129:␣
↪RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
73.12. 5 Code Summary 1315

chi_squ_case.plot_winner_payment_distribution()


73.13 References
1. Wikipedia for FPSB: https://en.wikipedia.org/wiki/First-price_sealed-bid_auction

2. Wikipedia for SPSB: https://en.wikipedia.org/wiki/Vickrey_auction
3. Chandra Chekuri’s lecture note for algorithmic game theory: http://chekuri.cs.illinois.edu/teaching/spring2008/
Lectures/scribed/Notes20.pdf
4. Tim Salmon. ECO 4400 Supplemental Handout: All About Auctions: https://s2.smu.edu/tsalmon/auctions.pdf
5. Auction Theory- Revenue Equivalence Theorem: https://michaellevet.wordpress.com/2015/07/06/
auction-theory-revenue-equivalence-theorem/
6. Order Statistics: https://online.stat.psu.edu/stat415/book/export/html/834
73.13. References 1317


CHAPTER
SEVENTYFOUR
MULTIPLE GOOD ALLOCATION MECHANISMS
!pip install prettytable
74.1 Overview
This lecture describes two mechanisms for allocating 𝑛 private goods (“houses”) to 𝑚 people (“buyers”).
We assume that 𝑚 > 𝑛 so that there are more potential buyers than there are houses.
Prospective buyers regard the houses as substitutes.
Buyer 𝑗 attaches value 𝑣𝑖𝑗 to house 𝑖.
These values are private
• 𝑣𝑖𝑗 is known only to person 𝑗 unless person 𝑗 chooses to tell someone.
We require that a mechanism allocate at most one house to one prospective buyer.
We describe two distinct mechanisms
• A multiple rounds, ascending bid auction
• A special case of a Groves-Clarke [Groves, 1973], [Clarke, 1971] mechanism with a benevolent social planner
Note: In 1994, the multiple rounds, ascending bid auction was actually used by Stanford University to sell leases to 9
lots on the Stanford campus to eligible faculty members.
We begin with overviews of the two mechanisms.
74.2 Ascending Bids Auction for Multiple Goods
An auction is administered by an auctioneer

The auctioneer has an 𝑛 × 1 vector 𝑟 of reservation prices on the 𝑛 houses.
The auctioneer sells house 𝑖 only if the final price bid for it exceeds 𝑟𝑖
The auctioneer allocates all 𝑛 houses simultaneously
The auctioneer does not know bidders’ private values 𝑣𝑖𝑗
There are multiple rounds
1319
• during each round, active participants can submit bids on any of the 𝑛 houses
• each bidder can bid on only one house during one round
• a person who was high bidder on a particular house in one round is understood to submit that same bid for the same
house in the next round
• between rounds, a bidder who was not a high bidder can change the house on which he/she chooses to bid
• the auction ends when the price of no house changes from one round to the next
• all 𝑛 houses are allocated after the final round
• house 𝑖 is retained by the auctioneer if not prospective buyer offers more that 𝑟𝑖 for the house
In this auction, person 𝑗 never tells anyone else his/her private values 𝑣𝑖𝑗
74.3 A Benevolent Planner
This mechanism is designed so that all prospective buyers voluntarily choose to reveal their private values to a social
planner who uses them to construct a socially optimal allocation.
Among all feasible allocations, a socially optimal allocation maximizes the sum of private values across all prospective
buyers.
The planner tells everyone in advance how he/she will allocate houses based on the matrix of values that prospective
buyers report.
The mechanism provide every prospective buyer an incentive to reveal his vector of private values to the planner.
After the planner receives everyone’s vector of private values, the planner deploys a sequential algorithm to determine
an allocation of houses and a set of fees that he charges awardees for the negative externality that their presence impose
on other prospective buyers.
74.4 Equivalence of Allocations
Remarkably, these two mechanisms can produce virtually identical allocations.

We construct Python code for both mechanism.
We also work out some examples by hand or almost by hand.
Next, let’s dive down into the details.
74.5 Ascending Bid Auction
74.5.1 Basic Setting
We start with a more detailed description of the setting.

• A seller owns 𝑛 houses that he wants to sell for the maximum possible amounts to a set of 𝑚 prospective eligible
buyers.
• The seller wants to sell at most one house to each potential buyer.
• There are 𝑚 potential eligible buyers, identified by 𝑗 = [1, 2, … , 𝑚]
1320 Chapter 74. Multiple Good Allocation Mechanisms

– Each potential buyer is permitted to buy at most one house.

– Buyer 𝑗 would be willing to pay at most 𝑣𝑖𝑗 for house 𝑖.
– Buyer 𝑗 knows 𝑣𝑖𝑗 , 𝑖 = 1, … , 𝑛, but no one else does.
– If buyer 𝑗 pays 𝑝𝑖 for house 𝑖, he enjoys surplus value 𝑣𝑖𝑗 − 𝑝𝑖 .
– Each buyer 𝑗 wants to choose the 𝑖 that maximizes his/her surplus value 𝑣𝑖𝑗 − 𝑝𝑖 .
– The seller wants to maximize ∑𝑖 𝑝𝑖 .
The seller conducts a simultaneous, multiple goods, ascending bid auction.
Outcomes of the auction are
• An 𝑛 × 1 vector 𝑝 of sales prices 𝑝 = [𝑝1 , … , 𝑝𝑛 ] for the 𝑛 houses.
• An 𝑛 × 𝑚 matrix 𝑄 of 0’s and 1’s, where 𝑄𝑖𝑗 = 1 if and only if person 𝑗 bought house 𝑖.
• An 𝑛 × 𝑚 matrix 𝑆 of surplus values consisting of all zeros unless person 𝑗 bought house 𝑖, in which case 𝑆𝑖𝑗 =
𝑣𝑖𝑗 − 𝑝𝑖
We describe rules for the auction it terms of pseudo code.
The pseudo code will provide a road map for writing Python code to implement the auction.
74.6 Pseudocode
Here is a quick sketch of a possible simple structure for our Python code
Inputs:
• 𝑛, 𝑚.
• an 𝑛 × 𝑚 non-negative matrix 𝑣 of private values
• an 𝑛 × 1 vector 𝑟 of seller-specified reservation prices
• the seller will not accept a price less than 𝑟𝑖 for house 𝑖
• we are free to think of these reservation prices as private values of a fictitious 𝑚 + 1 th buyer who does not actually
participate in the auction
• initial bids can be thought of starting at 𝑟
• a scalar 𝜖 of seller-specified minimum price-bid increments
For each round of the auction, new bids on a house must be at least the prevailing highest bid so far plus 𝜖
Auction Protocols
• the auction consists of a finite number of rounds
• in each round, a prospective buyer can bid on one and only one house
• after each round, a house is temporarily awarded to the person who made the highest bid for that house
– temporarily winning bids on each house are announced
– this sets the stage to move on to the next round
• a new round is held
– bids for temporary winners from the previous round are again attached to the houses on which they bid;
temporary winners of the last round leave their bids from the previous round unchanged
74.6. Pseudocode 1321

– all other active prospective buyers must submit a new bid on some house
– new bids on a house must be at least equal to the prevailing temporary price that won the last round plus 𝜖
– if a person does not submit a new bid and was also not a temporary winner from the previous round, that
person must drop out of the auction permanently
– for each house, the highest bid, whether it is a new bid or was the temporary winner from the previous round,
is announced, with the person who made that new (temporarily) winning bid being (temporarily) awarded the
house to start the next round
• rounds continue until no price on any house changes from the previous round
• houses are sold to the winning bidders from the final round at the prices that they bid
Outputs:
• an 𝑛 × 1 vector 𝑝 of sales prices
• an 𝑛 × 𝑚 matrix 𝑆 of surplus values consisting of all zeros unless person 𝑗 bought house 𝑖, in which case 𝑆𝑖𝑗 =
𝑣𝑖𝑗 − 𝑝𝑖
• an 𝑛 × (𝑚 + 1) matrix 𝑄 of 0’s and 1’s that tells which buyer bought which house. (The last column accounts for
unsold houses.)
Proposed buyer strategy:
In this pseudo code and the actual Python code below, we’ll assume that all buyers choose to use the following strategy
• The strategy is optimal for each buyer
Each buyer 𝑗 = 1, … , 𝑚 uses the same strategy.
The strategy has the form:
• Let 𝑝̌𝑡 be the 𝑛 × 1 vector of prevailing highest-bid prices at the beginning of round 𝑡
• Let 𝜖 > 0 be the minimum bid increment specified by the seller
• For each prospective buyer 𝑗, compute the index of the best house to bid on during round 𝑡, namely 𝑖𝑡̂ =
argmax𝑖 {[𝑣𝑖𝑗 − 𝑝𝑖̌𝑡 − 𝜖]}
• If max𝑖 {[𝑣𝑖𝑗 − 𝑝𝑖̌𝑡 − 𝜖]} ≤ 0, person 𝑗 permanently drops out of the auction at round 𝑡
• If 𝑣𝑖𝑡̂ ,𝑗 − 𝑝𝑖̌𝑡 − 𝜖 > 0, person 𝑗 bids 𝑝𝑖̌𝑡 + 𝜖 on house 𝑗
Resolving ambiguities: The protocols we have described so far leave open two possible sources of ambiguity.
(1) The optimal bid choice for buyers in each round. It is possible that a buyer has the same surplus value for multiple
houses. The argmax function in Python always returns the first argmax element. We instead prefer to randomize among
such winner. For that reason, we write our own argmax function below.
(2) Seller’s choice of winner if same price bid cast by several buyers. To resolve this ambiguity, we use the
np.random.choice function below.
Given the randomness in outcomes, it is possible that different allocations of houses could emerge from the same inputs.
However, this will happen only when the bid price increment 𝜖 is nonnegligible.
import numpy as np
np.random.seed(100)

74.7 An Example
Before building a Python class, let’s step by step solve things almost “by hand” to grasp how the auction proceeds.
A step-by-step procedure also helps reduce bugs, especially when the value matrix is peculiar (e.g. the differences between
values are negligible, a column containing identical values or multiple buyers have the same valuation etc.).
Fortunately, our auction behaves well and robustly with various peculiar matrices.
We provide some examples later in this lecture.
v = np.array([[8, 5, 9, 4],
[4, 11, 7, 4],
[9, 7, 6, 4]])
n, m = v.shape
r = np.array([2, 1, 0])
ϵ = 1
p = r.copy()
buyer_list = np.arange(m)
house_list = np.arange(n)
array([[ 8, 5, 9, 4],
[ 4, 11, 7, 4],
[ 9, 7, 6, 4]])
Remember that column indexes 𝑗 indicate buyers and row indexes 𝑖 indicate houses.
The above value matrix 𝑣 is peculiar in the sense that Buyer 3 (indexed from 0) puts the same value 4 on every house
being sold.
Maybe buyer 3 is a bureaucrat who purchases these house simply by following instructions from his superior.
array([2, 1, 0])
def find_argmax_with_randomness(v):
"""
We build our own verion of argmax function such that the argmax index will be␣
↪returned randomly
when there are multiple maximum values. This function is similiar to np.argmax(v,
↪axis=0)
Parameters:
----------
v: 2 dimensional np.array
"""
n, m = v.shape
index_array = np.arange(n)
result=[]
74.7. An Example 1323


for ii in range(m):
max_value = v[:,ii].max()
result.append(np.random.choice(index_array[v[:,ii] == max_value]))
return np.array(result)
def present_dict(dt):
"""
A function that present the information in table.
Parameters:
----------
dt: dictionary.
"""
ymtb.field_names = ['House Number', *dt.keys()]
ymtb.add_row(['Buyer', *dt.values()])
print(ymtb)
Check Kick Off Condition
def check_kick_off_condition(v, r, ϵ):

"""
A function that checks whether the auction could be initiated given the␣
↪reservation price and value matrix.
To avoid the situation that the reservation prices are so high that no one would␣
↪even bid in the first round.
Parameters:
----------
v : value matrix of the shape (n,m).
r: the reservation price
ϵ: the minimun price increment in each round
"""
# we convert the price vector to a matrix in the same shape as value matrix to␣
↪facilitate subtraction
p_start = (ϵ+r)[:,None] @ np.ones(m)[None,:]
surplus_value = v - p_start
buyer_decision = (surplus_value > 0).any(axis = 0)
return buyer_decision.any()
check_kick_off_condition(v, r, ϵ)
True

74.7.1 round 1
submit bid
def submit_initial_bid(p_initial, ϵ, v):

"""
A function that describes the bid information in the first round.
Parameters:
----------
p_initial: the price (or the reservation prices) at the beginning of auction.
v: the value matrix
ϵ: the minimun price increment in each round
Returns:
----------
p: price array after this round of bidding
bid_info: a dictionary that contains bidding information (house number as keys␣

and buyer as values).
↪
"""
p = p_initial.copy()
p_start_mat = (ϵ + p)[:,None] @ np.ones(m)[None,:]
surplus_value = v - p_start_mat
# we only care about active buyers who have positve surplus values
active_buyer_diagnosis = (surplus_value > 0).any(axis = 0)
active_buyer_list = buyer_list[active_buyer_diagnosis]
active_buyer_surplus_value = surplus_value[:,active_buyer_diagnosis]
active_buyer_choice = find_argmax_with_randomness(active_buyer_surplus_value)
# choice means the favourite houses given the current price and ϵ
# we only retain the unique house index because prices increase once at one round
house_bid = list(set(active_buyer_choice))
p[house_bid] += ϵ
bid_info = {}
for house_num in house_bid:
bid_info[house_num] = active_buyer_list[active_buyer_choice == house_num]
return p, bid_info
p, bid_info = submit_initial_bid(p, ϵ, v)
array([3, 2, 1])
present_dict(bid_info)

+--------------+-----+-----+-------+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-------+
| Buyer | [2] | [1] | [0 3] |
+--------------+-----+-----+-------+
check terminal condition

Notice that two buyers bid for house 2 (indexed from 0).
Because the auction protocol does not specify a selection rule in this case, we simply select a winner randomly.
This is reasonable because the seller can’t distinguish these buyers and doesn’t know the valuation of each buyer.
It is both convenient and practical for him to just pick a winner randomly.
There is a 50% probability that Buyer 3 is chosen as the winner for house 2, although he values it less than buyer 0.
In this case, buyer 0 has to bid one more time with a higher price, which crowds out Buyer 3.
Therefore, final price could be 3 or 4, depending on the winner in the last round.
def check_terminal_condition(bid_info, p, v):

"""
A function that checks whether the auction ends.
Recall that the auction ends when either losers have non-positive surplus values␣
↪ for each house
or there is no loser (every buyer gets a house).
Parameters:
----------
bid_info: a dictionary that contains bidding information of house numbers (as␣
↪keys) and buyers (as values).
p: np.array. price array of houses
v: value matrix
Returns:
----------
allocation: a dictionary that descirbe how the houses bid are assigned.
winner_list: a list of winners
loser_list: a list of losers
"""
# there may be several buyers bidding one house, we choose a winner randomly
winner_list=[np.random.choice(bid_info[ii]) for ii in bid_info.keys()]
allocation = {house_num:winner for house_num,winner in zip(bid_info.keys(),winner_

↪ list)}
loser_set = set(buyer_list).difference(set(winner_list))
loser_list = list(loser_set)
loser_num = len(loser_list)


if loser_num == 0:
print('The auction ends because every buyer gets one house.')
return allocation,winner_list,loser_list
p_mat = (ϵ + p)[:,None] @ np.ones(loser_num)[None,:]

loser_surplus_value = v[:,loser_list] - p_mat
loser_decision = (loser_surplus_value > 0).any(axis = 0)
print(~(loser_decision.any()))
return allocation,winner_list,loser_list
allocation,winner_list,loser_list = check_terminal_condition(bid_info, p, v)
False
present_dict(allocation)
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Buyer | 2 | 1 | 0 |
+--------------+---+---+---+
winner_list
[2, 1, 0]
loser_list
[3]
74.7.2 round 2
From the second round on, the auction proceeds differently from the first round.
Now only active losers (those who have positive surplus values) have an incentive to submit bids to displace temporary
winners from the previous round.
def submit_bid(loser_list, p, ϵ, v, bid_info):

"""
A function that executes the bid operation after the first round.
After the first round, only active losers would cast a new bid with price as old␣
↪price + increment.
By such bid, winners of last round are replaced by the active losers.
Parameters:
----------


loser_list: a list that includes the indexes of losers
p: np.array. price array of houses
ϵ: minimum increment of bid price
v: value matrix
bid_info: a dictionary that contains bidding information of house numbers (as␣

keys) and buyers (as values).
↪
Returns:
----------
p_end: a price array after this round of bidding
bid_info: a dictionary that contains updated bidding information.
"""
p_end=p.copy()
loser_num = len(loser_list)
p_mat = (ϵ + p_end)[:,None] @ np.ones(loser_num)[None,:]
loser_surplus_value = v[:,loser_list] - p_mat
loser_decision = (loser_surplus_value > 0).any(axis = 0)
active_loser_list = np.array(loser_list)[loser_decision]
active_loser_surplus_value = loser_surplus_value[:,loser_decision]
active_loser_choice = find_argmax_with_randomness(active_loser_surplus_value)
# we retain the unique house index and increasing the corresponding bid price
house_bid = list(set(active_loser_choice))
p_end[house_bid] += ϵ
# we record the bidding information from active losers

bid_info_active_loser = {}
bid_info_active_loser[house_num] = active_loser_list[active_loser_choice ==␣
↪house_num]
# we update the bidding information according to the bidding from actice losers
for house_num in bid_info_active_loser.keys():
bid_info[house_num] = bid_info_active_loser[house_num]
return p_end,bid_info
p,bid_info = submit_bid(loser_list, p, ϵ, v, bid_info)
array([3, 2, 2])

+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [3] |
+--------------+-----+-----+-----+
False
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Buyer | 2 | 1 | 3 |
+--------------+---+---+---+
74.7.3 round 3
array([3, 2, 3])
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
False
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Buyer | 2 | 1 | 0 |
+--------------+---+---+---+

74.7.4 round 4
array([3, 3, 3])
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [3] | [0] |
+--------------+-----+-----+-----+
Notice that Buyer 3 now switches to bid for house 1 having recongized that house 2 is no longer his best option.
False
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Buyer | 2 | 3 | 0 |
+--------------+---+---+---+
74.7.5 round 5
array([3, 4, 3])
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+

Now Buyer 1 bids for house 1 again with price at 4, which crowds out Buyer 3, marking the end of the auction.
True
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Buyer | 2 | 1 | 0 |
+--------------+---+---+---+
# as for the houses unsold
house_unsold_list = list(set(house_list).difference(set(allocation.keys())))
house_unsold_list
[]
total_revenue = p[list(allocation.keys())].sum()
total_revenue
10
74.8 A Python Class
Above we simulated an ascending bid auction step by step.

When defining functions, we repeatedly computed some intermediate objects because our Python function loses track of
variables once the function is executed.
That of course led to redundancy in our code
It is much more efficient to collect all of the aforementioned code into a class that records information about all rounds.
class ascending_bid_auction:
def __init__(self, v, r, ϵ):

"""
A class that simulates an ascending bid auction for houses.
Given buyers' value matrix, sellers' reservation prices and minimum increment␣
↪of bid prices,
this class can execute an ascending bid auction and present information round␣
↪by round until the end.
Parameters:
----------
74.8. A Python Class 1331


v: 2 dimensional value matrix
r: np.array of reservation prices
ϵ: minimum increment of bid price
"""
self.v = v.copy()
self.n,self.m = self.v.shape
self.r = r
self.ϵ = ϵ
self.p = r.copy()
self.buyer_list = np.arange(self.m)
self.house_list = np.arange(self.n)
self.bid_info_history = []
self.allocation_history = []
self.winner_history = []
self.loser_history = []
def find_argmax_with_randomness(self, v):

n,m = v.shape
index_array = np.arange(n)
result=[]
for ii in range(m):
max_value = v[:,ii].max()
result.append(np.random.choice(index_array[v[:,ii] == max_value]))
return np.array(result)
def check_kick_off_condition(self):
# we convert the price vector to a matrix in the same shape as value matrix␣
↪to facilitate subtraction
p_start = (self.ϵ + self.r)[:,None] @ np.ones(self.m)[None,:]

self.surplus_value = self.v - p_start
buyer_decision = (self.surplus_value > 0).any(axis = 0)
return buyer_decision.any()
def submit_initial_bid(self):
# we intend to find the optimal choice of each buyer
p_start_mat = (self.ϵ + self.p)[:,None] @ np.ones(self.m)[None,:]
self.surplus_value = self.v - p_start_mat
# we only care about active buyers who have positve surplus values
active_buyer_diagnosis = (self.surplus_value > 0).any(axis = 0)
active_buyer_list = self.buyer_list[active_buyer_diagnosis]
active_buyer_surplus_value = self.surplus_value[:,active_buyer_diagnosis]
active_buyer_choice = self.find_argmax_with_randomness(active_buyer_surplus_
↪value)
# we only retain the unique house index because prices increase once at one␣
↪round


house_bid = list(set(active_buyer_choice))
self.p[house_bid] += self.ϵ
bid_info = {}
bid_info[house_num] = active_buyer_list[active_buyer_choice == house_num]
self.bid_info_history.append(bid_info)
print('The bid information is')

ymtb.field_names = ['House Number', *bid_info.keys()]
ymtb.add_row(['Buyer', *bid_info.values()])
print(ymtb)
print('The bid prices for houses are')

ymtb.field_names = ['House Number', *self.house_list]
ymtb.add_row(['Price', *self.p])
print(ymtb)
self.winner_list=[np.random.choice(bid_info[ii]) for ii in bid_info.keys()]

self.winner_history.append(self.winner_list)
self.allocation = {house_num:[winner] for house_num,winner in zip(bid_info.

↪keys(),self.winner_list)}
self.allocation_history.append(self.allocation)
loser_set = set(self.buyer_list).difference(set(self.winner_list))
self.loser_list = list(loser_set)
self.loser_history.append(self.loser_list)
print('The winners are')

print(self.winner_list)
print('The losers are')

print(self.loser_list)
print('\n')
def check_terminal_condition(self):
loser_num = len(self.loser_list)
if loser_num == 0:
print('The auction ends because every buyer gets one house.')
print('\n')
return True
p_mat = (self.ϵ + self.p)[:,None] @ np.ones(loser_num)[None,:]

self.loser_surplus_value = self.v[:,self.loser_list] - p_mat
self.loser_decision = (self.loser_surplus_value > 0).any(axis = 0)
return ~(self.loser_decision.any())
def submit_bid(self):
bid_info = self.allocation_history[-1].copy() # we only record the bid info␣
↪of winner

loser_num = len(self.loser_list)
p_mat = (self.ϵ + self.p)[:,None] @ np.ones(loser_num)[None,:]
self.loser_surplus_value = self.v[:,self.loser_list] - p_mat
self.loser_decision = (self.loser_surplus_value > 0).any(axis = 0)
active_loser_list = np.array(self.loser_list)[self.loser_decision]
active_loser_surplus_value = self.loser_surplus_value[:,self.loser_decision]
active_loser_choice = self.find_argmax_with_randomness(active_loser_surplus_
↪value)
# we retain the unique house index and increasing the corresponding bid price
house_bid = list(set(active_loser_choice))
self.p[house_bid] += self.ϵ
# we record the bidding information from active losers

bid_info_active_loser = {}
bid_info_active_loser[house_num] = active_loser_list[active_loser_choice␣
↪== house_num]
# we update the bidding information according to the bidding from actice␣

↪losers
for house_num in bid_info_active_loser.keys():
bid_info[house_num] = bid_info_active_loser[house_num]
self.bid_info_history.append(bid_info)
print('The bid information is')

ymtb.field_names = ['House Number', *bid_info.keys()]
ymtb.add_row(['Buyer', *bid_info.values()])
print(ymtb)

print(ymtb)
self.winner_list=[np.random.choice(bid_info[ii]) for ii in bid_info.keys()]

self.winner_history.append(self.winner_list)
self.allocation = {house_num:[winner] for house_num,winner in zip(bid_info.

↪keys(),self.winner_list)}
self.allocation_history.append(self.allocation)
loser_set = set(self.buyer_list).difference(set(self.winner_list))
self.loser_list = list(loser_set)
self.loser_history.append(self.loser_list)


print('\n')

def start_auction(self):
print('The Ascending Bid Auction for Houses')
print('\n')
print('Basic Information: %d houses, %d buyers'%(self.n, self.m))
print('The valuation matrix is as follows')

ymtb.field_names = ['Buyer Number', *(np.arange(self.m))]
for ii in range(self.n):
ymtb.add_row(['House %d'%(ii), *self.v[ii,:]])
print(ymtb)
print('The reservation prices for houses are')

ymtb.add_row(['Price', *self.r])
print(ymtb)
print('The minimum increment of bid price is %.2f' % self.ϵ)
print('\n')
ctr = 1
if self.check_kick_off_condition():
print('Auction starts successfully')
print('\n')
print('Round %d'% ctr)
self.submit_initial_bid()
while True:
if self.check_terminal_condition():
print('Auction ends')
print('\n')
print('The final result is as follows')

print('\n')
print('The allocation plan is')
ymtb.field_names = ['House Number', *self.allocation.keys()]
ymtb.add_row(['Buyer', *self.allocation.values()])
print(ymtb)

print(ymtb)




self.house_unsold_list = list(set(self.house_list).
↪ difference(set(self.allocation.keys())))
print('The houses unsold are')
print(self.house_unsold_list)
self.total_revenue = self.p[list(self.allocation.keys())].sum()
print('The total revenue is %.2f' % self.total_revenue)
break
ctr += 1
print('Round %d'% ctr)
self.submit_bid()
# we compute the surplus matrix S and the quantity matrix X as required␣

↪ in 1.1
self.S = np.zeros((self.n, self.m))
for ii,jj in zip(self.allocation.keys(),self.allocation.values()):
self.S[ii,jj] = self.v[ii,jj] - self.p[ii]
self.Q = np.zeros((self.n, self.m + 1)) # the last column records the␣

↪ houses unsold
for ii,jj in zip(self.allocation.keys(),self.allocation.values()):
self.Q[ii,jj] = 1
for ii in self.house_unsold_list:
self.Q[ii,-1] = 1
# we sort the allocation result by the house number

house_sold_list = list(self.allocation.keys())
house_sold_list.sort()
dict_temp = {}
for ii in house_sold_list:
dict_temp[ii] = self.allocation[ii]
self.allocation = dict_temp
else:
print('The auction can not start because of high reservation prices')
Let’s use our class to conduct the auction described in one of the above examples.
v = np.array([[8,5,9,4],[4,11,7,4],[9,7,6,4]])
r = np.array([2,1,0])
ϵ = 1
auction_1 = ascending_bid_auction(v, r, ϵ)
auction_1.start_auction()
The Ascending Bid Auction for Houses
Basic Information: 3 houses, 4 buyers

The valuation matrix is as follows
+--------------+---+----+---+---+


| Buyer Number | 0 | 1 | 2 | 3 |
+--------------+---+----+---+---+
| House 0 | 8 | 5 | 9 | 4 |
| House 1 | 4 | 11 | 7 | 4 |
| House 2 | 9 | 7 | 6 | 4 |
+--------------+---+----+---+---+
The reservation prices for houses are
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 2 | 1 | 0 |
+--------------+---+---+---+
The minimum increment of bid price is 1.00
Auction starts successfully
Round 1
The bid information is
+--------------+-----+-----+-------+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-------+
| Buyer | [2] | [1] | [0 3] |
+--------------+-----+-----+-------+
The bid prices for houses are
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 2 | 1 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[3]
Round 2
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [3] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 2 | 2 |
+--------------+---+---+---+
The winners are
[2, 1, 3]
The losers are
[0]


Round 3
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 2 | 3 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[3]
Round 4
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [3] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 3 | 3 |
+--------------+---+---+---+
The winners are
[2, 3, 0]
The losers are
[1]
Round 5
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 4 | 3 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[3]

Auction ends
The final result is as follows
The allocation plan is

+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 4 | 3 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[3]
The houses unsold are
[]
The total revenue is 10.00
# the surplus matrix S
auction_1.S
array([[0., 0., 6., 0.],

[0., 7., 0., 0.],
[6., 0., 0., 0.]])
# the quantity matrix X
auction_1.Q
array([[0., 0., 1., 0., 0.],

[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.]])

74.9 Robustness Checks
Let’s do some stress testing of our code by applying it to auctions characterized by different matrices of private values.
1. number of houses = number of buyers
v2 = np.array([[8,5,9],[4,11,7],[9,7,6]])
auction_2 = ascending_bid_auction(v2, r, ϵ)

+--------------+---+----+---+
| Buyer Number | 0 | 1 | 2 |
+--------------+---+----+---+
| House 0 | 8 | 5 | 9 |
| House 1 | 4 | 11 | 7 |
| House 2 | 9 | 7 | 6 |
+--------------+---+----+---+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 2 | 1 | 0 |
+--------------+---+---+---+
Round 1
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 2 | 1 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[]
The auction ends because every buyer gets one house.


Auction ends

+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 2 | 1 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[]
[]
2. multilple excess buyers
v3 = np.array([[8,5,9,4,3],[4,11,7,4,6],[9,7,6,4,2]])
auction_3 = ascending_bid_auction(v3, r, ϵ)

+--------------+---+----+---+---+---+
| Buyer Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+----+---+---+---+
| House 0 | 8 | 5 | 9 | 4 | 3 |
| House 1 | 4 | 11 | 7 | 4 | 6 |
| House 2 | 9 | 7 | 6 | 4 | 2 |
+--------------+---+----+---+---+---+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 2 | 1 | 0 |
+--------------+---+---+---+
74.9. Robustness Checks 1341

Round 1
+--------------+-----+-------+-------+
| House Number | 0 | 1 | 2 |
+--------------+-----+-------+-------+
| Buyer | [2] | [1 4] | [0 3] |
+--------------+-----+-------+-------+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 2 | 1 |
+--------------+---+---+---+
The winners are
[2, 4, 3]
The losers are
[0, 1]
Round 2
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 3 | 2 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[3, 4]
Round 3
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [4] | [3] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 4 | 3 |
+--------------+---+---+---+


The winners are
[2, 4, 3]
The losers are
[0, 1]
Round 4
+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 5 | 4 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[3, 4]
Auction ends

+--------------+-----+-----+-----+
| House Number | 0 | 1 | 2 |
+--------------+-----+-----+-----+
| Buyer | [2] | [1] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+
| House Number | 0 | 1 | 2 |
+--------------+---+---+---+
| Price | 3 | 5 | 4 |
+--------------+---+---+---+
The winners are
[2, 1, 0]
The losers are
[3, 4]
[]
3. more houses than buyers
v4 = np.array([[8,5,4],[4,11,7],[9,7,9],[6,4,5],[2,2,2]])
r2 = np.array([2,1,0,1,1])


auction_4 = ascending_bid_auction(v4, r2, ϵ)

+--------------+---+----+---+
| Buyer Number | 0 | 1 | 2 |
+--------------+---+----+---+
| House 0 | 8 | 5 | 4 |
| House 1 | 4 | 11 | 7 |
| House 2 | 9 | 7 | 9 |
| House 3 | 6 | 4 | 5 |
| House 4 | 2 | 2 | 2 |
+--------------+---+----+---+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 1 | 0 | 1 | 1 |
+--------------+---+---+---+---+---+
Round 1
+--------------+-----+-------+
| House Number | 1 | 2 |
+--------------+-----+-------+
| Buyer | [1] | [0 2] |
+--------------+-----+-------+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 2 | 1 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 2]
The losers are
[0]
Round 2
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [0] |


+--------------+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 2 | 2 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 0]
The losers are
[2]
Round 3
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [2] |
+--------------+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 2 | 3 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 2]
The losers are
[0]
Round 4
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [0] |
+--------------+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 2 | 4 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 0]
The losers are
[2]
Round 5
+--------------+-----+-----+
+--------------+-----+-----+


| Buyer | [2] | [0] |
+--------------+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 3 | 4 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[2, 0]
The losers are
[1]
Round 6
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [0] |
+--------------+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 4 | 4 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 0]
The losers are
[2]
Round 7
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [2] |
+--------------+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 2 | 4 | 5 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 2]
The losers are
[0]
Round 8
+--------------+-----+-----+-----+
| House Number | 1 | 2 | 0 |


+--------------+-----+-----+-----+
| Buyer | [1] | [2] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 3 | 4 | 5 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 2, 0]
The losers are
[]
Auction ends

+--------------+-----+-----+-----+
| House Number | 1 | 2 | 0 |
+--------------+-----+-----+-----+
| Buyer | [1] | [2] | [0] |
+--------------+-----+-----+-----+
+--------------+---+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+---+---+---+---+---+
| Price | 3 | 4 | 5 | 1 | 1 |
+--------------+---+---+---+---+---+
The winners are
[1, 2, 0]
The losers are
[]
[3, 4]
4. some houses have extremely high reservation prices
v5 = np.array([[8,5,4],[4,11,7],[9,7,9],[6,4,5],[2,2,2]])
r3 = np.array([10,1,0,1,1])
auction_5 = ascending_bid_auction(v5, r3, ϵ)


+--------------+---+----+---+
| Buyer Number | 0 | 1 | 2 |
+--------------+---+----+---+
| House 0 | 8 | 5 | 4 |
| House 1 | 4 | 11 | 7 |
| House 2 | 9 | 7 | 9 |
| House 3 | 6 | 4 | 5 |
| House 4 | 2 | 2 | 2 |
+--------------+---+----+---+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 1 | 0 | 1 | 1 |
+--------------+----+---+---+---+---+
Round 1
+--------------+-----+-------+
+--------------+-----+-------+
| Buyer | [1] | [0 2] |
+--------------+-----+-------+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 2 | 1 | 1 | 1 |
+--------------+----+---+---+---+---+
The winners are
[1, 0]
The losers are
[2]
Round 2
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [2] |
+--------------+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 2 | 2 | 1 | 1 |
+--------------+----+---+---+---+---+


The winners are
[1, 2]
The losers are
[0]
Round 3
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [0] |
+--------------+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 2 | 3 | 1 | 1 |
+--------------+----+---+---+---+---+
The winners are
[1, 0]
The losers are
[2]
Round 4
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [2] |
+--------------+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 2 | 4 | 1 | 1 |
+--------------+----+---+---+---+---+
The winners are
[1, 2]
The losers are
[0]
Round 5
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [0] |
+--------------+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 2 | 5 | 1 | 1 |


+--------------+----+---+---+---+---+
The winners are
[1, 0]
The losers are
[2]
Round 6

+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [2] | [0] |
+--------------+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 3 | 5 | 1 | 1 |
+--------------+----+---+---+---+---+
The winners are
[2, 0]
The losers are
[1]
Round 7
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [0] |
+--------------+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 4 | 5 | 1 | 1 |
+--------------+----+---+---+---+---+
The winners are
[1, 0]
The losers are
[2]
Round 8
+--------------+-----+-----+
+--------------+-----+-----+
| Buyer | [1] | [2] |
+--------------+-----+-----+
+--------------+----+---+---+---+---+


| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 4 | 6 | 1 | 1 |
+--------------+----+---+---+---+---+
The winners are
[1, 2]
The losers are
[0]
Round 9
+--------------+-----+-----+-----+
| House Number | 1 | 2 | 3 |
+--------------+-----+-----+-----+
| Buyer | [1] | [2] | [0] |
+--------------+-----+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 4 | 6 | 2 | 1 |
+--------------+----+---+---+---+---+
The winners are
[1, 2, 0]
The losers are
[]
Auction ends

+--------------+-----+-----+-----+
| House Number | 1 | 2 | 3 |
+--------------+-----+-----+-----+
| Buyer | [1] | [2] | [0] |
+--------------+-----+-----+-----+
+--------------+----+---+---+---+---+
| House Number | 0 | 1 | 2 | 3 | 4 |
+--------------+----+---+---+---+---+
| Price | 10 | 4 | 6 | 2 | 1 |
+--------------+----+---+---+---+---+
The winners are
[1, 2, 0]
The losers are
[]
[0, 4]

5. reservation prices are so high that the auction can’t start
r4 = np.array([15,15,15])
auction_6 = ascending_bid_auction(v, r4, ϵ)

+--------------+---+----+---+---+
| Buyer Number | 0 | 1 | 2 | 3 |
+--------------+---+----+---+---+
| House 0 | 8 | 5 | 9 | 4 |
| House 1 | 4 | 11 | 7 | 4 |
| House 2 | 9 | 7 | 6 | 4 |
+--------------+---+----+---+---+
+--------------+----+----+----+
| House Number | 0 | 1 | 2 |
+--------------+----+----+----+
| Price | 15 | 15 | 15 |
+--------------+----+----+----+
The auction can not start because of high reservation prices
74.10 A Groves-Clarke Mechanism
We now decribe an alternative way for society to allocate 𝑛 houses to 𝑚 possible buyers in a way that maximizes total
value across all potential buyers.
We continue to assume that each buyer can purchase at most one house.
The mechanism is a very special case of a Groves-Clarke mechanism [Groves, 1973], [Clarke, 1971].
Its special structure substantially simplifies writing Python code to find an optimal allocation.
Our mechanims works like this.
• The values 𝑉𝑖𝑗 are private information to person 𝑗
• The mechanism makes each person 𝑗 willing to tell a social planner his private values 𝑉𝑖,𝑗 for all 𝑖 = 1, … , 𝑛.
• The social planner asks all potential bidders to tell the planner their private values 𝑉𝑖𝑗
• The social planner tells no one these, but uses them to allocate houses and set prices
• The mechanism is designed in a way that makes all prospective buyers want to tell the planner their private values
– truth telling is a dominant strategy for each potential buyer
• The planner finds a house, bidder pair with highest private value by computing (𝑖,̃ 𝑗)̃ = argmax(𝑉𝑖𝑗 )
• The planner assigns house 𝑖 ̃ to buyer 𝑗 ̃

• The planner charges buyer 𝑗 ̃ the price max−𝑗 ̃ 𝑉𝑖,𝑗̃ , where −𝑗 ̃ means all 𝑗’s except 𝑗.̃
• The planner creates a matrix of private values for the remaining houses −𝑖 ̃ by deleting row (i.e., house) 𝑖 ̃ and
column (i.e., buyer) 𝑗 ̃ from 𝑉 .
– (But in doing this, the planner keeps track of the real names of the bidders and the houses).
• The planner returns to the original step and repeat it.
• The planner iterates until all 𝑛 houses are allocated and the charges for all 𝑛 houses are set.
74.11 An Example Solved by Hand
Let’s see how our Groves-Clarke algorithm would work for the following simple matrix 𝑉 matrix of private values
10 9 8 7 6
⎡9 9 7 6 6⎤
𝑉 =⎢ ⎥
⎢8 6 6 9 4⎥
⎣7 5 6 4 9⎦
Remark: In the first step, when the highest private value corresponds to more than one house, bidder pairs, we choose
the pair with the highest sale price. If a highest sale price corresponds to two or more pairs with highest private values,
we randomly choose one.
np.random.seed(666)
V_orig = np.array([[10, 9, 8, 7, 6], # record the origianl values

[9, 9, 7, 6, 6],
[8, 6, 6, 9, 4],
[7, 5, 6, 4, 9]])
V = np.copy(V_orig) # used iteratively
n, m = V.shape
p = np.zeros(n) # prices of houses
Q = np.zeros((n, m)) # keep record the status of houses and buyers
First assignment
First, we find house, bidder pair with highest private value.
i, j = np.where(V==np.max(V))
i, j
(array([0]), array([0]))
So, house 0 will be sold to buyer 0 at a price of 9. We then update the sale price of house 0 and the status matrix Q.
p[i] = np.max(np.delete(V[i, :], j))

Q[i, j] = 1
p, Q
(array([9., 0., 0., 0.]),

array([[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]))
74.11. An Example Solved by Hand 1353

Then we remove row 0 and column 0 from 𝑉 . To keep the real number of houses and buyers, we set this row and this
column to -1, which will have the same result as removing them since 𝑉 ≥ 0.
V[i, :] = -1
V[:, j] = -1
V
array([[-1, -1, -1, -1, -1],

[-1, 9, 7, 6, 6],
[-1, 6, 6, 9, 4],
[-1, 5, 6, 4, 9]])
Second assignment
We find house, bidder pair with the highest private value again.
i, j
(array([1, 2, 3]), array([1, 3, 4]))
In this special example, there are three pairs (1, 1), (2, 3) and (3, 4) with the highest private value. To solve this problem,
we choose the one with highest sale price.
p_candidate = np.zeros(len(i))
for k in range(len(i)):
p_candidate[k] = np.max(np.delete(V[i[k], :], j[k]))
k, = np.where(p_candidate==np.max(p_candidate))
i, j = i[k], j[k]
i, j
So, house 1 will be sold to buyer 1 at a price of 7. We update matrices.

Q[i, j] = 1
V[i, :] = -1
V[:, j] = -1
p, Q, V
(array([9., 7., 0., 0.]),

array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[-1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1],
[-1, -1, 6, 9, 4],
[-1, -1, 6, 4, 9]]))
Third assignment

i, j
(array([2, 3]), array([3, 4]))
In this special example, there are two pairs (2, 3) and (3, 4) with the highest private value.
To resolve the assignment, we choose the one with highest sale price.
p_candidate[k] = np.max(np.delete(V[i[k], :], j[k]))
i, j = i[k], j[k]
i, j
(array([2, 3]), array([3, 4]))
The two pairs even have the same sale price.

We randomly choose one pair.
k = np.random.choice(len(i))
i, j = i[k], j[k]
i, j
(2, 3)
Finally, house 2 will be sold to buyer 3.

We update matrices accordingly.

Q[i, j] = 1
V[i, :] = -1
V[:, j] = -1
p, Q, V
(array([9., 7., 6., 0.]),

array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0.]]),
array([[-1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1],
[-1, -1, 6, -1, 9]]))
Fourth assignment
i, j
74.11. An Example Solved by Hand 1355

House 3 will be sold to buyer 4.

The final outcome follows.

Q[i, j] = 1
V[i, :] = -1
V[:, j] = -1
S = V_orig*Q - np.diag(p)@Q
p, Q, V, S
(array([9., 7., 6., 6.]),

array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]]),
array([[-1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1]]),
array([[1., 0., 0., 0., 0.],
[0., 2., 0., 0., 0.],
[0., 0., 0., 3., 0.],
[0., 0., 0., 0., 3.]]))
74.12 Another Python Class
It is efficient to assemble our calculations in a single Python Class.
class GC_Mechanism:
def __init__(self, V):

"""
Implementation of the special Groves Clarke Mechanism for house auction.
Parameters:
----------
V: 2 dimensional private value matrix
"""
self.V_orig = V.copy()
self.V = V.copy()
self.n, self.m = self.V.shape
self.p = np.zeros(self.n)
self.Q = np.zeros((self.n, self.m))
self.S = np.copy(self.Q)
def find_argmax(self):
"""
Find the house-buyer pair with the highest value.


When the highest private value corresponds to more than one house, bidder␣
↪pairs,
we choose the pair with the highest sale price.
Moreoever, if the highest sale price corresponds to two or more pairs with␣
↪highest private value,
We randomly choose one.
Parameters:
----------
V: 2 dimensional private value matrix with -1 indicating revomed rows and␣
↪columns
Returns:
----------
i: the index of the sold house
j: the index of the buyer
"""
i, j = np.where(self.V==np.max(self.V))
if (len(i)>1):
p_candidate[k] = np.max(np.delete(self.V[i[k], :], j[k]))
i, j = i[k], j[k]
if (len(i)>1):
k = np.random.choice(len(i))
k = np.array([k])
i, j = i[k], j[k]
return i, j
def update_status(self, i, j):

self.p[i] = np.max(np.delete(self.V[i, :], j))
self.Q[i, j] = 1
self.V[i, :] = -1
self.V[:, j] = -1
def calculate_surplus(self):
self.S = self.V_orig*self.Q - np.diag(self.p)@self.Q
def start(self):
while (np.max(self.V)>=0):
i, j = self.find_argmax()
self.update_status(i, j)
print("House %i is sold to buyer %i at price %i"%(i[0], j[0], self.
↪p[i[0]]))
print("\n")
self.calculate_surplus()
print("Prices of house:\n", self.p)
print("\n")
print("The status matrix:\n", self.Q)
print("\n")
print("The surplus matrix:\n", self.S)
74.12. Another Python Class 1357

np.random.seed(666)
V_orig = np.array([[10, 9, 8, 7, 6],

[9, 9, 7, 6, 6],
[8, 6, 6, 9, 4],
[7, 5, 6, 4, 9]])
gc_mechanism = GC_Mechanism(V_orig)
gc_mechanism.start()
House 0 is sold to buyer 0 at price 9
Prices of house:
[9. 7. 6. 6.]
The status matrix:

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
The surplus matrix:

[[1. 0. 0. 0. 0.]
[0. 2. 0. 0. 0.]
[0. 0. 0. 3. 0.]
[0. 0. 0. 0. 3.]]
74.12.1 Elaborations
Here we use some additional notation designed to conform with standard notation in parts of the VCG literature.
We want to verify that our pseudo code is indeed a pivot mechanism, also called a VCG (Vickrey-Clarke-Groves)
mechanism.
• The mechanism is named after [Groves, 1973], [Clarke, 1971], and [Vickrey, 1961].
To prepare for verifying this, we add some notation.
Let 𝑋 be the set of feasible allocations of houses under the protocols above (i.e., at most one house to each person).
Let 𝑋(𝑣) be the allocation that the mechanism chooses for matrix 𝑣 of private values.
The mechanism maps a matrix 𝑣 of private values into an 𝑥 ∈ 𝑋.
Let 𝑣𝑗 (𝑥) be the value that person 𝑗 attaches to allocation 𝑥 ∈ 𝑋.
Let 𝑡𝑗̌ (𝑣) the payment that the mechanism charges person 𝑗.

The VCG mechanism chooses the allocation

𝑚
𝑋(𝑣) = argmax𝑥∈𝑋 ∑ 𝑣𝑗 (𝑥) (74.1)
𝑗=1
and charges person 𝑗 a “social cost”
𝑡𝑗̌ (𝑣) = max ∑ 𝑣𝑘 (𝑥) − ∑ 𝑣𝑘 (𝑋(𝑣)) (74.2)

𝑥∈𝑋
𝑘≠𝑗 𝑘≠𝑗
In our setting, equation (74.1) says that the VCG allocation allocates houses to maximize the total value of the successful
prospective buyers.
In our setting, equation (74.2) says that the mechanism charges people for the externality that their presence in society
imposes on other prospective buyers.
Thus, notice that according to equation (74.2):
• unsuccessful prospective buyers pay 0 because removing them from “society” would not affect the allocation chosen
by the mechanim
• successful prospective buyers pay the difference between the total value society could achieve without them present
and the total value that others present in society do achieve under the mechanism.
The generalized second-price auction described in our pseudo code above does indeed satisfy (1). We want to compute
𝑡𝑗̌ for 𝑗 = 1, … , 𝑚 and compare with 𝑝𝑗 from the second price auction.
74.12.2 Social Cost
Using the GC_Mechanism class, we can calculate the social cost of each buyer.
Let’s see a simpler example with private value matrix
10 9 8 7 6
𝑉 =⎡
⎢9 8 7 6 6⎤⎥
⎣8 7 6 5 4⎦
To begin with, we implement the GC mechanism and see the outcome.
np.random.seed(666)
V_orig = np.array([[10, 9, 8, 7, 6],

[9, 8, 7, 6, 6],
[8, 7, 6, 5, 4]])
gc_mechanism = GC_Mechanism(V_orig)
gc_mechanism.start()
Prices of house:


[9. 7. 5.]
The status matrix:

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]]
The surplus matrix:

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]]
We exclude buyer 0 and calculate the allocation.
V_exc_0 = np.copy(V_orig)
V_exc_0[:, 0] = -1
V_exc_0
gc_mechanism_exc_0 = GC_Mechanism(V_exc_0)
gc_mechanism_exc_0.start()
Prices of house:
[8. 6. 4.]
The status matrix:

[[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]]
The surplus matrix:

[[-0. 1. 0. 0. 0.]
[-0. 0. 1. 0. 0.]
[-0. 0. 0. 1. 0.]]
Calculate the social cost of buyer 0.
print("The social cost of buyer 0:",

np.sum(gc_mechanism_exc_0.Q*gc_mechanism_exc_0.V_orig)-np.sum(np.delete(gc_
↪mechanism.Q*gc_mechanism.V_orig, 0, axis=1)))
The social cost of buyer 0: 7.0
Repeat this process for buyer 1 and buyer 2

V_exc_1[:, 1] = -1
V_exc_1
print("\nThe social cost of buyer 1:",

Prices of house:
[8. 6. 4.]
The status matrix:

[[1. 0. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]]
The surplus matrix:

[[ 2. -0. 0. 0. 0.]
[ 0. -0. 1. 0. 0.]
[ 0. -0. 0. 1. 0.]]
V_exc_2[:, 2] = -1
V_exc_2
print("\nThe social cost of buyer 2:",



Prices of house:
[9. 6. 4.]
The status matrix:

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0.]]
The surplus matrix:

[[ 1. 0. -0. 0. 0.]
[ 0. 2. -0. 0. 0.]
[ 0. 0. -0. 1. 0.]]

Part XIV
Other
1363
CHAPTER
SEVENTYFIVE
TROUBLESHOOTING
Contents
• Troubleshooting
– Fixing Your Local Environment
– Reporting an Issue
This page is for readers experiencing errors when running the code from the lectures.
75.1 Fixing Your Local Environment
The basic assumption of the lectures is that code in a lecture should execute whenever
1. it is executed in a Jupyter notebook and
2. the notebook is running on a machine with the latest version of Anaconda Python.
You have installed Anaconda, haven’t you, following the instructions in this lecture?
Assuming that you have, the most common source of problems for our readers is that their Anaconda distribution is not
up to date.
Here’s a useful article on how to update Anaconda.
Another option is to simply remove Anaconda and reinstall.
You also need to keep the external code libraries, such as QuantEcon.py up to date.
For this task you can either
• use conda install -y quantecon on the command line, or
• execute !conda install -y quantecon within a Jupyter notebook.
If your local environment is still not working you can do two things.
First, you can use a remote machine instead, by clicking on the Launch Notebook icon available for each lecture
1365
Second, you can report an issue, so we can try to fix your local set up.
We like getting feedback on the lectures so please don’t hesitate to get in touch.
75.2 Reporting an Issue
One way to give feedback is to raise an issue through our issue tracker.
Please be as specific as possible. Tell us where the problem is and as much detail about your local set up as you can
provide.
Another feedback option is to use our discourse forum.
Finally, you can provide direct feedback to contact@quantecon.org
1366 Chapter 75. Troubleshooting

CHAPTER
SEVENTYSIX
REFERENCES
1367
1368 Chapter 76. References

CHAPTER
SEVENTYSEVEN
EXECUTION STATISTICS
This table contains the latest execution statistics.
Document Modified Method Run Time (s) Status

aiyagari 2024-04-29 22:59 cache 26.24 ✅
ar1_bayes 2024-04-29 23:06 cache 407.78 ✅
ar1_turningpts 2024-04-29 23:06 cache 43.41 ✅
back_prop 2024-04-29 23:07 cache 65.17 ✅
bayes_nonconj 2024-04-30 00:15 cache 4031.25 ✅
cake_eating_numerical 2024-04-30 00:15 cache 27.83 ✅
cake_eating_problem 2024-04-30 00:15 cache 2.18 ✅
career 2024-04-30 00:15 cache 17.43 ✅
cass_koopmans_1 2024-04-30 00:16 cache 10.17 ✅
cass_koopmans_2 2024-04-30 00:16 cache 8.72 ✅
coleman_policy_iter 2024-04-30 00:16 cache 17.32 ✅
cross_product_trick 2024-04-30 00:16 cache 1.05 ✅
egm_policy_iter 2024-04-30 00:16 cache 6.54 ✅
eig_circulant 2024-04-30 00:16 cache 4.83 ✅
exchangeable 2024-04-30 00:16 cache 10.37 ✅
finite_markov 2024-04-30 00:17 cache 10.86 ✅
ge_arrow 2024-04-30 00:17 cache 2.43 ✅
harrison_kreps 2024-04-30 00:17 cache 7.74 ✅
hoist_failure 2024-04-30 00:18 cache 76.53 ✅
house_auction 2024-04-30 00:18 cache 6.89 ✅
ifp 2024-04-30 00:19 cache 48.72 ✅
ifp_advanced 2024-04-30 00:19 cache 30.78 ✅
imp_sample 2024-04-30 00:24 cache 279.77 ✅
intro 2024-04-30 00:24 cache 4.02 ✅
inventory_dynamics 2024-04-30 00:24 cache 12.15 ✅
jv 2024-04-30 00:25 cache 19.95 ✅
kalman 2024-04-30 00:25 cache 11.52 ✅
kalman_2 2024-04-30 00:26 cache 35.19 ✅
kesten_processes 2024-04-30 00:26 cache 47.58 ✅
lagrangian_lqdp 2024-04-30 00:27 cache 20.5 ✅
lake_model 2024-04-30 00:27 cache 19.97 ✅
likelihood_bayes 2024-04-30 00:28 cache 49.8 ✅
likelihood_ratio_process 2024-04-30 00:28 cache 10.79 ✅
linear_algebra 2024-04-30 00:28 cache 2.85 ✅
linear_models 2024-04-30 00:28 cache 11.05 ✅
continues on next page
1369
Table 77.1 – continued from previous page

Document Modified Method Run Time (s) Status
lln_clt 2024-04-30 00:29 cache 13.96 ✅
lq_inventories 2024-04-30 00:29 cache 20.38 ✅
lqcontrol 2024-04-30 00:29 cache 9.43 ✅
markov_asset 2024-04-30 00:29 cache 9.94 ✅
markov_perf 2024-04-30 00:29 cache 8.76 ✅
mccall_correlated 2024-04-30 00:31 cache 90.58 ✅
mccall_fitted_vfi 2024-04-30 00:31 cache 12.41 ✅
mccall_model 2024-04-30 00:31 cache 19.69 ✅
mccall_model_with_separation 2024-04-30 00:32 cache 11.8 ✅
mccall_q 2024-04-30 00:32 cache 24.58 ✅
mix_model 2024-04-30 00:33 cache 35.12 ✅
mle 2024-04-30 00:33 cache 5.7 ✅
multi_hyper 2024-04-30 00:33 cache 25.28 ✅
multivariate_normal 2024-04-30 00:33 cache 5.59 ✅
navy_captain 2024-04-30 00:34 cache 34.93 ✅
newton_method 2024-04-30 00:35 cache 86.8 ✅
odu 2024-04-30 00:36 cache 65.96 ✅
ols 2024-04-30 00:37 cache 15.36 ✅
opt_transport 2024-04-30 00:37 cache 27.16 ✅
optgrowth 2024-04-30 00:38 cache 80.91 ✅
optgrowth_fast 2024-04-30 00:39 cache 26.98 ✅
pandas_panel 2024-04-30 00:39 cache 5.75 ✅
perm_income 2024-04-30 00:39 cache 4.77 ✅
perm_income_cons 2024-04-30 00:39 cache 10.81 ✅
prob_matrix 2024-04-30 00:40 cache 16.38 ✅
prob_meaning 2024-04-30 00:41 cache 75.45 ✅
qr_decomp 2024-04-30 00:41 cache 1.55 ✅
rand_resp 2024-04-30 00:41 cache 3.27 ✅
rational_expectations 2024-04-30 00:41 cache 7.76 ✅
re_with_feedback 2024-04-30 00:41 cache 12.69 ✅
samuelson 2024-04-30 00:42 cache 16.93 ✅
sir_model 2024-04-30 00:42 cache 3.61 ✅
status 2024-04-30 00:42 cache 9.27 ✅
svd_intro 2024-04-30 00:42 cache 1.81 ✅
troubleshooting 2024-04-30 00:24 cache 4.02 ✅
two_auctions 2024-04-30 00:42 cache 20.61 ✅
uncertainty_traps 2024-04-30 00:42 cache 3.29 ✅
util_rand_resp 2024-04-30 00:42 cache 3.51 ✅
var_dmd 2024-04-30 00:24 cache 4.02 ✅
von_neumann_model 2024-04-30 00:42 cache 2.68 ✅
wald_friedman 2024-04-30 00:43 cache 17.67 ✅
wealth_dynamics 2024-04-30 00:43 cache 42.21 ✅
zreferences 2024-04-30 00:24 cache 4.02 ✅
These lectures are built on linux instances through github actions.

These lectures are using the following python version
!python --version
1370 Chapter 77. Execution Statistics

Python 3.11.7
and the following package versions
!conda list
1371
1372 Chapter 77. Execution Statistics

BIBLIOGRAPHY
[AJR01] Daron Acemoglu, Simon Johnson, and James A Robinson. The colonial origins of comparative development:
an empirical investigation. The American Economic Review, 91(5):1369–1401, 2001.
[Aiy94] S Rao Aiyagari. Uninsured Idiosyncratic Risk and Aggregate Saving. The Quarterly Journal of Economics,
109(3):659–684, 1994.
[AM05] D. B. O. Anderson and J. B. Moore. Optimal Filtering. Dover Publications, 2005.
[AHMS96] E. W. Anderson, L. P. Hansen, E. R. McGrattan, and T. J. Sargent. Mechanics of Forming and Estimating
Dynamic Linear Economies. In Handbook of Computational Economics. Elsevier, vol 1 edition, 1996.
[And76] Harald Anderson. Estimation of a proportion through randomized response. International Statistical Re-
view/Revue Internationale de Statistique, pages 213–217, 1976.
[Apo90] George Apostolakis. The concept of probability in safety assessments of technological systems. Science,
250(4986):1359–1364, 1990.
[Axt01] Robert L Axtell. Zipf distribution of us firm sizes. science, 293(5536):1818–1820, 2001.
[Bar79] Robert J Barro. On the Determination of the Public Debt. Journal of Political Economy, 87(5):940–971,
1979.
[BB18] Jess Benhabib and Alberto Bisin. Skewed wealth distributions: theory and empirics. Journal of Economic
Literature, 56(4):1261–91, 2018.
[BBZ15] Jess Benhabib, Alberto Bisin, and Shenghao Zhu. The wealth distribution in bewley economies with capital
income risk. Journal of Economic Theory, 159:489–515, 2015.
[BS79] L M Benveniste and J A Scheinkman. On the Differentiability of the Value Function in Dynamic Models of
Economics. Econometrica, 47(3):727–732, 1979.
[Ber75] Dmitri Bertsekas. Dynamic Programming and Stochastic Control. Academic Press, New York, 1975.
[Bew77] Truman Bewley. The permanent income hypothesis: a theoretical formulation. Journal of Economic Theory,
16(2):252–292, 1977.
[Bew86] Truman F Bewley. Stationary monetary equilibrium with a continuum of independently fluctuating consumers.
In Werner Hildenbran and Andreu Mas-Colell, editors, Contributions to Mathematical Economics in Honor of
Gerard Debreu, pages 27–102. North-Holland, Amsterdam, 1986.
[Bis06] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[BK80] Olivier Jean Blanchard and Charles M Kahn. The Solution of Linear Difference Models under Rational Ex-
pectations. Econometrica, 48(5):1305–1311, July 1980.
[BK19] Steven L. Brunton and J. Nathan Kutz. Data-Driven Science and Engineering: Machine Learning, Dynamical
Systems, and Control. Cambridge University Press, 2019.
1373
[BK22] Steven L. Brunton and J. Nathan Kutz. Data-Driven Science and Engineering, Second Edition. Cambridge
University Press, New York, 2022.
[BDM+16] Dariusz Buraczewski, Ewa Damek, Thomas Mikosch, and others. Stochastic models with power-law tails.
Springer, 2016.
[Bur23] Jennifer Burns. Milton Friedman: The Last Conservative. Farrar, Straus, and Giroux, New York, 2023.
[Cag56] Philip Cagan. The monetary dynamics of hyperinflation. In Milton Friedman, editor, Studies in the Quantity
Theory of Money, pages 25–117. University of Chicago Press, Chicago, 1956.
[Cap85] Andrew S Caplin. The variability of aggregate demand with (s, s) inventory policies. Econometrica, pages
1395–1409, 1985.
[Car01] Christopher D Carroll. A Theory of the Consumption Function, with and without Liquidity Constraints.
Journal of Economic Perspectives, 15(3):23–45, 2001.
[Car06] Christopher D Carroll. The method of endogenous gridpoints for solving dynamic stochastic optimization
problems. Economics Letters, 91(3):312–320, 2006.
[Cas65] David Cass. Optimum growth in an aggregative model of capital accumulation. Review of Economic Studies,
32(3):233–240, 1965.
[CM88] A Chadhuri and R Mukerjee. Randomized Response: Theory and Technique. Marcel Dekker, New York,
1988.
[Cla71] E. Clarke. Multipart pricing of public goods. Public Choice, 8:19–33, 1971.
[Col90] Wilbur John Coleman. Solving the Stochastic Growth Model by Policy-Function Iteration. Journal of Business
& Economic Statistics, 8(1):27–29, 1990.
[DFH06] Steven J Davis, R Jason Faberman, and John Haltiwanger. The flow approach to labor markets: new data
sources, micro-macro links and the recent downturn. Journal of Economic Perspectives, 2006.
[dF37] Bruno de Finetti. La prevision: ses lois logiques, ses sources subjectives. Annales de l'Institute Henri Poincare',
7:1 – 68, 1937. English translation in Kyburg and Smokler (eds.), \it Studies in Subjective Probability, Wiley,
New York, 1964.
[Dea91] Angus Deaton. Saving and Liquidity Constraints. Econometrica, 59(5):1221–1248, 1991.
[DP94] Angus Deaton and Christina Paxson. Intertemporal Choice and Inequality. Journal of Political Economy,
102(3):437–467, 1994.
[DH10] Wouter J Den Haan. Comparison of solutions to the incomplete markets model with aggregate uncertainty.
Journal of Economic Dynamics and Control, 34(1):4–27, 2010.
[DS10] Ulrich Doraszelski and Mark Satterthwaite. Computable markov-perfect industry dynamics. The RAND Jour-
nal of Economics, 41(2):215–243, 2010.
[DSS58] Robert Dorfman, Paul A. Samuelson, and Robert M. Solow. Linear Programming and Economic Analysis:
Revised Edition. McGraw Hill, New York, 1958.
[DLP13] Y E Du, Ehud Lehrer, and A D Y Pauzner. Competitive economy as a ranking device over networks. sub-
mitted, 2013.
[Dud02] R M Dudley. Real Analysis and Probability. Cambridge Studies in Advanced Mathematics. Cambridge Uni-
versity Press, 2002.
[DRS89] Timothy Dunne, Mark J Roberts, and Larry Samuelson. The growth and failure of us manufacturing plants.
The Quarterly Journal of Economics, 104(4):671–698, 1989.
[ESAW18] Ashraf Ben El-Shanawany, Keith H Ardron, and Simon P Walker. Lognormal approximations of fault tree
uncertainty distributions. Risk Analysis, 38(8):1576–1584, 2018.
1374 Bibliography
[EG87] Robert F Engle and Clive W J Granger. Co-integration and Error Correction: Representation, Estimation,
and Testing. Econometrica, 55(2):251–276, 1987.
[EP95] Richard Ericson and Ariel Pakes. Markov-perfect industry dynamics: a framework for empirical work. The
Review of Economic Studies, 62(1):53–82, 1995.
[Eva87] David S Evans. The relationship between firm growth, size, and age: estimates for 100 manufacturing indus-
tries. The Journal of Industrial Economics, pages 567–581, 1987.
[EH01] G W Evans and S Honkapohja. Learning and Expectations in Macroeconomics. Frontiers of Economic Re-
search. Princeton University Press, 2001.
[FSTD15] Pablo Fajgelbaum, Edouard Schaal, and Mathieu Taschereau-Dumouchel. Uncertainty traps. Technical Re-
port, National Bureau of Economic Research, 2015.
[FPS77] Michael A Fligner, George E Policello, and Jagbir Singh. A comparison of two randomized response survey
methods with consideration for the level of respondent protection. Communications in Statistics-Theory and
Methods, 6(15):1511–1524, 1977.
[Fri56] M. Friedman. A Theory of the Consumption Function. Princeton University Press, 1956.
[FF98] Milton Friedman and Rose D Friedman. Two Lucky People. University of Chicago Press, 1998.
[Gab16] Xavier Gabaix. Power laws in economics: an introduction. Journal of Economic Perspectives, 30(1):185–206,
2016.
[Gal89] David Gale. The theory of linear economic models. University of Chicago press, 1989.
[Gal16] Alfred Galichon. Optimal Transport Methods in Economics. Princeton University Press, Princeton, New Jer-
sey, 2016.
[Gib31] Robert Gibrat. Les inégalités économiques: Applications d'une loi nouvelle, la loi de l'effet proportionnel. PhD
thesis, Recueil Sirey, 1931.
[Gor95] Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceed-
ings 1995, pages 261–268. Elsevier, 1995.
[GAESH69] Bernard G Greenberg, Abdel-Latif A Abul-Ela, Walt R Simmons, and Daniel G Horvitz. The unrelated
question randomized response model: theoretical framework. Journal of the American Statistical Association,
64(326):520–539, 1969.
[GKAH77] Bernard G Greenberg, Roy R Kuebler, James R Abernathy, and Daniel G Horvitz. Respondent hazards in
the unrelated question randomized response model. Journal of Statistical Planning and Inference, 1(1):53–60,
1977.
[GS93] Moses A Greenfield and Thomas J Sargent. A probabilistic analysis of a catastrophic transuranic waste hoise
accident at the wipp. Environmental Evaluation Group, Albuquerque, New Mexico, June 1993. URL: http:
//www.tomsargent.com/research/EEG-53.pdf.
[Gro73] T. Groves. Incentives in teams. Econometrica, 41:617–631, 1973.
[Hal87] Bronwyn H Hall. The relationship between firm size and firm growth in the us manufacturing sector. The
Journal of Industrial Economics, pages 583–606, 1987.
[Hal78] Robert E Hall. Stochastic Implications of the Life Cycle-Permanent Income Hypothesis: Theory and Evi-
dence. Journal of Political Economy, 86(6):971–987, 1978.
[HM82] Robert E Hall and Frederic S Mishkin. The Sensitivity of Consumption to Transitory Income: Estimates from
Panel Data on Households. National Bureau of Economic Research Working Paper Series, 1982.
[HTW67] Michael J Hamburger, Gerald L Thompson, and Roman L Weil. Computation of expansion rates for the
generalized von neumann model of an expanding economy. Econometrica, Journal of the Econometric Society,
pages 542–547, 1967.
Bibliography 1375
[Ham05] James D Hamilton. What's real about the business cycle? Federal Reserve Bank of St. Louis Review, pages
435–452, 2005.
[HS08] L P Hansen and T J Sargent. Robustness. Princeton University Press, 2008.
[HS13] L P Hansen and T J Sargent. Recursive Models of Dynamic Linear Economies. The Gorman Lectures in
Economics. Princeton University Press, 2013.
[HR87] Lars Peter Hansen and Scott F Richard. The role of conditioning information in deducing testable restrictions
implied by dynamc asset pricing models. Econometrica, 55(3):587–613, May 1987.
[HK78] J. Michael Harrison and David M. Kreps. Speculative investor behavior in a stock market with heterogeneous
expectations. The Quarterly Journal of Economics, 92(2):323–336, 1978.
[HK79] J. Michael Harrison and David M. Kreps. Martingales and arbitrage in multiperiod securities markets. Journal
of Economic Theory, 20(3):381–408, June 1979.
[HL96] John Heaton and Deborah J Lucas. Evaluating the effects of incomplete markets on risk sharing and asset
pricing. Journal of Political Economy, pages 443–487, 1996.
[HLL96] O Hernandez-Lerma and J B Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria.
Number Vol 1 in Applications of Mathematics Stochastic Modelling and Applied Probability. Springer, 1996.
[HMMS60] Charles Holt, Franco Modigliani, John F. Muth, and Herbert Simon. Planning Production, Inventories, and
Work Force. Prentice-Hall International Series in Management, New Jersey, 1960.
[Hop92] Hugo A Hopenhayn. Entry, exit, and firm dynamics in long run equilibrium. Econometrica: Journal of the
Econometric Society, pages 1127–1150, 1992.
[HP92] Hugo A Hopenhayn and Edward C Prescott. Stochastic Monotonicity and Stationary Distributions for Dy-
namic Economies. Econometrica, 60(6):1387–1406, 1992.
[Hug93] Mark Huggett. The risk-free rate in heterogeneous-agent incomplete-insurance economies. Journal of Eco-
nomic Dynamics and Control, 17(5-6):953–969, 1993.
[Hur50] Leonid Hurwicz. Least squares bias in time series. Statistical inference in dynamic economic models, 10:365–
383, 1950.
[Haggstrom02] Olle Häggström. Finite Markov chains and algorithmic applications. Volume 52. Cambridge University
Press, 2002.
[Janich94] K Jänich. Linear Algebra. Springer Undergraduate Texts in Mathematics and Technology. Springer, 1994.
[JYC88] Robert J. Shiller John Y. Campbell. The Dividend-Price Ratio and Expectations of Future Dividends and
Discount Factors. Review of Financial Studies, 1(3):195–228, 1988.
[Jov79] Boyan Jovanovic. Firm-specific capital and turnover. Journal of Political Economy, 87(6):1246–1260, 1979.
[Jud90] K L Judd. Cournot versus bertrand: a dynamic resolution. Technical Report, Hoover Institution, Stanford
University, 1990.
[Kam12] Takashi Kamihigashi. Elementary results on solutions to the bellman equation of dynamic programming:
existence, uniqueness, and convergence. Technical Report, Kobe University, 2012.
[KMT56] John G Kemeny, Oskar Morgenstern, and Gerald L Thompson. A generalization of the von neumann model
of an expanding economy. Econometrica, Journal of the Econometric Society, pages 115–135, 1956.
[Koo65] Tjalling C. Koopmans. On the concept of optimal economic growth. In Tjalling C. Koopmans, editor, The
Economic Approach to Development Planning, pages 225–287. Chicago, 1965.
[Kre88] David M. Kreps. Notes on the Theory of Choice. Westview Press, Boulder, Colorado, 1988.
[Kuh13] Moritz Kuhn. Recursive Equilibria In An Aiyagari-Style Economy With Permanent Income Shocks. Interna-
tional Economic Review, 54:807–835, 2013.
1376 Bibliography
[KBBWP16] J. N. Kutz, S. L. Brunton, Brunton B. W, and J. L. Proctor. Dynamic mode decomposition: data-driven
modeling of complex systems. SIAM, 2016.
[Lan75] Jan Lanke. On the choice of the unrelated question in simmons' version of randomized response. Journal of
the American Statistical Association, 70(349):80–83, 1975.
[Lan76] Jan Lanke. On the degree of protection in randomized interviews. International Statistical Review/Revue In-
ternationale de Statistique, pages 197–203, 1976.
[LL01] Martin Lettau and Sydney Ludvigson. Consumption, Aggregate Wealth, and Expected Stock Returns. Journal
of Finance, 56(3):815–849, 06 2001.
[LL04] Martin Lettau and Sydney C. Ludvigson. Understanding Trend and Cycle in Asset Values: Reevaluating the
Wealth Effect on Consumption. American Economic Review, 94(1):276–299, March 2004.
[LM80] David Levhari and Leonard J Mirman. The great fish war: an example using a dynamic cournot-nash solution.
The Bell Journal of Economics, pages 322–334, 1980.
[LW76] Frederick W Leysieffer and Stanley L Warner. Respondent jeopardy and optimal designs in randomized
response models. Journal of the American Statistical Association, 71(355):649–656, 1976.
[LS18] L Ljungqvist and T J Sargent. Recursive Macroeconomic Theory. MIT Press, 4 edition, 2018.
[Lju93] Lars Ljungqvist. A unified approach to measures of privacy in randomized response models: a utilitarian
perspective. Journal of the American Statistical Association, 88(421):97–103, 1993.
[Luc78] Robert E Lucas, Jr. Asset prices in an exchange economy. Econometrica: Journal of the Econometric Society,
46(6):1429–1445, 1978.
[LP71] Robert E Lucas, Jr. and Edward C Prescott. Investment under uncertainty. Econometrica: Journal of the
Econometric Society, pages 659–681, 1971.
[MST20] Qingyin Ma, John Stachurski, and Alexis Akira Toda. The income fluctuation problem and the evolution of
wealth. Journal of Economic Theory, 187:105003, 2020.
[MS89] Albert Marcet and Thomas J Sargent. Convergence of Least-Squares Learning in Environments with Hidden
State Variables and Private Information. Journal of Political Economy, 97(6):1306–1322, 1989.
[MdRV10] V Filipe Martins-da-Rocha and Yiannis Vailakis. Existence and Uniqueness of a Fixed Point for Local Con-
tractions. Econometrica, 78(3):1127–1141, 2010.
[MCWG95] A Mas-Colell, M D Whinston, and J R Green. Microeconomic Theory. Volume 1. Oxford University Press,
1995.
[McC70] J J McCall. Economics of Information and Job Search. The Quarterly Journal of Economics, 84(1):113–126,
1970.
[MB54] F. Modigliani and R. Brumberg. Utility analysis and the consumption function: An interpretation of cross-
section data. In K.K Kurihara, editor, Post-Keynesian Economics. 1954.
[Nea99] Derek Neal. The Complexity of Job Mobility among Young Men. Journal of Labor Economics, 17(2):237–
261, 1999.
[NP33] J. Neyman and E. S Pearson. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans.
R. Soc. Lond. A. 231 (694–706), pages 289–337, 1933.
[OW69] Guy H. Orcutt and Herbert S. Winokur. First order autoregression: inference, estimation, and prediction.
Econometrica, 37(1):1–14, 1969.
[Par99] Jonathan A Parker. The Reaction of Household Consumption to Predictable Changes in Social Security Taxes.
American Economic Review, 89(4):959–973, 1999.
[PalS13] Jenő Pál and John Stachurski. Fitted value function iteration with probability one contractions. Journal of
Economic Dynamics and Control, 37(1):251–264, 2013.
Bibliography 1377
[Rab02] Guillaume Rabault. When do borrowing constraints bind? Some new results on the income fluctuation prob-
lem. Journal of Economic Dynamics and Control, 26(2):217–245, 2002.
[Ref96] Kevin L Reffett. Production-based asset pricing in monetary economies with transactions costs. Economica,
pages 427–443, 1996.
[Rei09] Michael Reiter. Solving heterogeneous-agent models by projection and perturbation. Journal of Economic
Dynamics and Control, 33(3):649–665, 2009.
[Rya12] Stephen P Ryan. The costs of environmental regulation in a concentrated industry. Econometrica, 80(3):1019–
1061, 2012.
[Sam39] Paul A. Samuelson. Interactions between the multiplier analysis and the principle of acceleration. Review of
Economic Studies, 21(2):75–78, 1939.
[Sar77] Thomas J Sargent. The Demand for Money During Hyperinflations under Rational Expectations: I. Interna-
tional Economic Review, 18(1):59–82, February 1977.
[Sar87] Thomas J Sargent. Macroeconomic Theory. Academic Press, New York, 2nd edition, 1987.
[SE77] Jack Schechtman and Vera L S Escudero. Some results on an income fluctuation problem. Journal of Economic
Theory, 16(2):151–166, 1977.
[Sch14] Jose A. Scheinkman. Speculation, Trading, and Bubbles. Columbia University Press, New York, 2014.
[Sch10] Peter J Schmid. Dynamic mode decomposition of numerical and experimental data. Journal of fluid mechan-
ics, 656:5–28, 2010.
[Sta08] John Stachurski. Continuous state dynamic programming via nonexpansive approximation. Computational
Economics, 31(2):141–160, 2008.
[ST19] John Stachurski and Alexis Akira Toda. An impossibility theorem for wealth in heterogeneous-agent models
with limited heterogeneity. Journal of Economic Theory, 182:1–24, 2019.
[SLP89] N L Stokey, R E Lucas, and E C Prescott. Recursive Methods in Economic Dynamics. Harvard University
Press, 1989.
[STY04] Kjetil Storesletten, Christopher I Telmer, and Amir Yaron. Consumption and risk sharing over the life cycle.
Journal of Monetary Economics, 51(3):609–633, 2004.
[Sun96] R K Sundaram. A First Course in Optimization Theory. Cambridge University Press, 1996.
[SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[Tau86] George Tauchen. Finite state markov-chain approximations to univariate and vector autoregressions. Eco-
nomics Letters, 20(2):177–181, 1986.
[Tre16] Daniel Treisman. Russia's billionaires. The American Economic Review, 106(5):236–241, 2016.
[TRL+14] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode decomposition:
theory and applications. Journal of Computational Dynamics, 1(2):391–421, 2014.
[VL11] Ngo Van Long. Dynamic games in the economics of natural resources: a survey. Dynamic Games and Appli-
cations, 1(1):115–148, 2011.
[Vic61] W. Vickrey. Counterspeculation, auctions, and competitive sealed tenders. Journal of Finance, 16:8–37,
1961.
[vN28] John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
[vN37] John von Neumann. Uber ein okonomsiches gleichungssystem und eine verallgemeinering des browerschen
fixpunktsatzes. In Erge. Math. Kolloq., volume 8, 73–83. 1937.
[Wal47] Abraham Wald. Sequential Analysis. John Wiley and Sons, New York, 1947.
1378 Bibliography
[Wal80] W Allen Wallis. The statistical research group, 1942–1945. Journal of the American Statistical Association,
75(370):320–330, 1980.
[War65] Stanley L Warner. Randomized response: a survey technique for eliminating evasive answer bias. Journal of
the American Statistical Association, 60(309):63–69, 1965.
[Wec79] William E Wecker. Predicting the turning points of a time series. Journal of business, pages 35–50, 1979.
[Whi83] Charles Whiteman. Linear Rational Expectations Models: A User's Guide. University of Minnesota Press,
Minneapolis, Minnesota, 1983.
[Woo15] Jeffrey M Wooldridge. Introductory econometrics: A modern approach. Nelson Education, 2015.
[YS05] G Alastair Young and Richard L Smith. Essentials of statistical inference. Cambridge University Press, 2005.
Bibliography 1379
1380 Bibliography
INDEX
A Kesten processes
A Problem that Stumped Milton Friedman, heavy tails, 434
869
An Introduction to Job Search, 513 L
Asset Pricing: Finite State Models, 1179 Lake Model, 1073
Law of Large Numbers, 163, 164
C Illustration, 166
Central Limit Theorem, 163, 168 Multivariate Case, 173
Intuition, 168 Proof, 165
Multivariate Case, 173 Linear Algebra, 17
CLT, 163 Differentiating Linear and
Quadratic Forms, 37
D Eigenvalues, 33
Dynamic Programming Eigenvectors, 33
Computation, 656, 668 Matrices, 26
Theory, 656 Matrix Norms, 36
Unbounded Utility, 656 Neumann's Theorem, 37
Positive Definite Matrices, 37
E SciPy, 33
Series Expansions, 36
Eigenvalues, 17, 33
Spectral Radius, 37
Eigenvectors, 17, 33
Vectors, 18
Ergodicity, 339, 353
Linear Markov Perfect Equilibria, 1137
Linear State Space Models, 373, 433
F Distributions, 379, 380
Finite Markov Asset Pricing Ergodicity, 386
Lucas Tree, 1186 Martingale Difference Shocks, 375
Finite Markov Chains, 339, 340 Moments, 379
Stochastic Matrices, 340 Moving Average Representations, 379
Prediction, 391
I Seasonals, 378
Irreducibility and Aperiodicity, 339, 347 Stationarity, 386
Time Trends, 378
J Univariate Autoregressive Processes,
Job Search VI: On-the-Job Search, 559 376
Job Search VI: On-the-Job Search, 559 Vector Autoregressions, 377
LLN, 163
K LQ Control, 965
Kalman Filter, 459 Infinite Horizon, 977
Programming Implementation, 469 Optimality (Finite Horizon), 968
Recursive Procedure, 468
Kalman Filter 2, 479
1381
M Policy Function, 663

Marginal Distributions, 339, 345 Policy Function Approach, 653
Markov Asset Pricing Optimal Growth I: The Stochastic Opti-
Overview, 1179 mal Growth Model, 651
Markov Chains, 340 Optimal Growth II: Accelerating the
Calculating Stationary Distribu- Code with Numba, 667
tions, 351 Optimal Growth III: Time Iteration, 679
Convergence to Stationarity, 352 Optimal Growth IV: The Endogenous Grid
Cross-Sectional Distributions, 346 Method, 691
Ergodicity, 353 Optimal Savings
Forecasting Future Values, 354 Computation, 702, 719
Future Probabilities, 346 Problem, 700
Irreducibility, Aperiodicity, 347 Programming Implementation, 703
Marginal Distributions, 345
Simulation, 342 P
Stationary Distributions, 350 Pandas for Panel Data, 1239
Markov Perfect Equilibrium, 1135 Permanent Income II: LQ Techniques, 1029
Applications, 1139 Permanent Income Model
Background, 1136 Hall's Representation, 1021
Overview, 1135 Savings Problem, 1014
Markov process, inventory, 363 Positive Definite Matrices, 37
Matrix Pricing Models, 1179, 1180
Determinants, 31 Risk Aversion, 1180
Inverse, 31 Risk-Neutral, 1180
Maps, 28 Python
Numpy, 27 Pandas, 1239
Operations, 26 python, 93, 320
Solving Systems of Equations, 28
Modeling R
Career Choice, 545 Rational Expectations Equilibrium, 1101
Modeling COVID 19, 7 Competitive Equilbrium (w. Adjustment
Models Costs), 1104
Harrison Kreps, 1225 Computation, 1107
Linear State Space, 374 Definition, 1104
Markov Asset Pricing, 1179 Planning Problem Approach, 1107
McCall, 498
On-the-Job Search, 559 S
Permanent Income, 1013, 1029 Spectral Radius, 37
Pricing, 1180 Stability in Linear Rational Expecta-
Sequential analysis, 869 tions Models, 1115
Stationary Distributions, 339, 350
N Stochastic Matrices, 340
Neumann's Theorem, 37
T
O The Income Fluctuation Problem I: Basic
On-the-Job Search Model, 699
Model, 560 The Income Fluctuation Problem II:
Model Features, 559 Stochastic Returns on Assets, 717
Parameterization, 560 The Permanent Income Model, 1013
Programming Implementation, 561
Solving for Policies, 564 U
Optimal Growth Unbounded Utility, 656
Model, 652, 668
1382 Index
V
Vectors, 17, 18
Inner Product, 22
Linear Independence, 25
Norm, 22
Operations, 19
Span, 22
Index 1383

Quantecon Python Econometria

Uploaded by

Copyright:

Available Formats

Quantecon Python Econometria

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quantecon Python Econometria

Uploaded by

Copyright:

Available Formats

Intermediate Quantitative Economics

Thomas J. Sargent & John Stachurski

Apr 30, 2024

I Tools and Techniques 5

5 Singular Value Decomposition (SVD) 65

6 VARs and DMDs 83

7 Using Newton’s Method to Solve Economic Models 95

II Elementary Statistics 119

9 LLN and CLT 163

11 Multivariate Hypergeometric Distribution 199

12 Multivariate Normal Distribution 209

13 Fault Tree Uncertainties 247

14 Introduction to Artificial Neural Networks 263

15 Randomized Response Surveys 275

16 Expected Utilities of Random Responses 285

III Linear Programming 301

18 Von Neumann Growth Model (and a Generalization) 321

IV Introduction to Dynamics 337

20 Inventory Dynamics 363

21 Linear State Space Models 373

23 Kesten Processes and Firm Dynamics 433

24 Wealth Distribution Dynamics 445

25 A First Look at the Kalman Filter 459

26 Another Look at the Kalman Filter 479

28 Job Search II: Search and Separation 513

29 Job Search III: Fitted Value Function Iteration 525

30 Job Search IV: Correlated Wage Offers 535

31 Job Search V: Modeling Career Choice 545

32 Job Search VI: On-the-Job Search 559

33 Job Search VII: A McCall Worker Q-Learns 571

VI Consumption, Savings and Capital 589

35 Cass-Koopmans Competitive Equilibrium 609

36 Cake Eating I: Introduction to Optimal Saving 625

37 Cake Eating II: Numerical Methods 635

38 Optimal Growth I: The Stochastic Optimal Growth Model 651

39 Optimal Growth II: Accelerating the Code with Numba 667

40 Optimal Growth III: Time Iteration 679

41 Optimal Growth IV: The Endogenous Grid Method 691

42 The Income Fluctuation Problem I: Basic Model 699

43 The Income Fluctuation Problem II: Stochastic Returns on Assets 717

VII Bayes Law 731

45 Posterior Distributions for AR(1) Parameters 779

46 Forecasting an AR(1) Process 789

VIII Information 805

48 Likelihood Ratio Processes 839

49 Computing Mean of a Likelihood Ratio Process 855

50 A Problem that Stumped Milton Friedman 869

51 Exchangeability and Bayesian Updating 887

52 Likelihood Ratio Processes and Bayesian Learning 903

53 Incorrect Models 919

54 Bayesian versus Frequentist Decision Rules 935

56 Lagrangian for LQ Control 995

57 Eliminating Cross Products 1009

58 The Permanent Income Model 1013

59 Permanent Income II: LQ Techniques 1029

60 Production Smoothing via Inventories 1049

X Multiple Agent Models 1071

62 Rational Expectations Equilibrium 1101

63 Stability in Linear Rational Expectations Models 1115

64 Markov Perfect Equilibrium 1135