Modern Bayesian Econometrics
Modern Bayesian Econometrics
Modern Bayesian Econometrics
LECTURES
BY TONY LANCASTER
January 2006
AN OVERVIEW
main
software
used
is
WinBUGS
http://www.mrc-
bsu.cam.ac.uk/bugs/winbugs/contents.shtml
This is shareware.
Practical classes using WinBUGS accompany these lectures.
The main programming and statistical software is R.
http://www.r-project.org/
This is also shareware.
1
Figure 1:
AIM
The aim of the course is to explain how to do econometrics the
Bayesian way.
Rev. Thomas Bayes (1702-1761)
METHOD
By computation.
Dominant approach since 1990.
Superceding earlier heavy algebra.
OUTLINE
Examples
PRINCIPLES (Chapter 1)
Pr(B|A) Pr(A)
Pr(A|B) =
.
Pr(B)
(1)
p(y|x)p(x)
p(x|y) =
p(y)
Bayes theorem for parameters and data:
p(y|)p()
p(|y) =
p(y)
Notation for data y or y obs.
(2)
p(m|n)p(n)
p(m|n)p(n)
p(m)
Jereys solution: Take p(n) 1/n and p(m|n) = 1/n i.e. uniform.
Then
1
nm
n2
strictly decreasing with median (about) 2m. A reasonable guess if he
p(n|m)
2 A Medical Shock
A rare but horrible disease D or its absence D.
A powerful diagnostic test with results + (!) or .
(powerful test)
3. Paradise Lost?1
If your friend read you her favourite line of poetry and told you it was
line (2, 5, 12, 32, 67) of the poem, what would you predict for the total
length of the poem?
Let l be total length and y the length observed. Then by Bayes theorem
p(l|y) p(y|l)p(l)
Take p(y|l) 1/l (uniform) and p(l) l . Then
p(l|y) l(1+),
lt
(*)
The density p(y|l) captures the idea that the favourite line is equally
likely to be anywhere in the poem; the density p(l) is empirically
roughly accurate for some .
Experimental subjects asked these (and many similar) questions reply
with predictions consistent with the median of (*)
1
Psychological Science.
10
INTERPRETATION OF Pr(.)
Probability as rational degree of belief in a proposition.
Not limiting relative frequency. Not equally likely cases.
Ramsey Truth and Probability (1926)
See the web page for links to Ramseys essay
Persi Diaconis. Coins dont have probabilities, people do. Coins
dont have little numbers P hidden inside them.
Later, deFinetti. Probability does not exist.
11
Likelihood is
p(y|)
p(y|)p()d
Posterior Distribution is
p(|y)
.
12
13
14
Figure 2:
Note:
1. The previous posterior becomes the new prior
2. Beliefs seem to become more concentrated as the number of observations increases.
3. Posteriors seem to look more normal as the number of observations
increases.
4. (Not shown here) The prior has less and less influence on the
posterior as n .
15
16
Pr(A|B) =
Pr(B|A) Pr(A)
Pr(B)
17
< <
p() 1/
>0
(3)
(4)
was proper.
18
Some people want objective priors that can be generated by applying a rule.
In particular there is a desire for a rule that can generate noninformative priors.
Others are content to form priors subjectively and then to study the
eect, if any, of changing them.
There are several general rules which Ill mention fairly briefly. They
all have drawbacks.
19
20
(5)
(6)
(7)
21
22
(8)
23
24
Subjective Priors
Economic agents have subjective priors but for econometricians ?
Econometric modelling arguably subjective.
Arguments that an instrumental variable is valid typically subjective.
Why is randomized allocation of treatments convincing?
Can always study sensitivity of inferences to changes in the prior.
25
Often useful to think about the prior for a vector parameters in stages.
Suppose that = (1, 2, ...n) and is a parameter of lower dimension
than . Then to get p() consider p(|)p() = p(, ) so that p() =
R
p(|)p()d. is a hyperparameter. And
p(, , y) = p(y|)p(|)p().
E(i|y1, ...yn) =
yi + y
+
26
27
p(|y)d(i)
(9)
28
29
30
cision 1/ 2.
So is normally distributed.
Likelihood has the shape of a normal density with mean b and precision
iyi2.
31
10
10
12
20
14
30
16
40
18
10
15
beta
20
0.86
0.88
0.90
beta
0.92
0.94
for
2
r = Tt=2ytyt1/Tt=2yt1
.
2
0
-3
-2
-1
likelihood
10
20
30
40
50
0.4
time
0.6
0.8
1.0
rho
33
34
likelihood
likelihood
-0.06
-0.02
0.02
0.06
0.00
beta
0.05
0.10
beta
35
0.15
0.20
< y, < .
0.0
0.2
0.4
0.6
0.8
1.0
-4
-2
2
y
36
exp(-val)
-3
-2
-1
1
theta
37
Pr(Y = yl ) = pl
n
`(p; y) Ll=1pl l .
(10)
(11)
Exponential as { l } 0.
38
(12)
(13)
For example
g <- rexp(n);
mu <- sum(g*y)/sum(g).
Equation (12) is a moment condition. This is a Bayesian version of
method of moments. (Well give another later.) Also called Bayesian
Bootstrap.
To see why this called a bootstrap and the precise connection with
the frequentist bootstrap see my paper A Note on Bootstraps and
Robustness on the web site.
39
s = 0, 1, 2, ...n, 0 1,
(14)
n
1
,
2
n 7.
40
0.0
0.05
0.10
0.15
0.20
10
15
20
25
30
Another example: Which model is true? The label of the true! model
is a parameter. It will have a prior distribution and, if data are available, it will have a posterior distribution.
41
Bayesian inference is based entirely upon the (marginal) posterior distribution of the quantity of interest.
42
Point Estimation
Posterior mode(s), mean etc.
Or decision theory perspective.
(pps 56-57) Minimize
R
loss(b
, )p(|y)d expected posterior loss w.r.t b
. Quadratic
loss
loss(b
, ) = (b
)2
loss(b
, ) = |b
|
43
posterior is
r(1 r)
.
n
Notice asymptotic irrelevance of the prior (if its NOT dogmatic).
E(|y) = r,
V (|y) =
1/ for 0 y
p(y|) =
0
elsewhere
(15)
(16)
n
ymax.
n1
45
(17)
46
Use p(y) =
p(y|)p()d
p(e
y |y, )p(|y)d
47
s2
1
2
ry
)
}
(y
exp{
n+1
n
2 s2 + yn2
which is normal with mean equal to ryn and precision s2/(s2 +yn2 ) < 1.
p(yn+1|y) is the predictive density of yn+1.
48
p(y) says what you think the data should look like.
You can use it to check a model by
1. Choose a test statistic, T (y)
2. Calculate its predictive distribution from that of y
3. Find T (y obs) and see if it is probable or not.
Step 2 can be done by sampling:
1. Sample from p()
2. Sample y from p(y|) and form T (y)
3. Repeat many times.
49
Let Mj denote the j 0th of J models and let the data be y. Then by
Bayes theorem the posterior probability of this model is
p(y|Mj )Pj
,
p(y)
where p(y) = Jj=1p(y|Mj )Pj .
P (Mj |y) =
p(y|Mj ) are the predictive distributions of the data on the two hypotheses and their ratio is the Bayes factor.
50
51
(18)
P (y obs| = 1) e(1/2)(y+1)
= (1/2)(y1)2
P (y obs| = 1)
e
and so, if the hypotheses are equally probable a priori, the posterior
odds are
P ( = 1|y obs)
2y
=
e
.
P ( = 1|y obs)
If y > 0 then = 1 more probable than = 1; y < 0 makes
52
(19)
where the Rj are the residual sums of squares in the two models and
the kj are the numbers of coecients.
For example
Model 1
y = 1x1 + 2x2 + 1
(20)
Model 2
y = 1x1 + 2x3 + 2
(21)
53
Model Averaging
For prediction purposes one might not want to use the most probable
model. Instead it is optimal, for certain loss functions, to predict from
an average model using
p(e
y |y) = j p(e
y , Mj |y)
= j P (Mj |y)p(e
y |Mj , y).
54
y = X + ,
(22)
(23)
(24)
(26)
yields
(25)
where
55
56
gi
nj=1gj
for i = 1, 2, ...n.
(27)
write
=
e (X 0GX)1X 0Gy
57
(28)
D = diag{e2i },
where e = y Xb.
This posterior distribution for is the Bayesian bootstrap distribution. It is robust against heteroscedasticity and non-normality.
Can do (bb) using weighted regression with weights equal to rexp(n)
see exercises.
58
coecient ols
se
BB mean White se BB se
b0
.128
.124
b1
.091
.096
b2
.134
.134
59
60
Suppose that all you have are moment restrictions of the form
Eg(y, ) = 0. But Bayesian inference needs a likelihood. One way
to proceed Schennach, Biometrika 92(1), 2005 is to construct a
maximum entropy distribution supported on the observed data. This
gives probability pi to observation yi. As we have seen the unrestricted
maxent distribution assigns probability 1/n to each data point which
is the solution to
max ni=1 pi log pi subject to ni=1pi = 1
p
(29)
The solution has the form
pi ()
exp{()0g(yi, )}
= n
j=1 exp{()0g(yi, )}
where the {()} are the Lagrange multipliers associated with the
61
Figure 9:
Here is an example:
Estimation of the 25% quantile
Use the single moment
1(y ) 0.25
which has expectation zero when is the 25% quantile. Figure 7
shows the posterior density of the 25% quantile based on a sample of
size 100 under a uniform prior. The vertical line is the sample 25%
quantile.This method extends to any GMM setting including linear
and non-linear models, discrete choice, instrumental variables etc.
62
When the object of interest is, say h() a scalar or vector function of
Bayesian inferences are based on the marginal distribution of h. How
do we obtain this?
The answer is the sampling principle, (just illustrated in the normal
linear model and on several other occasions) that underlies all modern
Bayesian work.
64
0.0
0.1
0.2
Density
0.3
0.4
0.5
density(x = h, width = 1)
65
10
66
A chain is characterized by its transition kernel whose elements provide the conditional probabilities of t+1 given the values of t. The
kernel is denoted by K(x, y).
67
1
1
t
1
(1
)
Kt =
+
.
+
+
(30)
Pr( = 2) = /( + )
(31)
68
or
p = Z
pK
p(y) =
K(x, y)p(x)dx.
(*)
1
2
K(x, y) = e(yx) /2.
2
Try (*)
p
1 (yx)2/2 1 2 (12)x2/2
e
e
dx
x K(x, y)p(x)dx =
2
2
p
Z
1 (xy)2/2 1 2 (12)y2/2
e
e
dx
=
2
2
p
1 2 (12)y2/2
e
= p(y).
=
2
69
70
71
1. choose 01
2. sample 12 from p(2|01)
3. sample 11 from p(1|12)
72
73
74
-3
-1
1 2 3
-3
-2
-1
0
y1
-3
-1
1 2 3
-3
-2
-1
0
y1
Figure 11:
75
n(0, 1)
y = I{y>0}
(32)
(33)
76
For another example consider optimal job search. Agents receive job
oers and accept the first oer to exceed a reservation wage w. The
econometrician observes the time to acceptance, t, and the accepted
wage, wa. If oers come from a distribution function F (w) (with
F = 1 F ) and arrive in a Poisson process of rate . Duration and
accepted wage have joint density
eF (w )tf (wa);
wa w, t 0.
This is rather awkward. But consider latent data consisting of the rejected wages (if any) and the times at which these oers were received.
Let = (, w) plus any parameters of the wage oer distribution and
let w, s be the rejected oers and their times of arrival. Data augmentation includes w, s as additional parameters and a Gibbs algorithm
would sample in turn from p(|w, s, wa, t) and p(w, s|, wa, t) both of
which take a very simple form.
A judicious choice of latent data radically simplifies inference about
quite complex structural models.
77
But the most important development has been the production of black box general purpose software that enables the
user to input his model and data and receive MCMC realizations from the posterior as output without the user worrying
about the particular chain that is being used for his problem.
(This is somewhat analogous to the development in the frequentist literature of general purpose function minimization
routines.)
78
79
Of the packages available now probably the most widely used is BUGS
which is freely distributed from
http://www.mrc-bsu.cam.ac.uk/bugs/
BUGS stands for Bayesian analysis Using the Gibbs Sampler, though
in fact it uses a variety of algorithms and not merely GS.
As with any package you need to provide the program with two things:
The model
The data
80
To supply the model you do not simply choose from a menu of models.
BUGS is more flexible in that you can give it any model you like!.
(Though there are some models that require some thought before they
can be written in a way acceptable to BUGS.)
For a Bayesian analysis the model is, of course, the likelihood and the
prior.
81
Here is an example of a BUGS model statement for a first order autoregressive model with autoregression coecient , intercept and
error precision .
model{
for( i in 2:T){y[i] ~dnorm(mu[i], tau)
mu[i] <- alpha + rho * y[i-1]
}
alpha ~dnorm(0, 0.001)
rho ~dnorm(0, 0.001)
tau ~dgamma(0.001,0.001)
}
82
Lines two and three are the likelihood. Lines five, six and seven are
the prior. In this case , and are independent with distributions
having low precision (high variance). For example has mean zero
83
Model
y1 = b0 + b1y2 + 1
y2 = c0 + c1z1 + c2z2 + 2.
#2 equation overidentified recursive model with 2 exogenous variables.
# Modelled as a restricted reduced form.
model{
for(i in 1:n){
y[i,1:2] ~dmnorm(mu[i,],R[,])
mu[i,1] <- b0 + b1*c0 + b1*c1*z[i,1] + b1*c2*z[i,2]
mu[i,2] <- c0 + c1*z[i,1] + c2*z[i,2]
}
R[1:2,1:2] ~dwish(Omega[,],4)
b0 ~dnorm(0,0.0001)
b1 ~dnorm(0,0.0001)
c0 ~dnorm(0,0.0001)
c1 ~dnorm(0,0.0001)
84
c2 ~dnorm(0,0.0001)
}
mean
start
sample
R[1,1]
0.9647
2501
sd
MC error
0.04334
4.708E-4
2.5%
median
97.5%
0.8806
0.9645
1.052
7500
0.1396
2501
7500
R[2,1] 0.0766 0.03233 3.458E-4 0.01316 0.077
0.1396
2501 7500
R[2,2] 1.055
1.054
1.15
2501
0.064
2501
0.6149 1.146
2501
7500
b0
7500
b1
0.6396 0.2248
0.004277 0.2641
7500
85
c0
0.0407 0.03111
7500
c1
0.1442 0.0284
4.031E-4
2501 7500
c2
2501 7500
86
0.1
(34)
(35)
88
Years of education
first
second
third
fourth
13.003
13.013
12.989
13.116
(36)
The dierences are small; the dierence between 13.116 and 12.989 is
about six and half weeks of education and this is about one percent
of the median years of education in the whole sample.
Instruments qualify as weak
89
0.1.1
Diuse Prior.
The sampler was run through 2000 steps with the first 1000 discarded
and three parallel chains were run to check convergence this gives
a total of 3000 realizations from the joint posterior. The first figure
shows the marginal posterior for under the diuse prior setup. The
left frame gives the entire histogram for all 3000 realizations and the
right frame provides a close-up of the middle 90% of the distribution.
Rate of Return - Central Portion
Density
Density
-0.5
0.5
1.0
1.5
2.0
0.0
beta
0.1
0.2
0.3
0.4
beta
Distribution not normal thick tailed. and it is also very far from
the prior which, if superimposed on the left hand frame, would look
almost flat.
We have learned something but not much. As can be seen on the right
frame, the posterior points towards rates of return of the order of 10
to 20% a year, but very considerable uncertainty remains. The ideal
might be to be able to pin down the rate of return to within one or
two percentage points, but we are very far from this accuracy. The
posterior mean of under this prior is 17% and the median is 15%.
75% of all realizations lie within the range from 10% to 21%.
91
0.1.2
Informative Prior
For contrast let us see what happens when we use the relatively informative prior in which is n(0.10, 100) and the three coecients are
constrained to be similar. We again ran the sampler for 2000 steps
three times and retained the final 1000 of each chain giving 3000 realizations from the joint posterior distribution. The marginal posterior
histogram for is shown in the next figure, with the prior density
superimposed as the solid line.
4
0
Density
Posterior of Beta
-0.1
0.0
0.1
0.2
0.3
beta
92
0.1.3
upper
46%
diuse
informative 1%
25%
The length of the confidence interval has been sharply reduced, but it
is still very wide. Another comparison is to look at the quantiles and
these, expressed in percentages, are
min q25 q50 mean q75 max
diuse
informative
0.1.4
40 10 15
17
21 220
15 8
12
16
Conclusions
What do we conclude?
93
12
37
It appears that even with about 36, 000 observations and a simple
94
0.2
Is Education Endogenous?
wage
10
15
20
education
The coecients are very sharply determined and the posterior mean,
median and mode of the rate of return is 6.8% with a standard error
of one tenth of one percent. Why cant we accept this estimate? The
answer is, of course, that we have been presuming that education is
endogenous, that its eect on wages is confounded with that of the
numerous other potential determinants of wages. And under this belief
the least squares estimates are (very precisely) wrong. But there is
still hope for the least squares estimate since we have not, yet, shown
that this presumption is true!
96
Consider how we might check to see whether education really is endogenous. We reproduce the structural form of the model for convenience here. It is
educi = + zi + 1i
wagei = + educi + 2i,
where z stands for the quarter of birth instruments. We showed earlier
that if 2 and 1 are uncorrelated then the second equation is a regression, the system is fully recursive, and standard regression methods,
like least squares, can be used to make inferences about . But if 2
and 1 are correlated then these methods are inappropriate and we
must proceed as above with the consequent disappointing precision.
So why dont we see if these errors are correlated? One way of doing
this is to look at the posterior distribution of the correlation coecient
of 2 and 1. We can do this as follows. For purposes of our calculation
we represented the model as the restricted reduced form
educi = + zi + 1i
wagei = + zi + 2i.
The relation between the structural and reduced form errors is
1
1 0
1
=
(37)
2
2
1
98
150
100
0
50
Frequency
200
250
-0.5
0.0
0.5
The posterior suggests that the structural form errors are negatively
correlated and this is a bit surprising on the hypothesis that a major
element of both 1 and 2 is ability and this variable tends to aect
positively both education and wages. But the evidence is very far from
conclusive.
99
100