Modern Bayesian Econometrics

MODERN BAYESIAN ECONOMETRICS
LECTURES
BY TONY LANCASTER
January 2006
AN OVERVIEW
These lectures are based on my book

An Introduction to Modern Bayesian Econometrics,
Blackwells, May 2004 and some more recent material.
The
main
software
used
is
WinBUGS
http://www.mrc-
bsu.cam.ac.uk/bugs/winbugs/contents.shtml
This is shareware.
Practical classes using WinBUGS accompany these lectures.
The main programming and statistical software is R.
http://www.r-project.org/
This is also shareware.
1
There is also R to Matlab connectivity see the r-project home page.

Also see BACC Bayesian econometric software link on the course
web page.
These introductory lectures are intended for both econometricians and
applied economists in general.
Figure 1:
AIM
The aim of the course is to explain how to do econometrics the
Bayesian way.
Rev. Thomas Bayes (1702-1761)
METHOD
By computation.
Dominant approach since 1990.
Superceding earlier heavy algebra.
OUTLINE
Principles of Bayesian Inference
Examples
Bayesian Computation and MCMC

\
PRINCIPLES (Chapter 1)
Bayes theorem for events:
Pr(B|A) Pr(A)
Pr(A|B) =
.
Pr(B)
(1)
Bayes theorem for densities:
p(y|x)p(x)
p(x|y) =
p(y)
Bayes theorem for parameters and data:
p(y|)p()
p(|y) =
p(y)
Notation for data y or y obs.
(2)
So Bayes theorem transforms prior or initial probabilities, Pr(A), into

posterior or subsequent probabilities, Pr(A|B).
B represents some new evidence or data and the theorem shows how
such evidence should change your mind.
EXAMPLES OF BAYES THEOREM

(with possible, and debatable, likelihoods and priors)
1. Jereys Tramcar Problem
Trams are numbered 1, 2, 3, ...n. A stranger (Thomas Bayes?) arrives
at the railway station and notices tram number m. He wonders how
many trams the city has.
p(n|m) =
p(m|n)p(n)
p(m|n)p(n)
p(m)
Jereys solution: Take p(n) 1/n and p(m|n) = 1/n i.e. uniform.
Then
1
nm
n2
strictly decreasing with median (about) 2m. A reasonable guess if he
p(n|m)
sees tram 21 might therefore be 42.
2 A Medical Shock
A rare but horrible disease D or its absence D.
A powerful diagnostic test with results + (!) or .
Pr(D) = 1/10000 (rare)

Pr(+|D) = 0.9
(powerful test)
Pr(+|D) = 0.1....(false positive)

Pr(+|D) Pr(D)
Pr(D|+) =
Pr(+)
Pr(+|D) Pr(D)
=
Pr(+|D) Pr(D) + Pr(+|D) Pr(D)
0.90
0.9
=
0.90 + 0.10(10, 000 1) 1000

= 0.0009 (relief)
3. Paradise Lost?1
If your friend read you her favourite line of poetry and told you it was
line (2, 5, 12, 32, 67) of the poem, what would you predict for the total
length of the poem?
Let l be total length and y the length observed. Then by Bayes theorem
p(l|y) p(y|l)p(l)
Take p(y|l) 1/l (uniform) and p(l) l . Then
p(l|y) l(1+),
lt
(*)
The density p(y|l) captures the idea that the favourite line is equally
likely to be anywhere in the poem; the density p(l) is empirically
roughly accurate for some .
Experimental subjects asked these (and many similar) questions reply
with predictions consistent with the median of (*)
1
Optimal predictions in everyday cognition,Griths and Tenenbaum, forthcoming in
Psychological Science.
10
INTERPRETATION OF Pr(.)
Probability as rational degree of belief in a proposition.
Not limiting relative frequency. Not equally likely cases.
Ramsey Truth and Probability (1926)
See the web page for links to Ramseys essay
Persi Diaconis. Coins dont have probabilities, people do. Coins
dont have little numbers P hidden inside them.
Later, deFinetti. Probability does not exist.
11
Let be the parameter of some economic model and let y be some

data.
Prior is
p()
Likelihood is
p(y|)
Marginal Likelihood or Predictive Distribution of the (potential) data

is
p(y) =
.
p(y|)p()d
Posterior Distribution is
p(|y)
.
12
The Bayesian Algorithm (page 9)

1. Formulate your economic model as a collection of probability distrbutions conditional on dierent values for a parameter , about which
you wish to learn.
2. Organize your beliefs about into a (prior) probability distribution.
3. Collect the data and insert them into the family of distributions
given in step 1.
4. Use Bayes theorem to calculate your new beliefs about .
5. Criticise your model.
13
The Evolution of Beliefs

Consider the following data from 50 Bernoulli trials
0010010000010111000010100010000000000
0011000010100
If is the probability of a one at any one trial then the likelihood of
any sequence of s trials containing y ones is
p(y|) = y (1 )sy
So if the prior is uniform p() = 1 then after 5 trials the posterior
is
p(|y) (1 )4
and after 10 trials the posterior is
p(|y) (1 )4 (1 )4 = 2(1 )8
and after 40 trials the posterior is
p(|y) 2(1 )88(1 )22 = 10(1 )30
and after all 50 trials the posterior is
p(|y) = 10(1 )304(1 )6 = 14(1 )36
These successive posteriors are plotted below
14
Figure 2:
Note:
1. The previous posterior becomes the new prior
2. Beliefs seem to become more concentrated as the number of observations increases.
3. Posteriors seem to look more normal as the number of observations
increases.
4. (Not shown here) The prior has less and less influence on the
posterior as n .
15
These are quite general properties of posterior inference.
16
Proper and Improper Priors (section 1.4.2)
Any proper probability distribution over will do.

Bayes theorem shows only how beliefs change. It does not dictate
what beliefs should be.
The theorem shows how beliefs are changed by evidence.
Pr(A|B) =
Pr(B|A) Pr(A)
Pr(B)
A model for the evolution of scientific knowledge?
17
Improper priors are sometimes used. These do not integrate to one

and are not probability distributions. Simple examples are
p() 1,
< <
p() 1/
>0
(3)
(4)
Can be thought of as approximations to diuse but proper priors.

What matters is that the posterior is proper. e.g. in the tramcar
problem the prior 1/n was improper but the posterior 1/n2, n m
was proper.
Improper priors sometimes mathematically convenient. But software

e.g. WinBUGS requires proper priors.
Notice use of which means is proportional to. Scale factors irrelevant in most Bayesian calculation.
18
Some people want objective priors that can be generated by applying a rule.
In particular there is a desire for a rule that can generate noninformative priors.
Others are content to form priors subjectively and then to study the
eect, if any, of changing them.
There are several general rules which Ill mention fairly briefly. They
all have drawbacks.
19
One rather important rule (which doesnt have a drawback) is:

Dont assign probability zero to parts of the parameter space.
This is because the posterior is the product of likelihood and prior so
prior probability zero posterior probability zero. So you cant learn
that you were wrong.
20
Natural Conjugate Priors( pps 30-33)
Class of priors is conjugate for a family of likelihoods if both prior and

posterior are in the same class for all data y.
Natural Conjugate prior has the same functional form as the likelihood
e.g.Bernoulli likelihood
`(; y) y (1 )1y
(5)
p() a1(1 )b1
(6)
p(|y) y+a1(1 )by .
(7)
and Beta prior
giving (Beta) posterior
Note symbol ` to represent likelihood previously described by p(y|).

Terminology of R. A. Fisher.
21
Natural conjugate can be thought of as a posterior from (hypothetical)

previous sample. So posterior stays in the same family but its parameters change. Quite convenient if youre doing algebra but of little
relevance if youre computing. Conjugate form sometimes conflicts
with reasonable prior beliefs.
22
Jereys Priors (pps 34-37)

Fisher information about is
2 log `(; y)
|
I() = E
2
(8)
Jereys prior is |I()|1/2.

This prior is invariant to reparametrization. Posterior beliefs about
are the same whether prior expressed on or on ().
Jereys priors can also be shown to in a certain sense minimally
informative relative to the likelihood.
23
Example:: iid normal data mean zero precision

`( ; y) n/2 exp{ ni=1yi2/2}
n
I() =
2 2
1
so p() .
24
Subjective Priors
Economic agents have subjective priors but for econometricians ?
Econometric modelling arguably subjective.
Arguments that an instrumental variable is valid typically subjective.
Why is randomized allocation of treatments convincing?
Can always study sensitivity of inferences to changes in the prior.
25
Hierarchical Priors (pages 37-39)
Often useful to think about the prior for a vector parameters in stages.
Suppose that = (1, 2, ...n) and is a parameter of lower dimension
than . Then to get p() consider p(|)p() = p(, ) so that p() =
R
p(|)p()d. is a hyperparameter. And
p(, , y) = p(y|)p(|)p().
Example: {yi} n(i, ); {i} n(, ); p() 1 then the posterior
means of i take the form
E(i|y1, ...yn) =
yi + y
+
Example of shrinkage to the general mean. c.f. estimation of permanent incomes.
26
Likelihoods( pps 1028)
Any proper probability distribution for Y will do.

Can always study sensitivity of inferences to changes in the likelihood.
27
Posterior Distributions(pps 41 - 55)
Express what YOU believe given model and data.

Parameter and data y are usually vector valued.
Interest often centres on individual elements, e.g. i. The posterior
distribution of i
p(i|y) =
p(|y)d(i)
(9)
Bayesian methods involve integration.

This was a barrier until recently. No longer
For example, WinBUGS does high dimensional numerical integration.
NO reliance on asymptotic distribution theory Bayesian results are
exact.
28
Frequentist (Classical) Econometrics (Appendix 1)
Relies (mostly) on distributions of estimators and test statistics over

hypothetical repeated samples. Does NOT condition on the data.
Inferences are based on data not observed!
Such sampling distributions are strictly irrelevant to Bayesian inference.
Sampling distributions arguably arbitrary e.g. fixed versus random
regressors. Conditioning an ancillaries.
In Bayesian work there are no estimators and no test statistics.
So there is no role for unbiasedness, minimum variance, eciency etc.
29
p values give probability of the data given the hypothesis. Reader

wants probability of the hypothesis given the data.
Probability of bloodstain given guilty .V. Probability guilty given
bloodstain! Prosecutors fallacy.
30
One Parameter Likelihood Examples

1. Normal regression of consumption, c, on income, y. (pps 12-13)
`(; c, y) exp{( yi2/2)( b)2}

for b = ni=1ciyi/ni=1yi2.
(The manipulation here was
(c y)2 = (c by + (b )y)2 = e2 + ( b)2y 2)
Note notation: ` for likelihood; for is proportional to; for pre-
cision 1/ 2.
So is normally distributed.
Likelihood has the shape of a normal density with mean b and precision
iyi2.
31
10
10
12
20
14
30
16
40
18
Figure 3: Plot of the Data and the Likelihood for Example 1
10
15
beta
20
0.86
0.88
0.90
beta
0.92
0.94
2. Autoregression (pps 14-16)
p(y|y1, ) exp{( /2)Tt=2(yt yt1)2}.

Rearranging the sum of squares in exactly the same way as in example
1 and then regarding the whole expression as a function of gives the
likelihood kernel as
2
`(; y, y1, ) exp{( Tt=2yt1
/2)( r)2}
for
2
r = Tt=2ytyt1/Tt=2yt1
.
Note terminology: kernel of a density neglects multiplicative terms

not involving the quantity of interest.
32
2
0
-3
-2
-1
likelihood
Figure 4: Time Series Data and its Likelihood
10
20
30
40
50
0.4
time
0.6
0.8
1.0
rho
So is normally distributed (under a uniform prior).
33
3. Probit model (pps 17-18)
`(; y, x) = ni=1(xi)yi (1 (xi))1yi .
Figures for n = 50. = 0. Simulated data with = 0 for fig 1 and

= 0.1 for fig 2.a
For both likelihoods the function is essentially zero everywhere else on
the real line!
34
likelihood
likelihood
Figure 5: Two Probit Likelihoods
-0.06
-0.02
0.02
0.06
0.00
beta
0.05
0.10
beta
35
0.15
0.20
4. Example Laplace data: (pps 61-63)

p(y|) = exp{|y |},
< y, < .
Thick tailed compared to normal. Figure plots the Laplace density

function for the case = 1.
0.0
0.2
0.4
0.6
0.8
1.0
Figure 6: A Double Exponential Density
-4
-2
2
y
36
exp(-val)
Figure 7: The Likelihood for 3 Observations of a Laplace Variate
-3
-2
-1
1
theta
37
A Nonparametric (Multinomial) Likelihood (pps 141-147)
Pr(Y = yl ) = pl
n
`(p; y) Ll=1pl l .
(10)
(11)
Natural conjugate prior for {pi} is the Dirichlet (multivariate Beta).

p(p) Ll=0pl l
Posterior can be simulated as pl = gl /Li=1gi where {gi} iid unit
Exponential as { l } 0.
L may be arbitrarily large.
38
Since as { l } 0 the posterior density of the {pl } concentrate on the
observed data points the posterior density of, say,

= ll=1pl yl
(12)
dicult to find analytically may be easily found by simulation as

ni=1yigi
,
= n
i=1gi
{gi} iid E(1).
(13)
For example
g <- rexp(n);
mu <- sum(g*y)/sum(g).
Equation (12) is a moment condition. This is a Bayesian version of
method of moments. (Well give another later.) Also called Bayesian
Bootstrap.
To see why this called a bootstrap and the precise connection with
the frequentist bootstrap see my paper A Note on Bootstraps and
Robustness on the web site.
39
What is a parameter? (pps 21-22)

Anything that isnt data.
Example: Number of tramcars.
Example: How many trials did he do?
n Bernoulli trials with a parameter agreed to be 0.5. s = 7, successes
recorded. What was n? The probability of s successes in n Bernoulli
trials is the binomial expression

n s
P (S = s|n, ) =
(1 )ns,
s
s = 0, 1, 2, ...n, 0 1,
(14)
and on inserting the known data s = 7, = 1/2 we get the likelihood

for the parameter n
n!
`(n; s, )
(n 7)!
n
1
,
2
n 7.
This is drawn in the next figure for n = 7, 8, ...30.

Mode at 2n of course.
40
0.0
0.05
0.10
0.15
0.20
Figure 8: Likelihood for n
10
15
20
25
30
Another example: Which model is true? The label of the true! model
is a parameter. It will have a prior distribution and, if data are available, it will have a posterior distribution.
41
Inferential Uses of Bayes Theorem
Bayesian inference is based entirely upon the (marginal) posterior distribution of the quantity of interest.
42
Point Estimation
Posterior mode(s), mean etc.
Or decision theory perspective.
(pps 56-57) Minimize
R
loss(b
, )p(|y)d expected posterior loss w.r.t b
. Quadratic
loss
loss(b
, ) = (b
)2
leads to the posterior mean.

Absolute error loss
loss(b
, ) = |b
|
leads to the posterior median.
43
Example: Probit model. Suppose the parameter of interest is

P (y = 1|x, )/xj
at x = x. This is a function of . So compute its marginal posterior
distribution and report the mean etc.
Example: Bernoulli trials: Assume (natural conjugate) beta family

p() a1(1 )b1, 0 1. With data from n Bernoulli trials
posterior is
p(|y) s+a1(1 )ns+b1

with mean and variance
s+a
,
n+a+b
(s + a)(n s + b)
.
V (|y) =
(n + a + b)2(n + a + b + 1)
For large n and s, n in the ratio r then approximately
E(|y) =
r(1 r)
.
n
Notice asymptotic irrelevance of the prior (if its NOT dogmatic).
E(|y) = r,
V (|y) =
This is a general feature of Bayesian inference. Log likelihood O(n)

but prior of O(1).
Example: Maximum likelihood. Since p(|y) `(; y)p() ML gives
the vector of (joint) posterior modes under a uniform prior. This
diers, in general, from the vector of marginal modes or means.
44
Uniform Distribution (p 57)

Let Y be uniformly distributed on 0 to so
1/ for 0 y
p(y|) =
0
elsewhere
with likelihood for a random sample of size n
1/n for ymax

`(; y)
0
elsewhere
(15)
(16)
Maximum likelihood estimator of is ymax which is always too small!

Bayes posterior expectation under prior p() 1/ is
E(|y) =
n
ymax.
n1
45
(17)
Interval Estimation (p 43)
Construct a 95% highest posterior density interval (region). This is a

set whose probability content is 0.95 and such that no point outside
it has higher posterior density than any point inside it.
Example: Pr(x 1.96/ n < < x + 1.96/ n) = 0.95 when data

are iid n(, 2) with 2 known. This statement means what it says!
It does not refer to hypothetical repeated samples.

For vector parameters construct highest posterior density regions.
46
Prediction (pps 79-97)
(i) of data to be observed.
Use p(y) =
p(y|)p()d
(ii) of new data ye given old data.

Use p(e
y |y) =
p(e
y |y, )p(|y)d
47
Example: Prediction from an autoregression with known and equal

to one.
p(e
y |y obs, ) exp{(1/2)(yn+1 yn)2}
2
Thus, putting s2 = nt=2yt1
, and using the fact established earlier
that the posterior density of is normal with mean r and precision

s2,
p(yn+1|y)
exp{(1/2)(yn+1 yn)2 (s2/2)( r)2}d
s2
1
2
ry
)
}
(y
exp{
n+1
n
2 s2 + yn2
which is normal with mean equal to ryn and precision s2/(s2 +yn2 ) < 1.
p(yn+1|y) is the predictive density of yn+1.
48
Prediction and Model Criticism (chapter 2)
p(y) says what you think the data should look like.
You can use it to check a model by
1. Choose a test statistic, T (y)
2. Calculate its predictive distribution from that of y
3. Find T (y obs) and see if it is probable or not.
Step 2 can be done by sampling:
1. Sample from p()
2. Sample y from p(y|) and form T (y)
3. Repeat many times.
49
Model Choice(pps 97-102)
Let Mj denote the j 0th of J models and let the data be y. Then by
Bayes theorem the posterior probability of this model is
p(y|Mj )Pj
,
p(y)
where p(y) = Jj=1p(y|Mj )Pj .
P (Mj |y) =
and, with J = 2, the posterior odds on model 1 are

P (M1|y) p(y|M1) P (M1)
=
.
P (M2|y) p(y|M2) P (M2)
p(y|Mj ) are the predictive distributions of the data on the two hypotheses and their ratio is the Bayes factor.
50
For two simple hypotheses

P ( = 1|y obs) `(1; y obs) P ( = 1)
=
P ( = 2|y obs) `(2; y obs) P ( = 2)
In general the probability of the data given model j is

Z
P (y|Mj ) = `(y|j )p(j )dj
where `(y|j ) is the likelihood of the data under model j.

.
51
(18)
Example with Two Simple Hypotheses

`(y; ) is the density of a conditionally normal (, 1) variate.
Two hypotheses are that = 1 and = 1 and sample size is n = 1.
The likelihood ratio is
P (y obs| = 1) e(1/2)(y+1)
= (1/2)(y1)2
P (y obs| = 1)
e
and so, if the hypotheses are equally probable a priori, the posterior
odds are
P ( = 1|y obs)
2y
=
e
.
P ( = 1|y obs)
If y > 0 then = 1 more probable than = 1; y < 0 makes
= 1 more probable than = 1; y = 0 equal to zero leaves the two

hypotheses equally probable
If you observe y = 0.5 then posterior odds on = 1 are e = 2.718

corresponding to a probability of this hypothesis of P ( = 1|y =
0.5) = e/(1 + e) = 0.73. When y = 1 the probability moves to 0.88.
52
Linear Model Choice

In the linear model an approximate Bayes factor is the BIC Bayesian
Information Criterion. The approximate Bayes factor in favour of
model 2 compared to model 1 takes the form
n/2
R1
n(k1k2)/2
BIC =
R2
(19)
where the Rj are the residual sums of squares in the two models and
the kj are the numbers of coecients.
For example
Model 1
y = 1x1 + 2x2 + 1
(20)
Model 2
y = 1x1 + 2x3 + 2
(21)
53
Model Averaging
For prediction purposes one might not want to use the most probable
model. Instead it is optimal, for certain loss functions, to predict from
an average model using
p(e
y |y) = j p(e
y , Mj |y)
= j P (Mj |y)p(e
y |Mj , y).
So predictions are made from a weighted average of the models under

consideration with weights provided by the posterior model probabilities
54
Linear Models (Chapter 3)

Normal linear model
n(0, In)
y = X + ,
(22)
and conventional prior

p(, ) 1/
(23)
p(| , y, X) = n(b, X 0X)

n k e0e
, )
p( |y, X) = gamma(
2
2
(24)
b = (X 0X)1X 0y and e = y Xb.
(26)
yields
(25)
where
55
Marginal posterior density of is multivariate t.

BUT the simplest way is to sample , .
Algorithm:
1. Sample using rgamma
2. Put in to (20) and sample using mvrnorm.
3. Repeat 10,000 times.
This makes it easy to study the marginal posterior distribution of
ANY function of , .
56
A Non-Parametric Version of the Linear Model (pps 141-147)

(Bayesian Bootstrap Again)
Consider the linear model again but without assuming normality or
homoscedasticity. Define by
EX 0(y X) = 0.
So,
= [E(X 0X)]1E(X 0y)
Assume the rows of (y : X) are multinomial with probabilities p =
(p1, p2, ....pL). So a typical element of E(X 0X) is ni=1xil ximpi and a
typical element of E(X 0y) is ni=1xil yipi. Thus we can write as
= (X 0P X)1X 0P y.
where P = diag{pi}. If the prior for {pi} is Dirichlet (multivariate
beta) then so is the posterior (natural conjugate) and, as before, the

{pi} can be simulated by
pi =
gi
nj=1gj
for i = 1, 2, ...n.
(27)
where the {gi} are independent unit exponential variates. So we can
write
=
e (X 0GX)1X 0Gy
57
(28)
where G is an n n diagonal matrix with elements that are indepen-
dent gamma(1), or unit exponential, variates. The symbol =

e means
is distributed as.
has (approximate) posterior mean equal to the least squares estimate

b = (X 0X)1X 0y and its approximate covariance matrix is
V = (X 0X)1X 0DX(X 0 X)1;
D = diag{e2i },
where e = y Xb.
This posterior distribution for is the Bayesian bootstrap distribution. It is robust against heteroscedasticity and non-normality.
Can do (bb) using weighted regression with weights equal to rexp(n)
see exercises.
58
Example: Heteroscedastic errors and two real covariates: n = 50.
coecient ols
se
BB mean White se BB se
b0
.064 .132 .069
.128
.124
b1
.933 .152 .932
.091
.096
b2
-.979 .131 -.974
.134
.134
59
Bayesian Method of Moments (Again) (not in book)

Entropy
Entropy measures the amount of uncertainy in a probability distribution. The larger the entropy the more the uncertainy. For a discrete
distribution with probabilities p1, p2, ...pn entropy is
ni=1pi log pi.
This is maximized subject to ni=1pi = 1 by p1 = p2 = ... = pn = 1/n
which is the most uncertain or least informative distribution.
60
Suppose that all you have are moment restrictions of the form
Eg(y, ) = 0. But Bayesian inference needs a likelihood. One way
to proceed Schennach, Biometrika 92(1), 2005 is to construct a
maximum entropy distribution supported on the observed data. This
gives probability pi to observation yi. As we have seen the unrestricted
maxent distribution assigns probability 1/n to each data point which
is the solution to
max ni=1 pi log pi subject to ni=1pi = 1
p
The general procedure solves the problem
max ni=1 pi log pi subject to ni=1pi = 1 and ni=1pig(yi, ) = 0

p
(29)
The solution has the form
pi ()
exp{()0g(yi, )}
= n
j=1 exp{()0g(yi, )}
where the {()} are the Lagrange multipliers associated with the
moment constraints. The resulting posterior density takes the form

p(|Y ) = p()ni=1pi ()
where p() is an arbitrary prior.
61
Figure 9:
Here is an example:
Estimation of the 25% quantile
Use the single moment
1(y ) 0.25
which has expectation zero when is the 25% quantile. Figure 7
shows the posterior density of the 25% quantile based on a sample of
size 100 under a uniform prior. The vertical line is the sample 25%
quantile.This method extends to any GMM setting including linear
and non-linear models, discrete choice, instrumental variables etc.
62
BAYESIAN COMPUTATION AND MCMC(pps 183-192)
When the object of interest is, say h() a scalar or vector function of
Bayesian inferences are based on the marginal distribution of h. How
do we obtain this?
The answer is the sampling principle, (just illustrated in the normal
linear model and on several other occasions) that underlies all modern
Bayesian work.
Sampling Principle: To study h() sample from p(|y) and

for each realization i form h(i). Many replications will provide, exactly, the marginal posterior density of h.
Example using R Suppose that you are interested in exp{0.21 0.32}

and the posterior density of is multivariate normal with mean and
variance .
> mu <- c(1,-1);Sigma <- matrix(c(2,-0.6,-0.6,1),nrow=2,byrow=T)
> theta <- mvrnorm(5000,mu,Sigma)
63
> h <- rep(0,5000); for(i in 1:5000){h[i] <- exp(.2*theta[i,1].3*theta[i,2])}

> hist(h,nclass=50)
> plot(density(h))
> summary(h)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2722 1.1800 1.6410 1.8580 2.3020 9.6480
> plot(density(h,width=1)))
64
Figure 10: Posterior Density of exp{0.21 0.32}
0.0
0.1
0.2
Density
0.3
0.4
0.5
density(x = h, width = 1)
N = 5000 Bandwidth = 0.25
65
10
When a distribution can be sampled with a single call, such as

mvrnorm, it is called available. Most posterior distributions are
not available. So, what to do?
The answer, since about 1990, is Markov Chain Monte Carlo or

MCMC.
66
Principles of MCMC(pps 192-226)
The state of a markov chain is a random variable indexed by t, say,

t. The state distribution is the distribution of t, pt().A stationary
distribution of the chain is a distribution p such that, if pt() = p then
pt+s() = p for all s > 0. Under certain conditions a chain will
1. Have a unique stationary distribution.
2. Converge to that stationary distribution as t . For example,
when the sample space for is discrete, this means
P (t = j) pj as t .
3. Be ergodic. This means that averages of successive realizations of
will converge to their expectations with respect to p.
A chain is characterized by its transition kernel whose elements provide the conditional probabilities of t+1 given the values of t. The
kernel is denoted by K(x, y).
67
Example: A 2 State Chain

K=
1
1
When t = 1 then t+1 = 1 with probability 1 and equals 2 with

probability . For a chain that has a stationary distribution powers
of K converge to a constant matrix whose rows are p. For the 2 state

chain K t takes the form
t
1
(1
)

Kt =
+
.

+
+
which converges geometrically fast to a matrix with rows equal to

(/( + ), /( + )).
The stationary distribution of this chain is
Pr( = 1) = /( + )
(30)
Pr( = 2) = /( + )
(31)
Example: An Autoregressive Process:

1
K(x, y) = exp{(1/2)(y x)2}
2
68
A stationary distribution of the chain, p, satisfies
or
p = Z
pK
p(y) =
K(x, y)p(x)dx.
(*)
To check that some p(.) is a stationary distribution of the chain defined

by K(, ) show it satisfies (*). To prove that p(y) = n(0, 1 2) is a
stationary distribution of the chain with kernel
1
2
K(x, y) = e(yx) /2.
2
Try (*)
p
1 (yx)2/2 1 2 (12)x2/2
e
e
dx
x K(x, y)p(x)dx =
2
2
p
Z
1 (xy)2/2 1 2 (12)y2/2
e
e
dx
=
2
2
p
1 2 (12)y2/2
e
= p(y).
=
2
69
The Essence of MCMC
We wish to sample from p(|y). Then let p(|y) be thought of as the

stationary distribution of a markov chain and find a chain having this
p as its unique stationary distribution. This can be done in many
ways!
Then: RUN THE CHAIN until it has converged to p. This means

choosing an initial value 1 then sampling 2 according to the relevant row of K then sampling 3 using the relevant row of K
.........................................
When it has converged, realizations of have distribution p(|y). They

are identically, but not independently, distributed. To study properties of p use the ergodic theorem. e.g.
nrep
s=1 I( t+s > 0)/nrep P ( > 0) as nrep ,
where I(.) is the indicator function.
70
Probability texts focus on the question

Given a chain finds its stationary distribution(s)
For MCMC the relevant question is
Given a distribution find a chain that has that distribution
as its stationary distribution.
71
Finding a chain that will do the job.
When is scalar this is not an issue just draw p(|y)!
When is vector valued with elements 1, 2, ....k the most intuitive

and widely used algorithm for finding a chain with p(|y) as its stationary distribution is the Gibbs Sampler.
p has k univariate component conditionals e.g. when k = 2 these

are p(2|1) and p(1|2). A step in the GS samples in turn from the
component conditionals. For example, for k = 2, the algorithm is
1. choose 01
2. sample 12 from p(2|01)
3. sample 11 from p(1|12)
72
4 update the superscript by 1 and return to 2.

Steps 2 and 3 described the transition kernel K.
73
Succesive pairs 1, 2 are points in the sample space of .

The successive points tour the sample space. In stationary
equilibrium they will visit each region of the space in proportion to its posterior probability.
Next is a graph showing the first few realizations of of

a Gibbs sampler for the bivariate normal distribution, whose
components conditionals are, as is well known, univariate normal.
The second figure has contours of the target (posterior)

distribution superimposed.
74
-3
-1
1 2 3
A Tour with the Gibbs Sampler: 1
-3
-2
-1
0
y1
-3
-1
1 2 3
A Tour with the Gibbs Sampler: 2
-3
-2
-1
0
y1
Figure 11:
75
Gibbs Sampler and Data Augmentation

Data augmentation enlarges the parameter space. Convenient when
there is a latent data model.
For example in the probit model
y = x + ,
n(0, 1)
y = I{y>0}
(32)
(33)
Data is y, x. Parameter is . Enlarge parameter space to , y and

consider Gibbs algorithm.
1. p(|y , y) = p(|y ) = n(b, (X 0X)1)
2. p(y |y, ) = truncated normals.
Both steps easy.
76
For another example consider optimal job search. Agents receive job
oers and accept the first oer to exceed a reservation wage w. The
econometrician observes the time to acceptance, t, and the accepted
wage, wa. If oers come from a distribution function F (w) (with
F = 1 F ) and arrive in a Poisson process of rate . Duration and
accepted wage have joint density
eF (w )tf (wa);
wa w, t 0.
This is rather awkward. But consider latent data consisting of the rejected wages (if any) and the times at which these oers were received.
Let = (, w) plus any parameters of the wage oer distribution and
let w, s be the rejected oers and their times of arrival. Data augmentation includes w, s as additional parameters and a Gibbs algorithm
would sample in turn from p(|w, s, wa, t) and p(w, s|, wa, t) both of
which take a very simple form.
A judicious choice of latent data radically simplifies inference about
quite complex structural models.
77
Since about 1993 the main developments have been
Providing proofs of convergence and ergodicity for broad classes
of methods such as the Gibbs sampler for finding chains to
solve classes of problem.

Providing eective MCMC algorithms for particular classes of
model. In the econometrics journals these include samplers for,

e.g. discrete choice models; dynamic general equilibrium models;
VARs; stochastic volatility models etc. etc.
But the most important development has been the production of black box general purpose software that enables the
user to input his model and data and receive MCMC realizations from the posterior as output without the user worrying
about the particular chain that is being used for his problem.
(This is somewhat analogous to the development in the frequentist literature of general purpose function minimization
routines.)
78
This development has made MCMC a feasible option for

the general applied economist.
79
Practical MCMC(pps 222-224 and Appendices 2 and 3)
Of the packages available now probably the most widely used is BUGS
which is freely distributed from
http://www.mrc-bsu.cam.ac.uk/bugs/
BUGS stands for Bayesian analysis Using the Gibbs Sampler, though
in fact it uses a variety of algorithms and not merely GS.
As with any package you need to provide the program with two things:
The model
The data
80
Supplying the data is much as in any econometrics package you

give it the y 0s and the x0s and any other relevant data, for example
censoring indicators.
To supply the model you do not simply choose from a menu of models.
BUGS is more flexible in that you can give it any model you like!.
(Though there are some models that require some thought before they
can be written in a way acceptable to BUGS.)
For a Bayesian analysis the model is, of course, the likelihood and the
prior.
81
The model is supplied by creating a file containing statements that

closely correspond to the mathematical representation of the model
and the prior.
Here is an example of a BUGS model statement for a first order autoregressive model with autoregression coecient , intercept and
error precision .
model{
for( i in 2:T){y[i] ~dnorm(mu[i], tau)
mu[i] <- alpha + rho * y[i-1]
}
alpha ~dnorm(0, 0.001)
rho ~dnorm(0, 0.001)
tau ~dgamma(0.001,0.001)
}
82
Lines two and three are the likelihood. Lines five, six and seven are
the prior. In this case , and are independent with distributions
having low precision (high variance). For example has mean zero
and standard deviation 1/ 0.001 = 32.
83
Another BUGS program, this time for an overidentified two equation

recursive model.
Model
y1 = b0 + b1y2 + 1
y2 = c0 + c1z1 + c2z2 + 2.
#2 equation overidentified recursive model with 2 exogenous variables.
# Modelled as a restricted reduced form.
model{
for(i in 1:n){
y[i,1:2] ~dmnorm(mu[i,],R[,])
mu[i,1] <- b0 + b1*c0 + b1*c1*z[i,1] + b1*c2*z[i,2]
mu[i,2] <- c0 + c1*z[i,1] + c2*z[i,2]
}
R[1:2,1:2] ~dwish(Omega[,],4)
b0 ~dnorm(0,0.0001)
b1 ~dnorm(0,0.0001)
c0 ~dnorm(0,0.0001)
c1 ~dnorm(0,0.0001)
84
c2 ~dnorm(0,0.0001)
}
Summary Output Table

Node statistics
node
mean
start
sample
R[1,1]
0.9647
2501
sd
MC error
0.04334
4.708E-4
2.5%
median
97.5%
0.8806
0.9645
1.052
7500
R[1,2] 0.0766 0.03233 3.458E-4 0.01316 0.077
0.1396
2501
7500
R[2,1] 0.0766 0.03233 3.458E-4 0.01316 0.077
0.1396
2501 7500
R[2,2] 1.055
0.04723 5.355E-4 0.9648
1.054
1.15
2501
0.064
2501
0.6149 1.146
2501
7500
b0
-0.0136 0.04159 7.906E-4 -0.09936 -0.012
7500
b1
0.6396 0.2248
0.004277 0.2641
7500
85
c0
0.0407 0.03111
5.562E-4 -0.01868 0.0400 0.1021 2501
7500
c1
0.1442 0.0284
4.031E-4
0.08972 0.1439 0.2008
2501 7500
c2
0.1214 0.02905 4.623E-4
2501 7500
86
0.06608 0.1214 0.1797
0.1
An Application of IV Methods to Wages

and Education
wages = + .education
Education is measured in years and wages is the logarithm of the

weekly wage rate.
is the proportionate return to an additional year of education
Rate of return probably between 5 and 30% implying of the order
of 0.05 to 0.30.
BUT education presumably endogenous.
Quarter of birth as instrument as in Angrist and Krueger[1991]. A
...........children born in dierent months of the years start
school at dierent ages, while compulsory schooling laws generally require students to remain in school until their sixteenth
or seventeenth birthday. In eect, the interaction of school
entry requirements and compulsory schooling laws compel(s)
students born in certain months to attend school longer than
students born in other months.
87
The model uses quarter of birth as three instrumental variables. Let

q1, q2, and q3 be such that qj = 1 if the agent was born in quarter j
and is zero otherwise, and write
educ = + 1q1 + 2q2 + 3q3 + 1
(34)
This model implies that the expected education of someone born in

quarter j is + j , j = 1, 2, 3 So the j are the dierences in average
education between someone born in quarter j and someone born in
the fourth quarter. The second structural equation relates wages to
education and we write it as
wage = + educ + 2
(35)
since we would expect the relation between education and wages to

be monotone and, at least roughly, linear, perhaps. This is an overidentified recursive model. It is overidentified because there are, in
fact, three instrumental variables, q1, q2 and q3 but only one right
hand endogenous variable, education. Under an assumption of bivariate normality (not necessary) for 1, 2 it can be simulated using the
BUGS program given earlier.
88
n = 35, 805 men born in 1939 (subset of the AK data)

The average number of years of education was 13 and the median
number was 12, with 59 men having zero years and 1322 having the
maximum education of 20 years. 38% of the men had 12 years, corresponding to completion of high school.
Quarter of birth
Years of education
first
second
third
fourth
13.003
13.013
12.989
13.116
(36)
The dierences are small; the dierence between 13.116 and 12.989 is
about six and half weeks of education and this is about one percent
of the median years of education in the whole sample.
Instruments qualify as weak
89
0.1.1
Diuse Prior.
The sampler was run through 2000 steps with the first 1000 discarded
and three parallel chains were run to check convergence this gives
a total of 3000 realizations from the joint posterior. The first figure
shows the marginal posterior for under the diuse prior setup. The
left frame gives the entire histogram for all 3000 realizations and the
right frame provides a close-up of the middle 90% of the distribution.
Rate of Return - Central Portion
Density
Density
Rate of Return - Diffuse Prior
-0.5
0.5
1.0
1.5
2.0
0.0
beta
0.1
0.2
0.3
0.4
beta
Posterior Distribution of the Rate of Return Diuse Prior

It can be seen that the distribution is very dispersed; the smallest
realization (of 3000) corresponds to a return (a loss) of minus 40% and
the largest to a gain of 220%. An (approximate) 95% HPD interval
runs from a rate of return of essentially zero to one of about 45%.
90
Distribution not normal thick tailed. and it is also very far from
the prior which, if superimposed on the left hand frame, would look
almost flat.
We have learned something but not much. As can be seen on the right
frame, the posterior points towards rates of return of the order of 10
to 20% a year, but very considerable uncertainty remains. The ideal
might be to be able to pin down the rate of return to within one or
two percentage points, but we are very far from this accuracy. The
posterior mean of under this prior is 17% and the median is 15%.
75% of all realizations lie within the range from 10% to 21%.
91
0.1.2
Informative Prior
For contrast let us see what happens when we use the relatively informative prior in which is n(0.10, 100) and the three coecients are
constrained to be similar. We again ran the sampler for 2000 steps
three times and retained the final 1000 of each chain giving 3000 realizations from the joint posterior distribution. The marginal posterior
histogram for is shown in the next figure, with the prior density
superimposed as the solid line.
4
0
Density
Posterior of Beta
-0.1
0.0
0.1
0.2
0.3
beta
The Posterior Distribution of with an Informative Prior
92
0.1.3
Comparison of Posteriors Under Diuse and Informative Priors.
For a numerical comparison we can look at the 95% HPD intervals

which are
lower
upper
46%
diuse
informative 1%
25%
The length of the confidence interval has been sharply reduced, but it
is still very wide. Another comparison is to look at the quantiles and
these, expressed in percentages, are
min q25 q50 mean q75 max
diuse
informative
0.1.4
40 10 15
17
21 220
15 8
12
16
Conclusions
What do we conclude?
93
12
37
It appears that even with about 36, 000 observations and a simple
model with very few parameters there is not enough information

to make precise estimates of the marginal rate of return. This
is true even under relatively informative prior beliefs. The data
transform such beliefs into less dispersed and more peaked posteriors but these are still not adequate to provide an accurate
assessment of the rate of return.
Another valuable conclusion is that, even with this sample size,
it is vital to compute exact posterior distributions that reveal the

true nature of our uncertainty.
A third implication is that by using MCMC methods such exact

inferences can be made very easily.
94
0.2
Is Education Endogenous?
These rather imprecise inferences about a parameter that is of major

importance, the rate of return to investment in education, are disappointing. It would have been so much simpler to regress (log) wages
on education by least squares. Here is what happens if you do. The
next figure plots a randomly selected sub-sample of size 1000 of the
education and wage data with the least squares fit (using the whole
wage
sample) superimposed upon it.
10
15
20
education
Education and Wages, with LS Fit

The least squares line is
wage = 5.010(.014) + 0.0680(.001)education
95
The coecients are very sharply determined and the posterior mean,
median and mode of the rate of return is 6.8% with a standard error
of one tenth of one percent. Why cant we accept this estimate? The
answer is, of course, that we have been presuming that education is
endogenous, that its eect on wages is confounded with that of the
numerous other potential determinants of wages. And under this belief
the least squares estimates are (very precisely) wrong. But there is
still hope for the least squares estimate since we have not, yet, shown
that this presumption is true!
96
Consider how we might check to see whether education really is endogenous. We reproduce the structural form of the model for convenience here. It is
educi = + zi + 1i
wagei = + educi + 2i,
where z stands for the quarter of birth instruments. We showed earlier
that if 2 and 1 are uncorrelated then the second equation is a regression, the system is fully recursive, and standard regression methods,
like least squares, can be used to make inferences about . But if 2
and 1 are correlated then these methods are inappropriate and we
must proceed as above with the consequent disappointing precision.
So why dont we see if these errors are correlated? One way of doing
this is to look at the posterior distribution of the correlation coecient
of 2 and 1. We can do this as follows. For purposes of our calculation
we represented the model as the restricted reduced form
educi = + zi + 1i
wagei = + zi + 2i.
The relation between the structural and reduced form errors is

1
1 0
1
=
(37)
2
2
1
or = B. Thus V () = BV ()B 0 = BP 1()B 0, where P () is the

precision of the reduced form errors which, in our BUGS program, we
97
denoted by q, thus V () = Bq 1B 0. We can now use this equation to

express the correlation between the structural form errors in terms of
and the reduced form precision matrix q, and we find
q22 q12
2,1 = q
.
q22(q11 + 2q12 + 2q22)
98
Finally, to inspect the posterior distribution of we substitute the

3000 realizations of the four parameters on the right into this formula
and the result is the same number of realizations from the posterior
150
100
0
50
Frequency
200
250
distribution of whose histogram is shown below.
-0.5
0.0
0.5
structural form error correlation
Correlation of the Structural Form Errors

The mean and median correlation are both about 0.25 but the standard deviation is 0.26 and 517 out of 3000 realizations are positive.
The posterior suggests that the structural form errors are negatively
correlated and this is a bit surprising on the hypothesis that a major
element of both 1 and 2 is ability and this variable tends to aect
positively both education and wages. But the evidence is very far from
conclusive.
99
100

Modern Bayesian Econometrics

Uploaded by

Copyright:

Available Formats

Modern Bayesian Econometrics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modern Bayesian Econometrics

Uploaded by

Copyright:

Available Formats

What is the aim of the course?

What is the aim of the course?

What methods are used in Bayesian econometrics?

What methods are used in Bayesian econometrics?

MODERN BAYESIAN ECONOMETRICS

These lectures are based on my book

There is also R to Matlab connectivity see the r-project home page.

Principles of Bayesian Inference

Bayesian Computation and MCMC

Bayes theorem for events:

Bayes theorem for densities:

So Bayes theorem transforms prior or initial probabilities, Pr(A), into

EXAMPLES OF BAYES THEOREM

sees tram 21 might therefore be 42.

Pr(D) = 1/10000 (rare)

Pr(+|D) = 0.1....(false positive)

0.90 + 0.10(10, 000 1) 1000

Optimal predictions in everyday cognition,Griths and Tenenbaum, forthcoming in

Let be the parameter of some economic model and let y be some

Marginal Likelihood or Predictive Distribution of the (potential) data

The Bayesian Algorithm (page 9)

The Evolution of Beliefs

These are quite general properties of posterior inference.

Proper and Improper Priors (section 1.4.2)

Any proper probability distribution over will do.

A model for the evolution of scientific knowledge?

Improper priors are sometimes used. These do not integrate to one

Can be thought of as approximations to diuse but proper priors.

Improper priors sometimes mathematically convenient. But software

One rather important rule (which doesnt have a drawback) is:

Natural Conjugate Priors( pps 30-33)

Class of priors is conjugate for a family of likelihoods if both prior and

p() a1(1 )b1

p(|y) y+a1(1 )by .

and Beta prior

giving (Beta) posterior

Note symbol ` to represent likelihood previously described by p(y|).

Natural conjugate can be thought of as a posterior from (hypothetical)

Jereys Priors (pps 34-37)

Jereys prior is |I()|1/2.

Example:: iid normal data mean zero precision

Hierarchical Priors (pages 37-39)

Example: {yi} n(i, ); {i} n(, ); p() 1 then the posterior

means of i take the form

Example of shrinkage to the general mean. c.f. estimation of permanent incomes.

Likelihoods( pps 1028)

Any proper probability distribution for Y will do.

Posterior Distributions(pps 41 - 55)

Express what YOU believe given model and data.

Bayesian methods involve integration.

Frequentist (Classical) Econometrics (Appendix 1)

Relies (mostly) on distributions of estimators and test statistics over

p values give probability of the data given the hypothesis. Reader

One Parameter Likelihood Examples

`(; c, y) exp{( yi2/2)( b)2}

Figure 3: Plot of the Data and the Likelihood for Example 1

2. Autoregression (pps 14-16)

p(y|y1, ) exp{( /2)Tt=2(yt yt1)2}.

Note terminology: kernel of a density neglects multiplicative terms