Notes BMDA PDF
Notes BMDA PDF
Notes BMDA PDF
Course contents
Introduction of Bayesian concepts using single-parameter models.
Multiple-parameter models and hyerarchical models.
Computation:
approximations to the posterior,
importance sampling and MCMC.
rejection and
Course emphasis
Notes draw heavily on the book by Gelman et al., Bayesian Data
Analysis 2nd. ed., and many of the figures are borrowed directly from
that book.
We focus on the implementation of Bayesian methods and interpretation
of results.
Little theory, but some is needed to understand methods.
Lots of examples. Some are not directly drawn from biological problems,
but still serve to illustrate methodology.
Biggest idea to get across: inference by simulation.
ENAR - March 2006
Software: R (or SPlus) early on, and WinBUGS for most of the
examples after we discuss computational methods.
All of the programs used to construct examples can be downloaded
from www.public.iastate.edu/ alicia.
On my website, go to Teaching and then to ENAR Short Course.
Binomial model
Of historical importance: Bayes derived his theorem and Laplace
provided computations for the Binomial model.
Before Bayes, question was: given , what are the probabilities of the
possible outcomes of y?
Bayes asked: what is P r(1 < < 2|y)?
Using a uniform prior for , Bayes showed that
R 2
P r(1 < < 2|y)
y (1 )ny d
p(y)
(1
)
y
p(|y) = R n y
ny d
y (1 )
ENAR - March 2006
n!
= (n + 1)
y (1 )ny
y!(n y)!
=
(n + 1)! y
(1 )ny
y!(n y)!
(n + 2)
y+11(1 )ny+11
(y + 1)(n y + 1)
= Beta(y + 1, n y + 1)
Point estimation
y+1
Posterior mean E(|y) = n+2
Note posterior mean is compromise between prior mean (1/2) and
sample proportion y/n.
Posterior mode y/n
Posterior median such that P r( |y) = 0.5.
Best point estimator minimizes the expected loss (more later).
ENAR - March 2006
Posterior variance
(y + 1)(n y + 1)
V ar(|y) =
.
(n + 2)2(n + 3)
Interval estimation
95% credibe set (or central posterior interval) is (a, b) if:
Z
p(|y)d = 0.975
10
11
Inference by simulation:
Draw values from posterior: easy for closed form posteriors, can also
do in other cases (later)
Monte Carlo estimates of point and interval estimates
Added MC error (due to sampling)
Easy to get interval estimates and estimators for functions of
parameters.
Prediction:
Prior predictive distribution
Z
p(y) =
0
1
n y
1
ny
(1 )
d =
.
y
n+1
12
P r(
y = 1|y, )p(|y)d
P r(
y = 1|y) =
0
P r(
y = 1|)p(|y)d
=
0
Z
=
0
y+1
p(|y)d = E(|y) =
n+2
13
14
Conjugate prior
Suppose that we choose a Beta prior for :
p(|, ) 1(1 )1
Posterior is now
p(|y) y+1(1 )ny+1
Posterior is again proportional to a Beta:
p(|y) = Beta(y + , n y + ).
For now, , considered fixed and known, but they can also get their
own prior distribution (hierarchical model).
ENAR - March 2006
15
16
Conjugate priors
Formal definition: F a class of sampling distributions and P a class
of prior distributions. Then P is conjugate for F if p() P and
p(y|) F implies p(|y) P .
If F is exponential family then distributions in F have natural conjugate
priors.
A distribution in F has form
p(yi|) = f (yi)g() exp(()T u(yi)).
For an iid sequence, likelihood function is
p(y|) g()n exp(()T t(y)),
ENAR - March 2006
17
18
}
p(y|) (1 ) exp{y log
1
n
19
20
}
1
+y
++n
is always between the prior mean /( + ) and the MLE y/n.
E(|y) =
21
Posterior variance
V ar(|y) =
=
( + y)( + n y)
( + + n)2( + + n + 1)
E(|y)[1 E(|y)]
++n+1
22
23
and therefore
|y Beta(y + 1, n y + 1).
438 544
(438 + 544)2(438 + 544 + 1)
24
Prior
mean
0.5
0.485
0.485
0.485
0.485
0.485
0.485
+
2
2
5
10
20
100
200
Post.
median
0.446
0.446
0.446
0.446
0.447
0.450
0.453
Post 2.5th
pctile
0.415
0.415
0.415
0.415
0.416
0.420
0.424
Post 97.5th
pctile
0.477
0.477
0.477
0.477
0.478
0.479
0.481
Results robust to choice of prior, even very informative prior. Prior mean
not in posterior 95% credible set.
25
n
Y
yi exp()
i=1
yi !
ny exp(n)
= Qn
.
y
!
i=1 i
26
For = n
y + and = n + , posterior is Gamma(, )
Note:
Prior mean of is /
Posterior mean of is
n
y+
E(|y) =
n+
If sample size n then E(|y) approaches MLE of .
If sample size goes to zero, then E(|y) approaches prior mean.
27
1
1
exp( 2 (yi )2)
2
2 2
28
1
p(|y) exp( [
2
i (yi
)2
( 0)2
+
])
2
0
29
2
i yi
i yi
2
2 20 + 20
+
])
2
0
30
Note that
n =
y02
+ 0( /n)
=
( 2/n) + 02
n
1
y
02 0
1
n
+
2
02
y02 + 0( 2/n)
( 2/n) + 02
31
02
2/n
) + y( 2
)
= 0( 2
( /n) + 02
( /n) + 02
Add and subtract 002/(( 2/n) + 02) to see that
02
n = 0 + (
y 0)( 2
)
2
( /n) + 0
Posterior mean is prior mean shrunken towards observed value. Amount
of shrinkage depends on relative size of precisions
Posterior variance:
Recall that
1 (( 2/n) + 02) 2
y02 + 0( 2/n)
p(|y) exp(
[ 2(
)])
2
2
2
2
2 ( /n)0
( /n) + 0
ENAR - March 2006
32
Then
1
n2
=
=
(( 2/n) + 02)
( 2/n)02
n
1
+
2 02
p(
y |)p(|y)d
We know that
ENAR - March 2006
33
Given , y N (, 2)
|y N (n, n2)
Then
Z
1 (
y )2
1 ( n)2
exp{
} exp{
}d
2
2
2
2
n
1 (
y )2 ( n)2
exp{ [
+
]}d
2
2
2
p(
y |y)
34
Need E(
y |y) and var(
y |y):
E(
y |y) = E[E(
y |y, )|y] = E(|y) = n,
because E(
y |y, ) = E(
y |) =
var(
y |y) = E[var(
y |y, )|y] + var[E(
y |y, )|y]
= E( 2|y) + var(|y)
= 2 + n2
because var(
y |y, ) = var(
y |) = 2
Variance includes additional term n as a penalty for us not knowing
the true value of .
Recall n =
ENAR - March 2006
1 + n y
0 2
02
1 + n
02 2
35
36
Normal variance
Example of a scale model
Assume that y N (, 2), known.
For iid y1, ..., yn:
n
1 X
2
2 n/2
(yi )2)
p(y| ) ( )
exp( 2
2 i=1
2
2 n/2
p(y| ) ( )
for suffient statistic
nv
exp( 2 )
2
1X
(yi )2
v=
n i=1
ENAR - March 2006
37
38
0
0
p( 2) ( 2 )( 2 +1) exp( 2 )
0
2
Corresponds to Inv
2
0 0 0
Gamma( 2 , 2 ).
( )
ENAR - March 2006
nv 2 ( 0 +1)
002
exp( 2 )( 2 ) 2
exp( 2 )
2 0
2
39
2 ( 21 +1)
( )
exp(
with 1 = 0 + n and
12
1
2
)
1
1
2 2
nv + 002
=
n + 0
40
Poisson model
Appropriate for count data such as number of cases, number of
accidents, etc. For a vector of n iid observations:
t(y)en
yi e
= p(y|) = Qn
p(y|) = i
,
yi !
y!
i=1
where is the rate, y = 0, 1, ... and t(y) =
statistic for .
Pn
i=1 yi
the sufficient
41
42
p()p(y|)
p(|y)
43
44
45
Now
p(y|)
Py
exp(
xi)
X
i
yi , +
xi)
46
=
1
1/02 + n/ 2
1/02 + n/ 2
ENAR - March 2006
47
For 0 :
1 y
12 2/n
Same result could have been obtained using p() 1.
The uniform prior is a natural non-informative prior for location
parameters (see later).
It is improper:
Z
Z
p()d =
d =
yet leads to proper posterior for . This is not always the case.
Poisson example: y1, ..., yn iid Poisson(), consider p() 1/2.
R 1/2
0
d = improper
ENAR - March 2006
48
Yet
p(|y)
=
P y n 1/2
P y 1/2 n
e
P y +1/21 n
i i
i i
=
proportional to a Gamma( 21 +
i i
i yi , n),
proper
49
d
d
= 1 is Jacobian
Then, if p() is prior for , p() = 1p(log ) is corresponding
prior for transformation.
For p() c, p() 1, informative.
Informative prior is needed to arrive at same answer in both
parameterizations.
50
Jeffreys prior
In 1961, Jeffreys proposed a method for finding non-informative priors
that are invariant to one-to-one transformations
Jeffreys proposed
p() [I()]1/2
where I() is the expected Fisher information:
d2
I() = E [ 2 log p(y|)]
d
If is a vector, then I() is a matrix with (i, j) element equal to
d2
E [
log p(y|)]
didj
ENAR - March 2006
51
and
p() |I()|1/2
Theorem: Jeffreys prior is locally uniform and therefore non-informative.
52
(n
y)(1
)
d2
Taking expectations:
d2
I() = E [ 2 log p(y|)] = E(y)2 + (n E(y))(1 )2
d
n
2
2
= n (n n)(1 ) =
.
(1 )
Then
1 1
p() [I()]1/2 1/2(1 )1/2 Beta( , ).
2 2
ENAR - March 2006
53
d2
2
constant with respect to .
Then I() constant and p() constant.
54
(y
)
i
d 2
2 2 4
n
n
3
n
2
+
n
=
2 2 4
2 2
55
d
dh() 1
| = p()|
|
d
d
56
d 2
= I()[ ]
d
d
Then [I()]1/2 = [I()]1/2 d
as required.
57
Nuisance parameters
Consider a model with two parameters (1, 2) (e.g., a normal
distribution with unknown mean and variance)
We are interested in 1 so 2 is a nuisance parameter
The marginal posterior distribution of interest is p(1|y)
Can be obtained directly from the joint posterior density
p(1, 2|y) p(1, 2)p(y|1, 2)
by integrating with respect to 2:
Z
p(1|y) = p(1, 2|y)d2
Joint posterior:
p(, 2|y) p(, 2)p(y|, 2)
n
X
1
n2 exp( 2
(yi )2)
2 i=1
ENAR - March 2006
Note that
n
X
(yi )
(yi2 2yi + 2)
i=1
yi2 2n
y + n2
(yi y)2 + n(
y )2
1 X
(yi y)2
s =
n1 i
2
p(, |y)
n2
1
exp( 2 [(n 1)s2 + n(
y )2])
2
n
2
)
(
y
)
2
2
1
2
2
[(n
1)s
+
n(
y
)
])d
2
2
2 Z
n
(n
1)s
2
n2
)
exp(
(
y
)
)d
exp(
2
2
2
2
2 p
(n
1)s
2 /n
n2 exp(
)
2
2 2
p( 2|y)
n2 exp(
Then
(n 1)s2
p( |y) ( )
exp(
)
2
2
which is proportional to a scaled-inverse 2 distribution with degrees
of freedom (n 1) and scale s2.
2
2 (n+1)/2
10
11
Z
p(|y) =
Z
p(, 2|y)d 2
1
1 n/2+1
( 2)
exp( 2 [(n 1)s2 + n(
y )2])d 2
2
2
A
2 2
where A = (n 1)s2 + n(
y )2. Then
z=
d 2
A
= 2
dz
2z
ENAR - March 2006
12
and
Z
z n A
( ) 2 +1 2 exp(z)dz
A
z
0
Z
n 1
n/2
A
z 2 exp(z)dz
p(|y)
13
p(|y) A
n 1
2
exp (z)dz
14
y
|y) = tn1
s/ n
15
y
p( |y) = tn1
s/ n
16
p(
y |y, 2, )p(, 2|y)dd 2
First factor in integrand is just normal model, and it does not depend
on y at all.
To simulate y from posterior predictive distributions, do the following:
1. Draw 2 from Inv-2(n 1, s2)
2. Draw from N (
y , 2/n)
3. Draw y from N (, 2)
ENAR - March 2006
17
exp( 2 (
y ) ) exp( 2 ( y)2)d
2
2
After some algebra:
1 2
p(
y | , y) = N (
y , (1 + ) )
n
2
18
N(0, 2/0)
Inv-2(0, 02)
Jointly:
p(, 2) 1( 2)(0/2+1) exp(
ENAR - March 2006
1
2
2
[
)
])
0
0
0
0
2
2
19
n =
n
nn2
20
21
22
N(0, 02)
Inv-2(0, 02)
23
with
n =
1
n
+
y
0
2
2
0
,
n
1
+
2
22
n2
1
02
1
+ n2
NOTE: Even though and 2 are independent a priori, they are not
independent in the posterior.
24
25
so that
2
2 2
2
2
)
N(y
|,
)
)
Inv
(
|
,
N(|
,
i
0
0
0
0
2
p( |y)
N(|n, n2)
26
27
28
29
1
2
3
4
5
6
7
8
9
10
Prob()
0.03
0.04
0.08
0.15
0.20
0.30
0.10
0.05
0.03
0.02
CDF (F)
0.03
0.07
0.15
0.30
0.50
0.80
0.90
0.95
0.98
1.00
30
31
32
yi + 0.5
yi 0.5
Prior: p(, 2) 2
We are interested in posterior inference about (, 2) and in differences
between rounded and exact analysis.
ENAR - March 2006
33
2
0
sigma2
exact normal
10
11
12
13
12
13
mu
2
0
sigma2
rounded
10
11
mu
34
j=1 j = 1 and
j=1 yj = n
Sampling distribution:
y
p(y|) kj=1j j
35
p(|) kj=1j j
with
j > 0j,
and
0 =
k
X
j=1
j > 0j,
and
k
X
j = 1
j=1
36
j
E(j ) = ,
0
37
Dirichlet distribution
The Dirichlet distribution is the conjugate prior for the parameters of
the multinomial model.
If (1, 2, ...K ) D(1, 2, ..., K ) then
(0)
1
p(1, ...k ) =
j j j ,
j (j )
where j 0, j j = 1, j 0 and 0 = j j .
Some properties:
E(j ) =
ENAR - March 2006
j
0
38
j (0 j )
02(0 + 1)
ij
Cov(i, j ) = 2
0(0 + 1)
V ar(j ) =
39
111(v(1 1))21
= Q111(1 1)2+31
ENAR - March 2006
(2)(3)
(2 + 3)
40
Then,
1 Beta(1, 2 + 3).
This generalizes to what is called the clumping property of the Dirichlet.
In general, if (1, ..., k ) D(1, ..., K ):
p(i) = Beta(1, 0 i)
41
p(|y) kj=1j j
y +j 1
kj=1j j
n =
j (j
j + yj
0 + n
# of obs. of jth outcome
total # of obs
42
For P
j = 1 j: uniform non-informative prior on all vectors of j such
that j j = 1
For j = 0
restriction.
43
44
45
46
0.46
0.48
0.50
0.52
0.54
theta_1
0.0
0.05
0.10
0.15
0.20
theta_1 - theta_2
0.36
0.38
0.40
0.42
0.44
theta_2
ni
No. of deaths yi
-0.863
-0.296
-0.053
0.727
5
5
5
5
0
1
3
5
47
logit (i) = log
i
1 i
48
(Logistic
so that
i
log
1 i
= + xi,
exp( + xi)
.
i =
1 + exp( + xi)
49
Then
p(yi|
, , ni, xi)
yi
niyi
exp( + xi)
exp( + xi)
1
1 + exp( + xi)
1 + exp( + xi)
Prior: p(, ) 1.
Posterior:
p(, |y, n, x) ki=1p(yi|, , ni, xi).
We first evaluate p(, |y, n, x) on the grid
(, ) [5, 10] [10, 40]
and use inverse cdf method to sample from posterior.
ENAR - March 2006
50
...
...
200
p(1|y)
p(2|y)
..
..
p(200|y)
1
2
..
..
200
p(1|y)
p(2|y)
...
...
p(200|y)
51
Z
p(j |, y)p(|y)d
X
p(j |, y)
=
52
53
yi
ni
= i = 0.5
so that
54
[4]
1.0
[3]
0.75
0.5
[2]
0.25
[1]
0.0
0.6
0.1
0.075
0.05
0.025
0.0
0.4
0.2
0.0
-2.5
0.0
2.5
5.0
-20.0
0.0
20.0
40.0
4.0
2.0
0.0
-1.0
-0.5
0.0
0.5
p(y|, ) ||
1X
exp[
(yi )01(yi )]
2 i
1
||n/2 exp[ tr(yi )01(yi )]
2
1
||n/2 exp[ tr1S]
2
ENAR - March 2006
55
with
S=
(yi )(yi )0
56
57
with
n = (10 + n1y)(1 + n1)1
n = (1 + n1)1
58
+
n
y
)(
0
0
0 + n )
59
"
n =
(1)
n
(2)
n
(1,1) (1,2)
n n
(2,1) (2,2)
n n
60
(1,1)
).
),
)
n
n
where
1|2 = (1,2)
(2,2)
n
n
1
(2,2)
1|2 = (1,1)
(1,2)
n
n
n
1
(2,1)
n
61
62
p(
y |, y)p(|y)d
Then:
1. Draw from p(|y) = N (n, n)
2. Draw y from p(
y |, y) = N (, )
Alternatively (better), draw y directly from
p(
y |y) = N (n, + n)
63
64
N (
(12)
( (12))2
(1)
2(2)
+ 2(2) (y1 ),
2(1) )
65
66
67
||(0+d)/2+1
1
0
exp
tr(01) ( 0)01( 0)
2
2
68
|y Inv-Wishartn (1
n )
|, y N (n, /n)
|y Mult-tnd+1(n, n/(n(n d + 1)))
y|y Mult-t
Here:
n
0
0 +
y
0 + n
0 + n
= 0 + n
n =
n
n = 0 + n
ENAR - March 2006
69
n
S
n0
= 0 + S +
(
y 0)(
y 0)0
0 + n
X
=
(yi y)(yi y)0
i
70
If Q Wishart then Q1 =
71
72
= 0 i, j
0 = 7
ENAR - March 2006
2.5%
5.0%
50.0%
95.0%
97.5%
213.5079 216.7256 231.6443 243.5567 247.1686
# bismuth
c(mean(sample.mu[,4]),sqrt(var(sample.mu[,4])))
[1] 127.248741
1.553327
quantile(sample.mu[,4],probs=c(0.025,0.05,0.5,0.95,0.975))
2.5%
5.0%
50.0%
95.0%
97.5%
124.3257 124.6953 127.2468 129.7041 130.759
# silver
c(mean(sample.mu[,5]),sqrt(var(sample.mu[,5])))
[1] 38.2072916 0.7918199
quantile(sample.mu[,5],probs=c(0.025,0.05,0.5,0.95,0.975))
2.5%
5.0%
50.0%
95.0%
97.5%
36.70916 36.93737 38.1971 39.54205 39.73169
Posterior of 1/2 of
10
11
12
2.5%
5.0%
50.0%
95.0%
97.5%
0.5560137 0.5685715 0.6429459 0.7013519 0.7135556
# j = silver
c(mean(sample.rho[,5]),sqrt(var(sample.rho[,5])))
[1] 0.03642082 0.07232010
quantile(sample.rho[,5],probs=c(0.025,0.05,0.5,0.95,0.975))
2.5%
5.0%
50.0%
95.0%
97.5%
-0.09464575 -0.0831816 0.03370379 0.164939 0.1765523
13
S-Plus code
#
#
#
#
#
#
#
#
Cascade example
We need to define two functions, one to generate random vectors from
a multivariate normal distribution and the other to generate random
matrices from a Wishart distribution.
Note, that a function that generates random matrices from a
inverse Wishart distribution is not necessary to be defined
since if W ~ Wishart(S) then W^(-1) ~ Inv-Wishart(S^(-1))
14
rwishart_function(a,b) {
k_ncol(b)
m_matrix(0,nrow=k,ncol=1)
cc_matrix(0,nrow=a,ncol=k)
for (i in 1:a) { cc[i,]_rmnorm(m,b) }
w_t(cc)%*%cc
w
}
#
#Read the data
#
y_scan(file="cascade.data")
y_matrix(y,ncol=5,byrow=T)
y[,1]_y[,1]/100
means_rep(0,5)
for (i in 1:5){ means[i]_mean(y[,i]) }
means_matrix(means,ncol=1,byrow=T)
n_nrow(y)
#
#Assign values to the prior parameters
#
v0_7
Delta0_diag(c(100,5000,15000,1000,100))
k0_10
mu0_matrix(c(200,200,200,100,50),ncol=1,byrow=T)
# Calculate the values of the parameters of the posterior
#
mu.n_(k0*mu0+n*means)/(k0+n)
v.n_v0+n
k.n_k0+n
15
ones_matrix(1,nrow=n,ncol=1)
S = t(y-ones%*%t(means))%*%(y-ones%*%t(means))
Delta.n_Delta0+S+(k0*n/(k0+n))*(means-mu0)%*%t(means-mu0)
#
# Draw Sigma and mu from their posteriors
#
samplesize_200
lambda.samp_matrix(0,nrow=samplesize,ncol=5) #this matrix will store
#eigenvalues of Sigma
rho.samp_matrix(0,nrow=samplesize,ncol=5)
mu.samp_matrix(0,nrow=samplesize,ncol=5)
for (j in 1:samplesize) {
# Sigma
SS = solve(Delta.n)
# The following makes sure that SS is symmetric
for (pp in 1:5) { for (jj in 1:5) {
if(pp<jj){SS[pp,jj]=SS[jj,pp]} } }
Sigma_solve(rwishart(v.n,SS))
# The following makes sure that Sigma is symmetric
for (pp in 1:5) { for (jj in 1:5) {
if(pp<jj){Sigma[pp,jj]=Sigma[jj,pp]} } }
# Eigenvalue of Sigma
lambda.samp[j,]_eigen(Sigma)$values
# Correlation coefficients
for (pp in 1:5){
rho.samp[j,pp]_Sigma[pp,2]/sqrt(Sigma[pp,pp]*Sigma[2,2]) }
# mu
mu.samp[j,]_rmnorm(mu.n,Sigma/k.n)
16
}
# Graphics and summary statistics
sink("cascade.output")
# Calculate the ratio between max(eigenvalues of Sigma)
# and max({eigenvalues of Sigma}\{max(eigenvalues of Sigma)})
#
ratio.l_sample.l[,1]/sample.l[,2]
# "Histogram of the draws of the ratio"
postscript("cascade_lambda.eps",height=5,width=6)
hist(ratio.l,nclass=30,axes = F,
xlab="eigenvalue1/eigenvalue2",xlim=c(3,9))
axis(1)
dev.off()
# Sumary statistics of the ratio of the eigenvalues
c(mean(ratio.l),sqrt(var(ratio.l)))
quantile(ratio.l,probs=c(0.025,0.05,0.5,0.95,0.975))
#
# correlations with copper
#
postscript("cascade_corr.eps",height=8,width=8)
par(mfrow=c(2,2))
# "Histogram of the draws of corr(antimony,copper)"
hist(sample.rho[,1],nclass=30,axes = F,xlab="corr(antimony,copper)",
xlim=c(0.3,0.7))
axis(1)
# "Histogram of the draws of corr(arsenic,copper)"
hist(sample.rho[,3],nclass=30,axes = F,xlab="corr(arsenic,copper)")
axis(1)
17
18
19
Advanced Computation
Approximations based on posterior modes
Simulation from posterior distributions
Markov chain simulation
Why do we need advanced computational methods?
Except for simpler cases, computation is not possible with available
methods:
Logistic regression with random effects
Normal-normal model with unknown sampling variances j2
Poisson-lognormal hierarchical model for counts.
ENAR - March 26, 2006
d
log p(|y)
( )
2
d
=
d2
I() = 2 log p(|y),
d
the observed information matrix.
1
or [I()]
k N(k , Vk )
k
ENAR - March 26, 2006
X
k
1
k )}.
q(k |y) exp{ ( k )0V1
(
k
2
k
k
10
11
12
X y|U
f (X)
M g(X)
(X)
P X y, U Mf g(X)
f (X)
P U M g(X)
13
Then
Ry
P (Y y) =
R f (x)/M g(x)
dug(x)dx
0
R R f (x)/M g(x)
dug(x)dx
0
R
1 y
M
f (x)dx
R
.
1
f (x)dx
M
Ry
14
2. For each f , there will be many instrumental densities g1, g2, ....
Choose the g that requires smallest bound M
3. M is necessarily larger than 1, and will approach minimum value 1
when g closely imitates f .
In general, g needs to have thicker tails than f for f /g to remain
bounded for all x.
Cannot use a normal g to generate values from a Cauchy f .
Can do the opposite, however.
Rejection sampling can be used within other algorithms, such as the
Gibbs sampler (see later).
15
p()p(y; ) p(y; )
q(|y)
=
.
=
Those from the prior that are likely under the likelihood are kept in
the posterior sample.
ENAR - March 26, 2006
16
Importance sampling
Also known as SIR (Sampling Importance Resampling), the method is
no more than a weighted bootstrap.
Suppose we have a sample of draws (1, 2, ..., n) from the proposal
distribution g(). Can we convert it to a sample from q(|y)?
For each i, compute
i = q(i|y)/g(i)
X
wi = i/
j .
j
Draw from the discrete distribution over {1, ..., n} with weight wi
on i, without replacement.
ENAR - March 26, 2006
17
n
X
wi1,a(i)
i=1
i 1,a (i )
i P
n1 i i
Eg q(|y)
g() 1,a (i )
Eg q(|y)
g()
R
q(|y)d
R a
q(|y)d
Z a
p(|y)d.
18
i.
19
E[h()]
=
h(i).
N
Sampling from p(|y) may be difficult. But note:
Z
E[h()|y] =
ENAR - March 26, 2006
h()p(|y)
g()d.
g()
20
h(i)
E[h()]
=
N
g(i)
The w(i) = p(i|y)/g(i) are the same importance weights from
earlier.
Will not work if tails of g are short relative to tails of p.
21
22
j = 1, 2.
23
24
Markov chains
A process Xt in discrete time t = 0, 1, 2, ..., T where
E(Xt|X0, X1, ..., Xt1) = E(Xt|Xt1)
is called a Markov chain.
A Markov chain is irreducible if it is possible to reach all states from
any other state:
pn(j|i) > 0,
pm(i|j) > 0,
m, n > 0
25
26
1X
a(Xt)
n t
E{a(X)} as n .
a
n is an ergodic average.
Also, rate of convergence can be calculated and is geometric.
27
28
29
MCMC methods all based on the same idea; difference is just in how
the transitions in the MC are created.
In MCMC simulation, we generate at least one MC for each parameter
in the model. Often, more than one (independent) chain for each
parameter
30
31
32
Example:
normal
y1
y2
N
1
1
,
.
2
1
(1)
1
|y
2
N
y1
1
,
.
y2
1
(2)
1
N(y1 + (2 y2), 1 2)
2|1, y
N(y2 + (1 y1), 1 2)
After 50 iterations..
p(, 2) ( ) e
P(y ) ) (
i
1 ( )2 )
0
202
2
0 0
0
(
)
2 ( 2 +1))
2 2
( )
33
For :
1 1 X
1
p(| , y) exp{ [ 2
(yi )2 + 2 ( 0)2]}
2
0
1
exp{ [(n02 + 2)2 2(n02y + 20)]}
2
2
2
2
2
1 n0 + 2
n0 y + 0
exp{
[ 2
]}
2 202
n02 + 2
2
N (n, n2),
ENAR - March 26, 2006
34
where
n =
n
1
y
02 0
1
n
+
2
02
and
n2
n
2
1
.
+ 12
0
35
p( |, y)
n+
1X
2 ( 2 0 +1)
( )
exp{ [ (yi
)2 + 002]}
Inv 2(n, n2 ),
where
n = 0 + n
02 = nS 2 + 002,
and
S=n
(yi )2.
36
with
n( 2(0)) =
2(0)
y + 12 0
n
2(0)
and
n2( 2(0))
mean
+ 12
and
variance
1
= n
1 .
+
2
2(0)
0
37
(1)
S (
)=n
(yi (1))2.
38
yi Poi()
For i = m + 1, ..., n ,
yi Poi()
and m is unknown.
Priors on (, , m):
Ga(, )
Ga(, )
m U[1,n]
ENAR - March 26, 2006
39
p(, , m|y)
m
Y
eyi
i=1
n
Y
eyi 1e 1e n1
i=m+1
y1 +1e(m+)y2 +1e(nm+)
with
y1
Pm
i=1 yi
and
y2
Pn
i=m+1 yi .
40
Conditional of :
p(|m, , y)
m
i=1 yi +1
Ga( +
m
X
exp((m + ))
yi, m + )
i=1
Conditional of :
p(|, m, y)
n
i=m+1 yi +1
Ga( +
n
X
exp((n m + ))
yi, n m + )
i=m+1
Conditional of m = 1, 2, ..., n:
p(m|, , y) = c1q(m|, )
ENAR - March 26, 2006
41
1 y1 +1
c
exp((m
y2 +1
+ ))
exp((n m + )).
n
X
q(k|, , y).
k=1
42
Non-standard distributions
It may happen that one or more of the full conditionals is not a standard
distribution
What to do then?
Try direct simulation: grid approximation, rejection sampling
Try approximation: normal or t approximation, need mode at each
iteration (see later)
Try more general Markov chain algorithms: Metropolis or MetropolisHastings.
43
Metropolis-Hastings algorithm
More flexible transition kernel: rather than requiring sampling from
conditional distributions, M-H permits using many other proposal
densities
Idea: instead of drawing sequentially from conditionals as in Gibbs,
M-H jumps around the parameter space
The algorithm is the following:
1. Given a draw t in iteration t, sample a candidate draw from a
proposal distribution J(|)
2. Accept the draw with probability
p(|y)/J(|)
r=
.
p(|y)/J(| )
ENAR - March 26, 2006
44
3. Stay in place (do not accept the draw) with probability 1 r, i.e.,
(t+1) = (t).
Remarkably, the proposal distribution (text calls it jump distribution)
can have just about any form.
When proposal distribution is symmetric, i.e.
J(|) = J(|),
Metropolis-Hastings acceptance probability is
p(|y)
r=
.
p(|y)
This is the Metropolis algorithm
ENAR - March 26, 2006
45
Proposal distributions
Convergence does not depend on J, but rate of convergence does.
Optimal J is p(|y) in which case r = 1.
Else, how do we choose J?
1. It is easy to get samples from J
2. It is easy to compute r
3. It leads to rapid convergence and mixing: jumps should be large
enough to take us everywhere in the parameter space but not too
large so that draw is accepted (see figure from Gilks et al. (1995)).
Three main approaches:
random walk M-H (most popular),
independence sampler, and approximation M-H
ENAR - March 26, 2006
46
Independence sampler
Proposal distribution Jt does not depend on t.
Just find a distribution g() and generate values from it
Can work very well if g(.) is a good approximation to p(|y) and g has
heavier tails than p.
Can work awfully bad otherwise
Acceptance probability is
p(|y)/g()
r=
p(|y)/g()
ENAR - March 26, 2006
47
48
49
50
Approximation M-H
Idea is to improve approximation of J to p(|y) as we know more about
.
E.g., in random walk M-H, can perhaps increase acceptance rate by
considering
J(|(t1)) = N (|(t1), V(t1) )
variance also depends on current draw
Proposals here typically not symmetric, requires full r expression.
51
Starting values
If chain is irreducible, choice of 0 will not affect convergence.
With multiple chains (see later), can choose over-dispersed starting
values for each chain. Possible algorithm:
1. Find posterior modes
2. Create over-dispersed approximation to posterior at mode (e.g., t4)
3. Sample values from that distribution
Not much research on this topic.
52
stationary
distribution
53
Convergence
Impossible to decide whether chain has converged, can only monitor
behavior.
Easiest approach: graphical displays (trace plots in WinBUGS).
p
of Gelman
A bit more formal (and most popular in terms of use): (R)
and Rubin (1992).
To use the G-R diagnostic, must generate multiple independent chains
for each parameter.
The G-R diagnostic compares the within-chain sd to the between-chain
sd.
ENAR - March 26, 2006
54
.. =
1
j
J
Within-chain variance:
X
1
W =
(ij j )2.
J(n 1)
ENAR - March 26, 2006
55
=
n
n
Early in iterations, var(|y)
)
(R) = (
W
which goes to 1 as n .
56
p(y|) =
p(|) =
p() =
Hyperprior
Bin(n, )
Beta(, )
3
Posterior for is
p(|y) = Beta( + 4, + 10)
Where can we get good guesses for and ?
One possibility: from the literature, look at data from similar
experiments. In this example, information from 70 other experiments
is available.
In jth study, yj is the number of rats with tumors and nj is the sample
size, j = 1, .., 70.
See Table 5.1 and Figure 5.1 in Gelman et al..
Model the yj as independent binomial data given the study-specific j
and sample size nj
ENAR - March 26, 2006
= 0.136, + = /0.136
+
2
=
0.103
( + )2( + + 1)
Resulting estimate for (, ) is (1.4, 8.6).
ENAR - March 26, 2006
Then posterior is
p(|y) = Beta(5.4, 18.6)
with posterior mean 0.223, lower than the sample mean, and posterior
standard deviation 0.083.
Posterior point estimate is lower than crude estimate; indicates that in
current experiment, number of tumors was unusually high.
Why not go back now and use the prior to obtain better estimates for
tumor rates in earlier 70 experiments?
Should not do that because:
1. Cant use the data twice: we use the historical data to get the
prior, and cannot now combine that prior with the data from the
experiments for inference
2. Using a point estimate for (, ) suggests that there is no uncertainty
about (, ): not true
ENAR - March 26, 2006
Jj=1p(j |)p()d.
10
11
12
p(, |y)d,
13
p(,|y)
p(|y,) .
14
15
p(, ) j
16
p(, , |y)
p(|y, , )
( + ) ( + yj )( + nj yj )
()()
( + + nj )
17
But what is p(, )? As it turns out, many of the obvious choices for
the prior lead to improper posterior. (See solution to Exercise 5.7, on
the course web site.)
Some obvious choices such as p(, |y) 1 lead to non-integrable
posterior.
Integrability can be checked analytically by evaluating the behavior of
the posterior as , (or functions) go to .
An empirical assessment can be made by looking at the contour plot
of p(, |y) over a grid. Significant mass extending towards infinity
suggests that the posterior will not integrate to a constant.
Other choices leading to non-integrability of p(, |y) include:
ENAR - March 26, 2006
18
p(
, + ) 1.
+
A flat prior on the log(mean) and log(degrees of freedom)
p(log(/), log( + )) 1.
A reasonable choice for the prior is a flat distribution on the prior mean
and square root of inverse of the degrees of freedom:
p(
, ( + )1/2) 1.
+
This is equivalent to
p(, ) ( + )5/2.
ENAR - March 26, 2006
19
Also equivalent to
p(log(/), log( + ) ( + )5/2.
We use the parameterization (log(/), log( + )).
Idea for drawing values of and from their posterior is the usual:
evaluate p(log(/), log( + )|y) over grid of values of (, ), and
then use inverse cdf method.
For this problem, easier to evaluate the log posterior and then
exponentiate.
Grid: From earlier estimates, potential centers for the grid are =
1.4, = 8.6. In the new parameterization, this translates into u =
log(/) = 1.8, v = log( + ) = 2.3.
ENAR - March 26, 2006
20
Grid too narrow leaves mass outside. Try [2.3, 1.3] [1, 5].
Steps for computation:
1. Given draws (u, v), transform back to (, ).
2. Finally, sample j from Beta( + yj , + nj yj )
21
1 P
nj
i yij ,
then
y.j |j N(j , j2)
22
with j2 = 2/nj .
Sampling model for y.j is quite general. For nj large, y.j is normal
even if yij is not.
Need now to think of priors for the j .
What type of posterior estimates for j might be reasonable?
1.
2.
j = y.. = ( j )
.j : pooled estimate, reasonable if
j j y
we believe that all means are the same.
23
Source
Between groups
Within groups
df
J-1
J(n-1)
SS
SSB
SSE
MS
SSB / (J-1)
SSE / J(n-1)
E(MS)
n 2 + 2
2
M SB M SE
n
24
25
Normal-normal model
Set-up (for 2 known):
y.j |j
1, ..., J |,
N(j , j2)
Y
N(, 2)
p(, 2)
It follows that
p(1, ..., J ) =
Z Y
The joint prior can be written as p(, ) = p(| )p( ). For , we will
ENAR - March 26, 2006
26
27
with
j2 + 2y.j
j =
,
2
2
j +
Vj1
1
1
= 2+ 2
j
p(, , |y)
p(|, , y)
28
p(y|, , )p(|, )d
Z Y
j
N(
y.j ; j , j2)
29
+var[E(
y.j |j , , )|, ]
= E(j2|, ) + var(j |, )
= j2 + 2
Therefore
p(, |y) p( )
N(
y.j ; , j2 + 2)
We know that
p(, |y) p( )p(| )
"
Y
1
2
(
y
)
.j
2(j2 + 2)
30
= P
y.j /(j2 + 2)
2
2
j 1/(j + )
V =
X
j
1
j2 + 2
p(, |y)
p(|, y)
Q
p( ) j N(
y.j ; , j2 + 2)
N(;
, V )
31
p( )V1/2
N(
y.j ;
, j2 + 2).
32
33
34
35
36
P 2
P
2
2
1
j , and 2 = (J
(yij y.j ) ,
= J
j2 = (nj 1)1
P
1
1)
(
y.j y..)2.
Data
The following table contains data that represents coagulation time
in seconds for blood drawn from 24 animals randomly allocated to four
different diets. Different treatments have different numbers of observations
because the randomization was unrestricted.
Diet
Measurements
62,60,63,59
63,67,71,64,65,66
68,66,71,67,68,68
56,62,60,61,63,64,63,59
37
2
2
/
+
n
y
/
j
.j
j =
,
2
2
1/ + nj /
and Vj1 = 1/ 2 + nj / 2.
For j = 1, ..., J, maximize conditional posteriors by using j in place of
current estimate j .
ENAR - March 26, 2006
38
Conditional mode of :
2
|, , , y N(
, /J),
=J
j .
39
Conditional maximization
Conditional mode of log : first derive conditional posterior for 2:
2|, , , y Inv 2(n,
2),
PP
with
2 = n1
(yij j )2.
The mode of the Inv 2 is
2n/(n + 2). To get mode for log ,
use transformation. Term n/(n + 2) disappears with Jacobian, so
conditional mode of log is log
.
Conditional mode of log : same reasoning. Note that
2|, , 2, y Inv 2(J 1, 2),
P
2
1
with = (J 1)
(j )2. After accounting for Jacobian of
transformation, conditional mode of log is log .
ENAR - March 26, 2006
40
Conditional maximization
Starting from crude estimates, conditional maximization required only
three iterations to converge approximately (see table)
Log posterior increased at each step
Values in final iteration is approximate joint mode.
When J is large relative to the nj , joint mode may not provide good
summary of posterior. Try to get marginal modes.
In this problem, factor:
p(, , log , log |y) = p(|, log , log , y)p(, log , log |y).
Marginal of , log , log is three-dimensional regardless of J and nj .
ENAR - March 26, 2006
41
Parameter
1
2
3
4
log p(params.|y)
Crude
estimate
61.000
66.000
68.000
61.000
64.000
2.291
3.559
-61.604
First
iteration
61.282
65.871
67.742
61.148
64.010
2.160
3.318
-61.420
Stepwise ascent
Second
Third
iteration iteration
61.288
61.290
65.869
65.868
67.737
67.736
61.152
61.152
64.011
64.011
2.160
2.160
3.318
3.312
-61.420
-61.420
Fourth
iteration
61.290
65.868
67.736
61.152
64.011
2.160
3.312
-61.420
42
Marginal maximization
Recall algebraic trick:
p(, log , log |y) =
Using in place of :
1/2
43
Can maximize p(, log , log |y) using EM. Here, (j ) are the missing
data.
Steps in EM:
1. Average over missing data in E-step
2. Maximize over (, log , log ) in M-step
44
)
ij
j
2 j
2 2 j i
E-step: Average over using conditioning trick (and p(| rest). Need
two expectations:
Eold[(j )2] = E[(j )2|old, old, old, y]
= [Eold(j )]2 + varold(j )
= (j )2 + Vj
Similarly:
Eold[(yij j )2] = (yij j )2 + Vj .
ENAR - March 26, 2006
45
=
j .
J j
1/2
XX
1
new
=
[(yij j )2 + Vj ]
n j i
and
new
1/2
X
1
[(j new )2 + Vj ]
=
J 1 j
46
Value at
joint mode
64.01
2.17
3.31
First
iteration
64.01
2.33
3.46
Second
iteration
64.01
2.36
3.47
Third
iteration
64.01
2.36
3.47
47
48
j |all N (J of them)
|all N
2|all Inv 2
2|all Inv 2.
Starting values for the Gibbs sampler can be drawn from, e.g., a t4
approximation to the marginal and conditional mode.
ENAR - March 26, 2006
49
50
51
2.5%
58.92
63.96
65.72
59.39
55.64
1.83
1.98
-70.79
-71.07
Posterior quantiles
25.0% 50.0% 75.0%
60.44 61.23 62.08
65.26 65.91 66.57
67.11 67.77 68.44
60.56 61.14 61.71
62.32 63.99 65.69
2.17
2.41
2.7
3.45
4.97
7.76
-66.87 -65.36 -64.20
-66.88 -65.25 -64.00
97.5%
63.69
67.94
69.75
62.89
73.00
3.47
24.60
-62.71
-62.42
52
Checking convergence
Parameter
1
2
3
4
R
1.001
1.001
1.000
1.000
1.005
1.000
1.012
1.000
1.000
upper
1.003
1.003
1.002
1.001
1.005
1.001
1.013
1.000
1.001
53
62
58
theta1
66
500
600
700
800
900
1000
800
900
1000
800
900
1000
800
900
1000
65
63
theta2
67
iteration
500
600
700
69
67
65
theta3
iteration
500
600
700
63
61
59
theta3
65
iteration
500
600
700
iteration
54
56
58
60
62
64
66
62
64
theta1
64
66
68
theta3
66
68
70
theta2
70
72
58
60
62
64
theta4
55
80
50
60
70
mu
90 100
Chains for , ,
500
600
700
800
900
1000
800
900
1000
800
900
1000
10
5
sigma2
15
iteration
500
600
700
6000
0
2000
tau2
10000
iteration
500
600
700
iteration
56
Posteriors for , ,
50
55
60
65
70
75
80
mu
1.5
2.0
2.5
3.0
3.5
4.0
4.5
sigma
10
15
20
25
tau
57
58
59
y1j
n1j y1j
log
y0j
n0j y0j
1
1
1
1
+
+
+
y1j n1j y1j y0j n0j y0j
See Table 5.4 for the estimated log-odds ratios and their estimated
standard errors.
ENAR - March 26, 2006
60
Raw data
Study
j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Control
deaths
total
3
39
14
116
11
93
127
1520
27
365
6
52
152
939
48
471
37
282
188
1921
52
583
47
266
16
293
45
883
31
147
38
213
12
122
6
154
3
134
40
218
43
364
39
674
Treated
deaths
total
3
38
7
114
5
69
102
1533
28
355
4
59
98
945
60
632
25
278
138
1916
64
873
45
263
9
291
57
858
25
154
33
207
28
251
8
151
6
174
32
209
27
391
22
680
Logodds,
yj
0.0282
-0.7410
-0.5406
-0.2461
0.0695
-0.5842
-0.5124
-0.0786
-0.4242
-0.3348
-0.2134
-0.0389
-0.5933
0.2815
-0.3213
-0.1353
0.1406
0.3220
0.4444
-0.2175
-0.5911
-0.6081
sd,
j
0.8503
0.4832
0.5646
0.1382
0.2807
0.6757
0.1387
0.2040
0.2740
0.1171
0.1949
0.2295
0.4252
0.2054
0.2977
0.2609
0.3642
0.5526
0.7166
0.2598
0.2572
0.2724
61
Goals of analysis
Goal 1: if studies can be assumed to be exchangeable, we wish to
estimate the mean of the distribution of effect sizes, or overall average
effect.
Goal 2: the average effect size in each of the exchangeable studies
Goal 3: the effect size that could be expected if a new, exchangeable
study, were to be conducted.
62
63
Study
j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2.5%
-0.58
-0.63
-0.64
-0.44
-0.44
-0.62
-0.62
-0.43
-0.56
-0.48
-0.47
-0.42
-0.65
-0.34
-0.54
-0.50
-0.46
-0.53
-0.51
-0.51
-0.67
-0.69
97.5%
0.13
-0.03
0.06
-0.04
0.13
0.04
-0.16
0.08
-0.05
-0.12
-0.01
0.09
0.02
0.30
0.00
0.06
0.15
0.15
0.17
0.04
-0.09
-0.07
64
Estimand
Mean,
Standard deviation,
Predicted effect, j
2.5%
-0.38
0.01
-0.57
Posterior quantiles
25.0%
50.0%
75.0%
-0.29
-0.25
-0.21
0.08
0.13
0.18
-0.33
-0.25
-0.17
97.5%
-0.12
0.32
0.08
0.0
0.02
0.04
0.0
0.5
1.0
1.5
2.0
tau
65
0.4
19
0.2
18
14
0.0
-0.4
-0.2
1
12
8
16
20
11
4
10
15
9
13
22
-0.6
17
7
3
6
21
2
0.0
0.5
1.0
1.5
2.0
tau
66
0.8
0.6
3
18
2
0.4
13
17
15
5
9 16
22 20
21
12
8 14
11
0.2
19
6
10
0.0
0.5
1.0
1.5
4 7
2.0
tau
67
-1.0
-0.6
-0.2
0.2
-0.8
Study 2
-0.8
-0.4
Study 7
-0.4
0.0
0.4
Study 18
0.0
-0.6
-0.2
0.2
0.6
Study 14
68
-0.24
-0.26
-0.25
E(mu|tau,y)
-0.23
0.0
0.5
1.0
1.5
2.0
tau
69
Method
Government
Lab 1
5.19
5.09
Sheffield
3.26
3.38
3.24
3.41
3.35
3.04
Lab 2
4.09
3.0
3.75
4.04
4.06
3.02
3.32
2.83
2.96
3.23
3.07
Lab 3
4.62
4.32
4.35
4.59
Lab 4
3.71
3.86
3.79
3.63
3.08
2.95
2.98
2.74
3.07
2.70
2.98
2.89
2.75
3.04
2.88
3.20
2
N(0, 2 )
2
N(0,
)
6000
8000
10000
iteration
4.0
3.5
3.0
2.5
2.0
5001
8000
6000
iteration
10000
3.0
2.0
1.0
0.0
-1.0
0.0
-1.0
1.0
-0.5
0.0
0.5
4.0
3.0
2.0
1.0
0.0
4.0
3.0
2.0
1.0
0.0
-2.0
-1.0
0.0
1.0
-2.0
-1.0
0.0
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
0
20
40
20
40
lag
lag
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
40
20
lag
40
20
lag
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
40
20
20
lag
40
lag
1.0
0.5
0.0
-0.5
-1.0
1.0
0.5
0.0
-0.5
-1.0
40
20
lag
20
40
lag
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
5001
8000
6000
5001
8000
6000
iteration
iteration
1.5
1.0
1.0
0.5
0.5
0.0
0.0
5001
6000
8000
iteration
5001
6000
8000
iteration
5.0
1.0
[1]
[1]
[3]
0.5
[2]
4.0
[4]
[2]
0.0
3.0
-0.5
2.0
-1.0
Model checking
What do we need to check?
Remember: models are never true; they may just fit the data well and
allow for useful inference.
Model checking strategies must address various parts of models:
priors
sampling distribution
hierarchical structure
ENAR - March 26, 2006
Bayes p-values
p-values attempt to measure tail-area probabilities.
Classical definition:
class p value = Pr(T (y rep) T (y)|)
Probability is taken over distribution of y rep with fixed
Point estimate typically used to compute the pvalue.
Posterior predictive p-values:
Bayes p value = Pr(T (y rep, ) T (y, )|y)
ENAR - March 26, 2006
= Pr
n
y
S
= P
This is special case:
parameters.
tn1
rep
n
y
n
y 2
|
S
Bayes approach:
p value = P r(T (y rep) T (y)|y)
Z Z
=
IT (yrep)T (y)p(y rep| 2)p( 2|y)dy repd 2
Note that
IT (yrep)T (y)p(y rep| 2) = P (T (y rep) T (y)| 2)
ENAR - March 26, 2006
10
= classical p-value
Then:
Bayes p value = E{p valueclass|y}
where the expectation is taken with respect to p( 2|y).
In this example, classical pvalue and Bayes pvalue are the same.
In general, Bayes can handle easily nuisance parameters.
11
12
ns
with s =
yi .
Sample:
1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
Is the assumption of independence warranted?
Consider discrepancy statistic
T (y, ) = T (y) = number of switches between 0 and 1.
ENAR - March 26, 2006
13
In sample, T (y) = 3.
Posterior is Beta(8,14).
To test assumption of independence, do:
1. For j = 1, ..., M draw j from Beta(8,14).
rep,j
rep,j
, ..., y20 } independent Bernoulli variables with
2. Draw {y1
probability j .
3. In each of the M replicate samples, compute T (y), the number of
switches between 0 and 1.
p = Prob(T (y rep) T (y)|y) = 0.98.
B
14
15
16
17
18
19
20
21
s2y
1
= i(yi y)2
65
22
23
24
25
T (y)
77.3
5.1
8.4
95% CI
for T (y rep )
[75.5, 78.2]
[5.0, 6.5]
[5.3, 7.9]
p-value
0.27
0.95
0.005
95% CI
for T (y rep )
[74.8, 79.9]
[3.8, 6.3]
[4.9, 7.8]
p-value
0.53
0.44
0.004
26
Omnibus tests
It is often useful to also consider summary statistics:
2
-discrepancy: T (y, ) =
P (yiE(yi|))2
var(yi |)
27
28
29
30
31
Model comparison
Which model fits the data best?
Often, models are nested: a model with parameters is nested within
a model with parameters (, ).
Comparison involves deciding whether adding to the model improves
its fit. Improvement in fit may not justify additional complexity.
It is also possible to compare non-nested models.
We focus on predictive performance and on model posterior probabilities
to compare models.
32
Expected deviance
We compare the observed data to several models to see which predicts
more accurately.
We summarize model fit using the deviance
D(y, ) = 2 log p(y|).
It can be shown that the model with the lowest expected deviance is
best in the sense of minimizing the (Kullback-Leibler) distance between
the model p(y|) and the true distribution of y, f (y).
We compute the expected deviance by simulation:
Davg(y) = E|y (D(y, )|y).
ENAR - March 26, 2006
33
An estimate is
1 X
Davg(y) =
D(y, j ).
M j
34
35
36
j
2
= log(2J ) +
((yj j )2/j2)
The DIC for the three models were: 70.3 (no pooling), 61.5 (complete
pooling), 63.4 (hierarchical model).
Based on DIC, we would pick the complete pooling model.
ENAR - March 26, 2006
37
38
Bayes Factors
Suppose that we wish to decide between two models M1 and M2
(different prior, sampling distribution, parameters).
Priors on models are p(M1) and p(M2) = 1 p(M1).
The posterior odds favoring M1 over M2 are
p(M1|y) p(y|M1) p(M1)
=
.
p(M2|y) p(y|M2) p(M2)
The ratio p(y|M1)/p(y|M2) is called a Bayes factor.
It tells us how much the data support one model over the other.
ENAR - March 26, 2006
39
R
p(y|1, M1)p(1|M1)d1
=R
.
p(y|2, M2)p(2|M2)d2
40
p(y|)p()d.
The simplest approach is to draw many values of from p() and get
a Monte Carlo approximation to the integral.
May not work well because the s from the prior may not come from
the parameter space region where the sampling distribution has most
mass.
A better Monte Carlo approximation is similar to importance sampling.
ENAR - March 26, 2006
41
Note that
1
p(y)
h()
d
p(y)
h()
p(|y)d.
p(y|)p()
=
=
42
43
k
X
j xij
j=1
Joint posterior:
1 2
2
2 n
1
2
p(, |y) ( )
exp (y X)0(y X)
2
Joint posterior
2 n
2 1
p(, 2|y) ( )
1 2
exp (y X)0(y X)
2
Consider
p(, 2|y) = p(| 2, y)p( 2|y)
Conditional posterior for :
ENAR - March 2006
exp 2 ( )
2
for
= (X 0X)1X 0y
Then:
2(X 0X)1)
| 2, y N (,
p(, 2|y)d
Z
n
1
( 2) 2 1 exp 2 (y X)0(y X) d
2
p( |y) =
ENAR - March 2006
2 n
( ) 2 1
i
h
1
+ ( )
0X 0X( )
0(y X )
d
exp 2 (y X )
2
Z
n
1
1
0X 0X( )
d
( 2) 2 1 exp 2 (n k)S 2
exp 2 ( )
2
2
Z
p( |y)
nk
( 2) 2 +1 exp
2
2
(n k)S
covariates x
. Thus, we wish to draw values from p(
y |y, X)
By simulation:
1. Draw 2 from Inv-2(n k, S 2)
2(X 0X)1)
2. Draw from N(,
3. Draw yi for i = 1, ..., m from N(
x0i, 2)
p(
y |, 2)p(| 2, y)d
2
2
E(
y | , y) = E E(
y |, , y)| , y
2
= E(X|
, y)
= X
ENAR - March 2006
2
2
2
2
var(
y | , y) = E var(
y |, , y)| , y + var E(
y |, , y)| , y
2
= E[ 2I| 2, y] + var[X|
, y]
0X)1X
0)
= 2(I + X(X
0X)1X
0 is
Var has two terms: 2I is sampling variation, and 2X(X
due to uncertainty about
To complete specification of p(
y |y) must integrate p(
y |y, 2) with
respect to marginal posterior distribution of 2.
Result is:
S 2[I + X(X
,
0X)1X])
p(
y |y) = tnk (X
ENAR - March 2006
10
beta[1]
3.0
2.5
2.0
1.5
1.0
0.5
3001
3500
4000
4500
5000
4500
5000
iteration
beta[2]
3.0
2.5
2.0
1.5
1.0
0.5
3001
3500
4000
iteration
beta[3]
3.0
2.5
2.0
1.5
1.0
3001
3500
4000
4500
5000
4500
5000
iteration
beta[4]
1.0
0.0
-1.0
-2.0
3001
3500
4000
iteration
3.0
[1]
[2]
[3]
2.0
1.0
[4]
0.0
-1.0
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.0
5.0
10.0
15.0
0.0
5.0
10.0
15.0
0.4
0.3
0.2
0.1
0.0
0.4
0.3
0.2
0.1
0.0
0.0
10.0
5.0
0.0
0.2
0.2
0.1
0.1
0.0
0.0
5.0
10.0
10.0
0.3
0.0
5.0
15.0
0.0
5.0
10.0
15.0
13
14
15
16
Regression - known y
As before, let p() 1 be the non-informative prior
Since y is positive definite and symmetric, it has an upper-triangular
1/2
1/2 1/20
square root matrix (Cholesky factor) y such that y y = y
so that if:
y = X + e, e N (0, y )
then
1/2
y = 1/2
X + 1/2
e,
y
y
y
1/2
e N (0, I)
y
17
is equivalent to computing
1
1
= (X 01
y X) Xy y
1
= (X 01
X)
y
18
y are n
new observations given a n
k matrix of regressors X.
Joint distribution of y, y is:
y
N
|X, X,
y
y|y, , y
ENAR - March 2006
X
y yy
,
yy yy
X
N (y, Vy)
19
where
+ yy1
y = X
yy (y X)
Vy = yy yy1
yy yy
20
Regression - unknown y
To draw inferences about (, y ), we proceed in steps:
p(|y , y) and then p(y |y).
derive
p(|y , y)
p(|y , y)
p(y ) N(y|, y )
,
N(|, V )
V ) depend on y .
where (,
ENAR - March 2006
21
Expression for p(y |y) must hold for any , so we set = .
Note that
V ) |V |1/2
p(|y , y) = N(,
Then
for = .
1
01(y X )]
22
Regression - y = 2Qy
Suppose that we know y upto a scalar factor 2.
Non-informative prior on , 2 is p(, 2) 2
Results follow directly from ordinary linear regression, by using
1/2
1/2
transformation Qy y and Qy X, which is equivalent to computing
1 0 1
= (X 0Q1
X)
X Qy y
y
1
= (X 0Q1
y X)
0Q1(y X )
s2 = (n k)1(y X )
y
To estimate the joint posterior distribution p(, 2|y) do
ENAR - March 2006
23
24
25
= 1 ii = 2/wi i
In practice, to uncouple from the scale of the weights, we multiply
weights by a factor so that their product equals 1.
26
27
p(, 2, |y)
p(, 2|, y)
p(, 2, |y)
p(| 2, , y)p( 2|, y)
28
= (X X ) X y
V
ENAR - March 2006
= (X X )1
29
0(y X )
s2 = (n k)1(y X )
Draw 2 from Inv-2(n k, s2)
2 V )
Draw from N(,
30
31
variance 2j .
To include prior information, do the following:
1. Append one more data point to vector y with value j0.
2. Add one row to X with zeroes except in jth column.
3. Add a diagonal element with value 2j to y .
Now apply computational methods for non-informative prior.
Given y , posterior for obtained by weighted linear regression.
32
y
0
,
X =
X
Ik
, =
y 0
0
33
0
0
).
2|y Inv-2(n0 + n,
n0 + n
34
Inequality constraints on
Sometimes we wish to impose inequality constraints such as
1 > 0
or
2 < 3 < 4.
The easiest way is to ignore the constraint until the end:
Simulate (, 2) from posterior
discard all the draws that do not satisfy the constraint.
Typically a reasonably efficient way to proceed unless constraint
eliminates a large portion of unconstrained posterior distribution.
If so, data tend to contradict the model.
ENAR - March 2006
35
1
exp( exp(i))(exp(i))yi
y!
with i = (X)i.
ENAR - March 26, 2006
i
g(i) = log
1 i
= (X)i = i
In practice, inference from logit and probit models is almost the same,
except in extremes of the tails of the distribution.
Overdispersion
In many applications, the model can be formulated to allow for extra
variability or overdispersion.
E.g. in Poisson model, the variance is constrained to be equal to the
mean.
As an example, suppose that data are the number of fatal car accidents
at K intersections over T years. Covariates might include intersection
characteristics and traffic control devices (stop lights, etc).
To accommodate overdispersion we model the (log)rate as a linear
combination of covariates and add a random effect for intersection with
its own population distribution.
Setting up GLIMs
Canonical link functions: Canonical link is function of mean that
appears in exponent of exponential family form of sampling distribution.
All links discussed so far are canonical except for the probit.
Offset: Arises when counts are obtained from different population sizes
or volumes or time periods and we need to use an exposure. Offset is
a covariate with a known coefficient.
Example: Number of incidents in a given exposure time T are Poisson
with rate per unit of time. Mean number of incidents is T .
Link function would be log() = i, but here mean of y is not but
T .
ENAR - March 26, 2006
Interpreting GLIMs
In linear models, j represents the change in the outcome when xj is
changed by one unit.
Here, j reflects changes in g() when xj is changed.
The effect of changing xj depends of current value of x.
To translate effects into the scale of y, measure changes relative to a
baseline
y0 = g 1(x0).
A change in x of x takes outcome from y0 to y where
g(y0) = x0 y0 = g 1(x0)
ENAR - March 26, 2006
and
y = g 1(g(y0) + (x))
10
Priors in GLIM
We focus on priors for although sometimes is present and has its
own prior.
Non-informative prior for :
With p() 1, posterior mode = MLE for
Approximate posterior inference can be
approximation to posterior at mode.
based
on
normal
11
12
Computation
Posterior distributions of parameters can be estimated using MCMC
methods in WinBUGS or other software.
Metropolis within Gibbs will often be necessary: in GLIM, most often
full conditionals do not have standard form.
An alternative is to approximate the sampling distribution with a
cleverly chosen approximation.
Idea:
)
on
13
14
15
1
i2
Then
zi
i2
L0(yi|
i, )
= i
L00(yi|
i, )
=
L00(yi|
i, )
16
L0 = yi ni
L00 = ni
exp(i)
(1 + exp(i))2
zi = i +
i2 =
(1 + exp(
i))
exp(
i)
yi
exp(
i)
ni 1 + exp(
i)
1 (1 + exp(
i))2
ni
exp(
i)
17
p(y|) kj=1j j
with j : probability of jth outcome and
n.
ENAR - March 26, 2006
Pk
j=1 j
= 1 and
Pk
j=1 yj
18
19
i yi = ni , and
Pk
j
ij = 1.
20
to baseline category j = 1:
log
ij
i1
= ij = (Xj )i,
exp(ij )
!yij
Pk
l=1 exp(il )
21
22
23
with
exp(ijk )
ijk = Pk
,
l=1 exp(ijl )
and
ijk = k + ik + jk .
Here,
k is baseline indicator for category k
ik is coefficient for indicator for lake
jk is coefficient for indicator for size
24
[5]
0.0
[2]
[4]
[3]
-2.0
-4.0
-6.0
alpha[2]
alpha[3]
alpha[4]
alpha[5]
Mean
Std
2.5%
Median
97.5%
-1.838
-2.655
-2.188
-0.7844
0.5278
0.706
0.58
0.3691
-2.935
-4.261
-3.382
-1.531
-1.823
-2.593
-2.153
-0.7687
-0.8267
-1.419
-1.172
-0.1168
5.0
[3,2]
[2,2]
[3,3]
[2,3]
[4,2]
2.5
[3,4] [3,5]
[4,3] [4,4]
[2,4] [2,5]
[4,5]
0.0
-2.5
-5.0
beta[2,2]
beta[2,3]
beta[2,4]
beta[2,5]
beta[3,2]
beta[3,3]
beta[3,4]
beta[3,5]
beta[4,2]
beta[4,3]
beta[4,4]
beta[4,5]
Mean
Std
2.5%
Median 97.5%
2.706
1.398
-1.799
-0.9353
2.932
1.935
0.3768
0.7328
1.75
-1.595
-0.7617
-0.843
0.6431
0.8571
1.413
0.7692
0.6864
0.8461
0.7936
0.5679
0.6116
1.447
0.8026
0.5717
1.51
-0.2491
-5.1
-2.555
1.638
0.37
-1.139
-0.3256
0.636
-4.847
-2.348
-1.97
2.659
1.367
-1.693
-0.867
2.922
1.886
0.398
0.7069
1.73
-1.433
-0.7536
-0.8492
4.008
3.171
0.5832
0.4413
4.297
3.842
1.931
1.849
3.023
0.9117
0.746
0.281
4.0
[2,4]
2.0
[2,3]
[2,5]
0.0
[2,2]
-2.0
-4.0
gamma[2,2]
gamma[2,3]
gamma[2,4]
gamma[2,5]
Mean
Std
2.5%
Median
97.5%
-1.523
0.342
0.7098
-0.353
0.4101
0.5842
0.6808
0.4679
-2.378
-0.788
-0.651
-1.216
-1.523
0.3351
0.7028
-0.3461
-0.7253
1.476
2.035
0.518
0.8
0.8
[2,1,2]
[1,2,1]
[1,1,1]
[2,2,1]
0.6
0.6
[2,1,1]
[1,1,5]
0.4
[2,2,2]
0.4
[2,2,3]
[1,2,5]
[1,2,4]
[1,2,3]
[1,1,2]
0.2
[2,1,3]
0.2
[1,1,4]
[2,2,5]
[2,1,5]
[1,1,3]
[2,2,4]
[1,2,2]
[2,1,4]
0.0
0.0
0.8
[4,2,1]
0.8
[3,1,2]
0.6
0.6
[4,1,1]
[4,1,2]
[3,2,1]
0.4
[3,2,3]
[3,1,5]
[3,1,1]
[3,2,5]
[3,2,2]
0.4
[4,2,2]
[3,2,4]
0.2
[3,1,3]
0.2
[3,1,4]
[4,1,4]
[4,1,3]
0.0
0.0
[4,2,4]
[4,1,5]
[4,2,3]
[4,2,5]
p[1,1,1]
p[1,1,2]
p[1,1,3]
p[1,1,4]
p[1,1,5]
p[1,2,1]
p[1,2,2]
p[1,2,3]
p[1,2,4]
p[1,2,5]
p[2,1,1]
p[2,1,2]
p[2,1,3]
p[2,1,4]
p[2,1,5]
p[2,2,1]
p[2,2,2]
p[2,2,3]
p[2,2,4]
p[2,2,5]
p[3,1,1]
p[3,1,2]
p[3,1,3]
p[3,1,4]
p[3,1,5]
p[3,2,1]
0.5392
0.09435
0.04585
0.06837
0.2523
0.5679
0.02322
0.06937
0.1469
0.1927
0.2578
0.5968
0.08137
0.0095
0.05445
0.4612
0.2455
0.1947
0.0302
0.06833
0.1794
0.5178
0.09403
0.03554
0.1732
0.2937
0.07161
0.04288
0.02844
0.03418
0.06411
0.09029
0.01428
0.04539
0.07437
0.07171
0.07217
0.08652
0.0427
0.01186
0.03517
0.08211
0.06879
0.07042
0.02921
0.03984
0.05581
0.09136
0.04716
0.02504
0.06253
0.07225
0.4046
0.03016
0.007941
0.01923
0.1389
0.3959
0.005301
0.01191
0.03647
0.0793
0.1367
0.4159
0.01939
0.00533
0.009716
0.3055
0.1263
0.07776
7.653E-4
0.01357
0.08555
0.3334
0.02652
0.006063
0.07053
0.1618
0.5384
0.08735
0.04043
0.06237
0.2496
0.5684
0.02013
0.05918
0.1344
0.1852
0.2496
0.6
0.0728
0.04105
0.0476
0.4632
0.2426
0.189
0.02009
0.06097
0.1732
0.5185
0.08517
0.02996
0.167
0.2901
0.676
0.2067
0.1213
0.1457
0.3808
0.7342
0.06186
0.1768
0.3182
0.349
0.4265
0.7551
0.1841
0.143
0.1398
0.6175
0.3867
0.3502
0.1122
0.1643
0.296
0.6933
0.2219
0.1055
0.3261
0.4469
Interpretation of results
Because we set to zero several model parameters, interpreting results is
tricky. For example:
Beta[2,2] ha posterior mean 2.714. This is the effect of lake Oklawaha
(relative to Hancock) on the alligators preference for invertebrates
relative to fish.
Since beta[2,2] > 0, we conclude that alligators in Oklawaha eat more
invertebrates than do alligators in Hancock (even though both may
prefer fish!).
Gamma[2,2] is the effect of size 2 relative to size on the relative
preference for invertebrates. Since gamma[2,2] < 0, we conclude that
large alligators prefer fish more than do small alligators.
The alpha are baseline counts for each type of food relative to fish.
yi
Poisson(i)
Gamma(, ).
25
Gamma(a, b)
Gamma(c, d),
26
Conditional for i is
p(i| all) yi i+1 exp{i( + 1)},
which is proportional to a Gamma with parameters (yi + , + 1).
The full conditional for is
p(| all) i1
a1 exp{b}.
i
The conditional for does not have a standard form.
For :
p(| all) i exp{i} c1 exp{d}
X
c1
exp{(
i + d)},
i
ENAR - March 26, 2006
27
i i
+ d).
Computation:
Given , , draw each i from the corresponding Gamma conditional.
Draw using a Metropolis step or rejection sampling or inverse cdf
method.
Draw from the Gamma conditional.
See Italian marriages example.
28
WinBUGS code:
model {
for (i in 1:16) {
y[i] ~ dpois(l[i])
l[i] ~ dgamma(alpha, beta)
}
alpha ~ dgamma(1,1)
beta ~ dgamma(1,1)
warave <- (l[4]+l[5]+ l[6]+l[7]+l[8]+l[9]+l[10]) / 7
nonwarave<- (l[1]+l[2]+l[3]+l[11]+l[12]+l[13]+l[14]+l[15]+l[16]) / 9
diff <- nonwarave - warave
}
list(y = c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7))
Results
lambda: 95% credible sets
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
0.0
5.0
10.0
15.0
[12]
[2]
[11]
[13]
[3]
[1]
[4]
[5]
[10]
[6]
[7]
[8]
10.0
5.0
0.0
[9]
[14]
[15]
[16]
0.0
5.0
mean sd
7.362 1.068
2.281 0.394
2.5% median
5.508
7.285
1.59
2.266
97.5%
9.665
3.125
6.0
8.0
10.0
2.0
3.0
4.0
Poisson regression
When rates are not exchangeable, we need to incorporate covariates
into the model. Often we are interested in the association between one
or more covariate and the outcome.
It is possible (but not easy) to incorporate covariates into the PoissonGamma model.
Christiansen and Morris (1997, JASA) propose the following model:
Sampling distribution, where ei is a known exposure:
yi|i Poisson(iei).
Under model, E(yi/ei) = i.
ENAR - March 26, 2006
29
E(i) =
/i
= i
CV2(i) =
=
ENAR - March 26, 2006
2i 1
2i
1
30
Gamma(, ),
are
y0
,
2
( + y0)
31
32
The second-level distribution for the s will typically be flat (if covariate
is a fixed effect) or normal
j Normal(j0, 2j )
if jth covariate is a random effect. The variance 2j represents the
between batch variability.
33
Epilepsy example
From Breslow and Clayton, 1993, JASA.
Fifty nine epilectic patients in a clinical trial were randomized to a new
drug: T = 1 is the drug and T = 0 is the placebo.
Covariates included:
Baseline data: number of seizures during eight weeks preceding trial
Age in years.
Outcomes: number of seizures during the two weeks preceding each of
four clinical visits.
Data suggest that number of seizures was significantly lower prior to
fourth visit, so an indicator was used for V4 versus the others.
ENAR - March 26, 2006
34
35
alpha.BT ~ dnorm(0.0,1.0E-4)
alpha.Age ~ dnorm(0.0,1.0E-4)
alpha.V4 ~ dnorm(0.0,1.0E-4)
tau.b1 ~ dgamma(1.0E-3,1.0E-3); sigma.b1 <- 1.0 / sqrt(tau.b1)
tau.b
~ dgamma(1.0E-3,1.0E-3); sigma.b <- 1.0/ sqrt(tau.b)
# re-calculate intercept on original scale:
alpha0 <- a0 - alpha.Base * log.Base4.bar - alpha.Trt * Trt.bar
- alpha.BT * BT.bar - alpha.Age * log.Age.bar - alpha.V4 * V4.bar
}
Results
Parameter
alpha.Age
alpha.Base
alpha.Trt
alpha.V4
alpha.BT
sigma.b1
sigma.b
Mean
0.4677
0.8815
-0.9587
-0.1013
0.3778
0.4983
0.3641
Std
0.3557
0.1459
0.4557
0.08818
0.2427
0.07189
0.04349
2.5th
-0.2407
0.5908
-1.794
-0.273
-0.1478
0.3704
0.2871
Median
0.4744
0.8849
-0.9637
-0.09978
0.3904
0.4931
0.362
97.5th
1.172
1.165
-0.06769
0.07268
0.7886
0.6579
0.4552
box plot: b1
2.0
[4]
[1] [2]
[32]
[33]
[8]
[3]
1.0
[18]
[22]
[11]
[5]
[6]
[7]
[9]
[12] [14]
[13] [15]
[21]
[20]
[19]
[28]
[24]
[23]
[56]
[35]
[25]
[10]
[37]
[40]
[43]
[42]
[27]
[26]
[45]
[44]
[30][31]
[29]
[49]
[46]
[36]
[39]
[53]
[48]
[34]
[50]
[51]
[41]
0.0
-1.0
-2.0
[17]
[16]
[38]
[55]
[59]
[47]
[54]
[57]
[52]
[58]
[1]
[2]
[3]
[4]
[8]
[49]
-100.0
-50.0
[7]
[6]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
0.0
[5]
[25]
1.0
[5,4]
1.0
[5,2]
[49,1]
0.5
[5,3]
0.5
[5,1]
0.0
0.0
-0.5
-1.0
[49,3]
[49,2]
-0.5
[49,4]
1.0
0.5
0.0
-0.5
-1.0
40
20
lag
lag
1.5
1.0
1.0
0.5
0.5
0.0
0.0
5001 6000
40
20
8000
iteration
5001 6000
8000
iteration
Poisson(i)
i = i Ei
i
Gamma(, ),
County
1
2
3
4
5
Ei
94
15
62
126
5
yi
5
1
5
14
3
County
6
7
8
9
10
Ei
31
2
2
9
100
yi
19
1
1
4
22
1.5
[7]
[8]
[5]
1.0
[6]
[9]
0.5
[10]
[2]
[3]
[1]
0.0
[4]
node
mean
theta[1] 0.06033
theta[2] 0.1074
theta[3] 0.09093
theta[4] 0.1158
theta[5] 0.5482
theta[6] 0.6028
theta[7] 0.4586
theta[8] 0.4658
theta[9] 0.4378
theta[10] 0.2238
sd
2.5%
median 97.5%
0.02414
0.08086
0.03734
0.03085
0.2859
0.1342
0.3549
0.3689
0.2055
0.0486
0.02127
0.009868
0.03248
0.06335
0.131
0.3651
0.0482
0.04501
0.1352
0.1386
0.05765
0.08742
0.08618
0.1132
0.4954
0.5962
0.3646
0.3805
0.3988
0.2211
0.1147
0.306
0.1767
0.1831
1.242
0.8932
1.343
1.41
0.9255
0.33
40.0
[10]
30.0
[6]
[4]
20.0
[1]
[3]
10.0
[9]
[5]
[2]
[7]
0.0
[8]
node
mean
sd
2.5%
median 97.5%
lambda[1]
lambda[2]
lambda[3]
lambda[4]
lambda[5]
lambda[6]
lambda[7]
lambda[8]
lambda[9]
lambda[10]
5.671
1.611
5.638
14.59
2.741
18.69
0.9171
0.9317
3.94
22.38
2.269
1.213
2.315
3.887
1.43
4.16
0.7097
0.7377
1.85
0.1028
2.0
0.148
2.014
7.982
0.6551
11.32
0.09641
0.09002
1.217
13.86
5.419
1.311
5.343
14.26
2.477
18.48
0.7292
0.761
3.589
22.11
node
mean
sd
2.5%
median
alpha
beta
mean
10.78
4.59
10.95
23.07
6.21
27.69
2.687
2.819
8.33
33.0
97.5%
Congdon considers the incidence of heart disease mortality in 758 electoral wards in
the Greater London area over three years (1990-92). These small areas are grouped
administratively into 33 boroughs.
Regressors:
at ward level: xij , index of socio-economic deprivation.
at borough level: wj , where
wi =
1
0
We assume borough level variation in the intercepts and in the impacts of deprivation;
this variation is linked to the category of borough (inner vs outer).
KSU - April 2005
Model
First level of the hierarchy
Oij |ij Poisson(ij )
ij N (0, )
Death counts Oij are Poisson with means ij .
log(Eij ) is an offset, i.e. an explanatory variable with known coefficient (equal
to 1).
j = (1j , 2j )0 are random coefficients for the intercepts and the impacts of
deprivation at the borough level.
ij is a random error for Poisson over-dispersion. We fitted two models: one
without and another with this random error term.
Model (contd)
Second level of the hierarchy
j =
where
1j
2j
N2(j , )
11 12
=
21 22
1j = 11 + 12wj
2j = 21 + 22wj
11, 12, 21, and 22 are the population coefficients for the intercepts, the
impact of borough category, the impact of deprivations, and, respectively, the
interaction impact of the level-2 regressors and level-1 regressors.
Hyperparameters
1
1
Wishart (R) ,
ij N (0, 0.1) i, j {1, 2}
2
Inv-Gamma(a, b)
KSU - April 2005
Computation of Eij
Suppose that p is an overall disease rate. Then, Eij = nij p and ij = pij /p.
The s are said:
1. externally standardized if p is obtained from another data source (such as a standard
reference table);
2. internally standardized if p is obtained from the given dataset, e.g.
p =
Oij
ij
nij
ij
node
11
12
21
22
1
2
Deviance
mean
-0.075
0.078
0.621
0.104
0.294
0.445
std.
dev.
0.074
0.106
0.138
0.197
0.038
0.077
945.800
11.320
Posterior quantiles
2.5%
median
97.5%
-0.224
-0.075
0.070
-0.128
0.078
0.290
0.354
0.620
0.896
-0.282
0.103
0.486
0.231
0.290
0.380
0.318
0.437
0.618
925.000
945.300
969.400
mean
-0.069
0.068
0.616
0.105
0.292
0.431
std.
dev.
0.074
0.106
0.141
0.198
0.037
0.075
802.400
40.290
Posterior quantiles
2.5%
median
97.5%
-0.218
-0.069
0.076
-0.141
0.068
0.278
0.335
0.615
0.892
-0.276
0.104
0.499
0.228
0.289
0.376
0.306
0.423
0.600
726.600
800.900
883.500
Including ward-level random variability ij reduces the average Poisson GLM deviance
to 802, with a 95% credible interval from 726 to 883. This is in line with the expected
value of the GLM deviance for N = 758 areas if the Poisson Model is appropriate.
Often data are also collected over time, so models that include spatial
and temporal correlations of outcomes are needed.
We focus on spatial, rather than spatio-temporal models.
We also focus on models for univariate rather than multivariate
outcomes.
= ( 2, 2, ).
Then
()ii0 = 2 exp(dii0 ) + 2Ii=i0 ,
ENAR - March 26, 2006
with ( 2, 2, ) > 0.
This is an example of an isotropic covariance function: the spatial
correlation is only a function of d.
10
Examples of semi-variograms
Semi-variograms for the linear, spherical and exponential models.
11
Stationarity
Strict stationarity implies weak stationarity but the converse is not true
except in Gaussian processes.
Weak stationarity implies intrinsec stationarity, but the converse is not
true in general.
Notice that intrinsec stationarity is defined on the differences between
outcomes at two locations and thus says nothing about the joint
distribution of outcomes.
12
Semivariogram (contd)
If (h) depends on h only through its length ||h||, then the spatial
process is isotropic. Else it is anisotropic.
There are many choices for isotropic models. The exponential model
is popular and has good properties. For t = ||h||:
(t) = 2 + 2(1 exp(t)) if t > 0,
= 0 otherwise.
See figures, page 24.
The powered exponential model has an extra parameter for smoothness:
(t) = 2 + 2(1 exp(t)) if t > 0
ENAR - March 26, 2006
13
14
15
16
1(y X )
E[Y (so)|y] = x0o + 0
1 ,
V ar[Y (so)|y] =
2 + 2 0
where
do1), ...,
don))
= (
2(,
2(,
1X)1X 0
1y
= (X 0
2H().
Solution assumes that we have observed the covariates xo at the new
site.
ENAR - March 26, 2006
17
18
19
20
21
W |, 2
N(X + W, 2I)
N(0, 2H()).
22
Given draws ( 2(g), (g)) from the Gibbs sampler on the marginal
model, we can generate W from
p(W | 2(g), (g)) = N(0, 2(g)H((g))).
Analytical marginalization over W is possible only if model has Gaussian
form.
ENAR - March 26, 2006
23
Bayesian kriging
Let Yo = Y (so) and xo = x(so). Kriging is accomplished by obtaining
the posterior predictive distribution
Z
p(yo|xo, X, y) =
p(yo, |y, X, xo)d
Z
=
p(yo|, y, xo)p(|y, X)d.
Since (Yo, Y ) are jointly multivariate normal (see expressions 2.18 and
2.19 on page 51), then p(yo|, y, xo) is a conditional normal distribution.
Given MCMC draws of the parameters ((1), ..., (G)) from the posterior
(g)
distribution p(|y, X), we draw values yo for each (g) as
yo(g) p(yo|(g), y, xo).
ENAR - March 26, 2006
24
(1)
(2)
(G)
25
26
27
model {
# Spatially structured multivariate normal likelihood
height[1:N] ~ spatial.exp(mu[], x[], y[], tau, phi, kappa)
for(i in 1:N) {
mu[i] <- beta
}
# Priors
beta ~ dflat()
tau ~ dgamma(0.001, 0.001)
sigma2 <- 1/tau
# priors for spatial.exp parameters
phi ~ dunif(0.05, 20)
# prior decay for correlation at min distance (0.2 x 50 ft) is 0.02 to 0.99
# prior range for correlation at max distance (8.3 x 50 ft) is 0 to 0.66
kappa ~ dunif(0.05,1.95)
# Spatial prediction
# Single site prediction
for(j in 1:M) {
height.pred[j] ~ spatial.unipred(beta, x.pred[j], y.pred[j], height[])
}
# Only use joint prediction for small subset of points, due to length of time it takes to run
for(j in 1:10) { mu.pred[j] <- beta }
height.pred.multi[1:10] ~ spatial.pred(mu.pred[], x.pred[1:10], y.pred[1:10], height[])
}
beta
1200.0
1.00E+3
800.0
600.0
400.0
1
500
1000
iteration
1500
2000
phi
0.8
0.6
0.4
0.2
0.0
1
500
1000
iteration
1500
2000
kappa
2.0
1.5
1.0
0.5
1
500
1000
iteration
1500
2000
N
(40) 750.0 - 800.0
(80) 800.0 - 850.0
(61) 850.0 - 900.0
(26) 900.0 - 950.0
(1) >= 950.0
0.002km
28
Defining neighbors
A proximity matrix W with entries wij spatially connects areas i and
j in some fashion.
Typically, wii = 0.
There are many choices for the wij :
Binary: wij = 1 if areas i, j share a common boundary, and is 0
otherwise.
Continuous: decreasing function of intercentroidal distance.
Combo: wij = 1 if areas are within a certain distance.
W need not be symmetric.
P
29
30
31
nij rj ,
where rj is the risk for persons of age group j (from some existing
table of risks by age) and nij is the number of persons of age j in area
i.
32
Yi
.
Ei
33
1
V ar(SM Ri)
SM Ri2
Yi
1
Ei2
2= .
2
Yi
Ei
Yi
34
35
Prob(X Yi|Ei)
= 1 Prob(X Yi|Ei)
Yi 1
X exp(Ei)E x
i
.
= 1
x!
x=0
If p < 0.05 we reject H0.
36
37
Poisson-Gamma model
Consider
Yi|i
Poisson(Eii)
i|a, b
Gamma(a, b).
38
A point estimate of i is
E(i|y) = E(i|yi) =
=
yi + a
Ei + b
[E(i )]2
yi + V ar(i)
i)
Ei + VE(
ar(i )
Ei( Eyii )
Ei +
E(i )
V ar(i )
E(i )
V ar(i ) E(i )
i)
Ei + VE(
ar(i )
= wiSM Ri + (1 wi)E(i),
where wi = Ei/[Ei + (E(i)/V ar(i))].
Bayesian point estimate is a weighted average of the data-based SM Ri
and the prior mean E(i).
ENAR - March 26, 2006
39
Poisson(Ei exp(i))
i = x0i + i + i,
where xi are area-level covariates.
The i are assumed to be exchangeable and model between-area
variability:
i N(0, 1/h).
ENAR - March 26, 2006
40
41
42
CAR model
More reasonable to think of a neighbor-based proximity measure and
consider a conditionally autoregressive model for :
i N(i, 1/(cmi)),
where
i =
wij (i j ),
i6=j
43
needed:
xi +i +i
) N(i|i,
).
mic
44
45
Consider
h Gamma(ah, bh),
c Gamma(ac, bc).
sd(i) = p
sd(i)
0.7
m
h )
c
with m
the average number of neighbors is more fair (Bernardinelli
et al. 1995, Statistics in Medicine).
46
Model
model {
# Likelihood
for (i in 1 : N) {
O[i] ~ dpois(mu[i])
log(mu[i]) <- log(E[i]) + alpha0 + alpha1 * X[i]/10 + b[i]
RR[i] <- exp(alpha0 + alpha1 * X[i]/10 + b[i])
relative risk (for maps)
}
# CAR prior distribution for random effects:
b[1:N] ~ car.normal(adj[], weights[], num[], tau)
for(k in 1:sumNumNeigh) {
weights[k] <- 1
}
# Other priors:
alpha0 ~ dflat()
alpha1 ~ dnorm(0.0, 1.0E-5)
tau ~ dgamma(0.5, 0.0005)
sigma <- sqrt(1 / tau)
}
# prior on precision
# standard deviation
# Area-specific
Data
list(N = 56,
O = c( 9, 39, 11, 9, 15, 8, 26, 7, 6, 20,
13, 5, 3, 8, 17, 9, 2, 7, 9, 7,
16, 31, 11, 7, 19, 15, 7, 10, 16, 11,
5, 3, 7, 8, 11, 9, 11, 8, 6, 4,
10, 8, 2, 6, 19, 3, 2, 3, 28, 6,
1, 1, 1, 1, 0, 0),
E = c( 1.4, 8.7, 3.0, 2.5, 4.3, 2.4, 8.1, 2.3, 2.0, 6.6,
4.4, 1.8, 1.1, 3.3, 7.8, 4.6, 1.1, 4.2, 5.5, 4.4,
10.5,22.7, 8.8, 5.6,15.5,12.5, 6.0, 9.0,14.4,10.2,
4.8, 2.9, 7.0, 8.5,12.3,10.1,12.7, 9.4, 7.2, 5.3,
18.8,15.8, 4.3,14.6,50.7, 8.2, 5.6, 9.3,88.7,19.6,
3.4, 3.6, 5.7, 7.0, 4.2, 1.8),
X = c(16,16,10,24,10,24,10, 7, 7,16,
7,16,10,24, 7,16,10, 7, 7,10,
7,16,10, 7, 1, 1, 7, 7,10,10,
7,24,10, 7, 7, 0,10, 1,16, 0,
1,16,16, 0, 1, 7, 1, 1, 0, 1,
1, 0, 1, 1,16,10),
num = c(3, 2, 1, 3, 3, 0, 5, 0, 5, 4,
0, 2, 3, 3, 2, 6, 6, 6, 5, 3,
3, 2, 4, 8, 3, 3, 4, 4, 11, 6,
7, 3, 4, 9, 4, 2, 4, 6, 3, 4,
5, 5, 4, 5, 4, 6, 6, 4, 9, 2,
4, 4, 4, 5, 6, 5),
adj = c(
19, 9, 5,
10, 7,
12,
28, 20, 18,
19, 12, 1,
17, 16, 13, 10, 2,
29, 23, 19, 17, 1,
22, 16, 7, 2,
5, 3,
19, 17, 7,
35, 32, 31,
29, 25,
29, 22, 21, 17, 10, 7,
29, 19, 16, 13, 9, 7,
56, 55, 33, 28, 20, 4,
17, 13, 9, 5, 1,
56, 18, 4,
50, 29, 16,
16, 10,
39, 34, 29, 9,
56, 55, 48, 47, 44, 31, 30, 27,
29, 26, 15,
43, 29, 25,
56, 32, 31, 24,
Results
The proportion of individuals working in agriculture appears to be
associated to the incidence of cancer
-0.25
0.0
0.25
0.5
0.75
box plot: b
2.0
[1]
[3]
[5]
[2]
[12][13]
[7]
1.0
[9]
[10]
[4]
[16]
[14]
0.0
-1.0
-2.0
[18]
[20][21]
[25]
[26]
[23]
[22]
[24]
[27][28][29]
[36]
[56]
[31][32][33] [35]
[40]
[37][38][39]
[30]
[43]
[34]
[51]
[52]
[41]
[44] [46][47][48] [50]
[53][54][55]
[45]
[42]
[49]
(samples)means for RR
(26) <
(15)
1.0 -
2.0
(9)
2.0 -
3.0
(4)
3.0 -
4.0
(2) >=
200.0km
1.0
4.0
values for O
(2)
10.0 20.0 -
200.0km
20.0
30.0
Model
model {
for (i in 1 : N) {
# Likelihood
O[i] ~ dpois(mu[i])
log(mu[i]) <- log(E[i]) + alpha + beta * depriv[i] + b[i] + h[i]
RR[i] <- exp(alpha + beta * depriv[i] + b[i] + h[i])
relative risk (for maps)
# Exchangeable prior on unstructured random effects
h[i] ~ dnorm(0, tau.h)
}
# CAR prior distribution for spatial random effects:
b[1:N] ~ car.normal(adj[], weights[], num[], tau.b)
for(k in 1:sumNumNeigh) {
weights[k] <- 1
}
# Other priors:
alpha ~ dflat()
beta ~ dnorm(0.0, 1.0E-5)
tau.b ~ dgamma(0.5, 0.0005)
# Area-specific
Note that the priors on the precisions for the exchangeable and the spatial
random effects are Gamma(0.5, 0.0005).
That means that a priori, the expected value of the standard deviations is
approximately 0.03 with a relatively large prior standard deviation.
This is not a fair prior as discussed in class. The average number of neighbors
is 4.8 and this is not taken into account in the choice of priors.
Results
Parameter
Mean
Std
2.5th
perc.
97.5th
perc.
Alpha
Beta
Relative size of
spatial std
-0.208
0.0474
0.358
0.1
0.0179
0.243
-0.408
0.0133
0.052
-0.024
0.0838
0.874
20.0
15.0
10.0
5.0
0.0
10.0
7.5
5.0
2.5
0.0
-0.2
0.0
0.2
0.4
0.6
-0.2
0.0
0.2
0.4
0.6
mean
sd
2.5%
median 97.5%
RR[1]
RR[2]
RR[3]
RR[4]
RR[5]
RR[6]
RR[7]
RR[8]
RR[9]
RR[10]
RR[11]
RR[12]
RR[13]
RR[14]
RR[15]
0.828
1.18
0.8641
0.8024
0.7116
1.05
1.122
0.821
1.112
1.546
0.7697
0.865
1.237
0.8359
0.7876
0.1539
0.2061
0.1695
0.1522
0.1519
0.2171
0.1955
0.1581
0.2167
0.2823
0.1425
0.16
0.3539
0.1807
0.1563
0.51
0.7593
0.5621
0.5239
0.379
0.7149
0.7556
0.4911
0.7951
1.072
0.4859
0.6027
0.8743
0.5279
0.489
0.8286
1.188
0.8508
0.7873
0.7206
1.015
1.116
0.8284
1.07
1.505
0.7788
0.8464
1.117
0.8084
0.7869
1.148
1.6
1.247
1.154
0.9914
1.621
1.589
1.132
1.702
2.146
1.066
1.26
2.172
1.3
1.141
values for O
(17) <
(19)
5.0
5.0 -
10.0
N
(3)
10.0 -
15.0
(3)
15.0 -
20.0
2.5km
values for b
-0.1 - -0.05
N
(18) -0.05 - 1.38778E-17
(19) 1.38778E-17 (2)
0.05 -
(1) >=
2.5km
0.1
0.1
0.05
values for RR
(8) <
0.8
(20)
0.8 -
1.0
(11)
1.0 -
1.2
(5) >=
2.5km
1.2