Bayesian Statistics With R and BUGS
Bayesian Statistics With R and BUGS
Bayesian Statistics With R and BUGS
Day 1
Lecture 1: Introduction to Bayesian Inference
Lecture 2: Bayesian analysis for single parameter models
Lecture 3: Prior distributions: univariate
Day 2
Lecture 4: Bayesian analysis for multiple parameter models with R
Lecture 5: An introduction to WinBUGS
Lecture 6: Multivariate models with WinBUGS
Day 3
Lecture 7: An introduction to MCMC computations
Lecture 8: Bayesian regression with WinBUGS
Lecture 9: Introduction to Hierarchical Statistical modeling
Learning objectives
Focus on statistical modeling rather than running code, checking convergence etc.
Recommended bibliography
I
Bayesian Data Analysis (Second edition). Andrew Gelman, John Carlin, Hal
Stern and Donald Rubin. 2004 Chapman & Hall/CRC.
Lecture 1:
Introduction to Modern Bayesian Inference
I shall not assume the truth of Bayes axiom (...) theorems which are
useless for scientific purposes.
-Ronald A. Fisher (1935) The Design of Experiments, page 6.
PROBLEM.
Given the number of times in which an unknown event has happened and
faild: Requiered the chance that the probability of its happening in a single
trial lies somewhere between any two degrees of probability that can be
named.
Experience in other hospitals indicates that the risk for each patient is around
10 %
We can directly express uncertainty about the patient risk with a probability
distribution
4
0
Density(mortality rate)
10
20
30
40
50
Probability that mortality risk is greater than 15% is Pr( > 0.15) = 0.17
Dr. Pablo E. Verde
10
Tells us what we want: what are plausible values for the parameter of interest?
No confidence intervals: just report central area that contains 95% of distribution
11
Requires the specification of what we thought before new evidence is taken into
account: the prior distribution
12
13
P(B|A)P(A)
P(B)
which is
Sometimes is useful to work with P(B) = P(A)P(B|A) + P(A)P(B|
A),
curiously called extending the conversation (Lindley 2006, pag. 68)
14
Suppose that a priory only 10% of clinical trials are truly effective treatments
Assume each trial is carried out with a design with enough sample size such that
= 5% and power 1 = 80%
Question: What is the chance that the treatment is true effective given a significant
test results?
p(H1 |significant results)?
15
(1 ) 0.1
0.8 0.1
=
= 0.64
(1 ) 0.1 + 0.9
0.8 0.1 + 0.05 0.9
Answer: This says that if truly effective treatments are relatively rare, then a
statistically significant results stands a good chance of being a false positive.
Dr. Pablo E. Verde
16
In the classical framework, parameters are fixed non-random quantities and the
probability statements concern the data
17
Bayesian Inference
I
The Bayesian analysis starts like a classical statistical analysis by specifying the
sampling model:
p(y |)
this is the likelihood function.
18
Then we use the Bayes theorem to obtain the conditional probability distribution for
unobserved quantities of interest given the data:
p(|y ) = R
p()p(y |)
p()p(y |)
p()p(y |)d
19
20
21
0.25
0.50
0.75
P
Prior
p()
0.33
0.33
0.33
1.0
Likelihood
p(y = 1|)
0.25
0.50
0.75
1.50
Likelihood prior
p(y = 1|)p()
0.0825
0.1650
0.2475
0.495
Posterior
p(|y = 1)
0.167
0.333
0.500
1.0
So, observing a head on a single flip of the coin means that there is now a 50%
probability that the chance of heads is 0.75 and only a 16.7% that the chance of
heads is 0.25.
Note that if we normalize the likelihood p(y = 1|) we have exactly the same
results.
22
23
24
Monte Carlo error: How accurate are the empirical estimation by simulation?
I
= Var
(g (X ))
Suppose that we obtain empirical mean E (g (X )) and variance V
based on T simulations.
25
26
27
Sequential learning
Suppose we obtain data y1 and form the posterior p(|y1 ) and then we obtain further
data y2 . The posterior based on y1 , y2 is given by:
p(|y1 , y2 ) p(y2 |) p(|y1 ).
28
0.25
0.50
0.75
P
Prior
p()
0.167
0.333
0.500
1.0
Likelihood
p(y = 1|)
0.25
0.50
0.75
1.50
Likelihood prior
p(y = 1|)p()
0.042
0.167
0.375
0.583
Posterior
p(|y = 1)
0.071
0.286
0.644
1.0
After observing a second head, there is now a 64.4% probability that the chance
of heads is 0.75 and only a 7.1% that the chance of heads is 0.25.
29
The prior distribution p(), expresses our uncertainty about before seeing the
data, that could be objective or subjective
In the classical setting probabilities are defined in terms of long run frequencies
and are interpreted as physical properties of systems. In this way we use:
probability of ...
30
Summary
Bayesian statistics:
I
Formal combination of external information with data model by the Bayesian rule
31
Lecture 2:
Bayesian Inference for Single Parameter Models
32
Summary
1. Conjugate Analysis for:
I
Binomial model
Poisson model
2. Using R for:
I
33
34
To represent external evidence that some response rates are more plausible than
others, it is mathematical convenient to use a Beta(a, b) prior distribution for
p() a1 (1 )b1
Combining this with the binomial likelihood gives a posterior distribution
p(|r , n) p(r |, n)p()
r (1 )nr
= r +a1 (1 )nr +b1
Beta(r + a, n r + b).
35
When the prior and posterior come from the same family of distributions the prior
is said to be conjugate to the likelihood
36
37
1.2
1.0
0.8
0.6
dbeta(x, 1, 1)
Beta(1/2, 1/2)
1.4
Beta(1, 1)
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
Beta(2, 5)
Beta(2, 2)
1.0
0.8
1.0
0.5
1.0
1.5
0.8
0.0
dbeta(x, 2, 2)
0.0
0.2
0.4
0.6
0.0
dbeta(x, 2, 5)
0.0
0.8
1.0
0.0
0.2
0.4
0.6
38
Experience with similar compounds has suggested that response rates between 0.2
and 0.6 could be feasible
Then we update the prior and the posterior is Beta(15 + 9.2, 5 + 13.8)
39
par(mfrow=c(1,1))
40
0 1 2 3 4
Prior
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
0.6
0.8
1.0
prob of success
0.20
0.10
0.00
dbinom(15, 20, x)
Likelihood
0.0
0.2
0.4
prob of sucess
4
2
0
Posterior
0.0
0.2
0.4
prob of sucess
41
If we are able to sample values from the posterior p(|r , n), then we can extend
the inferential scope.
For example, one important application of this simulation process is when we need
to estimate the posterior of a functional parameter, e.g., the odds ratio:
= f () =
For simple models we can directly simulate from the posterior density and use this
values to empirically approximate the posterior quantiles.
42
is calculated in R as follows:
> theta.star <- rbeta(20000, 24.2, 18.8)
> odds.star <- theta.star/(1-theta.star)
> quantile(odds.star, prob = c(0.05, 0.5, 0.75, 0.95))
5%
50%
75%
95%
0.7730276 1.2852558 1.5784318 2.1592678
> hist(odds.star, breaks=100, xlab="odds ratio", freq=FALSE, xlim=c(0, 4))
> lines(density(odds.star), lwd =2, lty = 1, col ="red")
43
0.6
0.0
0.2
0.4
Density
0.8
1.0
Histogram of odds.star
odds ratio
44
The predictive posterior distribution for r new (the number of successes in m trials)
follows a Beta-Binomial distribution with density:
(a + r new )(b + m r new )
(a + b)
m
new
p(r
)=
(a)(b) r new
(a + b + m)
In R:
# Beta-binomial density
betabin <- function(r,a,b,m)
{
gamma(a+b)/(gamma(a)*gamma(b)) * choose(m,r) *
gamma(a+r)*gamma(b+m-r)/gamma(a+b+m)
}
45
46
0.00
0.02
0.04
px
0.06
0.08
Predictive Posterior
10
20
30
40
47
In R we can calculate:
# probability of at least 25 successes out of 40 further trials
> sum(betabin(25:40,24.2,18.8,m=40))
[1] 0.3290134
>
48
49
50
0.06
0.04
0.00
0.02
Predictive Probability
0.08
10
15
20
25
30
35
51
i = 1, . . . , n
52
Posterior variance is based on an implicit sample size n0 and the data sample size
n.
53
where
n
,
n + n0
n0
= y (
y 0 )
.
n + n0
w=
n0
,
n + n0
n = 0 + (
y 0 )
n
54
Prediction
Denoting the posterior mean and variance as n and n2 = 2 /(n0 + n), the predictive
posterior distribution for a new observation y is
Z
p(y |y ) =
p(y |)p(|y )d
55
i = 1, . . . , n,
56
i=1
n
p(w |, y ) = Gamma + , +
2
2
!
(yi )2
i=1
57
Clearly we can think of = n0 /2, where n0 is the effective prior sample size.
P
Since ni=1 (yi )2 /n estimate 2 = 1/w , then we interpret 2 as representing
n0 prior estimate of02 .
58
The kernel of the Poisson likelihood as a function of has the same form as a
Gamma(a, b) prior for :
p() a1 exp(b).
59
(a1) b
n
Y
e ti yi
i=1
a+Yn 1 (b+Tn )
= Gamma(a + Yn , b + Tn ).
where Yn =
Pn
i=1 yi and Tn =
Pn
i=1 ti .
60
n
n+b
a
+
b
1
n
n+b
.
The posterior mean is a compromise between the prior mean a/b and the MLE
Yn
Tn .
Thus b can be interpreted as an effective exposure and a/b as a prior estimate of the
Poisson mean.
61
Annual numbers of cases were available from two a specialist center at Birmingham,
from 1970 to 1988. For analysis we consider observed values from the first decade.
year
cases, xi
1970
1
1971
5
1972
3
1973
2
1974
1
1975
0
1976
0
1977
2
1978
1
1979
1
Because we are interested in counts of disease over time, a simple model is a Poisson
process.
yi | Poisson()
i = 1, . . . , 10.
62
Similar data is collected by another centre, given a mean rate of 2.3 with s.d. 2.79.
I
prior sd
a/b = 2.79
Then
X
p(|y ) = Gamma(
xi + 0.679, 10 + 0.295) = Gamma(16.679, 10.295)
i
E (|y ) =
16.679
= 1.620;
10.295
sd(|y ) =
16.679
= 0.396
10.295
63
0.6
0.4
prior
posterior
0.0
0.2
0.8
1.0
64
65
0.25
0.20
0.15
0.10
0.05
0.00
0.30
10
counts
66
Some comments
For all these examples, we see that
I
the posterior mean is a compromise between the prior mean and the MLE
the posterior s.d. is less that each of the prior s.d. and the s.e. (MLE)
As n ,
I
These observations are generally true, when the MLE exists and is unique.
67
Priors
I
When the posterior is in the same family as the prior then we have what is known
as conjugacy.
Distribution of y
Binomial
Poisson
Exponential
Normal
Normal
I
Parameter
Prob. of success
Mean
Reciprocal of mean
Mean (variance known)
Variance (mean known)
conjugate prior
Beta
Gamma
Gamma
Normal
Inverse Gamma
68
Practical
Exercise: Conjugate inference for a binomial experiment
Drug investigation example from this Lecture
We treat n = 20 volunteers with a new compound and observe r = 15 positive
responses. We use as prior Beta(9.2, 13.8):
1. What is the posterior mean and median for the response rate?
2. What are the 2.5th and 97.5th percentiles of the posterior?
3. What is the probability that the true response rate is greater than 0.6?
4. How is this value affected if a uniform prior is adopted?
5. Using the original Beta(9.2, 13.8) prior, suppose 40 more patients were entered
into the study. What is the chance that at least 25 of them respond positively?
Dr. Pablo E. Verde
69
Solution
# 1) Posteriors mean and median
# The posterior mean is: (r+a) / (n + a+b)
> (15 + 9.2) /( 20 + 9.2 + 13.8)
[1] 0.5627907
# the posterior median is
> qbeta(0.5, 24.2, 18.8 )
[1] 0.5637731
>
# 2) Posterior percentiles:
> qbeta(0.025, 24.2, 18.8 )
[1] 0.4142266
> qbeta(0.975, 24.2, 18.8 )
[1] 0.7058181
>
Dr. Pablo E. Verde
# 2.5%
# 97.5%
70
71
Lecture 3:
Priors Distributions
72
Summary
I
Mixture of priors
73
The prior is not necessarily unique! In a recent article Lambert et. al. (2005)
analyze the use of 13 different priors for the between study variance parameter in
random-effects meta-analysis.
The prior may not be completely specified. In Empirical Bayes inference priors
have unknown parameters that are estimated from the data.
74
Inference may rely only on priors. There are situations where no further data
are available to combine with our priors or there is no intention to update the
priors. This is the typical case of risk analysis, sample size determination in
experiments, simulation of complex process , etc. In these analytical scenarios
priors are usually used to simulate hypothetical data and we refer to that prior
predictive analysis.
75
76
77
d
d
"
#
2 log p(X |)
log p(X |) 2
= Ex|
.
2
78
n
.
(1 )
79
X (xi )2
+C
2v
with
I () =
n
.
v
80
P
(xi m)2 , then
I () =
s
,
2
n
.
22
81
82
They are invariant, whatever the scale we choose to measure the unknown
parameter, the same prior results when the scale is transformed to any particular
scale
Applying this rule to the normal case with both mean and variance parameters
unknown does not lead to the same prior as applying separately the rule for the mean
and the variance and assuming a priori independence between these parameters.
83
a trial designed to have two-sided Type I error and Type II error in detecting
a true difference of in mean response between the groups will require a sample
size per group of
2 2
n = 2 (z1 z1/2 )2 ,
84
However, we accept uncertainty about and and we wish to include this feature
into the sample size and power calculations.
85
Then
1. Simulate values N(0.5, 1) and N(1, 0.32 ) (subject to the constrain of
being positive).
2. Substitute them in the formulae and generate n and Power .
3. Use the histogram of n and Power as their corresponding predictive distribution.
4. In R we have
## predictive prior distribution for sample size and power
set.seed(123)
theta <- rnorm(10000, 0.5, 0.1)
sigma <- rnorm(10000, 1, 0.3)
sigma <- ifelse(sigma <0, -1*sigma, sigma)
n <- 2*sigma^2 /(theta^2)*(0.84 + 1.96)^2
pow <- pnorm( sqrt( 63/2) * theta /sigma - 1.96)
par(mfrow=c(1,2))
hist(n, xlim = c(0, 400), breaks=50)
hist(pow)
par(mfrow=c(1,1))
Dr. Pablo E. Verde
86
Histogram of pow
500
500
1000
1500
Frequency
1000
Frequency
1500
2000
2500
2000
Histogram of n
100
200
300
400
0.2
0.4
0.6
0.8
1.0
pow
87
> round(quantile(n),2)
0%
25%
50%
0.00
37.09
62.03
> round(quantile(pow),2)
0% 25% 50% 75% 100%
0.09 0.61 0.81 0.95 1.00
> sum(pow<0.7)/10000
[1] 0.3645
>
75%
100%
99.45 1334.07
88
We may want to express a more complex prior opinion that can not be
encapsulated by a beta distribution
89
In R:
# mixture of betas
mixbeta <- function(x,r,a1,b1,a2,b2,q,n)
{
qstar <- q*betabin(r,a1,b1,n)/
(q*betabin(r,a1,b1,n)+(1-q)*betabin(r,a2,b2,n))
p1 <- dbeta(x,a1+r,n-r+b1)
p2 <- dbeta(x,a2+r,n-r+b2)
posterior <- qstar*p1 + (1-qstar)*p2
}
90
We want to combine:
I
a non-informative Beta(1, 1)
91
92
3
2
1
0
priors
0.0
0.2
0.4
0.6
0.8
1.0
0.8
1.0
probability of response
4
3
2
0
posteriors
0.0
0.2
0.4
0.6
probability of response
93
Hirearchical priors for large dimensional problems (e.g regression with more
predictors than observations)
Summary:
I
It is reasonamble that the prior should influence the analysis, as long as the
influence is recognized and justified
94
Practical
Suppose that most drugs (95%) are assumed to come from the stated Beta(9.2,
13.8) prior, but there is a small chance that the drug might be a winner.
Winners are assumed to have a prior distribution with mean 0.8 and standard
deviation 0.1.
95
1. What Beta distribution might represent the winners prior? ( Hint: Use the
formulas of the mean and the variance for a Beta(a, b) distribution from lecture 2.
This can be re-arranged to
a = m(m(1 m)/v 1)
and
b = (1 m)(m(1 m)/v 1),
where m and v are the mean and the variance of a Beta(a,b) distribution).
2. Enter this mixture prior model into the R code for mixtures and plot it.
3. Calculate the posterior mixture model based on 15 success out of 20 trials and
plot it.
4. What is the chance that the drug is a winner?
96
Solution
# 1)
# to workout the a and b for the winner drug prior use the realtionship
# of the mean and variance of the beta distribution and its parameters
> m <- 0.8
> v <- 0.1^2
> a <- m*(m*(1-m)/v -1)
> a
[1] 12
> b <- (1-m)*(m*(1-m)/v -1)
> b
[1] 3
# 2) plot for the mixture prior
curve(dbeta(x, 12, 3)*0.05 + dbeta(x, 9.2, 13.8)*0.95, from = 0, to =1,
main = "prior", xlab = "rate")
97
98
99
Lecture 4:
Introduction to Multiparameter Models
100
Summary
Normal model with unknown mean and variance: standard non-informative priors
101
Introduction
I
We have a model for the observed data which defines a likelihood p(y |) on the
vector parameter
Problems:
102
The obervations are the mean times (in minutes) for men running Marathon with
age classes between 20 to 29 years old:
> library(LearnBayes)
> data(marathontimes)
> marathontimes$time
[1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ...
>
103
We assume a Normal model for the data given the mean and the variance 2 :
y1 , . . . , y20 |, 2 N(, 2 )
Then posterior density for the mean and the variance p(, |y ) is called the
Normal-2 distribution
104
We can visualize the % levels contourns with the function mycontour. The
arguments include the log-density to plot, the rectangular area (xlo , xhi , ylo , yhi )
and the data:
> mycontour(normchi2post, c(220,330,500,9000), time,
xlab = "mean", ylab = "var")
8000
4.6
2.3
4000
variance
6000
2000
6.9
220
240
260
280
300
320
mean
106
107
Frequency
3
2
Frequency
3
2
Frequency
250
300
350
400
1
150
200
250
300
350
400
0
150
200
250
300
350
400
150
250
300
y.sim
Simulated Data
Simulated Data
Simulated Data
Simulated Data
400
350
400
350
400
3
2
Frequency
3
2
Frequency
3
2
250
300
350
400
150
200
250
300
350
400
1
0
150
200
250
300
350
400
150
200
250
300
y.sim
y.sim
Simulated Data
Simulated Data
Simulated Data
Simulated Data
150
200
250
300
y.sim
350
400
3
2
Frequency
Frequency
1
150
200
250
300
y.sim
350
400
Frequency
3
2
Frequency
y.sim
y.sim
200
150
Frequency
350
y.sim
y.sim
Frequency
200
time
200
150
Frequency
Simulated Data
Simulated Data
Simulated Data
Original Data
150
200
250
300
y.sim
350
400
150
200
250
300
y.sim
108
T4 = |y(18)
| |y(2)
|
where the 18th and 2sd order statistics approximate the 90% and 10% respectively.
These measures are compared with the corresponding values based on the observed
data:
T1 = min(y ), T2 = max(y ), T 3 = q75 (y ) q25 (y )
and
T4 = |y(18) | |y(2) |.
109
In R we have:
# Analysis of the minimum, maximum, variability and asymmetry
min.y <- max.y <- asy1 <- asy2 <- inter.q <- rep(0,1000)
time <- sort(time)
for (b in 1:1000){
y.sim <- sample(y.star, 20)
mu.star <- sample(mu, 20)
min.y[b] <- min(y.sim)
max.y[b] <- max(y.sim)
y.sim <- sort(y.sim)
asy1[b] <- abs(y.sim[18] - mean(mu.star)) - abs(y.sim[2] - mean(mu.star))
asy2[b] <- abs(time[18] - mean(mu.star)) - abs(time[2] - mean(mu.star))
inter.q[b] <- quantile(y.sim, prob=0.75) - quantile(y.sim, prob=0.25)
}
110
111
Maximum y*
30
10
20
Frequency
60
40
20
Frequency
80
40
100
Minimum y*
150
200
250
300
320
340
360
380
min.y
max.y
Variance y*
Asymmetry
15
100
10
420
40
400
5
0
5
15
1000
2000
3000
4000
variance
5000
6000
7000
10
20
10
Frequency
30
150
100
50
50
112
In this model we observe n independent trials each of which has p possible outcomes
with associated probabilities
= (1 , . . . , p ).
Letting y = (y1 , . . . , yp ) be the number of times each outcome is observe we have
y | Multinomial(n, ),
with likelihood
P
( yi )! Y yi
i ,
p(y |) = Q
yi !
yi = n,
i = 1.
113
ai > 0,
then
p()
ai 1
yi +ai 1
114
Some comments
P
We are going to use for contingency tables a surrogate Poisson model for the
multinomial distribution.
ai
115
Death
No death
Intervention
New
1,1
2,1
Control
1,2
2,2
N
Data model:
p(y |)
2 Y
2
Y
i,ji,j
j=1 i=1
Prior model:
p()
2 Y
2
Y
a 1
i,ji,j
j=1 i=1
Posterior model:
p(|y )
2 Y
2
Y
y +ai,j 1
i,ji,j
j=1 i=1
Dr. Pablo E. Verde
116
1,1 2,2
.
1,2 2,1
Simulate a large number of values for the vector from its posterior.
117
118
Death
No death
Intervention
New
13
150
163
Control
23
125
148
36
275
311
We use a Dirichlet prior with parameters a1,1 = a1,2 = a2,1 = a2,2 = 1, which
corresponds to a uniform distribution for (1,1 , 1,2 , 2,1 , 2,2 ).
119
We can simulate from a Dirichlet distributions with the function rdirichlet() from
the package LearnBayes:
> library(LearnBayes)
> draws <- rdirichlet(10000, c(13,23,150,125) )
> odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3])
> hist(odds, breaks = 100, xlab="Odds Ratio", freq = FALSE)
> lines(density(odds), lty =2, lwd=2, col ="blue")
> abline(v=quantile(odds, prob=c(0.025, 0.5, 0.975)), lty=3, col="red")
>
> quantile(odds, prob=c(0.025, 0.5, 0.975))
2.5%
50%
97.5%
0.2157422 0.4649921 0.9698272
120
1.5
2.0
2.5
Histogram of odds
0.0
0.5
1.0
Density
prior odds
posterior odds
0.5
1.0
1.5
Odds Ratio
121
vs.
H1 : < 1
then,
> sum(odds > 1)/10000
[1] 0.0187
It is interesting to compare this results with the exact Fisher test:
>
122
Now, if we work with a prior with parameters a1,1 = 2, a1,2 = 1, a2,1 = 1 and a2,2 = 2,
with density
p(1,1 , 1,2 , 2,1 , 2,2 ) 1,1 2,2
we get the following results
> # Fisher exact test
> param <- c(2,1,1,2) # parameter of the dirichlet
> draws <- rdirichlet(10000, c(13,23,150,125)+param )
> odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3])
> sum(odds>1)/10000
[1] 0.0277
This shows (empirically) that the Fisher test is NOT based in a non-informative prior.
Some weakly information of association is implied in the test.
However, there is a difference in interpretation between the Bayesian and sampling
theory results.
Dr. Pablo E. Verde
123
f(the
theta
theta11
ta11,
22)
theta22
124
Practical
Exercise: Normal data with unknown mean and variance
I
Now, change the first observation of the marathon times and generate an outlier
as follows:
> library(LearnBayes)
> data(marathontimes)
> marathontimes$time
[1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ...
>
# generate an outlier in the first observation:
data1 <- marathontimes$time
data1[1] <- 82
125
Lecture 5:
Introduction to WinBUGS
126
Summary
Introduction to BUGS
Making predictions
127
Introduction to BUGS
The BUGS project began at the Medical Research Council Biostatistics Unit in
Cambridge in 1989, before the classic Gibbs sampling paper by Gelfand and Smith in
1990. An excellent review and future directions of the BUGS project is given in Lunn
et al. (2009).
BUGS stands for Bayesian inference using Gibbs sampling, reflecting the basic
computational techniques originally adopted.
BUGS has been just one part of the tremendous growth in the application of Bayesian
ideas over the last 20 years.
At this time BUGS has approximately over 30,000 registered users worldwide, and an
active on-line community comprising over 8,000 members.
Typing WinBUGS in Google generates over 100,000 hits; in Google scholars over
5,000; and searching in Statistics in Medicine on-line gives about 100 hits.
Dr. Pablo E. Verde
128
Knowledge base:
I
Inference engine:
I
This approach forces to think first about the model use to describe the problem at
hand. The declarative programming approach, sometimes, confuses statistical analysts
used to work with procedural or functional statistical languages (SAS, SPSS, R, etc).
Dr. Pablo E. Verde
129
130
Example: Drug
In n = 20 patients we observed r = 15 positive responses.
y Bin(, n)
and we assume a conjugate prior for :
Beta(a, b)
Of course, we know the posterior distribution is
|y Beta(a + y , n r + b)
and no simulation is necessary. But just to illustrate WinBUGS ...
131
132
133
134
,
= log
1
so that < <
We assume a normal prior for :
Normal(, )
for suitable mean and precision = 1/ 2 .
This is a non-conjugate prior with no simple form for the posterior.
Straightforward in WinBUGS!
Dr. Pablo E. Verde
135
136
137
Making predictions
I
making predictions!
Easy in MCMC/WinBUGS, just specify a stochastic node without a data value it automatically predicted
Easiest case is where there is no data at all!! Just forwards sampling from prior
to make a Monte Carlo analysis.
138
139
The prediction node y.pred is conditionally independent of the data node y given rate
.
Dr. Pablo E. Verde
140
141
142
143
144
145
The R object m1 generated in this analysis belong to the class bug and can be further
manipulated, transformed, etc.
> class(m1)
[1] "bugs"
> names(m1)
[1] "n.chains"
[4] "n.thin"
[7] "sims.array"
[10] "summary"
[13] "median"
[16] "dimension.short"
[19] "isDIC"
[22] "DIC"
"n.iter"
"n.keep"
"sims.list"
"mean"
"root.short"
"indexes.short"
"DICbyR"
"model.file"
"n.burnin"
"n.sims"
"sims.matrix"
"sd"
"long.short"
"last.values"
"pD"
"program"
146
The object m1 is a list in R, so we can extract elements of the list by using the $
operator, for example:
> m1$pD
[1] 1.065
> m1$n.chains
[1] 1
> m1$sims.array[1:10, ,"theta"]
[1] 0.8389 0.9124 0.7971 0.8682 0.7025 0.7696 0.8417 0.6782 0.6146
[10] 0.8647
> m1$sims.array[1:10, ,"y.pred"]
[1] 15 19 19 20 15 20 18 17 10 17
> theta <- m1$sims.array[1:1000, ,"theta"]
> hist(theta, breaks = 50, prob = TRUE)
> lines(density(theta), col = "blue", lwd =2)
147
3
0
Density
Histogram of theta
0.5
0.6
0.7
0.8
0.9
1.0
theta
148
149
150
mean(p[]) to take mean of whole array, mean(p[m : n]) to take mean of elements m
to n. Also for sum(p[])
dnorm(0, 1)I(0, ) means the random variable will be restricted to the range
(0, ).
151
p
s <- 1/sqrt(tau) sets s = 1/ ( )
j ij .
152
Data transformations
Although transformations of data can always be carried out before using WinBUGS, it
is convenient to be able to try various transformations of dependent variables within a
model description.
For example, we may wish to try both y and sqrt(y) as dependent variables without
creating a separate variable z = sqrt(y) in the data file.
The BUGS language therefore permits the following type of structure to occur:
for (i in 1:N) {
z[i] <- sqrt(y[i])
z[i] ~ dnorm(mu, tau)
}
Strictly speaking, this goes against the declarative structure of the model specification.
Dr. Pablo E. Verde
153
Binomial: r dbin(p, n)
Poisson: r dpois(lambda)
Uniform: x dunif(a, b)
Gamma: x dgamma(a, b)
154
155
156
Practical
157
= log(survival[i])
N(, ),
= 1/ 2
158
Solution
# WinBUGS script:
model{
for ( i in 1:N)
{
y[i] <- log(survtime[i])
y[i] ~ dnorm(mu, tau)
}
sigma <- 1/sqrt(tau)
mu ~ dnorm(0, 1.0E-3)
tau ~ dgamma(1.0E-3, 1.0E-3)
y.new ~ dnorm(mu, tau)
y.dif <- step(y.new - log(150))
}
#data transform
#sampling model
#standard deviation
#prior for mu
#prior for tau
#predicted data
#pr(survival>150)
159
Solution
160
Lecture 6:
Multivariate Models with WinBUGS
161
162
163
Some characteristics
I
When p = 1
W1 (k, R) Gamma(k/2, R/2) 2k /R.
Jeffreys prior is
p() ||(p+1)/2 ,
equivalent to k 0. This is not currently implemented in WinBUGS.
For weakly informative we can set R/k to be a rough prior guess at the
unknown true covariance matrix and taking k = p indicates minimal effective
prior sample size.
164
Known problems
I
If do not use Wishart priors for precision matrices in BUGS, you need to make
sure that the covariance matrix at each iteration is positive-definite, otherwise
may crash.
165
[,1] [,2]
1
1
1
-1
-1
1
-1
-1
2
NA
2
NA
-2
NA
-2
NA
166
The following script in WinBUGS implement a Bayesian analysis for this problem.
model
{
for (i in 1 : 12){
Y[i, 1 : 2] ~ dmnorm(mu[], tau[ , ])
}
mu[1] <- 0
mu[2] <- 0
tau[1 : 2,1 : 2] ~ dwish(R[ , ], 2)
R[1, 1] <- 0.001
R[1, 2] <- 0
R[2, 1] <- 0
R[2, 2] <- 0.001
Sigma2[1 : 2,1 : 2] <- inverse(tau[ , ])
rho <- Sigma2[1, 2] / sqrt(Sigma2[1, 1] * Sigma2[2, 2]) }
167
80
60
0
20
40
Frequency
100
120
P(rho|data)
1.0
0.5
0.0
0.5
1.0
Using predictive posterior values in each iteration WinBUGS impute the missing values.
Dr. Pablo E. Verde
168
In some applications the cell probabilities of contingency tables can be made function
of more basic parameters
Example: Population genetics example
Maternal genotype
AA
AB
BB
Offspring genotype
AA
427
108
-
AB
95
161
64
BB
71
74
169
Maternal genotype
AA
AB
BB
Offspring genotype
AA
(1 )p +
(1 )p/2 + /4
-
AB
(1 )q
1/2
(1 )p
BB
(1 )q/2 + /4
(1 )q +
170
171
n.eff
10000
10000
10000
172
Frequency
0.25
sigma
Frequency
0.35
0.45
0.72
0.76
0.45
0.68
0.35
0.64
0.25
0.64
0.68
0.72
0.76
x
173
Lecture 7 - MCMC-Short-version
Lecture 7:
Bayesian Computations with MCMC Methods
174
Lecture 7 - MCMC-Short-version
Summary
175
Lecture 7 - MCMC-Short-version
p(y |) and p() are usually available in closed form, but p(|y ) is usually not
analytical tractable. Also, we may want to obtain
R
p(|y )d(1)
R
176
Lecture 7 - MCMC-Short-version
Suppose we can draw samples from the joint posterior distribution for , i.e.
(1) , (2) , . . . , (N) p(|y )
1
N
PN
i=1
g (i )
N
1 X
g ((i) ) E (g ())
N
as
i=1
177
Lecture 7 - MCMC-Short-version
Then, we design a Markov chain which has p(|y ) as its stationary distribution
178
Lecture 7 - MCMC-Short-version
Run the chain until it appears to have settled down to equilibrium, say
(k) , (k+1) , . . . , (K ) p(|y )
Problem: Design a Markov chain with p(|y ) as its unique stationary distribution?
179
Lecture 7 - MCMC-Short-version
Answer: This is surprisingly easy and several standard recipes are available
I
See Gilks, Richardson and Spiegelhalter (1996) for a gentle introduction and many
worked examples.
180
Lecture 7 - MCMC-Short-version
p(1 |2 , 30 , . . . , k , y )
(1)
p(2 |1 , 30 , . . . , k , y )
(0)
(0)
(1)
(0)
(1)
(1)
...
(1)
p(k |1 , 21 , . . . , k1 , y )
3. Repeat step 2 many 1000s of times. Eventually we obtain samples from p(|y )
181
Lecture 7 - MCMC-Short-version
Gamma(a, b)
182
Lecture 7 - MCMC-Short-version
183
Lecture 7 - MCMC-Short-version
0.004
0.004
0.003
0.003
tau
0.002
tau
0.002
0.001
0.001
0.000
0.000
149.0
149.5
150.0
mu
150.5
151.0
149.0
149.5
150.0
150.5
151.0
mu
Four independent sequences of the Gibbs sampler for a normal distribution with
unknown mean and variance.
Left panel: first 10 steps. Right panel: last 500 iterations in each chain.
Dr. Pablo E. Verde
184
Lecture 7 - MCMC-Short-version
20
0
10
Frequency
30
40
mu
Posterior of mu
100
200
300
400
500
149.0
149.5
iteration
150.0
150.5
151.0
mu
30
10
20
Frequency
0.0015
0.0005
tau
0.0025
40
Posterior of tau
100
200
300
400
500
0.0000
0.0010
iteration
0.0020
0.0030
tau
Traces of one sequence of Gibbs simulations. Upper panels: 500 iterations from the
posterior of . Lower panels: 500 iterations from the posterior of .
Dr. Pablo E. Verde
185
Lecture 7 - MCMC-Short-version
Reversibility:
I
The key property is reversibility or detailed balance, i.e., a balance in the flow of
transition probabilities between the states of the Markov chain
186
Lecture 7 - MCMC-Short-version
Irreducibility:
I
An irreducible chain is one in which for any point (k) in the parameter space, it is
possible to move from (k) to other point (l) in a finite number of steps.
This guarantees the chain can visit all possible values of irrespective of the starting
value (0) .
Aperiodicity:
I
If R1 , R2 , . . . , Rk are disjoint regions in parameter space the chain does not cycle
around them.
To sample form p(|y ) we construct a Markov chain with transition matrix P which
satisfies reversibility, irreducibility and aperiodicity.
187
Lecture 7 - MCMC-Short-version
Metropolis algorithm
The algorithm proceeds as follows:
I
p(t |y )
,
p(t1 |y )
The algorithm continues, until we sample from p(|y ). The coin tosses allow it to go
to less plausible t s, and keep it from getting stuck in local maxima.
Dr. Pablo E. Verde
188
Lecture 7 - MCMC-Short-version
The Gibbs sampler. The Gibbs transition can be regarded as a special case of a
Metropolis transition
189
Lecture 7 - MCMC-Short-version
Metropolis in R
The current implementation uses a random walk Metropolis with proposal density
multivariate Normals.
190
Lecture 7 - MCMC-Short-version
It is well know that the pair of marginal distributions doest not uniquely
determine a bivariate distribution. For example, a bivariate distribution with
normal marginal distributions need not be jointly normal (Feller 1966, p. 69)
191
Lecture 7 - MCMC-Short-version
By + C1
1
,
),
2
2
Ay + 1 Ay + 1
y |x N(
Bx + C1
1
, 2
).
2
Ax + 1 Ax + 1
192
Lecture 7 - MCMC-Short-version
z
y
193
Lecture 7 - MCMC-Short-version
11
7
13
15
33
19
29
31
13
21
27
11
17
25
15
35
23
1
1
13
11
194
Lecture 7 - MCMC-Short-version
195
3
1
15
13
Lecture 7 - MCMC-Short-version
9 1011
27
35
25
31 3
17
29
21 25
30
19
23
20
3
15
11
15
10
5
7
9
1
3
3
5
5
9
0
5
1
3
1
7
2
0
res[,2]
res[,1]
196
1 0 1 2 3 4
Lecture 7 - MCMC-Short-version
200
400
600
800
1000
600
800
1000
0 1 2 3 4
iteration
200
400
iteration
197
Lecture 7 - MCMC-Short-version
198
Lecture 7 - MCMC-Short-version
Checking convergence
Once convergence is reached, samples should look like a random scatter about a
stable mean value.
One approach is to run many long chains with widely differing starting values.
199
Lecture 8:
Bayesian Regression Models
200
Modeling Examples
201
Introduction
Standard (and non standard) regression models can be easily formulated within a
Bayesian framework.
I
Specify prior distribution for regression coefficients and any other unknown
(nuisance) parameters
202
203
Linear regression
Example: Stack loss data, WinBUGS Volume 1
In this example, we illustrate a more complete regression analysis including outlier
checking, model adequacy and variable selection methods.
This is a very often analyzed data of Brownlee (1965, p. 454).
I
204
30
40
50
60
70
80
90
20
60
80
40
X1
X2
40
80
60
20
40
60
80
10 15 20 25 30 35 40
20
80
60
X3
20
40
20
10
15
20
25
30
35
40
20
40
60
80
205
Model 1 specification:
yi
Normal(i , ) i = 1, . . . , 21
= 1/ 2
Uniform(0.01, 100)
k
Normal(0, 0.001),
k = 0, . . . , 3.
206
207
208
209
yi
t(i , , )
= 1/ 2
i = 1, . . . , 21
Uniform(0.01, 100)
k
Normal(0, 0.001),
k = 0, . . . , 3.
210
In the WinBUGS code we modify the model section by commenting out the normal
model and adding one line with the model based on the t-distribution and we run the
MCMC again.
#y[i] ~ dnorm(mu[i], tau)
y[i] ~ dt(mu[i], tau, 4)
#DIC Normal
Dbar Dhat pD DIC
Y 110.537 105.800 4.736 115.273
total 110.537 105.800 4.736 115.273
#DIC t
Dbar Dhat pD DIC
Y 109.043 104.103 4.940 113.983
total 109.043 104.103 4.940 113.983
A very modest difference between the two models.
Dr. Pablo E. Verde
211
Variable Selection
We modify the structure of the distribution of the regression coefficients by adding a
common distribution with unknown variance. This is called a ridge regression
model, where the s are assumed exchangeable.
Model 3 specification:
= 1/ 2 ,
yi
Normal(i , ),
i = 1, . . . , 21
Uniform(0.01, 100)
k
Normal(0, ),
= 1/2
k = 0, . . . , 3
Uniform(0.01, 100).
212
In WinBUGS
for (j in 1 : p) {
# beta[j] ~ dnorm(0, 0.001)
# coefs independent
beta[j] ~ dnorm(0, phi) # coefs exchangeable (ridge regression)
}
phi <- 1/(sigmaB*sigmaB)
sigmaB ~ dunif(0.01, 100) #Gelmans prior
213
214
We see that 3 is not a relevant variable in the model. Now, we try another variable
selection procedure, by modifying the distribution of the regression coefficients.
Model 5 specification:
yi
Normal(i , 2 ) i = 1, . . . , 21
Uniform(0.01, 100)
k
Normal(0, 100),
k = 0, . . . , 3,
1 Bernoulli(0.5)
2 Bernoulli(0.5)
3 Bernoulli(0.5).
215
216
217
library(MASS)
data(michelson)
attach(michelson)
plot(Expt, Speed, main="Speed of Light Data",
xlab="Experiment No.",ylab="Speed")
218
1000
800
Speed
900
700
Experiment No.
219
Normal(j , ),
= 1/ 2
= + j ,
Normal(0, 0.001),
i = 1, . . . , nj ,
j = 1, . . . , J,
j = 2, . . . , J,
Normal(0, 0.001),
Uniform(0.01, 100).
220
221
# mean differences:
d[1] <- (m.g[2] - m.g[1] )
d[2]<- (m.g[3] - m.g[1] )
...
# Priors
mu0 ~ dnorm(0, 0.0001)
for( j in 2:J){
alpha[j] ~ dnorm(0, 0.0001)
}
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 100) #Gelmans prior
}
222
One serious problem of ANOVA modeling is the lack of variance homogeneity between
groups. We can extend our Bayesian set up to include this data feature as follows:
Model 2
yi,j
Normal(j , j ),
= + j ,
Normal(0, 0.001),
j = 1/j2
i = 1, . . . , nj ,
j = 1, . . . , J,
j = 2, . . . , J,
Normal(0, 0.001),
j
Uniform(0.01, 100),
j = 1, . . . , J, .
This model includes a structural dispersion sub-model for each treatment group.
223
224
For models with more than 3 groups the estimation based on simple means can be
improved by adding exchangeability structure to the i s (Stein, 1956, Lindley 1962,
Casella and Berger, pag.574). This assumed a sub-model for i in the following way:
Model 3
yi,j
j = 1/j2
Normal(j , j ),
= + j ,
Normal(0, ),
i = 1, . . . , nj ,
j = 1, . . . , J,
= 1/
j = 2, . . . , J,
Normal(0, 0.001),
Uniform(0.01, 100),
j
Uniform(0.01, 100),
j = 1, . . . , J,
j = 1, . . . , J, .
225
226
j = 1/j2
Normal(j , j ),
= + j j ,
Normal(0, 0.001),
i = 1, . . . , nj ,
j = 1, . . . , J,
j = 2, . . . , J,
Normal(0, 0.001),
j
Bernoulli(0.5),
Uniform(0.01, 100),
j = 2, . . . , J,
j = 1, . . . , J, .
227
The next slides present some graphical results of the main features of each model. The
full WinBUGS script for this analysis is anova.odc, this includes some other
calculations for residual analysis as well.
228
229
Posterior distributions for mean differences between all groups k l (k 6= l), Model 1
Dr. Pablo E. Verde
230
231
Posterior distributions for mean differences between all groups k l (k 6= l), Model 2
Dr. Pablo E. Verde
232
Posterior distributions for mean differences between all groups with exchangeability,
Model 3.
Dr. Pablo E. Verde
233
Posterior distributions for the group means with mixture distributions, Model 4.
Dr. Pablo E. Verde
234
The Space Shuttle Challengers final mission was on January 28th 1986, on an
unusually cold morning (31F/-0.5C)
It disintegrated 73 seconds into its flight after an O-ring seal in its right solid
rocket booster failed
The uncontrolled flame caused structural failure of the external tank, and the
resulting aerodynamic forces broke up the orbiter
235
Technical background
236
On the night of January 27, 1986, the night before the space shuttle Challenger
accident, there was a tree-hour teleconference among people at Marton Thiokol
(manufacturer of the solid rocket motor), Marshal Space Center and Kennedy
Space Center.
The discussion focused on the forecast of a 31 F temperature for lunch time the
next morning, and the effect of low temperature on O-ring performance.
237
238
1.0
0.6
0.4
0.2
0.8
0.0
55
60
65
70
75
80
Temperature in F
239
These data have been analyzed several times in the literature. I found that fitting
a change point model makes an improvement on previous published analysis. The
model is:
Pr(yi = 1) = logit1 (0 + 1 zi )
where
zi =
1
0
for
xi K ,
otherwise.
In this model K is unknown and represents the temperature from which the
probability of distress is the lowest.
240
241
>
print(chall.1)
mean sd 2.5%
25% 50% 75% 97.5%
beta0
6.2 3.9 -0.2
3.3 5.8 8.4 15.3
beta1
-8.0 3.9 -17.2 -10.2 -7.6 -5.1 -2.1
K
63.8 2.9 58.2 63.1 64.1 65.1 67.0
deviance 19.1 2.8 16.6 16.9 18.0 20.4 26.9
pD = 2.4 and DIC = 21.5
242
1.0
Histogram of K
500
0.6
0.2
0.4
300
100
200
Frequency
400
0.8
0.0
55
60
65
70
K
75
80
55
60
65
70
75
80
Temperature in F
243
244
A catastrophic failure of a field joint is expected when all four events occur
during the operation of the solid rocket boosters:
pfield = pa pb pc pd
Since there are 6 field joints per launch, assuming the all 6 joint failures are
independent, the probability of at least one failure is
Risk = 1 (1 pfield )6
The available data for the other events was very sparse:
I
I
I
Event b: 2 in 7 flights
Event c: 1 in 2 flights
Event d: 0 in 1 flight
245
246
247
Practical
Exercise: a nonconjugate nonlinear model
Volume 2 in WinBUGS help: Dugongs
Originally, Carlin and Gelfand (1991) consider data on length yi and age xi
measurements for 27 dugongs (sea cows) and use the following nonlinear growth curve
with no inflection point and an asymptote as xi tends to infinity:
yi
Normal(i , 2 )
= xi ,
248
Practical
I
Change the last observation 2.57 to 2.17 and run the model. Did you see different
results?
...
df <- 5
249
Lecture 9:
Introduction to Hierarchical Models
250
During the previous lectures we have seen several examples, where we performed
a single statistical analysis.
Now, suppose that instead of having a single analysis, you have several ones.
Each one giving independent results from each other.
251
From the classical point of view, Charles Stein demonstrated in the 50s a
fundamental theoretical result, which shows the benefit of combining individual
results in a single analysis.
Charles Stein (22nd of March 1920) ... and he still working everyday!
252
Hierarchical Models
It becomes common practice to construct statistical models which reflects the
underline complexity of the problem under study. Such us different patterns of
heterogeneity, dependence, miss-measurements, missing data, etc.
Statistical models which reflect complexity of the data involve multiple parameters.
Examples are:
I
...
253
254
Exchangeability
I
255
Example:
Suppose that X1 , X2 , X3 are the outcomes of the three patients enrolled in a clinical
trial. Were Xi = 1 indicates positive reaction to a treatment and Xi = 0 no reaction.
We may judge
p(X1 = 1, X2 = 0, X3 = 1) = p(X2 = 1, X1 = 0, X3 = 1) = p(X1 = 1, X3 = 0, X2 = 1)
Note that this is a very strong assumption. In reality we may expect that patients
may behave in different way. For example they may fail to comply a treatment.
256
p(yij |i )
p(i |)
p(i |)
p()
where, p() can be considered as a common prior for all units, but with
unknown parameters.
257
Assuming that 1 , . . . , N are drawn from some common prior distribution whose
parameters are unknown is known as a hierarchical or multilevel model.
Empirical Bayes techniques omit p(). They are useful in practice, but they may
biased variability estimates be reusing y to estimate .
258
Note that there does not need to be any actual sampling - perhaps these N units
are the only ones that exists - this is very common in meta-analysis.
259
260
An exchangeable model therefore leads to the inferences for each unit having
narrowed intervals that if they are assumed independent, but shrunk towards the
prior mean response.
wi controls the shrinkage of the estimate towards , and the reduction in the
width of the interval for i
261
These data present the relation between rainfall and the proportions of people
with toxoplasmosis for 34 cities in El Salvador (Efron 1986)
The data have been used to illustrate non-linear relationship between the amount
of rain and toxoplasmosis prevalence
262
0.2
0.4
0.6
0.8
1.0
263
1.0
0.8
Toxoplasmosis Rate
0.6
0.4
Toxoplasmosis Rate
0.6
0.8
1800
2000
0.2
0.0
0.2
0.0
2200
1600
0.4
1.0
20
40
60
80
Sample Size
264
Pooled approach:
To start we ignore possible (complex) relation between rainfall and proportion and we
fit same beta-binomial model to all the cities
34
Y
i=1
p(ri |, ni ) =
34
Y
Binomial(ni , )
i=1
p() = Beta(a, b)
34
Y
p(|r , n)
p(ri |, ni )p()
i=1
= Beta a +
X
i
!
X
ri , b +
(ni ri )
i
265
Some comments:
I
Does this model adequately describe the random variation in outcomes for each
city?
Are the cities rates more variable that our model assumes ?
266
ri
Binomial(ni , i )
logit(i ) N(, 2 )
N(0, 100)
Uniform(0.01, 10)
This model combines information between cities in a single model.
267
Model 2:
Modeling each city rate independently
ri
Binomial(ni , i )
268
# Exchangeable
# Copy of r[i]
# Independent
# Priors
269
#data
list(I=34,
r=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8,
41,24,7, 46,9,23,53,8,3,1,23),
r2=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8,41,24,7, 46,
9,23,53,8,3,1,23),
n=c(4,20,5,10,2,5,8,19,6,10,24,1,30,22,1,11,1,54,9,18,12,1,11,77,51,
16,82,13,43,75,13,10,6,37)
)
270
0.2
0.4
0.6
0.8
1.0
271
0.0
0.2
0.4
0.6
0.8
1.0
Schrikage effect
ML
Bayes
272
Results:
DIC for Model 1 (GLMM) = 145.670
DIC for Model 2 (independent) =176.504
I
The use of Model 2 has the effect of reduce the variability between cities
The use of Model 2 gives better results at the level of the city.
273
Model 3:
ri
Binomial(ni , i )
N(, 2 )
Uniform(0.01, 10)
, 1 , 2 N(0, 100),
274
In WinBUGS:
model
{
for(i in 1:N)
{log.n[i] <- log(n[i])}
for( i in 1 : N ) {
r[i] ~ dbin(p[i], n[i])
a[i] ~ dnorm(alpha,tau)
logit(p[i]) <- a[i] + beta1 * (rain[i]-mean(rain[])) +
beta2 * (log.n[i] - mean(log.n[]))
}
# Priors
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 5)
alpha ~ dnorm(0,0.01)
beta1 ~ dnorm(0, 0.01)
beta2 ~ dnorm(0, 0.01)}
Dr. Pablo E. Verde
275
0.0
0.5
100
1.0
1.5
Density
300
200
Density
2.0
400
2.5
500
3.0
0.003
0.000
beta1
0.002
0.2
0.2
0.6
beta2
276
Example: Ranking of the eighteen baseball players (Efron and Morris, 1977)
How can we rank the players ability ? Who was the best player of the season 1970?
We use the ranking function in WinBUGS to answer this question.
Name
Clemente
Robinson
..
.
hit/AB
18/45
17/45
..
.
TRUTH
.346
.298
..
.
.077
James-Stein
.290
.286
..
.
0.022
277
0.35
0.30
0.20
0.25
Batting Average
0.40
0.45
ML
Bayes
278
Posteriors: Prob(Hits)
0.1
Max Alvis
Thurman Munson
Bert Campaneris
0.3
George Scott
Billy Williams
Del Unser
Ron Swoboda
Ron Santo
Jim Spencer
Ken Berry
Frank Howard
Frank Robinson
Roberto Clemente
0.6
Rico Petrocelli
Jay Johnstone
0.5
Don Kessinger
0.4
Ellie Rodriguez
Luis Alvarado
0.2
279
model {
for( i in 1 : N ) {
r[i] ~ dbin(p[i],n[i])
logit(p[i]) <- b[i]
b[i] ~ dnorm(mu, tau)
p.rank[i] <- rank(p[], i)
}
# hyper-priors
mu ~ dnorm(0.0,1.0E-6)
sigma~ dunif(0,100)
tau<-1/(sigma*sigma)
}
# Data
list(r = c(145, 144, 160, 76, 128, 140, 167, 41, 148, 57, 83, 79, 142,
152, 52, 168, 137, 21),
n= c(412, 471, 566, 320, 463, 511, 631, 183, 555, 245, 322, 315, 480, 583,
231, 603, 453, 115), N=18 )
Dr. Pablo E. Verde
280
281
282
Posteriors distribution of i for each unit borrows information from the likelihood
contributions for all other units. Estimation is more efficient.
283
Uses of prior distribution as: sensitivity analysis, prior predictions, model building,
etc.
284
Finally
Nice to see that most people attended and survived the course!!
Hope that everybody will be now enthusiastic modern Bayesians ;-)
Thank you very much ... Muchas gracias !!!
285