Bayesian Statistics With R and BUGS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 143

Introduction to Bayesian Data Analysis using R and WinBUGS

Dr. Pablo E. Verde

Department of Mathematics and Statistics


Masaryk University
Czech Republic
April 2013
pabloemilio.verde@uni-duesseldorf.de

Dr. Pablo E. Verde

Overview of the course

Day 1
Lecture 1: Introduction to Bayesian Inference
Lecture 2: Bayesian analysis for single parameter models
Lecture 3: Prior distributions: univariate

Dr. Pablo E. Verde

Day 2
Lecture 4: Bayesian analysis for multiple parameter models with R
Lecture 5: An introduction to WinBUGS
Lecture 6: Multivariate models with WinBUGS
Day 3
Lecture 7: An introduction to MCMC computations
Lecture 8: Bayesian regression with WinBUGS
Lecture 9: Introduction to Hierarchical Statistical modeling

Dr. Pablo E. Verde

Learning objectives

Understanding of the potential role of Bayesian methods for making inference


about real-world problems

Insight into modern computations techniques used for Bayesian analysis

Learning Bayesian statistical analysis with R and WinBUGS

An interest in using Bayesian methods in your own field of work

Dr. Pablo E. Verde

Style and emphasis


I

Immediately applicable methods rather than latest theory

Attention to real problems: case studies

Implementation in R and WinBUGS (although not a full tutorial)

Focus on statistical modeling rather than running code, checking convergence etc.

Make more emphasis to the complementary aspects of Bayesian Statistics to


Classical Statistics rather than one vs. the other

Dr. Pablo E. Verde

Recommended bibliography
I

The BUGS Book: A Practical Introduction to Bayesian Analysis. David


Lunn; Chris Jackson; Nicky Best; Andrew Thomas; David Spiegelhalter. CRC
Press, October 3, 2012.

Bayesian Data Analysis (Second edition). Andrew Gelman, John Carlin, Hal
Stern and Donald Rubin. 2004 Chapman & Hall/CRC.

Bayesian Computation with R (Second edition). Jim Albert. 2009. Springer


Verlag.

An introduction of Bayesian data analysis with R and BUGS: a simple


worked example. Verde, P.E. Estadistica (2010), 62, pp. 21-44

Dr. Pablo E. Verde

Lecture 1: Introduction to Modern Bayesian Inference

Lecture 1:
Introduction to Modern Bayesian Inference

I shall not assume the truth of Bayes axiom (...) theorems which are
useless for scientific purposes.
-Ronald A. Fisher (1935) The Design of Experiments, page 6.

Dr. Pablo E. Verde

Lecture 1: Introduction to Modern Bayesian Inference

How did it all start?


In 1763, Reverend Thomas Bayes wrote:

PROBLEM.
Given the number of times in which an unknown event has happened and
faild: Requiered the chance that the probability of its happening in a single
trial lies somewhere between any two degrees of probability that can be
named.

In modern language, given r Binomial(, n), what is Pr (1 < < 2 |r , n) ?

Dr. Pablo E. Verde

Lecture 1: Introduction to Modern Bayesian Inference

Example: surgical procedure


I

Suppose a hospital is considering a new high-risk operation

Experience in other hospitals indicates that the risk for each patient is around
10 %

It would be surprising to be less than 3% or more than 20%

We can directly express uncertainty about the patient risk with a probability
distribution

Dr. Pablo E. Verde

4
0

Density(mortality rate)

Lecture 1: Introduction to Modern Bayesian Inference

10

20

30

40

50

Mortality rate (in %)

Probability that mortality risk is greater than 15% is Pr( > 0.15) = 0.17
Dr. Pablo E. Verde

10

Lecture 1: Introduction to Modern Bayesian Inference

Why a direct probability distribution?


I

Tells us what we want: what are plausible values for the parameter of interest?

No P-values: just calculate relevant tail areas

No confidence intervals: just report central area that contains 95% of distribution

Easy to make predictions (see later)

Fits naturally into decision analysis / risk analysis / cost-effectiveness analysis

There is a procedure for adapting the distribution in the light of additional


evidence: i.e. Bayes theorem allows us to learn from experience

Dr. Pablo E. Verde

11

Lecture 1: Introduction to Modern Bayesian Inference

What about disadvantages?


I

Requires the specification of what we thought before new evidence is taken into
account: the prior distribution

Explicit allowance for quantitative subjective judgment in the analysis

Analysis may be more complex than a traditional approach

Computation may be more difficult

Currently no established standards for Bayesian reporting

Dr. Pablo E. Verde

12

Lecture 1: Introduction to Modern Bayesian Inference

Why does Bayesian statistics is popular today?


I

Bayesian methods optimally combine multiple sources of information in a common


model

The computational revolution produced by the rediscovery of Markov chain Monte


Carlo (MCMC) techniques in statistics

Free available software implementation of MCMC (e.g. WinBUGS, JAGS, STAN,


large number of packages in R, etc.)

As a result, we can routinely construct sophisticated statistical models that may


reflect the complexity for phenomena of interest

Dr. Pablo E. Verde

13

Lecture 1: Introduction to Modern Bayesian Inference

Bayes theorem for observable quantities


Let A and B be events; then it is provable from axioms of probability theory that
P(A|B) =

P(B|A)P(A)
P(B)

P(A) is the marginal probability of A, i.e., prior to taking account of the


information in B

P(A|B) is the conditional probability of A given B, i.e., posterior to taking


account of the value of B

P(B|A) is the conditional probability of B given A

P(B) is the marginal probability of B

which is
Sometimes is useful to work with P(B) = P(A)P(B|A) + P(A)P(B|
A),
curiously called extending the conversation (Lindley 2006, pag. 68)

Dr. Pablo E. Verde

14

Lecture 1: Introduction to Modern Bayesian Inference

Example: Problems with statistical significance (Simon, 1994)


I

Suppose that a priory only 10% of clinical trials are truly effective treatments

Assume each trial is carried out with a design with enough sample size such that
= 5% and power 1 = 80%

Question: What is the chance that the treatment is true effective given a significant
test results?
p(H1 |significant results)?

Dr. Pablo E. Verde

15

Lecture 1: Introduction to Modern Bayesian Inference

Let A be the event that H1 is true, then p(H1 ) = 0.1

Let B be the event that the statistical test is significant

We want p(A|B) = p(H1 |significant results)

We have: p(B|A) = p(significant results|H1 ) = 1 = 0.8

We have: p(B|A) = p(significant results|H0 ) = = 0.05

Now, Bayes theorem says


p(H1 |significant results) =

(1 ) 0.1
0.8 0.1
=
= 0.64
(1 ) 0.1 + 0.9
0.8 0.1 + 0.05 0.9

Answer: This says that if truly effective treatments are relatively rare, then a
statistically significant results stands a good chance of being a false positive.
Dr. Pablo E. Verde

16

Lecture 1: Introduction to Modern Bayesian Inference

Bayesian inference for unknown quantities


I

Makes fundamental distinction between:


I

Observable quantities y , i.e., data.

Unknown quantities , that can be statistical parameters, missing data, predicted


values, mismeasured data, indicators of variable selected, etc.

Technically, in the Bayesian framework parameters are treated as values of


random variables.

Differences with classical statistical inference:


I

In Bayesian inference, we make probability statements about model parameters

In the classical framework, parameters are fixed non-random quantities and the
probability statements concern the data

Dr. Pablo E. Verde

17

Lecture 1: Introduction to Modern Bayesian Inference

Bayesian Inference
I

Suppose that we have observed some data y

We want to make inference about unknown quantities : model parameters,


missing data, predicted values, mismeasured data, etc.

The Bayesian analysis starts like a classical statistical analysis by specifying the
sampling model:
p(y |)
this is the likelihood function.

From a Bayesian point of view, is unknown so should have a probability


distribution reflecting our uncertainty about it before seeing the data. We need to
specify a prior distribution
p()

Together they define a full probability model:


p(y , ) = p(y |)p()

Dr. Pablo E. Verde

18

Lecture 1: Introduction to Modern Bayesian Inference

Then we use the Bayes theorem to obtain the conditional probability distribution for
unobserved quantities of interest given the data:
p(|y ) = R

p()p(y |)
p()p(y |)
p()p(y |)d

This is the posterior distribution for ,


posterior likelihood prior.

Dr. Pablo E. Verde

19

Lecture 1: Introduction to Modern Bayesian Inference

Inference with binary data


Example: Inference on proportions using a discrete prior
I

Suppose I have 3 coins in my pocket:


1. A biased coin: p(heads) = 0.25
2. A fair coin: p(heads) = 0.5
3. A biased coin: p(heads) = 0.75

I randomly select one coin, I flip it once and it comes a head.

What is the probability that I have chosen coin number 3 ?

Dr. Pablo E. Verde

20

Lecture 1: Introduction to Modern Bayesian Inference

Inference with binary data

1. Let y = 1 the event that I observed a head


2. Let denote the probability of a head: (0.25, 0.5, 0.75)
3. Prior: p( = 0.25) = p( = 0.5) = p( = 0.75) = 1/3
4. Sampling distribution:
y | Binomial(, 1),
with likelihood
p(y |) = y (1 )(1y ) .

Dr. Pablo E. Verde

21

Lecture 1: Introduction to Modern Bayesian Inference

If we observe a single positive response (y = 1), how is our belief revised?


Coin
1
2
3

0.25
0.50
0.75
P

Prior
p()
0.33
0.33
0.33
1.0

Likelihood
p(y = 1|)
0.25
0.50
0.75
1.50

Likelihood prior
p(y = 1|)p()
0.0825
0.1650
0.2475
0.495

Posterior
p(|y = 1)
0.167
0.333
0.500
1.0

So, observing a head on a single flip of the coin means that there is now a 50%
probability that the chance of heads is 0.75 and only a 16.7% that the chance of
heads is 0.25.

Note that if we normalize the likelihood p(y = 1|) we have exactly the same
results.

Dr. Pablo E. Verde

22

Lecture 1: Introduction to Modern Bayesian Inference

Example: Three coins by simulation


We can approximate the results of the three coins example with the following
computer simulation algorithm:
1. Generate a random index i from (1, 2, 3) with prior probability 1/3 for each
component
2. Evaluate the probability of y = 1 at = (i )
3. Generate a Binomial y with probability
4. Repeat 1 to 3 a large number of times
5. Average and normalize the outcomes of y

Dr. Pablo E. Verde

23

Lecture 1: Introduction to Modern Bayesian Inference

Example: Three coins by simulation in R


>
>
>
>
>
>

coin.pri <- rep(1/3, 3)


# prior for coins 1, 2, 3.
theta
<- c(0.25, 0.5, 0.75) # pr. heads under 1, 2, 3.
sim <- 100000
# number of simulations
y <- rep(0, 3*sim)
dim(y) <- c(sim, 3)
for( i in 1:sim)
{
ind <- sample(c(1,2,3), size = 1, prob = coin.pri, replace = TRUE)
y.new <- rbinom(1, 1, prob = theta[ind])
y[i, ind] <- y.new
}
> # average each class and normalize
> apply(y, 2, mean) / sum( apply(y, 2, mean))
[1] 0.1661 0.3306 0.5034

Dr. Pablo E. Verde

24

Lecture 1: Introduction to Modern Bayesian Inference

Monte Carlo error: How accurate are the empirical estimation by simulation?
I

= Var
(g (X ))
Suppose that we obtain empirical mean E (g (X )) and variance V
based on T simulations.

Then since E is a sample mean based on T independent samples, it has true


/T
sample variance V = Var (g (X ))/T which may be estimated by V

Hence E has estimated standard error


error.

/T , which is known as Monte Carlo


V

> th.post <- apply(y, 2, mean)/sum( apply(y, 2, mean))


> # monte carlo error
> V <- th.post * (1- th.post)/sim
> sqrt(V)
[1] 0.001177 0.001488 0.001581
>
Dr. Pablo E. Verde

25

Lecture 1: Introduction to Modern Bayesian Inference

Posterior predictive distributions


The predictive posterior distribution for a new observation y new is given by
Z
new
p(y
|y ) = p(y new |y , )p(|y )d.
Assuming that past and future observations are conditionally independent given ,
this simplify to
Z
new
p(y
|y ) = p(y new |)p(|y )d.
For the discrete case of , integrals are replaced by sums:
X
p(y new |y ) =
p(y new |i )p(i |y )
i

where the p(i |y ) can be thought of as posterior weights.

Dr. Pablo E. Verde

26

Lecture 1: Introduction to Modern Bayesian Inference

Example: Three coins continue ...


Suppose we want to predict the probability that in the next toss is a head. We have:
X
p(y new = 1|y = 1) =
i , p(i |y = 1)
i

= (0.25 0.167) + (0.50 0.333) + (0.75 0.5)


= 0.5833
By using the previous simulation results in R we have:
> sum(theta * apply(y, 2, mean)/sum( apply(y, 2, mean)))
[1] 0.5843
>

Dr. Pablo E. Verde

27

Lecture 1: Introduction to Modern Bayesian Inference

Sequential learning

Suppose we obtain data y1 and form the posterior p(|y1 ) and then we obtain further
data y2 . The posterior based on y1 , y2 is given by:
p(|y1 , y2 ) p(y2 |) p(|y1 ).

Todays posterior is tomorrows prior!


The resultant posterior is the same as if we have obtained the data y1 , y2 together:
p(|y1 , y2 ) p(y1 , y2 |) p().

Dr. Pablo E. Verde

28

Lecture 1: Introduction to Modern Bayesian Inference

Example: Three coins continue ...


I

Now suppose that after observing y1 = 1 we observe y2 = 1, how is our belief


revised?
Coin
1
2
3

0.25
0.50
0.75
P

Prior
p()
0.167
0.333
0.500
1.0

Likelihood
p(y = 1|)
0.25
0.50
0.75
1.50

Likelihood prior
p(y = 1|)p()
0.042
0.167
0.375
0.583

Posterior
p(|y = 1)
0.071
0.286
0.644
1.0

After observing a second head, there is now a 64.4% probability that the chance
of heads is 0.75 and only a 7.1% that the chance of heads is 0.25.

Dr. Pablo E. Verde

29

Lecture 1: Introduction to Modern Bayesian Inference

A bit of philosophy: Probability for and probability of


I

The prior distribution p(), expresses our uncertainty about before seeing the
data, that could be objective or subjective

The Bayesian inference allows the combination of different types of probabilities

Subjective probability implied a mental construct where probabilities are used to


express our uncertainty. This is why we use a pedantic: probability for an
event...

In the classical setting probabilities are defined in terms of long run frequencies
and are interpreted as physical properties of systems. In this way we use:
probability of ...

In general we follow Bayesian Statistical Modeling, which is dynamic view of


data analysis, which includes model building and model checking as well as
statistical inference

Dr. Pablo E. Verde

30

Lecture 1: Introduction to Modern Bayesian Inference

Summary
Bayesian statistics:
I

Formal combination of external information with data model by the Bayesian rule

Uses of subjective probability to make inductive inference

Straightforward inference by manipulating probabilities

No need of repeated sampling ideas

Specification of prior or external evidence may be difficult or controversial

Prediction and sequential learning is straightforward.

Dr. Pablo E. Verde

31

Lecture 2: Bayesian inference for single parameter

Lecture 2:
Bayesian Inference for Single Parameter Models

Dr. Pablo E. Verde

32

Lecture 2: Bayesian inference for single parameter

Summary
1. Conjugate Analysis for:
I

Binomial model

Normal model known variance

Normal model known mean

Poisson model

2. Using R for:
I

Graphical visualization of posteriors

Direct Monte Carlo Simulation Methods

Calculations of posteriors for functional parameters

Dr. Pablo E. Verde

33

Lecture 2: Bayesian inference for single parameter

Inference of proportions using a continuous prior


Suppose we observe r positive responses out of n patients. Assuming patients are
independent, with common unknown response rate , leads to a binomial likelihood
 
n
p(r |n, ) =
r (1 )nr r (1 )nr
r
We consider the response rate to be a continuous parameter, i.e., we need to give
a continuous prior distribution.
Suppose that before the data is observed we believe all values for are equally likely
(very unrealistic!), then we give Unif(0, 1), i.e., p() = 1.
Posterior is then
p(|r , n) r (1 )nr 1
this has a form of the kernel of a Beta(r + 1, n r + 1).
Dr. Pablo E. Verde

34

Lecture 2: Bayesian inference for single parameter

To represent external evidence that some response rates are more plausible than
others, it is mathematical convenient to use a Beta(a, b) prior distribution for
p() a1 (1 )b1
Combining this with the binomial likelihood gives a posterior distribution
p(|r , n) p(r |, n)p()
r (1 )nr
= r +a1 (1 )nr +b1
Beta(r + a, n r + b).

Dr. Pablo E. Verde

35

Lecture 2: Bayesian inference for single parameter

When the prior and posterior come from the same family of distributions the prior
is said to be conjugate to the likelihood

A Beta(a, b) distribution has


E () = a/(a + b)
var () = ab/[(a + b)2 (a + b + 1)]
Hence the posterior mean is E (|r , n) = (r + a)/(n + a + b)

a and b are equivalent to observing a prior a 1 successes in a + b 2 trials, then


it can be elicited.

With fixed a and b, as r and n increase, E (|r , n) r /n (the MLE).

A Beta(1,1) is equivalent to Uniform(0,1).

Dr. Pablo E. Verde

36

Lecture 2: Bayesian inference for single parameter

Example: Shape of the Beta density function


The Beta(a, b) prior is a flexible distribution. We can explore it shape by using the
following lines in R:
# Beta flexibility
par(mfrow=c(2,2))
curve(dbeta(x, 1, 1), from =0, to=1, lty = 1, lwd=2,
main="Beta(1, 1)", xlab=expression(theta))
curve(dbeta(x, 1/2, 1/2), from = 0, to = 1, lty = 2,
lwd=2, col="red", main="Beta(1/2, 1/2)", xlab=expression(theta))
curve(dbeta(x, 2, 5), from = 0, to = 1, lty = 2, lwd=2,
col="blue", main="Beta(2, 5)", xlab=expression(theta))
curve(dbeta(x, 2, 2), from = 0, to = 1, lty = 2, lwd=2,
col="green", main="Beta(2, 2)", xlab=expression(theta))
par(mfrow=c(1,1))

Dr. Pablo E. Verde

37

Lecture 2: Bayesian inference for single parameter

dbeta(x, 1/2, 1/2)

1.2
1.0
0.8
0.6

dbeta(x, 1, 1)

1.0 1.5 2.0 2.5 3.0

Beta(1/2, 1/2)

1.4

Beta(1, 1)

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

Beta(2, 5)

Beta(2, 2)

1.0

0.8

1.0

0.5

1.0

1.5

0.8

0.0

dbeta(x, 2, 2)
0.0

0.2

0.4

0.6

Dr. Pablo E. Verde

0.0

0.0 0.5 1.0 1.5 2.0 2.5

dbeta(x, 2, 5)

0.0

0.8

1.0

0.0

0.2

0.4

0.6

38

Lecture 2: Bayesian inference for single parameter

Example: drug investigation


I

Consider an early investigation of a new drug

Experience with similar compounds has suggested that response rates between 0.2
and 0.6 could be feasible

Interpret this as a distribution with mean=0.4 and standard deviation 0.1

A Beta(9.2, 13.8) distribution has these properties

Suppose we treat n = 20 volunteers with the compound and observe r = 15


positive responses.

Then we update the prior and the posterior is Beta(15 + 9.2, 5 + 13.8)

Dr. Pablo E. Verde

39

Lecture 2: Bayesian inference for single parameter

R script to perform the analysis:


> par(mfrow=c(3,1)) # graphical output: 3 by 1 panels
# draw a curve for a beta density
> curve(dbeta(x,9.2, 13.8),from=0, to =1,
xlab= "prob of success",main="Prior")
# draw a curve for a binomial density
> curve(dbinom(15, 20,x),from=0, to =1,
col="blue", xlab="prob of sucess", main="Likelihood")
# draw the posterior
> curve(dbeta(x, 24.2, 18.8),from=0, to =1,
col="red", xlab="prob of sucess", main="Posterior")
>

par(mfrow=c(1,1))

Dr. Pablo E. Verde

40

Lecture 2: Bayesian inference for single parameter

0 1 2 3 4

dbeta(x, 9.2, 13.8)

Prior

0.0

0.2

0.4

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

prob of success

0.20
0.10
0.00

dbinom(15, 20, x)

Likelihood

0.0

0.2

0.4
prob of sucess

4
2
0

dbeta(x, 24.2, 18.8)

Posterior

0.0

0.2

0.4
prob of sucess

Dr. Pablo E. Verde

41

Lecture 2: Bayesian inference for single parameter

Monte Carlo Simulation


I

If we are able to sample values from the posterior p(|r , n), then we can extend
the inferential scope.

For example, one important application of this simulation process is when we need
to estimate the posterior of a functional parameter, e.g., the odds ratio:
= f () =

For simple models we can directly simulate from the posterior density and use this
values to empirically approximate the posterior quantiles.

Dr. Pablo E. Verde

42

Lecture 2: Bayesian inference for single parameter

Example: Posterior for the odds ratio


The posterior of
=

is calculated in R as follows:
> theta.star <- rbeta(20000, 24.2, 18.8)
> odds.star <- theta.star/(1-theta.star)
> quantile(odds.star, prob = c(0.05, 0.5, 0.75, 0.95))
5%
50%
75%
95%
0.7730276 1.2852558 1.5784318 2.1592678
> hist(odds.star, breaks=100, xlab="odds ratio", freq=FALSE, xlim=c(0, 4))
> lines(density(odds.star), lwd =2, lty = 1, col ="red")

Dr. Pablo E. Verde

43

Lecture 2: Bayesian inference for single parameter

0.6
0.0

0.2

0.4

Density

0.8

1.0

Histogram of odds.star

odds ratio

Dr. Pablo E. Verde

44

Lecture 2: Bayesian inference for single parameter

Making predictions for binary data


I

Suppose that we want to make predictions from the model

The predictive posterior distribution for r new (the number of successes in m trials)
follows a Beta-Binomial distribution with density:


(a + r new )(b + m r new )
(a + b)
m
new
p(r
)=
(a)(b) r new
(a + b + m)

In R:
# Beta-binomial density
betabin <- function(r,a,b,m)
{
gamma(a+b)/(gamma(a)*gamma(b)) * choose(m,r) *
gamma(a+r)*gamma(b+m-r)/gamma(a+b+m)
}

Dr. Pablo E. Verde

45

Lecture 2: Bayesian inference for single parameter

Example: drug investigation continue ...


I

Suppose that we are interested in the predictive posterior of the number of


success in the next 40 trials:

Using the betabin function in R we have:


# Beta-binomial distribution of the number of successes x in the next
# 40 trials with mean 22.5 and standard deviation 4.3
x <- 0:40; px <- betabin(0:40, a=24.2,b=18.8,m=40)
plot(x, px, type="h", xlab="number of successes out of 40",
main="Predictive Posterior")

Dr. Pablo E. Verde

46

Lecture 2: Bayesian inference for single parameter

0.00

0.02

0.04

px

0.06

0.08

Predictive Posterior

10

20

30

40

number of successes out of 40

Dr. Pablo E. Verde

47

Lecture 2: Bayesian inference for single parameter

Example: drug investigation continue ...


I

Suppose that we would consider continuing a development program if the drug


managed to achieve at least a further 25 successes out of these 40 future trials

In R we can calculate:
# probability of at least 25 successes out of 40 further trials
> sum(betabin(25:40,24.2,18.8,m=40))
[1] 0.3290134
>

Dr. Pablo E. Verde

48

Lecture 2: Bayesian inference for single parameter

Simulations for predictive data

Instead of using the analytical approximation based on the Beta-Binomial distribution,


we can simulate predicted observations in the following way:
I

We simulate 1 , . . . , B from the posterior Beta(24.2, 18.8)

Then we simulate y from a binomial distribution with rate and n = 40.

We tabulate and normalize predictive frequencies.

Dr. Pablo E. Verde

49

Lecture 2: Bayesian inference for single parameter

Simulations for predictive data


In R notation
> # Simulation of predictive data with a Beta-Binomial model ...
> theta.star <- rbeta(10000, 24.2, 18.8)
> y <- rbinom(10000, 40, theta.star)
> freq.y <- table(y)
> ys <- as.integer(names(freq.y))
> predprob <- freq.y/sum(freq.y)
> plot(ys, predprob, type = "h", xlab = "y",
+
ylab = "Predictive Probability")
>
> # Probability of at least 25 out of future 40 trials.
> sum(predprob[ys>24])
[1] 0.3244

Dr. Pablo E. Verde

50

0.06
0.04
0.00

0.02

Predictive Probability

0.08

Lecture 2: Bayesian inference for single parameter

10

15

20

25

30

35

Dr. Pablo E. Verde

51

Lecture 2: Bayesian inference for single parameter

Bayesian analysis for Normal data

Known variance, unknown mean


Suppose we have a sample of Normal data
yi N(, 2 ),

i = 1, . . . , n

where 2 is known and is unknown. The conjugate prior of is


N(0 , 2 ).
It is convenient to write 2 as 2 /n0 , where n0 represents the effective number of
observations in the prior distribution.

Dr. Pablo E. Verde

52

Lecture 2: Bayesian inference for single parameter

Then the posterior distribution for is given by




2
n0 0 + n
y
p(|y ) = N
,
n0 + n n0 + n
I

Prior variance is based on an implicit sample size n0

As n0 tends to 0, the distribution becomes flatter

Posterior mean is a weighted average of the prior mean 0 and parameter


estimate x , weighted by their precisions, i.e., relative sample sizes.

Posterior variance is based on an implicit sample size n0 and the data sample size
n.

Dr. Pablo E. Verde

53

Lecture 2: Bayesian inference for single parameter

Alternative expressions for the posterior mean n are :


n = w 0 + (1 w )
y

where

n
,
n + n0
n0
= y (
y 0 )
.
n + n0

w=

n0
,
n + n0

n = 0 + (
y 0 )
n

That shows shrinkage towards prior mean.

Dr. Pablo E. Verde

54

Lecture 2: Bayesian inference for single parameter

Prediction
Denoting the posterior mean and variance as n and n2 = 2 /(n0 + n), the predictive
posterior distribution for a new observation y is
Z

p(y |y ) = p(y |y , )p(|y )d


which is generally equal to

p(y |y ) =

p(y |)p(|y )d

which can be shown to give


p(y |y ) N(n , n2 + 2 )
The predictive posterior distribution is centered around the posterior mean n with
variance equal to sum of the posterior variance of plus the data variance.
Dr. Pablo E. Verde

55

Lecture 2: Bayesian inference for single parameter

Bayesian inference for Normal data


Unknown variance, know mean
Suppose we have sample of Normal data
yi N(, 2 ),

i = 1, . . . , n,

where is known and 2 is unknown.


It is convenient to change parameterization to the precision w = 1/ 2 . The conjugate
prior for w is then
w Gamma(, ),
where
p(w ) w 1 exp(w ).
Then 2 is then said to have an inverse-gamma distribution.

Dr. Pablo E. Verde

56

Lecture 2: Bayesian inference for single parameter

The posterior distribution for w takes the form


#
n
X
n
w
(yi )2 .
p(w |, y ) w 1 exp(w ) w 2 exp
2
"

i=1

Collecting terms gives


n
1X

n
p(w |, y ) = Gamma + , +
2
2

!
(yi )2

i=1

Dr. Pablo E. Verde

57

Lecture 2: Bayesian inference for single parameter

Clearly we can think of = n0 /2, where n0 is the effective prior sample size.

P
Since ni=1 (yi )2 /n estimate 2 = 1/w , then we interpret 2 as representing
n0 prior estimate of02 .

Alternative, we can write our conjugate prior as




n0 n0 02
w Gamma
,
,
2
2
which can be seen as a scale 2n0 distribution. This is useful when assessing prior
distributions for sample variances.

Dr. Pablo E. Verde

58

Lecture 2: Bayesian inference for single parameter

Bayesian inference with count data

Suppose we have an independent sample of counts y1 . . . , yn which can be assumed to


follow a Poisson distribution with unknown mean ti , where is the rate per unit t :
Y (ti )yi exp(ti )
p(y |) =
yi !
i

The kernel of the Poisson likelihood as a function of has the same form as a
Gamma(a, b) prior for :
p() a1 exp(b).

Dr. Pablo E. Verde

59

Lecture 2: Bayesian inference for single parameter

This implies the following posterior


p(|y ) p()p(y |)

(a1) b

n
Y

e ti yi

i=1
a+Yn 1 (b+Tn )

= Gamma(a + Yn , b + Tn ).

where Yn =

Dr. Pablo E. Verde

Pn

i=1 yi and Tn =

Pn

i=1 ti .

60

Lecture 2: Bayesian inference for single parameter

The posterior mean is:


a + Yn
Yn
E (|y ) =
=
b + Tn
Tn

n
n+b

a
+
b


1

n
n+b


.

The posterior mean is a compromise between the prior mean a/b and the MLE

Yn
Tn .

Thus b can be interpreted as an effective exposure and a/b as a prior estimate of the
Poisson mean.

Dr. Pablo E. Verde

61

Lecture 2: Bayesian inference for single parameter

Example: Annual number of cases of haemolytic uraemic


syndrome Henderson and Matthews, (1993)

Annual numbers of cases were available from two a specialist center at Birmingham,
from 1970 to 1988. For analysis we consider observed values from the first decade.
year
cases, xi

1970
1

1971
5

1972
3

1973
2

1974
1

1975
0

1976
0

1977
2

1978
1

1979
1

Because we are interested in counts of disease over time, a simple model is a Poisson
process.
yi | Poisson()

Dr. Pablo E. Verde

i = 1, . . . , 10.

62

Lecture 2: Bayesian inference for single parameter

Similar data is collected by another centre, given a mean rate of 2.3 with s.d. 2.79.
I

with prior mean a/b = 2.3

prior sd

Solving for a and b, this information can be translated to a prior Gamma(0.679,


0.295) distribution

a/b = 2.79

Then
X
p(|y ) = Gamma(
xi + 0.679, 10 + 0.295) = Gamma(16.679, 10.295)
i

E (|y ) =

16.679
= 1.620;
10.295

sd(|y ) =

16.679
= 0.396
10.295

Dr. Pablo E. Verde

63

0.6
0.4

prior
posterior

0.0

0.2

dgamma(x, 0.679, 0.295)

0.8

1.0

Lecture 2: Bayesian inference for single parameter

Dr. Pablo E. Verde

64

Lecture 2: Bayesian inference for single parameter

The predictive posterior distribution for a new count y is


y |y Negative-Binomial(a + n
y , b + n)
If we ask what is the probability to observe 7 or more cases in the next year we have
> sum(dnbinom(6:17, prob=10.295/(1+10.295), size=16.679))
[1] 0.0096
>
Note: the observed values for the following years were 7,11,4,7,10,... Indicating a
possible structural break.
We can visualize the density by
plot(dnbinom(0:10, prob=p, size=16.679, type="h",
lwd=2, col="red", xlab= "counts")

Dr. Pablo E. Verde

65

0.25
0.20
0.15
0.10
0.05
0.00

dnbinom(0:10, prob = p, size = 16.679)

0.30

Lecture 2: Bayesian inference for single parameter

10

counts

Dr. Pablo E. Verde

66

Lecture 2: Bayesian inference for single parameter

Some comments
For all these examples, we see that
I

the posterior mean is a compromise between the prior mean and the MLE

the posterior s.d. is less that each of the prior s.d. and the s.e. (MLE)

As n ,
I

the posterior mean the MLE

the posterior s.d. the s.e. (MLE)

the posterior does not depend on the prior.

These observations are generally true, when the MLE exists and is unique.

Dr. Pablo E. Verde

67

Lecture 2: Bayesian inference for single parameter

Priors
I

When the posterior is in the same family as the prior then we have what is known
as conjugacy.

Conjugate models have the advantage that:


I

Prior parameters can usually be interpreted as implied prior sample size: n0

They involve simple mathematical models, examples include:

Distribution of y
Binomial
Poisson
Exponential
Normal
Normal
I

Parameter
Prob. of success
Mean
Reciprocal of mean
Mean (variance known)
Variance (mean known)

conjugate prior
Beta
Gamma
Gamma
Normal
Inverse Gamma

Unfortunately conjugate priors only exists for small catalog of likelihoods.

Dr. Pablo E. Verde

68

Lecture 2: Bayesian inference for single parameter

Practical
Exercise: Conjugate inference for a binomial experiment
Drug investigation example from this Lecture
We treat n = 20 volunteers with a new compound and observe r = 15 positive
responses. We use as prior Beta(9.2, 13.8):
1. What is the posterior mean and median for the response rate?
2. What are the 2.5th and 97.5th percentiles of the posterior?
3. What is the probability that the true response rate is greater than 0.6?
4. How is this value affected if a uniform prior is adopted?
5. Using the original Beta(9.2, 13.8) prior, suppose 40 more patients were entered
into the study. What is the chance that at least 25 of them respond positively?
Dr. Pablo E. Verde

69

Lecture 2: Bayesian inference for single parameter

Solution
# 1) Posteriors mean and median
# The posterior mean is: (r+a) / (n + a+b)
> (15 + 9.2) /( 20 + 9.2 + 13.8)
[1] 0.5627907
# the posterior median is
> qbeta(0.5, 24.2, 18.8 )
[1] 0.5637731
>
# 2) Posterior percentiles:
> qbeta(0.025, 24.2, 18.8 )
[1] 0.4142266
> qbeta(0.975, 24.2, 18.8 )
[1] 0.7058181
>
Dr. Pablo E. Verde

# 2.5%
# 97.5%

70

Lecture 2: Bayesian inference for single parameter

# 3) Posterior prob that the rate is greter than 0.6


> 1 - pbeta(0.6, 24.2, 18.8 )
[1] 0.3156323
# 4) Uniform prior is beta(1,1)

then the posterior is beta( r+1, n-r+1)

> 1 - pbeta(0.6, 15+1 ,20-15+1)


[1] 0.9042598
# 5) Posterior probability of at least 25 successes out of 40 further trials
> sum(betabin(25:40,24.2,18.8,m=40))
[1] 0.3290134

Dr. Pablo E. Verde

71

Lecture 3: Prior distributions

Lecture 3:
Priors Distributions

Dr. Pablo E. Verde

72

Lecture 3: Prior distributions

Summary
I

Misunderstandings about prior distributions

Non-informative Priors and Jeffreys invariance priors

Sensitivity analysis and making priors predictions

Adjustment of priors based on historical data and judgement

Mixture of priors

Dr. Pablo E. Verde

73

Lecture 3: Prior distributions

Misunderstandings about prior distributions


It is worth pointing out some misunderstanding regarding prior distributions:
I

The name prior suggests a temporal relationship, however, this is


misleading. The prior distribution models the uncertainty given by the external
evidence. Cox (1999)

The prior is not necessarily unique! In a recent article Lambert et. al. (2005)
analyze the use of 13 different priors for the between study variance parameter in
random-effects meta-analysis.

There is no such thing as thecorrect prior. Bayesian analysis is regarded as


transforming prior into posterior opinion, rather than producing the posterior
distribution.

The prior may not be completely specified. In Empirical Bayes inference priors
have unknown parameters that are estimated from the data.

Dr. Pablo E. Verde

74

Lecture 3: Prior distributions

Priors can be overparametrized. Sometimes we intentionally overparametrized


the priors in order to accelerate convergence of simulation methods, see Gelman,
Carlin, Stern and Rubin (2004) Chapter 6.

Inference may rely only on priors. There are situations where no further data
are available to combine with our priors or there is no intention to update the
priors. This is the typical case of risk analysis, sample size determination in
experiments, simulation of complex process , etc. In these analytical scenarios
priors are usually used to simulate hypothetical data and we refer to that prior
predictive analysis.

Finally, priors are not necessarily important! In many scientific applications, as


the amount of data increases, the prior is overwhelmed by the likelihood and the
influence of the prior disappears, see Box and Tiao (1973) (pag. 20-25).

Dr. Pablo E. Verde

75

Lecture 3: Prior distributions

Non-informative Priors: Classical Bayesian Perspective

We may be interested to introduce an initial state of ignorance in our Bayesian


analysis.
But representing ignorance raises formidable difficulties!
There has been a long and complex search for various non-informative,
reference or objective priors during the last century. A sort of off-the-shelf
objective prior that can be applied in all circumstances.
The synthesis of this search is that those magic priors do not exists, although useful
guidance exists (Berger, 2006).

Dr. Pablo E. Verde

76

Lecture 3: Prior distributions

Problems with uniform priors for continuous parameters


Example: uniform prior on proportions
Let be the chance that a bias coin comes down heads, we assume
Uniform(0, 1).
Let = 2 the chance that it coming down heads in both of the next 2 throws.
Now, the density of is
1
p() = ,
2
which corresponds to a Beta(0.5,1) distribution and is certainly not uniform!

Dr. Pablo E. Verde

77

Lecture 3: Prior distributions

Jeffreys invariance priors


Consider a 1-to-1 transformation of : = g ()
Transformation of variables: prior p() is equivalent to prior on of p() = p() |

d
d

Jeffreys proposed defining a non-informative prior for as


p() I ()1/2
where I () is Fisher information for

I () = Ex|

Dr. Pablo E. Verde

"

 #
2 log p(X |)
log p(X |) 2
= Ex|
.
2

78

Lecture 3: Prior distributions

Non-informative priors for proportions


Data model is Binomial: we consider r successes from n trials
r | Binomial(, n)
we have
log p(x|) = r log() + (n r ) log(1 ) + C
then
I () =

n
.
(1 )

So the Jeffreys prior is


p() ((1 ))1/2 ,
which is a Beta(1/2, 1/2).

Dr. Pablo E. Verde

79

Lecture 3: Prior distributions

Non-informative priors for location parameters


A location parameters is such that p(|y ) is a function of (y ) and so the
distribution of (y ) is independent of .
Example: data model is Normal with unknown mean and known variance v
x1 , x2 , . . . , xn | Normal(, v )
then we have
log p(x|) =

X (xi )2
+C
2v

with
I () =

n
.
v

So the Jeffreys prior is


p() 1
which is the Uniform distribution for .
Dr. Pablo E. Verde

80

Lecture 3: Prior distributions

Non-informative priors for scale parameters


A scale parameters is such that p(y |) is a function of 1/f (y /) and so the
distribution of (y /) is independent of .
Example: data model is Normal with known mean m and unknown variance
x1 , x2 , . . . , xn | Normal(m, )
then we have
log p(x|) = n/2 log()
where s =

P
(xi m)2 , then
I () =

s
,
2

n
.
22

So the Jeffreys prior on variance is


p()

This Jeffreys improper prior is approximated by a Gamma(, ) with  0.


Dr. Pablo E. Verde

81

Lecture 3: Prior distributions

Priors for counts and rates

Data model is Poisson: x| Poisson(), then we have


log p(x|) = + x log + C
with
I () = 1/.
So the Jeffreys prior is
p() 1/2 .
This improper prior is approximated by a Gamma distribution with = 1/2 and 0.

Dr. Pablo E. Verde

82

Lecture 3: Prior distributions

Comments Jeffreys rule

They are invariant, whatever the scale we choose to measure the unknown
parameter, the same prior results when the scale is transformed to any particular
scale

Some inconsistencies associated with Jeffreys priors have been discussed:


I

Applying this rule to the normal case with both mean and variance parameters
unknown does not lead to the same prior as applying separately the rule for the mean
and the variance and assuming a priori independence between these parameters.

Although Jeffreys rule is suggestive, it cannot be applied blindly. It should be


thought of as guideline to consider particularly if there is no other obvious way of
finding a prior distribution.

Dr. Pablo E. Verde

83

Lecture 3: Prior distributions

Predictive Prior Analysis


In practice it is not necessary to adopt a full Bayesian approach. Sometimes is very
useful to use Bayesian methods for some analyzes and classical approaches for others.
Example: predictive distribution of the power in sample size calculations
I

a randomize trial is planned with n patients in each of two arms.

the response within each treatment arm is assumed to have between-patient


standard deviation .

the treatment estimate = y1 y2 is approximately distributed as


Normal(, 2 2 /n).

a trial designed to have two-sided Type I error and Type II error in detecting
a true difference of in mean response between the groups will require a sample
size per group of
2 2
n = 2 (z1 z1/2 )2 ,

Dr. Pablo E. Verde

84

Lecture 3: Prior distributions

Alternatively, for fixed n, the power of this experiment is


!
r
n2
Power =
z1/2 .
2 2

If we assume / = 0.5, = 0.05 and = 0.2, we have z1 = 0.84,


z1/2 = 1.96, then the power of the trial is 80% and n = 63 in each trial arm.

However, we accept uncertainty about and and we wish to include this feature
into the sample size and power calculations.

We assume from previous studies that it is reasonable that


N(0.5, 1)

and N(1, 0.32 )I (0, ).

Dr. Pablo E. Verde

85

Lecture 3: Prior distributions

Then
1. Simulate values N(0.5, 1) and N(1, 0.32 ) (subject to the constrain of
being positive).
2. Substitute them in the formulae and generate n and Power .
3. Use the histogram of n and Power as their corresponding predictive distribution.
4. In R we have
## predictive prior distribution for sample size and power
set.seed(123)
theta <- rnorm(10000, 0.5, 0.1)
sigma <- rnorm(10000, 1, 0.3)
sigma <- ifelse(sigma <0, -1*sigma, sigma)
n <- 2*sigma^2 /(theta^2)*(0.84 + 1.96)^2
pow <- pnorm( sqrt( 63/2) * theta /sigma - 1.96)
par(mfrow=c(1,2))
hist(n, xlim = c(0, 400), breaks=50)
hist(pow)
par(mfrow=c(1,1))
Dr. Pablo E. Verde

86

Lecture 3: Prior distributions

Histogram of pow

500

500

1000

1500

Frequency

1000

Frequency

1500

2000

2500

2000

Histogram of n

100

200

300

400

0.2

0.4

0.6

0.8

1.0

pow

Dr. Pablo E. Verde

87

Lecture 3: Prior distributions

> round(quantile(n),2)
0%
25%
50%
0.00
37.09
62.03
> round(quantile(pow),2)
0% 25% 50% 75% 100%
0.09 0.61 0.81 0.95 1.00
> sum(pow<0.7)/10000
[1] 0.3645
>

75%
100%
99.45 1334.07

It is clear that there is huge uncertainty to the appropriate sample size.


For n=63 the median power is 81% and a trial of 63 patients per group could be
seriously underpowered. There is a 37% chance that the power is less than 70%.

Dr. Pablo E. Verde

88

Lecture 3: Prior distributions

Mixture of priors for proportions


I

We may want to express a more complex prior opinion that can not be
encapsulated by a beta distribution

A prior which is a mixture of beta distributions


p() = qp1 () + (1 q)p2 ()
where pi = Beta(ai , bi ).

Now if we observe r successes out of n trials,the posterior is


p(|r , n) = q p1 (|r , n) + (1 q )p2 (|r , n)
where
pi (|r , n) pi ()p(r |, n)
qp1 (r |n)
q =
qp1 (r |n) + (1 q)p2 (r |n)

Dr. Pablo E. Verde

89

Lecture 3: Prior distributions

pi (r |n) is a beta-binomial predictive probability or r successes in n trials assuming


has distribution Beta(ai , bi ).

The posterior is a mixture of beta posteriors, with mixture weights adapted to


support prior that provides best prediction for the observed data.

In R:
# mixture of betas
mixbeta <- function(x,r,a1,b1,a2,b2,q,n)
{
qstar <- q*betabin(r,a1,b1,n)/
(q*betabin(r,a1,b1,n)+(1-q)*betabin(r,a2,b2,n))
p1 <- dbeta(x,a1+r,n-r+b1)
p2 <- dbeta(x,a2+r,n-r+b2)
posterior <- qstar*p1 + (1-qstar)*p2
}

Dr. Pablo E. Verde

90

Lecture 3: Prior distributions

Example: drug investigation continue ...


I

We want to combine:
I

an informative Beta(9.2, 13.8)

a non-informative Beta(1, 1)

we give 80 % of prior weight to the informative prior

Suppose we treat n = 20 volunteers with the compound and observe r = 15


positive responses.

Dr. Pablo E. Verde

91

Lecture 3: Prior distributions

We can visualize this analysis in R as follows:


par(mfrow=c(2,1))
# informative beta prior
curve(dbeta(x,9.2,13.8),from=0, to=1,col="red",
xlab="probability of response",main="priors")
# mixture beta prior with 80% informative and 20% flat
curve(0.8*dbeta(x, 9.2, 13.8)+0.2*dbeta(x, 1, 1),from=0,
to=1,col="blue",add=T)
# beta posterior
curve(dbeta(x,24.2,18.8),from=0,to=1,col="red",
xlab="probability of response", main="posteriors")
# posterior from a mixture prior
curve(mixbeta(x, r=15, a1=9.2, b1=13.8, a2=1, b2=1, q=.8, 20),
from=0,to=1, col="blue",add=T)
par(mfrow=c(1,1))

Dr. Pablo E. Verde

92

Lecture 3: Prior distributions

3
2
1
0

dbeta(x, 9.2, 13.8)

priors

0.0

0.2

0.4

0.6

0.8

1.0

0.8

1.0

probability of response

4
3
2
0

dbeta(x, 24.2, 18.8)

posteriors

0.0

0.2

0.4

0.6

probability of response

Dr. Pablo E. Verde

93

Lecture 3: Prior distributions

Further topics on priors:


I

Adjustment of priors based on historical data and judgement


1. Power priors (Ibrahim and Shen, 2000)
2. Bias modelling

Hirearchical priors for large dimensional problems (e.g regression with more
predictors than observations)

Summary:
I

The need for priors distributions should not be an embarrassment

It is reasonamble that the prior should influence the analysis, as long as the
influence is recognized and justified

Importance of transparency and sensitivity analysis

Dr. Pablo E. Verde

94

Lecture 3: Prior distributions

Practical

Exercise: Conjugate mixture priors inference for a binomial experiment


I

Suppose that most drugs (95%) are assumed to come from the stated Beta(9.2,
13.8) prior, but there is a small chance that the drug might be a winner.

Winners are assumed to have a prior distribution with mean 0.8 and standard
deviation 0.1.

Dr. Pablo E. Verde

95

Lecture 3: Prior distributions

1. What Beta distribution might represent the winners prior? ( Hint: Use the
formulas of the mean and the variance for a Beta(a, b) distribution from lecture 2.
This can be re-arranged to
a = m(m(1 m)/v 1)
and
b = (1 m)(m(1 m)/v 1),
where m and v are the mean and the variance of a Beta(a,b) distribution).
2. Enter this mixture prior model into the R code for mixtures and plot it.
3. Calculate the posterior mixture model based on 15 success out of 20 trials and
plot it.
4. What is the chance that the drug is a winner?

Dr. Pablo E. Verde

96

Lecture 3: Prior distributions

Solution
# 1)
# to workout the a and b for the winner drug prior use the realtionship
# of the mean and variance of the beta distribution and its parameters
> m <- 0.8
> v <- 0.1^2
> a <- m*(m*(1-m)/v -1)
> a
[1] 12
> b <- (1-m)*(m*(1-m)/v -1)
> b
[1] 3
# 2) plot for the mixture prior
curve(dbeta(x, 12, 3)*0.05 + dbeta(x, 9.2, 13.8)*0.95, from = 0, to =1,
main = "prior", xlab = "rate")

Dr. Pablo E. Verde

97

Lecture 3: Prior distributions

# 3) plot for the mixture posterior with 15 success out of 20 trials


# mixture posterior of betas
mixbeta <- function(x,r,a1,b1,a2,b2,q,n)
{
qstar <- q*betabin(r,a1,b1,n)/(q*betabin(r,a1,b1,n)+
(1-q)*betabin(r,a2,b2,n))
p1 <- dbeta(x,a1+r,n-r+b1)
p2 <- dbeta(x,a2+r,n-r+b2)
posterior <- qstar*p1 + (1-qstar)*p2
}
# then the mixture beta has parameters:
# a = 12, b = 3 with probability 5%
# a = 9.2, b = 13.8 with probabilty 95%
# r = 15, n = 20
curve(mixbeta(x, r = 15, a1 = 12, b1 = 3, a2 = 9.2, b2 = 13.8,
q = 0.05, n = 20), from = 0, to = 1, main = "posterior", xlab = "rate")
Dr. Pablo E. Verde

98

Lecture 3: Prior distributions

# 4) The probabilty that the drug is a winner is given by the posterior


# weight to belong to the winner class
qstar <- 0.05*betabin(15, 12 , 3 ,20 )/(0.05*betabin(15, 12 ,3 ,20)
+(1-0.05)*betabin(15, 9.2, 13.8, 20 ))
> qstar
[1] 0.3951545

Dr. Pablo E. Verde

99

Lecture 4 - Multiparameters with R

Lecture 4:
Introduction to Multiparameter Models

Dr. Pablo E. Verde

100

Lecture 4 - Multiparameters with R

Summary

Introduction to multiparameter inference

Normal model with unknown mean and variance: standard non-informative priors

Using R for predictive model checking

Multinomial Model conjugate analysis and its application in R

Comparing classical and Bayesian multiparameter models

Dr. Pablo E. Verde

101

Lecture 4 - Multiparameters with R

Introduction
I

The basic ideas is similar to one parameter models:


I

We have a model for the observed data which defines a likelihood p(y |) on the
vector parameter

We specify a joint prior distribution p() for the possible values of

We use Bayes rule to obtain the posterior of ,


p(|y ) p() p(y |)

Problems:

Dr. Pablo E. Verde

In many applications is difficut to specify priors on multivariate

Computations could be very difficult, they involve multivariate integration

102

Lecture 4 - Multiparameters with R

Example: Normal with unknown mean and variance 2


I

In this example we use marathontimes in the R package LearnBayes

The obervations are the mean times (in minutes) for men running Marathon with
age classes between 20 to 29 years old:
> library(LearnBayes)
> data(marathontimes)
> marathontimes$time
[1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ...
>

Dr. Pablo E. Verde

103

Lecture 4 - Multiparameters with R

We assume a Normal model for the data given the mean and the variance 2 :
y1 , . . . , y20 |, 2 N(, 2 )

We use a non-informative Jeffreys prior assuming independence between location


and scale:
1
p(, ) ,

Then posterior density for the mean and the variance p(, |y ) is called the
Normal-2 distribution

Dr. Pablo E. Verde

This posterior delivers same results as classical analysis: E (|y ) = y , E ( 2 |y ) = S 2

The posterior p(|y ) is proportional to a t-distribution with n 1 df, and so on

Direct simulation from these posteriors is straghtforwared

104

Lecture 4 - Multiparameters with R

The Normal-2 posterior is implemented in the function normchis2post in


LearnBayes, which computes the logarithm of the joint posterior density of
(, 2 )

We can visualize the % levels contourns with the function mycontour. The
arguments include the log-density to plot, the rectangular area (xlo , xhi , ylo , yhi )
and the data:
> mycontour(normchi2post, c(220,330,500,9000), time,
xlab = "mean", ylab = "var")

We can simulate directly form the marginal posteriors of and :


> SS <- sum((time - mean(time))^2)
> n <- length(time)
> sigma2 <- SS/rchisq(1000, n - 1)
> mu <- rnorm(1000, mean = mean(time), sd = sqrt(sigma2)/sqrt(n))
> points(mu, sigma2, col="blue")
> quantile(mu, c(0.025, 0.975))
# 95% posterior interval for mu
2.5%
97.5%
253.1121 301.1133
Dr. Pablo E. Verde
105
I

Lecture 4 - Multiparameters with R

8000

4.6


2.3

4000

variance

6000

2000

6.9

220

240

260

280

300

320

mean

Dr. Pablo E. Verde

106

Lecture 4 - Multiparameters with R

Predictive model checking


Is the statistical model consistent with the data ?
We answer this question by simulating predictive y values from the model and
comparing some data features with the predictive data.
We start by looking at some histograms between the original data and the simulated
data with the same sample size.
y.star <- rnorm(1000, mean = mu, sd =sqrt(sigma2)) #predictive data
par(mfrow=c(3,4))
hist(time, breaks=10, xlim =c(150, 400), main="Original Data", col="green")
for(i in 1:11){
y.sim <- sample(y.star, 20)
hist(y.sim,breaks=10,xlim=c(150, 400),main="Simulated Data",col="blue")
}

Dr. Pablo E. Verde

107

Lecture 4 - Multiparameters with R

Frequency

3
2

Frequency

3
2

Frequency

250

300

350

400

1
150

200

250

300

350

400

0
150

200

250

300

350

400

150

250

300
y.sim

Simulated Data

Simulated Data

Simulated Data

Simulated Data

400

350

400

350

400

3
2

Frequency

3
2

Frequency

3
2

250

300

350

400

150

200

250

300

350

400

1
0
150

200

250

300

350

400

150

200

250

300

y.sim

y.sim

Simulated Data

Simulated Data

Simulated Data

Simulated Data

150

200

250

300
y.sim

Dr. Pablo E. Verde

350

400

3
2

Frequency

Frequency

1
150

200

250

300
y.sim

350

400

Frequency

3
2

Frequency

y.sim

y.sim

200

150

Frequency

350

y.sim

y.sim

Frequency

200

time

200

150

Frequency

Simulated Data

Simulated Data

Simulated Data

Original Data

150

200

250

300
y.sim

350

400

150

200

250

300
y.sim

108

Lecture 4 - Multiparameters with R

Predictive model checking


We define the following features between the observed data y and the predictive data
y :
T1 = min(y ), T2 = max(y ), T 3 = q75 (y ) q25 (y )
and for asymmetry

T4 = |y(18)
| |y(2)
|

where the 18th and 2sd order statistics approximate the 90% and 10% respectively.
These measures are compared with the corresponding values based on the observed
data:
T1 = min(y ), T2 = max(y ), T 3 = q75 (y ) q25 (y )
and
T4 = |y(18) | |y(2) |.

Dr. Pablo E. Verde

109

Lecture 4 - Multiparameters with R

In R we have:
# Analysis of the minimum, maximum, variability and asymmetry
min.y <- max.y <- asy1 <- asy2 <- inter.q <- rep(0,1000)
time <- sort(time)
for (b in 1:1000){
y.sim <- sample(y.star, 20)
mu.star <- sample(mu, 20)
min.y[b] <- min(y.sim)
max.y[b] <- max(y.sim)
y.sim <- sort(y.sim)
asy1[b] <- abs(y.sim[18] - mean(mu.star)) - abs(y.sim[2] - mean(mu.star))
asy2[b] <- abs(time[18] - mean(mu.star)) - abs(time[2] - mean(mu.star))
inter.q[b] <- quantile(y.sim, prob=0.75) - quantile(y.sim, prob=0.25)
}

Dr. Pablo E. Verde

110

Lecture 4 - Multiparameters with R

To display this quantities


par(mfrow =c(2,2))
hist(min.y, breaks = 50, col="blue", main = "Minimum y*")
abline(v=min(time), lwd=3, lty=2)
hist(max.y, breaks = 50, col="red", main = "Maximum y*")
abline(v=max(time), lwd=3, lty=2)
hist(inter.q, breaks = 50, col="magenta", main = "Variability y*")
abline(v=quantile(time,prob=0.75)-quantile(time,prob=0.25), lwd=3,lty=2)
plot(asy1, asy2, main ="Asymmetry", xlab="Asymmetry predicted data",
ylab ="Asymmetry original data")
abline(a=0, b=1, lwd=2)
par(mfrow=c(1,1))
This analysis shows that the data and the model are compatibles and deviations can be
easily explained by sampling variation.

Dr. Pablo E. Verde

111

Lecture 4 - Multiparameters with R

Maximum y*

30
10

20

Frequency

60
40

20

Frequency

80

40

100

Minimum y*

150

200

250

300

320

340

360

380

min.y

max.y

Variance y*

Asymmetry
15

100

10

420

40

400

5
0
5

15

1000

2000

3000

4000
variance

Dr. Pablo E. Verde

5000

6000

7000

10

20
10

Frequency

30

Asymmetry original data

150

100

50

50

Asymmetry predicted data

112

Lecture 4 - Multiparameters with R

Multinomial distribution with unknown cell probabilities

In this model we observe n independent trials each of which has p possible outcomes
with associated probabilities
= (1 , . . . , p ).
Letting y = (y1 , . . . , yp ) be the number of times each outcome is observe we have
y | Multinomial(n, ),
with likelihood
P
( yi )! Y yi
i ,
p(y |) = Q
yi !

yi = n,

i = 1.

Dr. Pablo E. Verde

113

Lecture 4 - Multiparameters with R

The kernel of this likelihood is proportional to a Dirichlet distribution. If a-prior we


assume
Dirichlet(a1 , . . . , ap ),

ai > 0,

then
p()

ai 1

and the posterior for is


p(|y )

yi +ai 1

so taking i = ai + yi , then the posterior for is


p(|y ) = Dirichlet(1 , . . . , p ).

Dr. Pablo E. Verde

114

Lecture 4 - Multiparameters with R

Some comments
P

The prior distribution is equivalent to a likelihood resulting from


observations, with ai observations in the ith category.

A uniform density is obtained by setting ai = 1 for all i. This distribution assigns


equal probability to each i .

In general it is very difficult to work with this distribution in practice. Proportions


are in general not independent and modeling prior information in this context it is
very difficult.

We are going to use for contingency tables a surrogate Poisson model for the
multinomial distribution.

Moments and properties for the Dirichlet distribution are in Appendix A of


Gelman et.al.

Dr. Pablo E. Verde

ai

115

Lecture 4 - Multiparameters with R

Example: A model for a two by two contingency table

Death
No death

Intervention
New
1,1
2,1

Control
1,2
2,2
N

Data model:
p(y |)

2 Y
2
Y

i,ji,j

j=1 i=1

Prior model:
p()

2 Y
2
Y

a 1

i,ji,j

j=1 i=1

Posterior model:
p(|y )

2 Y
2
Y

y +ai,j 1

i,ji,j

j=1 i=1
Dr. Pablo E. Verde

116

Lecture 4 - Multiparameters with R

We are interested in make inference for the Odds ratio:


=

1,1 2,2
.
1,2 2,1

We use direct simulation methods:


I

Simulate a large number of values for the vector from its posterior.

For each simulated value calculate .

Inference for is based on the histogram of .

Dr. Pablo E. Verde

117

Lecture 4 - Multiparameters with R

Example: GREAT trial, Spiegelhalter et.al. pag 69.


Intervention: Thrombolytic therapy after myocardial infarction, given at home by
general practitioners.
Aim of study: to compare a new drug treatment to be given at home as soon as
possible after a myocardial infarction and placebo.
Outcome measure: Thirty-day mortality rate under each treatment, with the benefit
of the new treatment measured by the odds ratio, i.e., the ratio of the odds of death
following the new treatment to the odds of death on the conventional: OR < 1
therefore favors the new treatment.
Prospective Bayesian analysis: NO. It was carried out after the trial reported its
results.

Dr. Pablo E. Verde

118

Lecture 4 - Multiparameters with R

Example: GREAT trial continue

Death
No death

Intervention
New
13
150
163

Control
23
125
148

36
275
311

We use a Dirichlet prior with parameters a1,1 = a1,2 = a2,1 = a2,2 = 1, which
corresponds to a uniform distribution for (1,1 , 1,2 , 2,1 , 2,2 ).

Dr. Pablo E. Verde

119

Lecture 4 - Multiparameters with R

We can simulate from a Dirichlet distributions with the function rdirichlet() from
the package LearnBayes:
> library(LearnBayes)
> draws <- rdirichlet(10000, c(13,23,150,125) )
> odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3])
> hist(odds, breaks = 100, xlab="Odds Ratio", freq = FALSE)
> lines(density(odds), lty =2, lwd=2, col ="blue")
> abline(v=quantile(odds, prob=c(0.025, 0.5, 0.975)), lty=3, col="red")
>
> quantile(odds, prob=c(0.025, 0.5, 0.975))
2.5%
50%
97.5%
0.2157422 0.4649921 0.9698272

Dr. Pablo E. Verde

120

Lecture 4 - Multiparameters with R

1.5

2.0

2.5

Histogram of odds

0.0

0.5

1.0

Density

prior odds
posterior odds

0.5

1.0

1.5

Odds Ratio

Dr. Pablo E. Verde

121

Lecture 4 - Multiparameters with R

Now suppose that we want to calculate the p-value of


H0 : 1

vs.

H1 : < 1

then,
> sum(odds > 1)/10000
[1] 0.0187
It is interesting to compare this results with the exact Fisher test:
>

fisher.test(matrix(c(13, 23, 150, 125), nrow=2, byrow=TRUE),


alternative="less")
Fishers Exact Test for Count Data
p-value = 0.02817
Thus, the uniform prior on (1,1 , 1,2 , 2,1 , 2,2 ) does not corresponds to a standard
statistical analysis.

Dr. Pablo E. Verde

122

Lecture 4 - Multiparameters with R

Now, if we work with a prior with parameters a1,1 = 2, a1,2 = 1, a2,1 = 1 and a2,2 = 2,
with density
p(1,1 , 1,2 , 2,1 , 2,2 ) 1,1 2,2
we get the following results
> # Fisher exact test
> param <- c(2,1,1,2) # parameter of the dirichlet
> draws <- rdirichlet(10000, c(13,23,150,125)+param )
> odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3])
> sum(odds>1)/10000
[1] 0.0277
This shows (empirically) that the Fisher test is NOT based in a non-informative prior.
Some weakly information of association is implied in the test.
However, there is a difference in interpretation between the Bayesian and sampling
theory results.
Dr. Pablo E. Verde

123

Lecture 4 - Multiparameters with R

Dirichlet with a11=2, a12=1, a21=1, a22=2

f(the
theta

theta11

ta11,
22)
theta22

Dr. Pablo E. Verde

124

Lecture 4 - Multiparameters with R

Practical
Exercise: Normal data with unknown mean and variance
I

Review the Bayesian analysis presented in this lecture

Generate predictive posterior data and check model fitness

Now, change the first observation of the marathon times and generate an outlier
as follows:
> library(LearnBayes)
> data(marathontimes)
> marathontimes$time
[1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ...
>
# generate an outlier in the first observation:
data1 <- marathontimes$time
data1[1] <- 82

Make the same analysis for the contaminated data

Which is the effect of this outlier in the analysis?

Dr. Pablo E. Verde

125

Lecture 5 - Introduction to WinBUGS

Lecture 5:
Introduction to WinBUGS

Dr. Pablo E. Verde

126

Lecture 5 - Introduction to WinBUGS

Summary

Introduction to BUGS

The BUGS language

Some simple examples

Making predictions

Connecting WinBUGS with R

Dr. Pablo E. Verde

127

Lecture 5 - Introduction to WinBUGS

Introduction to BUGS
The BUGS project began at the Medical Research Council Biostatistics Unit in
Cambridge in 1989, before the classic Gibbs sampling paper by Gelfand and Smith in
1990. An excellent review and future directions of the BUGS project is given in Lunn
et al. (2009).
BUGS stands for Bayesian inference using Gibbs sampling, reflecting the basic
computational techniques originally adopted.
BUGS has been just one part of the tremendous growth in the application of Bayesian
ideas over the last 20 years.
At this time BUGS has approximately over 30,000 registered users worldwide, and an
active on-line community comprising over 8,000 members.
Typing WinBUGS in Google generates over 100,000 hits; in Google scholars over
5,000; and searching in Statistics in Medicine on-line gives about 100 hits.
Dr. Pablo E. Verde

128

Lecture 5 - Introduction to WinBUGS

Think different ...


The modelling philosophy of BUGS was strongly influenced by developments in
artificial intelligence in the 1980s. In particular the focus was on Expert systems,
where a basic principle was to separate:
I

Knowledge base:
I

assumed model for the world

makes use of a declarative form of programming

structures described using a series of local relationships

Inference engine:
I

used to draw conclusions in specific circumstances

This approach forces to think first about the model use to describe the problem at
hand. The declarative programming approach, sometimes, confuses statistical analysts
used to work with procedural or functional statistical languages (SAS, SPSS, R, etc).
Dr. Pablo E. Verde

129

Lecture 5 - Introduction to WinBUGS

The BUGS language:


I

Language for specifying complex Bayesian models.

Constructs object-oriented internal representation of the model graph by


identifying parents and children. This is done with a DAG (Directed Acyclic
Graph).

Builds up an arbitrary complex model through specification of local structure.

Simulation from full conditionals using Gibbs sampling.

Current version is WinBUGS 1.4.3, it runs in Windows, and incorporates the


DoodleBUGS graphical model editor and a script language for running in batch
mode.

WinBUGS is freely available from http://www.mrc-bsu.cam.ac.uk/bugs


An open source version of BUGS called OpenBUGS is under development, and includes
versions of BUGS that run under LINUX (LinBUGS) and that can be run directly from
R (BRugs). See http://www.rni.helsinki.fi/openbugs
Dr. Pablo E. Verde

130

Lecture 5 - Introduction to WinBUGS

Example: Drug
In n = 20 patients we observed r = 15 positive responses.
y Bin(, n)
and we assume a conjugate prior for :
Beta(a, b)
Of course, we know the posterior distribution is
|y Beta(a + y , n r + b)
and no simulation is necessary. But just to illustrate WinBUGS ...

Dr. Pablo E. Verde

131

Lecture 5 - Introduction to WinBUGS

Dr. Pablo E. Verde

132

Lecture 5 - Introduction to WinBUGS

Directed Acyclic Graph (DAG) representation:


I

Ovals nodes represent random variables

Rectangular nodes represent constants

Arrows parent child relationships

Dr. Pablo E. Verde

133

Lecture 5 - Introduction to WinBUGS

1. Write BUGS code to specify the model ...


model {
y ~ dbin(theta, n)
theta ~ dbeta(a, b)
}
2. Check the model syntax ...
3. Load the data ...
list(n=20, y = 15, a = 3, b =2)
... and compile.
4. Then load initial values ...
list(theta = 0.5)
5. Select the nodes (variables) to monitor (just one in this case)...
6. Set the trace for all selected nodes (*) to monitor the MCMC simulations ...
7. Using the Update tools, select 10,000 simulations ...
8. Results ...
Dr. Pablo E. Verde

134

Lecture 5 - Introduction to WinBUGS

Example: Drug continue ...


Now we assume a different prior. A logit transform of



,
= log
1
so that < <
We assume a normal prior for :
Normal(, )
for suitable mean and precision = 1/ 2 .
This is a non-conjugate prior with no simple form for the posterior.
Straightforward in WinBUGS!
Dr. Pablo E. Verde

135

Lecture 5 - Introduction to WinBUGS

Double arrows represent a logical node or a mathematical functional relationship.


Dr. Pablo E. Verde

136

Lecture 5 - Introduction to WinBUGS

The BUGS code for the model is ...


model
{
y ~ dbin(theta, n)
logit(theta) <- phi
phi ~ dnorm(0, 0.001)
}

Dr. Pablo E. Verde

137

Lecture 5 - Introduction to WinBUGS

Making predictions
I

Important to be able to predict unobserved quantities for


I

filling-in missing or censored data

model checking - are predictions similar to observed data?

making predictions!

Easy in MCMC/WinBUGS, just specify a stochastic node without a data value it automatically predicted

Provides automatic imputation of missing data

Easiest case is where there is no data at all!! Just forwards sampling from prior
to make a Monte Carlo analysis.

Dr. Pablo E. Verde

138

Lecture 5 - Introduction to WinBUGS

Example: making predictions


The BUGS code for the model is ...
model
{
y ~ dbin(theta, n)
logit(theta) <- phi
phi ~ dnorm(mu, tau)
y.pred ~ dbin(theta, n) # defines a stochastic node for
# predictions
}

Dr. Pablo E. Verde

139

Lecture 5 - Introduction to WinBUGS

The prediction node y.pred is conditionally independent of the data node y given rate
.
Dr. Pablo E. Verde

140

Lecture 5 - Introduction to WinBUGS

Summary: Running WinBUGS


1. Open Specification tool and Update from Model menu, and Samples from
Inference menu.
2. Highlight model by double-click. Click on Check model.
3. Highlight start of data. Click on Load data.
4. Click on Compile.
5. Highlight start of initial values. Click on Load inits.
6. Click on Gen Inits if model initials values are needed.
7. Click on Update to burn in.
8. Type nodes to be monitored into Sample Monitor, and click set after each.
9. Perform more updates.
10. Type * into Sample Monitor, and click stats, etc. to see results on all monitored
nodes.
Dr. Pablo E. Verde

141

Lecture 5 - Introduction to WinBUGS

The R2WinBUGS package in R


The R package R2WinBUGS offers a versatile approach for making MCMC
computations within WinBUGS and returning them to R.
Let see a step by step example of linking R and WinBUGS with R2WinBUGS.
Example: non-conjugate priors inference for a binomial experiment
First save the following file with WinBUGS code in your working directory ...
# WinBUGS code: binary problem non-conjugate analysis
model
{
y ~ dbin(theta, n)
logit(theta) <- phi
phi ~ dnorm(0,0.001)
y.pred ~ dbin(theta,n) # making prediction
}
Dr. Pablo E. Verde

142

Lecture 5 - Introduction to WinBUGS

Then in R console ...


# load R2WinBUGS package
library(R2WinBUGS)
# setup WinBUGS directory and your working directory
bugsdir <- "C:/Programme/WinBUGS14"
workdir <- getwd()
# define you data nodes
n <- 20
y <- 15
data1 <- list ("n", "y")

Dr. Pablo E. Verde

143

Lecture 5 - Introduction to WinBUGS

# define parameters of interest


par1 <- c("theta", "y.pred")
# run the bugs() function:
m1 <- bugs(data1, inits=NULL, par1, "model1.txt",
n.chains = 1, n.iter = 2000, n.thin=1,
bugs.directory = bugsdir,
working.directory = getwd(),
debug=TRUE)

Dr. Pablo E. Verde

144

Lecture 5 - Introduction to WinBUGS

> print(m1, digits.summary = 3)


Inference for Bugs model at "model1.txt", fit using WinBUGS,
1 chains, each with 2000 iterations (first 1000 discarded)
n.sims = 1000 iterations saved
mean
sd 2.5%
25%
50%
75% 97.5%
theta
0.749 0.096 0.543 0.690 0.760 0.819 0.914
y.pred
14.962 2.733 9.000 13.000 15.000 17.000 19.000
deviance 4.277 1.584 3.198 3.286 3.667 4.551 9.014
DIC info (using the rule, pD = Dbar-Dhat)
pD = 1.1 and DIC = 5.3
DIC is an estimate of expected predictive error (lower deviance is better).

Dr. Pablo E. Verde

145

Lecture 5 - Introduction to WinBUGS

The R object m1 generated in this analysis belong to the class bug and can be further
manipulated, transformed, etc.
> class(m1)
[1] "bugs"
> names(m1)
[1] "n.chains"
[4] "n.thin"
[7] "sims.array"
[10] "summary"
[13] "median"
[16] "dimension.short"
[19] "isDIC"
[22] "DIC"

Dr. Pablo E. Verde

"n.iter"
"n.keep"
"sims.list"
"mean"
"root.short"
"indexes.short"
"DICbyR"
"model.file"

"n.burnin"
"n.sims"
"sims.matrix"
"sd"
"long.short"
"last.values"
"pD"
"program"

146

Lecture 5 - Introduction to WinBUGS

The object m1 is a list in R, so we can extract elements of the list by using the $
operator, for example:
> m1$pD
[1] 1.065
> m1$n.chains
[1] 1
> m1$sims.array[1:10, ,"theta"]
[1] 0.8389 0.9124 0.7971 0.8682 0.7025 0.7696 0.8417 0.6782 0.6146
[10] 0.8647
> m1$sims.array[1:10, ,"y.pred"]
[1] 15 19 19 20 15 20 18 17 10 17
> theta <- m1$sims.array[1:1000, ,"theta"]
> hist(theta, breaks = 50, prob = TRUE)
> lines(density(theta), col = "blue", lwd =2)

Dr. Pablo E. Verde

147

Lecture 5 - Introduction to WinBUGS

3
0

Density

Histogram of theta

0.5

0.6

0.7

0.8

0.9

1.0

theta

Dr. Pablo E. Verde

148

Lecture 5 - Introduction to WinBUGS

Some aspects of the BUGS language


I

<- represents logical dependence, e.g. m <- a + b*x

~ represents stochastic dependence, e.g. r ~ dunif(a,b)

Can use arrays and loops


model{
....
for (i in 1:N){
r[i] ~ dbin(p[i], n[i])
}
...
}

Dr. Pablo E. Verde

149

Lecture 5 - Introduction to WinBUGS

A for loop is represented as a plate in the DAGs model

Dr. Pablo E. Verde

150

Lecture 5 - Introduction to WinBUGS

Some aspects of the BUGS language

Some functions can appear on the left-hand-side of an expression, e.g.


logit(p[i]) <- a + b*x[i]
log(m[i]) < - c + d*y[i]

mean(p[]) to take mean of whole array, mean(p[m : n]) to take mean of elements m
to n. Also for sum(p[])

dnorm(0, 1)I(0, ) means the random variable will be restricted to the range
(0, ).

Dr. Pablo E. Verde

151

Lecture 5 - Introduction to WinBUGS

Functions in the BUGS language


I

p <- step(0.05 - x) = 1 if x 0.05, 0 otherwise. Hence monitoring p and


recording its mean will give the probability that x 0.05. This is useful to
calculate Bayesian p-values.

p <- equals(x, 0.7) = 1 if x = 0.7, 0 otherwise.

tau <- 1/pow(s,2) sets = 1/s 2 .

p
s <- 1/sqrt(tau) sets s = 1/ ( )

p[i,k] <- inprod(p[], Lambda[i,k]) sets pik =

See Model Specification/Logical nodes in the manual for full syntax.

Dr. Pablo E. Verde

j ij .

152

Lecture 5 - Introduction to WinBUGS

Data transformations
Although transformations of data can always be carried out before using WinBUGS, it
is convenient to be able to try various transformations of dependent variables within a
model description.
For example, we may wish to try both y and sqrt(y) as dependent variables without
creating a separate variable z = sqrt(y) in the data file.
The BUGS language therefore permits the following type of structure to occur:
for (i in 1:N) {
z[i] <- sqrt(y[i])
z[i] ~ dnorm(mu, tau)
}
Strictly speaking, this goes against the declarative structure of the model specification.
Dr. Pablo E. Verde

153

Lecture 5 - Introduction to WinBUGS

Some common distributions


I

Binomial: r dbin(p, n)

Normal: x dnorm(mu, tau)

Poisson: r dpois(lambda)

Uniform: x dunif(a, b)

Gamma: x dgamma(a, b)

Note: The normal distribution is parameterized in terms of its mean and


precision = 1/variance = 1/ 2 .
Functions cannot be used as arguments in distributions. You need to create
new nodes.

Dr. Pablo E. Verde

154

Lecture 5 - Introduction to WinBUGS

The WinBUGS data formats


WinBUGS accepts data files in:
1. Rectangular formant
n[] r[]
50 2
....
20 4
END
2. R list format:
list(N =12, n = c(50,12,...), r = c(2, 1,...))

Dr. Pablo E. Verde

155

Lecture 5 - Introduction to WinBUGS

Double indexing: Specifying categorical explanatory variables


y[] x[]
12 1
#subject with x=level 1
34 2
#subject with x=level 2
...
for( i in 1:N) {
y[i] ~dnorm(mu[i], tau)
mu[i] <- alpha + beta[x[i]]
}
alpha ~ dunif(-100,100)
beta[1] <- 0
# alias first level of beta
beta[2] ~ dunif(-100, 100)
beta[3] ~ dunif(-100, 100)
tau ~ dgamma(0.1,0.1)

Dr. Pablo E. Verde

156

Lecture 5 - Introduction to WinBUGS

Practical

Exercise: Survival data


Aitchison & Dunsmore (1975) give data on survival times (in weeks) of 20 patients
after treatment for a particular carcinoma. The data set is
list(survtime=c(25, 45, 238, 194, 16, 23, 30, 16,
22, 123, 51, 412, 45, 162,
14, 72, 5, 43, 45, 91), N=20)

Dr. Pablo E. Verde

157

Lecture 5 - Introduction to WinBUGS

1. Write a WinBUGS script for the following model:


yi
yi |,

= log(survival[i])
N(, ),

= 1/ 2

with non-informative priors for N(0, 103 ) and Gamma(103 , 103 ).


2. Use as initial values
list( mu = 0, tau = 1)
3. Draw a DAG for this model.
4. Run the model with 10,000 iterations and discard the first 5,000. Analyze visually
convergence.
5. Estimate the probability that a new patient has survival time more than 150
weeks.
Dr. Pablo E. Verde

158

Lecture 5 - Introduction to WinBUGS

Solution
# WinBUGS script:
model{
for ( i in 1:N)
{
y[i] <- log(survtime[i])
y[i] ~ dnorm(mu, tau)
}
sigma <- 1/sqrt(tau)
mu ~ dnorm(0, 1.0E-3)
tau ~ dgamma(1.0E-3, 1.0E-3)
y.new ~ dnorm(mu, tau)
y.dif <- step(y.new - log(150))
}

#data transform
#sampling model
#standard deviation
#prior for mu
#prior for tau
#predicted data
#pr(survival>150)

Dr. Pablo E. Verde

159

Lecture 5 - Introduction to WinBUGS

Solution

Dr. Pablo E. Verde

160

Lecture 6 - Multivariate Models with WinBUGS

Lecture 6:
Multivariate Models with WinBUGS

Dr. Pablo E. Verde

161

Lecture 6 - Multivariate Models with WinBUGS

Multivariate Normal Distribution


Multivariate Normal for p-dimensional vector y
y Normalp (, )
The conjugate prior for is also a MVN. WinBUGS follows parametrization using the
precision matrix = 1 .
In BUGS notation we have
y[1:p] ~ dmnorm(mu[], Omega[,])
mu[1:p] ~ dmnorm(mu.prior[], Omega.prior[])
Sigma[1:p, 1:p] <- inverse(Omega[,])

Dr. Pablo E. Verde

162

Lecture 6 - Multivariate Models with WinBUGS

Priors on precision matrix of multivariate normals

Conjugate prior is the Wishart distribution, which is analogous to Gamma or 2 .


Arises in classical statistics as the distribution of the sum-of-squares and products
matrix in multivariate normal sampling.
The Wishart distribution Wp (k, R) for a symmetric positive definite p p matrix
has density


1
p() |R|k/2 ||(kp1)/2 exp tr(R) ,
2
defined for a real scalar k > p 1 and a symmetric positive definite matrix R.

Dr. Pablo E. Verde

163

Lecture 6 - Multivariate Models with WinBUGS

Some characteristics
I

When p = 1
W1 (k, R) Gamma(k/2, R/2) 2k /R.

The expectation of the Wp (k, R) is


E [] = kR 1 .

Jeffreys prior is
p() ||(p+1)/2 ,
equivalent to k 0. This is not currently implemented in WinBUGS.

For weakly informative we can set R/k to be a rough prior guess at the
unknown true covariance matrix and taking k = p indicates minimal effective
prior sample size.

Dr. Pablo E. Verde

164

Lecture 6 - Multivariate Models with WinBUGS

Known problems
I

Every element must have same precision

Incomprehensible! It is a good idea to make some prior predictions in practice to


understand what we are doing.

Gelman at al (2004) page 483 outline alternative strategies

If do not use Wishart priors for precision matrices in BUGS, you need to make
sure that the covariance matrix at each iteration is positive-definite, otherwise
may crash.

Dr. Pablo E. Verde

165

Lecture 6 - Multivariate Models with WinBUGS

Exercise 1: Correlation Analysis with missing data


The following is an artificial data from the book Tools for Statistical Inference (Tanner
pag 63.(1994)):
> Y
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
...

Dr. Pablo E. Verde

[,1] [,2]
1
1
1
-1
-1
1
-1
-1
2
NA
2
NA
-2
NA
-2
NA

166

Lecture 6 - Multivariate Models with WinBUGS

The following script in WinBUGS implement a Bayesian analysis for this problem.
model
{
for (i in 1 : 12){
Y[i, 1 : 2] ~ dmnorm(mu[], tau[ , ])
}
mu[1] <- 0
mu[2] <- 0
tau[1 : 2,1 : 2] ~ dwish(R[ , ], 2)
R[1, 1] <- 0.001
R[1, 2] <- 0
R[2, 1] <- 0
R[2, 2] <- 0.001
Sigma2[1 : 2,1 : 2] <- inverse(tau[ , ])
rho <- Sigma2[1, 2] / sqrt(Sigma2[1, 1] * Sigma2[2, 2]) }

Dr. Pablo E. Verde

167

Lecture 6 - Multivariate Models with WinBUGS

80
60
0

20

40

Frequency

100

120

P(rho|data)

1.0

0.5

0.0

0.5

1.0

Using predictive posterior values in each iteration WinBUGS impute the missing values.
Dr. Pablo E. Verde

168

Lecture 6 - Multivariate Models with WinBUGS

Multinomial model: non-conjugate analysis

In some applications the cell probabilities of contingency tables can be made function
of more basic parameters
Example: Population genetics example

Maternal genotype
AA
AB
BB

Offspring genotype
AA
427
108
-

AB
95
161
64

BB
71
74

Dr. Pablo E. Verde

169

Lecture 6 - Multivariate Models with WinBUGS

The model equations are given by the following table:

Maternal genotype
AA
AB
BB

Offspring genotype
AA
(1 )p +
(1 )p/2 + /4
-

AB
(1 )q
1/2
(1 )p

BB
(1 )q/2 + /4
(1 )q +

where p is the frequency of A in outcross pollen, is the rate of self-fertilization and


q =1p

Dr. Pablo E. Verde

170

Lecture 6 - Multivariate Models with WinBUGS

To implement this model in WinBUGS, we equate cell probabilites with requred


function:
model{
XAA[1]<- (1-sigma)*p + sigma; XAA[2]<- (1- sigma)*q; XAA[3]<- 0
XAB[1]<-(1-sigma)*p/2 +sigma/4; XAB[2]<-0.5; XAB[3]<-(1-sigma)*q/2 +sigma/4
XBB[1]<- 0; XBB[2]<- (1-sigma)*p; XBB[3]<- (1- sigma)*q + sigma
KAA <- sum(NAA[]); KAB <- sum(NAB[]); KBB <- sum(NBB[])
NAA[1:3]
~ dmulti(XAA[], KAA)
NAB[1:3]
~ dmulti(XAB[], KAB)
NBB[1:3]
~ dmulti(XBB[], KBB)
p ~ dunif(0, 1)
# uniform prior for p
sigma ~ dunif(0, 1)
# uniform prior for sigma
q <- 1 -p
}
list(NAA = c(427, 95, 0), NAB=c(108, 161, 71), NBB=c(0, 64, 74))
Dr. Pablo E. Verde

171

Lecture 6 - Multivariate Models with WinBUGS

> print(m.popgen, digits=3)


Inference for Bugs model at "popgen.bug", fit using WinBUGS,
2 chains, each with 10000 iterations (first 5000 discarded)
n.sims = 10000 iterations saved
mean
sd
2.5%
25%
50%
75% 97.5% Rhat
p
0.705 0.024 0.655 0.690 0.706 0.721 0.749 1.001
sigma
0.371 0.042 0.288 0.343 0.371 0.400 0.451 1.001
deviance 27.151 1.972 25.210 25.730 26.540 27.960 32.410 1.001

n.eff
10000
10000
10000

For each parameter, n.eff is a crude measure of effective sample size,


and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = Dbar-Dhat)
pD = 2.0 and DIC = 29.1

Dr. Pablo E. Verde

172

Lecture 6 - Multivariate Models with WinBUGS

Frequency

0.25

sigma

Frequency

0.35

0.45

0.72

0.76

0.45

0.68

0.35

0.64

0.25

0.64

0.68

0.72

0.76
x

Dr. Pablo E. Verde

173

Lecture 7 - MCMC-Short-version

Lecture 7:
Bayesian Computations with MCMC Methods

Dr. Pablo E. Verde

174

Lecture 7 - MCMC-Short-version

Summary

Introduction to discrete Markov chains.

Construction of Markov chain Monte Carlo algorithms.

Gibbs sampling methods.

Metropolis and Metropolis-Hastings method.

Issues in applications of MCMC (convergence checking, proposal distributions,


etc.)

Dr. Pablo E. Verde

175

Lecture 7 - MCMC-Short-version

Why is computation important ?


I

Bayesian inference centers around the posterior distribution


p(|y ) p(y |) p()
where may be a vector of thousand components!

p(y |) and p() are usually available in closed form, but p(|y ) is usually not
analytical tractable. Also, we may want to obtain
R

marginal posteriors p(1 |y ) =

calculate properties of p(1 |y ), such as mean E (1 |y ) =


etc.

p(|y )d(1)
R

1 p(|y )d(1) , tail areas,

We see that numerical integration becomes crucial in Bayesian inference!

Dr. Pablo E. Verde

176

Lecture 7 - MCMC-Short-version

General Monte Carlo integration


If we have algorithms for sampling from arbitrary (typically high-dimensional) posterior
distributions, we could use Monte Carlo methods for Bayesian inference:
I

Suppose we can draw samples from the joint posterior distribution for , i.e.
(1) , (2) , . . . , (N) p(|y )

Then Monte Carlo integration

(1) , (2) , . . . (N) p(|y )


R
E (g ()) = g ()p(|y )d

Theorems exist which prove convergence even if the sample is dependent,i.e.

1
N

PN

i=1

g (i )

N
1 X
g ((i) ) E (g ())
N

as

i=1

Dr. Pablo E. Verde

177

Lecture 7 - MCMC-Short-version

Markov Chain Monte Carlo (MCMC)

Independent sampling from p(|y ) may be very difficult in high dimensions

Alternative strategy based on dependent sampling:


I

We know p(|y ) up to a normalizing constant

Then, we design a Markov chain which has p(|y ) as its stationary distribution

A sequence of random variables (0) , (1) , (3) , . . . forms a Markov chain if


(i+1) p(|(i) )
i.e. conditional on the value of (i) , (i+1) is independent of (i1) , . . . , (0) .

Dr. Pablo E. Verde

178

Lecture 7 - MCMC-Short-version

Run the chain until it appears to have settled down to equilibrium, say
(k) , (k+1) , . . . , (K ) p(|y )

Use these sampling values to empirically estimate the posterior of , say,


p(|y )

Problem: Design a Markov chain with p(|y ) as its unique stationary distribution?

Dr. Pablo E. Verde

179

Lecture 7 - MCMC-Short-version

Answer: This is surprisingly easy and several standard recipes are available
I

Metropolis et al. (1953) showed how to do it

Hastings (1970) generalized Metropolis algorithm

Geman and Geman (1984) introduced the Gibbs Sampling algorithm

Gelfand and Smith (1990) popularized Gibbs sampling in statistics

See Gilks, Richardson and Spiegelhalter (1996) for a gentle introduction and many
worked examples.

Robert and Casella (2004) for more detailed theoretical reference

Dr. Pablo E. Verde

180

Lecture 7 - MCMC-Short-version

The Gibbs sampler


Let our vector of unknowns consist of k sub-components = (1 , . . . , k )
1. Choose starting values 10 , . . . , k0
2. Sample from
(1)

p(1 |2 , 30 , . . . , k , y )

(1)

p(2 |1 , 30 , . . . , k , y )

(0)

(0)

(1)

(0)

(1)

(1)

...
(1)

p(k |1 , 21 , . . . , k1 , y )

3. Repeat step 2 many 1000s of times. Eventually we obtain samples from p(|y )

Dr. Pablo E. Verde

181

Lecture 7 - MCMC-Short-version

Example: Normal distribution with unknown mean and variance


I

Suppose that the observations y1 , . . . , yn N(, 1 )

We assume that yi are conditionally independent given and precision , and


and are themselves independent.

We put conjugate priors on and :


N(0 , 1
0 ),

Gamma(a, b)

Then the full conditionals for and are:




0 0 + n
y
1
,
p(|, y ) = N
0 + n
0 + n


n
1X
p( |, y ) = Gamma a + , b +
(yi )2
2
2

Dr. Pablo E. Verde

182

Lecture 7 - MCMC-Short-version

Programming the Gibbs sampler in R


Example: Normal distribution with unknown mean and variance
gibbsNormal <- function(y, mu1, tau1, N, mu0, phi0, a, b) {
mu <- numeric(N+1) # N = number of iterations
mu[1] <- mu1 # the initial value for mu
tau <- numeric(N+1)
tau[1] <- tau1 # the initial value for tau
n <- length(y) ; ybar <- mean(y)
for(i in 2:(N+1)) {
# generate samples from full conditional
mu[i] <- rnorm(1, (mu0*phi0 + n*ybar*tau[i-1])/(phi0 + n*tau[i-1]),
1/sqrt(phi0 + n*tau[i-1]))
tau[i] <- rgamma(1, (n+2*a)/2, (sum((y-mu[i])^2) + 2*b)/2)
}
output <- cbind(mu, tau) }

Dr. Pablo E. Verde

183

Lecture 7 - MCMC-Short-version

0.004

0.004

0.003

0.003

tau

0.002

tau

0.002

0.001

0.001

0.000

0.000

149.0

149.5

150.0
mu

150.5

151.0

149.0

149.5

150.0

150.5

151.0

mu

Four independent sequences of the Gibbs sampler for a normal distribution with
unknown mean and variance.
Left panel: first 10 steps. Right panel: last 500 iterations in each chain.
Dr. Pablo E. Verde

184

Lecture 7 - MCMC-Short-version

20
0

10

Frequency

30

40

149.0 149.5 150.0 150.5 151.0

mu

Posterior of mu

100

200

300

400

500

149.0

149.5

iteration

150.0

150.5

151.0

mu

30
10

20

Frequency

0.0015

0.0005

tau

0.0025

40

Posterior of tau

100

200

300

400

500

0.0000

0.0010

iteration

0.0020

0.0030

tau

Traces of one sequence of Gibbs simulations. Upper panels: 500 iterations from the
posterior of . Lower panels: 500 iterations from the posterior of .
Dr. Pablo E. Verde

185

Lecture 7 - MCMC-Short-version

General properties of a MCMC algorithm

Reversibility:
I

The key property is reversibility or detailed balance, i.e., a balance in the flow of
transition probabilities between the states of the Markov chain

A Markov chain which is reversible has stationary distribution

Thus we build a kernel or transition matrix P such that it ensures reversibility


p(x)P(y |x) = p(y )P(x|y )
where P(y |x) represents the flow of probability x y

Dr. Pablo E. Verde

186

Lecture 7 - MCMC-Short-version

Irreducibility:
I

An irreducible chain is one in which for any point (k) in the parameter space, it is
possible to move from (k) to other point (l) in a finite number of steps.

This guarantees the chain can visit all possible values of irrespective of the starting
value (0) .

Aperiodicity:
I

A chain is aperiodic if does not exhibits periodic behavior.

If R1 , R2 , . . . , Rk are disjoint regions in parameter space the chain does not cycle
around them.

To sample form p(|y ) we construct a Markov chain with transition matrix P which
satisfies reversibility, irreducibility and aperiodicity.

Dr. Pablo E. Verde

187

Lecture 7 - MCMC-Short-version

Metropolis algorithm
The algorithm proceeds as follows:
I

Start with a preliminary guess, say 0 .

At iteration t sample a proposal t P(t |t1 ).

The jumping distribution must be symmetric, satisfying the condition


P(a |b ) = P(b |a ) for all a and b .

If p(t |y ) > p(t1 |y ) then accept t

If not, flip a coin with probability r =

If the coin toss comes up tails, stay at t1

p(t |y )
,
p(t1 |y )

if it comes up heads, accept t .

The algorithm continues, until we sample from p(|y ). The coin tosses allow it to go
to less plausible t s, and keep it from getting stuck in local maxima.
Dr. Pablo E. Verde

188

Lecture 7 - MCMC-Short-version

Some samplers for the Metropolis algorithm


The following are commonly implemented samplers for the proposal distribution
P(t |t1 )
I

Random walk sampler, observations are generated by t = t1 + zt with zt f .


There are many common choices for f , including the uniform in the unit disc, a
multivariate normal or a t - distribution

The independent sampler, the candidate observation is sample independently from


the current state of the chain

The Gibbs sampler. The Gibbs transition can be regarded as a special case of a
Metropolis transition

The STAN software implements samplers based on Hamiltonian dynamics

Dr. Pablo E. Verde

189

Lecture 7 - MCMC-Short-version

Metropolis in R

The package MCMCpack implements Metropolis for a large number of statistical


models.

The current implementation uses a random walk Metropolis with proposal density
multivariate Normals.

The proposal density is centered at the current (t) with variance-covariance


matrix given by the user or automatically calculated proportional to the Hessian of
the posterior evaluated at its mode.

Dr. Pablo E. Verde

190

Lecture 7 - MCMC-Short-version

Example: an interesting bivariate distribution


I

It is well know that the pair of marginal distributions doest not uniquely
determine a bivariate distribution. For example, a bivariate distribution with
normal marginal distributions need not be jointly normal (Feller 1966, p. 69)

In contrast, the conditional distribution functions uniquely determine a joint


density function (Arnold and Press 1989).

A natural question arises: Must a bivariate distribution with normal


conditionals be jointly normal?

Dr. Pablo E. Verde

191

Lecture 7 - MCMC-Short-version

The answer is NO!


Gelman and Meng (1991) introduced an interesting distribution, where the join
distribution is non-normal:


f (x, y ) exp 1/2 Ax 2 y 2 + x 2 + y 2 2Bxy 2C1 2C2 y .
But it has conditional distributions which are normals:
x|y N(

By + C1
1
,
),
2
2
Ay + 1 Ay + 1

y |x N(

Bx + C1
1
, 2
).
2
Ax + 1 Ax + 1

The question is how to sample form f (x, y )?

Dr. Pablo E. Verde

192

Lecture 7 - MCMC-Short-version

Here we show an example where the parameters are A = 1, B = 0 C1 = 3 and C2 = 3.


GelmanMeng distribtuion A=1, B=0, C1=3 and C2=3

z
y

Dr. Pablo E. Verde

193

Lecture 7 - MCMC-Short-version

11
7

13

15

GelmanMeng distribtuion A=1, B=0, C1=3 and C2=3

33
19

29

31

13

21

27

11

17

25
15

35
23

1
1

13

11

Dr. Pablo E. Verde

194

Lecture 7 - MCMC-Short-version

We implement a Metropolis sampling algorithm based on a random walk and bivariate


normals as proposals.
## Gelman and Meng (1991) kernel function:
f.g.m <- function(xy , A = 1, B = 0, C1 = 3, C2 = 3)
{ x <- xy[1]; y <- xy[2]
r <- -.5 * (A * x^2 * y^2 + x^2 + y^2
- 2 * B * x * y - 2 * C1 * x - 2 * C2 * y)
as.vector(r) }
## Metropolis sampling with MCMCpack
res <- MCMCmetrop1R(f.g.m, theta.init=c(0, 1), mcmc=10000, burnin=5000,
logfun=FALSE)
>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The Metropolis acceptance rate was 0.47580

Dr. Pablo E. Verde

195

3
1

15

13

Lecture 7 - MCMC-Short-version

9 1011

27

35

25

31 3

17

29

21 25

30

19

23

20
3


15

11

15

10

5
7
9

1
3

3
5
5
9
0
5
1
3
1
7

2
0

res[,2]

res[,1]

Dr. Pablo E. Verde

196

1 0 1 2 3 4

Lecture 7 - MCMC-Short-version

200

400

600

800

1000

600

800

1000

0 1 2 3 4

iteration

200

400
iteration

Dr. Pablo E. Verde

197

Lecture 7 - MCMC-Short-version

Performance of MCMC methods

There are three main issues to consider


I

Convergence: How quickly does the distribution of (t) approach p(|x) ?

Efficiency: How well are functionals of p(|x) estimated from (t) ?

Simplicity: How convenient is the method to use?

In general computer effort should be measured in seconds, not iterations!

Dr. Pablo E. Verde

198

Lecture 7 - MCMC-Short-version

Checking convergence

Convergence is to target distribution and not to a single value

Once convergence is reached, samples should look like a random scatter about a
stable mean value.

One approach is to run many long chains with widely differing starting values.

Plot traces of the simulated chain

Plot ACF, etc.

Dr. Pablo E. Verde

199

Lecture 8 - Regression Models

Lecture 8:
Bayesian Regression Models

Dr. Pablo E. Verde

200

Lecture 8 - Regression Models

Modeling Examples

Multiple linear regression: model diagnostics and variable selection

ANOVA models: an alternative to multiple testing

Logistic regression: combining multiple information in Risk analysis

Dr. Pablo E. Verde

201

Lecture 8 - Regression Models

Introduction

Standard (and non standard) regression models can be easily formulated within a
Bayesian framework.
I

Specify probability distribution for the data

Specify form of relationship between response and explanatory variables

Specify prior distribution for regression coefficients and any other unknown
(nuisance) parameters

Dr. Pablo E. Verde

202

Lecture 8 - Regression Models

Some advantages of a Bayesian formulation in regression modeling include:


I

Easy to include parameter restrictions and other relevant prior knowledge

It is simple to extended to non-linear regression

We can easily robustified a model

Easy to make inference about functions of regression parameters and/or


predictions

We can handle missing data and covariate measurement error

Transparent variable selection by prior modeling regression coefficients

Dr. Pablo E. Verde

203

Lecture 8 - Regression Models

Linear regression
Example: Stack loss data, WinBUGS Volume 1
In this example, we illustrate a more complete regression analysis including outlier
checking, model adequacy and variable selection methods.
This is a very often analyzed data of Brownlee (1965, p. 454).
I

21 daily responses of stack loss, yi the amount of ammonia escaping from


industrial chimneys

Covariates: air flow x1 , temperature x2 and acid concentration x3

Transformed covariates: zk,i = (xk,i xk )/sd(xk ) for k = 1, 2, 3.

Dr. Pablo E. Verde

204

Lecture 8 - Regression Models

30

40

50

60

70

80

90

20

60

80

40

X1

X2

40

80

60

20

40

60

80

10 15 20 25 30 35 40

20

80

60

X3

20

40

20

10

15

20

25

30

35

40

20

40

60

80

Dr. Pablo E. Verde

205

Lecture 8 - Regression Models

Model 1 specification:

yi

Normal(i , ) i = 1, . . . , 21

= 1/ 2

= 0 + 1 z1,i + 2 z2,i + 3 z3,i

Uniform(0.01, 100)
k

Dr. Pablo E. Verde

Normal(0, 0.001),

k = 0, . . . , 3.

206

Lecture 8 - Regression Models

In these model we also calculate some diagnostic quantities:


By analogy to the classical regression analysis we calculate the standardized residuals as
yi E (yi |z)
di = p
.
Var (yi |z)
In a Bayesian setting these residuals have posterior distributions as well. They can be
used similarly to residual analysis in classical statistics.

Dr. Pablo E. Verde

207

Lecture 8 - Regression Models

We specify the model in WinBUGS as:


model{
# Model 1
for (i in 1 : N) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- beta0 + beta[1] * z[i, 1] + beta[2] * z[i, 2] +
beta[3] * z[i, 3]
stres[i] <- (y[i] - mu[i]) / sigma
# Priors
beta0 ~ dnorm(0, 0.001)
for (j in 1 : p) {
beta[j] ~ dnorm(0, 0.001) } # coefficients independent
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 100)
# Gelmans prior
}

Dr. Pablo E. Verde

208

Lecture 8 - Regression Models

We run the model in WinBUGS, we look at the possible outliers.

Dr. Pablo E. Verde

209

Lecture 8 - Regression Models

As an alternative model we replace the Normal likelihood by a t-distribution with = 4


degrees of freedom. The idea is to have a more robust model against outliers in the y s.
Model 2:

yi

t(i , , )

= 1/ 2

i = 1, . . . , 21

= 0 + 1 z1,i + 2 z2,i + 3 z3,i

Uniform(0.01, 100)
k

Dr. Pablo E. Verde

Normal(0, 0.001),

k = 0, . . . , 3.

210

Lecture 8 - Regression Models

In the WinBUGS code we modify the model section by commenting out the normal
model and adding one line with the model based on the t-distribution and we run the
MCMC again.
#y[i] ~ dnorm(mu[i], tau)
y[i] ~ dt(mu[i], tau, 4)
#DIC Normal
Dbar Dhat pD DIC
Y 110.537 105.800 4.736 115.273
total 110.537 105.800 4.736 115.273
#DIC t
Dbar Dhat pD DIC
Y 109.043 104.103 4.940 113.983
total 109.043 104.103 4.940 113.983
A very modest difference between the two models.
Dr. Pablo E. Verde

211

Lecture 8 - Regression Models

Variable Selection
We modify the structure of the distribution of the regression coefficients by adding a
common distribution with unknown variance. This is called a ridge regression
model, where the s are assumed exchangeable.
Model 3 specification:
= 1/ 2 ,

yi

Normal(i , ),

= 0 + 1 z1,i + 2 z2,i + 3 z3,i

i = 1, . . . , 21

Uniform(0.01, 100)
k

Normal(0, ),

= 1/2

k = 0, . . . , 3

Uniform(0.01, 100).

Dr. Pablo E. Verde

212

Lecture 8 - Regression Models

In WinBUGS
for (j in 1 : p) {
# beta[j] ~ dnorm(0, 0.001)
# coefs independent
beta[j] ~ dnorm(0, phi) # coefs exchangeable (ridge regression)
}
phi <- 1/(sigmaB*sigmaB)
sigmaB ~ dunif(0.01, 100) #Gelmans prior

Dr. Pablo E. Verde

213

Lecture 8 - Regression Models

Dr. Pablo E. Verde

214

Lecture 8 - Regression Models

We see that 3 is not a relevant variable in the model. Now, we try another variable
selection procedure, by modifying the distribution of the regression coefficients.
Model 5 specification:

yi

Normal(i , 2 ) i = 1, . . . , 21

= 0 + 1 1 z1,i + 2 2 z2,i + 3 3 z3,i

Uniform(0.01, 100)
k

Normal(0, 100),

k = 0, . . . , 3,

1 Bernoulli(0.5)
2 Bernoulli(0.5)
3 Bernoulli(0.5).

Dr. Pablo E. Verde

215

Lecture 8 - Regression Models

In WinBUGS we change the equation of i :


#mean equation model
mu[i] <- beta0 + beta[1] * pi[1]* z[i, 1] +
beta[2]* pi[2] * z[i, 2] + beta[3] * pi[3]* z[i, 3]
#priors
for (j in 1 : p) {
beta[j] ~ dnorm(0, 0.001)
pi[j] ~ dbern(0.5)
}
We run the model for 40,000 iterations and we discard the first 20,000. Results are:
pi[1] 0.9987 0.0367 <- X1 very important
pi[2] 0.8361 0.3702 <- X2 non meaningful
pi[3] 0.0425 0.2018 <- X1 definitively no important
Dr. Pablo E. Verde

216

Lecture 8 - Regression Models

Dr. Pablo E. Verde

217

Lecture 8 - Regression Models

ANOVA and Experimental Design


Example: Speed of light measurements by Michelson
This is a classical data set available in R and corresponds to Measurements of the
speed of light in air, made between 5th June and 2nd July, 1879. The data consists of
five experiments, each consisting of 20 consecutive runs. The response is the speed of
light in km/s, less 299000. The currently accepted value, on this scale of
measurement, is 734.5.
We start by comparing the 5 experiments with boxplots:
>
>
>
>

library(MASS)
data(michelson)
attach(michelson)
plot(Expt, Speed, main="Speed of Light Data",
xlab="Experiment No.",ylab="Speed")

Dr. Pablo E. Verde

218

Lecture 8 - Regression Models

1000

Speed of Light Data

800

Speed

900

700

Experiment No.

Dr. Pablo E. Verde

219

Lecture 8 - Regression Models

ANOVA model: Bayesian analysis in WinBUGS


The Bayesian approach to the ANOVA model is similar to the regression model, with
mean equation according to the mean groups parametrization. So if we have nj
P
observations in J groups with n = Jj=1 nj .
Model 1
yi,j

Normal(j , ),

= 1/ 2

= + j ,

Normal(0, 0.001),

i = 1, . . . , nj ,

j = 1, . . . , J,
j = 2, . . . , J,

Normal(0, 0.001),
Uniform(0.01, 100).

Dr. Pablo E. Verde

220

Lecture 8 - Regression Models

The WinBUGS implementation of Model 1 is:


model{
for(i in 1:n){
y[i] ~ dnorm( mu[i], tau)
mu[i] <- mu0 + alpha[ a[i] ]
}
# Corner constrains parametrization
# alpha[1] <- 0
# Sum to zero parametrization
alpha[1] <- -sum( alpha[2:J] )
# group means:
for( i in 1:J){
m.g[i] <- mu0 + alpha[i]}

Dr. Pablo E. Verde

221

Lecture 8 - Regression Models

# mean differences:
d[1] <- (m.g[2] - m.g[1] )
d[2]<- (m.g[3] - m.g[1] )
...
# Priors
mu0 ~ dnorm(0, 0.0001)
for( j in 2:J){
alpha[j] ~ dnorm(0, 0.0001)
}
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 100) #Gelmans prior
}

Dr. Pablo E. Verde

222

Lecture 8 - Regression Models

One serious problem of ANOVA modeling is the lack of variance homogeneity between
groups. We can extend our Bayesian set up to include this data feature as follows:
Model 2
yi,j

Normal(j , j ),

= + j ,

Normal(0, 0.001),

j = 1/j2

i = 1, . . . , nj ,

j = 1, . . . , J,
j = 2, . . . , J,

Normal(0, 0.001),
j

Uniform(0.01, 100),

j = 1, . . . , J, .

This model includes a structural dispersion sub-model for each treatment group.

Dr. Pablo E. Verde

223

Lecture 8 - Regression Models

The WinBUGS implementation of Model 2 change to:


{
for(i in 1:n){
#y[i] ~ dnorm( mu[i], tau)
y[i] ~ dnorm( mu[i], tau[i])
mu[i] <- mu0 + alpha[ a[i] ]
tau[i] <- gamma[ a[i] ]
}
# Priors for groups dispersion model
for( j in 1:J){
gamma[j] ~ dgamma(0.001, 0.001)
sigma2[j] <- 1/ gamma[j]
sigma[j] <- pow(sigma2[j], 0.5)
}

Dr. Pablo E. Verde

224

Lecture 8 - Regression Models

For models with more than 3 groups the estimation based on simple means can be
improved by adding exchangeability structure to the i s (Stein, 1956, Lindley 1962,
Casella and Berger, pag.574). This assumed a sub-model for i in the following way:
Model 3
yi,j

j = 1/j2

Normal(j , j ),

= + j ,

Normal(0, ),

i = 1, . . . , nj ,

j = 1, . . . , J,
= 1/

j = 2, . . . , J,

Normal(0, 0.001),
Uniform(0.01, 100),
j

Uniform(0.01, 100),

j = 1, . . . , J,
j = 1, . . . , J, .

Dr. Pablo E. Verde

225

Lecture 8 - Regression Models

The WinBUGS implementation of Model 3 change to:


# Priors
mu0 ~ dnorm(0, 0.0001)
for( j in 2:J){
alpha[j] ~ dnorm(0, xi)
}
xi <- 1/(sigma.alpha*sigma.alpha)
sigma.alpha ~ dunif(0.01, 100) #Gelmans prior

Dr. Pablo E. Verde

226

Lecture 8 - Regression Models

One alternative to exchangeability between i s is to use a mixture model to


investigate the structure of the data, for example:
Model 4
yi,j

j = 1/j2

Normal(j , j ),

= + j j ,

Normal(0, 0.001),

i = 1, . . . , nj ,

j = 1, . . . , J,
j = 2, . . . , J,

Normal(0, 0.001),
j

Bernoulli(0.5),

Uniform(0.01, 100),

j = 2, . . . , J,
j = 1, . . . , J, .

Dr. Pablo E. Verde

227

Lecture 8 - Regression Models

The WinBUGS implementation of Model 3 change to:

# group means mixture model:


for( i in 1:J){
m.g[i] <- mu0 + alpha[i]*pi[i]
pi[i] ~ dbern(0.5)
}

The next slides present some graphical results of the main features of each model. The
full WinBUGS script for this analysis is anova.odc, this includes some other
calculations for residual analysis as well.

Dr. Pablo E. Verde

228

Lecture 8 - Regression Models

ANOVA: classical model

Posterior distributions for mean groups j , Model 1.


Dr. Pablo E. Verde

229

Lecture 8 - Regression Models

ANOVA: classical model

Posterior distributions for mean differences between all groups k l (k 6= l), Model 1
Dr. Pablo E. Verde

230

Lecture 8 - Regression Models

ANOVA: variance heterogeneity model

Posterior distributions for j , Model 2.


Dr. Pablo E. Verde

231

Lecture 8 - Regression Models

ANOVA: variance heterogeneity model

Posterior distributions for mean differences between all groups k l (k 6= l), Model 2
Dr. Pablo E. Verde

232

Lecture 8 - Regression Models

ANOVA: exchangeability model

Posterior distributions for mean differences between all groups with exchangeability,
Model 3.
Dr. Pablo E. Verde

233

Lecture 8 - Regression Models

ANOVA: Mixture models

Posterior distributions for the group means with mixture distributions, Model 4.
Dr. Pablo E. Verde

234

Lecture 8 - Regression Models

Logistic regression: combining multiple information in Risk analysis


Example: The Challenger O-Ring Data: A Bayesian Risk Analysis
Historical background
I

The Space Shuttle Challengers final mission was on January 28th 1986, on an
unusually cold morning (31F/-0.5C)

It disintegrated 73 seconds into its flight after an O-ring seal in its right solid
rocket booster failed

The uncontrolled flame caused structural failure of the external tank, and the
resulting aerodynamic forces broke up the orbiter

All seven members of the crew were killed

Dr. Pablo E. Verde

235

Lecture 8 - Regression Models

Technical background

Dr. Pablo E. Verde

236

Lecture 8 - Regression Models

The Challengers O-Ring Thermal-Distress Data


Example: Regression modelling with change point
I

On the night of January 27, 1986, the night before the space shuttle Challenger
accident, there was a tree-hour teleconference among people at Marton Thiokol
(manufacturer of the solid rocket motor), Marshal Space Center and Kennedy
Space Center.

The discussion focused on the forecast of a 31 F temperature for lunch time the
next morning, and the effect of low temperature on O-ring performance.

The available data of previous shuttle flights consisted on occurrence of thermal


distress (yes, no) and temperature at launch time. Lets take a look at these data
and fit a logistic regression in R:

Dr. Pablo E. Verde

237

Lecture 8 - Regression Models

> #distress yes=1 , no =0


> y = c(1,1,1, 1,0,0,0, 0,0,0,0, 0,1,1,0,
0,0,1,0, 0,0,0,0)
> # temperature at launch time
> x = c(53,57,58, 63,66, 67,67, 67,68, 69,70,
70,70, 70,72, 73,75, 75,76, 76,78, 79,81)
> # number of previous flights
> N = 23
> summary(f.b1 <- glm(y ~ x, family=binomial))
...
AIC: 24.315
> summary(f.b2 <- update(f.b1, family=binomial(link=cloglog)))
...
AIC: 23.531

Dr. Pablo E. Verde

238

1.0

Lecture 8 - Regression Models

0.6
0.4
0.2

Prob of at least one oring incident

0.8

0.0

55

60

65

70

75

80

Temperature in F

Dr. Pablo E. Verde

239

Lecture 8 - Regression Models

These data have been analyzed several times in the literature. I found that fitting
a change point model makes an improvement on previous published analysis. The
model is:
Pr(yi = 1) = logit1 (0 + 1 zi )
where


zi =

1
0

for
xi K ,
otherwise.

In this model K is unknown and represents the temperature from which the
probability of distress is the lowest.

The parameter 1 represents the reduction of risk under temperatures greater


than K .

Dr. Pablo E. Verde

240

Lecture 8 - Regression Models

This model is implemented in WinBUGS as following:


model
{
for(i in 1 : N) {
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 + beta1 * z[i] }
for(i in 1:N){z[i] <- step(x[i] - K) } # step = 0 until K
#priors
K ~dunif(53, 81)
beta0 ~ dnorm(0.0, 0.01)
beta1 ~ dnorm(0.0, 0.01)
}

Dr. Pablo E. Verde

241

Lecture 8 - Regression Models

We run the model with R2WinBUGS:


>
>
>
>

data.b <- c("x", "y", "N")


par.b <- c("beta0", "beta1", "K")
chall.1 <- bugs(data.b, inits=NULL, par.b, "challenger-1.txt",
n.chains=1, n.iter=20000, n.thin=1, bugs.directory = bugsdir,
working.directory = getwd(), clearWD=TRUE, debug=TRUE)

>

print(chall.1)
mean sd 2.5%
25% 50% 75% 97.5%
beta0
6.2 3.9 -0.2
3.3 5.8 8.4 15.3
beta1
-8.0 3.9 -17.2 -10.2 -7.6 -5.1 -2.1
K
63.8 2.9 58.2 63.1 64.1 65.1 67.0
deviance 19.1 2.8 16.6 16.9 18.0 20.4 26.9
pD = 2.4 and DIC = 21.5

Dr. Pablo E. Verde

242

Lecture 8 - Regression Models

1.0

Histogram of K

500

0.6
0.2

0.4

Prob of at least one oring incident

300
100

200

Frequency

400

0.8

0.0

55

60

65

70
K

75

80

55

60

65

70

75

80

Temperature in F

Dr. Pablo E. Verde

243

Lecture 8 - Regression Models

Example: Probabilistic Risk Assessment

Dr. Pablo E. Verde

244

Lecture 8 - Regression Models

A catastrophic failure of a field joint is expected when all four events occur
during the operation of the solid rocket boosters:
pfield = pa pb pc pd

Since there are 6 field joints per launch, assuming the all 6 joint failures are
independent, the probability of at least one failure is
Risk = 1 (1 pfield )6

pa is calculated from the logistic regression model as:


pa = Pr(O-Ring Thermal-Distress at 31F)

The available data for the other events was very sparse:
I
I
I

Event b: 2 in 7 flights
Event c: 1 in 2 flights
Event d: 0 in 1 flight

We model the probabilities of events b, c and d with a beta-binomial model with


uninformative priors Beta(0.5, 0.5).
Dr. Pablo E. Verde

245

Lecture 8 - Regression Models

In WinBUGS we modify the model file challenger-1.txt by adding these risk


calculations. We add, also, the calculation of risk at temperature K or greater.
To run the model in R by adding the date of events b, c, and d:
# Risk analysis
y.b <- 2; y.c <- 1; y.d <- 0
data.r <- c("x", "y", "N", "y.b", "y.c", "y.d")
par.r <- c("beta0", "beta1", "K", "p.cat", "p.catK")
chall.2<- bugs(data.r, inits=NULL, par.r, "challenger-1.txt", n.chains = 1,
...
> print(chall.2, 3)
mean
sd
2.5%
25%
50%
75% 97.5%
K
63.726 2.653 58.190 63.090 64.120 65.120 66.010
p.cat
0.173 0.204
0.000
0.019 0.088 0.258 0.736
p.catK
0.038 0.061
0.000
0.003 0.014 0.047 0.213

Dr. Pablo E. Verde

246

Lecture 8 - Regression Models

Summary results of our analysis:


I

The risk of a catastrophic failure with launching temperature of 31 F is 17%

The risk of a catastrophic failure with temperatures greater than 63 F is 3.8%

The odds ratio of the risk between these temperatures is 9.8

One interesting historical note:


The Roger Commission, appointed by president Reagan to investigate the causes of
the accident, pointed out:
... a mistake in the analysis of the thermal distress data was that the flights with zero
incidents were left off because it was felt that these flights did not contribute any
information about the temperature effect. (p. 145)

Dr. Pablo E. Verde

247

Lecture 8 - Regression Models

Practical
Exercise: a nonconjugate nonlinear model
Volume 2 in WinBUGS help: Dugongs
Originally, Carlin and Gelfand (1991) consider data on length yi and age xi
measurements for 27 dugongs (sea cows) and use the following nonlinear growth curve
with no inflection point and an asymptote as xi tends to infinity:
yi

Normal(i , 2 )

= xi ,

where , > 0 and (0, 1).

Dr. Pablo E. Verde

248

Lecture 8 - Regression Models

Practical
I

Open the file in WinBUGS and run the model

Change the last observation 2.57 to 2.17 and run the model. Did you see different
results?

Modify the WinBUGS code as:


#

y[i] ~ dnorm(mu[i], tau)


y[i] ~ dt(mu[i], tau, df)

# fit robust t distribution

...
df <- 5

Run the model with this change and compare results

Dr. Pablo E. Verde

249

Lecture 9: Introduction to Hierarchical Models

Lecture 9:
Introduction to Hierarchical Models

Dr. Pablo E. Verde

250

Lecture 9: Introduction to Hierarchical Models

Learning from the information of the others ...

During the previous lectures we have seen several examples, where we performed
a single statistical analysis.

Now, suppose that instead of having a single analysis, you have several ones.
Each one giving independent results from each other.

One might ask:


Should we combine these independent results in a single global analysis?

Dr. Pablo E. Verde

251

Lecture 9: Introduction to Hierarchical Models

The general answer is:


YES!!
I

Hierarchical models give us the statistical framework to combine multiple sources


of information in a single analysis.

Informally, by combining several individual results, each individual analysis is


improve by the information we have from the others. This information is
summarized in an Empirical Prior distribution.

From the classical point of view, Charles Stein demonstrated in the 50s a
fundamental theoretical result, which shows the benefit of combining individual
results in a single analysis.

Charles Stein (22nd of March 1920) ... and he still working everyday!

Dr. Pablo E. Verde

252

Lecture 9: Introduction to Hierarchical Models

Hierarchical Models
It becomes common practice to construct statistical models which reflects the
underline complexity of the problem under study. Such us different patterns of
heterogeneity, dependence, miss-measurements, missing data, etc.
Statistical models which reflect complexity of the data involve multiple parameters.
Examples are:
I

Study effect in meta-analysis

Subject effect in growth curves models

Identification of hidden process in sequence of observations

Relative risks of disease outcome in different areas/time periods

Frailty effects in correlated survival data

...

Dr. Pablo E. Verde

253

Lecture 9: Introduction to Hierarchical Models

Statistical Inference for Multiple Parameters


How to make inference on multiple parameters {1 , . . . , N } measured in N units
(person, centers, areas, ...) which are related or connected by the structure of the
problem?
We can identify three different scenarios
1. Identical parameters: All the s are identical, i.e. all the data can be pooled
and the individual units ignored.
2. Independent parameters: All the s are entirely unrelated, i.e. the results from
each unit can be analyzed independently.
3. Exchangeable parameters: The s are assumed to be similar in the sense that
the labels convey no information.

Dr. Pablo E. Verde

254

Lecture 9: Introduction to Hierarchical Models

Exchangeability
I

The assumption of exchangeable units is mathematically equivalent to assuming


that the s are drawn at random from some population distribution.

Exchangeability express formally the idea that we find no systematic reason to


distinguish the individual random variables
X1 , . . . , Xn .

We assume that X1 , . . . , Xn are exchangeable if the probability that we assign to


any set of potential outcomes,
p(x1 , . . . , xn ),
is unaffected by permutations of the labels attached to the variables.

Dr. Pablo E. Verde

255

Lecture 9: Introduction to Hierarchical Models

Example:
Suppose that X1 , X2 , X3 are the outcomes of the three patients enrolled in a clinical
trial. Were Xi = 1 indicates positive reaction to a treatment and Xi = 0 no reaction.
We may judge
p(X1 = 1, X2 = 0, X3 = 1) = p(X2 = 1, X1 = 0, X3 = 1) = p(X1 = 1, X3 = 0, X2 = 1)

i.e. the probability of getting 2 positive outcomes is NOT affected by the


particular patient on which the positive outcome comes.

Note that this is a very strong assumption. In reality we may expect that patients
may behave in different way. For example they may fail to comply a treatment.

Note that exchangeability does not mean we believe that X1 , X2 , . . . Xn are


independent!

Dr. Pablo E. Verde

256

Lecture 9: Introduction to Hierarchical Models

Exchangeability and Hierarchical Models


Suppose yij is outcome for individual j, unit i, with unit-specific parameter i
I

Assumption of partial exchangeability of individuals within units is represented by


yij

p(yij |i )

p(i |)

Assumption of exchangeability of units can be represented by


i

p(i |)

p()
where, p() can be considered as a common prior for all units, but with
unknown parameters.

Dr. Pablo E. Verde

257

Lecture 9: Introduction to Hierarchical Models

Assuming that 1 , . . . , N are drawn from some common prior distribution whose
parameters are unknown is known as a hierarchical or multilevel model.

Bayesian statistical inference is based on:


p(1 , . . . , N , |y) p() p(1 , . . . , N |) p(y|1 , . . . , N , )

The dimension of (1 , . . . , N ) could be very large in practice.

Empirical Bayes techniques omit p(). They are useful in practice, but they may
biased variability estimates be reusing y to estimate .

Dr. Pablo E. Verde

258

Lecture 9: Introduction to Hierarchical Models

Exchangeability some further comments

Note that there does not need to be any actual sampling - perhaps these N units
are the only ones that exists - this is very common in meta-analysis.

The probability structure is a consequence of the belief in exchangeability rather


than a physical randomization mechanism.

We emphasis that an assumption of exchangeability is a judgement based on our


knowledge of the context.

Dr. Pablo E. Verde

259

Lecture 9: Introduction to Hierarchical Models

Hierarchical models and shrinkage


Suppose in each unit we observe a response yi assumed to have a Normal likelihood
yi N(i , si2 )
Unit means i are assumed to be exchangeable, and to have a Normal distribution
i N(, 2 )
where and 2 are hyper-parameters for the moment assumed known.
After observing yi , Bayes theorem gives
i |yi N(wi + (1 wi ) yi , (1 wi ) si2 )
where wi = si2 /(si2 + 2 ) is the weight given to the prior mean.

Dr. Pablo E. Verde

260

Lecture 9: Introduction to Hierarchical Models

Hierarchical models and shrinkage

An exchangeable model therefore leads to the inferences for each unit having
narrowed intervals that if they are assumed independent, but shrunk towards the
prior mean response.

wi controls the shrinkage of the estimate towards , and the reduction in the
width of the interval for i

Shrinkage (wi ) depends on precision of the individual unit i relative to the


variability between units.

Dr. Pablo E. Verde

261

Lecture 9: Introduction to Hierarchical Models

Example: Toxoplasmosis data


I

These data present the relation between rainfall and the proportions of people
with toxoplasmosis for 34 cities in El Salvador (Efron 1986)

The question is how to combine results of 34 different cities in a single model?

The data have been used to illustrate non-linear relationship between the amount
of rain and toxoplasmosis prevalence

We illustrate a series of hierarchical models

...we take some preliminary visualization

Dr. Pablo E. Verde

262

Lecture 9: Introduction to Hierarchical Models

Toxoplasmosis Rates: El Salvador 34 cities


0.0
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

0.2

0.4

0.6

0.8

1.0

Dr. Pablo E. Verde

263

Lecture 9: Introduction to Hierarchical Models

1.0
0.8

Toxoplasmosis Rate

0.6

0.4

Toxoplasmosis Rate

0.6

0.8

1800

2000

0.2

0.0

0.2
0.0

2200

Amount of Rain (mm)

Dr. Pablo E. Verde

1600

0.4

1.0

Toxoplasmosis Rates: El Salvador 34 cities

20

40

60

80

Sample Size

264

Lecture 9: Introduction to Hierarchical Models

Pooled approach:
To start we ignore possible (complex) relation between rainfall and proportion and we
fit same beta-binomial model to all the cities
34
Y
i=1

p(ri |, ni ) =

34
Y

Binomial(ni , )

i=1

p() = Beta(a, b)
34
Y
p(|r , n)
p(ri |, ni )p()
i=1

= Beta a +

X
i

!
X
ri , b +
(ni ri )
i

Dr. Pablo E. Verde

265

Lecture 9: Introduction to Hierarchical Models

Some comments:
I

Now, is it reasonable to assume common probability of Toxoplasmosis for


each city?

The beta-binomial model assumes that each outcome is independent and


identically distributed according to the binomial probability distribution with
parameter

Does this model adequately describe the random variation in outcomes for each
city?

Are the cities rates more variable that our model assumes ?

Question: How to model the excess of variation in the data?

Dr. Pablo E. Verde

266

Lecture 9: Introduction to Hierarchical Models

Modeling the excess of variation


Model 1:
Combining information with a Generalized Linear Mixed Model (GLMM)

ri

Binomial(ni , i )

logit(i ) N(, 2 )
N(0, 100)
Uniform(0.01, 10)
This model combines information between cities in a single model.

Dr. Pablo E. Verde

267

Lecture 9: Introduction to Hierarchical Models

Model 2:
Modeling each city rate independently

ri

Binomial(ni , i )

logit(i ) N(0, 100).


This model does NOT use information between cities. Each rate i is estimated
independently.

Dr. Pablo E. Verde

268

Lecture 9: Introduction to Hierarchical Models

In WinBUGS we run both models simultaneously as follows:


model{
for( i in 1 : I ) {
r[i] ~ dbin(p[i,1],n[i])
logit(p[i,1]) <- a[i]
a[i] ~ dnorm(alpha1, tau1)
r2[i] ~ dbin(p[i,2],n[i])
logit(p[i,2]) <- b[i]
b[i] ~ dnorm(0, 0.01)
}
tau1 <- 1/(sigma1*sigma1)
sigma1 ~ dunif(0.01, 5)
alpha1 ~ dnorm(0,0.01)
}

# Exchangeable
# Copy of r[i]
# Independent
# Priors

Dr. Pablo E. Verde

269

Lecture 9: Introduction to Hierarchical Models

#data
list(I=34,
r=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8,
41,24,7, 46,9,23,53,8,3,1,23),
r2=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8,41,24,7, 46,
9,23,53,8,3,1,23),
n=c(4,20,5,10,2,5,8,19,6,10,24,1,30,22,1,11,1,54,9,18,12,1,11,77,51,
16,82,13,43,75,13,10,6,37)
)

Dr. Pablo E. Verde

270

Lecture 9: Introduction to Hierarchical Models

Toxoplasmosis Rates: El Salvador 34 cities


0.0
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

0.2

0.4

0.6

0.8

1.0

Dr. Pablo E. Verde

271

Lecture 9: Introduction to Hierarchical Models

0.0

0.2

0.4

0.6

0.8

1.0

Schrikage effect

ML

Dr. Pablo E. Verde

Bayes

272

Lecture 9: Introduction to Hierarchical Models

Results:
DIC for Model 1 (GLMM) = 145.670
DIC for Model 2 (independent) =176.504
I

Clearly Model 1 which combines information between cities is the winer.

The use of Model 2 has the effect of reduce the variability between cities

The use of Model 2 gives better results at the level of the city.

Clear conclusion: We should take advantage of modeling simultaneously multiple


results!

Dr. Pablo E. Verde

273

Lecture 9: Introduction to Hierarchical Models

Example: Toxoplasmosis data continue ...


Two important questions are:
I

Which is the influence of the amount of rain in toxoplasmosis prevalence?

Which is the influence of the number of participants in each town?

Model 3:
ri

Binomial(ni , i )

logit(i ) = ai + 1 (xi,1 x1 ) + 2 (xi,1 x1 )


ai

N(, 2 )

Uniform(0.01, 10)
, 1 , 2 N(0, 100),

Dr. Pablo E. Verde

274

Lecture 9: Introduction to Hierarchical Models

In WinBUGS:
model
{
for(i in 1:N)
{log.n[i] <- log(n[i])}
for( i in 1 : N ) {
r[i] ~ dbin(p[i], n[i])
a[i] ~ dnorm(alpha,tau)
logit(p[i]) <- a[i] + beta1 * (rain[i]-mean(rain[])) +
beta2 * (log.n[i] - mean(log.n[]))
}
# Priors
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 5)
alpha ~ dnorm(0,0.01)
beta1 ~ dnorm(0, 0.01)
beta2 ~ dnorm(0, 0.01)}
Dr. Pablo E. Verde

275

Lecture 9: Introduction to Hierarchical Models

Posterior for Beta_2

0.0

0.5

100

1.0

1.5

Density

300
200

Density

2.0

400

2.5

500

3.0

Posterior for Beta_1

0.003

0.000
beta1

0.002

0.2

0.2

0.6
beta2

Posterior of 1 clearly indicates no linear influence of the amount of rain. Posterior of


2 shows a clear trend.
Dr. Pablo E. Verde

276

Lecture 9: Introduction to Hierarchical Models

Example: Ranking of the eighteen baseball players (Efron and Morris, 1977)
How can we rank the players ability ? Who was the best player of the season 1970?
We use the ranking function in WinBUGS to answer this question.
Name
Clemente
Robinson
..
.

hit/AB
18/45
17/45
..
.

Total Squared error

Observed Avg (ML)


.400
.378
..
.

TRUTH
.346
.298
..
.

.077

James-Stein
.290
.286
..
.
0.022

Dr. Pablo E. Verde

277

0.35
0.30
0.20

0.25

Batting Average

0.40

0.45

Lecture 9: Introduction to Hierarchical Models

ML

Dr. Pablo E. Verde

Bayes

278

Lecture 9: Introduction to Hierarchical Models

Posteriors: Prob(Hits)
0.1
Max Alvis
Thurman Munson
Bert Campaneris

0.3

George Scott

Billy Williams

Del Unser

Ron Swoboda

Ron Santo

Jim Spencer

Ken Berry

Frank Howard
Frank Robinson
Roberto Clemente

0.6

Rico Petrocelli

Jay Johnstone

0.5

Don Kessinger

0.4

Ellie Rodriguez

Luis Alvarado

Dr. Pablo E. Verde

0.2

279

Lecture 9: Introduction to Hierarchical Models

model {
for( i in 1 : N ) {
r[i] ~ dbin(p[i],n[i])
logit(p[i]) <- b[i]
b[i] ~ dnorm(mu, tau)
p.rank[i] <- rank(p[], i)
}
# hyper-priors
mu ~ dnorm(0.0,1.0E-6)
sigma~ dunif(0,100)
tau<-1/(sigma*sigma)
}
# Data
list(r = c(145, 144, 160, 76, 128, 140, 167, 41, 148, 57, 83, 79, 142,
152, 52, 168, 137, 21),
n= c(412, 471, 566, 320, 463, 511, 631, 183, 555, 245, 322, 315, 480, 583,
231, 603, 453, 115), N=18 )
Dr. Pablo E. Verde

280

Lecture 9: Introduction to Hierarchical Models

Rank distributions for Robinson, Howard and Clemente.


Dr. Pablo E. Verde

281

Lecture 9: Introduction to Hierarchical Models

Left panel: posterior distributions of ranks. Right panel: posterior distributions of


performances.
Dr. Pablo E. Verde

282

Lecture 9: Introduction to Hierarchical Models

Summary on hierarchical models


I

Hierarchical models allows to borrow strength across units

Posteriors distribution of i for each unit borrows information from the likelihood
contributions for all other units. Estimation is more efficient.

MCMC allows considerable flexibility over choice of random effects distribution


(not restricted to normal random errors)

MCMC allows to make inference on difficult questions, e.g. ranking estimation of


random effects

Easy to extend to more complicated models (e.g. non-linear repeated


measurements, etc.)

Dr. Pablo E. Verde

283

Lecture 9: Introduction to Hierarchical Models

Summary of the course


Bayesian statistics:
I

A new interpretation of probability as a subjective mental construct

Relationship with classical statistics

Uses of prior distribution as: sensitivity analysis, prior predictions, model building,
etc.

Making inference using MCMC

Thinking different about regression models

Different modeling tools: DAGs, DIC and WinBUGS

Introduction to hierarchical modeling

Dr. Pablo E. Verde

284

Lecture 9: Introduction to Hierarchical Models

Finally

Nice to see that most people attended and survived the course!!
Hope that everybody will be now enthusiastic modern Bayesians ;-)
Thank you very much ... Muchas gracias !!!

Dr. Pablo E. Verde

285

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy