1 Bayesian Talk
1 Bayesian Talk
ca505@york.ac.uk
Part 1
Talk overview
The Frequentist paradigm has been the mainstay of probability theory during the
19th and 20th century, with important contributions by e.g. Jerzy Neyman, Egon
Pearson, John Venn, R.A. Fisher, and Richard von Mises.
Bayes' Theorem
Bayes' Theorem can be derived easily from the expression of the joint
probability of two events A and B:
Let p(A) denote the probability that event A will occur, let p(B) denote the
probability that event B will occur, and let p(A,B) denote the probability that
both of the events occur.
p B⋅p A∣B
p B∣A=
p A
p B⋅p A∣B
p B∣A=
p A
p x ⋅p data∣x
p x∣data=
p data
p x ⋅p data∣x
p x∣data=
p data
“Likelihood”
p x ⋅p data∣x
p x∣data=
p data
p x ⋅p data∣x
p x∣data=
p data
The denominator is a constant and can usually be ignored.
This approach has been compared to the task of learning in humans, where
experience supports a constant updating of a person's belief system.
p x ⋅p data∣x
p x∣data=
p data
Definition of “Probability”
FREQUENTIST
B AY E S I A N
Definition of “Probability”
FREQUENTIST
The “probability” of an event A occurring (or of a quantity taking a value in
a given interval) is a frequency. Imagine many (hypothetical or actual)
circumstances in which the data have been observed. The proportion of
circumstances in which event A occurs (out of all circumstances) is the
“probability” of A. This probability is objective.
B AY E S I A N
Definition of “Probability”
FREQUENTIST
The “probability” of an event A occurring (or of a quantity taking a value in
a given interval) is a frequency. Imagine many (hypothetical or actual)
circumstances in which the data have been observed. The proportion of
circumstances in which event A occurs (out of all circumstances) is the
“probability” of A. This probability is objective.
B AY E S I A N
B AY E S I A N
B AY E S I A N
B AY E S I A N
The value for each parameter is unknown. The data are known, they
have been observed. A Bayesian statistician evaluates how likely different
values for the underlying quantities are, given the observed data. Thus,
statements can be made about the probability of the unknown quantity
taking a value in a certain credibility interval.
B AY E S I A N
B AY E S I A N
B AY E S I A N
Hypothesis testing
FREQUENTIST
Given two hypotheses, H0 and H1, ...
B AY E S I A N
Hypothesis testing
FREQUENTIST
Given two hypotheses, H0 and H1, calculate the probability of observing
the data (or more extreme data) if H0 is true. If this probability is low
(p‑value), reject H0.
B AY E S I A N
Hypothesis testing
FREQUENTIST
Given two hypotheses, H0 and H1, calculate the probability of observing
the data (or more extreme data) if H0 is true. If this probability is low
(p‑value), reject H0.
B AY E S I A N
Hypothesis testing
FREQUENTIST
Given two hypotheses, H0 and H1, calculate the probability of observing
the data (or more extreme data) if H0 is true. If this probability is low
(p‑value), reject H0.
Because a hypothesis is either true or false (this is just not known)
and only the likelihood of observing the data is calculated, a
Frequentist cannot assign a probability to each hypothesis.
B AY E S I A N
FREQUENTIST
B AY E S I A N
B AY E S I A N
p(z) = Beta(1, 1)
B AY E S I A N
B AY E S I A N
In this simple example, when the prior is from a particular family (Beta)
and the likelihood of the data is also from a particular family (Binomial),
the posterior likelihood also belongs to a particular family of distributions
(Beta). The Beta prior and Binomial likelihood distribution are called
conjugate.
B AY E S I A N
Priors - again...
Different choices of prior distributions lead to different posterior
distributions and thus to different credibility intervals.
Model Selection
In Bayesian statistics, it is relatively straightforward to evaluate different
explanations for a data-set (nested models or totally different models).
The models are all evaluated simultaneously, together with additional
parameters mi for the probabilities of each of the models.
The posteriors for the parameters mi summarise how well each of the
competing models fits the data. Depending on the model application,
one most suitable model may be found, or predictions can be made from
all models simultaneously, using the posterior values for mi as weights
(model averaging).
Summary
FREQUENTIST B AY E S I A N
Interval.......
Good Bayesian text book that starts with a comparison of Bayesian and
Frequentist methods:
The problem is that, for each possible set of parameter values, Bayes'
Theorem gives the posterior probability, but if the parametric form of the
distribution cannot be recognised, there is no obvious method for
calculating e.g. its mean value, or for sampling from it.
p∣data∝ p ⋅pdata∣
The posterior probabilities are calculated for both xi and x*. Depending
on the likelihood of x* relative to xi, an acceptance probability is
calculated, and the chain either moves to x* (xi+1=x*) or stays at its
current value (xi+1=xi).
Ergodic theory ensures that, in the limit, the distribution of the values of
C converges to the posterior distribution of interest. The beginning of the
chain is discarded because the initial values dominate it (“burn-in”).
Example
1. Suggest a candidate for a.
(In this example, a* is accepted.)
2.
3.
Example
1. Suggest a candidate for a.
(In this example, a* is accepted.)
2. Suggest a candidate for b.
(In this example, b* is accepted.)
3.
Example
1. Suggest a candidate for a.
(In this example, a* is accepted.)
2. Suggest a candidate for b.
(In this example, b* is accepted.)
3. The chain moves to [a*, b*].
Gibbs sampling
Gibbs sampling is a special case of MCMC. Here, the posterior
parameter space is divided into blocks of parameters, such that for each
block, the conditional posterior probabilities are known.
The posterior density function is split into factors, and at each step in the
algorithm, all particles are resampled based on weights. These weights
are derived from the factors that make up the pdf. For example, the first
step might weight the particle sample according to the Bayesian prior.
The second step might weight the updated set of particles according to
the factor that corresponds to the first datum. The next resampling may
take into account the next datum, etc, until the data are used up.
Prior
a=2.5
a=3.1
a=-1
a=4
a=2.7
a=1.7
...
Weighted resampling...
Implementations
For MCMC, many ready-made implementations exist. A good place to
start is the package OpenBUGS (ongoing development of WinBUGS),
which implements the Gibbs and other samplers. With a familiar
Windows interface and a very general symbolic language to specify
models, OpenBUGS can solve most classes of Bayesian models.
For SIS, I am not aware of any ready-made packages, but there are
ongoing developments.
Hands-on Example
Here I demonstrate the use of OpenBUGS. I've made up this example –
but the basic approach carries through to real applications in health
economics.
Statistical model
An evidence synthesis model is required to combine the information
from the 8 RCTs.
The model yields a probability and we have binomial data, so the only
sensible choice is
A A A B B B
r i ~ Binom pi , ni r i ~ Binom pi , ni
By now we have 20 unknown parameters (8 ti, 8 μi, M, σM, T and σT).
Sampling A A A B B B
distribution: r i ~ Binom pi , ni r i ~ Binom pi , ni
Sampling A A A B B B
distribution: r i ~ Binom pi , ni r i ~ Binom pi , ni
OpenBUGS
Let us fit this Bayesian model using OpenBUGS.
model {
A
for (i in 1:N) { logit pi =i
logit(pA[i])<-mu[i]
B
logit(pB[i])<-mu[i]+t[i] logit p =i t i
i
rA[i]~dbin(pA[i],nA[i])
A A A
rB[i]~dbin(pB[i],nB[i]) r ~ Binom p , n
i i i
mu[i]~dnorm(M,precM) B B B
t[i] ~dnorm(T,precT) r ~ Binom p , n
i i i
} 2
M~dnorm(0,0.0001) i ~ NormM , M
2
T~dnorm(0,0.0001) t i ~ Norm T , T
precM<-1/pow(sigmaM,2)
precT<-1/pow(sigmaT,2)
sigmaM~dunif(0,2)
sigmaT~dunif(0,2)
}
OpenBUGS (2)
The data are specified in a separate section so that they can be entered
or changed easily.
model {
for (i in 1:N) {
logit(pA[i])<-mu[i]
logit(pB[i])<-mu[i]+t[i]
rA[i]~dbin(pA[i],nA[i])
rB[i]~dbin(pB[i],nB[i])
mu[i]~dnorm(M,precM)
t[i] ~dnorm(T,precT)
#data
} list(N=8,
M~dnorm(0,0.0001) nA=c(120,15,84,398, 80,40, 97,121),
T~dnorm(0,0.0001) rA=c( 65, 9,39,202, 45,17, 48, 63),
precM<-1/pow(sigmaM,2)
nB=c(120,16,45,402, 77,20,100,115),
precT<-1/pow(sigmaT,2)
rB=c( 81,15,29,270, 52,12, 68, 80))
sigmaM~dunif(0,2)
sigmaT~dunif(0,2)
}
OpenBUGS (3)
The OpenBUGS window can look like this.
OpenBUGS (4)
In this example, OpenBUGS explores the model's posterior reasonably well.
OpenBUGS (5)
Here's a screenshot from another model, in which the sampler did not
converge (just to give you an idea of what to look for...).
Model convergence
WinBUGS and OpenBUGS provide a few formal diagnostics to check for
convergence and performance of the sampler, for example the Brooks-
Gelman-Rubin diagram and plots of within-chain autocorrelation.
High “MC_error” can also indicate convergence problems.
Example continued
When you are satisfied with the posterior sampling, you can generate
any desired summary statistic for your posterior.
Example continued
When you are satisfied with the posterior sampling, you can generate
any desired summary statistic for your posterior.
MC_error is another
indication for how well the
sampler performed.
Example continued
When you are satisfied with the posterior sampling, you can generate
any desired summary statistic for your posterior.
Example continued
But what is the probability P that treatment B is the cost-effective choice
at a willingness-to-pay (WTP) of λ=SEK 50,000?
Let's assume that the underlying baseline and treatment effect would
apply to the target population, i.e. in the target population
A B
logit p =M and logit p =M T . We calculate the net benefit of
the treatments (NB), using the costs C and utilities U.
A A A A
NB =[ p ⋅U free1− p ⋅U symptoms ]⋅−C
B B B B
NB =[ p ⋅U free1− p ⋅U symptoms ]⋅−C
B A
The probability P is given by P=Pr NB NB
Example continued
But what is the probability P that treatment B is the cost-effective choice
at a willingness-to-pay (WTP) of λ=SEK 50,000?
model {
... Numerically, we simply look at all
logit(PA)<-M the draws from the posterior and
logit(PB)<-M+T check which of them fulfill the
Uf~dbeta(9,1) condition. This proportion is the
Us~dbeta(5,5) posterior probability P. There is no
NBA<-(PA*Uf+(1-PA)*Us)*WTP-CA
NBB<-(PB*Uf+(1-PB)*Us)*WTP-CB need for any further tests.
P<-step(NBB-NBA)
}
Example continued
But what is the probability P that treatment B is the cost-effective choice
at a willingness-to-pay (WTP) of λ=SEK 50,000?
Example continued
But what is the probability P that treatment B is the cost-effective choice
at a willingness-to-pay (WTP) of λ=SEK 50,000?
Example continued
OpenBUGS can also produce graphical output for all quantities
of interest.
For example, here are the posterior densities for the net benefits NBA
and NBB.
NBA sample: 30000 NBB sample: 30000
0.07.5E-5
P(NBB)
0.01.0E-4
P(NBA)
They both show quite wide distributions and their support on the x-axes
overlaps substantially.
Summary
Frequentist modelling centres on the likelihood function, i.e. how likely
are the data given a particular model.
Both can be used equally well to fit models and to make inferences on
model parameters.
Numerical methods for fitting Bayesian models require some care and
experience.