Bayesian Modelling Tuts-4-9
Bayesian Modelling Tuts-4-9
Bayesian Modelling Tuts-4-9
Theorem 1 (Bayes’ Theorem) Let A be any event. Then for any 1 ≤ k ≤ K we have
P (A | Bk )P (Bk ) P (A | Bk )P (Bk )
P (Bk | A) = = PK .
P (A) j=1 P (A | Bj )P (Bj )
Of course there is also a continuous version of Bayes’ Theorem with sums replaced by integrals. Bayes’
Theorem provides us with a simple rule for updating probabilities when new information appears. In Bayesian
modeling and statistics this new information is the observed data and it allows us to update our prior beliefs
about parameters of interest which are themselves assumed to be random variables.
p(θ, y) = π(θ)p(y | θ)
and we can integrate the joint distribution to get the marginal distribution of y, namely
Z
p(y) = π(θ)p(y | θ) dθ.
θ
We can compute the posterior distribution via Bayes’ Theorem and obtain
π(θ)p(y | θ) π(θ)p(y | θ)
π(θ | y) = = R . (1)
p(y) θ
π(θ)p(y | θ) dθ
A Little History
Bayesian modeling has a long history going back to the discovery of Bayes Theorem by the Reverend Thomas
Bayes in the 18th century and its later but independent discovery by Laplace towards the end of the 18th
century. Both Bayes and Laplace were objective Bayesians and were skeptical about conveying too much
information through the prior distribution. For over a century only the Bayesian paradigm existed but this
changed in the early 20th century with major contributions from Fisher (maximum likelihood estimation),
Neyman and Wald. Neyman developed the frequentist approach to statistics whereby statistical procedures
are evaluated w.r.t. a probability distribution over all possible data-sets. This approach gave rise to the
concepts of confidence intervals and hypothesis testing as well as the (somewhat awkward) interpretation
of their results. Despite the work of de Finetti, Savage and others who continued to push the Bayesian
paradigm, the frequentist and maximum likelihood approaches held sway over most of the 20th century.
However, Bayesian methods have made a strong comeback in recent decades with the advent of new MCMC
algorithms and the impressive growth in computing power.
It’s interesting to note that frequentists take the parameter vector θ to be fixed (albeit unknown) and
introduce uncertainty over the possible data-sets y. In contrast the Bayesian approach treats the data-set
y as given and instead introduces uncertainty over θ. It’s worth noting, however, that the frequentist and
Bayesian approaches do share some ideas and statisticians today seem to be much more at ease moving
between these differing approaches. Moreover, some well-known and popular statistical procedures combine
elements of both approaches. For example, empirical Bayesian methods (see Section 8) are very Bayesian
in spirit but they are not strictly Bayesian and in fact the analysis of these methods is often frequentist in
nature.
Robert [38] provides a thorough introduction to Bayesian statistics as well as its connections to and differences
with the frequentist approach.
where the final equality follows because the data are assumed i.i.d. given θ. As its name suggests, the
posterior predictive distribution can be used to predict new values of y but it also plays an important role
in model checking and selection as we shall see in Section 7.
Much of Bayesian analysis is concerned with “understanding” the posterior π(θ | y). Note that
π(θ | y) ∝ π(θ)p(y | θ)
which is what we often work with in practice. Sometimes we can recognize the form of the posterior by
simply inspecting π(θ)p(y | θ). But typically we cannot recognize the posterior and cannot compute the
denominator in (1) either. In such cases approximate inference techniques such as MCMC are required. We
begin with a simple example.
θα−1 (1 − θ)β−1
π(θ) = , 0 < θ < 1.
B(α, β)
We also assume that y | θ ∼ Bin(n, θ) so that p(y | θ) = ny θy (1 − θ)n−y , y = 0, . . . , n. The posterior then
satisfies
p(θ | y) ∝ π(θ)p(y | θ)
θα−1 (1 − θ)β−1 n y
= θ (1 − θ)n−y
B(α, β) y
∝ θα+y−1 (1 − θ)n−y+β−1
which we recognize as the Beta(α + y, β + n − y) distribution! See Figure 20.1 for a numerical example and
a visualization of how the data and prior interact to produce the posterior distribution.
We say the prior π(θ; α) is a conjugate prior for the likelihood p(y | θ) if the posterior satisfies
so that the observations influence the posterior only via a parameter change α0 → α(y). In particular, the
form or type of the distribution is unchanged. In Example 1, for example, we saw the beta distribution is
conjugate for the binomial likelihood. Here are two further examples.
3.5
Prior
3
Posterior
2.5
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1
Figure 20.1 Prior and posterior densities for α = β = 2 and n = x = 5, respectively. The dashed vertical
line shows the location of the posterior mode at θ = 6/7 = 0.857.
Combining (3) and (4) we see the posterior takes the form
>
u(y)−ψ(θ) θ > α−γψ(θ) >
p(θ | y, α, γ) ∝ eθ e = eθ (α+u(y))−(γ+1)ψ(θ)
= π(θ | α + u(y), γ + 1)
π(θ | y) ∝ π(θ)p(y | θ)
without knowing the constant of proportionality given by the denominator in (1). This can be viewed as
a specific instance of a more general sampling problem. Specifically, suppose we are given a distribution
2 See Gelman et al. [19] for these assumptions as well as a more detailed discussion of the Bernstein-von Mises Theorem.
function
1
p(z) = p̃(z) (5)
Zp
where p̃(z) ≥ 0 is easy to compute but Zp is (too) hard to compute. This very important situation arises in
several contexts:
R
1. In Bayesian models where p̃(θ) := p(y | θ)π(θ) is easy to compute but Zp := p(y) = θ π(θ)p(y | θ)dθ,
i.e. the denominator in (1), can be very difficult or impossible to compute. In this case Zp is often
referred to as the marginal likelihood or evidence.
2. In models from statistical physics, e.g. the Ising model, we only know p̃(z) = e−E(z) , where E(z) is an
“energy” function. (The Ising model is an example of a Markov network or an undirected graphical
model.) In this case Zp is often known as the partition function.
3. Dealing with evidence in directed graphical models such as belief networks or directed acyclic graphs
(DAGs).
The sampling problem is the problem of simulating from p(z) in (5) without knowing the constant Zp . While
the well-known acceptance-rejection3 algorithm can be used, it is very inefficient in high dimensions and an
alternative approach is required. That alternative approach is Markov Chain Monte-Carlo (MCMC).
1.5 Exercises
1. (Interpreting the Prior )
How can we interpret the prior distribution in Example 1?
3. (Conjugate Priors)
(a) Consider the following form of the Normal distribution
1
κ(y − µ)2
κ2
p(y | µ, κ) = √ exp − .
2π 2
where κ (the variance inverse) is called the precision parameter. Show that this distribution can
be written as an Exponential Family distribution of the form
θ1 y 2
p(y | θ1 , θ2 ) = h(y) exp − + θ2 y − ψ(θ1 , θ2 )
2
Characterize h(y), (θ1 , θ2 ) and the function ψ(θ1 , θ2 ).
(b) Recall that the generic conjugate prior for an exponential family distribution is given by
π(θ1 , θ2 ) ∝ exp (a1 θ1 + a2 θ2 − γψ(θ1 , θ2 ) (6)
Substitute your expression for (θ1 , θ2 ) from part (a) to show that the conjugate prior for the
Normal model is of the form
1 γκ 2
a0 −1 −κ/b0 − 2 (µ−µ0 )
π(κ | a0 , b0 ) · π(µ | µ0 , γκ) ∝ κ
| e
{z } ·κ
| e {z
2
}. (7)
Gamma(κ|a0 ,b0 ) Normal(µ|µ0 ,γκ)
Your expressions for a0 , b0 and µ0 should be in terms of γ, a1 and a2 . (This prior is known as
the Normal-Gamma prior.)
3 The acceptance-rejection algorithm is a standard Monte-Carlo algorithm that is covered in just about every introduction
(c) Suppose (µ, κ) ∼ Normal-Gamma(a 0 , b0 , µ0 , γ), and the likelihood of the data y is p(y | µ, κ) =
1
κ 2 κ(y−µ)2
√
2π
exp − 2 . Compute the posterior distribution after you see n IID samples {y1 , . . . , yn }.
(The results of parts (a) and (b) can help simplify the calculations.)
Remark 1 The Dirichlet prior and multinomial likelihood form a conjugate pair that includes the
beta-Binomial as a special case. Your answer to part (c) explains why E[θj | y, φ] is sometimes called
a shrinkage estimator of θj .
(a) Prove that this algorithm works, i.e. terminates with a random variable X having the correct
distribution.
(b) Why must we have M ≥ 1?
(c) On average how many samples of Y are required until one if accepted?