Cra I U Rosenthal Ann Rev
Cra I U Rosenthal Ann Rev
Cra I U Rosenthal Ann Rev
Carlo
1 Introduction
A search for Markov chain Monte Carlo (or MCMC) articles on Google Scholar yields
over 100,000 hits, and a general web search on Google yields 1.7 million hits. These
results stem largely from the ubiquitous use of these algorithms in modern computa-
tional statistics, as we shall now describe.
MCMC algorithms are used to solve problems in many scientific fields, includ-
ing physics (where many MCMC algorithms originated) and chemistry and computer
science. However, the widespread popularity of MCMC samplers is largely due to
their impact on solving statistical computation problems related to Bayesian infer-
ence. Specifically, suppose we are given an independent and identically distributed
(henceforth iid) sample {x1 , . . . , xn } from a parametric sampling density f (x|θ), where
x ∈ X ⊂ Rk and θ ∈ Θ ⊂ Rd . Suppose we also have some prior density p(θ). Then
the Bayesian paradigm prescribes that all aspects of inference should be based on the
1
posterior density
p(θ)f (~x|θ)
(1) π(θ|~x) = R
Θ p(θ)f (~x|θ)dθ
where ~x = {x1 , . . . , xn }. Of greatest interest are the posterior means of functionals
g : X → R, defined by
Z R
Θ g(θ)p(θ)f (~x|θ)dθ
(2) I= g(θ)π(θ|~x)dθ = R .
Θ Θ p(θ)f (~
x|θ)dθ
Such expectations are usually impossible to compute directly, because of the integrals
that appear in the denominator of (1) and in (2). However, we can still study (2) as
long as we can sample from π. In the traditional Monte Carlo paradigm, we generate
an iid sample θ1 , . . . , θM from π, and then estimate (2) using
M
1 X
(3) IˆM = g(θi ).
M i=1
This estimate generally works well in cases where the iid sample θ1 , . . . , θM can be
generated, and in particular IˆM → I with probability 1 as M → ∞.
However, when π is complicated and high-dimensional, classical methods devised
to draw independent samples from the distribution of interest are not implementable.
In this case, a Markov chain Monte Carlo (MCMC) algorithm proceeds by instead
constructing an updating algorithm for generating θt+1 once we know θt . MCMC
updating algorithms are constructed by specifying a set of transition probabilities for
an associated Markov chain (see e.g. (44; 43)). It then uses the realizations θ1 , . . . , θM
obtained from the Markov chain as the Monte Carlo sample in (3), or more commonly
with the slight modification
M
X
1
(4) IˆM = g(θi ).
M − B i=B+1
where B is a fixed non-negative integer (e.g. 1,000) indicating the amount of burn-in,
i.e. the number of initial samples that will be discarded due to being excessively biased
2
towards the (arbitrary) initial value θ0 . If the Markov chain has π as an invariant
distribution, and if it satisfies the mild technical conditions of being aperiodic and
irreducible, then the ergodic theorem implies that with probability one, IˆM → I as
M → ∞ (see Section 8.1).
Now, unlike traditional Monte Carlo where the samples are independent, MCMC
samplers yield dependent draws. This makes the theoretical study of these algorithms
and the assessment of their speed of convergence and Monte Carlo variance much more
difficult to assess. A comprehensive evolution of the field can be traced through the
articles included in the volumes edited by (1) and (2) as well as the books devoted
to Monte Carlo methods in statistics, e.g. (3), (4), (5) and (6). We recognize that
for those scientists who are not familiar with MCMC techniques, but need them for
their statistical analysis, the wealth of information contained in the literature can be
overwhelming. Therefore, in this review we provide, in concise form, the ingredients
needed for using MCMC in most applications. Along the way, we will point the user
in need of more sophisticated methods to the relevant literature.
3
Table 1: The number of latent membranous lupus nephritis cases (numerator), and
the total number of cases (denominator), for each combination of the values of the
two covariates.
IgA
∆IgG 0 0.5 1 1.5 2
-3.0 0/ 1 - - - -
-2.5 0/ 3 - - - -
-2.0 0/ 7 - - - 0/ 1
-1.5 0/ 6 0/ 1 - - -
-1.0 0/ 6 0/ 1 0/ 1 - 0/ 1
-0.5 0/ 4 - - 1/ 1 -
0 0/ 3 - 0/ 1 1/ 1 -
0.5 3/ 4 - 1/ 1 1/ 1 1/ 1
1.0 1/ 1 - 1/ 1 1/ 1 4/ 4
1.5 1/ 1 - - 2/ 2 -
4
where Φ(·) is the CDF of N(0, 1), xi = (1, ∆IgGi, IgAi) is the vector of covariates,
and β is a 3 × 1 vector of parameters. We assume a flat prior p(β) ∝ 1 throughout
the paper.
For the PR model, the posterior is thus
55 h
Y
~ Y~ , IgA,
πP R (β| ~ ~
∆IgG) ∝ Φ(β0 + ∆IgGi β1 + IgAiβ2 )Yi ×
i=1
i
(6) × (1 − Φ(β0 + ∆IgGiβ1 + IgAiβ2 ))(1−Yi ) .
5
Step 1 A proposal ωt is drawn from a proposal density q(ω|θt );
Step 2 Set
ωt with probability r
θt+1 =
θt with probability 1 − r
where ( )
π(ωt )q(θt |ωt )
(7) r = min 1, .
π(θt )q(ωt |θt )
The acceptance probability (7) is independent of the normalizing constant for π (i.e.,
it does not require the value of the denominator in (1)), and is chosen precisely to
ensure that π is an invariant distribution, the key condition to ensure that IˆM → I
as M → ∞ as discussed above; see Section 8.2.
The most popular variant of the MH algorithm is the random walk Metropolis
(RWM) in which ωt = θt + ǫt where ǫt is generated from a spherically symmetric
distribution, e.g. the Gaussian case with ǫt ∼ N(0, Σ). Another common choice is
the independence sampler (IS) in which q(ω|θt ) = q(ω), i.e. ωt does not depend on
the current state of the chain, θt . Generally, the RWM is used in situations in which
we have little idea about the shape of the target distribution and, therefore, we need
to meander through the parameter space. The opposite situation is one in which we
have a pretty good idea about the target π and we are able to produce a credible
approximation q which can be used as the proposal in the IS algorithm.
Modifications of these MH samplers include the delayed-rejection (12), multiple-
try Metropolis (13; 14), and reversible-jump algorithms (15; 16), among others.
In practice, one must decide which sampler to use and, maybe more importantly,
what simulation parameters values to choose. For instance, in the case of the RWM,
the proposal covariance matrix Σ plays a crucial role in the performance of the sam-
pling algorithm (17; 18).
6
2.1 Application to the Lupus Data
To see the effect of these choices in action, let us consider the lupus data under the PR
model formulation. The target distribution has density (6). Since we have little idea
of the shape of πP R , it will be hard to come up with a good independence proposal
distribution. Instead, we will use the RWM algorithm with a Gaussian proposal.
We illustrate using two possible choices for the variance-covariance matrix Σ of the
Gaussian proposal distribution: Σ1 = 0.6 I3 and Σ2 = 1.2 I3 , where Id is the identity
matrix in Rd×d .
In Figure 1(a) we plot 5,000 samples for (β0 , β1 ) that are obtained from the RWM
with proposal variance Σ1 . The plot is superimposed on the two-dimensional projec-
tion of the contour plot for the density πP R . The contour plot used is obtained from
a large Monte Carlo sample produced by a state-of-the-art sampler and offers an ac-
curate description of the target. The two red lines mark the coordinates of the initial
value of the chain which has been chosen to be the maximum likelihood estimate for
β. One can see that the samples do not cover the entire support of the distribution.
Moreover, from the autocorrelation plots shown in Figure 1(b), we can see that the
chain is very “sticky”, i.e. there is strong dependence between the realizations of the
chain, despite having an acceptance rate of 39% which, as we will see in Section 5, is
usually considered reasonably high. One may be tempted to believe that the strong
dependence between the Monte Carlo draws is due to the proposal variance being too
small since sampling from a normal distribution with a small variance will result in
draws close to the mean.
We consider doubling the variance and use Σ2 = 1.2 I3 as the proposal’s covariance
matrix. The larger variance brings the acceptance rate down to 24%. Figure 1(c)
show the same plots as Figure 1(a) but for the sampler that uses Σ2 . The chain seems
to travel further into the tails of the distribution, but the serial correlation remains
extremely high. Such a high autocorrelation implies that the 5K Monte Carlo sample
contains the same amount of information that would be contained in a much much
7
smaller sample of independent realizations. This reduction in effective sample size
is computationally wasteful, since we spend a lot of time collecting samples that are
essentially uninformative. In fact, under certain conditions (19) has shown that the
asymptotic variance of IˆM is σ 2 /M, where
∞
X
(8) σ 2 = Varπ {g(θ)} + 2 cov{g(θ1 ), g(θk+1)},
k=1
which illustrates the importance of having small correlations between the successive
draws θt .
The high autocorrelation between the successive draws can be explained if we
consider the the strong posterior dependence between the parameters, as illustrated
by Figure 2 where we have plotted the samples obtained in pairs. The plots provide an
intuitive explanation for the poor mixing exhibited by the two RWM samplers, since
their proposals have independent components and, therefore, deviate significantly
from the target configuration.
We shall use these RWM algorithms for this lupus data to illustrate various MCMC
theoretical considerations in Section 8.4.
3 Gibbs Sampler
The Gibbs sampler is an algorithm that was first used by (20) in the context of
image restoration and, subsequently, (21) and (22) have recognized the algorithm’s
power for fitting statistical models. Assume that the vector of parameters θ ∈ Rd is
partitioned into s subvectors so that θ = (η1 , . . . , ηs ). Assume that the current state
(t)
of the chain is θ(t) = (η1 , . . . , ηs(t) ). The transition kernel for the Gibbs chain requires
updating, in turn, each subvector by sampling it from its conditional distribution,
given all the other subvectors. More precisely, step t + 1 of the sampler involves the
following updates:
(t+1) (t)
η1 ∼ π(η1 |η2 , . . . , ηs(t) )
8
(t+1) (t+1) (t)
η2 ∼ π(η2 |η1 , η3 , . . . , ηs(t) )
... ... ...
(t+1) (t+1) (t+1)
(9) ηs(t+1) ∼ π(ηs |η1 , η2 , . . . , ηs−1 ).
Cycling through the blocks in a fixed order defines the Gibbs sampler with de-
terministic scan; an alternative implementation involves a random scan in which the
next block to be updated is sampled at random and each ηj has strictly positive
probability to be updated. In general, it is not known whether the Gibbs sampler
with random scan is more efficient than deterministic scan or not (23; 24; 25).
An obvious choice for the blocks η is obtained when s = d and ηj = θj for
1 ≤ j ≤ d. However, whenever possible it is recommended to have the blocks η
containing as many individual components of θ as possible while being able to sample
from the conditional distributions in (9) (see the analysis of 26).
9
expanding the model so that conditional distributions are available in closed form is
known as the Data Augmentation (DA) algorithm (22).
The Gibbs sampler (or DA) for the lupus data alternates between sampling from
10
4 Variable-at-a-time Metropolis
It is possible to combine Metropolis-style moves with Gibbs-style variable-at-a-time,
to create a variable-at-a-time Metropolis algorithm. (This algorithm is also sometimes
called Metropolis-within-Gibbs, but actually this was the original form of the algorithm
used by (10).)
Assume again that the vector of parameters θ ∈ Rd is partitioned into s subvectors
so that θ = (η1 , . . . , ηs ). Variable-at-a-time Metropolis then proceeds by proposing
to move just one coordinate at a time (or a subset of coordinates), leaving all the
other coordinates fixed. In its most common form, we might try to move the ith
coordinate by proposing a new state ωt+1 , where ωt+1,j = ηt,j for all j 6= i, and where
ηt,i ∼ N(ηt,i , σ 2 ). (Here ωt+1,j is the j th coordinate of ωt+1 , etc.) We then accept the
proposal ωt+1 according to the Metropolis-Hastings rule (7).
As with the Gibbs sampler, we need to choose which coordinate to update each
time, and again can proceed either by choosing coordinates in the sequence 1, 2, . . . , d,
1, 2, . . . (“systematic-scan”), or by choosing the coordinate uniformly from {1, 2, . . . , d}
each time (“random-scan”). (In this formulation, one systematic-scan iteration is
roughly equivalent to d random-scan ones.)
The variable-at-a-time Metropolis algorithm is often a good generic choice, since
unlike the full Metropolis algorithm it does not require moving all coordinates at once
(which can be very challenging to do efficiently), and unlike Gibbs sampling it does
not require being able to sample from the full conditional distributions (which could
be infeasible).
11
N(βt,h , σh2 ) and is accepted with probability
n o
(10) min 1, π(ωt+1,h |βt,[−h] )/π(βt,h |βt,[−h] ) ,
where βt,[−h] is the vector of the most recent updates for all the components of β except
βh . Note that the ratio involved in (10) is identical to π(ωt+1,h , βt,[−h] )/π(βt,h , βt,[−h] )
and it can be computed in closed form since it is independent of any unknown nor-
malizing constants.
√ √
We have implemented the algorithm using, σ = ( 5, 5, 2 2). These values were
chosen so that the acceptance rates for each component are between 20-25%. Fig-
ure 3(c) shows the samples obtained and Figure 3(d) presents the autocorrelation
functions. Notice that the serial dependence is smaller than in the full MH imple-
mentation, but remains high. Also, the samples cover most of the support of the
posterior density π. In the one-at-a-time implementation, we are no longer forcing
all components of the chain to move together at the same time, and this seems to
improve the spread of the resulting sample.
12
will usually be very far from the previous states θt . This means (especially in high
dimensions) that they are quite likely to be out in the tails of the target density π
in at least one coordinate, and thus to have much lower π values. This implies that
they will almost always be rejected, which is again clearly sub-optimal.
It follows that the optimal scaling is somewhere in between these two extremes.
That is, we want our proposal scaling to be not too small, and not too large, but
rather “just right” (this is sometimes called the Goldilocks principle).
In a pioneering paper, (17) took this a step further, proving that (in a certain
idealised high-dimensional limit, at least) the optimal acceptance rate (i.e., limiting
fraction of proposed moves which are accepted) is equal to the specific fraction 0.234,
or about 23%. On the other hand, any acceptance rate between about 15% and 50%
is still fairly efficient (see e.g. Figure 3 of (18)).
Later optimal scaling results obtained by (18) and (30) indicate that (again, in a
certain idealised high-dimensional limit, at least) the optimal proposal covariance Σ
should be chosen to be proportional to the true covariance matrix of the target distri-
bution π (with the constant of proportionality chosen to achieve the 0.234 acceptance
rate).
13
Alternatively, one can build upon the recent advances in Adaptive MCMC (AM-
CMC) where the proposal distribution is changed on the go and continuously at any
time t using the information contained in the samples obtained up to that time (see
e.g. 31; 32; 33; 34). Such an approach does not require re-starting of the chain, and
can be made to be fully automatic. On the other hand, it requires careful theoretical
analysis since, by using the past realizations of the chain (and not only the current
state), the process loses its Markovian property and asymptotic ergodicity must be
proven on a case-by-case basis. However, proving the validity of adaptive samplers
has been made easier by the general frameworks developed by e.g. (35) and (36).
(2.4)2
(11) Σt = SamV art + ǫI3 ,
3
where, ǫ = 0.01 and SamV art is the sample variance of all samples drawn up to time
t − 1. This is an attempt to approximately mimic the theoretical optimal scaling
results discussed in Section 5.1 above; indeed if SamV art happened to actually equal
the true covariance matrix of π, and if ǫ = 0, then (11) would indeed be the optimal
proposal covariance. In Figure 5 we show the same plots as for Figures 1(a)-1(d).
The reduction in serial autocorrelation is apparent. For instance, the mean, median,
lower and upper quartiles for the autocorrelations computed up to lag 200 equal
(0.537, 0.513, 0.377, 0.664) for the RWM sampler with Σ2 , and equal the much smaller
values (0.065, 0.029, 0.007, 0.059) for the adaptive RWM.
14
6 Simulated Tempering
Particular challenges arise in MCMC when the target density π is multi-modal, i.e.
has distinct high-probability regions which are separated by low-probability barriers
which are difficult for the Markov chain to traverse. In such cases, it is often easy for
a simple MCMC algorithm like RWM to explore well within any one of the modal
regions, but it may take unfeasibly long for the chain to move between modes. This
leads to extremely slow convergence, and very poor resulting estimates.
The idea of simulated tempering is to “flatten out” the distribution into related
distributions which have less pronounced modes and hence can be more easily sam-
pled. If done carefully, this flattening out can be compensated for, to ultimately yield
good estimates for expected values from the original target density π, as we now
explain.
Specifically, simulated tempering requires a sequence π1 , π2 , . . . , πm of target den-
sities, where π1 = π is the original density, and πτ is flatter for the larger-τ distribu-
tions. (The parameter τ is usually referred to as the “temperature”; then π1 is the
“cold” density, and πτ for larger τ are called the “hot” densities.) These different
densities can then be combined to define a single joint density π on Θ × {1, 2, . . . , m},
1
defined by π(θ, τ ) = m
πτ (θ) for 1 ≤ τ ≤ m and θ ∈ Θ. (It is also possible to use
1
other weights besides the uniform choice m
.)
Simulated tempering then uses π to define a joint Markov chain (θ, τ ) on Θ ×
{1, 2, . . . , m}, with target density π. In the simplest case, this chain is a version
of variable-at-a-time Metropolis which alternates (say) between spatial moves which
propose (say) θ′ ∼ N(θ, σθ2 ) and then accept with the usual Metropolis probabil-
π(θ ′ ,τ ) πτ (θ ′ )
ity min 1, π(θ,τ )
= min 1, πτ (θ)
, and temperature moves which propose (say)
1
τ ′ = τ ± 1 (prob 2
each) and then accept with the usual Metropolis probability
π(θ,τ ′ ) πτ ′ (θ)
min 1, π(θ,τ )
= min 1, πτ (θ)
.
As usual for Metropolis algorithms, this chain should converge in distribution to
15
the density π. But of course, our interest is in the original density π = π1 , not in π.
The genius of simulated tempering is that in the end, we only “count” those samples
corresponding to τ = 1. That is, once we have a good sample from π, then we simply
discard all the sample values corresponding to τ 6= 1, and what remains is then a
good sample from π, as we now explain.
16
One promising approach is to let the hotter densities πτ (θ) correspond to taking
smaller and smaller powers of the original target density π(θ), i.e. to let πτ (θ) =
cτ (π(θ))1/τ for appropriate normalising constant cτ . (It is common to write β = 1/τ ,
and refer to β as the inverse temperature.) This formula guarantees that π1 = π,
and that πτ will be flatter for larger τ (since small positive powers move all positive
numbers closer to 1), which is precisely what we need.
As a specific example, if it happened that π(θ) is the density of N(µ, σ 2 ), then
cτ (π(θ))1/τ would be the density of N(µ, τ σ 2 ). This is indeed a flatter density, similar
to the simple example above, which confirms that this is a promising approach.
One problem with this approach is the following. With this formula, if we propose
to move τ to τ ′ , then we should accept this proposal with probability
πτ ′ (θ) cτ ′ ′
min 1, = min 1, (π(θ))(1/τ )−(1/τ ) .
πτ (θ) cτ
This formula explicitly depends on the normalising constants cτ and cτ ′ , i.e. they do
not “cancel” as for ordinary RWM. This is quite problematic since the cτ are usually
unknown and infeasible to calculate. So, what can be done?
17
Crucially, the chain can also choose temperatures τ and τ ′ (say, each chosen uni-
formly from {1, 2, . . . , m}), and then propose to “swap” the values θn,τ and θt,τ ′ . This
πτ (θt,τ ′ ) πτ ′ (θt,τ )
proposal will then be accepted with its usual Metropolis probability, min 1, πτ (θt,τ ) πτ ′ (θt,τ ′ )
.
The beauty of parallel tempering is that now the normalising constants do indeed can-
cel. That is, if πτ (θ) = cτ (π(θ))1/τ , then the acceptance probability becomes:
cτ π(θt,τ ′ )1/τ cτ ′ π(θt,τ )1/τ
′
π(θt,τ ′ )1/τ π(θt,τ )1/τ
′
min 1, = min 1, .
cτ π(θt,τ )1/τ cτ ′ π(θt,τ ′ )1/τ ′ π(θt,τ )1/τ π(θt,τ ′ )1/τ ′
So, the values of cτ and cτ ′ are not required to run the algorithm.
As a first test, we can apply parallel tempering to the above simple example, again
1
with πτ (θ) = 2
N(0, τ 2 ; θ) + 12 N(20, τ 2 ; θ) for τ = 1, 2, . . . , 10. We see that parallel
tempering again works pretty well in this case (Figure 7(d)).
Of course, in this simple example the normalising constants were known, so paral-
lel tempering wasn’t really required. However, in many applications the normalising
constants are unknown, in which case parallel tempering is often a very useful sam-
pling method.
18
The simplest way to estimate standard error from an MCMC estimate is to re-run
the entire Markov chain over again, a number of times, using the exact same values
of run length M and burn-in B as in (4), but started from different initial values θ0
drawn from the same “overdispersed” (i.e., well spread-out) initial distribution. This
leads to a sequence of iid estimates of the target expectation I, and standard errors
from this sequence of estimates can then be computed in the usual iid manner. (We
shall illustrate this in Section 8.4 below, for the RWM algorithms for the lupus data
presented in Section 2.1 above.)
However, such a procedure is often too inefficient, leading to the question of how
to estimate standard error from a single run of a single Markov chain. Specifically,
PM
1
we would like to estimate v ≡ Var M −B i=B+1 g(θi ) .
1 h
= 2
(M − B)E(g(θi )2 ) + 2(M − B − 1)E(g(θi )g(θi+1 ))
(M − B)
i
+2(M − B − 2)E(g(θi )g(θi+2 )) + . . .
1
≈ E(g(θi )2 ) + 2 E(g(θi )g(θi+1 )) + 2 E(g(θi )g(θi+2 )) + . . .
M −B
1
= Varπ (g) + 2 Covπ (g(θi )g(θi+1 )) + 2 Covπ (g(θi )g(θi+2 )) + . . .
M −B
1
= Varπ (g) 1 + 2 Corrπ (g(θi ), g(θi+1 )) + 2 Corrπ (g(θi ), g(θi+2 )) + . . .
M −B
1
≡ Varπ (g)(ACT) = (iid variance) (ACT) ,
M −B
19
where “iid variance” is the value for the variance that we would obtain if the samples
{θi } were in fact iid, and
!
∞
X ∞
X ∞
X ∞
X
ACT = 1 + 2 Corrπ g(θ0 ), g(θk ) ≡ 1+2 ρk = ρk = 2 ρk − 1
k=1 k=1 k=−∞ k=0
is the factor by which the variance is multiplied due to the serial correlations from
the Markov chain (sometimes called the “integrated auto-correlation time”). Here
“Corrπ ” means the theoretical correlation that would arise from a sequence {θi }∞
i=−∞
which was in stationarity (so each θi had density π) and which followed the Markov
chain transitions; this in turn implies that the correlations are a function of the time
lag between the two variables, and in particular ρ−k = ρk as above. The standard
√ √
error is then, of course, given by se = v = (iid-se) ACT.
Now, both the iid variance, and the quantity ACT, can be estimated from the
sample run. (For example, R’s built-in function “acf” automatically computes the
lag correlations ρk . Note also that when computing ACT in practice, we don’t sum
over all k, just until, say, |ρk | < 0.05 or ρk < 0, since for large k we should have ρk ≈ 0
but the estimates of ρk will always contain some sampling error.) This provides a
method of estimating the standard error of your sample. It also provides a method of
comparing different MCMC algorithms, since usually ACT ≫ 1, and “better” chains
would have smaller values of ACT. In the most extreme case, one sometimes even
tries to design “antithetic” chains for which ACT < 1 (see 8; 9; 39; 40; 41).
20
follows as usual that (e − u) v −1/2 ≈ N(0, 1), so P(−1.96 < (e − u) v −1/2 < 1.96) ≈
√ √
0.95, so P(−1.96 v < e − u < 1.96 v ) ≈ 0.95. This gives us our desired confidence
√ √
interval: with prob 95%, the interval (e−1.96 v, e+1.96 v) will contain u. (Strictly
speaking, we should use the “t” distribution, not normal distribution. But if M − B
is at all large, then that doesn’t really matter, so we will ignore this issue for now.)
Such confidence intervals allow us to more appropriately assess the uncertainty of our
MCMC estimates (e.g. 42).
The above analysis raises the question of whether a CLT even holds in the Markov
chain setting. This and other questions will be answered when we consider the theory
of MCMC in the following section.
21
measurable S ⊆ Θ.
22
for simplicity that π(θ) > 0 for all θ ∈ Θ. We also let P (i, j) = P(θ1 = j | θ0 = i) be
the Markov chain’s transition probabilities.
We say that π is stationary for the Markov chain if it is preserved under the
chain’s dynamics, i.e. if it has the property that whenever θ0 ∼ π (meaning that
P(θ0 = i) = π(i) for all i ∈ Θ), then also θ1 ∼ π (i.e., P(θ1 = i) = π(i) for all i ∈ Θ).
P
Equivalently, i∈Θ π(i) P (i, j) = π(j) for all j ∈ Θ. Intuitively this means that the
probabilities π are left invariant by the chain, which explains why the chain might
perhaps converge to those probabilities in the limit.
We now show that reversibility is automatically satisfied by Metropolis-Hastings
algorithms; indeed this explains why the Metropolis acceptance probabilities are de-
fined as they are. Indeed, let q(i, j) = P(ωt = j | θt−1 = i) be the proposal distribu-
π(j) q(j,i)
tion, which is then accepted with probability min 1, π(i) q(i,j)
. Then, for i, j ∈ Θ
with i 6= j, we have that
π(j) q(j, i)
P (i, j) = q(i, j) min 1, .
π(i) q(i, j)
It follows that
π(j) q(j, i)
π(i) P (i, j) = π(i) q(i, j) min 1, = min (π(i) q(i, j), π(j) q(j, i)) .
π(i) q(i, j)
By inspection, this last expression is symmetric in i and j. It follows that π(i) P (i, j) =
π(j) P (j, i) for all i, j ∈ Θ (at least for i 6= j, but also case i = j is trivial). This
property is described as π being reversible for the chain. (Intuitively, it implies that if
θ0 ∼ π, then P(θ0 = i, θ1 = j) = P(θ0 = j, θ1 = i), i.e. we have the same probability
of starting at i and then moving to j or vice-versa, which is also called being “time
reversible”.)
The importance of reversibility is that it in turn implies stationarity of π. Indeed,
using reversibility, we compute that if θ0 ∼ π, then:
X X X
P(θ1 = j) = P(θ0 = i) P (i, j) = π(i) P (i, j) = π(j) P (j, i)
i∈Θ i∈Θ i∈Θ
23
X
= π(j) P (j, i) = π(j) ,
i∈Θ
24
is geometrically ergodic if there is ρ < 1, and M : Θ → [0, ∞] which is Π-a.e. finite,
such that D(ζ, t) ≤ M(ζ) ρt for all ζ ∈ Θ and t ∈ N, i.e. such that the convergence
to Π happens exponentially quickly.
One important fact is that if a Markov chain is geometrically ergodic, and if
g : Θ → R such that E(|g|2+a) < ∞ for some a > 0, then a Central Limit Theorem
1 PM
(CLT) holds for quantities like e = M −B i=B+1 g(θi ) (44; 19), we have the normal
approximation that e ≈ N(u, v). (In fact, if the Markov chain is reversible as above,
then it suffices to take a = 0 (51).) As explained in Section 7.2 above, this is then
key to obtaining confidence intervals and thus more reliable estimates.
Now, if the state space Θ is finite, then assuming irreducibility and aperiodicity,
any Markov chain on Θ is always geometrically ergodic. However, on infinite state
spaces this is not the case. The random-walk Metropolis (RWM) algorithm is known
to be geometrically ergodic essentially (i.e., under a few mild technical conditions) if
and only if π has exponential tails, i.e. there are a, b, c > 0 such that π(θ) ≤ ae−b|θ|
whenever |θ| > c. (52; 53) And the Gibbs sampler is known to be geometrically
ergodic for certain models (e.g. 54). But in some cases, geometric ergodicity can be
difficult to ascertain.
In the absence of theoretical convergence bounds, it is difficult to ascertain whether
the chain has reached stationarity or not. One option is to independently run some
large number K of chains that have each been started from the same overdispersed
starting distribution. If M and B are large enough, then we expect that the estimators
provided by each chain to be in agreement. For mathematical formalization of this
general principle see e.g. (55) and (56).
25
1000 iterations. We initialize the chain using draws from an overdispersed starting
distribution centred at the MLE, by setting βinit = β̂M LE + W where W is a vector
of 3 iid random variables each generated from a Student distribution with 2 degrees
of freedom.
We repeated this entire experiment a total of K = 350 times with proposal
variance-covariance matrix Σ1 = 0.6 I3 (Figure 8), and another K = 350 times with
proposal variance-covariance matrix Σ2 = 1.2 I3 (Figure 9). Inspection of the corre-
sponding lists of estimates of the three βi values illustrated in these figures shows that
despite the wide overdispersed starting distributions, the resulting estimates are fairly
closely concentrated around particular values (boxplots, top rows; histograms, bottom
rows) showing fairly good convergence, and are approximately normally distributed
(normal Q-Q plots, middle rows; histograms, bottom rows) showing an approximate
CLT. Choosing larger values of M and B would be expected to result in even more
concentrated values and more normal-looking distributions of the various estimates.
This brief experiment illustrates that even in the absence of theoretical conver-
gence bounds, one can use multiple independent runs from overdispersed starting
distributions to assess the convergence and accuracy and normality of MCMC esti-
mates.
26
a free quantitative bound thrown in.
For a simple specific example, consider an independence sampler on Θ = [0, ∞)
with target density π(θ) = e−θ . If the proposal density is, say, q(θ) = 0.01 e−0.01θ , then
q(θ) ≥ 0.01 π(θ) for all θ ∈ Θ, i.e. the above condition holds with δ = 0.01, so the chain
is geometrically ergodic with D(ζ, t) ≤ (1 − δ)t = (0.99)t and hence converges in t∗ =
459 iterations (since (0.99)459 < 0.01). By contrast, if q(θ) = 5e−5θ , then the above
condition does not hold for any value δ > 0, so the chain is not geometrically ergodic,
and in fact it has been shown (57) that in this case 4, 000, 000 ≤ t∗ ≤ 14, 000, 000,
i.e. it takes at least four million iterations to converge. This illustrates how geometric
ergodicity can sometimes make a tremendous difference between MCMC algorithms
which converge efficiently and those which converge very poorly.
References
[1] Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. 2002. Bayesian measures
of model complexity and fit (with discussion). Journal of the Royal Statistical
Society, Series B 64:583–639(57)
[2] Brooks S, Gelman A, Jones GL, Meng XL, eds. 2011. Handbook of Markov chain
Monte Carlo. Chapman & Hall/CRC, Boca Raton, FL
[3] Chen MH, Shao QM, Ibrahim J. 2000. Monte Carlo methods in Bayesian com-
putation. Springer Verlag
[4] Liu JS. 2001. Monte Carlo strategies in scientific computing. Springer
27
[5] Robert CP, Casella G. 2004. Monte Carlo statistical methods. Springer-Verlag
New York Inc.
[6] Robert CP, Casella G. 2010. Introducing Monte Carlo methods with R. Use R!
New York: Springer
[7] van Dyk D, Meng XL. 2001. The art of data augmentation (with discussion). J.
Comput. Graph. Statist. 10:1–111
[8] Craiu RV, Meng XL. 2005. Multi-process parallel antithetic coupling for forward
and backward MCMC. Ann. Statist. 33:661–697
[9] Craiu RV, Lemieux C. 2007. Acceleration of the multiple-try Metropolis algo-
rithm using antithetic and stratified sampling. Statistics and Computing 17:109–
120
[11] Hastings WK. 1970. Monte Carlo sampling methods using Markov chains and
their applications. Biometrika 57:97–109
[13] Liu J, Liang F, Wong W. 2000. The multiple-try method and local optimization
in Metropolis sampling. Journal of the American Statistical Association 95:121–
134
[14] Casarin R, Craiu RV, Leisen F. 2013. Interacting multiple try algorithms with
different proposal distributions. Statistics and Computing :to appear
[15] Green PJ. 1995. Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika 82:711–732
28
[16] Richardson S, Green PJ. 1997. On Bayesian analysis of mixtures with an un-
known number of components (with discussion). Journal of the Royal Statistical
Society: Series B 59:731–792
[17] Roberts GO, Gelman A, Wilks W. 1997. Weak convergence and optimal scaling
of random walk Metropolis algorithms. Ann. Appl. Probab. 7:110–120
[18] Roberts GO, Rosenthal JS. 2001. Optimal scaling for various Metropolis-Hastings
algorithms. Statist. Sci. 16:351–367
[19] Geyer CJ. 1992. Practical Markov chain Monte Carlo (with discussion). Statis-
tical Science 7:473–483
[20] Geman S, Geman D. 1984. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence PAMI-6:721–741
[22] Tanner MA, Wong WH. 1987. The calculation of posterior distributions by data
augmentation. J. Amer. Statist. Assoc. 82:528–540
[23] Amit Y. 1991. On rates of convergence of stochastic relaxation for Gaussian and
non-Gaussian distributions. J. Multivariate Anal. 38:82–100
[24] Amit Y. 1996. Convergence properties of the Gibbs sampler for perturbations of
Gaussians. Ann. Statist. 24:122–140
[25] Liu JS, Wong WH, Kong A. 1995. Covariance structure and convergence rate of
the Gibbs sampler with various scans. JRSS-B 57:157–169
29
[26] Liu JS, Wong WH, Kong A. 1994. Covariance structure of the Gibbs sampler
with applications to the comparisons of estimators and augmentation schemes.
Biometrika 81:27–40
[27] Albert J, Chib S. 1993. Bayesian analysis of binary and polychotomous response
data. JASA 88:669–679
[28] Liu JS, Wu YN. 1999. Parameter expansion for data augmentation. Journal of
the American Statistical Association 94:1264–1274
[29] Meng XL, van Dyk D. 1999. Seeking efficient data augmentation schemes via
conditional and marginal augmentation. Biometrika 86:301–320
[30] Bedard M. 2006. On the robustness of optimal scaling for random walk Metropo-
lis algorithms. Ph.D. thesis, Department of Statistics, University of Toronto
[32] Roberts GO, Rosenthal JS. 2009. Examples of adaptive MCMC. J. Comput.
Graph. Statist. 18:349–367
[33] Craiu RV, Rosenthal JS, Yang C. 2009. Learn from thy neighbor: Parallel-chain
adaptive and regional MCMC. Journal of the American Statistical Association
104:1454–1466
[34] Bai Y, Craiu RV, Di Narzo A. 2011. Divide and conquer: A mixture-based
approach to regional adaptation for MCMC. J. Comput. Graph. Statist. 20:63–
79
30
[36] Roberts GO, Rosenthal JS. 2007. Coupling and ergodicity of adaptive Markov
chain Monte Carlo algorithms. J. Appl. Probab. 44:458–475
[37] Propp JG, Wilson DB. 1996. Exact sampling with coupled Markov chains and
applications to statistical mechanics. Random Structures and Algorithms 9:223–
252
[38] Craiu RV, Meng XL. 2011. In Handbook of Markov Chain Monte Carlo, eds.
S Brooks, A Gelman, G Jones, XL Meng. Chapman & Hall/CRC, Boca Raton,
FL, 199–226
[39] Adler SL. 1981. Over-relaxation methods for the Monte Carlo evaluation of the
partition function for multiquadratic actions. Phys. Rev. D 23:2901–2904
[40] Barone P, Frigessi A. 1990. Improving stochastic relaxation for Gaussian random
fields. Probab. Engrg. Inform. Sci. 4:369–389
[41] Neal RM. 1995. Suppressing random walks in Markov chain Monte Carlo using
ordered overrelaxation. Tech. Rep. 9508, University of Toronto
[42] Flegal J, Haran M, Jones G. 2008. Markov chain Monte Marlo: Can we trust the
third significant figure? Statist. Sci. 23:250–260
[43] Meyn SP, Tweedie RL. 1993. Markov Chains and Stochastic Stability. Commu-
nications and Control Engineering Series. London: Springer-Verlag
[44] Tierney L. 1994. Markov chains for exploring posterior distributions. Ann.
Statist. 22:1701–1728
[45] Rosenthal JS. 2001. A review of asymptotic convergence for general state space
Markov chains. Far East J. Theor. Stat. 5:37–50
[46] Roberts GO, Rosenthal JS. 2004. General state space Markov chains and MCMC
algorithms. Probab. Surv. 1:20–71 (electronic)
31
[47] Rosenthal JS. 1995. Minorization conditions and convergence rates for Markov
chain Monte Carlo. J. Amer. Statist. Assoc. 90:558–566
[49] Rosenthal JS. 2002. Quantitative convergence rates of Markov chains: a simple
account. Electron. Comm. Probab. 7:123–128 (electronic)
[51] Roberts GO, Rosenthal JS. 1997. Geometric ergodicity and hybrid Markov
chains. Electron. Comm. Probab. 2:no. 2, 13–25 (electronic)
[52] Mengersen KL, Tweedie RL. 1996. Rates of convergence of the Hastings and
Metropolis algorithms. The Annals of Statistics 24:101–121
[53] Roberts GO, Tweedie RL. 1996. Geometric convergence and central limit the-
orems for multidimensional Hastings and Metropolis algorithms. Biometrika
83:95–110
[54] Papaspiliopoulos O, Roberts GO. 2008. Stability of the Gibbs sampler for
Bayesian hierarchical models. Ann. Statist. 36:95–117
[55] Gelman A, Rubin DB. 1992. Inference from iterative simulation using multiple
sequences. Statistical Science Vol. 7:457–472
[56] Brooks SP, Gelman A. 1998. General methods for monitoring convergence of
iterative simulations. J. Comput. Graph. Statist. 7:434–455
32
β0
ACF
0
0.2
0.6 0.4
0.8
−2
Lag
β1
−4
ACF
0.05
−8
Lag
β2
−10
ACF
5 10 15 20 25
0 50 100 150 200
β1
Lag
(a) (b)
β0
0.0 0.2 0.4 0.6 0.8 1.0
0.005
05
0.0
ACF
0
0.2
0.6 0.4
0.8
−2
Lag
β1
−4
ACF
0.05
−8
Lag
β2
−10
ACF
5 10 15 20 25
0 50 100 150 200
β1
Lag
(c) (d)
Figure 1: Left panels: Scatterplots of 5,000 samples for (β0 , β1 ) obtained using the
RWM with proposal variance (a) Σ1 and (c) Σ2 . The points are superimposed on
the 2-dimensional projection of the contour plot for the target πP R . Right panels:
Autocorrelation plots for the three components of the chain for the RWM with proposal
variance (b) Σ1 and (d) Σ2 .
33
5 10 15 20
0
−2
−4
β0
−6
−8
−10
20
15
β1
10
5
12
10
8
β2
6
4
2
0
−10 −8 −6 −4 −2 0 0 2 4 6 8 10 12
Figure 2: Pair plots for the samples obtained using the RWM with proposal variance
Σ2 .
34
β0
1.0
0.8
0.005
0.6
05
ACF
0.0
0
0.4
0.2
0.2
0.6 0.4
0.8
0.0
−2
Lag
β1
−4
1.0
0.8
0.6
−6
β0
ACF
0.4
0.2
0.0
0.05
−8
Lag
β2
−10
1.0
0.8
0.6
−12
ACF
0.4
0.2
0.0
5 10 15 20 25
0 50 100 150 200
β1
Lag
(a) (b)
β0
1.0
0.8
0.005
0.6
05
ACF
0.0
0
0.4
0.2
0.2
0.6 0.4
0.8
0.0
−2
Lag
β1
−4
1.0
0.8
0.6
−6
β0
ACF
0.4
0.2
0.0
0.05
−8
Lag
β2
−10
1.0
0.8
0.6
−12
ACF
0.4
0.2
0.0
5 10 15 20 25
0 50 100 150 200
β1
Lag
(c) (d)
Figure 3: Left panels: (a) Trajectory of the Gibbs chain for 300 updates for (β0 , β1 )
(c) Scatterplots of 5,000 samples for (β0 , β1 ) obtained using the variable-at-a-time
MH. The points are superimposed on the 2-dimensional projection of the contour plot
for the target πP R . Right panels: Autocorrelation plots for the three components of
the chain for (b) Gibbs sampler and (d) variable-at-a-time MH.
35
(a) (b)
Figure 4: Trace plots of the first coordinate of RWM, on the same 20-dimensional
target, with acceptance rates both approximately 0.234, where the proposal covariance
matrix Σ is proportional to either: (a) the identity I20 or (b) to the target covariance
matrix. The run in(b) clearly mixes much faster.
β0
0.8
0.005
05
0.0
ACF
0
0.4
0.2
0.6 0.4
0.8
0.0
−2
Lag
β1
−4
0.8
−6
β0
ACF
0.4
0.0
0.05
−8
Lag
β2
−10
0.8
−12
ACF
0.4
0.0
5 10 15 20 25
0 50 100 150 200
β1
Lag
Figure 5: Top left panel: Scatterplots of 30,000 samples for (β0 , β1 ) obtained using
the RWM with adaptive variance. The points are superimposed on the 2-dimensional
projection of the contour plot for the target πP R . Top, right panel: Autocorrelation
plots for the three components of the chain show much lower serial dependence when
compared with the non-adaptive RWM samplers.
36
0.20
0.05
0.024
0.022
0.04
0.15
0.020
0.03
0.10
tf(x)
tf(x)
tf(x)
0.018
0.02
0.016
0.05
0.01
0.014
0.00
0.012
x x x
Figure 6: (a) The highly multimodal target density π(θ) = 12 N(0, 1; θ) + 12 N(20, 1; θ).
(b) A somewhat flatter density π4 = 21 N(0, 42 ; θ) + 21 N(20, 42 ; θ). (c) An even flatter
1
density π10 = 2
N(0, 102 ; θ) + 12 N(20, 102; θ).
37
40
40
20
20
xlist
xlist
0
0
−20
−20
0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000
Index Index
(a) (b)
20
40
15
20
xlist[, 1]
xlist
10
0
5
−20
0 10000 20000 30000 40000 50000 0 2000 4000 6000 8000 10000
Index Index
(c) (d)
1
Figure 7: Trace plots for the π(θ) = 2
N(0, 1; θ) + 12 N(20, 1; θ) example. (a) Ordinary
RWM gets stuck in π’s modal region near 20, and cannot find the second modal
region near 0. (b) The θ coordinate of simulated tempering for π. (c) Identifying (red
circles) the θ values of the simulated tempering corresponding to τ = 1 (and hence to
π). (d) The coordinate θ1 for the corresponding parallel tempering algorithm, showing
excellent mixing.
38
^ ^ ^
β0 β1 β2
−2.0
5.0
8.0
4.5
−2.5
7.0
4.0
−3.0
3.5
6.0
−3.5
3.0
5.0
5.0
8.0
Sample Quantiles
Sample Quantiles
Sample Quantiles
4.5
−2.5
7.0
4.0
−3.0
3.5
6.0
−3.5
3.0
5.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.8
0.4
0.8
0.6
Density
Density
Density
0.4
0.2
0.4
0.2
0.0
0.0
0.0
−4.0 −3.5 −3.0 −2.5 −2.0 5 6 7 8 3.0 3.5 4.0 4.5 5.0
^ ^ ^
β0 β1 β2
39
^ ^ ^
β0 β1 β2
−2.0
5.0
8.0
4.5
−2.5
7.0
4.0
−3.0
3.5
6.0
−3.5
3.0
5.0
5.0
8.0
Sample Quantiles
Sample Quantiles
Sample Quantiles
4.5
−2.5
7.0
4.0
−3.0
3.5
6.0
−3.5
3.0
5.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.8
0.4
0.8
0.6
Density
Density
Density
0.4
0.2
0.4
0.2
0.0
0.0
0.0
−4.0 −3.5 −3.0 −2.5 −2.0 5 6 7 8 3.0 3.5 4.0 4.5 5.0
^ ^ ^
β0 β1 β2
40