Introduction To Markov Chain Monte Carlo (MCMC) and Its Role in Modern Bayesian Analysis

Introduction to Markov chain Monte Carlo (MCMC)
and its role in modern Bayesian analysis
Phil Gregory
University of British Columbia
March 2010
Outline
1. Bayesian primer 1
2. Spectral line problem 2

Challenge of nonlinear models 3
3. Introduction to Markov chain Monte Carlo (MCMC) 4

Parallel tempering 5
Hybrid MCMC 6
4. Mathematica MCMC demonstration 7
5. Conclusions 8
Methanol Occam
outline
What is Bayesian Probability Theory?
(BPT)
BPT = a theory of extended logic
Deductive logic is based on Axiomatic knowledge.
In science we never know any theory of nature is true because

our reasoning is based on incomplete information.
Our conclusions are at best probabilities.
Any extension of logic to deal with situations of incomplete

information (realm of inductive logic) requires a theory of
probability.
outline
A new perception of probability has arisen in recognition that

the mathematical rules of probability are not merely rules for
manipulating random variables.
They are now recognized as valid principles of logic for

conducting inference about any hypothesis of interest.
This view of, ``Probability Theory as Logic'', was championed

in the late 20th century by E. T. Jaynes.
“Probability Theory: The Logic of Science”
Cambridge University Press 2003
It is also commonly referred to as Bayesian Probability Theory

in recognition of the work of the 18th century English
clergyman and Mathematician Thomas Bayes.
outline
Logic is concerned with the truth of propositions.

A proposition asserts that something is true.
outline
We will need to consider compound propositions like

A,B which asserts that propositions A and B are true
A,B|C asserts that propositions A and B are true

given that proposition C is true
Rules for manipulating probabilities

¯¯¯¯
Sum rule : p A C +p A C =1
Product rule : p A, B C = p A C p B A, C
= p B C p A B, C
Bayes theorem :
p A C p B A, C
p A B, C =
p B C
outline
How to proceed in a Bayesian analysis?

Write down Bayes’ theorem, identify the terms and solve.
Prior probability Likelihood
p Hi I â p D Hi , I
p Hi D, I =
pD I
Posterior probability
that Hi is true, given Normalizing constant
the new data D and
prior information I
Every item to the right of the
vertical bar | is assumed to be true
The likelihood p(D| Hi, I), also written as l(Hi ), stands for
the probability that we would have gotten the data D that we
did, if Hi is true.
outline
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge, in contrast to a numerical recipe approach.
Two basic problems
1. Model selection (discrete hypothesis space)
“Which one of 2 or more models (hypotheses) is most probable
given our current state of knowledge?”
e.g.
• Hypothesis or model M0 asserts that the star has no planets.
• Hypothesis M1 asserts that the star has 1 planet.
• Hypothesis Mi asserts that the star has i planets.
2. Parameter estimation (continuous hypothesis)

“Assuming the truth of M1, solve for the probability density
distribution for each of the model parameters based on our
current state of knowledge.”
e.g.
• Hypothesis H asserts that the orbital period is between P and P+dP.
outline
Significance of this development
Probabilities are commonly quantified by a real number between 0 and 1.
0 Realm of science 1
and inductive logic
false true
The end-points, corresponding to absolutely false and absolutely true,

are simply the extreme limits of this infinity of real numbers.
Bayesian probability theory spans the whole range.
Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information.
Occam
outline
Calculation of a simple Likelihood p D M , X , I
Let di represent the i th measured data value . We model di by,
d i = fi X + e i
Model prediction for i th data value
for current choice of parameters X
where ei represents the error component in the measurement.
Since M , X is assumed to be true, if it were not for the

error ei, di would equal the model prediction fi .
Now suppose prior information I indicates that ei has a Gaussian
probability distribution. Then
1 ei 2
p Di M , X , I = Exp -
si 2 p 2 si 2
2
1 d i - fi X
= Exp -
si 2p 2 si 2
outline
Gaussian error curve

0.5
0.4 pH Di » M,X ,I L
Probability density
proportional
0.3 to line height
0.2
0.1
ei
0
measured di fi H X L predicted value

0 2 4 6 8
Signal strength
Probability of getting a data value di a distance ei away from the
predicted value fi is proportional to the height of the Gaussian error
curve at that location.
outline
Calculation of a simple Likelihood p D M , X , I
For independent data the likelihood for the entire data

set D=(D1,D2,….,DN ) is the product of N Gaussians.
Jdi - fi H X LN
2
p J D M , X , I N = H2 p L- N ê 2 :‰ si -1 > ExpB- 0.5 ‚ F

N N
i= 1 i= 1 si 2
The familiar c2
statistic used
in least-squares
Maximizing the likelihood corresponds to minimizing c2
Recall: Bayesian posterior µ prior â likelihood
Thus, only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior.
outline
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr
(upper limit set by perturbations from neighboring stars).
Suppose we assume a uniform prior probability density for the P

parameter. This would imply that we believed that it was ~ 104 times
more probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d.
105
104
p P M, I P
= 104
10
1
p P M, I P
Usually, expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per
decade. The Jeffreys prior has this scale invariant property.
outline
Jeffreys prior (scale invariant)
p H P M , I L dP =
dP
P ¥ ln H P max ê P min L
p Hln P M , I L d ln P =
d ln P
ln H P max ê P min L
or equivalently
Equal probability per decade

10 105
p P M, I P= p P M, I P
1 104
Actually, there are good reasons for searching in orbital frequency

f = 1/P instead of P. The form of the prior is unchanged.
d ln f
p ln f M , I d ln f =
ln fmax fmin
Modified Jeffreys freq

outline
Integration not minimization
A full Bayesian analysis requires integrating over the model

parameter space. Integration is more difficult than minimization.
However, the Bayesian solution provides the most accurate

information about the parameter errors and correlations without
the need for any additional calculations, i.e., Monte Carlo
simulations.
Shortly discuss an efficient method for

Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC).
End of Bayesian primer

outline
Simple Spectral Line Problem
Background (prior) information:

Two competing grand unification theories have been proposed, each
championed by a Nobel prize winner in physics. We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data.
Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength.
Unfortunately, it is not feasible to detect the line in the laboratory. The

only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space.
outline
Data
To test this prediction, a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained. The spectrometer has 64 frequency channels.
All channels have Gaussian noise characterized by σ = 1 mK. The noise

in separate channels is independent.
outline
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T

is the amplitude of the line. The frequency, νi , is in units of the
spectrometer channel number and the line center frequency is ν0.
In this version of the problem

Line profile
T, ν0, sL are all unknowns with
for a given
prior limits:
ν0, sL
T = 0.0 - 100.0
ν0 = 1 – 44
sL = 0.5 – 4.0
outline
Extra noise term, e0i
We will represent the measured data by the equation
di = f i + e i + e 0 i
di = ith measured data value
fi = model prediction
ei = component of di which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction fi
In the absence of detailed knowledge of the sampling distribution for e0 i ,
other than that it has a finite variance, the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (i.e., maximally
non committal about the information we don't have).
We therefore adopt a Gaussian distribution for e0 i with a variance s2.
Thus the combination of ei + e0 i has a Gaussian distribution with
variance = s2i + s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem),
which has the desirable effect of treating as noise anything in the data that can' t be
explained by the model and known measurement errors, leading to most conservative
estimates of the model parameters. Prior range for s = 0 - 0.5 × data range.
outline
Questions of interest
Based on our current state of information, which includes just the

above prior information and the measured spectrum,
1) what do we conclude about the relative probabilities of the two
competing theories
and
2) what is the posterior PDF for the model parameters and s?
Hypothesis space of interest for model selection part:
M0 ≡ “Model 0, no line exists”

M1 ≡ “Model 1, line exists”
M1 has 3 unknown parameters, the line temperature T, ν0, sL

and one nuisance parameter s.
M0 has no unknown parameters, and one nuisance parameter s.

outline
Likelihood for the spectral line model
In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood
Hdi - T fi L2
p HD M1 , T , I L = H2 p L 2 σ−N ExpC- ‚ G
N N
-
i= 1 2s
Our new likelihood for the more complicated model with

unknown variables T, u0, sL, s
Hdi - T fi Hu0 , sLLL2

p H D M 1 , T , u0 , s L , s , I L = H2 p L 2 Js + s N ExpC- ‚ G
N N N
- 2 2 -2
i= 1 2 Is + s M
2 2
outline
Simple nonlinear model with a single parameter α
True value
The Bayesian posterior density for a nonlinear model with single parameter,
α, for 4 simulated data sets of different size ranging from N = 5 to N = 80.
The N = 5 case has the broadest distribution and exhibits 4 maxima.
Asymptotic theory says that the maximum likelihood estimator becomes
more unbiased, more normally distributed and of smaller variance as the
sample size becomes larger. Simulated annealing
Integration not minimization
outline
In Least-squares analysis we minimize some statistic like c2.

In a Bayesian analysis we need to integrate.
Parameter estimation: to find the marginal posterior probability
density function (PDF) for the orbital period P, we need to integrate
the joint posterior over all the other parameters.
p T D, M1 , I = ‚ u 0 ‚ s L ‚ s p T , u0 , s L , s D , M 1 , I
Marginal PDF Joint posterior probability

for T density function (PDF) for
the parameters
Integration is more difficult than minimization. However, the Bayesian
solution provides the most accurate information about the parameter
errors and correlations without the need for any additional
calculations, i.e., Monte Carlo simulations.
Shortly discuss an efficient method for
Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC).
outline
Numerical tools Data Model Prior
for Bayesian D M I
model fitting
Posterior
Linear models (uniform priors) Nonlinear models

+ linear models (non-uniform priors)
Posterior has a single peak
Posterior may have multiple peaks
(multi-dimensional Gaussian)
Brute force Asymptotic Moderate High
Parameters given integration approx.’s dimensions dimensions
by the normal equations
For some peak finding quadrature MCMC
of linear least-squares algorithms
parameters
No integration required analytic (1) Levenberg- randomized
solution very fast integration Marquardt quadrature
using linear algebra sometimes (2) Simulated
annealing adaptive
possible (3) Genetic quadrature
algorithm
Laplace
approx.’s
(chapter 10) (chapter 11) (chapter 12)

outline
Chapters
1. Role of probability theory in science
2. Probability theory as extended logic
3. The how-to of Bayesian inference
4. Assigning probabilities
5. Frequentist statistical inference
6. What is a statistic?
7. Frequentist hypothesis testing
8. Maximum entropy probabilities
9. Bayesian inference (Gaussian errors)
10. Linear model fitting (Gaussian errors)
11. Nonlinear model fitting
12. Markov chain Monte Carlo
13. Bayesian spectral analysis
14. Bayesian inference (Poisson sampling)
Resources and solutions Introduces statistical inference in the

This title has free larger context of scientific methods, and
Mathematica based support includes 55 worked examples and many
software available problem sets.
outline
MCMC for integration in large parameter spaces

Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor. This factor is not required for parameter estimation.
After an initial burn-in period (which is discarded) , the MCMC

produces an equilibrium distribution of samples in parameter space
such that the density of samples is proportional to the joint posterior
PDF.
It is very efficient because, unlike straight Mont Carlo integration, it
doesn’t waste time exploring regions where the joint posterior is very
small.
The MCMC employs a Markov chain random walk, whereby the new
sample in parameter space, designated X{t+1} , depends on previous
sample Xt according to an entity called the transition probability or
kernel, p(X{t+1} |Xt). The transition kernel is assumed to be time
independent.
conditions return
outline
Starting point: Metropolis-Hastings MCMC algorithm

P(X|D,M,I) = target posterior probability distribution
(X represents the set of model parameters)
1. Choose X0 an initial location in the parameter space . Set t = 0.

2. Repeat :
- Obtain a new sample Y from a proposal distribution q HY » Xt L
that is easy to evaluate . q HY » XtL can have almost any form.
- Sample a Uniform H0, 1L random variable U.

I use a Gaussian proposal distribution. i.e., Normal distribution N(Xt ,σ)
p HY » D, IL q HXt » YL
p HXt » D, IL q HY » XtL
-If U £ â , then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t > This factor =1

for a symmetric proposal
distribution like a Gaussian
return
outline
Toy MCMC simulations: the efficiency depends on tuning proposal

distribution s’s. Can be a very difficult challenge for many parameters.
Acceptance rate = 95% Acceptance rate = 63%
In this example the

posterior probability
distribution consists of two
2 dimensional Gaussians
indicated by the contours
Acceptance rate = 4%
Autocorrelation
return
outline
MCMC parameter samples for
P1 a Kepler model with 2 planets.
MNRAS, 374, 1321, 2007

P. C. Gregory
Title: A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
P2
Gelman Ruben stat Post burn-in

outline
Parallel tempering MCMC
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks. It can fail to fully explore all peaks which contain
significant probability, especially if some of the peaks are very narrow.
One solution is to run multiple Metropolis-Hastings simulations in
parallel, employing probability distributions of the kind
p HX » D, M, b, IL = p HX » M, IL p HD » X, M, ILb
H0 < β b 1 L
Typical set of β values = 0.09,0.15,0.22,0.35,0.48,0.61,0.78,1.0
β = 1 corresponds to our desired target distribution. The others
correspond to progressively flatter probability distributions.
At intervals, a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states. The swap allows for
an exchange of information across the ladder of simulations.
In the low β simulations, radically different configurations can arise,
whereas at higher β, a configuration is given the chance to refine itself.
Final results are based on samples from the β = 1 simulation.
Samples from the other simulations provide one way to evaluate
the Bayes Factor in model selection problems.
outline
MCMC Technical Difficulties

1. Deciding on the burn-in period.
2. Choosing a good choice for the characteristic width

of each proposal distribution, one for each model
parameter.
For Gaussian proposal distributions this means picking
a set of proposal σ’s. This can be very time consuming
for a large number of different parameters.
3. Handling highly correlated parameters.
Ans: transform parameter set or differential MCMC
4. Deciding how many iterations are sufficient.

Ans: use Gelman-Rubin Statistic
5. Deciding on a good choice of tempering levels (β values).

Gelman –Rubin statistic
outline
My involvement: since 2002, ongoing
development of a general Bayesian Nonlinear
model fitting program.
My latest hybrid Markov chain Monte Carlo (MCMC)
nonlinear model fitting algorithm incorporates:
-Parallel tempering
-Simulated annealing
-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications:
-precision radial velocity data – (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing.
I run on an 8 core PC and achieve a speed-up of 7 times.
outline
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithm
Differential evolution
Each of these methods was designed to facilitate the

detection of a global minimum in c2. By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal.
MCMC details outline
Data Model Prior information

D M I
pH8Xa <»D,M,IL
Target Posterior
- 8Xa < iterations

- Control systemdiagnostics
Hybrid
8Xa<init = start parameters
n = no. of iterations
parallel tempering - Summarystatistics
8sa <init = start proposal s's
- 8Xa < marginals
MCMC - Best fit model & residuals
8 b < = Temperinglevels
- 8Xa < 68.3% credible regions
Nonlinear model
- pHD»M,IL marginal likelihood

fitting program
for model comparison

Adaptive Two Stage Control System
1L Automates selection of an efficient set of Gaussian proposal

___________________________________________________________1
distribution s's using an annealing operation.

2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC. Includes a gene crossover
algorithm to breed higher probability chains.
Schematic of a Bayesian Markov chain Monte Carlo program for

nonlinear model fitting. The program incorporates a control system
that automates the selection of Gaussian proposal distribution σ’s.
outline
Adaptive Hybrid MCMC

Output at each iteration
8 parallel tempering Metropolis chains
β = 1.0 parameters, logprior + b â loglike, logprior + loglike
Parallel tempering
β = 1/ T swap operations Monitor for
Peak parameter set: parameters
If (logprior + loglike) > with peak
Refine & update previous best by a probability
Anneal Gaussian threshold then update
Gaussian
proposal s’s proposal s’s and reset burn-in
2 stage proposal s control system

error signal = Genetic algorithm
(actual joint acceptance rate – 0.25) Every 10th iteration perform gene
Effectively defines burn-in interval crossover operation to breed larger
(logprior + loglike) parameter set.
MCMC adaptive control system Corr Par
outline
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M0,I)
Model M0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Hdi - 0L2
p HD M0 , s , I L = H2 p L 2 Js 2 + s2 N 2 ExpC- ‚ G
N N N
- -
i= 1 2 Is 2
+ s2
M
Model selection results
Bayes factor =4.5x104

outline
Methanol emission in
the Sgr A* environment
M. Stanković, E.R. Seaquist (UofT), S.

Leurini (ESO), P.Gregory (UBC),
S. Muehle(JIVE), K.M.Menten (MPIfR)
Optically thin fit to 3 bands

+ unidentified line in 96 GHz band
9v Ikm s−1M, FWHM Ikm s−1 M, TJ HKL, HN ê ZLA Icm−2 M, HN ê ZLA Icm−2M
TK HKL, ν UL H MHzL, FWHMUL Ikm s−1M, TUL HKL, ds96 , ds242 , s HKL=
ν UL HMHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity, v Hkm s−1 L
return
outline
Conclusions
1. For Bayesian parameter estimation, MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter.
2. Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters, MCMC
techniques are really most competitive for models with a much larger
number of parameters m ≥ 15.
3. Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution.
This is fine for parameter estimation.
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi ,I) for each model. This is a
much more difficult problem still in search of two good solutions for
large m. We need two to know if either is valid.
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values, however, this
becomes computationally very intensive for m > 17.
For a copy of this talk please Google Phil Gregory
outline
The rewards of data analysis:
‘The universe is full of magical things,

patiently waiting for our wits to grow
sharper.’
Eden Philpotts (1862-1960)

Author and playwright
outline
Gelman-Rubin Statistic
Let q represent one of the model parameters .
Let qij represent the ith iteration of the jth of m independent simulation.
Extract the last h post burn - in iterations for each simulation.
‚ ‚ Iq j - q j M
m h
1 i êê 2
m Hh - 1L j=1 i=1
Mean within chain variance W =
‚ Hq j - q L
m
h êê êê 2
Between chain variance B =
m- 1 j=1
i 1 yz
Estimated variance V Hq L = jj1 - zW+ B
` 1
k h{ h
Hq L
Gelman - Rubin statistic = $%%%%%%%%%%%%
`
V
The Gelman - Rubin statistic should be close to 1.0 He.g. < 1.05 L
W
for all paramaters for convergence

Ref : Gelman, A.and D.B.Rubin H1992L ' Inference from iterative
simulations using multiple sequences Hwith discussionL ',
Statistical Science 7, pp. 457 − 511.
return

Introduction To Markov Chain Monte Carlo (MCMC) and Its Role in Modern Bayesian Analysis

Uploaded by

Copyright:

Available Formats

Introduction To Markov Chain Monte Carlo (MCMC) and Its Role in Modern Bayesian Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Markov Chain Monte Carlo (MCMC) and Its Role in Modern Bayesian Analysis

Uploaded by

Copyright:

Available Formats

Introduction to Markov chain Monte Carlo (MCMC)

and its role in modern Bayesian analysis

University of British Columbia

2. Spectral line problem 2

3. Introduction to Markov chain Monte Carlo (MCMC) 4

4. Mathematica MCMC demonstration 7

BPT = a theory of extended logic

Deductive logic is based on Axiomatic knowledge.

In science we never know any theory of nature is true because

Our conclusions are at best probabilities.

Any extension of logic to deal with situations of incomplete

A new perception of probability has arisen in recognition that

They are now recognized as valid principles of logic for

This view of, ``Probability Theory as Logic'', was championed

It is also commonly referred to as Bayesian Probability Theory

Logic is concerned with the truth of propositions.

We will need to consider compound propositions like

A,B|C asserts that propositions A and B are true

Rules for manipulating probabilities

How to proceed in a Bayesian analysis?

2. Parameter estimation (continuous hypothesis)

The end-points, corresponding to absolutely false and absolutely true,

Bayesian probability theory spans the whole range.

Deductive logic is just a special case of Bayesian probability

Since M , X is assumed to be true, if it were not for the

Gaussian error curve

measured di fi H X L predicted value

For independent data the likelihood for the entire data

p J D M , X , I N = H2 p L- N ê 2 :‰ si -1 > ExpB- 0.5 ‚ F

Maximizing the likelihood corresponds to minimizing c2

Recall: Bayesian posterior µ prior â likelihood

Thus, only for a uniform prior will a least-squares analysis

In the exoplanet problem the prior range for the unknown

Suppose we assume a uniform prior probability density for the P

Usually, expressing great uncertainty in some quantity corresponds

Equal probability per decade

Actually, there are good reasons for searching in orbital frequency

Modified Jeffreys freq

Integration not minimization

A full Bayesian analysis requires integrating over the model

However, the Bayesian solution provides the most accurate

Shortly discuss an efficient method for

End of Bayesian primer

Simple Spectral Line Problem

Background (prior) information:

Theory 1 is unique in that it predicts the existence of a new short-lived

Unfortunately, it is not feasible to detect the line in the laboratory. The

All channels have Gaussian noise characterized by σ = 1 mK. The noise

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T

In this version of the problem

Based on our current state of information, which includes just the

Hypothesis space of interest for model selection part:

M0 ≡ “Model 0, no line exists”

M1 has 3 unknown parameters, the line temperature T, ν0, sL

M0 has no unknown parameters, and one nuisance parameter s.

Likelihood for the spectral line model

In the earlier spectral line problem which had only

Our new likelihood for the more complicated model with