Introduction To Markov Chain Monte Carlo (MCMC) and Its Role in Modern Bayesian Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Introduction to Markov chain Monte Carlo (MCMC)

and its role in modern Bayesian analysis

Phil Gregory

University of British Columbia

March 2010
Outline

1. Bayesian primer 1

2. Spectral line problem 2


Challenge of nonlinear models 3

3. Introduction to Markov chain Monte Carlo (MCMC) 4


Parallel tempering 5
Hybrid MCMC 6

4. Mathematica MCMC demonstration 7

5. Conclusions 8

Methanol Occam
outline
What is Bayesian Probability Theory?
(BPT)

BPT = a theory of extended logic

Deductive logic is based on Axiomatic knowledge.

In science we never know any theory of nature is true because


our reasoning is based on incomplete information.

Our conclusions are at best probabilities.

Any extension of logic to deal with situations of incomplete


information (realm of inductive logic) requires a theory of
probability.
outline

A new perception of probability has arisen in recognition that


the mathematical rules of probability are not merely rules for
manipulating random variables.

They are now recognized as valid principles of logic for


conducting inference about any hypothesis of interest.

This view of, ``Probability Theory as Logic'', was championed


in the late 20th century by E. T. Jaynes.
“Probability Theory: The Logic of Science”
Cambridge University Press 2003

It is also commonly referred to as Bayesian Probability Theory


in recognition of the work of the 18th century English
clergyman and Mathematician Thomas Bayes.
outline

Logic is concerned with the truth of propositions.


A proposition asserts that something is true.
outline

We will need to consider compound propositions like


A,B which asserts that propositions A and B are true

A,B|C asserts that propositions A and B are true


given that proposition C is true

Rules for manipulating probabilities


¯¯¯¯
Sum rule : p A C +p A C =1

Product rule : p A, B C = p A C p B A, C
= p B C p A B, C

Bayes theorem :
p A C p B A, C
p A B, C =
p B C
outline

How to proceed in a Bayesian analysis?


Write down Bayes’ theorem, identify the terms and solve.
Prior probability Likelihood

p Hi I â p D Hi , I
p Hi D, I =
pD I
Posterior probability
that Hi is true, given Normalizing constant
the new data D and
prior information I
Every item to the right of the
vertical bar | is assumed to be true

The likelihood p(D| Hi, I), also written as l(Hi ), stands for
the probability that we would have gotten the data D that we
did, if Hi is true.
outline
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge, in contrast to a numerical recipe approach.
Two basic problems
1. Model selection (discrete hypothesis space)
“Which one of 2 or more models (hypotheses) is most probable
given our current state of knowledge?”
e.g.
• Hypothesis or model M0 asserts that the star has no planets.
• Hypothesis M1 asserts that the star has 1 planet.
• Hypothesis Mi asserts that the star has i planets.

2. Parameter estimation (continuous hypothesis)


“Assuming the truth of M1, solve for the probability density
distribution for each of the model parameters based on our
current state of knowledge.”
e.g.
• Hypothesis H asserts that the orbital period is between P and P+dP.
outline
Significance of this development
Probabilities are commonly quantified by a real number between 0 and 1.

0 Realm of science 1
and inductive logic
false true

The end-points, corresponding to absolutely false and absolutely true,


are simply the extreme limits of this infinity of real numbers.

Bayesian probability theory spans the whole range.

Deductive logic is just a special case of Bayesian probability


theory in the idealized limit of complete information.

Occam
outline
Calculation of a simple Likelihood p D M , X , I
Let di represent the i th measured data value . We model di by,

d i = fi X + e i
Model prediction for i th data value
for current choice of parameters X
where ei represents the error component in the measurement.

Since M , X is assumed to be true, if it were not for the


error ei, di would equal the model prediction fi .
Now suppose prior information I indicates that ei has a Gaussian
probability distribution. Then
1 ei 2
p Di M , X , I = Exp -
si 2 p 2 si 2
2
1 d i - fi X
= Exp -
si 2p 2 si 2
outline

Gaussian error curve


0.5

0.4 pH Di » M,X ,I L
Probability density

proportional
0.3 to line height
0.2

0.1
ei
0

measured di fi H X L predicted value


0 2 4 6 8
Signal strength
Probability of getting a data value di a distance ei away from the
predicted value fi is proportional to the height of the Gaussian error
curve at that location.
outline
Calculation of a simple Likelihood p D M , X , I

For independent data the likelihood for the entire data


set D=(D1,D2,….,DN ) is the product of N Gaussians.

Jdi - fi H X LN
2

p J D M , X , I N = H2 p L- N ê 2 :‰ si -1 > ExpB- 0.5 ‚ F


N N

i= 1 i= 1 si 2
The familiar c2
statistic used
in least-squares

Maximizing the likelihood corresponds to minimizing c2

Recall: Bayesian posterior µ prior â likelihood

Thus, only for a uniform prior will a least-squares analysis


yield the same solution as the Bayesian posterior.
outline
Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown


orbital period P is very large from ~1 day to 1000 yr
(upper limit set by perturbations from neighboring stars).

Suppose we assume a uniform prior probability density for the P


parameter. This would imply that we believed that it was ~ 104 times
more probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d.

105
104
p P M, I P
= 104
10
1
p P M, I P

Usually, expressing great uncertainty in some quantity corresponds


more closely to a statement of scale invariance or equal probability per
decade. The Jeffreys prior has this scale invariant property.
outline
Jeffreys prior (scale invariant)

p H P M , I L dP =
dP
P ¥ ln H P max ê P min L

p Hln P M , I L d ln P =
d ln P
ln H P max ê P min L
or equivalently

Equal probability per decade


10 105
p P M, I P= p P M, I P
1 104

Actually, there are good reasons for searching in orbital frequency


f = 1/P instead of P. The form of the prior is unchanged.

d ln f
p ln f M , I d ln f =
ln fmax fmin

Modified Jeffreys freq


outline

Integration not minimization

A full Bayesian analysis requires integrating over the model


parameter space. Integration is more difficult than minimization.

However, the Bayesian solution provides the most accurate


information about the parameter errors and correlations without
the need for any additional calculations, i.e., Monte Carlo
simulations.

Shortly discuss an efficient method for


Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC).

End of Bayesian primer


outline

Simple Spectral Line Problem

Background (prior) information:


Two competing grand unification theories have been proposed, each
championed by a Nobel prize winner in physics. We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data.

Theory 1 is unique in that it predicts the existence of a new short-lived


baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength.

Unfortunately, it is not feasible to detect the line in the laboratory. The


only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space.
outline

Data
To test this prediction, a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained. The spectrometer has 64 frequency channels.

All channels have Gaussian noise characterized by σ = 1 mK. The noise


in separate channels is independent.
outline
Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T


is the amplitude of the line. The frequency, νi , is in units of the
spectrometer channel number and the line center frequency is ν0.

In this version of the problem


Line profile
T, ν0, sL are all unknowns with
for a given
prior limits:
ν0, sL
T = 0.0 - 100.0
ν0 = 1 – 44
sL = 0.5 – 4.0
outline
Extra noise term, e0i
We will represent the measured data by the equation
di = f i + e i + e 0 i
di = ith measured data value
fi = model prediction
ei = component of di which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction fi
In the absence of detailed knowledge of the sampling distribution for e0 i ,
other than that it has a finite variance, the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (i.e., maximally
non committal about the information we don't have).
We therefore adopt a Gaussian distribution for e0 i with a variance s2.
Thus the combination of ei + e0 i has a Gaussian distribution with
variance = s2i + s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem),
which has the desirable effect of treating as noise anything in the data that can' t be
explained by the model and known measurement errors, leading to most conservative
estimates of the model parameters. Prior range for s = 0 - 0.5 × data range.
outline

Questions of interest

Based on our current state of information, which includes just the


above prior information and the measured spectrum,
1) what do we conclude about the relative probabilities of the two
competing theories
and
2) what is the posterior PDF for the model parameters and s?

Hypothesis space of interest for model selection part:

M0 ≡ “Model 0, no line exists”


M1 ≡ “Model 1, line exists”

M1 has 3 unknown parameters, the line temperature T, ν0, sL


and one nuisance parameter s.

M0 has no unknown parameters, and one nuisance parameter s.


outline

Likelihood for the spectral line model

In the earlier spectral line problem which had only


one unknown variable T we derived the likelihood

Hdi - T fi L2
p HD M1 , T , I L = H2 p L 2 σ−N ExpC- ‚ G
N N
-

i= 1 2s

Our new likelihood for the more complicated model with


unknown variables T, u0, sL, s

Hdi - T fi Hu0 , sLLL2


p H D M 1 , T , u0 , s L , s , I L = H2 p L 2 Js + s N ExpC- ‚ G
N N N
- 2 2 -2

i= 1 2 Is + s M
2 2
outline
Simple nonlinear model with a single parameter α
True value

The Bayesian posterior density for a nonlinear model with single parameter,
α, for 4 simulated data sets of different size ranging from N = 5 to N = 80.
The N = 5 case has the broadest distribution and exhibits 4 maxima.
Asymptotic theory says that the maximum likelihood estimator becomes
more unbiased, more normally distributed and of smaller variance as the
sample size becomes larger. Simulated annealing
Integration not minimization
outline

In Least-squares analysis we minimize some statistic like c2.


In a Bayesian analysis we need to integrate.
Parameter estimation: to find the marginal posterior probability
density function (PDF) for the orbital period P, we need to integrate
the joint posterior over all the other parameters.

p T D, M1 , I = ‚ u 0 ‚ s L ‚ s p T , u0 , s L , s D , M 1 , I

Marginal PDF Joint posterior probability


for T density function (PDF) for
the parameters
Integration is more difficult than minimization. However, the Bayesian
solution provides the most accurate information about the parameter
errors and correlations without the need for any additional
calculations, i.e., Monte Carlo simulations.
Shortly discuss an efficient method for
Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC).
outline
Numerical tools Data Model Prior
for Bayesian D M I
model fitting
Posterior

Linear models (uniform priors) Nonlinear models


+ linear models (non-uniform priors)
Posterior has a single peak
Posterior may have multiple peaks
(multi-dimensional Gaussian)
Brute force Asymptotic Moderate High
Parameters given integration approx.’s dimensions dimensions
by the normal equations
For some peak finding quadrature MCMC
of linear least-squares algorithms
parameters
No integration required analytic (1) Levenberg- randomized
solution very fast integration Marquardt quadrature
using linear algebra sometimes (2) Simulated
annealing adaptive
possible (3) Genetic quadrature
algorithm

Laplace
approx.’s

(chapter 10) (chapter 11) (chapter 12)


outline

Chapters
1. Role of probability theory in science
2. Probability theory as extended logic
3. The how-to of Bayesian inference
4. Assigning probabilities
5. Frequentist statistical inference
6. What is a statistic?
7. Frequentist hypothesis testing
8. Maximum entropy probabilities
9. Bayesian inference (Gaussian errors)
10. Linear model fitting (Gaussian errors)
11. Nonlinear model fitting
12. Markov chain Monte Carlo
13. Bayesian spectral analysis
14. Bayesian inference (Poisson sampling)

Resources and solutions Introduces statistical inference in the


This title has free larger context of scientific methods, and
Mathematica based support includes 55 worked examples and many
software available problem sets.
outline

MCMC for integration in large parameter spaces


Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor. This factor is not required for parameter estimation.

After an initial burn-in period (which is discarded) , the MCMC


produces an equilibrium distribution of samples in parameter space
such that the density of samples is proportional to the joint posterior
PDF.
It is very efficient because, unlike straight Mont Carlo integration, it
doesn’t waste time exploring regions where the joint posterior is very
small.

The MCMC employs a Markov chain random walk, whereby the new
sample in parameter space, designated X{t+1} , depends on previous
sample Xt according to an entity called the transition probability or
kernel, p(X{t+1} |Xt). The transition kernel is assumed to be time
independent.

conditions return
outline

Starting point: Metropolis-Hastings MCMC algorithm


P(X|D,M,I) = target posterior probability distribution
(X represents the set of model parameters)

1. Choose X0 an initial location in the parameter space . Set t = 0.


2. Repeat :
- Obtain a new sample Y from a proposal distribution q HY » Xt L
that is easy to evaluate . q HY » XtL can have almost any form.

- Sample a Uniform H0, 1L random variable U.


I use a Gaussian proposal distribution. i.e., Normal distribution N(Xt ,σ)

p HY » D, IL q HXt » YL
p HXt » D, IL q HY » XtL
-If U £ â , then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t > This factor =1


for a symmetric proposal
distribution like a Gaussian
return
outline

Toy MCMC simulations: the efficiency depends on tuning proposal


distribution s’s. Can be a very difficult challenge for many parameters.

Acceptance rate = 95% Acceptance rate = 63%

In this example the


posterior probability
distribution consists of two
2 dimensional Gaussians
indicated by the contours
Acceptance rate = 4%

Autocorrelation

return
outline
MCMC parameter samples for
P1 a Kepler model with 2 planets.

MNRAS, 374, 1321, 2007


P. C. Gregory
Title: A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487

P2

Gelman Ruben stat Post burn-in


outline
Parallel tempering MCMC
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks. It can fail to fully explore all peaks which contain
significant probability, especially if some of the peaks are very narrow.
One solution is to run multiple Metropolis-Hastings simulations in
parallel, employing probability distributions of the kind
p HX » D, M, b, IL = p HX » M, IL p HD » X, M, ILb
H0 < β b 1 L
Typical set of β values = 0.09,0.15,0.22,0.35,0.48,0.61,0.78,1.0
β = 1 corresponds to our desired target distribution. The others
correspond to progressively flatter probability distributions.
At intervals, a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states. The swap allows for
an exchange of information across the ladder of simulations.
In the low β simulations, radically different configurations can arise,
whereas at higher β, a configuration is given the chance to refine itself.
Final results are based on samples from the β = 1 simulation.
Samples from the other simulations provide one way to evaluate
the Bayes Factor in model selection problems.
outline

MCMC Technical Difficulties


1. Deciding on the burn-in period.

2. Choosing a good choice for the characteristic width


of each proposal distribution, one for each model
parameter.
For Gaussian proposal distributions this means picking
a set of proposal σ’s. This can be very time consuming
for a large number of different parameters.
3. Handling highly correlated parameters.
Ans: transform parameter set or differential MCMC

4. Deciding how many iterations are sufficient.


Ans: use Gelman-Rubin Statistic

5. Deciding on a good choice of tempering levels (β values).


Gelman –Rubin statistic
outline
My involvement: since 2002, ongoing
development of a general Bayesian Nonlinear
model fitting program.
My latest hybrid Markov chain Monte Carlo (MCMC)
nonlinear model fitting algorithm incorporates:
-Parallel tempering
-Simulated annealing
-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications:
-precision radial velocity data – (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing.
I run on an 8 core PC and achieve a speed-up of 7 times.
outline

Blind searches with hybrid MCMC

Parallel tempering
Simulated annealing
Genetic algorithm
Differential evolution

Each of these methods was designed to facilitate the


detection of a global minimum in c2. By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal.
MCMC details outline

Data Model Prior information


D M I

pH8Xa <»D,M,IL
Target Posterior

- 8Xa < iterations


- Control systemdiagnostics
Hybrid
8Xa<init = start parameters
n = no. of iterations
parallel tempering - Summarystatistics
8sa <init = start proposal s's
- 8Xa < marginals
MCMC - Best fit model & residuals
8 b < = Temperinglevels
- 8Xa < 68.3% credible regions
Nonlinear model

- pHD»M,IL marginal likelihood


fitting program

for model comparison


Adaptive Two Stage Control System

1L Automates selection of an efficient set of Gaussian proposal


___________________________________________________________1

distribution s's using an annealing operation.


2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC. Includes a gene crossover
algorithm to breed higher probability chains.

Schematic of a Bayesian Markov chain Monte Carlo program for


nonlinear model fitting. The program incorporates a control system
that automates the selection of Gaussian proposal distribution σ’s.
outline

Adaptive Hybrid MCMC


Output at each iteration
8 parallel tempering Metropolis chains
β = 1.0 parameters, logprior + b â loglike, logprior + loglike
β = 0.72 parameters, logprior + b â loglike, logprior + loglike
β = 0.52 parameters, logprior + b â loglike, logprior + loglike
β = 0.39 parameters, logprior + b â loglike, logprior + loglike
β = 0.29 parameters, logprior + b â loglike, logprior + loglike
β = 0.20 parameters, logprior + b â loglike, logprior + loglike
β = 0.13 parameters, logprior + b â loglike, logprior + loglike
β = 0.09 parameters, logprior + b â loglike, logprior + loglike
Parallel tempering
β = 1/ T swap operations Monitor for
Peak parameter set: parameters
If (logprior + loglike) > with peak
Refine & update previous best by a probability
Anneal Gaussian threshold then update
Gaussian
proposal s’s proposal s’s and reset burn-in

2 stage proposal s control system


error signal = Genetic algorithm
(actual joint acceptance rate – 0.25) Every 10th iteration perform gene
Effectively defines burn-in interval crossover operation to breed larger
(logprior + loglike) parameter set.
MCMC adaptive control system Corr Par
outline

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo
outline

Calculation of p(D|M0,I)
Model M0 assumes the spectrum is consistent with noise and has no
free parameters so we can write

Hdi - 0L2
p HD M0 , s , I L = H2 p L 2 Js 2 + s2 N 2 ExpC- ‚ G
N N N
- -

i= 1 2 Is 2
+ s2
M

Model selection results

Bayes factor =4.5x104


outline
Methanol emission in
the Sgr A* environment

M. Stanković, E.R. Seaquist (UofT), S.


Leurini (ESO), P.Gregory (UBC),
S. Muehle(JIVE), K.M.Menten (MPIfR)

Optically thin fit to 3 bands


+ unidentified line in 96 GHz band

9v Ikm s−1M, FWHM Ikm s−1 M, TJ HKL, HN ê ZLA Icm−2 M, HN ê ZLA Icm−2M
TK HKL, ν UL H MHzL, FWHMUL Ikm s−1M, TUL HKL, ds96 , ds242 , s HKL=

ν UL HMHzL is the rest frequency of the unidentied


line after removal of the Doppler veocity, v Hkm s−1 L
return
outline

Conclusions
1. For Bayesian parameter estimation, MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter.
2. Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters, MCMC
techniques are really most competitive for models with a much larger
number of parameters m ≥ 15.
3. Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution.
This is fine for parameter estimation.
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi ,I) for each model. This is a
much more difficult problem still in search of two good solutions for
large m. We need two to know if either is valid.
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values, however, this
becomes computationally very intensive for m > 17.
For a copy of this talk please Google Phil Gregory
outline

The rewards of data analysis:

‘The universe is full of magical things,


patiently waiting for our wits to grow
sharper.’

Eden Philpotts (1862-1960)


Author and playwright
outline
Gelman-Rubin Statistic
Let q represent one of the model parameters .

Let qij represent the ith iteration of the jth of m independent simulation.
Extract the last h post burn - in iterations for each simulation.
‚ ‚ Iq j - q j M
m h
1 i êê 2
m Hh - 1L j=1 i=1
Mean within chain variance W =

‚ Hq j - q L
m
h êê êê 2
Between chain variance B =
m- 1 j=1

i 1 yz
Estimated variance V Hq L = jj1 - zW+ B
` 1
k h{ h

Hq L
Gelman - Rubin statistic = $%%%%%%%%%%%%
`
V

The Gelman - Rubin statistic should be close to 1.0 He.g. < 1.05 L
W

for all paramaters for convergence


Ref : Gelman, A.and D.B.Rubin H1992L ' Inference from iterative
simulations using multiple sequences Hwith discussionL ',
Statistical Science 7, pp. 457 − 511.
return

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy