Tan Bui-Thanh∗
Institute for Computational Engineering and Sciences, The University of Texas at Austin
1 Introduction 1
3 Construction of likelihood 10
4 Construction of Prior(S) 12
4.1 Smooth priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 “Non-smooth” priors . . . . . . . . . . . . . . . . . . . . . . . . . 17
8 Matlab codes 47
References 47
1 Introduction
In this note, we are interested in solving inverse problems using statistical
techniques. Let us motivate you by considering the following particular in-
verse problem, namely, the deconvolution problem. Given the observation
1. Introduction
signal g(s), we would like to reconstruct the input signal f (t) : [0, 1] → R,
where the observation and the input obey the following relation
(1) g(s j ) = a(s j , t) f (t) dt, 0 ≤ j ≤ n.
n Z1
(2) min ∑ g(s j ) − a(s j , t) f (t) dt .
f ( t ) j =0
However, the ill-conditioning nature of our inverse problem does not go away.
Indeed, (2) may have multiple solutions and multiple minima. In addition, a
solution to (2) may not depend continuously on g(s j ), 0 ≤ j ≤ n. So what is
the point of recast? Clearly, if the cost function (also known as the data misfit)
is a parabola, then the optimal solution is unique. This immediately suggests
that one should add a quadratic term to the cost function to make it more
like a parabola, and hence making the optimization problem easier. This is
essentially the idea behind the Tikhonov regularization, which proposes to solve
the nearby problem
n Z1
min ∑ g(s j ) − a(s j , t) f (t) dt +
R f
f ( t ) j =0 2
gobs (s j ) = g(s j ) + e j , 0 ≤ j ≤ n,
where e j , j = 0, . . . , n, are some random noise. You can think of the noise as
the inaccuracy in observation/measurement devices. The question you may
2. Some concepts from probability theory
problem. Specifically, one’s belief is based on his known information (expressed in terms of σ-
algebra) and “weights” on each information (expressed in terms of probability measure). That is,
people working with different probability spaces have different solutions.
2. Some concepts from probability theory
2.4 example. Let us consider the event of tossing a coin. Clearly, this is a
random event since we don’t know whether head or tail will appear. Nev-
ertheless, we believe that out of n tossing times, n/2 times is head and n/2
times is tail.2 We express this belief in terms of probability as: the (subjective)
probability of getting a head is 12 and the (subjective) probability of getting a
tail is 12 .
2.5 example. Back to the tossing coin example, we trivially have Ω = { head, tail },
F = {∅, {head} , {tail } , Ω}. The weights are P [∅] = 0, P [{tail }] = P [{head}] =
2 , and P [{ head, tail }] = 1.
P [ A ∩ B] = P [ A] × P [ B] .
P[ A ∩ B ]
(3) P [ A| B] = P[ B ]
This is the corner stone formula to build which can also be rephrased as the probability that A happens provided B has
most of results in this note, make sure that
you feel comfortable with it.
already happened.
2.6 example. Assume that we want to roll a dice. Denote B as the event of
getting of face bigger than 4, and A the event of getting face 6. Using (3) we
2 One can believe that out of n tossing times, n/3 times is head and 2n/3 times is tail if he uses
an unfair coin.
3 Probability theory is often believed to be a part of measure theory, but independence is where
tional probability.
5 This was initially introduced by Kolmogorov, a father of modern probability theory.
2. Some concepts from probability theory
P [ A| B] = = 1/2.
We can solve the problem using a more elementary argument. B happens
when we either get face 5 or face 6. The probability of getting face 6 when B
has already happened is clearly 12 .
(a) P [ A| B] = (b) P [ A| B] =
2.8 exercise. Show that the following Bayes formula for conditional probabil-
ity holds
P[ B | A ]P[ A ]
(4) P [ A| B] = P[ B ]
P [ A| B] = P [ A] , P [ B| A] = P [ B] .
2. Some concepts from probability theory
2.9 definition. The state space S is the set containing all the possible out-
In this note, the state space S (and also T) is the standard Euclidean space
Rn , where n is the dimension. We are in position to introduce the key player,
the random variable.
M : Ω 3 ω 7→ M (ω ) ∈ S.
2.11 definition. The probability distribution (or distribution or law for short)
of a random variable M is defined as
h i
(5) µ M ( A) = P M−1 ( A) = P [{ M ∈ A}] , ∀ A ∈ S,
def def
M −1 ( A ) = { M ∈ A } = { ω ∈ Ω : M ( ω ) ∈ A } .
From the definition, we can see that the distribution is a probability mea-
sure8 on S. In other words, the random variable M induces a probability mea-
sure, defined as µ M , on the state space S. The key property of the induced
probability measure µ M is the following. The probability for an event A in
the state space to happen, denoted as µ M ( A), is defined as the probability
for an event B = M−1 ( A) in the sample space to happen (see Figure 2 for
an illustration). The distribution and the probability density9 π X of M obey the
6 Measurable map is the rigorous definition, but we avoid technicalities here since it involves
operations on σ-algebra.
7 Rigorously, A must be a measurable subset of S.
8 In fact, it is the push-forward measure by the random variable M.
9 Here, the density is understood with respect to the Lebesgue measure on S = Rn . Rigorously,
2. Some concepts from probability theory
following relation
def R def R
(6) µ M ( A) = A π M (m) dm = { M∈ A} dω, ∀A ⊂ S .
where the second equality of the definition is from (5). The meaning of ran- Do you see this?
dom variable M (ω ) can now be seen in Figure 2. It maps the event B ∈ Ω
into the set A = M ( B) in the state space such that the area under the density
function π M (m) and above A is exactly the probability that B happens.
πM (m)
Ω B A = M(B)
µ M (dm) = π (m) dm = dω .
2. Some concepts from probability theory
nore the underlying probability space in practice, since we don’t need them
in computation of probability in the state space. However, to intuitively un-
derstand the source of randomness, we need to go back to the probability
space where the outcome of all events, except Ω, is uncertain. As a result,
the pair (S, π M (m)) contains complete information describing our ignorance
about the outcome of random variable M. To the rest of this note, we shall
work directly on the state space.
2.13 remark. At this point, you may wonder what is the point of introducing
the abstract probability space (Ω, F , P) to make life more complicated? Well,
its introduction is two fold. First, as discussed above, the probability space
not only shows the origin of randomness but also provides the probability
measure P for the computation of the randomness; it is also used to define
random variables and furnishes a decent understanding about them. Second,
the concepts of distribution and density in (5) and (6), which are introduced
for random variable M, a map from Ω to S, are valid for maps11 from an
arbitrary space V to another space W. Here, W plays the role of S, and V
the role of Ω on which we have a probability measure. For example, later in
Section 3, we introduce the parameter-to-observable map h (m) : S → Rr , then
S plays the role of Ω and Rr of S in (5) and (6).
As we will see, the Bayes formula for probability densities is about the joint
density of two or more random variables. So let us define the joint distribution
and joint density of two random variables here.
2. Some concepts from probability theory
µ MY ({ M ∈ A} , {Y ∈ B}) = µ M ( A) µY ( B) , ∀ A × B ⊂ S × T,
or if
π (m,y)
π (m|y) = π (y)
3. Construction of likelihood
By symmetry, we have
π (y|m)π (m)
(9) π (m|y) = π (y)
2.18 exercise. Prove directly the Bayes formula for conditional density (9)
using the Bayes formula for conditional probability (4).
2.20 definition (Prior). We call π (m) the prior. It is the probability density
of M regardless of Y. The prior encodes, in the Bayesian framework, all infor-
mation before any observations/data are made.
2.21 definition (Posterior). The density π (m|y) is called the posterior, the
distribution of parameter m given the measurement y, and it is the solution of
the Bayesian inverse problem under consideration.
Can you write the conditional mean using 2.22 definition. The conditional mean is defined as
dω? Z
E [ M|y] = mπ (m|y) dm.
3 Construction of likelihood
y = h (m) ,
(10) Y obs = h ( M) + E,
3. Construction of likelihood
where Y obs is the actual observation rather than Y = f ( M). Since the noise
comes from external sources, in this note, it is assumed to be independent of
M. In the likelihood modeling, we pretend to have realization(s) of M and the
task is to construct the distribution of Y obs . From (10), one can see that the
randomness in Y obs is the randomness
in E shifted by anamount h (m), see
Figure 3, and hence πY obs |m yobs |m = π E yobs − h (m) . More rigorously,
assume that both Y obs and E are random variables on a same probability space,
we have
πY obs |m yobs |m dyobs = µY obs |m ( A) = µ E ( A − h (m))
change of variable
= π E (e) de = π E yobs − h (m) dy, ∀ A ⊂ S,
A−h(m) A
which implies
πY obs |m yobs |m = π E yobs − h (m) .
4. Construction of Prior(S)
(11) Y obs = Eh ( M ) .
3.1 exercise. Show that the likelihood for multiplicative noise model (11) has
the following form
π E yobs /h (m)
πY obs |m yobs |m = , h (m) 6= 0.
h (m)
3.2 exercise. Can you generalize the result for the noise model e = g yobs , h ( x ) ?
4 Construction of Prior(S)
g(s j ) = a(s j , t) f (t) dt + e(s j ), 0 ≤ j ≤ n,
Y obs = A M + E.
4. Construction of Prior(S)
the Bayesian solution to our inverse problem is, by virtue of the Bayes formula
(9), given by
(12) πpost m|yobs ∝ N yobs − Am, σ2 I × πprior (m) ,
where we have ignored the denominator π yobs since it does not depend on
the parameter of interest m. We now start our prior elicitation.
In this section, we believe that the unknown function f (t) is smooth, which
can be translated into, among other possibilities, the following simplest re-
quirement on the pointwise values f (si ), and hence mi ,
(13) mi = ( m i −1 + m i +1 ) ,
that is, the value of f (s) at a point is more or less the same of its neighbor.
But, this is by no means the correct behavior of the unknown function f (s).
We therefore admit some uncertainty in our belief (13) by adding an innovative
term Wj such that
Mi = ( Mi−1 + Mi+1 ) + Wj ,
reconstructed function f (t) departs from the smoothness model (13). In terms
of matrices, we obtain
LM = W,
where L is given by
−1 −1
1 −1 2 −1
L= ∈ R(n−1)×(n+1) ,
.. .. ..
. . .
−1 2 −1
which is the second order finite difference matrix approximating the Laplacian
∆ f . Indeed,
(14) ∆ f (s j ) ≈ n2 ( LM ) j .
4. Construction of Prior(S)
1 1
M0 = ( M−1 + M1 ) + W0 = M1 + W0 , W0 ∼ N 0, γ2
2 2
1 1
Mn = ( Mn−1 + Mn+1 ) + Wn = Mn−1 + Wn , Wn ∼ N 0, γ2 .
2 2
Note that we have extended f (s) by zero outside the domain [0, 1] since we
“know” that it is smooth. Consequently, we have L D M = W with
2 −1
−1 2 −1
1 −1 2 −1
(16) L D = ∈ R(n+1)×(n+1) ,
2 .. .. ..
. . .
−1 2 −1
−1 2
which is the second order finite difference matrix corresponding to zero Dirich-
let boundary conditions. The prior density in this case reads
D 1 2
(17) πprior (m) ∝ exp − 2 k L D mk .
4. Construction of Prior(S)
where e j is the jth canonical basis vector in Rn+1 , and we have used the fact
that the prior is Gaussian in the last equality. So we in fact plot the square
root of the diagonal of γ2 L TD L D , the covariance matrix, as the standard
deviation curve. One can see that the uncertainty is largest in the middle of the Do we really have the complete continuous
domain since it is farthest from the constrained boundary. The points closer
to the boundaries have smaller variance, that is, they are more correlated to
the “known” boundary data, and hence less uncertain.
Standard deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 4: Prior random draws from πprior together with the standard deviation
Now, you may ask why f (s) must be zero at the boundary, and you are
right! There is no reason to believe that must be the case. However, we don’t
know the exact values of f (s) at the boundary either, even though we believe
that we may have non-zero Dirichlet boundary condition. If this is the case,
we have to admit our ignorance and let the data from the likelihood correct
us in the posterior. To be consistent with the Bayesian philosophy, if we do
not know anything about boundary conditions, let them be, for convenience,
4. Construction of Prior(S)
2δ0 0
−1 2 −1
1 −1 2 −1
LR = ∈ R(n+1)×(n+1) .
2 .. .. ..
. . .
−1 2 −1
0 2δn
γ2 γ2 h i
2 T
Var [ M0 ] = = Var [ M n ] = = Var M [ n/2 ] = γ e [n/2] L R L R e[n/2] ,
δ02 δn2
where [n/2] denotes the largest integer smaller than n/2. It follows that
δ02 = δn2 = −1 .
e[Tn/2] L TR L R e[n/2]
Again, we draw five random realizations from πprior R and put them to-
gether with the standard deviation curve in Figure 5. As can be observed, the
4. Construction of Prior(S)
uncertainty is more or less the same at every point and prior realizations are
no longer constrained to have zero boundary conditions.
Standard deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5: Prior random draws from πprior together with the standard deviation
mi = λi mi−1 + (1 − λi )mi+1 + Ei , 0 ≤ λi ≤ 1.
Convince yourself that by choosing a particular set of λi , you can recover all
the above prior models. Replace BayesianPriorElicitation.m by a generic
code with input parameters λi . Experience new prior models by using differ-
ent values of λi (those that don’t reproduce priors presented in the text).
We first consider the case in which we believe that f (s) is still smooth but may
have discontinuities at known locations on the mesh. Can we design a prior to
convey this belief? A natural approach is to require that M j is equal to M j−1
4. Construction of Prior(S)
M j = M j −1 + E j ,
−1 1
LN = ∈ Rn × n .
.. ..
. .
−1 1
But, if we think that there is a particular big jump, relative to others, from 2
M j−1 to M j , then the mathematical translation of this belief is Ej ∼ N 0, γθ 2
with θ < 1. The corresponding prior in this case reads
O 1 2
(20) πprior (m) ∝ exp − 2 k JL N mk ,
Let’s draw some random realizations from πprior (m) in Figure 6 with n = 160,
j = 80, β = 1, and θ = 0.01. As desired, all realizations have a sudden jump
at j = 80, and the standard deviation of the jump is 1/θ = 100. In addition,
compared to priors in Figure 4 and 5, the realizations from πprior (m) are less
smooth, which confirms that our belief is indeed conveyed.
5. Posterior as the solution to Bayesian inverse problems
Standard deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 6: Prior random draws from πprior together with the standard deviation
M = diag (θ1 , . . . , θn ) ,
4.4 exercise. Modify the scheme in Exercise 4.1 to include priors with sudden
In this section, we explore the posterior (12), the solution of our Bayesian
problem, given the likelihood in Section 3 and priors in Section 4.
To derive results that are valid for all priors discussed so far, we work with
5. Posterior as the solution to Bayesian inverse problems
where T (m) is the familiar (to you I hope) Tikhonov functional; it is sometimes
called the potential. We re-emphasize here that the Bayesian solution is the
posterior probability density, and if we draw samples from it, we want to
know what the most likely function m is going to be. In other words, we
ask for the most probable point m in the posterior distribution. This point is
known as the Maximum A Posteriori (MAP) estimator/point, namely, the point
at which the posterior density is maximized. Let us denote this point as m MAP ,
and we have
m MAP = arg max πpost m|yobs = arg min T (m) .
m m
Hence, the MAP point is exactly the deterministic solution of the Tikhonov
Since both likelihood and prior are Gaussian, the posterior is also a Gaus-
This is fundamental. If you have not seen sian. For our case, the resulting posterior Gaussian reads
this, prove it!
2 !
1 − 1 T obs
∝ exp −
πpost m|y m− 2H A y
1 1 −1 T obs 1
= exp − m − 2 H A y , H m − 2 H −1 A T yobs
2 σ σ
def 1 1 1
= exp − m − 2 H −1 A T yobs , Γpost
m − 2 H −1 A T yobs
2 σ σ
H= 1
A T A + γ12 Γ−1 ,
is the Hessian of the Tikhonov functional (aka the regularized misfit), and we
have used the weighted norm k·k2H =
H 2 ·
5. Posterior as the solution to Bayesian inverse problems
The other important point is that the posterior covariance matrix is pre-
cisely the inverse of the Hessian of the regularized misfit, i.e.,
Γpost = H −1 .
Last, but not least, we have showed that the MAP point is given by
1 1 1 T 1
m MAP = 2 H −1 A T yobs = 2 A A + 2 Γ −1 A T yobs ,
σ σ σ2 γ
which is, again, exactly the solution of the Tikhonov functional for linear in-
verse problem.
5.2 exercise. Show that m MAP is also the least squares solution of the follow-
ing over-determined system
" #
1 obs
σ y
1 − 12 m= σ
γ Γ 0
5.3 exercise. Show that the posterior mean, which is in fact the conditional
mean, is precisely the MAP point.
The noise level is taken to be the 5% of the maximum value of f (s), i.e. σ =
0.05 maxs∈[0,1] | f (s)|.
We first consider the belief described by πprior in which we think that
f (s) is zero at the boundaries. Figures 7 plots the MAP estimator, the truth
function f (s), and the predicted uncertainty. As can be observed, the MAP is
in good agreement with the truth function inside the interval [0, 1], though
it is far from recovering f (s) at the boundaries. This is the price we have to
pay for not admitting our ignorance about the boundary values of f (s). The
5. Posterior as the solution to Bayesian inverse problems
likelihood in fact sees this discrepancy in the prior knowledge and tries to
make correction by lifting the MAP away from 0, but not enough to be a good
reconstruction. The reason is that our incorrect prior is strong enough such
that the information from the data yobs cannot help much.
−0.2 MAP
−0.4 uncertainty
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 7: The MAP estimator, the truth function, and the predicted uncertainty
D .
(95% credibility region) using πprior
5.4 exercise. Can you make the prior less strong? Change some parameter to
make prior contribution less! Use BayesianPosterior.m to test your answer. Is
the prediction better in terms of satisfying the boundary conditions? Is the
uncertainty smaller? If not, why?
On the other hand, if we admit this ignorance and use the corresponding
D , we see much better reconstruction in Figure 8. In this case, we in
prior πprior
fact let the information from the data yobs determine the appropriate values
for the Dirichlet boundary conditions rather than setting them to zero. By
doing this, we allow the likelihood and the prior to be well-balanced leading
to good reconstruction and uncertainty quantification.
5. Posterior as the solution to Bayesian inverse problems
−0.2 MAP
−0.4 uncertainty
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 8: The MAP estimator, the truth function, and the predicted uncertainty
A .
(95% credibility region) using πprior
5.6 exercise. Use your favorite deterministic inversion approach to solve the
above deconvolution problem and then compare it with the solution in Figure
Now consider the case in which the truth function has a jump discontinuity
at j = 70. Assume we also know that the magnitude of the jump is 10. In
particular, we take the truth function f (s) as the following step function
0 if s ≤ 0.7
f (s) = .
10 otherwise
6. Connection between Bayesian inverse problems and deterministic
inverse problems
12 MAP
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 9: The MAP estimator, the truth function, and the predicted uncertainty
O .
(95% credibility region) using πprior
5.8 exercise. Use your favorite deterministic inversion approach to solve the
above deconvolution problem with discontinuity and then compare it with
the solution in Figure 10.
6. Connection between Bayesian inverse problems and deterministic
inverse problems
10 truth
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 10: The MAP estimator, the truth function, and the predicted uncer-
O .
tainty (95% credibility region) using πprior
ining the diagonal of the posterior covariance matrix. We can even discuss
about the posterior correlation structure by looking at the off diagonal ele-
ments, though we are not going to do it here in this lecture note. Since, again,
both likelihood and prior are Gaussian, the posterior is a Gaussian distribu-
tion, and hence the MAP point (the first order moment) and the covariance
matrix (the second order moment) are the complete description of the poste-
rior. If, however, the likelihood is not Gaussian, say when the Am is nonlinear,
then one can explore higher moments.
We hope the arguments above convince you that the Bayesian solution pro-
vide information far beyond the deterministic counterpart. In the remainder
of this section, let us dig into details the connection between the MAP point
and the deterministic solution, particularly in the context of the deconvolution
problem. Recall the definition of the MAP point
2 1 σ 2
def 2
m MAP = arg min T (m) = σ
y − Am
Γ m
m 2 2 γ2
2 1
= arg min T (m) = σ2 − y
+ κ
R 2 m
2 2
1 1
where we have defined κ = σ2 /γ2 , R 2 = Γ− 2 , and y = Am.
We begin our discussion with zero Dirichlet boundary condition prior
6. Connection between Bayesian inverse problems and deterministic
inverse problems
πprior (m) in (17). Recall in (14) and (16) that L D M is proportional to a dis-
cretization of the Laplacian operator with zero boundary conditions using
second order finite difference method. Therefore, our Tikhonov functional is
in fact a discretization, up to a constant, of the following potential in the infi-
nite dimensional setting
2 1
T∞ ( f ) =
y − yobs
+ κ k∆ f k L2 (0,1) ,
2 2
def R1
where k·k2L2 (0,1) = 0 (·)2 ds. Rewrite the preceding equation informally as
2 1
T∞ ( f ) =
y − yobs
+ κ f , ∆2 f 2 ,
2 2 L (0,1)
and we immediately realize that the potential in our prior description, namely
k L D mk2 , is in fact a discretization of Tikhonov regularization using the bihar-
monic operator. This is another explanation for the smoothness of the prior
realizations and the name smooth prior, since biharmonic regularization is
very smooth. 13
The power of the statistical approach lies in the construction of prior
πprior (m). Here, the interpretation of rows corresponding to interior nodes
s j is still the discretization of the biharmonic regularization, but the design
of those corresponding to the boundary points is purely statistics, for which
we have no corresponding deterministic counterpart (or at least it is not clear
how to construct it from a purely deterministic point of view). As the results
in Section 5 showed, πprior R (m) provided much more satisfactory results both
in the prediction and in uncertainty quantification.
As for the “non-smooth”priors in Section 4.2, a simple inspection shows
that L N m is, up to a constant, a discretization of ∇ f . Similar to the above dis-
cussion, the potential in our prior description, namely k L D mk2 , is now in fact
a discretization of Tikhonov regularization using the Laplacian operator.14 As
a result, the current prior is less smooth than the previous one with harmonic
operator. Nevertheless, all the prior realizations corresponding to πpren (m)
13 From a functional analysis point of view, k∆ f k2L2 (0,1) is finite if f ∈ H 2 (0, 1), and by Sobolev
imbedding theorem we know that in fact f ∈ C1,1/2−ε , the space of continuous differential
functions whose first derivative is in the Hölder space of continuous function C1/2−ε , for any
0 < ε < 21 . So indeed f is more than continuously differentiable.
14 Again, Sobolev embedding theorem shows that f ∈ C 1/2−ε for
k∇ f k2L2 (0,1) to be finite. Hence,
all prior realizations corresponding to πpren (m) are at least continuous. The prior πpriorO (m) is
different, due to the scaling matrix J. As long as θ stays away from zero, prior realizations are
still in H 1 (0, 1), and hence continuous though having steep gradient at s j as shown in Figures 9
and 10. But as θ approaches zero, prior realizations are leaving H 1 (0, 1), and therefore may be no
longer continuous. Note that in one dimension, H 2 +ε is enough to be embedded in the space of
C ε -Hölder continuous functions. If you like to know a bit about the Sobolev embedding theorem,
see [3].
7. Markov chain Monte Carlo
are at least continuous, though may have steep gradient at s j as shown in Fig-
ures 9 and 10. The rigorous arguments for the prior smoothness require the
Sobolev embedding theorem, but we avoid the details.
For those who have not seen the Sobolev embedding theorem, you only
loose the insight on why πprior (m) could give very steep gradient realizations
(which is the prior belief we start with). Nevertheless, you still can see that
πprior D
(m) gives less smooth realizations than πprior (m) does, since, at least,
the MAP point corresponding to πprior (m) only requires finite first deriva-
tive of f while second derivative of f needs to be finite at the MAP point if
πprior (m) is used.
m = E [ M] ,
( M1 + . . . + M N )
(22) m ≈ .
But this kind of method cannot be extended to S = Rn . This is where the
central limit theorem and law of large numbers come to rescue. They say that
the simple formula (22) is still valid with a simple error estimation expression.
7.1 theorem (central limit theorem (CLT)). Assume that real valued random vari-
ables M1 , . . . are independent and identically distributed (iid), each with expectation
7. Markov chain Monte Carlo
(23) lim P [ ZN ≤ m] = exp − dt
N →∞ 2π 2
Proof. The proof is elementary, though technical, using the concept of charac-
teristic function (Fourier transform of a random variable). You can consult [4]
for the complete proof.
7.2 theorem (Strong law of large numbers (LLN)). Assume random variables
M1 , . . . are independent and identically distributed (iid), each with finite expectation
m and finite variance σ2 . Then
(24) lim S N = ( M1 + M2 + · · · + M N ) = m
N →∞ N
almost surely16 .
7.3 remark. The central limit theorem says that no matter what the underly-
ing common distribution looks like, the sum of iid random variables, when
properly scaled and centralized, converges in distribution to a standard nor-
mal distribution. The strong law of large numbers, on the other hand, states
that the average of the sum is, as expected in the limit, precisely the mean of
the common distribution with probability one.
Both the central limit theorem (CLT) and the strong law of large numbers
(LLN) are useful, particularly LLN, and we use them routinely. For example,
if you are given an iid sample { M1 , M2 , · · · , M N } from a common distribu-
tion π (m), the first thing you should do is to compute the the sample mean
S N to estimate the actual mean m. From LLN we know that the sample mean
can be as close as desired if N is sufficiently large. A question immediately
arises is whether we can estimate the error between the sample mean and the
15 Convergence in distribution is also known as weak convergence and it is beyond the scope of
this introductory note. You can think of the distribution of Zn is more and more like the standard
normal distribution as n → ∞, and it is precisely (23).
16 Almost sure convergence is the same as convergence with probability one, that is, the event
on which the convergence (24) does not happen has zero probability.
7. Markov chain Monte Carlo
truth mean, given a finite N. Let us first give an answer based on a simple
application of the CLT. Since the sample { M1 , M2 , · · · , M N } satisfies the con-
dition of the CLT, we know that ZN converges to N (0, 1). It follows that, at
least for sufficiently large N, the mean squared error between z N and 0 can be
estimated as
h i
def def
kz N − 0k2L2 (S,P) = E (z N − 0)2 = Var [ ZN − 0] ≈ 1,
def σ2
(25) kS N − mk2L2 (S,P) = Var [S N − m] ≈
If you are a little bit delicate, you may not feel completely comfortable with
the error estimate (25) since you can rewrite it as
kS N − mk L2 (S,P) = C √ ,
and you are not sure how big C is and the dependence of C on N. Let us try
to make you happy. We have
" ! !#
kS N − mk L2 (S,P) = 2 E ∑ ( Mi − m ) ∑ M j − m
N i =1 j =1
" !#
1 1 N σ2
= 2E
N ∑ ( Mi − m ) 2 = 2 ∑ σ2 = ,
N i =1 N
i =1
17 We avoid technicalities here, but g needs to be a Borel function for the statement to be true.
7. Markov chain Monte Carlo
Perhaps, one of the most popular and practical problems is to evaluate the
mean of g, i.e.,
(26) I = E [ G ( M )] = g (m) π (m) dm,
which is an integral in Rn .
7.7 exercise. Define z = g (m) ∈ T, the definition of the mean in (7) gives
E [ G ( M )] ≡ E [ Z ] = zπ Z (z) dz.
g (m) = (m − m) (m − m) T ,
7. Markov chain Monte Carlo
The average IN in this case is known as the sample (aka empirical) covariance
matrix. Denote
Γ̂ =
N ∑ ( mi − m ) ( mi − m ) T
i =1
as the sample covariance matrix. Note that m is typically not available in prac-
tice, and we have to resort to a computable approximation
Γ̂ =
N ∑ (mi − m̂) (mi − m̂)T ,
i =1
Sampling methods discussed in this note are based on two fundamental iid
random generators that are available as built-in functions in Matlab. The first
one is rand.m function which can draw iid random numbers (vectors) from
the uniform distribution in [0, 1], denoted as U [0, 1], and the second one is
randn.m function that generates iid numbers (vectors) from standard normal
distribution N (0, I ), where I is the identity matrix of appropriate size.
The most trivial task is how to draw iid samples { M1 , M2 , . . . , M N } from a
multivariate Gaussian N (m, Γ). This can be done through a so-called whiten-
ing process. The first step is to carry out the following decomposition
Γ = RR T ,
which can be done, for example, using Cholesky factorization. The second
step is to define a new random variable as
Z = R −1 ( M − m ) ,
then Z is a standard multivariate Gaussian, i.e. its density is N (0, I ), for which Show that Z is a standard multivariate
randn.m can be used to generate iid samples
{ Z1 , Z2 , . . . , ZN } = randn(n,N).
Mi = m + RZi .
7. Markov chain Monte Carlo
You may ask what if the distribution under consideration is not Gaussian,
which is true for most practical applications. Well, if the target density π (m)
is one dimensional or multivariate with independent components (in this case,
we can draw samples from individual components separately), then we still
can draw iid samples from π (m), but this time via the standard uniform
distribution U [0, 1]. If you have not seen it before, here is the definition: U [0, 1]
has 1 as its density function, i.e.,
(27) µU ( A) = ds, ∀ A ⊂ [0, 1] .
Now suppose that we would like to draw iid samples from a one dimensional
(S = R) distribution with density π (m) > 0. We still allow π (m) to be zero,
but only at isolated points on R, and you will see the reason in a moment.
Define the cumulative distribution function (CDF) as
(28) Φ (w) = π (m) dm,
Why? then it is clearly that Φ(w) is non-decreasing and 0 ≤ Φ(w) ≤ 1. Let us define
a new random variable Z as
(29) Z = Φ ( M ) .
Our next step is to prove that Z is actually a standard uniform random vari-
able, i.e. Z ∼ U [0, 1], and then show how to draw M via Z. We begin by the
following observation
−1 ( a )
h i
(30) P [ Z < a] = P [Φ ( M ) < a] = P M < Φ−1 ( a) = π (m) dm,
where we have used (29) in the first equality, the monotonicity of Φ ( M) in the
second equality, and the definition of CDF (28) in the last equality. Now, we
can view (29) as the change of variable formula z = Φ (m), then combining
this fact with (28) to have
P [ Z < a] = dz = µ Z ( Z < a) ,
7. Markov chain Monte Carlo
which says that the density of Z is 1, and hence Z must be a standard uniform
random variable. In terms of our language at the end of Section 7.1, we can
define M = g ( Z ) = Φ−1 ( Z ), then drawing iid samples for M is simple by first
drawing iid samples from Z, then mapping them through g. Let us summarize
the idea in Algorithm 1.
2. Compute the inverse of the CDF to draw m, i.e. m = Φ−1 (z). Go back
to Step 1.
The above method works perfectly if one can compute the analytical in-
verse of the CDF easily and efficiently; it is particularly efficient for discrete
random variables, as we shall show. You may say that you can always compute
the inverse CDF numerically. Yes, you are right, but you need to be careful
about this. Note that the CDF is an integral operation, and hence its inverse
is some kind of differentiation. The fact is that numerical differentiation is an
ill-posed problem! You don’t want to add extra ill-posedness on top of the
original ill-posed inverse problem that you started with, do you? If not, let us
introduce to you a simpler but more robust algorithm that works for multi-
variate distribution without requiring the independence of individual compo-
nents. We shall first introduce the algorithm and then analyze it to show you
why it works.
Suppose that you want to draw iid samples from a target density π (m),
but you only know it up to a constant C > 0, that is, you only know Cπ (m).
(This is perfect for our Bayesian inversion framework since we typically know
the posterior up to a constant as in (12).) Assume that we have a proposal distri-
bution q (m) at hand, for which we know how to sample easily and efficiently.
This is not a limitation since we can always take either the standard normal
distribution or uniform distribution as the proposal distribution. We further
assume that there exists D > 0 such that
7. Markov chain Monte Carlo
Cπ (m)
α= ,
Dq (m)
Cπ (m)
(32) P [ B|m] = α = .
Dq (m)
P [ B|m] q (m) dm
P [m ∈ dA| B] = = π (m) dm,
P [ B]
where we have used (32) and P [ B], the probability of accepting a draw from
q, is the following marginal probability
P [ B] = P [ B|m] q (m) dm = π (m) dm = .
Note that
7. Markov chain Monte Carlo
g (m)
π (m) = exp − ,
C 2
7. Markov chain Monte Carlo
We have presented a few methods to draw iid samples from a target distri-
bution q (m). The most robust method that works in any dimension is the
rejection-acceptance sampling algorithm though it may be slow in practice.
In this section, we introduce the Markov chain Monte Carlo scheme which
is the most popular sampling approach. It is in general more effective than
any methods discussed so far, particularly for complex target density in high
dimensions, though it has its own problems. One of them is that we no longer
have iid samples but correlated ones. Let us start the motivation by consider-
ing the following web-page ranking problem.
Assume that we have a set of Internet websites that may be linked to the
others. We represent these sites as nodes and mutual linkings by directed ar-
rows connecting nodes such as in Figure 11. Now you are seeking sites that
1 4
contains a keyword of interest for which all the nodes, and hence websites,
contain. A good search engine will show you all these websites. The question
is now which website should be ranked first, second, and so on? You may
guess that node 4 should be the first one in the list. Let us present a proba-
bilistic method to see whether your guess is correct or not. We first assign the
7. Markov chain Monte Carlo
The jth column of P is contains the probability of moving from the jth node
to the rest. For example, the first column says that if we start from node 1, we
can move to either node 2 or node 4, each with probability 21 . Note that we
have treated all the nodes equally, that is, the transition probability from one
node to other linked nodes is the same (a node is not linked to itself in this
model). Note that the sum of each column is 1, meaning that a website must
have a link to a website in the network.
Assume we are initially at node 4, and we represent the initial probability
density as
π0 =
0 ,
that is, we are initially at node 4 with certainty. In order to know the next node
to visit, we first compute the probability density of the next state by
π1 = Pπ0 ,
then randomly move to a node by drawing a sample from the (discrete) prob-
ability density π j (see Exercise 7.12). In general, the probability density after
k steps is given by
(34) πk = Pπk−1 = . . . = Pk π0 ,
where the jth component of πk is the probability of moving to the jth node.
Observing (34) you may wonder what happens if k approaches infinity.
Assume, on credit, the limit probability density π∞ exists, then it ought to
(35) π∞ = Pπ∞ .
7. Markov chain Monte Carlo
Figure 12 shows the visiting frequency (blue) for each node after N = 1500
moves. Here, visiting frequency of a node is the number of visits to that node
divided by N. We expect that numerical visiting frequencies approximate the
visiting probabilities in the limit. We confirm this expectation by also plotting
the components of π∞ (red) in Figure 12. By the way, π1500 is equal to π∞ up
to machine zero, meaning that a draw from π N , N ≥ 1500, is distributed by
the limit distribution π∞ . We are now in the position to answer our ranking
question. Figure 12 shows that node 1 is the most visited one, and hence
should appear at the top of the website list coming from the search engine.
visiting frequency
first eigenvector
1 2 3 4 5
Figure 12: Visiting frequency after N = 1500 moves and the first eigenvector.
7.13 exercise. Using the above probabilistic method to determine the proba-
bility that the economy, as shown in Figure 13, is in recession.
7. Markov chain Monte Carlo
Now assume that we know all states up to mk−1 , say mk−1 = 1. It follows that
7. Markov chain Monte Carlo
π k ( m k | m k −1 , . . . , m 0 ) = π k ( m k | m k −1 ) ,
7.16 example. Let mk−1 = 4 in our website ranking problem, then the proba-
What is the transition probability bility kernel P (mk−1 = 4, m) is exactly the probability density in (36).
P (mk−1 = 4, mk = 1)?
7.18 example. The discrete version of (37), applying to our website ranking
problem, reads
π∞ ( j) = ∑ P ( j, k) π∞ (k) = P( j, :)π∞ , ∀ j = 1, . . . , 5,
k =1
7. Markov chain Monte Carlo
7.20 exercise. What is the discrete version of (38)? Does the transition matrix
in the website ranking problem satisfy the reversibility? If not, why? How
about the transition matrix in Exercise 7.13.
The above discussion shows that if a Markov chain is reversible then even-
tually the states in the chain are distributed by the underlying invariant dis-
tribution. A question you may ask is how to construct a transition kernel such
that reversibility holds. This is exactly the question Markov chain Monte Carlo
methods are designed to answer. Let us now present the Metropolis-Hastings
MCMC method in Algorithm 3.
π ( p)q( p, mk )
α(mk , p) = min 1,
π (mk )q(mk , p)
7. Markov chain Monte Carlo
Proof. We proceed in two steps. In the first step, we consider the case in which
the proposal p is accepted. Denote B as the event of accepting a draw q (or the
acceptance event). Following the same proof of Proposition 7.9, we have
P [ B | p ] = α ( m k , p ),
which is exactly P (mk , p), the probability density of the joint event of drawing
p from q (mk , p) and accept it, starting from mk . It follows that the reversibility
holds since
π ( p)q( p, mk )
π (mk ) P (mk , p) = π (mk ) q (mk , p) min 1,
π (mk )q(mk , p)
= min {π (mk )q(mk , p), π ( p)q( p, mk )}
π (mk )q(mk , p)
= min , 1 π ( p)q( p, mk )
π ( p)q( p, mk )
= π ( p) P ( p, mk ) .
7. Markov chain Monte Carlo
1 1 1
0 0 0
−1 −1 −1
Let us now increase the proposal stepsize γ to 5, and we show the corre-
sponding chain in Figure 14(b). This time, the chain immediately explores the
7. Markov chain Monte Carlo
7. Markov chain Monte Carlo
0 1 2 3 4 5 6 7 8 9 10
x 10
(a) γ = 0.02
0 1 2 3 4 5 6 7 8 9 10
x 10
(b) γ = 5
0 1 2 3 4 5 6 7 8 9 10
x 10
(c) γ = 0.5
7. Markov chain Monte Carlo
N −k
ck = ∑ m j+k m j , , k = 0, . . . , N − 1,
j =0
If ĉk is zero, then we say that the correlation length of the Markov chain is
approximately k, that is, any state m j is considered to be insignificantly corre-
lated to m j−k (and hence any state before m j−k ), and to m j+k (and hence any
state after m j+k ). In other words, every kth sample point can be considered to
be approximately independent. Note that this is simply a heuristic and one
should be aware that independece implies un-correlation but not vice versa.
Let us now approximately compute the correlation length for three Markov
chains corresponding to γ = 0.02, γ = 0.5, and γ = 5, respectively, with
N = 100000. We first subtract away the sample mean as
zj = mj − ∑ m.
N + 1 i =0 i
Then, we plot the autocorrelation functions ĉk for each component of the zero
mean sample z j j=0 in Figure 16. As can be observed, the autocorrelation
length for the chain with optimal stepsize γ = 0.5 is about k = 100, while the
others are much larger (not shown here). That is, every 100th sample point can
be considered to be independent for γ = 0.5. The case with γ = 0.02 is the
worst, indicating slow move around the target density. The stepsize of γ = 5
is better, but so big that the chain remains at each state for a long period of
time, and hence autocorrelation length is still significant relatively to that of
γ = 0.5.
Extensive MCMC methods including improvements on the standard RWMH
algorithm can be found in [6]. Let us introduce two simple modifications
through the following two exercises.
8. Matlab codes
m21 − m2
h (m) = , y= .
m2 /5 0.1
7.25 exercise. So far the proposal density q (m, p) is isotropic and indepen-
dent of the target density π (m). For anisotropic target density, isotropic pro-
posal is not a good idea, intuitively. The reason is that the proposal is dis-
tributed equally in all directions, whereas it is not in the target density. A
natural idea is to shape the proposal density to make it locally resemble the
target density. A simple idea in this direction is to linearize h (m), and then
define the proposal density as
1 2 1 2
q (mk , p) ∝ exp − 2 k pk − 2 ky − h (mk ) − ∇h (mk ) ( p − mk )k ,
2δ 2σ
1. Determine H (mk ) such that q (mk , p) = N mk , H (mk )−1 , by keeping
only the quadratic term in p − mk .
8 Matlab codes
A set of Matlab codes that can be used to reproduce most of the figures in the
note can be downloaded from
[1] D. Calvetti and E. Somersalo. Introduction to Bayesian Scientific Computing:
Ten Lectures on Subjective Computing. Springer, New York, 2007.
[2] Jari Kaipio and Erkki Somersalo. Statistical and Computational Inverse Prob-
lems, volume 160 of Applied Mathematical Sciences. Springer-Verlag, New
York, 2005.
[3] Todd Arbogast and Jerry L. Bona. Methods of Applied Mathematics. Univer-
sity of Texas at Austin, 2008. Lecture notes in applied mathematics.
[4] Rick Durret. Probability: theory and examples. Cambidge University Press,
[5] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling for various
metropolis-hastings algorithms. Statistical Science, 16(4):pp. 351–367, 2001.
[6] Christian P. Robert and George Casella. Monte Carlo Statistical Methods
(Springer Texts in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ,
USA, 2005.
Sample Autocorrelation
1st component
2nd component
10 20 30 40 50 60 70 80 90 100
(a) γ = 0.02
Sample Autocorrelation
10 20 30 40 50 60 70 80 90 100
(b) γ = 5
Sample Autocorrelation
10 20 30 40 50 60 70 80 90 100
(c) γ = 0.5
Figure 16: Autocorrelation function plot for both components of m with dif-
ferent γ2 .