0% found this document useful (0 votes)
5 views158 pages

zzzz-essential_bayes

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views158 pages

zzzz-essential_bayes

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Some of Bayesian Statistics: The Essential Parts

Rebecca C. Steorts

Monday 11th December, 2017


11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts

Contents

1 Introduction 3
1.1 Advantages of Bayesian Methods . . . . . . . . . . . . . . . . . . . 4
1.2 de Finetti’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Introduction to Bayesian Methods 8


2.1 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Frequentist Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Motivation for Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . 14
◯ Frequentist Interpretation: Risk . . . . . . . . . . . . . . . 15
◯ Bayesian Interpretation: Posterior Risk . . . . . . . . . . . 15
◯ Hybrid Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Bayesian Parametric Models . . . . . . . . . . . . . . . . . . . . . . 16
2.6 How to Choose Priors . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Hierarchical Bayesian Models . . . . . . . . . . . . . . . . . . . . . 18
2.8 Empirical Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . 32
2.9 Posterior Predictive Distributions . . . . . . . . . . . . . . . . . . . 34

3 Being Objective 42
◯ Meaning Of Flat . . . . . . . . . . . . . . . . . . . . . . . . . 44
◯ Objective Priors in More Detail . . . . . . . . . . . . . . . . 44
3.1 Reference Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
◯ Laplace Approximation . . . . . . . . . . . . . . . . . . . . . 51
◯ Some Probability Theory . . . . . . . . . . . . . . . . . . . . 52
◯ Shrinkage Argument of J.K. Ghosh . . . . . . . . . . . . . 53
◯ Reference Priors . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Final Thoughts on Being Objective . . . . . . . . . . . . . . . . . . 59

4 Evaluating Bayesian Procedures 61


4.1 Confidence Intervals versus Credible Intervals . . . . . . . . . . . . 61
4.2 Credible Sets or Intervals . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Bayesian Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . 70
◯ Lavine and Schervish (The American Statistician, 1999):
Bayes Factors: What They Are and What They Are Not 72
4.4 Bayesian p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

2
CONTENTS 3

◯ Prior Predictive p-value . . . . . . . . . . . . . . . . . . . . 74


◯ Other Bayesian p-values . . . . . . . . . . . . . . . . . . . . 75
4.5 Appendix to Chapter 4 (Done by Rafael Stern) . . . . . . . . . . . 76

5 Monte Carlo Methods 80


5.1 A Quick Review of Monte Carlo Methods . . . . . . . . . . . . . . 80
◯ Classical Monte Carlo Integration . . . . . . . . . . . . . . 81
◯ Importance Sampling . . . . . . . . . . . . . . . . . . . . . . 82
◯ Importance Sampling with unknown normalizing constant 87
◯ Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Introduction to Gibbs and MCMC . . . . . . . . . . . . . . . . . . 91
◯ Markov Chains and Gibbs Samplers . . . . . . . . . . . . . 91
◯ The Two-Stage Gibbs Sampler . . . . . . . . . . . . . . . . 95
◯ The Multistage Gibbs Sampler . . . . . . . . . . . . . . . . 100
◯ Application of the GS to latent variable models . . . . . . 101
5.3 MCMC Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Theory and Application Based Example . . . . . . . . . . . . . . . 109
◯ PlA2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 Metropolis and Metropolis-Hastings . . . . . . . . . . . . . . . . . 120
◯ Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . 126
◯ Metropolis and Gibbs Combined . . . . . . . . . . . . . . . 131
5.6 Introduction to Nonparametric Bayes . . . . . . . . . . . . . . . . . 142
◯ Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
◯ The Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . 143
◯ Polya Urn Scheme on Urn With Finitely Many Colors . . 145
◯ Polya Urn Scheme in General . . . . . . . . . . . . . . . . . 146
◯ De Finetti and Exchaneability . . . . . . . . . . . . . . . . 147
◯ Chinese Restaurant Process . . . . . . . . . . . . . . . . . . 150
◯ Clustering: How to choose K? . . . . . . . . . . . . . . . . 150
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts

Chapter 1

Introduction

There are three kinds of lies: lies, damned lies and statistics.
—Mark Twain

The word “Bayesian” traces its origin to the 18th century and English Rev-
erend Thomas Bayes, who along with Pierre-Simon Laplace was among the first
thinkers to consider the laws of chance and randomness in a quantitative, scien-
tific way. Both Bayes and Laplace were aware of a relation that is now known
as Bayes Theorem:
p(x∣θ)p(θ)
p(θ∣x) = ∝ p(x∣θ)p(θ). (1.1)
p(x)
The proportionality ∝ in Eq. (1.1) signifies that the 1/p(x) factor is constant
and may be ignored when viewing p(θ∣x) as a function of θ. We can decompose
Bayes’ Theorem into three principal terms:
p(θ∣x) posterior
p(x∣θ) likelihood
p(θ) prior
In effect, Bayes’ Theorem provides a general recipe for updating prior beliefs
about an unknown parameter θ based on observing some data x.

However, the notion of having prior beliefs about a parameter that is ostensibly
“unknown” did not sit well with many people who considered the problem in
the 19th and early 20th centuries. The resulting search for a way to practice
statistics without priors led to the development of frequentist statistics by such
eminent figures as Sir Ronald Fisher, Karl Pearson, Jerzy Neyman, Abraham
Wald, and many others.

The frequentist way of thinking came to dominate statistical theory and practice

4
1.1 Advantages of Bayesian Methods 5

in the 20th century, to the point that most students who take only introduc-
tory statistics courses are never even aware of the existence of an alternative
paradigm. However, recent decades have seen a resurgence of Bayesian statistics
(partially due to advances in computing power), and an increasing number of
statisticians subscribe to the Bayesian school of thought. Perhaps most encour-
agingly, both frequentists and Bayesians have become more willing to recognize
the strengths of the opposite approach and the weaknesses of their own, and it
is now common for open-minded statisticians to freely use techniques from both
sides when appropriate.

1.1 Advantages of Bayesian Methods

The basic philosophical difference between the frequentist and Bayesian paradigms
is that Bayesians treat an unknown parameter θ as random and use probability
to quantify their uncertainty about it. In contrast, frequentists treat θ as un-
known but fixed, and they therefore believe that probability statements about
θ are useless. This fundamental disagreement leads to entirely different ways to
handle statistical problems, even problems that might at first seem very basic.

To motivate the Bayesian approach, we now discuss two simple examples in


which the frequentist way of thinking leads to answers that might be considered
awkward, or even nonsensical.
Example 1.1: Let θ be the probability of a particular coin landing on heads,
and suppose we want to test the hypotheses
H0 ∶ θ = 1/2, H1 ∶ θ > 1/2
at a significance level of α = 0.05. Now suppose we observe the following se-
quence of flips:
heads, heads, heads, heads, heads, tails (5 heads, 1 tails)
To perform a frequentist hypothesis test, we must define a random variable to
describe the data. The proper way to do this depends on exactly which of the
following two experiments was actually performed:

• Suppose that the experiment was “Flip six times and record the results.”
In this case, the random variable X counts the number of heads, and
X ∼ Binomial(6, θ). The observed data was x = 5, and the p-value of our
hypothesis test is
p-value = Pθ=1/2 (X ≥ 5)
= Pθ=1/2 (X = 5) + Pθ=1/2 (X = 6)
6 1 7
= + = = 0.109375 > 0.05.
64 64 64
1.1 Advantages of Bayesian Methods 6

So we fail to reject H0 at α = 0.05.

• Suppose instead that the experiment was “Flip until we get tails.” In this
case, the random variable X counts the number of the flip on which the
first tails occurs, and X ∼ Geometric(1 − θ). The observed data was x = 6,
and the p-value of our hypothesis test is

p-value = Pθ=1/2 (X ≥ 6)
= 1 − Pθ=1/2 (X < 6)
5
= 1 − ∑ Pθ=1/2 (X = x)
x=1
1 1 1 1 1 1
=1−( + + + + )= = 0.03125 < 0.05.
2 4 8 16 32 32
So we reject H0 at α = 0.05.

The conclusions differ, which seems absurd. Moreover the p-values aren’t even
close—one is 3.5 times as large as the other. Essentially, the result of our
hypothesis test depends on whether we would have stopped flipping if we had
gotten a tails sooner. In other words, the frequentist approach requires us to
specify what we would have done had the data been something that we already
know it wasn’t.

Note that despite the different results, the likelihood for the actual value of x
that was observed is the same for both experiments (up to a constant):

p(x∣θ) ∝ θ5 (1 − θ).

A Bayesian approach would take the data into account only through this likeli-
hood and would therefore be guaranteed to provide the same answers regardless
of which experiment was being performed.

Example 1.2: Suppose we want to test whether the voltage θ across some
electrical component differs from 9 V, based on noisy readings of this voltage
from a voltmeter. Suppose the data is as follows:

9.7, 9.4, 9.8, 8.7, 8.6

A frequentist might assume that the voltage readings Xi are iid from some
N (θ, σ 2 ) distribution, which would lead to a basic one-sample t-test.

However, the frequentist is then presented with an additional piece of informa-


tion: The voltmeter used for the experiment only went up to 10 V, and any
readings that might have otherwise been higher are instead truncated to that
value. Notice that none of the voltages in the data are 10 V. In other words,
we already know that the 10 V limit was completely irrelevant for the data we
actually observed.
1.2 de Finetti’s Theorem 7

Nevertheless, a frequentist must now redo the analysis and could perhaps obtain
a different conclusion, because the 10 V limit changes the distribution of the
observations under the null hypothesis. Like in the last example, the frequentist
results change based on what would have happened had the data been something
that we already know it wasn’t.

The problems in Examples 1.1 and 1.2 arise from the way the frequentist
paradigm forces itself to interpret probability. Another familiar aspect of this
problem is the awkward definition of “confidence” in frequentist confidence inter-
vals. The most natural interpretation of a 95% confidence interval (L, U )—that
there is a 95% chance that the parameter is between L and U —is dead wrong
from the frequentist point of view. Instead, the notion of “confidence” must
be interpreted in terms of repeating the experiment a large number of times
(in principle, an infinite number), and no probabilistic statement can be made
about this particualar confidence interval computed from the data we actually
observed.

1.2 de Finetti’s Theorem

In this section, well motivate the use of priors on parameters and indeed motivate
the very use of parameters. We begin with a denition.
Definition 1.1: (Infinite exchangeability). We say that (x1 , x2 , . . . ) is an in-
finitely exchangeable sequence of random variables if, for any n, the joint prob-
ability p(x1 , x2 , ..., xn ) is invariant to permutation of the indices. That is, for
any permutation π,

p(x1 , x2 , ..., xn ) = p(xπ1 , xπ2 , ..., xπn ).

A key assumption of many statistical analyses is that the random variables being
studied are independent and identically distributed (iid). Note that iid random
variables are always infinitely exchangeable. However, infinite exchangeability
is a much broader concept than being iid; an infinitely exchangeable sequence
is not necessarily iid. For example, let (x1 , x2 , . . . ) be iid, and let x0 be a non-
trivial random variable independent of the rest. Then (x0 + x1 , x0 + x2 , . . .) is
infinitely exchangeable but not iid. The usefulness of infinite exchangeability
lies in the following theorem.
Theorem 1.1. (De Finetti). A sequence of random variables (x1 , x2 , . . . ) is
infinitely exchangeable iff, for all n,
n
p(x1 , x2 , ..., xn ) = ∫ ∏ p(xi ∣θ)P (dθ),
i=1

for some measure P on θ.


1.2 de Finetti’s Theorem 8

If the distribution on θ has a density, we can replace P (dθ) with p(θ) dθ, but
the theorem applies to a much broader class of cases than just those with a
density for θ.

Clearly, since ∏ni=1 p(xi ∣θ) is invariant to reordering, we have that any sequence
of distributions that can be written as
n
∫ ∏ p(xi ∣θ) p(θ) dθ,
i=1

for all n must be infinitely exchangeable. The other direction, though, is much
deeper. It says that if we have exchangeable data, then:

• There must exist a parameter θ.


• There must exist a likelihood p(x∣θ).

• There must exist a distribution P on θ.


• The above quantities must exist so as to render the data (x1 , . . . , xn )
conditionally independent.

Thus, the theorem provides an answer to the questions of why we should use
parameters and why we should put priors on parameters.
Example 1.3: (Document processing and information retrieval). To highlight
the difference between iid and infinitely exchangeable sequences, consider that
search engines have historically used “bag-of-words” models to model docu-
ments. That is, for the moment, pretend that the order of words in a document
does not matter. Even so, the words are definitely not iid. If we see one word
and it is a French word, we then expect that the rest of the document is likely to
be in French. If we see the French words voyage (travel), passeport (passport),
and douane (customs), we expect the rest of the document to be both in French
and on the subject of travel. Since we are assuming infinite exchangeability,
there is some θ governing these intuitions. Thus, we see that θ can be very rich,
and it seems implausible that θ might always be finite-dimensional in Theorem
2. In fact, it is the case that θ can be infinite-dimensional in Theorem 2. For
example, in nonparametric Bayesian work, θ can be a stochastic process.
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts

Chapter 2

Introduction to Bayesian
Methods

Every time I think I know what’s going on, suddenly there’s another layer of
complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony

We introduce Bayesian methods first by motivations in decision theory and


introducing the ideas of loss functions, as well as many others. Advanced topics
in decision theory will be covered much later in the course. We will cover the
following topics as well in this chapter:

• hierarchical and empirical Bayesian methods

• the difference between subjective and objective priors

• posterior predictive distributions.

2.1 Decision Theory

Another motivation for the Bayesian approach is decision theory. Its origins go
back to Von Neumann and Morgenstern’s game theory, but the main character
was Wald. In statistical decision theory, we formalize good and bad results with
a loss function.

A loss function L(θ, δ(x)) is a function of θ ∈ Θ a parameter or index, and δ(x)


is a decision based on the data x ∈ X. For example, δ(x) = n−1 ∑ni=1 xi might be

9
2.2 Frequentist Risk 10

the sample mean, and θ might be the true mean. The loss function determines
the penalty for deciding δ(x) if θ is the true parameter. To give some intuition,
in the discrete case, we might use a 0–1 loss, which assigns


⎪0 if δ(x) = θ,
L(θ, δ(x)) = ⎨

⎪1 if δ(x) ≠ θ,

or in the continuous case, we might use the squared error loss L(θ, δ(x)) = (θ −
δ(x))2 . Notice that in general, δ(x) does not necessarily have to be an estimate
of θ. Loss functions provide a very good foundation for statistical decision theory.
They are simply a function of the state of nature (θ) and a decision function
(δ(⋅)). In order to compare procedures we need to calculate which procedure is
best even though we cannot observe the true nature of the parameter space Θ
and data X. This is the main challenge of decision theory and the break between
frequentists and Bayesians.

2.2 Frequentist Risk

Definition 2.1: The frequentist risk is

R(θ, δ(x)) = Eθ [L(θ, δ(x))] = ∫ L(θ, δ)f (x∣θ) dx.


X

where θ is held fixed and the expectation is taken over X.

Thus, the risk measures the long-term average loss resulting from using δ.

Figure 1 shows the risk of three different decisions as a function of θ ∈ Θ.

Often one decision does not dominate the other everywhere as is the case with
decisions δ1 , δ2 . The challenge is in saying whether, for example, δ1 or δ3 is
better. In other words, how should we aggregate over Θ?

Frequentists have a few answers for deciding which is better:

1. Admissibility. A decision which is inadmissible is one that is dominated


everywhere. For example, in Figure 1, δ2 dominates δ1 for all values of
θ. It would be easy to compare decisions if all but one were inadmissible.
But usually the risk functions overlap, so this criterion fails.

2. Restricted classes of procedure. We say that an estimator θ̂ is an


unbiased estimator of θ if Eθ [θ̂] = θ for all θ. If we restrict our atten-
tion to only unbiased estimators then we can often reduce the situation
to only risk curves like δ1 and δ2 in Figure 5.21, eliminating overlapping
2.2 Frequentist Risk 11

Frequentist Risk

R(θ, δ1)
R(θ, δ2)
Risk
R(θ, δ3)

Figure 2.1: frequentist Risk

curves like δ3 . The existence of an optimal unbiased procedure is a nice


frequentist theory, but many good procedures are biased—for example
Bayesian procedures are typically biased. More surprisingly, some unbi-
ased procedures are actually inadmissible. For example, James and Stein
showed that the sample mean is an inadmissible estimate of the mean
of a multivariate Gaussian in three or more dimensions. There are also
some problems were no unbiased estimator exists—for example, when p is
a binomial proportion and we wish to estimate 1/p (see Example 2.1.2 on
page 83 of Lehmann and Casella). If we restrict our class of procedures
to those which are equivariant, we also get nice properties. We do not go
into detail here, but these are procedures with the same group theoretic
properties as the data.

3. Minimax. In this approach we get around the problem by just looking at


supΘ R(θ, δ(x)), where R(θ, δ(x)) = Eθ [L(θ, δ(x))]. For example in Fig-
ure 2, δ2 would be chosen over δ1 because its maximum worst-case risk
(the grey dotted line) is lower.

A Bayesian answer is to introduce a weighting function p(θ) to tell which part of


Θ is important and integrate with respect to p(θ). In some sense the frequentist
approach is the opposite of the Bayesian approach. However, sometimes an
equivalent Bayesian procedure can be derived using a certain prior. Before
moving on, again note that R(θ, δ(x)) = Eθ [L(θ, δ(x))] is an expectation on
X, assuming fixed θ. A Bayesian would only look at x, the data you observed,
not all possible X. An alternative definition of a frequentist is, “Someone who
2.3 Motivation for Bayes 12

Minimax Frequentist Risk

Risk

R(θ, δ1)
R(θ, δ2)

Figure 2.2: Minimax frequentist Risk

is happy to look at other data they could have gotten but didn’t.”

2.3 Motivation for Bayes

The Bayesian approach can also be motivated by a set of principles. Some books
and classes start with a long list of axioms and principles conceived in the 1950s
and 1960s. However, we will focus on three main principles.

The Bayesian approach can also be motivated by a set of principles. Some books
and classes start with a long list of axioms and principles conceived in the 1950s
and 1960s. However, we will focus on three main principles.

1. Conditionality Principle: The idea here for a Bayesian is that we con-


dition on the data x.

• Suppose we have an experiment concerning inference about θ that is


chosen from a collection of possible experiments independently.
• Then any experiment not chosen is irrelevant to the inference (this
is the opposite of what we do in frequentist inference).
2.3 Motivation for Bayes 13

Example 2.1: For example, two different labs estimate the potency of
drugs. Both have some error or noise in their measurements which can
accurately estimated from past tests. Now we introduce a new drug. Then
we test its potency at a randomly chosen lab. Suppose the sample sizes
matter dramatically.
• Suppose the sample size of the first experiment (lab 1) is 1 and the
sample size of the second experiment (lab 2) is 100.
• What happens if we’re doing a frequentist experiment in terms of
the variance? Since this is a randomized experiment, we need to take
into account all of the data. In essence, the variance will do some
sort of averaging to take into account the sample sizes of each.
• However, taking a Bayesian approach, we just care about the data
that we see. Thus, the variance calculation will only come from the
actual data at the randomly chosen lab.
Thus, the question that we ask is should we use the noise level from the
lab where it is tested or average over both? Intuitively, we use the noise
level from the lab where it was tested, but in some frequentist approaches,
it is not always so straightforward.

2. Likelihood Principle: The relevant information in any inference about θ


after observing x is contained entirely in the likelihood function. Remem-
ber the likelihood function p(x∣θ) for fixed x is viewed as a function of θ,
not x. For example in Bayesian approaches, p(θ∣x) ∝ p(x∣θ)p(θ), so clearly
inference about θ is based on the likelihood. Another approach based on
the likelihood principle is Fisher’s maximum likelihood estimation. This
approach can also be justified by asymptotics. In case this principle seems
too indisputable, here is an example using hypothesis testing in coin toss-
ing that shows how some reasonable procedures may not follow it.
Example 2.2: Let θ be the probability of a particular coin landing on
heads and let
Ho ∶ θ = 1/2, H1 ∶ θ > 1/2.
Now suppose we observe the following sequence of flips:

H,H,T,H,T,H,H,H,H,H,H,T (9 heads, 3 tails)

Then the likelihood is simply

p(x∣θ) ∝ θ9 (1 − θ)3 .

Many non-Bayesian analyses would pick an experimental design that is


reflected in p(x∣θ), for example binomial (toss a coin 12 times) or negative
binomial (toss a coin until you get 3 tails). However the two lead to
different probabilities over the sample space X. This results in different
assumed tail probabilities and p-values.
2.3 Motivation for Bayes 14

We repeat a few definitions from mathematical statistics that will be used


in the course at one point or another. For example of sufficient statistics
or distributions that fall in exponential families, we refer the reader to
Theory of Point Estimation (TPE), Chapter 1.

Definition 2.2: Sufficiency


Recall that for a data set x = (x1 , . . . , xn ), a sufficient statistic T (x) is
a function such that the likelihood p(x∣θ) = p(x1 , . . . , xn ∣θ) depends on
x1 , . . . , xn only through T (x). Then the likelihood p(x∣θ) may be written
as p(x∣θ) = g(θ, T (x)) h(x) for some functions g and h.

Definition 2.3: Exponential Families


A family {Pθ } of distributions is said to form an s-dimensional exponential
family if the distributions of Pθ have densities of the form
s
pθ (x) = exp [∑ ηi (θ)Ti (x) − B(θ)] h(x).
i=1

3. Sufficiency Principle: The sufficiency principle states that if two differ-


ent observations x and y have the same sufficient statistic T (x) = T (y),
then inference based on x and y should be the same. The sufficiency
principle is the least controversial principle.

Theorem 2.1. The posterior distribution, p(θ∣y) only depends on the data
through the sufficient statistic, T(y).

Proof. By the factorization theorem, if T (y) is sufficient,

f (y∣θ) = g(θ, T (y)) h(y).

Then we know the posterior can be written

f (y∣θ)π(θ)
p(θ∣y) =
∫ (y∣θ)π(θ) dθ
f
g(θ, T (y)) h(y)π(θ)
=
∫ g(θ, T (y)) h(y)π(θ) dθ
g(θ, T (y))π(θ)
=
∫ g(θ, T (y))π(θ) dθ
∝ g(θ, T (y)) p(θ),

which only depends on y through T (y).

Example 2.3: Sufficiency


Let y ∶= ∑i yi . Consider
iid
y1 , . . . , yn ∣θ ∼ Bin(1, θ).
2.4 Bayesian Decision Theory 15

Then p(y) = (ny)θy (1 − θ)n−y . Let p(θ) represent a general prior. Then

p(θ∣y) ∝ θy (1 − θ)n−y p(θ),

which only depends on the data through the sufficient statistic y.

Theorem 2.2. (Birnbaum).


Sufficiency Principle + Conditionality Principle = Likelihood Principle.

So if we assume the sufficiency principle, then the conditionality and like-


lihood principles are equivalent. The Bayesian approach satisfies all of
these principles.

2.4 Bayesian Decision Theory

Earlier we discussed the frequentist approach to statistical decision theory. Now


we discuss the Bayesian approach in which we condition on x and integrate over
Θ (remember it was the other way around in the frequentist approach). The
posterior risk is defined as ρ(π, δ(x)) = ∫Θ L(θ, δ(x))π(θ∣x) dθ.

The Bayes action δ ∗ (x) for any fixed x is the decision δ(x) that minimizes the
posterior risk. If the problem at hand is to estimate some unknown parameter
θ, then we typically call this the Bayes estimator instead.

Theorem 2.3. Under squared error loss, the decision δ(x) that minimizes the
posterior risk is the posterior mean.

Proof. Suppose that L(θ, δ(x)) = (θ − δ(x))2 . Now note that

ρ(π, δ(x)) = ∫ (θ − δ(x))2 π(θ∣x) dθ


2
= ∫ θ2 π(θ∣x) dθ + δ(x) ∫ π(θ∣x) dθ − 2δ(x) ∫ θπ(θ∣x) dθ.

Then
∂[ρ(π, δ(x))]
= 2δ(x) − 2 ∫ θπ(θ∣x) dθ = 0 ⇐⇒ δ(x) = E[θ∣x],
∂[δ(x)]

and ∂ 2 [ρ(π, δ(x))]/∂[δ(x)]2 = 2 > 0, so δ(x) = E[θ∣x] is the minimizer.

Recall that decision theory provides a quantification of what it means for a pro-
cedure to be ‘good.’ This quantification comes from the loss function L(θ, δ(x)).
Frequentists and Bayesians use the loss function differently.
2.4 Bayesian Decision Theory 16

◯ Frequentist Interpretation: Risk

In frequentist usage, the parameter θ is fixed, and thus it is the sample space
over which averages are taken. Letting R(θ, δ(x)) denote the frequentist risk,
recall that R(θ, δ(x)) = Eθ [L(θ, δ(x))]. This expectation is taken over the data
X, with the parameter θ held fixed. Note that the data, X, is capitalized,
emphasizing that it is a random variable.

Example 2.4: (Squared error loss). Let the loss function be squared error. In
this case, the risk is

R(θ, δ(x)) = Eθ [(θ − δ(x))2 ]


2
= Eθ [{θ − Eθ [δ(x)] + Eθ [δ(x)] − δ(x)} ]
2 2
= {θ − Eθ [δ(x)]} + Eθ [{δ(x) − Eθ [δ(x)]}
= Bias2 + Variance

This result allows a frequentist to analyze the variance and bias of an estimator
separately, and can be used to motivate frequentist ideas, e.g. minimum variance
unbiased estimators (MVUEs).

◯ Bayesian Interpretation: Posterior Risk

Bayesians do not find the previous idea compelling because it doesn’t adhere to
the conditionality principle since it averages over all possible data sets. Hence,
in a Bayesian framework, we define the posterior risk ρ(x, π) based on the data
x and a prior π, where

ρ(π, δ(x)) = ∫Θ L(θ, δ(x))π(θ∣x) dθ.

Note that the prior enters the equation when calculating the posterior density.
Using the Bayes risk, we can define a bit of jargon. Recall that the Bayes action
δ ∗ (x) is the value of δ(x) that minimizes the posterior risk. We already showed
that the Bayes action under squared error loss is the posterior mean.

◯ Hybrid Ideas

Despite the tensions between frequentists and Bayesians, they occasionally steal
ideas from each other.

Definition 2.4: The Bayes risk is denoted by r(π, δ(x)). While the Bayes risk
is a frequentist concept since it averages over X, the expression can also be
2.5 Bayesian Parametric Models 17

interpreted differently. Consider

r(π, δ(x)) = ∫ ∫ L(θ, δ(x)) f (x∣θ) π(θ) dx dθ

r(π, δ(x)) = ∫ ∫ L(θ, δ(x)) π(θ∣x) π(x) dx dθ

r(π, δ(x)) = ∫ ρ(π, δ(x)) π(x) dx.

Note that the last equation is the posterior risk averaged over the marginal
distribution of x. Another connection with frequentist theory includes that
finding a Bayes rule against the “worst possible prior” gives you a minimax
estimator. While a Bayesian might not find this particularly interesting, it is
useful from a frequentist perspective because it provides a way to compute the
minimax estimator.

We will come back to more decision theory in a more later chapter on advanced
decision theory, where we will cover topics such as minimaxity, admissibility,
and James-Stein estimators.

2.5 Bayesian Parametric Models

For now we will consider parametric models, which means that the parameter θ
is a fixed-dimensional vector of numbers. Let x ∈ X be the observed data and
θ ∈ Θ be the parameter. Note that X may be called the sample space, while Θ
may be called the parameter space. Now we define some notation that we will
reuse throughout the course:
p(x∣θ) likelihood
π(θ) prior
p(x) = ∫ p(x∣θ)π(θ) dθ marginal likelihood
p(x∣θ)π(θ)
p(θ∣x) = posterior probability
p(x)
p(xnew ∣x) = ∫ p(xnew ∣θ)π(θ∣x) dθ predictive probability

Most of Bayesian analysis is calculating these quantities in some way or another.


Note that the definition of the predictive probability assumes exchangeability,
but it can easily be modied if the data are not exchangeable. As a helpful hint,
note that for the posterior distribution,
p(x∣θ)π(θ)
p(θ∣x) =
p(x)
∝ p(x∣θ)π(θ),
2.6 How to Choose Priors 18

and oftentimes it’s best to not calculate the normalizing constant p(x) because
you can recognize the form of p(x∣θ)π(θ) as a probability distribution you know.
So don’t normalize until the end!

Remark: Note that the prior distribution that we take on θ doesn’t have to be
a proper distribution, however, the posterior is always required to be proper for
valid inference. By proper, I mean that the distribution must integrate to 1.

Two questions we still need to address are

• How do we choose priors?

• How do we compute the aforementioned quantities, such as posterior dis-


tributions?

We’ll focus on choosing priors for now.

2.6 How to Choose Priors

We will discuss objective and subjective priors. Objective priors may be ob-
tained from the likelihood or through some type of invariance argument. Sub-
jective priors are typically arrived at by a process involving interviews with
domain experts and thinking really hard; in fact, there is arguably more philos-
ophy and psychology in the study of subjective priors than mathematics. We
start with conjugate priors. The main justification for the use of conjugate pri-
ors is that they are computationally convenient and they have asymptotically
desirably properties.

Choosing prior probabilities: Subjective or Objective.

Subjective
A prior probability could be subjective based on the information a person might
have due to past experience, scientific considerations, or simple common sense.
For example, suppose we wish to estimate the probability that a randomly
selected woman has breast cancer. A simple prior could be formulated based
on the national or worldwide incidence of breast cancer. A more sophisticated
approach might take into account the woman’s age, ethnicity, and family history.
Neither approach could necessarily be classified as right or wrong—again, it’s
subjective.

As another example, say a doctor administers a treatment to patients and finds


48 out of 50 are cured. If the same doctor later wishes to investigate the cure
probability of a similar but slightly different treatment, he might expect that
its cure probability will also be around 48/50 = 0.96. However a different doctor
2.7 Hierarchical Bayesian Models 19

may have only had 8/10 patients cured by the first treatment and might therefore
specify a prior suggesting a cure rate of around 0.8 for the for the new treatment.
For convenience, subjective priors are often chosen to take the form of common
distributions, such as the normal, gamma, or beta distribution.

Objective
An objective prior (also called default, vague, noninformative) can also be used
in a given situation even in the absence of enough information. Examples
of objective priors are flat priors such as Laplace’s, Haldane’s, Jeffreys’, and
Bernardo’s references priors. These priors will be discussed later.

2.7 Hierarchical Bayesian Models

In a hierarchical Bayesian model, rather than specifying the prior distribution as


a single function, we specify it as a hierarchy. Thus, on the unknown parameter
of interest, say θ, we put a prior. On any other unknown hyperparameters of
the model that are given, we also specify priors for these. We write

X∣θ ∼ f (x∣θ)
Θ∣γ ∼ π(θ∣γ)
Γ ∼ φ(γ),

where we assume that φ(γ) is known and not dependent on any other unknown
hyperparameters (what the parameters of the prior are often called as we have
already said). Note that we can continue this hierarchical modeling and add
more stages to the model, however note that doing so adds more complexity
to the model (and possibly as we will see may result in a posterior that we
cannot compute without the aid of numerical integration or MCMC, which we
will cover in detail in a later chapter).

Definition 2.5: (Conjugate Distributions). Let F be the class of sampling


distributions p(y∣θ). Then let P denote the class of prior distributions on θ. Then
P is said to be conjugate to F if for every p(θ) ∈ P and p(y∣θ) ∈ F, p(y∣θ) ∈ P.
Simple definition: A family of priors such that, upon being multiplied by the
likelihood, yields a posterior in the same family.

Example 2.5: (Beta-Binomial) If X∣θ is distributed as binomial(n, θ), then a


conjugate prior is the beta family of distributions, where we can show that the
2.7 Hierarchical Bayesian Models 20

posterior is

π(θ∣x) ∝ p(x∣θ)p(θ)
n Γ(a + b) a−1
∝ ( )θx (1 − θ)n−x θ (1 − θ)b−1
x Γ(a)Γ(b)
∝ θx (1 − θ)n−x θa−1 (1 − θ)b−1
∝ θx+a−1 (1 − θ)n−x+b−1 Ô⇒

θ∣x ∼ Beta(x + a, n − x + b).

Let’s apply this to a real example! We’re interested in the proportion of people
that approve of President Obama in PA.

• We take a random sample of 10 people in PA and find that 6 approve of


President Obama.

• The national approval rating (Zogby poll) of President Obama in mid-


December was 45%. We’ll assume that in MA his approval rating is ap-
proximately 50%.

• Based on this prior information, we’ll use a Beta prior for θ and we’ll
choose a and b. (Won’t get into this here).

• We can plot the prior and likelihood distributions in R and then see how
the two mix to form the posterior distribution.
3.5
3.0

Prior
2.5
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ
2.7 Hierarchical Bayesian Models 21

3.5
3.0

Prior
2.5

Likelihood
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ
3.5
3.0

Prior
2.5

Likelihood
Posterior
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Example 2.6: (Normal-Uniform Prior)


iid
X1 , . . . , Xn ∣θ ∼ Normal(θ, σ 2 ), σ 2 known
θ ∼ Uniform(−∞, ∞),

where θ ∼ Uniform(−∞, ∞) means that p(θ) ∝ 1.


2.7 Hierarchical Bayesian Models 22

Calculate the posterior distribution of θ given the data.


n
1 −1
p(θ∣x) = ∏ √ exp { (xi − θ)2 }
i=12πσ 2 2σ 2
1 −1 n
= exp { 2 ∑(xi − θ)2 }
(2πσ )
2 n/2
2σ i=1
−1
∝ exp { ∑(xi − θ) } .
2
2σ 2 i

Note that ∑i (xi − θ)2 = ∑i (xi − x̄)2 + n(x̄ − θ)2 . Then

−1 −n
p(θ∣x) ∝ exp { ∑(xi − x̄) } exp { 2 (x̄ − θ) }
2 2
2σ 2 i 2σ
−n
∝ exp { 2 (x̄ − θ)2 }

−n
= exp { 2 (θ − x̄)2 } .

Thus,
θ∣x1 , . . . , xn ∼ Normal(x̄, σ 2 /n).

Example 2.7: Normal-Normal

iid
X1 , . . . , Xn ∣θ ∼ N(θ, σ 2 )
θ ∼ N(µ, τ 2 ),

where σ 2 is known. Calculate the distribution of θ∣x1 , . . . , xn .

n
1 −1 1 −1
p(θ∣x1 , . . . , xn ) ∝ ∏ √ 2
(xi − θ)2 } × √
exp { exp { 2 (θ − µ)2 }
i=1 2πσ 2σ2 2πτ 2 2τ
−1 −1
∝ exp { 2 ∑(xi − θ)2 } exp { 2 (θ − µ)2 } .
2σ i 2τ

Consider

∑(xi − θ) = ∑(xi − x̄ + x̄ − θ) = ∑(xi − x̄) + n(x̄ − θ) .


2 2 2 2
i i i
2.7 Hierarchical Bayesian Models 23

Then
−1 −1 −1
p(θ∣x1 , . . . , xn ) = exp { 2 ∑ i
(x − x̄)2 } × exp { 2 n(x̄ − θ)2 } × exp { 2 (θ − µ)2 }
2σ i 2σ 2τ
−1 −1
∝ exp { 2 n(x̄ − θ)2 } exp { 2 (θ − µ)2 }
2σ 2τ
−1 n 1
= exp { [ 2 (x̄ − 2x̄θ + θ ) + 2 (θ2 − 2θµ + µ2 )]}
2 2
2 σ τ
−1 n 1 nx̄ µ nx̄2 µ2
= exp { [( 2 + 2 ) θ2 − 2θ ( 2 + 2 ) + 2 + 2 ]}
2 σ τ σ τ σ τ
⎧ ⎡ µ ⎤⎫
2 + τ 2 ⎞⎥⎪
nx̄

⎪ −1 ⎢ n 1 ⎛
∝ exp ⎨ ⎢⎢( 2 + 2 ) θ2 − 2θ σn ⎥⎪
⎪ ⎥⎬
⎪ 2 ⎢ σ τ ⎝ + 1
⎠ ⎥⎪
⎩ ⎣ σ 2 τ 2
⎦⎪


⎪ ⎡ µ 2 ⎤⎫

⎪ −1 ⎢⎢ n 2 + τ2 ⎞ ⎥
nx̄
⎪ 1 ⎛ ⎥⎪

∝ exp ⎨ ⎢( 2 + 2 ) θ − σn ⎥⎬.
⎪ ⎢ τ ⎝ ⎥⎪
2 + τ 2 ⎠ ⎥⎪
1

⎪ 2 ⎢ σ
⎩ ⎣ σ ⎦⎪

Recall what it means to complete the square as we did above.1 Thus,

⎛ 2 + τ2
nx̄ µ
1 ⎞
θ∣x1 , . . . , xn ∼ N σn , n
⎝ 2 + τ21
+ ⎠1
σ σ2 τ 2
nx̄τ 2 + µσ 2 σ2 τ 2
=N( , ).
nτ 2 + σ 2
nτ 2 + σ 2

Definition 2.6: The reciprocal of the variance is referred to as the precision.


That is,
1
Precision = .
Variance
Theorem 2.4. Let δn be a sequence of estimators of g(θ) with mean squared
error E(δn − g(θ))2 .

(i) If E[δn − g(θ)]2 → 0 then δn is consistent for g(θ).


(ii) Equivalent to the above, δn is consistent if bn (θ) → 0 and V ar(δn ) → 0 for
all θ.
(iii) In particular (and most useful), δn is consistent if it is unbiased for each
n and if V ar(δn ) → 0 for all θ.

We omit the proof since it requires Chebychev’s Inequality along with a bit of
probability theory. See Problem 1.8.1 in TPE for the exercise of proving this.
1 Recall from algebra that (x − b)2 = x2 − 2bx + b2 . We want to complete something that

resembles x2 − 2bx = x2 + 2bx + (2b/2)2 − (2b/2)2 = (x − b)2 − b2 .


2.7 Hierarchical Bayesian Models 24

Example 2.8: (Normal-Normal Revisited) Recall Example 2.7. We write the


posterior mean as E(θ∣x). Let’s write the posterior mean in this example as

2 +
nx̄ µ
σ τ2
E(θ∣x) = n .
+ 1
σ2 τ2
nx̄ µ
2
= nσ 1 + n τ
2
.
+ + 1
σ2 τ 2 σ2 τ2

We also write the posterior variance as


1
V (θ∣x) = .
n
2 + 1
τ2
σ
We can see that the posterior mean is a weighted average of the sample mean and
the prior mean. The weights are proportional to the reciprocal of the respective
variances (precision). In this case,
1
Posterior Precision =
Posterior Variance
= (n/σ 2 ) + (1/τ 2 )
= Sample Precision + Prior Precision.

The posterior precision is larger than either the sample precision or the prior
precision. Equivalently, the posterior variance, denoted by V (θ∣x), is smaller
than either the sample variance or the prior variance.

What happens as n → ∞?

Divide the posterior mean (numerator and denominator) by n. Now take n → ∞.


Then
1 nx̄ 1 µ x̄
2
+
E(θ∣x) = n σ n τ → σ 2 = x̄
2
as n → ∞.
1 n 1 1 1
2
+ 2
nσ nτ σ2

In the case of the posterior variance, divide the denominator and numerator by
n. Then
1
n σ2
V (θ∣x) = ≈ →0 as n → ∞.
1 n 1 1 n
2
+
nσ n τ2

Since the posterior mean is unbiased and the posterior variance goes to 0, the
posterior mean is consistent by Theorem 2.4.
2.7 Hierarchical Bayesian Models 25

Example 2.9:

X∣α, β ∼ Gamma(α, β), α known, β unknown


β ∼ IG(a, b).

Calculate the posterior distribution of β∣x.

1 α−1 −x/β ba −a−1 −b/β


p(β∣x) ∝ x e × β e
Γ(α)β α Γ(a)
1
∝ α e−x/β β −a−1 e−b/β
β
= β −α−a−1 e−(x+b)/β .

Notice that this looks like an Inverse Gamma distribution with parameters α + a
and x + b. Thus,
β∣x ∼ IG(α + a, x + b).

Example 2.10: (Bayesian versus frequentist)


Suppose a child is given an IQ test and his score is X. We assume that

X∣θ ∼ Normal(θ, 100)


θ ∼ Normal(100, 225)

From previous calculations, we know that the posterior is


400 + 9x 900
θ∣x ∼ Normal ( , ).
13 13

Here the posterior mean is (400 + 9x)/13. Suppose x = 115. Then the posterior
mean becomes 110.4. Contrasting this, we know that the frequentist estimate
is the mle, which is x = 115 in this example.

The posterior variance is 900/13 = 69.23, whereas the variance of the data is
σ 2 = 100.

Now suppose we take the Uniform(−∞, ∞) prior on θ. From an earlier example,


we found that the posterior is

θ∣x ∼ Normal (115, 100) .

Notice that the posterior mean and mle are both 115 and the posterior variance
and variance of the data are both 100.

When we put little/no prior information on θ, the data washes away most/all
of the prior information (and the results of frequentist and Bayesian estimation
are similar or equivalent in this case).
2.7 Hierarchical Bayesian Models 26

Example 2.11: (Normal Example with Unknown Variance)


Consider
iid
X1 , . . . , Xn ∣θ ∼ Normal(θ, σ 2 ), θ known, σ 2 unknown
p(σ 2 ) ∝ (σ 2 )−1 .

Calculate p(σ 2 ∣x1 , . . . , xn ).

−1
p(σ 2 ∣x1 , . . . , xn ) ∝ (2πσ 2 )−n/2 exp { 2 −1
∑(xi − θ) } (σ )
2
2σ 2 i
−1
∝ (σ 2 )−n/2−1 exp { ∑(xi − θ) } .
2
2σ 2 i

ba −a−1 −b/y
Recall, if Y ∼ IG(a, b), then f (y) = y e . Thus,
Γ(a)

∑i=1 (xi − θ)2


n
σ 2 ∣x1 , . . . , xn ∼ IG (n/2, ).
2

Example 2.12: (Football Data) Gelman et. al (2003) consider the problem of
estimating an unknown variance using American football scores. The focus is
on the difference d between a game outcome (winning score minus losing score)
and a published point spread.

• We observe d1 , . . . , dn , the observed differences between game outcomes


and point spreads for n = 2240 football games.

• We assume these differences are a random sample from a Normal distri-


bution with mean 0 and unknown variance σ 2 .

• Our goal is to make inference on the unknown parameter σ 2 , which rep-


resents the variability in the game outcomes and point spreads.

We can refer to Example 2.11, since the setup here is the same. Hence the
posterior becomes
σ 2 ∣d1 , . . . , dn ∼ IG(n/2, ∑ d2i /2).
i

The next logical step would be plotting the posterior distribution in R. As far as
I can tell, there is not a built-in function predefined in R for the Inverse Gamma
density. However, someone saw the need for it and built one in using the pscl
package.

Proceeding below, we try and calculate the posterior using the function densigamma,
which corresponds to the Inverse Gamma density. However, running this line
in the code gives the following error:
2.7 Hierarchical Bayesian Models 27

Warning message:
In densigamma(sigmas, n/2, sum(d^2)/2) : value out of range in ’gammafn’

What’s the problem? Think about the what the posterior looks like. Recall
that
(∑i d2i /2)n/2 2 −n/2−1 −(∑i d2i )/2σ2
p(σ 2 ∣d) = (σ ) e .
Γ(n/2)

In the calculation R is doing, it’s dividing by Γ(1120), which is a very large


factorial. This is too large for even R to compute, so we’re out of luck here. So,
what can we do to analyze the data?

setwd("~/Desktop/sta4930/football")
data = read.table("football.txt",header=T)
names(data)
attach(data)
score = favorite-underdog
d = score-spread
n = length(d)
hist(d)
install.packages("pscl",repos="http://cran.opensourceresources.org")
library(pscl)
?densigamma
sigmas = seq(10,20,by=0.1)
post = densigamma(sigmas,n/2,sum(d^2)/2)
v = sum(d^2)

We know we can’t use the Inverse Gamma density (because of the function in
R), but we do know a relationship regarding the Inverse Gamma and Gamma
distributions. So, let’s apply this fact.

You may be thinking, we’re going to run into the same problem because we’ll
still be dividing by Γ(1120). This is true, except the Gamma density function
dgamma was built into R by the original writers. The dgamma function is able
to do some internal tricks that let it calculate the gamma density even though
the individual piece Γ(n/2) by itself is too large for R to handle. So, moving
forward, we will apply the following fact that we already learned:
If X ∼ IG(a, b), then 1/X ∼ Gamma(a, 1/b).

Since
σ 2 ∣d1 , . . . , dn ∼ IG(n/2, ∑ d2i /2),
i
we know that
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i
2.7 Hierarchical Bayesian Models 28

Now we can plot this posterior distribution in R, however in terms of making


inference about σ 2 , this isn’t going to be very useful.

In the code below, we plot the posterior of 12 ∣d. In order to do so, we must
σ
create a new sequence of x-values since the mean of our gamma will be at
n/v ≈ 0.0053.

xnew = seq(0.004,0.007,.000001)
pdf("football_sigmainv.pdf", width = 5, height = 4.5)
post.d = dgamma(xnew,n/2,scale = 2/v)
plot(xnew,post.d, type= "l", xlab = expression(1/sigma^2), ylab= "density")
dev.off()

As we can see from the plot below, viewing the posterior of 1


∣d isn’t very
σ2
2
useful. We would like to get the parameter in terms of σ , so that we could plot
the posterior distribution of interest as well as calculate the posterior mean and
variance.
2000
density

1000
500
0

0.0040 0.0050 0.0060 0.0070

1 σ2

Figure 2.3: Posterior Distribution p( 1


∣d1 , . . . , dn )
σ2

To recap, we know
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i

Let u = 1
2.We are going to make a transformation of variables now to write
σ
the density in terms of σ 2 .
2.7 Hierarchical Bayesian Models 29

∂u 1
Since u = 1
, this implies σ 2 = u1 . Then ∣ ∣ = 4.
σ2 ∂σ 2 σ
Now applying the transformation of variables we find that

1 1 n/2−1 − v2 1
f (σ 2 ∣d1 , . . . , dn ) = ( ) e 2σ ( 4 ) .
Γ(n/2)(2/v)n/2 σ 2 σ
Thus,
1
σ 2 ∣d ∼ Gamma(n/2, 2/v) ( ).
σ4

Now, we know the density of σ 2 ∣d in a form we can calculate in R.

x.s = seq(150,250,1)
pdf("football_sigma.pdf", height = 5, width = 4.5)
post.s = dgamma(1/x.s,n/2, scale = 2/v)*(1/x.s^2)
plot(x.s,post.s, type="l", xlab = expression(sigma^2), ylab="density")
dev.off()
detach(data)

From the posterior plot in Figure 2.4 we can see that the posterior mean is
around 185. This means that the variability of the actual game result around
the point spread has a standard deviation around 14 points. If you wanted to
actually calculate the posterior mean and variance, you could do this using a
numerical method in R.

What’s interesting about this example is that there is a lot more variability in
football games than the average person would most likely think.

• Assume that (1) the standard deviation actually is 14 points, and (2)
game result is normally distributed (which it’s not, exactly, but this is a
reasonable approximation).
• Things with a normal distribution fall two or more standard deviations
from their mean about 5% of the time, so this means that, roughly speak-
ing, about 5% of football games end up 28 or more points away from their
spread.
2.7 Hierarchical Bayesian Models 30

0.06
0.04
density

0.02
0.00

160 180 200 220 240

σ2

Figure 2.4: Posterior Distribution p(σ 2 ∣d1 , . . . , dn )


2.7 Hierarchical Bayesian Models 31

Example 2.13:

iid
Y1 , . . . , Yn ∣µ, σ 2 ∼ Normal(µ, σ 2 ),
σ2
µ∣σ 2 ∼ Normal(µ0 , ),
κ0
ν0 σ02
σ 2 ∼ IG( , ),
2 2

where µ0 , κ0 , ν0 , σ02 are constant.

Find p(µ, σ 2 ∣y1 . . . , yn ). Notice that

p(µ, σ 2 , y1 . . . , yn )
p(µ, σ 2 ∣y1 . . . , yn ) =
p(y1 . . . , yn )
∝ p(y1 . . . , yn ∣µ, σ 2 )p(µ, σ 2 )
= p(y1 . . . , yn ∣µ, σ 2 )p(µ∣σ 2 )p(σ 2 ).

Then

p(µ, σ 2 ∣y1 . . . , yn ) ∝ p(y1 . . . , yn ∣µ, σ 2 )p(µ∣σ 2 )p(σ 2 )


−1 n −κ0
∝ (σ 2 )−n/2 exp { ∑(yi − µ) } (σ 2 )
2 −1/2
exp { 2 (µ − µ0 )2 }
2σ 2 i=1 2σ
−σ02
× (σ 2 )−ν0 /2−1 exp { }.
2σ 2

Consider ∑i (yi − µ)2 = ∑i (yi − ȳ)2 + n(ȳ − µ)2 .

Then

n(ȳ − µ)2 + κ0 (µ − µ0 )2 = nȳ 2 − 2nȳµ + nµ2 + κ0 µ2 − 2κ0 µµ0 + κ0 µ0 2


= (n + κ0 )µ2 − 2(nȳ + κ0 µ0 )µ + nȳ 2 + κ0 µ0 2
nȳ + κ0 µ0 2 (nȳ + κ0 µ0 )2
= (n + κ0 ) (µ − ) − + nȳ 2 + κ0 µ0 2 .
n + κ0 n + κ0
2.7 Hierarchical Bayesian Models 32

Now consider
(nȳ + κ0 µ0 )2 −n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
nȳ 2 + κ0 µ0 2 − = nȳ 2 + κ0 µ0 2 +
n + κ0 n + κ0
n ȳ + nκ0 µ0 + nκ0 ȳ + κ0 2 µ0 2 − n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
2 2 2 2
=
n + κ0
nκ0 µ0 + nκ0 ȳ − 2nκ0 µ0 ȳ
2 2
=
n + κ0
nκ0 (µ0 − 2µ0 ȳ + ȳ 2 )
2
=
n + κ0
nκ0 (µ0 − ȳ)2
= .
n + κ0

Putting this all together, we find


−n −1
p(µ, σ 2 ∣y1 . . . , yn ) ∝ exp { (ȳ − µ)2 } exp { 2 ∑(yi − ȳ)2 }
2σ 2 2σ i
−κ0 2 −n/2−1/2 2 −ν0 /2−1 −σ02
× exp { ∑ (µ − µ0 ) 2
} (σ ) (σ ) exp { }
2σ 2 i 2σ 2
−nκ0 −1
= exp { (µ0 − ȳ)2 } exp { 2 ∑(yi − ȳ)2 }
2σ 2 (n + κ0 ) 2σ i
(n + κ0 ) nȳ + κ0 µ0 2 2 −ν0 /2−1 2 −n/2−1 −σ02
× exp {− (µ − ) } (σ ) (σ ) exp { }
2σ 2 n + κ0 2σ 2
−1 nκ0 σ2
= exp { (y − ȳ)2 − 2
2 ∑ i
(µ0 − ȳ)2 − 02 } (σ 2 )−(n+ν0 )/2−1
2σ i 2σ (n + κ0 ) 2σ
(n + κ0 ) nȳ + κ0 µ0 2
× exp {− (µ − ) } (σ 2 )−1/2 .
2σ 2 n + κ0
Since the posterior above factors, we find
nȳ + κ0 µ0 σ 2
µ∣σ 2 , y ∼ Normal ( , ),
n + κ0 n + κ0

n + ν0 1 nκ0
σ 2 ∣y ∼ IG ( , (∑(yi − ȳ)2 + (µ0 − ȳ)2 + σ02 )) .
2 2 i (n + κ0 )
Example 2.14: Suppose we calculate E[θ∣y] where y = x(n) . Let
Xi ∣ θ ∼ Uniform(0, θ)
θ ∼ Gamma(a, 1/b).
Show
1 P (χ22(n+a−1) < 2/(by))
E[θ∣x] = .
b(n + a − 1) P (χ22(n+a−1) < 2/(by))
2.8 Empirical Bayesian Models 33

Proof. Recall that the posterior depends on the data only through the sufficient
statistic y. Consider that P (Y ≤ y) = P (X1 ≤ y)n = (y/θ)n Ô⇒ fy (y) =
n
n/θ(y/θ)n−1 = n y n−1 .
θ

θf (y∣θ)π(θ) dθ
E[θ∣x] = ∫
∫ f (y∣θ)π(θ) dθ
n−1 −a−1 −1/(θb)
∞ θny θ e
∫y
θn Γ(a)ba dθ
= a−1 −1/(θb)
∞ nθ e
∫y
θn Γ(a)ba dθ
∞ −n−a −1/(θb)
∫y θ e dθ
= ∞ −n−a−1 −1/(θb)
∫y θ e dθ

Let θ = 2/(xb) Ô⇒ dθ = −2/(bx2 ) dx. Recall that Gamma(v/2, 2) is a χ2v . Then


2
2 −n−a −x/2 2
∫0 ( xb )
by
e bx2
dx
E[θ∣x] = 2

∫0 ( xb )−n−a−1 e−x/2 bx2 dx


by 2 2

2
n+a−1 n+a−2 −x/2
by
∫0 b x e dx × Γ(n + a − 1) dx
2n+a−1 Γ(n + a − 1)
= 2
n+a+1−1 n+a+1−2 −x/2
by
∫0 b x e dx × Γ(n + a) dx
2 n+a+1−1 Γ(n + a)
P (χ2(n+a−1) < 2/(by)) bn+a−1 Γ(n + a − 1)
2
=
P (χ22(n+a−1) < 2/(by)) bn+a Γ(n + a)

1 P (χ22(n+a−1) < 2/(by))


= .
b(n + a − 1) P (χ22(n+a−1) < 2/(by))

2.8 Empirical Bayesian Models

Another generalization of Bayes estimation is called empirical Bayes (EB) es-


timation, which most consider to fall outside of the Bayesian paradigm (in the
sense that it’s not fully Bayesian). However, it’s been proved to be a technique
of constructing estimators that perform well under both Bayesian and frequen-
tist criteria. One reason for this is that EB estimators tend to be more robust
against model misspecification of the prior distribution.
2.8 Empirical Bayesian Models 34

We start again with an HB model, however this time we assume that γ is


unknown and must be estimated. We begin with the Bayes model

Xi ∣θ ∼ f (x∣θ), i = 1 . . . , p
Θ∣γ ∼ π(θ∣γ).

We then calculate the marginal distribution of X with density

m(x∣γ) = ∫ ∏ f (xi ∣θ)π(θ∣γ) dθ.

Based on m(x∣γ), we obtain an estimate of γ̂(x) of γ. It’s most common to


find the estimate using maximum likelihood estimation (MLE), but method of
moments could be used as well (or other methods). We now substitute γ̂(x) for
γ in π(θ∣γ) and determine the estimator that minimizes the empirical posterior
loss
∫ L(θ, δ)π(θ∣γ̂(x)) dθ.

Remark: An alternative definition is obtained by substituting γ̂(x) for γ in


the Bayes estimator. (This proof is left as a homework exercise, 4.6.1 in TPE).

Example 2.15: Empirical Bayes Binomial


Suppose there are K different groups of patients where each group has n pa-
tients. Each group is given a different treatment for the same illness and in
the kth group, we count Xk , k = 1, . . . , K, which is the number of successful
treatments our of n.

Since the groups receive different treatments, we expect different success rates,
however, since we are treating the same illness, these rates should be related to
each other. These considerations suggest the following model:

Xk ∼ Bin(n, pk ),
pk ∼ Beta(a, b),

where the K groups are tied together by the common prior distribution.

It is easy to show that the Bayes estimator of pk under squared error loss is

a + xk
E(pk ∣ak , a, b) = .
a+b+n

Suppose now that we are told that a, b are unknown and we wish to estimate
2.9 Posterior Predictive Distributions 35

them using EB. We first calculate


K
n x Γ(a + b) a−1
m(x∣a, b) = ∫ ...∫ ∏ ( )pkk (1 − pk ) k ×
n−x
p (1 − pk )b−1 dpk
0,1 0,1 k=1 xk Γ(a)Γ(b) k
K
n Γ(a + b) xk +a−1
=∫ ...∫ ∏( ) pk (1 − pk )n−xk +b−1 dpk
0,1 0,1 k=1 xk Γ(a)Γ(b)
K
n Γ(a + b)Γ(a + xk )Γ(n − xk + b)
= ∏( )
k=1 xk Γ(a)Γ(b)Γ(a + b + n)
which is a product of beta-binomials. Although the MLEs of a and b aren’t
expressible in closed form, they can be calculated numerically to construct the
EB estimator
â + xk
δ̂ EB (x) = .
â + b̂ + n

2.9 Posterior Predictive Distributions

We have just gone through many examples illustrating how to calculate many
simple posterior distributions. This is the main goal of a Bayesian analysis.
Another goal might be prediction. That is given some data y and a new obser-
vation ỹ, we may wish to find the conditional distribution of ỹ given y. This
distribution is referred to as the posterior predictive distribution. That is, our
goal is to find p(ỹ∣y). This minimizing estimator is called the empirical Bayes
estimator.

We’ll derive the posterior predictive distribution for the discrete case (θ is
discrete). It’s the same for the continuous case, with the sums replaced with
integrals.

Consider
p(ỹ, y)
p(ỹ∣y) =
p(y)
∫ p(ỹ, y, θ) dθ
= θ
p(y)
∫ p(ỹ∣y, θ)p(y, θ) dθ
= θ
p(y)
= ∫ p(ỹ∣y, θ)p(θ∣y) dθ.
θ
In most contexts, if θ is given, then ỹ∣θ is independent of y, i.e., the value of θ
determines the distribution of ỹ, without needing to also know y. When this is
the case, we say that ỹ and y are conditionally independent given θ. Then the
above becomes
p(ỹ∣y) = ∫ p(ỹ∣θ)p(θ∣y) dθ.
θ
2.9 Posterior Predictive Distributions 36

Theorem 2.5. If θ is discrete and ỹ and y are conditionally independent given


θ, then the posterior predictive distribution is

p(ỹ∣y) = ∑ p(ỹ∣θ)p(θ∣y).
θ

If θ is continuous and ỹ and y are conditionally independent given θ, then the


posterior predictive distribution is

p(ỹ∣y) = ∫ p(ỹ∣θ)p(θ∣y) dθ.


θ
2.9 Posterior Predictive Distributions 37

Theorem 2.6. Suppose p(x) is a pdf that looks like p(x) = cf (x), where c is a
constant and f is a continuous function of x. Since

∫ p(x) dx = ∫ cf (x) dx = 1,
x x

then
∫ f (x)dx = 1/c.
x

Note: No calculus is needed to compute ∫x f (x) dx if f (x) looks like a known


pdf.

Example 2.16: Human males have one X-chromosome and one Y-chromosome,
whereas females have two X-chromosomes, each chromosome being inherited
from one parent. Hemophilia is a disease that exhibits X-chromosome-linked
recessive inheritance, meaning that a male who inherits the gene that causes
the disease on the X-chromosome is affected, whereas a female carrying the
gene on only one of her X-chromosomes is not affected. The disease is generally
fatal for women who inherit two such genes, and this is very rare, since the
frequency of occurrence of the gene is very low in human populations.

Consider a woman who has an affected brother (xY), which implies that her
mother must be a carrier of the hemophilia gene (xX). We are also told that
her father is not affected (XY), thus the woman herself has a fifty-fifty chance
of having the gene.

Let θ denote the state of the woman. It can take two values: the woman is a
carrier (θ = 1) or not (θ = 0). Based on this, the prior can be written as

P (θ = 1) = P (θ = 0) = 1/2.

Suppose the woman has a son who does not have hemophilia (S1 = 0). Now
suppose the woman has another son. Calculate the probability that this second
son also will not have hemophilia (S2 = 0), given that the first son does not have
hemophilia. Assume son one and son two are conditionally independent given
θ.

Solution:

p(S2 = 0∣S1 = 0) = ∑ p(S2 = 0∣θ)p(θ∣S1 = 0).


θ

First compute

p(S1 = 0∣θ)p(θ)
p(θ∣S1 = 0) =
p(S1 = 0∣θ = 0)p(θ = 0) + p(S1 = 0∣θ = 1)p(θ = 1)

⎪ (1)(1/2)
⎪ = 2 if θ = 0
= ⎨ 1(1)(1/2)+(1/2)(1/2) 3

⎪ if θ = 1.
⎩3
2.9 Posterior Predictive Distributions 38

Then

p(S2 = 0∣S1 = 0) = p(S2 = 0∣θ = 0)p(θ = 0∣S1 = 0) + p(S2 = 0∣θ = 1)p(θ = 1∣S1 = 0)
= (1)(2/3) + (1/2)(1/3) = 5/6.

Negative Binomial Distribution


Before doing the next example, we will introduce the Negative Binomial distri-
bution. The binomial distribution counts the numbers of successes in a fixed
number of iid Bernoulli trials. Recall, a Bernoulli trial has a fixed success prob-
ability p.

Suppose instead that we count the number of Bernoulli trials required to get
a fixed number of successes. This formulation leads to the Negative Binomial
distribution.

In a sequence of independent Bernoulli(p) trials, let X denote the trial at which


the rth success occurs, where r is a fixed integer.

Then
x−1 r
f (x) = ( ) p (1 − p)x−r , x = r, r + 1, . . .
r−1
and we say X ∼ Negative Binom(r, p).

There is another useful formulation of the Negative Binomial distribution. In


many cases, it is defined as Y = number of failures before the rth success. This
formulation is statistically equivalent to the one given above in term of X = trial
at which the rth success occurs, since Y = X − r. Then
r+y−1 r
f (y) = ( ) p (1 − p)y , y = 0, 1, 2, . . .
y

and we say Y ∼ Negative Binom(r, p).

When we refer to the Negative Binomial distribution in this class, we will refer
to the second one defined unless we indicate otherwise.

Example 2.17: (Poisson-Gamma)

X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b)

Assume that X̃∣λ ∼ Poisson(λ) is independent of X. Assume we have a new


observation x̃. Find the posterior predictive distribution, p(x̃∣x). Assume that
a is an integer.

Solution:

First, we must find p(λ∣x).


2.9 Posterior Predictive Distributions 39

Recall

p(λ∣x) ∝ p(x∣λ)(p(λ)
∝ e−λ λx λa−1 e−λ/b
= λx+a−1 e−λ(1+1/b) .

Thus, λ∣x ∼ Gamma(x + a, 1+1/b


1
), i.e., λ∣x ∼ Gamma(x + a, b+1
b
).
2.9 Posterior Predictive Distributions 40

It then follows that

p(x̃∣x) = ∫ p(x̃∣λ)p(λ∣x) dλ
λ

e−λ λx̃ 1
=∫ λx+a−1 e−λ(b+1)/b dλ
λ x̃! Γ(x + a)( b+1 )
b x+a

1
= b x+a ∫λ
λx̃+x+a−1 e−λ(2b+1/b) dλ
x̃! Γ(x + a)( b+1 )
1
= Γ(x̃ + x + a)(b/(2b + 1)x̃+x+a
x̃! Γ(x + a)( b+1 )
b x+a

Γ(x̃ + x + a)(b/(2b + 1)x̃+x+a


=
x̃! Γ(x + a)( b+1 )
b x+a

Γ(x̃ + x + a) b x̃+x+a (b + 1)x+a


=
x̃! Γ(x + a) bx+a (2b + 1) x̃+x+a
( x̃ + x + a − 1)! bx̃ (b + 1)x+a
=
(x + a − 1)! x̃! (2b + 1)x̃+x+a
x̃ + x + a − 1 b x̃
b + 1 x+a
=( )( ) ( ) .
x̃ 2b + 1 2b + 1
Let p = b/(2b + 1), which implies 1 − p = (b + 1)/(2b + 1).

Then
x̃ + x + a − 1 x̃
p(x̃∣x) = ( )p (1 − p)x+a .

Thus,
b
x̃∣x ∼ Negative Binom (x + a, ),
2b + 1
where we are assuming the Negative Binomial distribution as defined in Wikipedia
(and not as defined earlier in the notes).
2.9 Posterior Predictive Distributions 41

Example 2.18: Suppose that X is the number of pregnant women arriving at


a particular hospital to deliver their babies during a given month. The discrete
count nature of the data plus its natural interpretation as an arrival rate suggest
modeling it with a Poisson likelihood.

To use a Bayesian analysis, we require a prior distribution for θ having support


on the positive real line. A convenient choice is given by the Gamma distribu-
tion, since it’s conjugate for the Poisson likelihood.

The model is given by

X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b).

We are also told 42 moms are observed arriving at the particular hospital during
December 2007. Using prior study information given, we are told a = 5 and b = 6.
(We found a, b by working backwards from a prior mean of 30 and prior variance
of 180).

We would like to find several things in this example:

1. Plot the likelihood, prior, and posterior distributions as functions of λ in


R.
2. Plot the posterior predictive distribution where the number of pregnant
women arriving falls between [0,100], integer valued.
3. Find the posterior predictive probability that the number of pregnant
women arrive is between 40 and 45 (inclusive).

Solution: The first thing we need to know to do this problem are p(λ∣x) and
p(x̃∣x). We found these in Example 2.17. So,

b
λ∣x ∼ Gamma (x + a, ),
b+1
and
b
x̃∣x ∼ Negative Binom (x + a, ).
2b + 1
2.9 Posterior Predictive Distributions 42

Next, we can move right into R for our analysis.

setwd("~/Desktop/sta4930/ch3")
lam = seq(0,100, length=500)
x = 42
a = 5
b = 6
like = dgamma(lam,x+1,scale=1)
prior = dgamma(lam,5,scale=6)
post = dgamma(lam,x+a,scale=b/(b+1))
pdf("preg.pdf", width = 5, height = 4.5)
plot(lam, post, xlab = expression(lambda), ylab= "Density", lty=2, lwd=3, type="l")
lines(lam,like, lty=1,lwd=3)
lines(lam,prior, lty=3,lwd=3)
legend(70,.06,c("Prior", "Likelihood","Posterior"), lty = c(2,1,3),
lwd=c(3,3,3))
dev.off()

##posterior predictive distribution


xnew = seq(0,100) ## will all be ints
post_pred_values = dnbinom(xnew,x+a,b/(2*b+1))
plot(xnew, post_pred_values, type="h", xlab = "x", ylab="Posterior Predictive Distribution")

## what is posterior predictive prob that number


of pregnant women arrive is between 40 and 45 (inclusive)

(ans = sum(post_pred_values[41:46])) ##recall we included 0

In the first part of the code, we plot the posterior, likelihood, and posterior.
This should be self-explanatory since we have already done an example.

When we find our posterior predictive distribution, we must create a sequence


of integers from 0 to 100 (inclusive) using the seq command. Then we find the
posterior predictive values using the function dnbinom. Then we simply plot
the sequence of xnew on the x-axis and the corresponding posterior predictive
values on the y-axis. We set type="h" so that our plot will appear somewhat
like a smooth histogram.

Finally, in order to calculate the posterior predictive probability that the num-
ber of pregnant women who arrive is between 40 and 45, we simply add up
the posterior predictive probabilities that correspond to these values. We find
that the posterior predictive probability of 0.1284 that the number of pregnant
women who arrive is between 40 and 45.
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts

Chapter 3

Being Objective

No, it does not make sense for me to be an ‘Objective Bayesian’ !


—Stephen E. Fienberg

Thus far in this course, we have mostly considered informative or subjective pri-
ors. Ideally, we want to choose a prior reflecting our beliefs about the unknown
parameter of interest. This is a subjective choice. All Bayesians agree that
wherever prior information is available, one should try to incorporate a prior
reflecting this information as much as possible. We have mentioned how incor-
poration of a prior expert opinion would strengthen purely data-based analysis
in real-life decision problems. Using prior information can also be useful in
problems of statistical inference when your sample size is small or you have a
high- or infinite-dimensional parameter space.

However, in dealing with real-life problems you may run into problems such as

• not having past historical data,

• not having an expert opinion to base your prior knowledge on (perhaps


your research is cutting-edge and new), or

• as your model becomes more complicated, it becomes hard to know what


priors to put on each unknown parameter.

The problems we have dealt with all semester have been very simple in nature.
We have only had one parameter to estimate (except for one example). Think
about a more complex problem such as the following (we looked at this problem

43
44

in Chapter 1):

X∣θ ∼ N (θ, σ 2 )
θ∣σ 2 ∼ N (µ, τ 2 )
σ 2 ∼ IG(a, b)

where now θ and σ 2 are both unknown and we must find the posterior dis-
tributions of θ∣X, σ 2 and σ 2 ∣X. For this slightly more complex problem, it is
much harder to think about what values µ, τ 2 , a, b should take for a particular
problem. What should we do in these type of situations?

Often no reliable prior information concerning θ exists, or inference based com-


pletely on the data is desired. It might appear that inference in such settings
would be impossible, but reaching this conclusion is too hasty.

Suppose we could find a distribution p(θ) that contained no or little information


about θ in the sense that it didn’t favor one value of θ over another (provided
this is possible). Then it would be natural to refer to such a distribution as a
noninformative prior. We could also argue that all or most of the information
contained in the posterior distribution, p(θ∣x), came from the data. Thus, all
resulting inferences were objective and not subjective.
Definition 3.1: Informative/subjective priors represent our prior beliefs about
parameter values before collecting any data. For example, in reality, if statisti-
cians are unsure about specifying the prior, they will turn to the experts in the
field or experimenters to look at past data to help fix the prior.
Example 3.1: (Pregnant Mothers) Suppose that X is the number of pregnant
mothers arriving at a hospital to deliver their babies during a given month. The
discrete count nature of the data as well as its natural interpretation leads to
adopting a Poisson likelihood,

e−θ θx
p(x∣θ) = , x ∈ {0, 1, 2, . . .}, θ > 0.
x!
A convenient choice for the prior distribution here is a Gamma(a, b) since it is
conjugate for the Poisson likelihood. To illustrate the example further, suppose
that 42 moms deliver babies during the month of December. Suppose from past
data at this hospital, we assume a prior of Gamma(5, 6). From this, we can
easily calculate the posterior distribution, posterior mean and variance, and do
various calculations of interest in R.
Definition 3.2: Noninformative/objective priors contain little or no informa-
tion about θ in the sense that they do not favor one value of θ over another.
Therefore, when we calculate the posterior distribution, most if not all of the
inference will arise from the likelihood. Inferences in this case are objective and
not subjective. Let’s look at the following example to see why we might consider
such priors.
45

Example 3.2: (Pregnant Mothers Continued) Recall Example 3.1. As we


noted earlier, it would be natural to take the prior on θ as Gamma(a, b) since it
is the conjugate prior for the Poisson likelihood, however suppose that for this
data set we do not have any information on the number of pregnant mothers
arriving at the hospital so there is no basis for using a Gamma prior or any
other informative prior. In this situation, we could take some noninformative
prior.

Comment: Since many of the objective priors are improper, so we must check
that the posterior is proper.
Theorem 3.1. Propriety of the Posterior

• If the prior is proper, then the posterior will always be proper.


• If the prior is improper, you must check that the posterior is proper.

◯ Meaning Of Flat

What does a “flat prior” really mean? People really abuse the word flat and
interchange it for noninformative. Let’s talk about what people really mean
when they use the term “flat,” since it can have different meanings.
Example 3.3: Often statisticians will refer to a prior as being flat, when a plot
of its density actually looks flat, i.e., uniform. An example of this would be
taking such a prior to be
θ ∼ Unif(0, 1).
We can plot the density of this prior to see that the density is flat.

APPLE

◯ Objective Priors in More Detail

Uniform Prior of Bayes and Laplace

Example 3.4: (Thomas Bayes) In 1763, Thomas Bayes considered the ques-
tion of what prior to use when estimating a binomial success probability p. He
described the problem quite differently back then by considering throwing balls
onto a billiard table. He separated the billiard table into many different inter-
vals and considered different events. By doing so (and not going into the details
of this), he argued that a Uniform(0,1) prior was appropriate for p.
Example 3.5: (Laplace) In 1814, Pierre-Simon Laplace wanted to know the
probability that the sun will rise tomorrow. He answered this question using
the following Bayesian analysis:
46

• Let X represent the number of days the sun rises. Let p be the probability
the sun will rise tomorrow.

• Let X∣p ∼ Bin(n, p).

• Suppose p ∼ Uniform(0, 1).

• Based on reading the Bible, Laplace computed the total number of days n
in recorded history, and the number of days x on which the sun rose.
Clearly, x = n.

Then
n
π(p∣x) ∝ ( )px (1 − p)n−x ⋅ 1
x
∝p x+1−1
(1 − p)n−x+1−1

This implies
p∣x ∼ Beta(x + 1, n − x + 1)

Then
x+1 x+1 n+1
p̂ = E[p∣x] = = = .
x+1+n−x+1 n+2 n+2
Thus, Laplace’s estimate for the probability that the sun rises tomorrow is
(n + 1)/(n + 2), where n is the total number of days recorded in history. For
instance, if so far we have encountered 100 days in the history of our universe,
this would say that the probability the sun will rise tomorrow is 101/102 ≈
0.9902. However, we know that this calculation is ridiculous. Here, we have
extremely strong subjective information (the laws of physics) that says it is
extremely likely that the sun will rise tomorrow. Thus, objective Bayesian
methods shouldn’t be recklessly applied to every problem we study—especially
when subjective information this strong is available.

Criticism of the Uniform Prior

The Uniform prior of Bayes and Laplace and has been criticized for many dif-
ferent reasons. We will discuss one important reason for criticism and not go
into the other reasons since they go beyond the scope of this course.

In statistics, it is often a good property when a rule for choosing a prior is


invariant under what are called one-to-one transformations. Invariant basi-
cally means unchanging in some sense. The invariance principle means that a
rule for choosing a prior should provide equivalent beliefs even if we consider a
transformed version of our parameter, like p2 or log p instead of p.
47

Jeffreys’ Prior

One prior that is invariant under one-to-one transformations is Jeffreys’ prior.

What does the invariance principle mean? Suppose our prior parameter is θ,
however we would like to transform to φ.

Define φ = f (θ), where f is a one-to-one function.

Jeffreys’ prior says that if θ has the distribution specified by Jeffreys’ prior for
θ, then f (θ) will have the distribution specified by Jeffreys’ prior for φ. We will
clarify by going over two examples to illustrate this idea.

Note, for example, that if θ has a Uniform prior, Then one can show φ = f (θ)
will not have a Uniform prior (unless f is the identity function).

Aside from the invariance property of Jeffreys’ prior, in the univariate case,
Jeffreys’ prior satisfies many optimality criteria that statisticians are interested
in.
Definition 3.3: Define
∂ 2 log p(y∣θ)
I(θ) = −E [ ],
∂θ2
where I(θ) is called the Fisher information. Then Jeffreys’ prior is defined to
be √
pJ (θ) = I(θ).
Example 3.6: (Uniform Prior is Not Invariant to Transformation)
Let θ ∼ Uniform(0, 1). Suppose now we would like to transform from θ to θ2 .

Let φ = θ2 . Then θ = φ. It follows that
∂θ 1
= √ .
∂φ 2 φ
1
Thus, p(φ) = √ , 0 < φ < 1 which shows that φ is not Uniform on (0, 1). Hence,
2 φ
the transformation is not invariant. Criticism such as this led to consideration
of Jeffreys’ prior.
Example 3.7: (Jeffreys’ Prior Invariance Example)
Suppose
X∣θ ∼ Exp(θ).
One can show using calculus that I(θ) = 1/θ2 . Then pJ (θ) = 1/θ. Suppose that
φ = θ2 . It follows that
∂θ 1
= √ .
∂φ 2 φ
48

Then
√ ∂θ
pJ (φ) = pJ ( φ) ∣ ∣
∂φ
1 1 1
=√ √ ∝ .
φ 2φ φ
Hence, we have shown for this example, that Jeffreys’ prior is invariant under
the transformation φ = θ2 .
Example 3.8: (Jeffreys’ prior) Suppose

X∣θ ∼ Binomial(n, θ).

Let’s calculate the posterior using Jeffreys’ prior. To do so we need to calculate


I(θ). Ignoring terms that don’t depend on θ, we find

log p(x∣θ) = x log (θ) + (n − x) log (1 − θ) Ô⇒


∂ log p(x∣θ) x n − x
= −
∂θ θ 1−θ
∂ 2 log p(x∣θ) x n−x
=− 2 −
∂θ2 θ (1 − θ)2

Since, E(X) = nθ, then

x n−x nθ n − nθ n n n
I(θ) = −E [− − ]= 2 + = = .
θ 2 (1 − θ) 2 θ (1 − θ) 2 θ (1 − θ) θ(1 − θ)
This implies that

n
pJ (θ) =
θ(1 − θ)
∝ Beta(1/2, 1/2).

Figure 3.1 compares the prior density πJ (θ) with that for a flat prior, which is
equivalent to a Beta(1,1) distribution.

Note that in this case the prior is inversely proportional to the standard devia-
tion. Why does this make sense?

We see that the data has the least effect on the posterior when the true θ =
0.5, and has the greatest effect near the extremes, θ = 0 or 1. Jeffreys’ prior
compensates for this by placing more mass near the extremes of the range, where
the data has the strongest effect. We could get the same effect by (for example)
1 1
letting the prior be π(θ) ∝ instead of π(θ) ∝ . However, the
Varθ [Varθ]1/2
former prior is not invariant under reparameterization, as we would prefer.
49

Jeffrey's prior and flat prior densities

2.0
1.5
p(θ)

1.0
0.5

Beta(1/2,1/2)
Beta(1,1)
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.1: Jeffreys’ prior and flat prior densities

We then find that

p(θ ∣ x) ∝ θx (1 − θ)n−x θ1/2−1 (1 − θ)1/2−1


= θx−1/2 (1 − θ)n−x−1/2
= θx−1/2+1−1 (1 − θ)n−x−1/2+1−1 .

Thus, θ∣x ∼ Beta(x + 1/2, n − x + 1/2), which is a proper posterior since the prior
is proper.

Note: Remember that it is important to check that the posterior is proper.

Jeffreys’ and Conjugacy


Jeffreys priors are widely used in Bayesian analysis. In general, they are not
conjugate priors; the fact that we ended up with a conjugate Beta prior for
the binomial example above is just a lucky coincidence. For example, with a
Gaussian model X ∼ N (µ, σ 2 ), it can be shown that πJ (µ) = 1 and πJ (σ) = σ1 ,
which do not look anything like a Gaussian or an inverse gamma, respectively.
However, it can be shown that Jeffreys priors are limits of conjugate prior den-
sities. For example, a Gaussian density N (µo , σo2 ) approaches a flat prior as
σo2 → ∞, while the inverse gamma σ −(a+1) e−b/σ → σ −1 as a, b → 0.

Limitations of Jeffreys’
Jeffreys’ priors work well for single-parameter models, but not for models with
multidimensional parameters. By analogy with the one-dimensional case, one
50

might construct a naive Jeffreys prior as the joint density:

πJ (θ) = ∣I(θ)∣1/2 ,

where ∣ ⋅ ∣ denotes the determinant and the (i, j)th element of the Fisher infor-
mation matrix is given by

∂ 2 log p(X∣θ)
I(θ)ij = −E [ ].
∂θi ∂θj

Let’s see what happens when we apply a Jeffreys’ prior for θ to a multivari-
ate Gaussian location model. Suppose X ∼ Np (θ, I), and we are interested
in performing inference on ∣∣θ∣∣2 . In this case the Jeffreys’ prior for θ is flat.
It turns out that the posterior has the form of a non-central χ2 distribution
with p degrees of freedom. The posterior mean given one observation of X is
E(∣∣θ∣∣2 ∣X) = ∣∣X∣∣2 + p. This is not a good estimate because it adds p to the
square of the norm of X, whereas we might normally want to shrink our esti-
mate towards zero. By contrast, the minimum variance frequentist estimate of
∣∣θ∣∣2 is ∣∣X∣∣2 − p.

Intuitively, a multidimensional flat prior carries a lot of information about the


expected value of a parameter. Since most of the mass of a flat prior distribution
is in a shell at infinite distance, it says that we expect the value of θ to lie at
some extreme distance from the origin, which causes our estimate of the norm
to be pushed further away from zero.

Haldane’s Prior

In 1963, Haldane introduced the following improper prior for a binomial pro-
portion:
p(θ) ∝ θ−1 (1 − θ)−1 .

It can be shown to be improper using simple calculus, which we will not go into.
However, the posterior is proper under certain conditions. Let

Y ∣θ ∼ Bin(n, θ).

Calculate p(θ∣y) and show that it is improper when y = 0 or y = n.

Remark: Recall that for a Binomial distribution, Y can take values y =


0, 1, 2, . . . , n.
3.1 Reference Priors 51

We will first calculate p(θ∣y).

(ny)θy (1 − θ)n−y
p(θ∣y) ∝
θ(1 − θ)
∝θ y−1
(1 − θ)n−y−1
= θy−1 (1 − θ)(n−y)−1 .

The density of a Beta(a, b) is the following:


Γ(a + b) a−1
f (θ) = θ (1 − θ)b−1 , θ > 0.
Γ(a)Γ(b)
This implies that θ∣Y ∼ Beta(y, n − y).

Finally, we need to check that our posterior is proper. Recall that the parameters
of the Beta need to be positive. Thus, y > 0 and n − y > 0. This means that y ≠ 0
and y ≠ n in order for the posterior to be proper.

Remark: Recall that the Beta density must integrate to 1 whenever the
parameter values are positive. Hence, when they are not positive, the
density does not integrate to 1 and integrates to ∞. Thus, for the problem
above, when y = 0 and y = n the density is improper.

There are many other objective priors that are used in Bayesian inference, how-
ever, this is the level of exposure that we will cover in this course. If you’re
interested in learning more about objective priors (g-prior, probability match-
ing priors), see me and I can give you some references.

3.1 Reference Priors

Reference priors were proposed by Jose Bernardo in a 1979 paper, and further
developed by Jim Berger and others from the 1980s through the present. They
are credited with bringing about an objective Bayesian renaissance; an annual
conference is now devoted to the objective Bayesian approach.

The idea behind reference priors is to formalize what exactly we mean by an


uninformative prior: it is a function that maximizes some measure of distance or
divergence between the posterior and prior, as data observations are made. Any
of several possible divergence measures can be chosen, for example the Kullback-
Leibler divergence or the Hellinger distance. By maximizing the divergence, we
allow the data to have the maximum effect on the posterior estimates.

For one-dimensional parameters, it will turn out that reference priors and Jef-
freys’ priors are equivalent. For multidimensional parameters, they differ. One
3.1 Reference Priors 52

might ask, how can we choose a prior to maximize the divergence between the
posterior and prior, without having seen the data first? Reference priors handle
this by taking the expectation of the divergence, given a model distribution for
the data. This sounds superficially like a frequentist approach—basing inference
on imagined data. But once the prior is chosen based on some model, inference
proceeds in a standard Bayesian fashion. (This contrasts with the frequentist
approach, which continues to deal with imagined data even after seeing the real
data!)

◯ Laplace Approximation

Before deriving reference priors in some detail, we go through the Laplace ap-
proximation which is very useful in Bayesian analysis since we often need to
evaluate integrals of the form

∫ g(θ)f (x∣θ)π(θ) dθ.

For example, when g(θ) = 1, the integral reduces to the marginal likelihood of x.
The posterior mean requires evaluation of two integrals ∫ θf (x∣θ)π(θ) dθ and
∫ f (x∣θ)π(θ) dθ. Laplace’s method is a technique for approximating integrals
when the integrand has a sharp maximum.

Remark: There is a nice refinement of the Laplace approximation due to Tierney,


Kass, and Kadane (JASA, 1989). Due to time constraints, we won’t go into this,
but if you’re looking to apply this in research, this is something you should look
up in the literature and use when needed.
Theorem 3.2. Laplace Approximation
Let I = ∫ q(θ) exp{nh(θ)} dθ. Assume that θ̂ maximizes θ and that h has a
sharp maximum at θ̂. Let c = h′′ (θ̂) > 0. Then
√ √
2π −1 2π
I = q(θ̂) exp{nh(θ̂)} √ (1 + O(n )) ≈ q(θ̂) exp{nh(θ̂)} √
nc nc

Proof. Apply Taylor expansion about θ̂.


θ̂+δ 1
I ≈∫ [q(θ̂) + (θ − θ̂)q ′ (θ̂) + (θ − θ̂)2 q ′′ (θ̂)]
θ̂−δ 2
n
× [exp{nh(θ̂) + n(θ − θ̂)h (θ̂) + (θ − θ̂)2 h′′ (θ̂)}] dθ + ⋯

2

q (θ̂) 1 q ′′ (θ̂)
≈ q(θ̂)enh(θ̂) ∫ [1 + (θ − θ̂) + (θ − θ̂)2 ]
q(θ̂) 2 q(θ̂)
−nc
× exp [ (θ − θ̂)2 }] dθ + ⋯ .
2
3.1 Reference Priors 53

√ 1
Now let t = nc(θ − θ̂). This implies that dθ = √ dt. Hence,
nc

q(θ̂)enh(θ̂) δ nc t q ′ (θ̂) t2 q ′′ (θ̂) −t2 /2
I≈ √ ∫ √ [1 + √ + ]e dt
nc −δ nc nc q(θ̂) 2nc q(θ̂)
q(θ̂)enh(θ̂) √ q ′′ (θ̂) 1
≈ √ 2π [1 + 0 + ]
nc q(θ̂) 2nc
q(θ̂)enh(θ̂) √ q(θ̂)enh(θ̂) √
≈ √ 2π [1 + O(1/n)] ≈ √ 2π.
nc nc

◯ Some Probability Theory

First, we give a few definitions from probability theory (you may have seen these
before) and we will be informal about these.

• If Xn is O(n−1 ) then Xn “goes to 0 at least as fast as 1/n.”

• If Xn is o(n−1 ) then Xn “goes to 0 faster than 1/n.”

Definition 3.4: Formally, writing

Xn = o(rn ) as n → ∞

means that
Xn
→ 0.
rn
Similarly,
Xn = O(rn ) as n → ∞
means that
Xn
is bounded.
rn

This shouldn’t be confused with the definition below:

Definition 3.5: Formally, let Xn , n ≥ 1 be random vectors and Rn , n ≥ 1 be


positive variables. Then
Xn p
Xn = op (Rn ) if Ð→ 0
Rn
and
Xn
Xn = Op (Rn ) if is bounded in probability.
Rn
3.1 Reference Priors 54

Recall that Xn is bounded in probability if {Pn , n ≥ 1} is uniformly tight, where


Pn (A) = P r(Xn ∈ A), A ∈ Rk , i.e., given any  > 0 there exists an M such that
P r(∣∣Xn ∣∣ ≤ M ) ≥ 1 −  for all n ≥ 1. For full details and examples, see Billingsley
or van der Vaart.

◯ Shrinkage Argument of J.K. Ghosh

This argument given by J.K. Ghosh will be used to derive reference priors. It
can be used in many other theoretical proofs in Bayesian theory. If interested in
seeing these, please refer to his book for details as listed on the syllabus. Please
note that below I am hand waving over some of the details regarding analysis
that are important but not completely necessary to grasp the basic concept
here.

We consider a possibly vector-valued r.v. X with pdf g(⋅∣θ). Our goal is to


find an expression for Eθ [q(X, θ)] for some function q(X, θ), where the integral
∫ q(x, θ) f (x∣θ) dθ is too difficult to calculate directly. There are three steps to
go through to find the desired quantity. The steps are outlined without proof.

Step 1: Consider a proper prior π̄(⋅) for θ such that the support of π̄(⋅) is a
compact rectangle in the parameter space and π̄(⋅) vanishes on the boundary of
the support, while remaining positive on the interior. Consider the posterior of
θ under π̄(⋅) and hence obtain E π̄ [q(X, θ)∣x].

Step 2: Find Eθ E π̄ [q(x, θ)∣x] = λ(θ) for θ in the interior of the support of π̄(⋅).

Step 3: Integrate λ(⋅) with respect to π̄(⋅) and then allow π̄(⋅) to converge to
the degenerate prior at the true value of θ (say θ0 ) supposing that the true θ is
an interior point of the support of π̄(⋅). This yields Eθ [q(X, θ)].

◯ Reference Priors

Bernardo (1979) suggested choosing the prior to maximize the expected Kullback-
Leibler divergence between the posterior and prior,

π(θ∣x)
E [log ],
π(θ)

where expectation is taken over the joint distribution of X and θ. It is shown in


Berger and Bernardo (1989) that if one does this maximization for fixed n, it
may lead to a discrete prior with finitely many jumps—a far cry from a diffuse
prior. Instead, the maximization must be done asymptotically, i.e., as n → ∞.
This is achieved as follows:
3.1 Reference Priors 55

First write

π(θ∣x) π(θ∣x)
E [log ] = ∫ ∫ [log ] π(θ∣x) m(x) dx dθ
π(θ) π(θ)
π(θ∣x)
= ∫ ∫ log f (x∣θ) π(θ) dx dθ
π(θ)
π(θ∣x)
= ∫ π(θ)E [log ∣ θ] dθ.
π(θ)

π(θ∣x)
Consider E [log ∣ θ] = E [log π(θ∣x) ∣ θ] − log π(θ).
π(θ)
Then by iterated expectation,

π(θ∣x)
E [log ] = ∫ π(θ) {E [π(θ∣x); ∣ θ] − log π(θ)} dθ
π(θ)
= ∫ E [π(θ∣x)∣ θ] π(θ) dθ − ∫ log π(θ)π(θ) dθ. (3.1)

Since we cannot calculate the integral

∫ E [π(θ∣x)∣ θ] π(θ) dθ =∶ q(X, θ)

and we will use Step 1 of the Shrinkage argument of J.K. Ghosh to find E π̄ [q(X, θ)∣x].

Step 1: Find E π̄ [log π(θ∣x)∣x] = ∫ log π(θ∣x)π̄(θ∣x) dθ.

Let Ln (θ) be defined such that f (x∣θ) = exp{Ln (θ)}.

exp{Ln (θ)}π(θ)
π(θ∣x) =
∫ exp{Ln (θ)}π(θ) dθ

exp{Ln (θ) − Ln (θ̂n )}π(θ)


= ,
∫ exp{Ln (θ) − Ln (θ̂n )}π(θ) dθ

where θ̂n denotes the maximum likelihood estimator. Let t = n(θ − θ̂n ), so that
θ = θ̂n + n−1/2 t and dθ = n−1/2 dt. We now substitute in for θ and then perform a
Taylor expansion. Recall that in general f (x+h) = f (x)+hf ′ (x)+ 21 h2 f ′′ (x)+⋯ .
Then

exp{Ln (θ̂n + n−1/2 t) − Ln (θ̂n )}π(θ̂n + n−1/2 t)


π(θ∣x) =
∫ exp{Ln (θ̂n + n−1/2 t) − Ln (θ̂n )}π(θ̂n + n−1/2 t) n−1/2 dt
exp{Ln (θ̂n ) + n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) − Ln (θ̂n ) + ⋯}π(θ̂n + n−1/2 t)
=
∫ exp{Ln (θ̂n ) + n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) − Ln (θ̂n ) + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt

exp{n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + t/ n)
= .
∫ exp{n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt
3.1 Reference Priors 56

Since θ̂n is the maximum likelihood estimate, L′n (θ̂n ) = 0. Now define the
quantity
∂ 2 log f (x∣θ)
Iˆn ∶= Iˆn (θ̂n ) = − ∣ = −L′′n (θ̂n ).
∂θ2 θ=θ̂n

Also, under mild regularity conditions,


√ d
Iˆn (θ̂n )1/2 n(θ̂n − θ) Ð→ N (0, Ip ).
Then we have that
exp{n−1/2 t L′n (θ̂n ) + 21 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + n−1/2 t)
π(θ∣x) = .
∫ exp{n−1/2 t L′n (θ̂n ) + 2 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt
1

exp{− 1 t2 Iˆn + ⋯}π(θ̂n + n−1/2 t)


= 2

∫ exp{− 2 t2 Iˆn + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt


1

exp{− 1 t2 Iˆn + O(n−1/2 )}[π(θ̂n ) + O(n−1/2 )]


= 2

∫ exp{− 2 t2 Iˆn + O(n−1/2 )}[π(θ̂n ) + O(n−1/2 )] n−1/2 dt


1


n exp{− 12 t2 Iˆn }π(θ̂n )[1 + O(n−1/2 )]
= √ −1/2
,
2π Iˆn π(θ̂n )[1 + O(n−1/2 )]
noting that the denominator takes the form of a constant times the integral of
a normal density with variance Iˆn−1 . Hence,
√ ˆ1/2
nIn 1
π(θ∣x) = √ exp (− t2 Iˆn ) [1 + O(n−1/2 )]. (3.2)
2π 2

Then log π(θ∣x) = 21 log n−log 2π − 12 t2 Iˆn + 12 log Iˆn +log[1+O(n−1/2 )] = 21 log n−

log 2π − 21 t2 Iˆn + 12 log Iˆn + log[O(n−1/2 )]. Now consider
1 √ 1 1
E π̄ log π(θ∣x) = log n − log 2π − E π̄ [ t2 Iˆn ] + log Iˆn + log[O(n−1/2 )].
2 2 2
To evaluate E π̄ [ 12 t2 Iˆn ], note that (3.2) states that, up to order n−1/2 , π(t∣x)
is approximately normal with mean zero and variance Iˆn−1 . Since this does
not depend on the form of the prior π, it follows that π̄(t∣x) is also approxi-
mately normal with mean zero and variance Iˆn−1 , again up to order n−1/2 . Then
E π̄ [ 21 t2 Iˆn ] = 21 , which implies that
1 √ 1
E π̄ log π(θ∣x) = log n − log 2π − + log Iˆn1/2 + log[O(n−1/2 )]
2 2
1 1 √
= log n − log 2πe + log Iˆn1/2 + log[O(n−1/2 )].
2 2

Step 2: Calculate λ(θ) = ∫ E π̄ log π(θ∣x)f (x∣θ) dx. This is simply


1 1 √
λ(θ) = log n − log 2πe + log [I(θ)] + log[O(n−1/2 )].
1/2
2 2
3.1 Reference Priors 57

Step 3: Since λ(θ) is continuous, the process of calculating ∫ λ(θ) π̄(θ) dθ and
allowing π̄(⋅) to converge to degeneracy at θ simply yields λ(θ) again. Thus,
1 1 √
log n − log 2πe + log [I(θ)] + log[O(n−1/2 )].
1/2
E [π(θ∣x) ∣ θ] =
2 2

Thus, returning to (3.1), the quantity we need to maximize

1 √ 1/2
[I(θ)]
1/2 log n − log 2πe + ∫ log { }π(θ) dθ + log[O(n−1/2 )].
2 π(θ)

The integral is non-positive and is maximized above when it is 0, or rather when


π(θ) = I 1/2 (θ), i.e., Jeffreys’ prior.

Take away: If there are no nuisance parameters, Jeffreys’ prior is the reference
prior.

Multiparameter generalization

In the absence of nuisance parameters, the K-L divergence simplifies to

π(θ∣x) p p ∣I(θ)∣1/2
E[ ] = log n − log(2πe) + ∫ log ( ) π(θ) dθ + O(n−1/2 ).
π(θ) 2 2 π(θ)

Note that this is maximized when π(θ) = ∣I(θ)∣1/2 , meaning that Jeffreys’ prior
is the maximizer in distance between the prior and posterior. In the presence
of nuisance parameters, things change considerably.
3.1 Reference Priors 58

Example 3.9: Bernardo’s reference prior, 1979, JASA


Let θ = (θ1 , θ2 ), where θ1 is p1 × 1 and θ2 is p2 × 1. We define p = p1 + p2 . Let

I (θ) I12 (θ)


I(θ) = I(θ1 , θ2 ) = ( 11 ).
I21 (θ) I22 (θ)

Suppose that θ1 is the parameter of interest and θ2 is a nuisance parameter


(meaning that it’s not really of interest to us in the model).

Begin with π(θ2 ∣θ1 ) = ∣I22 (θ)∣1/2 c(θ1 ), where c(θ1 ) is the constant that makes
this distribution a proper density. Now try to maximize

log π(θ1 ∣x)


E[ ]
π(θ1 )

to find the marginal prior π(θ1 ). We write

π(θ1 ∣x) π(θ1 , θ2 ∣x)/π(θ2 ∣θ1 , x)


log = log
π(θ1 ) π(θ1 , θ2 )/π(θ2 ∣θ1 )
π(θ∣x) π(θ2 ∣θ1 , x)
= log − log . (3.3)
π(θ) π(θ2 ∣θ1 )

Arguing as before,

π(θ∣x) p p ∣I(θ)∣1/2
E [log ] = log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ). (3.4)
π(θ) 2 2 π(θ)

Similarly,

π(θ2 ∣θ1 , x) p2 p2 ∣I22 (θ)∣1/2


E [log ]= log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ).
π(θ2 ∣θ1 ) 2 2 π(θ2 ∣θ1 )
(3.5)

From (3.3)–(3.5), we find

π(θ1 ∣x) p1 p1 ∣I11.2 (θ)∣1/2


E [log ]= log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ),
π(θ1 ) 2 2 π(θ1 )
(3.6)

1/2 −1 −1
where I11.2 (θ) = I11 (θ) − I12 (θ)I22 (θ)I21 (θ) and I(θ) = ∣I22 ∣ ∣I11 − I12 I22 I21 ∣ =
∣I22 ∣ ∣I11.2 ∣. These can be derived from Searle’s book on matrix algebra as a
reference.

We now break up the integral in (3.6) and we define

log ψ(θ1 ) = ∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 .


3.1 Reference Priors 59

We find that
π(θ1 ∣x) p1 p1
E [log ]= log n − log 2πe + ∫ π(θ) log ∣I11.2 (θ)∣1/2 dθ
π(θ1 ) 2 2
− ∫ π(θ) log π(θ1 ) dθ + O(n−1/2 )
p1 p1
= log n − log 2πe + ∫ π(θ1 ) [∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 ] dθ1
2 2
− ∫ π(θ1 ) log π(θ1 ) dθ1 + O(n−1/2 )
p1 p1 ψ(θ1 )
= log n − log 2πe + ∫ π(θ1 ) log dθ1 + O(n−1/2 ).
2 2 π(θ1 )
−1
To maximize the integral above, we choose π(θ1 ) = ψ(θ1 ). Note that I11.2 (θ) =
I11 (θ) where
−1
I11 (θ) −I11 (θ)I12 (θ)I22 (θ)
I −1 (θ) = ( −1 ).
−I22 (θ)I21 (θ)I11 (θ) I22 (θ)

Writing out our prior, we find that

π(θ1 ) = exp{∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 } = exp{∫ ∣I22 (θ)∣1/2 log ∣I11.2 (θ)∣1/2 dθ2 }.

Remark: An important point that should be highlighted is that all these calcu-
lations (especially evaluations of all integrals) are carried out over an increasing
sequence of compact sets K, whose union is the parameter space. For exam-
ple, if the parameter space is R × R+ , take the increasing sequence of compact
rectangles [−i, i] × [−i−1 , i] and then eventually take i → ∞. Also, the proofs
are carried out by considering a sequence of priors πi with support Ki and we
eventually take i → ∞. This fact should be taken into account when doing the
examples and calculations of these types of problems.
iid
Example 3.10: Let X1 . . . Xn ∣µ, σ 2 ∼ N (µ, σ 2 ), where σ 2 is a nuisance param-
eter. Consider the sequence of priors πi with support [−i, i]×[−i−1 , i], i = 1, 2, . . . .

1/σ 2 0
I(µ, σ 2 ) = ( ).
0 2/σ 2
√ √
2ci2 −1 i 2ci2 1
Then π(σ∣µ) = , i ≤ σ ≤ i. Consider ∫i−1 = 1 Ô⇒ ci2 = √ .
σ σ 2 2 ln i
1 1 −1
Thus, π(σ∣µ) = , i ≤ σ ≤ i. Now find π(µ). Observe that
2 ln i σ

π(µ) = exp{∫ π(σ∣µ) log ∣I11.2 (θ)∣1/2 dσ}

Recall that
3.2 Final Thoughts on Being Objective 60


−1 2 1i
I11.2 = I11 −I12 I22 I21 = 1/σ . Thus, π(µ) =
2
log( )} dσ+constant =
exp{∫i−1 ci2
σ σ
π(µ, σ)
c. We want to find π(µ, σ). We know that π(σ∣µ) = Ô⇒ π(µ, σ) =
π(µ)
c 1 1
π(µ)π(σ∣µ) = ∝ .
2 ln i σ σ

Problems with Reference Priors See page 128 of the little green book.
Bernardo and Berger (1992) suggest

3.2 Final Thoughts on Being Objective

Some Thoughts from a Review paper by Stephen Fienberg

We have spent just a short amount of time covering objective Bayesian proce-
dures, but already we have seen how each is flawed in some sense. As Fienberg
(2009) points out in a review article (see the webpage), there are two main parts
of being Bayesian: the prior and the likelihood. What is Fienberg’s point? There
is proposed claim that robustness should be carried through at both levels of
the model. That is, we should care about subjectivity of the likelihood as well
as the prior.

What about pragmatism (what works best in terms of implementation)? It


really comes down to the quality of the data that we’re analyzing. Fienberg
looks at two examples. The first involves the NBC Election Night model of
the 1960s and 1970s (which used a fully HB model). Here considering different
priors is important. For this illustration, there were multiple priors based on
past elections to choose from in real time, and the choice of prior was often
crucial in close elections. However, in other examples such as Mosteller and
Wallace (1964) analysis of the Federalist papers, the likelihood mattered more.
In this case, the posterior odds for several papers shifted when Mosteller and
Wallace used a negative binomial versus a Poisson for word counts.

My favorite part that Fienberg illustrates in this paper is the view that “objec-
tive Bayes is like the search for the Holy Grail.” He mentions that Good (1972)
once wrote that there are “46,656 Varieties of Bayesians,” which was a num-
ber that he admitted exceeded the number of professional statisticians during
that time. Today? There seem to be as many choices coming about of objective
Bayes for trying to arrive at the perfect choice of an objective prior. Each seems
to fail because of foundational principles. For example, Eaton and Freedman
(2004) criticize why you shouldn’t use Jeffreys’ prior for the normal covariance
matrix. We didn’t look at intrinsic priors, but they have been criticized by
Fienberg for contingency tables because of their dependence on the likelihood
function and because of bizarre properties when extended to deal with large
3.2 Final Thoughts on Being Objective 61

sparse tables.

Fienberg’s conclusion: “No, it does not make sense for me to be an ‘Objective


Bayesian’ !” Read the other papers on the web when you have time and you can
make you’re own decision.
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts

Chapter 4

Evaluating Bayesian
Procedures

They say statistics are for losers, but losers are usually the ones saying that.
—Urban Meyer

In this chapter, we give a brief overview of how to evaluate Bayesian procedures


by looking at how frequentist confidence intervals differ from Bayesian credible
intervals. We also introduce Bayesian hypothesis testing and Bayesian p-values.
Again, we emphasize that this is a rough overview and for more details, one
should look in Gelman et al. (2004) or Carlin and Louis (2009).

4.1 Confidence Intervals versus Credible


Intervals

One major difference between Bayesians and frequentists is how they interpret
intervals. Let’s quickly review what a frequentist confidence interval is and how
to interpret one.

Frequentist Confidence Intervals

A confidence interval for an unknown (fixed) parameter θ is an interval of num-


bers that we believe is likely to contain the true value of θ. Intervals are impor-
tant because they provide us with an idea of how well we can estimate θ.

Definition 4.1: A confidence interval is constructed to contain θ a percentage


of the time, say 95%. Suppose our confidence level is 95% and our interval is
(L, U ). Then we are 95% confident that the true value of θ is contained in

62
4.1 Confidence Intervals versus Credible Intervals 63

(L, U ) in the long run. In the long run means that this would occur nearly 95%
of the time if we repeated our study millions and millions of times.

Common Misconceptions in Statistical Inference

• A confidence interval is a statement about θ (a population parameter). It


is not a statement about the sample.

• Remember that a confidence interval is not a statement about individual


subjects in the population. As an example, suppose that I tell you that
a 95% confidence interval for the average amount of television watched
by Americans is (2.69, 6.04) hours. This doesn’t mean we can say that
95% of all Americans watch between 2.69 and 6.04 hours of television.
We also cannot say that 95% of Americans in the sample watch between
2.69 and 6.04 hours of television. Beware that statements such as these
are false. However, we can say that we are 95 percent confident that the
average amount of televison watched by Americans is between 2.69 and
6.04 hours.

Bayesian Credible Intervals

Recall that frequentists treat θ as fixed, but Bayesians treat θ as a random vari-
able. The main difference between frequentist confidence intervals and Bayesian
credible intervals is the following:

• Frequentists invoke the concept of probability before observing the data.


For any fixed value of θ, a frequentist confidence interval will contain the
true parameter θ with some probability, e.g., 0.95.

• Bayesians invoke the concept of probability after observing the data. For
some particular set of data X = x, the random variable θ lies in a Bayesian
credible interval with some probability, e.g., 0.95.

Assumptions

In lower-level classes, you wrote down assumptions whenever you did confidence
intervals. This is redundant for any problem we construct in this course since
we always know the data is randomly distributed and we assume it comes from
some underlying distribution, say Normal, Gamma, etc. We also always assume
our observations are i.i.d. (independent and identically distributed), meaning
that the observations are all independent and they all have the same variance.
Thus, when working a particular problem, we will assume these assumptions
are satisfied given the proposed model holds.
4.1 Confidence Intervals versus Credible Intervals 64

Definition 4.2: A Bayesian credible interval of size 1 − α is an interval (a, b)


such that
P (a ≤ θ ≤ b∣x) = 1 − α.
b
∫ p(θ∣x) dθ = 1 − α.
a

Remark: When you’re calculating credible intervals, you’ll find the values
of a and b by several means. You could be asked do the following:

• Find the a, b using means of calculus to determine the credible in-


terval or set.
• Use a Z-table when appropriate.
• Use R to approximate the values of a and b.
• You could be given R code/output and asked to find the values of a
and b.
4.1 Confidence Intervals versus Credible Intervals 65

Important Point

Our definition for the credible interval could lead to many choices of (a, b) for
particular problems.

Suppose that we required our credible interval to have equal probability α/2 in
each tail. That is, we will assume
P (θ < a∣x) = α/2
and
P (θ > b∣x) = α/2.
Is the credible interval still unique? No. Consider
π(θ∣x) = I(0 < θ < 0.025) + I(1 < θ < 1.95) + I(3 < θ < 3.025)
so that the density has three separate plateaus. Now notice that any (a, b)
such that 0.025 < a < 1 and 1.95 < b < 3 satisfies the proposed definition of a
ostensibly “unique” credible interval. To fix this, we can simply require that
{θ ∶ π(θ∣x) is positive}
(i.e., the support of the posterior) must be an interval.

Bayesian interval estimates for θ are similar to confidence intervals of classical


inference. They are called credible intervals or sets. Bayesian credible intervals
have a nice interpretation as we will soon see.

To see this more clearly, see Figure 4.1.


Definition 4.3: A Bayesian credible set C of level or size 1 − α is a set C such
that 1 − α ≤ P (C∣y) = ∫C p(θ∣y) dθ. (Of course in discrete settings, the integral
is simply replaced by summation).

Note: We use ≤ instead of = to include discrete settings since obtaining exact


coverage in a discrete setting may not be possible.

This definition enables direct probability statements about the likelihood of θ


falling in C. That is,
“The probability that θ lies in C given the observed data y is at least (1 − α).”

This greatly contrasts with the usual frequentist CI, for which the corresponding
statement is something like “If we could recompute C for a large number of
datasets collected in the same way as ours, about (1 − α) × 100% of them would
contain the true value θ. ”

This classical statement is not one of comfort. We may not be able to repeat our
experiment a large number of times (suppose we have an an interval estimate
4.1 Confidence Intervals versus Credible Intervals 66

for the 1993 U.S. unemployment rate). If we are in physical possession of just
one dataset, our computed C will either contain θ or it won’t, so the actual
coverage probability will be 0 or 1. For the frequentist, the confidence level
(1 − α) is only a “tag” that indicates the quality of the procedure. But for a
Bayesian, the credible set provides an actual probability statement based only
on the observed data and whatever prior information we add.

0.95

0.025 0.025

a b

Value of θ|x

Figure 4.1: Illustration of 95% credible interval

Interpretation

We interpret Bayesian credible intervals as follows: There is a 95% probability


that the true value of θ is in the interval (a, b), given the data.

Comparisons

• Conceptually, probability comes into play in a frequentist confidence inter-


val before collecting the data, i.e., there is a 95% probability that we will
collect data that produces an interval that contains the true parameter
value. However, this is awkward, because we would like to make state-
ments about the probability that the interval contains the true parameter
value given the data that we actually observed.
4.1 Confidence Intervals versus Credible Intervals 67

• Meanwhile, probability comes into play in a Bayesian credible interval


after collecting the data, i.e., based on the data, we now think there is a
95% probability that the true parameter value is in the interval. This is
more natural because we want to make a probability statement regarding
that data after we have observed it.

Example 4.1: Suppose

X1 . . . , Xn ∣θ ∼ N (θ, σ 2 )
θ ∼ N (µ, τ 2 ),

where µ, σ 2 , and τ 2 are known. Calculate a 95% credible interval for θ.

Recall
nx̄τ 2 + µσ 2 σ2 τ 2
θ∣x1 , . . . xn ∼ N ( , 2 ).
nτ + σ
2 2
nτ + σ 2
Let
nx̄τ 2 + µσ 2
µ∗ = ,
nτ 2 + σ 2
σ2 τ 2
σ ∗2 = .
nτ 2 + σ 2

We want to calculate a and b such that P (θ < a∣x1 , . . . , xn ) = 0.05/2 = 0.025 and
P (θ > b∣x1 , . . . , xn ) = 0.05/2 = 0.025. So,

0.025 = P (θ < a∣x1 , . . . , xn )


θ − µ∗ a − µ∗
=P( < ∣ x1 , . . . , x n )
σ∗ σ∗

a−µ
= P (Z < ∣ x1 , . . . , xn ) , where Z ∼ N (0, 1).
σ∗


Thus, we now must find an a such that P ( Z < a−µ σ∗
∣ x1 , . . . , xn ) = 0.025. From
a Z-table, we know that
a − µ∗
= −1.96.
σ∗
This tells us that a = µ∗ − 1.96σ ∗ . Similarly, b = µ∗ + 1.96σ ∗ . (Work this part out
on your own at home). Therefore, a 95% credible interval is

µ∗ ± 1.96σ ∗ .

Example 4.2: We’re interested in knowing the true average number of orna-
ments on a Christmas tree. Call this θ. We take a random sample of n Christmas
trees, count the ornaments on each one, and call the results X1 , . . . , Xn . Let the
prior on θ be Normal(75, 225).
4.1 Confidence Intervals versus Credible Intervals 68

Using data (trees.txt) we have, we will calculate the 95% credible interval
and confidence interval for θ. In R we first read in the data file trees.txt. We
then set the initial values for our known parameters, n, σ, µ, and τ.

Next, we refer to Example 4.1, and calculate the values of µ∗ and σ ∗ using this
example. Finally, again referring to Example 4.1, we recall that the formula for
a 95% credible interval here is
µ∗ ± 1.96σ ∗ .
On the other hand, recalling back to any basic statistics course, a 95% confidence
interval in this situation is √
x̄ ± 1.96σ/ n.

From the R code, we find that there is a 95% probability that the average number
of ornaments per tree is in (45.00, 57.13) given the data. We also find that we
are 95% confident that the average number of ornaments per tree is contained
in (43.80, 56.20). If we compare the width of each interval, we see that the
credible interval is slightly narrower. It is also shifted towards slightly higher
values than the confidence interval for this data, which makes sense because the
prior mean was higher than the sample mean. What would happen to the width
of the intervals if we increased n? Does this make sense?

x = read.table("trees.txt",header=T)
attach(x)
n = 10
sigma = 10
mu = 75
tau = 15

mu.star = (n*mean(orn)*tau^2+mu*sigma^2)/(n*tau^2+sigma^2)
sigma.star = sqrt((sigma^2*tau^2)/(n*tau^2+sigma^2))

(cred.i = mu.star+c(-1,1)*qnorm(0.975)*sigma.star)
(conf.i = mean(orn)+c(-1,1)*qnorm(0.975)*sigma/sqrt(n))

diff(cred.i)
diff(conf.i)
detach(x)
Example 4.3: (Sleep Example)
Recall the Beta-Binomial. Consider that we were interested in the proportion
of the population of American college students that sleep at least eight hours
each night (θ).

Suppose a random sample of 27 students from UF, where 11 students recorded


they slept at least eight hours each night. So, we assume the data is distributed
as Binomial(27, θ).
4.1 Confidence Intervals versus Credible Intervals 69

Suppose that the prior on θ was Beta(3.3,7.2). Thus, the posterior distribution
is

θ∣11 ∼ Beta(11 + 3.3, 27 − 11 + 7.2), i.e.,


θ∣11 ∼ Beta(14.3, 23.2).

Suppose now we would like to find a 90% credible interval for θ. We cannot
compute this in closed form since computing probabilities for Beta distributions
involves messy integrals that we do not know how to compute. However, we can
use R to find the interval.

We need to solve
P (θ < c∣x) = 0.05
and
P (θ > d∣x) = 0.05 for c and d.

The reason we cannot compute this in closed form is because we need to compute
c
∫ Beta(14.3, 23.2) dθ = 0.05
0

and
1
∫ Beta(14.3, 23.2) dθ = 0.05.
d

Note that Beta(14.3,23.2) represents

Γ(37.5)
f (θ) = θ14.3−1 (1 − θ)23.2−1 .
Γ(14.3)Γ(23.2)

The R code for this is very straightforward:

a = 3.3
b = 7.2
n = 27
x = 11
a.star = x+a
b.star = n-x+b

c = qbeta(0.05,a.star,b.star)
d = qbeta(1-0.05,a.star,b.star)

Running the code in R, we find that a 90% credible interval for θ is (0.256, 0.514),
meaning that there is a 90% probability that the proportion of UF students who
sleep eight or more hours per night is between 0.256 and 0.514 given the data.
4.2 Credible Sets or Intervals 70

4.2 Credible Sets or Intervals

Definition 4.4: Suppose the posterior density for θ is unimodal. A highest


posterior density (HPD) credible set of size 1 − α is a set C such that C = {θ ∶
p(θ∣Y = y) ≥ kα } where kα is chosen so that P (θ ∈ C) ≥ 1 − α.
Example 4.4: Normal HPD credible interval
Suppose that
y∣θ ∼ N (θ, σ), θ ∼ N (θ, τ 2 ).
Then θ∣y ∼ N (µ1 , τ12 ), where we have derived these before. The HPD credible
interval for θ is simply µ ± zα/2 τ1 . Note that the HPD credible interval is that
same as the equal tailed interval interval centered at the posterior mean.

Remark: Credible intervals are very easy to calculate unlike confidence inter-
vals, which require pivotal quantities or inversion of a family of tests.

In general, plot the posterior distribution and find the HPD credible set. One
important point is that the posterior must be unimodal in order to guarantee
that the HPD credible set is an interval. (Unimodality of the posterior is a
sufficient condition for the credible set to be an interval, but it’s not necessary.)
Example 4.5: Suppose
iid
y1 , . . . , yn ∣σ 2 ∼ N (0, σ 2 )
p(σ 2 ) ∝ (σ 2 )α/2−1 e− 2σ2 .
β

Let z = 1/σ 2 . Then


1
p(z) ∝ z α/2+1 e− 2z ∣ ∣ = z α/2−1 e− 2z .
β β

z 2

Then

p(σ 2 ∣y) ∝ (σ 2 )−(n+α)/2−1 e− 2σ2 (∑i yi +β) ,


1 2

which implies that

σ 2 ∣y ∼ IG ((n + α)/2, (∑ yi2 + β)/2) .


i

This posterior distribution is unimodal, but how do we know this? One way of
showing the posterior is unimodal is to show that it is increasing in σ 2 up to
a point and then decreasing afterwards. The log of the posterior has the same
feature.

Then
1
log(p(σ 2 ∣y)) = c1 − [(n + α)/2 + 1] log(σ 2 ) − (∑ y 2 + β).
2σ 2 i i
4.3 Bayesian Hypothesis Testing 71

This implies that

∂ log(p(σ 2 ∣y)) n+α+2 1


2
=− 2
+ 4 (∑ yi2 + β)
∂σ 2σ σ i
(∑i yi2 + β) − (n + α + 2)σ 2
= ,
2σ 4
which is increasing with respect to σ 2 , equal to, or decreasing as σ 2 ≥ (∑i yi2 +
β)/(n+α+2), etc. Thus, the posterior is unimodal, so we can get a HPD interval
for σ 2 .

4.3 Bayesian Hypothesis Testing

Let’s first review p-values and why they might not make sense in the grand
scheme of things. In classical statistics, the traditional approach proposed by
Fisher, Neyman, and Pearson is where we have a null hypothesis and an al-
ternative. After determining some test statistic T (y), we compute the p-value,
which is

p-value = P {T (Y ) is more extreme than T (yobs ) ∣ Ho } ,

where extremeness is in the direction of the alternative hypothesis. If the p-value


is less than some pre specified Type I error rate, we reject Ho , and otherwise
we don’t.

Clearly, classical statistics has deep roots and a long history. It’s popular with
practitioners, but does it make sense? The approach can be applied in a straight-
forward manner only when the two hypothesis in question are nested (meaning
one within the other). This means that Ho must be a simplification of Ha .
Many practical testing problems involve a choice between two or more models
that aren’t nested (choosing between quadratic and exponential growth models
for example).

Another difficulty is that tests of this type can only offer evidence against the
null hypothesis. A small p-value indicates that the later, alternative model has
significantly more explanatory power. But a large p-value does not suggest that
the two models are equivalent (only that we lack evidence that they are not).
This limitation/difficulty is often swept under the rug and never dealt with. We
simply say, “we fail to reject the null hypothesis” and leave it at that.

Third, the p-value offers no direct interpretation as a “weight of evidence” but


only as a long-term probability of obtaining data at least as unusual as what
we observe. Unfortunately, the fact that small p-values imply rejection of Ho
causes many consumers of statistical analyses to assume that the p-value is the
probability that Ho is true, even though it’s nothing of the sort.
4.3 Bayesian Hypothesis Testing 72

Finally, one last criticism is that p-values depend not only on the observed data
but also on the total sampling probability of certain unobserved data points,
namely, the more extreme T (Y ) values. Because of this, two experiments with
identical likelihoods could result in different p-values if the two experiments were
designed differently. (This violates the Likelihood Principle.) See Example 1.1
in Chapter 1 for an illustration of how this can happen.

In classical settings, we talk about Type I and Type II errors. In Bayesian


hypothesis testing, we will consider the following scenarios:

Ho ∶ θ ∈ Θo Ha ∶ θ ∈ Θ1 .

Ho ∶ θ = θ o Ha ∶ θ ≠ θ 0 .
Ho ∶ θ ≤ θ o Ha ∶ θ > θ 0 .
A Bayesian talks about posterior odds and Bayes factors.
Definition 4.5: Prior odds
Let πo = P (θ ∈ Θo ), π1 = P (θ ∈ Θ1 ) and πo + π1 = 1. Then the prior odds in
πo
favor of Ho = .
π1
Definition 4.6: Posterior odds
αo
Let αo = P (θ ∈ Θo ∣y) and α1 = P (θ ∈ Θ1 ∣y). Then the posterior odds = .
α1
Definition 4.7: Bayes Factor
posterior odds αo πo αo π1
The Bayes Factor (BF)= = ÷ = .
prior odds α1 π1 α1 πo
Example 4.6: IQ Scores
Suppose we’re studying IQ scores and so we assume that the data follow the
model where

y∣θ ∼ N (θ, 102 )


θ ∼ N (100, 152 ).

We’d like to be able to say something about the mean of the IQ scores and
whether it’s below or larger than 100. Then

Ho ∶ θ ≤ 100 Ha ∶ θ > 100.

πo P (θ ≤ 100) 1
The prior odds are then = = 2
= 1 by symmetry.
π1 P (θ > 100)
1
2

Suppose we find that y = 115. Then θ∣y = 115 ∼ N (110.39, 63.23).

Then αo = P (θo ≤ 100∣y = 115) = 0.106 and α1 = P (θ1 > 100∣y = 115) = 0.894.
αo
Thus, = 0.1185. Hence, BF = 0.1185.
α1
4.3 Bayesian Hypothesis Testing 73

◯ Lavine and Schervish (The American Statistician, 1999):


Bayes Factors: What They Are and What They Are
Not

We present an example from the paper above to illustrate an important point


regarding Bayes Factors. Suppose a coin is known to be a 2-sided head, a 2-
sided tail, or fair. Then let θ be the probability of a head ∈ {0, 1/2, 1}. Suppose
the data tell us that the coin was tossed 4 times and always landed on heads.

Furthermore, suppose that

π({0}) = 0.01, π({1/2}) = 0.98, π({1}) = 0.01.

Consider

H1 ∶ θ = 1 versus H4 ∶ θ ≠ 1
H2 ∶ θ = 1/2 versus H5 ∶ θ ≠ 1/2
H3 ∶ θ = 0 versus H6 ∶ θ ≠ 0.

Then

f (x∣H1 ) = P (four heads∣θ = 1) = 1


1 1
f (x∣H2 ) = P (four heads∣θ = 1/2) = ( )4 =
2 16
f (x∣H3 ) = P (four heads∣θ = 0) = 0
1
× 0.98 + 0 × 0.01 1 98
f (x∣H4 ) = P (four heads∣θ ≠ 1) = 16
= × = 0.0619
0.98 + 0.01 16 99
1 × 0.01 + 0 × 0.01
f (x∣H5 ) = P (four heads∣θ ≠ 1/2) = = 0.05
0.98 + 0.01
1
× 0.98 + 1 × 0.01
f (x∣H6 ) = P (four heads∣θ ≠ 0) = 16 = 0.072
0.98 + 0.01

We then find that


1
f (x∣H1 ) f (x∣H2 )
= 1/0.0619 and = 16
= 0.125.
f (x∣H4 ) f (x∣H5 ) 1
2

Let k ∈ (0.0619, 0.125) and reject if BF01 < k. Then we reject H4 in favor of H1 .
We fail to reject H2 . Thus, failing to reject H2 implies failing to reject H4 . In
this example, evidence in favor of H4 should be stronger than that of H2 . But
4.3 Bayesian Hypothesis Testing 74

the Bayes Factor violates this. Lavine and Schervish refer to this as lack of
coherence. The problem does not occur with the posterior odds since if

P (Θo ∣x) < P (Θa ∣x)

holds, then
P (Θo ∣x) P (Θa ∣x)
< .
1 − P (Θo ∣x) 1 − P (Θa ∣x)
(This result can be generalized).

• Bayes factors are insensitive to the choice of prior, however, this statement
is misleading. (Berger, 1995) We will see why in Example 4.5.

• BF measures the change from priors odds to the posterior odds.

Example 4.7: Simple Null versus Simple Alternative

Ho ∶ θ = θ o Ha ∶ θ = θ 1 .

Then πo = P (θ = θo ) and π1 = P (θ = θ1 ), so πo + π1 = 1.

Then

P (y∣θ = θo )P (θ = θo ) P (y∣θ = θo )πo


αo = P (θ = θo ∣y) = = .
P (y∣θ = θo )P (θ = θo ) + P (y∣θ = θ1 )P (θ = θ1 ) P (y∣θ = θo )πo + P (y∣θ = θ1 )π1

αo πo P (y∣θ = θo ) P (y∣θ = θo )
This implies that = and hence BF = , which is
α1 π1 P (y∣θ = θ1 ) P (y∣θ = θ1 )
the likelihood ratio. This does not depend on the choice of the prior. However,
in general the Bayes factor depends on how the prior spreads mass over the null
and alternative (so Berger’s statement is misleading).

Example 4.8:
Ho ∶ θ ∈ θo Ha ∶ θ ∈ θ1 .

Derive the BF. Let go (θ) and g1 (θ) be probability density functions such that
∫Θo go (θ) dθ = 1 and ∫Θ1 g1 (θ) dθ = 1. Let


⎪πo go (θ) if θ ∈ Θo
π(θ) = ⎨

⎪π g (θ) if θ ∈ Θ1 .
⎩ 1 1

Then ∫ π(θ) dθ = ∫Θo πo go (θ) dθ + ∫Θ1 π1 g1 (θ) dθ = π0 + π1 = 1.

αo ∫Θo π(θ∣y) dθ p(y∣θ)π(θ)


This implies that = . Thus, π(θ∣y) = . This implies
α1 ∫Θ1 π(θ∣y) dθ m(y)
4.4 Bayesian p-values 75

that
∫Θo p(y∣θ)π(θ) dθ
αo m(y) ∫Θ p(y∣θ)πo go (θ) dθ
= = o
α1 ∫Θ1 p(y∣θ)π(θ) dθ ∫Θ1 p(y∣θ)π1 g1 (θ) dθ
m(y)
πo ∫Θo p(y∣θ)go (θ) dθ
= Ô⇒
π1 ∫Θ1 p(y∣θ)g1 (θ) dθ

∫Θo p(y∣θ)go (θ) dθ


BF = , which is the marginal of y under Ho divided by the
∫Θ1 p(y∣θ)g1 (θ) dθ
marginal of y under H1 .

4.4 Bayesian p-values

Bayes factors are meant to compare two or more models, however, often we are
interested in the goodness of fit of a particular model rather than comparison
of the models. Bayesian p-values were proposed to address these problems.

◯ Prior Predictive p-value

George Box proposed a prior predictive p-value. Suppose that T (x) is a test
statistic and π is some prior. Then we calculate the marginal distribution

m[T (x) ≥ T (xobs )∣Mo ],

where Mo is the null model under consideration.


iid √
Example 4.9: X1 , . . . , Xn ∣θ, σ 2 ∼ N (θ, σ 2 ). Let Mo ∶ θ = 0 and T (x) = n∣X̄∣.

Suppose the prior π(σ 2 ) is degenerate at σo2 . Then π(σ 2 = σo2 ) = 1. Marginally,

X̄ ∼ N (0, σ 2 /n)

under Mo . Also,
√ √ √
√ √ n∣X̄∣ n∣x̄obs ∣ n∣x̄obs ∣
P ( n∣X̄∣ ≥ n∣x̄obs ∣) = P ( ≥ ) = 2Φ(− ).
σo σo σo
If the guessed σo is much smaller than the actual model variance, then the
p-value is small and the evidence again Mo is overestimated.

Remark: The takeaway message is that the prior predictive p-value is heavily
influenced by the prior.
4.4 Bayesian p-values 76

◯ Other Bayesian p-values

Since then, the posterior predictive p-value (PPP) has been proposed by Rubin
(1984), Meng (1994), and Gelman et al. (1996). They propose looking at the
posterior predictive distribution of a future observation x under some prior π.
That is, we calculate

m(x∣xobs ) = ∫ f (x∣θ)π(θ∣xobs ) dθ.

Then PPP is defined to be

P ∗ = P (T (X) ≥ T (xobs )),

which is the conditional probability that for a future observation T (X) ≥ T (xobs )
given the predictive distribution of X under prior π and xobs .

Remark: For details, see the papers. A general criticism by Bayarri and Berger
points out that the procedure involves using the data twice. The data is used in
finding the posterior distribution of θ and also in finding the posterior predictive
p-value. As an alternative, they have suggested using conditional predictive p-
values (CPP). This involves splitting the data into two parts, say T (X) and
U (X). We use U (X) to find the posterior predictive distribution and T (X)
continues to be the test statistic.

Potential Fixes to the Prior Predictive p-value

We first consider the Conditional and Partial Predictive p-values by Bayarri


and Berger, JASA, (1999, 2000). They propose splitting the data into two
parts (T (X), U (X), where T is the test statistic and the p-value is computed
from the posterior predictive distribution of a future T conditional on U. The
choice of U is unclear and for complex problems it is nearly impossible to find.
We note that if U (X) is taken to be the entire data and T (X) is some test
statistic, then we get the PPP back.

Also, Robins, van der Waart, and Ventura, JASA, 2000 investigate Bayarri
and Berger’s claims that for a parametric model, that their conditional and
partial predictive p-values are superior to the parametric bootstrap p-value and
to previously proposed p-values (prior predictive p-value of Guttman, 1967 and
Rubin, 1984 and the discrepancy p-value of Gelman et. al (1995, 1996) and
Meng (1994). Robins et. al note that Bayarri and Berger’s claims of superiority
is based on small-sample properties for specific examples. They investigate large
sample properties and conclude that asymptotic results confirm the superiority
of the conditional predictive p-value and partial posterior predictive p-values.

Robins et. al (2000) also explore corrections for when these p-values are difficult
to compute. In Section 4 of their paper, they discuss how to modify the test
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 77

statistic for the parametric bootstrap p-value, posterior predictive p-values, and
discrepancy p-values. Modifications are made such they are asymptotically uni-
form. They claim that their approach is successful for the discrepancy p-value
(and the authors derive a test based on this). Note: the discrepancy p-value
can be difficult to calculate for complex models.

4.5 Appendix to Chapter 4 (Done by Rafael Stern)

Added Example for Chapter 4 on March 21, 2013


The following example is an adaptation from Carlos Alberto de Braganc a
Pereira (2006).

Consider that U1 , U2 , U3 , U4 are conditionally i.i.d. given θ and such that U1 ∣θ


has Uniform distribution on (θ − 0.5, θ + 0.5). Next, we construct a confidence
interval and a credible interval for θ.

Let’s start with a confidence interval. Let U(1) = min{U1 , U2 , U3 , U4 } and


U(4) = max{U1 , U2 , U3 , U4 }. Let’s prove that (U(1) , U(4) ) is a 87.5% confidence
interval for θ.

Consider
P (θ ∉ (U(1) , U(4) )∣θ) = P (U(1) > θ ∪ U(4) < θ∣θ) =
= P (U(1) > θ∣θ) + P (U(4) < θ∣θ) =
= P (Ui > θ, i = 1, 2, 3, 4∣θ) + P (Ui < θ, i = 1, 2, 3, 4∣θ) =
= (0.5)4 + (0.5)4 = (0.5)3 = 0.125

Hence, P (θ ∈ (U(1) , U(4) )∣θ) = 0.875, which proves that (U(1) , U(4) ) is a 87.5%
confidence interval for θ.

Consider that U(1) = 0.1 and that U(4) = 0.9. The 87.5% probability has to do
with the random interval (U(1) , U(4) ) and not with the particular observed value
of (0.1, 0.9).

Let’s do some investigative work! Observe that, for every ui , ui > θ − 0.5.
Hence, u(1) > θ − 0.5, that is, θ < u(1) + 0.5. Similarly, θ > u(4) − 0.5. Hence,
θ ∈ (u(4) − 0.5, u(1) + 0.5). Plugging in u(1) = 0.1 and u(4) = 0.9, obtain
θ ∈ (0.4, 0.6). That is, even though the observed 87.5% confidence interval
is (0.1, 0.9), we know that θ ∈ (0.4, 0.6) with certainty.

Let’s now compute a 87.5% centered credible interval. This depends on the
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 78

prior for θ. Consider the improper prior p(θ) = 1, θ ∈ R. Observe that:

P (θ∣u1 , u2 , u3 , u4 ) ∝ P (θ)P (u1 , u2 , u3 , u4 ∣θ) =


4
= ∏ I(ui )(θ−0.5,θ+0.5) =
i=1
= I(θ)(u(4) −0.5,u(1) +0.5) .

That is, θ∣u has Uniform distribution on (u(4) − 0.5, u(1) + 0.5). Let a = u(4) − 0.5
and b = u(1) + 0.5. The centered 87.5% credible interval is (l, u) such that
l 1 b 1 b−a b−a
∫a dx = 2−4 and ∫u dx = 2−4 . Hence, l = a + 4 and u = b − 4 .
b−a b−a 2 2
Observe that this interval is always a subset of (a, b), which we know contains
θ for sure.
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 79

1−(U(4)−U(1)) 1−(U(4)−U
(1) )
Does (U(4) − 0.5 + 24
, U(1) + 0.5 − 24
) have any confidence guar-
antees? Before getting into troublesome calculations, we can check this through
simulations.

The following R code generates a barplot for how often the credible interval
captures the correct parameter given different parameter values.

sim_capture_theta <- function(theta,nsim) {


samples <- runif(4*nsim,theta-0.5,theta+0.5)
dim(samples) <- c(nsim,4)

success <- function(sample)


{
aa <- min(sample)
bb <- max(sample)
return((theta > aa + (bb-aa)/16) && (theta < bb - (bb-aa)/16))
}

return(mean(apply(samples,1,success)))
}

capture_frequency <- sapply(0:1000, function(ii){sim_capture_theta(ii,1000)})


barplot(capture_frequency,
main="Coverage of credible interval for parameter in 0,1,...,1000")
abline(h=0.825,lty=22)
abline(h=0.85,lty=22)
abline(h=0.875,lty=22)
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 80

Coverage of credible interval for parameter in 0,1,...,1000

0.8
0.6
0.4
0.2
0.0

Figure 4.2

The result in Figure 4.2 shows that, in this case, the coverage of the credible
interval seems to be uniform on the parameter space. This is not guaranteed
to always happen! Also, although we constructed a 87.5% credible interval,
1−(U(4) −U(1) ) 1−(U(4) −U(1) )
the picture suggests that (U(4) − 0.5 + 24
, U(1) + 0.5 − 24
) is
somewhere near a 85% confidence interval.

Observe that, the wider the gap between U(1) and U(4) the smaller is the region
in which θ can lie. In this sense, it would be nice if the interval would be
smaller, the larger this gap is. The example shows that there exist both credible
and confidence intevals with this property, but this property isn’t achieved by
guaranteeing confidence alone.
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts

Chapter 5

Monte Carlo Methods

Every time I think I know what’s going on, suddenly there’s another layer of
complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony

5.1 A Quick Review of Monte Carlo Methods

One motivation for Monte Carlo methods is to approximate an integral of the


form ∫X h(x)f (x) dx that is intractable, where f is a probability density. You
might wonder why we wouldn’t just use numerical integration techniques. There
are a few reasons:

• The most serious problem is the so-called “curse of dimensionality.” Sup-


pose we have a p-dimensional integral. Numerical integration typically
entails evaluating the integrand over some grid of points. However, if p
is even moderately large, then any reasonably fine grid will contain an
impractically large number of points. For example if p = 6, then a grid
with just ten points in each dimension—already too coarse for any sensible
amount of precision—will consist of 106 points. If p = 50, then even an
absurdly coarse grid with just two points in each dimension will consist of
250 points (note that 250 > 1015 ).

• There can still be problems even when the dimensionality is small. There
are packages in R called area and integrate, however, area cannot deal
with infinite bounds in the integral, and even though integrate can handle
infinite bounds, it is fragile and often produces output that’s not trust-
worthy (Robert and Casella, 2010).

81
5.1 A Quick Review of Monte Carlo Methods 82

◯ Classical Monte Carlo Integration

The generic problem here is to evaluate Ef [h(x)] = ∫X h(x)f (x) dx. The clas-
sical way to solve this is to generate a sample (X1 , . . . , Xn ) from f and propose
as an approximation the empirical average

1 n
h̄n = ∑ h(xj ).
n j=1

Why? It can be shown that h̄n converges a.s. (i.e. for almost every generated
sequence) to Ef [h(X)] by the Strong Law of Large Numbers.

Also, under certain assumptions (which we won’t get into, see Casella and
Robert, page 65, for details), the asymptotic variance can be approximated
and then can be estimated from the sample (X1 , . . . , Xn ) by
n
vn = 1/n2 ∑ [h(xj ) − h̄n ]2 .
j=1

Finally, by the CLT (for large n),

h̄n − Ef [h(X)] approx.


√ ∼ N (0, 1).
vn

There are examples in Casella and Robert (2010) along with R code for those
that haven’t seen these methods before or want to review them.
5.1 A Quick Review of Monte Carlo Methods 83

◯ Importance Sampling

Importance sampling involves generating random variables from a different dis-


tribution and then reweighing the output. It’s name is given since the new
distribution is chosen to give greater mass to regions where h is large (the im-
portant part of the space).

Let g be an arbitrary density function and then we can write


f (x) h(x)f (x)
I = Ef [h(x)] = ∫ h(x) g(x) dx = Eg [ ]. (5.1)
X g(x) g(x)
This is estimated by
1 n f (Xj )
Iˆ = ∑ h(Xj ) Ð→ Ef [h(X)] (5.2)
n j=1 g(Xj )

based on a sample generated from g (not f ). Since (5.1) can be written as an


expectation under g, (5.2) converges to (5.1) for the same reason the Monte
carlo estimator h̄n converges.
ˆ we find
Remark: Calculating the variance of I,

ˆ = 1 ∑ V ar ( h(Xi )f (Xi ) ) = 1 V ar ( h(Xi )f (Xi ) ) Ô⇒


V ar(I)
n2 i g(Xi ) n g(Xi )

ˆ = 1 V̂ h(Xi )f (Xi )

ar(I) ar ( ).
n g(Xi )
Example 5.1: Suppose we want to estimate P (X > 5), where X ∼ N (0, 1).

Naive method: Generate n iid standard normals and use the proportion p̂ that
are larger than 5.

Importance sampling: We will sample from a distribution that gives high prob-
ability to the “important region” (the set (5, ∞)) and then reweight.

Solution: Let φo and φθ be the densities of the N (0, 1) and N (θ, 1) distributions
(θ taken around 5 will work). We have

φo (u)
p = ∫ I(u > 5)φo (u) du = ∫ [I(u > 5) ] φθ (u) du.
φθ (u)
In other words, if
φo (u)
h(u) = I(u > 5)
φθ (u)
then p = Eφθ [h(X)]. If X1 , . . . , Xn ∼ N (θ, 1), then an unbiased estimate is
p̂ = n1 ∑i h(Xi ).

We implement this in R as follows:


5.1 A Quick Review of Monte Carlo Methods 84

1 - pnorm(5) # gives 2.866516e-07

# Naive method
set.seed(1)
ss <- 100000
x <- rnorm(n=ss)
phat <- sum(x>5)/length(x)
sdphat <- sqrt(phat*(1-phat)/length(x)) # gives 0

# IS method

set.seed(1)
y <- rnorm(n=ss, mean=5)
h <- dnorm(y, mean=0)/dnorm(y, mean=5) * I(y>5)
mean(h) # gives 2.865596e-07
sd(h)/sqrt(length(h)) # gives 2.157211e-09

Example 5.2: Let f (x) be the pdf of a N (0, 1). Assume we want to compute
1 1
a=∫ f (x)dx = ∫ N (0, 1)dx
−1 −1

We can use importance sampling to do this calculation. Let g(X) be an arbi-


trary pdf,
1 f (x)
a(x) = ∫ g(x) dx.
−1 g(x)

We want to be able to draw g(x) ∼ Y easily. But how should we go about


choosing g(x)?

(Y )
• Note that if g ∼ Y, then a = E[I[−1,1] (Y ) fg(Y )
].

f (Y )
• The variance of I[−1,1] (Y ) is minimized picking g ∝ I[−1,1] (x)f (x).
g(Y )
Nevertheless simulating from this g is usually expensive.
• Some g’s which are easy to simulate from are the pdf’s of the Uniform(−1, 1),
the Normal(0, 1) and a Cauchy with location parameter 0.
f (Y )
• Below, there is code of how to get a sample from I[−1,1] (Y ) for these
g(Y )
distributions,

uniformIS <- function(nn) {


sapply(runif(nn,-1,1),
5.1 A Quick Review of Monte Carlo Methods 85

function(xx) dnorm(xx,0,1)/dunif(xx,-1,1)) }

cauchyIS <- function(nn) {


sapply(rt(nn,1),
function(xx) (xx <= 1)*(xx >= -1)*dnorm(xx,0,1)/dt(xx,2)) }

gaussianIS <- function(nn) {


sapply(rnorm(nn,0,1),
function(xx) (xx <= 1)*(xx >= -1)) }
5.1 A Quick Review of Monte Carlo Methods 86

Figure 5.1 presents histograms for a sample size 1000 from each of these distri-
(Y )
butions. The sample variance of I[−1,1] (Y ) fg(Y )
was, respectively, 0.009, 0.349
and 0.227 (for the uniform, cauchy, and the normal).

• Even though the shape of the uniform distribution is very different from
f (x), a standard normal, in (−1, 1), f (x) has a lot of mass outside of
(−1, 1).
• This is why the histograms for the Cauchy and the Normal have big bars
on 0 and the variance obtained from the uniform distribution is the lowest.

• How would these results change if we wanted to compute the integral over
the range (−3, 3) instead of (−1, 1)? This is left as a homework exercise.
5.1 A Quick Review of Monte Carlo Methods 87
700

700

700
600

600

600
500

500

500
400

400

400
Frequency

Frequency

Frequency
300

300

300
200

200

200
100

100

100
0

0.45 0.55 0.65 0.75 0.0 0.4 0.8 1.2 0.0 0.2 0.4 0.6 0.8 1.0

(Y )
Figure 5.1: Histograms for samples from I[−1,1] (Y ) fg(Y )
when g is, respectiv-
elly, a uniform, a Cauchy and a Normal pdf.
5.1 A Quick Review of Monte Carlo Methods 88

◯ Importance Sampling with unknown normalizing con-


stant

Often we have sample from µ, but know π(x) except for a multiplicative µ(x)
constant. Typical example is Bayesian situation:

• π = νY = posterior density of θ given Y when prior density is ν.

• µ = λY = posterior density of θ given Y when prior density is λ.


π(x) cν L(θ)ν(θ) ν(θ)
We want to estimate = =c = c `(x),
µ(x) cλ L(θ)λ(θ) λ(θ)
where `(x) is known and c is unknown.
Remark: get a ratio of priors.

Then if we’re estimating h(x), we find

∫ h(x)π(x) dx = ∫ h(x) c `(x)µ(x) d(x)

=∫
h(x) c `(x)µ(x) d(x)
∫ µ(x) d(x)
=∫
h(x) c `(x)µ(x) d(x)
∫ c `(x)µ(x) d(x)
=∫
h(x) `(x)µ(x) d(x)
.
∫ `(x)µ(x) d(x)

Generate X1 , . . . , Xn ∼ µ and estimate via

∑i h(Xi ) `(Xi ) `(Xi )


= ∑ h(Xi ) ( ) = ∑ wi h(Xi )
∑i `(Xi ) i ∑j `(Xj ) i

`(Xi ) ν(θi )/λ(θi )


where wi = = .
∑j `(Xj ) ∑j ν(θj )/λ(θj )

Motivation
Why the choice above for `(X)? Just taking a ratio of priors. The moti-
vation is the following for example:

– Suppose our application is to Bayesian statistics where θ1 , . . . , θn ∼


λY .
– Think about the posterior corresponding here is an essay to deal with
conjugate prior λ.
– Think of π = ν as a complicated prior and µ = λ as a conjugate prior.
5.1 A Quick Review of Monte Carlo Methods 89

ν(θi )/λ(θi )
– Then the weights are wi = .
∑j ν(θj )/λ(θj )

1. If µ and π i.e. ν and λ differ greatly most of the weight will be taken
up by a few observations resulting in an unstable estimate.
∑ h(Xi ) `(Xi )
2. We can get an estimate of the variance of i but we
∑i `(Xi )
need to use theorems from advance probability theory (The Cramer-
Wold device and the Multivariate Delta Method). We’ll skip these
details.
3. In the application of Bayesian statistics, the cancellation of a poten-
tially very complicated likelihood can lead to a great simplification.
4. The original purpose of importance sampling was to sample more
heavily from regions that are important. So, we may do importance
sampling using a density µ because it’s more convenient than using a
density π. (These could also be measures if the densities don’t exist
for those taking measure theory).

◯ Rejection Sampling

Suppose π is a density on the reals and suppose π(x) = c l(x) where l is known,
c is not known. We are interested in case where π is complicated. Want to
generate X ∼ π.

Motivating idea: look at a very simple case of rejection sampling.

Suppose first that l is bounded and is zero outside of [0, 1]. Suppose also l
is constant on the intervals ((j − 1)/k, j/k), j = 1, . . . , k. Let M be such that
M ≥ l(x) for all x.

For very simple case, consider the following procedure.

1. Generate a point (U1 , U2 ) uniformly at random from the rectangle of


height M sitting on top of the interval [0, 1].

2. If the point is below the graph of the function l, retain U1 . Else, reject the
point and go back to (1).

Remark: Using the Probability Integral Transformation in reverse. If X ∼


F −1 (U ), then X ∼ F where U ∼ Uniform(0, 1).

Remark: Think about what this is doing, we’re generating many draws that are
wasting time. Think about the restriction on [0, 1] and if this makes sense.

General Case:
5.1 A Quick Review of Monte Carlo Methods 90

Suppose the density g is such that for some known constant M, M g(x) ≥ l(x)
for all x. Procedure:

l(X)
1. Generate X ∼ g, and calculate r(X) = .
M g(X)
2. Flip a coin with probability of success r(X). If we have a success, retain
X. Else return to (1).

To show that an accepted point has distribution π, let I = indicator that the
point is accepted. Then
π(x)/c 1
P (I = 1) = ∫ P (I = 1 ∣ X = x)g(x) dx = ∫ g(x) dx = .
M g(x) cM

Thus, if gl is the conditional distribution of X given I, we have


π(x)/c
gI (x∣I = 1) = g(x) /P (I = 1) = π(x).
M g(x)

Example 5.3: Suppose we want to generate random variables from the Beta(5.5,5.5)
distribution. Note: There are no direct methods for generating from Beta(a,b)
if a,b are not integers.

One possibility is to use a Uniform(0,1) as the trial distribution. A better idea


is to use an approximating normal distribution.

##simple rejection sampler for Beta(5.5,5.5), 3.26.13

a <- 5.5; b <- 5.5


m <- a/(a+b); s <- sqrt((a/(a+b))*(b/(a+b))/(a+b+1))
funct1 <- function(x) {dnorm(x, mean=m, sd=s)}
funct2 <- function(x) {dbeta(x, shape1=a, shape2=b)}

##plotting normal and beta densities


pdf(file = "beta1.pdf", height = 4.5, width = 5)
plot(funct1, from=0, to=1, col="blue", ylab="")
plot(funct2, from=0, to=1, col="red", add=T)
dev.off()

##M=1.3 (this is trial and error to get a good M)


funct1 <- function(x) {1.3*dnorm(x, mean=m, sd=s)}
funct2 <- function(x) {dbeta(x, shape1=a, shape2=b)}
pdf(file = "beta2.pdf", height = 4.5, width = 5)
plot(funct1, from=0, to=1, col="blue", ylab="")
plot(funct2, from=0, to=1, col="red", add=T)
5.1 A Quick Review of Monte Carlo Methods 91

dev.off()

##Doing accept-reject
##substance of code
set.seed(1); nsim <- 1e5
x <- rnorm(n=nsim, mean=m, sd=s)
u <- runif(n=nsim)
ratio <- dbeta(x, shape1=a, shape2=b) /
(1.3*dnorm(x, mean=m, sd=s))
ind <- I(u < ratio)
betas <- x[ind==1]
# as a check to make sure we have enough
length(betas) # gives 76836

funct2 <- function(x) {dbeta(x, shape1=a, shape2=b)}


pdf(file = "beta3.pdf", height = 4.5, width = 5)
plot(density(betas))
plot(funct2, from=0, to=1, col="red", lty=2, add=T)
dev.off()
2.5

3.0
2.0

2.0
1.5
1.0

1.0
0.5
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

Figure 5.2: Normal enveloping Figure 5.3: Naive rejection sam-


Beta pling, M=1.3
5.2 Introduction to Gibbs and MCMC 92

density.default(x = betas)

2.5
2.0
1.5
Density

1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0

N = 76836 Bandwidth = 0.01372

Figure 5.4: Rejection sampler

5.2 Introduction to Gibbs and MCMC

The main idea here involves iterative simulation. We sample values on a random
variable from a sequence of distributions that converge as iterations continue to
a target distribution. The simulated values are generated by a Markov chain
whose stationary distribution is the target distribution, i.e., the posterior dis-
tribution.

Geman and Geman (1994) introduced Gibbs sampling for simulating a multi-
variate probability distribution p(x) using as random walk on a vector x, where
p(x) is not necessarily a posterior density.

◯ Markov Chains and Gibbs Samplers

We have a probability distribution π on some space X and we are interested in


estimating π or ∫ h(x)π(x)dx, where h is some function. We are considering
situation where π is analytically intractable.

The Basic idea of MCMC

• Construct a sequence of random variables X1 , X2 , . . . with the property


that the distribution of Xn converges to π as n → ∞.

• If no is large, then Xno , Xn0 +1 . . . all have the distribution π and these can
be used to estimate π and ∫ h(x)π(x)dx.

Two problems:
5.2 Introduction to Gibbs and MCMC 93

1. The distribution of Xno , Xn0 +1 . . . is only approximately π.


2. The random variables Xno , Xn0 +1 . . . are NOT independent; they may be
correlated.
5.2 Introduction to Gibbs and MCMC 94

The MCMC Method Setup: We have a probability distribution π which is


analytically intractable. Want to estimate π or ∫ h(x)π(x)dx, where h is some
function.

The MCMC method consider of coming up with a transition probability function


P (x, A) with the property that it has station distribution π.

A Markov chain with Markov transition function P (⋅, ⋅) is a sequence of random


variables X1 , X2 , . . . on a measurable space such that:

1. P (Xn+1 ∈ A∣Xn = x) = P (x, A).


2. P (Xn+1 ∈ A∣X1 , X2 , . . . , Xn ) = P (Xn+1 ∈ A∣Xn = x).

1.) is called a Markov transition function and 2.) is the Markov property, which
says “where I’m going next only depends on where I am right now.”

Coming back to the MCMC method, we fix a starting point xo and generate
an observation from X1 from P (xo , ⋅), generate an observation from X2 from
P (X1 , ⋅), etc. This generates the Markov chain xo = Xo , X1 , X2 , . . . ,

If we can show that

sup ∣P n (x, C) − π(C)∣ → 0 for all x ∈ X


C∈B

then by running the chain sufficiently long enough, we succeed in generating an


observation Xn with distribution approximately π.
5.2 Introduction to Gibbs and MCMC 95

What is a Markov chain?


Start with a sequence of dependent random variables, {X (t) }. That is we have
the sequence
X (0) , X (1) , . . . , X (t) , . . .
such that the probability distribution of X (t) given all the past variables only
depends on the very last one X (t−1) . This conditional probability is called the
transition kernel or Markov kernel K, i.e.,

X (t+1) ∣X (0) , X (1) , . . . , X (t) ∼ K(X (t) , X (t+1) ).

• For a given Markov kernel K, there may exist a distribution f such that

∫ K(x, y)f (x) dx = f (y).


X

• If f satisfies this equation, we call f a stationary distribution of K. What


this means is that if X (t) ∼ f, then X (t+1) ∼ f as well.

The theory of Markov chains provides various results about the existence and
uniqueness of stationary distributions, but such results are beyond the scope of
this course. However, one specific result is that under fairly general conditions
that are typically satisfied in practice, if a stationary distribution f exists, then
f is the limiting distribution of {X (t) } is f for almost any initial value or
distribution of X (0) . This property is called ergodicity. From a simulation
point of view, it means that if a given kernel K produces an ergodic Markov
chain with stationary distribution f , generating a chain from this kernel will
eventually produce simulations that are approximately from f.

In particular, a very important result can be derived. For integrable functions h,


the standard average

1 M (t)
∑ h(X ) Ð→ Ef [h(X)].
M i=1

This means that the LLN lies at the basis of Monte Carlo methods which can
be applied in MCMC settings. The result shown above is called the Ergodic
Theorem.

Of course, even in applied settings, it should always be confirmed that the


Markov chain in question behaves as desired before blindly using MCMC to
perform Bayesian calculations. Again, such theoretical verifications are beyond
the scope of this course. Practically speaking, however, the MCMC methods we
will discuss do indeed behave nicely in an extremely wide variety on problems.

Now we turn to Gibbs. The name Gibbs sampling comes from a paper by Geman
and Geman (1984), which first applied a Gibbs sampler on a Gibbs random
field. The name stuck from there. It’s actually a special case of something from
5.2 Introduction to Gibbs and MCMC 96

Markov chain Monte Carlo (MCMC), and more specifically a method called
Metropolis-Hastings, which we will hopefully get to. We’ll start by studying the
simple case of the two-stage sampler and then look at the multi-stage sampler.

◯ The Two-Stage Gibbs Sampler

The two-stage Gibbs sampler creates a Markov chain from a joint distribution.
Suppose we have two random variables X and Y with joint density f (x, y).
They also have respective conditional densities fY ∣X and fX∣Y . The two-stage
sampler generates a Markov chain {(Xt , Yt )} according to the following steps:

Algorithm 5.1: Two-stage Gibbs Sampler


Take X0 = x0 . Then for t = 1, 2, . . . , generate

1. Xt ∼ fX∣Y (⋅∣yt−1 )

2. Yt ∼ fY ∣X (⋅∣xt ).

As long as we can write down both conditionals (and simulate from them), it is
easy to implement the algorithm above.

Example 5.4: Bivariate Normal


Consider the bivariate normal model

1 ρ
(X, Y ) ∼ N2 (0, ( )) .
ρ 1

Recall the following fact from Casella and Berger (2009): If

µX σ2 ρσX σY
(X, Y ) ∼ N2 (( ),( X )) ,
µY ρσX σY σY2

then
σY
Y ∣X = x ∼ N (µY + ρ (x − µX ), σY2 (1 − ρ2 )) .
σX
Suppose we calculate the Gibbs sampler just given the starting point (x0 , y0 ).
Since this is a toy example, let’s suppose we only care about X. Note that we
don’t really need both components of the starting point, since if we pick x0 , we
can generate Y0 from fY ∣X (⋅∣x0 ).

We know that Y0 ∼ N (ρx0 , 1 − ρ2 ) and X1 ∣Y0 = y0 ∼ N (ρy0 , 1 − ρ2 ). Then

E[X1 ] = E[E[X1 ∣Y0 ]] = ρx0

and
Var[X1 ] = EVar[X1 ∣Y0 ]] + VarE[X1 ∣Y0 ]] = 1 − ρ4 .
5.2 Introduction to Gibbs and MCMC 97

Then
X1 ∼ N (ρ2 x0 , 1 − ρ4 ).
We want the unconditional distribution of X2 eventually. So, we need to update
(X2 , Y2 ). So we need Y1 so we can generate Y1 ∣X1 = x1 . Since we only care about
X, we can use the conditional distribution formula to find that Y1 ∣X1 = x1 ∼
N (ρx1 , 1 − ρ). Then using iterated expectation and iterated variance, we can
show that
X2 ∼ N (ρ4 xo , 1 − ρ8 ).

If we keep iterating, we find that


Xn ∼ N (ρ2n xo , 1 − ρ4n ).
(To see this, iterate a few times and find the pattern.) What happens as n → ∞?
approx.
Xn ∼ N (0, 1).
Example 5.5: Binomial-Beta
Suppose X∣θ ∼ Bin(n, θ) and θ ∼ Beta(a, b). Then the joint distribution is
n Γ(a + b) x+a−1
f (x, θ) = ( ) θ (1 − θ)n−x+b−1 .
x Γ(a)Γ(b)
The distribution of X∣θ is given above, and θ∣X ∼ Beta(x + a, n − x + b).

We can implement the Gibbs sampler in R as

gibbs_beta_bin <- function(nsim, nn, aa, bb)


{
xx <- rep(NA,nsim)
tt <- rep(NA,nsim)

tt[1] <- rbeta(1,aa,bb)


xx[1] <- rbinom(1,nn,tt[1])

for(ii in 2:nsim)
{
tt[ii] <- rbeta(1,aa+xx[ii-1],bb+nn-xx[ii-1])
xx[ii] <- rbinom(1,nn,tt[ii])
}

return(list(beta_bin=xx,beta=tt))
}

Since X has a discrete distribution, we can use a rootogram to check if the


Gibbs sampler performed a good approximation. The rootogram plot is imple-
mented in the library vcd in R. The following are the commands to generate
this rootogram:
5.2 Introduction to Gibbs and MCMC 98

Rootogram for Beta Binomial sample

● ●



600



400
Frequency

200 ●



0 ● ● ●

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 5.5: Rootogram from a Beta-Binomial(15,3,7)

gibbs_sample <- gibbs_beta_bin(5000,15,3,7)

# Density of a beta-binomial distribution with parameters


# nn: sample size of the binomial
# aa: first parameter of the beta
# bb: second parameter of the beta
dbetabi <- function(xx, nn, aa, bb)
{
return(choose(nn,xx)*exp(lgamma(aa+xx)-lgamma(aa)+lgamma(nn-xx+bb)-
lgamma(bb)-lgamma(nn+aa+bb)+lgamma(aa+bb)))
}

#Rootogram for the marginal distribution of X.


library(vcd)
beta_bin_sample <- gibbs_sample$beta_bin
max_observed <- max(beta_bin_sample)
rootogram(table(beta_bin_sample),5000*dbetabi(0:max_observed,15,3,7),
scale="raw",xlab="X",main="Rootogram for Beta Binomial sample")

Figure 5.5 presents the rootogram for the Gibbs sample for the Beta-Binomial
distribution. Similarly, Figure 5.6 shows the same for the marginal distribution
of θ obtained through the following commands:
5.2 Introduction to Gibbs and MCMC 99

Histogram for Beta sample

3.0
2.5
2.0
Marginal Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8

Figure 5.6: Histogram for a Beta(3,7)

#Histogram for the marginal distribution of Theta.


beta_sample <- gibbs_sample$beta
hist(beta_sample,probability=TRUE,xlab=expression(theta),
ylab="Marginal Density", main="Histogram for Beta sample")
curve(dbeta(x,3,7),from=0,to=1,add=TRUE)

Example 5.6: Consider the posterior on (θ, σ 2 ) associated with the following
model:

Xi ∣θ ∼ N (θ, σ 2 ), i = 1, . . . , n,
θ ∼ N (θo , τ 2 )
σ 2 ∼ InverseGamma(a, b),

ba e−b/x
where θo , τ 2 , a, b known. Recall that p(σ 2 ) = .
Γ(a) xa+1
The Gibbs sampler for these conditional distributions can be coded in R as
follows:

# gibbs_gaussian: Gibbs sampler for marginal of theta|X=xx and sigma2|X=xx


# when Theta ~ Normal(theta0,tau2) and Sigma2 ~ Inv-Gamma(aa,bb) and
5.2 Introduction to Gibbs and MCMC 100

# X|Theta=tt,Sigma2=ss ~ Normal(tt,ss)
#
# returns a list gibbs_sample
# gibbs_sample$theta : sample from the marginal distribution of Theta|X=xx
# gibbs_sample$sigma2: sample from the marginal distribution of Sigma2|X=xx

gibbs_gaussian <- function(nsim,xx,theta0,tau2,aa,bb)


{
nn <- length(xx)
xbar <- mean(xx)
RSS <- sum((xx-xbar)^2)
post_sigma_shape <- aa + nn/2

theta <- rep(NA,nsim)


sigma2 <- rep(NA,nsim)

sigma2[1] <- 1/rgamma(1,shape=aa,rate=bb)


ww <- sigma2[1]/(sigma2[1]+nn*tau2)
theta[1] <- rnorm(1,mean=ww*theta0+(1-ww)*xbar, sd=sqrt(tau2*ww))

for(ii in 2:nsim)
{
new_post_sigma_rate <- (1/2)*(RSS+ nn*(xbar-theta[ii-1])^2) + bb
sigma2[ii] <- 1/rgamma(1,shape=post_sigma_shape,
rate=new_post_sigma_rate)

new_ww <- sigma2[ii]/(sigma2[ii]+nn*tau2)


theta[ii] <- rnorm(1,mean=new_ww*theta0+(1-new_ww)*xbar,
sd=sqrt(tau2*new_ww))
}

return(list(theta=theta,sigma2=sigma2))
}

The histograms in Figure 5.7 for the posterior for θ and σ 2 are obtained as
follows:

library(mcsm)
data(Energy)
gibbs_sample <- gibbs_gaussian(5000,log(Energy[,1]),5,10,3,3)

par(mfrow=c(1,2))
hist(gibbs_sample$theta,xlab=expression(theta~"|X=x"),main="")
hist(sqrt(gibbs_sample$sigma2),xlab=expression(sigma~"|X=x"),main="")
5.2 Introduction to Gibbs and MCMC 101

1000

1500
800

1000
600
Frequency

Frequency
400

500
200
0

6.0 6.5 7.0 7.5 0.4 0.6 0.8 1.0 1.2 1.4

θ |X=x σ |X=x

Figure 5.7: Histograms for posterior mean and standard deviation.

◯ The Multistage Gibbs Sampler

There is a natural extension from the two-stage Gibbs sampler to the general
multistage Gibbs sampler. Suppose that for p > 1, we can write the random
variable X = (X1 , . . . , Xp ), where the Xi ’s are either unidimensional or mul-
tidimensional components. Suppose that we can simulate from corresponding
conditional densities f1 , . . . , fp . That is, we can simulate

Xi ∣x1 , . . . , xi−1 , xi+1 , . . . , xp ∼ f (xi ∣x1 , . . . , xi−1 , xi+1 , . . . , xp )

for i = 1, . . . , p. The associated Gibbs sampling algorithm is given by the follow-


ing transition from X (t) to X (t+1) ∶
Algorithm 5.2: The Multistage Gibbs sampler
(t−1) (t−1)
At iteration t = 1, 2, . . . given x(t−1) = (x1 , . . . , xp ), generate

(t) (t−1) (t−1)


1. X1 ∼ f (x1 ∣x2 , . . . , xp ),
(t) (t) (t−1) (t−1)
2. X2 ∼ f (x2 ∣x1 , x3 . . . , xp ),

(t) (t) (t) (t−1)
p−1. Xp−1 ∼ f (xp−1 ∣x1 , . . . , xp−2 , xp ),
5.2 Introduction to Gibbs and MCMC 102

(t) (t) (t)


p. Xp ∼ f (xp ∣x1 , . . . , xp−1 ).

The densities f1 , . . . , fp are called the full conditionals, and a particular feature
of the Gibbs sampler is that these are the only densities used for simulation.
Hence, even for high-dimensional problems, all of the simulations may be uni-
variate, which is a major advantage.
Example 5.7: (Casella and Robert, p. 207) Consider the following model:
ind
Xij ∣θi , σ 2 ∼ N (θi , σ 2 ) 1 ≤ i ≤ k, 1 ≤ j ≤ ni
iid
θi ∣µ, τ 2 ∼ N (µ, τ 2 )
µ∣σµ2 ∼ N (µ0 , σµ2 )
σ 2 ∼ IG(a1 , b1 )
τ 2 ∼ IG(a2 , b2 )
σµ2 ∼ IG(a3 , b3 )
The conditional independencies in this example can be visualized by the Bayesian
Network in Figure 5.8. Using these conditional independencies, we can compute
the complete conditional distributions for each of the variables as
σ2 ni τ 2 σ2 τ 2
θi ∼ N ( µ + X̄i , ),
σ 2 + ni τ 2 σ 2 + ni τ 2 σ 2 + ni τ 2
τ2 kσµ2 σµ2 τ 2
µ∼N( µ0 + θ̄, ),
τ 2 + kσµ2 τ 2 + kσµ2 τ 2 + kσµ2
⎛ ⎞
σ 2 ∼ IG ∑ ni /2 + a1 , (1/2) ∑ (Xi,j − θi )2 + b1 ,
⎝i i,j ⎠

τ 2 ∼ IG (k/2 + a2 , (1/2) ∑ (θi − µ)2 + b2 ) ,


i
σµ2 ∼ IG (1/2 + a3 , 1/2(µ − µ0 )2 + b3 ) ,

where θ̄ = ∑i ni θi / ∑i ni .

Running the chain with µ0 = 5 and a1 = a2 = a3 = b1 = b2 = b3 = 3 and chain size


5000, we get the histograms in Figure 5.9.

◯ Application of the GS to latent variable models

We give an example of Gibbs sampling to an data augmentation example. We


look at the example from a genetic linkage analysis. This example is given in
Rao (1973, pp. 3689) where it is analyzed in a frequentist setting; it was re-
analyzed in Dempster, Laird and Rubin (1977), and re-analyzed in a Bayesian
framework in Tanner and Wong (1987).
5.2 Introduction to Gibbs and MCMC 103

τ2 µ σµ2

σ2
θi

Xij
Figure 5.8: Bayesian Network for Example 5.7.
1500

1000
1000

800
800
1000
Frequency

Frequency

Frequency

600
600

400
400
500

200
200
0

3 4 5 6 7 8 9 6.4 6.6 6.8 7.0 6.8 7.0 7.2 7.4

µ θ1 θ2
1500

1000
2000

800
1500
1000

600
Frequency

Frequency

Frequency
1000

400
500

500

200
0

0 2 4 6 8 10 0 2 4 6 8 0.2 0.4 0.6 0.8 1.0

σ2µ τ2 σ2

Figure 5.9: Histograms for posterior quantities.


5.2 Introduction to Gibbs and MCMC 104

Example 5.8: A genetic model specifies that 197 animals are distributed multi-
nomially into four categories, with cell probabilities given by

π = (1/2 + θ/4, (1 − θ)/4, (1 − θ)/4, θ/4).(∗)

The actual observations are y = (125, 18, 20, 34). We want to estimate θ.

Biological basis for model:

Suppose we have two factors, call them α and β (say eye color and leg length).

• Each comes at two levels: α comes in levels A and a, and β comes in levels
B and b.
• Suppose A is dominant, a is recessive; also B dominant, b recessive.
• Suppose further that P (A) = 1/2 = P (a) [and similarly for the other
factor].
• Now suppose that the two factors are related: P (B∣A) = 1−η and P (b∣A) =
η.
• Similarly, P (B∣a) = η and P (b∣a) = 1 − η.

To calculate probability of the phenotypes AB, Ab, aB and ab in an offspring


(phenotype is what we actually see in the offspring), we suppose that mother
and father are chosen independently from the population, and make following
table, involving the genotypes (genotype is what is actually in the genes, and
this is not seen).

Then
1
P (Father is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (Mother is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (O.S. is AB) = P (B∣A)P (B) = (1 − η)2 .
4

Note: η = 1/2 means no linkage and people like to estimate η.

APPLE

What now?
Suppose we put the prior Beta(a, b) on θ. How do we get the posterior?

Here is one method, using the Gibbs sampler.

Split first cell into two cells, one with probability 1/2, the other with probability
θ/4.
5.2 Introduction to Gibbs and MCMC 105

Augment the data into a 5-category multinomial, call it X, where X1 is Bernoulli


with parameter 1/2. Now consider pdata (X1 ∣θ). Will run a Gibbs sampler of
length 2:

1/2
• The conditional distribution of X1 ∣ θ (and given the data) is Bin(125, 1/2+θ/4 ).

• Given the data, conditional on X1 the model is simply a binomial with


n = 197 − X1 , and probability of success θ, and data consisting of (125 −
X1 + X5 )successes, (X3 + X4 ) failures
• Thus, conditional distribution of θ ∣ X1 and the data is

Beta(a + 125 − X1 − X5 , b + X3 + X4 ).

R Code to implement G.S.:

set.seed(1)
a <- 1; b <- 1
z <- c(125,18,20,34)
x <- c(z[1]/2, z[1]/2, z[2:4])
nsim <- 50000 # runs in about 2 seconds on 3.8GHz P4
theta <- rep(a/(a+b), nsim)
for (j in 1:nsim)
{
theta[j] <- rbeta(n=1, shape1=a+125-x[1]+x[5],
shape2=b+x[3]+x[4])
x[1] <- rbinom(n=1, z[1], (2/(2+theta[j])))
}
mean(theta) # gives 0.623
pdf(file="post-dist-theta.pdf",
horiz=F, height=5.0, width=5.0)
plot(density(theta), xlab=expression(theta), ylab="",
main=expression(paste("Post Dist of ", theta)))
dev.off()
eta <- 1 - sqrt(theta) # Variable of actual interest
plot(density(eta))
sum(eta > .4)/nsim # gives 0
5.2 Introduction to Gibbs and MCMC 106

Post Dist of θ
8
6
4
2
0

0.4 0.5 0.6 0.7 0.8

Figure 5.10: Posterior Distribution of θ for Genetic Linkage


5.3 MCMC Diagnostics 107

5.3 MCMC Diagnostics

We will want to check any chain that we run to assess any lack of convergence.

The adequate length of a run will depend on

• a burn-in period (debatable topic).


• mixing rate.

• variance of quantity we are monitoring.

Quick checks:

• trace plots: a times series plot of the parameters of interest; indicates how
quickly the chain is mixing of failure to mix.

• Autocorrelations plots.
• Plots of log posterior densities – used mostly in high dimensional problems.
• Multiple starting points – diagnostic to attempt to handle problems when
we obtain different estimates when we start with multiple (different) start-
ing values.

Definition: An autocorrelation plot graphically measures the correlation be-


tween Xi and each Xk+i variable in the chain.

• The Lag-k correlation is the Corr(Xi , Xk+i ).


• By looking at autocorrelation plots of parameters that we are interested
in, we can decide how much to thin or subsample our chain by.

• Then rerun Gibbs sampler using new thin value.

For a real data example that I’m working on:


5.3 MCMC Diagnostics 108

λ2.9111 λ2.9145
17800

20800
17400

20400
17000

20000
400 600 800 1000 1200 400 600 800 1000 1200

λ2.9378 λ2.9248
20800

20800
20400

20400
20000

20000

400 600 800 1000 1200 400 600 800 1000 1200

Figure 5.11: Trace Plot for RL Example


5.3 MCMC Diagnostics 109

1.0
0.8
Correlation

0.6
0.4
0.2
0.0

0 20 40 60 80 100

Lag

Figure 5.12: Max Autocorrelation Plot for RL Example

Multiple Starting Points: Can help determine if burn-in is long enough.

• Basic idea: want to estimate the mean of a parameter θ.


• Run chain 1 starting at xo . Estimate the mean to be 10 ± 0.1.
• Run chain 2 starting at x1 . Estimate the mean to be 11 ± 0.1.
• Then we know that the effort of the starting point hasn’t been forgotten.
• Maybe the chain hasn’t reached the area of high probability yet and need
to be run for longer?
• Try running multiple chains.

Gelman-Rubin

• Idea is that if we run several chains, the behavior of the chains should be
basically the same.
• Check informally using trace plots.
• Check using the Gelman-Rubin diagnostic – but can fail like any test.
• Suggestions – Geweke – more robust when normality fails.
5.4 Theory and Application Based Example 110

5.4 Theory and Application Based Example

◯ PlA2 Example

Twelve studies run to investigate potential link between presence of a certain


genetic trait and risk of heart attack. Each was case-control, and considered
a group of individuals with coronary heart disease and another group with no
history of heart disease. For each study i (i = 1, . . . , 12) the proportion
having the genetic trait in each group was noted and a log odds ratio ψ̂i was
calculated, together with a standard error σi . Results are summarized in table
below (data from Burr et al. 2003).

i 1 2 3 4 5 6
ψ̂i 1.06 -0.10 0.62 0.02 1.07 -0.02
σi 0.37 0.11 0.22 0.11 0.12 0.12

i 7 8 9 10 11 12
ψ̂i -0.12 -0.38 0.51 0.00 0.38 0.40
σi 0.22 0.23 0.18 0.32 0.20 0.25

Setup:

• Twelve studies were run to investigate the potential link between presence
of a certain genetic trait and risk of heart attack.

• Each study was case-control and considered a group of individuals with


coronary heart disease and another group with no history of coronary
heart disease.

• For each study i (i = 1, ⋯, 12) the proportion having the genetic trait in
each group was recorded.

• For each study, a log odds ratio, ψ̂i , and standard error, σi , were calcu-
lated.

Let ψi represent the true log odds ratio for study i. Then a typical hierarchical
model would look like:

ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, . . . , 12
iid
ψi ∣ µ, τ ∼ N (µ, τ 2 ) i = 1, . . . , 12
(µ, τ ) ∼ ν.
5.4 Theory and Application Based Example 111

From this, the likelihood is

12 12
L(µ, τ ) = ∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 .
i=1 i=1

The posterior can be written as (as long as µ and τ have densities), as


12 12
π(µ, τ ∣ ψ̂i ) = c−1 [∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 ] p(µ, τ ).
i=1 i=1

Supppose we take ν = “Normal/Inverse Gamme prior.” Then conditional on


τ, µ ∼ N (c, dτ 2 ), γ = 1/τ 2 ∼ Gamma(a, b).

Remark: The reason for taking this prior is that it is conjugate for the normal
distribution with both mean and variance unknown (that is, it is conjugate for
the model in which the ψi ’s are observed).

We will use the notation NIG(a, b, c, d) to denote this prior. Taking a = .1, b
= .1, c = 0, and d = 1000 gives a flat prior.

• If we are frequentists, then we need to calculated the likelihood


12 12
L(µ, τ ) = ∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 .
i=1 i=1

• If we are Bayesians, we need to calculate the likelihood and in addition we


need to calculate the normalizing constant in order to find the posterior

L(µ, τ )p(µ, τ )
pi(µ, τ ∣ ψ̂i ) =
∫ L(µ, τ )p(µ, τ )dµdτ

• Neither above is easy to do.

We have a choice:

• Select a model that doesn’t fit the data well but gives answers that are
easy to obtain, i.e. in closed form.

• Select a model that is appropriate for the data but i computationally


difficult to deal with.

MCMC methods often allow us (in many cases) to make the second choice.

Going back to the example and fitting a model:


5.4 Theory and Application Based Example 112

Recall the general model:


ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, . . . , 12
iid
ψi ∣ µ, τ ∼ N (µ, τ 2 ) i = 1, . . . , 12
(µ, τ ) ∼ N IG(a, b, c, d).

Then the posterior of (µ, τ ) is NIG(a’,b’,c’,d’), with

1 n(X̄ − c)2
a′ = a + n/2 b′ = b + ∑(Xi − X̄) +
2
2 i 2(1 + nd)

and
c + ndX̄ 1
c′ = d′ = .
nd + 1 n + d−1

This means that


µ ∣ τ 2 , y ∼ N (c′ , d′ )
and
τ 2 ∣ y ∼ InverseGamma(a′ , b′ )

Implementing the Gibbs sampler:

Want the posterior distribution of (µ, τ ).

• In order to clarify what we are doing, we use the notation that subscripting
a distribution by a random variable denotes conditioning.

• Thus, if U and V are two random variables, L(U ∣V ) and LV (U ) will both
denote the conditional distribution of U given V.

We want to find Lψ̂ (µ, τ, ψ) . We’ll run a Gibbs sampler of length 2:

• Given (µ, τ ), the ψ’s are independent. The conditional distribution of ψ


given ψ̂ is the conditional distribution of ψ given only ψ̂. This conditional
distribution is given by a standard result for the conjugate normal/normal
′ ′
situation: it is N (µ , τ 2 ), where

′ σi2 µ + τ 2 ψ̂i ′ σi2 τ 2


µ = τ 2
=
σi2 + τ 2 σi2 + τ 2
5.4 Theory and Application Based Example 113

• Given the (ψ’s, the data) are superfluous, i.e. Lψ̂ (µ, τ ∣ ψ)= L (µ, τ ∣ ψ) .
This conditional distribution is given by the conjugacy of the Normal /
Inverse gamma prior: L (µ, τ ∣ ψ) = NIG(a′ , b′ , c′ , d′ ), where

1 n(ψ̄ − c)2
a′ = a + n/2 b′ = b + ∑(ψi ψ̄) +
2
2 i 2(1 + nd)
and
c + ndψ̄ 1
c′ = d′ = .
nd + 1 n + d−1

This gives us a sequence µ, τ, ψ1 , . . . , ψn ; g = 1, ..., G, from Lψ̂ (µ, τ, ψ) . If were


interested in, e.g., the posterior distribution of µ, we just retain the first coor-
dinate in the sequence.

Specific Example for PlA2 data


Our proposed hierarchical model is
ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, ⋯, 12

iid
ψi ∣ µ, τ 2 ∼ N (µ, τ 2 ) i = 1, ⋯, 12

µ ∣ τ 2 ∼ N (0, 1000τ 2 )

γ = 1/τ 2 ∼ Gamma(0.1, 0.1)


Why is a normal prior taken? It’s conjugate for the normal distribution with
the mean and variance known. The two priors above with the chosen hyperpa-
rameters result in noninformative hyperpriors.

%\frame[containsverbatim]{
%\frametitle{PlA2 Example}

The file model.txt contains

\begin{verbatim}
model {
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],1/(sigma[i])^2)
psi[i] ~ dnorm(mu,1/tau^2)
}
mu ~ dnorm(0,1/(1000*tau^2))
tau <- 1/sqrt(gam)
gam ~ dgamma(0.1,0.1)
}
5.4 Theory and Application Based Example 114

Note: In BUGS, use dnorm(mean,precision), where precision = 1/variance.

%\frame[containsverbatim]{
%\frametitle{PlA2 Example}

The file data.txt contains

\begin{verbatim}
"N" <- 12
"psihat" <- c(1.055, -0.097, 0.626, 0.017, 1.068,
-0.025, -0.117, -0.381, 0.507, 0, 0.385, 0.405)
"sigma" <- c(0.373, 0.116, 0.229, 0.117, 0.471,
0.120, 0.220, 0.239, 0.186, 0.328, 0.206, 0.254)

The file inits 1.txt contains

".RNG.name" <- "base::Super-Duper"


".RNG.seed" <- 12
"psi" <- c(0,0,0,0,0,0,0,0,0,0,0,0)
"mu" <- 0
"gam" <- 1

%\frame[containsverbatim]{
%\frametitle{PlA2 Example}

The file script.txt contains


\small
\begin{verbatim}
model clear
data clear
model in "model"
data in "data"
compile, nchains(2)
inits in "inits1", chain(1)
inits in "inits2", chain(2)
initialize
update 10000
monitor mu
monitor psi
monitor gam
update 100000
coda *, stem(CODA1)
coda *, stem(CODA2)
5.4 Theory and Application Based Example 115

Now, we read in the coda files into R from the current directory and continue our
analysis. The first part of our analysis will consist of some diagnostic procedures.

We will consider

• Autocorrelation Plots
• Trace Plots
• Gelman-Rubin Diagnostic
• Geweke Diagnostic

Definition: An autocorrelation plot graphically measures the correlation be-


tween Xi and each Xk+i variable in the chain.

• The Lag-k correlation is the Corr(Xi , Xk+i ).


• By looking at autocorrelation plots of parameters that we are interested
in, we can decide how much to thin or subsample our chain by.
• We can rerun our JAGS script using our thin value.

We take the thin value to be the first lag whose correlation ≤ 0.2. For this plot,
we take a thin of 2. We will go back and rerun our JAGS script and skip every
other value in each chain. After thinning, we will proceed with other diagnostic
procedures of interest.
1.0
0.8
Correlation

0.6
0.4
0.2
0.0

0 10 20 30 40 50

Lag

%\frame[containsverbatim]{
5.4 Theory and Application Based Example 116

%\frametitle{PlA2 Example}

The file script\_thin.txt contains


\small
\begin{verbatim}
model clear
data clear
model in "model"
data in "data"
compile, nchains(2)
inits in "inits1", chain(1)
inits in "inits2", chain(2)
initialize
update 10000
monitor mu, thin(6)
monitor psi, thin(6)
monitor gam, thin(6)
update 100000
coda *, stem(CODA1_thin)
coda *, stem(CODA2_thin)

Definition: A trace plot is a time series plot of the parameter, say µ, that we
monitor as the Markov chain(s) proceed(s).
1.0
0.5
µ

0.0
−0.5

0 500 1000 1500 2000

Iteration

Definition: The Gelman-Rubin diagnostic tests that burn-in is adequate and


requires that multiple starting points be used.

To compute the G-R statistic, we must


5.4 Theory and Application Based Example 117

• Run two chains in JAGS using two different sets of initial values (and two
different seeds).

• Load coda package in R and run gelman.diag(mcmc.list(chain1,chain2)).

How do we interpret the Gelman-Rubin Diagnostic?

• If the chain has reached convergence, the G-R test statistic R ≈ 1. We


conclude that burn-in is adequate.

• Values above 1.05 indicate lack of convergence.

Warning: The distribution of R under the null hypothesis is essentially an F


distribution. Recall that the F-test for comparing two variances is not robust
to violations of normality. Thus, we want to be cautious in using the G-R
diagnostic.

%\frame[containsverbatim]{
%\frametitle{Gelman-Rubin Diagnostic}

Doing this in R, for the PlA-2 example, we find

\begin{verbatim}
Point est. 97.5% quantile
mu 1 1
psi[1] 1 1
psi[2] 1 1
...
psi[11] 1 1
psi[12] 1 1
gam 1 1

Since 1 is in all the 95% CI, we can conclude that we have not failed to converge.

Suppose µ is the parameter of interest.

Main Idea: If burn-in is adequate, the mean of the posterior distribution of µ


from the first half of the chain should equal the mean from the second half of
the chain.

To compute the Geweke statistic, we must


5.4 Theory and Application Based Example 118

• Run a chain in JAGS along with a set of initial values.


• Load the coda package in R and run geweke.diag(mcmc.list(chain)).

• The Geweke statistic asymptotically has a standard normal distribution,


so if the values from R are outside -2.5 or 2.5, this indicates nonstationarity
of chain and that burn-in is not sufficient.
• Using the Geweke diagnostic on the PlA2 data indicates that burn-in of
10,000 is sufficient (the largest absolute Z-score is 1.75).
• Observe that the Geweke diagnostic does not require multiple starting
points as Gelman-Rubin does.
• The Geweke statistic (based on a T-test) is robust against violations of
normality so the Geweke test is preferred to Gelman-Rubin.

Using Gelman-Rubin and Geweke we have shown that burn-in is “sufficient.”

• We can look at summary statistics such as means, standard errors, and


credible intervals using the summary function.
• We can use kernel density functions in R to estimate posterior distributions
that we are interested in using the density function.

Post Mean Post SD Post Naive SE


µ 0.217272 0.127 0.0009834
ψ1 0.594141 0.2883 0.0022334
ψ2 -0.062498 0.1108 0.0008583
ψ3 0.490872 0.2012 0.0015588
ψ4 0.040284 0.1118 0.0008658
ψ5 0.51521 0.3157 0.0024453
ψ6 0.003678 0.114 0.0008831
ψ7 -0.015558 0.1883 0.0014586
ψ8 -0.175852 0.2064 0.0015988
ψ9 0.433525 0.1689 0.0013084
ψ10 0.101912 0.2423 0.0018769
ψ11 0.332775 0.1803 0.0013965
ψ12 0.331466 0.2107 0.0016318
γ 10.465411 6.6611 0.051596

The posterior of µ ∣ data is

Alternatively, we can estimate the conditional distributions of exp(ψi )’s given


the data. A few are shown below.
5.4 Theory and Application Based Example 119

3.0
2.0
Density

1.0
0.0

−0.5 0.0 0.5 1.0

Figure 5.13: Posterior of µ ∣ data

• So, here we’re looking at the odds ratio’s of the prob of getting heart
disease given you have the genetic trait over the prob of not getting heart
disease given you have the trait. Note that all estimates are pulled toward
the mean showing a Bayesian Stein effect.

• This is the odds ratio of having a heart attack for those who have the
genetic trait versus those who don’t (looking at study i).
5.4 Theory and Application Based Example 120

exp(ψ1) exp(ψ2)

4
3

3
2

2
1

1
0

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

exp(ψ3) exp(ψ4)
4

4
3

3
2

2
1

1
0

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

%\frame[containsverbatim]{
Moreover, we could have just have easily done this analysis in WinBUGS. Below is the co
\begin{verbatim}
model{
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],rho[i])
psi[i] ~ dnorm(mu,gam)
rho[i] <- 1/pow(sigma[i],2)
}

mu ~ dnorm(0,gamt)
gam ~ dgamma(0.1,0.1)
gamt <- gam/1000
}

Finally, we can either run the analysis using WinBUGS or JAGS and R. I will
demonstrate how to do this using JAGS for this example. I have included the
basic code to run this on a Windows machine via WinBUGS. Both methods
yield essentially the same results.

To run WinBUGS within R, you need the following:

• Load the R2WinBUGS library.

• Read in data and format it as a list().

• Format intital values as a list().


5.5 Metropolis and Metropolis-Hastings 121

• Format the unknown parameters using c().

• Run the bugs() command to open/run WinBUGS.

• Read in the G.S. values using read.coda().

\scriptsize
\begin{verbatim}
setwd("C:/Documents and Settings/Tina Greenly
/Desktop/beka_winbugs/novartis/pla2")
library(R2WinBUGS)
pla2 <- read.table("pla2_data.txt",header=T)
attach(pla2)
names(pla2)
N<-length(psihat)
data <- list("psihat", "sigma", "N")

inits1 <- list(psi = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), mu = 0, gam = 1)


inits2 <- list(psi = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), mu = 1, gam = 2)
inits = list(inits1, inits2)
parameters <- c("mu", "psi", "gam")
pla2.sim <- bugs(data, inits, parameters,
"pla2.bug", n.chains=2, n.iter = 110000,
codaPkg=T,debug=T,n.burnin = 10000,n.thin=1,bugs.seed=c(12,13),
working.directory="C:/Documents and Settings/Tina Greenly/
Desktop/beka_winbugs/novartis/pla2")
detach(pla2)
coda1 = read.coda("coda1.txt","codaIndex.txt")
coda2 = read.coda("coda2.txt","codaIndex.txt")

5.5 Metropolis and Metropolis-Hastings

The Metropolis-Hastings algorithm is a general term for a family of Markov


chain simulation methods that are useful for drawing samples from Bayesian
posterior distributions. The Gibbs sampler can be viewed as a special case of
Metropolis-Hastings (as well will soon see). Here, we present the basic Metropo-
lis algorithm and its generalization to the Metropolis-Hastings algorithm, which
is often useful in applications (and has many extensions).

Suppose we can sample from p(θ∣y). Then we could generate

θ(1) , . . . , θ(S) ∼ p(θ∣y)


iid
5.5 Metropolis and Metropolis-Hastings 122

and obtain Monte Carlo approximations of posterior quantities


S
E[g(θ)∣y] → 1/S ∑ g(θ(i) ).
i=1

But what if we cannot sample directly from p(θ∣y)? The important concept here
is that we are able to construct a large collection of θ values (rather than them
being iid, since this most certain for most realistic situations will not hold).
Thus, for any two different θ values θa and θb , we need

#θ′ s in the collection = θa p(θa ∣y)


≈ .
#θ′ s in the collection = θb p(θb ∣y)

How might we intuitively construct such a collection?

• Suppose we have a working collection {θ(1) , . . . , θ(s) } and we want to add


a new value θ(s+1) .

• Consider adding a value θ∗ which is nearby θ(s) .

• Should we include θ∗ or not?

• If p(θ∗ ∣y) > p(θ(s) ∣y), then we want more θ∗ ’s in the set than θ(s) ’s.

• But if p(θ∗ ∣y) < p(θ(s) ∣y), we shouldn’t necessarily include θ∗ .

Based on the above, perhaps our decision to include θ∗ or not should be based
upon a comparison of p(θ∗ ∣y) and p(θ(s) ∣y). We can do this by computing r:

p(θ∗ ∣y) p(y ∣ θ∗ )p(θ∗ )


r= = .
p(θ ∣y) p(y ∣ θ(s) )p(θ(s) )
(s)

Having computed r, what should we do next?

• If r > 1 (intuition): Since θ(s) is already in our set, we should include θ∗


as it has a higher probability than θ(s) .

(procedure): Accept θ∗ into our set and let θ(s+1) = θ∗ .

• If r < 1 (intuition): The relative frequency of θ-values in our set equal to


θ∗ compared to those equal to θ(s) should be

p(θ∗ ∣y)
= r.
p(θ(s) ∣y)
5.5 Metropolis and Metropolis-Hastings 123

This means that for every instance of θ(s) , we should only have a fraction
of an instance of a θ∗ value.

(procedure): Set θ(s+1) equal to either θ∗ or θ(s) with probability r and


1 − r respectively.

This is basic intuition behind the Metropolis (1953) algorithm. More formally,
it

• It proceeds by sampling a proposal value θ∗ nearby the current value θ(s)


using a symmetric proposal distribution J(θ∗ ∣ θ(s) ).
• What does symmetry mean here? It means that J(θa ∣ θb ) = J(θb ∣ θa ).
That is, the probability of proposing θ∗ = θa given that θ(s) = θb is equal
to the probability of proposing θ∗ = θb given that θ(s) = θa .

• Symmetric proposals include:

J(θ∗ ∣ θ(s) ) = Uniform(θ(s) − δ, θ(s) + δ)


and
J(θ∗ ∣ θ(s) ) = Normal(θ(s) , δ 2 ).

The Metropolis algorithm proceeds as follows:

1. Sample θ∗ ∼ J(θ ∣ θ(s) ).


2. Compute the acceptance ratio (r):

p(θ∗ ∣y) p(y ∣ θ∗ )p(θ∗ )


r= = .
p(θ(s) ∣y) p(y ∣ θ(s) )p(θ(s) )

3. Let


⎪θ∗ with prob min(r,1)
θ(s+1) = ⎨ (s)

⎪θ otherwise.

Remark: Step 3 can be accomplished by sampling u ∼ Uniform(0, 1) and setting


θ(s+1) = θ∗ if u < r and setting θ(s+1) = θ(s) otherwise.
5.5 Metropolis and Metropolis-Hastings 124

Example 5.9: Metropolis for Normal-Normal


Let’s test out the Metropolis algorithm for the conjugate Normal-Normal model
with a known variance situation.

That is let
iid
X1 , . . . , Xn ∣ θ ∼ Normal(θ, σ 2 )
θ ∼ Normal(µ, τ 2 ).

Recall that the posterior of θ is Normal(µn , τn2 ), where

n/σ 2 1/τ 2
µn = x̄ + µ
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2
and
1
τn2 = .
n/σ 2 + 1/τ 2

Suppose (taken from Hoff, 2009), σ 2 = 1, τ 2 = 10, µ = 5, n = 5, and y =


(9.37, 10.18, 9.16, 11.60, 10.33). For these data, µn = 10.03 and τn2 = 0.20.

Suppose that for some ridiculous reason we cannot come up with the posterior
distribution and instead we need the Metropolis algorithm to approximate it
(please note how incredible silly this example is and it’s just to illustrate the
method).

Based on this model and prior, we need to compute the acceptance ratio r

p(θ∗ ∣x) p(x∣θ∗ )p(θ∗ ) ∏ dnorm(xi , θ∗ , σ) ∏ dnorm(θ∗ , µ, σ)


r= (s)
= (s) (s)
=( i (s)
)( i )
p(θ ∣x) p(x∣θ )p(θ ) ∏i dnorm(xi , θ , σ) ∏i dnorm(θ(s) , µ, σ)

In many cases, computing the ratio r directly can be numerically unstable,


however, this can be modified by taking log r.

This results in

log r = ∑ [log dnorm(xi , θ∗ , σ) − log dnorm(xi , θ(s) , σ)]


i

+ ∑ [log dnorm(θ∗ , µ, σ) − log dnorm(θ(s) , µ, σ)] .


i

Then a proposal is accepted if log u < log r, where u is sample from the Uni-
form(0,1).

The R-code below generates 10,000 iterations of the Metropolis algorithm stat-
ing at θ(0) = 0. and using a normal proposal distribution, where

θ(s+1) ∼ Normal(θ(s) , 2).


5.5 Metropolis and Metropolis-Hastings 125

Below is R-code for running the above model. Figure 5.14 shows a trace plot for
this run as well as a histogram for the Metropolis algorithm compared with a
draw from the true normal density. From the trace plot, although the value of
θ does not start near the posterior mean of 10.03, it quickly arrives there after
just a few iterations. The second plot shows that the empirical distribution of
the simulated values is very close to the true posterior distribution.
11

0.8
10

0.6
9

density
θ
8

0.4
7

0.2
6
5

0.0

0 2000
4000 6000 8000 8.5 9.0 9.5 10.5 11.5
iteration θ
Figure 5.14: Results from the Metropolis sampler for the normal model.
5.5 Metropolis and Metropolis-Hastings 126

## initialing values for normal-normal example and setting seed


# MH algorithm for one-sample normal problem with known variance

s2<-1
t2<-10 ; mu<-5; set.seed(1); n<-5; y<-round(rnorm(n,10,1),2)
mu.n<-( mean(y)*n/s2 + mu/t2 )/( n/s2+1/t2)
t2.n<-1/(n/s2+1/t2)

####metropolis part####
y<-c(9.37, 10.18, 9.16, 11.60, 10.33)
##S = total num of simulations
theta<-0 ; delta<-2 ; S<-10000 ; THETA<-NULL ; set.seed(1)

for(s in 1:S)
{

## simulating our proposal


theta.star<-rnorm(1,theta,sqrt(delta))

##taking the log of the ratio r


log.r<-( sum(dnorm(y,theta.star,sqrt(s2),log=TRUE)) +
dnorm(theta.star,mu,sqrt(t2),log=TRUE) ) -
( sum(dnorm(y,theta,sqrt(s2),log=TRUE)) +
dnorm(theta,mu,sqrt(t2),log=TRUE) )

if(log(runif(1))<log.r) { theta<-theta.star }

##updating THETA

THETA<-c(THETA,theta)

##two plots: trace of theta and comparing the empirical distribution


##of simulated values to the true posterior

pdf("metropolis_normal.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))

skeep<-seq(10,S,by=10)
plot(skeep,THETA[skeep],type="l",xlab="iteration",ylab=expression(theta))

hist(THETA[-(1:50)],prob=TRUE,main="",xlab=expression(theta),ylab="density")
th<-seq(min(THETA),max(THETA),length=100)
lines(th,dnorm(th,mu.n,sqrt(t2.n)) )
5.5 Metropolis and Metropolis-Hastings 127

dev.off()

◯ Metropolis-Hastings Algorithm

Recall that a Markov chain is a sequentially generated sequence {x(1) , , x(2) , . . .}


such that the mechanism that generates x(s+1) can depend on the value of x(s)
but not on anything that was in the sequence before it. A better way of putting
this: for a Markov chain, the future depends on the present and not on the past.

The Gibbs sampler and the Metropolis algorithm are both ways of generating
Markov chains that approximate a target probability distribution.

We first consider a simple example where our target probability distribution is


po (u, v), a bivariate distribution for two random variables U and V. In the one-
sample normal problem, we would have U = θ, V = σ 2 and po (u, v) = p(θ, σ 2 ∣y).

What does the Gibbs sampler have us do? It has us iteratively sample values
of U and V from their conditional distributions. That is,

1. update U ∶ sample u(s+1) ∼ po (u ∣ v (s) )

2. update V ∶ sample v (s+1) ∼ po (v ∣ u(s+1) ).

In contrast, Metropolis proposes changes to X = (U, V ) and then accepts or re-


jects those changes based on po . An alternative way to implement the Metropolis
algorithm if to propose and then accept or reject change to one element at a
time:

1. update U ∶

(a) sample u∗ ∼ Ju (u ∣ u(s) )


po (u∗ , v (s) )
(b) compute r =
po (u(s) , v (s) )
(c) set u(s+1) equal to u∗ or u(s+1) with prob min(1,r) and max(0,1-r).

2. update V ∶ sample v (s+1) ∼ po (v ∣ u(s+1) ).

(a) sample v ∗ ∼ Ju (v ∣ v (s) )


po (u(s+1) , v ∗ )
(b) compute r =
po (u(s+1) , v (s) )
(c) set v (s+1) equal to v ∗ or v (s) with prob min(1,r) and max(0,1-r).

Here, Ju and Jv are separate symmetric proposal distributions for U and V.


5.5 Metropolis and Metropolis-Hastings 128

• The Metropolis algorithm generates proposals from Ju and Jv


• It accepts them with some probability min(1,r).
• Similarly, each step of Gibbs can be seen as generating a proposal from a
full conditional and then accepting it with probability 1.
• The Metropolis-Hastings (MH) algorithm generalizes both of these ap-
proaches by allowing arbitrary proposal distributions.
• The proposal distributions can be symmetric around the current values,
full conditionals, or something else entirely. A MH algorithm for approx-
imating po (u, v) runs as follows:

1. update U ∶
(a) sample u∗ ∼ Ju (u ∣ u(s) , v (s) )
(b) compute
po (u∗ , v (s) ) Ju (u(s) ∣ u∗ , v (s) )
r= ×
po (u(s) , v (s) ) Ju (u∗ ∣ u(s) , v (s) )
(c) set u(s+1) equal to u∗ or u(s+1) with prob min(1,r) and max(0,1-r).
2. update V ∶
(a) sample v ∗ ∼ Jv (u ∣ u(s+1) , v (s) )
(b) compute

po (u(s+1) , v ∗ ) Ju (v (s+1) ∣ u(s+1) , v ∗) )


r= (s+1) (s)
×
po (u ,v ) Ju (v ∗ ∣ u(s+1) , v (s) )

(c) set v (s+1) equal to v ∗ or v (s+1) with prob min(1,r) and max(0,1-r).

In the above algorithm, the proposal distributions Ju and Jv are not required to
be symmetric. The only requirement is that they not depend on U or V values
in our sequence previous to the most current values. This requirement ensures
that the sequence is a Markov chain.

Doesn’t the algorithm above look familiar? Yes, it looks a lot like Metropolis,
except the acceptance ratio r contains an extra factor:

• It contains the ratio of the prob of generating the current value from the
proposed to the prob of generating the proposed from the current.
• This can be viewed as a correction factor.
• If a value u∗ is much more likely to be proposed than the current value
u(s) then we must down-weight the probability of accepting u.
5.5 Metropolis and Metropolis-Hastings 129

• Otherwise, such a value u∗ will be overrepresented in the chain.

Exercise 1: Show that Metropolis is a special case of MH. Hint: Think about
the jumps J.

Exercise 2: Show that Gibbs is a special case of MH. Hint: Show that r = 1.

Note: The MH algorithm can easily be generalized.


Example 5.10: Poisson Regression We implement the Metropolis algorithm
for a Poisson regression model.

• We have a sample from a population of 52 song sparrows that was stud-


ied over the course of a summer and their reproductive activities were
recorded.
• In particular, their age and number of new offspring were recorded for
each sparrow (Arcese et al., 1992).
• A simple probability model to fit the data would be a Poisson regression
where, Y = number of offspring conditional on x = age.

Thus, we assume that


Y ∣θx ∼ Poisson(θx ).
For stability of the model, we assume that the mean number of offspring θx is
a smoother function of age. Thus, we express θx = β1 + β2 x+ β3 x2 .

Remark: This parameterization allows some values of θx to be negative, so as


an alternative we reparameterize and model the log-mean of Y, so that

log E(Y ∣x) = log θx = log(β1 + β2 x+ β3 x2 )

which implies that

θx = exp(β1 + β2 x+ β3 x2 ) = exp(β T x).

Now back to the problem of implementing Metropolis. For this problem, we will
write
log E(Yi ∣xi ) = log(β1 + β2 xi + β3 x2i ) = β T xi ,
where xi is the age of sparrow i. We will abuse notation slightly and write
xi = (1, xi , x2i ).

• We will assume the prior on the regression coefficients is iid Normal(0,100).


• Given a current value β (s) and a value β ∗ generated from J(β ∗ , β (s) ) the
acceptance ration for the Metropolis algorithm is:
5.5 Metropolis and Metropolis-Hastings 130

∏i=1 dpois(yi , xTi β ∗ ) ∏j=1 dnorm(βj∗ , 0, 10)


3
p(β ∗ ∣X, y) n
r= = × .
p(β (s) ∣X, y) ∏ni=1 dpois(yi , xTi β (s) ) ∏3j=1 dnorm(β (s) , 0, 10)
j

• We just need to specify the proposal distribution for θ∗


• A convenient choice is a multivariate normal distribution with mean β (s) .
• In many problems, the posterior variance can be an efficient choice of a
proposal variance. But we don’t know it here.
• However, it’s often sufficient to use a rough approximation. In a normal
regression problem, the posterior variance will be close to σ 2 (X T X)−1
where σ 2 is the variance of Y.

In our problem: E log Y = β T x so we can try a proposal variance of σ̂ 2 (X T X)−1


where σ̂ 2 is the sample variance of log(y + 1/2).

Remark: Note we add 1/2 because otherwise log 0 is undefined. The code of
implementing the algorithm is included below.
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


0.0
−0.1

ACF

ACF
β3
−0.2
−0.3

0 2000 6000 10000 0 10 20 30 40 0 5 10 20 30


iteration lag lag/10
Figure 5.15: Plot of the Markov chain in β3 along with autocorrelations func-
tions

###example 5.10 -- sparrow Poisson regression


yX.sparrow<-dget("http://www.stat.washington.edu/~hoff/Book/Data/data/yX.sparrow")

### sample from the multivariate normal distribution


rmvnorm<-function(n,mu,Sigma)
{
p<-length(mu)
res<-matrix(0,nrow=n,ncol=p)
if( n>0 & p>0 )
{
5.5 Metropolis and Metropolis-Hastings 131

E<-matrix(rnorm(n*p),n,p)
res<-t( t(E%*%chol(Sigma)) +c(mu))
}
res
}

y<- yX.sparrow[,1]; X<- yX.sparrow[,-1]


n<-length(y) ; p<-dim(X)[2]

pmn.beta<-rep(0,p)
psd.beta<-rep(10,p)

var.prop<- var(log(y+1/2))*solve( t(X)%*%X )


beta<-rep(0,p)
S<-10000
BETA<-matrix(0,nrow=S,ncol=p)
ac<-0
set.seed(1)

for(s in 1:S) {

#propose a new beta

beta.p<- t(rmvnorm(1, beta, var.prop ))

lhr<- sum(dpois(y,exp(X%*%beta.p),log=T)) -
sum(dpois(y,exp(X%*%beta),log=T)) +
sum(dnorm(beta.p,pmn.beta,psd.beta,log=T)) -
sum(dnorm(beta,pmn.beta,psd.beta,log=T))

if( log(runif(1))< lhr ) { beta<-beta.p ; ac<-ac+1 }

BETA[s,]<-beta
}
cat(ac/S,"\n")

#######

library(coda)
apply(BETA,2,effectiveSize)

####
pdf("sparrow_plot1.pdf",family="Times",height=1.75,width=5)
5.5 Metropolis and Metropolis-Hastings 132

par(mar=c(2.75,2.75,.5,.5),mgp=c(1.7,.7,0))
par(mfrow=c(1,3))
blabs<-c(expression(beta[1]),expression(beta[2]),expression(beta[3]))
thin<-c(1,(1:1000)*(S/1000))
j<-3
plot(thin,BETA[thin,j],type="l",xlab="iteration",ylab=blabs[j])
abline(h=mean(BETA[,j]) )

acf(BETA[,j],ci.col="gray",xlab="lag")
acf(BETA[thin,j],xlab="lag/10",ci.col="gray")
dev.off()
####

◯ Metropolis and Gibbs Combined

In complex models, it is often the case that the conditional distributions are
available for some parameters but not for others. What can we do then? In these
situations we can combine Gibbs and Metropolis-type proposal distributions to
generate a Markov chain to approximate the joint posterior distribution of all
the parameters.

• Here, we look at an example of estimating the parameters in a regression


model for time-series data, where the errors are temporally correlated.
• The full conditionals are available for the regression parameters here, but
not the parameter describing the dependence among the observations.
Example 5.11: Historical CO2 and temperature data

Analyses of ice cores from East Antarctica have allowed scientists to deduce
historical atmospheric conditions of law few hundred years (Petit et al, 1999).
Figure 5.18 plots time-series of temperature and carbon dioxide concentration
on a standardized scale (centered and called to have mean of zero and variance
of 1).

• The data include 200 values of temperature measured at roughly equal


time intervals, with time between consecutive measurements being around
2,000 years.
• For each value of temperature there is a CO2 concentration value that
corresponds to data that is about 1,000 years previous to the temperature
value (on average).
• Temperature is recorded in terms of its difference from current present
temperature in degrees Celsius and CO2 concentration is recorded in parts
per million by volume.
5.5 Metropolis and Metropolis-Hastings 133

temperature difference (deg C)



3


standardized measurement

temp

0 2
● ● ●
CO2 ●●
●●
●●● ●
2

●● ●
● ●
●● ●●

● ● ●
1

● ● ●●
●● ●● ●● ●
●●

● ●● ● ●● ●●
● ● ●●
●● ● ●

−4
−2 −1 0

● ●● ●
● ● ● ●
●●● ● ●● ● ●
● ●● ●
●● ● ● ● ● ● ●● ●


● ●●●●● ●● ●●
● ●●● ●● ●●●
●● ●
● ● ●
●●● ● ●

●●●●●● ● ●●

● ●●
● ● ● ●●
●●● ●●●●

● ●●●
●●● ●



●●● ●●●

−8

●●●●
● ●
●●●


● ● ● ●●

● ●● ●

−4e+05 −3e+05 −2e+05 −1e+05 0e+00 180 220 260


year CO2(ppmv)
Figure 5.16: Temperature and carbon dioxide data.

• The plot indicates the temporal history of temperature and CO2 follow
very similar patterns.
• The second plot in Figure 5.18 indicates that CO2 concentration at a given
time is predictive of temperature following that time point.
• We can quantify this using a linear regression model for temperature (Y )
as a function of (CO2 )(x).
• The validity of the standard error relies on the error terms in the regression
model being iid and standard confidence intervals further rely on the errors
being normally distributed.

• These two assumptions are examined in the two residual diagnostic plots
in Figure 5.19.

• The first plot shows a histogram of the residuals and indicates no serious
deviation from non-normality.

• The second plot gives the autocorrelation function of the residuals, in-
dicating a nontrivial correlation of 0.52 between residuals at consecutive
time points.
• Such a positive correlation generally implies there is less information in
the data and less evidence for a relations between the two variables than
is assumed by the OLS regression analysis.
5.5 Metropolis and Metropolis-Hastings 134

1.0
50

0.8
40

0.4 0.6
30
frequency

ACF
20

0.2
10

0.0
−0.2
0

−4 −2 0 2 4 0 5 10 15 20
residual lag
Figure 5.17: Temperature and carbon dioxide data.

The ordinary regression model is

Y ∼ N (Xβ, σ 2 I).

The diagnostic plots suggest that a more appropriate model for the ice core data
is one in which the error terms are not independent, but temporally correlated.

We will replace σ 2 I with a covariance matrix Σ that can represent the posi-
tive correlation between sequential observations. One simple, popular class of
covariance matrices for temporally correlated data are those having first order
autoregressive structure:

⎛ 1 ρ ρ2 . . . ρn−1 ⎞
⎜ ρ 1 ρ . . . ρn−2 ⎟
⎜ 2 ⎟
⎜ ρ ρ 1 ... ⎟

2⎜

Σ = σ Cp = σ ⎜
2 ⎟

⎜ ⋮ ⋮ ⋱ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ρn−1 ρn−2 1 ⎠

Under this covariance matrix the variance of Yi ∣β, xi is σ 2 but the correlation
between Yi and Yi+t is ρt . Using the multivariate normal and inverse gamma
prior (it is left as an exercise to show that)
5.5 Metropolis and Metropolis-Hastings 135

β ∣ X, y, σ 2 , ρ ∼ N (βn , Σn ),
σ 2 ∣ X, y, β ∼ IG((νo + n)/2, [νo σo2 + SSRρ ]/2)
−1 −1
where βn = Σn (X T Cp−1 X/σ 2 + Σ−1
o βo ) and Σn = (X T Cp−1 X/σ 2 + Σ−1
o ) and
SSRρ = (y − Xβ)T Cp−1 (y − Xβ)

• If βo and Σo has large diagonal entries, then βn is very close to

(X T Cp−1 X)−1 X T Cp−1 y

• If ρ were known this would be the generalized least squares (GLS) estimate
of β.

• This is a type of weighted LS estimate that is used when the error terms are
not iid. In such situations, both OLS and GLS provide unbiased estimates
of β but the GLS has lower variance.
• Bayesian analysis using a model that accounts for correlation errors pro-
vides parameter estimates that are similar to those of GLS, so for conve-
nience we will refer to our analysis as “Bayesian GLS.”

If we knew the value of ρ we could just implement Gibbs to approximate


p(β, σ 2 ∣X, y, ρ). However, ρ is unknown and typically the distribution of ρ is
nonstandard for most prior distributions, suggesting that the Gibbs sampler
isn’t applicable. What can we do instead?

We can use the generality of the MH algorithm. Recall we are allowed to use
different proposals at each step. We can iteratively update β, σ 2 , and ρ at
different steps (using Gibbs proposals). That is:

• We will make proposals for β and σ 2 using the full conditionals and
• make a symmetric proposal for ρ.
• Following the rules of MH, we accept with prob 1 any proposal coming
from a full conditional distribution, whereas we have to calcite an accep-
tance probability for proposals of ρ.
5.5 Metropolis and Metropolis-Hastings 136

We run the following algorithm:

1. Update β: Sample β (s+1) ∼ N (βn , Σn ), where βn and Σn depend on σ 2(s)


and ρ(s) .

2. Update σ 2 : Sample σ 2(s+1) ∼IG( (νo +n)/2, [νo σo2 +SSRρ ]/2) where SSRρ
depends on β (s+1) and ρ(s) .

3. Update ρ ∶ (a): Propose ρ∗ ∼ Uniform(ρ(s ) − δ, ρ(s ) + δ). If ρ∗ < 0 then


reassign it to be ∣p∗ ∣. If ρ∗ > 1 then reassign it to be 2 − ρ∗ .
(b) Compute the acceptance ratio

p(y ∣ X, β (s+1) , σ 2(s+1) , ρ∗ ) p(ρ∗ )


r= andsample
p(y ∣ X, β (s+1) , σ 2(s+1) , ρ(s) ) p(ρ(s) )

u ∼ Uniform(0, 1). If u < r, set ρ(s+1) = ρ∗ , otherwise ρ(s+1) = ρ(s) .

The proposal used in Step 3(a) is called reflecting random walk, which insures
that 0 < ρ < 1. Note that a sequence of MH steps in which each parameter is
updated is often referred to as a scan of the algorithm.

Exercise: Show that the proposal is symmetric.

For convenience and ease, we’re going to use diffuse priors for the parameters
with βo = 0, Σo = diag(1000), νo = 1, and σ 2 = 1. Our prior on ρ will be Uni-
form(0,1). We first run 1000 iterations of the MH algorithm and show a trace
plot of ρ as well as an autocorrelation plot (Figure 5.20).

Suppose now we want to generate 25,000 scans for a total of 100,000 parameter
values. The MC is highly correlated, so we will thin every 25th value in the
chain. This reduces the autocorrelation.

The Monte Carlo approximation of the posterior density of β2 (the slope) ap-
pears in the Figure 5.20. The posterior mean is 0.028 with 95 percent posterior
credible interval of (0.01,0.05), indicating that the relationship between tem-
perature and CO2 is positive. As indicated in the second plot this relationship
seems much weaker than suggested by the OLS estimate of 0.08. For the OLS
estimation, the small number of data points with high y-values have a large
influence on the estimate of β. On the other hand, the GLS model recognizes
many of these extreme points are highly correlated with one another and down
weights their influence.

Remark: this weaker regression coefficient is a result of the temporally corre-


lated data and not of the particular prior distribution we used or the Bayesian
approach in general.
5.5 Metropolis and Metropolis-Hastings 137

1.0
0.9

0.8
0.8

0.6
ACF
ρ
0.7

0.4 0.2
0.6

0.0
0.5

0 200 400 600 800 1000 0 5 10 15 20 25 30


scan lag
Figure 5.18: The first 1,000 values of ρ generated from the Markov chain.
1.0
0.9

0.8
0.6
0.8

ACF
ρ

0.4
0.7

0.2
0.6

0.0

0 200 400 600 800 1000 0 5 10 15 20 25 30


scan/25 lag/25
Figure 5.19: Every 25th value of ρ generated from the Markov chain of length
25,000.
5.5 Metropolis and Metropolis-Hastings 138

40
● ●

2
GLS estimate ● ●
● ●
●● ● ●
posterior marginal density
30 OLS estimate ●● ●
●● ●

0
● ● ●
● ●●
●●
● ● ●

temperature
● ● ●

−4 −2

● ● ● ●
● ● ● ●● ● ●

20

● ●
● ● ● ●
●● ● ● ●
● ●●●●
● ●● ●
●●● ●●●● ● ●●
● ● ● ● ● ●
● ● ● ●
●● ●
●● ● ●
● ●● ●● ● ●● ● ●● ●● ●
●●

−6
● ●● ●
● ● ● ●

10

● ● ●
●●● ●●● ●● ● ●
●● ●●● ●●●

●● ●●

●● ● ● ●● ●●

● ●● ●● ● ● ● ●
● ●●
●●●●●●● ●● ●

−8
● ● ●●●● ● ● ●
●● ● ● ● ●


0

0.00 0.02 0.04 0.06 180 200 220 240 260 280
β2 CO2
Figure 5.20: Posterior distribution of the slope parameter β2 and posterior
mean regression line (after generating the Markov chain with
length 25,000 with thin 25).

Exercise: Repeat the analysis with different prior distributions and perform
non-Bayesian GLS for comparison.

#####
##example 5.10 in notes
# MH and Gibbs problem
##temperature and co2 problem

source("http://www.stat.washington.edu/~hoff/Book/Data/data/chapter10.r")

### sample from the multivariate normal distribution


rmvnorm<-function(n,mu,Sigma)
{
p<-length(mu)
res<-matrix(0,nrow=n,ncol=p)
if( n>0 & p>0 )
{
E<-matrix(rnorm(n*p),n,p)
res<-t( t(E%*%chol(Sigma)) +c(mu))
}
5.5 Metropolis and Metropolis-Hastings 139

res
}
###

##reading in the data and storing it


dtmp<-as.matrix(read.table("volstok.txt",header=F), sep = "-")
dco2<-as.matrix(read.table("co2.txt",header=F, sep = "\t"))
dtmp[,2]<- -dtmp[,2]
dco2[,2]<- -dco2[,2]
library(nlme)

#### get evenly spaced temperature points


ymin<-max( c(min(dtmp[,2]),min(dco2[,2])))
ymax<-min( c(max(dtmp[,2]),max(dco2[,2])))
n<-200
syear<-seq(ymin,ymax,length=n)
dat<-NULL
for(i in 1:n) {
tmp<-dtmp[ dtmp[,2]>=syear[i] ,]
dat<-rbind(dat, tmp[dim(tmp)[1],c(2,4)] )
}
dat<-as.matrix(dat)
####

####
dct<-NULL
for(i in 1:n) {
xc<-dco2[ dco2[,2] < dat[i,1] ,,drop=FALSE]
xc<-xc[ 1, ]
dct<-rbind(dct, c( xc[c(2,4)], dat[i,] ) )
}

mean( dct[,3]-dct[,1])

dct<-dct[,c(3,2,4)]
colnames(dct)<-c("year","co2","tmp")
rownames(dct)<-NULL
dct<-as.data.frame(dct)

##looking at temporal history of co2 and temperature


########
pdf("temp_co2.pdf",family="Times",height=1.75,width=5)
par(mar=c(2.75,2.75,.5,.5),mgp=c(1.7,.7,0))
layout(matrix( c(1,1,2),nrow=1,ncol=3) )
5.5 Metropolis and Metropolis-Hastings 140

#plot(dct[,1],qnorm( rank(dct[,3])/(length(dct[,3])+1 )) ,
plot(dct[,1], (dct[,3]-mean(dct[,3]))/sd(dct[,3]) ,
type="l",col="black",
xlab="year",ylab="standardized measurement",ylim=c(-2.5,3))
legend(-115000,3.2,legend=c("temp",expression(CO[2])),bty="n",
lwd=c(2,2),col=c("black","gray"))
lines(dct[,1], (dct[,2]-mean(dct[,2]))/sd(dct[,2]),
#lines(dct[,1],qnorm( rank(dct[,2])/(length(dct[,2])+1 )),
type="l",col="gray")

plot(dct[,2], dct[,3],xlab=expression(paste(CO[2],"(ppmv)")),
ylab="temperature difference (deg C)")
dev.off()
########

##residual analysis for the least squares estimation


########
pdf("residual_analysis.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))

lmfit<-lm(dct$tmp~dct$co2)
hist(lmfit$res,main="",xlab="residual",ylab="frequency")
#plot(dct$year, lmfit$res,xlab="year",ylab="residual",type="l" ); abline(h=0)
acf(lmfit$res,ci.col="gray",xlab="lag")
dev.off()
########

##BEGINNING THE GIBBS WITHIN METROPOLIS

######## starting values (DIFFUSE)


n<-dim(dct)[1]
y<-dct[,3]
X<-cbind(rep(1,n),dct[,2])
DY<-abs(outer( (1:n),(1:n) ,"-"))

lmfit<-lm(y~-1+X)
fit.gls <- gls(y~X[,2], correlation=corARMA(p=1), method="ML")
beta<-lmfit$coef
s2<-summary(lmfit)$sigma^2
phi<-acf(lmfit$res,plot=FALSE)$acf[2]
nu0<-1 ; s20<-1 ; T0<-diag(1/1000,nrow=2)
###
set.seed(1)
5.5 Metropolis and Metropolis-Hastings 141

###number of MH steps
S<-25000 ; odens<-S/1000
OUT<-NULL ; ac<-0 ; par(mfrow=c(1,2))
library(psych)
for(s in 1:S)
{

Cor<-phi^DY ; iCor<-solve(Cor)
V.beta<- solve( t(X)%*%iCor%*%X/s2 + T0)
E.beta<- V.beta%*%( t(X)%*%iCor%*%y/s2 )
beta<-t(rmvnorm(1,E.beta,V.beta) )

s2<-1/rgamma(1,(nu0+n)/2,(nu0*s20+t(y-X%*%beta)%*%iCor%*%(y-X%*%beta)) /2 )

phi.p<-abs(runif(1,phi-.1,phi+.1))
phi.p<- min( phi.p, 2-phi.p)
lr<- -.5*( determinant(phi.p^DY,log=TRUE)$mod -
determinant(phi^DY,log=TRUE)$mod +
tr( (y-X%*%beta)%*%t(y-X%*%beta)%*%(solve(phi.p^DY) -solve(phi^DY)) )/s2 )

if( log(runif(1)) < lr ) { phi<-phi.p ; ac<-ac+1 }

if(s%%odens==0)
{
cat(s,ac/s,beta,s2,phi,"\n") ; OUT<-rbind(OUT,c(beta,s2,phi))
# par(mfrow=c(2,2))
# plot(OUT[,1]) ; abline(h=fit.gls$coef[1])
# plot(OUT[,2]) ; abline(h=fit.gls$coef[2])
# plot(OUT[,3]) ; abline(h=fit.gls$sigma^2)
# plot(OUT[,4]) ; abline(h=.8284)

}
}
#####

OUT.25000<-OUT
library(coda)
apply(OUT,2,effectiveSize )

OUT.25000<-dget("data.f10_10.f10_11")
apply(OUT.25000,2,effectiveSize )

pdf("trace_auto_1000.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
5.5 Metropolis and Metropolis-Hastings 142

par(mfrow=c(1,2))
plot(OUT.1000[,4],xlab="scan",ylab=expression(rho),type="l")
acf(OUT.1000[,4],ci.col="gray",xlab="lag")
dev.off()

pdf("trace_thin_25.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
plot(OUT.25000[,4],xlab="scan/25",ylab=expression(rho),type="l")
acf(OUT.25000[,4],ci.col="gray",xlab="lag/25")
dev.off()

pdf("fig10_11.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))

plot(density(OUT.25000[,2],adj=2),xlab=expression(beta[2]),
ylab="posterior marginal density",main="")

plot(y~X[,2],xlab=expression(CO[2]),ylab="temperature")
abline(mean(OUT.25000[,1]),mean(OUT.25000[,2]),lwd=2)
abline(lmfit$coef,col="gray",lwd=2)
legend(180,2.5,legend=c("GLS estimate","OLS estimate"),bty="n",
lwd=c(2,2),col=c("black","gray"))
dev.off()

quantile(OUT.25000[,2],probs=c(.025,.975) )

plot(X[,2],y,type="l")
points(X[,2],y,cex=2,pch=19)
points(X[,2],y,cex=1.9,pch=19,col="white")
text(X[,2],y,1:n)

iC<-solve( mean(OUT[,4])^DY )
Lev.gls<-solve(t(X)%*%iC%*%X)%*%t(X)%*%iC
Lev.ols<-solve(t(X)%*%X)%*%t(X)

plot(y,Lev.ols[2,] )
plot(y,Lev.gls[2,] )
5.6 Introduction to Nonparametric Bayes 143

5.6 Introduction to Nonparametric Bayes

As we have seen, Bayesian parametric methods takes classical methodology for


prior and posterior distributions in models with a finite number of parameters.
It is often the case that the number of parameters taken in such model is low
for computational complexity, however, in current research problem we deal
with high dimensional data and high dimensional parameters. The origins of
Bayesian methods have been around since the mid-1700’s and are still thriving
today. The applicability of Bayesian parametric models still remains and has
widened with the increased advancements made in modern computing and the
growth of methods available in Markov chain Monte Carlo.

Frequentist nonparametrics covers a wide array of areas in statistics. The area


is well known for being associated with testing procedures that are or become
asymptotically distribution free, which lead to nonparametric confidence inter-
vals, bands, etc. (Hjors et al., 2010). Further information can be found on these
methods in Wasserman (2006).

Nonparametric Bayesian methods are models and methods characterized gener-


ally by large parameter spaces, such as unknown density and regression functions
and construction of probability measures over these spaces. Typical examples
seen in practice include density estimation, nonparametric regression with fixed
error distributions, hazard rate and survival function estimation. For a thorough
introduction into this subject see (Hjors et al., 2010).

◯ Motivations

The motivation is the following:

iid
• We have X1 . . . Xn ∼ F, F ∈ F. We usually assume that F is a parametric
family.

• Then, putting a prior on F amounts to putting a prior on Rd for some d.

• We would like to be able to put a prior on all the set of cdf’s. And we
would like the prior to have some basic features:

1. The prior should have large support.

2. The prior should give rise to priors which are analytically tractable or
computationally manageable.

3. We should be able to center the prior around a given parametric family.


5.6 Introduction to Nonparametric Bayes 144

◯ The Dirichlet Process

Review of Finite Dimensional Dirichlet Distribution

• This is a distribution on the k-dimensional simplex.

• Let (α1 , . . . , αk ) be such that αj > 0 for all j.


• The Dirichlet distribution with parameter vector (α1 , . . . , αk ) has density

Γ(α1 + ⋯ + αk )
p(θ) = θ1α1 −1 ⋯θkαk −1 .
∏j=1 Γ(αj )
k

• It is conjugate to the Multinomial distribution. That is if Y ∼ M ultinomial(n, θ)


and θ ∼ Dir(α1 , . . . , αk ), then it can be shown that

θ∣y ∼ Dir(α1 + N1 , . . . , αk + Nk ).

• It can be shown that


E(θj ) = αj /α
where α = ∑j αj . It can also be shown that

αj (α − αj )
V ar(θj ) = .
α2 (α + 1)
5.6 Introduction to Nonparametric Bayes 145

Infinite Dimensional Dirichlet Distribution


Let α be a finite (non-null) measure (or think probability distribution) on
R. Sometimes α will be called the concentration parameter (in scenarios
when we might hand wave the measure theory for example or it’s not
needed).
You should think about the Infinite Dimension Dirichlet Distribution as
a distribution of distributions as we will soon see.
Definition 5.1: F has the Dirichlet distribution with parameter (mea-
sure) α if for every finite measurable partition A1 , . . . , Ak of R the k-
dimensional random vector (F ({A1 }), . . . , F ({Ak })) has the finite k-dimensional
Dirichlet distribution

Dir(α(A1 ), . . . , α(Ak )).

For more on this see: Freedman (1963), Ferguson (1973, 1974).


Intuition: Each F ({Ak }) ∈ [0, 1] since F is some cumulative distribution
function. Also,
F ({A1 }) + ⋯ + F ({Ak }),
thus, (F ({A1 }), . . . , F ({Ak })) lives on the k-dimensional simplex.
Remark: For those with measure theory: You can’t have a measure that
is 0. Note that Lebesgue measure isn’t finite on the reals.

We will construct the Dirichlet process to intuitively understand it based


on the “Polya Urn Scheme” of Blackwell and MacQueen (1973). This is
one of the most intuitive approaches. Others in the literature include Fer-
guson (1973, 1974), which include two constructions. There is an incorrect
constructions involve the Kolmovgorov extension theorem (the problem is
that the sets aren’t measurable, so an important technical detail). The
other is a correct construction based on something called the gamma pro-
cess (this involves much overhead and existence of the gamma process).
5.6 Introduction to Nonparametric Bayes 146

◯ Polya Urn Scheme on Urn With Finitely Many Col-


ors
– Consider an urn containing a finite number of balls of k different
colors.
– There are α1 , . . . , αk balls of colors 1 . . . , k, respectively.
– We pick a ball at random, look at its color, return it to the urn
together with another ball of the same color.
– We repeat this indefinitely.
– Let p1 (n), . . . pk (n) be the proportions of balls of colors 1 . . . , k at
time n.
Example 5.12: Polya Urn for Three Balls
Suppose we have three balls in our urn. Let red correspond the ball 1. Let
blue correspond the ball 2. Let green correspond the ball 3. Furthermore,
suppose that P (red) = 2/9), P (blue) = 3/9 and P (green) = 4/9.
Let α be a the following probability measure (or rather discrete probability
distribution):
– αo (1) = 2/9.
– αo (2) = 3/9.
– αo (3) = 4/9.
Another way of writing this is define αo = 2/9 δ1 + 3/9 δ2 + 4/9 δ3 where


⎪1 if 1 ∈ A
δ1 (A) = ⎨

⎪0 otherwise.

5.6 Introduction to Nonparametric Bayes 147

◯ Polya Urn Scheme in General


Let α be a finite measure on a space X = R.
1. (Step 1) Suppose X1 ∼ αo .
2. (Step 2) Now create a new measure α + δX1 where


⎪1 if X1 ∈ A
δX1 (A) = ⎨

⎪0 otherwise.

Then
α + δX1 α + δX1
X2 ∼ = .
α() + δX1 () α() + 1
Fact: δX1 () = 1. Think about why this is intuitively true.
What does the above equation really mean?
– α represents the original distribution of balls.
– δX1 represents the ball we just added.
Deeper understanding
– Suppose the urn contained N total balls when we started.
– Then the probability that the second ball drawn X2 will be of the
original N balls is N /(N + 1).
– This represents the α part of the distribution of X2 .
– We want the probability of drawing a new ball to be 1/(N + 1). This
goes with δX1 .
α + δX1
– When we write X2 ∼ we want N /(N +1) of the proba-
norm. constant
α δX1
bility to go to and 1/(N +1) to go to .
norm. constant norm. constant
How does this continue? Since we want
δX1 ()
= = 1/(N + 1)
α() + 1
this implies that
1
= 1/(N + 1) Ô⇒ α() = N.
α() + 1
Hence, we take α() = N and then we plug back in and find that αo =
α/N Ô⇒ α = αo N.
This implies that
αo N + δX1
X2 ∼ ,
N +1
which is now in terms of αo and N (which we know).
5.6 Introduction to Nonparametric Bayes 148

(Step 3) Continue forming new measures: α + δX1 + δX2 . Then


α + δX1 + δX2 α + δX1 + δX2
X3 ∼ = .
α() + δX1 () + δX2 () α() + 2

In general, it can be shown that

α(A) + ∑ni=1 δXi (A) αo N + ∑ni=1 δXi (A)


P (Xn+1 ∣ X1 . . . Xn ) = = .
α() + n N +n

Polya Urn Scheme in General Case: Theorem

– Let α be a finite measure on a space X (this space can be very general,


but we will assume it’s the reals).
– Define a sequence {X1 , X2 , . . .} of random variables to be a Polya
urn sequence with parameter measure α if
∗ P (X1 ∈ B) = α(B)/α().
∗ For every n,
α(B) + ∑i δXi (B)
P (Xn+1 ∈ B ∣ X1 . . . , Xn ) = .
α() + n
Specifically, X1 , X2 , . . . , is PUS(α) if
α(A)
P (X1 ∈ A) = = αo
α()
for every A ∈ and for every n
α(B) + ∑i δXi (B)
P (Xn+1 ∈ B ∣ X1 . . . , Xn ) =
α() + n
for every A ∈ .

◯ De Finetti and Exchaneability


Recall what exchangeability means. Suppose that Y1 , Y2 , . . . , Yn is a se-
quence of random variables. This sequence is said to be exchangeable if
the distribution of
d
(Y1 , Y2 , . . . , Yn ) = (Yπ(1) , Yπ(2) , . . . , Yπ(n) )

for every permutation π of 1, . . . , n.


Note: This means that we can permute the random variables and the
distribution doesn’t change.
5.6 Introduction to Nonparametric Bayes 149

An infinite sequence is said to be exchangeable if for every n, Y1 , Y2 , . . . , Yn


is exchangeable. That is, we don’t require exchangeability for infinite
permutations, but it must be true for every “chunk” that we take that is
of length or size n.

Definition 5.2: De Finetti’s General Theorem


Let X1 , X2 . . . be an infinite exchangeable sequence of random variables.
Then there exists a probability measure π such that
iid
X1 , X2 . . . , ∣ F ∼ F

F ∼π
for any x1 , . . . xn ∈ {0, 1}.
Remark: Suppose that X1 , X2 . . . is an infinite exchangeable sequence of
binary random variables. Then there exists a probability measure (distri-
bution) on [0, 1] such that for every n
1
P (X1 = x1 , . . . , Xn = xn ) = ∫ p∑i xi (1 − p)n−∑i xi µ(p)dp
0

where µ(p) is the measure or probability distribution or prior that we take


on p.
5.6 Introduction to Nonparametric Bayes 150

Theorem 5.1. A General Result (without proof )


Let X1 , X2 . . . be PUS(α). Then this can be thought of as a two-stage
process where
– F ∼ Dir(α)
iid
– X1 , X2 . . . , ∣ F ∼ F
If we consider the process of the PUS consisting of X2 , X3 . . . , then it’s a
PUS(α + δX 1 ). That is, F ∣ X1 ∼ Dir(α + δX 1 ).
More generally, it can be shown that
F ∣ X1 . . . Xn ∼ Dir(α + ∑ni=1 δX i ).
5.6 Introduction to Nonparametric Bayes 151

◯ Chinese Restaurant Process


– There are Bayesian NP approaches to many of the main issues in
statistics including:
∗ regression.
∗ classification.
∗ clustering.
∗ survival analysis.
∗ time series analysis.
∗ spatial data analysis.

– These generally involve assumptions of exchangeability or partial ex-


changeability.
∗ and corresponding distributions on random objects of various
kinds (functions, partitions, measures, etc.)
– We look at the problem of clustering for concreteness.

◯ Clustering: How to choose K?


– Adhoc approaches (hierarchical clustering)
∗ these methods do yield a data-drive choice of K
∗ there is little understanding how good these choices are (meaning
the checks are adhoc based on some criterion)
– Methods based on objective functions (M -estimators)
∗ K-means, spectral clustering
∗ they come with some frequentist guarantees
∗ it’s often hard to turn these into data-driven choices of K
– Parametric likelihood-based methods
∗ finite mixture models, Bayesian variants
∗ various model choice methods: hypothesis testing, cross-validation,
bootstrap, AIC, BIC, Laplace, reversible jump MCMC
∗ do the assumptions underlying the method apply to the setting
(not very often)
– Something different: The Chinese restaurant process.

Basic idea: In many data analysis settings, we don’t know the number of
latent clusters and would like to learn it from the data. BNP clustering
addresses this by assuming there is an infinite number of latent clusters,
but that only a finite number of them is used to generate the observed
data. Under these assumptions, the posterior yields a distribution over
the number of clusters, the assign of data to clusters, and the parameters
5.6 Introduction to Nonparametric Bayes 152

associated with each cluster. In addition, the predictive distribution, the


assignment of the next data point, allows for new data to be assign to a
previously unseen cluster.
How does it work: The BNP problem addresses and finesses the clustering
problem by choosing the number of clusters by assuming it is infinite,
however it specifies a prior over the infinite groupings P (c) in such a way
that favors assigning data to a small number of groups, where c refers to
the cluster assignments. The prior over groupings is a well known problem
called the Chinese restaurant process (CRP), which is a distribution over
infinite partition of the integers (Aldous, 1985; Pitman, 2002).
Where does the name come from?

– Imagine that Sam and Mike own a restaurant with an infinite number
of tables.
– Imagine a sequence of customers entering their restaurant and sitting
down.
– The first customer (Liz) enters and sits at the first table.
– The second customer enters and sits at the first table with probability
1 α
1+α
and a new table with probability 1+α , where α is positive and
real.
– Liz is friendly and people would want to sit and talk with her. So,
1
we would assume that 1+α is a high probability, meaning that α is a
small number.
– What happens with the nth customer?
∗ He sits at each of the previously occupied tables with probability
proportional to the number previous customers sitting there.
∗ He sits at the next unoccupied table with probability propor-
tional to α.

More formally, let cn be the table assigned me of customer n. A draw from


this distribution can be generated by sequentially assigning observations
with probability



⎪ mk if k ≤ K+ (i.e. k is a previously occupied table),
P (cn = k ∣ c) = ⎨ n−1+α


α
otherwise (i.e. k is the next unoccupied table),
⎩ n−1+α
where mk is the number of customers sitting at table k and K+ is the num-
ber of table for which mk > 0. The parameter α is called the concentration
parameter.
5.6 Introduction to Nonparametric Bayes 153

The rich just get richer


– CRP rule: next customer sits at a table with prob. proportional to
number of customers already sitting at it (and sits at new table with
prob. proportional to α).
– Customers tend to sit at most popular tables.
– Most popular tables attract the most new customers, and become
even more popular.
– CRPs exhibit power law behavior, where a few tables attract the bulk
of the customers.
– The concentration parameter α determines how likely a customer is
to sit at a fresh table.
More formally stated:
– A larger value of α will produce more occupied tables (and fewer
customers per table).
– Thus, a small value of α produces more customers at each table.
– The CRP exhibits an important invariance property: the cluster as-
signments under this distribution are exchangeable.
– This means that p(c) is unchanged if the order of customers is shuffled
(up to label changes). This may be counter-intuitive since the process
was just described sequentially.
5.6 Introduction to Nonparametric Bayes 154

The CRP and Clustering


– The data points refer to the customers and the tables are the clusters.
∗ Then the CRP defines a prior distribution on the partition of the
data and on the number of tables.
– The prior can be completed with:
∗ A likelihood, meaning there needs to be an parameterized prob-
ability distribution that corresponds to each table
∗ A prior for the parameters –the first customer to sit at table k
chooses the parameter vector for that table (φk ) from the prior
– Now that we have a distribution for any quantity we care about in
some clustering setting.
Now, let’s think about how we would write down this process out formally.
We’re writing out a mixture model with a component that’s nonparamet-
ric.
Let’s define the following:
– yn are the observations at time n.
– cn are the latent clusters that generate cn .
– F is a parametric family of distributions for yn .
– θk represent the clustering parameters.
– Go represents a general prior for the clustering parameters (this is
the nonparametric part).
We also assume that each observation is conditionally independent given
its latent cluster assignment and its cluster parameters.
Using the CRP, we can view the model as

yn ∣ cn , θ ∼ F (θcn )
cn ∝ p(cn )
θk ∝ Go .

We want to know p(y ∣ c).


p(y∣c)p(c)
Then by Bayes’ rule, p(c∣y) = , where
∑c p(y∣c)p(c)
N K
p(y ∣ c) = ∫ [ ∏ F (y∣θcn ) ∏ Go (θk )] dθ.
θ n=1 k=1

A Go that is conjugate allow this integral to be calculated analytically.


For example, the Gaussian is the conjugate prior to a Gaussian with fixed
variance (and thus a mixture of Gaussians model is computationally con-
venient). We illustrate this specific example below.
5.6 Introduction to Nonparametric Bayes 155

Example 5.13: Suppose

yn ∣ cn , θ ∼ N (θcn , 1)
cn ∼ Multinomial(1, p)
θk ∼ N (µ, τ 2 ),

where p, µ, and τ 2 are known.

Then
N K
p(y∣c) = ∫ [ ∏ Normal(θcn , 1)(yn ) × ∏ Normal(µ, τ 2 )(θk )] dθ.
θ n=1 k=1

The term above (inside the integral) is just another normal as a function
of θ. Then we can integrate θ out as we have in problems before.
Once we calculate p(y∣c), we can simply plug this and p(c) into

p(y∣c)p(c)
p(c∣y) = .
∑c p(y∣c)p(c)
Example 5.14: Gaussian Mixture using R
Information on the R package profdpm:
This package facilitates inference at the posterior mode in a class of con-
jugate product partition models (PPM) by approximating the maximum
a posteriori data (MAP) partition. The class of PPMs is motivated by an
augmented formulation of the Dirichlet process mixture, which is currently
the ONLY available member of this class. The profdpm package consists
of two model fittting functions, profBinary and profLinear, their asso-
ciated summary methods summary.profBinary and summary.profLinear,
and a function (pci) that computes several metrics of agreement between
two data partitions. However, the profdpm package was designed to be
extensible to other types of product partition models. For more on this
package, see help(profdpm) after installation.

– The following example simulates a dataset consisting of 99 longitu-


dinal measurements on 33 units of observation, or subjects.
– Each subject is measured at three times, drawn uniformly and inde-
pendently from the unit interval.
– Each of the three measurements per subject are drawn independently
from the normal distribution with one of three linear mean functions
of time, and with unit variance.
– The linear mean functions vary by intercept and slope. The longitu-
dinal structure imposes a grouping among measurements on a single
subject.
5.6 Introduction to Nonparametric Bayes 156

– Observations grouped in this way should always cluster together. A


grouping structure is specified using the group parameter; a factor
that behaves similarly to the groups parameter of lattice graphics
functions.
– For the PPM of conjugate binary models, the grouping structure is
imposed by the model formula.
– Grouped observations correspond to rows of the model matrix, result-
ing from a call to model.matrix on the formula passed to profBinary.
Hence, the profBinary function does not have a group parameter in
its prototype.
– The goal of the following example is to recover the simulated partition
and to create simultaneous 95% credible bands for the mean within
each cluster. The following R code block creates and the simulated
dataset.

set.seed(42)
sim <- function(multiplier = 1) {
x <- as.matrix(runif(99))
a <- multiplier * c(5,0,-5)
s <- multiplier * c(-10,0,10)
y <- c(a[1]+s[1]*x[1:33],
a[2]+s[2]*x[34:66],
a[3]+s[3]*x[67:99]) + rnorm(99)
group <- rep(1:33, rep(3,33))
return(data.frame(x=x,y=y,gr=group))
}
dat <- sim()
library("profdpm")
fitL <- profLinear(y ~ x, group=gr, data=dat)
sfitL <- summary(fitL)
%pdf(np_plot.pdf)
plot(fitL$x[,2], fitL$y, col=grey(0.9), xlab="x", ylab="y")
for(grp in unique(fitL$group)) {
ind <- which(fitL$group==grp)
ord <- order(fitL$x[ind,2])
lines(fitL$x[ind,2][ord],
fitL$y[ind][ord],
col=grey(0.9))
}
for(cls in 1:length(sfitL)) {
# The following implements the (3rd) method of
# Hanson & McMillan (2012) for simultaneous credible bands
# Generate coefficients from profile posterior
n <- 1e4
5.6 Introduction to Nonparametric Bayes 157

4
2
0
y

-2
-4

0.0 0.2 0.4 0.6 0.8 1.0

Figure 5.21: Simulated data; 99 longitudinal measurements on 33 subjects.


Simultaneous confidence bands for the mean within each of the
three clusters.
5.6 Introduction to Nonparametric Bayes 158

tau <- rgamma(n, shape=fitL$a[[cls]]/2, scale=2/fitL$b[[cls]])


muz <- matrix(rnorm(n*2, 0, 1),n,2)
mus <- (muz / sqrt(tau)) %*% chol(solve(fitL$s[[cls]]))
mu <- outer(rep(1,n), fitL$m[[cls]]) + mus

# Compute Mahalanobis distances


mhd <- rowSums(muz^2)

# Find the smallest 95% in terms of Mahalanobis distance


# I.e., a 95% credible region for mu
ord <- order(mhd, decreasing=TRUE)[-(1:floor(n*0.05))]
mu <- mu[ord,]
#Compute the 95% credible band
plotx <- seq(min(dat$x), max(dat$x), length.out=200)
ral <- apply(mu, 1, function(m) m[1] + m[2] * plotx)
rlo <- apply(ral, 1, min)
rhi <- apply(ral, 1, max)
rmd <- fitL$m[[cls]][1] + fitL$m[[cls]][2] * plotx

lines(plotx, rmd, col=cls, lty=2)


lines(plotx, rhi, col=cls)
lines(plotx, rlo, col=cls)
}
%dev.off()

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy