zzzz-essential_bayes
zzzz-essential_bayes
Rebecca C. Steorts
Contents
1 Introduction 3
1.1 Advantages of Bayesian Methods . . . . . . . . . . . . . . . . . . . 4
1.2 de Finetti’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Being Objective 42
◯ Meaning Of Flat . . . . . . . . . . . . . . . . . . . . . . . . . 44
◯ Objective Priors in More Detail . . . . . . . . . . . . . . . . 44
3.1 Reference Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
◯ Laplace Approximation . . . . . . . . . . . . . . . . . . . . . 51
◯ Some Probability Theory . . . . . . . . . . . . . . . . . . . . 52
◯ Shrinkage Argument of J.K. Ghosh . . . . . . . . . . . . . 53
◯ Reference Priors . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Final Thoughts on Being Objective . . . . . . . . . . . . . . . . . . 59
2
CONTENTS 3
Chapter 1
Introduction
There are three kinds of lies: lies, damned lies and statistics.
—Mark Twain
The word “Bayesian” traces its origin to the 18th century and English Rev-
erend Thomas Bayes, who along with Pierre-Simon Laplace was among the first
thinkers to consider the laws of chance and randomness in a quantitative, scien-
tific way. Both Bayes and Laplace were aware of a relation that is now known
as Bayes Theorem:
p(x∣θ)p(θ)
p(θ∣x) = ∝ p(x∣θ)p(θ). (1.1)
p(x)
The proportionality ∝ in Eq. (1.1) signifies that the 1/p(x) factor is constant
and may be ignored when viewing p(θ∣x) as a function of θ. We can decompose
Bayes’ Theorem into three principal terms:
p(θ∣x) posterior
p(x∣θ) likelihood
p(θ) prior
In effect, Bayes’ Theorem provides a general recipe for updating prior beliefs
about an unknown parameter θ based on observing some data x.
However, the notion of having prior beliefs about a parameter that is ostensibly
“unknown” did not sit well with many people who considered the problem in
the 19th and early 20th centuries. The resulting search for a way to practice
statistics without priors led to the development of frequentist statistics by such
eminent figures as Sir Ronald Fisher, Karl Pearson, Jerzy Neyman, Abraham
Wald, and many others.
The frequentist way of thinking came to dominate statistical theory and practice
4
1.1 Advantages of Bayesian Methods 5
in the 20th century, to the point that most students who take only introduc-
tory statistics courses are never even aware of the existence of an alternative
paradigm. However, recent decades have seen a resurgence of Bayesian statistics
(partially due to advances in computing power), and an increasing number of
statisticians subscribe to the Bayesian school of thought. Perhaps most encour-
agingly, both frequentists and Bayesians have become more willing to recognize
the strengths of the opposite approach and the weaknesses of their own, and it
is now common for open-minded statisticians to freely use techniques from both
sides when appropriate.
The basic philosophical difference between the frequentist and Bayesian paradigms
is that Bayesians treat an unknown parameter θ as random and use probability
to quantify their uncertainty about it. In contrast, frequentists treat θ as un-
known but fixed, and they therefore believe that probability statements about
θ are useless. This fundamental disagreement leads to entirely different ways to
handle statistical problems, even problems that might at first seem very basic.
• Suppose that the experiment was “Flip six times and record the results.”
In this case, the random variable X counts the number of heads, and
X ∼ Binomial(6, θ). The observed data was x = 5, and the p-value of our
hypothesis test is
p-value = Pθ=1/2 (X ≥ 5)
= Pθ=1/2 (X = 5) + Pθ=1/2 (X = 6)
6 1 7
= + = = 0.109375 > 0.05.
64 64 64
1.1 Advantages of Bayesian Methods 6
• Suppose instead that the experiment was “Flip until we get tails.” In this
case, the random variable X counts the number of the flip on which the
first tails occurs, and X ∼ Geometric(1 − θ). The observed data was x = 6,
and the p-value of our hypothesis test is
p-value = Pθ=1/2 (X ≥ 6)
= 1 − Pθ=1/2 (X < 6)
5
= 1 − ∑ Pθ=1/2 (X = x)
x=1
1 1 1 1 1 1
=1−( + + + + )= = 0.03125 < 0.05.
2 4 8 16 32 32
So we reject H0 at α = 0.05.
The conclusions differ, which seems absurd. Moreover the p-values aren’t even
close—one is 3.5 times as large as the other. Essentially, the result of our
hypothesis test depends on whether we would have stopped flipping if we had
gotten a tails sooner. In other words, the frequentist approach requires us to
specify what we would have done had the data been something that we already
know it wasn’t.
Note that despite the different results, the likelihood for the actual value of x
that was observed is the same for both experiments (up to a constant):
p(x∣θ) ∝ θ5 (1 − θ).
A Bayesian approach would take the data into account only through this likeli-
hood and would therefore be guaranteed to provide the same answers regardless
of which experiment was being performed.
Example 1.2: Suppose we want to test whether the voltage θ across some
electrical component differs from 9 V, based on noisy readings of this voltage
from a voltmeter. Suppose the data is as follows:
A frequentist might assume that the voltage readings Xi are iid from some
N (θ, σ 2 ) distribution, which would lead to a basic one-sample t-test.
Nevertheless, a frequentist must now redo the analysis and could perhaps obtain
a different conclusion, because the 10 V limit changes the distribution of the
observations under the null hypothesis. Like in the last example, the frequentist
results change based on what would have happened had the data been something
that we already know it wasn’t.
The problems in Examples 1.1 and 1.2 arise from the way the frequentist
paradigm forces itself to interpret probability. Another familiar aspect of this
problem is the awkward definition of “confidence” in frequentist confidence inter-
vals. The most natural interpretation of a 95% confidence interval (L, U )—that
there is a 95% chance that the parameter is between L and U —is dead wrong
from the frequentist point of view. Instead, the notion of “confidence” must
be interpreted in terms of repeating the experiment a large number of times
(in principle, an infinite number), and no probabilistic statement can be made
about this particualar confidence interval computed from the data we actually
observed.
In this section, well motivate the use of priors on parameters and indeed motivate
the very use of parameters. We begin with a denition.
Definition 1.1: (Infinite exchangeability). We say that (x1 , x2 , . . . ) is an in-
finitely exchangeable sequence of random variables if, for any n, the joint prob-
ability p(x1 , x2 , ..., xn ) is invariant to permutation of the indices. That is, for
any permutation π,
A key assumption of many statistical analyses is that the random variables being
studied are independent and identically distributed (iid). Note that iid random
variables are always infinitely exchangeable. However, infinite exchangeability
is a much broader concept than being iid; an infinitely exchangeable sequence
is not necessarily iid. For example, let (x1 , x2 , . . . ) be iid, and let x0 be a non-
trivial random variable independent of the rest. Then (x0 + x1 , x0 + x2 , . . .) is
infinitely exchangeable but not iid. The usefulness of infinite exchangeability
lies in the following theorem.
Theorem 1.1. (De Finetti). A sequence of random variables (x1 , x2 , . . . ) is
infinitely exchangeable iff, for all n,
n
p(x1 , x2 , ..., xn ) = ∫ ∏ p(xi ∣θ)P (dθ),
i=1
If the distribution on θ has a density, we can replace P (dθ) with p(θ) dθ, but
the theorem applies to a much broader class of cases than just those with a
density for θ.
Clearly, since ∏ni=1 p(xi ∣θ) is invariant to reordering, we have that any sequence
of distributions that can be written as
n
∫ ∏ p(xi ∣θ) p(θ) dθ,
i=1
for all n must be infinitely exchangeable. The other direction, though, is much
deeper. It says that if we have exchangeable data, then:
Thus, the theorem provides an answer to the questions of why we should use
parameters and why we should put priors on parameters.
Example 1.3: (Document processing and information retrieval). To highlight
the difference between iid and infinitely exchangeable sequences, consider that
search engines have historically used “bag-of-words” models to model docu-
ments. That is, for the moment, pretend that the order of words in a document
does not matter. Even so, the words are definitely not iid. If we see one word
and it is a French word, we then expect that the rest of the document is likely to
be in French. If we see the French words voyage (travel), passeport (passport),
and douane (customs), we expect the rest of the document to be both in French
and on the subject of travel. Since we are assuming infinite exchangeability,
there is some θ governing these intuitions. Thus, we see that θ can be very rich,
and it seems implausible that θ might always be finite-dimensional in Theorem
2. In fact, it is the case that θ can be infinite-dimensional in Theorem 2. For
example, in nonparametric Bayesian work, θ can be a stochastic process.
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts
Chapter 2
Introduction to Bayesian
Methods
Every time I think I know what’s going on, suddenly there’s another layer of
complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony
Another motivation for the Bayesian approach is decision theory. Its origins go
back to Von Neumann and Morgenstern’s game theory, but the main character
was Wald. In statistical decision theory, we formalize good and bad results with
a loss function.
9
2.2 Frequentist Risk 10
the sample mean, and θ might be the true mean. The loss function determines
the penalty for deciding δ(x) if θ is the true parameter. To give some intuition,
in the discrete case, we might use a 0–1 loss, which assigns
⎧
⎪
⎪0 if δ(x) = θ,
L(θ, δ(x)) = ⎨
⎪
⎪1 if δ(x) ≠ θ,
⎩
or in the continuous case, we might use the squared error loss L(θ, δ(x)) = (θ −
δ(x))2 . Notice that in general, δ(x) does not necessarily have to be an estimate
of θ. Loss functions provide a very good foundation for statistical decision theory.
They are simply a function of the state of nature (θ) and a decision function
(δ(⋅)). In order to compare procedures we need to calculate which procedure is
best even though we cannot observe the true nature of the parameter space Θ
and data X. This is the main challenge of decision theory and the break between
frequentists and Bayesians.
Thus, the risk measures the long-term average loss resulting from using δ.
Often one decision does not dominate the other everywhere as is the case with
decisions δ1 , δ2 . The challenge is in saying whether, for example, δ1 or δ3 is
better. In other words, how should we aggregate over Θ?
Frequentist Risk
R(θ, δ1)
R(θ, δ2)
Risk
R(θ, δ3)
Risk
R(θ, δ1)
R(θ, δ2)
is happy to look at other data they could have gotten but didn’t.”
The Bayesian approach can also be motivated by a set of principles. Some books
and classes start with a long list of axioms and principles conceived in the 1950s
and 1960s. However, we will focus on three main principles.
The Bayesian approach can also be motivated by a set of principles. Some books
and classes start with a long list of axioms and principles conceived in the 1950s
and 1960s. However, we will focus on three main principles.
Example 2.1: For example, two different labs estimate the potency of
drugs. Both have some error or noise in their measurements which can
accurately estimated from past tests. Now we introduce a new drug. Then
we test its potency at a randomly chosen lab. Suppose the sample sizes
matter dramatically.
• Suppose the sample size of the first experiment (lab 1) is 1 and the
sample size of the second experiment (lab 2) is 100.
• What happens if we’re doing a frequentist experiment in terms of
the variance? Since this is a randomized experiment, we need to take
into account all of the data. In essence, the variance will do some
sort of averaging to take into account the sample sizes of each.
• However, taking a Bayesian approach, we just care about the data
that we see. Thus, the variance calculation will only come from the
actual data at the randomly chosen lab.
Thus, the question that we ask is should we use the noise level from the
lab where it is tested or average over both? Intuitively, we use the noise
level from the lab where it was tested, but in some frequentist approaches,
it is not always so straightforward.
p(x∣θ) ∝ θ9 (1 − θ)3 .
Theorem 2.1. The posterior distribution, p(θ∣y) only depends on the data
through the sufficient statistic, T(y).
f (y∣θ)π(θ)
p(θ∣y) =
∫ (y∣θ)π(θ) dθ
f
g(θ, T (y)) h(y)π(θ)
=
∫ g(θ, T (y)) h(y)π(θ) dθ
g(θ, T (y))π(θ)
=
∫ g(θ, T (y))π(θ) dθ
∝ g(θ, T (y)) p(θ),
Then p(y) = (ny)θy (1 − θ)n−y . Let p(θ) represent a general prior. Then
The Bayes action δ ∗ (x) for any fixed x is the decision δ(x) that minimizes the
posterior risk. If the problem at hand is to estimate some unknown parameter
θ, then we typically call this the Bayes estimator instead.
Theorem 2.3. Under squared error loss, the decision δ(x) that minimizes the
posterior risk is the posterior mean.
Then
∂[ρ(π, δ(x))]
= 2δ(x) − 2 ∫ θπ(θ∣x) dθ = 0 ⇐⇒ δ(x) = E[θ∣x],
∂[δ(x)]
Recall that decision theory provides a quantification of what it means for a pro-
cedure to be ‘good.’ This quantification comes from the loss function L(θ, δ(x)).
Frequentists and Bayesians use the loss function differently.
2.4 Bayesian Decision Theory 16
In frequentist usage, the parameter θ is fixed, and thus it is the sample space
over which averages are taken. Letting R(θ, δ(x)) denote the frequentist risk,
recall that R(θ, δ(x)) = Eθ [L(θ, δ(x))]. This expectation is taken over the data
X, with the parameter θ held fixed. Note that the data, X, is capitalized,
emphasizing that it is a random variable.
Example 2.4: (Squared error loss). Let the loss function be squared error. In
this case, the risk is
This result allows a frequentist to analyze the variance and bias of an estimator
separately, and can be used to motivate frequentist ideas, e.g. minimum variance
unbiased estimators (MVUEs).
Bayesians do not find the previous idea compelling because it doesn’t adhere to
the conditionality principle since it averages over all possible data sets. Hence,
in a Bayesian framework, we define the posterior risk ρ(x, π) based on the data
x and a prior π, where
Note that the prior enters the equation when calculating the posterior density.
Using the Bayes risk, we can define a bit of jargon. Recall that the Bayes action
δ ∗ (x) is the value of δ(x) that minimizes the posterior risk. We already showed
that the Bayes action under squared error loss is the posterior mean.
◯ Hybrid Ideas
Despite the tensions between frequentists and Bayesians, they occasionally steal
ideas from each other.
Definition 2.4: The Bayes risk is denoted by r(π, δ(x)). While the Bayes risk
is a frequentist concept since it averages over X, the expression can also be
2.5 Bayesian Parametric Models 17
Note that the last equation is the posterior risk averaged over the marginal
distribution of x. Another connection with frequentist theory includes that
finding a Bayes rule against the “worst possible prior” gives you a minimax
estimator. While a Bayesian might not find this particularly interesting, it is
useful from a frequentist perspective because it provides a way to compute the
minimax estimator.
We will come back to more decision theory in a more later chapter on advanced
decision theory, where we will cover topics such as minimaxity, admissibility,
and James-Stein estimators.
For now we will consider parametric models, which means that the parameter θ
is a fixed-dimensional vector of numbers. Let x ∈ X be the observed data and
θ ∈ Θ be the parameter. Note that X may be called the sample space, while Θ
may be called the parameter space. Now we define some notation that we will
reuse throughout the course:
p(x∣θ) likelihood
π(θ) prior
p(x) = ∫ p(x∣θ)π(θ) dθ marginal likelihood
p(x∣θ)π(θ)
p(θ∣x) = posterior probability
p(x)
p(xnew ∣x) = ∫ p(xnew ∣θ)π(θ∣x) dθ predictive probability
and oftentimes it’s best to not calculate the normalizing constant p(x) because
you can recognize the form of p(x∣θ)π(θ) as a probability distribution you know.
So don’t normalize until the end!
Remark: Note that the prior distribution that we take on θ doesn’t have to be
a proper distribution, however, the posterior is always required to be proper for
valid inference. By proper, I mean that the distribution must integrate to 1.
We will discuss objective and subjective priors. Objective priors may be ob-
tained from the likelihood or through some type of invariance argument. Sub-
jective priors are typically arrived at by a process involving interviews with
domain experts and thinking really hard; in fact, there is arguably more philos-
ophy and psychology in the study of subjective priors than mathematics. We
start with conjugate priors. The main justification for the use of conjugate pri-
ors is that they are computationally convenient and they have asymptotically
desirably properties.
Subjective
A prior probability could be subjective based on the information a person might
have due to past experience, scientific considerations, or simple common sense.
For example, suppose we wish to estimate the probability that a randomly
selected woman has breast cancer. A simple prior could be formulated based
on the national or worldwide incidence of breast cancer. A more sophisticated
approach might take into account the woman’s age, ethnicity, and family history.
Neither approach could necessarily be classified as right or wrong—again, it’s
subjective.
may have only had 8/10 patients cured by the first treatment and might therefore
specify a prior suggesting a cure rate of around 0.8 for the for the new treatment.
For convenience, subjective priors are often chosen to take the form of common
distributions, such as the normal, gamma, or beta distribution.
Objective
An objective prior (also called default, vague, noninformative) can also be used
in a given situation even in the absence of enough information. Examples
of objective priors are flat priors such as Laplace’s, Haldane’s, Jeffreys’, and
Bernardo’s references priors. These priors will be discussed later.
X∣θ ∼ f (x∣θ)
Θ∣γ ∼ π(θ∣γ)
Γ ∼ φ(γ),
where we assume that φ(γ) is known and not dependent on any other unknown
hyperparameters (what the parameters of the prior are often called as we have
already said). Note that we can continue this hierarchical modeling and add
more stages to the model, however note that doing so adds more complexity
to the model (and possibly as we will see may result in a posterior that we
cannot compute without the aid of numerical integration or MCMC, which we
will cover in detail in a later chapter).
posterior is
π(θ∣x) ∝ p(x∣θ)p(θ)
n Γ(a + b) a−1
∝ ( )θx (1 − θ)n−x θ (1 − θ)b−1
x Γ(a)Γ(b)
∝ θx (1 − θ)n−x θa−1 (1 − θ)b−1
∝ θx+a−1 (1 − θ)n−x+b−1 Ô⇒
Let’s apply this to a real example! We’re interested in the proportion of people
that approve of President Obama in PA.
• Based on this prior information, we’ll use a Beta prior for θ and we’ll
choose a and b. (Won’t get into this here).
• We can plot the prior and likelihood distributions in R and then see how
the two mix to form the posterior distribution.
3.5
3.0
Prior
2.5
2.0
Density
1.5
1.0
0.5
0.0
θ
2.7 Hierarchical Bayesian Models 21
3.5
3.0
Prior
2.5
Likelihood
2.0
Density
1.5
1.0
0.5
0.0
θ
3.5
3.0
Prior
2.5
Likelihood
Posterior
2.0
Density
1.5
1.0
0.5
0.0
−1 −n
p(θ∣x) ∝ exp { ∑(xi − x̄) } exp { 2 (x̄ − θ) }
2 2
2σ 2 i 2σ
−n
∝ exp { 2 (x̄ − θ)2 }
2σ
−n
= exp { 2 (θ − x̄)2 } .
2σ
Thus,
θ∣x1 , . . . , xn ∼ Normal(x̄, σ 2 /n).
iid
X1 , . . . , Xn ∣θ ∼ N(θ, σ 2 )
θ ∼ N(µ, τ 2 ),
n
1 −1 1 −1
p(θ∣x1 , . . . , xn ) ∝ ∏ √ 2
(xi − θ)2 } × √
exp { exp { 2 (θ − µ)2 }
i=1 2πσ 2σ2 2πτ 2 2τ
−1 −1
∝ exp { 2 ∑(xi − θ)2 } exp { 2 (θ − µ)2 } .
2σ i 2τ
Consider
Then
−1 −1 −1
p(θ∣x1 , . . . , xn ) = exp { 2 ∑ i
(x − x̄)2 } × exp { 2 n(x̄ − θ)2 } × exp { 2 (θ − µ)2 }
2σ i 2σ 2τ
−1 −1
∝ exp { 2 n(x̄ − θ)2 } exp { 2 (θ − µ)2 }
2σ 2τ
−1 n 1
= exp { [ 2 (x̄ − 2x̄θ + θ ) + 2 (θ2 − 2θµ + µ2 )]}
2 2
2 σ τ
−1 n 1 nx̄ µ nx̄2 µ2
= exp { [( 2 + 2 ) θ2 − 2θ ( 2 + 2 ) + 2 + 2 ]}
2 σ τ σ τ σ τ
⎧ ⎡ µ ⎤⎫
2 + τ 2 ⎞⎥⎪
nx̄
⎪
⎪ −1 ⎢ n 1 ⎛
∝ exp ⎨ ⎢⎢( 2 + 2 ) θ2 − 2θ σn ⎥⎪
⎪ ⎥⎬
⎪ 2 ⎢ σ τ ⎝ + 1
⎠ ⎥⎪
⎩ ⎣ σ 2 τ 2
⎦⎪
⎭
⎧
⎪ ⎡ µ 2 ⎤⎫
⎪
⎪ −1 ⎢⎢ n 2 + τ2 ⎞ ⎥
nx̄
⎪ 1 ⎛ ⎥⎪
⎪
∝ exp ⎨ ⎢( 2 + 2 ) θ − σn ⎥⎬.
⎪ ⎢ τ ⎝ ⎥⎪
2 + τ 2 ⎠ ⎥⎪
1
⎪
⎪ 2 ⎢ σ
⎩ ⎣ σ ⎦⎪
⎭
⎛ 2 + τ2
nx̄ µ
1 ⎞
θ∣x1 , . . . , xn ∼ N σn , n
⎝ 2 + τ21
+ ⎠1
σ σ2 τ 2
nx̄τ 2 + µσ 2 σ2 τ 2
=N( , ).
nτ 2 + σ 2
nτ 2 + σ 2
We omit the proof since it requires Chebychev’s Inequality along with a bit of
probability theory. See Problem 1.8.1 in TPE for the exercise of proving this.
1 Recall from algebra that (x − b)2 = x2 − 2bx + b2 . We want to complete something that
2 +
nx̄ µ
σ τ2
E(θ∣x) = n .
+ 1
σ2 τ2
nx̄ µ
2
= nσ 1 + n τ
2
.
+ + 1
σ2 τ 2 σ2 τ2
The posterior precision is larger than either the sample precision or the prior
precision. Equivalently, the posterior variance, denoted by V (θ∣x), is smaller
than either the sample variance or the prior variance.
What happens as n → ∞?
In the case of the posterior variance, divide the denominator and numerator by
n. Then
1
n σ2
V (θ∣x) = ≈ →0 as n → ∞.
1 n 1 1 n
2
+
nσ n τ2
Since the posterior mean is unbiased and the posterior variance goes to 0, the
posterior mean is consistent by Theorem 2.4.
2.7 Hierarchical Bayesian Models 25
Example 2.9:
Notice that this looks like an Inverse Gamma distribution with parameters α + a
and x + b. Thus,
β∣x ∼ IG(α + a, x + b).
Here the posterior mean is (400 + 9x)/13. Suppose x = 115. Then the posterior
mean becomes 110.4. Contrasting this, we know that the frequentist estimate
is the mle, which is x = 115 in this example.
The posterior variance is 900/13 = 69.23, whereas the variance of the data is
σ 2 = 100.
Notice that the posterior mean and mle are both 115 and the posterior variance
and variance of the data are both 100.
When we put little/no prior information on θ, the data washes away most/all
of the prior information (and the results of frequentist and Bayesian estimation
are similar or equivalent in this case).
2.7 Hierarchical Bayesian Models 26
−1
p(σ 2 ∣x1 , . . . , xn ) ∝ (2πσ 2 )−n/2 exp { 2 −1
∑(xi − θ) } (σ )
2
2σ 2 i
−1
∝ (σ 2 )−n/2−1 exp { ∑(xi − θ) } .
2
2σ 2 i
ba −a−1 −b/y
Recall, if Y ∼ IG(a, b), then f (y) = y e . Thus,
Γ(a)
Example 2.12: (Football Data) Gelman et. al (2003) consider the problem of
estimating an unknown variance using American football scores. The focus is
on the difference d between a game outcome (winning score minus losing score)
and a published point spread.
We can refer to Example 2.11, since the setup here is the same. Hence the
posterior becomes
σ 2 ∣d1 , . . . , dn ∼ IG(n/2, ∑ d2i /2).
i
The next logical step would be plotting the posterior distribution in R. As far as
I can tell, there is not a built-in function predefined in R for the Inverse Gamma
density. However, someone saw the need for it and built one in using the pscl
package.
Proceeding below, we try and calculate the posterior using the function densigamma,
which corresponds to the Inverse Gamma density. However, running this line
in the code gives the following error:
2.7 Hierarchical Bayesian Models 27
Warning message:
In densigamma(sigmas, n/2, sum(d^2)/2) : value out of range in ’gammafn’
What’s the problem? Think about the what the posterior looks like. Recall
that
(∑i d2i /2)n/2 2 −n/2−1 −(∑i d2i )/2σ2
p(σ 2 ∣d) = (σ ) e .
Γ(n/2)
setwd("~/Desktop/sta4930/football")
data = read.table("football.txt",header=T)
names(data)
attach(data)
score = favorite-underdog
d = score-spread
n = length(d)
hist(d)
install.packages("pscl",repos="http://cran.opensourceresources.org")
library(pscl)
?densigamma
sigmas = seq(10,20,by=0.1)
post = densigamma(sigmas,n/2,sum(d^2)/2)
v = sum(d^2)
We know we can’t use the Inverse Gamma density (because of the function in
R), but we do know a relationship regarding the Inverse Gamma and Gamma
distributions. So, let’s apply this fact.
You may be thinking, we’re going to run into the same problem because we’ll
still be dividing by Γ(1120). This is true, except the Gamma density function
dgamma was built into R by the original writers. The dgamma function is able
to do some internal tricks that let it calculate the gamma density even though
the individual piece Γ(n/2) by itself is too large for R to handle. So, moving
forward, we will apply the following fact that we already learned:
If X ∼ IG(a, b), then 1/X ∼ Gamma(a, 1/b).
Since
σ 2 ∣d1 , . . . , dn ∼ IG(n/2, ∑ d2i /2),
i
we know that
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i
2.7 Hierarchical Bayesian Models 28
In the code below, we plot the posterior of 12 ∣d. In order to do so, we must
σ
create a new sequence of x-values since the mean of our gamma will be at
n/v ≈ 0.0053.
xnew = seq(0.004,0.007,.000001)
pdf("football_sigmainv.pdf", width = 5, height = 4.5)
post.d = dgamma(xnew,n/2,scale = 2/v)
plot(xnew,post.d, type= "l", xlab = expression(1/sigma^2), ylab= "density")
dev.off()
1000
500
0
1 σ2
To recap, we know
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i
Let u = 1
2.We are going to make a transformation of variables now to write
σ
the density in terms of σ 2 .
2.7 Hierarchical Bayesian Models 29
∂u 1
Since u = 1
, this implies σ 2 = u1 . Then ∣ ∣ = 4.
σ2 ∂σ 2 σ
Now applying the transformation of variables we find that
1 1 n/2−1 − v2 1
f (σ 2 ∣d1 , . . . , dn ) = ( ) e 2σ ( 4 ) .
Γ(n/2)(2/v)n/2 σ 2 σ
Thus,
1
σ 2 ∣d ∼ Gamma(n/2, 2/v) ( ).
σ4
x.s = seq(150,250,1)
pdf("football_sigma.pdf", height = 5, width = 4.5)
post.s = dgamma(1/x.s,n/2, scale = 2/v)*(1/x.s^2)
plot(x.s,post.s, type="l", xlab = expression(sigma^2), ylab="density")
dev.off()
detach(data)
From the posterior plot in Figure 2.4 we can see that the posterior mean is
around 185. This means that the variability of the actual game result around
the point spread has a standard deviation around 14 points. If you wanted to
actually calculate the posterior mean and variance, you could do this using a
numerical method in R.
What’s interesting about this example is that there is a lot more variability in
football games than the average person would most likely think.
• Assume that (1) the standard deviation actually is 14 points, and (2)
game result is normally distributed (which it’s not, exactly, but this is a
reasonable approximation).
• Things with a normal distribution fall two or more standard deviations
from their mean about 5% of the time, so this means that, roughly speak-
ing, about 5% of football games end up 28 or more points away from their
spread.
2.7 Hierarchical Bayesian Models 30
0.06
0.04
density
0.02
0.00
σ2
Example 2.13:
iid
Y1 , . . . , Yn ∣µ, σ 2 ∼ Normal(µ, σ 2 ),
σ2
µ∣σ 2 ∼ Normal(µ0 , ),
κ0
ν0 σ02
σ 2 ∼ IG( , ),
2 2
p(µ, σ 2 , y1 . . . , yn )
p(µ, σ 2 ∣y1 . . . , yn ) =
p(y1 . . . , yn )
∝ p(y1 . . . , yn ∣µ, σ 2 )p(µ, σ 2 )
= p(y1 . . . , yn ∣µ, σ 2 )p(µ∣σ 2 )p(σ 2 ).
Then
Then
Now consider
(nȳ + κ0 µ0 )2 −n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
nȳ 2 + κ0 µ0 2 − = nȳ 2 + κ0 µ0 2 +
n + κ0 n + κ0
n ȳ + nκ0 µ0 + nκ0 ȳ + κ0 2 µ0 2 − n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
2 2 2 2
=
n + κ0
nκ0 µ0 + nκ0 ȳ − 2nκ0 µ0 ȳ
2 2
=
n + κ0
nκ0 (µ0 − 2µ0 ȳ + ȳ 2 )
2
=
n + κ0
nκ0 (µ0 − ȳ)2
= .
n + κ0
n + ν0 1 nκ0
σ 2 ∣y ∼ IG ( , (∑(yi − ȳ)2 + (µ0 − ȳ)2 + σ02 )) .
2 2 i (n + κ0 )
Example 2.14: Suppose we calculate E[θ∣y] where y = x(n) . Let
Xi ∣ θ ∼ Uniform(0, θ)
θ ∼ Gamma(a, 1/b).
Show
1 P (χ22(n+a−1) < 2/(by))
E[θ∣x] = .
b(n + a − 1) P (χ22(n+a−1) < 2/(by))
2.8 Empirical Bayesian Models 33
Proof. Recall that the posterior depends on the data only through the sufficient
statistic y. Consider that P (Y ≤ y) = P (X1 ≤ y)n = (y/θ)n Ô⇒ fy (y) =
n
n/θ(y/θ)n−1 = n y n−1 .
θ
θf (y∣θ)π(θ) dθ
E[θ∣x] = ∫
∫ f (y∣θ)π(θ) dθ
n−1 −a−1 −1/(θb)
∞ θny θ e
∫y
θn Γ(a)ba dθ
= a−1 −1/(θb)
∞ nθ e
∫y
θn Γ(a)ba dθ
∞ −n−a −1/(θb)
∫y θ e dθ
= ∞ −n−a−1 −1/(θb)
∫y θ e dθ
2
n+a−1 n+a−2 −x/2
by
∫0 b x e dx × Γ(n + a − 1) dx
2n+a−1 Γ(n + a − 1)
= 2
n+a+1−1 n+a+1−2 −x/2
by
∫0 b x e dx × Γ(n + a) dx
2 n+a+1−1 Γ(n + a)
P (χ2(n+a−1) < 2/(by)) bn+a−1 Γ(n + a − 1)
2
=
P (χ22(n+a−1) < 2/(by)) bn+a Γ(n + a)
Xi ∣θ ∼ f (x∣θ), i = 1 . . . , p
Θ∣γ ∼ π(θ∣γ).
Since the groups receive different treatments, we expect different success rates,
however, since we are treating the same illness, these rates should be related to
each other. These considerations suggest the following model:
Xk ∼ Bin(n, pk ),
pk ∼ Beta(a, b),
where the K groups are tied together by the common prior distribution.
It is easy to show that the Bayes estimator of pk under squared error loss is
a + xk
E(pk ∣ak , a, b) = .
a+b+n
Suppose now that we are told that a, b are unknown and we wish to estimate
2.9 Posterior Predictive Distributions 35
We have just gone through many examples illustrating how to calculate many
simple posterior distributions. This is the main goal of a Bayesian analysis.
Another goal might be prediction. That is given some data y and a new obser-
vation ỹ, we may wish to find the conditional distribution of ỹ given y. This
distribution is referred to as the posterior predictive distribution. That is, our
goal is to find p(ỹ∣y). This minimizing estimator is called the empirical Bayes
estimator.
We’ll derive the posterior predictive distribution for the discrete case (θ is
discrete). It’s the same for the continuous case, with the sums replaced with
integrals.
Consider
p(ỹ, y)
p(ỹ∣y) =
p(y)
∫ p(ỹ, y, θ) dθ
= θ
p(y)
∫ p(ỹ∣y, θ)p(y, θ) dθ
= θ
p(y)
= ∫ p(ỹ∣y, θ)p(θ∣y) dθ.
θ
In most contexts, if θ is given, then ỹ∣θ is independent of y, i.e., the value of θ
determines the distribution of ỹ, without needing to also know y. When this is
the case, we say that ỹ and y are conditionally independent given θ. Then the
above becomes
p(ỹ∣y) = ∫ p(ỹ∣θ)p(θ∣y) dθ.
θ
2.9 Posterior Predictive Distributions 36
p(ỹ∣y) = ∑ p(ỹ∣θ)p(θ∣y).
θ
Theorem 2.6. Suppose p(x) is a pdf that looks like p(x) = cf (x), where c is a
constant and f is a continuous function of x. Since
∫ p(x) dx = ∫ cf (x) dx = 1,
x x
then
∫ f (x)dx = 1/c.
x
Example 2.16: Human males have one X-chromosome and one Y-chromosome,
whereas females have two X-chromosomes, each chromosome being inherited
from one parent. Hemophilia is a disease that exhibits X-chromosome-linked
recessive inheritance, meaning that a male who inherits the gene that causes
the disease on the X-chromosome is affected, whereas a female carrying the
gene on only one of her X-chromosomes is not affected. The disease is generally
fatal for women who inherit two such genes, and this is very rare, since the
frequency of occurrence of the gene is very low in human populations.
Consider a woman who has an affected brother (xY), which implies that her
mother must be a carrier of the hemophilia gene (xX). We are also told that
her father is not affected (XY), thus the woman herself has a fifty-fifty chance
of having the gene.
Let θ denote the state of the woman. It can take two values: the woman is a
carrier (θ = 1) or not (θ = 0). Based on this, the prior can be written as
P (θ = 1) = P (θ = 0) = 1/2.
Suppose the woman has a son who does not have hemophilia (S1 = 0). Now
suppose the woman has another son. Calculate the probability that this second
son also will not have hemophilia (S2 = 0), given that the first son does not have
hemophilia. Assume son one and son two are conditionally independent given
θ.
Solution:
First compute
p(S1 = 0∣θ)p(θ)
p(θ∣S1 = 0) =
p(S1 = 0∣θ = 0)p(θ = 0) + p(S1 = 0∣θ = 1)p(θ = 1)
⎧
⎪ (1)(1/2)
⎪ = 2 if θ = 0
= ⎨ 1(1)(1/2)+(1/2)(1/2) 3
⎪
⎪ if θ = 1.
⎩3
2.9 Posterior Predictive Distributions 38
Then
p(S2 = 0∣S1 = 0) = p(S2 = 0∣θ = 0)p(θ = 0∣S1 = 0) + p(S2 = 0∣θ = 1)p(θ = 1∣S1 = 0)
= (1)(2/3) + (1/2)(1/3) = 5/6.
Suppose instead that we count the number of Bernoulli trials required to get
a fixed number of successes. This formulation leads to the Negative Binomial
distribution.
Then
x−1 r
f (x) = ( ) p (1 − p)x−r , x = r, r + 1, . . .
r−1
and we say X ∼ Negative Binom(r, p).
When we refer to the Negative Binomial distribution in this class, we will refer
to the second one defined unless we indicate otherwise.
X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b)
Solution:
Recall
p(λ∣x) ∝ p(x∣λ)(p(λ)
∝ e−λ λx λa−1 e−λ/b
= λx+a−1 e−λ(1+1/b) .
p(x̃∣x) = ∫ p(x̃∣λ)p(λ∣x) dλ
λ
e−λ λx̃ 1
=∫ λx+a−1 e−λ(b+1)/b dλ
λ x̃! Γ(x + a)( b+1 )
b x+a
1
= b x+a ∫λ
λx̃+x+a−1 e−λ(2b+1/b) dλ
x̃! Γ(x + a)( b+1 )
1
= Γ(x̃ + x + a)(b/(2b + 1)x̃+x+a
x̃! Γ(x + a)( b+1 )
b x+a
Then
x̃ + x + a − 1 x̃
p(x̃∣x) = ( )p (1 − p)x+a .
x̃
Thus,
b
x̃∣x ∼ Negative Binom (x + a, ),
2b + 1
where we are assuming the Negative Binomial distribution as defined in Wikipedia
(and not as defined earlier in the notes).
2.9 Posterior Predictive Distributions 41
X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b).
We are also told 42 moms are observed arriving at the particular hospital during
December 2007. Using prior study information given, we are told a = 5 and b = 6.
(We found a, b by working backwards from a prior mean of 30 and prior variance
of 180).
Solution: The first thing we need to know to do this problem are p(λ∣x) and
p(x̃∣x). We found these in Example 2.17. So,
b
λ∣x ∼ Gamma (x + a, ),
b+1
and
b
x̃∣x ∼ Negative Binom (x + a, ).
2b + 1
2.9 Posterior Predictive Distributions 42
setwd("~/Desktop/sta4930/ch3")
lam = seq(0,100, length=500)
x = 42
a = 5
b = 6
like = dgamma(lam,x+1,scale=1)
prior = dgamma(lam,5,scale=6)
post = dgamma(lam,x+a,scale=b/(b+1))
pdf("preg.pdf", width = 5, height = 4.5)
plot(lam, post, xlab = expression(lambda), ylab= "Density", lty=2, lwd=3, type="l")
lines(lam,like, lty=1,lwd=3)
lines(lam,prior, lty=3,lwd=3)
legend(70,.06,c("Prior", "Likelihood","Posterior"), lty = c(2,1,3),
lwd=c(3,3,3))
dev.off()
In the first part of the code, we plot the posterior, likelihood, and posterior.
This should be self-explanatory since we have already done an example.
Finally, in order to calculate the posterior predictive probability that the num-
ber of pregnant women who arrive is between 40 and 45, we simply add up
the posterior predictive probabilities that correspond to these values. We find
that the posterior predictive probability of 0.1284 that the number of pregnant
women who arrive is between 40 and 45.
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts
Chapter 3
Being Objective
Thus far in this course, we have mostly considered informative or subjective pri-
ors. Ideally, we want to choose a prior reflecting our beliefs about the unknown
parameter of interest. This is a subjective choice. All Bayesians agree that
wherever prior information is available, one should try to incorporate a prior
reflecting this information as much as possible. We have mentioned how incor-
poration of a prior expert opinion would strengthen purely data-based analysis
in real-life decision problems. Using prior information can also be useful in
problems of statistical inference when your sample size is small or you have a
high- or infinite-dimensional parameter space.
However, in dealing with real-life problems you may run into problems such as
The problems we have dealt with all semester have been very simple in nature.
We have only had one parameter to estimate (except for one example). Think
about a more complex problem such as the following (we looked at this problem
43
44
in Chapter 1):
X∣θ ∼ N (θ, σ 2 )
θ∣σ 2 ∼ N (µ, τ 2 )
σ 2 ∼ IG(a, b)
where now θ and σ 2 are both unknown and we must find the posterior dis-
tributions of θ∣X, σ 2 and σ 2 ∣X. For this slightly more complex problem, it is
much harder to think about what values µ, τ 2 , a, b should take for a particular
problem. What should we do in these type of situations?
e−θ θx
p(x∣θ) = , x ∈ {0, 1, 2, . . .}, θ > 0.
x!
A convenient choice for the prior distribution here is a Gamma(a, b) since it is
conjugate for the Poisson likelihood. To illustrate the example further, suppose
that 42 moms deliver babies during the month of December. Suppose from past
data at this hospital, we assume a prior of Gamma(5, 6). From this, we can
easily calculate the posterior distribution, posterior mean and variance, and do
various calculations of interest in R.
Definition 3.2: Noninformative/objective priors contain little or no informa-
tion about θ in the sense that they do not favor one value of θ over another.
Therefore, when we calculate the posterior distribution, most if not all of the
inference will arise from the likelihood. Inferences in this case are objective and
not subjective. Let’s look at the following example to see why we might consider
such priors.
45
Comment: Since many of the objective priors are improper, so we must check
that the posterior is proper.
Theorem 3.1. Propriety of the Posterior
◯ Meaning Of Flat
What does a “flat prior” really mean? People really abuse the word flat and
interchange it for noninformative. Let’s talk about what people really mean
when they use the term “flat,” since it can have different meanings.
Example 3.3: Often statisticians will refer to a prior as being flat, when a plot
of its density actually looks flat, i.e., uniform. An example of this would be
taking such a prior to be
θ ∼ Unif(0, 1).
We can plot the density of this prior to see that the density is flat.
APPLE
Example 3.4: (Thomas Bayes) In 1763, Thomas Bayes considered the ques-
tion of what prior to use when estimating a binomial success probability p. He
described the problem quite differently back then by considering throwing balls
onto a billiard table. He separated the billiard table into many different inter-
vals and considered different events. By doing so (and not going into the details
of this), he argued that a Uniform(0,1) prior was appropriate for p.
Example 3.5: (Laplace) In 1814, Pierre-Simon Laplace wanted to know the
probability that the sun will rise tomorrow. He answered this question using
the following Bayesian analysis:
46
• Let X represent the number of days the sun rises. Let p be the probability
the sun will rise tomorrow.
• Based on reading the Bible, Laplace computed the total number of days n
in recorded history, and the number of days x on which the sun rose.
Clearly, x = n.
Then
n
π(p∣x) ∝ ( )px (1 − p)n−x ⋅ 1
x
∝p x+1−1
(1 − p)n−x+1−1
This implies
p∣x ∼ Beta(x + 1, n − x + 1)
Then
x+1 x+1 n+1
p̂ = E[p∣x] = = = .
x+1+n−x+1 n+2 n+2
Thus, Laplace’s estimate for the probability that the sun rises tomorrow is
(n + 1)/(n + 2), where n is the total number of days recorded in history. For
instance, if so far we have encountered 100 days in the history of our universe,
this would say that the probability the sun will rise tomorrow is 101/102 ≈
0.9902. However, we know that this calculation is ridiculous. Here, we have
extremely strong subjective information (the laws of physics) that says it is
extremely likely that the sun will rise tomorrow. Thus, objective Bayesian
methods shouldn’t be recklessly applied to every problem we study—especially
when subjective information this strong is available.
The Uniform prior of Bayes and Laplace and has been criticized for many dif-
ferent reasons. We will discuss one important reason for criticism and not go
into the other reasons since they go beyond the scope of this course.
Jeffreys’ Prior
What does the invariance principle mean? Suppose our prior parameter is θ,
however we would like to transform to φ.
Jeffreys’ prior says that if θ has the distribution specified by Jeffreys’ prior for
θ, then f (θ) will have the distribution specified by Jeffreys’ prior for φ. We will
clarify by going over two examples to illustrate this idea.
Note, for example, that if θ has a Uniform prior, Then one can show φ = f (θ)
will not have a Uniform prior (unless f is the identity function).
Aside from the invariance property of Jeffreys’ prior, in the univariate case,
Jeffreys’ prior satisfies many optimality criteria that statisticians are interested
in.
Definition 3.3: Define
∂ 2 log p(y∣θ)
I(θ) = −E [ ],
∂θ2
where I(θ) is called the Fisher information. Then Jeffreys’ prior is defined to
be √
pJ (θ) = I(θ).
Example 3.6: (Uniform Prior is Not Invariant to Transformation)
Let θ ∼ Uniform(0, 1). Suppose now we would like to transform from θ to θ2 .
√
Let φ = θ2 . Then θ = φ. It follows that
∂θ 1
= √ .
∂φ 2 φ
1
Thus, p(φ) = √ , 0 < φ < 1 which shows that φ is not Uniform on (0, 1). Hence,
2 φ
the transformation is not invariant. Criticism such as this led to consideration
of Jeffreys’ prior.
Example 3.7: (Jeffreys’ Prior Invariance Example)
Suppose
X∣θ ∼ Exp(θ).
One can show using calculus that I(θ) = 1/θ2 . Then pJ (θ) = 1/θ. Suppose that
φ = θ2 . It follows that
∂θ 1
= √ .
∂φ 2 φ
48
Then
√ ∂θ
pJ (φ) = pJ ( φ) ∣ ∣
∂φ
1 1 1
=√ √ ∝ .
φ 2φ φ
Hence, we have shown for this example, that Jeffreys’ prior is invariant under
the transformation φ = θ2 .
Example 3.8: (Jeffreys’ prior) Suppose
x n−x nθ n − nθ n n n
I(θ) = −E [− − ]= 2 + = = .
θ 2 (1 − θ) 2 θ (1 − θ) 2 θ (1 − θ) θ(1 − θ)
This implies that
√
n
pJ (θ) =
θ(1 − θ)
∝ Beta(1/2, 1/2).
Figure 3.1 compares the prior density πJ (θ) with that for a flat prior, which is
equivalent to a Beta(1,1) distribution.
Note that in this case the prior is inversely proportional to the standard devia-
tion. Why does this make sense?
We see that the data has the least effect on the posterior when the true θ =
0.5, and has the greatest effect near the extremes, θ = 0 or 1. Jeffreys’ prior
compensates for this by placing more mass near the extremes of the range, where
the data has the strongest effect. We could get the same effect by (for example)
1 1
letting the prior be π(θ) ∝ instead of π(θ) ∝ . However, the
Varθ [Varθ]1/2
former prior is not invariant under reparameterization, as we would prefer.
49
2.0
1.5
p(θ)
1.0
0.5
Beta(1/2,1/2)
Beta(1,1)
0.0
Thus, θ∣x ∼ Beta(x + 1/2, n − x + 1/2), which is a proper posterior since the prior
is proper.
Limitations of Jeffreys’
Jeffreys’ priors work well for single-parameter models, but not for models with
multidimensional parameters. By analogy with the one-dimensional case, one
50
πJ (θ) = ∣I(θ)∣1/2 ,
where ∣ ⋅ ∣ denotes the determinant and the (i, j)th element of the Fisher infor-
mation matrix is given by
∂ 2 log p(X∣θ)
I(θ)ij = −E [ ].
∂θi ∂θj
Let’s see what happens when we apply a Jeffreys’ prior for θ to a multivari-
ate Gaussian location model. Suppose X ∼ Np (θ, I), and we are interested
in performing inference on ∣∣θ∣∣2 . In this case the Jeffreys’ prior for θ is flat.
It turns out that the posterior has the form of a non-central χ2 distribution
with p degrees of freedom. The posterior mean given one observation of X is
E(∣∣θ∣∣2 ∣X) = ∣∣X∣∣2 + p. This is not a good estimate because it adds p to the
square of the norm of X, whereas we might normally want to shrink our esti-
mate towards zero. By contrast, the minimum variance frequentist estimate of
∣∣θ∣∣2 is ∣∣X∣∣2 − p.
Haldane’s Prior
In 1963, Haldane introduced the following improper prior for a binomial pro-
portion:
p(θ) ∝ θ−1 (1 − θ)−1 .
It can be shown to be improper using simple calculus, which we will not go into.
However, the posterior is proper under certain conditions. Let
Y ∣θ ∼ Bin(n, θ).
(ny)θy (1 − θ)n−y
p(θ∣y) ∝
θ(1 − θ)
∝θ y−1
(1 − θ)n−y−1
= θy−1 (1 − θ)(n−y)−1 .
Finally, we need to check that our posterior is proper. Recall that the parameters
of the Beta need to be positive. Thus, y > 0 and n − y > 0. This means that y ≠ 0
and y ≠ n in order for the posterior to be proper.
Remark: Recall that the Beta density must integrate to 1 whenever the
parameter values are positive. Hence, when they are not positive, the
density does not integrate to 1 and integrates to ∞. Thus, for the problem
above, when y = 0 and y = n the density is improper.
There are many other objective priors that are used in Bayesian inference, how-
ever, this is the level of exposure that we will cover in this course. If you’re
interested in learning more about objective priors (g-prior, probability match-
ing priors), see me and I can give you some references.
Reference priors were proposed by Jose Bernardo in a 1979 paper, and further
developed by Jim Berger and others from the 1980s through the present. They
are credited with bringing about an objective Bayesian renaissance; an annual
conference is now devoted to the objective Bayesian approach.
For one-dimensional parameters, it will turn out that reference priors and Jef-
freys’ priors are equivalent. For multidimensional parameters, they differ. One
3.1 Reference Priors 52
might ask, how can we choose a prior to maximize the divergence between the
posterior and prior, without having seen the data first? Reference priors handle
this by taking the expectation of the divergence, given a model distribution for
the data. This sounds superficially like a frequentist approach—basing inference
on imagined data. But once the prior is chosen based on some model, inference
proceeds in a standard Bayesian fashion. (This contrasts with the frequentist
approach, which continues to deal with imagined data even after seeing the real
data!)
◯ Laplace Approximation
Before deriving reference priors in some detail, we go through the Laplace ap-
proximation which is very useful in Bayesian analysis since we often need to
evaluate integrals of the form
For example, when g(θ) = 1, the integral reduces to the marginal likelihood of x.
The posterior mean requires evaluation of two integrals ∫ θf (x∣θ)π(θ) dθ and
∫ f (x∣θ)π(θ) dθ. Laplace’s method is a technique for approximating integrals
when the integrand has a sharp maximum.
√ 1
Now let t = nc(θ − θ̂). This implies that dθ = √ dt. Hence,
nc
√
q(θ̂)enh(θ̂) δ nc t q ′ (θ̂) t2 q ′′ (θ̂) −t2 /2
I≈ √ ∫ √ [1 + √ + ]e dt
nc −δ nc nc q(θ̂) 2nc q(θ̂)
q(θ̂)enh(θ̂) √ q ′′ (θ̂) 1
≈ √ 2π [1 + 0 + ]
nc q(θ̂) 2nc
q(θ̂)enh(θ̂) √ q(θ̂)enh(θ̂) √
≈ √ 2π [1 + O(1/n)] ≈ √ 2π.
nc nc
First, we give a few definitions from probability theory (you may have seen these
before) and we will be informal about these.
Xn = o(rn ) as n → ∞
means that
Xn
→ 0.
rn
Similarly,
Xn = O(rn ) as n → ∞
means that
Xn
is bounded.
rn
This argument given by J.K. Ghosh will be used to derive reference priors. It
can be used in many other theoretical proofs in Bayesian theory. If interested in
seeing these, please refer to his book for details as listed on the syllabus. Please
note that below I am hand waving over some of the details regarding analysis
that are important but not completely necessary to grasp the basic concept
here.
Step 1: Consider a proper prior π̄(⋅) for θ such that the support of π̄(⋅) is a
compact rectangle in the parameter space and π̄(⋅) vanishes on the boundary of
the support, while remaining positive on the interior. Consider the posterior of
θ under π̄(⋅) and hence obtain E π̄ [q(X, θ)∣x].
Step 2: Find Eθ E π̄ [q(x, θ)∣x] = λ(θ) for θ in the interior of the support of π̄(⋅).
Step 3: Integrate λ(⋅) with respect to π̄(⋅) and then allow π̄(⋅) to converge to
the degenerate prior at the true value of θ (say θ0 ) supposing that the true θ is
an interior point of the support of π̄(⋅). This yields Eθ [q(X, θ)].
◯ Reference Priors
Bernardo (1979) suggested choosing the prior to maximize the expected Kullback-
Leibler divergence between the posterior and prior,
π(θ∣x)
E [log ],
π(θ)
First write
π(θ∣x) π(θ∣x)
E [log ] = ∫ ∫ [log ] π(θ∣x) m(x) dx dθ
π(θ) π(θ)
π(θ∣x)
= ∫ ∫ log f (x∣θ) π(θ) dx dθ
π(θ)
π(θ∣x)
= ∫ π(θ)E [log ∣ θ] dθ.
π(θ)
π(θ∣x)
Consider E [log ∣ θ] = E [log π(θ∣x) ∣ θ] − log π(θ).
π(θ)
Then by iterated expectation,
π(θ∣x)
E [log ] = ∫ π(θ) {E [π(θ∣x); ∣ θ] − log π(θ)} dθ
π(θ)
= ∫ E [π(θ∣x)∣ θ] π(θ) dθ − ∫ log π(θ)π(θ) dθ. (3.1)
and we will use Step 1 of the Shrinkage argument of J.K. Ghosh to find E π̄ [q(X, θ)∣x].
exp{Ln (θ)}π(θ)
π(θ∣x) =
∫ exp{Ln (θ)}π(θ) dθ
Since θ̂n is the maximum likelihood estimate, L′n (θ̂n ) = 0. Now define the
quantity
∂ 2 log f (x∣θ)
Iˆn ∶= Iˆn (θ̂n ) = − ∣ = −L′′n (θ̂n ).
∂θ2 θ=θ̂n
√
n exp{− 12 t2 Iˆn }π(θ̂n )[1 + O(n−1/2 )]
= √ −1/2
,
2π Iˆn π(θ̂n )[1 + O(n−1/2 )]
noting that the denominator takes the form of a constant times the integral of
a normal density with variance Iˆn−1 . Hence,
√ ˆ1/2
nIn 1
π(θ∣x) = √ exp (− t2 Iˆn ) [1 + O(n−1/2 )]. (3.2)
2π 2
√
Then log π(θ∣x) = 21 log n−log 2π − 12 t2 Iˆn + 12 log Iˆn +log[1+O(n−1/2 )] = 21 log n−
√
log 2π − 21 t2 Iˆn + 12 log Iˆn + log[O(n−1/2 )]. Now consider
1 √ 1 1
E π̄ log π(θ∣x) = log n − log 2π − E π̄ [ t2 Iˆn ] + log Iˆn + log[O(n−1/2 )].
2 2 2
To evaluate E π̄ [ 12 t2 Iˆn ], note that (3.2) states that, up to order n−1/2 , π(t∣x)
is approximately normal with mean zero and variance Iˆn−1 . Since this does
not depend on the form of the prior π, it follows that π̄(t∣x) is also approxi-
mately normal with mean zero and variance Iˆn−1 , again up to order n−1/2 . Then
E π̄ [ 21 t2 Iˆn ] = 21 , which implies that
1 √ 1
E π̄ log π(θ∣x) = log n − log 2π − + log Iˆn1/2 + log[O(n−1/2 )]
2 2
1 1 √
= log n − log 2πe + log Iˆn1/2 + log[O(n−1/2 )].
2 2
Step 3: Since λ(θ) is continuous, the process of calculating ∫ λ(θ) π̄(θ) dθ and
allowing π̄(⋅) to converge to degeneracy at θ simply yields λ(θ) again. Thus,
1 1 √
log n − log 2πe + log [I(θ)] + log[O(n−1/2 )].
1/2
E [π(θ∣x) ∣ θ] =
2 2
1 √ 1/2
[I(θ)]
1/2 log n − log 2πe + ∫ log { }π(θ) dθ + log[O(n−1/2 )].
2 π(θ)
Take away: If there are no nuisance parameters, Jeffreys’ prior is the reference
prior.
Multiparameter generalization
π(θ∣x) p p ∣I(θ)∣1/2
E[ ] = log n − log(2πe) + ∫ log ( ) π(θ) dθ + O(n−1/2 ).
π(θ) 2 2 π(θ)
Note that this is maximized when π(θ) = ∣I(θ)∣1/2 , meaning that Jeffreys’ prior
is the maximizer in distance between the prior and posterior. In the presence
of nuisance parameters, things change considerably.
3.1 Reference Priors 58
Begin with π(θ2 ∣θ1 ) = ∣I22 (θ)∣1/2 c(θ1 ), where c(θ1 ) is the constant that makes
this distribution a proper density. Now try to maximize
Arguing as before,
π(θ∣x) p p ∣I(θ)∣1/2
E [log ] = log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ). (3.4)
π(θ) 2 2 π(θ)
Similarly,
1/2 −1 −1
where I11.2 (θ) = I11 (θ) − I12 (θ)I22 (θ)I21 (θ) and I(θ) = ∣I22 ∣ ∣I11 − I12 I22 I21 ∣ =
∣I22 ∣ ∣I11.2 ∣. These can be derived from Searle’s book on matrix algebra as a
reference.
We find that
π(θ1 ∣x) p1 p1
E [log ]= log n − log 2πe + ∫ π(θ) log ∣I11.2 (θ)∣1/2 dθ
π(θ1 ) 2 2
− ∫ π(θ) log π(θ1 ) dθ + O(n−1/2 )
p1 p1
= log n − log 2πe + ∫ π(θ1 ) [∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 ] dθ1
2 2
− ∫ π(θ1 ) log π(θ1 ) dθ1 + O(n−1/2 )
p1 p1 ψ(θ1 )
= log n − log 2πe + ∫ π(θ1 ) log dθ1 + O(n−1/2 ).
2 2 π(θ1 )
−1
To maximize the integral above, we choose π(θ1 ) = ψ(θ1 ). Note that I11.2 (θ) =
I11 (θ) where
−1
I11 (θ) −I11 (θ)I12 (θ)I22 (θ)
I −1 (θ) = ( −1 ).
−I22 (θ)I21 (θ)I11 (θ) I22 (θ)
π(θ1 ) = exp{∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 } = exp{∫ ∣I22 (θ)∣1/2 log ∣I11.2 (θ)∣1/2 dθ2 }.
Remark: An important point that should be highlighted is that all these calcu-
lations (especially evaluations of all integrals) are carried out over an increasing
sequence of compact sets K, whose union is the parameter space. For exam-
ple, if the parameter space is R × R+ , take the increasing sequence of compact
rectangles [−i, i] × [−i−1 , i] and then eventually take i → ∞. Also, the proofs
are carried out by considering a sequence of priors πi with support Ki and we
eventually take i → ∞. This fact should be taken into account when doing the
examples and calculations of these types of problems.
iid
Example 3.10: Let X1 . . . Xn ∣µ, σ 2 ∼ N (µ, σ 2 ), where σ 2 is a nuisance param-
eter. Consider the sequence of priors πi with support [−i, i]×[−i−1 , i], i = 1, 2, . . . .
1/σ 2 0
I(µ, σ 2 ) = ( ).
0 2/σ 2
√ √
2ci2 −1 i 2ci2 1
Then π(σ∣µ) = , i ≤ σ ≤ i. Consider ∫i−1 = 1 Ô⇒ ci2 = √ .
σ σ 2 2 ln i
1 1 −1
Thus, π(σ∣µ) = , i ≤ σ ≤ i. Now find π(µ). Observe that
2 ln i σ
Recall that
3.2 Final Thoughts on Being Objective 60
√
−1 2 1i
I11.2 = I11 −I12 I22 I21 = 1/σ . Thus, π(µ) =
2
log( )} dσ+constant =
exp{∫i−1 ci2
σ σ
π(µ, σ)
c. We want to find π(µ, σ). We know that π(σ∣µ) = Ô⇒ π(µ, σ) =
π(µ)
c 1 1
π(µ)π(σ∣µ) = ∝ .
2 ln i σ σ
Problems with Reference Priors See page 128 of the little green book.
Bernardo and Berger (1992) suggest
We have spent just a short amount of time covering objective Bayesian proce-
dures, but already we have seen how each is flawed in some sense. As Fienberg
(2009) points out in a review article (see the webpage), there are two main parts
of being Bayesian: the prior and the likelihood. What is Fienberg’s point? There
is proposed claim that robustness should be carried through at both levels of
the model. That is, we should care about subjectivity of the likelihood as well
as the prior.
My favorite part that Fienberg illustrates in this paper is the view that “objec-
tive Bayes is like the search for the Holy Grail.” He mentions that Good (1972)
once wrote that there are “46,656 Varieties of Bayesians,” which was a num-
ber that he admitted exceeded the number of professional statisticians during
that time. Today? There seem to be as many choices coming about of objective
Bayes for trying to arrive at the perfect choice of an objective prior. Each seems
to fail because of foundational principles. For example, Eaton and Freedman
(2004) criticize why you shouldn’t use Jeffreys’ prior for the normal covariance
matrix. We didn’t look at intrinsic priors, but they have been criticized by
Fienberg for contingency tables because of their dependence on the likelihood
function and because of bizarre properties when extended to deal with large
3.2 Final Thoughts on Being Objective 61
sparse tables.
Chapter 4
Evaluating Bayesian
Procedures
They say statistics are for losers, but losers are usually the ones saying that.
—Urban Meyer
One major difference between Bayesians and frequentists is how they interpret
intervals. Let’s quickly review what a frequentist confidence interval is and how
to interpret one.
62
4.1 Confidence Intervals versus Credible Intervals 63
(L, U ) in the long run. In the long run means that this would occur nearly 95%
of the time if we repeated our study millions and millions of times.
Recall that frequentists treat θ as fixed, but Bayesians treat θ as a random vari-
able. The main difference between frequentist confidence intervals and Bayesian
credible intervals is the following:
• Bayesians invoke the concept of probability after observing the data. For
some particular set of data X = x, the random variable θ lies in a Bayesian
credible interval with some probability, e.g., 0.95.
Assumptions
In lower-level classes, you wrote down assumptions whenever you did confidence
intervals. This is redundant for any problem we construct in this course since
we always know the data is randomly distributed and we assume it comes from
some underlying distribution, say Normal, Gamma, etc. We also always assume
our observations are i.i.d. (independent and identically distributed), meaning
that the observations are all independent and they all have the same variance.
Thus, when working a particular problem, we will assume these assumptions
are satisfied given the proposed model holds.
4.1 Confidence Intervals versus Credible Intervals 64
Remark: When you’re calculating credible intervals, you’ll find the values
of a and b by several means. You could be asked do the following:
Important Point
Our definition for the credible interval could lead to many choices of (a, b) for
particular problems.
Suppose that we required our credible interval to have equal probability α/2 in
each tail. That is, we will assume
P (θ < a∣x) = α/2
and
P (θ > b∣x) = α/2.
Is the credible interval still unique? No. Consider
π(θ∣x) = I(0 < θ < 0.025) + I(1 < θ < 1.95) + I(3 < θ < 3.025)
so that the density has three separate plateaus. Now notice that any (a, b)
such that 0.025 < a < 1 and 1.95 < b < 3 satisfies the proposed definition of a
ostensibly “unique” credible interval. To fix this, we can simply require that
{θ ∶ π(θ∣x) is positive}
(i.e., the support of the posterior) must be an interval.
This greatly contrasts with the usual frequentist CI, for which the corresponding
statement is something like “If we could recompute C for a large number of
datasets collected in the same way as ours, about (1 − α) × 100% of them would
contain the true value θ. ”
This classical statement is not one of comfort. We may not be able to repeat our
experiment a large number of times (suppose we have an an interval estimate
4.1 Confidence Intervals versus Credible Intervals 66
for the 1993 U.S. unemployment rate). If we are in physical possession of just
one dataset, our computed C will either contain θ or it won’t, so the actual
coverage probability will be 0 or 1. For the frequentist, the confidence level
(1 − α) is only a “tag” that indicates the quality of the procedure. But for a
Bayesian, the credible set provides an actual probability statement based only
on the observed data and whatever prior information we add.
0.95
0.025 0.025
a b
Value of θ|x
Interpretation
Comparisons
X1 . . . , Xn ∣θ ∼ N (θ, σ 2 )
θ ∼ N (µ, τ 2 ),
Recall
nx̄τ 2 + µσ 2 σ2 τ 2
θ∣x1 , . . . xn ∼ N ( , 2 ).
nτ + σ
2 2
nτ + σ 2
Let
nx̄τ 2 + µσ 2
µ∗ = ,
nτ 2 + σ 2
σ2 τ 2
σ ∗2 = .
nτ 2 + σ 2
We want to calculate a and b such that P (θ < a∣x1 , . . . , xn ) = 0.05/2 = 0.025 and
P (θ > b∣x1 , . . . , xn ) = 0.05/2 = 0.025. So,
∗
Thus, we now must find an a such that P ( Z < a−µ σ∗
∣ x1 , . . . , xn ) = 0.025. From
a Z-table, we know that
a − µ∗
= −1.96.
σ∗
This tells us that a = µ∗ − 1.96σ ∗ . Similarly, b = µ∗ + 1.96σ ∗ . (Work this part out
on your own at home). Therefore, a 95% credible interval is
µ∗ ± 1.96σ ∗ .
Example 4.2: We’re interested in knowing the true average number of orna-
ments on a Christmas tree. Call this θ. We take a random sample of n Christmas
trees, count the ornaments on each one, and call the results X1 , . . . , Xn . Let the
prior on θ be Normal(75, 225).
4.1 Confidence Intervals versus Credible Intervals 68
Using data (trees.txt) we have, we will calculate the 95% credible interval
and confidence interval for θ. In R we first read in the data file trees.txt. We
then set the initial values for our known parameters, n, σ, µ, and τ.
Next, we refer to Example 4.1, and calculate the values of µ∗ and σ ∗ using this
example. Finally, again referring to Example 4.1, we recall that the formula for
a 95% credible interval here is
µ∗ ± 1.96σ ∗ .
On the other hand, recalling back to any basic statistics course, a 95% confidence
interval in this situation is √
x̄ ± 1.96σ/ n.
From the R code, we find that there is a 95% probability that the average number
of ornaments per tree is in (45.00, 57.13) given the data. We also find that we
are 95% confident that the average number of ornaments per tree is contained
in (43.80, 56.20). If we compare the width of each interval, we see that the
credible interval is slightly narrower. It is also shifted towards slightly higher
values than the confidence interval for this data, which makes sense because the
prior mean was higher than the sample mean. What would happen to the width
of the intervals if we increased n? Does this make sense?
x = read.table("trees.txt",header=T)
attach(x)
n = 10
sigma = 10
mu = 75
tau = 15
mu.star = (n*mean(orn)*tau^2+mu*sigma^2)/(n*tau^2+sigma^2)
sigma.star = sqrt((sigma^2*tau^2)/(n*tau^2+sigma^2))
(cred.i = mu.star+c(-1,1)*qnorm(0.975)*sigma.star)
(conf.i = mean(orn)+c(-1,1)*qnorm(0.975)*sigma/sqrt(n))
diff(cred.i)
diff(conf.i)
detach(x)
Example 4.3: (Sleep Example)
Recall the Beta-Binomial. Consider that we were interested in the proportion
of the population of American college students that sleep at least eight hours
each night (θ).
Suppose that the prior on θ was Beta(3.3,7.2). Thus, the posterior distribution
is
Suppose now we would like to find a 90% credible interval for θ. We cannot
compute this in closed form since computing probabilities for Beta distributions
involves messy integrals that we do not know how to compute. However, we can
use R to find the interval.
We need to solve
P (θ < c∣x) = 0.05
and
P (θ > d∣x) = 0.05 for c and d.
The reason we cannot compute this in closed form is because we need to compute
c
∫ Beta(14.3, 23.2) dθ = 0.05
0
and
1
∫ Beta(14.3, 23.2) dθ = 0.05.
d
Γ(37.5)
f (θ) = θ14.3−1 (1 − θ)23.2−1 .
Γ(14.3)Γ(23.2)
a = 3.3
b = 7.2
n = 27
x = 11
a.star = x+a
b.star = n-x+b
c = qbeta(0.05,a.star,b.star)
d = qbeta(1-0.05,a.star,b.star)
Running the code in R, we find that a 90% credible interval for θ is (0.256, 0.514),
meaning that there is a 90% probability that the proportion of UF students who
sleep eight or more hours per night is between 0.256 and 0.514 given the data.
4.2 Credible Sets or Intervals 70
Remark: Credible intervals are very easy to calculate unlike confidence inter-
vals, which require pivotal quantities or inversion of a family of tests.
In general, plot the posterior distribution and find the HPD credible set. One
important point is that the posterior must be unimodal in order to guarantee
that the HPD credible set is an interval. (Unimodality of the posterior is a
sufficient condition for the credible set to be an interval, but it’s not necessary.)
Example 4.5: Suppose
iid
y1 , . . . , yn ∣σ 2 ∼ N (0, σ 2 )
p(σ 2 ) ∝ (σ 2 )α/2−1 e− 2σ2 .
β
z 2
Then
This posterior distribution is unimodal, but how do we know this? One way of
showing the posterior is unimodal is to show that it is increasing in σ 2 up to
a point and then decreasing afterwards. The log of the posterior has the same
feature.
Then
1
log(p(σ 2 ∣y)) = c1 − [(n + α)/2 + 1] log(σ 2 ) − (∑ y 2 + β).
2σ 2 i i
4.3 Bayesian Hypothesis Testing 71
Let’s first review p-values and why they might not make sense in the grand
scheme of things. In classical statistics, the traditional approach proposed by
Fisher, Neyman, and Pearson is where we have a null hypothesis and an al-
ternative. After determining some test statistic T (y), we compute the p-value,
which is
Clearly, classical statistics has deep roots and a long history. It’s popular with
practitioners, but does it make sense? The approach can be applied in a straight-
forward manner only when the two hypothesis in question are nested (meaning
one within the other). This means that Ho must be a simplification of Ha .
Many practical testing problems involve a choice between two or more models
that aren’t nested (choosing between quadratic and exponential growth models
for example).
Another difficulty is that tests of this type can only offer evidence against the
null hypothesis. A small p-value indicates that the later, alternative model has
significantly more explanatory power. But a large p-value does not suggest that
the two models are equivalent (only that we lack evidence that they are not).
This limitation/difficulty is often swept under the rug and never dealt with. We
simply say, “we fail to reject the null hypothesis” and leave it at that.
Finally, one last criticism is that p-values depend not only on the observed data
but also on the total sampling probability of certain unobserved data points,
namely, the more extreme T (Y ) values. Because of this, two experiments with
identical likelihoods could result in different p-values if the two experiments were
designed differently. (This violates the Likelihood Principle.) See Example 1.1
in Chapter 1 for an illustration of how this can happen.
Ho ∶ θ ∈ Θo Ha ∶ θ ∈ Θ1 .
Ho ∶ θ = θ o Ha ∶ θ ≠ θ 0 .
Ho ∶ θ ≤ θ o Ha ∶ θ > θ 0 .
A Bayesian talks about posterior odds and Bayes factors.
Definition 4.5: Prior odds
Let πo = P (θ ∈ Θo ), π1 = P (θ ∈ Θ1 ) and πo + π1 = 1. Then the prior odds in
πo
favor of Ho = .
π1
Definition 4.6: Posterior odds
αo
Let αo = P (θ ∈ Θo ∣y) and α1 = P (θ ∈ Θ1 ∣y). Then the posterior odds = .
α1
Definition 4.7: Bayes Factor
posterior odds αo πo αo π1
The Bayes Factor (BF)= = ÷ = .
prior odds α1 π1 α1 πo
Example 4.6: IQ Scores
Suppose we’re studying IQ scores and so we assume that the data follow the
model where
We’d like to be able to say something about the mean of the IQ scores and
whether it’s below or larger than 100. Then
πo P (θ ≤ 100) 1
The prior odds are then = = 2
= 1 by symmetry.
π1 P (θ > 100)
1
2
Then αo = P (θo ≤ 100∣y = 115) = 0.106 and α1 = P (θ1 > 100∣y = 115) = 0.894.
αo
Thus, = 0.1185. Hence, BF = 0.1185.
α1
4.3 Bayesian Hypothesis Testing 73
Consider
H1 ∶ θ = 1 versus H4 ∶ θ ≠ 1
H2 ∶ θ = 1/2 versus H5 ∶ θ ≠ 1/2
H3 ∶ θ = 0 versus H6 ∶ θ ≠ 0.
Then
Let k ∈ (0.0619, 0.125) and reject if BF01 < k. Then we reject H4 in favor of H1 .
We fail to reject H2 . Thus, failing to reject H2 implies failing to reject H4 . In
this example, evidence in favor of H4 should be stronger than that of H2 . But
4.3 Bayesian Hypothesis Testing 74
the Bayes Factor violates this. Lavine and Schervish refer to this as lack of
coherence. The problem does not occur with the posterior odds since if
holds, then
P (Θo ∣x) P (Θa ∣x)
< .
1 − P (Θo ∣x) 1 − P (Θa ∣x)
(This result can be generalized).
• Bayes factors are insensitive to the choice of prior, however, this statement
is misleading. (Berger, 1995) We will see why in Example 4.5.
Ho ∶ θ = θ o Ha ∶ θ = θ 1 .
Then πo = P (θ = θo ) and π1 = P (θ = θ1 ), so πo + π1 = 1.
Then
αo πo P (y∣θ = θo ) P (y∣θ = θo )
This implies that = and hence BF = , which is
α1 π1 P (y∣θ = θ1 ) P (y∣θ = θ1 )
the likelihood ratio. This does not depend on the choice of the prior. However,
in general the Bayes factor depends on how the prior spreads mass over the null
and alternative (so Berger’s statement is misleading).
Example 4.8:
Ho ∶ θ ∈ θo Ha ∶ θ ∈ θ1 .
Derive the BF. Let go (θ) and g1 (θ) be probability density functions such that
∫Θo go (θ) dθ = 1 and ∫Θ1 g1 (θ) dθ = 1. Let
⎧
⎪
⎪πo go (θ) if θ ∈ Θo
π(θ) = ⎨
⎪
⎪π g (θ) if θ ∈ Θ1 .
⎩ 1 1
that
∫Θo p(y∣θ)π(θ) dθ
αo m(y) ∫Θ p(y∣θ)πo go (θ) dθ
= = o
α1 ∫Θ1 p(y∣θ)π(θ) dθ ∫Θ1 p(y∣θ)π1 g1 (θ) dθ
m(y)
πo ∫Θo p(y∣θ)go (θ) dθ
= Ô⇒
π1 ∫Θ1 p(y∣θ)g1 (θ) dθ
Bayes factors are meant to compare two or more models, however, often we are
interested in the goodness of fit of a particular model rather than comparison
of the models. Bayesian p-values were proposed to address these problems.
George Box proposed a prior predictive p-value. Suppose that T (x) is a test
statistic and π is some prior. Then we calculate the marginal distribution
Suppose the prior π(σ 2 ) is degenerate at σo2 . Then π(σ 2 = σo2 ) = 1. Marginally,
X̄ ∼ N (0, σ 2 /n)
under Mo . Also,
√ √ √
√ √ n∣X̄∣ n∣x̄obs ∣ n∣x̄obs ∣
P ( n∣X̄∣ ≥ n∣x̄obs ∣) = P ( ≥ ) = 2Φ(− ).
σo σo σo
If the guessed σo is much smaller than the actual model variance, then the
p-value is small and the evidence again Mo is overestimated.
Remark: The takeaway message is that the prior predictive p-value is heavily
influenced by the prior.
4.4 Bayesian p-values 76
Since then, the posterior predictive p-value (PPP) has been proposed by Rubin
(1984), Meng (1994), and Gelman et al. (1996). They propose looking at the
posterior predictive distribution of a future observation x under some prior π.
That is, we calculate
which is the conditional probability that for a future observation T (X) ≥ T (xobs )
given the predictive distribution of X under prior π and xobs .
Remark: For details, see the papers. A general criticism by Bayarri and Berger
points out that the procedure involves using the data twice. The data is used in
finding the posterior distribution of θ and also in finding the posterior predictive
p-value. As an alternative, they have suggested using conditional predictive p-
values (CPP). This involves splitting the data into two parts, say T (X) and
U (X). We use U (X) to find the posterior predictive distribution and T (X)
continues to be the test statistic.
Also, Robins, van der Waart, and Ventura, JASA, 2000 investigate Bayarri
and Berger’s claims that for a parametric model, that their conditional and
partial predictive p-values are superior to the parametric bootstrap p-value and
to previously proposed p-values (prior predictive p-value of Guttman, 1967 and
Rubin, 1984 and the discrepancy p-value of Gelman et. al (1995, 1996) and
Meng (1994). Robins et. al note that Bayarri and Berger’s claims of superiority
is based on small-sample properties for specific examples. They investigate large
sample properties and conclude that asymptotic results confirm the superiority
of the conditional predictive p-value and partial posterior predictive p-values.
Robins et. al (2000) also explore corrections for when these p-values are difficult
to compute. In Section 4 of their paper, they discuss how to modify the test
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 77
statistic for the parametric bootstrap p-value, posterior predictive p-values, and
discrepancy p-values. Modifications are made such they are asymptotically uni-
form. They claim that their approach is successful for the discrepancy p-value
(and the authors derive a test based on this). Note: the discrepancy p-value
can be difficult to calculate for complex models.
Consider
P (θ ∉ (U(1) , U(4) )∣θ) = P (U(1) > θ ∪ U(4) < θ∣θ) =
= P (U(1) > θ∣θ) + P (U(4) < θ∣θ) =
= P (Ui > θ, i = 1, 2, 3, 4∣θ) + P (Ui < θ, i = 1, 2, 3, 4∣θ) =
= (0.5)4 + (0.5)4 = (0.5)3 = 0.125
Hence, P (θ ∈ (U(1) , U(4) )∣θ) = 0.875, which proves that (U(1) , U(4) ) is a 87.5%
confidence interval for θ.
Consider that U(1) = 0.1 and that U(4) = 0.9. The 87.5% probability has to do
with the random interval (U(1) , U(4) ) and not with the particular observed value
of (0.1, 0.9).
Let’s do some investigative work! Observe that, for every ui , ui > θ − 0.5.
Hence, u(1) > θ − 0.5, that is, θ < u(1) + 0.5. Similarly, θ > u(4) − 0.5. Hence,
θ ∈ (u(4) − 0.5, u(1) + 0.5). Plugging in u(1) = 0.1 and u(4) = 0.9, obtain
θ ∈ (0.4, 0.6). That is, even though the observed 87.5% confidence interval
is (0.1, 0.9), we know that θ ∈ (0.4, 0.6) with certainty.
Let’s now compute a 87.5% centered credible interval. This depends on the
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 78
That is, θ∣u has Uniform distribution on (u(4) − 0.5, u(1) + 0.5). Let a = u(4) − 0.5
and b = u(1) + 0.5. The centered 87.5% credible interval is (l, u) such that
l 1 b 1 b−a b−a
∫a dx = 2−4 and ∫u dx = 2−4 . Hence, l = a + 4 and u = b − 4 .
b−a b−a 2 2
Observe that this interval is always a subset of (a, b), which we know contains
θ for sure.
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 79
1−(U(4)−U(1)) 1−(U(4)−U
(1) )
Does (U(4) − 0.5 + 24
, U(1) + 0.5 − 24
) have any confidence guar-
antees? Before getting into troublesome calculations, we can check this through
simulations.
The following R code generates a barplot for how often the credible interval
captures the correct parameter given different parameter values.
return(mean(apply(samples,1,success)))
}
0.8
0.6
0.4
0.2
0.0
Figure 4.2
The result in Figure 4.2 shows that, in this case, the coverage of the credible
interval seems to be uniform on the parameter space. This is not guaranteed
to always happen! Also, although we constructed a 87.5% credible interval,
1−(U(4) −U(1) ) 1−(U(4) −U(1) )
the picture suggests that (U(4) − 0.5 + 24
, U(1) + 0.5 − 24
) is
somewhere near a 85% confidence interval.
Observe that, the wider the gap between U(1) and U(4) the smaller is the region
in which θ can lie. In this sense, it would be nice if the interval would be
smaller, the larger this gap is. The example shows that there exist both credible
and confidence intevals with this property, but this property isn’t achieved by
guaranteeing confidence alone.
11:36 Monday 11th December, 2017
Copyright ©2017 Rebecca C. Steorts
Chapter 5
Every time I think I know what’s going on, suddenly there’s another layer of
complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony
• There can still be problems even when the dimensionality is small. There
are packages in R called area and integrate, however, area cannot deal
with infinite bounds in the integral, and even though integrate can handle
infinite bounds, it is fragile and often produces output that’s not trust-
worthy (Robert and Casella, 2010).
81
5.1 A Quick Review of Monte Carlo Methods 82
The generic problem here is to evaluate Ef [h(x)] = ∫X h(x)f (x) dx. The clas-
sical way to solve this is to generate a sample (X1 , . . . , Xn ) from f and propose
as an approximation the empirical average
1 n
h̄n = ∑ h(xj ).
n j=1
Why? It can be shown that h̄n converges a.s. (i.e. for almost every generated
sequence) to Ef [h(X)] by the Strong Law of Large Numbers.
Also, under certain assumptions (which we won’t get into, see Casella and
Robert, page 65, for details), the asymptotic variance can be approximated
and then can be estimated from the sample (X1 , . . . , Xn ) by
n
vn = 1/n2 ∑ [h(xj ) − h̄n ]2 .
j=1
There are examples in Casella and Robert (2010) along with R code for those
that haven’t seen these methods before or want to review them.
5.1 A Quick Review of Monte Carlo Methods 83
◯ Importance Sampling
ˆ = 1 V̂ h(Xi )f (Xi )
V̂
ar(I) ar ( ).
n g(Xi )
Example 5.1: Suppose we want to estimate P (X > 5), where X ∼ N (0, 1).
Naive method: Generate n iid standard normals and use the proportion p̂ that
are larger than 5.
Importance sampling: We will sample from a distribution that gives high prob-
ability to the “important region” (the set (5, ∞)) and then reweight.
Solution: Let φo and φθ be the densities of the N (0, 1) and N (θ, 1) distributions
(θ taken around 5 will work). We have
φo (u)
p = ∫ I(u > 5)φo (u) du = ∫ [I(u > 5) ] φθ (u) du.
φθ (u)
In other words, if
φo (u)
h(u) = I(u > 5)
φθ (u)
then p = Eφθ [h(X)]. If X1 , . . . , Xn ∼ N (θ, 1), then an unbiased estimate is
p̂ = n1 ∑i h(Xi ).
# Naive method
set.seed(1)
ss <- 100000
x <- rnorm(n=ss)
phat <- sum(x>5)/length(x)
sdphat <- sqrt(phat*(1-phat)/length(x)) # gives 0
# IS method
set.seed(1)
y <- rnorm(n=ss, mean=5)
h <- dnorm(y, mean=0)/dnorm(y, mean=5) * I(y>5)
mean(h) # gives 2.865596e-07
sd(h)/sqrt(length(h)) # gives 2.157211e-09
Example 5.2: Let f (x) be the pdf of a N (0, 1). Assume we want to compute
1 1
a=∫ f (x)dx = ∫ N (0, 1)dx
−1 −1
(Y )
• Note that if g ∼ Y, then a = E[I[−1,1] (Y ) fg(Y )
].
f (Y )
• The variance of I[−1,1] (Y ) is minimized picking g ∝ I[−1,1] (x)f (x).
g(Y )
Nevertheless simulating from this g is usually expensive.
• Some g’s which are easy to simulate from are the pdf’s of the Uniform(−1, 1),
the Normal(0, 1) and a Cauchy with location parameter 0.
f (Y )
• Below, there is code of how to get a sample from I[−1,1] (Y ) for these
g(Y )
distributions,
function(xx) dnorm(xx,0,1)/dunif(xx,-1,1)) }
Figure 5.1 presents histograms for a sample size 1000 from each of these distri-
(Y )
butions. The sample variance of I[−1,1] (Y ) fg(Y )
was, respectively, 0.009, 0.349
and 0.227 (for the uniform, cauchy, and the normal).
• Even though the shape of the uniform distribution is very different from
f (x), a standard normal, in (−1, 1), f (x) has a lot of mass outside of
(−1, 1).
• This is why the histograms for the Cauchy and the Normal have big bars
on 0 and the variance obtained from the uniform distribution is the lowest.
• How would these results change if we wanted to compute the integral over
the range (−3, 3) instead of (−1, 1)? This is left as a homework exercise.
5.1 A Quick Review of Monte Carlo Methods 87
700
700
700
600
600
600
500
500
500
400
400
400
Frequency
Frequency
Frequency
300
300
300
200
200
200
100
100
100
0
0.45 0.55 0.65 0.75 0.0 0.4 0.8 1.2 0.0 0.2 0.4 0.6 0.8 1.0
(Y )
Figure 5.1: Histograms for samples from I[−1,1] (Y ) fg(Y )
when g is, respectiv-
elly, a uniform, a Cauchy and a Normal pdf.
5.1 A Quick Review of Monte Carlo Methods 88
Often we have sample from µ, but know π(x) except for a multiplicative µ(x)
constant. Typical example is Bayesian situation:
=∫
h(x) c `(x)µ(x) d(x)
∫ µ(x) d(x)
=∫
h(x) c `(x)µ(x) d(x)
∫ c `(x)µ(x) d(x)
=∫
h(x) `(x)µ(x) d(x)
.
∫ `(x)µ(x) d(x)
Motivation
Why the choice above for `(X)? Just taking a ratio of priors. The moti-
vation is the following for example:
ν(θi )/λ(θi )
– Then the weights are wi = .
∑j ν(θj )/λ(θj )
1. If µ and π i.e. ν and λ differ greatly most of the weight will be taken
up by a few observations resulting in an unstable estimate.
∑ h(Xi ) `(Xi )
2. We can get an estimate of the variance of i but we
∑i `(Xi )
need to use theorems from advance probability theory (The Cramer-
Wold device and the Multivariate Delta Method). We’ll skip these
details.
3. In the application of Bayesian statistics, the cancellation of a poten-
tially very complicated likelihood can lead to a great simplification.
4. The original purpose of importance sampling was to sample more
heavily from regions that are important. So, we may do importance
sampling using a density µ because it’s more convenient than using a
density π. (These could also be measures if the densities don’t exist
for those taking measure theory).
◯ Rejection Sampling
Suppose π is a density on the reals and suppose π(x) = c l(x) where l is known,
c is not known. We are interested in case where π is complicated. Want to
generate X ∼ π.
Suppose first that l is bounded and is zero outside of [0, 1]. Suppose also l
is constant on the intervals ((j − 1)/k, j/k), j = 1, . . . , k. Let M be such that
M ≥ l(x) for all x.
2. If the point is below the graph of the function l, retain U1 . Else, reject the
point and go back to (1).
Remark: Think about what this is doing, we’re generating many draws that are
wasting time. Think about the restriction on [0, 1] and if this makes sense.
General Case:
5.1 A Quick Review of Monte Carlo Methods 90
Suppose the density g is such that for some known constant M, M g(x) ≥ l(x)
for all x. Procedure:
l(X)
1. Generate X ∼ g, and calculate r(X) = .
M g(X)
2. Flip a coin with probability of success r(X). If we have a success, retain
X. Else return to (1).
To show that an accepted point has distribution π, let I = indicator that the
point is accepted. Then
π(x)/c 1
P (I = 1) = ∫ P (I = 1 ∣ X = x)g(x) dx = ∫ g(x) dx = .
M g(x) cM
Example 5.3: Suppose we want to generate random variables from the Beta(5.5,5.5)
distribution. Note: There are no direct methods for generating from Beta(a,b)
if a,b are not integers.
dev.off()
##Doing accept-reject
##substance of code
set.seed(1); nsim <- 1e5
x <- rnorm(n=nsim, mean=m, sd=s)
u <- runif(n=nsim)
ratio <- dbeta(x, shape1=a, shape2=b) /
(1.3*dnorm(x, mean=m, sd=s))
ind <- I(u < ratio)
betas <- x[ind==1]
# as a check to make sure we have enough
length(betas) # gives 76836
3.0
2.0
2.0
1.5
1.0
1.0
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
density.default(x = betas)
2.5
2.0
1.5
Density
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
The main idea here involves iterative simulation. We sample values on a random
variable from a sequence of distributions that converge as iterations continue to
a target distribution. The simulated values are generated by a Markov chain
whose stationary distribution is the target distribution, i.e., the posterior dis-
tribution.
Geman and Geman (1994) introduced Gibbs sampling for simulating a multi-
variate probability distribution p(x) using as random walk on a vector x, where
p(x) is not necessarily a posterior density.
• If no is large, then Xno , Xn0 +1 . . . all have the distribution π and these can
be used to estimate π and ∫ h(x)π(x)dx.
Two problems:
5.2 Introduction to Gibbs and MCMC 93
1.) is called a Markov transition function and 2.) is the Markov property, which
says “where I’m going next only depends on where I am right now.”
Coming back to the MCMC method, we fix a starting point xo and generate
an observation from X1 from P (xo , ⋅), generate an observation from X2 from
P (X1 , ⋅), etc. This generates the Markov chain xo = Xo , X1 , X2 , . . . ,
• For a given Markov kernel K, there may exist a distribution f such that
The theory of Markov chains provides various results about the existence and
uniqueness of stationary distributions, but such results are beyond the scope of
this course. However, one specific result is that under fairly general conditions
that are typically satisfied in practice, if a stationary distribution f exists, then
f is the limiting distribution of {X (t) } is f for almost any initial value or
distribution of X (0) . This property is called ergodicity. From a simulation
point of view, it means that if a given kernel K produces an ergodic Markov
chain with stationary distribution f , generating a chain from this kernel will
eventually produce simulations that are approximately from f.
1 M (t)
∑ h(X ) Ð→ Ef [h(X)].
M i=1
This means that the LLN lies at the basis of Monte Carlo methods which can
be applied in MCMC settings. The result shown above is called the Ergodic
Theorem.
Now we turn to Gibbs. The name Gibbs sampling comes from a paper by Geman
and Geman (1984), which first applied a Gibbs sampler on a Gibbs random
field. The name stuck from there. It’s actually a special case of something from
5.2 Introduction to Gibbs and MCMC 96
Markov chain Monte Carlo (MCMC), and more specifically a method called
Metropolis-Hastings, which we will hopefully get to. We’ll start by studying the
simple case of the two-stage sampler and then look at the multi-stage sampler.
The two-stage Gibbs sampler creates a Markov chain from a joint distribution.
Suppose we have two random variables X and Y with joint density f (x, y).
They also have respective conditional densities fY ∣X and fX∣Y . The two-stage
sampler generates a Markov chain {(Xt , Yt )} according to the following steps:
1. Xt ∼ fX∣Y (⋅∣yt−1 )
2. Yt ∼ fY ∣X (⋅∣xt ).
As long as we can write down both conditionals (and simulate from them), it is
easy to implement the algorithm above.
1 ρ
(X, Y ) ∼ N2 (0, ( )) .
ρ 1
µX σ2 ρσX σY
(X, Y ) ∼ N2 (( ),( X )) ,
µY ρσX σY σY2
then
σY
Y ∣X = x ∼ N (µY + ρ (x − µX ), σY2 (1 − ρ2 )) .
σX
Suppose we calculate the Gibbs sampler just given the starting point (x0 , y0 ).
Since this is a toy example, let’s suppose we only care about X. Note that we
don’t really need both components of the starting point, since if we pick x0 , we
can generate Y0 from fY ∣X (⋅∣x0 ).
and
Var[X1 ] = EVar[X1 ∣Y0 ]] + VarE[X1 ∣Y0 ]] = 1 − ρ4 .
5.2 Introduction to Gibbs and MCMC 97
Then
X1 ∼ N (ρ2 x0 , 1 − ρ4 ).
We want the unconditional distribution of X2 eventually. So, we need to update
(X2 , Y2 ). So we need Y1 so we can generate Y1 ∣X1 = x1 . Since we only care about
X, we can use the conditional distribution formula to find that Y1 ∣X1 = x1 ∼
N (ρx1 , 1 − ρ). Then using iterated expectation and iterated variance, we can
show that
X2 ∼ N (ρ4 xo , 1 − ρ8 ).
for(ii in 2:nsim)
{
tt[ii] <- rbeta(1,aa+xx[ii-1],bb+nn-xx[ii-1])
xx[ii] <- rbinom(1,nn,tt[ii])
}
return(list(beta_bin=xx,beta=tt))
}
● ●
●
●
600
●
●
400
Frequency
200 ●
●
●
●
0 ● ● ●
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 5.5 presents the rootogram for the Gibbs sample for the Beta-Binomial
distribution. Similarly, Figure 5.6 shows the same for the marginal distribution
of θ obtained through the following commands:
5.2 Introduction to Gibbs and MCMC 99
3.0
2.5
2.0
Marginal Density
1.5
1.0
0.5
0.0
Example 5.6: Consider the posterior on (θ, σ 2 ) associated with the following
model:
Xi ∣θ ∼ N (θ, σ 2 ), i = 1, . . . , n,
θ ∼ N (θo , τ 2 )
σ 2 ∼ InverseGamma(a, b),
ba e−b/x
where θo , τ 2 , a, b known. Recall that p(σ 2 ) = .
Γ(a) xa+1
The Gibbs sampler for these conditional distributions can be coded in R as
follows:
# X|Theta=tt,Sigma2=ss ~ Normal(tt,ss)
#
# returns a list gibbs_sample
# gibbs_sample$theta : sample from the marginal distribution of Theta|X=xx
# gibbs_sample$sigma2: sample from the marginal distribution of Sigma2|X=xx
for(ii in 2:nsim)
{
new_post_sigma_rate <- (1/2)*(RSS+ nn*(xbar-theta[ii-1])^2) + bb
sigma2[ii] <- 1/rgamma(1,shape=post_sigma_shape,
rate=new_post_sigma_rate)
return(list(theta=theta,sigma2=sigma2))
}
The histograms in Figure 5.7 for the posterior for θ and σ 2 are obtained as
follows:
library(mcsm)
data(Energy)
gibbs_sample <- gibbs_gaussian(5000,log(Energy[,1]),5,10,3,3)
par(mfrow=c(1,2))
hist(gibbs_sample$theta,xlab=expression(theta~"|X=x"),main="")
hist(sqrt(gibbs_sample$sigma2),xlab=expression(sigma~"|X=x"),main="")
5.2 Introduction to Gibbs and MCMC 101
1000
1500
800
1000
600
Frequency
Frequency
400
500
200
0
6.0 6.5 7.0 7.5 0.4 0.6 0.8 1.0 1.2 1.4
θ |X=x σ |X=x
There is a natural extension from the two-stage Gibbs sampler to the general
multistage Gibbs sampler. Suppose that for p > 1, we can write the random
variable X = (X1 , . . . , Xp ), where the Xi ’s are either unidimensional or mul-
tidimensional components. Suppose that we can simulate from corresponding
conditional densities f1 , . . . , fp . That is, we can simulate
The densities f1 , . . . , fp are called the full conditionals, and a particular feature
of the Gibbs sampler is that these are the only densities used for simulation.
Hence, even for high-dimensional problems, all of the simulations may be uni-
variate, which is a major advantage.
Example 5.7: (Casella and Robert, p. 207) Consider the following model:
ind
Xij ∣θi , σ 2 ∼ N (θi , σ 2 ) 1 ≤ i ≤ k, 1 ≤ j ≤ ni
iid
θi ∣µ, τ 2 ∼ N (µ, τ 2 )
µ∣σµ2 ∼ N (µ0 , σµ2 )
σ 2 ∼ IG(a1 , b1 )
τ 2 ∼ IG(a2 , b2 )
σµ2 ∼ IG(a3 , b3 )
The conditional independencies in this example can be visualized by the Bayesian
Network in Figure 5.8. Using these conditional independencies, we can compute
the complete conditional distributions for each of the variables as
σ2 ni τ 2 σ2 τ 2
θi ∼ N ( µ + X̄i , ),
σ 2 + ni τ 2 σ 2 + ni τ 2 σ 2 + ni τ 2
τ2 kσµ2 σµ2 τ 2
µ∼N( µ0 + θ̄, ),
τ 2 + kσµ2 τ 2 + kσµ2 τ 2 + kσµ2
⎛ ⎞
σ 2 ∼ IG ∑ ni /2 + a1 , (1/2) ∑ (Xi,j − θi )2 + b1 ,
⎝i i,j ⎠
where θ̄ = ∑i ni θi / ∑i ni .
τ2 µ σµ2
σ2
θi
Xij
Figure 5.8: Bayesian Network for Example 5.7.
1500
1000
1000
800
800
1000
Frequency
Frequency
Frequency
600
600
400
400
500
200
200
0
µ θ1 θ2
1500
1000
2000
800
1500
1000
600
Frequency
Frequency
Frequency
1000
400
500
500
200
0
σ2µ τ2 σ2
Example 5.8: A genetic model specifies that 197 animals are distributed multi-
nomially into four categories, with cell probabilities given by
The actual observations are y = (125, 18, 20, 34). We want to estimate θ.
Suppose we have two factors, call them α and β (say eye color and leg length).
• Each comes at two levels: α comes in levels A and a, and β comes in levels
B and b.
• Suppose A is dominant, a is recessive; also B dominant, b recessive.
• Suppose further that P (A) = 1/2 = P (a) [and similarly for the other
factor].
• Now suppose that the two factors are related: P (B∣A) = 1−η and P (b∣A) =
η.
• Similarly, P (B∣a) = η and P (b∣a) = 1 − η.
Then
1
P (Father is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (Mother is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (O.S. is AB) = P (B∣A)P (B) = (1 − η)2 .
4
APPLE
What now?
Suppose we put the prior Beta(a, b) on θ. How do we get the posterior?
Split first cell into two cells, one with probability 1/2, the other with probability
θ/4.
5.2 Introduction to Gibbs and MCMC 105
1/2
• The conditional distribution of X1 ∣ θ (and given the data) is Bin(125, 1/2+θ/4 ).
Beta(a + 125 − X1 − X5 , b + X3 + X4 ).
set.seed(1)
a <- 1; b <- 1
z <- c(125,18,20,34)
x <- c(z[1]/2, z[1]/2, z[2:4])
nsim <- 50000 # runs in about 2 seconds on 3.8GHz P4
theta <- rep(a/(a+b), nsim)
for (j in 1:nsim)
{
theta[j] <- rbeta(n=1, shape1=a+125-x[1]+x[5],
shape2=b+x[3]+x[4])
x[1] <- rbinom(n=1, z[1], (2/(2+theta[j])))
}
mean(theta) # gives 0.623
pdf(file="post-dist-theta.pdf",
horiz=F, height=5.0, width=5.0)
plot(density(theta), xlab=expression(theta), ylab="",
main=expression(paste("Post Dist of ", theta)))
dev.off()
eta <- 1 - sqrt(theta) # Variable of actual interest
plot(density(eta))
sum(eta > .4)/nsim # gives 0
5.2 Introduction to Gibbs and MCMC 106
Post Dist of θ
8
6
4
2
0
We will want to check any chain that we run to assess any lack of convergence.
Quick checks:
• trace plots: a times series plot of the parameters of interest; indicates how
quickly the chain is mixing of failure to mix.
• Autocorrelations plots.
• Plots of log posterior densities – used mostly in high dimensional problems.
• Multiple starting points – diagnostic to attempt to handle problems when
we obtain different estimates when we start with multiple (different) start-
ing values.
λ2.9111 λ2.9145
17800
20800
17400
20400
17000
20000
400 600 800 1000 1200 400 600 800 1000 1200
λ2.9378 λ2.9248
20800
20800
20400
20400
20000
20000
400 600 800 1000 1200 400 600 800 1000 1200
1.0
0.8
Correlation
0.6
0.4
0.2
0.0
0 20 40 60 80 100
Lag
Gelman-Rubin
• Idea is that if we run several chains, the behavior of the chains should be
basically the same.
• Check informally using trace plots.
• Check using the Gelman-Rubin diagnostic – but can fail like any test.
• Suggestions – Geweke – more robust when normality fails.
5.4 Theory and Application Based Example 110
◯ PlA2 Example
i 1 2 3 4 5 6
ψ̂i 1.06 -0.10 0.62 0.02 1.07 -0.02
σi 0.37 0.11 0.22 0.11 0.12 0.12
i 7 8 9 10 11 12
ψ̂i -0.12 -0.38 0.51 0.00 0.38 0.40
σi 0.22 0.23 0.18 0.32 0.20 0.25
Setup:
• Twelve studies were run to investigate the potential link between presence
of a certain genetic trait and risk of heart attack.
• For each study i (i = 1, ⋯, 12) the proportion having the genetic trait in
each group was recorded.
• For each study, a log odds ratio, ψ̂i , and standard error, σi , were calcu-
lated.
Let ψi represent the true log odds ratio for study i. Then a typical hierarchical
model would look like:
ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, . . . , 12
iid
ψi ∣ µ, τ ∼ N (µ, τ 2 ) i = 1, . . . , 12
(µ, τ ) ∼ ν.
5.4 Theory and Application Based Example 111
12 12
L(µ, τ ) = ∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 .
i=1 i=1
Remark: The reason for taking this prior is that it is conjugate for the normal
distribution with both mean and variance unknown (that is, it is conjugate for
the model in which the ψi ’s are observed).
We will use the notation NIG(a, b, c, d) to denote this prior. Taking a = .1, b
= .1, c = 0, and d = 1000 gives a flat prior.
L(µ, τ )p(µ, τ )
pi(µ, τ ∣ ψ̂i ) =
∫ L(µ, τ )p(µ, τ )dµdτ
We have a choice:
• Select a model that doesn’t fit the data well but gives answers that are
easy to obtain, i.e. in closed form.
MCMC methods often allow us (in many cases) to make the second choice.
1 n(X̄ − c)2
a′ = a + n/2 b′ = b + ∑(Xi − X̄) +
2
2 i 2(1 + nd)
and
c + ndX̄ 1
c′ = d′ = .
nd + 1 n + d−1
• In order to clarify what we are doing, we use the notation that subscripting
a distribution by a random variable denotes conditioning.
• Thus, if U and V are two random variables, L(U ∣V ) and LV (U ) will both
denote the conditional distribution of U given V.
• Given the (ψ’s, the data) are superfluous, i.e. Lψ̂ (µ, τ ∣ ψ)= L (µ, τ ∣ ψ) .
This conditional distribution is given by the conjugacy of the Normal /
Inverse gamma prior: L (µ, τ ∣ ψ) = NIG(a′ , b′ , c′ , d′ ), where
1 n(ψ̄ − c)2
a′ = a + n/2 b′ = b + ∑(ψi ψ̄) +
2
2 i 2(1 + nd)
and
c + ndψ̄ 1
c′ = d′ = .
nd + 1 n + d−1
iid
ψi ∣ µ, τ 2 ∼ N (µ, τ 2 ) i = 1, ⋯, 12
µ ∣ τ 2 ∼ N (0, 1000τ 2 )
%\frame[containsverbatim]{
%\frametitle{PlA2 Example}
\begin{verbatim}
model {
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],1/(sigma[i])^2)
psi[i] ~ dnorm(mu,1/tau^2)
}
mu ~ dnorm(0,1/(1000*tau^2))
tau <- 1/sqrt(gam)
gam ~ dgamma(0.1,0.1)
}
5.4 Theory and Application Based Example 114
%\frame[containsverbatim]{
%\frametitle{PlA2 Example}
\begin{verbatim}
"N" <- 12
"psihat" <- c(1.055, -0.097, 0.626, 0.017, 1.068,
-0.025, -0.117, -0.381, 0.507, 0, 0.385, 0.405)
"sigma" <- c(0.373, 0.116, 0.229, 0.117, 0.471,
0.120, 0.220, 0.239, 0.186, 0.328, 0.206, 0.254)
%\frame[containsverbatim]{
%\frametitle{PlA2 Example}
Now, we read in the coda files into R from the current directory and continue our
analysis. The first part of our analysis will consist of some diagnostic procedures.
We will consider
• Autocorrelation Plots
• Trace Plots
• Gelman-Rubin Diagnostic
• Geweke Diagnostic
We take the thin value to be the first lag whose correlation ≤ 0.2. For this plot,
we take a thin of 2. We will go back and rerun our JAGS script and skip every
other value in each chain. After thinning, we will proceed with other diagnostic
procedures of interest.
1.0
0.8
Correlation
0.6
0.4
0.2
0.0
0 10 20 30 40 50
Lag
%\frame[containsverbatim]{
5.4 Theory and Application Based Example 116
%\frametitle{PlA2 Example}
Definition: A trace plot is a time series plot of the parameter, say µ, that we
monitor as the Markov chain(s) proceed(s).
1.0
0.5
µ
0.0
−0.5
Iteration
• Run two chains in JAGS using two different sets of initial values (and two
different seeds).
%\frame[containsverbatim]{
%\frametitle{Gelman-Rubin Diagnostic}
\begin{verbatim}
Point est. 97.5% quantile
mu 1 1
psi[1] 1 1
psi[2] 1 1
...
psi[11] 1 1
psi[12] 1 1
gam 1 1
Since 1 is in all the 95% CI, we can conclude that we have not failed to converge.
3.0
2.0
Density
1.0
0.0
• So, here we’re looking at the odds ratio’s of the prob of getting heart
disease given you have the genetic trait over the prob of not getting heart
disease given you have the trait. Note that all estimates are pulled toward
the mean showing a Bayesian Stein effect.
• This is the odds ratio of having a heart attack for those who have the
genetic trait versus those who don’t (looking at study i).
5.4 Theory and Application Based Example 120
exp(ψ1) exp(ψ2)
4
3
3
2
2
1
1
0
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
exp(ψ3) exp(ψ4)
4
4
3
3
2
2
1
1
0
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
%\frame[containsverbatim]{
Moreover, we could have just have easily done this analysis in WinBUGS. Below is the co
\begin{verbatim}
model{
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],rho[i])
psi[i] ~ dnorm(mu,gam)
rho[i] <- 1/pow(sigma[i],2)
}
mu ~ dnorm(0,gamt)
gam ~ dgamma(0.1,0.1)
gamt <- gam/1000
}
Finally, we can either run the analysis using WinBUGS or JAGS and R. I will
demonstrate how to do this using JAGS for this example. I have included the
basic code to run this on a Windows machine via WinBUGS. Both methods
yield essentially the same results.
\scriptsize
\begin{verbatim}
setwd("C:/Documents and Settings/Tina Greenly
/Desktop/beka_winbugs/novartis/pla2")
library(R2WinBUGS)
pla2 <- read.table("pla2_data.txt",header=T)
attach(pla2)
names(pla2)
N<-length(psihat)
data <- list("psihat", "sigma", "N")
But what if we cannot sample directly from p(θ∣y)? The important concept here
is that we are able to construct a large collection of θ values (rather than them
being iid, since this most certain for most realistic situations will not hold).
Thus, for any two different θ values θa and θb , we need
• If p(θ∗ ∣y) > p(θ(s) ∣y), then we want more θ∗ ’s in the set than θ(s) ’s.
Based on the above, perhaps our decision to include θ∗ or not should be based
upon a comparison of p(θ∗ ∣y) and p(θ(s) ∣y). We can do this by computing r:
p(θ∗ ∣y)
= r.
p(θ(s) ∣y)
5.5 Metropolis and Metropolis-Hastings 123
This means that for every instance of θ(s) , we should only have a fraction
of an instance of a θ∗ value.
This is basic intuition behind the Metropolis (1953) algorithm. More formally,
it
3. Let
⎧
⎪
⎪θ∗ with prob min(r,1)
θ(s+1) = ⎨ (s)
⎪
⎪θ otherwise.
⎩
That is let
iid
X1 , . . . , Xn ∣ θ ∼ Normal(θ, σ 2 )
θ ∼ Normal(µ, τ 2 ).
n/σ 2 1/τ 2
µn = x̄ + µ
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2
and
1
τn2 = .
n/σ 2 + 1/τ 2
Suppose that for some ridiculous reason we cannot come up with the posterior
distribution and instead we need the Metropolis algorithm to approximate it
(please note how incredible silly this example is and it’s just to illustrate the
method).
Based on this model and prior, we need to compute the acceptance ratio r
This results in
Then a proposal is accepted if log u < log r, where u is sample from the Uni-
form(0,1).
The R-code below generates 10,000 iterations of the Metropolis algorithm stat-
ing at θ(0) = 0. and using a normal proposal distribution, where
Below is R-code for running the above model. Figure 5.14 shows a trace plot for
this run as well as a histogram for the Metropolis algorithm compared with a
draw from the true normal density. From the trace plot, although the value of
θ does not start near the posterior mean of 10.03, it quickly arrives there after
just a few iterations. The second plot shows that the empirical distribution of
the simulated values is very close to the true posterior distribution.
11
0.8
10
0.6
9
density
θ
8
0.4
7
0.2
6
5
0.0
0 2000
4000 6000 8000 8.5 9.0 9.5 10.5 11.5
iteration θ
Figure 5.14: Results from the Metropolis sampler for the normal model.
5.5 Metropolis and Metropolis-Hastings 126
s2<-1
t2<-10 ; mu<-5; set.seed(1); n<-5; y<-round(rnorm(n,10,1),2)
mu.n<-( mean(y)*n/s2 + mu/t2 )/( n/s2+1/t2)
t2.n<-1/(n/s2+1/t2)
####metropolis part####
y<-c(9.37, 10.18, 9.16, 11.60, 10.33)
##S = total num of simulations
theta<-0 ; delta<-2 ; S<-10000 ; THETA<-NULL ; set.seed(1)
for(s in 1:S)
{
if(log(runif(1))<log.r) { theta<-theta.star }
##updating THETA
THETA<-c(THETA,theta)
pdf("metropolis_normal.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
skeep<-seq(10,S,by=10)
plot(skeep,THETA[skeep],type="l",xlab="iteration",ylab=expression(theta))
hist(THETA[-(1:50)],prob=TRUE,main="",xlab=expression(theta),ylab="density")
th<-seq(min(THETA),max(THETA),length=100)
lines(th,dnorm(th,mu.n,sqrt(t2.n)) )
5.5 Metropolis and Metropolis-Hastings 127
dev.off()
◯ Metropolis-Hastings Algorithm
The Gibbs sampler and the Metropolis algorithm are both ways of generating
Markov chains that approximate a target probability distribution.
What does the Gibbs sampler have us do? It has us iteratively sample values
of U and V from their conditional distributions. That is,
1. update U ∶
1. update U ∶
(a) sample u∗ ∼ Ju (u ∣ u(s) , v (s) )
(b) compute
po (u∗ , v (s) ) Ju (u(s) ∣ u∗ , v (s) )
r= ×
po (u(s) , v (s) ) Ju (u∗ ∣ u(s) , v (s) )
(c) set u(s+1) equal to u∗ or u(s+1) with prob min(1,r) and max(0,1-r).
2. update V ∶
(a) sample v ∗ ∼ Jv (u ∣ u(s+1) , v (s) )
(b) compute
(c) set v (s+1) equal to v ∗ or v (s+1) with prob min(1,r) and max(0,1-r).
In the above algorithm, the proposal distributions Ju and Jv are not required to
be symmetric. The only requirement is that they not depend on U or V values
in our sequence previous to the most current values. This requirement ensures
that the sequence is a Markov chain.
Doesn’t the algorithm above look familiar? Yes, it looks a lot like Metropolis,
except the acceptance ratio r contains an extra factor:
• It contains the ratio of the prob of generating the current value from the
proposed to the prob of generating the proposed from the current.
• This can be viewed as a correction factor.
• If a value u∗ is much more likely to be proposed than the current value
u(s) then we must down-weight the probability of accepting u.
5.5 Metropolis and Metropolis-Hastings 129
Exercise 1: Show that Metropolis is a special case of MH. Hint: Think about
the jumps J.
Exercise 2: Show that Gibbs is a special case of MH. Hint: Show that r = 1.
Now back to the problem of implementing Metropolis. For this problem, we will
write
log E(Yi ∣xi ) = log(β1 + β2 xi + β3 x2i ) = β T xi ,
where xi is the age of sparrow i. We will abuse notation slightly and write
xi = (1, xi , x2i ).
Remark: Note we add 1/2 because otherwise log 0 is undefined. The code of
implementing the algorithm is included below.
0.0 0.2 0.4 0.6 0.8 1.0
ACF
ACF
β3
−0.2
−0.3
E<-matrix(rnorm(n*p),n,p)
res<-t( t(E%*%chol(Sigma)) +c(mu))
}
res
}
pmn.beta<-rep(0,p)
psd.beta<-rep(10,p)
for(s in 1:S) {
lhr<- sum(dpois(y,exp(X%*%beta.p),log=T)) -
sum(dpois(y,exp(X%*%beta),log=T)) +
sum(dnorm(beta.p,pmn.beta,psd.beta,log=T)) -
sum(dnorm(beta,pmn.beta,psd.beta,log=T))
BETA[s,]<-beta
}
cat(ac/S,"\n")
#######
library(coda)
apply(BETA,2,effectiveSize)
####
pdf("sparrow_plot1.pdf",family="Times",height=1.75,width=5)
5.5 Metropolis and Metropolis-Hastings 132
par(mar=c(2.75,2.75,.5,.5),mgp=c(1.7,.7,0))
par(mfrow=c(1,3))
blabs<-c(expression(beta[1]),expression(beta[2]),expression(beta[3]))
thin<-c(1,(1:1000)*(S/1000))
j<-3
plot(thin,BETA[thin,j],type="l",xlab="iteration",ylab=blabs[j])
abline(h=mean(BETA[,j]) )
acf(BETA[,j],ci.col="gray",xlab="lag")
acf(BETA[thin,j],xlab="lag/10",ci.col="gray")
dev.off()
####
In complex models, it is often the case that the conditional distributions are
available for some parameters but not for others. What can we do then? In these
situations we can combine Gibbs and Metropolis-type proposal distributions to
generate a Markov chain to approximate the joint posterior distribution of all
the parameters.
Analyses of ice cores from East Antarctica have allowed scientists to deduce
historical atmospheric conditions of law few hundred years (Petit et al, 1999).
Figure 5.18 plots time-series of temperature and carbon dioxide concentration
on a standardized scale (centered and called to have mean of zero and variance
of 1).
●
standardized measurement
temp
0 2
● ● ●
CO2 ●●
●●
●●● ●
2
●● ●
● ●
●● ●●
●
● ● ●
1
● ● ●●
●● ●● ●● ●
●●
●
● ●● ● ●● ●●
● ● ●●
●● ● ●
−4
−2 −1 0
● ●● ●
● ● ● ●
●●● ● ●● ● ●
● ●● ●
●● ● ● ● ● ● ●● ●
●
●
● ●●●●● ●● ●●
● ●●● ●● ●●●
●● ●
● ● ●
●●● ● ●
●
●●●●●● ● ●●
●
● ●●
● ● ● ●●
●●● ●●●●
●
● ●●●
●●● ●
●
●
●
●●● ●●●
●
−8
●
●●●●
● ●
●●●
●
●
● ● ● ●●
●
● ●● ●
●
• The plot indicates the temporal history of temperature and CO2 follow
very similar patterns.
• The second plot in Figure 5.18 indicates that CO2 concentration at a given
time is predictive of temperature following that time point.
• We can quantify this using a linear regression model for temperature (Y )
as a function of (CO2 )(x).
• The validity of the standard error relies on the error terms in the regression
model being iid and standard confidence intervals further rely on the errors
being normally distributed.
• These two assumptions are examined in the two residual diagnostic plots
in Figure 5.19.
• The first plot shows a histogram of the residuals and indicates no serious
deviation from non-normality.
• The second plot gives the autocorrelation function of the residuals, in-
dicating a nontrivial correlation of 0.52 between residuals at consecutive
time points.
• Such a positive correlation generally implies there is less information in
the data and less evidence for a relations between the two variables than
is assumed by the OLS regression analysis.
5.5 Metropolis and Metropolis-Hastings 134
1.0
50
0.8
40
0.4 0.6
30
frequency
ACF
20
0.2
10
0.0
−0.2
0
−4 −2 0 2 4 0 5 10 15 20
residual lag
Figure 5.17: Temperature and carbon dioxide data.
Y ∼ N (Xβ, σ 2 I).
The diagnostic plots suggest that a more appropriate model for the ice core data
is one in which the error terms are not independent, but temporally correlated.
We will replace σ 2 I with a covariance matrix Σ that can represent the posi-
tive correlation between sequential observations. One simple, popular class of
covariance matrices for temporally correlated data are those having first order
autoregressive structure:
⎛ 1 ρ ρ2 . . . ρn−1 ⎞
⎜ ρ 1 ρ . . . ρn−2 ⎟
⎜ 2 ⎟
⎜ ρ ρ 1 ... ⎟
⎜
2⎜
⎟
Σ = σ Cp = σ ⎜
2 ⎟
⎟
⎜ ⋮ ⋮ ⋱ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ρn−1 ρn−2 1 ⎠
Under this covariance matrix the variance of Yi ∣β, xi is σ 2 but the correlation
between Yi and Yi+t is ρt . Using the multivariate normal and inverse gamma
prior (it is left as an exercise to show that)
5.5 Metropolis and Metropolis-Hastings 135
β ∣ X, y, σ 2 , ρ ∼ N (βn , Σn ),
σ 2 ∣ X, y, β ∼ IG((νo + n)/2, [νo σo2 + SSRρ ]/2)
−1 −1
where βn = Σn (X T Cp−1 X/σ 2 + Σ−1
o βo ) and Σn = (X T Cp−1 X/σ 2 + Σ−1
o ) and
SSRρ = (y − Xβ)T Cp−1 (y − Xβ)
• If ρ were known this would be the generalized least squares (GLS) estimate
of β.
• This is a type of weighted LS estimate that is used when the error terms are
not iid. In such situations, both OLS and GLS provide unbiased estimates
of β but the GLS has lower variance.
• Bayesian analysis using a model that accounts for correlation errors pro-
vides parameter estimates that are similar to those of GLS, so for conve-
nience we will refer to our analysis as “Bayesian GLS.”
We can use the generality of the MH algorithm. Recall we are allowed to use
different proposals at each step. We can iteratively update β, σ 2 , and ρ at
different steps (using Gibbs proposals). That is:
• We will make proposals for β and σ 2 using the full conditionals and
• make a symmetric proposal for ρ.
• Following the rules of MH, we accept with prob 1 any proposal coming
from a full conditional distribution, whereas we have to calcite an accep-
tance probability for proposals of ρ.
5.5 Metropolis and Metropolis-Hastings 136
2. Update σ 2 : Sample σ 2(s+1) ∼IG( (νo +n)/2, [νo σo2 +SSRρ ]/2) where SSRρ
depends on β (s+1) and ρ(s) .
The proposal used in Step 3(a) is called reflecting random walk, which insures
that 0 < ρ < 1. Note that a sequence of MH steps in which each parameter is
updated is often referred to as a scan of the algorithm.
For convenience and ease, we’re going to use diffuse priors for the parameters
with βo = 0, Σo = diag(1000), νo = 1, and σ 2 = 1. Our prior on ρ will be Uni-
form(0,1). We first run 1000 iterations of the MH algorithm and show a trace
plot of ρ as well as an autocorrelation plot (Figure 5.20).
Suppose now we want to generate 25,000 scans for a total of 100,000 parameter
values. The MC is highly correlated, so we will thin every 25th value in the
chain. This reduces the autocorrelation.
The Monte Carlo approximation of the posterior density of β2 (the slope) ap-
pears in the Figure 5.20. The posterior mean is 0.028 with 95 percent posterior
credible interval of (0.01,0.05), indicating that the relationship between tem-
perature and CO2 is positive. As indicated in the second plot this relationship
seems much weaker than suggested by the OLS estimate of 0.08. For the OLS
estimation, the small number of data points with high y-values have a large
influence on the estimate of β. On the other hand, the GLS model recognizes
many of these extreme points are highly correlated with one another and down
weights their influence.
1.0
0.9
0.8
0.8
0.6
ACF
ρ
0.7
0.4 0.2
0.6
0.0
0.5
0.8
0.6
0.8
ACF
ρ
0.4
0.7
0.2
0.6
0.0
40
● ●
2
GLS estimate ● ●
● ●
●● ● ●
posterior marginal density
30 OLS estimate ●● ●
●● ●
0
● ● ●
● ●●
●●
● ● ●
temperature
● ● ●
−4 −2
●
● ● ● ●
● ● ● ●● ● ●
●
20
● ●
● ● ● ●
●● ● ● ●
● ●●●●
● ●● ●
●●● ●●●● ● ●●
● ● ● ● ● ●
● ● ● ●
●● ●
●● ● ●
● ●● ●● ● ●● ● ●● ●● ●
●●
−6
● ●● ●
● ● ● ●
●
10
● ● ●
●●● ●●● ●● ● ●
●● ●●● ●●●
●
●● ●●
●
●● ● ● ●● ●●
●
● ●● ●● ● ● ● ●
● ●●
●●●●●●● ●● ●
−8
● ● ●●●● ● ● ●
●● ● ● ● ●
●
●
0
0.00 0.02 0.04 0.06 180 200 220 240 260 280
β2 CO2
Figure 5.20: Posterior distribution of the slope parameter β2 and posterior
mean regression line (after generating the Markov chain with
length 25,000 with thin 25).
Exercise: Repeat the analysis with different prior distributions and perform
non-Bayesian GLS for comparison.
#####
##example 5.10 in notes
# MH and Gibbs problem
##temperature and co2 problem
source("http://www.stat.washington.edu/~hoff/Book/Data/data/chapter10.r")
res
}
###
####
dct<-NULL
for(i in 1:n) {
xc<-dco2[ dco2[,2] < dat[i,1] ,,drop=FALSE]
xc<-xc[ 1, ]
dct<-rbind(dct, c( xc[c(2,4)], dat[i,] ) )
}
mean( dct[,3]-dct[,1])
dct<-dct[,c(3,2,4)]
colnames(dct)<-c("year","co2","tmp")
rownames(dct)<-NULL
dct<-as.data.frame(dct)
#plot(dct[,1],qnorm( rank(dct[,3])/(length(dct[,3])+1 )) ,
plot(dct[,1], (dct[,3]-mean(dct[,3]))/sd(dct[,3]) ,
type="l",col="black",
xlab="year",ylab="standardized measurement",ylim=c(-2.5,3))
legend(-115000,3.2,legend=c("temp",expression(CO[2])),bty="n",
lwd=c(2,2),col=c("black","gray"))
lines(dct[,1], (dct[,2]-mean(dct[,2]))/sd(dct[,2]),
#lines(dct[,1],qnorm( rank(dct[,2])/(length(dct[,2])+1 )),
type="l",col="gray")
plot(dct[,2], dct[,3],xlab=expression(paste(CO[2],"(ppmv)")),
ylab="temperature difference (deg C)")
dev.off()
########
lmfit<-lm(dct$tmp~dct$co2)
hist(lmfit$res,main="",xlab="residual",ylab="frequency")
#plot(dct$year, lmfit$res,xlab="year",ylab="residual",type="l" ); abline(h=0)
acf(lmfit$res,ci.col="gray",xlab="lag")
dev.off()
########
lmfit<-lm(y~-1+X)
fit.gls <- gls(y~X[,2], correlation=corARMA(p=1), method="ML")
beta<-lmfit$coef
s2<-summary(lmfit)$sigma^2
phi<-acf(lmfit$res,plot=FALSE)$acf[2]
nu0<-1 ; s20<-1 ; T0<-diag(1/1000,nrow=2)
###
set.seed(1)
5.5 Metropolis and Metropolis-Hastings 141
###number of MH steps
S<-25000 ; odens<-S/1000
OUT<-NULL ; ac<-0 ; par(mfrow=c(1,2))
library(psych)
for(s in 1:S)
{
Cor<-phi^DY ; iCor<-solve(Cor)
V.beta<- solve( t(X)%*%iCor%*%X/s2 + T0)
E.beta<- V.beta%*%( t(X)%*%iCor%*%y/s2 )
beta<-t(rmvnorm(1,E.beta,V.beta) )
s2<-1/rgamma(1,(nu0+n)/2,(nu0*s20+t(y-X%*%beta)%*%iCor%*%(y-X%*%beta)) /2 )
phi.p<-abs(runif(1,phi-.1,phi+.1))
phi.p<- min( phi.p, 2-phi.p)
lr<- -.5*( determinant(phi.p^DY,log=TRUE)$mod -
determinant(phi^DY,log=TRUE)$mod +
tr( (y-X%*%beta)%*%t(y-X%*%beta)%*%(solve(phi.p^DY) -solve(phi^DY)) )/s2 )
if(s%%odens==0)
{
cat(s,ac/s,beta,s2,phi,"\n") ; OUT<-rbind(OUT,c(beta,s2,phi))
# par(mfrow=c(2,2))
# plot(OUT[,1]) ; abline(h=fit.gls$coef[1])
# plot(OUT[,2]) ; abline(h=fit.gls$coef[2])
# plot(OUT[,3]) ; abline(h=fit.gls$sigma^2)
# plot(OUT[,4]) ; abline(h=.8284)
}
}
#####
OUT.25000<-OUT
library(coda)
apply(OUT,2,effectiveSize )
OUT.25000<-dget("data.f10_10.f10_11")
apply(OUT.25000,2,effectiveSize )
pdf("trace_auto_1000.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
5.5 Metropolis and Metropolis-Hastings 142
par(mfrow=c(1,2))
plot(OUT.1000[,4],xlab="scan",ylab=expression(rho),type="l")
acf(OUT.1000[,4],ci.col="gray",xlab="lag")
dev.off()
pdf("trace_thin_25.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
plot(OUT.25000[,4],xlab="scan/25",ylab=expression(rho),type="l")
acf(OUT.25000[,4],ci.col="gray",xlab="lag/25")
dev.off()
pdf("fig10_11.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
plot(density(OUT.25000[,2],adj=2),xlab=expression(beta[2]),
ylab="posterior marginal density",main="")
plot(y~X[,2],xlab=expression(CO[2]),ylab="temperature")
abline(mean(OUT.25000[,1]),mean(OUT.25000[,2]),lwd=2)
abline(lmfit$coef,col="gray",lwd=2)
legend(180,2.5,legend=c("GLS estimate","OLS estimate"),bty="n",
lwd=c(2,2),col=c("black","gray"))
dev.off()
quantile(OUT.25000[,2],probs=c(.025,.975) )
plot(X[,2],y,type="l")
points(X[,2],y,cex=2,pch=19)
points(X[,2],y,cex=1.9,pch=19,col="white")
text(X[,2],y,1:n)
iC<-solve( mean(OUT[,4])^DY )
Lev.gls<-solve(t(X)%*%iC%*%X)%*%t(X)%*%iC
Lev.ols<-solve(t(X)%*%X)%*%t(X)
plot(y,Lev.ols[2,] )
plot(y,Lev.gls[2,] )
5.6 Introduction to Nonparametric Bayes 143
◯ Motivations
iid
• We have X1 . . . Xn ∼ F, F ∈ F. We usually assume that F is a parametric
family.
• We would like to be able to put a prior on all the set of cdf’s. And we
would like the prior to have some basic features:
2. The prior should give rise to priors which are analytically tractable or
computationally manageable.
Γ(α1 + ⋯ + αk )
p(θ) = θ1α1 −1 ⋯θkαk −1 .
∏j=1 Γ(αj )
k
θ∣y ∼ Dir(α1 + N1 , . . . , αk + Nk ).
αj (α − αj )
V ar(θj ) = .
α2 (α + 1)
5.6 Introduction to Nonparametric Bayes 145
F ∼π
for any x1 , . . . xn ∈ {0, 1}.
Remark: Suppose that X1 , X2 . . . is an infinite exchangeable sequence of
binary random variables. Then there exists a probability measure (distri-
bution) on [0, 1] such that for every n
1
P (X1 = x1 , . . . , Xn = xn ) = ∫ p∑i xi (1 − p)n−∑i xi µ(p)dp
0
Basic idea: In many data analysis settings, we don’t know the number of
latent clusters and would like to learn it from the data. BNP clustering
addresses this by assuming there is an infinite number of latent clusters,
but that only a finite number of them is used to generate the observed
data. Under these assumptions, the posterior yields a distribution over
the number of clusters, the assign of data to clusters, and the parameters
5.6 Introduction to Nonparametric Bayes 152
– Imagine that Sam and Mike own a restaurant with an infinite number
of tables.
– Imagine a sequence of customers entering their restaurant and sitting
down.
– The first customer (Liz) enters and sits at the first table.
– The second customer enters and sits at the first table with probability
1 α
1+α
and a new table with probability 1+α , where α is positive and
real.
– Liz is friendly and people would want to sit and talk with her. So,
1
we would assume that 1+α is a high probability, meaning that α is a
small number.
– What happens with the nth customer?
∗ He sits at each of the previously occupied tables with probability
proportional to the number previous customers sitting there.
∗ He sits at the next unoccupied table with probability propor-
tional to α.
⎧
⎪
⎪ mk if k ≤ K+ (i.e. k is a previously occupied table),
P (cn = k ∣ c) = ⎨ n−1+α
⎪
⎪
α
otherwise (i.e. k is the next unoccupied table),
⎩ n−1+α
where mk is the number of customers sitting at table k and K+ is the num-
ber of table for which mk > 0. The parameter α is called the concentration
parameter.
5.6 Introduction to Nonparametric Bayes 153
yn ∣ cn , θ ∼ F (θcn )
cn ∝ p(cn )
θk ∝ Go .
yn ∣ cn , θ ∼ N (θcn , 1)
cn ∼ Multinomial(1, p)
θk ∼ N (µ, τ 2 ),
Then
N K
p(y∣c) = ∫ [ ∏ Normal(θcn , 1)(yn ) × ∏ Normal(µ, τ 2 )(θk )] dθ.
θ n=1 k=1
The term above (inside the integral) is just another normal as a function
of θ. Then we can integrate θ out as we have in problems before.
Once we calculate p(y∣c), we can simply plug this and p(c) into
p(y∣c)p(c)
p(c∣y) = .
∑c p(y∣c)p(c)
Example 5.14: Gaussian Mixture using R
Information on the R package profdpm:
This package facilitates inference at the posterior mode in a class of con-
jugate product partition models (PPM) by approximating the maximum
a posteriori data (MAP) partition. The class of PPMs is motivated by an
augmented formulation of the Dirichlet process mixture, which is currently
the ONLY available member of this class. The profdpm package consists
of two model fittting functions, profBinary and profLinear, their asso-
ciated summary methods summary.profBinary and summary.profLinear,
and a function (pci) that computes several metrics of agreement between
two data partitions. However, the profdpm package was designed to be
extensible to other types of product partition models. For more on this
package, see help(profdpm) after installation.
set.seed(42)
sim <- function(multiplier = 1) {
x <- as.matrix(runif(99))
a <- multiplier * c(5,0,-5)
s <- multiplier * c(-10,0,10)
y <- c(a[1]+s[1]*x[1:33],
a[2]+s[2]*x[34:66],
a[3]+s[3]*x[67:99]) + rnorm(99)
group <- rep(1:33, rep(3,33))
return(data.frame(x=x,y=y,gr=group))
}
dat <- sim()
library("profdpm")
fitL <- profLinear(y ~ x, group=gr, data=dat)
sfitL <- summary(fitL)
%pdf(np_plot.pdf)
plot(fitL$x[,2], fitL$y, col=grey(0.9), xlab="x", ylab="y")
for(grp in unique(fitL$group)) {
ind <- which(fitL$group==grp)
ord <- order(fitL$x[ind,2])
lines(fitL$x[ind,2][ord],
fitL$y[ind][ord],
col=grey(0.9))
}
for(cls in 1:length(sfitL)) {
# The following implements the (3rd) method of
# Hanson & McMillan (2012) for simultaneous credible bands
# Generate coefficients from profile posterior
n <- 1e4
5.6 Introduction to Nonparametric Bayes 157
4
2
0
y
-2
-4